{"http:\/\/dx.doi.org\/10.14288\/1.0362870":{"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool":[{"value":"Science, Faculty of","type":"literal","lang":"en"},{"value":"Statistics, Department of","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider":[{"value":"DSpace","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeCampus":[{"value":"UBCV","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/creator":[{"value":"Jun, Seong-Hwan","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/issued":[{"value":"2018-01-05T18:54:40Z","type":"literal","lang":"en"},{"value":"2017","type":"literal","lang":"en"}],"http:\/\/vivoweb.org\/ontology\/core#relatedDegree":[{"value":"Doctor of Philosophy - PhD","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeGrantor":[{"value":"University of British Columbia","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/description":[{"value":"In this thesis, we consider two combinatorial problems arising in phylogenetics and computational forestry. We develop sequential Monte Carlo based inference methods to tackle the challenges arising from these applications. In the phylogenetics application, the SMC is placed under stringent restrictions in terms of computer memory to the point where using sufficient number of particles may be prohibitive. To that end, a new method, streaming particle filter (SPF), is proposed which alleviates this problem and study some of the theoretical properties of the algorithm, including L\u00b2 convergence of the estimators derived from the SPF simulation. In the case of the computational forestry, the combinatorial problem that is tackled in this thesis is referred to as the knot matching problem. This problem is formulated as a graph matching problem on a K-partite hypergraph. A model for graph matching that admits efficient parameter inference is proposed along with a sequential Monte Carlo sampler for graph matching based on this model. A detailed background on each of the two problems are presented and evaluated on simulated and real data.","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO":[{"value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/64277?expand=metadata","type":"literal","lang":"en"}],"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note":[{"value":"Scalable Sequential Monte CarloMethods and Probabilistic Approach toCombinatorial ProblemsbySeong-Hwan JunB. Math., University of Waterloo, 2009M. Sc., The University of British Columbia, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2017c\u00a9 Seong-Hwan Jun 2017AbstractIn this thesis, we consider two combinatorial problems arising in phylogenet-ics and computational forestry. We develop sequential Monte Carlo basedinference methods to tackle the challenges arising from these applications. Inthe phylogenetics application, the SMC is placed under stringent restrictionsin terms of computer memory to the point where using sufficient number ofparticles may be prohibitive. To that end, a new method, streaming particlefilter (SPF), is proposed which alleviates this problem and study some ofthe theoretical properties of the algorithm, including L2 convergence of theestimators derived from the SPF simulation. In the case of the computationalforestry, the combinatorial problem that is tackled in this thesis is referredto as the knot matching problem. This problem is formulated as a graphmatching problem on a K-partite hypergraph. A model for graph matchingthat admits efficient parameter inference is proposed along with a sequentialMonte Carlo sampler for graph matching based on this model. A detailedbackground on each of the two problems are presented and evaluated onsimulated and real data.iiLay SummaryThe first contribution made by this thesis is in the development of fast,scalable, and accurate computer simulation method for statistical inferencein phylogenetics. Phylogenetics is a discipline focussed on studying theevolutionary relationship between species or population stemming from acommon ancestor. The method developed attains accuracies that werepreviously difficult to obtain. The second contribution is to the field ofcomputational forestry, in particular, the problem that is tackled is referredto as knot matching. The knot matching problem seeks to reconstruct theinternal tree branch structure for a sawn lumber. This method is importanttowards automatically quantifying the strength of lumber. A statisticalmodel for reconstruction of the branch structure as well as a fast and accuratesimulation algorithm for statistical inference are developed to solve thisproblem.iiiPrefaceAll of Chapter 1, Chapter 2, Chapter 7, including, writing and the figuresare solely authored. Chapter 3 is a joint work with Dr. Alexandre Bouchard-C\u00f4t\u00e9. A version of this chapter is published [1], in the proceedings ofInternational Conference on Machine Learning (ICML). Dr. Bouchard-C\u00f4t\u00e9and I contributed equally to the writing of the paper as well as generationof the figures. Chapter 4 is solely authored for the thesis except Figure 4.4,which appeared in [1]. I carried out the implementation of the experiment andthe generation of this figure. Chapter 5 and Chapter 6 are jointly authoredwith Dr. Samuel W.K. Wong, Dr. James V. Zidek, and Dr. AlexandreBouchard-C\u00f4t\u00e9. These two chapters are based on [2, 3]. [2] is published inthe proceedings of Artificial Intelligence and Statistics (AISTATS) and [3]has been submitted for review. I am the first author in both of the papersand contributed towards the writing of the papers, generation of all of thefigures, as well as data collection, implementation and experimentation. Allauthors contributed towards revising the paper. Dr. Samuel W.K. Wongwrote Appendix A and I have helped in revising it. Dr. James V. Zidek, andDr. Alexandre Bouchard-C\u00f4t\u00e9 took on the supervisory roles.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Knot matching . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Review of Sequential Monte Carlo Methods . . . . . . . . . 72.1 Bayesian posterior inference for hidden Markov models . . . 72.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . 82.3 Sequential importance sampling . . . . . . . . . . . . . . . . 102.4 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . 112.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Algorithm description . . . . . . . . . . . . . . . . . . 132.4.3 Role of resampling . . . . . . . . . . . . . . . . . . . . 142.4.4 Effective sample size . . . . . . . . . . . . . . . . . . . 152.4.5 Example: Ricker model . . . . . . . . . . . . . . . . . 162.5 SMC sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 18v2.5.1 Problem set-up and notation . . . . . . . . . . . . . . 192.5.2 Auxiliary variable techniques . . . . . . . . . . . . . . 202.6 Combinatorial SMC . . . . . . . . . . . . . . . . . . . . . . . 212.6.1 Partially ordered set . . . . . . . . . . . . . . . . . . . 212.6.2 Example: Binary tree (an illustration of an incorrectway to design an SMC sampler) . . . . . . . . . . . . 212.6.3 Overcounting problem . . . . . . . . . . . . . . . . . . 232.6.4 Combinatorial SMC . . . . . . . . . . . . . . . . . . . 232.6.5 Example: Binary tree (continued) . . . . . . . . . . . 252.7 Particle Markov chain Monte Carlo . . . . . . . . . . . . . . 262.7.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 262.7.2 Particle independent Metropolis-Hastings sampler . . 282.7.3 Particle marginal Metropolis Hastings sampler . . . . 293 Streaming Particle Filter . . . . . . . . . . . . . . . . . . . . . 313.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . 313.1.1 Stochastic maps . . . . . . . . . . . . . . . . . . . . . 323.1.2 Task group 1: streaming expansion, implicit proposal,summation . . . . . . . . . . . . . . . . . . . . . . . . 333.1.3 Task group 2: explicit proposal, division, contraction 343.2 Stopping time . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Convergence results . . . . . . . . . . . . . . . . . . . . . . . 363.3.1 Measure theoretic formulation of SPF . . . . . . . . . 373.3.2 Convergence result . . . . . . . . . . . . . . . . . . . . 383.3.3 SPF as adapted particle filter . . . . . . . . . . . . . . 433.4 SPF within particle MCMC . . . . . . . . . . . . . . . . . . . 463.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 SPF algorithm description . . . . . . . . . . . . . . . 473.4.3 SPF-PMCMC . . . . . . . . . . . . . . . . . . . . . . 494 SPF for Phylogenetics Application . . . . . . . . . . . . . . . 554.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2 Bayesian phylogenetics . . . . . . . . . . . . . . . . . . . . . 56vi4.2.1 Kingman\u2019s coalescent prior . . . . . . . . . . . . . . . 564.2.2 Felsenstein\u2019s peeling recursion for likelihood computa-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.3 Models of evolution . . . . . . . . . . . . . . . . . . . 614.3 SMC for inference on phylogenetic trees . . . . . . . . . . . . 634.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.2 Proposal distribution . . . . . . . . . . . . . . . . . . 634.3.3 Likelihood computation . . . . . . . . . . . . . . . . . 644.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 654.4.1 Data simulation procedure . . . . . . . . . . . . . . . 654.4.2 The speed of execution . . . . . . . . . . . . . . . . . 654.4.3 Scalability experiments . . . . . . . . . . . . . . . . . 665 Sequential Graph Matching . . . . . . . . . . . . . . . . . . . 705.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 725.2.2 Probabilistic models . . . . . . . . . . . . . . . . . . . 735.3 A sequential decision model for matchings . . . . . . . . . . . 745.4 Formulation of the decision set . . . . . . . . . . . . . . . . . 765.4.1 Bipartite matching . . . . . . . . . . . . . . . . . . . 775.4.2 Knot matching . . . . . . . . . . . . . . . . . . . . . . 785.5 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 785.5.1 Supervised learning . . . . . . . . . . . . . . . . . . . 795.5.2 Unsupervised learning . . . . . . . . . . . . . . . . . . 825.6 SMC sampler for prediction . . . . . . . . . . . . . . . . . . . 835.6.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 845.6.2 Poset SMC for graph matching . . . . . . . . . . . . . 855.6.3 SMC sampler for graph matching . . . . . . . . . . . 865.6.4 Overcounting correction . . . . . . . . . . . . . . . . . 875.6.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . 895.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.7.1 Document ranking . . . . . . . . . . . . . . . . . . . . 91vii5.7.2 Image feature matching . . . . . . . . . . . . . . . . . 935.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946 Knot Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Data for knot matching . . . . . . . . . . . . . . . . . . . . . 976.2.1 Lumber and knot representation . . . . . . . . . . . . 976.2.2 Choice of covariates . . . . . . . . . . . . . . . . . . . 986.2.3 Matching by a human grader . . . . . . . . . . . . . . 1016.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 1016.3.1 Preliminary experiments . . . . . . . . . . . . . . . . 1016.3.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . 1036.4 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . 1057 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 112Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115A Generating Synthetic Data for Knot Matching . . . . . . . 123viiiList of Figures1.1 Phylogenetic trees describing ancestral relationship of 5 species:(a) Relates gorilla as closest specie to humans, (b) relateschimpanzee, gorilla, and human to have derived from a commonancestor and the orangutans and gibbons to have derivedfrom another ancestor. Each tree represents a hypothesis onevolutionary of the species. . . . . . . . . . . . . . . . . . . . 31.2 Lumber strength prediction pipeline. The knot detection scansthe board, and outputs the location and size of the knotfaces. The knot matching step formulates hypothesis regardingpossible matchings. The strength prediction model producesprediction interval for the strength of the given lumber. . . . 52.1 Simulated Ricker data: (a) the observations and (b) the latentvariables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Inference of the latent variables using sequential importancesampling and bootstrap particle filter: (a) absolute error be-tween the truth and the weighted mean prediction and (b) theeffective sample size at each iteration. The absolute error forSIS can be seen to be poor compare to BPF. The ESS for SISreaches 1, which is indicative of collapse of population to asingle particle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 (a) A balanced binary tree on 4 leaves. (b) An unbalancedbinary tree on 4 leaves. (c) The results of incorrect analysiscarried out using SMC. The estimates are biased as theydeviate from the expected value of 1\/15 \u2248 0.067. . . . . . . . 24ix2.4 An example of a cycle in the Hasse diagram induced by thechoice of proposal distribution given in Section 2.6.2. Thestates connected by red edges form a cycle in the Hasse diagram,leading to the overcounting problem. . . . . . . . . . . . . . . 262.5 An example of correct analysis carried out using SMC. . . . . 273.1 Overview of the streaming particle algorithm. The circles areparticles. The black particles show the subset of the implicitparticles that are selected in the contraction step. We alsoshow the grouping of the tasks to Task group 1 and Task group2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 An illustration of the indices in action for N = 7,K = 4, R = 2.The shaded circles form a lineage of particles, where b21:2 =(3, 2) can be used for indexing the concrete particles and ib211 =i31 = 7 and ib222 = i22 = 5 can be used to index the implicitparticles. The composition of indexes can yield the index ofthe parent particle for any implicit particle xnr . For example,the parent (implicit) particle of x52 is given by ia511 = i31 = 7:xia5111 = x71. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1 (a) A sample from the coalescent prior on 8 species. Note thatthe branch length merging the root of the tree and its left andright children are much longer than the other branches. (b) Anexample of a phylogenetics tree. The leaf nodes are labelled1, 2, 3, 4 and their DNA sequences are denoted by y1, y2, y3, y4 \u2208{A,C,G, T}. The branch lengths play an important role incomputing the likelihood. buv denotes the length of the branchlength from the parent u to the child v. . . . . . . . . . . . . 584.2 (a) The effective sample size at each iteration for SMC andSPF. (b) The raw execution times in milliseconds. (c) Theplot of ESS\/sec. (d) The ratio of ESS\/sec of SPF to ESS\/secof SMC. Each of the SPF and SMC are executed 10 times onthe same dataset to obtain the estimate of the standard errors. 67x4.3 (a) The estimate of the mutation rate of the Jukes-Cantormodel. The black vertical line indicates the true values, \u00b5 = 1.7.(b) The average of the ESS divided by the time (in seconds) foreach iteration of SPF and SMC. The averages are calculatedover the course of PMMH iterations. . . . . . . . . . . . . . . 684.4 (a) Decrease in negative marginal likelihood as the number ofparticles is increased. (b) Comparison of SSD versus time forSPF (red) and the standard SMC (blue). . . . . . . . . . . . . 685.1 (a) A 4-partite hypergraph representing a piece of lumber. Thenodes are labelled in the order \u03c3 that they are visited. Nodesoutside the \u2018distance span\u2019 cannot reasonably belong to thesame tree branch. (b) The decision candidates where only thesingleton and doubleton matching are permitted. See text fordetails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 (a) Illustration of the decisions for a bipartite matching. Forbipartite matching, we only need to visit the nodes in oneof the partitions. In this illustration, we have \u03c3 = (3, 1, 5).The decision set for the nodes are presented with the selectednodes colored in yellow. (b) An illustration of sequentialdecision model used for knot matching application with eachpartition containing exactly one knot, labelled from 1 to 4.The rectangles represent matching and curly braces representedges. The visited nodes are colored in yellow whereas thenodes that are yet to be visited are unfilled. . . . . . . . . . . 775.3 (a) 4-partite hypergraph representing a piece of lumber. (b)The decision set for blue node #1. (c) The decision set for rednode #2 given the decision made by blue node #1. (d) Anexample of a final matching. This is a modified version of asimilar figure appearing in [2]. . . . . . . . . . . . . . . . . . . 795.4 (a) Example of Hasse diagram corresponding to the bipartitedecision model described in Section 5.4.1. (b) Example ofpartial order defined on bipartite matching. . . . . . . . . . . 86xi5.5 The state {{1, 2}, {3, 4}} can be reached by two distinct pathsfrom the initial state whereas the state {{1, 2, 3}} can bereached by three distinct paths. . . . . . . . . . . . . . . . . . 875.6 The possible parent states for different cases. Left: a statecontaining a 3-matching where all three nodes have been visited.Top right: containing a 3-matching and a singleton where allfour nodes have been visited. Bottom right: example of statesthat are not permitted under the decision model for the knotmatching application (see Section 5.4.2). . . . . . . . . . . . . 905.7 The performance on (a) OHSUMED, (b) TD2003, (c) imagematching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.1 Sample lumber used in the real data analysis. Four sides ofthe boards with the wide surfaces shown on the first and thethird rows and the narrow surfaces shown on the second andthe last rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.2 3-dimensional view of the lumber (not to scale). An illustrationof 2-matching and 3-matching are provided. . . . . . . . . . . 996.3 A closer look at a segment of a plank. The matching forknot faces labelled 1, 2, 3, 4, produced by the human grader is{{1, 4}, {2, 3}}. . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.4 Experiments where the sequence \u03c3 is given. (a) The plot ofRMSE as the number of nodes is increased. (b) The sampleposterior surface showing that the MAP estimate correctlyfinds the mode of the posterior. . . . . . . . . . . . . . . . . . 1076.5 Overcounting problem illustrated on sampling from uniformgraph matching. (a) The number of possible {2, 3}-matchingsfor a graph with four partitions and one node in each partition.(b)The root mean squared error with overcounting correction(blue), without the overcounting correction (red). . . . . . . 108xii6.6 The plot of the Q\u02dc function versus iterations. The error barscorrespond to Q\u02dc plus\/minus two times the sample standarddeviation (i.e., Q\u02dc\u00b1 2\u03c3\u02c6). The convergence of MC-EM seems tohave been reached in 10 iterations. . . . . . . . . . . . . . . . 1086.7 The trajectory of parameters in the MC-EM training versusiterations. The distance based covariates seem to play impor-tant roles in determining the correct matches compared to thearea based covariates. . . . . . . . . . . . . . . . . . . . . . . 1096.8 Single sample prediction accuracy and Jaccard index evaluationof the matching samples generated by SMC computed on thereal data for each board. The single sample prediction accuracyis perfect for all but 3 boards. The quality of the matchingsgenerated for each board by SMC appears reasonable basedon the values of Jaccard index. . . . . . . . . . . . . . . . . . 1106.9 Timing results on the simulated and the real boards. Theprediction times are within a second for the real boards. Thetiming results on the simulated boards serve to test the feasi-bility of deploying the SMC sampler in lumber mills. . . . . . 111xiiiAcknowledgementsI would like to thank my supervisors Dr. Bouchard-C\u00f4t\u00e9 and Dr. JamesZidek for their continued supports. This thesis would not have been possiblewithout their joint guidance. I would also like to thank the statistical machinelearning lab members, both the current and the past, for the fun memories.Also, I would like to thank my best friend Xi for being a good friend despitemy tendency to get moody at times. Last but not least, I would like to thankmy parents, my sister, and Tingting Yu for the support and love they haveshown me throughout this journey.xivChapter 1IntroductionThis thesis is motivated by two combinatorial problems: inference of phylo-genetic trees and graph matching on a specific hypergraph, an applicationarising from computational forestry. This chapter provides a gentle introduc-tion to each of the two problems so as to establish the common theme ofthe thesis as well as to provide the readers with background and a concretecontext.1.1 Phylogenetic treesThe first combinatorial object that we consider is the phylogenetic tree. Givena set of homologous biological species, a phylogenetic tree is a graph thatdescribes the relationship between the species. A phylogenetic tree is denotedT = (V,E, b) where the nodes V denote the species and the edges E denotethe evolutionary relationship between the species in V . Each of the edges areassociated with a branch length, b : E \u2192 (0,\u221e); b(e) for e = (u, v), u, v \u2208 Vcan loosely be interpreted as describing the similarity\/dissimilarity betweenthe two species u and v. A phylogenetic X-tree is a phylogenetic tree wherethe leaf nodes of the tree, X \u2282 V , are labelled (i.e., by the species forwhich genetic data is available). Some examples of phylogenetic X-treesare shown in Figure 1.1. In this figure, we have a set of five extant species:gibbons, orangutans, chimpanzee, gorilla, and humans. As can be seenfrom the trees shown in this figure, a phylogenetic tree depicts an ancestralrelationship between the species and considering that the phylogenetic tree isusually unobserved, each phylogenetic X-tree may represent one\u2019s hypothesisregarding an unknown ancestral relationship between a set of observed species.For example, in Figure 1.1 (a) hypothesizes that chimpanzee is closer to the1gibbon and orangutan whereas Figure 1.1 (b) hypothesizes that the specie ofchimpanzee is closer to the species of human and gorilla.A typical problem setup has the observations at the leaves provided inthe form of DNA sequences and the goal is to infer the latent phylogenetictree T by considering the evidence, the observed DNA sequences. To thatend, a Bayesian inference of phylogeny based on Markov chain Monte Carlo(MCMC) has been proposed [4] and gained popularity due in large part toavailability of a mature software technology [5, 6]. In the recent years, analternative Bayesian inference based on the sequential Monte Carlo methodshave started to be considered [7]. The necessity for multiple approaches toinference, in this case, MCMC and SMC, arise due to the magnitude of thespace of the phylogenetic X-tree. The number of possible tree topologies isequal to (2|X| \u2212 3)!! = 1\u00d7 3\u00d7 5\u00d7 ...(2|X| \u2212 3) [8]. For example, if |X| = 20,the number of possible trees is approximately 8.2\u00d71021. Further complicatingthe matter is that the branch lengths are also of inferential interest, whichmakes the object of interest a combination of discrete structure, that is thetopology of the phylogenetic X-tree as well as a set of continuous valuedrandom variables, one for each edge in the tree topology. The magnitude ofthe problem space calls for efficient computational algorithms; the SMC basedmethods provide a principled mechanism for interleaving pruning steps inbetween the proposal steps in the sequential construction of the phylogenetictrees and can be very effective in its exploration of the space, in particular,as it leads to natural parallelization of computation. To conclude, we remarkthat the problem of phylogenetic inference has important applications suchas in the study of the evolutionary dynamics of cancer, where the size ofcell population can be very large as well as viral phylodynamics involvingvirus samples obtained across time [9, 10]. The latest advances in sequencingtechnologies provide access to the genetic data of finest scale, opening theavenue towards significant progresses in these applications; however, thisis expected to lead to the growth in the scale of the data both in the sizeof the population (i.e., |X|) as well as in the length of the DNA (or RNA)sequences. For example, the length of sequence in the order of 10, 000 or largeris becoming more common and it is unclear if MCMC can cope with this2HumansGorillasChimpanzeesOrangutansGibbonsHumansGorillasChimpanzeesOrangutansGibbons(a) (b)Figure 1.1: Phylogenetic trees describing ancestral relationship of 5 species:(a) Relates gorilla as closest specie to humans, (b) relates chimpanzee, gorilla,and human to have derived from a common ancestor and the orangutansand gibbons to have derived from another ancestor. Each tree represents ahypothesis on evolutionary of the species.level of growth. Because the quality of approximation of the SMC methodsis provided in terms of the number of samples (also referred to as particles inthe SMC literature) used, for SMC to be useful in such large scale settings,there is a need for a scalable inference methods. This is the motivation forChapter 3.1.2 Knot matchingThe second problem that is considered in this thesis is motivated from aproblem in computation forestry: automatic grading of structural lumber.There is a growing trend amongst the lumber mills towards installation ofadvanced computer vision system along with the computing capacities toprocess high-definition images for real time grading of structural lumber.The high-definition imaging system captures the images of the surfaces of aboard, which are then analyzed by image processing algorithms to identifythe characteristics that are known to be detrimental to lumber. A knot is3a particular type of defect that is known to be the most common cause forfailure when a load is applied to a piece of lumber [11, 12]. Specifically,the knot faces are remnants of a tree branch that appear on the surfacesof the board as a result of sawing. An important task is to reconstruct the3-dimensional structure of the board by matching the knot faces observed onthe surfaces using the scanning technology; this leads to representation ofeach of the tree branches as a 3-dimensional convex body. A loss of strengthcan be computed by modelling of the distribution of force when a load isplaced on the lumber.To provide the full context, we have provided an overview of the proceduresinvolved in automatic grading of lumber in Figure 1.2. The knot detectionstep identifies the faces of the knots on the surfaces given the raw scans ofthe board. The knot matching step reconstructs the 3-D structure of thetree branches. The final step is to use the information extracted from knotmatching to produce the strength of a lumber or a prediction interval onthe strength of lumber from a strength prediction model. Each of these stepsare based on an uncertain process. For example, knot detection may failto detect a knot. Similarly, knot matching step may produce an incorrectmatching. Hence, the natural decision is to develop a probabilistic approachthat can capture uncertainty and moreover, a Bayesian approach is employedwhere uncertainty from each stage can be quantified and propagated forwardto the end stage seamlessly. The focus of this thesis is on developing aprobabilistic model for knot matching that admits efficient estimation of themodel parameters and sampling method for matchings using sequential MonteCarlo methods. The methodology developed here requires only the basicimaging technologies, eliminating the need for the mills to adopt equipmentthat is expensive to maintain, such as X-ray or CT scan based systems, whichrequire a specialist to be employed for maintenance as well as safety purposes.The development of a strength prediction model is beyond the score of thisthesis; however, there is already an on-going efforts towards this direction(see for example, [13]).4Knot DetectionKnot Matching1 3426571 342657......Strength Prediction ModelPrediction IntervalFigure 1.2: Lumber strength prediction pipeline. The knot detection scansthe board, and outputs the location and size of the knot faces. The knotmatching step formulates hypothesis regarding possible matchings. Thestrength prediction model produces prediction interval for the strength of thegiven lumber.51.3 OrganizationThe main theme of this thesis is on the sequential Monte Carlo methods forstatistical inference on combinatorial objects, in particular, on the phylo-genetic trees and graph matchings as described above. To that end, a newsequential Monte Carlo algorithm, streaming particle filter, is proposed. Thestreaming particle filter allows for use of as many particles as needed, which isnecessary for large scale phylogenetic inference problems. Also, a sequentialgraph matching model is developed for the knot matching problem. Thedevelopment of the sequential graph matching model is accompanied by asequential Monte Carlo sampler for graph matching. This sampler is fast andcan be used in real time to rapidly sample knot matchings.The rest of the thesis is organized as follows. In Chapter 2, a review ofsequential Monte Carlo methods is provided. The streaming particle filteris described in Chapter 3. The demonstration of the efficacy of streamingparticle filter on the phylogenetics inference problem is provided in Chapter 4.In Chapter 5, a probabilistic model for sequential graph matching, referred toas sequential decision model is presented. This is accompanied by an efficientestimation procedures for the model parameters along with SMC algorithmsfor sampling of graph matching according to the sequential decision model.The performance of the sequential decision model and the accompanied SMCsampler for graph matching on the knot matching application is consideredin Chapter 6. The thesis is concluded with a discussion of future work inChapter 7.6Chapter 2Review of Sequential MonteCarlo MethodsIn this chapter, sequential Monte Carlo methods described. Such methodsare useful for a wide array of inference problems. Perhaps the most fittingview of SMC for this thesis is as a principled sampling method that facilitatessequential construction of a combinatorial object. The sequential MonteCarlo methods form the focal point of this thesis; therefore, an effort is madeto provide as much detail as possible. This chapter begins by introducingSMC for inference about the latent states of a hidden Markov model. Then,underpinnings of some of the latest developments in SMC methods aredescribed as relevant to the problems considered in this thesis.2.1 Bayesian posterior inference for hiddenMarkov modelsThe specification of a simple hidden Markov model is as follows. The observa-tions, yr \u2208 R, arrive sequentially over time r = 1, ..., R. Each observation yr isknown to depend on an unobserved latent variable, xr \u2208 R. The relationshipbetween xr and yr is known and it is given by the probability density g(yr|xr)for r = 1, ..., R. The process behind the latent variables forms a Markovprocess with transition density from xr\u22121 to xr given by a density functionf(xr|xr\u22121). The joint distribution of x1:R, y1:R is written as,p(x1:R, y1:R) = \u00b5(x1)R\u220fr=1g(yr|xr)R\u220fr=2f(xr|xr\u22121), (2.1)7where the shorthand notation xi:j = (xi, ..., xj) and yi:j = (yi, ..., yj) are usedfor brevity. Note that the observation density as well as the transition density,f, g are associated with parameters. For ease of exposition, the parametersare assumed to be known and the notation for the parameters suppressed.A Bayesian approach to inference of the latent variables involves a posteriordistribution given the observations:pi(x1:R|y1:R) = p(x1:R, y1:R)m(y1:R)=p(y1:R|x1:R)\u03c4(x1:R)m(y1:R)\u221d \u03b3(x1:R), (2.2)where \u03b3(x1:R) = p(x1:R, y1:R), m(y1:R) =\u222b\u03b3(x1:R)dx1:R is the marginal like-lihood, and \u03c4(x1:R) is the prior over x1:R. In the SMC literature, the symbolsfor the observations are often omitted for brevity but also, to emphasize thatonly the latent variables x1:R are the random quantities of interest. It iscustomary in the SMC literature to denote the marginal data likelihood byZ = m(y1:R) and to express the target distribution by pi(x1:R) = \u03b3(x1:R)\/Z.The inferential goal usually involves evaluating the expectation of a testfunction \u03c6 with respect to pi, which is denoted as follows:pi\u03c6 = Epi[\u03c6(x1:R)] =\u222b\u03c6(x1:R)pi(x1:R)dx1:R. (2.3)In most cases, this integral is difficult to compute analytically. One approachis to approximate Equation 2.3 by sampling from pi. However, samplingdirectly from pi is difficult in practice; for one thing, the normalizationm(y1:R) is unknown and itself difficult to evaluate as it requires evaluating ahigh-dimensional integral over x1:R.2.2 Importance samplingThe importance sampling procedure is typically used when sampling fromthe target distribution exactly is difficult. To use importance samplingonly requires the user to choose a suitable proposal density \u03bd : RR \u2192 R,from which sampling is easy. The condition on \u03bd is relatively mild [14]:pi(x) > 0\u21d2 \u03bd(x) > 0 (and of course that \u03bd is integrable, which is assumed8to be satisfied if \u03bd is a probability density function). To approximate pi\u03c6for some test function \u03c6, draw samples xk1:R \u223c \u03bd(\u00b7) for k = 1, ...,K, andcompensate for the discrepancy between the target distribution pi and theproposal distribution \u03bd via the importance weight: wk = \u03b3(xk1:R)\/\u03bd(xk1:R).The importance sampling estimator for pi\u03c6 is given by,pi\u03c6 \u2248K\u2211k=1w\u00afk\u03c6(xk1:R), (2.4)where w\u00afk = wk\/\u2211Kk=1wk.Remark 1. Importance sampling produces an estimate of the normalizationconstant:Z\u02c6 =1KK\u2211k=1wk.This estimate is unbiased for Z =\u222b\u03b3(x1:R)dx1:R:E\u03bdZ\u02c6 = E\u03bdw(xk1:R) = E\u03bd[\u03b3(xk1:R)\u03bd(xk1:R)]=\u222b\u03b3(x1:R)dx1:R = Z.Furthermore, it is consistent by the law of large numbers (LLN) as K \u2192\u221e:1KK\u2211k=1wkp\u2212\u2192\u222b\u03b3(x1:R)dx1:R = Z.Remark 2. The estimator in Equation 2.4 is a consistent estimator for pi\u03c6.To see this, re-express the RHS of Equation 2.4 as,K\u2211k=1w\u00afk\u03c6(xk1:R) =K\u2211k=1wk\u03c6(xk1:R)K\u2211k=1wk=1KK\u2211k=1wk\u03c6(xk1:R)1KK\u2211k=1wk.The denominator of the rightmost expression converges in probability to Z asdiscussed in Remark 1. The numerator is consistent for \u03b3\u03c6 by the LLN:91KK\u2211k=1wk\u03c6(xk1:R) =1KK\u2211k=1\u03b3(xk1:R)\u03bd(xk1:R)\u03c6(xk1:R)p\u2212\u2192 E\u03bd[\u03b3(x1:R)\u03bd(x1:R)\u03c6(x1:R)]= \u03b3\u03c6.The consistency is given by the continuous mapping theorem [15, Thm 3.2.4],which is cited here for completeness: let h be a measurable function andDh = {x : h is discontinuous at x}. If Xk p\u2212\u2192 X and P(X \u2208 Dh) = 0, thenh(Xk)p\u2212\u2192 h(X).Let Wk = (Xk, Yk), where Xk = 1KK\u2211k=1wk\u03c6(xk1:R) and Yk =(1KK\u2211k=1wk).Let h(Wk) = Xk\/Yk to apply the continuous mapping theorem to get theconsistency.2.3 Sequential importance samplingOne potential difficulty with importance sampling is that it may be difficultto find a suitable multivariate proposal distribution even for a moderate valueof R. One remedy for this problem is to use sequential importance sampling(SIS). The idea is to construct x1:R by sampling one variable at a time from aunivariate proposal distribution. The SIS procedure is given in Algorithm 1.A good application for SIS is in fact, HMM, where the temporal structureof HMM leads to a natural order in which to sample the variables. That is,sample x1 first, then x2 given x1, and continuing in this fashion to sampleXR|XR\u22121. To use SIS, the user\u2019s task is to specify the density \u03bd(x1:r) in theform of \u03bd(x1:R) = \u03bd1(x1)\u220fRr=2 \u03bdr(xr|x1:r\u22121), where \u03bd1 is the initial proposaldensity and {\u03bdr}, the proposal densities for r = 2, ..., R. For r = 1, theweight can be computed as w(x1) = \u03b3(x1)\/\u03bd1(x1). For r \u2265 2, the weightcomputation can now be expressed recursively:10w(x1:r) =\u03b3(x1:r)\u03bd(x1:r)=\u03b3(x1:r\u22121)\u03bd(x1:r\u22121)\u00d7 \u03b3(x1:r)\u03b3(x1:r\u22121)\u03bdr(xr|x1:r\u22121)= w(x1:r\u22121)\u00d7 \u03b1(x1:r), (2.5)The function \u03b1 is often referred to as the incremental weight function. Notethat decomposing the sampling over a sequence of R steps brings forth thenotion of intermediate distribution. For any r \u2265 1, the joint distribution of(x1:r, y1:r) is well defined for HMM:\u03b3(x1:r) = \u00b5(x1)r\u220fj=1g(yr|xr)r\u220fj=2f(xr|xr\u22121).In general, the intermediate distributions are denoted with a subscript, forexample \u03b3r, whenever necessary to avoid any potential confusion.Algorithm 1 : SIS(y1:R, \u03b3r, \u03bdr,K)1: xk1 \u223c \u03bd1(\u00b7)2: wk1 \u2190 \u03b31(xk1)\/\u03bd1(xk1)3: for i = 1, . . . , R do4: xkr \u223c \u03bdr(\u00b7|xk1:r\u22121)5: xk1:r \u2190 (xk1:r\u22121, xkr )6: wkr \u2190 wkr\u22121 \u00d7 \u03b1(xk1:r)7: end for8: return (xk1:R, wkR)Kk=12.4 Sequential Monte Carlo2.4.1 BackgroundThe sequential importance sampling reduces the problem of finding an Rdimensional proposal distribution to that of finding an univariate proposaldistribution; however, sequential importance sampling is often found to suffer11from the problem known as particle degeneracy, where the population ofthe particles collapse to a single particle [16, 17]. In particular, this is aconsequence of the variance increasing in dimension R. To demonstrate thisproblem, consider the variance of Z\u02c6 produced by SIS:VarZ\u02c6 =1KVar(w(x1:R))=1K(E[w(x1:R)2]\u2212 E[w(x1:R)]2)=1K(\u222b\u03b32(x1:R)\u03bd2(x1:R)\u03bd(x1:R)dx1:R \u2212(\u222b\u03b3(x1:R)\u03bd(x1:R)\u03bd(x1:R)dx1:R)2)=1K(\u222b\u03b32(x1:R)\u03bd(x1:R)dx1:R \u2212 Z2),and the relative variance is given by,VarZ\u02c6Z2=1K(\u222bpi2(x1:R)\u03bd(x1:R)dx1:R \u2212 1).It can be seen from the relative variance that the importance distributionshould be chosen so that the weights are bounded, |w(x1:R)| = |pi(x1:R)\/\u03bd(x1:R)| 1 isdifficult in the context of sequential importance sampling, where the onlychoice the user has is over the local proposal density. Hence, the user mayonly be able to ensure for example, pir(xr|x1:r\u22121)\/\u03bdr(xr|x1:r\u22121) < Cr for someconstant Cr for r = 1, ..., R (note: pir(xr|x1:r\u22121) = pi(x1:r)\/pi(x1:r\u22121)). In12this case, the best bound for the relative variance of the estimator of thenormalization constant is given by by K\u22121(CR \u2212 1), where C = maxr Cr.Therefore, the relative variance can grow quite fast in certain settings (i.e.,exponential growth in R). An example of the problem that this can cause forSIS is demonstrated in [17, pp. 12] and also in Section 2.4.5 of this thesis.2.4.2 Algorithm descriptionOne remedy to guard against potentially exponential growth in the variance ofthe estimators is to interleave the resampling step between every subsequentproposal step. A commonly used resampling scheme is the multinomial resam-pling scheme parameterized by the normalized weights w\u00afr = (w\u00af1r , ..., w\u00afKr ) todraw the indices of the parent particles akr\u22121 \u2208 {1, ...,K} for k = 1, ...,K andr = 2, ..., R. Note that w\u00afkr = wkr\/\u2211Kj=1wkr , with wkr = w(xk1:r) as given inEquation 2.5. Sequential importance sampling with multinomial resamplingis commonly known as bootstrap particle filter (BPF) and it is attributed to[18]. The algorithm proceeds as follows:Iteration r = 1:(a) Proposal: xk1 \u223c \u03bd1(\u00b7).(b) Weight computation: wk1 = \u03b3(xk1)\/\u03bd1(xk1).(c) Weight normalization: w\u00afk1 = wk1\/s, where s =\u2211Kk=1wk1 .Iteration r = 2, ..., R:(a) Resampling: aKr\u22121 \u223c Multinomial(w\u00af1r\u22121, ..., w\u00afKr\u22121).(b) Proposal: xkr \u223c \u03bdr(\u00b7|xakr\u221211:r\u22121). Extend the particle path by xk1:r =(xakr\u221211:r\u22121, xkr ).(c) Weight computation: wkr = \u03b1(xakr\u221211:r\u22121, xkr )(d) Weight normalization: w\u00afkr = wkr\/s, where s =\u2211Kk=1wkr .The pseudocode for BPF is provided in Algorithm 2. At each iteration r,the particles along with the weights yield an empirical measure:p\u02c6ir =K\u2211k=1w\u00afkr \u03b4xk1:r, (2.6)13where \u03b4xk1:r is a Dirac delta measure at xk1:r. For a given test function \u03c6, theexpectation of \u03c6 with respect to pi is estimated by,pir\u03c6 \u2248 p\u02c6ir\u03c6 =K\u2211k=1w\u00afkr\u03c6(xk1:r).Consistency results for SMC compared to IS and SIS are harder to obtaincompared to SIS; extensive discussion about the convergence of SMC methodscan be found in [19].Algorithm 2 : BPF(y1:R, \u03b3r, \u03bdr,K)1: xk1 \u223c \u03bd1(\u00b7)2: wk1 \u2190 \u03b3(xk1)\/\u03bd1(xk1)3: w\u00afk1 \u2190 wk1\/\u2211k wk14: for i = 1, . . . , R do5: akr \u223c Multinomial(w\u00af1r\u22121, ..., w\u00afKr\u22121)6: xkr \u223c \u03bdr(\u00b7|xakr1:r\u22121)7: xk1:r \u2190 (xakr\u221211:r\u22121, xkr )8: wkr \u2190 \u03b1(xk1:r)9: w\u00afkr \u2190 wkr\/\u2211k wkr10: end for11: return (xk1:R, w\u00afkR)Kk=12.4.3 Role of resamplingIt is not immediately clear how SMC alleviates the particle degeneracyproblem. In fact, resampling appears to induce additional variance associatedwith the sampling of the parent particles. For example, the multinomialresampling scheme produces an approximation of Equation 2.6,p\u00afir(dx1:r) =1KK\u2211k=1Ok\u03b4xk1:r(dx1:r), (2.7)where Ok denotes the number of times the particle xk1:r is resampled. Ifanything, the variance of the estimator using p\u00afi should be higher than the14one using p\u02c6i given in Equation 2.6. The variance of the estimator of thenormalization constant when resampling is carried out at every iteration isgiven by [17, 19],VarZ\u02c6Z2=1K(\u222bpi21(x1)\u03bd1(x1)dx1 \u2212 1 +R\u2211r=2\u222bpi2r (x1:r)pir\u22121(x1:r\u22121)\u03bdr(xr|x1:r\u22121)dxr\u22121:r \u2212 1).(2.8)That is, if the incremental weights are bounded,|\u03b1(x1:r)| =\u2223\u2223\u2223\u2223 pir(x1:r)pir\u22121(x1:r\u22121)\u03bdr(xr|x1:r\u22121)\u2223\u2223\u2223\u2223 < C,then Equation 2.8 shows that the variance is increasing in the dimension, R,in an additive fashion. As explained in [17], the resampling step serves toreset the particle system.2.4.4 Effective sample sizeOne measure that is commonly used to quantify the goodness of the particlepopulation, {xk1:r, w\u00afkr}, at any given iteration is given by the effective samplesize [20, 17]:ESSr =(K\u2211k=1(w\u00afkr )2)\u22121. (2.9)Note that when w\u00afkr = 1\/K for all k = 1, ...,K, this is equivalent to drawingsamples exactly from the target distribution (i.e., the standard Monte Carloapproximation). In this case, the ESS is equal to K. At the other extreme,if a poor choice of proposal is made, it may lead to a collapse of particlepopulation to a single particle, where a single particle has most of the weight.In this case, the ESS would be equal to 1. As per [17], the effective sample sizecan be taken as a measure of the number of samples in the weighted particlepopulation corresponding to the ideal case if exact sampling were possible. Itserves as an important measure in assessing the quality of estimation usingthe samples generated by an SMC algorithm.150501001502000 10 20 30 40 50TimeObservation051015200 10 20 30 40 50TimeLatent Variable(a) (b)Figure 2.1: Simulated Ricker data: (a) the observations and (b) the latentvariables.2.4.5 Example: Ricker modelIn this section, a concrete example is considered for the purposes of comparingSMC to SIS. The model to be considered is the Ricker model, which is adynamic model of population growth (see for example, [21, Sec 4.5]). Thetransition kernel is as follows,Xr = \u03b7Xr\u22121 exp(\u2212Xr\u22121 + er), er \u223c Normal(0, \u03c32). (2.10)And the observations are distributed as follows,Yr \u223c Poisson(\u03c6Xr). (2.11)Here, the vector of model parameters is (\u03b7, \u03c6, \u03c32), which is assumed to begiven. The Ricker model is used as an example in [21, Sec 4.5] where plug-and-play algorithms are discussed. The simulation procedure for xr givenxr\u22121 for the Ricker model is as follows: first simulate \u000fr \u223c N(0, \u03c32) and then,16set xr = \u03b7xr\u22121 exp(\u2212xr\u22121 + \u000fr). The weight update formula is given by,\u03b1(x1:r\u22121, xr) =\u03b3r(x1:r)\u03b3r(x1:r\u22121)\u03bdr(xr|x1:r\u22121)=\u00b5(x1)g(y1|x1)\u220frj=2 f(xj |xj\u22121)g(yj |xj)\u00b5(x1)g(y1|x1)\u220fr\u22121j=2 f(xj |xj\u22121)g(yj |xj)f(xj |xj\u22121)= g(yr|xr),which only requires evaluation of the density of the Poisson distribution. Forevaluation purposes, a sequence of observations is simulated according tothe Ricker model with R = 50 and the parameters \u03b7 = 44.7, \u03c32 = 0.3, \u03c6 =10.0, x0 = 7 (this corresponds to the parameter values considered in [21]).The generated data (xr, yr) are shown in Figure 2.1.Note that importance sampling is not practical as 50-dimensional proposaldistribution is difficult to find. One may consider multivariate Gaussiandistribution; however, this may lead to using a prohibitively large number ofparticles to attain satisfactory estimation, especially since there is no upperbound for the weights for particles proposed from multivariate Gaussiandistribution. Therefore, only SIS and the bootstrap particle filter (BPF) arecompared, both, using K = 1, 000 particles.The evaluation is carried out to compare the quality of the approximationof the expectation where the test function is chosen as \u03c6r(x1:r) = xr:\u03c6rpir = Epir [Xr] \u2248K\u2211k=1w\u00afkrxkr . (2.12)A plot of the absolute error at each iteration between this estimator to thetrue, simulated values of xr, is given in Figure 2.2 (a) for SIS and BPF. Ascan be seen from this figure, the estimate for SIS is quite poor whereas theerror is stable and near 0 for BPF. The reason for this is given in Figure 2.2(b) where the effective sample size for both SIS and BPF are plotted. It canbe seen that the ESS becomes 1 early on in the simulation, which indicatescollapse of the particle population to a single particle for SIS. The ESS ofthe bootstrap particle filter does not suffer from the same problem.1702040600 10 20 30 40 50TimeAbsolute ErrorSISBPF(a)025050075010000 10 20 30 40 50TimeEffective Sample SizeSISBPF(b)Figure 2.2: Inference of the latent variables using sequential importancesampling and bootstrap particle filter: (a) absolute error between the truthand the weighted mean prediction and (b) the effective sample size at eachiteration. The absolute error for SIS can be seen to be poor compare to BPF.The ESS for SIS reaches 1, which is indicative of collapse of population to asingle particle.2.5 SMC samplerThe discussion thus far has focussed on describing the main ideas behind astandard sequential Monte Carlo algorithm. One important aspect of theSMC algorithm that has not been explicitly stated is that it operates on aspace of increasing dimension. To be specific, the state space for the SMCfor a hidden Markov model is increasing in dimension: S1 = R, S2 = R2, andSr = Rr. Until a key development was proposed in [22], SMC methods weremore or less limited to this type of set up. In this section, the focus is placed18on describing a key idea of [22].2.5.1 Problem set-up and notationLet s \u2208 S denote the object of interest, e.g., a phylogenetics tree, andS, the space of interest (or the support). The goal is to sample from thetarget distribution pi(s) = \u03b3(s)\/Z. In a Bayesian set up, pi is the posteriordistribution, specifically,pi(s) = pi(s|y) = p(y|s)p(s)\u222bp(s, y)ds=\u03b3(s)Z,where y denotes the observations. As before, the notation for y is suppressed.For a simple set up, where s can be sampled directly, importance samplingis a natural choice to sample from S. That is, sample sk \u223c \u03bd(\u00b7) and computethe particle weights w(sk) = \u03b3(sk)\/\u03bd(sk) to approximate \u03c6pi. However, theassumption is that S takes on a complex structure, in particular, that theobjects in S can only be constructed over a sequence of steps. SequentialMonte Carlo methods provide a principled approach to constructing suchobjects by integrating the evidence (data) sequentially, via proposal, weighting,and resampling steps. The problem is that the target space S is not associatedwith any temporal structure, as was the case for the HMM.For the set up under consideration, the notion of intermediate state spaceand the intermediate target distributions, Sr and pir = \u03b3r\/Zr, play definitiveroles and these need to be explicitly specified. Recall that the goal is toconstruct a full state s upon termination of an SMC algorithm, therefore,the design of the intermediate state spaces should lead to SR = S, the targetspace of interest. Assume that {\u03bdr} is a sequence of proposal distributionssuch that 1) each proposal distribution takes a partial state sr\u22121 \u2208 Sr\u22121and generates sr \u223c \u03bdr(\u00b7|sr\u22121), where sr \u2208 Sr and, 2) R applications of theproposal to the initial state s0 leads to an object s \u2208 S. Additionally, assumethat the sequence of proposal distributions are such that every possible states \u2208 S can be realized starting from s0.192.5.2 Auxiliary variable techniquesThe procedure described in [22] is to define an artificial target distributionby introducing the backward proposals, \u03bd\u2212r . For clarity, the forward proposalsare denoted by \u03bd+r to differentiate them from the backward proposal. First,the desired target distribution is expanded as, p\u02dcir(s1:r) = \u03b3\u02dcr(s1:r)\/Zr where\u03b3\u02dcr(s1:r) = \u03b3r(sr)\u220fr\u22121j=1 \u03bd\u2212j (sj |sj+1). These artificial target distributions takethe problem to the standard SMC set up, with increasing dimensionality:S\u02dcr = S1 \u00d7 ...\u00d7 Sr. The weight update is given by:w(s1:r) =\u03b3\u02dcr(s1:r)\u03bdr(s1:r)= w(s1:r\u22121)\u00d7 \u03b3r(sr)\u03b3r\u22121(sr\u22121)\u03bd\u2212r\u22121(sr\u22121|sr)\u03bd+r (sr|sr\u22121).The artificial target distribution admits the desired target distributionas a marginal distribution since the backward proposals are probabilitydistributions:\u222bp\u02dciR(s1:R)ds1:R\u22121 =\u222b\u03b3\u02dcR(s1:R)ZRds1:R\u22121=\u03b3R(sR)ZR\u222b R\u22121\u220fr=1\u03bd\u2212r (sr|sr+1)ds1:R\u22121= piR(xR).Also note that since the backward proposals are already normalizedprobability densities, therefore, they do not affect the normalization constantZR. Upon termination, the particle population at the final iteration R can beused to approximate the expectation of a given test function \u03c6 with respectto pi as for the standard SMC:pi\u03c6 \u2248K\u2211k=1w\u00afkR\u03c6(skR).202.6 Combinatorial SMCIn this section, a description is given for formulating an SMC algorithm whenS takes on a combinatorial space [23].2.6.1 Partially ordered setA key concept in developing an SMC sampler on a combinatorial state spaceis the notion of a partially ordered set (poset), denoted (S,\u227a). The partialorder, \u227a, is a binary relation on the elements of S that is 1) reflexive, 2) anti-symmetric, and 3) transitive. Note that not all elements of S are comparable.A partially ordered set can be represented using a Hasse diagram, which isan undirected graph G = (S, E) where the nodes are the elements of the setS and an edge exists between the nodes s, s\u2032 \u2208 S if s\u2032 covers s, which is tosay that s \u227a s\u2032 and there does not exist s\u2032\u2032 \u2208 S such that s \u227a s\u2032\u2032 \u227a s\u2032. Thepartial order endows the general state space with a sequential structure, asneeded by the SMC algorithm. Note that in the context of SMC, the partialorder is endowed by the sequence of proposal distributions, {\u03bd+r }.2.6.2 Example: Binary tree (an illustration of an incorrectway to design an SMC sampler)Suppose that the goal is to sample a binary tree on L leaves. For this example,the interest lies only on the topology, so assume that every edge has length 1.To simplify matters, let the target distribution be the uniform distributionover all possible topologies over L leaves. To formalize things, let T (l) denotethe space of the binary trees on l leaves. The target distribution is definedon t \u2208 T (L):pi(t) = 1\/Z \u221d 1 = \u03b3(t),where Z = |T (L)|, denotes the total number of tree topologies with L leaves.In fact, this uniform distribution can be defined on T (l) for any l:pil(t) = 1\/Z l \u221d 1 = \u03b3l(t),21where the superscript l distinguishes the different spaces (note that piL = piby this notation).An intuitive approach to constructing a binary tree on L leaves is tobuild it in a bottom up fashion using the proposal that samples a pair ofsubtrees to merge at random. Let the initial state s0 be a collection of Ltrivial trees (i.e., s0 = {t1, ..., tL}, where tl for l = 1, ..., L are singleton trees).The bottom-up construction proceeds as follows, at iteration r = 1, twosingleton trees are chosen at random. Without loss of generality let the twotrees chosen be denoted t1, t2. A new tree is formed, t\u2032, with t1 and t2 as theleft and right subtrees. In iteration r = 2, two trees are chosen at randomamong {t\u2032, t3, ..., tL}. And so on. This proposal distribution can be writtenas,\u03bd+r (sr|sr\u22121) =1(|sr\u22121|2) ,where |sr| denotes the number of trees in the forest sr. Note that randomlymerging a pair of trees among the trees available decreases the number oftrees in the state; hence, after R = L\u2212 1 iterations, a full binary tree can beconstructed as desired.The poset induced by this proposal corresponds to the forest of trees,where each state s \u2208 S is a collection of binary trees. An intermediatestate space Sr denotes the space of forests on L \u2212 r binary trees and theintermediate target distributions are given by:pir(sr) =\u220ft\u2208srpi|t|(t) \u221d 1,where |t| denotes the number of leaves of t. Recall that pi|t| = 1\/Z |t| asdescribed above. For sr\u22121 \u2208 Sr\u22121, if sr \u2208 {sr | \u03bd+r (sr|sr\u22121) > 0}, then theincremental weight function is as follows,\u03b1(sr\u22121, sr) =\u220ft\u2208sr \u03b3|t|(t)\u220ft\u2208sr\u22121 \u03b3|t|(t)\u00d7 \u03bd+(sr|sr\u22121)\u221d(|sr\u22121|2)\u221d 1,22and 0 if sr is not in the support set of \u03bd+r (\u00b7|sr\u22121).An experiment was carried out using an SMC algorithm designed withthe above specifications, on L = 4 leaves. It is easy to check that thereare 15 possible tree topologies. Let t\u2032 \u2208 T (4) be the balanced tree given inFigure 2.3 (a) and let t\u2032\u2032 \u2208 T (4) be the unbalanced tree given in Figure 2.3(b). The expectations, E1[T = t\u2032] and E1[T = t\u2032\u2032] should converge to 1\/15as the number of particles is increased since the goal is to sample from theuniform distribution over all possible topologies on L = 4 leaves. The resultsare shown in Figure 2.3 (c). As can be seen from this figure, the estimatesare not tending towards 1\/15 \u2248 0.067, regardless of the number of particlesused. The reason behind the incorrectness is discussed in the next section.2.6.3 Overcounting problemThe bias in the estimation is due to the problem known as the overcountingproblem [23]. This problem is attributed to the fact that the states in theposet may be reached by a differing number of paths. For example, thebalanced tree shown in Figure 2.3 (a) can be reached in two ways: 1) thenodes A and B are merged first, then the nodes C and D or, 2) C, D aremerged first, then A, B. On the other hand, the unbalanced tree shown inFigure 2.3 (b) can only be reached in one way: merge A and B, then E with C,then F with D (see Figure 2.4). This is the explanation behind the estimatefor the balanced binary tree tending towards 2\/18 \u2248 0.111 in Figure 2.3 (c)and 1\/18 \u2248 0.056 for the unbalanced trees. The observation to be made here,is that the backward proposal, which is an integral part of SMC samplersmethodology is not specified in the above example.2.6.4 Combinatorial SMCThe general solution to the overcounting problem is detailed in [23]. Thegeneral solution is to correct for this by constructing the poset to be freeof the multiple paths problem: from s0 to any state s \u2208 S, there should beexactly one path. The essential idea is to formulate a larger state space:S\u02dcr =\u22c3ri=1 Sr with SMC operating on S\u02dc =\u22c3r S\u02dcr. It is easy to verify that23A B C D A B C DEFG(a) (b)0.040.060.080.100.120 2000 4000 6000Num. ParticlesEstimateA balanced TreeAn unbalanced Tree(c)Figure 2.3: (a) A balanced binary tree on 4 leaves. (b) An unbalancedbinary tree on 4 leaves. (c) The results of incorrect analysis carried out usingSMC. The estimates are biased as they deviate from the expected value of1\/15 \u2248 0.067.24this expanded state space eliminates the multiple paths problem: for anystate s\u02dcr = (s1, ..., sr) \u2208 S\u02dc, there is exactly one parent state, which is obtainedby removing sr.The intermediate distribution is given by p\u02dcir(s\u02dcr) = pir(sr)\u220fr\u22121j=1 \u03bd\u2212j (sj |sj+1)and the incremental weight update by,\u03b1(s\u02dcr\u22121, s\u02dcr) =p\u02dcir(s\u02dcr)p\u02dcir(s\u02dcr\u22121)\u03bd+r (s\u02dcr|s\u02dcr\u22121)=pir(sr)\u03bd\u2212r\u22121(sr\u22121|sr)pir\u22121(sr\u22121)\u03bd+r (sr|sr\u22121). (2.13)The assumption on the backward proposal is as follows (see [23, Assumption1]):Assumption 1. For all s, s\u2032 \u2208 S, \u03bd+(s\u2032|s) > 0 implies \u03bd\u2212(s|s\u2032) > 0.A specific choice of the backward kernel that satisfies Assumption 1 isgiven by,\u03bd\u2212(s|s\u2032) = |Q(s\u2032)|\u22121 \u00d7 1[\u03bd+(s\u2032|s) > 0], (2.14)where Q(s\u2032) denotes the set of parent states of s\u2032. Note that this choice isnot unique and there is an optimality criterion for the backward kernels, butthese typically require computation of intractable integrals [22].The underlying workings of combinatorial SMC may appear complex.However, the user only needs to specify the sequence of forward proposalsfor constructing the combinatorial object, such that starting from a suitablychosen initial state s0, the object of interest can be constructed in R steps,as well as the backward proposal to use combinatorial SMC.2.6.5 Example: Binary tree (continued)The same analysis was carried out on the binary tree, but with the overcount-ing correction. Note that the backward kernel is easy to calculate since thenumber of parent states of any s \u2208 S is equal to the number of non-trivialtrees in s (i.e., the number of trees where the root node has non-empty chil-dren). The results are shown in Figure 2.5. The estimate of the proportionof the balanced and unbalanced trees given in Figure 2.3 (a) and (b) bothconverge towards 1\/15 \u2248 0.067 as desired.25A B C DA B C DA B C D C DA BA B C DA B C DFigure 2.4: An example of a cycle in the Hasse diagram induced by the choiceof proposal distribution given in Section 2.6.2. The states connected by rededges form a cycle in the Hasse diagram, leading to the overcounting problem.2.7 Particle Markov chain Monte CarloThe review of SMC methods concludes with the discussion of parameterestimation. Although there are many approaches to tackle this problem, thefocus here is on the particle Markov chain Monte Carlo (PMCMC) framework,in particular on particle marginal Metropolis-Hastings (PMMH) methodology,which is a specific type of PMCMC algorithm. The PMCMC admits fullBayesian inference of the parameters and the latent variables by using SMCon the space of the latent variables and familiar MCMC algorithms such asMetropolis-Hastings for the parameter space.2.7.1 NotationAn important idea behind particle MCMC is to make explicit, all of therandom variables involved in an SMC algorithm. Let x1, ...,xR denote thelatent variables generated at each iteration, where xr = (x1r , ..., xKr ). Inaddition, SMC generates via resampling, the ancestor indices, a1, ...aR\u22121,where akr\u22121 \u223c s(\u00b7|w\u00afr\u22121) for r = 2, ..., R. Note that s denotes the distributionfor the resampling step. For example, s is the multinomial distribution260.050.100.150 2000 4000 6000Num. ParticlesEstimateA balanced TreeAn unbalanced TreeFigure 2.5: An example of correct analysis carried out using SMC.parameterized by the normalized weight wr\u22121 for the bootstrap particlefilter. To facilitate the description of PMCMC, the authors of [24] introducea notation for the ancestral lineage of the k-th particle: bk1:R = (bk1, ..., bkR).That is, bkR = k and bkr = abkr+1r for r = 1, ..., R \u2212 1 (see [24, Figure 1] forexample). The distribution over the random variables generated by PMCMCcan be expressed as follows,\u03c8(x1, ...,xR,a1, ...,aR\u22121) =K\u220fk=1\u03bd1(xk1)R\u220fr=2{s(ar\u22121|wr\u22121)K\u220fk=1\u03bdr(xkr |xakr\u221211:r\u22121)}.272.7.2 Particle independent Metropolis-Hastings samplerThe particle independent Metropolis-Hastings (PIMH) algorithm is a Markovchain Monte Carlo algorithm to sample the latent variables x1:R from thetarget distribution pi(x1:R). Typically the target distribution is the posteriordistribution pi(x1:R) = p(x1:R|y1:R). Note that PIMH does not sample themodel parameters; however, it serves as a great starting point towardsdescribing particle marginal Metropolis-Hastings (PMMH), which does samplethe model parameters.Algorithm descriptionThe PIMH algorithm is initialized as follows. First, execute an SMC algorithmthat produces an approximation p\u02c6i(x1:R) of the target distribution and anapproximation of the marginal likelihood, p\u02c6(y1:R). Set X1:R(0) \u223c p\u02c6i (forexample, by drawing an index k \u223c s(\u00b7|w\u00afR)) and Z\u02c6(0) = p\u02c6(y1:R) as aninitialization of the Markov chain. For iteration i \u2265 1, execute an SMCalgorithm to produce an approximations p\u02c6i(x1:R) and Z\u02c6\u2217 = p\u02c6(y1:R). Thendraw X\u22171:R \u223c p\u02c6i. With probability,max{1,Z\u02c6\u2217Z\u02c6(i\u2212 1)}, (2.15)set X1:R(i) = X\u22171:R, Z\u02c6(i) = Z\u02c6\u2217. Otherwise, set X1:R(i) = X1:R(i\u2212 1), Z\u02c6(i) =Z\u02c6(i\u2212 1).Why PIMH worksTo see why PIMH works, note that the procedure is a standard independentMetropolis-Hastings (IMH) algorithm on (k,x1, ...,xR,a1, ...,aR\u22121), wherek is the index of the lineage selected from p\u02c6i. The proposal distribution ofthis IMH algorithm is,qK(k,x1, ...,xR,a1, ...,aR\u22121) = w\u00afkR\u03c8(x1, ...,xR,a1, ...,aR\u22121),28where w\u00afkR is the probability of drawing the k-th lineage as the proposal X\u22171:R.It was noted in [24, Section 4.2] that with the following target distribution,p\u02dciK(k,x1, ...,xR,a1, ...,aR\u22121) =pi(xk1:R)KR\u03c8(x1, ...,xR,a1, ...,aR\u22121)\u00b5(xbk11 )R\u220fr=2r(bkr\u22121|wr\u22121)\u03bd(xbkrr |xbkr\u221211:r\u22121),the MH ratio works out as given in Equation 2.15. By construction, theMetropolis-Hastings algorithm satisfies the detailed balance, aperiodicity, andpositive recurrence conditions, which establishes p\u02dci as the unique stationarydistribution as well as the ergodicity of the chain (see for example, [14,Chp 7]). Note that the target distribution, pi(x1:R) appears as the marginaldistribution inside p\u02dci and hence, it follows from the standard Markov chaintheory that the chain X1:R(i) has pi(x1:R) as the stationary distribution.The assumptions that are needed for valid PIMH algorithm are mild, andin fact, very much the same conditions needed for a valid SMC algorithmsare required and that the lineage k be chosen with probability w\u00afkR (see [24,Section 4.1] for details).2.7.3 Particle marginal Metropolis Hastings samplerThe focus now shifts to sampling from the posterior distribution of the un-known parameters \u03b8 and the latent variables x1:R given the observations y1:R.The posterior is denoted by, pi(\u03b8, x1:R) = p(\u03b8, x1:R|y1:R), which decomposesas,p(\u03b8, x1:R|y1:R) = p(\u03b8|y1:R)p(x1:R|y1:R, \u03b8). (2.16)The PMMH algorithm proceeds as follows. First, initialize \u03b8(0) (for example,from the prior distribution p(\u03b8)). Then, run an SMC algorithm to produceapproximations p\u02c6(x1:R|y1:R, \u03b8(0)) and p\u02c6(y1:R|\u03b8(0)). As in the PIMH sampler,draw a lineage X1:R(0) \u223c p\u02c6(x1:R|y1:R, \u03b8(0)) and set Z\u02c6(0) = p\u02c6(y1:R|\u03b8(0)). Foriterations i \u2265 1, draw \u03b8\u2217 \u223c q(\u00b7|\u03b8(i\u2212 1)), where q is the proposal distributionon \u03b8. Then, run SMC algorithm to produce approximations p\u02c6(x1:R|y1:R, \u03b8\u2217)29and Z\u02c6\u2217 = p\u02c6(y1:R|\u03b8\u2217). With probability,max{1,Z\u02c6(\u03b8\u2217)Z\u02c6(i\u2212 1)q(\u03b8(i\u2212 1)|\u03b8\u2217)q(\u03b8\u2217|\u03b8(i\u2212 1))}, (2.17)set Z\u02c6(i) = Z\u02c6(\u03b8\u2217) and X1:R(i) = X\u22171:R. Otherwise, set Z\u02c6(i) = Z\u02c6(i \u2212 1) andX1:R(i) = X1:R(i\u2212 1).Similar to PIMH, PMMH algorithm is a Markov chain on the parameters \u03b8as well as all of the variables generated by SMC: (\u03b8, k,x1, ...,xR,a1, ...,aR\u22121).The proposal distribution is given by,qK(\u03b8\u2217, k,x\u22171, ...,x\u2217R,a\u22171, ...,a\u2217R\u22121|\u03b8) = q(\u03b8\u2217|\u03b8)w\u00afkR\u03c8(x\u22171, ...,x\u2217R,a\u22171, ...,a\u2217R\u22121),which along the following target distribution,p\u02dci(\u03b8, k,x1, ...,xR,a1, ...,aR\u22121) =pi(\u03b8, xk1:R)KR\u03c8(x1, ...,xR,a1, ...,aR\u22121|\u03b8)\u03bd1(xbk11 |\u03b8)R\u220fr=2s(bkr\u22121|wr\u22121)\u03bdr(xbkr\u22121r |xbkr\u221211:r\u22121),yields the MH ratio given in Equation 2.17. The assumptions needed for avalid PMMH algorithm are mild (see [24, Section 4.4]).30Chapter 3Streaming Particle FilterIn this chapter, we describe the streaming particle filter. The method ismotivated by an application in phylogenetics where storage of the particles inthe computer memory becomes too prohibitive to allow for use of sufficientnumber of particles. However, the method developed in this chapter is ageneral SMC method that is applicable to the problems where SMC maybe chosen for inference. We focus on providing the algorithmic descriptionas well as fundamental convergence results in this chapter. The details ofphylogenetics application is given in Chapter 4.3.1 Algorithm descriptionWe begin this chapter by providing an alternative view of an SMC iterationas performing a set of operations on a particle population {xk1:r\u22121, w\u00afkr\u22121},k = 1, ...,K, approximating pi(x1:r\u22121|y1:r\u22121) to generate a new population,one that approximates pi(x1:r|y1:r). The set of operations performed for thestandard SMC are resampling, proposal\/extension, weight computation, andweight normalization:1. resampling: sample an ancestor index akr\u22121 \u223c r(\u00b7|w\u00afr\u22121),2. proposal (extension): xkr \u223c \u03bd(\u00b7|xakr\u22121r\u22121 ), and set xk1:r = (xakr\u22121r\u22121 , xkr ),3. weight computation: wkr = w\u00afakr\u22121r\u22121 \u00d7 \u03b1(xakr\u221211:r\u22121, xkr ), and4. weight normalization: w\u00afkr = wkr\/\u2211Kj=1wjr.This set of operations are performed K times, at which point we obtain anew particle population {xk1:r, w\u00afkr}, k = 1, ...,K, that extends the previous31particle population to approximate pi(x1:r|y1:r). The streaming particle filter(SPF) can be viewed as expanding on these set of operations performedduring one iteration of SMC. The expanded set of operations are composedof expansion, implicit proposal, summation, explicit proposal, division, andcontraction. The expanded set of operations are grouped into two task groups.The set of operations performed by SPF are designed to automaticallydetermine the stopping time, a suitable number of particles as determinedby effective sample size. This is desirable specifically for memory intensiveapplication such as phylogenetics because it results in smaller number ofmemory write operations when compared to standard SMC algorithm underthe same computational budget. Hence, SPF attains an improved accuracyin estimation at the expense of computational time. Before we describe thetask groups in details, we begin by introducing a key idea of stochastic mapsintroduced in the perfect simulation literature [25].3.1.1 Stochastic mapsLet Q : S \u00d7 FS \u2192 [0, 1] denote a Markov transition kernel. The Markovtransition kernels are fundamental objects in inference in sequential MonteCarlo methods as well as Markov chain Monte Carlo algorithms. For instance,a Markov transition kernel for a Metropolis-Hastings algorithm is constructedby first drawing from a proposal distribution given the current state, thenaccept-reject by drawing a random variable distributed according to theuniform distribution on [0, 1]. Note that each application of Markov kernelcan be characterized by a uniform number u \u2208 [0, 1]. In fact, much ofmodern computer simulation relies on generating random numbers startingfrom a uniform random variable U \u223c Uniform[0, 1]. For example, if theproposal step of MH algorithm draws from Gaussian distribution then, wecan draw a sample from the Gaussian distribution via inverse CDF method.Therefore, a stochastic map associated with a kernel Q can be defined asa map, F : [0, 1] \u00d7 S \u2192 S such that Q(s,A) = P(F (u, s) \u2208 A) for s \u2208 S,u \u2208 [0, 1], and A \u2208 FS . Typically, we write F (u, s) = F (s) to suppress thedependence on the uniform random number.32With the notion of stochastic maps, an MCMC algorithm can now bedescribed by application of N i.i.d stochastic maps where the n-th state canbe attained by Fn \u25e6 ... \u25e6 F1(x0) = Fn(Fn\u22121(...(F1(x0))...) for some arbitraryinitial state s0 \u2208 S and n \u2208 {1, ..., N}. The same principle applies whenQ is a Markov kernel used in an SMC algorithm. Each iteration of SMCcan be viewed as drawing N i.i.d. stochastic maps, applied to the particlepopulation from the previous iteration.3.1.2 Task group 1: streaming expansion, implicit proposal,summationExpansionThe expansion operation generalizes the resampling operation but ratherthan resampling K times from the particle population, we resample N \u2265 Ktimes. The expansion operation, much like resampling operation, generatesan ancestor indices anr\u22121, n = 1, ..., N .Implicit proposalThe implicit proposal operation draws a sample from proposal distributionto generate xkr \u223c \u03bdr(\u00b7|xakr\u22121r\u22121 ) and computes the weight, wkr = \u03b1(xakr\u221211:r\u22121, xkr ).Contrary to the regular proposal operation, we discard the generated valuesxkr , and do not update the particle lineage at this stage. This can be of greatadvantage when the size of the particle xk1:r is prohibitive where writing thevalues of xk1:r to the computer memory can be a bottleneck. Such a casearises in phylogenetics (more details are provided in Chapter 4).Summation subtaskThe summation operation is a subtask of weight normalization. The weightsgenerated from the implicit proposal operation update the sum of the weights:s\u2190 s+ wkr .Upon termination of the task group 1, we have the sum of the weights, s,accumulated from expanding the previous particle population and proposing33N particles. However, the concrete values of the population are not keptin computer memory. Therefore, the storage complexity is in the orderO(1). Note that a stochastic map represents the operations performed above,initialized by a uniform random number.3.1.3 Task group 2: explicit proposal, division, contractionThe operations in this group begin by re-initializing the stochastic mapsused in Task group 1 by re-using the uniform random number generator.This allows us to replay the random computations performed in task group1. Additionally, in task group 2, we generate K sorted standard uniformnumbers, V(1), ..., V(K), where K denotes the number of particles that can beaccommodated in memory.Explicit proposalWe replay the expansion and proposal operations performed in Task group 1.This yields anr\u22121, xnr and wnr .Division subtaskThe division subtask of the normalization operation is performed using thesum of total weights s: w\u00afnr = wnr \/s. Additionally, we keep the cumulativesum of the normalized weights, denoted s\u00af\u2190 s\u00af+ w\u00afnr .ContractionThe contraction operation determines if the particle xnr is to be selected inthe population of particles to be passed to the next iteration of SPF. Thiscan be performed by checking if one or more of the K sorted numbers isin the interval [s\u00af, s\u00af + w\u00afnr ). For all k such that V(k) \u2208 [s\u00af, s\u00af + w\u00afnr ), then theparticle is selected, and it is stored in the memory by extending the particlelineage as, xk1:r = (xanr\u22121r\u22121 , xnr ). Otherwise, the particle is discarded. Note thatwe can adopt lazy computation by storing the particle weights, but not theparticles in the implicit proposal step in Task group 1. In the lazy version,34K......Generation rGeneration r-1 Generation r+1...... ...NExpansion Proposal Norma-lizationContraction\u2211 \u00f7Task Group 1Task Group 2Figure 3.1: Overview of the streaming particle algorithm. The circles areparticles. The black particles show the subset of the implicit particles thatare selected in the contraction step. We also show the grouping of the tasksto Task group 1 and Task group 2.we perform explicit proposal only if xnr is selected in the contraction step.This can help to reduce the running time to O(K) from O(N) (this point isillustrated in Section 4.4).Upon termination of Task group 2, we obtain a particle population of size Kapproximating the target distribution,p\u02c6i(x1:r) =1KK\u2211k=1\u03b4xk1:r(x1:r),We have provided a schematic diagram of the streaming particle filter inFigure 3.1.3.2 Stopping timeThe vanilla version of SPF generates a fixed number of implicit particlesacross the iterations of SMC. Although this version is useful for memory35intensive settings, we may be able to benefit from adaptively selecting thenumber of implicit particles depending on the problem at hand. To this end,we introduce a stopping time Nr defined in terms of the effective sample size,Nr = min{K \u2264 n \u2264 N\u2217 : ESS(w\u00af1, ..., w\u00afn)K\u2265 \u03ba}, (3.1)where \u03ba > 0 is the desired level of effective sample size and K,N\u2217 denotethe minimum and maximum number of particles to be used in the algorithm.The minimum, K, is needed to ensure that we use sufficient number ofparticles in the algorithm. The upper bound, N\u2217, is needed to bound themaximum computation time at each iteration. Now, recall that the effectivesample size for population of particles of size N is given byESS(w\u00af1, ..., w\u00afN ) =(N\u2211n=1w\u00af2n)\u22121=(N\u2211n=1wn)2N\u2211n=1w2n. (3.2)That is, to determine the stopping time in real time, we only need tokeep track of the sum of the unnormalized weights and the sum of the squareof the unnormalized weights. Therefore, incorporating this stopping timecriterion can be accomplished in the Task group 1 by keeping track of thesum of squared weights in addition to the sum of the weights. Note thatESS\/K can be viewed as relative ESS, with \u03ba = 1 we can view the stoppingtime as the desired number of effective sample size that we wish to use in thealgorithm.3.3 Convergence resultsIn this section, we provide some convergence results of the estimators derivedfrom the streaming particle filter. The proof techniques used are standard(see [19]); however, due to the two resampling rounds that appear in eachiteration of SPF, the convergence results need to be established to ensurecorrectness.363.3.1 Measure theoretic formulation of SPFWe begin with measure theoretic formulation of SPF and provide convergenceresults. We denote the target space at each generation by Xr = E1 \u00d7 ...\u00d7Erand corresponding product \u03c3-algebra Fr. The target measure for eachgeneration is denoted by pir : Fr \u2192 R+. For a test function \u03c6, we letpir\u03c6 =\u222b\u03c6dpir. For any positive measure \u03bb on X , we write \u2016\u03bb\u2016 = \u03bb(X ) and\u03bb\u00af(A) = \u03bb(A)\/\u2016\u03bb\u2016. Note that with our notation, pir = \u03b3r\/\u03b3r(X ).Definition 2. Given an arbitrary measure \u03bb, the output of the randomoperator resK\u03bb and propK\u03bb is a new measure defined as follows: for a giventest function \u03c6,(resK\u03bb)\u03c6 = \u2016\u03bb\u2016 1KK\u2211k=1\u03c6(Sk),(propK\u03bb)\u03c6 = \u2016\u03bb\u20161KK\u2211k=1\u03b1r(Sk, S\u2032k)\u03c6(y),where Sk \u223c \u03bb\u00af and S\u2032k|Sk \u223c \u03bd(\u00b7|Sk) independently and \u03b1r is the incrementalweight function:\u03b1r(x, y) =pir(y)pir\u22121(x)1\u03bd(y|x) .With this notation, the operations performed by one iteration of thestandard SMC can be described succinctly:piSMCr+1,K = propKpiSMCr,K ,where piSMCr,K =\u2211Kk=1 w\u00afkr \u03b4skr . Note that the prop operator performs boththe resampling and proposal operations since it first samples Sk from piSMCr,Kand then proposes S\u2032 \u223c \u03bd(\u00b7|S). Similarly, we can express the operationsperformed by one iteration of SPF succinctly:piSPFr+1,K = resK(propNr(K)piSPFr,K).37For SPF, there are two resampling stages, one in the form of the outerres operator, the other one included inside the prop operator. Note thatthese two resampling stages are needed: at iteration r we do not know inadvance how many times the output discrete measure will be sampled fromat iteration r+ 1, we use the resK operator to ensure that the memory boundK will be respected no matter what is the behavior at iteration r + 1. Whilethis would be detrimental in a standard particle filter, this is not a problemin our scheme since the number of proposal calls at each iteration is suitablyexpanded via the stopping time Nr. This comes at a cost of a theoreticallyhigher running time, but as demonstrated in the experiments in Chapter 4, inpractice the running time of our method is competitive with standard SMCmethods due to the lower frequency of memory writes.3.3.2 Convergence resultWe state the main result:Theorem 3. Let \u03c6 be a bounded function. Assume that the weights arebounded and pir \u001c \u03bdr\u22121, then\u222b\u03c6dpiSPFr,KL2\u2212\u2192\u222b\u03c6pir, as K \u2192\u221e.Before we can prove the main result, we first introduce results that areinstrumental in completing the proof.Lemma 4. For any positive measure \u03bb with \u2016\u03bb\u2016 <\u221e, we haveE[(propK\u03bb)\u03c6] = (prop \u03bb)\u03c6, (3.3)E[(resK\u03bb)\u03c6] = \u03bb\u03c6, (3.4)where,(prop \u03bb)\u03c6 =\u222b\u03bb(dx)\u222b\u03bdx(dy)w(x, y)\u03c6(y). (3.5)38Proof. The proof is by the linearity of expectation.E[(propK\u03bb)\u03c6] = E[\u2016\u03bb\u2016 1KK\u2211k=1\u03b1r(Sk, S\u2032k)\u03c6(S\u2032k)]= \u2016\u03bb\u2016E[\u03b1r(Sk, S\u2032k)\u03c6(S\u2032k)]= \u2016\u03bb\u2016\u222b \u222b\u03bb\u00af(dx)\u03bdx(dy)w(x, y)\u03c6(y)= (prop \u03bb)\u03c6.The third line is due to the definition of the prop operator: we drawSk \u223c \u03bb\u00af and propose S\u2032k|Sk \u223c \u03bd(\u00b7|Sk). Similarly,E[(resK\u03bb)\u03c6] = \u2016\u03bb\u2016E[\u03c6(Sk)]= \u2016\u03bb\u2016\u222b\u03bb\u00af(dx)\u03c6(x)= \u03bb\u03c6.Lemma 5. For any positive measure \u03bb with \u2016\u03bb\u2016 <\u221e and a bounded weightfunction |\u03b1r| \u2264 C1 <\u221e, we have for a test function \u03c6 such that |\u03c6| \u2264 C2 <\u221e,E[(propK\u03bb)\u03c6\u2212 (prop \u03bb)\u03c6]2 \u2264(C1C2)2\u2016\u03bb\u20162K, (3.6)E[(resK\u03bb)\u03c6\u2212 \u03bb\u03c6]2 \u2264 C22\u2016\u03bb\u20162K. (3.7)Proof. By Equation 3.3, we can take Equation 3.6 as the variance of (propK\u03bb)\u03c6:39E[(propK\u03bb)\u03c6\u2212 (prop \u03bb)\u03c6]2 = Var[(propK\u03bb)\u03c6]= Var[\u2016\u03bb\u2016 1KK\u2211k=1\u03b1r(Sk, S\u2032k)\u03c6(S\u2032k)]=\u2016\u03bb\u20162KVar[\u03b1r(Sk, S\u2032k)\u03c6(S\u2032k)]\u2264 \u2016\u03bb\u20162KVar[C1C2]=(C1C2)2\u2016\u03bb\u20162K,where the second to the third line is due to the independence of (Sk, S\u2032k)and (Sj , S\u2032j) for j 6= k. Similarly,E[(resK\u03bb)\u03c6\u2212 \u03bb\u03c6]2 = Var[(resK\u03bb)\u03c6]= Var[\u2016\u03bb\u2016 1KK\u2211k=1\u03c6(Sk)]=\u2016\u03bb\u20162KVar[\u03c6(Sk)]\u2264 C22\u2016\u03bb\u20162K.Corollary 6. For any stopping time Nr(K) with Nr(K) \u2265 K, we have:E[(propNr(K)\u03bb)\u03c6\u2212 (prop \u03bb)\u03c6]2 \u2264 (C1C2)2\u2016\u03bb\u20162K.40Proof.E[(propNr(K)\u03bb)\u03c6\u2212 (prop \u03bb)\u03c6]2= E[E[(propNr(K)\u03bb)\u03c6\u2212 (prop \u03bb)\u03c6]2 |Nr(K)]=\u221e\u2211n=KP(Nr(K) = n)E[(propn\u03bb)\u03c6\u2212 (prop \u03bb)\u03c6]2\u2264\u221e\u2211n=KP(Nr(K) = n)(C1C2)2\u2016\u03bb\u20162n\u2264\u221e\u2211n=KP(Nr(K) = n)(C1C2)2\u2016\u03bb\u20162K=(C1C2)2\u2016\u03bb\u20162K.Lemma 7. For all r, prop pir = pir+1.Proof. To prove this lemma, we need to show that for a measurable function\u03c6, we have(prop pir)\u03c6 = pir+1\u03c6.(prop pir)\u03c6 =\u222bpir(dx)\u222b\u03bdx(dy)\u03b1(x, y)\u03c6(y)=\u222bpir(dx)\u222bpir+1(dy)pir(x)\u03c6(y)=\u222bpir+1(dy)\u03c6(y)= pir+1\u03c6.41Lemma 8. If for all bounded measurable function \u03c6,pir,K\u03c6L2\u2212\u2192 pir\u03c6,then we also have:(prop pir,K)\u03c6L2\u2212\u2192 (prop pir)\u03c6. (3.8)Moreover, by Lemma 7 the right hand side of Equation (3.8) is equal to pir+1\u03c6.Proof. Let \u03c6\u2032(x) =\u222b\u03bd(dy|x)\u03b1(x, y)\u03c6(y). Then, since |\u03b1r| \u2264 C1 and |\u03c6| \u2264 C2,\u03c6\u2032 is also bounded. Now,pir,K\u03c6\u2032 =\u222bpir,K(dx)\u222b\u03bd(dy|x)\u03b1(x, y)\u03c6(y) = (prop pir,K)\u03c6,and,pir\u03c6\u2032 =\u222bpir(dx)\u222b\u03bd(dy|x)\u03b1(x, y)\u03c6(y) = (prop pir)\u03c6.Therefore, if pir,K\u03c6L2\u2212\u2192 pir\u03c6, for all bounded \u03c6, then surely, pir,K\u03c6\u2032 L2\u2212\u2192 pir\u03c6\u2032and hence, (prop pir,K)\u03c6L2\u2212\u2192 (prop pir)\u03c6.Proof. (of Theorem 3) We show by induction that for r \u2265 0, and for allbounded \u03c6, piSPFr,K \u03c6L2\u2212\u2192 pir\u03c6.The base case is trivial since piSPFr,K and pi0 are equal to a Dirac deltameasure on the same atom. To prove the induction hypothesis, we firstdecompose the L2 norm using the Minkowski inequality and control eachterm separately:(E[piSPFr+1,K\u03c6\u2212 pir+1\u03c6]2)1\/2 \u2264 (E [piSPFr+1,K\u03c6\u2212 (propNr(K)piSPFr,K )\u03c6]2)1\/2 (3.9)+(E[(propNr(K)piSPFr,K)\u03c6\u2212 (prop piSPFr,K )\u03c6]2)1\/2(3.10)+(E[(prop piSPFr,K )\u03c6\u2212 pir+1\u03c6]2)1\/2. (3.11)42The Equation 3.9 can be shown to be bounded as follows,Equation (3.9) =(E[(resK(propNr(K)piSPFr,K))\u03c6\u2212(propNr(K)piSPFr,K)\u03c6]2)1\/2=(E [(resK\u03bbr)\u03c6\u2212 \u03bbr\u03c6]2)1\/2=(E[E[((resK\u03bbr)\u03c6\u2212 \u03bbr\u03c6)2 | \u03bbr]])1\/2\u2264(C22\u2016\u03bbr\u20162K)1\/2=C2\u2016\u03bbr\u2016\u221aK,where \u03bbr = propNr(K)piSPFr,K and the second last line follows from Lemma 5.Hence, it goes to zero as K \u2192 \u221e. The Equation 3.10 can be bounded asfollows,Equation 3.10 =(E[E[((propNr(K)piSPFr,K)\u03c6\u2212 (prop piSPFr,K )\u03c6)2 |piSPFr,K ]])1\/2\u2264 E[(C1C2)2\u2016piSPFr,K \u20162K]1\/2=C1C2\u2016piSPFr,K \u2016\u221aK,where we used Corollary 6. Therefore, as K \u2192 \u221e, Equation 3.10 goes tozero. For Equation 3.11, we invoke the induction hypothesis to concludethat piSPFr,K \u03c6L2\u2212\u2192 pir\u03c6 for all bounded \u03c6, therefore we can invoke Lemma 8 toconclude that (proppiSPFr,K )\u03c6L2\u2212\u2192 (proppir)\u03c6. By Lemma 7, (proppir) = pir+1.Therefore, Equation 3.11 goes to zero as K \u2192\u221e.3.3.3 SPF as adapted particle filterIn this section, we present a viewpoint of the implicit proposal step of SPFas performing an adaptation.43BackgroundA particle filter is said to be adapted if the proposals take the currentobservation into account [26, 27]. This is most natural to understand inthe context of hidden Markov model, as described in Section 2.1. In thissetting, a particle filter is adapted if the proposal distribution \u03bd\u2217r (xr|xr\u22121) =p(xr|xr\u22121, yr), then it leads to the incremental weight function as,\u03b1(x1:r\u22121, xr) =p(x1:r, y1:r)p(x1:r\u22121, y1:r\u22121)p(xr|xr\u22121, yr)=f(xr|xr\u22121)g(yr|xr)p(xr, yr|xr\u22121)\/p(yr|xr\u22121)=f(xr|xr\u22121)g(yr|xr)f(xr|xr\u22121)g(yr|xr)\/p(yr|xr\u22121)= p(yr|xr\u22121).This is optimal in the sense that it minimizes the conditional variance ofthe weights given x1:r\u22121 [17]. However, sampling from the optimal proposaldistribution is impossible in many cases. Furthermore, sincep(yr|xr\u22121) =\u222bp(yr, xr|xr\u22121)dxr =\u222bf(xr|xr\u22121)g(yr|xr)dxr,in many interesting cases, including Ricker model introduced in Section 2.4.5and other dynamical models (see for example, [28]), we do not have closedform for f(xr|xr\u22121) let alone be able to evaluate it.We explore the possibility of adaptation without requiring the user toprovide anything more than being able to sample from the prior f as is thecase in the plug-and-play settings [21].SPF as adapted particle filterWe present result that as the number of implicit particles is increased, wecan expect the SPF equipped with a simple proposal to perform about as44well as the SPF equipped with the optimal proposal.Definition 9. Given an arbitrary measure \u03bb, we define the prop operator on\u03bb using the proposal density \u03bdr: for a given test function \u03c6,(prop\u03bdr \u03bb)\u03c6 =\u222b\u03bb(dx)\u222b\u03bdr(dy|x)\u03b1\u03bdr(x, y)\u03c6(y), (3.12)and(prop\u03bdrK \u03bb)\u03c6 = \u2016\u03bb\u20161KK\u2211k=1\u03b1\u03bdr(Sk, S\u2032k)\u03c6(S\u2032k), (3.13)where \u03b1\u03bdr(x, y) = pir(y)pir\u22121(x)\u03bdr(y|x) .The main result of this section is as follows.Proposition 10. Let \u03c6 be a bounded function. For \u03bdr such that the weightfunction is bounded and for a fixed K > 0,E[((prop\u03bdrN piSPFr\u22121,K)\u03c6\u2212 (prop\u03bd\u2217rN piSPFr\u22121,K)\u03c6)2]1\/2 \u2192 0, as N \u2192\u221eProof. We use the Minkowski inequality:(E[(prop\u03bdrN piSPFr\u22121,K)\u03c6\u2212 (prop\u03bd\u2217rN piSPFr\u22121,K)\u03c6]2)1\/2\u2264 (E[prop\u03bdrN piSPFr\u22121,K)\u03c6\u2212 (prop\u03bdr piSPFr\u22121,K)\u03c6]2)1\/2+(E[(prop\u03bdr piSPFr\u22121,K)\u03c6\u2212 (prop\u03bd\u2217r piSPFr\u22121,K)\u03c6]2)1\/2+((E[prop\u03bd\u2217r piSPFr\u22121,K)\u03c6\u2212 (prop\u03bd\u2217rN piSPFr\u22121,K)\u03c6]2)1\/2.By Lemma 5,(E[prop\u03bdrN piSPFr\u22121,K)\u03c6\u2212 (prop\u03bdr piSPFr\u22121,K)\u03c6]2)1\/2 \u2264 C1C2\u2016piSPFr\u22121,K\u2016\u221aN,and(E[prop\u03bd\u2217rN piSPFr\u22121,K)\u03c6\u2212 (prop\u03bd\u2217r piSPFr\u22121,K)\u03c6]2)1\/2 \u2264 C1C2\u2016piSPFr\u22121,K\u2016\u221aN.45Note that, by definition of prop\u03bdr , we have,(prop\u03bdrpiSPFr\u22121,K)\u03c6 =\u222bpiSPFr\u22121,K(dx)\u222b\u03bdr(dy|x) pir(y)pir\u22121(x)\u03bdr(dy|x)\u03c6(y)=\u222bpiSPFr\u22121,K(dx)\u222bpir(y)pir\u22121(x)\u03c6(y).Hence, prop\u03bdrpiSPFr\u22121,K = prop\u03bd\u2217r piSPFr\u22121,K :(E[(prop\u03bdr piSPFr\u22121,K)\u03c6\u2212 (prop\u03bd\u2217r piSPFr\u22121,K)\u03c6]2)1\/2= 0.Therefore, as N \u2192\u221e, we obtain the desired result.The Proposition 10 provides a justification for why it helps to use largenumber of implicit particles. That is, even if we are only equipped witha simple proposal, using a large number of implicit particles can lead toperformance similar to using the optimal proposal.3.4 SPF within particle MCMCThe particle MCMC framework is a powerful inference framework whereone can carry out joint inference on the parameters and the latent variables.Being able to deploy SPF within PMCMC inevitably is of interest and cancontribute to increasing the applicability of SPF to a variety of problems.We have results showing that SPF can be used within PMCMC when a fixednumber of implicit particles is used. It remains for future work to develop away to incorporate random stopping time into SPF-PMCMC methodology.3.4.1 NotationWe keep the number of implicit particles fixed, denoted Nr = N for allr = 1, ..., R. We continue to denote by K, the number of times we performthe res operation and the size of the particle population at the end of oneiteration of SPF (i.e., K denotes the number of concrete particles). We46introduce a new notation, ir \u2208 {1, ..., N}K to denote the indices of theimplicit particle selected by the res operation and let ikr \u2208 {1, ..., N} be thek-th component of ir for r = 1, ..., R. The K concrete particles at iterationr can be indexed using ikr : xikrr for k = 1, ...,K. Let ar\u22121 \u2208 {1, ...,K}N forr = 2, ..., R denote the indices of the ancestor particles sampled from theequally weighted particle population obtained during the expansion operationof Task group 1 (i.e., inside the prop operation).We denote the concrete particles by xki:j whenever the superscript k \u2208{1, ...,K} and the implicit particles xni:j whenever the superscript n \u2208{1, ..., N} for 1 \u2264 i \u2264 j \u2264 R. This abuse of notation allows additionalflexibility in referencing the implicit and concrete particles. The reason toadopt this abuse of notation is to guard against a possible misconception thatthe concrete particles and the implicit particles are different objects, when infact the concrete particles are subset of implicit particles. To emphasize, theparticles generated by SPF are x1, ...,xR, where xr = (x1r , ..., xNr ).The backward lineage of a concrete particle xk1:R is denoted by bk1:R, wherebkr \u2208 {1, ...,K} denotes the index of the concrete particle at iteration r withbkR = k . The backward lineage allows indexing of the concrete particlesat every iteration via xbkrr for r = 1, ..., R and k = 1, ...,K. Note thatcomposition of indices allows the implicit particles of the previous iterationsto be referenced. For example, ibkrr indexes the implicit particle at iterationr of the lineage xk1:R and similarly, iakr\u22121r\u22121 indexes the implicit particle at ther\u2212 1 iteration of the lineage xk1:R. An example is provided in Figure 3.2. Weshow shortly that the cumbersome notation leads to simpler expression ofthe joint distribution of the variables generated by SPF.3.4.2 SPF algorithm descriptionWe describe the SPF in light of these newly introduced notation.Initial iteration: r = 1In the initial distribution, we begin by proposing N particles, xn1 \u223c \u03bd1(\u00b7) forn = 1, ..., N . Then, we perform the contraction operation via multinomial47x11 x31 x71x12 x32 x52 x72i1 = (3, 1, 7, 3)i2 = (3, 5, 1, 7)a1 = (2, 1, 1, 4, 3, 4, 2)Figure 3.2: An illustration of the indices in action for N = 7,K = 4, R = 2.The shaded circles form a lineage of particles, where b21:2 = (3, 2) can be usedfor indexing the concrete particles and ib211 = i31 = 7 and ib222 = i22 = 5 can beused to index the implicit particles. The composition of indexes can yield theindex of the parent particle for any implicit particle xnr . For example, theparent (implicit) particle of x52 is given by ia511 = i31 = 7: xia5111 = x71.sampling using the weights w\u00afn1 \u221d \u03b3(xn1 )\/\u03bd1(xn1 ). Let us denote the conditionaldensity of the indices i1, given the normalized weights by i1 \u223c c(\u00b7|w\u00af1) (i.e., ccorresponds to the density of the resampling distribution for the contractionoperation). After contraction, we obtain a population of K equally weightedparticles: {xk1, 1\/K}.General iteration: r \u2265 2We begin by selecting the parent index from the particle population generatedfrom the previous iteration: anr\u22121 \u223c s(\u00b7|ir\u22121), where s(\u00b7|ir\u22121) is the discreteuniform distribution over the indices of the concrete particles selected in theprevious iteration. We propose the particle, xnr \u223c \u03bdr(\u00b7|xiakr\u22121r\u221211:r\u22121). The weightfor each of the particles is given by wnr = \u03b1(xiakr\u22121r\u221211:r\u22121, xkr ). Then, we perform48the contraction operation to draw K indices from the implicit population:ir \u223c c(\u00b7|w\u00afr). After contraction operation, we obtain K equally weightedparticles forming a particle population for the next iteration. Note that theabuse of notation that we introduced on the range of the superscript allowsus simplify some of the expression, for example, xiakr\u22121r\u221211:r\u22121 can be written simplyas xakr\u221211:r\u22121 because akr\u22121 \u2208 {1, ...,K}.Joint distribution of the variables generated by SPFWe can express the joint distribution of the variables generated by SPF:\u03c8SPF(x1, ...,xR, i1, ..., iR,a1, ...,aR\u22121) =N\u220fn=1\u03bd1(xn1 )c(i1|w\u00af1)\u00d7R\u220fr=2s(ar\u22121|ir\u22121)N\u220fn=1\u03bdr(xnr |xanr\u221211:r\u22121)c(ir|w\u00afr),(3.14)Note that s(ar\u22121|ir\u22121) = K\u2212N since each anr\u22121 is independent and identicallydistributed from a discrete uniform distribution over K support points. Wemake the following assumption:Assumption 11. c(ikr = j|w\u00afr) = w\u00afjr.This assumption is easily enforced by the contraction operation based onthe multinomial resampling scheme.3.4.3 SPF-PMCMCIn this section, we show that the same iterative procedure for SMC-PMCMCcan be used with SPF in place of SMC.49Particle independent Metropolis-HastingsThe proposal density for SPF-PIMH is given by,qSPFN,K(k,x1, ...,xR,a1, ...,aR\u22121, i1, ..., iR) =1K\u03c8SPF(x1, ...,xR,a1, ...,aR\u22121, i1, ..., iR).(3.15)Note that the factor of 1\/K is because the SPF-PIMH algorithm selects aparticle genealogy from the uniform distribution over {1, ...,K} since SPFreturns population of equally weighted particles. The target distribution isgiven by,p\u02dciSPFN,K(k,x1, ...,xR,a1, ...,aR\u22121, i1, ..., iR) =pi(xk1:R)NRKR\u00d7\u03c8SPF(x1, ...,xR,a1, ...,aR\u22121, i1, ..., iR)\u03bd1(xbk11 )c(ibk11 |w\u00af1)R\u220fr=2s(bkr\u22121|ir\u22121)\u03bdr(xbkrr |xbkr\u221211:r\u22121)c(ibkrr |w\u00afr).Note that s(bkr\u22121|ir\u22121) = 1\/K since |ir\u22121| = K and s corresponds to adistribution that selects one item uniformly from ir\u22121. The derivation of the50ratio between the target and the proposal is given by,p\u02dciSPFN,KqSPFN,K=pi(xk1:R)N\u2212RKRK\u22121\u00b5(xbk11 )c(ibk11 |w\u00af1)R\u220fr=2s(bkr\u22121|w\u00afr\u22121)\u03bdr(xbkrr |xbkr\u221211:r\u22121)c(ibkrr |w\u00afr)=pi(xk1:R)N\u2212R\u00b5(xbk11 )R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121)R\u220fr=1w\u00afbkrr=pi(xk1:R)N\u2212R R\u220fr=1N\u2211j=1wjr\u00b5(xbk11 )R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121)R\u220fr=1wbkrr=pi(xk1:R)\u00b5(xbk11 )R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121)R\u220fr=1wbkrr\u00d7 Z\u02c6N=Z\u02c6NZ. (3.16)Since Z is a constant, the acceptance ratio for SPF-PIMH matches Equa-tion 2.15. Another important point to note is that the estimate of themarginal likelihood, Z\u02c6N , uses all of the implicit particles, therefore, we expectsmaller variance in the acceptance ratio compared to using just K particles.In memory restrictive settings where we can only afford K concrete particles,we expect this to be very useful.Particle marginal Metropolis-HastingsWe assume that a proposal distribution for \u03b8 is given by q(\u03b8\u2217|\u03b8). The proposaldensity for all of the variables in PMMH is given by proposing a new value,\u03b8\u2217 given the current value \u03b8, followed by performing SPF algorithm given\u03b8\u2217 to generate x1, ...,xR,a1, ...,aR\u22121, i1, ..., iR and then, sampling a k-th51genealogy:qSPF(\u03b8\u2217, k,x\u22171, ...,x\u2217R,a\u22171, ...,a\u2217R\u22121, i\u22171, ..., i\u2217R|\u03b8) = q(\u03b8\u2217|\u03b8)1K\u03c8SPF(x\u22171, ...,x\u2217R,a\u22171, ...,a\u2217R\u22121, i\u22171, ..., i\u2217R|\u03b8\u2217).With the target distribution as follows,p\u02dciSPF(\u03b8, k,x1:R,a1:R\u22121, i1:R) =pi(\u03b8, xk1:R)NRKR\u00d7\u03c8SPF(x1:R,a1:R\u22121, i1:R|\u03b8)\u03bd1(xbk11 |\u03b8)c(ibk11 |w\u00af1)R\u220fr=2s(bkr\u22121|ir\u22121)\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)c(ibkrr |w\u00afr),52we get the ratio between the target distribution and the proposal as:p\u02dciSPFqSPF=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)N\u2212RKRK\u22121\u03bd1(xbk11 |\u03b8)c(ibk11 |w\u00af1)R\u220fr=2s(bkr\u22121|ir\u22121)\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)c(ibkrr |w\u00afr)=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)N\u2212RKRK\u2212R\u03bd1(xbk11 |\u03b8)R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)R\u220fr=1c(ibkrr |w\u00afr)=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)N\u2212R\u03bd1(xbk11 |\u03b8)R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)R\u220fr=1w\u00afibkrrr=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)N\u2212RR\u220fr=1N\u2211j=1wjr\u03bd1(xbk11 |\u03b8)R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)R\u220fr=1wibkrrr=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)Z\u02c6N (\u03b8)\u03bd1(xbk11 |\u03b8)R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)R\u220fr=1wibkrrr=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)Z\u02c6N (\u03b8)\u03bd1(xbk11 |\u03b8)R\u220fr=2\u03bdr(xbkrr |xbkr\u221211:r\u22121, \u03b8)\u03b3(\u03b8,xbk11 )\u03bd1(xbk11 |\u03b8)R\u220fr=2\u03b3(\u03b8,xbkr1:r)\u03b3(\u03b8,xbkr\u221211:r\u22121)\u03bdr(xbkrr |xbkr\u221211:r\u22121,\u03b8)=pi(\u03b8, xk1:R)q(\u03b8|\u03b8\u2217)Z\u02c6N (\u03b8)\u03b3(\u03b8, xk1:r)=\u03b3(\u03b8, xk1:R)Zq(\u03b8|\u03b8\u2217)Z\u02c6N (\u03b8)\u03b3(\u03b8, xk1:R)=Z\u02c6N (\u03b8)Zq(\u03b8|\u03b8\u2217) .Since Z is a constant, we obtain the MH acceptance ratio matching the oneprovided in [24]:1 \u2227 Z\u02c6N (\u03b8\u2217)q(\u03b8|\u03b8\u2217)Z\u02c6N (\u03b8)q(\u03b8\u2217|\u03b8). (3.17)Since the SPF-PIMH and SPF-PMMH algorithm we have constructed here isthe standard MH update on the variables generated by SPF, we expect the53convergence of the chain to the desired target distribution under the samemild conditions imposed in [24]. In particular, the PMCMC chain X1:R(i)converges to the target distribution pi(x1:R), since it appears as a marginaldistribution in p\u02dciSPF.54Chapter 4SPF for PhylogeneticsApplicationIn this chapter, a phylogenetics application is described in detail for thepurpose of demonstrating the effectiveness of SPF.4.1 NotationThe phylogenetic inference problem is as follows: given a set of DNA sequencesfor homologous taxa (i.e., species originating from a common ancestor), inferthe relationship between the species via a latent tree topology as well theevolutionary distances between the species via the branch lengths. A DNAsequence is comprised of four nucleobases, adenine, cytosine, guanine, andthymine, henceforth referred to by A, C, G, T respectively. The number ofnucleobases is denoted by B = 4. The latent variable of interest is a rootedtree topology as well as the branch lengths, denoted by T = (V,E, b) whereT denotes a rooted binary tree topology with vertices and edges denoted byV,E respectively, and b : E \u2192 (0,\u221e) denotes the branch lengths.The set of observed DNA sequences is denoted by Y = {y1, ..., yL}, whereeach yl is a DNA sequence for l = 1, ..., L and L = |Y| denotes the numberof taxa. The length of the DNA sequences is assumed to be the sameacross the species, and this length is denoted by S. This assumption is tokeep the discussion focussed on describing the phylogenetics application anddemonstrating the performance of SPF. Let yl \u2208 {A,C,G, T}S , with yl,sdenoting the value of the nucleobasis at the s-th site for l = 1, ..., L. Therandom tree is denoted by T , a realization by t, and the space of feasibletrees for a given set of observations by T (Y). In general, a space of trees is55denoted by T , which is distinguished from the space of trees on leaves, Y \u2032,T (Y \u2032).4.2 Bayesian phylogeneticsA Bayesian approach to inferring the latent phylogenetics tree T = (V,E, b)involves the posterior distribution over the tree given the observations:p(t|Y) = p(Y|t)p(t)p(Y) \u221d \u03b3(t). (4.1)As can be seen from the posterior, a full specification requires placing a priordistribution on t and the likelihood model. Each of these two componentsare explained in this section.4.2.1 Kingman\u2019s coalescent priorThe prior distribution on the t considered in this thesis is Kingman\u2019s coalescentprior [29, 30]. That prior is a distribution on a tree of all sizes. An importantnotion that appears in the description of the model is the height of a tree:h : T \u2192 [0,\u221e).Kingman\u2019s coalescent prior is perhaps best explained in terms of the stepsinvolved in drawing a sample from the model. Let L be the number of leaves.The sampling procedure takes place over L\u22121 iterations. In the first iteration(r = 1), a pair of leaves is selected at random to coalesce among the(L2)possible pairs. Let tl and tr denote the pair of leaves that were chosen andlet tm denote the tree with tl and tr as the subtrees (left and right subtreesrespectively). The next step is to draw an exponential distributed randomvariable with mean(L2): \u03b41 \u223c Exponential((L2)). The height of tm is set as:h(tm) = \u03b41.For iterations r = 2, ..., L \u2212 1, note that there are L \u2212 r + 1 trees tochoose from. Therefore, a pair of subtrees, tl, tr are sampled with probability1\/(L\u2212r+12)and a new tree tm is formed with tl, tr as left and right subtrees. Anexponential distributed random variable \u03b4r \u223c Exponential((L\u2212r+12))is sam-56pled and the height of the new tree tm is set as h(tm) = max{t1, ..., tL\u2212r+1}+\u03b4r.After the final iteration, r = L\u2212 1, a full coalescent of the L species isreached as there is exactly one tree left. The probability density of a tree ton L leaves can be represented as follows:p(t) =R\u22121\u220fr=1(L\u2212 r + 12)exp(\u2212(L\u2212 r + 12)\u03b4r)\u00d7 1(L\u2212r+12)=R\u22121\u220fr=1exp(\u2212(L\u2212 r + 12)\u03b4r).An example of a tree sampled from the coalescent prior with L = 8is shown in Figure 4.1 (a). Note that the exponential distributed randomvariable \u03b4r denotes the waiting time to a new coalescent event. Since the meanparameter of the exponential distribution is decreasing in r, the expected timefor coalescence longer for larger values of r. Using an alternative view of theheight of the tree as the coalescent times of the (ancestral) species, backwardin time, this is a desirable property. For example, consider a population ofhaploid individuals that mate at random. In this case, larger the number ofindividuals competing to mate, the smaller the expected waiting time to anew coalescent event. The branch length from the root node to its left andright children in Figure 4.1 (a) depicts this phenomenon.4.2.2 Felsenstein\u2019s peeling recursion for likelihoodcomputationThe likelihood of the observations given a tree is computed by marginalizingout the DNA sequences of the unobserved ancestral species. An algorithmto efficiently perform this marginalization is known as Felsenstein\u2019s peelingrecursion, which is a type of sum-product algorithm [31, 32, 33]. Considerthe tree given in Figure 4.1 (b). The DNA sequences for the leaf nodes areobserved and denoted by y1, y2, y3, y4, whereas the DNA sequences for theancestral species denoted, y5, y6, y7, are unknown. The likelihood of the data,{y1, y2, y3, y4}, given the tree is computed by assuming site independence and571 2 3 4567b51=0.7b63=1.3b65=0.6b76=0.5b74=1.8b52=0.7(a) (b)Figure 4.1: (a) A sample from the coalescent prior on 8 species. Note that thebranch length merging the root of the tree and its left and right children aremuch longer than the other branches. (b) An example of a phylogenetics tree.The leaf nodes are labelled 1, 2, 3, 4 and their DNA sequences are denoted byy1, y2, y3, y4 \u2208 {A,C,G, T}. The branch lengths play an important role incomputing the likelihood. buv denotes the length of the branch length fromthe parent u to the child v.by summing over the DNA sequence of the ancestor species as follows:p(y1, y2, y3, y4|t) =S\u220fs=1p(y1,s, y2,s, y3,s, y4,s|t)=S\u220fs=1\u2211y7,s\u2211y6,s\u2211y5,sp(y1,s, y2,s, y3,s, y4,s, y5,s, y6,s, y7,s|t),where y5,s, y6,s, y7,s \u2208 {A,C,G, T}. As it stands, this summation requiresevaluating the joint probability of all of the variables 43 times \u2013 one evaluationfor each set of assignments to (y5,s, y6,s, y7,s). For a tree with L leaves, thisamounts to 4L\u22121 evaluations of the joint probability density for each site.A key component of the likelihood computation is the mathematical model58for evolution based on a continuous time Markov chain, which is assumed tooperate along the branches of the tree. This model for decomposition of thejoint probability density is,p(y1,s, y2,s, y3,s, y4,s, y5,s, y6,s, y7,s|t) =p(y7,s|t)\u00d7 p(y6,s|y7,s, t)\u00d7 p(y5,s|y6,s, t)\u00d7p(y4,s|y7,s, t)\u00d7 p(y3,s|y6,s, t)\u00d7p(y1,s|y5,s, t)\u00d7 p(y2,s|y5,s, t).The probability p(yj,s|yi,s, t) is computed using the transition matrix com-puted from the rate matrix of the aforementioned continuous time Markovchain. Specifically, for an edge e = (i, j) with i being the parent of j, thetransition matrix is given by Pyi,s,yj,s = exp(Qb(e))yi,s,yj,s , where b(e) is thelength of the branch e and Q is the rate matrix. For example, for e = (7, 6),b(e) = 0.5 in Figure 4.1 (b). Note that the marginal probability p(y7|t)given by the stationary distribution of the CTMC, piy7 , which stems fromthe assumption that a long enough time has elapsed to ensure stationarity ofthe CTMC has been reached upon the birth of the root ancestor specie ofthe tree. The details of the rate matrix and the CTMC is given in the nextsection. The likelihood of the data at site s can now be expressed as,p(y1,s, y2,s, y3,s, y4,s|t) =\u2211y7\u2208{A,C,G,T}p(y7,s|t)p(y4,s|y7,s, t)\u00d7 (4.2)\uf8eb\uf8ed \u2211y6,s\u2208{A,C,G,T}p(y6,s|y7,s, t)p(y3,s|y6,s, t)\u00d7 (4.3)\uf8eb\uf8ed \u2211y5,s\u2208{A,C,G,T}p(y5,s|y6,s, t)p(y1,s|y5,s, t)p(y2,s|y5,s, t)\uf8f6\uf8f8\uf8f6\uf8f8 .(4.4)The above expression suggests that the computation should be carriedout by summing over y5,s first, then over y6,s, and finally over y7,s. However,computation of Equation 4.4 involves y6,s and similarly, the computationof Equation 4.3 involves y7,s. It is not clear how this computation can be59carried out efficiently, an issue we now address.Implementation detailsAn efficient implementation for likelihood computation is a dynamic program-ming algorithm where each node of the tree is associated with a table. Thealgorithm is initialized by allocating a table of dimension B \u00d7 S to each ofthe leaf nodes of the tree. Let Du denote the table allocated to a node u.The table for the leaf nodes are initialized as follows: set Du[c, s] = 1 if andonly if yu,s = c for c \u2208 {A,C,G, T}. Note that Du[b, s] denotes the b-th rowand s-th column of the table Du.The algorithm proceeds by selecting an internal node u if and only ifboth the left and right children of u have a table with all of its entries filled.This can be enforced by a post-order tree traversal algorithm, which outputsa sequence of nodes of the tree such that for every internal node, childrennodes appear before it in the order. The table Du for an internal node u isconstructed as follows:Du[c, s] =\u220fd\u2208{l,r}\u2211z\u2208{A,C,G,T}p(z|c, t)\u00d7Dud [z, s], (4.5)where Dud [z, s] for d \u2208 {l, r} denotes the table allocated to the left and rightchild nodes of u. For concreteness, the first internal node visited in theexample corresponding to Figure 4.1 (b) would be the node 5. Note thatcarrying out the computation in Equation 4.5 to construct D5 yields:D5[c, s] =\uf8eb\uf8ed \u2211y\u20321,s\u2208{A,C,G,T}p(y\u20321,s|y5,s, t)\u00d7D1[y\u20321,s, s]\uf8f6\uf8f8\u00d7\uf8eb\uf8ed \u2211y\u20322,s\u2208{A,C,G,T}p(y\u20322,s|y5,s, t)\u00d7D2[y\u20321,s, s]\uf8f6\uf8f8=p(y1,s|y5,s, t)\u00d7 p(y2,s|y5,s, t),because D1[y\u20321,s, s] = 0 except for y1,s and similarly for y2,s. Note that the last60line of the above equation corresponds to the term p(y1,s|y5,s, t)p(y2,s|y5,s, t)in Equation 4.4. Continuing along, the next node visited is 6. The table D6is constructed as,D6[y6,s, s] =\uf8eb\uf8ed\u2211y\u20325,sp(y\u20325,s|y6,s, t)\u00d7D5[y\u20325,s, s]\uf8f6\uf8f8\uf8eb\uf8ed\u2211y\u20323,sp(y\u20323,s|y6,s, t)\u00d7D3[y\u20323,s, s]\uf8f6\uf8f8=\uf8eb\uf8ed\u2211y\u20325,sp(y\u20325,s|y6,s, t)p(y1,s|y5,s, t)p(y2,s|y5,s, t)\uf8f6\uf8f8\u00d7 p(y3,s|y6,s, t).(4.6)Note that the first term on the RHS of Equation 4.6 corresponds precisely toEquation 4.4. Continuing in this fashion, the next node visited is the rootnode 7, whose table contains computation corresponding to the product ofEquation 4.3, Equation 4.4, and p(y4,s|y7,s, t). Therefore, to compute the datalikelihood, the last step is to marginalize p(y7,s|t) over y7,s. Note that thereare L\u22121 internal nodes and for each internal node, the time to fill the entriesof the table is S\u00d7B2. Therefore, the overall runtime is polynomial in B,S, L:O((B2S +H)L), where H corresponds to the computational complexity ofcomputing the transition matrix, for example, the computational complexityof matrix exponentiation for a diagonalizable rate matrix is H = O(B3).4.2.3 Models of evolutionAn important piece of information that is needed to complete the specificationof the likelihood is the model of evolution. There are various evolutionarymodels but only two models, Jukes-Cantor model [34] and the general timereversible model (GTR) [35] are discussed here.The Jukes-Cantor model assumes that the stationary distribution overthe bases is uniform, piA = piC = piG = piT = 0.25 and that there is a singlemutation rate \u00b5 for all possible substitutions. The rate matrix is therefore61given by,Q =\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\u22123\u00b54\u00b54\u00b54\u00b54\u00b54\u22123\u00b54\u00b54\u00b54\u00b54\u00b54\u22123\u00b54\u00b54\u00b54\u00b54\u00b54\u22123\u00b54\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 . (4.7)The interpretation of the entry Qxy for x, y \u2208 {A,C,G, T} is as an instanta-neous rate of substitution from x to y.The GTR model allows the stationary probabilities piA, piC , piG, piT tovary and it encodes 6 parameters capturing the rate of change between anytwo bases. The parameters are denoted \u03b1, \u03b2, \u03b3, \u03b4, \u000f, \u03b7 and have the followinginterpretation:\u03b1 : transition rate between A and C\u03b2 : transition rate between A and G\u03b3 : transition rate between A and T\u03b4 : transition rate between C and G\u000f : transition rate between C and T\u03b7 : transition rate between G and T.The rate matrix for the GTR model is given as follows,Q =\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\u00b7 \u03b1piC \u03b2piG \u03b3piT\u03b1piA \u00b7 \u03b4piG \u000fpiT\u03b2piA \u03b4piC \u00b7 \u03b7piT\u03b3piA \u03b4piC \u03b7piG \u00b7\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 . (4.8)Note that the diagonal entries, Qxx, are equal to the negative of the sum ofthe entries, Qxy for y 6= x and y \u2208 {A,C,G, T}:Qxx = \u2212\u2211x 6=y\u2208{A,C,G,T}Qxy.The first row of the rate matrix corresponds to the instantaneous rate of62substitution from A to A,C,G, T . Similarly, the second, third, and fourthrows correspond to the instantaneous rates of substitution from C,G, T toA,C,G, T respectively.4.3 SMC for inference on phylogenetic treesThe details of the computation involved in carrying out SMC based inferencefor phylogenetics is given in this section.4.3.1 NotationLet S denote the forest of phylogenetic trees, that is, a s \u2208 S is a collectionof one or more phylogenetic trees. More precisely, let Sr denote the forestof L \u2212 r phylogenetic trees for r = 1, ..., L \u2212 1, where L = |Y| denotes thenumber of taxa. Hence, SL\u22121 corresponds to the desired target space, i.e.,SL\u22121 = T (Y) and S =\u22c3L\u22121r=1 Sr. The intermediate distribution for sr \u2208 Sr isgiven by,\u03b3r(sr) =\u220ft\u2208srp(Y(t)|t)p(t),where Y(t) denotes the set of observations corresponding to the leaves of atree t. The notion of the height of the tree is extended to s \u2208 S as follows:h : S \u2192 [0,\u221e) such that h(s) = maxt\u2208s h(t).4.3.2 Proposal distributionThe proposal distribution that we consider in this thesis is the prior-priorproposal introduced in [30]. The prior-prior proposal samples both the pairto coalesce and the exponential random variable from the prior distribution(i.e., samples from the Kingman\u2019s coalescent prior). At the r-th iteration ofthe SMC, a pair of species to coalesce are sampled with probability 1\/(L\u2212r+12)and \u03b4r \u223c Exponential((L\u2212r+12)) for r = 1, ..., L\u2212 1. The incremental weightupdate for sr\u22121 \u2208 Sr\u22121 and sr \u2208 Sr is given by,63\u03b1(sr\u22121, sr) =\u03b3r(sr)\u03b3r\u22121(sr\u22121)\u03bdpr-pr(sr|sr\u22121)=\u220ft\u2208sr p(Y(t)|t)p(t)\u220ft\u2208sr\u22121 p(Y(t)|t)p(t)\u03bdpr-pr(sr|sr\u22121)=p(Y(tm)|tm)p(Y(tl)|tl)p(Y(tr)|tr) , (4.9)where tl and tr denote the left and right subtrees selected to be merged bythe prior-prior proposal, and tm denotes the rooted tree with tl as the leftand tr as the right subtree. Recall from Section 4.2.1 that the calculationof the prior p(tm) includes the calculation involved in p(tl), p(tr), leading tocancellations, leaving only the probability density for \u03b4r times the probabilityof choosing the pair tl, tr to coalesce. But the prior-prior proposal samplesfrom the prior distribution for both the branch length and the pair to coalesce,so we are left with Equation 4.9.4.3.3 Likelihood computationThe incremental weight update formula requires computation of the likeli-hoods p(Y(tm)|tm), p(Y(tl)|tl), p(Y(tr)|tr). The Felsenstein\u2019s peeling recur-sion described in Section 4.2.2 can be carried out exactly because the tree isconstructed in a bottom-up fashion. The table for each of the extant taxonx is initialized by Dx[c, s] = 1 if the character at site s is equal to c and0 otherwise. For each coalescent event, consisting of tl, tr, resulting in thecreation of the merged tree tm, a new table Dtm is created and filled as inEquation 4.5.For practical purposes, the table associated with each node is stored incomputer memory. This eliminates the need to re-compute the tables forthe left and right subtrees required in the particle weight computation step.Note that this can lead to insufficient the storage capacity in the computer\u2019smemory. A naive calculation reveals that the storage for one particle isS \u00d7B \u00d7 8 bytes, where 8 bytes is needed to store a double precision floatingpoint number in a computer\u2019s memory. For example, if S = 1, 000 and B = 4,64storage for one particle is 31.25 kilobytes.4.4 Experimental resultsThis section demonstrates the effectiveness of the SPF.4.4.1 Data simulation procedureThe data for the experiments are simulated as follows. For a desired number oftaxa L, sample a tree the Kingman\u2019s coalescent prior, and denote the sampledtree by ttrue. The observations are generated by starting from the root ofttrue and sampling yroot,s \u223c Multinomial(piA, piC , piG, piT ) for s = 1, ..., S. Foran edge e = (u, v), with u being the parent specie and v being the childspecie, the DNA sequence for v is generated by computing the transitionprobability via matrix exponentiation, P = exp(Qb(e)). The nucleobasis issampled for site s = 1, ..., S by sampling from the multinomial distribution:yv,s \u223c Multinomial(Pyu,s,A, Pyu,s,C , Pyu,s,G, Pyu,s,T ).The DNA sequences for the internal nodes of the tree are discarded to yieldthe DNA observations only at the leaf nodes of the tree: Y = (y1, ..., yL).4.4.2 The speed of executionIt was observed that the number of unique particles at each iteration ofSMC can be quite small for a phylogenetics application when using theprior-prior proposal, which is due in large part to the discrete componentof the phylogenetics tree. This point is illustrated in Figure 4.2 (a). Thisfigure is obtained using K = 10, 000 particles with L = 16 taxa for SMC andK = N = 10, 000 for SPF with 10 replications. The effective sample sizes forthe two methods are indistinguishable and their standard errors overlap.In this experiment, the execution time for SPF is expected to be abouttwice that of SMC. However, it was noted that due to the number of memorywrites for the table required in the likelihood computation, the streaming65particle filter actually resulted in faster execution time. In particular, becauseof the small number of unique samples at almost every iteration exceptthe last one, the number of memory writes is significantly lower since SPFwrites the likelihood table to memory if and only if the particle survives thecontraction step. This phenomenon is depicted in Figure 4.2 (b), where thetiming results in milliseconds at each iteration is compared for SMC and SPF.In nearly all iterations, we notice that SPF outperforms SMC. In Figure 4.2(c) is the ESS per second, SPF is shown to outperform SMC in its ability togenerate ESS for every second of computation. Finally, Figure 4.2 (d) showsthe LOESS fit on the ratio of the ESS\/sec for SPF to the ESS\/sec for SMC.The red line indicates a ratio of 1. A ratio greater than 1 throughout showsthe SPF outperforming SMC in terms of its ability to generate higher ESSper second of execution.SPF vs SMC within PMMHThis observation carries over when using SPF within PMCMC comparedto using the standard SMC. The standard SMC with K = 10, 000 particlesversus SPF with K = 1, 000 and N = 10, 000 particles is compared in anexperiment to estimate the parameter of the Jukes-Cantor model. As canbe seen in Figure 4.3 (a), the estimation quality of SPF is similar to thatobtained from SMC, despite using a smaller number of concrete particles.On the contrary, the ESS\/sec for SPF is greater than the ESS\/sec for SMCas shown in Figure 4.3 (b). The PMMH was executed for 10, 000 iterationswith thinning applied after every 30 iterations. The mutation rate of \u00b5 = 1.7is used for the Jukes-Cantor model.4.4.3 Scalability experimentsThe low ESS observed in the first few iterations of SMC shown in Figure 4.2 (a)provides insight as to the necessity of a large scale method for phylogenetics.Specifically, it may be necessary to use large number of particles in the firstfew iterations of the SMC so as to find highly likelihood coalescent events.The Figure 4.4 shows results of large scale experiments demonstrating why66l l l l ll l lll ll llll05001000150020000 5 10 15SMC IterationESSllSMCSPFll l l l l l l l l ll l llll l llll l lll llll1000200030000 5 10 15SMC IterationTime (in milliseconds)llSMCSPF(a) (b)l lllll llll llllll l l ll l lll ll lll0500100015004 8 12SMC IterationESS\/secllSMCSPF\u221250510154 8 12SMC IterationsRatio of ESS\/sec(c) (d)Figure 4.2: (a) The effective sample size at each iteration for SMC and SPF.(b) The raw execution times in milliseconds. (c) The plot of ESS\/sec. (d)The ratio of ESS\/sec of SPF to ESS\/sec of SMC. Each of the SPF and SMCare executed 10 times on the same dataset to obtain the estimate of thestandard errors.SPF may be needed to achieve adequate accuracy of estimation. In thisexperiment, 20 taxa and 1000 sites are generated using the GTR model. Toassess the quality of the approximation obtained by large numbers of implicitparticles, the estimate of the marginal negative log-likelihood is plotted inFigure 4.4 (a), over 5 runs, as the number of particles is increased. As can beseen from this figure, about 10 million particles is needed before the negativelog-likelihood plateaus. However, based on naive memory analysis, 10 millionparticles requires about 320 gigabytes of storage at each iteration.In the next experiment, the problem of reconstructing the ancestralrelationships between the taxa is considered. For each pair of taxa, thepairwise distances is estimated from the trees sampled using SPF. In this670.000.250.500.751.001.0 1.5 2.0 2.5 3.0Parameter valuesDensitySPFSMCllll lll l llll lllllllllll lll l llll lllllll100100005 10 15SMC IterationESS\/sec (log scale)llSPFSMC(a) (b)Figure 4.3: (a) The estimate of the mutation rate of the Jukes-Cantor model.The black vertical line indicates the true values, \u00b5 = 1.7. (b) The average ofthe ESS divided by the time (in seconds) for each iteration of SPF and SMC.The averages are calculated over the course of PMMH iterations.11200112401128010 20 30 40# Particles (millions)Negative likelihood2002503003504001 10 100Time (seconds)Average SSD(a) (b)Figure 4.4: (a) Decrease in negative marginal likelihood as the number ofparticles is increased. (b) Comparison of SSD versus time for SPF (red) andthe standard SMC (blue).case, the simulated dataset is on 20 taxa and hence, there are(202)suchpairwise distances. The estimate of the pairwise distance for species i, j isdenoted d\u02c6ij . The sum of squares of the difference (SSD) of the estimate to68the true value dij is computed as:SSD =\u2211i,j(dij \u2212 d\u02c6ij)2 (4.10)The result is shown in Figure 4.4 (b), where the values of SSD\u2019s are plottedas the time of execution is increased. Note that the SPF (red) achieves lowererror level compared to the standard SMC (blue). In fact, the memory limitprevents the standard SMC from running longer (using more particles) toachieve a lower error. In our largest experiment in Figure 4.4, the theoretical(extrapolated) memory needed if we had used concrete instead of implicitparticles would have been around 4 terabytes. In contrast, we were able torun most of these experiments on a computer equipped with only 8 gigabytesof RAM.69Chapter 5Sequential Graph MatchingIn this and the next chapter, we shift focus to the problem of graph matching.In this chapter, we describe the sequential model for graph matching, anda sequential Monte Carlo algorithm to sample a graph matching based onthis model. In Chapter 6, we discuss the application of the methodologydeveloped in this chapter to the knot matching problem.5.1 IntroductionA matching is a combinatorial structure derived from a graph that consistsof a set of edges such that pairs of edges are required to have no commonnodes. Matchings have important applications in machine learning whena feature function is defined on the nodes and edges of the graph. Givensuch a feature function, the compatibility score of a matching is commonlydefined as the dot product of the features and a parameter vector \u03b8 \u2208 Rp.Based on the compatibility score we may optimize or compute expectationsover the space of matchings, for example to find the most likely alignment oflandmark points between two frames of a video or of two images [36, 37], orof nucleotides in related genomes [38, 39].Two inter-related technical challenges are associated with such graphmatching problems. The first challenge is parameter estimation for thematching models, which arises in both supervised and unsupervised settings.In the supervised setting, we are given a set of labelled matchings and thetask is to optimize over the set of parameters to minimize a suitably chosenempirical loss function [40]. The unsupervised setting is typically based ona probabilistic approach, where a probability distribution over the space ofmatchings is defined [41]. One can then optimize over the parameters by70marginalizing over matchings or by using an Expectation-Maximization (EM)algorithm. The second challenge consists of computation of expectations overmatchings E[f(M)], where M is a matching-valued random variable and f isa test function. Since the simulation of matching-valued random variables isa well-known #P hard problem, approximate inference methods are needed.We develop a model and a method of inference that addresses thesetwo challenges. Our method is based on viewing a matching as a sequen-tial decision process, where at each step, a node, selected at random ordeterministically, picks a match (if any) from among the other nodes. Eachdecision is parameterized by a locally normalized logistic regression model.We demonstrate that using this local sequential view for both the modeland approximate inference is highly beneficial. In particular, using it as thebasis for the model formulation sidesteps the need to estimate an intractablenormalization constant. This allows us to approach parameter estimation,the first technical challenge, both in the supervised and unsupervised settings,using standard tools such as Monte Carlo expectation maximization (MC-EM)[42] and stochastic approximation (SA-EM) [43]. This compares favorably tocurrent models based on global Markov random fields, where both maximumlikelihood and Bayesian parameter estimation requires specialized methods[44, 45]. While there has been work on using the local sequential decisionview for the basis of an MCMC inference method over matchings [36], itsapplication to matching models has not been broadly studied. The Plackett-Luce model is a special case that has received attention in the literature,where the matching is the basis for constructing a ranking [46, 47, 48]. Thelocal sequential view has comparably been more popular in other types ofproblems such as hidden Markov models and parse trees [49].We argue that a natural approach for performing approximate inferenceover matchings in the local model context is to use sequential Monte Carloalgorithms [17]. This yields a simple algorithm that exploits the sequentialstructure of the model to explore the matching space efficiently, thus tacklingthe second challenge. SMC methods have been successfully applied to prob-lems where the latent state space has a combinatorial structure (e.g., [30, 7])and theoretical grounds for applying SMC methods for combinatorial state71spaces have recently been established [7, 23]. Furthermore, we can benefitfrom parallelization of the computation that comes naturally with SMCmethods; as well, methods such as [1, 50] which provide mechanisms for usingmore particles as needed (automatically) to ensure sufficient exploration ofthe space of the latent variables.Our work is motivated by a novel application in computational forestry,where the internal three-dimensional tree branch structure of wood planksis reconstructed by matching the surface knots. This application requires amatching method that can be generalized to quadripartite graph as each ofthe four long faces of a wood plank represents a partition. This reconstructed3-D structure is then used for predicting the strength of lumber.Approximation algorithms on matchings has been approached in theliterature using MCMC [51, 52] and recently using SMC in [53]. A variationalmethod has also been proposed [54]. A sequential Monte Carlo sampler forbipartite matching was proposed in [55], but they do not address the problemof parameter estimation and do not account for the overcounting problem [23],which may occur in hypergraph matching problems. Such problems are ofpractical importance, for example in our computational forestry application.5.2 Background5.2.1 NotationA hypergraph is a generalization of a graph where each edge (or also calledhyperedge) can connect any number of nodes. Matching can then be formu-lated as a set packing problem, where each edge is a set of nodes. We denotea K-partite hypergraph by G = (V1, ..., VK , E). We denote a node in Vk byvk,i, i = 1, 2, . . . , |Vk|. An edge e \u2208 E is a set of nodes; for example, an edgebetween two nodes v1,i, v2,j is denoted by e = {v1,i, v2,j}. Bipartite matchingis a special case of K-partite set packing problem with two partitions andeach edge restricted to contain exactly two nodes, one from each partition. Amatching is represented by a set of edges and we denote it by M or m (forrandom variable and realization respectively) with the space of all feasible72matchings denoted byM.5.2.2 Probabilistic modelsWe begin by describing a probabilistic formulation of the matching problemon a bipartite graph, G = (V1, V2, E). For ease of exposition, we assumethat |V1| = |V2|. For bipartite matching, we can think of matching m asa permutation m : V1 \u2192 V2. A probabilistic formulation of the bipartitematching problem in [41, 36] places a Gibbs measure on the matching:P(M = m|\u03b8) = exp\u3008\u03c6(m), \u03b8\u3009\u2211m\u2032exp\u3008\u03c6(m\u2032), \u03b8\u3009 , (5.1)where \u03c6 : M \u2192 Rp is a feature function (i.e., set of covariates extractedfrom the given matching m). However, the summation in the denominatoris intractable, ruling out direct optimization over the parameters. Samplingfrom a posterior distribution of \u03b8 with the likelihood defined by Equation 5.1is also difficult for the same reason.The authors of [36] define an MCMC proposal distribution by constructingthe bipartite matching iteratively. To do so, they introduce a permutationsequence, \u03c3 : {1, ..., |V1|} \u2192 {1, ..., |V1|}, which specifies the order in which thenodes in V1 are visited. Note also that the permutation sequence randomizesthe order in which the bipartite matching is constructed. We briefly describetheir construction here. First, let v1,\u03c3(r) be the node visited at the r-thiteration. Let the partial matching up to iteration r be denoted by mr\u22121,and let Vi,r\u22121 be the nodes in Vi that have not been matched up to the r-thiteration (note that m0 is the empty matching and Vi,0 = Vi). Suppose thenode v1,\u03c3(r) decides to match with the node v2,j \u2208 V2,r\u22121, resulting in anedge er = {v1,\u03c3(r), v2,j}. The probability of this decision is given by:p(et|mr\u22121, \u03b8) = exp\u3008\u03b8, \u03c6(mr\u22121 + er)\u3009\u2211e\u2032rexp\u3008\u03b8, \u03c6(mr\u22121 + e\u2032r)\u3009, (5.2)where e\u2032r = {v1,\u03c3(r), v2,j\u2032} for v2,j\u2032 \u2208 V2,r\u22121 and we have overloaded the73addition operator to denote by m + e an addition of an edge e to thematching m. Unlike in Equation 5.1, the summation in the denominator ofEquation 5.2 can be computed exactly in O(|V2|).This local multinomial model induces a Plackett-Luce-type probabilitydistribution on the complete matching, m = {e1, ..., e|V1|}:Q(m|\u03c3, \u03b8) =|V1|\u220fr=1p(et|mr\u22121, \u03c3, \u03b8). (5.3)The authors of [36] view Equation 5.3 as an approximation of Equation 5.1,and develop an MCMC sampler for matchings with proposal distributiongiven by Q. We take this idea further and propose in Section 5.3 to replacethe distribution in Equation 5.1 by one that resembles Equation (5.3). Doingso simplifies parameter estimation as it yields a likelihood model for whichpointwise and gradient evaluation can be computed efficiently.5.3 A sequential decision model for matchingsWe view the process of matching on a K-partite hypergraph as that of setpacking, where each node is placed, sequentially, into a set of \u201csimilar\u201d nodes.For full generality, we define the permutation sequence to visit every nodein the graph, \u03c3 : {1, ..., |V |} \u2192 {1, ..., |V |}. This contrasts with the bipartitematching problem, where it is sufficient to iterate over the nodes in only onepartition. Note that \u03c3 is a device that serves to randomize the constructionof matching and in full generality, it is viewed as a random variable. Eachnode v\u03c3(r) makes a decision, denoted by dv\u03c3(r) , among the set of availabledecisions, denoted by D(v\u03c3(r),mr\u22121). Here, we use mr\u22121 to denote thepartial matching implied by the sequence of decisions, {dv\u03c3(1) , ..., dv\u03c3(r\u22121)}but we often omit mr\u22121 for notational simplicity and just write D(v\u03c3(r)).The decision dv\u03c3(r) consists of potential ways the node v\u03c3(r) can be enteredinto the matching. More precisely, picking d \u2208 D(v\u03c3(r)) means that thepartial edge e\u2032 = d \u222a {v\u03c3(r)} is to be added to the partial matching mr, i.e.mr = mr\u22121 + e\u2032 = mr\u22121\\{d} \u222a {e\u2032}. We use the terminology partial edge,74since for example an edge with two nodes might be grown at a later iterationr\u2032 > r to an edge augmented with a third node.The decision set is user configurable to suit the problem at hand. Forexample, in bipartite matching without any restrictions, the decision can-didates are V2,r\u22121 (i.e., any nodes in V2 that have not yet been matched).Depending on the problem at hand, we may also allow for the decision set toinclude a singleton decision, where the node is placed into a new set by itself.With our formalism, this decision can be modelled with D(v\u03c3(r)) containingan empty set.We illustrate an example of a sequential set packing process in Figure 5.1.There are four partitions in this example, and the nodes are labelled in theorder \u03c3 that they are visited (i.e., the first node visited is the blue nodelabelled 1 and so on). For illustration purposes, we assume that \u03c3 is given. Inthis illustration, we formulate the decision set D(v\u03c3(r)) to contain any nodethat is in a different partition from v\u03c3(r) as a candidate for matching but (i)exclude any node that is beyond a certain distance away and (ii) restrict eachset to contain at most two nodes. The restriction (ii) enforces the decisionset to contain only the unmatched nodes or an empty set. The formulationof the decision set for the knot matching application is given in Section 5.4.2.In the first step of this example, we visit the blue node #1. The decisionsavailable are to form a singleton or to match with red node #2. In step 2,the only decision available for the red node #2 is to select the empty set,this is because the other nodes are beyond the distance span configured intoour decision set but also because it is already contained in a set containingtwo nodes.We model the process of the node making a decision amongst a list ofavailable decisions using a multinomial logistic model, where the number ofcategories depends on the number of decisions available for the node. Wedenote ej \u2208 D(v\u03c3(r)) for j = 1, ..., |D(v\u03c3(r))| and we denote the edge chosenby node v\u03c3(r) by edv\u03c3(r) . The conditional probability for v\u03c3(r) to be placed75263154Distance Span for 31 21 },{=1 2 }{3{ 33 5 643,,,3{}}D(v\u0000(1))D(v\u0000(2))D(v\u0000(3))D(v\u0000(4))===}{1 2 }{1 2 }{1 2 }{ 3,m0 ===m1m2m3 = 44(a) (b)Figure 5.1: (a) A 4-partite hypergraph representing a piece of lumber. Thenodes are labelled in the order \u03c3 that they are visited. Nodes outside the\u2018distance span\u2019 cannot reasonably belong to the same tree branch. (b) Thedecision candidates where only the singleton and doubleton matching arepermitted. See text for details.into edge ej given the partial matching, mr\u22121, is expressed by,p(dv\u03c3(r) |mr\u22121, \u03c3, \u03b8) =exp\u3008\u03c6(mr\u22121 + edv\u03c3(r) ), \u03b8\u3009|D(v\u03c3(r))|\u2211j\u2032=1exp\u3008\u03c6(mr\u22121 + ej\u2032), \u03b8\u3009, (5.4)where \u03c6 is a feature vector taking as input a (partial) matching. The likelihoodof the complete sequence of decisions is simply:`(\u03b8) =|V |\u220fr=1p(dv\u03c3(r) |mr\u22121, \u03c3, \u03b8). (5.5)We emphasize that the benefit of this model is that we can evaluate Equa-tion (5.5) and compute its exact gradient efficiently, which permits numericaloptimization of the likelihood over the parameters using off-the-shelf convexoptimization routines such as L-BFGS [56].5.4 Formulation of the decision setIn this section, we provide how one would formulate the decision set.7613245 63 2{{ } 4{ } 6{ }}:1 4{{ }{ }}:5 4{{ }}:V1 V2\u0000 = {3, 1, 5}613241 2{ }1 2{ }1 2{ }31 2{ }3 { }4\u0000 = {1, 3, 2, 4}3(a) (b)Figure 5.2: (a) Illustration of the decisions for a bipartite matching. Forbipartite matching, we only need to visit the nodes in one of the partitions.In this illustration, we have \u03c3 = (3, 1, 5). The decision set for the nodesare presented with the selected nodes colored in yellow. (b) An illustrationof sequential decision model used for knot matching application with eachpartition containing exactly one knot, labelled from 1 to 4. The rectanglesrepresent matching and curly braces represent edges. The visited nodes arecolored in yellow whereas the nodes that are yet to be visited are unfilled.5.4.1 Bipartite matchingBipartite matching is a special case where it is sufficient to form a matchingby visiting the nodes in only one partition. An illustration of the decisionset is shown in Figure 5.2. In this illustration, we visit the nodes in V1 withpermutation \u03c3 = {3, 1, 5}. First, the decision set for node 3 is all of the nodesin V2: {{2}, {4}, {6}}. From this set, the node 2 is chosen (marked as yellow)first. The process continues for node \u03c3(2) = 1: D(v1,\u03c3(2)) = {{4}, {6}}. Sincenode 2 is already matched to node 3, it is not included in the decision setformulated for v1,\u03c3(2). We can see from the figure that node 1 chose to matchwith node 6. Then, we move on to node \u03c3(3) = 5, which chose to match withnode 4 (the only choice available), to arrive at the bipartite matching shownin the figure.775.4.2 Knot matchingFor the knot matching problem, we begin by imposing a restriction on thecardinality of the edges in the matching to be restricted to {2, 3}. Thisrestriction stems from the fact that 4-matching is rarely observed in practice.With this restriction in place, a decision set for an uncovered knot v can beformulated to include any knot face on a different surface from v as well asany edge whose cardinality is 2 and does not contain a knot face from thesame partition as v. If v is already covered (belongs to an edge), we formulatethe decision set with only an empty set, equivalent to allowing no decisions.We have provided an illustration of a knot matching problem in Figure 5.3 (a)and illustrative decision sets in Figure 5.3 (b) and (c), with \u03c3 = {1, 2, 3, ..., 7},pre-specified. In Figure 5.3 (b), we have m0 = \u2205, and hence, it considers allof the nodes in the graph with different colors as candidates. In Figure 5.3(c), we note that the red node labelled #2 is already contained in an edge andhence, the only decision available is an empty set (for illustration purposes,we have indicated the edge that contains it in the decision set). Finally, anexample of a final matching state is provided in Figure 5.3 (d).Note that this decision model allows a singleton set as a by-product.For example, consider the case with one node in each partition shown inFigure 5.2 (b). Given a permutation \u03c3 = (1, 3, 2, 4), the decision set for node1 is {2, 3, 4}. Suppose it matches with node 2. Node 3 is presented withdecision set {{1, 2}, {4}}. Suppose it decides to form {1, 2, 3}. Next, we notethat node 2 is already covered, so it is presented with an empty set as theonly decision. Then, when we visit node 4, the only decision presented to itis an empty set because the edge {1, 2, 3} is already saturated. Therefore,node #4 forms a singleton set. In practice, a singleton case may arise due toan imperfect knot detection step.5.5 Parameter estimationWe describe parameter estimation for the cases when we are given a set oflabelled matchings. We also show that the joint inference of the parameters782473156(a)12Candidates for:{ }4 5 6 7Candidates for: 2 2given m1 = 2{ }11312 54 67(b) (c) (d)Figure 5.3: (a) 4-partite hypergraph representing a piece of lumber. (b) Thedecision set for blue node #1. (c) The decision set for red node #2 given thedecision made by blue node #1. (d) An example of a final matching. This isa modified version of a similar figure appearing in [2].and the matching is possible, in unsupervised settings.5.5.1 Supervised learningAssume that we are given a data set of I matchings: m1, ...,mI . We wouldlike to maximize p(\u03b8|m1, ...,mI). One potential difficulty is that for a givenmatching mi, there can be multiple paths (i.e., permutation and decision79sequences) that lead to mi. For example, consider the bipartite matching{{1, 6}, {3, 2}, {5, 4}} (shown in Figure 5.2 (a)). This matching can beattained with \u03c3 = (1, 3, 5) where d\u03c3(1) = {6}, d\u03c3(2) = {2}, and d\u03c3(3) = {4}as well as \u03c3 = (3, 1, 5) where d\u03c3(1) = {2}, d\u03c3(2) = {6}, and d\u03c3(3) = {4}.We view the permutation and the decisions as latent variables and expressthe complete data likelihood as,I\u220fi=1Lc(mi, \u03c3i,d\u03c3i |\u03b8) =I\u220fi=1p(mi|\u03c3i,d\u03c3i)p(\u03c3i,d\u03c3i |\u03b8) (5.6)=I\u220fi=11[(\u03c3i,d\u03c3i)\u2192 mi]\u00d7 p(\u03c3i,d\u03c3i |\u03b8). (5.7)Note that p(\u03c3i,d\u03c3i |\u03b8) is given byp(\u03c3,d\u03c3|\u03b8) =|V |\u220fr=1p(dv\u03c3(r) |mr\u22121, \u03c3r, \u03b8)p(\u03c3r|\u03c3r\u22121), (5.8)and 1[(\u03c3i,d\u03c3i) \u2192 mi] is an indicator function equal to 1 if and only if(\u03c3i,d\u03c3i) maps to mi. The inference can be carried out iteratively using theexpectation maximization [57]:(E step): Q(\u03b8, \u03b8t) = E[log p(\u03b8|{mi, \u03c3i,d\u03c3i}Ii=1)],(M step): \u03b8t+1 = argmax\u03b8Q(\u03b8, \u03b8t).where expectation is taken with respect to (\u03c3i,d\u03c3i) \u223c p(\u03c3,d\u03c3|\u03b8t,mi). Notethat the posterior can be expressed as,p(\u03b8|{mi, \u03c3i,d\u03c3i}Ii=1) \u221dI\u220fi=1Lc(mi, \u03c3i,d\u03c3i |\u03b8)p(\u03b8),80and hence, the Q function is given as,Q(\u03b8, \u03b8t) \u221dI\u2211i=1E[logLc(mi, \u03c3i,d\u03c3i)] + log p(\u03b8)=I\u2211i=1\u2211\u03c3i,d\u03c3ilogLc(mi, \u03c3i,d\u03c3i)p(\u03c3i,d\u03c3i |\u03b8t,mi) + log p(\u03b8). (5.9)We use a Monte Carlo version of EM to approximate the expectation involvedin the E-step [42]. To sample from p(\u03c3,d\u03c3|\u03b8t,mi), we use sequential MonteCarlo by defining the state space at each iteration of the SMC as Sr = \u03a3r\u00d7Dr,where \u03a3r is the set of all possible permutation sequences of length r andDr is the set of all possible decision sequences of length r. Let \u03c3r \u2208 \u03a3rdenote the partial map \u03c3r : {1, ..., r} \u2192 {1, ..., |V |}. The intermediate targetdistribution for iteration r is,p(\u03c3ir,d\u03c3ir |\u03b8t,mi) =1[(\u03c3ir,d\u03c3ir) \u2208 mi]p(\u03c3ir,d\u03c3ir |\u03b8t)p(mi|\u03b8t) .We use the notation (\u03c3ir,d\u03c3ir) \u2208 mi to mean the following: if (\u03c3ir,d\u03c3ir)\u2192 miris such that each e \u2208 mir is contained in some edge e \u2208 mi, then we say(\u03c3ir,d\u03c3ir) \u2208 mi.As for the proposal distribution, we can use p(\u03c3r,d\u03c3r |\u03b8t), in which casethe weight update for proposing (\u03c3ir+1,d\u03c3ir+1) from (\u03c3ir,d\u03c3ir) is simply:\u03b1((\u03c3ir,d\u03c3ir)\u2192 (\u03c3ir+1,d\u03c3ir+1)) = 1[(\u03c3ir+1,d\u03c3ir+1) \u2208 mi].With the target and the proposal distribution clearly defined, we cansample the latent permutation and the decisions to approximate the Qfunction in Equation (5.9):Q\u02dc(\u03b8, \u03b8t) =I\u2211i=11NN\u2211n=1logLc(mi, \u03c3i,n,d\u03c3i,n) + log p(\u03b8), (5.10)where (\u03c3i,n,d\u03c3i,n) \u223c p(\u03c3i,d\u03c3i |\u03b8t,mi) for n = 1, ..., N for each i = 1, ..., I.81The M-step can be carried out using numerical optimization procedures.Since each decision is modelled using a multinomial logistic, the likelihoodadmits exact computation of the gradient. If the gradient can be computedexactly for p(\u03b8), then efficient numerical optimization of the objective functionover the parameters using off-the-shelf optimization routines such as L-BFGS[56] can be adopted. For example, if we take the isotropic Gaussian priorover \u03b8, then the objective function is:Q\u02dc(\u03b8, \u03b8t) =I\u2211i=11NN\u2211n=1logLc(mi, \u03c3i,n,d\u03c3i,n)\u2212 \u03bb\u2016\u03b8\u20162, (5.11)for some \u03bb > 0.5.5.2 Unsupervised learningWe place an isotropic Gaussian prior on the parameters, and focus thediscussion here on maximum a posteriori (MAP) estimation of \u03b8 \u2208 Rp usingan EM algorithm [57]. Maximum likelihood estimation can be done in a verysimilar fashion. We also discuss the prospects for a full Bayesian analysis inthe conclusion.The posterior distribution over the parameters given the sequence ofdecisions denoted by d\u03c3 can be expressed as,p(\u03b8|d\u03c3, \u03c3) = p(d\u03c3, \u03c3|\u03b8)p(\u03b8)p(d\u03c3, \u03c3)=p(d\u03c3|\u03c3, \u03b8)p(\u03c3)p(\u03b8)p(d\u03c3, \u03c3). (5.12)Note that the denominator is independent of the parameters. In someproblems, such as document ranking, there exists a canonical ordering, making\u03c3 a fixed quantity instead of a random one [58]. When no canonical orderingis provided by the context of the problem, we place a uniform distributionover the permutation sequence \u03c3, independent of the parameters.The EM algorithm alternates between computing the conditional expec-tation of the latent variables given the current estimate of the parameters (E82step), and maximizing that expectation over the parameters (M step):Q(\u03b8, \u03b8t) =\u2211d\u03c3 ,\u03c3p(d\u03c3, \u03c3|\u03b8t) log p(\u03b8|d\u03c3, \u03c3), (5.13)\u03b8t+1 = argmax\u03b8Q(\u03b8, \u03b8t). (5.14)Here, the summation is difficult to evaluate analytically, and one may ap-proximate it by generating samples from p(d\u03c3, \u03c3|\u03b8t). Thus, the Monte CarloE-step approximates the expectation by sampling and taking an average overthe sampled values:Q\u02dc(\u03b8, \u03b8t) =1NN\u2211n=1log p(d\u03c3n |\u03c3n, \u03b8) + log p(\u03c3n)+ log p(\u03b8)\u2212 log p(d\u03c3n , \u03c3n), (5.15)where (d\u03c3n , \u03c3n) \u223c p(d\u03c3, \u03c3|\u03b8t). Note that when we are optimizing Equa-tion (5.15) over the parameters, the terms log p(\u03c3n), log p(d\u03c3n , \u03c3n) are in-dependent of \u03b8 and hence, need not be evaluated. Also note that withindependent Gaussian prior on \u03b8, we essentially obtain a log-likelihood withl2 penalty:Q\u02dc(\u03b8, \u03b8t) =1NN\u2211n=1log `(\u03b8)\u2212 12\u03bb\u2016\u03b8\u20162, (5.16)where ` is defined in Equation (5.5). As mentioned earlier, computing thegradient of ` is simple and efficient, and we use L-BFGS to perform thismaximization step.5.6 SMC sampler for predictionIn this section, we describe how to sample matchings, given the MAP esti-mate of the parameters (obtained by following the procedure described inSection 5.5) and show how the samples can be used for prediction. We basethe exposition in this section on two important developments from the SMCliterature. The first is the SMC samplers method [22], which extends basic83SMC by introducing a sequence of intermediate distributions such that theSMC algorithm is defined on the common state space with the final distribu-tion coinciding with the desired target distribution (please see Section 2.5for a review). This idea allows one to draw samples from an arbitrary statespace. The second is the Poset SMC and combinatorial SMC [7, 23], whichestablishes conditions for developing SMC sampler for combinatorial statespaces (please see Section 2.6 for a review).5.6.1 NotationWe begin by establishing notation for defining intermediate target distribu-tions as well as the intermediate state spaces. The state space of interestis the space of matchings on G, which we denote M. We generalize thisspace and introduceMr, r = 1, ..., R, as the intermediate state spaces. Thestate spaceMr denotes a matching that can be realized after r nodes havemade decisions using our sequential decision model. We use r to index theiterations of the SMC algorithm; therefore, R is equal to the number of nodesin the graph to be visited. This leads toMR =M, the space where everynode has made a decision and hence, placed into an edge.The intermediate distributions will be denoted by \u03b3r and the proposaldistribution by \u03bd+. We will use K to denote the number of particles in theSMC population and denote by skr and wkr the particle k at iteration r and itsun-normalized weight. The resampling step of the SMC algorithm is carriedout using the normalized version of the weights, w\u00afkr = wkr\/\u2211Nn=1wkr . Recallthat the resampling step of the SMC algorithm induces the notion of a parentparticle for each particle; we denote the index of the parent particle of skr byakr\u22121. The weight computation is carried out in a recursive manner:wkr = w\u00afakr\u22121r\u22121 \u00d7 \u03b1(sakr\u22121r\u22121 , skr ),where w\u00afr\u22121,akr\u22121 = 1\/K if resampling is carried out in the previous iteration84of SMC and \u03b1(sanr\u22121r\u22121 , skr ) is the weight function:\u03b1(sakr\u22121r\u22121 , skr ) =\u03b3r(skr )\u03b3r(sakr\u22121r\u22121 )\u03bd+(skr |sakr\u22121r\u22121 ).The particles and weights at the last iteration are used to approximate theexpectations of a test function f viaE[f(M)] \u2248 1KK\u2211k=1f(skR),if resampling is carried out after the last iteration and,E[f(M)] \u2248K\u2211k=1w\u00afkRf(skR),if resampling is not performed at the last iteration. We refer to Section 2 formore detailed treatment of SMC methods.5.6.2 Poset SMC for graph matchingAn important notion that we need to re-visit before we can complete thespecification of our SMC sampler for graph matching is that of partiallyordered set (refer to Section 2.6). The notion of partial order on the statespace S is characterized by the proposal distribution. In particular, s coverss\u2032 if from s, we can obtain s in one application from s\u2032. An example of aHasse diagram corresponding to the decision model for bipartite graph givenin Section 5.4.1 is shown in Figure 5.4 (a). We have also provided an exampleof a case s \u227a s\u2032 on the top panel of Figure 5.4 (b) and a case where two statesare not comparable in the bottom panel of Figure 5.4 (b). In essence, thesequential structure that is needed by the SMC method is induced by thepartial order.The next natural question concerns the conditions for a valid proposaldistribution that ensures correctness of an SMC algorithm. This is providedin [23] for combinatorial state spaces. An important condition is the connect-8513245 6V1 V213245 613245 613245 6...13245 613245 6...13245 613245 613245 613245 613245 6...13245 613245 6M1M0M2M313245 6V1 V213245 6V1 V2\u000013245 6V1 V213245 6V1 V2\u0000\u0000(a) (b)Figure 5.4: (a) Example of Hasse diagram corresponding to the bipartitedecision model described in Section 5.4.1. (b) Example of partial orderdefined on bipartite matching.edness condition, that is, starting from an initial state s0, we should be ableto reach every state s \u2208 S by finite number of applications of the proposaldistribution on s0. Another important condition is that the Hasse diagraminduced by the proposal is acyclic.5.6.3 SMC sampler for graph matchingIn Section 5.5, we used SMC for sampling the latent variables (\u03c3,d). Ourgoal in this section differs in the sense that the object of interest includesmatching as well as the permutation and the decision sequences, which areneeded for evaluating the likelihood. In this section, we develop an SMCsampler that operates on an expanded state space that admits sampling ofmatching. First, recall that (\u03c3r,dr) maps to a matching mr \u2208 Mr. Wedefine the intermediate state space as Sr = Mr \u00d7 \u03a3r \u00d7 Dr and define the86{}{{1,2}}{{3,4}}{{1,2,3}}{{1,2},{3,4}}{{1,3}}{{1,4}}{{2,3}}{{2,4}}...Figure 5.5: The state {{1, 2}, {3, 4}} can be reached by two distinct pathsfrom the initial state whereas the state {{1, 2, 3}} can be reached by threedistinct paths.intermediate distribution as,\u03b3r(mr, \u03c3r,dr|\u03b8) = p(mr|\u03c3r,dr)\u00d7 p(\u03c3r,dr|\u03b8)= 1[(\u03c3r,dr)\u2192 mr]\u00d7 p(\u03c3r,dr|\u03b8). (5.17)Here, 1[(\u03c3r,dr) \u2192 mr] denotes the indicator function that is 1 if (\u03c3r,dr)maps to mr and 0 otherwise. Note that each (\u03c3r,dr) maps to exactly onematching mr \u2208Mr because each decision made by a knot results in it beingplaced in exactly one edge.The state space that our SMC sampler operates on is defined as S = \u22c3r Sr.We can take the proposal distribution \u03bd+ = p(\u03c3r, d\u03c3(r)|\u03c3r\u22121,mr\u22121, \u03b8). Thischoice ensures that the state space S is connected starting from the initialstate s0 = (m0, \u03c30,d\u03c30), where m0 = \u2205. It is easy to verify that the weightfunction reduces to 1 with this choice of the proposal for the intermediatedistribution given in Equation 5.17.5.6.4 Overcounting correctionDesigning an SMC sampler for a combinatorial state space requires carefulattention to the possibility of an overcounting problem, which may lead87to biased estimates of the desired quantities. The overcounting problemarises when there are multiple paths that lead to the same state. To seethis, consider a graph with 4 partitions with one node in each partition(see Figure 5.2 (b)). The overcounting problem for this case is illustrated inFigure 5.5. In this figure, we can see that there are 3 paths leading to the state{{1, 2, 3}} starting from the initial state whereas there are 2 paths leadingto the state {{1, 2}, {3, 4}}. Approximation of any desired quantities usingthe SMC described in Section 5.6.3 would lead to bias if this overcountingproblem is not corrected. A solution to this problem is to incorporate thebackward kernel \u03bd\u2212 as proposed in [23]. One particular form of the backwardkernel that works is,\u03bd\u2212(s\u2032 \u2192 s) = |Q(s\u2032)|\u22121 \u00d7 1[\u03bd+(s\u2192 s\u2032) > 0], (5.18)where Q(s\u2032) is the number of possible parent states of s\u2032 and \u03bd+(s\u2192 s\u2032) > 0if s\u2032 \u2208 S can be proposed in one step starting from s \u2208 S. This leads to theweight computation step as,\u03b1(sakr\u22121r\u22121 , skr ) =\u03b3r(skr )\u00d7 \u03bd\u2212(sakr\u22121r\u22121 |skr )\u03b3r\u22121(sakr\u22121r\u22121 )\u00d7 \u03bd+(skr |sakr\u22121r\u22121 )= \u03bd\u2212(sakr\u22121r\u22121 |skr ).For the decision model corresponding to the knot matching application (seeSection 5.4.2), we have compiled a list of possible cases and the number ofpossible parents for each of the cases:\u2022 If s \u2208 S does not contain any singleton edge:\u2013 For any edge with 2 nodes, if it contains1. two visited nodes, count two possible parent states.2. one visited node, count one possible parent state.\u2013 For any edge with 3 nodes, if it contains1. two visited nodes, count two possible parent states.882. three visited nodes, count six possible parent states.\u2022 If there is at least one singleton edge in s \u2208 S:\u2013 For any edge with 1 node, count one possible parent.\u2013 For any edge with 2 nodes, if it contains1. one visited node, this state is not reachable under this model.2. two visited nodes, then count two possible parent states.\u2013 For any edge with 3 nodes, if it contains1. two visited nodes, then return zero possible parent state.2. three visited nodes, then count three possible parent states.Note that if a singleton set exists in a state, and if there are 2 visitednodes in a 3-matching, then undoing the move performed by one of the visitednodes in the 3-matching breaks it into 2-matching, which produces a statewhere a singleton set cannot have been attained. Therefore, the last movemust have been made by one of the singletons and hence, we count only oneparent state (which is accounted for by the singleton edge). Illustration of asubset of the cases are provided in Figure 5.6.5.6.5 Evaluation metricsIn this section, we describe two approaches to evaluate goodness of thepopulation of the particles generated using the SMC sampler for matching.Single sample predictionWe can obtain a single sample prediction by choosing the particle withthe highest likelihood, which we denote by m\u02c6. We can then compute theprediction accuracy as,a(m\u02c6,mtrue) =1|mtrue|\u2211e\u2208mtrue1[e \u2208 m\u02c6], (5.19)where mtrue denotes the true matching. Note that a(m\u02c6,mtrue) \u2208 [0, 1] and itis equal to 1 when m\u02c6 = mtrue.891 2{ }31 2{ }31 2{ }31 2{ }31 2{ }1 3{ }2 3{ }1 2{ }3 { }41 2{ }3 { }41 2{ }3 { }41 2{ }3 { }41 2{ }31 2{ }{ }41 2{ }3 1 2{ }{ }431{ }Figure 5.6: The possible parent states for different cases. Left: a statecontaining a 3-matching where all three nodes have been visited. Top right:containing a 3-matching and a singleton where all four nodes have beenvisited. Bottom right: example of states that are not permitted under thedecision model for the knot matching application (see Section 5.4.2).Jaccard IndexTo assess the entire particle population, we use Jaccard index, which iscommonly used metric for computing the similarity coefficient [59]. We cancompute the Jaccard index to evaluate the deviation of each of the SMCparticles from the true matching. Jaccard index is defined on two sets A andB as follows:J(A,B) =|A\u22c2B||A\u22c3B| . (5.20)We can use Jaccard index to evaluate a particle mn as follows. For eachnode v, we find the edge that contains v in mn as well as in mtrue. We willdenote these edges by mn(v) and mtrue(v). Then, we compute the Jaccardindex between the n-th particle and the truth by,J(mk,mtrue) =1|V |\u2211v\u2208VJ(mk(v),mtrue(v)). (5.21)90Note that the minimum value for J(mk(v),mtrue(v)) is 1\/6 for the knotmatching application since both mk(v) and mtrue(v) must contain v. Themaximum value is 1 if the two edges contain the same set of nodes.5.7 ExperimentsIn this section, we consider two applications of bipartite graph matching:document ranking and image matching. The application to hypergraphmatching is deferred to Chapter 6.5.7.1 Document rankingWe begin by presenting results on a supervised learning experiment where theorder is provided by the structure of the problem. With our model, supervisedlearning in such context reduces to simple logistic regression training, whichcan be performed by convex optimization using the exact gradient. In contrast,globally normalized methods such as [41] require approximate inference evenin the supervised setting. These supervised learning experiments allow us tofocus on the performance of the model; evaluation of the SMC algorithm isdeferred to the subsequent subsections.We apply the sequential decision model to the document ranking task. Forthis problem, we are given a set of documents for a given query, and our goalis to rank the documents by relevance. As described in [41], the documentranking task can be formulated as a bipartite matching task where the taskis to match the rank to the documents with the goal of matching higherrank with relevant documents and lower rank with irrelevant documents. Acommon approach to this problem is to use supervised learning methodswhere we are given a dataset of queries and corresponding documents foreach query, where each document is labelled with a relevance value. In fact,our sequential decision model is related to the listwise approach proposedin [47] when applied to the document ranking task, with the difference that[47] uses neural networks and a custom-defined loss function to estimate themodel parameters whereas we optimize the log-likelihood. We demonstrate91llllll lllllllll llll llll llllll lllllll ll llllllllllllllll llllll l llllllllll ll llllll l llllllllll l lllllllllll ll lll l0.400.440.480.520.561 2 3 4 5 6 7 8 9 10kNDCG l ll lllll l lllllll l l llllllll l l lllllll l lllllllllll l lllllll ll llll ll l llllll l llllll ll ll llllllll llll l llll l ll l lllll lll0.250.300.350.401 2 3 4 5 6 7 8 9 10kNDCGMethodlllllllllllll(Our method)AdaRank\u2212MAPAdaRank\u2212NDCGFRankListNetRankBoostRankSVMRankSVM\u2212PrimalRankSVM\u2212StructRegressionRegression\u2212RegSmoothRankSVMMAP0.000.050.100.150.205\/106 8\/103 12\/99 23\/88 37\/74 56\/55 106\/5 103\/8 99\/12 88\/23 74\/37 55\/56Training\/Testing SplitAverage Error RateMethodsLA LearningGA LearningSGM w\/ Node FeaturesSGM w\/ Edge Features(a) (b) (c)Figure 5.7: The performance on (a) OHSUMED, (b) TD2003, (c) imagematching.the effectiveness of our method on the LETOR 3.0 benchmark [58]. Weperformed experiments on the OHSUMED and TD2003 datasets. In each ofthese two datasets, we have a query and documents pair, which we denote by{(qi, {xij , yij}Jij=1)}Ii=1, where I is the total number of queries and Ji is thetotal number of documents corresponding to query qi. Here, xij denotes thefeatures for document j retrieved for query qi and yij denotes the relevancelabel. For OHSUMED, yij \u2208 {0, 1, 2} and yij \u2208 {0, 1} for TD2003.To perform training, we adopt the mechanism used in [41] where eachquery-document pair (qi, {xij , yij}) is broken up to yield B training instances,each containing L documents. Each training instance is obtained by firstchoosing a document from each relevance class and then randomly samplingthe rest of the documents without replacement. The number of traininginstances is then I \u00d7 B. In [41], the experiments were carried out withL \u2208 {3, 4, 5}, partly due to the problem of intractable summation arising fora larger value of L. One advantage of our method is that we can experimentwith larger values of L, as intractable summation is not a problem under ourframework.The metric we use for measuring the performance of ranking on thetest dataset is the standard NDCG@k metric [41]. The performance of thesequential decision model for k = 1, ..., 10 is shown in Figure 5.7 (a) and(b) (the dotted red line). Our method attains a level of performance that iscompetitive with the top methods, and in some cases strictly better than allthe other methods.925.7.2 Image feature matchingWe now turn to the image matching experiment. This experiment demon-strates a case where there is no natural order \u03c3 provided by the problem. Fortraining, we sampled a single sequence of decision \u03c3i,d\u03c3i for each traininginstance i = 1, ..., I. The SMC algorithm comes in at the prediction stageafter having trained the parameters. In this case \u03c3 is sampled jointly withthe decisions by our SMC algorithm.We are given a pair of images containing 30 landmark points. Theselandmark points correspond to the nodes in the bipartite graph to be matched.We test our method on the CMU House dataset, which was used for evaluatinga supervised graph matching algorithm in [40]. There are 111 frames in thevideo, where each frame is an image still with a slight modification from theprevious frame. Each landmark point u is associated with a shape contextfeature, f(u) \u2208 R60+ . For any proposed matching (u, v), the feature functionis defined as: \u03c6p(u, v) = |fp(u)\u2212 fp(v)|, p = 1, ..., 60.We split the data exactly as in [40] for ease of comparison. For eachl \u2208 {25, 15, 10, 5, 3, 2}, the first scenario takes the training set to be theimages that are divisible by l with the remaining as testing set, while thesecond scenario takes the training set to be the images that are not divisibleby l and the testing set to be the images that are divisible by l. Thus, a totalof 12 scenarios are tested.In [40], Delaunay triangulation was performed on each image yielding agraph structure Gi = (Vi, Ei) for each image i = 1, ..., 111. We incorporatethis graph structure to extract edge features (also known as pairwise affinities)as follows. Suppose we are matching two images Gi and Gi\u2032 . Let mr\u22121 bethe partial matching and let e = {vi,j , vi\u2032,k} be the proposed matching. Foreach ej\u2032k\u2032 = {vi,j\u2032 , vi\u2032,k\u2032} \u2208 mr\u22121, we compute,cj\u2032k\u2032 = 1[{vi,j , vi,j\u2032} \u2208 E1, {vi\u2032,k, vi\u2032,k\u2032} \/\u2208 E2]+1[{vi,j , vi,j\u2032} \/\u2208 E1, {vi\u2032,k, vi\u2032,k\u2032} \u2208 E2].93Then, set \u03c661(mr\u22121 + e) = \u2212\u2211j\u2032k\u2032 cj\u2032k\u2032 . The results are shown in Figure 5.7(c). Note that our method based on the shape context feature outperforms thelinear assignment learning methodology used in [40], which uses only the shapecontext features; likewise, our method with the edge feature outperformsgraduated assignment learning, which uses the edge features.5.8 ConclusionWe have presented a method for learning a graph matching on a hypergraphby modelling the matching as a sequence of local decisions. The sequentialdecision model allows for incorporating the compatibility score in a similarfashion to the Gibbs measure whilst addressing the parameter inferenceproblem via simple algorithms based on MC-EM combined with sequentialMonte Carlo samplers. We have focussed on MAP parameter estimation,but our model is also amenable to full Bayesian analysis. While globallynormalized models require doubly-intractable methods for full Bayesiananalysis [45], our model can be easily combined to implement particle MCMC(PMCMC) methods [24]. Compatibility with PMCMC also opens the doorto more advanced SMC methods (for example, [60]). For large problems,where PMCMC methods may be computationally prohibitive, a more suitableoption may be offered by profile likelihood methodology that allows to accountfor Monte Carlo uncertainty (for example, see [61]).94Chapter 6Knot MatchingIn this chapter, we provide details of the knot matching application anddemonstrate the performance of the methodology developed in Chapter 5.6.1 IntroductionWood processed in mills to produce sawn lumber for use in constructionis assigned into grades, according to established rules and standards [62].The grading process for a piece of lumber involves identifying its visualcharacteristics and assessing its strength in a non-destructive manner. The\u201cknot\u201d, formed by a branch or limb during growth of the tree, is an importantclass of visual characteristics that affects both the aesthetic quality as well asthe strength of wood. For individual pieces of lumber, previous studies haveshown a strong relationship between the size of its knots and its strength whenloaded to failure (see for e.g., [11, 12]). Therefore, knots have an importantrole in determining the grade of a piece.Many modern mills utilize machine vision systems to automate the pro-duction process. Scanning systems incorporating lasers and cameras are usedto detect the visual characteristics for quality control and for grading [63, 64].Although images from such systems could be analyzed to provide detailedinformation about every knot on the piece, current grading rules define stan-dards and size limits on individual knots (or knot clusters) only. Therefore,much of the potential in the use of these systems to improve lumber strengthprediction has yet to be realized. The strength-reducing effects of differentknots on the piece may work together, and jointly modeling their effectsmay permit more accurate predictions of the ultimate strength of lumber,compared to models based on a single knot alone. Towards this objective, fast95and accurate algorithms for detecting and identifying all knots from surfacescans of boards are needed. On that basis, a new strength prediction modelcan be developed from the complete knot data; uncertainties in the gradingand identification can be captured in a probabilistic prediction framework(see for example [13]).We consider surface scans of lumber pieces that provide images of its fourlong sides. In processing these images, two main tasks must be performedto detect and identify knots. The first task is knot face recognition fromthe images; this task belongs mainly in the realm of computer vision as itshares similarities with the object recognition problem. The second task,which is the focus of this chapter, is automatically identifying which of thedetected knot faces on the different sides are from the same tree branch; werefer to this problem as \u201cknot matching\u201d. Note that knot refers to the threedimensional convex body that is to be reconstructed from the knot facesthat are observed on the surfaces of a piece from scanning technologies. Bycombining information from four-sided scans, the three-dimensional structureof the wood fibers can be characterized, which is important for strengthprediction [65].Formally, we shall represent a piece of lumber by a quadripartite graphwith each of its surfaces forming a partition, with the knot faces as the nodes ofthe graph. Knot matching can thus be formulated as a quadripartite matchingproblem on a non-uniform hypergraph. We propose to use the sequentialdecision model to build a matching, where each decision is modeled via alocal multinomial regression. This class of models is commonly used in otherapplication areas (for example, part-of-speech tagging in natural languageprocessing (see for e.g., [49]). This approach allows inference for the modelparameters via maximum likelihood or maximum a posteriori estimationto be performed using standard techniques, when given a sample of boardswith known matchings (e.g., manually matched by a human). We then use asequential Monte Carlo (SMC) sampler for sampling knot matchings giventhe estimated parameters. The SMC sampler draws a population of particlesfrom the space of matchings, and thus also serves to estimate uncertainty inthe unknown matching when applied to a future piece of lumber. We show96that our SMC sampler is fast and thus permits online application for gradingin lumber mills.Thus we anticipate that our contributions to this applied problem willenhance the sawn lumber production process in two important ways. First,each individual knot can be better assessed by capturing information fromall of its visible faces much like a traditional human grader would, therebyincreasing the effectiveness of automated grading without sacrificing speedand efficiency. Second, accurate automatic knot matchings will provide animportant component of the necessary data for future refinement of lumberstrength prediction models based on visual characteristics.6.2 Data for knot matchingThe images necessary for the development of our application are generatedfor a piece of lumber, as shown in Figure 6.1. High-definition cameras areinstalled to capture images of the four surfaces as it moves on a carrier (i.e.,each piece is taken along conveyor belt through the scanning station). Notethat each piece of lumber has six sides but the two ends are typically ignored,as controlled sawing leaves no knot faces to appear on the end sides. The firstprocessing task is to identify the knot faces in the images. This task is muchlike object detection and localization (e.g., face detection from a photo). Forthis purpose, we use an internally developed knot detection algorithm thatoutputs data on the location and size of the knot faces. The second task, andthe focus of this paper, is to identify which of the knot faces on the differentsurfaces belong to the same knot (i.e., from the same tree branch). In thissection, we provide the details of the relevant data for the knot matchingproblem.6.2.1 Lumber and knot representationThe raw images of the surfaces as shown in Figure 6.1 are first processed byan internally developed knot detection algorithm. Knots are often modelledas elliptical cones [66], and hence, our implementation of the knot detection97Figure 6.1: Sample lumber used in the real data analysis. Four sides of theboards with the wide surfaces shown on the first and the third rows and thenarrow surfaces shown on the second and the last rows.algorithm models each knot face as an ellipse. We view each piece of lumberas a 3-dimensional object, positioned in a standard 3-dimensional Euclideanspace as shown in Figure 6.2, with the x, y and z-axes representing length,width, and height respectively. In this fashion, the two \u2018wide\u2019 surfaces inFigure 6.3 (a) and (c) are parallel to the x-y plane, while the two \u2018narrow\u2019surfaces in Figure 6.3 (b) and (d) are parallel to the x-z plane.For each knot face on each surface, the knot detector outputs the 3-dimensional coordinate, (x, y, z), indicating the position of the center of theknot face. It also outputs the axes of the fitted ellipse on the knot face,denoted by (a, b) where a is the length of the axes along the x-axis and b isthe length of the axes along the y-axis for the two \u2018wide\u2019 surfaces and z-axisfor the two \u2018narrow\u2019 surfaces. Additionally, we have the rotation angle ofthe fitted ellipse, denoted \u03b1. In summary, each knot face is represented bythe 6-tuple (p, x, y, z, a, b, \u03b1), where p denotes the index of the surface (i.e.,partition).6.2.2 Choice of covariatesFor knot faces on distinct surfaces, we can compute a vector of associatedcovariates to assess whether the knot faces belong to the same branch, andhence should be matched together. Covariates that are useful predictorswould help distinguish matches from non-matches. For each pair of knotfaces u and v on distinct surfaces we consider the following covariates:\u2022 Both u and v appear on a wide surface: We compute the distancebetween u and v. We observed from our data the most common98xyz3-matching2-matchingFigure 6.2: 3-dimensional view of the lumber (not to scale). An illustrationof 2-matching and 3-matching are provided.occurrence among the matched knots is of this type. Knot faces atshorter distances apart are more likely to be matches.\u2022 One of u or v appears on a narrow surface: This covariate resemblesthe one above in that we compute the Euclidean distance between uand v. We found that differentiating the two cases helped improvethe performance of the matching method that we will describe in thesubsequent sections.\u2022 Comparison of sizes: To assess the size difference between knot faces,we compute the areas of their fitted ellipses and compute the absolutedifference |uarea \u2212 varea|. Knot faces belonging to the same branch areexpected to have a smaller difference in sizes.For a triplet of knot faces u, v, w, we consider the following covariate,which is a slight modification of the ones listed above to accommodate3-matching:\u2022 Maximum and minimum distances: We compute the Euclidean distancebetween each pair of knot faces and extract the maximum and theminimum pairwise distances as covariates. Recall that the knot faces991234Figure 6.3: A closer look at a segment of a plank. The matching for knotfaces labelled 1, 2, 3, 4, produced by the human grader is {{1, 4}, {2, 3}}.represent a surface of a convex body that appear when an elliptical coneis sliced. Therefore, two of the knot faces must share an axis as shownin Figure 6.2. However, we found that inaccuracies during the knotdetection stage can potentially capture two knot faces that share an axisand appear separated. This error is of a reasonable size and we foundthat computing the distance between the nearest knot faces instead isa useful approximation that leads to good empirical performance. Themaximum distance is analogous to the distance covariate computedabove for a pair of knot faces.\u2022 Comparison of sizes: To adapt the size covariate above, we sum thearea of the fitted ellipses of the two closest knot faces and take theabsolute difference with the area of the remaining knot face.We estimate the parameters associated with each of these covariates froma sample of boards where the correct knot matching is known.1006.2.3 Matching by a human graderFor a well trained person, matching the knot faces is an easy task (althoughtime consuming). That human grader would examine a piece of lumber fromend-to-end, using the visual characteristics of the knot faces to determine thematches. For example, note the knots labelled 1 to 4 in red, on Figure 6.3.The grader is able to determine the correct matching after careful examination:knot face 1 and knot face 4 belong to the same branch and knot face 2 andknot face 3 belong to the same branch. However, there are cases that can bedifficult even for a human grader and we would like to quantify uncertaintyin the matching using probabilities.We utilized a human grader to manually annotate our data. For eachboard, each knot is represented by (p, x, y, z, a, b, label), where label is aunique identifier given to knot faces that stem from a same branch. Themanually annotated matchings will be used to evaluate the performance ofour approach.6.3 Experimental resultsThis section presents the results of experiments to demonstrate the perfor-mance of the methods proposed in Chapter 5. We have 30 boards thatwere manually annotated (i.e., knots were matched manually) for evaluationpurposes. As it is expensive to acquire additional data, we used the proceduredescribed in Section A to simulate realistic knot matching data for evaluatingour model and SMC sampler. These simulated pieces of lumber supplementthe real data for analysis, in particular to test the feasibility of deploying themethodology under the real time constraint.6.3.1 Preliminary experimentsBefore tackling the knot matching data, we perform experiments to validatevarious components of the model and the methods.101Validation of parameter estimation procedureWe begin with a simple parameter estimation experiments. We simulate asynthetic graph as follows:1. Sample the parameters, \u03b8j \u223c N(0, \u03c42).2. Generate N nodes per partition.3. For each node, sample the covariates \u03c6j \u223c N(0, \u03b62) for j = 1, ..., d,where d denotes the number of covariates.4. Sample \u03c3 from uniform distribution over the permutation.5. Sample the decisions d\u03c3 \u223c p(\u00b7|\u03c3, \u03b8) (using Equation 5.4).We can repeat the synthetic graph generation process I times to obtain aset of labelled matchings {(mi, \u03c3i,d\u03c3i)}Ii=1. Given this dataset, we carriedout experiments to validate the MAP estimator of the parameters. In thisexperiment, we used the true \u03c3 that was used for generating each of thematchings. In this case, we expect the parameter estimator to be efficientas it boils down to that of estimating the parameters of a model composedof multiple multinomial logistic regression models. We have shown the rootmean squared error, d\u22121\u2016\u03b8\u02c6MAP \u2212 \u03b8true\u20162, when d = 2 in Figure 6.4 (a). Asthe number of nodes in the graph is increased, the accuracy of the estimationimproves as expected. We have also generated a surface plot of the posteriorwhen d = 2 (two covariates) in Figure 6.4 (b). Note that the responsesurface that we are optimizing over is convex and the MAP estimate attainsa higher value of the log-likelihood compared to the truth. Figure 6.4 (a) wasobtained using I = 10. We fixed the number of partitions to 2. The standarderror estimates were obtained using LOESS in the R package ggplot2 [67].Figure 6.4 (b) was obtained by evaluating the likelihood function over a gridof parameter values. This experiment serves to verify the correctness of ourparameter estimation procedure.102Overcounting correction experimentsThis subsection illustrates the overcounting problem and why it needs to beaddressed to sample graph matching using SMC. To that end, we assumea scenario where we want to sample graph matching from the uniformdistribution over all possible configurations permitted by our choice of thedecision model (for example, the decision model for knot matching given inSection 5.4.2). For illustrative purposes, suppose we have a simple example ofa quadripartite graph with one node in each partition. Note that this decisionmodel is restricted to {2, 3}-matchings so there are total of 7 matchingspossible for this graph and hence, we expect the probability of sampling amatching to be 1\/7 (see Figure 6.5 (a)).We have computed the estimate of the probability of each matchingconfiguration from the SMC population, p\u02c6m and computed the root meansquared error:\u221a7\u22121\u2211m\u2208M(p\u02c6m \u2212 1\/7)2. In Figure 6.5 (b), we show thatthe RMSE tends to 0 as the number of particles used in the SMC is increased(the blue curve). On the other hand, the RMSE stabilizes around 0.02 whenthe overcounting problem is ignored (the red curve).6.3.2 Data analysisIn this section, we analyze the simulated and real lumber data.Real data analysisIn this section, we evaluate our methods on the 30 boards that had beenmanually annotated. First, we illustrate the parameter estimation procedurethat was carried out using an MC-EM procedure. The sample size usedfor the E-step is kept at 100 for the first 10 iterations of MC-EM, whichis increased to 500 onwards to reduce the Monte Carlo error across theiterations. The convergence of MC-EM is monitored by plotting Q\u02dc across theiterations. In Figure 6.6, we show a plot of the Q\u02dc function across MC-EMiterations with the error bars computed using the standard deviation of theMonte Carlo samples to Q\u02dc at each iteration of MC-EM. The figure suggeststhat convergence is reached in about 10 iterations. The trajectory of the103parameters across MC-EM iterations is shown in Figure 6.7. This plot showsthat distance based covariates play important roles compared to area basedcovariates. The value for \u03bb is set to 1 for the experiments.To evaluate predictive performance, we perform leave-one-out cross vali-dation. That is, we leave one board out from the MC-EM inference procedure(i.e., obtain MAP estimate using the remaining 29 boards). Then, we samplegraph matchings on the held-out board to be evaluated using single sampleprediction accuracy and the Jaccard index described in Section 5.6.5. With\u03bb = 1, the overall accuracy is 375\/384 = 0.977 using a single sample pre-diction. The board-by-board performance is shown in Figure 6.8. We haveexperimented with \u03bb = 0.1 and \u03bb = 10 as well and found the single sampleprediction performance to be comparable to \u03bb = 1.Simulated data analysisA data simulation procedure is helpful for future research in the automaticstrength grading of lumber as it can be used to calibrate the performance ofmatching methodology presented in Chapter 5 on simulated boards due to thehigh cost associated with acquisition of the real data (i.e., requiring manualknot matching). In particular, there are various types of knots that we havenot been able to model due to the limitation in the dataset. The simulationprovides a testbed to develop new models for knots and new features for knotmatching for the application experts.One way that we make use of the simulated data is to test the feasibilityof deploying the SMC sampler in real time. As the end goal is to deploythe matching mechanism developed here to mills that operate under realtime constraints, it is important to study the time it takes to generatesamples using SMC. To that end, we simulated 100 boards. To speed upthe sampling procedure, we segmented each board into multiple subgraphs.This segmentation procedure was carried out based on distance so that knotswithin certain distances are placed into the same subgraph. The SMC samplerwas executed locally within each subgraph and this helped to significantlyreduce the time to draw samples since there were less decisions to consider104at each iteration of an SMC.We have plotted the timing results in Figure 6.9. The figure depicts thescatter plot of the timing results for 100 simulated boards as well as 30 realboards when the number of particles is set to 1000 against the number of knotfaces on the board. Observe that for most boards, it only takes a fraction ofa second for sampling to complete. For completeness, we carried out a 2-foldvalidation procedure to quantify the performance of our methodologies onthe simulated dataset. We attained the single sample prediction accuracy of93% on the 100 simulated boards.6.4 Conclusion and discussionIn this chapter, we evaluate the statistical inference procedure for matching onthe novel knot matching application, which can be formulated as a 4-partitehypergraph matching problem. This is an important step towards automatingthe grading of lumber, one where statistical inferential methods can be used toproduce not just a single strength prediction value but a posterior predictiveinterval that captures any uncertainties encountered in the process. Wedeveloped a sequential decision model that admits efficient inference of theparameters and an SMC sampler that allows for rapid sampling that canmeet the real time constraints of lumber mills. Furthermore, the model couldin principle admit a full Bayesian approach via particle MCMC methodology[24].The predictive performance of our methodology will be dependent on thechoice of covariates. The framework we have laid out in this paper is generaland will allow users to craft and experiment with different covariates, espe-cially when more sample data becomes available in the future. The covariatesthat we crafted here led to good performance on the current sample data;however, they are by no means complete. For example, additional informa-tion that might be incorporated to further improve predictive performanceincludes the rotation angle of the knot faces and\/or more detailed shapeinformation on the knots.Future work remains to complete an automatic lumber strength grading105pipeline. We shall develop an enhanced statistical model for strength pre-diction, using the output from our knot matching methodology as an inputfor producing the strength estimate. With accurate knot matchings anduncertainty appropriately quantified, we anticipate that our contributionswill have practical impact in this discipline and demonstrate yet anotherapplication where modern statistical methods can greatly benefit the field.1060.20.30.40.50.60.750 100 150Num Nodes per PartitionRMSE(a)xyloglikloglik@MAP: \u2212137.184 loglik@truth: \u2212145.458lll llllllllllllllllllllllllllllllll llllll llllllllllll lllllllllllllll lllllllllllllll llllllllll llllll l llllllll lllllllllllllllllll llllllllllllllllll l lllllllllllllll llllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllll lllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllll lllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll lllllllllllllllllllllllllllllllllllllll llllllllllll llllllllllll llllllllllllllllllllllllllllllllll lllllllllllllll llll lllllllllllllllllllllllll llllllllllllll lllllllllll llll llllllll llllllll lllllllllllllllllllllll llllllllllllllllllllll lllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll llllllllllllllll lllll llllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllll llllllllll llllllllllllllllllllllllllllllllllllllllllllllllllll lll llllllllllllll llllllllll lllllllllllllllllllllllllllllllll llllllll lllllllll llll llllll llllllllllllll llllllllll lllllllllllllllllllllllllllllllll llllllllllllllllllll llllllllllllllllllllll l lllllllllllllllllllllllllllll llllllllllllllllll lllllll l lllllllllllllllllllllllll llllllllllllllllllllllllll llllllllllllllllllll llll llllll lllllllllllllllllll lll lllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllll lllllll lll llllllllllll lllllllllllll llllll llllllllllllllllllllllllllllllllllllll llllllllllllllllllllllll llllllllllllll llllllllllllllllllllllllllllllllllllllllllllll llllllllllll lllll lllllll lllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllllll llll lllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll lllll lllll llllll lll lllllllllllllllllllllllllllllll lllllllllllllllllllllllllll llllllllll lll llllllllll lll l lllllllllllllllll llllllllllll llllllllllll llllllllllllll lllllllllllllllll lllllllll llll l llllllllllllllll l lllllllllllll llllllll llllllllllllll lllllllllllllllll llllllllll llllll l llllllllllll llll l lllllllllllll lllllll llllllllll lll llllllllllllllllll llllllll l llllll l lllllllllllll llllll l ll llllllllll llllllllllllllllll l lllllllllllllllll lllllllll lllllll l llllllllllll llllll l ll ll llllllllll lllllllllllllllll l lllllllllllllll l ll lllllll llllll l lllllllllllll llllll l ll ll llllllllll lllllll llllllllll l l lllllllllllllll l ll lllllll lllll l llllllllllllllllllll l ll ll llllllllll llll lll llllllllll l l lllllllllllllll l ll lllllll lllll l lllllllllllllllllll l ll ll llllllllll llll lll llllll llll l l lllll lllllllllll l ll lllllll lllll l llllllllllll lllllll l llll llllll llll llll lll llllll llll l llllll lllllllllll l lllll lll ll llll l llllllllllll lllllll l llll llllllllll llll lll llllll ll l l llllll llll lllllll l lllll lll ll llll l ll llllllllll llllll l llll llllllllll lllllll llllll ll l l ll lll l llll llllll l lllll lll llllll l ll llllllllll llllll ll ll l llll ll lll lllllll llllll ll l l ll lll l llll llllll ll llll lll lllll l llllllllllll llllll ll ll l llll ll lll ll llll llllll ll l ll lll l llll llllll ll llll llll ll lll l lllllllllll lllll ll ll l llll ll lll ll llll llll ll l l l llll l llll lllll ll llll llll ll lll ll lllllllllll lllll ll ll llll llll ll llll llll ll l l ll llll l llll lllll ll lll llll llll ll lllllllll ll llllllll l l llll llll ll lll llll ll l ll llll l ll ll llllllll lll llll llll ll lllllllll llllll ll l l llll l lll ll lll llll ll l ll lll l lll llllll ll lll llll llll ll llllllll lllllll ll l lll l llllll lll llll ll l ll lll l ll llllllll lll lll l llllll llllllll lllllll ll l ll ll lllll lll lll l l llll lll l ll lllllll llll ll ll lllll lllllll l llllll ll l ll ll llll llll lll l l lll lll lll lllllll llll ll ll llll llllllll l lllll l l lll ll llll llll ll ll l ll llll lll l lllll llllll ll llll lllllll ll lllll l l lll l l lll llllll ll l ll llll ll ll lllll l lllll ll lll llllll ll llll ll l ll ll l lll l lllll ll lll lllll ll llll ll lllll l l lll llllll ll lll ll ll ll l ll ll lllll l l lll llllll ll lll ll lll ll l ll ll lllll ll lll ll lll ll lll ll lll ll l ll ll lllll ll lll ll lll ll ll ll lll ll l lll ll lll ll lll ll lll ll ll ll lll ll l lll ll lll ll lll ll lll ll ll ll ll ll l lll ll lll ll lll ll lll ll ll ll ll ll l lll ll lll ll lll ll ll ll ll ll lll ll lll ll lll ll ll ll ll ll ll ll lll ll lll ll ll ll ll ll ll ll ll ll lll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll lll ll ll ll ll ll ll ll ll lll lll ll ll ll ll ll ll lll lll lll ll ll ll ll l ll lll lll lll ll ll l ll lll lll lll lll l ll l ll lll lll lll ll l ll lll lll lll ll ll l ll lll lll ll ll ll l ll lll ll ll ll ll lll ll ll ll ll ll ll ll ll ll lll ll ll ll llll ll ll lllll ll llllll llllll lllll llll llllll\u2212500\u2212400\u2212300\u2212200(b)Figure 6.4: Experiments where the sequence \u03c3 is given. (a) The plot ofRMSE as the number of nodes is increased. (b) The sample posterior surfaceshowing that the MAP estimate correctly finds the mode of the posterior.107{{1,2,3},{4}}{{1,2,4},{3}}{{1,3,4},{2}}{{2,3,4},{1}}{{1,2},{3,4}}{{1,3},{2,4}}{{1,4},{2,3}}0.000.010.020.030.040.050 25000 50000 75000 100000NRMSE(a) (b)Figure 6.5: Overcounting problem illustrated on sampling from uniform graphmatching. (a) The number of possible {2, 3}-matchings for a graph withfour partitions and one node in each partition. (b)The root mean squarederror with overcounting correction (blue), without the overcounting correction(red).\u2212100\u221275\u221250\u22122500 10 20 30 40 50IterationsLog LikelihoodFigure 6.6: The plot of the Q\u02dc function versus iterations. The error barscorrespond to Q\u02dc plus\/minus two times the sample standard deviation (i.e.,Q\u02dc \u00b1 2\u03c3\u02c6). The convergence of MC-EM seems to have been reached in 10iterations.108\u221220\u221215\u221210\u2212500 10 20 30 40 50IterationsValuesCovariatesTWO_MATCHING_DISTANCE_1THREE_MATCHING_DISTANCE_2THREE_MATCHING_DISTANCE_1TWO_MATCHING_DISTANCE_2TWO_MATCHING_AREA_DIFFTHREE_MATCHING_AREA_DIFFFigure 6.7: The trajectory of parameters in the MC-EM training versusiterations. The distance based covariates seem to play important roles indetermining the correct matches compared to the area based covariates.1090.000.250.500.751.001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30BoardsAccuracySingleSampleJaccardIndexFigure 6.8: Single sample prediction accuracy and Jaccard index evaluationof the matching samples generated by SMC computed on the real data foreach board. The single sample prediction accuracy is perfect for all but 3boards. The quality of the matchings generated for each board by SMCappears reasonable based on the values of Jaccard index.110lllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllll lllllllllll0.00.51.01.520 40 60Number of Knot FacesTime (Seconds)llReal BoardsSimulated BoardsFigure 6.9: Timing results on the simulated and the real boards. Theprediction times are within a second for the real boards. The timing resultson the simulated boards serve to test the feasibility of deploying the SMCsampler in lumber mills.111Chapter 7Conclusion and Future WorkIn this thesis, we explored SMC based inference methods for phylogenetictrees and graph matching. In the first half of the thesis, we proposed anew methodology, the streaming particle filter, to facilitate inference forphylogenetics with large memory storage requirements. The methodologywas shown to provide speed advantage whenever a large number of memorywriting operations are to be performed. This was the case for phylogenetics,where the dynamic programming table needed to be computed at eachiteration of the SMC. In general, a speed up may be observed in settingswhere a message in the form of a dynamic programming table needs to bepassed across the iterations of SMC. A future research direction involvescultivating applications for SPF that suffer from memory limitations. Onedirection is to explore the applications of Rao-Blackwellized particle filters.Rao-Blackwellized particles filters deal with potential memory limitations bymarginalizing out a subset of the state variables [68]. In general, one oftenneeds to develop a specialized algorithm to be able to use large number ofparticles [69]. The streaming particle filter is a general methodology thatmay be useful for various applications.We provided some convergence properties. In particular, we have shownrelationship between the implicit step of the SPF and the adaptation, i.e.,as the number of implicit particles goes to infinity, the quality of estimationis expected to rival the quality of estimation achieved by fully adapted[27] proposal that uses the latest observation. This direction has not beenfully explored in this thesis and hence, there is work remaining to study therelationship between the number of implicit particles and adaptation in variousapplications. In particular, it remains to explore the connection between thestreaming particle filter and the nested SMC [70], which proposes an SMC112algorithm that achieves adaptation. SPF was motivated from phylogeneticsapplications and at the onset, the connection to adaptation was not realized.[70] appeared one year after SPF as an SMC algorithm that achieves perfectadaptation by composing SMC algorithm(s) within an SMC algorithm. Itremains to explore whether SPF can be posed as a special case of nestedSMC or if the two algorithms are inherently different. The work requiredto establish SPF as a special case of nested SMC does not appear trivial;however, if it can be accomplished, the proof framework established in [70]can be used to show that SPF also achieves adaptation. Either way, futurework entails establishing a proof framework similar to the one established innested SMC to show that SPF achieves adaptation.Additionally, we have also shown that SPF can be used within particlemarginal Metropolis-Hastings framework when a fixed number of implicitparticles is used. We demonstrated its effectiveness on simulated phylogeneticsdata. There is additional work remaining in this direction as well. Inparticular, recall that one advantage of SPF is in its ability to determine thesufficient number of implicit particles automatically, for example, using astopping time criterion based on effective sample size. It remains as a futurework to prove that SPF with a random number of implicit particles can beused within the PMCMC framework. We believe that being able to use arandom number of implicit particles would lead to an adaptive PMCMCwhere the number of implicit particles can be automatically adjusted to attainan optimal acceptance ratio of the PMCMC chain. Other areas that are tobe explored as future work include applicability to plug-and-play models. Inparticular, SPF may be well suited for plug-and-play settings where the latentprocess takes on high dimensions, for example, models driven by stochasticdynamics [28] where the user is left with very little choice in terms of proposal.The implicit particles provide adaptation, which can lead to improved qualityof estimation in this setting.In the second half of the thesis, we proposed a probabilistic model forgraph matching and developed a sequential Monte Carlo sampler that canbe utilized to sample graph matchings. The methodology was applied tosuccessfully identify the matching of the knot faces on 30 real lumber boards.113The methodology circumvents the need for expensive to maintain imagingequipment such as X-rays or CT scan for the purposes of reconstructing theinternal tree branch structure of a piece of lumber. The speed at which thesampling can be carried out makes it feasible to deploy these systems to thereal mills. Furthermore, it adds graph matching as an application where SMCcan be utilized by providing the detailed steps involved in defining variouscomponents involved in combinatorial SMC for graph matching. It is left asfuture work to explore applications of graph matching that can be tackledusing the methodology proposed here, in particular, the problems arisingfrom machine learning and computational biology involving hypergraphssuch as DNA sequence alignment and image matching across a sequenceof video frames. The future work for knot matching application includesdevelopment of additional features to improve on the performance, as wellas a study involving larger number of boards for a comprehensive testingof the methodology. The next step to be taken is in the development of astrength prediction model to complete the strength prediction pipeline forlumber introduce in Figure 1.2.114Bibliography[1] S-H. Jun and A. Bouchard-C\u00f4t\u00e9. Memory (and time) efficient sequentialMonte Carlo. In International Conference on Machine Learning (ICML),volume 31, pages 514\u2013522, 2014.[2] Seong-Hwan Jun, Samuel WK Wong, James Zidek, and AlexandreBouchard-Cote. Sequential graph matching with sequential Monte Carlo.In Artificial Intelligence and Statistics, pages 1075\u20131084, 2017.[3] Seong-Hwan Jun, Samuel WK Wong, James V Zidek, and AlexandreBouchard-C\u00f4t\u00e9. Sequential decision model for inference and predictionon non-uniform hypergraphs with application to knot matching fromcomputational forestry. arXiv preprint arXiv:1708.07592, 2017.[4] Bret Larget and Donald L Simon. Markov chain Monte Carlo algorithmsfor the Bayesian analysis of phylogenetic trees. Molecular biology andevolution, 16(6):750\u2013759, 1999.[5] John P Huelsenbeck and Fredrik Ronquist. MRBAYES: Bayesian infer-ence of phylogenetic trees. Bioinformatics, 17(8):754\u2013755, 2001.[6] Fredrik Ronquist and John P Huelsenbeck. MrBayes 3: Bayesian phylo-genetic inference under mixed models. Bioinformatics, 19(12):1572\u20131574,2003.[7] Alexandre Bouchard-C\u00f4t\u00e9, Sriram Sankararaman, and Michael I. Jordan.Phylogenetic inference via sequential Monte Carlo. Systematic Biology,61:579\u2013593, 2012.[8] Charles Semple and Mike A Steel. Phylogenetics, volume 24. OxfordUniversity Press on Demand, 2003.115[9] Niko Beerenwinkel, Chris D Greenman, and Jens Lagergren. Computa-tional cancer biology: An evolutionary perspective. PLoS computationalbiology, 12(2):e1004717, 2016.[10] Erik M Volz, Katia Koelle, and Trevor Bedford. Viral phylodynamics.PLoS computational biology, 9(3):e1002947, 2013.[11] P Cast\u00e9ra, C Faye, and A El Ouadrani. Prevision of the bendingstrength of timber with a multivariate statistical approach. In Annalesdes sciences foresti\u00e8res, volume 53, pages 885\u2013896. EDP Sciences, 1996.[12] Riku Hietaniemi, Jari Hannuksela, and Olli Silve\u00e9n. Camera basedlumber strength classification system. In MVA2011 IAPR Conferenceon Machine Vision Applications, pages 251\u2013254, 2011.[13] Samuel WK Wong, Conroy Lum, Lang Wu, and James V Zidek. Quanti-fying uncertainty in lumber grading and strength prediction: a Bayesianapproach. Technometrics, 58(2):236\u2013243, 2016.[14] Christian P Robert. Monte Carlo methods. Wiley Online Library, 2004.[15] Rick Durrett. Probability: theory and examples. Cambridge universitypress, 2010.[16] Thomas Bengtsson, Peter Bickel, Bo Li, et al. Curse-of-dimensionalityrevisited: Collapse of the particle filter in very large scale systems. InProbability and statistics: Essays in honor of David A. Freedman, pages316\u2013334. Institute of Mathematical Statistics, 2008.[17] Arnaud Doucet and Adam M Johansen. A tutorial on particle filteringand smoothing: Fifteen years later. Handbook of nonlinear filtering,12(656-704):3, 2009.[18] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approachto nonlinear\/non-Gaussian Bayesian state estimation. In IEE ProceedingsF (Radar and Signal Processing), volume 140, pages 107\u2013113. IET, 1993.116[19] Pierre Del Moral. Feynman-Kac formulae. In Feynman-Kac Formulae,pages 47\u201393. Springer, 2004.[20] Augustine Kong, Jun S Liu, and Wing Hung Wong. Sequential impu-tations and Bayesian missing data problems. Journal of the Americanstatistical association, 89(425):278\u2013288, 1994.[21] Aaron King, Dao Nguyen, and Edward Ionides. Statistical inference forpartially observed Markov processes via the R package pomp. Journalof Statistical Software, Articles, 69(12):1\u201343, 2016.[22] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential MonteCarlo samplers. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 68(3):411\u2013436, 2006.[23] Liangliang Wang, Alexandre Bouchard-C\u00f4t\u00e9, and Arnaud Doucet.Bayesian phylogenetic inference using a combinatorial sequential MonteCarlo method. Journal of the American Statistical Association,110(512):1362\u20131374, 2015.[24] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. ParticleMarkov chain Monte Carlo methods. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 72(3):269\u2013342, 2010.[25] James Propp and David Wilson. Coupling from the past: a user\u2019s guide.Microsurveys in discrete probability, 41:181\u2013192, 1998.[26] Michael K Pitt and Neil Shephard. Filtering via simulation: Auxil-iary particle filters. Journal of the American statistical association,94(446):590\u2013599, 1999.[27] Adam M Johansen and Arnaud Doucet. A note on auxiliary particlefilters. Statistics & Probability Letters, 78(12):1498\u20131504, 2008.[28] Edward Ionides, Carles Bret\u00f3, and Aaron King. Inference for nonlineardynamical systems. Proceedings of the National Academy of Sciences,103(49):18438\u201318443, 2006.117[29] John FC Kingman. On the genealogy of large populations. Journal ofApplied Probability, 19(A):27\u201343, 1982.[30] Yee W Teh, Hal Daume III, and Daniel M Roy. Bayesian agglomerativeclustering with coalescents. In Advances in Neural Information ProcessingSystems, pages 1473\u20131480, 2008.[31] Joseph Felsenstein. Evolutionary trees from DNA sequences: a maximumlikelihood approach. Journal of molecular evolution, 17(6):368\u2013376, 1981.[32] Joseph Felsenstein and Joseph Felenstein. Inferring phylogenies, volume 2.Sinauer associates Sunderland, MA, 2004.[33] Michael I Jordan. Graphical models. Statistical Science, 19(1):140\u2013155,2004.[34] Thomas H Jukes and Charles R Cantor. Evolution of protein molecules.Mammalian protein metabolism, 3(21):132, 1969.[35] Simon Tavar\u00e9. Some probabilistic and statistical problems in the analysisof DNA sequences. Lectures on mathematics in the life sciences, 17(2):57\u201386, 1986.[36] Maksims Volkovs and Rich Zemel. Efficient Sampling for BipartiteMatching Problems. Advances in Neural Information Processing Systems,(25):1322\u20131330, 2012.[37] I. L. Dryden and K. V. Mardia. Statistical shape analysis, volume 4. J.Wiley Chichester, 1998.[38] I. Holmes and G. M. Rubin. Pairwise RNA structure comparison withstochastic context-free grammars. In Proceedings of the Pac. Symp.Biocomputing, volume 7, pages 163\u2013174, 2001.[39] G. Lunter, A. J. Drummond, I. Mikl\u00f3s, and J. Hein. Statistical alignment:Recent progress, new applications, and challenges. In Statistical methodsin molecular evolution, pages 375\u2013405. Springer, 2005.118[40] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola.Learning graph matching. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 31(6):1048\u20131058, 2009.[41] J. Petterson, J. Yu, J. J. McAuley, and T. S. Caetano. Exponentialfamily graph matching and ranking. In Advances in Neural InformationProcessing Systems, pages 1455\u20131463, 2009.[42] Greg CG Wei and Martin A Tanner. A Monte Carlo implementation ofthe EM algorithm and the poor man\u2019s data augmentation algorithms.Journal of the American statistical Association, 85(411):699\u2013704, 1990.[43] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochasticapproximation version of the EM algorithm. Annals of statistics, pages94\u2013128, 1999.[44] N. A. Smith and J. Eisner. Contrastive estimation: Training log-linearmodels on unlabeled data. In Proceedings of the 43rd Annual Meeting onAssociation for Computational Linguistics, pages 354\u2013362. Associationfor Computational Linguistics, 2005.[45] J. M\u00f8ller, A. N. Pettitt, R. Reeves, and K. K. Berthelsen. An efficientMarkov chain Monte Carlo method for distributions with intractablenormalising constants. Biometrika, 93(2):451\u2013458, 2006.[46] R. L. Plackett. The analysis of permutations. Applied Statistics, pages193\u2013202, 1975.[47] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li. Learning to rank:from pairwise approach to listwise approach. In Proceedings of the 24thinternational conference on Machine learning, pages 129\u2013136. ACM,2007.[48] F. Caron, Y. W. Teh, and T. B. Murphy. Bayesian nonparametricPlackett-Luce models for the analysis of preferences for college degreeprogrammes. The Annals of Applied Statistics, 8(2):1145\u20131181, 2014.119[49] T. Berg-Kirkpatrick, A. Bouchard-C\u00f4t\u00e9, J. DeNero, and D. Klein. Pain-less unsupervised learning with features. In Proceedings of the NorthAmerican Chapter of the Association for Computational Linguistics(NAACL10), volume 8, pages 582\u2013590, 2010.[50] B. Paige, F. Wood, A. Doucet, and Y. W. Teh. Asynchronous anytimesequential Monte Carlo. In Advances in Neural Information ProcessingSystems, pages 3410\u20133418, 2014.[51] M. Jerrum and A. Sinclair. Approximating the permanent. SIAM journalon computing, 18(6):1149\u20131178, 1989.[52] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial-time approxima-tion algorithm for the permanent of a matrix with nonnegative entries.Journal of the ACM (JACM), 51(4):671\u2013697, 2004.[53] J. Wang and A. Jasra. Monte Carlo algorithms for computing \u03b1-permanents. Statistics and Computing, 26(1-2):231\u2013248, 2016.[54] A. Bouchard-C\u00f4t\u00e9 and M. I. Jordan. Variational inference over combina-torial spaces. In Advances in Neural Information Processing Systems 23(NIPS), volume 23, pages 280\u2013288, 2010.[55] Y. Suh, M. Cho, and K. M. Lee. Graph matching via sequential MonteCarlo. In European Conference on Computer Vision, pages 624\u2013637.Springer, 2012.[56] D. C. Liu and J. Nocedal. On the limited memory BFGS method forlarge scale optimization. Mathematical programming, 45:503\u2013528, 1989.[57] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximumlikelihood from incomplete data via the EM algorithm. Journal of theroyal statistical society. Series B (methodological), pages 1\u201338, 1977.[58] T-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark datasetfor research on learning to rank for information retrieval. In Proceedingsof SIGIR 2007 workshop on learning to rank for information retrieval,pages 3\u201310, 2007.120[59] Michael Levandowsky and David Winter. Distance between sets. Nature,234(5323):34\u201335, 1971.[60] Fredrik Lindsten, Michael I Jordan, and Thomas B Sch\u00f6n. Particle Gibbswith ancestor sampling. The Journal of Machine Learning Research,15(1):2145\u20132184, 2014.[61] EL Ionides, C Breto, J Park, RA Smith, and AA King. Monte carloprofile confidence intervals for dynamic systems. Journal of The RoyalSociety Interface, 14(132):20170126, 2017.[62] David W. Green, Robert J. Ross, and Kent A. McDonald. Productionof hardwood machine stress rated lumber. In Proceedings of 9th Interna-tional Symposium on Nondestructive Testing of Wood, pages 141\u2013150,1994.[63] Mattias Br\u00e4nnstr\u00f6m. The impact of a strength grading process onsawmill profitability and product quality. BioResources, 4(4):1430\u20131454,2009.[64] Riku Hietaniemi, Miguel Bordallo L\u00f3pez, Jari Hannuksela, and OlliSilv\u00e9n. A real-time imaging system for lumber strength prediction.Forest Products Journal, 64(3):126\u2013133, 2014.[65] Anders Olsson, Jan Oscarsson, Erik Serrano, Bo K\u00e4llsner, Marie Jo-hansson, and Bertil Enquist. Prediction of timber bending strengthand in-member cross-sectional stiffness variation on the basis of localwood fibre orientation. European Journal of Wood and Wood Products,71(3):319\u2013333, 2013.[66] Pablo Guindos and Manuel Guaita. A three-dimensional wood materialmodel to simulate the behavior of wood with any type of knot at themacro-scale. Wood science and technology, 47(3):585\u2013599, 2013.[67] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.121[68] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Rus-sell. Rao-Blackwellised particle filtering for dynamic Bayesian networks.In Proceedings of the Sixteenth conference on Uncertainty in artificialintelligence, pages 176\u2013183. Morgan Kaufmann Publishers Inc., 2000.[69] Mike Klaas, Mark Briers, Nando De Freitas, Arnaud Doucet, SimonMaskell, and Dustin Lang. Fast particle smoothing: If i had a millionparticles. In Proceedings of the 23rd international conference on Machinelearning, pages 481\u2013488. ACM, 2006.[70] Christian Naesseth, Fredrik Lindsten, and Thomas Schon. Nested se-quential Monte Carlo methods. In Proceedings of the 32nd InternationalConference on Machine Learning (ICML-15), pages 1292\u20131301, 2015.122Appendix AGenerating Synthetic Data forKnot MatchingThis section presents a mechanism to simulate synthetic boards that closelymimics the real data as was presented in [3, Section 7]. This simulationmechanism is needed partly due to the prohibitive cost associated withobtaining the real data, which limits the study of knot formulation andimplementation of new covariates to accurately capture the variety of knotshapes.We first conceptualize tree branches as approximately cone-shaped objectsemanating from the center of the tree trunk [66]. As the tree is cut intoconstruction lumber, these cones intersect with rectangular prisms represent-ing the pieces of lumber, forming knot faces on the board\u2019s surfaces. Thussynthetic boards and knot faces with known matchings and realistic geometrycan be generated by simulating locations and sizes of cones representingthe tree branches, and calculating the conic sections with the four planesrepresenting the surfaces of the board. Conic sections arising from the samebranch are matched knot faces.The board is situated in 3-D Cartesian coordinates as described in Sec-tion 6.2, with length 5000 units (x dimension), width 300 units (y dimension)and height 150 units (z dimension). Let nk denote the random number ofknots on the board. Based on the number of knots observed in real data, wedraw nk \u223c Poisson(\u03c1) and generate nk branches that intersect with the board.Lumber is cut so that most branches go through the two \u2018wide\u2019 surfaces, sowe initially position a branch according to the equation of a right circular123cone that opens upward from the origin,x2 + y2c20= z2,where the random c0 governs the slope of the cone and we restrict z > 0. Itis possible for branches emanating in different directions to appear in oneboard, so with probability 1\/2 we allow the cone to open downwards byreflecting it over the plane z = 75. We next apply a random angle of rotationto the cone around the x and y axes, to mimic the variability in the anglesof tree branches. Then, the center of the cone is translated to a random(x, y) location on the board. Additional variation in the sizes of knot facesis provided by a random translation in the z direction. Intersections of thefinal cone and the four surfaces are computed using a numerical root-findingprocedure, and ellipses are fitted to the intersections representing the knotfaces.The cone simulation is repeated for each of the nk branches required.Since different tree branches do not intersect, we impose a condition to rejectsimulated cones that overlap geometrically or are otherwise too close toexisting cones. Consider the 3-D line segment joining the centers of two knotfaces due to the same branch. Then, the line segments corresponding todifferent branches cannot be too close: specifically, the minimum distance dbetween the two line segments should exceed the typical diameter of a branch.Hence, a simulated cone is rejected and resampled if d < 50 with an existingcone. This corresponds to a real distance cutoff of about 0.6in.The procedure for generating a board is summarized in Algorithm 3 withthe specific simulation parameters used. We use this procedure to generatethe samples of synthetic boards used in the computational experiments inthe following section.124Algorithm 3 : Synthetic board generator1: Draw nk \u223c Pois(25)2: for i = 1, . . . , nk do3: Draw cone parameters: slope c0 \u223c Unif [0.025, 0.05], orientation s \u223cBern(0.5)4: Draw rotation angles: \u03b8x, \u03b8yiid\u223c Unif [\u2212pi\/6, pi\/6]5: Draw center (xt, yt): xt \u223c Unif [0, 5000], yt \u223c Unif [0, 300]6: Draw z translation: zt \u223c (2s\u2212 1)Unif [0, 500]7: for j = 1, 2, 3, 4 do8: if cone intersects with surface j then9: Compute center and covariance matrix for ellipse of conic sectionon surface j10: tij \u2190 111: else12: tij \u2190 013: end if14: end for15: Compute line segment Li between ellipse centers on two surfaces withtij = 116: for b = 1, . . . , i\u2212 1 do17: di,b \u2190 minimum distance between Li and Lb18: if di,b < 50 then19: goto 320: end if21: end for22: Define matching mi = {j : tij = 1}23: end for125","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/hasType":[{"value":"Thesis\/Dissertation","type":"literal","lang":"en"}],"http:\/\/vivoweb.org\/ontology\/core#dateIssued":[{"value":"2018-02","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt":[{"value":"10.14288\/1.0362870","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/language":[{"value":"eng","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline":[{"value":"Statistics","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/provider":[{"value":"Vancouver : University of British Columbia Library","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/publisher":[{"value":"University of British Columbia","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/rights":[{"value":"Attribution-NonCommercial-NoDerivatives 4.0 International","type":"literal","lang":"*"}],"https:\/\/open.library.ubc.ca\/terms#rightsURI":[{"value":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/","type":"literal","lang":"*"}],"https:\/\/open.library.ubc.ca\/terms#scholarLevel":[{"value":"Graduate","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/title":[{"value":"Scalable sequential Monte Carlo methods and probabilistic approach to combinatorial problems","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/type":[{"value":"Text","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#identifierURI":[{"value":"http:\/\/hdl.handle.net\/2429\/64277","type":"literal","lang":"en"}]}}