Entangled Monte CarlobySeong-Hwan JunB. Math., University of Waterloo, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2013c? Seong-Hwan Jun 2013AbstractA recurrent problem in statistics is that of computing an expectation involv-ing intractable integration. In particular, this problem arises in Bayesianstatistics when computing an expectation with respect to a posterior dis-tribution known only up to a normalizing constant. A common solution isto use Monte Carlo simulation to estimate the target expectation. Two ofthe most commonly adopted simulation methods are Markov Chain MonteCarlo (MCMC) and Sequential Monte Carlo (SMC) methods. However,these methods fail to scale up with the size of the inference problem. ForMCMC, the problem takes the form of simulations that must be ran for along time in order to obtain an accurate inference. For SMC, one may notbe able to store enough particles to exhaustively explore the state space. Wepropose a novel scalable parallelization of Monte Carlo simulation, Entan-gled Monte Carlo simulation, that can scale up with the size of the inferenceproblem. Instead of transmitting particles over the network, our proposedalgorithm reconstructs the particles from the particle genealogy using thenotion of stochastic maps borrowed from perfect simulation literature. Wepropose bounds on the expected time for particles to coalesce based on thecoalescent model. Our empirical results also demonstrate the efficacy of ourmethod on datasets from the field of phylogenetics.iiPrefacePart of this thesis is based on the published work of [1], which is co-authoredwith Dr. Liangliang Wang and Dr. Alexandre Bouchard-Co?te?. All of thesections that expands on the explanations given in [1] are solely authored.Chapter 2 is solely authored except for Section 2.3.4, which is co-authored.Chapter 3 is co-authored except for Section 3.6. For Chapter 4, Section 4.1and Section 4.2 are co-authored. The rest of Chapter 4 is solely authored.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . vii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Graph theory . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Tree topology . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Phylogenetic forest . . . . . . . . . . . . . . . . . . . 92.2 Bayesian phylogenetics . . . . . . . . . . . . . . . . . . . . . 92.3 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . 102.3.1 Importance sampling . . . . . . . . . . . . . . . . . . 112.3.2 Sequential importance sampling . . . . . . . . . . . . 132.3.3 Resampling . . . . . . . . . . . . . . . . . . . . . . . . 172.3.4 SMC algorithm . . . . . . . . . . . . . . . . . . . . . 182.4 SMC for phylogenetic inference . . . . . . . . . . . . . . . . . 203 Entangled Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 243.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25ivTable of Contents3.1.1 Stochastic maps . . . . . . . . . . . . . . . . . . . . . 253.1.2 Particle genealogy . . . . . . . . . . . . . . . . . . . . 263.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Allocation and resampling . . . . . . . . . . . . . . . . . . . 283.4 Genealogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5 Compact representation of the stochastic maps . . . . . . . . 353.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.6.1 The coalescent . . . . . . . . . . . . . . . . . . . . . . 383.6.2 Bound on the expected time to reconstruct . . . . . . 424 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 464.2 Validation of compact stochastic maps . . . . . . . . . . . . 474.3 Speed-up results . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 Timing results . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5 Genealogy analysis . . . . . . . . . . . . . . . . . . . . . . . . 524.5.1 Monte Carlo estimate of E[T ] . . . . . . . . . . . . . 524.5.2 Effect of greedy allocation on E[T ] . . . . . . . . . . 535 Conclusion and Future work . . . . . . . . . . . . . . . . . . . 56Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57AppendicesA Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . 62B Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65B.1 Storing the genealogy . . . . . . . . . . . . . . . . . . . . . . 65B.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66B.3 Storage limitations . . . . . . . . . . . . . . . . . . . . . . . . 68vList of Figures2.1 Example of trees . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Phylogeny examples . . . . . . . . . . . . . . . . . . . . . . . 72.3 SMC schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 SMC algorithm on phylogenetic forest . . . . . . . . . . . . . 233.1 Particle genealogy . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Illustration of genealogy update . . . . . . . . . . . . . . . . . 343.3 Compact stochastic map datastructure . . . . . . . . . . . . . 383.4 Simulation of theoretical time to coalescence . . . . . . . . . . 454.1 SMC inference using stochastic maps generated from uniform 484.2 SMC inference using standard pseudo-random generator . . . 494.3 EMC speed up experiments on real phylogenetic datasets . . 504.4 EMC timing results compared to naive parallelization of SMC 514.5 Monte Carlo estimate of expected time to reconstruct . . . . 534.6 Comparison of greedy and Chaos allocation . . . . . . . . . . 54viAcknowledgementsThis thesis marks the end of the four year journey that began when I re-turned to school as a mature student in 2009. As I am about to embark onanother four year journey, I have many words to say. As I have had to comethrough many obstacles and adversities, I have many people to thank for allthe help and support that I received along the way.I would like to begin by showing my gratitude to my supervisor Prof.Alexandre Bouchard-Co?te?. Our relationship as a mentor and mentee begantwo years ago when I began working as an undergraduate research assistant.As a mature student coming back to undertake study in a new discipline ofstatistics, I desperately wanted a research experience in order to shape myresearch interests. It was Alex who offered the helping hand when he offeredthe opportunity to work as his research assistant. Under his guidance, Iwas able shape out my research interest and eventually, it led to my firstpublication. Looking back, I realize that he generously invested so muchtime and effort into developing my skills and knowledge. I feel very fortunateto have had him as an adviser. I have thoroughly enjoyed working with himfor the last two years and I am excited that I can continue to work with himfor the next few years as I am set to begin my PhD under his supervision.Along the way I have received so much help from so many people. Oneperson that stands out is Prof. Matias Salibia?n-Barrera. When I first de-cided to come back to take up on statistics, Matias was the instructor forthe very first statistics course that I took. At that point in time, I wasnot sure whether I had an aptitude for statistics; in other words, I was notentirely sure whether I was going to enjoy learning statistics. After takingthe course, I was sure that I was going to enjoy learning statistics and itultimately led to the decision to pursue graduate study in statistics. I haveviiAcknowledgementsalways had trouble maintaining focus in lectures, but for some strange rea-son, it was so easy to focus in Matias? lectures. After working as a TA to oneof his courses, it became clear to me that it was his dedication to teachingthat made the lectures so interesting. Without his dedication to teaching,I must say that it is entirely possible that I may have disregarded statisticsand moved on to live a less interesting life.I did not have many friends when I first began the journey. I wouldlike to begin by thanking the first two friends that I met in UBC, Jae Leeand Yumi Kondo. During the dark period of my life, you were the onlypeople that would spend the time to engage in conversations with me andto encourage me along the way. I would also like to thank the friends Imet afterwards, the fellow SSC case-study winning team members in thealphabetical order: Maria Casquilho, Charles Fishbane, and Claude Nie.Although we argued and fought a bit, I think we were all having a lot of funalong the way as well. Of course, I will never forget the drive to Edmontonand the time we spent in Jasper.I must thank the people who have always encouraged me and believedin me. I thank my brother Zhang Xi for keeping me sane. During the mostdifficult times, the motivational talks that we shared over the last ten yearsgot me going strong. Along the way, life had a surprise for me as one personbecame very special to me as we shared a lot of fun as well as food. Thanksto Yu for making my days bright even during the most stressful times. Mostimportantly, I thank my family for the unconditional love and support thatthey have given me. My sister Shin-Won for being a lovely sister. My fatherfor the sacrifices he made for us. And my mother for everything; I am sofortunate to have such a loving and caring mother.viiiChapter 1IntroductionIn this thesis, we focus on scalable parallelization of Monte Carlo simula-tion, a problem motivated by large scale inference problems arising fromscience and engineering. Our main contribution is a method, EntangledMonte Carlo (EMC), for carrying out SMC simulation in a cluster comput-ing environment. Our method requires communication cost independent ofthe size of the inference problem and we envision that it can be seamlesslyintegrated into high-performance scientific computing cluster environmentas well as clusters assembled from commodity hardware.The statistical problems that can benefit greatly from our method iswhen expectation needs to be computed involving an intractable integrationdue to large size of the model. In particular, this problem arises in Bayesianstatistics when computing an expectation with respect to a posterior dis-tribution known only up to a normalizing constant. A common solution isto use Monte Carlo simulation to estimate the target expectation. Two ofthe most commonly adopted simulation methods are Markov Chain MonteCarlo (MCMC) [2] and Sequential Monte Carlo (SMC) [3, 4] methods. How-ever, these methods fail to scale up with the growing size of the inferenceproblems. For MCMC, the problem takes the form of simulations that mustbe ran for a long time in order to obtain an accurate inference. For SMC,one may not be able to store enough particles to exhaustively explore thestate space.Although our method is generic and it can be applied to a wide varietyof problems, our motivation is to apply EMC to solve problems arising fromBayesian analysis of phylogenetics. Phylogenetic methods have a wide rangeof applications, and to motivate this thesis, we devote the rest of this chapterto an overview of this field of phylogenetics and its applications.1Chapter 1. IntroductionPhylogeneticsThe main goal of phylogenetics is to infer the evolutionary relationship be-tween organisms, by reconstructing the history of these organisms. Theorganisms are usually represented by their DNA or RNA sequences, whichare known to derive from the same common ancestor. This is why phyloge-netics is generally linked with studying of the relationship between speciesas all species are known to have evolved from a distant common ancestor [5].However, there is a wide variety of areas where phylogenetic methods canbe applied. For one, it has application in medical genetics, as it is used forstudying the evolutionary pathway of cancer cells, which can be useful fordefining types and subtypes of cancer and may lead to finding an optimaltreatment depending on the classification of cancer [6].Another interesting field that can benefit greatly from application of phy-logenetic methods is historical linguistics. Historical linguistics [7] is a fielddedicated to study of language change and it assumes that all languagesbelonging to a language family derive from a common ancestor language,referred to as proto-language. There are many hypotheses regarding thegrouping of languages into language families and subfamilies, including onehypothesis that all languages derive from one common language [8]. Regard-less, historical linguistics can benefit greatly from application of methodsfrom Bayesian phylogenetics that can be used to potentially quantify levelof certainty in these hypothesis. Furthermore, there are computational andstatistical methods that can be readily applied to analyze the data, (for e.g.[9, 10]) but these methods usually require as input a phylogeny ? a linguisticversion of tree of life.MotivationAn immediate challenge associated with applying phylogenetic methods isthat the growth of data is currently happening at a faster pace than thatof computational power. The inference technique most commonly used isMCMC [11, 12]. MCMC suffers from the aforementioned problems, namely,that chains often need to be ran for long periods. Furthermore, it may need2Chapter 1. Introductionto be accompanied by auxiliary chains [13] in order to avoid the problemof getting stuck at a local minima. SMC is relatively new technique forphylogenetic inference [14]. As mentioned above, for SMC to work well inthe presence of large data, we need to use a large number of particles. Thisis the motivation behind EMC, to enable execution of SMC simulation withas many particles as necessary so that we can achieve high level of accuracyin our inference of phylogeny.3Chapter 2BackgroundIn Chapter 1, we gave a qualitative description of the problems arising fromphylogenetics: given a homologous sequence of observations, Y, we wantto determine the ancestral relationships between taxa represented by thesequences. In order to capture the ancestral relationships between taxa, weneed a structure that can capture notions such as monophyly and geneticdistances (a measure of relatedness) among the taxa. This chapter servestwo goals. The first goal is to formulate the qualitative description of theproblems arising from phylogenetics under a mathematical framework andto frame it as a statistical problem. The second goal is then to determinethe statistical inference method and computational technique for solving theproblem. This chapter is outlined as follows:1. Introduce basic relevant notions from graph theory2. Review of Sequential Monte Carlo (SMC) methodAfter reading this chapter, the reader should have a clear understandingof the type of problems we are interested in and of the necessary backgroundto understand about our novel algorithm presented in Chapter 3.2.1 Basic notionsIn essence, we need a structure that can capture the ancestral relation be-tween a set of observed taxa. A natural structure is a tree, T = (V,E) ? T .The vertex set V of the tree captures not only the set of observed taxa butalso the unobserved (latent) ancestral taxa. The edgeset E captures the an-cestral relationship between any pair of taxon u, v ? V . Although the trees42.1. Basic notionsare discrete objects, the space of trees grows exponentially in the number ofobserved taxa. The main goal of this section is to clearly define the type oftrees that we are looking for so that we can restrict the cardinality of thespace of trees, T .2.1.1 Graph theoryIn order to concisely discuss phylogenetic trees, we need to review basicnotions in graph theory. By definition, trees are a special type of a graphthat are acyclic and connected.Definition 1 A path from u, v ? V in a graph G = (V,E) is a sequenceof distinct vertices (v1 = u, . . . , vk = v) such that ej = (vj , vj+1) ? E forj = 1, . . . , k ? 1.Definition 2 A graph G = (V,E) is connected if ?u, v ? V , there is apath from u to v.Before we can define acyclic graphs, we need to define directed and undi-rected graphs.Definition 3 G = (V,E) is a directed graph if ?e = (u, v) ? E, e isdirected, meaning that e is an ordered pair of vertices (u, v).The definition of ordered pair of vertices is that for e = (u, v) and e? =(v, u), e 6= e?. For directed edges, we sometimes adopt the notation e =(u? v).Definition 4 G = (V,E) is an undirected graph if ?e = (u, v) ? E, e isundirected, meaning that e is an unordered pair of vertices (u, v).The definition of unordered pair of vertices is that for e = (u, v) ande? = (v, u), e = e?; that is, the two edges cannot be distinguished. Anexample of a directed and undirected trees can be found in Figure 2.1 (a)and (b) respectively.52.1. Basic notions(a) (b) (c)Figure 2.1: An example of (a) directed tree, (b) undirected, (c) rootedundirected tree. All trees have 5 taxa (leaf nodes). The undirected tree in(c) is obtained by converting the directed edges in (a) into the undirectededges.Directed trees are necessarily rooted. The rooted trees have a designatednode, the root, that represents the common ancestor of all taxa. Contrastedto the directed trees, undirected trees are not rooted but it can easily betransformed into a rooted tree by selecting an artificial node as the root.Definition 5 A graph G = (V,E) is acyclic if ?v ? V,@ a path (v1 =v, . . . , vk = v).By definition (connected and acyclic), the trees are simple graphs mean-ing that for any vertex v ? V , there is no edge of the form (v, v) (no loop)and for any two vertices u, v ? V , there is at most one edge connecting thetwo vertices.Before we finish this section, we make a final remark on connectednessof graphs. Given a directed graph G = (V,E), we can obtain an undirectedgraph G? = (V,E?) by converting all e = (u ? v) ? E to an undirectedversion, e? = (u, v) (see Figure 2.1 (c)). We say that a directed graph G =(V,E) is weakly connected if G? = (V,E?) obtained this way is connected. Inthe rest of this thesis, we do not distinguish the notion of weakly connectedgraphs from connected graphs and just refer to them as connected graphs.2.1.2 Tree topologyWe now focus on describing the topology of phylogenetic trees. We restrictour attention to bifurcating or binary trees, meaning that for each node62.1. Basic notionsHumansChimpanzeesGorillasOrangutansGibbonsHumansChimpanzeesGorillasOrangutansGibbons(a) (b)Figure 2.2: An example of (a) trifurcating, (b) bifurcating phylogenetic X-trees for a subgroup of primates [15]. The internal nodes denote speciationevents.v, it has at most 2 children. Fortunately, this assumption closely matchesthe reality, as a speciation event rarely produces more than two species atonce. In fact, trifurcating trees (Figure 2.2 (a)) usually indicates that thechronological order of speciation events is unknown.For each T = (V,E) ? T , we denote by X ? V the leaf nodes of the tree.Given the subset X ? ? X, we refer to Y(X ?) as the set of observations at theleaf nodes X ?. Assuming that we have as observation, only the extant taxa,the observations are found only at the leaves. This is another restriction wehave on the tree topology, that the leaf nodes are labelled and the internalnodes, denoting the ancestral taxa, are not. This type of tree is referred toas phylogenetic X-tree (see Figure 2.2 for examples).As mentioned earlier, the trees are discrete objects. It may seem thatexhaustively exploring this space to find the correct tree is plausible forphylogenetic X-trees with reasonable cardinality, |X|. However, complicat-ing the matter is the fact that the tree topologies that we are interested in72.1. Basic notionsare usually accompanied by the branch length function, b : E ? R, whereR ? <+. The branch lengths are key components of evolution that describethe historical relatedness (genetic distances) between the taxa. With theaddition of the branch length, the trees are really described by three com-ponents, T = (V,E, b). The trees are discrete objects if R is countable; forexample, R ? N. However, if R is uncountable, then the trees are no longerdiscrete objects and the |T | would be uncountably infinite. In either case,exhaustive search of this space is implausible. The branch length invites yetanother tree terminology.Definition 6 Suppose r is a root of the tree T = (V,E, b) and let x, x? ? Xdenote arbitrary leaf nodes. Furthermore, let px = (v1 = r, . . . , vk = x) andpx? = (v?1 = r, . . . , v?k = x?) be the paths from r to x and x? respectively. Tis an ultrametric tree if for any pair of leaf node x, x? ? X, the sum of thebranch lengths are equal for the two paths:k?1?j=1b(vj , vj+1) =k?1?j=1b(v?j , v?j+1)In plain terms, the ultrametric trees have all of the leaf nodes at anequidistant from the root node. Ultrametric trees are the final restrictionon the type of the tree that we are interested in. From this point on, whenwe use T to denote the space of the trees, we mean that each T ? T is:1. Rooted2. Bifurcating3. Leaf nodes are labelled (phylogenetic X-tree)4. UltrametricWe end our discussion of the trees with a definition of a metric on theultrametric trees:Definition 7 The height of an ultrametric tree T is a sum of the branchlengths from the root node r to any leaf node x ? X.82.2. Bayesian phylogenetics2.1.3 Phylogenetic forestA general definition of forest F ? F is that it is a disjoint union of trees{Tj}. In our case, we are interested in phylogenetic forest on X:Definition 8 A phylogenetic forest F ? F on X is a disjoint union ofphylogenetic trees Tj = (Vj , Xj) ? T for j = 1, . . . , J such that {Xj ? Vj}is a partition of X.We have examples of forests in Figure 2.4. Note that by definition, atree is a forest, i.e. F ? T . We can define a notion of height on the forestsas we did for the trees:Definition 9 Given a forest F composed of trees Tj, we define the heightof the forest as, height(F ) = maxj{height(Tj)}The notion of forest comes in handy when we describe a specific SMCalgorithm for inferring phylogeny in Section 2.4. In the rest of the thesis,we will refer to phylogenetic trees and phylogenetic forests as just trees andforests. For a detailed mathematical treatment of phylogenetics, we referthe reader to [16].2.2 Bayesian phylogeneticsThe goal of phylogenetics is to uncover the historical relationship betweentaxa by inferring the topology of the phylogenetic tree T = (V,E, b) giventhe sequences of data Y = Y(X) representing observed taxa. We take theBayesian approach to inferring phylogeny, which involves computing theexpectations of some function h, Epi[h], based on a posterior distributionover the phylogenetic trees t given the observations Y:pi(t|Y) = p(Y|t)p(t)?T p(Y, t)dt(2.1)Note that we are using t to denote a realized value of a random tree T .The likelihood term, p(Y|t), can be computed efficiently in polynomial time92.3. Sequential Monte Carlousing a pruning algorithm described in [17]. The prior distribution can besuch that it is easy to evaluate pointwise, for instance, a uniform prior overthe tree topology and geometric distribution or exponential distribution onthe branch lengths. The main difficulty lies in computing the integral inthe denominator. Although the trees by themselves are discrete objects, thenumber of possible trees grows exponentially in the number of taxa at theleaf nodes of the tree and summing over all possible trees is impractical forany reasonably sized tree of interest.The Bayesian inference for phylogeny relies heavily on the Markov ChainMonte Carlo simulation method. There are two main issues with MCMC,(i) in the presence of local peaks, a stochastic tree search can get stuck ata local minima [11] leading to subsequent proposals being rejected (wastingcomputing efforts) and (ii) MCMC is not ideal for taking advantage of thecomputing resources, i.e. parallelization over multiple computing nodes orcomputing cores. For instance, parallel tempering [13] is a popular approachto parallelizing an MCMC algorithm. However, it displays diminishing re-turn as more chains are added, which means after a certain point, addingadditional chains does not result in significant speed up. We will review theSequential Monte Carlo method and show how it can address both of theseproblems.2.3 Sequential Monte CarloOur goal is to familiarize the reader with the algorithmic aspects of theSMC method. To inspect the components of SMC in detail, we begin witha review of importance sampling and sequential importance sampling. Theexposition of this section is independent of the phylogenetic tree problem.In Section 2.4, we explain the SMC method as applied to inferring a phy-logenetic tree. There is a rich literature available. At the technical levelsuitable for the first time readers, we suggest the excellent tutorial of [4]and for more advanced topics regarding SMC, we suggest [3] and [18].102.3. Sequential Monte Carlo2.3.1 Importance samplingIn statistics, we are often interested in computing an expectation of a func-tion h of a random variable X, with respect to a probability density function(pdf) pi:Epi[h(X)] =?h(x)pi(x)dx (2.2)If we can exactly sample Xk ? pi, then the Monte Carlo estimator of theintegral is:1KK?k=1h(Xk)? Epi[h(X)] a.s.where a.s denotes almost sure convergence. The above result is knownas the strong law of large numbers [2, 19]. However, in many problems ofinterest, it may be difficult or impossible to sample from pi exactly. The ideabehind importance sampling is to modify the right hand side of the equation2.2 as follows:Epi[h(X)] =?h(x)pi(x)dx =?h(x)pi(x)?(x)?(x)dx (2.3)This modification transforms an expectation of h with respect to pdf piinto an expectation of g = hpi? , with respect to ?. Equation 2.3 is referredto as importance sampling fundamental identity [2] and ? is referred to asthe importance or proposal density. To obtain a Monte Carlo estimate forEquation 2.3, we need to be able to sample from ? and to be able to evaluatepi pointwise. The conditions for choosing the proposal density is rather weakin that all that is required is: pi > 0 ? ? > 0. The Monte Carlo estimatorfor the expectation is obtained by sampling Xk ? ?:1KK?i=1h(Xk)pi(Xk)?(Xk)=1KK?i=1w(Xk)h(Xk)? Epi[h(X)] a.s.where w(Xk) =f(Xk)?(Xk) is referred to as an importance weight for the kth112.3. Sequential Monte Carlosample. The role of the weights is to compensate for the difference betweenpi and the proposal.ExampleHere, we give an example of importance sampling as applied to Bayesianstatistics. In the process, we motivate an alternative view of importancesampling. We have the posterior distribution of a random variable X giventhe data Y:pi(x|Y) = p(Y|x)p(x)?p(Y|x)p(x)dx =?(x)??(x)dx= pi(x)We assume that the integral in the denominator is intractable and hence,the posterior distribution cannot be sampled from exactly. Another problemin this case is that we cannot evaluate pi pointwise. To solve this problem,we view the samples Xk ? ? as inducing a discrete distribution that approx-imates the proposal:??(x) =1KK?k=1?xk(x)where ?xk(x) = 1 if x = xk and 0 otherwise (delta point mass). Note thatpi(x) can be written as:pi(x) =?(x)??(x)dx=?(x)?(x)?(x)? ?(x)?(x)?(x)dx=w(x)?(x)?w(x)?(x)dxWe obtain an estimator of pi(x) by replacing ? by ??:122.3. Sequential Monte Carlop?i(x) =w(x)??(x)?w(x)??(x)dx=1KK?k=1w(xk)?xk(x)1KK?k=1w(xk)=K?k=1Wk?xk(x)Wk =w(xk)K?k=1w(xk)where Wk ? (0, 1) are the normalized importance weights. What we havehere is a discrete distribution that approximates pi. Using this approxima-tion, we can now estimate the expectation of h(X):Epi[h(X)] ??h(x)p?i(x)dx =K?k=1Wkh(xk)For the purpose of introducing the idea of importance sampling, wefocused on one-dimensional problem. The importance sampling methodfails to be practical when we look at problems involving large number ofvariables as choosing a suitable high-dimensional proposal density becomesa problem. One remedy for this problem is to use sequential importancesampling.2.3.2 Sequential importance samplingWe motivate sequential importance sampling by assuming that we have aBayesian inference problem at hand, that of computing an expectation withrespect to a posterior distribution:pi(x1, . . . , xR|Y) = pi(x1, . . . , xR) =?(x1, . . . , xR)p(Y)132.3. Sequential Monte Carlowhere p(Y) is the marginal likelihood involving intractable integral. Asbefore, we cannot sample from pi and we cannot evaluate pi pointwise. Themain idea is to build the proposal distribution as a product of local proposaldistributions:?(x1, . . . , xR) = ?1(x1)R?r=2?r(xr|x1, . . . , xr?1)The local proposals effectively enforce the samples to be generated se-quentially and hence, for importance sampling to be carried out sequentially,one generation at a time. That is, we generate X1,k ? ?1(?) and computethe importance weights:w1,k =?(x1,k)?1(x1,k)Only then can we generate X2,k ? ?2(?|x1,k) and compute the importanceweights:w2,k =?(x1,k, x2,k)?(x1,k, x2,k)=?(x1,k, x2,k)?1(x1,k)?2(x2,k|x1,k)Note that since we are re-using the sample from generation r = 1, weshould be able to express w2,k in terms of w1,k:w2,k =?(x1,k, x2,k)?1(x1,k)?2(x2,k|x1,k)? ?(x1,k)?(x1,k)=?(x1,k)?1(x1,k)?(x1,k, x2,k)?(x1,k)?2(x2,k|x1,k)= w1,k ??(x1,k, x2,k)?(x1,k)?2(x2,k|x1,k)Induction yields the conclusion that the weight at generation r can berecursively expressed based on the weight at generation r ? 1:wr,k = wr?1,k ? ?(x1,k, . . . , xr,k)142.3. Sequential Monte Carlowhere ? is the weight update function,?(x1,k, . . . , xr,k) =?(x1,k, . . . , xr,k)?(x1,k, . . . , xr?1,k)?r(xr,k|x1,k, . . . , xr?1,k)At each generation, the samples induce a discrete distribution. For r = 1,we have the same case as the importance sampling:??(x1) =K?k=1?x1,k(x)At generation r, we have the samples (x1,k, . . . , xr,k) ? ?(x1, . . . , xr) andthe induced discrete distribution is:??(x1, . . . , xr) =K?k=1?x1,k,...,xr,k(x1, . . . , xr)From which, we get an estimate of the target distribution:p?i(x1, . . . , xr) =K?k=1Wr,k?x1,k,...,xr,k(x1, . . . , xr) (2.4)ExampleThe set of problems where sequential importance sampling is most natu-ral is when the underlying model follows the Markov process such as HiddenMarkov Model (HMM) [20]. We continue with the example of Bayesianstatistics but we assume that we have HMM as the underlying model. Weassume that we are given the data, Y = (y1, . . . , yR), and we want to infera sequence of latent state variables X = (X1, . . . , XR), where Xr ? S. Asbefore, we denote the posterior distribution as:pi(x1, . . . , xR|y1, . . . , yR) =?(x1, . . . , xR)p(y1, . . . , yR)where ?(x1, . . . , xR) = p(x1, . . . , xR)p(y1, . . . , yR|x1, . . . , xR) is prior ?likelihood. For HMM, we can decompose ? into a product of densities:152.3. Sequential Monte Carlo?(x1, . . . , xR) = ?(x1)R?r=2f(xr|xr?1)R?r=1g(yr|xr)The weight update functions ? is simple to derive for HMM:?(x1, . . . , xr) =?(x1, . . . , xr)?(x1, . . . , xr?1)?r(xr|x1, . . . , xr?1)=1?r(xr|x1, . . . , xr?1)??(x1)r?j=2f(xj |xj?1)r?j=1g(yj |xj)?(x1)r?1?j=2f(xj |xj?1)r?1?j=1g(yj |xj)=f(xr|xr?1)g(yr|xr)?r(xr|x1, . . . , xr?1)At generation R, we have an estimator for the target density as in Equa-tion 2.4. An important observation in this example is that we want to inferthe values of the latent states, which is a special case of computing expec-tation (taking expectation of indicator functions). This type of problem ofinferring a path of latent states sequentially for each generation is referredto as the filtering problem. Each of the path of the samples (x1,k, . . . , xR,k)is referred to as a particle. Each particle evolves independently to yield upontermination, p?i. By choosing the particle with the highest weight, we cananswer queries such as ?what is the most likely path of the latent states??The filtering problem reveals a critical problem with sequential impor-tance sampling. The weights are computed recursively and for a reasonablylarge value of R, it is not hard to see that the compounding of the weightsthrough multiplication will lead to the weight close to 0. If only a hand-ful of the particles end up with significant non-zero weights, the samplesare clearly not representative enough to estimate the target distribution.This problem is referred to as the sample degeneracy problem where thesequential importance sampling eventually collapses as the result of high-dimensionality (also referred to as curse of dimensionality) [21]. A solutionto this problem is to include a resampling stage that prunes unpromising162.3. Sequential Monte Carloparticles whilst propagating promising particles to continue to explore thestate space.2.3.3 ResamplingLet us introduce the idea of resampling through a hypothetical scenario.If we are provided a computational budget to accommodate K particles ateach generation, then we would run sequential importance sampling with Kindependent particles to explore the state space. We run it for r generationsand obtain from the particles along with the weights, a discrete distributionas in Equation 2.4. Now, suppose that our computing budget has decreasedto K ? < K so before we can proceed to generation r + 1, we need to choosewhich of the K particles to keep. If we are trying to answer a filtering query,then we would select the most promising K ? paths among the K particlesand this would be achieved by choosing K ? particles based on their weights.Suppose this time, we resample the particles with replacement. In this case,if only a handful of the particles have high weights, these particles will beselected multiple times. Instead of deterministically selecting the particleswith the largest weights, the objective of resampling is to use the randomnessto maintain the unbiasedness of the estimator. Note that even the particleswith low weights have a chance to be selected; in fact, the expectation isthat a particle will be selected K ? ? Wk times. Through the resamplingstage, we can expect to obtain particles that represents the target densityquite accurately.From this analogy, we can see that resampling may be a good idea even ifthe computing budget is maintained at K particles. This way, the comput-ing efforts are focused on the particles that are exploring the relevant areasof the state space whilst the unpromising particles that have low weightsand are bound to have near 0 weight at the termination will most likelybe pruned. The resampling stage can be carried out at every generation orit can be carried out when the effective sample size (ESS) [22] falls undersome threshold. As for the resampling scheme, there are various ways as ex-plained in [4]: systematic resampling, residual resampling, and multinomial172.3. Sequential Monte Carlos1,1s1,2s1,3w1,1 = 0.03w1,2 = 0.02w1,3 = 0.08s1,1s1,3s1,2s2,1s2,3s2,2~~~s2,1s2,3s2,2w2,1 = 0.12w2,2 = 0.2w2,3 = 0.02ResamplingProposal Weightingcomponents decreases by one at every step. More precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , sR = t, wherean X-forest sr = {(ti, Xi)} is a collection of rooted Xi-trees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note that with this specific construction, a forest of rank rhas |X|? r trees.The sets of partial states considered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14(a) (b)Figure 2.3: (a) a graphical illustration of the SMC algorithm. (b) Particlegenealogy.resampling. There is also an optimal resampling scheme of [23] in the caseof discrete state space.2.3.4 SMC algorithmIn this section, we combine the ideas presented in the previous three sectionsand explain the generic SMC algorithm. So far, we have been sly aboutnotations on purpose so that the reader can focus on the ideas rather thangetting hung up on the notations. In order to be precise in the descriptionof the SMC algorithm, we formally introduce the notations in this section.NotationsWe denote the general probability space as (S,FS), with the state-spaceS and a ?-algebra FS . We explicitly assume that we have the target dis-tribution pi : FS ? [0, 1], which corresponds to the posterior distributionin Bayesian statistics. The goal is to compute an integral of some functionh under pi, we denote this quantity as pi(h). The target distribution, pi, isdifficult to sample from directly so it requires us to design a proposal density? : S ? FS ? [0, 1]. We describe SMC as is described in [1].AlgorithmSMC proceeds in a sequence of generations indexed by r. At each182.3. Sequential Monte Carlogeneration, the algorithm keeps in memory a weighted list of K particles,sr,1, . . . , sr,K ? S, with corresponding weights wr,1, . . . , wr,K (see Figure 2.3).The weighted particles induce a distribution on S defined by:pir(A) ?K?k=1wr,k?sr,k(A) (2.5)where A ? FS is an event and ?x(A) = 1 if x ? A and 0 otherwise (Diracdelta measure). We describe the algorithm recursively on the generationr. In the base case, we set w0,k = 1/K for all k ? {1, . . . ,K}, and thes0,k are initialized to a designated start state ?. Given the weights of thesamples at generation r? 1, a new list of particles is created in three steps.The first step is the resampling step. The resampling step is carried outby sampling K times independently from the weighted samples distributionEquation 2.5. We denote the sampled particles by s?r?1,k. The second stepis to create new particles by extending the partial states of the sampledparticles in generation r ? 1. This is done by sampling K times from theproposal distribution sr,k ? ?s?r?1,k . The third step is to compute weights forthe new particles: wr,k = ?(s?r?1,k, sr,k), where the weight update function? is an easy to evaluate deterministic function ? : S2 ? [0,?). Finally, thetarget integral pi(h) is approximated using the weighted distribution of thelast generation R, piR. The generic SMC algorithm is given in Algorithm 1.The resampling step in SMC algorithms induces a one-to-many relation-ship between the particles in generation r ? 1 and those in generation r.This relationship is called the particle genealogy, illustrated in Figure 2.3(b). Formally, a genealogy is a directed graph where nodes are particlessr,k, r ? {1, . . . , R}, k ? {1, . . . ,K}, and where node sr?1,k is deemed theparent of node sr,k? if the latter was obtained by resampling s?r?1,k? = sr?1,kfollowed by proposing sr,k? from s?r?1,k? . We use ? : I ? I to denote theparticle genealogy. For any particle index i = (r, k) ? I, ?(i) denotes theparent of this particle. The entire genealogy of particles can be accessedusing ? because we can access any particle i and its lineage by multipleapplications of ? to i.192.4. SMC for phylogenetic inferenceAlgorithm 1 : SMC(?, ?, h, I0)1: ?? empty-genealogy2: init(s, w)3: for r ? {1, . . . , R} do4: Ir ? resample(wr?1, ?, Ir?1)5: for i ? Ir do6: s(i)? propagate(s, ?, i)7: wr,k(i) ? ?(s(?(i)), s(i))8: end for9: end for10: process(s, w, h)2.4 SMC for phylogenetic inferenceIn this section, we describe how to carry out inference for ultrametric phy-logenetic trees with discrete branch lengths using SMC. The choice to usediscretized branch lengths is for the purpose of simplicity. The algorithm wedescribe is known as prior-prior in [24]. This proposal works on the space ofultrametric forests, i.e. S = F ? T . In each generation r < R, the proposalis to merge a pair of trees in the forest and to sample a branch length atrandom from a geometric distribution with parameter p.In the base case, s0,k corresponds to a forest composed of |X| trees, whereeach tree T = (x ? X, ?) and with the weights initialized to, w0,k = 1/K.Given the state s?r?1,k, the prior-prior proposal operates in three steps. First,select a pair of trees (Tl, Tr). Second, construct a rooted tree Tm with Tl(respectively Tr) as the left (right) subtree. Let v denote the root of themerged tree. Third, set the branch lengths, b(v, Tl) and b(v, Tr). Note thatwe overloaded on the notation r, by using it as a subscript for the right tree,Tr. It should be clear that the use of r as subscript of tree is not related tothe use of r as a generation of SMC algorithm.The first two steps are easy to carry out. At generation r, we randomlysample (Tl, Tr) from(|X|?r+12)pairs of trees, create a new node v and set Tland Tr as left and right subtrees respectively to get a new tree rooted at v.The third step needs to be carried out with care. We need to ensure thatthe height of the new forest, sr,k is strictly greater than the height of theprevious forest s?r?1,k in order to satisfy the condition for a valid proposal202.4. SMC for phylogenetic inference(Proposition 5 in [14]). To ensure this, we set the branch lengths as follows:1. Sample br ? geom(p)2. Compute the difference in height of the forest s?r?1,k and the left sub-tree Tl as: dl = height(s?r?1,k)? height(Tl)3. Set b(v, Tl) = dl + br.4. Repeat the same procedure for the right subtree.What remains is to specify the weight function ? to assign weights toeach of the forest states. The derivation for prior-prior weight function isprovided in [14], we provide only the result here:?(s? s?) ?LY(Xm(s?))(tm(s?))LY(Xl(s))(tl(s))? LY(Xr(s))(tr(s))Note that tm(s?) denotes the merged tree in the new state s? and tl(s)(respectively tr(s)) denotes the left (right) tree sampled from the previousstate s. So, Y(Xm(s?)) denotes the observed taxa found at the leaf nodesof the merged tree tm and LY(Xm(s?))(tm(s?)) denotes the likelihood of thedata observed at the leaf nodes of the merged tree tm in the new state s?.Similarly, LY(Xl(s?))(tl(s)) denotes the likelihood of the observed taxa foundat the leaf nodes of the left tree tl sampled from the previous state s ? samenotations hold respectively for the right tree, tr, sampled from the previousstate s.We depict the progression of the algorithm on a handful of languagesfrom the Formosan language family in Figure 2.4. The algorithm begins atgeneration r = 0 with |X| = 6 trees, each tree with only one node and emptyedgeset. This algorithm ensures that at generation R = |X| ? 1, we have acollection of full trees, sR,k ? T , which along with the weights WR,k, inducean empirical distribution on the space of the trees. This proposal is easyto implement, easy to understand, and satisfies the necessary conditionsof a valid proposal on ultrametric forests described in [14]. However, theselection of the pair to merge and the branch lengths are purely at random212.4. SMC for phylogenetic inferenceso it requires a large number of particles to be used in order to achieveaccurate inference results.222.4. SMC for phylogenetic inferenceHoanyaPaporaBununRukaiSaaroaKanakanabuHoanyaPaporaBununRukaiSaaroaKanakanabub1(a) (b)HoanyaPaporaBununRukaiSaaroaKanakanabub1b2HoanyaPaporaBununRukaiSaaroaKanakanabub1b2b3(c) (d)HoanyaPaporaBununRukaiSaaroaKanakanabub1b2b3b4HoanyaPaporaBununRukaiSaaroaKanakanabub1b2b3b4b5(e) (f)Figure 2.4: A forest with (a) 6 trees, (b) 5 trees, (c) 4 trees, (d) 3 trees, (e)2 trees, and (f) 1 tree. Also illustrated here is the prior-prior algorithm [24]on forest with 6 languages from the Formosan language family.23Chapter 3Entangled Monte CarloIn this chapter, we introduce Entangled Monte Carlo1, a scalable paral-lelization of SMC algorithms. SMC is considered as the leading candidatefor massively parallel simulation because the computation of the proposalcan be trivially parallelized as it depends only on the local state of a particle.The problem lies with the resampling stage, which distributes the particlesacross the computing nodes; depending on the application at hand, a largenumber of particles needs to be transmitted over the network at each gen-eration.In order to achieve scalability, we seek to minimize the network com-munication cost as much as possible. The main feature of our algorithm isthat we keep this cost fixed independent of the size of the inference prob-lem. The idea is that whenever a particle needs to be transmitted fromone machine (sender) to another machine (receiver), the receiver machinereconstructs the particle locally, eliminating the need for transmitting theparticle. The only required network synchronization needed in our meth-ods is that of particle weights, which is constant over generations and aretypically small compared to the size of the particles.The rest of this chapter is organized as follows. We begin with a reviewof the stochastic maps in Section 3.1. The concept of stochastic maps playa key role in our reconstruction algorithm. We provide an overview of themain algorithm in Section 3.2 and describe each of the components of thealgorithm in the following order: resampling and allocation in Section 3.3,efficient maintenance of the particle genealogy for particle reconstruction inSection 3.4 and Section 3.5, and analysis of the expected reconstruction time1Description of EMC is based on [1], which is co-authored with Dr. AlexandreBouchard-Co?te? and Dr. Liangliang Wang.243.1. Reviewin Section 3.6. The experimental results are deferred to Chapter 4.3.1 ReviewAn important concept used in the construction of our algorithms is the ideaof a stochastic map. We start by reviewing stochastic maps in the context ofa Markov chain, where it was first introduced to design perfect simulationalgorithms [25].3.1.1 Stochastic mapsLet T : S ? FS ? [0, 1] denote the transition kernel of a Markov chain(generally constructed by first proposing and then deciding whether to moveor not using a Metropolis-Hastings (MH) ratio). A stochastic map is anequivalent view of this chain, pushing the randomness into a list of randomtransition functions. Formally, it is a (S ? S)-valued random variable Fsuch that T (s,A) = P(F (s) ? A) for all state s ? S and event A ? FS .Concretely, these maps are constructed by using the observation that Tis typically defined as a transformation t(u, s) with u ? [0, 1]. The mostfundamental example is the case where t is based on the inverse cumulativedistribution method. We can then write F (s) = t(U, s) for a uniform randomvariable U on [0, 1].With this notation, we get a non-standard, but useful way of formulatingMCMC algorithms. First, sample N stochastic maps F1, F2, . . . , FN , inde-pendently and identically. Second, to compute the state of the chain aftern transitions, simply return F1(F2(. . . (Fn(x0)) . . . )) = F1 ? ? ? ? ? Fn(x0), foran arbitrary start state x0 ? S, n ? {1, 2, . . . , N}. This representation de-couples the dependencies induced by random number generation from thedependencies induced by operations on the state space. In MCMC, the latterare still not readily amenable to parallelization, and this is the motivationfor using SMC as the foundation of our method.253.2. Overview3.1.2 Particle genealogyThe particle genealogy combined with stochastic maps are the core compo-nents of the particle reconstruction algorithm. Before we explain the mainalgorithm, we briefly describe the role of particle genealogy in reconstructionof particles.To parallelize SMC, we view the applications of SMC proposals as acollection of stochastic maps to be shared across machines. Note that thereare K ? R proposal applications in total, which we will index by I 3 i =i(r, k) = (r(i), k(i)) for convenience. We denote the set of stochastic mapsas F = {Fi : i ? I}. For each of the particles sr,k, there is a sequence ofstochastic maps applied to ? to arrive at the current state. This sequence ofstochastic maps can be obtained by tracing the lineage of this particle, i.e.?(. . . ?(i(r, k)) . . .). Hence, even if the particle sr,k is not in the memory of amachine, it can reconstruct sr,k from any ancestral state sr?,k for 0 ? r? < r,as long as the machine has access to the particle genealogy and a means forefficiently storing the stochastic maps for the particles in the lineage of sr,k.3.2 OverviewAs we mentioned before, we view applications of SMC proposals as ap-plication of stochastic maps. Applying the stochastic maps, F = {Fi :i ? I}, is often computationally intensive (for example because of Rao-Blackwellization), and it is common to view this step as the computationalbottleneck. At iteration r, each machine, with index m ? {1, . . . ,M}, willtherefore be responsible for computing proposals for only a subset Ir of theparticles indices {1, . . . ,K}. We refer to machine m as the reference ma-chine. For brevity of notation, we omit notation m when it is clear that werefer to the reference machine.Parallelizing SMC is complicated by the resampling step. If roughly allparticles were resampled exactly once, we would be able to assign to eachmachine the same indices as the previous iteration, avoiding communication.However, this rarely happens in practice. Instead, a small number of parti-263.2. OverviewAlgorithm 2 : EMC(?, ?, h, I0)1: (F ,G ,H )? entangle(?) {Section 3.5}2: s? empty-hashtable3: ?? empty-genealogy4: init(s, w)5: for r ? {1, . . . , R} do6: exchange(wr?1)7: I?r?1 ? resample(wr?1, ?, Ir?1,G ) {Appendix B}8: Ir ? allocate(?, I?r?1,H ) {Section 3.3}9: for i ? Ir do10: s(i)? reconstruct(s, ?, i,F ) {Algorithm 3}11: wr,k(i) ? ?(s(?(i)), s(i))12: end for13: end for14: process(s, w, h)cles is often resampled a large number of times while many others have nooffspring. This means that Ir can radically change across iterations. Thisraises an important question: how can a machine compute a proposal if theparticle from which to propose was itself computed by a different machine?The naive approach would consist in transmitting the ?missing? particlesover the network. However, even if basic optimizations are used (for examplesending particles with multiplicities only once), we show in Chapter 4 thatthis transfer can be slow in practice. Instead, our approach relies on acombination of the stochastic maps with the particle genealogy to reconstructthe particle. Let us see what this means in more detail, by going over thekey steps of EMC, shown in Algorithm 2.The product of resample procedure (line 7) is a multiset, I?r?1. Bydefinition, a multiset is a set in which the elements can appear more thanonce. For example, for K = 4, resampling may produce a multiset such as,I?r?1 = {1, 2, 2, 2}, meaning that particle (r ? 1, 1) is resampled once andthe particle (r? 1, 2) is resampled three times. Given I?r?1, the allocationprocedure determines allocation of the particles to machines (line 8). Froma high-level perspective, allocation produces the set of particle indices, Ir,to be propagated in the current generation. We will see in the next section,the complexity of the allocation in terms of its impact on the total runningtime of EMC simulation.273.3. Allocation and resamplingAlgorithm 3 : reconstruct(s, ?, i,F = {Fi : i ? I})1: F ? I2: while (s(i) = nil) do3: F ? F ? Fi4: i? ?(i)5: end while6: return F (s(i))Suppose for now that each machine kept track of the full genealogy,in the form of a hashtable ? : I ? I of parent pointers. Each machine alsomaintains a hashtable holding the particles held in memory in the referencemachine, s : I ? S ? {nil} (the value nil represent a particle not currentlyrepresented explicitly in the reference machine). Algorithm 3 shows thatthis information, s, ?,F , is sufficient to instantiate any query particle (in-dexed by i in the pseudo-code). Note that the procedure reconstruct isguaranteed to terminate: in the procedure init, we set s(i(0, k)) to ?, andthe weights uniformly to 1/K, hence ? is an ancestor of all particles.This high-level description raises several questions. How can we effi-ciently store and retrieve the stochastic maps? Can we maintain a sparseview of the genealogical information ?, s to keep space requirements low? Fi-nally, how can we do resampling and particle allocation in this distributedframework? We will cover these issues in the remaining of this chapter,describing at the same time how the procedures allocate, resample andexchange are implemented.3.3 Allocation and resamplingIn SMC algorithms, the weights are periodically used for resampling theparticles, a step also known as the bootstrapping stage and denoted byresample in Algorithm 2. This is the only stage where EMC requirescommunication over the network to be done. With each machine havingthe full information of the weights in the current iteration, they can eachperform a standard, global resampling step without further communication.In most cases of interest, each machine can transmit all the individual283.3. Allocation and resamplingweights of its particles and to communicate it with every other machine(either via a central server, or a decentralized scheme such as [26]) withoutbecoming the bottleneck. To put this into perspective, consider the casewith K = 200, 000 particles. To store the weight of one particle requires 8bytes of storage, which is the size of a double precision floating point formatin most computer architectures. The total size of the weights that needs tobe transmitted for resampling is roughly 1.6 megabytes.In certain situations, transmitting the list of weights alone may becomeproblematic. For example, in a peer-to-peer distributed computing projectssuch as BOINC [27], the node communication is limited by a combination oflatency, throughput, and cost. Another scenario (our ambition) is where thenumber of particles in the simulation reaches an order of one billion. In sucha case, even the list of weights alone would be too large to transmit. Thesecases can be handled by transmitting only the sum of the weights of eachmachine, and using a distributed hashtable [26] to represent the genealogy.The modifications needed to implement this are discussed in Appendix B.We focus on the simpler case here.Once the resampling step determines which particles survive to the nextgeneration, the next step is to determine allocation of particles to machines.Particle allocation is an optimization problem where the objective is to re-duce the reconstruction time with respect to the set of partition of par-ticles to be processed Ir.We need to introduce new notations to clearly describe the allocationprocedure. Let Amr be a set (not a multiset) of particles allocated to machinem at generation r. The collection of these sets {A1r , . . . ,AMr } partitionsIr, which is the output of the allocate procedure. We let cm denote themaximum number of particles that can be processed by machine m, this isa pre-determined parameter to the simulation. For particle i, let ?(i,m)be the number of times the stochastic map needs to be applied in order to293.3. Allocation and resamplingreconstruct this particle in machine m. The objective function is defined as,argmin{A1r,...,AMr s.t ?m,|Amr |?cm}M?m=1?i?Amr?(i,m)Obtaining an exact solution to this optimization problem is infeasible inpractice as it requires enumerating over the set of all possible partitions. Infact, even if we can somehow solve this optimization problem exactly, it isstill a greedy approach in the perspective of the entire simulation becausewe have no way of knowing how the currently best allocation will affectthe reconstruction in the future generations. We propose a simple greedyapproach, where each machine retains as many particles from I?r?1?Amr?1as possible. Recall that I?r?1 is a multiset that stores the indices of theresampled particles and Amr?1 is a set of particle indices allocated to machinem in r? 1. We define the intersection of a multiset and a set as a multiset.For example, if I?r?1 = {1, 2, 2, 3} and Amr?1 = {1, 2}, then I?r?1?Amr?1 ={1, 2, 2}. We make the following two remarks:(I?r?1?Amr?1)?(I?r?1?Am?r?1)= ??????M?m=1(I?r?1?Amr?1)?????= KThe first remark is attributed by the fact that A1r?1, . . . ,AMr?1 partitionsIr?1, meaning Amr?1?Am?r?1 = ? for all m 6= m?. The second remark followsfrom the first remark and the fact that |I?r?1| = K.This simple greedy allocation approach will produce machines that arein surplus of particles, |I?r?1?Amr?1| ? cm > 0, and machine in deficit ofparticles, |I?r?1?Am?r?1| ? cm? < 0. We propose a variety of schemes toallocate surplus particles to deficit machines.FirstOpen: This is a deterministic scheme where a list of machinesis ordered and is known to all machines. For example, the list may beordered by machine?s computing capacity as determined by a combination303.3. Allocation and resamplingof the processing speed and memory. A machine m in surplus goes throughthis list sequentially, assigning as many particles as possible to the firstmachine, m?, that is in deficit until it is no longer in deficit. If m is stillin surplus, m continues to search through the list until it allocates all ofits surplus particles. Note that this process will terminate since the sum ofthe capacities of all of the machines must be at least K. The motivationfor proposing this approach is to utilize the preferred machines as much aspossible in the simulation.MostAvailable: This scheme attempts to allocate the surplus particlesto machines with the most capacity remaining as defined by:cm? ? |I?r?1?Am?r?1|It is similar to FirstOpen in the sense that the machines are sequentiallyvisited in the order of remaining capacity. The goal here is to not let any onemachine be idle. For example, it is possible that |I?r?1?Am?r?1| << cm? , inwhich case m? will process its particles very fast relative to other machinesand idly wait for the weight exchange stage.Random: As the name suggest, we randomly sample a machine m?among the machines that are in deficit. The intention is that the particleswill be mixed well over different machines so that with high probability, theparticle reconstruction at a later generation terminates well before reachingthe root, ?. As we will see later in the analysis of particle genealogy inSection 3.6, the expected time to reconstruct and mixing of particles overthe machines is directly related so this scheme may be attractive in the caseswhere the number of generations R is very large.Chaos: This is a scheme that ignores our basic greedy strategy of main-taining as many particles in I?r?1?Amr?1 as possible. Instead, for each par-ticle in I?r?1, we randomly sample a machine m, if m is in deficit, we assignthe particle to m. This scheme is an extreme version of the Random scheme.This scheme may also be attractive for the cases where the number of gen-erations is large.313.4. Genealogy3.4 GenealogyIn this section, we argue that for the purpose of reconstruction, only a sparsesubset of the genealogy needs to be represented at any given iteration andmachine. The key idea is that if a particle has no descendants in the currentgeneration, storing its parent is not necessary. In practice, we observed thatthe vast majority of the ancestral particles have this property. In Section 3.6,we provide an in-depth analysis of the particle genealogy.Let us first look at how we can efficiently exploit this property. First, it isuseful to draw a distinction between concrete particles, with s(i) 6= nil, andcompact particles, which are particles implicitly represented via an integer,i = i(r, k), and are therefore considerably more space-efficient. For example,in the smallest phylogenetic example considered in Chapter 4, a compactparticle occupies about 50, 000 times less memory than a concrete particle.Whereas a concrete particle can grow in size as the problem size increases,a compact particle size is fixed.Particles, concrete or compact, can become obsolete from the perspectiveof the reference machine, meaning that the algorithm can guarantee thatthe reference machine will not need them in subsequent iterations. This canhappen for at least two reasons, each of which is efficiently detected at adifferent stage of the algorithm.Update after resampling: Any lineage (path in ?) that did not survivethe resampling stage no longer needs to be maintained. This is illustratedin Figure 3.1. The greyed out particles will never be reconstructed in thefuture generation so they are no longer maintained. Note that it is easy toharness a garbage collector to perform this update in practice.Update after reconstruction: Once a particle is reconstructed, thelineage of the reconstructed particle can be updated. Let j be the particlethat is reconstructed at generation r. At any future generation r? > r, thereconstruction algorithm will only trace up to j (as s(j) 6= nil), and henceall its ancestors reconstructed along the way, can be discarded unless anancestor belongs to another particle lineage that is still alive (active).To paint a clear picture of the process, we expand on Figure 3.1 and323.4. Genealogycomponents decreases by one at every step. More precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , sR = t, wherean X-forest sr = {(ti, Xi)} is a collection of rooted Xi-trees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note that with this specific construction, a forest of rank rhas |X|? r trees.The sets of partial states considered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14r123Figure 3.1: I ustration of compact particl s (blue), concret particles(black), and discarded particles (grey).depict this process one generation at a time in Figure 3.2. The set up ofthe figures is as follows. There are 3 machines in the simulation each withcapacity cm = 3. The reference machin is m = 1, the particles allocated tomachine 1 a e enclosed inside the rectangle in the fig r . The black particlesare concrete particles and the samples for these particles are stored in thememory of the reference machine. The blue particles are compact particles,the reference machine continues to update the genealogy ? by updating theparent pointer for the compact particles indicated by the blue arrows. Notethat the concrete particle, as per the rule update after reconstruction, donot need to point to its parent. The grey particles denote the obsolete andtherefore, discarded particles. The particle index, k, at each generation rreads from right to left, i.e. the rightmost particle is indexed by k = 1 andthe leftmost particle is indexed by k = 9.The set of graphics in Figure 3.2 begins at generation 1. The referencemachine applies the stochastic maps to generate the samples for the threeparticles that are allocated to it (encapsulated in the rectangle). In thiscase, A11 = {1, 2, 3}. The reference machine only updates the compact par-ticles for the purposes of maintaining the genealogy. At generation 2, we333.4. Genealogycomponents decreases by one at every step. More precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , sR = t, wherean X-forest sr = {(ti, Xi)} is a collection of rooted Xi-trees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note that with this specific construction, a forest of rank rhas |X|? r trees.The sets of partial states considered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14r123components decreases by one at every step. More precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , sR = t, wherean X-forest sr = {(ti, Xi)} is a collection of rooted Xi-trees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note t at with this specific construction, a forest of rank rhas |X|? r trees.The sets of partial sta es co sidered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14r123(a) (b)components decreases by one at every step. More precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , R = t, rean X-forest sr = {(ti, Xi)} is a collection of rooted Xi-trees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note that with this specific construction, a forest of rank rhas |X|? r trees.The sets of partial states considered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest r nk should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set f partial states of ank R should coi cide with the target p ce,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14r123components decreases by one at every step. More precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , sR = t, wherean X-forest sr = {(ti, Xi)} is a collection of rooted Xi-trees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note that with this pecific construction, forest of rank rhas |X|? r trees.The sets of partial states considered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14r123(c) (d)component decreases by one at every step. More precisely, we will build eachrooted X-tr e t by proposing a sequence of X-forests 1, s2, . . . , sR = t, wherean X-forest sr = {(ti, Xi)} is a collection of rooted Xi- rees ti such that thedisjoint union of leaves of the trees in the forest is equal to the original set ofleaves, ?iXi = X. Note that with this specific construction, a fore t of rankhas |X|? r trees.The sets of partial states considered in this Section are assumed to satisfythe following three conditions:1. The sets of partial states of different ranks should be disjoint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of the target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetframework.In order to grow particles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destination partial states A, we denote the probability of proposingan element in A from s by ?+s (A). In the discrete case, we abuse the notation14r123components decreases y one at every step. M re precisely, we will build eachrooted X-tree t by proposing a sequence of X-forests s1, s2, . . . , sR = t, wheren X-forest sr = {(ti, Xi)} is collection of rooted Xi-trees ti such that thedisj int union of leaves of the trees in the forest is equal to the original set ofleaves, ?i i = X. Note that with this specific construction, a forest of rank rha |X|? r trees.The sets of partial ates consid red in this Section are assumed to satisfythe following three conditions:1. The sets of partial stat s of different r nks s ould be di joint, i.e. Sr?Ssfor all r ?= s (in phylogenetics, this holds since a forest with r trees cannotbe a forest with s trees when r ?= s).2. The set of partial state of smallest rank should have a single elementdenoted by ?, S1 = {?} (in phylogenetics, ? is the disconnected graphon X).3. The set of partial states of rank R should coincide with the target space,SR = X (in phylogenetics, at rank R = |X|?1, forests have a single treeand are members of he target space X ).These conditions will be subsumed by the more general framework of Sec-tion 4.5, but the more concrete conditions above help understanding the posetfram work.In order to grow par icles from one rank to the next, the user needs tospecify a proposal probability kernel ?+. Given an initial partial state s anda set of destina ion pa tial states A, we denote the probability of proposingan eleme t n A f om by ?+s (A). In the discrete case, we abuse the notation14r123(e) (f)Figure 3.2: Illustration of evolution of particle genealogy for K = 9, M = 3,(a) r = 1, (b) r = 2, (c) r = 3, (d) r = 4, (e) r = 5, (f) r = 6.343.5. Compact representation of the stochastic mapsobtain I?1 = {1, 1, 4, 4, 5, 5, 6, 7, 9} as a result of the resampling stage. Notethat the Figure 3.2 (b) shows the particles (1, 2), (1, 3), and (1, 8) as graydiscarded particles because the rule update after resampling is invoked ? i.e.no offspring. The reason why particle (1, 1) is gray will be explained shortly.To determine allocation of the particles, we note that the set of particlesallocated to machine 1 in generation 1 isA11 = {1, 2, 3}. We take the intersec-tion of this set with I?1 to obtain {1, 1}. According to the greedy scheme, wepropagate two offsprings for particle (1, 1). The figure illustrates that the al-location determined one of the offsprings of particle (1, 4) to machine 1. Weinvoke reconstruction algorithm to propagate particles (2, 1), (2, 2), (2, 3).After this, we invoke the update after reconstruction rule. In this case, (1, 6)has an active offspring (2, 6) as indicated by the blue arrow so we do notpurge it from the memory. Although (1, 1) has two offsprings, it does nothave any active blue arrow so it can be purged as reconstruction algorithmwill never reach it.At generation 3, I?2 = {1, 2, 3, 3, 3, 6, 8, 8, 8} and A12 = {1, 2, 3} so theintersection is {1, 2, 3, 3, 3}. The reference machine is in surplus; it endsup allocating two of three offsprings of particle (2, 3) to machine m = 2.The update after resampling purges (2, 4), (2, 5), (2, 7), (2, 9). An interestingconsequence of purging (2, 4) is that the active arrow to particle (1, 4) hasnow become inactive gray arrow and it too is purged as a result. For thesame reason, particles (1, 6) (the parent of (2, 7)) and (1, 9) (the parent of(2, 9)) are purged. Again, this update is automatically carried out by thegarbage collector. Finally for generation 3, the update after reconstructionrule purges (2, 1) and (2, 2) from the memory. This process continues onuntil the simulation terminates at r = 6.3.5 Compact representation of the stochasticmapsIn Section 3.1.2, we mentioned that we need a way to efficiently store thestochastic maps. The cardinality of the set of the stochastic maps F = {Fi :353.5. Compact representation of the stochastic mapsi ? I} grows proportionally to the number of particles K times the numberof generations R. To store these maps naively would require the storage ofO(KR) uniform random variables Ui. However, since in practice pseudo-random numbers rather than true independent numbers are typically used,the sequence can be stored implicitly by maintaining only a random seedshared between machines. A drawback to this approach is that it is notefficient to perform random access of the random numbers. Random accessof random numbers is an unusual requirement imposed by the genealogyreconstruction algorithm. Fortunately, as we discuss in this section, it is nothard to modify pseudo-random generators to support random access.The simplest strategy to obtain faster random access is to cache inter-mediate internal states of the pseudo-random generator. For example bydoing so for every particle generation, we get a faster access time of O(K)and a larger space requirement of O(R). More generally, this method canprovide tradeoff of O(n) space and O(m) time with mn = RK.We describe the details of an alternative that requires O(1) storage withO(log(KR)) time for random access of any given map with index i ? I.This method could potentially change the quality of the pseudo-randomsequences obtained, but as demonstrated in Section 4.2, we have empiricalevidence suggesting that the new pseudo-random scheme does not affect thequality of the estimated posterior expectations.As described in Section 3.1.1, each of the maps Fi is completely deter-mined by a uniform random variable Ui. Our algorithm is initialized (duringthe entangle phase of the algorithm) with two lists of uniform random num-bers V (0)n and V(1)n , n = 1, . . . N , where N = dlog2(K ? R + 1)? 1e. Thesesequences should be identical in all the machines, but in practice only theseed of each of these pseudo-random sequences need to be synchronized be-tween the machines (i.e. V (0)0 and V(1)0 ). Given an index i, the strategyis to perform a sequence of xor operations (denoted as ?) of the uniformrandom numbers to arrive at a unique seed U . Given an index i ? I, wewrite its binary expansion as, binary(i) = (b0, b1, . . . , bn), where bj ? {0, 1}corresponds to the jth bit from the left of the binary representation of i, andn+1 is the number of bits in the binary representation (b0 denotes the most363.5. Compact representation of the stochastic mapsAlgorithm 4 : uniform(i)1: U ? V (0)02: if (i = 1) then3: return U4: end if5: {b0, b1, . . . , bn} = binary(i)6: for j = 1..n do7: U ? U ? V(bj)j8: end for9: return Usignificant non-zero bit in the binary representation ? leftmost non-zero bit).We access the random seed V(bj)j at each iteration of the algorithm, whichis xor?ed with the random seeds, V(bj? )j? , j? = 0, . . . , j ? 1. Since the binaryexpansion is unique for each i, the random seed produced by the algorithmis unique to that index. The pseudocode for the algorithm is in Algorithm 4.We adopt a binary tree structure to represent this process in Figure 3.3.Accessing a node i in the tree is interpreted as accessing the stochastic mapfor particle i. It can be done as follows, first if b0 is 1, return the root(b0 = 0 is not possible since i = 1, . . . , RK). For j > 0, we traverse the treeaccording to the value of bj in the binary expansion: traverse to the left ifbj = 0 and to the right if bj = 1.For example, if i = 6, the binary representation is (1, 1, 0). We begin bysetting U ? V (0)0 . We see that b1 = 1, so we traverse to the right, which isto say that we access V (1)1 and set U ? U ? V(1)1 . Next, we see that b2 = 0,so we access V (0)2 traverse to the left and set U ? U ? V(0)2 . At this point,there are no more digit to consider so the algorithm terminates and returnsthe seed U . This example is depicted in Figure 3.3.This algorithm allows for accessing a random seed in a log(KR) time.The storage is O(1) because we only need to store the first two seeds V (0)0and V (1)0 as we can generate (access) the subsequent seeds as the algorithmprogresses.373.6. Analysis1001110101 110 1111Figure 3.3: The representation of implicit binary tree structure for obtain-ing pseudo-random numbers with O(1) storage and log(KR) random accesstime. The double layered nodes of the tree are the nodes traversed to obtainthe map for i = 6.3.6 AnalysisIn this section, we analyze the running time of the reconstruction algorithm(Algorithm 3). The question we want to answer is, given that there are Kparticles in the simulation and Km particles in the reference machine, whatis the running time of reconstructing a particle i at r-th generation? Sinceall of the particles descend from the common ancestor, ?, the worst case runtime of the algorithm is O(r). However, as we will see in the experimentsin Chapter 4, the reconstruction algorithm rarely traces all the way to theroot. Therefore, more interesting question is, what in the average runningtime of the reconstruction algorithm? In order to thoroughly analyze theaverage running time of the reconstruct, we need to first introduce thecoalescent theory [28].3.6.1 The coalescentThe coalescent is a model in population genetics that is commonly used tostudy the genealogy of individuals. The standard model assumes a Wright-Fisher model [28, 29] where we have a diploid population with a constantpopulation size, N , at each generation, yielding 2N copies of locus with no383.6. Analysisgeneration overlap. The Wright-Fisher model operates under the assump-tion of random mating, meaning that each copy in an individual, randomlychooses a parent locus (with probability 1/2N). The random mating oc-casionally leads to selfing when both copies of an individual comes froma single copy in the preceding generation. The coalescent model aims toapproximate the expected time for any n copies in the r-th generation tocoalesce (descend from the common ancestor) by tracing backward in time.If we assumed the weight function ? to be constant and that we usethe Chaos allocation scheme, the genealogy induced by resampling in EMCsimulation can be viewed as a Wright-Fisher model. The only difference isthat we have K haploid individuals ? particles. As we will soon see, the co-alescent model provides an explanation for having Algorithm 3 terminatingwell before reaching the initial symbol ?, which reflects what we observedin our experiments.Expected time to coalesce for two particlesWe will apply the coalescent model and trace the genealogy of two particlesbackward in time starting at the r-th generation to compute the expectedtime to coalescence. As mentioned in the previous section, we assume thatwe have a constant weight function ? that assigns the same weight to eachof the particles.Suppose we are given two particles, i and i?. Let {Bs}rs=1 denote asequence of random variables, each of which takes on value 1 (success) if iand i? coalesce in the generation, r ? s and 0 otherwise. This sequence ofrandom variables form a finite Bernoulli process from which we can definea random variable T , which denotes the number of Bernoulli trials neededto get the first success as defined by the first coalescent event, i.e. {T =t} = {B1 = 0, . . . , Bt?1 = 0, Bt = 1}. To compute E[T ], we need to beable to determine the probability of an event {T = t}. We first point outthat Bs has probability p = 1/K of success for all s; this is easy to see fromthere being K possible choices for a parent and 1/K2 probability that thetwo particles choose the same parent. The expected time to coalesce for two393.6. Analysisrandomly selected particles is:E[T ] =r?t=1tP ({T = t}) =r?t=1t(1? p)t?1pAlternative interpretation of T is that it is a modified geometric randomvariable, one with finite support set, {1, 2, . . . , r}.Expected time to reconstruct a particleIn the above, we computed the expected time for two particles to coalesce.Note the coalescence of particles signals termination of the reconstructionalgorithm according to Algorithm 3. Let us expand on this view further.Given a particle i = i(r, k) ? I that is not stored as a concrete particle inthe reference machine, we are interested in computing the expected time toreconstruct i under the coalescent model. Recall that an ancestor particle ofi will be stored as a concrete particle in the memory of the reference machineif any one of the Km particles in the reference machine descended from thesame ancestor as i (Section 3.4). Therefore, the reconstruction algorithmterminates upon reaching the most recent common ancestor (MRCA) of iand any one of the Km particles (if ? is the MRCA, then it will terminateafter reaching the root). This is a direct consequence of update after recon-struction rule. Now that we have a clear understanding of the connectionbetween coalescence of particles and the reconstruction time of a particle,let us precisely frame the problem at hand. We begin by establishing thenotations.Let {Ns}rs=1 be a sequence of random variables, each of which takes ona value of {0, 1, . . . ,Km} depending on the number of particles among theKm particles in the reference machine that coalesce with i at the generation,r ? s. We overload on T as there is no cause for confusion: let randomvariable T denote the number of Bernoulli trials needed to get the firstcoalescent event between i and any one of the Km particles. In other words,{T = t} = {N1 = 0, . . . , Nt?1 = 0, Nt > 0}. This probability can becomputed as:403.6. AnalysisP ({T = t}) =t?1?s=1P ({Ns = 0})P ({Nt > 0})We need to first compute the probability of no coalescent event at eachgeneration, P ({Ns = 0}). We proceed under the assumption of coalescentmodel, in particular, there is a constant population of K particles in thesimulation and the constant population of Km particles allocated to thereference machine over all generations. With this assumption, we can viewNs as a Binomial random variable with parameter p = 1/K. This is due tothe fact each particle choosing its parent independently and identically fromK possible choices. In other words, particle i chooses its parent among theK possibilities, and each of Km particles have probability 1/K of choosingthe same parent and 1? 1/K probability of choosing another parent. Sinceeach particle chooses the parent independently of each other, we have Ns asa Binomial random variable. Hence, P ({Ns = 0}) = (1 ? 1/K)Km . If wedenote q = P ({Ns = 0}), then we have:P ({T = t}) = (1? q)t?1qWe are now ready to introduce our first result on the expected time toreconstruct a particle (the proof is provided in Appendix A):Lemma 10 Under the coalescent model (the Chaos allocation scheme andconstant weight function for EMC), the expected time to reconstruct a par-ticle is,E[T ] =1? (1? 1K )rKmq(3.1)where K is the number of particles in the simulation, Km is the numberof particles in the reference machine, and q = P ({Ns = 0}) = 1?(1? 1K )Km.For r sufficiently large, we can expect the expected time to coalesce tobe approximately 1/q, which is the mean of the geometric distribution withparameter q. For example, K = 10, 000, Km = 100, and r = 1, 000, we413.6. Analysiscan see that E[T ] ? 100. Note that E[T ] is dependent on the choices forK and Km, which determines the value of q. For example, if the ratio ofK : Km = 100, then E[T ] ? 100 even for r >> 100.3.6.2 Bound on the expected time to reconstructIn the above section, we computed the expected time to reconstruct a par-ticle under the coalescent model. One of the key assumptions that we madewas the weight function ? is constant. In practice, the weights of the par-ticles are vastly different from one another. In this section, we show thatalthough we are not able to compute the expected time to coalesce when ?is not constant, it is bounded above by Equation 3.1.Effective Sample SizeWe briefly introduced the idea of Effective Sample Size (ESS) [22] as a wayto measure degeneracy of the weights in Section 2.3.4. An estimate of ESSat each generation is given in [30] as:ESSr =( K?k=1W 2r,k)?1? K (3.2)where Wr,k ? [0, 1] is the normalized weight of the particle i = i(r, k).The key point is that this ESSr is never greater than K. We will use thisproperty of ESS to prove the bound on the expected reconstruction time forthe case with non-constant weight function.Expected time to coalesce for two particlesAgain, we begin by examining the expected time to coalesce for two ran-domly chosen particles i and i?. Let {Bs,?}rs=1 be a finite Bernoulli process,each of which defines success as occurrence of the coalescent event betweenthe two particles at generation r? s. We use the subscript ? to denote thatwe are dealing with the case of non-constant weight function. We let T?be a random variable, which denotes the number of Bernoulli trials needed423.6. Analysisin order to get the first success, i.e. {T? = t} = {B1,? = 0, . . . , Bt?1,? =0, Bt,? = 1}. We can compute the probability of this event as:P ({T? = t}) =t?1?s=1P ({Bs,? = 0})P ({Bt,? = 1})We need to be able to compute P ({Bs,? = 1}). As before, there areK possible choices of a parent, but each parent has a different probabilityof survival, Wr?1,k so the probability that the two particles come from thesame parent is:P ({Bs,? = 1}) =K?k=1W 2r?1,k (3.3)Note that this is inverse of ESSr defined in Equation 3.2 from which wecan easily deduce,P ({Bs,? = 1}) =1ESS? 1K= P ({Bs = 1})Expected reconstruction timeWe now consider the case where the reference machine is given the task ofreconstructing a particle i. We carry forward the same assumptions as inSection 3.6.1 except that the weight function is not constant. Let Ns,? be theBinomial distributed random variable denoting the number of particles thatcoalesce with i among Km particles in the reference machine at generationr ? s. We let T? denote the number of Bernoulli trials need to get the firstcoalescent event: {T? = t} = {N1,? = 0, . . . , Nt?1,,? = 0, Nt,? > 0}. Foreach of Ns,?, the probability of success is given by Equation 3.3, which meansthat it varies from generation to generation. Let qs denote P ({Ns,? > 0}).We can express E[T?] as:E[T?] =r?t=1tP ({T? = t}) =r?t=1tt?1?s=1(1? qs)qtSince the realized values of qs differ from application to application, we433.6. Analysiscannot compute this quantity. However, note that the implication of qs > qis that there is a higher probability of an occurrence of coalescent eventat each generation. Intuitively, this should mean that the expected timeto coalescence should be smaller than that defined in Equation 3.1. Toexperimentally verify this intuition, we simulated qs for s = 1, . . . , 1000 asfollows:1. Sample ws,k ? Unif(0, 1) for k = 1, . . . ,K2. Normalize the weights Ws,k = ws,k/?Kk=1ws,k3. Compute qs =?Kk=1W2s,kWe repeated the same procedure with N(0, 1) (we exponentiated theweights to avoid having negative weights). The expected value for differentvalues of r was computed and plotted along with the quantity in Equation 3.1as shown in Figure 3.4. This simple simulation matches with our intuition:that the expected time to coalesce for non-constant weight case, E[T?] isbounded above by E[T ]. We have not proved the below result and leave itas a conjecture to be proved for as a future work:Conjecture 11 Under the same assumptions of Lemma 10 except that ofconstant weight function, there exists r0 > 0 such that for r > r0, E[T?] ?E[T ].Final remarkWe have studied the behavior of the expected time to reconstruct a particleunder the coalescent model for the cases where ? is constant and the real-istic case when it is not constant. We presented results showing that thereconstruction time is bounded above by the case when the weight functionis constant. However, when we use greedy allocation scheme as explained inSection 3.3, we no longer have the coalescent model. We extensively analyzethe effect of allocation on the expected time to reconstruct in Section 4.5.443.6. Analysis0 200 400 600 800 1000020406080100Expected time to coalescencerE[T]1/qUniform(0, 1)Normal(0, 1)Figure 3.4: The simulation result for expected time to coalescent.45Chapter 4ExperimentsIn this chapter, we carry out experiments on synthetic and real datasetsarising from phylogenetics. The goal for this chapter is to convince thereader of the efficacy of EMC simulation as a scalable parallelization method.We present the empirical behavior of EMC simulation first, followed bya simulation study to provide evidence supporting the correctness of theanalysis presented in Section 3.6.This chapter is outlined as follows. We begin by briefly discussing theexperimental setup in Section 4.1. As a first validation, we start by demon-strating that the behavior of our sampler equipped with our stochastic mapdatastructure (Section 3.5) is indistinguishable from that of a sampler basedon a standard pseudo-random generator in Section 4.2. Then, we delve intothe experiments by presenting the speed up results in Section 4.3 followedby the timing results in Section 4.4. We end the chapter with a simulationstudy based on the coalescent model [28] in Section 4.5.4.1 Experimental setupWe have explained our motivation for developing EMC in Section 2.2. InBayesian phylogenetics, given a collection of biological sequences for dif-ferent taxa, we are interested in computing expectations under a posteriordistribution over phylogenetic trees. For intermediate to large numbers ofspecies, Bayesian phylogenetic inference via SMC requires a large number ofparticles to achieve an accurate estimate. This is due to fact that the totalnumber of distinct tree topologies increases at a super-exponential rate asthe number of species increases [16].In this chapter, we apply the phylogenetic SMC algorithm described in464.2. Validation of compact stochastic mapsSection 2.4. Recall that the number of generations, R, in this SMC algorithmis determined by the number of taxa under study. In the experiments inSection 4.2 where we wanted to run our SMC for more iterations, we use analternation of kernel: in the first phase, the proposal described in Section 2.4is used as the kernel, and in the second phase, the MCMC kernel of [12],transformed into a SMC kernel using the technique of [18, 31].To generate synthetic datasets, we used a standard process [14]: wesampled trees from the coalescent [28], simulated data along the tree usinga K2P likelihood model [32], discarded the values at internal nodes to keeponly the observations at the leaves and held out the tree.For real datasets, we used the manually aligned ribosomal RNA (rRNA)dataset of [33]. We used a subset of 28 sequences in the directory contain-ing 5S rRNA sequences of Betaproteobacteria and a larger subset of 4,510sequences of 16S rRNA sequences from Actinobacteria. We did experimentson two different numbers of subsampled species: 20 and 100.4.2 Validation of compact stochastic mapsFor EMC simulation to achieve scalability, it is important for the machines tobe able to efficiently store and access the stochastic maps. We presented analgorithm, uniform (Algorithm 4), in Section 3.5 that requires O(1) storageand O(log(KR)) access time. In this section, we verify the correctness ofthis algorithm.We carried out experiments to compare the quality of the SMC approxi-mation based on pseudo-random numbers generated from uniform againstthe standard pseudo-random number generator. The dataset is a syntheticphylogenetics data with 20 taxa and 1000 sites. We measured a tree metric,the Robinson Foulds metric [34], on the consensus tree at every iteration, todetect potential biases in the estimator. We show random examples of pairsof runs in Figure 4.1 and Figure 4.2. It can be seen that the fluctuationsare negligible and hence, impossible to distinguish which plot was gener-ated using uniform and which plot from standard pseudo-random numbergenerator.474.3. Speed-up results0 1000 2000 3000 4000 5000 6000050100150200250Compact stochastic mapIterationsRobinson Foulds metric0 500 1000 1500 2000 2500 3000 3500050100150200250Compact stochastic mapIterationsRobinson Foulds metric0 1000 2000 3000 4000050100150200250Compact stochastic mapIterationsRobinson Foulds metric0 1000 2000 3000 4000050100150200250Compact stochastic mapIterationsRobinson Foulds metricFigure 4.1: Phylogenetic inference using the stochastic maps generated fromuniform.4.3 Speed-up resultsIn this section, we show experimental results where we measure the speed-up of an EMC algorithm on two sets of phylogenetics data by counting thenumber of times the maps Fi are applied. The question we explore here ishow deep the reconstruction algorithm has to trace back, or more precisely,how many times a parallelized version of our algorithm applies maps Ficompared to the number of times the equivalent operation is performed inthe serial version of SMC.We denote N1 to be the number of times the proposal function is applied484.3. Speed-up results0 1000 2000 3000 4000 5000 6000050100150200250Regular random generatorIterationsRobinson Foulds metric0 500 1000 1500 2000 2500 3000 3500050100150200250Regular random generatorIterationsRobinson Foulds metric0 1000 2000 3000 4000050100150200250Regular random generatorIterationsRobinson Foulds metric0 1000 2000 3000 4000050100150200250Regular random generatorIterationsRobinson Foulds metricFigure 4.2: Phylogenetic inference using stochastic maps generated from thestandard pseudo-random number generators.in serial SMC, and NM to be the number of stochastic maps applied in ouralgorithm ran on M machines. We measure the speedup as a ratio of Mand RM , SM = MRM where RM =NMN1 .We ran these experiments on the 16S and 5S subsets of the rRNA datadescribed earlier. In both subsets, we found a substantial speedup, suggest-ing that deep reconstruction was rarely needed in practice. We show theresults on 100 taxa (species) for 5S and 16S in Figure 4.3.We also obtained the following empirical ranking across the performanceof the allocation methods: FirstOpen ? MostAvailable > Random >>Chaos. As expected, Chaos performs the worst, which demonstrates the494.4. Timing results02040600 20 40 60# of MachinesSpeed UpAllocationChaosFirstOpenMostAvailableRandom16S actinobacteria dataset with 100 taxa02040600 20 40 60# of MachinesSpeed UpAllocationChaosFirstOpenMostAvailableRandom5S proteobacteria dataset with 100 taxa(a) (b)Figure 4.3: The speedup factor for (a) the 16S actinobacteria dataset with100 taxa, (b) the 5S proteobacteria dataset with 100 taxa.effectiveness of the greedy strategy. The reason why FirstOpen scheme per-forms the best can be attributed to the fact that a small set of preferredmachines are allocated more particles than other non-preferred machinesand hence store more concrete versions of the particles in the memory. Hav-ing access to more concrete particles has the effect of shortening the timeto reconstruct particles; only rarely will these machines trace deep into thegenealogy.4.4 Timing resultsAn SMC algorithm can easily be distributed over multiple machines byrelying naively on particle transmission between machines over the network.In this section, we compared the particle transmission time to reconstructiontime of EMC2.The timing results in this section builds on the results from Section 4.3where we showed that the ratio of NM and N1 is small. Here, we ranSMC algorithm for 100 generations and measured the total run time of2The experiments were carried out on Amazon EC2 micro instances.504.4. Timing results500 1000 1500 2000 2500 300002000004000006000008000001200000Total run time of EMC versus Particle transfer# of particlesTime (milliseconds)0 20 40 60 80 1000100002000030000400005000060000Run time elapsed per generationGenerationTime (milliseconds)(a) (b)Figure 4.4: (a) Total time for particle transfer (red), total time for EMC(blue). (b) Sample generation time including reconstruction time (black),reconstruction time (blue), and particle transfer time (red) by generation.the EMC algorithm and an SMC algorithm parallelized via explicit particletransfer?see Figure 4.4 (a). We fixed the number of particles per machineat 100 and produced a sequence of experiments by doubling the numberof machines and hence the number of particles at each step. In Figure 4.4(b), we show the reconstruction time, the sample generation time (whichincludes the reconstruction time), and the particle transmission time bygeneration. As expected, the particle transmission is the bottleneck to theSMC algorithm whereas the reconstruction time is stable, which verifies thatthe reconstruction algorithm rarely traced deep.The total timing result in Figure 4.4 (a) shows that the overhead aris-ing from increasing the number of particles (or increasing the number ofmachines) is much smaller compared to the particle transmission method.The breakdown of time by generation in Figure 4.4 (b) shows that the par-ticle transmission time is volatile as it depends on the network latency andthroughput. The reconstruction time is stable as it relies only on the CPUcycles.514.5. Genealogy analysis4.5 Genealogy analysisIn Section 3.6, we computed the expected time to coalesce, E[T ], for differentvalues of r under the coalescent model [28] (Lemma 10). This result givesus a bound on the expected time to reconstruct a particle and hence, thebehavior of EMC simulation. In this section, we carry out experiments tosupport the analysis presented in Section 3.6.4.5.1 Monte Carlo estimate of E[T ]We want to simulate EMC under the same assumptions used for derivingLemma 10 to obtain an approximation of E[T ] and compare it to the theo-retical quantity. To simulate EMC under the coalescent model, we assumethat we have a constant weight function, ?, that assigns same weight toevery particle and that we use the Chaos allocation scheme. Note that thecoalescent model does not have the concept of ?machines,? which is whywe use the Chaos allocation scheme. We proceed to obtain a Monte Carloestimate of E[T ] as follows:? for n = 1, . . . , N :1. Run EMC for r generations (under the assumptions of the coa-lescent model) .2. At r-th generation, we randomly select a particle, i(r, k) that isnot in the reference machine, and invoke reconstruct;3. Let Tj denote the number of stochastic maps applied by recon-struct. This is the n-th Monte Carlo sample.? Finally, our Monte Carlo estimate is: E?[T ] = 1NN?n=1TnThe Figure 4.5 (a) shows the result of this simulation. We ran EMCwith K = 1, 000 and Km = 100 for r = 1, . . . , 100 generations and for eachgeneration, collected N = 10, 000 Monte Carlo samples. In Figure 4.5 (b),we chose K = 10, 000, Km = 100, and tried for r = 1, . . . , 1000, again, wecollected N = 10, 000 samples. In both cases, the estimate of the time to524.5. Genealogy analysis0.02.55.07.510.00 25 50 75 100GenerationsE[T]TypeEstimateOptimizedTheoreticalMonte Carlo estimate of E[T]02550751000 250 500 750 1000GenerationsE[T]TypeEstimateOptimizedTheoreticalMonte Carlo estimate of E[T](a) (b)Figure 4.5: Estimated value of E[T ] over generations plotted with the the-oretical value computed from Lemma 10 (a) for K = 1, 000 and Km = 100,(b) for K = 10, 000 and Km = 100. The green line is the estimated E[T ]computed with optimization.coalesce (the red line) was very close to the theoretical quantity (the blueline) for all generations.4.5.2 Effect of greedy allocation on E[T ]Lemma 10 and the subsequent Conjecture 11 gives us a bound on the ex-pected time to coalesce for the Chaos allocation scheme. Naturally, we arenow interested in the effect of greedy allocation on the expected time tocoalesce. Before we perform any computer experiment, we wanted to runa simple thought experiment. Suppose that at each generation up to r,each particle gets resampled exactly once. Under the greedy scheme, eachmachine keeps the particle that it generated, eliminating the need for re-constructing any particles. Then at generation r, suppose that the particlegenealogy collapses, i.e. only one particle, say sr,k, gets resampled. Then,each machine (except for the machine that propagated this particle) needs534.5. Genealogy analysis0501001502002500 250 500 750 1000GenerationsE[T]Comparison of greedy and chaos allocation02500005000007500000 250 500 750 1000Generations# of callsAllocationChaosGreedyTotal # of calls to reconstruct(a) (b)Figure 4.6: Comparison of greedy allocation scheme against chaos allocationscheme for (a) estimated expected time to coalesce, (b) the total number oftimes reconstruct procedure is invoked.to reconstruct the particle by tracing back up to the root3. This would notbe the case for the Chaos allocation scheme as the particles would be mixedwell over the nodes even in the case where each particle gets resampled ex-actly once. The conclusion is that the greedy allocation should increase thetime to coalesce, which is confirmed in Figure 4.6 (a). In the simulation, weestimated the expected time to coalesce for the FirstOpen greedy allocationscheme (we chose to use the FirstOpen scheme because it performed the bestin the speed up experiment in Section 4.3) with K = 1, 000 and Km = 100.This may lead to the conclusion that greedy allocation schemes achieveinferior performance to the Chaos allocation scheme; this is not true ingeneral. The performance of EMC depends on the number of stochasticmaps applied in the simulation. The Figure 4.6 (b) shows the total number ofmaps applied in the simulation for greedy and the Chaos allocation schemes.The numbers are obtained by averaging over 100 EMC simulations. We cansee that the total number of stochastic maps applied is much smaller for thegreedy scheme up to approximately r = 500. An interpretation of this figure3Each machine needs to reconstruct the particle only once because of the update afterresampling rule described in Section 3.4.544.5. Genealogy analysisis that although each call to reconstruct procedure terminates quicker onaverage for the Chaos scheme, it makes much more calls to reconstructthan the greedy schemes. We predict that for most problems arising fromphylogenetics, the total number of stochastic maps applied by the greedyallocation will be less than the total number of stochastic maps applied bythe Chaos allocation scheme. The figures for the speed up result (Figure 4.3)serves to support this claim. In the speed up experiment, we noted that theChaos allocation was the worst of all four allocation schemes in terms of thetotal number of stochastic maps applied.We close this chapter with a final remark. In the cases where simulationneeds to be run for a long time (i.e. large value of R), the Chaos schemecan be an attractive choice because its long run behavior is well understood(Lemma 10 and Conjecture 11). The main advantage of EMC is that sincethe particles need not be transmitted over the network, it can accommodatethe Chaos scheme without incurring additional extra network cost.55Chapter 5Conclusion and Future workWe presented Entangled Monte Carlo simulation method. We analyzed thetheoretical behavior of EMC and supported it empirically by demonstratingits efficacy on the datasets arising from phylogenetics. We envision it as auseful tool for inference in other application areas such as cancer phyloge-netics and historical linguistics where large inference problems commonlyoccur.Our contribution is to the field of computational statistics, which accord-ing to [35], is dedicated to designing of algorithms for implementing statisti-cal methods on computers including ones unthinkable before the computerage (e.g. bootstrap [36] and Monte Carlo simulation [37]). As statisticians,we realize that computing is and will continue to be critical to the successof the statistics as a discipline. We need to be able to adapt to the latestadvances from computer science in order to deal with large scale inferenceproblems commonly faced nowadays by the statisticians. Naturally, the fu-ture work for EMC evolves around integrating the method on top of parallelcomputing architectures such as MapReduce [38] and Spark cluster com-puting framework [39]. Specifically, we want to integrate EMC along witha distributed hash table (Appendix B) so that EMC can be seamlessly inte-grated to run simulations on virtually any cluster computing environment.56Bibliography[1] Seong-Hwan Jun, Liangliang Wang, and Alexandre Bouchard-Co?te?. En-tangled Monte Carlo. In Advances in Neural Information ProcessingSystems 25, pages 2735?2743, 2012.[2] Christian P. Robert and George Casella. Monte Carlo Statistical Meth-ods (Springer Texts in Statistics). Springer-Verlag New York, Inc.,2004.[3] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An Introductionto Sequential Monte Carlo Methods, chapter 1. Springer-Verlag,, NewYork, 2001.[4] Arnaud Doucet and Adam M Johansen. A tutorial on particle filteringand smoothing: Fifteen years later. Handbook of Nonlinear Filtering,12:656?704, 2009.[5] Charles Darwin. On the origins of species by means of natural selection.London: Murray, 1859.[6] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard,Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh,James R Downing, Mark A Caligiuri, et al. Molecular classification ofcancer: class discovery and class prediction by gene expression moni-toring. science, 286(5439):531?537, 1999.[7] Lyle Campbell. Historical Linguistics: An Introduction. MIT Press,2nd edition, 1998.[8] Quentin D Atkinson. Phonemic diversity supports a serial founder effect57Bibliographymodel of language expansion from Africa. Science, 332(6027):346?349,2011.[9] Alexandre Bouchard-Co?te?, Percy Liang, Thomas Griffiths, and DanKlein. A Probabilistic Approach to Language Change, 2008.[10] David Hall and Dan Klein. Finding cognate groups using phylogenies.In Proceedings of the 48th Annual Meeting of the Association for Com-putational Linguistics, pages 1030?1039. Association for ComputationalLinguistics, 2010.[11] Bret Larget and Donald L Simon. Markov chain Monte Carlo algo-rithms for the Bayesian analysis of phylogenetic trees. Molecular Biol-ogy and Evolution, 16:750?759, 1999.[12] John P. Huelsenbeck, Fredrik Ronquist, et al. Mrbayes: Bayesian in-ference of phylogenetic trees. Bioinformatics, 17(8):754?755, 2001.[13] R. H. Swendsen and J-S. Wang. Replica Monte Carlo simulation ofspin-glasses. Phys. Rev. Lett., 57:2607?2609, Nov 1986.[14] Alexandre Bouchard-Co?te?, Sriram Sankararaman, and Michael I. Jor-dan. Phylogenetic Inference via Sequential Monte Carlo. SystematicBiology, 61:579?593, 2012.[15] T Ryan Gregory. Understanding evolutionary trees. Evolution: Educa-tion and Outreach, 1(2):121?137, 2008.[16] Charles Semple and Mike A Steel. Phylogenetics, volume 24. OxfordUniversity Press, 2003.[17] J Felsenstein. Evolutionary trees from DNA sequences: a maximumlikelihood approach. Journal of Molecular Evolution, 17:368?376, 1981.[18] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo sam-plers. Journal of The Royal Statistical Society Series B-statisticalMethodology, 68(3):411?436, 2006.58Bibliography[19] Rick Durrett. Probability: theory and examples, volume 3. Cambridgeuniversity press, 2010.[20] Dan Jurafsky, James H Martin, Andrew Kehler, Keith Vander Lin-den, and Nigel Ward. Speech and language processing: An introductionto natural language processing, computational linguistics, and speechrecognition, volume 2. MIT Press, 2000.[21] Thomas Bengtsson, Peter Bickel, and Bo Li. Curse-of-dimensionalityrevisited: Collapse of the particle filter in very large scale systems. Prob-ability and statistics: Essays in honor of David A. Freedman, 2:316?334,2008.[22] Augustine Kong, Jun S Liu, and Wing Hung Wong. Sequential impu-tations and Bayesian missing data problems. Journal of the Americanstatistical association, 89(425):278?288, 1994.[23] Paul Fearnhead and Peter Clifford. On-line inference for hidden Markovmodels via particle filters. Journal of the Royal Statistical Society.Series B, 2003.[24] Y. W. Teh, H. Daume? III, and D. M. Roy. Bayesian agglomerative clus-tering with coalescents. In Advances in Neural Information ProcessingSystems (NIPS), 2008.[25] J. G. Propp and D. B. Wilson. Coupling from the past: a user?s guide.Microsurveys in Discrete Probability. DIMACS Series in Discrete Math-ematics and Theoretical Computer Science, 41:181?192, 1998.[26] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan.Chord: A scalable peer-to-peer lookup service for internet applications.ACM SIGCOMM 2001, pages 149?160, 2001.[27] D. P. Anderson. BOINC: A System for Public-Resource Computing andStorage. In GRID ?04: Proceedings of the 5th IEEE/ACM InternationalWorkshop on Grid Computing, pages 4?10, Washington, DC, USA,2004. IEEE Computer Society.59Bibliography[28] J. F. C. Kingman. On the Genealogy of Large Populations. Journal ofApplied Probability, 19:27?43, 1982.[29] J. Felsenstein. Inferring phylogenies. Sinauer Associates, 2003.[30] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequentialMonte Carlo sampling methods for Bayesian filtering. Statistics andcomputing, 10(3):197?208, 2000.[31] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo forBayesian computation. Bayesian Statistics, 8:1?34, 2007.[32] Motoo Kimura. A simple method for estimating evolutionary rates ofbase substitutions through comparative studies of nucleotide sequences.Journal of molecular evolution, 16(2):111?120, 1980.[33] J.J. Cannone, S. Subramanian, M.N. Schnare, J.R. Collett, L.M.D?Souza, Y. Du, B. Feng, N. Lin, L.V. Madabusi, K.M. Muller,N. Pande, Z. Shang, N. Yu, and R.R. Gutell. The comparative RNAweb (CRW) site: An online database of comparative sequence andstructure information for ribosomal, intron, and other RNAs. BioMedCentral Bioinformatics, 2002.[34] DF Robinson and Leslie R Foulds. Comparison of phylogenetic trees.Mathematical Biosciences, 53(1):131?147, 1981.[35] Carlo Lauro. Computational statistics or statistical computing, is thatthe question? Computational Statistics and Data Analysis, 23(1):191 ?193, 1996.[36] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chap-man & Hall, New York, 1993.[37] Nicholas Metropolis and Stanislaw M. Ulam. The Monte Carlo Method.Journal of the American Statistical Association, 44(247):335?341, sep1949.60[38] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data pro-cessing on large clusters. Communications of the ACM, 51(1):107?113,2008.[39] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, ScottShenker, and Ion Stoica. Spark: cluster computing with working sets.In Proceedings of the 2nd USENIX conference on Hot topics in cloudcomputing, pages 10?10, 2010.61Appendix AProof of Lemma 10Lemma 10: Under the coalescent model (Chaos allocation scheme and con-stant weight function for EMC), the expected time to reconstruct a particleis,E[T ] =1? (1? 1/K)rKmq(A.1)where K is the number of particles in the simulation, Km is the numberof particles in the reference machine, and q = P (Ns = 0) = 1?(1?1/K)Km .Proof:The expectation of T on support set {1, 2, . . . , r} is:E[T ] =r?1?t=1tP (T = t) =r?1?t=0tP (T = t)Note that P (T = t) = (1 ? q)t?1q for t < r and P (T = r) = (1 ? q)r?1because after r ? 1 failure to get a coalescent event, we have a coalescentevent with probability 1 as all particles descend from the common root, ?.We break up E[T ] into two components and carry out the computation:E[T ] =r?1?t=0tP (T = t) + rP (T = r)Replacing the probabilities with the actual quantities, we have:E[T ] =r?1?t=0t(1? q)t?1q + r(1? q)r?1 (A.2)62Appendix A. Proof of Lemma 10Let S =r?1?t=0t(1? q)t?1, multiply S by (1? q) to get,(1? q)S =r?1?t=0t(1? q)tNow, subtract (1? q)S from S:S ? (1? q)S =r?1?t=0t(1? q)t?1 ?r?1?t=0t(1? q)t=r?1?t=0(1? q)t ? r(1? q)r?1From this, we can get S:S =r?1?t=0(1? q)t ? r(1? q)r?1qHence, the first summand is:qS =r?1?t=0(1? q)t ? r(1? q)r?1Plugging this into the Equation A.2, we obtain an expression for E[T ]as:E[T ] =r?1?t=0(1? q)tIt can be further simplified by geometric series as:E[T ] =1? (1? q)rq(A.3)To obtain Equation A.1, we replace q = 1?(1?1/K)Km in the numeratorof Equation A.3:63Appendix A. Proof of Lemma 10E[T ] =1? [1? (1? (1? 1/K)Km)]rq=1? (1? 1/K)rKmq64Appendix BExtensionsIn this section, we discuss extensions for scaling the EMC simulation tohandle the cases where the number of particles is arbitrarily large. Theextensions discussed here are subject of future work in our research as theyhave not been implemented yet.The reconstruction algorithm requires only the compact representationof the particle consisted of its stochastic map and the pointer to its par-ent ?(i) be stored in the memory. However, for an EMC simulation withK particles and R generations, the upper bound on the storage is O(KR),which is directly proportional to the number of particles K. This is a crudeupper bound, derived by assuming that all of the particles survive the re-sampling stage at every generation (in other words, each particle has exactlyone descendent and this occurs at each generation). In fact the coalescenttheory guarantees that coalescent events occur surprisingly frequently (seeSection 3.6) and in the problem domain of interest, we observed that this isindeed the case in practice (Section 4).Nonetheless, the storage of entire particle genealogy locally in a nodebecomes infeasible as the number of particles increases. This limitationprohibits EMC from scaling up to an arbitrarily large number of particles.To alleviate this problem, we introduce a distributed hash table (DHT) forstoring the particle genealogy.B.1 Storing the genealogyThe main idea is for each machine to track only portions of the geneal-ogy rather than entire genealogy. Each machine stores a list of neighboringnodes, which can be queried for the parent ?(i). The neighboring node ei-65B.2. Resamplingther provides the parent ?(i) or queries its neighbor list for ?(i). The ideaof chord DHT proposed in [26] requires that each machine to keep trackof at most logM other machines for routing of the queries for retrieval (ofparent ?(i)) to be completed with at most O(logM) queries sent aroundthe network. Furthermore, join and failure of machines are handled withO(log2M) messages sent around the network. Since EMC relies on a net-work protocol for communication, a mechanism to handle node failure isparticularly useful. The use of DHT implies that the storage requirementon each machine is limited to the set of particles to be processed, Ir, andonly a subset of the genealogy of compact particles.Defining unique key-value pair for the DHT is simple. Since any particlein the genealogy is uniquely determined by the generation r and index kthrough the mapping i(r, k), it can be used as the key for storing and ac-cessing any particle in the DHT. The value for each key would be a datastructure that stores a pointer to parent (or parent index), ?(i), and thestochastic map. The use of DHT comes at a cost of sacrificing speed but itallows EMC to be applied to scale up to an arbitrary number of machinesand an arbitrary number of particles.B.2 ResamplingWe started with a discussion of a strategy to handle cases where the numberof particles is so large that even our compact representation of the genealogymight become a memory bottleneck. A related issue it that the transmissionof all the individual weights can become prohibitively large. This problemcan be addressed by storing the genealogy in a distributed hashtable asindicated above and by having each machine communicating the sum ofits weights rather than the individual weights. We need to first show thatresampling can still be done efficiently with only the knowledge of the sumof the weights per machine.We start by reviewing the non-parallel version of the resampling algo-rithm. Given the (unnormalized) weights wr,1, . . . , wr,K , the first step is tonormalize them to get the mean parameters ? = (w?r,1, . . . , w?r,K) of a multi-66B.2. Resamplingnomial distribution Mult(K,?) over 1, . . . ,K. Using the fact that the k-thorder statistics is distributed according to a Beta(K,K? k+ 1), this can bedone in time O(K) using a stick breaking procedure.The first and simplest way to do distributed resampling is to have eachmachine transmit the individual weight of each of its particle. This meansthat after this first version of exchange is completed for generation r, eachof the machines has a global view on the weights of the entangled sampler.The weights and the shared stochastic maps can then be used to performa coherent resampling in all machines. This is done by letting the sharedrandom maps correspond to particle indices, G = {Gi : i ? I}. Sinceaccessing each map takes time O(rK), the overall time is O(K log(rK)),and the space requirements scale in O(K).Our efficient distributed resampling scheme is based on an elementaryobservation: if K uniform [0, 1] random variables are sampled, the number ofpoints that will fall in the first half of the interval is Bin(K, 1/2) distributed.We use this idea recursively on a collection of dyadic intervals subdividing[0, 1] to get our distributed resampling scheme. More precisely, we let {Ij :j ? J } denote the dyadic intervals of size 1, I0 = [0, 1], down to size 1/K.To simplify the notation, we will assume that K is a power of two, andwe will also drop the subscript r?it is understood that each resamplingdatastructure needs to be generation-specific.The number of dyadic intervals is 2K ? 1, which can be viewed as beingorganized on a tree where each interval index j 6= 0 has a parent indexj? = pi(j) corresponding to the smallest distinct interval Ij? containing Ij .We will also assume that the index of an interval corresponding to the firsthalf of its parent interval has an odd index j, and we will denote the lengthof an interval I by `(I).For each interval that is not a leaf on the tree defined above, we at-tach a ({1, . . . ,K} ? {1, . . .K})-valued random map induced by the arti-ficial transition probabilities T (x, x?) = pBin(x,1/2)(x?), where pBin(n,p) is theprobability mass function of a Bin(n, p)-distributed random variable. Thesemaps give us an efficient way of finding the number N(Ij) of uniform ran-dom variables that fall in the first half of any non-leaf dyadic interval Ij ,67B.3. Storage limitationsj 6= 0:N(Ij ,G ) ={Gpi(j)(N(Ipi(j),G )) if j is oddN(Ipi(j),G )?Gpi(j)(N(Ipi(j),G )) o.w.Finally, we attach to the indices at the leaves a collection of randommaps described in Algorithm 4 to be able to instantiate the location of theuniform random variables inside any leaf dyadic interval. Given an arbitraryinterval I ? [0, 1], it is then efficient to find how many of the K uniformrandom variables fall inside I: if `(I) < 1/K, then it is contained in at mosttwo leaf dyadic intervals, while if `(I) ? 1/K, I ? I ? where I ? can be writtenas the union of two dyadic intervals such that `(I ?) ? 4`(I).Given this function N , the resampling step is computed as shown inAlgorithm 5. Note that the algorithm only depends on the sum of normalizedweights with indices less than or equal to the particle number k in the currentgeneration, w??k =?k??k w?i(r,k?), which only needs the sum of the weightsin the other machines and the weight of the individual particles in Ir to beprocessed in the current machine.There are logK accesses to the datastructure G required, for a totalrunning time of O(log2(rK) +Km) and memory O(Km).Algorithm 5 : resample(w = wr, ?, Ir,G )1: for i ? Ir do2: k ? N([0, w??k(i)),G )3: n? N([w??k(i), w??k(i)+1,G )4: for m ? {1, . . . , n} do5: i? ? i(r + 1, k +m)6: ?(i?)? i7: end for8: end forB.3 Storage limitationsSo far, we have addressed one form of storage limitation that arises fromincreasing the number of particles in the EMC simulation. In this section,we discuss another form of storage limitation where the raw dataset cannot68B.3. Storage limitationsbe accommodated in a single node.The EMC algorithms in its current form focusses on cases such as phy-logenetic inference where the communication between nodes is prohibitivedue to the structure of the models, where it usually involves information onlatent variables rather than dataset transmission. For example, the class ofproblems where we expect EMC to be especially attractive is when efficientSMC sampling depends on Rao-Blackwellization. In these cases the size ofthe particles is often increasing with respect to generation by the storage ofdynamic programming tables.In principle, EMC can be relaxed to overcome this limitation by parti-tioning the data and ensuring that the allocation scheme does not cross thesepartition blocks. However, for problems involving dataset transmission, thecommon approach is to use MapReduce [38] to distribute and process jobsin parallel. Another framework that is gaining traction is Spark clustercomputing framework [39]. Implementation of EMC on top of Spark clustercomputing environment is a future work.69
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Entangled Monte Carlo
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Entangled Monte Carlo Jun, Seong-Hwan 2013
pdf
Page Metadata
Item Metadata
Title | Entangled Monte Carlo |
Creator |
Jun, Seong-Hwan |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | A recurrent problem in statistics is that of computing an expectation involving intractable integration. In particular, this problem arises in Bayesian statistics when computing an expectation with respect to a posterior distribution known only up to a normalizing constant. A common solution is to use Monte Carlo simulation to estimate the target expectation. Two of the most commonly adopted simulation methods are Markov Chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) methods. However, these methods fail to scale up with the size of the inference problem. For MCMC, the problem takes the form of simulations that must be ran for a long time in order to obtain an accurate inference. For SMC, one may not be able to store enough particles to exhaustively explore the state space. We propose a novel scalable parallelization of Monte Carlo simulation, Entangled Monte Carlo simulation, that can scale up with the size of the inference problem. Instead of transmitting particles over the network, our proposed algorithm reconstructs the particles from the particle genealogy using the notion of stochastic maps borrowed from perfect simulation literature. We propose bounds on the expected time for particles to coalesce based on the coalescent model. Our empirical results also demonstrate the efficacy of our method on datasets from the field of phylogenetics. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-08-29 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0074196 |
URI | http://hdl.handle.net/2429/44953 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_fall_jun_seonghwan.pdf [ 2.69MB ]
- Metadata
- JSON: 24-1.0074196.json
- JSON-LD: 24-1.0074196-ld.json
- RDF/XML (Pretty): 24-1.0074196-rdf.xml
- RDF/JSON: 24-1.0074196-rdf.json
- Turtle: 24-1.0074196-turtle.txt
- N-Triples: 24-1.0074196-rdf-ntriples.txt
- Original Record: 24-1.0074196-source.json
- Full Text
- 24-1.0074196-fulltext.txt
- Citation
- 24-1.0074196.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0074196/manifest