Divide and Conquer Sequential Monte Carlo forPhylogeneticsbySean William JewellB.Sc., McMaster University, 2012a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Scienceinthe faculty of graduate and postdoctoralstudies(Statistics)The University of British Columbia(Vancouver)August 2015© Sean William Jewell, 2015AbstractRecently reconstructing evolutionary histories has become a computationalissue due to the increased availability of genetic sequencing data and relax-ations of classical modelling assumptions. This thesis specializes a Divide &conquer sequential Monte Carlo (DCSMC) inference algorithm to phyloge-netics to address these challenges. In phylogenetics, the tree structure usedto represent evolutionary histories provides a model decomposition used forDCSMC. In particular, speciation events are used to recursively decomposethe model into subproblems. Each subproblem is approximated by an inde-pendent population of weighted particles, which are merged and propagatedto create an ancestral population. This approach provides the flexibility torelax classical assumptions on large trees by parallelizing these recursions.iiPrefaceThis dissertation is solely authored, unpublished work by the author, SeanJewell. Alexandre Bouchard-Coˆte´ and Sean Jewell designed research andexperiments.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Phylogenetic Tools . . . . . . . . . . . . . . . . . . . . 32.1.2 Biological Hypotheses . . . . . . . . . . . . . . . . . . 72.2 Bayesian Phylogenetics . . . . . . . . . . . . . . . . . . . . . . 92.3 Sampling Background . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Importance Sampling . . . . . . . . . . . . . . . . . . 142.3.2 Sequential Importance Sampling . . . . . . . . . . . . 15iv2.3.3 Sequential Monte Carlo . . . . . . . . . . . . . . . . . 163 Phylogenetic Divide and Conquer Sequential Monte Carlo 183.1 Phylogenetic Inference with IS . . . . . . . . . . . . . . . . . 183.2 Phylogenetic Inference with DCSMC . . . . . . . . . . . . . . 203.2.1 Intermediate Distributions . . . . . . . . . . . . . . . . 253.2.2 Theoretical Implications . . . . . . . . . . . . . . . . . 313.3 Relaxing the Molecular Clock with DCSMC . . . . . . . . . . 323.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Synthetic Results . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Green Plant rbcL . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51vList of TablesTable 4.1 Green Plant Comparisons . . . . . . . . . . . . . . . . . . 47viList of FiguresFigure 2.1 Components of Phylogenetic Trees . . . . . . . . . . . . . 5Figure 2.2 Relaxed Molecular Clock via Compound Poisson Process 10Figure 2.3 Labelled Tree with Data and Message Passing Algorithm 12Figure 3.1 Importance Sampling in Phylogenetics . . . . . . . . . . . 19Figure 3.2 Na¨ıve Yule Prior . . . . . . . . . . . . . . . . . . . . . . . 26Figure 3.3 Yule Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 3.4 Phylogenetic DCSMC . . . . . . . . . . . . . . . . . . . . 30Figure 3.5 DCSMC for the Compound Poisson Process . . . . . . . . 34Figure 3.6 MCMC Topology Move for Relaxed Clock Models . . . . 37Figure 4.1 Correctness Unit Test . . . . . . . . . . . . . . . . . . . . 39Figure 4.2 Synthetic Validation Comparisons . . . . . . . . . . . . . 45Figure 4.3 Green Plant Phylogeny . . . . . . . . . . . . . . . . . . . 47Figure 4.4 Synthetic Tree Runtime Comparisons . . . . . . . . . . . 49viiList of ListingsListing 2.1 Ultrametric Tree Class . . . . . . . . . . . . . . . . . . . . 8Listing 3.1 Particle Interface . . . . . . . . . . . . . . . . . . . . . . . 22Listing 3.2 Clock Particle . . . . . . . . . . . . . . . . . . . . . . . . . 23Listing 3.3 DCSMC Pseudocode . . . . . . . . . . . . . . . . . . . . . 24Listing 3.4 Yule-like Proposals . . . . . . . . . . . . . . . . . . . . . . 28Listing 3.5 Relaxed Clock Particle . . . . . . . . . . . . . . . . . . . . 35Listing 4.1 Mr. Bayes Parameter Specification . . . . . . . . . . . . . 42viiiGlossaryγ¯ Unnormalized target posterior distributionr Vector of compound Poisson process scalingsz Vector of compound Poisson process event locationsγ Target posterior distributionλ Poisson process rate parameterC ChildrenL Loss functionP ParentsY Data at leaf nodesb Branch length functionQ Rate matrixP CTMC probabilitypi Stationary distributionΣ Character setτ Tree topologyξ Number of compound Poisson process eventsixB Ambient spaced Height incrementsE Edge sete Collection of compound Poisson process eventsh Height functionJ Compound Poisson processM Number of taxam Substitution rateN Number of particlesn Number of nucleotides sequencedP Poisson processq Proposal distributionr RootS Stochastic processT Tree lengthV Node setX Leaf set X ⊂ VCPP Compound Poisson processCTMC Continuous time Markov chainDCSMC Divide and conquer sequential Monte CarloESS Effective sample sizeIS Importance samplingxMAP Maximum a posterioriMCMC Markov chain Monte CarloML Maximum likelihoodPMCMC Particle Markov chain Monte CarloPP Poisson processSIS Sequential importance samplingSMC Sequential Monte CarloxiAcknowledgmentsGoing to the mountains is going home.— John MuirThe last two years in Vancouver have been a wonderful experience. I amindebted to a number of people who have helped me grow professionally atUBC, and personally through many adventures in British Columbia.Firstly, none of this thesis would have been possible without the excellenttutelage of Alexandre Bouchard-Coˆte´. Alex introduced me to the beautifulfield of probabilistic approaches to machine learning. Within this area, Ihad the distinct pleasure of exploring applications spanning problems inphylogenetics (through a collaborative research project with the BC CancerAgency) to modelling urban spatial data. Alex’s tireless dedication to curatethis formative part of my research career cannot be undervalued.I am financially indebted to both the BC Cancer Agency and the NaturalSciences and Engineering Research Council of Canada. Thank you for yoursupport.Within the Department of Statistics I have made many friends. Thankyou to Andre´s Sa´nchez Ordon˜ez for always organizing creative ways for us toexplore Vancouver, and to Creagh Briercliffe for injecting his witty sense ofhumour into our office culture. A special thank you to Neil Spencer for themany lively and engaging office conversations that often resulted in laughter,new insights, or another side project.Vancouver’s close proximity to both the mountains and sea offered manyopportunities to ski powder, hike, and kayak. I am especially grateful toxiiGreg Shield, Kym Berenschot, Sunny Sheshgiri, Dan Ashby, Hyun-KyoungKoo, Ariel Bultz, and Patrick Prendergast for always spurring new adven-tures.Thanks to Janine Roller and her family for welcoming me to their homefor the holidays and for hosting me in Whistler after our skiing escapades.You helped make this coast feel like home.Lastly, I am fortunate to have very supportive parents and little sister.The amount of patience and love they have given me through this meander-ing path is remarkable.Sungmee, thank you for always believing in us.xiiiChapter 1IntroductionThe evolutionary history of species, cells, or tumours is often of interest toresearchers trying to understand the complex relationships between extantspecies and their ancestors. Knowledge of these histories allows researchersto study, for example, ancestral state reconstruction, population histories,phylogeny and alignment, and species divergence times (see Chen et al. [5,Chapter 1] and the references therein).Attempts to encapsulate the complicated dynamics of evolution havegiven rise to a host of different approaches from heuristics to model basedmethods [13]. The introduction of likelihood based approaches led to Bayesianmodelling in phylogenetics [8, 22, 25, 27, 30, 34, 43]. Interestingly, in con-trast to the usual connotations associated with Bayesian modelling, theearly Bayesian phylogenetic models provided a more computationally ef-ficient framework than standard Maximum likelihood (ML) bootstrappingmethods [22]. Moreover, the computational efficiency of Bayesian modellingversus traditional ML methods allowed for more biologically reasonable evo-lutionary models to be studied [19].Within Bayesian phylogenetics, the demand for efficient and accurateposterior inference has driven development of sophisticated Markov chainMonte Carlo (MCMC) methods [22, 25, 27, 30, 43], and more recently parti-cle filter methods [4]. These methods approximate the posterior distributionvia a collection of weighted samples. Such weighted particle populations can1be subsequently used to approximate quantities of biological interest (whichare often represented as expectations with respect to the posterior distribu-tion).Although MCMC and particle filters have facilitated inference for in-creasingly realistic evolutionary models, the recent influx of genetic sequencedata and the desire for (even) more biologically reasonable models has begunto demand more efficient methods. To address potentially inefficiencies inMCMC sampling for large scale data, Bouchard-Coˆte´ et al. [4] introduced aparticle filter approach for phylogenetics by using techniques similar to theagglomerative clustering methods of Teh et al. [40].Research in areas outside of efficient MCMC and particle filter methodshave also been developed for the data surge. A˚kerborg et al. [1] presented afast dynamic programming approach that can be used to speed-up MCMCmethods, and also in a hill-climbing maximum a posteriori (MAP) estimatefor dating inference. A difference of this work is that it provides the MAPestimate and not an estimate of the full posterior distribution as in MCMCand particle filtering methods. Our goal is to utilize the the full power ofBayesian decision theory, which couples the posterior distribution and a lossfunction to yield an optimal estimator known as the Bayes estimator. Hence,we focus on MCMC and particle filtering.The work developed herein continues the trend of particle filtering meth-ods for Bayesian phylogenetics by specializing the computationally efficientand parallelizable DCSMC algorithm of Lindsten et al. [26] to phylogenet-ics. We anticipate that this framework will become desirable as evolutionarybiologists sequence more species while simultaneously demanding more com-plex evolutionary models.This thesis is organized as follows. Chapter 2 provides the necessarybiological, phylogenetics, and sampling preliminary material for the rest ofthis thesis. Chapter 2 concludes with an introduction to particle filteringmethods which serve as the motivation and building blocks for DCSMC.DCSMC and its specialization to phylogenetics are proposed in Chapter 3.Experimental results are presented in Chapter 4. Conclusions and futuredirections are listed in Chapter 5.2Chapter 2BackgroundThis chapter is divided into three different parts. The first section dis-cusses the biological background necessary to describe phylogenetics andbriefly mentions some common biological hypotheses. The second sectiondescribes Bayesian phylogenetic models: both likelihood and prior choicesare discussed. Finally, the third part reviews sampling through MCMC andsequential Monte Carlo (SMC) methods. Taken together, these sections pro-vide the necessary notation and background to discuss the specialization ofthe DCSMC algorithm to phylogenetics.2.1 Biological BackgroundTo achieve the goal of modelling evolution several ideas need to be mathe-matically formalized. This section serves to formalize the mutation process,the notion of evolutionary histories, and several common biological hypothe-ses.2.1.1 Phylogenetic ToolsThe mutation process and evolutionary history are formalized through astochastic process for character mutations (specifically point substitutions)and a directed graph representation for evolutionary histories. Both mod-elling components are discussed in this section.3We assume that the evolutionary process of point substitution betweencharacters is governed by a Markov model [11, 12]. These discrete charac-ters could be binary to indicate the presence/absence of traits, or letters todenote DNA or protein nucleotides. The running example of a character setΣ in this work is DNA–whose four possible characters are the nucleotides{A,C,G, T}. Character substitution among these nucleotides is commonlymodelled with a continuous time Markov chain (CTMC) St whose realiza-tions are values in Σ. Such CTMCs are parameterized by a rate matrix Qthat encodes the rate of transition between characters in Σ.The topology of a phylogenetic tree is represented by a directed graphτ = (V ,E). The set of nodes V represents extant and extinct species, andcan be partitioned into sets based on biological meaning. The root of aphylogenetic tree r ∈ V represents the most recent common ancestor of allextant species X ⊂ V . The internal nodes V \ (r ∪X) represent speciationevents, and the leaves X are extant species. The edges E represent theevolutionary relationship between an ancestor v ∈ V and its descendantsC(v) ⊂ V such that C(v) = {c1, c2} if (v, c1), (v, c2) ∈ E. Associated withthe edge set E is a branch length function that maps from E to the positivereal numbers b : E → R+. These branch lengths b(vi, vj) represent thegenetic distance between two species (vi, vj) ∈ E. An important quantityrelated to branch lengths is the total length of a tree,T =∑(vi,vj)∈Eb(vi, vj). (2.1)The data at leaf nodes X ′ ⊂ X is a set Y(X ′) of |X ′| lists each of whichcontain n values in Σ. Each list is the DNA sequence for a species x ∈ Xfor each of the |X| = M species. These ideas are summarized in Figure 2.1.4Leaves: XSpeciationRoot: rData: Y(X)ACGT· · ·TACTT· · ·ATTGC· · ·TTTGG· · ·TEdges: Evrb(r, v)Figure 2.1: Components of Phylogenetic Trees: τ = (V ,E). (Left)Schematic of the partition of the node set V into the root r,speciation events, and leaves X. The observed data Y(X) isshown as M = |X| lists of n characters from a DNA characterset Σ = {A,C,G, T}. (Right) Illustration of the edge set E andthe branch length b(vi, vj) > 0 associated with each directededge (vi, vj) ∈ E.With the topology, branch length function, and stochastic process de-fined we can now discuss how these ideas are combined to form a phyloge-netic tree. The stochastic process is defined on the tree through a bijectivemapping from (τ , b) to an interval [0, T ]. St’s index set is the interval [0, T ].From a forward simulation perspective, the stochastic process St starts atthe root r with a character σ ∈ Σ drawn from the stationary distributionof Q . At speciation nodes, the stochastic process splits into two indepen-dent stochastic processes, each of which start at the state of the chain beforesplitting. Conditional probabilities (marginalized over all possible paths) areeasily computed through the matrix exponential. For example, if c ∈ C(r)then the conditional probability at c is given as:Pb(r,c)(Sc = y|Sr = x) = [exp(b(r, c) ·Q)]x,y , (2.2)where S· has been overloaded to denote the stochastic process St at nodes5r and c. The matrix exponential is defined as the infinite sum:exp(b(r, c) ·Q) =∞∑i=0(b(r, c) ·Q)ii! , (2.3)and can often be computed efficiently (e.g., Moler and Van Loan [28]).In order to address time and rate confounding, Q is constructed suchthat the expected number of transitions in unit time is one:∑σ∈Σ[Q ]σ,σ piσ = 1, (2.4)where pi is the stationary distribution of the CTMC,piQ = 0. (2.5)Consequently, the branch lengths are in terms of the expected number ofsubstitutions. For example, the normalized Jukes-Cantor rate matrix is:QJC =−1 13 13 1313 −1 13 131313 −1 13131313 −1, (2.6)and has stationary distribution piJC =(14 , 14 , 14 , 14). This form of the ratematrix assumes that the rate of substitution is equal for all possible characterpairs; however, the family of matrices given as Generalized time-reversiblematrices of Tavare´ [39] relax this assumption.62.1.2 Biological HypothesesMolecular Clock: A simple biological hypothesis is that the evolutionaryrate across all lineages is constant. This hypothesis, known as the molecularclock hypothesis, was first introduced in Zuckerkandl and Pauling [45]. Aconsequence of this conjecture is that the distance between species is ultra-metric: the distances between a species v and leaf nodes vi, vj ∈ X connectedby a (directed) path from v are equivalent. This implies that the distancesfrom the root to any leaf node are also all equivalent. Ultrametric treesinvite us to define a notion of tree height. The height of a node v ∈ V is thesum of branch lengths along a path from itself to a leaf, or more preciselythrough the following recursive definition:h(v) =b(v, c) + h(c) C(v) 6= ∅0 C(v) = ∅, (2.7)where c ∈ C(v) can be chosen arbitrarily.To prepare for the specialization of the DCSMC algorithm these defini-tions are used to define a data structure that encapsulates both tree topol-ogy and character sequences (for each node) under the clock hypothesis.As shown in Listing 2.1, the topology is represented through a list of lists,branch lengths are a function of both the list of children and their associatedheights, and observed sequence data is stored as a character string. Thislisting complements Figure 2.1.Although the molecular clock hypothesis is simple, empirical evidencesuggests that rates of evolution vary across different lineages (e.g., Gillespie[17]); for example due to differing evolutionary pressures. To account forthese empirical observations, relaxations of the molecular clock are the focusof our discussion in Section 3.3 and below.Despite the empirical evidence that the molecular clock is not perfect,it remains an important conceptual tool. It provides a timescale of evolu-tion that can be used in estimating divergence times from molecular data.Moreover, there is a rich set of unsolved inference problems under these as-sumptions. These challenges are addressed in Chapter 3. As a result, we7public class Tree<T>{// Propertiesprivate final List<Tree<T>> children;private final double height;private final T nodeLabel;private final String characterSequence;// Methodspublic Tree<T> getChildren(){// Return list of children}public branchLength(T node1, T node2){/*** Recurse children to obtain node1.height and node2.height* Return | node1.height - node2.height | if there is a* path between node1 and node2, otherwise return RuntimeError()*/}public getCharacterSequence(){// Return character sequence for current node}public boolean isLeaf(){// Return true if children = ∅}}Listing 2.1: Ultrametric Tree Class: The topology is represented asa list of lists, branch lengths are a function of both the list ofchildren and their associated heights, and the data is stored asa character string.first develop methods for this simple model then consider expanded modelsthat capture rate variation across different lineages. The expanded model isdetailed next.Relaxed Clock: Relaxed clock models aim to retain some of the propertiesof the molecular clock, such as rooting, while simultaneously relaxing theassumption that all lineages evolve with the same rate. As described in Lep-age et al. [24], these relaxations have been modelled through non-parametric[37, 38], local clock [44], and Bayesian parametric [3, 9, 18, 21, 41] methods.In this thesis the focus is on the compound Poisson process (CPP) modelof Huelsenbeck et al. [18]. It is an example of a Bayesian parametric model8where the rate is modelled as a piece-wise constant function.A CPP is a jump process J t where events occur according to a Poissonprocess (PP), and the jumps Ui are each independent identically distributedrandom variables. That is,J t =P t∑i=1Ui, (2.8)where P t is a Poisson process with rate λ. In Huelsenbeck et al. [18], theCPP construction is used to generate points of substitution rate change(according to the underlying PP) and substitution rate scalings throughgamma distributed random variables Ui ∼ Gamma(α, β). The illustrationof the CPP given in Figure 2.2 demonstrates that events of substitutionrate change give rise to differing rates of evolution. Since branch lengths areviewed as the expected number of substitutions along a branch the resultingtree is scaled. A more detailed description of the CPP is given in Section 3.3under the context of the DCSMC algorithm.2.2 Bayesian PhylogeneticsBayesian phylogenetic inference is summarized through the posterior distri-bution for the evolutionary history of species (represented through a treestructure τ), the associated branch lengths b, and additional parameterssuch as substitution rates θ, given observed data Y(X) at the leaves. In-ference of the tree topology, branch lengths, and parameters is facilitatedthrough the posterior distribution:γ(τ , b, θ|Y(X)) = l(Y(X)|τ , b, θ)pi(τ , b, θ)∫ l(Y(X)|τ , b, θ)pi( dτ , db, dθ) , (2.9)where pi(τ , b, θ) is the joint prior density over (τ , b, θ) and l(Y(X)|τ , b, θ) isthe likelihood of the data. Posterior inference is challenging computationallydue to the analytically intractable multidimensional integral:Z =∫l(Y(X)|τ , b, θ)pi( dτ , db, dθ).9Figure 2.2: Relaxed Molecular Clock via Compound Poisson Process:Illustration of the relaxed molecular clock via the compoundPoisson process. Stars represent realizations from a Poissonprocess for the location of substitution rate change. Substitu-tion rate change occurs just prior (in evolutionary time) to theevent of rate change. The rate is modified by scaling the branchlength immediately before the event by a gamma random vari-able. Since the tree is drawn such that distances are in termsof the expected number of substitutions the scalings are illus-trated by stretching the tree on lineages where a scaling eventoccurred.In this work we assume that the topology τ and rate matrix parameters θare fixed. Thus Equation 2.9 simplifies to:γ(b|Y(X)) = l(Y(X)|b)pi(b)∫ l(Y(X)|b)pi( db) =γ¯(b|Y(X))Z , (2.10)where Z =∫l(Y(X)|b)pi( db) is still analytically intractable.Probabilistic phylogenetic models can be formed by defining differentprior densities pi(·) or likelihood models l(·|·). A common model choice isindependent CTMCs for each character in a DNA sequence. Under these as-sumptions the likelihood of the leaf data given branch lengths is l(Y(X)|b) =10n∏i=1∑σr∑σvi1· · ·∑σviM−2[M−2∏j=1Pb(P(vj),vj)(Sivj = σvj |SiP(vj) = σP(vj))]×[M∏j=1Pb(P(xj),xj)(Sixj = Y(xij)|SiP(xj) = σP(xj))]× piσr ,(2.11)where {σvi1 , . . . , σviM−2} denote the values of the internal nodes of τ ,X = {x1, . . . , xM}, Siv is the value of the stochastic process at node v forthe ith character, and Y(xij) denotes the ith observed character for the jthtaxa. For example, consider the labelled tree given in Figure 2.3. In thatcase, Equation 2.11, is:∑σr∑σv1∑σv2piσrPb(r,v1) (Sv1 = σv1 |Sr = σr)Pb(r,v2) (Sv2 = σv2 |Sr = σr)×Pb(v1,x1) (Sx1 = C|Sv1 = σv1)Pb(v1,x2) (Sx2 = A|Sv1 = σv1)×Pb(v2,x3) (Sx3 = C|Sv2 = σv2)Pb(v2,x4) (Sx4 = T|Sv2 = σv2) .(2.12)In Chapter 3, methods to calculate Equation 2.11 efficiently via the so-calledpeeling recursions introduced by Felsenstein [11, 12] are discussed.This likelihood can be conceptualized as follows. Assume that characterlocations are independent and first consider only one site. The likelihoodof observing the leaf data at one site is obtained by marginalizing over allpossible internal node and root values. Finally, since independence acrosssites is assumed, the full likelihood is the product over all n sites.With the likelihood model defined, we turn attention to the prior dis-tribution over branch lengths. The methods developed in this thesis aredesigned for clock and relaxed clock trees [18] which imply that the treetopology is rooted. Therefore, the branch length prior is the only remain-ing component required to complete the Bayesian model specification. Fora detailed list of branch length priors for rooted trees see Chen et al. [5,Chapters 2.3-2.4]. The one considered here is the Yule prior.The Yule process is a convenient form of the birth-death process wherethe death rate is set to zero. This process induces a distribution over branchlengths by defining a stochastic process of cladogenesis from ancestors to ex-tant species. In particular, the Yule process induces branch lengths through11rx1x2x3x4v1v2C = [0 1 0 0]A = [1 0 0 0]C = [0 1 0 0]T = [0 0 0 1][p1(v1), p2(v1), p3(v1), p4(v1)][p1(v2), p2(v2), p3(v3), p4(v4)][p1(r), p2(r), p3(r), p4(r)]C A C TData:Figure 2.3: Labelled Tree with Data and Message Passing Algo-rithm: (Left) Labelled phylogenetic tree with leaf data. (Right)Schematic for message passing algorithm. Leaf characters arecoded as row matrices, messages propagate from the leavesthrough internal nodes to the root. Messages represent the con-ditional probability of each character for internal nodes giventhe descended observed data. The likelihood of the tree is givenas the sum ∑4i=1 pi(r)pi(i).speciation times whose increments are exponentially distributed with rateparameter proportional to the number of speciation events from the root tothe current node. This process is depicted in panel I of Figure 3.3. Mathe-matically, this prior is defined as follows. Let h be a list of M sorted heightsfor (τ , b) and let d a list of height increments obtained from h via differenc-ing the elements of h: di = hi − hi−1 for i = 1, . . . ,M − 1. The density ofthe Yule prior is defined as:pi(b) =M−1∏i=1(i+ 1) exp ((i+ 1)gi(b)) (2.13)=M−1∏i=1(i+ 1) exp ((i+ 1)di) , (2.14)where gi(b) = di.12The models discussed so far implicitly assume that the evolutionary pro-cess follows the molecular clock hypothesis. Loosely speaking, this hypothe-sis states that the rate of character change is uniform across all branches. Al-though numerous empirical studies have provided evidence to the contrary,it remains difficult to conduct (large scale) inference on models that relaxthis hypothesis to allow for rate variation across branches. The DCSMCframework is envisaged to perform well in modelling scenarios where thesetypes of assumptions are relaxed via a CPP construction as in Huelsenbecket al. [18].2.3 Sampling BackgroundThis section provides an overview of particle filtering methods starting fromits foundations in Importance Sampling (IS) and incrementally working to-wards SMC.1 The discussion serves as a way to set notation and link theseideas for sampling on product spaces, to general spaces, to the developmentof sampling via tree decompositions in DCSMC.Given a difficult to sample target distribution γ(x) = γ¯(x)/Z, whereγ¯(x) is easy to evaluate point-wise, and where Z =∫γ¯(x) dx is difficult tocalculate, our aim is to approximate γ(x) via a collection of N weighted par-ticles (xi, wi)Ni=1. Our desiderata for these collections of weighted particlesis that for well behaved measurable functions f : X → R the following resultholds:1∑Nj=1wjN∑i=1wif(xi) P−→∫f(x)γ(x) dx, as N →∞. (2.15)Different sampling methods can be distinguished in the way that such col-lections of weighted particles are generated. For example, MCMC generatesa collection of samples whose weights are equal (and deterministic), perfectsampling (either via coupling from the past [31, 32] or other exact meth-ods) generates collections with equal deterministic weights, whereas particle1An excellent introduction to particle filtering methods is provided in Doucet andJohansen [7].13methods often generate collections with unequal random weights.2 An ex-ample of a sampling method with unequal random weights is presented next.2.3.1 Importance SamplingA common method to generate samples from a target γ is via rejectionsampling. In rejection sampling a proposal distribution q whose support isa superset of the target’s support is used to generate candidate samples fromγ. To determine whether each candidate sample is kept as a sample from γor rejected a randomized algorithm is used. The algorithm’s accept-rejectdecisions are based on the probability of drawing these candidate samplesfrom γ.IS builds on rejection sampling by noticing that rejection can be com-putational expensive (especially for “bad” proposals). To ameliorate thisinefficiency, samples generated from the proposal are never discarded, butinstead weighted based on the evidence that they arose from γ. The weightsshould be viewed as a correction for the discrepancy between the proposaldistribution and the target.The rationale of this modification to rejection sampling is shown via thelaw of large numbers. Let xi ∼ q(·), i = 1, . . . , N and wi = γ¯(xi)q(xi) be a weightassigned to each sample xi. Then for well behaved measurable functions fthe strong law of large numbers, and Slutsky yields:1∑Nj=1 wjN∑i=1wif(xi)a.s.−→1∫ γ¯(x)q(x) q( dx)∫f(x)γ¯(x)q(x)q( dx) (2.16)=1Z∫f(x)γ¯(x) dx =∫f(x)γ(x) dx. (2.17)This consistency result is encouraging; however, the asymptotic variance(see Doucet and Johansen [7, Eqn. 27]),1N∫ γ2(x)q(x) (f(x)− E(f(x)))2 dx, (2.18)is known to be large in simple scenarios [7]. The next two modifications to ISimprove on this issue by incorporating statistically efficient proposal strate-2If resampling is used the weights are renormalized and each particle has equal weight.14gies and variance reduction methods via particle resampling techniques.2.3.2 Sequential Importance SamplingWe first address the inefficiency of IS when known model decompositionsexist. When the target is defined on a high dimensional space and admitsa “sequential” decomposition the proposals used in importance samplingcould be inefficient (both computationally and statistically). Suppose x canbe written as (x1, x2, . . . , xm) = x1:m. sequential importance sampling (SIS)modifies importance sampling by proposing component xj conditionally onx1:j−1 and sequentially updating weights. In particular, the proposal can bewritten as the product [7]:qm(xi1:m) = qm−1(xi1:m−1)qm(xim|xi1:m−1) (2.19)= q1(xi1)∏mk=2 qk(xik|xi1:k−1), (2.20)with unnormalized weights given as:wim =γ¯m(xi1:m)qm(xi1:m)= γ¯n−1(xi1:m−1)qm−1(xi1:m−1)γ¯m(xi1:m)γ¯m−1(xi1:m−1)qn(xi1:m). (2.21)This telescoping product yields weight updates given by:wim = wm−1(xi1:m−1)αm(xi1:m) = wi1∏mk=2 αik, (2.22)with incremental weightsαik =γm(xi1:m)γm−1(xi1:m−1)qm(xim|x1:m−1). (2.23)Data received in an online fashion is amenable to these types of de-compositions and sequential updates. Examples include target tracking andstochastic volatility models. Although proposing x via sequential proposalscan be more efficient, SIS suffers from the same high variance as IS. More-over, as particles are sequentially updated in SIS, the corresponding set of(normalized) weights often concentrates on values close to zero with a fewparticles close to one. This is known as particle degeneracy since only a few15particles contribute to the target approximation. SMC addresses this issueby adding a resampling step between sequential proposals.2.3.3 Sequential Monte CarloAs alluded to in the previous section, SMC addresses the high variancefound in both IS and SIS and particle degeneracy by adding a resamplingstep after each weight update. The details of this scheme are illustrated inAlgorithm 1, for more detail see Doucet and Johansen [7].Algorithm 1 smc:Sample xi1 ∼ q1(x1)Compute weight wi1 =γ¯1(xi1)q1(xi1)Resample (xi1, wi1)Ni=1 to obtain equally weighted particle system (xˇi1, 1)Ni=1for c = 2, . . . ,m doSample xic ∼ qc(xc|xˇ1:c−1)Set xi1:c ← (xˇi1:c−1, xic)Compute incremental weight αi1:c =γ¯c(xi1:c)γ¯c−1(xi1:c−1)qc(xic|xi1:c−1)Resample (xi1:c, wi1:c)Ni=1 to obtain equally weighted particle system(xˇi1:c, 1)Ni=1end forreturn Equally weighted particle system (xˇi1:n, 1)Ni=1The upshot of resampling is that the particle system is “reset” wheneverresampling occurs [7]. This can be shown to significantly reduce the numberof particles required to achieve reasonable performance in terms of the ratioof asymptotic variance of Zˆ to Z.The discussion of particle methods has thus far focused on specific as-sumptions of the space (X,X ). IS assumed that efficient proposals existedfor the (potentially high dimensional) space, whereas SIS and SMC assumeda sequential decomposition. The sequential decomposition was originallyused to create more efficient proposals by sequentially building a weightedcollection of particles on (X,X ). SMC introduced resampling to alleviatethe high variance in both IS and SIS. However, particle methods are not lim-16ited to these specialized domains. In fact, a general framework (confusingly)named SMC samplers has been developed to cover general spaces [6].As mentioned in the introduction, Bouchard-Coˆte´ et al. [4] introducedparticle filters for phylogenetics by using techniques similar to the agglom-erative clustering methods of Teh et al. [40]. The sequential decompositionwas possible by building the tree topology in a sequential manner. We in-vestigate another model decomposition for phylogenetics when τ is fixed.17Chapter 3Phylogenetic Divide andConquer Sequential MonteCarloThis chapter delineates the contributions made through this thesis work tothe phylogenetic sampling literature. First an example is given to showhow phylogenetic inference proceeds with the IS method introduced in Sec-tion 2.3.1. This example motivates the introduction of a specialized versionof the DCSMC algorithm which follows. Throughout, specialized DCSMCdata structures and pseudocode are introduced to aid exposition and facili-tate easy reproduction. Next, several theoretical results for general DCSMCalgorithms are described. The chapter wraps up with a discussion of how theDCSMC algorithm can be easily used for relaxations of classical biologicalassumptions and a notes a few extensions.3.1 Phylogenetic Inference with ISTo frame Bayesian phylogenetic inference in terms of particle filters we firstshow how posterior inference of γ is completed under simple IS. The variableof interest is the list of branch lengths b. These branch lengths can bedescribed through a list of height increments d and an ordering on the18internal nodes as in Figure 3.1 (Left). An obvious proposal distribution q isa product of independent exponential distributions as in the Yule prior (seeFigure 3.3 and Equation 2.13) and a permutation over the internal nodes.Figure 3.1 (Bottom-Right) shows an example realization from this proposal.The importance weights are assigned to each particle according to:wi = γ¯(bi)q(bi) , (3.1)as in Figure 3.1 (Top-Right).This procedure can be computationally inefficient in scenarios whereinference is performed on large trees. Since scalable inference motivates thiswork the simple IS method is unacceptable. However, instead of proposingjointly over all height increments and orderings, a model decomposition canbe employed to increase the statistical efficiency. In contrast to SIS andSMC methods, this decomposition is not sequential in nature, but rathermakes use of the topology of the phylogenetic tree. A general overview ofthe DCSMC methodology follows.…w1w2w3wNw4N particlesd3d2d1Figure 3.1: Importance Sampling in Phylogenetics: (Left) Phyloge-netic tree and height increments d. (Bottom-Right) N-particlepopulation of height increments. (Top-Right) IS particle weightsfor the N-particle population of height increments.193.2 Phylogenetic Inference with DCSMCIn contrast to the sequential decompositions often found in hidden Markovmodels, phylogenetic models are endowed with a tree topology that allowsfor model decompositions based on the tree structure. The DCSMC algo-rithm, described in Lindsten et al. [26], is a novel variation of the typicalparticle filtering methodology designed for models endowed with a tree de-composition. Complementary to the sequential decompositions already inuse by SIS and SMC, DCSMC uses a(n) (auxiliary) tree structure to re-cursively decompose the model into computationally easy submodels. InDCSMC each subproblem is endowed with its own population of weightedparticles which are merged at intermediate steps.In particular, DCSMC recurses through C(r) to the leaves, proposes ac-cording to intermediate distributions γv for each v ∈ V defined on Bv =(⊗c∈C(v)Bc)× B˜v, and propagates forward to v’s parents P(v) by a resam-pling step. The incremental set B˜v can be chosen arbitrarily, and is equalto R+ for the molecular clock case. The only restriction on the set of inter-mediate distributions is that γr(·) = γ(·); however, in practice, the choiceof intermediate distributions can greatly affect the quality of the resultingapproximations. A full description of DCSMC follows in Algorithm 2.In the phylogenetic setup, the target distribution γ is the posterior dis-tribution of branch lengths γ(b|Y(X)). DCSMC methods approximate thisposterior with a weighted collection of branch lengths and an estimate ofthe marginal likelihood((biv, wiv)Ni=1, ZˆNv). The phylogenetic notation intro-duced earlier is overloaded so that, bv denotes the restriction of the branchlength function to edges that are connected to the subtree with pseudo-rootv and the superscript i indexes each particle.To develop the specialization of the DCSMC method to phylogenetics,we consider an object oriented representation of particle data structures andthe main algorithmic components of DCSMC. These are described throughexplicit data structures and pseudocode. In the object oriented representa-tions both object properties and methods are presented for each class.20Algorithm 2 dc smc(v): // by convention ∏c∈∅(·) = 1for c ∈ C(v) do((bic, wic)Ni=1, ZˆNc )← dc smc(c)Resample (bic, wic)Ni=1 to obtain equally weighted particle system(bˇic, 1)Ni=1end forfor particle i = 1, 2, . . . , N doSimulate b˜iv ∼ qt(·|bˇic1 , bˇic2) from proposal kernel defined on R+Set biv = (bˇic1 , bˇic2 , b˜iv)Compute wiv = γ¯v(biv)∏c∈C(v) γ¯c(bˇic)1qv(b˜iv |bˇic1,bˇic2)end forCompute ZˆNt =[1N∑Ni=1wiv]∏c∈C(t) ZˆNcreturn ((biv, wiv)Ni=1, ZˆNv )First we formalize a computational strategy to calculate the data likeli-hood, introduced in Equation 2.11. To this end, we revisit Equation 2.11 todiscuss its implementation through an efficient message passing framework.The key observation by Felsenstein [11, 12] in efficiently calculating the like-lihood of the data is that if Equation 2.11 is calculated na¨ıvely the (full) sumcontains 4M−1 terms, whereas if the summation signs are moved inwards thenumber of computations is reduced. For example, in Equation 2.12, the to-tal number of terms is 64 (= 43). However, by “moving the summationsinwards” one can obtain the computationally efficient peeling-recursion ofFelsenstein [12].The right panel of Figure 2.3 depicts the message passing view of thepeeling algorithm. Leaf data is viewed as a row vector denoting the observedcharacter. The messages from descendant nodes propagate to the parentnode via the conditional likelihood described above. The parent messagefrom each character is represented as a row vector containing the conditionallikelihood for each character. The total likelihood is given as the dot productbetween the stationary distribution and the message at the root. This isequivalent to the peeling recursion.21The data structure of phylogenetic particles is visited next. In gen-eral, these particles need to be able to determine their height, calculatethe evolutionary distance between descendant particles, create new particlesbased on the merging of descendant particles, and compute the unnormal-ized posterior density of itself. These requirements are summarized in thePhyloParticle<P> interface given in Listing 3.1.This interface design allows for efficient code reuse for the two differentenvisaged models of evolutionary change. Each model, the molecular clockand relaxed clock, is associated with a particle class that implements thePhyloParticle<P> interface. The implementation for the molecular clockClockParticle is given in Listing 3.2.public interface PhyloParticle<P>{public P create(P particle1, P particle2, HeightScaling heightScaling);public double getHeight();public double getEvolutionaryDistance(P particle1, P particle2);public double getLogPosterior();}Listing 3.1: Particle InterfaceIn addition to the methods required by the PhyloParticle interface,clock particles require four fields to complete their specification. Listing 3.2develops this implementation of the PhyloParticle interface. Specifically,a list of children provides a set of pointers to descendant particles, a list ofsorted heights serves as node speciation times, propagated messages fromdescendant particles are represented as a matrix where each row correspondsto a character (within the character sequence) and columns correspond tothe propagated message. The log posterior is also stored for each particle.Intermediate steps within the DCSMC algorithm call for new particlecreation based on existing particles. Within the ClockParticle setup this isparticularly straightforward. The new particle’s children are the two mergedparticles, speciation times are a union over the speciation times within eachparticle to be merged and the speciation time of the merged particle, andthe new root message is calculated by propagating the messages of each22descendant particle. The message is calculated via the evolutionary distancebetween each particle and their parent according to the CTMC St.public class ClockParticle implements PhyloParticle<ClockParticle>{// Propertiespublic final List<ClockParticle> children;public final List<Double> sortedHeights;public final double[][] rootMessage;public final double logPosterior;// Methodspublic double getHeight(){// Return first element of sortedHeights}public ClockParticlecreate(ClockParticle particle1, ClockParticle particle2,HeightScaling heightScaling){/*** children = (particle1, particle2)* sortedHeights = sort(particle1.sortedHeights ∪* particle2.sortedHeights ∪ heightScaling.height)* rootMessage = propogateMessage(particle1, particle2)*/}public doublegetEvolutionaryDistance(ClockParticle parent, ClockParticle descendant){// Return parent.getHeight() - descendant.getHeight()}public double getLogPosterior(){// Return logPosterior}}Listing 3.2: Clock ParticleWith the particle and the underlying tree structure defined as objects,we now formulate the DCSMC algorithm in Listing 3.3. Listing 3.3’s fieldsare a proposal distribution (to be defined precisely shortly), and a randomseed that governs all randomness in the algorithm. The principal method ofthis class, dcSMC, generates a population of particles ParticlePopulationbased on the DCSMC algorithm. Given a fixed tree topology, the algorithmrecurses to the leaves of the tree through the tree’s list of children. At23the leaves, independent ParticlePopulations are created, resampled, andthen propagated forward through the merge function. Merging involvescreating new particles based on the previous two populations of particlesand a proposal for the height of the merged particle. The weight of eachparticle in log-scale is the difference between the log posterior of the particleand the sum of the children’s log posterior and the log proposal density.public class PhyloDCSMC<P extends PhyloParticle<P>, T>{// Propertiesprivate final Proposal<P> proposal;private final Random random;// Methodspublic ParticlePopulation<P> dcSMC(Tree<T> tree){if (tree.isLeaf())return new ParticlePopulation<P>();// Recurse treeParticlePopulation<P> leftParticles = dcSMC(tree.getChildren.get(0))leftParticles.resample(random); // Multinomial resamplingParticlePopulation<P> rightParticles = dcSMC(tree.getChildren.get(1))rightParticles.resample(random); // Multinomial resamplingreturn merge(leftParticles, rightParticles);}public ParticlePopulation<P> {merge(ParticlePopulation leftParticles, ParticlePopulation rightParticles){P[] newParticles = new P[N]; // Array of new particlesdouble[] logWeights = new double[N]; // Array of particle weights/*** New particles are formed from leftParticles and rightParticles* Proposed heights given by Proposal. For each i = 1,...,N*/newParticles[i] = proposal(random, leftParticles[i], rightParticles[i]);logweights[i] = calculateLogWeight(newParticles[i])return new ParticlePopulation<P>(newParticles, logWeights);}private calculateLogWeight(P particle){// Calculate sum of particle’s children logPosterior (kidsLogPosterior)// Calculate logDensity of proposed increment by Proposal (proposalLogDensity)return particle.logPosterior - kidsLogPosterior - proposalLogDensity;}}Listing 3.3: DCSMC Pseudocode243.2.1 Intermediate DistributionsTo complete the DCSMC specialization to phylogenetics the intermediatedistributions γv and the proposals qv must be defined for all v ∈ V . Al-though phylogenetic models considered here admit natural decompositionsthrough tree structures, a challenge arises in specifying high quality interme-diate target γv and corresponding proposal qv distributions. Both interme-diate targets and proposals are easy to specify since DCSMC only requiresthat the root “intermediate” distribution is equal to the target distributionγr(·) = γ(·). Nevertheless, since the particle population’s approximation ofthe posterior is often determined by the intermediate and proposal distribu-tions judicious choices must be made. These choices are difficult as a prioriknowledge of the posterior shape is generally unknown.For the most part, in SMC algorithms the accuracy of the posterior ap-proximation is driven by the correspondence between target and proposaldistributions. In Bayesian statistics the prior distribution is meant to en-capsulate our beliefs of the parameter of interest a priori and it thus is anatural candidate for the proposal. While this choice is convenient it maylead to a poor approximation if the posterior does not resemble the prior.We examine two different intermediate and proposal distributions in thefollowing discussion.Since the prior over the full tree is Yule based we attempt to extendthis birth process for subtrees. The main challenge in extending the Yuleprocess for subtrees arises due to independent particle populations inherentto DCSMC. In particular, independence implies that there is no informationsharing across speciation events. Therefore, the ordering of speciation eventsof a subtree is unknown with respect to the merged tree of v1 and v2.An example of a simple strategy to assign Yule-like priors to subtrees istaken up in Figure 3.2. This na¨ıve strategy ignores the prior tree structureknowledge (in addition to its ignorance of speciation times from other pop-ulations), and considers each subtree in isolation. Precisely, each subtree isassumed to be its own tree and the usual Yule prior is employed. The simplenature of this prior distribution is appealing; however, it may be desirable25to explicitly incorporate a priori knowledge of the tree structure since thisinformation could decrease the discrepancy between weight updates. Thisis described in detail in the following paragraphs.dR1~Exp(3)dR2~Exp(2)dL1~Exp(3)dL2~Exp(2)Left subtree Right subtreeFigure 3.2: Na¨ıve Yule Prior: Each subtree is viewed in isolation tothe whole tree structure.The following intermediate specification improves upon the simple strat-egy by incorporating knowledge of the tree structure through the number oftaxa M . Let Xv ⊂ X be the set of leaves descendant from species v ∈ V ,and piv(bv) be a modified form of the Yule prior chosen so that (minimally)pir(br) = pi(b). In particular, let the intermediate prior have density,piv(bv) =|Xv|−1∏i=1(M − |Xv|+ i+ 1) exp ((M − |Xv|+ i+ 1)gi(bv)) (3.2)=|Xv|−1∏i=1(M − |Xv|+ i+ 1) exp ((M − |Xv|+ i+ 1)di) , (3.3)where gi(bv) = di. The corresponding intermediate proposal qv(·) has theexponential density with rate parameter M − |Xv|+ 2. This prior is shownin panel II of Figure 3.3.Comparing Figure 3.2 and panel II of Figure 3.3 to the Yule prior overthe full tree in panel I of Figure 3.3 illustrates potential disagreement in thedistribution of the joint Yule process prior for height increments and thedistribution on increments within each subtree. The flexibility of DCSMC26allows for disagreement between the intermediate priors and proposal distri-butions. These differences are incorporated (along with the likelihood) in theparticle system weights and are subsequently corrected in merged samples;however, the quality of the posterior distribution via weighted particles de-pends upon the intermediate priors and proposals. Since the modified Yuleprior decreases the discrepancy versus the na¨ıve prior the modified setup isexpected to provide a higher quality approximation to the posterior distri-bution, especially if the prior dominates the likelihood in the posterior.d1~Exp(6)d2~Exp(5)d3~Exp(4)d4~Exp(3)d5~Exp(2)IdR1~Exp(6)dR2~Exp(5)dL1~Exp(6)dL2~Exp(5)dR3~Exp(2)IId6~Exp(1) dR4~Exp(1)Left subtree Right subtreeFigure 3.3: Yule Prior: (I) Branch length prior induced by the Yuleprocess for cladogenesis over speciation times (represented asheight increments d). The time between speciation events isexponentially distributed with rate parameter proportional tothe total number of speciation events between the node of inter-est and the root. (II) An adapted branch length prior based onthe Yule process for DCSMC. In DCSMC setups intermediatedistributions are independent and thus each subtree lacks fullknowledge of other subtrees. A consequence of independence isa disagreement between the Yule prior on the full tree versussubtrees. One solution is illustrated in (II) where the prior overbranch lengths is exponentially distributed with rate propor-tional to the number of total species less the total number ofleaves connected to the root of the subtree.27The intermediate target density for subpopulation v ∈ V is defined asγv ∝ lv(Y(Xv)|bv)piv(b). This yields intermediate weights:wiv = lv(Y(Xv)|biv)lc1 (Y(Xc1 )|bic1)lc2 (Y(Xc2 )|bic2)1q(bi|bic1 ,bic2), (3.4)where {c1, c2} = C(v). Since the root target distribution γr ≡ γ this choiceof intermediate distributions satisfies the requirements of DCSMC.The proposal distribution is implemented as a class containing the num-ber of taxa (as the prior tree knowledge), and a propose method that re-turns a new particle whose height increment is exponentially distributed.Pseduocode is presented in Listing 3.4. Future work is aimed at examiningthe posterior approximation with different proposals.public class Proposal<P extends PhyloParticle<P>>{// Propertiesprivate final int M;// Methodspublic P propose(Random random, P particle1, P particle2){/*** Determine number of leaves descendant from particle1 and particle2 |Xv|* Propose increment ∼ Exponential(rate = M − |Xv|+ 2)* Height of subtree is max(particle1.height, particle2.height) + increment* Return a new P with this height*/}}Listing 3.4: Yule-like ProposalsIn summary, a schematic showing the differences between the IS al-gorithm and DCSMC algorithm applied to phylogenetics is shown in Fig-ure 3.4. As opposed to IS, this algorithm can be roughly divided into fourdifferent steps for each recursion level: (I) Independent particle populationsare proposed and weighted for each (of the two) subtrees; (II) Multinomialresampling is carried out for each particle population to create two newequally weighted particle populations; (III) These particle populations aremerged based on the multinomial induced permutation on particle indices;28(IV) New height increments are proposed and weighted (if at the root thenthe algorithm terminates; otherwise the algorithm continues to step II). Thegains of the divide and conquer approach are from a sparse representationof the parameter space. Specifically, the binary speciation events used torecursively decompose the model yield two independent populations of Nparticles. Each population covers N2 points, but only requires O(2N) stor-age.1Intuitively, the efficiency of SMC methods over MCMC methods can bedivided into statistical and computational reasons. From a statistical view,the proposal distribution used in SMC methods are often designed to createa set of particles covering a large part of the sample space. In contrast,MCMC methods rely on a kernel distribution to sequentially explore thetarget space through accept-reject steps. Serial exploration of the samplespace via MCMC methods are prone to finding local optima, especially if ex-ploration requires visiting areas of lower probability before reaching areas ofhigher probability (since these paths are often rejected). These explorationand coverage issues are inherent to MCMC, but are not necessarily presentin SMC algorithms. From a computational perspective, the serial natureof MCMC prohibits easy parallelization of these algorithms across multi-ple computer cores or servers. On the other hand, the tree decompositionof DCSMC into independent particle populations is natural to parallelize.These two reasons provide justification for our belief that DCSMC couldshow efficiency gains over standard MCMC techniques for large trees andmore complex models of evolution.1As noted in Lindsten et al. [26], the most basic DCSMC procedure is essentially asequential importance resampling method. Such methods may perform (significantly)better if in each intermediate step the intermediate target distribution is slowly reachedthrough annealing in the intermediate likelihood.29… …… ……………I. Propose II. ResampleIII. Merge IV. ProposeFigure 3.4: Phylogenetic DCSMC: (I. Propose) For each child, {c1, c2} ∈ C(r), propose independent particlepopulations according to proposals qc1 , qc2 and weight according to DCSMC intermediate weights.(II. Resample) Conduct multinomial resampling independently according to the particle weights toeach particle populations. (III. Merge) Merge the particle populations according to the inducedpermutation from resampling. (IV. Propose) Propose an additional height increment and weightaccording to γ¯r.303.2.2 Theoretical ImplicationsThe DCSMC algorithm has two theoretically important guarantees: unbi-asedness of the marginal likelihood estimator and consistency of the particlesystem. These results are captured in the following two propositions fromLindsten et al. [26].Proposition 1 (Proposition 1 [26]). Assume that the auxiliary tree used fordecomposition is a balanced binary tree, that appropriate absolute continuityrequirements are met, and that multinomial sampling is carried out for everypopulation during every iteration. Then E(ZˆNr)= Zr for any N ≥ 1.Proposition 2 (Proposition 2 [26]). Under regularity conditions (see Lind-sten et al. [26, Appendix A.2]), the weighted particle system (xN,ir , wN,ir )Ni=1generated by dc smc(r) is consistent in that for all non-negative, bounded,measurable functions f : X → R:N∑i=1wN,it∑Nj=1wN,jtf(xN,ir) P−→∫f(x)γ(x) dx, as N →∞. (3.5)These results exceed our desiderata from Section 2.3 which only re-quested consistency for well behaved test functions f . The additional theo-retical result for the unbiasedness of the DCSMC estimate of the probabilityof the data Zˆ allows for two downstream analyses. First, Proposition 1 pro-vides a method to compare different models based on the Bayes factor [20],that is the ratio of the marginal likelihood of the two models under consid-eration:Bayes factor = ZˆModel 1ZˆModel 2. (3.6)Values of this ratio that are greater than one favour Model 1, whereas valuesthat are less than one favour Model 2. Second, as mentioned in Lindstenet al. [26], the unbiased estimate can be used in particle Markov chain MonteCarlo (PMCMC)[2]. In a PMCMC setup DCSMC is embedded inside of aMCMC chain, so that inference can be performed on additional quantities31of interest. For example, inference could be performed on branch lengths,rate matrix parameters, and the topology via the full posterior γ(τ , b, θ)displayed in Equation 2.9.In addition to applications provided for Proposition 1, Proposition 2 canbe used in debugging situations. Two immediate possibilities exist for utiliz-ing this theoretical result in debugging: (i) comparing functions of samples(to real-values) from DCSMC and MCMC through frequentist hypothesistesting; (ii) extending the MCMC correctness tests of Geweke [16]. Possi-bility (i) is presented in Chapter 4.3.3 Relaxing the Molecular Clock with DCSMCAs discussed in Section 2.1.2, the CPP is a parametric model to accountfor rate inhomogeneity across lineages. The following generative processcan be used to describe a CPP on phylogenetic trees. Simulate locations ofsubstitution rate change according to a PP with rate λ over the tree topologyand branch lengths. The rate of substitution change m just before an eventof rate change is scaled by a gamma random variable r ∼ Gamma(α, β):m′ = r ·m, (3.7)where α is the shape parameter, and β is the scale parameter.2To formalize this process, consider the notation of Huelsenbeck et al.[18]. Let the collection of events from a CPP be denoted e = (ξ, z, r), whereξ is the number of events, z is a vector of event locations, and r is a vector ofscalings. If ξ = 0, then z and r are empty, otherwise if ξ > 0, then z ∈ [0, T ]ξwhere T is the total tree length, and r ∈ (R+)ξ. Furthermore, the bijectivemap from [0, T ]→ (τ, b) allows each point in z to be identified with a pointon the tree. The prior on e given the tree topology, branch lengths, shape2For technical reasons, see Huelsenbeck et al. [18], β is set to exp(ψ(α)) where:ψ(α) =ddαlog Γ(α).32parameter α, and Poisson process rate λ is given as:f(e|τ , b, λ, α) =e−λT ξ = 0e−λT (λT )ξξ!(1T)ξ ξ∏i=1g(ri|α) ξ > 0, (3.8)where g(·|α) is the gamma density with shape parameter α and scale pa-rameter β set as above. It has been shown in Rannala [33] that the rateof change and rate multiplier parameters are non-identifiable; nonetheless,inference is possible for the mean substitution rate of a lineage.This CPP construction achieves differing evolutionary rates across lin-eages and its functional form makes it an excellent candidate for DCSMCalgorithms. In particular, the proposals shown in Figure 3.4 are easy toadjust for the CPP. Figure 3.5 demonstrates that the earlier proposals canbe adjusted to include a PP for locations of rate substitution change overthe single proposed branch and gamma random variables to scale the rate.The joint prior is calculated as:pi(b, e) = pi(e|b)pi(b), (3.9)and intermediate distributions can be constructed in a similar way as in themolecular clock example.Listing 3.5 illustrates the implementations of PhyloParticle for relaxedclock models. The main difference between molecular clock particles andrelaxed clock particles is the extra field to store the scaling factors, anda new implementation of the evolutionary distance between parents anddescendants. This extra information is sufficient to describe the relaxedclock model and allows for evolutionary distances to depend directly on thescaled branches and not the original height differences between parent anddescendent. Also note that since the CPP employs a homogeneous PP, thelocations of the events of substitution rate change need not be recorded. Theprior density only depends on the number of events and their correspondingvalues.33…N particles…N particlesFigure 3.5: DCSMC for the Compound Poisson Process: Changes tothe simple phylogenetic DCSMC algorithm to accommodate theCPP are limited to adding a Poisson point process with gammarandom variable scalings to the proposal. Purple stars indicaterealizations of the PP over the proposed branch lengths for eachparticle population.34public class RelaxedParticle implements PhyloParticle<RelaxedParticle>{// Propertiespublic final ClockParticle baseParticle;public final Map<List<Double>,List<Double>> scalingFactors;// Methodspublic double getHeight(){/*** Return first element of baseParticle’s sortedHeights*/}public RelaxedParticlecreate(RelaxedParticle particle1, RelaxedParticle particle2,HeightScaling heightScaling){/*** Create a new baseParticle via baseParticle’s create method* scalingFactors = particle1.scalingFactors ∪* particle2.scalingFactors ∪ heightScaling.scalings*/}public doublegetEvolutionaryDistance(RelaxedParticle parent, RelaxedParticle descendant){/*** Determine path between parent & descendant (via baseParticle children) p* Return∑p∑i scalingFactors.get(p).get(i)*/}}Listing 3.5: Relaxed Clock Particle353.4 ExtensionsAs discussed in Section 2.2, the CPP is one of many techniques used torelax the molecular clock. The DCSMC framework can also be utilized inapproaches where the rate is modelled as a Brownian motion. In particular,DCSMC only requires efficient simulation from a proposal distribution andpoint-wise evaluation of the prior density over perturbed branch lengths.For example, in the CPP, the piece-wise constant rate can be viewed asa special case of a Brownian motion. Future work will explicitly integratethese techniques into the DCSMC framework.We also take this opportunity to build on the earlier efficiency discussionof this chapter by introducing a specific example for relaxed clock models.Consider the height inference problem for relaxed clock models when thetree topology is unknown. Inference proceeds under MCMC methods bysampling the full state space (τ , b, relaxed clock parameters) via a collectionof MCMC kernels. Conversely, in a PMCMC setup DCSMC is embeddedinto a MCMC chain where the MCMC chain samples only the topology τand the branch lengths and relaxed clock parameters are inferred inside ofDCSMC routines.In MCMC implementations topological moves may require proposingseveral relaxed clock parameters for each branch, whereas the PMCMC im-plementation provides the flexibility to initialize a particle population fromscratch for each newly proposed topology. Figure 3.6 demonstrates a MCMCmove on tree topologies for relaxed clock models. Branch colours representdifferent values of the evolutionary rate. As illustrated in this figure, relaxedclock models may require more complicated MCMC proposals to accommo-date the Brownian motion governing the evolutionary rate.36a b c ab cq(⌧ 0|⌧)Figure 3.6: MCMC Topology Move for Relaxed Clock Models: (Left)Current state space of MCMC sampler. (Right) Possible pro-posal from a kernel restricted to topological moves. Branchcolours represent a stochastic process for the rate.37Chapter 4ExperimentsThis section covers several experiments that (i) show the correctness of theDCSMC implementation, and (ii) demonstrate areas where we predict DC-SMC will outperform traditional approaches. Implementation correctnessis assessed via several unit tests of the peeling recursions and numericalmarginal likelihood calculations. The validity of the DCSMC method inphylogenetics is compared versus traditional MCMC methods. These com-parisons are focused on two main metrics: marginal likelihood estimation,and estimated tree height hˆ. For each metric, we aim to show compara-ble performance; future work is planned to show comparable performancewith significantly reduced computational budgets on large trees. The base-line to which all marginal likelihood comparisons are made against is thephylogenetic-specialized MCMC software called Mr. Bayes [19].Correctness Tests: The DCSMC implementation can be tested with avariety of different unit tests. Two simple tests are described below. Thefirst test is to ensure the correctness of the peeling-recursions. Since thelikelihood function has the property that,∑Y(X)l(Y(X)|b) = 1, (4.1)the peeling-recursion, which calculates the likelihood, can be explicitly checkedby enumerating the support of Y(X). Figure 4.1 depicts a full enumeration38of leaf observations for one character DNA sequences in a two taxa tree.Since there are two leaves and four possible characters this space has 16(= 42) elements whose likelihoods must sum to one.A A A C A G A TC A C C C G C TG A G C G G G TT A T C T G T TFigure 4.1: Correctness Unit Test: Enumerate all possible data se-quences, determine the likelihood of each, and ensure that thesum of likelihoods is unity.The second analytical correctness test is to compare the Z estimatefrom DCSMC to numerical methods. Although intractable for any tree ofpractical interest, for small enough trees and bounded support, the integral:Z =∫l(Y(X)|b)pi( db),39can be numerically calculated via multivariate trapezoidal integration. Forinstance, suppose M = 3 and the prior over the branch lengths (b1, b2) isUnif(0, 1)×Unif(0, 1). Then the integral is simply:Z =∫ 10∫ 10l(Y(X)|b1, b2) db1 db2,which can be calculated via evaluating l(Y(X)|b1, b2) at points on the unitsquare and employing trapezoidal integration iteratively.The implementation of the DCSMC algorithm for phylogenetics passesboth peeling recursion and marginal likelihood tests.Marginal Likelihood: The marginal likelihood estimate is returned di-rectly from the DCSMC algorithm; however, its estimation from other MonteCarlo frameworks like MCMC is not straightforward. Common methods ofestimating this quantity from MCMC use different importance sampling dis-tributions to create estimators of Zˆ. One such estimator is the harmonicmean introduced by Newton and Raftery [29] via a particular choice of im-portance distribution. While Mr. Bayes [19] reports the harmonic mean bydefault, it has been shown to be inefficient since it often has infinite variance.In light of this limitation, other statistically efficient methods adaptedfrom bridge sampling and path sampling of Gelman and Meng [15] have beenintroduced into Bayesian phylogenetics. A computationally intensive alter-native, thermodynamic integration, was developed first in Friel and Pettitt[14], Lartillot and Philippe [23]. Subsequently, a more computationally fea-sible method to estimate the marginal likelihood, called the stepping stonemethod, was developed by Xie et al. [42]. The experiments presented inthis paper compare the marginal likelihood estimate of DCSMC versus theestimate provided by a stepping stone method implemented in Mr. Bayes[19].Height estimates hˆ: Within a Bayesian phylogenetic setup, parameter es-timation or estimation of functions of parameters can be cast as minimizingthe posterior expected loss E (L(·, δ)|Y(X)) given the observed data andwith respect to the posterior. The loss L(·, δ) encapsulates the error asso-ciated with evaluating · by δ. The loss function provides a principled way40to summarize trees and to estimate the tree height. Under the quadraticloss function, L(v, δ) = ||v − δ|| for v, δ ∈ Rp, the Bayes estimator can beshown to be the posterior mean, see Robert [35, Chapter 2.5]. As a result,trees are summarized by the posterior mean of b; the height of the tree h issimilarly summarized as the posterior mean of h(r).The consistency result of Proposition 2 provides a method to check thecorrectness of the algorithm. For instance, the weighted average of the parti-cle heights from a DCSMC population tend in probability to the Bayes esti-mator of tree height for the posterior distribution as the number of particlesapproach infinity. A similar consistency result holds for samples generatedfrom MCMC.There are thus two different correctness tests to evaluate an implemen-tation of the phylogenetic DCSMC. In a simulated setting, where the trueheight is known, the DCSMC and MCMC estimates can be directly com-pared to the true known height (which is not equal to the synthetic held-outheight in general). Additionally, given a correct MCMC implementationsuch as Mr. Bayes, the DCSMC estimates can be compared to those ofMCMC using frequentist hypothesis testing techniques.In all instances that follow, Mr. Bayes is executed with one run (withouttempering) with a varying number of MCMC iterations. The model is spec-ified to match the description in this thesis (Jukes-Cantor rate matrix withYule process prior)1. The tree topology is fixed for all MCMC iterations toa user-supplied tree that corresponds to the fixed tree used in DCSMC.The remainder of this section covers synthetic experiments that validatethe DCSMC implementation, and a real world example on green plant DNAsequences. It concludes with a note on future directions for this work.1There is a technical difference between the Yule prior described herein and the im-plementation in Mr. Bayes. A recompiled version of Mr. Bayes can be used to correctfor this discrepancy; however, for simplicity of exposition of the parameter settings this isnot discussed outside of this footnote.41#nexusbegin mrbayes;set autoclose = yes nowarn = yes;set seed = 1;set swapseed = 1;execute tree.nexus;lset nst = 1;prset brlenspr = clock:birthdeath;prset speciationpr = fixed(1);prset extinctionpr = fixed(0);prset statefreqpr = fixed(equal);prset topologypr = fixed(true);mcmcp nruns = 1;mcmcp nchains = 1;// Height reconstructions:mcmc ngen = 1000000;// Log(Z) estimate:ss ngen = 1000000 Alpha = 0.3 NSteps = 10 Burninss = -1;quit;end;Listing 4.1: Mr. Bayes Parameter Specification4.1 Synthetic ResultsThe first synthetic result validates the implementation of DCSMC by com-paring the log marginal estimate log(ZˆDCSMC) to the log marginal estimatelog(ZˆMCMC) from Mr. Bayes, and the estimated height from DCSMC ver-sus the true known tree. The true synthetic tree is used to initialize Mr.Bayes’ MCMC chain for the log marginal estimate. This helps ensure thatthe Markov chain explores the posterior distribution near the true value.The setup consisted of simulating in silico three rooted ultrametric treescontaining 3, 5, 10 species with 100 character DNA sequences. The in silicotrees are produced via forward sampling: (1) simulate the tree topologyand speciation times via the Yule process prior; (2) simulate the stochasticprocess over DNA characters from the root to the leaves for each site. Atthe root, DNA characters are drawn from the stationary distribution pi ofQ .42The marginal log likelihood for each tree is estimated with any increasingnumber of particles from 2 to 10,000 particles via DCSMC and 1,000,000MCMC iterations. This process is repeated for 100 Monte Carlo replicates,and performance is measured by aggregating across replicates. In total,this represents 3 taxa × 6 particle configurations × 100 replicates = 1, 800experiments.Figure 4.2 depicts the results of this validation experiment. In panel (a)the percent difference between DCSMC and Mr. Bayes is plotted for each ofthe three different simulated trees and for an increasing number of particles.Panel (b) tells a similar story for the height estimate provided by DCSMCversus the true known tree. (In Section 4.3 height reconstructions from bothMr. Bayes and DCSMC are compared to the true synthetic tree.)In Figure 4.2 the differences between the log(Zˆ) estimate from DCSMCand Mr. Bayes, and the DCSMC reconstructed height versus the known truetree height, increase as the number of leaves increase. One explanation ofthis experimental behaviour is that the unbiasedness and consistency of theDCSMC algorithm are only asymptotic guarantees, yet these experimentsuse a finite number of particles. Below, we attempt to develop a moresatisfying resolution to the performance of the algorithm under finite particlepopulations.We endeavour to characterize the discrepancy between the two log evi-dence estimates. Since the MCMC runs of Mr. Bayes are initialized to thetrue synthetic tree we believe that Mr. Bayes is not stuck in a local max-imum and that the value provided for log(ZˆMCMC) represents Mr. Bayesbest estimate. To supplement Mr. Bayes’ estimate another log(ZˆMCMC)estimate is planned via the phylogenetic software package Beast [10]. SinceDCSMC cannot be initialized to the true tree (it explores the sample spaceby generating samples from a proposal as opposed to the random MCMCwalks) this additional comparison could provide insight into the discrepancybetween the original estimates.A complementary explanation is that the proposals used in DCSMC arenot “good enough”. As discussed earlier, the quality of the approximationto the posterior distribution is dependent on the proposal distribution. To43assess and correct proposal quality the effective sample size (ESS) at eachnode can be evaluated:ESSv =1∑Ni=1(wiv)2. (4.2)The ESS is a measure of the particle degeneracy within a population ofparticles. If all particles are equally-weighted the ESS is maximized as N ; ifall particles except one have weight 0 the ESS is minimized to 1. Larger ESSvalues imply less particle degeneracy and higher quality approximations.Low ESS values are symptoms of poor proposal distributions and can oftenbe alleviated by slowly annealing the intermediate likelihoods. Future workis aimed at adding an annealing step to address these potential issues. SeeLindsten et al. [26] for a discussion of annealing in DCSMC methods.443 5 100510152 10 50 100 100010000 2 10 50 100 100010000 2 10 50 100 100010000Number of Particleslog(Z) error (%) [(D&C − MB) / MB * 100]3 5 1001002002 10 50 100 100010000 2 10 50 100 100010000 2 10 50 100 100010000Number of ParticlesHeight error (%) [(D&C − True) / True * 100](a) (b)Figure 4.2: Synthetic Validation Comparisons: Synthetic Trees Validation for 100 Monte Carlo replicatesof DCSMC over different number of taxa and differing N particles. (a) log(Zˆ) comparisons versusMr.Bayes. (b) hˆ comparisons versus the true synthetic tree.454.2 Green Plant rbcLIn addition to the in silico experiments, the DCSMC methodology is alsoevaluated on a published real dataset. This dataset consists of 10 taxa fromthe green plant family. Specifically, DNA sequences are from the chloroplast-encoded large subunit of the RuBisCO gene (rbcL) for different green plants.This gene is involved in an enzyme used when plants convert carbon dioxideto glucose (i.e., photosynthesis). Understanding RuBisCO’s evolutionaryrelationships could be of interest in areas such as its role in global climatechange, see for example Sage et al. [36]. More information on this datasetcan be found in Xie et al. [42].Similarly to the synthetic experiments, the log(Zˆ) estimates of Mr.Bayesand DCSMC are compared for the RuBisCO dataset. However, in contrastto the synthetic datasets, the true tree topology is unknown for the Ru-BisCO data. Hence, we have assumed that the tree topology is fixed to theone presented in Figure 4.3. To check that both DCSMC and Mr. Bayesestimates of log(Zˆ) are similar 50 Monte Carlo replicates of DCSMC with1, 000 particles and 500, 000 iterations of Mr.Bayes are generated. EachMonte Carlo replicate differs due to the differing initial random seeds.Table 4.1 tabulates 95% confidence intervals for these Monte Carlo repli-cates. In addition to the fact that these confidence intervals overlap, fre-quentist tests, such as the t-Test, for a difference in means between DCSMCand Mr. Bayes are unable to reject the null hypothesis that the true dif-ference between the means is zero. In conclusion, the real dataset providesevidence that the DCSMC method is valid, in the sense that its estimatesof the marginal likelihood are not significantly different than Mr. Bayes’estimates (Mr. Bayes is viewed as the gold standard).460.02Iris_unguicularis_AJ309693Conocephalum_conicum_U87067Avena_sativa_L15300Asplenium_nidus_U05907Picea_pungens_AF456382Chara_connivens_L13476Bazzania_trilobata_L11056Sphagnum_palustre_L13485Osmunda_cinnamomea_D14882Nicotiana_tabacum_J01450Figure 4.3: Green Plant Phylogeny: 10-taxon green plant data setwith DNA sequences of chloroplast-encoded large subunit ofRuBisCO gene (rbcL) [42].Table 4.1: Green Plant Comparisons: 95% confidence intervals from50 Monte Carlo replicates of the DCSMC and MCMC algorithmsinitialized with different random seeds.DCSMC Mr. Bayes1, 000 particles 500, 000 iterationslog(Zˆ) (−7298,−7295) (−7297.6,−7297.1)474.3 Future WorkThere is a rich set of problems and application areas for this framework tobe applied to. In this section, we outline future experimental work on twomain areas. The first set of planned experiments is testing the scaleabilityof this framework on large trees (i.e., M > 1000). Scalability experimentsare expected to show that DCSMC requires smaller computational budgetsthan Mr. Bayes for posterior estimation.To illustrate the idea behind this experiment, we present a prototypeon a small synthetic dataset. This prototype compares height estimatesprovided by DCSMC, Mr. Bayes, and Mr. Bayes initialized from DCSMC,to the true height of the tree. Each of these three inference strategies aregiven a pre-specified computational budget; number of iterations or numberof particles. Runtime is used to standardize across different computationalbudget metrics. Ten Monte Carlo replicates are simulated for each budgetand strategy.This prototype is illustrated in Figure 4.4 for a small ten taxa tree. Theleft panel of this figure shows the synthetic tree with true height 1.588. Theright panel shows the reconstructed height, given as the Bayes estimatordescribed earlier, for each of the three different strategies. The true heightof the tree is listed as a black horizontal line. The estimates provided byall three methods are generally in agreement. The key domain to showcaseDCSMC performance is on large trees. In these cases, we expect DCSMC toconverge to the correct true tree height with smaller wall-times than MCMCbased methods.The second set of planned experiments is to evaluate the performanceof the DCSMC framework on the relaxed clock models. As discussed inSection 3.3, the molecular clock models are easy to extend to the CPPrelaxed clock models.48●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●● ●● ● ●●●●●●●● ●● ●● ●●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●● ●●● ●●●● ●●●●●●●● ●●●●●●●●●● ●●● ●●●● ● ●●●●●●● ●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●● ●● ● ●●●●●● ●●●●● ●● ●● ●●●●●●● ●●●● ●●●● ●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●● ● ●●●●● ●●●●●●●●●●●●●● ●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●● ●●●●● ●●● ●●●●●●●● ●●●●● ● ●●●●●●●●● ●● ●● ●●● ●● ● ● ● ●●●● ●●● ●● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●● ●●●● ●● ●●●●●● ●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●●●●●● ●●●●●●●●●●● ●● ●●●● ●●● ●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●● ●● ●●●● ●● ●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●● ● ●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●●● ●● ●●●●● ●● ●●●● ●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●● ● ●●●●●●● ●●●●●●● ●●●● ●●●●●●● ●●● ● ●●● ●●●● ●●● ●● ●● ●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●● ●●● ●●●●● ●●●●● ●●●● ●● ● ●●● ●●● ●●●● ●● ●● ●●● ●● ●● ●● ●●●●● ● ●● ●●●●●● ●●●●●●●●●●● ●●● ● ●●●●●● ●1.01.52.02.50 25000 50000 75000 100000Time (milliseconds)Reconstructed HeightInference Strategy●●●D&C SMCD&C SMC + MBMBh0.2leaf7leaf4 leaf1leaf2leaf8leaf6leaf5leaf3leaf0 leaf90.46251.5880.04610.3518 0.32690.75890.98660.41230.6026h = 1.588Figure 4.4: Synthetic Tree Runtime Comparisons: Reconstructed hˆfrom DCSMC, DCSMC+Mr. Bayes, and Mr. Bayes versuscomputational budget. True synthetic tree height denoted asblack horizontal line as h.49Chapter 5ConclusionThis thesis introduced a specialized DCSMC method for Bayesian phyloge-netic inference. In contrast to traditional particle filtering methods, DCSMCuses a tree based decomposition of the probabilistic model. Tree based de-compositions can be applied naturally to models, as in phylogenetics, thatdirectly use a tree structure. In this thesis, two phylogenetic specializationsof the DCSMC algorithm are presented. The first method addresses thestandard molecular clock models, while the second specialization attends toa class of relaxed class models constructed through a CPP. Experimentalresults are presented to validate the correctness of the implementation onboth synthetic and real data. Directions for future work are suggested tofocus on demonstrating the computational performance of both large treesand relaxed models.50Bibliography[1] O¨. A˚kerborg, B. Sennblad, and J. Lagergren. Birth-Death Prior onPhylogeny and Speed Dating. BMC Evolutionary Biology, 8(1):77,2008. → page(s) 2[2] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov ChainMonte Carlo Methods. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 72(3):269–342, 2010. → page(s) 31[3] S. Aris-Brosou and Z. Yang. Effects of Models of Rate Evolution onEstimation of Divergence Dates with Special Reference to theMetazoan 18S Ribosomal RNA Phylogeny. Systematic Biology, 51(5):703–714, 2002. → page(s) 8[4] A. Bouchard-Coˆte´, S. Sankararaman, and M. I. Jordan. PhylogeneticInference via Sequential Monte Carlo. Systematic Biology, 61:579–593,2012. → page(s) 1, 2, 17[5] M.-H. Chen, L. Kuo, and P. O. Lewis. Bayesian Phylogenetics:Methods, Algorithms, and Applications. CRC Press, 2014. → page(s)1, 11[6] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte CarloSamplers. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 68(3):411–436, 2006. → page(s) 17[7] A. Doucet and A. M. Johansen. A Tutorial on Particle Filtering andSmoothing: Fifteen Years Later. Handbook of Nonlinear Filtering, 12:656–704, 2009. → page(s) 13, 14, 15, 16[8] A. J. Drummond, G. K. Nicholls, A. G. Rodrigo, and W. Solomon.Estimating Mutation Parameters, Population History and GenealogySimultaneously From Temporally Spaced Sequence Data. Genetics,161(3):1307–1320, 2002. → page(s) 151[9] A. J. Drummond, S. Y. Ho, M. J. Phillips, A. Rambaut, et al.Relaxed Phylogenetics and Dating with Confidence. PLoS Biology, 4(5):699, 2006. → page(s) 8[10] A. J. Drummond, M. A. Suchard, D. Xie, and A. Rambaut. Bayesianphylogenetics with BEAUti and the BEAST 1.7. Molecular Biologyand Evolution, 29(8):1969–1973, 2012. → page(s) 43[11] J. Felsenstein. Maximum Likelihood and Minimum-Steps Methods forEstimating Evolutionary Trees from Data on Discrete Characters.Systematic Biology, 22(3):240–249, 1973. → page(s) 4, 11, 21[12] J. Felsenstein. Evolutionary Trees from DNA sequences: A MaximumLikelihood Approach. Journal of molecular evolution, 17(6):368–376,1981. → page(s) 4, 11, 21[13] J. Felsenstein. Inferring Phylogenies. Palgrave Macmillan, 2004. ISBN9780878931774. → page(s) 1[14] N. Friel and A. N. Pettitt. Marginal Likelihood Estimation via PowerPosteriors. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 70(3):589–607, 2008. → page(s) 40[15] A. Gelman and X.-L. Meng. Simulating Normalizing Constants: FromImportance Sampling to Bridge Sampling to Path Sampling.Statistical Science, pages 163–185, 1998. → page(s) 40[16] J. Geweke. Getting It Right: Joint Distribution Tests of PosteriorSimulators. Journal of the American Statistical Association, 99(467):799–804, 2004. → page(s) 32[17] J. H. Gillespie. The Causes of Molecular Evolution. Oxford UniversityPress, 1991. → page(s) 7[18] J. P. Huelsenbeck, B. Larget, and D. Swofford. A Compound PoissonProcess for Relaxing the Molecular Clock. Genetics, 154(4):1879–1892, 2000. → page(s) 8, 9, 11, 13, 32[19] J. P. Huelsenbeck, F. Ronquist, et al. MRBAYES: Bayesian Inferenceof Phylogenetic Trees. Bioinformatics, 17(8):754–755, 2001. →page(s) 1, 38, 40[20] R. E. Kass and A. E. Raftery. Bayes Factors. Journal of the AmericanStatistical Association, 90(430):773–795, 1995. → page(s) 3152[21] H. Kishino, J. L. Thorne, and W. J. Bruno. Performance of aDivergence Time Estimation Method Under a Probabilistic Model ofRate Evolution. Molecular Biology and Evolution, 18(3):352–361,2001. → page(s) 8[22] B. Larget and D. L. Simon. Markov Chain Monte Carlo Algorithmsfor the Bayesian Analysis of Phylogenetic Trees. Molecular Biologyand Evolution, 16:750–759, 1999. → page(s) 1[23] N. Lartillot and H. Philippe. Computing Bayes Factors usingThermodynamic Integration. Systematic Biology, 55(2):195–207, 2006.→ page(s) 40[24] T. Lepage, D. Bryant, H. Philippe, and N. Lartillot. A GeneralComparison of Relaxed Molecular Clock Models. Molecular Biologyand Evolution, 24(12):2669–2680, 2007. → page(s) 8[25] S. Li, D. K. Pearl, and H. Doss. Phylogenetic Tree Construction usingMarkov Chain Monte Carlo. Journal of the American StatisticalAssociation, 95(450):493–508, 2000. → page(s) 1[26] F. Lindsten, A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B.Scho¨n, J. Aston, and A. Bouchard-Coˆte´. Divide-and-Conquer withSequential Monte Carlo. ArXiv e-prints, June 2014. → page(s) 2, 20,29, 31, 44[27] B. Mau and M. A. Newton. Phylogenetic Inference for Binary Dataon Dendograms Using Markov chain Monte Carlo. Journal ofComputational and Graphical Statistics, 6(1):122–131, 1997. →page(s) 1[28] C. Moler and C. Van Loan. Nineteen Dubious Ways to Compute theExponential of a Matrix, Twenty-Five Years Later. SIAM Review, 45(1):3–49, 2003. → page(s) 6[29] M. A. Newton and A. E. Raftery. Approximate Bayesian Inferencewith the Weighted Likelihood Bootstrap. Journal of the RoyalStatistical Society. Series B (Methodological), pages 3–48, 1994. →page(s) 40[30] M. A. Newton, B. Mau, and B. Larget. Markov Chain Monte Carlofor the Bayesian Analysis of Evolutionary Trees from AlignedMolecular Sequences. Lecture Notes-Monograph Series, pages 143–162,1999. → page(s) 153[31] J. G. Propp and D. B. Wilson. Exact Sampling with Coupled MarkovChains and Applications to Statistical Mechanics. Random Structuresand Algorithms, 9(1-2):223–252, 1996. → page(s) 13[32] J. G. Propp and D. B. Wilson. How to Get a Perfectly RandomSample from a Generic Markov Chain and Generate a RandomSpanning Tree of a Directed Graph. Journal of Algorithms, 27(2):170–217, 1998. → page(s) 13[33] B. Rannala. Identifiability of Parameters in MCMC BayesianInference of Phylogeny. Systematic Biology, 51(5):754–760, 2002. →page(s) 33[34] B. Rannala and Z. Yang. Probability Distribution of MolecularEvolutionary Trees: A New Method of Phylogenetic Inference.Journal of Molecular Evolution, 43(3):304–311, 1996. → page(s) 1[35] C. Robert. The Bayesian Choice: From Decision-TheoreticFoundations To Computational implementation. Springer Science &Business Media, 2007. → page(s) 41[36] R. F. Sage, D. A. Way, and D. S. Kubien. Rubisco, Rubisco Activase,and Global Climate Change. Journal of Experimental Botany, 59(7):1581–1595, 2008. → page(s) 46[37] M. J. Sanderson. A Nonparametric Approach to EstimatingDivergence Times in the Absence of Rate Constancy. MolecularBiology and Evolution, 14(12):1218–1231, 1997. → page(s) 8[38] M. J. Sanderson. Estimating Absolute Rates of Molecular Evolutionand Divergence Times: A Penalized Likelihood Approach. MolecularBiology and Evolution, 19(1):101–109, 2002. → page(s) 8[39] S. Tavare´. Some Probabilistic and Statistical Problems in the Analysisof DNA Sequences. Lectures on Mathematics in the Life Sciences, 17:57–86, 1986. → page(s) 6[40] Y. Teh, H. Daume´ III, and D. Roy. Bayesian Agglomerative Clusteringwith Coalescents. In Advances in Neural Information ProcessingSystems 20-Proceedings of the 2007 Conference, 2009. → page(s) 2, 17[41] J. L. Thorne, H. Kishino, and I. S. Painter. Estimating the Rate ofEvolution of The Rate of Molecular Evolution. Molecular Biology andEvolution, 15(12):1647–1657, 1998. → page(s) 854[42] W. Xie, P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. ImprovingMarginal Likelihood Estimation for Bayesian Phylogenetic ModelSelection. Systematic Biology, page syq085, 2010. → page(s) 40, 46, 47[43] Z. Yang and B. Rannala. Bayesian Phylogenetic Inference Using DNAsequences: A Markov Chain Monte Carlo Method. Molecular Biologyand Evolution, 14(7):717–724, 1997. → page(s) 1[44] A. D. Yoder and Z. Yang. Estimation of Primate Speciation DatesUsing Local Molecular Clocks. Molecular Biology and Evolution, 17(7):1081–1090, 2000. → page(s) 8[45] E. Zuckerkandl and L. Pauling. Molecular Disease, Evolution andGenetic Heterogeneity. 1962. → page(s) 755
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Divide and conquer sequential Monte Carlo for phylogenetics
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Divide and conquer sequential Monte Carlo for phylogenetics Jewell, Sean William 2015
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Divide and conquer sequential Monte Carlo for phylogenetics |
Creator |
Jewell, Sean William |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Recently reconstructing evolutionary histories has become a computational issue due to the increased availability of genetic sequencing data and relaxations of classical modelling assumptions. This thesis specializes a Divide & conquer sequential Monte Carlo (DCSMC) inference algorithm to phylogenetics to address these challenges. In phylogenetics, the tree structure used to represent evolutionary histories provides a model decomposition used for DCSMC. In particular, speciation events are used to recursively decompose the model into subproblems. Each subproblem is approximated by an independent population of weighted particles, which are merged and propagated to create an ancestral population. This approach provides the flexibility to relax classical assumptions on large trees by parallelizing these recursions. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-08-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0165760 |
URI | http://hdl.handle.net/2429/54514 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_september_jewell_sean.pdf [ 855.1kB ]
- Metadata
- JSON: 24-1.0165760.json
- JSON-LD: 24-1.0165760-ld.json
- RDF/XML (Pretty): 24-1.0165760-rdf.xml
- RDF/JSON: 24-1.0165760-rdf.json
- Turtle: 24-1.0165760-turtle.txt
- N-Triples: 24-1.0165760-rdf-ntriples.txt
- Original Record: 24-1.0165760-source.json
- Full Text
- 24-1.0165760-fulltext.txt
- Citation
- 24-1.0165760.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0165760/manifest