UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian phylogenetic inference via Monte Carlo methods Wang, Liangliang 2012-12-31

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_2012_fall_wang_liangliang.pdf [ 2.45MB ]
Metadata
JSON: 1.0072997.json
JSON-LD: 1.0072997+ld.json
RDF/XML (Pretty): 1.0072997.xml
RDF/JSON: 1.0072997+rdf.json
Turtle: 1.0072997+rdf-turtle.txt
N-Triples: 1.0072997+rdf-ntriples.txt
Original Record: 1.0072997 +original-record.json
Full Text
1.0072997.txt
Citation
1.0072997.ris

Full Text

Bayesian Phylogenetic Inference via Monte Carlo Methods  by Liangliang Wang M.Sc. Statistics, McGill University, 2007 M.Sc. Informatics and Signal Processing, Peking University, 2005 B.Sc. Computer Science, Zhengzhou University, 2002  a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in  the faculty of graduate studies (Statistics)  The University Of British Columbia (Vancouver) August 2012 c Liangliang Wang, 2012  Abstract A main task in evolutionary biology is phylogenetic tree reconstruction, which determines the ancestral relationships among different species based on observed molecular sequences, e.g. DNA data. When a stochastic model, typically continuous time Markov chain (CTMC), is used to describe the evolution, the phylogenetic inference depends on unknown evolutionary parameters (hyper-parameters) in the stochastic model. Bayesian inference provides a general framework for phylogenetic analysis, able to implement complex models of sequence evolution and to provide a coherent treatment of uncertainty for the groups on the tree. The conventional computational methods in Bayesian phylogenetics based on Markov chain Monte Carlo (MCMC) cannot efficiently explore the huge tree space, growing super exponentially with the number of molecular sequences, due to difficulties of proposing tree topologies. sequential Monte Carlo (SMC) is an alternative to approximate posterior distributions. However, it is non-trivial to directly apply SMC to phylogenetic posterior tree inference because of its combinatorial intricacies. We propose the combinatorial sequential Monte Carlo (CSMC) method to generalize applications of SMC to non-clock tree inference based on the existence of a flexible partially ordered set (poset) structure, and we present it in a level of generality directly applicable to many other combinatorial spaces. We show that the proposed CSMC algorithm is consistent and fast in simulations. We also investigate two ways of combining SMC and MCMC to jointly estimate the phylogenetic trees and evolutionary parameters, particle Markov chain Monte Carlo (PMCMC) algorithms with CSMC at each iteration and an SMC sampler with MCMC moves. Further, we present a novel way to estimate the transition probabilities for a general CTMC, which can be used to solve the computing bottleneck in a general evoluii  tionary model, string-valued continuous time Markov Chain (SCTMC), that can incorporate a wide range of molecular mechanisms.  iii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xi  Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xii  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xiv  1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  2  Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . .  6  2.1  Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . .  6  2.1.1  Metropolis-Hastings (MH) . . . . . . . . . . . . . . . . .  8  2.1.2  Gibbs Sampler (GS) . . . . . . . . . . . . . . . . . . . .  9  Sequential Monte Carlo (SMC) . . . . . . . . . . . . . . . . . . .  10  2.2.1  Generic SMC . . . . . . . . . . . . . . . . . . . . . . . .  10  2.2.2  Resampling Methods . . . . . . . . . . . . . . . . . . . .  13  2.2.3  Sequential Monte Carlo Samplers . . . . . . . . . . . . .  16  Particle MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . .  17  2.3.1  Particle Independent Metropolis-Hastings (PIMH) . . . .  18  2.3.2  Particle Marginal Metropolis-Hastings Sampler (PMMH) .  18  2.2  2.3  iv  2.3.3 3  Particle Gibbs Sampler (PGS) . . . . . . . . . . . . . . .  20  Background on Phylogenetics . . . . . . . . . . . . . . . . . . . . . .  23  3.1  Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . .  23  3.2  Data for Phylogenetics . . . . . . . . . . . . . . . . . . . . . . .  25  3.2.1  Data Types . . . . . . . . . . . . . . . . . . . . . . . . .  25  3.2.2  Aligned Data via Multiple Sequence Alignments . . . . .  26  3.2.3  Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .  26  Score-based Methods for Phylogeny Reconstruction . . . . . . . .  26  3.3.1  Neighbour-joining Method . . . . . . . . . . . . . . . . .  27  3.3.2  Maximum Parsimony . . . . . . . . . . . . . . . . . . . .  27  Model-based Methods for Phylogeny Reconstruction . . . . . . .  27  3.4.1  Stochastic Evolutionary Model . . . . . . . . . . . . . . .  28  3.4.2  Maximum Likelihood . . . . . . . . . . . . . . . . . . .  31  3.4.3  Framework of Bayesian Phylogenetics . . . . . . . . . . .  32  3.4.4  Bayesian Phylogenetics via MCMC . . . . . . . . . . . .  33  Consensus Tree and Tree Distances . . . . . . . . . . . . . . . .  35  Combinatorial SMC . . . . . . . . . . . . . . . . . . . . . . . . . . .  37  4.1  Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  38  4.2  Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  39  4.2.1  Discrete Setup . . . . . . . . . . . . . . . . . . . . . . .  39  4.2.2  Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .  41  4.2.3  Overcounting Correction . . . . . . . . . . . . . . . . . .  44  4.2.4  Analysis in the Discrete Case . . . . . . . . . . . . . . .  47  4.2.5  General Setup . . . . . . . . . . . . . . . . . . . . . . . .  50  4.2.6  Connection to Phylogenetics with Branch Lengths . . . .  51  Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . .  53  4.3.1  Synthetic Posets . . . . . . . . . . . . . . . . . . . . . .  53  4.3.2  Synthetic Phylogenies . . . . . . . . . . . . . . . . . . .  54  4.3.3  RNA Data . . . . . . . . . . . . . . . . . . . . . . . . . .  55  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  57  4.4.1  57  3.3  3.4  3.5 4  4.3  4.4  Choices of the Partial States and Their Distributions . . .  v  5  4.4.2  Choices of the Proposal Distributions . . . . . . . . . . .  59  4.4.3  Resampling Strategies . . . . . . . . . . . . . . . . . . .  59  4.4.4  Another Application: Self-avoiding Walk . . . . . . . . .  59  Combining MCMC and SMC . . . . . . . . . . . . . . . . . . . . . .  61  5.1  Particle MCMC for Phylogenetics . . . . . . . . . . . . . . . . .  62  5.1.1  The Particle Independent Metropolis-Hastings (PIMH) . .  62  5.1.2  Particle Marginal Metropolis-Hastings (PMMH) . . . . .  63  5.1.3  The Particle Gibbs sampler (PGS) . . . . . . . . . . . . .  65  5.1.4  Simulation Studies . . . . . . . . . . . . . . . . . . . . .  67  5.1.5  Real Data Analysis . . . . . . . . . . . . . . . . . . . . .  69  5.2  An SMC Sampler with MCMC Kernels . . . . . . . . . . . . . .  75  5.3  Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . .  78  5.3.1  Fine-Grain Parallelization . . . . . . . . . . . . . . . . .  79  5.3.2  Multiple Threads on a Single CPU . . . . . . . . . . . . .  81  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  82  A General Evolutionary Model Based on String-valued CTMC . . .  84  6.1  Importance Sampling for Transition Probabilities . . . . . . . . .  85  6.1.1  Transition Probabilities . . . . . . . . . . . . . . . . . . .  85  6.1.2  Importance Sampling . . . . . . . . . . . . . . . . . . . .  87  Birth-death Processes . . . . . . . . . . . . . . . . . . . . . . . .  89  6.2.1  Sampling a Sequence of States with Fixed Two Ends . . .  89  6.2.2  Example 1 . . . . . . . . . . . . . . . . . . . . . . . . .  90  6.2.3  Example 2 . . . . . . . . . . . . . . . . . . . . . . . . .  90  A General Evolutionary Model for Phylogenetics . . . . . . . . .  91  6.3.1  String-valued CTMC on a Phylogenetic Tree . . . . . . .  91  6.3.2  Sequential Monte Carlo . . . . . . . . . . . . . . . . . .  94  6.3.3  Simulation Studies . . . . . . . . . . . . . . . . . . . . .  95  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  97  Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . .  99  7.1  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  99  7.2  Inferring Large Scale Trees . . . . . . . . . . . . . . . . . . . . . 100  5.4 6  6.2  6.3  6.4 7  vi  7.2.1  Divide-and-Conquer Strategies . . . . . . . . . . . . . . . 101  7.2.2  Parallel Computing on GPUs . . . . . . . . . . . . . . . . 104  7.3  Harnessing Non-Local Evolutionary Events for Tree Inference . . 105  7.4  Joint Estimation of MSA and Phylogeny . . . . . . . . . . . . . . 106  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A Proof of Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 118  vii  List of Tables Table 5.1  Settings for the simulated data. . . . . . . . . . . . . . . . . .  68  Table 5.2  Tree estimates using PIMH for clock trees. . . . . . . . . . . .  69  Table 5.3  Comparison of PMMH and MrBayes for a set of non-clock trees. 69  Table 5.4  Log-likelihoods of the consensus trees obtained using PMMH with different numbers of particles. . . . . . . . . . . . . . . .  Table 5.5  72  Comparison of the SMC sampler with MH moves with different number of iterations for a set of non-clock trees. . . . . . . . .  viii  77  List of Figures Figure 3.1  Examples of a rooted tree and an unrooted tree. . . . . . . . .  24  Figure 3.2  Heights of subtrees on a clock tree. . . . . . . . . . . . . . .  25  Figure 3.3  The Nearest Neighbor Interchange (NNI) Proposal [58]. . . .  35  Figure 4.1  An example of the partial states of a discrete rooted X-tree. . .  39  Figure 4.2  A graphical illustration of the SMC algorithm. . . . . . . . .  43  Figure 4.3  (a) All the different sequences of partial states (forests) leading to fully specified states (rooted X-trees). (b) An example of the set of parents (s) of a partial state s. (c) An example of a simple cyclic poset. . . . . . . . . . . . . . . . . . . . . . . .  45  Figure 4.4  An example of changing a simple cyclic poset to an acyclic case. 49  Figure 4.5  The support of the backward kernels used in the synthetic poset experiments. . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 4.6  Total variation distance of the particle approximation to the exact target distribution. . . . . . . . . . . . . . . . . . . . .  Figure 4.7  52 54  The average Robinson Foulds metric and partition metric versus the running times using the simulated data. . . . . . . . .  56  Figure 4.8  Log-likelihoods of the consensus trees using RNA datasets. . .  58  Figure 5.1  Coverage probability versus credible intervals for the estimates of κ using particle marginal Metropolis-Hastings sampler (PMMH). 70  Figure 5.2  Trace plot and histogram of κ estimated by PMMH for three arbitrarily selected non-clock datasets. . . . . . . . . . . . . .  Figure 5.3  71  Log-likelihoods of the consensus trees for a small RNA dataset using CSMC, MB, and PMMH . . . . . . . . . . . . . . . . .  ix  72  Figure 5.4  The consensus tree for cichlid fishes estimated by PMMH. . .  73  Figure 5.5  Consensus tree for the dataset of Cichlid Fishes using MrBayes. 74  Figure 5.6  Results of running the SMC sampler with MH moves. . . . .  78  Figure 5.7  Running time of CSMC versus different number of threads. . .  81  Figure 6.1  Comparison of transition probabilities P10,b (T = 1) computed by our method and their true values. . . . . . . . . . . . . . .  91  Figure 6.2  Transition probabilities P10,b (T ) with different values of T . . .  92  Figure 6.3  Mean string lengths using forward simulation. . . . . . . . . .  96  Figure 6.4  A subset of simulated data . . . . . . . . . . . . . . . . . . .  97  Figure 6.5  Performance on estimating Trees. . . . . . . . . . . . . . . .  98  x  List of Algorithms 2.1  Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . .  8  2.2  Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . .  9  2.3  Generic Sequential Monte Carlo . . . . . . . . . . . . . . . . .  12  2.4  Stratified Resampling . . . . . . . . . . . . . . . . . . . . . . .  15  2.5  An SMC Sampler . . . . . . . . . . . . . . . . . . . . . . . . .  17  2.6  The Particle Independent Metropolis-Hastings Algorithm . . .  19  2.7  The Particle Marginal Metropolis-Hastings Sampler . . . . . .  20  2.8  Conditional SMC . . . . . . . . . . . . . . . . . . . . . . . . .  21  2.9  The Particle Gibbs Sampler . . . . . . . . . . . . . . . . . . . .  22  4.1  The Combinatorial Sequential Monte Carlo Algorithm . . . .  41  5.1  The particle independent Metropolis-Hastings (PIMH) Algorithm for Phylogenetics . . . . . . . . . . . . . . . . . . . . . .  63  5.2  The PMMH for Phylogenetics . . . . . . . . . . . . . . . . . .  64  5.3  Conditional CSMC for Phylogenetic Trees . . . . . . . . . . .  66  5.4  The MH within Particle Gibbs Sampler for Phylogenetics . . .  67  5.5  An SMC Sampler using Metropolis-Hastings (MH) Kernels for Phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . .  77  5.6  Outline of the Fine-grain Parallelization for SMC Algorithms .  79  7.1  The Particle Block Gibbs Sampler for Phylogenetics . . . . . . 102  7.2  An SMC Sampler using Partial Data . . . . . . . . . . . . . . . 103  xi  Glossary BDP  birth-death process  CSMC  combinatorial sequential Monte Carlo  CTMC  continuous time Markov chain  CV  coefficient of variation  ESS  effective sample size  GS  Gibbs sampler  GPUs  graphics processing units  indels  insertions or deletions  K2P  Kimura two-parameter model  KF  Kuhner Felsenstein  MB  MrBayes  MCMC  Markov chain Monte Carlo  MH  Metropolis-Hastings  MJP  Markov jump process  MSA  multiple sequence alignment  NJ  neighbor joining xii  NNI  nearest neighbor interchange  PGS  particle Gibbs sampler  PIMH  particle independent Metropolis-Hastings  PM  partition metric  PMCMC particle Markov chain Monte Carlo PMMH  particle marginal Metropolis-Hastings sampler  poset  partially ordered set  RF  Robinson-Foulds  SCTMC string-valued continuous time Markov Chain SAWs  self-avoiding walks  SMC  sequential Monte Carlo  SSM  slipped strand mispairing  xiii  Acknowledgments I would like to take this opportunity to thank those who have influenced my life and helped me to complete my Ph.D. program and this thesis. First and foremost, I would like to express my sincerest gratitude to my cosupervisors, Dr. Arnaud Doucet and Dr. Alexandre Bouchard-Cˆot´e. I am sincerely thankful to Arnaud for introducing me to the area of sequential Monte Carlo and believing in my academic potential. I would like to thank Alex for his exceptional generosity with his time, energy, support as well as resources. I will never forget the numerous inspiring discussions we had about phylogenetics. I would also like to acknowledge the financial support from Arnaud and Alex. Many thanks go to the people who have been influential in my academic journey. I am sincerely thankful to Dr. Nandini Dendukuri and Dr. Alula Hadgu for collaborating with me in the area of Bayesian approaches for medical diagnostic tests and giving me enormous amount of support in my job applications. I am grateful to Dr. James O. Ramsay, for introducing me to the area of Functional Data Analysis (FDA) and inviting me to many holiday parties in his house. I would like to thank Dr. Jinwen Ma for teaching me about bioinformatics and neural networks. I am thankful to Mr. Haoran Qin, my high school biology teacher, for originally stimulating my interest in biology and mentoring me for the (Chinese) National Olympic Contest of Biology in 1997. I would like to extend my thanks to the professors in UBC. They have given me excellent lectures and all kinds of help and support. Particularly, I would like to thank Dr. Paul Gustafson, Dr. Lang Wu, and Dr. Anne Condon for being in my examination committee and providing insightful comments and suggestions. I am very grateful to Dr. Jiahua Chen, Dr. Lang Wu, and Dr. Ruben H. Zamar for their xiv  altruistic help with my job talk. I would like to thank Dr. Nancy Heckman and Dr. Mat´ıas Salibi´an-Barrera for advising me on robust FDA and supporting me in many aspects. I am thankful to Dr. John Petkau for helping me with the consulting projects and academic writing. I would like to thank Dr. William J. Welch and Dr. Harry Joe for making me feel comfortable in the department. Furthermore I would like to thank my student colleagues and friends in UBC for creating a friendly environment for development and growth. Special thanks go to Yanling (Tara) Cai, Camila P. Estevam de Souza, Julia Viladomat, Mike Danilov, Xiaoli Yu, Li Xing, Xuekui Zhang, Yongliang (Vincent) Zhai, Ardavan Saeedi, Bonnie Kirkpatrick, Luke Bornn, Ehsan Karim, Chen Xu, Lei (Larry) Hua, Song Cai, and Yang (Seagle) Liu. I deeply thank my parents and my sister who always stand by me and offer me endless love, care, and support. I would like to show my special appreciation to my husband, Dr. Jiguo Cao, for always being there for me and for being patient and supportive during this long journey. For partial support of this work through graduate scholarships and research grants, thanks are due to the Natural Sciences and Engineering Research Council of Canada.  xv  Chapter 1  Introduction A main task in evolutionary biology is phylogenetic reconstruction, which reconstructing the ancestral relationships among different species and estimating evolutionary parameters as well. The available data are typically biological sequences of DNA, RNA, and protein. Bayesian phylogenetics aims to estimate the posterior distribution of phylogenetic tree and evolutionary parameters given the observed biological sequences, and then to make inferences about a function of the tree by computing its posterior expectation. Calculating the posterior distribution of phylogenetic trees analytically would theoretically involve summation over all possible trees and for each tree integration over all possible combinations of branch lengths. There is no closed-form solution except for very small trees. Moreover, the total number of possible phylogenetic trees grows super exponentially with the number of biological sequences increases. This hard posterior computation is typically estimated using Markov chain Monte Carlo (MCMC) algorithms. In phylogenetics, MCMC algorithms construct a Markov chain that has the desired posterior distribution of phylogenetic trees as its equilibrium distribution. At each MCMC iteration, a new tree is proposed by making small perturbations on the current tree and then we accept the proposed tree with the probability of the Metropolis-Hastings (MH) ratio or remain at the current tree with the remaining probability. Various MCMC methods for phylogenetics differ from one another because of 1  different evolutionary models, proposal distributions, and prior distributions. The development of modern Bayesian phylogenetics originated in the earlier work of [52, 67, 70, 77, 82, 113]. [82] and [113] showed how posterior probabilities of trees could be calculated under a linear birth-death prior for a small number of species. [77] proposed the GLOBAL algorithm in which all branch lengths are changed with every cycle. In these publications, a common assumption is that the phylogenetic tree is a molecular clock. [67] discussed two types of MCMC algorithms, the GLOBAL algorithm and LOCAL algorithm, programmed in their software package BAMBE. [67] also extended these algorithms from clock trees to nonclock trees. [70] restricted their model to molecular clock trees. Their MCMC algorithm uses auxiliary variables of the sequences of internal nodes to improve computational efficiency per cycle by avoiding integrating over these sequences. [52] provided an overview of the traditional and Bayesian approaches for phylogeny estimation. Bayesian phylogenetics is becoming more and more popular for several reasons. Many studies have focused on comparing nonparametric bootstrap and posterior probabilities in terms of estimating the tree uncertainty [1, 20, 29, 96, 102]. These studies show that bootstrap tends to be too conservative. Moreover, Bayesian methods might be a faster way to assess support for trees than maximum likelihood bootstrapping [25]. In addition, the Bayesian framework can incorporate complex evolutionary models, and there are many user-friendly software packages available for Bayesian phylogenetics using MCMC; e.g. BAMBE [97], MrBayes [55, 90, 91], and BEAST [28]. The main difficulty in Bayesian phylogenetic inference with MCMC lies in the efficiency with which topology proposals sample tree space. Since the target posterior distribution of phylogenetic trees is high dimensional and multimodal, it can be difficult to design appropriate proposal distributions needed to achieve efficient MCMC algorithm. [66] evaluated two classes of proposal distributions that change topology and branch lengths simultaneously for phylogenetic inference for unrooted trees. [51] studied different proposal distributions for constrained rooted clock trees. The results in [66] and [51] show that the design of the currently used transition kernels leads to a low acceptance rate of transitions unless a small neighborhood is used. As a result, MCMC algorithms need a long computing time 2  to give reliable estimates for the parameters under study. [50] explored new MCMC transition kernels for sampling the posterior distribution in tree space. However, due to the huge size of the tree space, it remains challenging to design proposal distributions that make Markov chains mix well and move quickly among all the trees with high probability mass. Another main disadvantage of using MCMC is that it can be difficult to determine whether the MCMC approximation has achieved convergence. Moreover, the MCMC algorithms are often computationally expensive because each iteration requires recalculating the likelihood of the full tree even when only a small perturbation is made on the current tree. This computation bottleneck has prevented the application of Bayesian methods to large phylogenetic studies, and increase in the CPU speed is unlikely to address this issue because increasingly cheap sequencing technologies are creating massive molecular data at a faster rate. Such data represent an opportunity for estimating more accurate models of evolutionary histories underlying species. At the same time, the massive amount of molecular data makes a computational challenge for Bayesian phylogenetics, where running MCMC chains for several months is not uncommon. sequential Monte Carlo (SMC) is an increasingly popular alternative to MCMC, because of its speed and scalability. [42] contributed a pioneering paper of SMC focusing on tracking applications. Since then, various SMC algorithms have been proposed to obtain full posterior distributions for problems in nonlinear dynamic systems in science and engineering. To date, these algorithms for nonlinear and non-Gaussian state-space models have been successfully applied in various fields including computer vision, signal processing, tracking, control, econometrics, finance, robotics, and statistics [15, 26, 27, 71]. However, it is still an open question how to use SMC to break the computing bottleneck imposed by many combinatorial problems in computational biology, such as phylogenetic inference, inference over alignments, and gene networks. [105] and [43] have applied SMC to a restricted class of phylogenetic models, coalescent trees. However, since ultrametric trees, such as the coalescents, put unrealistic constraints on phylogenetic inference, which prevents this method from fitting the real data. Their approach cannot be applied to standard phylogenetic models, based on non-clock trees. There has also been a large body of work apply3  ing sequential methods that are closely related to SMC, e.g., importance sampling and approximate Bayesian computation, to problems in population genetics, but these methods also focus on coalescent models ([45]; [7]; [76]; [57]; [81]). [9] extended the application of classical SMC to phylogenetic data analysis based on partially ordered sets (posets). However, this work requires an artificial restriction on proposals, which is difficult to be satisfied for the case of non-clock trees. [109] applied sequential sampling and reweighting methods to phylogenetics, but in the context of pooling results from stratified analyses rather than constructing a single joint tree posterior. [22, 23] proposed the SMC sampler to consider a sequence of distributions that are defined on a common continuous space. Their proposed SMC sampler uses MCMC kernels to propose particles, and they introduced artificial backward (in time) Markov kernels in the importance weight update. However, the SMC sampler in [22, 23] is focused on state-space models and it cannot be readily applied to phylogenetics. We propose a new algorithm, the combinatorial sequential Monte Carlo (CSMC) algorithm, to approximate posteriors of phylogenetic trees given fixed evolutionary parameters. SMC is proposed to improve the efficiency of the standard MCMC methods. The technique we use to generalize SMC to non-clock trees depends on the existence of a flexible poset structure, and we present it in a level of generality directly applicable to many other combinatorial spaces. We show that the proposed CSMC algorithm is consistent and fast in simulations. We investigate two frameworks of combining SMC and MCMC, the two main Monte Carlo tools to sample from complex and high-dimensional probability distributions, to jointly estimate phylogenetic trees and evolutionary parameters (hyperparameters). The first framework is particle Markov chain Monte Carlo (PMCMC) [3, 53], in which each MCMC iteration uses our proposed CSMC to approximate the posterior distribution of the phylogenetic trees and the marginal likelihoods of data, given parameters in the evolution model. The second algorithm is an SMC sampler, in which each SMC iteration uses an MCMC kernel to propose artificially intermediate states, which are full trees with tempering parameters. Ideally, the evolutionary model in Bayesian phylogenetics should be flexible enough to incorporate a wide range of evolutionary phenomena, e.g. slipped strand 4  mispairing (SSM), a well known explanation for the evolution of repeated sequence. We propose more realistic evolutionary models by incorporating a wide range of evolutionary phenomena. Specifically, the model is a string-valued continuous time Markov Chain (SCTMC) where the exponential rates are allowed to depend on an unbounded context. We propose a novel SMC algorithm to sample the posterior distribution of the phylogenetic tree and hidden molecular sequences given the unaligned sequence data. The rest of this dissertation is organized as follows. In Chapter 2 we provide a brief review of Monte Carlo methods, including MCMC, SMC, and PMCMC. Chapter 3 introduces backgrounds on phylogenetics, starting with the phylogenetic trees, followed by a review of methods of phylogeny reconstruction. Especially, Bayesian phylogenetics (Section 3.4.3) is necessary to understand the rest of this dissertation. Chapter 4 presents a novel SMC algorithm, CSMC, for Bayesian phylogenetic inference for non-clock trees. Note that the proposed CSMC is general enough to be applied to many other applications. In Chapter 5 we presents different ways of combining MCMC and SMC, focusing on PMCMC and an SMC sampler with MCMC kernels. In Chapter 6, we propose a general evolutionary model based on SCTMC, and we implement Bayesian phylogenetic inference using a novel sequential Monte Carlo (SMC) algorithm. Finally, Chapter 7 concludes the contributions of this dissertation and outlines some future work.  5  Chapter 2  Monte Carlo Methods Monte Carlo Methods are a class of computational algorithms that make use of random samples to compute expectations. Monte Carlo Methods are commonly used to approximate intractable integrals of high dimensions. From Monte Carlo’s view, any probability distribution π, regardless of its complexity, can always be represented by a discrete (Monte Carlo) sample from it. An important issue is to find a convenient and efficient discrete representation. [88] provided an excellent review on Monte Carlo statistical Methods. In this chapter, we will briefly review the major Monte Carlo Methods that will be used in the following chapters. Throughout this chapter, π denotes the target distribution of some random variable defined on a measurable space (E, E). Assume the density function is denoted by π(x), x ∈ E. We are interested in computing the integral of a measurable function φ : E → R with respect to π φ(x)π(dx).  Eπ (φ) = E  To simplify notation, for the most part, especially algorithms, of this dissertation we do not distinguish random variables from their realization.  2.1  Markov Chain Monte Carlo (MCMC)  Since it is typically difficult to obtain a large number of independent and identically distributed samples from a distribution of interest, π, MCMC methods [88] are 6  widely used to draw a dependent sample from π by constructing an ergodic Markov chain which has the desired target distribution π as its equilibrium distribution. A Markov chain, {Xn }, can be thought of as a sequence of random variables evolving over time that has the following property: given the present state, future transitions of the chain are independent of the past history. More precisely, for every initial distribution µ, P(Xn+1 ∈ A|x0 , x1 , · · · , xn ) = P(Xn+1 ∈ A|xn ) =  M(xn , dx), A  where M is a transition kernel defined in a measurable space (E, E) such that: 1. ∀x ∈ E, M(x, ·) is a probability measure; 2. ∀A ∈ E, M(·, A) is measurable. Providing a discrete Markov chain, {Xn }, is irreducible and aperiodic, and π is the equilibrium distribution, we have P(Xn = y|X0 = x) → π(y) as n → ∞ for all x, y, i.e. the equilibrium distribution of this Markov chain is our target distribution. Further, given the chain is irreducible, for any function φ(·) such that Eπ |φ(X)| < ∞, 1 K  K  φ(Xk ) → E(φ(X)) as K → ∞, k=1  with almost sure convergence in probability. In other words, this Markov chain converges to the required target distribution. Please refer to [88] for more details. In order to construct an appropriate Markov chain, we require the transition kernel to satisfy the detailed balance condition defined as follows Definition 1. A Markov chain with transition kernel M satisfies the detailed balance condition if there exists a function f satisfying M(y, x) f (y) = M(x, y) f (x) for every (x, y). 7  (2.1)  According to [88], we have the following result: Theorem 2. Suppose that a Markov chain with transition function M satisfies the detailed balance condition with π a probability density function. Then: 1. The density π is the invariant density of the chain. 2. The chain is π-reversible.  2.1.1  Metropolis-Hastings (MH)  The most popular algorithm to construct appropriate Markov chains is the MetropolisHastings (MH) algorithm, which was first proposed by [78] and later adapted by [48]. To construct a Markov chain {Xn }n=1,2,··· , proposal distributions K(x, ·) are chosen to propose a candidate value for the state, y, based on the current value x with the density K(x, y). The following MH acceptance probability α(x, y) is calculated α(x, y) = min 1,  π(y)K(y, x) . π(x)K(x, y)  With probability α(x, y), the new state of the Markov chain is updated to y, otherwise it remains x. Algorithm 2.1 summarizes the MH algorithm. The initial value of the chain can be selected arbitrarily. The choice of K(x, ·) is also arbitrary as long as it satisfies the requirement that the constructed Markov chain is irreducible. Algorithm 2.1 Metropolis-Hastings Algorithm 1: Choose a starting value x0 . 2: for k > 1 do 3: Sample x∗ ∼ K(xk−1 , ·). 4: Compute an MH acceptance ratio α(xk−1 , x∗ ). 5: Accept x∗ with the above probability. If x∗ is not accepted, then xk = xk−1 . 6: end for We can show that the MH algorithm produces a Markov kernel which has the desired invariant distribution by showing that this Markov kernel satisfies the detailed balance condition for π. Using Algorithm 2.1, the Markov transition kernel, 8  M : E → E, is M(x, dy) = α(x, y)K(x, dy) + (1 − α(x, y))δ x (dy) π(y)K(y, x) π(y)K(y, x) = 1∧ K(x, dy) + 1 − 1 ∧ δ x (dy), π(x)K(x, y) π(x)K(x, y) where δ x is the Dirac delta function for x ∈ E. Using the MH acceptance probability α(x, y) ≤ 1, it is easy to show that π(y)K(y, x) π(y)K(y, x) K(x, y) + 1 − δ x (y) π(x)K(x, y) π(x)K(x, y) π(x) K(y, x) − δ x (y) = π(y) K(y, x) + π(y) K(x, y) = π(y)K(y, x) (α(x, y) ≤ 1 ⇒ α(y, x) = 1)  π(x)M(x, y) = π(x)  = π(y) α(y, x)K(y, x) + (1 − α(y, x))δy (x) = π(y)M(y, x).  2.1.2  Gibbs Sampler (GS)  Gibbs sampler (GS) is applicable to problems of multivariate state. The value of each element is sampled from the full conditional distribution of the element given the most recent values of all the other variables. Let x = (x1 , · · · , xD ) denote the vector of a multivariate random variable, and let xk = (xk,1 , · · · , xk,D ) denote the k-th sample of x. Algorithm 2.2 describes a deterministic scan GS in which the order to elements to update is fixed. In this algorithm, xk,−d = (xk,1 , · · · , xk,d−1 , xk−1,d+1 , · · · , xk−1,D ). We can also use a random scan GS in which we randomly choose the element to update. Algorithm 2.2 Gibbs Sampling 1: Choose a starting value x0 . 2: for k > 1 do 3: for d = 1 to D do 4: Sample xk,d ∼ π(·|xk,−d ) 5: end for 6: end for 9  GS can be regarded as one special case of the MH sampler [39], in which the MH acceptance probability is always 1, shown as follows: α(x, (x−d , yd )) = =  π((x−d , yd ))π(xd |x−d ) π(x)π(yd |x−d ) π(x−d )π(yd |x−d )π(xd |x−d ) = 1, π(x−d )π(xd |x−d )π(yd |x−d )  where (x−d , yd ) = (x1 , · · · , xd−1 , yd , xd+1 , · · · , xD ). Since the GS requires sampling from the full conditional distributions, it might be impossible to sample from some of the full conditional distributions for many complex models. In such cases, an MH step is used to sample from the full conditional distribution. In addition, the GS can be very inefficient when the variables are closely correlated or sampling from the full conditionals is extremely expensive or inefficient.  2.2  Sequential Monte Carlo (SMC)  The sequential Monte Carlo (SMC) method is an increasingly popular alternative to MCMC. In this section, we first review the generic SMC algorithm and use the application to state space models as a specific example. Then, we summarize some resampling methods used in SMC. Finally, we shall introduce an SMC sampler that is relevant to our work.  2.2.1  Generic SMC  In this section, we will briefly review the generic SMC algorithms to sample from a given target distribution π defined on a measurable space (E, E); the density function is denoted π(x), x ∈ E. This method requires us to introduce a sequence of intermediate probability distributions {πr (xr ), r = 1, · · · , R}  (2.2)  defined on a sequence of measurable spaces {(Er , Er ), r = 1, · · · , R} such that πR (xR ) = π(x). Here r is a general index for the intermediate distributions which is not necessary to be an index for time as in state space models. Typically, xr = x1:r = 10  (x1 , · · · , xr ), where xr ∈ X implying that Er = Xr . Each density of the intermediate distributions is assumed known up to a normalizing constant; i.e. for r = 1, · · · , R, πr (xr ) = Zr−1 γr (xr ), where γr : Er → R+ can be evaluated pointwise, but the normalizing constant Zr is unknown. We will denote ZR by Z. A concrete example of the general applications of SMC is state space models [6, 10, 10, 11, 30, 31, 41, 41, 42], in which r is an index for time; the intermediate distributions are given by state space models themselves, i.e. πr (xr ) = p(x1:r |y1:r ) defined on the product space Er = Xr , where x1:r are from a hidden X-valued discrete-time Markov process, and y1:r are the observations. The unnormalized density can be evaluated pointwise as follows r  γr (xr ) = p(x1:r , y1:r ) = µ(x1 )  r  f (xn |xn−1 ) n=2  g(yn |xn ), n=1  where f (xn |xn−1 ) are the transition probability densities, and g(yn |xn ) are the marginal densities of the observations. We wish to estimate expectations of test functions φr : Er → R Eπr (πr ) =  φr (xr )πr (dxr ),  and the normalizing constants Zr . In SMC algorithms, we do this sequentially; i.e. we first estimate π1 and Z1 at iteration 1, then π2 and Z2 at iteration 2 and so on. A generic SMC algorithm is described in Algorithm 2.3. At the r-th iteration of the algorithm, the proposal distributions qr are used to propose the particles, {xr,k }k=1,··· ,K , of the current iteration based on the particles, { x˜ r−1,k }, from the previous iteration. The intermediate distribution πr is approximated by the set of K weighted samples (also called particles), {xr,k , Wr,k }k=1,··· ,K , where xr,k denotes the 11  Algorithm 2.3 Generic Sequential Monte Carlo At iteration r = 1, sample x1,k ∼ q1 (·) set its unnormalized weight w1,k = γ1 (x1,k )/q1 (x1,k ). K normalize weights W1,k = w1,k / k=1 w1,k resample {x1,k , W1,k } to obtain new particles denoted { x˜1,k } At iteration r ≥ 2, sample xr,k ∼ qr ( x˜ r−1,k → ·), and xr,k = ( x˜ r−1,k , xr,k ) compute wr,k = w( x˜ r−1,k , xr,k ) =  γr (xr,k ) γr−1 ( x˜ r−1,k )qr ( x˜ r−1,k → xr,k )  K normalize weights Wr,k = wr,k / k=1 wr,k resample {xr,k , Wr,k } to obtain new particles denoted { x˜ r,k }  k-th particle and Wr,k represents its normalized weight correspondingly, i.e. wr,k . K k=1 wr,k  Wr,k =  We have the following approximation of πr (xr ), K  πr,K (dxr ) =  Wr,k δ xr,k (dxr ). k=1  A byproduct of the SMC algorithm is an estimate of the normalizing constant Z. We can rewrite the first constant normalizing constant as Z1 =  γ1 (x1 ) q1 (x1 )dx1 = q1 (x1 )  w1 (x1 )q1 (x1 )dx1 .  Correspondingly, an estimate of Z1 is Z1,K =  1 K  K  w1,k . k=1  12  Similarly, we can rewrite the ratio of the normalizing constants as Zr Zr−1  = = = =  γr (xr )dxr  γr (xr )dxr  =  Zr−1 γr−1 (xr−1 )/πr−1 (xr−1 ) γr (xr ) πr−1 (xr−1 )dxr γr−1 (xr−1 ) γr (xr ) πr−1 (xr−1 )qr (xr−1 → xr )dxr γr−1 (xr−1 )qr (xr−1 → xr ) wr (xr )πr−1 (xr−1 )qr (xr−1 → xr )dxr .  Straightforwardly, an estimate of Zr /Zr−1 is provided by K  1 Zr = Zr−1 K  wr,k . k=1  Since the estimate of the normalizing constant can be rewritten as R  Z ≡ ZR = Z1 r=2  Zr , Zr−1  an estimate of the normalizing constant Z is   1  = K r=1 R  ZR,K  K k=1    wr,k  ,  (2.3)  which can be obtained from an SMC algorithm readily. Moreover, Equation (2.3) is a consistent estimate of Z [21, 26]. The performance of SMC algorithms can be improved by using sophisticated resampling methods, strategically designing the sequence of target distributions, and carefully choosing the proposal distributions. In the next section, we will only briefly review some resampling methods.  2.2.2  Resampling Methods  As the iteration index increases, the variance of the unnormalized weights {wr,k } tend to increase and all the mass is concentrated on a few number of particles. A 13  resampling step is used to reset the approximation by pruning the particles with low weights and multiply the particles with high weights. The rationale is that if a particle at iteration r has a low weight then often it will still have a low weight at iteration r + 1. We want to focus our computational efforts on the promising particles. Resampling at every iteration is neither necessary nor efficient. We need a resampling schedule to determine when we implement the resampling step. A resampling schedule can be either deterministic or dynamic. In a deterministic schedule one conducts resampling at iteration r0 , 2r0 , · · · , where r0 depends on a particular problem. In a dynamic schedule, we resample only when the variation of the weights is higher than some prefixed threshold. We can use two criteria, the effective sample size (ESS) or the coefficient of variation (CV), to measure the variation of the weights. ESS can be computed using   ES S =   −1   K k=1   2  Wr,k   .  If Wr,k = 1/K for any k, we have ES S = K; if Wr,i = 1 and Wr, j = 0 for j  i, we  have ES S = 0. CV can be computed using   1 CV =  K  K k=1  1/2  (KWr,k − 1)2   If Wr,k = 1/K for any k, we have CV = 0; if Wr,i = 1 and Wr, j = 0 for j √ CV = K − 1.  i, we have  In the following, we will summarize two commonly-used resampling methods, multinomial resampling and stratified resampling.  14  Multinomial Resampling The simplest resampling scheme is multinomial resampling which samples K times x˜ r,k ∼ πr,K (dxr ) to build the new approximation π˜ r,K (dxr ) =  1 K  K  δ x˜ r,k (dxr ). k=1  We can rewrite the above by K  π˜ r,K (dxr ) = k=1  Nr,k δ x˜ (dxr ), K r,k  where (Nr,1 , · · · , Nr,K ) ∼ Multinomial(K; Wr,1 , · · · , Wr,K ) thus E[Nr,k ] = KWr,k , Var[Nr,k ] = KWr,k (1 − Wr,k ).  Stratified Resampling A better resampling method can be designed such that E[Nr,k ] = KWr,k , Var[Nr,k ] < KWr,k (1 − Wr,k ). Stratified resampling, described in Algorithm 2.4, is a popular alternative to multinomial resampling. Algorithm 2.4 Stratified Resampling Normalise the weights {wr,k } of the K particles and label them according to the order of the corresponding xr,k to obtain Wr,(k) , k = 1, · · · , K. Construct the corresponding cumulative distribution function: for k = 1, · · · , K, Wr,( j) ,  Qr (0) ≡ 0, Qr (k) ≡ j≤k  Sample U1 uniformly on [0, 1/K] and set U j = U1 + ( j − 1)/K for j = 2, · · · , K. For k = 1, · · · , K, if there exists j ∈ {1, · · · , K} such that Qr (k − 1) ≤ U j < Qr (k), then xr,k survives. Update Nr,k = #{U j : Qr (k − 1) ≤ U j < Qr (k)}. 15  2.2.3  Sequential Monte Carlo Samplers  Recent work on SMC has focused on the problem of sampling from arbitrary sequences of distributions. The SMC sampler framework proposed by [22, 23] is a very general method for obtaining a set of samples from a sequence of distributions which can exist on the same or different spaces. This is a generalization of the standard SMC method [27] in which the target distribution exists on a space of strictly increasing dimension. [22, 23] mainly addressed the case of the sequence of target distributions {πr } that are defined on a common continuous space X, e.g. πr is the posterior distribution of a parameter given the data collected until time r, i.e. πr (x) = p(x|y1:r ). This SMC sampler can be obtained by defining a sequence of distributions that admit the distribution of interest, πr (xr ), as the recent iteration marginal r−1  π˜ r (xr ) = πr (xr )  L j (x j+1 , x j ), j=1  where L j (x j+1 , x j ) is the artificial backward Markov kernels from iteration j + 1 to j. Then we apply the standard SMC on this sequence of distributions. Sample at iteration r, xr,k ∼ Kr (xr−1,k , ·), where Kr is a Markov kernel defined on Er−1 × Er . The resulting sampler has a weight update Wr,k ∝  πr (xr,k )Lr−1 (xr,k , xr−1,k ) , πr−1 (xr,k )Kr (xr−1,k , xr,k )  which is different from the one in a standard SMC. Algorithm 2.5 summarizes the SMC sampler. A common approach in SMC samplers is to choose Kr (xr−1 , xr ) to be πr -invariant, typically MCMC kernels. A convenient backward Markov kernel that allows an easy evaluation of the importance weight is Lr−1 (xr , xr−1 ) =  πr (xr−1 )Kr (xr−1 , xr ) . πr (xr )  16  Algorithm 2.5 An SMC Sampler sample x1,k ∼ q1 (·) set its unnormalized weight w1,k = γ1 (x1,k )/q1 (x1,k ). K w1,k normalize weights W1,k = w1,k / k=1 resample {x1,k , W1,k } to obtain new particles denoted { x˜1,k } for r ∈ 2, . . . , R do sample xr,k ∼ Kr ( x˜r−1,k , ·) compute wr,k = w( x˜r−1,k , xr,k ) =  γn (xr,k ) Lr−1 (xr,k , x˜r−1,k ) · γr−1 ( x˜r−1,k ) Kr ( x˜r−1,k , xr,k )  K normalize weights Wr,k = wr,k / k=1 wr,k resample {xr,k , Wr,k } to obtain new particles denoted { x˜r,k } end for  With this backward kernel, the incremental importance weight becomes γr (xr ) Lr−1 (xr , xr−1 ) · γr−1 (xr−1 ) Kr (xr−1 , xr ) πr (xr−1 )Kr (xr−1 , xr ) 1 γr (xr ) · · γr−1 (xr−1 ) πr (xr ) Kr (xr−1 , xr ) Zr πr (xr−1 ) γr (xr−1 ) = . γr−1 (xr−1 ) γr−1 (xr−1 )  wr = w(xr−1 , xr ) = = =  2.3  Particle MCMC  [3] proposed a whole class of efficient and flexible MCMC algorithms, named particle Markov chain Monte Carlo methods (PMCMC), to approximate a target distribution, π. Typically, the target distribution is a posterior distribution; i.e. π(x) ≡ p(x|Y). The PMCMC algorithms combine respective strengths of MCMC and SMC algorithms by using SMC methods to build efficient high dimensional proposal distributions at each of the MCMC steps. Particle MCMC can make bold moves in exploring the parameter space. The PMCMC algorithms include the particle independent Metropolis-Hastings (PIMH), particle marginal MetropolisHastings sampler (PMMH), and particle Gibbs sampler (PGS).  17  2.3.1  Particle Independent Metropolis-Hastings (PIMH)  A standard independent Metropolis-Hastings (IMH) sampler targeting π(x) ≡ p(x|Y) uses a proposal distribution q(·) to propose candidates x∗ independent of the current state x, and accepts x∗ with probability min 1,  π(x∗ ) q(x) · . π(x) q(x∗ )  The particle independent Metropolis-Hastings (PIMH) algorithm uses SMC approximations of π(x) as a proposal [3], i.e. using q(x) = πR,K (x). After we obtain a set of weighted samples {(WR,k , xR,k )}k=1,··· ,K using SMC, we draw the proposed sample using a resampling method (see Section 2.2.2) from K  πR,K (dxr ) =  Wr,k δ xr,k (dxr ). k=1  The resulting PIMH sampler has a simple form as described in Algorithm 2.6. Generally, PIMH and SMC perform similarly with respect to estimate accuracy given the same running time. Actually, for many cases, it is preferable to use SMC [3, 53]. However, the memory limit of our computers might prevent us from running a large number of particles which is required for a complicated model, and consequently the estimate may not be accurate. In contrast, the benefit of using PIMH is its nature of iteration, which allows us to run it for any specified amount of time. We could run PIMH until a sufficient number of particles have been generated.  2.3.2  Particle Marginal Metropolis-Hastings Sampler (PMMH)  We are interested in sampling from a joint distribution of static parameters θ ∈ Θ and a high dimensional variable of interest x ∈ E π(θ, x) =  γ(θ, x) , Z  18  Algorithm 2.6 The Particle Independent Metropolis-Hastings Algorithm 1. Initialization, i = 0, (a) run an SMC algorithm targeting π(x), (b) sample x(0) ∼ πR,K (·) and compute ZR,K (0) 2. For iteration i ≥ 1, (a) run an SMC algorithm targeting π(x), ∗ using (2.3), (b) sample x∗ ∼ πR,K (·) and calculate ZR,K  (c) with probability min 1,  ∗ ZR,K  ZR,K (i − 1)  ,  ∗ ; otherwise set x(i) = x(i − 1) and set x(i) = x∗ and ZR,K (i) = ZR,K ZR,K (i) = ZR,K (i − 1).  where γ : Θ × E → R+ is known pointwise. As an example, there is often some static parameter θ ∈ Θ in the state space models mentioned in Section 2.2.1. More precisely, the unnormalized target distribution is R  γ(θ, x) = p(θ, x1:R , y1:R ) = µθ (x1 )  R  fθ (xn |xn−1 ) n=2  gθ (yn |xn ). n=1  Consider an MH algorithm of target density π(θ, x) and proposal density q((θ, x) → (θ∗ , x∗ )) = q(θ → θ∗ )π(x∗ |θ∗ ),  19  Then the resulting MH acceptance ratio is given by π(θ∗ , x∗ ) q((θ∗ , x∗ ) → (θ, x)) π(θ, x) q((θ, x) → (θ∗ , x∗ )) π(θ∗ , x∗ ) q(θ∗ → θ)π(x|θ) = min 1, π(θ, x) q(θ → θ∗ )π(x∗ |θ∗ ) π(θ∗ ) q(θ∗ → θ) = min 1, π(θ) q(θ → θ∗ ) min 1,  where π(θ) is the marginal distribution. To build a proposal approximating π(x∗ |θ∗ ), we use an SMC algorithm using a sequence of distributions πr . PMMH utilizes the fact that the normalizing constant estimate obtained at iteration R of SMC is an estimate of the unnormalized π(θ), denoted γ(θ). Algorithm 2.7 shows the PMCMC algorithm for sampling from π(θ, x). Algorithm 2.7 The Particle Marginal Metropolis-Hastings Sampler Step 1: initialization, i = 0, set θ(0) arbitrarily and run an SMC algorithm targeting π(x|θ(0)), sample x(0) ∼ πR,K (·|θ(0)) and let γR,K (θ(0)) = ZR,K Step 2: for iteration i ≥ 1, sample θ∗ ∼ q(θ(i − 1) → ·), run an SMC algorithm targeting π(x|θ∗ ), sample x∗ ∼ πR,K (·|θ∗ ) and γR,K (θ∗ ) = ZR,K with probability min 1,  γR,K (θ∗ ) q{θ∗ → θ(i − 1)} γR,K (θ(i − 1)) q{θ(i − 1) → θ∗ }  (2.4)  set θ(i) = θ∗ , x(i) = x∗ , and γR,K (θ(i)) = γR,K (θ∗ ); otherwise set θ(i) = θ(i − 1), x(i) = x(i − 1), and γR,K (θ(i)) = γR,K (θ(i − 1)).  2.3.3  Particle Gibbs Sampler (PGS)  A standard Gibbs sampler for sampling from π(θ, x) iteratively samples from the full conditionals π(θ|x) and π(x|θ). Sampling from π(θ|x) is relatively easy. In 20  many cases it is possible to sample exactly from π(θ|x). Otherwise an MH step of invariant density π(θ|x) can be used. In contrast, since x of interest is typically high dimensional, e.g. a large phylogenetic tree, sampling from π(x|θ) might be very challenging. The particle Gibbs sampler (PGS) approximates the standard Gibbs sampler using a special type of PMCMC update called the conditional SMC update to sample from π(x|θ). The conditional SMC, described in Algorithm 2.8, is similar to a standard SMC algorithm but it ensures that a prespecified particle and its ancestral lineage survive and samples the remaining K − 1 particles as usual. Suppose the j  j  j  j-th particle is the “frozen” one and its ancestral lineage is denoted (A1 , A2 , · · · , AR ), j  where Ar represents the index of the “parent” at iteration r − 1 of particle xr, j for r = 2, · · · , R. The PGS is described in Algorithm 2.9. Algorithm 2.8 Conditional SMC At iteration 1 j For k A1 , sample x1,k ∼ q1 (·), and set its unnormalized weight w1,k =  γθ (x1,k ) , q1 (x1,k )  Resample K − 1 times from {x1,k , w1,k } to obtain { x˜1,k : k x˜1,A j = x1,A j . 1 1 At iteration r ≥ 2, j For k Ar , Sample xr,k ∼ qr ( x˜ r−1,k → ·), set xr,k = ( x˜ r−1,k , xr,k ), Compute γθ (xr,k ) wr,k = , γθ ( x˜ r−1,k )qr ( x˜ r−1,k → xr,k )  x˜ r,A j r  Resample {xr,k , wr,k } to obtain new particles denoted { x˜ r,k : k = xr,A j . r  21  j  A1 } and set  j  Ar } and set  Algorithm 2.9 The Particle Gibbs Sampler Initialization: i = 0 Sample θ(0) arbitrarily Run an SMC algorithm targeting π(x|θ(0)) Sample x(0) ∼ πR,K (·|θ(0)) and record its ancestral lineage. for i ≥ 1 do Sample θ(i) ∼ π(·|x(i − 1)) Run a conditional SMC algorithm targeting π(x|θ(i−1)) conditional on x(i−1) and its ancestral lineage. Sample x(i) ∼ πR,K (·|θ(i − 1)). end for  22  Chapter 3  Background on Phylogenetics 3.1  Phylogenetic Tree  Let X be a set of observed taxa, typically modern species, each of which is represented by a biological sequence, a string of symbols or characters from a finite alphabet Σ. For example, the alphabet for DNA sequences is the set of four nucleotides {A,C,G, T }. A phylogenetic X-tree represents the relationship among taxa via a tree topology, which is represented by a connected acyclic graph, (V, E), where V is the set of vertices for the observed and unobserved taxa and E is the set of edges between taxa. The vertices in X are called external nodes or leaves, which are nodes with only one neighbour. Note that in a tree any two vertices are connected by exactly one simple path. In other words, a tree is a connected graph without cycles. Further, a forest is a disjoint union of trees. In this dissertation, we only consider binary phylogenetic X-trees, which is also the main type of trees studied in the phylogenetics literature. Each vertex of a binary phylogenetic tree has a maximum of three neighbors. A binary phylogenetic tree can be rooted or unrooted. Figure 3.1 shows an example of a rooted tree and an unrooted tree of 5 taxa. A rooted phylogenetic tree is a directed tree with a unique node corresponding to the most recent common ancestor of all the leaf taxa of the tree. In a rooted binary tree, vertices with one, two, and three neighbors correspond to leaves, root, and internal nodes, respectively. In contrast, an unrooted phylogenetic tree represents the relatedness of leaf taxa without making assumptions about 23  their common ancestor. In an unrooted binary tree, each vertex has either one or three neighbors.  B  C  E  D  D  A B C  A  Unrooted tree  E  Rooted tree  Figure 3.1: Examples of a rooted tree and an unrooted tree. A rooted phylogenetic X-tree describes the evolution of a set of species in the tree from a common ancestor at the root. We regard the edges as being directed away from the root. Branch lengths are positive real numbers associated with each edge, specifying the amount of evolution between nodes. We consider the ultrametric (or clock ) trees where the leaf nodes are all equally distant from the root. If a tree is a molecular clock, its edge lengths can be specified corresponding to distances among the taxa. Since a binary tree with n leaves has n − 1 internal nodes including the root, the branch lengths of a clock tree are determined by the n − 1 distances among the taxa. Typically, the branch lengths of a clock tree are reparameterized as distances between heights from the leaves, shown in Figure 3.2. Although a clock tree is a convenient representation of the evolutionary relationship among taxa, it is too restricted and unrealistic because it implies a constant evolutionary rate. In contrast, a non-clock tree refers to a more general tree without these constraints. A non-clock tree is typically represented by an unrooted tree. However, most evolutionary models are reversible models in which the likelihood for an unrooted tree is equal to the likelihood of a rooted tree by rooting at an arbi24  Figure 3.2: Heights of subtrees on a clock tree. Height 4 is the height of this clock tree. trary point in the tree. For a non-clock tree, the branch lengths are determined by 2(n − 1) values.  3.2 3.2.1  Data for Phylogenetics Data Types  Although all kinds of characteristics of species can be used for phylogenetic reconstruction, modern technologies have allowed us to study the evolutionary relationship among species at a molecular level. Deoxyribonucleic acid (DNA), Ribonucleic acid (RNA), and proteins are the three major macromolecules that are essential for all known forms of life. Therefore, the commonly-used types of data for phylogeny analysis include DNA, RNA, and amino acid data. In this dissertation, we reconstruct the phylogenetic relationship among the taxa using the DNA sequences or RNA sequences to illustrate the performance of our methods. But the methodologies can be easily adapted to any types of biological sequence data. Usually DNA is a right-handed helix composed of two complementary chains twisted around each other. Each chain is a linear polynucleotide consisting of four nucleotides, two purines: adenine (A) and guanine (G), and two pyrimidines: cytosine (C) and thymine (T). 25  The chemical structure of RNA is very similar to that of DNA, but RNA contains uracil (U) instead of thymine (T); i.e. RNA is made up of a long chain of nucleotides from the set {A, U,G,C}. In addition, most RNA molecules are singlestranded and can adopt very complex three-dimensional structures.  3.2.2  Aligned Data via Multiple Sequence Alignments  The common methods of reconstructing phylogenetic trees are based on a fixed single multiple alignment, in which the homologous nucleotides aligned in columns using multiple sequence alignment (MSA) algorithms, e.g. ClustalW [44, 68] and T-coffee [12]. We use Y to denote the set of observed molecular sequences for an X-tree. For simplicity, we will assume that the observation, Y , takes the form of a matrix y x,s , where x ∈ X and s denotes an aligned position on the genomes, called a site. In this dissertation, we assume the sites are independent.  3.2.3  Datasets  The DNA sequences we will analyze are aligned protein coding mitochondrial DNA sequences obtained from 32 species of cichlid fishes [17, 64]. Each DNA sequence consists of 1047 sites. Identical nucleotides are observed on 569 sites, and we use the nucleotides on the remaining 478 sites for phylogenetic tree construction. In this dissertation, we will use ribosomal RNA (rRNA) sequence data arbitrarily chosen from the Comparative RNA Web (CRW) Site (http://www.rna.ccbb. utexas.edu) which contains RNA sequences and structure information, results of  data analysis, interpretation, etc [13].  3.3  Score-based Methods for Phylogeny Reconstruction  During the past several decades, researchers have proposed many methodologies to reconstruct evolutionary histories using molecular sequence data. [52] provided a good review of these methods. In this section, we will briefly introduce two scorebased methods, neighbor joining (NJ) [37, 92, 98] and maximum parsimony [36, 75, 93]. We will introduce the model-based methods for phylogeny reconstruction 26  in Section 3.4.  3.3.1  Neighbour-joining Method  The neighbor joining (NJ) method is a bottom-up clustering method used for reconstructing phylogenetic trees using the distance between each pair of taxa in the tree [52, 92]. The evolutionary distance represents the number of changes, e.g. A→G, that have occurred along the branches between two sequences. The main advantage of NJ is that it can be implemented in a polynomial time, making it capable of handling massive datasets. The main drawbacks of NJ are rooted in the inaccuracy of approximating evolutionary distances by observed differences between sequences. Please refer to [52] for details.  3.3.2  Maximum Parsimony  The maximum parsimony method searches all possible tree topologies to choose the tree that requires the fewest possible mutations to explain the data. The classical parsimony problem has direct connections with graph theory and enumerative combinatorics. To find the most parsimonious tree, we must have a way of calculating how many changes of state are needed on a given tree. The most famous algorithms for Maximum Parsimony are Fitch’s algorithm [36] and Sankoff’s algorithm [93]. The advantages of the maximum parsimony method include that it is fast enough for the analysis of hundreds of sequences and it is robust if branches are short [52]. The main disadvantage is that maximum parsimony can be inconsistent in the case of ‘long-branch attraction’ [8, 32]. In other words, two long branches that are not adjacent on the true tree are inferred to be the closest relatives of each other by parsimony. Another drawback is that it makes unrealistic assumption that the number of mutations is to be equal on all branches of a tree.  3.4  Model-based Methods for Phylogeny Reconstruction  Because the weight or cost parameters in score-based methods cannot be easily estimated and interpreted, they are replaced by stochastic modelling of nucleotide, codon, or amino acid evolution. The advantages of stochastic modelling include 27  explicit models of evolution, meaningful parameters, and allowance for hypothesis testing and model comparison. The parameters can be estimated by maximum likelihood [33, 46, 62] or Bayesian techniques [67, 70, 77, 82, 113]. Both the maximum likelihood method and Bayesian phylogenetics are heavily dependent on the probabilistic model of evolution. We will first introduce stochastic evolutionary models and then provide more details on Bayesian phylogenetics.  3.4.1  Stochastic Evolutionary Model  continuous time Markov chain (CTMC) over characters The standard technique to specify the transition probability of a node v in a graphical model given its parent (v) is to use a stochastic process. A stochastic process is a collection of random variables indexed by a set S , {Y s , s ∈ S }. In phylogenetics, we use a stochastic process called a CTMC, where S = [0, T ] represents the time interval for a large number of successive generations and the individual variables Y s are assumed to have a countable domain Σ, e.g. Σ = {A,C,G, T } for DNA data. The probability of transitioning from Y0 = i to YT = j along a branch of length T is denoted by the (i, j)-th element, Pi, j (T ), of a matrix P(T ). The matrix of transition probabilities P satisfies the matrix differential equation P = PQ with initial condition P(0) = I, where I is the identity matrix, and Q is called the rate matrix. Under the Markov assumption, P(T ) must satisfy P(T + T ∗ ) = P(T )P(T ∗ ). This implies the solution of P = PQ is ∞  P(T ) = exp (QT ) = I +  Qi T i /i!,  (3.1)  i=1  where exp is the matrix exponential, and I denotes the |Σ| × |Σ| identity matrix. In practice, the matrix exponential is computed by diagonalization. Denote λ1 , · · · , λ|Σ| as the eigenvalues of Q, and denote U as the orthogonal matrix in which the columns are composed of the corresponding eigenvectors. Using the diagonal-  28  ization Q = UΛU −1 , Equation (3.1) becomes P(T ) = exp(QT ) = Udiag(exp(λ1 T, λ2 T, · · · , λ|Σ| T ))U −1 . There are many evolutionary models for the rate matrix of DNA or RNA sequences. The HKY85 model introduced by [47] is one of the popular ones. In HKY85, the rate matrix is modelled by    πC rtv πG rti πT rtv     πA rtv πG rtv πT rti    , Q =   πA rti πC rtv  π r T tv   πA rtv πC rti πG rtv where rtv is the transversion1 rate, rti is the transition2 rate, and (πA , πC , πG , πT ) is the stationary distribution of CTMC. The notation − is short for minus the sum of all the other entries in the same row. For example, in the first row of Q, it represents −(πC rtv + πG πti + πT rtv ). The transition probabilities, P(T ), of the HKY85 model is calculated using Equation (3.1); The explicit expression can be found in [70]. The HKY85 model is often reparameterized by the transition to transversion ratio, i.e. κ = rti /rtv , representing a relative bias toward the occurrence of transitional events over transversional events. More precisely,    πC κπG πT      πA π κπ G T  , Q = α    κπA πC π T   πA κπC πG where the parameter α represents an overall transversion rate. A special case of the HKY85 model is the Kimura two-parameter model (K2P) model [60], in which the stationary state frequencies are fixed to be equal, i.e. πA = πC = πG = πT = 0.25. Using the K2P model for phylogenetic reconstruction, the parameter α and branch lengths are unidentifiable. Without loss of generality, 1 Transition  refers to a mutation between two pyrimidines (T↔C) or two purines (A↔G). refers to a mutation between a pyrimidine and a purine (A↔C, A↔T, G↔C, or  2 Transversion  G↔T).  29  we assume α = 1. In this dissertation, we use the K2P model with parameter κ = 2 as the default model. Let θ denote the vector of parameters of the evolutionary model; θ ∈ Θ. We call θ evolutionary parameters. In the case of the K2P model, θ ≡ κ. Recall that we use Y to denote the observed biological sequences. From now on, we use t to denote a phylogenetic tree in a tree space X. The likelihood, L(θ, t|Y), of the evolutionary parameter θ and the phylogenetic tree t is calculated by the sum-product algorithm, which can be easily expressed as a recursion. Let Lv ( j) denote the probability of everything that is observed below the internal node v conditional on v having character j. We start the recursion from calculating the likelihoods of leaves. If the leaf v has an observation i, let Lv (i) = 1, and Lv ( j) = 0 for all j ∈ Σ, j  i. Then the sum-product algorithm computes  Lv (i) for each internal node v on the tree from the conditional likelihood function of its immediate descendant nodes, C(v). Mathematically, this recursion can be expressed as follows:    1, if v ∈ X and Y(v) = i     Lv (i) =  0, if v ∈ X and Y(v) i      v∈C(v) j Pi j (bv )Lv ( j) , if v is an internal node  (3.2)  where Pi j (bv ) is the transition probability of changing from i to j along the branch of length bv , calculated by Equation (3.1). The sum-product algorithm continues the recursion until it computes L0 (i), the probability of the tree with the root having character i. Finally, the likelihood is computed by L(θ, t|Y) =  πi L0 (i).  (3.3)  i∈Σ  String-valued Continuous Time Markov Chain (SCTMC) The most popular SCTMC is the TKF91 model proposed in the pioneering paper of [106]. TKF91 is a time-reversible Markov model for single nucleotide insertions or deletions (indels). TKF91 does proper statistical analysis for two sequences by obtaining pairwise maximum likelihood of sequence alignments and estimating  30  the evolutionary distance between two sequences. Its main advantage is that its marginals P(YT = s|Y0 ) can be expressed as a string transducer with a set of weights obtained from closed-form functions of T such that it is as fast as score-based approaches. TKF92 [107] was developed based on TKF91 to deal with arbitrarylength non-overlapping indels by breaking one biological sequence into indivisible fragments. The math for TKF92 is similar to the TKF91 model, but with “residues” replaced by “fragments”. Furthermore, [79] presented a probabilistic “long indel” model of sequence evolution, allowing for overlapping indels of arbitrary length. For more backgrounds and references, see http://biowiki.org/. Researchers have used SCTMCs to compute the marginals of the process in order to obtain the probability of trees (and alignments) [54, 73, 106]. However, the computational bottleneck has prevented a broader applications based on SCTMCs. We will mainly focus on the CTMC over characters in this dissertation, and explore a computational method for a general SCTMC in Section 6.  3.4.2  Maximum Likelihood  In maximum likelihood inference, trees and evolutionary parameters are estimated by maximizing the likelihood function in Equation (3.3) that fully captures what the data tell us about the phylogeny under a given model [46]. Finding the maximum of the likelihood function involves searching through a multidimensional parameter space. Whether maximum likelihood is computationally expensive depends on the thoroughness of the search. Although maximum likelihood [33] can be efficient for obtaining point estimates of phylogenies, it is difficult to determine statistical confidence [40]. The most commonly used procedure for assessing phylogenetic uncertainty is using nonparametric bootstrap [34]. Many studies have focused on comparing nonparametric bootstrap and posterior probabilities with regard to estimating the tree uncertainty [1, 20, 29, 96, 102]. These studies show that bootstrap tends to be too conservative.  31  3.4.3  Framework of Bayesian Phylogenetics  Bayesian phylogenetics has a strong connection to the maximum likelihood method. The same evolutionary models used in maximum likelihood can be used in Bayesian methods simply by specifying the prior distributions for the tree and evolutionary parameters. Priors incorporate previous knowledge from sources other than the data at hand. Bayesian methods estimate the posterior distribution which is a function of both the likelihood and the prior. In this dissertation, we will focus on Bayesian phylogenetics methods because of several advantages. First, Bayesian methods allow complex models of sequence evolution to be implemented. Second, Bayesian phylogenetics produces both a tree estimate and measures of uncertainty for the groups on the tree. Third, Bayesian methods might be a faster way to assess support for trees than maximum likelihood bootstrapping [25]. Recall that Y is a set of observations on the leaves of a phylogenetic X-tree. For X ⊂ X, we use the notation Y(X ) for the subset of observations corresponding to a subset X of the leaves. Our objective is to use n observed biological sequences, Y, to estimate a phylogenetic tree using Bayesian approaches. Let θ ∈ Θ denote the vector of parameters of the evolutionary model. Let p(θ) denote the prior for θ. For a tree t ∈ X, the density of the prior distribution given θ is denoted by p(t|θ). The probability model for observed data Y conditioning on a specific choice of parameters θ and tree t is P(Y|θ, t). Bayesian inference is based on the posterior density p(θ, t|Y) =  P(Y|θ, t)p(t|θ)p(θ) θ t P(Y|θ, t)p(t|θ)p(θ)dtdθ  (3.4)  where P(Y|θ, t) is the likelihood function L(θ, t|Y) in Equation (3.3). All statistical inferences are based on the sample trees and model parameter values form their joint posterior distribution. The total number of distinct labelled topologies of a rooted tree of n leaves is (2n − 3)!! [95], which increases at a super-exponential rate as the number of taxa increases. For example, when n = 10, the possible tree topologies are greater than 34 millions. Therefore, the denominator of the posterior density in Equation (3.4)  32  involves high-dimensional summations and integrals for which there is no closedform solution except for very small trees. Since typically the likelihood function for phylogenetic models are too complex to integrate analytically, we use Monte Carlo methods to approximate the posterior distribution of phylogenetic trees and evolutionary parameters. Conventional Bayesian phylogenetic inference uses MCMC to approximate the posterior probabilities of trees, which will be the focus of next subsection.  3.4.4  Bayesian Phylogenetics via MCMC  MCMC is conventionally used for approximating posterior distribution of phylogenetic trees [50–52, 66, 67, 70, 77, 82, 113]. Many user-friendly software packages have been developed for implementing MCMC for phylogenetics. The most popular packages include BAMBE [97], MrBayes [55, 90, 91], and BEAST [28]. The space under consideration is a joint space of all the possible trees and all the evolutionary parameters, denoted Ω = Θ × X. An MCMC algorithm samples a dependent sequence of points from the space Ω such that after some point in the sequence, all subsequent sampled points are distributed approximately according to the posterior distribution. Each iteration of the MH algorithm, described in Algorithm 2.1, proposes a new point of specifications of a phylogenetic tree and evolutionary parameters, usually similar to the present one, and we use the MH ratio to accept or reject it. The main difficulty in Bayesian phylogenetic inference with MCMC lies in the efficiency with which topology proposals sample tree space. Ideally, a well functioned proposal distribution can lead to a well-mixing Markov chain that can rapidly traverse the posterior distribution such that inferences will be sufficiently accurate based on a computationally feasible sample. However, MCMC imposes relatively strict constraints on the types of proposals that can be used, making it is challenging to design fast mixing proposal distributions. Due to combinatorial constraints, the distribution on tree space is often a complex multimodal distribution, pronounced moves are likely to result in sample points with low posterior probabilities because the phylogenetic tree space is so huge that the posterior density is usually low for most places in the tree space. Design of the currently used  33  proposal distributions leads to a low acceptance rate of transitions unless a small neighborhood is used [51, 66]. Therefore, the proposed point is usually close to the present point of the chain, requiring the MCMC chain to run for a long time to explore the full possibilities of ‘tree space’. With a few exceptions [50, 51, 66], the proposals used by current phylogenetic MCMC sampler [67] have remained unchanged in the past decade. [67] discussed two types of MCMC algorithms, the GLOBAL algorithm and LOCAL algorithm, programmed in their software package BAMBE. The version of the GLOBAL algorithm modifies all branch lengths and potentially changes the tree topology simultaneously. In contrast, the LOCAL algorithm proposes changes to only small portions of the tree. Each algorithm has two versions, one assumes a molecular clock and one assumes a non-clock tree. A composition of two different basic update mechanisms are used to traverse the space of phylogenetic trees and evolutionary parameters. One cycle of MCMC iteration consists of two stages: in the first stage, keep the current tree fixed, propose the new evolutionary parameters, and either accepted it or rejected it according to the MH ratio; the second stage is to modifies the current tree while holding the evolutionary parameters fixed. [66] evaluated two classes of proposal distributions to change topology and branch lengths simultaneously in phylogenetic inference for unrooted trees. More precisely, branch-change proposals modify branch lengths or branch attachment points continuously in a way that produces topology changes in some cases; branchrearrangement proposals use the pruning-regrafting moves to prune a subtree from the rest of the tree and regraft it somewhere else or use swapping moves to simply trade places of two subtrees. [51] studied different proposal distributions for constrained rooted clock trees. [50] explored new MCMC proposal distributions to find more efficient MCMC algorithms. In this dissertation, we will use the nearest neighbor interchange (NNI) [58] to propose a new tree topology, illustrated in Figure 3.3. The edge e has four nearest neighbors, A, B, C, and D, each of which can be a subtree. By interchanging the four nearest neighbors, the NNI proposal changes the first topology to one of the second and third topologies randomly. One proposal for branch lengths, multiplicative branch proposal, picks one edge at random and multiply its current value by a random number distributed 34  uniformly in [1/a, a] for some fixed parameter a > 1 (controlling how bold the move is) [66].  Figure 3.3: The Nearest Neighbor Interchange (NNI) Proposal [58]. In addition to the difficulty in designing proposal distributions, assessing whether the chain has converged is a crucial issue for using MCMC. Moreover, existing MCMC proposals are computationally expensive because they require recalculating the likelihood of the full tree from scratch, even when only a small perturbation is made on the current tree. In order to speed up computation, next chapter of this dissertation is focused on our work of exploring SMC algorithms for reconstructing general phylogenetic trees.  3.5  Consensus Tree and Tree Distances  In this section, we address the problem of summarizing a sample of phylogenetic trees, e.g. from MCMC algorithms. Moreover, we introduce the tree distance between two trees for the purpose of evaluating the estimated tree. Contrary to summarizing a sample of real numbers, it is not a trivial problem for trees that are composed of tree topology and branch lengths. The summary tree usually refers to the maximum a posteriori (MAP) tree or the consensus tree [67]. Consensus trees are trees that summarize the information contained in a set 35  of trees having the same species on their leaves. In this dissertation, we use the majority-rule consensus tree which consists of those groups that are present in no less than a half of the trees [33]. We measure the distance between a reference tree t and an estimated consensus tree t using three types of tree distance metrics: Robinson-Foulds (RF) [89], the partition metric (PM) [35], and Kuhner Felsenstein (KF) [65]. We first discard the edge directions from rooted trees to get unrooted trees. Each branch on an unrooted tree can partition the whole set of leaves into two unordered subsets, called one bipartition. We use S (t) to denote the set of all the bipartitions of t: S (t) = {Bi , i = 1, · · · , ne }, where Bi is the bipartition resulting from the i-th edge. The set of different bipartitions of t and t is denoted by D(t, t ) = S (t) S (t ), where A1 A2 denotes the symmetric difference of sets A1 and A2 . The partition metric of t and t is defined as the number of their different bipartitions, denoted |D(t, t )|. B∈D(t,t ) |b(B; t) − b(B; t  The RF metric of t and t is defined as  )|, where b(B; t)  denotes the length of the branch corresponding to the bipartition B on tree t. The KF metric is defined as  B∈D(t,t ) (b(B; t) − b(B; t  ))2 .  To make these distance metric more comparable for different trees, we normalize the partition metric of t and t : Normalized partition =  |D(t, t )| , (|S (t)| + |S (t )|)  (3.5)  and we normalize the RF metric of t and t : Normalized RF =  (  B∈D(t,t ) |b(B; t) − b(B; t ne ne i b(Bi ; t) + j b(B j ; t  36  )| ))  .  (3.6)  Chapter 4  Combinatorial SMC sequential Monte Carlo (SMC) is another approach to approximation of posterior distributions, which has been very successful in state-space models ([15, 26, 27, 71]), and more recently, to more general settings [22, 23]. However, because of the intricacies of phylogenetic tree spaces, it is non-trivial to directly apply these general frameworks to posterior tree inference. Previous work on applying SMC to phylogenetics has been limited in two important ways. First, it has relied on more restrictive SMC algorithms [9, 43, 105], and as a consequence, they are limited in the types of phylogenetic proposals they can use. Second, no previous results have been published on applying SMC to nonclock trees. This is an important limitation, as most current work in phylogenetics depends on non-clock tree models. Our contribution is to show how both of these limitations can be addressed using a new framework for building SMC phylogenetic tree inference algorithms. In our combinatorial sequential Monte Carlo (CSMC) framework, the flexibility on the proposal distributions generalizes both MCMC methods and previous work on phylogenetic SMC. In particular, this flexibility makes it possible to construct non-clock tree proposals that do not require recalculating the likelihood of the full tree at each proposal step. The proposed SMC algorithm is motivated by a certain over-counting problem in sequentially constructing a non-clock phylogenetic tree. A conventional SMC algorithm will favour the trees which can be constructed in multiple ways, whereas 37  in our algorithm, a graded poset on an extended combinatorial space is used to compute correction terms. The remainder of this chapter is organized as follows. Section 4.1 introduces some notations used in this chapter. Section 4.2 describes the CSMC algorithm, its asymptotic consistency properties and its application to phylogenetics. In Section 4.3, we show some results of simulation studies and apply the proposed SMC methods to real RNA datasets. Section 4.4 concludes this chapter and provides some discussions.  4.1  Notation  Recall that we use π to denote the normalized measure of interest, which can be expressed as π = γ/Z using the unnormalized measure γ and the normalizing constant Z. In this chapter, we introduce another notation, γ , to denote the normalizing constant, i.e. Z ≡ γ , for the purpose of describing the SMC algorithm in a compact form. In Bayesian phylogenetics, π is the normalized posterior over trees, i.e. π = p(θ, t|Y); the unnormalized posterior γ is a measure over trees t with density p(θ, t, Y) = P(Y|θ, t)p(t|θ)p(θ); the normalizing constant is the data likelihood in the denominator of (3.4), i.e. γ = P(Y). In this chapter, we assume that the evolutionary parameters θ is fixed and we focus on the phylogenetic tree reconstruction. We can omit θ from the notation for simplicity. We are interested in approximate expectations of a function φ with respect to π. Denote πφ as a short-hand for integration of φ with respect to π, i.e. πφ = π(dt)φ(t). As a concrete example of the posterior expectation of a function φ in phylogenetics, we define φ as an indicator of whether a clade, a group consisting of a species and all its descendants, is in a phylogenetic tree t or not. Let clades(t) denote the set of all clades in t. For a clade a, define φ(t) = 1(a ∈ clades(t)), where 1 is an indicator function.  38  1. The sets of partial states of different ranks should be disjoint, i.e.  for all r = s (in phylogenetics, this holds since a forest with r trees  be a forest with s trees when r = s). A A A A B state of B smallest Brank should B have a single 2. The set of partial C C C C s0=⊥,DS1 = s{⊥} s2 D s3⊥ isDthe disconnected denoted by (in phylogenetics, D 1 Figure 4.1: An example of the partial states of a discrete rooted X-tree.  on X). 4.2  Method  3. The set of partial states of rank R should coincide with the targe In this section, we describe CSMC, an algorithm to approximate expectations πφ and find the normalization of an unnormalized positive measure γ on a target com-  SR = X binatorial (in phylogenetics, at rankweRfirst=introduce |X|the−algorithm 1, forests have a sin space X. To simplify the presentation, under the assumption that X is a finite but large combinatorial set, e.g. a set of tree  topologies. We will show in Section 4.2.5 that this assumption can be relaxed, e.g. and are members of the target space X ). to accommodate branch lengths in phylogenetics.  4.2.1  Discrete Setup  These conditions will bemainsubsumed by the more framework At a high level, the assumption on which our algorithm depends isgeneral that any object t in the target sample space X can be constructed incrementally using a  sequence of intermediate states s conditions , s , . . . , s = t. Here sabove will be called a partial tion 4.5, but the more concrete help understanding th 0  1  R  r  state of rank r, the set of all partial states of rank r will be denoted by Sr , and the  framework.  set of partial states across all ranks will be denoted by S =  r Sr .  Our terminology  (partial states, rank) has an order theoretic motivation that will be described in more details shortly. Note that we consider phylogenies without branch lengths in  subsection. In order tothisgrow particles from one rank to the next, the user n In phylogenetics for example, the partial states we will use are based on forests.  specify a proposal probability kernel ν + . Given an initial partial stat we use the term tree as a short-hand for tree topology. Such a tree can be conConsider for example a discrete rooted X-tree as in Figure 4.1. In this section,  structed by starting from the disconnected graph with vertices X, and by adding,  at each step, partial one internal node v X, andA, a pairwe of edges connected tothe v such probability that a set of destination states denote of pr  an element in A from s by νs+ (A). 39In the discrete case, we abuse the n  14  the number of connected components decreases by one at every step. More precisely, we will build each rooted X-tree t by proposing a sequence of X-forests s0 , s1 , . . . , sR = t, where an X-forest sr = {(ti , Xi )} is a collection of rooted Xi -trees ti such that the disjoint union of leaves of the trees in the forest is equal to the original set of leaves,  i Xi  = X. Note that with this specific construction, a forest of rank r  has |X| − r trees. The sets of partial states considered in this section are assumed to satisfy the following three conditions: 1. The sets of partial states of different ranks should be disjoint, i.e. Sr ∩ S s for all r  s (in phylogenetics, this holds since a forest with r trees cannot be a  forest with s trees when r  s).  2. The set of partial state of smallest rank should have a single element denoted by ⊥, S0 = {⊥} (in phylogenetics, ⊥ is the disconnected graph on X). 3. The set of partial states of rank R should coincide with the target space, i.e. SR = X (in phylogenetics, at rank R = |X| − 1, forests have a single tree and are members of the target space X). These conditions will be subsumed by the more general framework of Section 4.2.5, but the more concrete conditions above help understanding the poset framework. In order to put the CSMC algorithm under service, the user first needs to specify an extension of the measure γ (the corresponding normalized measure is denoted by π), defined only on the target space X, into a measure over the larger space S. The restriction of this extended measure on X should coincide with the target measure γ. We abuse notation and use γ and π for both the extended and the target measures. In phylogenetics for example, a measure over X-trees is usually specified by the user: in the Bayesian setup of Section 3.4.3 it is given by a prior times a likelihood:1 γ(t) = γY (t) = P(Y|t)p(t), 1 Since we assume a discrete space in this section, we write γ(t) = γ({t}) and denote the probability  mass function of the prior by p(t).  40  where we are assuming fixed parameters θ, an assumption we will relax in Chapter 5. A natural choice in non-clock models to obtain an extension of γ into forests is to take a product over the trees in the forest s as follows: γY(Xi ) (ti ).  γ(s) = (ti ,Xi )∈s  We call this choice of extension the natural forest extension, and note that a range of other choices are possible (see Section 4.4 for more examples).  4.2.2  Algorithm  Algorithm 4.1 The Combinatorial Sequential Monte Carlo Algorithm 1. Step 1: initialization (a) for all k ∈ {1, . . . , K}, i. set s0,k to ⊥ ii. set w0,k to 1/K (b) construct γ0,K using Equation (4.1) 2. Step 2: for iteration r = 1, 2, . . . , R, (a) for all k ∈ {1, . . . , K}: i. sample s˜r−1,k ∼ πr−1,K ii. sample sr,k ∼ ν+s˜r−1,k iii. compute wr,k = w( s˜r−1,k , sr,k ) using Equation (4.2) (b) construct γr,K using Equation (4.1) 3. Step 3: return γR,K  In this section, we introduce the CSMC algorithm, an iterative procedure that approximates the target measure π in R iterations. At each iteration r, a list of K partial states is kept in memory. Each element of this list is called a particle, denoted sr,1 , sr,2 , . . . , sr,K ∈ Sr . We also assume there is a positive weight wr,k associated with each particle sr,k . Refer to Algorithm 4.1 for an overview of the steps 41  described in more details in the following. The general form of the algorithm has similarities with standard SMC algorithms, with the important exception of the weight updates, which needs to be altered to accommodate general combinatorial structures. Given a list of weighted particles, we construct a discrete positive measure as follows: γr,K (A) = γr−1,K  1 K  K  wr,k δ sr,k (A), for all A ⊂ S.  (4.1)  k=1  The algorithm constructs these discrete measures recursively, and returns the measure γR,K at termination. Note that in most descriptions of standard SMC algorithms, the algorithm is described in terms of a target probability distributions π. Allowing γ and the intermediate particle populations to be unnormalized has the advantage of revealing more explicitly the connexion between the estimators for the evidence P(Y) and posterior distribution p(t|Y). The algorithm is initialized with the rank r = 0 populated with K copies of the least partial state ⊥ described in the previous section. Each of these copies is associated with a uniform initial weight, 1/K. Given the empirical measure γr−1,K from the previous population of particles, a new list of particles and weights is created in three steps. Figure 4.2 provides a graphical illustration of the three steps of an SMC algorithm. The first step is to select a promising list of particles while preserving the correct asymptotic distributions. In this section we use the simplest scheme, multinomial resampling, to accomplish this, but other schemes are possible and are discussed in Section 2.2.2. In multinomial resampling, this first step is simply done by resampling K times from the normalized discrete measure πr−1,K . We denote these sampled particles by s˜r−1,1 , s˜r−1,2 , . . . , s˜r−1,K . The second step is, for each particle s˜r−1,k in this list, to use the proposal ν+s˜r−1,k to grow each of the resampled particle into a new particle of rank r, denoted by sr,k . In order to do this, the user needs to specify a proposal probability kernel ν+s : S → [0, 1] for all s ∈ S. Given an initial partial state s and destination partial state s , we denote the probability of proposing s from s by ν+s (s ) or by a more readable 42  Resampling s1,1  Proposal  s~1,1  Weighting s2,1  w1,1 = 0.03  s2,1 w2,1 = 0.12  s1,2  s2,2  s~1,2  w1,2 = 0.02  s2,2 w2,2 = 0.2  s1,3  s2,3  s~1,3  w1,3 = 0.08  s2,3 w3,3 = 0.02  Figure 4.2: A graphical illustration of the SMC algorithm.  notation ν+ (s → s ) especially when subscripts are involved. We assume that the successors proposed from a partial state of rank r will always have rank r + 1, i.e. if s ∈ Sr and ν+s (s ) > 0, then s ∈ Sr+1 . This proposal is applied independently to sample a successor to each particle s˜r−1,k : sr,k ∼ ν+s˜r−1,k . For example, when building discrete rooted X-trees, the proposal needs to select a pair of trees to merge. One simple choice is to pick a pair uniformly at random among the  |X|−r 2  pairs; other choices are discussed in Section 4.4.  The third step is to compute a weight for each of these new particles. This is done using the following formula: wr,k = w( s˜r−1,k , sr,k ) =  γ(sr,k ) ν− (sr,k → s˜r−1,k ) · , γ( s˜r−1,k ) ν+ ( s˜r−1,k → sr,k )  (4.2)  where we call the function ν− an overcounting function. Note that while this weight update looks like a Metropolis-Hastings ratio, it has the fundamental difference that ν+  ν− in general. The overcounting correction is more closely related to the  backward kernels of [22, 23], but because of the combinatorial nature of the space, 43  the poset framework plays an instrumental role in constructing ν− in the types of spaces we are interested in. After running this algorithm, the discrete positive measure γR,K can be used to approximate γ, where γR,K is obtained from applying Equation (4.1) to the particles output at the last generation. Concretely, this means that γR,K can be used to estimate P(Y), and for any statistic of interest φ, its expectation under the posterior can be estimated by πR,K φ. In the next section, we show how ν− can be selected to guarantee convergence of the algorithm to the target distribution π when K → ∞. The precise meaning of convergence will be discussed in Section 4.2.4 and 4.2.5.  4.2.3  Overcounting Correction  In the previous section, we have described an algorithm similar to standard SMC, with the exception of the overcounting correction, which has an extra factor ν− in the weight update. Before explaining how to compute this correction in general, we first describe the problem that would arise if we were to omit the correction in the specific case of phylogenetic non-clock inference. To simplify the discussion, we start by considering a model where there is no observation (in practice, sampling from the prior would not require an approximate algorithm, but it is instructive to start by studying this simpler example). For this example, refer to Figure 4.3 (a), where we show that all the different sequences of partial states (forests) leading to one of the 1 · 3 · 5 = 15 fully specified states (rooted X-trees). An arrow between partial states s and s in this figure means that s can be obtained from s by one application of the proposal, i.e. that ν+ (s → s ) > 0. It can be seen in the figure that a balanced binary tree on four taxa, for example one with rooted clades {A, B}, {C, D}, can be constructed in two distinct ways: either by first merging A and B, then C and D, or by first merging C and D, then A and B. On the other hand, an unbalanced tree on the same set of tax can be constructed in only one way, for example the tree with clades {A, B}, {A, B,C} can only be constructed by first proposing to merge A and B, then C and {A, B}. A consequence of this dichotomy is that under the uniform proposal, the expected fraction of parti44  Figure 4.3: (a) All the different sequences of partial states (forests) leading to fully specified states (rooted X-trees). (b) An example of the set of parents (s) of a partial state s. (c) An example of a simple cyclic poset.  cles with a balanced topology of each type is 2/21, while it is 1/21 for unbalanced topologies (since there are 21 proposal paths, 2 for each balanced topologies, 1 for each unbalanced one). Since we would like the probability π of each topology to be close to 1/5, the naive estimate is therefore inconsistent. In order to resolve this issue (by defining an appropriate overcounting function), it will be useful to formalize the graph shown in Figure 4.3 (c). This can be done by defining a partial order ≤ on S. Recall that (S, ≤) is called a partially ordered set, or briefly a poset, if ≤ is a binary relation on S that is reflexive, antisymmetric, and transitive [99]. Also, for s, s ∈ S, we say that s is covered by s , written s ≺ s , if s ≤ s and there is no z ∈ S between s and s . Since the covering relation determines the partial order in a finite ordered set, this means that in our combinatorial setup we can induce a poset on the extended space S by deeming that s covers s if and only if ν+ (s → s ) > 0. By the way we defined S =  r Sr ,  we get that ≤ actually has an extra structure  called a rank ρ: a function from S to {0, 1, · · · , R} such that ρ(s0 ) = 0 if s0 is a  45  minimal element of the poset, and ρ(s ) = ρ(s) + 1 if s covers s in S. With these definitions, graphs such as Figure 4.4 (left) can then be understood as the Hasse diagram corresponding to this induced poset, namely a graph where the set of vertices is S, and there is an edge between s and s whenever s covers s . In previous work [9], the overcounting problem has been avoided by forbidding proposals ν+ that induce a cyclic Hasse diagram. In the CSMC algorithm, ν− is used to avoid this artificial restriction. We present the solution in discrete spaces S in this section. The discrete assumption is lifted in Section 4.2.5. Generally, in order to have consistency as in Equation (4.6), the only requirement on ν− is: Assumption 3. For all s, s ∈ S, ν+ (s → s ) = 0 implies ν− (s → s) = 0. Let  : S → FS denote the set of parents (s) of a partial state s. In phylo-  genetics for example, the number of parents is equal to the number of nontrivial trees in the forest, where a tree is said to be trivial if it has a single leaf in it (see Figure 4.3 (b) for an example). A simple choice of ν− that satisfies this condition is ν− (s → s) = | (s )|−1 × 1[ν+s (s ) > 0],  (4.3)  when | (s )| is finite. Correspondingly, we get the following formula for the case of rooted non-clock tree inference:   ν (s → s) =  −  (ti ,Xi )∈s  −1  1[|Xi | > 1] .  (4.4)  While Assumption 3 is very weak, it is of interest to select ν− so as to minimize the variance of the weights appearing in the CSMC algorithm. By generalizing Proposition 3.1 in [22], one can show that the optimal choice for ν− is  ν−  opt  (s → s) = s  46  π(s) . ≺s π(s )  (4.5)  4.2.4  Analysis in the Discrete Case  To motivate the overcounting correction, we sketch in this section an elementary proof that CSMC converges to the correct value as the number of particles goes to infinity. Again, we restrict our attention to the discrete case for now (|S| < ∞), where the proof is transparent. In Section 4.2.5, we present a general result that do not make this assumption. Formally, we will prove the following result, where for simplicity, in this chapter → denotes convergence as the number of particles K → ∞ in L2 unless stated otherwise. Proposition 4. Under Assumption 3, we have for all test functions φ : X → R γR,K φ → γφ.  (4.6)  There are two important corollaries of consistency results of the form of Equation (4.6): 1. The normalized weighted particles obtained at iteration R of the SMC algorithm, πR,K , form a consistent estimator for computing expectations under π: πr,K φ → πφ. This corollary is important because π corresponds to a posterior expectation in the Bayesian framework. 2. The normalization of γ can be consistently estimated as follows: γr,K → γ .  (4.7)  This corollary is important because γ corresponds to a marginal likelihood in the Bayesian framework. Note that the left-hand side of Equation (4.7) is equal to the product of all the weight normalizations divided by K R (which  47  is the way this result is usually presented), i.e. r  Zr,K ≡ γr,K = j=1  1 K  K  w j,k . k=1  Proposition 4 can be established in two steps. First, we note that when the induced Hasse diagram is acyclic, previous SMC consistency proofs apply directly. Second, we show that in the cyclic case, we can construct a certain distribution πˇ and proposal qˇ + on a larger space Sˇ with the following properties. 1. The target distribution π can be obtained from πˇ by straightforward marginalisation of the samples. 2. The induced Hasse diagram is acyclic, so the algorithm on Sˇ is consistent by the first step of the proof. 3. The proposal steps and weight updates in the algorithm on Sˇ can be shown to be equivalent to those of the original algorithm on S. This shows that the algorithm on S is also consistent. As described above, let us start by assuming the poset is acyclic. In this case, we claim that we can invoke Proposition 4 of [9]. First, the boundedness assumptions required in Proposition 4 are automatically satisfied since here |S| < ∞. Second, the connectedness assumption made in this previous work, Assumption 2b, can be shown to hold using the following argument: assume on the contrary that there is a connected component C ⊂ S in the Hasse diagram that does not contain ⊥, and let s be a minimal element, where s ∈ C by finiteness. Since {s : ν− (s → s ) > 0} ⊂ ρ−1 (ρ(s) − 1), we have a contradiction. Therefore there can be only one connected component. If the poset is not acyclic, we now present the reduction to the acyclic case. Let S0:r = S0 × S1 × · · · × Sr , the set of paths of length r in S. We will view the algorithm as incrementally building partial states over a larger space, with sˇ0 ∈ S0 , sˇ1 ∈ S0:1 , sˇ2 ∈ S0:2 , . . . , sˇR ∈ S0:R . In other words, instead of viewing the algorithm as operating over S = R Sr , we will view it as operating over Sˇ = R S0:r . r=0  r=0  48  marginalisation of the samples. n a larger space Sˆ with the following properties.  2. The induced diagramfrom is acyclic, the algorithm on Sˇ is consistent get distribution π can Hasse be obtained π ˇ by so straightforward  first step of the proof. lisation of by thethe samples.  3. The proposal steps so and updates in consistent the algorithm on Sˇ can be uced Hasse diagram is acyclic, theweight algorithm on Sˇ is Figure 2: An example of a simple cyclic poset  to be equivalent to those of the original algorithm on S. This rst step ofshown the proof.  algorithm as incrementally building partial states over a la shows that the algorithm onthe S 0:1is also consistent. posal steps and weight updates in thesˇ0algorithm on Sˇˇ2 can be, . . . , sˇR ∈ S0:R . In other words, in ∈ S0 , sˇ1 ∈ S ∈ S0:2 0:1 , s As described above, let us start by assuming the poset is acyclic.R In this o be equivalent to those of the original onoperating S. Thisover S = r=0 Sr , we will view it a thealgorithm algorithm as case, we claim that we can invokeˇ Proposition 4 of XXX. First, the boundS= R hat the algorithm on S is also consistent. r=0 S0:r . edness assumptions required in Proposition 4 are automatically since for π(s) XXX:a say wecyclic abuse insatisfied finite measure Figure 4.4: An example of changing simple posetnotation to an acyclic case. ed above, let us start by assuming the poset is acyclic. In this here |S| < ∞. Second, the connectedness assumption made ina new this measure previousπˇ on S. ˇ Let Let us start by introducing ˇ ˇ Let us start by introducing a new measure Let sˇ be an element in S, m that we can invoke Proposition 4 of XXX. First,πˇ on theS. boundˇ work, Assumption 2b, canofbe shown hold argument: as-r, sˇ = sˇr = i.e. a sequence forests, sayinof to length sˇr = (s0the , s1 , . .following . , sr forests, ) ∈ S0:r . Following S, i.e.r, sˇa=using sequence of say of length r−1 − ptions required in Proposition 4 are automatically satisfied since [22, 23], we define the new measure by a product γˇ ( sˇr ) = γ(s r) j=1 ν (s j → s j−1 ). S . Following XXX, we thethe new measure by a p − 0:r sume on the contrary thatthethere connected Cdefine ⊂S in Hasse Note that since ν (s →is·) aare assumed to becomponent normalized probability densities, . Second, the connectedness made this previous r−1inthe marginalizationassumption over s0 , s1 , . . . , sr−1 recovers original extended measure π. π(sand νs− (s sminimal that since the ν − (s → ·) ar r) j → j−1 ). Note j=1 diagram that does not contain ⊥, let be a element, where ˇ The proposal over S creates an identical copy of the sequence of forests, and tion 2b, can be shown using the following argument: as-density νmarginalization + . Note adds to to ithold a new elements by−normalized sampling from the original proposal probability over s0 , s1 , . −1densities, s ∈ C by finiteness. Since {s : ν (s → s ) > 0} ⊂ ρ (ρ(s) − 1), we have a ˇ that with this definition, given an element sˇ ∈ S, there can be only one predecessor ontrary that there is (asˇ),connected component Cwith ⊂ the Sextended in the Hasse original measure π. a simple namely the prefix of thethe sequence last element removed. As contradiction. Therefore there can be only one connected component. example, Figure 4.4 shows the picture of a simple finite cyclic poset and its acyclic The proposal overwhere Sˇ creates an identical copy of the seq does not contain ⊥, and let s be a minimal element, poset overacyclic, the extended If the poset is not wespace. now present the reduction to the acyclic case. on this extending space can beby to be − Finally, standard SMC operating adds to a we newhave elements from the original eness. Since {s : ν (s → s ) > 0} ⊂ and ρ−1 (ρ(s) − it 1), a seensampling equivalent to CSMC, since the weight updates simplify to: Let S0:r = S0 × S1 × · · · × Sr , the set of paths of length r in S. We will view + ˇ t ν . Note that rwith this definition, given an element sˇ ∈ S, Therefore there can be only one connected component. − γ(sr ) ν (s j → s j−1 ) γˇ ( sˇ ) r  ν+ ( sˇr−1  → sˇ )ˇγ( sˇ  )  =  j=1 + ν ( sˇr−1 → sˇr )γ(sr−1 ) γ(sr )ν− (sr → sr−1 ) γ(sr−1 )ν+ (sr−1 → sr )  r  ν− (s → s j−1 )  r−1 j 22 to the acyclicj=1 case. is not acyclic, we now present ther reduction  =  × S1 × · · · × Sr , the set of paths of length r in S. We will view This completes the proof in the discrete cyclic case.  22 49  23  4.2.5  General Setup  As in the previous section, γ denotes the target positive measure, but in this section we do not restrict γ to be defined over a discrete space. More precisely, let FX denote a sigma-algebra on X, and let γ : FX → [0, ∞). We assume that the user has provided an extension γ : FS → [0, ∞), and a pair of forward and backward proposal probability kernels ν+ , ν− : S × FS → [0, 1]. Next, we define the following measures on the product space S × S: τ+ (A × B) = γA (ν+ (B)) =  γA (dx)ν+x (B)  (4.8)  τ− (A × B) = γB (ν− (A)) =  γB (dx)ν−x (A),  (4.9)  where A, B ∈ FS , and for any measure λ and measurable A, λA (B) denotes λ(A ∩ B). Informally, and up to normalization, the measure in Equation (4.8) corresponds to picking A according to γ, and then propagating the points in A using the forward proposal. The measure in Equation (4.9) corresponds to the inverse situation of picking B according to γ and then propagating these points using the backward proposal (this second situation is hypothetical since the algorithm does not sample from ν− , but is useful in the analysis). We make the following assumptions: Assumption 5. We have τ−  τ+ ,  which implies by the Radon-Nikodym theorem the existence of a derivative τ− /τ+  : S × S → [0, ∞). We also assume that there is a version w = τ− /τ+ with the  following properties: Assumption 6. There is a ranked poset (S, ≺, ρ) such that s covers s iff w(s, s ) > 0, ρ : S → Z. Using the notation γr as a short hand for γρ−1 (r) , we assume that the Hasse diagram induced by ≺ is (a) connected and (b) that there is an R such that γr = γ for r ≥ R, and that γr is a Dirac delta for r ≤ 0. To get a compact notation for CSMC, we introduce the following Monte Carlo  50  propagation operator:   1 (propK λ)φ = λ  K  K k=1    w(S k , S k )φ(S k ) ,  where S k ∼ λ¯ (λ¯ denotes the normalized measure of λ), S k |S k ∼ ν+ (S k → ·), independently, λ : FX → [0, ∞) is an arbitrary measure on the poset, and φ : S → R is a test function. Note that the propagation operator incorporates both a multinomial resampling step and a proposal step in one equation. If we denote the composition of these operators by prop2K γ0 = (propK (propK γ0 )), then the full CSMC algorithm can be summarized by γR,K = propRK γ0 . With this notation, we can write the main result as: Proposition 7. If φ : X → R is measurable and such that φ ≤ C1 , w ≤ C2 for some C1 and C2 , then γR,K φ → γφ. The proof and supporting lemmas can be found in the appendix. There are many similarities between our proof and that for consistency in general state space models, however the main point of the appendix is to illustrate how the conditions on the poset are used in the proof. In the next section, we show how this result can be specialized to various phylogenetics setups.  4.2.6  Connection to Phylogenetics with Branch Lengths  In an ultrametric setup, the poset S is defined as the set of ultrametric forests over X. An ultrametric forest s = {(ti , Xi )} is a set of ultrametric Xi -trees ti such that the disjoint union of the leaves yields the set of observed taxa. Defining the height of an ultrametric forest as the height of the tallest tree in the forest. Heights of subtrees on a clock tree has been shown in Figure 3.2. The height increment, denoted ∆r where r = ρ(s), of the forest s represents the waiting time between two coalescent events. A common prior for ultrametric trees is the coalescent [61]. We assume 51  can have on the approximation in cyclic posets. We use a small synthe  examples  poset where the exact target distribution can be computed easily, namely t  finite poset S from Figure 3 (left). Here both the exact solution and t  sets  approximations are low-dimensional multinomial distributions, so the to  ion of the effect that a lack of appropriate correction  variation distance can be computed efficiently.  First, we use this simple test case to show that Assumption 2 on ν −  get distribution can be computed easily, namely the  necessary to have consistency (Proposition 4). Two versions of the par  re 3 (left). Here both the exact solution and the  cle algorithm presented in the previous section are compared: the first on  dimensional multinomial distributions, so the total  ‘Poset-correction’ uses a backward kernel satisfying Assumption 2, while t  computed efficiently.  ple test case to show that Assumption 2 on ν − is  q  Standard-SMC  Poset-correction  mation in cyclic posets. We use a small synthetic  second one, ‘Standard-SMC’, does not. The latter uses a backward propo  proportional to ν˜− ≡ 1, in which case the weight update reduces to the weig  update foundkernels in state-space SMC algorithm. tency (Proposition 4). Two Figure versions4.5: of The the support parti- of the backward used in the synthetic poset It can be checked easily that th experiments. choice violates Assumption 2. n the previous section are compared: the first one,  backward kernel satisfying Assumption 2, while the  In Figure 4 (left), we compare the performance of the two algorithms  the number of particles increases. Performance is measured using the to MC’, does not. The latter uses a backward proposal the unconditional distribution of the topology is the uniform distribution over all variation distance, n which case the weight update reduces to the weight possible topologies. The prior for ∆ is Exp µ |s| .and the experiment for each number of particle is repeat 0 2  r  In a non-clock setup, S is defined as different the set of random non-clock forests over 1000 times using seeds. The results show that only the alg ce SMC algorithm. It can be checked easily that thisthe poset X. A non-clock forest s = {(ti , X a set of Xi -trees ti in which the an rootapproximation is the i )} is satisfying rithm Assumption 2 gives with a total variati on 2. newly formed internal node when s is a partial state and the root is an arbitrary compare the performance of the two algorithms as point on the new branch when s is a full state. A common prior on a branch length ncreases. Performance is measured usingnon-clock the totalphylogenetic tree is Exp(µ0 ). of an unrooted 26 When building discrete rooted X-trees, three elements need to be proposed at e experiment for each number of particle is repeated each step: which pair of trees to merge, what is the length of the added branch, random seeds. The results show that only the algoand what is the new position of the root. For efficiency, in a non-clock setup, we ion 2 gives an approximation withina most totalofvariation assume what follows that the pair is picked uniformly at random among the  |X|−r 2  pairs, that the branch length is sampled according to the prior, and that  the position of the root is sampled uniformly on the newly inserted branch. An 26  ultrametric setup can be regarded as a special case of a non-clock setup with some constraint: the unique position of the root is selected to ensure the newly formed tree is ultrametric. In practice, this constraint is satisfied by sampling the height increment rather than the branch length directly.  52  4.3 4.3.1  Numerical Examples Synthetic Posets  We start with an illustration of the effect that a lack of appropriate correction can have on the approximation in cyclic posets. We use a synthetic poset with a small cardinality |S| < ∞ so that the exact target distribution can be computed easily. The poset has a support as in Figure 4.4 (left) 2 . Here both the exact solution and the approximations are low-dimensional multinomial distributions, so the total variation distance can be computed efficiently. First, we use this simple test case to show that Assumption 5 on ν− is necessary to have consistency (Proposition 7). Two versions of the particle algorithm presented in the previous section are compared: the first one, ‘Poset-correction’ uses a backward kernel satisfying Assumption 5, while the second one, ‘StandardSMC’, does not. The latter uses a backward proposal proportional to ν˜ − ≡ 1, in which case the weight update reduces to the weight update found in state-space SMC algorithm. It can be checked easily that this choice violates Assumption 5. The support of the two backward kernels are shown in Figure 4.5. In Figure 4.6 (left), we compare the performance of the two algorithms as the number of particles increases. The performance of the algorithms is measured using the total variation distance, and the experiment for each number of particle is repeated 1000 times using different random seeds. The results show that only the algorithm satisfying Assumption 5 gives an approximation with a total variation distance going to zero as the number of particles increases. We did a second experiment to give an example where cycles in posets are beneficial. In this experiment, we fix the structure of the backward kernel as in the ‘Poset-correction’, with the exception of one of the backward transition, which we parameterize with a number q ∈ [0, 1] shown in Figure 4.5 (left). When q ∈ {0, 1}, this effectively corresponds to removing a cycle in the Hasse diagram of the poset. We show in Figure 4.6 (right) the total variation distance as a function of this parameter q for a fixed number of particles (10). It can be seen that the 2 The  values of π, ν+ , ν− can be downloaded from http://www.stat.ubc.ca/∼l.wang/phylo/  syntheticPoset/.  53  0.45  Poset−correction Standard−SMC  ●  ●  ●  0.35  0.0 0.1 0.2 0.3 0.4 0.5  Total variation distance  ●  ●  ●  ●  ● ● ●  ● ●  1e+01  1e+03  ●  1e+05  Number of particles  0.0  0.2  0.4  0.6  0.8  1.0  Backward parameter q  Figure 4.6: Total variation distance of the particle approximation to the exact target distribution. Left: the performance of the two algorithms as the number of particles increases. Right: the total variation distance as a function of the parameter q for a fixed number of particles (10).  best performance is attained away from the points {0, 1}, showing that the cycle is indeed beneficial.  4.3.2  Synthetic Phylogenies  In this study, we evaluate the proposed CSMC method on estimating phylogenetic trees. We summarize the set of the random trees obtained from the CSMC algorithm using the consensus tree [67]. We calculate the tree distance between estimated consensus trees and true trees using the RF metric and the PM. The smaller the tree distance, the better the performance. See Section 3.5 for an introduction to consensus trees and distance metrics for phylogenetic trees. We simulated 100 ultrametric trees of 10 taxa with µ0 = 10. The non-clock trees were obtained by perturbing the branch lengths of ultrametric trees. More precisely, we modified a branch of length b by adding a noise randomly sampled from Unif(−.3b, .3b). We generated 10 datasets for each tree. In each dataset, a total of 10 DNA sequences were generated given a tree and the K2P model with the parameter κ = 2. The length of each sequence is 1000. We generated the 54  sequence on the root by randomly sampling from the stationary distribution of the CTMC; then we generated each of its children using the transition probability that was computed with κ and the branch length by Equation 3.1. This procedure was recursively implemented until reaching leaves. The sequences of the internal nodes were discarded and those on leaves were taken as the observations. CSMC was applied to each of these datasets. For comparison, we also used one widely-used Bayesian phylogeny software, MrBayes (MB), which implements MCMC for a range of tree models [55]. For both software, we fixed the parameter, κ, to be the true value and focused on comparing their performance in phylogenetic tree estimation. Priors for the branch lengths were exponential distributions with rate 10. The initial tree for MB was of an arbitrary topology and branch lengths sampled from exponential distribution with the true tree rate. In both the CSMC and MCMC algorithms, the computational bottleneck is the peeling recurrence, which needs to be computed at each speciation event in order to evaluate π(t). Therefore, we report running times as the number of times the peeling recurrence is calculated. Figure 4.7 shows the average Robinson Foulds metric and partition metric versus the running times in log scale using 1000 datasets simulated from ultrametric trees and non-clock trees, respectively. On average, CSMC outperforms MB in terms of obtaining a smaller tree distance with a smaller number of steps. Regarding the partition metric, our SMC algorithm is 100 times, or 2 orders of magnitude, faster than MCMC to reach 0. The vertical line shows the 50% credible intervals of the estimates. For ultrametric trees, the variance of estimates using SMC is smaller than or equal to those from MB. For non-clock trees, with a small number of particles, SMC is obviously better. After reaching convergence, SMC and MB performs similarly.  4.3.3  RNA Data  We analyzed aligned curated ribosomal RNA (rRNA) sequences obtained from 199 species of Chloroplast [13] 3 . We randomly selected four sub-datasets of 5, 10, 20, and 30 taxa, respectively, from the whole dataset. Numbers of sites are 3 The  dataset was downloaded from http://www.rna.ccbb.utexas.edu/DAT/3C/Alignment/ on 08/09/2011  55  MB SMC  0.5  4  1.0  6  1.5  ●  ●  ●  5e+03  ●  ●  1e+05  12  1e+02  ●  1e+02  ●  1e+04  ●  1e+06  ●  MB SMC  6 2 ●  ●  5e+03  ●  ●  1e+05  Ultrametric trees  0  2 0 1e+02  ●  4  ●  4  6  ●  8  MB SMC  8  ●  ●  0.0  ●  12  0  ●  Partition metric  2.0  10  MB SMC  2  Robinson Foulds  8  ●  1e+02  ●  ●  ●  1e+04  Non-clock trees  Figure 4.7: The average Robinson Foulds metric and partition metric versus the running times (the number of times the peeling recurrence is calculated) in log scale using 1000 datasets simulated from ultrametric trees (left) and non-clock trees (right).  56  ●  ●  1e+06  121, 121, 126, and 140, respectively. Using a K2P model with κ = 2, both SMC and MrBayes were applied to these datasets. Figure 4.8 shows the log-likelihood of the consensus trees of using CSMC and MrBayes versus the running times (divided by 1000). We can see that CSMC can obtain higher log-likelihood than MB using a smaller running time. For large numbers of iterations and particles, the MCMC sampler slightly outperformed CSMC on the real data of 30 taxa.  4.4  Discussion  We have presented a general class of SMC methods that provide an alternative to MCMC for Bayesian phylogenetic inference. Our proposed method, CSMC, can efficiently explore the huge phylogenetic tree space by making use of the ordertheoretic notion of a poset. Our CSMC methods can do inference for not only ultrametric trees but also non-clock trees which have not been approached by SMC previously. Both theoretical and experimental results show that CSMC can obtain a consistency estimate of the expectation of a function with respect to a target measure. In addition, extensive numerical simulations show that CSMC can obtain comparable results to the MCMC method in terms of computation speed and estimation accuracy.  4.4.1  Choices of the Partial States and Their Distributions  Besides the natural forest extension, described in Section 4.2.1, there are other potential extensions of π into forests. As an example, we can define π(sr ) = p(Y, N J(sr )), where N J(sr ) refers to a complete phylogenetic tree that is obtained by fixing the subtrees in the current forest sr and applying the neighbor joining (NJ) method, briefly reviewed in Section 3.3.1, to the roots of sr . Moreover, there are more than one way to define the sequence of partial states as long as the objects obtained at the last iteration of CSMC are in the target space, i.e. a phylogenetic tree in our example. The method we have used is based on building a complete tree by starting from the disconnected graph and by connecting a pair of subtrees in a forest at every step. An alternative, called ‘star-shaped decomposition’, is to build a complete tree by starting with all leaves of molecular  57  ● ●  −445  ●  ●  ●  ●  −455 100 200  500  MB NCPF 2000  −475  −465  −236 −240  Log-likelihoods  ●  ●  100 200  500  MB NCPF 2000  10 taxa  ●  ●  ● ●  ●  ●  ●  ● ●  ●  100 200  500  MB NCPF 2000  20 taxa  ●  −4900  −4700  ●  −1700  −1600  5 taxa  Log-likelihoods  ●  ●  ●  ●  MB NCPF 2000 ●  100  500  30 taxa  Figure 4.8: Log-likelihoods of the consensus trees using CSMC and MB versus the running times (divided by 1000) using RNA datasets of 5, 10, 20, and 30 taxa, respectively. Note that ‘NCPF’ on the plot represents ‘particle filter for non-clock trees’, i.e. CSMC for non-clock trees.  58  sequences connected to a single centre node and separating a pair of nodes attached to the centre node at each step. The likelihood of a star-shaped partial state can be computed in a similar way as a binary tree using Equation (3.3).  4.4.2  Choices of the Proposal Distributions  The design of the proposal distributions has significant impact on the performance of the CSMC algorithm. We have the flexibility to choose forward proposal distributions as long as it induces a poset on all possible partial states. In addition to the simple proposal distribution used in Section 4.2.2, an alternative proposal distribution is to consider all possible pairs of subtrees to merge, i.e. ν+ (sr−1 → sr ) =  π(sr ) sr :sr−1 ≺sr π(sr )  .  This is called Prior-Post proposal in [43, 105]. For a partial state sr−1 of m subtrees, the Prior-Post proposal considers m(m − 1)/2 possible pairs of subtrees, which is computationally expensive for a large number m. From our simulation studies, intensive computing for each particle allows us to use a smaller number of particles to obtain the same estimation accuracy. However, there is no obvious benefits in terms of the total computing time. Therefore, we choose to use the simple proposal distribution in this dissertation.  4.4.3  Resampling Strategies  CSMC algorithms can be improved by using more sophisticated resampling strategies that have been developed for other SMC algorithms, for example, stratified sampling [63] and adaptive resampling [24]. Other potential improvements are reviewed in [14] and [26].  4.4.4  Another Application: Self-avoiding Walk  Although the proposed method is motivated by Bayesian phylogenetic inference, CSMC is general enough to be applied to many other applications. Bayesian inference problems in computational biology, e.g., statistical alignment for proteins and DNA and grammatical inference for RNA, may benefit from the CSMC framework. 59  One of other applications is to compute the total number of different self-avoiding walks (SAWs) of a fixed length. SAWs in a two- or three-dimensional lattice space have been studied as the most basic model for chain polymers in polymer physics [85, 110, 111]. In a twodimensional lattice model, a chain polymer of length N is fully characterized by the positions of all of its molecules, x = (x1 , · · · , xN ), where xi = (a, b), and a and b are integers. The distance between xi and xi+1 has to be exactly 1, and xi+1  xk for  all k ≤ i. The target distribution of interest is π(x) = 1/ZN , where the space of x is the set of all SAWs of length N and ZN is a normalizing constant, which is equal to the total number of different SAWs of length N. Let s0 be the walk of length 0 as well as the initial state. We sequentially generate partial states s1 , s2 , · · · , by using a proposal distribution with density a ν+ (sr−1 → sr ): for the partial state sr−1 , we have a walk of length r − 1 and we generate a partial state sr of length r by uniformly inserting one step to the walk sr−1 . We require sN ≡ x. Previous work [110] can be shown to be a special case of our framework where the backward proposal distribution has the following density ν (sr → sr−1 ) = 1/r. After running the CSMC algorithm, ZN can be estimated by γr,K =  60  1 r K  k wr,k .  Chapter 5  Combining MCMC and SMC In previous chapters, we have introduced the applications of MCMC and SMC in Bayesian phylogenetic tree inference. As the two important types of Monte Carlo methods, they have their own strengths and drawbacks. MCMC is a class of very powerful methods that have a rich literature. The main problem with MCMC is that it is typically difficult to design appropriate proposal distributions for highdimensional and multimodal target distributions. Consequently, the chain often mixes poorly and it takes an extremely long time to reach convergence. On the contrary, our proposed CSMC is fast in computation; but it requires to run a large number of particles simultaneously. It is tempting to study combinations of MCMC and SMC to utilize their advantages and circumvent their limitations. Since the main limitation of using MCMC is unrealistic running time while the main constraint of using SMC is computer memory, an ideal framework of their combination should provide a flexible way of allocating the available total budget, including time and computing resources. In this chapter, we will investigate two frameworks for combining SMC and MCMC in applications to phylogenetics. The first one is a set of particle Markov chain Monte Carlo (PMCMC) algorithms [3, 53], in which each MCMC iteration uses our proposed CSMC algorithm to approximate the posterior distribution of the phylogenetic trees and the marginal likelihoods of data, given parameters in the evolutionary model. Using PMCMC, we can incorporate global uncertainty of  61  hyperparameters, e.g. random nucleotide evolutionary models.1 This is of concern since modelling uncertainty over evolutionary processes is one of the key selling point of Bayesian phylogenetic methods [25]. The second framework is to embed the standard MCMC algorithms in phylogenetics literature [52, 66, 67, 70], reviewed in Section 3.4.4, into an SMC sampler proposed by [22, 23], reviewed in Section 2.2.3.  5.1  Particle MCMC for Phylogenetics  In phylogenetics, our goal is to estimate the phylogenetic tree t and evolutionary parameters θ based on n observed biological sequences Y. Using Bayesian approaches, the target distribution is the posterior distribution of t and θ, i.e. π(t, θ) ≡ p(t, θ|Y). When the evolutionary parameters, θ, are assumed to be known or fixed, the target distribution is denoted πθ (t) ≡ pθ (t|Y). In Section 2.3, we have reviewed the generic PMCMC algorithms, including PIMH, PMMH, and PGS. In this section, we will focus on practical details when applying these algorithms to estimate the phylogenetic trees and evolutionary parameters using our proposed CSMC algorithm described in the previous chapter.  5.1.1  The Particle Independent Metropolis-Hastings (PIMH)  If estimating the evolutionary parameters is not of interest, we can use the PIMH, described in Algorithm 5.1, to approximate the posterior of phylogenetic tree t, i.e. pθ (t|Y). Algorithm 5.1 is a special case of the generic PIMH (Algorithm 2.6), where the CSMC algorithm (Algorithm 4.1) is used in each iteration of the PIMH. Although the PIMH algorithm and the SMC algorithm perform similarly in terms of estimation accuracy with the same total number of particles [3, 53], it is quite useful to have this alternative for Bayesian phylogenetics because of a practical reason. The implementation of an SMC algorithm is often limited by the computer memory capacity due to the fact that a large number of particles are required simultaneously. It may not be a problem when each particle is small. However, this is not the case in phylogenetics because typically each particle is a 1 We use the ‘hyperparameter’ terminology because we view the phylogenetic tree as a parameter itself.  62  partial or a complete phylogenetic tree of many biological sequences with hundreds and thousands of sites. Contrary to SMC, the PIMH algorithm allows us to run a smaller number of particles in each iteration and run it for any specified amount of time until a sufficient number of particles have been generated. We will use an example to demonstrate the similarity in performance between PIMH and SMC in Section 5.1.4. Algorithm 5.1 The PIMH Algorithm for Phylogenetics 1. Initialization, i = 0, (a) run the CSMC algorithm (Algorithm 4.1) targeting pθ (t|Y), (b) sample t(0) ∼ pˆ θ (·|Y) and compute pˆ θ (Y)(0). 2. For iteration i ≥ 1, (a) run the CSMC algorithm (Algorithm 4.1) targeting pθ (t|Y), (b) sample t∗ ∼ pˆ θ (·|Y) and calculate pˆ θ (Y)∗ using (2.3), (c) with probability  min 1,  pˆ θ (Y)∗ , pˆ θ (Y)(i − 1)  set t(i) = t∗ and pˆ θ (Y)(i) = pˆ θ (Y)∗ ; otherwise set t(i) = t(i − 1) and pˆ θ (Y)(i) = pˆ θ (Y)(i − 1).  5.1.2  Particle Marginal Metropolis-Hastings (PMMH)  The PMMH algorithm approximates a ‘marginal Metropolis-Hastings’ (MMH) update targeting directly the marginal density pθ (Y) that is estimated as a byproduct of the CSMC algorithm using   1  = K r=1 R  ZR,K = γR,K  63  K k=1    wr,k  .  Algorithm 5.2 explicitly rewrite the PMMH algorithm for phylogenetics with our CSMC Algorithm 4.1 in each iteration. Algorithm 5.2 The PMMH for Phylogenetics 1. Initialization, i = 0, (a) set θ(0) arbitrarily and (b) run the CSMC algorithm targeting pθ(0) (t|Y), sample t(0) ∼ pˆ θ(0) (·|Y) and let pˆ θ(0) (Y) denote the marginal likelihood estimate. 2. For iteration i ≥ 1, (a) sample θ∗ ∼ q(θ(i − 1) → ·), (b) run the CSMC algorithm targeting pθ∗ (t|Y), sample t∗ ∼ pˆ θ∗ (·|Y) and let pˆ θ∗ (Y) denote the marginal likelihood estimate, and (c) with probability min 1,  pˆ θ∗ (Y)p(θ∗ ) q{θ∗ → θ(i − 1)} pˆ θ(i−1) (Y)p(θ(i − 1)) q{θ(i − 1) → θ∗ }  (5.1)  set θ(i) = θ∗ , t(i) = t∗ , and pˆ θ(i) (Y) = pˆ θ∗ (Y); otherwise set θ(i) = θ(i−1), t(i) = t(i − 1), and pˆ θ(i) (Y) = pˆ θ(i−1) (Y).  Recall that there is only one evolutionary parameter κ in the K2P evolutionary model, i.e. θ = κ. We choose an exponential distribution with rate µ0 as the prior for κ, i.e. κ ∼ Exp(µ0 ). Regarding the proposal distribution q{κ → ·}, one choice is to propose a new κ∗ by using a multiplier of κ [66]; i.e. let κ∗ = mκ, where m ∼ Unif( 1a , a) with a prefixed tuning parameter a > 1. A lower value of a will lead to a higher acceptance rate, which can also mean that the MCMC chain will mix slowly. By [66], the proposal ratio in the MH ratio is q{κ∗ → κ(i − 1)} = m. q{κ(i − 1) → κ∗ }  64  (5.2)  The prior ratio is p(κ∗ ) µ0 exp(−µ0 κ∗ ) = = exp(µ0 (κ − κ∗ )) = exp(µ0 κ(1 − m)). p(κ) µ0 exp(−µ0 κ) In each implementation of the CSMC algorithm, we use a total number of K particles. By results of [3, 53], for any fixed number K ≥ 1 of particles, the transition kernels of PMCMC leave the target density of interest, p(θ, t|Y), invariant, and under weak assumptions the PMCMC sampler is ergodic. In practice, how to choose the number K is an issue that will affect the performance of a PMCMC algorithm.  5.1.3  The Particle Gibbs sampler (PGS)  Another PMCMC algorithm that can do joint estimation of phylogenetic trees and evolutionary parameters is the particle Gibbs sampler (PGS) described in Algorithm 2.9. Contrary to the PMMH algorithm that bypasses the phylogenetic tree in calculating the MH ratio by marginalizing it, PGS requires a special type of PMCMC update called the conditional SMC update. To demonstrate how PGS applies to phylogenetics, we rewrite the two algorithms, the conditional SMC (Algorithm 2.8) and the PGS (Algorithm 2.9), explicitly for phylogenetics in Algorithm 5.3 and Algorithm 5.4, respectively, with more details and explanations. The introduction of the notation for ancestral lineage in the conditional SMC (Algorithm 2.8) is mainly for the purpose of asymptotic proofs in [3]. In the conditional CSMC algorithm for phylogenetic trees (Algorithm 5.3), we abandon these complicated notations for ancestral lineage. Instead, to simplify notations and implementation, we switch the positions of particles such that the first particle is the frozen one. Obviously, we can switch particles without affecting the particle approximation towards the target distributions. In addition, Algorithm 5.3 is a conditional version of the CSMC algorithm (Algorithm 4.1) whilst Algorithm 2.8 is a conditional version of the generic traditional SMC. Recall that K is the number of particles in the CSMC algorithm. Let s1:R = (s1 , s2 , · · · , sR ) denote the frozen particle path that the conditional CSMC algorithm is conditioned on. In other words, this path always survives whereas the remaining 65  Algorithm 5.3 Conditional CSMC for Phylogenetic Trees s0,k ←⊥, ∀k ∈ {1, · · · , K} w0,k ← 1/K construct γ0,K using Equation (4.1) for r ∈ 1, . . . , R, do sr,1 ← sr γ(sr,1 ) ν (sr,1 →sr−1,1 ) wr,1 ← γ(sr−1,1 ) · ν+ (sr−1,1 →sr,1 ) for k ∈ 2, · · · , K, do sample s˜r−1,k ∼ πr−1,K sr,k ∼ ν+ ( s˜r−1,k → ·) γ(sr,k ) ν (sr,k → s˜r−1,k ) wr,k ← γ( s˜r−1,k ) · ν+ ( s˜r−1,k →sr,k ) end for construct γr,K using Equation (4.1) end for return γR,K  K − 1 particles are generated as in the CSMC algorithm. Now we will apply the conditional CSMC (Algorithm 5.3) in the framework of the generic PGS to estimate the joint posterior of θ and t, summarized in Algorithm 5.4. Since we cannot obtain the closed-form full conditional distribution for θ, i.e. p(θ|t, Y), we use a step of the Metropolis-Hastings algorithm within Gibbs sampling to sample θ from p(θ|t, Y), where t is the current phylogenetic tree. The MH ratio for accepting a newly proposed θ∗ is p(θ∗ |t, Y) q{θ∗ → θ} p(θ|t, Y) q{θ → θ∗ } p(Y|θ∗ , t)p(θ∗ |t) q{θ∗ → θ} = min 1, , p(Y|θ, t)p(θ|t) q{θ → θ∗ }  α(θ → θ∗ |t) = min 1,  (5.3)  where the likelihood p(Y|θ, t) can be calculated by (3.3). A common assumption is that the priors for θ and t are independent, i.e. p(θ|t) = p(θ). In the K2P model, the only parameter is κ, i.e. θ = κ. The same prior as described in the PMMH algorithm can be used; i.e. the prior κ ∼ Exp(µ0 ), and the proposal q{κ → κ∗ } = mκ,  66  where m ∼ Unif( 1a , a) with a prefixed tuning parameter a > 0. As a result, α(κ → κ∗ |t) = min 1,  p(Y|θ∗ , t) µ0 κ(1−m) ·e ·m . p(Y|θ, t)  Algorithm 5.4 The MH within Particle Gibbs Sampler for Phylogenetics Initialization: i = 0 Sample θ(0) arbitrarily Run the CSMC algorithm targeting pθ(0) (t|Y) Sample t(0) ∼ pˆ θ(0) (·|Y) and record its ancestral lineage. for i ≥ 1 do Sample θ(i) ∼ p(·|t(i − 1)) (a) Draw θ∗ ∼ q(θ(i − 1) → ·) (b) with probability α(θ(i − 1) → θ∗ |t(i − 1)) calculated by (5.3), set θ(i) = θ∗ ; otherwise set θ(i) = θ(i − 1). Run the conditional CSMC algorithm targeting pθ(i) (t|Y) conditional on t(i − 1) and its ancestral lineage. Sample t(i) ∼ pˆ θ(i) (·|Y). end for  5.1.4  Simulation Studies  In this section, we evaluated PIMH and PMMH using some simulated datasets. In order to simulate the datasets, we first generated a set of random trees, including topologies and branch lengths, as the reference trees. Then, for each reference tree, we simulated DNA sequences using the K2P model with parameter κ = 2.0, described in Section 3.4.1. For each unrooted tree, we chose an arbitrary point on the tree as its root. The data generation started from the root of a tree by randomly sampling from the stationary distribution of the CTMC (see Section 3.4.1). Assuming site independence, we generated the data for the children of the root using the transition probability computed by (3.1). This procedure was recursively implemented until reaching the leaves. We discarded the data at the internal nodes and took the data on leaves as the simulated observations. Table 5.1 summarizes the settings for the simulated datasets that will be used in this section. 67  Dataset 1 2  Tree type clock non-clock  Data type DNA DNA  # Tree 100 100  # taxa 5 10  # sites 200 200  Tree rate 1.0 1.0  Table 5.1: Settings for the simulated data.  We constructed a consensus tree using the random trees from the results of applying the PMCMC algorithms for phylogenetics to each simulated dataset. These estimated consensus trees were assessed by their log-likelihoods, and their tree distances from the reference trees measured by PM, RF, and KF tree metrics. Please refer to Section 3.5 for the definition of the consensus tree and tree distance metrics. PIMH Versus CSMC In this study, we compared the performance of PIMH and SMC on the same datasets using the same computing cost, i.e. a fixed number of total particles. As mentioned earlier in this chapter, the use of SMC algorithms is limited by computer memory capacity. For example, an “out-of-memory” error occurred when we tried to run 5,000,000 particles simultaneously for a dataset of 10 taxa of 200 sites using a 4-gigabytes memory limit. To avoid such a problem, we used a set of small clock trees of 5 taxa (Dataset 1 of Table 5.1) without loss of generality for the purpose of illustration. In the CSMC algorithm, all these particles are run simultaneously. In PIMH algorithms, we need to decide the number of particles used at each iteration, which is shown in the first column of Table 5.2. We used a total number of 1,000,000 particles. From the results in Table 5.2, PIMH and SMC performed very similarly in terms of the log-likelihood of the consensus trees and their distances from the reference trees. PMMH and MrBayes The purpose of this simulation study is to evaluate PMMH in terms of estimating the evolutionary parameter κ and the phylogenetic trees using a set of simulated  68  Method CSMC PIMH PIMH PIMH PIMH  #particles/iter. 1000,000 10 100 1,000 10,000  # iteration 100,000 10,000 1,000 100  log-likelihood -1091.32 (153.59) -1091.40 (153.60) -1091.40 (153.64) -1091.39 (153.62) -1091.45 (153.65)  RF 0.063 (0.048) 0.062 (0.046) 0.062 (0.046) 0.061 (0.046) 0.062 (0.046)  PM 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)  Table 5.2: Tree estimates using PIMH for clock trees. Log-likelihood refers to that of the consensus tree. Methods PMMH MrBayes  log-likelihood -1894.84 (162.01) -1898.89 (175.43)  κ 1.99 (0.25) 3.72 (0.47)  RF 1.44 (1.43) 1.62 (1.13)  PM 0 (0.45) 0 (1.35)  KF 0.45 (3.76) 0.56 (3.15)  Table 5.3: Comparison of PMMH and MrBayes for a set of non-clock trees in terms of estimating the parameter κ (true value 2.0) and the phylogenetic tress. Log-likelihood refers to that of the consensus tree.  non-clock trees (Dataset 2 of Table 5.1). We used 10,000 particles at each iteration of PMMH and ran 5,000 iterations. For the purpose of comparison, we also used MrBayes to run MCMC for 50,000,000 MCMC iterations on the same datasets such that the total computing cost is comparable to that of PMMH. PMMH outperformed MrBayes from the results for the two algorithms using the 100 simulated non-clock datasets, shown in Table 5.3. For the results of PMMH, we also calibrated the credible intervals using coverage probabilities, shown in Figure 5.1. As expected, the coverage probabilities (circles) are close to the values of credible intervals (solid line). In addition, Figure 5.2 shows the trace plot and histogram of κ estimated by PMMH for three arbitrarily selected non-clock datasets.  5.1.5  Real Data Analysis  We also applied PMMH for non-clock trees to the same datasets in Section 4.3.3 with 1000 particles at each iteration. By estimating κ rather than using a pre-  69  KF 1.38 (3.34) 1.38 (3.33) 1.38 (3.35) 1.41 (3.41) 1.34 (3.24)  100  ●  Coverage probability 20 40 60 80  ●  ●  ● ● ●  ●  ●  0  ●  0  20  40 60 Credible intervals  80  100  Figure 5.1: Coverage probability versus credible intervals for the estimates of κ using PMMH.  fixed value, the log-likelihoods of the obtained consensus trees increased, shown in Figure 5.3 as an example. The parameter a in PMMH was set to be 1.2. Better performance might be achieved by choosing a more sensible value. In order to evaluate the influence of the number of particles at each PMMH step, we set this number to be 100, 1000, 10000 and 25000, respectively, and compared the log-likelihoods of the consensus trees using the datasets of 5, 10, and 20 taxa, shown in Table 5.4. We found that the smaller the number of particles, the lower log likelihood and higher variance. Theoretically, we can always achieve higher accuracy by increasing the number of particles. In practice, running a large number of particles can be computationally expensive. Therefore, it is a trade-off between speed and accuracy. We analyzed the aligned protein coding mitochondrial DNA sequences of 32 species of cichlid fishes [17, 64], described in Section 3.2, using PMMH for non70  2000  3.5  ●  Frequency 1000 1500  3.0  ● ●  ● ● ● ● ●  ● ●  ●  ●  ●  ●  ● ● ●  ● ● ● ●  ● ● ● ●  ● ● ● ● ●● ● ● ● ● ●● ●  0  1000  2000 3000 Index  ●  4000  0  500  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  5000  1.5  ●  2.5  ●●  ●●  Frequency 400 600  ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ●  ●  200  ● ● ● ● ●  ●  1000  2000 3000 Index  4000  5000  1.5  2.0  2.5  700  0  3.0  3.5  0  kappa 2.0 1.5  ● ●  ● ● ●  ●  ●  ●  3.0  ●  ●  ●  ● ●  2.5  ●  ●  ● ●  2.0  800  1.5  2.0  kappa 2.5  ● ● ● ●  ● ●  ●  ●  ●  ●  ● ● ●  ●  ● ● ●  ●  ● ● ● ● ●  ●  Frequency 300 500  ● ●  ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ●  ● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ●  100  1.5  kappa 2.0  2.5  ●  0  1.0  ● ●  ●  0  1000  2000 Index  3000  4000  1.0  1.5  2.0  2.5  3.0  Figure 5.2: Trace plot and histogram of κ estimated by PMMH for three arbitrarily selected non-clock datasets.  71  ● ●  −232.5  ●  ●  ● ●  ●  ●  100 200  500  MB NCPF 2000  ●  −234.5  −233.5  −236 −240  Log-likelihoods  ●  MB PMMHNC 500 1000 ●  100  200  Figure 5.3: Log-likelihoods of the consensus trees versus the running times (divided by 1000) using RNA datasets of 5 taxa. (Left) We only estimated the phylogenetic tree with MB and CSMC. (Right) We estimated both the phylogenetic tree and the parameter κ with MB and PMMH.  # taxa 5 10 20  100 -231.12 (0.18) -438.99 (2.06) -1608.41 (3.12)  Number of particles 1000 10000 -231.13 (0.059) -231.12 (0.086) -436.64 (0.86) -436.66 (0.62) -1581.16 (3.36) -  25000 -231.10 (0.033) -436.28 (0.23) -  Table 5.4: Log-likelihoods of the consensus trees obtained using PMMH for 3000 iterations with different numbers of particles at each iteration using non-clock trees.  clock trees. We used 10,000 particles for each MCMC iteration, and ran PMMH for 2,000 iterations. For comparison, we ran MrBayes on the same dataset for 20,000,000 iterations. The log-likelihoods of the consensus trees from PMMH and MrBayes were -6,892.51 and -6,942.20, respectively. Figure 5.4 shows the consensus tree estimated by PMMH, and Figure 5.5 by MrBayes. The result was consistent with previous studies [17, 64]. For example, the top clade of 7 species2 on our consensus tree are species in the same tribe Lamprologini [64].  72  Champsochromis_spilorhynchus Lethrinops_auritus Buccochromis_lepturus Metriaclima_zebra Rhamphochromis_sp Petrochromis_orthognathus Lobochilotes_labiatus Tropheus_moorii Gnathochromis_pfefferi Cyphotilapia_frontosa Limnochromis_auritus Paracyprichromis_brieni Perissodus_microlepis_T32a Perissodus_microlepis_T51a Cardiopharynx_schoutedeni Ophthalmotilapia_ventralis Xenotilapia_sima Xenotilapia_flavipinnus Callochromis_macrops Cichlasoma_citrinellum Bathybates_sp Tanganicodus_irsacae Boulengerochromis_microlepis Oreochromis_niloticus Tylochromis_polylepis Lamprologus_callipterus Neolamprologus_tetracanthus Chalinochromis_popeleni Julidochromis_marlieri Telmatochromis_temporalis Neolamprologus_brichardi Lepidiolamprologus_elongatus  Figure 5.4: The consensus tree for cichlid fishes estimated by PMMH. The species on the root is Lamprologus callipterus.  73  Lethrinops_auritus Buccochromis_lepturus Champsochromis_spilorhynchus Metriaclima_zebra Rhamphochromis_sp Gnathochromis_pfefferi Petrochromis_orthognathus Lobochilotes_labiatus Tropheus_moorii Cardiopharynx_schoutedeni Ophthalmotilapia_ventralis Callochromis_macrops Xenotilapia_sima Xenotilapia_flavipinnus Perissodus_microlepis_T32a Perissodus_microlepis_T51a Cyphotilapia_frontosa Limnochromis_auritus Paracyprichromis_brieni Oreochromis_niloticus Tylochromis_polylepis Cichlasoma_citrinellum Boulengerochromis_microlepis Bathybates_sp Tanganicodus_irsacae Lepidiolamprologus_elongatus Chalinochromis_popeleni Julidochromis_marlieri Telmatochromis_temporalis Neolamprologus_brichardi Neolamprologus_tetracanthus  Figure 5.5: Consensus tree for the dataset of Cichlid Fishes using MrBayes. The species on the root is Lamprologus callipterus.  74  5.2  An SMC Sampler with MCMC Kernels  As reviewed in Section 2.2.3, SMC samplers [22, 23] provide a framework of converting an MCMC algorithm for a static distribution π into an SMC algorithm by doing MCMC moves within SMC iterations. In this section, we describe how the standard MCMC algorithms in Bayesian phylogenetics can be used within an SMC algorithm. In Bayesian phylogenetics, the target distribution of interest is the joint posterior of a phylogenetic tree t and evolutionary parameters θ, i.e. π(t, θ) ≡ p(t, θ|Y). For simplicity of notation, we denote x = (t, θ). We define πr (x) ∝ π(x)φr ,  (5.4)  where 0 ≤ φ1 < · · · < φR = 1, and πR (x) = π(x) = p(x|Y). We will use the SMC sampler (Algorithm 2.5) with the backward kernel, Lr−1 (xr , xr−1 ) = πr (xr−1 )Kr (xr−1 , xr )/πr (xr ). As mentioned in Section 2.2.3, with this backward kernel, the incremental importance weight becomes γr (xr−1 )/γr−1 (xr−1 ). More precisely, using Equation (5.4), we have γr (xr−1 )/γr−1 (xr−1 ) = {γ(xr−1 )}φr −φr−1 = {p(xr−1 )p(Y|xr−1 )}φr −φr−1 . A common choice for the Markov kernels, Kr (xr−1 , ·), is to use MCMC kernels [22, 23]. A typical MH kernel used in an SMC sampler is composed of the following steps: 1. Let q(xr−1 , ·) be a proposal distribution. Propose a new tree and new evolutionary parameters, denoted xr∗ , from q(xr−1 , ·). 2 Telmatochromis temporalis, Neolamprologus tetracanthus, Neolamprologus brichardi, Lepidiolamprologus elongatus, Lamprologus callipterus, Julidochromis marlieri, Chalinochromis popelini.  75  2. The MH ratio is computed as α(xr−1 , xr∗ ) = min 1,  πr (xr∗ )q(xr∗ , xr−1 ) . πr (xr−1 )q(xr−1 , xr∗ )  3. With probability α(xr−1 , xr∗ ), the proposal xr∗ is accepted, and with (1-α(xr−1 , xr∗ )) probability, the chain remains at xr−1 . In phylogenetics, there is a rich literature on using MCMC algorithms to sample from the posterior and then do phylogenetic inference. In order to take an advantage of these methods, we can combine different MCMC samplers into mixtures and cycles of several individual samplers. This is justified by a very powerful and useful property of MCMC [2, 108]: if each of the transition kernels {K i }, i = 1, · · · , M, have invariant distribution π, then the cycle hybrid kernel and the mixture hybrid kernel  M i=1  pi K i ,  M i=1  M i i=1 K  pi = 1, are also transition kernels with  invariant distribution π. Algorithm 5.5 summarizes the SMC sampler for phylogenetics where the proposal Kri can be any MCMC kernel, including those proposed in Bayesian phylogenetics literature, e.g. [52, 66, 67, 70]. In the numerical examples of this section, we used the proposals Kri defined as follow: 1. Kr1 : the multiplicative branch proposal described in Section 3.4.4. 2. Kr2 : the global multiplicative branch proposal that proposes all the branch lengths by applying the above multiplicative branch proposal to each branch. 3. Kr3 : the stochastic nearest neighbor interchange (NNI) proposal described in Section 3.4.4. 4. Kr4 : the stochastic NNI proposal with resampling the edge that uses the above nearest neighbor interchange (NNI) proposal and the multiplicative branch proposal for the edge under consideration. 5. Kr5 : the independent branch proposal that proposes an branch length for an arbitrary branch with an exponential distribution.  76  R 10 100 500 1000  log-likelihood -2147.78(289.48) -1505.19(241.31) -1375.70(221.25) -1371.15(203.14)  RF 0.81(0.01) 0.43(0.13) 0.17(0.14) 0.11(0.11)  PM 1.00(0.08) 0.42(0.136) 0(0.18) 0(0.16)  KF 3.16(21.97) 0.27(42.39) 0.036(154.43) 0.014(92.45)  Table 5.5: Comparison of the SMC sampler with MH moves with different number of iterations for a set of non-clock trees in terms of estimating the phylogenetic tress. Log-likelihood refers to that of the consensus tree. Here the tree distance metric, RF and PM, is the normalized version.  Note that we have only considered estimating phylogenetic trees in our current research. But this method can be used for estimating evolutionary parameters θ simply by using {Kri } that propose θ. Algorithm 5.5 An SMC Sampler using MH Kernels for Phylogenetics x1,k ←⊥, ∀k ∈ {1, · · · , K} w1,k ← 1/K for r ∈ 2, . . . , R do M M Sample xr,k ∼ i=1 pi Kri (xr−1,k , ·), i=1 pi = 1 φ −φ wr,k ← {p(xr−1,k )p(Y|xr−1,k )} r r−1 Normalize weights Wr,k ∝ wr,k , and resample {xr,k , Wr,k } end for We are interested in how different choices of K and R affect the performance of Algorithm 5.5. In our numerical examples, we used a sequence of φi = i/R for simplicity. We simulated 100 non-clock trees with rate 10 and one dataset for each tree. Each sequence has length 200. Figure 5.6 shows the median (50% credible interval) of the log-likelihoods of the consensus trees and their tree distances from the reference trees. From these results, the log-likelihoods are very similar and the tree distances slightly decrease with different values of K. From the results of varying R with a fixed K = 5000, shown in Table 5.5, a larger number of R leads to a better estimate.  77  2000  100  500  2000  500  2000  500  2000 0.4 0.3 0.2 0.1  0.1  0.2  0.3  0.4  SMC−Sampler  ●  100  500  ●  2000  Figure 5.6: Results of running the SMC sampler with MH moves using different number of particles, K, for a set of simulated non-clock trees.  5.3  Parallel Computing  Inferring large scale trees is often computationally expensive or infeasible, which motivates us to develop parallel Monte Carlo algorithms to tackle the computational challenges in phylogenetics. There are two types of parallelization, coarse-grain and fine-grain methods [100]. Coarse-grain parallelization refers to multi-core CPUs that are linked loosely through an Ethernet network where communication between nodes is supported by Message Passing Interface (MPI). This type of parallelization is limited by a com78  0.30 0.20  0.30 100  ●  0.0  2000  ●  Kuhner Felsenstein  0.10 ●  0.00  0.10  500  ●  ●  0.20  Normalized PM  0.00  ●  ●  ●  SMC−Sampler  ●  100  ●  100  0.20  ●  SMC−Sampler  ●  0.10  500  2000  ●  0.0  100  500  0.20  ●  0.10  ●  Normalized RF  ●  ●  −1200  SMC−Sampler  100  −1400  ●  2000  −1600  −1200  500  −1600  −1400  Log-likelihoods  100  bination of latency, throughput, and cost. In contrast, fine-grain parallelization refers to shared memory multiprocessing, e.g. multi-threaded processing within a single computer, where communication between computing units is fast and cheap. In the following, we will focus on the fine-grain methods.  5.3.1  Fine-Grain Parallelization  There is a large literature on parallelization of both MCMC and SMC algorithms. SMC is intrinsically parallelizable algorithms. Most of the work has been on parallelization of the proposal steps [69]. Although MCMC algorithms are intrinsically serial algorithms, complicated statistical models typically require us to use advanced MCMC algorithms in which auxiliary variables are added to consider a larger space for the purpose of improving mixing of the Markov chain. Many advanced MCMC algorithms, e.g. slice sampling and parallel tempering [103], can be decomposed into fine-grain problems for parallelizing within MCMC iterations. Algorithm 5.6 Outline of the Fine-grain Parallelization for SMC Algorithms 1: . . . Serial code . . . 2: for r = 1, · · · , R do 3: Update particles to their r-th partial state using multiple threads. 4: if Resampling is needed then 5: Synchronize the threads. 6: Resampling the particles. 7: Synchronize the threads. 8: end if 9: end for 10: . . . Serial code . . . The art of designing efficient parallelization programming is essentially to reduce the serial computations and optimize the use of the multi-cores [69, 100]. In other words, the bottleneck is the fraction of irreducibly serial computations in the code. Fine-grain parallelization can be implemented by running multiple threads on a single CPU or on graphics processing units (GPUs) [69, 100]. Algorithm 5.6 outlines the fine-grain parallelization for SMC algorithms in which we distribute the computing for K particles to multiple threads (either on 79  CPU or GPUs). The bottleneck is the resampling step which involves synchronizing the threads. Fortunately, as mentioned in Section 2.2.2, resampling at every iteration is neither necessary nor efficient; instead, we resample only when the variation of the particle weights is higher than some prefixed threshold. It is straightforward to do parallelization programming for PMCMC simply by using Algorithm 5.6 whenever an SMC algorithm is required. Furthermore, we can apply the idea of parallel tempering to PMCMC and implement subchains in parallel. Within each iteration of the PMCMC subchain, we implement the SMC algorithm in parallel. Therefore, parallel tempering PMCMC is highly parallelizable. To be concrete, we will briefly describe the parallel tempering PMCMC and its parallelization mainly based on [69] but in the context of phylogenetics and PMCMC. The idea of using parallel tempering for PMCMC is to speed up convergence of the PMCMC chain for π defined on X by constructing a Markov chain on a joint space X M using M − 1 auxiliary variables each in X. In other words, we have M parallel subchains each with stationary distribution π(i) , i ∈ M = {1, · · · , M} such that π(M) = π. Each subchain i is associated with a PMCMC kernel that leaves π(i) invariant. The stationary distribution of the joint chain is π˜ (x1:M ) ≡  M i=1 π(i) (xi )  with support in the joint space X M . By construction, π˜ admits π as a marginal distribution. In phylogenetics, x is a description of the phylogenetic tree and evolutionary parameters. Using the parallel tempering methodology, we select π(i) (x) = π(x)φi with 0 < φ1 < · · · < φ M = 1. The rationale is that MCMC converges more rapidly when the target distribution is flatter. Hopefully, these auxiliary subchains can speed up convergence of the chain of interest by swapping their values [56]. The chains i and j swap their values with probability min{1, αi j } where αi j =  π(i) (x j )π( j) (xi ) . π(i) (xi )π( j) (x j )  It is obvious that the PMCMC subchains can be executed in parallel. However, the value-swapping moves among subchains need synchronization of the threads which becomes the bottleneck of using the parallel tempering. [69] proposed one way to exchange values between adjacent chains such that all value80  6e+05 Time 2  3  4  Number of threads  0e+00  2e+05  4e+05  6e+05 Time  4e+05 2e+05  Time (milliseconds)  0e+00  1  1  2  3  Number of threads  Figure 5.7: Time in milliseconds versus different number of threads for a dataset of an ultrametric tree (left) and a dataset of a non-clock tree (right) using CSMC.  swapping moves can be done in parallel, i.e. swapping values of each pair in either {{1, 2}, {3, 4}, · · · , {M − 1, M}} or {{2, 3}, {4, 5}, · · · , {M − 2, M − 1}, {M, 1}}, each with probability half.  5.3.2  Multiple Threads on a Single CPU  We provide some results of executing multiple threads on a single CPU to illustrate the advantage of parallelizing CSMC. We ran the CSMC algorithm on two simulated datasets from an ultrametric tree and a non-clock tree, respectively, of 30 taxa and 2000 sites using 100,000 particles with various number of threads on a 2.40GHz Intel Xeon 16-cores E7330 architecture. Figure 5.7 shows the computing time in milliseconds versus different number of threads for the two datasets. It is obvious to see the computing time decreases when the number of threads increases. If we parallelize more parts of our code, or use more machines, the CSMC algorithm will be even faster.  81  4  5.4  Discussion  PMCMC provides an alternative method to the standard MCMC approach to sample from the joint posterior of the phylogenetic trees and evolutionary parameters. We have demonstrated in our experiments that for a range of situations, PMCMC can achieve better estimates for the evolutionary parameters with the same computing cost. This result can be explained by comparing PMCMC and the standard MCMC, especially in the ways of proposing new states. The main disadvantage of standard MCMC methods is that it is difficult to design proposal distributions that can efficiently explore the huge parameter space. Since the likelihood of a tree may change significantly with the change of the evolutionary parameters, proposing a new tree and a evolutionary parameter simultaneously or proposing a very different tree from the current one often leads to a high rejection rate. In a standard MCMC approach, this problem is alleviated by iterating between updating parameters given a fixed phylogenetic tree and updating the tree given the parameters with small moves. In other words, at each MCMC step, only a small perturbation is made on the current tree to propose a new tree. To show the problem with this coordinate-wise strategy, we use a simplified situation as an illustration where the hyperparameter θ can only take two values, θ1 , and θ2 , and there are only two possible trees, t1 and t2 . In other words, there are only 4 possible states: (θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), and (θ2 , t2 ). At each iteration of MCMC, the chain is at one of these 4 states. Suppose the probability mass of the 4 states are 0.1, 0.3, 0.5, and 0.1, respectively. A good Markov chain should move quickly among the states with high probabilities. In this example, the states with higher probabilities are (θ1 , t2 ) and (θ2 , t1 ). However, if we start from the state (θ1 , t2 ) in a standard MCMC in which we either fix θ to update t or fix t to update θ, the proposal is highly likely to be rejected. As a result, it is difficult to move from (θ1 , t2 ) to (θ2 , t1 ). Contrary to the standard MCMC, PMCMC can make bold moves in the parameter space. Specifically, for a given θ, PMCMC will consider the whole population of trees and sample from them according to the posterior. As a result, the proposal is more sensible and more likely to be accepted, especially for high dimensional and multimodal distributions. Therefore, although the better proposals in PMCMC  82  are paid by the price of a much higher computing cost at each iteration, it is worthwhile because it leads to better estimates with a lower total cost. The SMC sampler with MCMC moves provides another flexible framework to exploit the previous work in Bayesian phylogenetics using MCMC moves in an SMC algorithm. This method looks very similar to parallel tempering MCMC [103] in which subchains of tempered target distributions are implemented in parallel and value-swapping moves among subchains are used to help the chain for the target distribution to converge faster. The difference between the two methods is subtle. SMC samplers bypass the awkward value-swapping moves. In an SMC sampler, each tempered target distribution is approximated by a set of weighted particles at each SMC iteration. Contrary to running subchains with various temperatures in parallel, an SMC sampler starts from a very flat distribution and then approaches the target distribution gradually by increasing the temperature little by little. In this way, we can alleviate the main problem of using MCMC in phylogenetics, i.e. inefficient exploration in the tree space because only small moves are allowed. In the example of this chapter, we have used the MCMC moves that are designed for standard MCMC algorithms in phylogenetics. In future, it is desirable to design more bold MCMC moves that are more suitable for SMC samplers.  83  Chapter 6  A General Evolutionary Model Based on String-valued CTMC Current tree inference methods ground their analysis on a relatively restricted range of evolutionary phenomena, limited in most cases to substitution events and, less frequently, to context independent long indels [80]. However, sequence change is caused by many other important molecular mechanisms, e.g. slipped strand mispairing (SSM), a well known explanation for the evolution of repeated sequence [59, 80, 104]. Most of the molecular mechanisms have not yet been exploited as phylogenetically informative events [80], since it is generally non-trivial to extend tractable sequence evolution models beyond point mutations. Researchers have used string-valued continuous time Markov Chain (SCTMC) to use the marginals of the process to obtain the probability of trees (and alignments) [54, 73, 106]. The computational bottleneck has prevented a broader applications based on SCTMCs. Many applications of CTMC need to compute the transition probability Pa,b (T ) of the process moving from state a to state b in a finite time interval of length T ≥ 0. Calculating transition probabilities for a CTMC can be regarded as a building block of the likelihood computation of a full phylogenetic tree. [19] proposed an efficient method to evaluate the finite-time transition probabilities for general birth-death processes (BDPs) which is a special class of CTMC. However, currently there is no robust and efficient method to evaluate transition probabilities for a more general CTMC, e.g. SCTMCs. 84  We propose a novel way to estimate the transition probabilities for a general CTMC. Our approach is based on the idea of expressing explicitly the probability of a sequence of states of a continuous-time Markov jump process (MJP) [16] and using importance sampling to sample a set of weighted sequences of states. Our method is different from the relevant work of [84] in which they sample MJP paths, completely characterized by the jump times and the sequence of states, based on a method named uniformization [49]. We have verified our method using numerical examples of birth-death processs (BDPs). Further, we describe a more flexible evolutionary model based on string-valued CTMC that can incorporate a wide range of evolutionary phenomena. The Bayesian phylogenetic inference is implemented by extending the importance sampling for transition probabilities of a general CTMC to a sequential Monte Carlo (SMC) algorithm. The rest of this chapter is organized as follows. In Section 6.1, we introduce the transition probabilities for a general CTMC and propose to compute it using importance sampling. In Section 6.2, we provide an example of the proposed importance sampling for transition probabilities using BDPs. Section 6.3 describes an evolutionary model, a general SCTMC, and how we implement Bayesian phylogenetic inference using a novel sequential Monte Carlo (SMC) algorithm. Discussions are provided in Section 6.4.  6.1 6.1.1  Importance Sampling for Transition Probabilities Transition Probabilities  CTMCs, {Yt : 0 ≤ t ≤ T }, are often used to model data sampled at discrete time points. For example, we only observe partial data, typically on the leaves, in phylogenetics. We are interested in the transition probabilities Pa,b (T ) = P(YT = b|Y0 = a). Let m be the number of state transitions between two states, the beginning state a and the ending state b. The total time spent in these states is T . The m state-change events occur at time points T 0 = 0 < T 1 < · · · < T m < T m+1 = T . Let Q x,x be the instantaneous rate of transiting from state x to state x . Let λ(x) =  x x  Q x,x be the instantaneous rate of departing from state x. 85  A path of a CTMC {Yt } is completely described by the sequence of time T 0:m+1 = {T 0 , T 1 , · · · , T m+1 } and the sequence of states x0:m = {x0 , x1 , · · · , xm }. Denote t0:m = {t0 , · · · , tm }, where ti = T i+1 −T i , denoting the waiting time until the i-th state change. The probability density of this path conditional on the beginning state x0 is g(x0:m , t0:m |x0 , T ) m−1        =  f (ti ; λi )J(xi → xi+1 ) (1 − F(tm ; λm ))     i=0  m−1     Q  x ,x i i+1  −λm tm −λi ti =  λ e · e  i    λi  i=0 m−1  = i=0  Q xi ,xi+1 λi  (6.1)  m−1  λi e−λi ti e−λm tm i=0  where f (t; λ) is the density of an exponential distribution with rate λ; F(t; λ) is the cdf of an exponential distribution; J(x → x ) is the probability of jumping from x to x . Here λi is a function that depends on the i-th state xi ; i.e. λi = λ(xi ). Using Monte Carlo methods, we can sample the sequence of states and the jump times to approximate transition probabilities. Alternatively, we can marginalize the time intervals when we are not interested in the jump times. That is, m−1  g(x0:m |x0 , T ) = i=0  Q xi ,xi+1 G(λ1 , λ2 , · · · , λm ; T ) λi  (6.2)  where G(λ0 , λ2 , · · · , λm ; T ) m−1  =  ··· 0<t0 +t2 +···+tm−1 <T i=0  λi e−λi ti e−λm (T −  m−1 i=0 ti )  dt1 dt2 · · · dtm−1 .  The above integral is equal to the (1, m + 1)-th element of a matrix exponential,  86  ˜  eQT , where Q˜ is      Q˜ =      −λ0 λ0  0  ···  0     0   · · ·  .  λm   0 0  0  −λ1 λ1  ···  0  ···  ···  ···  ···  ···  0  0  0  ···  −λm  0  0  0  ···  0  When the rates, λ0 , λ1 , · · · , λm , are distinct, m−1    G(λ0 , λ1 , · · · , λm ; T ) =  λi  i=0  m k=0  e−λk T . m j=0, j k (λ j − λk )  When all the rates are equal, i.e. λ = λ0 = λ1 = · · · = λm , the probabilities are given by the Poisson probabilities G(λ0 , λ1 , · · · , λm ; T ) =  e−λT (λT )m . m!  If some of the rates are equal, i.e. λk = λ j then the expression should be interpreted as a limit when λk → λ j . Assume that λ0 , · · · , λm takes m∗ distinct values λ˜ 1 , · · · , λ˜ m∗ with the frequencies n1 , · · · , nm∗ , where    G(λ0 , λ1 , · · · , λm ; T ) =   6.1.2  m i=0    λi   m∗ i=1  ni = m + 1, then  (−1)ni −1 ∂ni −1 (ni − 1)! ∂λ˜ ni −1 i  m∗ k=1  e−λk T . ˜ ˜ k (λ j − λk ) ˜  j  (6.3)  Importance Sampling  Since the probability density of the sequence of states and jump times, denoted g(x0:m , t0:m |x0 , T ), can be computed by (6.1) pointwise, we can use Monte Carlo methods to obtain a sample of sequences of states and jump times that start from x0 and end with xm . We propose to use importance sampling. Denote qpath (x0:m , t0:m ) as the density of the proposal distribution for x0:m and t0:m satisfying x0 = a, xm = b, and  m i=0 ti  = T . The importance weight for the proposed path, {x0:m , t0:m }, can be  87  computed by wpath (x0:m , t0:m ) =  g(x0:m , t0:m |x0 , T ) . qpath (x0:m , t0:m )  The transition probability Pa,b (T ) can be rewritten as P(YT = b|Y0 = a) =  g(x0:m , t0:m |x0 = a, T ) {x0:m ,t0:m }:x0 =a,xm =b,  m i=0 ti =T  = {x0:m ,t0:m }:x0 =a,xm =b,  m i=0 ti =T  {x0:m ,t0:m }:x0 =a,xm =b,  m i=0 ti =T  =  g(x0:m , t0:m |x0 = a, T ) · qpath (x0:m , t0:m ) qpath (x0:m , t0:m ) wpath (x1:m , t0:m ) · qpath (x0:m , t0:m ).  Hence, a consistent estimate of Pa,b (T ) is 1 Pˆ a,b (T ) = K  K (k) (k) , t0:m ), wpath (x0:m k=1  (k) where wpath (x0:m ) is the importance weight for the k-th proposed paths.  Alternatively, we can sample a sequence of states without sampling the jump times. Denote qstate (x0:m ) as the density of the proposal distribution for x0:m with the fixed ends, i.e. x0 = a and xm = b. The importance weight for the proposed path, x0:m , can be computed by wstate (x0:m ) =  g(x0:m |x0 , T ) . qstate (x0:m )  88  The transition probability Pa,b (T ) can be rewritten as P(YT = b|Y0 = a) =  g(x0:m |x0 = a, T ) x0:m :x0 =a,xm =b  = x0:m :x0 =a,xm  g(x0:m |x0 = a, T ) · qstate (x0:m ) qstate (x0:m ) =b  =  wstate (x1:m ) · qstate (x0:m ). x0:m :x0 =a,xm =b  Hence, a consistent estimate of Pa,b (T ) is 1 Pˆ a,b (T ) = K  K (k) wstate (x0:m ), k=1  (k) where wstate (x0:m ) is the importance weight for the k-th proposed sequence of states.  In order to use the importance sampling method to estimate Pa,b (T ), we need to choose a proposal distribution for sampling x0:m that starts from x0 = a and ends with xm = b. The choice of a proposal distribution depends on the particular application. We will provide more details about the proposal distribution in importance sampling for transition probabilities of CTMCs in Section 6.2 and that in the SMC algorithm for an evolutionary model in Section 6.3.  6.2  Birth-death Processes  In this section, we will use BDPs as an example of CTMC to illustrate the importance sampling method for transition probabilities. In a BDP with an instantaneous birth rate υn and a death rate µn , the rate of departing from state n is λ(n) = υn + µn . After we sample a sequence of states, x0,m , we evaluate its probability using Equation (6.2), in which the λi ’s are computed by λi = λ(xi ) = υ xi + µ xi .  6.2.1  Sampling a Sequence of States with Fixed Two Ends  Now the task is to sample x0,m that starts from x0 = a and ends with xm = b for the BDP. Suppose the current state is x, the next state, denoted x , can only be x − 1  89  or x + 1 in BDP. We build the proposal using a sequence of small proposals from the current state to its adjacent state. We use two parameters, αstop and η, in our proposal distributions. If x is the final state b, the proposed sequence of states ends with probability αstop , or jumps to either x − 1 or x + 1 with probability (1 − αstop )/2; if x is not the final state, it jumps to x + 1 with probability p0 = 1(b > x)(1 − η) + 1(b < x)η, or to x − 1 with probability 1 − p0 , where 0 < η < 0.5, and 1 is an indicator function. In this way, the proposal avoids the possibility of being trapped in a loop of jumping back and forth between some states such that it will never end with state b.  6.2.2  Example 1  For the simple BDP with υn = nυ0 and µn = nµ0 , we have the closed-form solution given explicitly by Bailey [4] as  min(a,b)  Pa,b (t) = j=0  a a + b − j − 1 a− j b− j α β (1 − α − β) j j b−1  Pa,0 (t) = αa , where α =  µ0 (e(υ0 −µ0 )t −1) , υ0 e(υ0 −µ0 )t −µ0  and β =  υ0 (e(υ0 −µ0 )t −1) . υ0 e(υ0 −µ0 )t −µ0  Figure 6.1 shows comparison of transition probabilities P10,b (T = 1) computed using our method for the birth-death processes with υ0 = 0.2 and µ0 = 0.4. In this example, the number of sampled sequences of states is K = 1000; the tuning parameters in the proposal distribution are αstop = 0.3, and η = 0.4.  6.2.3  Example 2  The second example is a BDP with the linear rates υn = nυ0 + ν and µn = nµ0 + γ. This model can be used in a population model for the number of organisms in an area with new immigrants arrive at rate ν, and emigrants leave at rate γ [19]. Figure 6.2 shows P10,b (T ) for several times T and state b, with the parameters  90  0.20  ●  ●  0.15  ●  ●  0.10  Probability  ●  ●  0.05  ●  ● ●  0.00  ● ● ●  ●  ●  5  10  ●  ●  ●  ●  ●  15  ●  20  n  Figure 6.1: Comparison of transition probabilities P10,b (T = 1) computed using our importance sampling algorithm (solid line) and their true values (circles) for the birth-death processes with υ0 = 0.2 and µ0 = 0.4.  υ0 = 0.5, ν = 0.2, µ0 = 0.3, and γ = 0.1. The number of sampled sequences of states is chosen to be K = 1000; the tuning parameters in the proposal distribution are αstop = 0.3, and η = 0.4. For the purpose of evaluating our method, we deliberately use the same setting for this example as in [19], and the results agree with each other.  6.3 6.3.1  A General Evolutionary Model for Phylogenetics String-valued CTMC on a Phylogenetic Tree  We are interested in modelling the evolution between two sequences at the ends of one branch on a phylogenetic tree. Suppose the branch length is T . We use an SCTMC, {Yt : 0 ≤ t ≤ T }, to model the evolution between two sequences. The individual variable Yt is assumed to have a countable infinite domain of all pos-  91  0.12 0.10 0.08 0.06  Probability  0.04 0.02 0.00 0  10  20  30  40  n  Figure 6.2: Transition probabilities P10,b (T ) with T = 1 (solid) and T = 2 (dashed) computed using our importance sampling algorithm for the birth-death processes with υ0 = 0.5, ν = 0.2, µ0 = 0.3, and γ = 0.1.  sible molecular sequences. Let m be the total number of mutation events between two sequences. The sequence of states is a sequence of molecular strings x0:m = {x0 , x1 , · · · , xm } along the branch under consideration. The state changes whenever a mutation event occurs. Hence the sequence of states is completely determined by the mutation events. We propose a general evolutionary model that can incorporate various evolutionary phenomena. This is accomplished by defining the rate of jumping away from the current state x as a flexible function of various mutation events, e.g. the single-nucleotide indels, substitutions etc. As an example, in the following, we define the rate, λ(x), as a function of the point mutation rate per base θ sub , the global point insertion rate λ pt , the point deletion rate per base µ pt , the global SSM insertion rate λS S M , and the SSM deletion rate per valid SSM deletion location µS S M : λ(x) = nθ sub + λ pt + nµ pt + λS S M + k(x)µS S M , 92  (6.4)  where n is the length of x, and k(x) is the number of valid SSM deletion locations. The substitution model in this example is a simple Jukes-Cantor model, but our method can be easily adapted to handle more complicated ones such as the General Time Reversible (GTR) family. The jump probability from xk to xk+1 is Mutation type from x to x    θ sub     λ pt     n+1   1  J(x → x ) = λ(x) µ pt      λS S M    k(x)      µS S M  Point substitution Point insertion Point deletion SSM insertion SSM deletion.  To extend the proposed SCTMC on one branch to describe the evolution on a phylogenetic tree, we first introduce some notations. Let τ denote the phylogenetic X-tree under consideration, including the tree topology and the associated branch lengths. Let V(τ) is the set of internal nodes except for the root. Let M denote the mutation events and their times on the tree, which fully determine the hidden molecular sequences. For the internal node v ∈ V(τ), ρ(v) denotes its parent; x˜(v) denotes the hidden molecular sequence on v; β(v) is the length of the branch, denoted ev , that points to v in the direction from the root. We assume there exists a bijection between M and the hidden strings. The probability of observed sequences Y and the mutation events M given the phylogenetic X-tree τ is p(Y, M|τ) = p( x˜(0) )  v∈V(τ) u∈X  (v) g(x0:m , t(v) |x(v) = x˜(ρ(v)) , β(v) ) v 0:mv 0 (u) g(x0:m , t(u) |x(u) u 0:mu 0  = x˜  (ρ(u))  ,β  (u)  (u) )1(xm u  (6.5) = Y(u)),  (v) where {x0:m , t(v) } are the sequences of hidden strings and the jump times along the v 0:mv  branch ev . We use an improper uniform distribution over the strings on {A,C,G,T} as the distribution for the root sequence x˜(0) . Using Bayesian approaches, the target distribution is the posterior of τ and M  93  given Y that is proportional to γ(τ, M) = γY (τ, M) = p(Y, M|τ)p(τ),  (6.6)  where the likelihood model is described in Equation (6.5) and p(τ) is a prior on the tree τ.  6.3.2  Sequential Monte Carlo  We currently propose to use a simple SMC algorithm that requires fixing the evolutionary parameters. In the proposed method for our evolutionary model, the r-th partial state sr is a forest that includes the forest topology and the associated branch lengths, denoted τr , as well as the mutation events and the jump times, denoted Mr ; i.e. sr = (τr , Mr ). Consider a sequence of intermediate distributions over forests: γ(sr ) =  γY(Xi ) (τi , Mr (τi )), (τi ,Xi )∈sr  where Mr (τi ) denotes the mutation events and their jump times on tree τi . At the r-th iteration of SMC, the weight update for the k-th particle is wr,k = w( s˜r−1,k , sr,k ) =  γ(sr,k ) ν− (sr,k → s˜r−1,k ) · , γ( s˜r−1,k ) ν+ ( s˜r−1,k → sr,k )  where ν+ is the forward proposal distribution, and ν− is the backward proposal distribution. For simplicity, in this chapter we only consider clock trees for which ν− ≡ 1. The proposal ν+ (s → ·) randomly selects a pair of trees in s to merge and proposes the hidden strings and jump times along the newly proposed branches. Suppose the root strings of the selected subtrees are a and b. We sample the hidden strings x0:m that starts from x0 = a and ends with xm = b using an epsilon greedy strategy based on Levenshtein distances to the fixed ending string. We use two parameters, αstop and αgreedy , in our proposal distributions. For the current string x, we consider all possible next strings. The potential next string, denoted x , can be obtained by applying one mutation event to x. If x is the final state b, the proposed sequence of states ends with probability αstop ; otherwise, x is selected 94  probabilistically according to q M (x, x ) ∝ exp(αgreedy · d(x , b)), where d(x , b) is the Levenshtein distance between x and the ending string b.  6.3.3  Simulation Studies  Asymptotic String Lengths We investigate the distribution of the asymptotic string length by varying one parameter and fixing all the other parameters. Figure 6.3 shows the mean string lengths and standard deviations using 10000 iterations of forward simulation. One interesting result is the asymptotic string lengths versus the point substitution rates. If there were no SSM mutations, we expect the string length does not change much with the point substitution rates. However, we can see that the mean string length increases with the number of point substitution rates due to the interplay between the SSM mutations and the point mutations. More specifically, SSM deletion is one large factor for evolving shorter strings, and a larger point substitution rate will reduce the positions for possible SSM deletions. Tree Inference We sampled 10 random trees from the coalescent on ten leaves, and along each of randomly generated tree, we simulated 5 sets of molecular sequences with different random seeds according to our evolutionary model. The parameters used are: SSM length=3, the point mutation rate per base θ sub = 0.03, the global point insertion rate λ pt = 0.05, the point deletion rate per base µ pt = 0.2, the global SSM insertion rate λS S M = 2.0, and the SSM deletion rate per valid SSM deletion location µS S M = 2.0. Figure 6.4 shows one subset of simulated data. The unaligned sequences on leaves are used for tree reconstruction using the SMC algorithm. In the proposal distribution, we used αstop = 0.9 and αgreedy = 50. We summarized the posterior over trees using a consensus tree optimizing the posterior expected  95  80  ●  80  ●  Mean String Length  Mean String Length  70 ●  60  60  ●  ●  ●  50  40  ●  ●  ● ●  40  ●  20  ●  0.05  0.10  0.15  0.20  0.25  0.30  0.5  1.0  Point Deletion Rate  1.5  2.0  2.5  3.0  Global Insertion Rate  120 ●  60 ●  Mean String Length  Mean String Length  100  50 ●  ●  40  ●  ●  ●  80  ●  60 ●  ● ●  40  30  ●  20 0.1  0.2  0.3  0.4  0.5  0.6  1  SSM Deletion Rate  2  3  4  5  6  Global SSM Insertion Rate  65 120  60  ●  55  Mean String Length  Mean String Length  100  ●  ●  ● ●  50 ● ●  ●  80  ●  60  45 ●  40 40  ●  0.01  0.02  0.03  0.04  0.05  2  Point Substitution Rate  3  4  5  6  SSM Length  Figure 6.3: Mean string lengths and standard deviations using 10000 iterations of forward simulation. The basic setting for parameters is: SSM length=3, θ sub = 0.01, λ pt = 1.0, µ pt = 0.1, λS S M = 3.0, and µS S M = 0.2. 96  Figure 6.4: A subset of simulated data.  pairwise distances. Figure 6.5 shows tree distances between generated trees and consensus trees reconstructed using our evolutionary model. The partition metric decreases with the number of particles increases, suggesting that it is possible to reconstruct fairly accurate topologies from noisy data that are generated by complex evolutionary mechanisms. Although the KF metric results also improve with more particles, a non-trivial error remains even using a large number of particles. It is probably because we are currently sampling the times of the changes using a simple proposal, the uniform order statistics distribution.  6.4  Discussion  We have presented a flexible evolutionary model using SCTMC that can incorporate a wide range of evolutionary phenomena. The flexibility comes from the perspective that the molecular evolution can be modelled as a string-valued jump process where the jumps depend only on a localized context. Specifically, the rate of jumping from the current state s to other states is a function of it, denoted λ(s), which allows us to design complicated and realistic evolutionary models. 97  1000  ●  ●  100  1000  30  30  20  20  ●  ●  10  5 ●  100  ●  10  10  10  ●  5  Partition Metric  ●  10  40  1 40  1000 Kuhner Felsenstein Metric  100  15  10  15  1  ●  0  0  0  0  ●  1  10  100  1000  1  Number of particles  10  Number of particles  Figure 6.5: Performance on estimating trees. The greatest challenge of the SMC method for our evolutionary model lies in designing appropriate proposal distributions with fixed starting and ending states. In the current SMC algorithm for our evolutionary model, we sample both the hidden strings and the jump times. As shown in our example, a non-trivial error in terms of the KF metric remains even using a large number of particles, implying poor proposals for the jump times. In future, we need to explore more sophisticated and efficient proposals for proposing both the states and jump times. One alternative is to integrate the jump times out using Equation (6.2). If there are identical values in {λi : i = 0, · · · , m}, this method involves the complicated partial derivatives in Equation (6.3), which can be challenging to compute when the orders of derivatives are high. Our ultimate goals include estimating the evolutionary parameters in our SCTMC model. Currently we simplify the problem by focusing on estimating phylogenetic trees with fixed evolutionary parameters. In future work, we can use our SMC method within each iteration of the particle Markov chain Monte Carlo (PMCMC) algorithms to jointly estimate the evolutionary parameters and the phylogenetic trees.  98  Chapter 7  Summary and Future Work In this concluding chapter, we summarize our contributions in Bayesian phylogenetics and propose some research directions for future work.  7.1  Summary  We have presented various Monte Carlo methods for Bayesian Phylogenetic inference to sample from the joint posterior distribution of phylogenetic trees and evolutionary parameters. We propose a novel and general class of SMC algorithms that is motivated by the over-counting problem when applying SMC to a non-clock phylogenetic tree. Contrary to the standard application of MCMC in Bayesian phylogenetic inference, the proposed CSMC can efficiently explore the huge phylogenetic tree space. Both theoretical and experimental results show that CSMC can obtain a consistent estimate of the expectation of a function with respect to a target measure. We have proposed to jointly estimate phylogenetic trees and evolutionary parameters in two ways of combining MCMC and SMC. The first method is a set of PMCMC algorithms with CSMC at each iteration. At each iteration of a PMCMC algorithm, a phylogenetic tree is proposed by running CSMC. Since the main advantage of CSMC is that one can use low-dimensional proposal distributions to efficiently reconstruct high-dimensional phylogenetic trees, the proposal is more sensible such that it can lead to a bold move in the huge tree space. The second  99  method, an SMC sampler, provides a way to embed the standard MCMC algorithms in phylogenetics literature into an SMC algorithm. We have also described the framework for fine-grain parallelization of SMC algorithms, implemented using multiple threads on a single CPU. The above work provides a set of flexible and efficient tools for all kinds of situations in Bayesian phylogenetic inference. First, our methods is very general that they works for both non-clock trees and restricted clock trees. Second, from our results, PMMH can achieve a better estimate for the evolutionary parameter than the standard MCMC. Third, users can fully utilize their computing resources and time by carefully specifying the number of particles at each PMCMC iteration and the number of iterations. For example, if computer memory does not allows us to run simultaneously a sufficient number of particles in CSMC, we can run PIMH for enough iterations as an alternative. In addition to the efficient algorithms for general phylogenetic tree reconstruction, we have presented a general evolutionary model based on string-valued continuous time Markov Chain (SCTMC), incorporating a wide range of molecular mechanisms, and we have explored an SMC algorithm to do Bayesian phylogenetic inference.  7.2  Inferring Large Scale Trees  Genome-wide datasets offer opportunities for resolving difficult phylogenetic problems but also pose computational and statistical challenges for data analysis [83]. Our ultimate goal in phylogenetics is to develop new statistical evolutionary models and computational algorithms for efficient analysis of large-scale datasets. A large-scale dataset may have a large number of taxa, or a large number of sites for each sequence, or both. When the dataset is large-scale, our proposed methods in the previous chapters might fail for two possible reasons: main memory shortages, or unrealistic running times. We propose to infer large scale trees by: 1) algorithms using divide-andconquer strategies; 2) parallel computing.  100  7.2.1  Divide-and-Conquer Strategies  We propose three divide-and-conquer strategies to decompose a large scale phylogenetic inference problem. The first one, which consists in using the PIMH algorithm to replace SMC, is a general strategy to deal with a main memory shortage. The second one, the particle block Gibbs algorithm, targets datasets with a large number of taxa. The third one, an SMC algorithm using partial data, targets datasets with a relatively small number of taxa but each taxa has a long sequence. Using the PIMH Algorithm The PIMH algorithm can be used as one divide-and-conquer strategy to divide the task of running a large number K of particles required in an SMC algorithm into several smaller steps. More precisely, when the K particles cannot be accommodated in main memory, we can use a smaller number K of particles in each of the R iterations using the PIMH algorithm, where K is the maximum number of particles that fit in main memory, and RK ≥ K. This method is justified by the simulation studies of Section 5.1.4 in which PIMH and SMC perform similarly in terms of phylogenetic tree inference. The Particle Block Gibbs Algorithm We can sample sub-blocks of a phylogenetic tree t when the number of taxa, n, is so large that a prohibitive number of particles is required to lead to an efficient global update. We divide t into D large sub-blocks, {t1 , · · · , tD }, and use a conditional SMC algorithm for each sub-block. Denote t−d = {t1 · · · , td−1 , td+1 , · · · , tD } for d ∈ 1, · · · , D. Algorithm 7.1 describes the proposed particle block Gibbs sampler for phylogenetics, in which p(td |t−d , Y) can be pθ (td |t−d , Y) that depends on a hyperparameter, θ. We omit θ from the notation for simplicity. The compact pseudo-code shown in Algorithm 7.1 hides two difficulties. The first difficulty lies in how we divide t into sub-blocks. Another difficulty is that we need to adapt the conditional CSMC algorithm to make it target p(td |t−d , Y), which is dependent on how we define sub-blocks of a tree. A natural way is to define a sub-block as a subtree which can be resampled using our CSMC algorithm. In  101  Algorithm 7.1 The Particle Block Gibbs Sampler for Phylogenetics Initialization: i = 0; sample t(0) arbitrarily. for i ≥ 1 do Randomly divide t(i − 1) into Di large sub-blocks. {t1 (i − 1), · · · , tDi (i − 1)} for d ∈ 1, · · · , Di do Run a conditional CSMC algorithm targeting p(td |t−d , Y) conditional on td (i − 1) and its ancestral lineage. Sample td (i) ∼ pˆ K (·|t−d , Y). end for end for the following, we will describe how we partition a non-clock tree and a clock tree, respectively, into a set of subtrees. For non-clock trees, we randomly choose a pair of leaves and identify the path between the two leaves. Note that the path between two nodes is unique on a tree. Then we randomly choose a node on the path and select one clade connected to this node as a sub-block to update. It is more complicated to partition a clock tree than a non-clock tree because of the ultrametric constraints. We propose to cut a clock tree using a horizontal line with a random height. The block above the horizontal line is a clock subtree with pseudo-observations; and the part below this line is a forest of clock subtrees and part of their top branches. We randomly choose one of these subtrees to update. In order to maintain a clock tree after one subtree is updated, we restrict the height of an updated subtree below the horizontal line to be its height before the update. This height restriction is embedded in the proposal of the SMC algorithm in which the proposed forest height cannot exceed the original subtree height. An SMC Algorithm Using Partial Data Chopin [18] introduced an SMC algorithm that targets a posterior distribution of partial data and applied it to two examples, mixture models and discrete generalized linear models. Relevant previous work include [5, 87] which focuses on a similar SMC algorithm for massive datasets targeting on the posterior distribution of a smaller, more manageable portion of the whole dataset at each iteration. Their examples include a mixture of transition models that can be used to model web 102  traffic and robotics, and a Bayesian logistic regression model. We propose to extend the traditional applications of this type of SMC algorithm to Bayesian phylogenetics. Recall that our target distribution in Bayesian phylogenetics is the joint posterior of a phylogenetic tree t and evolutionary parameters θ, i.e. π(x) ≡ p(x|Y), where x = (t, θ). For the situation where each taxa is described by a long sequence, we can use the observed data, Y, sequentially by considering sub-blocks of the data, Y1 , Y2 , · · · , YR . Denote the first r blocks of data by Y1:r . We define the r-th intermediate distribution as πr (x) = p(x|Y1:r ),  (7.1)  where Y1:R = Y and πR (x) = π(x) = p(x|Y). We will use the SMC sampler (Algorithm 2.5) with the backward kernel, Lr−1 (xr , xr−1 ) = πr (xr−1 )Kr (xr−1 , xr )/πr (xr ), which leads to the incremental importance weight, w(xr−1 , xr ) = γr (xr−1 )/γr−1 (xr−1 ). More precisely, using (7.1), we have w(xr−1 , xr ) =  p(xr−1 )p(Y1:r |xr−1 ) = p(Yr |xr−1 ). p(xr−1 )p(Y1:r−1 |xr−1 )  The proposed method is summarized in Algorithm 7.2, where the MH kernels, {Kri }, are described in Section 5.2. Algorithm 7.2 An SMC Sampler using Partial Data x1,k ←⊥, ∀k ∈ {1, · · · , K} w1,k ← 1/K for r ∈ 2, . . . , R do M M Sample xr,k ∼ i=1 pi Kri (xr−1,k , ·), i=1 pi = 1 wr,k ← p(Yr |xr−1,k ) Normalize weights Wr,k ∝ wr,k , and resample {xr,k , Wr,k } end for Regarding how we define sub-blocks of the sequence data, one choice is: Y1 includes data from site 1 to site [m/R], Y2 includes data from site [m/R + 1] to 103  site [2m/R], and so on, where m is the total number of sites, and the number of sub-blocks of data is equal to the number of SMC iterations, R.  7.2.2  Parallel Computing on GPUs  As discussed in Section 5.3, SMC and PMCMC algorithms are more efficient if implemented in parallel. We have demonstrated that the performance of CSMC can be improved when it is executed with multiple threads on a single CPU. Since graphics processing units (GPUs) have an extremely large number of computer cores [69, 100], it is worthwhile to explore ways to parallelize PMCMC algorithms on GPUs. GPUs are self-contained parallel computational devices. Contrary to CPU, GPUs have an extremely large number of relatively simple computer cores, each being capable of executing fast arithmetic and logical instructions. Computer cores on GPUs are organized in blocks. Each block has a shared memory that is common to all the cores of this block. In addition, there is device memory that is accessible to all cores. Communication between CPU and GPUs is only possible via device memory. The threads on GPUs are different from those of CPUs. Creating and destroying threads on a CPU is very expensive. In contrast, the creation of GPU threads is significantly faster than that of CPU threads. Actually, one can create several thousand threads in less than ten clock cycles [69, 100]. GPU computing is characterized by SIMD, single instruction, multiple data [69, 100]. This characteristics makes GPUs suitable for SMC algorithms in which the same set of instructions, i.e. updating and reweighting a particle, is used to update each of a large number of particles of the same structure. It is challenging to use GPUs for reconstructing phylogenetic trees because each particle might be too large to be loaded into device memory. Therefore, inferring large scale trees on GPUs will involve a combination of the proposed divideand-conquer strategies and super-fine grain parallelization. Another challenge of computing with GPUs is to develop skills of low-level programming for the hardware, GPUs. Two software packages, Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA), can be used to parallelize computation  104  on GPUs.  7.3  Harnessing Non-Local Evolutionary Events for Tree Inference  In Chapter 6, we have presented our preliminary work on this topic. By non-local, we mean that sequence changes can depend on contexts of unbounded random size. More precisely, the model is based on an extension of the Doob-Gillespie construction for continuous time Markov chain (CTMC), where the exponential rates are allowed to depend on an unbounded context. In the near future, we are going to work towards the following directions: 1. Understand some important molecular mechanisms, e.g. context-sensitive substitutions, structural constraints in RNA evolution etc. We are particularly interested in slipped strand mispairing (SSM), a well known explanation for the evolution of repeated sequence [59, 80, 104]. Although λ(s) in Equation (6.4) is very flexible, biology backgrounds are required to understand how we can incorporate appropriately various molecular mechanisms in λ(s). 2. Develop a tree sampling algorithm allowing non-local sequence evolution models. We propose to use PMCMC algorithms to approximate the posterior distribution of the evolutionary parameters, phylogenetic trees, and the hidden molecular sequences given the observed data on leaves. At each iteration, the evolutionary parameters are updated; within each iteration, we use the SMC algorithm proposed in Chapter 6 to update the hidden molecular sequences and the phylogenetic trees. 3. We will apply our method to study how the complex interplay between SSMs and point mutations affects the inferred trees. Our experiments will involve simulated sequence data with realistic patterns of SSMs. We will compare the tree reconstructions obtained with our model incorporating SSMs to other models that ignore SSMs. We will also apply our method to plant intron datasets, where SSMs play an important role in sequence change.  105  7.4  Joint Estimation of MSA and Phylogeny  It is desirable to jointly estimate phylogeny and alignment from their joint posterior distribution [86, 101]. Using a fixed single multiple sequence alignment (MSA) will often make the phylogenetic tree estimate biased toward the guide tree that is used to do the multiple sequence alignments [112]. In addition, a single guide tree may decrease MSA accuracy [38] . On the other hand, estimating MSA depends on the tree reconstruction. [72] proposed an MSA inference algorithm based on the result from a high performance algorithm for phylogenetic tree. Estimating MSA and phylogenetic tree can be iterated for several times. However, these systems have a lack of theoretical understanding, the difficulty of getting calibrated confidence intervals, and over-alignment problems [74, 94]. An ambitious goal is to extend our model do joint estimation of MSA, phylogeny, and evolutionary parameters. In contrast to methods based on an MSA point estimate, joint estimation avoids guide tree bias and over-confidence problems. Moreover, non-local evolutionary events, such as SSMs, are known to affect MSAs and are already used to do manual alignment. One direction in future is to develop efficient algorithms to do joint estimation of MSAs and phylogeny based on the work proposed in Section 7.3.  106  Bibliography [1] M. E. Alfaro and M. T. Holder. The posterior and the prior in Bayesian phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 37 (1):19–42, 2006. → pages 2, 31 [2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5–43, 2003. ISSN 0885-6125. → pages 76 [3] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. J. R. Statist. Soc. B, 72(3):269–342, 2010. → pages 4, 17, 18, 61, 62, 65 [4] N. T. J. Bailey. The elements of stochastic processes with applications to the natural sciences. John Wiley & Sons Inc., New York, 1964. → pages 90 [5] S. Balakrishnan and D. Madigan. A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis, 1: 345–362, 2006. → pages 102 [6] O. E. Barndorff-Nielsen and N. Shephard. Non-gaussian ornsteinuhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):167–241, 2001. → pages 11 [7] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162:2025–2035, 2002. → pages 4 [8] J. Bergsten. A review of long-branch attraction. Cladistics, 21(2):163–193, 2005. → pages 27  107  [9] A. Bouchard-Cˆot´e, S. Sankararaman, and M. I. Jordan. Phylogenetic inference via sequential Monte Carlo. Systematic Biology, 2011. → pages 4, 37, 46, 48 [10] R. J. Boys, D. J. Wilkinson, and T. B. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 18 (2):125–135, 2008. → pages 11 [11] S. T. Buckland, K. B. Newman, C. Fern´andez, L. Thomas, and J. Harwood. Embedding population dynamics models in inference. Statistical Science, 22:44–58, 2007. → pages 11 [12] N. C., H. D.G., and H. J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1): 205–217, 2000. → pages 26 [13] J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D’Souza, Y. Du, B. Feng, N. Lin, L. Madabusi, K. Muller, N. Pande, Z. Shang, N. Yu, and R. Gutell. The comparative RNA web (CRW) site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BioMed Central Bioinformatics, 2002. → pages 26, 55 [14] O. Capp´e, E. Moulines, and T. Ryd´en. Inference in Hidden Markov Models. Springer, 2005. → pages 59 [15] O. Capp´e, S. J. Godsill, and E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5):899–924, May 2007. → pages 3, 37 [16] E. C ¸ inlar. Introduction to Stochastic Processes. Prentice Hall., 1975. → pages 85 [17] S. Cheon and F. Liang. Phylogenetic tree construction using sequential stochastic approximation Monte Carlo. Biosystems, 91(1):94 – 107, 2008. → pages 26, 70, 72 [18] N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–552, August 01 2002. → pages 102 [19] F. W. Crawford and M. A. Suchard. Transition probabilities for general birth death processes with applications in ecology, genetics, and evolution. J. Math. Biol., 2011. doi:doi:10.1007/s00285-011-0471-z. → pages 84, 90, 91 108  [20] M. P. Cummings, S. A. Handley, D. S. Myers, D. L. Reed, A. Rokas, and K. Winka. Comparing bootstrap and posterior probability values in the four-taxon case. Systematic Biology, 52(4):477–487, 2003. → pages 2, 31 [21] P. Del Moral. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. Springer, New York, 2004. → pages 13 [22] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of The Royal Statistical Society Series B-statistical Methodology, 68(3):411–436, 2006. → pages 4, 16, 37, 43, 46, 49, 62, 75 [23] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo for Bayesian computation. Bayesian Statistics, 8:1–34, 2007. → pages 4, 16, 37, 43, 49, 62, 75 [24] P. Del Moral, A. Doucet, and A. Jasra. On adaptive resampling strategies for sequential Monte Carlo methods. Bernoulli, 2011. To appear. → pages 59 [25] C. J. Douady, F. Delsuc, Y. Boucher, W. F. Doolittle, and E. J. P. Douzery. Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Molecular biology and evolution, 20(2):248–254, February 01 2003. → pages 2, 32, 62 [26] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: fifteen years later. In Handbook of Nonlinear Filtering (eds). University Press, 2009. → pages 3, 13, 37, 59 [27] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo methods in practice. Springer, 2001. → pages 3, 16, 37 [28] A. Drummond and A. Rambaut. Beast: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology, 7(1):214, 2007. → pages 2, 33 [29] P. Erixon, B. Svennblad, T. Britton, and B. Oxelman. Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Systematic Biology, 52(5):665–673, 2003. → pages 2, 31 [30] Fearnhead, Paul, Papaspiliopoulos, Omiros, Roberts, and O. Gareth. Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755–777, Sept. 2008. → pages 11  109  [31] P. Fearnhead and L. Meligkotsidou. Filtering methods for mixture models. Journal of Computational and Graphical Statistics, 16(3):586–607, 2007. → pages 11 [32] J. Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Systematic zoology, 27(4):pp. 401–410, Dec. 1978. → pages 27 [33] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17:368–376, 1981. → pages 28, 31, 36 [34] J. Felsenstein. Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 39(4):pp. 783–791, Jul. 1985. → pages 31 [35] J. Felsenstein. Inferring phylogenies. Sinauer Associates, 2003. → pages 36 [36] W. Fitch. Toward defining the course of evolution: minimal change for a specific tree topology. Syst. Zool., 20:406–416, 1971. → pages 26, 27 [37] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. Science, 155:279–284, 1967. → pages 26 [38] R. Fleissner, D. Metzler, and A. von Haeseler. Simultaneous statistical multiple alignment and phylogeny reconstruction. Systematic Biology, 54 (4):548–561, August 01 2005. → pages 106 [39] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6):721–741, Nov. 1984. → pages 10 [40] N. Goldman, J. P. Anderson, and A. G. Rodrigo. Likelihood-based tests of topologies in phylogenetics. Systematic Biology, 49(4):652–670, December 01 2000. → pages 31 [41] A. Golightly and D. J. Wilkinson. Bayesian sequential inference for nonlinear multivariate diffusions. Statistics and Computing, 16:323–338, 2006. → pages 11 [42] N. Gordon, D. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107–113, Apr. 1993. → pages 3, 11  110  [43] D. G¨or¨ur and Y. W. Teh. An efficient sequential Monte Carlo algorithm for coalescent clustering. In Advances in Neural Information Processing Systems (NIPS), 2009. → pages 3, 37, 59 [44] M. Goujon, H. McWilliam, W. Li, F. Valentin, S. Squizzato, J. Paern, and R. Lopez. A new bioinformatics analysis tools framework at emblebi. Nucleic Acids Research, 38(suppl 2):W695–W699, 2010. → pages 26 [45] R. Griffiths and S. Tavar´e. Monte Carlo inference methods in population genetics. Math. Comput. Modelling, 23:141–158, 1996. → pages 4 [46] S. Guindon and O. Gascuel. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52 (5):696–704, October 01 2003. → pages 28, 31 [47] M. Hasegawa, H. Kishino, and T. Yano. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22:160–174, 1985. → pages 29 [48] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, April 01 1970. → pages 8 [49] A. Hobolth and E. A. Stone. Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution. Annals of Applied Statistics, 3(3):1204, 2009. → pages 85 [50] S. H¨ohna and A. J. Drummond. Guided tree topology proposals for Bayesian phylogenetic inference. Systematic Biology, 61(1):1–11, January 01 2012. → pages 3, 33, 34 [51] S. H¨ohna, M. Defoin-Platel, and A. Drummond. Clock-constrained tree proposal operators in Bayesian phylogenetic inference. In BioInformatics and BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on, pages 1 –7, oct. 2008. → pages 2, 34 [52] M. Holder and P. Lewis. Phylogeny estimation: Traditional and Bayesian approaches. Nat. Rev.: Genet., 4:275–284, 2003. → pages 2, 26, 27, 33, 62, 76 [53] R. Holenstein. Particle Markov Chain Monte Carlo. PhD thesis, University of British Columbia, 2009. → pages 4, 18, 61, 62, 65  111  [54] I. Holmes and W. J. Bruno. Evolutionary hmms: a Bayesian approach to multiple alignment. Bioinformatics, 17(9):803–820, September 01 2001. → pages 31, 84 [55] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755, 2001. → pages 2, 33, 55 [56] K. Hukushima and K. Nemoto. Exchange Monte Carlo method and application to spin glass simulations. Journal of the Physical Society of Japan, 65(6):1604–1608, 1996. → pages 80 [57] M. D. Iorio and R. C. Griffiths. Importance sampling on coalescent histories. Adv. Appl. Prob., 36:417–433, 2004. → pages 4 [58] H. Jow, C. Hudelot, M. Rattray, and P. G. Higgs. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Molecular Biology and Evolution, 19(9):1591–1601, 2002. → pages ix, 34, 35 [59] S. A. Kelchner. The evolution of non-coding chloroplast DNA and its application in plant systematics. Annals of the Missouri Botanical Garden, 87(4):pp. 482–498, 2000. → pages 84, 105 [60] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16:111–120, 1980. → pages 29 [61] J. Kingman. On the genealogy of large populations. Journal of Applied probability, 19:27–43, 1982. → pages 51 [62] H. Kishino, T. Miyata, and M. Hasegawa. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol., 31: 151–160, 1990. → pages 28 [63] G. Kitagawa. Monte Carlo filter and smoother for non-gaussian nonlinear state space models. J. Comput. Graph. Stat., 5:1–25, 1996. → pages 59 [64] T. D. Kocher, J. A. Conroy, K. R. McKaye, J. R. Stauffer, and S. F. Lockwood. Evolution of nadh dehydrogenase subunit 2 in east african cichlid fish. Molecular phylogenetics and evolution, 4(4):420–432, 12 1995. → pages 26, 70, 72  112  [65] M. K. Kuhner and J. Felsenstein. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3):459–468, 1994. → pages 36 [66] C. Lakner, P. van der Mark, J. P. Huelsenbeck, B. Larget, and F. Ronquist. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Systematic Biology, 57(1):86–103, 2008. → pages 2, 33, 34, 35, 62, 64, 76 [67] B. Larget and D. Simon. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol., 16:750–759, 1999. → pages 2, 28, 33, 34, 35, 54, 62, 76 [68] M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thompson, T. Gibson, and D. Higgins. Clustal w and clustal x version 2.0. Bioinformatics, 23(21):2947–2948, 2007. → pages 26 [69] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Journal of Computational and Graphical Statistics, 19(4):769–789, 2010. → pages 79, 80, 104 [70] S. Li, D. Pearl, and H. Doss. Phylogenetic tree construction using Markov chain Monte Carlo. J. Am. Stat. Assoc., 95:493–508, 2000. → pages 2, 28, 29, 33, 62, 76 [71] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. → pages 3, 37 [72] K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934):1561–1564, June 19 2009. → pages 106 [73] Lunter, I. Mikl´os, A. Drummond, J. L. Jensen, and J. Hein. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics, 6:83–83, 2005. → pages 31, 84 [74] G. Lunter, A. Drummond, I. Mikl´os, and J. Hein. Statistical aligment: recent progress, new applications and challenges. Springer, New York, 2004// 2004. → pages 106 [75] D. Maddison. The discovery and importance of multiple islands of most parsimonious trees. Syst. Zool., 40:315–328, 1991. → pages 26 113  [76] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavar´e. Markov chain Monte Carlo without likelihoods. Proc. Nat. Acad. Sci., 100:15324–15328, 2002. → pages 4 [77] B. Mau, M. Newton, and B. Larget. Bayesian phylogenetic inference via Markov chain Monte Carlo. Biometrics, 55:1–12, 1999. → pages 2, 28, 33 [78] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. → pages 8 [79] I. Mikl´os, G. A. Lunter, and I. Holmes. A long indel model for evolutionary sequence alignment. Molecular biology and evolution, 21(3): 529–540, March 01 2004. → pages 31 [80] D. A. Morrison. Why would phylogeneticists ignore computerized sequence alignment? Systematic Biology, 58(1):150–158, February 01 2009. → pages 84, 105 [81] J. S. Paul, M. Steinr¨ucken, and Y. S. Song. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics, 187:11151128, 2011. → pages 4 [82] B. Rannala and Z. Yang. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J. Mol. E, 43:304–311, 1996. → pages 2, 28, 33 [83] B. Rannala and Z. Yang. Phylogenetic inference using whole genomes. Annual review of genomics and human genetics, 9(1):217–231, 2008. → pages 100 [84] V. Rao and Y. W. Teh. Fast MCMC sampling for Markov jump processes and continuous time Bayesian networks. In Proceedings of the Twenty-Seventh Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-11), pages 619–626, Corvallis, Oregon, 2011. AUAI Press. → pages 85 [85] A. Rechnitzer and E. J. J. van Rensburg. Generalized atmospheric rosenbluth methods (garm). J. Phys. A: Math. Theor., 41(44):442002, 2008. → pages 60 [86] B. D. Redelings and M. A. Suchard. Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, 54(3):401–418, June 01 2005. → pages 106 114  [87] G. Ridgeway and D. Madigan. A sequential Monte Carlo method for Bayesian analysis of massive datasets, 2003. → pages 102 [88] C. P. Robert and G. Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. ISBN 0387212396. → pages 6, 7, 8 [89] D. Robinson and L. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53:131 – 147, 1981. → pages 36 [90] F. Ronquist and J. P. Huelsenbeck. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19(12):1572–1574, August 12 2003. → pages 2, 33 [91] F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Hhna, B. Larget, L. Liu, M. A. Suchard, and J. P. Huelsenbeck. Mrbayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology, February 22 2012. → pages 2, 33 [92] N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology Evolution, 4: 406–425, 1987. → pages 26, 27 [93] D. Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):pp. 35–42, Jan. 1975. → pages 26, 27 [94] A. S. Schwartz and L. Pachter. Multiple alignment by sequence annealing. Bioinformatics, 23(2):e24–e29, January 15 2007. → pages 106 [95] C. Semple and M. Steel. Phylogenetics. Oxford, 2003. → pages 32 [96] M. P. Simmons, K. M. Pickett, and M. Miya. How meaningful are Bayesian support values? Molecular Biology and Evolution, 21(1): 188–199, 2004. → pages 2, 31 [97] D. Simon and B. Larget. Bayesian analysis in molecular biology and evolution (BAMBE), 2000. version 2.03 beta. → pages 2, 33 [98] R. R. Sokal and C. D. Michener. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38: 1409–1438, 1958. → pages 26 [99] R. P. Stanley. Enumerative Combinatorics. Volume I. Cambridge University Press, Cambridge, 1986. → pages 45 115  [100] M. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19(2):419438, 2010. → pages 78, 79, 104 [101] M. A. Suchard and B. D. Redelings. Bali-phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics, 22(16):2047–2048, 2006. → pages 106 [102] Y. Suzuki, G. V. Glazko, and M. Nei. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proceedings of the National Academy of Sciences, 99(25):16138–16143, December 10 2002. → pages 2, 31 [103] R. H. Swendsen and J.-S. Wang. Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett., 57:2607–2609, Nov 1986. → pages 79, 83 [104] J. S. Taylor and F. Breden. Slipped-strand mispairing at noncontiguous repeats in poecilia reticulata: A model for minisatellite birth. Genetics, 155 (3):1313–1320, July 01 2000. → pages 84, 105 [105] Y. W. Teh, H. Daum´e III, and D. M. Roy. Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems (NIPS), 2008. → pages 3, 37, 59 [106] J. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood alignment of DNA sequences, 1991. → pages 30, 31, 84 [107] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. Journal of molecular evolution, 34(1):3–16, 1992. → pages 31 [108] L. Tierney. Markov chains for exploring posterior distributions. Annals of Statistics, 22:1701–1762, 1994. → pages 76 [109] J. A. Tom, J. S. Sinsheimer, and M. A. Suchard. Reuse, recycle, reweigh: Combating influenza through efficient sequential Bayesian computation for massive data. Ann. Appl. Stat., 4:17221748, 2010. → pages 4 [110] E. J. J. van Rensburg. Monte Carlo methods for the self-avoiding walk. J. Phys. A: Math. Theor., 42:323001, 2009. → pages 60  116  [111] E. J. J. van Rensburg and A. Rechnitzer. Generalized atmospheric sampling of self-avoiding walks. J. Phys. A: Math. Theor., 42(33):335001, 2009. → pages 60 [112] K. M. Wong, M. A. Suchard, and J. P. Huelsenbeck. Alignment uncertainty and genomic analysis. Science, 319(5862):473–476, January 25 2008. → pages 106 [113] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol. Biol. E, 14: 717–724, 1997. → pages 2, 28, 33  117  Appendix A  Proof of Consistency To prove Proposition 7, we start by introducing a series of lemmas. In the following, we will use the following objects: φ : S → R, λ : FS → [0, ∞). Lemma 8. For all positive measure λ with λ < ∞, we have: E[(propK λ)φ] = (prop λ)φ, where: (prop λ)φ =  ν+ (x, dy)w(x, y)φ(y).  λ( dx)  Proof. By definition and linearity: E[(propK λ)(A)] = λ E[w(S k , S k ); S k ∈ A] = λ  ν+ (x, dy)w(x, y)  ¯ dx) λ( A  = (prop λ)(A), for all A ∈ FS . Lemma 9. For all positive measure λ with λ < ∞, we have: L2  (propK λ)φ −→ (prop λ)φ, 118  with the following rate: E (propK λ)φ − (prop λ)φ 2 ≤  (C1C2 )2 λ 2 . K  Proof. Using independence of the S k in the definition of propK , we have: λ 2 E[w(S k , S k )φ(S k )]2 K (C1C2 )2 λ 2 . ≤ K  E (propK λ)φ − (prop λ)φ 2 =  Corollary 10. We have: E γr,K φ − (prop γr−1,K )φ 2 → 0. Proof. Since γr,K ≤ C R < ∞,  E γr,K φ − (prop γr−1,K )φ 2 = E E γr,K φ − (prop γr−1,K )φ ≤E  γr−1,K K  2  γr−1,K  2C 4  → 0.  Lemma 11. For all r, prop γr = γr+1 . Proof. First note that for all simple measurable f : S2 → R, γ( dx)ν+ (x, dy) f (x, y) =  τ+ ( dx, dy) f (x, y),  so the same identity holds for all bounded measurable f by the dominated conver-  119  gence theorem. Similarly, γ( dy)ν− (y, dx) f (x, y) =  τ− ( dx, dy) f (x, y).  Using these identities and basic properties of the Radon-Nikodym derivative w = dτ− / dτ+ : (prop γr )φ =  1[ρ(x) = r]γ( dx)  ν+ (x, dy)w(x, y)φ(y)  =  1[ρ(y) = r + 1]γ( dx)ν+ (x, dy)w(x, y)φ(y)  =  1[ρ(y) = r + 1]τ+ ( dx, dy)w(x, y)φ(y)  =  1[ρ(y) = r + 1]τ− ( dx, dy)φ(y)  =  1[ρ(y) = r + 1]γ( dy)ν− (y, dx)φ(y)  =  1[ρ(y) = r + 1]γ( dy)φ(y)  ν− (y, dx)  = γr+1 φ. Here to change the indicator 1[ρ(x) = r] into 1[ρ(y) = r + 1], we have used the definition of the poset and the fact that its Hasse diagram is connected. Lemma 12. If for all measurable φ < C, L2  γr,K φ −→ γr φ,  (A.1)  then for all φ < C, we also have: L2  (prop γr,K )φ −→ (prop γr )φ . Moreover, by Lemma 11 the RHS of Equation (A.2) is equal to γr+1 φ .  120  (A.2)  Proof. It is enough to show the result with φ (x) = 1[x ∈ A]. Let ν+ (x, dy)w(x, y).  φ (x) = A  Since φ < C, we can use φ = φ in Equation (A.1) to get Equation (A.2). We can now prove the main proposition: Proof. We proceed by induction, showing for r ≥ 0 that the following equation holds: L2  γr,K φ −→ γr φ. The base case is trivial, since γ0,K and γr are equal to a Dirac delta on the same atom. To prove the induction hypothesis, we first decompose the L2 distances using Minkowski inequality, and control each term separately: E1/2 γr+1,K φ − γr+1 φ 2 ≤E1/2 γr+1,K φ − (prop γr,K )φ 2 + E1/2 (prop γr,K )φ − γr+1 φ 2 . But by Corollary 10, the first term goes to zero; and by Lemma 12 and the induction hypothesis, the second term also goes to zero.  121  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Country Views Downloads
Canada 84 4
United States 25 3
China 16 2
Russia 8 0
France 7 0
India 4 0
Ireland 3 0
Japan 3 0
Switzerland 2 0
United Kingdom 2 4
Republic of Korea 2 0
Germany 2 1
Greece 1 1
City Views Downloads
Unknown 83 4
Vancouver 19 2
Ashburn 9 0
Shenzhen 6 0
Burnaby 5 0
Beijing 5 1
Shanghai 4 0
Mountain View 3 0
Los Angeles 3 3
Tokyo 3 0
Dublin 3 0
Oxford 2 2
Columbus 2 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072997/manifest

Comment

Related Items