Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference Zhai, Yongliang 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_november_zhai_yongliang.pdf [ 4.42MB ]
Metadata
JSON: 24-1.0314148.json
JSON-LD: 24-1.0314148-ld.json
RDF/XML (Pretty): 24-1.0314148-rdf.xml
RDF/JSON: 24-1.0314148-rdf.json
Turtle: 24-1.0314148-turtle.txt
N-Triples: 24-1.0314148-rdf-ntriples.txt
Original Record: 24-1.0314148-source.json
Full Text
24-1.0314148-fulltext.txt
Citation
24-1.0314148.ris

Full Text

Stochastic Processes, StatisticalInference and Efficient Algorithms forPhylogenetic InferencebyYongliang ZhaiB.Sc., Zhejiang University, 2009M.Sc., The University of British Columbia, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2016c© Yongliang Zhai 2016AbstractPhylogenetic inference aims to reconstruct the evolutionary history of pop-ulations or species. With the rapid expansion of genetic data available,statistical methods play an increasingly important role in phylogenetic in-ference by analyzing genetic variation of observed data collected at currentpopulations or species. In this thesis, we develop new evolutionary mod-els, statistical inference methods and efficient algorithms for reconstructingphylogenetic trees at the level of populations using single nucleotide poly-morphism data and at the level of species using multiple sequence alignmentdata.At the level of populations, we introduce a new inference method to es-timate evolutionary distances for any two populations to their most recentcommon ancestral population using single-nucleotide polymorphism allelefrequencies. Our method is based on a new evolutionary model for bothdrift and fixation. To scale this method to large numbers of populations, weintroduce the asymmetric neighbor-joining algorithm, an efficient methodfor reconstructing rooted bifurcating trees. Asymmetric neighbor-joiningprovides a scalable rooting method applicable to any non-reversible evolu-tionary modelling setup. We explore the statistical properties of asymmet-ric neighbor-joining, and demonstrate its accuracy on synthetic data. Wevalidate our method by reconstructing rooted phylogenetic trees from theHuman Genome Diversity Panel data. Our results are obtained without us-ing an outgroup, and are consistent with the prevalent recent single-originmodel of human migration.At the level of species, we introduce a continuous time stochastic pro-cess, the geometric Poisson indel process, that allows indel rates to varyacross sites. We design an efficient algorithm for computing the probabilityiiAbstractof a given multiple sequence alignment based on our new indel model. Wedescribe a method to construct phylogeny estimates from a fixed alignmentusing neighbor-joining. Using simulation studies, we show that ignoringindel rate variation may have a detrimental effect on the accuracy of theinferred phylogenies, and that our proposed method can sidestep this issueby inferring latent indel rate categories. We also show that our phylogeneticinference method may be more stable to taxa subsampling in a real dataexperiment compared to some existing methods that either ignore indels orignore indel rate variation.iiiPrefaceThis thesis is the original work of the author, Yongliang Zhai, under thesupervision of Dr. Alexandre Bouchard-Coˆte´.A version of Chapter 3 and Chapter 4 has been accepted by the Annalsof Applied Statistics. Dr. Alexandre Bouchard-Coˆte´ is the sole co-author. Idesigned the model and implemented the algorithm. I conducted the com-putational work and derivations. I wrote the majority of the manuscript.A version of Chapter 5 and Chapter 6 has been submitted for publica-tion. Dr. Alexandre Bouchard-Coˆte´ is the sole co-author. This manuscriptis currently under review. I designed the model and implemented the algo-rithm. I conducted the computational work and derivations. I wrote themajority of the manuscript.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Genetic variation . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Genetic sequences at the level of populations . . . . . 82.1.2 Genetic sequences at the level of species . . . . . . . 122.2 Phylogenetic tree . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Evolutionary process on a phylogenetic tree . . . . . 162.3 Evolutionary models . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Various evolutionary models for different types of data 182.3.2 Continuous time Markov chain . . . . . . . . . . . . 202.4 Phylogenetic inference . . . . . . . . . . . . . . . . . . . . . 232.4.1 Non-likelihood based methods . . . . . . . . . . . . . 232.4.2 Likelihood-based methods . . . . . . . . . . . . . . . 26vTable of Contents2.5 Comparing, summarizing and evaluating phylogenetic results 282.5.1 Comparing phylogenetic trees . . . . . . . . . . . . . 282.5.2 Summarizing phylogenetic trees . . . . . . . . . . . . 312.5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 313 Normal approximation with general fixation model for fre-quency data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Evolutionary models . . . . . . . . . . . . . . . . . . . . . . . 343.2.1 Review of evolutionary models for frequency data . . 353.2.2 Normal approximation with general fixation . . . . . 373.2.3 Comparisons to the Wright-Fisher model . . . . . . . 383.3 Inference method for two populations . . . . . . . . . . . . . 433.3.1 Likelihood model for two populations . . . . . . . . . 433.3.2 Modelling ancestral allele frequencies . . . . . . . . . 443.4 Simulation studies for two populations . . . . . . . . . . . . 463.4.1 Two population inference when the model is correctlyspecified . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 Two population inference when the model is misspec-ified . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 Rooted tree reconstruction using asymmetric neighbor-joiningalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 Asymmetric dissimilarity matrix . . . . . . . . . . . . . . . . 564.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.2 Additivity property of the normal approximation withgeneral fixation model . . . . . . . . . . . . . . . . . . 584.3 Asymmetric neighbor-joining algorithm . . . . . . . . . . . . 604.3.1 Identifying a cherry . . . . . . . . . . . . . . . . . . . 624.3.2 Estimation step and update step . . . . . . . . . . . . 644.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . 67viTable of Contents4.4.1 Simulation for more than two populations . . . . . . 674.4.2 Computational cost . . . . . . . . . . . . . . . . . . . 714.5 Data analysis: Human Genome Diversity Panel . . . . . . . 724.5.1 Human population tree . . . . . . . . . . . . . . . . . 734.5.2 Assessing uncertainty of the estimated tree . . . . . . 764.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785 Geometric Poisson indel process . . . . . . . . . . . . . . . . 805.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Poisson indel process . . . . . . . . . . . . . . . . . . . . . . 835.2.1 Basic notation . . . . . . . . . . . . . . . . . . . . . . 835.2.2 Definition of the Poisson indel process . . . . . . . . . 845.2.3 Computational properties of the Poisson indel process 855.3 Geometric Poisson indel process . . . . . . . . . . . . . . . . 865.4 Efficient phylogenetic inference with the GeoPIP model . . 895.4.1 Likelihood evaluation in polynomial time . . . . . . . 905.4.2 Likelihood optimization in polynomial time . . . . . . 905.5 Hierarchical Poisson indel process . . . . . . . . . . . . . . . 935.5.1 Generative process of the hierarchical Poisson indelprocess . . . . . . . . . . . . . . . . . . . . . . . . . . 935.5.2 Properties of the hierarchical Poisson indel process . 945.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956 Estimation of phylogenies with indel rate heterogeneity . 966.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Details of the phylogenetic inference method . . . . . . . . . 986.2.1 Number of indel rate categories m . . . . . . . . . . . 986.2.2 Optimizing with respect to the segmentation β andindel rate indicator r . . . . . . . . . . . . . . . . . . 996.2.3 Updating the geometric parameter ρ and stationarydistribution parameter ω . . . . . . . . . . . . . . . . 1006.2.4 Updating the phylogenetic tree τ . . . . . . . . . . . 1016.2.5 Updating the indel rate θ . . . . . . . . . . . . . . . . 102viiTable of Contents6.2.6 Updating the rate matrix Q . . . . . . . . . . . . . . 1026.2.7 Convergence of the optimization algorithm . . . . . . 1036.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . 1036.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . 1046.3.2 Well-specified synthetic examples . . . . . . . . . . . 1056.3.3 Misspecified synthetic examples using the hPIP . . . 1106.3.4 Misspecified synthetic examples using software INDELi-ble and MUSCLE . . . . . . . . . . . . . . . . . . . . 1126.4 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.4.1 Molluscan ribosomal RNA data . . . . . . . . . . . . 1156.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197 Conclusions and future work . . . . . . . . . . . . . . . . . . . 1207.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2.1 Extension of the FixNormal model and ANJ frame-work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2.2 Extension of the GeoPIP model . . . . . . . . . . . . 122Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124AppendicesA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139A.1 Analysis of the cherry-picking algorithm . . . . . . . . . . . . 139A.1.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . 139A.1.2 Corollary 1 and its proof . . . . . . . . . . . . . . . . 139A.1.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . 140A.1.4 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . 141A.2 Analysis of the asymmetric neighbor-joining algorithm . . . 143A.2.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . 143A.2.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . 144viiiTable of ContentsA.2.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . 145A.2.4 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . 145ixList of Tables2.1 List of all partitions in the phylogenetic trees shown in Fig-ure 2.6. Partition {A, D | B, C, E} is only available in treea, and partition {C, D | A, B, E} is only available in tree b. 293.1 Mean fixation rate difference of 100 simulation runs compar-ing the Wright-Fisher model and the FixNormal model. Themean is shown as the fixation rate of the Wright-Fisher modelminus the fixation rate of the FixNormal model. Note thatwhen p = 0.1 and t = 1000, optimizing σ2 based on the like-lihood does not converge, which is omitted. . . . . . . . . . . 394.1 Mean fixation rate difference of 100 simulation runs compar-ing two consecutive FixNomal distributions versus one com-bined FixNormal distribution. . . . . . . . . . . . . . . . . . 594.2 Mean distances between estimated trees and true trees calcu-lated from 200 simulations. . . . . . . . . . . . . . . . . . . . 706.1 Simulation results on segmentation error and running time.Data are simulated based on the geometric Poisson indel pro-cess (GeoPIP) model with 2, 3, or 4 indel rates (m), on aperfect binary tree (i.e., a tree in which all internal verticeshave two descendants and all branch lengths are equal) with32 leaves (see Figure 6.1B for an example of a perfect bi-nary tree with 16 leaves). Average percentages of alignmentcolumns with incorrectly inferred indel rates from 100 simu-lations are listed. . . . . . . . . . . . . . . . . . . . . . . . . 105xList of Tables6.2 Results on synthetic data simulated from the GeoPIP on aphylogenetic tree of 8 leaves with varying branch lengths (seeFigure 6.1A). All models are well-specified, except for thestandard Poisson indel process (PIP). The weighted Robinson-Foulds (wRF) distances and the Robinson-Foulds (RF) dis-tance of 100 simulation runs are summarized. For the “scaledtree” columns, we scale the total branch length of all esti-mated trees and the true tree to be equal to one. . . . . . . 1086.3 Simulation results on synthetic data generated from the GeoPIP.The true tree is a perfect binary tree of 16 leaves with thesame branch length b for all branches (see Figure 6.1B). Dif-ferent indel rates (i.e., µ2) and different phylogenetic treebranch lengths (i.e., b) are considered. The weighted Robinson-Foulds (wRF) distances and the Robinson-Foulds (RF) dis-tance of 100 simulation runs are summarized. . . . . . . . . 1096.4 Simulation results when the true model is the hierarchicalPoisson indel process (hPIP), and the true tree has 8 leaveswith varying branch lengths (see Figure 6.1A). The PIP andGeoPIP models are misspecified, while the other, substitution-only methods are well-specified. Both wRF and RF are re-ported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111xiList of Tables6.5 Simulation results on synthetic data generated from the soft-ware INDELible and aligned by the software MUSCLE. Thetrue tree is a perfect binary tree of 16 leaves with the samebranch length b = 0.05 for all branches (see Figure 6.1B). Thetrue alignment generated using INDELible and the estimatedalignment using the software MUSCLE are both considered.In this table, NB+NB indicates that the data are generatedusing two blocks with the same indel length model (negativebinomial with parameter 1 and 0.1) but different indel rates(0.05 and 0.25 respectively), NB+SUB+POW indicates thatthe data are generated using three blocks with different indellength models (a negative binomial distribution with parame-ter 1 and 0.1, a substitution model with no indels, and a powerlaw distribution with parameter 1.7 and maximum 30), anddifferent indel rates (0.2 for the negative binomial block and0.1 for the power law block). . . . . . . . . . . . . . . . . . . 1146.6 Description of the reference clades used for validation in termsof the species available in the dataset of Lydeard et al. (2000). 1166.7 Comparison of the clades identified by different methods, whenthe two outgroups are added, and in parentheses, when theoutgroups are excluded. In this table, “1” indicates that theclade has been identified by the corresponding tree inferencemethod (column), and “0” indicates that the clade has notbeen identified. Maximum-parsimony (MP) trees are takenfrom Lydeard et al. (2000), “MP(t.o.)” stands for MP analy-sis from transversions only (i.e., the substitution of a purinefor a pyrimidine or vice versa), and “BAli,” for BAli-Phy. . . 116xiiList of Figures2.1 A graphic illustration of the change of single nucleotide poly-morphism (SNP) frequency in a finite population. The x-axisrepresents time (t) and the y-axis represents SNP frequency(p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 A heat plot of 100 SNP frequencies across 53 populations inthe Human Genome Development Panel (HGDP). . . . . . . 112.3 An example of multiple sequence alignment. A: unaligneddata, which are segments of sequences from four molluscanspecies. B: aligned data, where nucleotide sites are aligned incolumns to illustrate shared ancestry. . . . . . . . . . . . . . 132.4 A: an unrooted bifurcating phylogenetic tree of four leaveswith an outgroup v5. B: a star shape phylogenetic tree withroot v0. C: a rooted tree assuming a molecular clock assump-tion. D: a rooted bifurcating phylogenetic tree without themolecular clock assumption (i.e., the total branch length fromeach leaf to the root may be different). The smallest subtreecontaining vertices v1 and v2 are highlighted using dottedboxes for A, B, C and D. . . . . . . . . . . . . . . . . . . . . 16xiiiList of Figures2.5 Stochastic representation of the evolutionary process basedon a phylogenetic tree. A: before time t0, there is only onegroup, with trait X0. B: at time t0, one group separates intotwo groups with independently involving traits X01 and X02after both duplicating X0 at time t0. C: at time t1, one groupseparates into two groups with independently involving traitsX011 and X012 after both duplicating X01 at time t1, whilethe trait of the other group, X02, evolves independently. D:at time t = t2, data is collected at all current groups, i.e.,realization of random variables X011, X012, X02 at time t2. . 172.6 Two unrooted bifurcating phylogenetic trees with 5 leaves. . 293.1 Comparison of evolutionary parameters in the FixNormal modeland the Wright-Fisher model. . . . . . . . . . . . . . . . . . 393.2 Quantile-quantile plots of simulated data based on the FixNor-mal model and the the Wright-Fisher model when p0 = 0.5.In the above plots, x-axis represents Wright-Fisher quantilesand y-axis represents FixNormal quantiles. . . . . . . . . . . 413.3 Quantile-quantile plots of simulated data based on the FixNor-mal model and the Wright-Fisher model when p0 = 0.1. Inthe above plots, x-axis represents Wright-Fisher quantiles andy-axis represents FixNormal quantiles. . . . . . . . . . . . . 423.4 Boxplots of σ̂21 and σ̂22 in 100 simulation runs for two popu-lations when p0 is generated from a uniform(0, 1) with pointmass 0.1 at 0 and 1. In the three left panels, the true valuemf = 0.2 is used in estimation, and in three right panels,the empirical Bayes estimate m̂f (3.10) is used in estimation.The grey lines indicate the true values of σ21 and σ22 = 2σ21. . 473.5 Boxplots of (σ̂21 − σ21)/σ21 and (σ̂22 − σ22)/σ22 in 100 simulationruns for two populations when p0 is generated from a uni-form(0, 1) with point mass 0.1 at 0 and 1. The panels arearranged in the same order as in Figure 3.4. The grey linesindicate 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48xivList of Figures3.6 Boxplots of σ̂21 and σ̂22 in 100 simulation runs for two popula-tions when p0 is generated from a beta(0.5, 0.5) with pointmass 0.1 at 0 and 1. The panels are arranged in the sameorder as in Figure 3.4. The grey lines indicate true values ofσ21 and σ22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.7 Boxplots of (σ̂21 − σ21)/σ21 and (σ̂22 − σ22)/σ22 in 100 simula-tion runs for two populations when p0 is generated from abeta(0.5, 0.5) with point mass 0.1 at 0 and 1. The panels arearranged in the same order as in Figure 3.4. The grey linesindicate 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.8 Boxplots of σ̂21 and σ̂22 in 100 simulation runs for two popula-tions when p0 is generated from a beta(2, 2) with point mass0.1 at 0 and 1. The panels are arranged in the same order asin Figure 3.4. The grey lines indicate true values of σ21 and σ22. 513.9 Boxplots of (σ̂21 − σ21)/σ21 and (σ̂22 − σ22)/σ22 in 100 simulationruns for two populations when p0 is generated from a beta(2,2) with point mass 0.1 at 0 and 1. The panels are arrangedin the same order as in Figure 3.4. The grey lines indicate 0. 524.1 An example of a rooted bifurcating tree τ for 3 populations,and the asymmetric dissimilarity matrix A with relevant pa-rameters derived from this tree where ai,j is the sum of branchlengths from the i-th leaf to the most recent common ancestorof the i-th leaf and the j-th leaf. For example, a21 = σ22 +σ22,3is the sum of branch length from P2 to P0 which is the most re-cent common ancestor population of P2 and P1, and a12 = σ21is the branch lengths from P1 to P0. The first step of ourinference method is to estimate A based on a non-reversiblemodel using maximum likelihood (Chapter 3). The secondstep is to estimate τ based on  (Chapter 4). . . . . . . . . 554.2 Quantile-quantile plots of simulated data based on two con-secutive FixNormal distributions and one combined FixNor-mal distribution when p0 = 0.5. . . . . . . . . . . . . . . . . 60xvList of Figures4.3 Quantile-quantile plots of simulated data based on two con-secutive FixNormal distributions and one combined FixNor-mal distribution when p0 = 0.1. . . . . . . . . . . . . . . . . 614.4 Simulation results for the African populations. Panel 1 showsthe true tree and its branch lengths. Panel 2 and 3 show theconsensus rooted tree using our method and the consensusunrooted tree using NJ respectively, with the numeric an-notations being successful recovery rates of clades (subtreeswith exact topology as in the true tree) in 200 simulationruns. Panel 4 and 5 show the consensus rooted trees withunscaled and scaled estimated branch lengths, and Panel 6,the consensus unrooted tree with scaled branch lengths. . . . 694.5 Simulation results for the American and Oceanic populations.Panel 1 shows the true tree and its branch lengths. Panel2 and 3 show the consensus rooted tree with unscaled andscaled estimated branch lengths. Panel 4 shows the consensusunrooted tree with scaled estimated branch lengths. . . . . . 704.6 A. Rooted phylogenetic tree for 53 human populations inHGDP using our method. (F) indicates the root of ourtree. Populations are coloured according to regions as speci-fied in Li et al. (2008). B. Zoomed dendrogram for Europeanpopulations using our method. C. Zoomed dendrogram forEuropean populations reproduced using contml (Felsenstein,1989), which is the same as the subtree as in Li et al. (2008)for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 754.7 1. Estimated tree based on 1314 SNPs. 2. Consensus treewith estimated node uncertainty from 5000 bootstrap sam-ples. 3. Consensus tree with estimated branch lengths andtheir standard deviations in brackets. 4. Fre´chet mean treewith estimated branch length. The Fre´chet variance of thisFre´chet mean tree is 0.0013. . . . . . . . . . . . . . . . . . . 77xviList of Figures5.1 A segment of the multiple sequence alignment (MSA) of fourmolluscan species from the molluscan ribosomal RNA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 An illustration of the GeoPIP on an MSA of four sequences.There are three MSA segments: segment 1 and 3 evolves at alow deletion rate (dyed in purple), while segment 2 evolves ata high deletion rate (dyed in yellow). Each segment evolvesindependently based on the PIP model. Insertion, deletionand substitution events for each segment are indicated on thebottom plots. . . . . . . . . . . . . . . . . . . . . . . . . . . 866.1 The reference phylogenetic trees used in simulation studies.A: a phylogenetic tree with 8 leaves and varying branch lengths.B: a perfect binary phylogenetic tree with 16 leaves and samebranch length b for all branches. . . . . . . . . . . . . . . . . 1066.2 Inferred indel rate categories for alignment columns 150-300of one set of simulated data: segments with low estimateddeletion rate (µ̂1 = 0.006) are in purple; segments with in-termediate deletion rate (µ̂2 = 0.848) are in yellow; segmentswith high deletion rate (µ̂3 = 4.150) are in blue. . . . . . . . . 1116.3 Trees reconstructed by the three indel-aware methods (columns)for the data with and without outgroups (rows). The fivenumbers measure the wRF distance and the RF distance (inbrackets) between each of the bottom trees and the corre-sponding top subtree obtained after excluding the two out-groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.4 Inferred indel rate categories for alignment columns 701-850of molluscan data: segments with lowest deletion rate (µ̂1 =0.01) are in purple; segments with low deletion rate (µ̂2 =0.11) in are yellow; segments with high deletion rate (µ̂3 =0.41) are in blue; segments with highest deletion rate (µ̂4 =1.27) are in red. . . . . . . . . . . . . . . . . . . . . . . . . . . 118xviiList of FiguresA.1 Illustrations of connected circles which contradict the bifur-cating tree assumption. . . . . . . . . . . . . . . . . . . . . . 141xviiiAcknowledgementsMy deepest debt of gratitude goes to my supervisor, Dr. Alexandre Bouchard-Coˆte´, for his warm support and profound advice. I am blessed to work withAlex for the past four years. His encouragement made me brave to ex-plore new research territories with excitement and no fear for failures. Hisguidance helped me build my own skills to develop ideas into results.I would like to thank members of my supervisory committee, Dr. JiahuaChen and Dr. John Petkau, for their inspiring suggestions and insightfulcomments. I am grateful to faculty members and staff of the StatisticsDepartment at UBC, in particular, Dr. Nancy Heckman, Dr. Lang Wu, andDr. Harry Joe, for shedding light on my academic career.I would like to thank my fellow students and friends at UBC for creatinga supportive and friendly environment, which I enjoyed so much in the last afew years. Special thanks to Yanling Cai, Jing Dong, Jia Gou, Zhenyu Guo,Chengzhi He, Yi Huang, Seong-Hwan Jun, Sha Liao, Libo Lu, LiangliangWang and Tingting Zhao.Last, I leave the warmest part of my heart for my family: my parents,Mingqin Li and Baogang Zhai, who gave birth to me, enlightened me andeducated me with their unconditional support and love; my wife, ShanshanChen, who always believes in me and stands beside me during this longjourney. Without them, everything is impossible.xixChapter 1IntroductionThe evolutionary history of populations or species has always been a fasci-nating topic for researchers in many areas including archeology, anthropol-ogy, linguistics and biology (Lipo, 2006). While pieces of information aboutthe past can be found in fossils (Lieberman, 2012), such information is alsowidely preserved in genetic information passed along generations of livingspecies (Cann et al., 1987; Rosenberg et al., 2002; Li et al., 2008; Pickrellet al., 2012; Cann, 2013; Francalacci et al., 2013; Poznik et al., 2013). Withthe rapid expansion of genetic data available, statistical methods play an in-creasingly important role in inferring the evolutionary history of populationsand species.The evolutionary history of populations or species can be informed by thereconstruction of a phylogenetic tree, a connected acyclic graph consisting ofa set of vertices and a set of edges (Semple and Steel, 2003). The phylogenetictree cannot be observed directly and is often the main parameter of interestto estimate, as data are usually only available on current populations orspecies, which are called leaves of the phylogenetic tree, in the form of geneticinformation such as deoxyribonucleic acid (DNA) sequences, ribonucleic acid(RNA) sequences, amino acid sequences and protein sequences.Phylogenetic inference aims to reconstruct the evolutionary history ofpopulations or species by analyzing variation of genetic data observed atsampled individuals from current populations or current species. Phylo-genetic studies are motivated by various scientific questions, for example,inferring the path of human migration and classifying species to constructthe tree of life. Phylogenetic inference also has wide technological applica-tions such as genome imputation and machine translation.Genetic variation may appear in different forms of data at different levels1Chapter 1. Introductionof comparisons, which are often analyzed using different phylogenetic infer-ence methods designed specifically for certain types of data. In Chapter 2,we review relevant background in phylogenetics related to this thesis.At the level of populations (roughly a group of all the organisms of thesame species, which live in a particular geographical area), sample sequencesfrom different populations within the same species (roughly the largest groupof organisms where two individuals are capable of reproducing fertile off-spring) are usually very similar to each other, sharing exactly the same nu-cleotide type at the majority of positions. The majority of genetic variationof sample sequences are results of substitution of a single nucleotide in theDNA or RNA sequences of some individual within the population during theevolutionary history. This variation of a single position in a nucleotide se-quence among individuals is called a single nucleotide polymorphism (SNP).At the level of populations, we focus on analyzing SNP data in this thesis.At the level of species, sample sequences from different species usuallydiffer more from each other. They may also differ in sequence length, asa result of accumulated insertion and deletion events in the relatively longevolutionary history. For sequences of varying length, the problem of esti-mating the phylogenetic history of the sequences often interacts with theproblem with finding homologous characters of all the sequences (i.e., char-acters of the same ancestry). If all sequences are aligned in a way that allsets of homologous characters are identified, it is called a multiple sequencealignment (MSA). At the level of species, we focus on analyzing MSA datain this thesis.In this thesis, we develop new evolutionary models, statistical inferencemethod and efficient algorithms for reconstructing phylogenetic trees at thelevel of populations using SNP data and at the level of species using MSAdata. The accuracy of our methods are evaluated in simulation studies.Population level methodAt the level of populations, our work is motivated by the scientific questionof determining the root of a phylogenetic tree, especially the root of human2Chapter 1. Introductionpopulation trees. Determining the root of a phylogenetic tree is an impor-tant step in a range of phylogenetic studies (Huelsenbeck et al., 2002). Forexample, the root of human population trees provides clues on the originand the path of human migration (Gray et al., 2009). However, most classicmethods estimate unrooted trees only because the direction of evolution isnot identifiable.The outgroup criterion is widely used to root an unrooted phylogenetictree: the idea is to find a taxon (species or population) sufficiently far fromthose under study and to use the attachment point of this taxon to the othertaxa to infer the root. But finding a reasonable outgroup is not always easy.Previous works have shown that the outgroup criterion can lead to errors(Wheeler, 1990; Outlaw and Ricklefs, 2011; Pearson et al., 2013; Outlaw andRicklefs, 2011).In Chapter 3, we present our probability model for both the fixationand drift found in SNP frequency data while keeping the computationalrequirements and the model complexity manageable. Being non-reversible,the method has the additional advantage of being able to identify the place-ment of the root without using outgroups. For any pair of two populations,the evolutionary distance from their most recent common ancestor can beestimated based on our model.In Chapter 4, we propose a novel efficient rooted-tree reconstructionalgorithm called the asymmetric neighbor-joining (ANJ) algorithm. ANJis inspired by the popular neighbor-joining (NJ) algorithm, but utilizes theadditional information provided by our non-reversible model proposed inChapter 3.The main advantage of using ANJ versus using full maximum likeli-hood or Bayesian methods for estimating a phylogenetic tree is computa-tional efficiency, as it is well known that optimization over all possible treesis computationally hard (Roch, 2006; Nichols and Warnow, 2008). How-ever, it is worth mentioning that incorporating maximum likelihood meth-ods and Bayesian methods with our proposed non-reversible model is alsofeasible. The ANJ algorithm presented here is broadly applicable to anynon-reversible setup: ANJ can turn collections of pairwise rooted trees ob-3Chapter 1. Introductiontained from any non-reversible model into a joint rooted non-clock tree. Itis therefore an interesting alternative to the outgroup criterion.We perform a series of simulation studies to assess the accuracy androbustness of the methods. The results show that our method can recovertrue topologies and root placements with higher probability than modelsthat ignore fixation. Branch lengths are also shown to be accurate. Ourmethod also has a reasonable computational cost.Using our method, we reanalyze the SNP data from 53 populations fromthe Human Genome Diversity Panel (HGDP) (Cann et al., 2002), recon-structing the full tree with particular attention to the location of the root.The internal organization of the tree is similar to Li et al. (2008), but alsocorrects some problems found in this previous work.Species level methodAt the level of species, our work is motivated by modelling the insertionand deletion (indel) rate variation observed in the evolutionary process ofDNA and RNA sequences. It is well known that different regions of nu-cleotide sequences evolve at different rates, both in terms of substitutions(Fitch and Margoliash, 1967; Nachman and Crowell, 2000), and in termsof insertions-deletions (indels) (Mouchiroud et al., 1991; Wong et al., 2004;Lunter et al., 2006; Mills et al., 2006; Chen et al., 2009; Kvikstad and Duret,2014). In phylogenetic analyses based on substitutions, rate variation hasbeen incorporated in virtually all modern phylogenetic methods (Yang, 1997;Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003; Suchardand Redelings, 2006; Yang, 2007; Guindon et al., 2010; Stamatakis, 2014).There are now several approaches to phylogenetic tree inference thattake indels into account (Thorne et al., 1991, 1992; Westesson et al., 2012),and some of them include substitution rate heterogeneity (Klosterman et al.,2006; Suchard and Redelings, 2006; Redelings and Suchard, 2007). However,these approaches generally do not incorporate indel rate heterogeneity aspart of the model specification.We present a simple indel rate heterogeneity model suitable for phylo-4Chapter 1. Introductiongenetic tree inference. Our model is obtained as the marginal distributionof a reversible stochastic process defined on a phylogenetic tree. We in-troduce this continuous-time Markov process, called the geometric Poissonindel process (GeoPIP), in Chapter 5.As its name suggests, the main building block of the GeoPIP is thePoisson indel process (PIP) (Bouchard-Coˆte´ and Jordan, 2013), and GeoPIPinherits the attractive computational properties of the PIP. This means inparticular that given a tree, computing the probability of an alignment (i.e.marginalizing over internal sequences) can be done in time polynomial inboth the number of the sequences and the lengths of the sequences. Thisproperty forms the basis of an efficient algorithm which determines in anunsupervised fashion the indel rates, while inferring the tree and partitioningthe sequences into segments taking on different indel rates.In Chapter 6, we propose an estimation method to estimate phylogenyfrom a fixed multiple sequence alignment using the neighbor-joining (NJ)algorithm (Saitou and Nei, 1987; Studier et al., 1988; Gascuel, 1997) as anillustration. Our inference method iteratively estimates a segmentation ofthe multiple sequence alignment, indel rates, phylogenetic tree and other rel-evant parameters. Our inference method is initialized using random starts,without requiring a guide tree.Using our method, we investigate the effect of indel rate heterogeneityon phylogenetic inference. We provide some evidence that modelling indelsenhances accuracy of phylogenetic inference, and that modelling indel rateheterogeneity can further improve the accuracy of phylogenetic inference.We demonstrate the accuracy of our method in both well-specified and mis-specified synthetic experiments, including data generated using the softwareINDELible (Fletcher and Yang, 2009) and aligned using the software MUS-CLE (Edgar, 2004a,b).For tractability reasons, we do not attempt to include long indels intoour GeoPIP model in this thesis. Instead, our strategy to avoid the branchoverestimation caused by ignoring long indels is to have the GeoPIP modelexplain them with segments of very high indel rate. On the other hand,our method has better scaling properties as the number of taxa increases,5Chapter 1. Introductioncompared to the TKF92 model (Thorne et al., 1992) which does not allowexact marginalization of internal nodes in polynomial time. To demonstratethat our strategy is sensible, we include synthetic experiments where thedata are generated from models that include long indels. We apply ourmethod to a data set of molluscan ribosomal RNA to show that the GeoPIPmodel not only may lead to more stable phylogenetic tree estimates, butalso can automatically separate fast evolving regions of the RNA sequencesfrom slowly evolving regions in real data application.Chapter 7 summarizes the contributions of this thesis, the limitation ofmethods proposed in this thesis and related further work.6Chapter 2BackgroundIn this thesis, we consider genetic data which are nucleotide sequences col-lected from individuals of different populations within the same species andindividuals from different species. We assume that data can only be collectedfrom individuals of current populations and current species in this thesis,although genetic data of extinct species may be available in some cases. Forinference on the phylogenetic history for populations or species, sequencesfrom sampled individuals within the same population or species are summa-rized to represent common characteristics of the population or species whileindividual differences within the population or species are ignored.Our objective is to reconstruct a phylogenetic tree, which is considered asan unknown parameter. The phylogenetic tree connects sampled sequencesin a way that describes the evolutionary history of sequences. We proposenew stochastic processes to model the evolutionary process of sequencesalong the unobserved phylogenetic tree, and utilize sequences collected fromcurrent populations and species to estimate the unknown evolutionary his-tory, in the form of a phylogenetic tree.In this chapter, we review necessary background in phylogenetic infer-ence. In Section 2.1, we introduce two types of genetic sequences that wefocus on in this thesis. We introduce basic concepts of phylogenetic trees inSection 2.2, popular evolutionary models in Section 2.3, and common phy-logenetic inference methods in Section 2.4. We conclude this chapter with abrief review of methods for analyzing results of phylogenetic inference usedin this thesis in Section 2.5.72.1. Genetic variation2.1 Genetic variationPhylogenetic inference aims to reconstruct the evolutionary history of popu-lations or species by analyzing variation of genetic data observed at sampledindividuals. Mutation is the source of genetic variation. The simplest formof mutation is a single substitution of the original nucleotide by a differentnucleotide in the deoxyribonucleic acid (DNA) or ribonucleic acid (RNA)sequence. Most observed mutations are neutral and either fall outside cod-ing regions or do not alter proteins encoded. Insertion and deletion (indel)of one or more nucleotides is also possible in a single mutation event.In Section 2.1.1, we review genetic variation among human populationsand genetic sequences of populations for phylogenetic inference. We alsoprovide a brief introduction to the SNP frequency data we consider in Chap-ter 3 and Chapter 4. In Section 2.1.2, we review genetic variation amongspecies and genetic sequences of species for phylogenetic inference. We alsoprovide a brief introduction to the nucleotide sequence data we consider inChapter 5 and Chapter 6.2.1.1 Genetic sequences at the level of populationsAlthough genetic variation among human individuals was first demonstratedin a study of human genes determining the ABO blood groups almost 100years ago (Hirschfeld and Hirschfeld, 1919), using genetic variation to inferdetailed history of human populations did not become possible until the de-velopment of polymerase chain reaction (PCR) and automated deoxyribonu-cleic acid (DNA) sequencing in 1980s (Cavalli-Sforza and Feldman, 2003).In the past thirty years, the availability of genetic data has improved greatly,from a small fraction of the genome to the whole-genome scale, and fromlimited populations to many populations worldwide (Consortium, 2003; Liet al., 2008; Consortium, 2010), making it the main source of data for phy-logenetic inference.Genetic variation is widespread among human populations. Depend-ing on the location and the type, genetic variations among human popula-tions that have been used for phylogenetic inference include: mitochondrial82.1. Genetic variationDNA (mtDNA) single nucleotide polymorphisms (Cann et al., 1987), auto-somal microsatellites (Rosenberg et al., 2002), autosomal single nucleotidepolymorphisms (Stephens et al., 2001; Li et al., 2008), Y-chromosome mi-crosatellites (Pritchard et al., 1999) and Y-chromosome single nucleotidepolymorphisms (Francalacci et al., 2013; Poznik et al., 2013; Cann, 2013).Phylogenetic inference at the level of populations analyzes genetic se-quences from different populations within the same species. Samples ofDNA sequences collected from individuals of the same species are usuallyvery similar to each other, consisting of exactly the same nucleotides at themajority of sites.At a single position in a nucleotide sequence among individuals, if thereis a variation, this nucleotide variation is called a single nucleotide polymor-phism (SNP). If a SNP occurs within a gene, then this gene is described ashaving more than one allele. SNPs may or may not lead to variations in theamino acid sequences or protein sequences, depending on their locations, forexample, a SNP in a non-coding region of DNA does not lead to variationin the amino acid sequences.For each population, a sequence of SNP frequencies of chosen allelesis calculated based on SNP sequences of sampled individuals within thepopulation. This sequence of population frequences represents the geneticcharacteristics of the population and is compared across populations in phy-logenetic inference. In this thesis, we consider sequences of SNP frequenciesin populations for phylogenetic inference at the level of population.Our work focuses on reconstructing the history of human populationsusing SNP frequencies at the level of populations. SNPs widely exist acrossall human chromosomes. Common SNPs are almost always bi-allelic, so SNPfrequencies of one allele in a population can be defined for bi-allelic SNPs andcompared across populations as a measure of genetic variation. We consideronly bi-allelic SNPs. In this thesis, we discuss phylogenetic data at thelevel of populations as a N ×L matrix, where each row representing L SNPfrequencies in one population and each column representing the frequenciesof one SNP across N populations.Figure 2.1 illustrates the change of SNP frequency in a finite population.92.1. Genetic variationThe frequency changes from zero to non-zero due to a single mutation inone of the individuals within this population. Then this new allele spreadswithin the populations and gets passed within generations through mating.As a result, the SNP frequency drifts between zero and one. At some point,this new allele may replace the old allele completely in the population, whichis called fixation.0.00.20.40.60.81.0 mutationfixationdrifttpFigure 2.1: A graphic illustration of the change of single nucleotide poly-morphism (SNP) frequency in a finite population. The x-axis representstime (t) and the y-axis represents SNP frequency (p).Data: Human Genome Development PanelIn this thesis, we analyze SNP data from the Human Genome DevelopmentPanel (HGDP) (Li et al., 2008). The HGDP contains data of 938 unrelatedindividuals from 53 populations from Africa, Europe, the Middle East, Asiaand the Americas at more than 650,000 common SNP loci. The number ofindividuals in one population varies from 5 to 46.102.1. Genetic variationSampled SNP ID: 1−100BiakaPygmyMbutiPygmyMandenkaYorubaBantuKenyaSanBantuSouthAfricaMozabiteBedouinDruzePalestinianBrahuiBalochiHazaraMakraniSindhiPathanKalashBurushoHanHanNChinaTujiaYiMiaoDaurMongolaHezhenXiboUygurOroqenDaiLahuSheTuYakutJapaneseCambodianPapuanMelanesianFrenchBasqueSardinianItalianOrcadianNaxiTuscanAdygeiRussianPimaMayaColombianKaritianaSurui20 40 60 800.00.20.40.60.81.0Figure 2.2: A heat plot of 100 SNP frequencies across 53 populations in theHuman Genome Development Panel (HGDP).Before sequencing all samples from every population collected, potentialSNP loci are usually marked first using a smaller number of samples fromsome populations or using results from previous data. Only marked SNPloci are sequenced for the majority of samples. This SNP loci pre-selectionprocess is called ascertainment. The allele with smaller frequency in a refer-ence population is usually defined as the minor allele. The SNP loci in theHGDP date set were selected based on data from the International HapMapProject, for which European, East Asian and African samples were usedto select SNP markers. This ascertainment process may introduce selectionbias in the HGDP data because of the preferential selection of common vari-ants in Europe, East Asia and Africa (Clark et al., 2005; Li et al., 2008),which leads to higher SNP heterozygosities in these populations.Figure 2.2 shows a heat plot of 100 SNP frequencies (of the minor allele112.1. Genetic variationdefined in the ascertainment process) across 53 populations in the HGDP.We analyze this data in Chapter 4 using our proposed methods.2.1.2 Genetic sequences at the level of speciesPhylogenetic inference at the level of species analyzes genetic sequences fromdifferent species. Samples of DNA, RNA or protein sequences collectedfrom individuals of different species usually differ more from each othercomparing to samples from individuals of the same species. Sequences fromindividuals of different species may also differ in sequence length, as a resultof accumulated indel events in the relatively longer evolutionary history.For each species, a representative sequence is constructed by aggregat-ing sampled sequences from different individuals within the species. Thesequence of species keeps only common sites of all sampled individual se-quences and ignores variation of individual sequences within the species.These sequences of species represent genetic characteristics of the speciesand are used for phylogenetic inference at the level of species. The se-quences of different species usually vary in length. For sequences of differentlengths, reconstructing a phylogenetic tree usually interacts with solving theproblem of identifying homologous sites across sequences of different species,which is also called aligning multiple sequences.We now introduce the concepts of sequence alignment and multiple se-quence alignment, which is the form of phylogenetic data we focus on atthe level of species in this thesis. A sequence alignment is a way of arrang-ing the sequences of DNA, RNA, or protein to identify sites of similarityfrom different sequences that may be a consequence of evolutionary rela-tionships between the sequences. Aligned sequences of nucleotide or aminoacid residues are typically represented as rows within a matrix, with a gapsymbol “–” inserted between the residues so characters of different sequencesare aligned in the same column of the matrix if and only if they have sharedancestry. A multiple sequence alignment (MSA) is a sequence alignmentof three or more biological sequences. From an MSA, sequence homologycan be inferred and phylogenetic analysis can be conducted to assess the122.1. Genetic variationsequences’ shared evolutionary relationship. See Figure 2.3 for an exampleof four unaligned molluscan sequences and an corresponding MSA.MSAs can be considered as processed data from the unaligned sequences.MSAs can be constructed from sequences based on the properties of the se-quences, such as protein secondary structures, by experts or by numericalcomputation methods (Thorne et al., 1991; Thompson et al., 1997; Edgar,2004a,b; Satija et al., 2009). Recently, Truszkowski and Goldman (2016)showed that maximum likelihood phylogenetic inference is consistent onMSAs under mild conditions.A.coerulea     ...ACGAGAAGACCCUUAGAAUUUUUAAAAGCAAUAAGUA...A.turrita      ...ACGAGAAGACCCUUAGAAUUUUGAAAAUGUAAAAGGUAUU...C.nemoralis    ...ACGAGAAGACCCUAGAAGCUUGUUAUUUGU...Cac.lacertina  ...ACGAGAAGACCCUGUCGAGCUUUAGAAGAAAGUAGGCUAAAAUAUAUA...A.coerulea     ...ACGAGAAGACCCUU-AGAAUUUUUAAAAG-C-AAUAAGUA----------------------...A.turrita      ...ACGAGAAGACCCUU-AGAAUUUUGAAAA--U-GUAAAAGG---------UAUU---------...C.nemoralis    ...ACGAGAAGACCCUA-GAAGCUUG------------------UUAUUUGU-------------...Cac.lacertina  ...ACGAGAAGACCCUGUCGAGCUUU-AGAAG--AAAGUAG-----------GCUAAAAUAUAUA...A. unaligned sequencesB. aligned sequencesFigure 2.3: An example of multiple sequence alignment. A: unaligned data,which are segments of sequences from four molluscan species. B: aligneddata, where nucleotide sites are aligned in columns to illustrate shared an-cestry.We focus on estimating the phylogeny given a fixed MSA in this thesis,but methods proposed in Chapter 5 and Chapter 6 can also be incorpo-rated into existing framework of co-estimating the MSA and the phylogeny(Suchard and Redelings, 2006).Data: Molluscan ribosomal RNAMolluscs are a diverse group of well studied animals, but many phylogeneticrelationships among molluscan species are still unresolved (Smith et al.,2011). Because of the vast diversity within this large group of species, inser-tions and deletions of nucleotides are prevalent in molluscan ribosomal RNA(rRNA). Lydeard et al. (2000) conducted a comparative analysis of completemitochondrial large subunit (LSU) rRNA sequences of 10 molluscan species132.2. Phylogenetic treeand two outgroups (L. terrestris and D. melanogaster), and provided anMSA of these sequences based on their secondary structure.This data set can be downloaded from http://www.rna.icmb.utexas.edu/SIM/4D/Mollusk/alignment.gb. Figure 2.3 shows a segment of theunaligned sequences and the MSA of four molluscan species.2.2 Phylogenetic treeIn this section, we introduce basic concepts of phylogenetic trees and theuse of phylogenetic trees on describing the evolutionary process.2.2.1 Basic conceptsWe start with basic concepts of trees and then link the abstract conceptswith data in the setup of phylogenetic inference.A tree τ = (V ;E) is a connected acyclic graph consisting of a set of ver-tices V (also called nodes) and a set of edges E (also called branches) (Sempleand Steel, 2003). A phylogenetic tree or evolutionary tree is a tree showingthe inferred evolutionary relationships among various biological species orother entities based on similarities and difererences in their physical or ge-netic characteristics. In this thesis, we focus on the use of trees to describethe evolutionary process and evolutionary relationship of different popula-tions and species.In a tree, each edge connects a pair of vertices and there is exactly onepath between every two vertices. Two vertices are called neighbors if they arelinked by one edge. An edge length (also called branch length) associated withan edge is a positive real number measuring amount of evolution betweenthe two vertices connected by the edge.The degree of a vertex v ∈ V is the number of neighbors of v. An internalvertex (or inner vertex or branch vertex) is a vertex of degree at least 2. Anexternal vertex (or outer vertex or leaf) is a vertex of degree 1. Let Vexdenote the set of all external vertices and Vin denote the set of all internalvertices. Then, V = Vex ∪ Vin and Vex ∩ Vin = ∅.142.2. Phylogenetic treeIn this thesis, we only consider sequences of current populations orspecies, which are obtained from sequences of individuals from current popu-lations and species. Our objective is to construct a phylogenetic tree linkingpopulations or species at leaves of the tree by comparing their sequences.This also means that we can only observe sequences at leaves of a phyloge-netic tree. The internal vertices of the phylogenetic tree represent ancestralpopulations or species, whose sequences cannot be observed.A rooted tree is a tree in which one vertex is singled out and designatedthe root, denoted v0. A rooted phylogenetic tree uses the root to identify thedirection of the evolutionary process, while an unrooted phylogenetic treedoes not contain information on the direction of the evolutionary process.For a rooted tree τ , a natural partial order ≤T on V is obtained by settingv1 ≤T v2 for v1, v2 ∈ V if the path from the root of τ to v2 includes v1. Ifv1 ≤T v2, v2 is called a descendant of v1 and v1 is called an ancestor of v2.If v1 and v2 are neighbors, and v1 ≤T v2, then v1 is called the parent of v2and v2 is called the child of v1. The parent of a vertex vi ∈ V/v0 is denotedvpa(i) ∈ Vin. The root v0 does not have a parent.A subtree is a connected subgraph of τ . A phylogenetic forest is a disjointunion of trees. A phylogenetic network is a generalization of phylogenetictree when cycles are allowed.A bifurcating phylogenetic tree is a phylogenetic tree in which each vertexhas a maximum of three neighbors. Note that all external vertices of aphylogenetic tree always have exactly one neighbor. All internal verticesof a bifurcating tree have exactly three neighbors, except for the root ina rooted tree, which has exactly two neighbors. In a rooted bifurcatingtree, all internal vertices have exactly two descendants and one ancestor,except for the root, which has two descendants but no ancestor. Therefore,a bifurcating tree is also called a binary tree. A binary phylogenetic treecan be either rooted or unrooted. Figure 2.4 shows examples of phylogenetictrees. In this thesis, we focus on bifurcating trees.The most recent common ancestor (MRCA) of N external vertices {v1,v2, . . . , vN}, is the vertex at the root of the smallest subtree containing{v1, v2, . . . , vN}, denoted vMRCA. There are N − 1 internal vertices (includ-152.2. Phylogenetic treev1 v2 v3 v4v0v1v2v3v4 v1 v2 v3 v4v0v1 v2 v3 v4v0A DB Cv5v5v6Figure 2.4: A: an unrooted bifurcating phylogenetic tree of four leaves withan outgroup v5. B: a star shape phylogenetic tree with root v0. C: a rootedtree assuming a molecular clock assumption. D: a rooted bifurcating phylo-genetic tree without the molecular clock assumption (i.e., the total branchlength from each leaf to the root may be different). The smallest subtreecontaining vertices v1 and v2 are highlighted using dotted boxes for A, B, Cand D.ing the root) representing unobserved vertices {vMRCA, vN+1, vN+2, . . . , v2N−2}in a rooted bifurcating tree (see Figure 2.4D for an illustration where v1, v2, v3, v4are leaves which can be observed and v5, v6, v0 (≡ vMRCA) are internal ver-tices which cannot be directly observed.2.2.2 Evolutionary process on a phylogenetic treeIn this thesis, we assume that the evolutionary history is of the form of arooted bifurcating phylogenetic tree. We view the evolutionary process asa stochastic process of traits along branches of the phylogenetic tree fromthe root to the leaves. A trait can be a character (for example, type ofnucleotide), a continuous value (for example, frequency of a SNP), a binaryvalue (for example, presence of absence of a trait) or a string of characters(for example, a DNA sequence).The evolutionary process can be viewed as a stochastic process on aphylogenetic tree. A sequence of traits X0, considered as a random variable,evolve from the root to the leaves along the branches of the tree. The traitsX0 can be individual sequences, population sequences, or species sequencesdepending on the area of application. For example, we may consider X0as DNA sequences of individuals, gene frequency sequences of populations,162.2. Phylogenetic treeor common nucleotide sequences of species. At each internal node, twoprocesses are generated along each edge below the node. The two processesshare the same initial value but evolve independently. Figure 2.5 illustratesthis stochastic process.X 0 X 0X 0X01 X02 X02X01X011 X012X012 X02X01X011X 0 X 0D: t = t2 C: t1 < t ≤ t2 A: t ≤ t0 B: t0 < t ≤ t1 Figure 2.5: Stochastic representation of the evolutionary process based ona phylogenetic tree. A: before time t0, there is only one group, with traitX0. B: at time t0, one group separates into two groups with independentlyinvolving traits X01 and X02 after both duplicating X0 at time t0. C: at timet1, one group separates into two groups with independently involving traitsX011 and X012 after both duplicating X01 at time t1, while the trait of theother group, X02, evolves independently. D: at time t = t2, data is collectedat all current groups, i.e., realization of random variables X011, X012, X02 attime t2.Data are only observed at current populations or species and at currenttime t, denoted x = (x1,x2, . . . ,xN ) at v1, v2, . . . , vN respectively, which isa realization of random variables X = (X1,X2, . . . ,XN ).The main objective of phylogenetic inference is to recover the phyloge-netic tree τ based on the observed data x. Non-probability based mod-els usually aim to measure the differences between current populations orspecies and then construct the tree using those differences. Probability basedmodels aim to model the evolutionary process along branches and search forthe true tree τ based on probability of observing data x at the leaves of thetree.Under the tree-shape assumption, we aim to capture the main forces ofthe evolutionary processes. It is worth mentioning that the above description172.3. Evolutionary modelshas some limitations. First, sequences are assumed to evolve independentlyafter their separation. This assumption may be violated by strong geneticflows (i.e., transfer of alleles or genes from one population to another). Inapplication, those violations usually lead to misplacing populations involvedin the inferred tree. To reduce the effects of genetic flows, populations whichare known to be involved in recent genetic flows may be removed from theanalysis. Second, sequences are assumed to be duplicated at internal verticesat the time of separation. This assumption may be violated when two newsequences separated are different, for example, two groups of populationsseparate because of their differences.2.3 Evolutionary models2.3.1 Various evolutionary models for different types ofdataDifferent evolutionary models for different evolutionary processes are pro-posed for different types of data based on the understanding of the under-lying mechanisms driving changes in genes.For one trait, we will focus on three main types of data in the following:continuous, binary, or categorical. We discuss evolutionary models for eachtype separately in this section.Evolutionary models for continuous dataSNP frequencies are examples of continuous data used in phylogenetic in-ference. By definition, frequencies are the fraction of chromosomes in apopulation with a given mutation. The popular Wright-Fisher model as-sumes that the total number of one gene or one allele in one generation isbinomially distributed based on the frequency of this gene or allele in theprevious generation. Based on the Wright-Fisher model, gene or SNP fre-quencies are a sequence of binomial distributed random variables. However,such models are computationally expensive when the branch length is large.Diffusion approximations are widely used in modelling SNP frequencies182.3. Evolutionary models(Nicholson et al., 2002; Sire´n et al., 2013). See Section 3.2 for a detailedreview of these models and our new proposed model for the evolution of SNPfrequencies. Note that the SNP frequency of a population cannot be directlyobserved. Several individuals within the population are usually sampled toestimate the SNP frequency in the population. For the HGDP data, thenumber of individuals sampled within each population varies from 5 to 46.In Chapter 3 and Chapter 4, we focus on modelling the frequency datadirectly. We discusss incorporating sampling errors caused by the individualsampling process within populations in Section 3.3.2.Evolutionary models for binary dataBefore molecular data became widely available, the main form of data usedfor phylogenetic inference was the presence or absence of certain traits inindividuals, populations or species, which is binary. There are two popularmodels for binary data: the stochastic Dollo model and the Wagner model.The Wagner model assumes that the two states 0 and 1 are reversible withequal rates of transitions, while the stochastic Dollo model fixes the ancestralstate as 0 and only allows a single transition from 0 to 1 over evolutionarytime.The Wagner model is usually formulated using a continuous-time Markovchain (Suchard et al., 2001). The Dollo model can be formulated using abirth-death process (Nicholls and Gray, 2008; Ryder and Nicholls, 2011),and can also be generalized to allow multi-states using a continuous timeMarkov chain (Alekseyenko et al., 2008). The continuous time Markov chainmodel is explained in Section 2.3.2.Evolutionary model for categorical dataWith the availability of molecular data, such as DNA, RNA and proteinsequences, categorical data became the dominating type for phylogeneticinference. For example, DNA sequences are sequences of nucleotides consistof either cytosine (C), guanine (G), adenine (A), or thymine (T) at eachlocus. As a result, there are four states for the DNA sequence data. Other192.3. Evolutionary modelsmorphological characters in languages, such as the order of subject, verb,and object, also have a finite number of states.The change of elements in categorical sequences from one state to anotheris usually modelled by a continuous time Markov chain.2.3.2 Continuous time Markov chainA continuous-time Markov chain (CTMC) is a probability model for changesof states in some finite or countable set Σ over continuous time t (t ∈ [0,∞)).In a CTMC, the time spent in each state takes non-negative real values andhas an exponential distribution. The Markov property holds in that theremaining time in the current state and the transition probabilities to thenext state depend only on the current state and not on historical behaviour.A detailed summary of CTMC models with application in genetics canbe found in Tavare´ (1986). We only introduce the basics of the CTMCmodel here. Let Σ = {s1, s2, . . . , sK} (K ∈ Z+) be the set of states. Let Xtbe a random variable denoting the state at time t. At t = 0, X0 is calledthe initial state.The transition probability matrix P(t) = (pij(t))K×K of a CTMC at timet is defined bypij(t) = P (Xt = sj |X0 = si), for 1 ≤ i, j ≤ K,where P (Xt = sj |X0 = si) is the probability of switching from start statesi to end state sj when time passes t. Note that there may be other inter-mediate state (or states) in the path from si to sj during the time interval[0, t).The rate matrix Q = (qij)K×K of a CTMC is a measure for substitutionrates and satisfies thatQ =dP(t)dt.202.3. Evolutionary modelsThe rate matrix Q is usually parameterized byqij =aij , j 6= i−∑k 6=iaik, j = i,All parameters aij (1 ≤ i, j ≤ K, and i 6= j) are assumed to be non-negativeand the diagonal entries (usually omitted using symbol ∗) are set to ensurethat each row of P(t) is well defined as a probability mass function. Inapplication, the rate matrix Q is assumed to be independent of time t inmost common models.P(t) can be calculated for Q and any time t ≥ 0 using matrix exponen-tial:P(t) = exp(tQ).If the CTMC is stationary, its stationary distribution pi = (pi1, pi2, . . . , piK)satisfies the equationpiQ = 0,with the additional constraint that∑k pik = 1. We only consider stationaryCTMC in this thesis.In order to ensure identifiability of the model, Q is usually scaled by afactor1β= − 1∑i∈Σ piiqii.Reversible CTMCA CTMC is reversible if for all 1 ≤ i, j ≤ K,piiqij = pijqji.If a CTMC is reversible, then the probability of observing one state si (1 ≤i ≤ K) at time t1 (t1 > 0) and another state sj (1 ≤ j ≤ K) at time t2(t2 > t1 > 0) is the same as the probability of observing sj at time t1 and212.3. Evolutionary modelssi at time t2, i.e.,P (Xt1 = si, Xt2 = sj) = P (Xt1 = sj , Xt2 = si)Therefore, the direction of the transition cannot be identified by data col-lected at two different time points. As a result, inference methods based onreversible CTMC can only reconstruct unrooted phylogenetic trees, but can-not identify a root because reversible CTMC cannot identify the directionof the evolutionary process.Reversible CTMC are dominant in modern phylogenetics.Popular CTMC modelsSeveral models have been proposed for nucleotide data which is a specialcase of categorical data with four categories (Tavare´, 1986).The F81 model for the evolutionary process of DNA sequences (Felsen-stein, 1981b) assumes that the rate matrix, which models the transitionrates among four states, is parameterized byQ =∗ piT piC piGpiA ∗ piC piGpiA piT ∗ piGpiA piT piC ∗ ,where pi = (piA, piT , piC , piG) is the stationary distribution.The general time reversible model (GTR) for the evolutionary processof DNA sequences (Tavare´, 1986) assumes that the rate matrix is parame-terized byQ =∗ x1 piTpiA x2piCpiAx3piGpiAx1 ∗ x4 piCpiT x5piGpiTx2 x4 ∗ x6 piGpiCx3 x5 x6 ∗ ,where pi = (piA, piT , piC , piG) is the stationary distribution. GTR is the mostgeneral model under the reversibility assumption for nucleotide data.222.4. Phylogenetic inference2.4 Phylogenetic inferenceA phylogenetic tree can be reconstructed applying a range of methods onvarious types of data. Distance-based methods measure the similarity be-tween current populations (Nei, 1972; Weir and Cockerham, 1984) and thenconstruct the tree based on the similarity using unrooted tree constructionalgorithms such as neighbor joining (NJ) (Saitou and Nei, 1987). Probabilitybased methods aim to model the evolutionary process along the branchesof the tree (Felsenstein, 1981b; Hasegawa et al., 1985; Tavare´, 1986) as astochastic process. These stochastic processes form the basis of a likelihoodwhich is suitable for maximum likelihood (Felsenstein, 1981b) or Bayesianinference (Li et al., 2000).We first summarize popular non-likelihood based methods for phylo-genetic inference and then summarize the main likelihood based methods.In this thesis, we focus on algorithms for calculating the probability of ob-served data given a fixed set of parameters, which is crucial for the efficiencyof both the maximum likelihood approach and the Bayesian approach. Ourproposed methods can be incorporated into existing frameworks of bothmaximum likelihood and Bayesian approaches.2.4.1 Non-likelihood based methodsDissimilarity measures for gene or SNP frequenciesSeveral measures have been proposed to measure the dissimilarity betweensequences on the basis of gene or SNP frequencies (Edwards, 1971; Nei,1972). Using such measures, a dissimilarity score is obtained for any pair ofsequences, and a symmetric dissimilarity matrix can be obtained for any Nsequences. Although it is traditionally called as “distance matrix”, many ofthese measures are not mathematical distances. However, most tree recon-struction methods based on the distance matrix do not require the distancemeasure to be a mathematical distance.232.4. Phylogenetic inferenceNeighbor-joining algorithmsNeighbor-joining (NJ) and related algorithms were proposed to construct anunrooted bifurcating tree (Saitou and Nei, 1987; Studier et al., 1988; Gas-cuel, 1997; Gascuel et al., 1997; Levy et al., 2006). The NJ algorithms do nottake sequence data directly as inputs, instead, they takes input as a symmet-ric dissimilarity matrix A which is pre-calculated for the N sequences usingsome dissimilarity measure. In an N ×N dissimilarity matrix A, ai,j mea-sures the evolutionary distance between the i-th taxon and the j-th taxon(1 ≤ i, j ≤ N). The majority of dissimilarity measures are symmetric, as aresult, the dissimilarity matrix is generally symmetric (i.e., ai,j = aj,i). Inpractice, the NJ algorithm is often used with simple distance measures thatmay not correspond to a probability model, but it can also be paired withlikelihood-based methods as well.The NJ algorithm takes an N ×N symmetric distance matrix A = (ai,j)of N sequences as input and produces an unrooted bifurcating phylogenetictree as output. The NJ consists of three steps at each iteration: search step,estimation step and update step. The original N sequences are representedby N nodes. The NJ algorithm aims to reconstruct a phylogenetic tree forthe N nodes based on the symmetric distance matrix A of the N respectivesequences.At the search step, (with n nodes, 1 ≤ n ≤ N , starting with n = N), theNJ algorithm aims to identify a pair of nodes that are closest to each otheramong all possible pairs. First, aggregated distances for all pairs of nodesare calculated,di,j = (n− 2)ai,j −n∑k=1ai,k −n∑k=1aj,k,and then the smallest value di1,j1 of all values in the n×n matrix D = (di,j)is identified.At the estimation step, create a new node u that joins node i1 and nodej1, and calculate the new distances from node i1 and j1 to the new node u242.4. Phylogenetic inferenceasdi1,u =12{di1,j1 +1n− 2(n∑k=1ai1,k −n∑k=1aj1,k)},anddj1,u =12{di1,j1 +1n− 2(n∑k=1aj1,k −n∑k=1ai1,k)}.At the update step, remove respective rows and columns in the sym-metric distance matrix A for the node i1 and j1, add a new row and col-umn in A for the new node u with distances from each remaining node v(v 6= u, v 6= i1, v 6= j1) to the new node u,au,v =12(di1,u + dj1,u − di1,j1) .There are n − 1 nodes available after one iteration. The NJ algorithmcompletes when an unrooted phylogenetic tree is constructed for the Noriginal nodes.The phylogenetic tree constructed using the NJ algorithm is unique andconsistent (Bryant, 2005). The branch lengths of NJ trees are interpreted asa measure of evolutionary distance, which is influenced by both evolutionarytime and evolutionary rate. As a result, the NJ tree generalizes the molecularclock tree that requires all leaves of the tree to be equally distant from theroot, by allowing branch lengths to vary as a result of evolutionary ratevariation.Maximum parsimony and maximum compatibilityMaximum parsimony (MP) and maximum compatibility (MC) are both op-timization criteria for seeking a tree of certain optimization properties. MPsearches for a tree with a minimum number of character state changes, andMC searches for a tree with a maximum number of compatible characters(i.e., characters that evolve without back mutation or parallel evolution).Both MP and MC are non-deterministic polynomial-time hard (NP-hard)(Nichols and Warnow, 2008), but approximate algorithms are available.252.4. Phylogenetic inference2.4.2 Likelihood-based methodsEvolutionary models aim to capture main forces of the evolutionary pro-cess in order to reconstruct the phylogeny. Likelihood-based phylogeneticinference methods are built on evolutionary models.Likelihood functionIn phylogenetic inference, the tree τ ∈ T is an unknown parameter whichconsists of not only a tree topology, but also a set of unknown branch lengthsd ∈ R2N−2+ (d ∈ R2N−3+ for unrooted tree) where N is the number of leavesin τ . There may also be parameters for the evolutionary process, denotedθ ∈ Θ. The log-likelihood function of (τ, θ) based on data x is given byl(τ, θ|x) = logP (X = x|τ, θ). (2.1)where X = (X1,X2, . . . ,XN ) are random variables denoting N randomsequences and x = (x1,x2, . . . ,xN ) are data, which are observed values ofthe N sequences. The data may appear in different forms, as discussed inSection 2.3.1.Although it is usually possible to compute the likelihood in (2.1) ex-actly or approximately when τ and θ are fixed, maximizing the likelihoodis not easy because the space T is discrete and the size of T is huge whenN is large. For example, there are 34,459,425 rooted bifurcating trees and2,027,025 unrooted bifurcating trees when N = 10 (Felsenstein, 2004). In or-der to explore the space of trees in practice, local search methods iterativelyupdate the tree estimate by searching for a tree with larger likelihood inthe neighborhood of the current tree estimate, (Li et al., 2000; Huelsenbeckand Ronquist, 2001; Ronquist and Huelsenbeck, 2003; Stamatakis, 2005),while incremental search methods reconstruct the tree by pruning algo-rithms, which identify two nodes at each step and replace the two nodesby one internal node (Saitou and Nei, 1987; Bouchard-Coˆte´ et al., 2012).Neither of the two approaches is guaranteed to find the global maximumlikelihood tree.262.4. Phylogenetic inferenceFrom pairwise subtrees to the full treeWe focus on only a pair of populations vi and vj (1 ≤ i, j ≤ N). Theunrooted subtree τij of two vertices vi and vj degenerated as a tree withonly one branch linking two vertices vi and vj . As a result, there is only onebranch length dij ∈ R+. See Figure 2.4A for an example. Note that the twobranches collapse into one branch linking the remaining two vertices.The likelihood function of θij based only (xi,xj) is given byl(dij |xi,xj) = logP (Xi = xi,Xj = xj |dij , θ). (2.2)Finding the maximum likelihood estimate of dij in (2.2) is usually analyti-cally or numerically solvable.The main advantage of this approach is the efficiency. When the numberof sequences are large, estimating the full tree is usually computationally ex-pensive, while estimating all pairwise distances for all pairs of sequences isfeasible. We generalize this idea in this thesis to develop phylogenetic infer-ence methods that are scalable to inference for a large number of sequences.Bayesian methodsBayesian inference of phylogenetic trees is based on the posterior densityp(τ, θ|x) = P (X = x|τ, θ)p(τ, θ)∫θ∫τ P (X = x|τ, θ)p(τ, θ)dτdθ.Markov chain Monte Carlo (MCMC) methods can be applied to sample(τ, θ) from the posterior density. Sequential Monte Carlo (SMC) methodswere also proposed for phylogenetic inference as an alternative to the com-putationally expensive MCMC methods (Bouchard-Coˆte´ et al., 2012; Wang,2012).272.5. Comparing, summarizing and evaluating phylogenetic results2.5 Comparing, summarizing and evaluatingphylogenetic resultsIn this section, we briefly review common methods for comparing, summa-rizing and evaluating results of phylogenetic inference. We only considertrees with the same set of leaves.2.5.1 Comparing phylogenetic treesComparing two phylogenetic trees usually consists of two parts: comparingthe tree topology and comparing the branch lengths. Several tree distancemeasures have been proposed to compare two phylogenetic trees, based onmeasures of two topologies and/or measures of two branch lengths.We first introduce the concept of a partition of an unrooted tree, whichis the basis for several tree distance measures, and then introduce somepopular tree distance measures that are used in this thesis.Partitions of an unrooted treeMost common tree distance measures discussed in this thesis are based onpartitions of an unrooted phylogenetic tree. A partition of an unrootedbifurcating phylogenetic tree consists of two unordered non-empty subsetsof leaves from two subtrees of this tree that are linked by a single branch. Foran unrooted phylogenetic tree with N leaves, there are 2N − 3 branches intotal, and thus 2N − 3 partitions in total. Each partition is also associatedwith a branch length, which is the length of the branch linking the twosubtrees.Figure 2.6 shows two unrooted bifurcation phylogenetic trees with 5leaves. There are 7 partitions in each, and 6 partitions are shared by bothtrees, summarized in Table 2.1. We illustrate how common tree distancemeasures are calculated using the two trees in Figure 2.6 as an example.282.5. Comparing, summarizing and evaluating phylogenetic resultsa.ADCBE0.40.10.10.20.10.30.5b.CDABE0.40.10.10.20.20.30.4Figure 2.6: Two unrooted bifurcating phylogenetic trees with 5 leaves.Table 2.1: List of all partitions in the phylogenetic trees shown in Figure 2.6.Partition {A, D | B, C, E} is only available in tree a, and partition {C, D |A, B, E} is only available in tree b.partition branch length (tree a) branch length (tree b){A | B, C, D, E} 0.1 0.2{B | A, C, D, E} 0.3 0.3{C | A, B, D, E} 0.1 0.1{D | A, B, C, E} 0.2 0.2{E | A, B, C, D} 0.5 0.4{B, E | A, C, D} 0.4 0.4{A, D | B, C, E} 0.1 −{C, D | A, B, E} − 0.1Robinson-Foulds symmetric differenceThe Robinson-Foulds (RF) symmetric difference (Robinson and Foulds, 1979)is defined as the sum of the number of partitions of the first tree which arenot a partition of the second tree and the number of partitions of the secondtree which are not a partition of the first tree. For example, there are twopartitions, {A, D | B, C, E} and {C, D | A, B, E}, listed in Table 2.1 thatare only available on one of the two trees. The RF difference of two trees inFigure 2.6 is 2. The RF difference only considers the topology difference of292.5. Comparing, summarizing and evaluating phylogenetic resultstwo unrooted phylogenetic trees and ignores the branch length difference.The Penny-Hendy (PH) difference (Penny and Hendy, 1985) is definedas twice the number of partitions at interior branches of one tree that arenot available in the other. It is easy to see that the PH difference returnsexactly the same value as the RF difference because of the symmetry.Weighted Robinson-Foulds distanceThe weighted Robinson-Foulds (wRF) is defined as the sum of absolutedifferences between respective branch lengths of all partitions for two phy-logenetic trees. If a partition is not available in one tree, then the branchlength for this partition is set as 0 for that tree. For example, the wRFdistance of the two trees in Figure 2.6 is |0.1−0.2|+ |0.3−0.3|+ |0.1−0.1|+|0.2− 0.2|+ |0.5− 0.4|+ |0.4− 0.4|+ |0.1− 0|+ |0− 0.1| = 0.4.The wRF distance takes both topology similarity and branch lengthsimilarity into consideration.Kuhner-Felsenstein distanceThe Kuhner-Felsenstein (KF) distance (Kuhner and Felsenstein, 1994) is de-fined as the sum of squares of differences between respective branch lengthsof all partitions for two phylogenetic trees. For example, the KF distanceof the two trees in Figure 2.6 is (0.1 − 0.2)2 + (0.3 − 0.3)2 + (0.1 − 0.1)2 +(0.2− 0.2)2 + (0.5− 0.4)2 + (0.4− 0.4)2 + (0.1− 0)2 + (0− 0.1)2 = 0.04.The KF distance also takes both topology similarity and branch lengthsimilarity into consideration.Geodesic distanceThe geodesic distance, introduced by Billera et al. (2001), is defined asthe length of the shortest path between two phylogenetic trees with branchlengths in the continuous tree space. Computing the geodesic distance is notas straight-forward as other distance measures. Owen and Provan (2011)proposed an algorithm for computing the geodesic distance of phylogenetictrees in polynomial time.302.5. Comparing, summarizing and evaluating phylogenetic resultsThe geodesic distance measures both the topology similarity and branchlength similarity.2.5.2 Summarizing phylogenetic treesThe consensus tree is a popular tool to summarize the agreement betweentwo or more trees. A strict consensus tree of a collection of trees retainsonly partitions that appear in all trees of this collection. A majority-ruleconsensus tree retains partitions that appear in more than half of all treesconsidered. For example, we consider the consensus tree of three phyloge-netic trees: two of them are identical to tree a in Figure 2.6 and the otheris identical to tree b. Then the majority-rule consensus tree has the sametopology as the tree a in Figure 2.6. But for the strict consensus tree, thenode linking leaves A and D collapses with the node linking the leaf C withA and D because the partition {A, D | B, C, E} does not appear in all trees,while the partition {A, C, D | B, E} appears in all trees. We use only themajority-rule consensus tree in this thesis.2.5.3 EvaluationFor phylogenetic inference methods based on probability models, simulationstudies can be conducted to evaluate their performance. Asymptotic be-haviour of the maximum likelihood estimates and Bayesian estimates mayalso be obtained under some assumptions.In real data analysis, because the true evolutionary process cannot berepeated, we can never fully verify if the inferred phylogenetic tree describesthe true underlying evolutionary history. However, there are assumptions onhuman migration which are supported by evidence collected from fossils andancient texts. These assumptions may be used to support results obtainedfrom applying new models if they agree.Comparing results obtained using real data with known facts of hu-man history or species evolution history may also provide insights into thelimitations of the evolutionary models or the violations of the tree shapeassumptions.31Chapter 3Normal approximation withgeneral fixation model forfrequency data3.1 IntroductionLocating the root of a phylogenetic tree that is consistent with observed ev-idence of the unknown evolutionary history is an important step in a rangeof phylogenetic studies (Huelsenbeck et al., 2002). For example, the rootof human population trees provides clues on the origin and the path of hu-man migration (Gray et al., 2009), and the root of the tree of life is used tostudy the origins of life via reconstruction methods. However, most classicmethods estimate unrooted trees only. Distance-based methods depend onsymmetric similarity measures and classic probability based methods de-pend on reversible continuous time Markov chain (CTMC) models. Bothapproaches ignore the direction of evolution and thus are unable to identifythe root of the tree directly.There are three main methods for constructing a rooted phylogenetictree: assuming a molecular clock tree (or a relaxation of the molecular clocktree), adding an outgroup, or using non-reversible models. In a molecularclock tree, all leaves are equally distant from the root (see Figure 2.4C). Thisis unrealistic as it ignores evolution rate variation which depends on factorssuch as the population size (Swofford et al., 1996; Huelsenbeck et al., 2002).There is a large literature on relaxing the clock assumption while modellingrooting, but these methods are generally computationally expensive (Bat-323.1. Introductiontistuzzi et al., 2010).The outgroup criterion is also widely used to root an unrooted phyloge-netic tree: the idea is to find a taxon (species or population) sufficiently farfrom those under study and to use the attachment point of this taxon to theother taxa to infer the root (see Figure 2.4A). However, finding a reasonableoutgroup is not always easy. Previous works have shown that the outgroupcriterion can lead to errors when the outgroup is distant (Wheeler, 1990;Pearson et al., 2013), or when the traits are changing at a high rate (Outlawand Ricklefs, 2011), or when the outgroup does not exist or is unknown;for example when studying the tree of life (Iwabe et al., 1989) or linguisticfamilies (Gray et al., 2009).We propose a non-reversible model for rooted bifurcating tree (see Fig-ure 2.4D) inference, with application to the evolutionary process of singlenucleotide polymorphism (SNP) frequencies. Our model is motivated bymodelling the main forces that shape the patterns of variations found inSNP frequencies across populations: mutation, which created variations inancestral populations; as well as drift and fixation, which determine the al-lele frequencies in modern populations. Among these forces, drift over timeof allele frequencies, which is usually modelled as a reversible process, hasbeen the focus of classical probability models. Brownian motion, in partic-ular, provides a simple and tractable approximation of the Wright-Fishermodel (Edwards and Cavalli-Sforza, 1964; Felsenstein, 1973, 1981b; Pickrelland Pritchard, 2012).While the reversible Brownian approximation is accurate for frequenciesbounded away from zero and one, it breaks down at the extremities. TheBrownian approximation fails in this regime because it ignores fixation: thesimple observation that if all individuals in a population share the sameallele, then the drift is fixed for a significant time period (until new muta-tions create subsequent drifts). Moreover, the distribution of reconstructedancestral states according to the Brownian motion assigns positive proba-bility to frequencies smaller than zero and greater than one. This limitationhas motivated the development of more sophisticated models which trackallele counts in a more detailed way through coalescent theory. This brings333.2. Evolutionary modelsfixation back into the model, thus non-reversibility, but at the cost of asignificant increase in computational requirements and model complexity.As a consequence, most current probability models either ignore fixation,or otherwise are limited to inference in small sets of populations at a time(RoyChoudhury et al., 2008).In this thesis, we present a tree reconstruction method that models boththe fixation and drift found in SNP frequency data while keeping the compu-tational requirements and the model complexity manageable. Our method isbased on a likelihood model that considers only valid ancestral frequencies.Being non-reversible, the method has the additional advantage of being ableto identify the placement of the root without using outgroups. Our approachconsists of two steps: first, relevant evolutionary parameters are estimatedunder the non-reversible model based on the maximum likelihood princi-ple (Chapter 3), then, a rooted phylogenetic tree is constructed via a novelefficient rooted-tree reconstruction algorithm (Chapter 4).We focus on our new evolutionary model and inference method for anytwo populations in Chapter 3, and leave our new rooted phylogenetic treeinference method to Chapter 4.3.2 Evolutionary modelsIn this section, we briefly review evolutionary models for frequency data,and propose a new model for genetic drift, called the normal approximationwith general fixation. Note that for most SNPs so far discovered, thereare only two different alleles, which are called bi-allelic sites. We focus onindependent bi-allelic sites in Chapter 3 and Chapter 4.Frequency data is widely used for the phylogenetic inference of popula-tions. Based on SNP or gene sequences collected at several individuals of thesame population, the SNP frequency sequence or gene frequency sequenceof chosen alleles is calculated for the population to represent the geneticcharacteristic of the population.343.2. Evolutionary models3.2.1 Review of evolutionary models for frequency dataEvolutionary models play a fundamental role in phylogenetics. The Wright-Fisher model and its approximations are widely used evolutionary modelsfor frequency data. Consider one locus with two alleles. Suppose that thereare N0 copies of one allele in N haploid individuals. If the number of haploidindividuals N is constant, then according to the Wright-Fisher model, thenumber of copies of this allele in the next generation, N1, follows a binomialdistribution,N1|N0 ∼ Binomial (N,N0/N).More generally, let p0 be the frequency of a binary allele in an ancestralpopulation. If the number of individuals in each generation is constant, thenthe frequency of this allele after t generations, p(t), can be modelled by asequence of binomial distributions using the Wright-Fisher model. We writep(t) as p for simplicity throughout Chapter 3.The distribution of p depends on both the number of generations t andthe initial frequency p0 = N0/N . For modelling the evolutionary history ofall human populations, the number of generations from the most recent com-mon ancestor of human populations to current human populations is large.As a result, calculating the distribution of p is computationally expensive.Edwards and Cavalli-Sforza (1964); Felsenstein (1973, 1981b) used aBrownian motion approximation for the Wright-Fisher model, i.e.,p ∼ Normal (p0, σ2t ), 0 < p0 < 1, (3.1)where σ2t is a measure of drift strength combining the effects of t and N .Larger N and smaller t lead to smaller σ2t , i.e., smaller drift. Note thatt and N are not identifiable in the normal approximation of the Wright-Fisher model and the variance σ2t depends on the frequency p0, which canbe further stabilized by a transformation (Felsenstein, 1973) for frequencieswhich are not close to 0 and 1.A main drawback of the Brownian motion approximation is that p is notbounded in [0, 1]. The normal approximation works well if the evolution353.2. Evolutionary modelsdistance from the parent population to the child population is short, butthe probability of breaking the boundary 0 or 1 is not negligible when theevolution distance is long.Nicholson et al. (2002) proposed a normal approximation with fixationfor the Wight-Fisher model which takes the probability of fixation in childpopulations into consideration:p ∼ Normal (p0, p0(1− p0)σ2), 0 < p0 < 1, (3.2)constrained in [0, 1] with atom mass on 0 and 1 equal to the total mass ofrelevant distribution on (−∞, 0) and (1,∞) respectively. In other words, pis a mixture of continuous and discrete random variable which has a Normaldensity in (0, 1), and point masses at 0 and 1 respectively modelling theprobability of p < 0 and p > 1 derived from the Normal density. In model(3.2), the component of the variance which depends on p0, p0(1 − p0), isseparated from the other component σ2, which measures the strength ofgenetic drift.Balding and Nichols (1995) used a beta distribution which matches thefirst two moments of the Brownian motion model as an approximation to theWright-Fisher diffusion for genetic drift in island populations. This model isalso widely used in modelling the effects of genetic drift of SNP frequencies(Sire´n et al., 2011, 2013).The work of Nicholson et al. (2002) and Balding and Nichols (1995)both impose restrictions on the allele frequencies at internal nodes. Moreprecisely, while both methods allow the full range of frequencies at the leaves,[0, 1], the frequencies of all alleles at any internal nodes are restricted to(0, 1). While SNPs are by definition polymorphic at the root of the tree,requiring each of them to be polymorphic in all internal nodes is limitingfor the purpose of medium or large population tree reconstruction.One should also keep in mind that the Wright-Fisher model is itself anapproximation for the genetic drift rather than an exact model of the geneticdrift.363.2. Evolutionary models3.2.2 Normal approximation with general fixationWe generalize an approximation model to the Wright-Fisher model for ge-netic drift (Nicholson et al., 2002), by extending the domain of the ancestralfrequencies p0 from (0, 1) into [0, 1]. We denote this evolution model as:p ∼ FixNormal (p0, σ2), 0 ≤ p0 ≤ 1,where σ2 is the drift parameter which models the strength of genetic drift.We call this model the normal approximation with general fixation.The density fp of p under the FixNormal model has the following formwhen 0 < p0 < 1:fp(p|p0, σ2) = φ(p− p0σ0σ), 0 < p < 1, (3.3a)fp(0|p0, σ2) =∫ 0−∞φ(t− p0σ0σ)dt = Φ(−p0σ0σ), (3.3b)fp(1|p0, σ2) =∫ ∞1φ(t− p0σ0σ)dt = 1− Φ(1− p0σ0σ), (3.3c)where σ0 =√p0(1− p0), φ and Φ are the probability density and cumulativedistribution functions of a standard normal random variable. Note that theabove density is defined with respect to a reference measure composed ofa uniform measure on (0, 1) superposed with a unit point mass at eachboundary point {0, 1}. This part of the density fp is the same as that of themodel of Nicholson et al. (2002).Under the assumption that mutations are rare, we further assume thatfixation is not reversible. In other words, once fixed, the frequency at onelocation will not change. Under this assumption:fp(0|0, σ2) = 1, fp(p|0, σ2) = 0, 0 < p ≤ 1, σ2 > 0, (3.4a)fp(1|1, σ2) = 1, fp(p|1, σ2) = 0, 0 ≤ p < 1, σ2 > 0. (3.4b)Since our model does not require the variation observed at current popu-lations to be present in ancestral populations, it allows fixation in both cur-373.2. Evolutionary modelsrent populations and ancestral populations. This simple relaxation makesthe new model a reasonable choice for modelling SNP frequencies amongdistantly related populations, and also has an important impact on the like-lihood model and inference methods.3.2.3 Comparisons to the Wright-Fisher modelIn this section, we compare the distribution of frequencies based on thenormal approximation with general fixation (i.e., FixNormal) model and thedistribution of frequencies based on the Wright-Fisher model, and show thatthe FixNormal model serves as a good approximation to the Wright-Fishermodel which is itself an approximation.The distribution of frequencies based on the Wright-Fisher model is aseries of binomially distributed random variables. When the number ofgenerations t is large, the probability density function of that distributionis hard to evaluate directly. Instead of working on its analytical probabilitydensity function, we generate samples from that distribution and comparethe empirical distribution of generated samples.We set the population size N = 1000 (i.e., the number of individuals inthe population) and the number of SNPs n = 5000. We simulate frequencydata from the Wright-Fisher model with different initial frequencies (p0 =0.5 or 0.1) and different numbers of generations t.First, we generate frequency data under the Wright-Fisher model, andthen estimate evolutionary distance σ2 in the FixNormal model based onmaximum likelihood. Figure 3.1 shows the relationship of fitted σ̂2 and t/N ,which is roughly linear when t/N is smaller than 0.5 and looks exponentialwhen t/N is larger than 0.5 for p0 = 0.5. A similar pattern is observed forp0 = 0.1, however, the maximum of the linear range shrinks to approximately0.1 because of a larger difference between the Wright-Fisher model and theFixNormal model. In this thesis, we focus on cases where σ2 is smaller than2, which is the general range of σ2 found in the real data studied in thisthesis. Almost identical σ̂2s are obtained when this simulation is repeatedto obtain other sets of randomly generated frequencies when n = 5000.383.2. Evolutionary modelsllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.51.01.5t/1000σ2^p0 = 0.5l l ll llll0.0 0.1 0.2 0.3 0.4 0.5 0.60.00.51.01.52.0t/1000σ2^p0 = 0.1Figure 3.1: Comparison of evolutionary parameters in the FixNormal modeland the Wright-Fisher model.Table 3.1: Mean fixation rate difference of 100 simulation runs comparingthe Wright-Fisher model and the FixNormal model. The mean is shown asthe fixation rate of the Wright-Fisher model minus the fixation rate of theFixNormal model. Note that when p = 0.1 and t = 1000, optimizing σ2based on the likelihood does not converge, which is omitted.t = 5 t = 25 t = 50 t = 100 t = 250 t = 500 t = 1000p0 = 0.5 0 0 0 −0.001 −0.023 −0.024 −0.003p0 = 0.1 0 −0.016 −0.050 −0.038 0.104 0.231 -Second, we calculate the fixation rate of SNPs simulated under theWright-Fisher model, and compare it with respective fixation rate of SNPssimulated under the normal approximation with general fixation model us-ing σ̂2. Table 3.1 shows that the difference between the two fixation ratesare larger when both the number of generations is large and the ancestralfrequency is close to the boundaries (i.e., p0 = 0.1 in the simulation), butrelatively small or close to 0 when the number of generations is small or theancestral frequency is close to one half. The results are stable when thissimulation is repeated with different random seeds.Third, we compare the distribution of non-fixed SNP frequencies under393.2. Evolutionary modelsthe Wright-Fisher model with the distribution of non-fixed SNP frequenciesunder the respective FixNormal model using σ̂2. We show some quantile-quantile plots of two generated datasets in Figure 3.2 (p = 0.5) and Fig-ure 3.3 (p = 0.1) to compare the two distributions of frequencies. Figure 3.2shows that when p = 0.5, the two simulated datasets are close to each otherin distribution. Figure 3.3 shows that when p = 0.1, the differences betweenthe two distributions are larger, especially when the evolutionary parametersare large.Overall, we show that the difference between the Wright-Fisher modeland the FixNormal model is relatively small for cases of particular interest inthe application of human population tree inference. Even though there arediscrepancies between the two models when both the ancestral frequency isclose to boundary and the evolutionary distance is large, the effect of thesediscrepancies may be attenuated when a non-degenerate distribution overancestral frequencies is considered (see Section 3.4).In conclusion, when σ2 is small or when p0 is close to 0.5, the differencebetween the distributions of p from our model and the Wright-Fisher modelis negligible. When p is close to 0 or 1, and σ2 is very large, there is a largerdifference between the two models.However, this limitation is not a major concern in the types of applica-tions we are interested in. First, we are more interested in σ2 rather than p.We do not need to estimate both p and σ2. Since we marginalize over thevalues of p, the effects of those p which are close to 0 will be offset by thosep which are close to 1, especially after the symmetric transformation intro-duced in the next section of our inference method. Second, in real problems,reasonable values of σ2 are not too large (mostly between 0 and 2 in ourdata analysis).403.2. Evolutionary modelslllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 5lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 25l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 100lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 250lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 500llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 1000Figure 3.2: Quantile-quantile plots of simulated data based on the FixNor-mal model and the the Wright-Fisher model when p0 = 0.5. In theabove plots, x-axis represents Wright-Fisher quantiles and y-axis representsFixNormal quantiles.413.2. Evolutionary modelslllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 5llllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 25llllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 50lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 100llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll llllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 250llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0t = 500Figure 3.3: Quantile-quantile plots of simulated data based on the FixNor-mal model and the Wright-Fisher model when p0 = 0.1. In the above plots,x-axis represents Wright-Fisher quantiles and y-axis represents FixNormalquantiles.423.3. Inference method for two populations3.3 Inference method for two populations3.3.1 Likelihood model for two populationsIn this thesis, we assume the phylogenetic tree is bifurcating (see Figure 4.1for an example with three leaves). We first focus on a pair of populations andpropose a new likelihood-based method to estimate evolutionary distancesfrom each of the two current populations to their MRCA.We denote the SNP frequencies in the two current populations and theirMRCA as pi,1, pi,2, and pi,0 (i = 1, 2, . . . , L) respectively. We only considerindependent SNP sites in this thesis. We use two sets of independent normalapproximations with general fixation for the genetic drift after the separationof the two populations, i.e.,pi,1 ∼ FixNormal (pi,0, σ21), and pi,2 ∼ FixNormal (pi,0, σ22). (3.5)Note that pi,1 and pi,2 are conditionally independent given pi,0 (i = 1, 2, . . . , L).We also assume that the variances do not depend on the SNP sites, i.e., allSNP sites share the same σ21 and σ22. The log-likelihood function for twopopulations is given byl(p0, σ21, σ22) =L∑i=1{ln fp(pi,1|pi,0, σ21) + ln fp(pi,2|pi,0, σ22)}, (3.6)where p0 = (p1,0, p2,0, . . . , pL,0), and fp is the density function defined in(3.3) and (3.4).For a pair of populations, there are 2L samples, and L + 2 unknownparameters: σ21, σ22, which are univariate, and p0, which is L-dimensional.It is hard to estimate pi,0 accurately from two samples pi,1 and pi,2 even ifσ21 and σ22 are known. However, to estimate σ21 and σ22, which are sharedby all loci, we can avoid estimating p0 by modelling its distribution. Theeffects of model misspecification is illustrated in the simulation studies (seeSection 3.4).The effects of the value of p0 on the distribution of p have been discussed433.3. Inference method for two populationsin Ewens (1973); Nicholson et al. (2002); Sire´n et al. (2013), but to ourbest knowledge, there is no thorough study on the effects of misspecifieddistributions of p0 on the distribution of p using simulation studies.3.3.2 Modelling ancestral allele frequenciesFor one SNP locus, we can write the density function of (p1, p2) asf(p1, p2|σ21, σ22) =∫ 10pi( dp0)fp(p1|p0, σ21)fp(p2|p0, σ22)=∫ 10pi∗( dp0)fp(p1|p0, σ21)fp(p2|p0, σ22)+P (p0 = 0)fp(p1|0, σ21)fp(p2|0, σ22) (3.7)+P (p0 = 1)fp(p1|1, σ21)fp(p2|1, σ22),where pi is the distribution of p0 defined on [0, 1], and pi∗ is the continuouspart of pi defined on (0, 1). Let m0 = P (p0 = 0), m1 = P (p0 = 1), andm2 =∫ 10 pi∗(p0)dp0 with a constraint m0 +m1 +m2 = 1.If an allele is randomly chosen to measure the frequency, it is reasonableto assume that pi is symmetric with respect to 0.5.1 Under the symmetricassumption on pi, we can further combine m0 and m1 into one parametermf ≡ 2m0 = 2m1 because m0 = m1, when pi is symmetric. Then mnf ≡1−mf is the proportion of unfixed SNPs in the ancestral population. Thedensity of (p1, p2) in (3.7) can be simplified asf(p1, p2|σ21, σ22)=∫ 10pi∗(p0)fp(p1|p0, σ21)fp(p2|p0, σ22)dp0+mf2{fp(p1|0, σ21)fp(p2|0, σ22) + fp(p1|1, σ21)fp(p2|1, σ22)}. (3.8)The log-likelihood of (σ21, σ22) for L frequencies p1 = (p1,1, p2,1, . . . , pL,1) and1If the allele used to measure the frequency is not randomly chosen (for example, it ischosen to be the minor allele in the discovery panel of the SNP ascertainment process asin HGDP) we can perform a symmetric transformation by inverting half of the frequenciesfrom p into 1− p.443.3. Inference method for two populationsp2 = (p1,2, p2,2, . . . , pL,2) is given byl(σ21, σ22|p1,p2) =L∑i=1ln f(pi,1, pi,2|σ21, σ22). (3.9)We can maximize the log-likelihood (3.9) to find the maximum likelihoodestimate (σ̂21, σ̂22).In real data analysis, we do not observe pi,1 and pi,2. We observe thenumber of sampled individuals ni,1 and ni,2 with the number of certainalleles xi,1 and xi,2. We can integrate out p1 and p2 in (3.8) after adding twobinomial probability mass functions. In this thesis, we focus on populationfrequencies directly and simply use p̂i,l = xi,l/ni,l (i = 1, 2, · · · , L and l =1, 2) as data in the inference.In this thesis, we use the uniform(0, 1) distribution multiplied by afactor mnf to model pi∗. If there is extra information on the frequenciesof certain ancestral populations, it could also be easily incorporated intoour model. Other distributions motivated by population genetics such aspi∗(x) ∝ 1/{x(1− x)} for x ∈ (0, 1) (Ewens, 1973) can also be used.For the values of mf , we propose an empirical Bayes estimator to es-timate mf from data which works well in our simulation studies and dataanalysis. We estimatem̂0 =1LL∑i=1I(p̂i,1 = 0)I(p̂i,2 = 0) and m̂1 =1LL∑i=1I(p̂i,1 = 1)I(p̂i,2 = 1),where I is an indicator function and estimatem̂f = m̂0 + m̂1. (3.10)Note that m̂0 ≥ m0 since if pi,0 = 0, then pi,1 = pi,2 = 0, but pi,1 =pi,2 = 0 does not imply pi,0 = 0 (i = 1, 2, · · · , L). Similarly, m̂1 ≥ m1, andthus m̂f ≥ mf . The bias m̂f −mf increases when σ21 and σ22 are larger. It isalso possible to estimate mf as a parameter together with σ21 and σ22, whichis not the focus of this thesis.453.4. Simulation studies for two populationsIn the next section, we show that using this model works well for avariety of distributions on p0 even when the model is misspecified, and thatthe choice of the weight mf has a negligible effect on the estimation of σ21,σ22, as long as mf is non-zero.3.4 Simulation studies for two populationsIn this simulation study, we investigate the performance of our inferencemethod for two populations, under the scenario of both when the distribu-tion of the ancestral frequency used in simulation is the same as that usedin inference and when the distribution of the ancestral frequency used insimulation is different from that used in inference. We also investigate theeffects of branch lengths and the number of SNPs on the accuracy of theestimates.3.4.1 Two population inference when the model is correctlyspecifiedFirst, we generate three sets of SNP frequencies of an ancestral populationP0 using a mixed distribution with a continuous part from uniform(0, 1),beta(0.5, 0.5) or beta(2, 2), and discrete parts on 0 and 1 with mass m0and m1. The proportion of frequencies which are fixed in the ancestralpopulation P0 is denoted mf = m0 + m1. In the simulation, we set m0 =m1 = 0.1. We generate the values p0 of L independent SNPs with L =100, 1000, 10000 in this simulation study.Second, we generate SNP frequencies for the leaf populations P1 andP2, p1 and p2, based on p0 using the normal approximation with generalfixation model with parameters σ21 and σ22 respectively, as shown in (3.5).We set σ21 = 0.1, 0.5, 1.0, and σ22 = 2σ21 so that it is not a clock tree.The box plots of σ̂21 and σ̂22 from 100 simulation runs are shown in Fig-ure 3.4 when uniform(0, 1) was used for the continuous part of the distri-bution of p0. From Figure 3.4, we can see that the estimates σ̂21 and σ̂22 arevery accurate. The variances of the estimates decrease when the number of463.4. Simulation studies for two populationslllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.00.10.20.30.4llllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.00.10.20.30.4lllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.52.0llllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.52.0lllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)1234llllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)1234Figure 3.4: Boxplots of σ̂21 and σ̂22 in 100 simulation runs for two populationswhen p0 is generated from a uniform(0, 1) with point mass 0.1 at 0 and 1.In the three left panels, the true value mf = 0.2 is used in estimation,and in three right panels, the empirical Bayes estimate m̂f (3.10) is used inestimation. The grey lines indicate the true values of σ21 and σ22 = 2σ21.SNPs increases. The effect of using m̂f versus using mf on the estimatesσ̂21 and σ̂22 is generally small. There are slight biases for larger values of σ21and σ22, which can be explained by the fact that for larger values of branchlengths, m̂f tends to overestimate mf .The box plots of relative errors, i.e., (σ̂21 − σ21)/σ21 and (σ̂22 − σ22)/σ22, are473.4. Simulation studies for two populationslllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0llllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0llllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0llll lllllll lσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0llllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0Figure 3.5: Boxplots of (σ̂21 − σ21)/σ21 and (σ̂22 − σ22)/σ22 in 100 simulationruns for two populations when p0 is generated from a uniform(0, 1) withpoint mass 0.1 at 0 and 1. The panels are arranged in the same order as inFigure 3.4. The grey lines indicate 0.shown in Figure 3.5. Figure 3.5 shows that the bias of both estimates arereasonably small, and the relative error reduces when the number of sitesincreases.483.4. Simulation studies for two populationsllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.000.100.200.30lllllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.00.10.20.3llllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.5lllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.5llllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.52.02.53.03.5llllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.52.02.53.0Figure 3.6: Boxplots of σ̂21 and σ̂22 in 100 simulation runs for two populationswhen p0 is generated from a beta(0.5, 0.5) with point mass 0.1 at 0 and 1.The panels are arranged in the same order as in Figure 3.4. The grey linesindicate true values of σ21 and σ22.3.4.2 Two population inference when the model ismisspecifiedSimilar to the previous section, we generate SNP frequencies of a populationP0 at the root using a different distribution from the distribution pi used inthe inference. The box plots of σ̂21 and σ̂22 from 100 simulation runs areshown in Figure 3.6, where beta(0.5, 0.5) is used for p0, and Figure 3.8,493.4. Simulation studies for two populationsllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0llllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.01.5 lllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllllll lσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0Figure 3.7: Boxplots of (σ̂21 − σ21)/σ21 and (σ̂22 − σ22)/σ22 in 100 simulationruns for two populations when p0 is generated from a beta(0.5, 0.5) withpoint mass 0.1 at 0 and 1. The panels are arranged in the same order as inFigure 3.4. The grey lines indicate 0.where beta(2, 2) is used for p0, and the box plots of relative errors areshown in Figure 3.7 (for beta(0.5, 0.5)) and Figure 3.9 (for beta(2, 2)).From Figure 3.6 and Figure 3.7, we can see that the estimates of σ̂21 andσ̂22 are more accurate for the smaller branch length σ21, but they underesti-mate the larger branch length σ22 when the continuous part of p0 is generatedfrom beta(0.5, 0.5) and estimated using uniform(0, 1). Both estimates σ̂21503.4. Simulation studies for two populationsllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.00.10.20.30.4llσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.00.10.20.30.4lllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.52.0llllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)0.51.01.52.0lllllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)12345lllllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)12345Figure 3.8: Boxplots of σ̂21 and σ̂22 in 100 simulation runs for two populationswhen p0 is generated from a beta(2, 2) with point mass 0.1 at 0 and 1. Thepanels are arranged in the same order as in Figure 3.4. The grey linesindicate true values of σ21 and σ22.and σ̂22 are more accurate when the evolutionary distances are small. Thiscan be explained by the fact that in general, evolutionary models becomeless accurate when the evolutionary distances increase. The effects of us-ing empirical Bayes estimators m̂f (3.10) are similar to the effects observedin Figure 3.4 and Figure 3.5. Although biases are introduced, the relativemagnitudes of estimated branch lengths are preserved.513.4. Simulation studies for two populationsllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllllllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0llllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.01.5 lllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllllll lσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0lllllllσ12 (L=100) σ22 (L=100) σ12 (L=1000) σ22 (L=1000) σ12 (L=10000) σ22 (L=10000)−0.50.00.51.0Figure 3.9: Boxplots of (σ̂21−σ21)/σ21 and (σ̂22−σ22)/σ22 in 100 simulation runsfor two populations when p0 is generated from a beta(2, 2) with point mass0.1 at 0 and 1. The panels are arranged in the same order as in Figure 3.4.The grey lines indicate 0.Similar results are observed in Figure 3.8. However, σ̂21 and σ̂22 overesti-mate the larger branch length σ22 when the continuous part of p0 is generatedfrom beta(2, 2) and estimated using uniform(0, 1), comparing to the under-estimation observed in Figure 3.6 and Figure 3.7. This can be explained bythe fact that beta(2, 2), which is bell-shaped, is more dense on values closeto 0.5 comparing to beta(0.5, 0.5) which is u-shaped. When there are more523.5. Summaryextreme values (i.e., values close to 0 and 1) in the ancestral frequency butnot in pi which models that ancestral frequency in the inference algorithm,the branch length estimates may be shortened to compensate for the smallervariance contributed by extreme frequencies.3.5 SummaryIn this Chapter, we introduced a normal approximation with general fixa-tion model for SNP frequency data. We started with the motivation in Sec-tion 3.1, reviewed evolutionary models for frequency data and introducedour new model in Section 3.2, introduced a maximum likelihood inferencemethod for two populations based on our new model in Section 3.3, andconducted simulation studies to investigate the performance of the inferencemethods for the evolutionary distances on both the correctly specified modeland misspecified models in Section 3.4.53Chapter 4Rooted tree reconstructionusing asymmetricneighbor-joining algorithm4.1 IntroductionIn this chapter, we introduce a novel rooted phylogenetic tree reconstruc-tion algorithm, called asymmetric neighbor-joining (ANJ). ANJ is inspiredby the popular neighbor-joining (NJ) algorithm, but utilizes the additionalinformation provided by our non-reversible model proposed in Chapter 3.ANJ takes as input an asymmetric the dissimilarity matrix, whose en-tries consist of evolutionary distances from any pair of leaf sequences totheir most recent common ancestor, and outputs an unrooted phylogenetictree. The ANJ is based on the bijection relationship of rooted bifurcat-ing trees with positive branch lengths and a subset of square matrices (seeFigure 4.1 for an illustration). The main advantage of using the ANJ al-gorithm for estimating the phylogenetic tree is computational efficiency, asit is well known that optimization over all possible trees is computation-ally hard (Roch, 2006; Nichols and Warnow, 2008). However, it is worthmentioning that incorporating a non-reversible evolutionary model into ourmaximum likelihood or Bayesian phylogenetic inference framework is alsofeasible. The ANJ algorithm presented here is broadly applicable to anynon-reversible setup: ANJ can turn collections of pairwise rooted trees ob-tained from any non-reversible model into a joint rooted non-clock tree (i.e.,the leaves are not required to be equally distant from the root). It is there-544.1. Introductionfore an interesting alternative to the outgroup criterion.P1P2 P3P0P2,31σ 22σ 2 3σ 22,3σ 2⇔ A = ∞ σ21 σ21σ22 + σ22,3 ∞ σ22σ23 + σ22,3 σ23 ∞Figure 4.1: An example of a rooted bifurcating tree τ for 3 populations, andthe asymmetric dissimilarity matrix A with relevant parameters derivedfrom this tree where ai,j is the sum of branch lengths from the i-th leaf tothe most recent common ancestor of the i-th leaf and the j-th leaf. Forexample, a21 = σ22 +σ22,3 is the sum of branch length from P2 to P0 which isthe most recent common ancestor population of P2 and P1, and a12 = σ21 isthe branch lengths from P1 to P0. The first step of our inference method isto estimate A based on a non-reversible model using maximum likelihood(Chapter 3). The second step is to estimate τ based on  (Chapter 4).We perform a series of simulations studies to assess the accuracy androbustness of the method. The results show that our method can recovertrue topologies and root placements with higher probability than modelsthat ignore fixation. Branch lengths are also shown to be accurate. Ourmethod also has a reasonable computational cost.Using our method, we reanalyze the SNP data from 53 populations fromthe Human Genome Diversity Panel (HGDP) (Cann et al., 2002), recon-structing the full tree with particular attention to the location of the root.The internal organization of the tree is similar to Li et al. (2008), but alsocorrects some problems found in this previous work.There has been earlier work on the construction of tractable methodstaking fixation into account, in particular the work of Nicholson et al. (2002)which our model builds on. However, this previous work has been restrictedto star shaped trees (i.e., a tree without internal vertices) of closely re-lated populations. See Figure 2.4C for an example of a star tree. There isalso a rich literature on exact calculation of marginal densities of stochastic554.2. Asymmetric dissimilarity matrixprocesses taking fixation into account (for example, Song and Steinru¨cken(2012)) and on simulation of these processes (Jenkins and Spano, 2015).These methods can, in principle, be used to estimate bifurcating trees (forexample, using a particle MCMC framework (Wang et al., 2015)), but at asignificant computational cost. While there has been considerable progresson improving the scalability of these methods via advanced Monte Carloand numerical methods (RoyChoudhury et al., 2008; Bryant et al., 2012),these other methods would not easily process the datasets used in this thesis,where the number of populations is large. For example, in a recent paper,Bryant et al. (2012) reduced the time complexity of a multispecies coalescentmodel likelihood calculation to O(LnN2 logN) for L SNPs, where n is thenumber of populations and N is the total number of individuals (typically,N  n). Our method models frequencies of populations directly, whichis beneficial when the number of individuals sampled in each population islarge. The time complexity of our algorithm is O(Ln2 +N).4.2 Asymmetric dissimilarity matrixOur inference method for n populations is a generalization of our methodfor two populations. For n populations, we can estimate the branch lengthsof any two populations to their MRCA. This information is encoded in anasymmetric dissimilarity matrix A, which is a parameter to be estimatedusing data. In this section, we discuss the relationship between asymmetricdissimilarity matrices and bifurcating trees.Suppose that we have SNP frequencies at L independent SNP loci forn populations. We denote the SNP frequency of the ith locus in the jthpopulation by pij (i = 1, 2, . . . , L, and j = 1, 2, . . . , n).4.2.1 DefinitionDefinition 1. A matrix A is said to be an asymmetric dissimilarity matrixrepresentation of a rooted bifurcating tree τ of n populations, if A is ann × n matrix, with ai,i = ∞ (i = 1, 2, . . . , n), and ai,j equal to the sum of564.2. Asymmetric dissimilarity matrixthe branch lengths from population i to the most recent common ancestralpopulation of population i and population j (i, j = 1, 2, . . . , n and i 6= j).It is easy to see that for a given rooted bifurcating tree τ , the asymmetricdissimilarity matrix A is uniquely defined. Figure 4.1 provides an examplefor 3 populations. In the tree τ , branch lengths are defined on the topologyof the tree, and they measure the evolutionary distance from the parentpopulation to the child population. These branch lengths are modelled bythe drift parameter σ2 in our evolutionary model.From the SNP frequency sequences of n populations with L sites, we cancalculate an n× n asymmetric dissimilarity matrix ÂL by settingâii =∞, 1 ≤ i ≤ n,âij = σ̂2ij , 1 ≤ i, j ≤ n, and i 6= j,where σ̂2ij and σ̂2ji are estimates of the branch lengths, respectively, from theith population and the jth population to their MRCA using our inferencemethod proposed in Section 3.3.1.Before claiming that ÂL is a reasonable estimate of A, we need one moreassumption on the additivity of the branch lengths in our model to maintainthe compatibility of the model and the bifurcating tree structure. To illus-trate the compatibility of the model and the tree, we take three populationsas an example. In Figure 4.1, there are three populations P1, P2 and P3,and two internal populations P0 and P2,3. Based on the assumption thatafter the separation of two populations, each branch evolves independently,we model the random genetic drift bypi,1|pi,0 = ci,0 ∼ FixNormal (ci,0, σ21), (4.1a)pi,23|pi,0 = ci,0 ∼ FixNormal (ci,0, σ22,3), (4.1b)pi,2|pi,23 = ci,23 ∼ FixNormal (ci,23, σ22), (4.1c)pi,3|pi,23 = ci,23 ∼ FixNormal (ci,23, σ23). (4.1d)Under this set of assumptions, we can estimate σ22 and σ23 by σ̂223 and574.2. Asymmetric dissimilarity matrixσ̂232, but we cannot estimate σ22,3 using our inference method as we do nothave data for the internal population P2,3.If we further assumepi,2|pi,0 = ci,0 ∼ FixNormal (ci,0, σ22 + σ22,3), (4.2a)pi,3|pi,0 = ci,0 ∼ FixNormal (ci,0, σ23 + σ22,3), (4.2b)we can estimate σ22,3 +σ22 by σ̂221 which is the total branch length from P2 toP0 since P0 is the MRCA of P1 and P2. Similarly, we can estimate σ22,3 +σ23by σ̂231.We emphasize that in our model, just as in the previous work (Baldingand Nichols, 1995; Nicholson et al., 2002; Pickrell and Pritchard, 2012),assumptions of the form of (4.1a-d) do not imply that pi,2 and pi,3 are exactlydistributed as specified in (4.2a-b). We show that the approximation (4.2)is reasonable in practice in the next Section.4.2.2 Additivity property of the normal approximationwith general fixation modelWe compare two scenarios: (1) the process obtained by composing two con-secutive FixNormal distributions, (2) the process obtained with a singleFixNormal with variance parameter given by the sum of the two step vari-ances. Specifically, we simulate data pi,1|pi,0 = ci,0 ∼ FixNormal (ci,0, σ21),and pi,2|pi,1 = ci,1 ∼ FixNormal (ci,1, σ22), and then compare the distributionof p2 = (p1,2, p2,2, · · · , pn,2) with the distribution of p3 = (p1,3, p2,3, · · · , pn,3)where pi,3|pi,0 = ci,0 ∼ FixNormal (ci,0, σ21 +σ22). The comparisons are madein two aspects: the fixation rate and the distribution of non-fixed frequen-cies.For each simulation run, we set c = ci,0 = 0.5 or c = ci,0 = 0.1(i = 1, 2, . . . , n) and n = 10000 duplicates. We run the simulation 100times and report the mean difference of the fixation rates. Table 4.1 showsthe difference of fixation rates under the two simulation scenarios (i.e., thefixation of two consecutive FixNormal distributions (1) minus the fixation584.2. Asymmetric dissimilarity matrixTable 4.1: Mean fixation rate difference of 100 simulation runs comparingtwo consecutive FixNomal distributions versus one combined FixNormal dis-tribution.σ22 = 0.12 σ22 = 0.22 σ22 = 0.42 σ22 = 0.62 σ22 = 0.82 σ22 = 1.02σ21 = 0.12 0 0.00 0.00 0.00 0.00 0.00σ21 = 0.22 0 0.00 0.00 −0.01 −0.01 −0.01c = 0.5 σ21 = 0.42 0 −0.01 −0.01 −0.02 −0.02 −0.02σ21 = 0.62 0 0.00 0.00 0.01 0.01 0.00σ21 = 0.82 0 0.00 0.02 0.03 0.04 0.04σ21 = 1.02 0 0.01 0.03 0.05 0.06 0.07σ21 = 0.12 −0.01 −0.01 0.00 0.00 0.00 0.00σ21 = 0.22 −0.01 0.00 0.02 0.03 0.03 0.03c = 0.1 σ21 = 0.42 0.01 0.03 0.07 0.09 0.10 0.11σ21 = 0.62 0.01 0.03 0.08 0.11 0.13 0.15σ21 = 0.82 0.01 0.03 0.07 0.11 0.14 0.16σ21 = 1.02 0.01 0.02 0.06 0.11 0.14 0.16of one combined FixNormal distribution (2)). Two consecutive FixNormaldistributions usually lead to larger fixation rate than one combined FixNor-mal distribution when the evolutionary distances are large. The fixationrate difference is also larger when p is close to the boundaries, or when σ21and σ22 are large.We use quantile-quantile plots to compare for scenarios (1) and (2) thedistribution of the frequencies which are not fixed. We show the results forc = 0.5 in Figure 4.2 and for c = 0.1 in Figure 4.3. Figure 4.2 shows that thedifference between the two scenarios is negligible in all cases where c = 0.5.Figure 4.3 shows that when c = 0.1, there are differences among the twoscenarios (1) and (2), but these are still within a reasonable range.In summary, we have shown that the difference of frequencies from twoconsecutive FixNormal distributions and one combined FixNormal distribu-tion is relatively small when evolutionary parameters are within the rangeof 0 and 1, which is the range of interest in our real data analysis.594.3. Asymmetric neighbor-joining algorithmllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 0.12llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 0.42llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 0.82lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 12llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 0.12lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 0.42lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 0.82lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 12lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 0.12lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 0.42llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 0.82llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 12lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 0.12lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 0.42llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 0.82llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 12Figure 4.2: Quantile-quantile plots of simulated data based on two con-secutive FixNormal distributions and one combined FixNormal distributionwhen p0 = 0.5.4.3 Asymmetric neighbor-joining algorithmWe propose a new algorithm, called asymmetric neighbor-joining (ANJ),to construct a rooted bifurcating tree based on an asymmetric dissimilarity604.3. Asymmetric neighbor-joining algorithmlllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 0.12llllllllllllllllllllllllllllllllllllllllllll lll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 0.42lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 0.82lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.12σ22 = 12lllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 0.12llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 0.42llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 0.82llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.42σ22 = 12llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 0.12llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 0.42lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 0.82llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 0.82σ22 = 12llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 0.12lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 0.42lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 0.82lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0σ12 = 12σ22 = 12Figure 4.3: Quantile-quantile plots of simulated data based on two con-secutive FixNormal distributions and one combined FixNormal distributionwhen p0 = 0.1.matrix A induced by a rooted bifurcating tree τ , and its estimate ÂL whenthe asymmetric dissimilarity matrix is an unknown parameter in application.We first show that if an asymmetric dissimilarity matrix A is induced bya rooted bifurcating tree τ , then the ANJ algorithm can recover the rooted614.3. Asymmetric neighbor-joining algorithmbifurcating tree τ exactly based on A. We then show that in application,when the phylogenetic history is assumed to be a bifurcation tree and theasymmetric dissimilarity matrix is an unknown parameter, we can recon-struct a rooted bifurcating tree τ̂ based on an estimate of the asymmetricdissimilarity matrix  using the ANJ. If the true tree is τ and  is a consis-tent estimator of A which is induced by τ , then τ̂ is a consistent estimatorof τ .ANJ can be applied to the matrix derived using our methods and also toany asymmetric dissimilarity matrix derived by other methods. ANJ sharesmany of NJ’s desirable properties.ANJ proceeds in three steps: search step, estimation step and updatestep. At the search step, we propose a new cherry-picking algorithm toidentify a pair of populations which forms a cherry in the tree, i.e., twodistinct leaves in a tree which are adjacent to a common vertex (Semple andSteel, 2003). At the estimation step, we estimate the branch lengths fromtwo leaves of the cherry to the root of the cherry. At the update step, wecombine the two populations in the cherry into one new internal population,and update the distances related to this internal population. Proofs can befound in Appendix A.4.3.1 Identifying a cherryFirst, we show that a cherry, which is a pair of adjacent external verticeson the tree τ , can be identified by comparing elements in an asymmetricdissimilarity matrix A induced by the tree τ in Proposition 1.Definition 2. The two matrix row minimization condition (2-MRMC) issatisfied for the i-th and j-th row of a matrix A if aij is the smallest entryin the ith row of A and aji is the smallest entry in the jth row of A.Proposition 1. For any square matrix A that is induced by a rooted bifur-cating tree τ , the 2-MRMC is satisfied for (Pi, Pj) if and only if Pi and Pjform a cherry.Our cherry-picking algorithm is summarized in Algorithm 1. Its proper-ties are summarized in Theorem 1 and Proposition 2.624.3. Asymmetric neighbor-joining algorithmAlgorithm 1 Cherry-picking algorithmq ← 1(ni, nj)← index of the smallest element in Awhile (q < n) doif anj ,ni = min Anj ,· (the smallest element in the nj-th row of Athen)return (ni, nj), breakelseni ← njnj ← index of the smallest element in the ni-th row of Aend ifq ← q + 1end whileTheorem 1. For a given bifurcating phylogenetic tree τ with asymmetricdissimilarity matrix A, the cherry-picking algorithm will always return acherry, and the algorithm terminates before the iteration variable q equalsn.Proposition 2. If ÂL is a consistent estimator of an asymmetric dissim-ilarity matrix A, then the probability of identifying a cherry of the tree τusing our cherry-picking algorithm on ÂL converges to 1 as the number ofloci L goes to infinity.By consistency, we mean in Proposition 2 that limL→∞P (|ÂL −A| ≥ ) = 0for any  > 0, where the norm of a matrix is defined as the maximum of allits elements.Proposition 2 and Theorem 1 show that a cherry can be found using ourcherry-picking algorithm by searching for a pair (Pi, Pj) which satisfies the2-MRMC condition. However, when we apply the algorithm on a matrix ÂLobtained from finite data, it is possible that there is no pair (Pi, Pj) satisfyingthe 2-MRMC condition. If no pair satisfies the 2-MRMC condition, thealgorithm stops when q = n, and selects the two populations on hold as acherry. A consequence of the proof of Proposition 2 is that the probabilityof this happening goes to zero as L→∞.634.3. Asymmetric neighbor-joining algorithm4.3.2 Estimation step and update stepIt is straight-forward to see that given a rooted bifurcating tree τ , the asym-metric dissimilarity matrix A exists and it is uniquely defined. In this sec-tion, we first show that on the other hand, given any square matrix A, if Ais induced by a phylogenetic tree τ , then we can reconstruct the underlyingphylogenetic tree τ uniquely using estimation and update steps proposed inthis section together with the search step for a cherry in the previous section.Then we show that in reality when A is unknown, we can construct an esti-mated tree τ̂ for the phylogenetic tree τ based on an estimated dissimilaritymatrix Â.Let P = {P1, P2, . . . , Pn} denote the set of all external vertices, whichrepresents populations of interest. Assume that (Pi, Pj) is the selectedcherry, and denote the internal node linking Pi and Pj as Pi,j . In our ANJalgorithm, we estimate the distance from Pi and Pj to Pi,j respectively byd̂i,ij = ai,j and d̂j,ij = aj,i. (4.3)We update the distance from any other population Pm (m 6= i, j) to thenew internal node Pi,j byam,ij = (am,i + am,j)/2, (4.4)and update the distance from Pi,j to Pm byaij,m = (ai,m + aj,m − d̂i,ij − d̂j,ij)/2. (4.5)We remove the i-th and j-th columns and the i-th and j-th rows of A,add a new column and a new row for Pi,j , using am,ij and aij,m, and setaij,ij = ∞ to form a reduced matrix r(A). If ÂL is used instead of A,replace all a· with â· in the above equations to form a reduced estimatedmatrix r(ÂL).We define the reduced populations r(P) = P−{Pi, Pj}+{Pi,j}, and thereduced true tree r(τ) as a rooted bifurcating tree with leaf populations r(P),644.3. Asymmetric neighbor-joining algorithmi.e., a subtree of τ without descendants of Pi,j . We denote the asymmetricdissimilarity matrix of tree r(τ) as Ar(τ).Lemma 1 and Lemma 2 establish the relationships between Ar(τ), r(A)and r(ÂL).Lemma 1. If a matrix A is an asymmetric dissimilarity matrix represen-tation of a rooted bifurcating tree τ , then after a cherry of τ is removed, thereduced matrix r(A) is an asymmetric dissimilarity matrix representationof the reduced tree r(τ), i.e., r(A) = Ar(τ), when the leaf populations r(P)are arranged in the same order.Lemma 2. If ÂL is a consistent estimator of an asymmetric dissimilaritymatrix A, then the reduced estimated matrix r(ÂL) is a consistent estimatorof r(A) (i.e., Ar(τ)), when the leaf populations r(P) are arranged in the sameorder.We repeat the search, estimation and update steps until we have onlyone node left, which is the MRCA for all populations being considered.Alternatively, estimation and update steps analogous to NJ and un-weighted neighbor-joining (UNJ) (Gascuel et al., 1997) can also be used.However, we found these alternatives not to work as well. This can be ex-plained by the fact that ANJ uses only short branch lengths to constructthe tree and short branch length estimates are more accurate than longbranch length estimates using our method (see Section 4.4.1). If more accu-rate branch length estimates can be obtained using other methods for longbranches, the asymmetric versions of NJ and UNJ may perform better thanANJ by utilizing more branch length estimates to reconstruct the tree.Algorithm 2 summarizes our algorithm for constructing a rooted bi-furcating tree from an asymmetric dissimilarity matrix A. A bifurcatingtree can be fully recovered from the recorded pairs (Pi, Pj), and the branchlengths (d̂i,ij , d̂j,ij). The properties of ANJ are summarized in Theorem 2and Proposition 3.Theorem 2. If a matrix A is an asymmetric dissimilarity matrix represen-tation of a rooted bifurcating tree τ , then asymmetric neighbor-joining canrecover τ correctly from A, in terms of both topology and branch lengths.654.3. Asymmetric neighbor-joining algorithmAlgorithm 2 Constructing a tree from an asymmetric dissimilarity matrixindex = {1, 2, · · · , n}for k = 1 to n− 1 doidentify a cherry (Pi, Pj) using the Cherry-Picking Algorithm (Algo-rithm 1)estimate d̂i,j and d̂j,irecord (Pi, Pj) and (d̂i,j , d̂i,j)combine (Pi, Pj) into a new node Pn+kremove Pi, Pj and associated rows and columns from A (or ÂL)index ← index − {i, j}for l in index doupdate the distance from Pl and Pn+k to their MRCAend forAdd a new row and a new column in A for Pn+k using updated dis-tancesindex ← index + {n+ k}end forProposition 3. If ÂL is a consistent estimator of an asymmetric dissim-ilarity matrix A, then the probability of recovering the topology of the truetree τ using asymmetric neighbor-joining on ÂL converges to 1 when thenumber of loci L → ∞. Moreover, the branch length estimates are alsoconsistent.It is worth mentioning that if our inference method is used to estimateÂL, then ÂL is not a consistent estimator of A, as we do not know the truemarginal distributions of SNP frequencies in any of the internal populations.However, ÂL can still serve as a useful estimate of A since the asymptoticbiases of the estimates are roughly proportional to the magnitude of thethe true branch lengths in the model, as shown in the next section, whichmeans the branch lengths are scaled but still maintain their general relativelengths.664.4. Simulation Studies4.4 Simulation StudiesWe conduct simulation studies to investigate the performance of our infer-ence method, both in the well-specified and misspecified cases.For two populations, we showed in Section 3.4 that our method canestimate the evolutionary distances from each population to their MRCApopulation reasonably well when the distribution of SNP frequencies in theancestral population used in our likelihood model is the distribution used tosimulate data. When the two distributions are different, the estimates of thetwo evolutionary distances are not guaranteed to be unbiased, but still reflectthe relative lengths of the distances. Note that for two populations, a rootedbifurcating tree is uniquely constructed by the two estimated evolutionarydistances.In this section, we show that for more than two populations, our methodidentifies the root of the phylogenetic tree correctly in all simulation runs.4.4.1 Simulation for more than two populationsWe investigate the performance of our methods for datasets containing morethan two populations. In order to obtain realistic simulation scenarios, weuse trees estimated from real data as references. The two reference trees usedto simulate population SNP frequencies are based on an estimated subtree of8 African populations, and an estimated subtree of 7 American and Oceanicpopulations (with an extra outgroup, namely the Hazara population). Thereference trees come from our data analysis results using the HGDP dataof 53 human populations, which provide realistic setups for branch lengths.Results on the full tree are presented in the next section.We first generate p0, the SNP frequencies at the root of the subtree,using a mixed distribution with a continuous part from uniform(0, 1) anddiscrete parts on 0 and 1. Then we generate SNP frequencies of the internalnodes and leaf populations based on the true tree from the root to the leavesusing our normal approximation with general fixation model. We keep onlythe SNP frequencies of the leaf populations for inference.674.4. Simulation StudiesWe set m0 = m1 = 0.1, L = 5000 in the simulation for African popula-tions, and L = 2000 for the American and Oceanic populations.We construct a rooted bifurcating tree based on the asymmetric dissim-ilarity matrix for the leaf populations using our inference method and ourANJ algorithm. For comparison, we also construct an unrooted bifurcat-ing tree based on a symmetric dissimilarity matrix for the leaf populationsusing Nei’s symmetric dissimilarity measure (Nei, 1972) and NJ, a classi-cal method that is still widely used because of its efficiency and accuracy.We further root those unrooted trees from NJ results with an outgroup(Mozabite for African populations and Hazara for American and Oceanicpopulations) by setting the root at the midpoint from the outgroup to theinternal vertex that is the MRCA of all other populations for illustrationand comparison purposes. We use empirical Bayes estimates m̂f (3.10) foreach pair of populations in our likelihood model.We compute consensus trees (Felsenstein, 2004; Paradis et al., 2004; Par-adis, 2011) from 200 simulation runs. The true tree, the consensus rootedtree of all 200 estimated rooted trees using our method and the consensusunrooted tree of all 200 estimated unrooted trees using NJ are shown inFigure 4.4 (for the African populations) and Figure 4.5 (for the Americanand Oceanic populations).Note that in all comparisons in this section, we set a higher standard forresults using our methods than results using traditional methods. When wecompare estimated trees with the true tree, we use estimated rooted treesusing ANJ directly, and use estimated unrooted trees using NJ with extrainformation on rooting, i.e., a correct outgroup specified as the root. Inother words, ANJ estimates one more node than NJ in all simulation runs,but on the other hand, we set the correct location of this node for treesestimated by NJ so that results using both methods can be compared. Wescale the total branch lengths of the estimated trees so that they have thesame total branch length as the true tree. Results are reported for bothunscaled and scaled trees.Note that our method correctly identifies the location of the root of thephylogenetic tree in all simulation runs. In Figure 4.4, the branch lengths684.4. Simulation StudiesMozabiteMandenkaYorubaBiakaPygmyMbutiPygmySanBantuKenyaBantuSouthAfrica1.True tree0.090.0020.0210.0280.0980.0380.0880.4660.0390.0070.0580.0970.0320.037MandenkaYorubaBantuSouthAfricaBantuKenyaBiakaPygmyMbutiPygmySanMozabite2.Cons. rooted (clade recovery rate)110.711110.735MandenkaMozabiteBantuKenyaBantuSouthAfricaMbutiPygmySanBiakaPygmyYoruba3.Cons. unrooted (clade recovery rate)10.5951110.995BantuKenyaBantuSouthAfricaBiakaPygmyMbutiPygmySanYorubaMandenkaMozabite4.Cons. rooted (branch length)0.1040.0030.0240.0360.0930.0480.0870.4690.0380.0040.0580.0970.0320.036BantuKenyaBantuSouthAfricaBiakaPygmyMbutiPygmySanYorubaMandenkaMozabite5.Cons. rooted (scaled branch length)0.1010.0030.0230.0350.0910.0470.0840.4580.0370.0040.0570.0950.0310.035BantuKenyaBantuSouthAfricaBiakaPygmyMbutiPygmySanYorubaMandenkaMozabite6.Cons. unrooted (scaled branch length)0.120.0040.0240.0310.120.0380.0980.3540.0470.0090.0670.1080.0390.044Figure 4.4: Simulation results for the African populations. Panel 1 showsthe true tree and its branch lengths. Panel 2 and 3 show the consensusrooted tree using our method and the consensus unrooted tree using NJrespectively, with the numeric annotations being successful recovery rates ofclades (subtrees with exact topology as in the true tree) in 200 simulationruns. Panel 4 and 5 show the consensus rooted trees with unscaled andscaled estimated branch lengths, and Panel 6, the consensus unrooted treewith scaled branch lengths.in the true tree of the African populations (panel 1) are short, with theshortest edge length only 0.002. Our method achieves a high rate of clade(i.e., subtree) recovery on this data (panel 2 shows the results obtained us-ing our method, while panel 3 shows the results obtained using NJ, whichuses additional information). The consensus branch lengths estimated byour method (panel 4 and 5) achieve a high level of accuracy, even with-out rescaling (panel 4). This contrasts with the NJ method, which is lessaccurate in terms of branch length estimates, even after rescaling (panel 6).694.4. Simulation StudiesPapuanMelanesianPimaColombianKaritianaSuruiMayaHazara1.True tree0.020.0380.0370.2380.0560.0810.050.1350.4210.1330.120.2770.1590.165ColombianKaritianaSuruiPimaMayaMelanesianPapuanHazara2.Cons. rooted (branch length)0.020.0740.0330.3070.0710.1170.0720.1320.4230.1260.110.3350.1650.16ColombianKaritianaSuruiPimaMayaMelanesianPapuanHazara3.Cons. rooted (scaled branch length)0.0180.0660.0290.2760.0640.1060.0650.1190.3810.1130.0990.3010.1480.144ColombianKaritianaSuruiPimaMayaMelanesianPapuanHazara4.Cons. unrooted (scaled branch length)0.0290.0670.0670.2820.0580.0790.0430.1230.2880.1260.1240.3120.1680.164Figure 4.5: Simulation results for the American and Oceanic populations.Panel 1 shows the true tree and its branch lengths. Panel 2 and 3 show theconsensus rooted tree with unscaled and scaled estimated branch lengths.Panel 4 shows the consensus unrooted tree with scaled estimated branchlengths.Table 4.2: Mean distances between estimated trees and true trees calculatedfrom 200 simulations.African-ANJ African-NJ American-ANJ American-NJCorrect rate 0.555 0.590 1 1PH 1.110 0.820 0 0KF (unscaled) 0.016 0.045 0.109 0.331KF (scaled) 0.015 0.013 0.065 0.062geodesic (unscaled) 0.030 0.459 0.123 0.584geodesic (scaled) 0.027 0.092 0.098 0.157In Figure 4.5, our method recovers the true tree topology in each of the200 simulation runs for the American and Oceanic populations. We do notshow panel 2 and 3 of Figure 4.4 in Figure 4.5 as the recovery rates are1 for all clades. The panels in Figure 4.5 show similar results to those inFigure 4.4. Panel 1, 2, and 3 show that branch lengths which are shorter orcloser to leaf populations are estimated more accurately than branch lengthswhich are longer and closer to the root. This observation is consistent withour results in previous sections that our estimates are more accurate forshorter branch lengths and branches which are close to leaf populations.However, our method still works well for longer branch lengths in terms ofestimating the relative magnitude.Table 4.2 summarizes the mean distances between the estimated trees704.4. Simulation Studiesand the true trees under several different distance measures for phyloge-netic trees: the proportion of trees which have the same topology as thetrue tree, the Penny-Hendy (PH) distance (Penny and Hendy, 1985), theKuhner-Felsenstein (KF) distance (Kuhner and Felsenstein, 1994), and thegeodesic distances (Billera et al., 2001; Chakerian and Holmes, 2012). ThePH distance equals two times the number of clades being mistakenly speci-fied. The KF distance measures the number of misspecified clades as well asdifferences of internal branch lengths. The geodesic distance measures thenumber of misspecified clades and differences of all branch lengths.From Table 4.2, we can see that for the African populations the proba-bility of recovering the full true tree topology using ANJ is similar to thatusing NJ. ANJ performs better in terms of KF for unscaled trees and interms of geodesic distances for both scaled and unscaled trees but performsworse than NJ in terms of PH. For the American and Oceanic populations,both methods recover the true tree topology all the time, but our methodperforms better in all distance measures considered except for KF for scaledtrees.More importantly, our method identifies the root of the tree correctlyin all cases, an inference which cannot be obtained using symmetric dis-similarity measures, unless extra information in the form of an outgroup isavailable.4.4.2 Computational costFor a pair of populations, the main computational cost of our method isthe calculation of the likelihood for various parameters (σ21, σ22), which isrequired in the optimization of the likelihood. Calculation of the likelihoodfor a fixed (σ21, σ22) involves a univariate numerical integration for each SNPlocation, which is vectorized and numerically solved using Gaussian quadra-tures with 40 points in our simulation and data analysis. The numericaloptimization cost can be reduced by choosing a reasonable starting point.It can also be further improved by learning those parameters using a smallportion of SNPs and then using those estimates as starting points with more714.5. Data analysis: Human Genome Diversity PanelSNPs.In our simulation studies, calculating branch lengths for a pair of popu-lations takes around 0.14 seconds (L = 100), 1.40 seconds (L = 1000) and14.7 seconds (L = 10000) for σ21 = 0.1 and σ22 = 0.2 and uniform(0, 1). Theresults vary for different parameters but remain similar. All analyses arerun on a 2.8 GHz CPU Mac Mini (Intel dual core i5) using R version 3.1.0.For more than two populations, we estimate the asymmetric dissimilaritymatrix using our method applied to each pair of populations. The compu-tational cost is of order(n2)times the cost for one pair. Overall, this gives arunning time of O(Ln2) for the computation of the asymmetric dissimilaritymatrix, and it can be also easily paralleled. In our simulation studies, cal-culating the asymmetric dissimilarity matrix for eight African populationstook 37.5 seconds per run on average (L = 2000), and 41.3 seconds per runon average (L = 2000) for the seven American and Oceanian populations.The computational cost for the construction of the trees from an asymmetricdissimilarity matrix using our algorithm is negligible compared to the costof forming the asymmetric dissimilarity matrix.4.5 Data analysis: Human Genome DiversityPanelWe apply our methods on a subset of the data from the Human GenomeDiversity Panel (HGDP). The HGDP contains 650,000 common SNP locidata for 938 unrelated individuals from 53 populations from Africa, Europe,the Middle East, Asia and the Americas. The number of individuals in onepopulation varies from 5 to 46.Li et al. (2008) obtained an unrooted phylogenetic tree for 51 popula-tions (with Han and Han North China combined as Han, and Bantu SouthAfrica and Bantu Kenya combined as Bantu) using contml, which is imple-mented based on the pruning algorithm proposed by Felsenstein (1973). Bygenotyping two chimpanzee samples, Li et al. (2008) defined the putativeancestral allele for most of the SNPs, and used the chimpanzee data to lo-724.5. Data analysis: Human Genome Diversity Panelcate the root of the tree, i.e., the MRCA of all human populations, at thenode linking San and all other populations. Pickrell and Pritchard (2012)analyzed the same data sets to investigate population admixture, meaningthe gene exchange of two or more previously isolated populations throughinterbreeding. Pickrell and Pritchard (2012) obtained a similar phylogenetictree as Li et al. (2008), but not all populations within the same geographi-cal regions are clearly grouped together in their results, even after adjustingfor population admixtures. Pickrell and Pritchard (2012) also removed twoOceanian populations (Melanesian and Papuan) from the analysis becauseaccording to the authors, including these two populations will lead to con-fusing results. Our method does not suffer from this issue and reconstructsa high-quality phylogenetic tree from the full dataset.4.5.1 Human population treeIt is well studied that SNPs can be correlated, especially those that are closeto each other (Turnbull et al., 2010). To reduce correlation among SNPs, wefocus on a subset of the original data set containing 131,329 SNPs that areless correlated and then subsample the SNPs to one for every ten loci. As aresult, our data contains frequencies of 13,133 SNPs in 53 populations, i.e.,the data matrix X = (xij) is a 53×13, 133 matrix with each row representingone population and each column representing one SNP locus. The resultsobtained are consistent when other subsets of SNPs are used.Since the HGDP records frequencies of the minor SNP, we symmetrizethe SNP frequencies so that the prior over the ancestral frequencies can beassumed to be symmetric:x∗ij ={xij for i = 1, 2, · · · , 53, and j = 1, 3, · · · , 13 133.1− xij for i = 1, 2, · · · , 53, and j = 2, 4, · · · , 13 132.We calculate the asymmetric dissimilarity matrix for the 53 populationsusing the data X∗ = (x∗ij). We then construct a rooted bifurcating phyloge-netic tree using asymmetric neighbor-joining on the asymmetric dissimilar-ity matrix proposed in this thesis (see Figure 4.6). Our results are obtained734.5. Data analysis: Human Genome Diversity Panelusing less SNPs than Li et al. (2008) and without chimpanzee data.From Figure 4.6, we can see that populations from the same continentsare grouped together which is consistent with our knowledge of human mi-grations. The pattern of human populations moving out of African can beobserved clearly from Figure 4.6A and is consistent with the prevalent con-sensus (Cavalli-Sforza and Feldman, 2003). American and Asian populationsare farthest from the root while African and Middle Eastern populations areclosest to the root.Our results are more accurate than the results of Li et al. (2008) andPickrell and Pritchard (2012), in terms of estimating phylogenetic trees forclosely-related populations. For example, from historical records, the Tuscanand Italian populations are closely related, and this relationship is correctlyidentified using our method (see Figure 4.6B) but not in Li et al. (2008) (seeFigure 4.6C). Similarly, we can say the same for the Adygei and Russianpopulations. Our result clearly identifies closely related populations withinthe same regions into subgroups (see colours in Figure 4.6A), while thisis not the case in Pickrell and Pritchard (2012) as they failed to group allMiddle Eastern populations together and failed to incorporate two Oceanianpopulations.More importantly, our method identifies the root of the human popula-tion tree without using extra data from chimpanzees or Neandertals as in Liet al. (2008) and Pickrell and Pritchard (2012). In our results, the root islocated between a clade containing current middle and southern African pop-ulations, and a clade containing current northern African and non-Africanpopulations (see F in Figure 4.6A). However, results in this region of thetree should be treated with caution as both models ignore effects such as re-cent admixtures which have a large effect on the SNP frequencies in Africanpopulations (Pickrell et al., 2012).It takes 7.88 hours to calculate the asymmetric dissimilarity matrix for53 populations (without parallel computation) and less than 1 second toreconstruct the rooted tree based on the dissimilarity matrix.744.5. Data analysis: Human Genome Diversity PanelAdygeiRussianSardinianBasqueFrenchOrcadianTuscanItalianBACAdygeiTuscanSardinianItalianBasqueFrenchOrcadianRussianMandenkaYorubaBiakaPygmyMbutiPygmySanBantuKenyaBantuSouthAfricaMozabiteBedouinMakraniSindhiMayaPimaColombianKaritianaSuruiPapuanMelanesianYakutCambodianXiboMongolaDaurOroqenHezhenTuLahuHanDaiSheTujiaMiaoYiNaxiHanNChinaJapaneseHazaraUygurKalashBurushoPathanBrahuiBalochiPalestinianDruzeSardinianBasqueItalianTuscanOrcadianFrenchRussianAdygeiFigure 4.6: A. Rooted phylogenetic tree for 53 human populations in HGDPusing our method. (F) indicates the root of our tree. Populations arecoloured according to regions as specified in Li et al. (2008). B. Zoomeddendrogram for European populations using our method. C. Zoomed den-drogram for European populations reproduced using contml (Felsenstein,1989), which is the same as the subtree as in Li et al. (2008) for comparison.754.5. Data analysis: Human Genome Diversity Panel4.5.2 Assessing uncertainty of the estimated treeAssessing uncertainty of the estimated tree is not a simple task. The un-certainty of the estimated tree comes from not only the topology of theestimated tree, but also the associated branch lengths.The bootstrap method has been widely used to assess the uncertainty ofthe estimated tree (Felsenstein, 1983; Zharkikh and Li, 1995). Summarizingbootstrap estimates in a meaningful way is not straight-forward if more thanone tree topology are present in the bootstrap estimated trees. A similarproblem arises in the Bayesian phylogenetic framework when summarizingsample trees from the posterior distritbution (Yang and Rannala, 1997).The consensus tree is a popular tool to summarize a set of phyloge-netic trees in practice. The consensus tree usually consists only the topol-ogy (Zharkikh and Li, 1995; Gray and Atkinson, 2003; Smeulders et al.,2011; Song and Steinru¨cken, 2012). One question is how the average branchlengths should be calculated for trees of different topology since the branchlengths are conditional on the phylogenetic tree and are meaningful only forspecific trees (Yang et al., 1995). Some software calculate the mean branchlengths of the consensus tree by averaging branch length estimates fromonly trees that have the same topology as the consensus tree (Sukumaranand Holder, 2010; Revell, 2012). This approach is debatable as it ignoresall other estimated trees of different topology (Felsenstein, 2004). For thesame reason, a sound definition of variances of branch length estimates isonly available for the mean estimated tree if all estimated trees share thesame topology.On the other hand, Billera et al. (2001) investigated the geometry ofthe space of phylogenetic trees and proposed the geodesic distance, whichenables averaging phylogenetic trees and the construction of a convex hullof a family of trees as a measure of variability of the estimated tree. Effi-cient algorithms have been developed to calculate the Fre´chet mean tree andthe Fre´chet variance of the estimated tree (Owen and Provan, 2011; Benneret al., 2014) based on the geodesic distance. The Fre´chet mean tree is de-fined as the tree that achieves the minimum mean square geodesic distance764.5. Data analysis: Human Genome Diversity Panelto all trees considered, and the Fre´chet variance is the mean square geodesicdistance between the Fre´chet mean tree to all trees considered. The Fre´chetvariance can be interpreted as the uncertainty measure for the estimatedtree as a whole. However this approach does not straightforwardly provideuncertainty measures for individual nodes or individual branch length esti-mates. As a consequence it could be argued that the Fre´chet variance is noteasy to interpret.We illustrate how to use the bootstrap method to assess the uncertaintyof the estimated phylogenetic tree based on our methods. For each bootstraprun, we sample SNPs with replacement and then estimate the phylogenetictree using the bootstrap sample. We then construct the consensus tree ofall 5000 bootstrap trees with uncertainty measure on each node, and theFre´chet mean tree of all bootstrap trees with the Fre´chet variance. Thegeodesic mean tree and its variance are calculated using TrAP (Benner et al.,2014). To provide an example where the uncertainty is relatively large, weshow a reconstruction using only 1314 SNPs (i.e., every hundredth of the131,329 SNPs) from six sample populations across the world.MayaHanBurushoFrenchBedouinBiakaPygmy1. Est. tree with full data0.1710.0370.0330.0520.1110.120.0490.050.0160.148 BiakaPygmyMayaHanBurushoFrenchBedouin2. Cons. tree with node uncertainty110.99960.79660.9992BedouinBurushoHanMayaFrenchBiakaPygmy3. Cons. tree with est. branch length0.1710.0420.0350.0520.1210.1110.0470.0490.0170.148(0.014)(0.012)(0.005)(0.008)(0.011)(0.01)(0.004)(0.004)(0.002)(0.011)MayaHanBurushoFrenchBedouinBiakaPygmy4. Fréchet mean tree with est. branch length0.1710.0420.0250.0580.1110.1210.0480.0470.0170.148Figure 4.7: 1. Estimated tree based on 1314 SNPs. 2. Consensus tree withestimated node uncertainty from 5000 bootstrap samples. 3. Consensus treewith estimated branch lengths and their standard deviations in brackets. 4.Fre´chet mean tree with estimated branch length. The Fre´chet variance ofthis Fre´chet mean tree is 0.0013.Figure 4.7 shows that the root of the estimated tree is placed at the nodeseparating African populations with all other populations with probability774.6. Discussion1 in this analysis. The consensus tree (Figure 4.7-3) and the Fre´chet meantree (Figure 4.7-4) share the same topology and almost identical branchlength estimates. Uncertainty of the estimated tree topology is observedon the node of the consensus tree linking Han, Papuan and Maya (0.7968in Figure 4.7-2), meaning that this node does not appear in approximately20% of the bootstrap estimates trees. Uncertainty of the estimated branchlengths are summarized by the standard deviations of the branch lengthestimates in the consensus tree (Figure 4.7-3).In conclusion, reasonably accurate human population trees can be iden-tified using a relatively small number of SNPs based on our method. Theuncertainty of this root placement is extremely low even when the numberof SNPs used in the analysis is small. This result shows that the informationon the root of human population tree is stored in the gene of human pop-ulations, and can be recovered using non-reversible models on evolutionaryprocess of SNP frequencies of human populations, which does not dependon the choice of outgroups.4.6 DiscussionWe have presented a simple and effective method for reconstructing rootedtrees from multi-population SNP frequencies. By modelling fixation in aprincipled way, and generalizing neighbor-joining methods, we can performjoint tree and rooting inference without requiring an outgroup. We haveshown in simulations and real worldwide genome-wide data that this rootingcan be more accurate despite using less data. The method is also compu-tationally efficient with time complexity O(Ln2), where L is the number ofsites and n is the number of populations.We have also considered other tree building estimation and update rulesanalogous to NJ and UNJ. For example, consider an NJ-like algorithm forasymmetric dissimilarity matrices (referred as NJ-A), where we estimate the784.6. Discussionbranch length from Pi to Pi,j byd̂i,ij =∑l 6=iai,l/(n− 1)−∑l 6=jaj,l/(n− 1) /2 + (ai,j + ai,j)/2. (4.6)The estimation rule (4.6) utilizes all branch length estimates from populationPi to the MRCA of Pi and all other populations, while the ANJ estimationrule utilizes only the branch length estimates from Pi to the MRCA of Piand Pj .In many evolutionary models, the estimate of branch length is less accu-rate when the branch length is longer. Although averaging more estimatesusually leads to less variance in the estimates, this may not be the casehere, because longer branch length estimates tend to be associated withlarger variance. As a result, d̂i,ij in (4.6) may also be less reliable than thesimple estimating version used in the ANJ as in (4.3) of this thesis.In our simulation studies and data analysis, ANJ works better than theabove update rule. The estimation rule (4.6) has severe problems such asnegative branch length estimates, due to the large variability of long branchestimates included.79Chapter 5Geometric Poisson indelprocess5.1 IntroductionIn Chapter 5 and Chapter 6, we shift the focus of this thesis from the fre-quency data of different populations to string-type data of different species.In this Chapter, we propose a new stochastic model for the evolutionaryprocess of sequences which are strings of characters with different lengths.This model can be used to reconstruct phylogenetic tree of species, for whichsequences vary not only in characters but also in length due to insertions,deletions and substitutions. The nucleotide or protein sequences of a speciesare usually aggregated from sequences of several individuals of this speciesto represent common features of the species, by keeping only nucleotidesor proteins that are available in all sequences of the individuals from thisspecies collected. We focus on modelling the evolutionary rate variation ofnucleotide sequences at the level of species in this Chapter.It is well known that different regions of nucleotide sequences evolve atdifferent rates, both in terms of substitutions (Fitch and Margoliash, 1967;Nachman and Crowell, 2000), and in terms of insertions-deletions (indels)(Mouchiroud et al., 1991; Wong et al., 2004; Lunter et al., 2006; Mills et al.,2006; Chen et al., 2009; Kvikstad and Duret, 2014). In phylogenetic analysesbased on substitutions, rate variation is viewed as an important phenomenonto include when building evolutionary models; consequently, virtually allmodern phylogenetic methods explicitly model substitution rate variationacross sites (Yang, 1997; Huelsenbeck and Ronquist, 2001; Ronquist and805.1. IntroductionHuelsenbeck, 2003; Suchard and Redelings, 2006; Yang, 2007; Guindon et al.,2010; Stamatakis, 2014).There is substantial previous work analyzing patterns of indel rate varia-tion, but these analyses are typically done from trees and alignments inferredusing standard models which ignore rate variation. This body of previ-ous work has not only demonstrated that indel rate variation is widespread(Chen et al., 2009; Kvikstad and Duret, 2014), but also identified correlates(and in some cases, mechanisms) behind indel rate heterogeneity, includ-ing sequence context (Tanay and Siggia, 2008), substitution rate (Anandaet al., 2011; Jovelin and Cutter, 2013), selection (Carvalho and Clark, 1999;Kvikstad and Duret, 2014), recombination (Nam and Ellegren, 2012; Leushkinand Bazykin, 2013) and short tandem repeats (Ellegren, 2004).There are now several approaches to phylogenetic tree inference that takeindels into account (Thorne et al., 1991, 1992; Westesson et al., 2012), andsome of them include substitution rate heterogeneity (Klosterman et al.,2006; Suchard and Redelings, 2006; Redelings and Suchard, 2007). How-ever, these approaches generally do not incorporate indel rate heterogene-ity as part of the model specification. In the multiple sequence alignmentliterature, some methods do consider indel rate variation, but those meth-ods typically assume a fixed guide tree and are not based on a continuousstochastic process (Lo¨ytynoja and Goldman, 2008), or are limited to fixedtrees with a small number of leaves (Satija et al., 2009).A.coerulea     ...ACGAGAAGACCCUU-AGAAUUUUUAAAAG-C-AAUAAGUA----------------------...A.turrita      ...ACGAGAAGACCCUU-AGAAUUUUGAAAA--U-GUAAAAGG---------UAUU---------...C.nemoralis    ...ACGAGAAGACCCUA-GAAGCUUG------------------UUAUUUGU-------------...Cac.lacertina  ...ACGAGAAGACCCUGUCGAGCUUU-AGAAG--AAAGUAG-----------GCUAAAAUAUAUA...indel bothsubstitutionstable region fast evolving regionFigure 5.1: A segment of the multiple sequence alignment (MSA) of fourmolluscan species from the molluscan ribosomal RNA data set.In this Chapter, we present a simple indel rate heterogeneity model suit-able for phylogenetic tree inference. Figure 5.1 shows an illustration of indelrate heterogeneity on real data which motivates our work. As with substitu-815.1. Introductiontion rate heterogeneity models, we approximate the distribution over ratesusing a discrete mixture. Given a discrete indel rate mixture, our model isobtained as the marginal distribution of a reversible stochastic process de-fined on a phylogenetic tree. This continuous-time Markov process is calledthe geometric Poisson indel process (GeoPIP), which we introduce in thisthesis.As its name suggests, the main building block of the GeoPIP is thePoisson indel process (PIP) (Bouchard-Coˆte´ and Jordan, 2013), and GeoPIPinherits the attractive computational properties of the PIP. This means inparticular that given a tree, computing the probability of an alignment (i.e.,marginalizing over internal sequences) can be done in time polynomial inboth the number of the sequences and the lengths of the sequences. Thisproperty forms the basis of an efficient algorithm which determines in anunsupervised fashion the indel rates, while inferring the tree and partitioningthe sequences into segments taking on different indel rates.The statistical and computational properties of the GeoPIP differenti-ate it from the model used in the alignment method of Lunter (2007). Inthis previous work, Lunter (2007) developed a sequence aligner based on astring transducer. This transducer is equipped with groups of latent statesencoding different indel rates. While Lunter’s model is effective for pairwisealignment, there are two important challenges in applying this model to phy-logenetic tree inference. First, since it is not the marginal distribution of astochastic process, the model provides no natural way of exploiting indelsas phylogenetically informative events. Second, marginalizing the sequenceson the internal nodes of a tree using Lunter’s transducer model leads to aworst-case running time exponential in the number of taxa (this can be de-rived using the results in Hirschberg (1975)). Consequently, Lunter’s modelhas not been used for phylogenetic tree inference. Incidentally, we showthat even if one only cares about identifying the rate segmentation (witha fixed guide tree), using more sequences jointly improves inference accu-racy. Again, one would have to resort to approximations to do so with atransducer-based approach (Holmes and Bruno, 2001; Holmes, 2003; Miklo´set al., 2004; Jensen and Hein, 2005), while we can do this exactly in time825.2. Poisson indel processlinear in the number of sequences with the GeoPIP model.In this Chapter, we introduce our GeoPIP model and an efficient algo-rithm for evaluating the likelihood function based on the GeoPIP model. Weleave our complete phylogenetic tree inference algorithm based on GeoPIPto Chapter 6.5.2 Poisson indel processBefore describing the GeoPIP model, we introduce our notation and reviewthe PIP model, which is the foundation of the GeoPIP model.5.2.1 Basic notationIn the following, we assume that sequences from different species take theform of a multiple sequence alignment (MSA) of characters from a finitealphabet Σ (for example, Σ = {A, C, G, T} for DNA data). MSAs are setsof homologous characters which can be visualized using an alignment matrix,where each row represents one aligned sequence and each column representsone set of homologous characters at a certain locus. When there are nohomologous characters observed at a locus in one sequence, a gap symbol“–” is padded at the locus of that sequence so that two characters are in thesame column of the alignment matrix if and only if they are homologous.Let Σ+ = Σ ∪ {−} denote the expanded set of symbols including the gapsymbol “–”.Let x = (x1,x2, . . . ,xN )′ denote a fixed MSA of sequences from N differ-ent species with n columns, (xi ∈ Σn+, i = 1, 2, . . . , N). We will also use x asx = (c1, c2, . . . , cn) for a fixed MSA with columns (cj ∈ ΣN+ , j = 1, 2, . . . , n).We let Q denote a reversible substitution rate matrix over a state spaceX . Here, X could be taken to be the finite alphabet Σ, or X could be theset of pairs containing a character in Σ together with a substitution ratecategory annotation from a discrete set of substitution category indices. Tosimplify the notation, we take X = Σ in the following, but we note thatsubstitution heterogeneity can be handled in our framework with no change835.2. Poisson indel processon the algorithms or properties of the method. Let pi denote the stationarydistribution of the rate matrix Q.Finally, we let τ denote a latent phylogenetic tree with leaves labelledwith the same taxa as those indexing the rows of the MSA x.5.2.2 Definition of the Poisson indel processBouchard-Coˆte´ and Jordan (2013) proposed the PIP to model insertion,deletion and substitution of characters in string-valued continuous time pro-cesses. The description of the PIP on a string of k characters consists oftwo steps: first, the type of the next change (insertion, deletion or substitu-tion) is determined by a realization of 2k+ 1 exponential random variables;second, the exact change is determined based on the type of change andrealization of some type-specific random variables.The first step is generated as follows. For a sequence of length k, thePIP assumes that the smallest of 2k + 1 exponential random variables de-termines the nature of the next evolutionary event and the waiting time.The waiting time for a potential insertion event is exponentially distributedwith rate λ > 0 (this random variable does not determine the location ofthe insertion since all k + 1 possible insertion sites share the same randomvariable for insertion). The waiting times for k potential deletion eventsare independently and identically exponentially distributed with rate µ > 0(these random variables determine the location of the deletions since thereis one random variable for deletion of each site). The waiting times for k po-tential substitution events are independently exponentially distributed withrates based on the substitution rate matrix Q. We let θ = (λ, µ) denote thetwo indel parameters of the PIP.The second step is generated as follows. If the next event is an insertion,the location of the insertion is uniformly selected from k+1 possible insertionpositions, and a new character is randomly generated based on a multinomialdistribution with parameter pi, with is the stationary distribution of ratematrix Q. If the next event is deletion, the character associated with thesmallest realization of the k deletion random variables is deleted from the845.2. Poisson indel processsequence. If the next event is substitution, a new character is randomlygenerated from a multinomial distribution based on respective rows of therate matrix Q determined by the character to be substituted.It is easy to simulate sequences from the stationary distribution of se-quences induced by the PIP, by first simulating a Poisson-distributed num-ber of characters, with parameter λ/µ, and then sampling each characterindependently from a multinomial distribution of characters with stationarydistribution pi. The PIP is reversible when the initial sequence is distributedaccording to this stationary distribution. We only consider the reversiblePIPs in this thesis.5.2.3 Computational properties of the Poisson indel processBouchard-Coˆte´ and Jordan (2013) showed that under the PIP model, themarginal probability mass function of observing an alignment x = (c1, c2, . . . , cn)at the leaves of a given tree τ isPIP(x|θ, τ) = ψ(P (c∅|θ, τ), n, θ, τ)n∏i=1P (ci|θ, τ), (5.1)where c∅ is a single MSA column with empty characters “–” at each leaf, θis the indel rate, and n is the number of alignment columns. The functionψ in (5.1) is given byψ(z, k, θ, τ) =1k!‖νθ,τ‖k exp{(z − 1)‖νθ,τ‖}, (5.2)where ‖νθ,τ‖ = λ(‖τ‖ + 1/µ) and ‖τ‖ is the sum of all branch lengthsin τ . The stationary sequence length distribution is given by a Poissondistribution with mean λ/µ (Bouchard-Coˆte´ and Jordan, 2013), which isa more adequate length distribution than the geometric sequence lengthdistribution induced by the TKF model (Miklo´s, 2003).Bouchard-Coˆte´ and Jordan (2013) proposed a dynamic programmingalgorithm to calculate P (ci|θ, τ) efficiently based on a variation of Felsen-stein’s peeling recursion algorithm (Felsenstein, 1981a), as well as a Bayesian855.3. Geometric Poisson indel processframework for phylogenetic inference based on the PIP.5.3 Geometric Poisson indel processWe illustrate the generative process of the GeoPIP model in Figure 5.2, anddescribe it in detail in this section.Figure 5.2: An illustration of the GeoPIP on an MSA of four sequences.There are three MSA segments: segment 1 and 3 evolves at a low deletionrate (dyed in purple), while segment 2 evolves at a high deletion rate (dyedin yellow). Each segment evolves independently based on the PIP model.Insertion, deletion and substitution events for each segment are indicatedon the bottom plots.The GeoPIP model is based on the concept of MSA segment, which wedefine as a group of contiguous MSA columns in which indels are assumedto accumulate at the same rate. We define a segmentation β of a fixed MSAx as a partition of the MSA columns x1, . . . ,xN into MSA segments, i.e.,β = (s1, s2, . . . , sZ) where sk is the k-th segment and Z = |β| is the numberof segments (k = 1, 2, . . . , |β|). To be specific, sk = (cdk−1+1, . . . , cdk), wheredk =∑k−1j=1 |sj | (k = 1, 2, . . . , Z) and d0 = 0.Yang (1996) showed using a discrete set of possible substitution ratecategories can help to reduce bias caused by substitution rate variation inphylogenetic analysis. Here we proceed similarly, and define a finite list ofindel rate categories θ1 = (λ1, µ1), . . . , θm = (λm, µm), where each item inthe list is just a distinct PIP indel parameter setting. However, in contrast865.3. Geometric Poisson indel processto discrete substitution rate models, where each rate is often obtained usinga discretized gamma distribution, we do not assume a specific parametricform for θ1, . . . , θm.We assume that the number of MSA segments of a fixed MSA, Z ≥ 1,follows a geometric distribution with parameter ρ, (0 < ρ ≤ 1). The choice ofa geometric distribution is motivated by computational considerations: thememoryless property allows a speedup of a factor n (the number of alignmentcolumns) in applying a dynamic programming algorithm to calculate thelikelihood (see Section 5.4).Given Z, we assume that the indel rate of each segment is independentlyand identically sampled from one of the m distinct indel rates θ1, . . . , θm.We denote the prior probabilities of each of the possible m categories asω = (ω1, · · · , ωm),∑mj=1 ωj = 1. For each segment i ∈ {1, 2, . . . , Z}, weintroduce a random variable Ri indicating the rate category sampled forsegment i:P (Ri = j) = ωj , i = 1, 2, . . . , Z and j = 1, 2, . . . ,m.Now that the sampling process for the segment-specific rate categorieshas been described, we can complete the description of the GeoPIP by defin-ing how the data are generated in each segment. This is done by using thePIP model to sample the data in each segment i independently using theindel parameter θRi corresponding to the rate category associated with seg-ment i. We assume a shared substitution rate matrix Q for substitution,with stationary distribution pi in this thesis.To summarize, we obtain the following generative description of theGeoPIP:Z ∼ Geo(·|ρ)Ri ∼ Cat(·|ω) i = 1, 2, . . . , Zsi|Ri ∼ PIP(·|θRi , τ) i = 1, 2, . . . , Zβ = s1 ◦ · · · ◦ sZ ,875.3. Geometric Poisson indel processwhere Geo and Cat are the geometric and categorical distributions, and “◦”denotes concatenation of multiple sequence alignments. This gives us thefollowing probability mass function of GeoPIP:GeoPIP(β, r|γ) = GeoPIP(β, r|θ, τ, ρ, ω) = (1− ρ)|β|−1ρ|β|∏i=1ωri PIP(si|θri , τ),where γ = (θ, τ, ρ, ω) denotes all the parameters involved, R = (R1, R2, · · · , RZ)are random variables that indicate the rate category for each segment,r = (r1, r2, · · · , rZ) is a realization of R, and θ = (θ1, θ2, . . . , θm) are the mdistinct indel rates.The motivation behind this construction is that the GeoPIP inherits thePIP’s desirable properties. We start with a simple result to illustrate this:Proposition 4. For all µ > 0, λ > 0, the GeoPIP model is explosion free(i.e., the expected sequence length is finite). Moreover, when the substitu-tion rate matrix is reversible, the GeoPIP model is reversible. Its stationarylength distribution has mean (1/ρ)∑mj=1 ωjλj/µj and a probability generat-ing function given by m∑j=1ωj exp{(s− 1)λj/µj}−1 − (1− ρ)−1 ρ.Proposition 4 can be easily proved by noticing that the sequence lengthbased on the PIP model is Poisson and the GeoPIP is a mixture of PIPs.In particular, Proposition 4 means that the GeoPIP can capture richersequence length distributions than previous indel models. For example,the PIP model has a Poisson stationary length distribution, and thereforean equal mean and variance. In contrast, the GeoPIP can capture theoverdispersion found in real data because the distribution of the sequencelength based on the GeoPIP is a mixture of Poisson distributed randomvariables and thus has an unequal mean and variance. The TKF91 model hasa stationary length distribution that is even more problematic, predictinga geometrically distributed stationary sequence length, which is undesirable885.4. Efficient phylogenetic inference with the GeoPIP modelbecause that probability mass function has its mode on the empty sequence(Zhang, 2000; Miklo´s, 2003). We emphasize that the GeoPIP does not havethis deficiency. The geometric reference in its name refers to the PIP mixingdistribution, not the stationary length distribution.The most important property of the GeoPIP, however, is its amenabilityto efficient phylogenetic inference, which we describe in detail in the nextsection.5.4 Efficient phylogenetic inference with theGeoPIP modelComputational complexity is a key issue in phylogenetic inference. Approx-imation algorithms are proposed in order to explore the space of trees inpractice, either using local search (Li et al., 2000; Huelsenbeck and Ronquist,2001; Ronquist and Huelsenbeck, 2003; Barker, 2004; Stamatakis, 2005), orincrementally (Saitou and Nei, 1987; Studier et al., 1988; Gascuel, 1997;Bouchard-Coˆte´ et al., 2012). Given the large literature on phylogenetic in-ference, our goal is to show that our model can be incorporated into mostexisting phylogenetic inference frameworks with minimal changes. The mainbuilding block of most existing methods is a likelihood function taking aphylogeny as an input, `(τ). Maximum likelihood methods optimize `(τ)directly (usually while maximizing other parameters); Bayesian methodscombine `(τ) with a prior and approximate the posterior via Metropolis-Hastings methods; and neighbor-joining (NJ) methods break the likelihood`(τ) optimization into many small problems, one for each pair of leaves{k1, k2}—these smaller problems can be viewed as optimization of a likeli-hood function over a two-leaf tree, `(τ{k1,k2}). In all these cases, the treeinference method views the evolutionary model as a black box function `(τ).Since this black box is evaluated at several putative trees, it is important tohave efficient evaluation algorithms for calculating `(τ).895.4. Efficient phylogenetic inference with the GeoPIP model5.4.1 Likelihood evaluation in polynomial timeIf the segmentation β∗ and indel rate categories r∗ were known, we couldsimply pick`(τ) = GeoPIP(β∗, r∗|γ),a quantity that can be computed efficiently. Efficient evaluation in this caseis a direct corollary of results in Section 3 of Bouchard-Coˆte´ and Jordan(2013):Proposition 5. Computing GeoPIP(β∗, r∗|γ) can be done in time O(Nn),where N is the number of taxa, and n is the number of alignment columns.Importantly, this running time is of the same order as that of computingthe likelihood of a substitution-only model.5.4.2 Likelihood optimization in polynomial timeWe still need to take into account the fact that a true segmentation is notknown in practice (and the notion of a “true” segmentation is only imper-fectly applicable in real datasets). Fortunately, a simple dynamic program-ming algorithm for optimization over segmentations can address this. Asimilar algorithm to sum or sample over the segmentations is also available.However, given the low uncertainty over segmentations that we observed inpractice (see Section 6.3.1), we focus here on a maximization strategy, whichestimates a segmentation with maximum likelihood. This algorithm is re-lated to the phylogenetic hidden Markov model (Siepel and Haussler, 2005),which has been used among other things for substitution rate variation. Wefocus on how our algorithm accommodates the indel rate variation.Proposition 6. Computingmaxβ,rGeoPIP(β, r|γ), (5.3)can be done in time O(mn2 +Nn), where N is the number of taxa, n is thenumber of alignment columns, and m is the number of indel rate categories.905.4. Efficient phylogenetic inference with the GeoPIP modelThe same is true for identifying theargmax β,r GeoPIP(β, r|γ). (5.4)An optimization algorithm that achieves this running time is describedbelow. Based on this result, we derive in Chapter 6 an inference algorithmbased on coordinate optimization, composite likelihood, and NJ, for jointinference of the phylogenetic tree, substitution parameters, and indel pa-rameters. However, as explained earlier in the thesis, our model could alsobe integrated into existing likelihood-based tree inference methods.As a preprocessing step to the segmentation algorithm, we first calculate:pi,j = P (ci|θj , τ), i = 1, 2, . . . , n, j = 1, 2, . . . ,m, (5.5)which is the probability of observing a single MSA column ci with indel rateθj = (λj , µj) on a tree τ . Second, we calculatemk,j = ψ(zj , k, θj , τ), k = 1, 2, . . . , n; j = 1, 2, . . . ,m,zj = P (c∅|θj , τ), j = 1, 2, . . . ,m,which are used to calculate the factor in the PIP density determined by thelength of the MSA segment. Here ψ is defined in (5.2).To calculate mk,j efficiently, we use the following recursion:logmk+1,j = logmk,j − log(k + 1) + log(‖νj‖) for k = 1, 2, . . . , n− 1,where ‖νj‖ = ‖νθj ,τ‖ = λj(‖τ‖+ 1/µj). The recursion is initialized with:logm1,j = log ‖νj‖+ (P (c∅|θj , τ)− 1)‖νj‖,for all j = 1, 2, . . . ,m. Using this recursive formula for mk,j and the recur-sions described in Bouchard-Coˆte´ and Jordan (2013) for pi,j , the computa-tional cost for calculating all pi,j and mk,j is O(nm).Let li denote the maximum likelihood over all possible segmentations forthe first i MSA columns c1:i = (c1, c2, · · · , ci) (1 ≤ i ≤ n). We set l0 = 1.915.4. Efficient phylogenetic inference with the GeoPIP modelWe start with c1:1. There are m possible choices for the rate assigned tothis single column, yieldingl1 = max {p1,j m1,j ωjρ : j ∈ {1, 2, . . . ,m}} .The computational cost of this step is O(m). We calculate an intermediatequantity lt based on l0, l1, . . . , lt−1 recursively. To do so, we define a t ×mmatrix L(t) with entry (i, j) given by:l(t)i,j = li−1 pi,j pi+1,j · · · pt,j mt−i+1,j ωj(1− ρ), i ∈ {1, · · · , t}, j ∈ {1, · · · ,m},where l(t)i,j represents the largest likelihood if the t-th column forms a segmentwith the last t− i columns with the j-th indel rate, conditioning on knowingthe first t columns only (i.e., no information on the columns {t+ 1, . . . , n}).Therefore, the matrix L(t) considers all possible segmentation choices forthe i-th column, and utilizes previously calculated maximum likelihood forthe segmentation choices of the first t − 1 columns to calculate the largestlikelihood for all t×m possible segmentation choices when the t-th columnis added to the first t− 1 columns. Then we computelt = max{l(t)i,j : i ∈ {1, 2, · · · , t}, j ∈ {1, 2, · · · ,m}}, (5.6)The largest value of L(t) gives the maximum likelihood lt of all possiblesegmentations and indel rate assignments of the first t columns.The computational cost of naively calculating lt+1 is O(t2m). However,part of the product pi,jpi+1,j · · · pt,j in l(t)i,j can be stored and used to calculatepart of the product pi−1,jpi,j · · · pt,j in l(t)i−1,j , so the computational cost canbe reduced to O(tm). The algorithm described in this section can find themaximum likelihood shown in (5.3). We can further identify a segmentationthat achieves the maximizer in (5.4) by recording the index of entries in L(t)that achieve the maximum sequentially. See Section 6.2.2 for the detailed al-gorithm for finding a segmentation that maximizes the likelihood as in (5.4),which follows standard dynamic programming backtracking techniques.925.5. Hierarchical Poisson indel process5.5 Hierarchical Poisson indel processWe also developed a more elaborate generalization of the PIP process thatincorporates long indels. We use this more elaborate process, called theHierarchical Poisson indel process (hPIP), as an additional mechanism togenerate synthetic data that we then analyze using the simpler GeoPIPmodel. While it is easy to generate data using the hPIP, it is not compu-tationally tractable to perform tree inference. We describe the hPIP withdetails in this section. In summary, the hPIP is a two level hierarchy, wherefor the first level, MSA segments are considered as basic characters of a PIPof segments where a segment can be deleted or inserted as a whole, and forthe second level, each segment independently follows a PIP of single sites.As with the TKF92 model, the hPIP allows long indels but in a mannerthat does not cover all types of long indels expected in a biologically real-istic process (in both cases, there cannot be an overlapping long insertionand long deletion, for example).5.5.1 Generative process of the hierarchical Poisson indelprocessIn the first step of the generative process, we use a PIP to generate the struc-ture of the segments: this PIP follows the definition reviewed in Section 5.2,but instead of taking the types of characters Σtop to be nucleotides, we takethem to be rate category indices, Σtop = {1, 2, . . . ,m}. This allows segmentsto be inserted and deleted along a tree (for example, the whole segment 2in Figure 5.2 can be deleted in some sequences), and to take on differentrate categories as a function of the position on the tree and the segmentindex k. We denote the parameters of this PIP by µtop > 0, λtop > 0 andQtop. The dimension of Qtop is equal to the number of indel rate categoriesm, and encodes potential changes in category within one segment along thebranches of the phylogenetic tree.In the second step, we look at each segment, and then at each connectedsubset of the tree (in preorder, starting from the root), where this segmenttakes on a constant indel rate category index j ∈ {1, 2, . . . ,m}. For each935.5. Hierarchical Poisson indel processsuch subset of the tree, we use a PIP with parameters θj (initialized at thestring generated recursively earlier in the preorder) to generate the evolutionof the nucleotides for this part of the tree and segment. The final MSA isobtained by pasting together all the segments. We call this evolutionarymodel the hPIP since it is built from two layers of PIP.For simplicity, in this thesis, we focus on cases where within each segmentthe rate category stays constant on the tree. This is achieved by setting thesegment transition rate matrix to the zero matrix, Qtop = 0. Since thematrix exponential is then equal to the identity, exp(tQtop) = I for anyt ≥ 0, a change of indel rate across points on a tree for a given segment isnot allowed under this condition. However, we still have rate heterogeneityacross segments.5.5.2 Properties of the hierarchical Poisson indel processIt is easy to see that the hPIP is reversible when Qtop = 0 given the re-versibility of the PIP (Bouchard-Coˆte´ and Jordan, 2013).The hPIP has two major advantages over the PIP. First, hPIP modelsthe indel rate variation directly as different segments can have different indelrates while in PIP, one indel rate applies to the whole sequences. Second,hPIP models long indels directly. Under hPIP, one segment may consist ofmore than one character, thus deletion of one segment can lead to long gapsin MSAs.However, phylogenetic inference based on the hPIP is challenging evenwhen Qtop = 0. One main reason is that the segmentation of the MSAs isunknown. There are two general ways to handle unobserved segmentation:optimization, i.e., to estimate a optimal segmentation based on certain cri-teria (for example, maximum likelihood); or marginalization, i.e., to sumover all possible segmentations. Because the parameter space of all hPIPsegmentations is intractable to optimize directly we use the hPIP only togenerate datasets. We show in the experiments that the simpler GeoPIPcan infer accurate trees from datasets generated by the hPIP model.945.6. Summary5.6 SummaryIn this Chapter, we introduced a new continuous time stochastic process, thegeometric Poisson indel process, to model indel rate variation. We startedwith the motivation in Section 5.1 and reviewed the Poisson indel processin Section 5.2. We introduced the geometric Poisson indel process in Sec-tion 5.3 and proposed an efficient algorithm for calculating the probability ofa multiple sequence alignment based on the geometric Poisson indel processin Section 5.4. We introduced the hierarchical Poisson indel process thatallows long indels in Section 5.5.In the next Chapter, we introduce an algorithm for reconstructing phy-logenetic trees based on the GeoPIP model.95Chapter 6Estimation of phylogenieswith indel rate heterogeneity6.1 IntroductionUtilizing our efficient likelihood calculation algorithm to infer segmenta-tions, we propose an algorithm to estimate phylogeny from a fixed multiplesequence alignment using the neighbor-joining (NJ) algorithm (Saitou andNei, 1987; Studier et al., 1988; Gascuel, 1997) as an illustration. It is alsoworth mentioning that a full likelihood approach, as well as joint inferenceof phylogeny and multiple sequence alignments, can also be implementedbased on the GeoPIP model, using existing phylogenetic inference frame-works (Guindon et al., 2010; Bouchard-Coˆte´ et al., 2012; Bouchard-Coˆte´and Jordan, 2013).Our inference method iteratively updates an estimate of the segmenta-tion of the MSA, an estimate of the indel rates, an estimate of the phylo-genetic tree and estimates of other relevant parameters, until convergenceoccurs or the full likelihood stops increasing. The exact marginalization,which integrates over all possible states of unknown sequences at all internalvertices, still plays a key role because of the need to infer a segmentationand indel parameters. The segmentation of the multiple sequences align-ment and indel rates is estimated using the GeoPIP model, based on ourefficient algorithm to calculate the probability of a multiple sequence align-ment. The phylogenetic tree is constructed using neighbor joining based onpairwise distances, which are calculated using the GeoPIP model on pairwisesequence alignments that inherit the segmentation and indel rates estimated966.1. Introductionfrom the multiple sequence alignment. Our inference method is initializedusing random starts, without requiring a guide tree.Using our method, we investigate the effect of indel rate heterogeneityon phylogenetic inference. We provide some evidence that modelling indelsenhances accuracy of phylogenetic inference, and that modelling indel rateheterogeneity can further improve the accuracy of phylogenetic inference.We demonstrate the accuracy of our method in both well-specified and mis-specified synthetic experiments, including data generated using the softwareINDELible (Fletcher and Yang, 2009) and aligned using the software MUS-CLE (Edgar, 2004a,b).In this thesis, we focus on modelling indel rate variation and consideronly indels of size one. An important area of related work is the develop-ment of long indel models (Thorne et al., 1992; Miklo´s et al., 2004; Lunteret al., 2005b; Redelings and Suchard, 2007). Modelling long indels is im-portant in the context of phylogenetics because explaining the insertion ordeletion of a segment with many single-character indels can lead to inac-curate tree estimation. Liu et al. (2009) showed that using the affine gappenalty which models long indels directly can improve alignment and treeestimation accuracy, and at the same time, the indel rate is comparablewith the substitution rate when the indel rate and the average indel lengthare separately estimated. This leads to more interpretable results whichprovide helpful insights into the ratio of indel event frequency and substitu-tion event frequency. Unfortunately, the problem of reconciling long indelswith a model that can be obtained as a tractable, exact marginalizationof a continuous time stochastic process is still open and appears elusive.The state of affairs uses complex approximations (Knudsen and Miyamoto,2003; Miklo´s et al., 2004), models that support insertions but not deletions(or vice versa) (Miklos and Toroczkai, 2001), and most methods are limitedto sequence pairs (Thorne et al., 1992).In this thesis, we only consider the GeoPIP model for single indels and donot attempt to include long indels into the GeoPIP model directly. Instead,our strategy to avoid the branch length overestimation caused by using mul-tiple indels to approximate long indels is to have the GeoPIP model explain976.2. Details of the phylogenetic inference methodlong indels with segments of very high indel rate. Our method shares a lim-itation of previous segment-based long indel methods (Thorne et al., 1992),namely that certain overlapping patterns of indels are not explained in themost parsimonious way (see Thorne et al. (1992) for examples). On theother hand, our algorithm based on the GeoPIP model has better scalingproperties as the number of taxa increases, compared to the TKF92 modelwhich does not allow exact marginalization of internal nodes in polynomialtime. To demonstrate that our strategy is sensible, we include syntheticexperiments where the data are generated from models that include longindels.6.2 Details of the phylogenetic inference methodIn this section, we show how to estimate the parameters of the GeoPIPmodel via a coordinate ascent algorithm. The full algorithm is summarizedin Algorithm 3. Note that Algorithm 3 can also be applied when the PIP isthe evolutionary model since the PIP is a special case of the GeoPIP. It isworth mentioning that the estimates obtained using our algorithm are notguaranteed to represent a global optimum in general.6.2.1 Number of indel rate categories mIn this thesis, we assume that m is fixed for simplicity. This is a reasonableassumption when the number of distinct indel rates can be roughly inferred,for example, based on information on the evolutionary rates of differentfunctional regions of sequences. In cases that a rough estimate of distinctindel rates is not easy to obtain, choosing m to be a large number worksin application as our algorithm will naturally choose a subset of indel ratesfrom m available indel rates, but at a price of higher computational cost.986.2. Details of the phylogenetic inference methodAlgorithm 3 Iterative optimization algorithm for parameter estimation ofGeoPIPInitialize parameters Q, θ, β, r, ρ, ω.Calculate B given θ, Q, β and r.Infer τ based on B using NJ and mid-point rooting.Set tolerance level tol. Set d = tol. Set `old = 1.e-10. Set ∆` = 1.while d ≥ tol and ∆` > 0 doUpdate β∗ and r∗ given θ, Q and τ using dynamic programming.Update ρ∗ given |β∗|.Update ω∗ given r∗.Update θ∗ given τ , Q, β∗ and r∗.Update Q∗ given τ .Update B∗ given θ∗, Q∗, β∗ and r∗.Update τ∗ based on B∗ using NJ and mid-point rooting.Set d← max{‖B∗ −B‖, ‖θ∗ − θ‖, ‖Q∗ −Q‖}.Set B ← B∗, τ ← τ∗, θ ← θ∗,Q ← Q∗, β ← β∗, r ← r∗, ρ ← ρ∗, ω ←ω∗.Calculate full likelihood `new.Calculate change of likelihood ∆` = `new − `old.Set `old = `new.end while6.2.2 Optimizing with respect to the segmentation β andindel rate indicator rOptimizing segmentation β and indel rate categories has been discussed inSection 5.4. Here we add the description of the backtracking algorithm thatcan be used to find one segmentation with indel rates for each site thatachieves the maximum likelihood, i.e.,argmax β,r GeoPIP(β, r|γ).Note that in (5.6), the maximum is taken over a matrix L(t) = (l(t)i,j )of t × m elements. Let (ηt,1, ηt,2) denote the index of the largest elementin L(t). To find the optimal segmentation β for a fixed alignment withmaximum likelihood ln using the path of dynamic programming, we recorda backward function f : {1, 2, · · · , n} → {1, 2, · · · , n}, where f(t) is the row996.2. Details of the phylogenetic inference methodindex of the maximum entry in L(t), i.e.,f(t) = ηt,1, t = 1, 2, · · · , n.To find the indel rates r in each segment of the optimal segmentationβ using the path of dynamic programming, we record another backwardfunction g : {1, 2, · · · , n} → {1, 2, · · · ,m}, where g(t) is the column index ofthe maximum entry in L(t), i.e.,g(t) = ηt,2, t = 1, 2, · · · , n.We trace the optimal segmentation β with maximum likelihood and re-spective indel rates r by Algorithm 4. The lengths of all segments are givenin the ordered array A and the indel rates of all segments are given in theordered array C of Algorithm 4.Algorithm 4 Backtracking for best segmentationSet i = n. Set A = ∅. Set C = ∅.while i > 0 doj ← f(i)Add element {i− j + 1} to A as the first element.Add element g(i) to C as the first element.i← j − 1.end whileIt is easy to see that recording these two backward functions f and g doesnot change the order of the time complexity of the dynamic programming,and finding the best segmentation and rate category in each segment basedon f and g does not increase the order of the total time complexity either.6.2.3 Updating the geometric parameter ρ and stationarydistribution parameter ωWe calculate ρ̂ = 1/|β| since E (|β|) = 1/ρ. We estimate ω based on R only,by counting how many inferred states r̂i equal j for i = 1, 2, . . . , |β|, andj = 1, 2, . . . ,m. We use add-one smoothing to ensure that all elements of ω1006.2. Details of the phylogenetic inference methodare non-zero.6.2.4 Updating the phylogenetic tree τWe focus on bifurcating tree topologies in this thesis. We reconstruct τ usingNJ (Saitou and Nei, 1987; Gascuel, 1997), based on the updated pairwisedistance matrix B and root the unrooted tree by midpoint rooting. Since theGeoPIP process is reversible, the root location will not affect the inferenceof evolutionary parameters.When all other parameters are fixed, a composite log-likelihood (Varinand Vidoni, 2005) `c of B can be written as`c(B) =∑1≤i<j≤Nlog GeoPIP(β(xi,xj), r|θ, bij , ρ, ω), (6.1)where β(xi,xj) denotes the segmentation β on two sequences xi and xj only,and bij is the total branch length from sequence i to sequence j.The branch length parameter bij only appears in one composite log-likelihood componentlog GeoPIP(β(xi,xj), r|θ, bij , ρ, ω), (6.2)thus the maximum composite likelihood estimate (MCLE) b̂ij can be ob-tained by maximizing (6.2) instead of (6.1). Given β, bij does not dependon ρ, and given θ, bij does not depend on ω. Therefore, the compositelikelihood of bij depends only on β, θ, Q.1016.2. Details of the phylogenetic inference method6.2.5 Updating the indel rate θWe estimate the indel rate θ by pooling all segments with the same rates.log GeoPIP(β, r|γ)= (|β| − 1) log(1− ρ) + log ρ+|β|∑i=1logωri +|β|∑i=1log PIP(si|θri , τ)= (|β| − 1) log(1− ρ) + log ρ+|β|∑i=1logωri+m∑l=1 ∑k:rk=llog PIP(sk|θl, τ) , (6.3)where the inner summation is over all k = 1, 2, . . . , |β| satisfying rk = l, i.e.,segments with the l-th indel rate (l = 1, 2, . . . ,m).The parameter θl appears only in the component∑k:rk=llog PIP(sk|θl, τ), (6.4)therefore, the MLE of θl (l = 1, 2, · · · ,m) can be obtained by maximizing(6.4) given the rate matrix Q and tree τ , instead of (6.3).6.2.6 Updating the rate matrix QThe conditional substitution rate matrix is the same at all loci regardless ofthe indel rate of the segment. Based on this observation, we pool all datainvolving transitions only to estimate the rate matrix Q. We explain thisstep only briefly as estimating the rate matrix Q is not the focus of thisthesis, and refer readers to Hobolth and Yoshida (2005) for more details.We use a standard EM algorithm to estimate Q based on substitutionsof characters only (Tataru and Hobolth, 2011). At the E-step, we calculateexpectations of the stationary distribution of characters, transitions amongall characters and the waiting times at each character type given a ratematrix Q̂ and data. At the M-step, we maximize a penalized likelihood1026.3. Simulation studiesfunction of Q based on the general time reversible (GTR) model (Tavare´,1986) to find Q̂ given all expectations from the E-step. We repeat the E-step and M-step iteratively until the change in penalized likelihood is smallerthan a given tolerance.The GeoPIP+NJ algorithm can simply incorporate the correlation ofindel rates and substitution rates by estimating substitution rate matricesseparately for different indel rate regions. The computation cost of updatingQs will increase by a factor ofm, which is the number of indel rate categories.6.2.7 Convergence of the optimization algorithmIn our algorithm, the iterative updating procedure is terminated when thechange of parameters is smaller than the tolerance level or the full likelihooddecreases after one full iteration, as shown in Algorithm 3.We calculate the full likelihood of the new set of all parameters updatedat the end of each iteration and monitor the change of the full likelihood.This procedure is important. Because some updating steps for individualparameters, for example B, are not based on optimizing the full likelihood.Even though at each step for individual parameters, we obtain a new esti-mate which maximizes the respective (composite) likelihood, it is possiblethat the full likelihood may decrease after one full iteration.6.3 Simulation studiesThis section is organized as follows. First, we perform a simulation studyto investigate the accuracy of our segmentation inference method, given thecorrect alignment. Second, we perform simulation studies to assess the ac-curacy of the complete inference algorithm for the GeoPIP in finding thetrue tree when the evolutionary model is correctly specified (i.e., data aresimulated using the GeoPIP, and the true alignment is given) and misspec-ified (e.g., data are simulated using the software INDELible (Fletcher andYang, 2009) or hPIP, and an estimated alignment is used). We compareinference results with a set of widely used phylogenetic inference methods.1036.3. Simulation studies6.3.1 SegmentationWe consider three sets of indel rates in the simulations. In the first scenario,we consider two indel rate categories, deletion rates µ1 = 0.02 and µ2 = 2.0,insertion rates λj = 20 · µj (j = 1, 2) and multinomial parameter for thestationary distribution of segments ω = (1/2, 1/2). In the second scenario,we set m = 3, µ1 = 0.02, µ2 = 0.2 and µ3 = 2.0, λj = 20 · µj (j = 1, 2, 3),and ω = (1/3, 1/3, 1/3). In the third scenario, we set m = 4, µ1 = 0.01,µ2 = 0.1, µ3 = 1.0 and µ4 = 5.0, λj = 20 · µj (j = 1, 2, 3, 4), and ω =(1/4, 1/4, 1/4, 1/4). The geometric parameter for the number of segments isρ = 0.05 in all scenarios. A perfect binary tree with 32 leaves is used in thissimulation study. All edge lengths are set to be 0.1.In each simulation run, we generate the MSAs randomly using the GeoPIPmodel proposed in this thesis. To focus on the accuracy of the segmenta-tion inference method, we fix the tree τ , rate matrix Q, indel rates θ, andGeoPIP parameters ρ and ω as true values. Instead of generating a Poisson-distributed number of segments, we generate 20 segments at the root of thetree in all runs so that the lengths of MSA columns are less variable acrosssimulation runs.To measure the accuracy of the segmentation algorithm, we calculatethe proportion of alignment columns being identified with incorrect rates.Since each alignment column belongs to exactly one segment and thus isassociated with exactly one indel rate, we define segmentation error as thepercentage of alignment columns in the estimated segmentation which havea different indel rate than that of the true segmentation.We vary the number of sequences used for segmentation inference (us-ing 2, 4, 8, 16 or 32 sequences), and evaluate the segmentation error onMSA columns that are non-empty for the smallest set of sequences (i.e., 2sequences), to make the absolute magnitude of the errors comparable whenvarying the number of sequences.Table 6.1 summarizes the averages of the percentages of alignment columnswith incorrectly inferred indel rates from 100 simulation runs. Table 6.1 alsosummarizes the average running time of our segmentation algorithm.1046.3. Simulation studiesTable 6.1: Simulation results on segmentation error and running time. Dataare simulated based on the geometric Poisson indel process (GeoPIP) modelwith 2, 3, or 4 indel rates (m), on a perfect binary tree (i.e., a tree in whichall internal vertices have two descendants and all branch lengths are equal)with 32 leaves (see Figure 6.1B for an example of a perfect binary treewith 16 leaves). Average percentages of alignment columns with incorrectlyinferred indel rates from 100 simulations are listed.segmentation error running time (in seconds)Sequences m = 2 m = 3 m = 4 m = 2 m = 3 m = 41-2 0.0276 0.2064 0.2219 3.0593 3.7787 2.96131-4 0.0064 0.1226 0.1180 5.6851 6.8666 6.33161-8 0.0035 0.0804 0.0899 14.9626 18.6748 19.07481-16 0.0011 0.0307 0.0437 44.4989 55.6419 61.43561-32 0.0011 0.0391 0.0397 142.829 169.2812 199.2880The dramatic decrease in error rate when the number of sequences usedfor segmentation inference increases motivated the need for marginalizationof internal sequences. The fact that GeoPIP allows such marginalization ina simple and exact fashion allows us to efficiently search over segmentations,even when the number of sequences increases.6.3.2 Well-specified synthetic examplesIn this section, we perform simulation studies to assess tree reconstructionaccuracy when the data are simulated according to the GeoPIP model. Inthis case, the substitution-only models and the GeoPIP models are both well-specified. Our focus is on the effect of the additional information containedin the indels on tree reconstruction accuracy. To make the reconstruction ac-curacies more interpretable, we also include the accuracy of reconstructionsfrom PhyML (Guindon et al., 2010), and from a standard PIP model.Simulation setupWe set the number of indel categories m = 2 and the indel rate (λ1, µ1) =(0.4, 0.02) for the first segment. For the second segment, we consider three1056.3. Simulation studiessets of indel rates, (λ2, µ2) = (10, 0.5), (40, 2.0), or (80, 4.0). Note thatwhen (λ2, µ2) = (80, 4.0), the data simulated using the GeoPIP model havefast-evolving regions making the synthetic alignments visually most similarto real datasets. We consider two phylogenetic trees in the simulation: aphylogenetic tree with 8 leaves and varying branch lengths and a perfectbinary phylogenetic tree with 16 leaves and constant branch lengths (seeFigure 6.1).Aseq1seq2seq3seq4seq5seq6seq7seq80.050.050.050.050.050.050.050.050.10.050.050.10.050.25Bseq1seq2seq3seq4seq5seq6seq7seq8seq9seq10seq11seq12seq13seq14seq15seq16bbb bbb bbbb bbb bbbbb bbb bbbb bbb bbFigure 6.1: The reference phylogenetic trees used in simulation studies. A:a phylogenetic tree with 8 leaves and varying branch lengths. B: a perfectbinary phylogenetic tree with 16 leaves and same branch length b for allbranches.We focus on indel rate variation and ignore substitution rate variation forsimplicity, but we note that substitution rate variation can be incorporatedinto our methods without technical difficulty. For the first set of simulationson the tree with 8 leaves, the estimated rate matrix Q̂ from PhyML is usedas starting values for CTMC+NJ, PIP+NJ and GeoPIP+NJ estimationalgorithms and then updated iteratively together with other parameters.For the second set of simulations on the tree with 16 leaves, we fix the ratematrix Q̂ in the CTMC+NJ, GeoPIP+NJ and PIP+NJ as the estimatedrate matrix obtained from PhyML, so that the rate matrix is the same across1066.3. Simulation studiesall methods considered.For PIP, we randomly generate a deletion rate µ ∼ U(0, 1) and setλ = µη as a starting value, where η is set as the total number of observedalignment columns. For GeoPIP, we use the true valuem = 2 in the inferencebased on GeoPIP. Since our iterative optimization algorithm requires a setof starting values for the indel rates θ, the multivariate parameter ω, andthe segmentation, we show two sets of results, one using the true values asinitialization, and one using random values. For the random starting values,we randomly generate two deletion rates µ1 ∼ U(0, 1) and µ2 ∼ U(1, 2),then set λi = µiη (i = 1, 2). We set η = 20 in all simulations. The choice ofη is related to the minimum number of alignment columns in one segment.Similar results are observed when η = 10 is used instead of η = 20. We setstarting values ps = 0.1, and ωs = (1/m, 1/m) = (0.5, 0.5). Again, we foundthat different choices of starting values ps and ωs did not markedly affectthe inference results in our simulations studies. We simply set the initialsegmentation as one segment containing all MSA columns.Simulation resultsWe calculate the Robinson-Foulds (RF) and the weighted Robinson-Foulds(wRF) distance (Robinson and Foulds, 1979; Felsenstein, 2004) between eachestimated unrooted tree and the true unrooted tree from 100 simulationruns. The RF and wRF distances are calculated using the Python packagedendropy (Sukumaran and Holder, 2010).The main comparison of interest is between the GeoPIP+NJ methodand the CTMC+NJ method. Both models are well-specified here, but onlythe former uses indels. Our results show that GeoPIP+NJ reduces recon-struction error by a factor of up to two (Table 6.2 and Table 6.3) in termsof the wRF distance, and GeoPIP+NJ always outperforms CTMC+NJ interms of the RF distance as well. Reconstructions based on the standardPIP model also outperform reconstructions solely based on substitutions,but by a much smaller margin.As a reference, we also include results obtained using PhyML, which1076.3. Simulation studiesTable 6.2: Results on synthetic data simulated from the GeoPIP on a phy-logenetic tree of 8 leaves with varying branch lengths (see Figure 6.1A).All models are well-specified, except for the standard Poisson indel process(PIP). The weighted Robinson-Foulds (wRF) distances and the Robinson-Foulds (RF) distance of 100 simulation runs are summarized. For the “scaledtree” columns, we scale the total branch length of all estimated trees andthe true tree to be equal to one.wRF (unscaled trees) wRF (scaled trees) RFParameter Method mean (s.e.) median mean (s.e.) median meanPhyML 0.200 (0.006) 0.190 0.187 (0.006) 0.179 0.10 (0.07)CTMC+NJ 0.213 (0.005) 0.203 0.200 (0.005) 0.192 0.12 (0.06)µ2=0.5 PIP+NJ 0.150 (0.003) 0.144 0.137 (0.003) 0.136 0GeoPIP+NJ (true init.) 0.153 (0.003) 0.151 0.139 (0.003) 0.139 0GeoPIP+NJ (random init.) 0.153 (0.003) 0.151 0.139 (0.003) 0.138 0PhyML 0.222 (0.006) 0.208 0.208 (0.006) 0.199 0.24 (0.08)CTMC+NJ 0.240 (0.006) 0.227 0.223 (0.005) 0.218 0.30 (0.07)µ2=2.0 PIP+NJ 0.144 (0.003) 0.144 0.130 (0.003) 0.128 0GeoPIP+NJ (true init.) 0.134 (0.004) 0.130 0.116 (0.003) 0.115 0GeoPIP+NJ (random init.) 0.134 (0.004) 0.130 0.116 (0.003) 0.115 0PhyML 0.216 (0.007) 0.203 0.207 (0.006) 0.196 0.20 (0.06)CTMC+NJ 0.231 (0.007) 0.226 0.219 (0.006) 0.212 0.28 (0.07)µ2=4.0 PIP+NJ 0.203 (0.003) 0.203 0.201 (0.003) 0.203 0GeoPIP+NJ (true init.) 0.124 (0.003) 0.116 0.107 (0.002) 0.103 0GeoPIP+NJ (random init.) 0.124 (0.003) 0.116 0.107 (0.002) 0.105 0uses a statistically superior tree estimation method (compared to NJ), anda well-specified model, but no indel information. Comparing PhyML andCTMC+NJ illustrates the discrepancy introduced by the slightly subopti-mal NJ estimator. The accuracy gains obtained by modelling indel rateheterogeneity are larger than those obtained by using a more sophisticatedtree estimation method under the simulation setups we considered.Table 6.2 also shows that the difference between initializing GeoPIPparameters with true values versus random values is negligible, supportingthe robustness of our estimation procedure. In Table 6.3 and followingtables, we show only the GeoPIP+NJ results with random initial values.The average running times of 100 simulations runs on the phyloge-1086.3. Simulation studiesTable 6.3: Simulation results on synthetic data generated from the GeoPIP.The true tree is a perfect binary tree of 16 leaves with the same branchlength b for all branches (see Figure 6.1B). Different indel rates (i.e., µ2)and different phylogenetic tree branch lengths (i.e., b) are considered. Theweighted Robinson-Foulds (wRF) distances and the Robinson-Foulds (RF)distance of 100 simulation runs are summarized.wRF (unscaled trees) wRF (scaled trees) RFParameter Method mean (s.e.) median mean (s.e.) median mean (s.e.)PhyML 0.584 (0.013) 0.567 0.375 (0.008) 0.367 1.18 (0.17)µ2 = 4.0 CTMC+NJ 0.660 (0.016) 0.651 0.424 (0.009) 0.414 1.82 (0.21)b = 0.05 PIP+NJ 0.315 (0.004) 0.309 0.210 (0.003) 0.208 0GeoPIP+NJ 0.317 (0.007) 0.308 0.194 (0.004) 0.192 0PhyML 1.161 (0.038) 1.073 0.372 (0.011) 0.345 1.54 (0.20)µ2 = 4.0 CTMC+NJ 30.19 (28.76) 1.236 0.422 (0.019) 0.387 2.20 (0.33)b = 0.1 PIP+NJ 0.854 (0.011) 0.854 0.319 (0.004) 0.319 0GeoPIP+NJ 0.686 (0.016) 0.675 0.211 (0.004) 0.208 0.12 (0.05)PhyML 2.772 (0.094) 2.604 0.464 (0.019) 0.421 3.82 (0.43)µ2 = 4.0 CTMC+NJ 31.80 (14.52) 3.203 0.658 (0.040) 0.505 5.44 (0.57)b = 0.2 PIP+NJ 2.837 (0.035) 2.805 0.529 (0.005) 0.535 0.04 (0.03)GeoPIP+NJ 2.043 (0.054) 2.003 0.314 (0.007) 0.302 0.86 (0.13)PhyML 0.511 (0.010) 0.497 0.333 (0.006) 0.326 0.72 (0.12)µ2 = 0.5 CTMC+NJ 0.569 (0.013) 0.547 0.371 (0.008) 0.361 1.28 (0.18)b = 0.05 PIP+NJ 0.345 (0.006) 0.341 0.227 (0.004) 0.219 0GeoPIP+NJ 0.340 (0.008) 0.338 0.217 (0.005) 0.214 0.02 (0.02)PhyML 1.053 (0.037) 0.920 0.344 (0.014) 0.297 1.30 (0.30)µ2 = 0.5 CTMC+NJ 15.42 (14.23) 1.068 0.378 (0.020) 0.338 1.78 (0.36)b = 0.1 PIP+NJ 0.740 (0.023) 0.740 0.258 (0.007) 0.253 0GeoPIP+NJ 0.669 (0.022) 0.624 0.205 (0.004) 0.196 0.06 (0.03)PhyML 2.800 (0.437) 2.236 0.406 (0.019) 0.367 2.74 (0.34)µ2 = 0.5 CTMC+NJ 37.14 (16.22) 2.794 0.643 (0.045) 0.461 5.18 (0.56)b = 0.2 PIP+NJ 1.954 (0.092) 2.229 0.353 (0.018) 0.311 0GeoPIP+NJ 1.536 (0.063) 1.367 0.238 (0.008) 0.218 0.40 (0.08)netic tree with 8 leaves are: 4.03 seconds for PhyML, 12.76 seconds forCTMC+NJ, 124.88 seconds for PIP+NJ, 182.46 seconds for GeoPIP+NJ(true initialization), and 234.51 seconds for GeoPIP+NJ (random initializa-tion).1096.3. Simulation studies6.3.3 Misspecified synthetic examples using the hPIPIn real applications, the substitution and indel processes are unknown. Thegaps in MSAs may also be caused by long indels which are not directlycaptured by the GeoPIP model. The hPIP can be viewed as a more realisticmodel since it explicitly incorporates long indel events. This motivates theexperiments presented in this section, where we simulate data from thehPIP, and show that tree reconstructions based on the GeoPIP model arestill superior.Simulation setupWe use the same evolutionary parameters as in the previous section for thephylogenetic tree with 8 leaves and set (λ2, µ2) = (80, 4.0). For the hPIP,we set the segment insertion rate to λseg = 2 and the segment deletion rateto µseg = 0.1.We estimate the phylogenetic tree using several tree inference methodsand models. For the GeoPIP, we use m = 3 and m = 5 as the numbersof indel rate categories. These two variants of the GeoPIP are denoted byGeoPIP3 and GeoPIP5. Even though two indel rates are used in the hPIPsimulation model, there is no “true” value in this setup for m in the GeoPIPmodel, since additional rate categories can be recruited as surrogates tolong indels. Therefore, both the PIP model and the GeoPIP model aremisspecified in this simulation study. The CTMC+NJ and PhyML are stillcorrectly specified since they utilize only substitutions. Starting values forthe PIP and GeoPIP estimators are randomly generated in the same way asin the previous section.Simulation resultsBoth the GeoPIP+NJ and the PIP+NJ are misspecified models in this case,as neither models long indels directly. However, Table 6.4 shows that theGeoPIP+NJ provides a better approximation of the long indels introducedby the hPIP process, by assigning regions with possible long indels a largerindel rate. The GeoPIP+NJ also compares favorably against models that1106.3. Simulation studiesuse substitution only, which are still well-specified, but use only a subsetof the data. At the same time, the region with long indel (dyed as yellowin Figure 6.2) is perfectly identified by our inference method based on theGeoPIP.Table 6.4: Simulation results when the true model is the hierarchical Poissonindel process (hPIP), and the true tree has 8 leaves with varying branchlengths (see Figure 6.1A). The PIP and GeoPIP models are misspecified,while the other, substitution-only methods are well-specified. Both wRFand RF are reported.wRF (unscaled trees) wRF (scaled trees) RFMethod mean (s.e.) median mean (s.e.) median meanPhyML 0.232 (0.008) 0.224 0.215 (0.007) 0.209 0.20 (0.08)CTMC+NJ 0.249 (0.007) 0.242 0.237 (0.007) 0.236 0.44 (0.09)PIP+NJ 0.219 (0.004) 0.216 0.210 (0.005) 0.204 0GeoPIP3+NJ 0.172 (0.006) 0.156 0.151 (0.005) 0.147 0.02 (0.02)GeoPIP5+NJ 0.172 (0.006) 0.157 0.153 (0.006) 0.147 0.02 (0.02)Figure 6.2 shows a representative subset of the MSA segmentations in-ferred by our method in one randomly selected simulation run. The threeestimated deletion rates in the same example are µ̂1 = 0.006, µ̂2 = 0.848,and µ̂3 = 4.150. Segments with different estimated indel rates are dyed indifferent colours.    seq1    ...----A---T----T----T--AG--A-T--------------TATT-TTCTGAAAAATAAAAGACTTTATTAT-AAAATCATAATATCT-------G---T---A-----T-TT-T-TATA------------------T-----T-T-A...    seq2  ...----A---T------------AG--A-T-A--C-T-A-----TAAT-TTATTAAAAATAATAGAATTAATTAT-AAAATCATTATATCT---A---G-------------T-----T-AT----T------TCT--A---A------T-A...    seq3  ...---------T-----------A---------------A----AAAT-TTCTTAAAAATAATAGAATTAATTAT-ATAATAATTATATCT---T-AC--------A------A--TT---T--CA---A-T-T-T--A------------C...    seq4  ...T--------------------AGA--T----------A----AAATTTTCTTAAAAATAATAGAATTAATTAT-ATAATAATTATATCT---T-A---------A----AT---TT---T--CA---A-T-T----A----TA-T----A...    seq5  ...-----------------A--CAG-------AT----------TAAT-TTCGTAAAAATAATAGAATTAAATAT-AACATAATTATATAT---A-A--A----A-A---A------T---T--C--TT----A-T--A-------T----A...    seq6  ...--ATAGA---TA---A----------------------A-A---------------------------AATATTAAAATGGTAATAAATCT--A-----A-TACC-T-----------A---AA-------------AT--------T-A...    seq7  ...-A--ATA---T--------T--G------------T----A---------------------------AATATTAATATAATTATAAAT--T------T--TA-CG-A--------------CA------CTA---A---------A-AA...    seq8  ...-------A----T-C-A-------T---T----C-T---T-C--------------------------ATTAGTAAAATATTAATATTT---------T---A------------------A------A-AT--TC-------T-----A... indel rate ...333333333333333333333333333333333333333333222222222222222222222222221111111111111111111113333333333333333333333333333333333333333333333333333333333333...150 300Figure 6.2: Inferred indel rate categories for alignment columns 150-300 ofone set of simulated data: segments with low estimated deletion rate (µ̂1 =0.006) are in purple; segments with intermediate deletion rate (µ̂2 = 0.848)are in yellow; segments with high deletion rate (µ̂3 = 4.150) are in blue.The average running times of 100 simulation runs are: 4.08 seconds forPhyML, 12.81 seconds for CTMC+NJ, 145.00 seconds for PIP+NJ, 304.151116.3. Simulation studiesseconds for GeoPIP3+NJ, and 270.71 seconds for GeoPIP5+NJ.6.3.4 Misspecified synthetic examples using softwareINDELible and MUSCLEWe consider generating data using other popular indel models. We use thesoftware INDELible to generate data in this section. INDELible providesseveral options for both the indel model and the substitution model, and italso allows data to be generated in blocks with different indel models andsubstitution models.When data were generated using INDELible, the GeoPIP+NJ methodutilizes both indels and substitutions to reconstruct the phylogenetic tree,but the indel model is misspecified, while the CTMC+NJ method utilizesonly substitutions which are correctly specified. Therefore, the comparisonof results from GeoPIP+NJ and results from CTMC+NJ illustrates thepotential gain or loss of modelling indels using a misspecified indel model inreal applications.In a real application, the multiple sequence alignment is usually un-known. We use MUSCLE to obtain an alignment, then use this alignmentfor inference. We compare results obtained using the MUSCLE estimatedalignment with the results obtained using the true alignment generated byINDELible. MUSCLE does not require an input tree to estimate the align-ment, so it can be used to obtain an estimated alignment before running ourinference method when the alignment is unknown.Simulation setupWe simulate data on a perfect binary tree with 16 leaves and branch lengthb = 0.05 for all branches using INDELible. The total branch length for thistree is 1.5.We consider two simulation scenarios. First, we simulate two blockswith the same indel length distribution but different indel rates: the indellength distribution is set as a negative binomial with parameter r = 1 andp = 0.1 and the indel rate is set as 0.05 and 0.25 (same insertion and deletion1126.3. Simulation studiesrate within each block). The initial length is set to be 50 for both blocks.Second, we simulate three blocks with different indel length distributionsand different indel rates: a negative binomial indel length distribution withparameter r = 1 and p = 0.1, no indels for the second block and a powerlaw indel length distribution (Fletcher and Yang, 2009) with parameter 1.7and maximum length 30. The indel rate is 0.2 for the first block and 0.05for the third block. The initial length is set to be 30 for all three blocks.Simulation resultsTable 6.5 shows that for the first simulation scenario, GeoPIP5+NJ andPIP+NJ outperform CTMC+NJ and PhyML in terms of the RF and thewRF of the scaled trees, on both the true alignment and the MUSCLEalignment. The GeoPIP5+NJ and PIP+NJ also outperform CTMC+NJand PhyML in terms of the wRF of the unscaled trees on the true alignment,but not on the MUSCLE alignment. For the second simulation scenario,GeoPIP5+NJ and PIP+NJ outperform CTMC+NJ (but not PhyML) interms of RF, but not in terms of wRF.The results show that even when the indel model is misspecified, theGeoPIP5+NJ may still achieve a more accurate phylogenetic tree estimate,compared to the correctly-specified model CTMC+NJ that relies on thesubstitution only. The improvement in accuracy may depend on the trueindel models. When the true alignment is not available, using the MUSCLEalignment provides an alternative to apply the GeoPIP5+NJ method whichrequires a fixed alignment.On the other hand, PhyML always outperforms CTMC+NJ in all sce-narios, which shows the benefits of the likelihood approach versus the NJapproach in general, and the magnitude of potential improvement if theGeoPIP is incorporated into a full likelihood inference approach in futurework. At the same time, the comparison between the results using the truealignment and the MUSCLE alignment shows the potential gain in accuracyif the GeoPIP is incorporated into a joint inference of phylogenetic tree andalignment in future work.1136.4. Data analysisTable 6.5: Simulation results on synthetic data generated from the softwareINDELible and aligned by the software MUSCLE. The true tree is a perfectbinary tree of 16 leaves with the same branch length b = 0.05 for all branches(see Figure 6.1B). The true alignment generated using INDELible and theestimated alignment using the software MUSCLE are both considered. Inthis table, NB+NB indicates that the data are generated using two blockswith the same indel length model (negative binomial with parameter 1 and0.1) but different indel rates (0.05 and 0.25 respectively), NB+SUB+POWindicates that the data are generated using three blocks with different indellength models (a negative binomial distribution with parameter 1 and 0.1,a substitution model with no indels, and a power law distribution with pa-rameter 1.7 and maximum 30), and different indel rates (0.2 for the negativebinomial block and 0.1 for the power law block).wRF (unscaled trees) wRF (scaled trees) RFParameter Method mean (s.e.) median mean (s.e.) median meanPhyML 0.612 (0.009) 0.614 0.400 (0.006) 0.397 1.06 (0.14)true alignment CTMC+NJ 0.653 (0.010) 0.664 0.427 (0.007) 0.423 1.40 (0.16)NB+NB PIP+NJ 0.544 (0.008) 0.547 0.356 (0.006) 0.360 0.38 (0.10)GeoPIP5+NJ 0.548 (0.009) 0.550 0.358 (0.006) 0.364 0.40 (0.10)PhyML 1.301 (0.017) 1.306 0.433 (0.008) 0.419 1.68 (0.20)MUSCLE alignment CTMC+NJ 1.349 (0.016) 1.355 0.442 (0.007) 0.442 1.86 (0.18)NB+NB PIP+NJ 1.384 (0.014) 1.390 0.403 (0.007) 0.403 1.26 (0.15)GeoPIP5+NJ 1.349 (0.014) 1.357 0.408 (0.007) 0.405 1.32 (0.17)PhyML 0.653 (0.011) 0.641 0.426 (0.007) 0.422 1.24 (0.16)true alignment CTMC+NJ 0.681 (0.011) 0.670 0.443 (0.007) 0.441 1.68 (0.17)NB+SUB+POW PIP+NJ 0.724 (0.013) 0.719 0.472 (0.008) 0.472 1.02 (0.15)GeoPIP5+NJ 0.724 (0.014) 0.712 0.468 (0.008) 0.462 1.08 (0.15)PhyML 1.393 (0.015) 1.393 0.449 (0.008) 0.436 1.74 (0.18)MUSCLE alignment CTMC+NJ 1.432 (0.015) 1.426 0.459 (0.007) 0.445 2.24 (0.21)NB+SUB+POW PIP+NJ 1.589 (0.020) 1.585 0.471 (0.009) 0.458 1.88 (0.22)GeoPIP5+NJ 1.549 (0.020) 1.523 0.479 (0.010) 0.466 2.18 (0.23)6.4 Data analysisIn this section, we apply our methods to a real data set. We compareresults obtained using our methods and other tree reconstruction methods,and show some examples of inferred segmentations in real alignments.1146.4. Data analysis6.4.1 Molluscan ribosomal RNA dataMolluscs are a diverse group of well studied animals, but many phylogeneticrelationships among molluscan species are still unresolved (Smith et al.,2011). Because of the vast diversity within this large group of species, in-sertions and deletions of nucleotides are prevalent in molluscan ribosomalRNA (rRNA) alignments.Lydeard et al. (2000) conducted a comparative analysis of complete mito-chondrial large subunit (LSU) rRNA sequences of 10 molluscan species andtwo outgroups (L. terrestris and D. melanogaster), and obtained the MSAsof these sequences based on their secondary structure. Smith et al. (2011)obtained a different tree for some sub-groups of molluscs, in particular group-ing Gastropoda with Bivalvia, instead of Gastropoda with Cephalopoda. Afew other hypotheses on sub-grouping of molluscs can also be found in Kocotet al. (2011); Smith et al. (2011).We re-analyze the dataset of Lydeard et al. (2000) using the followingmethods: CTMC+NJ, PIP+NJ, GeoPIP+NJ with four indel rates (denotedas GeoPIP4), PhyML with four substitution rates (denoted PhyML4), andBAli-Phy (Suchard and Redelings, 2006; Redelings and Suchard, 2007), astate-of-the-art Bayesian approach that takes long indels into account tosimultaneously estimate both alignment and phylogeny. In the BAli-Phyexperiments, we used RS07+GTR (Redelings and Suchard, 2007) as theevolutionary model, and 10 000 MCMC iterations (10% burn-in).Although a complete classification of molluscan species is still an openquestion, some hypotheses on the evolutionary history of molluscan speciesare widely supported by the fossil record (Lydeard et al., 2000; Smith et al.,2011). We described these widely supported hypotheses in term of referenceclades of a phylogenetic tree in Table 6.6. We use these reference clades toassess the quality of the inferred trees.This data set can be downloaded from http://www.rna.icmb.utexas.edu/SIM/4D/Mollusk/alignment.gb.1156.4. Data analysisTable 6.6: Description of the reference clades used for validation in terms ofthe species available in the dataset of Lydeard et al. (2000).Reference clade ConstituentsClausiliidae A. turrita and A. coeruleaHelicoidea E. herktotsi and C. nemoralisHerterobranchia Clausiliidae and HelicoideaBivalvia P. maximus and M. edulisCerithioidea P. paludiformis and Cac. lacertina6.4.2 ResultsWe ran each method on the full dataset, as well as on the subset excludingthe two outgroups. Table 6.7 reports whether the reference clades were cor-rectly reconstructed for all algorithm and data configurations. Among thethree indel methods, both GeoPIP and BAli-Phy reconstruct all the refer-ence clades, while the PIP reconstruction (from data excluding outgroups)fails to reconstruct one of the clades (Bivalvia). This supports that usingconstant rate, point-indel models can confound phylogenetic tree inference.Table 6.7: Comparison of the clades identified by different methods, whenthe two outgroups are added, and in parentheses, when the outgroups areexcluded. In this table, “1” indicates that the clade has been identifiedby the corresponding tree inference method (column), and “0” indicatesthat the clade has not been identified. Maximum-parsimony (MP) trees aretaken from Lydeard et al. (2000), “MP(t.o.)” stands for MP analysis fromtransversions only (i.e., the substitution of a purine for a pyrimidine or viceversa), and “BAli,” for BAli-Phy..Substitution-based Indel-awareReference clade MP MP(t.o.) PhyML4 BAli PIP GeoPIPClausiliidae 1 (1) 1 (1) 1 (1) 1 (1) 1 (1) 1 (1)Helicoidea 1 (1) 1 (1) 1 (1) 1 (1) 1 (1) 1 (1)Herterobranchia 1 (1) 1 (1) 1 (1) 1 (1) 1 (1) 1 (1)Bivalvia 0 (0) 1 (1) 1 (1) 1 (1) 1 (0) 1 (1)Cerithioidea 1 (1) 1 (1) 1 (1) 1 (1) 1 (1) 1 (1)We show in Figure 6.3 the trees inferred by the indel-aware methods with1166.4. Data analysisand without outgroups. Prompted by the observation of Smith et al. (2011)that molluscan phylogenetic trees are influenced by the choice of outgroups,we assessed the robustness of each method by measuring the wRF distanceand the RF distance between the tree inferred without outgroup and thesubtree obtained after exclusion of the two outgroups from the tree inferredfrom the full dataset. Figure 6.3 shows that the wRF distance between thetwo GeoPIP trees is 0.253, which compares favorably to the wRF distancebetween results from other indel-aware methods. The RF distances tell adifferent story where the GeoPIP has the largest value of 4 due to the changeof the placement of K.tunicata. However, the total branch length K.tunicatatravels is very small (0.042), which explains why the ∆wRF is small eventhough ∆RF is 4.A.coeruleaA.turritaE.herklotsiC.nemoralisCac.lacertinaP.paludiformisP.maximusM.edulisL.terrestrisK.tunicataD.melanogasterL.bleekeri0.1550.3521.0250.8620.1390.3250.5730.7421.4670.0750.2790.7740.7320.4620.2770.4070.3060.41.5821.2411.466Cac.lacertinaP.paludiformisL.terrestrisM.edulisP.maximusC.nemoralisE.herklotsiA.coeruleaA.turritaD.melanogasterL.bleekeriK.tunicata0.0240.0160.0750.0540.3650.2450.1720.1030.150.0270.0360.2030.2670.2220.0420.3510.470.0970.2110.1870.373A.coeruleaA.turritaD.melanogasterL.bleekeriK.tunicataCac.lacertinaP.paludiformisL.terrestrisM.edulisP.maximusC.nemoralisE.herklotsi0.0350.0720.0410.0510.1130.1210.0650.1170.1120.250.140.1570.2060.1390.0840.1470.1280.0620.2750.2980.282C.nemoralisE.herklotsiA.coeruleaA.turritaD.melanogasterL.terrestrisCac.lacertinaP.paludiformisK.tunicataL.bleekeriM.edulisP.maximus0.0270.0060.0070.0750.160.1380.1840.2740.0270.230.3390.0120.1350.0290.2610.1650.0940.090.1150.3860.328C.nemoralisE.herklotsiA.coeruleaA.turritaD.melanogasterL.terrestrisK.tunicataL.bleekeriCac.lacertinaP.paludiformisM.edulisP.maximus0.0330.0040.010.2040.3340.0830.1720.1440.0370.2960.3630.0120.250.0570.370.2120.1240.1190.1310.4470.42CTMCPhyML BAli-Phy GeoPIP4PIPNo outgroupsWith outgroups∆wRF (∆RF)of subtrees0.136 (2)↑↓0.862 (0)↑↓0.265 (0)↑↓0.418 (2)↑↓0.258 (4)↑↓A.coeruleaA.turritaC.nemoralisE.herklotsiK.tunicataL.bleekeriCac.lacertinaP.paludiformisM.edulisP.maximus0.1591.2770.9150.1420.3420.5541.4840.780.3090.3180.3930.3420.5771.3721.6490.4541.025C.nemoralisE.herklotsiA.coeruleaA.turritaM.edulisCac.lacertinaP.paludiformisK.tunicataL.bleekeriP.maximus0.0310.0220.0740.0490.350.2520.1760.1070.1540.0390.3510.4540.1160.210.1910.2280.287A.coeruleaA.turritaL.bleekeriK.tunicataCac.lacertinaP.paludiformisM.edulisP.maximusC.nemoralisE.herklotsi0.0330.0720.0690.1270.1290.0740.2540.2570.1270.1340.1130.0660.1120.0960.2060.1460.191C.nemoralisE.herklotsiA.coeruleaA.turritaM.edulisCac.lacertinaP.paludiformisK.tunicataL.bleekeriP.maximus0.0130.0310.1940.0330.2990.1880.110.1080.130.2720.3930.0160.0830.1810.160.2150.31C.nemoralisE.herklotsiA.coeruleaA.turritaL.bleekeriM.edulisP.maximusCac.lacertinaP.paludiformisK.tunicata0.0270.0150.0870.1830.1610.2320.0380.3350.3840.3120.0540.3830.2260.1350.1370.1360.334Figure 6.3: Trees reconstructed by the three indel-aware methods (columns)for the data with and without outgroups (rows). The five numbers measurethe wRF distance and the RF distance (in brackets) between each of thebottom trees and the corresponding top subtree obtained after excludingthe two outgroups.Moreover, one of the two outgroups, D. melanogaster, is severely mis-1176.4. Data analysisA.coerulea      ...-----GAAATTCTATTTATATATT--ACAACCG------GTCAGTATACTAATAATGT-ATCAATTAGTAAATAGG----------------------------------------------------------TATAATAAATATTAGT...A.turrita       ...-----TCTATTCTACTA-TGTATT----TCATT---------AGTGTATTGATAATGT-ATCAATTAGTAAAATTT---------------------------------------------------------ATATTTAAAATGTTGAT...C.nemoralis     ...-----ACTAGTGTGCCG-CGGG-TG---TTTATT-------TGTTTTGTTGATAATGT-TGTCATAAGTAGC--------------------------------------------------------------ACAT-TAAATAGCCTG...Cac.lacertina   ...-----AGAATGGAAGTA-TAGTTTT-T-AATAT-------AAAAACTTATGCTAATGC-TAGGATGAGTATTTT----ACCAGTTGGT-TCGCGGGCTAATACATTAAATTAGCCCATTTTTCTATAAAAAC---TTTTAAATTAGGAAG...E.herklotsi     ...-----AGTAATTAGTAA-TAGTTAA---GTTTAAA------TTT-CTTTTGATAATGT-TTATATTAGTATAAA--------------------------------------------------------------CGACATACTTCTAT...K.tunicata      ...-----AATATCTTTTCTTATTTAA----AATTTAAAAAC---TTAAATTAGTTTATGT-TAAAATGAGTATTAAAA-TCAAATATGTTTTAAAGCCCAAGAATAAAAACTTTAAACAAAAAA------------TATTATAAATGTAAAA...L.bleekeri      ...-----GAAGTTAATAAA-TGGTTATT--AATGTA------AGTAACTAAATATAATGT-TAATATGAGTATTAGGT--AAACATATTTATATGTTTGTGGTGGTTATAATTGTAAATAGTTATAAGTTTTGTA-TAAGTAAAATAAAATA...M.edulis        ...A----AAAATAAGCCTA-TGATC-----GGAG-----------GGTCAGTAATTATGATAAAAACGAGTAAAAATAT-TATTTTTTGGTTCACTGTTTTGTTTTACCTTAAAGATTACGGC-------------TAAATAAAGGTTTTTA...P.maximus       ...TGTGTTGTACGGAATCT-CTTTTT-GA-AACAATAA---TT-ATAGAGGTGATAATGT-AAAGATGAGTAGTCGAGA---AGTTGAAGTTTTTAAGGTCGGGGAAAGAAACGAAGGTATGTTCCGT-------ATTTTTAAAAAAAGCTA...P.paludiformis  ...-----ACAATAAGGGTA-AAA-AGT---AAAACTAATAA--GCTCAATTTATTTATGC-TAAAATGAGTATTTTCAC-ACCCACAGTAGT-AAACACTAAAGGTACCTATAGAATAAGCTTAAAAAG----------TTAAAATAAAAAG...indel rate      ...444442222222222222222222444444444444444444111111111111111111111111111111333333333333333333333333333333333333333333333333344444444444444411111111111111...701 850Figure 6.4: Inferred indel rate categories for alignment columns 701-850 ofmolluscan data: segments with lowest deletion rate (µ̂1 = 0.01) are in purple;segments with low deletion rate (µ̂2 = 0.11) in are yellow; segments withhigh deletion rate (µ̂3 = 0.41) are in blue; segments with highest deletionrate (µ̂4 = 1.27) are in red.placed in the CTMC tree, the PhyML tree, and the BAli-Phy tree. Thiscan be explained by the fact that substitution-only models and some indelmodels cannot overcome the erroneous attraction due to the similar basecompositions of D. melanogaster and L.bleekeri. To restore correct place-ment, a pruning and regraft operation would require moving the stem of thatoutgroup by a total branch length of 0.861 (four branches) in the PhyML treeand 0.199 (four branches) in the BAli-Phy tree. In contrast, the placementof D. melanogaster is greatly improved in both the GeoPIP and PIP trees,requiring moving the stem by a total branch length of 0.012 (one branch)for both the PIP tree and the GeoPIP tree.Figure 6.4 shows a subset of an inferred segmentation of the molluscandata. The four estimated deletion rates are µ̂1 = 0.01, µ̂2 = 0.11, µ̂3 = 0.41and µ̂4 = 1.27. Similar results are obtained when 6 indel rates are usedinstead of 4 indel rates or when β = λi/µi is set to 10 as the initial valueinstead of 20, which shows that the choice of category numbers for indelrates is not critical as long as it is large enough to allow sufficient indel ratevariations. The choice of initial segment lengths does not markedly affectthe results as long as this choice falls into a reasonable range.The running times are: 33.8 seconds for PhyML, 5.2 minutes for PIP+NJ,48.9 minutes for GeoPIP+NJ, and 1 day and 3 hours for BAli-Phy (10 000iterations).1186.5. Discussion6.5 DiscussionWith the exception of hand-coded indel characters, mainstream methods forphylogenetic tree reconstruction have been refractory to the incorporation ofthe indel information present in the sequence data. Our experiments suggestthat one potential factor behind this is that single rate point indel modelstend to lack robustness when doing phylogenetic tree inference.We show that a simple model of indel rate variation can restore ro-bustness while improving the quality of the reconstructed phylogenies. Themodel is simple, both in the sense that its running time is the same asexisting pure-substitution reconstruction algorithms, and also that its im-plementation involves components already present in standard phylogeneticsoftware toolboxes. In particular, a promising direction is to combine othertree inference methods with the GeoPIP model, for example, Bayesian treereconstruction methods (Li, 1996; Mau, 1996; Huelsenbeck and Ronquist,2001; Drummond et al., 2012). The Bayesian approach would provide theadditional advantage of outputting credible intervals.Alignment uncertainty is an important related issue. Using a point esti-mate for the alignment can cause underestimation of tree uncertainty down-stream, and alignment errors can confound tree reconstruction (Suchard andRedelings, 2006; Redelings and Suchard, 2007; Wong et al., 2008). To ad-dress these issues while still taking indel rate heterogeneity into account,our model could be integrated into a Bayesian or maximum likelihood co-estimation method (Lunter et al., 2005a; Suchard and Redelings, 2006; Re-delings and Suchard, 2007; Liu et al., 2012).Source code can be obtained from https://github.com/yzhai220/geopip.119Chapter 7Conclusions and future work7.1 ConclusionsIn this thesis, we considered two types of data for phylogenetic inference.We proposed new evolutionary models, statistical inference methods andefficient algorithms for reconstructing phylogenetic trees for continuous sin-gle nucleotide polymorphism (SNP) frequency data and categorical multiplesequence alignment (MSA) data.For nucleotide sequence data collected from individuals of different pop-ulations within the same species, we presented a simple and effective methodfor reconstructing rooted trees from multi-population SNP frequencies. Bymodelling fixation in a principled way, and generalizing neighbor-joiningmethods, we can perform joint tree and rooting inference without requiringan outgroup. We have shown in simulations and real data that this rootingcan be more accurate despite using less data. The method is also compu-tationally efficient with time complexity O(Ln2), where L is the number ofsites and n is the number of populations.For nucleotide sequence data collected from individuals of different species,we presented a simple model of indel rate variation and an efficient algorithmfor reconstructing trees from a fixed MSA. We showed that single rate pointindel models tend to lack robustness when doing phylogenetic tree infer-ence if indel information presented in the sequence data is ignored. Weshowed that our new model of indel rate variation can restore robustnesswhile improving the quality of the reconstructed phylogenies. The modelis simple, both in the sense that its running time is the same as existingpure-substitution reconstruction algorithms, and also that its implementa-tion involves components already present in standard phylogenetic software1207.2. Future worktoolboxes.7.2 Future workWe first discuss potential extensions of our FixNormal model and ANJframework proposed in Chapter 3 and Chapter 4, and then discuss potentialextensions of our GeoPIP model proposed in Chapter 5 and Chapter 6.7.2.1 Extension of the FixNormal model and ANJframeworkGenetic admixtureWe ignored the effect of genetic admixture in this thesis. Genetic admixtureis defined as the gene exchange of two or more previously isolated popula-tions through interbreeding. We assumed that populations evolve indepen-dently after separation from their most recent common ancestor. However,an admixed population can be formed as a result of interbreeding of sepa-rated populations, when a geographic barrier separating populations, suchas a river or isthmus, is removed. Pickrell et al. (2012) provided evidence ofgenetic admixture among African populations, and showed that the effectof genetic admixture could potentially bias the phylogenetic analysis.To take admixtures into account, we could design rooted network infer-ence (Finnila¨ et al., 2001) algorithms from dissimilarity matrices instead ofbifurcating phylogenetic trees. Note also that it may be feasible to detectadmixtures automatically by looking at iterations of Algorithm 2 where nopopulation pair satisfies the 2-MRMC (this property is only guaranteed tohold for matrices derived from a tree). We have not detected such cases inour data analysis, suggesting that the tree approximation might be reason-able at the tree scales of the HGDP dataset.Although our current rooted tree reconstruction algorithm may be usedto detect genetic admixtures, there is no straight-forward revision availableto incorporate such detected genetic admixtures in the reconstructed phylo-genetic tree. It would be a useful extension of the ANJ algorithm to enable1217.2. Future workphylogenetic network reconstruction when a genetic admixture is detectedby the algorithm.Incorporating sampling errors and ascertainment biasWe ignored sampling errors and ascertainment bias in the FixNormal modeland ANJ framework. The sampling errors at leaf populations can be in-corporated without much difficulty. Since the pairwise asymmetric distancecalculations are based on a probability model, it is possible to include anerror model, for example, a binomial sampling error term, at the cost of amore expensive optimization. Similarly, ascertainment bias may also be in-corporated into the likelihood model via a correction term (Nicholson et al.,2002).On the other hand, our current framework utilized numerical integrationfor evaluating the likelihood function. Adding a sampling error term ora correction term leads to an increase of the dimensionality of numericalintegration and computation cost, which would need to be addressed infuture work.7.2.2 Extension of the GeoPIP modelBayesian phylogenetic inference based on the GeoPIP modelA promising direction is to combine other tree inference methods with theGeoPIP model, for example, Bayesian tree reconstruction methods (Li, 1996;Mau, 1996; Huelsenbeck and Ronquist, 2001; Drummond et al., 2012). Withthe Bayesian approach, the uncertainty of the estimated phylogenetic treecan be easily quantified.The Bayesian approach also provides an easy way to address alignmentuncertainty. In this thesis, we presented the GeoPIP model and an NJ basedinference algorithm for a fixed alignment, which often has to be estimatedusing other computational tools in application. However, using a point esti-mate for the alignment can cause underestimation of tree uncertainty down-stream, and alignment errors can confound tree reconstruction (Suchardand Redelings, 2006; Redelings and Suchard, 2007; Wong et al., 2008). To1227.2. Future workaddress this issue while still taking indel rate heterogeneity into account,the GeoPIP model could be integrated into a Bayesian framework for jointinference of phylogeny and alignment (Suchard and Redelings, 2006; Redel-ings and Suchard, 2007). Such integration is conceptually straight-forwardand potentially fruitful, but it would require considerable implementationefforts.Incorporating indel rate heterotachyIndel rate heterotachy has been measured in certain datasets, for examplepromoter regions (Taylor et al., 2006). The GeoPIP model could be mod-ified to take indel heterotachy into account, for example by splitting andmerging segments at random points of the tree, but at the cost of mak-ing inference significantly more complicated. A similar trade-off is foundin substitution rate variation modelling, where rate variation assumptionsthat ignore heterotachy are often preferred as they are simple and generallyeffective.Investigating the correlation of indel rates and substitution ratesCorrelation of indel and substitution rates (Ananda et al., 2011; Jovelin andCutter, 2013) is another interesting future direction to explore. One simplemethod to model such correlations would be to estimate substitution ratematrices separately for different indel rate regions based on the GeoPIPmodel. The computation cost of rate matrix estimation would only increaseby a factor of m (the number of indel rate categories) using our GeoPIPinference method.123BibliographyAlekseyenko, A. V., Lee, C. J., and Suchard, M. A. 2008. Wagner anddollo: a stochastic duet by composing two parsimonious solos. SystematicBiology , 57(5): 772–784.Ananda, G., Chiaromonte, F., and Makova, K. D. 2011. A genome-wideview of mutation rate co-variation using multivariate analyses. GenomeBiology , 12(3): R27.Balding, D. J. and Nichols, R. A. 1995. A method for quantifying differen-tiation between populations at multi-allelic loci and its implications forinvestigating identity and paternity. Genetica, 96(1): 3–12.Barker, D. 2004. Lvb: parsimony and simulated annealing in the search forphylogenetic trees. Bioinformatics, 20(2): 274–275.Battistuzzi, F. U., Filipski, A., Hedges, S. B., and Kumar, S. 2010. Per-formance of relaxed-clock methods in estimating evolutionary divergencetimes and their credibility intervals. Molecular Biology and Evolution,27(6): 1289–1300.Benner, P., Bacˇa´k, M., and Bourguignon, P.-Y. 2014. Point estimates inphylogenetic reconstructions. Bioinformatics, 30(17): i534–i540.Billera, L. J., Holmes, S. P., and Vogtmann, K. 2001. Geometry of the spaceof phylogenetic trees. Advances in Applied Mathematics, 27(4): 733–767.Bouchard-Coˆte´, A. and Jordan, M. I. 2013. Evolutionary inference via thePoisson indel process. Proceedings of the National Academy of Sciences,110(4): 1160–1166.124BibliographyBouchard-Coˆte´, A., Sankararaman, S., and Jordan, M. I. 2012. Phylogeneticinference via sequential Monte Carlo. Systematic Biology , 61: 579–593.Bryant, D. 2005. On the uniqueness of the selection criterion in neighbor-joining. Journal of Classification, 22(1): 3–15.Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A., and RoyChoud-hury, A. 2012. Inferring species trees directly from biallelic genetic mark-ers: bypassing gene trees in a full coalescent analysis. Molecular Biologyand Evolution, 29(8): 1917–1932.Cann, H. M., de Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre,L., Bodmer, J., Bodmer, W. F., Bonne-Tamir, B., Cambon-Thomsen, A.,et al. 2002. A human genome diversity cell line panel. Science, 296(5566):261.Cann, R. L. 2013. Y weigh in again on modern humans. Science, 341(6145):465–467.Cann, R. L., Stoneking, M., and Wilson, A. C. 1987. Mitochondrial DNAand human evolution. Nature, 325(6099): 31–36.Carvalho, A. B. and Clark, A. G. 1999. Genetic recombination: intron sizeand natural selection. Nature, 401(6751): 344–344.Cavalli-Sforza, L. L. and Feldman, M. W. 2003. The application of moleculargenetic approaches to the study of human evolution. Nature Genetics, 33:266–275.Chakerian, J. and Holmes, S. 2012. Computational tools for evaluatingphylogenetic and hierarchical clustering trees. Journal of Computationaland Graphical Statistics, 21(3): 581–599.Chen, J.-Q., Wu, Y., Yang, H., Bergelson, J., Kreitman, M., and Tian, D.2009. Variation in the ratio of nucleotide substitution and indel ratesacross genomes in mammals and bacteria. Molecular Biology and Evolu-tion, 26(7): 1523–1531.125BibliographyClark, A. G., Hubisz, M. J., Bustamante, C. D., Williamson, S. H., andNielsen, R. 2005. Ascertainment bias in studies of human genome-widepolymorphism. Genome Research, 15(11): 1496–1502.Consortium, T. . G. P. 2010. A map of human genome variation frompopulation-scale sequencing. Nature, 467(7319): 1061–1073.Consortium, T. I. H. 2003. The international hapmap project. Nature,426(6968): 789–796.Drummond, A., Suchard, M., Xie, D., and Rambaut, A. 2012. Bayesianphylogenetics with BEAUti and the BEAST 1.7. Molecular Biology andEvolution, 29: 1969–1973.Edgar, R. C. 2004a. Muscle: a multiple sequence alignment method withreduced time and space complexity. BMC Bioinformatics, 5(1): 113.Edgar, R. C. 2004b. Muscle: multiple sequence alignment with high accuracyand high throughput. Nucleic Acids Research, 32(5): 1792–1797.Edwards, A. 1971. Distances between populations on the basis of genefrequencies. Biometrics, 27(4): 873–881.Edwards, A. and Cavalli-Sforza, L. 1964. Reconstruction of evolutionarytrees. Systematics Association, 6: 67–76.Ellegren, H. 2004. Microsatellites: simple sequences with complex evolution.Nature Reviews Genetics, 5(6): 435–445.Ewens, W. 1973. Conditional diffusion processes in population genetics.Theoretical Population Biology , 4(1): 21–30.Felsenstein, J. 1973. Maximum-likelihood estimation of evolutionary treesfrom continuous characters. American Journal of Human Genetics, 25(5):471–492.Felsenstein, J. 1981a. Evolutionary trees from DNA sequences: a maximumlikelihood approach. Journal of Molecular Evolution, 17(6): 368–376.126BibliographyFelsenstein, J. 1981b. Evolutionary trees from gene frequencies and quan-titative characters: finding maximum likelihood estimates. Evolution,35(6): 1229–1242.Felsenstein, J. 1983. Statistical inference of phylogenies. Journal of theRoyal Statistical Society. Series A (General), pages 246–272.Felsenstein, J. 1989. PHYLIP - Phylogeny inference package (Version 3.2).Cladistics, 5: 164–166.Felsenstein, J. 2004. Inferring Phytogenies. Sinauer Associates, Incorpo-rated.Finnila¨, S., Lehtonen, M. S., and Majamaa, K. 2001. Phylogenetic networkfor european mtdna. The American Journal of Human Genetics, 68(6):1475–1484.Fitch, W. M. and Margoliash, E. 1967. A method for estimating the numberof invariant amino acid coding positions in a gene using cytochrome c asa model case. Biochemical Genetics, 1(1): 65–71.Fletcher, W. and Yang, Z. 2009. Indelible: a flexible simulator of biologicalsequence evolution. Molecular Biology and Evolution, 26(8): 1879–1888.Francalacci, P., Morelli, L., Angius, A., Berutti, R., Reinier, F., Atzeni,R., Pilu, R., Busonero, F., Maschio, A., Zara, I., et al. 2013. Low-passDNA sequencing of 1200 sardinians reconstructs european y-chromosomephylogeny. Science, 341(6145): 565–569.Gascuel, O. 1997. Bionj: an improved version of the nj algorithm based ona simple model of sequence data. Molecular Biology and Evolution, 14(7):685–695.Gascuel, O. et al. 1997. Concerning the nj algorithm and its unweightedversion, unj. Mathematical Hierarchies and Biology , 37: 149–171.127BibliographyGray, R. D. and Atkinson, Q. D. 2003. Language-tree divergence timessupport the anatolian theory of indo-european origin. Nature, 426(6965):435–439.Gray, R. D., Drummond, A. J., and Greenhill, S. J. 2009. Language phylo-genies reveal expansion pulses and pauses in pacific settlement. Science,323(5913): 479–483.Guindon, S., Dufayard, J.-F., Lefort, V., Anisimova, M., Hordijk, W., andGascuel, O. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. System-atic Biology , 59(3): 307–321.Hasegawa, M., Kishino, H., and Yano, T.-a. 1985. Dating of the human-apesplitting by a molecular clock of mitochondrial DNA. Journal of MolecularEvolution, 22(2): 160–174.Hirschberg, D. S. 1975. A linear space algorithm for computing maximalcommon subsequences. Communications of the ACM , 18(6): 341–343.Hirschfeld, L. and Hirschfeld, H. 1919. Essai d’application des methodesserologiques au probleme des races. Anthropologie, 29: 505–537.Hobolth, A. and Yoshida, R. 2005. Maximum likelihood estimation of phy-logenetic tree and substitution rates via generalized neighbor-joining andthe EM algorithm. In Proceedings of the 1st international conference onalgebraic biology , pages 41–50, Tokyo. Universal Academy Press.Holmes, I. 2003. Using guide trees to construct multiple-sequence evolution-ary HMMs. Bioinformatics, 19(suppl 1): i147–i157.Holmes, I. and Bruno, W. J. 2001. Evolutionary HMMs: a Bayesian ap-proach to multiple alignment. Bioinformatics, 17(9): 803–820.Huelsenbeck, J. P. and Ronquist, F. 2001. MRBAYES: Bayesian inferenceof phylogenetic trees. Bioinformatics, 17(8): 754–755.128BibliographyHuelsenbeck, J. P., Bollback, J. P., and Levine, A. M. 2002. Inferring theroot of a phylogenetic tree. Systematic Biology , 51(1): 32–43.Iwabe, N., Kuma, K.-i., Hasegawa, M., Osawa, S., and Miyata, T. 1989.Evolutionary relationship of archaebacteria, eubacteria, and eukaryotesinferred from phylogenetic trees of duplicated genes. Proceedings of theNational Academy of Sciences, 86(23): 9355–9359.Jenkins, P. A. and Spano, D. 2015. Exact simulation of the Wright-Fisherdiffusion. arXiv preprint arXiv:1506.06998 .Jensen, J. L. and Hein, J. 2005. Gibbs sampler for statistical multiplealignment. Statistica Sinica, pages 889–907.Jovelin, R. and Cutter, A. D. 2013. Fine-scale signatures of molecular evolu-tion reconcile models of indel-associated mutation. Genome Biology andEvolution, 5(5): 978–986.Klosterman, P. S., Uzilov, A. V., Bendan˜a, Y. R., Bradley, R. K., Chao, S.,Kosiol, C., Goldman, N., and Holmes, I. 2006. XRate: a fast prototyping,training and annotation tool for phylo-grammars. BMC Bioinformatics,7: 428.Knudsen, B. and Miyamoto, M. M. 2003. Sequence alignments and pairhidden Markov models using evolutionary history. Journal of MolecularBiology , 333(2): 453–460.Kocot, K. M., Cannon, J. T., Todt, C., Citarella, M. R., Kohn, A. B.,Meyer, A., Santos, S. R., Schander, C., Moroz, L. L., Lieb, B., et al. 2011.Phylogenomics reveals deep molluscan relationships. Nature, 477(7365):452–456.Kuhner, M. K. and Felsenstein, J. 1994. A simulation comparison of phy-logeny algorithms under equal and unequal evolutionary rates. MolecularBiology and Evolution, 11(3): 459–468.129BibliographyKvikstad, E. M. and Duret, L. 2014. Strong heterogeneity in mutation ratecauses misleading hallmarks of natural selection on indel mutations in thehuman genome. Molecular Biology and Evolution, 31(1): 23–36.Leushkin, E. V. and Bazykin, G. A. 2013. Short indels are subject toinsertion-biased gene conversion. Evolution, 67(9): 2604–2613.Levy, D., Yoshida, R., and Pachter, L. 2006. Beyond pairwise distances:Neighbor-joining with phylogenetic diversity estimates. Molecular Biologyand Evolution, 23(3): 491–498.Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ra-machandran, S., Cann, H. M., Barsh, G. S., Feldman, M., Cavalli-Sforza,L. L., et al. 2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319(5866): 1100–1104.Li, S. 1996. Phylogenetic Tree Construction using Markov Chain MonteCarlo. Ph.D. thesis, Ohio State University.Li, S., Pearl, D. K., and Doss, H. 2000. Phylogenetic tree constructionusing Markov chain Monte Carlo. Journal of the American StatisticalAssociation, 95(450): 493–508.Lieberman, D. E. 2012. Human evolution: Those feet in ancient times.Nature, 483(7391): 550–551.Lipo, C. P. 2006. Mapping our ancestors: Phylogenetic approaches in an-thropology and prehistory . Transaction Publishers.Liu, K., Nelesen, S., Raghavan, S., Linder, C. R., and Warnow, T. 2009.Barking up the wrong treelength: the impact of gap penalty on alignmentand tree accuracy. IEEE/ACM Transactions on Computational Biologyand Bioinformatics, 6: 7–21.Liu, K., Warnow, T. J., Holder, M. T., Nelesen, S. M., Yu, J., Stamatakis,A. P., and Linder, C. R. 2012. SATe-II: very fast and accurate simulta-neous estimation of multiple sequence alignments and phylogenetic trees.Systematic Biology , 61(1): 90–106.130BibliographyLo¨ytynoja, A. and Goldman, N. 2008. A model of evolution and structurefor multiple sequence alignment. Philosophical Transactions of the RoyalSociety B: Biological Sciences, 363(1512): 3913–3919.Lunter, G. 2007. Probabilistic whole-genome alignments reveal high indelrates in the human and mouse genomes. Bioinformatics, 23(13): i289–i296.Lunter, G., Miklo´s, I., Drummond, A., Jensen, J., and Hein, J. 2005a.Bayesian coestimation of phylogeny and sequence alignment. BMC Bioin-formatics, 6: 83.Lunter, G., Drummond, A. J., Miklo´s, I., and Hein, J. 2005b. Statisticalalignment: recent progress, new applications, and challenges. In StatisticalMethods in Molecular Evolution, pages 375–405. Springer.Lunter, G., Ponting, C. P., and Hein, J. 2006. Genome-wide identification ofhuman functional DNA using a neutral indel model. PLoS ComputationalBiology , 2(1): e5.Lydeard, C., Holznagel, W. E., Schnare, M. N., and Gutell, R. R. 2000.Phylogenetic analysis of molluscan mitochondrial LSU rDNA sequencesand secondary structures. Molecular Phylogenetics and Evolution, 15(1):83–102.Mau, B. 1996. Bayesian Phylogenetic Inference via Markov Chain MonteCarlo Methods. Ph.D. thesis, University of Wisconsin, Madison.Miklo´s, I. 2003. Algorithm for statistical alignment of two sequences derivedfrom a Poisson sequence length distribution. Discrete Applied Mathemat-ics, 127(1): 79–84.Miklos, I. and Toroczkai, Z. 2001. An improved model for statistical align-ment. In First Workshop on Algorithms in Bioinformatics, Berlin, Hei-delberg. Springer-Verlag.131BibliographyMiklo´s, I., Lunter, G., and Holmes, I. 2004. A long indel model for evo-lutionary sequence alignment. Molecular Biology and Evolution, 21(3):529–540.Mills, R. E., Luttig, C. T., Larkins, C. E., Beauchamp, A., Tsui, C., Pittard,W. S., and Devine, S. E. 2006. An initial map of insertion and deletion(indel) variation in the human genome. Genome Research, 16(9): 1182–1190.Mouchiroud, D., D’Onofrio, G., Aı¨ssani, B., Macaya, G., Gautier, C., andBernardi, G. 1991. The distribution of genes in the human genome. Gene,100: 181–187.Nachman, M. W. and Crowell, S. L. 2000. Estimate of the mutation rateper nucleotide in humans. Genetics, 156(1): 297–304.Nam, K. and Ellegren, H. 2012. Recombination drives vertebrate genomecontraction. PLoS Genetics, 8(5): e1002680.Nei, M. 1972. Genetic distance between populations. American Naturalist ,106(949): 283–292.Nicholls, G. K. and Gray, R. D. 2008. Dated ancestral trees from binary traitdata and their application to the diversification of languages. Journal ofthe Royal Statistical Society: Series B , 70(3): 545–566.Nichols, J. and Warnow, T. 2008. Tutorial on computational linguistic phy-logeny. Language and Linguistics Compass, 2(5): 760–820.Nicholson, G., Smith, A. V., Jo´nsson, F., Gu´stafsson, O´., Stefa´nsson, K.,and Donnelly, P. 2002. Assessing population differentiation and isolationfrom single-nucleotide polymorphism data. Journal of the Royal StatisticalSociety: Series B , 64(4): 695–715.Outlaw, D. C. and Ricklefs, R. E. 2011. Rerooting the evolutionary treeof malaria parasites. Proceedings of the National Academy of Sciences,108(32): 13183–13187.132BibliographyOwen, M. and Provan, J. S. 2011. A fast algorithm for computing geodesicdistances in tree space. IEEE/ACM Transactions on Computational Bi-ology and Bioinformatics, 8(1): 2–13.Paradis, E. 2011. Analysis of Phylogenetics and Evolution with R. Springer.Paradis, E., Claude, J., and Strimmer, K. 2004. Ape: analyses of phyloge-netics and evolution in R language. Bioinformatics, 20(2): 289–290.Pearson, T., Hornstra, H. M., Sahl, J. W., Schaack, S., Schupp, J. M.,Beckstrom-Sternberg, S. M., O’Neill, M. W., Priestley, R. A., Champion,M. D., Beckstrom-Sternberg, J. S., et al. 2013. When outgroups fail; phy-logenomics of rooting the emerging pathogen, coxiella burnetii. SystematicBiology , 62(5): 752–762.Penny, D. and Hendy, M. 1985. The use of tree comparison metrics. Sys-tematic Zoology , 34(1): 75–82.Pickrell, J. K. and Pritchard, J. K. 2012. Inference of population splits andmixtures from genome-wide allele frequency data. PLoS Genetics, 8(11).Pickrell, J. K., Patterson, N., Barbieri, C., Berthold, F., Gerlach, L., Gu¨lde-mann, T., Kure, B., Mpoloka, S. W., Nakagawa, H., Naumann, C., et al.2012. The genetic prehistory of southern africa. Nature Communications,3(1143): 1–6.Poznik, G. D., Henn, B. M., Yee, M.-C., Sliwerska, E., Euskirchen, G. M.,Lin, A. A., Snyder, M., Quintana-Murci, L., Kidd, J. M., Underhill, P. A.,et al. 2013. Sequencing y chromosomes resolves discrepancy in time tocommon ancestor of males versus females. Science, 341(6145): 562–565.Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W.1999. Population growth of human y chromosomes: a study of y chro-mosome microsatellites. Molecular Biology and Evolution, 16(12): 1791–1798.133BibliographyRedelings, B. D. and Suchard, M. A. 2007. Incorporating indel informationinto phylogeny estimation for rapidly emerging pathogens. BMC Evolu-tionary Biology , 7(1): 40.Revell, L. J. 2012. Phytools: an R package for phylogenetic comparativebiology (and other things). Methods in Ecology and Evolution, 3(2): 217–223.Robinson, D. and Foulds, L. 1979. Comparison of weighted labelled trees.In Combinatorial Mathematics VI , pages 119–126. Springer.Roch, S. 2006. A short proof that phylogenetic tree reconstruction by max-imum likelihood is hard. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, 3(1): 92–94.Ronquist, F. and Huelsenbeck, J. P. 2003. MrBayes 3: Bayesian phylogeneticinference under mixed models. Bioinformatics, 19(12): 1572–1574.Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K.,Zhivotovsky, L. A., and Feldman, M. W. 2002. Genetic structure of humanpopulations. Science, 298(5602): 2381–2385.RoyChoudhury, A., Felsenstein, J., and Thompson, E. A. 2008. A two-stage pruning algorithm for likelihood computation for a population tree.Genetics, 180(2): 1095–1105.Ryder, R. J. and Nicholls, G. K. 2011. Missing data in a stochastic dollomodel for binary trait data, and its application to the dating of proto-indo-european. Journal of the Royal Statistical Society: Series C , 60(1):71–92.Saitou, N. and Nei, M. 1987. The neighbor-joining method: a new methodfor reconstructing phylogenetic trees. Molecular Biology and Evolution,4(4): 406–425.Satija, R., Nova´k, A´., Miklo´s, I., Lyngsø, R., and Hein, J. 2009. Bigfoot:Bayesian alignment and phylogenetic footprinting with MCMC. BMCEvolutionary Biology , 9(1): 217.134BibliographySemple, C. and Steel, M. 2003. Phylogenetics. Oxford University Press.Siepel, A. and Haussler, D. 2005. Phylogenetic hidden Markov models. InStatistical Methods in Molecular Evolution, pages 325–351. Springer.Sire´n, J., Marttinen, P., and Corander, J. 2011. Reconstructing populationhistories from single nucleotide polymorphism data. Molecular Biologyand Evolution, 28(1): 673–683.Sire´n, J., Hanage, W. P., and Corander, J. 2013. Inference on populationhistories by approximating infinite alleles diffusion. Molecular Biology andEvolution, 30(2): 457–468.Smeulders, M. J., Barends, T. R., Pol, A., Scherer, A., Zandvoort, M. H.,Udvarhelyi, A., Khadem, A. F., Menzel, A., Hermans, J., Shoeman, R. L.,et al. 2011. Evolution of a new enzyme for carbon disulphide conversionby an acidothermophilic archaeon. Nature, 478(7369): 412–416.Smith, S. A., Wilson, N. G., Goetz, F. E., Feehery, C., Andrade, S. C.,Rouse, G. W., Giribet, G., and Dunn, C. W. 2011. Resolving the evo-lutionary relationships of molluscs with phylogenomic tools. Nature,480(7377): 364–367.Song, Y. S. and Steinru¨cken, M. 2012. A simple method for finding explicitanalytic transition densities of diffusion processes with general diploidselection. Genetics, 190(3): 1117–1129.Stamatakis, A. 2005. An efficient program for phylogenetic inference usingsimulated annealing. In Parallel and Distributed Processing Symposium,2005. Proceedings. 19th IEEE International , pages 8–pp. IEEE.Stamatakis, A. 2014. RAxML version 8: a tool for phylogenetic analysisand post-analysis of large phylogenies. Bioinformatics, 30(9): 1312–1313.Stephens, J. C., Schneider, J. A., Tanguay, D. A., Choi, J., Acharya, T.,Stanley, S. E., Jiang, R., Messer, C. J., Chew, A., Han, J.-H., et al.2001. Haplotype variation and linkage disequilibrium in 313 human genes.Science, 293(5529): 489–493.135BibliographyStudier, J. A., Keppler, K. J., et al. 1988. A note on the neighbor-joiningalgorithm of saitou and nei. Molecular Biology and Evolution, 5(6): 729–731.Suchard, M. A. and Redelings, B. D. 2006. BAli-Phy: simultaneous Bayesianinference of alignment and phylogeny. Bioinformatics, 22(16): 2047–2048.Suchard, M. A., Weiss, R. E., and Sinsheimer, J. S. 2001. Bayesian selectionof continuous-time Markov chain evolutionary models. Molecular Biologyand Evolution, 18(6): 1001–1013.Sukumaran, J. and Holder, M. T. 2010. DendroPy: a Python library forphylogenetic computing. Bioinformatics, 26(12): 1569–1571.Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. 1996. Phy-logenetic inference. In M. B. Hillis D.M., Moritz C., editor, Molecularsystematics, pages 407–514. Sunderland: Sinauer Associates.Tanay, A. and Siggia, E. D. 2008. Sequence context affects the rate of shortinsertions and deletions in flies and primates. Genome Biology , 9(2): R37.Tataru, P. and Hobolth, A. 2011. Comparison of methods for calculatingconditional expectations of sufficient statistics for continuous time Markovchains. BMC Bioinformatics, 12(1): 465.Tavare´, S. 1986. Some probabilistic and statistical problems in the analysisof DNA sequences. Lectures on Mathematics in the Life Sciences, 17:57–86.Taylor, M. S., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., and Sem-ple, C. A. 2006. Heterotachy in mammalian promoter evolution. PLoSGenetics, 2(4): e30.Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F., and Higgins,D. G. 1997. The clustal x windows interface: flexible strategies for mul-tiple sequence alignment aided by quality analysis tools. Nucleic AcidsResearch, 25(24): 4876–4882.136BibliographyThorne, J. L., Kishino, H., and Felsenstein, J. 1991. An evolutionary modelfor maximum likelihood alignment of DNA sequences. Journal of Molec-ular Evolution, 33(2): 114–124.Thorne, J. L., Kishino, H., and Felsenstein, J. 1992. Inching toward reality:an improved likelihood model of sequence evolution. Journal of MolecularEvolution, 34(1): 3–16.Truszkowski, J. and Goldman, N. 2016. Maximum likelihood phylogeneticinference is consistent on multiple sequence alignments, with or withoutgaps. Systematic biology , 65(2): 328–333.Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A., Maranian,M., Seal, S., Ghoussaini, M., Hines, S., Healey, C. S., et al. 2010. Genome-wide association study identifies five new breast cancer susceptibility loci.Nature Genetics, 42(6): 504–507.Varin, C. and Vidoni, P. 2005. A note on composite likelihood inference andmodel selection. Biometrika, 92(3): 519–528.Wang, L. 2012. Bayesian Phylogenetic Inference via Monte Carlo methods.Ph.D. thesis, University of British Columbia.Wang, L., Bouchard-Coˆte´, A., and Doucet, A. 2015. Bayesian phylogeneticinference using the combinatorial sequential Monte Carlo method. Journalof the American Statistical Association, (In Press).Weir, B. and Cockerham, C. C. 1984. Estimating f-statistics for the analysisof population structure. Evolution, 38(6): 1358–1370.Westesson, O., Lunter, G., Paten, B., and Holmes, I. 2012. Accurate recon-struction of insertion-deletion histories by statistical phylogenetics. PLoSOne, 7(4): e34572.Wheeler, W. C. 1990. Nucleic acid sequence phylogeny and random out-groups. Cladistics, 6(4): 363–367.137Wong, G. K.-S., Liu, B., Wang, J., Zhang, Y., Yang, X., Zhang, Z., Meng,Q., Zhou, J., Li, D., Zhang, J., et al. 2004. A genetic variation mapfor chicken with 2.8 million single-nucleotide polymorphisms. Nature,432(7018): 717–722.Wong, K., Suchard, M., and Huelsenbeck, J. 2008. Alignment uncertaintyand genomic analysis. Science, 319: 473–476.Yang, Z. 1996. Among-site rate variation and its impact on phylogeneticanalyses. Trends in Ecology & Evolution, 11(9): 367–372.Yang, Z. 1997. PAML: a program package for phylogenetic analysis bymaximum likelihood. Computer Applications in the Biosciences, 13(5):555–556.Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood.Molecular Biology and Evolution, 24(8): 1586–1591.Yang, Z. and Rannala, B. 1997. Bayesian phylogenetic inference using DNAsequences: a Markov chain Monte Carlo method. Molecular Biology andEvolution, 14(7): 717–724.Yang, Z., Goldman, N., and Friday, A. 1995. Maximum likelihood trees fromDNA sequences: a peculiar statistical estimation problem. SystematicBiology , 44(3): 384–399.Zhang, J. 2000. Protein-length distributions for the three domains of life.Trends in Genetics, 16(3): 107–109.Zharkikh, A. and Li, W.-H. 1995. Estimation of confidence in phylogeny:the complete-and-partial bootstrap technique. Molecular phylogeneticsand evolution, 4(1): 44–63.138Appendix AA.1 Analysis of the cherry-picking algorithmA.1.1 Proof of Proposition 1Proposition 1. For any square matrix A that is induced by a rooted bifur-cating tree τ , the 2-MRMC is satisfied for (Pi, Pj) if and only if Pi and Pjform a cherry.Proof. (⇐) Suppose that Pi and Pj form a cherry, with corresponding branchlengths σ2i , σ2j . Assume without loss of generality that there are at least 3populations, and let σ2i,j denote the length of the branch above the mostrecent common ancestor of Pi and Pj , where σ2i,j > 0 by the bifurcationhypothesis. By construction, for all other populations Pk, aik ≥ σ2i,j + σ2i >σ2i = aij and ajk ≥ σ2i,j + σ2j > σ2j = aji, therefore 2-MRMC is satisfied.Refer to Figure 4.1 in the thesis where k = 1, i = 2, j = 3.(⇒) Suppose that Pi and Pj do not form a cherry. There is at least onepopulation Pk whose common ancestor with either Pi or Pj is a descendentof the common ancestor of Pi and Pj . Suppose it is with Pi. Supposethe branch length from the common ancestor of Pi and Pk to the commonancestor of Pi and Pj is σ2k,j . Refer to Figure 4.1 where j = 1, k = 2, i = 3.By the bifurcation hypothesis, σ2k,j > 0, so aij = aik + σ2k,j > aik, i.e., aijis not the minimum element in the i-th row of A. The 2-MRMC is notsatisfied.A.1.2 Corollary 1 and its proofCorollary 1. If Pi and Pj form a cherry in a rooted bifurcating tree τ , thenaij and aji are unique minima of the i-th and j-th row of the matrix A,respectively.139A.1. Analysis of the cherry-picking algorithmProof. Let ai,· denote all elements in the i-th row of A.If Pi and Pj form a cherry, from the first part of the proof of Proposi-tion 1, for any other population Pk, aij < aik, so aij is the unique minimumof ai,·. Similarly, aji is the unique minimum of aj,·.A.1.3 Proof of Theorem 1Theorem 1. For a given bifurcating phylogenetic tree τ with asymmetricdissimilarity matrix A, the cherry-picking algorithm will always return acherry, and the algorithm terminates before the iteration variable q equalsn.Proof. Without loss of generality, we assume that a1,2 is the smallest elementin A. Then a1,2 is the smallest element in a1,·.Suppose that a2,i2 is the smallest element in a2,·, then i2 ∈ {1, 3, 4, · · · , n}because a2,2 = ∞. If i2 = 1, (P1, P2) satisfies the 2-MRMC condition, andthe process terminates with a cherry (P1, P2) and q = 1. If i2 6= 1, we assumewithout loss of generality that i2 = 3, i.e., a2,3 is the smallest element ina2,·, and denote the internal node linking P1 and P2 as P1,2. Let B2 = {1}.Recursively, in the q-th iteration (q = 1, 2, · · · , n − 2), suppose thataq+1,iq+1 is the smallest number in aq+1,·. Let Bq = {1, 2, . . . , q − 1}.(1) If iq+1 = q, then Pq and Pq+1 form a cherry within q iterations.(2) If iq+1 ∈ Bq, then q−iq+1+2 internal nodes Piq+1,iq+1+1, Piq+1+1,iq+1+2,. . ., Pq,q+1, Pq+1,iq+1 form a connected circle in tree τ (as shown in Figure A.1for i3 = 1 and iq+1 = 1) which contradicts the bifurcating tree assumption.(3) If iq+1 ∈ {q + 2, q + 3, . . . , n}, we assume without loss of generalitythat iq+1 = q+ 2, i.e., aq+1,q+2 is the smallest element in aq+1,·. Let Bq+1 ={1, 2, · · · , q} and repeat as the (q + 1)-th iteration.If q = n−1 and the algorithm is not terminated, then Bq = {1, 2, · · · , n−2} and in /∈ B, so in = n − 1, i.e., an,n−1 must be the smallest in an,·, and(Pn−1, Pn) forms a cherry with q = n− 1.140A.1. Analysis of the cherry-picking algorithmP1 P2P3P1,2P2,3P1,3...P1 P2 P3 Pq Pq+1P1,2P2,3Pq-1,qPq,q+1 Pq+1,1...Figure A.1: Illustrations of connected circles which contradict the bifurcat-ing tree assumption.A.1.4 Proof of Proposition 2Proposition 2. If ÂL is a consistent estimator of an asymmetric dissim-ilarity matrix A, then the probability of identifying a cherry of the tree τusing our cherry-picking algorithm on ÂL converges to 1 as the number ofloci L goes to infinity.Proof. It is easy to see that Proposition 2 is true when n = 2. Assume thatn ≥ 3.Let bi be the set of all values in ai,·. Let δi = c1i − c2i, where c1i andc2i are the smallest and the second smallest value in bi, respectively. Then,c2i 6= c1i because bi contains at least two elements, one positive value and∞. Let δ = miniδi.For any rooted bifurcating tree τ , there exists at least one cherry, denoted(Pi, Pj). When n ≥ 3, there exists k 6= i, j, aik <∞, and from Corollary 1,aik > aij . As a result, 0 < δi ≤ aik − aij <∞. Thus, 0 < δ <∞.Because the dimension of A is finite, if ÂL converges to A in probabilityelement-wise, then all elements of ÂL uniformly converge to their respectiveelements of A in probability when L → ∞, i.e., for any  > 0, there exists141A.1. Analysis of the cherry-picking algorithmL0 such that for all L ≥ L0 and for any i 6= j,P (|âij − aij | ≤ δ/2) ≥ 1− /{2(n− 1)}.For any aij > aik, thenP (âij ≤ âik) ≤ P [{|âij − aij | ≥ δ/2} ∪ {|âik − aik| ≥ δ/2}]≤ P{|âij − aij | ≥ δ/2}+ P{|âik − aik| ≥ δ/2}≤ /{2(n− 1)}+ /{2(n− 1)} = /(n− 1),and thus P (âij > âik) ≥ 1− /(n− 1).Let âi,· denote all elements in the i-th row of ÂL.If (Pi, Pj) is a cherry in τ , aik > aij for all k 6= j. For all L ≥ L0,P (âij = min âi,·) = 1− P⋃l 6=i(âij > âil) ≥ 1−∑l 6=iP (âij > âil) = 1− /2.Similarly, P (âji = min âj,·) ≥ 1− /2. As a consequence, for all L ≥ L0,P{(âij = min âi,·) ∩ (âji = min âj,·)}≥ P (âij = min âi,·) + P (âji = min âj,·)− 1 ≥ 1− .If (Pi, Pj) is not a cherry in τ , it does not satisfy the 2-MRMC conditionin A, we assume without loss of generality that aij 6= min âi,·. Assume thataik < aij for some k 6= j. Notice that aij − aik ≥ δi ≥ δ. For all L ≥ L0,P (âij = min âi,·) ≤ P (âij ≤ âik) ≤ /(n− 1).As a consequence, for all L ≥ L0,P{(âij = min âi,·) ∩ (âji = min âj,·)} ≤ P (âij = min âi,·) ≤ /(n− 1) ≤ .Therefore, when L → ∞, the probability of (Pi, Pj) satisfying the 2-MRMC condition in ÂL uniformly converges to 1 for any cherries (Pi, Pj)142A.2. Analysis of the asymmetric neighbor-joining algorithmin τ , and uniformly converges to 0 for any pair (Pi, Pj) which is not a cherryin τ .Combining this with Theorem 1, which shows that our cherry-pickingalgorithm picks up one pair of populations which satisfies the 2-MRMC con-dition, the probability of our cherry-picking algorithm picking up a correctcherry of τ using ÂL is 1 when the number of loci L→∞.A.2 Analysis of the asymmetric neighbor-joiningalgorithmA.2.1 Proof of Lemma 1Lemma 1. If a matrix A is an asymmetric dissimilarity matrix represen-tation of a rooted bifurcating tree τ , then after a cherry of τ is removed, thereduced matrix r(A) is an asymmetric dissimilarity matrix representationof the reduced tree r(τ), i.e., r(A) = Ar(τ), when the leaf populations r(P)are arranged in the same order.Proof. Assume that (Pi, Pj) is a correct cherry of τ , and it is to be removed.There are two types of elements in r(A). Elements of A which are notin the i-th and j-th rows and columns remains in r(A). These elementsrepresent lengths of branches which are not directly linked to the cherry, sothey all remain in Ar(τ). We prove that other elements of r(A) related tothe new leaf population Pi,j (i.e., aij,m and am,ij for m 6= i, j and 1 ≤ m ≤ n)also exist exactly in Ar(τ).Notice that in A, based on the topology of the true tree τ , for m 6= i, jand 1 ≤ m ≤ n,ai,m − ai,j = aj,m − aj,i, and am,i = am,j . (A.1)Combining (A.1) with (4.3), (4.4) and (4.5) yieldsaij,m = ai,m − ai,j = aj,m − aj,i, and am,ij = am,i = am,j . (A.2)143A.2. Analysis of the asymmetric neighbor-joining algorithmIn the true tree τ and reduced true tree r(τ), the MRCA of Pm and Piis the same as the MRCA of Pm and Pi,j , so ar(τ)ij,m = ai,m − ai,j = aij,m, andar(τ)m,ij = am,i = am,ij for m 6= i, j and 1 ≤ m ≤ n. Thus, r(A) = Ar(τ).A.2.2 Proof of Lemma 2Lemma 2. If ÂL is a consistent estimator of an asymmetric dissimilaritymatrix A, then the reduced estimated matrix r(ÂL) is a consistent estimatorof r(A) (i.e., Ar(τ)), when the leaf populations r(P) are arranged in the sameorder.Proof. Similar to the proof of Lemma 1, there are two types of elementsin r(ÂL). Elements of r(ÂL) which are also elements of ÂL are consistentestimates of respective elements of r(A), as they are unchanged. We provethat elements of the new row and new column of r(ÂL) (i.e., âij,m and âm,ijfor m 6= i, j and 1 ≤ m ≤ n) are consistent estimates of respective elementsof r(A).Replacing a· with â· in (4.3), (4.4) and (4.5) yieldsâij,m = (âi,m + âj,m − âi,j − âj,i)/2, (A.3a)âm,ij = (âm,i + âm,j)/2. (A.3b)If ÂL is a consistent estimator of A, for any  > 0 and any δ > 0, thereexists L0 > 0 such that for any L ≥ L0, P (|âi,j − ai,j | ≥ δ/2) ≤ /4 for all1 ≤ i, j ≤ n and i 6= j.Combining (A.2) and (A.3),P (|âij,m − aij,m| ≥ δ)= P{|(âi,m − ai,m) + (âj,m − aj,m)− (âi,j − ai,j)− (âj,i − aj,i)| ≥ 2δ}≤ P{(|âi,m − ai,m| ≥ δ/2) ∪ (|âj,m − aj,m| ≥ δ/2)∪(|âi,j − ai,j | ≥ δ/2) ∪ (|âj,i − aj,i| ≥ δ/2)}≤ /4 + /4 + /4 + /4 + /4 = ,144A.2. Analysis of the asymmetric neighbor-joining algorithmandP (|âm,ij − am,ij | ≥ δ) = P{|(âm,i − am,i) + (âm,j − am,j)| ≥ 2δ}≤ P [(|âm,i − am,i| ≥ δ) ∪ (|âm,j − am,j | ≥ δ)]≤ /4 + /4 = /2.Thus, r(ÂL) is a consistent estimator of r(A).A.2.3 Proof of Theorem 2Theorem 2. If a matrix A is an asymmetric dissimilarity matrix represen-tation of a rooted bifurcating tree τ , then asymmetric neighbor-joining canrecover τ correctly from A, in terms of both topology and branch lengths.Proof. First, from Theorem 1 and Lemma 1, ANJ can recover the truetopology of τ from A step-by-step. Second, the estimated branch lengths in(4.3) are the same as the respective branch lengths in the true tree when Ais used in the algorithm. By induction on n (i.e., the dimension of A), allbranch lengths can be exactly recovered.A.2.4 Proof of Proposition 3Proposition 3. If ÂL is a consistent estimator of an asymmetric dissim-ilarity matrix A, then the probability of recovering the topology of the truetree τ using asymmetric neighbor-joining on ÂL converges to 1 when thenumber of loci L → ∞. Moreover, the branch length estimates are alsoconsistent.Proof. It is easy to see that Proposition 3 is true when n = 2.Suppose that Proposition 3 is true when n = k.When n = k+ 1, a (k+ 1)× (k+ 1) matrix ÂL is a consistent estimatorof an asymmetric dissimilarity matrix A. Let Rk+1 denote the event ofrecovering the topology of true tree τ and consistently estimating all branchlengths of τ using ANJ on ÂL, and let Rk denote the event of recoveringthe topology of true tree r(τ) and consistently estimating all branch lengths145A.2. Analysis of the asymmetric neighbor-joining algorithmof r(τ) using ANJ on r(ÂL). Let Ck+1 denote the event that the cherrybeing picked up is a correct cherry of τ at the cherry-picking step, and Bk+1denote the event that branch-length estimates of the cherry are consistent.From Proposition 2, the probability of picking up a correct cherry usingthe cherry-picking algorithm on ÂL is 1 when L → ∞, i.e., for any  > 0,there exists L1 such that for any L > L1, P (Ck+1) > 1− /2.Conditioned on Ck+1, Lemma 2 shows that r(ÂL) is a consistent esti-mator of r(A). The branch-length estimates of the cherry:d̂i,ij = âi,j and d̂j,ij = âj,i,are consistent with the respective branch lengths of the true tree τ , i.e.,P (Bk+1|Ck+1) = 1.Notice that r(ÂL) is k×k. There are a finite number of cherries in τ , andthus a finite number of different reduced tree r(ÂL). For any  > 0, thereexists L2 such that for any L > L2, P (Rk+1|Bk+1Ck+1) > 1 − /2 for anyreduced tree r(ÂL) from ANJ when L > L1. Then for any L > max{L1, L2},P (Rk+1) = P (Ck+1Bk+1Rk)= P (Ck+1)P (Bk+1|Ck+1)P (Rk|Bk+1Ck+1)> (1− /2) · 1 · (1− /2) > 1− .Thus, Proposition 3 is true when n = k + 1. By induction on n (i.e., thedimension of Â), it is true for any finite n.146

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0314148/manifest

Comment

Related Items