Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference Zhai, Yongliang

Abstract

Phylogenetic inference aims to reconstruct the evolutionary history of populations or species. With the rapid expansion of genetic data available, statistical methods play an increasingly important role in phylogenetic inference by analyzing genetic variation of observed data collected at current populations or species. In this thesis, we develop new evolutionary models, statistical inference methods and efficient algorithms for reconstructing phylogenetic trees at the level of populations using single nucleotide polymorphism data and at the level of species using multiple sequence alignment data. At the level of populations, we introduce a new inference method to estimate evolutionary distances for any two populations to their most recent common ancestral population using single-nucleotide polymorphism allele frequencies. Our method is based on a new evolutionary model for both drift and fixation. To scale this method to large numbers of populations, we introduce the asymmetric neighbor-joining algorithm, an efficient method for reconstructing rooted bifurcating trees. Asymmetric neighbor-joining provides a scalable rooting method applicable to any non-reversible evolutionary modelling setup. We explore the statistical properties of asymmetric neighbor-joining, and demonstrate its accuracy on synthetic data. We validate our method by reconstructing rooted phylogenetic trees from the Human Genome Diversity Panel data. Our results are obtained without using an outgroup, and are consistent with the prevalent recent single-origin model of human migration. At the level of species, we introduce a continuous time stochastic process, the geometric Poisson indel process, that allows indel rates to vary across sites. We design an efficient algorithm for computing the probability of a given multiple sequence alignment based on our new indel model. We describe a method to construct phylogeny estimates from a fixed alignment using neighbor-joining. Using simulation studies, we show that ignoring indel rate variation may have a detrimental effect on the accuracy of the inferred phylogenies, and that our proposed method can sidestep this issue by inferring latent indel rate categories. We also show that our phylogenetic inference method may be more stable to taxa subsampling in a real data experiment compared to some existing methods that either ignore indels or ignore indel rate variation.

Item Metadata

Title	Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference
Creator	Zhai, Yongliang
Publisher	University of British Columbia
Date Issued	2016
Description	Phylogenetic inference aims to reconstruct the evolutionary history of populations or species. With the rapid expansion of genetic data available, statistical methods play an increasingly important role in phylogenetic inference by analyzing genetic variation of observed data collected at current populations or species. In this thesis, we develop new evolutionary models, statistical inference methods and efficient algorithms for reconstructing phylogenetic trees at the level of populations using single nucleotide polymorphism data and at the level of species using multiple sequence alignment data. At the level of populations, we introduce a new inference method to estimate evolutionary distances for any two populations to their most recent common ancestral population using single-nucleotide polymorphism allele frequencies. Our method is based on a new evolutionary model for both drift and fixation. To scale this method to large numbers of populations, we introduce the asymmetric neighbor-joining algorithm, an efficient method for reconstructing rooted bifurcating trees. Asymmetric neighbor-joining provides a scalable rooting method applicable to any non-reversible evolutionary modelling setup. We explore the statistical properties of asymmetric neighbor-joining, and demonstrate its accuracy on synthetic data. We validate our method by reconstructing rooted phylogenetic trees from the Human Genome Diversity Panel data. Our results are obtained without using an outgroup, and are consistent with the prevalent recent single-origin model of human migration. At the level of species, we introduce a continuous time stochastic process, the geometric Poisson indel process, that allows indel rates to vary across sites. We design an efficient algorithm for computing the probability of a given multiple sequence alignment based on our new indel model. We describe a method to construct phylogeny estimates from a fixed alignment using neighbor-joining. Using simulation studies, we show that ignoring indel rate variation may have a detrimental effect on the accuracy of the inferred phylogenies, and that our proposed method can sidestep this issue by inferring latent indel rate categories. We also show that our phylogenetic inference method may be more stable to taxa subsampling in a real data experiment compared to some existing methods that either ignore indels or ignore indel rate variation.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2016-09-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0314148
URI	http://hdl.handle.net/2429/59095
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Statistics
Affiliation	Science, Faculty of; Statistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2016-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference Zhai, Yongliang

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights