Phylogenetic classification of long read sequences

UBC Undergraduate Research

Phylogenetic classification of long read sequences Chan, Kevin Chin-Wei

Abstract

TreeSAPP (Tree-based Sensitive and Accurate Protein Profiler) is an analysis pipeline designed to functionally and taxonomically classify protein and nucleotide sequences using marker genes and phylogenetic methods. Currently, TreeSAPP supports short read sequencing data (e.g. Illumina), but does not support long reads from newer sequencing platforms (e.g. Nanopore). Therefore, ten isolate datasets sequenced using Oxford Nanopore Technologies were aligned to reference sequences of five single-copy phylogenetic marker genes. Of the four aligners tested (minimap2, GraphMap, LAST and SNAP), minimap2 performed the best when judged by raw and weighted averages of taxonomic distance of alignments to their optimal placements, which is crucial for phylogenetic inference. Minimap2 was subsequently integrated into the long read workflow of TreeSAPP, and was tested on the same datasets and a mock community. While the workflow performed well with isolate datasets, poor recall was demonstrated with the mock community, suggesting required improvements in TreeSAPP’s linear model for taxonomic inference, or for higher resolution nucleotide reference packages. Importance Short read sequencing information pose several challenges for downstream bioinformatic analyses, such as sequencing error, non-uniform coverage of samples, computational time complexity and resolving repetitive regions. With the advent of cost-effective long read sequencing technologies, many of these problems are alleviated through contiguous sequences encoding full length open reading frames. Despite this benefit, relative to short reads, long reads have high error and insertion/deletion rates, with the potential to limit their utility in marker gene classification. To resolve this dilemma, TreeSAPP requires a separate workflow for long read sequences.

Item Metadata

Title	Phylogenetic classification of long read sequences
Creator	Chan, Kevin Chin-Wei
Date Issued	2019-04
Description	TreeSAPP (Tree-based Sensitive and Accurate Protein Profiler) is an analysis pipeline designed to functionally and taxonomically classify protein and nucleotide sequences using marker genes and phylogenetic methods. Currently, TreeSAPP supports short read sequencing data (e.g. Illumina), but does not support long reads from newer sequencing platforms (e.g. Nanopore). Therefore, ten isolate datasets sequenced using Oxford Nanopore Technologies were aligned to reference sequences of five single-copy phylogenetic marker genes. Of the four aligners tested (minimap2, GraphMap, LAST and SNAP), minimap2 performed the best when judged by raw and weighted averages of taxonomic distance of alignments to their optimal placements, which is crucial for phylogenetic inference. Minimap2 was subsequently integrated into the long read workflow of TreeSAPP, and was tested on the same datasets and a mock community. While the workflow performed well with isolate datasets, poor recall was demonstrated with the mock community, suggesting required improvements in TreeSAPP’s linear model for taxonomic inference, or for higher resolution nucleotide reference packages. Importance Short read sequencing information pose several challenges for downstream bioinformatic analyses, such as sequencing error, non-uniform coverage of samples, computational time complexity and resolving repetitive regions. With the advent of cost-effective long read sequencing technologies, many of these problems are alleviated through contiguous sequences encoding full length open reading frames. Despite this benefit, relative to short reads, long reads have high error and insertion/deletion rates, with the potential to limit their utility in marker gene classification. To resolve this dilemma, TreeSAPP requires a separate workflow for long read sequences.
Genre	Graduating Project
Type	Text
Language	eng
Series	University of British Columbia. MICB 449
Date Available	2019-04-24
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0378444
URI	http://hdl.handle.net/2429/69934
Affiliation	Science, Faculty of; Microbiology and Immunology, Department of
Campus	UBCV
Peer Review Status	Unreviewed
Scholarly Level	Undergraduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Undergraduate Research