UBC Undergraduate Research

Phylogenetic classification of long read sequences Chan, Kevin Chin-Wei

Abstract

TreeSAPP (Tree-based Sensitive and Accurate Protein Profiler) is an analysis pipeline designed to functionally and taxonomically classify protein and nucleotide sequences using marker genes and phylogenetic methods. Currently, TreeSAPP supports short read sequencing data (e.g. Illumina), but does not support long reads from newer sequencing platforms (e.g. Nanopore). Therefore, ten isolate datasets sequenced using Oxford Nanopore Technologies were aligned to reference sequences of five single-copy phylogenetic marker genes. Of the four aligners tested (minimap2, GraphMap, LAST and SNAP), minimap2 performed the best when judged by raw and weighted averages of taxonomic distance of alignments to their optimal placements, which is crucial for phylogenetic inference. Minimap2 was subsequently integrated into the long read workflow of TreeSAPP, and was tested on the same datasets and a mock community. While the workflow performed well with isolate datasets, poor recall was demonstrated with the mock community, suggesting required improvements in TreeSAPP’s linear model for taxonomic inference, or for higher resolution nucleotide reference packages. Importance Short read sequencing information pose several challenges for downstream bioinformatic analyses, such as sequencing error, non-uniform coverage of samples, computational time complexity and resolving repetitive regions. With the advent of cost-effective long read sequencing technologies, many of these problems are alleviated through contiguous sequences encoding full length open reading frames. Despite this benefit, relative to short reads, long reads have high error and insertion/deletion rates, with the potential to limit their utility in marker gene classification. To resolve this dilemma, TreeSAPP requires a separate workflow for long read sequences.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International