UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Computational modelling, simulation, and prediction of biological sequences Yang, Chen

Abstract

Current advances in sequencing technology have led to an exponential growth of omics data. To leverage torrents of large and complex sequencing data, researchers need effective analytical methods to mine the underlying patterns, infer novel insights, and complement the development of related bioinformatics tools. This doctoral thesis combines both descriptive and predictive data analytic strategies, including statistical modelling and machine learning, to analyze nucleotide sequences and make inferences from short- and long-read sequencing data. The tools and pipelines presented are publicly available in the service of the broader research community. As a soaring sequencing technology, long-read sequencing, represented by Oxford Nanopore Technologies (ONT), has unprecedented advantages in providing long-range information that spans inter- and intra-genomic homologous regions and differentiates transcript isoforms. Due to the novelty and uniqueness of this sequencing technique, the features of resulting reads, which is crucial for developing pertinent bioinformatics algorithms, remain to be comprehended. This doctoral thesis combines multiple statistical modelling methodologies, including Markov Model, Kernel Density Estimation, and probability distribution fitting, to build NanoSim, the first ONT read simulator that characterizes and simulates ONT genome, transcriptome, and metagenome data. NanoSim has and will continue to have an enabling role in the field and benefit the development of scalable algorithms, including assembly, alignment, quantification, mutation detection, and metagenomic analysis. RNA sequencing has become the established sequencing method of choice for a wide range of transcriptome projects. To fully realize its potential in comprehensive analysis with the isoform- level resolution, it is desirable to have a transcript completeness annotation pipeline for reconstructed transcripts and to incorporate such into RNA-Seq analysis routines. In this thesis, I explore the application of deep learning on distinguishing 3’ polyadenylation cleavage sites from non-cleavage sites. I have developed a deep neural network with novel sequence representation and propose the first assessment pipeline, Terminitor, which examines the 3’ terminus completeness of reconstructed transcripts from RNA-Seq data. This utility outperforms the state- of-the-art methods in terms of sensitivity and precision and demonstrates the robustness and flexibility of the model architecture in both short- and long-read sequencing data.

Item Media

Item Citations and Data

License

Attribution-NonCommercial-NoDerivatives 4.0 International

Usage Statistics