Computational modelling, simulation, and prediction of biological sequences

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Computational modelling, simulation, and prediction of biological sequences Yang, Chen

Abstract

Current advances in sequencing technology have led to an exponential growth of omics data. To leverage torrents of large and complex sequencing data, researchers need effective analytical methods to mine the underlying patterns, infer novel insights, and complement the development of related bioinformatics tools. This doctoral thesis combines both descriptive and predictive data analytic strategies, including statistical modelling and machine learning, to analyze nucleotide sequences and make inferences from short- and long-read sequencing data. The tools and pipelines presented are publicly available in the service of the broader research community. As a soaring sequencing technology, long-read sequencing, represented by Oxford Nanopore Technologies (ONT), has unprecedented advantages in providing long-range information that spans inter- and intra-genomic homologous regions and differentiates transcript isoforms. Due to the novelty and uniqueness of this sequencing technique, the features of resulting reads, which is crucial for developing pertinent bioinformatics algorithms, remain to be comprehended. This doctoral thesis combines multiple statistical modelling methodologies, including Markov Model, Kernel Density Estimation, and probability distribution fitting, to build NanoSim, the first ONT read simulator that characterizes and simulates ONT genome, transcriptome, and metagenome data. NanoSim has and will continue to have an enabling role in the field and benefit the development of scalable algorithms, including assembly, alignment, quantification, mutation detection, and metagenomic analysis. RNA sequencing has become the established sequencing method of choice for a wide range of transcriptome projects. To fully realize its potential in comprehensive analysis with the isoform- level resolution, it is desirable to have a transcript completeness annotation pipeline for reconstructed transcripts and to incorporate such into RNA-Seq analysis routines. In this thesis, I explore the application of deep learning on distinguishing 3’ polyadenylation cleavage sites from non-cleavage sites. I have developed a deep neural network with novel sequence representation and propose the first assessment pipeline, Terminitor, which examines the 3’ terminus completeness of reconstructed transcripts from RNA-Seq data. This utility outperforms the state- of-the-art methods in terms of sensitivity and precision and demonstrates the robustness and flexibility of the model architecture in both short- and long-read sequencing data.

Item Metadata

Title	Computational modelling, simulation, and prediction of biological sequences
Creator	Yang, Chen
Supervisor	Birol, Inanc
Publisher	University of British Columbia
Date Issued	2021
Description	Current advances in sequencing technology have led to an exponential growth of omics data. To leverage torrents of large and complex sequencing data, researchers need effective analytical methods to mine the underlying patterns, infer novel insights, and complement the development of related bioinformatics tools. This doctoral thesis combines both descriptive and predictive data analytic strategies, including statistical modelling and machine learning, to analyze nucleotide sequences and make inferences from short- and long-read sequencing data. The tools and pipelines presented are publicly available in the service of the broader research community. As a soaring sequencing technology, long-read sequencing, represented by Oxford Nanopore Technologies (ONT), has unprecedented advantages in providing long-range information that spans inter- and intra-genomic homologous regions and differentiates transcript isoforms. Due to the novelty and uniqueness of this sequencing technique, the features of resulting reads, which is crucial for developing pertinent bioinformatics algorithms, remain to be comprehended. This doctoral thesis combines multiple statistical modelling methodologies, including Markov Model, Kernel Density Estimation, and probability distribution fitting, to build NanoSim, the first ONT read simulator that characterizes and simulates ONT genome, transcriptome, and metagenome data. NanoSim has and will continue to have an enabling role in the field and benefit the development of scalable algorithms, including assembly, alignment, quantification, mutation detection, and metagenomic analysis. RNA sequencing has become the established sequencing method of choice for a wide range of transcriptome projects. To fully realize its potential in comprehensive analysis with the isoform- level resolution, it is desirable to have a transcript completeness annotation pipeline for reconstructed transcripts and to incorporate such into RNA-Seq analysis routines. In this thesis, I explore the application of deep learning on distinguishing 3’ polyadenylation cleavage sites from non-cleavage sites. I have developed a deep neural network with novel sequence representation and propose the first assessment pipeline, Terminitor, which examines the 3’ terminus completeness of reconstructed transcripts from RNA-Seq data. This utility outperforms the state- of-the-art methods in terms of sensitivity and precision and demonstrates the robustness and flexibility of the model architecture in both short- and long-read sequencing data.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2021-07-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0400046
URI	http://hdl.handle.net/2429/78860
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2021-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Computational modelling, simulation, and prediction of biological sequences Yang, Chen

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights