Machine learning methods for metabolic pathway inference from genomic sequence information

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Machine learning methods for metabolic pathway inference from genomic sequence information Mohd Abul Basher, Abdur Rahman

Abstract

Metabolic pathway prediction within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. This is a prerequisite to both understanding fundamental life processes and ultimately engineering these processes for specific biotechnological applications. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like Escherichia coli where the majority of protein functions are determined. Consequently, over the past decades several computational tools were developed to automate the reconstruction of pathways given enzymes obtained from genomes. Unfortunately, with an ever-increasing rate in the content and diversity of publicly available genomics and metagenomics datasets, those algorithms, to this date, experience more prominent and complex problems. These include incapability of systemically solving meta-level noise, neglecting pathway interactions, not considering vagueness associated with enzymes, and inadequate to scale to heterogeneous genomic datasets. In an attempt to resolve the aforementioned problems, this thesis examines multiple pathway prediction models given a list of enzymes based on multi-label learning approaches. Specifically, it first introduces mlLGPR that encodes manually designed enzyme and pathway properties to reconstruct pathways. Then, it proposes triUMPF, a more advanced model, that characterizes interactions among pathways and enzymes, jointly, with community detection from enzyme and pathway networks to improve the precision of predictions. This requires pathway2vec, a novel representation learning model, to automatically generate features aiding triUMPF’s prediction process. Next, the thesis presents leADS that subselects more impacted examples from a dataset to increase the pathway sensitivity performance. This model may rely on reMap, a novel relabeling algorithm, that incorporates the bag concept which is composed of correlated pathways to articulate missing pathways from data. Finally, all these models are integrated into a unified framework, mltS, to achieve the desired balance between sensitivity and precision outputs while assigning a confidence score to each model. The applicability of these models to recover pathways at the individual, population, and community levels of organization were examined against the traditional inference algorithms using benchmark datasets, where all the proposed models demonstrated accurate predictions and outperformed the previous approaches.

Item Metadata

Title	Machine learning methods for metabolic pathway inference from genomic sequence information
Creator	Mohd Abul Basher, Abdur Rahman
Publisher	University of British Columbia
Date Issued	2020
Description	Metabolic pathway prediction within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. This is a prerequisite to both understanding fundamental life processes and ultimately engineering these processes for specific biotechnological applications. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like Escherichia coli where the majority of protein functions are determined. Consequently, over the past decades several computational tools were developed to automate the reconstruction of pathways given enzymes obtained from genomes. Unfortunately, with an ever-increasing rate in the content and diversity of publicly available genomics and metagenomics datasets, those algorithms, to this date, experience more prominent and complex problems. These include incapability of systemically solving meta-level noise, neglecting pathway interactions, not considering vagueness associated with enzymes, and inadequate to scale to heterogeneous genomic datasets. In an attempt to resolve the aforementioned problems, this thesis examines multiple pathway prediction models given a list of enzymes based on multi-label learning approaches. Specifically, it first introduces mlLGPR that encodes manually designed enzyme and pathway properties to reconstruct pathways. Then, it proposes triUMPF, a more advanced model, that characterizes interactions among pathways and enzymes, jointly, with community detection from enzyme and pathway networks to improve the precision of predictions. This requires pathway2vec, a novel representation learning model, to automatically generate features aiding triUMPF’s prediction process. Next, the thesis presents leADS that subselects more impacted examples from a dataset to increase the pathway sensitivity performance. This model may rely on reMap, a novel relabeling algorithm, that incorporates the bag concept which is composed of correlated pathways to articulate missing pathways from data. Finally, all these models are integrated into a unified framework, mltS, to achieve the desired balance between sensitivity and precision outputs while assigning a confidence score to each model. The applicability of these models to recover pathways at the individual, population, and community levels of organization were examined against the traditional inference algorithms using benchmark datasets, where all the proposed models demonstrated accurate predictions and outperformed the previous approaches.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2020-10-15
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0394748
URI	http://hdl.handle.net/2429/76294
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2020-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Machine learning methods for metabolic pathway inference from genomic sequence information Mohd Abul Basher, Abdur Rahman

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights