UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Machine learning methods for metabolic pathway inference from genomic sequence information Mohd Abul Basher, Abdur Rahman

Abstract

Metabolic pathway prediction within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. This is a prerequisite to both understanding fundamental life processes and ultimately engineering these processes for specific biotechnological applications. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like Escherichia coli where the majority of protein functions are determined. Consequently, over the past decades several computational tools were developed to automate the reconstruction of pathways given enzymes obtained from genomes. Unfortunately, with an ever-increasing rate in the content and diversity of publicly available genomics and metagenomics datasets, those algorithms, to this date, experience more prominent and complex problems. These include incapability of systemically solving meta-level noise, neglecting pathway interactions, not considering vagueness associated with enzymes, and inadequate to scale to heterogeneous genomic datasets. In an attempt to resolve the aforementioned problems, this thesis examines multiple pathway prediction models given a list of enzymes based on multi-label learning approaches. Specifically, it first introduces mlLGPR that encodes manually designed enzyme and pathway properties to reconstruct pathways. Then, it proposes triUMPF, a more advanced model, that characterizes interactions among pathways and enzymes, jointly, with community detection from enzyme and pathway networks to improve the precision of predictions. This requires pathway2vec, a novel representation learning model, to automatically generate features aiding triUMPF’s prediction process. Next, the thesis presents leADS that subselects more impacted examples from a dataset to increase the pathway sensitivity performance. This model may rely on reMap, a novel relabeling algorithm, that incorporates the bag concept which is composed of correlated pathways to articulate missing pathways from data. Finally, all these models are integrated into a unified framework, mltS, to achieve the desired balance between sensitivity and precision outputs while assigning a confidence score to each model. The applicability of these models to recover pathways at the individual, population, and community levels of organization were examined against the traditional inference algorithms using benchmark datasets, where all the proposed models demonstrated accurate predictions and outperformed the previous approaches.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International