Machine learning methods for metabolic pathwayinference from genomic sequence informationbyABDUR RAHMAN MOHD ABUL BASHERM.A.Sc., Concordia University, 2011B.Sc., King Abdulaziz University, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)OCTOBER, 2020© Abdur Rahman Mohd Abul Basher, 2020The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, the thesis entitled:Machine Learning Methods for Metabolic Pathway Inference from Genomic SequenceInformationsubmitted by Abdur Rahman Mohd Abul Basher in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Bioinformatics.Examining Committee:Steven J. Hallam, BioinformaticsSupervisorSara Mostafavi, StatisticsSupervisory Committee MemberAnne Condon, Computer ScienceUniversity ExaminerOrlando Rojas, Chemical and Biological EngineeringUniversity ExaminerAdditional Supervisory Committee Member:Martin Hirst, EpigenomicsSupervisory Committee MemberRaymond Ng, Computer ScienceSupervisory Committee MemberiiAbstractMetabolic pathway prediction within and between cells from genomic sequence informa-tion is an integral problem in biology linking genotype to phenotype. This is a prerequisite toboth understanding fundamental life processes and ultimately engineering these processesfor specific biotechnological applications. A pathway prediction problem exists becausewe have limited knowledge of the reactions and pathways operating in cells even in modelorganisms like Escherichia coli where the majority of protein functions are determined.Consequently, over the past decades several computational tools were developed to auto-mate the reconstruction of pathways given enzymes obtained from genomes. Unfortunately,with an ever-increasing rate in the content and diversity of publicly available genomicsand metagenomics datasets, those algorithms, to this date, experience more prominentand complex problems. These include incapability of systemically solving meta-level noise,neglecting pathway interactions, not considering vagueness associated with enzymes, andinadequate to scale to heterogeneous genomic datasets.In an attempt to resolve the aforementioned problems, this thesis examines multiplepathway prediction models given a list of enzymes based on multi-label learning approaches.Specifically, it first introduces mlLGPR that encodes manually designed enzyme and path-way properties to reconstruct pathways. Then, it proposes triUMPF, a more advanced model,that characterizes interactions among pathways and enzymes, jointly, with communitydetection from enzyme and pathway networks to improve the precision of predictions. Thisrequires pathway2vec, a novel representation learning model, to automatically generatefeatures aiding triUMPF’s prediction process. Next, the thesis presents leADS that subselectsmore impacted examples from a dataset to increase the pathway sensitivity performance.This model may rely on reMap, a novel relabeling algorithm, that incorporates the bagconcept which is composed of correlated pathways to articulate missing pathways fromdata. Finally, all these models are integrated into a unified framework, mltS, to achieve thedesired balance between sensitivity and precision outputs while assigning a confidencescore to each model. The applicability of these models to recover pathways at the individual,population, and community levels of organization were examined against the traditional in-ference algorithms using benchmark datasets, where all the proposed models demonstratedaccurate predictions and outperformed the previous approaches.iiiLay SummaryMetabolic pathways are an important class of molecular networks comprising of compounds,enzymes, and their interactions in a cell. The ability to reconstruct pathways from organismsis extremely important for various biotechnological applications. Therefore, over the pastdecades several computational methods were developed to predict pathways. However,these approaches are either prone to increasing false-positive predictions or computation-ally demanding. In both cases, they suffer from increase false-positive predictions requiringperiodic manual adjustments while being resilient to adapt and scale on heterogeneousgenomic sequence information. To improve pathway predictions for (meta)genomes with re-duced human efforts, this thesis proposes multiple pathway predictors based on multi-labellearning approaches. All the developed models were examined against the traditional path-way inference algorithms at the individual, population, and community levels of biologicalorganization, where these models demonstrated accurate predictions and outperformedthe previous approaches. Moreover, the proposed treatments are extensible to other closely-associated studies in bioinformatics and multi-label learning.ivPrefaceAll the work presented in this thesis was conducted in the laboratory of Dr. Steven J. Hallam.A number of sections of this work are partly or wholly published, in press, accepted, or underreview. Copyright licenses to all works were obtained and are listed where appropriate.• Chapter 3: I developed all the figures with the exception of Figs 3.1a and 3.1b whichwere designed by Ryan J. McLaughlin with input from Steven J. Hallam.• Chapter 5: A version of this work has been published in the PLOS ComputationalBiology journal which is also deposited on bioRxiv. It is available with CC-BY 4.0International license which allows sharing and adaptation.1. Abdur Rahman M. A. Basher, Ryan McLaughlin, and Steven J. Hallam. “Metabolicpathway inference using multi-label classification with rich pathway features.”BioRxiv (2020): 919944.I developed the mlLGPR framework and analyzed data with supports from Ryan J.McLaughlin. The mlLGPR was implemented in the Python programming language.I wrote the manuscript with editorial support from Steven J. Hallam. The workflowdiagram was designed by Ryan J. McLaughlin and the genomic information hierarchywas constructed by Steven J. Hallam while I generated the remaining figures.• Chapter 6: A version of this work has been accepted for publication in the Bioin-formatics journal which is also deposited on bioRxiv. It is available with CC-BY 4.0International license which allows sharing and adaptation.2. Abdur Rahman M. A. Basher and Steven J. Hallam. “Leveraging heterogeneousnetwork embedding for metabolic pathway prediction.” BioRxiv (2020). 940205.I was the primary researcher for pathway2vec approach and implemented the algo-rithm in the Python programming language. I created the experimental design, did allthe analysis and wrote the manuscript with editorial support from Steven J. Hallam.• Chapter 7: A version of this work has been deposited on bioRxiv. It is available withCC-BY 4.0 International license which allows sharing and adaptation.v3. Abdur Rahman M. A. Basher, Ryan McLaughlin, and Steven J. Hallam. “Metabolicpathway inference using non-negative matrix factorization with community de-tection.” BioRxiv (2020): 119826.I designed the methodology behind triUMPF and implemented the algorithm in thePython programming language. I created the experimental design, data, and wrote themanuscript with editorial supports from Ryan McLaughlin and Steven J. Hallam. RyanMcLaughlin contributed to designing the workflow diagram, the illustrative example,and building figures related to the visualization task while the remaining figures wereconstructed by me. The manuscript of this work will be submitted soon after writingthis thesis.• Chapter 8: A version of this work has been deposited on bioRxiv. It is available withCC-BY 4.0 International license which allows sharing and adaptation.4. Abdur Rahman M. A. Basher and Steven J. Hallam. “reMap: Relabeling Multi-label Pathway Data with Bags to Enhance Predictive Performance.” BioRxiv(2020): 260109.I was the prime researcher for reMap and SOAP. I implemented these algorithms inthe Python programming language. I created the experimental design, data, and wrotethe manuscript with input from Steven J. Hallam. The workflow diagram for reMapwas designed by Ryan J. McLaughlin while results and figures were created by me. Themanuscript of this work will be submitted soon after writing this thesis.• Chapter 9: A version of this work has been deposited on bioRxiv. It is available withCC-BY 4.0 International license which allows sharing and adaptation.5. Abdur Rahman M. A. Basher and Steven J. Hallam. “Multi-label pathway predic-tion based on active dataset subsampling.” BioRxiv (2020): 297424.I designed leADS and implemented it in the Python programming language. Theworkflow diagram was designed by Ryan J. McLaughlin. I created all the experimentaldesign, data, results, and wrote the manuscript with input from Steven J. Hallam. Themanuscript of this work will be submitted soon after writing this thesis.• Chapter 10: I developed mltS and provided the underlying mathematical formula. Iimplemented the mltS algorithm in the Python programming language and performedexperimental studies. I wrote the main text and created all figures with input fromSteven J. Hallam. The contribution of this work will be prepared for the submissionsoon after the writing of this thesis.Please be advised that throughout the rest of this thesis, the narrator will be referred to as“we” rather than “I”, unless otherwise stated, to be consistent with the underlying discussion.It should be understood that the writings are my own. None of the work in this dissertationrequired consultation with the UBC Research Ethics Board.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Desiderata of Machine Learning based Pathway Prediction . . . . . . . . . . . 41.2 Thesis Contributions and Road Map . . . . . . . . . . . . . . . . . . . . . . . . 5I Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Metabolic Pathway Databases and Inference Algorithms . . . . . . . . . . . . . 102.1 Metabolic Pathway Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Metabolic Pathway Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . 112.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Multi-Label Learning and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 153.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15vii3.2 An Overview of a Metabolic Pathway . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Terminology and Definition . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Pathway Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Multi-Label Learning Problem Formulation . . . . . . . . . . . . . . . . . . . . 253.4 Multi-Label Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.1 Binary Relevance Methods . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2 Low Rank Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.3 Ensemble and Deep Learning Methods . . . . . . . . . . . . . . . . . . 323.4.4 Partial Labeled Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.5 Active Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.6 Other Notable Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 383.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Benchmark Data and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 424.1 Benchmark Pathway Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Benchmark Pathway Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.1 Golden Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.2 BioCyc Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.3 Symbiont Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.4 CAMI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.5 Hawaii Ocean Time-Series (HOTS) Dataset . . . . . . . . . . . . . . . . 474.2.6 Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 Benchmark Pathway Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4.2 Equalized Loss of Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 50II Conventional Multi-Label Classification . . . . . . . . . . . . . . . . . . . 515 mlLGPR: Multi-Label Classification to Metabolic Pathway Inference . . . . . . 525.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 The mlLGPR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.2 Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56viii5.3.3 Multi-Label Learning Process . . . . . . . . . . . . . . . . . . . . . . . . 575.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 595.5.1 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.2 Features Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.5.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5.4 Pathway Prediction Potential . . . . . . . . . . . . . . . . . . . . . . . . 645.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67III Graph based Multi-Label Classification . . . . . . . . . . . . . . . . . . . . 706 pathway2vec: Learning Metabolic Pathway Representations . . . . . . . . . . . . 716.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Definitions and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 736.3 The pathway2vec Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.3.1 Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3.2 Learning Latent Embedding in Graph . . . . . . . . . . . . . . . . . . . 786.4 Predicting Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.5.1 Preprocessing MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.5.2 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 816.6.1 Parameter Sensitivity of RUST . . . . . . . . . . . . . . . . . . . . . . . 816.6.2 Node Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.6.3 Manifold Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.6.4 Metabolic Pathway Prediction . . . . . . . . . . . . . . . . . . . . . . . 866.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867 triUMPF: TriNMF to Metabolic Pathway Recovery . . . . . . . . . . . . . . . . . 897.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.3 The triUMPF Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.3.1 Decomposing the Pathway EC Association Matrix . . . . . . . . . . . . 937.3.2 Subnetwork or Community Reconstruction . . . . . . . . . . . . . . . 947.3.3 Multi-label Learning Process . . . . . . . . . . . . . . . . . . . . . . . . 95ix7.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.4.1 Association Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.4.2 Pathway and Enzymatic Reaction Features . . . . . . . . . . . . . . . . 967.4.3 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 977.5.1 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.5.2 Network Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.5.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.5.4 Metabolic Pathway Prediction . . . . . . . . . . . . . . . . . . . . . . . 1087.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111IV Multi-Label Subsampling and Bagging . . . . . . . . . . . . . . . . . . . . . 1148 reMap: Relabeling Pathway Dataset with Bags . . . . . . . . . . . . . . . . . . . . 1158.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158.2 Definitions and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 1178.3 The reMap Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.3.1 Feed-Forward Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.3.2 Feed-Backward Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.3.3 Closing the loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.4.1 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1278.5.1 Sensitivity Analysis of Correlated Models . . . . . . . . . . . . . . . . . 1278.5.2 Bag Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288.5.3 Assessing the History Probability . . . . . . . . . . . . . . . . . . . . . . 1298.5.4 Metabolic Pathway Prediction . . . . . . . . . . . . . . . . . . . . . . . 1318.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339 leADS: Multi-label based on Active Dataset Subsampling . . . . . . . . . . . . . 1359.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.3 The leADS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.3.1 Building an Acquisition Model . . . . . . . . . . . . . . . . . . . . . . . 1409.3.2 Sub-sampling Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142x9.3.3 Training on the Reduced Dataset . . . . . . . . . . . . . . . . . . . . . . 1449.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449.5 Efficient Pathway Label Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 1459.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1469.6.1 Parameter Settings and Protocols . . . . . . . . . . . . . . . . . . . . . 1469.7 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1469.7.1 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.7.2 Scalability to the Ensemble Size . . . . . . . . . . . . . . . . . . . . . . 1489.7.3 Metabolic Pathway Prediction . . . . . . . . . . . . . . . . . . . . . . . 1499.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155V Multi-Label Learning from Multiple (Less-Trusted) Sources . . . . . 16010 mltS: Multi-label Learning from unTrusted Sources . . . . . . . . . . . . . . . . . 16110.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16110.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16210.3 The mltS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16410.3.1 Training Multiple (Local) Learners . . . . . . . . . . . . . . . . . . . . . 16410.3.2 Similarity (Discrepancy) Measures . . . . . . . . . . . . . . . . . . . . . 16610.3.3 Update Algorithmic Specific Weights . . . . . . . . . . . . . . . . . . . 16710.3.4 Learning Global Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 16910.3.5 Closing the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16910.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17010.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17110.5.1 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17110.5.2 Source Specific Multi-label Datasets . . . . . . . . . . . . . . . . . . . . 17210.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 17210.6.1 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17210.6.2 Analysis of Local Learners . . . . . . . . . . . . . . . . . . . . . . . . . . 17310.6.3 Analysis of Source-Specific Weights . . . . . . . . . . . . . . . . . . . . 17510.6.4 Metabolic Pathway Prediction . . . . . . . . . . . . . . . . . . . . . . . 17610.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181xiVI Afterword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18611 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18711.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18711.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230A Synthetic Samples Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230B Features Adopted in mlLGPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232B.1 Reactions Evidence Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232B.2 Pathways Evidence Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240B.3 Pathway Common Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243B.4 Possible Pathways Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243C Statistical Analyses of Pathway Prediction Algorithms for mlLGPR . . . . . . . . 244D pathway2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246D.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246D.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251D.3 Similarity Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251E triUMPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253E.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253F reMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263F.1 Modeling Metabolic Pathways as Bags (with Augmentation) . . . . . . . . . . 263F.1.1 Definitions and Problem Statement . . . . . . . . . . . . . . . . . . . . 263F.1.2 Correlated Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265F.1.3 Deriving the Evidence Lower Bound (ELBO) for SPREAT . . . . . . . . 269F.1.4 Optimizing the ELBO Terms . . . . . . . . . . . . . . . . . . . . . . . . 276F.1.5 Posterior Predictive Distribution for SPREAT . . . . . . . . . . . . . . . 281F.2 Algorithms and Optimization for reMap . . . . . . . . . . . . . . . . . . . . . . 282F.2.1 Algorithm for Computing Bag Centroid . . . . . . . . . . . . . . . . . . 283xiiF.2.2 Algorithm to Extracting the Maximum Number of Bags . . . . . . . . 284F.2.3 Algorithm for Re-assigning Labels to Data . . . . . . . . . . . . . . . . 284F.2.4 Algorithm for Feed-Backward . . . . . . . . . . . . . . . . . . . . . . . . 286F.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286F.3 Parameter Settings for Correlated Models . . . . . . . . . . . . . . . . . . . . . 291G leADS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292G.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292xiiiList of Tables1.1 Structure of the thesis with reference to the associated parts and chapters. . . 72.1 Comparison of pathway prediction algorithms. 1 various estimations, such aspathway abundance, are not designed in the original implementation of thealgorithms, however, these estimations can be added as inputs to a downstreampipeline; 2 algorithms are no longer available to the research community. . . . . 124.1 Different configurations of compound, enzyme and (EC) and pathway objectsextracted from the MetaCyc database. These are: i)- full content (MetaCyc), ii)-reduced content based on trimming nodes below 2 links (MetaCyc r), iii)- linksamong enzymatic reactions are removed (MetaCyc uec)), and iv)- combinationof unconnected enzymatic reactions and trimmed nodes (MetaCyc uec + r). The“–” indicates non applicable operation. . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Nine amino acids, indicated by metabolism, for the symbiont dataset. Thesepathways are distributed between Candidatus Moranella endobia and Candida-tus Tremblaya princeps genomes [218]. . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Selected 45 pathways for HOTS metagenome (DNA) dataset. The dataset iscomposed of complex microbial communities from 25m, 75m, 110m (sunlit)and 500m (dark) ocean depth intervals [309]. . . . . . . . . . . . . . . . . . . . . . 484.4 Characteristics of 13 datasets. The notations |S |, L(S), LCard(S), LDen(S), DL(S),and PDL(S) represent number of instances, number of pathway labels, pathwaylabels cardinality, pathway labels density, distinct pathway labels set, and pro-portion of distinct pathway labels set for S , respectively. The notations R(S),RCard(S), RDen(S), DR(S), and PDR(S) have similar meanings as before but forthe enzymatic reactions E in S . PLR(S) represents a ratio of L(S) to R(S). Thelast column denotes the domain of S . . . . . . . . . . . . . . . . . . . . . . . . . . 50xiv5.1 Predictive performance of mlLGPR on T1 golden datasets. mlLGPR-L1: the mlL-GPR with L1 regularizer, mlLGPR-L2: the mlLGPR with L2 regularizer, mlLGPR-EN:the mlLGPR with elastic net penalty, AB: abundance features, RE: reaction evi-dence features, and PE: pathway evidence features. For each performance metric,‘↓’ indicates the lower score is better while ‘↑’ indicates the higher score is better. 595.2 Ablation tests of mlLGPR-EN trained using Synset-2 on T1 golden datasets. AB:abundance features, RE: reaction evidence features, PP: possible pathway fea-tures, PE: pathway evidence features, and PC: pathway common features. mlLGPRis trained using a combination of features, represented by mlLGPR-*, on Synset-2training set. For each performance metric, ‘↓’ indicates the lower score is betterwhile ‘↑’ indicates the higher score is better. . . . . . . . . . . . . . . . . . . . . . . 635.3 Performance and robustness scores for mlLGPR-EN with AB, RE and PE fea-ture sets trained on both Synset-1 and Synset-2 training sets at 0 and ρ noise.The best performance scores are highlighted in bold. The ‘↓’ indicates the lowerscore is better while ‘↑’ indicates the higher score is better. . . . . . . . . . . . . . 645.4 Pathway prediction performance between methods using T1 golden datasets.mlLGPR-EN: the mlLGPR with elastic net penalty, AB: abundance features, RE:reaction evidence features, and PE: pathway evidence features. For each perfor-mance metric, ‘↓’ indicates the lower score is better while ‘↑’ indicates the higherscore is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.5 Predictive performance of mlLGPR-EN with AB, RE and PE feature sets onCAMI low complexity data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.1 Predictive performance of each comparing algorithm on 6 benchmark goldenT1 datasets. For each performance metric, ‘↓’ indicates the smaller score is betterwhile ‘↑’ indicates the higher score is better. . . . . . . . . . . . . . . . . . . . . . . 877.1 Top 5 communities with pathways predicted by triUMPF for E. coli K-12 sub-str. MG1655 (TAX-511145). The last column asserts whether a pathway is presentin or absent (a false-positive pathway) from EcoCyc reference data. . . . . . . . . 1027.2 18 amino acid biosynthetic pathways and 27 pathway variants with high con-fidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.3 Predictive performance of each comparing algorithm on 6 benchmark goldenT1 datasets. For each performance metric, ‘↓’ indicates the smaller score is betterwhile ‘↑’ indicates the higher score is better. . . . . . . . . . . . . . . . . . . . . . . 1097.4 Predictive performance of mlLGPR and triUMPF on CAMI low complexity data. 111xv8.1 The 11 HumanCyc pathways corresponding the bag index 16. . . . . . . . . . . 1338.2 Predictive performance of each comparing algorithm on 6 golden T1 datasets.For each performance metric, ‘↓’ indicates the smaller score is better while ‘↑’indicates the higher score is better. Values in boldface represent the best perfor-mance score while the underlined score indicates the best performance amongcorrelated models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349.1 Predictive performance of each comparing algorithm on 6 golden benchmarkdatasets. For each performance metric, ‘↓’ indicates the smaller score is betterwhile ‘↑’ indicates the higher score is better. Values in boldface represent the bestperformance score while the underlined score indicates the best performanceamong leADS variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1509.2 Predictive performance of mlLGPR with elastic net penalty, triUMPF, and leADSon CAMI low complexity data. Values in boldface represent the best perfor-mance score while the underlined score indicates the best performance amongleADS variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15110.1 Characteristics of 5 source specific datasets. S (1) − S (5) corresponds MSMLdatasets obtained using PathoLogic, MinPath, mlLGPR, triUMPF, and leADS,respectively. The notations L(S), LCard(S), LDen(S), DL(S), and PDL(S) rep-resent number of pathway labels, pathway labels cardinality, pathway labelsdensity, distinct pathway labels set, and proportion of distinct pathway labels setfor corresponding source S , respectively. The asterisk ∗ symbol in S∗ indicatescross intersected pathways between source specific data and the golden T1 data. 17110.2 Predictive performance of each comparing algorithm on 6 benchmark goldenT1 datasets. ma: meta-adaptive; mw: meta-weight; mp: meta-predict. For eachperformance metric, ‘↓’ indicates the smaller score is better while ‘↑’ indicatesthe higher score is better. Values in boldface represent the best performancescore while the underlined score indicates the best performance among mltSprediction strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17810.3 Predictive performance of mlLGPR with elastic net penalty, triUMPF, leADS,and mltS on CAMI low complexity data. ma: meta-adaptive; mw: meta-weight;mp: meta-predict. Values in boldface represent the best performance score whilethe underlined score indicates the best performance among mltS predictionstrategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181xviC.1 Summary of the Friedman statistics FF for 7 algorithms and 7 datasets. Thecritical value τ is set to 0.05 significance level. . . . . . . . . . . . . . . . . . . . . . 244D.1 23 nitrogen metabolism pathways, including variants, as extracted from Meta-Cyc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252D.2 Top 5 Pathway IDs for nitrogen metabolism. . . . . . . . . . . . . . . . . . . . . . 252F.1 Correspondence between variational and original parameters. . . . . . . . . . 270xviiList of Figures1.1 Proposed models in this thesis with reference to the associated parts and chap-ters. Each block describes the model and its overall objective. . . . . . . . . . . . 83.1 A metabolic network of E. coli K-12 substr. MG1655 (represented by black,lime, blue, magenta, and orange colors) from KEGG database [158] (Fig. 3.1a)and a subnetwork of four metabolic pathways (Fig. 3.1b) which are representedby lime, blue, magenta, orange colors in Fig. 3.1a. The pathways in Fig. 3.1b aresymbolized by y and corresponds to: trans-cinnamate degradation (y2), fattyacid biosynthesis, initiation (y1), fatty acid biosynthesis, elongation (y3), and beta-Oxidation (y4). The large circle surrounded by the blue colored border in Fig.3.1b corresponds to the fatty acid biosynthesis, initiation pathway (y2) and itscomponents defined as: compounds by c[∗], enzymes by e[∗], reactions by in-teger numbers i ∈ {1,2,3,4,5,6,7,8,9,10} on directed edges (→), and enzymescatalyzing reactions by dashed directed edges (99K). . . . . . . . . . . . . . . . . . 173.2 Enzymatic reaction and pathway graphs. The left panel corresponds to the ECgraph where a group of nodes constitutes an input instance. The right panelindicates the pathway graph where the blue colored node represents the truehidden pathways that are to be recovered while the light gray colored nodesindicate false pathways. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Enzymatic reaction and pathway graphs. The left panel corresponds to the ECgraph where where a group of nodes constitutes an input instance and the dou-ble circled nodes indicate the abundance information (more than 1 enzymaticreaction). The right panel indicates the pathway graph where the blue colorednodes represent the true hidden pathways that are to be recovered while the lightgray colored nodes indicate false pathways. . . . . . . . . . . . . . . . . . . . . . . 23xviii3.4 Enzymatic reaction and pathway graphs. The left panel corresponds to the ECgraph where a group of nodes constitutes an input instance and the double circlednodes indicate the abundance information (more than 1 enzymatic reaction).Each color represents a distinct organism. The right panel indicates the pathwaygraph where the blue colored nodes represent the true hidden pathways that areto be recovered while the light gray colored nodes indicate false pathways. . . . 233.5 The enzymatic reaction graph. The nodes are considered to be input and thedoubly-circled nodes indicate the abundance information (more than 1 enzy-matic reaction). Each color represents a distinct subnetwork. The dashed linkindicates possible edges between discovered subnetworks. . . . . . . . . . . . . . 243.6 10 types of correlations among pathways and 3 input instances. The dark graycolored nodes indicate input samples while the light-colored nodes representpathways. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7 Three learning frameworks. Node color indicates the category of the node type,where dark gray indicates input samples, black indicates grouping objects, andlight grey is reserved for metabolic pathways. . . . . . . . . . . . . . . . . . . . . . 394.1 Genomic information hierarchy encompassing individual, population and com-munity levels of cellular organization. (a) Building on the BioCyc curation-tiered structure of Pathway/Genome Databases (PGDBs) constructed from organ-ismal genomes, two additional data structures are resolved from single-cell andplurality sequencing methods to define a 4 tiered hierarchy (T1-4) in descendingorder of manual curation and functional validation. (b) Completion scales fororganismal genomes, single-cell amplified gemomes (SAGs) and metagenomeassembled genomes (MAGs) within the 4 tiered information hierarchy. Genomecompletion will have a direct effect on metabolic inference outcomes with incom-plete organismal genomes, SAGs or MAGS resolving fewer metabolic interactions. 444.2 Matrix layout for all possible pathway intersections among EcoCyc, Human-Cyc, AraCyc, YeastCyc, LeishCyc, and TrypanoCyc. Brown circles in the matrixindicate sets that are part of the intersection and their distributions are shown asa vertical bar above the matrix while the aggregated number of pathways fromintersected sets for each sample is represented by a horizontal bar at the bottomleft. More information is provided in Table 4.4. . . . . . . . . . . . . . . . . . . . . 45xix5.1 mlLGPR workflow. Datasets spanning the information hierarchy are used infeature engineering. The Synthetic dataset with features is split into training andtest sets and used to train mlLGPR. Test data from the Gold Standard dataset(T1) with features and Synthetic dataset with features is used to evaluate mlLGPRperformance prior to the application of mlLGPR on experimental datasets (T4)from different sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2 Average F1 scores of mlLGPR-EN on a range of regularization hyper-parameterλ ∈ {1,10,100,1000,10000} values on EcoCyc, HumanCyc, AraCyc, YeastCyc, Leish-Cyc, TrypanoCyc, and SixDB dataset. The x-axis is log scaled. . . . . . . . . . . . 605.3 Performance of mlLGPR-EN according to theβadaptive decision hyper-parameteron datasets. (a)- Synset-2 test dataset. (b)- SixDB dataset. . . . . . . . . . . . . . . 615.4 Predicted pathways for symbiont datasets between mlLGPR-EN with AB, REand PE feature sets and PathoLogic. Red circles indicate that neither methodpredicted a specific pathway while green circles indicate that both methodspredicted a specific pathway. Blue circles indicate pathways predicted solely bymlLGPR. The size of circles scales with reaction abundance information. . . . . 665.5 Comparison of predicted pathways for HOTS datasets between mlLGPR-ENwith AB, RE and PE feature sets and PathoLogic. Red circles indicate that nei-ther method predicted a specific pathway while green circles indicate that bothmethods predicted a specific pathway. Blue circles indicate pathways predictedsolely by mlLGPR and gray circles indicate pathways solely predicted by Patho-Logic. The size of circles scales with reaction abundance information. . . . . . . 686.1 Three interacting metabolic pathways (a), depicted as a cloud glyph, whereeach pathway is comprised of compounds (green) and enzymes (red). Inter-acting compound, enzyme and pathway components are transformed into amulti-layer heterogeneous information network (b). . . . . . . . . . . . . . . . . . 73xx6.2 Graphical representation of pathway2vec framework. Main components: (a)a multi-layer heterogeneous information network composed from MetaCyc,showing meta-level interaction among compounds, enzymes, and pathways,(b) four random walks, and (c) two representational learning models: traditionalSkip-Gram (top) and Skip-Gram by normalizing domain types (bottom). In thesubfigure (a), the highlighted network neighbors of T1 (nitrifier denitrification)indicate this pathway interacts directly with T2 (nitrogen fixation I (ferredoxin))and indirectly to T3 (nitrate reduction I (denitrification)) by second-order withrelationships to several compounds, including nitric oxide (C3) and nitrite (C4)converted by enzymes represented by the EC numbers (Z2: EC 1.7.2.6, Z3: EC1.7.2.1, and Z4: EC 1.7.2.5). The black colored nodes in subfigure (b) indicate thecurrent position of the walkers and red links suggest the next possible nodes tosample while black links indicate route taken by a walker to reach the currentnode. node2vec is parameterized by local search s and in-out h hyperparameters.These two hyperparameters constitute a unit circle, i.e., h2+ s2 = 1, for RUST. Mstores previously visited node types which is 2 and only applied for JUST andRUST. c is number of nodes of the same domain type as the current node which is3 and is associated with JUST. For metapath2vec, a walker requires a prespecifiedscheme which is set to “ZCTCZ”. The normalized Skip-Gram in the subfigure(c) bottom is simply trained based on the domain type, in contrast to the tradi-tional Skip-Gram model. More information related to both learning strategies isprovided in Section 6.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.3 An illustrative example showing the selection of the next node for both JUSTand RUST on HIN extracted from MetaCyc. The walker is currently stationed atC3 arriving from node C2 (indicated by black colored link), where M stores twopreviously visited node types and c (for JUST) holds 3 consecutive nodes that areof the same domain as C3. As can be seen JUST would prefer selecting the nextnode of type pathway while RUST may prefer returning to C2 than jumping to T1or T2, as indicated by red edges, because s < h represented by an ellipsis glyph. . 776.4 Parameter sensitivity of RUST based on NMI metric. . . . . . . . . . . . . . . . . 796.5 Node clustering results based on NMI metric using MetaCyc data. n2v: node2vec,m2v: metapath2vec, jt: JUST, rt: RUST, r: reduced content of MetaCyc based ontrimming nodes below 2 links, uec: links among enzymatic reactions are removedin MetaCyc, and uec + r: combination of unconnected enzymatic reactions andtrimmed nodes in MetaCyc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82xxi6.6 Node clustering results of metapath2vec++ (cm2v) and RUST-norm (crt) basedon NMI metric using MetaCyc data. . . . . . . . . . . . . . . . . . . . . . . . . . . 836.7 2D UMAP projections of the 128 dimension embeddings, trained under uec+fullsetting depicting 185 nodes related to nitrogen metabolism. Node color indi-cates the category of the node type, where red indicates enzymatic reactions,green indicates compounds, and blue is reserved for metabolic pathways. n2v:node2vec, m2v: metapath2vec, jt: JUST, rt: RUST, cm2v: metapath2vec++, andcrt: RUST-norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.8 2D UMAP projections of 80 pathways that have no enzymatic reactions, in-dicated by the blue color, with 109 corresponding pathway neighbors, repre-sented by the grey color. n2v: node2vec, m2v: metapath2vec, jt: JUST, rt: RUST,cm2v: metapath2vec++, and crt: RUST-norm. . . . . . . . . . . . . . . . . . . . . . 857.1 The set of complete metabolic pathways extracted from MetaCyc (A) and theirdiscovered communities (B). Zoomed in region of the pathway-pathway andcommunity-community interactions, C and D respectively. Nodes are metabolicpathways or communities for A,C and B,D respectively. Edges correspond tonumber of shared enzymatic reactions or shared pathways for the pathway andcommunity nodes respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917.2 A workflow diagram showing the proposed triUMPF method. The model takestwo graph topology, corresponding Pathway-Pathway interaction and EC-EC in-teraction, and a dataset to detect pathway and EC communities while, simultane-ously, decomposing Pathway-EC association information to produce a constrainlow rank matrix. Afterwards, a set of pathways is detected from a newly annotatedgenome or metagenome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.3 Sensitivity of components k based on reconstruction cost. . . . . . . . . . . . . 987.4 Sensitivity of community size and higher order proximity with weights basedon reconstruction cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.5 Link prediction results by varying noise levels ε ∈ {20%,40%,60%,80%} basedon reconstruction cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100xxii7.6 TCA cycle and associated pathways. Pathway communities visualized with andwithout training using BioCyc T2 &3. (a) MetaCyc communities and (b) BioCyccommunities observed using triUMPF. Nodes coloured black indicate the TCAcycle (TCA) while dark grey nodes indicate associated pathways. Remaining path-way communities not associated with the TCA cycle are indicated in light grey.PWY-7180: 2-deoxy-α-D-ribose 1-phosphate degradation; PWY-6223: gentisatedegradation I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.7 Pathway community networks for related T1 and T3 organismal genomes. Path-way communities for (a) E. coli K-12 substr. MG1655 (TAX-511145), (b) E. coli str.CFT073 (TAX-199310), and (c) E. coli O157:H7 str. EDL933 (TAX-155864) based oncommunity detection. Nodes colored in dark grey indicate pathways predictedby PathoLogic; lime pathways predicted by triUMPF; salmon pathways predictedby both PathoLogic and triUMPF; red expected pathways not predicted by bothPathoLogic and triUMPF; magenta expected pathways predicted only by Patho-Logic; purple expected pathways predicted solely by triUMPF; and green expectedpathways predicted by both PathoLogic and triUMPF. light-grey indicates path-ways not expected to be encoded in either organismal genome. The node sizesreflect the degree of associations between pathways. . . . . . . . . . . . . . . . . . 1037.8 A three way set difference analysis of pathways predicted for E. coli K-12 sub-str. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7str. EDL933 (TAX-155864) using (a) PathoLogic (taxonomic pruning) and (b)triUMPF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.9 Comparison of predicted pathways for E. coli K-12 substr. MG1655 (TAX-511145),E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864)datasets between PathoLogic (taxonomic pruning) and triUMPF. Red circlesindicate that neither method predicted a specific pathway while green circlesindicate that both methods predicted a specific pathway. Lime circles indicatepathways predicted solely by mlLGPR and gray circles indicate pathways solelypredicted by PathoLogic.The size of circles corresponds to associated pathwaycoverage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106xxiii7.10 Comparison of predicted pathways for E. coli K-12 substr. MG1655 (TAX-511145),E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864)datasets between PathoLogic (without taxonomic pruning) and triUMPF. Redcircles indicate that neither method predicted a specific pathway while greencircles indicate that both methods predicted a specific pathway. Lime circlesindicate pathways predicted solely by mlLGPR and gray circles indicate pathwayssolely predicted by PathoLogic. The size of circles corresponds the associatedcoverage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.11 Effect of ρ based on the average F1 scores using golden T1 datasets. The hyper-parameter ρ in Eq. 7.3.4 controls the amount of information propagation from Mto pathway label coefficientsΘ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.12 Comparative study of predicted pathways for symbiotic data between Patho-Logic, triUMPF, and mlLGPR. The size of circles corresponds the EC coverageinformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.13 Comparative study of predicted pathways for HOT DNA samples. The size ofcircles corresponds the pathway abundance information. . . . . . . . . . . . . . 1128.1 The traditional vs the proposed bag based multi-label classification approaches.The traditional supervised multi-label classification is displayed on the left panel,where labels (i.e., red or green colors) are associated with an input instance x(i ).This approach sought to predict a set of labels for x(i ) without considering anycompartmentalization of labels. On contrary, bag based multi-label classificationapproach, on the right, applies two steps, where it predicts a set of positive bags(depicted as a cloud glyph), at first, then the labels within these bags are predicted(green colored labels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.2 An example of feature vectors for bags. The subfigure in the left represents thefeature vector for six pathways corresponding to two instances. The right subfig-ure indicates two bags, B1 and B2, and their features for the same two instances,where the first sample, D1, suggests that B1 is positive because the correspond-ing pathways y3 and y4 are present, while the bag feature vector for the secondexample, D2, suggests that both bags are present. . . . . . . . . . . . . . . . . . . . 118xxiv8.3 A workflow diagram showing the proposed reMap pipeline to relabel a multi-label data. The method consists of two phases: i)- feed-forward and ii)- feed-backward. The forward phase is composed of three components: (b) constructionof pathway bag that aims to build correlated bags given data (a), (c) building bagcentroid that retrieves centroids of bags based on the associated pathways, and(d) re-assigning labels that maps samples to bags. The feed-backward phase (e)optimizes reMap’s parameters to maximize accuracy of mapping examples tobags. The process is repeated τ times. If the current iteration q reaches a desirednumber of rounds τ, the training is terminated while producing the final Sbagdata points (f). The bag dataset then can be used to train leADS (g). . . . . . . . . 1218.4 Illustration of pathway frequency (averaged on all examples) in BioCyc (v20.5T2 &3) and CAMI data, and their background pathways, indicated by M. . . . . 1258.5 Log predictive distribution on CAMI data. . . . . . . . . . . . . . . . . . . . . . . 1268.6 Visualizing 50 randomly picked bags for each model, trained with b = 200. Thefirst term within the bracket, i.e., #bags, corresponds to the average number ofcorrelated bags while the second term, i.e., #pathways, represents the averagenumber of pathway size per bag. The circles represent bags, and their sizes reflectthe correlation strength with other bags. Two clusters of bags can be seen for thelast three models indicating the two clusters contain distinct pathways. . . . . . 1288.7 2D UMAP projections of BioCyc T2 & 3 pathways and the corresponding back-ground pathways. Fig. 8.7a serves as a basis for color-coding where examples ofone color in BioCyc are clustered together while the same examples are seen tobe spread across the augmented BioCyc pathways (M) in Fig. 8.7b. Better viewedin color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.8 Heatmap representing bag distribution of CTM, SOAP+c2m, SPREAT+c2m forrandomly picked 50 bags with their associated 100 pathways. The entries iscolor-coded on a gradient scale ranging from light-gray to dark-gray, wherehigher intensity entails higher probability. . . . . . . . . . . . . . . . . . . . . . . . 1308.9 Snapshot of the history probability H during the relabeling process of goldenT1 data for 10 successive rounds. The x-axis shows 200 bags while the y-axiscorresponds data. Darker colors indicate high probability to assigning bags tothe corresponding data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131xxv8.10 The probability history H during annotation of T1 golden data after 10 succes-sive rounds. The x-axis shows 200 bags while the y-axis corresponds the associ-ated probability. Darker colors indicate high probability to assigning bags to thecorresponding data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.11 Similarity among golden datasets, measured by cosine distance. Best viewedin color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339.1 Number of samples for each pathway in BioCyc T2 &3 data. The horizontal axisindicates the indices of pathways while the vertical axis represents the number ofassociated examples in BioCyc T2 &3 collection. . . . . . . . . . . . . . . . . . . . 1379.2 A schematic diagram showing the proposed leADS pipeline. Using a multi-label (bag or pathway) dataset (a), leADS randomly selects data at the very firstiteration (b), then it builds g members of an ensemble (c), where each is trainedon a randomly selected portion of the training set. Next, leADS applies acquisitionfunction (d), which is based on either: entropy, mutual information, variationratios, or normalized PSP@k, to pick per% samples. Upon selecting a set of sub-sampled data, leADS performs an overall training on these samples (e). The pro-cess (b-e) is repeated τ times (f), where at each round the selected per% samplesare fed back into the dataset, and another set of samples are picked in addition tothe previously selected set of samples. If the current iteration q reaches a desirednumber of rounds τ, then training is terminated while producing the final per%data points (g). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399.3 The two possible strategies in building acquisition model. (Left) Dependencybased acquisition model assumes input data x(i ) is associated with with multiplelabels y(i ), which are in turn associated with multiple bags d(i ). (Right) Factor-ization based method assumes both y(i ) and d(i ) are independent to each other,given x(i ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1419.4 The two approaches for constructing multi-label learning algorithm. The in-dividual multi-label learner (on the left) and the ensemble based multi-labellearning (on the right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1429.5 Impact of acquisition function on dependency (a) and factorization (b) pre-dictive uncertainty types. Each function performed on par to each other despitethe variation ratios is outperforming (an average F1 score of 52.84%) on bothuncertainty functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147xxvi9.6 Effect of dependency, factorization, and random subsampling by varying sam-ple size. For dependency and factorization, they were trained on per% of 30% ofBioCyc data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489.7 Average F1 score on CAMI data as a function of g . The ensemble size g variesacross {1,3,5,10,15,20} for both dependency and factorization (a) predictiveuncertainties while the elapsed training time (in minutes) per epoch (averagedover 3 epochs) is demonstrated in (b) based on the same ensemble size variation. 1499.8 Comparative study of predicted pathways for symbiont data between Patho-Logic and leADS (with different configurations). Orange circles indicate thatneither method predicted a specific pathway while blue circles indicate that bothmethods predicted a specific pathway. Red circles indicate pathways predictedsolely by leADS. The size of circles scales with reaction abundance information. 1529.9 Samples corresponding top 100 species in BioCyc T2 &3. The black coloredbars represent leADSp (per%= 70%) selected samples while the grey colored barsindicate an overall number of samples associated with species in BioCyc T2 &3. 1539.10 Samples corresponding top 100 species in BioCyc T2 &3. The black coloredbars represent leADSb (per%= 70%) selected samples while the grey colored barsindicate an overall number of samples associated with species in BioCyc T2 &3. 1549.11 Number of reduced examples for each pathway in BioCyc T2 &3 data. The hor-izontal axis indicates the indices of pathways, while the vertical axis representsthe number of associated examples in BioCyc T2 &3 collection. The black coloredarea in Figs 9.11a and 9.11b represent leADSp and leADSb (per%= 70%) selectedinstances, respectively, while the grey colored area indicates an overall numberof samples corresponding pathways in BioCyc T2 &3. . . . . . . . . . . . . . . . . 1559.12 Comparative study of predicted pathways for HOT DNA samples. The size ofcircles corresponds the associated abundance information. Orange circles indi-cate none of the methods are able to recover the pathway while red, gray, andblue circles indicates that leADSb+vrank, PathoLogic, and both are able to predictthe associated pathway, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.13 Comparative study of predicted pathways for HOT DNA samples. The size ofcircles corresponds the associated abundance information. Orange circles indi-cate none of the methods are able to recover the pathway while red, gray, and bluecircles indicates that leADSb+voting, PathoLogic, and both are able to predict theassociated pathway, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157xxvii9.14 Comparative study of predicted pathways for HOT DNA samples. The size ofcircles corresponds the associated abundance information. Orange circles indi-cate none of the methods are able to recover the pathway while red, gray, andblue circles indicates that leADSp, PathoLogic, and both are able to predict theassociated pathway, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15810.1 A schematic diagram showing the proposed mltS pipeline. Using a multi-sourcemulti-label pathway dataset (a), mltS trains U models (b), then it builds a discrep-ancy table Q (c). Next, mltS optimizes source specific weights ω (d). Q is used tooptimize the global parametersΘglob (e). The process (b-e) is repeated τ times(f). If the current iteration q reaches a desired number of rounds τ, then trainingis terminated while producing U source and global related weights (g). . . . . . . 16510.2 A schematic view of global weights update algorithm. Three individual learners(Θ[1,2,3]) are learned using a small mini batch based on allocated datasets, thenΘglob weights are optimized to accumulate patterns from the three learners.∇l[1,2,3] gradients of loss functions associated with those three learners. . . . . . 16810.3 Average F1 score on CAMI data as a function of α and β. Values for both hyper-parameters α and β in figures (a) and (b), respectively, vary across {0.001,0.01,0.1,1,5,10,15,20}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17310.4 Average F1 scores on MSML datasets. Each entries indicates a score associatedwith a trained model, symbolized by g , on an allocated dataset, characterized byS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17410.5 A sketch of each model’s reliability weight with associated pathway reliabilityscores for randomly subsampled 100 pathways. The x-axis in the left figureshows 100 pathways while the y-axis represents models. Darker gradient colorsfor pathways indicate high reliability scores. The rightmost figure illustrates themodels reliability weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17510.6 87 LeishCyc pathways. The x-axis indicate models while y-axis represents Leish-Cyc pathways. Black circles indicate true pathways predicted by models whilegrey circles represent false negative pathways that were not inferred by models.The size of circles indicates the pathway reliability score obtained from L. The©symbols are pathways that were recovered while + suggests pathways that wereincorrectly not predicted by mltS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179xxviii10.7 Comparative study of predicted pathways for symbiont data between Patho-Logic, mlLGPR, triUMPF, leADS, and mltS (with different prediction strate-gies). ma: meta-adaptive; mw: meta-weight; mp: meta-predict. Black circlesindicate that predicted pathways by associated models while grey circles indi-cate pathways that were not recovered by models. The size of circles scales withreaction abundance information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18010.8 Comparative study of predicted pathways for HOT DNA 25m data betweenPathoLogic, mlLGPR, triUMPF, leADS, and mltS (with three prediction strate-gies). ma: meta-adaptive; mw: meta-weight; mp: meta-predict. Black circlesindicate predicted pathways by the associated models while grey circles indicatepathways that were not recovered by models. The size of circles corresponds thepathway abundance information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18210.9 Comparative study of predicted pathways for HOT DNA 75m data betweenPathoLogic, mlLGPR, triUMPF, leADS, and mltS (with three prediction strate-gies). ma: meta-adaptive; mw: meta-weight; mp: meta-predict. Black circlesindicate predicted pathways by the associated models while grey circles indicatepathways that were not recovered by models. The size of circles corresponds thepathway abundance information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310.10Comparative study of predicted pathways for HOT DNA 110m data betweenPathoLogic, mlLGPR, triUMPF, leADS, and mltS (with three prediction strate-gies). ma: meta-adaptive; mw: meta-weight; mp: meta-predict. Black circlesindicate predicted pathways by the associated models while grey circles indicatepathways that were not recovered by models. The size of circles corresponds thepathway abundance information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18410.11Comparative study of predicted pathways for HOT DNA 500m data betweenPathoLogic, mlLGPR, triUMPF, leADS, and mltS (with three prediction strate-gies). ma: meta-adaptive; mw: meta-weight; mp: meta-predict. Black circlesindicate predicted pathways by the associated models while grey circles indicatepathways that were not recovered by models. The size of circles corresponds thepathway abundance information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185C.1 Comparison of seven methods against each other with the Nemenyi test usingCD diagrams. Groups of methods that are not significantly different (at τ= 0.05)are connected. (a)- CD diagram for Hamming loss. (b)- CD diagram for averageprecision score. (c)- CD diagram for average recall score.(d)- CD diagram foraverage F1 score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245xxixD.1 Scalability measured in seconds (×103) under uec+full configuration. n2v: node2vec,m2v: metapath2vec, cm2v: metapath2vec++, jt: JUST, rt: RUST, crt: RUST-norm. 251F.1 Graphical representation of the correlated concept models. The boxes are “plates”representing replicates. The outer plate represents instances, while the inner platerepresents the repeated choice of features within an example. The logistic normaldistribution, used to model the latent concept proportions of an example, canrepresent capture correlations among concepts that are impossible to captureusing a single Dirichlet. The observed data for each example x(i ) are a set ofannotated features y(i ) and a set of hypothetical features Mi while per-exampleconcept proportions η(i ), per-example concept selection parameters Λ(i ), per-example hypothetical feature distributionsΩ(i ), per-feature concept assignmentz(i )j , and per-concept distribution over features Φa , and per-example beta dis-tribution β(i ) are hidden variables. The remaining hyperparameters should beprovided as inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266xxxAcknowledgmentsThis dissertation would not have been possible without the support of many people. Firstand foremost I would like to express my deepest gratitude to my thesis supervisor, Dr. StevenJ. Hallam, for taking me under his wing, and for running a lab where so many researchers areso free to explore creative ideas. When it comes to the commitment of the lab schedule, hewas extremely flexible. For the methods associated with my Ph.D. research, he encouragedme to extend my horizons, and was there in every little discussion, making the researchprocess remarkably fruitful. I am also grateful for the unlimited time he committed to ourdiscussions as well as proofreading my papers and thesis. Without his continuous guidanceand encouragement, this work would have not been successfully completed.I would like to thank several people at Hallam’s lab who were instrumental in gettingme interested in the Bioinformatics field, Dr. Kishori M. Konwar, Dr. Aria S. Hahn, Dr. AlyseK. Hawley, Evan W. Durno, Connor Morgan-Lang, and Ryan McLaughlin. It is due to thefriendly and supportive environment at Hallam’s lab that I was lucky enough to find so manygreat people to work with. Their attitude helped me experience a different aspect of the Ph.D.life and always provided with a positive atmosphere, making machine learning applicationsto biological data is more fun.In particular, I would like to thank Connor’s deep biological knowledge that helpedme in absorbing and polishing many biological concepts. Ryan who contributed so muchin editing and designing various pipeline modules. Our discussions led multiple times toco-authored papers. Evan for many random good lunches and interesting discussions.I would like to place on record my sincere thanks to the rest of the lab and staff members,Dr. Jennifer Bonderoff, Dan Seale, Joe Ho, Avery Noonan, Kateryna Ievdokymenko, SiddarthRaghuvanshi, and Julia Anstett for creating a welcoming environment.I also wish to express my appreciation to UBC for accepting me to the graduate programand awarding me the four-year fellowship (4YF) through the Department of Bioinformatics,which I am grateful for many supports that allowed me to fully focus on my studies.Finally, and above all, I want to thank my parents who planted the seeds of science inme, whose circumstances and life obstacles did not support their research path. My siblingsfor their unconditional encouragement and support, both emotionally and financially. Nowords could possibly express my sincere gratitude for their endless love and unwaveringassistance. To them, I dedicate this dissertation.xxxiDedicationTo my late parents, siblings, friends, and everyone who supported me throughout the Ph.D.journey.xxxiiChapter 1Introduction“They say 90% of the promotion of a book comes through word of mouth. Butyou’ve somehow got to get your book into the hands of those mouths first!.”– Claudia OsmondMetabolic pathways are core components of cell’s metabolism [250]. They consist of complexseries of biochemical reactions that transform substrates into products. In a cell, reactionsmay be catalyzed by a group of enzymes which are very specific biological catalysts thataccelerate reactions and are often referred to as enzymatic reactions [319]. Contextualizingenzymes onto pathways [167, 168] is an integral problem in biology linking genotype tophenotype. This is a prerequisite to both understanding fundamental life processes and ulti-mately translating these processes for specific biotechnological applications. Emerging ap-plications include the production of renewable biofuels and other bioproducts [26, 96, 191],various applications in human health [202, 281], modeling disease networks to screeningchemical or ligand libraries [7, 348], developing new, more efficient, less harmful drugsand new antimicrobial therapies [2, 69, 237], phylogenetic reconstruction [201], and studyrelated to reproduction of plants producing cheaper and higher quantity than with theconventional methods [86, 113].In the literature, studies anchored around pathways are known as pathway-centric [125],which was sought to offset the limitations of traditional gene-centric (or enzyme-based)approaches [139, 300, 312, 343]. That is, the pathway-centric analysis substantially reducesthe computational complexity by focusing on pathways that are far less in magnitude thanthe number of gene families or enzymes. To give an insight, consider the MetaCyc database[53], a multi-organism member of the BioCyc collection of Pathway/Genome Databases(PGDB), which currently contains 2766 metabolic pathways and 12564 enzymes. If theenzyme-based analysis were sought then we would require four times the computationalpower than the pathway-based analysis. It is also important to note that pathway-centric1approaches are easier to interpret at the higher-level biological roles (e.g. metabolic potentialof cells) than gene(enzyme)-centric approaches [4, 59, 129, 290]. These benefits led todevelopment of pathway mapping tools, formally known as pathway inference, prediction,or reconstruction.Early methods to inferring pathways (e.g. PathoLogic [165] and MinPath [370]) involve“in silico” mapping enzymes onto reference metabolic pathway collection stored in trustedrepositories (e.g. MetaCyc [53] and KEEG [158]) using a set of manually specified rules. Thereference metabolic pathways serve as templates to organize enzymes to recover pathways,where each inferred pathway may be associated with a numerical value highlighting theprediction score. These methods were observed to be an integral component in hetero-geneous bioinformatics pipelines (e.g. MetaPathways [126, 175, 176], HUMAnN [4], andHUMAnN2 [99]) for reproducible (“easy-to-use”) genome DNA information processing forthe advancement of genomic research and its applications.Let us illustrate the pipeline procedures in a more abstract form. Depending on the nextgeneration sequencing (NGS) platforms (e.g. Illumina HiSeq 2500), the DNA sequence ofan organism may be randomly shredded to produce millions or billions of short strings,typically in the length of ∼ 150 base pairs (bp), referred to as sequencing library [110, 183].The fragmented DNAs are then being sequenced in parallel at random, generating a set ofreads, which are small sub-sequences. Having the reads, those bioinformatics pipelines mayperform search against a reference database to learn what the DNA encodes or assemblereads into longer contiguous sequences (contig) by merging their overlaps (e.g. SPAdes [21]).The assembly step is computationally intensive, so is the identification of putative geneboundaries that encode proteins through open reading frame (ORF) prediction methods(e.g. Prokaryotic Dynamic Programming Gene-finding Algorithm (Prodigal) [146]). FollowingORF prediction, an aligner (e.g. DIAMOND [47] and BLAST [9]) may be used to performlookups in reference sequences databases with known functions (e.g. NCBI’s RefSeq non-redundant proteins [264]) to recover the functional roles of ORFs, also known as annotationof ORFs. While a fraction of ORFs can be assigned a function, many sequences will remainuncharacterized. Of particular interest are those ORFs that encode known enzymes, puttingthem in the context of metabolic pathways by applying an appropriate pathway reconstruc-tion tool (e.g. PathoLogic [163]). With the exception of NGS, all the remaining processes canbe merged into a bioinformatics pipeline.It is clear from this simple schematic procedure that pathway inference tools follow twocommon consecutive steps: 1)- identify a list of enzymes encoding reactions (also knownas reactome inference) then 2)- reconstruct pathways from the detected enzymes. While2the pathway inference tools can suboptimally recover pathways for a single organism, theydo not perform well for metagenomes [91, 310, 347], where fragments of sequences areattributed to several different microorganisms, thereby, imposing challenges in the assemblyprocess and the downstream pipelines. Metagenome based approaches (or metagenomics)were motivated by the limitations encountered in the lab itself. This is because the vast ma-jority of microorganisms resist being cultured to produce colonies on standard Petri plates.It is estimated that less than 1% of bacteria present in soils can be optimally grown and cul-tured under standard conditions [256]. As a result, metagenomics approaches were followedto solve this problem through culture-independent methods, involving high-throughputsequencing, by analyzing the genetic material in microorganisms directly sampled fromthe environment [124, 274]. This study is important in Microbiology given the fact thatmicroorganisms are the most abundant and the most ancient lifeforms on Earth [129, 211].From diagnostics perspectives, the reconstruction of complete or near complete genomesof all organisms from a metagenome remains less efficient [8, 39, 230, 285]. Instead, re-searchers focus on partially recovering genomes to address higher-level questions related totaxonomic (“who is there?”) and functional aspects of microbial communities (“what arethey doing?”). Both questions are equally important and constitute fundamental challengesin Microbial Ecology. However, the latter question is more interesting as it reveals interac-tions among microorganisms in communities consisting of myriad different but interactingspecies [169], where each community performs its own task, and its produced waste be-comes the starting engine for its neighbor [106, 211, 231, 346]. Despite being complex todiscern the coexistence of microbes in such communities, due to the enormous microbialdiversity and incompleteness of genomes, the reconstruction of pathways (together withtaxonomic profile) will provide unprecedented insight into discovering the essential rulesgoverning the ecology and evolution [170, 211]. Therefore, in the context of metagenomics,the ultimate goal is to recover a subset of pathways to interpret various organisms’ interac-tion, whereas for a single organism it is aimed to elucidate the metabolic network of thatorganism [124].Whether the attention was to recover pathways from a single genome or metagenome,the pathway inference tools [163, 370] to this date face additional multiple challenges as-sociated to adapting with an ever-increasing rate in the content and diversity of publiclyavailable genomics and metagenomics datasets [180, 209, 288, 349]. Consequently, con-verting these data into actionable insights to achieve the goal of pathway prediction is farfrom trivial or routine. This motivates the design of new approaches to disrupt the cur-rent set of inference algorithms. In response, machine learning based approach, called3PtwMLE [73], was developed. PtwMLE converts PathoLogic [165] rules into features to aidthe learning process. Then, the trained model can be applied to make predictions of anewly sequenced and annotated genome. Experimental evaluation has shown that PtwMLEequaled or marginally exceeded the performance of PathoLogic with the benefit of probabil-ity estimation for pathway presence and increased flexibility. Following PtwMLE, severalother recent efforts incorporated metabolite information to improve pathway inference andreaction rules to infer metabolic pathways [49, 75, 321, 326]. However, none of these toolswere dedicated to solving pathway inference given enzymes. Therefore, this thesis examinesmultiple treatments to this problem from the mathematical perspective.Thesis ObjectiveThe overall objective of this thesis is to build intelligent systems that arescalable across hundreds of genomes from diverse sources and, yet, accuratefor predicting pathways from enzymes while at the same time are robust tosome extent against errors propagated from many levels in a bioinformaticspipeline.1.1 Desiderata of Machine Learning based PathwayPredictionIn order to deliver an applicable pathway inference tool based on machine learning, it isimportant to aim for some if not all the desired characteristics listed below.1. Noise insensitiveness. Noises are mainly propagated from the upstream bioinformat-ics pipelines [152]. For example, if a sample is composed of multiple genomes thenduring open reading frame (ORF) prediction, many ORF searches may fail [135, 242].In such cases, only a fraction of the gene sequences may be annotated while theoverwhelming majority of sequences will remain unknown. Although battling withupstream noise is intractable, nonetheless, estimating noise proportion may be ap-proximated [134, 391]. Therefore, it is a fundamental requirement for machine learningmethods to systemically reduce noise.2. Pathway correlation. Previous pathway inference tools largely ignored interactionsamong pathways. For example, in homo sapiens the glycolysis pathway (glucose ox-idation to obtain ATP) entails the presence of citric acid cycle (TCA cycle) pathway4(oxidation of carbohydrates and fatty acids) [236]. This imposes a prominent compu-tational challenge that machine learning frameworks should address.3. Enzyme disambiguation. Predicting accurate pathways from single or multiple or-ganisms is impaired by either “multi-enzyme single-mapping” or “single-enzymemulti-mapping” problems. The former case indicates that a set of enzymes (fromtaxonomically distinct organisms) encoding the same metabolic pathway [211, 253].For example, the set of carboxylic acids metabolites, represented in the TCA cycle(tricarboxylic acid cycle) pathway, are present in all known organisms; however, theymay be produced by different enzymes depending on the organisms [305]. On theother hand, the single-enzyme multi-mapping problem suggests that an enzyme maycontribute to multiple pathways. Enzymes, in such cases, are referred to as “promiscu-ous enzymes”. For example, the enzyme acetylglutamate kinase contributes to bothornithine and arginine biosynthesis pathways for Escherichia coli [164]. In both cases,the assignment of enzymes to reference metabolic pathways is an ambiguous task.Therefore, a machine learning model should consider solving this problem.4. Taxonomic information. Taxonomically conserved genes are those set of genes foundonly in certain species (or other ranks) [154]. Based on this, incorporating a taxonomicprofile may accurately recover pathways and should not be avoided whenever thisinformation is provided, conditioned on accurate annotations of genes. However, formetagenome, the taxonomic information may negatively correlate with accuracy sinceenzymes belong to many organisms [125].To accommodate the above characteristics, this thesis examines multiple novel machinelearning models to accurately recover pathways from enzymes. More concretely, we cast thesolution in the context of multi-label learning. That is, the inference mechanism correspondsto predicting multiple pathways for each annotated genomic sequence information. Weapplied these methods to diverse genomic datasets in order to evaluate and assess therecovered pathways, where we have demonstrated that our models achieved equal or exceedthe accuracy of classical methods.1.2 Thesis Contributions and Road MapThe thesis is divided into six parts: i)- background, ii)- traditional binary relevance basedmulti-Label pathway prediction, iii)- graph-based multi-label classification, iv)- ensemblebased multi-label subsampling and bagging, v)- multi-label learning from multiple sources,5and vi)- conclusion and afterword. Table 1.1 outlines the overall structure of our researchwith reference to parts and chapters of this thesis. The dependency structure among thedeveloped models (to achieving a reasonable trade-off between the characteristics describedin Section 1.1) is illustrated in Fig. 1.1. The road-map of this thesis can be summarized as:Part I Background. Part I presents an overview of metabolic pathway databases and infer-ence algorithms in Chapter 2 while the background materials of multi-label classificationapproaches, including common definitions, evaluation metrics, and benchmark datasets,are outlined in Chapters 3 and 4.Part II Conventional Multi-Label Classification (Chapter 5). Part II describes an initialconstruction of our metabolic pathway prediction model, called mlLGPR. The model followsa multi-label classification approach that employs a rich pathway feature set based in parton the work of Dale and colleagues [73] to predict metabolic networks at the individual,population and community levels of organization. Moreover, this chapter establishes stan-dard protocols to perform a comparative analysis of the previous state-of-the-art pathwaypredictors across a wide range of datasets. Results indicated that mlLGPR were equaled orexceeded previous reports for organismal genomes.Part III Graph based Multi-Label Classification. This part consists of two chapters, wherethe overall goal is to project components of pathways onto a graph to encode various asso-ciations, for the pathway inference task. Specifically, Chapter 6 presents the pathway2vecpackage to automatically generate features from a multi-layer heterogeneous informationnetwork (HIN) [302] using MetaCyc. The metabolic pathways in HIN are decomposed intothree interacting layers: compounds, enzymes, and pathways, where nodes within a layermanifest inter-interactions and nodes between layers manifest betweenness interactions.This layered architecture captures relevant relationships used to learn a neural embedding-based low-dimensional space of metabolic features using the Skip-Gram model [224]. Inthe pathway prediction task, it is demonstrated that the incorporated embeddings is indeeda viable choice for pathway inference than the features introduced in the mlLGPR model.Moving forward, Chapter 7 adopts the embedding features using pathway2vec to exploitthe meta-level relationships in a network manner between enzymes and pathways usingthree stages of non-negative matrix factorization technique, followed by the subnetworkdetection to extract the mesoscopic structure of pathways and enzymes networks learnedfrom a reference database and datasets. This model, called triUMPF, showed compellingperformances in detecting more accurate (less noise) and, at the same time, correlatedpathways on benchmark datasets introduced in Chapter 4.Part IV Multi-Label Subsampling and Bagging. This part is decomposed into two chapters,6Part Research Task Chapter PublicationPart I Background1. Metabolic Pathway Databases andInference AlgorithmsChapter 2 –2. Multi-Label Learning and Preliminar-iesChapter 3 –3. Benchmark Data and EvaluationMetricsChapter 4 –Part II ConventionalMulti-Label Classifi-cationMulti-label Classification Approach toMetabolic Pathway Inference with RichPathway FeaturesChapter 5 [214]Part III Graph basedMulti-LabelClassification1. Leveraging Heterogeneous NetworkEmbedding for Metabolic Pathway Pre-dictionChapter 6 [24]2. Incorporating Triple NMF with Com-munity Detection to Metabolic Path-way InferenceChapter 7 [22]Part IV Multi-LabelSubsampling andBagging1. Relabeling Metabolic PathwayDataset with Bags to Enhance Predic-tive PerformanceChapter 8 [25]2. Multi-label Pathway Predictionbased on Active Dataset SubsamplingChapter 9 [23]Part V Multi-LabelLearning from Mul-tiple (Less-Trusted)SourcesLeveraging Multiple (Less-Trusted)Sources to Improve Metabolic PathwayPredictionChapter 10 –Part VI Afterword Conclusions and Future Work Chapter 11 –Table 1.1: Structure of the thesis with reference to the associated parts and chapters.where the goal is to reduce noise and subsampling the most informative examples in orderto reduce constraints imposed on triUMPF. In particular, Chapter 8 introduces reMap to per-form relabeling examples into bags, where each bag is comprised of non-disjoint correlatedpathways and is collected from another proposed package SOAP. To relabel pathway datasetinto bags, reMap applies an iterative procedure by alternating between i)- assigning bags toeach example and ii)- updating reMap’s internal parameters. After the relabeling process isaccomplished, leADS, introduced in Chapter 9, may be used to infer metabolic pathways.leADS is inspired by both tree-based multi-label classification and uncertainty based sub-sampling approaches (presented in Chapter 3.4). While the tree-based (an ensemble type)enhances the generalization ability, the subsampling seeks to reduce the negative impactof training loss caused by the tail labels (also class-imbalance) problem. In comparison totriUMPF, the dual strategy led to boosting both the precision and sensitivity of pathwayprediction performances on all datasets.Part V Multi-Label Learning from Multiple (Less-Trusted) Sources (Chapter 10). This part7Figure 1.1: Proposed models in this thesis with reference to the associated parts and chap-ters. Each block describes the model and its overall objective.addresses the problem of assigning weights to each pathway prediction model described inthe previous chapters. Having access to many pathway inference algorithms may obscurethe benefits of a specific model that may perform well under certain conditions and in othercases, it may degrade. This raises concerns with regard to the trustworthiness of predictivemodels. Motivated by this observation, mltS has quickly emerged which leverages referencedatasets to learn algorithmic specific weights (ascertaining the confidence level) while atthe same time global coefficient values are learned based on meta-learning approaches[12, 92]. We showed that mltS achieved competitive results against pathway predictors whilewe empirically demonstrated the impact of source related weights in making predictionsaggregated from all algorithms.Part VI Afterword (Chapter 11). Finally, this part concludes the thesis and discusses thesuccesses and shortcomings of the analytical approaches taken. It then points out interestingfuture directions that could have been explored either to optimize our proposed methods orto merge with many downstream applications.8Part IBackground9Chapter 2Metabolic Pathway Databases andInference Algorithms“The man who does not read good books has no advantage over the man whocan’t read them.”– Mark TwainThis chapter summarizes pathway databases (in Section 2.1) and the most up to datepathway prediction algorithms (in Section 2.2). The survey is centered around metabolicpathway databases and methods that take enzymes (or reactions) as inputs to predictpathways. Therefore, algorithms residing outside this paradigm were not considered in thediscussion.2.1 Metabolic Pathway DatabaseCurating a pathway database (PDB) is a complex procedure for a variety of reasons. Fore-most, there is no established format and guidelines regarding biological pathway represen-tation, hence, the same pathway may have different topological structures across multipledatabases. Over the past decade, many formats were proposed to integrate and unify path-way definition, such as KEGG Markup Language (KGML) [14], Biological Pathway Exchange(BioPAX) Levels 1, 2 and 3 [77], and System Biology Markup Language (SBML) [57]. However,given the fact that information content and quality regarding pathways significantly varyamong PDBs, the solutions provided by unified framework approaches are limited. It is upto bioinformatics practitioners and biologists to establish standard procedures to defin-ing pathway topology and performing manual inspections to validate pathways. Here, wediscuss some of the widely used PDBs by the scientific community.10MetaCyc [51]. This is a large comprehensive reference database of pathways (2,766 as ofJanuary 2020) encompassing all domains of life. It contains non-redundant data elucidatingmetabolic pathways, reactions, metabolites, genes, and enzymes, which were experimentallydetermined, validated, and reported in the scientific literature. Pathways in MetaCyc areof two types: i)- base pathways, comprising reactions, and ii)- super-pathways that are acombination of base pathways or individual reactions. Each component in the MetaCycdatabase is periodically updated to keep pace with new discoveries. The database can beaccessed through the BioCyc web portal [162] and integrated into Pathway Tools software[163]. Although MetaCyc claims to be the largest collection of curated metabolic pathways,in reality, no pathway database is complete, therefore, it is common to use multiple PDBs tointerpret the results, collectively [278].KEGG [158]. As with MetaCyc, the Kyoto Encyclopedia of Genes and Genomes or KEGG isa comprehensive knowledge base that contains a large-scale molecular information anda set of manually curated pathways, collected from scientific literature and multiple in-dividual resources, such as the ENZYME database [20], RefSeq [243], Genbank [30], andNCBI Taxonomy [89]. The KEGG database is composed of 18 manually curated databasesgrouped into five categories: systems information, genomic information, chemical informa-tion, health information, and drug labels. To analyze components of KEGG databases, KEGGMapper engine was introduced, comprising of a collection of KEGG mapping tools, such asKofamKOALA [15], BlastKOALA and GhostKOALA [159], for linking molecular objects (genes,proteins, metabolites) to higher-level objects (pathways, modules, taxonomy).Others. Several other publicly available pathway databases includes: NCI-PID [286], Bio-Carta [37], PANTHER [223], MACADAM [186] that uses Pathway Tools software to generatedatabases concerning microbes, community-driven databases such as Reactome [156] andWikiPathways [258], and the SEED database [245, 246] that contains metabolic contentderived largely from the KEGG database. Some of these databases update their contentsregularly, while others do not, therefore, it is advisable to use the most up-to-date andcomplete information for optimal pathway reconstruction from experimental data.2.2 Metabolic Pathway Prediction AlgorithmsMetabolic pathway prediction algorithms follow a common set of steps. The first step istypically to map annotated protein-coding genes (or enzymes) onto reference pathway col-lections, stored in trusted repositories, such as MetaCyc, followed by scoring the recoveredpathways. Many efforts were devoted to characterizing metabolic pathways from enzymes,11Topic PathoLogic MinPath PtwMLEApproacha heuristic symbolic rulebaseda hybrid based (integer pro-gramming combined withsymbolic rules)statistical using a diverse ar-ray of machine learning algo-rithmsDataseta list of annotated genes inGenBank format or Patho-Logic formata list of annotated genesa dataset comprising of5,610 pathway instancesfrom six organismsReference database MetaCyc KEGG, MetaCyc, & the SEED MetaCycPredict reactions? yes no noPredict subnetworks? no no noPredict pathways? yes yes yesEstimate subnet-work/pathway abun-dance?yes possible1 noSort outputs? possible possible noModularity no no yesBiological componentinteractionsno no yes (partially)Taxonomic constraints yes no yesInterface type standalone & web-based standalone –Programming lan-guageLisp, Java, & JavaScript Python LispOpen-source software? no yes not available2Other issuesinefficient to handle metage-nomic dataset, insufficientto discriminate pathwaysvariants, report generationexcludes interpretation onwhich rules were engagedduring the prediction, andrely heavily on manual in-spection to re-adjusting therulessame problems as Patho-Logic; oversimplified modelhas some level of inter-pretablility but requiresto tune up many (hy-per)parametersTable 2.1: Comparison of pathway prediction algorithms. 1 various estimations, such aspathway abundance, are not designed in the original implementation of the algorithms,however, these estimations can be added as inputs to a downstream pipeline; 2 algorithmsare no longer available to the research community.and here we summarize the current up-to-date panorama of prediction algorithms whichcan be grouped into two categories: i)- the symbolic heuristic rule-based approach, such asPathoLogic [163] and MinPath [370], that may adopt manually defined rules to predict path-ways; and ii)- the machine learning (ML) approach, such PtwML [73] that are mathematicaland statistical driven approaches to automatically extract patterns, without using explicitrules, for the pathway prediction. Table 2.1 provides a brief comparison of the algorithmsdiscussed in this section.PathoLogic. This algorithm is a component of Pathway Tools software [163] that is main-tained by SRI international [307]. The algorithm takes two input types: i)- an annotated12genome/metagenome of an organism to be analyzed in GenBank format or PathoLogicformat and ii)- a pathway database extracted from MetaCyc. Then, it proceeds to predictpathways in two sequential steps: i)- reaction inference and ii)- pathway inference fromits predicted reaction. The two-step predictions employ a set of manually created rules,including rules based on the taxonomic range. For example, a predicted pathway is prunedfrom an organism if that organism is outside the expected taxonomic distribution of thatpathway. While the designed rules are periodically inspected and investigated, they neglectinteractions among pathways. This is contrary to the acknowledgment of the interaction andoverlapping between pathways and the influence that a pathway can exert over another. Theresults of prediction are stored in a pathway genome database (PGDB), containing objectscorresponding to genes, proteins, metabolites, biochemical reactions, predicted metabolicpathways, and many others. PathoLogic was used to create more than 17000 PGDBs thatcan be accessed through the BioCyc portal [38].MinPath. This is an integer programming based approach to infer pathways [370]. Theinputs to this algorithm are of two types: i)- a set of pathways extracted from a databaseand ii)- a list of annotated genes or enzymes. Afterward, the system outputs the smallestnumber of pathways that can explain the genes or enzymes observed within samples throughan iterative procedure by integer programming algorithm [32]. MinPath (Minimal set ofPathways) has since been incorporated as a core component to the integrative HUMAaN’sprocessing modules [4], which was used to compare microbial functional diversity andorganismal ecology of 649 metagenomes as a part of the Human Microbiome Project (HMP)[72].PtwMLE. This was the first machine learning-based approach to pathway prediction [73],that incorporates a diverse array of models (naïve Bayes, k-nearest neighbor, decision trees,logistic regression, random forests) to train a dataset comprising of six organisms (EcoCyc,AraCyc, YeastCyc, MouseCyc, CattleCyc, and SynelCyc). After training, the learned estimatorswere applied to a newly sequenced and annotated genome for pathway inference. It wasreported that this approach has achieved a competitive performance against PathoLogicon a test dataset. However, this framework is not suitable for the pathway reconstructionbecause the input is pathway-related features (being either present or absent from MetaCyc)whereas in a genomic sample, the input is both a list of enzymes and a reference pathwaycollection (e.g. MetaCyc).Others. Notable prediction algorithms include XPathway framework [324] that combinesboth the pathway inference with differential analysis from RNA sequence data using KEGGreference database, KEGG automatic annotation server (KAAS) [229], and MG-RAST [222]13that first predicts protein-coding region from short reads and then maps the identifiedfunctions onto the SEED subsystems. However, these algorithms are tightly connected tothe reference database, making them less flexible to adapt to a new reference collection.2.2.1 SummaryThe pathway prediction algorithms, discussed in this section, fall short on several aspects,and we collectively point them out: 1)- with the exception of PtwMLE, they do not considerpathway topological information and neither inter-interactions among genes or pathways;2)- significance scores are not associated with the inferred pathways; 3)- the predictionalgorithms do not assume uncertainty in pathway recovery and neither do they considermissing reactions in the input; 4)- some algorithms, such as PathoLogic, cope with rules torecovering pathways, which are inflexible, thus, preventing any room for customization andexpansion; 5)- not being computationally feasible to large-scale sequenced genomes; and6)- implementations, practicality, user-friendliness, output format, and the programminglanguages used by the algorithms are less appealing to users with less software developmentexperience. These bottlenecks further motivate the need to develop a scalable computationalalgorithm for the pathway prediction, which is the main objective of this thesis.14Chapter 3Multi-Label Learning and Preliminaries“To have another language is to possess a second soul.”– CharlemagneIn this chapter, we present definitions and some preliminary studies with respect to themulti-label learning framework. First, we provide an overview of a metabolic pathway andexplain its internal components in Section 3.2.1 while we establish a formal definitioncorresponding the pathway dataset in Section 3.2.2. Then, we state the problem with regardto the pathway prediction task in Section 3.3. Finally, we discuss the current practices inmulti-label learning techniques according to the applications they solve in Section 3.4. Thetopics is this chapter cover a broader range of research domains, however, we narrow ourdiscussions within the scope of this thesis. The main purpose of this chapter is to presentnecessary backgrounds to understand methods presented in the coming chapters.3.1 NotationIt is very difficult to come up with a single, consistent notation to cover the wide variety ofdata, models and algorithms that we cover. Nonetheless, we present some basic and commonmathematical notations used in this thesis. Unless otherwise mentioned, we emphasize thatmathematical symbols are limited to the chapter in which they are introduced.For symbols, all vectors are represented by boldface lowercase letters, such as x, andare assumed to be column vectors, while x> represents the transposed row vectors. The i thcoordinate (where i ∈ {1, ...,n}) of a vector is referenced by xi . The matrices are denoted byboldface upper case letters, such as X and Xi indicates the i -th row of X. The representationXi , j denotes the (i , j )-th entry of X corresponding to the i th row and j th column. If wewrite X= [x1, ...,xd ], where the left hand side is a matrix, we mean to stack the xi along thecolumns, creating a matrix, where d (∈Z) is some arbitrary number of vectors. The transpose15of X is denoted as X> and the trace of it is symbolized as tr(X). The Frobenius norm of Xis defined as ||X||F =√tr(X>X). Occasional superscript, X(i ) (or x(i )), suggests an index to asample, a power, or a position. We use calligraphic letters to represent sets (e.g. E ) while weuse the notation |.| to indicate the cardinality of a given set. We denote the set of naturalnumber byN, the set of integers by Z, and the set of real numbers by R. R+ represents thenon-negative half-space while Rn represents the n-dimensional vector space over R. Withthese basic notations, we present some important definitions.3.2 An Overview of a Metabolic PathwayHaving provided mathematical notations in the above section, here we present an overviewof a metabolic pathway and explain its internal components.A metabolic pathway is a finite set of biochemical reactions occurring within a cell thatleads to a certain product or a change in a cell. Generally, a pathway can be either: catabolic,where compounds are broken to release energy (such as glycolysis process converting glucoseinto pyruvate); or anabolic, where compounds are synthesized (such as proteins, carbohy-drates, lipids, and nucleic acids) [192]. A metabolic reaction, in turn, is the transformation ofone molecule (substrate) to a different molecule (product) and is often catalyzed by enzymes,which are protein catalysts that can alter the rate and specificity of chemical reactions insidecells [236]. A reaction catalyzed by a set of enzymes is called an enzymatic reaction [31]. If anenzyme catalyzes a single unique reaction, it is called a key enzyme and the reaction is calledthe key enzymatic reaction, while if it contributes to multiple reactions, it is referred to as apromiscuous enzyme. The term key enzyme is coined from the “lock and key” model as amechanism of substrate binding, proposed by Emil Fischer [94] who suggested that boththe enzyme and the substrate possess a certain complementary geometric structure thatfits exactly into each other. If the key enzymatic reaction is dedicated to a specific pathway,then it is an pathway unique key enzymatic reaction. Under some specific thermodynamicconditions, the reactions may occur without the intervention of enzymes, and they arereferred to as spontaneous reactions [236]. All the reactants, products, and intermediates,produced by metabolic reactions are called metabolites [325], and assumed to take placewithin a dedicated boundary constituting a metabolic pathway. A subset of these pathways(or reactions) corresponds to a subnetwork [152, 296] representing interactions among path-ways. Subnetworks, in turn, may be associated among each other to form a wired diagramrepresenting a metabolic map (or network) [181] that determines the physiological andbiochemical properties of a cell.16(a) A metabolic network of E. coli K-12 substr. MG1655(b) A subnetwork of four metabolic pathwaysFigure 3.1: A metabolic network of E. coli K-12 substr. MG1655 (represented by black,lime, blue, magenta, and orange colors) from KEGG database [158] (Fig. 3.1a) and a sub-network of four metabolic pathways (Fig. 3.1b) which are represented by lime, blue, ma-genta, orange colors in Fig. 3.1a. The pathways in Fig. 3.1b are symbolized by y and corre-sponds to: trans-cinnamate degradation (y2), fatty acid biosynthesis, initiation (y1), fatty acidbiosynthesis, elongation (y3), and beta-Oxidation (y4). The large circle surrounded by theblue colored border in Fig. 3.1b corresponds to the fatty acid biosynthesis, initiation pathway(y2) and its components defined as: compounds by c[∗], enzymes by e[∗], reactions by integernumbers i ∈ {1,2,3,4,5,6,7,8,9,10} on directed edges (→), and enzymes catalyzing reactionsby dashed directed edges (99K).17Fig. 3.1a shows a metabolic network (obtained from KEGG database [158]) of E. coli K-12substr. MG1655 (TAX-511145). An example of a subnetwork is shown in Fig. 3.1b correspond-ing to trans-cinnamate degradation pathway (y2) and three pathways related to the fattyacid metabolism: fatty acid biosynthesis, initiation (y1), fatty acid biosynthesis, elongation(y3), and beta-Oxidation (y4). A schematic view of the fatty acid biosynthesis, initiation path-way is indicated by the large circle surrounded by the blue colored border in Fig. 3.1b. Anycomponents inside the boundary are internal to the pathway while components residingoutside the boundary are external to the pathway and maybe contributing to other reactionsor pathways. Within the boundary, metabolites (e.g. c1) and enzymes (e.g. the promiscuousenzyme e3) are represented by gray and red circled colors, respectively. Except for e5 ande6, all the enzymes considered in this figure are key enzymes and are allocated inside theboundary. Metabolic reactions are indicated by directed edges (with numbers), and thearrows correspond to metabolites produced by the associated reactions. Every reaction, intheory, is reversible, e.g. the reaction labeled by the number 5, however, conditions in the cellare often such that it is thermodynamically infeasible for the flux of reactions to flow in anopposite direction so the reaction becomes irreversible [152]. Transport reactions transformexternal metabolites by consuming them inside internally, e.g. the reaction 1, or producingthem to outside the boundary, e.g. reactions 8 and 10 [74].3.2.1 Terminology and DefinitionThe simplified description of a metabolic pathway can be translated into an undirectedgraph representation. This approach is convenient for the better elucidation of pathwaysand to reduce the computational burden. In what follows, we provide a series of definitionsto facilitate our discussions about a pathway graph.Definition 3.1. Reaction Graph Topology. Let the reaction graph be represented by anundirected graph G (rxn) = {C,Z (c)}, where C is a set of c metabolites andZ (c) represents r ′links between compounds. Each link indicates a reaction, derived from a set of biochemicalreactionsR of size r ′. Then, the reaction graph topology is defined by a matrixΩ(c) ∈Zr ′×c≥0 ,where each entry Ω(c)i , j is a binary value of 1 and 0, indicating either the compound j is asubstrate/product in a reaction i or not involved in that reaction, respectively.Z≥0 is a set of positive integer numbers.Ω(c) characterizes relationships between reac-tions and their associated metabolites.18Example 3.1. The incidence matrix Ω(c) for Fig. 3.1 is a subset consisting of 6 internalmetabolites and 10 reactions, and can be represented as:Ω(c)> =w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 . . . w ′rk1 0 1 1 0 0 0 0 0 0 0 · · · · · ·k2 0 0 1 0 0 1 0 0 0 0 · · · · · ·k3 0 1 0 1 1 0 0 0 0 0 · · · · · ·k4 0 0 0 0 1 1 1 1 0 0 · · · · · ·k5 0 0 0 0 0 1 1 0 1 0 · · · · · ·k6 0 0 0 0 0 0 0 0 1 1 · · · · · ·.................................. . .kc............................... . .where k ∈ C and w ∈R. In this matrix, the reaction w3 transforms k1 to produce the metabo-lite k2. If a reaction does not involve in the production/conversion of a compound then itsvalue is 0 for that compound.As discussed, a reaction in G (rxn) may be categorized as a spontaneous or an enzymaticreaction catalyzed by enzymes, thereby, constituting enzyme to reaction association matrix.Definition 3.2. Reaction-Enzyme Association (R2E). Let F represents a finite set of zmetabolic enzymes, then the reactions catalyzed by enzymes are represented as an in-cident matrixΩ′(e) ∈Zr ′×z≥0 , where an entryΩ′(e)i , j indicates an enzyme j catalyzing a reactioni and encoded as 1 and 0 otherwise.Example 3.2. Again in Fig. 3.1, the incidence matrixΩ′(e) is a subset consisting of 6 enzymesand 10 reactions, and is represented as:Ω′(e)> =w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 . . . w ′re1 0 0 1 0 0 0 0 0 0 0 · · · · · ·e2 0 1 0 0 0 0 0 0 0 0 · · · · · ·e3 0 0 0 1 0 1 0 0 0 0 · · · · · ·e4 0 0 0 0 0 1 1 0 0 0 · · · · · ·e5 0 0 0 0 0 0 0 0 1 0 · · · · · ·e6 0 0 0 0 0 0 0 1 0 0 · · · · · ·.................................. . .ez............................... . .19where e ∈ F and w ∈R. We observe two promiscuous enzymes, (e3 and e4), catalyzingmultiple reactions. Also, several spontaneous reactions having no enzymes, indicated byzero values in columns.The matrix Ω′(e) can be reduced to Ω(e), such that Ω(e) ⊆ Ω′(e) having only r ¿ r ′ en-zymatic reactions by eliminating spontaneous reactions. It is important to note that anenzymatic reaction can be given a hierarchical numerical category, known as an enzymecommission number (EC) based on the chemical reaction catalyzed by a group of enzymes[219]. Within the framework of this thesis, we only consider enzymatic reactions representedby ECs. As explained before, an enzyme catalyzes a conserved set of reactions, where weassume, without loss of generality, that it is one-to-one relation, such that z ≈ r . For MetaCyc,a preliminary analysis indicated that for each enzymatic reaction there are less than 1.30associated enzymes. Henceforth, we use e to synonymously indicate an enzymatic reaction.With respect to metabolic pathways, there are several different ways to view them as agraph: i)- pathway with compound, ii)- pathway with an enzyme (or gene), iii)- pathway withreaction or iv)- a combination of all yielding a metabolic network [181]. Since metaboliterelationships are absorbed in G (rxn) according to Def. 3.1, it is easy to model pathway withreaction relationships. Hence, an equivalent definition for the pathway graph topologyconcerning reaction can be easily formulated.Definition 3.3. Pathway Graph Topology. Let G (path) = {R,Z ′(r )} be an undirected graph,whereR is presented in Definition 3.1, andZ ′(r ) represents a set of t ′ links between reactions.Then, the pathway graph topology is defined by a matrixΩ(r ) ∈Zt×r ′≥0 , where each entryΩ(r )i , jis either 0 or a positive integer, corresponding the absence or the frequency of the reaction jin pathway i , respectively, and t is the number of pathways in a set Y = {y1, y2, . . . , yt }.In a similar manner to Def. 3.2, we slightly abuse the notations to formulate the pathwayto enzymatic reaction association matrix.Definition 3.4. Pathway-EC Association (P2E). Let G ′,(path) = {E ,Z (r )} be a subgraph ofG (path), such that E ⊂R with r ¿ r ′ enzymatic reactions. Then, the Pathway-EC associationis defined as a matrix M ∈Zt×r≥0 , where each row corresponds to a pathway, and each columnrepresents an EC, such that Mi , j ≥ 1 if an EC j is in pathway i and 0 otherwise.The pathway and reaction topology enables us to build various interaction adjacencymatrices among associated components as follows.Definition 3.5. Pathway-Pathway Interaction (P2P). Given G (path), we define a Pathway-Pathway interaction matrix A ∈Zt×t≥0 such that an entry Ai , j is a binary value indicating an20interaction between pathways i and j iff there exists a reaction k ∈R where its associatedcompounds are either substrate or product in both i and j pathways.Definition 3.6. EC-EC Interaction (E2E). Given G ′(rxn) ⊂ G (rxn), we define an EC-EC inter-action matrix B ∈ Zr×r≥0 such that an entry Bi , j is a binary value encoding an interactionbetween two ECs i and j iff they both share a compound, i.e.,Ω(c)i ,k ∧Ω(c)j ,k = 1 where k ∈ C.For the metabolic network, our definition is loosely adapted from on [181].Definition 3.7. Metabolic Network. A metabolic network is represented byN = (Grxn,Gpath,Ω(c),Ω′(e),Ω(r ),A,B).Metabolic networks are complex and highly interconnected, and, thus, to understandmetabolic genotype-phenotype associations, the global network is compartmentalized intoa smaller subunit, called subnetwork, which can be described as:Definition 3.8. Metabolic Subnetwork. A subnetwork is represented byN s = (Gs,rxn,Ωs,(c),Ω′s,(e),Bs), where Gs,rxn ⊆Grxn,Ωs,(c) ⊆Ω(c),Ω′s,(e) ⊆Ω′(e), Bs ⊆B.In the literature, a subnetwork may also be referred to as a community [277], whichdefines a set of densely connected nodes within that subnetwork. Note that a subnetworkagrees with the constraints supplied in the above definitions and does not necessarilyestablish an interconnected set of pathways. For example, if i ∈R, j ∈ C andΩs,(c)i , j = 0, thenthe subgraph Gs,rxn should exclude those components jointly with their associations in thecorresponding matrices, even ifΩ(c)i , j = 1. Nonetheless, a subnetwork is considered a buildingblock for metabolic network reconstruction from a bank of annotated genomic sequencesdatasets [82] that is our topic in the next section.3.2.2 Pathway DatasetGenome-scale pathway recovery is hampered by the absence of efficient computation toolsthat cope with quality control and evaluation standards (see Chapter 2.2). This is due tolimited resources available to the research community to conduct quantitative assessmentsof their tools, leading to wide variations in the produced results that may alter our un-derstandings of system biology. Collectively, these in-silico models initially investigate theaccuracy of pathway reconstruction through simulations based on their designed standardmetrics. Afterward, qualitative analyses are performed on preferred high-throughput ge-nomic datasets to demonstrate the potential utility of the models to interpret the results.21Despite the benefits provided by these tools, the conducted experiments are biased towardsdatasets, thereby, leaving questions about standard approaches to perform evaluations.e1e2 e3e4e5 e6y1y2y3y4Figure 3.2: Enzymatic reaction and pathway graphs. The left panel corresponds to the ECgraph where a group of nodes constitutes an input instance. The right panel indicates thepathway graph where the blue colored node represents the true hidden pathways that are tobe recovered while the light gray colored nodes indicate false pathways.Fig. 3.2 illustrates a graphical representation of pathway dataset, corresponding pathwayand enzymatic reactions for Fig. 3.1. The darker gray, blue, and light gray colors representobserved, true hidden (not predicted), and negative nodes, respectively. The goal is torecover y1, visualized by blue color, given e1:6 which constitute an instance of pathwaydata. This problem is relatively easy if all the enzymes are unique key enzymatic reactionsfor y1, however, the two enzymes e5 and e6 violate this assumption and may participatein multiple reactions or pathways. In reality, an annotated genome comes with a largecollection of ECs with abundance information, i.e., the number of copies of each EC. Fig.3.3 shows a schematic view of such data, where abundance information is symbolized asdoubly-circled EC nodes. In this figure, recovering true pathways, highlighted by blue colors,possess challenges if there exist promiscuous enzymes contributing to y3 and y5 which mayresult in incorrectly inferring them using naïve mapping approaches.Another important limitation to recovering pathways that is specifically observed formetagenomic datasets [91, 310, 347], where enzymes from multiple species are packedtogether, as depicted in Fig. 3.4. It should be understood that the multi-enzyme singlemapping problem, discussed in Chapter 1.1, hampers precise reconstruction of pathwaysfrom microorganisms. Therefore, for metagenomic datasets, it is common to partially recoverpathways in order to infer organismal interactions wherein the case of single cells thereconstruction corresponds to elucidation of cellular processes [124].Thus far, we have considered enzymatic reactions corresponding a single organism. Formultiple genomes, the matrix format is a more standard way to represent multiple annotatedgenomic samples.22e1e2 e3e4e5 e6e7 e8e9e10y1y2y3y4 y5Figure 3.3: Enzymatic reaction and pathway graphs. The left panel corresponds to the ECgraph where where a group of nodes constitutes an input instance and the double circlednodes indicate the abundance information (more than 1 enzymatic reaction). The rightpanel indicates the pathway graph where the blue colored nodes represent the true hiddenpathways that are to be recovered while the light gray colored nodes indicate false pathways.e1e2 e3e4e5 e6e7 e8e9e10y1y2y3y4 y5Figure 3.4: Enzymatic reaction and pathway graphs. The left panel corresponds to theEC graph where a group of nodes constitutes an input instance and the double circlednodes indicate the abundance information (more than 1 enzymatic reaction). Each colorrepresents a distinct organism. The right panel indicates the pathway graph where the bluecolored nodes represent the true hidden pathways that are to be recovered while the lightgray colored nodes indicate false pathways.Example 3.3. A pathway dataset of n examples (genomes or metagenomes) and r enzymaticreactions can be organized in a matrix X while pathways can be represented in a binary23e1e2 e3e4e5 e6e7 e8e9e10Figure 3.5: The enzymatic reaction graph. The nodes are considered to be input and thedoubly-circled nodes indicate the abundance information (more than 1 enzymatic reaction).Each color represents a distinct subnetwork. The dashed link indicates possible edgesbetween discovered subnetworks.matrix Y of size n examples and t pathways:X=e1 e2 e3 e4 e5 e6 e7 . . . e ′rx(1) 0 1 5 0 7 0 1 · · ·x(2) 12 10 0 11 0 0 0 · · ·x(3) 2 3 0 1 0 1 0 · · ·x(4) 0 0 5 1 0 11 4 · · · · · ·......................... . .x(n)...................... . .Y=y1 y2 y3 y4 y5 . . . yt1 1 1 0 0 · · · · · ·1 0 0 1 1 · · · · · ·0 1 1 0 1 · · · · · ·1 0 1 0 1 · · · · · ·................... . .................... . .where Xi , j is the abundance of EC j in example i and an entry in Yi , j indicates the predictedpathway j being associated with the sample i .Now, if we were given only X without labeled responses then in the context of machinelearning an unsupervised learning approach is a more convenient approach to detect pat-terns. Methods discovering subnetworks [98, 152, 185, 296] follow this type of learning. Theyall aim to find an optimum set of subnetworks using a suitable cost function, then, the dis-covered subnetworks are chained together to form metabolic pathways or linked reactionsas shown in Fig. 3.5. While such methods may be applied to pathway recovery, verificationand validation of pathways by these methods are nontrivial tasks.On the contrary, if a model leverages both X and Y (assuming pathways are provided) thenthe supervised learning approach is more efficient to detect patterns related to pathways.In this strategy, the goal is to learn a hypothesis function mapping EC space onto pathwayspace given a pathway dataset. Since each annotated genomic data instance is associated24with multiple outputs (pathways), this type of data is called multi-label pathway datasetwhich can be defined according to:Definition 3.9. Multi-label Pathway Dataset. A genomic pathway dataset is characterizedby S = {(x(i ),y(i )) : 1< i É n} consisting of n examples, where x(i ) is a vector indicating theabundance information corresponding to enzymatic reactions. An enzymatic reaction, inturn, is denoted by e, which is an element of a set of enzymatic reactions E = {e1,e2, ...,er },having r possible reactions. The abundance of an enzymatic reaction for an example i , saye(i )l , is defined as a(i )l (∈R≥0). The class labels y(i ) = [y (i )1 , ..., y (i )t ] ∈ {−1,+1}t is a pathway labelvector of size t that represents the total number of pathways, which themselves are derivedfrom a set of universal metabolic pathways Y . For each example i , y (i )j =+1 indicates thepresence of the label j while y (i )j =−1 means the same category is absent for i . The matrixform of x(i ) and y(i ) are symbolized as X and Y, respectively.Both E and Y can be extracted from reliable knowledge-bases discussed in Chapter2.1 (e.g. KEGG [158] and MetaCyc [51]). In this thesis, we adopted MetaCyc. Furthermore,because Y is composed of multiple outputs for each annotated genome, the supervisedlearning is categorized as a multi-label learning, which is examined in the next section.3.3 Multi-Label Learning Problem FormulationIn Def. 3.9, we assumed that there is a numerical representation behind every instanceand pathway label. We use x ∈X = Rr to denote the r -dimensional feature vector (inputspace), representing an instance and U =Rd for the d-dimensional numerical label vector.In practice, each input example is mapped into an arbitrary m-dimensional vector (m À r )based on a preferred transformation function Φ : X → Rm , which may be described asfeature engineering process. Furthermore, each example in S is considered to be drawnindependent, identically distributed (i.i.d) from an unknown distributionD overX ×U .Given this notation and a multi-label dataset S , the goal of multi-label classificationis to learn a hypothesis function h : Φ(x) → {−1,+1}t from S , such that it predicts thebest metabolic pathways for a hitherto unseen instance [397]. The value −1 indicates acorresponding pathway is absent while+1 suggests it is present in an example. The estimatorh is obtained by optimizing the expected risk of h with regard to a loss function l (y,h(Φ(x))) :h(Φ(x))×y→R, where y ∈ {−1,+1}t , according to:(3.3.1) ²l (h)= ED[l (Y,h(Φ(X)))]25SinceD is unknown, it is infeasible to estimate the Eq. 3.3.1, instead, we apply its empiricalcounterpart ²l (h) [3, 228]. In many modern multi-label learning systems, the estimator h(.)is formulated as:(3.3.2) h(Φ(x))= vec({+1 if f (Φ(x), j )≥ τ−1 otherwise)∀ j ∈ twhere τ ∈R is a cut-off threshold, vec is a vectorized operation, and f (Φ(x), j ) ∈R is a real-valued score representing the predictive confidence of j ∈ t being a proper label index forΦ(x). The form of output transformation in Eq. 3.3.2 is adopted in this thesis. Next, we reviewmulti-label learning algorithms given S .x(1) x(2) x(3)y1 y2 y3OneXOnePathwayx(1) x(2) x(3)y1 y2 y3OneXChainPathwayx(1) x(2) x(3)y1 y2 y3OneXPartitionPathwayx(1) x(2) x(3)y1 y2 y3ManyXManyPathwayx(1) x(2) x(3)y1 y2 y3ManyXChainPathwayx(1) x(2) x(3)y1 y2 y3ManyXPartitionPathwayx(1) x(2) x(3)y1 y2 y3OneXTreePathwayx(1) x(2) x(3)y1 y2 y3OneXFanPathwayx(1) x(2) x(3)y1 y2 y3ManyXTreePathwayx(1) x(2) x(3)y1 y2 y3ManyXFanPathwayFigure 3.6: 10 types of correlations among pathways and 3 input instances. The dark graycolored nodes indicate input samples while the light-colored nodes represent pathways.263.4 Multi-Label Learning AlgorithmsProbably, the fundamental challenge of learning from multi-label data is to design a suitablesolution with strong generalization ability while attaining correlations among labels andinstances [76]. There could be a mixture of 10 types of correlations in a given multi-labeldataset [251] that can be represented as undirected graphs in Fig. 3.6, where the top layerwith dark-colored nodes indicate an input instance while the corresponding pathways isdepicted in the bottom layer with light-colored nodes:1. OneXOnePathway. Each pathway label is associated with one x. This form of relation-ship is rare for the pathway dataset.2. OneXChainPathway. A pathway, say y1, is strongly associated with a single observedexample x(1) while the remaining pathways are inferred based on y1, hence, forminga chain structure, i.e., y2 is related to y1 and y3. This form is mostly observed forpathways that include only spontaneous reactions. Other examples are not observedto be linked to the three pathways.3. OneXPartitionPathway. In this case, a partition is formed based on a correlation oftwo pathways, y1 and y2, with an observed input, x(1), while the pathway y3 mayonly be detected through x(1). For the pathway y2, it constitutes a strong bond withy1 and can not be inferred directly from x(1), which has similar representation asOneXChainPathway. Moreover, the two remaining instances may not be linked to thethree pathway labels.4. ManyXManyPathway. All pathways are linked to three instances, but pathways arenot associated with each other. This form corresponds to organisms sharing the sameset of pathways. Perhaps, this may constitute duplicate instances in a pathway dataset.5. ManyXChainPathway. This is the same as OneXOnePathway, but a pathway, e.g. y1, islinked to all instances. This form corresponds to organisms that may share the threepathways, but, may contain different pathway set outside the scope of three pathways.6. ManyXPartitionPathway. Similar to the explanation presented for OneXPartition-Pathway, however, both y1 and y3 are now linked to the three input instances. Again,organisms may not exhibit similar patterns outside the scope of three pathways.277. OneXTreePathway. This is similar to OneXPartitionPathway and OneXChainPathway.Here, inferring a pathway y1 entails all its presumed correlated pathways should berecovered through only y1.8. OneXFanPathway. Similar explanation as OneXTreePathway, but y3 can be inferredeither one of the pathways y1 and y2 and not from the input data x(1).9. ManyXTreePathway. The same explanation as OneXTreePathway, but y1 is associatedwith the three input data. Again, instances may not exhibit similar patterns outsidethe scope of three pathways.10. ManyXFanPathway. Similar to OneXFanPathway, however, the y1 is linked to the threeobserved input data. Organisms may not have the same pathway set outside threepathways.Among all the correlation structure, OneXOnePathway is rarely or not observed in thepathway dataset, defined in Def. 3.9, while the remaining forms may be exhibited with dif-ferent proportions in the input data. If pathway is linked to all input data, then it constitutesan element of an universal pathway set, such as TCA cycle. While directed graph is moreconvenient to illustrate various dependencies in Fig. 3.6, nonetheless, this type of graphis known to be computationally intractable, therefore, we use the undirected counterpart.With regard to discovering and assessing the distributions of correlation structure in data,it’s whole another research domain that is outside the scope of this thesis and constitutes aninteresting research topic.Over the past decade, many models were proposed to articulate correlation and multi-label problems. Historically, they were compartmentalized into two categories [132, 216,386]: i)- algorithm adaptation and ii)- problem transformation. The multi-label models thatadapt and extend specific single-label algorithms for the task of multi-label classificationwere called algorithm adaptation methods, such as multi-label kNN [385], multi-label de-cision tree [68], ranking support vector machine [85], collective multi-label classifier [105],AdaBoostMH [287], and hierarchical multi-label decision trees [331]. The problem transfor-mation methods translate the multi-label learning problem into well-established learningscenarios so that traditional single-label classifiers can be applied without modification,such as binary relevance [43], label power-set, and pair-wise methods, classifier chains [270],calibrated label ranking [102], and random k-label sets [327].Although such division is common in the literature [216, 381, 386, 397], nonetheless,this created difficulty to characterize new models. Therefore, in the context of this thesis,we categorize multi-label algorithms based on the task-driven approaches according to: i)-28binary relevance, ii)- low rank algorithms, iii)- ensemble and deep learning methods, iv)-partially labeled approaches, v)- active learning methods, and vi)- notable algorithms.3.4.1 Binary Relevance MethodsBinary relevance (BR) addresses the multi-label learning by decomposing the problem intot independent binary classifiers, where t is the total number of labels in Y . Each binaryclassification problem j ∈ t is trained on examples that are tagged positive for j whilethe remaining instances are considered negative. During prediction, this approach queryoutputs for each label given x∗ and then aggregate the produced labels to tag the examplex∗ with a set of relevant labels according to:(3.4.1) y∗ = vec({+1 if h j (Φ(x∗))≥ 0−1 otherwise)∀ j ∈ twhere h j (Φ(x∗)) corresponds the predictive result for the label j , which has similar rep-resentation as Eq. 3.3.2. With the exception of OneXOnePathway or ManyXManyPathwaycorrelation structure, this approach is ineffective to exploit other types of relations amonglabels in the decision function h j . Fortunately, one can easily address this limitation byintroducing constraints in the objective function during training according to:(3.4.2) C (h)=⊕ t∑j=1(1nn∑i=1l (Y:, j ,h j (Φ(X)))︸ ︷︷ ︸loss function λΩ(h j )︸ ︷︷ ︸constraint function)where C is the objective cost function comprising of both loss l and constraintΩ functions.⊕indicates a model-dependent operation with regard to h which is usually either mini-mization or maximization, could be an additive or subtractive operation, λ> 0 is a tuninghyper-parameter that controls the trade-off between l (.) andΩ(.). The last termΩ(h j ) canbe understood as a series of penalties, which may correspond to a combination of regulariza-tions to prevent overfitting and constraints on coefficients to enforce semantical similaritiesamong labels or instances. It is, therefore, the choice of h(.) andΩ(.) that discriminate themodels under this context.For example, Zhang and colleagues [390] proposed support vector machine (SVM) basedmethod as a decision function h to explicitly extract relationships among labels, extendingthe previous conventional SVM based model [85]. Babbar and colleagues [18, 19] also appliedSVM for the extreme multi-label classification domain, where the number of labels in suchcases often exceeds thousands. Concretely, they suggested an adversarial perturbationtechnique applied to every training data to address the tail labels that occur infrequently29across examples in a multi-label data. Similar to the probabilistic based learning [120], Chengand colleagues [61] proposed multi-label logistic regression to capture interdependenciesbetween labels from instances that is motivated from multi-label k-nearest neighbor (KNN)classifier with Bayesian inference techniques [385].In addition to the above models, almost all recent binary relevance models enforce con-straints to optimize the learned parameters. These include LIFT [382], meta-level features[367], oracle teacher based labels induction algorithm [377], shared subspaces based oninput space [149], labeling importance based learning that incorporates the relative im-portance of each relevant label [200, 384], and self-paced learning that simulates humanlearning process by gradually learning from easy to hard instances [193]. However, in manycases, a naïve assumption regarding the existence of correlations among labels may poten-tially deteriorate the performance, hence, Xu and colleagues [358] suggested an alternativesolution based on the causality learning. Although this model achieved good results, it isimpractical for the pathway dataset which consists of a large number of pathways (t > 2500).Coupling between constrains and decision function definitely improved the performanceof binary relevance models, nonetheless, it is still imperfect in mining discriminative featuresfor certain correlation structures, such as OneXChainPathway in Fig. 3.6. Regardless, thesemethods are arguably the most intuitive solution to learning from multi-label data, inparticular when data are exposed to noise (e.g. missing labels) [48, 360, 366], due to theirsimplicity and easiness in distributing learning and prediction tasks across multiple clusternodes for large-scale data as we shall see in Section 3.4.3. In Chapter 5, we will be presentingmlLGPR that follows this formulation to solve the pathway inference problem.3.4.2 Low Rank MethodsAlgorithms introduced in the last section are limited only to explore for certain correlations,which are detrimental to classification performances. A feasible strategy is to assume thatboth instance X and label Y matrices have effectively low rank, and, by projecting themonto low-dimensional linear subspaces, one may recover the shared subspaces that can besupplemented to a multi-label classification model [150]. This approach is called low rankbased methods, which exhibits strong generalization guarantees [18]. In addition, low-rankgenerally reduces the dimension of input and/or label space, hence, they are also consideredas dimensionality reduction methods. In the literature, there are multiple variants of low-rank methods, where they mainly differ in their choices of reduction and decompressiontechniques [262], and we can roughly compartmentalize them into three major categories:i)- feature space, ii)- label space, and iii)- hybrid space reduction methods.30i)- Feature space reduction. Methods in this approach compress input variables inde-pendently of labels, and has the following minimization objective formula:(3.4.3) min ||Φ(X)−Y||ub +λ||Ω||ubwhere is a feature coefficient matrix with arbitrary dimension size, where each entryin i , j corresponds the importance of i-th feature in approximating Φ(X). b and uare `∗ norm and power operations, respectively, to encourage specific conditions.For example, applying `2,1 and u = 2, where `2,1 is the sum of the Euclidean normsof columns of a matrix, will promote sparse representation to . Ω has a similarexplanation as in Def. 3.4.2 which is a series of constraint terms.Notable methods in this group include principal component analysis (PCA) [155],which projects high-dimensional features onto a low-dimensional space. A similarline of works are followed in [138, 265, 314]. In summary, this approach uses thesmoothness assumption, where examples close to each other in the input spaceare more likely to share a label without exploiting labels, hence, they achieve poorperformances on multi-labeling.ii)- Label space reduction. This approach performs label space reduction by approximat-ing the label matrix to a low-dimensional subspace according to:(3.4.4) min ||Φ(X)−Y ||ub +λ||Ω||ubwhere is a low-rank matrix with arbitrary dimensions. Some works were proposedusing this approach. For example, Tai and colleagues [322] presented a solution bymapping label sets to vertices in a hypercube, and then optimize the solutions usingsingular value decomposition (SVD).ii)- Hybrid space reduction. This is a widely adopted strategy that provides diagnosticsto preserving input-label correlation. There are various forms of this approach as:min ||Φ(X)−||ub +||Y−||ub +λ||Ω||ub(3.4.5)min ||Φ(X)>Y−||ub +λ||Ω||ub(3.4.6)min ||Φ(X)>Φ(X)−Y> ||ub +|λ||Ω||ub(3.4.7)min ||Φ(X)>Φ(X)−Y||ub +λ||Ω||ub(3.4.8)where is low-rank matrices with arbitrary dimensions.31Some prominent models include multi-label informed feature selection [151], non-negative matrix factorization based [44], global and local correlation among labels andinstances based [34, 144, 359], incorporating extensive meta-features [299], singularvalue decomposition [358] (possibly followed by clustering labels [316]), dictionarylearning based [153], canonical correlation analysis [315], maximize margin based[207], output codes [389], and a unified framework for sparse local embeddings withnonconvex penalty to extreme multi-label classification [206].The prediction strategy among these models differs. As an example, in Eq. 3.4.3, it can bedefined as:(3.4.9) y∗ = vec({+1 if sign(Φ(X) )> 0−1 otherwise)∀ j ∈ twhere sign(.) corresponds the sign of the multiplicative terms inside.Collectively, low-rank methods have been at the forefront in multi-label classification dueto three main factors: i)- they adopt the low-rank structure of instance and/or label matricesto extract abstract representations of labels while preserving local and global correlations;ii)- they assume the low dimensional label vectors are smooth, where points close to eachother are more likely to share a label; and iii)- they solve the optimization problem in anintegrative way by adopting an efficient alternating minimization strategy. However, theseapproaches can be slow for training and prediction, especially, for the extreme multi-labelcase, consisting of thousands of labels. In addition, to classify new examples, many low-rankmethods perform projection of the low-rank matrices back to the original high-dimensionalspace, which is an error prone task. In Chapter 7, we will be presenting triUMPF, which iscustomized low-rank hybrid space reduction technique that is dedicated to solve pathwayprediction problem.3.4.3 Ensemble and Deep Learning MethodsIn this section, we first examine the ensemble based multi-label learning paradigm thenwe review the deep learning models. In general, all algorithms in this category are highlyeffective, but, they can be extremely slow for training and predicting multi-label datasetscontaining a large number of labels.3.4.3.1 Ensemble MethodsEnsemble learning aims to build a set of accurate and diverse multi-label base learners,simultaneously, considering correlations among labels [301]. To date, most ensemble algo-32rithms can be divided into: i)- cascade based, ii)- tree based, and iii)- low rank based.i)- Cascade based. The core theme of this approach is to incorporate labels dependencyinto the prediction. Almost all the correlation structure presented in Fig. 3.6 are ap-plicable using this technique. There are multiple types of cascading scheme, mostnotably probabilistic classifier chain [62], where a binary classifier for a label, say y j ,is trained, conditioned on a given instance x and the other remaining labels y− j , toestimate the predictive probability of that label according to:(3.4.10) h j (Φ(x))= argmax p(y j =±1|x,y− j )Afterward, the inference can be made as a maximum a posterior probability (MAP) forall labels:(3.4.11) y∗ = argmaxy j∈{−1,+1}tt∏j=1p(y j |x∗,y− j )A similar analytical expression is also exhibited in other chain models, such as classifierchains and ensembled classifier chains [270, 271]. One of the most fundamentalconcerns associated with this approach is the coordination of labels. This has animportant implication because the inference requires an enumeration over t possiblelabels, i.e., 2t (t−1)2 , which is a computationally-intensive process with high memoryusage. Consequently, a number of methods have been proposed, including conditionallabel dependence [157, 383], marginal accuracy [313], stochastic based ordering [267],the order space given a fixed structure [268], the structure space given a fixed order[323], and undirected chains [118]. However, the chain methods also suffer from theclass imbalance problems, due to the sparseness exhibited in the label matrix, i.e. Y, asin the case of pathway dataset. Solving these two aforementioned problems constitutethe main obstacles in these methods, and are left open to future research studies.ii)- Tree based. Models under this category have received significant attention in recentyears due to various factors, most importantly the reduction in the computationalcomplexity for the extreme multi-label classification domain. Specifically, these meth-ods [148, 259–262] follow the decision tree structure by recursively partitioning thespace of labels (or features) at each non-leaf node comprising of a small subset ofrelevant labels. However, to split instances at each node, these methods learn a baseclassifier (that is a weighted combination of all input features), instead of metricsemployed in the traditional decision tree (e.g. information gain [123]). In practice, thetree based methods follow three key aspects: 1)- construct an ensemble of trees, each33of which is induced based on subsampling instance/label features (either randomly ordeterministically) at each level of the tree, 2)- cluster examples sharing similar labelsinto one node, and 3)- retrieve a small set of highly relevant true-positive labels.During prediction, each example is passed through the root until a leaf in each inducedtree, where the base classifier at each node is applied to predict labels, then, labelsare aggregated over all trees using some form of voting strategy [108, 275, 284]. Conse-quently, the inference complexity is reduced. For example, consider B is the numberof induced trees, D is the depth of trees, and t ′ is the average number of labels perleaf. Then, the overall prediction cost can be approximated asO(B Hm+B t ′+ t ′ log t ′)[204]. If the induced trees are balanced, then D ≈ log t and prediction cost is nearO(Bm log t), which is logarithmic in the number of labels. However, the computa-tional cost is still troublesome since these methods require to induce multiple trees.Besides, due to the hierarchical nature, errors introduced at the top-level will be downpropagated to the lower levels of trees. As a result, they may not have good predictionaccuracy. Nonetheless, for the extreme multi-label case, a compromise is usually madeto take advantage of logarithmic prediction speed over the prediction accuracy.iii)- Low rank based. This approach is related to Section 3.4.2, which was reported to beeffectively capturing the label correlation. SLEEC (sparse local embedding for extremeclassification) [33] is the best known ensemble model which solves the problem intwo sequential steps: 1)- partitioning examples into smaller regions followed by 2)-extracting low dimensional label vectors. The clustering step in SLEEC can be unstablein high dimensional spaces, hence, an ensemble of SLEEC learners is employed toachieve good prediction accuracy. However, the models under this category mayconverge to a local-minima preventing to achieve an optimal solution.Notable other ensembled approaches include subset selection methods that cluster labelsinto disjoint sets then treat each set as a single label to train a classifier while taking intoaccount correlations between labels [269, 276, 328]. A similar approach was sought in Slice[147], which first retrieves a list of most probable true labels using a generative model thenevaluate discriminative classifiers only for the shortlisted true labels. Thanks to Slice, inChapter 9 we will be presenting leADS which is an ensemble approach based on the bagidea [353] for the pathway prediction problem.343.4.3.2 Deep LearningWith the exception of low-rank methods, vast majority of ensemble models consider featuresare non-dependent among each other when making label predictions, which is a funda-mental limitation of these methods. Deep learning can provide a solution to this problem.Specifically, this approach adapts deep learning models to generate low dimensional latentcontinuous features, known as embeddings [225] from raw examples. An earliest attempt toincorporate deep learning is the convolutional neural network (CNN) model [204], whichproduced promising results for the extreme multi-label classification task. Unfortunately,this model does not consider label smoothness assumption where labels that are semanti-cally tied with an instance (or context) information should be grouped together. Also, manylabel dependency structures, as ManyXChainPathway and OneXTreePathway, were not ad-dressed. Hence, several extensions to CNN model were proposed and the most emergentmodels include ensemble tree-based methods [364, 375], fusing label specific informationusing long-short term memory (LSTM) model [355], and combination of CNN and LSTMwith attention-based strategy [60].Despite the benefits gained by the proposed novel deep learning models, they do notutilize dependencies among labels. A cascade of recurrent neural networks was proposedwhich considers labels ordering according to dynamic shuffling strategy [232], as opposedto the static ordering in [233], thereby, replicating the traditional chain classifiers and labelpowerset approaches [270, 271, 328], as discussed in Section 3.4.3.1. However, this model isimpractical for the pathway dataset because they require a cascade of length ∼ 100 labelswhere relevant pathway size to be retrieved is usually> 100 pathways. An alternative solutionis based on graph learning techniques [187, 342, 371, 376, 388].It is worth mentioning that deep learning models are closely related to low-rank meth-ods discussed in Section 3.4.2, and interestingly both are computationally demandingapproaches [204]. But, they provide many solutions including mitigating from the extensivefeature engineering process. In Chapter 6, we will be introducing a deep learning-basedmodel, called pathway2vec, to automatically generate embeddings in order to supplementthem to the pathway prediction.3.4.4 Partial Labeled MethodsConventional multi-label learning often assumes that each training example is associatedwith multiple ground-truth labels. However, in many real-world applications, includingpathway datasets, each training instance is annotated with a partially valid candidate label35set. Formally, for each example, i in S , described in Def. 3.9, the labels are manipulatedaccording to y(i ) = [y (i )1 , ..., y (i )t ]⊆ {−1,0,+1}t , where y (i )j =+1 indicates the presence of thelabel j , y (i )j =−1 means the same category is absent for i , and y (i )j = 0 suggests the label j isnot annotated. This formulation of dataset is called partial multi-labeled dataset, where thetask is same as in Section 3.3. The learning paradigm (partial multi-label learning or PML) inthis situation is overly challenging as the true labels are concealed among many irrelevantlabels, thereby, an estimator is prone to retrieve many false positive outputs.An earlier approach to PML is to treat the missing labels as negatives [46, 318]. Thisstraightforward solution neglects the possible ground-truth positive labels for each example,hence, inducing inferior prediction models. As an alternative, many methods adopt the lowrank-based approaches to fill the instance-label matrix [48, 173, 339, 350–352, 356, 357, 360,378, 399]. These models, in general, treat labels and instances as graphs where the labelinformation is iteratively propagated, and choose the candidate labels associated with somescoring metrics (e.g. confidence values). However, they all suffer from the cumulative errorsinduced in propagation, which may impair the predictive model. Besides, the estimationof label scores is error-prone, especially when noisy labels dominate that can seriouslydeteriorate the predictors performance. Others formulate the problem in the context oftransductive learning that seems to be efficient for large-scale datasets [172]. Studies fromthe perspective of probabilistic modelings were also considered, where missing labels aretreated as latent variables [161, 330].Deep learning models were also investigated. Models in this category include a deepsequential generative model based on a variational auto-encoder framework [66], a con-volutional neural network with adaptive loss function [83], and tree ensemble-based deeplearning method [340]. Recently, the PML-GAN model was proposed accommodating gener-ative adversarial networks (GANs) [109], which uses minimax adversarial training over twonetworks: generation and discrimination. This model is comprised of four networks: 1)- adisambiguation network that estimates the noise probability of each label for each example;2)- a prediction network that predicts a set of probable true labels of each instance; 3)- ageneration network that synthesizes samples given label vectors; and 4)- a discriminationnetwork that separates the synthetic samples from the true data. However, PML-GAN doesnot incorporate various correlations among labels which are an integral part of the pathwaydataset. In Chapters 8 and 10 we will be presenting reMap and mltS models aiming to solvevarious noises consumed in the pathway dataset.363.4.5 Active Learning MethodsThe overall goal to active learning (AL) is to subselect training examples from a large pool ofunlabeled instances to design a high-quality prediction model using the acquired examples[70]. The general procedure for multi-label active learning methods, under the pool-basedscenarios [63], is provided in Algorithm 1, where the inputs comprised of two datasets, asmall set of labeled data L and a pool of unlabeled instances U . At the very beginning, aclassifier h is trained using L (line 2). Then, the algorithm proceeds in an iterative mannersuntil enough number of examples, say k, are queried, where at each step: 1)- a queryalgorithm is applied to assess the information content of each instance from U (line 4); 2)-an oracle (e.g. a human annotator) labels each selected example (line 5); 3)- the selectingpoints are added toL and removed fromU (line 6-7); and 4)- the base classifier h is retrainedwith the new set L (line 9).Inputs : labeled set L, unlabeled set U1 i = 1;2 Train multi-label classifiers hi using L;3 repeat4 {x}ki=1 ←− query(U ,L,hi );5 {y}ki=1 ←− oracle({x}ki=1);6 L←−L⊕ {(x,y)}ki=1;7 U ←−U ª {(x,y)}ki=1;8 i = i +1;9 Retrain multi-label classifiers hi using L;10 until enough instances are queried;Algorithm 1: General multi-label active learningframeworkIntuitively, the most informative points should be picked at each iteration. A commonapproach is to pick the candidate instances based on the uncertainty or informativenesscriterion, which measures the effectiveness of an example by reducing the classificationuncertainty [195, 196, 352]. This metric does not consider the future predictive informative-ness of the candidate instance from U , leading to suboptimal performance [143]. However,methods under this category have two key advantages: i)- being computationally efficient[293] and ii)- being extensible to consolidate an aggregation operation over labels scoreswith regard to the multi-label learning [63, 363].Representativeness-based is another approach to acquiring samples that measure thediscrepancy of a candidate instance with respect to the underlying distribution of U . Promi-37nent methods include the generalization error minimization based on trained classifiers[70, 71, 279]. These approaches are generally computationally intensive because they requirea new prediction model to be re-trained for each candidate instance or requiring to query arelatively large number of examples before an optimal decision boundary is reached [143].Other approaches apply heuristic measures to exploit U , such as estimating the density ofunlabeled data [294, 330, 379].Unfortunately, all the previous selection criteria fail to quantify the overlapping labelsacross a set of candidate examples. To address this limitation, the diversity-based criterionwas proposed. Notable methods in this category include clustering instances [117, 119] andclustering label-instance pair [240, 332]. However, both methods suffer from either informa-tion redundancy or ignoring label correlation. Alternative approaches were introduced, suchnovel queries [140] and compressed sensing and Bayesian principal component analysis[303]. For a more comprehensive review of active learning, see [115, 273, 292, 368].In the context of pathway dataset, the methods discussed above avoid addressing im-balanced and tailed label distribution problems. In Chapter 9 we will be presenting leADSthat is capable to select samples based on either uncertainty or diversity to subsampleexamples and, then, train leADS using these subsampled data. As we shall see leADS wasable to minimize the impact of imbalanced problem while partially solving the infrequentlyoccurring pathways.3.4.6 Other Notable ApproachesHaving defined a variety of multi-label models, in this section we describe three otherexisting approaches that are remotely related to this thesis: i)- semi-supervised based, ii)-multi-instance methods, and iii)- multi-label topic modeling. Among them, topic modelingwas the most widely used approaches in the past decade. Our discussions are kept at ahigher level of abstraction without considering analytical or in-depth explanations.Semi-supervised methods. If a multi-label dataset is comprised of a fraction of annotatedexamples within a pool of unlabeled instances [398], then the approach to learning fromthis type of data is called semi-supervised learning (SSL). It is important to note that the cor-responding dataset is different from the partial multi-labeled dataset, discussed in Section3.4.4, wherein the later case each example is annotated with partially valid labels. SSL isless explored because one may formulate this problem in the context of partial multi-labellearning framework through randomly imputing missing annotated examples with labels,then, obtaining classifiers that preserve tight bonds among instances, labels, and instanceswith labels. Indeed, all models in this category are semi-identical to PML [56, 116, 361, 390].38What deserves attention here is a class of dataset that consists of valid/invalid labeled orunannotated instances, which is frequently observed for the pathway dataset. To learn fromthis type of data weakly supervised multi-label learning was proposed which is a generaliza-tion of both semi-supervised and partial multi-label learning approaches. Nevertheless, themethods used for this case study is also seen similar to PML approaches. Models includedeep generative weakly-supervised multi-label classification [66], which is a sequential deepnetworks to learn multi-label classifiers from the aforementioned training data. Liu andcolleagues [208] proposed knowledge distillation based weak learning that jointly trains twomodels, teacher and student models. In this scenario, the teacher model learns the labelcorrelation and then passes this information to the student model which utilizes this knowl-edge to acquire feature representations, thereby, forming strong dependencies betweenlabels and instances. A related work was performed in [213].Multi-instance methods. In the pathway dataset, a collection of examples belonging to aspecific species may exhibit similar pathways. From this perspective, one may formulatethe problem of multi-label learning as multi-instance multi-label learning (MIML). In thislearning framework, every training example is represented with a group (or bag [353])of multiple instances, and annotated with multiple labels to express its semantics, i.e.,(x(i ),y(i )) ∈S where x(i ) = {z(i )1 , ...,z(i )ni } is a group of ni number of instances belonging to x(i ).Fig. 3.7 illustrates the three learning frameworks. As can be seen, the multi-instance learningaddresses the ambiguity associated in the input space where an object (a group), representedby black-colored nodes, may comprise of multiple instances, multi-label learning studiesthe ambiguity of an input being associated with multiple labels, and, MIML considers theuncertainties corresponding both input and label spaces simultaneously [396].x(1) x(2) x(3)yMulti-instance learningx(1)y1y2y3Multi-label learningx(1) x(2) x(3)y1y2y3Multi-instance Multi-label learningFigure 3.7: Three learning frameworks. Node color indicates the category of the node type,where dark gray indicates input samples, black indicates grouping objects, and light grey isreserved for metabolic pathways.39Prior works in this context include MIMLSVM [392], generative model [365], and nearest-neighbor based approach [380]. However, these methods are usually computationally-extensive and do not scale to process the massive volumes of data. Consequently, Huang andcolleagues [142] proposed a fast MIML algorithm by exploiting label correlation with sharedspace. Similarly, the instance representation learning-based approach was introduced in[90]. Alternatively, a model that combines multi-instance multi-label with active learningwas introduced in [141] to efficiently reduce labeling and computational budgets.For the pathway prediction task, the MIML framework may be a suitable approach if thepathway data involves species information, where one can incorporate species along withinput patterns and labels. However, without having access to grouping information, MIMLwould require pre-steps to discovering such groups bags. In Chapters 7 and 8, we will bediscussing triUMPF and reMap that attempt to partially address this key observation.Multi-label topic modeling. This approach merges the ingredients of generative statisticaltopic models with multi-label classification, thereby, resulting in models that achieve inter-pretability and predictions simultaneously. The generative topic modelings are a class ofprobabilistic hierarchical Bayesian networks to discover the hidden composition of latenttopics or concepts given a collection of examples. In particular, models as latent Dirichletallocation [42], assumes that each example is composed of mixed proportions of topics, and,topics, in turn, are comprised of a mixture of features with different distributions. To esti-mate the mixture components, these approaches either apply approximate samplings, suchas MCMC [10, 11] and Gibbs sampling [50, 329], or use tractable optimization algorithms,such as variational inference [41, 136, 184]. Since the process is completely accomplished inan unsupervised manner, an additional component is necessary to tag concepts with labels,which is the fundamental problem associated with this approach.Despite this limitation, several directions have been already explored, such as Prior-LDA and Dependency-LDA [280] that are subsequently extended to Frequency-LDA andDependency-Frequency- LDA by Li and colleagues [197]. Correlations among labels werealso considered in the correlated labeling model [338]. A Bayesian non-parametric approach[241] was considered to learn from possibly unknown (infinite) number of multi-labelcorrelation. Padmanabhan and colleagues [247] presented an interesting multi-label topicmodel that accounts for multiple noisy annotations from the crowd. All of these modelsare capable of learning rare labels, however, as mentioned tagging concepts to labels and,then, transforming a set of real-valued predictions into binary predictions for each instanceare non-trivial tasks. Several strategies were explored to articulate this bottlenecks, notably,rank-based cut-off thresholds [280].40Models in this category are closely related to SOAP and SPREAT, discussed in Chapter 8,which were designed to model pathway distributions while addressing the missing pathwaysthat were not annotated in pathway datasets.3.5 SummaryThis chapter contributes a much-needed resource about the mathematical formulationof a metabolic pathway in the context of multi-label learning. There have been numerousattempts to represent pathways, nevertheless, those studies intricate the prediction problemeven further. Motivated by this demand, we simplified the problem by projecting the pathwayonto two-layer graph networks, an enzyme layer, contributing to pathways, and a pathwaylayer. This representation enabled us to establish a sequence of terminologies (serving as acascade of building blocks) that ended up with the definition of a pathway genome dataset.The problem can then be articulated efficiently as multi-label classification approaches.Since the pathway is depicted as a graph structure, it is inevitable fact to address theambiguity associated with multiple levels of correlations. For this, we browsed the currentstate-of-the-art of multi-label learning methods that provide possible diagnostics to thisproblem. However, instead of trying to go through all the learning techniques within con-fined space, which would lead to only abridged introductions, we restricted the reviewprocess within our own designed paradigms according to the delivered tasks. Consequently,frameworks that did not fit within the landscape of our discussions were ignored, such asmulti-view multi-label learning [397] and max-margin based approach [372].Supported by our observations, the vast majority of discussed multi-label learningalgorithms fell under the low-rank based methods and have seen to gradually shiftingtowards deep learning-based techniques. However, measurements with regard to correlationstudies are not well crystallized. To the best of our knowledge, Park and colleagues [251]was the first attempt to formally characterize the label correlation concept by pinpointingmultiple scenarios that may exist in a multi-label data. In the context of pathway prediction,this thesis formulates and proposes multiple multi-label learning models customized tosystemically solve this problem. In the next subsequent chapters, we will be providingdetailed elaborations and experimental results about these models.41Chapter 4Benchmark Data and Evaluation Metrics“Many options are not transparent. They need to be explored and evaluated withcare. What you see is not always what you get.”– J. Grant HowardIn this chapter, we discuss database, datasets, and algorithms used in our experiments inSections 4.1, 4.2, and 4.3, respectively. Then, we explain the metrics used for evaluating theperformance of pathway prediction algorithms in Section 4.4.4.1 Benchmark Pathway DatabaseWe used MetaCyc knowledge-base v21 [51]. Various configurations of MetaCyc were ex-amined throughout this thesis and is summarized in Table 4.1, where V represents theaggregation of nodes as described in Def. 3.7 while Z indicates a set of all edges amongnodes in V . The configurations are: i)- full content MetaCyc consisting of nodes with linksamong themselves as described in Def. 3.7; ii)- reduced content of MetaCyc (r) which iscomprised of all nodes that have links over 2 degrees; iii)- MetaCyc (uec) by removing linksamong EC nodes while retaining links of all other node types; and iv)- MetaCyc (uec + r)which comprises of unconnected EC and trimmed nodes. Three association matrices werealso applied: Pathway-EC (M) in Def. 3.4; Pathway-Pathway interaction (A) in Def. 3.5; andEC-EC interaction (B) in Def. 3.6.4.2 Benchmark Pathway DatasetsExperiments were conducted using a corpus of 13 experimental datasets manifesting diversemulti-label properties, including manually curated organismal genomes, synthetic microbial42Database #EC #Compound #Pathway |V | |Z |MetaCyc 6378 13689 2526 22593 37631MetaCyc (r) 3606 6469 2467 12542 37631MetaCyc (uec) 6378 13689 2526 22593 33353MetaCyc (uec + r) 3229 6469 2467 12165 33353M 3650 – 2526 – 8576A – – 2526 – 9938B 3650 – – – 35629Table 4.1: Different configurations of compound, enzyme and (EC) and pathway objectsextracted from the MetaCyc database. These are: i)- full content (MetaCyc), ii)- reducedcontent based on trimming nodes below 2 links (MetaCyc r), iii)- links among enzymaticreactions are removed (MetaCyc uec)), and iv)- combination of unconnected enzymaticreactions and trimmed nodes (MetaCyc uec + r). The “–” indicates non applicable operation.communities, and low complexity microbial communities. The quality of these datasetscan be arranged onto a four-tiered (T) structure hierarchy in descending order of manualcuration and functional validation in Fig. 4.1, where top tiers reflect detailed biochemicalknowledge from a complete reference genome (e.g., T1 in the information hierarchy) whilethe very bottom layer (T4) indicate more complex organismal diversity found in natural andengineered environments.The detailed characteristics of the applied datasets are summarized in Table 4.4. For eachdataset S , we use |S | and L(S) to represent the number of instances and pathway labels,respectively. In addition, we also present some characteristics of the multi-label datasets,which are denoted as:1. Label cardinality (LCard(S)= 1n∑i=ni=1∑ j=tj=1 I[Yi , j 6= −1]), where I is an indicator functionand t is pathway size. It denotes the average number of pathways in S .2. Label density (LDen(S) = LC ar d(S)L(S) ). This is simply obtained through normalizingLCard(S) by the number of total pathways in S . This metric is related to LCard:LCard(S) = t× LDen(S).3. Distinct label sets (DL(S)). This notation indicates the number of distinct pathwaylabels observed in S .4. Proportion of distinct label sets (PDL(S)= DL(S)|S | ). It represents the normalized versionof DL(S), and is obtained by dividing DL(.) with the number of instances in S .43T1T2T3Highly CuratedModerately CuratedAutomatically CuratedEcoCyc{ } { } { }0SAGs01CompletionMAGs1T4Metagenomesa bFigure 4.1: Genomic information hierarchy encompassing individual, population andcommunity levels of cellular organization. (a) Building on the BioCyc curation-tiered struc-ture of Pathway/Genome Databases (PGDBs) constructed from organismal genomes, twoadditional data structures are resolved from single-cell and plurality sequencing methods todefine a 4 tiered hierarchy (T1-4) in descending order of manual curation and functionalvalidation. (b) Completion scales for organismal genomes, single-cell amplified gemomes(SAGs) and metagenome assembled genomes (MAGs) within the 4 tiered information hier-archy. Genome completion will have a direct effect on metabolic inference outcomes withincomplete organismal genomes, SAGs or MAGS resolving fewer metabolic interactions.The following notations R(S), RCard(S), RDen(S), DR(S), and PDR(S) have similar mean-ings as before but they are in the context of enzymatic reactions E in S . Finally, PLR(S)represents a ratio of L(S) to R(S). We briefly describe these experimental datasets.4.2.1 Golden DatasetT1 golden datasets composed from six PGDBs, retrieved from biocyc 1: EcoCyc (v21), Human-Cyc (v19.5), AraCyc (v18.5), YeastCyc (v19.5), LeishCyc (v19.5), and TrypanoCyc (v18.5), andare refined to include only information content overlapping with MetaCyc v21 (full content)[51]. In addition, a composite golden dataset was curated, referred to as SixDB, that consists1https://biocyc.org/44111106 2 6 6 2 1 1 1 1 3 2 5 4 4 1 2154 1 3 6 2 1 2 2 1 1 3 4 222312216 1 2 1 3132255 8 518232119432561231322710100200300Intersection SizeAraCycEcoCycHumanCycYeastCycTrypanoCycLeishCyc 0100200300400500Set SizeFigure 4.2: Matrix layout for all possible pathway intersections among EcoCyc, Human-Cyc, AraCyc, YeastCyc, LeishCyc, and TrypanoCyc. Brown circles in the matrix indicate setsthat are part of the intersection and their distributions are shown as a vertical bar above thematrix while the aggregated number of pathways from intersected sets for each sample isrepresented by a horizontal bar at the bottom left. More information is provided in Table 4.4.of 63 permuted combinations of T1 PGDBs, using the following formula:(4.2.1) |S | =k=6∑k=1(6k)where |.| denotes the number of samples in S .Fig 4.2 demonstrates the intersected pathways among the six databases, where thecolumns of the matrix use binary circled-shaped patterns to define the applied intersecteddatasets, and the bars, just above the matrix columns, represent the number of elementsin each intersection. The bars at the bottom left, plotted along the rows of the matrix,provide information regarding the total intersection size of a dataset. Several interestingobservations can be summarized, for example, LeishCyc has the lowest number in both: the45MetaCyc Pathway MataCyc Pathway ID MetabolismL-phenylalanine biosynthesis I PHESYN PhenylalanineL-tryptophan biosynthesis TRPSYN-PWY TryptophanL-arginine biosynthesis II (acetyl cycle) ARGSYNBSUB-PWY ArginineL-valine biosynthesis VALSYN-PWY ValineL-leucine biosynthesis LEUSYN-PWY LeucineL-lysine biosynthesis I DAPLYSINESYN-PWY ThreonineL-threonine biosynthesis HOMOSER-THRESYN-PWY ThreonineL-isoleucine biosynthesis I (from threonine) ILEUSYN-PWY IsoleucineL-histidine biosynthesis HISTSYN-PWY HistidineL-methionine biosynthesis I HOMOSER-METSYN-PWY MethionineTable 4.2: Nine amino acids, indicated by metabolism, for the symbiont dataset. Thesepathways are distributed between Candidatus Moranella endobia and Candidatus Tremblayaprinceps genomes [218].distinct pathways, having only 4 pathways, and the aggregated number of pathways from allenumeration of intersected sets, which represents the cardinality of LeishCyc pathways, i.e.,87 pathways (see Table 4.4), while AraCyc data has the highest number in both categories(271 distinct pathways and 510 the aggregated number of pathways). These observationswere substantially beneficial during our experimental inspections. The golden T1 datasetsserve as baselines to cross-examine the performances of all pathway prediction algorithms.4.2.2 BioCyc DatasetBioCyc (v20.5 T2-3) [52] consists of 9255 PGDBs (Pathway/Genome Databases) collectedfrom more than 1000 distinct species. We preprocessed the collections to extract ECs andpathways that cross intersect with MetaCyc v21. This has resulted in 1463 distinct pathwaylabels and 2705 distinct ECs. The dataset is mainly used to train models in this thesis.4.2.3 Symbiont DatasetThe symbiont data (T4) illustrates distributed metabolic pathways between two interactingorganismal reduced genomes for Candidatus Moranella endobia (GenBank NC-015735) andCandidatus Tremblaya princeps (GenBank NC-015736) [218]. We used MetaPathways v2.5[175] and Pathway Tools v21 to generate the environmental Pathway/Genome Database(ePGDB) with the default settings. This dataset was used to investigate the likeliness ofpathway inference algorithms to predict 9 amino acid biosynthetic pathways, provided inTable 4.2, on individual symbiont genomes and a composite genome consisting of both.464.2.4 CAMI DatasetThe CAMI 2 (Critical Assessment of Metagenome Interpretation) T4 dataset [289] is a simu-lated dataset from 40 genomes of low complexity. Similar to symbiont data, MetaPathwaysv2.5 [175] was employed to generate ePGDBs. We mainly used the CAMI dataset to comparethe performance gap, according to PathoLogic, of various proposed models in this thesis.4.2.5 Hawaii Ocean Time-Series (HOTS) DatasetThe HOTS metagenome (DNA) T4 dataset is composed of complex microbial communitiesfrom 25m, 75m, 110m (sunlit) and 500m (dark) ocean depth intervals [309]. Unassembledmetagenomic pyrosequences from the Hawaii Ocean Time-Series (10m, 75m, 110m, and500m) can be obtained from the NCBI Sequence Read Archive under accession numbersSRX007372, SRX007369, SRX007370, SRX007371. To generate ePGDBs, MetaPathways v2.5[175] was used. Out of 781 unique pathways, we selected 45 previously reported pathways[125], outlined in Table 4.3, to perform qualitative assessments over PathoLogic and theproposed pathway prediction algorithms in this thesis.4.2.6 Synthetic DatasetTwo in silico datasets, namely Synset-1 and Synset-2, were constructed by selecting a listof pathways, at first, then creating samples to curate a dataset (summarized in AppendixA). These datasets were used to train and evaluate the mlLGPR’s predictive performance(in Chapter 5). Since these datasets are simulated, we cannot map them onto the four tierinformation hierarchy.4.3 Benchmark Pathway AlgorithmsWe evaluated the performances of our models against Naïve [370], MinPath v1.2 [370],and PathoLogic v21 (without taxonomic pruning) [163] using the default settings. Besides,we introduce the BASELINE algorithm, which is the most straightforward way to recoverpathways. In this strategy, the enzymatic reactions for an example map directly onto the truerepresentation of all reference pathways in MetaCyc v21. Then, we apply a cutoff threshold(0.5) to retrieve a list of pathways for that example.2https://edwards.sdsu.edu/research/cami-challenge-datasets/47MetaCyc Pathway MataCyc Pathway ID Metabolism TopologyL-selenocysteine biosynthesis II (archaea and eukaryotes) PWY-6281 Amino acids Biosynthesisglycine biosynthesis IV GLYSYN-THR-PWY Amino acids Biosynthesishomocysteine and cysteine interconversion PWY-801 Amino acids BiosynthesisCMP-N-acetylneuraminate biosynthesis I (eukaryotes) PWY-6138 Carbohydrates BiosynthesisCMP-N-acetylneuraminate biosynthesis II (bacteria) PWY-6139 Carbohydrates Biosynthesisglycogen biosynthesis I (from ADP-D-Glucose) GLYCOGENSYNTH-PWY Carbohydrates BiosynthesisADP-L-glycero-β-D-manno-heptose biosynthesis PWY0-1241 Carbohydrates Biosynthesisphosphopantothenate biosynthesis III (archaebacteria) PWY-6654 Cofactors Biosynthesismenaquinol-8 biosynthesis MENAQUINONESYN-PWY Cofactors Biosynthesis5,6-dimethylbenzimidazole biosynthesis II (anaerobic) PWY-7729 Cofactors Biosynthesis5,6-dimethylbenzimidazole biosynthesis I (aerobic) PWY-5523 Cofactors Biosynthesismycothiol biosynthesis PWY1G-0 Cofactors Biosynthesiscoenzyme M biosynthesis I P261-PWY Cofactors Biosynthesispyridoxal 5’-phosphate biosynthesis II PWY-6466 Cofactors Biosynthesiscoenzyme B/coenzyme M regeneration PWY-5207 Cofactors Biosynthesisthiamine diphosphate biosynthesis II (Bacillus) PWY-6893 Cofactors Biosynthesisthiamine diphosphate biosynthesis I (E. coli) PWY-6894 Cofactors Biosynthesisthiamine diphosphate biosynthesis IV (eukaryotes) PWY-6908 Cofactors Biosynthesislipoate biosynthesis and incorporation I PWY0-501 Cofactors Biosynthesisglutathione biosynthesis GLUTATHIONESYN-PWY Cofactors Biosynthesisbiotin biosynthesis from 8-amino-7-oxononanoate I PWY0-1507 Cofactors Biosynthesistrans, trans-farnesyl diphosphate biosynthesis PWY-5123 Cofactors BiosynthesisUDP-<i>N</i>-acetyl-D-galactosamine biosynthesis II PWY-5514 Cofactors Biosynthesisflavonoid biosynthesis PWY1F-FLAVSYN Secondary metabolites Biosynthesisdiploterol and cycloartenol biosynthesis PWY-6098 Secondary metabolites Biosynthesissalidroside biosynthesis PWY-6802 Secondary metabolites BiosynthesisL-threonine degradation II THREONINE-DEG2-PWY Amino acids DegradationL-threonine degradation III (to methylglyoxal) THRDLCTCAT-PWY Amino acids DegradationL-rhamnose degradation II PWY-6713 Carbohydrates DegradationD-mannose degradation MANNCAT-PWY Carbohydrates Degradation2-methylcitrate cycle II PWY-5747 Carboxylates Degradationacetate formation from acetyl-CoA II PWY-5535 Carboxylates Degradationcitrate degradation PWY-6038 Carboxylates Degradationreductive monocarboxylic acid cycle PWY-5493 C1 compounds Degradationmethane oxidation to methanol I PWY-1641 C1 compounds Degradationhydrogen production VIII PWY-6785 Hydrogen production amino acids DegradationL-methionine degradation III PWY-5082 Hydrogen production amino acids Degradationammonia oxidation I (aerobic) AMMOXID-PWY Non-carbon nutrients Degradationnitrite-dependent anaerobic methane oxidation PWY-6523 Non-carbon nutrients Degradationnitrate reduction IV (dissimilatory) PWY-5674 Non-carbon nutrients Degradationguanosine nucleotides degradation III PWY-6608 Nucleotides Degradationribitol degradation RIBITOLUTIL-PWY Secondary metabolites DegradationD-sorbitol degradation I PWY-4101 Secondary metabolites Degradationpyruvate fermentation to (S)-acetoin PWY-6389 Fermentation Energyphotosynthesis light reactions PWY-101 Photosynthesis EnergyTable 4.3: Selected 45 pathways for HOTS metagenome (DNA) dataset. The dataset is com-posed of complex microbial communities from 25m, 75m, 110m (sunlit) and 500m (dark)ocean depth intervals [309].4.4 Evaluation MetricsHere, we discuss common performance metrics used to evaluate the performances ofpathway predictors. Additional metrics, including normalized mutual information (NMI),will be discussed in their associated contexts.484.4.1 Performance MetricsFour standard metrics were used to report on performance of prediction algorithms: averageprecision, average recall, average F1 score (F1), and Hamming loss, [354]. Formally, let us de-note y(i ) and ŷ(i ) to be the true and the predicted pathway set for the i th sample, respectively.Then, the four measurements can be defined as:(4.4.1) Average Precision (Pr)= 1nn∑i=1(y(i )>ŷ(i )∑j∈t ŷ(i )j)(4.4.2) Average Recall (Rc)= 1nn∑i=1(y(i )>ŷ(i )∑j∈t y(i )j)(4.4.3) Average F1= 2Pr×RcPr+Rc(4.4.4) Hamming Loss (hloss)= 1ntn∑i=1t∑j=11(y(i )j 6= ŷ(i )j )where 1(.) denotes the indicator function, respectively. Each metric is averaged based onsample size. The values of average precision, average recall, and average F1 vary between 0−1with 1 being the optimal score. Average Precision relates the number of true pathways to thenumber of predicted pathways including false positives, while recall relates the number oftrue pathways to the total number of expected pathways including false negatives. Whilerecall tells us about the ability of each prediction method to find relevant pathways, precisiontells us about the accuracy of those predictions. Average F1 represents the harmonic meanof average precision and average recall by taking the trade-off between the two metrics intoaccount. The hloss is the fraction of pathways that are incorrectly predicted providing auseful performance indicator. From Eq 4.4.4, we observe that when all of the pathways arecorrectly predicted, then hloss = 0, whereas the other metrics will be equaled to 1. On theother hand, when the predictions of all pathways are completely incorrect hloss= 1, whereasthe other metrics will be equaled to 0.494.4.2 Equalized Loss of AccuracyWe also evaluated the effects of noise on the robustness of a model’s performance usingequalized loss of accuracy (ELA) metric [283]:ELAρ =RLAρ+ s(M0)where RLAρ =M0−MρM0and s(M0)= 1−M0M0(4.4.5)The ELA score combines both i)- the robustness of a model, computed by RLAρ at a con-trolled noise threshold ρ and ii)- the performance of a model without noise, i.e., s(M0),where s represents the average F1 score for a model M0 without noise (any performancemetrics can be employed). A low ELA score indicates that a model continues to exhibit goodperformance with increasing background noise.Dataset |S | L(S) LCard(S) LDen(S) DL(S) PDL(S) R(S) RCard(S) RDen(S) DR(S) PDR(S) PLR(S) DomainAraCyc 1 510 510 1 510 510 2182 2182 1 1034 1034 0.2337Arabidopsisthaliana (v18.5)EcoCyc 1 307 307 1 307 307 1134 1134 1 719 719 0.2707Escherichiacoli K-12 sub-str.MG1655(v21)HumanCyc 1 279 279 1 279 279 1177 1177 1 693 693 0.2370Homo sapiens(v19.5)LeishCyc 1 87 87 1 87 87 363 363 1 292 292 0.2397Leishmaniamajor Friedlin(v19.5)TrypanoCyc 1 175 175 1 175 175 743 743 1 512 512 0.2355Trypanosomabrucei (v18.5)YeastCyc 1 229 229 1 229 229 966 966 1 544 544 0.2371Saccharomycescerevisiae (v19.5)SixDB 63 37295 591.9841 0.0159 944 14.9841 210080 3334.6032 0.0159 1709 27.1270 0.1775Composed fromsix databasesBioCyc 9255 1804003 194.9220 0.0001 1463 0.1581 8848714 956.1009 0.0001 2705 0.2923 0.2039BioCyc v20.5 (tier2 & 3)Symbiotic 3 119 39.6667 0.3333 59 19.6667 304 101.3333 0.3333 130 43.3333 0.3914Composed ofMoranella andTremblayaCAMI 40 6261 156.5250 0.0250 674 16.8500 14269 356.7250 0.0250 1083 27.0750 0.4388Simulated micro-biomes of lowcomplexityHOTS 4 2178 311.1429 0.1429 781 111.5714 182675 26096.4286 0.1429 1442 206.0000 0.0119MetagenomicHawaii OceanTime-series(10m, 75m,110m, and 500m)Synset-1 15000 6801364 453.4243 0.00007 2526 0.1684 30901554 2060.1036 0.00007 3650 0.2433 0.2201Syntheticallygenerated (un-corrupted)Synset-2 15000 6806262 453.7508 0.00007 2526 0.1684 34006386 2267.0924 0.00007 3650 0.2433 0.2001Syntheticallygenerated (cor-rupted)Table 4.4: Characteristics of 13 datasets. The notations |S |, L(S), LCard(S), LDen(S), DL(S),and PDL(S) represent number of instances, number of pathway labels, pathway labelscardinality, pathway labels density, distinct pathway labels set, and proportion of distinctpathway labels set for S , respectively. The notations R(S), RCard(S), RDen(S), DR(S), andPDR(S) have similar meanings as before but for the enzymatic reactions E in S . PLR(S)represents a ratio of L(S) to R(S). The last column denotes the domain of S .50Part IIConventional Multi-Label Classification51Chapter 5Multi-label Classification Approach toMetabolic Pathway Inference with RichPathway Features“Your assumptions are your windows on the world. Scrub them off every once in awhile, or the light won’t come in.”– Isaac AsimovMetabolic inference from genomic sequence information is an essential step in determiningthe capacity of cells to make a living in the world at different levels of biological organization.A common approach to determining the metabolic potential encoded in genomes is to mapconceptually translated open reading frames onto a reference database containing knownproduct descriptions. Such gene-centric methods are limited in their capacity to predictpathway presence or absence and do not support standardized rule-sets for automatedand reproducible research. Pathway-centric methods based on defined rule sets or ma-chine learning algorithms provide an adjunct or alternative inference method that supportshypothesis generation and testing of metabaolic relationships within and between cells.This chapter presents mlLGPR, multi-label based on logistic reg ression for pathwaypr ediction, a software package that uses supervised multi-label classification and richpathway features to infer metabolic networks at the individual, population and communitylevels of organization. mlLGPR was evaluated using a a subset of experimental datasets,introduced in Chapter 4.2. Resulting performance metrics equaled or exceeded previousreports for organismal genomes and identify specific challenges associated with featuresengineering and training data for community-level metabolic inference.525.1 IntroductionAs discussed in previous chapters, metabolic inference from genomic sequence informationis a fundamental problem in biology with far reaching implications for our capacity toperceive, evaluate and engineer cells at the individual, population and community levelsof organization [122, 244]. Predicting metabolic interactions can be described in terms ofmolecular events or reactions coordinated within a series or cycle. The set of reactions withinand between cells defines a reactome, while the set of linked reactions defines pathwayswithin and between cells. Reactomes and pathways can be predicted from primary sequenceinformation and refined using mass spectrometry to both validate known and uncover novelpathways.The development of reliable and flexible rule sets for metabolic inference is a non-trivialstep that requires manual curation to add accurate taxonomic or pathway labels [326].This problem is compounded by the ever increasing abundance of different informationstructures sourced from organismal genomes, single-cell amplified gemomes (SAGs) andmetagenome assembled genomes (MAGs) (in Fig. 4.1). Under ideal circumstances, pathwaysare inferred from a bounded reactome that has been manually curated to reflect detailedbiochemical knowledge from a closed reference genome (e.g. T1 in the information hierarchyin Fig. 4.1). While this is possible for a subset of model organisms, it becomes increasinglydifficult to realize when dealing with the broader range of organismal diversity found innatural and engineered environments. At the same time, advances in sequencing and massspectrometry platforms continue to lower the cost of data generation resulting in exponentialincreases in the volume and complexity of multi-omic information (DNA, RNA, protein andmetabolite) available for metabolic inference [13].While PathoLogic (in Chapter 2.2), provides a powerful engine for pathway-centricinference, it is a hard coded and relatively inflexible application that does not not scaleefficiently for community sequencing projects. Moreover, PathoLogic does not provideprobability scores associated with inferred pathways further limiting its statistical powerwith respect to false discovery. An alternative inference method called MinPath uses integerprogramming to identify the minimum number of pathways that can be described given aset of defined input sequences (e.g. KO family annotations in KEGG [370]). However, sucha parsimony approach is prone to false negatives and can be difficult to scale. Issues ofprobability and scale have led to the consideration of machine learning (ML) approaches forpathway prediction based on rich feature information. Dale and colleagues conducted acomprehensive comparison of PathoLogic to different types of supervised ML algorithms53including naive Bayes, k nearest neighbors, decision trees and logistic regression, convertingPathoLogic rules into features and defining new features for pathway inference [73]. Theyevaluated these algorithms on experimentally validated pathways from six T1 PGDBs inthe BioCyc collection randomly divided into training and test sets. Resulting performancemetrics indicated that generic ML methods equaled or marginally exceeded the performanceof PathoLogic with the benefit of probability estimation for pathway presence and increasedflexibility of use.Despite the potential benefits of adopting ML methods for pathway prediction from ge-nomic sequence information, PathoLogic remains the primary inference engine of PathwayTools [165], and alternative methods for pathway-centric inference expanding on the algo-rithms evaluated by Dale and colleagues remain nascent. Several recent efforts incorporatemetabolite information to improve pathway inference and reaction rules to infer metabolicpathways [49, 75, 321, 326]. Others, including BiomeNet [296] and MetaNetSim [152] omitpathways and model reaction networks based on enzyme abundance information.This chapter describes a multi-label classification approach to metabolic pathway infer-ence using rich pathway feature information called mlLGPR. mlLGPR uses logistic regressionand feature vectors re-adapted from the work of Dale and colleagues to predict metabolicpathways for individual genomes as well as more complex cellular communities (e.g. mi-crobiomes). We evaluated mlLGPR performance in relation to other inference methodsincluding PathoLogic and MinPath on a subset of datasets in Chapter 4.2, where mlLGPRachieved remarkable performances on golden T1 datasets.5.2 Problem FormulationAs explained in Chapter 3.3, an input data x ∈X , where X = Rr , can be transformed intoan arbitrary m-dimensional vector using an appropriate function where m À r . The trans-formation function for each example is defined as Φ : X → Rm , and is known as featureextraction and transformation process (see Section 5.3.1).Metabolic Pathway PredictionGiven a multi-label dataset S (Def. 3.9), the goal of mlLGPR is to learn ahypothesis function f : Φ(x) → 2Y , such that it efficiently predicts targetpathways for a hitherto unseeen instance x∗ as close as to the actual path-ways for that sample.54Figure 5.1: mlLGPR workflow. Datasets spanning the information hierarchy are used infeature engineering. The Synthetic dataset with features is split into training and test setsand used to train mlLGPR. Test data from the Gold Standard dataset (T1) with featuresand Synthetic dataset with features is used to evaluate mlLGPR performance prior to theapplication of mlLGPR on experimental datasets (T4) from different sources.5.3 The mlLGPR MethodIn this section, we provide a description of mlLGPR components including: i)- featuresrepresentation, ii)- the prediction model, and iii)- the multi-label learning process.5.3.1 Feature EngineeringThe design of feature vectors is critical for accurate classification and pathway inference. Weconsider five types of feature vectors inspired by the work of Dale and colleagues [73]: i)- en-55zymatic reactions abundance vector (φa), ii)- reactions evidence vector (φ f ), iii)- pathwaysevidence vector (φy ), iv)- pathway common vector (φc ), and v)- possible pathways vector(φd ). The transformation process φa is represented by r -dimensional frequency vector, cor-responding to the number of occurrences for each enzymatic reaction asφa = [a1, a2, ..., ar ]>.An enzymatic reaction is characterized by an enzyme commission (EC) classification num-ber [20]. The reaction evidence vector φ f indicates the properties of the enzymatic reactionfor each sample. The pathway evidence features φy include a subset of features developedby Dale and colleagues expanding on core PathoLogic rule sets to include additional infor-mation related to enzyme presence, gaps in pathways, network connectivity, taxonomicrange, etc [73]. The pathway common feature vector φc , for a sample x(i ) is represented byr -dimensional binary vector and the possible pathways vector φd is a t-dimensional binaryvector. Each of the transformation function maps x to a different dimensional vector, andthe concatenated feature vector Φ= [φa(x(i )),φ f (x(i )),φy (x(i )),φc (x(i )),φd (x(i ))] has a totalof m-dimensional features for each sample. For a more in-depth description of the featureengineering process please refer to Appendix B).5.3.2 Prediction ModelWe use the logistic regression (LR) model to infer a set of pathways given an instance featurevectorΦ(x(i )). LR was selected because of its proven power in discriminative classificationacross a variety of supervised machine learning problems [216]. In addition to direct prob-abilistic interpretation integrated into the model, LR can handle high-dimensional data,efficiently. The LR model represents conditional probabilities through a non-linear logisticfunction f (.) defined as(5.3.1) f (θ j ,Φ(x(i )))= p(y(i )j = 1|Φ(x(i ));θ j )=exp(θ>j Φ(x(i )))exp(θ>j Φ(x(i )))+1where y(i )j is the j -th element of the label vector y(i ) ∈ {0,1}t and θ j is a m-dimensionalweight vector for the j -th pathway. Each element ofΦ(x(i )) corresponds to an element of θ jfor the j -class, therefore, we can retrieve important features that contribute to the predictionof j by sorting the elements ofΦ(x(i )) according to the corresponding values of the weightvector θ j . The Eq 5.3.1 is repeated for all the t classes for an instance i , hence multi-labeling,and, for an individual pathway, the results are stored in a vector q(i ) ∈Rt . Predicted pathwaysare reported based on a cut-off threshold τ, which is set to 0.5 by default:(5.3.2) ŷi = vec({1 if q(i )j ≥ τ0 otherwise)∀ j ∈ t56where vec is a vectorized operation. Note that Eq. 5.3.2 resembles Eq. 3.3.2. Given that Eq.5.3.1 produces a conditional probability over each pathway, and the j -th class label will beincluded to y(i ) only if f (θ j ,Φ(x(i )))≥ τ we adopt a soft decision boundary using T-criterionrule [386] as:(5.3.3) ŷi = vec(1 if q(i )j ≥ τ1 if q(i )j Ê fmax(q(i )j )0 otherwise)∀ j ∈ twhere fmax( f (θ j ,Φ(x(i ))))=β ·max({ f (θ j ,Φ(x(i )) :∀ j ∈ t }), which is the maximum pre-dictive probability score. The hyper-parameter β ∈ (0,1] must be tuned based on empiricalinformation, and it cannot be set to 0, which implies retrieving all of the t pathways. Thepredicted set of pathways using the Eq 5.3.3 is referred to as adaptive prediction because thedecision boundary, and its corresponding threshold, are tuned to the test data [336].5.3.3 Multi-Label Learning ProcessThe learning process corresponds to the binary relevance technique discussed in Chapter3.4.1. Specifically, mlLGPR decomposes the prediction problem into t independent binaryclassification problems, where each binary classification problem corresponds to a possiblepathway in the label space. Then, LR is used to define a binary classifier f (.), such that for atraining example (Φ(x(i )),y(i )), an instanceΦ(x(i )) will be involved in the learning process oft binary classifiers. Given n training samples, we attempt to estimate all the weight vectorsindividually θ1,θ2, ...,θt by maximizing the logistic loss function as follows:(5.3.4) ll(θ j )=maxθ j1nn∑i=1(y(i )j θ>j Φ(x(i ))− log(1+exp(θ>j Φ(x(i ))))Usually, a penalty or regularization term Ω(θ j ) is inserted into the loss function toenhance the generalization properties to unseen data, particularly if the dimension m offeatures is high. Thus, the overall objective cost function (after dropping the maximizedterm for brevity) is defined as:(5.3.5) C (θ j )= ll(θ j )−λΩ(θ j )where λ> 0 is a hyper-parameter that controls the trade-off between ll(θ j ) andΩ(θ j ).Here, the regularization termΩ(θ j ) is chosen to be the elastic net:(5.3.6) Ω(θ j )= 1−α2||θ j ||22+α||θ j ||157The elastic net penalty of Eq 5.3.6 is a compromise between the L1 penalty of LASSO (bysetting α= 1) and the L2 penalty of ridge-regression (by setting α= 0) [400]. While the L1term of the elastic net aims to remove irrelevant variables by forcing some coefficients of θ jto 0, leading to a sparse vector of θ j , the L2 penalty ensures that highly correlated variableshave similar regression coefficients. Substituting Eq 5.3.6 into Eq 5.3.5, yields the followingobjective function:(5.3.7) C (θ j )= ll(θ j )−λ(1−α2||θ j ||22+α||θ j ||1)During learning, the aim is to estimate parameters θ j so as to maximize C (θ j ), whichis convex; however, the last term of Eq 5.3.7 is non-differentiable, making the equationnon-smooth. For the rightmost term, we apply the sub-gradient [254] method allowingthe optimization problem to be solved using mini-batch gradient descent (GD) [190]. Weinitialize with random values for θ j , followed by iterations to maximize the cost functionC (θ j ) with the following derivatives:(5.3.8)∂∂θ jC (θ j )= 1nn∑i=1Φ(x(i ))[y(i )j − f (θ j ,Φ(x(i )))]−λ[(1−α)θ j +α sign(θ j )]Finally, the update algorithm for θ j at each iteration u is obtained as:(5.3.9) θu+1j = θuj +η(1nn∑i=1Φ(x(i ))[y(i )j − f (θ j ,Φ(x(i )))]−λ[(1−α)θ j +α sign(θ j )])5.4 Experimental SetupIn this section, we describe an experimental framework used to demonstrate mlLGPR path-way prediction performance, using the metrics described in Chapter 4.4, across multipledatasets spanning the genomic information hierarchy (in Fig. 4.1). The mlLGPR frame-work was written in Python v3 and depends on scikit-learn v0.20 [252], Numpy v1.16 [335],NetworkX v2.3 [121], and SciPy v1.4[333].For training purposes Synset-1 and Synset-2, where subdivided in three subsets: (trainingset, validation set, and test set), using a stratified sampling approach [291] resulting in 10,869training, 1,938 validation and 2,193 testing samples for Synset-1 and 10,813 training, 1,930validation, and 2,257 test instances for Synset-2. Features extraction was implemented foreach dataset in Table 4.4, resulting in total feature vector size of 12,452 for each instance,where |φa | = 3650, |φ f | = 68, |φy | = 32, |φc | = 3650, and |φd | = 5052. Integral parametersettings included Θ initialized to a uniform random value in the range [0,1], batch-size58MethodsHamming Loss ↓EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-L1 (+AB+RE+PE) 0.0776 0.0645 0.1069 0.0487 0.0412 0.0602 0.1365mlLGPR-L2 (+AB+RE+PE) 0.0606 0.0515 0.1112 0.0412 0.0234 0.0344 0.1426mlLGPR-EN (+AB+RE+PE) 0.0804 0.0633 0.1069 0.0550 0.0380 0.0590 0.1281MethodsAverage Precision Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-L1 (+AB+RE+PE) 0.6253 0.6686 0.7390 0.6815 0.4525 0.5395 0.7391mlLGPR-L2 (+AB+RE+PE) 0.7437 0.7945 0.8418 0.7934 0.6186 0.7268 0.8488mlLGPR-EN (+AB+RE+PE) 0.6187 0.6686 0.7372 0.6480 0.4731 0.5455 0.7561MethodsAverage Recall Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-L1 (+AB+RE+PE) 0.9023 0.8244 0.7275 0.8690 0.9310 0.8971 0.6738mlLGPR-L2 (+AB+RE+PE) 0.7655 0.7204 0.5529 0.7380 0.8391 0.8057 0.5211mlLGPR-EN (+AB+RE+PE) 0.8827 0.8459 0.7314 0.8603 0.9080 0.8914 0.6904MethodsAverage F1 Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-L1 (+AB+RE+PE) 0.7387 0.7384 0.7332 0.7639 0.6090 0.6738 0.6919mlLGPR-L2 (+AB+RE+PE) 0.7544 0.7556 0.6675 0.7647 0.7122 0.7642 0.6306mlLGPR-EN (+AB+RE+PE) 0.7275 0.7468 0.7343 0.7392 0.6220 0.6768 0.7098Table 5.1: Predictive performance of mlLGPR on T1 golden datasets. mlLGPR-L1: the mlL-GPR with L1 regularizer, mlLGPR-L2: the mlLGPR with L2 regularizer, mlLGPR-EN: themlLGPR with elastic net penalty, AB: abundance features, RE: reaction evidence features,and PE: pathway evidence features. For each performance metric, ‘↓’ indicates the lowerscore is better while ‘↑’ indicates the higher score is better.set to 500, epoch number set to 3, adaptive prediction parameter β in the range (0,1],regularization parameters λ and α set to 10000 and 0.65, respectively. The learning rate ηwas adjusted based on 1λ+u , where u denotes the current step. The development set was usedto determine critical values of λ and α. Default parameter settings were used for MinPathand PathoLogic. All tests were conducted using a Linux server using 10 cores on an IntelXeon CPU E5-2650.5.5 Experimental Results and DiscussionFour types of analysis including parameter sensitivity, features selection, robustness, andpathway prediction potential were used to tune and evaluate mlLGPR performances.5.5.1 Parameter SensitivityExperimental setup. Three consecutive tests were performed to ascertain: 1)- the impact ofL1, L2, and elastic-net (EN) regularizers on mlLGPR performance using T1 golden datasets,2)- the impact of changing hyper-parameter λ ∈ {1,10,100,1000,10000} using T1 golden590 2 4 6 8Regularization hyper-parameter (log(λ ))204060AverageF1Score(%)EcoCycHumanCycAraCycYeastCycLeishCycTrypanoCycSixDBFigure 5.2: Average F1 scores of mlLGPR-EN on a range of regularization hyper-parameter λ ∈ {1,10,100,1000,10000} values on EcoCyc, HumanCyc, AraCyc, YeastCyc,LeishCyc, TrypanoCyc, and SixDB dataset. The x-axis is log scaled.datasets, and 3)- the impact of adaptive beta β ∈ (0,1] using Synset-2 and the SixDB goldendatasets.Experimental results. Table 5.1 indicates test results across different mlLGPR parametersettings. Although the F1 scores of mlLGPR-L1, mlLGPR-L2 and mlLGPR-EN were compara-ble, precision and recall scores were inconsistent across the T1 golden datasets. For example,high precision scores were observed for mlLGPR-L2 on AraCyc (0.8418) and YeastCyc (0.7934)with low recall scores of 0.5529 and 0.7380, respectively. In contrast, high recall scores wereobserved for mlLGPR-L1 on AraCyc (0.7275) and YeastCyc (0.8690) with low precision scoresof 0.7390 and 0.6815, respectively. The increased recall with reduced precision scores bymlLGPR-L1 indicates a low variance model that may eliminate many relevant coefficients.The impact is especially observed for datasets encoding a small number of pathways as isthe case for LeishCyc (87 pathways) and TrypanoCyc (175 pathways). Similarly, the increasedprecision with reduced recall scores by mlLGPR-L2 is a consequence of the existence ofhighly correlated features present in the test datasets [128], resulting in a high variance60model. The impact is especially observed for LeishCyc and TrypanoCyc suggesting thatmlLGPR-L2 performance declines with increasing pathway number. mlLGPR-EN tendedto even out the scores relative to mlLGPR-L1 and mlLGPR-L2 providing more balancedperformance outcomes.Based on these results, hyper-parameters λ and β were tested to tune mlLGPR-ENperformance. Fig. 5.2 indicates that the relationship between F1 score and the regularizationhyper-parameter λ increases monotonically for the T1 golden datasets peaking at λ= 10000(having an F1 score of > 0.6 for all datasets). For the adaptive β test, in Fig. 5.3 shows theperformance of mlLGPR-EN on Synset-2 test samples across a range of β ∈ (0,1] values,indicating that this hyper-parameter has minimal impact on performance.Figure 5.3: Performance of mlLGPR-EN according to the β adaptive decision hyper-parameter on datasets. (a)- Synset-2 test dataset. (b)- SixDB dataset.Taken together, parameter testing results indicated that mlLGPR-EN provided the mostbalanced implementation of mlLGPR, and the regularization hyper-parameter λ at 10000 re-sulted in the best performance for T1 golden datasets. This hyper-parameter should be tunedwhen applied to new datasets to reduce false positive pathway discovery. Minimal effects onprediction performance were observed when testing the adaptive β hyper-parameter.5.5.2 Features SelectionExperimental setup. In this study, a series of feature set “ablation” tests were conductedusing Synset-2 as a training set in a reverse manner, starting with only reaction abundancefeatures (AB), a fundamental feature set consisting of 3650 features and then successively61aggregating additional feature sets while recording predictive performance on golden T1datasets using the settings and metrics described in Section 5.4. Because testing individualfeatures is impractical, this form of aggregate testing provides a tractable method to identifythe relative contribution of feature sets to pathway prediction performance.Experimental results. Table 5.2 indicates ablation test results. The AB feature set promotesthe highest average recall on EcoCyc (0.9511) and a comparable F1-score of 0.6952. This is notunexpected given the ratio of pathways to the number of enzymatic reactions (PLR) indicatedby EC numbers for EcoCyc is high (see Table 4.4). However, although functional annotationswith EC numbers increase the probability of predicting a given pathway, pathways withfew or no EC numbers such as pregnenolone biosynthesis require additional feature sets toavoid false negatives. As additional feature sets were aggregated, mlLGPR-EN performancetends to improve unevenly for different T1 organismal genomes. For example, adding theenzymatic reaction evidence (RE) feature set consisting of 68 features to the AB featuresset improves F1 scores for YeastCyc (0.7394), LeishCyc (0.5830), and TrypanoCyc (0.6753).Further aggregating the pathway evidence (PE) feature set, consisting of 32 features to theAB feature set improves the F1 score for AraCyc (0.7532) but reduces the F1 score for theremaining T1 organismal genomes. Aggregating AB, RE and pathway evidence (PE) featuresets resulted in the highest F1 scores for HumanCyc (0.7468), LeishCyc(0.6220), TrypanoCyc(0.6768), and SixDB (0.7078) with only marginal differences between the highest F1 scores forEcoCyc (0.7275) and AraCyc (0.73432). Additional combinations of features did not improveoverall performance across the T1 golden datasets.Taken together, ablation testing results indicated that mlLGPR-EN in combination withAB, RE and PE feature sets result in the most even pathway prediction performance forgolden T1 datasets.5.5.3 RobustnessExperimental setup. Robustness also known as accuracy loss rate was determined for mlLGPR-EN with AB, RE and PE feature sets using the intact Synset-1 dataset and a "corrupted" ornoisy version of the Synset-2 dataset using setting in Section 5.4. Relative Loss of Accuracy(RLA) and equalized loss of accuracy (ELA) scores [283] were used to describe the expectedbehavior of mlLGPR-EN in relation to introduced noise (see Chapter 4.4.2). A low ELA scoreindicates that model continues to exhibit good performance with increasing noise.Experimental results. Table 5.3 indicates robustness test scores. mlLGPR-EN with intro-duced noise performed better for HumanCyc (−0.0502), YeastCyc (−0.0301), LeishCyc(−0.1189), and TrypanoCyc (−0.0151), but was less robust for AraCyc (0.0416) and SixDB62MethodsHamming Loss ↓EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-AB 0.1013 0.0887 0.1025 0.0907 0.1124 0.1073 0.1412mlLGPR-AB-RE 0.0788 0.0697 0.1101 0.0558 0.0447 0.0598 0.1348mlLGPR-AB-PP 0.2835 0.2922 0.2898 0.2724 0.2553 0.2759 0.2842mlLGPR-AB-PE 0.1017 0.0835 0.1002 0.0891 0.1172 0.1089 0.1387mlLGPR-AB-PC 0.1041 0.0938 0.1409 0.0879 0.1081 0.0899 0.1844mlLGPR-AB-RE-PP 0.2815 0.2882 0.2961 0.2648 0.2526 0.2759 0.2825mlLGPR-AB-RE-PE 0.0804 0.0633 0.1069 0.0550 0.0380 0.0590 0.1281mlLGPR-AB-RE-PC 0.0966 0.0732 0.1394 0.0677 0.0515 0.0625 0.1793mlLGPR-AB-PE-PC 0.1029 0.0899 0.1441 0.0914 0.1148 0.0903 0.1820mlLGPR-AB-PP-PC 0.2019 0.2070 0.2142 0.1876 0.1884 0.1880 0.2299mlLGPR-AB-RE-PE-PP 0.2894 0.2993 0.2953 0.2736 0.2530 0.2755 0.2838mlLGPR-AB-RE-PE-PC 0.0954 0.0816 0.1441 0.0673 0.0451 0.0641 0.1806mlLGPR-AB-RE-PE-PP-PC 0.2003 0.2063 0.2209 0.1924 0.1924 0.1928 0.2317MethodsAverage Precision Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-AB 0.5478 0.5610 0.7390 0.5000 0.2316 0.3873 0.7323mlLGPR-AB-RE 0.6205 0.6373 0.7275 0.6410 0.4293 0.5414 0.7412mlLGPR-AB-PP 0.2755 0.2508 0.3926 0.2303 0.1037 0.1855 0.4300mlLGPR-AB-PE 0.5473 0.5773 0.7495 0.5048 0.2257 0.3843 0.7402mlLGPR-AB-PC 0.5618 0.5673 0.7810 0.5113 0.2265 0.4217 0.7650mlLGPR-AB-RE-PP 0.2795 0.2536 0.3845 0.2375 0.1081 0.1885 0.4322mlLGPR-AB-RE-PE 0.6187 0.6686 0.7372 0.6480 0.4731 0.5455 0.7561mlLGPR-AB-RE-PC 0.6019 0.6926 0.7992 0.6330 0.3862 0.5362 0.7761mlLGPR-AB-PE-PC 0.5681 0.5844 0.7645 0.4969 0.2188 0.4223 0.7727mlLGPR-AB-PP-PC 0.3241 0.3000 0.4730 0.2761 0.1309 0.2283 0.5122mlLGPR-AB-RE-PE-PP 0.2706 0.2482 0.3870 0.2301 0.1068 0.1873 0.4309mlLGPR-AB-RE-PE-PC 0.6065 0.6466 0.7744 0.6277 0.4237 0.5291 0.7715mlLGPR-AB-RE-PE-PP-PC 0.3299 0.2997 0.4580 0.2701 0.1285 0.2244 0.5084MethodsAverage Recall Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-AB 0.9511 0.9068 0.7608 0.9258 0.9770 0.9429 0.6775mlLGPR-AB-RE 0.9055 0.8566 0.7275 0.8734 0.9080 0.8971 0.6774mlLGPR-AB-PP 0.8176 0.8280 0.7961 0.8559 0.8391 0.8800 0.7696mlLGPR-AB-PE 0.9414 0.9104 0.7569 0.9170 0.9885 0.9486 0.6795mlLGPR-AB-PC 0.6515 0.6344 0.4196 0.6900 0.8851 0.8000 0.3827mlLGPR-AB-RE-PP 0.8339 0.8280 0.7765 0.8690 0.8736 0.9029 0.7768mlLGPR-AB-RE-PE 0.8827 0.8459 0.7314 0.8603 0.9080 0.8914 0.6904mlLGPR-AB-RE-PC 0.6059 0.6057 0.4137 0.6026 0.8391 0.7200 0.3820mlLGPR-AB-PE-PC 0.6384 0.6452 0.4137 0.6900 0.9080 0.8229 0.3923mlLGPR-AB-PP-PC 0.6091 0.6559 0.5333 0.6594 0.7931 0.7200 0.5053mlLGPR-AB-RE-PE-PP 0.8143 0.8423 0.7922 0.8603 0.8621 0.8914 0.7758mlLGPR-AB-RE-PE-PC 0.6124 0.5771 0.4039 0.6332 0.8621 0.6743 0.3776mlLGPR-AB-RE-PE-PP-PC 0.6287 0.6487 0.5137 0.6594 0.7931 0.7257 0.5074MethodsAverage F1 Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBmlLGPR-AB 0.6952 0.6932 0.7498 0.6493 0.3744 0.5491 0.6754mlLGPR-AB-RE 0.7364 0.7309 0.7275 0.7394 0.5830 0.6753 0.6938mlLGPR-AB-PP 0.4122 0.3850 0.5259 0.3630 0.1846 0.3065 0.5386mlLGPR-AB-PE 0.6922 0.7065 0.7532 0.6512 0.3675 0.5470 0.6802mlLGPR-AB-PC 0.6033 0.5990 0.5459 0.5874 0.3607 0.5523 0.4683mlLGPR-AB-RE-PP 0.4186 0.3882 0.5143 0.3730 0.1924 0.3119 0.5422mlLGPR-AB-RE-PE 0.7275 0.7468 0.7343 0.7392 0.6220 0.6768 0.7098mlLGPR-AB-RE-PC 0.6039 0.6463 0.5452 0.6174 0.5290 0.6146 0.4853mlLGPR-AB-PE-PC 0.6012 0.6133 0.5369 0.5777 0.3527 0.5581 0.4779mlLGPR-AB-PP-PC 0.4231 0.4117 0.5014 0.3892 0.2248 0.3466 0.4857mlLGPR-AB-RE-PE-PP 0.4062 0.3834 0.5199 0.3631 0.1901 0.3095 0.5407mlLGPR-AB-RE-PE-PC 0.6094 0.6098 0.5309 0.6304 0.5682 0.5930 0.4805mlLGPR-AB-RE-PE-PP-PC 0.4327 0.4100 0.4843 0.3832 0.2212 0.3428 0.4847Table 5.2: Ablation tests of mlLGPR-EN trained using Synset-2 on T1 golden datasets. AB:abundance features, RE: reaction evidence features, PP: possible pathway features, PE:pathway evidence features, and PC: pathway common features. mlLGPR is trained usinga combination of features, represented by mlLGPR-*, on Synset-2 training set. For eachperformance metric, ‘↓’ indicates the lower score is better while ‘↑’ indicates the higher scoreis better.63DatasetAverage F1 Score ↑ Robustness Score ↓mlLGPR-EN0 mlLGPR-ENρ RLAρ s(M0) ELAρEcoCyc 0.7280 0.7275 0.0007 0.3736 0.3743HumanCyc 0.7111 0.7468 −0.0502 0.4063 0.3561AraCyc 0.7662 0.7343 0.0416 0.3051 0.3468YeastCyc 0.7176 0.7392 −0.0301 0.3935 0.3634LeishCyc 0.5559 0.6220 −0.1189 0.7989 0.6800TrypanoCyc 0.6667 0.6768 −0.0151 0.4999 0.4848SixDB 0.7448 0.7098 0.0470 0.3426 0.3896Table 5.3: Performance and robustness scores for mlLGPR-EN with AB, RE and PE featuresets trained on both Synset-1 and Synset-2 training sets at 0 and ρ noise. The best perfor-mance scores are highlighted in bold. The ‘↓’ indicates the lower score is better while ‘↑’indicates the higher score is better.(0.0470) based on RLAρ scores. This suggests that noise inversely correlates with the pathwaysize. The more pathways present within a dataset can upset correlations among features.However, the impact of negative correlations is minimized when a dataset contains fewerpathways. Note that the average number of ECs associated with pathways has little ornegligible effects on robustness.Taken together, the RLA and ELA results for T1 golden datasets indicate that mlLGPR-ENtrained on noisy datasets is robust to perturbation. This is a prerequisite for developingsupervised ML methods tuned for community-level pathway prediction.5.5.4 Pathway Prediction PotentialExperimental setup. Pathway prediction potential of mlLGPR-EN with AB, RE and PE fea-ture sets trained on Synset-2 training set was compared to four additional prediction meth-ods in Chapter 4.3 on T1 golden datasets using the settings and metrics described above.For community-level pathway prediction on the T4 datasets including symbiont, CAMI lowcomplexity, and HOTS datasets, mlLGPR-EN and PathoLogic (without taxonomic pruning)results were compared.Experimental results. Table 5.4 shows performance scores for each pathway predictionmethod tested. The BASELINE, Naïve, and MinPath methods infer many false positivepathways across the T1 golden datasets, indicated by high recall with low precision and F1scores. In contrast, high precision and F1 scores were observed for PathoLogic and mlLGPR-EN across the T1 golden datasets. Although both methods gave similar results, PathoLogic F1scores for EcoCyc (0.7631), YeastCyc (0.7890) and SixDB (0.7479) exceeded those for mlLGPR-64MethodsHamming Loss ↓EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBBASELINE 0.2217 0.2486 0.3230 0.2458 0.1591 0.2526 0.3096Naïve 0.3856 0.4113 0.4592 0.4216 0.3215 0.4319 0.4392MinPath 0.2257 0.2530 0.3266 0.2482 0.1615 0.2561 0.3124PathoLogic 0.0610 0.0633 0.1188 0.0424 0.0368 0.0424 0.1141mlLGPR-EN (+AB+RE+PE) 0.0804 0.0633 0.1069 0.0550 0.0380 0.0590 0.1281MethodsAverage Precision Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBBASELINE 0.3531 0.3042 0.3832 0.2694 0.1779 0.2153 0.4145Naïve 0.2384 0.2081 0.3035 0.1770 0.0968 0.1382 0.3357MinPath 0.3490 0.3004 0.3806 0.2675 0.1758 0.2129 0.4124PathoLogic 0.7230 0.6695 0.7011 0.7194 0.4803 0.5480 0.7522mlLGPR-EN (+AB+RE+PE) 0.6187 0.6686 0.7372 0.6480 0.4731 0.5455 0.7561MethodsAverage Recall Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBBASELINE 0.9902 0.9713 0.9843 1.0000 1.0000 1.0000 0.9860Naïve 0.9902 0.9713 0.9843 1.0000 1.0000 1.0000 0.9860MinPath 0.9902 0.9713 0.9843 1.0000 1.0000 1.0000 0.9860PathoLogic 0.8078 0.8423 0.7176 0.8734 0.8391 0.7829 0.7499mlLGPR-EN (+AB+RE+PE) 0.8827 0.8459 0.7314 0.8603 0.9080 0.8914 0.6904MethodsAverage F1 Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyc SixDBBASELINE 0.5205 0.4632 0.5516 0.4245 0.3021 0.3543 0.5784Naïve 0.3843 0.3428 0.4640 0.3007 0.1765 0.2429 0.4939MinPath 0.5161 0.4589 0.5489 0.4221 0.2990 0.3511 0.5763PathoLogic 0.7631 0.7460 0.7093 0.7890 0.6109 0.6447 0.7479mlLGPR-EN (+AB+RE+PE) 0.7275 0.7468 0.7343 0.7392 0.6220 0.6768 0.7098Table 5.4: Pathway prediction performance between methods using T1 golden datasets.mlLGPR-EN: the mlLGPR with elastic net penalty, AB: abundance features, RE: reactionevidence features, and PE: pathway evidence features. For each performance metric, ‘↓’indicates the lower score is better while ‘↑’ indicates the higher score is better.EN. Conversely, mlLGPR-EN F1 scores for HumanCyc (0.7468), AraCyc (0.7343), LeishCyc(0.6220) and TrypanoCyc (0.6768) exceeded those for PathoLogic. In addition, the statisticalanalysis in Appendix C has shown that mlLGPR (with all variants) is indeed comparable toPathoLogic rule-based algorithm.To evaluate mlLGP-EN performance on distributed metabolic pathway prediction be-tween two or more interacting organismal genomes a symbiotic system consisting of thereduced genomes for Candidatus Moranella endobia and Candidatus Tremblaya princeps,encoding a previously identified set of distributed amino acid biosynthetic pathways [218],was selected. mlLGPR-EN and PathoLogic were used to predict pathways on individualsymbiont genomes and a composite genome consisting of both, and resulting amino acidbiosynthetic pathway distributions were determined (Fig. 5.4). mlLGPR-EN predicted 8out of 9 expected amino acid biosynthetic pathways while PathoLogic recovered 6 on the65Figure 5.4: Predicted pathways for symbiont datasets between mlLGPR-EN with AB, REand PE feature sets and PathoLogic. Red circles indicate that neither method predicted aspecific pathway while green circles indicate that both methods predicted a specific pathway.Blue circles indicate pathways predicted solely by mlLGPR. The size of circles scales withreaction abundance information.Metric mlLGRPR-EN (+AB+RE+PE)Hamming Loss (↓) 0.0975Average Precision Score (↑) 0.3570Average Recall Score (↑) 0.7827Average F1 Score (↑) 0.4866Table 5.5: Predictive performance of mlLGPR-EN with AB, RE and PE feature sets onCAMI low complexity data.composite genome. The missing pathway for phenylalanine biosynthesis (L-phenylalaninebiosynthesis I was excluded from analysis because the associated genes were reported tobe missing during the ORF prediction process. False positives were predicted for individualsymbiont genomes in Moranella and Tremblaya using both methods although pathwaycoverage was low compared to the composite genome. Additional feature information re-stricting the taxonomic range of certain pathways or more restrictive pathway coveragecould reduce false discovery on individual organismal genomes.66To evaluate pathway prediction performance of mlLGPR-EN on more complex community-level genomes the CAMI low complexity and HOTS datasets were selected. Table 5.5 showsperformance scores for mlLGPR-EN on the CAMI dataset. Although recall was high (0.7827)precision and F1 scores were low when compared to the T1 golden datasets. Similar resultswere obtained for the HOTS dataset. In both cases it is difficult to validate most pathwayprediction results without individual organismal genomes that can be replicated in culture.Moreover, the total number of expected pathways per dataset is relatively large, encom-passing metabolic interactions at different levels of biological organization. On the onehand, these open conditions confound interpretation of performance metrics while onthe other they present numerous opportunities for hypothesis generation and testing. Tobetter constrain this tension, mlLGPR-EN and PathoLogic prediction results were comparedfor a subset of 45 pathways previously reported in the HOTS dataset [125]. Fig. 5.5 showspathway distributions spanning sunlit and dark ocean waters predicted by PathoLogic andmlLGPR-EN, grouped according to higher order functions within the MetaCyc classificationhierarchy. Between 25 and 500 m depth intervals, 7 pathways were exclusively predictedby PathoLogic and 6 were exclusively predicted by mlLGPR-EN. Another 20 pathways werepredicted by both methods, while 6 pathways were not predicted by either method includingglycine biosynthesis IV, thiamine diphosphate biosynthesis II and IV, flavanoid biosynthesis,2-methylcitrate cycle II and L-methionine degradation III. In several instances, the depth dis-tributions of predicted pathways were also different from those described in [125] includingL-selenocysteine biosythesis II and acetate formation from acetyl-CoA II. It remains uncertainwhy current implementation of PathoLogic resulted in inconsistent pathway predictionresults, although changes have accrued in PathoLogic rules and the structure of the MetaCycclassification hierarchy in the intervening time interval.Taken together, the comparative pathway prediction results indicate that mlLGPR-ENperformance equals or exceeds other methods including PathoLogic on organismal genomesbut diminishes with dataset complexity.5.6 SummaryIn this chapter, we have presented mlLGPR, a new method using multi-label classificationand logistic regression to predict metabolic pathways at different levels in the genomicinformation hierarchy (Fig. 4.1). mlLGPR effectively maps annotated enzymatic reactionsusing EC numbers onto reference metabolic pathways sourced from the MetaCyc database.We provide a detailed open source process from features engineering and the construc-67Figure 5.5: Comparison of predicted pathways for HOTS datasets between mlLGPR-ENwith AB, RE and PE feature sets and PathoLogic. Red circles indicate that neither methodpredicted a specific pathway while green circles indicate that both methods predicted aspecific pathway. Blue circles indicate pathways predicted solely by mlLGPR and gray circlesindicate pathways solely predicted by PathoLogic. The size of circles scales with reactionabundance information.68tion of synthetic samples, on which the mlLGPR is trained, to performance testing onincreasingly complex real world datasets including organismal genomes, nested symbionts,CAMI low complexity and HOT. With respect to features engineering, five feature sets wereadapted from Dale and colleagues [73] to guide the learning process. Feature ablation studiesdemonstrated the usefulness of aggregating different combinations of feature sets using theelastic-net (EN) regularizer to improve mlLGPR prediction performance on golden datasets.Using this process we determined that abundance (AB), enzymatic reaction evidence (RE)and pathway evidence (PE) feature sets contribute disproportionately to mlLGPR-EN per-formance. After tuning several hyper-parameters to further improve mlLGPR performance,pathway prediction outcomes were compared to other methods including MinPath andPathoLogic. The results indicated that while mlLGPR-EN performance equaled or exceededother methods including PathoLogic on organismal genomes, its performed more marginallyon complex datasets. This is likely due to multiple factors including the limited validationinformation for community-level metabolism as well as the need for more subtle featuresengineering and algorithmic improvements.Several issues were encountered during testing and implementation that need to beresolved for improved pathway prediction outcomes using machine learning methods. Whilerich feature information is integral to mlLGPR performance, the current definition of featuresets relies on manual curation based on prior knowledge. We observed that in some instancesthe features engineering process is susceptible to noise resulting in low performance scores.Moreover, individual enzyme reactions may participate in multiple pathways, resultingin increased false discovery without additional feature sets that relate the presence andabundance of EC numbers to other factors. This problem has been partially addressedby designing features based on side knowledge of a pathway, such as information about“key-reactions” in pathways that increase the likelihood that a given pathway is present. Inthe next chapter, we will be introducing another form of side knowledge to improve pathwayprediction, called representational learning approach [28].69Part IIIGraph based Multi-Label Classification70Chapter 6Leveraging Heterogeneous NetworkEmbedding for Metabolic PathwayPrediction“Productivity is never an accident. It is always the result of a commitment toexcellence, intelligent planning, and focused effort.”– Paul J. MeyerIn previous chapter, we discussed mlLGPR that is a machine learning approach to inferpathways. This method relied on rich feature information that may be susceptible to noise.This chapter presents pathway2vec, a software package consisting of six representationallearning based modules used to automatically generate features for pathway inference.Specifically, we build a three layered network composed of compounds, enzymes, andpathways, where nodes within a layer manifest inter-interactions and nodes between layersmanifest betweenness interactions. This layered architecture captures relevant relationshipsused to learn a neural embedding-based low-dimensional space of metabolic features.Modules in pathway2vec were evaluated based on node-clustering, embedding visualizationand pathway prediction using MetaCyc as a trusted source, introduced in Chapter 2.1. Inthe pathway prediction task, results indicate that it is possible to leverage embeddings toimprove pathway prediction outcomes.6.1 IntroductionMetabolic pathway reconstruction from genomic sequence information is a key step inpredicting regulatory and functional potential of cells at the individual, population and71community levels of organization [4]. As we discussed in Chapter 2.2, the most commonmethods (e.g. PathoLogic and MinPath) for metabolic pathway reconstruction rely on aset of manually specified rules. Unfortunately, the development of accurate and flexiblerule sets for pathway prediction remains a challenging enterprise informed by expert cura-tors incorporating thermodynamic, kinetic, and structural information for validation [326].Updating these rule sets as new organisms or pathways are described and validated canbe cumbersome and out of phase with current user needs. This has led to the considera-tion of machine learning (ML) approaches for pathway prediction based on rich featureinformation, such as mlLGPR in Chapter 5. One of the primary challenges encountered indeveloping mlLGPR relates to engineering reliable features representing heterogeneous anddegenerate functions within cellular communities e.g. microbiomes [185].Advances in representational learning have led to the development of scalable methodsfor engineering features from graphical networks e.g. networks composed of multiple nodesincluding information systems or social networks [81, 114, 255]. These approaches learn fea-ture vectors for nodes in a network by solving an optimization problem in an unsupervisedmanner, using random walks followed by Skip-Gram extraction of low dimensional latentcontinuous features, known as embeddings [225]. This chapter presents pathway2vec, a soft-ware package incorporating multiple random walks based algorithms for representationallearning used to automatically generate feature representations of metabolic pathways,which are decomposed into three interacting layers: compounds, enzymes and pathways,where each layer consists of associated nodes. A Skip-Gram model is applied to extractembeddings for each node, encoding smooth decision boundaries between groups of nodesin that graph. Nodes within a layer manifest inter-interactions and nodes between layersmanifest betweenness interactions resulting in a multi-layer heterogeneous informationnetwork [302]. This layered architecture captures relevant relationships used to learn aneural embedding-based low-dimensional space of metabolic features (Fig. 6.1).In addition to implementing several published random walk methods, we developedRUST (unit-circle based jump and stay random walk), adopting a unit-circle equation tosample node pairs that generalize previous random walk methods [81, 114, 145]. The mod-ules in pathway2vec were benchmarked based on node-clustering, embedding visualization,and pathway prediction. In the case of pathway prediction, pathway2vec modules provideda viable adjunct or alternative to manually curated feature sets used in ML based metabolicpathway reconstruction from genomic sequence information. The distinction of this worklies in decomposing pathway into components, so various graph learning methods can beapplied to automatically extract semantic features of metabolic pathways, and to incorporate72(a) (b)Figure 6.1: Three interacting metabolic pathways (a), depicted as a cloud glyph, whereeach pathway is comprised of compounds (green) and enzymes (red). Interacting com-pound, enzyme and pathway components are transformed into a multi-layer heterogeneousinformation network (b).the learned embeddings for pathway inference.6.2 Definitions and Problem StatementThe pathway inference task can be formulated as retrieving a set of pathway labels for anexample i given features learned according to a heterogeneous information network definedas:Definition 6.1. Heterogeneous Information Network A heterogeneous information net-work is defined as a graph G = (V ,E ), where V and E denote to the set of nodes and edges(either directed or undirected), respectively [317]. Each v ∈V is associated with an objecttype mapping function φ(v) :V→O, whereO represents a set of object types. Each edgee ∈ E ⊆ V ×V includes multiple types of links, and is associated with a link type map-ping function φ(e) : E →R, whereR represents a set of relation types. In particular, when|O|+ |R| > 2, the graph is referred to as a heterogeneous information network.In heterogeneous information networks, both object types and relationship types areexplicitly segregated. For the undirected edges, notice that if a relation exists from a typeOi (∈O) to a type O j (∈O), denoted as Oi RO j and R ∈R, the inverse relation R−1 holdsnaturally for O j R−1Oi . However, in many circumstances, R and its inverse R−1 are not equal,unless the two objects are in the same domain, and R is symmetric. In addition, the network73Figure 6.2: Graphical representation of pathway2vec framework. Main components: (a) amulti-layer heterogeneous information network composed from MetaCyc, showing meta-level interaction among compounds, enzymes, and pathways, (b) four random walks, and(c) two representational learning models: traditional Skip-Gram (top) and Skip-Gram bynormalizing domain types (bottom). In the subfigure (a), the highlighted network neighborsof T1 (nitrifier denitrification) indicate this pathway interacts directly with T2 (nitrogenfixation I (ferredoxin)) and indirectly to T3 (nitrate reduction I (denitrification)) by second-order with relationships to several compounds, including nitric oxide (C3) and nitrite (C4)converted by enzymes represented by the EC numbers (Z2: EC 1.7.2.6, Z3: EC 1.7.2.1, andZ4: EC 1.7.2.5). The black colored nodes in subfigure (b) indicate the current position of thewalkers and red links suggest the next possible nodes to sample while black links indicateroute taken by a walker to reach the current node. node2vec is parameterized by local searchs and in-out h hyperparameters. These two hyperparameters constitute a unit circle, i.e.,h2+ s2 = 1, for RUST. M stores previously visited node types which is 2 and only applied forJUST and RUST. c is number of nodes of the same domain type as the current node which is3 and is associated with JUST. For metapath2vec, a walker requires a prespecified schemewhich is set to “ZCTCZ”. The normalized Skip-Gram in the subfigure (c) bottom is simplytrained based on the domain type, in contrast to the traditional Skip-Gram model. Moreinformation related to both learning strategies is provided in Section 6.3.2.may be weighted where each edge ei , j , of nodes i and j , is associated with a weight of typeR. The linkage type of an edge automatically defines the node types of its end points. Thegraph articulated in this work is considered directed and weighted (in some cases), but forsimplification is converted to a undirected network by simply treating edges as symmetriclinks. Note that if |O| = |R| = 1, the network is homogeneous; otherwise, it is heterogeneous.Example 6.1. MetaCyc (in Chapter 2.1) can be abstracted as a heterogeneous information74network, in Fig 6.1(b), which contains 3 types of objects, namely compounds (C), enzymes(Z), and pathways (T). There exist different types of links between objects representingsemantic relationships e.g. “composed of” and “involved in”, relationships between path-ways and compounds or relations between enzymes and compounds e.g. “transform” and“transformed by”. An enzyme may be mapped to a numerical category, known as an enzymecommission number (EC) based on the chemical reaction it catalyzes.Two objects within heterogeneous information networks describe meta-level relation-ships refereed to as meta-paths [317].Definition 6.2. Meta-Path A meta-path P ∈P is a path over G in the form of O1 R1−→O2 R2−→OiRk−→ . . . R j−→O j+1, which defines an aggregation of relationships U =R1◦R2◦. . .◦R j betweentype O1 and O j+1, where ◦ denotes the composition operator on relationships and Oi ∈Oand Rk ∈R are object and relation type, respectively.Example 6.2. MetaCyc contains multiple meta-paths conveying different semantics. Forexample, a meta-path “ZCZ” represents the co-catalyst relationships on a compound (C)between two enzymatic reactions (Z), and “ZCTCZ” may indicate a meta-path that requirestwo enzymatic reactions (Z) transforming two compounds (C) within a pathway (T). Anotherimportant meta-path to consider is “CZC”, which implies “C + Z ⇒ C” transformationrelationship.Metabolic Pathway PredictionGiven three inputs: i)-a heterogeneous information network G , ii)- a datasetS (Def. 3.9), and iii)- an optional set of meta-paths P , the goal is to automat-ically resolve node embeddings such that leveraging the features will effec-tively improve pathway prediction for a hitherto unseeen instance x∗ ∈Rr ,where r corresponds the number of enzymatic reactions (Chapter 3.2.2).6.3 The pathway2vec FrameworkThe pathway2vec framework is a package composed of five modules: i)- node2vec [114],ii)- metapath2vec [81], iii)- metapath2vec++ [81], iv)- JUST [145], and v)- RUST (this work),where each module contains a random walk modeling and node representation step. Agraphical representation of the pathway2vec framework is depicted in Fig 6.2.75C1. Random Walks. In this step, a sequence of random walks over an input graph (whetherheterogeneous or homogeneous) is generated based on the selected model. (seeSection 6.3.1).C2. Learning Node Representation. Resulting walks are fed into the Skip-Gram model tolearn node embeddings [81, 100, 114, 225]. An embedding is a low dimensional latentcontinuous feature for each node in G, which encodes smooth decision boundariesbetween groups or communities within a graph. Details are provided in Section 6.3.2.6.3.1 Random WalksTo capture meaningful graph relationships, existing techniques such as DeepWalk [255],design simple but effective algorithms based on random walks for representational learningof features. However, DeepWalk does not address in-depth and in-breadth graph exploration.Therefore, node2vec [114] was developed to traverse local and global graph structures basedon the principles of: i)- homophily [97, 238] where interconnected nodes form a communityof correlated attributes and ii)- structural equivalence [133] where nodes having similarstructural roles in a graph should be close to one another. node2vec simulates a second-order random walk, where the next node is sampled conditioned on the previous and thecurrent node in a walk. For this, two hyper-parameters are adjusted, s ∈ R>0 that extractslocal information of a graph, and h ∈R>0 that enables local and global traversals by movingdeep in a graph or walking within the vicinity of the current node. This method is illustratedin Fig. 6.2 (b) top.First-order and second-order random walks were initially proposed for homogeneousgraphs, but can be readily extended to heterogeneous information networks. Sun andcolleagues [317] have observed that random walks can suffer from implicit bias due to initialnode selection or the presence of a small set of dominant node types skewing results towarda subset of interconnected nodes. metapath2vec was developed [81], to resolve implicitbias in graph traversal to characterize semantic associations embodied between differenttypes of nodes according to a certain path definition. This method is illustrated in Fig. 6.2(b) bottom.metapath2vec overcome the limitation of nove2vec by enabling to extract semanticalrepresentations over heterogeneous graph. However, the use of meta-paths requires eitherprior domain-specific knowledge to recover semantic associations of HIN according toa certain path definition. As a result, groups of vertices with the heterogeneous informa-tion network may not be visited or revisited multiple times. This limitation was partially76Figure 6.3: An illustrative example showing the selection of the next node for both JUSTand RUST on HIN extracted from MetaCyc. The walker is currently stationed at C3 arrivingfrom node C2 (indicated by black colored link), where M stores two previously visited nodetypes and c (for JUST) holds 3 consecutive nodes that are of the same domain as C3. Ascan be seen JUST would prefer selecting the next node of type pathway while RUST mayprefer returning to C2 than jumping to T1 or T2, as indicated by red edges, because s < hrepresented by an ellipsis glyph.addressed by leveraging multiple path schemes [100] to guide random walks based on ameta-path length parameter. Hussein and colleagues developed the Jump and Stay (JUST)heterogeneous graph embedding method using random walks [145] as an alternative tometa-paths. JUST randomly selects the next node in a walk from either the same node typeor from different node types using an exponential decay function and a tuning parameterbased on two history records: i)- c corresponding the number of nodes consecutively visitedin the same domain as the current node and ii)- a queue M of size m storing the previouslyvisited node types. This method is illustrated in Fig. 6.2 (b) second from top.However, in order to balance the node distribution over multiple node types, JUST con-strains the number of memorized domains m to be within the range of [1, |O|−1] ∈Z>1. Thiscan misrepresent graph structure in two ways: i)- explorations within domain because thelast visited consecutive c nodes may enforce sampling from another domain, or ii) jumpingdeep towards nodes from other domains because M is constrained. To alleviate these prob-lems we develop a novel random walk algorithm, RUST, adopting a unit-circle equation tosample node pairs that generalize previous representational learning methods, as illustratedin Fig. 6.2 (b) second from bottom. The two hyper-parameters s and h constitute a unitcircle, i.e., h2+ s2 = 1, where h ∈ [0,1] indicates how much exploration is needed within adomain while s ∈ [0,1] defines the in-depth search towards other domains such that s > hencourages the walk to explore more domains and vice versa. Consequently, RUST blendsboth semantic associations and local/global structural information for generating walks77without restricting domain size m in M .To better illustrate the effect of s and h on RUST, consider an example in Fig. 6.3, wherethe walkers in JUST and RUST are currently stationed at C3 of compound type. While JUSTenforces its walker to jump towards pathway domain, because of the combined effect of cthat holds three consecutive nodes of compound type and M that is currently storing ECand compound types, RUST may prefer returning to C2 (no links exist to C4) than jumpingto T1 or T2. This is because s < h entailing to explore more within the same domain as C3. If,however, s > h then RUST will perform in-depth search by selecting a node of type pathway.For formal definitions about the discussed random walks, see Appendix Section D.1.6.3.2 Learning Latent Embedding in GraphRandom walksW generated using node2vec, metapath2vec, JUST and RUST are fed into theSkip-Gram model to learn node embeddings [225]. The Skip-Gram model exploits contextinformation defined as a fixed number of nodes surrounding a target node. The modelattempts to maximize co-occurrence probability among a pair of nodes identified within agiven window of size q inW based on log-likelihood:∑l∈W∑j∈l∑−q≤k≤q, j 6=0log p(v j+k |v j )(6.3.1)where v j−c , ..., v j+c are the context neighbor nodes of node v j and p(v j+i |v j ) defines theconditional probability of having context nodes given the node v j . The p(v j+k |v j ) is thecommonly used softmax function, i.e,= eDv j+k .Dv j∑i∈V eDvi.Dv j, where D ∈R|V |×d stores the embeddingsof all nodes and Dv is the v-th row corresponding to the embedding vector for node vand d represents the embeddings size. In practice, the vocabulary of nodes may be verylarge, which intensifies the computation of p(v j+k |v j ). The Skip-Gram model uses negativesampling, which randomly selects a small set of nodes N that are not in the context to reducecomputational complexity. This idea, represented in updated Eq. 6.3.1 is implemented innode2vec, metapath2vec, JUST, and RUST according to:∑l∈W∑j∈l∑−q≤k≤q, j 6=0(logσ(Dv j+k .Dv j )+∑u∈N∧u∉N ( j )Evu [log p(vu |v j )])(6.3.2)where σ(v)= 11+e−v is the sigmoid function.In addition to the equation above, Dong and colleagues proposed a normalized versionof metapath2vec, called metapath2vec++, where the domain type of the context node isconsidered in calculating the probability p(v j+k |v j ), resulting in the following objective783 5 7memorized domains (m)0.02000.02250.02500.0275NMIh=0.55 h=0.71 h=0.84(a) Number of memorized do-mains vs different h value30 50 80 100 128 150dimensions (d)0.0150.0200.0250.0300.035NMIh=0.55, m=3(b) Number of dimensions3 5 7 9context size (q)0.0150.0200.0250.0300.035NMIh=0.55, m=3(c) Neighborhood sizeFigure 6.4: Parameter sensitivity of RUST based on NMI metric.formula: ∑l∈W∑j∈l∑−q≤k≤q, j 6=0(logσ(Dv j+k .Dv j )+∑u∈N∧u∉N ( j )∧φ(vu )=φ(v j+k )Evu [log p(vu |v j )])(6.3.3)where φ(vu)=φ(v j+k ) suggests that the negative nodes are of the same type as the contextnode φ(v j+k ). The above formula is also applied for RUST, and we refer it to RUST-norm.Through iterative update over all the context nodes, whether using Eq. 6.3.2 or Eq. 6.3.3,for each walk inW , the learned features are expected to capture semantic and structuralcontents of a graph, thereby, can be made useful for pathway inference.6.4 Predicting PathwaysFor pathway inference, the learned EC embedding vectors are concatenated into eachexample i according to:x˜(i ) = x(i )⊕ 1rx(i )Dv :v∈Z(6.4.1)where ⊕ denotes the vector concatenation operation and Dv :v∈Z indicates feature vec-tors for r enzymatic reactions. By incorporating enzymatic reaction features into x(i ), thedimension size is extended to r +d , where r is the enzyme vector size while d correspondsto embeddings size. This modified version of x(i ) is denoted by x˜(i ), which then can be usedby an appropriate machine learning algorithm, such as mlLGPR (in Chapter 5), to train andinfer a set of metabolic pathways from enzymatic reactions.796.5 Experimental SetupIn this section, we explain the experimental settings and outline materials used to evaluatethe performance of pathway2vec modules that were written in Python v3 and trained usingtensorflow v1.10 [1]. Unless otherwise specified all tests were conducted on a Linux serverusing 10 cores of Intel Xeon CPU E5-2650.6.5.1 Preprocessing MetaCycWe constructed three hierarchical layers of HIN using MetaCyc v21 [51], according to: EC(bottom-layer), compound (mid-layer), and pathway (top-layer) as in Fig. 6.2(a). Relation-ships among these layers establish inter-interactions and betweenness interactions. Threeinter-interactions were built: i)- ECs interactions that were collected based shared metabo-lites, e.g., if a compound is engaged in two ECs then the two ECs were considered connected;ii)- compounds interactions that were processed based on shared reactions, e.g., if anytwo compounds constituting substrate and product of an engaged enzymatic reaction theywould be linked; and iii)- pathways interactions that were constructed based on sharedmetabolites, e.g., if any product in one pathway is being consumed by another then thesetwo pathways were linked. With regard to betweenness interactions, we considered twoforms: i)- EC-compound interaction if any enzyme (represented by an EC number) engagesin any compound then nodes of both types were linked and ii)- compound-pathway interac-tion if any compound involves in any pathway then those nodes were considered related.After building multi-layer HIN, we apply different configurations (MetaCyc, MetaCyc (r),MetaCyc (uec), and MetaCyc (uec+r)), as summarized in Table 4.4, to explore the relationshipbetween different graph types and the quality of generated walks and embeddings.6.5.2 Parameter SettingsParameterization for the other random walk methods can be found in [81, 114, 145]. Fortraining, we randomly initialized model parameters with a truncated Gaussian distribution,and set the learning rate to 0.01, the batch size to 100, and the number of epochs to 10. Unlessotherwise indicated, for each module, the number of sampled path instances is K = 100,the walk length is l = 100, the embedding dimension size is d = 128, the neighborhoodsize is 5, the size of negative samples is 5, and the number of memorized domain m forJUST and RUST are 2 and 3, respectively. The explore and the in-out hyperparameters fornode2vec and RUST are h = 0.7 (or h = 0.55) and s = 0.7 (or s = 0.84), respectively, using80the uec configuration. For metapath2vec and metapath2vec++, we applied the meta-pathscheme “ZCTCZ” to guide random walks. For brevity, we denote node2vec, metapath2vec,metapath2vec++, JUST, RUST, and RUST-norm as n2v, m2v, cm2v, jt, and rt, crt, respectively.6.6 Experimental Results and DiscussionIn this section, we first evaluate parameter sensitivity of RUST prior to benchmarking the fourrandom walk algorithms, jointly with the two learning methods, based on node-clustering,embedding visualization, and pathway prediction.6.6.1 Parameter Sensitivity of RUSTExperimental setup. In this section, the effect of different hyperparameter settings in RUSTon the quality of learned nodes embeddings is described. Since the hyperparameter spaceinvolved in RUST, is infinite, exhaustive searches for optimal settings are prohibitive. There-fore, settings were sub-selected to determine RUST performance. Specifically, the effects ofthe dimensions d ∈ {30,50,80,100,128,150}, the neighborhood size q ∈ {3,5,7,9}, the memo-rized domains m ∈ {3,5,7}, and the two hyperparameters s and h (∈ {0.55,0.71,0.84}) wereevaluated based on Normalized Mutual Information (NMI) scores, after 10 trials. The NMIproduces scores between 0, indicating no mutual information exists, and 1, indicating nodeclusters (feature groups) are perfectly correlated based on class information: enzyme, com-pound, and pathway. Clustering was performed using the k-means algorithm [16] to groupdata based on the learned representations from RUST as described in [81, 145]. RandomwalksW were generated using MetaCyc with uec option for RUST test parameters.Experimental results. Fig 6.4a indicates that RUST performance tends to saturate when thememorized domains are concentrated around m = 5 and h = 0.55, indicating a preferenceto explore more domain types. By fixing m = 3 and h = 0.55 the optimal results of NMIscore w.r.t. the number of embedding dimensionality was found to be at 80 and 128 (Fig.6.4b). Beyond this value RUST performance deteriorated. A similar trend was also observedwhen the context neighborhood size was increased beyond q > 5 (Fig 6.4c). Based on theseobservations, the following settings m = 3, h = 0.55, d = 80 or d = 128, and q = 5 providethe most efficient and accurate clustering outcomes using MetaCyc with uec option. Forcomparative purposes, we set d = 128.81n2v m2v jt rt0.000.010.020.03NMI(a) MetaCycn2v m2v jt rt0.000.010.02NMI(b) MetaCycy rn2v m2v jt rt0.000.010.020.030.04NMI(c) MetaCyc uecn2v m2v jt rt0.000.050.100.15NMI(d) MetaCyc uec + rFigure 6.5: Node clustering results based on NMI metric using MetaCyc data. n2v:node2vec, m2v: metapath2vec, jt: JUST, rt: RUST, r: reduced content of MetaCyc basedon trimming nodes below 2 links, uec: links among enzymatic reactions are removed inMetaCyc, and uec + r: combination of unconnected enzymatic reactions and trimmed nodesin MetaCyc.6.6.2 Node ClusteringExperimental setup. The performance of different random walk methods was tested inrelation to node clustering using NMI after 10 trials and the hyperparameters describedabove on all MetaCyc graph types depicted in Table 4.1. Clustering was performed using thek-means algorithm to group homogeneous nodes based on the embeddings learned by eachmethod.Experimental results. Fig 6.5 indicates node clustering results for node2vec, metapath2vec,82cm2v crt0.00.10.20.30.4NMIfullruecuec+rFigure 6.6: Node clustering results of metapath2vec++ (cm2v) and RUST-norm (crt) basedon NMI metric using MetaCyc data.JUST and RUST. node2vec, JUST and RUST exhibited similar performance across all con-figurations, indicating that these methods are less likely to extract semantic knowledge,characterizing node domains, from MetaCyc. However, RUST performed optimally betterthan node2vec and JUST in learning representations. In the case of metapath2vec, therandom walk follows a predefined meta-path scheme, capturing the necessary relationalknowledge for defining node types. For example, nitrogenase (EC-1.18.6.1), which reducesnitrogen gas into ammonium, is exclusively linked to the nitrogen fixation I (ferredoxin)pathway [84]. Without a predefined relation, a walker may explore more local/global struc-ture of G, hence, become less efficient in exploiting relations between these two nodes.Among the four walks, only metapath2vec is able to accurately group those nodes, accordingto their classes. Despite the advantages of metapath2vec, it is biased to a scheme, as de-scribed in ([145]), which is explicitly observed for the case of “uec+r” (Fig 6.5d). Under theseconditions, both isolated nodes and links among ECs are discarded, resulting in a reducednumber of nodes that are more easily traversed by a meta-path walker. metapath2vec++exhibited trends similar to metapath2vec because they share the same walks. However,metapath2vec++ is trained using normalized Skip-Gram. Therefore, it is expected to achievegood NMI scores, yielding over 0.41 on uec+full content (in Fig. 6.6), which is also similarto RUST-norm NMI score (∼ 0.38). This is interesting because RUST-norm employs RUSTbased walks but the embeddings are learned using normalized Skip-Gram.Taken together, these results indicate that node2vec, JUST, and RUST based walks areeffective for analyzing graph structure while metapath2vec can learn good embeddings.However, RUST strikes a balance between the two proprieties through proper adjustmentsof m and the two unit-circle hyperparameters. Regarding the MetaCyc type, we recommend83“uec” because the associations among ECs are captured at the pathway level. The trimmedgraph is contraindicated, because it eliminates many isolated, but important pathways andECs.6.6.3 Manifold VisualizationExperimental setup. In this section, learned high dimensional embeddings are visualizedby projecting them onto a two-dimensional space using two case studies. The first caseexamines the quality of learned nodes embeddings according to the generated random walksan approach commonly sought in most graph learning embedding techniques [114, 337].We posit that a good representational learning method defines clear boundaries for nodesof the same type. For illustrative purposes, nodes corresponding to nitrogen metabolismwere selected. The second case examines the limitations of meta-path based random walks,extending our discussions in Section 6.6.2. For illustrative purposes we focus on the pathwaylayer in Fig 6.2a and consider representation of pathways having no enzymatic reactions.For visualization, we use UMAP, a.k.a. uniform manifold approximation and projection [220]using 1000 epochs with the remaining settings set to default values.Experimental results. Fig 6.7 visualizes 2D UMAP projections of the 128 dimension embed-dings, trained under uec+full setting depicting 185 nodes related to nitrogen metabolismin MetaCyc. Each point denotes a node in HIN and each color indicates the node type.node2vec (Fig 6.7a), JUST (Fig. 6.7c), and RUST (Fig. 6.7d) appear to be less than optimalin extracting walks that preserve three layer relational knowledge e.g. nodes belonging todifferent types form unclear boundaries and diffuse clusters. In the cases of metapath2vec(Fig. 6.7b), metapath2vec++ (Fig. 6.7f), and RUST-norm (Fig. 6.7f), nodes of the same colorare more optimally portrayed. In the second use case 80 pathways were identified, having noenzymatic reactions, with their 109 pathway neighbors, as shown in Fig. 6.8a. From Fig. 6.8,we observe that, in contrast to node2vec, JUST, RUST, and RUST-norm, pathway nodes areskewed incorrectly in both metapath2vec and metapath2vec++ and (with lesser degree). Thisdemonstrates the rigidness of meta-path based methods that follow a defined scheme thatlimits their capacity to exploit local structure in learning embeddings. Interestingly, RUST-norm, based on RUST walks, is the only method that combines structural and semanticinformation as indicated in Fig. 6.8g and Fig. 6.7f, respectively. Taken together, these resultsindicate that RUST based walks with training using Eq. 6.3.3 provide efficient embeddings,consistent with node clustering observations.84(a) n2v (b) m2v (c) jt(d) rt (e) cm2v (f) crtFigure 6.7: 2D UMAP projections of the 128 dimension embeddings, trained underuec+full setting depicting 185 nodes related to nitrogen metabolism. Node color indicatesthe category of the node type, where red indicates enzymatic reactions, green indicates com-pounds, and blue is reserved for metabolic pathways. n2v: node2vec, m2v: metapath2vec, jt:JUST, rt: RUST, cm2v: metapath2vec++, and crt: RUST-norm.(a) TruePathways(b) n2v (c) m2v (d) jt (e) rt (f) cm2v (g) crtFigure 6.8: 2D UMAP projections of 80 pathways that have no enzymatic reactions, indi-cated by the blue color, with 109 corresponding pathway neighbors, represented by thegrey color. n2v: node2vec, m2v: metapath2vec, jt: JUST, rt: RUST, cm2v: metapath2vec++,and crt: RUST-norm.856.6.4 Metabolic Pathway PredictionExperimental setup. In this section, the effectiveness of the learned embeddings frompathway2vec modules is determined across different pathway inference methods (in Chapter4.3) and mlLGPR-elastic net (EN) (in Chapter 5) on T1 golden datasets (in Chapter 4.2.1) usingthe settings and metrics described above. In contrast to previous multi-label classificationmethods [114, 145, 255], where the goal is to predict the most probable label set for nodes,we leverage the learned vectors and the multi-label dataset, according to Eq. 6.4.1. Pathwayprediction with mlLGPR-EN used the default hyperparameter settings, after concatenatingfeatures from each learning method, to train on BioCyc (in Chapter 4.2.2). Results arereported on T1 golden datasets including EcoCyc, HumanCyc, AraCyc, YeastCyc, LeishCyc,and TrypanoCyc using the four evaluation metrics in Chapter 4.4.1.Experimental results. Table 6.1 shows micro F1 scores for each pathway predictor. Numbersin boldface represent the best performance score in each column while the underlined textindicates the best performance among the embedding methods. From the results, it isobvious that all variation of embedding methods performs consistently better than MinPathacross the 4 T1 golden datasets (EcoCyc, YeastCyc, LeishCyc, and TrypanoCyc). With theexcpetion of EcoCyc the performance of embeddings resulted in less optimal micro F1scores than PathoLogic or mlLGPR. In the case of mlLGPR, embeddings were trained on lessthan 1470 pathways, potentially obscuring the actual benefits of the learned features. Takentogether, different pathway2vec modules performed similar to one another indicating thatembeddings are potential alternatives to the pathway and reaction evidence features usedin mlLGPR.6.7 SummaryWe have developed the pathway2vec package for learning features relevant to metabolicpathway prediction from genomic sequence information. The software package consists ofsix representational learning modules used to automatically generate features for pathwayinference. Metabolic feature representations were decomposed into three interacting layers:compounds, enzymes and pathways, where each layer consists of associated nodes. A Skip-Gram model was applied to extract embeddings for each node encoding smooth decisionboundaries between groups of nodes in a graph resulting in a multi-layer heterogeneousinformation network for metabolic interactions within and between layers. Three extensiveempirical studies were conducted to benchmark pathway2vec, indicating that the repre-sentational learning approach is a promising adjunct or alternative to features engineering86MethodsHamming Loss ↓EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.0610 0.0633 0.1188 0.0424 0.0368 0.0424MinPath 0.2257 0.2530 0.3266 0.2482 0.1615 0.2561mlLGPR 0.0804 0.0633 0.1069 0.0550 0.0380 0.0590mlLGPR+n2v 0.0558 0.1021 0.1706 0.0768 0.0424 0.0883mlLGPR+m2v 0.0558 0.0998 0.1742 0.0740 0.0412 0.0926mlLGPR+cm2v 0.0586 0.1041 0.1742 0.0744 0.0420 0.0867mlLGPR+jt 0.0550 0.1041 0.1738 0.0724 0.0459 0.0895mlLGPR+rt 0.0554 0.0990 0.1746 0.0752 0.0428 0.0855mlLGPR+crt 0.0542 0.1017 0.1615 0.0760 0.0439 0.0855MethodsMicro Precision Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.7230 0.6695 0.7011 0.7194 0.4803 0.5480MinPath 0.3490 0.3004 0.3806 0.2675 0.1758 0.2129mlLGPR 0.6187 0.6686 0.7372 0.6480 0.4731 0.5455mlLGPR+n2v 0.7923 0.5745 0.6965 0.6446 0.4153 0.3974mlLGPR+m2v 0.7862 0.6015 0.6786 0.6750 0.4261 0.3745mlLGPR+cm2v 0.7770 0.5556 0.6620 0.6723 0.4159 0.4076mlLGPR+jt 0.7979 0.5556 0.6732 0.6949 0.3840 0.3924mlLGPR+rt 0.7889 0.6014 0.6635 0.6560 0.4146 0.4113mlLGPR+crt 0.7993 0.5873 0.7898 0.6581 0.3983 0.4105MethodsMicro Recall Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.8078 0.8423 0.7176 0.8734 0.8391 0.7829MinPath 0.9902 0.9713 0.9843 1.0000 1.0000 1.0000mlLGPR 0.8827 0.8459 0.7314 0.8603 0.9080 0.8914mlLGPR+n2v 0.7329 0.2903 0.2745 0.3406 0.5632 0.5314mlLGPR+m2v 0.7427 0.2867 0.2608 0.3537 0.5632 0.5029mlLGPR+cm2v 0.7264 0.2867 0.2804 0.3493 0.5402 0.5543mlLGPR+jt 0.7329 0.2867 0.2706 0.3581 0.5517 0.5314mlLGPR+rt 0.7427 0.3082 0.2745 0.3581 0.5862 0.5429mlLGPR+crt 0.7394 0.2652 0.2725 0.3362 0.5402 0.5371MethodsMicro F1 Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.7631 0.7460 0.7093 0.7890 0.6109 0.6447MinPath 0.5161 0.4589 0.5489 0.4221 0.2990 0.3511mlLGPR 0.7275 0.7468 0.7343 0.7392 0.6220 0.6768mlLGPR+n2v 0.7614 0.3857 0.3938 0.4457 0.4780 0.4548mlLGPR+m2v 0.7638 0.3883 0.3768 0.4642 0.4851 0.4293mlLGPR+cm2v 0.7508 0.3783 0.3939 0.4598 0.4700 0.4697mlLGPR+jt 0.7640 0.3783 0.3860 0.4726 0.4528 0.4515mlLGPR+rt 0.7651 0.4076 0.3883 0.4633 0.4857 0.4680mlLGPR+crt 0.7682 0.3654 0.4052 0.4451 0.4585 0.4653Table 6.1: Predictive performance of each comparing algorithm on 6 benchmark goldenT1 datasets. For each performance metric, ‘↓’ indicates the smaller score is better while ‘↑’indicates the higher score is better.87based on manual curation. At the same time, we introduced RUST, a novel and flexiblerandom walk method that uses unit-circle and domain size hyperparameters to exploitlocal/global structure while absorbing semantic information from both homogeneous andheterogeneous graphs.Looking forward, we intend to leverage embeddings and graph structure on more com-plex community level metabolic pathway prediction problems, which will be discussed inthe following chapters.88Chapter 7Incorporating Triple NMF withCommunity Detection to MetabolicPathway Inference“Networking is not collecting contacts! Networking is about planting relations.”– MiShaAs we discussed in Chapter 6, machine learning provides a probabilistic framework formetabolic pathway inference; however, several challenges including pathway features engi-neering, multiple mapping of enzymatic reactions and emergent or distributed metabolismwithin populations or communities of cells can limit prediction performances. Here, wepresent triUMPF, triple non-negative matrix factorization (NMF) with community detectionfor metabolic pathway inference, that combines three stages of NMF to capture relationshipsbetween enzymes and pathways (using embeddings from pathway2vec) within a networkfollowed by community detection to extract higher order structure based on the clusteringof vertices sharing similar statistical properties. We evaluated triUMPF performance usingdatasets, presented in Chapter 4.2. Resulting performance metrics equaled or exceededother prediction methods on organismal genomes with improved prediction outcomes onmulti-organism datasets.7.1 IntroductionPathway reconstruction from genomic sequence information is an essential step in describ-ing the metabolic potential of cells at the individual, population and community levels ofbiological organization [125, 176, 214]. Resulting pathway representations provide a foun-dation for defining regulatory processes, modeling metabolite flux and engineering cells89and cellular consortia for defined process outcomes [122, 244]. In Chapter 5, we introducedmlLGPR that performed effectively on organismal genomes, pathway prediction outcomesfor multi-organismal data sets were less optimal due in part to missing or noisy featureinformation. In an effort to grapple with this problem, pathway2vec was introduced tolearn a neural embedding-based low-dimensional space of metabolic features based on athree-layered network architecture consisting of compounds, enzymes, and pathways (seeChapter 6). Based on several experiments, the learned feature vectors motivated the use ofmulti-layer networks for organismal and multi-organismal genomes.This chapter describes triUMPF that combines three stages of NMF to capture relation-ships between enzymes and pathways within a network [101] followed by community detec-tion to extract higher order network structure [98]. NMF is a data reduction and explorationmethod in which the original and factorized matrices have the property of non-negativeelements with reduced ranks or features [101, 107]. In contrast to other dimension reductionmethods, such as principal component analysis [45], NMF both reduces the number offeatures and preserves information needed to reconstruct the original data [369]. This hasimportant implications for noise robust feature extraction from sparse matrices includingdata sets associated with gene expression analysis and pathway prediction [369].For pathway prediction, triUMPF uses three graphs, one representing associations be-tween pathways and enzymes indicated by enzyme commission (EC)) numbers [20], onerepresenting interactions between enzymes and another representing interactions betweenpathways. The two interaction graphs adopt the subnetwork or community concept (inChapter 3.2.1). Community detection is performed on both interaction graphs to identifysubnetworks as shown in Fig. 7.1A, where a pathway network, extracted from MetaCyc,is represented as interactions among pathways. The detected pathway communities areillustrated in Fig. 7.1B. Similar to Fig. 7.1, enzyme interactions are used to create the enzymenetwork, which is used to detect enzyme communities.We evaluated triUMPF’s paramater sensitivity, robustness and prediction performance inrelation to other inference methods including PathoLogic, MinPath and mlLGPR on datasetsin Chapter 4.2 and Esherichia coli strains. Resulting performance metrics exceeded otherprediction methods on multiple benchmark datasets with improved prediction outcomes.7.2 Problem FormulationHere, we state the problem discussed in this chapter.90Figure 7.1: The set of complete metabolic pathways extracted from MetaCyc (A) and theirdiscovered communities (B). Zoomed in region of the pathway-pathway and community-community interactions, C and D respectively. Nodes are metabolic pathways or communi-ties for A,C and B,D respectively. Edges correspond to number of shared enzymatic reactionsor shared pathways for the pathway and community nodes respectively.Metabolic Pathway PredictionGiven: i)- Pathway-EC matrix M (Def. 3.4), ii)- a Pathway-Pathway interactionmatrix A (Def. 3.5), iii)- an EC-EC interaction matrix B (Def. 3.6), and iv)- adataset S (Def. 3.9), the goal is to efficiently reconstruct pathway labels fora hitherto unseeen instance x∗ ∈ Rr , where r corresponds the number ofenzymatic reactions (Chapter 3.2.2).91Figure 7.2: A workflow diagram showing the proposed triUMPF method. The model takestwo graph topology, corresponding Pathway-Pathway interaction and EC-EC interaction,and a dataset to detect pathway and EC communities while, simultaneously, decomposingPathway-EC association information to produce a constrain low rank matrix. Afterwards, aset of pathways is detected from a newly annotated genome or metagenome.927.3 The triUMPF MethodIn this section, we provide a description of triUMPF components, presented in Fig. 7.2, in-cluding: i)- decomposing the pathway EC association matrix, ii)- subnetwork or communityreconstruction, and iii)- the multi-label learning process.7.3.1 Decomposing the Pathway EC Association MatrixInspired by the idea of NMF, we decompose the P2E association matrix to recover low-dimensional latent factor matrices [101]. Unlike previous application of NMF to biologicaldata sets [234], triUMPF incorporates learned embeddings into the matrix decompositionprocess. Formally, given the non-negative M standard NMF decomposes the matrix into thetwo low-rank matrices, i.e. M≈WH>, where W ∈Rt×k stores the latent factors for pathwayswhile H ∈ Rr×k , known as the basis matrix, can be thought of as latent factors associatedwith ECs and k ¿ t ,r . We extend standard NMF by incorporating the two constraints: i)-interactions within ECs or pathways and ii)- interactions between pathways and ECs. Forthis, we apply the pathway2vec framework (discussed in Chapter 6) to extract features inthe form of continuous vectors, for each EC and pathway while incorporating interactionconstraints. This set of features can then be used to obtain the following minimizationobjective function:J fact(W,H,U,V)= minW,H,U,V||M−WH>||2F +λ1||W−PU||2F +λ2||H−EV||2F+λ3||U−V||2F +λ4(||W||2F +||H||2F +||U||2F +||V||2F )s.t. {W,H,U,V}≥ 0(7.3.1)where λ∗ are regularization hyperparameters. The leftmost term is the well-known squaredloss function that penalizes the deviation of the estimated entries in both W and H from thetrue association matrix M. The second term corresponds the relative differences of latentmatrix W from the pathway features P ∈Rt×m , learned using pathway2vec framework, wherethe matrix U ∈Rm×k absorbs different scales of matrices W and P. Similarly, the third termindicates the squared loss of H from E ∈Rr×m , which denotes the feature matrix of ECs, andtheir differences are captured by V ∈Rm×k . In the fourth term, we minimize the differencesbetween factors U and V, capturing the shared prominent features for the low dimensionalcoefficients.937.3.2 Subnetwork or Community ReconstructionGraph abstraction is a process of reducing a set of linked nodes into a more compact form,such as isolating densely connected nodes that possess similar properties or functions.The task of discovering distinct group of nodes is known as the community detectionproblem [54, 98, 298]. Motivated by this work, we use community detection to guide thelearning process for pathways on the two adjacency matrices A and B, indicating P2P andE2E associations, respectively. For example, Fig. 7.1 shows 90 communities in pathwaynetwork, where the intra-group of nodes, within a community, interacts with each othermore frequently than with those outside the group.The two matrices A and B represent first-order proximity, capturing pairwise proximityamong their related vertices [341]. However, as discussed in [198, 277], the first-order prox-imity is inadequate to fully characterize distant relationships among pathways or ECs. Assuch, higher-order, in particular second and third order, proximity is pursued, which can beobtained using the formula [198]:Aprox = ∑i∈lpωi Al , Bprox = ∑i∈leγi Bl(7.3.2)where Aprox and Bprox are polynomials of order lp and le , respectively, and ω and γ areweights associated to each term. Using these higher order matrices, we invoke again NMF torecover communities.Formally, let T ∈ Rm×p be a non-negative community representation matrix of size pcommunities for pathways, where the j -th column in T:, j denotes the representation ofcommunity j . The pathway community indicator matrix is denoted by C ∈Rt×p conditionedon tr(C>C)= t , where each entry Ci ,l and C j ,l encodes the probability that pathways i andj generates an edge belonging to a community l . The probability of i and j belonging tothe same community can be assessed as: Aproxi , j = (Pi C:,l T>l ,i )>(P j C:,l T>l , j ). Similar discussionfollows for the non-negative representation matrix R ∈Rm×v and the EC community indica-tor matrix K ∈ Rr×v of v communities, conditioned on tr(K>K)= r . Unfortunately, due tothe constraints emphasized on C and K, it is not straightforward to analytically derive anexpression, instead, we resort to much more tractable solution provided in [341], and relaxthe condition to be an orthogonal constraint, resulting in the following objective function:J comm(C,K)=minC,K||Aprox−PTC>||2F +||Bprox−ERK>||2F+α||C>C− I||2F +β||K>K− I||2F +λ5(||C||2F +||K||2F )s.t. {C,K}≥ 0(7.3.3)94where I denotes an identify matrix, λ5 is a regularization hyperparameter while α and β areboth positive hyperparameters. The value of these hyperparameters is usually set to a largenumber, e.g. 109 in this work, for adjusting the contribution of corresponding terms. Theobtained communities in Eq 7.3.3 are directly linked to the underlying graph topologies, i.e.,Aprox and Bprox.7.3.3 Multi-label Learning ProcessWe now bring together the NMF and community detection steps with multi-label classifica-tion for pathway prediction. The learning problem must obey rules mandated by M whilebeing lenient towards the dataset S , which should provide enough evidence to generaterepresentations of communities among pathways and ECs, as suggested by Aprox and Bprox.We present a weight termΘ ∈Rt×r that enforces X to be close enough to both Y and M. Wealso introduce two auxiliary terms L ∈Rn×m , which capture correlations between X and Yand Z ∈Rr×r , enforcing the pathway coefficients associated with M resulting in the followingobjective function:J path(T,R,Θ,L,Z)= minT,R,Θ,L,Z∑i∈n∑k∈tlog(1+e−y(i )k Θᵀk x(i ))+||X−LRK>||2F +||Y−LTC>||2F+ρ||Θ−ZHW>||2F +λ5(||T||2F +||R||2F )+λ6(||Θ||2,1+||L||2F +||Z||2F )s.t. {T,R}≥ 0(7.3.4)where λ5, λ6, and ρ are regularization hyperparameters, and ||.||2,1 represents the sum of theEuclidean norms of columns of a matrix introduced to emphasize sparseness. Notice thatwe do not restrict the terms L and Z to be non-negative. Both the second and the third termsin Eq. 7.3.4, are needed to discover pathway and EC communities, i.e., C and K, respectively.The Eqs 7.3.1, 7.3.3, and 7.3.4 are jointly non-convex due to non-negative constraintson the original and the approximation factorized matrices, implying the solutions to tri-UMPF are only unique up to scalings and rotations [369]. Hence, we adopt an alternatingoptimization algorithm to solve each objective function simultaneously, which is providedin Appendix E.957.4 Experimental SetupHere, we describe the experimental framework used to demonstrate triUMPF pathwayprediction performance across multiple datasets introduced in Chapter 4.2. triUMPF wasimplemented in the Python programming language (v3). Unless otherwise specified all testswere conducted on a Linux server using 10 cores of Intel Xeon CPU E5-2650.7.4.1 Association MatricesMetaCyc v21 ([51]) was used to obtain the three association matrices, P2E (M), P2P, (A),and E2E (B). Some of the properties for each matrix are summarized in Table 4.1. All threematrices are extremely sparse. For example, M contains 2526 pathways, having an average offour EC associations per pathway, leaving more than 3600 columns with zero values. Thesematrices will be utilized to obtain higher-order proximity (Section 7.5.1) and to analyzetriUMPF’s robustness (Section 7.5.2).7.4.2 Pathway and Enzymatic Reaction FeaturesThe pathway and EC features, indicated by P and E, respectively, were obtained usingpathway2vec. The following settings were applied to learn pathway and EC features: theembedding method was “crt”, the number of memorized domain is 3, the explore andthe in-out hyperparameters are 0.55 and 0.84, respectively, the number of sampled pathinstances was 100, the walk length is 100, the embedding dimension size was m = 128, theneighborhood size was 5, the size of negative samples was 5, and the used configuration ofMetaCyc was “uec”, indicating links among ECs are being trimmed.7.4.3 Parameter SettingsFor training, unless otherwise indicated, the learning rate was set to 0.0001, batch size to50, number of epochs to 10, number of components k = 100, number of pathway and ECcommunities to p = 90 and v = 100, respectively. The higher-order proximity for Aprox andBprox (corresponding P2P and E2E matrices, respectively, in Section 7.4.1) were set to l p = 3and l e = 1 and their associated weights fixed as ω = 0.1 and γ = 0.3, respectively. The αand β were fixed to 109. For the regularized hyperparameters λ∗, we performed 10-foldcross-validation on MetaCyc and a subsampled of BioCyc T2 &3 data and found the settingsλ1:5 = 0.01, λ6 = 10, and ρ = 0.001 to be optimum on golden T1 data.967.5 Experimental Results and DiscussionFour consecutive tests were performed to ascertain the performance of triUMPF includ-ing parameter sensitivity, network reconstruction, visualization, and metabolic pathwayprediction effectiveness.7.5.1 Parameter SensitivityExperimental setup. The impact of seven hyperparameters (k, p, v, lp , le , ω and γ) was eval-uated in relation to reconstruction cost of the associated matrices (M, Aprox, and Bprox). Thereconstruction cost (or error) defines the sum of mean squared errors accounted in theprocess of transforming the decomposed matrices into its original form where lower costentails the decomposed low dimensional matrices were able to better capture the represen-tations of the original matrix. We specifically evaluated the effects of varying the followingparameters: i)- the number of components k ∈ {20,50,70,90,120}, ii)- the community size ofpathway p ∈ {20,50,70,90,100} and EC v ∈ {20,50,70,90,100}, iii)- the higher-order proximitylp and le ∈ {1,2,3}, and iv)- weights of the polynomial order ω and γ ∈ {0.1,0.2,0.3}. We usedthe full matrix M, for each test, however, for community detection, we used BioCyc T2 &3data that is divided into training (80%), validation (5%) and test sets (15%). The final costs forcommunity detection are reported based on the test set after 10 successive trials. In addition,we contrast triUMPF with the standard NMF for monitoring the reconstruction costs of Mby varying k values. We emphasize that M, Aprox, and Bprox were collected from MetaCyc(Section 7.4.1) and not from BioCyc T2 &3 (Chapter 4.2.2).Experimental results. Fig. 7.3 shows the effect of rank k on triUMPF performance. In gen-eral, we observe that the performance is steady with the increase of k. This is in contrast tostandard NMF where the reconstruction error decreases as the number of features increases.This is expected because, unlike standard NMF, triUMPF exploits two types of correlationsto recover M: i)- within ECs or pathways and ii)- betweenness interactions, hence, servingas regularizers. As observed from Fig. 7.3, higher k values result in improved outcomes.Consequently, we selected k = 100 for performing downstream testing.For community detection, we observed optimal results with respect to pathway com-munity size at p = 20 under parameter settings k = 100 and v = 100, as shown in Fig. 7.4a.However, because Aprox is so sparse, we suggest that this low rank may not correspond to theoptimum community size. As with all methods of community detection triUMPF is sensitiveto community size and requires empirical testing. There, we tested settings between p = 20and p = 100 and observed a decrease in performance under parameter settings k = 1009720 50 70 90 120k1.01.52.0ReconstructioncostNMF triUMPFFigure 7.3: Sensitivity of components k based on reconstruction cost.and v = 100 with p = 90 providing a balance between cost and increased community size.A similar result was observed for EC community size at v = 100 under parameter settingsp = 90 and k = 100 in Fig. 7.4b.Finally, we show the effect of changing polynomial orders, and their weights on triUMPFperformance. From Fig. 7.4c, we see that the reconstruction error progressively increaseswith varying higher orders for all the three weights ω. However, for the same reasons de-scribed above, we prefer more long distances with less weight to preserve communitystructure, and remarkably, when ω= 0.1 triUMPF performance was relatively stable afterthe second order. The same conclusion can be drawn for le and its associated weights γ inFig. 7.4d.Based on these results, triUMPF performance is stable while minimizing cost underthe following parameter settings: k = 100, p > 90, e > 90, lp = 3, ω= 0.1, le = 1, and γ= 0.3.Therefor, we recommend these settings for both MetaCyc and BioCyc T2 &3.9820 50 70 90 100p51.051.552.052.553.0Reconstructioncost(a) Pathway community p (k = 100, v = 100)20 50 70 90 100v67686970Reconstructioncost(b) EC community v (k = 100, p = 90)1 2 3lp52.0152.0252.0352.04Reconstructioncostω = 0.1 ω = 0.2 ω = 0.3(c) Effect of lp1 2 3le68.5068.5568.6068.6568.70Reconstructioncostγ = 0.1 γ = 0.2 γ = 0.3(d) Effect of leFigure 7.4: Sensitivity of community size and higher order proximity with weights basedon reconstruction cost.7.5.2 Network ReconstructionExperimental setup. We next examined the robustness of triUMPF when exposed to noise.Links were randomly removed from M, A, and B according to ε ∈ {20%,40%,60%,80%}. Weused the partially linked matrices to refine parameters while comparing the reconstructioncost against the full association matrices M, A and B. Specifically for M, we varied com-ponents of M according to k ∈ {20,50,70,90,120} along with ². For all experiments, bothMetaCyc and BioCyc T2 &3 were applied for training using hyperparameters described inSection 7.4.3.Experimental results. Fig. 7.5a indicate that by progressively increasing noise ε to M, the9920 50 70 90 120k140160180Reconstructioncostε = 20%ε = 40%ε = 60%ε = 80%(a) Effect of k20 40 60 80ε(%)66.766.8Reconstructioncost(b) EC links recovery20 40 60 80ε(%)52.4552.5052.55Reconstructioncost(c) Pathway links recoveryFigure 7.5: Link prediction results by varying noise levels ε ∈ {20%,40%,60%,80%} basedon reconstruction cost.reconstruction cost increases when k is low. As more features are incorporated the cost atall noise levels steadily decreases up to k = 100. This tendency indicates that both pathwayand EC features (P and E contain useful correlations that contribute to the resilience oftriUMPF’s performance when M is perturbed.For Aprox and Bprox, as shown in Figs 7.5b and 7.5c, the costs are reduced in the presenceof noise, which is not surprising as the reconstruction of associated communities are con-strained on both data and Aprox and Bprox. These results are directly linked to the sparsenessof both matrices, as previously described in [98]. The pathway graph network, depicted inFig. 1 of the primary text, indicates that many pathways constitute islands with no directlinks, while some pathways are densely connected. For community detection, it is sufficientto group nodes that are densely connected, while links between communities can remainsparse. The same line of reasoning follows for the EC network.7.5.3 VisualizationExperimental setup. Recall that community detection (in Sections 7.3.2 and 7.3.3) was usedto guide the learning process using both MetaCyc and BioCyc T2 &3. Under circumstanceswhere BioCyc T2 &3 are excluded from Eq. 7.3.4, triUMPF identifies pathway communitiesfrom A defined according to MetaCyc. However, when trained with both MetaCyc and BioCycT2 &3 connected pathways may be distributed across multiple communities. This happensdue to the heterogeneous nature of the BioCyc collection and presents an opportunity toevaluate the statistical properties of pathway communities in relation to both taxonomicand functional diversity within the training set.To explore these properties in more detail, we visualized MetaCyc and BioCyc com-100(a) Communities from MetaCyc (b) Communities from BioCycFigure 7.6: TCA cycle and associated pathways. Pathway communities visualized with andwithout training using BioCyc T2 &3. (a) MetaCyc communities and (b) BioCyc communitiesobserved using triUMPF. Nodes coloured black indicate the TCA cycle (TCA) while darkgrey nodes indicate associated pathways. Remaining pathway communities not associatedwith the TCA cycle are indicated in light grey. PWY-7180: 2-deoxy-α-D-ribose 1-phosphatedegradation; PWY-6223: gentisate degradation I.munities associated with the tricarboxylic acid (TCA) cycle. The TCA cycle represents aseries of reactions central to cellular metabolism and can be found in different forms calledpathway variants in aerobic and anaerobic organismal genomes. We then visualized theimpact of community detection on pathway prediction by comparing metabolic networkspredicted for E. coli K-12 substr. MG1655 (TAX-511145), uropathogenic E. coli str. CFT073(TAX-199310), and enterohemorrhagic E. coli O157:H7 str. EDL933 (TAX-155864) using bothPathoLogic (taxonomic pruning) and triUMPF. All experiments were conducted based onthe settings in Section 7.4.3.Experimental results. Fig. 7.6a shows pathway communities obtained using MetaCyc,where pathways associated with the TCA cycle grouped together in the graph accordingto Aprox. For example, the pyruvate decarboxylation to acetyl CoA pathway that convertspyruvate to acetyl-CoA as input to the TCA cycle was identified in the same TCA commu-nity. In contrast, triUMPF trained using MetaCyc and BioCyc T2 &3 assigned TCA associ-ated pathways to several distinct communities as exhibited in Fig. 7.6b. For example, thepathways 2-deoxy-α-D-ribose 1-phosphate degradation that produces inputs to glycolysis(D-glyceraldehyde-3-phosphate) and TCA cycle (acetyl-coA), and gentisate degradation I101Community Index MetaCyc Pathway ID MetaCyc Pathway Name Status67PWY0-1182 trehalose degradation II (trehalase) truePWY-6910 hydroxymethylpyrimidine salvage trueHOMOSER-THRESYN-PWY L-threonine biosynthesis truePUTDEG-PWY putrescine degradation I truePWY-6611 adenine and adenosine salvage V trueFERMENTATION-PWY mixed acid fermentation trueENTNER-DOUDOROFF-PWY Entner-Doudoroff pathway I true34ASPARAGINESYN-PWY L-asparagine biosynthesis II truePWY-5340 sulfate activation for sulfonation truePWY-6618 guanine and guanosine salvage III truePWY0-1314 fructose degradation truePWY-7181 pyrimidine deoxyribonucleosides degradation truePWY0-1299 arginine dependent acid resistance truePWY0-42 2-methylcitrate cycle I true9NAGLIPASYN-PWY lipid-A-precursor biosynthesis (E. coli) truePWY-7221 guanosine ribonucleotides de novo biosynthesis trueKDOSYN-PWY Kdo transfer to lipid IVA I (E. coli) truePWY0-1309 chitobiose degradation truePPGPPMET-PWY ppGpp biosynthesis truePWY-6608 guanosine nucleotides degradation III truePWY-5656 mannosylglycerate biosynthesis I false47PLPSAL-PWY pyridoxal 5’-phosphate salvage I truePWY0-1313 acetate conversion to acetyl-CoA truePYRUVDEHYD-PWY pyruvate decarboxylation to acetyl CoA truePWY-4381 fatty acid biosynthesis initiation (bacteria and plants) truePWY0-662 PRPP biosynthesis true81HISTSYN-PWY L-histidine biosynthesis truePWY-6147 6-hydroxymethyl-dihydropterin diphosphate biosynthesis I truePWY-7176 UTP and CTP de novo biosynthesis truePWY-6932 selenate reduction falseTable 7.1: Top 5 communities with pathways predicted by triUMPF for E. coli K-12 substr.MG1655 (TAX-511145). The last column asserts whether a pathway is present in or absent(a false-positive pathway) from EcoCyc reference data.that produces inputs to the TCA cycle (fumarate and pyruvate) were not grouped in thesame TCA community. Closer inspection of the training data indicated that these pathwaysappear together in 250 organismal genomes altering the statistical association of pathwayoccurrences in the network. In this light, pathway communities reflect less the MetaCycpathway ontology and more the statistical properties of the network itself. This aspect oftriUMPF can be leveraged to improve prediction outcomes.To demonstrate this, we compared pathways predicted for the T1 gold standard E. coliK-12 substr. MG1655 (TAX-511145), henceforth referred to as MG1655, using PathoLogic andtriUMPF. Fig. 7.7a shows the results, where both methods inferred 202 true-positive path-ways (green-colored) in common out of 307 expected true-positive pathways (using EcoCycas a common frame of reference). In addition, PathoLogic uniquely predicted 39 (magenta-colored) true-positive pathways while triUMPF uniquely predicted 16 true-positives (purple-colored). This difference arises from the use of taxonomic pruning in PathoLogic which102(a) MG1655 (b) CFT073 (c) EDL933Figure 7.7: Pathway community networks for related T1 and T3 organismal genomes.Pathway communities for (a) E. coli K-12 substr. MG1655 (TAX-511145), (b) E. coli str.CFT073 (TAX-199310), and (c) E. coli O157:H7 str. EDL933 (TAX-155864) based on com-munity detection. Nodes colored in dark grey indicate pathways predicted by PathoLogic;lime pathways predicted by triUMPF; salmon pathways predicted by both PathoLogic andtriUMPF; red expected pathways not predicted by both PathoLogic and triUMPF; magentaexpected pathways predicted only by PathoLogic; purple expected pathways predicted solelyby triUMPF; and green expected pathways predicted by both PathoLogic and triUMPF. light-grey indicates pathways not expected to be encoded in either organismal genome. The nodesizes reflect the degree of associations between pathways.improves the recovery of taxonomically constrained pathways and limits false-positiveidentification. With taxonomic pruning enabled, PathoLogic inferred 79 false-positive path-ways, and over 170 when pruning was disabled. In contrast triUMPF which does not usetaxonomic feature information inferred 84 false-positive pathways. This improvement overPathoLogic with pruning disabled reinforces the idea that pathway communities improvethe precision of pathway prediction with limited impact on overall recall. Based on theseresults it is conceivable to train triUMPF on subsets of organismal genomes resulting inmore constrained pathway communities for pangenome analysis. Examples with regard topathway community see Table 7.1.To further evaluate triUMPF performance on closely related organismal genomes, weperformed pathway prediction on E. coli str. CFT073 (TAX-199310), and E. coli O157:H7str. EDL933 (TAX-155864) and compared results to the MG1655 reference strain [345]. BothCFT073 and EDL933 are pathogens infecting the human urinary and gastrointestinal tracts,respectively. Previously, Welch and colleagues described extensive genomic mosaicism be-tween these strains and MG1655, defining a core backbone of conserved metabolic genesinterspersed with genomic islands encoding common pathogenic or niche defining traits103(a) PathoLogic (taxonomic prun-ing)(b) PathoLogic (without taxo-nomic pruning)(c) triUMPFFigure 7.8: A three way set difference analysis of pathways predicted for E. coli K-12 substr.MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933(TAX-155864) using (a) PathoLogic (taxonomic pruning) and (b) triUMPF.[345]. Neither CFT073 nor EDL933 genomes are represented in the BioCyc collection oforganismal pathway genome databases. A total of 335 and 319 unique pathways were pre-dicted by PathoLogic and triUMPF, respectively. The resulting pathway lists were used toperform a set-difference analysis with MG1655 (Figs 7.8a and 7.8c). Both methods predictedmore than 200 pathways encoded by all three strains including core pathways like the TCAcycle (Figs 7.7b and 7.7c). CFT073 and EDL933 were predicted to share a single commonpathway (TCA cycle IV (2-oxoglutarate decarboxylase)) by triUMPF. However this pathwayvariant has not been previously identified in E. coli and constitutes a false-positive predic-tion based on recognized taxonomic range. Both PathoLogic and triUMPF predicted theaerobactin biosynthesis pathway involved in siderophore production in CFT073 consistentwith previous observations [345]. Similarly, four pathways (e.g. L-isoleucine biosynthesis IIIand GDP-D-perosamine biosynthesis) unique to EDL933 were inferred by both methods.Given the lack of cross validation standards for CFT073 and EDL933 we were unableto determine which method inferred fewer false-positive pathways across the completeset of predicted pathways. Therefore, to constrain this problem on a subset of the data,we applied GapMind [263] to analyze amino acid biosynthetic pathways encoded in thegenomes of the MG1655, CFT073 and EDL933 strains. GapMind is a web-based applicationdeveloped for annotating amino acid biosynthetic pathways in prokaryotic microorganisms(bacteria and archaea) where each reconstructed pathway is supported by a confidencelevel. After excluding pathways that were not incorporated in the training set a total of 102pathways were identified across the three strains encompassing 18 amino acid biosyntheticpathways and 27 pathway variants with high confidence (Table 7.2). PathoLogic inferred104Amino Acid MetaCyc Pathway ID MetaCyc Pathway NameArginineARGSYNBSUB-PWY L-arginine biosynthesis II (acetyl cycle)PWY-5154 L-arginine biosynthesis III (via N-acetyl-L-citrulline)PWY-7400 L-arginine biosynthesis IV (archaebacteria)AsparagineASPARAGINE-BIOSYNTHESIS L-asparagine biosynthesis IASPARAGINESYN-PWY L-asparagine biosynthesis IIChorismate PWY-6163 chorismate biosynthesis from 3-dehydroquinateCysteineCYSTSYN-PWY L-cysteine biosynthesis IPWY-6308 L-cysteine biosynthesis II (tRNA-dependent)Glutamine GLNSYN-PWY L-glutamine biosynthesis IGlycineGLYSYN-PWY glycine biosynthesis IGLYSYN-THR-PWY glycine biosynthesis IVHistidine HISTSYN-PWY L-histidine biosynthesisIsoleucineILEUSYN-PWY L-isoleucine biosynthesis I (from threonine)PWY-5104 L-isoleucine biosynthesis IVLeucine LEUSYN-PWY L-leucine biosynthesisLysineDAPLYSINESYN-PWY L-lysine biosynthesis IPWY-2941 L-lysine biosynthesis IIPWY-2942 L-lysine biosynthesis IIIMethionineHOMOSER-METSYN-PWY L-methionine biosynthesis IPWY-702 L-methionine biosynthesis IIPhenylalanine PHESYN L-phenylalanine biosynthesis IProline PROSYN-PWY L-proline biosynthesis ISerine SERSYN-PWY L-serine biosynthesisThreonine HOMOSER-THRESYN-PWY L-threonine biosynthesisTryptophan TRPSYN-PWY L-tryptophan biosynthesisTyrosine TYRSYN L-tyrosine biosynthesis IValine VALSYN-PWY L-valine biosynthesisTable 7.2: 18 amino acid biosynthetic pathways and 27 pathway variants with high confi-dence.49 pathways identified across the three strains encompassing 15 amino acid biosyntheticpathways and 17 pathway variants while triUMPF inferred 54 pathways identified across thethree strains encompassing 16 amino acid biosynthetic pathways and 19 pathway variantsincluding L-methionine biosynthesis in MG1655, CFT073 and EDL933 that was not predictedby PathoLogic. Neither method was able to predict L-tyrosine biosynthesis I (see Fig. 7.9).Finally, we note that when taxonomic pruning is disabled PathoLogic infers over 90additional pathways (Fig. 7.8b). With regard to GapMind results, PathoLogic with nontaxonomic pruning predicted 56 pathways across the three strains encompassing 15 aminoacid biosynthetic pathways and 20 pathway variants, including L-proline biosynthesis II (fromarginine) pathway that is known only for eukaryotes (Fig. 7.10), consequently, increasingfalse-positive pathway prediction.105Figure 7.9: Comparison of predicted pathways for E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864)datasets between PathoLogic (taxonomic pruning) and triUMPF. Red circles indicate thatneither method predicted a specific pathway while green circles indicate that both methodspredicted a specific pathway. Lime circles indicate pathways predicted solely by mlLGPR andgray circles indicate pathways solely predicted by PathoLogic.The size of circles correspondsto associated pathway coverage information.106Figure 7.10: Comparison of predicted pathways for E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864)datasets between PathoLogic (without taxonomic pruning) and triUMPF. Red circles in-dicate that neither method predicted a specific pathway while green circles indicate thatboth methods predicted a specific pathway. Lime circles indicate pathways predicted solelyby mlLGPR and gray circles indicate pathways solely predicted by PathoLogic. The size ofcircles corresponds the associated coverage information.10710 1 0.1 0.01 0.001 0.0001ρ20406080100AverageF1-ScoreAraCycEcoCycHumanCycLeishCycTrypanoCycYeastCycFigure 7.11: Effect of ρ based on the average F1 scores using golden T1 datasets. The hy-perparameter ρ in Eq. 7.3.4 controls the amount of information propagation from M topathway label coefficientsΘ.7.5.4 Metabolic Pathway PredictionExperimental setup. Pathway prediction potential of triUMPF was evaluated using the pa-rameter settings described in Section 7.4.3. The sensitivity of ρ was initially determinedacross a range of values {10,1,0.1,0.01,0.001,0.0001} using BioCyc as a training set. triUMPFperformance on T1 golden datasets was compared to pathway inference methods (in Chap-ter 4.3) and mlLGPR-elastic net (EN) (in Chapter 5). In addition to testing on T1 goldendatasets, triUMPF performance was compared to both PathoLogic and mlLGPR on mealy-bug symbionts, CAMI low complexity, and HOTS multi-organismal datasets (Chapter 4.2).We used the metrics introduced in Chapter 4.4.1 to report results.Experimental results. Fig. 7.11 shows the inverse effect in predictive performance on T1golden datasets when decreasing the ρ before reaching a performance plateau at ρ = 0.001.The hyperparameter ρ in Eq. 7.3.4 controls the amount of information propagation fromM to pathway label coefficients Θ. This suggests, in practice, lesser constraints should beemphasized onΘ, while not neglecting associations between EC numbers and pathways in-dicated in M. Having obtained the optimum value of ρ, we compared triUMPF performanceto that of MinPath, PathoLogic and mlLGPR. As shown in Table 7.3, triUMPF achievedcompetitive performance against the other methods in terms of average precision with108MethodsHamming Loss ↓EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.0610 0.0633 0.1188 0.0424 0.0368 0.0424MinPath 0.2257 0.2530 0.3266 0.2482 0.1615 0.2561mlLGPR 0.0804 0.0633 0.1069 0.0550 0.0380 0.0590triUMPF 0.0435 0.0954 0.1560 0.0649 0.0443 0.0776MethodsAverage Precision Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.7230 0.6695 0.7011 0.7194 0.4803 0.5480MinPath 0.3490 0.3004 0.3806 0.2675 0.1758 0.2129mlLGPR 0.6187 0.6686 0.7372 0.6480 0.4731 0.5455triUMPF 0.8662 0.6080 0.7377 0.7273 0.4161 0.4561MethodsAverage Recall Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.8078 0.8423 0.7176 0.8734 0.8391 0.7829MinPath 0.9902 0.9713 0.9843 1.0000 1.0000 1.0000mlLGPR 0.8827 0.8459 0.7314 0.8603 0.9080 0.8914triUMPF 0.7590 0.3835 0.3529 0.3319 0.7126 0.6229MethodsAverage F1 Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCycPathoLogic 0.7631 0.7460 0.7093 0.7890 0.6109 0.6447MinPath 0.5161 0.4589 0.5489 0.4221 0.2990 0.3511mlLGPR 0.7275 0.7468 0.7343 0.7392 0.6220 0.6768triUMPF 0.8090 0.4703 0.4775 0.4735 0.5254 0.5266Table 7.3: Predictive performance of each comparing algorithm on 6 benchmark goldenT1 datasets. For each performance metric, ‘↓’ indicates the smaller score is better while ‘↑’indicates the higher score is better.optimal performance on EcoCyc (0.8662). However, with respect to average F1 scores, itunder-performed on HumanCyc and AraCyc, yielding average F1 scores of 0.4703 and 0.4775,respectively.To evaluate triUMPF performance on distributed metabolic pathways we used the re-duced genomes of the mealybug symbionts Moranella (GenBank NC-015735) and Tremblaya(GenBank NC-015736) ([218]). Collectively the two symbiont genomes encode intact biosyn-thetic pathways for 9 essential amino acids. PathoLogic, mlLGPR, and triUMPF were usedto predict pathways on individual symbiont genomes and a composite genome consistingof both, and resulting amino acid biosynthetic pathway distributions were determined asillustrated in Fig. 7.12). Both triUMPF and PathoLogic predicted 6 of the expected amino acidbiosynthetic pathways on the composite genome while mlLGPR predicted 8 pathways. The109Figure 7.12: Comparative study of predicted pathways for symbiotic data between Patho-Logic, triUMPF, and mlLGPR. The size of circles corresponds the EC coverage information.pathway for phenylalanine biosynthesis (L-phenylalanine biosynthesis I) was excluded fromanalysis because the associated genes were reported to be missing during the ORF predictionprocess. False positives were predicted for individual symbiont genomes in Moranella andTremblaya using both methods although pathway coverage was reduced in relation to thecomposite genome.To evaluate triUMPF performance on more complex multi-organismal genomes we usedthe CAMI low complexity ([289]) and HOTS data sets ([309]) comparing resulting pathwaypredictions to both PathoLogic and mlLGPR. For CAMI low complexity triUMPF achievedan average F1 score of 0.5864 in comparison to 0.4866 for mlLGPR which is trained withmore than 2500 labeled pathways (Table 7.4). Similar results were obtained for HOTS (Fig.7.13). Among a subset of 80 selected water column pathways, PathoLogic and triUMPFpredicted a total of 54 and 58 pathways, respectively, while mlLGPR inferred 62. From a realworld perspective none of the methods predicted pathways for photosynthesis light reactionnor pyruvate fermentation to (S)-acetoin although both are expected to be prevalent in thewater column. Perhaps, the absence of specific ECs associated with these pathway limitsrule-based or ML prediction. Indeed, closer inspection revealed that the enzyme catabolicacetolactate synthase was missing from the pyruvate fermentation to (S)-acetoin pathway,which is an essential rule encoded in PathoLogic and represented as a feature in mlLGPR.110Conversely, although this pathway was indexed to a community, triUMPF did not predict itspresence, constituting a false-negative.Metric mlLGPR triUMPFHamming Loss (↓) 0.0975 0.0436Average Precision Score (↑) 0.3570 0.7027Average Recall Score (↑) 0.7827 0.5101Average F1 Score (↑) 0.4866 0.5864Table 7.4: Predictive performance of mlLGPR and triUMPF on CAMI low complexity data.7.6 SummaryIn this chapter, we present a novel ML approach for metabolic pathway inference thatcombines three stages of NMF to capture relationships between enzymes and pathwayswithin a network followed by community detection to extract higher order network structure.First, a Pathway-EC association (M) matrix, obtained from MetaCyc, is decomposed usingthe NMF technique to learn a constrained form of the pathway and EC factors, capturingthe microscopic structure of M. Then, we obtain the community structure (or mesoscopicstructure) jointly from both the input datasets and two interaction matrices, Pathway-Pathway interaction and EC-EC interaction. Finally, the consensus relationships betweenthe community structure and data, and between the learned factors from M and the pathwaylabels coefficients are exploited to efficiently optimize metabolic pathway parameters.We evaluated triUMPF performance using a corpora of experimental datasets presentedin Chapter 4.2. During the benchmarking process we realized that the BioCyc collection suf-fers from a class imbalance problem [130] where some pathways infrequently occur acrossPGDBs. This results in a significant sensitivity loss on T1 golden data, where triUMPF tendedto predict more frequently observed pathways while missing more infrequent pathways.One potential approach to solve this class-imbalance problem is subsampling the mostinformative PGDBs for training, hence, reducing false-positives [182].Despite the observed class imbalance problem, triUMPF improved pathway predictionprecision without the need for taxonomic rules or EC features to constrain metabolic po-tential. From an ML perspective this is a promising outcome considering that triUMPF wastrained on a reduced number of pathways relative to mlLGPR. Future development effortswill explore subsampling approaches to improve sensitivty and the use of constrained taxo-nomic groups for pangenome and multi-organismal genome pathway inference. Moreover,111Figure 7.13: Comparative study of predicted pathways for HOT DNA samples. The size ofcircles corresponds the pathway abundance information.112triUMPF showed promising results to solve multiple types pathway correlation, addressedin Chapter 3.4, based on statistical associations (e.g. OneXChainPathway and OneXTreePath-way).In the next part of this thesis, we will be examining the class imbalance problem.113Part IVMulti-Label Subsampling and Bagging114Chapter 8Relabeling Metabolic Pathway Datasetwith Bags to Enhance PredictivePerformance“Be the change that you wish to see in the world.”– Mahatma GandhiIn this chapter, we propose reMap pipeline (relabeling multi-label dataset based on baggingapproach), a simple, and yet, generic framework, that performs relabeling examples to a dif-ferent set of labels, characterized as bags, where a bag is comprised of correlated pathways.Bag based classification was considered to improve the sensitivity of the pathway predictors.To obtain bags, we also present two hierarchical mixture models, SOAP (sparse correlatedbag pathway) and SPREAT (distributed sparse correlated bag pathway), that incorporatepathway abundance information to encode each example as a mixture distribution of bags,and each bag, in turn, is a mixture of pathways with different mixing proportions. Afterobtaining bags, reMap preforms relabeling by alternating between (1) assigning bags to eachsample and (2) updating reMap’s internal parameters. reMap’s effectiveness were evaluatedon parameter sensitivity, visualization, annotation progress, and metabolic pathway predic-tion. The later experimental study confirmed that this approach outperforms triUMPF onseveral golden T1 datasets introduced in Chapter 4.2.1.8.1 IntroductionIn Chapter 7, we introduced triUMPF, which consumes interactions among pathways andenzymes, in a network manner, to escalate the accuracy of reconstructing pathways in terms115of communities. Despite of triUMPF’s predictive gains, its recall scores on pathway datasets(in Chapter 4.2) were reported to be deteriorated.In this chapter, we introduce reMap pipeline that annotates each example with a newlabel set, called “bag” set, inspired by the multi-graph classification (MGC) technique [353],to strike the balance between precision and recall for the downstream metabolic pathwayprediction task. Each bag is comprised of correlated pathways from a pathway label set,which are allowed to be inter-mixed across bags with different proportions, resulting inoverlapping subset of pathways over a subset of bags (non-disjoint). Hence, this approachis fundamentally different than triUMPF that follows the clustering strategy. Moreover, thebag-based technique has an important implication for the pathway prediction task (in Fig.8.1) . That is, unlike mlLGPR and triUMPF, the bag based pathway prediction applies twoconsecutive steps, where a set of bags are inferred at first then pathways within these bagsare recovered. Since pathways are distributed over bags, it is conceivable that a pathwaymay be revisited multiple times, thereby, increasing the chance of boosting sensitivity scoreswhile improving precision. This potential benefit was not leveraged in mlLGPR and neithertriUMPF.To initiate the relabeling process, reMap takes two label sets: i)- pathway label set andii)- bag set that can be obtained using either one of the two developed mixed-membershiphierarchical Bayesian models, SOAP and SPREAT. These two models are proposed to capturemixed bags given pathway datasets (involving abundance information). In addition, SOAPand SPREAT incorporates “background” or “supplementary” pathway (with different propor-tion) on top of the pathways provided in pathway datasets to partially resolve noisy pathwaydata. Moreover, the two models induce dual sparseness, where we allow an individual exam-ple to select only a few bags and also each bag to select its optimum set of pathways. Oncebags are obtained, reMap performs an iterative relabeling by alternating between assigningbags to examples and updating parameters, mirroring expectation-maximization algorithm[80] while enforcing tight integrity among pathways, bags, and input enzyme instances.Using reMap, we empirically examined it on four test cases: parameter sensitivity ofSOAP and SPREAT; bag visualization; assessments of the annotation process; and metabolicpathway prediction task using leADS (presented in Chapter 9). For the last experimentalstudy, the results outperformed triUMPF on golden T1 datasets.116Figure 8.1: The traditional vs the proposed bag based multi-label classification ap-proaches. The traditional supervised multi-label classification is displayed on the left panel,where labels (i.e., red or green colors) are associated with an input instance x(i ). This ap-proach sought to predict a set of labels for x(i ) without considering any compartmental-ization of labels. On contrary, bag based multi-label classification approach, on the right,applies two steps, where it predicts a set of positive bags (depicted as a cloud glyph), at first,then the labels within these bags are predicted (green colored labels).8.2 Definitions and Problem FormulationIn this section, we provide important notations and definitions. Unless otherwise mentioned,we emphasize that mathematical symbols are constrained within the context of this chapter.We define the term bag, which is borrowed from multi-graph classification (MGC) tech-nique [353]. In MGC, the goal is to learn a model from a set of labeled bags each containingseveral graphs. A bag is tagged positive, if at least one graph in the bag is positive, andnegative otherwise. Here, we slightly abuse the term and reserve it to describe a compositionof several correlated pathways.Definition 8.1. Pathway Bag. Denote B = {B1,B2, ...,Bb} a set with b bags, where each bagBc ∈ {−1,+1}t is presumed to contain a subset of correlated pathways, i.e., Yc ⊆ Y , and tis the number of pathways in Def. 3.9. The presence or absence of a pathway in bag c isindicated by +1 or −1, respectively. The matrix representation of B is B ∈ {−1,+1}t×t .Bags are also assumed to be correlated, i.e, non-disjoint, and can be modeled by a Gaus-sian covariance matrix, denoted by Σ ∈Rb×b . Each entry si , j in Σ characterizes the i -th bagassociation with j -th bag, where a larger score indicates both bags are highly correlated.This correlation can be discovered using either SOAP or SPREAT given pathway datasets,(Y) introduced in Def. 3.9, incorporating pathway abundance information which can be117Figure 8.2: An example of feature vectors for bags. The subfigure in the left represents thefeature vector for six pathways corresponding to two instances. The right subfigure indicatestwo bags, B1 and B2, and their features for the same two instances, where the first sample,D1, suggests that B1 is positive because the corresponding pathways y3 and y4 are present,while the bag feature vector for the second example, D2, suggests that both bags are present.obtained by mapping enzyme with abundances onto pathways. Both models are a form ofmixed-membership hierarchical Bayesian network, where each example is encoded as avector of bag probabilities and each bag, in turn, is comprised of a set of correlated path-ways. The pathways are permitted to be inter-mixed across bags with different proportion,resulting in overlapping pathways over bags.These two models extend the functions of CTM (correlated topic model) [40] by incor-porating dual sparseness and supplementary pathways in modeling bag proportions. Thatis, SOAP and SPREAT encodes missing probable pathways as “background” without com-promising the original pathway labels, i.e., Y. The supplementary pathways are stored ina matrix M ∈ Zn×t≥0 , where each entry is an integer value indicating the abundance of thepathway associated with a specific sample and n corresponds the number of samples. Thesebackground pathways can be modeled as in the case of SPREAT or employed directly as inSOAP. With regard to dual sparseness, it is applied to encourage selection of: i)- few focusedbags for each individual example and ii)- few focused pathways for each bag. For thoroughdiscussions about these models, see Appendix F.1.As a result of correlation, we define the following two terminologies: bag feature vectorand bag’s neighbor.Definition 8.2. Bag Feature Vector. Given an example i , the associated bag feature vectoris indicated by d(i ) ∈ {−1,+1}b , where d(i )j =+1, iff the bag j is observed for the sample i andd j =−1 otherwise. The matrix form in represented as D ∈ {−1,+1}n×b .118Example 8.1. An example of feature vectors for bags is illustrated in Fig. 8.2, where 2-dimensional feature vectors for bags encode presence or absence of two bags B1 and B2,given a set of 6 pathways and bag-pathway association information, depicted as a cloudglyph.Definition 8.3. Bag’s Neighbors. A bag Bc ∈B is said to be a neighbor to another bag B j ∈Bs.t. c 6= j , iff there exits an intersected pathway l in both bags, i.e., Bc,l ∧B j ,l = 1.With the above definitions, we formulate the problem in this chapter.Relabeling Multi-label Pathway DatasetGiven a set of bags B and a multi-label pathway dataset S (Def. 3.9), the goalis to learn an optimum relabeling function hbag :X → {+1,−1}b , such thatleveraging bags to X ∈Rn×r , where r corresponds the number of enzymaticreactions (Def. 3.9), incurs a high predictive score for the multi-label pathwayprediction task.To better crystallize the idea of incorporating bags in the pathway prediction, considerthe following example.Example 8.2. Fig. 8.1 illustrates the benefit of incorporating bags for multi-label pathwayclassification (right panel). Here, a dataset consists of two bags, each of which groups a setof 4 correlated pathways. To determine positive pathways (y2, y3, and y4) given Xi , we firstpredict the relevant bag, indicated by +, then classify pathways within that bag. In contrast,the traditional multi-label classification approaches (left figure), mostly based on binaryrelevance technique, proceeds on predicting multiple pathway labels for Xi .Relabeling a multi-label dataset S will result in another dataset of Sbag .Definition 8.4. Multi-label Bag Dataset. A bag dataset is represented by Sbag = {(x(i ),d(i )) :1 < i É n} consisting of n examples. d(i ) = [d (i )1 , ...,d (i )t ] ∈ {−1,+1}b is a bag label vector ofsize b. Each element of d(i ) indicates the presence/absence of the associated bag that isinherited from a set B = {B1,B2, ...,Bb}. Each bag Bc ∈ {−1,+1}t is presumed to contain asubset of correlated pathways, i.e., Yc ⊆Y . The presence or absence of a pathway in bag cis indicated by +1 or −1, respectively. The matrix representation of B is B ∈ {−1,+1}t×t . Inaddition, bags are correlated through overlapping pathways, i.e., for two correlated bags Bcand B j s.t. c 6= j , there exits an intersected pathway l in both bags, i.e., Bc,l ∧B j ,l =+1. Thematrix form of bag label vector with n instances is denoted by D (∈Zn×b≥0 ).1198.3 The reMap MethodIn this section, we provide a description of reMap framework (depicted in Fig. 8.3), whichiteratively alternates between the following two phases: i)- feed-forward, where it consists ofthree components: 1)- constructing pathway bag, 2)- building bag centroid, 3)- re-assigninglabels to each example; and ii)- feed-backward to update reMap’s parameters.8.3.1 Feed-Forward PhaseDuring this phase, a minimal subset of bags is picked to tag each example according to thefollowing three steps.8.3.1.1 Constructing Pathway BagIn this step, the pathways in S are partitioned into non-disjoint b bags using any correlatedbag models (CTM, SOAP or SPREAT). Moreover, the correlated models provide us with meancovariance matrix of bags (Σ ∈Rb×b) that is transformed to a correlation matrix ρ =C−1ΣC−1,where C =√diag(Σ). The correlated bag models also provide us with pathway distributionover bags, denoted byΦ ∈Rb×t , where we trim the pathway distributionΦ′ ∈Rb×k (⊆Φ) bykeeping top k pathways for each bag, provided that aggregation of k pathways from all bagsis equal to t .Modeling pathway distribution and bag’s correlations have two important implications:First, organisms encoding similar functions may share similar bags, thus, encouraging tohave near-identical statistical strength (as in triUMPF); and Second, pathways that areobserved to be frequently occurring together may suggest a similar relative contribution toa bag.8.3.1.2 Building Bag CentroidHaving obtained a set of bags, reMap computes centroids of each bag to harness the relativeassociation of each pathway to each bag’s centroid. We assume that pathways in a bag aresemantically “close enough” to the center of that bag and, hence, pathways in that bag shouldshare similar low-level representation while also ensuring similar semantics among bagswith overlapping pathways. Using this simple assumption, reMap computes the centroid csof a bag s according to:(8.3.1) cs = αns∑j∈Bs, j=+1P j||P j ||120Figure 8.3: A workflow diagram showing the proposed reMap pipeline to relabel a multi-label data. The method consists of two phases: i)- feed-forward and ii)- feed-backward. Theforward phase is composed of three components: (b) construction of pathway bag that aimsto build correlated bags given data (a), (c) building bag centroid that retrieves centroids ofbags based on the associated pathways, and (d) re-assigning labels that maps samples tobags. The feed-backward phase (e) optimizes reMap’s parameters to maximize accuracy ofmapping examples to bags. The process is repeated τ times. If the current iteration q reachesa desired number of rounds τ, the training is terminated while producing the final Sbag datapoints (f). The bag dataset then can be used to train leADS (g).121where P ∈ Rt×m is a matrix storing pathway representation obtained from pathway2vecframework (in Chapter 6), ns is the number of associated pathways to bag s, ||.|| is the lengthof a feature vector, and α is a hyper-parameter, which is determined by empirical evaluation(16 in this work). It is important to note that bags centroids enable us to obtain a maximumnumber of expected bags for a given example by applying a suitable metric, such as cosinesimilarity [212] as:D̂(i ) =vec({I( c>s c˜(i )s||cs || · ||c˜(i )s ||≥ υ): 1≤ s ≤ b})c˜(i )s =αns∑j∈Yi , j∧Bs, j=+1P j||P j ||(8.3.2)where I(.) is an indicator function that results in either+1 or−1 depending on a user-definedthreshold υ ∈R>0, c˜(i )s is the centroid for bag s, and D̂(i ) is an aggregated hypothetical bagsfor an example i of size b after the results are vectorized into a series of +1 and −1 using vecoperation.8.3.1.3 Re-assigning Pathways to BagsThe goal of this step is to map an example’s input space onto a bag space using a decisionfunction hbag which would produce an optimum multi-label bag set (Dopt ∈Zn×b≥0 ). Formally,let us denote a set of bags that are picked to relabel an example as B(i )P ⊆ arg{D̂i , j = +1 :∀ j } while we denote B(i )U the set of remaining bags, where D̂(i ) obtained using Eq. 8.3.2.Collectively, both sets of bags are stored in L(i ) = {B(i )P ∪B(i )U }. Then, the process of re-annotation is achieved iteratively, mirroring sequential learning and prediction strategy[112], where for each example, a bag B j at round q is either: i)-added to L(i ), indicated byL(i )q =L(i )q−1⊕ {B j : 1< j ≤ |B(i )U |}; or ii)- removed from the set of selected bags, representedby L(i )q =L(i )q−1ª {B j : 1< j ≤ |B(i )P |}.More concretely, at each iteration q , we estimate the probability of an example i giventhe selected bags at q −1, using threshold closeness (TC) metric [55]:p(x(i )|H(i )q−1,L(i )q−1,D̂i , j =+1)=p¯H(i )q−1(D̂i , j |L(i )q−1,x(i ))G+ζZ(8.3.3)where G = 1−p¯H(i )q−1 (D̂i , j |L(i )q−1,x(i )), D̂i , j =+1 if bag B j is in example i , and p¯H(i )q−1 (D̂i , j |L(i )q−1,x(i ))is the average probability of classifying x(i ) into the bag B j over values collected in H(i )q−1that represents the history of prediction probability storing all p(D̂i , j |L(i )q−1,x(i )) before thecurrent iteration q . The term ζ is a smoothness constant while Z is a normalization constant.122Note that TC is a class conditional probability density function that encourages correct classprobability to be close to the true unknown decision boundary.To estimate p(D̂i , j |L(i )q−1,x(i )), we jointly compute the probability of bags and pathwaysthat are associated with D̂i , j at round q −1 as:p(D̂i , j |L(i )q−1,x(i ))∝H(i )q−1( ∑e∈L(i )q−1z j ,e( ∑s∈B j ,s=+1p(D̂i , j |ls =+1,Θbagj )p(y (i )s |x(i ),Θpaths )))z j ,e =ρ j ,e −min(ρ)max(ρ)−min(ρ)p(D̂i , j |ls =+1,Θbagj )=11+e−Θbag,Tj |c˜(i )j −Ps |p(y (i )s |x(i ),Θpaths )=11+e−Θpath,Ts x(i )(8.3.4)where y (i )s = 1 if the pathway index s is found to be present in both bag j and in sample x(i )and 0 otherwise, and ls = 1 if the pathway index s is associated to bag j and 0 otherwise. z j ,eis a normalized correlation between bags j and e, respectively, obtained from ρ (see Section8.3.1.1) and c¯(i )j is presented in Eq 8.3.2.Θbagj ∈Rm andΘpaths ∈Rr denote parameters for thebag j and the pathway s model’s, respectively, and are learned during the feed-backwardstage in Section 8.3.2.To reduce computational latency, instead of applying the above procedure to all bags foreach example at every round, we randomly sub-sample bags of size γ. Also, the estimate isstill in the probability realm, therefore, we utilize a cutoff decision threshold (β) to retrieve asubset of bags having less overlapping pathways. Afterwards, L(i ) will be updated either byadding or removing bags from a previous iteration.8.3.2 Feed-Backward PhaseHere, we set up a learning framework for computing reMap’s bags and pathways parameters,jointly denoted asΘ= {Θbag,Θpath}. From Eq. 8.3.3, it is clear that our algorithm has threelearning components: i)- a hyper-plane in the bag feature space to absorb bag correlation, ii)-a hyper-plane in the pathway feature space to encode semantic information about pathways,and iii)- a joint learning between bags and pathways to exploit bag-pathway relationship.Let us define three empirical loss functions, corresponding the three components: ²bag :{0,1}b →R≥0, ²path : {0,1}t →R≥0, and ²bag-path : {0,1}b →R≥0 of margin dhbag(x), yhpath(x),and dhbag-path(y), respectively, where h(.) are decision functions. The last two loss functionsare based on the logistic loss while the first loss is a sum of the two other losses. Now, to123computeΘ, we maximize the posterior probability of Eq. 8.3.4:Θ̂= argmaxΘq=τ∏q=1i=n∏i=1H(i )q−1j=b∏j=1p(D̂i , j |L(i )q−1,x(i ))×( ∑s∈B j ,s=+1p(D̂i , j |ls =+1,Θbagj )p(y (i )s |x(i ),Θpaths )))(8.3.5)Estimation of parameters in Eq. 8.3.5 is intractable due to the chain of probabilitiesH(i )q−1 and the two marginalizations over Lq−1 and s. Hence, we propose the following twodiagnoses: i)- conditional independence assumptions where the previous history values areindependent given the most recent estimates and ii)- collapse the marginalization overLq−1by choosing only the maximum correlation z, irrelevant to which bags were considered.These simplified treatments provide an efficient way to optimize the parameters, where weadopt the “one-vs-all” scheme learning for each bag and pathway, discussed in Chapter 3.4.In addition, we apply four constraints to retrieve a good set of parameters: i)- enforcingsimilarity between bags’ and the associated pathway labels’ parameters; ii)- weights ofpathways, assembled in a bag, should be close to each other; iii) the input space (i.e.,enzyme) and the pathway space should share similar statistical properties, which entailsthat if two instances in feature space exhibit strong association then they may share thesame label set in the label space; and iv) all reMap’ parameters should not be too large ortoo small.With much simplifications and adding constraints, we can now minimize an upperbound approximation of the negative log-likelihood of Eq. 8.3.5, which leads to independentoptimization objective functions for all classifiers (bags and pathways), according to themulti-label 1-vs-All approach. For the analytical expression and pseudocode for reMap’sprediction and training algorithms, see Appendix Section F.2.8.3.3 Closing the loopThe two phases are repeated to all samples, until a predefined number of rounds (τ) isreached. At the end, a new dataset is constructed, Sbag = {(x(i ),àD(i ),opt) : 1< i É n} consistingof n examples andDopt is an optimum set of bags that contain a small number of b bagsannotated to each example.124Figure 8.4: Illustration of pathway frequency (averaged on all examples) in BioCyc (v20.5T2 &3) and CAMI data, and their background pathways, indicated by M.8.4 Experimental SetupIn this section, we describe the experimental settings and outline the materials used toevaluate the performance of reMap. The reMap pipeline was written in Python v3 anddepends on third party libraries (e.g. Numpy [335]). We use a subset of datasets introducedin Chapter 4.2. Unless otherwise specified all tests were conducted on a Linux server using10 cores of Intel Xeon CPU E5-2650.8.4.1 Parameter SettingsFor training reMap, we used BioCyc collection with the following default settings: the learn-ing rate η= 0.0001, the batch size to 30, the number of epochs to τ= 10, the bag centroidhyperparameter α= 16, the cutoff threshold for cosine similarity v = 0.2, the cutoff decisionthreshold for bags β= 0.3, the number of bags b = 200, and the subsampled bag size γ= 50.12550 100 150 200 300b−0.9−0.8−0.7−0.6LogPredictiveProbabilityCTM SOAP SPREAT(a) Effect of b50 100 150 200 300b−0.9−0.8−0.7−0.6LogPredictiveProbabilityCTM SOAP+c2m SPREAT+c2m(b) Effect of b by collapsing50 100 150 200 300 500k−1.0−0.8−0.6LogPredictiveProbabilitySOAP SPREAT(c) Effect of k with b = 200Figure 8.5: Log predictive distribution on CAMI data.For the regularized hyperparameters λ1:5 and κ (see Appendix F.2.5), we performed 10-foldcross-validation on a subsampled of BioCyc data and found the settings λ1:5 = 0.01 andκ= 0.01 to be optimum on golden T1 datasets.To obtain pathway features, we applied pathway2vec module using “crt” as the em-bedding method, the number of memorized domain is 3, the explore and the in-out hy-perparameters are 0.55 and 0.84, respectively, the number of sampled path instances was100, the walk length is 100, the embedding dimension size was m = 128, the neighborhoodsize was 5, the size of negative samples was 5, and the used configuration of MetaCyc was“uec”, indicating links among ECs are being trimmed. For the pathway prediction task, weapplied leADS (presented in Chapter 9) with the predictive uncertainty being set to “factor-ization” option, enabling to train the obtained bags and pathways, simultaneously, and theprediction strategy was set to “pref-vrank” with the ranking hyperparameter was set to 200.The hyperparameters settings for CTM, SOAP, and SPREAT are provided in AppendixSection F.3. Both SOAP and SPREAT have an option ‘collapse2ctm” (c2m) that enablesreduction to CTM while inducing dual sparseness. The supplementary pathways M forBioCyc T2 & 3, CAMI, and golden T1 datasets were obtained using mlLGPR. For this, wetrained mlLGPR using enzymatic reaction and pathway evidence features. A schematicview of pathway frequency across datasets for BioCyc T2 & 3 and CAMI, along with theiraugmented pathways is depicted in Fig. 8.4. All the remaining configurations in mlLGPR,SOAP, pathway2vec, and leADS, they were fixed to their default values.1268.5 Experimental Results and DiscussionWe conducted several experimental studies: parameter sensitivity and visualization forcorrelated models, assessing prediction probability history during relabeling process ofreMap, and metabolic pathway prediction effectiveness after obtaining Sbag using leADS.8.5.1 Sensitivity Analysis of Correlated ModelsExperimental setup. A fundamental challenge for the reMap pipeline is to acquire a gooddistribution of bags and pathways from correlated models for the purpose of relabeling.Following the common practice, here we examined various hyperparameters associatedwith correlated models. First, we compared the sensitivity of SOAP and SPREAT against CTMby incorporating the background pathways M while varying the number of bags accordingto b ∈ {50,100,150,200,300}. Next, we examined the c2m option for SOAP and SPREAT toshow that these two models exhibit similar performances as CTM. Finally, we conductedsparsity analysis of bag distribution by varying the cutoff threshold value according tok ∈ {50,100,150,200,300,500}. For the comparative analysis, we used CAMI as a test data toreport the log predictive distribution [136], where a lower score entails higher generalizationcapability for the associated models. Appendix Section F.1.5 provides the mathematicalderivation of such metric for SPREAT.Experimental results. While the log predictive scores for SOAP and SPREAT in Fig. 8.5aappears to be horizontal across bag size, the CTM model projects a more realistic view whereits performances are seen to be gaining by including more bags. For the former models, thisphenomena is not a consequence of design flaw, instead, it is expected due to the effects ofsupplementary pathways. That is, both models are encouraged to learn more pathways fromM because the average pathway size for an example in M is ∼ 500 whereas in BioCyc T2 & 3is ∼ 195 while only retaining 100 pathways for each bag. By excluding M (through enablingc2m option), we observe that the log predictive distribution of SOAP and SPREAT are similarwith that of CTM, as shown in Fig. 8.5b, which supports our previous discussion. We foundthat b = 200 gives a good set of overlapping pathways while having on average ∼ 15 distinctpathways for each bag from 2526 pathways. By fixing b = 200, we search for an optimumk value. As illustrated in Fig. 8.5c, both SOAP and SPREAT deteriorate their performances(<−0.6) when k > 100. Taken together, we suggest the settings b ∈Z[150,300] and k ∈Z[50,100]to recover good bag and pathway distributions.127(a) SOAP (#bags:∼74; #pathways: ∼138)(b) SPREAT(#bags: ∼ 55;#pathways: ∼ 141)(c) CTM (#bags: ∼50; #pathways: ∼69)(d) SOAP+c2m(#bags: ∼ 52;#pathways: ∼ 8)(e) SPREAT+c2m(#bags: ∼ 51;#pathways: ∼ 8)Figure 8.6: Visualizing 50 randomly picked bags for each model, trained with b = 200. Thefirst term within the bracket, i.e., #bags, corresponds to the average number of correlatedbags while the second term, i.e., #pathways, represents the average number of pathway sizeper bag. The circles represent bags, and their sizes reflect the correlation strength with otherbags. Two clusters of bags can be seen for the last three models indicating the two clusterscontain distinct pathways.8.5.2 Bag VisualizationExperimental setup. Here, we visualized the discovered bags using models in Section 8.5.1with the goal to assess the quality of bags. First, we examined the influence of augmentedpathways on bag correlation patterns, i.e., Σ, in SOAP and SPREAT and contrast the outputswith CTM, SOAP+c2m, and SPREAT+c2m. Then, we performed an in-depth comparisonbetween the sparse models (SOAP+c2m and SPREAT+c2m) with CTM to ensure that ourmodeling assumptions are aligned with the observed data, where a bag containing morefocused and fewer pathways is preferred. For all experiments here, we applied the settingsdescribed in Section 8.4.1.Experimental results. From Fig. 8.6, we notice two core findings. First, in contrast to CTM,SOAP+c2m, SPREAT, and SPREAT+c2m, bags in SOAP are densely connected (∼ 74 bags)where the width of edges indicates the strength of correlations. Second, leveraging back-ground pathways in SOAP and SPREAT resulted in gathering more pathways for each bag.These observations demonstrate the influence of M (obtained from mlLGPR) to pathwaydistribution over bags. As an example, Fig. 8.7 shows samples from BioCyc T2 & 3 pathwaysand corresponding background pathways after projecting them onto 2D space using UMAP[220]. The colors encode samples from BioCyc T2 & 3 pathways that are clustered using theK-means algorithm [160] with 10 groups. While examples of the same color in BioCyc T2& 3 pathways form a clear distinct group, the same examples are seen to be intermixed forM, possibly comprising of many false-positive pathways as depicted in Fig. 8.4 where manypathways (represented as column bars) are differentially distributed between BioCyc T2 & 3128(a) BioCyc (b) BioCyc (M)Figure 8.7: 2D UMAP projections of BioCyc T2 & 3 pathways and the corresponding back-ground pathways. Fig. 8.7a serves as a basis for color-coding where examples of one colorin BioCyc are clustered together while the same examples are seen to be spread across theaugmented BioCyc pathways (M) in Fig. 8.7b. Better viewed in color.pathways and M. For the collapsed models, they are observed to share similar behaviors asCTM (Fig. 8.5b). However, their bag distribution consists of fewer pathways than CTM (Fig.8.6). For example, Fig. 8.8 shows 50 randomly selected bags with associated 100 pathways,where CTM is shown to encapsulate more pathways per bag (encoded by gradient darkercolors) while SOAP+c2m and SPREAT+c2m exhibit a sparse distribution.The results from these experiments link with our previous remarks, entailing that SOAPand SPREAT are equipped to reduce irrelevant pathways by applying dual sparseness. Inparticular, SPREAT is observed to generate fewer correlations than SOAP. With regard to in-corporating supplementary pathways, SOAP and SPREAT are both sensitive to false-positivepathways, therefore, including accurate augmented pathways may recover better pathwaydistribution over bags.8.5.3 Assessing the History ProbabilityExperimental setup. After analyzing the correlated models in Sections 8.5.1 and 8.5.2, herewe investigated the accumulated history probability H on relabeling golden T1 datasets,during the feed-forward phase, using the settings in Section 8.4.1. Recall that reMap employsfour constraints in the feed-backward phase to select bags with high probability to annotatedata. The hope is to obtain an optimum hbag to remapping pathways to bags. For thedemonstration purposes, we only focused on SOAP to collect bags although any models canbe used to conduct this experiment.129Figure 8.8: Heatmap representing bag distribution of CTM, SOAP+c2m, SPREAT+c2m forrandomly picked 50 bags with their associated 100 pathways. The entries is color-codedon a gradient scale ranging from light-gray to dark-gray, where higher intensity entails higherprobability.Experimental results. Fig. 8.9 shows H during the annotation process of the six datasets. Inthe beginning, reMap attempts to select a set of bags from D̂, corresponding to the maximumnumber of bags that may exist for each example. However, with progressive updates andcalibration of parameters, reMap rectifies bags assignments where it picks fewer bags,containing more informative contents associated with each sample. For instance, at q = 1,the HumanCyc is tagged with multiple bags, represented by darker colors where higherintensity indicates a high probability of assigning the corresponding bag to HumanCyc, butat q = 10 less than 35 bags were retained out of 200 possible bags. Table 8.1 shows the 11HumanCyc pathways corresponding the assigned bag index 16, which includes L-prolinebiosynthesis I, thyroid hormone metabolism II (via conjugation and/or degradation), and bileacid biosynthesis, neutral pathway. The pathway set in bag 16 for HumanCyc is an indicationthat reMap is robust to a certain degree and able to capture relevant pathways for a bag.Bags preferences by reMap can be further examined by computing similarity (measuredin cosine distance metric) among datasets according to EC, pathway, and bag spaces, de-picted in Fig. 8.11. As elucidated in Section 8.3.2, reMap exploits four constraints to strikebalance among these spaces. This is particularly sighted for AraCyc and EcoCyc, wherethe ECs (in Fig. 8.11a) have similar density with the remaining golden data, however, thepathways are differentially represented (in Fig. 8.11b). Hence, this information is downpropagated to the bag annotation process depicted in Fig. 8.11c. Overall, reMap exhibits asteady, albeit slow, annotation progression while preserving consistency across differentspaces.130Figure 8.9: Snapshot of the history probability H during the relabeling process of goldenT1 data for 10 successive rounds. The x-axis shows 200 bags while the y-axis correspondsdata. Darker colors indicate high probability to assigning bags to the corresponding data.8.5.4 Metabolic Pathway PredictionExperimental setup. In this section, we quantified the quality of annotation by leveragingthe obtained bags for the pathway prediction task. We used the correlated models in Section8.5.2 to obtain bags. Next, we trained leADS using the configuration discussed in Section8.4.1. After 10 epochs, we reported the results on golden T1 data against triUMPF using fourevaluation metrics presented in Chapter 4.4.1.Experimental results. From Table 8.2, we observe that all correlated models outperformtriUMPF on (HumanCyc, AraCyc, YeastCyc, and TrypanoCyc) in terms of average F1 scores131Figure 8.10: The probability history H during annotation of T1 golden data after 10 suc-cessive rounds. The x-axis shows 200 bags while the y-axis corresponds the associatedprobability. Darker colors indicate high probability to assigning bags to the correspondingdata.where numbers in boldface represent the best performance score in each column while theunderlined text indicates the best performance among the correlated models. In addition,the sensitivity scores are also seen improved with the exception to EcoCyc. In summary, thisexperiment demonstrates that bag based approach improves pathway prediction perfor-mance.132(a) EC (b) Pathway (c) BagFigure 8.11: Similarity among golden datasets, measured by cosine distance. Best viewedin color.# MataCyc Pathway ID MataCyc Pathway Name1 PROSYN-PWY L-proline biosynthesis I2 PWY-5137 fatty acid β;-oxidation III (unsaturated, odd number)3 PWY-3982 uracil degradation I (reductive)4 TRNA-CHARGING-PWY tRNA charging5MANNOSYL-CHITO-DOLICHOL-BIOSYNTHESISprotein N-glycosylation initial phase (eukaryotic)6 PWY-5667 CDP-diacylglycerol biosynthesis I7 PWY0-662 PRPP biosynthesis I8 PWY-46 putrescine biosynthesis III9 PWY-6261thyroid hormone metabolism II (via conjugationand/or degradation)10 PWY-6061 bile acid biosynthesis, neutral pathway11 PWY66-388 fatty acid α;-oxidation IIITable 8.1: The 11 HumanCyc pathways corresponding the bag index 16.8.6 SummaryIn this chapter, we demonstrated the merit of iteratively relabeling pathway datasets witha new label set, referred to as bags using reMap pipeline. Specifically, reMap transformspathway labels into bags, and then train examples, jointly, with pathways and bags to opti-mize the relabeling process. In addition, two novel hierarchical mixture models, SOAP andSPREAT, were developed to collecting bags. Both models were motivated by the problem ofmissing pathways, which is very common in pathway datasets. Backed by our empirical stud-ies in the pathway prediction task on golden T1 datasets, reMap showed promising resultsin boosting the performance against triUMPF. In the next chapter, we will be performingrigorous analysis of bag based metabolic pathway prediction using leADS.133MethodsHamming Loss ↓EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyctriUMPF 0.0435 0.0954 0.1560 0.0649 0.0443 0.0776SOAP+vrank 0.0598 0.0819 0.1449 0.0724 0.0629 0.0566SPREATE+vrank 0.0519 0.0827 0.1489 0.0748 0.0629 0.0503CTM+vrank 0.0558 0.0835 0.1425 0.0804 0.0622 0.0503SOAP+c2m+vrank 0.0590 0.0780 0.1457 0.0772 0.0614 0.0534SPREATE+c2m+vrank 0.0542 0.0796 0.1520 0.0772 0.0598 0.0558MethodsAverage Precision Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyctriUMPF 0.8662 0.6080 0.7377 0.7273 0.4161 0.4561SOAP+vrank 0.8900 0.6800 0.8600 0.6150 0.3200 0.5800SPREATE+vrank 0.9400 0.6750 0.8350 0.6000 0.3200 0.6200CTM+vrank 0.9150 0.6700 0.8750 0.5650 0.3250 0.6200SOAP+c2m+vrank 0.8950 0.7050 0.8550 0.5850 0.3300 0.6000SPREATE+c2m+vrank 0.9250 0.6950 0.8150 0.5850 0.3400 0.5850MethodsAverage Recall Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyctriUMPF 0.7590 0.3835 0.3529 0.3319 0.7126 0.6229SOAP+vrank 0.5798 0.4875 0.3373 0.5371 0.7356 0.6629SPREATE+vrank 0.6124 0.4839 0.3275 0.5240 0.7356 0.7086CTM+vrank 0.5961 0.4803 0.3431 0.4934 0.7471 0.7086SOAP+c2m+vrank 0.5831 0.5054 0.3353 0.5109 0.7586 0.6857SPREATE+c2m+vrank 0.6026 0.4982 0.3196 0.5109 0.7816 0.6686MethodsAverage F1 Score ↑EcoCyc HumanCyc AraCyc YeastCyc LeishCyc TrypanoCyctriUMPF 0.8090 0.4703 0.4775 0.4735 0.5254 0.5266SOAP+vrank 0.7022 0.5678 0.4845 0.5734 0.4460 0.6187SPREATE+vrank 0.7416 0.5637 0.4704 0.5594 0.4460 0.6613CTM+vrank 0.7219 0.5595 0.4930 0.5268 0.4530 0.6613SOAP+c2m+vrank 0.7061 0.5887 0.4817 0.5455 0.4599 0.6400SPREATE+c2m+vrank 0.7298 0.5804 0.4592 0.5455 0.4739 0.6240Table 8.2: Predictive performance of each comparing algorithm on 6 golden T1 datasets.For each performance metric, ‘↓’ indicates the smaller score is better while ‘↑’ indicates thehigher score is better. Values in boldface represent the best performance score while theunderlined score indicates the best performance among correlated models.Chapter 9Multi-label Pathway Prediction based onActive Dataset Subsampling“Ants are good citizens, they place group interests first.”– Clarence DayIn Chapters 5 and 7, we have introduced mlLGPR and triUMPF, respectively, to automaticallyrecover pathways from large-scale pathway datasets. However, several complications remainthat can degrade prediction performance including inadequately labeled training data,missing feature information, and inherent imbalances in the distribution of enzymes andpathways within a dataset. This class imbalance problem is commonly encountered by themachine learning community when the proportion of instances over class labels within adataset are uneven, resulting in poor predictive performance for underrepresented classes.In this chapter, we present leADS, multi-label learning based on active dataset subsampling,that leverages the idea of subsampling examples from data to reduce the negative impactof training loss. Specifically, leADS performs an iterative procedure to: (a)- constructing anacquisition model in an ensemble framework; (b) subselecting informative points usingan appropriate acquisition function; and (c)- training on subsampled data. The ensembleapproach was followed to enhance the generalization ability of the multi-label learningsystems by concurrently building and executing a group of multi-label base learners, whereeach is assigned a portion of samples, to learn pathways. leADS was evaluated on the path-way prediction task using datasets in Chapter 4.2, where the experiments revealed thatleADS achieved very compelling and competitive performances against all the previousmachine learning models and the rule-based approaches.1359.1 IntroductionRecall that metabolic pathways are chemical reaction chains occurring in a cell, oftencatalyzed by a collaboration of enzymes, where metabolites (or chemical products) are built-up or broke-down for cellular processes. Reconstruction of metabolic pathways is pivotalin studying biological systems, as interpreting pathways help deciphering relationshipsbetween genotype and phenotype, and may assist us to elucidate essential mechanismsunderlying an organism’s metabolism [167].In Chapters 5 and 7, we presented two multi-label based approaches, called mlLGPRand triUMPF, respectively, that performed effectively on organismal genomes. However,one of the major drawbacks to fully adopt machine learning models is the limited access tohaving high-quality datasets where potential pathways to each instance are being properlyannotated. As such, reMap was developed in Chapter 8, which, instead of adapting models tolearn from pathway data, takes an alternative route by projecting pathways onto a differentset of labels, known as bags, where a bag is comprised of correlated pathway labels. Thisprocess has resulted in a very good performance while suggesting to consider sub-selectingsamples that are more informative to the relabeling process. For example, BioCyc collectionT2 &3 contains 112 samples related to Salmonella species with more than 100 pathways arecross-intersected. Therefore, it is conceivable that BioCyc contains many Salmonella relatedexamples that either do not contribute or may negatively effect the model’s performance.Moreover, pathways in BioCyc T2 &3 follow a power law distribution (Fig. 9.1) where morethan 30% of pathways were observed to occur in less than 25 BioCyc examples. These lessfrequent pathways are referred to as tail labels (class-imbalance). A potential approach toreduce the impact of redundant samples on training and solve the tail labels problem lies atthe heart of subsampling [182].In response, this chapter presents leADS (Fig. 9.2) that extends the previously establishedwork in the active dataset subsampling (ADS) domain [64] by incorporating an ensembleof multi-label learners [102, 269, 271, 364, 386], jointly, with hard example mining strategy[64] to address the challenge of subselecting informative samples from genomic data. Withregard to the example selection, in the literature there are two opposing strategies that workwell under different scenarios: i)- incremental learning from easier to harder instances andii)- hard example mining. While easy instance mining approach may be effective whena model tries to learn from data tainted with many noisy labels (as in BioCyc T2 &3) bygradually increasing the loss of hard examples, such as curriculum learning or self-pacedlearning [29, 179, 210, 221, 257, 304], sampling harder instances is more convenient for136Figure 9.1: Number of samples for each pathway in BioCyc T2 &3 data. The horizontal axisindicates the indices of pathways while the vertical axis represents the number of associatedexamples in BioCyc T2 &3 collection.BioCyc T2 &3 data that consists of more than 1450 pathway labels. This is because the sizeof both BioCyc data and its corresponding pathways are large where hard example miningcan accelerate the learning process efficiently [6, 111, 210, 304, 393].Specifically, leADS executes, in parallel, a group of multi-label base learners (constitutingan ensemble) where each is allocated to learn from a portion of samples that are selectedrandomly. Then, each member in the ensemble selects data according to predefined choicesof: i)- sample size and ii)- an acquisition function. Afterwards, samples from all base learnersare aggregated and reduced based on the sample size. These samples are then fed into thenext round to all members in the ensemble while incorporating additional points as requiredto continue learning. At the end of training, leADS produces samples that are packed withinformative content that could aid researchers in their investigations. The whole processis based on the ensemble learning approach that is known to enhance the generalizationability (at the expense of training and, sometimes, performance costs) while being robust tothe overfitting problem [364]. Also, the ensemble learning is suitable to confront with theclass-imbalance problem [271, 386].To verify the effectiveness of leADS, we conducted three experimental studies: parametersensitivity, scalability, and metabolic pathway prediction. In the later case study, laADS137was demonstrated to significantly improve the pathway prediction results on 10 datasets(see Chapter 4.2). This work, to the best of our knowledge, is the first study that leveragessubsampling with multi-label ensemble learning to address the metabolic pathway inferenceproblem from enzymatic reactions. Moreover, leADS is not constrained with the genomicdata and can be utilized to any multi-label dataset type.9.2 Problem FormulationHere, we state the problem raised in this chapter.Multi-label Pathway Active Dataset SubsamplingGiven either an aggregated set of the two multi-labeled datasets, Spath (Def.3.9) and Sbag (Def. 8.4), denoted by SA = {(x(i ),y(i ),d(i )) : 1< i É n} or Spath ,the goal is to retrieve a subset ofSA (orSpath), denoted bySper%, where per%is a pre-specified hyper-parameter, indicating the proportion of samples tobe chosen from SA (or Spath), such that learning on Sper% incurs similarpredictive score as if it was trained on full dataset, SA (or Spath).9.3 The leADS MethodIn this section, we describe the leADS framework consisting of three consecutive steps: i)-building an acquisition model, ii)- active dataset sub-sampling, and iii)- learning using thereduced sub-sampled data. These three steps interact with each other in an iterative processas illustrated in Fig. 9.2. At the very first iteration, a set S0per% is initialized with a randomlyselected data. In the next subsequent iteration q , instead of re-initializing Sqper% with a ran-domly selected samples, Sq−1per% data collected from previous iteration q−1 is used, thereby,constituting the build-up scheme as followed in many active learning approaches ([65]). Thisprocess is repeated until the maximum number of rounds τ is reached. Below, we provide adetailed description of each step at round q −1 for SA. Similar steps is straightforward toutilize for Spath by excluding bags in factorization based approach.138Figure 9.2: A schematic diagram showing the proposed leADS pipeline. Using a multi-label(bag or pathway) dataset (a), leADS randomly selects data at the very first iteration (b), then itbuilds g members of an ensemble (c), where each is trained on a randomly selected portionof the training set. Next, leADS applies acquisition function (d), which is based on either:entropy, mutual information, variation ratios, or normalized PSP@k, to pick per% samples.Upon selecting a set of sub-sampled data, leADS performs an overall training on thesesamples (e). The process (b-e) is repeated τ times (f), where at each round the selected per%samples are fed back into the dataset, and another set of samples are picked in addition tothe previously selected set of samples. If the current iteration q reaches a desired number ofrounds τ, then training is terminated while producing the final per% data points (g).1399.3.1 Building an Acquisition ModelGiven SA = {(x(i ),y(i ),d(i )) : 1 < i É n}, this stage builds an acquisition model, denotedby E , which consists of g models. As depicted in Fig. 9.2(c), each model of an ensemble,say s, is devoted to learn a binary classifier for each bag according to the multi-label 1-vs-All approach (in Chapter 3) based on either one of the two functions: dependency orfactorization. These are based on inquiring the posterior predictive uncertainty with regardto a new test point x∗.1. Dependency. This strategy assumes samples are conditionally independent from bagsgiven pathways. So, the uncertainty about x∗ for a bag d j is estimated according to:p(d j =+1|x∗,SA)=∫p(d j =+1|x∗,Θdepj )p(Θdepj |SA)∂Θdepj= ∑k∈B j ,k=+1∫ ∫p(d j =+1|lk ,Θbagj )×p(yk |x∗,Θpathk )p(Θpathk |SA)×p(Θbagj |SA)∂Θpathk ∂Θbagj(9.3.1)where Θdep = {Θbag ∈Rb×m ,Θpath ∈Rt×r } denotes bag’s and pathway’s parameters, respec-tively. b is the bag size, m is the dimension of features, t is the number of pathways, and r isthe total number of enzymatic reactions. Eq 9.3.1 involves summation and marginalizationoverΘdep parameters, which is hard to compute [184, 235]. One way to mitigate this issue isto approximate the above equation using Monte Carlo (MC) techniques [182] by generatingmultiple samples for each member of E , hence, resulting in the following formula:p(d j =+1|x∗,SA)≈ 1g∑s∈gp sjp sj =∑k∈B j ,k=+1p(d j =+1|lk ,Θs,bagj )p(yk |x∗,Θs,pathk )p(d j =+1|lk ,Θs,bagj )=11+e−Θs,bag,Tj∣∣c˜ j−Pk∣∣p(yk |x∗,Θs,pathk )=11+e−Θs,path,Tk x∗c˜ j = αn j∑k∈Yi ,k∧B j ,k=+1Pk||Pk ||(9.3.2)whereΘs,bag[.] andΘs,path[.] are sampled from qbag(Θ) and qpath(Θ), respectively, which them-selves are considered to be in the same family distribution as the true hidden variablesp(Θpathk |SA) and p(Θbagj |SA). c˜ j represents the centroid for bag j , and P ∈Rt×m is a pathwayrepresentation matrix, obtained from pathway2vec ( Chapter 6).140Figure 9.3: The two possible strategies in building acquisition model. (Left) Dependencybased acquisition model assumes input data x(i ) is associated with with multiple labels y(i ),which are in turn associated with multiple bags d(i ). (Right) Factorization based methodassumes both y(i ) and d(i ) are independent to each other, given x(i ).2. Factorization. Under this approach, both pathways and bags are factorized given x∗.Similar to the previous expression, the MC estimation of the factorized posterior predictiveis:p(d j =+1, yk =+1|x∗,SA)≈ 1g∑s∈gp sjwhere,p sj = p(d j =+1|x∗,Φs,bagj )p(yk =+1|x∗,Θs,pathk )p(d j =+1|x∗,Φs,bagj )=11+e−Φs,bag,Tj x∗(9.3.3)whereΦbag ∈Rb×r is factorized bag’s parameters.Both strategies are illustrated in Fig. 9.3. As it can be seen, the factorization decomposesthe input data x into elementary units of independent b bags and t pathways so the opti-mization of E can be made applicable (see Section 9.4). However, the dependency-basedapproach takes an alternative route aiming to maintain an integrity between bags and path-ways, forming a correlation structure, which can be efficiently exploited through ensemble-based multi-label learning, as demonstrated in numerous studies [62, 269–271, 276, 311].It is important to note that while the size of base learners in E plays a pivotal role inthe predictive performance, theoretical studies addressing the number of members is stillunderdeveloped. Furthermore, the error of the MC estimation is expected to decrease ( intheory) by adding more samples and members in E , but, due to label correlation problem,the computational burden at both training and inference stages may be overly complex, asexamined in Section 9.7.2.141Figure 9.4: The two approaches for constructing multi-label learning algorithm. The in-dividual multi-label learner (on the left) and the ensemble based multi-label learning (onthe right).If we use only one multi-label base learner (Fig. 9.4a), then, depending on the learningprocess, we may be able to exploit label correlations. However, the individual learner maystill suffer from consequences of generalization error owing to the overfitting problem. Onthe contrary, the goal of multi-label ensemble learning (Fig. 9.4b) is to build a group ofmulti-label base learners which are not only accurate but also diverse (with regard to theallocated samples), thereby, potentially reducing the overfitting risk.9.3.2 Sub-sampling DatasetDuring this step, a subset of SA, denoted as Sq−1per% ⊆ SA, is picked for each member usingan acquisition function f : x→ R where per% is a pre-specified threshold, indicating theproportion of samples to be chosen from SA, at iteration q −1. The calculated predictiveuncertainty distribution, obtained from the previous step, is accommodated into one thefollowing four acquisition functions for subsampling: entropy, mutual information, variationratios, and normalized PSP@k. In all of these functions, we retrieve top per% samples thatcontain high acquisition (or uncertainty) values, reflecting the ranking based scoring strategy.Though more sophisticated active learning based selection methods, such as [117, 119, 195,196, 303, 330, 352, 363], can be utilized to improve selection criterion, nonetheless, they arecomputationally intensive approaches, hence, neglected.1. Entropy (H) [297]. The entropy measures the uncertainty of a sample given the pre-142dictive distribution of that sample:(9.3.4) H=−pᵀ log(p)where p is a vector of predictive probabilities over b bags (or t pathways).2. Mutual information (M) [306]. This function looks for low mutual information be-tween g models, encouraging samples with high disagreement to be selected duringthe data acquisition process:(9.3.5) M=H− 1g∑s∈gHswhereHs denotes the entropy obtained from an individual member of E for a samplebefore marginalization. Since entropy is always positive, the maximum possible valueforM isH. However, when the models make similar predictions, then 1g∑s∈g Hs →H,resulting inM→ 0, which is its minimum value [64]. Note that this formula is similarto multi-label negative correlation learning [301], which estimates pairwise negativecorrelation of each learner’s error with respect to errors of other members in E .3. Variation ratios (V) [103]. This metric measures the number of members in E thatdisagree with the majority vote for a sample according to k desired pathway size,where larger values indicate higher uncertainty:V =1− 1|V |g∑s∈g∣∣∣({arg p sj : 1≤ j ≤ k})∩V ∣∣∣ where V =Modes∈g ({arg p sj : 1≤ j ≤ k})(9.3.6)V corresponds the disagreement of k bags (or pathways) across g models, wherek ∈Z>0 is a pre-specified number of bags (or pathways) to be considered in computingthe mode operation.4. Normalized propensity scored precision at k (nPSP@k). This is a modified version ofPSP@k [148], which measures the average precision of top k relevant bags (or path-ways) for a data point i . This means the higher the score is for i , the less uncertaintyis:nPSP@k =1−Norm(1k∑j∈rankk (p)y jps j)where ps j =11+ (n j +1)−1(9.3.7)where Norm(.) scales the score within [0,1]. The term p is a vector of predictive proba-bilities over b bags (or t pathways), rankk (p) returns the indices of k largest value in p,ranked in a descending order, where k ∈Z>0 is a hyperparameter. ps j is the propensity143score for the j -th bag (or pathway), where n j is the number of the positive traininginstances for the bag (or pathway) j . In the context of extreme multi-label problem,PSP@k was used to derive an upper bound for missing/miss-classified labels ([344]),and is reported to be a good performance metric for long-tail distribution in which asignificant portion of labels are tail labels ([19, 147, 259, 260]).9.3.3 Training on the Reduced DatasetDepending on the acquisition model, each member in E is assigned to train on randomly se-lected samples from Sq−1per%, where each is comprised of a set of b and t models, representingbags and pathways, respectively, mirroring an individual multi-label learner, shown in Fig.9.4. The Sq−1per% is expected to contain hard samples that are difficult to learn and classify. Byonly focusing to train on these samples, leADS provide an alternative treatment to boostingthe overall performance. After learning, leADS aggregates samples from all members andthen selects the top per% based on their acquisitions values to feed these selected samplesto repeat the process until the maximum number of rounds τ has reached.9.4 OptimizationAll parameters are updated based on the samples allocated to each base learner. The twoobjective functions Eqs 9.3.2 and 9.3.3, can be solved by decomposing them into t and bindependent binary classification problems according to the multi-label 1-vs-All approach(see Appendix Section G.1), which enables us to train them in parallel. For example, theoptimization for a member s:minΘs,bag∑i∈ns∑j∈b∑k∈B j ,k=+1log(1+e−d(i )j Θs,bagᵀj |c˜(i )j −Pk |)+∑j∈bλ1||Θs,bagj ||2,1)minΦs,bag∑i∈ns∑j∈b∑k∈B j ,k=+1log(1+e−d(i )j Φs,bagᵀj x(i ))+∑j∈bλ2||Φs,bagj ||2,1minΘs,path∑i∈ns∑k∈tlog(1+e−y(i )k Θs,pathᵀk x(i ))+∑k∈tλ3||Θs,pathk ||2,1(9.4.1)where ||.||22,1 is the L2,1 regularization term, which is the sum of the Euclidean norms ofcolumns of a matrix. The L2,1 norm imposes sparsity on model’s parameters to minimize thenegative effect of label correlations, where λ[.](∈R>0) is employed to govern the relative con-tributions of L2,1 and the remaining terms. Although the joint formula in Eq 9.4.1 is convex,the logistic log-loss function still posses a problem where there exists no analytical solution144for it. A possible treatment would be to adopt the standard gradient-based method to learnall the models parameters, gradually. In this work, the mini-batch gradient descent [190] wasconsidered, which begins with some initial random guess for all leADS’s parameters, andrepeatedly performs an update to minimize Eq 9.4.1. The similar line of discussion followsfor parametersΘs,path obtained using Spath for each member s.9.5 Efficient Pathway Label PredictionPredictions need to be made efficiently for the downstream pathway classification task. Fora test point x, the existing 1-vs-All approaches (Chapter 3.4) are infeasible for low-latencyand high-throughput applications as their prediction times are linear in the number ofboth labels and members (O(t g )). In addition, some of those approaches apply strict cut-offthreshold ξ ∈R≥0 so only the labels with the largest probability values are retainedLpath(x)={k : p(yk = +1|x,Θs,pathk ) ≥ ξ,∀k ∈ t ,∀s ∈ g }, where p(yk = +1|x,Θs,pathk ) = 11+e−Θs,path,Tkx(i )is awell-known non-linear logistic function. This strategy may neglect a set of true labels.leADS addresses these limitations through factorization based acquisition model type, byshortlisting a set of bags for each member as Lsbag (x)= { j : p(d j =+1|x,Φs,bagj )≥ ξ,∀ j ∈ b}.Then, the most probable pathways can be retrieved using a cut-off threshold ξ asL1path(x)={l : p(yl = +1|x,Θs,pathl ) ≥ ξ, l ∈ B j ,l = +1,∀ j ∈ Lsbag (x)}. We refer this approach as “pref-voting”. Alternatively, instead of using ξ, leADS deploys rankk (.) operation to sort the prob-abilities of both bags and pathway labels, similar to [388]. After that, the label probabilityis obtained by marginalizing over the probabilities of bags and labels, where the goal is toretrieve pathways with k highest scores, L2path(x)= {l : rankk (p(yl =+1|x,Θs,pathl )×p(d j =+1|x,Φs,bagj )), l ∈B j ,l =+1,∀ j ∈ b}. This approach is quite effective because it compromisesbetween capturing the tail labels, obtained by bag scores, and predicting the head labels,indicated by pathway label scores. We name this strategy as “pref-vrank”. In both approaches,we make hard voting, which is based on the majority class label over g members.The overall complexity of leADS, based on the factorization type, using the pref-votingstrategy for a test point x is reduced to O(g (b+ t)/2) cost while the pref-vrank may incuron average less th
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Machine learning methods for metabolic pathway inference...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Machine learning methods for metabolic pathway inference from genomic sequence information Mohd Abul Basher, Abdur Rahman 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Machine learning methods for metabolic pathway inference from genomic sequence information |
Creator |
Mohd Abul Basher, Abdur Rahman |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Metabolic pathway prediction within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. This is a prerequisite to both understanding fundamental life processes and ultimately engineering these processes for specific biotechnological applications. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like Escherichia coli where the majority of protein functions are determined. Consequently, over the past decades several computational tools were developed to automate the reconstruction of pathways given enzymes obtained from genomes. Unfortunately, with an ever-increasing rate in the content and diversity of publicly available genomics and metagenomics datasets, those algorithms, to this date, experience more prominent and complex problems. These include incapability of systemically solving meta-level noise, neglecting pathway interactions, not considering vagueness associated with enzymes, and inadequate to scale to heterogeneous genomic datasets. In an attempt to resolve the aforementioned problems, this thesis examines multiple pathway prediction models given a list of enzymes based on multi-label learning approaches. Specifically, it first introduces mlLGPR that encodes manually designed enzyme and pathway properties to reconstruct pathways. Then, it proposes triUMPF, a more advanced model, that characterizes interactions among pathways and enzymes, jointly, with community detection from enzyme and pathway networks to improve the precision of predictions. This requires pathway2vec, a novel representation learning model, to automatically generate features aiding triUMPF’s prediction process. Next, the thesis presents leADS that subselects more impacted examples from a dataset to increase the pathway sensitivity performance. This model may rely on reMap, a novel relabeling algorithm, that incorporates the bag concept which is composed of correlated pathways to articulate missing pathways from data. Finally, all these models are integrated into a unified framework, mltS, to achieve the desired balance between sensitivity and precision outputs while assigning a confidence score to each model. The applicability of these models to recover pathways at the individual, population, and community levels of organization were examined against the traditional inference algorithms using benchmark datasets, where all the proposed models demonstrated accurate predictions and outperformed the previous approaches. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-10-15 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0394748 |
URI | http://hdl.handle.net/2429/76294 |
Degree |
Doctor of Philosophy - PhD |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_november_mohdabulbasher_abdurrahman.pdf [ 21.12MB ]
- Metadata
- JSON: 24-1.0394748.json
- JSON-LD: 24-1.0394748-ld.json
- RDF/XML (Pretty): 24-1.0394748-rdf.xml
- RDF/JSON: 24-1.0394748-rdf.json
- Turtle: 24-1.0394748-turtle.txt
- N-Triples: 24-1.0394748-rdf-ntriples.txt
- Original Record: 24-1.0394748-source.json
- Full Text
- 24-1.0394748-fulltext.txt
- Citation
- 24-1.0394748.ris