UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

MetaPathways : a modular pipeline for the analysis of environmental sequence information Hanson, Niels William 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2015_may_hanson_niels.pdf [ 32.14MB ]
JSON: 24-1.0166225.json
JSON-LD: 24-1.0166225-ld.json
RDF/XML (Pretty): 24-1.0166225-rdf.xml
RDF/JSON: 24-1.0166225-rdf.json
Turtle: 24-1.0166225-turtle.txt
N-Triples: 24-1.0166225-rdf-ntriples.txt
Original Record: 24-1.0166225-source.json
Full Text

Full Text

MetaPathways: a modular pipeline forthe analysis of environmental sequenceinformationbyNiels William HansonB.Sc., Computer Science, The University of British Columbia, 2011A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)January 2015c© Niels William Hanson 2015AbstractThe lack of cultivated reference strains for the majority of naturally occurring microorganisms haslead to the development of plurality sequencing methods and the field of metagenomics, offeringa glimpse into the genomes of this so-called ‘microbial dark matter’ (MDM). An explosion ofsequencing initiatives has followed, attempting to capture and extract biological meaning fromMDM across a wide range of ecosystems from deep-sea vents and polar seas to waste-waterbioreactors and human beings. Current analytic approaches focus on taxonomic structure andmetabolic potential through a combination of phylogenetic anchor screening of the small subunitribosomal RNA gene (SSU or 16S rRNA) and general sequence searches using homology-basedinference. Though much has been learned about microbial diversity and metabolic potentialwithin natural and engineered ecosystems using these approaches, they are insufficient to resolvethe ecological relationships that couple nutrient and energy flow between community members— ultimately translating into ecosystem functions and services. This shortcoming arises from acombination of data-intensive challenges presented by environmental sequence information thatspan processing, integration, and interpretation steps, and a general lack of robust statistical andanalytical methods to directly address these problems.This dissertation addresses some of these shortcomings through the development of a modularanalytical pipeline, MetaPathways, allowing for the large-scale and systematic processing andintegration of many forms of environmental sequence information. MetaPathways is built toscale, comparing hundreds of metagenomic samples through the efficient use of data structures,grid compute models, and interactive data query. Moreover, it attempts to bring functionalanalysis back to the metabolic map through the creation of environmental pathway/genomedatabases (ePGDBs), adopting the Pathway Tools software for metabolic pathway prediction onthe MetaCyc encyclopedia of genes and genomes. ePGDBs and the pathway-centric approach arevalidated to provide known and novel insights into community structure and function. Finally,novel taxonomic and metabolic methods supporting the pathway-centric model are derived anddemonstrated, and enhance Pathway Tools as a framework for engineering microbial communitiesand consortia.iiPrefaceA number of sections of this work are partly or wholly published in press or accepted. Copyrightlicences to all works were obtained and are listed where appropriate.• Chapter 1: Niels W. Hanson wrote the main text with input from Steven J. Hallam. Shang-JuWu and Kishori M. Konwar edited the final manuscript and provided input and feedbackon the content of the computational section. A version of the text and figures are to appearas part of an accepted book chapter:Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, Steven J. Hallam. Introduction tothe Analysis of Environmental Sequence Information Using MetaPathways. ComputationalMethods for Next Generation Sequencing Data Analysis. Wiley Series in Bioinformatics. In Press.• Chapter 2: The MetaPathways pipeline was co-developed by Niels W. Hanson and KishoriM. Konwar. Niels W. Hanson and Kishori M. Konwar contributed equally in design anddevelopment of the software. Niels W. Hanson performed extensive regression testing,quality control, and wrote online documentation. Niels W. Hanson performed simulationstudies and performance analysis. Niels W. Hanson wrote the manuscript and created allfigures with editorial support from Steven J. Hallam. A version of this work was publishedin BMC Bioinformatics under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0):Kishori M. Konwar, Niels W. Hanson, Antoine P. Page´, Steven J. Hallam, MetaPathways: amodular pipeline for constructing pathway/genome databases from environmental sequenceinformation. BMC Bioinformatics, 14, 202 (2013).• Chapter 3: Niels W. Hanson performed pathway analysis of simulated and multi-omicsamples from the Hawaii Ocean Time series. Kishori M. Konwar and Niels W. Hanson ran theMetaPathways software on high-performance computational resources provided by ComputeCanada’s Western Canadian Compute Consortium (WestGrid). Niels W. Hanson identifiedprediction hazards and performed the analysis of nitrogen cycling pathways with theinterpretive support of Alyse K. Hawley and Steven J. Hallam. Niels W. Hanson developedthe weighted taxonomic distance (WTD) and algorithm with the mathematical support ofKishori M. Konwar. Tomer Altman of Stanford University’s Biomedical Informatics programand Peter D. Karp of SRI International provided valuable input on the design of WTD andits behaviour with respect to the expected taxonomic range of MetaCyc pathways. Niels W.Hanson implemented the WTD distance and algorithm in the Python programming language.Niels W. Hanson wrote the manuscript and created all figures with editorial support fromSteven J. Hallam. A version of this work is published in BMC Genomics under the terms ofthe Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0):iiiNiels W. Hanson, Kishori M. Konwar, Alyse K. Hawley, Tomer Altman, Peter D. Karp, andSteven J. Hallam, Metabolic pathways for the whole community. BMC Genomics, 15, 619(2014).• Chapter 4: MetaPathways v2.0 was co-developed by Niels W. Hanson, Kishori M. Konwar,and Shang-Ju Wu. Niels W. Hanson and Kishori M. Konwar designed and implemented themaster-worker model for homology-search jobs in equal parts. Niels W. Hanson, Kishori M.Konwar, and Shang-Ju Wu implemented the graphical user interface and Knowledge Enginedata structure in equal parts. Niels W. Hanson wrote the manuscript and created all figureswith editorial support from Steven J. Hallam. A version of this work is published as IEEEcopyrighted proceedings described below.All text, figures and tables in Chapter 4 are copyright 2014 IEEE. Reprinted, with permission,from:Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, Steven J. Hallam, MetaPathways v2.0: Amaster-worker model for environmental Pathway/Genome Database construction on gridsand clouds. 2014 IEEE Conference on Computational Intelligence in Bioinformatics andComputational Biology, May 2014.In reference to IEEE copyrighted material which is used with permission in this dissertation,the IEEE does not endorse any of University of British Columbia’s products or services.Internal or personal use of this material is permitted. If interested in reprinting/republishingIEEE copyrighted material for advertising or promotional purposes or for creating newcollective works for resale or redistribution, please go tohttp://www.ieee.org/publications standards/publications/rights/rights link.html to learnhow to obtain a License from RightsLink.• Chapter 5: Niels W. Hanson devised LCA* method and formulated the likelihood ratio testwith the mathematical support of Kishori M. Konwar. Niels W. Hanson implemented theLCA* software and performed simulation and comparison studies. Niels W. Hanson wrotethe manuscript and created all figures with editorial support from Steven J. Hallam.Throughout this dissertation the word ‘we’ refers to Niels W. Hanson unless otherwise stated.None of the work encompassing this dissertation required consultation with the UBC ResearchEthics Board.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Multi-omic sequencing and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 Next-generation sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.2 Metagenomic assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.1.3 Open reading frame (ORF) prediction . . . . . . . . . . . . . . . . . . . . . . 141.1.4 Functional assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.1.5 Pathway-centric analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.1.6 Taxonomic assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.1.7 rRNA gene identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.1.8 Clusters of orthologous groups . . . . . . . . . . . . . . . . . . . . . . . . . . 211.1.9 Lowest common ancestor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.2 Computational issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.2.1 Pipeline design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.2.2 Heterogeneous software and computational requirements . . . . . . . . . . . 251.2.3 Data integration and query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.2.4 A novel analysis pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.2.5 Research overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 MetaPathways: a modular pipeline for the analysis of environmental sequence infor-mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34v2.2.1 Quality control & ORF prediction . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.2 ORF annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.2.3 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.4 ePGDB construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2.5 Pathway export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.3.1 Evaluation of pathway prediction with simulated metagenomes . . . . . . . 452.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.4.1 Pipeline limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 Metabolic pathways for the whole community . . . . . . . . . . . . . . . . . . . . . . . . 493.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.2.1 Performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.2 Distributed metabolic pathways . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2.3 Comparative community metabolism . . . . . . . . . . . . . . . . . . . . . . . 583.2.4 Pathway prediction hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.3.1 Metabolic pathway analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.3.2 Pathway prediction on simulated data . . . . . . . . . . . . . . . . . . . . . . 673.3.3 Simulated metagenomes: Sim1, Sim2 . . . . . . . . . . . . . . . . . . . . . . . 683.3.4 Simulated metagenomes: HOT (25 m) . . . . . . . . . . . . . . . . . . . . . . 683.3.5 Taxonomic pruning experiments . . . . . . . . . . . . . . . . . . . . . . . . . 693.3.6 Weighted taxonomic distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.3.7 Distributed metabolic pathway prediction . . . . . . . . . . . . . . . . . . . . 703.3.8 Hawaii ocean time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.4 Motivation and derivation of the weighted taxonomic distance . . . . . . . . . . . . 713.4.1 WTD formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.4.2 Lowest common ancestor (LCA) algorithm . . . . . . . . . . . . . . . . . . . 783.4.3 The weighted taxonomic distance algorithm . . . . . . . . . . . . . . . . . . . 793.4.4 Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814 MetaPathways v2.0: a master-worker model for environmental Pathway/Genome Databaseconstruction on grids and clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.2.1 Multi-grid brokering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.2.2 Amazon elastic cloud integration . . . . . . . . . . . . . . . . . . . . . . . . . 904.2.3 Graphical user interface & data integration . . . . . . . . . . . . . . . . . . . 904.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.1 Heterogeneous grid migration . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95vi4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975 LCA*: an entropy-based measure for taxonomic assignment of contigs . . . . . . . . . 995.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.3 LCA*: derivation and algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.3.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.4 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.5.1 Simulation, sequences, and annotation . . . . . . . . . . . . . . . . . . . . . . 1165.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.7 Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.1 Assumptions and limitations of current approaches . . . . . . . . . . . . . . . . . . 1266.2 Related research and community developments . . . . . . . . . . . . . . . . . . . . . 1276.3 Improvements to the MetaPathways pipeline . . . . . . . . . . . . . . . . . . . . . . 1286.3.1 Automated assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.3.2 Pathway prediction and ePGDB interpretation . . . . . . . . . . . . . . . . . 1306.3.3 Distributed metabolism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.3.4 Improvements to the master-worker model . . . . . . . . . . . . . . . . . . . 1356.3.5 Sequencing technologies on the horizon: Illumina’s NextSeq, PacBio, andOxford Nanopore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.3.6 A future in the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.4 Future developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.4.1 Human microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.4.2 Single-cell sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.5 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145AppendicesA Chapter 2: supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161B Chapter 3: supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163B.1 Confusion Table Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163viiList of Tables1.1 Overview of next-generation sequencing technologies . . . . . . . . . . . . . . . . . . 122.1 MetaPathways validation summary based on a comparison of three sequencingmethods on a common sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2 MetaPathways validation summary based on a comparison of three sequencingmethods on a common sample (continued) . . . . . . . . . . . . . . . . . . . . . . . . 442.3 Pathway classification performance statistics for simulated metagenomes Sim1 andSim2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1 Overview of grid hardware specifications . . . . . . . . . . . . . . . . . . . . . . . . . 954.2 Matrix of completed jobs and transfers . . . . . . . . . . . . . . . . . . . . . . . . . . 95A.1 Source genome statistics for simulated metagenomes . . . . . . . . . . . . . . . . . . 161A.2 Confusion tables for classification analysis of simulated metagenomes Sim1 and Sim2162B.1 Overview of the E. coli K12 genome used for simulated sequencing experiments . . 163B.2 Overview of the tier-2 BioCyc genomes used for simulated sequencing experiments 163B.3 Long-read simulated sequencing experiments for E. coli K12, Sim1, and Sim2 . . . . 165B.4 Short-read simulated sequencing experiments for E. coli K12, Sim1, Sim2, and HOT(25m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166B.5 Behaviour of accuracy for a variety of confusion tables . . . . . . . . . . . . . . . . . 168B.6 Simulated long-read pathway confusion tables . . . . . . . . . . . . . . . . . . . . . . 171B.7 Simulated short-read pathway confusion tables . . . . . . . . . . . . . . . . . . . . . 172B.8 Taxonomic pruning experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172B.9 Predicted pathways for pairwise combined tier-2 BioCyc genomes . . . . . . . . . . 174B.10 Candidate distributed pathways pairwise tier-2 BioCyc genomes . . . . . . . . . . . 175B.11 Summary statistics of pathway prediction for the HOT metagenomes and metatran-scriptomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177B.12 Examples of observed pathway prediction hazards from the HOT analysis . . . . . 186viiiList of Figures1.1 A veritable tsunami of sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Overview of general stages of multi-omic sequencing and analysis . . . . . . . . . . 51.3 Short-read ORF prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 The anonymous sequence problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5 The small-subunit SSU rRNA gene is the “gold standard” for microbial diversitystudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.6 Taxonomic assignment and functional gene profiling methods . . . . . . . . . . . . . 241.7 Structure of a MapReduce job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.1 The MetaPathways pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2 Default MetaPathways settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.3 ePGDBs and Pathway Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4 ePGDBs and the Cellular Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.5 Analysis on in silico simulated sequencing experiments across different levels ofcoverage and taxon distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.1 A multi-tiered approach to ePGDB validation . . . . . . . . . . . . . . . . . . . . . . 523.2 Analysis on in silico simulated sequencing experiments . . . . . . . . . . . . . . . . . 543.3 Examples of amino acid metabolism shared between Moranella endobia and Tremblayaprinceps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.4 Analysis of predicted pathways from the Hawaii Ocean time series . . . . . . . . . . 603.5 Comparison of predicted genomic and transcriptomic pathways in the ‘sunlit’ and‘dark’ HOT samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.6 Taxonomic and functional breakdown of nitrogen cycling pathways . . . . . . . . . 663.7 Illustrative phylogenetic tree for distance expectations . . . . . . . . . . . . . . . . . 743.8 A phylogenetic example for function choice . . . . . . . . . . . . . . . . . . . . . . . 774.1 A master-worker model for sequence homology searches . . . . . . . . . . . . . . . . 894.2 Configuring a MetaPathways run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.3 Monitoring a MetaPathways run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.4 The MetaPathways ‘Knowledge Engine’ data structure . . . . . . . . . . . . . . . . . 934.5 Knowledge Engine data structure integration into data summary and visualizationmodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.1 Illustrative example of taxonomic assignment methods: LCA*, Majority, and LCA2 1045.2 The NCBI taxonomy tree structure used in our derivation . . . . . . . . . . . . . . . 1065.3 Decomposition of entropy into sub-trees . . . . . . . . . . . . . . . . . . . . . . . . . 110ix5.4 Kernel densities of Simple-walk distances between predicted and actual taxonomies 1195.5 Kernel densities of weighted taxonomic distances between predicted and actualtaxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.6 Root-mean-squared error (RMSE) for LCA2, Majority, and LCA* . . . . . . . . . . . 1215.7 Kernel densities of supremacy p-values for voting-based methods . . . . . . . . . . 1225.8 Pairwise supremacy p-values for the voting-based methods . . . . . . . . . . . . . . 1236.1 Nx assembly plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2 An integer optimization for distributed metabolism . . . . . . . . . . . . . . . . . . . 1356.3 Global comparative multi-omic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.4 Binding metagenomes to single-cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143B.1 Copy number distributions for the simulated metagenomes Sim1 and Sim2 . . . . . 164B.2 Illustrative confusion table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167B.3 Summarizing performance measures for long-read simulations . . . . . . . . . . . . 169B.4 Summarizing performance measures for short-read simulations . . . . . . . . . . . . 170B.5 Distribution of weighted taxonomic distance from HOT predicted pathways . . . . 173B.6 Disagreement class distribution of HOT predicted pathways by expected taxonomicrange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174B.7 An example of a plausible emergent metabolism pattern . . . . . . . . . . . . . . . . 175B.8 Comparison of predicted amino acid pathways in the Candidatus Moranella endobiaand Candidatus Tremblaya princeps genomes . . . . . . . . . . . . . . . . . . . . . . . . 176B.9 Overview of unique transcriptomic signal from HOT . . . . . . . . . . . . . . . . . . 178B.10 Overview of unique metagenomic signal from HOT . . . . . . . . . . . . . . . . . . . 179B.11 Top-40 predicted metagenomic and metatranscriptomic pathways from the HOT . . 180B.12 Genomic and transcriptomic signal for pathways unique to the sunlit surface (25 m) 181B.13 Genomic and transcriptomic signal for pathways unique to upper photic zone (25m and 75 m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182B.14 Genomic and transcriptomic signal for pathways unique to surface (25 m) and deep(500 m) depth intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183B.15 Unique pathways to the ‘photic and deep’ samples (25 m, 75 m, and 500 m) . . . . . 184B.16 Unique pathways to the lower euphotic 110 m sample . . . . . . . . . . . . . . . . . 185B.17 Unique pathways to the upper and lower euphotic 25 m and 110 m samples . . . . 185xList of Algorithms3.1 ComputeWTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.2 PathwayObservedTaxon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1 CalculateLCA* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112xiAcknowledgementsI would like to acknowledge the members of the Steven J. Hallam Laboratory for their livelydiscussions and friendship throughout this post-graduate experience. In particular, I would like tothank Kishori M. Konwar for his continuous support, friendship, supervision, and encouragementto push my mathematical and algorithmic background; William Evan Durno for continuedfriendship, statistical, and mathematical expertise; Connor Morgan-Lang for his hard work andfriendship; Aria S. Hahn for her hard work, taxonomic background, molecular biology andexperimental knowledge, and editorial support; Alyse K. Hawley for biochemical knowledgeof pathways and problem-solving skills; Maya P. Bhatia for lively discussions on the use ofMetaPathways; and Keith Mewis for technical laboratory knowledge, MetaPathways testing andquality control, and steadfast friendship and support.I would like to acknowledge a number of undergraduate co-op and work-study students forgiving me an opportunity to mentor them as well as their hard work and intelligence: Shang-JuWu, Dongjae Kim, Frances Russell, and Hiu Kan Cheung. I would also like to thank CurtisHuttenhower, Tomer Altman, Peter D. Karp, and Christian Von Merring for their knowledgeablediscussions on pathway prediction and metabolic analysis of multi-omic samples.Finally, I would like to thank my supervisor Steven J. Hallam for taking me on as a student,and providing continued support, great ideas, thought-provoking conversations, and pushing meto do my best.xiiDedicationTo my parents; for their continuous love through the ups and downs of graduate school andsupporting a son who did not know what he wanted to do.xiiiChapter 1IntroductionBacterial diversity estimates suggest that less than 1% of microorganisms are cultivable usingstandard laboratory techniques, representing a substantial limit to the understanding of microbialecology and the metabolic potential that exists within nature [1–3]. The developing field ofmetagenomics bridges this so-called ‘cultivation gap’ through clonal or plurality-sequencingmethods, directly sequencing microbial community sequences [4, 5], and allowing the discoveryof novel taxa and metabolic processes [6]. Fundamental ecological and functional properties ofmicrobial systems, in addition to being primary science, are of incredible practical interest ina number of domains including [7], environmental remediation [8, 9], biocatalytic engineering[10, 11], environmental systems engineering [12], and medical diagnoses and therapeutics [13–15].A number of multi-omic sequencing efforts, collectively referring to cultivation independentexperiments across the hierarchy of biological sequence information including DNA (metage-nomic), RNA (metatranscriptomic), and protein (metaproteomic), have been applied across avariety of aquatic [16–18], terrestrial [19], and host-associated [20] environments to study theirmicrobial diversity and metabolic potential. Roughly summarized, these studies tend to focus onquestions of community structure (i.e., “who’s there?”) and functional potential (i.e., “what arethey doing?”) in relation to measured environmental and biogeochemical parameters (e.g., salinity,luminosity, environmental chemistry, etc.) or conditions related to host health (e.g., body-massindex (BMI), antibiotic response, etc.). Addressing these taxonomic and functional questions frommulti-omic samples involves the de novo assembly or classification of environmental sequences toreference genotypes or taxonomic bins, and further in silico predictive methods to ascribe genefunction based on homology to known genes in public sequence databases.Though much progress has been made developing methods to answer these taxonomic andfunctional questions, many technical, computational, and analytical challenges still exist. First,technical sequencing challenges, including sparse sequence coverage [21], DNA amplification and1sequencing biases [22–25], and de novo short-read assembly [26, 27], can cloud the true natureof the underlying genomic sample. Additionally, the samples themselves represent movingtargets manifesting incredible genomic diversity and lateral gene transfer [28], a variance thatimposes assembly and binning problems. Moreover, large sets of multi-omic sequences impose anumber of high-performance computational challenges in terms of the efficient implementationof distributed algorithms and analytical methods, preventing the high-throughput processing ofmany metagenomic samples (Figure 1.1). Finally, a surplus of metagenomic sequencing projectsand an ever-increasing size of both samples and public sequence databases, in combination witha lack of data integration, have stymied comparative community analyses; necessitating thedevelopment of efficient data structures, analytical methods, and user interfaces that provide aninteractive comparative environment for hypothesis generation [29–31].This chapter outlines the research and motivation supporting the development of MetaPath-ways, a modular pipeline for the analysis of environmental sequence information. First, multi-omicsequencing and analysis is overviewed, discussing common next-generation sequencing tech-niques with respect to current functional, taxonomic, and higher-level pathway-centric methods.Next, computational issues associated with the development of a high-performance pipelineare explored with respect to growing sequencing capacity and database size, heterogeneouscomputational requirements, and algorithmic parallelism of bioinformatic software. Finally, therationale behind the development a novel modular pipeline for the analysis of environmentalsequence information is outlined, and the structure of the current work is developed.2Figure 1.1: A veritable tsunami of sequences and search. (a) The advent of short-read next-generation sequencing platforms has lead to an exponentialincrease in our capacity to generate sequencing data from environmental samples. (b)With the ability to sequence new genes and proteins,comes the ability to catalogue them. The NCBI RefSeq database of protein records has increased rapidly in recent years. (c) A fundamental taskin the analysis of environmental sequence information is the annotation of potential function by searching reference databases. The combinedrise of both sequencing capacity and reference database size has compounded the computational challenge of the task by creating an ever largersearch space.Background images: (a) ‘The Great Wave off Kanagawa’, (b) ‘Blue Mount Fuji at Dawn Near Oiso’, (c) ‘Mount Fuji in the Snow Storm’are woodblock prints by Katsushika Hokusai from his collections ‘The Thirty-six Views of Mount Fuji’ and ‘One Hundred Views ofMount Fuji’. Images obtained from: http://commons.wikimedia.org/wiki/File:Great Wave off Kanagawa.jpg, http://blog.ukiyo-e.org/wp-content/uploads/2011/09/landscapes98.jpg, and http://www.fujiarts.com/japanese-prints/r48/269r48f.jpg, respectively. Images were cropped,faded, and scaled to fit graph proportions. Copyright has expired as Katsushika Hokusai (1760–1849) passed away more than 70 years ago.31.1 Multi-omic sequencing and analysisThe analysis for environmental sequence information, from the initial assembly of sequencesto functional and taxonomic analysis, endeavours to bring insight into microbial communitystructure and functional potential of the uncultivated majority of microorganisms (Figure 1.2).Since the commercial inception of next-generation sequencing, there has been a variety of availablesequencing platforms that seek to provide high-abundance short-read sequences at low-cost.Although all platforms perform parallel sequencing, many have differing capabilities in terms ofrun time, read-length, and error rate that affect their viability for the analysis of multi-omic samplesin terms of sequence coverage and assembly. Once reads are sequenced, they often are not longenough to identify genes and taxonomic motifs, and must be assembled into longer contiguoussequences by merging their overlaps. This is the problem of de novo multi-omic sequence assembly,which because of the inherent complexity of environmental samples is very much unsolved inmany practical cases. After assembly, the next step is the identification of potential genes viaopen reading frame (ORF) prediction methods. Multi-omic gene prediction can be challengingbecause prediction models are often tuned for the sequence patterns of a particular organism. Butin multi-omic samples, sequences can come from a variety of taxa, causing many ORF searchesto fail, and constituting the so-called ‘anonymous sequence problem’. However, contemporaryORF prediction algorithms and software have largely addressed the ‘anonymous’ issue throughdifferent taxonomically-sensitive parameter tuning procedures. Reads and predicted ORFs arethen annotated for taxonomic and functional information using protein reference databases inorder to compare samples to each other. Finally, pathway-centric analyses integrate functional andtaxonomic information into the biochemical pathways of metabolism, putting them in the contextof the overall metabolic map. Here, predicted pathways and taxonomic annotations can be usedto detect potential instances of distributed metabolism, instances where a particular metabolicpathway is shared between two or more taxa, which have potential implications for the syntheticengineering and ecological structure of microbial communities.4MicrobesExtractionSequencingAssemblyATGATGATGATGORFPrediction   rRNADBProteinDB   AnnotationAnalysis1345/Input /OutputQC & ORF Prediction/preprocessed/orf_prediction/run_statistics/blast_results$Pathway_Tools/user//genbank/results/LCA/results/mltreemap/results/annotation_tables/results/tRNA/results/pgdbExport PathwaysQC SequencesLCAORFsATGATGATGATGKegg, COG,MetaCyc, RefSeq Annotated ORFsML-TreeMaptRNA Scan2AnnotationePGDBCreationPathway Tools Input{PathoLogic ePGDBBLASTPathway Summaries}.fasta.gff.gbk .sequin.blastout.{DB}_stats{NAME}cyc.run_parameters.txt.pathways.txtMetagenome.mapping.txt.faa.fna.refscores .blastout.parsed.txt.fxn_and_taxa_table.txt.nuc.stats .amino.stats/results/rRNArRNA.stats.txt/ptools/.parameters.txt.annotated.gff.qced.faaAnalysesabcdefgFigure 1.2: Overview of general stages of multi-omic sequencing and analysis. Microbi l mples are firstobtained from the environment (a) and sequences are extracted and prepared in libraries for sequencing(b). A next-generation sequencing platform produces billions of short-read sequences (c). If coverage ofthe original sample is large enough, the short-reads are assembled into contiguous sequences (contigs) (d),potential genes or open reading frames (ORFs) are predicted on the resulting sequences (e), and sequencesand ORFs are annotated against reference taxonomic and protein databases (f). Finally annotations arecategorized into different functional and taxonomic hierarchies, and anayzed in relation to other samplesfor functional, taxonomic and pathway-centric insights (g).51.1.1 Next-generation sequencingIn the past decade, there have been many different commercially-available next-generationsequencing platforms, each of which offering analytical benefits and drawbacks in terms ofsequencing time, costs, and bias. While it is beyond the scope of this introduction to givea detailed explanation of them all, this section will focus on platforms commonly used formulti-omic samples and briefly highlight their analytical strengths and weaknesses (Table 1.1).Pyrosequencing (454)Roche 454 Pyrosequencing (454) was one of the first widely-adopted next-generation sequencingplatforms. 454 works by attaching target sequences to small beads inside water droplets in an oilsolution, with the target sequences then amplified into clonal colonies on the surface of the beads,a process known as emulsion PCR [32]. Nucleotides are flooded over the beads in sequence, anda luciferase enzyme generates light each time new bases are added to the template strands. Asensor captures this released light, and based on its intensity, can deduce the number of basesadded to the template strand. In terms of performance, the 454 platform can produce millionsof moderately-long (500–700 bp) reads in a 24-hour time period. Unfortunately, the method issusceptible to errors in long homopolymer runs, as the signal from the luciferase is only linearonly up to approximately eight consecutive nucleotides. This platform is analytically very popularwith multi-omic samples because the longer sequence lengths aid in the identification of genes andprovide the extra length needed to simplify assembly in complex environments. Unfortunately,the sequencing depth of 454 is approximately 1–10M bases, significantly less than that of Illumina,a competing platform, and Roche officially announced the discontinuation of 454 in 2013, meaningmany multi-omic sequencing projects are likely to head to the Illumina platform.IlluminaIllumina’s platform involves the clonal amplification of slide-attached target sequences to form“DNA Clusters” [33]. Similar to 454, four types of fluorescent nucleotides are washed across theslide in sequence; however, the key innovation with the Illumina platform is the use of reversible6terminator bases (RT-bases) that block the 3’-end of the growing sequence. These RT-bases preventmore than one base from being incorporated at a time, and thus allow the accurate sequencingof homopolymer runs. A polymerase incorporates a base into the template, and a camera takesimages of the slide, noting which colonies have incorporated the base. The terminal 3’-blocker isremoved and the process starts again. Because incorporating a base is independent of when thesensor is active, very large arrays can be captured by taking a series of pictures using only one ora small number of sensors. Thus, inexpensive chemistry (DNA polymerase), low error rate bydesign, and high throughput by having a large number of colonies (600 billion bases per run),makes Illumina a very attractive sequencing platform. However, the sensor relies on all clonalsequences in a cluster to be ‘in-phase’, i.e., all cluster sequences incorporating a nucleotide in-sync,and there is a small probability at each round that the RT-base will not bind. As sequencingcontinues towards the 3’-end there is more and more likelihood that the cluster will be out-of-phase. Consequently, Illumina reads can be fairly short (50–300 bp), as phasing-error can cause3’-bases to be too ambiguous to be useful. For this reason, a quality score based on the intensityratios of a cluster are reported with sequences, and a common preprocessing step is to trim readsfrom the 3’-end to the desired level of accuracy.Ion TorrentIon Torrent semiconductor sequencing utilizes standard polymerization of DNA to implement asequencing-by-synthesis method that is based on the detection of hydrogen ions released whena new base is incorporated [34]. Like pyrosequencing, in homopolymer runs many bases canbe incorporated at once, which overwhelms the pH sensor, and causes the homopolymer runto incorrectly recorded. The Ion Torrent machine is inexpensive when compared to the otherplatforms, and has a fast turn-around time of 2–4 hours. However, reads are fairly short (50–200bp),and in some circles have a reputation for producing many homopolymer and substitution errors[35], but these quality issues are traded off against personal flexibility, short turn-around time,and ease-of-use.7Pacific Biosciences (Pac Bio)The Pacific Bioscience sequencing platform offers single-molecule real-time long-read (5–20kb) sequences. Sequences are synthesized in zero-mode waveguides (ZMWs), small well-likecontainers with the capturing sensor located at the bottom [36]. Fluorescently labelled nucleotidesflow freely in solution, releasing a fluorescent event whenever incorporated into the templatestrand, and the ZMWs are constructed such that only fluorescence in the base of the well is detected.In addition to the long read lengths and having no biasing amplification step, methylation ofnucleotides can also be detected, which are an important gene regulatory signal that is notdetected by other platforms. Pac Bio assemblers are more expensive with moderate throughputcompared to Illumina, but long sequences can be useful as reliable scaffolds in assemblies, and assuch Pac Bio has become popular in the assembly of complex plant genomes which have manychromosomes and large repetitive sections.Oxford Nanopore (ONP)A cutting-edge technology is the sequencing and detection of small-molecules though the use ofnanopores, small holes of nanometer proportions. Immersing nanopores in a conducting fluid andthen applying an electrical across them, allows for the detection of nucleotides and small moleculesby the characteristic distribution of change caused as they proceed through the nanopore passage[37, 38]. When manufactured and arrayed in parallel on an appropriate membrane [39], nanoporetechnology has the potential to perform large-scale sequencing or small-molecule detection, andhas the attractive ability of sequencing individual molecules with very-long read lengths withouta potentially biasing amplification step. However, despite this potential, there are still challengesassociated with navigation, control, and detection of molecules through the nanopore passageway.Though nanopores can be driven by differences in electrical potential or diffusion, nucleotidespass at rates of 1–5 nucleotides per micro second, making it difficult to detect individual bases [40].Thus, much research is focused on developing processive enzymes that bind with the nanoporeand ratchet the DNA molecule through one base at a time [40]. Moreover, nanopores with dualrecognition sites have also been proposed [41], allowing for paired detection of nucleotides and8improved error correction methods.Currently, nanopore sequencing is still an active area of research and development, and noplatform utilizing the technology is currently commercially available. This being said, OxfordNanopore (https://www.nanoporetech.com/), a start-up company out of Oxford University, is inthe final stages of development of its MinION sequencing platform, the first USB-based portablesequencer, and is beta-testing this technology in the research community via its MinION AccessProgram (MAP). While the technology is proving itself capable of producing long reads in the1–20kb range, assisting in the detection of genomic islands [42], there are some concerns raisedabout its accuracy and processing capacity compared to billions of bases provided by Illumina[42, 43]. Nevertheless, the small-scale and portability of long-read single molecule sequencingare attractive features of the MinION platform, expanding the possibilities for real-time on-siteenvironmental sequencing and single molecule detection.Next-generation environmental sequencingWith a number of next-generation assembly platforms to choose from, researchers are oftenburdened with deciding which platform is appropriate for their environmental samples. Ahigh-level framework from platform choice can be made based on environmental complexity andread-length. Although oversimplifying the situation slightly, the sequencing platforms describedabove can be placed into two general categories: short-read amplified (SRA) platforms, includingthe 454, Illumina, and Ion Torrent, and long-read single-molecule (LSM) platforms including PacBio and Oxford Nanopore. Environments can be ranked by their community complexity into low,medium, or high classes, which affects the amount of genomic coverage obtainable by availablesequencing platforms.Low complexity environments are characterized by a small number of 1–500 taxa which caninclude isolate populations, small microbial consortia, or communities sampled from extremeenvironments, (e.g., hot strings [44], hydrothermal vents [45], acid mine drainage [46], etc.).The functional and taxonomic information of these communities can be well-sequenced by SRAplatforms with de novo assembles that include more than 60% of all sequenced reads and perhapseven closed genomes with enough coverage. However, quantitative details may still be lost in the9amplification step used by these assemblers, and environmental sequencing experiments could beenhanced by LRS assemblers to close complex regions of isolate or consortia genomes or validatefunctional or taxonomic abundance profiles in the presence of amplification bias.Medium complexity environments, which generally include marine [47], freshwater [48], andhost-associated microbiomes [49], have approximately 500–10,000 proposed taxa, and representmost environmental sequencing initiatives today. These environments are generally well-servedby SRA platforms with approximately 10–60% of all sequencing in assembled contigs. HereLRS platforms by themselves are likely to lack the coverage required to adequately sample thefunctional complement of these environments; however, environmental experiments could use LRSplatforms to close abundant genomes and scaffolding long contigs together. Ion Torrent’s errorrate and its coverage limitations will probably struggle to resolve large portions of functional genesin the more complex communities of this class. The choice between SRA platforms 454 or Illuminais less clear, as there is likely a trade-off between cost and environmental complexity. If resourcespermit, more complex environments may benefit from deep coverage with the longer reads of454, allowing for better assembly and the identification of more functional genes. Alternatively,Illumina could provide deeper sequencing with larger assemblies of shorter contigs (at a reducedcost), but with a reduction in overall predicted gene size. Both 454 and Illumina assemblies couldbe improved by including LSM reads in a hybrid assembly to scaffold contigs, but requires morecare and attention in the downstream bioinformatic analysis.High complexity environments including sediment and soil communities have more than10,000 proposed taxa, pushing the analytical limits of current next-generation sequencing platforms[50] . SRA assemblies in these environments are very poor with only 1–10% of all reads endingup in contigs. Although assembly-independent analysis like read-mapping methods based onthe Burrows-Wheeler aligner offer an alternative, they are limited in the number of referencesequences they can search due to their current in-memory requirements [51]. The 454 platformmay have an advantage by offering read lengths that are long enough to contain significantproportions of genes, allowing for analysis without an assembly step. Similarly, the LRS platformsprovide a way to tap into long genes, but their current sequencing depth is too low to significantlysample high complexity sequence diversity. Increased sequencing capacity in LRS platforms will10likely be critical to the future analysis of environmental sequence information from these highlycomplex environments.11Table 1.1: Overview of next-generation sequencing technologies.PlatformReadLength(bp)Accuracy(%)SequencingDepth(Mbp)TimeperRunAdvantages DisadvantagesRoche 454 500–700 98 700 24 Hours Long reads. Fast. Homopolymer repeats.Illumina 50–300 98 600,000 1–10 Days High coverage and accuracy. Short reads.Ion Torrent 50–400 97 100–1,000 2–4 Hours Less-expensive equipment. Homopolymer repeats.Pacific Biosciences 5,000–20,000 87 500–1,000 2–4 Hours Long reads. Error rate. Cost. Coverage.Oxford Nanopore 1,000–15,000 Unknown 50–200 1–24 Hours Portability. Long reads. Error rate. Cost. Coverage.121.1.2 Metagenomic assemblyThe short-read multi-omic de novo assembly problem is, given the current capacity of next-generation short-read sequencing platforms and algorithms, still unsolved. A number of intrinsic,experimental, technical, and analytical issues make a quality multi-omic assembly difficult toaccomplish, let alone automate. Problems start with the intrinsic community complexity ofenvironmental samples, and assembly performance is largely dependent on community complex-ity, read-length, and sequencing depth [26]. Additionally, experimental challenges can impedeassembly through degraded or contaminated sequences, for instance host-associated sequences.Moreover, each of the sequencing platforms discussed above differs in terms of sequencing depth,read length, and error rates, all of which has led to the tailoring of preprocessing and assemblymethods to their different quality-control procedures and error models [52, 53].Computationally, the two dominant algorithmic paradigms for assembly are overlap-layout-consensus and De Bruijn graphs [54], the latter being critical to reducing memory requirementsand enabling the assembly of large collections of reads from Illumina’s short-read assemblers.Larger metagenomes push even these limits, as the De Bruijn graph expands beyond the memorycapacity of most large shared-memory machines, leading to the development of message passinginterface-based (MPI-based) algorithms that distribute the memory requirement of the De Bruijngraph across multiple computational nodes [55, 56]. Nevertheless, de novo assembly of large soilmetagenomes is still a challenge on current large-scale computational grid systems [50], leading tothe use of assembly-independent annotation methods based on short read-mapping to previouslyassembled metagenomes, reference genomes, and protein families. However, this comes at a costof sensitivity and overall reference database size [51]. A particular difficulty with metagenomicassemblies is that even after sequence assembly has been performed, deciding what a ‘good’assembly is in the absence of a reference is a non-trivial exercise. A number of length-basedassembly statistics like Nx (in a sorted set of contigs, Nx is the length of the contig that crosses xthpercentile) total assembly length, number of hypothetical genes, or reads mapped attempt to serveas possible quality metrics, but these can be extremely misleading in terms sparse coverage andchimeric contigs [57–59]. Finally, after sequences are assembled into contiguous sequences and13scaffolds, reads are often re-mapped back onto assembled contigs in order to retain a quantitativemeasure of read-depth. This abundance is often reported though a normalized read-depth statisticlike Reads per Kilobase Mapped (RPKM) [60], though the statistical theory is still underdevelopedfor read-mapping statistics of environmental sequence information.1.1.3 Open reading frame (ORF) predictionWith contiguous regions of the genome sequenced, the metabolic potential, i.e., the set of genesthat encode metabolic enzymes, of an organism or community can be inferred from primarysequence information with the aid of computational methods that search for patterns or motifs.Potential genes, or ORFs, are determined based on the identification of common genetic partsincluding promoters, translation initiation sites (TIS), start and stop codons or ribosomal bindingsites (RBS) (Figure 1.3a) [61]. A variety of algorithms have been developed that accurately predictORFs from assembled genomic sequences including GeneMark [62], Glimmer [63], and fgenesb(http://linux1.softberry.com/). Because these algorithms were developed to predict full-lengthgenes in individual genomes, their performance declines on unassembled environmental datasetscontaining reads less than 1000 bp in length (the average prokaryotic gene length) (Figure 1.3b)[64]. Moreover, while algorithm accuracy can be increased through a training process that tunesparameters to common sequence patterns among closely related taxa, the genomic complexityof environmental sequence information presents an ‘anonymous sequence problem’ [65], assequences can come from a wide variety of potential taxa. Given these constraints, short-readgene prediction algorithms need to sort incoming anonymous sequences into taxonomic bins orsuccessively modify internal taxonomic parameters of their prediction algorithm for each sequence.Moreover, such algorithms also need to be robust to traditional ORF prediction problems likesequencing errors, nucleotide insertions/deletions and substitutions, associated with short-readsequencing platforms [66], which can disrupt a valid protein reading frame.A number of machine learning and statistical models have been developed to address the‘anonymous sequence problem’ of ORF prediction (Figure 1.4). For example, MetaGene uses apair of heuristic scoring models based on log-odds frequencies of GC content, codon frequencies,ORF length, intra-start codon length, and orientation and distances between genes trained145' 3'DNATTCCTGACCAAGGGACGTAGATGGGACGTAGTGG...TATCACTCCTGAACCTTIS Site (Start Codon)In-frame StopCodonRibosomal Binding Site (RBS)abiiiiiiivvviFigure 1.3: Short-read ORF prediction. (a) Three classic sequence features used to define an ORF are startcodons or transcription initiation sites (TIS) (e.g., ATG), in-frame stop codons (e.g. TGA), and the presenceof a 5’ upstream ribosomal binding site (RBS). (b) Current short-read sequencing length is smaller thanmost bacterial genes, thus many ORF sequences will be truncated outside of the sequenced short-readwindow; ORFs in short reads have a number of possible incomplete signals: (i) multiple TIS sites withupstream in-frame stop codon, (ii) in-frame stop codon absent, (iii) no valid TIS site present, (iv) no startbut potential stop codon present (v) start but no potential stop codon present (vi) no start, RBS or potentialstop present. Figure to appear as a chapter of Computational Methods for Next Generation SequencingData Analysis [67]. Copyright John Wiley & Sons, Ltd.15for Bacteria and Archaea, using the highest scoring model for each sequence [68]. MetaGeneAnnotator [69], an updated version of MetaGene, adds a viral-trained model, as well as ribosomalbinding site features to improve anonymous sequence identification. Glimmer-MG employs asophisticated interpolated Hidden Markov Model using k-mer frequencies for model selection[70]. Orphelia is a two-stage algorithm that first extracts relevant genetic features via lineardiscriminants, and then combines these features in a neural network to create a gene predictionmodel [71]. MetaGUN sorts environmental sequences into taxonomic bins based on k-merfrequencies prior to training a support vector machine (SVM) to identify binned ORFs [72]. Finally,the Prokaryotic Dynamic Programming Gene-finding Algorithm (Prodigal) provides an efficientdynamic programming model capable of identifying ORFs using alternative genetic codes fromanonymous input sequences [73].The plethora of models above are relatively well-developed and accurate, all methods claim anaccuracy above 80% for detecting ‘anonymous’ ORFs, with MetaGUN, MetaGeneAnnotator, andProdigal claiming accuracies above 90% [69, 72, 74]. Thus, with all models attaining a high levelof accuracy, is it more important to select an interpretable model with a robust implementation,than to go with latest for marginal benefit. Here, Prodigal is an attractive option, it is a heuristicmethod modeled after hands-on annotation experience from the Oak Ridge genome annotationand analysis pipeline, built-up from layers of biological insight and experience [73]. When run inits metagenome mode, Prodigal’s internal parameters are trained via 50 hand-picked training sets,roughly separated on GC content, which improves its accuracy on ‘anonymous’ sequences [74].Moreover, Prodigal is an well-implemented model, distributed as easily-compilable source codewith limited computational dependencies (http://prodigal.ornl.gov/), making it an attractiveORF prediction option for pipeline development.1.1.4 Functional assessmentFollowing ORF prediction, functional assessment typically progresses by way of homology-search seed-and-extend algorithms (e.g., BLAST or LAST) to query user-defined protein sequencedatabases for functional annotations [75, 76]. Here, one of the most comprehensive and rapidlyexpanding public databases for automated functional annotation is the National Center for16MetaGene & MetaGene Annotator MetaGunOrphelia-2-3-4-5-6-7-8-9 -112345Relatve Position of RBSMotif12345MotifInterpolated Markov Models...Glimmer-MG...IteratedMarkov ClusteringORFs...MonocodonDicodonTIS CoverageTIS ProbabilitiyLength CompleteLength IncompleteGC-contentGene ProbabilitySequenceFeaturesHiddenNodesAAAATAAACAAAGAAA...GGGGLogitClassifierNaïve BayesBacteriaArchaea SVMArchaeaSVMBacteriaMetaProdigal θ1θ2θ50...abcdeFigure 1.4: The anonymous sequence problem. Metagenomic sequences have the possibility of coming froma wide taxonomic range of genomes, creating a parameter training problem for many ORF predictionsoftwares. (a) MetaGene Annotator has added an up-stream taxonomic RBS motif identification feature toassist predicted ORFs. (b) Glimmer-MG incorporates two styles of clustering, first by assigning taxonomyusing the highest scoring Markov model, and then by performing interactive Markov clustering withinthe collection of sequences directly, training on clusters before predicting ORFs with the final model. (c)Orphelia trains a one-layer neural network derived from RBS, TIS, codon frequencies, and length. (d)MetaGun performs a Naive Bayes classification of sequences into Bacteria and Archaea before applyingan appropriately trained support vector machine (SVM). (e) Finally, MetaProdigal has evaluated its ownperformance on different clusters NCBI taxonomies, generating a number of parameter sets by cross-validation, it then decides which parameter set to use based on incoming sequence GC content. Figureto appear as a chapter of Computational Methods for Next Generation Sequencing Data Analysis [67].Copyright John Wiley & Sons, Ltd.17Biotechnology Information (NCBI) non-redundant Reference Sequence database (RefSeq) [77].For example, the RefSeq Version 66 database released in July 2014 contained almost 60 millionreference sequences with 20 million entries added in the last twelve months alone (Figures 1.1band 1.1c). RefSeq expansion in combination with ever-growing number of sample sequencesunderlies one of the biggest computational challenges when processing environmental sequenceinformation due to runtime complexity of all-versus-all BLAST searches, necessitating alternativeor improved homology search methods.In addition to RefSeq, more specialized protein databases exist for metabolic reconstruc-tion including Kyoto Encyclopedia of Genes and Genomes (KEGG) and MetaCyc albeit withreduced taxonomic coverage. These databases have more stringent curation requirements basedon controlled vocabularies and defined organizational schemas. For example, KEGG providesan integrated database of genes, enzymes, metabolites, and pathway maps based on enzymecommission (EC) numbers [78]. The MetaCyc collection of genes, pathways, and metabolitescontains more EC numbers than KEGG and encompasses an extensive collection of evolution-arily conserved pathways and variants specific to individual organisms or taxonomic groupsrepresented in the BioCyc collection [79–82].1.1.5 Pathway-centric analysisFunctional genes operate within the structure of metabolic networks. Although some studies havefocused on specific pathways of interest [83, 84], few metagenomic studies take this structure intoaccount. This is in spite of a variety of existing frameworks and software for pathway-level analysisusing genomic data. iPath, an online tool, allows one to dynamically highlight KEGG pathwaysbased on EC number mapping, but does not assess pathway completion, presence, or absence [85].Similarly, a metabolic analysis module has been added to the SEED servers [86, 87], projectingsubsystem annotations onto the Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolicmap [78, 88], though this is done without filtering or weighting results. PathoLogic [89, 90], aheuristic algorithm within the Pathway Tools software developed by SRI International [91, 92],predicts, based on key enzymes and pathway completeness, degradation or biosynthesis statusof the pathway, the presence of metabolic pathways within the MetaCyc database of reactions18and pathways. More recently, MinPath, an integer optimization method, computes a conservative‘parsimonious’ minimum set of reactions along the KEGG pathways [93]. This has since beenimplemented into HumanN software, using the Metagenomics Reports (METAREP) pipeline fordata storage and retrieval [94, 95]. The most recent release of InnateDB [96], a server and databasefor innate-immunity systems biology, implements a pathway prediction module (SIGORA) basedon distinctive pairs of gene annotation within pathways. Outside of the fields of microbiologyand microbial ecology, pathway activities, based on multi-dimensional cancer-omics datasets,are inferred using Bayesian probabilistic graphical models [97, 98]. However, the fact that fewmetagenomic studies take advantage of the above frameworks speaks to the impediment thatlarge-scale data transformations and formatting requirements represent to the field, and highlightthe need for efficient, standardized software that simplify or automate the task.Pathway-centric approaches that predict metabolic networks using defined biochemical rulesoffer a more robust route for quantitative and predictive metabolic insights. Recently, pathwayprediction and metabolic flux algorithms have been developed for microbial communities includingHUMAnN and Predicted Relative Metabolic Turnover (PRMT), respectively. HUMAnN combinesgene prediction with MinPath [93], a parsimony approach for metabolic reconstruction basedon KEGG pathways [94]. Predicted Relative Metabolic Turnover predicts metabolic flux usingKEGG pathways across multiple metagenomes [99]. Neither HUMAnN nor PRMT supportscalable data processing steps needed to annotate large environmental datasets and neitheroffers a coherent data structure for exploring and interpreting predicted metabolic pathways. Inaddition to HUMAnN and PRMT, alternative algorithms for metabolic flux modelling withinengineered microbial consortia are emerging, although their application to environmental sequenceinformation remains to be determined [100, 101].1.1.6 Taxonomic assessmentTaxonomy is the more traditional perspective of environmental sequence analysis, utilizing specificmarker-genes and functional proteins that contain varying degrees of taxonomic signal. RibosomalRNA (rRNA) genes are fundamental single-copy markers for taxonomy, and are the fundamentalmethod that revealed the massive unseen taxonomic diversity present in environmental samples19[102]. To supplement ribosomal markers, collections of conserved single-copy functional genes likethe Clusters of Orthologous Genes (COGs) [103], provide an adjunct or alternative set of targetsto provide accurate taxonomic information. Finally, the lowest common ancestor (LCA) methodprovides less-specific taxonomic annotation [104], obtained by summarizing the taxonomic originof collections of functional annotations that map to the same read.1.1.7 rRNA gene identificationAll cellular organisms encode ribosomal RNA genes (Figure 1.5a and 1.5b). These ribonucleopro-tein machines are integral to protein synthesis within the cell, providing a translational mechanismfor biological information flow between DNA, RNA, and proteins. As universal marker genes,rRNA sequences can be used to construct phylogenetic trees unifying all three domains of life(Figure 1.5c) [102, 105]. Indeed, prokaryotic small and large subunit ribosomal RNA (SSU/16S andLSU/23S rRNA) genes represent “gold standards” in the study of microbial diversity. Cultivation-independent rRNA gene surveys over the past two decades have revolutionized our perceptionof the microbial world and revealed the presence of numerous candidate divisions, so calledmicrobial dark matter, known solely on the basis of rRNA gene sequence information.Today rRNA gene surveys represent a growth industry relevant to environmental monitoringand human health and disease. High-throughput amplicon sequencing projects based on poly-merase chain reaction amplification with so-called “universal” primer sets are expanding sequencedatabases used for taxonomic profiling in natural and engineered ecosystems (GreenGenes, Silva,RDP2, etc.) [23, 106–108]. Community composition profiling can provide quantitative insights intomicrobial ecology; however, rRNA genes do not encode metabolic enzymes themselves, meaningthey are limited in their capacity to determine metabolic pathways within uncultivated microbialcommunities. While methods to extrapolate functional potential based on rRNA gene abundancein amplicon datasets have been developed [109], identification of rRNA gene abundance directlyfrom metagenomic datasets provides a more accurate representation of donor genotypes within asample than can be directly related to metabolic pathway information based on functional geneannotations [110].The procedure for analyzing taxonomic rRNA gene profiles, counts or proportions of the20number of rRNA genes annotated in a sample, is relatively developed. After some quality controland sample normalization, rRNA profiles are analyzed for significant variance via a numberof statistical methods including principal component analysis, non-metric multidimensionalscaling [111], indicator species analysis [112], and hierarchical clustering [113]; many of whichare implemented as part of popular software packages Vegan [114], PC-ORD [115], or Qiime[116, 117]. These results are often further visualized through factor plots, heat maps, anddendrograms to help distill the hierarchy and structure of the detected taxonomy profile. Morerecently, an interest in social network analysis has allowed collections of taxonomic profiles to beused to create statistically-generated graphical networks, where nodes represent different taxaand edges represent some measure of association (e.g., correlation) [118, 119]. Force-directedlayouts, graph-theoretic statistics, and module-detection methods are applied to infer possibletaxonomic relationships. To further interpret potential patterns of microbial community structure,these networks are visualized with graph visualization software like Cytoscape [120], Python’sNetworkX [121], and R’s iGraph package (http://igraph.sourceforge.net/).1.1.8 Clusters of orthologous groupsThe use of conserved single copy functional genes provide a powerful adjunct or alternativeto rRNA amplicon sequencing when profiling microbial community structure. Indeed, COGsprovide collections of anchor genes useful in estimating genome completion or coverage andin the construction of phylogenetic trees [103]. Each COG contains at least three proteins fromdifferent lineages that are more similar to each other than any other protein, forming what iscalled a “triangle homology”, and are therefore considered likely orthologs, genes related throughcommon decent. Because orthologous proteins tend to share equivalent functions, this informationcan be propagated throughout a cluster providing a rapid route for functional annotation of indi-vidual genomes or metagenomic datasets (Figure 1.6a). Interestingly, environmental sequencingprojects have generated a large number of hypothetical ORFs that are conserved across taxonomicgroups. While incorporating these novel sequences into COGs has traditionally relied on manualannotation, the sheer number of incoming sequences has driven the development of automatedalternatives including the EggNOG database [122]. Now in its second version, EggNOG uses21Figure 1.5: The small-subunit SSU rRNA gene is the “gold standard” for microbial diversity studies. (a) TheE. coli SSU rRNA transcript and its folding structure contains four main domains (5’, C, 3’M and 3’m)and nine hyper-variable regions (V1–V9) that correspond to areas with a higher mutation rates. (b) Alinearized diagram of the SSU rRNA showing the relative position of each domain and hyper-variableregion. (c) By aligning rRNA sequences from multiple organisms a three-domain phylogenetic tree canbe generated revealing that the largest proportion of genetic diversity is represented by microorganisms.Part (a) copyright Dr. Harry Noller, RNA Center, University of California, Santa Cruz. Original imageobtained from: http://rna.ucsc.edu/rnacenter/ribosome images.html. Figure to appear as a chapter ofComputational Methods for Next Generation Sequencing Data Analysis [67]. Copyright John Wiley & Sons,Ltd.22reciprocal BLAST and the “triangle homology” method to build new clusters and append toexisting ones [123].MLTreeMap is a method that automatically places marker genes onto a highly resolvedreference phylogeny made from a multiple sequence alignment or “supermatrix” of 40 universalCOGs, a hand-picked set of protein-coding marker genes previously shown to provide sufficientinformation for phylogenetic analysis; they occur only once and are rarely transferred horizontallybetween taxa. [124–126]. Taxonomic assignment proceeds in three steps. First, marker genesare identified using nucleotide or protein sequences as input using BLAST and GeneWise [127].Detected marker genes are added to curated reference alignments using hmmalign and Gblocks.These aligned sequences are then placed into annotated reference phylogenies using RAxML[128] (Figure 1.6b). In addition to universal COG reference phylogenies, user-defined trees can beappended to MLTreeMap supporting both phylogenetic and functional anchor screening.1.1.9 Lowest common ancestorTaxonomic annotation of environmental sequence information is sensitive to read length andassembly. While the presence of phylogenetic anchors such as rRNA genes on assembled contigsenables direct taxonomic assignment, taxonomic annotations based on functional ORF assignmentscan be confounded by large portions of shared sequence homology between a number of differenttaxonomic groups. The lowest common ancestor algorithm (LCA) originally incorporated intoMEGAN uses BLAST scores to conservatively place sequences onto the NCBI taxonomic hierarchy[104]. The LCA represents the lowest node in the NCBI taxonomy to which a sequence can beassigned based on the expected taxonomic range of its corresponding BLAST hits (Figure 1.6c).MEGAN uses several user-defined parameters for determining this range, including a bit scorecutoff, best-hit cut-off and minimum support percent.1.2 Computational issuesBioinformatic pipelines are challenging from a software development perspective due to dynamicrequirements that change with the data analysis, heterogeneous computational requirements23aCC1091CC1091CC2793CC2793BS_yobNBS_yobNDRA0274DRA0274PA0421PA0421mll3668mll3668slr0782Rv3170Rv3170slr0782bcMetagenome COG 'Supermatrix' Alignment...: : : *MetagenomeVariable Quality HitsBLASTRefSeqFigure 1.6: Taxonomic assignment and functional gene profiling methods. (a) Clusters of orthologous groups(COGs) are constructed using a “triangle homology” method. Similar to rRNA genes, many COGs canbe used as phylogenetic anchors. (b) MLTreeMap leverages a subset of 40 universal COGs aligned andconcatenated into a “supermatrix”. Metagenomic reads can be added to this alignment and placed onthe tree of life using a Maximum Likelihood method. (c) MEGAN parses BLAST outputs and projectsthis information onto the NCBI taxonomic hierarchy using the lowest common ancestor (LCA) ancestoralgorithm. MEGAN also supports KEGG and SEED subsystems mapping. Figure to appear as a chapter ofComputational Methods for Next Generation Sequencing Data Analysis [67]. Copyright John Wiley & Sons,Ltd.24and software, and various issues associated with models of data integration and standardization.Moreover, these computational and implementation issues are compounded when considered inthe context of distributed computational systems, which often have their own set of heterogeneoussetup and computational requirements. This section discusses computational challenges in thecontext of environmental sequence information, and concludes by arguing for the development ofa novel analytical pipeline, MetaPathways.1.2.1 Pipeline designBioinformatic pipelines integrate a number of disparate software packages, and thus can depend ona variety of operating systems, system library requirements, programming languages, or referencefiles. For example, processing environmental sequences through quality control, assembly,ORF prediction, and functional annotations incorporates multiple software packages with manydifferent requirements, ranging from stand-alone scripts written in Python, Perl, or Java, toexecutables compiled using C or C++. Many of these packages change the representation ofthe underlying data, meaning care needs to be taken to preserve a data structure with overallanalytical concerns at each step. In many cases, an overarching structure or data model shouldbe considered throughout, preserving inputs and outputs at each step, and keeping in mind theoverall analytical questions that are being asked of the data. As is common with most exploratorydata analysis pipelines, this data model might have to be modified as knowledge of the problembecomes clearer or better methods become available throughout implementation. Thus modularobject-oriented and agile development development patterns are practical to avoid problems ofoverdevelopment, sometimes known as ‘big design upfront’, as not all problems are clear untilimplementation.1.2.2 Heterogeneous software and computational requirementsBioinformatics pipelines are not only heterogeneous in terms of software requirements, but arealso heterogeneous in terms of computational requirements and their level of computationalparallelism [129]. For example, sequence homology searches are ‘embarrassingly parallel’, that is,each sequence is searched independently of every other one, and so the task is easy to distribute25among multiple workers — provided all workers have efficient access to the same large referencedatabase. Other problems generally have larger communication requirements, meaning they havecertain dependant (serial) and independent (parallel) portions of their algorithms, and are thereforemore difficult to parallelize among multiple worker machines. Effectively parallelizing algorithmsis a challenging task, and cutting-edge methods often only have simple serial implementationsand large-memory requirements. However, mature softwares may be multi-threaded to provide adegree of local multi-core parallelism, but these features are often not very well advertised, so it isimportant thoroughly investigate available software to make sure all available processors or nodesare being used effectively. Overall, because of the variety of software typically integrated into abioinformatics pipeline, it is highly likely that a number of steps contain large, memory-intensiveserial tasks.While software can be multi-threaded for local parallelism on multi-core CPUs, more so-phisticated implementations are optimized to run on the specialized parallel environments ofcomputational grids. Here multiple systems are connected by way of high-speed interconnect (i.e.,InfiniBand), and some version of the Message Passing Interface (MPI) to communicate betweencomputational nodes. Implementing these software is often largely configurational. However,if the software is to run optimally on a grid machine, hardware structure (number of cores,available memory, and node design) may also need to be taken into account. This configuration isparticularly important if specialized nodes with large GPU/CPU co-processing units are available(e.g., Intel-Phi Cards) [130]. Another consideration is that computational grids, are separateentities that can fail or have variable performance, making them known in the field of DistributedComputing as ad hoc distributed networks. For instance grids can have variable performance for avariety of reasons. They may have a sharing policy to divide up computational time based onpriority or usage, have periods of hardware and software maintenance, or experience hardwarefailure and power outages. This means that ideally pipelines manage tasks across a wide varietyof grids, and scale back or migrate jobs between members if necessary.Finally, the MapReduce paradigm is becoming an important model for cloud-based or internetcomputing systems (Figure 1.7). MapReduce is a model of parallel computation where two largeparallel computational tasks are separated by a large communication task. The attractive feature26of this model is that parallel processing can proceed via two serial functions, Map and Reduce,which fully specify a parallel job; a significantly simpler programming task than full-blownimplementation in MPI, which requires parallel programming considerations of concurrency,critical regions, and fault-tolerance. Moreover, an open-source implementation of MapReduce,Hadoop [131], has become the de facto standard in Cloud computing environments. This begs thequestion, what tasks related to sequence analysis can be modelled this way. In many cases, thetasks of assembly and maximum likelihood tree-building have larger communication costs thatwould quickly overwhelm most MapReduce implementations. Implementing homology-search inMapReduce seems promising, but unfortunately Hadoop requires significant modifications to itsYet Another Resource Negotiator (YARN) resource manager to reduce communication costs forlarge homology search databases in order fit it to be effective [132].Input Split MapTasks"Shuffle"ReduceTasksOutputsa b c d eFigure 1.7: The structure of a MapReduce job is centred around three distributed stages Map, Shuffle, and Reduce.First input data is split into separate blocks based on the Hadoop Distributed File System (HDFS) (a). Inthe Map phase, input data splits are distributed to mapper instances where the Map tasks are performed,outputting resulting key-value pairs (b). The Shuffle phase then merges and sorts the key-value pairs fromthe map phase and distributes them to their appropriate reduce instances (c). In the Reduce phase, thereduce tasks are performed on the sorted key-value pairs they received from the shuffle (d), outputting theresults (e). It should be noted here that all steps acquire their required resources (input and support data)through the network via the HDFS.271.2.3 Data integration and querySince a significant proportion of current analyses of environmental sequence information areexploratory efforts searching for novel hypotheses, large-scale interactive data query acrossmultiple annotation levels and datasets is an attractive goal. While exploratory analysis is routinein existing statistical environments like R or MATLAB, next-generation environmental datasetsrequire annotation of millions of ORFs and contigs, meaning that not only is the processingcomputationally intensive, the interpretation, integration, and analysis of gigabytes of outputcreates computational challenges as well. Thus an environment to handle this scale of data requiresintelligent use of data-structures and out-of-core algorithms to provide interactive inquiry thatscales to thousands of samples. While command-line tools provide adequate first-pass softwareimplementations, exploratory analysis is often poor when compared with those done on graphicaluser interfaces (GUI) that are designed around interactive data exploration. Here there are anumber of libraries and frameworks for GUI development and information visualization, includingJava Swing, Tcl/Tk, Qt, and a plethora of web-based HTML 5.0 and JavaScript frameworks, andthe selection depends on a number of factors including ease of installation, cross-platformcompatibility, and computational efficiency and flexibility. Finally, an attractive quality is theflexible export of data files and sequences compatible with a number of different downstreamanalytical softwares.1.2.4 A novel analysis pipelineWhile there are notable analytical services and software for environmental sequence information,many lack available source code, are difficult to operate, do not integrate commonly-used analyti-cal software and formats, have varying data models, are not modular or open to expansion, ordo not have the flexibility to run on heterogeneous computational resources. The MetagenomeRapid Annotation using Subsystem Technology (MG-RAST) [133], Community Cyberinfrastruc-ture for Advanced Microbial Ecology Research and Analysis (CAMERA) [134], and IntegratedMicrobial Genomes and Metagenomes (IMG/M) warehouse public datasets [135, 136], provideweb-interfaces to online computation, and have different exploratory and analytical modules in28consistent data models. However, these warehouse-style analytical environments are opaqueto significant data handling and processing steps, can not be expanded to easily support newanalytical steps or software, and place restrictions on data export for downstream analysis. More-over, there exist a number of stand-alone softwares, for example, the MEtaGenome ANalyzer(MEGAN) and ML-TreeMap [104, 126], that provide particular analytical tasks and data products,but require up-stream annotation and data transformations, and careful integration, for a coherentintegrative analysis. While the above softwares and frameworks have great analytical potential,their opaque and idiosyncratic data transformations impede integration and analysis, necessitatingthe development of a transparent, high-performance, pipeline and integrative framework to enablelarge-scale comparison of next-generation meta-omic datasets.1.2.5 Research overviewThis dissertation describes the design and implementation of MetaPathways, an open-sourcemodular pipeline and platform for the integrative, large-scale analysis of environmental sequenceinformation, and represents the first software framework of its kind to integrate environmentalsequence analysis with heterogeneous high-performance computational resources.Chapter 2 describes the initial design and implementation of MetaPathways v1.0, focusing onthe modular analysis of environmental sequence information by way of an open-source Pythonframework. This chapter discusses quality control, open reading frame prediction, functionaland taxonomic annotation, and the construction of environmental Pathway/Genome databases(ePGDBs) by way of the Pathway Tools software. This initial design provides a transparent datamodel that is expandable to support known and novel analytical tasks, and useful data productsthat integrate with other down-stream analytical softwares.Chapter 3 provides validation of predicted MetaCyc pathways contained within environ-mental Pathway/Genome databases via the analysis of simulated, symbiotic systems, and are-analysis of actual metagenomic and metatranscriptomic samples from the Hawaii Ocean timeseries. Moreover, this chapter develops an analytical framework for the interpretation of predictedmetabolic pathways, highlighting analytical hazards, and illustrates the potential use of MetaPath-ways and Pathway Tools for engineering microbial consortia through instances of inter-pathway29complementarity within symbiotic systems.Chapter 4 discusses MetaPathways v2.0, which features substantial computational, usability,and integrative improvements through the implementation of a master-worker model for embar-rassingly parallel computational tasks and a full-featured graphical user interface. Large-scalecomparative multi-omic analyses can be performed by way of the memory-efficient ‘KnowledgeEngine’ data structure, and functional and taxonomic annotations can be exported into customdata tables for downstream analysis.Chapter 5 discusses the development and implementation of a contig-taxonomy classificationmethod based on using existing lowest common ancestor (LCA) annotations as input. Sincehomology-search performance statistics are collapsed when looking at LCA contig-taxonomies,the method formulates a taxonomic estimate based on results from voting and informationtheory. Moreover, likelihood-ratio tests are formulated for the voting-based methods to provide ameasurement of confidence in the reported taxonomy. This contig taxonomic classification methodhas been implemented as an analytical option of the MetaPathways pipeline, demonstrating themodular capacity of the pipeline to be expanded to novel analytical tasks.Finally, Chapter 6 concludes with a discussion of the assumptions and shortcomings of thecurrent analytical methods, explores distributed metabolism in the engineering of microbialconsortia and multi-omic samples, lays out future work and improvements to MetaPathwaysin the context of new sequencing technologies, cloud-computing, and single-cell genomics, andspeculates on the future direction of large-scale multi-omic analyses.30Chapter 2MetaPathways: a modular pipeline forthe analysis of environmental sequenceinformationThis chapter presents the initial implementation of MetaPathways, an open source (GNU GPLv3)pipeline for pathway inference that uses the PathoLogic algorithm to map functional anno-tations onto the MetaCyc collection of reactions and pathways, and construct environmentalPathway/Genome Databases (ePGDBs) compatible with the editing and navigation features ofPathway Tools. The pipeline accepts assembled or unassembled nucleotide sequences, performsquality assessment and control, predicts and annotates noncoding genes and open reading frames,and produces inputs to PathoLogic. In addition to constructing ePGDBs, MetaPathways usesMLTreeMap to build phylogenetic trees for selected taxonomic anchor and functional gene mark-ers, converts General Feature Format (.gff) files into concatenated GenBank files for ePGDBconstruction based on third-party annotations, and generates useful file formats including Sequinfiles for direct GenBank submission and gene feature tables summarizing annotations, MLTreeMaptrees, and ePGDB pathway coverage summaries for statistical comparisons.MetaPathways provides users with a modular annotation and analysis pipeline for pre-dicting MetaCyc metabolic interaction networks from environmental sequence information us-ing an alternative to KEGG pathways and SEED subsystems mapping. It is extensible to ge-nomic and transcriptomic datasets from a wide range of sequencing platforms, and generatesuseful data products for microbial community structure and function analysis. The MetaP-athways software package, installation instructions, and example data can be obtained from31https://www.github.com/hallamlab/metapathways.2.1 IntroductionMetabolic interactions between microorganisms direct matter and energy transformations in-tegral to ecosystem function [119, 137, 138]. Plurality sequencing methods enable explorationof potential (metagenomic) and expressed (metatranscriptomic) metabolic interactions with theaid of computational methods that assemble or cluster contiguous reads, search for patternsor motifs representing genes, and reconstruct pathways from environmental sequence informa-tion [29, 110, 139]. The prevailing paradigm in pathway reconstruction is to assign functionalannotation based on sequence homology using BLAST [75]. Functional annotations are thenprojected onto symbolic representations of metabolism such as KEGG pathways [78, 88, 140] orSEED subsystems [141] revealing network structure.With the expansion of next-generation sequencing technologies, increasingly complex datasetsare being generated for thousands of environmental samples resulting in analytic bottleneckswith the potential to stymie pathway reconstruction efforts. As a result, online services formetabolic reconstruction have been developed to externalize data processing burdens and providewarehousing and visualization tools for environmental sequence information. Popular onlineservices for metabolic reconstruction include Integrated Microbial Genomes and Metagenomes(IMG/M), Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analy-sis (CAMERA), and Metagenome Rapid Annotation using Subsystem Technology (MG-RAST).Both IMG/M [135, 136] and CAMERA [134] warehouse public datasets and provide management,exploration, and visualization tools for environmental sequence information. MG-RAST [133, 142]warehouses public datasets and provides gene prediction and annotation services based on SEEDsubsystems mapping using FIGfams [143] and BLAST. While on-line services increase accessto computational resources, idiosyncratic data processing and management practices commonto each service insulate users from command-line optimization and create formatting and datatransfer restrictions.Pathway Tools [91, 92] is a production-quality software system that enables construction,32management and navigation of symbolic representations of metabolism in the form of Path-way/Genome databases (PGDBs). A PGDB encodes contemporary knowledge about the networkproperties of a cellular organism. Pathway Tools supports four modular operations includingmetabolic pathway prediction using PathoLogic [89], metabolic flux modeling using MetaFlux[144], PGDB editing and navigation tools including manual or automated search functions, andcomparative analysis and systems level visualizations. Further, genes, reactions, and pathwayscan be exported via the Systems Biology Markup Language (SMBL) framework, allowing interop-erability and downstream analysis with compatible systems biology tools [145]. The Pathologicmodule allows users to construct new PGDBs from an annotated genome using MetaCyc [80, 146],a highly curated, non-redundant and experimentally validated database of metabolic pathwaysrepresenting all domains of life. Unlike KEGG pathways or SEED subsystems, MetaCyc empha-sizes smaller, evolutionarily conserved units of metabolism or pathway variants that are regulatedand transferred together. MetaCyc is also extensively commented with pathway descriptions,literature citations, and enzyme properties including subunit composition, substrate specificity,cofactors, activators, and inhibitors each connected to specific pathway variants [82]. A web-serverversion of the Pathway Tools editing and navigation tools supports on-line browsing, manualcuration and web publishing of PGDBs. Currently PGDBs for 2037 cellular organisms have beenconstructed and incorporated into the BioCyc collection [147].MetaPathways, extends the PGDB concept for cellular organisms to microbial communitystructure and function through use of the PathoLogic algorithm to build environmental PGDBs(ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipelineaccepts assembled contig or unassembled nucleotide sequences, performs quality control andcoverage estimation, predicts and annotates noncoding genes and open reading frames, andproduces concatenated GenBank files used as inputs to PathoLogic. In addition to constructingePGDBs, MetaPathways uses MLTreeMap to build phylogenetic trees for selected taxonomicanchor and functional gene markers [126], converts General Feature Format (.gff) files intoconcatenated GenBank (.gbk) files for ePGDB construction using third-party annotations, andgenerates useful file formats including Sequin files for direct GenBank submission and genefeature tables summarizing annotations, MLTreeMap, and ePGDBs for statistical comparisons.332.2 ImplementationMetaPathways is a Unix-based modular pipeline written in Python that calls software componentswritten in C/C++, Perl (Version 5), and Python (v2.6), but requires additional reference databasesand the Pathway Tools software in order to run fully-featured. MetaPathways is compatible witha number of functional protein (RefSeq, KEGG, COG, MetaCyc [77, 78, 80, 148]) and taxonomicdatabases (Silva, GreenGenes [106, 107]), and requires their annotated amino acid and nucleotidesequences be placed in the MetaPathways DBs/ directory. Required executables formatted for MacOSX or Ubuntu operating systems are contained in the executables/ directory; but source code isprovided for other versions of Linux. Construction of ePGDBs requires the installation of PathwayTools v16.5 or greater (http://bioinformatics.ai.sri.com/ptools/) which is freely-licensed by SRIInternational for academic use. More details on installation and obtaining reference databases canbe found at (https://www.github.com/hallamlab/metapathways).MetaPathways accepts metagenomic or metatranscriptomic sequence data as input a numberof file file formats (.fasta, .gff, or .gbk), and consists of five operational stages including(1) Quality control (QC) and open reading frame (ORF) prediction, (2) ORF annotation, (3)Modular analysis (4) ePGDB construction, and (5) Pathway Export (Figure 2.1). A parameter file(.parameters.txt) delimits software settings for successive operational stages and can be easilyedited to enable or disable specific operations or modify default settings associated with specificsoftware components (Figure 2.2). Output files or directories containing results for pipeline stagesare specified in brackets.2.2.1 Quality control & ORF predictionSequence information is processed to remove sequences below a user-defined length threshold,non-nucleotide bases (i.e., not A, T, C, or G) are converted to the character ‘N’ and inputsequence identifiers are sequentially renamed (i.e., [sample] #). A mapping file (.mapping.txt)is exported, relating original input sequence names with sequential names. ORFs are predictedusing a metagenomic version of the Prokaryotic Dynamic Programming Genefinding Algorithm(Prodigal), MetaProdigal, which can detect incomplete or fragmentary ORFs [73, 74]. MetaProdigal34/Input /OutputQC & ORFPrediction /preprocessed/orf_prediction/run_statistics/blast_results$Pathway_Tools/user/genbank/results/LCA/results/mltreemap/results/annotation_tables/results/tRNA/results/pgdbQC SequencesLCAORFsATGATGATGATGKEGG COG RefSeq MetaCycAnnotated ORFsML-TreeMaptRNA ScanPathway Tools Input{PathoLogic ePGDBBLASTPathway Summaries}.fasta.gff.gbk .sequin.blastout.{DB}_stats{NAME}cyc.run_parameters.txt.pathways.txtMetagenome.mapping.txt.faa.fna.refscores .blastout.parsed.txt.fxn_and_taxa_table.txt.nuc.stats .amino.stats/results/rRNArRNA.stats.txt/ptools/.parameters.txt.annotated.gff.qced.faaabAnnotationcAnalysesdePGDBCreationeExportPathwaysFigure 2.1: The MetaPathways pipeline. The MetaPathways pipeline consists of five operational stagesincluding (a) Quality control (QC) and open reading frame (ORF) prediction (b) ORF annotation, (c)Modular analysis, (d) ePGDB construction, and (e) Pathway Export. Inputs and executables are depicted onthe left with corresponding output directories and exported files on the right. Figure originally publishedin BMC Bioinformatics under the Creative Commons Attribution Licence v2.0 [149].35##V.0.8   do not remove this line  INPUT:format fasta   # Quality Control  parameters  quality_control:min_length 180 quality_control:delete_replicates yes  # ORF prediction parameters  orf_prediction:algorithm prodigal orf_prediction:min_length 60  # ORF annotation parameters annotation:dbs  metacyc,kegg,cog,refseq annotation:min_bsr  0.4 annotation:max_evalue 0.000001 annotation:min_score 20 annotation:min_length 60 annotation:max_hits 10  #rRNA annotation parameters rRNA:refdbs GreenGenes,Silva-SSU,Silva-LSU rRNA:max_evalue 0.000001 rRNA:min_identity 20 rRNA:min_bitscore 50  # pathway tools parameters ptools_settings:taxonomic_pruning no  # grid settings grid_engine:batch_size 500 grid_engine:max_concurrent_batches 400 grid_engine:user username grid_engine:server server.address.com  # execution flags metapaths_steps:PREPROCESS_FASTA        yes metapaths_steps:ORF_PREDICTION  yes metapaths_steps:GFF_TO_AMINO    yes metapaths_steps:FILTERED_FASTA  yes metapaths_steps:COMPUTE_REFSCORE        yes metapaths_steps:BLAST_REFDB      grid metapaths_steps:PARSE_BLAST     yes metapaths_steps:SCAN_rRNA       grid metapaths_steps:STATS_rRNA      yes metapaths_steps:SCAN_tRNA       yes metapaths_steps:ANNOTATE        yes metapaths_steps:PATHOLOGIC_INPUT       yes metapaths_steps:GENBANK_FILE    skip metapaths_steps:CREATE_SEQUIN_FILE      skip metapaths_steps:CREATE_REPORT_FILES     skip metapaths_steps:MLTREEMAP_CALCULATION   skip metapaths_steps:MLTREEMAP_IMAGEMAKER    skip metapaths_steps:PATHOLOGIC      stop Figure 2.2: Default MetaPathways settings. The parameter file (.parameters.txt) delimits software settingsfor successive operational stages and can be easily edited to enable or disable specific operations or modifydefault settings associated with specific software components. Execution flags including yes, skip, andgrid control successive pipeline operations. Figure originally published in BMC Bioinformatics under theCreative Commons Attribution Licence v2.0 [149].36is optimized for environmental sequence information, using 50 training files derived from a pair-wise distance matrix composed of 1,415 clustered NCBI RefSeq genomes. MetaProdigal uses theGC content of each input sequence to select appropriate internal parameters for its ORF predictionmodel, which has been shown to increase the accuracy of predicted genes in metagenomic samples[74]. Moreover, a Bonferroni-like correction process reduces false-positive identification on shortreads by combining penalties for input sequence length, number of training files used and lengthof the predicted ORF [150]. When benchmarked against other ORF prediction methods tuned foranonymous and short read sequences MetaProdigal performs exceptionally well with improvedstart site identification [72]. Coordinate information, nucleotide, and translated amino acid usingbacterial translation table 11 (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), areexported as .gff, .fna, and .faa files, respectively. By default, ORFs below a default length of180 nucleotides or 60 amino acids are removed (.qced.faa) and nucleotide (.nuc.stats) andamino acid sequence (.amino.stats) distribution summaries before and after post-processing areexported.2.2.2 ORF annotationTranslated ORFs are queried against user-defined reference protein databases including KEGG[78], COG [148], RefSeq [77], and MetaCyc [80], where MetaCyc refers to the pathway hole-fillerdatabase included with Pathway Tools, using the protein BLAST or optimized LAST algorithm intabular format (.blastout/.lastout) [76]. Concomitant with reference protein database queries,self-BLAST bit-scores are calculated (.refscores) enabling a measure of similarity using the BLAST-score ratio (BSR) [151]. BLAST summary tables parsing resulting E-values, percent identities,bit-scores, lengths, and BSRs are exported for each reference database (.blastout.parsed.txt).By default, annotations with BSRs below 0.4 corresponding to the so-called “Twilight Zone” ofgene annotation are excluded from summary tables [152].BLAST represents a computational burden that can limit pipeline performance on big datasetswhen implemented on local machines [51]. Therefore, we have adopted a representationalstate transfer (REST) design supporting implementation on external Sun Grid Engine servers orsupercomputers [153]. A user-defined connection filter (username, password and external server37address for configuration) and externalization script enables setup (uploading, formatting, andinstalling BLAST databases and executables), parallel splitting of BLAST jobs, queue submissionand management, and the collection and consolidation of results back to the local machine. Thiscreates a RESTful system that is robust to unforeseen interruption and is readily transferable tothe Cloud. MetaPathways can also incorporate third party annotations sourced from .gbk or .gfffiles directly using embedded file-interconversion scripts.Tabular BLAST results returned from local or external resources are used to assign productdescriptions to predicted ORFs based on an internal heuristic to standardize product descriptions.The heuristic information score selects longer annotations with a large preference for EnzymeCommission (EC) numbers, as these are more likely to be mapped to pathways via PathoLogic. Foreach ORF, the top E-value from each reference database is selected and given a simple informationscore, based on the number of distinct enzymatic words and a preference to Enzyme Commission(EC) numbers. Text annotations are cleaned of stop words, words that carry very little information(e.g., ‘as’, ‘the’, ‘is’, ‘at’, ‘which’), and of the words ‘protein’ and ‘enzyme’. The ‘informationscore” is calculated as a simple sum of the number of remaining words with a +10 score ifthe annotation has a valid EC number. Functional annotations with the highest informationscore are appended to the ORF description and exported as a tabular file (.annotated.gff).Predicted ORFs with no BLAST hits are annotated as “hypothetical protein.” In addition, BLASTsummaries of functional annotations at different hierarchical levels are exported for KEGGand COG databases (.[DB].stats.txt). Following functional annotation of predicted ORFs,nucleotide sequences are queried against reference nucleotide databases including SILVA [107] andGreenGenes [106] to identify ribosomal RNA genes. BLAST summary tables containing E-values,percent identities, bit-scores, lengths and taxonomic identity are exported for each referencedatabase (.rRNA.stats.txt). This information is combined with the file .annotated.gff togenerate input files for ePGDB construction, standard .gbk and .sqn files for NCBI submission.2.2.3 AnalysesMetaPathways currently implements three modular analyses using existing or derived filesas input (.fasta, .gbk, or .gff input formats and derived tabular results). The first analytic38module implements tRNA-scan (version 1.4) to identify relevant tRNAs from QC’ed nucleotidesequences [154]. Resulting tRNA identifications are appended to the .gbk and .sequin filesas additional annotations. The second analytic module implements the popular and widelyaccepted LCA algorithm for taxonomic binning [104]. The lowest common ancestor in the NCBItaxonomic hierarchy is selected based on the previously calculated BLAST-hits from the RefSeqdatabase. This effectively sums the number of BLAST hits at the lowest shared position ofthe hierarchy. The RefSeq taxonomic names often contain multiple synonyms or alternativespellings. Therefore, names that conform to the official NCBI taxonomy are selected in preferenceover unknown synonyms. The third analytic module implements MLTreeMap (version 2.061)to identify and construct trees for selected phylogenetic and functional marker genes from QCnucleotide sequences [126]. Results from LCA and MLTreeMap are exported as a tabular file(fxn and taxa table.txt). Additional analysis modules implemented from the command linecan be directly inserted into the pipeline. By convention, results from each analysis are placed ina self-titled directory under the parent results directory (e.g., /results/mltreemap).2.2.4 ePGDB constructionThe annotated ORF file (.annotated.gff) is parsed and separated into four files including (1) anannotation file containing gene product information, (2) a nucleotide sequence file in .fasta format,(3) a genetic-elements file, and (4) a PGDB parameters file (/ptools/). For the purposes of ePGDBconstruction, nucleotide input files are concatenated to form a single ‘chromosomal’ elementdefining a composite genome. Algorithmically, Pathway Tools was not designed for an ‘organism’with thousands of ‘chromosomes’, and runs excessively slow if provided through thousands ofseparate files, essentially overburdening the Unix file-system. Concatenation of ORF sequences intoan artificial ‘chromosome’ improves Pathway Tools’ performance, allowing the creation of ePGDBswith thousands genetic-elements through PathoLogic. PathoLogic uses this input to predictmetabolic pathways based on defined biochemical rules (pathway completion, diagnostic/keyenzymes, biosynthesis and degradation constraints) resulting in ePGDB construction and exportto the local Pathway Tools internal library ([Pathway Tools]/user/).Environmental PGDBs and their contents are accessible, internally or externally, through a39built-in web server, allowing the knowledge of genes, proteins, metabolic and regulatory networksembedded within them to be queried, compared, curated, and shared in a distributed fashionvia the Internet. In addition to powerful search and retrieval functions, Pathway Tools providesa metabolic encyclopedia, based on primary literature citations encompassing more than 1,900evolutionary conserved sub-pathways within the MetaCyc schema [155]. The “Cellular Overview”feature displays ePGDB contents in the form of interactive glyphs that link subpathways togetherin a global picture of metabolism. Hovering over a glyph activates a tooltip that identifies thepathway and clicking on a glyph reveals pathway interactions at the level of enzymes, reactionsand identified ORFs (Figure 2.3). Direct comparisons between ePGDBs can be made using colouredoverlays on the cellular overview revealing similarities and differences in metabolic pathwaycomposition (Figure 2.4) [89].2.2.5 Pathway exportInformation is extracted from ePGDBs including ORF identities, enzyme abundance, and pathwaycoverage and exported in tabular format (.pwy.txt). A receipt and time-stamp for each successfulpipeline execution is created containing the specific parameter settings used in ePGDB creation(.run parameters.txt).2.3 PerformanceMetaPathways performance was evaluated using unassembled (Sanger fosmid-end, 454 py-rosequencing) and assembled (Illumina HiSeq) genomic sequence information sourced froma naphtha-degrading, methanogenic enrichment culture (Tables 2.1 and 2.2). Input datasetscaptured a range of nucleotide sequence numbers, lengths and sample coverage. Base pathwayprediction and runtime increased as a function of nucleotide sequence number. While runtimecomplexity varies in relation to input file size, reference database size, and external resourceallocation, empirical runtimes approached an upward limit of 2,300 ORF sequences per minute(minimum length 60 amino acids) when externalizing BLAST on the Western Canadian ResearchGrid. Remaining analyses and data transformations were performed locally on a Mac Pro desktop40ePGDB Cellular OverviewMetabolic PathwayReaction Open Reading FrameFigure 2.3: ePGDBs and Pathway Tools. Environmental PGDBs (ePGDBS) and their contents are accessiblethrough a built-in web server, allowing the knowledge of genes, proteins, metabolic and regulatory networksembedded within them to be queried, compared, curated and shared in a distributed fashion via theInternet. The “Cellular Overview” feature displays ePGDB contents in the form of interactive glyphsthat link sub-pathways together in a global picture of metabolism scalable down to the level of pathways,reactions and individual open reading frames. Figure originally published in BMC Bioinformatics underthe Creative Commons Attribution Licence v2.0 [149].41Figure 2.4: ePGDBs and the Cellular Overview. Cellular Overview in the Pathway Tools software highlightingpathways found in the Naphtha-degrading culture sample. Using the assembled Illumina pathways as abackbone (blue), common predicted pathways from the 454 (red) and Sanger (green) sequencings are placedon top. Allowing exploration of pathways predicted using different sequence technologies and depth.Figure originally published in BMC Bioinformatics under the Creative Commons Attribution Licence v2.0[149].42computer running Mac OSX 10.6.8 with a 2 x 2.4 Ghz Quad-Core Intel Xeon processors and 24GBof 1066Mhz DDR3 RAM.43Table 2.1: MetaPathways validation summary based on a comparison of three sequencing methods on a common sample.SequencingPlatformAssemblyStatusSequences BasesAvg.LengthStd.DeviationABI 3730 unassembled 14,403 10,345,496 718 96GS FLX Titanium unassembled 345,695 126,545,084 366 87Illumina HiSeq assembled 553,438 376,761,374 680 2,305Table 2.2: MetaPathways validation summary based on a comparison of three sequencing methods on a common sample (continued)SequencingPlatformProteinCodingSequencesAnnotatedCodingSequencesrRNAGenestRNAGenesTaxaAssignmentMLTreeMapMarkersMetaCycPathwaysABI 3730 17,131 6,387 699 215 6,663 71 123GS FLX Titanium 286,686 25,879 6,649 1,826 114,235 6,245 230Illumina HiSeq 657,832 126,736 5,443 5,848 275,823 7,822 284442.3.1 Evaluation of pathway prediction with simulated metagenomesPrevious studies have evaluated PathoLogic’s performance on fully-sequenced genomes estab-lishing its pathway prediction power in relation to other machine learning methods [156]. Todetermine PathoLogic’s performance on combined and incomplete genomes sourced from en-vironmental sequence information we generated simulated metagenomes from 10 BioCyc tier-2PGDB genomes (Table A.1) using MetaSim (Sanger sequencing, average length 700 bp, standarddeviation 100 bp) with differing sequence coverage and taxon distribution profiles (Sim1 andSim2) [157]. Tier-2 PGDBs were selected to minimize potential name mapping errors betweenMetaPathways’ annotations and extant MetaCyc annotations [156]. In Sim1 each genome waspresent at equal coverage and in Sim2 the Caulobacter crescentus NA1000 genome was overrepre-sented by 20-fold (Figure 2.5a). Simulations manifesting progressively larger fractions of totalunique sequence length (unique-Gm) revealed that pathway recovery increases with sequencecoverage (Figure 2.5b). Specificity, a measure of the confidence in accurate pathway prediction washigh (> 85%) regardless of taxonomic distribution or sequence coverage (Figure 2.5c) consistentwith reduced Type I errors (false positives). However, sensitivity, a measure of the confidencein predicting specific pathways present in the sample, was reduced at low coverage consistentwith increased Type II errors (false negatives) (Figure 2.5c). A 6% reduction in pathway recoverybetween Sim1 and Sim2 was observed, suggesting that pathway prediction follows a collector’scurve in which core metabolic functions shared between community members initially accumulate.As coverage increases, the encounter frequency for accessory genes increases resulting in improvedpathway prediction approaching a limit based on extant MetaCyc pathways. Summary statistics,including F-measure and Matthews Correlation Coefficient that balance between Type I andType II errors, reinforce the observation that PathoLogic’s performance improves with increasingsequence coverage (Table 2.3 and A.2).45Predicted Pathways (% Total) 0 20 40 60 80 1000. (% Unique-Gm)Performance ValueSequencing (% Unique-Gm)0 20 40 60 80 1000. bsim1sim2 Specificity Sensitivity0.0 0.2 0.4 0.6Copy NumberAgrobacterium tumefaciens C58Aurantimonas manganoxydans SI85-9A1Bacillus subtilis 168Caulobacter crescentus CB15Caulobacter crescentus NA1000Helicobacter pylori 26695Mycobacterium tuberculosis CDC1551Mycobacterium tuberculosis H37RvSynechococcus elongatus PCC 7942Vibrio cholerae str. N16961Sim1 Sim2 cFigure 2.5: Analysis on in silico simulated sequencing experiments across different levels of coverage and taxon dis-tribution. Sim1 (blue) contains ten tier-2 PGDB genomes in approximately equal proportion. Sim2 (red) hasone taxon overrepresented by 20-fold. Tier-2 taxa were selected on the basis of approximately equal genomesize and gene content (a). Predicted pathway recovery as a percentage of the total pathways predicted fromthe full genome (b). Specificity (triangles) and sensitivity (squares) classification performance of predictedpathways using the pathways predicted on the full genomes as the gold standard (c). Interpolating lineswere drawn via a natural spline. Figure originally published in BMC Bioinformatics under the CreativeCommons Attribution Licence v2.0 [149].2.4 DiscussionWhile efforts to model microbial community structure in relation to environmental parameters havesuccessfully predicted real-world distribution and diversity patterns in the surface ocean [158–160],the extension of modeling approaches to microbial metabolic interaction networks remains nascent.Function-based models such as Predicted Relative Metabolic Turnover (PRMT) predict metabolicflux in the environment based on the abundance of unique functional annotations using MG-RAST[99]. More recently, Abubucker and colleagues developed the Human Microbiome Project UnifiedMetabolic Analysis Network (HUMAnN) for metabolic reconstruction [94]. HUMAnN integratesMinPath to reconcile the multiple mapping problem associated with BLAST-based annotations formetabolic inference based on KEGG pathways and SEED subsystems with additional taxonomiclimitation and gap filling algorithms to reduce false positives and correct for rare genes inabundant pathways [93]. HUMAnN results have been compared using Metagenomics Reports(METAREP) data storage and retrieval pipeline that supports scalable and dynamic analysisof complex environmental datasets [95]. While Pathway Tools uses its set of biochemical rulesfor pathway prediction, an alternative to Pathway Tools for the construction of genome-scale46Table 2.3: Pathway classification performance statistics for simulated metagenomes Sim1 and Sim2 atprogressively larger sequence coverage.Sample Gm Precision Sensitivity Specificity Accuracy F-measure MatthewsSim1 1/32 0.96 0.31 0.99 0.73 0.47 0.79Sim1 1/16 0.70 0.38 0.98 0.75 0.53 0.73Sim1 1/8 0.76 0.57 0.98 0.82 0.71 0.81Sim1 1/4 0.85 0.69 0.97 0.86 0.80 0.83Sim1 1/2 0.81 0.80 0.98 0.91 0.87 0.88Sim1 1/1 0.84 0.98 0.97 0.94 0.92 0.91Sim2 1/32 0.93 0.25 0.99 0.7 0.4 0.74Sim2 1/16 0.95 0.36 0.99 0.75 0.53 0.78Sim2 1/8 0.93 0.49 0.98 0.79 0.64 0.78Sim2 1/4 0.95 0.62 0.98 0.84 0.75 0.83Sim2 1/2 0.97 0.70 0.98 0.88 0.81 0.87Sim2 1/1 0.95 0.81 0.97 0.91 0.87 0.87metabolic networks has also been integrated into SEED servers. This approach projects reactionsonto the comparatively coarser KEGG metabolic map without further filtering or weighting results,and applies a mixed linear integer optimization for filling reaction gaps [161, 162]. However, thismethod has not yet been applied to metabolic interaction networks in the environment.2.4.1 Pipeline limitationsCompared to current methods that project functional annotations from environmental sequenceinformation onto KEGG pathways or SEED subsystems, MetaPathways enables an alternativealgorithmic approach to metabolic reconstruction using evolutionarily conserved pathway pre-diction based on coverage and biochemical pathway rules. Moreover, the pipeline performstaxonomic binning and functional gene annotation, integrates external resource partitioning oncompute clusters using the Sun Grid engine, and supports useful data transformation and format-ting options. While we have demonstrated pipeline scalability with next generation sequencingdatasets, further improvements to computationally intensive stages including BLAST/LAST-basedannotation and ePGDB construction are needed to keep pace with projected advances in se-quencing throughput. Future pipeline implementations will enable users to harness multi-coredesktop computers to build local grid engines or to externalize BLAST and ePGDB construction47on commercial computing resources such as the Amazon Elastic Compute Cloud (EC2). As analternative to comprehensive all-against-all homology searches, future pipeline implementationswill also incorporate scalable and distributed clustering algorithms enabling functional annotationbased on hierarchical cluster assignments [6, 163, 164].Aside from runtime improvements, additional data transformation and visual analysis modulesexpanding on existing taxonomic binning and marker gene identification components are needed.These include coverage statistics for assembled sequence information, data matrices and interactivevisualizations indicating numerical abundance and taxonomic distribution of enzymatic steps, self-organizing maps and automated methods to append single cell or population genome assembliesto the NCBI hierarchy for more accurate taxonomic binning. Additional reference databasesfor 5S, 7S and 23S RNA genes and updates to the current MetaCyc database that include morebiogeochemically relevant pathways are needed to improve BLAST and cluster-based annotationefforts. Finally, more experience and operational insight is needed in constructing, comparingand interpreting ePGDBs to identify potential sources of error and inform ongoing Pathway Toolsdevelopment efforts.2.5 ConclusionsMetaPathways provides users with a modular annotation and analysis pipeline for predictingmetabolic interaction networks from environmental sequence information. It is extensible togenomic and transcriptomic datasets from multiple sequencing platforms, and generates usefuldata products for microbial community structure and functional analysis including phylogenetictrees, taxonomic bins and tabular annotation files. The pipeline provides local and externalcomputing solutions for implementing BLAST/LAST homology searches, resolves data han-dling issues associated with .gbk and .gff file conversion and NCBI submission, and generatesePGBDs using Pathway Tools for pathway inference and interactive visualization. The Meta-Pathways software, installation instructions, tutorials and example data can be obtained fromhttp://github.com/hallamlab/MetaPathways/.48Chapter 3Metabolic pathways for the wholecommunityA convergence of high-throughput sequencing and computational power is transforming biologyinto information science. Despite these technological advances, converting bits and bytes ofsequence information into meaningful insights remains a challenging enterprise. Biologicalsystems operate on multiple hierarchical levels from genomes to biomes. Holistic understandingof biological systems requires agile software tools that permit comparative analyses across multipleinformation levels (DNA, RNA, protein, and metabolites) to identify emergent properties, diagnosesystem states, or predict responses to environmental change.This chapter adopts the MetaPathways annotation and analysis pipeline and Pathway Tools toconstruct environmental pathway/genome databases (ePGDBs) that describe microbial communitymetabolism using MetaCyc, a highly curated database of metabolic pathways and componentscovering all domains of life. Pathway Tools’ performance is evaluated on three datasets withdifferent complexity and coding potential, including simulated metagenomes, a symbiotic sys-tem, and the Hawaii Ocean time series. Accuracy and sensitivity relationships between readlength, coverage and pathway recovery are highlighted, and the impact of taxonomic pruningon ePGDB construction and interpretation is evaluated. Resulting ePGDBs provide interactivemetabolic maps, predict emergent metabolic pathways associated with biosynthesis and energyproduction, and differentiate between genomic potential and phenotypic expression across definedenvironmental gradients.This multi-tiered analysis provides the user community with specific operating guidelines,performance metrics and prediction hazards for more reliable ePGDB construction and interpreta-49tion. Moreover, it demonstrates the power of Pathway Tools in predicting metabolic interactionsin natural and engineered ecosystems.3.1 BackgroundCommunity interactions between uncultivated microorganisms give rise to dynamic metabolic net-works integral to ecosystem function and global scale biogeochemical cycles [138]. Metagenomicsbridges the ‘cultivation gap’ through plurality or single-cell sequencing by providing direct andquantitative insight into microbial community structure and function [4, 165]. Although, newtechnologies are rapidly expanding our capacity to chart microbial sequence space, persistent com-putational and analytical bottlenecks impede comparative analyses across multiple informationlevels (DNA, RNA, protein and metabolites) [166, 167]. This in turn limits our ability to convertthe genetic potential and phenotypic expression of microbial communities into predictive insightsand technological or therapeutic innovations.Functional genes operate within the structure of metabolic pathways and reactions that definemetabolic networks. Despite this fact, few metagenomic studies use pathway-centric approachesto predict microbial community interaction networks based on known biochemical rules. Recently,algorithms for pathway prediction and metabolic flux have been developed for environmentalsequence information including the Human Microbiome Project Unified Metabolic AnalysisNetwork (HUMAnN) and Predicted Relative Metabolic Turnover (PRMT). HUMAnN uses aninteger optimization algorithm that conservatively computes a parsimonious minimum set ofreactions along KEGG pathways based on pathway presence, absence or completion [93, 94].PRMT infers metabolic flux based on normalized enzyme activity counts mapped to KEGGpathways across multiple metagenomes [99]. Because KEGG pathways are coarse and do notdiscriminate between pathway variants, both modes of analysis have limited metabolic resolution[82]. Moreover, neither HUMAnN nor PRMT provides a coherent structure for exploring andinterpreting predicted KEGG pathways.One alternative to HUMAnN and PRMT is Pathway Tools, a production-quality softwareenvironment supporting metabolic inference and flux balance analysis based on the MetaCyc50database of metabolic pathways and enzymes representing all domains of life [80, 89, 91, 144].Unlike KEGG or SEED subsystems, MetaCyc emphasizes smaller, evolutionarily conserved orco-regulated units of metabolism and contains the largest collection (over 2000) of experimentallyvalidated metabolic pathways. Extensively commented pathway descriptions, literature citations,and enzyme properties combined within a pathway/genome database (PGDB) provide a coherentstructure for exploring and interpreting predicted pathways. Although initially conceived forcellular organisms, recent development of the MetaPathways pipeline extends the PGDB concept toenvironmental sequence information enabling pathway-centric insights into microbial communitystructure and function [149, 168].This chapter provides essential guidelines for generating and interpreting ePGDBs inspired bythe multi-tiered structure of BioCyc [79] (Figure 3.1). Genome and metagenome simulations areused to assess performance on datasets manifesting different read length, coverage and taxonomicdiversity, with a weighted taxonomic distance being used to evaluate concordance betweenpathways predicted using environmental sequence information and reference pathways in theMetaCyc database. Given these performance metrics, Pathway Tools’ power is demonstrated topredict emergent metabolism in simulated metagenomes and a previously characterized symbioticsystem [169]. Finally, generated ePGDBs from coupled metagenomic and metatranscriptomicdatasets from the Hawaii Ocean time series (HOT) are used to compare and contrast geneticpotential and phenotypic expression along defined environmental gradients in the ocean [16, 18,156, 170].51HOTBioCyc PGDBs Tier-1Tier-2Tier-3EngCycHighly CuratedModerately CuratedAutomatically Curateda bceCopy NumberEcoCyc PathwaysMetaSim Metagenomesd SymbiontsEcoCycSim1Sim20.0 0.2 0.4 0.6Figure 3.1: A multi-tiered approach to ePGDB validation. (a) In the absence of highly curated and validatedmulti-omic datasets, ePGDBs were analyzed at different levels of manual curation and coverage, takinginspiration from the curation-tiered structure of available genomes within the BioCyc family. (b/c) Throughin silico simulated sequencing experiments on the E. coli K12 genome and two simulated metagenomes, theperformance of the PathoLogic algorithm was evaluated under changing sequence coverage and taxonomicdistributions. (d) Genomes of Candidatus Moranella endobia and Candidatus Tremblaya princeps, two symbiotictaxa with reduced genomes, were analyzed for a number of shared essential amino acid pathways. (e)Finally, predicted pathways from a previously analyzed paired metagenomic and metatranscriptomicdataset from the Hawaii Ocean time series were used to validate on previously identified pathwaysand metabolic functions. Figure originally published in BMC Genomics under the Creative CommonsAttribution Licence v2.0 [171].3.2 Results and discussionThis section describes pathway prediction performance on simulated, symbiotic, and metagenomicand metatranscriptomic samples from the Hawaii Ocean time series. Because PathoLogic wasrun without taxonomic pruning, enabling it to predict pathways from all three domains oflife, it may therefore predict pathways widely outside the appropriate taxonomic range. Thedevelopment of a weighted taxonomic distance and identification of interpretive hazards (e.g.,52reversible pathways, pathways with multifunctional enzymes, pathway variants, etc.) providea framework for interpreting environmental sequence information within Pathway Tools andMetaCyc. Prediction of nitrogen cycling pathways from the Hawaii Ocean time series is providedas a case study of these hazards.3.2.1 Performance considerationsEnvironmental pathway/genome database (ePGDB) construction commences with the Meta-Pathways annotation pipeline, using environmental sequence information as input (MethodsSection 3.3). Resulting annotations are used by the PathoLogic algorithm, by way of PathwayTools, to predict metabolic pathways based on multiple criteria including proportion of pathwaysfound, pathway specific enzymatic reactions, and purported taxon-specific pathway distributions.PathoLogic is known to perform well when compared to machine learning methods using thegenomes of cellular organisms as input [156]. Previous analysis reported PathoLogic’s perfor-mance on combined and incomplete genomes using two simulated metagenomes (Sim1 and Sim2)derived from 10 BioCyc tier-2 PGDBs manifesting different coverage and taxonomic diversityusing MetaSim [149, 157]. Simulations on increasing proportions of the total component genomelength (Gm) showed that the performance of pathway recovery increased with sequence coverageand sample diversity nearing an asymptote at higher coverage (Figure 2.5). This suggests thatpathway prediction follows a collector’s curve in which common core pathways accumulate in theearly part of the curve followed by less common accessory pathways near the asymptote.To better constrain pathway recovery and performance in relation to ePGDB construction, re-sults of previous MetaSim experiments were expanded to include Esherichia coli K12 substr. MG1655genome (basis of the EcoCyc database), Sim1 and Sim2, and a subsampled 25 m metagenomefrom HOT [170] (Tables B.1, B.2, and Figure B.1). Simulations were performed at progressivelylarger Gm coverage and run through the MetaPathways pipeline under default settings, wherepredicted ORF and CDS annotations closely followed sequencing depth in a stable manner (TablesB.3 and B.4). Consistent with previous observations for Sim1 and Sim2, all experiments showedthat pathway recovery percentage and sensitivity increased with sequence coverage and samplediversity nearing an asymptote at higher coverage (Figure 3.2a). The absolute values of these53GmPathway Recovery Fraction0. Read (~700 bp)0.2 0.4 0.6 0.8 1.0Short Read (~160 bp)0.2 0.4 0.6 0.8 1.0HOT (25m)K12Sim1Sim2SensitivityPrecisionbaLong Read (~700 bp) Short Read (~160 bp) 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0GmPerformance1.0Figure 3.2: Analysis on in silico simulated sequencing experiments across different levels of coverage, sequencinglengths, and taxonomic distributions. (a) Predicted pathway recovery as a fraction of the total pathwayspredicted from the full genomes. (b) Sensitivity (circles) and precision (triangles) of predicted pathways ofthe in silico experiments using the pathways predicted on the full genomes as the gold standard. Figureoriginally published in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].54patterns were sensitive to read length and likely reflected limits imposed by open reading frameprediction and BLAST/LAST-based annotation. In contrast, precision was high (> 85%) regardlessof read length, coverage, or taxonomic diversity (Figure 3.2b). The rate of pathway recoveryincreased proportionally with increasing sample diversity at lower coverage values, as seen in thereduction of pathway recovery percentage between Sim1, Sim2 and E. coli for long read (700 bp)and between HOT, Sim1/2 and E. coli for short read (160 bp) datasets. Summarizing performancestatistics that attempt to balance false positive and and false negative errors, Accuracy, F-measure,and Matthews correlation coefficient, improve with sequence coverage (Figures B.3 and B.4);however, accuracy is likely an over-optimistic measure because of the highly unbalanced numberof positive and negative examples (Tables B.6 and B.7). Because PathoLogic performance improveswith increasing read length, coverage and sample diversity, sequencing platform selection and useof assembled versus unassembled sequence information should be considered when generatingePGDBs.When constructing PGDBs for individual genomes PathoLogic uses a process called taxonomicpruning to constrain pathway predictions within a specified taxonomic lineage by taking advantageof the curated ‘taxonomic range’ associated with a given pathway. For example, if a pathwayis found only in plants, it will be difficult to predict this pathway in the genome of a bacterialisolate when using taxonomic pruning, unless the complete pathway is present [89]. Such aprocess is intended to reduce false positive predictions in individual genomes [172]; However,microbial communities are composed of diverse and largely uncultivated lineages whose combinedmetabolic potential and phenotypic expression must be considered both within and betweenindividuals. Thus the taxonomic origin of environmental sequence information is more difficult toascertain with the same degree of certainty as individual microbial genomes sourced from isolatesor single-cells. Indeed, the true taxonomic range of many pathways remains to be constrainedgiven the limited number of isolate genomes and the proclivity for horizontal gene transfer withinmicrobial communities.In order to evaluate the impact of taxonomic pruning on pathway recovery from environmentalsequence information, ePGDBs enabling or disabling taxon-specific pathway distributions wereconstructed (Table B.8). PathoLogic was run on Sim1/2 and the 25 m HOT dataset with the55‘Unclassified sequences’ pruning threshold and without pruning. With taxonomic pruning enabled,long read and short read Sim1 ePGDBs exhibited a reduction of 56% (206 compared to 604) and61% (194 compared to 499) predicted pathways, respectively. Interestingly, the subsampled 25m HOT dataset exhibited a 28% reduction (425 compared to 593) in pathway recovery with andwithout pruning suggesting that increased sample complexity can partially offset taxon specificsensitivity losses. In all cases, the pathways predicted with taxonomic pruning were a subsetof pathways predicted without taxonomic pruning. Given these observations, strict taxonomicpruning seems inappropriate for ePGDB construction, but also raises potential prediction hazardsassociated with pathways predicted outside of their expected taxonomic range.A weighted taxonomic distance (WTD) algorithm was developed to evaluate concordancebetween pathways predicted using environmental sequence information and reference pathwaysin the MetaCyc database (Section 3.4). The WTD algorithm measures the taxonomic distancebetween predicted coding DNA sequences (CDS), e.g., BLAST hits from the RefSeq database,and expected taxonomic range, curated for each MetaCyc pathway, for each predicted pathwayusing the NCBI Taxonomy Database. The NCBI Taxonomy Database is hierarchically structured,and a path between the lowest common ancestor (LCA) of observed CDS annotations and eachmember of the expected taxonomic range in a pathway can be charted [104], where each pathlength represents some measure of taxonomic distance relative to its lineage (e.g. root, cellularorganism, domain, phylum/division, class, order, family, genus, species). Steps on the path nearthe root of the hierarchy define greater evolutionary distances than those near the tips. Thus theWTD algorithm weights steps on the connecting path by 12dwhere d is the depth position of aparticular taxon in the hierarchy. To distinguish between paths descending from the expectedtaxonomic range and those falling outside the expected taxonomic range, paths descending froman expected taxonomic range have a non-negative distance and paths outside this range have anegative distance. The WTD algorithm gives preference to non-negative distances within expectedtaxonomic range(s), returning the minimum distance if found. Otherwise the maximum negativedistance (i.e., closest to zero) is returned.When the WTD algorithm was applied to HOT datasets, the taxonomic distribution of predictedpathways generally aligned with the expected taxonomic ranges of MetaCyc Pathways (Figure56B.5). Predicted pathways were classified into four categories of taxonomic disagreement basedon their WTD: ‘None’ if the WTD was positive, and ‘Low’, ‘Medium’, and ‘High’ if less thanor equal to zero, based on distance quartiles. A pathway had ‘Low’ taxonomic disagreement ifin the upper two quartiles of negative distances (i.e., those closest to zero), ‘Medium’ if in thesecond quartile, and ‘High’ if in the bottom (i.e., most negative) quartile. Pathways with expectedtaxonomic ranges affiliated with bacteria and archaea dominated the ‘None’, ‘Low’, and ‘Medium’disagreement classes, while pathways with expected taxonomic ranges affiliated with eukaryotesincluding ‘animals’, ‘fungi’, and ‘plants’ comprised the majority of the ‘High’ disagreement class(Figure B.6). While not excluded from downstream analysis, pathways with distances in the ‘High’disagreement class are more likely to represent false positives and should be interpreted withcare.3.2.2 Distributed metabolic pathwaysPublic good dynamics may play an integral role in shaping microbial interactions through dis-tributed networks of metabolite exchange. It has been suggested that such networks may increasefitness and resilience and explain the underlying difficulty in cultivating most environmentalmicroorganisms [173–175]. Because ePGDBs are constructed from environmental sequence in-formation, predicted pathways are represented by multiple donor genotypes providing differentlevels of sequence coverage for each reaction. By comparing pathway recovery for individual ref-erence genomes to pathway recovery for combinations of reference genomes, it becomes formallypossible to use Pathway Tools to identify distributed metabolic pathways that emerge betweenmultiple interacting partners. To test this hypothesis, four tier-2 MetaCyc reference genomeswere selected and ePGDBs constructed for all possible pair-wise genome combinations (Table B.9).Thirty distributed pathways were identified in pair-wise combinations that were not predicted inPGDBs for individual cellular organisms using set-difference analysis (Table B.10). Common andunique reactions associated with distributed pathways could be identified as composite glyphs inthe Pathway Tools genome browser (Figure B.7).A symbiotic system with known nutritional provisioning requirements was selected to providea real world example of distributed metabolic pathway prediction. The reduced genomes of57L-aspartateATPADPL-aspartyl-4-phosphateaspartate kinase2.7.2.4aspartate-semialdehyde dehydrogenase: H2OH+phosphateNADP+dihydrodipicolinatesynthase4.2.1.52L-2,3-dihydrodipicolinateH+NAD(P)HNAD(P)+tetrahydropipicolinatedihydrodipicolinatereductase1.3.1.26tetrahydrodipicolinatesuccinylase: ALysine Biosynthesis Ichorismate L-glutaminepyruvate L-glutamateH+anthranilate5-phospho-a-D-ribose 1-diphosphatediphosphateN(5’ phosphoribosyl) anthranilateputative anthranilatephosphoribosyltransferase4.’deoxyribulose-5’phosphateH+H2OCO2(1S,2R)-1-C-(indol-3yl)glycerol 3-phosphateD-glyceraldehyde-3-phosphateindole L-serineH2OL-tryptophan5. 3-deoxy-D-aramino-heptulosonate-7-phosphate3-dehydroquinate 3-dehydroshikimateshikimateshikimate-3-phosphate5-enopyruvyl-shikimate-3-phosphatechorismate2-dehydro-3-deoxyphosphoheptonate aldolase 3-dehydroquinatesynthaseshikimatekinase3-phosphoshikimate-1-carboxyvinyltransferase:chorismate synthase2.5.1.54 biosynthesis I2-oxoglutarate L-glutamate L-glutamate g-semialdehyde2.6.1.13 L-ornithine L-citrullineL-arginino-succinateL-arginineornithinecarbamoyltransferaseargininosuccinatelyase1.4.1.3 biosynthesis IV & Uridine-5’phosphate biosynthesisargininosuccinatesynthaseMoranella Tremblaya Both NeitherTryptophan biosynthesistryptophane synthasesubunit alphatryptophane synthasesubunit betaanthranilate synthasecompontent Ishikimate 5-dehydrogenase3-dehydroquniate dehydratasetype IIIbicarbonate carbamoyl-phosphatecarbamoyl-phosphatesynthase large/small chain6.3.5.5anthranilatephosphoribosyl-transferaseindole-3-glycerolphosphatesynthaseFigure 3.3: Examples of amino acid metabolism shared between Moranella endobia and Tremblaya princeps. Thisfigure illustrates examples of emergent metabolic pathways predicted between symbiotic prokaryotesCandidatus Moranella endobia and Candidatus Tremblaya princeps. Enzymes found in Moranella (red), Tremblaya(turquoise), or both taxa (purple) are highlighted in the pathway glyph diagrams, showing patterns ofpotentially emergent metabolism. A complete description of all amino acid pathways can be found inFigure B.8. Figure originally published in BMC Genomics under the Creative Commons Attribution Licencev2.0 [171].Candidatus Moranella endobia and Candidatus Tremblaya princeps (GenBank NC-015735 and NC-015736), bacterial endosymbionts of the mealybug Planococcus citri have been previously describedby McCutcheon and colleagues to distribute biosynthetic pathways for essential amino acidsin a process known as ‘inter-pathway complementarity’. Environmental PGDB constructionusing the combined Moranella and Tremblaya genomes recovered 43 out of 44 reactions and all 9distributed amino acid biosynthesis pathways previously reported (Figures 3.3 and B.8). Giventhese results, combinatorial ePGDB construction has enormous potential to predict distributedmetabolic pathways within defined microbial assemblages, e.g., co-cultures or more complexmicrobial communities in natural and engineered ecosystems.3.2.3 Comparative community metabolismMetagenome (DNA) and metatranscriptome (RNA) datasets from 25, 75, 110 m (sunlit or euphotic)and 500 m (dark) ocean depth intervals from HOT were compared to evaluate Pathway Tools’58performance on complex microbial communities at different information levels [170]. A total of1026 unique pathways from approximately 1.2 billion base pairs with 1.2 million CDS annotationsof environmental sequence information were recovered spanning defined environmental gradientsincluding luminosity, salinity, pressure, and oxygen concentration (Table B.11). Of these pathways,840 met minimal quality control (QC) standards (Section 3.3) and were used for subsequentset-difference analysis (Figure 3.4a).More than 600 pathways were shared in common between the sunlit and dark ocean based oncombined DNA and RNA datasets consistent with a conserved metabolic core (Figure 3.4b). A totalof 14 unique pathways were predicted exclusively in sunlit samples with 20 pathways predicted atthe intersection of 25, 75 and 110 m depth intervals (Figure 3.4b). More than 100 unique pathwayswere predicted for the 500 m compliment consistent with increased metabolic potential andniche-specialization with increasing depth (Figure 3.4b). Interestingly, the normalized proportionof genetic potential (DNA) versus expressed metabolic pathways (DNA/RNA) increased linearlybetween 25, 75 and 110 m depth intervals (0.4, 0.7 and 1.2, respectively) before plateauing at 500 m(1.2) (Figure 3.4c). It remains to be determined if this trend reflects an asymptote or an inflectionpoint in pathway expression covarying as a function of metabolic status, environmental conditionsor sample coverage and QC.A total of 30 pathways were identified exclusively in RNA datasets including 11 pathwayvariants (Figure 3.4c and Figure B.9). Expressed cholesterol degradation and tetrahydrobiopterinbiosynthesis I were common to all depth intervals. Unique expressed photorespiration andglycolate degradation III pathways were recovered at 25 and 75 m, while ammonia oxidation III,methane oxidation to methanol II, and arginine biosynthesis III were unique to 500 m (FigureB.9). More than 590 pathways were identified exclusively in DNA datasets, while 495 were sharedin common between DNA and RNA datasets (Figure 3.4d). With respect to functional classes,unique Degradation, Biosynthesis, and Energy-Metabolism pathways increased as a function ofdepth in DNA datasets (Figure B.10a). Within unique degradation classes a progression fromamino acids to aromatic-compounds and secondary metabolites was observed between 25, 75, 110and 500 m depth intervals. A similar progression was observed for a subset of Biosynthetic classesincluding polyamines, lipids, and cofactors and for Energy-Metabolism including C1-compounds59536431723111517 3 77 1292025m 500m75m 110mShared DNA & RNA (495)5 2 6720606121230 12 25 141041225m 500m75m 110mQC Pathways (840)2288449 30211378 3992269 426836325m (685) 75m (691) 110m (670) 500m (797)Sample-wise DNA/RNAbcd 10 24 836331375553101 13 17 201201125m 500m75m 110mDNA Fraction (593)0 200 400 600 800 1000 1200aPredicted PathwaysFailed (193)Passed (840)Figure 3.4: Analysis of predicted pathways from the Hawaii Ocean time series. (a) A total of 1033 uniquepathways were predicted from the HOT samples, however only 840 unique pathways remained after allpathways in each sample with less than 10 ORFs were removed. (b) After normalizing by total predictedORFs, a 4-way set analysis of these quality controlled (QC) pathways shows that the samples share a largecore of common pathways. (c) Separating unique pathways within the DNA (light colors) and RNA (darkcolors) of each sample revealed that very few pathways were unique to the RNA fraction of each sample.(d) Finally, at a set analysis of the unique DNA fraction (light colors), and pathways common to DNA andRNA from each sample (dark colors) found subsets of pathways unique to each depth fraction. Figureoriginally published in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].60and fermentation (Figure B.10b).An evaluation of the 72 most abundant pathways recovered from the combined datasetsindicated that 53 were both present and expressed at 25, 75, 110, and 500 m depth intervals.Moreover, several of the most abundant pathways including ammonium transport, Rubisco shunt,NADH to cytochrome electron transfer, pyruvate fermentation, denitrification, Calvin-Benson-Bassham cycle, cysteine biosynthesis I and arginine biosynthesis III exhibited depth-dependenttrends in gene expression (Figure B.11). A number of abundant pathways common to 25, 75,110, and 500 m depth intervals in the DNA datasets were exclusively expressed in sunlit or darkocean waters (Figure 3.5). In sunlit waters these included photosynthesis light reactions, hydrogenproduction VIII, flavonoid biosynthesis, cofactors including heme, vitamin B-complex (thiamin,adenosylcobalamin), and glutathione for oxidative stress (Figure 3.5). Below the euphotic zone, the500 m depth interval exclusively expressed pathways for ribitol, rhamnose, guanosine nucleotide,2-methylcitrate, and threonine degradation as well as pathways for cofactor biosynthesis includingphosphopantothenate, menaquinol-8 (vitamin K), and coenzyme M and several carbohydrate andamino acid biosynthetic pathways including CMP-N-acetylneuraminate I, ADP-L-glycero-beta-D-manno-heptose and glycine biosynthesis IV (Figure 3.5).Consistent with previous reports, sunlit waters expressed many photosynthesis-related path-ways including aerobic electron transfer, hydrogen production, and cofactors including ubiquinol,heme, vitamin B-complex (nicotinate, thiamine, cobalamin, tetrahydrofolate), chlorophyll a, andretinol biosynthesis [18, 170] (Figures B.12 and B.13). In addition to photosynthesis, 25 m and75 m depth intervals (upper euphotic) sets included pathways associated with degradation ofplant metabolites including phytate, glucuronate, mannitol, chitin, xylose, arabinose, gallate,and quinolate. Other pathways of interest identified in sunlit waters included organophosphate,urea, and aminobutyrate degradation, as well as pathways for conversion of the plant hormoneindole-3 acetic acid and mercury detoxification. Below the euphotic zone, the 500 m depth intervalexpressed unique pathways for intra-aerobic nitrite reduction, dissimilatory nitrate reduction, thereductive monocarboxylic acid cycle, ammonia oxidation, and methane oxidation to methanol I(Figure B.14). Thus, comparative ePGDB analysis using the combined DNA and RNA datasetsdifferentiated between genomic potential and phenotypic expression across defined environmental61glycine biosynthesis IVselenocysteine biosynthesis II (archaea)ADP-L-glycero-beta-D-manno-heptose biosynthesisCMP-N-acetylneuraminate biosynthesis Icoenzyme B/coenzyme M regenerationpyridoxal 5-phosphate biosynthesis IIcoenzyme M biosynthesis Imycothiol biosynthesis5,6-dimethylbenzimidazole biosynthesismenaquinol-8 biosynthesisphosphopantothenate biosynthesis IIIdiploterol and cycloartenol biosynthesissalidroside biosynthesisthreonine degradation III (to methylglyoxal)threonine degradation IImethane oxidation to methanol Ireductive monocarboxylic acid cyclecitrate degradationacetate formation from acetyl-CoA II2-methylcitrate cycle IID-mannose degradationL-rhamnose degradation IIguanosine nucleotides degradation IInitrate reduction IV (dissimilatory)intra-aerobic nitrite reductionammonia oxidation I (aerobic)sorbitol degradation Iribitol degradation(S)-acetoin biosynthesishomocysteine and cysteine interconversionglycogen biosynthesis I (ADP-D-Glucose)UDP-N-acetyl-D-galactosamine biosynthesis IItrans-farnesyl diphosphate biosynthesisbiotin biosynthesis (7-keto-8-aminopelargonate)glutathione biosynthesislipoate biosynthesis and incorporation Ithiamin diphosphate biosynthesis IVadenosylcobalamin biosynthesis Iadenosylcobalamin biosynthesis IIthiamin diphosphate biosynthesis I thiamin diphosphate biosynthesis IIheme biosynthesis I (uroporphyrinogen-III)flavonoid biosynthesismethionine degradation IIhydrogen production VIIIphotosynthesis light reactionsDark (500m) (RNA/DNA)0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6Sunlit (25m, 75m, 110m) (RNA/DNA)DNARNA25m75m110m500mMetaCyc PathwaysEnergy PhotosynthesisHydrogen productionBiosynthesisDegradation Amino acidsSecondary metabolitesCofactorsCarbohydratesAmino acidsRelative CDS Abundance (%) Energy FermentationDegradation Secondary metabolitesNon-carbon nutrientsNucleotidesCarbohydratesCarboxylatesC1 compoundsAmino acidsBiosynthesis Secondary metabolitesCofactorsAmino acidsCarbohydratesFigure 3.5: Comparison of predicted genomic and transcriptomic pathways with unique expression in the ‘sunlit’and ‘dark’ HOT samples. Sunlit metabolism was indicative of photosynthesis and aerobic metabolismincluding photosynthesis light reactions and hydrogen production. Dark metabolism had significantlymore degradation pathways. Figure originally published in BMC Genomics under the Creative CommonsAttribution Licence v2.0 [171].62gradients in the ocean and revealed known and novel patterns of functional specialization withpotential implications for nutrient and energy flow within sunlit and dark ocean waters.3.2.4 Pathway prediction hazardsWhile the construction of ePGDBs promotes pathway-centric analysis of environmental sequenceinformation, prediction hazards need to be considered for optimal interpretive power. Onecommon hazard is the ‘multiple mapping problem’, arising when an enzyme catalyzes conservedor promiscuous reaction steps across multiple pathways or enzyme commission (EC) numbersrepresenting classes with non-specific substrate activity.For example EC represents a non-specific enzyme class for beta-D-glucosides, allowingfor spurious prediction of specific carbohydrate degradation pathways. Moreover, PathoLogichas a preference for EC numbers over product descriptions that can further exacerbate falsediscovery associated with non-specific enzyme classes. Hazards manifesting themselves withinpathway variants sharing a number of common or reversible reaction steps have previously beendescribed by Caspi and colleagues in the context of PGDB construction for cellular organisms [172].For example, the tricarboxylic acid (TCA) cycle has at least eight pathway variants associatedwith different taxonomic groups and several incomplete or reversible forms that share multiplereactions steps.PathoLogic has difficulty differentiating between TCA cycle variants when reversible pathwaycomponents are present even when a diagnostic step such as ATP-citrate lyase for the reductiveTCA cycle is missing from the input data. A similar problem occurs when a regulatory proteinis used to provide evidence that a pathway exists even when catalytic pathway components aremissing from the input. Given that ePGDBs were constructed without taxonomic pruning andthat PathoLogic uses automated annotations from multiple taxonomic groups when predictingpathways from environmental sequence information, taxon specific pathways such as planthormone biosynthesis or innate immunity can be predicted even when organisms known toencode such pathways are absent from the dataset. As described in the performance considerationssection, WTD can be used to discern differences between the predicted and expected taxonomicrange of pathways pointing to potential hazards prior to interpretation. Indeed, the extent to which63these predicted pathways reflect previously unrecognized variants or prediction artifacts remainsto be determined. Moreover, this hazard has the potential to confound distributed metabolicpathway identification when sequence coverage is low or microbial community composition isextremely uneven. Some examples of these hazards from the HOT analysis are provided in TableB.15.The identification of dissimilatory nitrate reduction (denitrification), intra-aerobic nitritereduction and ammonia oxidation pathways in the combined 500 m HOT DNA and RNA datasetsprovides a real world example of hazard navigation. Denitrification is a distributed form of energymetabolism resulting in the production of nitrogen gas in oxygen-deficient waters (<20 µm O2 perkg) [119, 176]. The first step in denitrification is nitrate reduction to nitrite. In the combined HOTDNA and RNA datasets the predicted pathway variant nitrate reduction IV included a subset ofCDS transcripts for ‘nitrate reductase gamma subunit’ (24 in DNA, 79 in RNA) while the predictedpathway variant nitrate reduction I included CDS transcripts for multiple nitrate reductasesubunits (Figure 3.6). While CDS for nitrate reductase subunits originated from a number ofdifferent taxa including Alphaproteobacteria, Gammaproteobacteria, Nitrospira and Planctomycetes,435 out of 523 (83%) of predicted nitrate reductase transcripts originated from Nitrospira andPlanctomycetes consistent with a role in nitrite oxidation [177–180] (Figure 3.6). The second stepin denitrification is nitrite reduction to nitric oxide. Within the DNA dataset both bacterial andarchaeal CDS for nitrite reductase were recovered while transcripts originating from ammoniaoxidizing archaea dominated the RNA dataset (Figure 3.6). Coding sequences/transcripts fordownstream pathway components including nitric oxide reductase and nitrous oxide reductasewere not detected, although CbbQ/NirQ/NorQ family regulators necessary for inorganic carbonfixation in the Calvin-Benson-Bassham cycle, nitrite and nitric oxide reduction were identified inDNA and RNA datasets (Figure 3.6). Given that the mean oxygen concentration at 500 m is 120 µmO2 per kg [16, 18], these results are consistent with active water column nitrite and ammoniaoxidation processes. Recent studies in the Eastern Tropical South Pacific oxygen minimum zone(OMZ) observed changes in the frequency distribution of denitrification genes between free-living(0.2–1.6 m) and particle-associated (>1.6 µm) size fractions, with nitric oxide reductase and nitrousoxide reductase encoding genes enriched on particles [181]. The extent to which denitrification or64anammox processes partition between free-living and particle-associated microoganisms in theHOT water column remains to be determined.65DNA RNANH4+N2O2NO3-NO2-NON2ON28324(523)(79)(87) (87)17 17(18)115(79)(0) (444)24 590000(87) (0)(0)17 00NO3- NO2-NO N2ONO2- NOnitrate reductionnitrite oxidationnitrite reductionregulatory proteins*Function  Taxonomy+A BABCDEC DE01000 100200300400500nitrate reduction I (denitrication)nitrate reduction IV (dissimilatory)intra-aerobic nitrite reductuon      a bCandidatus Nitrospira deuviiNitrolancetus hollandicusplanctomycete KSU-1Archaea, NitrosopumulaceaeArchaea, OtherAlphaproteobacteriaGammaproteobacteriaOther BacteriaProteobacterianitrate & nitrite reducing nitrite oxidizing ammonia oxidizingFigure 3.6: Taxonomic and functional breakdown of nitrogen cycling pathways. (a) Nitrogen cycling pathways and reactions assigned by PathoLogic.Arrow color indicates pathway, nitrate reduction I (denitrification) (brown), nitrate reduction IV (dissimilatory) (yellow), and intra-aerobicnitrite reduction (red). Grey numbers adjacent to arrows indicated number of reads assigned to the reaction in the DNA and RNA (RNA inparentheses). Overlapping circles indicate the distribution of reads across multiple pathways. (b) BLAST-based functional and taxonomicbreakdown of reads assigned to reactions in given pathways as indicated by letters A–E. Function was determined by the top RefSeq BLAST hit,reported by the MetaPathways pipeline, and indicated by reaction arrows, with color corresponding to taxa or taxonomic group with knownactivity: taxa with nitrate and nitrite reducing activity (blue), nitrite oxidizing activity (green), and ammonia oxidizing activity (purple). Greyreactions indicate no reads for enzymatic activity were detected, only regulatory proteins that may be involved in gene expression regulation (*).Figure originally published in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].663.3 MethodsThis section describes some of the details of the analysis, including prediction of MetaCycpathways from multi-omic sequence information using the MetaPathways pipeline, pathwayprediction performance of simulated short-read and long-read metagenomes, taxonomic pruningexperiments, and a summary of the weighted taxonomic distance (WTD) algorithm and distanceto measure taxonomic disagreement in predicted pathways when pruning is disabled. Finally,the set-difference analyses for the discovery of emergent metabolic pathways from both syntheticand symbiotic experiments are highlighted, and the pathway-centric analysis of metagenomic andtranscriptomic samples from the Hawaii Ocean time series is described.3.3.1 Metabolic pathway analysisEnvironmental PGDBs were constructed from public datasets using MetaPathways(http://github.com/hallamlab/MetaPathways/) [149] with default parameter settings: openreading frame (ORF) detection by Prodigal (minimum length 60 amino acids), functional anno-tation by BLAST (e-value 1e-5, blast-score ratio 0.4) against protein databases KEGG [78], COG[148], MetaCyc [80] (version 16.0), and RefSeq [77] (Downloaded August 2012), and pathwayprediction via the PathoLogic algorithm with taxonomic pruning disabled. Predicted pathwaysand associated annotated CDS sequences were extracted from created ePGDBs using the utilityscript extract pathway table from pgdb.pl included with MetaPathways.3.3.2 Pathway prediction on simulated dataSimulated sequencing experiments were performed using MetaSim [157] with the parametersettings: Long read: clone size 36,000 bp, Gaussian error, mean read length 700 bp, standarddeviation 100 bp; Short read: Gaussian error, mean 160 bp, standard deviation 40 bp) against theE. coli K12 MG1655 complete nucleotide genome (GenBank: NC 000913) at a series of fractionallevels (1/32, 1/16, 1/8, 1/4, 1/2, 1/1) of the total combined length of starting component genomes(Gm). Pathways were predicted using the MetaPathways pipeline, as described above, againsteach of the resulting sequence sets (Tables B.3 and B.4). A classification performance analysis67was performed; True positives (TP) were pathways found in both the simulated sample pathways(test set) and the complete gold standard E. coli genome. True negatives (TN) were pathwaysnot predicted in the test set or gold standard. False positives (FP) were pathways found in thetest set but not in the gold standard. Finally, false negatives (FN) were pathways found in thegold standard but not in the test set. Multiple summary statistics for the resulting confusiontables (Sensitivity (Recall), Specificity, Precision, Accuracy, F-measure, and Matthew’s CorrelationCoefficient (MCC)) were calculated. A summary of these performance statistics is provided inSection B.1 ‘Confusion Table Statistics’.3.3.3 Simulated metagenomes: Sim1, Sim2Simulated sequencing experiments of metagenomes Sim1 and Sim2 were generated and analyzedas described above for E. coli. To minimize name-mapping problems, prokaryotic genomes wereselected from the tier-2 BioCyc database collection [156]. The Sim1 metagenome was composedof ten tier-2 BioCyc genomes (Table B.2) in equal copy number, while Sim2 was composed ofthe Caulobacter cresentus NA1000 genome in 20-fold excess relative to other genomes (FigureB.1). A classification performance analysis was performed as described above with the set of 646pathways predicted from the complete tier-2 genomes used to derive Sim1 and Sim2 representingthe gold standard (Figures B.3 and B.4, Tables B.6 and B.7).3.3.4 Simulated metagenomes: HOT (25 m)The 25 m metagenome from the Hawaii Ocean time series was sub-sampled with replacementto different fractional levels (1/20, 1/10, 3/20, 1/5, 2/5, 3/5, 4/5, and 1/1) and pathways werepredicted as described above. Similarly, a classification performance analysis was performed withthe set of 864 pathways predicted from the complete 454 run representing the gold standard(Figures B.4 and B.7).683.3.5 Taxonomic pruning experimentsThe full-Gm simulated sequencing samples for Sim1 and Sim2, both short and long read lengths,and the full-Gm HOT (25 m) sample, had their pathways predicted with the above method, butwith taxonomic pruning enabled using the taxonomic lineage parameter set to ‘Unclassifiedsequences’. The number of predicted pathways were tabulated and compared with the pathwayspreviously predicted with taxonomic pruning disabled. As simple set analysis showed that withina sample the pruned pathways were a strict subset of the ‘no-pruning’ ones, and the reduction inpathways was calculated (Table B.8).3.3.6 Weighted taxonomic distanceFor each predicted pathway in the HOT dataset, a weighted taxonomic distance (WTD) distancewas calculated using the WTD algorithm. First, the lowest common ancestor algorithm (LCA)was applied to a pathway’s RefSeq CDS sequences. The WTD algorithm calculates a weighteddistance D between the observed LCA taxonomy xo and the pathway’s expected taxonomicrange(s) xe ∈ TR(MetaCyc)(p), where TR(MetaCyc)(p) is the set of taxonomic range(s) for a givenpathway p on the NCBI Taxonomy Database hierarchy.This WTD algorithm takes as input p and xo, and calculates a weighted taxonomic distancefor each xe on nodes in the connecting path P(xe, xo), asD(a, b) =∑ea,b∈EP(xe ,xo)12d(a)(3.1)where ea,b is an edge between nodes a and b in the path and d(a) is the depth of node a. Ifxe descends from the expected taxonomic range xo, then the WTD is assigned a positive valuewhile the WTD for paths descending outside this range are assigned a negative value. Aftercalculating the WTDs for all pairs xe, xo, the WTD algorithm first attempts to return the minimumnon-negative distance, i.e., WTD corresponding to the closest xe, where xo is a descendant ofxe, and returns the maximum negative score, i.e., closest to zero if all observed and expectedtaxonomies diverge. For each dataset, predicted pathways were assigned to a Disagreement Classbased on the following criteria: (i) pathways with positive WTD were given the ‘None’ class, (ii)69pathways with distances greater than the median of negative WTDs were given the ‘Low’ class,(iii) pathways within the 2nd quartile were given the ‘Medium’ class, and (iv) pathways in thelower quartile were given the ‘High’ disagreement class (Figure B.5). The expected taxonomicranges of each pathway where then collapsed into the higher taxonomic levels: ‘root’, ‘cellularorganisms’, ‘prokaryotes’, ‘archaea’, ‘bacteria’, ‘eukaryotes’, ‘animals’, ‘fungi’, ‘plants’, and ‘other’,as defined on the NCBI Taxonomy Database hierarchy and pathway frequencies and disagreementclasses were summarized for each sample (Figure B.6). For a detailed account of the motivationand derivation of the WTD and algorithm see Section Distributed metabolic pathway predictionFour genomes of similar size and complexity from the MetaCyc tier-2 classification were combinedin a pairwise manner: Aurantimonas manganoxydans SI85-9A (GenBank: NZ AAPJ00000000.1),Bacillus subtilis subtilis 168 (GenBank: AL009126.3), Caulobacter crescentus NA1000 (GenBank:CP001340.1), and Helicobacter pylori 26695 (GenBank: AE000511.1), abbreviated by the first characterof their proper names, A, B, C, and H, respectively. The six pairwise and four original genomeswere analyzed as described above for E. coli (Table B.9). Pathways predicted in the combinedPGDBs were considered candidates for distributed metabolism if they were absent from PGDBsfor individual genomes (i.e., found in A and B combined, but not in either A or B individually)(Table B.10). Candidate pathways were manually inspected and deemed ‘plausible’ if there wassufficient coverage, i.e., 75% of reactions in a pathway had associated CDS sequences from bothtaxa (Figure B.8). Similarly, the Candidatus Moranella endobia and Candidatus Tremblaya princepsgenomes (GenBank: NC-015735 and NC-015736) were downloaded from NCBI and analyzedas described above for E. coli. Resulting PGDBs for individual and combined genomes weremanually inspected for amino acid biosynthetic pathways described in McCutcheon and Dohlen[169] (Figure B.8).3.3.8 Hawaii ocean time seriesUnassembled metagenomic and transcriptomic pyrosequences from the Hawaii Ocean timeseries (10 m, 75 m, 110 m, and 500 m) were obtained from the NCBI Sequence Read Archive70(SRA Accession: SRX007369–SRX007372, SRX156384, SRX156385, SRX016893, SRX016897) andrun through the MetaPathways pipeline using default settings (Table B.11). To avoid spuriouspredictions, only pathways with more than ten mapped CDS sequences in an individual samplewere used in downstream analysis. The pathways with nine or fewer mapped CDS sequencesrepresent the lower quartile of pathway annotations (Figure 3.4a). Pathway CDS counts for eachsample were normalized to the total number of unannotated ORFs in each dataset. Count datawas then converted to percentages providing relative ORF abundance for each pathway, alongwith their weighted taxonomic distances and sample-wise disagreement classes. Relative CDSabundance of the top-40 pathways from DNA and RNA datasets were compared (Figure B.11). Inaddition, pathways predicted in the DNA and RNA datasets were compared at each depth intervalto provide sample-wise fractions for each depth, e.g., DNA-only, DNA-RNA, and RNA-only(Figure 3.4c). Given the small number of pathways in the RNA-only sets no set-difference analysiswas needed (Figure B.9). The DNA-only sets were declined and tabulated at various levels of theMetaCyc pathway hierarchy (Figure B.10). A final four-way set analysis was performed on theDNA-only and DNA-RNA pathways at each depth (Figure 3.4d). DNA-RNA set-difference subsetswith more than 5 predicted pathways were compared in detail (Figures B.12, B.13, B.14, B.15, B.16,and B.17). All data transformations, set operations, and comparisons were performed in the Rstatistical environment (http://www.r-project.org), and visualized using the ggplot graphicalpackage (http://ggplot2.org) and d3.js JavaScript library (http://d3js.org/).3.4 Motivation and derivation of the weighted taxonomic distanceThe MetaCyc database contains a variety of genes and pathways affiliated with different tax-onomic ranges. Thus pathway prediction using the PathoLogic algorithm typically involvesa taxonomic pruning component based on the curated “expected taxonomic-range” for eachpathway, penalizing the prediction of pathways outside of this specified range. While pruningcan reduce false discovery when conducting metabolic reconstruction on individual genomes,environmental sequence information encompasses diverse donor genomes representing numeroustaxonomic groups spanning multiple domains of life. A more appropriate distance measure71would incorporate the expected taxonomic range(s) provided by MetaCyc and the observed taxaassociated with the RefSeq annotated CDS sequences assigned to each pathway. This would assistwith the interpretation of predicted pathways by providing a contextual measure of taxonomicagreement between observed and expected taxonomic range for each pathway. The section de-scribes the derivation, applicability, and use of a weighted taxonomic distance (WTD) algorithmfor making such a comparison, incorporating the NCBI Taxonomy Database and the LowestCommon Ancestor (LCA) algorithm [104].Organismal taxonomy is a powerful organizing principle based on central tenants of evolution:inheritance, homology by common descent, and the conservation of sequence and structure. TheNCBI Taxonomy provides a curated database consisting of organismal sequences submitted toGenBank. As of April 2014, more than 300,000 taxonomic records were contained within theNCBI Taxonomy Database. Despite this large size, only about 50% of all birds and mammals arerepresented, and many taxonomic groups have no or limited representation. Manually curatedand hierarchically structured, the NCBI Taxonomy Database classification scheme is based on theunison of morphological and molecular taxonomy methods that approximate the evolutionaryrelationships between extant life forms.Due to its hierarchical structure, we can view the database as a tree where each node representsan individual taxon. For convenience we will denote this tree TNCBI and henceforth refer to itas the NCBI Tree. The root node represents the root of the tree, which by definition, has everytaxon node contained within its sub-tree. A path from the root to any taxon is called a lineageconsisting of all taxa along that path. Each node in the database can have synonyms that representthe same taxon, but a numerical identifier called a taxid uniquely represents each node. We willrefer to the depth of a node as the length of its path from the root (i.e., the length of its shortestconnecting path from the node to the root node).For the purpose of providing a measure of divergence between two taxa, it would be convenientto have a distance that can be measured across the NCBI tree and that reflects three expectations.First, in ultrametric trees (where the branches are proportional to time), branches betweennodes tend to decrease in length towards the present, because the increased number of speciesover evolutionary time has decreased the expected waiting time until the next branching event72[182, 183]. Second, comparisons that include basal nodes in the tree (nearer the root) capture moreevolutionary history and so should carry more weight than comparisons among the descendantsof only one branch emanating from this basal node. Third and similarly, comparisons that includebasal nodes in the tree (nearer the root) should carry more weight than shallow nodes (nearerthe present or leaves of the tree); for example, the ancestral node between Bacteria and Archaeashould have more weight than divergences nearer the present, such as between Alphproteobacteriaand Betaproteobacteria, regardless of the length of the lineages that continue after the divergence.Although many phylogenetic distances currently exist, many of these are explicitly formulatedwith respect to evolutionary time (requiring an ultrametric tree) and do not capture the abovenotions that branches and nodes nearer the root likely capture more unique evolutionary historyand so should be weighted more strongly. These notions are particularly important to incorporateinto a distance measure based on taxonomy, which does not specify node ages or branch lengths[184]. For instance, the popular UniFrac distance provides a metric for metagenomes based onsets of edges that overlap [185], but this measure pays no attention to whether these overlappingbranches are near the present (and potentially very short) or in the distant past and likely long; noexisting weightings for UniFrac capture the above observations. Moreover, many implementationsof existing distances exhibit strange behaviour when a taxonomic assignment can be internal toa tree and not just at the tips. Such assignments are common when applying lowest commonancestor approximations to meta-omic annotations (e.g., ambiguity about the placement of anORF as either Alphproteobacteria or Betaproteobacteria may lead to a taxonomic assignment thatis the ancestral node between these two taxa). Thus we are motivated to implement our ownsimple weighted taxonomic distance to assess the differences between taxonomic lineages basedon observed annotations and the curated lineages of predicted pathways they inhabit.Let us now describe our distance with respect to our three expectations of lineage decent. First,denote the NCBI Tree as TNCBI = (V, E) with vertex and edge sets V and E corresponding to thetaxa nodes and their relationships, respectively. Since TNCBI is a tree, it has no cycles, meaningthe shortest path between any two nodes a, b ∈ V is unique. We will also assume that the set ofedges E is undirected. Note that the depth of nodes and edges refers to the number of steps intheir shortest path to the root of the tree, the root itself having having a depth of zero. Below we73 t1t2 t3t4 t5t7 t8t10 t11 t12t61 11/2 1/21/21/4 1/41/8 1/8 1/8Figure 3.7: Illustrative phylogenetic tree for distance expectations. Here nodes represent taxa and edgestheir phylogenetic relationships. Note that depth of a node node is the number of edges along the shortestconnecting path between itself and the root (t1), with the root itself having a depth of zero. For example,the depth of node t4 is two. Similarly, depth of an edge is the number of steps along the shortest connectingpath from its parent node to the root node. For example, the depth of the edge between t4 and t7 is thesame as the parent node t4, two. Numbers beside edges represent an example of edge weights that fit withour three expectations of lineage decent. Figure originally published in BMC Genomics under the CreativeCommons Attribution Licence v2.0 [171].describe our expectations and the behaviour of a hypothetical distance measure Dhyp that respectsthem. We will use an example graph as an aid to intuition (Figure 3.7).Expectation 1 Taxonomic edge weight decreases with the depth in the tree (i.e., number of stepsfrom the root of the tree). Splits near the root of the tree represent major evolutionarydistances e.g., Domains (Bacteria, Archaea, and Eukaryotes), while splits closer to the treetips represent more recently diverged taxa. Intuitively this means that edge weights shoulddecrease with respect to their depth in the tree. For example, in Figure 3.7 the followinginequalities should hold:Dhyp(t4, t7) < Dhyp(t2, t4) < Dhyp(t1, t2) (3.2)74Expectation 2 A split between two taxa in the NCBI tree separates them from a common ancestor.More specifically, if taxon A is not in the lineage of taxon B, the two taxa A and B divergeinto two separate lineages. Here, the divergent distance from taxon A to taxon B should belarger than the distance from taxon A to taxon C, where C is a descendant of A, i.e., A is inthe lineage of C. A distance Dhyp should respect this divergence; for example, the followingshould hold in the Figure 3.7:Dhyp(t4, t10) < Dhyp(t4, t5) (3.3)Expectation 3 Divergence weight decreases with the the number of steps from the root. Splitsnearer the tree tips represent a smaller taxonomic distance than those closer to the root, i.e.,the taxonomic significance of speciation events is inversely proportional to their depth inthe tree, where depth is number of steps in the connecting path to the root. Thus, we expectthe following example inequality to hold in Figure 3.7:Dhyp(t10, t5) < Dhyp(t2, t3) (3.4)3.4.1 WTD formulationWe will now derive a weighted taxonomic distance D between two nodes on the NCBI Tree thatrespects the above three expectations. First, let us define the depth of a node a as the numberof edges in its shortest-path to the root of the tree, and denote this d(a). Note that the depth ofthe root note aroot is zero, d(aroot) = 0. Consider any set of links or edges between nodes k andl, ek,l ∈ E and assume without loss of generality that d(k) < d(l). We will define the lineage ofa taxon a as the set of all taxa along the shortest path from a to aroot inclusive, and denote thisL(a). We will say that a taxon b is in the lineage of a if b ∈ L(a). Note that this has the evolutionaryinterpretation that b is an ancestor of a (or is a itself).Definition 1 (Weighted taxonomic distance). Consider any edge ek,l in the NCBI Tree TNCBI between theadjacent nodes k and l. Without loss of generality, since the edges are undirected, assume that d(k) < d(l),and we define the weight of the edge ek,l as c(ek,l) = 12d(k) . Next, we define the distance between two taxa75a, b ∈ V asD(a, b) ≡∑ex,y∈EP(a,b)c(ex,y) =∑ex,y∈EP(a,b)12d(x), (3.5)where P(a, b) is a subset of edges in E along the shortest path between a and b.We would like to emphasize that the above distance measure conforms to our three expectations.Although the function c(ex,y) could be any monotonically decreasing function with the depth ofthe tree (i.e., number of steps from the root), in defining D(a, b) inversely proportional to 2d(x),the following interesting algebraic property holds:n∑j=i+112j<12i≡∞∑j=i+112j(3.6)for any positive integer n and non-negative integer i. How this property affects distances in a treeis best illustrated by a quick example (Figure 3.8). Observe that distance D(x1, y1) = 14 +14 =12 isa strictly smaller distance than D(x2, y2) = 14 +12 =34 . However, what is more interesting is thatD(x1, z1) is strictly smaller than D(x2, y2) for all possible child nodes of y1, illustrated as z1,D(x1, z1) < D(x2, y2) (3.7)14+(14+18+ . . .)<12+14. (3.8)Now the convenience of our choice of c(ex,y) with respect to the three expectations can be shownmore explicitly:Expectation 1 Consider two pairs of taxa (a1, a2) and (b1, b2), such that: (i) a1 is in the lineageof a2, (ii) b1 is in the lineage of b2, and (iii) the number of edges between a1 and a2 and b1are b2 are equal, say k; then if d(a1) ≤ d(b1) implies D(a1, a2) ≥ D(b1, b2), and similarlyd(a1) > d(b1) implies D(a1, a2) < D(b1, b2). To show this, without loss of generality supposed(a1) ≤ d(b1), thenD(a1, a2) ≡k−1∑i=012d(a1)+i≥k−1∑i=012d(b1)+i≡ D(b1, b2). (3.9)76 15   Observe that the distance ! !!, !! = ! !! + !! = !! is strictly smaller than distance ! !!, !! != ! !! + !! = !! . However, what is more interesting is that ! !!, !!  is strictly smaller than ! !!, !!  for all possible child nodes of !!, illustrated as !! above.  ! !!, !! < ! !!, !!   14+ 14+ 18+⋯ < 12+ 14  Now the convenience of our choice of !(!!,!) with respect to the three observations can be shown more explicitly:  Observation 1. Consider two pairs of taxa !!, !!  and !!, !! , such that: (i) !! is in the lineage of !! and (ii) !! is in the lineage of !!, and (iii) the number of edges between !!and !!, and !! and  !! are equal, say !; then if ! !! ≤ !(!!) implies ! !!, !! ≥! !!, !! , and similarity ! !! > !(!!) implies ! !!, !! < ! !!, !! .  To show this, without loss of generality suppose ! !! ≤ ! !! , then   ! !!, !! ≡ ! 12! !! !! ≥ ! 12! !! !!!!!!!!!!!!!! ≡ !(!!, !!)!  Observation 2. Consider the following three taxa !!,!!!, and !! such that: (i) !! is in the lineage of !! and (ii) !! and !! have diverged, i.e., !!is not in the lineage of !!. Then it follows that ! !!, !! < ! !!, !! .  x1y1x2...1/41/2...1/4y21/41z1Figure 3.8: A phylogenetic example for function choice. Again, nodes represent taxa and edges theirphylogenetic relationships. Notice that edge weights are inversely proportional to 2d where d is the depthof the edge in the tree (i.e., number of edges along the along the shortest connecting path between edge’sparent node and the root). Figure originally published in BMC Genomics under the Creative CommonsAttribution Lice ce v2.0 [171].77Expectation 2 Consider the following three taxa t1, t2, and t3 such that: (i) t1 is in the lineageof t2 and (ii) t1 and t3 have diverged, i.e., t1 is not in the lineage of t3. Then it followsthat D(t1, t2) < D(t1, t3). To confirm this, note that since t1 is in the lineage of t2, thend(t1) ≤ d(t2), but since t1 and t3 have diverged there exists a common parent p where thedivergence occurred, implying that d(p) < d(t1). Thus the following relation holds:D(t1, t3) >12d(p)≥∞∑i=012d(t1)+i≡ D(t1, t2). (3.10)Expectation 3 This can be shown similar to Expectation 1, by considering two pairs of divergenttaxa (a1, a2) and (b1, b2) where the common parents of (a1, a2) and (b1, b2) are p(a1,a2) andp(b1,b2). respectively. If d(p(a1,a2)) < d(p(b1,b2)), then D(a1, a2) ≥ D(b1, b2), and without lossof generality suppose d(p(a1,a2)) < d(p(b1,b2)). Now,D(a1, a2) ≡ D(p(a1,a2), a1) + D(p(a1,a2), a2)≥12d(p(a1,a2))+12d(p(a1,a2))≥∞∑i=012d(pb1,b2 )+i+∞∑i=012d(pb1,b2 )+i≥ D(p(b1,b2), b1) + D(p(b1,b2), b2) ≡ D(b1, b2)(3.11)where the first inequality in the above equation considers only the first edge in the walkfrom p(a1,a2) to a1 and p(a1,a2) to a2.3.4.2 Lowest common ancestor (LCA) algorithmThe LCA algorithm traverses the NCBI Tree to return the lowest common ancestor of a set of taxa(Section 1.1.9 and MEGAN [104]). Two lineages are considered highly evolutionarily diverged iftheir LCA is near the root of the hierarchy (i.e., a high level rank such as phylum), while lineageswhose LCA is near the leaves of the hierarchy (i.e., a low level rank like genus) are less divergent.In the MEGAN software, LCA is used to place annotated sequences on the NCBI Tree by applyingthe LCA Algorithm to the set of taxa found in the RefSeq annotations. The LCA algorithmcalculates LCA using the set of all taxa found in the CDS annotations associated with the enzyme78reactions of a pathway. Let us define LCA in more precise terms. Let T = {t1, t2, . . . , tn} be a setof taxa in TNCBI and let P be the set of taxa, such that ∀p ∈ P, p is a parent or indirect parentof all taxa in T. Note that P 6= ∅ since aroot ∈ P. Now, the lowest common ancestor LCA(T) is thetaxon p ∈ P which has the highest depth in the tree (i.e., furthest from the root).3.4.3 The weighted taxonomic distance algorithmNow that we have described the weighted taxonomic distance D(·, ·) between two nodes on theNCBI Tree, and have describe the LCA Algorithm, we can discuss the specifics of how the WTDalgorithm uses both the distance and LCA to reconcile the two different sources of taxonomicannotations of predicted MetaCyc pathways. Recall that there are two sources of taxonomiesassociated with a pathway: (i) the observed taxonomic annotations of CDS sequences found forpathway reaction enzymes and (ii) the expected taxonomic range(s) from MetaCyc, a set of taxacurated for many MetaCyc pathways. For a given pathways, the set of CDS annotations areinput in the LCA algorithm, returning their observed LCA taxon. Next, the distance between theobserved LCA taxon is compared with each member of the pathway’s expected taxonomic range(s),returning the distance with the smallest magnitude, with a preference for positive distances (seebelow). In this way the WTD algorithm provides a measure of taxonomic disagreement betweenthe observed and expected taxonomic information on a per-pathway basis.Finally, there is a detail to discuss about the symmetric nature of the WTD. Intuitively theWTD distance captures the approximate taxonomic agreement between two positions on theNCBI Taxonomy Database. However, the distance is symmetric with regard to the lineage of theobserved and expected taxa, making it impossible to tell from distance value alone if the expectedtaxon is in the lineage of the observed taxon (i.e., the observed taxon is a descendant of the expectedtaxon) or otherwise (i.e., the two taxa have diverged). This is resolved by calculating the WTDrelative to expected taxon position. Distances calculated on a path where the expected taxon isin the lineage of the observed taxon are given a non-negative value and are negative otherwise.Intuitively, non-negative distances represent the degree of specificity of an observed LCA taxonwithin the descendants a taxonomic range, while large negative distances represent how muchdisagreement exists between the two positions. In the case of multiple taxonomic ranges, we have79a preference for the closest taxonomic range that is still positive. Thus, the WTD algorithm firstattempts to return the minimum non-negative distance, and if no positive distance is found thanthe maximum negative distance (i.e., closest to zero) is returned.3.4.4 CalculationWe will now describe the details of the weighted taxonomic distance algorithm, ComputeWTD(Algorithm 3.1), which can be applied to a set of predicted pathways from an ePGDB. Globally,the algorithm will provide a distance value for each predicted pathway, expressing the taxonomicdisagreement between the observed and expected taxonomic signals. Let us first provide somenotation for pathways, reactions and associated CDS annotations in an ePGDB. For any ePGDB G,we introduce the following notations for convenience. The notation P(G) denotes the list of basepathways in G. Next R(p, G) denotes the set of reactions in base pathway p which appears in G.Finally, CDS(r, G) denotes the list of CDS annotations associated with reaction/enzyme r in G.The observed taxon for each pathway is obtained in the following way. For a given pathway p andePGDB G, the observed taxon p(o) is calculated by the PathwayObservedTaxon subroutine. Thissubroutine collects all taxonomic CDS annotations associated with each reaction r in pathway p,and calculates and returns the lowest common ancestor as the observed taxonomy p(o) via theLCA algorithm.Next, we compare the observed taxon p(o) (i.e., a taxon in the NCBI Tree) against the set ofexpected MetaCyc pathway taxonomic range(s). Note that for a base pathway p, in the contextof the ePGDB under consideration G, we denote its non-empty set of taxa in the MetaCyctaxonomic range by TR(MetaCyc)(p), and use p(o) to denote the observed taxon computed byPathwayObservedTaxon (Algorithm 3.2). Also, recall for a taxon t we denote its lineage L(t). Notethat since the MetaCyc taxonomic range is a curated set of taxonomic information associatedwith a pathway, we would like to adjust the WTD to reflect if the expected taxonomy t is in thelineage of observed taxonomy p(o). Here the sign of the WTD is used to represent the positionof p(o) relative to t: non-negative if taxonomy t is in the lineage of observed taxonomy p(o) andnegative otherwise. The pairwise distance is calculated between each taxon in TR(MetaCyc)(p) andp(o), non-negative distances are added to Cp, and negative distances are added to Cn. The WTD is80computed with a preference for non-negative distances. If Cp is not empty, the minimum of thenon-negative distances Cp is returned. If Cp is empty, the maximum of the negative distances Cnis returned.Algorithm 3.1 ComputeWTD1: ComputeWTD(p, G)2: Cp ← ∅ /* set of non-negative distances, initially empty */3: Cn ← ∅ /* set of negative distances, initially empty */4: p(o) ← PathwayObservedTaxon(p, G) /* observed LCA-based taxon */5: for each t ∈ TR(MetaCyc)(p)6: if t ∈ L(p(o)) /* expected taxon in lineage of observed */7: Cp ← Cp ∪ {D(t, p(o))}8: else9: l ← LCA(t, p(o))10: Cn ← Cn ∪{−(D(l, t) + D(l, p(o)))}11: end for12: if Cp 6= ∅ /* positive distance set not empty */13: output min(Cp)14: else /* use negative distance */15: output max(Cn)Algorithm 3.2 PathwayObservedTaxon1: PathwayObservedTaxon(p, G)2: T ← ∅3: for each r in R(p, G)4: for each c in CDS(r, G)5: /* where tc is the taxa annotation from RefSeq */6: T → T ∪ tc7: end for8: end for9: return p(o) ← LCA(T)3.5 ConclusionsWhile advances in high throughput sequencing technologies are rapidly giving rise to tens ofthousands of environmental datasets, the computational and analytic powers needed to organize,interpret and mobilize these datasets have lagged behind [51, 186]. Conventional BLAST-basedannotation methods combined with gene-centric analyses tend to overlook the network properties81of microbial communities driving ecological and biogeochemical interactions [187]. Here pathway-centric analyses via the MetaPathways pipeline and Pathway Tools provides the scientific usercommunity with an end-to-end solution for comparing ePGDBs constructed from environmentalsequence information revealing known and novel network properties. As with any automatedanalysis, this method is no replacement for manual curation. Indeed, this work highlighted specificinstances where taxonomic range, idiosyncratic annotation, multifunctional enzymes, regulatoryfunctions, and reversible enzymatic forms predicted by Pathway Tools result in interpretivehazards that require expert knowledge to resolve.Continued development efforts are needed to improve on existing features and add newfunctionality to both the MetaPathways pipeline and Pathway Tools. Specifically, improved importfeatures amenable to categorical metadata e.g., taxonomic origin, location, depth, etc., need to beintegrated with Pathway Tools ’groups’, a feature that enables users to integrate external data andgroup pathways and objects within Pathway Tools. The ‘groups’ feature in turn needs to be betterintegrated into the ‘omics’ viewer allowing for improved pathway navigation and page summarieswithin the Pathway Tools browser. Tooltip enhancements that summarize the categorical datamentioned above could further enhance the browsing experience. Current ePGDBs are constructedusing concatenated CDS sequences and improved viewing features are needed that map coverageand noncoding sequence information onto complete contigs. Finally, the PathoLogic algorithmshould be improved to incorporate the described prediction hazards and WTD into its calculations.Specifically, one can imagine tree-based algorithmic improvements to PathoLogic akin to the WTDdescribed here that integrate taxonomic information with enzyme or pathway directionality.Despite current limitations, ePGDBs are an interactive and holistic data structure in whichto investigate distributed metabolism and differentiate between microbial community metabolicpotential and phenotypic expression. Thus, ePGBDs provide a functional blueprint of microbialcommunity metabolism that can be harnessed to engineer microbial consortia with definedemergent properties. These properties can in turn be transferred to industrial strains or modelledusing MetaFlux to improve process performance [144]. Although the set-difference and visualinspection methods used to identify distributed metabolic pathways described here do not scalefor big datasets, future algorithmic improvements will enable comparisons of reference genomes82and metagenomes in large numbers. Indeed, splitting the proverbial ‘reaction arrows’ for eachstep in a given metabolic pathway into taxonomic bins provides a basis for integer optimizationmethods that compute ‘distribution’ scores and a baseline for monitoring changes in the reactionnetwork associated with environmental change or even human health status. Looking forward,an open source collection of ePGDBs, called EngCyc analogous to BioCyc [79], could be queriedand compared online, revealing the network properties of microbial communities in natural andengineered ecosystems on a truly global scale.83Chapter 4MetaPathways v2.0: a master-workermodel for environmentalPathway/Genome Database constructionon grids and cloudsThe development of high-throughput sequencing technologies over the past decade has generateda tidal wave of environmental sequence information from a variety of natural and engineeredecosystems. The resulting flood of information into public databases and archived sequencingprojects has exponentially expanded computational resource requirements rendering most localhomology-based search methods inefficient. MetaPathways v1.0 introduced a modular annotationand analysis pipeline for constructing environmental Pathway/Genome Databases (ePGDBs) fromenvironmental sequence information capable of using the Sun Grid engine for external resourcepartitioning. However, a command-line interface and facile task management presented useractivation barriers and a lack of fault tolerance with respect to computational grid resources.MetaPathways v2.0 incorporates a graphical user interface (GUI) and refined task managementmethods. The MetaPathways GUI provides an intuitive display for set up and process monitoringand supports interactive data visualization and sub-setting via a custom Knowledge Engine datastructure. A master-worker model is adopted for task management allowing users to scavengecomputational results from a number of worker grids in an ad hoc, asynchronous, distributednetwork that dramatically increases fault tolerance. This model facilitates the use of EC2 instancesextending ePGDB construction to the Amazon Elastic Cloud.844.1 IntroductionHigh-throughput sequencing technologies over the past decade has generated a tidal wave ofenvironmental sequence information from a variety of natural and engineered ecosystems, creatingnew computational challenges and opportunities [51, 188]. For example, one of the primary modesof inferring microbial community structure and function from environmental sequence informationinvolves database searches using local alignment algorithms such as BLAST [189]. Unfortunately,BLAST-based queries have become increasingly inefficient on stand-alone machines as referencedatabases and sequence inputs increase in size and complexity. The advent of adaptive seedapproaches such as LAST [76] for homology searches shows promise in overcoming runtimebottlenecks when implemented on grid or cloud computing resources [190]. However, manyacademic researchers simply do not have access to the technical or infrastructure requirementsneeded to achieve these calculations, and must turn to online information processing services.The use of online services, such as Metagenome Rapid Annotation using Subsystem Technology(MG-RAST), increases user access to information storage, gene prediction, and annotation services[133, 142]. Although democratizing infrastructure access, the use of online services can insulateusers from parameter optimization, create formatting and data transfer restrictions, and imposebarriers to downstream analytic or visualization methods. Large grid computing resourcesoffer an alternative by providing access to high-performance compute infrastructure on local orregional scales. Such computing environments are often composed of multiple grids implementingdifferent batch scheduling and queuing systems e.g., Portable Batch System (PBS), TORQUE, SunGrid Engine (SGE), or SLURM. Because most federated systems limit the number of jobs a usercan submit into a batch-processing queue at one time, scheduling irregularities and job failureare not uncommon when attempting to process large datasets across multiple grid environmentsimplementing different batch scheduling and queuing systems.Task management can be improved with algorithms that split and load-balance across multiplegrid environments, monitor job status to resubmit failed jobs or submit new jobs in successionand consolidate batch results upon completion. These improvements can increase fault toleranceand reduce manual administration requirements. The development of such a method requires a85number of considerations: (1) job splitting and merging with appropriate checks on completionand correctness, (2) automated grid installation of search tools such as BLAST/LAST and requireddatabases in a non-redundant manner, (3) bandwidth optimization, (4) minimization of redundantjob processing on more than one grid, (5) tolerance to job failure or intermittent availability ofcomputing resources, (6) load-balance to deal with slow, high-volume or small clusters, and (7) anefficient client tool to integrate processes running on the local user’s machine.Practically speaking, searching a large set of query sequences against reference databasesby splitting them into smaller query files can be approximately viewed as an instance of thewell-studied Do-All problem — p processors must cooperatively perform n tasks in the presence of anAdversary [191]. One of the most popular approaches is the master-worker model [192, 193], wherethe master process sends tasks across the Internet to worker processes, and workers execute andreport back the result. Although the literature is replete with such models [191], multiple gridcomputing imposes new management challenges due to indirect communication issues betweenmaster and worker via head nodes. The head nodes themselves have limited communication andcompute capacity, and are restricted to performing simple calculations and job submissions. Anoptimal client tool should automatically minimize redundant or unproductive communications,e.g., frequently trying to submit jobs to or transfer results from a slow server, and adaptivelyincrease their involvement should their performance improve.Astronomers and mathematicians have encountered similar data processing challenges, andsoftware has been developed for task management across distributed compute networks. For ex-ample, SETI@HOME takes advantage of idle CPU hours from volunteers willing to install a customsoftware to search the cosmos for intelligent life [194], and PrimeNet has a similar implementationfor calculating the largest Mersenne prime numbers (http://www.mersenne.org/primenet/). Bothprojects are tightly controlled to accomplish monolithic tasks under the auspices of major organi-zations, and the software is not readily transferable to academic users interested in alternativeapplications such as sequence homology searches. The recent development of MetaPathways[149], an analytical pipeline for sequence annotation and pathway inference using the PathoLogicalgorithm and Pathway Tools to construct environmental Pathway/Genome Databases (ePGDBs)[80, 89, 92], provides a facile method distributing homology searches through the integration of an86individual grid. However, this software does not take advantage of all available compute resourcesshould more grids become available, nor does it address the previously discussed challengesassociated with ad hoc distributed compute networks.This chapter describes MetaPathways v2.0, representing a series of improvements to the existingpipeline that attempt to address the aforementioned computational and data integration issues.First, automated multiple grid management allows computationally intensive sections of thepipeline to be performed by multiple compute grids simultaneously in an ad hoc distributed system,accommodating dynamic availability, addition, and removal of compute clusters. Because manypotential users do not have dedicated access to compute clusters, this implementation includesa module for use of the Amazon Elastic Compute Cloud (EC2) (http://aws.amazon.com/ec2/)through integration with the StarCluster library (http://star.mit.edu/cluster/). Finally, theusability of MetaPathways is improved through the development of a graphical user interface(GUI) for parameter setup, run monitoring, and result management. Further, the integration andefficient query of results is empowered via a custom Knowledge Engine data structure. Use ofthis structure is integrated into customized data summary tables, visualization modules, and dataexport features for down-stream analysis (e.g., the Metagenome Analyzer (MEGAN) [104] and theR programming environment). The GUI interface is written and designed with the Qt 5.0 libraryin C++ under the LGPL license, and is compatible with Mac OS X and Linux-based operatingsystems.4.2 ImplementationThis section describes newly implemented features to the MetaPathways v2.0 code-base: themaster-worker model for BLAST job distribution, the interactive GUI for run management, and‘Knowledge Engine’ data structure, enabling the interactive comparison of hundreds of samples.4.2.1 Multi-grid brokeringMetaPathways v2.0 coordinates and manages the computation of sequence homology searcheson compute grids implementing the Sun Grid Engine/Open Grid Engine [153] or TORQUE87(http://www.adaptivecomputing.com/) batch-job queuing systems. Expanding from an indi-vidual compute grid to many grids in an ad hoc asynchronous distributed network incurs anumber of additional algorithmic challenges in terms of job coordination, worker monitoring, faulttolerance, and efficient job migration. The previous implementation of MetaPathways controlledan individual grid. In the current version, a variant of the master-worker model (perhaps a‘master-cluster’ model) has been implemented, with the local machine taking on the role of amaster Broker, coordinating a set of worker grids that compute tasks on the Broker’s behalf. Thissetup is analogous to the way operating systems and Internet supercomputers queue requestsand return results.The Broker, operating on the local machine, commissions worker grids to set up as “BlastServices” that compute individual sequence homology search jobs much in the same way as theindividual grid did in MetaPathways v1.0. However, the availability of many worker grids withvarying levels of throughput in an ad hoc distributed network greatly increases the complexity andasynchronicity of job distribution and result harvesting. The Broker not only has to monitor jobs inthe context of worker grids and specific samples, but also has to initiate job migration from overlyslow or ineffective workers. (Figure 4.1). Our model assumes an Adversary can sabotage thedistributed setup in three ways: (1) a grid core can fail or become ineffective, causing a loss of jobscurrently being computed on that core, (2) an entire grid can fail or become ineffective, requiringall jobs to be migrated to other grids, and (3) a sporadic or failed Internet connection increasesthe asynchronicity of the system, affecting the reliable submission and harvesting of results. TheBroker handles these three situations through a combination of job resubmission, job migration toother worker grids, and decreasing job submission and job harvesting from problematic grids.The distribution algorithm is executed by the Broker. First, smaller independent jobs arecreated from a much larger input file. The Broker then submits jobs in a modified round-robinfashion up to the queue limit allowed by different clusters. The round-robin submission is modifiedby a delay counter (one for each cluster), which exponentially increases for each unsuccessfulconnection, submission, or harvesting failure. Once all the jobs are submitted but not all resultsare available, then incomplete jobs are resubmitted to other clusters in counter rank order. Finally,once all the jobs are finished and results are retrieved, the Broker then consolidates results on the88Amazon EC2abcdeFigure 4.1: A master-worker model for sequence homology searches. (a) The master Broker breaks a BLAST jobinto equal sized sub-tasks. (b) The Broker then sets up blast services (with all the required executablesand databases) on each of the available worker grids, ready to handle incoming tasks. Jobs (squares) aresubmitted in a round robin manner to each of the grids. (c) The Broker then intermittently harvests results(circles) from each of the services as they become available, de-multiplexing if there are multiple samplesbeing run. (d) An Adversary can cause nodes and whole grids to fail at random, the Broker handles thisby migrating lost jobs (dashed lines) to alternative grids. (e) An Adversary can also cause an intermittentor failing Internet connection (red line), and the Broker handles this through an exponential back-off,eventually migrating jobs to other worker girds if latency becomes excessive. Figure originally published inIEEE proceedings [168]. Copyright IEEE 2014. Reprinted with permission.client’s machine.Connectivity or response issues are handled by the exponential back-off of job distribution,while throughput issues are handled by the migration of outstanding or pending jobs to more89available workers. On each unsuccessful attempt to submit or harvest results, the Broker waitssome specified period of time before retrying. This time exponentially increases on each successivefailure until an upper maximum is reached (by default set to 60 minutes). Exponential back-offis a well-accepted method in Queueing theory, limiting the number of connections attempted toslower grids. Job migration is handled by a heuristic that migrates a job when its pending time ata particular grid is six standard deviations greater than the average job completion time over aparameterized time-window (by default set to 60 minutes). Such tasks are then migrated to a gridwith the lowest expected completion time, proportional to the number of outstanding jobs. Thisensures that jobs are not readily passed to grids with a higher load.4.2.2 Amazon elastic cloud integrationMetaPathways v2.0 enables researchers to take advantage of existing compute grids using theTORQUE or SunGrid queue submission engine, requiring only login credentials and basic infor-mation about the queue submission system. However, there are many instances where convenientaccess to a compatible compute cluster is simply unavailable or difficult. Here, StarCluster(http://star.mit.edu/cluster/), an open source cluster-computing toolkit for Amazon’s ElasticCompute Cloud (EC2), was adopted in order to automate and simplify the setup of a compatiblecompute grid. EC2 grids are setup with a custom MetaPathways Amazon Machine Image (AMI)containing MetaPathways code, system appropriate executables, and compiled databases, reduc-ing setup bandwidth and latency. EC2 grids are specified similar to other compute grids; however,the user must provide Amazon Web Services (AWS) credentials: an access key, a “secret” accesskey, and an user ID, obtained by registering for an Amazon EC2 account.4.2.3 Graphical user interface & data integrationA MetaPathways run is set up via a configuration window that allows specification of input andoutput files, quality control parameters, target databases, executable stages, and grid computation(Figure 4.2). Available grid systems and credentials are added and stored in an ‘Available Grids”window, allowing the user to add additional grids when credentials become available. Currently,grid credentials using the Sun Grid or TORQUE job distribution system are supported, though90Figure 4.2: Configuring a MetaPathways run. Specific pipeline execution stages can run, skip, or redo bysetting the appropriate radio button. A check box at the bottom indicates if the currently configured set ofexternal worker grids should be used for this run. Additional grids can be configured with their credentialsby clicking the “Setup Grids” button. Figure originally published in IEEE proceedings [168]. CopyrightIEEE 2014. Reprinted with permission.expansion to include other systems including SLURM and PBS is anticipated. Once a runhas started, an execution summary is displayed showing run progress through completed andprojected pending stages along with a number of logs showing the exact commands associatedwith each processing step (Figure 4.3).Environmental datasets generated on next-generation sequencing platforms require the anno-tation of millions of open reading frames (ORFs). Not only is this annotation computationallyintensive, the interpretation, integration, and analysis of gigabytes of output creates analyticchallenges as well. Here, MetaPathways v2.0 has implemented a custom Knowledge Engine91Figure 4.3: Monitoring a MetaPathways run. Once a MetaPathways run is configured and started, progressis monitored using two coordinated windows. Execution progress and processed results for a specificsample are displayed as a series of tabs containing tables, graphs or other visualizations, and can easily beexpanded should additional pipeline stages or modifications occur. Different environmental datasets canbe selected using the drop-down combo-box above. Figure originally published in IEEE proceedings [168].Copyright IEEE 2014. Reprinted with permission.data structure and file indexing scheme that connects data primitives, such as reads, ORFs, andmetadata, projecting them onto a specified classification or hierarchy (e.g., KEGG [78], COG [148],MetaCyc [80], and the NCBI Taxonomy database [195]) (Figure 4.4). The benefit of this customabstraction is flexibility and performance; data primitives are extensible across multiple levelsof biological information (DNA, RNA and protein), and related knowledge transformations likecoverage statistics or numerical normalizations can be customized for them if necessary. Such acustom data structure enables rapid queries for specific results faster than typical database setupslike MySQL or Oracle. Through integration with the Knowledge Engine data structure, data92P1P2P3Pn... ...abKEGGCOGMetaCycNCBITaxonomyPrimitives&ODVVLÀFDWLRQ"Knowledge"Figure 4.4: The MetaPathways ‘Knowledge Engine’ data structure. One of the recent developments to enable theefficient exploration of large environmental sequence datasets is the Knowledge Engine data structure. Thisdata structure connects data primitives,such as reads, ORFs, and projects them onto a specified classificationor hierarchy. Pointer following is a computationally efficient operation so the identification and enumerationof data primitives is robust to large samples consisting of millions of annotations. The connection betweenprimitives and classification schemes are known as ‘Knowledge’ objects (a). Through the explorationand selection of data projected on classification schemes new ‘Knowledge’ objects can be created (dashedlines) (b). Once defined, Knowledge objects can be projected onto custom look-up tables and visualizationmodules, and these, in turn, can be used to create new Knowledge objects, enabling efficient and interactivequerying of millions of annotations. Figure originally published in IEEE proceedings [168]. Copyright IEEE2014. Reprinted with permission.subsets can be easily selected to drive inquiry through custom look-up tables and visualizationmodules, or exported in a variety of formats for downstream analysis (e.g., custom .csv, .tsv tables,nucleotide and protein fastas, MEGAN compatible csv, etc.).Here MetaPathways v2.0 can be used to build user-based tasks of interest. Here, the imple-mented three particular modules highlights this utility. The Large Table implementation exploitsan order statistic algorithm to allow the interactive browsing and search of annotations tables withmillion of rows (Figure 4.5a). Next, a Tree-taxonomy Browser for the display of taxonomic annota-tion allows the query and projections of ORFs in the context of the NCBI Taxonomy Hierarchy(Figure 4.5b). Finally a Contig Viewer for the inquiry of annotated ORFs on individual reads or as-sembled contigs allows for the interactive visualization and query of annotations in the context of93their operons (Figure 4.5c). More details on specific module features, use, and implementation canbe found on the MetaPathways2 wiki (http://www.github.com/hallamlab/metapathway2/). Eachmodule enables the selection of subsets and projections of reads, ORFs, or Contigs onto the aboveclassification schemes for intuitive data exploration and analysis. Additionally, context-sensitiveexport functions for sequences, reads, or ORF annotations facilitate and connect explorative datato downstream analyses like tree-building, R, or MEGAN [104].4.3 ResultsThe section discusses performance experiments performed on Canada’s High-performance com-puting consortia.4.3.1 Heterogeneous grid migrationTo illustrate the ability of the task distribution algorithm to migrate tasks between a numberof different homogeneous compute clusters, we ran a demonstration run on three differentcompute grids simultaneously (Table 4.1). Two systems (Bugaboo and Jasper) from WestGrid, oneCompute/Calcul Canada’s High-performance computing consortia, and HallamGrid, an internalgrid of 10 heterogeneous desktop computers. Connection to WestGrid systems was through theInternet, while HallamGrid was connected through a local area network. The Bugaboo and Jasperclusters have 4,584 and 4,160 cores, respectively, running the Torque or SunGrid Engine queuingsystems with at least 2GB of RAM available per core. HallamGrid ran the SunGrid Engine andconsisted of 10 nodes with 4-8 cores each with a variety of processor types. Locally, MetaPathwaysv2.0 was running on a Mac Pro desktop computer with Mac OSX 10.6.8 with 2 2.4 Ghz Quad-CoreIntel Xeon processors and 16 GB of 1,066 Mhz DDR3 RAM.We ran MetaPathways v2.0, configured to the above grids, on an Illumina-sequenced metagenome,predicting 106,500 ORFs from 127,821 assembled contigs using standard MetaPathways param-eters [149]. These ORFs were annotated against the COG database using the BLAST algorithm.When split into batches of 500 sequences each, this resulted in 213 sequence homology jobs.MetaPathways distributed these jobs to the network, completing the task in 0.5 real-world hours.94Table 4.1: Overview of grid hardware specifications.System Cores Nodes Memory ConnectivityBugaboo 4584 446 16-24GB/Node InternetJasper 4160 400 16-24GB/Node InternetHallamGrid 80 10 32-64GB/Node LANTable 4.2: Matrix of completed jobs (diagonal values) and job transfers (off-diagonal).Jasper Bugaboo HallamGridJasper 49 0 0Bugaboo 2 10 8HallamGrid 0 0 164A transfer matrix describes the job distribution behavior of the Broker (Table 4.2); diagonal valuesrepresent jobs completed at a particular grid, while off-diagonal values represent the transfer of ajob from one grid to another by the Broker. In this run the Bugaboo grid was less performant thanthe others, but the Broker successfully managed the situation by distributing stale tasks to Jasperand HallamGrid, roughly in proportion to their performance. Despite having significantly fewercores than the other girds, the locally-connected and dedicated HallamGrid completed more tasksthan the Bugaboo or Jasper, indicating that hardware specifications aside, grid performance issubstantially affected by the behavior of the administrative and queuing policies of the computegirds, and that there are large performance advantages to having a dedicated server available,i.e., HallamGrid or a private EC2 instance. In addition, job transfer matrices like Table 4.2 couldbe used as an online tool to assess grid behavior; a high number of off-diagonal values wouldindicate a high number of transfers and therefore low performance.4.4 DiscussionMetaPathways enables streamlined functional, taxonomic, and pathway-centric analysis of envi-ronmental sequence information when compared to existing methods based on KEGG pathways95and SEED subsystems. This version of the pipeline focused on improving scalability throughthe use of a robust master-worker model to control compute grid submission. Our solution hasthe potential to allow collaborating research labs to take advantage of under-used idle cycles ontheir heterogeneous in-house networks, and to facilitate the use of external grids when additionaldedicated CPU cycles are necessary. By adopting the StarCluster software, we enable the set upand use of Amazon EC2 instances for those who do not have convenient access to large federatedcompute grids. Further, we improved the usability and monitoring capacity of MetaPathwaysthrough the integration of a GUI to monitor pipeline progress, while also providing an efficientinteractive framework to bring users closer to their data. The custom Knowledge Engine datastructure drives query and sub-setting of the processed data, enabling efficient analysis of samplescontaining millions of annotations. We can think of no other stand-alone software tool that willallow for comparable analysis on such large datasets.We have demonstrated that the Broker’s distribution algorithm has good empirical performanceon a small number of grids. However, the current implementation uses a very heuristic Queueingtheory model, where the time-window to calculate an average completion time and standarddeviation is tuneable-parameter for each grid. The interactions between the client and clusters withvarying loads, user behaviors, and administrative activities could be viewed as a non-cooperativegame where the various players are competing for resources. From a Game theory perspectivethere could be interesting equilibrium conditions that result within this dynamic network, theBroker by transferring jobs could be a strategy that maximizes its expected utility. Not only wouldthese results be theoretically interesting, they could have critical implications to the performanceof the algorithm when large grid networks are being managed.Aside from run-time improvements, additional data transformation and visual analysis mod-ules, modules are needed that incorporate coverage statistics indicating numerical abundance andtaxonomic distribution of enzymatic steps. Additionally, the application of self-organizing mapsor other machine learning methods are needed to more accurately place single-cell or populationgenome assemblies onto the Tree of Life. Further, as transcriptomic and proteomic datasets arebecoming increasingly available, they represent new data primitives that need to be extendedinto the Knowledge Engine data structure. More generally, reference databases for 5S, 7S and 23S96RNA genes and updates to the current MetaCyc database that include more biogeochemicallyrelevant pathways are needed to improve BLAST and cluster-based annotation efforts. Finally,more operational insight is needed to identity hazards in pathway prediction and to improveePGDB integration within the Pathway Tools browser.4.5 ConclusionsMetaPathways v2.0 provides users with improved performance and usability in high-performancecomputing through a master-worker grid submission algorithm. The addition of a GUI signif-icantly lowers user activation boundaries, and provides better control over pipeline operation.MetaPathways v2.0 is extensible to the ever increasing data volumes produced on next generationsequencing platforms, and generates useful data products for microbial community structure andfunctional analysis, including phylogenetic trees, taxonomic bins and tabular annotation files. TheMetaPathways v2.0 software, installation instructions, tutorials and example data can be obtainedfrom http://www.github.com/hallamlab/MetaPathways2.97Figure 4.5: MetaPathways v2.0 has integrated its Knowledge Engine data structure into three data summaryand visualization modules. (a) A Large Table module that allows for the efficient query, look-up, andsub-setting of reads, ORFs, statistics, annotations and their hierarchical classification (e.g., KEGG, COG,MetaCyc, SEED, NCBI Taxonomy), (b) A Tree-taxonomy Browser of ORF annotations on the NCBITaxonomy classification. (c) A Contig Viewer enabling the browsing of contigs or long reads for ORFpositions, allowing functional and taxonomic annotations in tool-tips. Figure originally published in IEEEproceedings [168]. Copyright IEEE 2014. Reprinted with permission.98Chapter 5LCA*: an entropy-based measure fortaxonomic assignment of contigsA perennial problem in the analysis of environmental sequence information is the taxonomicclassification of unknown reads or assembled sequences, e.g., contigs or scaffolds to discretetaxonomic bins. Although assembly of such samples has its difficulties, once contigs are foundit is often important to classify them to a taxonomy based on protein annotations or ‘genomicsignature’, taxonomically-relevant patterns embedded in the sequence of nucleotides. This chapterdevelops the LCA* statistic and algorithm to predict its most likely taxonomic origin on the NCBITaxonomic Database hierarchy. Inspired by information and voting theory, contig annotationsbecome votes in an election for a candidate taxonomy, and a sufficiently strong majority isconstructed minimizing the change to the entropy of the observed taxonomic distribution, therebyminimizing changes to its statistical properties. Further, results from the order-statistic literatureare used to formulate a likelihood-ratio hypothesis test and p-value for testing the supremacyof the final predicted taxonomy. In simulated and actual contigs, we empirically demonstratethat voting-based methods, Majority vote and LCA*, are consistently more accurate than LCA2,ascribing contig taxonomy to the Lowest Common Ancestor of its taxonomic annotations, andthat LCA* estimates strike a balance between specificity and confidence to provide an estimateappropriate to the data depth. LCA* and its statistical tests have been implemented as a stand-alone Python library and have been integrated into the latest release of the MetaPathwayspipeline; both of which are available on GitHub with installation instructions and use cases(http://www.github.com/hallamlab/LCAStar/).995.1 IntroductionThe rise of next-generation sequencing technologies has enabled an explosion of environmentalsequencing initiatives across a plethora of clinically and ecologically relevant genomic samples.Several multi-omic analytical software pipelines, like MetaPathways [149], MG-RAST [133], andHUMAnN [94], are beginning to address the data deluge in both environmental and clinicalenvironments, enabling researchers to learn much about functional and taxonomic structure frombillions of multi-omic sequences [171]. Despite this progress, many current methods lack statisticalframeworks and hypotheses tests, which impedes generalizing observations to their underlyingpopulations, and has left investigators limited theoretical guidance when applying different toolsand methods [31, 196].One particular area of interest in multi-omic datasets is taxonomic binning, that is, theattribution to a particular taxon of a given contig sampled from the environment. Binning methodsgenerally fall into three categories: amplification-based marker gene sequencing, alignment-dependent, and alignment-independent methods. In amplification-based methods, sequencesfrom targeted amplified sequences are searched against collections of taxonomically-related markergenes. These can include 16S rRNA [197], Clusters of Orthologous Genes (COGs) [103], and morerecently the EggNOG [122]. In alignment-dependent methods, short-read samples are aligned toexisting protein reference sequences or pre-computed models. The majority of these methods workby aligning reads by seed-and-extend homology search (e.g., BLAST [75], LAST [76], RapSearch2[198]) or Hidden Markov Models (HMMs) that correspond to known taxonomic groups [199].For example, the Lowest Common Ancestor (LCA) algorithm, popularized by the MEtaGenomicANalyzer software (MEGAN) [104], places high-scoring alignments for a given read on the NCBITaxonomy Database hierarchy and returns the LCA of all alignments as the predicted taxonomy.Alignment-independent methods rely purely on statistical patterns of sequence composition toseparate sequences along taxonomic lines, a hallmark being patterns within the frequencies ofk-nucleotides or sequences of nucleotides [200]. Here methods based on dimensionality-reductionapproaches such as Principle Component Analysis (PCA) [201, 202], Kohonen Self-OrganizingMaps [203, 204], or other statistical methods such as Interpolated Markov Models [205], Linear100Discriminant Analysis (LCA) [206], or Support Vector Machines (SVM) [21], attempt to capture theorganism-specific “genome signature” of sequences in a lower-dimensional subspace in order togroup sequences into clusters of taxonomically-similar sequences [207]. Though both alignment-dependent and alignment-independent approaches can be used for taxonomic identification, thelatter more readily lends itself to sequence quality control tasks, but this is by no means a hardseparation, and in fact hybrid methods exist that take advantage of both sources of information[205, 208].Due to the popularity of the MEGAN software, the LCA method is routinely applied to ORFannotations with a correction based on homology search quality statistics. While extending LCAto predict a representative taxonomy on contigs seems straight-forward, it is unclear how totemper these predictions as each ORF on a contig is often summarized to have an individualtaxonomic annotation by some statistic (e.g., LCA, best-BLAST, etc.). Consider the perspectiveof electing a representative taxonomy for a contig, where each ORF annotation now ‘votes’ fortheir particular taxonomic annotation. In this election many ORF annotations have a differingtaxonomic opinions in both depth and breadth, casting their vote for the original taxonomy in avariety of ways. Two popular voting theory results provide theoretical justification to choosing amajority as the correct response. Condorcet’s Jury Theorem considers an election of two options,one correct and one incorrect, and voters each independently choose one of these two options withthe assumption that they choose the correct response with probability p > 12 [209]. The observedmajority converges in probability to the correct decision as the election size grows to infinity.Alternatively, Feige and colleagues studied the depth of noisy decision trees where each query ata node produces the correct answer with some probability p > 12 [210]. They derived tight boundson the number of queries required to compute threshold and parity functions, and analyze a noisycomparison model with tight bounds on comparison, sorting, selection, and merging. However,applying these voting theory methods to taxonomic count data is complicated by taxonomicdependence within the observed taxonomic annotations. For instance, many annotations belong tovarying levels of a common phylogenetic lineage. Additionally, the approximate nature of popularhomology search algorithms and variability with annotation databases increases the uncertaintyof our observations, making correct placement on the phylogenetic hierarchy a challenge. Further,101sparse observations related taxa often spreads observations too thin, undermining the confidencein results from a simple majority method.This chapter introduces LCA*, an entropy-based statistic and algorithm for declaring a suf-ficiently supported majority on the NCBI Taxonomic Database Hierarchy [195]. The statisticoffers a principled method of electing a majority taxon by applying results from informationand voting theory to contig annotations, obtaining an acceptable majority while minimizingchanges to the underlying taxonomic distribution. Moreover, order-restricted statistical resultscan be used to provide supremacy tests for an elected taxonomy as an alternative to traditionalχ-squared uniformity tests. Both simulated and actual contigs demonstrate that voting-basedmethods of simple majority and LCA* provide closer estimate than using a contig-level LCA, aprocedure referred to as LCA2, and that LCA* strikes a balance between specificity and statisticalconfidence with respect to simple majority predictions. Finally, LCA* is efficiently implementedas a stand-alone Python package, and has also been included as an analytical step of an updatedMetaPathways pipeline.5.2 MotivationLet us motivate LCA* through an illustrative example where taxonomic annotations are quitevariable and dispersed (Figure 5.1). The simple majority method, choosing the taxonomy withthe most annotations may be intuitive, but the majority 3 out of 11 taxa, is not very convincing.Alternatively, to combat this dispersion it might be a good idea to elect the LCA as the majority,but this very conservative estimate may not be very analytically useful. When ORFs are annotatedusing the LCA method in the MEGAN software, the LCA estimate is made less extreme bydiscarding B/LAST hits that do meet certain quality thresholds on the assumption that hits totaxa further away from the target taxonomy will be less accurate. However, in the case of contigs,this tempering by discarding annotations is not an option, because annotations have already beensummarized.Rather than temper a conservative estimate, LCA* takes the reverse approach by expanding thespecific simple majority estimate up the phylogenetic tree, progressively collapsing annotations102until a satisfying majority, an α-majority, is obtained. Here relevant voting theory results likeCondorcet’s Jury Theorem and the work of Feige on noisy decision trees are leveraged to justifyusing the majority proportion α > 12 [209, 210]. However, there remains an issue of how tocollapse the tree automatically, as the resulting majority could significantly biased if the tree werecollapsed arbitrarily. Here, an information-theoretic interpretation of entropy is used to motivatean algorithm of collapsing annotations in a principled way.Since the application information entropy is relatively rare in the area in of microbial ecology, abrief introduction will help in the understanding and motivation of the LCA*. Entropy (Shannon’sEntropy) is the fundamental unit of information where it is defined as the average amount ofinformation needed to specify the state of a random variable. However, it can be intuitivelydescribed as a measure of uncertainty. The uniform random variable, a situation where alloutcomes are equally likely has the highest possible uncertainty, and therefore has the highestentropy, while a more spiked random variable has less uncertainly, and therefore less entropy.Entropy has been central to the development of coding theory (e.g., file compression), cryptography,and signal processing. Moreover, its utility is not entirely foreign to ecology, where entropy isdirectly applied as a measure of taxonomic diversity [111]. Moreover, differences in entropy (alsoknown as Relative Entropy or the Kullback-Leibler divergence) are used to measure the divergencebetween two probability distributions [211]. For example, in the Machine Learning literature,Random Forest classifiers use entropy as a measure of separation quality, and it is used to identifythe most discriminating separations in hierarchy of classifier models [212]. The LCA* algorithm,uses the concept of entropy to measure the amount of change that collapsing annotations up thetree to a particular node will cause, which is used to find an α-majority while minimizing thechange in entropy of the taxonomic distribution as much as possible.5.3 LCA*: derivation and algorithmLet us begin with a high-level overview of the derivation of LCA* and the algorithm ComputeLCA*before going into the details. In order to reason clearly about contigs, ORF annotations, andtaxonomy within the NCBI Taxonomy Database hierarchy, we must first construct a mathematical1033 211 2 11MajorityLCA*LCA2Figure 5.1: Illustrative example of taxonomic assignment methods: LCA*, Majority, and LCA2. Node numbersindicate the number of annotations associated with each taxonomic position of the tree, and the double-circled node is the actual originating taxonomy. As is the case in many multi-omic samples, annotationscan be are quite variable, spanning a number of different lineages of the tree. LCA2 provides a conservativeestimate, while the simple majority method provides a specific taxon without very much support fromthe data. The LCA* tempers the Majority estimate by collapsing annotations up the tree in a principledway until a sufficient α-majority is reached (α > 0.5), distributing the entropy of the underlying taxonomicdistribution as little as possible.framework describing them, their relationships, and any additional notation that we need toperform our entropy calculations and algorithm. First, we will describe notation for the inherenttree-structure of the NCBI Taxonomy Database Hierarchy, and add some specific notations for childnodes and child sub-trees that will be useful when calculating the entropy of annotations withinthe tree. Next, we will describe a contig, its ORFs, and annotations, and define what it meansfor a particular set of annotations to have a sufficient α-majority. To facilitate the collapsing ofannotations up their phylogenetic lineages, we will define an annotation as having a phylogeneticlineage in terms of partially-ordered sets, which will allow us to define phylogenetically-validtransformations among observed annotations. In particular, we will define consistent reductionsto be a special kind of transformation for collapsing all child annotations within a sub-tree up toits root node.Having devised mechanics for moving annotations around the tree, we will then define theentropy of the tree in terms of its annotations collapsed at a particular node. From here wewill make a key observation that the entropy of annotations collapsed at a given node can be104decomposed to the sum of itself and its children. Using this new decomposition to formulatethe difference in entropy between two annotations, we observe that minimizing the differenceis equivalent to minimizing the entropy of the node we choose to move to, an observation thatwill be extremely useful in formulating an efficient algorithm. Reasoning that we can calculatethe change in entropy of annotations collapsed at every node, there must be some node withannotations that has both a valid α-majority and a minimal entropy change compared to all othernodes on the tree. This node is our target LCA*. Finally, we formulate an algorithm to calculateLCA*. We first describe a brute-force method of finding the valid node, and then observe that anode-colouring scheme restricting calculations to observed annotations significantly reduces itscomputational complexity.5.3.1 DerivationLet the NCBI Taxonomy Database be a tree TNCBI , where the nodes x represent taxa and edges rep-resent phylogenetic relationships. Let X denote the set of all nodes in TNCBI , X = {x1, x2, . . . , xM},where M is the total number of taxa in TNCBI (including taxa at internal nodes). Next, let Txdenote a sub-tree within TNCBI rooted at node x. Let the set of nodes in sub-tree Tx be denotedXx, allowing a complete recursive notation for all trees and sub-trees of TNCBI (Figure 5.2). As aspecial case we will denote the root node of TNCBI as x∗, and it follows that X ≡ Xx∗ .It will be convenient to discuss the children of a given node x ∈ X, so let the set of immediatechildren of node x be Yx = {y1, . . . , ys}, where s is the number of immediate child nodes ofx. Further, the set of immediate children of x have respective sub-trees Yx = {Ty1 , Ty2 , . . . , Tys},where each child sub-tree has the set of nodes Xy1 , Xy2 , . . . , Xys . A node x is a leaf node if it has noimmediate children, i.e., Yx = ∅ and Yx = ∅, otherwise x is a non-leaf node.Next, let us describe assembled contigs, open reading frames (ORFs), and their taxonomicannotations. Let R be the set of ORFs in the metagenome, and every ORF in R is by way of someannotation associated with a taxonomic node x ∈ X on TNCBI . Let ORFs that came through theannotation procedure without a known taxonomy be set to ‘root’ at node x∗. In other words,every predicted ORF in a contig has a corresponding taxonomic annotation, which is set to ‘root’if it did not find an acceptable hit in the annotation database. Suppose contig C has the set of105...x*TNCBITxy1 y2 ysxFigure 5.2: The NCBI taxonomy tree structure used in our derivation. Nodes represent taxons and a line betweentwo nodes shows taxonomic relationships. Tx denotes the sub-tree of TNCBI rooted at x and y1, y2, . . . , ysare the immediate children of x.N ORFs OC = {o1, . . . , oN} with a corresponding n-tuple of annotations AC = (a1, . . . , aN) whereannotation ai corresponds to ORF oi ∈ OC for i = 1, . . . , N. We use the notation A˜ to denote theset of annotations in the n-tuple A, with the set of annotations from contig C being detonated asA˜C.We need to determine or elect a taxon from annotations A˜ to label contig C. One straight-forward method might be to use a simple majority vote procedure on the taxa of contig C, AC.However, there may not be a simple majority among taxa AC, or even a majority with a minimumproportion of the votes α, where α > 12 , a so-called α-majority.Definition 2 (α-majority). Given an n-tuple of annotations A, for any α > 12 we say that A satisfies anα-majority if there is a taxon a ∈ A that constitutes at least α-fraction of the elements. Conversely, if nosuch taxon a exists then we say annotations A does not satisfy an α-majority.Clearly, given two proportions α and α′ such that α ≥ α′ > 12 , if some annotation n-tuple Ahas an α-majority, then by transitivity it implies that A also has an α′-majority.It might be possible to obtain an α-majority by replacing annotations A with modified an-notations A′, where each taxon a is replaced by one of its ancestral taxa a′, and define such a106relationship as a partial order on taxa as a′  a, where a, a′ ∈ X. For example, if a is Alphaproteobac-teria, a′ could be Proteobacteria or some other ancestor of a all the way to the root x∗. Clearly, it isalways possible to create a majority by replacing each taxon a ∈ A with the root x∗; however, thistrivial result is not analytically useful, as we have lost almost all taxa-specific information aboutcontig C other than “C came from life.” In fact, any modified set of taxa A′ essentially representssome loss of taxon-specific information from A. Therefore, we would like to formulate a way toquantify this loss of information in a principled way such that we can design an algorithm toconstruct an α-majority while minimizing the amount of information loss required to attain it.To formulate this problem, we need to extend the definition of the partial order  to n-tuplesas follows. We will now denote some specific transformations on an n-tuple of taxa that we callreductions:• (i) For any two taxa a, a′ ∈ X we denote the reduction of a to a′ as a → a′, such that,a′  a. If there exists an annotation a′′ such that a′′  a then either a′′ is equal to a or a′′.In other words, a′ is either a itself or in its lineage. When a is reduced to a′′, such that,a→ a′ → · · · → a′′ when we denote such a multistep reduction of a to a′′ as a ∗−→ a′′.• (ii) We define the partial order relation r for n-tuples A and A′ as: A′ r A if for everypair of elements a and a′ from A and A′, at the same index positions, satisfies the relationa′  a. Then we denote by A→ A′ to mean for every corresponding element a (in A) and a′(in A′) we have a→ a′; and by A ∗−→ A′ we denote the fact that for every corresponding pairof elements a (in A) and a′ (in A′) we have some series of transformations a ∗−→ a′. Note thatfor both A→ A′ and A ∗−→ A′ we have A r A′.(i) For any two taxa a, a′ ∈ X we denote the reduction of a to a′ as a → a′, such that, a′  a.If there exists an annotation a′′ such that a′′  a then either a′′ is equal to a or a′′. Inother words, a′ is either a itself or in its lineage. When a is reduced to a′′, such that,a→ a′ → · · · → a′′ when we denote such a multistep reduction of a to a′′ as a ∗−→ a′′.(ii) We define the partial order relation r for n-tuples A and A′ as: A′ r A if for every pair ofelements a and a′ from A and A′, at the same index positions, satisfies the relation a′  a.107Then we denote by A→ A′ to mean for every corresponding element a (in A) and a′ (in A′)we have a → a′; and by A ∗−→ A′ we denote the fact that for every corresponding pair ofelements a (in A) and a′ (in A′) we have some series of transformations a ∗−→ a′. Note thatfor both A→ A′ and A ∗−→ A′ we have A r A′.We define annotation n-tuple A′ to be consistent if for every pair of annotations a and a′ fromA and A′ we have a 6 a′ and a′ 6 a. Thus, we define a consistent reduction to be any reductionA→ A′, and similarly, a set of consistent reductions as A ∗−→ A′ where this condition holds. Thisconsistency condition is imposed in order to not bias a taxon in terms of its taxon-depth ofthe NCBI Tree. For example, if for annotations A ≡ (a1, a2), where a1 =Alphaproteobacteria anda2 =Proteobacteria, then A does not preserve consistency since a2  a1. However, annotationsA′ ≡ (a′1, a′2), where a′1 =Alphaproteobacteria and a′2 =Betaproteobacteria, then annotations A′preserves consistency. Intuitively, we can view a consistent reduction A → A′ as a reductionof all annotations descending from x to x, or in other words, the collapsing of all annotationscorresponding to a sub-tree of x to x.Let us note some observations about the reduction of annotation n-tuples. Every reductionstep for an annotation n-tuple A to another n-tuple A′, A → A′, A′ is less specific with respectto A. It is important to realize that A′ can not convey any new information about A. Moreover,for any annotation n-tuple A, there exists a reduction A ∗−→ A′′ where A′′ respects α-majority forsome α in the interval ( 12 , 1]; note that A′′ = A∗, where A∗ is the n-tuple where every element isthe root x∗, can always provide a possible solution. Therefore, if annotation n-tuple A does nothave a α-majority, there exists an A′′ that has α-majority and A ∗−→ A′′, i.e., a sequence of singlestep reductions A → A1 → · · · → Ak → A′′. For a given A there may be multiple solutions totake the position of A′′, and in such cases we would like to pick the candidate that loses the leastamount of annotation information. In this case, we assume that information-theoretic entropy andbiological “taxonomic information” coincide. We must now define entropy of a phylogenetic treewith annotations A.Definition 3. Given annotation n-tuple A and node x in TNCBI , we define entropy H(x; A) as H(x; A) =−∑z∈XpA(z) log pA(z) = −∑z∈X∩A˜pA(z) log pA(z), where pA(z) = rA(z)N , A˜ is the unique set of elements108in A, rA(z) is the number of annotations in A˜ that are taxon z, and N is the length of the annotationn-tuple A.Having defined our reductions and tree entropy given annotation n-tuple A, given an accept-able majority proportion threshold α ∈ (0.5, 1], we can now formulate a minimal entropy reductionon A to an α-majority satisfying A′.Definition 4 (Minimal Entropy Reduction). Given annotation n-tuple A from contig C and a majorityproportion α, we would like to produce an annotation A′ through reductions A ∗−→ A′, such that A′ satisfiesan α-majority and minimizes the change in entropy for all such A′, i.e., min∀A′,A′r A|H(x; A)− H(x; A′)|,electing the taxon that produces the α-majority in A′ as the origin taxa for C.In order to find an annotation n-tuple A′ that has an α-majority and minimizes the change inentropy, it is sufficient to replace some subset of A, S, by the lowest common ancestor of all taxaa in S, i.e., a  s for all s ∈ S. If there exists another a′ such that a′  s for each s, this impliesa′ = a, (i.e., the lowest common ancestor of S is unique). Therefore, one brute-force way would beto compute the change in entropy for all valid transformations ∆H(x; A, A′) ≡ H(x; A)− H(x; A′)at every node x.Next we will expand on and define some simplifications of entropy H(x; A) and change inentropy ∆H(A, A′), that will prove useful in the actual construction of the LCA* algorithm. Noticethat entropy can be writtenH(A) = −∑z∈X(rA(z)N)log(rA(z)N)= −1N ∑z∈XxrA(z)(log(rA(z))− log(N))(5.1)= −1N ∑z∈Xx(rA(z) log(rA(z)))+1N ∑z∈Xx(rA(z) log(N))(5.2)= −1N ∑z∈Xx(rA(z) log rA(z))+ log N, (5.3)where rA(z) refers to the number of annotations assigned to taxon node z. Similarly, observingthat the set of annotations in a sub-tree at x, Xx, can be partitioned as the union of itself and thenodes in its immediate children’s sub-trees Xx = {x} ∪⋃y∈Yx Xy, we can partition the entropy of109xTxy1 y2 ys...Figure 5.3: Decomposition of entropy into sub-trees. A key observation in our derivation is that the entropy ofannotation A in a tree rooted at a given annotation node x, Tx, can be decomposed into the sum of node xand the nodes of its immediate children’s subtrees (y1, y2, . . . , ys) as Xx = {x} ∪⋃y∈Yx Xy. From here wecan decompose the calculation of entropy H(A) in the same partition.a set of annotations as follows:H(A) = −1N ∑z∈Xx(rA(z) log rA(z))+ log N (5.4)= −1N(rA(x) log rA(x) +∑y∈Yx∑w∈XyrA(w) log rA(w))+ log N (5.5)= −1N(rA(x) log rA(x) +∑y∈YxLAy)+ log N (5.6)where LAy = rA(y) log rA(y) if y is a leaf node in TNCBI and LAy = ∑z∈YyLAz , otherwise. Note thatwe decomposed the entropy into two main terms, the entropy of node x, rA(x) log rA(x), and thesum of the entropy terms of its immediate children’s trees,∑y∈YxLAy (Figure 5.3).From here we can express the change in entropy ∆H(A, A′) on a consistent reduction ofannotations A→ A′ as∆H(A, A′) =(−1N ∑z∈XxrA(z) log rA(z) + log N)−(−1N ∑z∈XxrA′(z) log rA′(z) + log N).(5.7)Since we are interested in an A′ that minimizes ∆H(A, A′), note that all terms corresponding to A110in the above relation remain constant. We can simplify the calculation by focusing on finding anA′ such that A′ is a consistent reduction of A and minimizes δH(A, A′) ≡ −∑z∈XxrA′(z) log rA′(z)and the recursive relationδH(x; A, A′) = −rA′(x) log rA′(x) +∑z∈YxδH(z; A, A′), (5.8)based on which we will design our algorithm. We will now show that such a transformation to A′exists for any given starting annotations A.Proposition 1. Suppose A is any n-tuple of annotations, then for any α > 12 there exists a taxon xˆ and aconsistent reduction of A to some n-tuple A′′, such that(i) A′′ respects α-majority,(ii) δH(A, A′′) = minA′AδH(A, A′) for all consistent reductions A ∗−→ A′, and(iii) A and A′′ only differ in the elements where xˆ is in A′′.Proof. Note that in the above proposition it is easy to show the existence of a taxon xˆ that satisfiesconditions (i) and (ii). This is because xˆ = x∗ is a trivial solution that satisfies condition (i), andthe set of candidates that satisfies A′′ is non-empty, hence here exists a taxon that satisfies condition(ii). In order to realize (iii), note that since A ∗−→ A′′ and A′′ has a α-majority, therefore, thereexists an annotation x˜ in A′′ which is at least α fraction of all elements in A′′. Since the reductionsA ∗−→ A′′ are consistent, we can achieve the α majority by simply collapsing the annotations thatare descendants of x˜ in TNCBI , or specifically, for all annotations a ∈ A where x˜  a, a∗−→ x˜.5.3.2 AlgorithmSince we have outlined a mathematical framework defining α-majority, consistent reductionson the NCBI Tree, and a recursive definition of the entropy of annotations A = (a1, a2, · · · , aN)on TNCBI , we can now focus on designing and implementing our algorithm, ComputeLCA*,that calculates an α-majority for a given contig C while minimizing changes to its underlyinginformation entropy.111The input to ComputeLCA* consists of the NCBI taxonomy tree TNCBI , and the n-tuple of ORFannotations A for the ORFs in a contig C, and the threshold α that defines the majority (Algorithm5.1). Since we are interested in the taxon that minimizes the the change in entropy δH(A, A′), ouralgorithm is designed to exploit the recursive nature of the tree traversal in the TNCBI as well asthe recursive delta entropy term (Equation 5.8).We use the global hash data-structures S[x] and L[x] for every node x ∈ X. S[x] stores the sumof annotations at node x at its collapsed sub-tree x′ ∈ Xx, and similarly L[x] stores the sum ofentropy terms r(x′) log r(x′) for each node in the subtree of x′ ∈ Xx, i.e., δH(x; A, A′) at a given x.ComputeLCA* starts at the root x∗ and recursively traverses TNCBI , calculating sums of L and S atall nodes. The algorithm then selects the sum that minimizes the relative entropy and also hassufficient support α.Algorithm 5.1 CalculateLCA*Require: TNCBI , A, αEnsure: t∗1: S← ∅, L← ∅ /* S and L are hashes */2: x∗ ← root(TNCBI)3: call ComputeSL(x∗, TNCBI , A)4: t∗ ← argminx s.t. S[x]≥α|A|L[x] /* result */5:6:7: /* Subroutine ComputeSL computes the S, L for each taxon */8: subroutine ComputeSL(x, TNCBI , A)9: if x is a leaf-node in TNCBI10: L[x]← r(x) log r(x)11: S[x]← r(x)12: else13: L[x]← 0, S[x]← 014: for each c in Children(x)15: call ComputeSL(c, TNCBI , A)16: L[x]← L[x] + L[c]17: S[x]← S[x] + S[c]18: return112ImplementationComputeLCA* for a typical number of annotations on a contig does not take more than a fewhundred milliseconds, but the described brute-force method traversing the entire NCBI Tree iscomputationally inefficient, and for samples with hundreds of thousands of contigs the totalcomputation time could be large. Therefore, in the implementation of the ComputeLCA*, a keyoptimization step is incorporated that skips the examination of subtrees where no annotationsexist.Consider the set of N ORFs and corresponding set of N annotations originating from contig C.Let M be the total number of taxonomic nodes in TNCBI . Then according to ComputeLCA*, it cantake O(MN) steps to compute the LCA* taxonomy for C. Note that at line 14 of ComputeLCA*, itis redundant to visit the sub-tree rooted in the child node stored in loop variable c if there areno annotations in the sub-tree. However, in order to know if annotations are present in a givensub-tree of TNCBI , before running ComputeLCA*, we color all nodes whose subtree contains anon-empty set of annotations. We mark the nodes by considering one annotation at a time, say a,and mark the nodes as follows:(i) we start at the node a in TNCBI and travel upwards towards the root one parent step at a time;(ii) in each step, if the current node p is not marked then it marks p, and moves to the parent of p,if any, otherwise we are done with annotation a;(iii) if the parent is already marked we are done with annotation a.We will now say a quick note about the relative computational complexity of our optimization.Consider the partially ordered set (A˜,) of a set of annotations A˜ on TNCBI , and suppose L isthe size of largest subset S of A˜ such that any two annotations in S are not comparable via  toeach other. Our modified algorithm therefore takes (DL) steps to mark the nodes in the upfrontstep, where D is the maximum tree-depth in our set of annotations A˜ ∈ TNCBI . Since we onlyvisit nodes that have been colored at line 14, our modified algorithm has time complexity O(DL).Although the worst-case time complexity could still be O(MN), where the annotation-breath spansthe entire TNCBI , i.e., D = M and L = N, most real-word annotations sets are much more confined113with many annotations falling in the lineages of each other. This makes L < N and D << M, andhence real-world running time O(DL) is typically much smaller than O(MN).5.4 Statistical significanceHaving described the LCA* statistic and algorithm ComputeLCA* (Algorithm 5.1), we nowhave a way of estimating the taxonomic identity of a contig satisfying the majority threshold α.However, although LCA* represents an α-majority taxonomic estimate, this majority might nothave statistical confidence, especially for smaller contigs with few ORFs. Moreover, even withlarger contigs, we may still want to apply a statistical test to give another measure of confidence,and provide a mechanism to compare statistical confidence of LCA* with other voting-basedstatistics, like the simple majority. Here we formulate a likelihood-ratio hypothesis test for thesupremacy of the LCA* estimate, serving as an alternative to the uniformity χ-squared tests, whichwould more typically be used, but are actually a more indirect method to test for supremacy.Let us make a few observations about the LCA* statistic that will be informative whenmotivating a representative probability distribution. While in some cases the initial taxonomicdistribution may be able to elect a satisfactory α-majority, the reduced annotation set A′ providedby LCA* is essentially constructing an α-majority from the original annotations A while discardingas little information as possible. This means LCA* is a compromise away from estimating aspecific taxonomy, which might be estimated poorly if data is sparse or highly variable, for a moregeneral but possibly more reliable taxonomic estimate. We assume that the reduced taxonomicannotations from A′ express individual opinions or votes in an election, and the total numberof ballots cast is the number of ORF annotations in a contig. Moreover, each annotation inA′ independently casts its ballot and according to a multinomial probability distribution, withprobabilities p1 ≤ p2 ≤ . . . ≤ pk andk∑i=1pk = 1, a reasonable way model model if the electedα-majority taxon is indeed correct.Calculating the p-value for this distribution, we suppose the annotation counts from A′ arerepresented as a random vector X = (X1, X2, . . . , Xk)′ from the multinomial distribution withcell probabilities (p1, p2, . . . , pk) corresponding to the proportion of counts for each annotation.114Our goal is to test for the supremacy hypothesis that the proportion of the selected majoritypk is indeed the largest, i.e., the statistical hypotheses H0 : pk < max({p1, . . . , pk−1}) versusHA : pk > max({p1, . . . , pk−1}). It can be shown that the null hypothesis can be written as theunion of convex sets: H0 : p ∈ ∪k−1j=1 Cj, where Cj ≡ {p : pk ≤ pj} for j = 1, . . . , k− 1 [213]. And itcan also be shown that an approximate likelihood-ratio test at some significance level α′ can beapproximated by a χ-square distribution of degree one [214]. Therefore, we reject H0 if and onlyif Tn ≥ χ21(1− 2α′), where the test statistic Tn is defined as follows:Tn =0 if Xk ≤ M,2[M log(2MM+Xk)+ Xk log(XkM+Xk)]if Xk > M.(5.9)where M ≡ max{X1, · · ·Xk−1}.5.4.1 ImplementationWanting to keep the software dependencies and Monte Carlo-based approximations to a minimum,we calculated the χ21 cumulative distribution in terms of the Gauss Error Function er f (x) includedin the standard Python 2.7 Math library, a specially parameterized version of the Standard NormalCumulative distribution with mean µ = 0 and variance σ2 = 12 → σ =1√2er f (x) = P(t < x) =1√2piσ∫ x−∞e−t22σ2 dt (5.10)=1√2pi 1√2∫ x−∞e−t2dt =1√pi∫ x−∞e−t2dt, (5.11)and by symmetryer f (x) = 21√pi∫ x0e−t2dt. (5.12)Now to solve for a χ21 in terms of the error function, er f (x), we observe that it is just a specialform of the Standard Normal distribution, parameterized N (0, 12 ). It follows that√2 N(0,12)= N (0, 1) , (5.13)115and by squaring to solve for χ21,N (0, 1)2 =(√2 N(0,12))2= 2 N(0,12)2∼ χ21. (5.14)Applying the same transformations, the cumulative distribution of the χ21 isP(χ21 < t) = P(2 N(0,12)2< t)= P(N(0,12)<√t2)= er f(√t2). (5.15)5.5 MethodsHere we describe details of simulations, sequences, parameter settings, and necessary resourcesused in the performance analysis of LCA*.5.5.1 Simulation, sequences, and annotationThe two simulated sets of contigs were created by randomly sampling 10,000 bp “contigs” ran-domly generated from 2,713 NCBI bacterial genomes (Downloaded March 15, 2014) using thepython script sub sample ncbi.py. The first simulation (Small) is a random sample of 100genomes, sampling 10 contigs reads from each genome, while the second simulation (Large) is arandom subset of 2,000 genomes, also sampling ten random contigs from each genome. Assembledcontigs of 201 GEBA SAGs were obtained from IMG/M (JGI Project Name: “GEBA-MDM”) [215].All contigs had their ORFs predicted and annotated against the RefSeq database (Release-62)using the MetaPathways pipeline (Version 2.0) with default settings (i.e., Prodigal [74], LAST [76],BSR=0.4) [149, 152, 168].5.5.2 AnalysisAnnotations from the MetaPathways pipeline were used as input into the scriptCompute LCAStar.py which implements the LCA*, Majority, and LCA2 taxonomic predictionmethods. The α-majority parameter of LCA* was set to its default α > 0.5. Supremacy p-valueswere calculated for voting-based methods Majority and LCA* using the Gaussian Error Function116included in the Python 2.7 math library as described in the Statistical Significance section. A listof contig origins for each sample was included as an optional input to produce distances as anexpanded report. Prediction performance was evaluated using two taxonomic distances on theNCBI taxonomic database hierarchy (NCBI Tree) as measures of error between the predictedtaxonomy and the taxonomy of origin. The first was a Simple-walk on the NCBI Tree fromthe node of the predicted taxonomy to the original expected taxonomy. The second was amore taxonomically relevant weighted taxonomic distance (WTD) where each edge-weight isproportional 12d, where d is the depth of the edge in the tree from the root, making differences intaxonomy made early in their respective lineages [171] (Section 3.4). For example, two lineagesthat diverged at Bacteria would be have a larger distance than two lineages that diverged deeperin the tree at Proteobacteria. It should be noted that we needed to slightly modify and cleanthe NCBI Tree. An additional node for ‘prokaryotes’ was added as parent of the bacteria andarchaea nodes. Additionally, duplicate names in some taxonomies were causing ambiguitiesbetween different eukaryote and prokaryote taxa (e.g., ‘Bacteria 〈stick insect genus〉’, ‘Bacillus〈stick insect〉’, ‘Yersinia 〈mantid〉’, and ‘Rothia 〈angiosperm〉’). A copy of this modified tree isincluded in the LCAStar GitHub as ncbi taxonomy tree.txt.5.6 ResultsThe performance of LCA* was compared against two other taxonomic estimation methods, LCA2and Majority, on both simulated and actual multi-omic contigs. LCA2 is the application of LCA tothe taxonomic annotations of a contig, while Majority is a simple majority method where taxonomyis ascribed to the most number of annotations. In order evaluate these three methods, we evaluatedtheir relative prediction performance against two sets of simulated metagenomic contigs, and aset of actual contigs from 201 microbial “dark-matter” Single-cell Amplified Genomes (SAGs)obtained from the Genomic Encyclopedia of Bacteria and Archaea (GEBA) project [215], a UnitedStates Department of Energy and Joint Genome Institute (JGI) initiative for sequencing thousandsof bacterial and archaeal genomes from diverse branches of the Tree of Life (Section 5.5).In all instances the voting-based measures Majority and LCA* outperformed LCA2 using both117the Simple-walk and WTD distances. In both the Small and Large simulations, as well as theGEBA single-cell contigs, LCA* and Majority had Simple-walk distances closer to zero whencompared with LCA2 (Figure 5.4). A similar pattern was observed with the WTD, but here LCA2is more penalized for predictions widely outside their original taxonomic lineage. Here, LCA2predictions at the taxonomic root caused some very negative WTD values to form a cluster ofnegative values in the GEBA sample (Figure 5.5). Taking a perspective of a regression analysis, itis also possible to express these distances as error measurements, and calculate their Root-meansquared error (RMSE) as a measure of accuracy (Figure 5.6), here the voting-based exhibitedsmaller RMSE values in all cases, but were only significantly different at the 95% confidence levelin the Large simulation and in the GEBA SAGs when measured by the WTD. In no cases werethe RMSE values of voting-based methods LCA* and Majority significantly different at the 95%confidence level.The voting-based LCA* and Majority methods had similar performance in terms of both thesimple-walk and WTD in all experiments, LCA* exhibiting a slightly larger tail in all densities(Figures 5.4 and 5.5). However, the two methods differ significantly in their supremacy p-values,LCA* reporting substantially smaller p-values on average (Figure 5.7), suggesting there is morestatistical confidence in reported LCA* taxonomies. Moreover, when we compare pairwise p-values, we can see that in the majority of instances, LCA* reported more confident majoritytaxonomies, but also that in many instances the two methods were very similar, reporting thesame p-value when an α-majority is found in the original annotations, no collapsing of annotationswas necessary (Figure 5.8). The cluster of points where the Majority method’s p-values are 1.0highlights instances where the supremacy p-value indicates a definite hazard in interpreting thereported taxonomy. The p-value will equal 1.0 where there is a tie for a majority taxonomy (i.e.,Xk = M), and highlights a situation were an arbitrary decision was made between two taxonomies,indicating there is significant uncertainly about the reported taxonomy. Interestingly, none of theLCA* estimates in our experiments had a p-value of one, suggesting that because LCA* activelytries to seek out some majority, occurrence of a such a stale-mate election are relatively rare.118Small Large GEBA0.*0 3 6 9 0 5 10 15 0 5 10Simple−walk DistanceDensityFigure 5.4: Gaussian kernel densities of Simple-walk distances between predicted and actual taxonomies across LCA2,Majority, and LCA*, and experiments. Means (solid lines) and medians (dashed lines) of the distances arereported for each statistic and experiment.119Small Large GEBA0.00.51.00123450123LCA2MajorityLCA*−2.0 −1.5 −1.0 −0.5 0.0−2.0 −1.5 −1.0 −0.5 0.0−2.0 −1.5 −1.0 −0.5 0.0Weighted Taxonomic DistanceDensityFigure 5.5: Gaussian kernel densities of weighted taxonomic distances between predicted and actualtaxonomies across LCA2, Majority, and LCA*, and experiments. Means (solid lines) and medians (dashedlines) of the distances are reported for each statistic and experiment.120Small Large GEBA024680. Majority LCA*MethodRMSELCA2 Majority LCA* LCA2 Majority LCA*Figure 5.6: Root-mean-squared error (RMSE) for LCA2, Majority, and LCA*, across experiments and distances.Error bars represent 95% confidence intervals drawn from a Student’s t-distribution.121Small Large GEBA012345012345MajorityLCA*0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00p−valueDensityFigure 5.7: Gaussian kernel densities of supremacy p-values for voting-based methods Majority and LCA*. Means(solid lines) and medians (dashed lines) are reported for each statistic and experiment.122llllllllllllll lllllllllllllllllllll lllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllll lllllllllllllllllllll lll l llllllll llllllllllllll llllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll llllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllSmall Large GEBA0.000.250.500.751.000.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00p−value (LCA*)p−value (Majority)Figure 5.8: Pairwise supremacy p-values for the voting-based methods Majority and LCA*. The blue pointsare instances where the LCA* p-value was smaller than the Majority p-value, green points are instanceswhere the Majority p-value was smaller than the LCA* p-value, and grey points are instances where bothp-values were equal, presumably when both statistics were estimated from the same underlying taxonomicdistribution, where the two estimates are equivalent and report the same taxonomy.5.7 Discussion and conclusionsThis work described, formulated, and implemented LCA*, an entropy-based taxonomic binningmethod for the prediction of taxonomy on unknown contigs. By outlining a mathematicalframework to reason about taxonomy, LCA* attempts to strike a balance between the competinggoals of obtaining a sufficient majority of at least 50% of annotations as recommended byCondorcet’s Theorem, while minimizing the change in the underlying taxonomic informationby minimizing changes in its entropy. A likelihood-ratio test was implemented to test for thesupremacy of predicted taxonomies, reporting a p-value that can be used as as a measure ofconfidence and hazard in reported taxonomies. Both simulated and actual contigs demonstratedthe effectiveness of voting-based methods Majority and LCA* to predict contig taxonomiessignificantly closer to their actual origin over the simple LCA2 method, suggesting that LCA*produces taxonomic predictions appropriate to the support supplied by the data, balancingspecificity and confidence over the simple majority method.While LCA* has a strong theoretical backing for constructing a majority from an variabletaxonomic distribution, let us now draw attention to some significant underlying assumptionsmade in implementing the statistic. Observed taxonomies come from the tree at various taxonomic123depths, meaning that in many cases observations fall within the same possible taxonomic lineage.This raises a philosophical issue that our observed bins of the multinomial distribution cannot be viewed as completely independent, partial taxonomies could fall into multiple bins. Apossible remedy is to discard all annotations that do not fall at the observed leaves of the NCBITree. However, given the variability inherent in taxonomic annotations, e.g., homology-search,annotation databases, or taxonomic summarization at each ORF annotation, it is quite likely thatmany annotations may end-up at internal nodes of the tree. Moreover, annotation of metagenomiccontigs can be quite sparse, so discarding internal annotations in the name of independencesimply decreases valuable statistical power, and can artificially bias the signal for a particulartaxonomy by arbitrarily removing competition from internal nodes. Alternatively, one couldattempt to distribute annotations from internal nodes equally to the observed leaves of the tree.However, this creates its own discretization issue, as in many cases votes can not be distributedequally, and vote-splitting violates assumptions of multinomial model, which assumes votesare integers. Moreover, vote-splitting can make the final predicted taxonomy more difficult tointerpret, as the final reported taxonomic distribution in each case could be very different fromthe one we actually observed, and risks electing a more specific taxonomy than the data justify.In the end, we opted to leave internal annotations in the election; discarding annotations throwsvaluable information away and risks creating some very unfair majorities, while vote-splittingconfuses the interpretation of predictions, risks electing a more specific taxonomy than justified,and complicates the validity of the test statistic and p-value.Though LCA* will likely give reasonable estimates when observed annotations are highlyvariable, being an alignment-dependent binning method, it still stands to benefit from expandingour current framework to incorporate information from the statistical properties of sequencesfound in alignment-independent methods. The election model could be expanded to incorporategenomic-signature information into a weighted-voting or vote-splitting framework, which perhapscould help improve statistical confidence in cases where observed annotations are extremelysparse. However, expanding our current voting theory model to include continuous grades is notwithout its challenges, and would likely challenge the multinomial assumptions currently in themodel. Range voting attempts to provide a model for a continuous voting scale, but its theory has124numerous impossibility results that challenge all three of our common sense principles of voting:preserving majority rule, requiring a minimum level of core support, and rewarding sincere voters[216]. Another improvement that would serve to integrate NCBI-based taxonomies in-generalwould be to map 16S and COG sequence alignments to NCBI Taxonomy Database IDs, allowingan apples-to- apples comparison between functional and marker gene taxonomic methods likeMLTreeMap [126].LCA* is implemented as an convenient Python library, utilizing a tree-coloring scheme forcomputational efficiency. The package also includes implementations of the Majority and LCA2methods as analytical alternatives and provide a number of analytical features with intelligent ortheoretically-motivated defaults to help users perform similar analyses. Moreover, LCA* is nowintegrated as an additional step to the MetaPathways pipeline, including it as an analytical tablewithin the available GUI. The stand-alone LCA* Python package is available from the HallamLab GitHub (http://www.github.com/hallamlab/LCAStar/) along with installation tutorials andexample vignettes.125Chapter 6ConclusionsThis dissertation described the MetaPathways pipeline, an open-source, modular pipeline forthe scalable analysis environmental sequence information. This final chapter takes a step backto have a high-level discussion of underlying assumptions present in the analysis of multi-omicdatasets, highlights related research and community developments, outlines future computationaland analytical improvements, and concludes with some general speculation on future directionsin the analysis of environmental sequence information.6.1 Assumptions and limitations of current approachesWhile cultivation-independent methods allow exploration of unknown microbial dark matterfor taxonomic community structure and function, current approaches and analytical methodscontain a number of analytical assumptions and limitations that need to be acknowledged. First,with the exception of simpler environments like acid mine drainage [217], multi-omic datasetsrepresent severe under-sampling experiments, limiting the assembly and analysis to the mostabundant members of the community. Moreover, the sequences that are obtained are oftenentirely novel, making ORF prediction and annotation a challenge because of assumptions madein parameter training for known ‘genome signatures’ and sparse similarity to any characterizedproteins. Public database composition is also inherently biased by the nature of past research tomodel organisms and medically-oriented research goals. It should also be acknowledged that theseed-and-extend approach to homology search in protein databases is an approximation to theexact Smith-Waterman alignment, so it is an accepted risk that certain number of false-positive andfalse-negative annotations will permeate analyses. Taxonomic annotation methods like LCA andLCA* discussed in Chapter 5 are attractive for multi-omic samples from natural and engineered126ecosystems as they allow taxonomic annotation outside a particular set of single-copy genes (e.g.,16S rRNA and COGs), but must be interpreted with care in terms of quantitative abundanceand phylogenetic accuracy as many functional genes share large portions of homology. Pathway-centric analysis also suffers some of the same biases as protein annotations, but have furtherissues in terms of interpretive hazards including biochemical rules, multifunctional, reversible,and partially-functional enzymes. Moreover, the use of defined pathway structures constrains thepredictive universe in ways that likely underestimate metabolic potential encoded in the genomesof MDM. In the absence of transcriptomic, proteomic, or metabolic data to validate pathwayexpression at other levels of life’s central dogma, the extent to which certain metabolic pathwaysare actually occurring is unknown.6.2 Related research and community developmentsSince beginning this work other analytical software has been developed for the analysis ofenvironmental sequence information. In particular, the bioBakery collection, developed as partof the human microbiome project (https://bitbucket.org/biobakery/biobakery/), includes anumber of packages for advanced compositional analyses of genes and associated pathways[94, 109, 218, 219], statistical analysis [220, 221], and visualization of functional and taxonomicdata. This collection supports some data integration by design and via directed tutorials, buteach software is fundamentally independent, and a significant amount of setup and integrationis required for a more comprehensive analysis. Other developments include upgrades to theJoint Genome Institute’s IMG/M system, featuring a number of incremental improvements tocomparative, functional, annotation and data export [136]. The system is still closed source anddoes not allow the use of computational resources outside of its own environment. Kraken, a newsoftware taxonomic binning software, assigns taxonomic labels to short-read metagenomic DNAsequences with high-speed and accuracy [222].Despite these developments, MetaPathways still holds a unique position in the analysisof environmental sequence information. It presents a transparent, modular, and integrativepipeline of software tools that process raw environmental sequences into integrative data products127compatible with a number of downstream analyses. MetaPathways represents the first method forpredicting MetaCyc pathways from environmental sequence information using Pathway Tools,giving it an interpretive advantage over other pathway prediction methods by tightly integratingannotations and pathways in environmental pathway/genome databases. The MetaPathwaysv2.0 master-worker algorithm is to date the only software that externalizes large environmentalhomology search tasks to an external ad hoc distributed network of computational grids, and theGUI with its ‘Knowledge Engine’ data structure is the only software that attempt to accommodatelarge-scale interactive comparisons of multi-omic sequence data. Finally, LCA* is unique in itsuse of Voting and Information theory results to provide a theoretical grounding to taxonomicallyclassifying multi-omic contigs from natural and engineered ecosystems.Since the initial release of MetaPathways in January 2013 there has been some significantcommunity interest and adoption. In August 2014, MetaPathways was featured as a softwaretutorial in the Strategies and Techniques for Analyzing Microbial Population Structure (STAMP2014) course at the Marine Biological Laboratory, Woods Hole, MA (http://stamps.mbl.edu).There has also been significant amount of online activity according to statistics provided byAltmetric (http://www.altmetric.com). As of January 2015 the BMC publication has achieved‘highly-accessed’ status, being accessed 6,985 times, and putting it in the 94th-percentile of allarticles in BMC Bioinformatics. Metabolic pathways for the whole community is also ‘highly-accessed’ with 2,170 downloads since July 2014, putting it in the 87th-percentile of all articles inBMC Genomics. According to GitHub statistics, the MetaPathways software has been downloaded810 times since January 2013. A number of research institutes have also adopted MetaPathwaysfor internal use: the Woods Hole Oceanographic Institute (WHOI) and Pacific Northwest NationalLabs (PNNL) in Richland, WA have custom versions to utilize their individual HPC environments.6.3 Improvements to the MetaPathways pipelineWhile MetaPathways offers a cohesive analysis pipeline for environmental sequence information,there are a number of computational and analytical aspects that could be expanded or furtherimproved upon. Indeed, a pernicious and challenging task that is notably absent from the128current pipeline is an automated de novo sequence assembly step. A potential approach couldbe to automatically generate and compare a collection of assemblies by a series of order andlength statistics, empirically selecting the ‘best’. Moreover, the PathoLogic pathway predictionalgorithm implemented in Pathways Tools needs tailoring for multi-omic samples; improvementsto prediction hazards, algorithm transparency, and computational efficiency are needed to en-hance its performance and interpretability on environmental sequence information. Chapter 3demonstrated that taxonomic patterns of inter-pathway complementarity in predicted pathwayscan potentially be used to detect distributed metabolic patterns that are interesting from bothengineering and ecological perspectives. However, the current approach for detecting distributedpatterns is manual and time-consuming, and a needs an algorithmic method to scale to thousandsof samples, taxa, and pathways. The master-worker distributed compute model introducedin Chapter 4 is heavily biased to homology-search problems and needs to be generalized formore heterogeneous compute tasks and worker grids. While Illumina’s sequencing-by-synthesisplatform is currently the dominant sequencing paradigm, alternative technologies from PacificBiosciences and Oxford Nanopore are now offering single-molecule, long-read solutions that opennew analytical possibilities. Considerations need to be made for MetaPathways to accommo-date these new datasets, which are likely to increase both the quantity and quality of predictedgenes. Finally, as sequencing capacity continues to increase unabated, stand-alone computationalresources are struggling to host, process, and analyze such large unstructured datasets. Cloudcomputing technologies like Hadoop and NoSQL databases offer potential solutions of computeand data query, which future versions of MetaPathways should take advantage of to move intothe cloud.6.3.1 Automated assemblyAlthough the latest version of MetaPathways now accepts assembled contigs and raw-readsas input, using a read-mapping measure to provide quantitative counts on a per ORF basis[60, 223], a stand-alone de novo sequence assembly step is notably absent. Currently, there ismuch debate about the proper data cleaning, read-binning, and parameter settings necessaryto achieve a quality metagenomic assembly, which makes automation difficult [26, 58, 224, 225].129Indeed, our experience assembling a number of multi-omic samples from a variety of naturaland engineered ecosystems suggests that there is no one-size-fits all solution, and that assemblyquality can be dependent on upon laboratory experimental procedures, environmental conditions,and underlying community complexity.Although an assembly is best assessed in the presence of a reference, multi-omic samplesoften do not have this luxury. However, there are a number of assembly statistics, such as numberof contigs, total assembly length, number of predicted genes, and length order-statistics like Nx(length of the contig in the ranked contig set that contributes the x percent of the total assemblylength), that can provide an empirical measure of assembly quality [59, 226]. Although currentmethods of assembly are messy affairs with many parameters to tune and optimize, in an effort topress forward, it is possible to generate a set of assemblies (at some computational expense) at avariety of parameters, and proceed with the ‘best’ based on assembly statistical measures.One interesting way of ranking assemblies is to adopt a method inspired by the de novoassembly competition Assemblaton2 and the ideas of Mende et al. [59, 226], which recommendinterpolation of a series of assembly order statistics and the use of an area under the curve (AUC)measure to assess their relative quality (Figure 6.1). AUC in combination with other statistics likeN50, N90, and total length could be used to create an ordination function by way of some linearcombination or decision tree method. Although intuitive, performance studies would have to beperformed relative to manual assembly practice by cross validation to see if this is an effectiveapproach; the ranking method may have implementation issues that are not currently obvious.Running multiple assembly programs is a computationally challenging task, but until a moreoptimal solution is proposed, this framework will provide an empirical solution to a currentlymessy optimization problem.6.3.2 Pathway prediction and ePGDB interpretationWhile Pathway Tools and its PathoLogic algorithm provide a biochemical rule-based method forthe prediction of metabolic pathways, Chapter 3 highlights instances where predicted pathwaysfrom environmental sequence information present a number of interpretive hazards. Taxonomichazards can be flagged by the integration the weighted taxonomic distance (WTD) an algorithm130100020003000400025 50 75Nx StatisticLengthFigure 6.1: Nx assembly plots. Nx order statistics are a measure of the overall contig length distribution of asequence assembly. When organized and interpolated, each colored line representing a set of Nx statisticsfrom an assembly, such assembly plots provide an empirical way of comparing the relative quality ofmultiple assemblies by inspection or through an area under the curve (AUC) measure. Larger AUC curvesgenerally correlate with ‘better’ assemblies.to classify pathways that are predicted widely outside their taxonomic range; however, a numberof other hazards could be addressed by improving the PathoLogic pathway prediction algorithm.Multi-functional enzymes that map to multiple pathways could have their counts scaled back toprevent ‘double counting’, preserving the total number of annotations in a sample; and broad ECannotations could be interpreted similarly. Pathway variants and reversible reactions are morechallenging problems. Here taxonomic and environmental signals from annotations need to beincorporated to improve accuracy, but exactly how to do this is non-obvious. While pathwayvariants have taxonomic information which can be incorporated into the WTD approach, reversiblereactions are dependent on a number of biochemical equilibrium principles (e.g., Le Chatelier131equilibrium, environmental pH, temperature, etc.), which are nontrivial to integrate into a unifiedmodel. Manual interpretation of predicted pathways could be improved if PathoLogic provideda more transparent report on which prediction rules were engaged, removing a significantamount of guesswork that currently occurs. Additionally, Pathway Tools’ -omics viewer forproviding summarizing histograms, bar plots, and scatter plots on pathways, should be expandedto incorporate taxonomic and other categorical environmental metadata, allowing integrativeanalyses like those demonstrated in Chapter 3 to be more routine.It would also be worthwhile to implement and compare alternative pathway predictionalgorithms to PathoLogic. The latest version of HumanN and the MinPath method is apparentlybeing implemented on the MetaCyc framework, which should allow for a direct comparison ofthe two methods on the same hierarchy within ePGDBs (personal communication with CurtisHuttenhower). Additionally, moving beyond presence/absence to models that predict pathwaysfrom a probabilistic perspective would be useful in partially sampled multi-omic datasets [227].There is still much debate on how to properly normalize functional abundance in multi-omicdatasets with respect to genes and pathways, and a number of methods have been suggestedthrough gene family abundance, average gene size, or subsampling approaches that all attemptto statistically address unbalanced sampling depth [228, 229]. Generalized linear models witha Poisson canonical logarithmic linker function have also been proposed to model significantlydivergent pathways with relevant binomial and hypergeometric hypothesis tests [230]. Moreover,there are also methods that propose isolating significant subnetworks of metabolic pathwaysbetween samples [231, 232]. Finally, Network Reliability methods could model the reliability of aparticular path between an an important pair of metabolites (source and terminal) in the eventparticular metabolites or enzymes are removed from the metabolic network [233]. Properly fitted,these above models would add a statistical measure of confidence to predicted pathways beyondthe purely taxonomic currently supported by the WTD and PathoLogic.6.3.3 Distributed metabolismChapter 3 demonstrated the capacity of MetaPathways to detect potential instances of distributedmetabolism through patterns of inter-pathway complementarity across predicted metabolic path-132ways in known symbionts Candidatus Moranella endobia and Candidatus Tremblaya princeps. Detectingthese patterns is interesting from both metabolic engineering and community ecology perspectives.While synthetic biology and metabolic engineering of individual organisms has enabled affordableproduction of anti-malarial drugs [234], drug and gene delivery services [235, 236], and advancedinsight into the workings of the cell [237], engineered single-organism systems are limited to thesimple mechanisms of synthetic oscillators and switches [238]. Moreover, the reliability of suchsystems is hampered by natural variability and evolutionary processes, preventing the productionof reliable biological machines [239]. More complex tasks like the degradation of complex organicmolecules are generally thought to outstrip the metabolic breadth of any individual cell, anddirects interest to engineering microbial consortia that share metabolic pathways to accomplishcomplex metabolic goals [240, 241]. Patterns of distributed metabolism could also be indicative ofcommunity-level metabolic mutualism in environmental samples, potentially making the commu-nity more efficient through the effective use of metabolites; however, such interpretations mustbe cautious in light of the importance of habitat filtering and competition that classically definespecies interactions and community structures.Microbial community diversity is thought to hold the key to stable engineered biologicalsystems with novel metabolic functions [242]. The metabolic breadth of communities is positivelycorrelated with resilience to invasion and weathering periods of nutrient limitation throughthe increased activity of microbial minorities and metabolic sharing of downstream metabolites[243]. This being said, consortia add a layer of complexity onto existing engineering challenges.Engineered communities often have unpredictable long-term homeostasis, extinctive behaviour,and rampant horizontal gene transfer [244], making the potential fine-tuning or optimization ofbiological processes a considerable challenge [242]. However, engineering consortia directly tapsinto the 99% of uncultivable microbial dark matter in vivo, which often have extremely desirablemetabolic outcomes if they could be controlled [241].From an ecological perspective, finding true instances of distributed metabolism in environ-mental datasets is a challenging proposition. First, multi-omic datasets in many cases representonly a small faction of the community, meaning observed distributed patterns could merely bedue to undersampling. Second, the ability of two organisms to share metabolites is dependent on133the presence of appropriate cellular transport enzymes to facilitate metabolite exchange, thus de-tection of inter-pathway complementarity is not a sufficient condition for distributed metabolism.Moreover, the concept of sharing public good metabolites to reduce community metabolic require-ments, the so-called Black Queen Hypothesis [175], is challenged by classical evolutionary theory,which argues that after environmental filtering and competition are the main driving forces ofcommunity structure [245, 246]. As such the presence of ‘cheaters’, community members thatsimply consume public good metabolites, should eventually outcompete cooperators that do not[247]. This creates a kind of ecological prisoner’s dilemma; while the optimal use of resourcesmay be to share metabolites and cooperate, such open environmental systems are at risk from theintroduction of cheaters that will disrupt cooperators by defecting. This model of course assumesthat all community members are relying on the same resource [175]. Nonetheless, instances ofshared metabolism observed to form in co-cultures of multiple isolates suggest that increasedmetabolic diversity serves to minimize negative competitive interactions, and as a consequence thecommunity as a whole utilizes resources more effectively [174]. This suggests that communitiescan competitively re-equalize based on a diverse profile of metabolic needs (i.e., inhabit differentmetabolic “niches”), or that environments are complex and dynamic enough to offer a variety oftemporal and spatial phenotypic optima [248]. However, accurate rates of evolution in individualcommunity members are difficult to estimate from meta-omic samples as they generally lacka connection between observed metabolism and taxonomy, something that may be resolved assingle-cell sequencing methods improve to track individual genomes over time, leading to a moregeneral theory of evolution that encompases complex microbial communities.A simple in silico experiment to mine for distributed metabolic patterns would be to performpair-wise comparisons of cultured genomes from the American Type Culture Collection (ATCC)(http://www.atcc.org/) and the German Collection of Microorganisms and Cell Cultures (DSMZ)(http://www.dsmz.de/). However, the approach for detecting distributed pathways used inMetaPathways is tedious and time-consuming with collections of even a handful of genomes.Therefore it would be desirable to have a measure for the taxonomic ‘distributiveness’ of apredicted pathway. Such a measure could be formed as an integer optimization problem (Figure6.2), enabling the millions of calculations necessary to calculate all pair-wise patterns of inter-134taxa rxnsr 1r 2r 3abcdXa + Xb + Xc + XdMinimizeLet X  be binary r 1 Xa  1r 2 Xb + Xc  1r 3 Xa + Xd  1bFigure 6.2: An integer optimization from for distributed metabolism. A ‘distributiveness’ measure, detecting theminimum number of taxa required to complete a pathway, could be described as an integer optimizationproblem. In the above example, taxa a, b, c, and d, each have annotations for reactions in the pathway.Notice in the above example that only two taxa, either (a,b) or (a,c), are required to complete the pathway,giving the pathway a ‘distributiveness’ score of 2.pathway complementarity in simulated co-cultures. Another issue when applying this measureto environmental sequence information is partial taxonomy; a pathway may receive a high‘distributiveness’ score merely because of differences in taxonomic resolution, which is analyticallymisleading, so an algorithm should also be developed to detect independent lineages fromobserved phylogenies upstream of solving for distributivenes.6.3.4 Improvements to the master-worker modelWhile the current master-worker model for distributing parallelizable compute tasks provides aheuristic method for distributing computationally heavy tasks, its current assumptions restrictits applicability to uniform-size, embarrassingly-parallel tasks, and could be generalized tomore heterogeneous compute tasks and worker grids. The current model is heavily basedaround sequence homology-search problems, and it would be worthwhile to generalize thisimplementation to accept any task that provided a splitting function, a task function, and acombining function, expanding the model to work with other bioinformatics tasks like MaximumLikelihood tree-building, assembly, read-mapping, or any task that can be applied on a per-samplebasis. Moreover, the current model does not take into account diverse sets of workers with135specialized interfaces like MPI, co-processors, or GPUs that should be more optimal for certainsoftware. Finally, the Broker’s prioritization function could be improved to incorporate a costfunction if certain compute worker nodes incurred a monetary cost, as is the case with AWSnodes. From a theoretical perspective, job transfers could also be viewed as a non-cooperativenormal-form game between the Broker and the collection of worker nodes. Here, certain Nashequilibria might reveal principled conditions for distributing and migrating tasks among workers.6.3.5 Sequencing technologies on the horizon: Illumina’s NextSeq, PacBio, andOxford NanoporeThe high demand for low-cost sequencing has driven a decade of innovation in sequencingplatforms that parallelize the process (see Chapter 1 for a review). At present, Illumina’s short-read sequencing-by-synthesis approach is the dominant paradigm, accounting for more than 90%of all sequencing worldwide in 2014 (Presentation: Illumina BaseSpace WWDC 2014). Its latestdesktop sequencer, the Illumina NextSeq, is poised to provide the long-sought after $1000 humangenome, potentially making the beginning of large-scale personalized medicine. While thesemassively parallel short-read sequencing approaches have democratized sequencing technologyand caused an explosion of sequencing projects, they are humbled by the assembly of repetitiveregions of complex genomes and the de novo assembly of environmental mutli-omic datasets,limiting their effectiveness in sequencing microbial communities or plant genomes.This analytical intractability of repetitive and complex samples has lead to a call for moreaffordable longer reads [249]. Illumina’s recent purchase of Moleculo is primarily for the purposeof offering synthetic long-read sequencing. It works through an upstream library preparationprocedure that barcodes long strands of DNA prior to shearing, allowing for the downstreaminformatic barcode-stitching together of shorter reads into synthetic 10–20 kb long reads. PacificBiosciences and Oxford Nanopore (ONP) at this point are alternative sequencing platforms thatare trying to deliver on the promise single-molecule, high-throughput, affordable long-readsthat would alleviate the assembly of repetitive regions and avoid perennial amplification biases.Pacific Biosciences, though comparatively more expensive in terms of throughput with respectto Illumina, boasts an average read length of 5–20 kb, and has been used as a way to finish a136number of complex microbial genomes [250]. ONP, a relative newcomer to the sequencing field, ispoised to launch its novel nanopore biosensor sequencing platforms the MinION, the first portableUSB-based sequencing platform, and the larger GridION, that boasts real-time sequencing of readsup to 20k nucleotides long. ONP is unique in that it offers early-access to its MinION device viaits MinION Access Programme (MAP), which allows scientists to essentially beta-test the platformfor a nominal-fee and an agreement to share the sequencing data. Recent publications highlightthe MinION’s applicability to generate scaffold sequences to improve the assembly of bacterialgenomes [42], while others have raised reservations about its sequencing error [42, 43]. TheHallam Lab is a member of the 2015 MAP program, and it will interesting to see the applicationof this new technology to natural and engineered ecosystems, and the adaption of MetaPathwaysto accept ONP read data.In the context of MetaPathways, the new sequencing technologies are all pushing for longerreads and higher throughput, which will likely improve de novo metagenomic assembly andprovide better opportunities for ORF prediction and annotation. MetaPathways will have tobe modified to integrate hybrid assemblers that can sequence reads from multiple sequencingplatforms or provide longer reads as so-called ‘trusted contigs’. Longer reads should translate intobetter quality ORFs to annotate. MetaPathways should handle this increase well as it is built toannotate at scale through out-of-core processing, high-performance data structures, and memorymanagement techniques, and the discussed master-worker model for managing computationalresources. Further annotation efficiency in general could be achieved through the developmentand adoption of sequence database clustering techniques and more efficient homology-searchalgorithms [76, 198]. However, two anticipated analytical bottlenecks are the creation of ePGDBsvia the PathoLogic algorithm, which is quickly becoming the longest analytical step in the pipeline,and when an individual sample does not fit on an individual hard-drive (approximately 4–6TB at time of writing). The ePGDB bottleneck can likely be dealt with through algorithmicimprovements or approximations to the current implementation of PathoLogic to better handlelarge annotation sets, or with the adoption of alternative pathway prediction methods. Thesecond, where an individual sample size can not fit on a conventional hard drive, is a morepernicious problem; it is the point where a stand-alone machine will be unable to process a137sample independently, necessitating the move to high-performance distributed file systems andcloud-compute infrastructures. As sequencing projects often have multiple samples, this harddrive limitation will likely be experienced long before the one sample per hard drive situation,meaning the transition to the cloud or grid will likely be sooner than expected if the exponentialrise in next-generation sequencing capacity continues unabated.6.3.6 A future in the cloudWith the rise in popularity and availability of cloud-based supercomputing services, distributedcomputing frameworks like Hadoop, its distributed file system (HDFS), and columnar key-payload NoSQL-based databases [131, 251, 252] are becoming increasingly important in the areasof business analytics, web applications, and finance. However, this framework is largely unadoptedin the area of bioinformatics [253]. When sequencing outstrips the capacity of individual harddrives, it will necessitate a move to distributed file systems, in which case moving MetaPathwaysto the cloud will be the inevitable outcome. In terms of software distribution, a web-based solutionhas a great advantage in that by operating as an independent web service, dealing with annoyingsystem dependencies that plague many research-based software [254]. Assisting developmentis the fact that many services are adopting operating system level virtualization software likeDocker (https://www.docker.com/). Docker provides compartmentalized virtualization layer thathandles system dependencies and assists with deployment of software on Linux-based operatingsystems, without typical overhead of containing the whole operating system like existing virtualmachine images. Moreover, the demand for web-based analytics has pushed for the developmentof JavaScript based visualization frameworks like d3.js (http://www.d3js.org/), and a centralizedrepository of multi-omic data should make large-scale comparative analyses more routine (Figure6.3). However, these visualization frameworks are limited by the capacity of in-browser memoryand Internet bandwidth to real-time handling of data, which typically do not scale to millions ofdata points.While the horizontal-scaling of computational resources is a powerful paradigm, it is notwithout its own challenges. Hadoop and NoSQL databases are built around effective data queryand batch processing tasks, but there are still significant implementation challenges to translate138common bioinformatics algorithms like homology-search, tree-building, and read-mapping intoMapReduce. In the case of homology search, the need of every node to access large referencedatabases creates a large ‘side data’ distribution problem, which cannot be handled by Hadoopand its YARN resource allocator [255]. Thus there is a dire need for a least-recently used (LSU)cache to reduce communication costs associated with the Hadoop distributed file system (HDFS)[132].Recently, Illumina has introduced a centralized cloud-based environment for sequencingstorage, management and analysis, BaseSpace (https://basespace.illumina.com). It utilizes an‘internet-of-things’ model to collect sequencing information from its HiSeq and MiSeq sequencersthat have been configured for the batch uploading of sequencing data into BaseSpace. Onceuploaded, sequences can be analyzed by analytical software in an ‘Apple-like’ App Store, thatscales AWS nodes on-demand to run using the Docker framework. Results are provided throughweb-based reports and processed results can be downloaded locally for downstream analyses. In2014 more than 50% of sequences produced on Illumina machines were uploaded to BaseSpace,making it easily the largest cloud-based bioinformatics environment. Currently, BaseSpace doesnot accommodate large reference databases necessary for large-scale annotation of environmentalsequence information, but this a temporary issue that developers are aiming to solve. BaseS-pace looks to be a promising cloud-based environment for large-scale global comparisons ofenvironmental sequence information.6.4 Future developmentsThe analysis of environmental sequence information is rapidly evolving by way of novel tech-nologies and multi-omic experiments enabling their use in interesting applications across awide-variety of fields. One area in particular where there is a lot of excitement is in the analysisof human microbiomes for their potential relevance to human health. While the field and itsapplications are still developing, human microbiome samples, though complex, are comparativelymore analytically tractable than typical aquatic or terrestrial environments with current levelsof sequencing depth and assembly. Single-cell genome sequencing is a developing technology139Figure 6.3: Global comparative multi-omic analysis. MetaPathways scales to large sequence samples enablingcomparative multi-omic analysis on the global scale. Here sampling locations from a selection of areIMG/M and MG-RAST mutli-omic samples are displayed, colored by their general environmental source,i.e., terrestrial (green), aquatic (blue), air (grey), and scaled by number of nucleotide bases sequenced. Plotgenerated with d3.js (http://bl.ocks.org/nielshanson/23bd4b2ecf9d44ba1a8f).that enables the sequencing of genetic material from isolated cells. Throughput and bias issuescurrently limit its sampling depth compared to current multi-omic techniques, but it represents ananalytical advantage by directly relating annotations to a particular cell with a defined taxonomy,instead of relying on marker genes or statistical techniques.6.4.1 Human microbiomeMicrobiomes of the body (gastrointestinal tract, skin, etc.) now sit alongside environment andgenetics as potential causes of human illness [256]. While changing one’s genetics is very difficult,modifying the microbiome is comparatively straightforward. Both broad and specific antibioticscan drastically alter one’s microbiome, decreasing or eliminating large portions of its microbialcommunity, and effectively restarting the microbiome state to be regrown or replaced in acontrolled manner. In addition, probiotic foods and supplements containing live bacteria or140prebiotic sugars that stimulate the growth of particular commensal bacteria, could potentiallybe used to promote beneficial or corrective health outcomes [257]. In contrast, genome-wideassociation studies (GWAS) and whole-genome sequencing (WGS) experiments look for loci thatare significantly associated with a health outcome, and must be followed-up with functionalstudies to investigate biological pathways involved in the disease. After a putative target hasbeen found, a drug must be identified to affect the identified pathway; altogether a process thattypically takes years from discovery to implementation. Microbiome studies look to have similar,but possibly more straightforward, translational pathways where a combination of antibiotic,prebiotic, and probiotic treatments could be readily applied to clinical trial patients without thetime-consuming drug discovery step.International Human Microbiome Consortium (IHMC), and in particular two major humanmicrobiome studies, the European MetaHIT consortium [258] and the National Institutes ofHealth (NIH) Human Microbiome Project (HMP) [259], have provided an effective baselineof taxonomy and function for the human micrbiome [14]. While there are still claims of vastprotein sets of unknown function in microbiome samples, the portion of sequences that have nodatabase hit is less than 30%, which when compared to the proportion of unknown multi-omicannotations from aquatic (40–60%) and soil (90%) samples is very encouraging. In fact, thepresence of a reasonably comprehensive reference has lead to predictive functional techniques likePhylogenetic Investigation of Communities by Reconstruction of Unobserved States (PICRUSt)[109], which takes 16S taxonomic abundance and infers a predicted functional profile of COGand KEGG annotations. Human microbiome samples have shown that the predicted annotationprofile is comparable to those provided by low-coverage metagenomes, which potentially allowsfor affordable functional profiles to be generated from less expensive amplicon tag sequencingapproaches. This functional groundwork suggests that technical sequencing, assembly, andannotation resources are becoming adequate for deep investigation of human microbiome samples,and that interesting and clinically-relevant discoveries are on the horizon.1416.4.2 Single-cell sequencingWhile culture-independent sampling techniques and next-generation sequencing technologieshave enabled the inquiry of microbial community structure (i.e., ‘who’s there?’) and function(i.e., ‘what are they doing?’), advances in sample library preparation allows for representativesequencing from a handful of individual cells [260–262]. Improvements in fluorescence-activatedcell sorting (FACS) and novel microfluidic devices are allowing high-throughput cell isolation,DNA amplification, and sequencing experiments to be performed on a previously unimaginedscale [263–267]. Widespread in situ cell sequencing has many exciting applications includingdirect linking taxonomic and functional information (i.e., ‘who’s doing what?’) [264], phyloge-netic analysis of tumor heterogeneity and microevolution for diagnostic typing [268, 269], andcharacterization of the earliest human embryogenesis events [270].However, while the thought of large-scale environmental single-cell sequencing is grand,there is still significant progress to be made to bring single-cell sequencing to the analyticallevel of multi-omic experiments. State-of-the-art single-cell sequencing methods are practicallylimited in terms of sampling depth and throughput, and have a number of technical issuesrelated to contamination suppression, amplification bias, and sequence assembly that limit isapplication [261, 271]. Current environmental single-cell sequencing experiments are only practicalon hundreds of cells from complex environments, far below the thousands necessarily to samplethe equivalent depth of plurality sequencing [215]. The small amount of starting material requiredfor single-cell experiments compounds the effect of any undesired contaminant sequence in thesample, lab, or reagents, requiring the use of clean rooms and extremely sterile techniques. Arelated issue is multiple-displacement amplification (MDA) bias that causes single-cell genomes tohave highly-variable sequence coverage, between 20–80%, meaning much of the sampled genomeafter amplification is simply not available for sequencing [263, 272, 273]. As for sequence assembly,standard assumptions of Poisson-style coverage and rare chimeric sequences are violated withsingle-cell sequences, currently making high-quality assemblies a challenge [274]. However, whilethese issues of bias, coverage, and assembly are impediments, there is also tremendous potentialfor hybrid experimental approaches that combine both single-cell and multi-omic data [275],142Component 2Component 1Figure 6.4: Binding metagenomes to single-cells. Until amplification biases improve, metagenomic reads canbe binned to their closest single cell, exploiting clusters of ‘genomic signatures’ in a reduced dimensionalspace. Once in the reduced space, a classification model like Gaussian mixtures or linear-discriminantanalysis can be fit for each of the single-cells and metagenomic points (grey) can be classified to theirnearest single-cell genome.exploiting shared statistical properties of similar genomic signatures (Figure 6.4). The binning ofmetagenomic information to single-cell genomes also brings valuable taxonomic information tometabolic reconstruction efforts. As single-cell reference genomes and amplification bias improves,the proportion of environmental sequence information that can accurately binned will naturallyincrease, and bridge the gap between environmental taxonomy and function.6.5 ClosingThe increased throughput and lowered cost of next-generation sequencing technologies is drivingan explosion of multi-omic sequencing, generating much excitement across many potentialapplications, including environmental monitoring and remediation, personalized medicine, andbioengineering. The ability to generate large datasets from complex ecosystems makes for anincredibly challenging analysis that integrates a large variety of information. MetaPathways143endeavors to streamline the processing, integration, and analyses of these multi-omic datasets,by providing researchers with a transparent framework of inquiry and data products that canbe used for downstream analysis. As technologies and experimental techniques continuouslydevelop and transition from sequences to proteins and metabolites, MetaPathways, itself mustalso change and update its integration to accommodate these new methods.144Bibliography[1] Van H T Pham and Jaisoo Kim. Cultivation of unculturable soil bacteria. Trends in Biotechnology,30(9):475–484, September 2012.[2] Can Su, Liping Lei, Yanqing Duan, Ke-Qin Zhang, and Jinkui Yang. Culture-independent methodsfor studying environmental microorganisms: methods, application, and perspective. Appl. Microbiol.Biotechnol., 93(3):993–1003, February 2012.[3] William B Whitman, David C Coleman, and William J Wiebe. Prokaryotes: the unseen majority.Proceedings of the National Academy of Sciences, 95(12):6578–6583, June 1998.[4] Jo Handelsman. Metagenomics: application of genomics to uncultured microorganisms. Microbiol.Mol. Biol. Rev., 69(1):195–195, 2005.[5] Christian S Riesenfeld, Patrick D Schloss, and Jo Handelsman. Metagenomics: genomic analysis ofmicrobial communities. Annu. Rev. Genet., 38:525–552, 2004.[6] Shibu Yooseph, Granger Sutton, Douglas B Rusch, et al. The Sorcerer II Global Ocean Samplingexpedition: expanding the universe of protein families. PLoS Biol, 5(3):e16, 2007.[7] Patrick Lorenz and Ju¨rgen Eck. Metagenomics and industrial applications. Nat. Rev. Microbiol.,3(6):510–516, June 2005.[8] Miguel Alcalde, Manuel Ferrer, Francisco J Plou, and Antonio Ballesteros. Environmental biocatalysis:from remediation with enzymes to novel green processes. Trends in Biotechnology, 24(6):281–287, June2006.[9] Smriti Rayu, Dimitrios G Karpouzas, and Brajesh K Singh. Emerging technologies in bioremediation:constraints and opportunities. Biodegradation, 23(6):917–926, November 2012.[10] Patrick Lorenz, Klaus Liebeton, Frank Niehaus, and Ju¨rgen Eck. Screening for novel enzymes forbiocatalytic processes: accessing the metagenome as a resource of novel functional sequence space.Current Opinion in Biotechnology, 13(6):572–577, 2002.[11] Nancy Weiland, Carolin Lo¨scher, Rebekka Metzger, and Ruth Schmitz. Construction and screeningof marine metagenomic libraries. Methods Mol. Biol., 668:51–65, 2010.[12] Takashi Mino and Hiroyasu Satoh. Wastewater genomics. Nat Biotechnol, 24(10):1229–1230, October2006.[13] J Gregory Caporaso, Christian L Lauber, Elizabeth K Costello, et al. Moving pictures of the humanmicrobiome. Genome Biol, 12(5):R50, 2011.[14] Human Microbiome Project Consortium. Structure, function and diversity of the healthy humanmicrobiome. Nature, 486(7402):207–214, June 2012.[15] Ilseung Cho and Martin J Blaser. The human microbiome: at the interface of health and disease.Nature Publishing Group, 13(4):260–270, April 2012.145[16] Edward F Delong, Christina M Preston, Tracy Mincer, et al. Community genomics among stratifiedmicrobial assemblages in the ocean’s interior. Science, 311(5760):496–503, January 2006.[17] David A Walsh, Elena Zaikova, Charles G Howes, et al. Metagenome of a versatile chemolithoau-totroph from expanding oceanic dead zones. Science, 326(5952):578–582, 2009.[18] Yanmei Shi, Gene W Tyson, John M Eppley, and Edward F Delong. Integrated metatranscriptomic andmetagenomic analyses of stratified microbial assemblages in the open ocean. ISME J, 5(6):999–1013,June 2011.[19] Noah Fierer and Robert B Jackson. The diversity and biogeography of soil bacterial communities.Proceedings of the National Academy of Sciences, 103(3):626–631, January 2006.[20] Falk Warnecke, Peter Luginbu¨hl, Natalia Ivanova, et al. Metagenomic and functional analysis ofhindgut microbiota of a wood-feeding higher termite. Nature, 450(7169):560–565, November 2007.[21] Alice Carolyn McHardy, He´ctor Garcı´a Martı´n, Aristotelis Tsirigos, Philip Hugenholtz, and IsidoreRigoutsos. Accurate phylogenetic classification of variable-length DNA fragments. Nat Meth, 4(1):63–72, January 2007.[22] Li C Xia, Jacob A Cram, Ting Chen, Jed A Fuhrman, and Fengzhu Sun. Accurate genome relativeabundance estimation based on shotgun metagenomic reads. PLoS ONE, 6(12):e27992, 2011.[23] Anna Klindworth, Elmar Pruesse, Timmy Schweer, et al. Evaluation of general 16S ribosomal RNAgene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic AcidsResearch, 41(1):e1–e1, January 2013.[24] Ben Temperton, Dawn Field, Anna Oliver, et al. Bias in assessments of marine microbial biodiversityin fosmid libraries as evaluated by pyrosequencing. ISME J, 3(7):792–796, July 2009.[25] Mariette Je´roˆme, Ce´line Noirot, and Christophe Klopp. Assessment of replicate bias in 454 pyrose-quencing and a multi-purpose read-filtering tool. BMC Res Notes, 4:149, 2011.[26] Daniel R Mende, Alison S Waller, Shinichi Sunagawa, et al. Assessment of metagenomic assemblyusing simulated next generation sequencing data. PLoS ONE, 7(2):e31386, 2012.[27] Francesca Finotello, Enrico Lavezzo, Paolo Fontana, et al. Comparative analysis of algorithms forwhole-genome assembly of pyrosequencing data. Brief. Bioinformatics, 13(3):269–280, May 2012.[28] Camilla L Nesbø, Yan Boucher, Marlena Dlutek, and W Ford Doolittle. Lateral gene transfer andphylogenetic assignment of environmental fosmid clones. Environmental Microbiology, 7(12):2011–2026,December 2005.[29] John C Wooley, Adam Godzik, and Iddo Friedberg. A primer on metagenomics. PLoS Comput Biol,6(2):e1000667, February 2010.[30] Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu, and John Wooley. Ultrafast clustering algorithms formetagenomic sequence analysis. Brief. Bioinformatics, 13(6):656–668, November 2012.[31] Torsten Thomas, Jack Gilbert, and Folker Meyer. Metagenomics - a guide from sampling to dataanalysis. Microbial Informatics and Experimentation, 2(1):3, 2012.[32] Marcel Margulies, Michael Egholm, William E Altman, et al. Genome sequencing in microfabricatedhigh-density picolitre reactors. Nature, 437(7057):376–380, July 2005.[33] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, et al. Accurate whole humangenome sequencing using reversible terminator chemistry. Nature, 456(7218):53–59, November 2008.146[34] Nicole Rusk. Torrents of sequence. Nat Meth, 8(1):44–44, December 2010.[35] Lauren M Bragg, Glenn Stone, Margaret K Butler, Philip Hugenholtz, and Gene W Tyson. Shininga light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput Biol,9(4):e1003031, April 2013.[36] William J Greenleaf and Steven M Block. Single-molecule, motion-based DNA sequencing usingRNA polymerase. Science, 313(5788):801, August 2006.[37] John J Kasianowicz, Eric Brandin, Daniel Branton, and David W Deamer. Characterization ofindividual polynucleotide molecules using a membrane channel. Proceedings of the National Academyof Sciences, 93(24):13770–13773, November 1996.[38] James Clarke, Hai-Chen Wu, Lakmal Jayasinghe, et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotech, 4(4):265–270, February 2009.[39] J Li, D Stein, C McMullan, et al. Ion-beam sculpting at nanometre length scales. Nature, 412(6843):166–169, July 2001.[40] Hagan Bayley. Sequencing single molecules of DNA. Curr Opin Chem Biol, 10(6):628–637, December2006.[41] David Stoddart, Giovanni Maglia, Ellina Mikhailova, Andrew J Heron, and Hagan Bayley. Multiplebase-recognition sites in a biological nanopore: two heads are better than one. Angew. Chem. Int. Ed.Engl., 49(3):556–559, 2010.[42] Philip M Ashton, Satheesh Nair, Tim Dallman, et al. MinION nanopore sequencing identifies theposition and structure of a bacterial antibiotic resistance island. Nat Biotechnol, December 2014.[43] Alexander S Mikheyev and Mandy M Y Tin. A first look at the Oxford Nanopore MinION sequencer.Mol Ecol Resour, 14(6):1097–1102, September 2014.[44] Scott R Miller, Aaron L Strong, Kenneth L Jones, and Mark C Ungerer. Bar-coded pyrosequencingreveals shared bacterial community properties along the temperature gradients of two alkaline hotsprings in Yellowstone National Park. Appl. Environ. Microbiol., 75(13):4565–4572, July 2009.[45] Wei Xie, Fengping Wang, Lei Guo, et al. Comparative metagenomics of microbial communitiesinhabiting deep-sea hydrothermal vent chimneys with contrasting chemistries. ISME J, 5(3):414–426,October 2010.[46] Vincent J Denef, Ryan S Mueller, and Jillian F Banfield. AMD biofilms: using model communities tostudy microbial evolution and ecological complexity in nature. ISME J, 4(5):599–610, May 2010.[47] Brandon K Swan, Ben Tupper, Alexander Sczyrba, et al. Prevalent genome streamlining and latitudinaldivergence of planktonic bacteria in the surface ocean. Proceedings of the National Academy of Sciences,June 2013.[48] S Oh, A Caro-Quintero, D Tsementzi, et al. Metagenomic Insights into the Evolution, Function,and Complexity of the Planktonic Microbial Community of Lake Lanier, a Temperate FreshwaterEcosystem. Appl. Environ. Microbiol., 77(17):6000–6011, August 2011.[49] Maneesh Dave, Peter D R Higgins, Sumit Middha, and Kevin Rioux. The human gut microbiome:current knowledge, challenges, and future directions. Translational Research, pages 1–12, June 2012.[50] Adina Chuang Howe, Janet K Jansson, Stephanie A Malfatti, et al. Tackling soil diversity with theassembly of large, complex metagenomes. Proceedings of the National Academy of Sciences, 111(13):4904–4909, April 2014.147[51] Matthew B Scholz, Chien-Chi Lo, and Patrick S G Chain. Next generation sequencing and bioinfor-matic bottlenecks: the current state of metagenomic data analysis. Current Opinion in Biotechnology,23(1):9–15, February 2012.[52] Sara El-Metwally, Taher Hamza, Magdi Zakaria, and Mohamed Helmy. Next-generation se-quence assembly: four stages of data processing and computational challenges. PLoS ComputBiol, 9(12):e1003345, 2013.[53] Xiao Yang, Sriram P Chockalingam, and Srinivas Aluru. A survey of error-correction methods fornext-generation sequencing. Brief. Bioinformatics, 14(1):56–66, January 2013.[54] Niranjan Nagarajan and Mihai Pop. Sequence assembly demystified. Nat Rev Genet, pages 1–11,January 2013.[55] Jared T Simpson, Kim Wong, Shaun D Jackman, et al. ABySS: a parallel assembler for short readsequence data. Genome Res, 19(6):1117–1123, June 2009.[56] Se´bastien Boisvert, Fre´de´ric Raymond, E´le´nie Godzaridis, Franc¸ois Laviolette, and Jacques Corbeil.Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol, 13(12):R122, 2012.[57] Anveshi Charuvaka and Huzefa Rangwala. Evaluation of short read metagenomic assembly. BMCGenomics, 12(Suppl 2):S8, 2011.[58] Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, et al. GAGE: A critical evaluation of genomeassemblies and assembly algorithms. Genome Res, 22(3):557–567, March 2012.[59] Keith R Bradnam, Joseph N Fass, Anton Alexandrov, et al. Assemblathon 2: evaluating de novomethods of genome assembly in three vertebrate species. arXiv, January 2013.[60] Kaustubh R Patil, Peter Haider, Phillip B Pope, et al. Taxonomic metagenome sequence assignmentwith structured output models. Nat Meth, 8(3):191–192, March 2011.[61] Catherine Mathe´, Marie-France Sagot, Thomas Schiex, and Pierre Rouze´. Current methods of geneprediction, their strengths and weaknesses. Nucleic Acids Research, 30(19):4103–4117, October 2002.[62] Mark Borodovsky and James D McIninch. GeneMark: Parallel Gene Recognition for Both DNAStrands. Computational Chemistry, 17(2):123–133, January 1993.[63] Arthur L Delcher, Douglas Harmon, Simon Kasif, Owen White, and Steven L Salzberg. Improvedmicrobial gene identification with GLIMMER. Nucleic Acids Research, 27(23):4636–4641, December1999.[64] Katharina J Hoff, Maike Tech, Thomas Lingner, et al. Gene prediction in metagenomic fragments: alarge scale machine learning approach. BMC Bioinformatics, 9:217, 2008.[65] Wenhan Zhu, Alexandre Lomsadze, and Mark Borodovsky. Ab initio gene identification in metage-nomic sequences. Nucleic Acids Research, 38(12):e132, July 2010.[66] Katharina J Hoff. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics,10:520, 2009.[67] Ion Mandoiu and Alexander Zelikovsky. Computational Methods for Next Generation Sequencing DataAnalysis. Wiley Series in Bioinformatics. John Wiley & Sons Incorporated, 1 edition, September 2016.[68] Hideki Noguchi, Jungho Park, and Toshihisa Takagi. MetaGene: prokaryotic gene finding fromenvironmental genome shotgun sequences. Nucleic Acids Research, 34(19):5623–5630, 2006.148[69] Hideki Noguchi, Takeaki Taniguchi, and Takehiko Itoh. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic andphage genomes. DNA Research, 15(6):387–396, December 2008.[70] David R Kelley, Bo Liu, Arthur L Delcher, Mihai Pop, and Steven L Salzberg. Gene predictionwith Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic AcidsResearch, 40(1):e9–e9, December 2011.[71] Katharina J Hoff, Thomas Lingner, Peter Meinicke, and Maike Tech. Orphelia: predicting genes inmetagenomic sequencing reads. Nucleic Acids Research, 37(Web Server):W101–W105, June 2009.[72] Yongchu Liu, Jiangtao Guo, Gangqing Hu, and Huaiqiu Zhu. Gene prediction in metagenomicfragments based on the SVM algorithm. BMC Bioinformatics, 14 Suppl 5:S12, 2013.[73] Doug Hyatt, Gwo-Liang Chen, Philip F LoCascio, et al. Prodigal: prokaryotic gene recognition andtranslation initiation site identification. BMC Bioinformatics, 11(1):119, 2010.[74] Doug Hyatt, Philip F LoCascio, Loren J Hauser, and Edward C Uberbacher. Gene and translationinitiation site prediction in metagenomic sequences. Bioinformatics, 28(17):2223–2230, September 2012.[75] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic localalignment search tool. J Mol Biol, 215(3):403–410, October 1990.[76] Szymon M Kiełbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C Frith. Adaptive seedstame genomic sequence comparison. Genome Res, 21(3):487–493, March 2011.[77] Kim D Pruitt, Tatiana Tatusova, and Donna R Maglott. NCBI reference sequences (RefSeq): a curatednon-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research,35(Database issue):D61–5, January 2007.[78] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic AcidsResearch, 28(1):27–30, January 2000.[79] Peter D Karp, Christos A Ouzounis, Caroline Moore-Kochlacs, et al. Expansion of the BioCyccollection of pathway/genome databases to 160 genomes. Nucleic Acids Research, 33(19):6083–6089,2005.[80] Ron Caspi, Hartmut Foerster, Carol A Fulcher, et al. MetaCyc: a multiorganism database of metabolicpathways and enzymes. Nucleic Acids Research, 34(Database issue):D511–6, January 2006.[81] Ron Caspi, Tomer Altman, Kate Dreher, et al. The MetaCyc database of metabolic pathways andenzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research, 40(Databaseissue):D742–53, January 2012.[82] Tomer Altman, Michael Travers, Anamika Kothari, Ron Caspi, and Peter D Karp. A systematiccomparison of the MetaCyc and KEGG pathway databases. BMC Bioinformatics, 14:112, 2013.[83] Steven J Hallam, Nik Putnam, Christina M Preston, et al. Reverse methanogenesis: testing thehypothesis with environmental genomics. Science, 305(5689):1457–1462, September 2004.[84] Steven J Hallam, Tracy J Mincer, Christa Schleper, et al. Pathways of carbon assimilation and ammoniaoxidation suggested by environmental genomic analyses of marine Crenarchaeota. PLoS Biol, 4(4):e95,April 2006.[85] Takuji Yamada, Ivica Letunic, Shujiro Okuda, Minoru Kanehisa, and Peer Bork. iPath2.0: interactivepathway explorer. Nucleic Acids Research, 39(Web Server issue):W412–5, July 2011.149[86] Ergude Bao, Tao Jiang, Isgouhi Kaloshian, and Thomas Girke. SEED: efficient clustering of next-generation sequences. Bioinformatics, 27(18):2502–2509, September 2011.[87] Ramy K Aziz, Scott Devoid, Terrence Disz, et al. SEED servers: high-performance access to the SEEDgenomes, annotations, and metabolic models. PLoS ONE, 7(10):e48053, 2012.[88] Shujiro Okuda, Takuji Yamada, Masami Hamajima, et al. KEGG Atlas mapping for global analysis ofmetabolic pathways. Nucleic Acids Research, 36(Web Server issue):W423–6, July 2008.[89] Peter D Karp, Mario Latendresse, and Ron Caspi. The pathway tools pathway prediction algorithm.Stand Genomic Sci, 5(3):424–429, December 2011.[90] Michelle L Green and Peter D Karp. A Bayesian method for identifying missing enzymes in predictedmetabolic pathway databases. BMC Bioinformatics, 5:76, June 2004.[91] Peter D Karp, Suzanne Paley, and Pedro Romero. The pathway tools software. Bioinformatics, 18(suppl1):S225–S232, 2002.[92] Peter D Karp, Suzanne M Paley, Markus Krummenacker, et al. Pathway Tools version 13.0: integratedsoftware for pathway/genome informatics and systems biology. Brief. Bioinformatics, 11(1):40–79,January 2010.[93] Yuzhen Ye and Thomas G Doak. A parsimony approach to biological pathway reconstruc-tion/inference for genomes and metagenomes. PLoS Comput Biol, 5(8):e1000465, 2009.[94] Sahar Abubucker, Nicola Segata, Johannes Goll, et al. Metabolic reconstruction for metagenomic dataand its application to the human microbiome. PLoS Comput Biol, 8(6):e1002358, 2012.[95] Johannes Goll, Mathangi Thiagarajan, Sahar Abubucker, et al. A case study for large-scale humanmicrobiome analysis using JCVI’s metagenomics reports (METAREP). PLoS ONE, 7(6):e29044, 2012.[96] Karin Breuer, Amir K Foroushani, Matthew R Laird, et al. InnateDB: systems biology of innateimmunity and beyond–recent updates and continuing curation. Nucleic Acids Research, 41(Databaseissue):D1228–33, January 2013.[97] Charles J Vaske, Stephen C Benz, J Zachary Sanborn, et al. Inference of patient-specific pathwayactivities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26(12):i237–45, June 2010.[98] Sam Ng, Eric A Collisson, Artem Sokolov, et al. PARADIGM-SHIFT predicts the function of mutationsin multiple cancers using pathway impact analysis. Bioinformatics, 28(18):i640–i646, September 2012.[99] Peter E Larsen, Frank R Collart, Dawn Field, et al. Predicted Relative Metabolomic Turnover (PRMT):determining metabolic turnover from a coastal marine metagenomic dataset. Microbial Informatics andExperimentation, 1(1):4, 2011.[100] Hsuan-Chao Chiu, Roie Levy, and Elhanan Borenstein. Emergent biosynthetic capacity in simplemicrobial communities. PLoS Comput Biol, 10(7):e1003695, July 2014.[101] William R Harcombe, William J Riehl, Ilija Dukovski, et al. Metabolic resource allocation in individualmicrobes determines ecosystem interactions and spatial dynamics. Cell Rep, 7(4):1104–1115, May2014.[102] George E Fox, Kenneth R Pechman, and Carl R Woese. Comparative Cataloging of 16S RibosomalRibonucleic Acid: Molecular Approach to Procaryotic Systematics. International Journal of SystematicBacteriology, 27(1):44–57, January 1977.150[103] Roman L Tatusov, Eugene V Koonin, and David J Lipman. A genomic perspective on protein families.Science, 278(5338):631–637, 1997.[104] Daniel H Huson, Alexander F Auch, Ji Qi, and Stephan C Schuster. MEGAN analysis of metagenomicdata. Genome Res, 17(3):377–386, March 2007.[105] Norman R Pace. A Molecular View of Microbial Diversity and the Biosphere. Science, 276(5313):734–740, May 1997.[106] Todd Z DeSantis, Philip Hugenholtz, Niels Larsen, et al. Greengenes, a chimera-checked 16S rRNAgene database and workbench compatible with ARB. Appl. Environ. Microbiol., 72(7):5069–5072, 2006.[107] Elmar Pruesse, Christian Quast, Katrin Knittel, et al. SILVA: a comprehensive online resource forquality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic AcidsResearch, 35(21):7188–7196, 2007.[108] J R Cole, B Chai, T L Marsh, et al. The Ribosomal Database Project (RDP-II): previewing a newautoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Research,31(1):442–443, January 2003.[109] Morgan G I Langille, Jesse Zaneveld, J Gregory Caporaso, et al. Predictive functional profiling ofmicrobial communities using 16S rRNA marker gene sequences. Nat Biotechnol, 31(9):814–821, August2013.[110] Victor Kunin, Alex Copeland, Alla Lapidus, Konstantinos Mavromatis, and Philip Hugenholtz. Abioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev., 72(4):557–78– Table of Contents,December 2008.[111] Pierre Legendre and Louis Legendre. Numerical Ecology. Elsevier, July 2012.[112] Jari Oksanen, Roeland Kindt, Pierre Legendre, et al. The vegan package. Community ecology package,2007.[113] Ryota Suzuki and Hidetoshi Shimodaira. Pvclust: an R package for assessing the uncertainty inhierarchical clustering. Bioinformatics, 22(12):1540–1542, June 2006.[114] Philip Dixon. VEGAN, a package of R functions for community ecology. Journal of Vegetation Science,2003.[115] Ulf Grandin. PC-ORD version 5: A user-friendly toolbox for ecologists. Journal of Vegetation Science,17(6):843–844, February 2009.[116] J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Meth, 7(5):335–336, May 2010.[117] A Murat Eren, Michael J Ferris, and Christopher M Taylor. A framework for analysis of metagenomicsequencing data. Pac Symp Biocomput, pages 131–141, 2011.[118] Joshua A Steele, Peter D Countway, Li Xia, et al. Marine bacterial, archaeal and protistan associationnetworks reveal ecological linkages. ISME J, 5(9):1414–1425, September 2011.[119] Jody J Wright, Kishori M Konwar, and Steven J Hallam. Microbial ecology of expanding oxygenminimum zones. Nat. Rev. Microbiol., 10(6):381–394, June 2012.[120] Michael E Smoot, Keiichiro Ono, Johannes Ruscheinski, Peng-Liang Wang, and Trey Ideker. Cytoscape2.8: new features for data integration and network visualization. Bioinformatics, 27(3):431–432,February 2011.151[121] Samuel Chaffron, Hubert Rehrauer, Jakob Pernthaler, and Christian von Mering. A global network ofcoexisting microbes from environmental and whole-genome sequence data. Genome Res, 20(7):947–959,July 2010.[122] Lars J Jensen, Philippe Julien, Michael Kuhn, et al. eggNOG: automated construction and annotationof orthologous groups of genes. Nucleic Acids Research, 36(Database):D250–D254, December 2007.[123] Jean Muller, Damian Szklarczyk, Philippe Julien, et al. eggNOG v2.0: extending the evolutionarygenealogy of genes with enhanced non-supervised orthologous groups, species and functionalannotations. Nucleic Acids Research, 38(Database issue):D190–5, January 2010.[124] F D Ciccarelli. Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science,311(5765):1283–1287, March 2006.[125] C von Mering, P Hugenholtz, J Raes, et al. Quantitative phylogenetic assessment of microbialcommunities in diverse environments. Science, 315(5815):1126–1130, February 2007.[126] Manuel Stark, Simon A Berger, Alexandros Stamatakis, and Christian von Mering. MLTreeMap–accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic andfunctional reference phylogenies. BMC Genomics, 11:461, 2010.[127] Ewan Birney, Michele Clamp, and Richard Durbin. GeneWise and Genomewise. Genome Res,14(5):988–995, May 2004.[128] Alexandros Stamatakis. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses withthousands of taxa and mixed models. Bioinformatics, 22(21):2688–2690, November 2006.[129] Michael McCool, James Reinders, and Arch Robison. Structured Parallel Programming. Patterns forEfficient Computation. Elsevier, July 2012.[130] James Jeffers and James Reinders. Intel Xeon Phi Coprocessor High-Performance Programming. Newnes,February 2013.[131] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters.Communications of the ACM, 51(1):107–113, 2008.[132] Dongjae Kim, Kishori M Konwar, Niels W Hanson, and Steven J Hallam. Koonkie: An AutomatedSoftware Tool for Processing Environmental Sequence Information using Hadoop. Fourth ASEInternational Conference on Big Data (BigData 2014), pages 1–8, December 2014.[133] Folker Meyer, Daniel Paarmann, Mark D’Souza, et al. The metagenomics RAST server - a publicresource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics,9:386, 2008.[134] Rekha Seshadri, Saul A Kravitz, Larry Smarr, Paul Gilna, and Marvin Frazier. CAMERA: a communityresource for metagenomics. PLoS Biol, 5(3):e75, March 2007.[135] Victor M Markowitz, Natalia N Ivanova, Ernest Szeto, et al. IMG/M: a data management and analysissystem for metagenomes. Nucleic Acids Research, 36(Database issue):D534–8, January 2008.[136] Victor M Markowitz, I-Min A Chen, Ken Chu, et al. IMG/M: the integrated metagenome datamanagement and comparative analysis system. Nucleic Acids Research, 40(Database issue):D123–9,January 2012.[137] Edward F Delong. Towards microbial systems science: integrating microbial perspective, fromgenomes to biomes. Environmental Microbiology, 4(1):9–10, January 2002.152[138] Paul G Falkowski, Tom Fenchel, and Edward F Delong. The microbial engines that drive Earth’sbiogeochemical cycles. Science, 320(5879):1034–1039, May 2008.[139] Konstantinos Mavromatis, Natalia Ivanova, Kerrie Barry, et al. Use of simulated data sets to evaluatethe fidelity of metagenomic processing methods. Nat Meth, 4(6):495–500, June 2007.[140] Minoru Kanehisa, Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe. KEGG for integrationand interpretation of large-scale molecular data sets. Nucleic Acids Research, 40(Database issue):D109–14, January 2012.[141] Ross Overbeek, Tadhg Begley, Ralph M Butler, et al. The subsystems approach to genome annotationand its use in the project to annotate 1000 genomes. Nucleic Acids Research, 33(17):5691–5702, 2005.[142] Ramy K Aziz, Daniela Bartels, Aaron A Best, et al. The RAST Server: rapid annotations usingsubsystems technology. BMC Genomics, 9:75, 2008.[143] Folker Meyer, Ross Overbeek, and Alex Rodriguez. FIGfams: yet another set of protein families.Nucleic Acids Research, 37(20):6643–6654, November 2009.[144] Mario Latendresse, Markus Krummenacker, Miles Trupp, and Peter D Karp. Construction andcompletion of flux balance models from pathway databases. Bioinformatics, 28(3):388–396, February2012.[145] Michael Hucka, Andrew Finney, Herbert M Sauro, et al. The systems biology markup language(SBML): a medium for representation and exchange of biochemical network models. Bioinformatics,19(4):524–531, March 2003.[146] Peter D Karp, Monica Riley, Milton Saier, et al. The EcoCyc and MetaCyc databases. Nucleic AcidsResearch, 28(1):56–59, January 2000.[147] Mario Latendresse, Suzanne Paley, and Peter D Karp. Browsing metabolic and regulatory networkswith BioCyc. Methods Mol. Biol., 804:197–216, 2012.[148] Roman L Tatusov, Darren A Natale, Igor V Garkavtsev, et al. The COG database: new developmentsin phylogenetic classification of proteins from complete genomes. Nucleic Acids Research, 29(1):22–28,January 2001.[149] Kishori M Konwar, Niels W Hanson, Antoine P Page´, and Steven J Hallam. MetaPathways: a modularpipeline for constructing pathway/genome databases from environmental sequence information.BMC Bioinformatics, 14(1):202, 2013.[150] Carlo E Bonferroni. Il calcolo delle assicurazioni su gruppi di teste. Tipografia del Senato, December 1935.[151] David A Rasko, Garry S A Myers, and Jacques Ravel. Visualization of comparative genomic analysesby BLAST score ratio. BMC Bioinformatics, 6:2, 2005.[152] Burkhard Rost. Twilight zone of protein sequence alignments. Protein Eng., 12(2):85–94, February1999.[153] Wolfgang Gentzsch. Sun Grid Engine: towards creating a compute power grid. In CCGRID-01, pages35–36. IEEE Comput. Soc, 2001.[154] Todd M Lowe and Sean R Eddy. tRNAscan-SE: a program for improved detection of transfer RNAgenes in genomic sequence. Nucleic Acids Research, 25(5):0955–0964, 1997.[155] Mario Latendresse and Peter D Karp. An advanced web query interface for biological databases.Database (Oxford), 2010:baq006, 2010.153[156] Joseph M Dale, Liviu Popescu, and Peter D Karp. Machine learning methods for metabolic pathwayprediction. BMC Bioinformatics, 11:15, 2010.[157] Daniel C Richter, Felix Ott, Alexander F Auch, Ramona Schmid, and Daniel H Huson. MetaSim—Asequencing simulator for genomics and metagenomics. PLoS ONE, 3(10):e3373, 2008.[158] Andrew D Barton, Stephanie Dutkiewicz, Glenn Flierl, Jason Bragg, and Michael J Follows. Patternsof diversity in marine phytoplankton. Science, 327(5972):1509–1511, March 2010.[159] Michael J Follows, Stephanie Dutkiewicz, Scott Grant, and Sallie W Chisholm. Emergent Biogeogra-phy of Microbial Communities in a Model Ocean. Science, 315(5820):1843–1846, March 2007.[160] Peter E Larsen, Dawn Field, and Jack A Gilbert. Predicting bacterial community assemblages usingan artificial neural network approach. Nat Meth, 9(6):621–625, June 2012.[161] Christopher S Henry, Matthew DeJongh, Aaron A Best, et al. High-throughput generation, opti-mization and analysis of genome-scale metabolic models. Nat Biotechnol, 28(9):977–982, September2010.[162] Christopher S Henry, Ross Overbeek, Fangfang Xia, et al. Connecting genotype to phenotype in theera of high-throughput sequencing. Biochim. Biophys. Acta, 1810(10):967–977, October 2011.[163] Anantharaman Kalyanaraman, Srinivas Aluru, Suresh Kothari, and Volker Brendel. Efficient clus-tering of large EST data sets on parallel computers. Nucleic Acids Research, 31(11):2963–2974, June2003.[164] Ananth Kalyanaraman, William R Cannon, Benjamin Latt, and Douglas J Baxter. MapReduce imple-mentation of a hybrid spectral library-database search method for large-scale peptide identification.Bioinformatics, 27(21):3072–3073, November 2011.[165] Thomas Ishoey, Tanja Woyke, Ramunas Stepanauskas, Mark Novotny, and Roger S Lasken. Genomicsequencing of single microbial cells from environmental samples. Current Opinion in Microbiology,11(3):198–204, June 2008.[166] John C Wooley and Yuzhen Ye. Metagenomics: Facts and Artifacts, and Computational Challenges. JComput Sci Technol, 25(1):71–81, January 2009.[167] Tony Hay, Stewart Tansley, and Kristin M Tolle. The fourth paradigm: data-intensive scientificdiscovery. Microsoft Reserach, 2009.[168] Niels W Hanson, Kishori M Konwar, Shang-Ju Wu, and Steven J Hallam. MetaPathways v2.0: Amaster-worker model for environmental Pathway/Genome Database construction on grids andclouds. Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on,pages 1–7, May 2014.[169] John P McCutcheon and Carol D von Dohlen. An interdependent metabolic patchwork in the nestedsymbiosis of mealybugs. Curr. Biol., 21(16):1366–1372, August 2011.[170] Frank J Stewart, Adrian K Sharma, Jessica A Bryant, John M Eppley, and Edward F Delong. Commu-nity transcriptomics reveals universal patterns of protein sequence conservation in natural microbialcommunities. Genome Biol, 12(3):R26, 2011.[171] Niels W Hanson, Kishori M Konwar, Alyse K Hawley, et al. Metabolic pathways for the wholecommunity. BMC Genomics, 15:619, 2014.[172] Ron Caspi, Kate Dreher, and Peter D Karp. The challenge of constructing, classifying, and representingmetabolic pathways. FEMS Microbiol Lett, 345(2):85–93, June 2013.154[173] Jacintha Ellers, E Toby Kiers, Cameron R Currie, Bradon R McDonald, and Bertanne Visser. Ecologicalinteractions drive evolutionary loss of traits. Ecol Lett, 15(10):1071–1082, July 2012.[174] Diane Lawrence, Francesca Fiegna, Volker Behrends, et al. Species interactions alter evolutionaryresponses to a novel environment. PLoS Biol, 10(5):e1001330, May 2012.[175] J Jeffrey Morris, Richard E Lenski, and Erik R Zinser. The Black Queen Hypothesis: evolution ofdependencies through adaptive gene loss. MBio, 3(2):e00036–12–e00036–12, February 2012.[176] Phyllis Lam and Marcel M M Kuypers. Microbial nitrogen cycling processes in oxygen minimumzones. Ann Rev Mar Sci, 3:317–345, 2011.[177] Silke Ehrich, Doris Behrens, Elena Lebedeva, Wolfgang Ludwig, and Eberhard Bock. A newobligately chemolithoautotrophic, nitrite-oxidizing bacterium, Nitrospira moscoviensis sp. nov. andits phylogenetic relationship. Arch. Microbiol., 164(1):16–23, July 1995.[178] Marc Strous, Eric Pelletier, Sophie Mangenot, et al. Deciphering the evolution and metabolism of ananammox bacterium from a community genome. Nature, 440(7085):790–794, April 2006.[179] Sebastian Lu¨cker, Michael Wagner, Frank Maixner, et al. A Nitrospira metagenome illuminates thephysiology and evolution of globally important nitrite-oxidizing bacteria. Proceedings of the NationalAcademy of Sciences, 107(30):13479–13484, July 2010.[180] Boran Kartal, Wouter J Maalcke, Naomi M de Almeida, et al. Molecular mechanism of anaerobicammonium oxidation. Nature, 479(7371):127–130, November 2011.[181] Sangita Ganesh, Darren J Parris, Edward F Delong, and Frank J Stewart. Metagenomic analysis ofsize-fractionated picoplankton in a marine oxygen minimum zone. ISME J, September 2013.[182] F Tajima. Evolutionary relationship of DNA sequences in finite populations. Genetics, 105(2):437–460,October 1983.[183] Sean Nee, Robert M May, and Paul H Harvey. The Reconstructed Evolutionary Process. PhilosophicalTransactions : Biological Sciences, 344(1309):305–311, May 1994.[184] Nathan G Swenson. Phylogenetic beta diversity metrics, trait evolution and inferring the functionalbeta diversity of communities. PLoS ONE, 6(6):e21264, 2011.[185] Catherine A Lozupone and Rob Knight. UniFrac: a New Phylogenetic Method for ComparingMicrobial Communities. Appl. Environ. Microbiol., 71(12):8228–8235, December 2005.[186] C I Hunter, A Mitchell, P Jones, et al. Metagenomic analysis: the challenge of the data bonanza. Brief.Bioinformatics, September 2012.[187] C De Filippo, M Ramazzotti, P Fontana, and D Cavalieri. Bioinformatic approaches for functional an-notation and pathway inference in metagenomics data. Brief. Bioinformatics, 13(6):696–710, November2012.[188] Narayan Desai, Dion Antonopoulos, Jack A Gilbert, Elizabeth M Glass, and Folker Meyer. Fromgenomics to metagenomics. Current Opinion in Biotechnology, 23(1):72–76, February 2012.[189] Stephen F Altschul, Thomas L Madden, Alejandro A Scha¨ffer, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402,September 1997.[190] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science, 323(5919):1296–1297, March2009.155[191] Chryssis Georgiou and Alexander A Shvartsman. Cooperative Task-Oriented Computing: Algorithmsand Complexity. Synthesis Lectures on Distributed Computing Theory, 2(2):1–167, July 2011.[192] Evgenia Christoforou, Antonio F Anta, Chryssis Georgiou, Miguel A Mosteiro, and Angel Sa´nchez.Applying the dynamics of evolution to achieve reliability in master–worker computing. Concurrencyand Computation: Practice and Experience, 25:2363–2380, 2013.[193] Kishori M. Konwar, Seda Rajasekaran, and Alexander A Shvartsman. Robust network supercom-puting with malicious processes. In Proc. of 17th Int-l Symp. on Distributed Computing (DISC), pages474–488, 2006.[194] David P Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer. SETI@ home: anexperiment in public-resource computing. Communications of the ACM, 45(11):56–61, 2002.[195] Scott Federhen. The NCBI Taxonomy database. Nucleic Acids Research, 40(Database issue):D136–43,January 2012.[196] James I Prosser. Replicate or lie. Environmental Microbiology, 12(7):1806–1810, March 2010.[197] Carl R Woese and George E Fox. Phylogenetic structure of the prokaryotic domain: The primarykingdoms. Proceedings of the National Academy of Sciences, 74(11):5088–5090, 1977.[198] Yongan Zhao, Haixu Tang, and Yuzhen Ye. RAPSearch2: a fast and memory-efficient proteinsimilarity search tool for next-generation sequencing data. Bioinformatics, 28(1):125–126, December2011.[199] R D Finn, J Clements, and S R Eddy. HMMER web server: interactive sequence similarity searching.Nucleic Acids Research, 39(Web Server):W29–W37, June 2011.[200] Johannes Alneberg, Brynjar Sma´ri Bjarnason, Ino de Bruijn, et al. Binning metagenomic contigs bycoverage and composition. Nat Meth, 11(11):1144–1146, November 2014.[201] Hanno Teeling, Jost Waldmann, Thierry Lombardot, Margarete Bauer, and Frank Oliver Glo¨ckner.TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotideusage patterns in DNA sequences. BMC Bioinformatics, 5:163, October 2004.[202] Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai, and Jonathan A Eisen. CompostBin: A DNAComposition-Based Algorithm for Binning Environmental Shotgun Reads. In Research in Computational. . . , pages 17–28. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.[203] Gregory J. Dick, Anders F Andersson, Brett J Baker, et al. Community-wide analysis of microbialgenome sequence signatures. Genome Biol, 10(8):R85, 2009.[204] Cedric C Laczny, Nicolas Pinel, Nikos Vlassis, and Paul Wilmes. Alignment-free visualization ofmetagenomic data by nonlinear dimension reduction. Sci Rep, 4:4516, 2014.[205] Arthur Brady and Steven L Salzberg. Phymm and PhymmBl: metagenomic phylogenetic classificationwith interpolated markov models. Nat Meth, 6(9):673–676, August 2009.[206] Marc Strous, Beate Kraft, Regina Bisdorf, and Halina E Tegetmeyer. The binning of metagenomiccontigs for microbial physiology of mixed cultures. Front Microbiol, 3:410, 2012.[207] Chris Burge and Samuel Karlin. Prediction of complete gene structures in human genomic DNA. JMol Biol, 1997.[208] Monzoorul Haque Mohammed, Tarini Shankar Ghosh, Nitin Kumar Singh, and Sharmila S Mande.SPHINX–an algorithm for taxonomic binning of metagenomic sequences. Bioinformatics, 27(1):22–30,January 2011.156[209] David M Estlund. Opinion leaders, independence, and Condorcet’s Jury Theorem. Theor Decis,36(2):131–162, March 1994.[210] Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with Noisy Information.SIAM J. Comput., 23(5):1001–1018, October 1994.[211] S Kullback and R A Leibler. JSTOR: The Annals of Mathematical Statistics, Vol. 22, No. 1 (Mar., 1951),pp. 79-86. The Annals of Mathematical Statistics, 1951.[212] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.[213] F T Wright, R L Dykstra, and T Robertson. Order-restricted statistical inference. Wiley, New York, 1988.[214] D Nettleton. Testing for the Supremacy of a Multinomial Cell Probability. Journal of the AmericanStatistical Association, 2009.[215] Christian Rinke, Patrick Schwientek, Alexander Sczyrba, et al. Insights into the phylogeny and codingpotential of microbial dark matter. Nature, 499(7459):431–437, July 2013.[216] Michel Balinski and Rida Laraki. A theory of measuring, electing, and ranking. Proceedings of theNational Academy of Sciences, 104(21):8720–8725, May 2007.[217] Rachna J Ram, Nathan C VerBerkmoes, Michael P Thelen, et al. Community proteomics of a naturalmicrobial biofilm. Science, 308(5730):1915–1920, June 2005.[218] Nicola Segata, Levi Waldron, Annalisa Ballarini, et al. Metagenomic microbial community profilingusing unique clade-specific marker genes. Nat Meth, 9(8):811–814, August 2012.[219] Nicola Segata, Daniela Bo¨rnigen, Xochitl C Morgan, and Curtis Huttenhower. PhyloPhlAn is a newmethod for improved phylogenetic and taxonomic placement of microbes. Nat Comms, 4:2304, 2013.[220] Nicola Segata, Jacques Izard, Levi Waldron, et al. Metagenomic biomarker discovery and explanation.Genome Biol, 12(6):R60, 2011.[221] Timothy L Tickle, Nicola Segata, Levi Waldron, Uri Weingart, and Curtis Huttenhower. Two-stagemicrobial community experimental design. ISME J, 7(12):2330–2339, December 2013.[222] Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification usingexact alignments. Genome Biol, 15(3):R46, 2014.[223] Kishori M Konwar, Niels W Hanson, Maya P Bhatia, et al. MetaPathways v2.5: Quantitative functional,taxonomic, and usability improvements. Bioinformatics, pages 1–3, January 2015.[224] Miguel Pignatelli and Andre´s Moya. Evaluating the Fidelity of De Novo Short Read MetagenomicAssembly Using Simulated Data. PLoS ONE, 6(5):e19984, May 2011.[225] Eric McDonald and C Titus Brown. khmer: Working with Big Data in Bioinformatics. arXiv, March2013.[226] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessmenttool for genome assemblies. Bioinformatics, 29(8):1072–1075, April 2013.[227] Dazhi Jiao, Yuzhen Ye, and Haixu Tang. Probabilistic Inference of Biochemical Reactions in MicrobialCommunities from Metagenomic Sequences. PLoS Comput Biol, 2013.[228] Itai Sharon, Sivan Bercovici, Ron Y Pinter, and Tomer Shlomi. Pathway-based functional analysis ofmetagenomes. Journal of Computational Biology, 18(3):495–505, March 2011.157[229] J Craig Venter, Karin Remington, John F Heidelberg, et al. Environmental genome shotgun sequencingof the Sargasso Sea. Science, 304(5667):66–74, April 2004.[230] Erik Kristiansson, Philip Hugenholtz, and Daniel Dalevi. ShotgunFunctionalizeR: an R-package forfunctional comparison of metagenomes. Bioinformatics, 25(20):2737–2738, October 2009.[231] James Robert White, Niranjan Nagarajan, and Mihai Pop. Statistical methods for detecting differ-entially abundant features in clinical metagenomic samples. PLoS Comput Biol, 5(4):e1000352, April2009.[232] Bo Liu and Mihai Pop. MetaPath: identifying differentially abundant metabolic pathways inmetagenomic datasets. BMC Proceedings, 5 Suppl 2:S9, 2011.[233] Ilya B Gertsbakh and Yoseph Shpungin. Models of Network Reliability: Analysis, Combinatorics, andMonte Carlo - Ilya B. Gertsbakh, Yoseph Shpungin - Google Books. CRC Press, 2011.[234] Dae-Kyun Ro, Eric M Paradise, Mario Ouellet, et al. Production of the antimalarial drug precursorartemisinic acid in engineered yeast. Nature, 440(7086):940–943, April 2006.[235] Concetta Beninati, Marco R Oggioni, Maria Boccanera, et al. Therapy of mucosal candidiasis byexpression of an anti-idiotype in human commensal bacteria. Nat Biotechnol, 18(10):1060–1064, October2000.[236] David Bermudes, Li-mou Zheng, and Ivan C King. Live bacteria as anticancer agents and tumor-selective protein delivery vectors. Curr Opin Drug Discov Devel, 5(2):194–199, March 2002.[237] Gu¨rol M Su¨el, Jordi Garcia-Ojalvo, Louisa M Liberman, and Michael B Elowitz. An excitable generegulatory circuit induces transient cellular differentiation. Nature, 440(7083):545–550, March 2006.[238] David A Drubin, Jeffrey C Way, and Pamela A Silver. Designing biological systems. Genes &Development, 21(3):242–254, February 2007.[239] Drew Endy. Foundations for engineering biology. Nature, 438(7067):449–453, November 2005.[240] J Zaldivar, J Nielsen, and L Olsson. Fuel ethanol production from lignocellulose: a challenge formetabolic engineering and process integration. Appl. Microbiol. Biotechnol., 56(1-2):17–34, July 2001.[241] Lee R Lynd, Paul J Weimer, Willem H van Zyl, and Isak S Pretorius. Microbial cellulose utilization:fundamentals and biotechnology. Microbiol. Mol. Biol. Rev., 66(3):506–77– table of contents, September2002.[242] Katie Brenner, Lingchong You, and Frances H Arnold. Engineering microbial consortia: a newfrontier in synthetic biology. Trends in Biotechnology, 26(9):483–489, September 2008.[243] Mette Burmølle, Jeremy S Webb, Dhana Rao, et al. Enhanced biofilm formation and increasedresistance to antimicrobial agents and bacterial invasion are caused by synergistic interactions inmultispecies biofilms. Appl. Environ. Microbiol., 72(6):3916–3923, June 2006.[244] John Davison. Genetic exchange between bacteria in the environment. Plasmid, 42(2):73–91, September1999.[245] Campbell O Webb, David D Ackerly, Mark A McPeek, and Michael J Donoghue. Phylogenies andcommunity ecology. Annu. Rev. Ecol. Syst., pages 475–505, 2002.[246] Juli G Pausas and Miguel Verdu´. The Jungle of Methods for Evaluating Phenotypic and PhylogeneticStructure of Communities. BioScience, 60(8):614–625, September 2010.158[247] Stuart A West, Ashleigh S Griffin, Andy Gardner, and Stephen P Diggle. Social evolution theory formicroorganisms. Nat. Rev. Microbiol., 4(8):597–607, August 2006.[248] Lee Hsiang Liow, Leigh Van Valen, and Nils Chr Stenseth. Red Queen: from populations to taxa andcommunities. Trends in Ecology & Evolution, 26(7):349–358, July 2011.[249] Vivien Marx. Next-generation sequencing: The genome jigsaw. Nature, 501(7466):263–268, September2013.[250] Sergey Koren, Gregory P Harhay, Timothy P L Smith, et al. Reducing assembly complexity ofmicrobial genomes with single-molecule sequencing. Genome Biol, 14(9):R101, 2013.[251] D Borthakur. Hadoop distributed file system. Apache Software Foundation, 2007.[252] Michael Stonebraker. SQL databases v. NoSQL databases. Communications of the ACM, 53(4):10, April2010.[253] Ronald C Taylor. An overview of the Hadoop/MapReduce/HBase framework and its currentapplications in bioinformatics. BMC Bioinformatics, 11(Suppl 12):S1, 2010.[254] Zeeya Merali. Computational science: Error, why scientific programming does not compute. Nature,467(7317):775–777, 2010.[255] Vinod Kumar Vavilapalli, Siddharth Seth, Bikas Saha, et al. Apache Hadoop YARN. In SOCC ’13,pages 1–16, New York, New York, USA, 2013. ACM Press.[256] Karen E Nelson. Metagenomics of the Human Body. Springer, November 2011.[257] Geoffrey A Preidis and James Versalovic. Targeting the human microbiome with antibiotics, probiotics,and prebiotics: gastroenterology enters the metagenomics era. Gastroenterology, 136(6):2015–2031,May 2009.[258] Junjie Qin, Ruiqiang Li, Jeroen Raes, et al. A human gut microbial gene catalogue established bymetagenomic sequencing. Nature, 464(7285):59–65, March 2010.[259] The Human Microbiome Jumpstart Reference Strains Consortium, K E Nelson, G M Weinstock, et al.A Catalog of Reference Genomes from the Human Microbiome. Science, 328(5981):994–999, May 2010.[260] Ramunas Stepanauskas. Single cell genomics: an individual look at microbes. Current Opinion inMicrobiology, 15(5):613–620, September 2012.[261] Paul C Blainey. The future is now: single-cell genomics of bacteria and archaea. FEMS Microbiol. Rev.,37(3):407–427, May 2013.[262] Ehud Shapiro, Tamir Biezuner, and Sten Linnarsson. Single-cell sequencing-based technologies willrevolutionize whole-organism science. Nature Publishing Group, 14(9):618–630, August 2013.[263] Arumugham Raghunathan, Harley R Ferguson, Carole J Bornarth, et al. Genomic DNA amplificationfrom a single bacterium. Appl. Environ. Microbiol., 71(6):3342–3347, June 2005.[264] Ramunas Stepanauskas and Michael E Sieracki. Matching phylogeny and metabolism in the uncul-tured marine bacteria, one cell at a time. Proceedings of the National Academy of Sciences, 104(21):9052–9057, May 2007.[265] Ryan Tewhey, Jason B Warner, Masakazu Nakano, et al. Microdroplet-based PCR enrichment forlarge-scale targeted sequencing. Nat Biotechnol, 27(11):1025–1031, November 2009.159[266] H Christina Fan, Jianbin Wang, Anastasia Potanina, and Stephen R Quake. Whole-genome molecularhaplotyping of single cells. Nat Biotechnol, 29(1):51–57, January 2011.[267] Kaston Leung, Hans Zahn, Timothy Leaver, et al. A programmable droplet-based microfluidic deviceapplied to multiparameter analysis of single microbes and microbial communities. Proceedings of theNational Academy of Sciences, 109(20):7665–7670, May 2012.[268] Dan Frumkin, Adam Wasserstrom, Shai Kaplan, Uriel Feige, and Ehud Shapiro. Genomic variabilitywithin an organism exposes its cell lineage tree. PLoS Comput Biol, 1(5):e50, October 2005.[269] Nicholas Navin, Jude Kendall, Jennifer Troge, et al. Tumour evolution inferred by single-cellsequencing. Nature, 472(7341):90–94, April 2011.[270] Fuchou Tang, Catalin Barbacioru, Yangzhou Wang, et al. mRNA-Seq whole-transcriptome analysis ofa single cell. Nat Meth, 6(5):377–382, May 2009.[271] Paul C Blainey and Stephen R Quake. Digital MDA for enumeration of total nucleic acid contamina-tion. Nucleic Acids Research, 39(4):e19, March 2011.[272] Yann Marcy, Thomas Ishoey, Roger S Lasken, et al. Nanoliter reactors improve multiple displacementamplification of genomes from single cells. PLoS Genet., 3(9):1702–1708, September 2007.[273] Tanja Woyke, Gary Xie, Alex Copeland, et al. Assembling the marine metagenome, one cell at a time.PLoS ONE, 4(4):e5299, 2009.[274] Anton Bankevich, Sergey Nurk, Dmitry Antipov, et al. SPAdes: A New Genome Assembly Algorithmand Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455–477, May2012.[275] Paul C Blainey, Annika C Mosier, Anastasia Potanina, Christopher A Francis, and Stephen R Quake.Genome of a low-salinity ammonia-oxidizing archaeon determined by single-cell and metagenomicanalysis. PLoS ONE, 6(2):e16626, 2011.160Appendix AChapter 2: supplementary materialTable A.1: Source genome statistics for simulated metagenomes Sim1 and Sim2.TaxaGenomeSizeChromosomes GenesAgrobacterium tumefaciens C58 5,674,064 4 5,469Aurantimonas manganoxydans SI85-9A1 4,285,343 1 3,665Bacillus subtilis subtilis 168 4,215,606 1 4,428Caulobacter crescentus CB15 4,016,947 1 3,819Caulobacter crescentus NA1000 4,042,929 1 3,968Helicobacter pylori 26695 1,667,867 1 1,609Mycobacterium tuberculosis CDC1551 4,403,836 1 4,235Mycobacterium tuberculosis H37Rv 4,411,529 1 3,916Synechococcus elongatus PCC 7942 2,695,903 1 2,664Vibrio cholerae O1 biovar El Tor str. N16961 4,033,464 2 3,952Total 39,447,488 37,725Average 3,944,749 3,773161Table A.2: Confusion tables for classification analysis of simulated metagenomes Sim1 and Sim2 atprogressively larger sequence coverage.Sample Gm TP TN FP FN P Nsim1 (1/32) 200 1,023 8 446 208 1,469sim1 (1/16) 244 1,007 24 402 268 1,409sim1 (1/8) 368 1,007 24 278 392 1,285sim1 (1/4) 446 1,002 29 200 475 1202sim1 (1/2) 517 1,006 25 129 542 1135sim1 (1/1) 576 1,003 28 70 604 1073sim2 (1/32) 163 1,019 12 483 175 1502sim2 (1/16) 235 1,019 12 411 247 1430sim2 (1/8) 316 1,008 23 330 339 1338sim2 (1/4) 403 1,009 22 243 425 1252sim2 (1/2) 455 1,015 16 191 471 1206sim2 (1/1) 521 1,002 29 125 550 1127162Appendix BChapter 3: supplementary materialTable B.1: Overview of the E. coli K12 genome used for simulated sequencing experimentsTaxa GenBankSize(bp)GenesEscherichia coli str. K-12 substr. MG1655 NC 000913 4,639,675 4,288Table B.2: Overview of the tier-2 BioCyc genomes used for simulated sequencing experiments.TaxaGenomeSizeChromosomes GenesAgrobacterium tumefaciens C58 5,674,064 4 5,469Aurantimonas manganoxydans SI85-9A1 4,285,343 1 3,665Bacillus subtilis subtilis 168 4,215,606 1 4,428Caulobacter crescentus CB15 4,016,947 1 3,819Caulobacter crescentus NA1000 4,042,929 1 3,968Helicobacter pylori 26695 1,667,867 1 1,609Mycobacterium tuberculosis CDC1551 4,403,836 1 4,235Mycobacterium tuberculosis H37Rv 4,411,529 1 3,916Synechococcus elongatus PCC 7942 2,695,903 1 2,664Vibrio cholerae O1 biovar El Tor str. N16961 4,033,464 2 3,952Total 39,447,488 37,725Average 3,944,749 3,773B.1 Confusion Table StatisticsIn machine learning a confusion table (contingency table) is a method to assess the performance of a supervisedclassifier. Rows of the table represent class predictions, while columns represent the actual class. Given apredicted class and the known class, there are four possible outcomes for the prediction:1630.0 0.2 0.4 0.6Copy NumberAgrobacterium tumefaciens C58Aurantimonas manganoxydans SI85-9A1Bacillus subtilis 168Caulobacter crescentus CB15Caulobacter crescentus NA1000Helicobacter pylori 26695Mycobacterium tuberculosis CDC1551Mycobacterium tuberculosis H37RvSynechococcus elongatus PCC 7942Vibrio cholerae str. N16961Sim1 Sim2Figure B.1: Copy number distributions for the simulated metagenomes Sim1 and Sim2. Sim1 (blue) has the tenBioCyc taxa in approximately equal proportion. Sim2 (purple) has the genome copy number of Caulobactercrescentus NA1000 in approximately twenty times abundance. Taxa used were selected with approximatelyequal genome size and gene content. Figure originally published in BMC Genomics under the CreativeCommons Attribution Licence v2.0 [171].Correct ResponsesTrue Positives (TP): The classifier correctly identified the class as present.True Negatives (TN): The classifier correctly identified the class as absent.Incorrect ResponsesFalse Positives (FP) (Type-1 Error): The classifier incorrectly predicted the class present when absent.False Negative (FN) (Type-2 Error): The classifier incorrectly predicted the class absent when present.Summary StatisticsSince classifiers can have very different performance characteristics it is often important to consider differentstatistics of the confusion table. In most situations, there is often a trade off between the two types of errorsthat a classifier can make. Figure B.2 illustrates a confusion table and illustrates the relationships betweenthe various summary statistics.Sensitivity (Recall): Represents the ability of the classifier to find positive results. Given that a class isactually in the sample, what is the probability that it is found? High values represent a low number164Table B.3: Overview of Long-read simulated sequencing experiments for E. coli K12, Sim1, and Sim2taxonomic distributions.Distribution Gm Size (bp) Reads ORFsAnnotatedCDSPredicted PathwaysRecovered(%)E. coli K12 0.03 143,339 180 243 150 71 0.20E. coli K12 0.06 285,694 360 494 309 85 0.24E. coli K12 0.12 575,821 720 973 604 154 0.44E. coli K12 0.25 1,151,303 1,438 1,992 1,207 206 0.58E. coli K12 0.50 2,305,899 2,876 4,030 2,401 271 0.77E. coli K12 1.0 4,594,877 5,750 7,996 4,859 352 0.99Sim1 0.03 1,249,104 1,564 2,978 771 208 0.32Sim1 0.06 2,501,345 3,126 5,857 1487 268 0.41Sim1 0.12 5,003,783 6,250 11,769 3,120 392 0.61Sim1 0.25 10,014,496 12,500 23,422 6,107 475 0.74Sim1 0.5 19,991,551 25,000 47,304 12,139 542 0.84Sim1 1.0 40,016,291 50,000 94,438 24,388 604 0.93Sim2 0.03 1,245,781 1,562 2,946 760 175 0.27Sim2 0.06 2,496,313 3,126 5,880 1,538 247 0.38Sim2 0.12 4,987,646 6,250 11,756 3,154 339 0.52Sim2 0.25 9,994,331 12,500 23,852 6,330 425 0.66Sim2 0.50 19,993,717 25,000 47,350 12,676 471 0.73Sim2 1.0 40,006,531 50,000 94,366 25,139 550 0.85of false negatives (Type-II errors).Sensitivity =(# correctly predicted present)(# actually present)=TP(TP + FN)Specificity: The ability of the test to find negative results. What is the probability of correctly rejecting aclass. High values represent a low number of false positives (Type-I errors).Specificity =(# correctly predicted absent)(# actually absent)=TN(FP + TN)Precision: Given a positive prediction, what is the probability that it is correct? High values represent alow number of false positives (Type-I errors).Precision =(# correctly predicted present)(# predicted present)=TP(TP + FP)Negative Predictive Value (NPV): Given a negative prediction, what is the probability of actually beingcorrect? High values represent low false negatives (Type-2 errors).NPV =(# correctly predicted absent)(# predicted absent)=TN(FN + TN)165Table B.4: Overview of short-read simulated sequencing experiments for E. coli K12, Sim1, Sim2, and HOT(25m) taxonomic distributions.Distribution GmSize(bp)Reads ORFsAnnotatedCDSPredictedPathwaysRecovered(%)E. coli K12 0.03 139,983 540 179 37 9 0.03E. coli K12 0.06 292,642 1,125 356 68 27 0.08E. coli K12 0.12 584,031 2,250 742 128 51 0.14E. coli K12 0.25 1,168,390 4,500 1,445 269 96 0.27E. coli K12 0.50 2,340,834 9,000 2,878 516 121 0.34E. coli K12 1.0 4,676,245 18,000 5,884 1,013 181 0.51Sim1 0.03 1,283,742 4,738 4,151 2,261 108 0.17Sim1 0.06 2,570,031 9,476 8,266 4,576 169 0.26Sim1 0.12 5,140,469 18,975 16,549 9,132 239 0.37Sim1 0.25 10,271,637 37,904 33,164 18,270 316 0.49Sim1 0.50 20,540,345 75,808 66,260 36,443 431 0.67Sim1 1.0 41,097,945 151,616 132,577 72,937 499 0.77Sim2 0.03 1,282,621 4,738 4,337 2,666 113 0.17Sim2 0.06 2,567,379 9,476 8,657 5,193 171 0.26Sim2 0.12 5,133,838 18,952 17,313 10,496 237 0.37Sim2 0.25 10,264,228 37,904 34,624 21,301 334 0.52Sim2 0.50 20,545,013 75,808 69,256 41,901 392 0.61Sim2 1.0 41,074,096 151,616 138,593 83,929 502 0.78HOT (25m) 0.05 8,012,746 31,178 6,668 5,978 336 0.42HOT (25m) 0.10 16,025,492 62,356 13,478 12,087 398 0.50HOT (25m) 0.15 24,038,238 93,534 20,054 17,953 438 0.55HOT (25m) 0.20 32,050,984 124,712 26,836 23,972 462 0.58HOT (25m) 0.40 64,101,968 249,424 53,300 47,617 526 0.66HOT (25m) 0.60 96,152,695 374,135 80,080 71,599 555 0.70HOT (25m) 0.80 128,203,679 498,847 106,985 95,766 585 0.73HOT (25m) 1.0 160,254,663 623,559 133,836 119,867 593 0.74Ideally one will investigate the confusion table directly, however, because it is onerous to compare manyvalues, a number of statistics have been developed to summarize the overall performance of a confusiontable.Accuracy: the most intuitive measure, but it can be misleading if the distribution of positive and negativeresults are not of similar magnitude. It asks, of all the decisions that the classifier made, how manywere correct? Examples of the behaviour of accuracy can be seen in Table B.5.Accuracy =(# correct predictions)(Total Predictions)=(TP + TN)(TP + TN + FP + FN)F-measure: the harmonic mean between precision and sensitivity. Therefore, it represents the number ofcorrectly predicted values scaled between false-positive and false-negative errors. However, it doesnot take into account the number of true-negative responses, which can be important depending on166 5 Note S1: Confusion Table Statistics In machine learning a confusion table (contingency table) is a method to assess the performance of a supervised classifier. Rows of the table represent class predictions, while columns represent the actual class. Given a predicted class and the known class, there are four possible outcomes for the prediction: Correct Responses True Positives (TP) - The classifier correctly identified the class as present. True Negatives (TN) - The classifier correctly identified the class as absent. Incorrect Responses False Positives (FP) (Type 1 Error) - The classifier incorrectly predicted the class present when absent. False Negative (FN) (Type 2 Error) - The classifier incorrectly predicted the class absent when present.  Actual Class  Positive Negative Prediction Positive True Positives (TP) False Positives (FP) Precision TP / (TP + FP) Negative False Negatives (FN) True Negatives (TN) Negative Predictive Value TN / (FN + TN)  Sensitivity TP / (TP + FN)  Specificity TN / (FP + TN)   Summary Statistics Since classifiers can have very different performance characteristics it is often important to consider different statistics of the confusion table. In most situations, there is often a trade off between the two types of errors that a classifier can make. Sensitivity (Recall) – Represents the ability of the classifier to find positive results. Given that a class is actually in the sample, what is the probability that it is found? High values represent a low number of false negatives (Type-II errors). Sensitivity = (# correctly predicted present) / (# actually present) Sensitivity = TP / (TP + FN)  Specificity - The ability of the test to find negative results. What is the probability of correctly rejecting a class. High values represent a low number of false positives (Type-I errors). Specificity = (# predicted absent) / (# actually absent) Specificity = TN / (FP + TN) Precision - Given a positive prediction, what is the probability that it is correct? High values represent a low number of false positives (Type-I errors). Precision = (# correctly predicted present) / (# predicted present) Figure B.2: Illustrative confusion table. Confusion tables summarize the performance of a binary classifierthrough the tabulation of correct predictions where the predicted class and the actual class agree, truepositives (TP) a d true negatives (TN), and incorrect predictions where there is a disagreement, falsepositives (FP) and false negatives (FN). A number of additional statistics are used to further summarizea confusion table based on a particular classification task or interest, those that capture the overallperformance of capturing existing positive and negative in a population, Sensitivity, Specificity, as wellconditional statistics on the likelihood of a positive or negative result, Precision and Negative PredictiveValue (NPV). Figure originally published in BMC Genomics under the Creative Commons AttributionLicence v2.0 [171].the application.F-measure = 2×(Pr cision× Sensitivity)(Precision + Sensitivity)Matthew’s Correlation Coefficient (MCC): a comprehensive measure that controls for the populationdifferences between total positive and negatives in a test or training sample. Essentially it is acorrelation coefficient between observed and predicted responses where +1 is perfect prediction, −1is total disagreement, and 0 is no better than randomly guessing (i.e., no correlation). Matthew’scorrelation is generally accepted to be the best overall summary statistic of a confusion table that isrobust to unequal class sizes, as well as taking into account both Type-1 and Type-2 errors.MCC =(TP)(TN)− (FP)(FN)√(TP + FP)(TP + FN)(TN + FP)(TN + FN)167Table B.5: Behaviour of accuracy for a variety of confusion tablesClassifier TP TN FP FNAccuracy(%)A 25 75 25 75 50B 0 150 0 50 75C 50 0 150 0 25D 30 100 50 20 65168K12 Sim1 Sim20. 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0GmPerformanceAccuracyF-measureMatthewsFigure B.3: Summarizing performance measures for long-read simulations. Performance measures Accuracy, F-measure, and Matthew’s correlationcoefficient for simulated long-read sequencing experiments of E. coli K12, Sim1, and Sim2 at progressively larger genomic sequence coverage.Figure originally published in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].169K12 Sim1 Sim2 HOT (25m) 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0GmPerformanceAccuracyF-measureMatthewsFigure B.4: Summarizing performance measures for short-read simulations. Performance measures Accuarcy, F-measure, and Matthew’s correlationcoefficient for simulated short-read sequencing experiments of the E. coli K12, Sim1, Sim2, and HOT 25m metagenome at progressively largergenomic sequence coverage. Figure originally published in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].170Table B.6: Confusion tables of pathway prediction using simulated Long-read sequencing upon the E. coliK12 genome, Sim1, and Sim2 at progressively larger genomic sequence coverage.Distribution Gm TP TN FP FN P NE. coli K12 0.03 71 1,453 5 317 76 1,770E. coli K12 0.06 69 1,428 30 319 99 1,747E. coli K12 0.12 124 1,418 40 264 164 1,682E. coli K12 0.25 195 1,423 35 193 230 1,616E. coli K12 0.50 242 1,402 56 146 298 1,548E. coli K12 1.0 328 1,397 61 60 389 1,457Sim1 0.03 200 1,023 8 446 208 1,469Sim1 0.06 244 1,007 24 402 268 1,409Sim1 0.12 368 1,007 24 278 392 1,285Sim1 0.25 446 1,002 29 200 475 1,202Sim1 0.50 517 1,006 25 129 542 1,135Sim1 1.0 576 1,003 28 70 604 1,073Sim2 0.03 163 1,019 12 483 175 1,502Sim2 0.06 235 1,019 12 411 247 1,430Sim2 0.12 316 1,008 23 330 339 1,338Sim2 0.25 403 1,009 22 243 425 1,252Sim2 0.50 455 1,015 16 191 471 1,206Sim2 1.0 521 1,002 29 125 550 1,127171Table B.7: Confusion tables of pathway prediction for simulated Short-read sequencing experiments of theE. coli K12, Sim1, Sim2, and the HOT 25 m metagenome at progressively larger genomic sequence coverage.Distribution Gm TP TN FP FN P NE. coli K12 0.03 9 1,323 0 345 9 1,668E. coli K12 0.06 21 1,317 6 333 27 1,650E. coli K12 0.12 45 1,317 6 309 51 1,626E. coli K12 0.25 73 1,300 23 281 96 1,581E. coli K12 0.50 102 1,304 19 252 121 1,556E. coli K12 1.0 153 1,295 28 201 181 1,496Sim1 0.03 98 1,021 10 548 108 1,569Sim1 0.06 145 1,007 24 501 169 1,508Sim1 0.12 211 1,003 28 435 239 1,438Sim1 0.25 272 987 44 374 316 1,361Sim1 0.5 373 973 58 273 431 1,246Sim1 1.0 423 955 76 223 499 1,178Sim2 0.03 99 1,017 14 547 113 1,564Sim2 0.06 148 1,008 23 498 171 1,506Sim2 0.12 206 1,000 31 440 237 1,440Sim2 0.25 295 992 39 351 334 1,343Sim2 0.5 347 986 45 299 392 1,285Sim2 1.0 434 963 68 212 502 1,175HOT (25m) 0.05 323 868 13 473 336 1,341HOT (25m) 0.10 386 869 12 410 398 1,279HOT (25m) 0.15 420 863 18 376 438 1,239HOT (25m) 0.20 443 862 19 353 462 1,215HOT (25m) 0.40 501 856 25 295 526 1,151HOT (25m) 0.60 526 852 29 270 555 1,122HOT (25m) 0.80 552 848 33 244 585 1,092HOT (25m) 1.0 561 849 32 235 593 1,084Table B.8: Results of taxonomic pruning pathway recovery experiments for simulated metagenomes Sim1and Sim2 and the HOT 25 m metagenome using the ‘Unclassified sequences’ taxonomic parameter.Distribution Read Length Pruning No PruningReduction(%)Sim1 Long 260 604 56.95Sim1 Short 194 499 61.12Sim2 Long 222 550 59.64Sim2 Short 184 502 63.35HOT (25m) N/A 425 593 28.33172 20   Figure S2. Distribution of Weighted Taxonomic Distance from HOT Predicted Pathways. Positive distances represent instances where the observed taxonomy was a descendant of the MetaCyc taxonomic range for that pathway, while negative distances represent divergent taxonomies. Predicted pathways were classified into taxonomic disagreement classes based on the weighted distance distribution for the sample; “None” contains positive distances, “Low” contains the upper two negative quartiles, while the “Medium” and “High” disagreement classes contain the lower two negative quartiles, respectively.  DNA RNA010020030001002003000100200300010020030025m75m110m500mï ï 0 1 ï ï 0 1Taxonomic DistancePathway FrequencyDisagreementClassNoneLowMediumHighFigure B.5: Distribution of weighted taxonomic distance from HOT predicted pathways. Positive distancesrepresent instances where the observed taxonomy was a descendant of the MetaCyc taxonomic rangefor that pathway, while negative distances represent divergent taxonomies. Predicted pathways wereclassified into taxonomic disagreement classes based on the weighted distance distribution for the sample;“None” contains positive distances, “Low” contains the upper two negative quartiles, while the “Medium”and “High” disagreement classes c tain th low r two negative quar les, respectively. Figure originallypublished in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].173 21    Figure S3. Disagreement class distribution of HOT predicted pathways by expected taxonomic range. Predicted pathways from HOT datasets were classified into their disagreement class based on WTD. Tabulating these pathways by expected taxonomic range the majority of pathways classified in the “Medium” and “High” disagreement classes have expected taxonomic ranges within “animals”, “fungi”, and “plants”. DNA RNAprotistsplantsfungianimalseukaryotesbacteriaarchaeacellular organismsvirusesrootprotistsplantsfungianimalseukaryotesbacteriaarchaeacellular organismsvirusesrootprotistsplantsfungianimalseukaryotesbacteriaarchaeacellular organismsvirusesrootprotistsplantsfungianimalseukaryotesbacteriaarchaeacellular organismsvirusesroot25m75m110m500m0 100 200 300 400 0 100 200 300 400Pathway FrequencyExpected Taxonomic RangeDisagreementClassNoneLowMediumHighFigure B.6: Disagreement class distribution of HOT predicted pathways by expected taxonomic range. Predictedpathways from HOT datasets were classified into their disagreement class based on WTD. Tabulating thesepathways by expected taxonomic range the majority of pathways classified in the “Medium” and “High”disagreement classes have expected taxonomic ranges within “ani als”, “fungi”, and “plants”. Figureoriginally published in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].Table B.9: Total predicted pathways for pairwise combined tier-2 BioCyc genomes: Aurantimonas manganoxy-dans SI85-9A (A), Bacillus subtilis subtilis 168 (B), Caulobacter crescentus NA1000 (C), and Helicobacter pylori26695 (H).A B C HA 394 497 424 435B 361 481 402C 378 416H 210174Table B.10: Number of candidate pathways that are potentially distributed by set-difference calculation.A B C HA - 4 1 6B - 6 11C - 2H - 23   Figure S4. An example of a plausible emergent metabolism pattern. The completion of the pathway requires participation from multiple taxa, e.g., Aurantimonas manganoxydans SI85-9A (a) and Bacillus subtilis subtilis 168 (b). Pathway glyphs produced by Pathway Tools can be supplemented with taxonomic information to enable the discovery of patterns of inter-pathway complementarity and potentially distributed metabolic pathways. b a+baL-methionine H2OATP s-adenosylmethioninesynthase:B_7188A_35492.5.1.6diphosphatephospateS-adenosyl-L-methioninea demethylated methyl acceptora methylated methyl acceptorS-adenosyl-L-homocysteine2.1.1.-H2Oadenosineputativeadenosylhomocysteinase 3: A_35973.3.1.1L-homocysteine5-methyltetrahydropteroyltri-L-glutamatetetrahydropteroyl tri-L-glutamatemethionine synthase ii: B_80415-methyltetrahydropteroyltrigulamate- homocysteine methyltransferase: B_54002.1.1.14Figure B.7: An example of a plausible emergent metabolism pattern. The completion of the pathway requiresparticipation from multiple taxa, e.g., Aurantimonas manganoxydans SI85-9A (a) and Bacillus subtilis subtilis168 (b). Pathway glyphs produced by Pathw y Tools can be supplemented with taxonomic information toenable the discovery of patterns of inter-pathway complementarity and potentially distributed metabolicpathways. Figure originally published in BMC Genomics under the Creative Commons Attribution Licencev2.0 [171].175 24  Figure S5. Comparison of predicted amino acid pathways in the Candidatus Moranella endobia and Candidatus Tremblaya princeps genomes. Dots represent detected presence of the pathway enzymes in Moranella (red), Tremblaya (blue), or both genomes (purple). MetEHisBHisHHisFHisAHisIHisGIIvDIIvCIIvBThrCThrBThrADapDDapBDapAAsdLysCLeuBLeuDLeuCLeuAIIvDIIvCIIvBArgHArgGArgFCarBCarATrpBTrpATrpCTrpDTrpGTrpEPheAAroCAroAAroKAroEAroDAroBAroGMcCutcheonPathoLogicChorsimate biosynthesis 2-dehydro-3-deoxyphosphoheptonate aldolase3-dehydroquinate synthase3-dehydroquinaseshikimate 5-dehydrogenaseshikimate kiniase 3-phosphoshikimate-1-carboxvinyltransferasechorismate synthasePhenylalanine biosynthesis chorismate mutasePhenyalanineTryptophan Tryptophan biosynthesis anthranilate synthase component 1anthranilate synthase component 2 anthranilate phosphoribosyltransferaseindol-3-glycerol phosphate synthasetryptophan synthase subunit alphatryptophan synthase subunit betaArginine Unridine-5-phosphate biosynthesis carbamoyl-phosphate synthase (small)carbamoyl-phosphate synthase (large)ornithine carbamoyl transferaseargininosuccinate synthaseargininosuccinate lyaseAcetohydroxy acid synthase large subunitKetol-acid reductoisomerasedihydroxy-acid dehydratase 2Valine biosynthesisValine2-isopropylmalate synthase3-isoproylmate dehyratas subunit isoproplymalate isomerase subunit3-isopropylmalate isomeraseLeucine biosynthesisLeucineasparate kiniseaspartate semialdehyde dehydrogenasedihydrodipicolinate synthasedihydrodipcolinate reductasetetrahydrodipicolinate succinylaseLysine biosynthesis IThreonineThreonine biosynthesis(from homoserine)homoserine kinasehomoserine kinasethronine synthaseIsoleucine Acetohydroxy acid synthase large subunitKetol-acid reductoisomerasedihydroxy-acid dehydratase 2Isoleucine biosynthesis I(from threonine)phosphoribosyltransferase catalytic subunitphosphoribosyl-atp diphosphataseimidazolecarboxamide isomerase*imidazole glycerol phosphate synthaseimidazole glycerol phosphate synthaseimidazoleglycerol-phosphate dyhydratseHistidine Histidine biosynthesisMethionine Methionine biosynthesis I homocysteine transmethylaseMoranellaTremblayaBothAmino acid MetaCyc Pathway Enzyme Gene*Full name: N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomeraseFigure B.8: Comparison of predicted amino acid pathways in the Candidatus Moranella endobia and CandidatusTremblaya princeps genomes. Dots represent detected presence of the pathway enzymes in Moranella (red),Tremblaya (blue), or both genomes (purple). Figure originally published in BMC Genomics under theCreative Commons Attribution Licence v2.0 [171].176Table B.11: Summary statistics of pathway prediction for the HOT metagenomes and metatranscriptomes.SampleGenBankSRASize(bp)Reads ORFsAnnotatedCDSMetaCycReactionsPredictedPathwaysHOT 25m SRX007372 160,254,663 623,559 405,613 214,149 4,138 864HOT 75m SRX007369 164,376,456 673,674 430,689 222,572 4,052 854HOT 110m SRX007370 127,754,820 473,166 336,035 165,775 4,133 860HOT 500m SRX007371 274,826,172 995,747 714,743 361,193 4,464 949HOT 25m(cDNA)SRX016893 139,331,608 561,821 234,404 85,781 3,433 723HOT 75m(cDNA)SRX016897 133,294,602 557,718 203,359 66,855 3,208 669HOT 110m(cDNA)SRX156384 90,843,408 398,436 135,107 36,912 2,549 532HOT 500m(cDNA)SRX156385 127,589,826 479,661 207,465 71,400 3,034 641Total 1,218,271,555 4,763,782 2,667,415 1,224,637 29,011 6,092177 26  Figure S6. Overview of unique transcriptomic signal. (a) Euler diagrams comparing common genomic and transcriptomic pathways for each depth. (b) Unique transcriptomic pathways projected to all depths. 22884493021137839922694269363MetaCyc PathwaysRelative CDS AbundanceHOT 25m (RNA)HOT 75m (RNA)HOT110m (RNA)HOT 500m (RNA)HOT All Depths (DNA)25m75m110m500mDegradationBiosynthesisa b0. 25m (RNA)HOT 75m (RNA)HOT 110m (RNA)HOT 500m (RNA)photorespirationcholesterol degradationsitosterol degradation to androstenedioneammonia oxidation IIIglycolate and glyoxylate degradation IIImethane oxidation to methanol IIreductive TCA cycle Imethanol oxidation to formaldehyde Ilysine degradation IVphenylacetate degradation II (anaerobic)xanthohumol biosynthesisajmaline and sarpagine biosynthesisectoine biosynthesispolyhydroxybutyrate biosynthesisadenine and adenosine salvage IItetrahydrobiopterin biosynthesis Iaspartate biosynthesisarginine biosynthesis IIIselenocysteine biosynthesis IIbeta-alanine biosynthesis IVFigure B.9: Overview of unique transcriptomic signal from HOT. (a) Euler diagrams comparing commongenomic and transcriptomic pathways for each depth. (b) Unique transcriptomic pathways projected to alldepths. Figure originally published in BMC Genomics under the Creative Commons Attribution Licencev2.0 [171].178 27    Figure S7. Predicted pathways predicted unique to DNA samples. (a) Unique DNA pathways projected at the highest MetaCyc classification. (b) Unique DNA pathways projected to the next MetaCyc sub-classification. NoneTCA-VariantsRespirationPhotosynthesisPentose Phosphate CycleOther EnergyMethanogenesisHydrogen ProductionGlycolysis VariantsFermentationElectron TransferChemoautotrophic EnergyAcetyl CoA BiosynthesisMercuryCyanideArsenateAntibiotic ResistanceAcid ResistanceSteroidsSecondary MetaboliteProteinPolymerOtherNoncarbon NutrientsNucleotideHormoneFatty Acid and LipidCarbohydratesCofactorChlorinated CompoundsCarboxylatesC1 CompoundsAmino AcidAldehydeMethylglyoxalAlcoholAromatic CompoundsAmineSiderophoresSecondary MetabolitePolyamineOtherNucleotideMetabolic RegulatorsLipidsHormoneCofactorCell StructureCarbohydratestRNAs-ChargingAmino AcidAromatic CompoundsInterconversionInactivationActivation10 6 1 11 1024 208353120335513 1713711162023401601110202124001116113113815179501274318163119132307771119027300025m (only)25m-75m25m-110m25m-500m75m (only)75m-110m75m-500m110m (only)110m-500m500m (only)25m-75m-110m75m-110m-500m25m-110m-500m25m-75m-500mall depthsTransport PathwaysEnergy MetabolismDetoxificationDegradationMetabolic ClustersBiosynthesisActivation-Inactivation-Interconversion10 6 1 11 1024 208353120335513 1713714010267227300 3 6 9 12 15 18 21 24Legend:Energy MetabolismDetoxificationDegradationOtherBiosynthesisActivation-Inactivation-Interconversion25m (only)25m-75m25m-110m25m-500m75m (only)75m-110m75m-500m110m (only)110m-500m500m (only)25m-75m-110m75m-110m-500m25m-110m-500m25m-75m-500mall depthsabMetaCyc PathwaysDNA Set-difference SubsetsPathway CountsMetaCyc PathwaysDNA Set-difference SubsetsFigure B.10: Predicted pathways unique to HOT metagenome samples. (a) Unique DNA pathways projectedat the highest MetaCyc classification. (b) Unique DNA pathways projected to the next MetaCyc sub-classification. Figure originally published in BMC Ge omics under the Creative Commons AttributionLicence v2.0 [171].179 28  Figure S8. Top-40 predicted pathways based on coding DNA sequence (CDS) and transcript abundance from four HOT depth intervals. The most abundant pathways were largely stable between samples, with the Rubisco shunt, pyruvate fermentation, NADH to cytochrome electron transfer, aerobic respiration, nitrate reduction, and pyruvate fermentation varying between sunlit and dark ocean waters. isoleucine biosynthesis IIIisoleucine biosynthesis IVcysteine biosynthesis Iarginine biosynthesis IIIlysine biosynthesis Icitrulline biosynthesisvaline biosynthesisisoleucine biosynthesis IIarginine biosynthesis II (acetyl cycle)isoleucine biosynthesis I (from threonine)glutamine biosynthesis IIItRNA chargingsucrose biosynthesisgluconeogenesis Imethylerythritol phosphate pathwayNAD/NADH phosphorylation and dephosphorylationmycolate biosynthesis5-aminoimidazole ribonucleotide biosynthesis II5-aminoimidazole ribonucleotide biosynthesis Iuridine-5-phosphate biosynthesispyrimidine deoxyribonucleotides de novo biosynthesis Ipyrimidine deoxyribonucleotides de novo biosynthesis IIguanosine nucleotides de novo biosynthesisadenosine nucleotides de novo biosynthesisseleno-amino acid biosynthesis4-aminobutyrate degradation Vglycine betaine degradationgallate degradation III (anaerobic)lysine fermentation to acetate and butyrateglycine cleavage complexformaldehyde assimilation III (dihydroxyacetone cycle)reductive TCA cycle Iformaldehyde assimilation I (serine pathway)formaldehyde assimilation II (RuMP Cycle)incomplete reductive TCA cycle3-hydroxypropionate/4-hydroxybutyrate cycleCalvin-Benson-Bassham cyclereductive TCA cycle IIglutaryl-CoA degradationfatty acid beta-oxidation II (core pathway)fatty acid beta-oxidation Ipurine nucleotides degradation III (anaerobic)purine nucleotides degradation IV (anaerobic)nitrate reduction I (denitrification)nitrate reduction II (assimilatory)ammonia assimilation cycle Iammonia assimilation cycle IInitrate reduction VI (assimilatory)alkylnitronates degradationoctane oxidationformate oxidation to CO2pyruvate fermentation to butanoatepyruvate fermentation to butanol Iheterolactic fermentationmixed acid fermentationglycolysis IV (plant cytosol)glycolysis III (glucokinase)glycolysis Ipentose phosphate pathway (non-oxidative)NADH to cytochrome bo oxidase electron transferNADH to cytochrome bd oxidase electron transferrespiration (anaerobic)aerobic respiration (cytochrome c)TCA cycle II (eukaryotic)TCA cycle III (helicobacter)TCA cycle I (prokaryotic)TCA cycle IV (2-oxoglutarate decarboxylase)TCA cycle V (2-oxoglutarate:ferredoxin oxidoreductase)TCA cycle VI (obligate autotrophs)methylaspartate cycleRubisco shuntammonium transport6 4 2 0 275m (RNA/DNA) 110m (RNA/DNA) 500m (RNA/DNA)25m (RNA/DNA)6 4 2 0 2 6 4 2 0 2 6 4 2 0 2TransportEnergy TCA VariantsGeneralRespirationPentose phosphateGlycolysisFermentationChemoautotrophicMetaCyc PathwaysRelative CDS Abundance (%) Degradation GeneralNon-carbonNucleotideLipids/Fatty acidsCarboxylateC1 compoundsAmino acidsAromatic compoundsAminesBiosynthesis GeneralNucleotidesLipidsCo-factorsCarbohydratesAminoacyl/tRNA chargingAmino acidsDNARNA25m 75m 110m500mFigure B.11: Top-40 predicted pathways based on coding DNA sequence (CDS) and transcript abundance from fourHOT depth intervals. The most abundant pathways were largely stable between samples, with the Rubiscoshunt, pyruvate fermentation, NADH to cytoc rome electron transfer, aerobic respiration, nitrate reduction,and pyruvate fermentation varying between sunlit and dark ocean waters. Figure originally published inBMC Genomics under the Creative Commons Attribution Licence v2.0 [171].180 29  Figure S9. Genomic and transcriptomic signal for pathways unique to the sunlit surface (25 m) depth interval. The predicted presence of cytocrome oxidase electron transfer and reversible hydrogen production and oxidation indicate a strong signal for aerobic growth.  L-N-delta-acetylornithine biosynthesisphenylalanine biosynthesis IIteichoic acid (poly-glycerol) biosynthesismycolyl-arabinogalactan-peptidoglycan complex biosynthesisfarnesylcysteine salvage pathwaydi-trans,poly-cis-undecaprenyl phosphate biosynthesisglutathione redox reactions IIubiquinol-6 biosynthesis7-keto-8-aminopelargonate biosynthesis IIubiquinol-10 biosynthesistetrahydrofolate biosynthesismethyl indole-3-acetate interconversioncyclopropane and cyclopropene fatty acid biosynthesiscyclopropane fatty acid (CFA) biosynthesisadenine and adenosine salvage IIIadenine and adenosine salvage Iqueuosine biosynthesisspermidine biosynthesis Iputrescine biosynthesis IIneurosporene biosynthesiscarnitine degradation Imethylgallate degradationprotocatechuate degradation I (meta-cleavage pathway)nicotinate degradation III4-hydroxyproline degradation IIL-cysteine degradation IIphenylalanine degradation I (aerobic)4-hydroxyproline degradation Ityrosine degradation Ihistidine degradation Iproline degradationarginine degradation IV (arginine decarboxylase)methanol oxidation to formaldehyde IImevalonate degradationglycolate and glyoxylate degradation IIID-arabinose degradation IIItrehalose degradation I (low osmolarity)chitin degradation IImannitol degradation IIdermatan sulfate degradationpurine ribonucleosides degradationphosphonoacetate degradation5-dehydro-4-deoxy-D-glucuronate degradationbeta-D-glucuronide/D-glucuronate degradationD-galacturonate degradation Iphytate degradation Iphenylmercury acetate degradationhydrogen oxidation II (aerobic, NAD)hydrogen production IIsuccinate to cytochrome bo oxidase electron transferNADH to cytochrome bo oxidase electron transfer0.5 0.0 0.5Unique HOT 25m (RNA/DNA)MetaCyc PathwaysEnergy Hydrogen productionChemoautotrophic energyRespirationDegradationDetoxification Mercury detoxificationSecondary metabolitesNon-carbon nutrientsNucleotidesCarbohydratesCarboxylatesC1-compoundsAmino acidsAromatic compoundsAminesBiosynthesis Secondary metabolitesPolyaminesNucleotidesLipidsHormonesCofactorsCell structureAmino acidesRelative CDS Abundance (%) DNARNAFigure B.12: Genomic and transcriptomic signal for pathways unique to the sunlit surface (25 m). The predictedpresence of cytocrome oxidase electron transfer and reversible hydrogen production and oxidation indicatea strong signal fo aerobic grow h. Figure originally publish d in BMC Genomics under the Crea iveCommons Attribution Licence v2.0 [171].181 30  Figure S10. Genomic and transcriptomic signal for pathways unique to upper photic zone(25 m and 75 m) depth intervals. Notable pathways included chlorophyll a biosynthesis, the ethylmalonyl, Entner-Doudoroff, and pyruvate fermentation pathways.   glutamate biosynthesis IIGDP-D-rhamnose biosynthesissuberin biosynthesisoctaprenyl diphosphate biosynthesisretinol biosynthesisubiquinol-8 biosynthesis (eukaryotic)chlorophyll a biosynthesis Iadenosylcobalamin salvage from cobinamide Iglutathione redox reactions Ithiamin diphosphate biosynthesis III (Staphylococcus)phosphatidylethanolamine biosynthesis Iferulate and sinapate biosynthesiscoumarin biosynthesis (via 2-coumarate)neurosporaxanthin biosynthesisurea degradation I4-aminobutyrate degradation III4-aminobutyrate degradation II4-aminobutyrate degradation Iurea degradation IIquinate degradation Iglutamate degradation Xarginine degradation IX (arginine:pyruvate transaminase)taurine degradation IValanine degradation IVtyrosine degradation IIIarginine degradation III (arginine decarboxylase/agmatinase)CO2 fixation into oxaloacetate (anapleurotic)acetate formation from acetyl-CoA Ioxalate degradation IIIL-arabinose degradation IIIxylose degradation Iacetone degradation II (to acetoacetate)urea cyclenicotine degradation IID-galactarate degradation IIlinamarin degradationlotaustralin degradationpyruvate fermentation to acetate VIIIpyruvate fermentation to acetonemethanogenesis from CO2methanogenesis from acetateEntner-Doudoroff pathway Iethylmalonyl pathway0. MethanogenesisFermentationGeneralDegradation Secondary metabolitesGeneralNon-carbon nutrientsLipidsCarbohydratesCarboxylatesC1-compoundsAmino acidsAromatic compoundsAminesBiosynthesisSecondary metabolitesLipidsCofactorsCell structuresCarbohydratesAmino acidsUpper Photic (25m, 75m) RNA/DNARelative CDS Abundance (%) DNARNA25m75mMetaCyc PathwaysFigure B.13: Genomic and transcriptomic signal for pathways unique to upper photic zone (25 m and 75 m).Notable pathways included chlorophyll a biosynthesis, the ethylmalonyl, Entner-Doudoroff, and pyruvatefermentation pathways. Figure originally published in BMC Genomics under the Creative CommonsAttribution Licence v2.0 [171].182 31   Figure S11. Genomic and transcriptomic signal for pathways unique to surface (25 m) and deep (500 m) depth intervals. There exist a limited number of pathways common to the surface and deep, but note that the largest signal is for nitrate and sulfate reduction, the first steps of sulfur recycling being shared with nitrate reduction.   2-keto-L-gulonate biosynthesisL-ascorbate biosynthesis Vsiroheme biosynthesisvery long chain fatty acid biosynthesissalicortin biosynthesismyo-inositol biosynthesisfluoroacetate and fluorothreonine biosynthesisdicranin biosynthesisprotocatechuate degradation II (ortho)conversion of succinate to propionate3-chlorocatechol degradation II (ortho)noradrenaline and adrenaline degradationserotonin degradationpurine deoxyribonucleosides degradationthiosulfate disproportionation III sulfate reduction V (dissimilatory)nitrate reduction I (denitrification)methanogenesis from trimethylaminesuccinate cytochrome bd oxidase electron transfercopper transport II2.0 1.5 1.0 0.5 0.0Transport GeneralMetaCyc PathwaysEnergy RespirationMethanogenesisDegradation Non-carbon nutrientsNucleic AcidsHormonesChlorinated compoundsCarboxylatesAromatic compoundsBiosynthesis Secondary MetabolitesLipidsCofactorCarbohydratesSurface and Deep (25m, 500m) RNA/DNARelative CDS Abundance (%) DNARNA25m500mFigure B.14: Genomic and transcriptomic signal for pathways unique to surface (25 m) and deep (500 m) depthintervals. There exist a limited number of pathways common to the surface and deep, but note that thelargest signal is for nitrate and sulfate reduction, the first steps of ulfu recycling being sh red with nitratereduction. Figure originally published in BMC Genomics under the Creative Commons Attribution Licencev2.0 [171].183 32  Figure S12. Unique pathways to the ‘photic and deep’ samples (25 m, 75 m, and 500 m). This set is characterized by a strong signal for the TCA cycle, ketolysis and pyruvate fermentation, as well as a large number of organic matter degradation pathways. tyrosine biosynthesis IIIuracil degradation II (reductive)glycine biosynthesis IIIasparagine biosynthesis Ibeta-alanine biosynthesis Isorbitol biosynthesis IICMP-N-acetylneuraminate biosynthesis II (bacteria)CMP-KDO biosynthesis Iacetaldehyde biosynthesis Ifactor 420 biosynthesisNAD salvage pathway Imolybdenum cofactor biosynthesisfolate transformations IIjuvenile hormone III biosynthesis IIdimethylsulfoniopropionate biosynthesis IIdimethylsulfoniopropionate biosynthesis Icholine degradation Iglycine betaine biosynthesis I (Gram-negative)norspermidine biosynthesissalinosporamide A biosynthesisphenylethanol biosynthesisjuvenile hormone III biosynthesis Imevalonate pathway I4-aminobutyrate degradation IVdopamine degradationputrescine degradation IIIbenzoyl-CoA degradation III (anaerobic)4-hydroxyphenylacetate degradationcyclohexanol degradationethanol degradation IL-lactaldehyde degradation (aerobic)arginine degradation II (AST pathway)ornithine degradation II (Stickland reaction)arginine degradation X (arginine monooxygenase)L-cysteine degradation IIIlysine degradation Xthreonine degradation IVhistidine degradation IImethionine degradation IIItryptophan degradation X (mammalian)alanine degradation II (to D-lactate)lysine fermentation to acetate and butyrateformaldehyde oxidation II (glutathione-dependent)3-oxoadipate degradation2-oxopentenoate degradationoxalate degradation IIchondroitin sulfate degradationcellulose degradation IIgalactose degradation I (Leloir pathway)fatty acid beta-oxidation Vfatty acid alpha-oxidationfatty acid omega-oxidationglycerol degradation Ithymine degradationpyrimidine deoxyribonucleosides degradationguanosine nucleotides degradation IIImanganese oxidation Isulfite oxidation I (sulfite oxidoreductase)2-aminoethylphosphonate degradation Isulfolactate degradation IIIsulfate reduction IV (dissimilatory)nicotine degradation Ialkylnitronates degradationacrylonitrile degradationglyoxylate assimilationacetylene degradationD-galactarate degradation Imyo-inositol degradationmethylglyoxal degradation Imethylglyoxal degradation VIpyruvate fermentation to ethanol IIpyruvate fermentation to ethanol IIIpyruvate fermentation to butanoatepyruvate fermentation to butanol Iglycerol-3-phosphate shuttleketolysisTCA cycle II (eukaryotic)2 1 0 1MetaCyc PathwaysEnergy TCA variantsGeneralFermentationDegradation MethylglyoxalsSecondary metabolitesGeneralNon-carbon nutrientsNucleic acidsFatty acidsCarbohydratesCarboxylaseC1-compoundsAmino acidsAldehydesAlcoholsAromatic compoundsAminesBiosynthesis Secondary metabolitesPolyaminesGeneralHormonesCofactor biosynthesisCarbohydratesAmino acidsRelative CDS Abundance (%) Surface and Deep (25m,75m,500m)DNARNA25m75m500mFigure B.15: Unique pathways to the ‘photic and deep’ samples (25 m, 75 m, and 500 m). This set is characterizedby a strong signal for the TCA cycle, ketolysis and pyruvate fermentation, as well as a large number oforganic atter degradation pathways. Figure originally published in BMC Genomics under the CreativeCommons Attribution Licence v2.0 [171].184 33  Figure S13. Unique pathways to the lower euphotic 110m sample.     Figure S14. Unique pathways to the upper and lower euphotic 25m and 110m samples. glutamate degradation IIdolichyl-diphosphooligosaccharide biosynthesislipoate biosynthesis and incorporation IItrans-lycopene biosynthesis IIstarch degradation IIIacetoacetate degradation (to acetyl CoA) Lower Euphotic (110m) RNA/DNARelative CDS Abundance (%) MetaCyc PathwaysDegradationBiosynthesisLipidsCarbohydratesSecondary MetabolitesCofactorCarbohydratesAmino acidsDNARNA110mgeranylgeranyldiphosphate biosynthesisgeranyl diphosphate biosynthesissulfolipid biosynthesishomoglutathione biosynthesiskievitone biosynthesissitosterol degradation to androstenedionenitrogen fixation0. Upper & Lower Euphotic (10m/110m) RNA/DNAEnergy DegradationBiosynthesisHydrogen productionSteroidsSecondary MetaboliteGeneralLipidCofactorMetaCyc PathwaysRelative CDS Abundance (%) DNARNA25m110mFig r B. 6: nique pathways to the lower euphotic 110 m sample. Figure originally published in BMCGenomics under the Creative Commons Attribution Licence v2.0 [171]. 33  Figure S13. Unique pathways to the lower euphotic 110m sample.     Figure S14. Unique pathways to the upper and lower euphotic 25m and 110m samples. glutamate degradation IIdolichyl-diphosphooligosaccharide biosynthesislipoate biosynthesis and incorporation IItrans-lycopene biosynthesis IIstarch degradation IIIacetoacetate degradation (to acetyl CoA) Lower Euphotic (110m) RNA/DNARelative CDS Abundance (%) MetaCyc PathwaysDegradationBiosynthesisLipidsCarbohydratesSecondary MetabolitesCofactorCarbohydratesAmino acidsDNARNA110mgeranylgeranyldiphosphate biosynthesisgeranyl diphosphate biosynthesissulfolipid biosynthesishomoglutathione biosynthesiskievitone biosynthesissitosterol degradation to androstenedionenitrogen fixation0. Upper & Lower Euphotic (10m/110m) RNA/DNAEnergy DegradationBiosynthesisHydrogen productionSteroidsSecondary MetaboliteGeneralLipidCofactorMetaCyc PathwaysRelative CDS Abundance (%) DNARNA25m110mFigure B.17: Unique pathways to the upper and lower euphotic 25 m and 110 m samples. Figure originallypublished in BMC Genomics under the Creative Commons Attribution Licence v2.0 [171].185Table B.12: Examples of observed pathway prediction hazards from the HOT analysis.Pathway Name EC Promiscuity Pathway Variants Unique Reactions Taxonomic Rangethreonine degradation IV Xlysine degradation X X Xethanol degradation Xmolybdenum cofactor biosynthesis Xthiamin diphosphate biosyntheis Xadenosylcobalamin biosynthesis XTCA cycle variants XNAD-related pathways Xintra-aerobic nitrate reduction Xammonia assimilation Xsucrose degradation II X Xnitrate reduction X XdTDP-D-desosamine biosynthesis Xmannitol degradation I Xlinamarin degradation Xlimonene degradation II (L-limonene) X186


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items