UBC Faculty Research and Publications

Metabolic pathways for the whole community Hanson, Niels W; Konwar, Kishori M; Hawley, Alyse K; Altman, Tomer; Karp, Peter D; Hallam, Steven J Jul 22, 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2014_Article_6353.pdf [ 1.15MB ]
JSON: 52383-1.0228387.json
JSON-LD: 52383-1.0228387-ld.json
RDF/XML (Pretty): 52383-1.0228387-rdf.xml
RDF/JSON: 52383-1.0228387-rdf.json
Turtle: 52383-1.0228387-turtle.txt
N-Triples: 52383-1.0228387-rdf-ntriples.txt
Original Record: 52383-1.0228387-source.json
Full Text

Full Text

METHODOLOGY ARTICLEMetabolic pathways for thNiels W Hanson1, Kishori M Konwar2, Alyse K Hawley2, Tomcinan. BgiNo edesnec(PRMT). HUMAnN uses an integer optimizationHanson et al. BMC Genomics 2014, 15:619http://www.biomedcentral.com/1471-2164/15/619genomes [8]. Because KEGG pathways are coarse and doCanadaFull list of author information is available at the end of the articlealgorithm that conservatively computes a parsimoniousminimum set of reactions along KEGG pathways based onpathway presence, absence or completion [6,7]. PRMTinfers metabolic flux based on normalized enzyme activitycounts mapped to KEGG pathways across multiple meta-* Correspondence: shallam@mail.ubc.ca1Graduate Program in Bioinformatics, University of British Columbia, GenomeSciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia V5Z4S6, Canada2Department of Microbiology & Immunology, University of British Columbia,2552-2350 Health Sciences Mall, Vancouver, British Columbia V6T 1Z3,[4,5]. This in turn limits our ability to convert the geneticsymbiotic system, and the Hawaii Ocean Time-series. We define accuracy and sensitivity relationships between readlength, coverage and pathway recovery and evaluate the impact of taxonomic pruning on ePGDB construction andinterpretation. Resulting ePGDBs provide interactive metabolic maps, predict emergent metabolic pathways associatedwith biosynthesis and energy production and differentiate between genomic potential and phenotypic expressionacross defined environmental gradients.Conclusions: This multi-tiered analysis provides the user community with specific operating guidelines, performancemetrics and prediction hazards for more reliable ePGDB construction and interpretation. Moreover, it demonstrates thepower of Pathway Tools in predicting metabolic interactions in natural and engineered ecosystems.BackgroundCommunity interactions between uncultivated microor-ganisms give rise to dynamic metabolic networks integralto ecosystem function and global scale biogeochemicalcycles [1]. Metagenomics bridges the “cultivation gap”through plurality or single-cell sequencing by providingdirect and quantitative insight into microbial communitystructure and function [2,3]. Although, new technologiesare rapidly expanding our capacity to chart microbialsequence space, persistent computational and analyticalbottlenecks impede comparative analyses across multipleinformation levels (DNA, RNA, protein and metabolites)potential and phenotypic expression of microbial com-munities into predictive insights and technological ortherapeutic innovations.Functional genes operate within the structure of meta-bolic pathways and reactions that define metabolicnetworks. Despite this fact, few metagenomic studiesuse pathway-centric approaches to predict microbialcommunity interaction networks based on knownbiochemical rules. Recently, algorithms for pathwayprediction and metabolic flux have been developed forenvironmental sequence information including the HumanMicrobiome Project Unified Metabolic Analysis Network(HUMAnN) and Predicted Relative Metabolic TurnoverAbstractBackground: A convergence of high-throughput sequeninto information science. Despite these technological advinto meaningful insights remains a challenging enterprisefrom genomes to biomes. Holistic understanding of biolocomparative analyses across multiple information levels (Dproperties, diagnose system states, or predict responses tResults: Here we adopt the MetaPathways annotation anenvironmental pathway/genome databases (ePGDBs) that dhighly curated database of metabolic pathways and compoperformance on three datasets with different complexity and© 2014 Hanson et al.; licensee BioMed CentraCommons Attribution License (http://creativecreproduction in any medium, provided the orDedication waiver (http://creativecommons.orunless otherwise stated.Open Accesse whole communityer Altman3, Peter D Karp4 and Steven J Hallam1,2*g and computational power is transforming biologyces, converting bits and bytes of sequence informationiological systems operate on multiple hierarchical levelscal systems requires agile software tools that permitA, RNA, protein, and metabolites) to identify emergentnvironmental change.analysis pipeline and Pathway Tools to constructcribe microbial community metabolism using MetaCyc, ants covering all domains of life. We evaluate Pathway Tools’oding potential, including simulated metagenomes, al Ltd. This is an Open Access article distributed under the terms of the Creativeommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andiginal work is properly credited. The Creative Commons Public Domaing/publicdomain/zero/1.0/) applies to the data made available in this article,Hanson et al. BMC Genomics 2014, 15:619 Page 2 of 14http://www.biomedcentral.com/1471-2164/15/619not discriminate between pathway variants, both modes ofanalysis have limited metabolic resolution [9]. Moreover,neither HUMAnN nor PRMT provides a coherentstructure for exploring and interpreting predicted KEGGpathways.One alternative to HUMAnN and PRMT is PathwayTools, a production-quality software environment sup-porting metabolic inference and flux balance analysisbased on the MetaCyc database of metabolic pathwaysand enzymes representing all domains of life [10-13].Unlike KEGG or SEED subsystems, MetaCyc emphasizessmaller, evolutionarily conserved or co-regulated units ofmetabolism and contains the largest collection (over 2000)of experimentally validated metabolic pathways. Extensivelycommented pathway descriptions, literature citations, andenzyme properties combined within a pathway/genomedatabase (PGDB) provide a coherent structure for explor-ing and interpreting predicted pathways. Although initiallyconceived for cellular organisms, recent development ofthe MetaPathways pipeline extends the PGDB concept toenvironmental sequence information enabling pathway-centric insights into microbial community structure andfunction [14,15].Here we provide essential guidelines for generatingand interpreting ePGDBs inspired by the multi-tieredstructure of BioCyc [16] (Figure 1). We begin with genomeand metagenome simulations to assess performance ondatasets manifesting different read length, coverage andtaxonomic diversity and we develop a weighted taxonomicdistance to evaluate concordance between pathwayspredicted using environmental sequence informationand reference pathways in the MetayCyc database. Giventhese metrics, we demonstrate Pathway Tools’ power topredict emergent metabolism in simulated metagenomesand a previously characterized symbiotic system [17].Finally, we generate ePGDBs using coupled metagenomicand metatranscriptomic datasets from the Hawaii OceanTime-series (HOT) to compare and contrast geneticpotential and phenotypic expression along definedenvironmental gradients in the ocean [18-20].Results and discussionPerformance considerationsEnvironmental pathway/genome database (ePGDB) con-struction commences with the MetaPathways automatedannotation pipeline using environmental sequence informa-tion as input (Materials and Methods). Resulting annota-tions are used by the PathoLogic algorithm implementedin Pathway Tools to predict metabolic pathways based onmultiple criteria including proportion of pathways found,pathway specific enzymatic reactions, and purportedtaxon-specific pathway distributions. PathoLogic is knownto perform well when compared to machine learningmethods using the genomes of cellular organisms as input[21]. We previously reported PathoLogic’s performanceon combined and incomplete genomes using two simu-lated metagenomes (Sim1 and Sim2) derived from 10BioCyc tier-2 PGDBs manifesting different coverage andtaxonomic diversity using MetaSim [14,22]. Simulationson increasing proportions of the total component genomelength (Gm) showed that the performance of pathwayrecovery based on multiple metrics (F-measure, MatthewsCorrelation Coefficient, etc.) increased with sequencecoverage and sample diversity nearing an asymptote athigher coverage (Figure 2a). This suggests that pathwayprediction follows a collector’s curve in which commoncore pathways accumulate in the early part of the curvefollowed by less common accessory pathways near theasymptote.To better constrain pathway recovery and performancein relation to ePGDB construction we compared resultsof MetaSim experiments using the Esherichia coli K12substr. MG1655 genome (basis of the EcoCyc database),Sim1 and Sim2, and a subsampled 25 m metagenomefrom HOT [19] (Additional file 1: Materials and Methods,Tables S1-S4 and Figure S1). Simulations were performedat progressively larger Gm coverage. Consistent withprevious observations for Sim1 and Sim2, all experi-ments showed that pathway recovery percentage andperformance sensitivity increased with sequence cover-age and sample diversity nearing an asymptote at highercoverage (Figure 2a-b). The absolute values of these pat-terns were sensitive to read length and likely reflectedlimits imposed by open reading frame prediction andBLAST/LAST-based annotation. In contrast, performancespecificity was high (>85%) regardless of read length,coverage, or taxonomic diversity (Figure 2b). The rate ofpathway recovery increased proportionally with increas-ing sample diversity at lower coverage values, as seen inthe reduction of pathway recovery percentage betweenSim1, Sim2 and E. coli for long read (~700 bp) and be-tween HOT, Sim1/2 and E. coli for short read (~160 bp)datasets. Additional performance metrics can be foundin Additional file 1: Tables S5–S8. Because PathoLogicperformance improves with increasing read length,coverage and sample diversity, sequencing platform selec-tion and use of assembled versus unassembled sequence in-formation should be considered when generating ePGDBs.When constructing PGDBs for individual genomesPathoLogic uses a process called taxonomic pruning toconstrain pathway predictions within a specified taxonomiclineage by taking advantage of the curated ‘taxonomic-range’ associated with a given pathway. For example, if apathway is found only in plants, it will be difficult to predictthis pathway in the genome of a bacterial isolate whenusing taxonomic pruning. Such a process is intended toreduce false positive predictions in individual genomes[12]; However, microbial communities are composed ofHanson et al. BMC Genomics 2014, 15:619 Page 3 of 14http://www.biomedcentral.com/1471-2164/15/619BioCyc PGDBsTier-1Tier-2Tier-3EngCycHighly CuratedModerately CuratedAutomatically CuratedaEcoCycdiverse and largely uncultivated lineages whose combinedmetabolic potential and phenotypic expression must beconsidered both within and between individuals. Thus thetaxonomic origin of environmental sequence informationis more difficult to ascertain with the same degree ofcertainty as individual microbial genomes sourced fromisolates or single-cells. Indeed, the true taxonomic rangeof many pathways remains to be constrained given thelimited number of isolate genomes and the proclivity forhorizontal gene transfer within microbial communities.In order to evaluate the impact of taxonomic pruningon pathway recovery from environmental sequence in-formation we constructed ePGDBs enabling or disablingtaxon-specific pathway distributions (Additional file 1:Table S9). We ran PathoLogic on Sim1/2 and 25 m HOTdatasets with the ‘Unclassified sequences’ pruning thresh-old and without pruning. With taxonomic pruning enabled,long read and short read Sim1 ePGDBs exhibited a reduc-tion of 56% (206 compared to 604) and 61% (194 comparedto 499) predicted pathways, respectively. Interestingly, thesubsampled 25 m HOT dataset exhibited a 28% reduction(425 compared to 593) in pathway recovery with and with-out pruning suggesting that increased sample complexityFigure 1 A multi-tiered approach to ePGDB validation. (a) In the absenthe curation-tiered structure of available pathway/genome databases withiexperiments on the E. coli K12 genome and two simulated metagenomes,changing sequence coverage and taxonomic distributions. (d) We reanalyzTremblaya princeps, two symbiotic taxa with reduced genomes, sharing a npathways from a previously analyzed paired metagenomic and metatranscpreviously identified pathways and metabolic functions.HOTbceCopy NumberEcoCyc PathwaysMetaSim Metagenomesd SymbiontsSim1Sim20.0 0.2 0.4 0.6can partially offset taxon specific sensitivity losses. In allcases, the pathways predicted with taxonomic pruning werea subset of pathways predicted without taxonomic pruning.Given these observations we posit that strict taxonomicpruning is inappropriate for ePGDB construction while rec-ognizing potential prediction hazards associated with path-ways predicted outside of their expected taxonomic range.To evaluate concordance between pathways predictedusing environmental sequence information and referencepathways in the MetaCyc database we developed a weightedtaxonomic distance (WTD) algorithm. The WTD algorithmmeasures the taxonomic distance between predicted codingDNA sequences (CDS), e.g., BLAST hits from the RefSeqdatabase, and expected taxonomic range for each predictedpathway using the NCBI Taxonomy Database. The NCBITaxonomy Database is hierarchically structured, and a pathbetween the lowest common ancestor (LCA) of observedCDS annotations and each member of the expectedtaxonomic range in a pathway can be charted [23],where each path length represents some measure oftaxonomic distance e.g. root, cellular organism, domain,phylum/division, class, order, family, genus, species. Stepson the path near the root of the hierarchy define greaterce of highly curated and validated datasets, we took inspiration fromn the BioCyc family. (b/c) Through in silico simulated sequencingwe evaluated the performance of the PathoLogic algorithm undered the genomes of Candidatus Moranella endobia and Candidatusumber of essential amino acid pathways. (e) Finally, we predictedriptomic dataset from the Hawaii Ocean Time-series to validate onHanson et al. BMC Genomics 2014, 15:619 Page 4 of 14http://www.biomedcentral.com/1471-2164/15/619ay Recovery Fraction0.40.60.8Long Read (~700 bp)a1.0evolutionary distances than those near the tips. Thus theWTD algorithm weights steps on the connecting path bya factor of 12d, where d is the depth position of a particulartaxon in the hierarchy (Additional file 1: SupplementaryNote 2). To distinguish between paths descending fromthe expected taxonomic range and those falling outsidethe expected taxonomic range, paths descending from anexpected taxonomic range have a non-negative distanceand paths outside this range have a negative distance. TheWTD algorithm gives preference to non-negative dis-tances within expected taxonomic range(s), returning theminimum distance if found. Otherwise the maximumnegative distance (i.e., closest to zero) is returned.GmPathw0.20.2 0.4 0.6 0.8 1.0b Long Read (~700 bp) 0.4 0.6 0.8 1.0GmPerformanceFigure 2 Analysis on in silico simulated sequencing experiments acrossdistributions. (a) Predicted pathway recovery as a fraction of the total pathw(triangles) of predicted pathways of the in silico experiments using the pathwShort Read (~160 bp)When the WTD algorithm was applied to HOT data-sets, the taxonomic distribution of predicted pathwaysgenerally aligned with the expected taxonomic rangesof MetaCyc Pathways (Additional file 1: Figure S2).Predicted pathways were classified into four categoriesof taxonomic disagreement based on their WTD:“None” if the WTD was positive, and “Low”, “Medium”,and “High” if less than or equal to zero, based on distancequartiles. A pathway had “Low” taxonomic disagreementif in the upper two quartiles of negative distances (i.e.,those closest to zero), “Medium” if in the second quartile,and “High” if in the bottom (i.e., most negative) quartile.Pathways with expected taxonomic ranges affiliated with0.2 0.4 0.6 0.8 1.0HOT (25m)K12Sim1Sim2SensitivityPrecisionShort Read (~160 bp)0.2 0.4 0.6 0.8 1.0different levels of coverage, sequencing lengths, and taxonomicays predicted from the full genomes. (b) Sensitivity (circles) and precisionays predicted on the full genomes as the gold standard.Candidatus Tremblaya princeps (GenBank NC-015735Hanson et al. BMC Genomics 2014, 15:619 Page 5 of 14http://www.biomedcentral.com/1471-2164/15/619and NC-015736), bacterial endosymbionts of the mealy-bug Planococcus citri have been previously described byMcCutcheon and colleagues to distribute biosyntheticpathways for essential amino acids in a process knownas “inter-pathway complementarity.” Environmental PGDBconstruction using the combined Moranella and Trem-blaya genomes recovered 43 out of 44 reactions and all 9distributed amino acid biosynthesis pathways previouslyreported (Figure 3 and Additional file 1: Figure S5). Giventhese results, combinatorial ePGDB construction hasbacteria and archaea dominated the “None”, “Low”, and“Medium” disagreement classes, while pathways withexpected taxonomic ranges affiliated with eukaryotesincluding “animals”, “fungi”, and “plants” comprised themajority of the “High” disagreement class (Additionalfile 1: Figure S3). While not excluded from downstreamanalysis, pathways with distances in the “High” disagree-ment class are more likely to represent false positivesand should be interpreted with care.Distributed metabolic pathwaysPublic good dynamics play an integral role in shapingmicrobial interactions through distributed networks ofmetabolite exchange [24]. Such networks promoteincreased fitness and resilience and may explain theunderlying difficulty in cultivating most environmentalmicroorganisms [25-27]. Because ePGDBs are constructedfrom environmental sequence information, predictedpathways are represented by multiple donor genotypesproviding different levels of sequence coverage for eachreaction. By comparing pathway recovery for individualreference genomes to pathway recovery for combinationsof reference genomes, it becomes formally possible to usePathway Tools to identify distributed metabolic pathwaysthat emerge between multiple interacting partners. Totest this hypothesis, we selected four Tier-2 referencegenomes used in simulation experiments and constructedePGDBs using all possible pair-wise genome combinations(Additional file 1: Table S10). Thirty distributed pathwayswere identified in pair-wise genome combinations thatwere not predicted in PGDBs for individual cellular or-ganisms using set-difference analysis (Additional file 1:Table S11). Common and unique reactions associatedwith distributed pathways could be identified as com-posite glyphs in the Pathway Tools genome browser(Additional file 1: Figure S4).To provide a real world example of distributed meta-bolic pathway prediction we selected a symbiotic systemwith known nutritional provisioning requirements. Thereduced genomes of Candidatus Moranella endobia andenormous potential to predict distributed metabolicpathways within defined microbial assemblages e.g.,co-cultures or more complex microbial communities innatural and engineered ecosystems.Comparative community metabolismTo evaluate Pathway Tools’ performance on complexmicrobial communities at different information levels wecompared and contrasted coupled metagenome (DNA)and metatranscriptome (RNA) datasets from 25, 75,110 m (sunlit or euphotic) and 500 m (dark) oceandepth intervals from HOT [19]. A total of 1026 uniquepathways from approximately 1.2 billion base pairs ofenvironmental sequence information were recoveredspanning defined environmental gradients includingluminosity, salinity, pressure, and oxygen concentra-tion (Additional file 1: Table S12). Of these pathways,840 met minimal quality control (QC) standards(Materials and Methods) and were used for subsequentset-difference analysis (Figure 4a).More than 600 pathways were shared in common be-tween the sunlit and dark ocean based on combined DNAand RNA datasets consistent with a conserved metaboliccore (Figure 4b). A total of 14 unique pathways werepredicted exclusively in sunlit samples with 20 pathwayspredicted at the intersection of 25, 75 and 110 m depthintervals (Figure 4b). More than 100 unique pathwayswere predicted for the 500 m compliment consistent withincreased metabolic potential and niche-specialization withincreasing depth (Figure 4b). Interestingly, the normalizedproportion of genetic potential (DNA) versus expressedmetabolic pathways (DNA/RNA) increased linearly be-tween 25, 75 and 110 m depth intervals (0.4, 0.7 and 1.2,respectively) before plateauing at 500 m (1.2) (Figure 4c). Itremains to be determined if this trend reflects an asymp-tote or an inflection point in pathway expression co-varying as a function of metabolic status, environmentalconditions or sample coverage and QC.A total of 30 pathways were identified exclusively inRNA datasets including 11 pathway variants (Figure 4cand Additional file 1: Figure S6). Expressed cholesteroldegradation and tetrahydrobiopterin biosynthesis I werecommon to all depth intervals. Unique expressed photo-respiration and glycolate degradation III pathways wererecovered at 25 and 75 m, while ammonia oxidation III,methane oxidation to methanol II, and arginine bio-synthesis III were unique to 500 m (Additional file 1:Figure S6). More than 590 pathways were identifiedexclusively in DNA datasets, while 495 were shared incommon between DNA and RNA datasets (Figure 4d).With respect to functional classes, unique Degradation,Biosynthesis and Energy-Metabolism pathways increasedas a function of depth in DNA datasets (Additional file 1:Figure S7a). Within unique degradation classes aprogression from amino acids to aromatic-compoundsand secondary metabolites was observed betweenthrosmacoglutweetwlaypleHanson et al. BMC Genomics 2014, 15:619 Page 6 of 14http://www.biomedcentral.com/1471-2164/15/61925, 75, 110 and 500 m depth intervals. A similarprogression was observed for a subset of BiosyntheticL-aspartateATPADPL-aspartyl-4-phosphateaspartate kinase2.7.2.4aspartate-semialdehyde dehydrogenase: H2OH+phosphateNADP+dihydrodipicolinatesynthase4.2.1.52L-2,3-dihydrodipicolinateH+NAD(P)HNAD(P)+tetrahydropipicolinatedihydrodipicolinatereductase1.3.1.26tetrahydrodipicolinatesuccinylase: ALysine Biosynthesis Ichorismate L-glutaminepyruvate L-glutamateH+anthranilate5-phospho-a-D-ribose 1-diphosphatediphosphateN(5’ phosphoribosyl) anthranilateputative anthranilatephosphoribosyltransferase4.’deoxyribulose-5’phosphateH+H2OCO2(1S,2R)-1-C-(indol-3yl)glycerol 3-phosphateD-glyceraldehyde-3-phosphateindole L-serineH2OL-tryptophan5. biosynthesistryptophane synthasesubunit alphatryptophane synthasesubunit betaanthranilate synthasecompontent Ianthranilatephosphoribosyl-transferaseindole-3-glycerolphosphatesynthaseFigure 3 Examples of emergent amino acid metabolism shared befigure illustrates examples of emergent metabolic pathways predicted bCandidatus Tremblaya princeps. Enyzmes found in Moranella (red), Trembdiagrams, showing patterns of potentially emergent metabolism. A comfile 1: Figure S3.classes including polyamines, lipids, and cofactors andfor Energy-Metabolism including C1-compounds andfermentation (Additional file 1: Figure S7b).An evaluation of the 72 most abundant pathways re-covered from the combined datasets indicated that 53were both present and expressed at 25, 75, 110, and 500 mdepth intervals. Moreover, several of the most abundantpathways including ammonium transport, Rubisco shunt,NADH to cytochrome electron transfer, pyruvate fermen-tation, denitrification, Calvin-Benson-Bassham cycle, cyst-eine biosynthesis I and arginine biosynthesis III exhibiteddepth-dependent trends in gene expression (Additionalfile 1: Figure S8). A number of abundant pathways com-mon to 25, 75, 110, and 500 m depth intervals in theDNA datasets were exclusively expressed in sunlit or darkocean waters (Figure 5). In sunlit waters these includedphotosynthesis light reactions, hydrogen production VIII,flavonoid biosynthesis, cofactors including heme, vitaminB-complex (thiamin, adenosylcobalamin), and glutathionefor oxidative stress (Figure 5). Below the euphotic zone,the 500 m depth interval exclusively expressed path-ways for ribitol, rhamnose, guanosine nucleotide, 2-methylcitrate, and threonine degradation as well aspathways for cofactor biosynthesis including phospho-pantothenate, menaquinol-8 (vitamin K), and coenzymeM and several carbohydrate and amino acid biosyntheticpathways including CMP-N-acetylneuraminate I, ADP-L-glycero-beta-D-manno-heptose and glycine biosynthesisIV (Figure 5).se-4-phosphate 3-deoxy-D-aramino-heptulosonate-7-phosphate3-dehydroquinate 3-dehydroshikimateshikimateshikimate-3-phosphate5-enopyruvyl-shikimate-3-phosphatete2-dehydro-3-deoxyphosphoheptonate aldolase3-dehydroquinatesynthaseshikimatekinase3-phosphoshikimate-1-carboxyvinyltransferase:horismate synthase2.5.1.54 biosynthesis Itarate L-glutamate L-glutamate g-semialdehyde2.6.1.13L-ornithine L-citrullineL-arginino-succinateL-arginineornithinecarbamoyltransferaseargininosuccinatelyase1.4.1.3 biosynthesis IV & Uridine-5’phosphate biosynthesisargininosuccinatesynthaseMoranella Tremblaya Both Neithershikimate 5-dehydrogenase3-dehydroquniate dehydratasetype IIIbicarbonate carbamoyl-phosphatecarbamoyl-phosphatesynthase large/small chain6.3.5.5en the Moranella endobia and Tremblaya princeps genomes. Thiseen symbiotic prokaryotes Candidatus Moranella endobia anda (blue), or both taxa (purple) are highlighted in the pathway glyphte description of all amino acid pathways can be found in AdditionalConsistent with previous reports, sunlit waters expressedmany photosynthesis-related pathways including aerobicelectron transfer, hydrogen production, and cofactorsincluding ubiquinol, heme, vitamin B-complex (nicotinate,thiamine, cobalamin, tetrahydrofolate), chlorophyll a, andretinol biosynthesis [19,20] (Additional file 1: Figures S9and S10). In addition to photosynthesis, 25 and 75 mdepth intervals (upper euphotic) sets included pathwaysassociated with degradation of plant metabolites includingphytate, glucuronate, mannitol, chitin, xylose, arabinose,gallate, and quinolate. Other pathways of interest identi-fied in sunlit waters included organophosphate, urea, andaminobutyrate degradation, as well as pathways forconversion of the plant hormone indole-3 acetic acidand mercury detoxification. Below the euphotic zone,the 500 m depth interval expressed unique pathways forintra-aerobic nitrite reduction, dissimilatory nitratereduction, the reductive monocarboxylic acid cycle,ammonia oxidation, and methane oxidation to methanol I(Additional file 1: Figure S11). Thus, comparative ePGDBanalysis using the combined DNA and RNA datasetsdifferentiated between genomic potential and pheno-typic expression across defined environmental gradientsin the ocean and revealed known and novel patterns offunctional specialization with potential implications fornutrient and energy flow within sunlit and dark oceanwaters.60edHanson et al. BMC Genomics 2014, 15:619 Page 7 of 14http://www.biomedcentral.com/1471-2164/15/61975mb0 200 400PredictPassed (840)Pathway prediction hazardsWhile the construction of ePGDBs promotes pathway-centric analysis of environmental sequence information,prediction hazards need to be considered for optimalinterpretive power. One common hazard is the ‘multiplemapping problem,’ arising when an enzyme catalyzesconserved or promiscuous reaction steps across mul-tiple pathways or enzyme commission (EC) numbersrepresenting classes with non-specific substrate activity.52720603012125mQC Pathw2288449 3021137825m (685) 75m (691)Sample-wisecd102483633137555310113 17201201125m 500m75m 110mDNA Fraction (593)Figure 4 Analysis of predicted pathways from the Hawaii Ocean TimeHOT samples (Additional file 3), however only 840 unique pathways remainremoved (Additional file 4). (b) After normalizing by total predicted ORFs (A(QC) pathways shows that the samples share a large core of common patheach sample revealed that very few pathways were unique to the RNA fracfraction (light colors), and pathways common to DNA and RNA from each(Additional files 7 and 8).110m0 800 1000 1200 PathwaysFailed (193)For example EC represents a non-specific enzymeclass for beta-D-glucosides, allowing for spurious predictionof specific carbohydrate degradation pathways. Moreover,PathoLogic has a preference for EC numbers over productdescriptions that can further exacerbate false discoveryassociated with non-specific enzyme classes. Hazardsmanifesting themselves within pathway variants sharinga number of common or reversible reaction steps havepreviously been described by Caspi and colleagues in5364317231115173 771292025m 500m75m 110mShared DNA & RNA (495)66121225141042500mays (840)3992269 4268363110m (670) 500m (797) DNA/RNA-series. (a) A total of 1033 unique pathways were predicted from theed after all pathways in each sample with less than 10 ORFs weredditional files 5 and 6), a 4-way set analysis of these quality controlledways. (c) Separating unique pathways within the DNA and RNA oftion of each sample. (d) Finally, at set analysis of the unique DNAsample (dark colors) found subsets of pathways unique to each fraction reaHanson et al. BMC Genomics 2014, 15:619 Page 8 of 14http://www.biomedcentral.com/1471-2164/15/619photosynthesis lightMetaCyc PathwaysPhotosynthesisthe context of PGDB construction for cellular organ-isms [28]. For example, the tricarboxylic acid cycle(TCA) cycle has at least eight pathway variants associatedwith different taxonomic groups and several incompleteor reversible forms that share multiple reactions steps.glycine biosyntheselenocysteine biosynthesis II (arcADP-L-glycero-beta-D-manno-heptose biosynCMP-N-acetylneuraminate biosynthcoenzyme B/coenzyme M regenepyridoxal 5-phosphate biosynthcoenzyme M biosynthmycothiol biosyn5,6-dimethylbenzimidazole biosynmenaquinol-8 biosynphosphopantothenate biosynthediploterol and cycloartenol biosynsalidroside biosynthreonine degradation III (to methylglthreonine degradamethane oxidation to methreductive monocarboxylic acidcitrate degraacetate formation from acetyl-2-methylcitrate cD-mannose degraL-rhamnose degradaguanosine nucleotides degradanitrate reduction IV (dissimiintra-aerobic nitrite redammonia oxidation I (aesorbitol degradribitol degra(S)-acetoin biosynhomocysteine and cysteine interconvglycogen biosynthesis I (ADP-D-GluUDP-N-acetyl-D-galactosamine biosynthtrans-farnesyl diphosphate biosynbiotin biosynthesis (7-keto-8-aminopelargglutathione biosynlipoate biosynthesis and incorporthiamin diphosphate biosynthadenosylcobalamin biosyntadenosylcobalamin biosynththiamin diphosphate biosynththiamin diphosphate biosynthheme biosynthesis I (uroporphyrinogflavonoid biosynmethionine degradhydrogen productiDNARNA25m75m110m500mEnergy Hydrogen productionBiosynthesisDegradation Amino acidsSecondary metabolitesCofactorsCarbohydratesAmino acidsEnergy FermentationDegradation Secondary metabolitesNon-carbon nutrientsNucleotidesCarbohydratesCarboxylatesC1 compoundsAmino acidsBiosynthesis Secondary metabolitesCofactorsAmino acidsCarbohydratesFigure 5 Comparison of predicted genomic and transcriptomic pathwaSunlit metabolism was indicative of photosynthesis and aerobic metabolism imetabolism had significantly more degradation pathways.ctionsSunlit (25m, 75m, 110m) (RNA/DNA)Pathologic has difficulty differentiating between TCAcycle variants when reversible pathway components arepresent even when a diagnostic step such as ATP-citratelyase for the reductive TCA cycle is missing from theinput data. A similar problem occurs when a regulatorysis IVhaea)thesisesis Irationesis IIesis Ithesisthesisthesissis IIIthesisthesisyoxal)tion IIanol I cycledationCoA IIycle IIdationtion IItion IIlatory)uctionrobic)ation Idationthesisersioncose)esis IIthesisonate)thesisation Iesis IVhesis Iesis IIesis I esis IIen-III)thesisation IIon VIIIDark (500m) (RNA/DNA) CDS Abundance (%) ys with unique expression in the ‘sunlit’ and ‘dark’ HOT samples.ncluding photosynthesis light reactions and hydrogen production. DarkHanson et al. BMC Genomics 2014, 15:619 Page 9 of 14http://www.biomedcentral.com/1471-2164/15/619protein is used to provide evidence that a pathway existseven when catalytic pathway components are missingfrom the input. Given that we constructed ePGDBswithout taxonomic pruning and that PathoLogic usesautomated annotations from multiple taxonomic groupswhen predicting pathways from environmental sequenceinformation, taxon specific pathways such as plant hor-mone biosynthesis or innate immunity can be predictedeven when organisms known to encode such pathways areabsent from the dataset. As described in the performanceconsiderations section, WTD can be used to discern dif-ferences between the predicted and expected taxonomicrange of pathways pointing to potential hazards prior tointerpretation. Indeed, the extent to which these predictedpathways reflect previously unrecognized variants orprediction artifacts remains to be determined. More-over, this hazard has the potential to confound distrib-uted metabolic pathway identification when sequencecoverage is low or microbial community composition isextremely uneven. Some examples of these hazardsfrom the HOT analysis are provided in Additional file 1:Table S13.The identification of dissimilatory nitrate reduction(denitrification), intra-aerobic nitrite reduction and am-monia oxidation in the combined 500 m HOT DNA andRNA datasets provides a real world example of hazardnavigation. Denitrification is a distributed form of energymetabolism resulting in the production of nitrogen gasin oxygen-deficient waters (<20 μM O2 per kg) [29,30].The first step in denitrification is nitrate reduction tonitrite. In the combined HOT DNA and RNA datasets thepredicted pathway variant nitrate reduction IV included asubset of CDS transcripts for ‘nitrate reductase gammasubunit’ (24 in DNA, 79 in RNA) while the predictedpathway variant nitrate reduction I included CDS tran-scripts for multiple nitrate reductase subunits (Figure 6).While CDS for nitrate reductase subunits originated froma number of different taxa including Alphaproteobacteria,Gammaproteobacteria, Nitrospira and Planctomycetes,435 out of 523 (83%) predicted nitrate reductase transcriptsoriginated from Nitrospira and Planctomycetes consistentwith a role in nitrite oxidation [31-34] (Figure 6). Thesecond step in denitrification is nitrite reduction to nitricoxide. Within the DNA dataset both bacterial and archaealCDS for nitrite reductase were recovered while transcriptsoriginating from ammonia oxidizing archaea dominated theRNA dataset (Figure 6). Coding sequences/transcripts fordownstream pathway components including nitric oxidereductase and nitrous oxide reductase were not detected,although CbbQ/NirQ/NorQ family regulators necessary forinorganic carbon fixation in the Calvin-Benson-Basshamcycle, nitrite and nitric oxide reduction were identified inDNA and RNA datasets [35] (Figure 6). Given that themean oxygen concentration at 500 m is ~120 μM O2 perkg [18,20], these results are consistent with active watercolumn nitrite and ammonia oxidation processes. Recentstudies in the Eastern Tropical South Pacific OMZ ob-served changes in the frequency distribution of denitri-fication genes between free-living (0.2-1.6 μm) andparticle-associated (>1.6 μm) size fractions, with nitricoxide reductase and nitrous oxide reductase encodinggenes enriched on particles [36]. The extent to whichdenitrification or anammox processes partition betweenfree-living and particle-associated microoganisms in theHOT water column remains to be determined.ConclusionsWhile advances in high throughput sequencing tech-nologies are rapidly giving rise to tens of thousands ofenvironmental datasets, the computational and analyticpowers needed to organize, interpret and mobilize thesedatasets have lagged behind. Conventional BLAST-basedannotation methods combined with gene-centric analysestend to overlook the network properties of microbialcommunities driving ecological and biogeochemical inter-actions. We argue that pathway-centric analyses via theMetaPathways pipeline and Pathway Tools provides thescientific user community with an end-to-end solution forcomparing ePGDBs constructed from environmental se-quence information revealing known and novel networkproperties. As with any automated analysis, this methodis no replacement for manual curation. Indeed, we havehighlighted specific instances where taxonomic range,idiosyncratic annotation, multifunctional enzymes, regula-tory functions, and reversible enzymatic forms predictedby Pathway Tools result in interpretive hazards thatrequire expert knowledge to resolve.Continued development efforts are needed to improveon existing features and add new functionality to both theMetaPathways pipeline and Pathway Tools. Specifically,improved import features amenable to categorical metadatae.g., taxonomic origin, location, depth, etc., need to beintegrated with Pathway Tools 'groups', a feature thatenables users to integrate external data and grouppathways and objects within Pathway Tools. The ‘groups’feature in turn needs to be better integrated into the‘omics’ viewer allowing for improved pathway navigationand page summaries within the Pathway Tools browser.Tooltip enhancements that summarize the categoricaldata mentioned above could further enhance the browsingexperience. Current ePGDBs are constructed usingconcatenated CDS sequences and improved viewingfeatures are needed that map coverage and noncodingsequence information onto complete contigs. Finally,the PathoLogic algorithm should be improved toincorporate the described prediction hazards and WTDinto its calculations. Specifically, one can imagine tree-based algorithmic improvements to PathoLogic akin topan)ofayctirren),t mHanson et al. BMC Genomics 2014, 15:619 Page 10 of 14http://www.biomedcentral.com/1471-2164/15/619Figure 6 Taxonomic and functional breakdown of nitrogen cyclingPathoLogic. Arrow color indicates pathway, nitrate reduction I (denitrificationitrite reduction (red). Grey numbers adjacent to arrows indicated numberOverlapping circles indicate the distribution of reads across multiple pathwassigned to reactions in given pathways as indicated by letters A-E. FunMetaPathways pipeline, and indicated by reaction arrows, with color conitrate and nitrite reducing activity (blue), nitrite oxidizing activity (greereads for enzymatic activity were detected, only regulatory proteins thathe WTD described here that integrate taxonomic infor-mation with enzyme or pathway directionality.Despite current limitations, ePGDBs provide an inter-active and holistic data structure in which to investigatedistributed metabolism and differentiate between micro-bial community metabolic potential and phenotypic ex-pression. Thus, ePGBDs provide a functional blueprint ofmicrobial community metabolism that can be harnessedto engineer microbial consortia with defined emergentproperties. These properties can in turn be transferred toindustrial strains or modeled using MetaFlux to improveprocess performance [13]. Although the set-difference andvisual inspection methods used to identify distributedmetabolic pathways described here do not scale for bigdatasets, future algorithmic improvements will enablecomparisons of reference genomes and metagenomes inlarge numbers. Indeed, splitting the proverbial “reactionarrows” for each step in a given metabolic pathway intotaxonomic bins provides a basis for integer optimizationmethods that compute “distribution” scores and a baselinefor monitoring changes in the reaction network associatedwith environmental change or even human health status.Looking forward, we envision an open source collection ofePGDBs, called EngCyc analogous to BioCyc [16], whichcan be queried and compared online revealing the networkproperties of microbial communities in natural andengineered ecosystems on a truly global scale.thways. (a) Nitrogen cycling pathways and reactions assigned by(brown), nitrate reduction IV (dissimilatory) (yellow), and intra-aerobicreads assigned to the reaction in the DNA and RNA (RNA in parentheses).s. (b) BLAST-based functional and taxonomic breakdown of readson was determined by the top RefSeq BLAST hit, reported by thesponding to taxa or taxonomic group with known activity: taxa withand ammonia oxidizing activity (purple). Grey reactions indicate noay be involved in gene expression regulation (*).MethodsMetabolic pathway analysisEnvironmental PGDBs were constructed from publicdatsets using MetaPathways (http://github.com/hallamlab/MetaPathways/) [14] with default parameter settings: openreading frame (ORF) detection by Prodigal (minimumlength 60 amino acids), functional annotation by BLAST(e-value 1e-5, blast-score ratio 0.4) against protein data-bases KEGG [37], COG [38], MetaCyc [11] (version 16.0),and RefSeq [39] (Downloaded August 2012), and pathwayprediction via the PathoLogic algorithm with taxonomicpruning disabled. Predicted pathways and associatedannotated CDS sequences were extracted from createdePGDBs using the utility script extract_pathway_table_-from_pgdb.pl included with MetaPathways.Pathway prediction on simulated dataSimulated sequencing experiments were performed usingMetaSim [22] with the parameter settings: Long read:clone size 36000 bp, Gaussian error, mean read length700 bp, standard deviation 100 bp; Short read: Gaussianerror, mean 160 bp, standard deviation 40 bp) againstthe E. coli K12 MG1655 complete nucleotide genome(GenBank: NC_000913) at a series of fractional levels(1/32, 1/16, 1/8, 1/4, 1/2, 1/1) of the total combinedlength of starting component genomes (Gm). Pathwayswere predicted using the MetaPathways pipeline, asHanson et al. BMC Genomics 2014, 15:619 Page 11 of 14http://www.biomedcentral.com/1471-2164/15/619described above, against each of the resulting sequencesets (Additional file 1: Tables S3 and S4). A classificationperformance analysis was performed; True positives(TP) were pathways found in both the simulated samplepathways (test set) and the complete gold standardE. coli genome. True negatives (TN) were pathways notpredicted in the test set or gold standard. False positives(FP) were pathways found in the test set but not in thegold standard. Finally, false negatives (FN) were path-ways found in the gold standard but not in the test set.Multiple summary statistics for the resulting confusiontables (Sensitivity (Recall), Specificity, Precision, Accuracy,F-measure, and Matthew’s Correlation Coefficient (MCC))were calculated. A summary of these performance sta-tistics is provided in the supplement (Additional file 1:Note S1: ‘A Note on Confusion Table Statistics’).Simulated metagenomes: Sim1, Sim2Simulated sequencing experiments of metagenomes Sim1and Sim2 were generated and analyzed as described abovefor E. coli. To minimize name-mapping problems, we usedprokaryotic genomes from the tier-2 BioCyc databasecollection [21]. The Sim1 metagenome was composedof ten tier-2 BioCyc genomes (Additional file 1: Table S2)in equal copy number, while Sim2 was composed of theCaulobacter cresentus NA1000 genome in 20-fold excessrelative to other genomes (Additional file 1: Figure S1). Aclassification performance analysis was performed as de-scribed above with the set of 646 pathways predicted fromthe complete tier-2 genomes used to derive Sim1 andSim2 representing the gold standard (Additional file 1:Tables S5-S8).Simulated metagenomes: HOT (25 m)A 25 m metagenome from the Hawaii ocean time serieswas sub-sampled with replacement to different fractionallevels (1/20, 1/10, 3/20, 1/5, 2/5, 3/5, 4/5, and 1/1) andpathways were predicted as described above. Similarly, aclassification performance analysis was performed withthe set of 864 pathways predicted from the complete454 run representing the gold standard (Additional file 1:Tables S7 and S8).Taxonomic pruning experimentsThe full-Gm simulated sequencing samples for Sim1 andSim2, both short and long read lengths, and the full-GmHOT (25 m) sample, had their pathways predicted withthe above method, but with taxonomic pruning enabledusing the taxonomic lineage parameter set to “Unclassi-fied sequences”. The number of predicted pathways weretabulated and compared with the pathways previouslypredicted with taxonomic pruning disabled. As simpleset analysis showed that within a sample the pruned path-ways were a strict subset of the “no-pruning” ones, andthe reduction in pathways was calculated (Additionalfile 1: Table S9).Weighted taxonomic distanceFor each predicted pathway in the HOT dataset, aweighted taxonomic distance (WTD) distance wascalculated using the WTD algorithm (Additional file 1:Supplementary Note 2). First, the lowest common ancestoralgorithm (LCA) was applied to a pathway’s RefSeq CDSsequences. The WTD algorithm calculates a weighted dis-tance D between the observed LCA taxonomy xobs and thepathway’s expected taxonomic range(s) xexp ∈ TR(MetaCyc)(p), where TR(MetaCyc)(p) is the set of taxonomic range(s) fora given pathway p on the NCBI Taxonomy Databasehierarchy.This WTD algorithm takes as input p and xobs, andcalculates a weighted taxonomic distance for each xexpon nodes in the connecting path P(xexp, xobs), asD xexp; xobs ¼Xea;b∈EP xexp ;xobsð Þ12d að Þ;where ea,b is an edge between nodes a and b in thepath and d(a) is the depth of node a. If xexp descendsfrom the expected taxonomic range xobs, then theWTD is assigned a positive value and WTD for pathsdescending outside this range are assigned a negativevalue. After calculating the WTDs for all pairs xexp, xobs,the WTD algorithm first attempts to return the mini-mum non-negative distance e.g., WTD corresponding tothe closest xexp where xobs is a descendant of xexp, andreturns the maximum negative score e.g., closest to zero ifall observed and expected taxonomies diverge. For eachdataset, predicted pathways were assigned to a “Disagree-ment Class” based on the following criteria: (i) pathwayswith positive WTD were given the “None” class, (ii) path-ways with distances greater than the median of negativeWTDs were given the “Low” class, (iii) pathways withinthe 2nd quartile were given the “Medium” class, and(iv) pathways in the lower quartile were given the“High” disagreement class (Additional file 1: Figure S2).The expected taxonomic ranges of each pathway wherethen collapsed into the higher taxonomic levels: “root”,“cellular organisms”, “prokaryotes”, “archaea”, “bacteria”,“eukaryotes”, “animals”, “fungi”, ”plants”, and “other”, asdefined on the NCBI Taxonomy Database hierarchy andpathway frequencies and disagreement classes were sum-marized for each sample (Additional file 1: Figure S3).Distributed metabolic pathway predictionFour genomes of similar size and complexity from thetier-2 dataset were combined in a pairwise manner:Aurantimonas manganoxydans SI85-9A (GenBank: NZ_AAPJ00000000.1), Bacillus subtilis subtilis 168 (GenBank:Hanson et al. BMC Genomics 2014, 15:619 Page 12 of 14http://www.biomedcentral.com/1471-2164/15/619AL009126.3), Caulobacter crescentus NA1000 (GenBank:CP001340.1), and Helicobacter pylori 26695 (GenBank:AE000511.1), abbreviated by the first character of theirproper names, A, B, C, and H, respectively. The six pair-wise and four original genomes were analyzed as de-scribed above for E. coli (Additional file 1: Table S10).Pathways predicted in the combined PGDBs were con-sidered candidates for distributed metabolism if theywere absent from PGDBs for individual genomes (i.e.,found in A and B combined, but not in either A or B in-dividually) (Additional file 1: Table S11 and Additionalfile 2). Candidate pathways were manually inspectedand deemed ‘plausible’ if there was sufficient coverage, i.e.,75% of reactions in a pathway had associated CDSsequences from both taxa (Additional file 1: Figure S4).Similarly, the Candidatus Moranella endobia andCandidatus Tremblaya princeps genomes (GenBank:NC-015735 and NC-015736) were downloaded fromNCBI and analyzed as described above for E. coli.Resulting PGDBs for individual and combined genomeswere manually inspected for amino acid biosyntheticpathways described in McCutcheon and Dohlen [17](Additional file 1: Figure S5).Hawaii ocean time-seriesUnassembled metagenomic and transcriptomic pyrose-quences from the Hawaii Ocean Time-series (10 m, 75 m,110 m, and 500 m) were obtained from the NCBI SequenceRead Archive (SRA Accession: SRX007372, SRX007369,SRX007370, SRX007371, SRX016893, SRX016897, SRX156384, SRX156385) and run through the MetaPathwayspipeline using default settings (Additional file 3). To avoidspurious predictions, only pathways with more than tenmapped CDS sequences in an individual sample were usedin downstream analysis. The pathways with nine or fewermapped CDS sequences represent the lower quartile ofpathway annotations (Figure 4a, Additional file 4). PathwayCDS counts for each sample were normalized to the totalnumber of unannotated ORFs in each dataset. Count datawas then converted to percentages providing relative ORFabundance for each pathway (Additional file 5), along withtheir weighted taxonomic distances and sample-wisedisagreement classes (Additional file 6). Relative CDSabundance of the top-40 pathways from DNA and RNAdatasets were compared (Additional file 1: Figure S8). Inaddition, pathways predicted in the DNA and RNAdatasets were compared at each depth interval to providesample-wise fractions for each depth e.g., DNA-only,DNA-RNA, and RNA-only (Figure 4c). Given thesmall number of pathways in the RNA-only sets noset-difference analysis was needed (Additional file 1:Figure S6). The DNA-only sets were declined and tabu-lated at various levels of the MetaCyc pathway hierarchy(Additional file 1: Figure S7). A final four-way set analysiswas performed on the DNA-only and DNA-RNA path-ways at each depth (Figure 4d, Additional files 7 and 8).DNA-RNA set-difference subsets with more than 5predicted pathways were compared in detail (Additionalfile 1: Figures S9-S14). All data transformations, set opera-tions, and comparisons were performed in the R statisticalenvironment (http://www.r-project.org), and visualizedusing the ggplot graphical package (http://ggplot2.org)and d3.js graphical library (http://d3js.org/).Availability of supporting dataThe ten full-length genomes used to create simulatedmetagenomes can be downloaded from GenBank underaccession numbers AE008687-AE008690, NZ_AAPJ00000000.1, AL009126.3, AE005673, CP001340.1, AE000511.1,AE000516, AL123456, NC_007604.1, AE003852, andAE003853.The symbiotic Candidatus Moranella endobia and Can-didatus Tremblaya princeps genomes can be downloadedfrom GenBank under accession numbers NC-015735 andNC-015736). The Hawaii Ocean Time series datasets canbe downloaded from the NCBI Sequence Read Archiveunder accession numbers SRX007372, SRX007369,SRX007370, SRX007371, SRX016893, SRX016897, SRX156384, SRX156385.Additional filesAdditional file 1: Supplementary notes, figures, and tables.Additional file 2: Summary of candidate pathways that arepotentially distributed by set-difference analysis.Additional file 3: Summary table of 1033 pre-QC predicted pathwaysand CDS counts for the Hawaii Ocean Time-series samples.Additional file 4: Summary table of the 840 post-QC predictedpathways and CDS counts for the Hawaii Ocean Time-seriessamples.Additional file 5: Summary table of the 840 post-QC predictedpathways and normalized CDS counts for the Hawaii OceanTime-series samples with taxonomic disagreement class highlighted.Additional file 6: Summary table of the 840 post-QC predictedpathways and normalized CDS counts for the Hawaii OceanTime-series samples with observed LCA taxonomies, expectedtaxonomic ranges, calculated weighted taxonomic distance, andtaxonomic disagreement class.Additional file 7: Summary table of normalized CDS counts for the593 DNA fraction pathways of samples from the Hawaii OceanTime-series.Additional file 8: Summary table of normalized CDS counts for the495 pathways common to DNA and RNA samples from the HawaiiOcean Time-series.Competing interestsThe authors are unaware of any competing interests.Authors’ contributionsNWH conducted simulated metagenome and distributed metabolismexperiments, comparative ePGDB analysis and co-wrote the paper. KMKprovided pipeline and high performance computing support and co-wrotethe paper. AKH participated in interpreting nitrogen cycle hazards in theHanson et al. BMC Genomics 2014, 15:619 Page 13 of 14http://www.biomedcentral.com/1471-2164/15/619HOT datasets and provided essential feedback on data products. TA and PDKprovided computational and interpretive support related to MetaPathways,PathoLogic, and the MetaCyc database. NWH, KMK and SJH conceived theweighted taxonomic distance and NWH and KMK described and implementedthe algorithm with essential input from TA and PDK. SJH supervised the group,participated in data interpretation, provided essential feedback on dataproducts, integration and formatting, and co-wrote the paper. All authorsread and approved the final manuscript.AcknowledgementsThis work was carried out under the auspices of Genome Canada, GenomeBritish Columbia, Genome Alberta, the Natural Science and EngineeringResearch Council (NSERC) of Canada, the Canadian Foundation forInnovation (CFI) and the Canadian Institute for Advanced Research (CIFAR)through grants awarded to SJH. The Western Canadian Research Grid(WestGrid) provided access to high-performance computing resources. KMKwas supported by the Tula Foundation funded Centre for Microbial Diversityand Evolution (CMDE). NWH was supported by a four year doctoral fellowship(4YF) administered through the UBC Graduate Program in Bioinformatics. Wewould like to thank Suzanne Paley, Ron Caspi, and Quang Ong of SRIInternational for their patience, technical support, and lucid discussionson the function of Pathway Tools and the PathoLogic algorithm, AntoinePagé for his participation in preliminary performance evaluations and allmembers of the Hallam Lab for helpful comments along the way.Author details1Graduate Program in Bioinformatics, University of British Columbia, GenomeSciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia V5Z4S6, Canada. 2Department of Microbiology & Immunology, University ofBritish Columbia, 2552-2350 Health Sciences Mall, Vancouver, BritishColumbia V6T 1Z3, Canada. 3Biomedical Informatics Training Program,Stanford University, MSOB, 1265 Welch Road, X-215 MC 5479, Stanford, CA94305-5479, USA. 4Bioinformatics Research Group, SRI International, 333Ravenswood Avenue, Menlo Park, CA 94025-3493, USA.Received: 14 January 2014 Accepted: 8 July 2014Published: 22 July 2014References1. Falkowski PG, Fenchel T, Delong EF: The microbial engines that driveEarth's biogeochemical cycles. Science 2008, 320:1034–1039.2. Handelsman J: Metagenomics: application of genomics to unculturedmicroorganisms. Microbiol Mol Biol Rev 2005, 69:195–195.3. Ishoey T, Woyke T, Stepanauskas R, Novotny M, Lasken RS: Genomicsequencing of single microbial cells from environmental samples.Curr Opin Microbiol 2008, 11:198–204.4. Wooley JC, Ye Y: Metagenomics: facts and artifacts, and computationalchallenges. J Comput Sci Technol 2009, 25:71–81.5. Hey AJ, Tansley S, Tolle KM: The fourth paradigm: data-intensive scientificdiscovery, Microsoft Research; 2009.6. Ye Y, Doak TG: A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol 2009,5:e1000465.7. Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BJ,Rodriguez-Mueller B, Zucker J, Thiagarajan M, Henrissat B, White O, KelleyST, Methé B, Schloss PD, Gevers D, Mitreva M, Huttenhower C: Metabolicreconstruction for metagenomic data and its application to thehuman microbiome. PLoS Comput Biol 2012, 8:e1002358.8. Larsen PE, Collart FR, Field D, Meyer F, Keegan KP, Henry CS, McGrath J,Quinn J, Gilbert JA: Predicted Relative Metabolomic Turnover (PRMT):determining metabolic turnover from a coastal marine metagenomicdataset. Microb Inform Exp 2011, 1:4.9. Altman T, Travers M, Kothari A, Caspi R, Karp PD: A systematic comparisonof the MetaCyc and KEGG pathway databases. BMC Bioinformatics 2013,14:112.10. Karp PD, Paley S, Romero P: The pathway tools software. Bioinformatics2002, 18:S225–S232.11. Caspi R, Foerster H, Fulcher CA, Hopkinson R, Ingraham J, Kaipa P,Krummenacker M, Paley S, Pick J, Rhee SY, Tissier C, Zhang P, Karp PD:MetaCyc: a multiorganism database of metabolic pathways andenzymes. Nucleic Acids Res 2006, 34:D511–D516.12. Karp PD, Latendresse M, Caspi R: The pathway tools pathway predictionalgorithm. Stand Genomic Sci 2011, 5:424–429.13. Latendresse M, Krummenacker M, Trupp M, Karp PD: Construction andcompletion of flux balance models from pathway databases.Bioinformatics 2012, 28:388–396.14. Konwar KM, Hanson NW, Pagé AP, Hallam SJ: MetaPathways: a modularpipeline for constructing pathway/genome databases fromenvironmental sequence information. BMC Bioinformatics 2013, 14:202.15. Hanson NW, Konwar KM, Wu S-J, Hallam SJ: MetaPathways v2.0: Amaster-worker model for environmental Pathway/Genome Databaseconstruction on grids and clouds. Conf Proc IEEE Comp Intel in Bioinfand Comp Biology 2014, (28):1–7.16. Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahrén D,Tsoka S, Darzentas N, Kunin V, López-Bigas N: Expansion of the BioCyccollection of pathway/genome databases to 160 genomes. Nucleic Acids Res2005, 33:6083–6089.17. McCutcheon JP, von Dohlen CD: An interdependent metabolic patchworkin the nested symbiosis of mealybugs. Curr Biol 2011, 21:1366–1372.18. Delong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N-U, MartinezA, Sullivan MB, Edwards R, Brito BR, Chisholm SW, Karl DM: Communitygenomics among stratified microbial assemblages in the ocean's interior.Science 2006, 311:496–503.19. Stewart FJ, Sharma AK, Bryant JA, Eppley JM, Delong EF: Communitytranscriptomics reveals universal patterns of protein sequenceconservation in natural microbial communities. Genome Biol 2011, 12:R26.20. Shi Y, Tyson GW, Eppley JM, Delong EF: Integrated metatranscriptomicand metagenomic analyses of stratified microbial assemblages in theopen ocean. ISME J 2011, 5:999–1013.21. Dale JM, Popescu L, Karp PD: Machine learning methods for metabolicpathway prediction. BMC Bioinformatics 2010, 11:15.22. Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim—a sequencingsimulator for genomics and metagenomics. PLoS ONE 2008, 3:e3373.23. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomicdata. Genome Res 2007, 17:377–386.24. Cordero OX, Ventouras L-A, Delong EF, Polz MF: Public good dynamicsdrive evolution of iron acquisition strategies in natural bacterioplanktonpopulations. Proc Natl Acad Sci U S A 2012, 109:20059–20064.25. Ellers J, Toby Kiers E, Currie CR, McDonald BR, Visser B: Ecologicalinteractions drive evolutionary loss of traits. Ecol Lett 2012, 15:1071–1082.26. Lawrence D, Fiegna F, Behrends V, Bundy JG, Phillimore AB, Bell T,Barraclough TG: Species interactions alter evolutionary responses to anovel environment. PLoS Biol 2012, 10:e1001330.27. Morris JJ, Lenski RE, Zinser ER: The Black Queen hypothesis: evolution ofdependencies through adaptive gene loss. MBio 2012, 3:e00036–12.28. Caspi R, Dreher K, Karp PD: The challenge of constructing, classifying, andrepresenting metabolic pathways. FEMS Microbiol Lett 2013, 345:85–93.29. Lam P, Kuypers MMM: Microbial nitrogen cycling processes in oxygenminimum zones. Ann Rev Mar Sci 2011, 3:317–345.30. Wright JJ, Konwar KM, Hallam SJ: Microbial ecology of expanding oxygenminimum zones. Nat Rev Microbiol 2012, 10:381–394.31. Ehrich S, Behrens D, Lebedeva E, Ludwig W, Bock E: A new obligatelychemolithoautotrophic, nitrite-oxidizing bacterium. Nitrospiramoscoviensis sp. nov. and its phylogenetic relationship. Arch Microbiol1995, 164:16–23.32. Strous M, Pelletier E, Mangenot S, Rattei T, Lehner A, Taylor MW, Horn M,Daims H, Bartol-Mavel D, Wincker P, Barbe V, Fonknechten N, Vallenet D,Segurens B, Schenowitz-Truong C, Médigue C, Collingro A, Snel B, Dutilh BE, Opden Camp HJM, van der Drift C, Cirpus I, van de Pas-Schoonen KT, Harhangi HR,van Niftrik L, Schmid M, Keltjens J, van de Vossenberg J, Kartal B, Meier H, et al:Deciphering the evolution and metabolism of an anammox bacterium froma community genome. Nature 2006, 440:790–794.33. Lücker S, Wagner M, Maixner F, Pelletier E, Koch H, Vacherieb B, Ratteie T,Damstéf JSS, Spieckg E, Le Paslier D, Daimsa H: A Nitrospira metagenomeilluminates the physiology and evolution of globally importantnitrite-oxidizing bacteria. Proc Natl Acad Sci U S A 2010, 107:13479–13484.34. Kartal B, Maalcke WJ, de Almeida NM, Cirpus I, Gloerich J, Geerts W, Op denCamp HJM, Harhangi HR, Janssen-Megens EM, Francoijs K-J, StunnenbergHG, Keltjens JT, Jetten MSM, Strous M: Molecular mechanism of anaerobicammonium oxidation. Nature 2011, 479:127–130.35. Zumft WG: Cell biology and molecular basis of denitrification.Microbiol Mol Biol Rev 1997, 61:533–616.36. Ganesh S, Parris DJ, Delong EF, Stewart FJ: Metagenomic analysis ofsize-fractionated picoplankton in a marine oxygen minimum zone.ISME J 2013. doi:10.1038/ismej.2013.144.37. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.Nucleic Acids Res 2000, 28:27–30.38. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS,Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: newdevelopments in phylogenetic classification of proteins from completegenomes. Nucleic Acids Res 2001, 29:22–28.39. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): acurated non-redundant sequence database of genomes, transcripts andproteins. Nucleic Acids Res 2007, 35:D61–D65.doi:10.1186/1471-2164-15-619Cite this article as: Hanson et al.: Metabolic pathways for the wholecommunity. BMC Genomics 2014 15:619.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionHanson et al. BMC Genomics 2014, 15:619 Page 14 of 14http://www.biomedcentral.com/1471-2164/15/619Submit your manuscript at www.biomedcentral.com/submit


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items