Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Unearthing the influence of soil organic matter removal on population structure and metabolism of soil… Hahn, Aria Stefanee 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_february_hahn_aria.pdf [ 71.31MB ]
JSON: 24-1.0340624.json
JSON-LD: 24-1.0340624-ld.json
RDF/XML (Pretty): 24-1.0340624-rdf.xml
RDF/JSON: 24-1.0340624-rdf.json
Turtle: 24-1.0340624-turtle.txt
N-Triples: 24-1.0340624-rdf-ntriples.txt
Original Record: 24-1.0340624-source.json
Full Text

Full Text

Unearthing the influence of soil organicmatter removal on population structureand metabolism of soil microbialcommunitiesbyAria Stefanee HahnMSc Soil Science, The University of Alberta, 2012BSc (Bilingual) ENCS Land Reclamation, The University of Alberta, 2009A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Microbiology and Immunology)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2016c Aria Stefanee Hahn 2016AbstractMicroorganisms are the stewards and creators of Earth’s ecosystems, driving planetary nutrientand energy cycles. As such, the interactions and metabolic processes of microbial communi-ties have emerged as a fundamental area of scientific research. Through the use of multi-omic(e.g.; metagenomics, metatranscriptomics, proteomics) sequence information, it is possible toreconstruct the compositional, regulatory, and distributed metabolic processes connecting mi-crobial community members. This dissertation develops an interpretative framework for thejoint analysis of compositional network patterns and metabolic pathway reconstruction usingmetagenomic and metatranscriptomic data from soils collected from lodgepole pine forests 13years post-harvesting and soil organic matter removal and adjacent undisturbed lodgepole pineforests and Interior Douglas-fir forests at two Long Term Soil Productivity (LTSP) sites locatednear Kamloops, and Williams Lake B.C., Canada. Further, this dissertation serves to improvethe accuracy with which environmental sequence data are analyzed by leveraging the work ofstatisticians to overcome known biases in canonical approaches to data normalization. Finally,this work provides a systematic approach to the functional annotation of unassembled data, andextends an existing diversity index to accept data types common in studies involving unculti-vated microbial annotation. Together the data indicated spatiotemporal variation in, and forestharvesting impact on, metabolic interactions and genomic potential for plant biomass degradationand carbon cycling. However, redundant metabolic capacity combined with genetic variationwithin the microbial community ensures natural and anthropogenically-induced environmentalchange had disparate effects across community members thereby moderating the consequences oflocalized extinctions or niche space reduction, and guarding against the loss of metabolic functionswithin the soil ecosystem. Indeed, environmental change can result in the reshuffling of trophicrelationships and information exchange (e.g.; H+, metabolites, and horizontal gene transfer)allowing new and novel interactions between organisms to form and increase the community’sability to tolerate disturbance. This work represents an important step in understanding howenvironmental changes impact microbial communities and ecosystem function within the soilmilieu. Ultimately, the data and findings from this dissertation can be integrated with futureanalyses of biogeochemical parameter information and thermodynamic principles to enable timevariable forecasts of microbial adaptive response to environmental change.iiPrefaceThis work was made possible through the contributions and dedication of many collaborators.Dr. Steven Hallam, as the research advisor was involved in all aspects of this work includingexperimental design, data analysis and interpretation and writing. Sections of this work are partlyor wholly published, in press, or in review. Copyright licenses were obtained and are listed below.• Chapter 1: Aria S. Hahn wrote the main text with editorial support from Steven J. Hallam.Part of this work is published as described below.Aria S. Hahn, Kishori M. Konwar, Stilianos Louca, Niels W. Hanson, Steven J. Hallam. Theinformation science of microbial ecology. Current opinion in microbiology, 31 (2016).• Chapter 2: Aria S. Hahn wrote the main text with input from Sarah E.I. Perez and StevenJ. Hallam. Steven J. Hallam and William W. Mohn designed the research. Sangwon Leecollected samples. Sangwon Lee, Melanie (Scofield) Sorensen and Aria S. Hahn performedlaboratory work, Aria S. Hahn, Sarah E.I. Perez, Niels W. Hanson and W. Evan Durnodeveloped software code and conducted statistical analysis, Aria S. Hahn interpreted thedata with input from Sarah E.I. Perez and Steven J. Hallam. Steven J. Hallam edited thechapter.Aria S. Hahn, Sarah E.I. Perez, Niels W. Hanson, Sangwon Lee, Melanie Scofield, W. EvanDurno, William W. Mohn and Steven J. Hallam, Probing the depths of microbial communityinteractions in soil. In Review.• Chapter 3: Aria S. Hahn wrote the main text with input from Steven J. Hallam. Aria S. Hahn,Natasha J. Sihota, Ashley C. Arnold and Genesis M. Magat collected samples. Aria S. Hahn,Ashley C. Arnold, Andreas Mueller and Melanie (Scofield) Sorensen performed laboratorywork. Aria S. Hahn, Connor Morgan-Lang, Dongjae Kim, and W. Evan Durno developedsoftware code. Aria S. Hahn conducted statistical analysis. Aria S. Hahn interpreted thedata with input from Steven J. Hallam. Steven J. Hallam edited the chapter.• Chapter 4: Aria S. Hahn and Ashley C. Arnold wrote the main text with input from NatashaJ. Sihota and Steven J. Hallam. Natasha J. Sihota collected the samples and conductedlaboratory work. Aria S. Hahn and Ashley C. Arnold conducted statistical analysis andinterpreted the data with input from Steven J. Hallam and K. Uli Mayer. Steven J. Hallamedited the chapter.• Chapter 5: Aria S. Hahn wrote the main text with input from Kishori M. Konwar, and StevenJ. Hallam. Aria S. Hahn, Kishori M. Konwar, Niels W. Hanson, and Dongjae Kim developedsoftware code. Aria S. Hahn, Niels W. Hanson, and Kishori M. Konwar conducted statisticalanalysis. Aria S. Hahn, Kishori M. Konwar, and Niels W. Hanson interpreted the data withiiiinput from Steven J. Hallam. Steven J. Hallam edited both the chapter the manuscriptsadapted within this chapter. A version of this work is published as IEEE copyrightedproceedings described below.Text, figures and tables in Chapter 5 are copyright 2014 IEEE. Reprinted, with permission,from:Aria S. Hahn, Niels W. Hanson, Dongjae Kim, Kishori M. Konwar, and Steven J. Hallam,Assembly independent functional annotation of short-read data using SOFA: Short-ORFFunctional Annotation. 2015 IEEE Conference on Computational Intelligence in Bioinformaticsand Computational Biology, August 2015.In reference to IEEE copyrighted material which is used with permission in this dissertation,the IEEE does not endorse any of University of British Columbia’s products or services.Internal or personal use of this material is permitted. If interested in reprinting/republishingIEEE copyrighted material for advertising or promotional purposes or for creating newcollective works for resale or redistribution, please go to standards/publications/rights/rights link.html to learnhow to obtain a License from RightsLink.• Chapter 6: Aria S. Hahn wrote the manuscript and created all figures with editorial supportfrom Steven J. Hallam. Part of this work is published as described below.Aria S. Hahn, Kishori M. Konwar, Stilianos Louca, Niels W. Hanson, Steven J. Hallam. Theinformation science of microbial ecology. Current Opinion in Microbiology, 31 (2016).Throughout this dissertation the word ‘we’ refers to Aria S. Hahn unless otherwise stated.None of the work encompassing this dissertation required consultation with the UBC ResearchEthics Board.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Soil and soil microbial communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Forest soils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Response of soil microbial communities to soil organic matter removal . . . . . . . 51.3 Multi-omics data and analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 Small subunit ribosomal RNA (SSU rRNA) genes . . . . . . . . . . . . . . . 111.3.2 Shot-gun ‘omics’ data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Probing the Depths of Microbial Community Interactions in Soil . . . . . . . . . . . . 212.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.1 Sampling and laboratory techniques . . . . . . . . . . . . . . . . . . . . . . . 232.2.2 Processing of pyrotag sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.3 Processing of metagenomic sequences . . . . . . . . . . . . . . . . . . . . . . 242.2.4 Statistical analyses and data visualization . . . . . . . . . . . . . . . . . . . . 252.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3.1 Soil characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3.2 Microbial community structure . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.3 Hierarchical cluster and indicator species analysis . . . . . . . . . . . . . . . 262.3.4 Network description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.5 Microbial community function . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.6 Taxonomic relationships between co-occurrence and metabolic networks . . 352.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37v2.4.1 Perturbation effects on soil microbial community structure . . . . . . . . . . 372.4.2 Harvesting and depth influence soil microbial co-occurrence network topol-ogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4.3 Impaired organic matter degradation and nutrient cycling potential in har-vested LFH horizons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.4.4 Consistent taxonomic patterns in network modules and metabolic pathways 412.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Digging Deeper into Soil Microbial Community Metabolism . . . . . . . . . . . . . . . 443.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2.1 Site description and sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2.2 Genomic DNA and RNA isolation and sequencing . . . . . . . . . . . . . . . 483.2.3 Genomic DNA and RNA processing . . . . . . . . . . . . . . . . . . . . . . . 503.2.4 Isolate genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2.5 Statistical analyses and data visualization . . . . . . . . . . . . . . . . . . . . 513.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.1 Multi-omics sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.2 Metabolic pathway predication . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.3 Taxonomic distinctness of potential and expressed metabolic pathways . . . 543.3.4 Potential and expressed metabolic pathways . . . . . . . . . . . . . . . . . . 563.3.5 Trends in soil microbial metabolic potential and expression with season,depth, and treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.3.6 Soil microbial metabolism across forest ecozones . . . . . . . . . . . . . . . . 683.3.7 Abundance and expression of carbohydrate and lignin degradation genes(CAZymes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.3.8 CAZyme abundance and expression within soil isolates . . . . . . . . . . . . 773.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.4.1 Metabolic potential is not squandered within soil microbial communities . . 803.4.2 Winter conditions impact metabolic potential . . . . . . . . . . . . . . . . . . 813.4.3 Metabolic potential is similar across ecozones . . . . . . . . . . . . . . . . . . 823.4.4 Metabolic pathway expression driven by depth . . . . . . . . . . . . . . . . . 843.4.5 Forest harvesting affects the abundance but not expression of biomassdegradation genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863.4.6 Spatiotemporal controls on CAZyme expression . . . . . . . . . . . . . . . . 873.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874 The Ecologist’s Guide to Normalization Methods in Count Data for the Microcosmos 894.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.2.1 Sample collection, DNA extraction and pyrosequencing . . . . . . . . . . . . 924.2.2 Sequence snalyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.2.3 Data analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95vi4.3.1 Microbial community diversity . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.2 Soil microbial community structure . . . . . . . . . . . . . . . . . . . . . . . . 984.3.3 Microbial abundance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.3.4 Potential activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.4.1 Similar trends in community and diversity and structure with varyingnormalization technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.4.2 Discrepancies in abundance and potential activity between normalizationtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.4.3 Considerations for VST normalization . . . . . . . . . . . . . . . . . . . . . . 1104.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135 Analytical Augers for Mining Soil Sequence Data . . . . . . . . . . . . . . . . . . . . . . 1145.1 Short-ORF functional annotation (SOFA) pipeline . . . . . . . . . . . . . . . . . . . . 1155.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.1.4 Implementation and availability . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.2 Taxonomic distinctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.2.1 Estimating diversity in metagenomic data . . . . . . . . . . . . . . . . . . . . 1265.2.2 Extending taxonomic distinctness . . . . . . . . . . . . . . . . . . . . . . . . . 1285.2.3 Implementation and availability . . . . . . . . . . . . . . . . . . . . . . . . . . 1316 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.1 Soil microbial community response to environmental change . . . . . . . . . . . . 1336.2 A systematic approach to data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.3 Future challenges and directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.3.1 Soil microbial response to disturbance . . . . . . . . . . . . . . . . . . . . . . 1376.3.2 Microbial ecology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.4 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142AppendicesA Chapter 2: Supplemental Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168A.1 Supplemental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168A.1.1 Site description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168A.1.2 Genomic DNA isolation, sequencing and processing . . . . . . . . . . . . . . 168A.1.3 Community diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169A.1.4 Statistical differences among and between soil horizons . . . . . . . . . . . . 169A.1.5 Indicator species analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.1.6 Co-occurrence network construction and analysis . . . . . . . . . . . . . . . 170A.1.7 Taxonomic distinctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170viiA.1.8 Univariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2 Supplemental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2.1 Soil characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2.2 Rarefaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171A.2.3 Microbial community structure . . . . . . . . . . . . . . . . . . . . . . . . . . 171A.2.4 Hierarchical cluster analysis (HCA) . . . . . . . . . . . . . . . . . . . . . . . . 171A.2.5 Indicator species analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172A.2.6 Network description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172A.2.7 Hive plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172A.2.8 Taxonomic composition of network modules . . . . . . . . . . . . . . . . . . 173A.2.9 Metagenomic pathway prediction . . . . . . . . . . . . . . . . . . . . . . . . . 173A.2.10 Differences in the relative abundance of pathways . . . . . . . . . . . . . . . 173A.2.11 Taxonomic relationships between co-occurrence and metabolic networks . . 174A.3 Supplemental discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174A.3.1 Perturbation effects on soil microbial community structure . . . . . . . . . . 174A.3.2 Impaired organic matter degradation and nutrient cycling potential in har-vested LFH horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174A.3.3 Consistent taxonomic patterns in network modules and metabolic pathways 174A.3.4 Difficulties in defining the metabolic potential of individual nodes . . . . . 175A.4 Supplemental figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176A.5 Supplemental tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187B Chapter 3: Supplemental Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.1 Supplemental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.1.1 Site description and sampling description . . . . . . . . . . . . . . . . . . . . 192B.2 Supplemental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.2.1 ribosomal RNA (rRNA) depletion . . . . . . . . . . . . . . . . . . . . . . . . . 192B.2.2 Metagenomic sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.2.3 Metatranscriptomic sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . 192B.2.4 Potential and expressed metabolic pathways . . . . . . . . . . . . . . . . . . 193B.2.5 Seasonal differences in microbial metabolism . . . . . . . . . . . . . . . . . . 193B.2.6 The impact of depth on soil microbial metabolism . . . . . . . . . . . . . . . 194B.2.7 Perturbation effects on soil microbial metabolism . . . . . . . . . . . . . . . . 195B.2.8 The effects of depth and perturbation on microbial metabolism within eachseason . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195B.2.9 The effects of depth and perturbation on CAZyme abundance and expressionwithin each season . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196B.3 Supplemental figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197B.4 Supplemental tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204C Chapter 4: Supplemental Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213C.1 Supplemental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213C.1.1 Study site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213C.1.2 Nucleic acid extraction and cDNA synthesis . . . . . . . . . . . . . . . . . . . 213viiiC.1.3 PCR amplification and pyrosequencing of SSU rRNA genes and cDNA . . . 213C.1.4 Processing of pyrotag sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 213C.2 Supplemental figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214ixList of Tables1.1 Soil organic matter removal intensity on the Long-Term Soil Productivity (LTSP)sites across North America . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Soil microbial co-occurrence network and module properties near William’s Lake,B.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2 Influences of forest harvesting and depth on soil chemistry and microbial communities 384.1 Diversity indicies samples collected from a reference (Core A) and contaminatedsoil profile (Core B) in southwestern Minnesota. . . . . . . . . . . . . . . . . . . . . . 995.1 Sample information, comparative statistics and total COG database hits for as-sembled short read sequence data, read-one-only unassembled data, and SOFAprocessed unassembled short read data. Min:mineral sample; Org:organic sample;M:million; Assm:Assembled (%); Mrgd:merged (%); Dupl: deduplicated(%). Tableoriginally published in IEEE proceedings [1]. Copyright IEEE 2015. Reprinted withpermission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122A.1 Chemical properties of soils used in community diversity and metagenome libraryproduction and sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187A.2 Metagenomic sequencing, assembly and annotation summary statistics . . . . . . . 188A.3 PCR barcodes and SSU rRNA library production . . . . . . . . . . . . . . . . . . . . 189A.4 Microbial Phyla identified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190A.5 Indicator OTUs with indicator values ¿ 0.7 for clusters, and combinations of clusters,defined in the dendrogram and unmanaged soil profiles . . . . . . . . . . . . . . . . 191B.1 rRNA content within samples with and without rRNA depeltion . . . . . . . . . . . 204B.2 Sequencing and data processing statistics for 54 metagenomes generated fromsamples collected from 3 seasons, 3 depths, and an unmanaged and harvestedforest soil plot at the O’Connor Lake LTSP site . . . . . . . . . . . . . . . . . . . . . . 205B.3 Sequencing and data processing statistics for 49 metatranscriptomes generated fromsamples collected from 3 seasons, 3 depths, and an unmanaged and harvested forestsoil plot at the O’Connor Lake LTSP site . . . . . . . . . . . . . . . . . . . . . . . . . . 207B.4 Pathways uniquely found in the metagenomes at the O’Connor LTSP site . . . . . . 209B.5 Pathways uniquely found in the metatranscriptomes at the O’Connor LTSP site . . 212xList of Figures1.1 Earth’s soils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Landmark events in microbiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 The central dogma of molecular biology . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Data generation and processing workflows . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Organization of microbial communities . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 The MetaPathways pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.7 Sampling locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Hierarchical cluster analysis describing the influence of horizon and treatment onmicrobial community assemblages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2 Co-occurrence networks describing the influence of horizon and treatment onmicrobial community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3 environmental Pathway Genome Databases (ePGDB) from 26 soil metagenomes . . 342.4 Taxonomic distinctness of metabolic pathways within 26 environmental PathwayGenome Databases (ePGDB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.1 Sampling and analysis schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Summary of potential and expressed pathways . . . . . . . . . . . . . . . . . . . . . 533.3 Taxonomic distinctness of metabolic pathways within 103 environmental PathwayGenome Databases (ePGDB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.4 Venn diagram of potential and expressed pathways . . . . . . . . . . . . . . . . . . . 583.5 Multivariate regression tree of potential and expressed metabolic pathways . . . . . 603.6 Differentially abundant abundant and expressed pathways driving patterns in theMVRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.7 Differential abundance of potential pathways . . . . . . . . . . . . . . . . . . . . . . . 653.8 Differential abundance of expressed pathways . . . . . . . . . . . . . . . . . . . . . . 673.9 Comparison of the metabolic potential between ecozones . . . . . . . . . . . . . . . . 703.10 Multivariate regression tree of potential and expressed CAZymes . . . . . . . . . . . 743.11 Multivariate regression tree of potential and expressed CAZymes within Bradyrhi-zobium isolates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.12 Abundance of potential and expressed CAZymes within Bradyrhizobium isolates . 794.1 Sampling and analysis schematic for samples collected from two soil cores . . . . . 934.2 Overdispersion in microbiome data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Principal Component Analysis (PCA) of microbiome data using two normalizationtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100xi4.4 Non-metric multidimensional scaling (NMDS) of microbiome data using two nor-malization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.5 Heat map depicting the potential activity (rRNA:rRNA genes) of the most variabletaxonomic orders in Core B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.6 Practical considerations for selecting normalization techniques for microbiome dataanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.1 The SOFA pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.2 Deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3 Validation of the SOFA pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.4 Simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.5 Taxonomic distinctness uniform step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.6 Taxonomic distinctness distinct step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.7 Taxonomic distinctness using weighted taxonomic distance . . . . . . . . . . . . . . 1315.8 Taxonomic distinctness using microbial data . . . . . . . . . . . . . . . . . . . . . . . 1326.1 Model of microbial community dynamics in response to perturbation . . . . . . . . 1356.2 The microbial matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140A.1 Sampling and analysis schematic for 26 samples from 5 soil horizons in two soilprofiles near William’s Lake, B.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176A.2 Choa 1 rarefaction curves for V6-V9 pyrotags . . . . . . . . . . . . . . . . . . . . . . . 177A.3 Hierarchical cluster analysis of pyrotags . . . . . . . . . . . . . . . . . . . . . . . . . 178A.4 Relative abundance of phyla and indicator OTUs within co-occurrence network . . 179A.5 Hierarchical cluster analysis describing the influence of horizon and treatment(unmanaged (N) and harvested (H)) on predicted metabolic pathways . . . . . . . . 180A.6 Relative abundance of metabolic pathways more prevalent in unmanaged (N) LFHhorizons compared to harvested (H) LFH horizons . . . . . . . . . . . . . . . . . . . 181A.7 Relative abundance of metabolic pathways more prevalent in harvested (H) LFHhorizons compared to unmanaged (N) LFH horizons . . . . . . . . . . . . . . . . . . 182A.8 Relative abundance of metabolic pathways more prevalent in LFH horizons com-pared to mineral horizons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183A.9 Relative abundance of metabolic pathways more prevalent in mineral horizonscompared to LFH horizons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184A.10 Relative abundance of three pathways by taxa . . . . . . . . . . . . . . . . . . . . . . 185A.11 Relative abundance of taxonomic groups in pathways . . . . . . . . . . . . . . . . . . 186B.1 ERCC spike-in recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197B.2 Venn diagram of potential and expressed CAZymes . . . . . . . . . . . . . . . . . . . 198B.3 Pathways abundant in both the metagenomes and metatranscriptomes . . . . . . . 199B.4 Pathways abundant in either the metagenomes or metatranscriptomes . . . . . . . . 200B.5 Summary of potential and expressed CAZymes . . . . . . . . . . . . . . . . . . . . . 201B.6 Differential abundance of potential CAZymes . . . . . . . . . . . . . . . . . . . . . . 202B.7 Differential abundance of expressed CAZymes . . . . . . . . . . . . . . . . . . . . . . 203xiiC.1 Choa 1 rarefaction curves for V6-V8 pyrotags generated for 18 samples from 11 soildepths in two soil profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214C.2 rRNA and rRNA genes of the taxonomic groups from Core A after proportionnormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215C.3 rRNA and rRNA genes of the taxonomic groups from Core A after VST normalization216C.4 rRNA:rRNA genes of the taxonomic groups from Core A after proportion and VSTnormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217C.5 rRNA and rRNA genes of the taxonomic groups from Core B after proportionnormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218C.6 rRNA and rRNA genes of the taxonomic groups from Core B after VST normalization219C.7 rRNA:rRNA genes of the taxonomic groups from Core B after proportion and VSTnormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220C.8 Heat map depicting the potential activity (rRNA:rRNA genes) of the most variabletaxonomic orders in Core A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221xiiiAcknowledgementsI am grateful to the past and present members of the Steven J. Hallam’s research group. It wasthrough all of you that I learned techniques in molecular biology, statistical analyses, programming,writing, and how much coffee I am able to drink in one day. I am better having worked with all ofyou. In particular, I would like to thank Kishori M. Konwar for his time, patience, and expertise -I have never spent as much time on a phone as I do with you. Niels W. Hanson for his YouTubelinks, statistical and programming abilities, and editorial support. Dongjae Kim for lively (andsometimes heated) discussions, his admirable work ethic, and intelligence. Ashley C. Arnold forher careful thinking and for making me look good through her diligence and organization onmultiple occasions. Sarah E. I. Perez for reminding me to relax and write more in python. ElenaZaikova for giving me two pieces of advice that made my entire degree easier. Melanie (Scofield)Sorensen for teaching me about bench work and all the best fantasy/adventure novels. EstherGies for literally sitting with me for years. Connor Morgan-Lang for debriefing with me at leastonce day, helping me with everything from scripts to pipelines, and making me delicious meals.Alyse K. Hawley for her support and impressive knowledge base. W. Evan Durno for havinglong, and I am sure sometimes painful, statistical discussions. Diane K. Fairley for keeping me(and the rest of the lab) in line. Darlene Birkenhead for knowing all the answers and alwayslistening. Karoline Faust for teaching me all about network analysis and making me feel sowelcome in Belgium. Jerone Raes and his laboratory for offering their time and expertise. MayaBhatia, Monica Torres-Beltran, Keith Mewis, Zach Armstrong, Jody Wright, Chris Lawson, SamKheirandish, Zaira Petruf, Martin Krzywinski, Celine Michiels, Colleen Kellogg, John Kellogg,and CarrieAyne Jones for their input, discussions, encouragement, coffee dates, lab trips andmuch more; without all of you, doing this work would not have been as fun. Thank you to mycommittee, Sue Grayston, William Mohn, and Martin Hirst, for challenging and supporting methroughout my degree. Finally, I would like to thank Steven J. Hallam for taking me on as astudent, giving me a lot freedom, encouraging me to expand my skillset and knowledge (wellwell) beyond what was comfortable, for his generosity in so many things, his ideas - big, smalland absolutely insane, his time, his enthusiasm, and his guidance.xivDedicationTo Genesis M. Magat for his love and patience -and for digging soil pits under the snow while I watched.xvChapter 1IntroductionWe live in a microbial world. Microorganisms are the most ubiquitous life form on the planet anddominate Earth’s habitable space, living as high 77 km above ground in the mesosphere and aslow as 4 km beneath the ocean floor [2]. Microbes have inhabited the Earth for at least 3.5 Gyr andcontinue to populate, construct, and transform almost every conceivable niche on the planet [3].Today, approximately half of all carbon contained within living organisms is held within microbialcells [2]. This ‘unseen majority’ decomposes organic material, cycling and making availableresources for new generations of beings spanning the entire tree of life. Microbes simultaneouslyserve as protectors from and instigators of disease, and both enhance and destroy populations ofhigher organisms [4]. Thus, to study and understand microorganisms is to study and understandthe history of Earth, life, and humans, it is to study and understand the stewards and creatorsof Earths ecosystems, and it is to study and understand the players ultimately responsible forthe health and survival of all other organisms [4, 5]. As such, the interactions and metabolicprocesses of microbial communities have emerged as fundamental areas of scientific research.More specifically, microbial ecology aims to answer three fundamental questions, i) what is thephylogenetic distribution of microorganisms across environments?; ii) what are the metabolicprocesses completed by microorganisms across environments?; and iii) how does environmentalchange affect phylogenetic distribution and metabolic function within microbial communities?Given Earth’s landmass spans approximately 148,300,000 km2 [6] and assuming a uniformsoil depth of 1 m (and thus assuming the remaining area is occupied by consolidated geologicalmaterial), we can estimate there is approximately 1.5e+14 m3 of soil on Earth or 0.01% of the totalvolume of worlds oceans (approx. 1.3E+18 m3 [7]. Soils connect the atmosphere, hydrosphere,lithosphere, and biosphere, and represent some of the most complex environments on the planet[8] (Fig. 1.1A). Furthermore, soils host Earth’s most diverse microbial communities [9] and areessential to global food supply, as well as many economically important industries such as1Atmosphere HydrosphereLithospherePedosphereBiosphereA B CFigure 1.1: Earth’s soils. (A.) The Pedosphere connects the atmosphere, hydrosphere, lithosphere andbiosphere; (B.) A soil pedon; (C.) Soil microbial communities are influenced by other microbes (distributedmetabolism, symbioses, viral reprogramming, and predation), time, aboveground vegetation, edaphicfactors (indigenous soil properties), and anthropogenic activities.forestry, mining, and ecotourism. In addition, soils represent an important carbon sink andcontain approximately 2400 Pg of carbon [10] or 16,000-16,600 grams of carbon per m3, over500 times the quantity of carbon stored in a m3 of ocean water (38,000 Pg total [11]. This largecarbon stock, stored as soil organic matter (OM), can be mobilized by soil microbes and releasedinto the atmosphere as climate active trace gases [12, 13], due to deforestation, poor ecosystemmanagement, or changes in land-use [14, 15]. Thus, understanding soil ecosystems and the role ofmicroorganisms within them has both economic and environmental implications.This chapter outlines the motivation to define and describe the taxonomic structure andmetabolic function of soil microbial communities with particular focus on the influence of forestharvesting and the removal of soil organic matter (OM), on microbial community metabolism.First, the importance of soil in Earth’s ecosystem and the complexity of soil microbial communitiesare outlined. Next, current efforts to investigate the response of soil microbial communitiesto forest harvesting and OM removal are described and the rationales for digging deeper intomicrobial interactions and the challenges associated with ‘multi-omics’ data and analyses aresummarized. Finally, the structure of this dissertation is developed.21.1 Soil and soil microbial communitiesThe Canadian System of Soil Classification defines soils as, “as the naturally occurring, uncon-solidated mineral or organic material at least 10 cm thick that occurs at the earth’s surface andis capable of supporting plant growth” [16]. Soils develop through the interaction of climate,organisms (including animals, plants, insects, protozoa, fungi and bacteria), relief (i.e. topographyand slope), parent material (the underlying consolidated geological material), and time [17] andare a non-renewable resource essential to terrestrial life. Soils are complex matrices of mineralmaterial, organic matter, liquids, gases, and organisms [18]. Together, Earth’s soils are known asthe pedosphere (Fig. 1.1A).Soil profiles, often meters deep, represent a stratified environment that forms in layers, termedhorizons, distinguishable by color, texture, structure, nutrient availability and pH [19–21]. Dueto the complexity and spatial heterogeneity of soils, it is estimated that a single gram of soilcan contain more than 109 microbial cells, or over 10,000 unique genomes [22, 23]. These richmicrobial communities interact with one another and their environment to provide, store, andcycle nutrients for the world’s terrestrial ecosystems [24, 25]. To date, most research in forest soilshas focused on the surface organic material (LFH horizon) and the uppermost mineral horizon (Ahorizon), typically found in the top 15 cm of the soil profile [26] (Fig. 1.1B). Using next generationsequencing technologies to study soil microbial communities within Canada, the compositionof both bacterial and fungal communities has been found to be influenced by soil horizon andgeographic location [27]. For example, LFH horizons host a distinct and more diverse microbialcommunity than the mineral A horizons found directly below [27]. Investigation of the microbialcommunities in LFH and A soil horizons has been done in many contexts, including agriculture,silviculture and reclamation, and has benefited these industries by increasing the sustainabilityof global food supply [28–30], and serving as bioindicators of ecosystem health and recoveryfollowing management or disturbance [27, 31, 32]. Still, it is estimated that 35% to 50% of soilmicrobial biomass is located below 15 cm within the soil subsurface [19, 33]. Further, the microbialcommunities in deeper soil horizons play essential roles in long-term carbon sequestration [34, 35],soil formation [36] and contaminant degradation [37].3The distribution of soil microorganisms throughout the soil profile is influenced by a myriadof factors including distributed metabolism (wherein a single metabolic pathway is completedby multiple organisms) [38], symbioses [39], viral reprogramming [40], predation [41], time [42], aboveground vegetation [43], and indigenous soil properties, also known as edaphic factors[19, 20, 23, 44–46] (Fig. 1.1C). For example, soil microbial composition is highly influenced bysampling date (season) which is believed to be linked with climatic differences, soil moistureregimes, and interactions with management, vegetation type, and litter [47, 48]. Further, soilmicrobial composition and biomass are significantly different under aspen (Populas tremuloidesMichx.) and spruce (Picea glauca (Moench) Voss) dominated stands [49]. Edaphic factors suchas soil pH [50] and carbon and nitrogen pools [51] are also strong drivers of soil microbialcommunity structure. Indeed, decreases in taxonomic diversity and microbial biomass [52, 53]with soil depth [19, 44–46] have both been attributed to changes in carbon resources and soilchemistry throughout the soil profile. In addition, soil microbial community composition isalso known to be impacted by physical perturbations such as agriculture [54], mining [55], andforest harvesting [20]. However, the impact of changes in microbial community compositionand diversity on community metabolism and ecosystem function are difficult to predict due tofunctional redundancy, resource fluctuations, and complex interaction among microbes and theenvironment. [25].1.1.1 Forest soilsGenerally, forest soils often have an LFH horizon made up of leaf litter (L), partially degradedorganic material known as folic material (F), and completely decomposed organic material knownas humic material (H [16]. Generally soil horizons are named A-C, with A horizons being closer tothe surface while B horizons and C horizons exist at lower depths respectively. Lower case lettersare then appended to the horizon designation (e.g.; Ae or Bt) and describe the dominant processor processes contributing to the horizons formation and properties, providing information aboutsome edaphic factors. Briefly, eluvial horizons, denoted with ‘e’ (eg.; Ae), are those from whichclays and minerals are removed due to weathering or leaching into the illuvial horizon below.Illuvial horizons are thus are enriched in clays and minerals from the horizon above, and denoted4with a ‘t’ (e.g.; Bt). A lower case ‘t’ describes soil that have accumulated soil organic matter and alower case ‘p’ describes a soil horizon that has been disturbed by anthropogenic activity [16]. Thisdissertation focuses on two forested sites both dominated by Gray Luvisols. By introduction, anabridged form of Canadian System of Soil Classification for Canada [16] description of Luvisols isas follows: “Soils of the Luvisolic order generally have light-colored, eluvial horizons and have illuvialB horizons in which silicate clay has accumulated. The Bt horizon must have a specified increase in clayover that in the eluvial horizon, clay skins indicative of translocated clay accounting for 1% or more ofthe area of a section through the horizon, and be at least 5 cm thick. Luvisolic soils may have Ah, Ahe,or dark-colored Ap horizon that does not meet the requirements of [other orders] and the dark-colored Ahorizon is underlain by, light-colored Ae horizon that extends to a depth of 15 cm from the mineral surface.In addition, the Ae horizon is at least 5 cm thick.”. In addition, the luvisolic order is divided into twoGreat Groups, namely Gray Brown Luvisols and Gray Luvisols which are distinguished betweenon the basis of the A horizon present wherein Gray brown luvisols occur in warmer climates andhave a mull (mixed) Ah horizons while Gray Luvisols do not [56].1.2 Response of soil microbial communities to soil organic matterremovalSoil organic matter (OM) is both a large carbon reservoir and the primary source of nutrients forsoil microbial communities. OM contains a complex mixture of proteins, sugars, fats, waxes, lignin,humic acid, fulvic acid, and insoluble structures called humins [18] comprised of high molecularweight polymers and humic acids, fulvic acids, and/ or lipids bound to inorganic substances [57].OM and soil microbial communities are linked as OM is formed through the microbial breakdownof surface inputs such as labile plant litter [58] and root exudates [59] sometimes combined withphysical pedoturbation (the mixing of soils by soil dwelling animals) [60] . While the enzymaticactivities of soil microbial communities transform the carbon and nutrients within OM and drivebiogeochemical cycles, OM accumulates as it becomes stabilized via physical seclusion, chemicaladsorption onto inorganic particles, and biological degradation (wherein the material becomesrecalcitrant due to its altered chemical structure) [61]. Indeed, globally, soils store two times5more carbon than the planet’s atmosphere, and close to three times that in aboveground biomass,making them an important carbon reservoir [62]. However, OM can be lost via naturally occurringprocesses such as fire [4], or mobilized by the soil microbial and released into the atmosphere asclimate active trace gases [12, 13], poor ecosystem management, or changes in land-use as well asremoved due to deforestation and/ or human intervention (Table 1.1) [14, 15]. Indeed, carbonrelease from soils may lead to a doubling in the projected increase in global warming over the next85 years [63, 64]. By understanding the influence of OM loss and/ or removal on the microbialcommunity, management and forestry practices can be developed to better protect and maintainthe economic and environmental resources dependent upon soil systems.Previous investigation of the effects of OM removal on soil microbial communities has yieldedinconsistent results. For example, Ponder and Tadros [65] found differences between certainmicrobial groups from undisturbed soil samples and those from which OM was removed, usingphospholipid fatty acid analysis (PLFA). However, using the same treatments and experimentaldesign, Busse et al., [66] were unable to detect differences in the soil microbial community amongtreatments. Variability in results may be attributed to the wide range and inadequacy of themethods used. Indeed, to date most studies have assessed the effects of OM removal on microbialcommunities by using secondary measures such as microbial biomass carbon and nitrogen, byusing methods that require culturing such as BIOLOGTM, or by methods with low taxonomicresolution such PLFA analysis. However, genomic approaches now allow for more detailedinvestigation of the taxonomic structure and metabolic function of soil microbial communities,avoid biases associated with culture based approaches such as BIOLOGTM and provide morespecific taxonomic identification than PLFA analysis.Using a high throughput - high resolution sequencing technique, Hartmann et al., [27] wereable to describe the soil microbial communities in surface soils in detail using taxonomic groupsdefined by operational taxonomic units (OTUs). Specifically, the authors observed that microbialcommunity composition varied most with geographic location and subsequently soil horizon.Further, within each site studied, plant symbionts such as ectomycorrhizal fungi, recently found toplay an important role in carbon sequestration [67], and saprobic taxa, such as Ascomycetes andActinomycetes, were most intensively affected by disturbance [27]. More recently, metagenomic6sequencing revealed a decline in biomass conversion potential following forest harvesting and OMremoval based on reduced abundance of 41 families of carbohydrate-active enzymes involved inlignin, cellulose and hemicellulose transformation [68]. Together these data suggest that location,forest harvesting, and OM removal impact both community composition and potential metabolicfunction. However, in order to understand the long-term consequences of soil disturbance oncarbon and nutrient cycling, more comprehensive investigations of changes in potential andexpressed microbial community metabolism are needed.In addition to impacting the soil microbial community, OM removal and whole tree harvestinghave also been shown to result in declines in soil carbon concentration and nutrient availability(e.g., phosphorus and nitrogen) in surface horizons (0-20 cm) [69–71]. However, these effectshave been shown to vary with geographic location. For example, Powers et al. [70] found soilwith a higher initial carbon concentration experienced a greater decrease in carbon than thosewith lower pre-harvest carbon concentrations [70]. Further, believed to be due to higher soiltemperature and greater moisture, areas that lacked large understory plant populations, or hadbetter drainage, also suffered greater carbon and nitrogen loss than those with sizeable understorypopulations or poorly drained soils or post-harvesting [69, 70, 72]. Additionally, while somestudies have measured an increase in ammonium in the upper horizon of clear-cut soils [73], othershave measured declining levels of nitrogen mineralization following disturbance [69]. Generally,the effects of forest harvesting and OM removal on nutrient availability and the soil microbialcommunity are multifaceted [74] and influenced by soil physico-chemical structure, soil history[75], the metabolic capacity of the microorganisms present, and the effect of the disturbanceon interacting community members [25]. However, while edaphic factors can be measured andsoil history recorded, accurate description of metabolic capacity and microbial interactions insoil ecosystems remains challenging given both the complexity of the soil environment, and theprevalence of uncultivated microorganisms.7Table 1.1: Soil organic matter removal intensity on the Long-Term Soil Productivity (LTSP) sites acrossNorth America (Powers)Treatment DescriptionOM0 no soil organic matter removalOM1 tree boles removed; crowns, felled understory and forest floor retainedOM2 all aboveground vegetation removed; forest floor retainedOM3 all surface organic matter removed; bare soil exposedLong term soil sustainability plots (LTSP)This dissertation is integrated with the Long-Term Soil Productivity (LTSP) Study, one of theworld’s largest coordinated research networks that addresses basic and applied scientific questionsrelated to forest management across North America. Established by the USDA Forest Service(Washington, DC) in 1989, the LTSP Study aims to investigate the influence of forest harvesting onlong-term soil productivity and sustainability and to develop biological indicators of disturbanceand recovery. The LTSP study represents one of the world’s largest coordinated research networksand includes over 110 research sites in the USA and Canada [70]. Research on LTSP sites is primar-ily focused on the impacts of OM removal during timber harvesting on over 10 biogeographicallydistinct areas with specific distributional patterns of terrestrial organisms, known as ‘ecozones’.Each LTSP ecozone contains a randomized complete factorial design with three levels of OMremoval (OM1- OM3) (Table 1.1) replicated in triplicate in 40 x 70 m2 plots. Additionally, eachecozone has a tenth plot representing a natural undisturbed reference forest stand (OM0). OM1describes a treatment of minimal OM removal, in which tree boles have been removed but thetree crowns, felled understory, and forest floor (LFH) are retained (51±4% net carbon removal).OM2 represents common forestry practices and an intermediate level of OM removal, wherein allaboveground vegetation is removed but the forest floor material is retained (65±3% net carbonremoval). OM3 is the most extreme treatment that includes the removal of all OM including theforest floor (LFH) material, leaving bare soil exposed (84±2% net carbon removal) [27, 70].81944Oswald AveryGenes and chromosomes are made of DNA2016Big Genomic Data2004Craig VenterShotgun Sequencing of the Sargasso Sea2001Draft HumanGenome1998Jo Handelsman"Metagenomcs" term coined1995First Genome Sequenced1990Next-GenerationSequencing becomes common1977-1985Carl Woese and Norman PacerRNA uncovers archaeauncultivated majority1977F. Sanger SequencingArchaea as a domain of lifeMass Market Computers1971First CommercialMicroprocessor1959First Conference onScientific Information1958Central Dogma of Molecular Biology1951Rosalind FranklinX-ray diffraction of DNA1953Francis CrickJames WatsonStructure of DNA1948Claude ShannonInformation Science1928Alexander FlemingDiscovery of Penicillin  1869Joannes Miescher Discouvery of DNA1860Louis Pasteur &Robert KochMicrobes Growbut Not After Boiling1800Relationship betweenCleanlinessAndInfection1939Torbjorn CasperssonRNA Implicated in Protein Synthesis  1660Anton Von LeeuwenbekSees First BacteriaFigure 1.2: Landmark events in microbiology. Landmark events contributing to our current understandingof the microbial world: 19660, the first bacterium in observed [76], 1800’s germ theory is beginning to bedeveloped [77], 1860, Louis Pasteur and Robert Koch generate fundamental theories [78, 79], 1869, JoannesMiescher discovers DNA [80], 1928 Alexander Fleming discovers penicillin [81], 1939, Torbjorn Casperssonfinds RNA is implicated in protein synthesis [82], 1944 Oswald Avery finds genes and chromosomes aremade up of DNA [83], 1948 Claude Shannon presents the “The Mathematical Theory of Communication”[84], 1951 Rosalind Franklin takes X-ray diffraction of DNA [85], 1953, Watson and Crick reveal thestructure of DNA [86], 1958, Francis Crick proposed the central dogma of molecular biology, [87], 1959, thefirst annual conference on scientific information (, 1971 the first commercial microprocessorWatson and crick reveal the structure of DNA [88] , 1977, the advent of Sanger sequencing [89], 1987, CarlWoese uses 16S ribosomal RNA to uncover the uncultivated majority [90], 1990, next-generation sequencingexplodes [91], 1995, the first complete genome is sequenced [92], 1998, Jo Handelsman coins the term‘Metagenomics’ [93], 2201 draft of the human genome [94], 2004 Craig Venter does shotgun sequencing ofthe Sargasso sea [95] and 2016, we now faced with Big genomic Data [96].91.3 Multi-omics data and analysesIn 1958 Francis Crick proposed the central dogma of molecular biology, which describes thescheme of genetic information flow in biological systems [87, 97] (Fig. 1.3). Still the basis of theprevailing paradigm, the central dogma considers the flow of biological information whereinDNA nucleotide sequences are transcribed to RNA and then translated to the amino-acids(Fig. 1.3). In order to build on Crick’s work and access the information stored in life’s complexitypyramid (DNA, RNA, protein and metabolites) [87, 97] it became necessary to study this biologicalinformation in greater detail. By 1972, it was possible to sequence DNA through location specificprimer extension [98, 99] and by 1977 Sanger sequencing allowed sequencing of DNA fragments[89]. Over the next decade, Carl Woese and Norman Pace used small subunit ribosomal RNA(SSU rRNA) genes to reveal the “uncultured majority” of microorganisms and launched microbialecology into the molecular era [90, 100, 101] (Fig. 1.2). By the end of the 20th century the adventof next-generation sequencing platforms resulted in an explosion of environmental sequencingprojects and the generation of petabytes of environmental sequence information (Fig. 1.2).Figure 1.3: The central dogma of molecular biology. The central dogma of molecular biology describes the flowof information in biological systems wherein DNA nucleotide sequences are transcribed to RNA and thentranslated to the amino-acids [87, 97, 102]. An adaptation of the this figure was originally published inCurrent Opinion in Microbiology [103].At present most microorganisms are uncultivated [104] and thus information about the“uncultured majority” is primarily obtained through molecular sequence data. Comparativegenomic analyses allow the investigation of the metabolic potential of an organism or community.Metatranscriptomic sequences (RNA) can also be recovered and used to determine the attemptedsynthesis of individual proteins [105–107]. Indeed, messenger RNA (mRNA) molecules encode thesequence of amino acids to be translated within the ribosome. However, mRNA molecules have10short half-lives, typically ranging from second to minutes [108], and stability varies between genes[109–111], among species [109, 111] , and with the nutritional status of individual organisms [112],making metatranscriptomic analyses challenging. Fortunately, flash freezing and the addition ofchemical solutions can both be used to preserve in situ transcriptional profiles[113], and recentadvancements in RNA extraction and mRNA enrichment protocols have vastly increased theefficacy of soil metatranscriptomics. Today, there exists an unprecedented quantity of DNA,RNA and protein sequences all publicly available to the researcher. If interpretable, this tidalwave of biological information has enormous potential to reveal the metabolic networks drivingmatter and energy transformations in soils and other natural and engineered ecosystems, withtranslational benefits across a wide range of sectors including human health, biorefining and earthsystems engineering. Indeed, at present, soil microbiota are largely recognized only on the basisof molecular sequence information making omics’ data the primary tool for understanding thecomposition and metabolism of soil microbial communities.1.3.1 Small subunit ribosomal RNA (SSU rRNA) genesMassively parallel, or high-throughput, analyses of SSU rRNA genes have enabled a deeperunderstanding of microbial community structure [114] have enabled a deeper understandingof microbial community. Sequencing SSU rRNA gene amplicon libraries typically results in2,000-80,000 reads per sample and varies between studies and samples due to the amplificationtechnique and sequencing platform used. Following sequencing, reads are processed using peerreviewed open access bioinformatic software pipelines such as MOTHUR [115] or QIIME [116].While the default parameters and clustering algorithms differ slightly between the two pipelines,both support the defaults of the other, allowing the user to generate the same results regardless ofpipeline choice or select parameters that best suit their data. Briefly, sequences are typically qualitycontrolled and clustered based on sequence similarity (often at 97% sequence similarity) [114].Next, de novo and/ or reference based chimera checking is typically performed and suspectedchimeric sequences removed. Finally, remaining clusters are annotated by comparing the sequencesimilarity of a representative sequence for each cluster to a reference database. Each cluster isthen termed an operational taxonomic unit (OTU), a term used in lieu of ‘species’, as this remains11undefined in microbiology. The abundance of each OTU is then reported as the number (count) ofreads belonging to the cluster for a given sample (Fig. 1.4A). OTU data highlights the genotypicdiversity of a community and can be used to determine how this diversity changes in response toenvironmental change or disturbance.1213Nucleic acids SSU rRNANucleic acidsReads Scaffolds LYKLYKLYKLYKAssembleATGATG ATGATGATGATGORFprediction AnnotateClusterChimeriacheck AnnotateATGATGATGExtract nucleic acids1ABSequencing 2Process data3Downstream analysisFigure 1.4: Data generation and processing workflows. (A.) Workflow for processing small subunit ribosomal RNA (SSU rRNA). Nucleic acidsare extracted, primers are used to amplify the target gene, which is subsequently amplified and sequenced. Reads are then processed viaclustering, chimera checking, and annotation using one or more reference databases. Resulting operational taxonomic units are then analyzed. ;(B.) Workflow for processing shotgun omics’ data. Nucleic acids are extracted and sequenced. Reads are quality controlled and assembled.Open reading frames (ORFs) are then predicted, translated, and annotated using one or more reference databases and the resulting data is thenanalyzed.13Modeling microbial interactions with SSU rRNA genesBeyond the confines of laboratory environments microbes do not live in isolation, instead inter-acting with one another at the population and community levels to ultimately drive distributedmatter and energy transformation processes [117, 118] (Fig. 1.5). Indeed, in isolation, individualmicroorganism can perform only a limited number of biochemical reactions. However, as a com-munity, microbes exchange information with each other and the environment to form functionallyredundant and resilient metabolic networks supporting ecosystem functions and services [24, 25].For example, the breakdown of polysaccharides in human intestines is completed by severalbacterial species wherein the metabolic byproducts of one organism serve as the primary carbonsource for another recipient species unable to degrade the original molecule [119].Figure 1.5: Organization of microbial communities. Individual cells give rise to populations that interact toform communities of information exchange. These community interaction networks in turn help driveEarth’s biogeochemical cycles.Microbial exchanges and interactions have a deep impact on the flow of organic matter andnutrients via the biogeochemical cycles that shape the biosphere. Microbial networks containindividuals with variable metabolic processing power, symbiotic and mutualistic relationships[119, 120] and denizens that ‘cheat’ [121–123], or are predatory [40, 114, 124, 125]. For example,expression of lignin (an important component of soil OM) [126, 127] transformation genes andgene cassettes encoded in different host genomes can synergize in combination to produce differentmonoaromatic breakdown profiles [120]. Beyond catabolic processes, microbe-host interactionsrequire signaling processes to direct biofilm formation or differentiate host tissue structures. Forexample, plants actively interact with soil microorganisms colonizing root structures and themicrobiome in turn produces signaling molecules that help shape community metabolism [128].14Given this increasing awareness that microbial communities can work as distributed systemsgiving rise to ecosystem functions and services, there is growing interest in determining the roleof interactions and information exchange in the structure and function of microbial communities[39, 129, 130]. By predicting microbial interactions it is possible to move beyond basic descriptionsof OTUs and the edaphic factors effecting community composition [39, 129, 131, 132], and constructa more complete view of the of ecological forces shaping microbial communities. Still, due to theinherent diversity and complexity of soil microbiota, accurate description and interpretation ofsoil microbial interactions is a daunting task.Using SSU rRNA genes, network models have been used to estimate and describe microbialcommunity structure and trophic relationships within microbial communities [118, 133]. Indeed,network analysis can be used to examine the correlations (edges) between individual taxa (nodes)by exploiting the mathematical, statistical and structural properties of the sampled community[39]. Over the past 5 years there has been marked increase in the use of network models toanalyze microbial communities [130, 133–135]. For example, co-occurrence networks have beenused to identify soil microbial keystone species that could serve as indicators for ecosystemfunctions [134, 136], and to determine ecological organizing principles driving spatial distributionof soil microbial populations within a community [39, 135]. However, it remains difficult tovalidate co-occurrence patterns found in these models [133]. Further, to date there has beenlimited consensus on the techniques used to build microbial networks (e.g. Spearman’s correlation[134, 137], Spearman’s correlation and Kullback-Leibler dissimilarity measure [133], Pearson’scorrelation [136, 138], and ordination based co-correspondence analysis [124]) making it difficultto accurately compare network properties and build on previous work. Given these discontinuitiesa more formal effort to develop network standards to compare data sets within and betweenenvironments, validate hypotheses, and ultimately predict and engineer system states based onbiological information transfer is needed.1.3.2 Shot-gun ‘omics’ dataWhile the ubiquity and conserved nature of SSU rRNA genes make them useful taxonomicmarkers, they provide little to no information as to the metabolic capacity of the community as15they code only for ribosomal genes. However, bench-top sequencing platforms like Illumina’sMiSeq and NextSeq are becoming standard in many research laboratories and enable explorationof potential (metagenomic) and expressed (metatranscriptomic) metabolic capacity of microbialcommunities [139–141]. Generally, environmental sequence data are quality controlled, assembled(wherein short reads are aligned and merged to create longer sequences termed contigs), openreading frames (ORFs) are predicted, and resulting ORFs are subsequently annotated using oneor more reference database (Fig. 1.4B). The data can then be used to chart taxonomic compositionas well as explore and compare metabolic function and expression on local and ecosystem scalesacross different environmental conditions or perturbation states.Several pipelines for comparative metagenomics are available including IMG/M [142], Com-munity Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA),Metagenome Rapid Annotation using Subsystem Technology (MG-RAST), and MOCAT2 [143]and MetaPathways[139–141] all of which differ slightly in their choice of annotation and pro-cessing algorithms and introduce their own idiosyncrasies in terms of output and formatting.In addition, as most of these pipelines exist as on-line services, the ever-increasing volume ofnext-generation sequence data represents an analytical bottleneck given current bandwidth andnetwork limitations [139–141]. Further, many of these pipelines are inflexible, not allowing theuser to select analytical or annotation algorithms, choose reference databases, or tune parametersto best fit specific sample data. Finally, as functional genes rarely operate in isolation but ratherwork within the larger structure of metabolic pathways (series of enzymatic interactions thatcomplete a given function or produce a given compound) [139–141], pathway-centric analysis canbe a useful tool with which to reconstruct the metabolic network of the microbial communitywithin a given sample; however, this feature is limited to a few pipelines.MetaPathways 2.5 [139–141] is a flexible and transparent modular pipeline that addressesmany of the issues mentioned above. MetaPathways takes assembled sequence data as input andincludes modules for contig quality control, open reading frame (ORF) prediction, functionaland taxonomic annotation, and read count normalization [139–141] (Fig. 1.6). MetaPathways isequipped with multiple algorithms for functional annotation and is compatible with any referencedatabase. MetaPathways also includes modules for the identification of ribosomal RNA genes16within the assembly and provides ORF-wise (each gene) taxonomic annotation based on theLowest Common Ancestor (LCA) algorithm [144]. Finally, MetaPathways is the only availablepipeline with which environmental Pathway/Genome Databases (ePGDBs) can be constructed inan automated manner. To construct ePGDBs MetaPathways integrates with PathwayTools [145], aproduction quality software that produces consistent, rule-based metabolic pathway predictionsthat include pathway descriptions, literature citations, and enzyme properties. Taken together theanalysis modules and flexibility of the MetaPathways pipeline enable large-scale comparisons ofthe metabolic potential and expression of a community, and are thus well suited for the functionalcomparisons completed within this dissertation.QC & ORFPrediction {ePGDBTranslation andannotationData tablegenerationePGDBCreation1 2 3 4Downstream analysisATGATG ATGATGATGATGKEGG COGRefSeq MetaCycLYKLYKLYKLYKFigure 1.6: The MetaPathways pipeline. The MetaPathways pipeline [139–141]. Assembled data is qualitycontrolled, open reading frames (ORFs) are predicted, translated, and annotated using one or more referencedatabases and environmental Pathway Genome Databases (ePGDBs) are created.A pathway-centric approach to modeling community metabolismWithin a single metagenomic sample >50,000 genes can be predicted thereby yielding datasetsthat are both too large (many individuals) and too complex (many unique genomes) for manycanonical analyses [146]. In order to integrate information across and among the genomes withinthe sampled community while still accurately representing biological processes, it is possible totake a pathway-centric approach [140]. Indeed, many biological processes are not completed bysingle genes but rather through a series of reactions [146]. Further, there is increasing evidencefor distributed metabolism, wherein more than one organism is responsible for completinga metabolic pathway [38, 140]. For example, to synthesize several essential amino acids, themealybug Planococcus citri, and bacterial symbionts Candidatus Moranella endobia and CandidatusTremblaya princeps rely on genes from each of the three organisms [38]. Thus, pathway levelcomparisons of microbial communities have the potential to identify biological functions that17would go undetected using a gene-centric approach.A pathway-centric approach may also increase the power of comparative multi-omics giventhe incomplete metagenomes typical in microbiome studies. Indeed, despite the veritable tsunamiof next-generation sequence data, given the complexity and diversity of soil microbiota, the datawave fails to capture the entire genomic content within soil microbial communities [1]. By usingrule-based pathway level prediction, such as PathoLogic algorithm implemented in the PathwayTools software [145], the identification of pathways for which not all genes are represented in thesequenced data is possible, allowing for a more comprehensive analysis of microbial metabolism[140]. For example, degradation pathways are predicted only if the last reaction in the pathway ispresent within the sample, while biosynthesis pathways are predicted only when the first reactionis found within the data, and energy metabolism pathways are predicted only when at least 50%of the reactions are present [140]. Pathway level comparisons of genomic and transcriptomic datahave been used to identify differences in nitrogen cycling, osmolyte biosynthesis, heavy metalsresistance and central metabolism in multiple environmental settings [140, 147, 148].1.4 Dissertation overviewThe goal of this thesis is to investigate the taxonomic structure, metabolic potential (DNA) and geneexpression (RNA) in the soil microbial communities in two forests located within British ColumbiaCanada (Fig. 1.7), with particular focus on the influence of soil organic matter removal on carboncycling. Soil microbial community structure is assessed via V6-V9 region of the small sub-unitribosomal RNA (SSU rRNA), across unmanaged and disturbed soil profiles, and differences inmetabolic potential and expression among and between soil horizons, perturbation states andseason are evaluated using environmental pathway genome databases (ePGDBs) constructed withenvironmental sequence information. Specifically this work aims to:I Chart soil microbial community composition and model microbial interactionsII Map the metabolic function, gene expression and community structure of soil microorganismsacross different soil horizons, seasons, and perturbation states18III Evaluate the impact of soil organic matter removal on the metabolic potential of microbialcommunities, with a focus on carbon cyclingIV Develop bioinformatic tools to process and analyse multi-omics data ultimately enablingdownstream analysesSBS-3OCLFigure 1.7: Soil sampling locations. A map of British Columbia, Canada wherein points denote soil samplinglocations for data within this dissertation.The specific aims for each data chapter are as follows:Chapter 2: Probing the depths of microbial community interactions in soilIn Chapter 2 co-occurrence network properties are compared and contrasted using definedenvironmental parameter data and paired environmental pathway genome databases (ePGDBs)constructed from shotgun metagenomes from multiple soil horizons in unmanaged and disturbedexperimental plots within a Long-term Soil Productivity (LTSP) site to test two hypotheses: (i) thatdifferences in soil microbial community co-occurrence patterns are due to gradients in resourceavailability with soil depth and perturbation state, and (ii) network composition and topology canelucidate ecological design principles shaping soil microbial community ecosystem functions.Chapter 3: Digging deeper into soil microbial community metabolismChapter 3 digs deeper and presents a detailed investigation of the impact of season, depth,and forest harvesting on the differential abundance and expression of metabolic pathways andcarbohydrate active enzymes to test three hypotheses: (i) changes in phenotypic expression due to19season and depth are constrained to a subset of metabolic pathways (ii) reduced genomic potentialfor biomass degradation following forest harvesting significantly alters expression of pathwaysand genes implicated in carbon and nutrient cycling within the soil microbial community, (iii)while ecozone (e.g.; soil-type, climate, and initial nutrient concentration), and perturbation state(OM removal intensity) influence the taxonomic composition of soil microbial communities, mostmetabolic pathways will be found across locations.Chapter 4: The ecologist’s guide to normalization methods in count data for the microcosmosChapter 4 outlines the suitability of canonical statistical methods for the analysis of rDNAgene and rRNA transcript surveys by conducting a side-by-side comparison of normalization tech-niques. Specifically, clustered small subunit ribosomal rRNA and rDNA profiles are normalizedusing both proportions, which suffer from overdispersion and unequal variance, and variancestabilization transformation. Analyses of microbial abundance (rDNA) and activity based onrRNA:rDNA ratios between a subsurface soil core from reference site and biofuel contaminatedsite are then completed using the two differently normalized datasets.Chapter 5: Analytical augers for mining soil sequence dataChapter 5 presents and validates a short-ORF functional annotation pipeline (SOFA) for as-sembly independent functional annotation of short-read data. The pipeline merges paired-endlibraries, predicts ORFs, and completes an additional step we term ‘deduplication’. Deduplicationprevents the double counting of ORFs predicted twice due to unmerged read pairs spanning a sin-gle gene, thereby generating accurate gene counts. Additionally, we Extend Clarke and Warwick’s[149] taxonomic distinctness index to allow the use of incomplete taxonomic annotations, therebyyielding a more accurate description of diversity within multi-omics data sets.Chapter 6: ConclusionFinally, Chapter 6 concludes with a synthesis of findings related to soil microbial communityresponse to environmental change, explores both current and looming challenges in metagenomicsand lays out future directions for research in microbial ecology.20Chapter 2Probing the Depths of MicrobialCommunity Interactions in SoilThe microbial interactions driving nutrient and energy cycles in terrestrial ecosystems are chal-lenging to describe given the complexity of soil microbiota. Co-occurrence analysis providesa statistical framework to chart this complexity. However, linking co-occurrence patterns tometabolic potential requires new inference modes incorporating genomic sequence information.This chapter presents an interpretative framework for joint analysis of co-occurrence patterns andmetabolic potential, and utilizes an extended version of Clarke and Warwick’s taxonomic distinct-ness index to infer pathway diversity in harvested and unmanaged soils. Network topology andmetabolic potential were influenced by soil depth and perturbation state. Indeed, metagenomicpathway reconstruction indicated harvested surface soil horizons exhibited reduced carbon andnitrogen cycling potential associated with heterotrophic carbon fixation and plant biomass con-version processes 13 years post harvesting. Further, taxonomic distinctness of pathways revealedthat compared to unmanaged surface horizons, more taxonomic groups have the potential toparticipate in a given pathway within harvested surface horizons. Both network and metabolicpathway-centric analyses indicate that perturbation effects are constrained to surface horizons, im-plying co-occurrence patterns can reflect changes in functional potential of soil microbiota. Takentogether, the data indicate soil disturbance (associated with tree harvesting) impacts metabolicinteractions with long-term feedback on terrestrial ecosystem services including carbon storagepotential.212.1 IntroductionInterconnected microbial communities, largely recognized on the basis of molecular sequenceinformation, mediate matter and energy conversion processes in soil ecosystems [24, 25]. However,accurate description and interpretation of these networks remains a daunting task given theinherent diversity and complexity of soil microbiota. Numerous factors including distributedmetabolism [38], symbioses [39], viral reprogramming [40], predation [41] and indigenous soilproperties i.e., edaphic factors [19, 20, 23, 44–46] contribute to microbial community structure. Forexample, changes in community structure have been attributed to gradients in pH and nutrientavailability between soil horizons [19, 20, 23, 44–46], and to agricultural [54], mining [55], andforest harvesting practices [20, 27]. While compositional changes in microbial communities havebeen shown to affect ecosystem functions such as carbon cycling [150] and litter decompositionrates [151], the extent to which changes in microbial interactions feedback on soil ecosystemfunctions and services remains to be determined.Co-occurrence analysis has previously been used to identify clusters of habitat specific mi-croorganisms [129] and reveal potential metabolic interactions and biogeographic patterns [39].For example, investigation of co-occurrence patterns in soil microbial communities spanning threecontinents indicated that abundant but site-specific taxa formed more correlations (edges) withother community members than equally abundant and co-occurring cosmopolitan taxa [39]. Arelated study found evidence for keystone connectivity that differed between sites in soil networksconstructed from tropical forest and agricultural soils [130]. While promising, these studies madeno attempt to relate network properties to metabolic potential.Here, we compare and contrast microbial community taxonomy and co-occurrence networkproperties with defined environmental parameter data and paired environmental pathway genomedatabases (ePGDBs) constructed from shotgun metagenomes from multiple soil horizons inunmanaged and disturbed experimental plots within a Long-term Soil Productivity (LTSP) siteto test two hypotheses: (i) that differences in soil microbial community co-occurrence patternsare due to gradients in resource availability, and (ii) network composition and topology canelucidate ecological design principles shaping soil microbial community ecosystem functions. In22the process we extend Clarke and Warwick’s [149] taxonomic distinctness index to allow the useof incomplete taxonomic annotations thereby yielding a more accurate description of diversitywithin co-occurrence networks and ePGDBs. Network results reveal potential changes in matterand energy conversion processes that when combined with metabolic pathway information canbe used to illuminate differences in nutrient cycling and signaling among and between soil depthsand perturbation states.2.2 Methods2.2.1 Sampling and laboratory techniquesThis study was conducted in a boreal forest at the Skulow Lake LTSP site (SBS-3 WL) (5220’N,12155’W) near Williams Lake in British Columbia, Canada [20, 27]. We compared two treatments,a natural reference plot (N) and a harvested soil plot (H) wherein all trees and soil organic matterwas removed and the site was compacted 13 years prior to sampling. Detailed description ofplot treatments and compaction can be found in [20, 27, 70]. Sample collection, processing andchemical analyses were previously described by Hartmann and colleagues (2009). Briefly, organic(LFH) and mineral horizons were distinguished on the basis of visible properties and chemistry,using the criteria established by the Canadian System of Soil Classification [16]. Approximately500 g of soil was collected from each soil horizon (LFH, Ahe, Ae, AB, and Bt) per replicate sample.Three replicates were collected at each plot at randomly selected locations. Chemical propertieswere measured at the Ministry of Forests and Range soil laboratory (Victoria, British Columbia).Water content was measured by drying soils at 105C. Soil allocated for DNA extraction waspassed through a 2 mm sieve to homogenize the sample and remove larger geological materials,plants and roots. Samples were stored at -80C prior to DNA extraction. DNA was isolated andsequenced from 26 individual soil samples. Detailed description of the isolation of pyrotag (smallsubunit ribosomal RNA or SSU rRNA) and shotgun metagenomic sequences can been found inA. Processing of pyrotag sequencesThe Quantitative Insights Into Microbial Ecology package (QIIME) [116] was used to removesequences with ambiguous bases, homopolymer runs, and length less than 200 bp from a total of299,164 V6 pyrotag sequences recovered from the 26 soil samples. Chimeras were detected andremoved from further data processing using the chimera slayer provided in the QIIME softwarepackage. The remaining 224,498 high quality sequences (average length = 430 bp) were clusteredat the 97% identity threshold with a maximum BLAST e-value cut-off of 1e-10 using UCLUSTimplemented in QIIME [116]. Singletons were removed in order to decrease the likelihood ofundetected chimeric sequences and sequencing errors being included in downstream analyses[152]. The abundance of each OTU was normalized to the total number of reads recovered persample, and expressed on a percentage basis due to the unequal numbers of sequences recoveredfrom each sample (Table A.3) [39]. Taxonomic assignment for each OTU cluster was performedusing the Basic Local Alignment Tool (BLAST) and the SILVA database (; [153])with an e-value cut-off of 1e- Processing of metagenomic sequencesMetagenomes were vector and quality trimmed (Q20, quality offset 33) before assembly withABySS [154] using 18 kmer lengths ranging from 28-96 bps. Final assemblies were chosen based onlargest N50 and total base pairs assembled. See Table A.2 for a breakdown of resulting assemblyinformation for each sample interval. Environmental pathway genome databases (ePGDBs) for theassembled metagenomic datasets were generated using MetaPathways 2.5 [139, 155], a modularpipeline for open reading frame (ORF) prediction, functional and taxonomic annotation, ORFcount normalization (for both sequencing depth and ORF length), and the creation of ePGDBsbased on a well-curated database of metabolic pathways and components representing all domainsof life (MetaCyc-v4-11-07-03) [156], Kyoto Encyclopedia of Genes and Genomes (KEGG-11-06-18)[157], SEED-14-01-30 (, Clusters of Orthologous Groups (COG-13-12-27) [158], Carbohydrate-Active enZYmes (CAZY-14-09-04) [159], and RefSeq-nr-14-01-18 [160]databases.242.2.4 Statistical analyses and data visualizationStatistical analyses were performed in MATLAB ( the Fathom toolbox ( and R(version 3.0.2 ( using multiple packages (A.1). The co-occurrencenetwork was created using CoNet (Beta version 3.2) [41] implemented in Cytoscape (version 3.1.0)[161] with the ‘ensemble’ method [41]. Detailed description of network construction and analysesare described in A.1. The network was visualized using hive plots[162], which were built usingthe tool Hive Panel Explorer [163].Clarke and Warwick’s taxonomic distinctness index [149, 164] was extended to accept thepartial taxonomies common in uncultivated microbial annotation using the previously publishedalgorithm ‘weighted taxonomic distance’ (WTD) to calculate the taxonomic distance between op-erational taxonomic units (OTUs) [140]. Taxonomic distinctness analysis produces three statistics,D, which describes the average taxonomic distance between two randomly selected OTUs (taxa)and considers both taxonomic relatedness and evenness, D⇤ which describes the average pathlength between two randomly selected OTUs (taxa) and considers only taxonomic relatedness,and D+ which, given presence/absence data is equal to both D and D⇤ and describes the averagepath length between two randomly selected OTUs (taxa) and can be used to detect whether thereis a significant difference between the D+ for a given sample and the expected D+ calculatedfrom a master list of all OTUs (taxa) in all samples [149]. D+ was used to calculate taxonomicdistinctness of network modules given network OTUs are present/absent and thus D+ is equalto both D and D⇤. D⇤ was used calculate taxonomic distinctness of pathways as it focused onrelatedness of taxa within the sample.2.3 Results2.3.1 Soil characteristicsTo determine whether the chemical characteristics of soil horizons were significantly different,a non-parametric permutation based multivariate analysis of variance (NPMANOVA) was con-25ducted because of the non-normal distribution of the data based on Shapiro-Wilk tests of variables(p< 0.05) [165]. Samples differed significantly with respect to depth, pH, total carbon and nitrogen(%), nitrate and nitrite (ppm), due to both soil horizon and the interaction between horizon andtreatment (H versus N) (Table 2.2). Using posterior pairwise tests, surface horizons were foundto differ significantly between soil treatments, indicating long-term harvesting effects (Table 2.2).Detailed differences between horizons and treatments can be found in A. Microbial community structureTo determine microbial community structure within H and N soil samples, pyrotags targetingthe V6 region of the SSU rRNA gene were generated with three-domain resolution (Table A.3).A total of 224,498 high quality sequences were clustered at the 97% identity threshold with amaximum e-value cut-off of 1e-10. After singleton removal 14,059 operational taxonomic units(OTUs) encompassing 206,739 V6 pyrotag sequences remained. The abundance of each OTU wasnormalized to the total number of reads recovered per sample, and expressed on a percentagebasis due to the unequal numbers of sequences recovered from each sample (Table A.3) [39].NPMANOVA was subsequently used to identify significant differences in microbial communitystructure among horizons within and between soil depth profiles (Table 2.2). Soil microbialcommunity composition differed significantly with horizon, treatment and the interaction betweenhorizon and treatment. Consistent with soil chemical data, only the LFH horizon (surface horizoncontaining organic matter) differed significantly between H and N soil profiles (Table 2.2).2.3.3 Hierarchical cluster and indicator species analysisWe next used hierarchical cluster analysis (HCA) to explore relationships between horizon,treatment and microbial community structure (Fig. 2.1). Five clusters were identified, includingharvested surface horizons (H-LFH 1, H-LFH 2, H-LFH 3 and H-Ae 1) containing 4,538 OTUs,unmanaged surface horizons (N-LFH 1, N-LFH 2, N-LFH 3 and N-Ahe 2) containing 5,909 OTUs,mixed horizons from both H and N (H-Ae 2, N-Ahe 1, H-Bt 3, H-AB 3 and N-Ae 1) containing7,474 OTUs, mid subsurface horizons from both H and N (N-Ae 2, N-AB 1, N-AB 2, N-AB3, H-Ae 3, H-AB 1, and H-AB 2) containing 7,554 OTUs, and lower subsurface horizons from26both H and N (N-Bt 1, N-Bt 2, N-Bt 3, H-Bt 1 and H-Bt 2) containing 4,564 OTUs (Fig. 2.1).These results indicate that both forest harvesting and sampling depth contribute to differences inmicrobial community structure. Differences were observed between surface horizons (LFH) H andN (clusters 1 and 2) while the remaining samples were generally sorted by depth, consistent withNPMANOVA results. Next, to identify OTUs driving differences among the five clusters identifiedin HCA, indicator species analysis (ISA) (explained in detail in A.2) was conducted resulting inthe identification of 1,860 cluster-specific indicator OTUs (indicator value > 0.7 and p-value< 0.05)and 939 multi-cluster indicator OTUs (Table A.5). Compositional differences between clusters arediscussed in detail in A.2.2728Relative Proportion of Pyrotags (%)0 10 20 30 40 0 10 20 30 40 40 0 10 20 30 40 0 10 20 30 40Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5***** ** ** ** *ProteobacteriaAcidobacteriaActinobacteriaBacteroidetesPlanctomycetesFungiGemmatimonadetesVerrucomicrobiaArchaeplastidaNitrospiraeArmatimonadetesCyanobacteriaCandidate division OP11SAR (Stramenopiles, Alveolates, and Rhizara) supergroupOther* ** *Natural Unmanaged (N)Harvested (H)LFHAhe0 10 20 30H-LFH 3H-Ae 1H-LFH 1H-LFH 2H-Ae 2N-Ahe 1H-Bt 3H-AB 3N-Ae 1N-Bt 3N-Bt 1N-Bt 2H-Bt 1H-Bt 2N-AB 3H-Ae 3H-AB 1H-AB 2N-Ae 2N-AB 1N-AB 2N-LFH 1N-LFH 2N-Ahe 2N-LFH 3Bray-Curtis Dissimilarity10010010099 9710092 98 6810080 926772966768 651005160930. 3378C3C240342663 1875C1 C232023122 4352C4C34294 13043260C4 C5AeABBtLFHAeABBtFigure 2.1: Hierarchical cluster analysis describing the influence of horizon and treatment on microbial community assemblages. Microbial communityassemblages (V6-V8 SSU rRNA) were analysed 25 samples from 5 soil horizons in two soil profiles (unmanaged (N) and harvested (H)) at theLTSP ecozone in William’s Lake, B.C.; Phyla representing > 0.5% on average across all samples are shown; all other taxa are binned into ahigher taxonomic group or ‘other’ categories. 100% of the total microbial pyrotags clustered at 97% in OTUs are represented in this plot. Venndiagrams depict the number of OTUs shared between clusters. Box and Venn colours represent soil depth.282.3.4 Network descriptionTo determine potential interactions between OTUs throughout the soil profile, a co-occurrencenetwork was constructed using both Spearman correlation measure and Kullback-Leibler dissimi-larity measure (robust to compositionality) using cut-offs of | 0.8 |, and > 18 and < 1 respectively.Each node represents an OTU and each edge a statistically significant positive or negative correla-tion. All statistically significant co-occurrences, calculated by computing 1,000 edge and measurespecific permutations and bootstrap score distributions, were included in the network (Barberan etal, 2014, Faust et al, 2012). The resulting network contains 1,880 nodes, connected by 13,605 edges.6,967 of the edges correspond to positive correlations (co-presence) while 6,638 edges correspondto negative correlations (Fig. 2.2). Taxonomic distinctness (D+) for the whole network was 1.71.Additional network properties and a description of how to interpret hive plots can be found inA.2.Table 2.1: Soil microbial co-occurrence network and module properties near William’s Lake, B.C.Treatments Network Module A Module B Module CNumber of nodes 1478 99 153 381Number of edges 6967 502 937 3803Number of components 113 1 1 1Number of nodes of largest connected component 1206 99 153 381Number of edges of largest connected component 6801 502 937 3803Diameter of largest connected component 19 6 6 7Average degree 9.43 10.14 12.25 19.96Connectance 0.01 0.10 0.08 0.05Global clustering coefficient 0.28 0.57 0.58 0.50Fraction of possible triangles 0.42 0.43 0.42 0.46Size of largest clique 20 9 11 20Average path on largest connected component 6.18 2.70 2.67 3.07Taxonomic distinctness (D+) 1.71 2.01 1.36 1.14To explore how depth and treatment influence network topology we used the FAG-ECalgorithm [166] to identify modules (sub-graphs of highly clustered microbes within the network)in the positive edges. Negative edges were excluded from calculations of modularity to avoidgrouping mutually exclusive OTUs together. Three large (> 99 nodes) modules with stretchedexponential distributions and 11 smaller components (3-61 nodes) were identified. Total nodes and29edges, weighted mean depth, average degree, average path length, diameter, clustering co-efficient,connectance and D+ were calculated for each large component (Table 2.1). The weighted meandepths were then compared to the average depths of the HCA clusters, and indicator OTUsfor clusters, or groups of clusters, were highlighted within the network in an effort to guideinterpretation.Module A, with the shallowest weighted mean depth (7 cm), contained 99 OTUs, 55 of whichwere indicators for cluster 1 (H surface horizons, LFH samples and a single Ae sample) and 15of which were indicators for LFH horizons specifically (Fig. A.4). Module B, with a weightedmean depth of 13 cm, contained 153 OTUs, 25 of which were indicators for cluster 2 (N surfacehorizons, LFH samples and a single Ahe sample) and 1 of which was an indicator for LFHhorizons specifically (Fig. A.4). Together, the indicator OTUs and weighted mean depths suggestcomponents A and B are representative of the H and N surface horizons respectively. ModuleC, with a weighted mean depth of 36 cm had 381 nodes, 240 of which were indicator OTUs formineral horizon, which when combined with the weighted mean depth, suggests module C isrepresentative of combined mineral horizons (N + H) (Fig. A.4). Network topology was found tobe similar among all three modules (Table 2.1). While, D+ for all modules was not lower thanexpected given the number of OTUs within the network (Table 2.1), D+ was highest in module A(H LFH horizons) (2.01) followed by module B (N LFH horizons) (1.36) and module C (mineralhorizons N + H) (1.14) ((Table 2.1). Compared to module C (mineral horizons N + H) modulesA and B (N LFH horizons and H LFH horizons respectively) had a slightly higher clustering co-efficient but lower connectance and average degree (Table 2.1). Taxonomic composition of networkmodules were comparable to that found in the clusters identified using HCA and described indetailed in A.2 (Fig. 2.1 and Fig. A.4).3031Module COtherBIndicator OTU for HarvestedIndicator OTU for Mineral ClustersIndicator OTU for Organic ClustersIndicator OTU for NaturalAHarvested Natural Unmanaged X     ZYYXZDegree12 -15> 15Soil Depth50cm5cmAxis99 nodes, 502 edges 153 nodes, 937 edges 381nodes, 3803 edgesModule BModule ACo-occurrenceMutal ExclusionRPKMFrequency0 1 2 3 4 5 602040608010 20 30 40RPKMFrequency0 1 2 3 4 5 60123456RPKMFrequency0 1 2 3 4 50246810RPKMFrequency0 10 20 30 40010203040506070Figure 2.2: Co-occurrence networks describing the influence of horizon and treatment on microbial community. (A.) Hive plots of co-occurrence networkcalculated using SSU rRNA gene pyrotag operational taxonomic units (OTUs) (clustered at 97% sequence similarity) from 5 horizons inunmanaged (N) and harvested (H) soil profiles. Co-occurrence represents positive correlations while ‘mutual exclusion’ represent negativecorrelations and the edges are colored accordingly. Nodes are colored by module membership. (B.) Hive plots of modules present in theoriginal co-occurrence network. Hive plots: Axes represent node degrees class (1, 2-15 and >15 respectively). Axes have been duplicatedto show links between nodes of the same degree class. Node placement along axes is by mean weighted depth. In A, nodes are colored bymodule membership. Nodes are colored by indicator OTU then by module membership. Histograms depict metagenomic reads recruited to therepresentative sequence of OTUs within the network. RPKM is reads per kilobase per million mapped.312.3.5 Microbial community functionEnvironmental pathway genome databases were generated from the assembled shotgun metagenomesto compare taxonomic diversity and metabolic potential across different soil horizons and treat-ments. Details on the method, number of pathways, and reactions predicted can be found in A.2.Relationships between horizon, treatment and microbial community metabolic potential wereinitially determined using multivariate hierarchical cluster analysis based on normalized ORFcounts for each predicted pathway. At this level of comparison, all samples appeared similarto one another given the low Bray-Curtis dissimilarity distance (Fig. 2.3, Fig. A.5). Becausehierarchical clustering disregards potentially important negative associations between variablesand has the potential to obscure abundance differences between individual variables [167], wecompared differences in the relative abundance of each pathway between horizons and treatmentsin a univariate manner. This approach resulted in the identification of 109 pathways with sig-nificant differential abundance between N and H LFH horizons (Fig. 2.1, Fig. A.6, and Fig. A.4)and 40 pathways with significant differential abundance between combined LFH (unmanagedand harvested horizons) and mineral horizons (unmanaged and harvested Ahe, Ae, AB and Bt)(Fig. 2.1, Fig. A.8, and Fig. A.9).In the H LFH horizons we identified 12 pathways for carbohydrate (e.g.; dTDP-L-mycaroseand dTDP-L-olivose) and secondary metabolite (e.g.; myo-inositol) biosynthesis that were moreprevalent than in the N LFH horizons (Fig. 2.1 and Fig. A.4). Seven of the carbohydrate biosynthesispathways synthesize lipopolysaccharides found in glycoproteins of cell membranes and manyantibiotics [168] while the remaining 5 synthesize dTDP sugars known to be building blocks formacrolide antibiotics [169].In the N LFH horizons 55 degradation pathways including 16 pathways for the degradationof aromatic compounds such as protocatechuate intermediate metabolite in the degradation oflignin, an important component of soil organic matter, and benzene, contained in many higherplants and thus a naturally occurring component of soil organic matter, were more prevalentcompared to the H LFH horizons [126]. In addition, 39 degradation pathways including adenosine,guanosine, pyrimidine, thiocyanate, methanol, and taurine degradation, all compounds naturally32found in decaying plant tissues, and anaplerotic CO2 fixation into oxaloacetate, were also moreprevalent compared to the H LFH horizons (Fig. A.6)). Cumulatively, these results suggest areduced potential to cycle plant derived organic matter in the harvested LFH horizon.Differences between the LFH and mineral horizons were most pronounced in the higherrelative abundance of 8 degradation pathways and 13 biosynthesis pathways in the LFH horizon,and a higher relative of abundance of 11 degradation and 2 fermentation pathways in the mineralhorizons (Fig. 2.1, Fig. A.8 and Fig. A.9). Specifically, LFH horizons had a higher relativeabundance of carbohydrate degradation pathways that target substrates commonly found insoil organic matter while the mineral horizons had a higher relative abundance of pyruvatefermentation, dissimilatory nitrate reduction and oxalate degradation pathways. In addition,mineral horizons had a higher relative abundance of pathways related to assimilatory carbonfixation. More details on these pathways can be found in A.2.33340 20000 40000 60000 80000Activation/InterconversionAmino AcidAromatic CompoundCarbohydratesCell-StructureCofactorHormoneLipidNucleotidePolyamineSecondary MetaboliteSiderophoresOtherAlcoholAmineAmino AcidAromatic CompoundsC1 CompoundsCarbohydratesCarboxalyteChlorinated CompoundFatty Acid and LipidNon-Carbon NutrientsNucleotideSecondary MetaboliteOtherDetoxificationElectronTransferFermentationOtherMetabolic-ClustersN H Organic MineralNumber of PathwaysSig. More AbundantNumber of PathwaysPresentTotal Pathways2 4 8 16Activation/Inactivation/InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-ClustersAll SamplesRPKM32 64 130 260Organic N+HSoil DepthA BFigure 2.3: environmental Pathway Genome Databases (ePGDB) from 26 soil metagenomes. (A.) Summary of 26 environmental Pathway GenomeDatabases (ePGDB) from 5 horizons in unmanaged (N) and harvested (H) soil profiles. RPKM is reads per kilobase per million mapped. Themedian RPKM of pathways within a given category is depicpcted as the line through the box. First (Q1) and third (Q3 quartiles make up the boxand whicskers represent observations with the exception of outliers (> 1.5*(Q3Q1)) which are shown as single points. Total pathways denote thetotal number of unique pathways in a given category across all 26 ePGDBs. (B.) Number of pathways significantly more abundant between Nand H LFH (organic) horizons and between LFH and mineral (N + H) horizons.342.3.6 Taxonomic relationships between co-occurrence and metabolic networksTo confirm network taxa were represented within the metagenomes used for metabolic reconstruc-tion we recruited SSU rRNA gene fragments from shotgun datasets to representative sequencesfor all OTUs using BWA, and normalized counts using reads per million kilobases mapped(RPKM). Consistent with expectations [170], we observed that 0.011±003% of metagenomic readswere recruited to OTUs (Table A.2). In total, metagenomic SSU rRNA reads recruited to 96%of network OTUs, indicating high similarity in the taxonomic composition of the SSU rRNAand shotgun metagenomic datasets (Fig. 2.2). Given this consistency between datasets we nextdetermined the phylogenetic relationship between network modules and ePGDBs using the lowestcommon ancestor (LCA) (Huson et al. 2007) annotations provided by MetaPathways 2.5 for eachORF to calculate pathway level taxonomic distinctness (D⇤), which describes the average pathlength between two randomly selected taxa and considers only taxonomic relatedness (Fig. 2.4).We observed that differences in the taxonomic annotation of the ORFs within pathways weremore prevalent in N LFH horizons (Fig. A.11) consistent with network module composition. Forexample, Proteobacteria were significantly more abundant within the N LFH microbial communitycompared to the H LFH microbial community (Fig. 2.1). We then compared the taxonomy of ORFswithin the pathways differentially abundant between treatments with the taxonomy of networkOTUs.Within the 85 pathways more prevalent in the N LFH horizons, 62.4% of ORFs were attributedto Proteobacteria (via LCA annotation) in the N horizons compared to 34.7% within the H LFHhorizons (Fig. A.11). Relatedly, within the network modules, 39.2% of OTUs within module B(N LFH horizons) were annotated as Proteobacteria, compared to 20.2% in module A (H LFHhorizons) (Fig. A.4). Furthermore, within these same pathways, only 0.2% of the ORFs wereattributed to Bacteriodetes respectively within the N horizons, compared to 11.0% within the HLFH horizons (Fig. A.11). This result is consistent with the network analysis wherein Bacteriodetesmade-up 26.3% of module A (H LFH horizons) compared to 7.8% of module B (N LFH horizons)(Fig. A.4). Finally, within the pathways more prevalent in the H LFH horizons we observed a28.3% of ORFs were attributed to Actinobacteria in H LFH horizons compared to 18.8% in the N35LFH horizons the N LFH horizons (Fig. A.11). Again within the network, 20.2% of OTUs wereattributed to Actinobacteria of module A (H LFH horizons) compared to 9.8% of module B (NLFH horizons) suggesting that these samples have an increased potential for macrolide antibioticsynthesis (Fig. A.4).Soil DepthNatural Unmanaged (N)Harvested (H)0 10 20 30 40 502. (cm)Delta*Figure 2.4: Taxonomic distinctness of metabolic pathways within 26 environmental Pathway Genome Databases(ePGDB). Taxonomic distinctness D⇤ (which describes the average path length between two randomlyselected OTUs (taxa)) of 26 environmental Pathway Genome Databases (ePGDB) from 5 horizons inunmanaged (N) and harvested (H) soil profiles using the lowest common ancestor (LCA) annotation fromeach open reading frame (ORF) in each pathway.362.4 DiscussionIn the present study we used OTU count data in combination with shotgun metagenomes toidentify microbial interactions and metabolic potential across multiple soil horizons in unmanagedand disturbed experimental plots within the Williams Lake LTSP site. Here, we move beyonda standard survey of the count data using canonical metrics in community ecology and applya systematic workflow to construct a co-occurrence network to estimate microbial interactions.Moreover we evaluated the impact of harvesting and edaphic factors on network compositionand topology in relation to microbial community metabolic potential using ePGDBs, and resolvedspecific pathways mediating nutrient cycling and signaling, manifesting significant differencesbetween LFH horizons of unmanaged and disturbed experimental plots. Several of these pathwayscould be taxonomically linked to network components associated with cognate soil horizonsreinforcing the power of combining network and metabolic pathway perspectives to understandhow perturbation impacts ecosystem functions and services within the soil milieu.2.4.1 Perturbation effects on soil microbial community structureThirteen years post-harvest, forest floor removal, and severe compaction, the LFH horizon differedsignificantly in both edaphic factors and microbial community composition between soil treatmentsindicating long-term effects of forest disturbance (Table 2.2). These observations are consistentwith previous soil perturbation studies in disparate ecozones [171–174]. For example, previouswork found the forest floor organic layer (LFH) had a distinct and more variable microbialcommunity than that of the mineral horizon found directly below, and differences betweenmicrobial communities in both soil horizons increased with geographic distance among samples[27]. Within the mineral horizons, differences in soil chemistry and microbial community structurewere associated with sampling depth rather than harvesting treatment. The long-term differencesbetween the H and N LFH horizons may be due to initial disturbance or consequential changein surface vegetation, both of which have been previously observed to influence soil microbialdiversity and nutrient availability [20, 27, 175, 176]. While, altered microbial community structurehas been linked to changes in distributed nitrogen cycling processes [150, 177] such responses are37more challenging to identify from a global metabolic perspective due to functional redundancy,resource fluctuations, and the effect of disturbance on interacting community members [25].Table 2.2: Influences of forest harvesting and depth on soil chemistry and microbial communitiesComparion Soil Chemistry PyrotagsOverall effects F-value(NPMANOVA) p-valueF-value(NPMANOVA) p-valueHarvesting 0.58 0.576 1.29* 0.007Soil horizon 25.07* <0.001 1.14* 0.01Interaction 25.07* <0.001 1.13* 0.009Pairwise comparison(N vs H) t-statistic p-value t-statistic p-valueLFH 3.62* 0.048 1.24* 0.046Ae 1.84 0.071 1 0.584AB 1.62 0.212 1.03 0.35Bt 1 0.43 1.11 0.064Soil horizons (N + H)LFH vs Ahe 1.69 0.127 1 0.545LFH vs Ae 4.61* 0.001 1.04 0.284LFH vs AB - - 1.18* 0.003Ae vs AB 3.86* 0.002 0.89 0.98AB vs Bt 4.63* 0.002 1.16* 0.001* significant at a false discovery rate of Harvesting and depth influence soil microbial co-occurrence network topologyTo identify differences in metabolic interactions with harvesting treatment and soil horizon wetook a systematic approach to co-occurrence network construction. Co-occurrence networksgenerated from clustered small subunit ribosomal RNA (SSU rRNA) genes (OTUs) are known tosuffer from compositional data (relative proportions) effects that result in false correlations and/or convert legitimate positive correlations to negative ones [178]. We accounted for compositionaldata effects by using both the Spearman correlation measure and Kullback-Leibler dissimilaritymeasure (robust to compositional effects wherein the increase in the relative abundance of oneOTU necessitates the decrease in relative abundance of one or more other OTUs [179]. ). We alsoemployed a permutation strategy where the calculation of correlations was repeated after theremoval of a single OTU and data renormalization 1,000 times. Only correlations impervious tochanges in data composition due to permutations were considered as edge candidates. Next, wecalculated 1,000 bootstrap score distributions to determine statistical significance of correlations38and corrected for multiple tests using the Benjamini-Hochberg-Yekutieli method. While networkproperties such as characteristic path length (3.01) and clustering coefficient (0.32) were wellwithin the ranges reported in previous investigations of soil and marine microbial co-occurrencenetworks (1.06-6.3, 0.25-0.33 respectively) [39, 132, 180, 181] other properties, such as the diameterof the largest connected component (12) were lower (18-292) [39, 132, 180, 181]. This result is likelyrelated to the modular network structure and the stratified nature of resources and soil microbialcommunities with depth (Table 2.2 and Fig. 2.1). Indeed, network modules partitioned withharvesting treatment and soil depth, consistent with NPMANOVA (Table 2.2) and HCA (Fig. 2.1)results, reinforcing the idea that co-occurrence patterns are influenced by both perturbation andedaphic factors (Fig. 2.2).2.4.3 Impaired organic matter degradation and nutrient cycling potential inharvested LFH horizonsGiven minimal shotgun metagenome assemblies from our William’s Lake samples we used apathway-centric approach to first identify potential metabolic differences between soil horizonsand treatments followed by transitive mapping of indicator pathways onto network modulesbased on the NCBI taxonomic hierarchy. Overall, metabolic potential varied with soil depth andperturbation state, similar to network results described above. Forest harvesting reduced therelative abundance of 85 pathways in the H LFH horizons and increased the relative abundanceof 24 pathways. Compared to the N LFH horizons, the H LFH horizons had markedly lowrelative abundances of degradation pathways, particularly aromatic carbon degradation pathwaysand nucleotides degradation, thereby suggesting altered nutrient cycling 13 years post-harvest(Fig. 2.1). We observed that a total of sixteen aromatic compound degradation pathways hadsignificantly lower relative abundance in the H LFH horizons relative to the N LFH horizons(Fig. A.6). Aromatic compounds constitute a large component of soil humus [182], a complexmixture of organic compounds that provide nutrients for soil microbial communities [183]. Forexample, the H LFH horizons had a significantly lower relative abundance of two protocatechuatedegradation pathway variants, both implicated in the degradation of many lignin (a structuralcomponent of plant cells) derived aromatic compounds [127]. Correspondingly, the H LFH39horizons also had a lower relative abundance of the superoxide radicals degradation pathway, alllikely related to impaired lignin degradation as this processes releases free organic radicals [184].These results support and expand on a recent study from a nearby LTSP site suggesting a declinein biomass conversion potential based on reduced abundance of 41 families of carbohydrate-activeenzymes involved in lignin, cellulose and hemicellulose transformation in harvested forest soils[68].The H LFH horizons also had a lower relative abundance of several energy-metabolismpathways including ammonia oxidation I (aerobic) to nitrate and CO2 fixation into oxaloacetate(anapleurotic) compare dto the N LFH horizons (Fig. 2.4), perhaps both related to significantdifferences in total carbon, total nitrogen and organic nitrogen between the N (16.86±31%,0.42±10%, and 69.26±0.55 ppm respectively) and H (4.36±85%, 0.16±06%, and 33.87±0.38 ppmrespectively) LFH horizons (Table 2.2). Anapleurotic CO2 fixation contributes 10% of cell carbonin heterotrophic bacteria (Perez and Matin, 1982, Sonntag et al, 1995). Given that 289 Pg of carbonare stored in the forested portions of the boreal forest [185] and 1.33-3.44% of total organic C canbe attributed to forest soil microbial biomass [186], we estimate that anapleurotic CO2 fixationis responsible for between 0.40-1.00 Pg of carbon storage in the boreal forest. Soil microbialgrowth is often carbon limited [187]; as such, the reduced potential for anapleurotic CO2 fixationand significantly lower carbon resources in the H LFH horizons suggests potentially impededmicrobial growth and compromised coupled carbon and nitrogen cycling processes. Indeed,the H LFH horizons also had a lower relative abundance of aromatic compound, carbohydrateand nucleotide degradation pathways, again suggesting an impaired ability of the microbialcommunity to degrade plant-derived biomass and cycle nutrients essential for long-term soilproductivity 13 years post-harvest.Despite a lower relative abundance of degradation pathways, the H LFH horizons had a higherrelative abundance of 3 myo-inositol pathway variants and 12 carbohydrate synthesis pathways(Fig. 2.1, Fig. A.7). Myo-inositol is produced by bacteria, fungi, algae, plants, and animals and actsas important signaling molecule in the regulation of mRNA export and other cellular functions[188]. Of the 12 carbohydrate synthesis pathways, five synthesize building blocks for macrolideantibiotics, which function not only as antibiotics but also as signaling molecules that in low40concentrations influence transcription of several cellular functions [189]. Increased productionof antibiotics and signaling molecules has been shown to mediate competition [190] and maytherefore be a response to significant reductions in carbon and nitrogen within the H LFH horizons(Table 2.2). Furthermore, recent evidence suggests bacterial production of signaling moleculesplays an important role in the establishment of rhizosphere microbiomes [128].Differences in pathway abundances between the LFH and mineral horizons are likely relatedto differences in available resources and soil function with depth (Table 2.2). More specifically,while LFH horizons cycle nutrients and convert labile substrates into compounds used by plants,mineral horizons serve as massive carbon reservoirs [183]. Indeed, here, pathways more prevalentin the LFH horizons were related to nutrient cycling, while pathways involved in carbon fixationand degradation of recalcitrant compounds were more abundant in the mineral horizons. Forexample, anaerobic processes such as, pyruvate fermentation pathway variants, and dissimilatorynitrate reduction, all likely to related to increasing anaerobic niche space with soil depth [26], alsohad a higher relative abundance mineral horizons. For more examples see A. Consistent taxonomic patterns in network modules and metabolic pathwaysTo relate network observations with microbial community metabolism, we compared the taxo-nomic diversity of network modules and metabolic pathways. While canonical diversity metricssuch as Choa1 [191], Shannon’s [192] , and Simpson’s [193] consider richness and evenness, theyfail to consider the phylogenetic relationship between taxa. Thus, communities with n taxafrom a single phylum could be found to be equally diverse as a community with n membersfrom n phyla. Taxonomic distinctness (D, D⇤, and D+), a diversity measure first proposed inmacroecology in 1995, is designed to consider the relatedness of taxa in a given sample in ad-dition to richness and evenness [149, 164]. However, in its original implementation, taxonomicdistinctness calculations required complete phylogenetic annotation (Domain to species) of ev-ery taxon. We extended the taxonomic distinctness algorithm to accept the partial taxonomiescommon in uncultivated microbial annotation to calculate the phylogenic distance betweentaxa, and determine network, module, and pathway diversity (available for download and use: in network module taxonomic distinctness with disturbance and depth suggestco-occurrence patterns may reflect changes in community metabolism. Indeed, pathway levelD⇤, calculated using the abundance of each ORF in each pathway, also varied with soil depthand perturbation state. Specifically, D⇤ of pathways within the H LFH horizons was significantlyhigher than that of the N LFH horizons (Fig. 2.4) signifying that, on average, more taxonomicgroups have the potential to participate in a given pathway within the H LFH horizons. Thisresult is consistent with co-occurrence patterns wherein module A (H LFH horizons) has thehighest D+ (Table 2.1). While it has long been recognized that perturbation can affect microbialcommunity structure [194], higher taxonomic distinctness within metabolic pathways within theH LFH horizons and network module A (H LFH horizons) suggests forest harvesting increasesfunctional redundancy and may reflect a resilience force that buffers the community againstlarge-scale process perturbations.Together with higher taxonomic distinctness in the H LFH horizon network module andePGDBs, the observations in this dataset are consistent with models and experimental evidenceindicating competition within microbial communities increases antibiotic production [195], andincreased microbial community diversity [194]. However, conclusive links between networktaxa and metabolic potential were not possible given the data, as this would require largeassembled contigs containing both functional genes and small subunit ribosomal RNA genesequences (to which the representative sequences of OTUs could be mapped), or alternatively,more comprehensive references database allowing for more precise LCA annotations of functionalgenes (beyond the phylum level) and thus more precise comparisons of network OTUs andmetabolic potential.2.5 ConclusionsHere, we establish that soil microbial community co-occurrence patterns change with harvestingtreatment and OM removal and soil horizon using a principled approach to network construc-tion and interpretation. Both network and metabolic pathway-centric analyses indicate thatperturbation effects are constrained to surface LFH horizons, implying co-occurrence patterns42can reflect changes in functional potential of soil microbiota 13 years post harvesting. Together,the data suggest that within LFH horizons, forest harvesting alters the dominant architectureof connectivity that knits together the soil microbial community, resulting in a new adaptivelandscape in which interactions are free to form beyond the constraints of previously establishedinteractions, ultimately influencing nutrient and energy cycling. Given the data, we posit thatperturbation events can result in the reshuffling of trophic relationships and information exchange(e.g.; signaling molecules, metabolites, and horizontal gene transfer). These new interactions mayor may not be equivalent to the previous state, with long-term implications for soil productivityand forest renewal. The challenge in resolving this uncertainty lies in mapping specific metabolicfunctions onto the network background, essentially defining the metabolic potential of individualnodes in relation to one another. This work represents an important step in understanding theimpacts of disturbance on microbial community structure and function in soil. From a modelingperspective, knowing how microbial community members are connected may be essential todefining adaptation and response patterns relevant to forest harvesting practice, long-term soilproductivity and carbon storage potential within terrestrial ecosystems.43Chapter 3Digging Deeper into Soil MicrobialCommunity MetabolismSoil microorganisms power carbon and nutrient transformations in terrestrial ecosystems. How-ever, owing to the complexity of the soil environment and the prevalence of uncultivated organisms,description and analysis of soil microbial metabolism remains challenging. This chapter digsdeeper into microbial metabolism using paired metagenomic and metatranscriptomic data tocompare and contrast the metabolic potential and expression of soil microbial communities, acrossthree seasons and three sampling depths in unmanaged and disturbed experimental plots withinthe O‘Connor Lake Long Term Soil Sustainability (LTSP) site. Using both environmental Path-way Genome Databases (ePGDBs) and Carbohydrate Active enZYmes (CAZymes) gene surveys,changes in metabolic potential across time, space and in response to environmental disturbancewere found to be decoupled from changes in metabolic expression across these same variables. In-deed, while metabolic potential was strongly influenced by winter conditions, metabolic pathwayand CAZyme expression were most strongly influenced by soil depth and harvesting treatmentrespectively. Next, in an effort to determine whether forest harvesting had similar effects onsoil microbial community metabolism across ecozones, the metabolic potential of soil microbialcommunities at O‘Connor Lake and Williams Lake (an LTSP site described in detail in Chapter 2of this dissertation) were compared. We found that despite differences in the taxonomy of the soilmicrobial communities at the two ecozones, metabolic potential was alike, as only approximately3.6% of pathways identified at Williams Lake were not present at O‘Connor Lake. This suggeststhat metabolic capacity is similar between forest ecosystems, despite geographic distance anddifferences in Biogeoclimatic Ecosystem Classification (BEC), harvesting intensity and compaction.44Finally, we examined the expression of CAZymes from 4 isolate genomes known to be abundantat the site as, due to both their abundance and degradation potential, these isolates present anopportunity to determine how season, depth, and forest harvesting influence CAZyme abundanceand expression within a specific taxonomic group. We found that regardless of phylogeneticrelatedness CAZyme expression differed between isolates suggesting the impact of spatiotemporalvariation and harvesting differs amongst related taxa and indicating that taxonomic identity canbe decoupled from metabolic function. Given the data, we posit that redundant metabolic capacityboth among phyla and within closely related species, combined with large and fine scale variationin metabolic potential ensures environmental change has disparate effects across communitymembers thereby tempering the effects of localized extinctions or niche space reduction, andguarding against the loss of metabolic functions within the soil ecosystem.3.1 IntroductionEssential to biogeochemical cycling, soil microbiota represent one of the most taxonomicallydiverse communities on the planet. Termed functionally redundant, many soil taxa possessmetabolically equivalent genes and pathways [196]. While this may imply taxonomic identityis often decoupled from metabolic potential [197–199], functional redundancy is believed toincrease community stability [196, 200]. Indeed, not all microorganisms with similar metabolicpotential compete directly as they may occupy distinct metabolic niches [201], form obligateor mutualistic relationships with specific taxa [39, 130, 202], be narrowly distributed with littlespatial mobility [201] or express their metabolic potential in different ways under differentcircumstances. More generally, microbial composition and metabolic potential are thoughtto be shaped by environmental parameters as soil microbial communities are known to bespatiotemporally variable [203–205] as well as sensitive to perturbation [20, 27]. However, giventhe technical and computational challenges associated with soil metatranscriptomics [113], theextent to which compositional differences and metabolic redundancies buffer against changes inmicrobial community metabolic expression with seasonal variation, soil depth, and soil disturbanceremains fragmentary.45Soil microbial biomass can turn over in days to months, and can thereby result in successionalmicrobial communities through time [206]. Further, seasonal differences have been observedin soil microbial composition [206], metabolic potential [207] and more recently phenotypicexpression [208]. In temperate ecozones, these differences are largely attributed to fluctuations inphotosynthetic production between summer, known as the vegetation period, and winter, whereinfreezing temperatures and shorter days limit plant activity and thus alter litter and rhizodeposition(the release of organic compounds from plant roots) [208, 209]. Seasonal snowpacks also impactthe soil environment by providing insulation that raises soil temperatures relative to the ambientair temperature [135] and reduces soil oxygen [135, 210, 211], which in turn affect microbialcomposition and activity [208, 209]. To illustrate, recent work found 2-29% of functional metaboliccategories (KEGG [157] identified were differentially expressed between samples collected inthe summer during peak photosynthetic production, and in the winter under the snowpack.However while this work was the first analysis of shotgun soil metatranscriptomics across seasons,metabolic comparisons were completed only for high-level metabolic categories and little attemptwas made to relate shifts in metabolic categories either with specific pathways, or to largerecosystem functions.Soils are also spatially complex at nm [8], cm, and m scales [212], and vertical stratification ofsoil microbial communities has been well documented [19]. Differences with depth are generallyattributed to gradients in indigenous soil properties i.e., edaphic factors [23, 27, 44, 45] as well asdifferences in soil carbon inputs. For example, while LFH horizons (referred to as the O horizon inthe USDA soil classification system) are comprised of both labile litter and recalcitrant compoundsstabilized via physio-chemical and biological processes [61], carbon inputs in the mineral horizonsbelow are primarily from rhizodesposition [67]. Consistent with differences in carbon resources,enzymatic expression of plant biomass degradation genes has been observed to differ betweenLFH and mineral soil horizons [68, 213, 214], and strain level differences in genomic content fromsoil isolates have been linked with fine-scale adaptations to LFH and mineral soil environments[215].Finally, natural and anthropogenic disturbances such as fire and forest harvesting have alsobeen demonstrated to alter soil microbial communities [4]. For example, persistent impacts in46community level diversity between undisturbed and clear-cut soils [20, 65] as well as decreasesin microbial biomass have been detected more than 15 years after initial tree harvesting atmultiple sampling locations [20, 27, 49, 66, 216]. More recently, forest harvesting was found tochange potential community interactions (Fig. 2.4), and reduce metabolic potential for biomassdecomposition at the gene [68] and pathway level (Fig. 2.3). However, whether differences inmetabolic potential translate into differences in carbon and nitrogen cycling potential and plantbiomass conversion processes remains unclear, and thus the impact of soil disturbance on soilmicrobial function is poorly constrained.Here, we construct environmental pathway genome databases (ePGDBs) and analyze a knownset of carbohydrate degradation genes using paired shotgun metagenomes and metatranscriptomescollected across three seasons and three depths in unmanaged and disturbed experimental plotswithin a Long-term Soil Productivity (LTSP) site. We compare and contrast predicted potentialand expressed metabolic networks, and conduct a detailed investigation of the impact of season,depth, and forest harvesting on the differential abundance and expression of metabolic pathwaysto test three hypothesis: (i) changes in phenotypic expression due to season and depth will beconstrained to a subset of metabolic pathways as genotypic diversity buffers against large scalechanges in metabolic processes within the microbial community, (ii) reduced metabolic potentialfor biomass degradation following forest harvesting significantly alters expression of pathwaysand genes implicated in carbon and nutrient cycling within the soil microbial community, and (iii)while ecozone (e.g.; soil-type, climate, and initial nutrient concentration), and perturbation state(OM removal intensity) influence the taxonomic composition of soil microbial communities, mostmetabolic pathways will be found across locations (similar metabolic potential).3.2 Methods3.2.1 Site description and samplingThis study was conducted in Interior Douglas-fir (IDF) ecozone at the O‘Connor Lake Long TermSoil Productivity (LTSP) site (5088, 12035) near Kamloops in British Columbia, Canada. EachLTSP sites contains 40 m 70 m (0.28 ha) replicated plots with varying degrees of organic matter47(OM) removal. A detailed description of the LTSP study and OM removal treatments can be inPowers 2005 and 2010 [70, 217] and (Table 1.1). Here, samples were collected from two plots,the OM0 unmanaged natural reference site (N) and the OM2 whole tree harvested treatmentplot (H) in June 2013, October 2013, and February 2014 (Fig. 3.1). After harvesting, lodgepolepine seedlings (Pinus contorta Dougl.) were planted with 2.5 m spacing in 2000 [218]. A moredetailed description of the O‘Connor Lake site can be found in B.1. Each plot was sampled onceper day for three days thereby generating triplicate samples from randomly selected locationswithin the plot. Over the three days sampling in June the site received 31.2 mm of rainfall (over83% of the 1981-2010 monthly average) resulting in wet soil conditions [219]. During the Octobersampling conditions were drier receiving only 1.2 mm of rainfall (6.2% of the 1981-2010 monthlyaverage) [219]. Prior to sampling in February 2014, the area had received 61.8 cm of snowfall(from November 2013 to February 2014) and thus sampling took place beneath the snowpack[219]. At each location samples were taken from the LFH (surface organic matter) and mineralsoil at 5 (min 1) and 15 cm (min 2). Thus, 3 samples from 3 locations in 2 soil plots (18) werecollected each season resulting in a total of 54 samples (Fig. 3.1). Samples were flash frozen inliquid nitrogen in the field and transferred to the University of British Columbia on dry ice wherethey were stored at -80C prior to nucleic acids extraction.3.2.2 Genomic DNA and RNA isolation and sequencingDNA for shotgun metagenomes was extracted from each sample using FastDNATMSPIN Kit forSoil (MPBio) according to the manufacturer’s instructions from all 54 samples and sent to Ge´nomeQue´bec for sequencing. Clustering was done on an Illumina cBot using a pool of 7 libraries at 8pM on each lane. Libraries were then sequenced on the Illumina R HiSeq-2000 platform (pairedend 150 bp reads) according to the manufacturers instructions. Three samples were multiplexedper lane.RNA for shotgun metatranscriptomes was also extracted from each sample using a chemicalextraction method for all 54 samples [220]. Extractions were completed within a laminar flowhood cleaned with RNase AwayTM (ThermoFisher). 0.5 ml of CTAB extraction buffer, 0.5 mlphenol:chloroform:isoamylalcohol, and 50-100 ul AmAlS was added to 0.5 grams of soil and48OCLReference DisturbedFeb ‘14Oct ‘13June ‘13Mineral 2Mineral 1LFHDNA RNAFigure 3.1: Sampling and analysis schematic. Sampling and analysis schematic for 103 samples (5 samplesfailed sequencing) from 3 soil depths from 3 depths in two soil plots at O‘Connor Lake, B.C.placed into the FastPrep R -24 instrument (MPBio) for 30 seconds. Following centrifugation, theaqueous phase was then transferred to a Phase-Lock gel (Heavy) tubeTM(VWR) and 0.5 ml ofchloroform:isoamylalcohol was added and the sample centrifuged. The aqueous phase was onceagain removed and nucleic acids were precipitated using ice-cold ethanol. The above steps wererepeated for each sample. RNA was then isolated using the QIAGEN AllPrep DNA/RNA Kitmodifying the RLT buffer by adding 10 ul 2-mercaptoethanol per 1 ml RLT. Any remaining DNAwas removed using the QIAGEN RNAse-free DNAse Set. Ambion R. Samples were quantifiedusing Quant-iTTMRiboGreen R RNA Assay Kit (ThermoFisher). ERCC Spike-In Control Mix 1[221] was then added to each sample according the manufactures instructions. Samples were PCRamplified to check for DNA contamination. Samples were stored at -80C with 1 ul RNAse inhibitorand sent to Genome Quebec. Total RNA was quantified using a NanoDrop SpectrophotometerND-1000 (NanoDrop Technologies, Inc.) and its integrity was assessed using a 2100 Bioanalyzer(Agilent Technologies). Bacterial rRNA was depleted using Ribo-Zero rRNA Removal kits specificfor bacterial RNA (Illumina R). Residual RNA was cleaned up using the Agencourt RNACleanTMXP Kit (Beckman Coulter) and eluted in water. The Elute/Frag/Prime buffer from the TruSeqstranded mRNA Sample Preparation Kit (Illumina) was added and the remaining of the protocol49was performed as per the manufacturers recommendations. Libraries were quantified using theQuant-iTTMPicoGreen R dsDNA Assay Kit (Life Technologies) and the Kapa Illumina R GA withRevised Primers-SYBR Fast Universal kit (D-Mark). Average fragment size was determined usinga TapeStation (Agilent Technologies) instrument. Samples were sequenced at Ge´nome Que´bec onthe Illumina R HiSeq-2000 platform (paired end 100 bp reads). Six samples were multiplexed perlane.3.2.3 Genomic DNA and RNA processingMetagenomic and metatranscriptomic sequence reads were processed by Trimmomatic (v. 0.32)to remove adapter sequences and low quality reads and strings of bases [222]. Soft trimmingwas completed with the parameters LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, andMINLEN:36. Paired reads were interleaved while orphaned reads were concatenated into aseparate FASTA file. Trimmed reads were assembled into contigs using MEGAHIT, selected due toits use of a multi-sized de Bruijn graph, high speed, and low RAM requirements [223]. Assemblieswere generated with k-mer values ranging from 27-97 with a step of 10. –merge-level was set to10,0.99 such that bubbles in the de Bruijn graph were merged if 10*k basepairs were similar at99% identity or greater. Environmental pathway genome databases (ePGDBs) for the assembledmetagenomic and metatranscriptomic datasets were generated using MetaPathways 2.5 [139, 141],a modular pipeline for open reading frame (ORF) prediction, functional and taxonomic annotation,ORF abundance normalization (for both sequencing depth and ORF length), and the creationof ePGDBs. Read recruitment and ORF abundance normalization for metatranscriptomes wasstrand specific. All ePGDBs were based on a well-curated database of metabolic pathways andcomponents representing all domains of life (MetaCyc-v4-11-07-03) [224], Kyoto Encyclopedia ofGenes and Genomes (KEGG-11-06-18) [157], Clusters of Orthologous Groups (COG-13-12-27) [158],Carbohydrate-Active enZYmes (CAZY-14-09-04) [159], and RefSeq-nr-14-01-18 [160] databases.Gene-centric results were also produced for the CAZy database.503.2.4 Isolate genomesFour isolate genomes named LTSP857, LTSP855, LTSP849, and LTSPM299 (accession numbersSAMN03340218, SAMN03340243, SAMN03340244 and SAMN0334024) known to be abundant atthe O‘Connor Lake LTSP site [215] were downloaded and processed through the MetaPathwayspipeline [139, 141]. Each metagenomic and transcriptomic sample generated in this study wasthen recruited to each isolate using BWA and read counts were normalized and expressed asreads per million kilobases mapped (RPKM) such that gene abundances and expression couldcompared among samples. Again, read recruitment and ORF abundance normalization formetatranscriptomes was strand specific.3.2.5 Statistical analyses and data visualizationStatistical analyses were performed in R (version 3.1.2 ( using theggplot2 (, RUV [225], EdgeR, [226], dpylr (, vegan (Dixon, 2003), ecodist [227], mvpart [228]), and a custom vennpackages (\am2.r). Analyses were performed on normalized genes counts (RPKM) with the exception ofdifferential abundance (DNA) and expression (RNA) analysis. Differential abundance analysisutilized the normalization procedure within EdgeR known to be more accurate for univariatecomparisons [225]. More specifically, for metagenomic data, counts were normalized using therelative log expression (RLE) method and multiple test correction was completed via false discov-ery rate [226, 229]. For the metatranscriptomic data, External RNA Controls Consortium (ERCC)spike-ins were first fit to input quantities with a generalized linear model [221] (Fig. B.1) andsubsequently used to normalize reads counts according the Removal of Unwanted Variance (RUV)procedure [225]. These normalized counts were then passed to EdgeR for analysis and once againmultiple test correction was completed via false discovery rate [226, 229]. Clarke and Warwickstaxonomic distinctness index [149, 164], extended to accept the partial taxonomies common inuncultivated microbial annotation using the previously published algorithm weighted taxonomicdistance (WTD), was used to calculate the taxonomic distance between taxa identified within the51dataset [139, 141]. Here, taxonomic distinctness (D) describes the average taxonomic distancebetween two randomly selected taxa and considers both taxonomic relatedness and evenness[228].3.3 Results3.3.1 Multi-omics sequencingAcross the 54 metagenomes an average of 53,824,117±13,576,409 reads were sequenced per sample.On average 2,040,460±7,22,325 open reading frames (ORFs) over 180bps (60 amino acids whentranslated) were predicted within each assembly. Using the 5 reference databases 64.20±3.83%of ORFs could be annotated as is typical of omics data [1]. A more detailed description of theassembly and annotation results can be found in B.1. For the metatranscriptomic samples, 5of the 54 sequencing libraries (Oct-H-Min 1-Rep 2, June-N-Min 2-Rep 3, June-N-Min 1-Rep 3,June-N-LFH-Rep 3, Feb-N-Min 1-Rep 1) failed quality control, resulting in a total of 49 sequencedmetatranscriptomes. Details on ribosomal RNA (rRNA) depletion can be found in B.2. Onaverage 26,926,664±7,583,382 reads were sequenced per sample. Following de novo assembly, anaverage of 80.74±4.48% of total metatranscriptomic reads from a given sample could be mappedto the corresponding assembly (again this includes contigs <200 bps not included in downstreamanalysis). A more detailed description of the assembly and annotation results can be found in B. Metabolic pathway predicationIn order to compare the metabolic potential of soil microbial communities across soil depthsand treatments, environmental pathway genome databases (ePGDBs) were generated from theassembled metagenomes and metatranscriptomes. A total of 1,275 unique metabolic pathwaysinvolved in biosynthesis (627), degradation (524), detoxification (17), energy-metabolism (101),metabolic clusters (4), and activation/inactivation/interconversion (2) were predicted within the54 environmental pathway genome databases (ePGDBs) built for each metagenomic assembly(Fig. 3.2). On average 927±45 pathways were predicted within each ePGDB.5253050100150200050010001500200025000501001502000500100015002000250005010015020005001000150020002500216 621000213 68992243 69985Metagenome MetatranscriptomeRPKMRPKMRPKMRPKMRPKMRPKMReference DisturbedFeb ‘14Oct ‘13June ‘13Metagenome MetatranscriptomeActivation-Inactivation-InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-ClustersFigure 3.2: Summary of potential and expressed pathways. Predicted pathways within 54 metagenomes and 49 metatranscriptomes from 3 soildepths and 3 seasons from unmanaged (N) and harvested (H) soil plots near O‘Connor Lake, B.C.53Next, the number of pathways predicted within ePGDBs built from de novo assembly ofmetatranscriptomes was compared the number of pathways predicted within ePGDBs built whenmetatranscriptomic reads were mapped to the de novo assembly of the metagenomes. Within theePGDBs built using de novometatranscriptomic samples, a total of 1,147 unique metabolic pathwaysinvolved in biosynthesis (561), degradation (472), detoxification (14), energy-metabolism (94),metabolic clusters (4), and activation/inactivation/interconversion (2) were predicted (Fig. 3.2).When metatranscriptomic reads were mapped to de novo metagenomic ePGDBs, only 1,020 uniquepathways were predicted (127 fewer compared to de novo metatranscriptome assembly ePGDBs).Further, on average almost twice as many pathways were predicted within each individualePGDBs built from de novo metatranscriptome assemblies (632±142) compared to ePGDBs builtwhen metatranscriptomic reads were mapped to de novo metagenomic assembly (302±93). Basedon both the total number of unique pathways predicted and the average number of pathwayspredicted per sample, ePGDBs built from de novo metatranscriptomic assemblies were used fordownstream analyses.3.3.3 Taxonomic distinctness of potential and expressed metabolic pathwaysIn order to determine the impact of season, depth and treatment on the taxonomic diversity ofthe soil microbial communities, we used the lowest common ancestor (LCA) [144] annotationsprovided by MetaPathways 2.5 for each ORF to calculate pathway level taxonomic distinctness (D)a measure describing the taxonomic distance between two randomly selected taxa that considersboth taxonomic relatedness and evenness [149, 164] (Fig. 3.3). Using t-tests and multiple testcorrection, no significant differences in D was found among or between metagenomes andmetatranscriptomes across season, horizons or treatments. However, the standard deviation in Dwithin the metatranscriptomes was an order of magnitude larger than that of the metagenomes(0.72 and 0.08 respectively). This suggests that while taxonomic distinctness remains relativelystable within the metagenomes, expression among microbial community members varies in timeand space. This observation is likely related to the heterogeneity of the soil environment whereinthe availability of resources change in seconds or across nanometers [8].54234562345623456N HFeb ‘14Oct ‘13June ‘13LFH Min 1 Min 2 LFH Min 1 Min 2 LFH Min 1 Min 2 LFH Min 1 Min 2Metagenome MetatranscriptomeTaxonomic Distinctness (∆)Figure 3.3: Taxonomic distinctness of metabolic pathways within 103 environmental Pathway Genome Databases(ePGDB). Taxonomic distinctness D (which describes the average path length between two randomly selectedOTUs (taxa)) of 103 environmental Pathway Genome Databases (ePGDB) from 5 horizons in unmanaged(N) and harvested (H) soil profiles using the lowest common ancestor (LCA) annotation from each openreading frame (ORF) in each pathway.553.3.4 Potential and expressed metabolic pathwaysTo determine the number of potential pathways that were expressed, and to assess the numberof pathways expressed but not identified within the metagenomic assemblies, pathways withinmetagenomic and metatranscriptomic ePGDBs were compared (Fig. 3.2). 1,072 pathways wereshared between the metagenomic and metatranscriptomic ePGDBs (84.1 and 93.5% respectively).However, little correlation between the relative abundance of pathways within the metagenomicePGDBs and the metatranscriptomic ePGDBs was observed (R2 = 0.10) signifying that commongenomic pathways are not the most highly expressed. Of the 203 pathways present only withinthe metagenome (and thus not within the metatranscriptomes) 110 were involved in biosynthesis,80 in degradation, 3 in detoxification, 7 in energy-metabolism, 2 in metabolic clusters, and 1in activation/inactivation/interconversion. A full list of these pathways can be found in B.4.These pathways may be expressed in low quantities and thus poorly assembled within themetatranscriptome, expressed under alternative sampling conditions, or represent unexpressedgenomic elements.In total 75 pathways, (44 biosynthetic, 28 degradation, 2 metabolic cluster and 1 activation)were expressed but not predicted within the metagenomes. A full list of these pathways can befound in B.5. 54 of these 75 pathways were from eukaryotic organisms including tRNA splicing,gibberellin biosynthesis I (non C-3, non C-13 hydroxylation), implicated in early plant growth[230, 231], 2 pathways for juvenile hormone III biosynthesis, oxalate biosynthesis, a fungal andplant compound degraded by soil bacteria in order to capture cations and energy [232], andfree phenylpropanoid acid biosynthesis, a precursor to plant compounds such cinnamic acid,caffeic acid, and sinapic, all components in lignin. As all expressed genes must be present withingenomes, the prevalence of eukaryotic pathways identified within the metatranscriptomes butnot the metagenomes suggests eukaryotic ORFs did not readily assemble (genes are longer thantranscripts due to splicing) or, eukaryotic ORFs were poorly predicted within the metagenomeas they include intron/exons. Alternatively, these pathways maybe relatively rare within themetagenome yet highly expressed and thus better assembled within the metatranscriptome.Next we compared the presence and absence of pathways within each season (June, October,56and, February), treatment (N and H) and depth (LFH, Min 1, and Min 2) within both themetagenomes and metatranscriptomes (Fig. 3.4). In all cases most pathways were shared, andwhile all seasons, treatments and depths did contain unique pathways, there was a higherproportion of unique pathways within the metatranscriptomes comparisons. This suggests bothmetabolic potential and expression are impacted by season, depth, and treatment.575822773221171211631515Metagenome Metatranscriptome33272721973353138 55112342 56110773 321123104 48910112 42906104 7387712277610971602223423894712475747339382120452822279604324371321329614645625830309712619919356107173894932602473281402410558587260376742643861010235532626626825942514553486045691106201422058410153848272679238562125432870010955443439636165985534636686512549472260606566168292717549942251221838309534458293332149444543193248249455024625524449731113361414609533140203 751072N HFeb ‘14Oct ‘13June ‘13Mineral 2Mineral 1LFHFigure 3.4: Venn diagram of potential and expressed pathways. Venn diagrams comparing the presence and absence of pathways within each season(June, October, and, February), treatment (N and H) and depth (LFH, Min 1, and Min 2) within both the metagenomes and metatranscriptomes.583.3.5 Trends in soil microbial metabolic potential and expression with season,depth, and treatmentRelationships among season, depth, treatment and metabolic potential and expression wereinitially evaluated using multivariate regression tress (MVRTs), a statistical technique designedto describe and predict relationships between multidimensional data and sample characteristics[228]. At this broad level of comparison, season had greatest effect on metabolic potential (Fig. 3.5).Indeed, the largest differences were seen between the February samples, and June and Octobersamples. Within the February samples, metabolic potential differed most between the LFH andmineral horizons. Further, differences in metabolic potential between treatments were identifiedwithin the February LFH samples. In contrast, differences in metabolic expression were greatestbetween the LFH and mineral horizons (both Min1 and Min 2). Differences were also observed byseason wherein LFH samples from both February and June differed from that of October samples.Similarities in pathway expression between February and June LFH may be reflective of thesnowpack and wet soil conditions respectively (the site received >33 mm of rain during the threedays of sampling in June), both of which reduce soil oxygen availability [135, 210, 211, 233]. Finally,differences in metabolic expression between treatments in the October LFH samples were alsoobserved. Thus, while season, depth, and forest harvesting impact both potential and expressedmicrobial metabolism, metabolic potential was most highly influenced by winter conditions whilephenotypic expression was more highly influenced by soil depth (Fig. 3.5).5960N HFeb ‘14Oct ‘13June ‘13Mineral 2Mineral 1LFHJune &  OctoberMineral 1 & Mineral 2FebruaryLFH n=36n=12 n=3 n=3SeasonDepthN HTreatmentMineral 1 & Mineral 2June & FebruaryNLFHOctoberHn=32n=11n=3 n=3DepthSeasonTreatmentMetagenome MetatranscriptomeActivation-Inactivation-InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-Clusterslog(pathways differentially abundant/expressed) 1 1.5 2Figure 3.5: Multivariate regression tree of potential and expressed metabolic pathways. The influence of season, depth and forest harvesting on potential(error = 0.69) and expressed metabolic pathways (error = 0.80). Pie charts denote the number of pathways differentially abundant or expressed.60In order to identify pathways driving the patterns with season, depth and treatment, wecompared the relative abundance and expression of the groups of samples identified in themetagenomic and metatranscriptomic MVRT respectively (Fig. 3.5). First we compared therelative abundance of pathways identified in June and October samples with those identified inFebruary (Fig. 3.6). Samples collected in June and October had significantly greater abundanceof 6 pathways (Fig. 3.6), including 2 for trehalose degradation, a disaccharide found in bacteriaand fungi [234], and fosfomycin biosynthesis, a compound with antibiotic properties produced bysome bacteria [235]. In contrast, samples collected in February had a greater relative abundance of5 pathways (Fig. 3.6), including 2 pathways involved in the biosynthesis of thiamine (vitamin B1),and L-carnitine both implicated in energy metabolism [236, 237], and phenylacetate degradation,an intermediate produced in the breakdown of many aromatic compounds [238] common in soilorganic matter [182].Next we compared the relative abundance of pathways within the LFH and mineral horizons inthe February samples. LFH horizons had a higher relative abundance of 13 pathways including 2detoxification pathways, 8 biosynthesis pathways involved in cofactor, lipid biosynthesis pathway,and actinorhodin biosynthesis, an antibiotic produced by Actinobacteria and some yeasts [239](Fig. 3.6). Additionally the LFH horizons had a higher abundance of 2 degradation pathwaysnamely, starch degradation V and methanol oxidation to formaldehyde. Interestingly, within themineral horizons below, the pathway for methane oxidation to methanol, the input necessaryfor methanol oxidation to formaldehyde was more abundant, suggesting metabolic processeswithin mineral samples may stimulate the microbial communities in LFH samples above. Themineral horizons also had a higher relative abundance of 6 additional degradation pathwaysincluding 4-coumarate degradation (anaerobic), phenol degradation II (anaerobic), 4-ethylphenoldegradation (anaerobic) all involved in the breakdown of plant biomass [240–242] (Fig. 3.6 and4 biosynthetic pathways . Finally, we compared the N and H LFH in February. A total of 10pathways were differentially abundant between treatments (Fig. 3.6). N LFH samples had a higherrelative abundance of 1 detoxification, 3 degradation and 2 biosynthesis pathways includingactinorhodin biosynthesis. In contrast the H horizons had a significantly higher abundance of4 pathways (Fig. 3.6) including 2 tetrapyrrole biosynthesis pathways involved in metal binding61[243], and 1 pathway in involved in resistance to the antibiotic polymyxin (Fig. 3.6).Within the metatranscriptomes, 43 pathways had significantly higher expression in the LFHhorizons compared to the mineral horizons (Fig. 3.6) including, 19 biosynthetic pathways, 13 ofwhich are involved in lipid biosynthesis, a principle component of cell membranes [244]. The LFHsamples also had a higher relative expression of 17 degradation pathways including 5 carbohydratedegradation pathways namely trehalose, again a disaccharide found in bacteria and fungi [234],L-arabinose, a hydrolysis product of plant hemicellulose [245], xylose, a monosaccharide foundin wood [246] and 2 pathways for chitin degradation, a common compound in soils found incell walls of fungi and the exoskeleton of invertebrates [247]. Finally, the LFH horizons also hada higher relative expression of 7 energy metabolism pathways including hydrogen oxidation I(aerobic) and chitin degradation to ethanol (Fig. 3.6). In contrast, the mineral horizons had ahigher relative expression of 13 biosynthetic and 23 degradation pathways. More specificallythese included lipid IVA biosynthesis, a major component of cell membranes, the biosynthesisof amino-acid histidine, and, 6 pathways for nucleosides and nucleotides biosynthesis, all likelyindicative of cell growth. Degradation pathways included 10 pathways involved in nucleotidedegradation, known to be an important source of phosphorus for the soil microbial community[142], and 5 pathways involved in the degradation of aromatic compounds, including phenoldegradation, and protocatechuate degradation, both involved in the degradation of lignin, animportant component of soil organic matter [126].4 pathways were more highly expressed in the LFH samples in June and February comparedto October including 1 pathway for each carbohydrate and nucleotide biosynthesis and 1 pathwayfor each oxalate degradation, used to acquire cations such as calcium, carbon, and energy [232]and nucleotide degradation (Fig. 3.6). No pathways were found to be more highly expressed in theLFH horizons sampled in October. Between October N LFH samples and October H LFH samples2 pathways for dimethylsulfoniopropionate synthesis an antioxidant and osmolyte produced byplants were more highly expressed in the N LFH samples [248]. In contrast the H LFH horizonshad a higher relative expression of 4 biosynthetic pathways including 1 for lipids, 1 for nucleotides,1 for carbohydrates and 1 for a cofactor involved in protection against oxidative damage [224] anddegradation pathways, 1 for pyrimidine degradation, a nucleic acid, and 1 for oxalate degradation62Activation-Inactivation-InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-ClustersJune & OctFeb Min 1 and Min 2Feb LFH N LFH  June LFH & FebLFH  Oct Lfh N Metagenome MetatranscriptomeActivationInactivationInterconversionOtherAmino-AcidAminoacyl-tRNAs-ChargingAromatic CompoundsCarbohydratesCell-StructureCofactorHormoneLipidMetabolic RegulatorsNucleosides and Nucleotides BiosynthesisPolyamineSecondary MetaboliteSiderophoresStorage CompoundsOtherAlcoholAldehyde-DegradationAmineAmino-AcidAromatic CompoundsC1-CompoundsCarbohydratesCarboxylatesChlorinated CompoundsCofactorFatty-Acid and LipidHormoneNoncarbon NutrientsNucleotidePolymerProteinSecondary MetaboliteOtherAcid-ResistanceAntibiotic-ResistanceArsenateCyanideMercuryMethylglyoxalOtherAcetyl-CoA-BiosynthesisChemoautotrophic Electron-TransferFermentationGlycolysis VariantsHydrogen-ProductionMethanogenesis Other Pentose-Phosphate-CyclePhotosynthesisRespirationTCA-CycleOther3581013Number ofpathwaysFebFeb LFHFeb LFH HMin 1 & Min 2  Oct  LFHOct LFH Hmore abundant incompared toFigure 3.6: Differentially abundant potential and expressed pathways driving patterns in the MVRT. Pathwaysfound to drive MVRT patterns in the abundance of potential and expressed metabolic pathways.63(Fig. 3.6).While MVRT provide a powerful method by which to assess broad trends in pathway abun-dance and expression across the combination of sample characteristics (season, depth, andtreatment), given the hierarchical study design it is possible to isolate the average effect of each ofthe three sample characteristics individually [226, 229, 249]. Detailed description of these results isgiven in B.2. First, we assessed differences in pathway abundance and expression among seasons.Briefly, few differences in pathway abundances were found across seasons (Fig. 3.7), suggestingmetabolic potential remains relative stable through time. Within the metatranscriptomes Junehad least in common with the other seasons may be related to photosynthetic activity (Fig. 3.8)[208, 209] and or increased soil moisture due to rainfall, which has previously been shown tostimulate microbial respiration, and alter microbial composition and physical soil properties[47, 250].6465Activation-Inactivation-InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-ClustersN LFHLFH  N JuneJuneOct June N June LFHJune LFH  NOct N Oct LFHOct LFH  N Feb N Feb LFH Feb LFH  N Metagenome258Number ofpathwaysActivationInactivationInterconversionOtherAmino-AcidAminoacyl-tRNAs-ChargingAromatic CompoundsCarbohydratesCell-StructureCofactorHormoneLipidMetabolic RegulatorsNucleosides and Nucleotides BiosynthesisPolyamineSecondary MetaboliteSiderophoresStorage CompoundsOtherAlcoholAldehyde-DegradationAmineAmino-AcidAromatic CompoundsC1-CompoundsCarbohydratesCarboxylatesChlorinated CompoundsCofactorFatty-Acid and LipidHormoneNoncarbon NutrientsNucleotidePolymerProteinSecondary MetaboliteOtherAcid-ResistanceAntibiotic-ResistanceArsenateCyanideMercuryMethylglyoxalOtherAcetyl-CoA-BiosynthesisChemoautotrophic Electron-TransferFermentationGlycolysis VariantsHydrogen-ProductionMethanogenesis Other Pentose-Phosphate-CyclePhotosynthesisRespirationTCA-CycleOtherH Min 1 & Min 2LFH  HOctFebFebJune HJune Min 1 & Min 2June LFH  HOct  HOct Min 1 & Min 2Oct  LFH  HFeb  HFeb  Min 1 & Min 2Feb  LFH  Hcompared tomore abundant inFigure 3.7: Differential abundance of potential pathways. Differential abundance of potential pathways among seasons, depths, and treatments.65We then evaluated the average impact of depth on microbial metabolism by assessing dif-ferences in pathway abundance and expression between the LFH and mineral samples acrossall samples. Differences between the LFH and mineral horizons were more pronounced thandifferences between treatments and season in both the metagenomes and metatranscriptomes(Fig. 3.7 and Fig. 3.8). Within the metagenomes a total of 48 pathways were more abundantwithin LFH samples (Fig. 3.7). Differences with depth in metatranscriptome are described indetail above as depth was identified as the major driver of differences in metabolic expression(Fig. 3.8). However, none of the pathways more abundant in the LFH samples were more highlyexpressed in the LFH samples, consistent with the poor correlation between pathway abundanceand expression.Finally, to better determine the impact of forest harvesting on microbial metabolism, weassessed differences in pathway abundance and expression between N and H samples across allseasons and treatments. At the metagenomic level few differences were found between N and Hsamples across all depth and seasons (Fig. 3.7). This result is consistent with the MVRT whereinthe effect of forest harvesting was smaller than the effects of both season and depth (Fig. 3.5).In contrast, at the metatranscriptomic level 24 and 22 pathways were differentially expressedbetween the N and H horizons across all horizons, and the N and H LFH horizons respectively(Fig. 3.8)]. These differences may reflect differences in resource availability as both stand age andtree species affect soil organic matter, microbial biomass and activity [186].6667Activation-Inactivation-InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-ClustersActivationInactivationInterconversionOtherAmino-AcidAminoacyl-tRNAs-ChargingAromatic CompoundsCarbohydratesCell-StructureCofactorHormoneLipidMetabolic RegulatorsNucleosides and Nucleotides BiosynthesisPolyamineSecondary MetaboliteSiderophoresStorage CompoundsOtherAlcoholAldehyde-DegradationAmineAmino-AcidAromatic CompoundsC1-CompoundsCarbohydratesCarboxylatesChlorinated CompoundsCofactorFatty-Acid and LipidHormoneNoncarbon NutrientsNucleotidePolymerProteinSecondary MetaboliteOtherAcid-ResistanceAntibiotic-ResistanceArsenateCyanideMercuryMethylglyoxalOtherAcetyl-CoA-BiosynthesisChemoautotrophic Electron-TransferFermentationGlycolysis VariantsHydrogen-ProductionMethanogenesis Other Pentose-Phosphate-CyclePhotosynthesisRespirationTCA-CycleOtherN LFH LFH  NJuneJuneOct June N June LFHJune LFH  N Oct NOct LFH Oct LFH  N Feb N Feb LFH Feb LFH  N Metatranscriptome3581013H Min 1 & Min 2LFH  HOctFebFebJune HJune Min 1 & Min 2June  LFH  HOct HOct Min 1 & Min 2Oct  LFH  HFeb  HFeb Min 1 & Min 2Feb LFH  Hcompared toFigure 3.8: Differential abundance of expressed pathways. Differential abundance of expressed pathways among seasons, depths, and treatments.673.3.6 Soil microbial metabolism across forest ecozonesPrevious investigations of soil microbial communities at LTSP sites have identified persistentharvesting impacts, detectable more than 15 years post disturbance, and elucidated decreases inmicrobial biomass [27, 49, 66, 216] and changes in community level diversity between undisturbedand harvested sites. Further, OM removal and harvesting have also been shown to result indeclines in soil carbon concentration and nutrient availability (e.g., phosphorus and nitrogen)in surface horizons (0-20 cm) [69–71]. However, these effects have been shown to vary withecozone [69, 70, 72]. Owing to both the previously observed variation in nutrient and microbialresponse among ecozones and the currently limited understanding of the relationship betweensoil disturbance and microbial metabolism we assessed similarities and differences in microbialcommunity metabolism between two ecozones. To do so we compared the 103 ePGDBs generatedhere to 26 ePGDBs generated from metagenomic samples extracted from 5 soil horizons froman unmanaged OM0 and harvested site OM3 located in the Sub Boreal Spruce at the SkulowLake LTSP site (SBS-3 WL) (52% 20’N, 12155’W) near Williams Lake in British Columbia,Canada. We found 1,048 pathways were present at Skulow Lake LTSP and either present orexpressed at O‘Connor Lake representing 84% and 78% of the total pathways recovered at the sitesrespectively. Given the differences in soil treatment, geographic location, and ecozone betweenthe two sampling locations the overlap in pathways identified between the two sites suggeststhere exists a large number of core metabolic functions within forest soil microbial communities.Further, as the taxonomic composition of soil microbial communities at the two sites is known tobe different [27], this result also implies high functional redundancy amongst taxa at the two sites.However, there was poor correlation between the abundance of these pathways at the two sites (r2= 0.41) suggesting that despite the overlap in pathway presence relative abundance of pathwaysvaries with location. Further, 53 pathways found in the metatranscriptomic ePGDBs but not themetagenomic ePGDBs from O‘Connor Lake were identified within the Skulow Lake ePGDBs(Fig. 3.9) confirming that these pathways are likely comprised of reactions that are relatively rarewithin the O‘Connor Lake metagenome yet highly expressed and thus better assembled withinthe metatranscriptome.68In total 141 and 45 pathways were unique to the O‘Connor Lake and Skulow Lake datasetsrespectively (Fig. 3.9). Within the 141 pathways unique to O‘Connor Lake, 38 were implicatedsecondary metabolite biosynthesis. Of these, 21 are thought to be exclusively produced by plants,including free phenylpropanoid acid biosynthesis, a precursor to plant compounds such cinnamicacid, caffeic acid, and sinapic, all components in lignin synthesis, and barbaloin biosynthesis, acomponent of plant leaves, oleoresin monoterpene biosynthesis, a resin secreted by coniferoustrees[251], and monoterpene biosynthesis, a compound produced by plants as a defense mecha-nism [252]. Within the 45 pathways unique to Skulow Lake, 10 are related to secondary metabolitebiosynthesis 6 of which are believed to be exclusively produced by plants including sinapate esterbiosynthesis and gibberellin biosynthesis III (early C-13 hydroxylation) both implicated in earlyplant growth [230, 253].In addition, there were 10 aromatic compound degradation pathways unique to O‘ConnorLake all involved in the breakdown of plant biomass [240–242] (Fig. 3.9). These included, 2pathways for vanillin and vanillate degradation, important components of lignin, and 1 pathwayfor 3-hydroxycinnamate degradation (note the precursor for cinnamc acids was also foundexclusively at O‘Connor Lake). Similarly, 3 aromatic carbon degradation pathways were alsouniquely present Skulow Lake, namely 2-nitrophenol degradation, anthranilate degradation IV(aerobic), and toluene degradation again all of which are related to plant biomass degradation[254, 255]. Finally, there were also large differences in non-carbon nutrient pathways between thetwo sites (Fig. 3.9), most of which are involved in sulfur cycling. For example, sulfur oxidationII (Fe+3-dependent), sulfoacetaldehyde degradation III, (R)-cysteate degradation, a precursorof sulfolipids [256], dimethyl sulfide degradation I, and thiosulfate disproportionation I (thiol-dependent) were all unique to O‘Connor Lake. In contrast, tetrathionate reduction I (to thiosulfate),methylthiopropionate degradation I (cleavage), involved in sulfur metabolism, dimethyl sulfoxidedegradation and thiosulfate oxidation I (to tetrathionate) were unique to Skulow Lake. However,closer investigation of these pathway revealed that the abundance of these pathway was low(bottom quartile) and, with the exception of thiosulfate oxidation I, rare, occurring in less than 25%samples. Indeed, while some discrepancies in pathway presence may be due to sampling depth,differences plant produced secondary metabolites together with differences in plant biomass6945108955310482422Skulow LakeMetagenomeO'Connor Lake MetagenomeO'Connor LakeMetatranscriptomeActivation-Inactivation-InterconversionBiosynthesisDegradationDetoxificationEnergy-MetabolismMetabolic-ClustersO'Connor Lake MetagenomeO'Connor LakeMetatranscriptomeSkulow LakeMetagenomeInactivationInterconversionPolyamineAmino-AcidAromatic CompoundCarbohydratesCofactorHormoneLipidNucleosides and Nucleotides BiosynthesisSecondary MetaboliteOtherAlcoholAmineAmino-AcidAromatic CompoundCarbohydratesCarboxylatesChlorinated CompoundsFatty-Acid-and-LipidHormoneNoncarbon-NutrientsPolymerNucleotidesSecondary MetaboliteOtherAntibiotic-ResistanceCyanideMethylglyoxalOtherChemoautotrophicFermentationHydrogen-ProductionMethanogenesisOtherRespirationOtherUnique PathwaysNumber ofpathways15105Figure 3.9: Comparison of the metabolic potential between ecozones. Presence and absence of pathway between25 ePGDBS at Skulow Lake and 103 ePGDBs at O‘Connor Lake.70degradation and sulfur cycling potential suggests that pathways unique to each site may beindicative of differences in soil organic matter content and above ground vegetation. Indeed,while both sites were replanted with lodgepole pine (Pinus contorta Dougl.), the native vegetationdiffered between the two sites wherein O‘Connor Lake is dominated by Douglas fir (N) whileSkulow Lake is dominated by lodgepole pine (Pinus contorta Dougl.) and hybrid spruce (Piceaglauca engelmannii), Douglas-fir, aspen (Populus tremuloides Michx.) and cottonwood (Populusbalsamifera L.) (N).Previous work the Skulow Lake site identified a reduced potential to cycle plant derived organicmatter in the harvested LFH horizon, highlighting 55 degradation pathways significantly moreabundant within the N LFH horizons (Fig. 2.4). However, fewer differences in potential metabolicpathways between treatments were identified at O‘Connor Lake. Further, no differences wereidentified in carbohydrate biosynthesis, secondary metabolite biosynthesis, aromatic compoundsdegradation, or non-carbon nutrient degradation pathways, as was observed within the SkulowLake metagenomic ePGDBs. This is likely related to differences in the harvesting treatmentssampled at each site as while the forest floor was left intact at O‘Connor Lake, the OM3 treatmentsampled at Skulow Lake suffered complete forest floor removal effectively removing the majorityof organic inputs.In addition to impaired potential of the microbial community to degrade plant-derived biomassat the Skulow Lake site, the H horizons were found to have a reduced capacity for CO2 fixation intooxaloacetate (anapleurotic), a process which contributes 10% of cell carbon in heterotrophic bacteria[257, 258]. While this pathway was not found to be differentially abundant between treatmentsat O‘Connor Lake, it was present in 30/54 metagenomic ePGDBs and 47/49 metatranscriptomicePGDBs. Further within the Skulow Lake ePGDBs average RPKM for anapleurotic CO2 fixationwas 752.11±433.2 whereas it was 4.1±4.9 and 14.9±34.1 in the O‘Connor Lake metagenomic andmetatranscriptomic ePGDBs respectively. This is consistent with the idea that while pathwaypresence is similar between sites pathway abundance varies between the location.Differences in metabolic potential between LFH and mineral horizons were also identifiedat both the Skulow and O‘Connor Lake sites. Four pathways were more abundant in the LFHsamples at both sites including lysine, ergothioneine biosynthesis, both essential amino-acids,71homoglutathione biosynthesis, a plant pathway that plays a role in redox control and detoxification[259], and xyloglucan, a component of hemicellulose in some plant cell walls [260] degradation.No pathways were more abundant in the mineral horizons at both sites.Within metagenomic ePGDBs at both sites, the LFH horizons contained a higher abundance ofcarbohydrate degradation pathways related to soil organic matter degradation (Fig. 3.9). However,while as homogalacturonan, a component in plant cell walls [259], L-arabinose, a hydrolysisproduct of plant hemicellulose [245] were more prevalent in LFH horizons at Skulow Lake,glucose, pectin, lactose, cellulose and rhamnogalacturonan degradation pathways were moreprevalent in the LFH horizons at O‘Connor Lake, again suggesting pathway abundance is relatedto differences in soil organic matter and above ground vegetation. Within the mineral horizonsoxalate degradation was more prevalent compared to the LFH horizons at both sites. Oxalatedegradation genes are widespread among bacteria as oxalates are the most commonly oxidized2-carbon compound within the environment [224], and are used to acquire cations such as calcium,carbon, and energy [232].3.3.7 Abundance and expression of carbohydrate and lignin degradation genes(CAZymes)Recent work at LTSP sites has found forest harvesting to reduce metabolic potential for biomassdecomposition at the gene [68] and pathway level (Fig. 2.3). Indeed, using Carbohydrate ActiveenZymes (CAZyme) gene surveys Cardenas at el. found 41 CAZyme families to be consistentlyaffected by forest harvesting at O‘Connor Lake based on samples collected in 2010 and 2011[27]. Soils represent a globally important carbon sink, storing two times the carbon in Earth’satmosphere and three times the carbon in above ground biomass [62]. Thus, we dug deeper intosoil organic matter degradation metabolism and assessed whether forest harvesting effected bothmetabolic potential and expression of CAZyme gene families within our 103 sequence libraries.A total of 265 CAZyme families were identified within the 103 samples (Fig. B.5). 83 CAZymeswere present in the metagenomes but never expressed under the conditions sampled (Fig. B.2). Incontrast, only 2 CAZyme families, CBM19, CBM45 both carbohydrate binding modules (CBM),or non-catalytic proteins that enhance the catalytic power of the enzymes with which they bind72[159], were expressed but not found within the metagenome. In total 2.24% total metagenomicORFs were annotated on CAZymes consistent with previous reported studies in which 1-5% ofgenes in free-living organisms are CAZymes [261]. However, despite the prevalence of CAZymes,only 0.22% of total metatranscriptomic ORFs were annotated as CAZymes. Indeed, consistentwith pathway level analyses, there was little correlation between the abundance and expressionof CAZyme families within the dataset (R2 = 0.27). The most abundant families from each classwere GH13, GT2, CE4, PL4 and AA3 in the metagenome, while the most expressed families wereGH36, GT2, CE11, PL1 and AA2. Further, while the average RPKM for CAZyme families withinthe metagenome was 79.4±265.2, the average RPKM for expression was 1.3±9.5, an order ofmagnitude lower (Fig. B.5). The disparity between CAZyme abundance and expression stronglysuggests CAZyme expression is regulated by environmental conditions. As such we next exploredthe relationships among season, depth, treatment and metabolic potential and expression ofCAZymes using multivariate regression tress (MVRTs) (Fig. 3.10).Within the metagenomes the differences in CAZyme abundance were only attributed toseasons. Indeed, consistent with pathway analysis June and October samples differed fromFebruary samples, again likely related to the differences in edaphic factors under winter conditions(Fig. 3.10). In contrast, soil treatment was identified as having the greatest influence on CAZymefamily expression (Fig. 3.10). The H samples were further influenced by season, wherein samplescollected in February differed from those collected June and October. Finally, samples collectedin February were also influenced by soil depth and LFH samples differed from mineral samples(Fig. 3.10). Differences in CAZyme family expression between seasons and depth may be moreprominent in February due to the increased consistency of soil moisture and temperature anddecrease in labile inputs from aboveground vegetation beneath the snowpack.7374N HFeb ‘14Oct ‘13June ‘13Mineral 2Mineral 1LFH log(CAZymes differentially abundant/expressed) 1 1.5 n=36June &  October FebruarySeasonn=18Metagenome Metagenome n=23 n=17n=3 n=6MetatranscriptomeN HTreatmentJune & October FebruarySeasonMineral 1 & Mineral 2LFHDepthMetatranscriptome0.5N vs HJune & Oct vs FebruaryFeb LFH vs Feb MinJune & Oct vs FebGH5GH17GH24GH28GH35GH36GH48GH55GH103GT17GT30GT31GT70GT85GT89PL4CE5AA3AA5AA6CBM2CBM43BAmore abundantless abundantFigure 3.10: Multivariate regression tree of potential and expressed CAZymes. (A.) The influence of season, depth and forest harvesting on potential(Error = 0.93) and expressed (Error = 0.72) CAZymes. Pie charts denote the number of CAZymes differentially abundant or expressed. (B.)CAZymes driving patterns in the MVRT.74In order to identify pathways driving the patterns with season, depth and treatment, we com-pared the relative abundance and expression of CAZyme families among the groups identified inthe metagenomic and metatranscriptomic MVRT respectively (Fig. 3.10). Within the metagenomes,the abundance of 10 CAZyme families was greater in February compared to June and October(Fig. 3.10), including GH24 and GH103 both of which act on peptidoglycan, a compound found inplasma membranes of many bacteria [159], GH28 known to catalyze pectin degradation [159] animportant component of plant cell walls [262], 5 Glycosyl Transferase families, 1 PolysaccharideLyase family and 1 Carbohydrate Esterase family. Within the metatranscriptomes, 1 CAZymefamily, AA3, known to act on cellobiose, an intermediate in cellulose degradation [159] wasmore highly transcribed within the N samples. In contrast, 5 CAZyme families, including GH5and GH48 both of which have cellulolytic activity [159], GH35 and GH36 both of which containgalactosidase, as well as CBM2, known to bind to over 100 residues [159] were more highlytranscribed within the H samples (Fig. 3.10). Next, we identified 6 CAZyme families were morehighly expressed in within H samples collected in February compared to H samples collectedin June and October. Notably these included GH17, a group of glucosides and CMB43, knowto bind to GH17 genes [159]. Finally, following multiple test correction no CAZyme familieswere significantly different between the February LFH and February mineral samples (p >0.05)(Fig. 3.10).We then determined the average effect of season, depth, and treatment on CAZyme abundanceand expression across the samples. Few differences were identified between individual seasonsin both metagenome (Fig. B.6) and metatranscriptome (Fig. B.7). Indeed, 4, 15, and 2 CAZymesfamilies were differentially abundant between June and October, June and February, and Februaryand October respectively (Fig. B.6). Further, within the metatranscriptomes 3, 2, 1 CAZymesfamilies were differentially abundant between these same seasons (Fig. B.7). Within each season,4, 0, and 45 CAZyme families were differentially abundant between N and H samples in June,October and February respectively, However, at the transcriptomic level, 0, 10, and 1 CAZymefamilies were differentially expressed in June, October and February respectively (Fig. B.7). Thismay be related to decreased soil oxygen June and February due to rain and snow respectively, aschanges in redox conditions alter microbial composition and metabolism. Differences between75depth and treatments within each season are detailed in B.2.We then evaluated the impact of depth on CAZyme abundance and expression by assessingdifferences between the LFH and mineral samples across all seasons and treatments (Fig. B.6).We found 158 CAZyme families were differentially abundant between the LFH and mineralsamples. More specifically, 111 were more abundant in the LFH horizons while 47 were moreabundant within the mineral horizons. This result is consistent with the pathway level analyseswherein the largest differences in pathway abundance were also identified between LFH andmineral samples and affirms the idea that differences in plant activity and available resourcesbetween LFH and mineral samples shapes the metabolic potential of the soil microbial community.However, despite the large number of CAZyme families differentially abundant, only 5 CAZymefamilies were differentially expressed, including GT4, AA5, AA10 and CBM1, all of which weremore abundant within the LFH samples, and GT20 more highly expressed in the mineral horizonsagain suggesting that vertical stratification of metabolic potential does not necessarily translateinto differences in microbial transcription (Fig. B.7).To better determine the impact of forest harvesting on microbial metabolism we assesseddifferences in pathway abundance and expression between all N and H samples. 49 CAZymefamilies differential abundant between the N and H samples collected across all seasons anddepths (Fig. B.6) 8 of which (GH18, GH26, GH28, GH76, GT34, GT58, AA3, AA9). Of the familiesdifferentially abundant between treatments, 34 were more abundant in the N samples and 15 weremore abundant within the H horizons, suggesting that forest harvesting significantly alters thebiomass degradation potential of forest soils. However, within the metatranscriptomes, only oneof these families, AA3, was differentially expressed, suggesting that expression is regulated byniche partitioning and substrate ability rather than gene abundance (Fig. B.7). Consistent with thepathway analysis, next we examined differences CAZymes family abundance and expression inonly the LFH horizons and found no families to be differential abundant between LFH N and Hhorizons and only 1 family, GH2, related to cellulose degradation [159], to be more expressed inthe H LFH horizons compared to N LFH horizons (Fig. B.7).763.3.8 CAZyme abundance and expression within soil isolatesPrevious work at the O‘Connor Lake LSTP site based on the V1-V3 region of the 16S smallsubunit ribosomal RNA (16S SSU rRNA) gene pyrotag data clustered at 97% identified a singleoperational taxonomic unit (OTU) affiliated with the genus Bradyrhizobium that represented8-35% of the microbial community. Genomic analyses revealed all 4 isolates, named LTSP857,LTSP855, LTSP849 and LTSPM299, were capable of aromatic compound degradation [215]. Indeedcollectively the isolates contain CAZymes from 40 families [215]. Due to both their abundanceand degradation potential, these isolates present an opportunity to determine how season and soildepth influence CAZyme abundance and expression within a specific taxonomic group. As such,metagenomic and metatranscriptomic reads were recruited to the CAZymes ORFS identified ineach of the isolates and relationships among season, depth, treatment and metabolic potentialand expression were evaluated using multivariate regression tress (MVRTs). CAZyme abundanceand expression were dissimilar, consistent with poor correlation between metabolic potential andexpression at the community level. Within the metagenomes, CAZyme family abundance wassimilar between LTSP857 and LTSP855 isolates and between LTSP849 and LTSPM299 isolates(Fig. 3.11).Within the metatranscriptomes, season was the greatest determiner of CAZyme expressionacross the 4 isolates. Indeed, CAZyme expression of LTSP857 and LTSP855 differed most inOctober, while for LTSP849 and LTSPM299 CAZyme expression differed most in February(Fig. 3.10). Expression of GH13, AA2 and AA6 were only observed within LTSP849 and LTSPM299and was greater in February compared to June and October in all cases (Fig. 3.12). This suggeststhat despite high similarity of rRNA genes, strain level variations within a single population resultin alternative lifestyles and metabolic expression thereby increasing the metabolic capacity of thegenus.77MetatranscriptomesLTSP857 & LTSP885LTSP857 LTSP849MetagenomesLTSP849 & LTSPM299LTSP885 LTSPM299n=196n=54 n=54 n=54 n=54LTSP857 LTSP885LTSP849 LTSPM299June and February  Octobern= 33 n=16June and Oct Februaryn=32 n=17June and February  Octobern= 33 n=16June and Oct Februaryn=32 n=17ABFigure 3.11: Multivariate regression tree of potential and expressed CAZymes within Bradyrhizobium isolates. (A.)CAZyme potential and expression differ within Bradyrhizobium isolates (Error = 0.12). (B.) Seasonalinfluences on CAZyme expression within Bradyrhizobium isolates (Error = 0.88, 0.22, 0.88, and 0.92respectively).7879LTSP857LTSP855LTSP849LTSPM299GH1GH3GH13GH15GH17GH18GH23GH32GH77GH102GH103GH130GT1GT2GT4GT5GT9GT19GT20GT26GT28GT30GT32GT35GT39GT51GT66GT83PL23CE1CE4CE9CE11CE14AA1AA2AA3AA6CBM48CBM50MetagenomesGH1GH3GH13GH15GH17GH18GH23GH32GH77GH102GH103GH130GT1GT2GT4GT5GT9GT19GT20GT26GT28GT30GT32GT35GT39GT51GT66GT83PL23CE1CE4CE9CE11CE14AA1AA2AA3AA6CBM48CBM50Metatranscriptomes102030405060AverageRPKM1.03.02.0AverageRPKMLTSP857LTSP857LTSP857LTSP855LTSP855LTSP855LTSP849LTSP849LTSP849LTSPM299LTSPM299LTSPM299N HFeb ‘14Oct ‘13June ‘13Figure 3.12: Abundance of potential and expressed CAZymes within Bradyrhizobium isolates. The abundance and expression of CAZymes withinBradyrhizobium isolates across seasons793.4 DiscussionBased on 103 sequence libraries, we described the metabolic diversity of soil microbial communitiesand shed light on the relationship between metabolic potential and metabolic expression withinthe soil milieu in order to better understand how genomic survey results can be interpreted.Next we evaluated the impact of spatiotemporal variation and forest harvesting on potential andexpressed metabolic pathways using both ePGDBs and CAZyme gene surveys. We identifiedspecific pathways and gene families manifesting significant differences among the three seasons,between LFH and mineral samples, and between unmanaged and disturbed experimental plotsusing both genomic and transcriptomic data and discuss the potential impacts of these differenceson ecosystem function. Finally, we examined the expression CAZymes from 4 isolate genomesknown to be abundant at the site because, due to both their abundance and biomass degradationpotential, these isolates present an opportunity to determine how season, depth and forestharvesting influence CAZyme abundance and expression within a specific taxonomic group.3.4.1 Metabolic potential is not squandered within soil microbial communitiesTogether the results indicate that changes in metabolic potential across time, space and in responseto environmental disturbance are decoupled from changes in metabolic expression across thesesame variables (Fig. 3.5). Indeed, there was little correlation between pathway abundance andexpression. Though not unexpected, as this observation is consistent with investigations ofmicrobial potential and expression in marine systems [263], it suggests rare taxa contributedisproportionately to community activity and implies ecological organizing principles may beshared between terrestrial and marine environments despite discrepancies in average genomesize and community diversity [2, 264]. Recent work using stable isotope probing with terrestrialecosystems suggests dormant and rare taxa can be disproportionately active and ultimatelycontribute to ecosystem function [265]. Only approximately 3.6% of pathways identified atWilliams Lake were not present at O‘Connor Lake (Fig. 3.4), again consistent with observationsin marine systems wherein gene richness was similar between genomic and transcriptomic data[140, 266]. While it is possible similarities in pathway richness are related to the size of database80used for pathway predication, only 55% of possible pathways were predicted within the dataset,suggesting that rather than being an artifact of database size, metabolic expression is constrainedto neither a limited fraction of the community nor a limited fraction of most microbial genomes[266]. This idea is further supported by the lack of significant differences in taxonomic distinctnessbetween the genomic and transcriptomic samples (Fig. 3.3), which suggests that most taxa presentin the samples are also active, although individual taxa may be active in proportionally differentways.Consistent with the pathway analyses we observed little correlation in CAZyme familyabundance and expression. However, less than 70% of total CAZyme families identified withinthe metagenomes were identified within the metatranscriptomes. Moreover, within some samplesless than 40% of CAZyme families identified within the metagenome were identified withinthe metatranscriptome extracted from the same sample Fig. B.2. While this observation may berelated to the short-half life of most transcripts in soil ecosystems [108], combined with evidencethat biomass degradation capacity is constrained to subset of soil microbiota many of whichcontain multiple CAZyme genes [68, 214, 215, 267], the data suggest CAZyme expression is tightlylinked to substrate availability. Indeed, within the gut microbiome, bacteria are known to degradehost carbohydrates only in the absence of an alternative substrate [268]. Furthermore, recentwork in marine settings suggests that in addition to substrate availability, biomass degradationis performed by a subset of bacterial, viral, and eukaryotic organisms whose metabolic activityis regulated by multiple processes of community interaction such as parasitism, infection, andpredation [118]. Similar processes may be occurring soil microbial communities and could beinvestigated by integrating taxonomic and functional profiling of archeal, bacterial, eukaryotic,and viral soil communities with co-occurrence network approaches and organic matter contentand particle size analyses.3.4.2 Winter conditions impact metabolic potentialWinter conditions had the greatest impact on metabolic potential (Fig. 3.5). Samples collectedin February had a lower abundance of several disaccharide degradation pathways, and a higherabundance of a phenylacetate degradation pathway, a common intermediate produced in the81breakdown of aromatic compounds in soil [238]. This may be linked to decreased availabilityof labile sugars under the snow due to limited plant productivity given freezing temperatures[206, 208, 209], forcing the community to utilize more recalcitrant plant compounds duringwinter. Further, beneath the snow pathway abundance differed between LFH and mineral sampleswherein mineral samples contained a greater abundance of anaerobic pathways as well as pathwaysinvolved in the degradation of aromatic compounds, perhaps linked to differences in availablecarbon resources and oxygen concentration with depth [26]. Finally, differences between N and HLFH samples taken in February were generally related to cell growth and may be reflective ofthe compositional differences between the two treatments [20, 27] as growth requirements varyamongst taxa. Snow cover also impacted CAZyme expression (Fig. 3.10) and 10 CAZyme familiesrelated to the degradation of cell membranes, cell walls, and intermediate degradation productswere more abundant in February (Fig. 3.10) suggesting increased degradation of microbial biomassand more recalcitrant substrates. Indeed, consequential release of nutrients from the die-off ofsoil microbes and the degradation of phenolic compounds represent an important resource poolfor springtime plant growth [206]. Together, these observations are consistent with the idea thatenvironmental conditions shape microbial community composition and metabolic potential andmetabolic expression [19, 269, 270]. However, snow cover was not identified as significant driverof trends in either pathway or CAZyme family expression (Fig. 3.5 and Fig. 3.10) suggesting adecoupling of metabolic function and expression.3.4.3 Metabolic potential is similar across ecozonesTo determine the similarity in functional potential between microbial communities across ecozones,we compared pathway presence between the O‘Connor Lake and Skulow Lake LTSP site locatedapproximately 280 km northwest (Fig. 3.9). Although there was little correlation between theabundance of these pathways at the two sites (r2 = 0.41), only approximately 3.6% of pathwaysidentified at Williams Lake were not present at O‘Connor Lake despite differences in ecozone,sampling season, sampling year, sampling depth, and harvesting treatment (OM2 was sampled atOConor Lake and OM3 was sampled at Skulow Lake). Hartmann et al. [20, 27] demonstratedboth soil edaphic factors and community composition differ between sites and among OM0,82OM2 and OM3 plots within each site. Taken together, this indicates that most soil functions arepreserved across forest ecosystems and suggests that while taxonomic composition, biogeographiclocation, and spatiotemporal variation have little impact on pathway richness, these factors mayinfluence the abundance of each pathway within the community [263, 271]. However, given thepoor correlation of pathway abundance and expression, differences in abundance may not impactecosystem function.Pathways unique to either the O‘Connor Lake or Skulow Lake site were mainly involvedin secondary metabolite biosynthesis, and carbohydrate and non-carbon nutrient degradation(Fig. 3.9). Over half of the secondary metabolic biosynthesis pathways at each site were attributedexclusively to plants, and thus these differences are likely linked to differences in vegetationat each site. Indeed, while both H plots sites were replanted with lodgepole pine seedlings,the N plot at O‘Connor Lake is dominated by Douglas fir while Skulow Lake is dominated bylodgepole pine (Pinus contorta) and hybrid spruce (Picea glauca engelmannii), Douglas-fir, aspen(Populus tremuloides Michx.) and cottonwood (Populus balsamifera L.) [20, 27, 70, 217]. In contrast,carbohydrate degradation pathways unique to each site were all involved in the degradationof plant biomass [159] suggesting that despite the large similarities in pathway richness, themicrobial community may adapt to resources available at a given location. Indeed, we found thatboth precursor for cinnamic acid, a compound found in resinous exudates of some plants [272],and cinnamate degradation were found exclusively at O‘Connor Lake supporting the idea thatcommunities adapt to the locally available substrates as above ground vegetation has been directlylinked to soil organic matter composition [273].Interestingly, the non-carbon nutrient pathways were almost all implicated in sulfur cycling(Fig. 3.9). However these pathways were both rare (found in few samples) and scarce (lowRPKM) suggesting the presence of these pathways may be related to sequencing depth andcoverage. Sulfur is rarely limiting in forest ecosystems [274] and while sulfur concentration wasnot measured at either site, previous work has found forest harvesting has little impact on sulfurpools [275]. Furthermore, over 90% of sulfur cycling pathways identified at Skulow Lake werefound at O‘Connor Lake and many of these pathways were present in all samples and were twoorders magnitude more abundant. Together this suggests that while abundance of rare sulfur83cycling pathways may vary, more broadly, sulfur cycling potential is likely similar at both sites.Metabolic potential differed between LFH and mineral samples at both sites (Fig. 3.5) andFig. 2.3). Specifically, the abundance of carbohydrate degradation pathways that target substratescommonly found in soil organic matter was higher in LFH samples at both O‘Connor Lake andSkulow Lake. This observation suggests that, although the individual pathways varied betweensites, similar processes dominate within LFH samples at the two locations. Indeed, availableresources and soil function change with depth, as while LFH horizons cycle nutrients, mineralhorizons function as carbon reservoirs [183]. Within the mineral horizons, fewer similarities indifferentially abundant pathways were identified. This is likely related to differences in samplingwherein 4 mineral horizons (down to 60 cm) were sampled at Skulow Lake, while soil depths (5cm and 15 cm) not corresponding to specific soil horizons were sampled at O‘Connor Lake.Differences in metabolic potential between treatments were dissimilar between the two sites.While 149 pathways (most of which are involved in biomass degradation), were differentiallyabundant between N and H treatments at Skulow Lake (Fig. 2.3), less than 5 pathways weredifferentially abundant between N and H treatments at O‘Connor Lake (Fig. 3.7). This is likelyrelated to the difference in harvesting treatment between the two sites and implies OM2 harvestinghas a lesser impact on biomass degradation potential than OM3 harvesting. Indeed, this isconsistent with previous work wherein biomass degradation genes were differentially abundantbetween OM2 and OM3 plots [68]. Still, the impact of the OM3 harvesting treatment on microbialfunction should be corroborated with measures of microbial activity.3.4.4 Metabolic pathway expression driven by depthGiven metabolic potential is poorly correlated with metabolic expression, we sought to determinethe effect of season, depth, and harvesting on microbial community transcription. Soil depthwas largest driver of pathway expression. Normalization methods have been previously shownto impact the accuracy of tests of differential expression in RNA-seq data [225]. As such weused sequence recovery of External RNA Control Consortium (ERCC) spike-in controls [221] fordata normalization and removal unwanted variance due to amplification and sequencing therebyimproving accuracy of differential expression tests within the transcriptomic data [225]. Microbial84communities in LFH samples exhibited increased expression of saccharide (e.g.; trehalose, L-arabinose, and xylose) and chitin, a common compound found in cell walls of fungi and theexoskeleton of invertebrate, degradation [247] (Fig. 3.8). Further, 13 pathways involved in lipidbiosynthesis, a principle component of cell membranes [244], were also more expressed in LFHsamples, a signal likely related to microbial growth. These results are concomitant with higherconcentrations of soil carbon and microbial respiration in surface soils [276]. Within the mineralsamples, the expression of pathways involved in aromatic carbon and nucleotide degradationincreased compared to the LFH samples above. Microbial communities within the mineralhorizons may be more likely to capture carbon from recalcitrant aromatic compounds owingto decreased carbon availability with soil depth [277]. Furthermore, over 10% of the total soilorganic phosphorus pool is contained with nucleic acids [142], and therefore increased nucleotidedegradation may reflect community efforts to acquire phosphorus.Microbial expression in LFH samples was more similar in June and February compared toOctober (Fig. 3.8). While the initial hypothesis was that this differences was due to decreased soiloxygen due to heavy rain (June) and snow cover (February), pathway level analyses did not revealincreased expression of anaerobic processes (Fig. 3.8). Rather, expression of pathways associatedwith some fungi such as dolichyl-diphosphooligosaccharide biosynthesis, and oxalate degradationwere also higher in June and October (Fig. 3.8). As many fungal species grow under both aerobicand anaerobic conditions [278, 279], thee organisms may out compete other community membersfor resources under low oxygen conditions. Between soil treatments in October, plant expressionof antioxidant and osmolyte production pathways were higher in N LFH samples likely linked todifferences in vegetation between N and H plots.Although soil depth was not the primary driver of differences in metabolic potential, 95pathways were differentially abundant between the LFH and mineral samples (Fig. 3.8). While,there was little overlap between pathways or functions differentially abundant and expressedwithin LFH samples, methane oxidation to methanol II was both more abundant and more highlyexpressed within mineral horizons. Additionally, several metabolic functions, including oxalate,phenol, and 2-amino-3-carboxymuconate semialdehyde degradation were also more abundant andhighly expressed within the mineral horizons although the pathway variants for these functions85differed between the genomic and transcriptomic datasets. Pathway variants may be presentwithin different organisms adapted to specific niches, or represent functions expressed only underspecific conditions. As such, similarities between the differential abundance and expression ofpathways with analogous metabolic functions may reflect a community response to the presenceor absence of a given resource, and differences in metabolic potential and expression may resultfrom rare taxa outcompeting more abundant community members. Indeed, rare taxa have beenobserved to make important contributions to biodiversity and biogeochemical cycles in terrestrialecosystems [265, 280].3.4.5 Forest harvesting affects the abundance but not expression of biomassdegradation genesForest harvesting treatment had the most significant effect on the expression of plant biomassdegradation genes (Fig. 3.10). Similarly, a recent study hypothesized that harvested soils sufferlong-term impairment of plant biomass degradation based on differences in metabolic potential[68]. Indeed, Cardenas et al [68] identified decreases in gene abundance of 41 CAZyme familiesfollowing forest harvesting at the O‘Connor Lake site using samples collected approximately 5years prior to this study. Here, we also identified 49 CAZyme families differentially abundantbetween harvesting treatments, 34 of which were more abundant in the N samples. However, onlya single CAZyme family, namely AA3 a cellobiose dehydrogenase, was more highly expressed inthe N samples despite this family being significantly less abundant in N samples at the genomiclevel (Fig. B.6 and Fig. B.7). In contrast, 5 CAZyme families, known to act on cellulose andhemicellulose were more highly expressed in H samples, an effect likely related to consequentialchange in surface vegetation rather than initial disturbance. Indeed, there was no evidence ofimpaired biomass degradation at the expression level. This suggests differences in CAZymeabundance have little impact on ecosystem function and corroborates previous work wherein rareorganisms were found to play an important role in plant biomass degradation processes withinsoil [213].863.4.6 Spatiotemporal controls on CAZyme expressionCAZyme expression also varied with season and depth (Fig. 3.10) although these differenceswere constrained to very few gene families (Fig. 3.10). Indeed, while 158 CAZyme familieswere differentially abundant between the LFH and mineral samples (Fig. B.6), only 5 CAZymefamilies were differentially expressed between the LFH and mineral samples (Fig. B.7). Onceagain this suggests that CAZyme expression is tightly linked to substrate availability, and thatorganisms may retain a suite of CAZyme coding genes that allow individuals to tune theirmetabolic priorities according to their immediate surroundings. Finally, based on the recruitmentof metatranscriptomes to 4 Bradyrhizobium isolates previously identified as belonging to the mostabundant OTU at the O‘Connor Lake LTSP site [215], we observed that while harvesting treatmenthad the biggest impact on CAZyme expression across the entire community, season impactedCAZymes expression within these isolates (Fig. 3.11 and Fig. 3.12). This implies the impactof spatiotemporal variation and harvesting differs amongst taxa and indicates that taxonomicidentity can be decoupled from metabolic function [197–199] even within closely related organismsresulting in increased metabolic capacity within the genus and potential impacts on the long-termlikelihood of survival within the heterogeneous soil environment.3.5 ConclusionsHere, we establish that changes in metabolic potential with season, depth, and harvestingdisturbance are inconsistent with changes in metabolic expression across these same variables.Both pathway and gene centric analyses suggest that while winter conditions strongly influencemetabolic potential, pathway and gene expression are regulated according to gradients in edaphicfactors and available carbon resources. Together the data suggest that functional redundancywithin the soil microbial community effectively buffers against large-scale processes changes withseason and surface vegetation following whole tree harvesting. Given the data, we posit thatredundant metabolic capacity both among phyla and within closely related species, combined withlarge and fine scale variation in metabolic potential ensures environmental change has disparateeffects across community members thereby tempering the effects of localized extinctions or niche87space reduction, and guarding against the loss of metabolic functions within the soil ecosystem.Indeed, it has been widely speculated that soil microbial communities have the potential to beresistant and resilient to forest disturbance due to functional redundancy given their high geneticand metabolic diversity. Correspondingly, we find that most metabolic functions are conservedwithin the larger microbial community across seasons, site and treatments. We speculate that thereis disturbance threshold beyond which the community is unable to recover, but this would requirethe removal of metabolic functions from the community, rather that than simply a reduction intheir abundance.This work represents an important step in understanding how changes in community com-position and metabolic potential due to spatiotemporal change and anthropogenic disturbanceultimately impact microbial community function within the soil milieu. Given the inconsistency inmetabolic potential and expression observed here, from a modeling and management perspectivethe use of multi-omics techniques that provide information pertaining to microbial activity (e.g.;metatranscriptomics, metaproteomics, and metametabolics) may be integral to determining thelong-term impacts of natural and anthropogenically induced environmental changes on nutrientand carbon cycling within terrestrial ecosystems.88Chapter 4The Ecologist’s Guide to NormalizationMethods in Count Data for theMicrocosmosMassively parallel analyses of small subunit ribosomal genes and transcripts (SSU rRNA genesand rRNA respectively) enable detailed investigation of microbial communities. Indeed, SSUrRNA genes were used in Chapter 2 of this dissertation to elucidate patterns in taxonomiccomposition with soil depth and treatment, effectively answering the questions, “Who is there”and “How do they respond to change?”. SSU rRNA genes were also used to create microbialcorrelation networks in an effort to model and better understand microbial interactions. Still,despite the power of SSU genes and transcripts to illuminate the taxonomic diversity of theuncultured majority, the data remain imperfect and read counts must be normalized prior tocomparative analyses do to amplification and sequencing bias. Read count normalization istypically accomplished by expressing counts as proportions or rarefying samples (subsamplingdown to an equal number of reads) to contain an equal number of sequences. However, in lieu ofcanonical normalization approaches such as the expression of read counts as proportions, recentwork advocates for the use of a mixed model termed variance stabilization technique (VST) [281].Indeed, VST has been demonstrated to reduce false positives in tests for taxa that are differentiallyabundant across samples. However, both a comparison of summary graphics and a practicalguide to methodological considerations given different normalization techniques are lacking.This chapter details a side-by-side comparison of patterns and biological conclusions resultingfrom the analysis of rRNA genes and rRNA surveys wherein read counts were normalized with89both proportions or VST. Next, using normalized clustered rRNA and rRNA gene profiles fromreference and ethanol-contaminated soil cores, differences in patterns and biological conclusionsdue to normalization technique are highlighted. Finally, a list of practical considerations that canhelp microbial ecologists effectively use VST normalization and accurately analyze rRNA genesand rRNA data is provided.4.1 IntroductionRecent advances in sequencing technologies [103, 116] have now enabled a deeper understandingof microbial community structure and function through massively parallel analyses of smallsubunit ribosomal genes (rRNA genes) and ribosomal RNA (rRNA), respectively. For example,surveys of rRNA have been widely used to enumerate and compare the taxonomic compositionof microbial communities in natural and engineered ecosystems [27, 282]. Further, comparisonsof rRNA:rRNA gene ratios have been used to gain insight into potential community activity[283, 284] and describe microbial community responses to environmental conditions such asfluctuating redox conditions [282, 285] and salinity gradients [286]. Sequencing rRNA genes andrRNA libraries typically results in 2,000-80,000 reads per sample and differ among studies andsamples due to the amplification technique and sequencing platform used. Following sequencing,reads are usually clustered based on sequence similarity (often at 97% sequnce similarity [27, 282]),yielding operational taxonomic units (OTUs). The abundance of each OTU is thus expressed asthe number (count) of reads belonging to the cluster for a given sample. Owing to the variationin the number of reads produced for each sample due to current sequencing technologies, it isnecessary to normalize the read counts prior to comparing OTU distributions and abundancesamong and between samples.Normalization is most commonly completed through rarefying the data (subsampling downto a given number of sequences for each sample) or by expressing the counts as percent-ages/proportions [282, 287]). However, despite the widespread use of both of these normalizationtechniques in microbial ecology, these approaches have recently been called into question. Indeed,Mcmurdie and Holmes (2014) [281] present a strong statistical argument against using these90conventional normalization approaches [281]. Small subunit ribosomal RNA gene amplicon countdata is characteristically overdispersed, wherein there is greater variability in the dataset thanwould be expected by the underlying statistical model (typically the Poisson model), which thencontributes to inaccuracies when comparing the relative abundance of OTUs among and betweensamples [281]. Furthermore, rarefication requires the arbitrary elimination of useable data, andproportions fail to address heteroscedasticity, wherein a subset of the variables in a dataset hasdifferent variabilities from other variables in the same dataset which can invalidate statistical testsof significance [281]. Instead, McMurdie and Holmes (2014) [281] advocate the use of mixturemodels, specifically a negative binomial model termed variance stabilization technique (VST)[288], to account for differences in library size (number of sequenced reads) between samples[281].Already employed in numerous RNA-seq data analyses [289–291] , VST allows the researcherto retain all data collected, minimize type 1 errors due to overdispersion and, as the name suggests,stabilize the variance across samples thus directly addressing heteroscedasticity. Indeed, VSThas been demonstrated to reduce false positives in tests for taxa that are differentially abundantacross samples [281]. In addition, many canonical analyses such as principal component analyses(PCA) and non-metric multi-dimensional scaling (NMDS) assume homoscedasticity (wherein thevariance around a regression line is the same for all values of a predictor variable [292]. BecauseVST normalization produces transformed homoscedastic read counts, VST may be useful priorto multivariate modeling [293]. In addition, VST is a more suitable normalization technique foranalyses relying on correlations such as co-occurrence networks as compositional data (whereineach sample sums to the same number) is known to produce unreliable correlations as the dataare relative fractions of OTUs rather than absolute abundances [178], and unlike proportionnormalization, VST normalization does not produce a compositional data table. Still, despite thestrong statistical argument for VST and availability of R packages and functions within Qiime (apopular bioinformatics pipeline for the processing of microbiome data [116] that facilitate the useof VST, a direct comparison of summary graphics and data products resulting from conventionalnormalization techniques and VST has yet to be completed.Here, we outline the suitability of typical statistical methods used in the analysis of rRNA genes91and rRNA surveys wherein read counts are normalized with either proportions or VST. Next, weperform a side-by-side comparison of normalization techniques via proportions (which suffer fromoverdispersion and unequal variance but retains all data) and VST using clustered small subunitribosomal rRNA and rRNA gene profiles to compare microbial abundance (rRNA genes) andactivity based on rRNA:rRNA gene ratios between a subsurface soil core from reference site (CoreA) and biofuel contaminated site (Core B). Finally, we provide a list of practical considerationsthat can be used to guide statistical analyses post-VST in order to empower microbial ecologists toaccurately analyze rRNA genes and rRNA data, and maximize it’s story telling power.4.2 Methods4.2.1 Sample collection, DNA extraction and pyrosequencingSamples were taken at 10 cm intervals from sediment Core A and Core B collected using aGeoProbe in October 2012 in southwestern Minnesota (lat. 44.247083, long. -94.3371059) (Fig. 4.1).Details of the site can be found in the SI Methods. At each interval approximately 15 g of sedimentwas collected in falcon tubes and immediately frozen at -80C. Samples were shipped on dry iceto the University of British Columbia where DNA and RNA was extracted. DNA and RNA wasextracted using PowerMax R Soil DNA and PowerSoil R RNA Isolation Kits (MoBio, Carlsbad,CA; SI Methods). RNA was converted to cDNA using a Superscript R III first-strand synthesis kit(Invitrogen, Carlsbad, CA, SI Methods). The V6-8 region of the SSU rRNA gene was amplifiedusing universal (capturing all three domains of life) primers described in SI Methods. The resultantamplicons were sequenced using the Roche 454 GS FLX Titanium (454 Life Sciences, Branford, CT,USA) technology at the Ge´nome Que´bec Innovation Centre for 454 pyrosequencing C. Sequence snalyses275,606 and 127,187 total sequences were obtained from the pooled rRNA genes and rRNAamplicons, respectively. All trimming, clustering, and classifications were performed in QIIME(version 1.4.0 software package [116]. Quality control removed sequences that failed to meet thefollowing criteria: minimum length = 200 bp, average quality score = 25, maximum number of Ns92ABCore ADissolved CH4 Trees Dissolved EtOHCore B2 4 60EtOH [g kg-soil-1]406080100200%VSTrDNArRNA++++Depth [cm]Rail LineFigure 4.1: Sampling and analysis schematic for samples collected from two soil cores. A. Sampling location andanalysis schematic for 18 samples from 11 soil depths in two soil profiles, one reference (Core A) andone ethanol contaminated (Core B), in southwestern Minnesota. B. Sampling depths and sequencing type(rRNA genes or rRNA) for the two soil profiles samples and ethanol concentration with depth. Vertical linerepresents the location of the water table.93= 0, maximum homo- polymers = X, resulting in 224,599 and 102,550 sequences for the pooledrRNA genes and rRNA amplicons, respectively. Using UCLUST [294], sequences were clusteredinto operational taxonomic units (OTUs) at the 97% identity threshold with a maximum e-valuecut-off of 1e-10. Samples containing <400 OTUs in either the rRNA genes or rRNA fractionand were excluded from the datasets resulting in a total of 36 samples from reference (Core A)and ethanol contaminated (Core B) soil cores. Sequences were clustered with an additional 165samples taken from nearby cores as Core A and Core B are part of a larger, ongoing study in thearea. Although OTUs were generated from an ongoing time-series we focused on the first timepoint for which there was data from a reference and contaminated core. This approach increasedthe sensitivity in our study by providing a larger dataset from which to select representativesequences for each OTU [295, 296] and results from this work will inform the statistical approachtaken in the larger analyses of the time-series dataset. OTUs occurring in less than 5% of thesamples were removed in order to reduce the signal to noise ratio [281] resulting in total of 6,694OTUs and encompassing 95.3% of all sequence data.4.2.3 Data analysesStatistical analyses were performed in R (version 3.0.2 (2013-09-25)) CoNet (Beta version 3.2)[41, 297], and using in-house perl scripts C.1. Diversity estimates were calculated using unnor-malized data. The OTU table was then normalized using both proportion (wherein each OTUis expressed as the relative fraction of the sample) and VST (wherein reference DNA, referenceRNA, contaminated DNA, and contaminated RNA were used as the conditions on which toestimate size factors). In order to calculate rRNA:rRNA gene ratios in the proportion-normalizeddataset, it was necessary to impute rRNA gene values for OTUs wherein rRNA was recovered,using non-parametric multiplicative replacement [282, 298]. Imputation was unnecessary for theVST transformed data, as it does not contain zeros. To avoid calculating log(0) and producingnegative values which make rRNA:rRNA gene ratios more difficult to interpret, a constant wasadded prior to VST normalization resulting in an overall scaling (multiplication) of the data onthe untransformed scale [281]. Both imputation (proportions) and the addition of a constant(VST) serve to eliminate dividing by zero and negative numbers respectively, but keep values94small such that rRNA:rRNA gene ratios can be calculated as accurately as possible. Non-metricmulti-dimensional scaling (NMDS) was completed using Bray-Curtis dissimilarity index able tohandle both count and abundance data [299]. rRNA genes were used as a proxy for microbialabundance while rRNA:rRNA genes was used to estimate microbial activity.4.3 Results4.3.1 Microbial community diversityIn order to compare data products resulting from conventional proportion and VST normaliza-tion, we investigated soil microbial community structure and potential activity from a pristinereference core (Core A) and an ethanol-contaminated core (Core B). Specifically, we performed454 Pyrosequencing of rRNA genes and rRNA sequences with three-domain resolution andfollowing clustering at 97% identity, the removal of singletons, and operational taxonomic units(OTUs) present in <5% of the 36 samples, 6,694 operational taxonomic units (OTUs) were used fordownstream analysis. Using rarefaction curves to determine sample coverage, we found samplesexhibited similar slopes at the 97% identity threshold approaching 4,000 unique OTUs suggestingthat at this level of genetic distance only rare OTUs remain unrecovered (Fig. C.1). Of the totalOTUs, 5,834 (87.2%) were bacterial, 519 (7.8%) were archeal, 243 (3.6%) were eukaryotic, and 98(1.5%) were unclassified (representing sequencing errors or rare taxa).First, we compared the number of species in common between the two cores and the rRNAgenes and rRNA in both cores (Fig. 4.2). More OTUs were shared between rRNA genes andrRNA within each core than between Core A and Core B (Fig. 4.2). Diversity was estimated onunnormalized data using both Shannon’s and Simpson’s diversity indices [300] (Table 4.1). WhileSimpsons diversity was close to 1 for all samples, in general, Shannons diversity was higher inCore A compared to Core B. Indeed, using a two-tailed t-test, Core A had significantly higherdiversity that Core B (p=0.029) (Table 4.1). However, within each core no significant diversitybetween rRNA genes and rRNA were found p>0.05). Next, OTU counts were normalized usingboth proportions and VST. Variance estimates were fit using both Poisson (fŁ=Ł0) and negativebinomial models (Fig. 4.2). While variances were larger then estimated under the Poisson model95(the data are overdispersed), the negative binomial model fit the variances well indicating VST isan appropriate normalization method for the data (Fig. 4.2) [281].96971e+001e+021e+041e+0610 100MeanVariance136842147092022331312795283216559681098 1325853302343Core A Core B DNA RNA A BFigure 4.2: Overdispersion in microbiome data. A. Overdispersion in the microbiome data of 18 samples from 11 soil depths in two soil profiles, onereference (Core A) and one ethanol contaminated (Core B), in southwestern Minnesota. Points represent OTUs mean and variance estimate foracross study samples. Gray line represents Poisson; (fŁ=Ł0) while the blue line represents fitted variance using variance stabilizing techniquein DESeq [288, 301]. Venn diagram of the presence of OTUs throughout the soil profile of the rRNA genes of the undisturbed soil core andthe rRNA genes and rRNA of an ethanol contaminated soil profile in southwestern Minnesota. B. Venn diagram of rRNA genes and rRNAoperational taxonomic units (OTUs).974.3.2 Soil microbial community structureTo evaluate differences in microbial community structure among and between Core A and Core B,we employed principal component analysis (PCA) (Fig. 4.3) and non-metric multidimensionalscaling (NMDS) (Fig. 4.4) and using both the proportion and VST-normalized data. Large-scaletrends in the PCA produced by both normalization approaches were similar, grouping samplesprimarily by depth and subsequently by core (Fig. 4.3). However, using proportion-normalizeddata, no strong trends were identified within the NMDS (Fig. 4.4). Indeed, samples from bothcores and across depths grouped together suggesting both soil depth and disturbance had littleinfluence on microbial abundance and activity. In contrast, using the VST-normalized data, NMDSrevealed samples grouped along a depth gradient. Taken together the data suggest microbialabundance and activity are affected by depth and thus the NMDS using proportion should beinterpreted with caution (Fig. 4.3 and Fig. 4.4).98Table 4.1: Diversity indicies samples collected from a reference (Core A) and contaminated soil profile(Core B) in southwestern Minnesota.Samples Simpson’s Shannon’sCA1 0.99 6.38CA1cDNA 1.00 6.69CA3 0.94 5.26CA3cDNA 0.98 5.34CA4 1.00 6.46CA4cDNA 0.74 2.70CA5 0.99 5.88CA5cDNA 0.98 5.36CA6II 0.98 5.17CA7II 0.99 6.08CA7cDNA 0.73 3.30CA8 0.98 5.41CA8cDNA 0.88 3.20CA10 0.98 5.37CB1 1.00 7.07CB1cDNA 0.99 6.41CB2 0.99 6.54CB2cDNA 0.99 6.43CB3 1.00 7.07CB3cDNA 0.98 6.08CB4 0.99 6.13CB4cDNA 0.99 5.88CB5 0.92 4.30CB5cDNA 0.95 4.34CB6 0.91 4.62CB6cDNA 0.99 5.96CB7 0.91 4.30CB7cDNA 0.97 4.97CB8 0.99 5.82CB8cDNA 0.95 4.81CB9 0.99 5.83CB9cDNA 0.98 5.34CB10 0.99 5.85CB10cDNA 0.98 5.52CB11 0.99 5.53CB11cDNA 0.94 4.4899100-200204060-50 0 50PC1 (16.5% explained var.)PC2 (7.8% explained var.)% VST10 cm20 cm30 cm40 cm50 cm60 cm70 cm80 cm90 cm100 cm110 cmCore A Core B DNARNA -40-2002040-60 -40 -20 0 20 40PC1 (11.4% explained var.)PC2 (5.6% explained var.)Figure 4.3: Principal Component Analysis (PCA) of microbiome data using two normalization techniques. Principal Component Analysis (PCA)describing the influence of soil depth and ethanol concentration on microbial assembles (V6-V8 SSU rRNA genes and rRNA) for 18 samplesfrom 11 soil depths in two soil profiles, one reference (Core A) and one ethanol contaminated (Core B), in southwestern Minnesota calculatedafter proportion and variance stabilization technique normalization.100101-1.0 -0.5 0.0 0.5 1.0-1.0- 1NMDS 2% VST10 cm20 cm30 cm40 cm50 cm60 cm70 cm80 cm90 cm100 cm110 cmCore A Core B DNARNA -2 -1 0 1 2 3-2-10123NMDS1NMDS2Figure 4.4: Non-metric multidimensional scaling (NMDS) of microbiome data using two normalization techniques. Non-metric multidimensional scaling(NMDS) analysis describing the influence of soil depth and ethanol concentration on microbial assembles (V6-V8 SSU rRNA genes and rRNA)for 18 samples from 11 soil depths in two soil profiles, one reference (Core A) and one ethanol contaminated (Core B), in southwestern Minnesotacalculated after proportion and variance stabilization technique normalization.1014.3.3 Microbial abundanceTo better understand the microbial community composition with depth and across natural andcontaminated conditions, taxonomic composition at the phylum level was evaluated. At thephylum level, trends in rRNA genes and rRNA across Core A and Core B were similar using bothproportion (Fig. C.2 and Fig. C.5) and VST-normalized datasets (Fig. C.3 and Fig. C.6). Cores weredominated by bacterial phyla Proteobacteria (Alpha, Beta, Delta and Gamma), Actinobacteria,Chloroflexi, Firmicutes, Bacteroidetes and Acidobacteria as well as archaeal phyla Euryarchaeotaand Thaumarchaeota. Additionally, 18 candidate phyla, including OP11 and WS3, were identifiedin the dataset. Within Core A, Alpha-Proteobacteria were most abundant (rRNA genes) from10-50 cm compared to deeper depths (60 100 cm) while Beta-, Delta-, and Gamma-Proteobacteriaeither remained constant with depth or showed an increase in rRNA genes below 60 cm (Fig. C.2and Fig. C.5). Similarly, the abundance of Actinobacteria, Chloroflexi, and Acidobacteria and toa lesser extent Firmicutes and Bacteroidetes varied little throughout Core A using both normal-ization techniques (Fig. C.2 and Fig. C.5). In addition, the Archaeal phylum Euryarchaeota andThaumarchaeota were generally most abundant from 10 30 cm above the ethanol spill (Fig. C.2and Fig. C.5).Next, we compared the abundance of microbial phyla between to the two cores in order toidentify taxonomic response to ethanol contamination. The abundance of Bacteriodetes remainedconsistent between Core A and B using both normalization techniques suggesting there islittle impact of ethanol contamination on this phyla. The abundance of Deltaproteobacteriaand Chlorflexi was lowest in the ethanol-contaminated zone (50-70 cm) (Fig. 4.1) using bothnormalization techniques suggesting that these phylum are sensitive to ethanol contamination(Fig. C.2, Fig. C.5, Fig. C.2 and Fig. C.5). In addition, Actinobacteria and Acidobacteria decreasedin abundance below 40 cm (where ethanol concentration is higher) from Core A to Core Busing both normalization techniques although, the magnitude of this change differs between thetwo normalization techniques. Indeed, within the proportion-normalized data Actinobacteriaand Acidobacteria exhibited a 2-fold and 217-fold decrease respectively while within the VST-normalized data these taxa decreased by 1-fold and 2-fold respectively. Conversely, Firmicutes102displayed a 1.4 8.1-fold (proportions)/ 1.6 2.4-fold (VST) increase in rRNA gene abundance from50 70 cm from Core A to Core B suggesting a biological response to ethanol concentrations.Although the two normalization techniques produced similar trends in the abundance for mi-crobial phyla, this consistency was not universal. For example, within the proportion-normalizeddata Alpha-, Beta-, and Gamma-Proteobacteria decreased in abundance below 40 cm 1.1-2.3-fold,2.0 8.6-fold, and 1.8 2.7-fold respectively again suggesting these taxa are sensitive to ethanolas peak contamination occurred at 60 cm. However, within the VST-normalized data Alphapro-teobacteria did not appear to decrease while Beta-, and Gamma-Proteobacteria decreased by only1.0 - 1.6 fold and 1.0 1.2-fold respectively suggesting that these groups are not as sensitive toethanol but rather are able to tolerate the disturbance (Fig. C.2, Fig. C.5, Fig. C.2 and Fig. C.5). Thisrepresents a discrepancy between the normalization techniques that could impact downstreaminterpretation of the data.4.3.4 Potential activityAs both rRNA genes and rRNA were sequenced and rRNA genes can be used as a proxyto represent the active community, the ratio of rRNA to rRNA genes was used to evaluate thepotential activity of the microbial community throughout both cores. Within both Core A and CoreB the ratio of rRNA:rRNA genes for most phyla produced similar trends using both normalizationapproaches though on different scales (Fig. C.4 and Fig. C.7). For example, candidate phyla WS3and the Archaeal phylum Euryarchaeota and Thaumarchaeota were more active between 30-60cm in Core B compared to Core A suggesting these groups may be stimulated by the ethanolpresent within Core B. In order to explore relationships between depth and potential activity ofthe microbial community with increased taxonomic resolution, we identified the taxonomic orders(when possible given the annotations in the current database) whose abundance and activityvaried most throughout both cores using both normalization techniques. To determine whichorders were highly variable, the standard deviation of the abundance of rRNA genes and rRNA ofeach taxonomic order across Core A and Core B was calculated and the 30 orders (top 10%) withthe highest standard deviation were selected from both the proportion and VST-normalized dataand the potential activity (rRNA:rRNA genes) of these groups was explored. 18 out of 30 orders103selected were common to both proportion and VST-normalized datasets (Fig. C.8 and Fig. 4.5).From these 18 orders, 11 were bacterial, 5 were archaeal and the remaining 2 were eukaryotic(Fig. C.8 and Fig. 4.5).1041050.511.522.50246810 30 40 50 60 70 80 10020 90 110RhizobialesNitrosomonadalesEnterobacterialesMethylococcalesDesulfobacteralesAcidobacteria Subgroup 6Bacteroidales               vadinHA17CoriobacterialesBacillalesClostridialesMethanobacterialesMethanomicrobialesMethanosarcinales           Soil Crenarchaeotic GroupThaumarchaeotaCharophytaFungiSphingomonadalesBurkholderialesDesulfuromonadalesMyxococcalesSyntrophobacteralesAnaerolinealesChloroflexi -P2−11ESphingobacterialesGaiellalesActinobacteriaNitrospiralesMarine Group I*classB)BacteriaArchaeaEukaryotaBacteriaRhodospirillalesXanthomonadalesChloroflexi - KD4−96Acidobacteria - Subgroup 4Acidimicrobiales;Propionibacteriales;Solirubrobacterales;Actinobacteria - MB−A2−108Candidate division OP11MollicutesHalobacterialesThaumarchaeota - Group C3BacteriaArchaeaDepth [cm]rRNA:rDNA10 30 40 50 60 70 80 10020 90 110Depth [cm]rRNA:rDNACore BC)1230123454PhylumClassratio (log)Panel APanel B & CA)ArchaeaFigure 4.5: Heat map depicting the potential activity (rRNA:rRNA genes) of the most variable taxonomic orders in Core B.Heat map depicting the potentialactivity (rRNA:rRNA genes) in Core B of taxonomic orders (or phyla when further annotation was not possible given insufficient referenceswithin the database) exhibiting the most variation (top 10%) in southwestern Minnesota using proportion and variance stabilization techniquenormalization. A. represents taxonomic groups found to be among the most variable after using both proportion and variance stabilizationtechnique normalization B. represents taxonomic groups found to be among the most variable after using only proportion normalization C.represents taxonomic groups found to be among the most variable after using only variance stabilization technique normalization.105Within Core A, using both the proportion and VST-normalized data trends were similarwherein archaeal orders Methanobacteriales, Methanomicrobiales, and Methanosarcinales alldisplayed a peak in potential activity at 70 cm (Fig. C.8). However, within the proportion-normalized data Methanomicrobiales and Methanosarcinales were also active at 30 cm. Potentialactivity of Coriobacteriales, Bacteroidales and vadinHA17 also peaked at 30 cm and 70 cmin both normalized datasets. Additionally, Charophyta, a ubiquitous soil alga [302], had thehighest potential activity from 50-80 cm using both normalization techniques. In groups thatwere identified as being highly variable within the proportion-normalized data but not theVST-normalized data, Anaerolineales, Nitrospirales, Syntrophobacterales were most active atsurface depths 10 cm and 30 cm. In contrast, within groups identified as highly variable inthe VST-normalized data but not the proportion-normalized data, Thaumarchaeota Group C3,Halobacteriales and Mollicutes showed elevated activity at 30 cm and within the saturated zone(>70 cm), suggesting there may be anaerobic niches at 30 cm [303].Within Core B, most trends in potential activity were consistent across both normalizationtechniques (Fig. C.8). For example, the peak in potential activity for Enterobacteriales occurred at60 cm while Methanosarcinales and Charophyta were most active in the surface samples (10-30cm) although these may have taxa likely occupied anaerobic an aerobic niches respectively. Inaddition, Methylococcales were most active below 60 cm again using both the proportion and VST-normalized data. Within groups that were identified as being highly variable within the proportion-normalized data but not the VST-normalized data, Sphingomonadales and Syntrophobacteraleswere most active below 60 cm in the saturated zone suggesting these groups are most activeunder anaerobic conditions. Further, Burkholderiales were most active in the zone while ethanolconcentration peaked, and, within groups that were identified as being highly variable withinthe VST-normalized data but not the proportion–normalized data, Thaumarchaeota Group C3,Halobacteriales and Mollicutes were most active in the surface samples (10-30 cm) consistentwith the potential activity of these groups within Core A. Finally, OP11 was most active in thesaturated zone (>70 cm).1064.4 DiscussionIn the present study we performed a side-by-side comparison of normalization techniquesvia proportions and variance stabilization technique (VST) in order to highlight differences inpatterns and biological conclusions resulting from the two analyses using OTU count data frommicrobial communities found within two soils cores, one reference site (Core A) and one ethanolcontaminated site (Core B). We compared microbial abundance and activity based on rRNA:rRNAgenes ratios with a standard survey of the count data using canonical metrics in communityecology and evaluated the impact of normalization technique on the resulting data products. Ourcomparison included high-level multivariate community wide exploratory techniques such asnon-metric multi-dimensional scaling (NMDS) and principle component analysis (PCA) as well asphylum, and order level assessments to identify potential discrepancies in taxonomic patternsbetween the two normalization techniques that could influence downstream interpretation4.4.1 Similar trends in community and diversity and structure with varyingnormalization techniqueDiversity was estimated to be in the samples from Core A [300] suggesting ethanol exposurechanges community composition and decreases taxonomic diversity. Within both cores nosignificant differences in diversity were identified between the rRNA genes and rRNA suggestingthat most species present were likely active. As diversity estimates are best performed onunnormalized data [304] no comparison was made between proportion and VST data for thisanalysis (Table 4.1). However, it should be noted Shannons and Simpsons diversity estimates arenot effected by proportion normalization as both methods calculate the relative fraction of eachtaxa and, as each sample sums to 1 or 100, individual values for OTUs do not change that [300].For example, if OTUX represents 12% of a given sample, the relative proportion is equal to 12/100or 0.12 or 12%.Next we examined community structure and found large-scale trends were similar regardless ofwhich normalization approach used. Using PCA, samples grouped first by depth and subsequentlyby core. Community structure within Core A and Core B were most similar in samples between10710-60 cm and differed most in saturated samples (>70 cm) (Fig. 4.3). However, while the NMDSgenerated with VST-normalized data produced trends concomitant with the PCA, the NMDSgenerated with proportion-normalized data revealed no consistent trend with depth (Fig. 4.4).Compositional data bias, wherein the increase in the relative abundance of one OTU necessitatesthe decrease in relative abundance of one or more other OTUs, is known to effect ordinationresults [179]. As such, we suggest VST normalization prior to ordination as this may increase theconsistency of trends produced using PCA and NMDS.4.4.2 Discrepancies in abundance and potential activity between normalizationtechniquesBoth normalized datasets revealed an increase in the abundance Firmicutes, a large phylumknown to contain many members that form endospores to survive extreme conditions [305], andThaumarchaeota, an archeal phylum comprised of ammonia oxidizers [306], within saturated(>70 cm) samples from Core A to Core B (Fig. C.2, Fig. C.5, Fig. C.2 and Fig. C.5). However,the magnitude of these increases changed with normalization technique. Such discrepancies inthe magnitude of change may influence our understanding of the biological role of certain taxa.For example, the ability of Firmicutes to endure in ethanol contamination in soil may be overestimated in the proportion-normalized data as the VST-normalized data suggests the increaseis less substantial (Fig. C.2, Fig. C.5, Fig. C.2 and Fig. C.5). Indeed, the proportion-normalizeddata suggests the increase in Firmicutes is 2-fold greater than the increase observed withinthe VST-normalized data. Conversely, the increase in Thaumarchaeota within the proportion-normalized data may be overlooked as the increase was 4-fold larger in the VST-normalized datacompared to the proportion-normalized data, a difference that could have an impact on nitrogentransformations within the soil milieu [307]. We posit that discrepancies in the magnitude ofchange can influence the biological interpretation of the results as the influence of environmentalfactors on a given taxa, or the functional role of a given taxa, may be exaggerated or overlookeddepending on how the data were normalized. Indeed, in addition to trends in the abundance ofOTUs or phyla, the magnitude of changes in abundance must also be considered. As such wepropose it is best practice to employ VST as the this model better fits overdispersed OTU data.108We examined potential activity (rRNA:rRNA genes) throughout Core A and Core B using bothproportion and VST-normalized data (Fig. C.4 and Fig. C.7). Several taxa including candidatephyla WS3 and the Archaeal phylum Euryarchaeota were more active between 30-60 cm in Core Bcompared to Core using both normalization techniques suggesting these groups may be stimulatedby the ethanol spill. Indeed, Euryarchaeota are know to contain all archeal methanogens [308] andcandidate phyla WS3 has been previously found to degrade hydrocarbons [309]. In similar biofuelcontaminated environments ethanol concentrations up to 10g/L (1.3% v/v) have been known tostimulate microbial growth while concentrations above 40-100g/L exhibit toxic, sterilizing effects[310]. In general, microbial species are unable to tolerate ethanol concentrations above 9% (v/v)[311]. Here, ethanol concentrations between 40-60 cm decreased from 4-6g ethanol/ kg soil (0.8%to 1.2%) thus providing a favorable growth substrate [310] across this depth interval. Indeed,we did not observe any phyla active in Core A that were no longer present or active in Core B,consistent with the idea that the ethanol concentration at this site are neither toxic nor sterilizing.Still, care must be taken when interpreting rRNA:rRNA gene ratios, as elevated levels of rRNAhave been attributed to sources other than increased activity (e.g., multiple ribosome copies [312],dormant cell function [284, 313]. Explicitly, elevated rRNA:rRNA gene ratios can be indicativeof cells entering dormancy or be reflective of past environmental conditions prompting highribosomal levels [284].Finally, to explore discrepancies in the relationships between depth and potential microbialactivity between proportion and VST-normalized data, we identified the 10% most variabletaxonomic orders and examined the rRNA:rRNA gene ratio across Core A and Core B usingboth normalization techniques (Fig. C.8 and Fig. 4.5. 12 of the 30 orders selected (40%) weredifferent between the proportion and VST-normalized data, indicating the magnitude of potentialactivity varied across the profile differently using the two normalization techniques. For example,using the proportion-normalized data the activity of Burkholderiales, known to grow on ethanol[308], varied 3-6 fold from 50-110 cm, while using the VST-normalized data, potential activity ofBurkholderiales varied only 1.5-2 fold across this same depth interval. Differences in magnitudeof change across depth intervals and between uncontaminated and contaminated soils contributesto the high false positive rate in calculations of differential abundance when data are normalized109using proportions [281] as such, fold-changes within proportion-normalized data should beinterpreted with caution.4.4.3 Considerations for VST normalizationAlthough normalizing OTU count data with VST has not yet been widely used in microbialecology outside the context of differential abundance testing [289–291], VST allows the researcherto retain all data collected, minimize type 1 errors due to overdispersion and, stabilize the varianceacross samples thus directly addressing heteroscedasticity [281]. However, despite the statisticalrigor of VST, there several considerations one must take into account when generating exploratorysummary graphics with VST. Here, we provide a list of practical considerations that can helpmicrobial ecologists both select the most suitable normalization techniques for a given analysesand effectively use VST normalization to analyze rRNA genes and rRNA data (Fig. 4.6).Diversity indices• As discussed above, diversity estimates are best performed on unnormalized data [304](Fig. 4.6).Summary graphics• NMDS and PCA ordination both make many assumptions about the properties of the data[179, 292, 293, 299, 314–318], including homoscedasticity a property of the data not respectedwithin proportion-normalized data. However, VST-normalized data are homoscedastic andthus do not violate the assumptions of many exploratory statistical techniques. Indeed,compositional data bias is known to effect ordination results [179]. However, the magnitudeof the bias decreases with an increase in the number of variables [179]. As such, a highernumber of OTUs may reduce the effect of the bias introduced via normalization techniqueand thus normalization technique is more likely to impact the interpretation of low diversityenvironments (Fig. 4.6).• Phylum and order level exploratory graphics such as bubble plots and heatmaps do not have110Proportions VSTDiversity IndicesShannon'sSimpson'sChoa1Summary GraphicsPCANMDSBubble plotsHeatmapsDeterming Drivers of Multivariate PatternsIndicator species analysisDifferential abundanceOtherCorrelationsCo-ocurrence networksCautionAvoid usingAppropriate for useFigure 4.6: Practical considerations for selecting normalization techniques for microbiome data analysis. A guideto selecting the most suitable normalization technique for the analyses of microbiome data. Red signifiesavoid using, yellow signifies use with caution and green signifies appropriate for use. A more detailedexplanation of each dot can be found within the text.111statistical assumptions and as such both proportion and VST normalization can be used.However, it is important to note given high input values VST normalization approachesa log base 2 transformation [288, 301]. As a result, it is important take logarithmic rulesand identities into consideration when working with VST-normalized data. For example, asadding log-transformed values is equivalent to multiplying the original count data (log(a)+ log(b) = log(a*b)), normalized OTU count data must first be summed at the taxonomiclevel at which the analysis or comparison will be completed (e.g.; Phylum or Order) priorto VST normalization [FIG]. Indeed, here unnormalized count data was summed at boththe phylum and order level prior to VST normalization and figure generation. In contrast,proportion-normalized data scale linearly and thus the data can be normalized only once atthe OTU level and proportions can subsequently be summed directly to create phylum and/or order level tables and figures (Fig. 4.6).Indicator species analyses• Given that VST normalization approaches a log base 2 transformation with high values[288, 301] indicator species analyses should be approached with caution when using VST-normalized data to ensure results are not due to a misappropriation of logarithmic identities(Fig. 4.6).Correlations and co-occurrence networks• Correlations and co-occurrence networks generated from small subunit ribosomal RNA (SSUrRNA) genes (OTUs) are known to suffer from compositional data (relative proportions)effects that result in false correlations and/ or convert legitimate positive correlations tonegative ones [178]). As such, it is advisable to use VST-normalized data to calculatecorrelations [281] or employ alternative methods for co-occurrence network analyses such asSPARCC [178] or the ‘ensemble’ method [41, 297] both of which are reviewed in detail in[319] (Fig. 4.6).1124.5 ConclusionsCanonical normalization techniques such as proportions applied to discrete rRNA gene and rRNAsequence read count data may be problematic. Using SSU rRNA and rRNA gene surveys it ispossible to capture the diversity of the whole community and the abundance and potential activityof taxonomic groups and individual OTUs, which, if accurately analyzed and interpreted canultimately lead to a deeper understanding of microbial responses to perturbations and assistin the development of conceptual models for monitoring microbial activity given ethanol orethanol-blend biofuel contamination. We found that normalization technique had the largestimpact on exploratory graphics with statistical assumptions (e.g.; NMDS) as well as the magnitudeof change in microbial abundance and activity. Indeed, trends in microbial composition withdepth were only observable within the NMDS using VST normalization. Furthermore, giventhat differential abundance analyses using proportion-normalized data are known to increasethe likelihood of type I errors, interpretation of the magnitude of change across depth intervalsor in response to ethanol contamination using proportion-normalized data are more likely tobe inaccurate. Based on our analyses, we advocate the use of VST in summary and exploratorygraphics as it is statistically appropriate and propose that normalization technique can impactour understanding of microbial abundance and potential activity. This work contributes tomethodological improvements in OTU count data normalization and analyses by providing botha comparison of summary graphics and a practical guide to methodological considerations givendifferent normalization techniques.113Chapter 5Analytical Augers for Mining SoilSequence Data“Multi-omics” data produced via next generation sequencing platforms have enormous potentialto reveal the hidden metabolic powers of microbial communities in even the most complexecosystems such as soils. One of the fundamental questions in microbial ecology is “What arethe metabolic processes completed by microorganisms across environments?” or “What are theydoing?”, referring to both the functional of individual microbes and the community as a whole.Indeed, this dissertation makes use of metagenomes and metatranscriptomes to reveal trendsin microbial community metabolism with season and soil depth, discover that perturbation canreduce carbon storage and organic matter degradation potential, and suggest that changes inphenotypic expression due to season and depth are constrained to a subset of metabolic pathwaysas genotypic diversity buffers against large scale changes in metabolic processes within theboarder microbial community. However, as our understanding of microbial ecosystems improvesand sequencing technologies continue to advance and produce unprecedented quantities ofbiological information, the development of new technologies and interpretative frameworks isneeded to overcome computational and analytical bottlenecks, and empower microbial ecologiststo accurately analyze data. This chapter presents two methodological techniques: i) the Short-ORF Functional Annotation (SOFA) pipeline for assembly independent functional annotation ofshort-read data that allows researchers to quantitatively analyze unassembled data and maximizethe amount of sequence data used in downstream analysis; and ii) an extension of Clarkes andWarwicks Diversity and Distinctness measures that allows the use of incomplete taxonomicannotations common in uncultivated microbial annotation.1145.1 Short-ORF functional annotation (SOFA) pipeline5.1.1 IntroductionAccurate description of the microbial communities driving matter and energy transformationsin soil ecosystems remains challenging given both the complexity of soil, and the prevalenceof uncultivated microorganisms. While next generation sequencing technologies are yielding averitable tsunami of environmental sequence information, the data wave fails to provide sufficientdepth of coverage needed to illuminate microbial diversity in most terrestrial ecosystems. Usingpublished estimates of community diversity and assuming 100⇥ or 10⇥ genome coverage isnecessary for the assembly of short read (Illumina) or long read (PacBio) sequence data, it ispredicted that 5.00⇥ 1014 (Illumina) and 5.00⇥ 1013 (PacBio) base pairs (bps) are needed to achievecomplete soil metagenome assembly [320]. This presents a predicted lower boundary with vexingcomputational costs. Indeed, despite recent advances in de novo metagenome assembly thatleverage distributed high-performance computational resources and memory efficient algorithms,as well as data reduction via read filtering and normalization, substantially more data are requiredbefore soil microbial communities can be adequately analyzed using assembly-based approaches[321, 322].With bench top sequencing platforms like Illumina’s MiSeq and NextSeq becoming standardin many research laboratories, the need for scalable analytical pipelines capable of processingmany gigabytes of unassembled short read data is increasingly apparent. Routinely used inmetagenomic studies [270, 323] Metagenome Rapid Annotation using Subsystem Technology (MG-RAST) provides users with an on-line pipeline for the metabolic reconstruction of unassembledsequences [324]. Briefly, the pipeline merges paired reads when possible using FastqJoin [116] andprovides the user with the option to remove or retain unmerged reads. FragGeneScan, a hiddenmarkov model-based software designed to predict gene fragments on reads as small as 70bps, isthen used to predict open reading frames (ORFs) [325]. Further, because ORFs may be predictedtwice due to unmerged read pairs spanning a single gene, several strategies such as using onlya single read from each mate pair, and/ or subsampling annotated reads have been employedthat sacrifice up to 50% of the sequence data in order to prevent artificially inflating gene counts115[270, 323]. Thus, there is need for a high-throughput solution for the removal of ORFs artificiallypredicted twice.13/Input /OutputFLASH/merged_and_unmerged_pairs/orf_predictionDeduplicationFast Length Adjustmentof Short ReadsPredict Fragmented Genesin Short Reads Remove half of unmergedread pairs that span thesame gene /deduplicated ORFsMetaPathways/(b)last_results/results/rRNA/results/annotation_tables/results/tRNA/results/LCA/results/ptools/results/pgdb$Pathway_Tools/user//run_statisticsAnnotation and downstream analysis Downstream analysis compatiable with2FragGeneScanATGATG ATGATGATGORFs.faamerged.fastq unmerged.fastqfinalORFs.faa dupsrmvd.faaallreads.fastanohits.faaATGKeep RemoveUnassembled Metagenome{Figure 5.1: The SOFA pipeline. The SOFA pipeline consists of three operational stages including (1) read pairmerging (2) ORF prediction, and (3) deduplication, providing accurate quantification of ORF abundancethat can be used for annotation and downstream analysis. Inputs and executables are depicted on the leftwith corresponding output exported files on the right. Figure originally published in IEEE proceedings [1].Copyright IEEE 2015. Reprinted with permission.Here we present a short-ORF functional annotation pipeline (SOFA) for assembly independentfunctional annotation of short-read data. The pipeline, merges paired-end libraries, predictsORFs, and completes an additional step, we term ‘deduplication’. Deduplication prevents thedouble counting of ORFs predicted twice due to unmerged read pairs spanning a single gene,116thereby generating accurate gene counts (Fig. 5.1). The effectiveness of SOFA is validated withboth simulated and bone fide soil metagenomes. Moreover, empirical results are compared toexisting strategies for obtaining accurate ORF counts, including a model of read duplication.SOFA outputs are natively compatible with MetaPathways, a modular annotation and analysispipeline, where SOFA predicted ORFs are annotated, environmental pathway/genomes databasesconstructed, and results from multiple samples are compiled and compared [139–141].5.1.2 MethodsMerge mate pairs and quality control sequencesShort read sequence data from paired end libraries can be merged to produce longer read lengths.Overlapping and extending the 70 to 300 bp mate pairs typical of most modern sequencingtechnologies improves both de novo assembly and gene prediction [325–327]. Here, we use FastLength Adjustment of SHort reads (FLASH) to merge fastq libraries of paired-end reads tomaximize the length of sequences to be processed [327]. SOFA uses the default FLASH settings,allowing for mate pair merging in both the forward and reverse orientations (-O flag) (Fig. 5.1).Because not all read pairs can be merged, both merged and unmerged reads are retained in orderto preserve the maximum amount of sequence information.Predict open reading frames (ORFs)Open reading frames are predicted on merged and unmerged reads of 70 bps or longer usingFragGeneScan+, an optimized and multi-threaded version of FragGeneScan that can processlarge samples on a local desktop computer approximately 5–50 times faster than the originalFragGeneScan (Fig. 5.1) [325, 328]. Predicted ORFs below a default length of 70 nucleotides or 25amino acids are removed before deduplication.‘Deduplication’Prokaryotic genomes have an average coding density of one gene per kilobase of DNA [329].Because even merged short read data are 2-10 times smaller than an average ORF, several merge117scenarios can arise (Fig. 5.2). Unmerged read pairs may span 1 or 2 genes (Fig. 5.2 A and B).In the case where a mate pair spans two genes, each gene will be counted once, accuratelyrepresenting the data (Fig. 5.2 A). However, should both mates span the same gene, the genewould be predicted twice and thus be over represented in the data set thereby artificially inflatinggene counts (Fig. 5.2 B). We term such a read pair as having duplication and define deduplicationas the process of detecting and removing one mate in a duplicated read pair. In contrast, genespredicted on merged reads are counted only once regardless of whether they span 1 or 2 genes(Fig. 5.2 C and D).To address the issue of double counting genes predicted from unmerged read pairs (Fig. 5.2 B)we present the method of ‘deduplication’. Translated amino acid sequences for ORFs predictedfrom unmerged read pairs are locally aligned to a reference protein database, and one half of readpairs found to be homologous to the same reference protein are eliminated from downstreamanalysis. By default, SOFA uses the RefSeq database [330]. Owing to the computational expenseof comparing a large number of ORFs to such a large and comprehensive reference database, andbecause we aim to identify read pairs with the same function, the reference database (RefSeq)was clustered at 85% using Cluster Database at High Identity with Tolerance (CD-HIT) [331].Homology searches between unmerged pairs and representative protein sequences (selectedbased on sequence length) are performed using LAST, a computationally efficient and sensitivelocal alignment algorithm that uses adaptive and spaced seeds [332]. The clustered databaseis not used for ORF annotation but for deduplication only. Deduplication provides an efficientmeans to avoid double counting genes predicted from unmerged read pairs thereby providingquantitatively accurate gene counts. ‘Deduplicated’ ORFs can then be passed to the existingMetaPathways pipeline, which includes ORF annotation [333], rRNA and tRNA identification[153, 334, 335], Lowest Common Ancestor (LCA) [144] estimation, pathway/genome databaseconstruction [336, 337], calculates reads per kilobase per million mapped reads (RPKM) [141], andreports basic summary statistics [139].118ABCD(NNN)(NNN)Figure 5.2: Deduplication. (A) Unassembled read pairs spanning one gene; (B) unassembled read pairsspanning two genes; (C) merged read pair spanning one gene; (D) merged read spanning two reads. Figureoriginally published in IEEE proceedings [1]. Copyright IEEE 2015. Reprinted with permission.ORFs predicted by SOFAThe total numbers of ORFs predicted by the SOFA pipeline can be stated as follows: Let n bethe number of read pairs in the initial sample, m the fraction of merged pairs, f the fraction ofduplicates in unmerged pairs, and d the average number of ORFs predicted on merged read pairs(which we assume ranges between 0–2).Total ORFs predicted on merged reads is equal to nmd. If we assume only one ORF is predictedon each unmerged read, then total ORFs predicted on unmerged pairs is the sum of read pairsthat span the same gene ((1m)n f ) and read pairs spanning two genes 2(1m)n(1 f ). Thus,total ORFs predicted by SOFA isnmd+ (1m)n f + 2(1m)n(1 f )= n(md+ (1m)(2 f )).(5.1)For example, given a data set in which all read pairs were merged (i.e. m = 1), the total ORFspredicted by SOFA is nd. When no read pairs can be merged (i.e., m = 0), the total ORFs predictedby SOFA is n(2 f ).The total ORFs predicted by SOFA is therefore dependent on m the fraction of merged pairsand f the fraction of duplicates in unmerged pairs. Merging reads when possible improves thequality of both predicted ORFs and ORF annotation [327] thereby increasing the accuracy of ORF119predictions by removing a systematic bias (Fig. 5.1). In addition the ‘deduplication’ step (whichapproximates f ) allows the use of the entire read set, thus maximizing the sequence informationincluded in the analysis without risking the over representation of ORFs predicted twice onunmerged read pairs.‘Deduplicating’ hypothetical proteins and normalizing data for comparative genomic analysisHypothetical proteins or read pairs containing predicted ORFs with no significant hits in theRefSeq database, cannot be deduplicated due to a lack of representative sequences. Because40–60% of ORFs predicted in metagenomes typically cannot be functionally annotated beyond“hypothetical” [139], it is necessary to account for duplicates within the pool of reads for which nohomologous sequences can be found. Given that the value of m can be measured directly followingread merging (Fig. 5.1) and f can be estimated using the fraction of duplicates in functionallyannotated unmerged pairs ( fˆ ), distinct ORFs predicted by SOFA can be approximated as follows:distinct ORFs ⇡ nmd+ (1m)n fˆ + 2(1m)n(1 fˆ ). (5.2)For example, given n = 50, 000, 000, m = 0.5 and fˆ = 0.3, distinct ORFs ⇡ 50, 000, 000⇥ 0.5⇥0.3+ 2(1 0.5)50, 000, 000(1 0.3) ⇡ 42, 500, 000. Total estimated distinct ORFs from the SOFApipeline can then be expressed in relative proportions, enabling quantitative comparisons amongand between samples with variable values of n and m.5.1.3 ResultsIn order to demonstrate the analytical effectiveness of the SOFA pipeline we tested both simulatedreads and real world soil metagenomes.First, in order validate the deduplication process we simulated read pairs ranging in sizefrom 45–65 bps (found to be the average length of ORFs predicted on unmerged and mergedreads respectively) from 10,000 COG sequences thus generating ‘duplicate’ read pairs. COGsequences were chosen as their presence and number are often used to compare metagenomes([142, 324, 338]). The simulated read pairs were then ran through SOFA’s deduplication process.120SOFA was able to successfully deduplicate 97% of duplicate read pairs.Next, we simulated 10,000 paired reads from E. coli K-12 (MG1655) genome (GenBank:NC000913.3). This strain was chosen due to its well annotated status, making validation ofthe analysis reliable. The genome of this strain is 4,641,652 bps, 87% of which belong to proteincoding regions encompassing 4,319 genes with an average length of 937 bps (Fig. 5.3 greenring). We uniformly simulated 10,000 150 bp read pairs, which were then processed by the SOFApipeline. Using this empirical analysis we show that (i) duplication occurs in a large proportionof read pairs (> 80%, Fig. 5.3 red ring) of length 150 bp (comparable to the read length ofhigh-throughput next generation sequencing technologies), (ii) such duplicates occur more or lessuniformly athe genome (Fig. 5.3 red ring), and (iii) SOFA can effectively deduplicate unmergedread pairs spanning the same gene (Fig. 5.3 purple ring). For this analysis we considered any matepair with at least 60bps in the same gene to be a duplicate read pair (Fig. 5.3 red ring). We foundthat 82% or 8,170 read pairs were potential duplicates and located uniformly across the genome,leaving 18% or 1,830 non-duplicate read pairs (Fig. 5.3 blue ring). SOFA successfully deduplicated7,194 or 88% of the duplicate read pairs (Fig. 5.3 purple ring). Note that the effectiveness of theduplication is affected by both annotation, approximately 4% of the E.coli genome is unannotatedand thus cannot be deduplicated by SOFA, and ORF prediction, as FragGeneScan+ does notpredict genes on sequences shorter than 70 bps and thus ORFs on read pairs overlapping with thesame gene by 60–69 bps would not be predicted by the SOFA pipeline. Indeed, FragGeneScan+predicted ORFs on 19,400 of 20,000 reads (10,000 read pairs).121122Table 5.1: Sample information, comparative statistics and total COG database hits for assembled short read sequence data, read-one-onlyunassembled data, and SOFA processed unassembled short read data. Min:mineral sample; Org:organic sample; M:million; Assm:Assembled(%); Mrgd:merged (%); Dupl: deduplicated(%). Table originally published in IEEE proceedings [1]. Copyright IEEE 2015. Reprinted withpermission.Data Assemblies Read 1 only SOFASample Reads Assm N50 ORFs COGs ORFs COGs Merged Dedup ORFs COGsBS Min 199.28 M 12 548 231,872 208,560 45.93 M 9.10 M 67 30 38.20M 12.36 MJW Org 115.79 M 3 345 56,455 13,557 21.51 M 5.85 M 20 27 32.90 M 10.69 MJW Min 43.66 M 6 336 3,866 1,061 11.83 M 2.42 M 53 30 7.08 M 2.30 M122Finally, three soil metagenomes sequenced on the Illumina HiSeq platform (150 bps reads,270 bp insert) were analyzed (Table 5.1). A surface organic and two deeper mineral sampleswere selected from unmanaged soil plots established as part of the Long Term Soil Productivity(LTSP) project in Ontario, Canada [217]. Metagenomes were assembled with SOAPdenovo [339](Table 5.1) and processed through the MetaPathways pipeline using the default settings and theCOG database to annotate the ORFs [340]. The unassembled reads corresponding to the abovesamples were analyzed using SOFA and annotated against the same protein database (Table 5.1).Concomitantly, the same analysis was performed on just one of the read pairs (an existing strategyfor avoiding double counting ORFs) to illustrate the utility of including both unmerged mate pairsand the deduplication procedure (Table 5.1). Compared to the metagenome assemblies, predictingORFs on unassembled data, using the read-one-only set or the SOFA pipeline increased ORFprediction and total significant COG hits by three orders of magnitude (Table 5.1), emphasizingthe need for unassembled short read processing pipelines.The SOFA pipeline, merged 20–67% of read pairs, providing a substantial number of longerreads on which to predict ORFs (Table 5.1). ORFs predicted on merged reads had an averagelength of 69 amino acids, while ORFs on unmerged reads had an average length of 47 aminoacids demonstrating the utility of merging reads prior to ORF prediction. When over 50% of themate pairs were merged, 20–60% fewer ORFs were predicted with SOFA than the read-one-onlyset. Conversely, when less the 50% of reads were merged, 50% more ORFs were predicted withSOFA compared to the read one set. Deduplication found 27–30% of unmerged reads to beduplicates (Table 5.1) reinforcing the importance of this process in providing quantitativelyaccurate predictions. Finally, ORFs from both the SOFA pipeline and the read-one-only set wereannotated with LAST using an e-value cutoff of 1010 (conservative to reduce total numberof spurious hits) and COG as the reference protein database [340]. Despite the differences inORF predictions, the total number of high-scoring segment pairs (HSPs) using SOFA was eithercomparable (within 5%) or over 35% higher then the read-one-only set (Table 5.1). It shouldalso be noted that while subsampling annotated ORFs can normalize the bias of duplicate ORFsacross samples, the SOFA pipeline allows the user to retain the maximum quantity of sequenceinformation, and does not result in the potential removal of rare ORFs.123In order to estimate the expected percentage of duplicate read pairs in a simple model, we startwith a very long (meta)genome (denoted by G) that contains uniformly distributed protein codingregions or genes (of length g) and assume the non-coding region (of length gc) to be negligible(Fig. 5.4). We assume a sequenced read pair can originate from anywhere along this (meta)genome.Now, consider a randomly selected read pair and imagine where the center of the read pair (themid point between the two mates in the pair) could land on the (meta)genome (G). Note thatthis read pair is a duplicate if both mates in the pair overlap with the same gene by enough basepairs that the same ORF would be called on each mate by FragGeneScan+ (Approximately 70 bps,i.e., only if the center of the read pair lands in region c of Fig. 5.4 would the read pair not be aduplicate). Thus, the probability of a duplicate read pair can be expressed as the ratio(g gc  2r)(g+ gc)(5.3)or (1000 10 (2⇥ 70))(1000+ 10) = 84% using the example above.In a typical metagenome 40–60% of predicated ORFs are unannotated or hypothetical proteins.Hypothetical proteins cannot be deduplicated due to a lack of representative sequences in thereference protein database. Thus, in a bone fide metagenome, we expect approximately 34–50%of read pairs to be duplicates, which is in close agreement to our empirical analysis (Table 5.1).Note that due to the high diversity of soil, microbial communities and a low number of culturedorganisms originating from soils, soil metagenomes are more likely to have a higher proportion ofunannotated or hypothetical proteins than a typical metagenome and thus on the lower end ofthe estimate of the proportion of expected successful read pair deduplication.5.1.4 Implementation and availabilityShort-ORF Functional Annotation (SOFA) is an open source pipeline capable of accurately predict-ing ORFs on large unassembled short read environmental sequence data. SOFA predicts openreading frames (ORFs) on merged reads when possible and ‘deduplicates’ ORFs predicted onunmerged read pairs with the intent to increase the quality of ORFs predicted and maximize thequantity of sequence information that can be processed without sacrificing the accuracy of gene1240M0.5M1M1.5M2M2.5M3M3.5M4M4.5M 4,319 genes1,830 (18%) non-duplicate rps8,170 (82%) duplicate rps7,194 (88%) deduplicated duplicate rpsFigure 5.3: Validation of the SOFA pipeline. Validation of the deduplication process implemented by SOFAusing simulated read pairs from E. coli strain K-12 (MG1655). The grey ring represents the scale in millionbase pairs. The green ring represents the loci of genes on E. coli K-12. The blue ring represents non-duplicated simulated read pairs (i.e., read pairs without duplication). The red ring represents ‘duplicate’simulated read pairs. The purple ring represents read pairs deduplicated by SOFA. rps: read pairs. Figuregenerated using the ggbio R package [341]. Figure originally published in IEEE proceedings [1]. CopyrightIEEE 2015. Reprinted with permission.125G... ggc gc g gc ...c c c cr1 r2 r1 r2Figure 5.4: Simulation model. Validation of the deduplication process implemented by SOFA using simulatedread pairs from E. coli strain K-12 (MG1655). The genome length G is assumed to be very long relative to thelength of genes. Gene are assumed that coding and non-coding regions are evenly dispersed throughout thegenome, where g represents the average length of a gene coding region, gc is a complement or non-codingregion (assumed to very short compared to g), and c is a coverage threshold for overlapping read-pairs.Read pairs are represented by r1 and r2. The blue read pair does not have duplication. The red read pairhas duplication.counts. Indeed, a simple analytical model for read-duplication predicted 34-50% of read pairsto be duplicates, highlighting the importance of ’duplicate’ ORF removal prior to downstreamanalysis and biological interpretation of the data. The SOFA pipeline produces longer ORFsresulting in more reliable annotation results against a reference protein databases while retainingthe sequence information provided on short unmerged read pairs. Consequently, the SOFApipeline provides a quantitative estimate of ORF abundance with improved functional annotationsuseful in analyzing complex ecosystems such as soils. SOFA is available for download withexample data and usage tutorials at and test can be foundon https://IMG/M with taxon ids 3300001098 (JW Org), 3300001142 (JWMin), and 3300001154 (BS Min).5.2 Taxonomic distinctness5.2.1 Estimating diversity in metagenomic dataCanonical diversity metrics such as Choa1 cite, Shannons [192], and Simpsons [193] diversityindices consider richness and evenness, but fail to consider the phylogenetic relationship betweentaxa. Thus, communities with n taxa from a single phylum could be found to be equally diverse asa community with n members from n phyla. Taxonomic distinctness (D, D⇤, andD+), a diversitymeasure first proposed in macroecology in 1995, is designed to consider the relatedness of the126taxa in a given sample in addition to richness and evenness [149, 164].However, in its original implementation, taxonomic distinctness calculations required completephylogenetic annotation (Domain to species) of every taxon. Because ribosomal DNA is difficult toassemble [170], the phylogenetic identity of genes within metagenomes is often assigned using thelowest common ancestor (LCA) algorithm [144]. Briefly, following alignment of a protein to largereference database, the lowest ancestor shared by the set of taxa for which there were significantalignments is used as an estimate of the phylogenetic annotation of the protein [144]. As such thedepth phylogenetic assignment ranges from domain to species making taxonomic distinctnesscalculations inextensible to most multi-omics studies. However, taxonomic distinctness indexcan be extended to accept partial taxonomic annotations thereby creating a powerful tool forestimating community diversity from modern sequence data.Clarke and Warwicks taxonomic distinctness index considers richness, evenness, and phyloge-netic relatedness and is less sample size dependent than other diversity measures [149, 164]. Theanalysis produces three statistics, D which describes the average taxonomic distance between tworandomly selected taxa and considers both taxonomic relatedness and evenness, D⇤ which de-scribes the average path length between two randomly selected taxa and considers only taxonomicrelatedness, and D+ which, given presence/absence data is equal to both D and D⇤ and describesthe average path length between two randomly selected taxa and can be used to detect whetherthere is a significant difference between the taxonomic distinctness for a given sample and theexpected taxonomic distinctness calculated from a master list of all taxa in all samples [149, 164].Initially designed for macro-ecology, in its original implementation the taxonomic distinctnessindex used string matching to determine the relatedness between taxa within the study. Indeed,the authors propose multiple ways in which to assign distance including ‘uniform step length’wherein step lengths are equal at each taxonomic level (Domain through species), and distinctstep length wherein step lengths are relative to the number of distinct taxonomic annotations ateach successive level of classification (Domain through species) ([149, 164]. Both methods requiretaxonomic annotations of equal length to accurately assess taxonomic distance, and are thus notextensible to the partial annotations common in uncultivated microbial annotation. To illustrate,using a small test set of 19 taxa, we calculated taxonomic distance using both uniform (Fig. 5.5)127and distinct (Fig. 5.6) step length and found that the level to which each taxa was annotated,rather than taxonomic relatedness, determined the distance between two taxa. Indeed, archeal andbacterial taxa annotated only to the phylum level were found to be more related to one anotherthan other, more precisely annotated archeal and bacterial taxa respectively. Thus it was necessaryto extend the existing to accept partial taxonomies.5.2.2 Extending taxonomic distinctnessTo allow the use of taxonomic distinctness, we extended Clarke and Warwick’s taxonomicdistinctness index [149, 164] by using the previously published algorithm weighted taxonomicdistance (WTD) to calculate the taxonomic distance between taxa. The extended taxonomicdistinctness does not require a phylogenetic tree produced from a multiple sequence alignment asrequired by UniFrac distance [114], but can instead use the publically available NCBI taxonomichierarchy [342]. Briefly, WTD is based on the assumptions that divergence at node in the NCBItaxonomic hierarchy represents speciation, and the importance of a divergence decreases withdepth and thus divergences at the species level should have less weight than divergences at thegenus level, which should have less weight than divergences at the family level etc. WTD weightssteps in the hierarchy by 1/2d where d is the depth of the position in the hierarchy of a giventaxon. The WTD algorithm calculates a weighted distance D between the observed LCA taxonomyxo and the observed LCA taxonomy of a second ORF xe given a path p on the NCBI TaxonomyDatabase hierarchy.This WTD algorithm takes as input p and xo, and calculates a weighted taxonomic distancefor each xe on nodes in the connecting path P(xe, xo), asD(a, b) = Âea,b2EP(xe ,xo)12d(a)(5.4)where ea,b is an edge between nodes a and b in the path and d(a) is the depth of node a [141, 343].In order to demonstrate the implementation of the WTD in the calculation of taxonomicdistinctness, we used the Dune dataset for which examples of taxonomic distance calculated usingClarke and Warwicks original method are publically available [149]. We found that WTD was128Archaea;Thaumarchaeota;Cenarchaeales;Cenarchaeum;pIVWA5Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumArchaea;Methanomicrobia_Eury;Methanosarcinales;HydBeg134Archaea;Methanococci_Eury;MethanocaldococcaceaeArchaea;NO27FWArchaea;pISA1Bacteria;Actinobacteria;Acidimicrobidae;koll13;OM1Bacteria;Actinobacteria;Acidimicrobidae;MicrothrixineaeBacteria;ActinobacteriaBacteria;BacteroidetesBacteriaBacteria;ProteobacteriaBacteria;Bacteroidetes;BacteroidalesBacteria;Bacteroidetes;Bacteroidales;VC21_Bac22Bacteria;Proteobacteria;Gammaproteobacteria;Arctic96B−1Bacteria;Proteobacteria;Gammaproteobacteria;B2M28Bacteria;Proteobacteria;Deltaproteobacteria;BacteriovoraBacteria;Proteobacteria;Alphaproteobacteria;SphingomonadalesBacteria;Proteobacteria;Alphaproteobacteria;TSBb13Archaea;Thaumarchaeota;Cenarchaeales;Cenarchaeum;pIVWA5Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumArchaea;Methanomicrobia_Eury;Methanosarcinales;HydBeg134Archaea;Methanococci_Eury;MethanocaldococcaceaeArchaea;NO27FWArchaea;pISA1Bacteria;Actinobacteria;Acidimicrobidae;koll13;OM1Bacteria;Actinobacteria;Acidimicrobidae;MicrothrixineaeBacteria;ActinobacteriaBacteria;BacteroidetesBacteriaBacteria;ProteobacteriaBacteria;Bacteroidetes;BacteroidalesBacteria;Bacteroidetes;Bacteroidales;VC21_Bac22Bacteria;Proteobacteria;Gammaproteobacteria;Arctic96B−1Bacteria;Proteobacteria;Gammaproteobacteria;B2M28Bacteria;Proteobacteria;Deltaproteobacteria;BacteriovoraBacteria;Proteobacteria;Alphaproteobacteria;SphingomonadalesBacteria;Proteobacteria;Alphaproteobacteria;TSBb13020406080100Taxonomic distanceFigure 5.5: Taxonomic distinctness uniform step. Clarke and Warwick’s [149, 164] taxonomic distinctnessindex uniform step length incorrectly groups incompletely annotated taxa.129020406080100Taxonomic distanceBacteria;Actinobacteria;Acidimicrobidae;koll13;OM1Bacteria;Actinobacteria;Acidimicrobidae;MicrothrixineaeBacteria;Proteobacteria;Gammaproteobacteria;Arctic96B−1Bacteria;Proteobacteria;Gammaproteobacteria;B2M28Bacteria;Proteobacteria;Deltaproteobacteria;BacteriovoraBacteria;Bacteroidetes;BacteroidalesBacteria;Bacteroidetes;Bacteroidales;VC21_Bac22Bacteria;Proteobacteria;Alphaproteobacteria;SphingomonadalesBacteria;Proteobacteria;Alphaproteobacteria;TSBb13Archaea;NO27FWArchaea;pISA1Bacteria;ActinobacteriaBacteria;BacteroidetesBacteriaBacteria;ProteobacteriaArchaea;Thaumarchaeota;Cenarchaeales;Cenarchaeum;pIVWA5Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumArchaea;Methanococci_Eury;MethanocaldococcaceaeArchaea;Methanomicrobia_Eury;Methanosarcinales;HydBeg134Bacteria;Actinobacteria;Acidimicrobidae;koll13;OM1Bacteria;Actinobacteria;Acidimicrobidae;MicrothrixineaeBacteria;Proteobacteria;Gammaproteobacteria;Arctic96B−1Bacteria;Proteobacteria;Gammaproteobacteria;B2M28Bacteria;Proteobacteria;Deltaproteobacteria;BacteriovoraBacteria;Bacteroidetes;BacteroidalesBacteria;Bacteroidetes;Bacteroidales;VC21_Bac22Bacteria;Proteobacteria;Alphaproteobacteria;SphingomonadalesBacteria;Proteobacteria;Alphaproteobacteria;TSBb13Archaea;NO27FWArchaea;pISA1Bacteria;ActinobacteriaBacteria;BacteroidetesBacteriaBacteria;ProteobacteriaArchaea;Thaumarchaeota;Cenarchaeales;Cenarchaeum;pIVWA5Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumArchaea;Methanococci_Eury;MethanocaldococcaceaeArchaea;Methanomicrobia_Eury;Methanosarcinales;HydBeg134Figure 5.6: Taxonomic distinctness distinct step. Clarke and Warwick’s [149, 164] taxonomic distinctness indexdistinct step length incorrectly groups incompletely annotated taxa.130able to replicate distances between species and is thus accurately calculating taxonomic distances(Fig. 5.7). Next, we used our test set of 19 taxa and found that, using WTD to calculate the distancematrix the taxa clustered according to taxonomic relatedness and effectively split the archeal andbacterial domains as well as the phyla within these domains (Fig. 5.8).BracrutaCallcuspJuncartiJuncbufoEleopaluPoapratPoatrivLolipereElymrepeBromhordAnthodorAlopgeniAgrostolAirapraeRanuflamVicilathTrifpratTrifrepeComapaluSalirepeSagiprocChenalbuRumeacetPlanlancEmpenigrScorautuHyporadiCirsarveAchimillBellpere020406080100BracrutaCallcuspJuncartiJuncbufoEleopaluPoapratPoatrivLolipereElymrepeBromhordAnthodorAlopgeniAgrostolAirapraeRanuflamVicilathTrifpratTrifrepeComapaluSalirepeSagiprocChenalbuRumeacetPlanlancEmpenigrScorautuHyporadiCirsarveAchimillBellpere020406080100Weighted taxonomic distanceTaxonomic distanceA BFigure 5.7: Taxonomic distinctness using weighted taxonomic distance. Using weighted taxonomic distancereplicates distances between species as calculated by the uniform method proposed by the original authorsof the method [149, 164].5.2.3 Implementation and availabilityThe extended version of taxonomic distinctness was then applied to several metagenomicdatasets included in Chapters 2 and 3 of this dissertation to yield a more accurate descrip-tion of diversity within co-occurrence networks and environmental pathway genome databases(ePGDBs) (Fig. 2.4). Indeed, this method has implemented utilizing both operational taxo-nomic units (OTUs) defined via small subunit ribosomal genes and open reading frames (ORFs)defined from shotgun metagenomic and metatranscriptomic data for which taxonomy was as-signed using the lowest common ancestor algorithm [144]. The extended version of taxonomicdistinctness, example code and test datasets are publically available for download and use;Thaumarchaeota;Cenarchaeales;Cenarchaeum;pIVWA5Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumArchaea;Methanomicrobia_Eury;Methanosarcinales;HydBeg134Archaea;Methanococci_Eury;MethanocaldococcaceaeArchaea;NO27FWArchaea;pISA1BacteriaBacteria;ProteobacteriaBacteria;Proteobacteria;Gammaproteobacteria;Arctic96B−1Bacteria;Proteobacteria;Gammaproteobacteria;B2M28Bacteria;Proteobacteria;Deltaproteobacteria;BacteriovoraBacteria;Proteobacteria;Alphaproteobacteria;SphingomonadalesBacteria;Proteobacteria;Alphaproteobacteria;TSBb13Bacteria;BacteroidetesBacteria;Bacteroidetes;BacteroidalesBacteria;Bacteroidetes;Bacteroidales;VC21_Bac22Bacteria;ActinobacteriaBacteria;Actinobacteria;Acidimicrobidae;koll13;OM1Bacteria;Actinobacteria;Acidimicrobidae;Microthrixineae020406080100Archaea;Thaumarchaeota;Cenarchaeales;Cenarchaeum;pIVWA5Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumArchaea;Methanomicrobia_Eury;Methanosarcinales;HydBeg134Archaea;Methanococci_Eury;MethanocaldococcaceaeArchaea;NO27FWArchaea;pISA1BacteriaBacteria;ProteobacteriaBacteria;Proteobacteria;Gammaproteobacteria;Arctic96B−1Bacteria;Proteobacteria;Gammaproteobacteria;B2M28Bacteria;Proteobacteria;Deltaproteobacteria;BacteriovoraBacteria;Proteobacteria;Alphaproteobacteria;SphingomonadalesBacteria;Proteobacteria;Alphaproteobacteria;TSBb13Bacteria;BacteroidetesBacteria;Bacteroidetes;BacteroidalesBacteria;Bacteroidetes;Bacteroidales;VC21_Bac22Bacteria;ActinobacteriaBacteria;Actinobacteria;Acidimicrobidae;koll13;OM1Bacteria;Actinobacteria;Acidimicrobidae;MicrothrixineaeWTDFigure 5.8: Taxonomic distinctness using microbial data. Taxonomic distinctness using weighted taxonomicdistance correctly groups incompletely annotated microbial data.132Chapter 6ConclusionsThis dissertation described the taxonomic structure, metabolic potential, and metabolic expressionof the soil microbial communities in the context of spatiotemporal variation and forest harvesting.In addition, this work explored the suitability of canonical methods for normalization of rDNAgene and rRNA transcript data, and presented both a novel method for the analyses of unassem-bled sequence data, and an extension of the taxonomic distinctness index. This chapter synthesizesfindings related to soil microbial community response to natural and anthropogenically inducedenvironmental change, considers the importance of careful examination and improvement ofexisting methods for data processing and statistical analyses, and concludes with a discussion offuture challenges and directions in soil microbial ecology.6.1 Soil microbial community response to environmental changeThis dissertation establishes community composition, community interactions, metabolic potential,and metabolic expression differ with spatiotemporal variation and disturbance. Together the datasuggest that, despite the influence of geographic location, season, depth, and forest harvesting,metabolic redundancy within the microbial community buffers against large-scale changes innutrient and carbon cycling processes. Akin to a mutation in a branched pathway that shuntsmetabolic flux into an accessory branch of that pathway [344], we propose perturbation events canresult in the reshuffling of a trophic relationships and information exchange (e.g.; H+, metabolites,and horizontal gene transfer) allowing new and novel interactions between organisms to form andincrease the community’s ability to tolerate disturbance (Fig. 6.1). Conceptually this means thatrather than simply a subset of the microbial network responding to environmental change, thenetwork transforms, adapting to current ecosystem conditions (Fig. 6.1). This transformation of133the microbial network results in changes in microbial composition, community interactions, andmetabolic potential, but given metabolic redundancy within the microbial community, ecosystemfunction is preserved. In short, we propose it is the ability of the network to transform that allowsthe microbial communities to be both resistant (defy disturbance) and resilient (recover fromdisturbance) [25, 187] to environmental changes within soil ecosystems. Still, we submit thatthere is disturbance threshold (the inflection point of hysteresis curve) (Fig. 6.1) beyond which thecommunity is unable to recover and instead reestablishes as an alternative community structureor new state, wherein there is little overlap between previous and new community members [345](Fig. 6.1). However, the work within this dissertation suggests that such catastrophic change wouldrequire the removal of one or more metabolic functions from the community, and necessitatecomplete reestablishment, rather than transformation, of the soil microbial network.134123Perturbation StateN H1 ? 3121?HNSimilarities in soil microbial community functionFigure 6.1: Model of microbial community dynamics in response to perturbation. Model of microbial communitydynamics in response to perturbation. The curve represent similarity in the function of the soil microbialcommunity where: state ‘1’ is the unmanaged natural forest state; ‘?’: is the harvested forest state whoseposition along the curve is unknown but projected to be approaching ‘1’; ‘2’ a disturbance tipping beyondwhich a microbial community is unable to return to the previous state; and ‘3’ the new ecosystem stateafter recovery from an intense disturbance. Color is used to present data and grey scale is used to representhypothetical results. Hive plots: Axes represent node degrees class (1, 2-15 and >15 respectively). Nodesplacement along axes is by mean weighted depth. * The axes of each hive plot have been scaled to the depthsof the OTUs within the component to better show detail. Edges are positive and colored by connectedcomponent. Nodes are colored by indicator OTU.1356.2 A systematic approach to data analysisTo chart information transfer processes in microbial community networks, it is necessary toelucidate compositional, regulatory and distributed metabolic processes connecting communitymembers using multi-omic sequence information. For example, as presented in Chapters 2 and3 of this dissertation, gene- and pathway-centric metagenomics and metatranscriptomics can beused to reconstruct both interaction and metabolic networks. However, while possible, due toinaccuracies in sequencing [346], incomplete or insufficient sequencing coverage [1], inadequatecomputational resources [103], and the complexity and diversity of soil microbial communities,the reconstruction, description, and interpretation of these networks remains challenging.While achieving the perfect multi-omics dataset, one which accurately details the completegenomic, transcriptomic, proteomic, and metabolic profiles of an entire microbial community,remains well beyond current technologies, a systematic approach to data analysis can eliminatesome of the inherent bias in contemporary datasets. For example, in order to account for composi-tional data effects introducing specious correlations within the interaction network presented inChapter 2 of this dissertation, multi correlation measures, including Kullback-Leibler dissimilarity(robust to compositional effects) were used, a permutation strategy to remove unstable edges wasemployed, and bootstrap scores were calculated such that the statistical significance of correlationscould be determined. Further, Chapter 4 of this dissertation demonstrates how microbial ecologistscan leverage the work of statisticians to overcome known bias in canonical approaches to datanormalization and thereby improve the accuracy with which data analysed. Finally, Chapter5 presents a technique for circumventing poor assembly of molecular sequence data sourcedfrom soil microbial communities as well as an extension of the taxonomic distinctness index thatdescribes community diversity. Indeed, by considering the average length of microbial genesand length of current next-generation sequencing reads, it was possible to improve upon existingmethods for analyses of unassembled data. Additionally, extension of the taxonomic distinctnessindex to accept partial annotations allows microbial ecologists to integrate taxonomy into estimatesof microbial diversity without requiring computationally intensive multiple sequence alignments.More generally, methodological advancement in microbial ecology requires iterative refinement136of existing analytical tools as well as close collaboration among microbial ecologists and domainexperts in statistics, and computing to develop novel technologies and experimental techniques.However, the onus ultimately lies on the individual researcher to challenge established workflowsand generate reliable results and data driven conclusions.6.3 Future challenges and directions6.3.1 Soil microbial response to disturbancePersistent impacts of disturbance on potential community interactions (Fig. 2.4) and genomicpotential for biomass decomposition have been observed at the gene [68] (Fig. 3.10) and pathwaylevel (Fig. 2.3). However, no studies to date have combined process rate and real-time trace gasflux measurements with cultivation-independent gene expression and metabolite profiling todetermine impacts of soil organic matter removal on microbial cycling of climate active tracegases. By linking ‘multi-omics’ data with surficial gas flux measurements, it may be possibleto connect climate active trace gas cycling to specific environmental conditions and microbialcommunity interaction networks and build upon the work presented in this dissertation. Indeed,it has been recognized that integration of interdisciplinary models is critical to the advancementof our current understanding of the impacts of forest management and climate change on soilecosystem services [347]. Such a study could leverage the replicated plots established by theLTSP project [217] and the data could ultimately be used to (i) define regulatory and distributedmetabolic networks connecting microbial community members and assess network dynamics inresponse to soil organic matter removal and forest harvesting across soil horizons and ecozonesusing gene expression (RNA, protein) and metabolomics information, (ii) identify environmentalconditions regulating microbial community metabolism and ecosystem services with emphasison soil properties, process rates, and climate active trace gas fluxes, and (iii) create flux balancemodels to predict carbon sequestration/loss and climate active trace gas cycling within forest soilecosystems.1376.3.2 Microbial ecologyThe increased throughput and decreased cost of next generation sequencing technologies has ledto the production of vast and varied environmental data sets and large scale initiatives to make thisinformation accessible [348–350]. Indeed, there has been a recent development for both publiclyavailable information storage resources (e.g.; the TARA ocean’s project [348], [349], the HumanMicrobiome Project [350]), the National Center for Biotechnology Information [351], and the JointGenome Institute’s Genome Portal [352]) . However, despite access to the ever-increasing numbersof sequencing projects and publicly data sets, analyses of sequence information in a customand customizable way is often stymied by limited access to relevant tools [353], computationalresources, and processing time [141]. While this tidal wave of biological information has enormouspotential to reveal the hidden metabolic powers of microbial communities to realize this potential,new technologies and interpretative frameworks are needed to overcome computational andanalytical bottlenecks. Further, given the complexity and volume of next-generation data, andoften ad hoc or in-house methods for data analysis, there is also mounting concern about the lackof reproducible data products [354, 355].Integrating multiple sources of information including sequence reads, mass spectra, and envi-ronmental parameters and processing this information using a cloud-based system may alleviatemany operational and interpretive challenges faced by microbial ecologists (Fig. 6.2). Cloud com-puting gives any researcher with an Internet connection access to high performance computingresources obviating capital hardware investment and information technology administration costs[356] (Fig. 6.2). Further, cloud computing reduces energy waste as the so called computational“heavy lifting” is done externally, requiring only inexpensive laptop and mobile devices to initiatedata processing and collect results. Finally, cloud computing can directly address reproducibilityproblems. Among the “ten commandments of reproducible science”, cloud-computing can offerautomated documentation how every result was produced, help avoid manual data manipulation,archive programs, store raw data and provide public access to developed code and data products[354]. While cloud-based services look to be promising models for large-scale high-throughputdata processing that is reproducible between users and institutions, effective implementation of138such systems will only be realized through the collective efforts of biologists, bioinformaticians,and information scientists.139Figure 6.2: The microbial matrix. The matrix is a conceptual model of the interconnected network ofmulti-omic sequence information, processing, storage and researchers needed to chart the microcosmos.(a) Samples sourced from diverse natural and engineered ecosystems including our own bodies. (c)Samples are processed in laboratory settings (d) Biological information is converted into digital informationvia high-throughput sequencing machines, such as NGS or tandem mass spectrometers which producepetabytes of data. (e) Many sequencing centers can push the enormous volume of data to cloud-basedstorage via high-bandwidth networks. The cloud infrastructure consists of hundreds of thousands of (i)processing and (j) storage units, which collectively provide scalable data storage and processing capabilitiesto millions of users. (b) Environmental monitoring devices detect ecosystem perturbations such as harmfulalgal blooms or pathogenic strains in almost real time by gathering target information and transmittingto storage systems. (f) A network of environmental monitoring systems around the globe can collect andtransmit data to storage using a multiplicity of communication links including satellites and cables. (h)Code for bioinformatic tools can be stored in the same infrastructure where the data resides. (g) Microbialecologists, computer scientists and engineers from around the world can collaborate, refine and share theirdata and code. (p) Environmental and health professionals can gather, monitor and study the data fromthe monitoring sites located in far away or inaccessible places. The processed data is sent to end users viaInternet connections on the World Wide Web. (l) Desktops and (m) mobile devices can be used by endusers to explore data interactively, while triggering on demand processing in the cloud and gathering (k)interpretable data, such as matrices, interactive graphics and summary statistics, to more local settingsdriving (o) idea generation and (n) knowledge translation. Figure originally published in Current Opinionin Microbiology [103].1406.4 ClosingAdvances in high-throughput sequencing, ecology, and bioinformatics are enabling humans toperceive, reconstruct, and interact with the microbial networks mediating matter and energytransformations in the world around us. This dissertation presented a multi-omics approach tounderstanding the complex communities driving biogeochemical cycling in terrestrial ecosystemsand presented new methods with which to analyze molecular sequence data. By taking asystematic approach to the analysis and the reconstruction of interaction and metabolic networks,this work represents an important step in understanding how natural and anthropogenicallyinduced environmental changes impact microbial communities and ecosystem function within thesoil milieu. While spatiotemporal variation and forest harvesting incite the transformation of themicrobial network thereby shifting community composition, interactions, and metabolic potential,redundant metabolic capacity combined with large and fine scale variation in genomic contentensures environmental change has disparate effects across community members thereby temperingthe consequences of localized extinctions or niche space reduction, and guarding against theloss of metabolic functions within the soil ecosystem. Looking forward, the data and findingsfrom this dissertation can be integrated with detailed biogeochemical parameter information andthermodynamic principles to enable time variable forecasts of microbial community metabolismand adaptive response to environmental change. This will, in turn, ultimately facilitate the designand manipulation of microbial communities with beneficial metabolic properties.141Bibliography[1] A.S. Hahn, N.W. Hanson, D. Kim, K.M. Konwar, and S.J. Hallam. Assembly independent functionalannotation of short-read data using sofa: Short-orf functional annotation. Computational Intelligence inBioinformatics and Computational Biology (CIBCB), 2015.[2] W. B. Whitman, D. C. Coleman, and W. J. Wiebe. Prokaryotes: The unseen majority. Proceedings ofthe National Academy of Sciences of the United States of America, 95(12):6578–6583, 1998. Zt829 TimesCited:1095 Cited References Count:90.[3] E. G. Nisbet and N. H. Sleep. The habitat and nature of early life. Nature, 409(6823):1083–1091, 2001.405FT Times Cited:362 Cited References Count:115.[4] A. Gonzalez, A. King, M. S. Robeson, et al. Characterizing microbial communities through space andtime. Current Opinion in Biotechnology, 23(3):431–436, 2012. 966UE Times Cited:24 Cited ReferencesCount:62.[5] R. R. Colwell. Microbial diversity: The importance of exploration and conservation. Journal ofIndustrial Microbiology & Biotechnology, 18(5):302–307, 1997. Xd567 Times Cited:47 Cited ReferencesCount:61.[6] C. R. Coble, E. G. Murray, and D. R. Rice. Earth Science. Prentice Hall PTR, New York, 1897.[7] T. S. Garrison. Oceanography: An Invitation to Marine Science. Cengage Learning, 7 edition, 2009.[8] J. Lehmann, D. Solomon, J. Kinyangi, et al. Spatial complexity of soil organic matter forms atnanometre scales. Nature Geoscience, 1(4):238–242, 2008. 309BA Times Cited:128 Cited ReferencesCount:29.[9] R. Daniel. The metagenomics of soil. Nature Reviews Microbiology, 3(6):470–478, 2005. 930RW TimesCited:143 Cited References Count:94.[10] N. H. Batjes. Total carbon and nitrogen in the soils of the world. European Journal of Soil Science,47(2):151–163, 1996. Vd723 Times Cited:1092 Cited References Count:55.[11] I. C. Prentice, M. T. Sykes, M. Lautenschlager, et al. Modeling global vegetation patterns and terrestrialcarbon storage at the last glacial maximum. Global Ecology and Biogeography Letters, 3(3):67–76, 1993.Nb030 Times Cited:102 Cited References Count:49.[12] G. B. Bonan. Forests and climate change: Forcings, feedbacks, and the climate benefits of forests.Science, 320(5882):1444–1449, 2008. 312MV Times Cited:434 Cited References Count:50.[13] T. J. Fahey, P. B. Woodbury, J. J. Battles, et al. Forest carbon storage: ecology, management, and policy.Frontiers in Ecology and the Environment, 8(5):245–252, 2010. 607OT Times Cited:25 Cited ReferencesCount:44.[14] R. Lal. Forest soils and carbon sequestration. Forest Ecology and Management, 220(1-3):242–258, 2005.985GM Times Cited:147 Cited References Count:147.142[15] E. Eriksson, A. R. Gillespie, L. Gustavsson, et al. Integrated carbon analysis of forest managementpractices and wood substitution. Canadian Journal of Forest Research-Revue Canadienne De RechercheForestiere, 37(3):671–681, 2007. 190TP Times Cited:59 Cited References Count:45.[16] Soil Classification Working Group. The Canadian System of Soil Classification. Agric. and Agri-Food Can.Publ. 1646 (Revised) 187 pp. NRC Research Press, Ottawa. 3rd ed. pp.5., 1998.[17] R. Amundson and H. Jenny. The place of humans in the state factor theory of ecosystems and theirsoils. Soil Science, 151(1):99–109, 1991. Ev841 Times Cited:37 Cited References Count:19.[18] N. C. Brady and R.R. Weil. The Nature and Properties of Soils. 13th edition ed. Pearson Education, Inc.,Upper Saddle River, New Jersey. 960 pp., 2002.[19] N. Fierer, J. P. Schimel, and P. A. Holden. Variations in microbial community composition throughtwo soil depth profiles. Soil Biology and Biogeochemistry, 35:167–176, 2003.[20] M. Hartmann, S. Lee, S. J. Hallam, and W. W. Mohn. Bacterial, archaeal and eukaryal communitystructures throughout soil horizons of harvested and naturally disturbed forest stands. EnvironmentalMicrobiology, 11(12):3045–3062, 2009. 538KH Times Cited:6 Cited References Count:89.[21] J. Lee, J. Wu, Y. Y. Deng, et al. A cell-cell communication signal integrates quorum sensing and stressresponse. Nature Chemical Biology, 9(5):339–+, 2013.[22] V. Torsvik and L. Ovreas. Microbial diversity and function in soil: from genes to ecosystems. CurrentOpinion in Microbiology, 5(3):240–245, 2002. 563XB Times Cited:407 Cited References Count:35.[23] C. M. Hansel, S. Fendorf, P. M. Jardine, and C. A. Francis. Changes in bacterial and archaealcommunity structure and functional diversity along a geochemically variable soil profile. Applied andEnvironmental Microbiology, 74(5):1620–1633, 2008. 271MO Times Cited:86 Cited References Count:94.[24] G. H. Wagner and D.C. Wolf. Principles and Applications of Soil Microbiology. Pages 218-258 in D. M.Sylvia, Fuhrmann, J.J., Hartel, P.G. and Zuberer, D.A., ed. Principles and applications of soil microbiology.Prentice Hall, Upper Saddle River, N.J., 1998.[25] A. Bissett, M. V. Brown, S. D. Siciliano, and P. H. Thrall. Microbial community responses toanthropogenically induced environmental change: towards a systems approach. Ecology Letters,16:128–139, 2013. Suppl. 1 Sp. Iss. SI 145HQ Times Cited:7 Cited References Count:101.[26] K. G. Eilers, S. Debenport, S. Anderson, and N. Fierer. Digging deeper to find unique microbialcommunities: The strong effect of depth on the structure of bacterial and archaeal communities insoil. Soil Biology & Biochemistry, 50:58–65, 2012.[27] M. Hartmann, C. G. Howes, D. VanInsberghe, et al. Significant and persistent impact of timberharvesting on soil microbial communities in northern coniferous forests. Isme Journal, 6(12):2199–2218,2012. 041SL Times Cited:1 Cited References Count:120.[28] A. C. Kennedy and K. L. Smith. Soil microbial diversity and the sustainability of agricultural soils.Plant and Soil, 170(1):75–86, 1995. Qv546 Times Cited:219 Cited References Count:88.[29] P. A. Matson, W. J. Parton, A. G. Power, and M. J. Swift. Agricultural intensification and ecosystemproperties. Science, 277(5325):504–509, 1997. Xm867 Times Cited:650 Cited References Count:78.[30] D. Bru, A. Ramette, N. P. A. Saby, et al. Determinants of the distribution of nitrogen-cycling microbialcommunities at the landscape scale. Isme Journal, 5(3):532–542, 2011. 756NZ Times Cited:29 CitedReferences Count:60.143[31] E. Hackl, M. Pfeffer, C. Donat, G. Bachmann, and S. Zechmeister-Boltenstern. Composition of themicrobial communities in the mineral soil under different types of natural forest. Soil Biology &Biochemistry, 37(4):661–671, 2005. 895OO Times Cited:55 Cited References Count:41.[32] J. Harris. Soil microbial communities and restoration ecology: Facilitators or followers? Science,325(5940):573–574, 2009. 477BY Times Cited:44 Cited References Count:21.[33] K. Schutz, E. Kandeler, P. Nagel, S. Scheu, and L. Ruess. Functional microbial community responseto nutrient pulses by artificial groundwater recharge practice in surface soils and subsoils. FemsMicrobiology Ecology, 72(3):445–455, 2010. 591RR Times Cited:5 Cited References Count:50.[34] N. Fierer, O. A. Chadwick, and S. E. Trumbore. Production of co2 in soil profiles of a californiaannual grassland. Ecosystems, 8(4):412–429, 2005. 949QZ Times Cited:24 Cited References Count:51.[35] C. Rumpel and I. Kogel-Knabner. Deep soil organic matter-a key but poorly understood componentof terrestrial c cycle. Plant and Soil, 338(1-2):143–158, 2011. 695RC Times Cited:28 Cited ReferencesCount:150.[36] H. L. Buss, M. A. Bruns, M. J. Schultz, et al. The coupling of biological iron cycling and mineralweathering during saprolite formation, luquillo mountains, puerto rico. Geobiology, 3(4):247–260, 2005.V06cp Times Cited:24 Cited References Count:58.[37] A. Konopka and R. Turco. Biodegradation of organic-compounds in vadose zone and aquifersediments. Applied and Environmental Microbiology, 57(8):2260–2268, 1991. Fz250 Times Cited:79 CitedReferences Count:47.[38] J. P. McCutcheon and C. D. von Dohlen. An interdependent metabolic patchwork in the nestedsymbiosis of mealybugs. Current Biology, 21(16):1366–1372, 2011. 813PF Times Cited:20 CitedReferences Count:45.[39] A. Barberan, S. T. Bates, E. O. Casamayor, and N. Fierer. Using network analysis to explore co-occurrence patterns in soil microbial communities. Isme Journal, 6(2):343–351, 2012. 901RJ TimesCited:53 Cited References Count:51.[40] B.L. Hurwitz, S.J. Hallam, and M.B. Sullivan. Metabolic reprogramming by viruses in the sunlit anddark ocean. Genome Biology, 14(11), Nov 2013.[41] K. Faust, J. F. Sathirapongsasuti, J. Izard, et al. Microbial co-occurrence relationships in the humanmicrobiome. Plos Computational Biology, 8(7), 2012. 979TH Times Cited:52 Cited References Count:84.[42] A. S. Hahn and S. A. Quideau. Shifts in soil microbial community biomass and resource utilizationalong a canadian glacier chronosequence. Canadian Journal of Soil Science, 93(3):305–318, 2013. 223FVTimes Cited:7 Cited References Count:67.[43] L. H. Bach, A. Frostegard, and M. Ohlson. Variation in soil microbial communities across a borealspruce forest landscape. Canadian Journal of Forest Research-Revue Canadienne De Recherche Forestiere,38(6):1504–1516, 2008. 311KH Times Cited:15 Cited References Count:67.[44] F. Ekelund, R. Ronn, and S. Christensen. Distribution with depth of protozoa, bacteria and fungi insoil profiles from three danish forest sites. Soil Biology & Biochemistry, 33(4-5):475–481, 2001. 416ATTimes Cited:53 Cited References Count:27.[45] A. Agnelli, J. Ascher, G. Corti, et al. Distribution of microbial communities in a forest soil profileinvestigated by microbial biomass, soil respiration and dgge of total and extracellular dna. SoilBiology & Biochemistry, 36(5):859–868, 2004. 812LJ Times Cited:89 Cited References Count:58.144[46] C. Will, A. Thurmer, A. Wollherr, et al. Horizon-specific bacterial community composition ofgerman grassland soils, as revealed by pyrosequencing-based analysis of 16s rrna genes. Applied andEnvironmental Microbiology, 76(20):6751–6759, 2010. 659VM Times Cited:24 Cited References Count:57.[47] M. Swallow and S. A. Quideau. Moisture effects on microbial communities in boreal forest floorsare stand-dependent. Applied Soil Ecology, 63:120–126, 2013. 100ZZ Times Cited:3 Cited ReferencesCount:48.[48] T. Bell, J. A. Newman, B. W. Silverman, S. L. Turner, and A. K. Lilley. The contribution of speciesrichness and composition to bacterial services. Nature, 436(7054):1157–1160, 2005. 958AK TimesCited:226 Cited References Count:30.[49] K. D. Hannam, S. A. Quideau, and B. E. Kishchuk. Forest floor microbial communities in relation tostand composition and timber harvesting in northern alberta. Soil Biology & Biochemistry, 38(9):2565–2575, 2006. 080TD Times Cited:19 Cited References Count:65.[50] C. L. Lauber, M. Hamady, R. Knight, and N. Fierer. Pyrosequencing-based assessment of soil ph as apredictor of soil bacterial community structure at the continental scale. Applied and EnvironmentalMicrobiology, 75(15):5111–5120, 2009. 474VE Times Cited:203 Cited References Count:49.[51] W. R. Cookson, D. A. Abaye, P. Marschner, et al. The contribution of soil organic matter fractionsto carbon and nitrogen mineralization and microbial community size and structure. Soil Biology &Biochemistry, 37(9):1726–1737, 2005. 951MT Times Cited:58 Cited References Count:52.[52] E. A. Kaiser and O. Heinemeyer. Seasonal-variations of soil microbial biomass carbon within theplow layer. Soil Biology & Biochemistry, 25(12):1649–1655, 1993. Ml446 Times Cited:69 Cited ReferencesCount:32.[53] H. Fritze, J. Pietikainen, and T. Pennanen. Distribution of microbial biomass and phospholipid fattyacids in podzol profiles under coniferous forest. European Journal of Soil Science, 51(4):565–573, 2000.378XP Times Cited:66 Cited References Count:21.[54] G. C. Ding, Y. M. Piceno, H. Heuer, et al. Changes of soil bacterial diversity as a consequence ofagricultural land use in a semi-arid ecosystem. Plos One, 8(3), 2013. 125SW Times Cited:3 CitedReferences Count:76.[55] Y. Y. Li, L. Q. Chen, H. Y. Wen, et al. 454 pyrosequencing analysis of bacterial diversity revealed by acomparative study of soils from mining subsidence and reclamation areas. Journal of Microbiology andBiotechnology, 24(3):313–323, 2014. Ad8uy Times Cited:0 Cited References Count:65.[56] L. M. Lavkulich and J. M. Arocena. Luvisolic soils of canada: Genesis, distribution, and classification.Canadian Journal of Soil Science, 91(5):781–806, 2011. 873FR Times Cited:7 Cited References Count:165.[57] J. A. Rice. Humin. Soil Science, 166(11):848–857, 2001.[58] P. Saetre and E. Baath. Spatial variation and patterns of soil microbial community structure in amixed spruce-birch stand. Soil Biology & Biochemistry, 32(7):909–917, 2000. 333ZR Times Cited:167Cited References Count:33.[59] R. Miethling, G. Wieland, H. Backhaus, and C. C. Tebbe. Variation of microbial rhizosphere com-munities in response to crop species, soil origin, and inoculation with sinorhizobium meliloti l33.Microbial Ecology, 40(1):43–56, 2000. 350XL Times Cited:126 Cited References Count:62.[60] J. Balesdent, C. Chenu, and M. Balabane. Relationship of soil organic matter dynamics to physicalprotection and tillage. Soil & Tillage Research, 53(3-4):215–230, 2000. 293GM Times Cited:241 CitedReferences Count:91.145[61] C. E. Norris, S. A. Quideau, J. S. Bhatti, R. E. Wasylishen, and M. D. MacKenzie. Influence of fire andharvest on soil organic carbon in jack pine sites. Canadian Journal of Forest Research-Revue CanadienneDe Recherche Forestiere, 39(3):642–654, 2009. 434PE Times Cited:2 Cited References Count:46.[62] H. Eswaran, E. Vandenberg, and P. Reich. Organic-carbon in soils of the world. Soil Science Society ofAmerica Journal, 57(1):192–194, 1993. Kw022 Times Cited:465 Cited References Count:12.[63] L. Miles and V. Kapos. Reducing greenhouse gas emissions from deforestation and forest degradation:Global land-use implications. Science, 320(5882):1454–1455, 2008. 312MV Times Cited:115 CitedReferences Count:17.[64] J. Koarashi, W. C. Hockaday, C. A. Masiello, and S. E. Trumbore. Dynamics of decadally cyclingcarbon in subsurface soils. Journal of Geophysical Research-Biogeosciences, 117, 2012. 014ZZ TimesCited:0 Cited References Count:65.[65] F Ponder and M. Tadros. Phospholipid fatty acids in forest soil four years after organic matterremoval and soil compaction. Applied Soil Ecology, 19(2):173–182, 2002.[66] M. D. Busse, S. E. Beattie, R. F. Powers, F. G. Sanchez, and A. E. Tiarks. Microbial communityresponses in forest mineral soil to compaction, organic matter removal, and vegetation control.Canadain Journal of Forestry Research, 36:577–588, 2006.[67] K. E. Clemmensen, A. Bahr, O. Ovaskainen, et al. Roots and associated fungi drive long-term carbonsequestration in boreal forest. Science, 339(6127):1615–8, 2013. Clemmensen, K E Bahr, A Ovaskainen,O Dahlberg, A Ekblad, A Wallander, H Stenlid, J Finlay, R D Wardle, D A Lindahl, B D New York,N.Y. Science. 2013 Mar 29;339(6127):1615-8. doi: 10.1126/science.1231923.[68] E. Cardenas, J. M. Kranabetter, G. Hope, et al. Forest harvesting reduces the soil metagenomic potentialfor biomass decomposition. The ISME journal, 2015. Cardenas, Erick Kranabetter, J M Hope, GraemeMaas, Kendra R Hallam, Steven Mohn, William W ISME J. 2015 Apr 24. doi: 10.1038/ismej.2015.57.[69] Q. C. Li, H. L. Allen, and C. A. Wilson. Nitrogen mineralization dynamics following the establishmentof a loblolly pine plantation. Canadian Journal of Forest Research-Revue Canadienne De Recherche Forestiere,33(2):364–374, 2003. 655AF Times Cited:31 Cited References Count:48.[70] R. F. Powers, D. A. Scott, F. G. Sanchez, et al. The north american long-term soil productivityexperiment: Findings from the first decade of research. Forest Ecology and Management, 220(1-3):31–50,2005. 985GM Times Cited:83 Cited References Count:62.[71] X. Tan, S. X. Chang, and R. Kabzems. Soil compaction and forest floor removal reduced microbialbiomass and enzyme activities in a boreal aspen forest soil. Biology and Fertility of Soils, 44(3):471–479,2008. 248XQ Times Cited:12 Cited References Count:44.[72] A. M. Gordon, R. E. Schlentner, and K. Vancleve. Seasonal patterns of soil respiration and co2evolution following harvesting in the white spruce forests of interior alaska. Canadian Journal of ForestResearch-Revue Canadienne De Recherche Forestiere, 17(4):304–310, 1987. H5788 Times Cited:85 CitedReferences Count:58.[73] X. Tan, S. X. Chang, and R. Kabzems. Effects of soil compaction and forest floor removal on soilmicrobial properties and n transformations in a boreal forest long-term soil productivity study. ForestEcology and Management, 217(2-3):158–170, 2005. 975PA Times Cited:27 Cited References Count:62.[74] L. Mariani, S. X. Chang, and R. Kabzems. Effects of tree harvesting, forest floor removal, andcompaction on soil microbial biomass, microbial respiration, and n availability in a boreal aspenforest in british columbia. Soil Biology & Biochemistry, 38(7):1734–1744, 2006.146[75] B. S. Griffiths and L. Philippot. Insights into the resistance and resilience of the soil microbialcommunity. Fems Microbiology Reviews, 37(2):112–129, 2013. 087GY Times Cited:23 Cited ReferencesCount:148.[76] C. Dobell. Antony van Leeuwenhoek and his ”Little animals”. New York, Harcourt, Brace and Company,1932.[77] P. W. Smith, K. Watkins, and A. Hewlett. Infection control through the ages. American Journal ofInfection Control, 40(1):35–42, 2012.[78] S. Falkow. Molecular koch’s postulates applied to microbial pathogenicity. Reviews of InfectiousDiseases, 10 Suppl 2:S274–6, 1988.[79] K. A. Smith. Louis pasteur, the father of immunology? Frontiers in Immunology, 3:68, 2012.[80] R. Dahm. Friedrich miescher and the discovery of dna. Developmental biology, 278(2):274–88, 2005.Dahm, Ralf Dev Biol. 2005 Feb 15;278(2):274-88.[81] S. Y. Tan and Y. Tatsumura. Alexander fleming (1881-1955): Discoverer of penicillin. Singapore medicaljournal, 56(7):366–7, 2015.[82] F. H. Ruddle. Tribute to torbjorn caspersson. American journal of human genetics, 44(4):439–40, 1989.Ruddle, F H Am J Hum Genet. 1989 Apr;44(4):439-40.[83] M. Cobb. Oswald avery, dna, and the transformation of biology. Current Biology, 24(2):R55–60, 2014.[84] C. E. Shannon. The mathematical theory of communication (reprinted). MD Computing, 14(4):306–317,1997. Xj097 Times Cited:136 Cited References Count:8.[85] A. Klug. Rosalind franklin and the discovery of the structure of dna. Nature, 219(5156):808–10 passim,1968. Klug, A ENGLAND Nature. 1968 Aug 24;219(5156):808-10 passim.[86] J. D. Watson and F. H. Crick. Molecular structure of nucleic acids; a structure for deoxyribose nucleicacid. Nature, 171(4356):737–8, 1953. WATSON, J D CRICK, F H Not Available Nature. 1953 Apr25;171(4356):737-8.[87] F. H. Crick. Central dogma of molecular biology. Nature, 227(5258):561–3, 1970.[88] R. W. Henkel. It takes time, but justice does triumph - belatedly, a key man behind the microprocessoris getting the attention he deserves - this month, faggin,federico receives a marconi fellowship.Electronics, 61(8):3–3, 1988.[89] F. Sanger, S. Nicklen, and A. R. Coulson. Dna sequencing with chain-terminating inhibitors. Proceed-ings of the National Academy of Sciences, 74(12):5463–5467, 1977.[90] C. R. Woese. Bacterial evolution. Microbiological Reviews, 51(2):221–271, 1987.[91] D. C. Koboldt, K. M. Steinberg, D. E. Larson, R. K. Wilson, and E. R. Mardis. The next-generationsequencing revolution and its impact on genomics. Cell, 155(1):27–38, 2013.[92] H. O. Smith, J. F. Tomb, B. A. Dougherty, R. D. Fleischmann, and J. C. Venter. Frequency anddistribution of DNA uptake signal sequences in the haemophilus-influenzae rd genome. Science,269(5223):538–540, 1995.[93] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman. Molecular biologicalaccess to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry andBiology, 5(10):R245–R249, 1998.147[94] E. S. Lander, Int Human Genome Sequencing Consortium, L. M. Linton, et al. Initial sequencing andanalysis of the human genome. Nature, 409(6822):860–921, 2001.[95] J. C. Venter, K. Remington, J. F. Heidelberg, et al. Environmental genome shotgun sequencing of thesargasso sea. Science, 304(5667):66–74, 2004.[96] A. C. Greene, K. A. Giffin, C. S. Greene, and J. H. Moore. Adapting bioinformatics curricula for bigdata. Briefings in Bioinformatics, 17(1):43–50, 2016. Sp. Iss. SI Dc4wa Times Cited:0 Cited ReferencesCount:50.[97] F. H. Crick. The biological replication of macromolecules. n Symp. Soc. Rxp. Biol., 1958.[98] R. Padmanab and R. Wu. Nucleotide sequence analysis of dna .9. use of oligonucleotides of definedsequence as primers in dna sequence analysis. Biochemical and Biophysical Research Communications,48(5):1295–&, 1972. N4475 Times Cited:19 Cited References Count:10.[99] E. Jay, R. Bambara, R. Padmanabhan, and R. Wu. Nucleotide-sequence analysis of dna .13. dnasequence-analysis - a general, simple and rapid method for sequencing large oligodeoxyribonucleotidefragments by mapping. Nucleic Acids Research, 1(3):331–353, 1974. S3340 Times Cited:450 CitedReferences Count:37.[100] D J Lane, B Pace, G J Olsen, et al. Rapid determination of 16s ribosomal rna sequences for phylogeneticanalyses. Proceedings of the National Academy of Sciences, 82(20):6955–6959, 1985.[101] C. R. Woese. On the evolution of cells. Proceedings of the National Academy of Sciences of the United Statesof America, 99(13):8742–7, 2002. Woese, Carl R Proc Natl Acad Sci U S A. 2002 Jun 25;99(13):8742-7.Epub 2002 Jun 19.[102] N. D. Cook. The case for reverse translation. Journal of theoretical biology, 64(1):113–35, 1977.[103] A. S. Hahn, K. M. Konwar, S. Louca, N. W. Hanson, and S. J. Hallam. The information science ofmicrobial ecology. Current Opinion in Microbiology, 31:209–216, 2016. Dp4gk Times Cited:0 CitedReferences Count:54.[104] J. C. Wooley, A. Godzik, and I. Friedberg. A primer on metagenomics. Plos Computational Biology,6(2), 2010. 565AI Times Cited:185 Cited References Count:135.[105] R. I. Griffiths, A. S. Whiteley, A. G. O’Donnell, and M. J. Bailey. Rapid method for coextraction ofdna and rna from natural environments for analysis of ribosomal dna- and rrna-based microbialcommunity composition. Applied and Environmental Microbiology, 66(12):5488–5491, 2000. 404QP TimesCited:397 Cited References Count:15.[106] P. M. Shrestha, M. Kube, R. Reinhardt, and W. Liesack. Transcriptional activity of paddy soilbacterial communities. Environmental Microbiology, 11(4):960–70, 2009. Shrestha, Pravin Malla Kube,Michael Reinhardt, Richard Liesack, Werner England Environ Microbiol. 2009 Apr;11(4):960-70. doi:10.1111/j.1462-2920.2008.01821.x. Epub 2008 Dec 17.[107] F. Warnecke and M. Hess. A perspective: Metatranscriptomics as a tool for the discovery of novelbiocatalysts. Journal of Biotechnology, 142(1):91–95, 2009. 463SQ Times Cited:29 Cited ReferencesCount:33.[108] M. P. Deutscher. Degradation of rna in bacteria: comparison of mrna and stable rna. Nucleic AcidsResearch, 34(2):659–666, 2006. 011TO Times Cited:189 Cited References Count:57.[109] J. A. Bernstein, A. B. Khodursky, P. H. Lin, S. Lin-Chao, and S. N. Cohen. Global analysis of mrnadecay and abundance in escherichia coli at single-gene resolution using two-color fluorescent dnamicroarrays. Proceedings of the National Academy of Sciences of the United States of America, 99(15):9697–9702, 2002. 577DG Times Cited:403 Cited References Count:58.148[110] G. Hambraeus, C. von Wachenfeldt, and L. Hederstedt. Genome-wide survey of mrna half-lives inbacillus subtilis identifies extremely stable mrnas. Molecular Genetics and Genomics, 269(5):706–714,2003. 725QJ Times Cited:75 Cited References Count:32.[111] D. W. Selinger, R. M. Saxena, K. J. Cheung, G. M. Church, and C. Rosenow. Global rna half-lifeanalysis in escherichia coli reveals positional patterns of transcript degradation. Genome Research,13(2):216–223, 2003. 642TY Times Cited:180 Cited References Count:35.[112] E. Redon, P. Loubiere, and M. Cocaign-Bousquet. Role of mrna stability during genome-wideadaptation of lactococcus lactis to carbon starvation. Journal of Biological Chemistry, 280(43):36380–36385, 2005. 976IM Times Cited:43 Cited References Count:46.[113] L. C. Carvalhais, P. G. Dennis, G. W. Tyson, and P. M. Schenk. Application of metatranscriptomics tosoil environments. Journal of Microbiological Methods, 91(2):246–251, 2012. 034HP Times Cited:26 CitedReferences Count:94.[114] C. Lozupone and R. Knight. Unifrac: a new phylogenetic method for comparing microbial communi-ties. Applied and Environmental Microbiology, 71(12):8228–8235, 2005. 999VB Times Cited:1211 CitedReferences Count:45.[115] P. D. Schloss, S. L. Westcott, T. Ryabin, et al. Introducing mothur: Open-source, platform-independent,community-supported software for describing and comparing microbial communities. Appliedand Environmental Microbiology, 75(23):7537–7541, 2009. 521SR Times Cited:3987 Cited ReferencesCount:29.[116] J. G. Caporaso, J. Kuczynski, J. Stombaugh, et al. Qiime allows analysis of high-throughput communitysequencing data. Nature methods, 7(5):335–336, 2010.[117] Nobuto Takeuchi, Otto X. Cordero, Eugene V. Koonin, and Kunihiko Kaneko. Gene-specific selectivesweeps in bacteria and archaea caused by negative frequency-dependent selection. BMC Biology,13(1):1–11, 2015.[118] L. Guidi, S. Chaffron, L. Bittner, et al. Plankton networks driving carbon export in the oligotrophicocean. Nature, (advance online publication):–, 02 2016.[119] S. Rakoff-Nahoum, M. J. Coyne, and L. E. Comstock. An ecological network of polysaccharideutilization among human intestinal symbionts. Current Biology, 24(1):40–49, 2014. 286XA TimesCited:25 Cited References Count:34.[120] C. R. Strachan, R. Singh, D. VanInsberghe, et al. Metagenomic scaffolds enable combinatoriallignin transformation. Proceedings of the National Academy of Sciences of the United States of America,111(28):10143–10148, 2014. Al2wg Times Cited:10 Cited References Count:34.[121] J. J. Morris. Black queen evolution: the role of leakiness in structuring microbial communities. Trendsin Genetics, 31(8):475–482, 2015.[122] J. J. Morris, S. E. Papoulis, and R. E. Lenski. Coexistence of evolving bacteria stabilized by a sharedblack queen function. Evolution, 68(10):2960–2971, 2014.[123] S. Estrela, J. J. Morris, and B. Kerr. Private benefits and metabolic conflicts shape the emergence ofmicrobial interdependencies. Environmental Microbiology, 2015.[124] C. E. T. Chow, D. Y. Kim, R. Sachdeva, D. A. Caron, and J. A. Fuhrman. Top-down controls onbacterial community structure: microbial network analysis of bacteria, t4-like viruses and protists.ISME Journal, 8(4):816–829, 2014.149[125] C. E. Chow, D. M. Winget, 3rd White, R. A., S. J. Hallam, and C. A. Suttle. Combining genomicsequencing methods to explore viral diversity and reveal potential virus-host interactions. Frontiersin Microbiology, 6:265, 2015.[126] P. Simonart and L. Batistic. Aromatic hydrocarbons in soil. Nature, 212(5069):1461–&, 1966. 86627Times Cited:9 Cited References Count:3.[127] Y. Katayama, S. Nishikawa, A. Murayama, et al. The metabolism of biphenyl structures in lignin bythe soil bacterium (pseudomonas-paucimobilis syk-6). Febs Letters, 233(1):129–133, 1988. N9585 TimesCited:41 Cited References Count:14.[128] S. L. Lebeis, S. H. Paredes, D. S. Lundberg, et al. Salicylic acid modulates colonization of the rootmicrobiome by specific bacterial taxa. Science, 349(6250):860–864, 2015. Cp4dr Times Cited:2 CitedReferences Count:16.[129] S. Freilich, A. Kreimer, I. Meilijson, et al. The large-scale organization of the bacterial network ofecological co-occurrence interactions. Nucleic Acids Research, 38(12):3857–3868, 2010. 634EV TimesCited:17 Cited References Count:54.[130] M. Lupatini, A. K. A. Suleiman, R. J. S. Jacques, et al. Network topology reveals high connectancelevels and few key microbial genera within soils. Frontiers in Environmental Science, 2:10, 2014.[131] Q. S. Ruan, D. Dutta, M. S. Schwalbach, et al. Local similarity analysis reveals unique associationsamong marine bacterioplankton species and environmental factors. Bioinformatics, 22(20):2532–2538,2006. 094XA Times Cited:21 Cited References Count:22.[132] S. Chaffron, H. Rehrauer, J. Pernthaler, and C. von Mering. A global network of coexisting microbesfrom environmental and whole-genome sequence data. Genome Research, 20(7):947–959, 2010. 619BXTimes Cited:35 Cited References Count:118.[133] G. Lima-Mendez, K. Faust, N. Henry, et al. Determinants of community structure in the globalplankton interactome. Science, 348(6237), 2015.**Using both genomic and environmental data, a robust co-occurrence analysis is used tomodel a plankton interaction network that captures predatory and symbiotic relationships. Thiswork not only expands current knowledge about the oceans food webs but represents one of the firstattempts to validate predicted interactions, using microscopy to validate to corroborate relationships.[134] R. J. Williams, A. Howe, and K.S. Hofmockel. Demonstrating microbial co-occurrence pattern analyseswithin and between ecosystems. frontiers in Microbiology, 5(358), 2014.[135] Y. Zhang, W. J. Chen, S. L. Smith, D. W. Riseborough, and J. Cihlar. Soil temperature in canada duringthe twentieth century: Complex responses to atmospheric climate change. Journal of GeophysicalResearch-Atmospheres, 110(D3), 2005. 898GV Times Cited:1 Cited References Count:53.[136] M. Lupatini, A. K. A. Suleiman, R. J. S. Jacques, et al. Network topology reveals high connectancelevels and few key microbial genera within soils. Frontiers in Environmental Science, 2:10, 2014.[137] B. Ma, H. Wang, M. Dsouza, et al. Geographic patterns of co-occurrence network topological featuresfor soil microbiota at continental scale in eastern china. The ISME journal, 2016.[138] Z. Zhang, J. Geng, X. Tang, et al. Spatial heterogeneity and co-occurrence patterns of humanmucosal-associated intestinal microbiota. The ISME Journal, 8(4):881–93, 2014.[139] K. M. Konwar, N. W. Hanson, A. P. Page, and S. J. Hallam. Metapathways: a modular pipeline for con-structing pathway/genome databases from environmental sequence information. Bmc Bioinformatics,14, 2013.150[140] Niels W Hanson, Kishori M Konwar, Shang-Ju Wu, and Steven J Hallam. MetaPathways v2.0: Amaster-worker model for environmental Pathway/Genome Database construction on grids andclouds. Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on,pages 1–7, May 2014.[141] Kishori M Konwar, Niels W Hanson, Maya P Bhatia, et al. MetaPathways v2.5: Quantitative functional,taxonomic, and usability improvements. Bioinformatics, pages 1–3, July 2015.[142] V. M. Markowitz, I. M. A. Chen, K. Chu, et al. Img/m 4 version of the integrated metagenomecomparative analysis system. Nucleic Acids Research, 42(D1):D568–D573, 2014. Aa5lf Times Cited:6Cited References Count:22.[143] J.R. Kultima, L.P. Coelho, K. Forslund, et al. Mocat2: a metagenomic assembly, annotation andprofiling framework. Bioinformatics, 2016.[144] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. Megan analysis of metagenomic data. GenomeResearch, 17(3):377–386, 2007. 141IZ Times Cited:632 Cited References Count:27.[145] P. D. Karp, S. M. Paley, and M. Krummenacker. Pathway tools version 13.0: Integrated software forpathway/genome informatics and systems biology. Briefings in Bioinformatics, 11:40–79, 2010.[146] S. I. Greenblum, S. Efroni, C. F. Schaefer, and K. H. Buetow. The pathologist: an automated toolfor pathway-centric analysis. Bmc Bioinformatics, 12, 2011. 766LW Times Cited:10 Cited ReferencesCount:17.[147] V. P. Edgcomb, M. G. Pachiadaki, P. Mara, et al. Gene expression profiling of microbial activities andinteractions in sediments under haloclines of e. mediterranean deep hypersaline anoxic basins. TheISME journal, 2016. Edgcomb, Virginia P Pachiadaki, Maria G Mara, Paraskevi Kormas, KonstantinosA Leadbetter, Edward R Bernhard, Joan M ISME J. 2016 Apr 19. doi: 10.1038/ismej.2016.58.[148] R. Ruvindy, R. A. White, B. A. Neilan, and B. P. Burns. Unravelling core microbial metabolismsin the hypersaline microbial mats of shark bay using high-throughput metagenomics. Isme Journal,10(1):183–196, 2016. Cy8pr Times Cited:2 Cited References Count:73.[149] K. R. Clarke and R. M. Warwick. A taxonomic distinctness index and its statistical properties. Journalof Applied Ecology, 35(4):523–531, 1998. 151ZK Times Cited:421 Cited References Count:14.[150] T. C. Balser and M. K. Firestone. Linking microbial community composition and soil processes in acalifornia annual grassland and mixed-conifer forest. Biogeochemistry, 73(2):395–415, 2005. 951KBTimes Cited:120 Cited References Count:43.[151] C. C. Cleveland, D. R. Nemergut, S. K. Schmidt, and A. R. Townsend. Increases in soil respirationfollowing labile carbon additions linked to rapid shifts in soil microbial community composition.Biogeochemistry, 82(3):229–240, 2007. 147HP Times Cited:86 Cited References Count:51.[152] G. W. Tyson, J. Chapman, P. Hugenholtz, et al. Community structure and metabolism throughreconstruction of microbial genomes from the environment. Nature, 428(6978):37–43, 2004. 780GKTimes Cited:750 Cited References Count:42.[153] Elmar Pruesse, Christian Quast, Katrin Knittel, et al. SILVA: a comprehensive online resource forquality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic AcidsResearch, 35(21):7188–7196, 2007.[154] J. T. Simpson, K. Wong, S. D. Jackman, et al. Abyss: A parallel assembler for short read sequencedata. Genome Research, 19(6):1117–1123, 2009. 452DZ Times Cited:158 Cited References Count:31.151[155] K. M. Konwar, N. W. Hanson, M. P. Bhatia, et al. MetaPathways v2.5: quantitative functional,taxonomic and usability improvements. Bioinformatics, 31(20):3345–3347, 2015.[156] R. Caspi, T. Altman, K. Dreher, et al. The metacyc database of metabolic pathways and enzymes andthe biocyc collection of pathway/genome databases. Nucleic Acids Research, 40(Database issue):D742–53, 2012. Caspi, Ron Altman, Tomer Dreher, Kate Fulcher, Carol A Subhraveti, Pallavi Keseler,Ingrid M Kothari, Anamika Krummenacker, Markus Latendresse, Mario Mueller, Lukas A Ong,Quang Paley, Suzanne Pujar, Anuradha Shearer, Alexander G Travers, Michael Weerasinghe, DeepikaZhang, Peifen Karp, Peter D GM075742/GM/NIGMS NIH HHS/ GM077678/GM/NIGMS NIHHHS/ GM080746/GM/NIGMS NIH HHS/ GM088849/GM/NIGMS NIH HHS/ England NucleicAcids Res. 2012 Jan;40(Database issue):D742-53. doi: 10.1093/nar/gkr1014. Epub 2011 Nov 18.[157] M. Kanehisa. The kegg database. In Silico Simulation of Biological Processes, 247:91–103, 2002. Bw40vTimes Cited:405 Cited References Count:4 Novartis Foundation Symposium.[158] M. Kaufmann. The role of the cog database in comparative and functional genomics. CurrentBioinformatics, 1(3):291–300, 2006. 139PL Times Cited:7 Cited References Count:115.[159] B. L. Cantarel, P. M. Coutinho, C. Rancurel, et al. The carbohydrate-active enzymes database (cazy):an expert resource for glycogenomics. Nucleic Acids Research, 37:D233–D238, 2009. Sp. Iss. SI 386TTTimes Cited:1958 Cited References Count:19.[160] R. Agarwala, T. Barrett, J. Beck, et al. Database resources of the national center for biotechnologyinformation. Nucleic Acids Research, 43(D1):D6–D17, 2015. Cc2yf Times Cited:0 Cited ReferencesCount:68.[161] P. Shannon, A. Markiel, O. Ozier, et al. Cytoscape: A software environment for integrated models ofbiomolecular interaction networks. Genome Research, 13(11):2498–2504, 2003. 739QB Times Cited:2872Cited References Count:34.[162] M. Krzywinski, I. Birol, S. J. M. Jones, and M. A. Marra. Hive plots-rational approach to visualizingnetworks. Briefings in Bioinformatics, 13(5):627–644, 2012. 998HR Times Cited:36 Cited ReferencesCount:65.[163] S. I. E. Perez. Exploring microbial community structure and resilience through visualization and analysis ofmicrobial co-occurence networks. PhD thesis, University of British Columbia, 2015.[164] R. M. Warwick and K. R. Clarke. New ’biodiversity’ measures reveal a decrease in taxonomicdistinctness with increasing stress. Marine Ecology Progress Series, 129(1-3):301–305, 1995. Tp664 TimesCited:378 Cited References Count:16.[165] S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples). Biometrika,52:591–&, 1965. 3-4 70916 Times Cited:5483 Cited References Count:24.[166] M. Li, J. X. Wang, and J. Chen. A fast agglomerate algorithm for mining functional modules in proteininteraction networks. Bmei 2008: Proceedings of the International Conference on Biomedical Engineeringand Informatics, Vol 1, pages 3–7, 2008. Bhw79 Times Cited:14 Cited References Count:25 InternationalConference on Biomedical Engineering and Informatics.[167] A. Butte. The use and analysis of microarray data. Nature Reviews Drug Discovery, 1(12):951–960, 2002.620WB Times Cited:283 Cited References Count:52.[168] A. Pfostl, S. Zayni, A. Hofinger, et al. Biosynthesis of dtdp-3-acetamido-3,6-dideoxy-alpha-d-glucose.Biochemical Journal, 410:187–194, 2008. 1 266BN Times Cited:21 Cited References Count:37.152[169] I. Aguirrezabalaga, C. Olano, N. Allende, et al. Identification and expression of genes involved inbiosynthesis of l-oleandrose and its intermediate l-olivose in the oleandomycin producer streptomycesantibioticus. Antimicrobial Agents and Chemotherapy, 44(5):1266–1275, 2000. 306YC Times Cited:77Cited References Count:64.[170] J. R. Guo, J. R. Cole, Q. P. Zhang, C. T. Brown, and J. M. Tiedje. Microbial community analysiswith ribosomal gene fragments from shotgun metagenomes. Applied and Environmental Microbiology,82(1):157–166, 2016. Cz1uo Times Cited:0 Cited References Count:65.[171] J. D. Anderson, L. J. Ingram, and P. D. Stahl. Influence of reclamation management practices onmicrobial biomass carbon and soil organic carbon accumulation in semiarid mined lands of wyoming.Applied Soil Ecology, 40(2):387–397, 2008. 357EY Times Cited:16 Cited References Count:60.[172] N. C. Banning, D. B. Gleeson, A. H. Grigg, et al. Soil microbial community successional patternsduring forest ecosystem restoration. Applied and Environmental Microbiology, 77(17):6158–6164, 2011.811JP Times Cited:20 Cited References Count:64.[173] P. A. Dimitriu, C. E. Prescott, S. A. Quideau, and S. J. Grayston. Impact of reclamation of surface-minedboreal forest soils on microbial community composition and function. Soil Biology & Biochemistry,42(12):2289–2297, 2010. 681EQ Times Cited:12 Cited References Count:64.[174] H. Insam and K. H. Domsch. Relationship between soil organic-carbon and microbial biomass onchronosequences of reclamation sites. Microbial Ecology, 15(2):177–188, 1988. M0914 Times Cited:340Cited References Count:27.[175] S. J. Grayston and C. E. Prescott. Microbial communities in forest floors under four tree species incoastal british columbia. Soil Biology & Biochemistry, 37(6):1157–1167, 2005. 920TJ Times Cited:78Cited References Count:52.[176] Z. A. Sylvain and D. H. Wall. Linking soil biodiversity and vegetation: Implications for a changingplanet. American Journal of Botany, 98(3):517–527, 2011. 734FU Times Cited:13 Cited ReferencesCount:160.[177] T. Bell, J. A. Newman, B. W. Silverman, S. L. Turner, and A. K. Lilley. The contribution of speciesrichness and composition to bacterial services. Nature, 436(7054):1157–1160, 2005. 958AK TimesCited:226 Cited References Count:30.[178] J. Friedman and E. J. Alm. Inferring correlation networks from genomic survey data. Plos Computa-tional Biology, 8(9), 2012. 016IJ Times Cited:18 Cited References Count:31.[179] D. A. Jackson. Compositional data in community ecology: The paradigm or peril of proportions?Ecology, 78(3):929–940, 1997. Wu829 Times Cited:76 Cited References Count:44.[180] J. M. Beman, J. A. Steele, and J. A. Fuhrman. Co-occurrence patterns for abundant marine archaeal andbacterial lineages in the deep chlorophyll maximum of coastal california. Isme Journal, 5(7):1077–1085,2011. 830WK Times Cited:10 Cited References Count:73.[181] J. Z. Zhou, Y. Deng, F. Luo, Z. L. He, and Y. F. Yang. Phylogenetic molecular ecological network ofsoil microbial communities in response to elevated co2. mBio, 2(4), 2011. 845SX Times Cited:4 CitedReferences Count:50.[182] I. Kogelknabner, P. G. Hatcher, and W. Zech. Chemical structural studies of forest soil humic acids- aromatic carbon fraction. Soil Science Society of America Journal, 55(1):241–247, 1991. Fa732 TimesCited:95 Cited References Count:36.[183] J. P. Martin and K. Haider. Microbial activity in relation to soil humus formation. Soil Science,111(1):54, 1971. I4248 Times Cited:165 Cited References Count:93.153[184] C. Steelink. Free radical studies of lignin, lignin degradation products and soil humic acids. GeochimicaEt Cosmochimica Acta, 28(Oct):1615–1622, 1964. We021 Times Cited:71 Cited References Count:21.[185] E. S. Kasischke, N. L. Christensen, and B. J. Stocks. Fire, global warming, and the carbon balance ofboreal forests. Ecological Applications, 5(2):437–451, 1995. Qy378 Times Cited:246 Cited ReferencesCount:59.[186] J. Bauhus, D. Pare, and L. Cote. Effects of tree species, stand age and soil type on soil microbialbiomass and its activity in a southern boreal forest. Soil Biology & Biochemistry, 30(8-9):1077–1089,1998. Zy151 Times Cited:153 Cited References Count:66.[187] S. D. Allison and J. B. H. Martiny. Resistance, resilience, and redundancy in microbial communities.Proceedings of the National Academy of Sciences of the United States of America, 105:11512–11519, 2008.Suppl. 1 339EP Times Cited:267 Cited References Count:61.[188] F. A. Loewus and P. P. N. Murthy. myo-inositol metabolism in plants. Plant Science, 150(1):1–19, 2000.267UY Times Cited:254 Cited References Count:111.[189] G. Yim, H. M. H. Wang, and J. Davies. Antibiotics as signalling molecules. Philosophical Transactionsof the Royal Society B-Biological Sciences, 362(1483):1195–1200, 2007. 181XC Times Cited:174 CitedReferences Count:40.[190] O. X. Cordero, H. Wildschutte, B. Kirkup, et al. Ecological populations of bacteria act as sociallycohesive units of antibiotic production and resistance. Science, 337(6099):1228–1231, 2012. 000UJTimes Cited:72 Cited References Count:17.[191] A. Chao. Nonparametric-estimation of the number of classes in a population. Scandinavian Journal ofStatistics, 11(4):265–270, 1984. Ajz63 Times Cited:934 Cited References Count:24.[192] C. E. Shannon. The Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.[193] E. H. Simpson. Measurement of diversity. Nature, 163(4148):688–688, 1949. Ua174 Times Cited:3377Cited References Count:3.[194] T. L. Czaran, R. F. Hoekstra, and L. Pagie. Chemical warfare between microbes promotes biodiversity.Proceedings of the National Academy of Sciences of the United States of America, 99(2):786–790, 2002. 514PKTimes Cited:212 Cited References Count:33.[195] J. Davies and D. Davies. Origins and evolution of antibiotic resistance. Microbiology and MolecularBiology Reviews, 74(3):417–+, 2010. 644YH Times Cited:428 Cited References Count:154.[196] C. R. Fonseca and G. Ganade. Species functional redundancy, random extinctions and the stability ofecosystems. Journal of Ecology, 89(1):118–125, 2001. 425DN Times Cited:101 Cited References Count:39.[197] A. Briones and L. Raskin. Diversity and dynamics of microbial communities in engineered environ-ments and their implications for process stability. Current Opinion in Biotechnology, 14(3):270–276,2003. 702LR Times Cited:175 Cited References Count:61.[198] A. Fernandez, S. Y. Huang, S. Seston, et al. How stable is stable? function versus communitycomposition. Applied and Environmental Microbiology, 65(8):3697–3704, 1999. 223WR Times Cited:315Cited References Count:43.[199] S. Louca. The Ecology of Microbial Metabolic Pathways. PhD thesis, 2016.[200] U. N. Nielsen, E. Ayres, D. H. Wall, and R. D. Bardgett. Soil biodiversity and carbon cycling: a reviewand synthesis of studies examining diversity-function relationships. European Journal of Soil Science,62(1):105–116, 2011. 706NM Times Cited:107 Cited References Count:139.154[201] J. B. H. Martiny, B. J. M. Bohannan, J. H. Brown, et al. Microbial biogeography: putting microorganismson the map. Nature Reviews Microbiology, 4(2):102–112, 2006. 003WV Times Cited:884 Cited ReferencesCount:110.[202] P. Larsen, Y. Hamada, and J. Gilbert. Modeling microbial communities: Current, developing, andfuture technologies for predicting microbial community interaction. Journal of Biotechnology, 160(1-2):17–24, 2012. 958ZN Times Cited:14 Cited References Count:101.[203] R. G. Bjork, M. P. Bjorkman, M. X. Andersson, and L. Klemedtsson. Temporal variation in soilmicrobial communities in alpine tundra. Soil Biology & Biochemistry, 40(1):266–268, 2008. 235NUTimes Cited:32 Cited References Count:21.[204] M. Reichstein, A. Rey, A. Freibauer, et al. Modeling temporal and large-scale spatial variability ofsoil respiration from soil water availability, temperature and vegetation productivity indices. GlobalBiogeochemical Cycles, 17(4), 2003. 749BH Times Cited:102 Cited References Count:74.[205] J. W. Tang and D. D. Baldocchi. Spatial-temporal variation in soil respiration in an oak-grasssavanna ecosystem in california and its partitioning into autotrophic and heterotrophic components.Biogeochemistry, 73(1):183–207, 2005. 936RC Times Cited:139 Cited References Count:47.[206] S. K. Schmidt, E. K. Costello, D. R. Nemergut, et al. Biogeochemical consequences of rapid microbialturnover and seasonal succession in soil. Ecology, 88(6):1379–1385, 2007. 178ED Times Cited:124 CitedReferences Count:60.[207] H. Juottonen, E. S. Tuittila, S. Juutinen, H. Fritze, and K. Yrjala. Seasonality of rdna- and rrna-derivedarchaeal communities and methanogenic potential in a boreal mire. Isme Journal, 2(11):1157–1168,2008. 374EW Times Cited:43 Cited References Count:73.[208] L. Zifcakova, T. Vetrovsky, A. Howe, and P. Baldrian. Microbial activity in forest soil reflects thechanges in ecosystem properties between summer and winter. Environmental Microbiology, 18(1):288–301, 2016. De5ln Times Cited:2 Cited References Count:60.[209] J. Voriskova, V. Brabcova, T. Cajthaml, and P. Baldrian. Seasonal dynamics of fungal communitiesin a temperate oak forest soil. New Phytologist, 201(1):269–278, 2014. 256XZ Times Cited:32 CitedReferences Count:64.[210] P. D. Brooks, M. W. Williams, D. A. Walker, and S. K. Schmidt. The niwot ridge snow fenceexperiment: Biogeochemical responses to changes in the seasonal snowpack. Biogeochemistry ofSeasonally Snow-Covered Catchments, (228):293–302, 1995. Bf19g Times Cited:8 Cited ReferencesCount:0 Iahs Publications.[211] A. C. Edwards, R. Scalenghe, and M. Freppaz. Changes in the seasonal snow cover of alpine regionsand its effect on soil processes: A review. Quaternary International, 162:172–181, 2007. 158NK TimesCited:71 Cited References Count:75.[212] P. A. Burrough. Multiscale sources of spatial variation in soil .2. a non-brownian fractal model andits application in soil survey. Journal of Soil Science, 34(3):599–620, 1983. Rj449 Times Cited:96 CitedReferences Count:46.[213] P. Baldrian, M. Kolarik, M. Stursova, et al. Active and total microbial communities in forest soil arelargely different and highly stratified during decomposition. Isme Journal, 6(2):248–258, 2012. 901RJTimes Cited:16 Cited References Count:49.[214] R. Lopez-Mondejar, D. Zuhlke, D. Becher, K. Riedel, and P. Baldrian. Cellulose and hemicellulosedecomposition by forest soil bacteria proceeds by the action of structurally variable enzymaticsystems. Scientific reports, 6, 2016. Dk8ny Times Cited:0 Cited References Count:55.155[215] D. VanInsberghe, K. R. Maas, E. Cardenas, et al. Non-symbiotic bradyrhizobium ecotypes dominatenorth american forest soils. Isme Journal, 9(11):2435–2441, 2015. Cw6ez Times Cited:5 Cited ReferencesCount:28.[216] M. L. Chow, C. C. Radomski, J. M. McDermott, J. Davies, and P. E. Axelrood. Molecular characteriza-tion of bacterial diversity in lodgepole pine (pinus contorta) rhizosphere soils from british columbiaforest soils differing in disturbance and geographic source. Fems Microbiology Ecology, 42(3):347–357,2002. 620QY Times Cited:52 Cited References Count:47.[217] R.F. Powers, D.H. Alban, R.E. Miller, et al. Sustaining site productivity in north american forests:problems and prospects. Proceedings of the Seventh North American Forest Soils Conference on SustainedProductivity of Forest Soils. Faculty of Forestry, University of British Columbia, Vancouver, BC, pages 49–79,1990.[218] A. M. Reid, W. K. Chapman, and C. E. Prescott. Comparing lodgepole pine growth and diseaseoccurrence at six long-term soil productivity (ltsp) sites in british columbia, canada. Canadian Journalof Forest Research, 46(4):595–599, 2016. Dl9ey Times Cited:0 Cited References Count:27.[219] Government of Canada. Historical climate data. 2016-06-10.[220] E. Brodie, S. Edwards, and N. Clipson. Bacterial community dynamics across a floristic gradient in atemperate upland grassland ecosystem. Microbial Ecology, 44(3):260–270, 2002. 612CL Times Cited:76Cited References Count:51.[221] L. C. Jiang, F. Schlesinger, C. A. Davis, et al. Synthetic spike-in standards for rna-seq experiments.Genome Research, 21(9):1543–1551, 2011. 814SH Times Cited:117 Cited References Count:42.[222] A. M. Bolger, M. Lohse, and B. Usadel. Trimmomatic: a flexible trimmer for illumina sequence data.Bioinformatics, 30(15):2114–2120, 2014. Am7lq Times Cited:780 Cited References Count:10.[223] D. H. Li, C. M. Liu, R. B. Luo, K. Sadakane, and T. W. Lam. Megahit: an ultra-fast single-nodesolution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics,31(10):1674–1676, 2015. Cj7jz Times Cited:18 Cited References Count:10.[224] R. Caspi, T. Altman, R. Billington, et al. The metacyc database of metabolic pathways and enzymesand the biocyc collection of pathway/genome databases. Nucleic Acids Research, 42(D1):D459–D471,2014. Aa5lf Times Cited:61 Cited References Count:82.[225] D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of rna-seq data using factor analysisof control genes or samples. Nature Biotechnology, 32(9):896–902, 2014. Aq2fd Times Cited:47 CitedReferences Count:29.[226] D. J. McCarthy, Y. Chen, and G. K. Smyth. Differential expression analysis of multifactor rna-seqexperiments with respect to biological variation. Nucleic Acids Res, 40(10):4288–97, 2012. McCarthy,Davis J Chen, Yunshun Smyth, Gordon K eng Research Support, Non-U.S. Gov’t England 2012/01/3106:00 Nucleic Acids Res. 2012 May;40(10):4288-97. doi: 10.1093/nar/gks042. Epub 2012 Jan 28.[227] S. C. Goslee and D. L. Urban. The ecodist package for dissimilarity-based analysis of ecological data.Journal of Statistical Software, 22(7):1–19, 2007. 252ET Times Cited:528 Cited References Count:34.[228] G. De’Ath. Multivariate regression trees: a new technique for modeling species-environmentrelationships. Ecology, 83(4):1105–1117, 2002. 533HP Times Cited:464 Cited References Count:43.[229] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edger: a bioconductor package for differentialexpression analysis of digital gene expression data. Bioinformatics, 26(1):139–40, 2010. Robinson,Mark D McCarthy, Davis J Smyth, Gordon K eng Research Support, Non-U.S. Gov’t England Oxford,England 2009/11/17 06:00 Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616.Epub 2009 Nov 11.156[230] N. E. Appleford, D. J. Evans, J. R. Lenton, et al. Function and transcript analysis of gibberellin-biosynthetic enzymes in wheat. Planta, 223(3):568–82, 2006. Appleford, Nigel E J Evans, Daniel JLenton, John R Gaskin, Paul Croker, Stephen J Devos, Katrien M Phillips, Andrew L Hedden, Petereng BBS/E/C/00004161/Biotechnology and Biological Sciences Research Council/United KingdomResearch Support, Non-U.S. Gov’t Germany 2005/09/15 09:00 Planta. 2006 Feb;223(3):568-82. Epub2005 Sep 14.[231] T. Yoshihisa, K. Yunoki-Esaki, C. Ohshima, N. Tanaka, and T. Endo. Possibility of cytoplasmic pre-trnasplicing: the yeast trna splicing endonuclease mainly localizes on the mitochondria. Molecular Biologyof the Cell, 14(8):3266–3279, 2003. 713UY Times Cited:87 Cited References Count:57.[232] S. R. Khan. Calcium Oxalate in Biological Systems. CRC Press, 1995.[233] W. L. Silver, A. E. Lugo, and M. Keller. Soil oxygen availability and biogeochemistry along rainfalland topographic gradients in upland wet tropical forest soils. Biogeochemistry, 44(3):301–328, 1999.168UN Times Cited:147 Cited References Count:56.[234] J. C. Arguelles. Physiological roles of trehalose in bacteria and yeasts: a comparative analysis. Archivesof Microbiology, 174(4):217–224, 2000. 361PH Times Cited:191 Cited References Count:64.[235] J. Shoji, T. Kato, H. Hinoo, et al. Production of fosfomycin (phosphonomycin) by pseudomonas-syringae. Journal of Antibiotics, 39(7):1011–1012, 1986. D3176 Times Cited:37 Cited ReferencesCount:10.[236] L. Govindasamy, T. Kukar, W. Lian, et al. Structural and mutational characterization of l-carnitinebinding to human carnitine acetyltransferase. Journal of Structural Biology, 146(3):416–424, 2004. 817KPTimes Cited:12 Cited References Count:24.[237] R. A. Kaufman and H. P. Broquist. Biosynthesis of carnitine in neurospora-crassa. Journal of BiologicalChemistry, 252(21):7437–7439, 1977. Eb090 Times Cited:32 Cited References Count:11.[238] J. M. Luengo, J. L. Garcia, and E. R. Olivera. The phenylacetyl-coa catabolon: a complex catabolicunit with broad biotechnological applications. Molecular Microbiology, 39(6):1434–1442, 2001. 414AYTimes Cited:92 Cited References Count:30.[239] D. J. Aceti and W. C. Champness. Transcriptional regulation of streptomyces coelicolor pathway-specific antibiotic regulators by the absa and absb loci. Journal of Bacteriology, 180(12):3100–3106, 1998.Zt524 Times Cited:46 Cited References Count:56.[240] M. Carmona, M. T. Zamarro, B. Blazquez, et al. Anaerobic catabolism of aromatic compounds: agenetic and genomic view. Microbiology and Molecular Biology Reviews, 73(1):71–+, 2009. 414ID TimesCited:118 Cited References Count:388.[241] C. S. Harwood, G. Burchhardt, H. Herrmann, and G. Fuchs. Anaerobic metabolism of aromaticcompounds via the benzoyl-coa pathway. Fems Microbiology Reviews, 22(5):439–458, 1998. 164WJTimes Cited:38 Cited References Count:117.[242] L. Wohlbrand, H. Wilkes, T. Halder, and R. Rabus. Anaerobic degradation of p-ethylphenol by”aromatoleum aromaticum” strain ebn1: Pathway, regulation, and involved proteins. Journal ofBacteriology, 190(16):5699–5709, 2008. 335EU Times Cited:27 Cited References Count:53.[243] M. J. Warren and A. I. Scott. Tetrapyrrole assembly and modification into the ligands of biologicallyfunctional cofactors. Trends in Biochemical Sciences, 15(12):486–491, 1990. Ep182 Times Cited:81 CitedReferences Count:30.[244] G. van Meer. Membrane lipids, where they are and how they behave: Sphingolipids on the move.Faseb Journal, 24, 2010. V28iw Times Cited:0 Cited References Count:0.157[245] R. S. Rao, C. P. Jyothi, R. S. Prakasham, P. N. Sarma, and L. V. Rao. Xylitol production from corn fiberand sugarcane bagasse hydrolysates by candida tropicalis. Bioresource Technology, 97(15):1974–1978,2006. 063DA Times Cited:75 Cited References Count:26.[246] H. Sorensen. Decomposition of lignin by soil bacteria and complex formation between autoxidizedlignin and organic nitrogen compounds. Journal of General Microbiology, 27(1):21–&, 1962. 5071B TimesCited:45 Cited References Count:26.[247] M. S. Cretoiu, G. W. Korthals, J. H. M. Visser, and J. D. van Elsas. Chitin amendment increases soilsuppressiveness toward plant pathogens and modulates the actinobacterial and oxalobacteracealcommunities in an experimental agricultural field. Applied and Environmental Microbiology, 79(17):5291–5301, 2013. 197CZ Times Cited:14 Cited References Count:64.[248] M. L. Otte, G. Wilson, J. T. Morris, and B. M. Moran. Dimethylsulphoniopropionate (dmsp) andrelated compounds in higher plants. Journal of Experimental Botany, 55(404):1919–1925, 2004. 850CSTimes Cited:42 Cited References Count:60.[249] G. P. Quinn and M. J. Keough. Experimental Design and Data Analysis for Biologists. CambridgeUniversity Press, 2002.[250] V. A. Orchard and F. J. Cook. Relationship between soil respiration and soil-moisture. Soil Biology &Biochemistry, 15(4):447–453, 1983. Rk849 Times Cited:432 Cited References Count:36.[251] S. Trapp and R. Croteau. Defensive resin biosynthesis in conifers. Annual Review of Plant Physiologyand Plant Molecular Biology, 52:689–724, 2001. 448FE Times Cited:172 Cited References Count:125.[252] N. Dudareva, D. Martin, C. M. Kish, et al. (e)-beta-ocimene and myrcene synthase genes of floralscent biosynthesis in snapdragon: Function and expression of three terpene synthase genes of anew terpene synthase subfamily. Plant Cell, 15(5):1227–1241, 2003. 717GB Times Cited:179 CitedReferences Count:32.[253] I. S. Booij-James, S. K. Dube, M. A. K. Jansen, M. Edelman, and A. K. Mattoo. Ultraviolet-b radiationimpacts light-mediated turnover of the photosystem ii reaction center heterodimer in arabidopsismutants altered in phenolic metabolism. Plant Physiology, 124(3):1275–1283, 2000. 374VV TimesCited:78 Cited References Count:45.[254] G. Fuchs. Anaerobic metabolism of aromatic compounds. Incredible Anaerobes: From Physiology toGenomics to Fuels, 1125:82–99, 2008. Bhq41 Times Cited:66 Cited References Count:102 Annals of theNew York Academy of Sciences.[255] J. R. Simpson and W. C. Evans. The metabolism of nitrophenols by certain bacteria. BiochemicalJournal, 55(5):R24–R24, 1953. Ub557 Times Cited:32 Cited References Count:1.[256] R. H. White. Biosynthesis of the sulfonolipid 2-amino-3-hydroxy-15-methylhexadecane-1-sulfonicacid in the gliding bacterium cytophaga-johnsonae. Journal of Bacteriology, 159(1):42–46, 1984. Sy408Times Cited:17 Cited References Count:37.[257] R. C. Perez and A. Matin. Carbon-dioxide assimilation by thiobacillus-novellus under nutrient-limited mixotrophic conditions. Journal of Bacteriology, 150(1):46–51, 1982. Nl006 Times Cited:16 CitedReferences Count:19.[258] K. Sonntag, J. Schwinde, A. A. deGraaf, et al. C-13 nmr studies of the fluxes in the central metabolismof corynebacterium glutamicum during growth and overproduction of amino acids in batch cultures.Applied Microbiology and Biotechnology, 44(3-4):489–495, 1995. Tm498 Times Cited:42 Cited ReferencesCount:27.158[259] M. A. Matamoros, M. R. Clemente, S. Sato, et al. Molecular analysis of the pathway for the synthesis ofthiol tripeptides in the model legume lotus japonicus. Molecular Plant-Microbe Interactions, 16(11):1039–1046, 2003. 734KP Times Cited:24 Cited References Count:42.[260] A. Ebringerova. Structural diversity and application potential of hemicelluloses. MacromolecularSymposia, 232:1–12, 2006. 023CP Times Cited:74 Cited References Count:31.[261] N. Lombard, E. Prestat, J. D. van Elsas, and P. Simonet. Soil-specific limitations for access and analysisof soil microbial communities by metagenomics. Fems Microbiology Ecology, 78(1):31–49, 2011. Sp. Iss.SI 820QZ Times Cited:0 Cited References Count:182.[262] M. C. Jarvis. Structure and properties of pectin gels in plant-cell walls. Plant Cell and Environment,7(3):153–164, 1984. Sn974 Times Cited:400 Cited References Count:151.[263] F. J. Stewart, O. Ulloa, and E. F. DeLong. Microbial metatranscriptomics in a permanent marineoxygen minimum zone. Environmental Microbiology, 14(1):23–40, 2012. Sp. Iss. SI 922HV TimesCited:105 Cited References Count:74.[264] J. Raes, J. O. Korbel, M. J. Lercher, C. von Mering, and P. Bork. Prediction of effective genome size inmetagenomic samples. Genome Biology, 8(1), 2007. 144UF Times Cited:133 Cited References Count:47.[265] Z. T. Aanderud, S. E. Jones, N. Fierer, and J. T. Lennon. Resuscitation of the rare biosphere contributesto pulses of ecosystem activity. Front Microbiol, 6:24, 2015. Aanderud, Zachary T Jones, Stuart EFierer, Noah Lennon, Jay T eng Switzerland 2015/02/18 06:00 Front Microbiol. 2015 Jan 30;6:24. doi:10.3389/fmicb.2015.00024. eCollection 2015.[266] C. Ginestet. ggplot2: Elegant graphics for data analysis. Journal of the Royal Statistical Society Seriesa-Statistics in Society, 174:245–245, 2011. 1 703HS Times Cited:21 Cited References Count:1.[267] A. Hatakka. Biodegrdation of lignin. Biopolymers Online. Wiley-VCH Verlag GmbH & Co KGaA,Weinhiem, Germany, 2005.[268] N. A. Pudlo, K. Urs, S. S. Kumar, et al. Symbiotic human gut bacteria with variable metabolicpriorities for host mucosal glycans. Mbio, 6(6), 2015. Da0yy Times Cited:0 Cited References Count:36.[269] N. Fierer and R. B. Jackson. The diversity and biogeography of soil bacterial communities. Proceedingsof the National Academy of Sciences of the United States of America, 103(3):626–631, 2006. 004BU TimesCited:527 Cited References Count:45.[270] N. Fierer, J. W. Leff, B. J. Adams, et al. Cross-biome metagenomic analyses of soil microbialcommunities and their functional attributes. Proceedings of the National Academy of Sciences of the UnitedStates of America, 109(52):21390–21395, 2012.[271] F. O. Aylwarda, J. M. Eppleya, J. M. Smith, et al. Microbial community transcriptional networks areconserved in three domains at ocean basin scales. Proceedings of the National Academy of Sciences of theUnited States of America, 112(17):54435448, 2015.[272] A.G. Smith and A.C. Neish. Alkaline oxidation of [14c]-labelled protolignin formed from cinnamicacid in spruce and aspen twigs. Phytochemistry, 3:609–615, 1964.[273] S. A. Quideau, O. A. Chadwick, A. Benesi, R. C. Graham, and M. A. Anderson. A direct link betweenforest vegetation type and soil organic matter composition. Geoderma, 104(1-2):41–60, 2001. 486TZTimes Cited:120 Cited References Count:44.[274] D. W. Johnson. Sulfur cycling in forests. Biogeochemistry, 1(1):29–43, 1984. Ahu77 Times Cited:70Cited References Count:48.159[275] M. J. Mitchell, C. T. Driscoll, R. D. Fuller, M. B. David, and G. E. Likens. Effect of whole-treeharvesting on the sulfur dynamics of a forest soil. Soil Science Society of America Journal, 53(3):933–940,1989. Ac083 Times Cited:77 Cited References Count:44.[276] C. M. Fang and J. B. Moncrieff. The variation of soil microbial respiration with depth in relation tosoil carbon composition. Plant and Soil, 268(1-2):243–253, 2005. 931TG Times Cited:37 Cited ReferencesCount:47.[277] W.Z. Huang and J. J. Schoenau. Forms, amounts and distribution of carboh, nitrogen, phosfihorusand sulfur in a boreal aspen forest soil. Canadian Journal of Soil Science, 76, 1996.[278] B.J. Macauley and D.M. Griffin. Effects of carbon dioxide and oxygen on the activity of some soilfungi. Transactions of the British Mycological Society, 53(1):53–62, 1969.[279] M. J. Taherzadeh, L. Gustafsson, C. Niklasson, and G. Liden. Conversion of furfural in aerobicand anaerobic batch fermentation of glucose by saccharomyces cerevisiae. Journal of Bioscience andBioengineering, 87(2):169–174, 1999. 209CQ Times Cited:130 Cited References Count:25.[280] A. B. Dohrmann, M. Kuting, S. Junemann, et al. Importance of rare taxa for bacterial diversity inthe rhizosphere of bt- and conventional maize varieties. ISME J, 7(1):37–49, 2013. Dohrmann, AnjaB Kuting, Meike Junemann, Sebastian Jaenicke, Sebastian Schluter, Andreas Tebbe, Christoph Ceng Research Support, Non-U.S. Gov’t England 2012/07/14 06:00 ISME J. 2013 Jan;7(1):37-49. doi:10.1038/ismej.2012.77. Epub 2012 Jul 12.[281] P. J. McMurdie and S. Holmes. Waste not, want not: Why rarefying microbiome data is inadmissible.Plos Computational Biology, 10(4), 2014. Ai0af Times Cited:111 Cited References Count:75.[282] C. E. Lawson, B. J. Strachan, N. W. Hanson, et al. Rare taxa have potential to make metaboliccontributions in enhanced biological phosphorus removal ecosystems. Environmental Microbiology,17(12):4979–4993, 2015. Db4bn Times Cited:0 Cited References Count:89.[283] M. L. Sogin, H. G. Morrison, J. A. Huber, et al. Microbial diversity in the deep sea and theunderexplored ”rare biosphere”. Proc Natl Acad Sci U S A, 103(32):12115–20, 2006. Sogin, MitchellL Morrison, Hilary G Huber, Julie A Mark Welch, David Huse, Susan M Neal, Phillip R Arrieta,Jesus M Herndl, Gerhard J eng 1 P50 ES012742-01/ES/NIEHS NIH HHS/ Research Support, N.I.H.,Extramural Research Support, Non-U.S. Gov’t Research Support, U.S. Gov’t, Non-P.H.S. 2006/08/0209:00 Proc Natl Acad Sci U S A. 2006 Aug 8;103(32):12115-20. Epub 2006 Jul 31.[284] S. J. Blazewicz, R. L. Barnard, R. A. Daly, and M. K. Firestone. Evaluating rrna as an indicator ofmicrobial activity in environmental communities: limitations and uses. Isme Journal, 7(11):2061–2068,2013. 240JD Times Cited:100 Cited References Count:76.[285] K. M. DeAngelis, W. L. Silver, A. W. Thompson, and M. K. Firestone. Microbial communitiesacclimate to recurring changes in soil redox potential status. Environ Microbiol, 12(12):3137–49, 2010.DeAngelis, Kristen M Silver, Whendee L Thompson, Andrew W Firestone, Mary K eng ResearchSupport, Non-U.S. Gov’t Research Support, U.S. Gov’t, Non-P.H.S. England 2010/07/16 06:00 EnvironMicrobiol. 2010 Dec;12(12):3137-49. doi: 10.1111/j.1462-2920.2010.02286.x.[286] B. J. Campbell and D. L. Kirchman. Bacterial diversity, community structure and potential growthrates along an estuarine salinity gradient. ISME J, 7(1):210–20, 2013. Campbell, Barbara J Kirchman,David L eng Research Support, U.S. Gov’t, Non-P.H.S. England 2012/08/17 06:00 ISME J. 2013Jan;7(1):210-20. doi: 10.1038/ismej.2012.93. Epub 2012 Aug 16.[287] O. U. Mason, T. C. Hazen, S. Borglin, et al. Metagenome, metatranscriptome and single-cell sequencingreveal microbial response to deepwater horizon oil spill. Isme Journal, 6(9):1715–1727, 2012. 997CNTimes Cited:137 Cited References Count:40.160[288] S. Anders. Analysing rna-seq data with the deseq package. Molecular Biology, 43(4):17, 2010.[289] D. Bottomly, N. A. R. Walter, J. E. Hunter, et al. Evaluating gene expression in c57bl/6j and dba/2jmouse striatum using rna-seq and microarrays. Plos One, 6(3), 2011. 740RG Times Cited:60 CitedReferences Count:50.[290] F. M. Giorgi, C. Del Fabbro, and F. Licausi. Comparative study of rna-seq- and microarray-derivedcoexpression networks in arabidopsis thaliana. Bioinformatics, 29(6):717–724, 2013. 108DA TimesCited:22 Cited References Count:74.[291] K. von Wyschetzki, H. Lowack, and J. Heinze. Transcriptomic response to injury sheds light on thephysiological costs of reproduction in ant queens. Molecular Ecology, 25(9):1972–1985, 2016. Dn4gzTimes Cited:0 Cited References Count:82.[292] R. Laforge. Homoscedasticity. American Psychologist, 18(4):213–214, 1963. Cck42 Times Cited:0 CitedReferences Count:4.[293] I. Zwiener, B. Frisch, and H. Binder. Transforming rna-seq data to improve the performance ofprognostic gene signatures. Plos One, 9(1), 2014. 291WL Times Cited:13 Cited References Count:38.[294] R. C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26(19):2460–2461, 2010. 654LO Times Cited:2415 Cited References Count:8.[295] E. A. Gies, K. M. Konwar, J. T. Beatty, and S. J. Hallam. Illuminating microbial dark matter inmeromictic sakinaw lake. Applied and Environmental Microbiology, 80(21):6807–6818, 2014. As3fa TimesCited:16 Cited References Count:65.[296] A. Shade, S. E. Jones, J. G. Caporaso, et al. Conditionally rare taxa disproportionately contribute totemporal changes in microbial diversity. mBio, 5(4), 2014. Ao8fe Times Cited:30 Cited ReferencesCount:53.[297] K. Faust and J. Raes. Microbial interactions: from networks to models. Nat Rev Microbiol, 10(8):538–50,2012. Faust, Karoline Raes, Jeroen eng Research Support, Non-U.S. Gov’t Review England 2012/07/1706:00 Nat Rev Microbiol. 2012 Jul 16;10(8):538-50. doi: 10.1038/nrmicro2832.[298] J. A. Martin-Fernandez, C. Barcelo-Vidal, and V. Pawlowsky-Glahn. Dealing with zeros and missingvalues in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3):253–278, 2003. 681NH Times Cited:127 Cited References Count:27.[299] E. W. Beals. Bray-curtis ordination - an effective strategy for analysis of multivariate ecological data.Advances in Ecological Research, 14:1–55, 1984. Abq28 Times Cited:263 Cited References Count:167.[300] M. O. Hill. Diversity and evenness: A unifying notation and its consequences. Ecology, 54(2):427–432,1973. P4534 Times Cited:1704 Cited References Count:11.[301] M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersion for rna-seqdata with deseq2. Genome Biology, 15(12), 2014. Aw9xx Times Cited:208 Cited References Count:64.[302] J. Jahnke, D. M. Mahlmann, P. Jacobs, and U. B. Priefer. The influence of growth conditions on thecell dry weight per unit biovolume of klebsormidium flaccidum (charophyta), a typical ubiquitoussoil alga. Journal of Applied Phycology, 23(4):655–664, 2011. 797XN Times Cited:3 Cited ReferencesCount:32.[303] J. M. Tiedje, A. J. Sexstone, T. B. Parkin, N. P. Revsbech, and D. R. Shelton. Anaerobic processes insoil. Plant and Soil, 76(1-3):197–212, 1984. Sj061 Times Cited:172 Cited References Count:25.161[304] P. J. McMurdie and S. Holmes. phyloseq: An r package for reproducible interactive analysis andgraphics of microbiome census data. Plos One, 8(4), 2013. 130JR Times Cited:190 Cited ReferencesCount:87.[305] M. Bueche, T. Wunderlin, L. Roussel-Delif, et al. Quantification of endospore-forming firmicutes byquantitative pcr with the functional gene spo0a. Applied and Environmental Microbiology, 79(17):5302–5312, 2013. 197CZ Times Cited:9 Cited References Count:61.[306] B. P. Hedlund, J. A. Dodsworth, J. K. Cole, and H. H. Panosyan. An integrated study reveals diversemethanogens, thaumarchaeota, and yet-uncultivated archaeal lineages in armenian hot springs.Antonie Van Leeuwenhoek International Journal of General and Molecular Microbiology, 104(1):71–82, 2013.158SD Times Cited:5 Cited References Count:52.[307] H. W. Hu, L. M. Zhang, Y. Dai, H. J. Di, and J. Z. He. ph-dependent distribution of soil ammoniaoxidizers across a large geographical scale as revealed by high-throughput pyrosequencing. Journal ofSoils and Sediments, 13(8):1439–1449, 2013. 202WP Times Cited:37 Cited References Count:52.[308] G. W. Nicol, L. A. Glover, and J. I. Prosser. Molecular analysis of methanogenic archaeal communitiesin managed and natural upland pasture soils. Global Change Biology, 9(10):1451–1457, 2003. 730QRTimes Cited:24 Cited References Count:28.[309] W. Chang, Y. S. Um, B. Hoffman, and T. R. P. Holoman. Molecular characterization of polycyclicaromatic hydrocarbon (pah)-degrading methanogenic communities. Biotechnology Progress, 21(3):682–688, 2005. 933DU Times Cited:20 Cited References Count:43.[310] N. L. Capiro, M. L. B. Da Silva, B. P. Stafford, W. G. Rixey, and P. J. J. Alvarez. Microbial communityresponse to a release of neat ethanol onto residual hydrocarbons in a pilot-scale aquifer tank.Environmental Microbiology, 10(9):2236–2244, 2008. 334DW Times Cited:26 Cited References Count:51.[311] L. O. Ingram. Ethanol tolerance in bacteria. Critical Reviews in Biotechnology, 9(4):305–319, 1990. Cl941Times Cited:109 Cited References Count:33.[312] Y. C. Wang, B. J. Hao, Q. Zhang, et al. Discovery of multiple igs haplotypes within genotypes ofpuccinia striiformis. Fungal Biology, 116(4):522–528, 2012. 934ZQ Times Cited:2 Cited ReferencesCount:53.[313] A. Sukenik, R. N. Kaplan-Levy, J. M. Welch, and A. F. Post. Massive multiplication of genome andribosomes in dormant cells (akinetes) of aphanizomenon ovalisporum (cyanobacteria). Isme Journal,6(3):670–679, 2012. 900JM Times Cited:26 Cited References Count:41.[314] J. B. Kruskal. Multidimensional-scaling by optimizing goodness of fit to a nonmetric hypothesis.Psychometrika, 29(1):1–27, 1964. Cbt81 Times Cited:3215 Cited References Count:18.[315] J. B. Kruskal. Citation classic - multidimensional-scaling by optimizing goodness of fit to a nonmetrichypothesis. Current Contents/Social & Behavioral Sciences, (39):S12–S12, 1979. Hk752 Times Cited:0Cited References Count:5.[316] P. Pearce. Factor-analysis - statistical-methods and practical issues - jaeon,k, muller,cw. AustralianJournal of Psychology, 38(1):89–89, 1986. E1512 Times Cited:0 Cited References Count:1.[317] G. D. Garson. Factor analysis: Statistical methods and practical issues. Social Science Computer Review,17(1):129–131, 1999. 160MK Times Cited:0 Cited References Count:1.[318] M. Norris and L. Lecavalier. Evaluating the use of exploratory factor analysis in developmentaldisability psychological research. Journal of Autism and Developmental Disorders, 40(1):8–20, 2010.545HB Times Cited:40 Cited References Count:97.162[319] S. Weiss, W. Van Treuren, C. Lozupone, et al. Correlation detection strategies in microbial data setsvary widely in sensitivity and precision. Isme Journal, 10(7):1669–1681, 2016. Dp2bc Times Cited:0Cited References Count:53.[320] T. C. Brown. How much sequencing is needed for...?, 2012. Retrieved 2016-06-10.[321] J. Pell, A. Hintze, R. Canino-Koning, et al. Scaling metagenome sequence assembly with probabilisticde bruijn graphs. Proceedings of the National Academy of Sciences of the United States of America,109(33):13272–13277, 2012.[322] A. C. Howe, J. K. Jansson, S. A. Malfatti, et al. Tackling soil diversity with the assembly of large,complex metagenomes (vol 111, pg 4904, 2014). Proceedings of the National Academy of Sciences of theUnited States of America, 111(16):6115–6115, 2014.[323] E. A. Dinsdale, R. A. Edwards, D. Hall, et al. Functional metagenomic profiling of nine biomes.Nature, 452(7187):629–U8, 2008.[324] F. Meyer, D. Paarmann, M. D’Souza, et al. The metagenomics rast server - a public resource for theautomatic phylogenetic and functional analysis of metagenomes. Bmc Bioinformatics, 9, 2008.[325] M. N. Rho, H. X. Tang, and Y. Z. Ye. Fraggenescan: predicting genes in short and error-prone reads.Nucleic Acids Research, 38(20), 2010.[326] D. Hyatt, G. L. Chen, P. F. LoCascio, et al. Prodigal: prokaryotic gene recognition and translationinitiation site identification. Bmc Bioinformatics, 11, 2010.[327] T. Magoc and S. L. Salzberg. Flash: fast length adjustment of short reads to improve genomeassemblies. Bioinformatics, 27(21):2957–2963, 2011.[328] Dongjae Kim, Aria S Hahn, Shang-Ju Wu, et al. FragGeneScan+: high-throughput short-read geneprediction. Computational Intelligence in Bioinformatics and Computational Biology, 2015 IEEE Conferenceon, pages 1–7, August 2015.[329] S. D. Bentley and J. Parkhill. Comparative genomic structure of prokaryotes. Annual Review of Genetics,38:771–792, 2004.[330] K. D. Pruitt, T. Tatusova, and D. R. Maglott. Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35:D61–D65, 2007.[331] W. Z. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein ornucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006.[332] S. M. Kielbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith. Adaptive seeds tame genomic sequencecomparison. Genome Research, 21(3):487–493, 2011.[333] Doug Hyatt, Philip F LoCascio, Loren J Hauser, and Edward C Uberbacher. Gene and translationinitiation site prediction in metagenomic sequences. Bioinformatics, 28(17):2223–2230, 2012.[334] Todd Z DeSantis, Philip Hugenholtz, Niels Larsen, et al. Greengenes, a chimera-checked 16S rRNAgene database and workbench compatible with ARB. Appl. Environ. Microbiol., 72(7):5069–5072, 2006.[335] Todd M Lowe and Sean R Eddy. tRNAscan-SE: a program for improved detection of transfer RNAgenes in genomic sequence. Nucleic Acids Research, 25(5):0955–0964, 1997.[336] P D Karp, M Krummenacker, S Paley, and J Wagg. Integrated pathway-genome databases and theirrole in drug discovery. Trends in Biotechnology, 17(7):275–281, July 1999.163[337] Niels W Hanson, Kishori M Konwar, Alyse K Hawley, et al. Metabolic pathways for the wholecommunity. BMC Genomics, 15:619, 2014.[338] S. G. Tringe, C. von Mering, A. Kobayashi, et al. Comparative metagenomics of microbial communities.Science, 308(5721):554–557, 2005. 922BK Times Cited:598 Cited References Count:21.[339] R. Q. Li, H. M. Zhu, J. Ruan, et al. De novo assembly of human genomes with massively parallelshort read sequencing. Genome Research, 20(2):265–272, 2010.[340] R. L. Tatusov, M. Y. Galperin, D. A. Natale, and E. V. Koonin. The cog database: a tool for genome-scaleanalysis of protein functions and evolution. Nucleic Acids Research, 28(1):33–36, 2000.[341] Tengfei Yin, Dianne Cook, and Michael Lawrence. ggbio: an R package for extending the grammarof graphics for genomic data. Genome Biol, 13(8):R77, 2012.[342] S. Federhen. The ncbi taxonomy database. Nucleic Acids Research, 40(D1):D136–D143, 2012. 869MDTimes Cited:90 Cited References Count:16.[343] N. W. Hanson, K. M. Konwar, and S. J. Hallam. Lca?: an entropy-based measure for taxonomicassignment within assembled metagenomes. Bioinformatics, 2016.[344] M. D. Rausher. The evolution of genes in branched metabolic pathways. Evolution, 67(1):34–48, 2013.068KB Times Cited:5 Cited References Count:26.[345] R. J. Hobbs, E. Higgs, and J. A. Harris. Novel ecosystems: implications for conservation andrestoration. Trends in Ecology & Evolution, 24(11):599–605, 2009. 516KY Times Cited:199 CitedReferences Count:54.[346] M. A. Quail, M. Smith, P. Coupland, et al. A tale of three next generation sequencing platforms:comparison of ion torrent, pacific biosciences and illumina miseq sequencers. Bmc Genomics, 13, 2012.000DH Times Cited:434 Cited References Count:32.[347] R. T. Amos, K. U. Mayer, B. A. Bekins, G. N. Delin, and R. L. Williams. Use of dissolved andvapor-phase gases to investigate methanogenic degradation of petroleum hydrocarbon contaminationin the subsurface. Water Resources Research, 41(2), 2005. 895ZQ Times Cited:23 Cited ReferencesCount:53.[348] Tara Oceans Data. Re-trieved 2016-06-10.[349] J. A. Gilbert, J. K. Jansson, and R. Knight. The earth microbiome project: successes and aspirations.BMC Biology, 12, 2014.[350] P. J. Turnbaugh, R. E. Ley, M. Hamady, et al. The human microbiome project. Nature, 449(7164):804–810,2007.[351] H. Horai, M. Arita, S. Kanaya, et al. Massbank: a public repository for sharing mass spectral data forlife sciences. Journal of Mass Spectrometry, 45(7):703–714, 2010. 633ER Times Cited:91 Cited ReferencesCount:33.[352] Joint Genome Institute (JGI). Retrieved 2016-06-10.[353] D. Blankenberg, G. Von Kuster, E. Bouvier, et al. Dissemination of scientific software with galaxytoolshed. Genome Biology, 15(2):403, 2014. Blankenberg, Daniel Von Kuster, Gregory Bouvier,Emil Baker, Dannon Afgan, Enis Stoler, Nicholas Galaxy Team Taylor, James Nekrutenko, AntonHG004909/HG/NHGRI NIH HHS/ HG005133/HG/NHGRI NIH HHS/ HG005542/HG/NHGRINIH HHS/ HG006620/HG/NHGRI NIH HHS/ R01 HG004909/HG/NHGRI NIH HHS/ T32164GM102057/GM/NIGMS NIH HHS/ U41 HG006620/HG/NHGRI NIH HHS/ England Genome Biol.2014 Feb 20;15(2):403. doi: 10.1186/gb4161.[354] G. K. Sandve, A. Nekrutenko, J. Taylor, and E. Hovig. Ten simple rules for reproducible computationalresearch. Plos Computational Biology, 9(10), 2013.[355] J. Ison, K. Rapacki, H. Menager, et al. Tools and data services registry: a community effort todocument bioinformatics resources. Nucleic Acids Research, 44(D1):D38–47, 2016.[356] K. M. Fisch, T. Meissner, L. Gioia, et al. Omics pipe: a community-based framework for reproduciblemulti-omics data analysis. Bioinformatics, 31(11):1724–1728, 2015.[357] R.W. Holcomb. The long-term soil productivity study in british columbia. Victoria, BC, Canada:Canadian Forest Service and British Ministry of Forests., 1996.[358] J. P. Battigelli, J. R. Spence, D. W. Langor, and S. M. Berch. Short-term impact of forest soil compactionand organic matter removal on soil mesofauna density and oribatid mite diversity. Canadian Journal ofForest Research-Revue Canadienne De Recherche Forestiere, 34(5):1136–1149, 2004. 829BV Times Cited:35Cited References Count:52.[359] S. R. Proulx, D. E. L. Promislow, and P. C. Phillips. Network thinking in ecology and evolution. Trendsin Ecology & Evolution, 20(6):345–353, 2005. 936XC Times Cited:153 Cited References Count:87.[360] V. Latora and M. Marchiori. Is the boston subway a small-world network? Physica a-StatisticalMechanics and Its Applications, 314(1-4):109–113, 2002. 619XL Times Cited:159 Cited ReferencesCount:8.[361] J. C. S. Cardoso, C. Baquero, and P. S. Almeida. Probabilistic estimation of network size and diameter.Ladc: 2009 4th Latin-American Symposium on Dependable Computing, pages 33–40, 2009. Bns14 TimesCited:4 Cited References Count:11.[362] J. A. Dunne, R. J. Williams, and N. D. Martinez. Food-web structure and network theory: The role ofconnectance and size. Proceedings of the National Academy of Sciences of the United States of America,99(20):12917–12922, 2002. 600LK Times Cited:366 Cited References Count:58.[363] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SiamReview, 51(4):661–703, 2009. 522FO Times Cited:1337 Cited References Count:69.[364] M. Jackson, D. C. Crick, and P. J. Brennan. Phosphatidylinositol is an essential phospholipid ofmycobacteria. Journal of Biological Chemistry, 275(39):30092–30099, 2000. 358VM Times Cited:130 CitedReferences Count:40.[365] K. H. Caffall and D. Mohnen. The structure, function, and biosynthesis of plant cell wall pec-tic polysaccharides. Carbohydrate Research, 344(14):1879–1900, 2009. 506IT Times Cited:304 CitedReferences Count:290.[366] D. Grundemann, S. Harlfinger, S. Golz, et al. Discovery of the ergothioneine transporter. Proceedingsof the National Academy of Sciences of the United States of America, 102(14):5256–5261, 2005. 914AC TimesCited:163 Cited References Count:28.[367] K. Jungermann, R. K. Thauer, and K. Decker. The synthesis of one-carbon units from co2 inclostridium kluyveri. European journal of biochemistry / FEBS, 3(3):351–9, 1968. Jungermann, K Thauer,R K Decker, K GERMANY, WEST Eur J Biochem. 1968 Jan;3(3):351-9.[368] E. Blume, M. Bischoff, J. M. Reichert, et al. Surface and subsurface microbial biomass, communitystructure and metabolic activity as a function of soil depth and season. Applied Soil Ecology, 20(3):171–181, 2002. 578NJ Times Cited:75 Cited References Count:46.165[369] M. I. Makarov, L. Haumaier, and W. Zech. Nature of soil organic phosphorus: an assessment of peakassignments in the diester region of p-31 nmr spectra. Soil Biology & Biochemistry, 34(10):1467–1477,2002. 609HM Times Cited:83 Cited References Count:38.[370] J. Stratford, A. E. X. O. Dias, and C. J. Knowles. The utilization of thiocyanate as a nitrogen-source by aheterotrophic bacterium - the degradative pathway involves formation of ammonia and tetrathionate.Microbiology-Uk, 140:2657–2662, 1994. 10 Pn556 Times Cited:21 Cited References Count:25.[371] Y. C. Sung, D. Parsell, P. M. Anderson, and J. A. Fuchs. Identification, mapping, and cloning of thegene encoding cyanase in escherichia-coli k-12. Journal of Bacteriology, 169(6):2639–2642, 1987. H4923Times Cited:24 Cited References Count:24.[372] C. L. Yu, T. M. Louie, R. Summers, et al. Two distinct pathways for metabolism of theophylline andcaffeine are coexpressed in pseudomonas putida cbb5. Journal of Bacteriology, 191(14):4624–4632, 2009.462RH Times Cited:16 Cited References Count:20.[373] C. Burstein and A. Kepes. Alpha-galactosidase from escherichia-coli k12. Biochimica Et BiophysicaActa, 230(1):52–&, 1971. I5329 Times Cited:45 Cited References Count:22.[374] S. Fetzner and F. Lingens. Bacterial dehalogenases - biochemistry, genetics, and biotechnologicalapplications. Microbiological Reviews, 58(4):641–685, 1994. Pv324 Times Cited:276 Cited ReferencesCount:591.[375] H. Laue and A. M. Cook. Biochemical and molecular characterization of taurine : pyruvate aminotrans-ferase from the anaerobe bilophila wadsworthia. European Journal of Biochemistry, 267(23):6841–6848,2000. 376KQ Times Cited:25 Cited References Count:40.[376] J. Vonck, N. Arfman, G. E. Devries, et al. Electron-microscopic analysis and biochemical-characterization of a novel methanol dehydrogenase from the thermotolerant bacillus-sp c1. Journalof Biological Chemistry, 266(6):3949–3954, 1991. Ey682 Times Cited:31 Cited References Count:32.[377] S. A. Fischinger and J. Schulze. The importance of nodule co2 fixation for the efficiency of symbioticnitrogen fixation in pea at vegetative growth and during pod formation. Journal of Experimental Botany,61(9):2281–2291, 2010. 613TJ Times Cited:13 Cited References Count:53.[378] C. Rinke, P. Schwientek, A. Sczyrba, et al. Insights into the phylogeny and coding potential ofmicrobial dark matter. Nature, 499(7459):431–7, 2013. Rinke, Christian Schwientek, Patrick Sczyrba,Alexander Ivanova, Natalia N Anderson, Iain J Cheng, Jan-Fang Darling, Aaron Malfatti, StephanieSwan, Brandon K Gies, Esther A Dodsworth, Jeremy A Hedlund, Brian P Tsiamis, George Sievert,Stefan M Liu, Wen-Tso Eisen, Jonathan A Hallam, Steven J Kyrpides, Nikos C Stepanauskas, RamunasRubin, Edward M Hugenholtz, Philip Woyke, Tanja England Nature. 2013 Jul 25;499(7459):431-7. doi:10.1038/nature12352. Epub 2013 Jul 14.[379] Y. Huang, P. Gilna, and W. Z. Li. Identification of ribosomal rna genes in metagenomic fragments.Bioinformatics, 25(10):1338–1340, 2009. 443ZY Times Cited:63 Cited References Count:12.[380] M. Krzic, C. E. Bulmer, F. Teste, L. Dampier, and S. Rahman. Soil properties influencing compactabilityof forest soils in british columbia. Canadian Journal of Soil Science, 84(2):219–226, 2004. 837RH TimesCited:14 Cited References Count:26.[381] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool.Journal of Molecular Biology, 215(3):403–410, 1990. Ed167 Times Cited:33921 Cited References Count:22.[382] J. M. Shively, R. S. English, S. H. Baker, and G. C. Cannon. Carbon cycling: the prokaryoticcontribution. Current Opinion in Microbiology, 4(3):301–306, 2001. 440HA Times Cited:30 CitedReferences Count:54.166[383] E. Smith and H. J. Morowitz. Universality in intermediary metabolism. Proceedings of the NationalAcademy of Sciences of the United States of America, 101(36):13168–13173, 2004. 853AT Times Cited:145Cited References Count:44.[384] I. A. Berg, D. Kockelkorn, W. Buckel, and G. Fuchs. A 3-hydroxypropionate/4-hydroxybutyrateautotrophic carbon dioxide assimilation pathway in archaea. Science, 318(5857):1782–1786, 2007.240VR Times Cited:230 Cited References Count:31.[385] L. K. Massey, J. R. Sokatch, and R. S. Conrad. Branched-chain amino-acid catabolism in bacteria.Bacteriological Reviews, 40(1):42–54, 1976. Bl464 Times Cited:121 Cited References Count:73.[386] S. W. Ayer, A. G. Mcinnes, P. Thibault, et al. Jadomycin, a novel 8h-benz[b]oxazolo[3,2-f]phenanthridine antibiotic from streptomyces-venezuelae isp5230. Tetrahedron Letters, 32(44):6301–6304, 1991. Gn239 Times Cited:52 Cited References Count:7.[387] D. Mohnen, R.L. Doong, K. Liljebjelke, G. Fralish, and J. Chan. Pectins and pectinases. Elsevier ScienceB.V., Amsterdam, 2003.[388] R. F. Spalding, M. A. Toso, M. E. Exner, et al. Long-term groundwater monitoring results at large,sudden denatured ethanol releases. Ground Water Monitoring and Remediation, 31(3):69–81, 2011. Sp.Iss. SI 809HH Times Cited:15 Cited References Count:53.[389] A. Engelbrektson, V. Kunin, K. C. Wrighton, et al. Experimental factors affecting pcr-based estimatesof microbial species richness and evenness. Isme Journal, 4(5):642–647, 2010. 589PI Times Cited:224Cited References Count:17.[390] V. Kunin, A. Engelbrektson, H. Ochman, and P. Hugenholtz. Wrinkles in the rare biosphere:pyrosequencing errors can lead to artificial inflation of diversity estimates. Environmental Microbiology,12(1):118–123, 2010. 551TW Times Cited:556 Cited References Count:13.167Appendix AChapter 2: Supplemental MaterialA.1 Supplemental methodsA.1.1 Site descriptionLFH and mineral horizons were distinguished on the basis of visible properties and chemistry, using thecriteria established by the Canadian System of Soil Classification (The Soil Classification Working Group,1998 [16],. Approximately 500 g of soil was collected from each soil horizon per replicate sample, frozen onsite and stored at -80C. Soil properties for each sample including pH, bulk carbon (C), nitrogen (N), totalorganic N, ammonium (NH4) and nitrate (NO3) and sample processing have been previously published inHartmann et al., [20] (Table A.1).This study was conducted in a boreal forest at the Skulow Lake Long-Term Soil Sustainability (LTSP)site (SBS-3 WL) (52 20’N, 12155’W) near Williams Lake in British Columbia, Canada [20, 27]. The SBS-3site was established in 1994 as part of the LTSP program and details of plot establishment and treatmentshave been previously published [357]. The LTSP program is one of the world’s largest coordinated researchnetworks that addresses basic and applied scientific questions related to forest management across NorthAmerican ecozones. Previous work at LTSP sites has focused on establishing comprehensive inventories ofthe microbial communities in reference and disturbed soil horizons experiencing varying levels of harvesting,organic matter removal and compaction [27, 49, 65, 66, 216]. Sample collection focused on a 40 x 70 mLTSP OM3C2 or harvested treatment (H) that experienced tree trunk and crown harvesting with completeforest floor removal and severe soil compaction (approaching 80% of the difference between theoreticalgrowth limiting bulk density and bulk density measured at 10-20 cm before treatment establishment).Following harvesting, Lodgepole pine seedlings (Pinus contorta Dougl.) were planted in 1995 at this site.An adjacent natural unmanaged stand (N) experiencing natural disturbance in 2005 i.e., mountain pinebeetle infestation, was used as a comparison treatment. Although not part of the original LTSP treatmentdesign, N has been classified as OM0C0 in previous studies [358]. Soil horizons including organic andmineral horizons (Ahe, Ae, AB and Bt) were sampled from H and N treatments in triplicate resulting in 26samples supporting downstream network analyses and metabolic profiling.Located at an elevation of 1050 m in the Sub-Boreal Spruce climatic zone, Orthic Gray Luvisols(Haplocryalfs according to the USDA soil classification system) are most prevalent soils in the area. Theregion has a mean annual temperature of 4.1C and receives an average of 455 mm of precipitation yearly.Prior to the establishment of research plots, Lodgepole pine (Pinus contorta Dougl.) was the dominanttree species, but smaller populations of hybrid spruce (Picea glauca engelmannii), Douglas-fir (Pseudotsugamenziesii (Mirb.) Franco), aspen (Populus tremuloides Michx.) and cottonwood (Populus balsamifera L.) werealso present.A.1.2 Genomic DNA isolation, sequencing and processingFor pyrotag sequencing, genomic DNA was isolated using the PowerMax R soil DNA isolation kit (MoBioLaboratories) following the manufacturer’s instructions from 26 individual samples, including triplicatesamples from two soil plots) (Fig. A.1). PCR reactions for V6-V8 pyrotag amplicons were performed intriplicate as described in the Appendix S1. Resulting amplicon pools were purified using a QIAquick PCRPurification Kit R (Qiagen) and quantified using Picogreen (Invitrogen). A total of 10 l of purified PCRproduct @10 ng/l per sample were sent to the McGill University and Gnome Qubec Innovation Centre for454 pyrosequencing. DNA for shotgun metagenomes was extracted using FastDNATM25 SPIN Kit for Soil168(MPBio) according to the manufacturer’s instructions from all 26 samples and sent to Canada’s MichaelSmith Genome Sciences Centre in Vancouver BC for sequencing on the Illumina HiSeq-10 platform (pairedend 100 bp reads). Three samples were multiplexed per lane. See Table A.2) and Table A.3 for a breakdownof resulting pyrotag and metagenomic data for each sample interval.Genomic DNA extracts were stored at -80C in 10mM Tris-Cl pH 8.5 prior to PCR amplification. PCRreactions for V6-V8 pyrotag amplicons were run under the following PCR conditions; initial denaturation at95C for 3 minutes, followed by 25 cycles of: 95C for 30 seconds, primer annealing at 55C for 45 seconds,extension at 72C for 90 seconds, followed by a final extension of 10 minutes at 72C. Each 25 l reactioncontained 10 ng of genomic DNA template, 200 nM each forward primer (5’-cc tat ccc ctg tgt gcc ttg gcagtc tca gaa act yaa akg aat tgr cgg-3’) and reverse primer (5’cgt atc gcc tcc ctc gcg cca tca gac ggg cgg tgtgtRc-3’) and barcode (Table 2.1), 12.5 l of 2X PCR Master Mix (Fermentas, 0.05 U/l Taq DNA Polymerase,1X reaction buffer, 4 mM MgCl2, 0.4 mM of each dNTP) and 6.5 l of nuclease free water.A.1.3 Community diversityChao1 rarefaction curves [191] were calculated using Qiime [116]. Only 653 pyrotags (compared to >1,000pyrotags for all other samples) were recovered for sample N-Ae-3 and thus a rarefaction curve wasnot calculated (Fig. A.2); this sample was subsequently removed from downstream analysis. Chao 1species estimator and Shannon’s diversity index [192] were also calculated for the remaining samples in R( using the Vegan package ( Statistical differences among and between soil horizonsData was checked for normality distribution by the ShapiroWilk test [165]. To assess difference among andbetween soil horizons and treatments in both the soil chemical data (Table 2.2) and microbial abundancedata generated in Qiime [116] we performed non-parametric permutation based multivariate analyses ofvariance (NPMANOVA) iterated 1000 times in Matlab ( the Fathom toolbox ( ( = 0.05).Pairwise differences in soil chemistry and community structure were then assessed using posteriorimultiple comparison tests also with the Fathom toolbox ( Hierarchical cluster analysis was then used to assign each sample to a given groupbased on the relative abundance data from Qiime, using both the pvclust ( and hclust ( packages in R . A BrayCurtis distance matrix was used to determine dissimilarity. P-valuesfor each branch in the dendrogram were determined through multi-scale bootstrap resampling. Non-parametric Kruskal-Wallis tests were used to determine if there were differences in diversity betweenclusters; and one-tailed Wilcoxon’s signed-rank tests were used to determine if there were differencesin the abundance of major phyla between Clusters 1 and 2 and Cluster 4 and 5. Finally, we controlledfor Type I errors by calculating the maximum p-value cut off for a false discovery rate of 0.05 using theq-value package in R. We employed the bootstrapping method and default settings with the exception ofcomparisons for the soil chemical data (Table 2.2), and the phylum level differences between Clusters 1 and2 (Table 2.2) for which the upper bound of lambda sequences was lowered to 0.7 and 0.4 respectively, dueto the narrower range in p-values.Differences in taxonomy at the class, order and genus level were not statistically tested owing to thelarge number of possible tests (> 2,500 at the class level) and hence the necessarily stringent control of alpha.Taxa representing >0.1% in relative abundance (arbitrarily defined here as abundant taxa) were annotatedat the lowest taxonomic level available (e.g.; class, order or genus) while all other taxa (representing < 0.01%in relative abundance) were binned into ‘other’ categories. Differences between the resulting taxonomicgroups (Fig. A.3) were assessed by comparing the means and standard deviations of OTU abundancesamong and between clusters.169A.1.5 Indicator species analysisIndicator species analysis (ISA) was conducted using the ‘multipatt’ function available in the ‘indicspecies’package, in R with default settings and 1000 iterations ( This package identifies indicator OTUs between predefined groups as well asindicator OTUs for multiple groups (groups of groups). Sample groups were defined based on clustersidentified in the hierarchical cluster analysis: harvested surface horizons (n = 4), unmanaged surfacehorizons (n = 4), mixed horizons from both the harvested and unmanaged soil profiles (n = 5), midsubsurface horizons from both the harvested and unmanaged soil profiles (n = 7), and lower subsurfacehorizons from both the harvested (H) and unmanaged natural (N) soil profiles (n = 5). Taxonomic identityof indicators was explored using heat map and hierarchical cluster analysis (Manhattan distance andcomplete method) using the “pvclust” package in R ( Co-occurrence network construction and analysisOnly OTUs present in over 25% of samples (6 samples of 25) were included in network analyses (2,614OTUs). Operational taxonomic units failing to meet this requirement were summed and included asa single OTU within the analysis to maintain data structure [41]. Spearman correlation measure andKullback-Leibler dissimilarity measure (robust to compositionality) were both calculated between all pairsof OTUs using | 0.8 |, and >18 and <1 respectively. Statistical significance was assigned by computing1,000 edge and measure specific permutations and bootstrap score distributions. Compositional similaritiesassociated with Spearman correlation [178] were addressed by renormalizing the data and calculating a nulldistribution for each of the 1,000 permutations. p-values were calculated by z-scoring the permuted nulland bootstrap confidence intervals using pooled variance, removing all edges outside of the 95% confidenceinterval [41]. Finally, false discovery rate q-values were computed using the Benjamini-Hochberg-Yekutielimethod and edges with q-values above 0.05 were removed. Basic network properties such as average pathlength, clustering coefficient, network diameter, and degree distribution were calculated using NetworkAnalyzer in the software package Cytoscape [161].A.1.7 Taxonomic distinctnessTaxonomic distinctness does not require a phylogenetic tree produced from a multiple sequence alignmentas required by UniFrac distance [114], but can instead use the publically available NCBI taxonomic hierarchy[342].A.1.8 Univariate analysisUnivariate differences in the relative abundance of metabolic potential between soil horizons and treatmentswere identified using EdgeR after removing pathways not present in at least 25% of the samples beingcompared (reducing the occurrence of type II errors) [229]. All statistical tests used = 0.05.A.2 Supplemental resultsA.2.1 Soil characteristicsThe unmanaged and disturbed LFH horizons differed in pH (4.97±0.71 and 6.03±0.16 respectively), totalcarbon (C) (16.86±7.31% and 4.36±1.85% respectively), total nitrogen (N) (0.42±0.10% and 0.16±0.06%respectively), C:N (38.40±26.75 and 26.66±4.45 respectively) and total organic N (69.26±11.55 ppm and33.87±12.38 ppm respectively). With the exception of the LFH and Ahe horizons, all soil horizons differedsignificantly from the adjacent horizon when H and N treatments were analyzed together. Based on theseresults, a defined environmental gradient could be discerned with the potential to influence microbialcommunity structure and function in H and N treatments.170A.2.2 RarefactionRarefaction curves were generated to determine sample coverage. Samples exhibited similar slopes at the97% identity threshold approaching 5,000 unique OTUs (Fig. A.2). At this level of genetic distance, thecurves predict that additional sampling will lead to incremental increases in estimates of total diversityconsistent with near saturation coverage. A total of 47 phyla including 20 candidate divisions wereidentified across all samples (Table A.4).A.2.3 Microbial community structureWhen samples from both H and N soil profiles were analyzed together using NPMANOVA, significantdifferences in microbial community composition were found between the LFH horizon and the AB horizon(a mineral horizon near the middle of the soil column), and the AB and the Bt horizon (a mineral horizonat the bottom of the soil column) (Fig. A.1).Overall, 13,327 Bacterial OTUs (94.08%), 67 Archaeal OTUs (0.48%) and 741 Eukaryotic OTUs (5.27%)were identified with an additional 24 OTUs (443 reads or 0.17% of the data set) having no BLAST hit. Thesesequences may be sequencing or amplification errors or represent bacteria not currently in the database.Taxonomic abundances at the order level between H and N surface horizon clusters (1 and 2) differed mostwithin the Gammproteobacteria order Xanthomonadales (0.56±0.13% and 7.18±2.87% respectively) andthe Alphaproteobacteria orders Rhodosirillales (2.27±0.43% and 4.78±0.76% respectively) and Rhizobiales(9.19±2.09 and 14.74±2.36% respectively).Community composition within the mixed subsurface horizons cluster (3) was most similar to the midsubsurface horizon cluster (4). These clusters shared 3,378 OTUs or 58.32% and 57.61% of total OTUs,respectively (Fig. 2.1). Overall, differences between cluster 3 and clusters 4 and 5 were most pronouncedwithin Acidobacteria and Actinobacteria (Fig. A.3). This suggests the differences identified in chemicalcharacteristics between the AB and Bt horizons do indeed limit the growth of some microbial species.The greatest difference between mid subsurface horizons (cluster 4) and low subsurface horizons (cluster5) were in the Actinobacteria order MB-A2-108 (0.84±0.28% and 3.75±0.49% respectively) and ‘other’Actinobacteria (1.84±0.88% and 6.29±2.18% respectively). The term ‘other’ refers to orders representingless than 0.05% of the community or for which more detailed annotations were unavailable.Both H and N surface clusters (1 and 2) differed from subsurface clusters (3, 4 and 5) in 24 and 16taxa respectively Specifically, cluster 1 (H) differed from subsurface clusters most in the Bacteriodetesorder Sphingobacteriia (15.94±5.03% and 5.50±1.97% respectively), the phyla Archaeplastida (6.42±5.04%and 0.390.28% respectively) and the Alphaproteobacteria Sphingomonadales (2.25±1.16 and 0.22±0.12%respectively), for which cluster 1 (H) was comparatively enriched, and the Deltaproteobacteria GR-WP33-30(0.12±0.08 and 2.12±0.58% respectively), the Acidobacteria DA023 (1.84±1.19% and 5.50±1.68% respec-tively), and Nitrospirae (0.02±0.01 and 1.18±0.59% respectively) for which the subsurface clusters (3, 4,and 5) were comparatively enriched (Fig. 2.1). Cluster 2, the unmanaged surface horizons, differed fromthe subsurface clusters most in the Gammaproetobaceria Xanthomonadales (7.1±0.13% and 1.45±1.22%respectively) and the Alphaproteobacteria Rhodospirillales (4.78±0.76% and 2.28±0.92% respectively) andRhizobiales (14.75±2.36% and 8.78±2.81% respectively) for which cluster 2 was comparatively enrichedand the Acidobacteria RB41 (0.28±0.25% and 5.16±2.75% respectively), the ‘other’ Acidobacteria group(4.70±0.67% and 8.12±1.68% respectively) and the ‘other’ Chloroflexi group (0.81±0.69 and 4.48±2.24%respectively) for which the subsurface horizons were comparatively enriched. Compositional differencesbetween surface horizons comparatively rich in organic matter and the underlying mineral soil, whichdiffer greatly in available resources and physical structure, have been well documented.A.2.4 Hierarchical cluster analysis (HCA)To determine compositional differences between clusters 1 and 2 (the two surface clusters) and clusters 4 and5 (mid and lower soil horizons) we performed Wilcoxon rank sum tests for the 15 most abundant taxonomicgroups representing 97.1±1.22% of the total microbial community in each sample). Proteobacteria domi-nated all five clusters (Fig. 2.1). With the exception of the harvested surface horizons cluster, Acidobacteria171were the second most abundant group. The relative abundance of Bacteriodetes, Planctomycetes, Arma-timonadetes, and Cyanobacteria tended to decrease with depth, while Chloroflexi increased with depth(Fig. 2.1). Phylum level differences between clusters 1 (H) and 2 (N) were significant for Proteobacteria,Chloroflexi, Gemmatimonadetes, Archarplastidia, Armatimonadetes and the SAR (Stramenopile, Alveolatesand Rhizaria) super group. These clusters had 1,875 OTUs in common, representing 41.32% of OTUs incluster 1 (H), and 31.73% of OTUs in cluster 2 (N). Clusters 4 and 5 differed significantly for Planctomycetesand Gemmatimonadetes and shared 3,260 OTUs representing 43.16% and 71.43% of total OTUs respectively.Cluster abundance at lower taxonomic levels (order genus, and class) can be found in Fig. A.3.A.2.5 Indicator species analysisA total of 903 indicator OTUs were found for cluster 1,516 for cluster 2, 124 for cluster 3, 61 for cluster 4,and 256 for cluster 5. Additionally, there were 81 indicator OTUs for the two surface clusters (1 and 2) and342 indicator OTUs for two or more of the subsurface clusters (3,4, and 5). These ‘group’ indicators allowfor the identification of OTUs broadly adapted to LFH (organic) or mineral soils. A full list of indicatorOTUs and their taxonomic affiliation can be found in Table A.5.To better interpret the potential ecological roles of indicator OTUs, the taxonomic identity of significantindicators was determined. At the phylum level, the taxonomic identity of indicators from clusters 1,2, 3 and the two surface clusters combined (1 and 2) were similar to each other while clusters 4 and 5and combinations of subsurface clusters (3 and 4, 3 and 5, 4 and 5, and 3,4, and 5) were also similar toone another. Cluster 3 (mixed subsurface) contained samples from the interface between organic andmineral horizons (Ahe and Ae). This may explain why indicators within this horizon were more similarto the surface than deeper soil. All clusters were dominated by indicator OTUs from Acidobacteriaand Actinobacteria, while clusters 1, 2, 3 and the two surface clusters combined (1 and 2) also had alarge percentage of Bacteriodetes and Alphaproteobacteria indicator OTUs. Similarly, Bacteriodetes andAlphaproteobacteria were also found in higher relative abundance in the surface clusters (Fig. 2.1). Thedeeper clusters had a greater proportion of indicator OTUs from Deltaproteobacteria and Chloroflexi.A.2.6 Network descriptionConsistent with compositional profiling, 95.21% of network nodes were Bacteria (1,790 OTUs), 0.69%Archaea (13 OTUs) and 4.10% Eukaryotes (77 OTUs). The network was comprised of 41 connectedcomponents (1 containing 1,791 nodes, 40 containing 2-3 nodes). Average node degree (mean edges pernode) [359] was 14.5, the average path length on the largest connected component (the expected distancebetween two connected nodes) was 3, and the network diameter (longest path between two nodes) [360][361] was equal to 9. The clustering coefficient, which describes the connectedness of a nodes neighbor[359], was 0.3 and connectance, which describes the proportion of all possible links realized [362] was0.01 (Table 2.1). Typical of many biological networks the degree distribution was a stretched exponentialdistribution, also known as a power law with exponential tail [363].A.2.7 Hive plotsHive plots [162] were used to visualize positive and negative edges separately (Fig. 2.2). Node axisassignment and positioning rules were chosen to reflect the number of potential interactions and thelocation within the soil profile of each OTU as the HCA analyses revealed community partitioning withdepth(Fig. 2.1). As such, nodes were placed on one of three axis according to degree class and positionedalong each axis according to their average weighted depth. Nodes with a single edge appear on one axis,nodes with two to fifteen edges appear on a second axis, and nodes with more than fifteen edges appearon a third. Nodes closer to the center of the hive plot represent OTUs most abundant near the surface ofthe soil, and nodes located towards the outer perimeter of the hive plot represent OTUs most prevalentin deeper soil horizons. Negative edges occurred predominantly between nodes that were abundant in172different soil horizons. Positive edges occurred predominantly between nodes that were abundant insimilar horizons (Fig. 2.2).A.2.8 Taxonomic composition of network modulesModule B (N LFH horizons) had approximately two-times more Proteobacteria than module A (H LFHhorizons) (Fig. A.4). Correspondingly, significantly more Proteobacteria were found in Cluster 2 (N LFHhorizons) compared to Cluster 1 (H LFH horizons) (Fig. 2.1). Module A (H LFH horizons) contained twotimes more Chloroflexi than to module B (N LFH horizons) and while module A (H LFH horizons) containedOTUs from the Gemmatimonadetes, Archarplastidia, Armatimonadetes and the SAR (Stramenopile,Alveolates and Rhizaria) phyla, module B (N LFH horizons) did not (Fig. A.4). Similarly, these phyla wereall significantly more abundant within Cluster 1 (H LFH horizons) compared to Cluster 2 (N LFH horizons)(Fig. 2.1). Taken together, indicator OTUs, average depth of each OTU, and taxonomic composition of thenetwork module suggests network modules accurately reflect HCA clusters and the broader compositionalprofiling.A.2.9 Metagenomic pathway predictionA total of 1,254 pathways, involved in biosynthesis (617), degradation (516), detoxification (15), energy-metabolism (98), metabolic clusters (5), and activation/inactivation/interconversion (3) were identified(Fig. 2.3). On average we identified 40,703±16,816 hits in the MetaCyc database [156], 2,50,797±1,09,156KEGG orthologs (KOs) [157], 1,26,467±51,352 clusters of orthologous groups (COGs) [158], 11,688±5,350carbohydrate-active enzymes [159] and 2,85,063±1,23,248 hits in the RefSeq database [160]. Using theseannotations, the pathologic algorithm [145] was employed to generate ePGDBs.A.2.10 Differences in the relative abundance of pathwaysIn the H LFH horizons we identified 12 pathways for carbohydrate (e.g.; dTDP-L-mycarose and dTDP-L-olivose) and secondary metabolite (e.g.; myo-inositol) biosynthesis that were more prevalent than in theN LFH horizons (Fig. A.3 and Fig. A.7). dTDP-D-forosamine synthesizes building blocks for naturallyoccurring insecticides (red). Six myo-inositol biosynthesis pathway variants were also more prevalent in theH LFH horizons (Fig. A.3 and Fig. A.7). Myo-inositol biosynthesis is known as an important component ofphospholipid membranes [364], and as a signaling molecule in the regulation of mRNA export and othercellular functions [188].Compared to the mineral horizons, LFH horizons had a higher relative abundance of carbohydratedegradation pathways that target substrates commonly found in soil organic matter, such as homogalac-turonan, a compound which makes up 60% of the pectin found in plant cell walls [365], xyloglucan, acomponent of hemicellulose in type 1 cell walls [260] and L-arabinose, a hydrolysis product of planthemicellulose [245]. Biosynthesis pathways more prevalent in the LFH horizons included several pathwayvariants for lysine biosynthesis, an essential amino acid; docosahexanoate biosynthesis, a glycerol lipidfound in bacterial cell membranes; and ergothioneine biosynthesis, an amino acid needed by non-fungaleukaryotes such as plants but produced only by prokaryotes and fungi [366]. The mineral horizons hada higher relative abundance of pyruvate fermentation, dissimilatory nitrate reduction, carbon tetrachlo-ride degradation and oxalate degradation pathways. In addition, mineral horizons had a higher relativeabundance of pathways related to assimilatory carbon fixation. Specifically, the mineral horizons had ahigher relative abundance of the reductive monocarboxylic acid cycle, which produces one-carbon unitsthat can anabolically assimilated by some bacteria [367]. Furthermore, the mineral horizons had a higherrelative abundance of a formate oxidation pathway capable of producing CO2, the necessary input for themonocarboxylic acid cycle.173A.2.11 Taxonomic relationships between co-occurrence and metabolic networksWe determined the phylogenetic relationship between network modules and ePGDBs using the lowestcommon ancestor (LCA) [144] annotations provided by MetaPathways 2.5 [139, 141] for each ORF tocalculate pathway level taxonomic distinctness (D⇤) (Fig. 2.4). For example, aerobic respiration I (cytochromec) (classified in MetaCyc as PWY-3781, energy metabolism, respiration, aerobic respiration) contains 4reactions, the LCA annotations of which vary in all 26 samples (Fig. A.10). Using a 2-tailed t-test, we foundthat, analogous to the network analysis, taxonomic distinctness was significantly higher in the H LFHhorizon ePGDBs than the N LFH horizon ePGDBs (p = 0.02). While taxonomic distinctness was lowestwithin the mid (Ae and AB) mineral horizon ePGDBs, it increased in the deep (Bt) mineral horizon ePGDBs(Fig. 2.4).A.3 Supplemental discussionA.3.1 Perturbation effects on soil microbial community structureAll samples contained OTUs affiliated with 47-67 phyla but were dominated by Proteobacteria, Acidobac-teria and Actinobacteria (Fig. 2.1, Fig. A.3, and Fig. A.4). Moreover, microbial communities inhabitingminieral horizons were distinct from the communities inhabiting LFH horizons (Fig. 2.1 and Table 2.1)supporting previous observations [19, 26, 53, 368].A.3.2 Impaired organic matter degradation and nutrient cycling potential inharvested LFH horizonIn addition to carbohydrate and aromatic compound degradation pathways, 37 degradation pathwaysincluding adenosine, guanosine, pyrimidine, thiocyanate, methanol, and taurine degradation were alsofound in lower relative abundance in the H LFH horizons. Nucleic acids (e.g.; purines adenosine, guanosine,and pyrimidines thymine and cytosine) constitute over 10% of the total soil organic phosphorus pool,representing an essential resource for both microbial and plant growth [369]. Additionally, thiocyanate,methanol, and taurine are all naturally found in decaying plant tissues and are known to be an importantsources of carbon, nitrogen, and sulfur for bacteria [102, 370–376]. The H LFH horizons also had a lowerrelative abundance of oxaloacetate degradation pathways. Oxaloacetate is known to support N assimilationfrom phosphoenolpyruvate carboxylase (PEPC, EC activity within root nodules and thus the twopathways may be linked [377].Finally, differences in pathway abundances between the LFH and mineral horizons are likely related todifferences in available resources and soil function with depth (Table 2.2). For example, the LFH horizonshad a higher relative abundance of ergothioneine biosynthesis, an essential amino acid produced onlyby prokaryotes and fungi, acquired by plants via specific transporters [366]. Additionally, degradationpathways for labile carbon sources such as homogalacturonan, which makes up 60% of the pectin foundin plant cell walls [365], and xyloglucan, a component of hemicellulose in type 1 cell walls degradation[260], were also more abundant in LFH horizons. Within the mineral horizons, degradation of oxalates,compounds degraded by sub-surface soil bacteria in order to capture cations such as calcium, carbon, andenergy [232] were more abundant compared to LFH horizons. Lastly, formate oxidation, which producesCO2, the necessary input for the reductive monocarboxylic acid cycle, which in turn produces one-carbonunits that can be anabolically assimilated by some bacteria [367], were also more prevalent in the mineralhorizons.A.3.3 Consistent taxonomic patterns in network modules and metabolic pathwaysIn addition to differences in taxonomic distinctness with soil depth and treatment, we observed differencesin the taxonomic annotation (assigned using the LCA algorithm [144] of the ORFs within pathways moreprevalent in N LFH horizons that were consistent with network module composition (Fig. A.11). For174example, across the 85 pathways more prevalent in the N organic horizons, we observed a 350% and 550%decrease in ORFs attributed to Actinobacteria and Bacteriodetes, respectively, within the N LFH horizonscompared to the H LFH horizons (Fig. A.11). This is again consistent with the network analysis whereinActinobacteria and Bacteriodetes made-up twice as much of module A (H LFH horizons) compared tomodule B (N LFH horizons) (Fig. A.4).A.3.4 Difficulties in defining the metabolic potential of individual nodesWhen representative isolate genomes exist for specific and ample OTUs connecting OTUs in the network tospecific metabolisms becomes tractable. However, for most environmental samples where co-metabolicinnovations make the cultivation of representative isolates difficult to obtain, single-cell genomes (SAGs)or metagenome assemblies are needed to fill in the gaps [378]. Currently, though, throughput limitscompounded by incomplete genome coverage can confound the use of SAGs in isolation while metagenomeassembly comes with a cadre of caveats, including chimeric contigs and poor depth of coverage in complexsamples such as soils. Moreover, short-read aligners have difficulty in assembling SSU rRNA genes due topoor coverage [379] (approx. 0.03% of shotgun metagenomes are made-up of ribosomal gene fragments[170], which would be needed for direct mapping of assembled metagenome sequence bins onto OTUnetworks. Here, we recruited metagenomic reads to the representative sequences of our OTUs whichspan only the V6-V8 region (approximately 1/3 of the gene) and we therefore expect 0.01% of our readsto recruit. Indeed, we found 0.011±0.003% of our metagenomic reads recruited to the OTUs suggestingnetwork taxa accurately reflect the taxonomic diversity within the metagenomes.175A.4 Supplemental figuresNatural Unmanaged Forest StandHarvested Stand with LFH Horizon RemovedABAeBtLFHSBS-3Ahe ABAeBtLFHFigure A.1: Sampling and analysis schematic for 26 samples from 5 soil horizons in two soil profiles near William’sLake, B.C. An example soil horizon; a schematic of sample plots and sampled horizons.1761770 1000 2000 3000 4000 5000 6000 7000 10 834 1658 2482 3306 4130 4954 5778 6602 7426 8250 Chao 1 Rarefaction Metric Sequences per Sample  H-Bt 2  N-Bt 3  N-Bt 2  H-Bt 3  H-LFH 2 N-Ahe 1  N-Bt 1  N-LFH 3  H-Ae 1 H-AB 1 H-AB 3 H-Ae 3H-Ae 2 N-LFH 2 H-AB 2 N-AB 3 N-Ae 2  N-AB 2  H-LFH 3 H-Bt 1  N-Ae 1 H-LFH 1 N-AB 1 N-Ahe 2 N-LFH 1Soil DepthFigure A.2: Choa 1 rarefaction curves for V6-V9 pyrotags. Choa 1 rarefaction curves for V6-V9 pyrotags generated for 25 samples from 5 soilhorizons in two soil profiles near Williams Lake, B.C.177RhizobialesRhodospirillalesSphingomonadalesOther AlphaBurkholderialesNitrosomonadalesSC-I-84TRA3-20Other Beta43F-1404RDesulfuromonadalesGR-WP33-30MyxococcalesOther DeltaEnterobacterialesXanthomonadalesOther GammaChlorobiFlavobacteriaSphingobacteriiaOther BacteriodetesHolophagaeRB25DA023RB41AcidobacterialesDA052Other AcidobacteriaElusimicrobiaFirmicutesThermoleophiliaAcidimicrobialesMB-A2-108Other ActinobacteriaAnaerolineaeKtedonobacteriaPhycisphaeraePlanctomycetaciaOther PlanctomycetesCyanobacteriaGemmatimonadetesNitrospiraeSpartobacteriaOther VerrucomicrobiaArmatimonadetesCandidate division OP11Candidate division WS3Candidate division SM2F11Candidate division WCHB1-60Other BacteriaEuryarchaeotaThaumarchaeotaArchaeplastidaBasidiomycotaOther FungiAlveolataRhizariaOther SARNematodaOther MetazoaOther EukaryotaNo blast hitair et caBProteobacteriaAcidobacteriaActinobacteria- ahpl A- at eB- atl eD- ammaG            PlanctomycetesPercent of Total Pyrotags (%):6.2 12 2540962531 3378C3C240342663 1875C1 C232023122 4352C4C34294 13043260C4 C5Natural Unmanaged (N)Harvested (H)H-LFH 3H-Ae 1H-LFH 1H-LFH 2H-Ae 2N-Ahe 1H-Bt 3H-AB 3N-Ae 1 N-Bt 3N-Bt 1N-Bt 2H-Bt 1H-Bt 2N-AB 3H-Ae 3H-AB 1H-AB 2N-Ae 2N-AB 1N-AB 2N-LFH 1N-LFH 2N-Ahe 2N-LFH 310010010099 9710092 98 6890 10080 9267 729667 68 651005160930. Number of indicators:aeahcr Aat oyr akuEBacteriodetesVerrucomicrobiaFungiBray-Curtis Dissimilarity 90010050Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5Soil DepthFigure A.3: Hierarchical cluster analysis of pyrotags. Hierarchical cluster analysis of pyrotags (V6-V9 SSUrRNA) from 5 sample clusters identified using hierarchical cluster analysis on 26 soil samples spanning 5soil horizons and two treatments (unmanaged (N) and harvested (H)) at the LTSP ecozone in William’sLake, B.C. Taxa representing > 0.1% (arbitrarily defined as intermediate and abundant taxa) are displayedat the phylum (or super phylum level for eukaryota) while all other taxa (representing < 0.01%) are binnedinto ‘other’ categories.178Module A (H organic horizons) Module B (N organic horizons) Module C (mineral horizons N+ H)ChlorobiFirmicutesThaumarchaeotaElusimicrobiaArchaeplastidaCandidate division WS3Candidate division OP11NitrospiraeSARVerrucomicrobiaArmatimonadetesGemmatimonadetesCyanobacteriaOpisthokontaPlanctomycetesChloroflexiBacteroidetesActinobacteriaAcidobacteriaProteobacteria0 10 20 30 40Relative Proportion of OTUs in Network Modules (%)Phylum0 10 20 30 40 0 10 20 30 40Figure A.4: Relative abundance of phyla and indicator OTUs within co-occurrence network.Relative abundanceof phyla and indicator OTUs within co-occurrence network calculated using SSU rRNA gene pyrotagoperational taxonomic units (OTUs) (clustered at 97% sequence similarity) from 5 horizons in unmanaged(N) and harvested (H) soil profiles.179N-LFH 2H-Bt 3N-Ahe 2H-LFH 2H-LFH 3N-Ae 2N-AB 3N-LFH 3N-Ae 1H-AB 1N-AB 1N-AB 1H-Ae 1H-AB 3N-LFH 1N-Ae 3H-Bt 1H-AB 2H-Ae 3N-Ahe 1N-Bt 2N-Bt 3H-Bt 2N-Bt 1H-LFH 1H-Ae 10099100 100 79 75726396 9885909569627550716953100 100Bray-curtis disimilarityHarvested Natural Unmanaged Soil DepthFigure A.5: Hierarchical cluster analysis describing the influence of horizon and treatment (unmanaged (N) andharvested (H)) on predicted metabolic pathways. Hierarchical cluster analysis describing the influence of horizonand treatment (unmanaged (N) and harvested (H)) on predicted metabolic pathways in 26 samples from 5soil horizons in two soil profiles at the LTSP ecozone in William’s Lake, B.C.1800 500 1000 15000 500 1000 1500indole-3-acetyl-ester conjugate biosynthesisisoleucine biosynthesis I (from threonine)methionine biosynthesis II* (x10) lysine biosynthesis I* (x10) lysine biosynthesis II* (x10) lysine biosynthesis III * (x10) lysine biosynthesis VIlysine biosynthesis Vsorbitol biosynthesis IIL-ascorbate biosynthesis VI (engineered pathway)adenosylcobalamin biosynthesis from cobyrinate a_c-diamide Iretinol biosynthesisadenosylcobalamin biosynthesis from cobyrinate a_c-diamide IIL-ascorbate biosynthesis III2'-(5'-phosphoribosyl)-3'-dephospho-CoA biosynthesis II (malonate decarboxylase)coenzyme B biosynthesisestradiol biosynthesis I (via estrone)guanine and guanosine salvage IIglycine betaine biosynthesis I (Gram-negative bacteria)hopanoid biosynthesis (bacteria)diploterol and cycloartenol biosynthesisleucopelargonidin and leucocyanidin biosynthesispolyvinyl alcohol degradationcholine degradation Icreatinine degradation Icreatinine degradation IIcarnitine degradation IIcarnitine degradation Itaurine degradation IVprotocatechuate degradation II (ortho-cleavage pathway)cyanurate degradationmethylsalicylate degradationmethylgallate degradationnicotinate degradation IIIprotocatechuate degradation I (meta-cleavage pathway)orthanilate degradationcatechol degradation to 2-oxopent-4-enoate II4-amino-3-hydroxybenzoate degradationprotocatechuate degradation III (para-cleavage pathway)3-phenylpropanoate and 3-(3-hydroxyphenyl)propanoate degradation to 2-oxopent-4-enoatenicotinate degradation Ibiphenyl degradation2-hydroxybiphenyl degradationanthranilate degradation IV (aerobic)benzene degradationCO2 fixation into oxaloacetate (anapleurotic)methanol oxidation to formaldehyde IVglycogen degradation IImelibiose degradationoxalate degradation II2-amino-3-carboxymuconate semialdehyde degradation to glutaryl-CoA2-amino-3-carboxymuconate semialdehyde degradation to 2-oxopentenoateoxalate degradation V1_2-dichloroethane degradation4-chlorocatechol degradation3-chlorocatechol degradation II (ortho)3-chlorocatechol degradation III (meta pathway)fatty acid &beta;-oxidation III (unsaturated_ odd number)cyanate degradationtwo-component alkanesulfonate monooxygenaseacrylonitrile degradation Iacrylonitrile degradation IIadenosine nucleotides degradation IIguanosine nucleotides degradation IIIguanosine nucleotides degradation Iguanosine nucleotides degradation IIpyrimidine ribonucleosides salvage IIInylon-6 oligomer degradationthiocyanate degradation IItheophylline degradationcaffeine degradation III (bacteria_ via demethylation)glucose degradation (oxidative)limonene degradation II (L-limonene)crocetin biosynthesisabscisic acid biosynthesis(4S)-carveol and (4S)-dihydrocarveol degradationribitol degradationphenylmercury acetate degradationsuperoxide radicals degradationammonia oxidation I (aerobic)pyruvate oxidation pathwaypyruvate fermentation to acetonehydrogen production IIIisopropanol biosynthesisphospholipasesNatural Unmanaged (N)Harvested (H)Soil DepthRPKMFigure A.6: Relative abundance of metabolic pathways more prevalent in unmanaged (N) LFH horizons compared toharvested (H) LFH horizons. Relative abundance of metabolic pathways more prevalent in unmanaged (N)LFH horizons compared to harvested (H) LFH horizons at the LTSP ecozone in William’s Lake, B.C. RPKMis reads per kilobase per million mapped. 181182dTDP-L-olivose biosynthesisdTDP-3-acetamido-3,6-dideoxy-&alpha;-D-glucose biosynthesisdTDP-D-&beta;-fucofuranose biosynthesisdTDP-L-mycarose biosynthesisdTDP-D-olivose_ dTDP-D-oliose and dTDP-D-mycarose biosynthesisdTDP-3-acetamido-3,6-dideoxy-&alpha;-D-galactose biosynthesisdTDP-N-acetylthomosamine biosynthesisdTDP-L-rhamnose biosynthesis IIdTDP-L-megosamine biosynthesisdTDP-L-rhamnose biosynthesis IdTDP-N-acetylviosamine biosynthesisdTDP-D-desosamine biosynthesisthiamin diphosphate biosynthesis IV (eukaryotes)L-1-phosphatidyl-inositol biosynthesis (Mycobacteria)di-myo-inositol phosphate biosynthesis1D-myo-inositol hexakisphosphate biosynthesis III (Spirodela polyrrhiza)1D-myo-inositol hexakisphosphate biosynthesis IV (Dictyostelium)myo-inositol biosynthesisdTDP-D-forosamine biosynthesisglycerol degradation Vethanol degradation IVarginine degradation II (AST pathway)methionine degradation I (to homocysteine)conversion of succinate to propionateNatural Unmanaged (N)Harvested (H)Soil Depth0 200 400 600 8000 200 400 600 800RPKMFigure A.7: Relative abundance of metabolic pathways more prevalent in harvested (H) LFH horizons compared to unmanaged (N) LFH horizons. Relativeabundance of metabolic pathways more prevalent in harvested (H) LFH horizons compared to unmanaged (N) LFH horizons at the LTSPecozone in William’s Lake, B.C. RPKM is reads per kilobase per million mapped.1820 200 400 600 8000 200 400 600 800Organic Horizons (N + H) Mineral Horizons (N + H)glutathione biosynthesisgamma;-linolenate biosynthesis II (animals)arachidonate biosynthesisdocosahexanoate biosynthesis Ihomoglutathione biosynthesisputrescine biosynthesis IIIergothioneine biosynthesispyrrolnitrin biosynthesisrebeccamycin biosynthesisglutamate degradation III (via 4-aminobutyrate)glutamate degradation IVL-serine degradationhomogalacturonan degradationL-arabinose degradation Isucrose degradation VII (sucrose 3-dehydrogenase)xyloglucan degradation II (exoglucanase)(4S)-carveol and (4S)-dihydrocarveol degradationglutamate dependent acid resistancehydrogen oxidation I (aerobic)2-oxoglutarate decarboxylation to succinyl-CoAphospholipasesSoil Depth0 4000 8000 120000 4000 8000 12000lysine biosynthesis Ilysine biosynthesis IIlysine biosynthesis IIIlysine biosynthesis VIFigure A.8: Relative abundance of metabolic pathways more prevalent in LFH horizons compared to mineral horizons.Relative abundance of metabolic pathways more prevalent in LFH horizons compared to mineral horizonsat the LTSP ecozone in William’s Lake, B.C. RPKM is reads per kilobase per million mapped.1830 1000 2000 30000 1000 2000 3000polymyxin resistancearginine degradation VII (arginase 3 pathway)ornithine degradation I (proline biosynthesis)phenylalanine degradation I (aerobic)taurine degradation IIdeethylsimazine degradationformate oxidation to CO2reductive monocarboxylic acid cycleoxalate degradation IIoxalate degradation IIIcarbon tetrachloride degradation IInitrate reduction IV (dissimilatory)pyruvate fermentation to ethanol IIIpyruvate fermentation to hexanolisopropanol biosynthesisOrganic Horizons (N + H) Mineral Horizons (N + H)Soil DepthFigure A.9: Relative abundance of metabolic pathways more prevalent in mineral horizons compared to LFH horizons.Relative abundance of metabolic pathways more prevalent in mineral horizons compared to LFH horizonsat the LTSP ecozone in William’s Lake, B.C. RPKM is reads per kilobase per million mapped.18418502004006008000100200300400010002000300040005000ABCNatural Unmanaged (N)Harvested (H)AcidobacteriaActinobacteriaAquificaeBacteriaBacteroidetesCaldithrixChlamydiaeChloroflexiCyanobacteriaDeferribacteresDeinococcusFirmicutesGemmatimonadetesNC10AlphaproteobacteriaArchaeaBetaproteobacteriaGammaproteobacteriaN/ANitrospinaeOmnitrophusPlanctomycetesPoribacteriaProkaryotesProteobacteriaProteobacteria;delta/epsilonSpirochaetesThermobaculumCO2 hydrogen carbonate oxaloacetateCO2 fixation into oxaloacetate (anaplerotic)an ubiquinol and a ubiquinone 2 an ubiquinol and 2 a ubiquinoneaerobic respiration I (cytochrome c)reduced c-type cytochrome oxidized c-type cytochromedTDP-α-D-glucose dTDP-4-dehydro-6-deoxy-α-D-glucopyranosedTDP-L-rhamnose biosynthesis Iα-D-glucopyranose 1-phosphate dTDP-4-dehydro-β-L-rhamnose dTDP-β-L-rhamnoseSoil Depth000RPKM000RPKMRPKMRPKMRPKMRPKM40015001500150150400RPKMRPKMRPKMFigure A.10: Relative abundance of three pathways by taxa. Relative abundance of three pathways in both the unmanaged (N) LFH and harvested(H) LFH horizons for the total pathway and each pathway reaction broken down by taxonomic group. (A.) Pathways with same relativeabundance in the unmanaged (N) LFH and harvested (H) LFH horizons. (B.) Pathway more abundant in the unmanaged (N) LFH horizonscompared to the harvested (H) LFH horizons. (C.) Pathway more abundant in the harvested (H) LFH horizons compared to the unmanaged (N)LFH horizons. RPKM is reads per kilobase per million mapped.185Not assignedMethanobacteriaHalobacteriaEuryarchaeotaGammaproteobacteriaBetaproteobacteriaAlphaproteobacteriaProteobacteriaPlanctomycetiaGemmatimonadetes <class>BacilliAcidobacteriaCyanobacteriaVerrucomicrobiaBacteroidetesActinobacteriaBacteriacellular organismsroot HarvestedNaturalNot assignedHalobacteriaGammaproteobacteriadelta/epsilon subdivisionsBetaproteobacteriaAlphaproteobacteriaProteobacteriaPlanctomycetiaGemmatimonadetes <class>ClostridiaBacilliFirmicutesAcidobacteriaCyanobacteriaKtedonobacteriaVerrucomicrobiaBacteroidetesActinobacteriaBacteriacellular organismsrootPathways more prevalent in harvested organic (LFH) horizonsPathways more prevalent in natural organic (LFH) horizonsFigure A.11: Relative abundance of taxonomic groups in pathways. Relative abundance of taxonomic groups (asdetermined using lowest common ancestor annotations) in pathways more prevalent in (A.) harvested (H)LFH horizons and (B.) unmanaged (N) LFH horizons. For each treatment the proportion of all taxa sum to100%.186187A.5 Supplemental tablesTable A.1: Chemical properties of soils used in community diversity and metagenome library production and sequencingSample name Horizon Treatment depth (cm) pH total C (%) total N (%) total organic N (ppm) NH4 (ppm) NO3 (ppm)N-LFH 1 LFH Natural 2 4.25 14.4 0.4 76.85 7.85 0.01N-Ahe 1 Ahe Natural 6.5 4.69 1.8 0.08 11.82 0.64 0.01N-Ae 1 Ae Natural 13 5.42 0.95 0.05 7.23 0.36 0.01N-AB 1 AB Natural 38.5 5.92 0.67 0.05 6.3 0.54 0.01N-Bt 1 Bt Natural 62 6.72 0.89 0.06 5.64 2.83 0.1N-LFH 2 LFH Natural 2 4.5 26.64 0.55 67.56 7.53 0.03N-Ahe 2 Ahe Natural 7.5 5.07 8.71 0.25 40.41 2.37 0.01N-Ae 2 Ae Natural 20 6.22 0.73 0.05 5.12 0.26 0.01N-AB 2 AB Natural 27 6.34 1.06 0.05 6.65 0.47 0.01N-Bt 2 Bt Natural 55.5 6.9 0.74 0.05 1.5 0.64 0.01N-LFH 3 LFH Natural 10.5 5.59 27.46 0.61 90.52 13.53 1.48N-Ae 3 Ae Natural 18.5 5.7 1.12 0.08 7.34 1.57 0.01N-AB 3 AB Natural 39.5 5.77 0.6 0.05 2.25 0.81 0.01N-Bt 3 Bt Natural 54.5 6.6 0.61 0.05 1.52 2.45 0.01H-LFH 1 LFH Harvested 0.3 5.87 5.73 0.21 47.91 14.14 1.53H-Ae 1 Ae Harvested 6.2 6.13 0.28 0.03 2.16 0.11 0.02H-AB 1 AB Harvested 33.3 6.5 0.36 0.03 0.01 0.88 0.06H-Bt 1 Bt Harvested 52.8 6.9 0.58 0.04 0.01 0.83 0.15H-LFH 2 LFH Harvested 0.2 6.18 2.26 0.1 24.51 5.93 1.54H-Ae 2 Ae Harvested 6.7 6.05 1.13 0.06 5.73 0.56 0.06H-AB 2 AB Harvested 21.7 6.27 0.36 0.03 0.01 0.07 0.07H-Bt 2 Bt Harvested 59.7 6.76 0.65 0.05 3.85 0.87 0.12H-LFH 3 LFH Harvested 0.2 6.03 5.1 0.17 29.2 4.93 0.65H-Ae 3 Ae Harvested 4.2 5.16 2.1 0.09 9.51 0.89 0.01H-AB 3 AB Harvested 27.2 5.72 0.88 0.04 3.26 0.32 0.01H-Bt 3 Bt Harvested 49.2 6.45 0.64 0.05 3.17 1.17 0.05Skulow Lake (SBS-3) Long-term Soil Productivity (LTSP) located at coordinates 52 20N, 12155W187188Table A.2: Metagenomic sequencing, assembly and annotation summary statisticsSample name Barcode Total sequences Reads mapped (%) Sequences assembled min bps avg bps max bps total bps ORFs min bps avg bps max bps total bps Total annotations 16S rRNA reads (%)H-LFH 1 A26980 125987682 11.38 545607 60 260 126501 142105713 160378 60 125 5283 20179581 91369 0.01H-Ae 1 A26981 168989328 23.43 7221418 40 139 882491 1009516212 714203 60 117 13953 84162721 421574 0.01H-AB 1 A26982 83207826 18.48 2849218 44 139 293192 396370669 313767 60 105 2808 32988502 170351 0.01H-Bt 1 A26983 130605338 26.06 6358606 40 131 245314 835985953 653135 60 108 4497 70614557 344698 0.02H-LFH 2 A26984 124952720 12.4 4721642 40 118 157575 560121175 320397 60 103 3632 33124723 186053 0.01H-Ae 2 A26985 136037648 9.11 805396 52 188 326485 151953526 153718 60 114 3952 17580872 86623 0.01H-AB 2 A26986 117048788 22.37 5268581 40 131 370568 690203407 532576 60 105 3170 56286885 293333 0.01H-Bt 2 A26987 131120192 33.11 7152679 40 140 55860 1007810691 868411 60 112 3125 97522014 464540 0.01H-LFH 3 A26988 151098190 15.54 6970429 36 113 97416 794439146 477359 60 110 4892 52792441 268870 0.01H-Ae 3 A26989 157609844 35.04 8185118 40 137 118201 1124625812 921308 60 114 5467 105587482 499659 0.01H-AB 3 A26990 150244154 19.5 6223913 40 125 203253 779533091 562025 60 103 10806 58257148 326467 0.01H-Bt 3 A26991 137325258 12.29 3927952 44 124 195496 489335972 291462 60 99 2394 28977045 181133 0.01N-LFH 1 A26992 149977894 27.89 7159930 40 135 243494 972401701 742663 60 109 2748 81355870 449463 0.01N-Ahe 1 A26993 148999174 29.08 7610079 40 128 68461 976324197 748243 60 106 2785 80028436 436649 0.01N-Ae 1 A26994 128843246 17.31 5138557 40 120 70479 619721290 415116 60 100 2626 41850844 250152 0.01N-AB 1 A26995 93743392 18.36 3945818 40 124 49367 491859012 358359 60 100 1688 36051498 203307 0.01N-Bt 1 A26996 159206126 33.78 8323218 40 140 175453 1172521362 996006 60 114 3185 113941528 529283 0.02N-LFH 2 A26997 130566292 16.56 8563078 32 106 73505 911896328 579547 60 100 3902 58164373 341083 0.01N-Ahe 2 A26998 131201506 12.45 4338290 40 120 96668 522912941 305446 60 104 4408 31983922 193849 0.01N-Ae 2 A26999 94216422 13.05 3127014 40 118 38264 371632027 242236 60 96 1757 23443144 149186 0.01N-AB 2 A27000 109387312 17.96 3347714 44 130 176660 436968842 325924 60 100 3159 32848104 183434 0.01N-Bt 2 A27001 115798136 27.44 5664900 40 136 56104 773412521 655595 60 107 9372 70675415 358249 0.01N-LFH 3 A27002 129992810 18.15 5171933 40 124 78993 644883227 472374 60 100 3321 47445943 285715 0.01N-Ae 3 A27003 122762670 31.03 4522170 44 157 267974 711372945 621170 60 117 4403 73003872 359978 0.01N-AB 3 A27004 82205230 14.66 3925988 36 115 33375 453772858 298413 60 100 1772 29941856 175289 0.01N-Bt 3 A27005 96305846 35.92 4417892 40 147 391881 652253329 576126 60 115 9524 66634744 310006 0.01a Skulow Lake (SBS-3) located at coordinates 52 20N, 12155W188Table A.3: PCR barcodes and SSU rRNA library productionSample name Barcode Quality pyrotagsN-LFH 1 CGTCGATCTC 8499N-Ahe 1 CTACGACTGC 9357N-Ae 1 CTAGTCACTC 9278N-AB 1 CTCTACGCTC 9286N-Bt 1 TAGCGCGCGC 9294N-LFH 2 TAGCTCTATC 7701N-Ahe 2 TATAGACATC 8296N-Ae 2 TATGATACGC 7418N-AB 2 TCACTCATAC 12090N-Bt 2 TCATCGAGTC 10036N-LFH 3 TCGAGCTCTC 7496N-Ae 3 TCGCAGACAC 653N-AB 3 TCTGTCTCGC 8871N-Bt 3 TGAGTGACGC 9172H-LFH 1 AGCGACTAGC 7413H-Ae 1 AGTAGTGATC 7733H-AB 1 AGTGTATGTC 6303H-Bt 1 ATAGATAGAC 7002H-LFH 2 ATATAGTCGC 10157H-Ae 2 ATCTACTGAC 8252H-AB 2 CACGTAGATC 8248H-Bt 2 CACGTGTCGC 4293H-LFH 3 CATACTCTAC 7702H-Ae 3 CGACACTATC 6276H-AB 3 CGAGACGCGC 7169H-Bt 3 CGTATGCGAC 8744189Table A.4: Microbial Phyla identifiedLineage Average (%) Standard Deviation (%)Proteobacteria 30.605 6.395Acidobacteria 21.028 6.656Actinobacteria 9.862 4.455Bacteroidetes 9.112 5.844Chloroflexi 6.060 4.149Opisthokonta 5.612 3.288Planctomycetes 5.587 1.850Gemmatimonadetes 2.867 1.282Verrucomicrobia 1.895 1.099Archaeplastida 1.348 2.887Nitrospirae 0.819 0.721Armatimonadetes 0.788 1.103Cyanobacteria 0.721 1.006Candidate Division OP11 0.663 0.379SAR 0.551 0.532Candidate Division WS3 0.411 0.399Firmicutes 0.354 0.251Thaumarchaeota 0.340 0.439Elusimicrobia 0.250 0.154Chlorobi 0.188 0.093Candidate Division SM2F11 0.169 0.114Euryarchaeota 0.133 0.137Candidate Division WCHB1-60 0.109 0.086Candidate Division TM6 0.087 0.072Candidate Division OD1 0.071 0.074Candidate Division TM7 0.065 0.042Fibrobacteres 0.064 0.088Candidate Division BRC1 0.034 0.036Chlamydiae 0.031 0.034Candidate Division MVP-21 0.030 0.030Amoebozoa 0.019 0.024Spirochaetes 0.018 0.036Candidate Division OP3 0.012 0.013Candidate Division BHI80-139 0.012 0.014Excavata 0.009 0.034Candidate Division TA06 0.006 0.012Candidate Division WS6 0.006 0.015Lentisphaerae 0.005 0.008Candidate Division NPL-UPA2 0.004 0.012Candidate Division JL-ETNP-Z39 0.004 0.010Candidate Division GOUTA4 0.003 0.007Candidate Division OP9 0.002 0.006Candidate Division BD1-5 0.002 0.007Caldiserica 0.001 0.004Kazan-3B-28 0.001 0.005RT5iin25 0.001 0.003No BLAST hit 0.040 0.030190191Table A.5: Indicator OTUs with indicator values ¿ 0.7 for clusters, and combinations of clusters, defined in the dendrogram and unmanaged soilprofilesIndicator ClusterTaxa 1 2 3 4 5 1, 2a 1, 3 1, 4 1, 5 1, 2, 3 1, 2, 4 1, 2, 5 1, 3, 4 1, 3, 5 1, 4, 5 2, 3 2, 4 2, 5 2, 3, 4 2, 4, 5 3, 4b 3, 5b 3, 4, 5b 4, 5bAcidobacteria 86 78 37 14 32 15 3 2 13 7 1 38 3 14 1 31 1 34 28Actinobacteria 108 60 6 3 60 14 5 1 7 8 1 1 4 11 1 2 5 10 14Amoebozoa 2Archaeplastida 33 1Armatimonadetes 73 3 1 1 1 5 1Bacteroidetes 134 57 2 1 3 11 4 1 9 1 1 3 14 6 3 2 2Candidate division BRC1 1 1Candidate division OD1 1Candidate division OP11 5 4 12 1 3Candidate division TM7 2 1Candidate division WS3 2 2 2 1 1 2Candidate division WS6 1Chlorobi 2 1 1 1 2Chloroflexi 45 3 7 5 51 5 2 2 1 2 5 1 1 5 7 29Cyanobacteria 29 3 2 1 1 1 3Elusimicrobia 4 1 1 1 1 1Euryarchaeota 1 1 1Excavata 1Fibrobacteres 1Firmicutes 1 3 1 1 1Gemmatimonadetes 31 6 4 2 4 1 1 1 1 2 1 1 1 3 3 3Incertae Sedis 1MVP-21 1 1Nitrospirae 1 5 6 4No blast hit 1Opisthokonta 25 25 9 1 1 2 2 1 2 6 1 1 3Planctomycetes 66 37 26 7 14 8 10 10 6 8 4 21 1 5 7Proteobacteria 206 220 26 17 61 26 7 1 3 32 1 9 2 60 4 13 10 18 1 31 28SAR 20 1 2 1 3 4SM2F11 1 1 1 1 1 1Spirochaetes 1Thaumarchaeota 1 1 1 1 1TM6 1Verrucomicrobia 21 12 1 2 1 3 2 1 12 4 2WCHB1-60 4 3a: used for organic clusters b: used for mineral clusters191Appendix BChapter 3: Supplemental MaterialB.1 Supplemental methodsB.1.1 Site description and sampling descriptionThe O’Connor site is located within the Interior Douglas-fir (IDF) ecozone at an elevation of 1,075 m. Basedon historical averages between 1981 and 2010, the area receives an average of 224.3 mm of rainfall and63.5 cm of snowfall ( The area is dominated by Douglas fir (Pseudotsugamenziesii), ponderosa pine (Pinus ponderosa), and Rocky mountain juniper (Juiperus scopulorum). Parentmaterial consists of eolian veneer over glacial till and Gray Luvisols (Haplocryalfs according to the USDAsoil classification system) are the most common soil type. From 0-20 cm soils are 34% sand, 51% silt and15% clay. There is an average of 16.8±5.9 g kg-1 of carbon. The area receives an average of 279 mm ofprecipitation annually [380]. We compared two treatments, a natural reference (N) and a harvested soil plot(H) wherein all aboveground vegetation is removed but the forest floor material is retained (65±3% netcarbon removal) [27, 70].B.2 Supplemental resultsB.2.1 ribosomal RNA (rRNA) depletionIn order to determine the efficacy of ribosomal RNA (rRNA) depletion within the soil metatranscriptomes,prior to sequencing all 54 metatranscriptomic samples, 3 samples were sequenced with and without rRNAdepletion using the Ribo-Zero rRNA Removal kit specific for bacterial RNA (Illumina R). Samples (JuneN LFH, Oct N LFH, OCT H Min 2) were chosen from two seasons, both the natural forest plot (N) andharvested plot (H), and two depths (LFH, and Min 2) in an effort to capture multiple environmentalconditions and sample properties. Following sequencing, reads were trimmed to a quality score of 20 andaligned to the SILVA database (; [153] the using the Basic Local Alignment Tool (BLAST)[381] with a maximum e-value cut-off of 1e-3. 32.25±12.24% of total RNA (undepleted) reads aligned to anSSU rRNA reference sequences compared to 0.93±1.39% of reads from the rRNA-depleted libraries B.1.Given the reduction in alignment to rRNA reference sequences following rRNA depletion, the Ribo-ZerorRNA Removal kit was used prior to sequencing the complete set of metatranscriptomes.B.2.2 Metagenomic sequencingFollowing de novo assembly, an average of 59.80±8.40% of total metagenomic reads from a given samplecould be mapped to the corresponding assembly (including contigs <200 bps not included in downstreamanalysis), representing relatively good assembly given the complexity of soil microbial communities [1]).Detailed assembly, ORF prediction, quality control and annotation statistics can be found in Table B.2.B.2.3 Metatranscriptomic sequencingOn average 63,837±32,066 open reading frames (ORFs) over 180bps (60 amino acids when translated) werepredicted within each assembly. Using the 5 reference databases 51.77±7.80% of ORFs could be annotated,192again typical of omics data (Hahn et al, 2015). Detailed assembly, ORF prediction, quality control andannotation statistics can be found in Table B.3.B.2.4 Potential and expressed metabolic pathwaysNext we identified the 50 most abundant (metagenomes) pathways and the 50 most highly expressed(metatranscriptomes) pathways (Fig. B.3). Of these, 22 pathways were shared between the metagenomicand metatranscriptomic datasets including 4 biosynthesis pathways, 3 degradation pathways, and 15energy metabolism (both aerobic and anaerobic) pathways (Fig. B.3) and thus despite no general correlationbetween gene abundance an expression, a subset of pathways are both common highly expressed. Generally,these pathways were involved in amino-acid, nucleotide, carbohydrate and cofactor biosynthesis, andnucleotide, carbohydrate, and non-carbon nutrient degradation. Additionally, 3 pathways involved inelectron transfer, 2 in fermentation, 3 in aerobic respiration, 1 in anaerobic respiration, and 6 in the TCAcycle were also both abundant and highly expressed (Fig. B.3).Pathways that were abundant but not highly expressed included 15 biosynthesis, 9 degradation, 1detoxification, 2 energy metabolism, and 1 metabolic cluster pathways. Each pathway and the distributionof its abundance can be found in (Fig. B.4). In contrast, pathways that were highly expressed but not amongthe most abundant included 10 biosynthesis, 10 degradation, 1 detoxification and 7 energy metabolismpathways involved in both common metabolic functions such as the TCA cycle and thiazole biosynthesis (avitamin essential to energy metabolism), as well as 8 pathways involved in carbon fixation. Indeed, 5 of themost highly expressed pathways included formate oxidation, which produces CO2 [367], a necessary inputfor 5 additional pathway also among the most highly expressed including the Calvin-Benson-Basshamcycle, the primary carbon fixation pathway found in plants and autotrophic bacteria [382] and 3 pathwaysfor the reductive TCA cycle, an alternative pathway by which some bacteria and archaea synthesize carboncompound [383], and the 3-hydroxypropionate/4-hydroxybutyrate cycle by which some archaea can alsofix carbon [384] (Fig. B.4). In addition, 3 pathways implicated in methantrophic metabolism were alsohighly expressed including methane oxidation to methanol, and 2 formaldehyde (produced during theoxidation of methane) assimilation pathways (Fig. B.4). Together the data suggest thee pathways representa signal for both microbial community growth and carbon storage within microbial biomass.B.2.5 Seasonal differences in microbial metabolismWhile MVRT provide a powerful method by which to assess board trends in pathway abundance andexpression across the combination of sample characteristics (season, depth, and treatment), given thehierarchical study design it is possible to isolate the average effect of each of the three sample characteristicsindividually [226, 229, 249]. First, we assessed difference in pathway abundance and expression amongseasons. Within the metagenomes, few differences were found between seasons. Indeed, no differenceswere found in the relative abundance of pathways between June and October samples (Fig. B.3). BetweenJune and February, 7 pathways including trehalose degradation, formaldehyde assimilation and 3 pathwaysfor antibiotic (2 for fosfomycin biosynthesis and 1 for dehydrophos) biosynthesis were more abundantin June while L-carnitine biosynthesis, which plays a role in energy metabolism and stress response inEukaryotic organisms [236, 237], and L-valine degradation, a branched chain amino acid widely producedby soil microorganisms [385] were more abundant in February. Finally, between samples collected inOctober and February only 1 pathway for lipid biosynthesis was found to be more abundant in October.This suggests that genomic potential remains relative stable through time.Within the metatranscriptomes June had least in common with the other seasons maybe related tophotosynthetic activity [208, 209] and or increased soil moisture due to rainfall, which has previously beenshown to stimulate microbial respiration, and alter microbial composition and physical soil properties[47, 250]. Between June and October samples 11 pathways were differentially expressed (Fig. 3.8). Indeed,6 pathways including 3 pyrimidine biosynthesis, and 1 vitamin B9 biosynthesis pathways, as well as 1pyrimidine degradation and 1 catechol degradation pathway were more highly expressed in June comparedto October. In contrast, 2 amino-acid (alanine and cysteine) biosynthesis and 2 co-factor biosynthesis and 1193lipid biosynthesis pathway were more highly expressed in October.Between samples collected June and February, 59 pathways were differentially expressed. More specifi-cally, in June 31 pathways were more highly expressed, including 12 pathways involved in carbohydratebiosynthesis, likely related to cell growth, 5 pathways implicated in aromatic compound degradation,likely involved in plant biomass degradation [240–242], and 4 anaerobic pathways, including benzoyl-CoAdegradation III (anaerobic), two pathway variants for purine nucleobases degradation (anaerobic), andpyruvate fermentation to lactate (Fig. 3.8). In contrast, 28 pathways were more highly expressed in February,including 10 pathways involved in co-factor biosynthesis, also likely related to cell growth, 2 pathwaysinvolved in carbohydrate oxidation, 2 pathways involved in carboxylate degradation, and 2 fermentationpathways, namely pyruvate fermentation to acetate VIII and acetoin biosynthesis III. Given the similaritiesin metabolic expression within the LFH horizons of June and February, as identified in the (Fig. 3.5), differ-ences identified here are likely related to differences within the mineral samples. Finally, only 5 pathwayswere differentially expressed between samples collected in October and February (Fig. 3.8), including1 carbohydrate biosynthesis pathway, 2 carbohydrate degradation, and 1 aromatic carbon degradationpathway were more highly expressed in October, while only 1 energy-metabolism pathway for pyruvatefermentation to acetate was more highly expressed in February samples.B.2.6 The impact of depth on soil microbial metabolismWe then evaluated the average impact of depth on microbial metabolism by assessing differences in pathwayabundance and expression between the LFH and mineral samples across all samples. Differences betweenthe LFH and mineral horizons were more pronounced than differences between treatments and seasonin both the metagenomes and metatranscriptomes. Within the metagenomes a total of 48 pathways weremore abundant within LFH samples (Fig. 3.7). Differences were most pronounced for the biosynthesisof secondary metabolites, including several antibiotics, namely actinorhodin biosynthesis, gramicidin Sbiosynthesis both produced by bacteria [156, 224], and jadomycin biosynthesis, a family of antibioticsproduced by some fungi [386]. Additionally, LFH horizons also had a higher abundance 7 carbohydratedegradation pathways implicated in cellulose, glucose, pectin, found in plant cell walls [365], lactose, andrhamnogalacturonan (an important component of plant cell walls). Finally, oleoresin monoterpene volatilesbiosynthesis (i.e.; resin or pitch) a compound secreted by coniferous trees as a defense mechanism [252]was also more abundant in the LFH samples.In contrast the mineral samples had a higher abundance of 47 pathways (Fig. 3.7). These included 8aromatic compound degradation pathways, such as cinnamate and 3-hydroxycinnamate degradation to2-oxopent-4-enoate involved in the degradation of proteins that have putrefied within the soil [156], andcyanurate degradation which can be used as the primary source of nitrogen for both soil bacteria and fungi[102]. In addition, several of the aromatic degradation were anaerobic including 4-ethylphenol degradation(anaerobic) known to exist in denitrifying betaproteobacteria [242], 4-coumarate degradation (anaerobic),a compound produced by green plants [156] and phenol degradation II (anaerobic), phenol degradationII (anaerobic), both pathways isolated within Proteobacteria and known to aid in the detoxification ofphenolic compounds and metals, and assist in antimicrobial defense [240–242].Differences with depth in metatranscriptome are described in detail above as depth was identified asthe major driver of differences in metabolic expression (