Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Functional metagenomic screening for glycoside hydrolases Mewis, Keith 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_february_mewis_keith.pdf [ 14.16MB ]
JSON: 24-1.0340610.json
JSON-LD: 24-1.0340610-ld.json
RDF/XML (Pretty): 24-1.0340610-rdf.xml
RDF/JSON: 24-1.0340610-rdf.json
Turtle: 24-1.0340610-turtle.txt
N-Triples: 24-1.0340610-rdf-ntriples.txt
Original Record: 24-1.0340610-source.json
Full Text

Full Text

Functional Metagenomic Screening forGlycoside HydrolasesbyKeith MewisB.Sc., The University of British Columbia, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Genome Sciences and Technology)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2016c© Keith Mewis 2016AbstractLimitations on the cultivation of a majority of naturally occurring microbeshave spurred the rise of culture-independent methods for the investigationof environmental microbial communities, a field known as metagenomics.This thesis addresses both functional and informatic approaches to metage-nomics with the aim of improving our knowledge of carbohydrate degra-dation. A high throughput functional metagenomic screen was developedand applied to over 350,000 fosmid clones to search for glycoside hydrolases(GHs) in metagenomic libraries. Screening yielded 798 fosmid clones capa-ble of hydrolyzing a model sugar compound, and the genes responsible weresubcloned and biochemically characterized for pH and temperature stabil-ity, and substrate specificity. The combination of functional and in silicomethods developed were used in a longitudinal study of the beaver (Cas-tor canadensis) digestive tract, in order to gain insight into the sequentialdegradation of biomass. A linear model was used to identify enrichment ofendo-acting versus exo-acting GH families at five locations throughout thedigestive tract. The discovery of high numbers of GH43 family genes onfunctionally identified fosmids resulted in their combination with all otherknown GH43 genes in order to create subfamily classifications that providefiner resolution of enzyme activities. This classification system resulted inan improved ability to assign functional characteristics to enzymes iden-tified through informatic studies. Of the 37 subfamilies created, only 22contained a characterized enzyme. Fosmids identified earlier in this workharboured genes from four uncharacterized GH43 subfamilies, and futurecharacterization efforts will further our understanding of the GH43 family.Altogether, the developed methods provide a framework for future studiesof biomass degradation and improve the power of both functional and insilico metagenomics.iiPrefaceA number of sections of this work are partly or wholly published in press.• Portions of Chapter 1 and Chapter 5 drew references and ideas fromprevious publications but contain wording original to this thesis.Marcus Taupp, Keith Mewis, and Steven J. Hallam. “The art and de-sign of functional metagenomic screens.” Current opinion in biotech-nology 22(3): 465-472. (2011)Zachary Armstrong, Keith Mewis, Cameron Strachan, and Steven J.Hallam. “Biocatalysts for biomass deconstruction from environmentalgenomics.” Current opinion in chemical biology 29: 18-25. (2015)• Chapter 2: The functional screening method was developed and ap-plied for study of the anaerobic bioreactor and forest soils by KeithMewis. Other screening efforts were undertaken by Sam Kheirandishunder supervision of Keith Mewis. Fosmid libraries for environmentsscreened were previously created by Marcus Taupp, Sangwon Lee,Payal Sipahimalani, and Melanie Scofield. Characterization of dis-covered enzymes was performed by Zach Armstrong, and phylogeneticplacement of discovered genes among GH family trees was performedby Young Song.Keith Mewis, Marcus Taupp, and Steven J. Hallam. “A high through-put screen for biomining cellulase activity from metagenomic libraries.”Journal of Visualized Experiments 48: e2461. (2011)Keith Mewis, Zachary Armstrong, Young C. Song, Susan A. Baldwin,Stephen G. Withers, and Steven J. Hallam. “Biomining active cellu-lases from a mining bioremediation system.” Journal of biotechnology167(4): 462-471. (2013)• Chapter 3: Sampling of beaver feces was performed by Zach Armstrongand Kevin Mehr of the Withers Lab, and the fosmid library was createdby Zach Armstrong and Melanie Scofield. Enzyme characterizationswere performed by Zach Armstrong and Feng Liu of the Withers Lab.iiiPrefaceBeaver intestinal samples were collected by Keith Mewis, and DNAextraction performed by Keith Mewis, Zach Armstrong, and MelanieScofield. DNA and fosmid sequencing was performed by Keith Mewisat the UBC Pharmaceutical Sciences Sequencing Center (PSSC) withhelp from Dr. Sunita Sinha and Jennifer Chiang. Guidance withlinear modelling was provided by Rick White of the UBC StatisticalConsulting and Research Laboratory (SCARL) and Evan Durno.• Chapter 4: Dr. Bernard Henrissat proposed the idea for the project,and Keith Mewis designed testing and classification of sequences, andanalyzed data. Dr. Nicolas Lenfant created programs and scriptsnecessary for data analysis. This chapter has been published:Keith Mewis, Nicolas Lenfant, Vincent Lombard, and Bernard Henris-sat. “Dividing the large glycoside hydrolase family 43 into subfamilies:a motivation for detailed enzyme characterization.” Applied and envi-ronmental microbiology 82(6): 1686-1692. (2016)The UBC Office of Research Ethics was consulted related to work withdissected beavers in Chapter 3, but no ethical applications or approvalwas required.Throughout this work, the term “we” refers to Keith Mewis, unlessotherwise stated.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 High Throughput DNA Sequencing . . . . . . . . . . 51.1.2 in silico Metagenomic Approaches . . . . . . . . . . . 101.1.3 Functional Metagenomic Approaches . . . . . . . . . 121.2 Carbohydrates and Carbohydrate Active Enzymes . . . . . . 161.2.1 Glycoside Hydrolases . . . . . . . . . . . . . . . . . . 201.2.2 Polysaccharide Utilization Loci . . . . . . . . . . . . . 221.3 Functional Metagenomic Screening for Glycoside Hydrolases 251.4 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . 282 Development of a High Throughput Functional MetagenomicScreen for Glycoside Hydrolases . . . . . . . . . . . . . . . . 302.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 322.2.1 Replication of Fosmid Libraries . . . . . . . . . . . . 332.2.2 Lysis and Addition of Screening Substrate . . . . . . 332.2.3 Identification of Positive Clones . . . . . . . . . . . . 342.2.4 Fosmid Purification and Sequencing . . . . . . . . . . 35vTable of Contents2.2.5 Gene Annotation . . . . . . . . . . . . . . . . . . . . 362.2.6 Transposon Knockout Mutagenesis . . . . . . . . . . 372.2.7 Taxonomic Assignment of Fosmids . . . . . . . . . . . 372.3 Application of Screen to Anaerobic Bioreactor . . . . . . . . 382.3.1 Screening Results . . . . . . . . . . . . . . . . . . . . 392.3.2 Taxonomic Analysis . . . . . . . . . . . . . . . . . . . 392.3.3 Specific Analysis of GH Genes . . . . . . . . . . . . . 442.3.4 Phylogenetic Placement of Discovered GH Genes . . . 442.3.5 Biochemical Characterization of Functional Genes . . 492.4 Further Environments Studied . . . . . . . . . . . . . . . . . 542.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.5.1 Improvements . . . . . . . . . . . . . . . . . . . . . . 542.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 583 Functional and In Silico Metagenomic Analysis of the Cas-tor canadensis Gut Microbiome . . . . . . . . . . . . . . . . 603.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 633.2.1 Sample Collection . . . . . . . . . . . . . . . . . . . . 633.2.2 Extraction of High Molecular Weight DNA . . . . . . 653.2.3 PCR Amplification of Small Subunit Ribosomal RNAGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.2.4 Small Subunit Ribosomal RNA Gene Sequencing . . 663.2.5 Metagenomic Sequencing . . . . . . . . . . . . . . . . 673.2.6 Metagenomic Assembly . . . . . . . . . . . . . . . . . 673.2.7 Gene Annotation . . . . . . . . . . . . . . . . . . . . 693.2.8 Metagenomic Binning . . . . . . . . . . . . . . . . . . 693.2.9 Creation of Large-Insert Fosmid Libraries . . . . . . . 703.2.10 Functional Screening . . . . . . . . . . . . . . . . . . 703.2.11 Prediction of PULs on Fosmid Sequences . . . . . . . 703.3 Beaver Fecal Metagenome . . . . . . . . . . . . . . . . . . . . 713.3.1 Community Member Analysis via SSU Ribosomal RNAGene and Metagenome Sequencing . . . . . . . . . . 713.3.2 Comparison to Other Microbiomes . . . . . . . . . . . 723.3.3 Assignment of Function to Taxonomy Using Metage-nomic Binning . . . . . . . . . . . . . . . . . . . . . . 793.3.4 Functional Screening of Beaver Fecal Library . . . . . 833.3.5 Identification of PULs on Fosmids . . . . . . . . . . . 883.4 Dissected Beaver Samples . . . . . . . . . . . . . . . . . . . . 90viTable of Contents3.4.1 Community Member Analysis via SSU rRNA GeneSequencing . . . . . . . . . . . . . . . . . . . . . . . . 913.4.2 Functional Analysis Using Metagenomic Sequencing . 943.4.3 Linear Mixed Effects Model to Search for Patterns ofCAZyme Abundance . . . . . . . . . . . . . . . . . . 973.4.4 Functional Screening of Metagenomic Libraries FromLower Beaver Digestive Tract . . . . . . . . . . . . . 993.4.5 Identification of PULs on Digestive Tract Fosmids . . 1023.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044 Subfamily Classification of Glycoside Hydrolase Family 43Enzymes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.2 Subfamily Assignment of GH43 Protein Domains . . . . . . . 1104.3 Mapping of Functional Characteristics to Subfamilies . . . . 1134.3.1 β-D-Xylosidase and α-L-Arabinofuranosidase Contain-ing Subfamilies . . . . . . . . . . . . . . . . . . . . . . 1154.3.2 Endo-α-L-Arabinanase Containing Subfamilies . . . . 1164.3.3 β-1,3-Galactosidase Containing Subfamilies . . . . . . 1164.3.4 Uncharacterized Subfamilies . . . . . . . . . . . . . . 1174.4 GH43 Domain Co-occurrence with Other CAZy Modules . . 1184.4.1 CBM6 . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.4.2 CBM35 . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.4.3 CBM13 . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.4.4 CBM42 . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.4.5 X19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.4.6 Signal Peptides . . . . . . . . . . . . . . . . . . . . . 1224.5 Proteins Containing Multiple GH43 Domains . . . . . . . . . 1224.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234.6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 1254.6.2 Future Directions . . . . . . . . . . . . . . . . . . . . 1265 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.1 Related Research and Context . . . . . . . . . . . . . . . . . 1275.2 Assumptions and Limitations . . . . . . . . . . . . . . . . . . 1305.3 Future Directions and Applications . . . . . . . . . . . . . . 1335.4 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138viiList of Tables1.1 Properties of current high throughput sequencing methodologies 92.1 Relative substrate specificities of each enzyme, measured us-ing 100 µM substrate. pNP: p-nitrophenyl . . . . . . . . . . . 532.2 Biochemical characteristics of enzymes subcloned from posi-tive fosmids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.1 Sequencing and assembly statistics for beaver intestinal samples 683.2 Information on mammal microbiomes included for compari-son to beaver feces microbiome. . . . . . . . . . . . . . . . . . 753.3 Statistics for metagenomic bins created by MaxBin2 contain-ing assembled metagenomic contigs. . . . . . . . . . . . . . . 803.4 Differential abundance of CAZy families in metagenome bins,sorted by log2 fold abundance in Bacteroidetes bins comparedto Firmicutes bins. . . . . . . . . . . . . . . . . . . . . . . . . 823.5 Number of SSU rRNA gene sequences generated for each siteand animal of dissected beavers. . . . . . . . . . . . . . . . . 923.6 Listing of families denoted as endo-acting or exo-acting forpurposes of linear modelling . . . . . . . . . . . . . . . . . . . 983.7 Enrichment of endo-acting:exo-acting CAZymes in the stom-ach compared to other locations within the digestive tract, asgenerated by linear modelling. . . . . . . . . . . . . . . . . . . 994.1 GH43 Subfamily Characteristics. Individual GH43 subfamilymembership numbers and characteristics. . . . . . . . . . . . 113viiiList of Figures1.1 Functional Metagenomic Screening Workflow . . . . . . . . . 141.2 Structure of Lignocellulose . . . . . . . . . . . . . . . . . . . . 181.3 Schematic of PUL Enzymes . . . . . . . . . . . . . . . . . . . 242.1 Structure of DNPC Substrate and Hydrolysis Pathways . . . 352.2 ABR Screening Results . . . . . . . . . . . . . . . . . . . . . 402.3 Positive Clones Identified from ABR . . . . . . . . . . . . . . 412.4 Taxonomic Assignment of Fosmid Clones . . . . . . . . . . . 432.5 Microbial Composition of ABR and Fosmid Sequences . . . . 452.6 Phylogenetic Tree of GH1 Sequences . . . . . . . . . . . . . . 462.7 Phylogenetic Tree of GH3 Sequences . . . . . . . . . . . . . . 472.8 Phylogenetic Tree of GH5 25 Sequences . . . . . . . . . . . . 482.9 Graphs of pH Dependence of Discovered Enzymes . . . . . . 552.10 Graphs of Thermal Stability of Discovered Enzymes . . . . . 562.11 Summary of Positive Clones by Environment . . . . . . . . . 573.1 Schematic of Beaver Digestive Tract Sampling . . . . . . . . . 643.2 Relative Abundance of Phyla in Beaver Feces . . . . . . . . . 723.3 Relative Abundance of Families in Beaver Feces . . . . . . . . 733.4 Clustering of Mammals and Xylotrophs Based on CAZymeAbundance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.5 Comparison of CAZyme Abundance Between Mammalian Mi-crobiomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.6 Screening Results from Beaver Fecal Library . . . . . . . . . . 843.7 Positive Clones Identified from Beaver Fecal Library . . . . . 863.8 Comparison of GH Gene Abundance from Metagenome andFosmids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.9 Unique PULs Identified on Positive Fosmid CLones . . . . . . 883.10 Percent Abundance of Phyla At Dissected Beaver Sites . . . . 923.11 Clustering of Beaver Digestive Tract Compartments Based onSSU rRNA Gene Sequences . . . . . . . . . . . . . . . . . . . 933.12 Abundance of CAZy Classes in Dissected Beaver Sites . . . . 95ixList of Figures3.13 Relative CAZy Family Abundance in Dissected Beaver Sites . 963.14 Comparison of CAZyme Abundance on Fosmids From EachDigestive Tract Site . . . . . . . . . . . . . . . . . . . . . . . 1003.15 Venn Diagram of PULs Identified from Lower Digestive TractLibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.1 Structure of Arabinoxylan . . . . . . . . . . . . . . . . . . . . 1094.2 Subfamily Tree of GH43 Protein Domain Sequences . . . . . 1114.3 Steric Similarity of α-L-Arabinofuranose and β-D-Xylopyranose1154.4 Co-occurrence of GH43 domains with other CAZy modules . 1194.5 Counts of enzymes containing multiple GH43 domains . . . . 124xAcknowledgementsFirstly, I would like to thank my advisor, Dr. Steven Hallam, for not onlygiving me an opportunity to work in his group, but for his endless supportand encouragement, and for providing ideas, advice, and discussion on manyscientific and non-scientific matters. I would like to thank members of mycommittee, Dr. Lindsay Eltis, Dr. Steve Withers and Dr. Carl Hansen fortheir support and timely insights into the direction of my project.I would like to thank the members of the Hallam lab, past and present,for their assistance with my project. In particular, I would like to thank ZachArmstrong for being a complementary force in the multitude of projects weundertook together; Sam Kheirandish for overseeing and coordinating a widerange of laboratory efforts; Marcus Taupp for initial training and guidancefor my project; Diane Fairley, Melanie Scofield, and Payal Sipahimalanifor their tireless work ensuring the lab was kept running; Niels Hanson,Kishori Konwar, Aria Hahn, Evan Durno, Connor Morgan-Lang, YoungSong, and Charles Howes, and for their patient support and teaching relatedto statistics, computation, and programming. All of your companionshipwill be greatly missed!Additionellement, je voudrais apprecier les membres du laboratoire BernardHenrissat a` le laboratoire Architecture et Fonction des Macromole´cules Bi-ologiques en Luminy, particulairement Nicolas Lenfant, Nicolas Terrapon,Mathieu Hainaut, Vincent Lombard, et Pedro Coutinho pour l’assistance del’annotation des CAZymes et camaraderie magnifique pendant ma visite.I would like to thank all past and present Microballologists, particularlyJan Burian, Eric Brown, Erik Nielson, Craig Kerr, Sean Workman, MikeJones, and Dave Nogas, among others, for constantly livening up the weekand reminding me that we live in a microbial world.I would like to thank my brother Craig for his pointed humour relatedto many aspects of my work, as well as my step-mom, Lee, and my Dad fortheir efforts to understand my work, and for many opportunities to practicedescribing it to a lay audience.Finally, I would like to thank my amazing wife Georgia, who withouther unending support I would not be at this point today.xiDedicationI dedicate this thesis to my Mum, who always had words of encouragementand positivity for all my endeavours both within and outside of science.xiiChapter 1IntroductionMicroorganisms comprise the invisible majority of living things on the planet.There are an estimated 1031 microbial cells on earth at any given time [74],with the vast majority remaining uncultivated in a laboratory setting [56].The recent advances in high throughput DNA sequencing and molecular bi-ology techniques has enabled access to this “uncultivated majority” throughthe direct sequencing of entire microbial communities, an approach termedmetagenomics. These cultivation independent methods have allowed forthe discovery of novel taxa [135] and metabolic processes [103], as well ascontributing towards our understanding of microbial evolution [135] andphysiological roles within environments [170]. In a translational capacity,metagenomics has been applied to the fields of microbial systems engineer-ing and synthetic biology [151], environmental remediation [3], and humanhealth and disease [29]. The investigation of previously unexplored envi-ronments can reveal gene homologues with novel biochemical characteristicsamenable to industrial processes [126]. These applications position metage-nomic techniques as important tools for use in both primary and appliedresearch in the near future.Metagenomic studies typically seek to address questions of communitymembership (“who’s there?”) and function (“what are they doing?”) in1Chapter 1. Introductionthe context of other environmental parameters (e.g., oxygen concentration,salinity, temperature, etc.) [190] or host conditions (e.g., diet, disease, age,etc.) [189]. The sequencing of marker genes such as the recombinationgene recA, elongation factor thermo unstable (EF-Tu), or the small sub-unit ribsomal RNA gene (SSU rRNA), allows for assessment of the tax-onomic structure of the community due to their highly conserved natureacross organisms. Taxonomic complexity varies widely across different en-vironments, with extreme environments, such as hydrothermal vents andacid mine drainage, containing fewer than 500 unique taxa, and soils andsediments often containing more than 10,000 unique taxa. The functionalpotential of an environment can be assessed through environmental DNAsequence comparison to curated or uncurated databases of interest, suchas the carbohydrate active enzymes database (CAZy) [94] or the NCBI nrdatabase, respectively. The recent advances in sequencing technnologieshas rapidly increased the amount of sequences in these databases, with theCAZy database more than doubling from 2008 to 2013, and doubling againin the three years since [94], and the nr database more than doubling in sizeevery two years since 2012 [57]. Through metagenomic sequencing, genecatalogues have been established for environments including the human gutmicrobiome [130], oceans [160] and grassland soils [38], allowing for the elu-cidation of functional processes in these ecosystems. These studies providebaseline characterization of these ecosystems and allow further studies toidentify how changing environmental conditions affect ecosystem functionand diversity.While biological sequences can be assigned to a specific protein fam-2Chapter 1. Introductionily or pathway, biochemical attributes of proteins often cannot be predictedthrough sequence similarity alone. The accuracy of computational (in silico)gene annotation based on comparisons to curated databases is thus depen-dent on the quality of sequences deposited in them, the so-called “garbagein, garbage out” effect, which can result in significant mis-annotation of se-quences [142]. The growth of sequence databases has not been mirrored byfunctional characterization efforts; for instance, although the CAZy databasedoubled in number of entries from 2008 to 2013, biochemically characterizedCAZy enzyme (CAZyme) sequences increased only 35% over the same timeframe. The use of a heterologous host such as E. coli to express random DNAfragments or targeted genes detected through in silico methods allows for ex-pression and testing of biochemical activities in order to better tie sequencedata to protein function. Functional metagenomic screening approaches en-able the potential to discover new metabolites [73], protein subfamilies [62],and genetic operons [180] that share no characterized homologues. Thus, inorder to continue expanding our knowledge through sequencing approaches,an increased effort for functional screening and characterization is needed inorder to keep pace.This chapter outlines the existing research and motivation for the de-velopment of a high throughput functional metagenomic screen. First, anoverview of metagenomics is presented, detailing the use of high throughputsequencing methods that enable it, as well as the use of both in silico andfunctional methods used in investigation of environments of interest. Next,carbohydrates are discussed in the context of structure and function, as wellas the enzymes responsible for their degradation, with a focus on the GH4331.1. Metagenomicsfamily and polysaccharide utilization loci (PULs) that act synergistically forcarbohydrate breakdown. Finally, studies that have previously used metage-nomic approaches for the identification of carbohydrate-degrading enzymesare discussed, framing the rationale behind the current work in developinga high throughput functional metagenomic screen and applying it to thebeaver gut microbiome.1.1 MetagenomicsThe aim of metagenomics is to provide as broad and unbiased view a possibleof the structure and function of the microbial community for any given envi-ronment. Cultivation-dependent microbiological approaches are frequentlybiased towards bacteria whose metabolic necessities and laboratory growthconditions are well established, reflected in the dominance of four phyla(Actinobacteria, Firmicutes, Bacteroidetes, and Proteobacteria) represent-ing 88% of all sequenced bacterial and archael genomes [135]. While directcloning of environmental DNA was proposed as early as 1986 [121], largescale efforts required millions of individual sequencing reactions in orderto generate the depth necessary to investigate complex environments [176].With the advent of high throughput sequencing technologies, the ability torapidly and cheaply sequence DNA from any environment of interest re-sulted in a plethora of metagenomic studies [95] [35] [170]. The continueddevelopment of DNA sequencers with longer reads, higher throughput, andgreater accuracy has allowed for the exploration of more highly complex andchanging environments [160].41.1. MetagenomicsWithin the field of metagenomics, two key areas have emerged: in sil-ico predictive methods to analyze DNA sequences and predict genes basedon their similarity to sequence databases, and functional methods to inter-rogate metagenomic libraries for a desired trait based on growth selectionor phenotypic screening. Computational methods have the advantage ofbeing much higher throughput, often containing billions of reads [68], al-lowing the identification of both fine scale differences such as those betweenmonozygotic twin microbiomes [171] as well as broader studies that look atmicrobiome diversity across both age and geography [189]. These studiesserve as foundations for further research, but also provide insight into thegenetic potential of an ecosystem and can be used for targeted functionalsearches for genes of therapeutic or industrial relevance [33]. Activity basedfunctional screens are effective at identifying genes belonging to novel fami-lies [45], or with relevant biochemical attributes [126]. These methods forma strong complement to each other in the search to understand microbialdiversity and function within an ecosystem.1.1.1 High Throughput DNA SequencingIn the past decade, there has been a rapid advancement in the capabilitiesof commercially-available DNA sequencing technologies, each with differentstrengths suitable for different applications. Most current sequencers rely onsequencing-by-synthesis, with different methods of detection to identify thenewly polymerized base (Table 1.1). The work presented here relies on both454 Pyrosequencing and Illumina reversible-terminator sequencing, and themerits and limitations of each will be discussed briefly here.51.1. MetagenomicsPyrosequencing was introduced by 454 Life Sciences in 2005, representingthe first platform capable of sequencing heterogeneous populations of DNAmolecules in a single reaction [138]. Pyrosequencing libraries are preparedby fragmentation of DNA molecules followed by ligation to microbeads andclonal amplification on the surface of the beads in a water in oil emulsion.The beads are then deposited into a microwell chip that holds a single beadper chamber. A solution containing a single species of nucleotide (A, T, Cor G) is flowed over the chip, and a luciferase enzyme produces light whena nucleotide is incorporated. This light is detected by a sensor, and thenumber of bases incorporated (which may be more than one in the caseof a homopolymer repeat) can be deduced by the intensity of the light.This method relies on a comparable number of molecules bound to eachbead following emulsion PCR, which is not always the case. This can resultin homopolymer errors, where the number of bases in a homopolymer runmay be incorrect, leading to insertions or deletions that may interrupt geneprediction or annotation. In terms of performance, a single 454 reaction canproduce up to one million reads from 500 to 700 bp in length.The technology behind Illumina sequencing first debuted in 2006 underthe name Solexa, which was subsequently acquired by Illumina in 2007. Illu-mina’s method attaches DNA fragments to a flow cell, where they are ampli-fied to form clonal DNA clusters of thousands of identical DNA molecules.A single species of nucleotide is flowed over the entire cell and incorporatedinto all molecules in each cluster for which that nucleotide is required forextension. Ilumina uses a fluorescent “reversible terminator” chemistry thatblocks the 3’ extension of the newly synthesized DNA strand after incorpo-61.1. Metagenomicsration, ensuring only a single base is added during each flow cycle. Theflow cell is then imaged using a laser to excite the fluorophores, with eachcluster that incorporated the nucleotide exhibiting a fluorescent signal. Fol-lowing imaging, the reversible terminator moiety is removed and synthesiscan continue for the next flow cycle. A chief limitation with this technol-ogy is the requirement that all molecules within a cluster incorporate eachbase. Towards the end of each read, individual molecules within a clustermay vary from the consensus, leading to what is termed the “phasing” issuewhich limits the lengths of reads. Improved chemistry has lengthened readssignificantly, from 30 bp in the first generation machines to 300 bp in thecurrent iteration. Currently, Illumina produces a range of machines withvarying throughput, with the MiSeq used in this work able to produce upto 25 million paired-end reads up to 300 bp in length.The choice between these technologies depends on the application. 454sequencing has the advantage of longer read lengths, which facilitate bettersequence assembly and allows for better prediction of genes from individualreads. It is also the most commonly used method for amplicon sequencingof a single gene, typically the SSU rRNA gene, in order to generate a tax-onomic profile of an environment. Illumina sequencing is benefitted by theincreased depth at the same cost that can be obtained when sequencing anenvironment, as well as a lower error rate compared to 454 due to the for-mers homopolymer errors. The use of paired-end reads by Illumina allowsfor scaffolding information that can aid in complex assemblies. WhereasIllumina technologies previously suffered from very short (< 75 bp) readlengths, recent increases up to 300 bp and the use of mate-pair merging71.1. Metagenomicssoftware such as FLASh [99] allows for the generation of highly accuratemerged reads that approach the length of 454 reads. For accurate sequenc-ing of complete genomes, a hybrid approach using both technologies hasbeen used that relies on the long reads of 454 to produce scaffolds, whichare then error corrected using the highly accurate Illumina data [84] [172].The increased throughput, higher accuracy, and improving read lengths ofthe Illumina technology led Roche Life Sciences to discontinue support forthe 454 platform in 2016.In addition to these two dominant technologies, newer methods havearisen or are currently in development. The IonTorrent platform uses mi-crochips with individual pH sensors capable of detecting the shift in pH aris-ing from the incorporation of a new base onto a nascent DNA strand [137],and has the stated advantages of cheap and simple chemistry, and the useof naturally occurring bases. The MinION system from Oxford Nanoporedetects the change in electrical current through a membrane pore as DNAbases of different sizes occlude different volumes of the pore [8]. This tech-nology also has the advantage of being able to distinguish 5-methylcytosinefor epigenetic studies. Pacific Biosystems technology uses single moleculereal-time monitoring that allows for extremely long reads (up to 20 kb) andobviates the need for potentially biasing amplification steps [22], but cur-rently has a limited throughput and lower accuracy. While most sequencingis currently performed on the Illumina platform, the development of alterna-tive technologies to complement it will enable a range of hybrid sequencingapproaches that may provide greater insight into complex microbial com-munities.81.1.MetagenomicsTable 1.1: Properties of current high throughput sequencing methodologiesPlatformRead Length(bp)Accuracy(%)Data per run(Mbp)Advantages LimitationsRoche 454 500 - 800 98 800 Long readsHomopolymererrorsIlluminaMiSeq150 - 300 99 15,000High coverage,low cost per baseShort readsIon Torrent 50 - 400 95 10 - 1000Small size,inexpensiveequipmentHomopolymererrorsPacificBiosciences5000 - 20,000 87 5,000 - 10,000 Very long reads Error rate, costOxfordNanopore10,000 - 60,000 unknown 600 Very long readsError rate,accessibility91.1. Metagenomics1.1.2 in silico Metagenomic ApproachesThe widespread availability and improvement of high-throughput DNA se-quencing technologies has allowed exponential growth of data generation.Concomitantly with the development of DNA sequencing technology hasbeen the development of informatic tools necessary to handle ever-increasingamounts of data. Algorithms for DNA assembly, gene prediction, gene an-notation, and taxonomic assignment have all seen improvements in speedand efficiency [68].Assessment of Enviromental Taxonomy via Small SubunitRibosomal RNA Marker GenesA primary question of metagenomics is to answer the question “who’s there?”in terms of taxonomic distribution. It was first proposed by Fox et al. [48]touse small subunit ribosomal RNA (SSU rRNA) genes to assign taxonomydue to their properties of ubiquitousness among organisms, constancy offunction, and resilience to mutation. This was furthered by Lane et al. [87]in identifying regions of the SSU rRNA gene that are conserved across allthree domains of life, allowing for a “universal” primer set that is able totarget SSU rRNA genes from all organisms within a sample. Along withconserved regions, Lane et al. [87] also showed that partial sequencing ofvariable regions was sufficient to generate the same phylogenetic tree as wascreated from cloning and sequencing of the entire rRNA gene. Further ef-forts to generate reference trees based on multiple variable regions, firstlythe V4 to V6 regions, allowed this approach to be scaled to short read, high101.1. Metagenomicsthroughput DNA sequencers [169]. Together, these primary insights intomicrobial sequence similarity established the SSU rRNA gene as the “goldstandard” marker gene for microbial diversity studies.Metagenomic DNA Sequence AssemblyThe reads provided by current high throughput sequencing methods are allbelow 1 kb, which is sufficient for the prediction of individual genes or partialgenes, but does not allow for the recovery of larger genomic structures suchas operons. The assembly of metagenomic sequences into larger contiguoussequences, or contigs, allows for access to such information. A primarychallenge with DNA sequence assembly is due to the inherent complexity ofan environmental sample, with the quality of assembly typically being poorfor samples with high complexity. Briefly, the two dominant methods ofDNA assembly are overlap layout consensus and de Bruijn graph assembly.The computational technicalities of each are beyond the scope of this thesis,but a more comprehensive review can be found by Chaisson et al. [25]. Thetypical metric for quality of an assembly is provided by the N50 value, whichmeasures the largest size of a contig such that 50% of bases are containedin contigs of length N50 or greater. While not a definitive metric of success,other metrics including total assembly length or percent of reads mapped toassembled contigs can also provide insight into the quality of an assembly.Metagenomic Binning ApproachesAnother issue with short read sequences is the inability to attribute taxon-omy to reads or genes due to the small amount of sequence content. By111.1. Metagenomicsassembling short reads into contigs, additional genetic information becomesavailable for this purpose, allowing for the elucidation of function of a par-ticular taxonomic group within a community. Recently, metagenomic bin-ning methods such as MetaWatt [159], MetaCluster [179], and MaxBin [188]have been used to partition contigs into bins of shared taxonomy based onread depth abundance, tetranucleotide frequencies, and gene content. Oncecontigs are binned based on these metrics, taxonomy can be assigned toa bin through the use of last common ancestor (LCA) algorithms [69] [57]or by assessment of conserved single-copy marker genes [36]. These bin-ning approaches enable the recovery of complete genomes from complexmetagenomes [2], and allow attribution of key functions or pathways to in-dividual populations [58].1.1.3 Functional Metagenomic ApproachesA key aspect of metagenomic studies is the ability to assign function andtaxonomy to the uncovered genes. Most studies rely on nucleotide or aminoacid sequence similarity to known databases using seed-and-extend algo-rithms such as LAST [78] and BLAST [4], or weighted sequence comparisonsthat account for residues conserved across all members of a family, such asHidden Markov Models (HMMs) [111]. However, the prediction of genesbased on these methods remain a hypothesis until experimentally verifiedthrough functional studies. Additionally, it is usually not possible to pre-dict specific biochemical capabilities of an enzyme, such as halotolerance,thermal stability, or inhibitor tolerance, without performing a biochemicalassay for these traits.121.1. MetagenomicsThe use of functional metagenomic screening approaches helps to addressthese limitations. Functional metagenomics entails the random cloning ofenvironmental DNA into a suitable vector for expression within a heterolo-gous bacterial host for identification of genes of interest (Figure 1.1). Theenvironmental DNA typically bestows additional metabolic capacity to thehost dependent on the genes encoded on it. The resulting collection of bac-terial clones are termed a “metagenomic library” and range from thousandsto hundreds of thousands of clones. A suitable screening approach for iden-tification of a biochemical activity of interest is applied to these libraries,typically a synthetic model compound for degradation processes, or detec-tion of a desired end product for biosynthetic processes. Once clones ofinterest are identified, the environmental insert is sequenced and analyzedsimilar to an in silico study, or further work is performed to identify thegene or genes responsible for activity amongst many that are present on theenvironmental insert. With this information, the gene may be subclonedinto a more specialized vector for improved expression and purification inorder to study the enzyme in isolation. The identification and character-ization of an enzyme in this manner is much lower in throughput than insilico studies, but identifies proteins amenable to heterologous expressionand with biochemical attributes that correspond to screening conditions.The most common host for functional screening studies is E. coli, al-though other hosts including Pseudomonas putida and Caulobacter vibri-oides have been explored in order to maximize recovery of genes in theface of host expression biases [32]. The development of E. coli strains withattributes desirable for functional screening including removal of recombi-131.1. MetagenomicsEnvironmental DNAEnvironmental MicrobesEnvironmental ComparisonsPlatingLigationSequencing GeneIdentificationBiochemicalTestingFunctional ScreenFosmid LibraryProductionCoal Ocean Mine ForestTransfection[ [Figure 1.1: Functional Metagenomic Screening Workflow. Beginning witha sample containing environmental microbes, functional screening is carriedout by extracting the DNA, cloning it into a vector, transforming it into ahost organism, and screening for an activity of interest. Downstream anal-ysis includes sequencing of positive clones, gene identification, biochemicalcharacterization, and comparison to other relevant environments. Figureadapted from Armstrong et al. [6].141.1. Metagenomicsnation machinery [152] or inclusion of heterologous sigma factors to improvetranscription of heterologous genes [50]. Additionally, a range of host vec-tors have been developed for E. coli, allowing for the use of both small (< 3kb) [61] and large (> 20 kb) [165] environmental inserts as best suited fordifferent applications. Small inserts are suitable for the discovery of singlegenes and can be driven by a strong promoter system [182], while large insertvectors such as fosmids are more suitable for the discovery of larger geneticloci [180]. Copy control systems have been developed that maintain stable,single copies of fosmids or bacterial artificial chromosomes (BACs), whilealso allowing for induction of high copy number [183] for improved screen-ing or DNA purification for downstream processing [103]. The large insertsize of fosmids may preclude the vector-encoded promoter from driving ex-pression of genes in the middle of the insert, instead relying on recognition ofendogenous promoters by the E. coli host for expression. This may limit theamount of genes able to be functionally identified in absence of geneticallyengineered E. coli as mentioned above [50]. The multitude of genes foundon a large insert fosmid is a key advantage, since each environmental in-sert originated from a single donor genome, this allows for better taxonomicplacement of inserts.While functional metagenomics and in silico metagenomics offer twodifferent approaches to the study of organisms and environment, they com-plement each other, as well as culture-based microbiology approaches. Partof the success of a functional metagenomic screen for a particular biochem-ical activity requires looking in the right environment. The use of in silicomethods can inform us of the genetic potential that exists within a sample151.2. Carbohydrates and Carbohydrate Active Enzymesprior to the creation of an environmental library. Similarly, identificationof novel genes of proven function are used to populate the databases uponwhich in silico gene annotations are based, thus improving the accuracy ofthese predictions. The amplification of bacterial genomes from single cellshas made significant recent advances [105] [135], but the genomes producedstill remain largely incomplete (mean completeness of 40% for Rinke et al.[135]). As such, the use of culture-based methods for the study of micro-bial genomics remains important. However, for microbes whose metabolicrequirements are unknown, the use of metagenomically derived informationcan help identify the appropriate culture conditions to enable laboratorycultivation [146]. In this manner, in silico metagenomics, functional metage-nomics, and culture-based microbiology methods all complement each otherin efforts to understand the microbial forces that shape an ecosystem.1.2 Carbohydrates and Carbohydrate ActiveEnzymesAlong with nucleic acids, proteins, and lipids, carbohydrates are consideredone of the essential biological macromolecules. Carbohydrates are comprisedof individual monosaccharide units linked together in a linear or branchedchain conformation that may act as structural components of cells, or as astorage molecule for energy within a cell. As a structural component of plantcell walls, it is most commonly found as lignocellulose, which is comprisedprimarily of cellulose, hemicellulose, pectin, and lignin (Figure 1.2). Theratio of these components varies depending on the type of plant, as well as161.2. Carbohydrates and Carbohydrate Active Enzymesthe type of tissue within a plant (shoots and leaves versus bark or wood).Cellulose is nature’s most abundant macromolecule [17] and is comprisedof a linear chain of glucose subunits that typically number in the thousands,connected to each other by a β-1,4 glycosidic bond. On a macromolecularlevel, these β glycosidic bonds allow cellulose chains to pack tightly againsteach other and form inter-chain hydrogen bonds in structures known asmicrofibrils that become recalcitrant to enzymatic hydrolysis. Recently, en-zymes have been discovered that are capable of attacking these crystallinemicrostructures and revealing free polysaccharide ends [19], but the largemajority of cellulases are unable to cleave them, and they remain a hurdlein the industrial degradation of cellulosic biomass.Hemicelluloses are a class of polysaccharides that aid in plant cell wallstrength by binding cellulose microfibrils and aiding in cross-linking. Hemi-celluloses are comprised of a range of monosaccharides, including xylose,mannose, arabinose, and galactose. Hemicelluloses such as arabinoxylan,arabinomannan, and xyloglucan contain a homogeneous polysaccharide back-bone decorated with a range of both sugar and non-sugar sidechains. Be-cause of the variety of linkages found in hemicellulose, they typically re-quire more than one family of enzyme to completely degrade them, and therequired enzymes are often co-regulated in polysaccharide utilization loci(PULs), discussed later in this chapter.Lignin, while not a polysaccharide, also forms a substantial part of woodybiomass. The heterogeneous nature of lignin precludes it from being de-graded by defined enzymatic hydrolysis in the same manner as cellulose andhemicellulose. The predominant mechanism of lignin degradation is through171.2. Carbohydrates and Carbohydrate Active EnzymesFigure 1.2: Structure of Lignocellulose. Lignocellulose is comprised of cel-lulose (green), hemicellulose (yellow), and lignin (red). Each of these com-ponents form structures called microfibrils, that come together in largermacrofibrils which are responsible for structural stability in the plant cellwall. Figure from Rubin [139].181.2. Carbohydrates and Carbohydrate Active Enzymesradical oxygen chemistry primarily by white-rot fungi belonging to the phy-lum Basidiomycota [104]. Lignin’s resistance to degradation hampers theability of cellulose degrading enzymes to access their substrate, and poses aproblem to industrial biomass processing efforts [139].The ability of carbohydrates to form branched structures in nature, un-like the other biological macromolecules, means there is an incredible amountof carbohydrate diversity found in living organisms. This extensive variety oflinkages, both between monosaccharides and conjugated to other molecules,is paralleled by the large cohort of enzymes involved in their synthesis anddegradation. For most organisms, genes encoding CAZymes comprise 1 - 5%of all genes, and range as high as 9.5% in the degradation powerhouse B. cel-lulosilyticus WH2 [108]. Efforts to discover and classify CAZymes began in1991 with the creation of the CAZy database [65] which has become centralto carbohydrate biochemistry. The CAZy database currently harbours over660,000 individual CAZymes separated into six classes; glycoside hydrolases(GHs), glycosyl transferases (GTs), carbohydrate esterases (CEs), polysac-charides lyases (PLs), carbohydrate binding modules (CBMs), and auxiliaryactivity enzymes (AAs) [94]. Each of these classes is further divided intofamilies sharing sequence, structural, and biochemical characteristics.The CAZy database is regularly expanded through the daily scouringof new entries in the GenBank database, as well as the weekly scouring ofthe PDB and UniProt databases. Newly deposited sequences are comparedto a hidden Markov model (HMM) for each CAZy family created from allentries in that family, and all putative CAZymes are assigned to a familyand gathered for manual curation. Each sequence is further compared to all191.2. Carbohydrates and Carbohydrate Active Enzymessequences in the putative family via BLAST [4], and checked for C-terminalor N-terminal truncations relative to proteins with similar sequences. Fulllength entries meeting threshold criteria for both HMM (e-value at least10−20 lower for the specific family than any other family) and BLAST (atleast 27% amino acid similarity over 100 residues) comparisons are enteredinto the public CAZy database. This manual curation step ensures onlyhigh quality sequences are entered into the database, which preserves theaccuracy of annotations and predictive capacity of the database comparedto other less-stringently curated databases.1.2.1 Glycoside HydrolasesGlycoside hydrolases (GHs) represent the largest documented class in theCAZy database, currently numbering over 320,000 individual entries. Theseentries are separated into 135 families based on sequence similarity thatreflects homologous structures/folds, and catalytic capacity [65]. New GHfamilies are added upon discovery and characterization of an enzyme showinginsufficient sequence similarity to other families, as well as a demonstratedcatalytic activity.Glycoside hydrolases cleave the glycosidic bond between two monosac-charides via a general acid catalysis, using an acidic amino acid residue as aproton donor, and a second acidic residue as a nucleophilic base, althoughexceptions that utilize a co-factor in one of these roles have been identified[20]. Glycoside hydrolases are classified into either endo- or exo-acting en-zymes, depending on the location of polysaccharide cleavage. Exo-hydrolasesact on the terminal sugars of a polysaccharide chain, typically cleaving off201.2. Carbohydrates and Carbohydrate Active Enzymesa single monosaccharide. Endo-hydrolases bind and cleave at a point inthe middle of a polysaccharide, creating additional terminal ends that al-low for action by exo-hydrolases. Exo-hydrolases are usually responsiblefor the debranching of complex polysaccharides, while endo-hydrolases areresponsible for the cleavage of the backbone once branching chains havebeen removed [37]. The three primary topologies of the catalytic regionsof GHs are termed as pockets, clefts, and tunnels [37]. These topologiesloosely correspond with the processivity (the number of enzymatic actionsan enzyme performs before disassociating from the substrate) of a GH, aswell as their target substrate. Enzymes with a pocket topology are mostcommon in exo-hydrolases for debranching, while enzymes with clefts ortunnels are typically endo-hydrolases that act on longer chain sugars. Thetunnel topology is thought to be most important for enzymes requiring ahigh processivity, as the enclosed nature of the enzyme allow it to remaintightly bound to the substrate following cleavage, which is important for thedegradation of insoluble crystalline cellulose. Each of these three topologiescan, in principle, be built on similar protein folds, and as such a single GHfamily is not tied to a particular topology or exo- or endo-acting mechanism.GH43 FamilyThe GH43 family is currently the ninth largest family in the CAZy database,containing 6215 members from all three domains of life, but predominantlybacterial (5834 sequences). The major reported activities are β-D-xylosidase,α-L-arabinofuranosidase, endo-α-L-arabinanase, and 1,3-β-galactosidase, arange of activities important for the debranching and degradation of hemi-211.2. Carbohydrates and Carbohydrate Active Enzymescellulose substrates, in particular arabinoxylans and pectins. A number ofcrystal structures have been solved and biochemically characterized for en-zyme stereochemistry and active site residues [175] [117]. GH43 has emergedas an important family for biomass deconstruction efforts, as studies havefound it to be expanded relative to other families in a number of plant cellwall degrading organisms [80]. In addition to biotechnological applications,the GH43 family has been implicated in human health, being among themost abundant genes in the human gut microbiota [42] [187]. This impor-tance has led to an increased focus on understanding [23] and engineering[107] enzymes belonging to the GH43 family.1.2.2 Polysaccharide Utilization LociMany carbohydrates, such as xyloglucan and arabinoxylan, contain multipledifferent linkages between monosaccharide units that are unable to be de-graded by a single enzyme. The genes underlying the sequential degradationof these substrates have been studied most extensively in members of thephylum Bacteroidetes, where multiple genes are located in close genomicproximity to each other, in clusters known as polysaccharide utilization loci(PULs) that can encompass up to 18% of the genome in some gut microbes[100]. Genes found within a PUL share transcriptional regulation machin-ery that promotes expression in response to a specific substrate [102]. Thisspecific reactivity allows PUL-containing microbes to occupy a specific nichewithin the gut environment. PULs are, with few exceptions, comprised ofhomologues of two outer membrane proteins (SusC and SusD) that are re-sponsible for substrate binding (SusD) and import (SusC) [100], and can221.2. Carbohydrates and Carbohydrate Active Enzymesadditionally contain sensor regulators such as sigma and anti-sigma factors,and phosphorelay systems to coordinate gene expression (Figure 1.3). Thesefactors make each PUL responsive to a particular substrate, and convey adistinct fitness advantage to their microbial host in the presence of this sub-strate, allowing defined perturbation of gut microbial communities throughthe introduction of a specific polysaccharide into the diet of mice [187]. Inlight of the increasing scientific focus on the human gut microbiome and itsimpact on human health, identifying and cataloguing PULs is an importantfacet of this effort.The first characterization of PUL specificity was undertaken by Martenset al. in 2011 [102], in which two common gut microbes (B. thetaiotaomi-cron and B. ovatus) were grown in the presence of different dietary polysac-charides. Transcriptional profiling revealed individual PULs responded toa particular polysaccharide, and that while some PULs are shared acrossspecies, each species also contains PULs unique to itself. In addition toPULs that target dietary polysaccharides, there are also PULs known totarget host glycans in the mucosal lining of the intestine [129]. These PULsare strongly down-regulated in the presence of dietary polysaccharides, andare thus presumed to aid in the stability of the gut community in periodsof low carbohydrate intake. Further study of PULs have shown they actin a defined, sequential order on a substrate [88], further emphasizing theirspecificity of degradative potential.While characterization of individual PULs offers important insights intotheir pathways and mechanisms of action, this approach is not scalable tothe hundreds or thousands of individual species found in the gut micro-231.2. Carbohydrates and Carbohydrate Active EnzymesFigure 1.3: Schematic of Enzyme Locations and Functions of Genes Encodedin the Xyloglucan PUL of Bacteroides ovatus. In the phylum Bacteroidetes,long chain polysaccharides are extracellularly degraded into shorter oligosac-charides before import to the periplasm, where concerted actions of PUL-specific enzymes further degrade them and allow for monosaccharide importinto the cytoplasm. Figure from Larsbrink et al. [88].241.3. Functional Metagenomic Screening for Glycoside Hydrolasesbiome [18], each harbouring dozens of individual PULs [187]. Algorithms toautomate prediction of PULs from genome sequences based on known andcharacterized members have identified thousands of novel PULs across theBacteroidetes phylum [168]. These efforts require well curated databasesto aid in annotation, and large sequence fragments in order to determineintergenic distances, directions, and gene locality. The use of large-insertfunctional metagenomic screening is well poised to identify potentially novelPULs from any given environment.1.3 Functional Metagenomic Screening forGlycoside HydrolasesThe first reported used of functional metagenomic screens was by Healey etal. in 1995, in which organic matter from an anaerobic digester fed ligno-cellulose was used to create a metagenomic library that was subsequentlyscreened for cellulase activity [61]. Early efforts to screen for GHs grewclones on solid agar media containing carboxymethyl cellulose (CMC) andused the method outlined by Teather and Wood [167] to detect clones show-ing hydrolase activity. More recently, a range of fluorogenic substrates withdifferent sugar moieties have been synthesized [26] that allow for more sensi-tive detection of hydrolytic activity in liquid assay plates. These substratesallow for the use of automated liquid handling robots that are capable ofprocessing hundreds of thousands of clones per week [110]. While functionalscreening can identify a wide range of enzymes, the search for glycosidehydrolases has been among the most common [165].251.3. Functional Metagenomic Screening for Glycoside HydrolasesCarbohydrates form a large part of most mammalian diets, despite mostmammal genomes lacking the necessary genes to degrade a significant pro-portion of them, including cellulose and hemicellulose. Humans are capableof degrading simple starches through genomically encoded alpha-amylases,but the degradation of more recalcitrant carbohydrates is largely outsourcedto commensal microbes that inhabit the lower digestive tract (defined here asbeyond the small intestine). These microbes degrade the carbohydrates intomonosaccharides, and ferment them into short chain fatty acids (SCFAs)that can be absorbed by the host through the colonic wall [98]. The collec-tion of gut microbes are largely driven by host diet and digestive strategy,with functional convergence occurring across different mammalian lineages[90].As such, gut microbiomes have been a primary target of many studiesin the search for novel carbohydrate degradation genes. Feng et al. [44]identified novel cellulase genes from the rabbit cecum, Pope et al. [124]characterized the GH abundance of the foregut fermenting tammar wal-laby and noted its difference to ruminants, and Hess et al. [66] used highthroughput sequencing to identify genes from cow rumen that were subse-quently cloned for characterization against a panel of substrates. In additionto mammalian gut microbiomes, studies have looked at the gut communi-ties of wood-feeding organisms (xylotrophs) including termites [181] andshipworms [119] in an effort to identify genes capable of processing woodybiomass that contains a higher proportion of lignin. Because of its hetero-geneous nature, synthesizing a model compound for lignin is very difficult,but creative screening approaches such as that by Strachan et al. [158] have261.3. Functional Metagenomic Screening for Glycoside Hydrolasesidentified potential lignin degrading genes in environments outside the gut.The inhibition of cellulases by softwood lignin preparations [14] has led toa desire for enzymes that may show lower levels of inhibition for use inindustrial biomass processing facilities [67].The North American beaver, Castor canadensis, is a natural candidatefor the identification of lignocellulose degrading genes. Beavers are hindgut-fermenting herbivores whose diet is largely composed of bark, shoots, leaves,and other wood-based fibers from hardwood deciduous trees including poplar,aspen, and cottonwood [178] [34]. Previous work on their digestion hasshown the capability of beavers to degrade 32% of available cellulose [34],making it a staple of their diet. Recent studies have looked at bacterialand archaeal diversity within the beaver tract [54], as well as wood degrada-tion properties of isolate cultures from beaver feces [185], but no functionalmetagenomic studies have been undertaken to identify individual genes ofinterest. The use of functional methods for identification of these genes orgene cassettes can provide candidates for further biochemical characteriza-tion against natural lignocellulosic substrates. Additionally, a comparativebioinformatic approach to identify similarities and differences between thebeaver and other hindgut fermenters, as well as other xylotrophs, may revealpatterns of GH diversity and abundance that underlie the beavers capabilityto degrade lignocellulosic biomass.271.4. Research Overview1.4 Research OverviewThis dissertation describes the design and implementation of a high through-put functional metagenomic screening assay for the detection of glycosidehydrolases, the use of this method to study the carbohydrate degradationof the North American beaver microbiome, and identification of genes andgenetic loci that allow for a better link between protein sequence and bio-chemical function.Chapter 2 describes the initial development of the screen, and its appli-cation to a metagenomic library derived from an anaerobic bioreactor fedwith cellulosic biomass and designed for remediation of mine tailing wastes.This system provides a useful test-case due to its relatively low complexity,allowing for detailed characterization of discovered enzymes and validationof the screening and analysis workflow.Chapter 3 presents work using the screen developed in Chapter 2 to in-terrogate metagenomic libraries from fecal matter and longitudinal digestivetract samples from the North American beaver Castor canadensis. Patternsof GH abundance and diversity were compared to other studies that investi-gated a range of mammals and other wood-feeding organisms (xylotrophs) inan attempt to identify patterns that separate the beaver from other hindgutfermenting herbivores. Longitudinal samples were used to build a linearmodel of enzyme abundance in an attempt to identify the sequential natureof biomass degradation as it traverses the beaver digestive tract. The use oflarge insert fosmids in functional screening approaches allowed for the iden-tification of previously unknown PULs within both the fecal and digestive281.4. Research Overviewtract samples.Chapter 4 combines data generated in the previous chapters with existingentries in the CAZy database to create a subfamily classification system forthe GH43 family. This family is among the most abundant human gutmicrobiome members, and is known to be expanded in the genomes of plantcell wall degrading microbes. GH43 genes were readily recovered by thesubstrates used in the functional screen, and three functionally identifiedGH43 enzymes were seen to belong to subfamilies that currently have nofunctionally characterized members. The future characterization of thesediscovered enzymes will allow for better linkage of enzyme function withsequence based annotation for these subfamilies.Finally, Chapter 5 concludes with a discussion on the limitations of theresults presented in the previous chapters, and lays out the basis for futurework and improvements to the functional screen. An overview of the workpresented is placed in the scientific context of other studies, and suggestionsare made for future work and improvements to this method that may helpin directing other researchers.29Chapter 2Development of a HighThroughput FunctionalMetagenomic Screen forGlycoside HydrolasesThis chapter presents the initial development and application of a highthroughput screening pipeline capable of identifying GH genes from metage-nomic libraries hosted in E. coli. This pipeline includes growth and lysis ofE. coli clones hosted in 384-well plates, and detection of glycoside hydrolaseactivity through the hydrolysis of a chromogenic or fluorogenic model sugarsubstrate. The end product of this hydrolysis can then be detected using asuitable plate reader to detect clones harbouring GH genes. These clonescan then be re-grown from the stock plate, and fosmid DNA can be puri-fied in preparation for DNA sequencing. Using sequence-similarity searchalgorithms such as BLAST or LAST, glycoside hydrolase genes can be iden-tified through comparison to the CAZy database [94]. For fosmids contain-ing multiple GH genes, transposon mutagenesis of fosmid DNA, followed302.1. Introductionby re-transformation and screening for loss of function transformants canidentify genes necessary for substrate hydrolysis. These genes can then besubcloned, expressed, and purified for more detailed biochemical character-ization. This approach can be extended to screen for any activity for whicha chromogenic or fluorogenic substrate can be developed. This screeningpipeline was applied to metagenomic libraries originating from an anaerobicmining bioreactor [109], forest soils, hydrocarbon extraction sites, marineenvironments, and a fecal sample from the North American Beaver Castorcanadensis.2.1 IntroductionCellulose is the most abundant biopolymer on Earth [79], with wide rangingindustrial applications including an emphasis on biofuel production [139].This has resulted in studies focussed on the identification of carbohydrateactive enzymes (CAZymes) for the degradation of cellulosic biomass. Withinthese studies, two key areas have emerged:1. in silico predictive methods to analyze DNA sequences and predictgenes based on their similarity to sequence databases2. Functional screening to interrogate metagenomic libraries for a desiredtrait based on growth selection or phenotypic screeningWhile recent advances in DNA sequencing technology, gene prediction,and annotation have buoyed computational methods, there remains a lagin the improvement of functional screening approaches. Looking forward,312.2. Materials and Methodsaccurate predictions of enzyme function will depend on further validation ofproteins with potentially novel activities [126] or mechanisms [13] to informsequence guided searches.Beyond the discovery of functional genes, of interest to microbial ecolo-gists is the taxonomic assignment of such genes. Current DNA sequencingtechnologies output short reads (typically less than 300bp) that hinder ac-curate taxonomic assignment from in silico studies. By screening fosmidlibraries, additional genes located on the same fosmid as a GH gene can beused to more accurately predict taxonomy of a DNA fragment.2.2 Materials and MethodsThe impetus for the development of this screen was to improve upon thecarboxymethyl cellulose (CMC)/congo red screen for the detection of cel-lulase activity developed by Teather and Wood [167]. As opposed to agarplates used by Teather and Wood, the use of a soluble, chromogenic modelcompound, 2,4-dinitrophenyl cellobioside (DNP-C), allows this screen to beperformed in a liquid, 384-well plate format amenable to automation by lab-oratory robotics. This automation allows for screening to be performed onthe order of 106 clones per week, representing approximately 40 Gb of DNAsequence.This screening methodology was tailored for the application to existinglarge insert metagenomic fosmid libraries hosted in E. coli, and createdwith the Epicentre CopyControl Fosmid Library Production Kit (Epicentre,Madison, WI). From glycerol stocks of these libraries, the screening pipeline322.2. Materials and Methodsinvolves replication of the library, lysis and addition of screening substrate,identification of positive clones, fosmid purification and sequencing, geneannotation, transposon knockout mutagenesis, and taxonomic assignmentof fosmids.2.2.1 Replication of Fosmid LibrariesFosmid libraries were stored in 384-well plate format at −80◦C in Luria-Bertani (LB) broth supplemented with 12.5 µg/mL chloramphenicol for an-tibiotic selection, and 10% v/v glycerol as a cryoprotectant. These librarieswere defrosted at 37◦C and used to inoculate 384-well plates, with each wellcontaining 45 µL of LB supplemented with 12.5 µg/mL of chloramphenicoland 100 µg/mL of arabinose, without glycerol. The pCC1 vector backboneused for these libraries allows for the induction of a high fosmid copy num-ber in the presence of arabinose. While elevated copy numbers of a fosmiddo not guarantee elevated expression of their encoded proteins, empiricalobservations show an improved signal-to-noise ratio under the induction ofarabinose. This process was done in an automated fashion using a GenetixqPix2 picking robot (Molecular Devices, Sunnyvale, California) using the“replicating” function. Following inoculation of the new plates, they wereincubated in a humidified chamber at 37◦C for 16 hours, and original libraryplates were returned to −80◦C.2.2.2 Lysis and Addition of Screening SubstrateFollowing overnight growth, plates were removed from the incubator andcell density measurements were taken for each well by reading absorbance332.2. Materials and Methodsat a wavelength of 600 nm. These measurements can be used to identifywells that did not grow following inoculation.Cells were lysed to release enzymes into the supernatant to hydrolyzethe substrate. An equal volume of pre-made lysis and assay mix was addedto each well, with the final concentration in each well being 1% Triton X-100(Sigma-Aldrich, St. Louis, Missouri), 10 mM Tris, 1 mM EDTA, 50 mMpotassium acetate (pH 5.5), and 0.1 mg/mL of DNP-C. The DNP releasedfrom the DNP-C substrate has a pKa near 4.0 (Steve Withers, personalcommunication) which allows the screen to be carried out at pH 5.5, whichis within the typical optimum pH range for cellulases of 3 - 7 [64].After addition of lysis and substrate mix, 384-well plates were incubatedat 37◦C for 16 hours to allow enzymes to hydrolyze the substrate.2.2.3 Identification of Positive ClonesEnzymatic activity cleaves the glycosidic bond and releases 2,4-dinitrophenolinto solution. This can be carried out by endo-acting cellulases, or sequen-tial action of exo-acting β-glucosidases (Figure 2.1). The 2,4-dinitrophenolcompound is a yellow chromophore that can be detected by either visualinspection, or through the use of a spectrophotometer as it absorbs stronglyat 400 nm wavelength.To call a well positive for substrate hydrolysis, we set a threshold of 400nm absorbance of six standard deviations above the mean absorbance ofall wells. Based on assumptions of normality, readings above this thresholdhave a type II error rate of 1.9 x 10−9, sufficient for all fosmid libraries inour collection. For the remainder of this thesis, I will use the term “positive342.2. Materials and MethodsOHOHOOHOHOOHOOHOOH O2N NO2OHOHOOHOHOOHOOHOHOHOO2N NO2OHOHOOHOOH O2NNO2OHOHOOHOHOHOHOHOOHOHOHOO2N NO2OHOHOOHOHOHOHOHOOHOHOHCellulaseβ−Glucosidase+++β−Glucosidaseβ−Glucosidaseβ−GlucosidaseorYellow YellowFigure 2.1: Methods of DNPC substrate breakdown. Chemical structure ofDNPC substrate, and hydrolysis through the actions of cellulase (left), ormultiple steps of β-glucosidase (right).clone” to denote a clone capable of hydrolyzing the substrate of interest.2.2.4 Fosmid Purification and SequencingFollowing identification of positive clones from a library, the clone is re-grown, and the fosmid is purified using the FosmidMAX DNA PurificationKit (Epicentre, Madison, WI).Fosmids identified early in this work were sequenced on an IlluminaGA IIx sequencer, and assembled fosmid sequences were provided by theMichael Smith Genome Sciences Center (GSC) in Vancouver, B.C. To im-prove turnaround time from purification to DNA sequence data, I sequencedlater fosmids at the UBC Pharmaceutical Sciences Sequencing Center (PSSC)using the Nextera XT Library Preparation Kit (Illumina, San Diego, CA)on the Illumina MiSeq platform, using paired end 150 bp sequencing tech-nology.352.2. Materials and MethodsRaw DNA sequences were received from the PSSC in fastq format. As aquality control measure, reads were aligned to the E. coli K12 genome andthe pCC1 vector backbone using the BWA aligner [91] and removed if theyaligned to either sequence. Unaligned reads were further checked for qualityusing FastQC, and trimmed to a phred quality score of 30 (representing anerror rate of 1/1000). Examination of reads using FastQC showed signif-icantly similar per base sequence content over the first and last 15 bp ofsequence, and were thus removed using the fastx-toolkit [53]. Those readspassing QC were assembled using the ABySS assembler [148] with a rangeof kmer values from 64 to 128.Following assembly, fosmid sequences were compared to fosmid end se-quences using BLAST [4] for libraries where end sequences were available inorder to verify correct well locations. For all downstream analysis, sequencesgreater than 10 kbp were retained.2.2.5 Gene AnnotationFosmid sequences were annotated for gene content using MetaPathways v2.5[82], and compared to the curated and annotated KEGG [75], COG [164],RefSeq [128], MetaCyc [24], and CAZy [94] databases using the LAST algo-rithm. Parameters used were: minimum length of 60 bp, minimum bitscoreof 20, minimum BLAST score ratio (BSR) of 0.4, and maximum e-valueof 1x10−6. Fosmid gene content was explored and exported through thegraphical user interface of MetaPathways2 [57].362.2. Materials and Methods2.2.6 Transposon Knockout MutagenesisMany fosmids were found to contain multiple GH genes that may be capableof hydrolyzing the substrate. In order to identify active genes, transposonknockout mutagenesis was performed using the EZ-Tn5 KAN-2 InsertionKit (Epicentre, Madison, WI). Purified fosmid DNA was subjected to trans-poson mutagenesis, and transformed into competent Epi300 E. coli cells andplated on LB agar supplemented with chloramphenicol (12.5 µg/mL) andkanamycin (100 µg/mL). For each positive clone, up to 1152 transposonmutated clones (mutants) were picked into 384-well plates and screened inthe same manner as above for loss of activity. For each positive clone, the24 mutants showing the lowest 400 nm absorbance were selected for Sangersequencing using primers targeting each end of the transposon insert togenerate one read in each direction. Sequencing was performed at the GSCusing an Applied Biosystems 3730xl DNA Analyzer (Applied Biosystems,Foster City, CA). The resulting reads were mapped to each fosmid usingBLAST.2.2.7 Taxonomic Assignment of FosmidsTo identify taxonomy of each fosmid clone, the Lowest Common Ancestor(LCA) algorithm was applied as implemented by MEGAN [69]. Each pre-dicted open reading frame (ORF) was compared to the non-redundant (nr)protein database from the National Center for Biotechnology Information(NCBI) as a reference database. Assignment of all ORFs found on a fosmidserved as a marker for fosmid taxonomy, and assignment of fosmids at the372.3. Application of Screen to Anaerobic Bioreactorclass level allowed for greater than 50% of fosmids to have greater than 50%of their ORFs assigned to a class.2.3 Application of Screen to AnaerobicBioreactorThe functional screen described above was first applied to a fosmid librarycreated from core samples derived from an anaerobic bioreactor (ABR) sys-tem used to treat wastewater from a mine near Trail, B.C. [77]. The purposeof ABR systems such as this are to sequester heavy metal ions (includingiron, arsenic, and zinc) from solution in the form of metal sulfides [115].This particular ABR was fed with lignocellulosic feedstock from the pulpand paper industry as a carbon source for microbial growth. A previousstudy by Logan and colleagues [92] identified cellulose degradation as thelimiting step in ABR performance, indicating potential enrichment of GHencoding genes in this system.Samples were sourced by Marcus Taupp in collaboration with Dr. SusanBaldwin in the Chemical and Biological Engineering (CHBE) department atUBC. Marcus created a fosmid library consisting of 6144 clones from thesesamples prior to my work. Each of these clones had been end-sequenced fromeach end of the pCC1 vector backbone to give a small metagenomic datasetthat provided insight into library composition. Additionally, a subset ofthese clones had been previously screened using the CMC/congo red assaydescribed by Teather and Wood [167], and identifed a clone that was usedas a positive control for development of my high throughput assay.382.3. Application of Screen to Anaerobic Bioreactor2.3.1 Screening ResultsAll clones belonging to this library were screened according to the aboveworkflow. Mean 400 nm absorbance across all clones was 0.55, with astandard deviation of 0.073. Results are presented in Figure 2.2. Screen-ing identified 15 positive clones, representing a positive hit rate of 1/410.Other functional screening assays for cellulase activity have reported hitrates ranging from 1/2954 in rabbit cecum [44], 1/4600 in grassland soils[114], and 1/25,000 from compost [122], positioning this environment asrelatively abundant in cellulase and β-glucosidase encoding genes.Across all fifteen sequenced fosmids, 623 ORFs were predicted, including41 GH genes from 15 separate GH families. To visualize fosmid similarity,fosmids were compared to each other using BLAST via, show-ing 9 unique fosmid scaffolds (Figure 2.3). Each fosmid scaffold containedat least one GH1, GH3, or GH5 gene, and transposon mutagenesis readsmapped to regions harbouring these genes, showing their necessity for sub-strate hydrolysis.2.3.2 Taxonomic AnalysisI sought to identify the taxonomic composition of positive clones to elucidatewhich members of the community were responsible for biomass degradation.Of the 623 ORFs predicted across all fosmids, 405 could be assigned at theclass level, with a majority (329) of these assigned to the class Bacteroidia.MEGAN assignment of all ORFs found on each fosmid allowed for a reliableassignment of taxonomy for the complete fosmid, because each fosmid orig-392.3. Application of Screen to Anaerobic Bioreactor0 1000 2000 3000 4000 5000 60000. ID number400nm absorbanceFigure 2.2: Screening results from ABR fosmid library. 400 nm absorbanceresults for all 6144 clones from the ABR fosmid library screened with DNPCsubstrate. Green dotted line indicates mean absorbance of all clones (0.55),black dotted line indicates six standard deviations above the mean (0.89).402.3. Application of Screen to Anaerobic BioreactorA110 2030B1102030D1102030E110203040F1102030G1102030H1102030A2102030B2102030C2102030D2102030E2102030F210203040G210203040H210203040GH 1GH 2GH 3GH 5GH 6GH 16GH 20GH 27GH 29GH 30GH 43GH 55GH 95GH 106GH 65Figure 2.3: Positive clones identified from ABR library. Circos represen-tation of completed fosmid sequences. Grey bars represent each fosmid,labeled A1, B1, etc. Outer numbers show scale in kilobases (kb). Colouredbars within fosmids show locations of GH genes. Red histogram shows theabundance and locations of transposon insertions in 1 kb bins arising fromtransposon mutagenesis. Connections in the center are coloured by sequencesimilarity and show regions of nucleotide homology at greater than 90% sim-ilarity across intervals of more than 300 bp.412.3. Application of Screen to Anaerobic Bioreactorinates from a contiguous stretch of a single donor genome. Based on theseassignments, nine of the fifteen positive clones were seen to have arisenfrom the class Bacteroidia. Bacteroidia contains the genus Bacteroides, themost studied genus of organic matter degrading organisms. Bacteroides spp.are most common in mammalian gut microbiomes, and contain PULs thathave coordinated expression for the degradation of complex polysaccharidesubstrates [102]. Their presence in this organic matter rich ABR is unsur-prising, and it is likely they are the primary drivers of lignocellulosic biomassdegradation.To determine the microbial taxonomic composition of the ABR, ORFswere predicted from all fosmid end sequences using MetaPathways [83], re-sulting in 17,648 ORFs, of which 9,659 could be assigned at the class levelusing MEGAN. Based on these results, the class Methanomicrobia domi-nated the fosmid library, accounting for 39.4% of all predicted ORFs (Figure2.5). Members of this class of methanogenic Archaea have been previouslyseen in waste treatment bioreactors [15], and are known to utilize acetateor CO2 as an energy source [46]. The second most abundant class was Bac-teroidia, accounting for 6.2% of all predicted ORFs, and is a well knownclass of efficient organic matter degrading organisms. Other classes wereidentified as less than 1% of the community, including Clostridia and Pro-teobacteria. The sharp dropoff of taxonomic abundance signals this systemis of relatively low complexity compared to other ecosystems, such as forestsoils [59].422.3. Application of Screen to Anaerobic BioreactorA1B1D1E1F1G1H1A2B2C2D2E2F2G2H2ClassChlorobiProteobacteriaAnaerolineaeVerrucomicrobiaBacteroidiaOtherUnassignedFigure 2.4: Taxonomic assignment of fosmid clones. Taxonomic diversityat the class level for each completed fosmid. The width of each connectionrepresents the percentage of predicted ORFs from each fosmid assigned toa class.432.3. Application of Screen to Anaerobic Bioreactor2.3.3 Specific Analysis of GH GenesIn total, 130 GH genes from 37 separate GH families were identified on endsequences, representing 0.73% of all end sequence ORFs. Individual GHgenes from end sequence ORFs were assigned taxonomically using MEGAN,with 83 genes assigned to Bacteroidia, 8 assigned to Clostridia, and theremaining 39 unable to be assigned at the class level. No identified GH genesbelonged to Methanomicrobia, suggesting that the dominant communitymember does not contribute directly to carbohydrate degradation withinthe ABR.2.3.4 Phylogenetic Placement of Discovered GH GenesIn conjunction with Young Song (M.Sc student, Hallam Lab, UBC), phylo-genetic trees were built for GH families 1, 3, and 5 in order to identify GHgenes that may be least similar to known genes. All known protein sequencesfor these families (1630 GH1, 2297 GH3, 875 GH5) were downloaded fromthe CAZy database in August 2011. Sequences were filtered by length (allretained sequences fell within one standard deviation of the mean in orderto provide high quality alignments), and clustered by UCLUST [41] at 96%identity. Representative sequences of each cluster were aligned with MUS-CLE [40] and inserted into RAxML [155]. Hidden Markov models (HMMs)were generated using HMMer3 [111], and each family was appended to theMLTreeMap [156] package as a set of functional marker genes. The phy-logeny of these reference datasets was reconstructed with MLTreeMap andvisualized using FigTree [134] (Figures 2.6, 2.7, and 2.8). The GH5 tree442.3. Application of Screen to Anaerobic BioreactorI1Methanobacteriumsp. AL-21MethanocorpusculumlabreanumMethanospirillumhungateiMethanoplanuspetroleariusMethanoculleusmarisnigriMethanoregulaboonei MethanosphaerulapalustrisMethanosarcina barkeriMethanosarcinamazeiMicroscillamarinaDyadobacterfermentansSpirosomalingualeBacteroidetesoral taxonPedobacter heparinusPrevotellaPaludibacter propionicigenesOdoribacter splanchnicusParabacteroidesjohnsoniiBacteroides coprocolaBacteroides salanitronisBacteroides caccaeAlistipes putredinisAlistipes sp.HGB5FlavobacteriaceaeFluviicolataffensisFlavobacteriabacterium ChloroherpetonthalassiumPelodictyonphaeoclathratiformeSpirochaeta smaragdinaeSpirochaetasp.BuddyTreponemaAcetivibrio cellulolyticusSyntrophomonas wolfeiClostridiumPelotomaculum thermopropionicumFibrobacter succinogenesSyntrophus aciditrophicusBeijerinckia indicaPsychromonasEnterobacteriaceaePseudomonas syringaeAnaerolineathermophilaCyanobacteriaAkkermansiamuciniphilaVictivallisvadensisCandidatusCloacamonasacidaminovoransEnd Sequence ORFsPositive Fosmid ORFsLegend:Figure 2.5: Microbial Composition of ABR and Fosmid Sequences. Taxo-nomic diversity of ABR system based on MEGAN assignment of predictedORFs from end sequences (red) and positive fosmid clones (green). Bubblesize is scaled as a percentage of all ORFs identified in end sequences (17648) or positive clones (623).452.3. Application of Screen to Anaerobic BioreactorBootstrap value > 50%Bootstrap value > 75%End Sequences GH ORFsFull Fosmid GH ORFsCollapsed nodeFigure 2.6: Phylogenetic tree of GH1 sequences. Phylogenetic tree showingthe placement of discovered GH1 sequences from fosmids (red) and endsequences (green) among cluster representatives of the GH1 sequences fromthe CAZy database. Branch points with no circles show bootstrap values <50%.retained only 27% of original sequences after filtering and clustering, likelydue to high sequence divergence and length heterogeneity that resulted inthe subfamily classification of the GH5 family in 2012 [9]. All but one GH5gene from this study were seen to belong to subfamily 25, so the same pro-cedure was used to reconstruct the GH5 25 tree containing 33 sequencesand rooted using the well characterized Trichoderma reesei endoglucanaseIII from subfamily 4 as an outgroup.Phylogenetic comparison of GH genes discovered from functional screen-462.3. Application of Screen to Anaerobic BioreactorBootstrap value > 50%Bootstrap value > 75%End Sequences GH ORFsFull Fosmid GH ORFsCollapsed nodeFigure 2.7: Phylogenetic tree of GH3 sequences. Phylogenetic tree showingthe placement of discovered GH3 sequences from fosmids (red) and endsequences (green) among cluster representatives of the GH3 sequences fromthe CAZy database. Branch points with no circles show bootstrap values <50%.472.3. Application of Screen to Anaerobic BioreactorBootstrap value > 50%Bootstrap value > 75%End Sequences GH ORFsFull Fosmid GH ORFsFigure 2.8: Phylogenetic tree of GH5 25 sequences. Phylogenetic tree show-ing the placement of discovered GH5 25 sequences from fosmids (red) andend sequences (green) among cluster representatives of the GH5 25 se-quences from the CAZy database. Branch points with no circles show boot-strap values < 50%.482.3. Application of Screen to Anaerobic Bioreactoring and end sequence analysis showed no overlap, suggesting a bias intro-duced through the use of the heterogeneous screening host, and/or a lackof sufficient coverage of the sequence complexity of the environment repre-sented in the end sequences. Of particular note is that the most frequentlyrecovered fosmid-encoded GH gene (GH3 from fosmids A1, F1, G1, and C2)was not found at all in end sequences, underlining this bias. While it ispossible that the lack of identification of end sequence encoded GH genesvia functional screening may be due to truncated genes located at the endsof the fosmid insert, the converse that no functionally identified GH geneswere found on the end sequences quashes this likelihood.The positioning of discovered GH genes amongst a family (or subfamily)reference tree gives suggestions as to the divergence of the gene from otherfamily members. In future studies, this phylogenetic positioning analysiscould be used to identify candidate genes for further biochemical charac-terization, as sequence diversity can signal unique characteristics, such asthermal tolerance or halo tolerance, that may be of interest for specific aca-demic or industrial applications.2.3.5 Biochemical Characterization of Functional GenesIn order to assess biochemical characteristics of functionally identified genes,each of the nine unique GH1, GH3, and GH5 genes were cloned into apET28a expression vector (Novagen/EMD Millipore, Darmstadt, Germany)hosted in E. coli BL21(DE3) host. Proteins were purified using a His-bindResin (Novagen), and concentration determined using the Micro BCA assay(Thermo-Fisher Scientific, Waltham, MA). Purified enzymes were tested for492.3. Application of Screen to Anaerobic Bioreactorsubstrate specificity, pH dependence of kcat/KM , and thermal stability.Substrate SpecificityPurified enzyme was added at a final concentration as listed in Table 2.2to 100 µM of nine selected substrates, using p-nitrophenyl (pNP) glycosidesubstrates instead of 2,4-dinitrophenyl (DNP) substrates due to availability.The maximum absorbance change per unit time for each protein was givena value of 100 percent, and all other values are expressed as a percentageof this value. All GH1 and GH3 enzymes were seen to be optimally activeon the glucoside substrate, while only one of the four GH5 enzymes wasoptimally active on the cellobioside substrate. As galactose is a C4 epimerof glucose, enzymes B2 GH5 and H2 GH5 showing optimal activity on thelactoside substrate suggests a tolerance for stereoisomers at this locus. Sinceenzyme B1 GH5 harbours significant glucosidase activity, this suggests thatthe lactosidase activity shown is likely the result of two subsequent glucosi-dase and galactosidase reactions. It is interesting to note that even thoughthese enzymes were identified using DNP-cellobiose, they did not hydrolyzepNP-cellobiose as effectively, demonstrating the increased sensitivity of theDNP substrate and validating it’s use in functional screening assays.pH Dependence of kcat/KMTo evaluate the pH dependence of kcat/KM , a substrate depletion methodwas employed [7]. A limiting substrate concentration (lower than 1/5 ofKM for each enzyme) was added to buffer (100 mM sodium phosphate, 50mM citrate, 10mM sodium chloride, pH 3.5 - 8.0), and reactions initiated502.3. Application of Screen to Anaerobic Bioreactorby the addition of enzyme to a final concentration as listed in Table 2.2.Absorbance at 400 nm was monitored using a BioTek Synergy H1 platereader (Biotek, Winooski, VT) until greater than 75% substrate depletionwas observed. The pH of the reaction mixture was measured, and an aliquotwas assayed for activity to verify enzyme stability. The progress curves werefit to the first order rate equation:A(t) = Ainf (1− e−kt) + c (2.1)using GraFit 5.0 software (Erithacus Software, West Sussex, UK), to yieldthe first order rate constants. Division of these obtained constants by theenzyme concentration gave kcat/KM values. The pKa values were assignedby fitting kcat/KM versus pH plots to the equation:kcatKM=(kcatKM)max(Ka1 [H+](Ka1 + [H+])(Ka2 + [H+]))(2.2)The pKa1 value for protein E1 GH5 could not be determined accurately dueto protein instability at pH values lower than 5.5. Complete pH profiles forall nine subcloned enzymes are shown in Figure 2.9.Thermal StabilityAliquots of each purified enzyme were incubated at temperatures between27◦C and 65◦C for 10 minutes prior to performing a substrate depletionassay at the temperatures listed in Table 2.2 to determine the temperatureat which the enzyme irreversibly loses 50% of activity (herein called the T50value). All enzymes displayed T50 above 37◦C, with most values between512.3. Application of Screen to Anaerobic Bioreactor42◦C and 47◦C, except enzyme D2 GH1, which had a melting point of ap-proximately 38◦C and thus had no activity after this test. Complete thermalstability profiles above 37◦C for all other enzymes are shown in Figure 2.10.522.3.ApplicationofScreentoAnaerobicBioreactorTable 2.1: Relative substrate specificities of each enzyme, measured using 100 µM substrate. pNP: p-nitrophenylSubstrateD2GH1F2GH1H1GH3A1GH3D1GH3E1GH5B1GH5B2GH5H2GH5pNP β-cellobioside - 22.8 28.6 - - 100.0 11.7 16.7 22.6pNP β-D-glucopyranoside 100.0 100.0 100.0 100.0 100.0 56.0 28.3 3.1 2.8pNP β-lactoside 2.1 1.5 - - 1.9 - 76.7 100.0 100.0pNP β-D-galactopyranoside 27.1 - 1.6 1.0 - 28.0 100.0 9.2 1.0pNP β-D-fucopyranoside 70.8 42.7 - - - 68.0 15.8 1.3 -pNP β-D-xylopyranoside 2.1 2.6 - - - 2.0 - - -pNP α-L-rhamnofuranoside 2.1 - - - - - - - -pNP β-D-mannopyranoside - 9.1 - - - - - - -pNP α-L-arabinofuranoside - - - - - - - - -Table 2.2: Biochemical characteristics of enzymes subcloned from positive fosmidsEnzyme[Enzyme](nM)OptimalGlycosideAssayTemp (◦C)OptimalpHkcat/KM(M−1s−1) pKa 1 pKa 2 TmD2 GH1 100 Glucose 37 5.5 1.19 ± 0.01 x 105 4.3 6.5 38F2 GH1 10 Glucose 37 5.0 1.41 ± 0.03 x 105 4.6 5.7 46A1 GH3 1 Glucose 37 6.5 1.32 ± 0.01 x 105 5.2 8.8 46D1 GH3 10 Glucose 37 5.0 5.80 ± 0.03 x 103 4.2 5.9 47H1 GH3 10 Glucose 37 5.5 1.46 ± 0.01 x 105 4.1 7.4 47B1 GH5 100 Glucose 37 5.5 7.76 ± 0.04 x 103 4.9 7.0 43B2 GH5 100 Cellobiose 30 6.0 2.1 ± 0.02 x 104 4.6 6.8 45E1 GH5 1 Cellobiose 27 5.5 8.4 ± 0.07 x 103 - 6.4 42H2 GH5 100 Cellobiose 30 5.5 1.1 ± 0.01 x 104 4.7 6.3 52532.4. Further Environments Studied2.4 Further Environments StudiedIn addition to the mining ABR discussed above, this library has identi-fied glycoside hydrolase positive clones from metagenomic libraries derivedfrom natural and timber-harvesting impacted forest soil sites, hydrocarbonextraction sites, toluene and naphthalene degrading enrichment cultures,marine sites in the Pacific Ocean, and fecal and intestinal samples fromthe North American beaver Castor canadensis. In total, 798 fosmids wereidentified and sequenced, with a breakdown by site presented in Figure ConclusionThis high-throughput functional metagenomic screen provides a rapid andscalable method for the detection of enzymatic activity from fosmid libraries.The ability to utilize any substrate of interest that produces a fluoromet-ric or colorimetric readout makes it extensible to a wide range of targets.Furthermore, clones identified through this screen are readily expressed andretain activity in E. coli, enabling efficient purification and biochemical char-acterization. The application of this screen to a fosmid library derived froman anaerobic bioreactor serves as a strong proof of principle for utility.2.5.1 ImprovementsSince the initial implementation of the screen described here, further refine-ments have been made to improve sensitivity and scalability.To improve sensitivity, the current iteration of this screen utilizes flu-orometric 6-chloro-4-methylumbelliferyl substrates [26]. These substrates542.5. Conclusion3 4 5 6 7 8 9A1_GH3Kcat/KM(M-1s-1 )x104101286420pH3 4 5 6 7 8 9B1_GH5Kcat/KM(M-1s-1 )x103 86420pH3 4425 667 8 9D1_GH3Kcat/KM(M-1s-1 )x1030pH5 6 7 8 9E1_GH5Kcat/KM(M-1s-1 )x103 86420pH3 4 5 6 7 8 9H1_GH3Kcat/KM(M-1s-1 )x10410121486420pH3 4 5 6 7 8 9B2_GH5Kcat/KM(M-1s-1 )x103 201612840pH3 4 5 6 7 8 9Kcat/KM(M-1s-1 )x1041012D2_GH186420Kcat/KM(M-1s-1 )x104pH3 4 5 6 7 8 9F2_GH110121486420pH3 4 5 6 7 8 9H2_GH5Kcat/KM(M-1s-1 )x1031086420Figure 2.9: pH dependence for subcloned enzymes. pH stability profiles forall nine subcloned enzymes.552.5. ConclusionTemperature (°C)40 45 50 55 60 65 70[E]/[Eo] (°C)30 35 40 45 50 55 60 65 70[E]/[Eo] (°C)30 35 40 45 50 55 60 65 70[E]/[Eo] (°C)40 45 50 55 60 65 70[E]/[Eo] (°C)40 45 50 55 60 65 70[E]/[Eo] (°C)25 30 35 40 45 50 55 60 65 70[E]/[Eo] (°C)40 45 50 55 60 65 70[E]/[Eo] (°C)30 35 40 45 50 55 60 65 70[E]/[Eo] 2.10: Thermal stability of subcloned enzymes. Thermal stability pro-files above 37◦C for all subcloned enzymes, except D2 GH1.562.5. ConclusionFigure 2.11: Summary of positive clones by environment. A summary ofeach fosmid library this screen was applied to, and positive clones identifiedfrom each.572.5. Conclusionprovide a higher signal-to-noise ratio compared to the original DNPC sub-strate, thus further reducing type I errors. The 6-chloro derivative of 4-methylumbelliferone has a pKa of 5.9, and provides a strong fluorescentsignal at pH 7, relevant for most screening applications. Additionally, theuse of sodium acetate buffer has been replaced with the use of sodium phos-phate buffer. The use of phosphate buffer for rescreening the ABR libraryidentifed the same clones for cellulase activity, as well as cellobiose phos-phorylase enzymes. These enzymes have been identified as more efficientfor cellobiose fermentation, as they generate one molecule of glucose, andone molecule of glucose-1-phosphate, which can feed directly into cellularmetabolism without the use of additional ATP [28]. Although phosphatebuffers can sequester divalent cations necessary for some reactions, our em-pirical data suggests the use of phosphate buffer does not detract from cel-lulase screening efficiency.To improve scalability, Zach Armstrong characterized a panel of 6-chloro-4-methylumbelliferyl glycosides to identify ones that may be pooled in asingle reaction vessel in order to multiplex screening conditions. Follow-ing screening, positive clones can be de-multiplexed by rescreening each hitagainst the substrates individually. For libraries that may contain hundredsof plates, this approach generates significant cost and time savings.2.5.2 LimitationsWhile this method allows for the discovery of enzymes from any environ-ment, it is still limited in a number of respects.The discovery of genes of interest relies on their successful expression582.5. Conclusionin E. coli. As evidenced by the discrepancy between functionally recoveredGH genes and those identified on end sequences, this bias mean that usingonly one approach is unlikely to yield complete knowledge of a system. Anumber of strategies have been attempted to overcome this limitation (re-viewed in [86]), including the use of alternative expression hosts [32], or themodification of E. coli [50].Despite improvements in the throughput of functional screening, it re-mains significantly slower than informatic searches of metagenomic sequenc-ing data based on sequence similarity. The advantage of functional screeningis the ability to characterize discovered genes. However, with the decreasingcost of gene synthesis it is now feasible to select a broad phylogenomic rangeof enzymes to characterize, and have a library of these informatically identi-fied genes created for screening [63]. The limitation to this approach is thatit will not be able to identify genes with novel sequences or mechanisms.The biochemical characterization of discovered genes showed they allshared similar characteristics with respect to pH dependence and thermaltolerance, both of which closely match the screening parameters used. Inother words, you get what you screen for. Without rescreening each libraryat a range of pH or temperature values, we are likely biased to the identifi-cation of enzymes optimized for screening conditions. Furthermore, in ourE. coli host system, we are limited by the host growth conditions which mayhave an impact on heterologous expression or protein folding.59Chapter 3Functional and In SilicoMetagenomic Analysis of theCastor canadensis GutMicrobiomeThis chapter presents the application of the screen presented in Chapter 2to fosmid libraries derived from the digestive tract and feces of the NorthAmerican Beaver (Castor canadensis). Initial work was performed on sam-ples obtained from a holding pond at Critter Care Wildlife Society in Lan-gley, B.C. Followup work involved obtaining the digestive tract of freshlytrapped beavers from a local trapper (Allan Starkey, Maple Ridge, B.C.),and sampling contents at five locations throughout the tract for similar anal-yses. For each sample, small subunit (SSU) rRNA gene pyrotag sequencingand metagenomic sequencing was performed, and for each unique site withinthe lower digestive tract (cecum, proximal colon, rectum) and fecal sample,a metagenomic fosmid library was created and used for functional screening.Metagenomic reads were used to determine potential carbohydrate degrada-603.1. Introductiontion processes within each sample. Functional screening of fosmid librarieswas used to identify genes active against model carbohydrate substrates,and recover long stretches of genomic DNA suitable for identifying polysac-charide utilization loci (PULs).The fecal sample was processed prior to, and separately from, samplesfrom the dissected beavers, but shared some common methodology. Wheremethodologies differ, I will denote them as fecal sample and intestinal sam-ples, respectively. I will first detail analysis of the fecal sample, followed bythat of the dissected beavers.3.1 IntroductionWithin the mammalian gut, the microbial communities inhabiting the di-gestive tract mediate dietary access to recalcitrant starches and fibers [101].Recent studies of gut microbiomes have been performed to understand hownutrients are made available to the host [145], how diets influence metaboliccapacity [140], as well as for identification of novel degradation genes [124].Taken together, these studies suggest that a gut microbiome can be viewedas a bioreactor with selective pressure for the production of enzymes thatare optimized to degrade organic material from the hosts diet. In turn, thisprovides an ideal environment for the discovery and study of genes encodingthe enzymes responsible for these degradation processes. This perspectivehas been applied to the study of the gut microbiomes of herbivores such aswallabies [124] and cows [66], along with wood-eating insects such as ter-mites [11], the Asian long-horned beetle [143] and the shipworm [119]. In an613.1. Introductionextension to these efforts, we chose to examine the North American beaverdue to its diet of woody biomass, with hopes to identify genes particularlysuited to the degradation of lignocellulose.Beavers have a diet largely composed of bark, shoots, and leaves fromhardwood deciduous trees including poplar, aspen, and cottonwood [178]. Asa hindgut fermenter, commensal microbes within the lower digestive tractare responsible for the breakdown of polysaccharides, and fermentation ofmonosaccharides to short chain fatty acids for the mammalian host [116]. Astudy by Currier et al [34] showed the beaver is capable of digesting 32% ofdietary cellulose, making it a staple component of the beaver diet, and mak-ing the beaver digestive tract an ideal environment to study the processesand genes specific to the deconstruction of lignocellulose. Using functionalmetagenomics, we can identify genes acting on any specific polysaccharidelinkages for which a substrate can be synthesized, as well as gaining insightinto the genomic organization of genes that may act in concert to catabolizecomplex polysaccharides.While previous studies have gathered single samples of fecal matter fromthe cecum or post-defecation, I also looked at a longitudinal section of thegut to search for patterns of abundance of individual GH families and PULsthat may change as biomass progresses through the digestive tract.623.2. Materials and Methods3.2 Materials and Methods3.2.1 Sample CollectionFecal samples were collected by Zach Armstrong and Kevin Mehr in 50mLFalcon tubes (Thermo-Fisher Scientific, Waltham, MA) on April 24th, 2012from two beavers that were being cared for at the Critter Care WildlifeSociety located in Langley, B.C., Canada. Animals were fed branches froma variety of woody plant species native to the area. Due to the difficultyof obtaining fresh fecal matter, as beavers defecate underwater, sampleswere collected from material that had accumulated at the enclosure wateroutflow grating within 12 hours of cleaning. As both beavers shared thesame enclosure, it was not possible to identify from which animal the samplescame. Samples were frozen in a slurry of dry ice and ethanol, transportedto the lab on dry ice, and were stored at −80◦C prior to DNA extraction.Intestinal samples were collected by Zach Armstrong and myself frombeavers freshly trapped (< 24 hours) by Allan Starkey in Maple Ridge, B.C.,Canada between January 18th, 2014 and April 14th, 2014. Six beavers weredissected in Maple Ridge to remove the entire digestive tract (esophagus torectum) and transported on ice to UBC. Digestive tracts were dissected andsampled at 5 locations (stomach, small intestine, cecum, proximal colon,and rectum) as shown in figure 3.1. Collected samples were flash-frozen ina bath of dry ice and ethanol and stored at −80◦C.633.2. Materials and Methods10 cmSmallIntestineStomachCecumProximalColonDistalColonRectumFigure 3.1: Beaver digestive tract sampling. Stars denote locations of sam-ples collected from freshly trapped beavers. Figure adapted from Vispo etal. [178].643.2. Materials and Methods3.2.2 Extraction of High Molecular Weight DNAHigh molecular weight DNA was extracted by Zach Armstrong and MelanieScofield as described previously [89]. Four grams of beaver feces was thawedand extracted in two gram duplicates. The samples were ground by mor-tar and pestle under liquid nitrogen, and extracted three times with ex-traction buffer (100 mM sodium phosphate pH 7.0, 100 mM Tris-HCl, 100mM EDTA, 0.5 M NaCl, 1% hexadecyltrimethylammonium bromide, 2%Sodium dodecyl sulfate) at 65◦C with rotation. The resulting supernatantwas washed with chloroform-isoamyl alcohol. Finally, DNA was purified byisopropanol precipitation and quantified using the PicoGreen assay (Invit-rogen).3.2.3 PCR Amplification of Small Subunit Ribosomal RNAGenesFollowing DNA isolation, the V6-V8 region of the SSU rRNA gene wasamplified by Zach Armstrong with primers 926F (5-AAA CTY AAA KGAATT GAC GG-3) and 1392R (5-ACG GGC GGT GTG TRC-3). Reverseprimer sequences were modified to include the 454 adaptor sequence anda 5 bp barcode for multiplexing during sequencing. Reactions were run induplicate under the following conditions: initial denaturation cycle at 95Cfor 3 minutes; 25 cycles of 95◦C for 30 seconds, primer annealing at 55◦C for45 seconds, and extension at 72◦C for 90 seconds; and a final extension cycleat 72◦C for 10 minutes. Each 50 µL reaction contained: 1-10 ng templateDNA, 0.6 µL Taq polymerase (Bioshop Canada Inc., 5 U/µL), 5 µL 10x653.2. Materials and Methodsreaction buffer, 4 mM MgCl2, 0.4 mM of each dNTP (Invitrogen), 200 nMeach of forward and bar-coded reverse primers, and 33.4 µL nuclease freewater (Thermo-Fisher Scientific). Duplicate reactions were pooled and puri-fied using a QIAquick PCR Purification Kit (Qiagen, Hilden, Germany) andquantified using the PicoGreen assay (Invitrogen, Carlsbad, CA). Sampleswere diluted to 10 ng/L and pooled in equal concentrations for multiplexedSSU rRNA gene pyrosequencing.3.2.4 Small Subunit Ribosomal RNA Gene SequencingAmplicon pools were sent to the McGill University and Genome QuebecInnovation Centre for 454 pyrosequencing using the Roche 454 GS FLXTitanium (454 Life Sciences, Branford, CT, USA) technology according tothe manufacturer’s instructions.The software package Quantitative Insights Into Microbial Ecology (QI-IME) [21] was used by Zach Armstrong to analyze SSU rRNA gene se-quences. As a quality control step, sequences with quality scores less thanQ25, those containing ambiguous bases, or identified homopolymer runs, orchimeric sequences, or with length less than 200 bp were removed. Theremaining sequences were clustered at the 97% identity threshold with amaximum e-value cut-off of 1x10−10 using UCLUST [41], implemented inQIIME software. Singleton operational taxonomic units (OTUs) were omit-ted from downstream analyses.Taxonomic assignment for each OTU cluster was performed using theBasic Local Alignment Search Tool (BLAST) [4] and the SILVA databaseversion 111 ( [131] with a confidence level of 0.8 and a663.2. Materials and Methodsmaximum e-value cut-off of 1x10−3. OTU abundance was normalized to thetotal number of reads recovered, and expressed as a normalized percentagefor analysis.3.2.5 Metagenomic SequencingDNA from the fecal sample was sent to Genome Quebec Innovation Centerfor 454 pyrosequencing using the Roche 454 GS FLX Titanium (454 LifeSciences, Branford, CT, USA) technology according to the manufacturer’sinstructions. This resulted in 616,811 reads of average length 761 bp, andtotal length 469.2 Mbp.DNA from intestinal samples was sequenced at the UBC Pharmaceu-tical Sciences Sequencing Center on an Illumina MiSeq using paired-end,300 bp Nextera XT chemistry (Illumina, San Diego, CA). Fifteen samples(each of five sites from three different beavers) were indexed and pooled forsequencing.3.2.6 Metagenomic AssemblyDNA from the fecal sample was trimmed to Q30 Phred quality score (0.1%error rate) [43] using prinseq lite+ [141], and assembled using MIRA v4.0[27]. Parameters used were defaults except minimum read length was set to200 bp and minimum reads per contig was set to 10. This resulted in anassembly of 75,523 contigs with an N50 value of 1787 bp and 134.7 Mbp ofconsensus sequence.DNA from intestinal samples was trimmed to Q30 Phred quality scoreand assembled using ABySS [148] at a range of k-mer values from 64 to 168.673.2. Materials and MethodsTable 3.1: Sequencing and assembly statistics for beaver intestinal samplesBeaver SiteUnassembledLength (Mbp)AssembledLength (Mbp)AssemblyN50 (bp)Beaver 1Stomach 382.3 6.2 497Small Intestine 244.3 2.0 429Cecum 1452.2 62.9 694Proximal Colon 1062.1 39.4 598Rectum 1311.7 46.8 606Beaver 2Stomach 162.1 1.5 488Small Intestine 1043.7 22.8 472Cecum 1403.7 51.9 631Proximal Colon 867.6 31.1 592Rectum 1215.5 32.2 516Beaver 3Stomach 1382.7 21.4 428Small Intestine 1169.9 19.1 431Cecum 927.7 27.9 625Colon 1040.2 22.9 624Rectum 1368.3 21.4 428Assemblies were generally poor, containing less than 5% of unassembleddata, and N50 values being less than 700 bp. Assembly statistics for eachsample and site are shown in table 3.1. With the help of Connor Morgan-Lang, we tried two other assembly algorithms, IDBA-UD [123] and SPAdes[10], without significant improvement. I suspect that the poor assemblycompared to the fecal sample may be due to the increased abundance ofDNA from plants in the beaver diet.Instead of using poor assemblies, paired end reads were merged usingFLASH [99], with parameters specifying a minimum 20 base pair overlapwith 95% similarity, to generate reads up to 580 bp of high quality. Assem-bly attempts with these combined reads remained poor, with no samplesshowing an N50 above 1000 bp.683.2. Materials and Methods3.2.7 Gene AnnotationAssembled contigs and unassembled reads from the fecal DNA sample werecompared to the KEGG, COG, RefSeq, MetaCyc and CAZy databases andannotated as in Chapter 2.Reads from intestinal samples were annotated in the same way, with theexception of CAZy genes, which were annotated using the CAZy pipelinewith help of Vincent Lombard and Nicolas Terrapon. Using HMMer3 [111],all predicted ORFs were compared to hidden Markov models (HMMs) builtfrom existing CAZy families, and ORFs that showed an e-value below 10−10were assigned to that family.3.2.8 Metagenomic BinningIn order to understand how each community member contributes to commu-nity metabolism, assembled contigs from the beaver feces were assigned intobins based on tetranucleotide frequency, read abundance, and conservedmarker genes using default parameters of MaxBin2 [188]. MaxBin2 alsoprovides estimates of genome size and completion for each bin based onconserved marker gene frequencies. The LCA algorithm implemented inMEGAN [69] was used to assign taxonomy to each bin, and results werecross-validated using PhyloSift [36].As the intestinal beaver metagenomes were not able to be assembled,this approach was not valid for those samples.693.2. Materials and Methods3.2.9 Creation of Large-Insert Fosmid LibrariesLarge insert fosmid libraries were made by Zach Armstrong and MelanieScofield following the Epicentre CopyControl Fosmid Library ProductionKit (Epicentre, Madison, WI) with modifications as per Taupp et al [165].Libraries were created from the beaver fecal DNA, as well as one library fromthe cecum, proximal colon, and rectum sites of dissected beaver 2. Due tolow DNA yield from the stomach and small intestine, no libraries were ableto be made from these sites.3.2.10 Functional ScreeningFosmid libraries were screened by Sam Kheirandish and Zach Armstrongagainst 6-chloro-4-methylumbelliferyl (CMU) substrates, with all other de-tails as in chapter 2. For this screening, we used CMU-cellobioside, CMU-xyloside, and CMU-1,4-β-xylobioside substrates.3.2.11 Prediction of PULs on Fosmid SequencesIn conjunction with Nicolas Terrapon of the Henrissat lab at Universite´ deMarseille, fully sequenced and CAZy-annotated fosmids were used in theautomated PUL prediction pipeline. Briefly, this pipeline predicts PULsbased on nucleotide distance between genes, gene directionality, and thepresence of SusC and SusD genes, the hallmark of a PUL. For further details,see Terrapon et al [168].703.3. Beaver Fecal Metagenome3.3 Beaver Fecal Metagenome3.3.1 Community Member Analysis via SSU RibosomalRNA Gene and Metagenome SequencingSSU rRNA gene sequencing and quality control generated 12579 high qual-ity SSU rRNA gene sequences. Following clustering, 404 unique OTUswere identified, the majority of which belonged to either Firmicutes (214OTUs comprising 58.4% of sequences) or Bacteroidetes (93 OTUs compris-ing 24.4% of sequences). Both of these phyla are known organic matterdegraders common in mammalian guts. Within the Firmicutes, the familyLachnospiraceae was dominant (143 OTUs, 43.6% of all sequences). Lach-nospiraceae contains xylanotrophic, butyric acid-producing members firstisolated from human feces [31]. Within the Bacteroidetes, the most com-mon family was S24-7 (39 OTUs, 15.5 % of all counts), which is frequentlyfound in microbiomes but at present still lacks a cultured representative [18].This community structure dominated by Firmicutes and Bacteroidetes, re-spectively, is common among studied gut microbiomes [113].To compare and account for potential PCR amplification bias fromSSU rRNA methods, I examined the SSU rDNA sequences found in themetagenome by comparing unassembled reads to the Silva SSU database us-ing the LAST algorithm within MetaPathways [82] software. Of the 616,811unassembled reads, 1812 were annotated as having SSU rDNA genes, ofwhich the majority were either Firmicutes (890, 49.1 %) or Bacteriodetes(438, 24.2 %), mirroring the community structure seen from pyrosequencing.Comparing assignments at the family level also correlated with that seen in713.3. Beaver Fecal MetagenomeFigure 3.2: Relative abundance of phyla in beaver fecal SSU rRNA genes andmetagenome. Abundance of phyla based on 16 rDNA sequences comparedto the Silva SSU database. Figure by Zach Armstrong.the SSU rRNA genes, with Lachnospiraceae (615, 34.1 %) and S24-7 (242,13.4 %) once again being the most abundant. Comparing the rank abun-dance of families for the five most abundant families from both methodsshows an agreement on community structure and improves our confidencein the measures.3.3.2 Comparison to Other MicrobiomesGiven their similarity to other mammal microbiomes at the taxonomic level,I sought to compare the CAZyme content of the beaver gut microbiome com-position with the gut microbiome from other mammals and other xylotrophs.I downloaded data from previous metagenomic studies that had sampledmicrobiomes of other wood degrading organisms, particularly Asian long-723.3. Beaver Fecal MetagenomeOtherBacteroidaceaeComamonadaceaeErysipelotrichaceaeFlavobacteriaceaeHexapodaLachnospiraceaeUnassignedRuminococcaceaeS24-7FamilyPercent AbundancePyrotagsMetagenomeFirmicutesBacteroidetesProteobacteriaPhylum0 10 20 30 40Figure 3.3: Relative abundance of families in beaver fecal SSU rRNA genesand metagenome. Abundance of phyla based on SSU rRNA gene sequencescompared to the Silva SSU database.733.3. Beaver Fecal Metagenomehorned beetles (Anoplophora glabripennis) [143], ’‘higher” termites (Nasu-titermes spp.) [181], and the shipworm Bankia setacea [120], in additionto a range of mammalian microbiomes [113] [124] [125] [191] (statistics inTable 3.2). All datasets were compared to the CAZy database using MetaP-athways to generate CAZyme abundance for all samples. Due to variationsin sequencing depth between samples, a variance stabilizing transformation(VST) was applied to the data using the DeSeq2 package [97] as implementedin the provided tutorial [96]. Relative abundance of each family was calcu-lated, and a z-score was determined for each CAZy family for each mammal.Because the Manhattan distance metric weights each additional gene equallyregardless of abundance, whereas a Euclidean distance metric places lesserimportance on additional genes, a hierarchical clustering based on Man-hattan distances of z-scores was used. I saw xylotrophs looked markedlydifferent from most mammals (Figure 3.4), including beavers, despite thebeaver’s status as a xylotroph. Beavers clustered within the same group ashindgut fermenting herbivores, showing digestive tract layout and/or phy-logenetic relatedness plays a larger role in CAZyme family abundance thandoes xylophagy.To explain the clustering of mammalian microbiomes, groups of CAZyfamilies driving these clusters were investigated further by Zach Armstrongand myself. In particular, we saw three groups for which carnivores showedenrichment or depletion compared to herbivores for all members of the group,with groups highlighted in Figure 3.5. Clusters 1 and 2 were enriched inherbivore samples, and contained enzymes that were active on plant cellwall polysaccharides such as cellulose and xylan. Cluster 3 was enriched743.3. Beaver Fecal MetagenomeTable 3.2: Information on mammal microbiomes included for comparison tobeaver feces microbiome.Animal Classification Total ORFsAfrican Elephant Hindgut Fermenter 53875Armadillo Omnivore 19135Baboon Omnivore 37506Beaver Hindgut Fermenter 730206Bighorn Sheep Foregut Fermenter 43379Black Bear Omnivore 53811Black Lemur Omnivore 23135Black Rhino Hindgut Fermenter 55214Bush Dog Carnivore 42147Callimicos Omnivore 33226Capybara Hindgut Fermenter 73999Chimpanzee Omnivore 43777Colobus Foregut Fermenter 45934Dog Carnivore 450855Echidna Carnivore 44316Gazelle Foregut Fermenter 46196Giant Panda Hindgut Fermenter 23366Giraffe Foregut Fermenter 26570Gorilla Hindgut Fermenter 7796Horse Hindgut Fermenter 63937Hyena Carnivore 46542Kangaroo Foregut Fermenter 14893Lion Carnivore 61617Okapi Foregut Fermenter 20472Orangutan Hindgut Fermenter 17111Polar Bear Carnivore 45115Rabbit Hindgut Fermenter 48927Reindeer Foregut Fermenter 609928Ringtailed Lemur Hindgut Fermenter 42989Rock Hyrax Hindgut Fermenter 46968Saki Hindgut Fermenter 102152Spectacled Bear Omnivore 26745Springbok Foregut Fermenter 34502Squirrel Omnivore 29878Urial Sheep Foregut Fermenter 16569Visayan Warty Pig Hindgut Fermenter 48046Wallaby Foregut Fermenter 74987Zebra Hindgut Fermenter 17760 753.3. Beaver Fecal MetagenomeShipwormAsian BeetleNasutitermes TermiteDogLionHyenaBush DogGorillaAfrican ElephantZebraArmadilloSquirrelBlack LemurCallimicosKangarooBaboonOrangutanBeaverUrialSpringbokBig Horn SheepOkapiGazelleColobusGiraffeLemurWallabyChimpRabbitVisayan Warty PigSakiCapybaraReindeerBlack RhinoHyraxlPolar BearEchidnaGiant PandaBlack BearSpectacled Bear40 60 80100120140160180ManhattanDistanceFigure 3.4: Clustering of Mammals and Xylotrophs Based on CAZyme Abun-dance. Hierarchical clustering of mammal and xylotroph gut microbiomesusing a Manhattan distance metric based on CAZyme abundance showsthat beavers are more closely related to other mammals than more distantphylogenies of xylophagous organisms.763.3. Beaver Fecal Metagenomein carnivore samples and contained enzymes active on animal-associatedpolysaccharides such as the glycosaminoglycans heparin, chondroitin, andhyaluronan. In clustering of the mammals, the beaver clustered most sim-ilarly to other hindgut fermenting herbivores, including ring-tailed lemur,Visayan warty pig, and saki. These similarities in CAZyme profiles existed,despite differences in food sources (primarily wood for the beaver, fruits,leaves and grasses for most others). Taken together, this analysis shows wecan distinguish broad digestive strategy differences, such as carnivores ver-sus herbivores, but not finer grained differences, such as xylophagy versusfolivory, based on the CAZyme profile of gut microbiomes.773.3.BeaverFecalMetagenomeFigure 3.5: Comparison of beaver fecal metagenome with other sequenced mammal microbiomes. Heatmap showsenrichment (blue) or depletion (red) of CAZymes for each mammal. Clusters of genes enriched in herbivoresinclude: 1) families active on plant polysaccharides including cellulose, hemicellulose and pectin; 2) families activeon xylan. Clusters of genes enriched in carnivores include: 3) families active on animal polysaccharides such asglycosaminoglycans.783.3. Beaver Fecal Metagenome3.3.3 Assignment of Function to Taxonomy UsingMetagenomic BinningTo delve further into the roles of particular phyla in degradation processes inbeaver feces, I used MaxBin2 [188] to separate contigs into phylogeneticallyrelated bins. By using metagenomic bins, I am able to assess the degradationpotential of each phyla within the sample in an effort to attribute specializedroles to each. MaxBin2 separated assembled contigs into 24 bins comprising83.6 Mbp of total sequences (61.9 % of the assembled metagenomic data).Phylogenetic analysis with PhyloSift and MEGAN agreed on phylogeny for22 of 24 bins. Phylosift searches for conserved marker genes within a binto assign taxonomy, while MEGAN assigns all genes within a bin to a taxabased on BLAST comparisons. Phylosift also provides genome completenessestimates of each bin based on the presence of known single-copy markergenes. Statistics of these bins are shown in Table 3.3.In the two cases of disagreement betwen MEGAN and PhyloSift (bin 5and bin 15), I attributed the bins to the MEGAN assignments because ofthe greater number of data points supporting the placement. The agreementof two orthogonal methods for attributing taxonomy to these bins in themajority of cases improves my confidence in their annotation. However, therelative incompleteness of most bins limits the conclusions that can be drawnfrom any individual bin, as there may be some CAZy families that wereundetected by random chance due to insufficient coverage or completeness.However, by looking at all bins assigned to a particular phyla, the statisticalodds of missing the same CAZy family across multiple bins is reduced.793.3.BeaverFecalMetagenomeTable 3.3: Statistics for metagenomic bins created by MaxBin2 containing assembled metagenomic contigs.Bin Size (Mb) MEGAN Assigned (%) PhyloSift Assigned (%) Est. Completeness (%)1 3.9 Firmicutes (91.2) Firmicutes (92) 36.42 2.4 Bacteroidetes (92.7) Bacteroidetes (86) 79.43 3.5 Bacteroidetes (86.4) Bacteroidetes (82) 51.44 3 Bacteroidetes (91.8) Bacteroidetes (97) 56.15 2.3 Bacteroidetes (51.2) Firmicutes (67) 52.36 3.4 Firmicutes (95.2) Firmicutes (94) 35.57 3.3 Firmicutes (95.5) Firmicutes (100) 45.88 4.1 Firmicutes (97.1) Firmicutes (100) 49.59 2.7 Bacteroidetes (83.1) Bacteroidetes (73) 45.810 3.7 Firmicutes (50.9) Firmicutes (78) 47.711 1.4 Bacteroidetes (95.1) Bacteroidetes (92) 34.612 3.3 Firmicutes (93.8) Firmicutes (91) 39.313 3.3 Bacteroidetes (75.7) Bacteroidetes (75) 44.914 3.9 Bacteroidetes (88.3) Bacteroidetes (55) 52.315 2.5 Bacteroidetes (52.2) Proteobacteria (57) 51.416 2.1 Firmicutes (89.8) Firmicutes (94) 82.217 1.9 Firmicutes (84.9) Firmicutes (100) 22.418 2.5 Bacteroidetes (96.9) Bacteroidetes (88) 7119 2.2 Bacteroidetes (90.4) Bacteroidetes (100) 10.320 1.3 Firmicutes (58.5) Firmicutes (55) 29.921 1.6 Firmicutes (93.1) Firmicutes (96) 60.722 4.8 Proteobacteria (94.8) Proteobacteria (100) 69.223 1.3 Firmicutes (96.2) Firmicutes (100) 24.324 1.3 Firmicutes (95.5) Firmicutes (100) 43.9803.3. Beaver Fecal MetagenomeIn order to recover abundance measures that are lost during assembly,the unassembled reads belonging to each bin were recovered by aligning readsto assembled contigs using bwa [91]. As done previously, ORFs from eachbin were annotated using LAST comparisons via MetaPathways2. Totalgene counts for each CAZy family were generated, and DeSeq2 was usedto identify CAZymes that were more abundant in each phyla via pairwisecomparisons. Due to only one bin assigned to Proteobacteria, no CAZymeswere identified to be differentially abundant relative to Proteobacteria at astatistically significant level (Benjamini-Hochberg adjusted p-value <0.05).Since I am looking at more than one CAZy family, the Benjamini-Hochbergadjusted p-value was used, which corrects for the higher false discovery ratein the case of multiple-hypothesis testing.For Bacteroidetes relative to Firmicutes, fifteen CAZymes were differ-entially abundant, either increased (9 families) or decreased (6 families)(Table 3.4). Both phyla showed an increase in relative abundance for differ-ent starch binding and degradation families suggesting that they carry outthe process via different mechanisms. Firmicutes bins show an abundanceof phosphorylase enzymes to cleave starch, such as those belonging to theGH94 and GT35 families, while Bacteroidetes bins have an abundance ofendo- and exo-hydrolysis families, such as those belonging to the GH57 andGH97 families. No previous accounts of these differences between phyla havebeen seen in the literature. I propose these differences are a result of differ-ent trophic strategies between the phyla. Phosphorylases release glucose-1-phosphate, which can be fed directly into central metabolic processes, andallows for the rapid release of monosaccharides from storage polysaccharides813.3. Beaver Fecal MetagenomeTable 3.4: Differential abundance of CAZy families in metagenome bins,sorted by log2 fold abundance in Bacteroidetes bins compared to Firmicutesbins.CAZyFamilyMean(all Bins)log2 FoldChangeFunctionGH57 1.29 5.49 alpha amylaseGT3 0.83 4.94 glycogen synthaseGH106 0.89 4.85 alpha-l-rhamnosidaseGT30 1.03 4.51 KDO transferaseCBM20 1.55 4.42 cyclodextrin bindingGH97 3.65 3.63 glucoamylaseGH20 1.81 2.90 hexosaminidaseCBM32 1.31 2.72 galactose bindingGH92 3.74 2.68 alpha-mannosidaseGT5 3.10 -2.12 starch glucosyl transferaseGT35 15.51 -2.45 starch phosphorylaseGH94 4.21 -2.51 beta-glucan phosphorylaseGH42 1.80 -2.65 beta-galactosidaseCBM48 5.22 -2.66 glycogen and starch bindingCBM34 3.07 -3.63 granular starch bindingduring times of high energy demand. Bacteroidetes are known to be moreinvolved than Firmicutes in the fermentation of monosaccharides to shortchain fatty acids [47], which provides further pathways for glucose break-down when sugars are present. The difference in trophic strategy employedby each phyla is thus reflected in the differential abundance of CAZymes. Alarge scale search of all known Bacteroidetes and Firmicutes genomes wouldbe needed to be able to support this conclusion, and extend it beyond thebeaver feces sample.823.3. Beaver Fecal Metagenome3.3.4 Functional Screening of Beaver Fecal LibraryIn order to gain further insight into the carbohydrate degradation processesof the fecal community, a fosmid library was created that resulted in a to-tal of 4608 clones. This library was screened in multiplex with three CMUsubstrates; CMU-cellobiose, CMU-xylose, and CMU-xylan. Screening iden-tified 52 clones positive for one of the three activities, of which 42 showedactivity against all three substrates, and 45 active against the cellobiose sub-strate. Looking solely at cellobiose-active clones in order to compare withother cellulase discovery studies, this represents a rate of 0.98% (Figure 3.6).This hit rate is higher than that seen for the anaerobic bioreactor librarypresented in chapter 2, and higher than other gut microbiome studies us-ing functional metagenomic screening for cellulases and hemicellulases fromfosmids including rabbit cecum (11/32,500 or 0.03%) [44], wallaby foregut(98/30,528 or 0.3%) [124], and reindeer rumen (48/5000, or 0.96%) [125].While functional screening is not a direct proxy for the presence of glycosidehydrolase genes in an environment, the increased recovery rate from beaverfeces suggests the beaver is a good environment for discovery of glycosidehydrolases capable of expression in an E. coli host.Sequencing of all 52 fosmids showed some redundancy of inserts, definedas >95% similar across >90% of insert length, and 38 unique fosmid cloneswere identified (Figure 3.7). ORF prediction yielded a total of 1403 ORFs,of which 147 were annotated as glycoside hydrolases by comparison to theCAZy database. Surprisingly, the fosmids did not encode an abundanceof GH families that function as cellobiohydrolases or endoglucanases (GH833.3. Beaver Fecal Metagenome0 20 40 60 8001000200030004000Sample Well450nm FluorescenceLegendXyloseCellobioseXylobiose100Figure 3.6: Screening results from beaver fecal library. Multiplex screening ofthe beaver metagenomic library identified 52 clones capable of degrading oneof the three substrates used. Dotted lines represent six standard deviationsabove the mean for each substrate used. Shown here are the results fromdemultiplexing of 103 putative positive clones identified in the initial screen.843.3. Beaver Fecal Metagenomefamilies 5, 6, 7, 8, 9, 12, 26, 44, 45, 48, 51, 74, 124); we only identified oneGH5 gene, two GH8 genes, and one GH51 gene. These families were presentin the metagenome, but at a lower level than families that were the mostabundant in functionally identified fosmids such as GH2, GH3, and GH43.Taxonomic assignment using the LCA algorithm assigned 14 fosmids toFirmicutes and 38 fosmids to Bacteroidetes, showing that despite Firmi-cutes being the most abundant community member, the linkages modelledby these substrates are likely hydrolyzed by members of Bacteroidetes. Itshould be acknowledged that limitations of the screen, such as library size,cloning or transformation bias within the library, or expression bias fromthe E. coli host, may play a part in this discrepancy.I compared the abundance of each CAZy family in the fosmid datasetwith the metagenome in order to see which families are enriched in the posi-tive clones (Figure 3.8). The most abundant GH family found on the fosmidswas GH3, which contains both β-glucosidases and β-xylosidases. This wasfollowed in abundance by GH43, a family harbouring both β-xylosidasesand xylanases. These families show an enrichment of 12.8 and 15.8 foldrespectively in the fosmids compared to the metagenome. One fosmid inparticular (12 H03) contained four different GH43 genes. The close locationof four genes from the same family suggested a functional specialization foreach gene that is unable to be distinguished at the family level, warrantinginvestigation as performed in chapter 4. While these genes are close to oneanother genomically, they do not belong in a PUL as currently defined dueto the lack of SusC and SusD genes.853.3. Beaver Fecal MetagenomeIdentified PULFigure 3.7: Positive clones identified from beaver fecal library. Circos repre-sentation of completed fosmid sequences. Grey bars represent each fosmid,with scale bar for length in kbp. Outer rings show activities identified oneach fosmid. Coloured bars within fosmids show locations of GH genes. Or-ange bars inside the ring show location of identified PULs. Connections inthe center show regions of nucleotide homology at greater than 90% simi-larity across intervals of more than 300 bp.863.3. Beaver Fecal MetagenomePercent of Total ORFsFigure 3.8: Comparison of GH gene abundance between metagenome andfosmids. Histogram shows relative abundances of each GH family recoveredfrom positive fosmid clones compared to those recovered from metagenomesequencing.873.3. Beaver Fecal MetagenomeGH2GH3GH16GH28GH32GH36GH43GH50GH67GH72GH88GH97GH115SusCSusDOther1kbScale5kb 10kb04_C0305_P0809_G0109_C2212_O0405_H011Copies81114Figure 3.9: Unique PULs identified on positive fosmid clones. Schematic ofunique PULs identified on positive fosmid clones. Clone 4 C03 contains 2unique PULs. Figure made with assistance from Zach Armstrong.3.3.5 Identification of PULs on FosmidsIn addition to providing functional identification of CAZymes, sequencedfosmids provide additional information about the locations of CAZymes rel-ative to one another. Large insert fosmids thus serve ideally to identifygenetic loci, such as PULs, from an environment. In collaboration with Nico-las Terrapon in the Henrissat Lab at Universite´ de Marseille, we searchedfor PULs on fosmids using an automated PUL prediction tool [168]. Thissearch annotated 16 PULs, of which seven were unique (Figure 3.9). Themost abundant PUL was that found on fosmid 5 P08, but the eight fosmidsthat carry it are identical in sequence, suggesting the copies arose as an arti-fact of fosmid library creation. The PUL found on fosmid 5 H01 (hereaftertermed PUL01) was seen on three other fosmids that were not identical,suggesting it is present in relatively high abundance in the feces comparedto other PULs.883.3. Beaver Fecal MetagenomeTo further investigate this PUL, I sought to see if multiple species mayhave converged on PUL membership required to break down a particularsubstrate. The PULDB feature of the CAZy database allows for compar-ison of PULs based on gene membership rather than nucleotide identity.PUL01 contains a GH28-GH97-GH43-GH88 ordering not seen in any PULswithin PULDB. While PUL01 has not been studied in the literature, itsmembership can inform us of putative function. The GH28 family containspolygalacturonases, which cleave the homogalacturonan backbone of manypectins and other uronic acid-containing polysaccharides. The GH88 fam-ily is known to act only on β-linked, unsaturated glucuronic acid residuesfound on the reducing end of a polysaccharide chain following cleavage bypolysaccharide lyases. β-linked uronic acid residues are primarily found inglycosaminoglycans, such as chondroitin sulfate or heparin sulfate, whichare major components of the mucosal layer of the digestive tract. Theclosely related GH105 family, not identified here, degrades α-linked uronicacid residues which are more typically found in plant biomass, such as thepectin rhamnogalacturonan. Taken together, this suggests a PUL that isresponsible for degradation of host mucins rather than dietary polysaccha-rides. Previous studies have shown that PULs specific for complex dietarypolysaccharides are often unique to a species [102], while those responsi-ble for host mucin degradation are better conserved across the Bacteroidesgenus [133]. This may reflect the necessity of these gut microbes to per-sist in the mucosal lumen during periods of low dietary nutrient availability.Thus, this abundance of the host mucin degrading PUL may be driven byits importance in maintaining the stability of the gut community.893.4. Dissected Beaver SamplesTo place identified PULs within the context of known genomes, I com-pared all PUL sequences to the RefSeq database. PULs from fosmid 9 C22showed and 9 G01 showed 99% and 98% identity, respectively, across thewhole fosmid sequence to two regions of the Alistipes senegalensis JC50genome, implicating this species as a catalytically active member of thebeaver microbiome. Fosmid 5 P08 displayed 99% identity to the genome ofAlistipes finegoldii DSM17242, although only 73% coverage of the fosmidwas observed. Investigation of the genomic contig in RefSeq shows the fos-mid to be matching to the end of the contig, signalling the 73% coverageis due to a fragmented reference genome. No other fosmid PULs showedgreater than 7% identity to genomes in the RefSeq genomic database, show-ing the power of large insert functional screening to identify novel PULs frommetagenomes. As more work is done on the importance of PULs withina microbiome, high throughput functional screening from fosmid librariesrepresents an effective method for the identification and recovery of PULsequences and genes for further characterization, such as that performed byLarsbrink et al [88].3.4 Dissected Beaver SamplesFollowing investigation of the beaver fecal metagenome, I sought to performa longitudinal study of the beaver digestive tract to search for patterns ofgene abundance that may signal a sequential degradation of biomass withinthe gut.903.4. Dissected Beaver Samples3.4.1 Community Member Analysis via SSU rRNA GeneSequencingSSU rRNA gene sequencing of all five digestive tract locations from all sixanimals generated an additional 142,453 SSU rRNA gene sequences afterquality control (Table 3.5). Analysis with QIIME included SSU genes frombeaver feces and identified the same dominant phyla (Bacteroidetes, Firmi-cutes, Proteobacteria), but also revealed mammalian SSU sequences to bedominant in the stomach and the small intestine (Figure 3.10). In partic-ular, these mammalian sequences were attributed to the American pika, aclose relative of the beaver, and likely because beaver SSU sequences are notincluded in the Silva database. The abundance of mammalian sequences isdue to the decreased abundance of bacteria within the upper digestive tract;in humans this is approximately 10 bacteria per gram of material, comparedto 1012 per gram of material in the colon [144]. The lower digestive tractmore closely matches the distribution observed in the feces, with Firmicutesbeing the most abundant, followed by Bacteroidetes. While there is somevariation in abundance in the final three sites of the tract, no differencesare statistically significant using student’s t-test (p>0.20 for all pairwisecomparisons).I sought to answer whether microbial community composition is moreconserved within each beaver, or within each digestive site across beavers.To answer this, I clustered each sample based on OTU abundance fromeach sample. Due to high dimensionality, clusters were determined usingthe pvclust package in R [161], using the unweighted pair group method913.4. Dissected Beaver SamplesTable 3.5: Number of SSU rRNA gene sequences generated for each site andanimal of dissected beavers.1 2 3 4 5 6Stomach 3659 3594 3580 3564 4482 3927Small Intestine 2115 5178 7718 4339 2620 8066Cecum 2593 241 1264 4818 5050 2028Proximal Colon 5222 16367 3163 4300 4290 4156Rectum 15015 3690 4033 4130 5412 3839RectumProximalColonCecumPercent AbundanceSiteSmallIntestineStomachBacteroidetes Firmicutes Proteobacteria MammaliaFigure 3.10: Percent abundances of phyla at each digestive tract site. Abun-dances of major phyla at each site within the beaver digestive tract basedon SSU rRNA gene analysis. Boxes represent interquartile range (Q1 - Q3),black bar within each box represents the median, and whiskers extend to a95% confidence interval for comparison of two medians.923.4. Dissected Beaver SamplesFecesProximalColon3Rectum3Cecum5ProximalColon5Cecum3Cecum2ProximalColon2Rectum2Cecum4ProximalColon4Rectum4Rectum6Cecum6ProximalColon6Rectum1Stomach1Cecum1ProximalColon1 Rectum5Stomach6SmallIntestine2SmallIntestine3Stomach3SmallIntestine1Stomach4SmallIntestine4SmallIntestine5SmallIntestine6Stomach2Stomach50. 989996959896 9490 998597 9598969996 958594868595Figure 3.11: Clustering of Beaver Digestive Tract Compartments Based onSSU rRNA Gene Sequences. Hierarchical clustering of beaver compartmentsamples shows sites within the lower three sites of the digestive tract clusterby animal, while stomach and small intestine samples were most similar toeachother. Red numbers denote approximately unbiased p-values accordingto Shimodaira et al [147].with arithmetic mean (UPGMA) method [150] as used by Yatsunenko et al.[189], and distances between each cluster calculated using the correlationdistance metric [162] (Figure 3.11). This grouping showed two major clus-ters, with one containing almost exclusively samples from the stomach andsmall intestine, and another containing primarily samples from the lowertract. These clusters are driven by the shared abundance of mammaliancounts in the stomach and small intestine due to low microbial abundance.Within the lower tract, samples showed most similarity to others from thesame animal due to the same community being shared via the procession ofbiomass through the tract.933.4. Dissected Beaver Samples3.4.2 Functional Analysis Using Metagenomic SequencingTo investigate carbohydrate degradation processes occurring at each site,merged paired-end metagenomic reads were compared to the CAZy databaseusing MetaPathways2 to annotate CAZymes. Firstly, I sought to see if therewere different abundances of each class of CAZyme at each location withinthe digestive tract. The lower digestive tract was seen to be the most abun-dant for each CAZy class, with each class peaking in abundance in the cecum(mean 4.0% CAZymes). While these results are not statistically significantdue to the limited number of samples, a trend towards decreasing num-bers of CAZymes at each successive site is emerging(mean 3.1% in proximalcolon, mean 2.6% in rectum) (Figure 3.12). Glycoside hydrolases were themost abundant class at each site, signalling an environment dominated bycarbohydrate breakdown rather than synthesis.After seeing samples clustered by taxonomic abundance using SSU rRNAgenes, I questioned whether the CAZy profiles of each sample would showsimilar clustering patterns. While microbial taxonomy is shared via thepassage of biomass through the tract, the CAZy profile could be indepen-dent of taxonomy, with different community members undertaking similarfunctional roles in the same site across animals. Alternatively, the CAZyprofile within an animal may be determined by dietary composition. Clus-tering and analysis similar to that used previously to compare the differentmammal microbiomes was undertaken (Figure 3.13).Clustering of samples followed a similar pattern to clustering based onSSU rRNA gene sequence data - that stomach and small intestine samples943.4. Dissected Beaver SamplesStomachSmallIntestineCecumProximalColonRectum0.50GlycosideHydrolasesGlycosylTransferasesPercentAbundanceCarbohydrateEsterasesPolysaccharideLyases11.522.533.5Figure 3.12: Abundance of each class of CAZyme at each site. Percentabundance of the 4 main classes of CAZymes at each digestive tract site.Error bars represent one standard deviation.953.4. Dissected Beaver Samplesz-score-2 0 2Figure 3.13: Relative CAZy family abundance at each digestive tract site.Heatmap shows enrichment (blue) or depletion (red) of CAZymes for eachsite. Hierarchical clustering of samples based on relative CAZyme abun-dance using a Manhattan distance metric shows that within the lower di-gestive tract, the relative abundance of each CAZy family is dictated moreby the particular beaver than it is across beavers but at the same site in thedigestive tract.963.4. Dissected Beaver Sampleswere most similar to each other, and sites in the lower digestive tract weremost similar within a beaver than across beavers at the same site. Thissuggests that the CAZyme profile is largely driven by individual animaldiets or attributes, rather than having key carbohydrate degradation stepsat a site fulfilled by different community members in different animals.3.4.3 Linear Mixed Effects Model to Search for Patterns ofCAZyme AbundanceIn order to investigate whether there were patterns of CAZyme abundanceunderlying the processive degradation of biomass, I sought assistance fromRick White of the UBC Statistical Consulting and Research Lab (SCARL).Rick suggested a generalized linear mixed effects model using the lme4 pack-age as implemented in R [12]. However, he pointed to the number of samplesavailable as a limiting factor, and recommended a minimum of ten samplesfor each degree of freedom (here, each CAZy family, which would entail morethan 2000 samples) I was seeking a pattern for. Thus, instead of searchingfor patterns of CAZy families, I instead sought to identify whether therewas a pattern in the ratio of endo-acting to exo-acting GH families throughthe tract. I reasoned that the earlier locations within the digestive tractwould have an increase in endo-acting enzymes for degradation of longerchain polysaccharides, while the later sites would encode more exo-actingenzymes for degradation of shorter oligosaccharides into monosaccharides.These monosaccharides are then fermented to short chain fatty acids such asbutanoate and propionate in the lower digestive tract which provide energyfor the host.973.4. Dissected Beaver SamplesTable 3.6: Listing of families denoted as endo-acting or exo-acting for pur-poses of linear modellingExo-acting GH Families Endo-acting GH Families1, 2, 3, 4, 13, 14, 15, 20, 26, 27, 29, 31,33, 35, 36, 37, 38, 39, 42, 43, 47, 54, 57,59, 63, 65, 67, 72, 76, 77, 78, 79, 84, 88,89, 91, 92, 93, 94, 95, 97, 100, 105, 106,109, 110, 112, 115, 116, 117, 120, 121,123, 125, 127, 129, 130, 1335, 6, 8, 9, 10, 11, 16, 17, 18,19, 22, 23, 24, 25, 28, 30, 32,44, 46, 48, 50, 51, 53, 55, 56,64, 66, 73, 74, 81, 85, 87, 98,99, 101, 102, 103, 108, 113,124, 128, 132All CAZy families identified in the metagenomes were grouped into endo-acting (42 families) or exo-acting (57 families) (Table 3.6). A linear regres-sion model was set up to test if location along the digestive tract was ableto significantly correlate with the ratio of endo:exo-acting families. A modelwas fit, and pairwise comparisons (ten) were made between each site and allothers to generate a p-value describing the likelihood of correlation. All p-values were gathered and a Bonferroni correction was applied to each p-valueto account for ten hypotheses tested. Across all tests, the only statisticallysignificant shifts in endo:exo families seen were between the stomach and allother sites (adjusted p-value < 3.96x10−4 for all tests). That is, the onlylocation with a significantly different ratio of endo:exo acting CAZymes iswithin the stomach, which shows between a 23.2% and 35.7% increase inendo:exo ratio compared to other locations (Table 3.7).While these enrichments are statistically significant, our previous analy-sis of SSU rRNA gene data showed the stomach and small intestine sites tobe dominated by DNA from the beaver itself. As a result, this comparison isnot between the microbiota contained at each site, but rather more betweenthe beaver genome itself and the microbiota occurring in the lower tract.983.4. Dissected Beaver SamplesTable 3.7: Enrichment of endo-acting:exo-acting CAZymes in the stomachcompared to other locations within the digestive tract, as generated by linearmodelling.Location EnrichmentStandardErrorAdjustedp-valueSmall Intestine 35.7% 8.7% 3.96x10−4Cecum 24.1% 4.4% 5.07x10−7Proximal Colon 23.3% 4.5% 2.61x10−6Rectum 23.2% 4.5% 2.73x10−6Comparisons between sites in the lower tract did not show any significantshift in the ratio of endo-acting:exo-acting CAZy families.3.4.4 Functional Screening of Metagenomic Libraries FromLower Beaver Digestive TractTo provide more specific functional insight on carbohydrate degradationgenes within the lower digestive tract, multiplexed functional screening wasperformed as previously described. In total, 124 fosmids were identified aspositive against one of the three substrates used. The cecum produced 53positive hits, the proximal colon 34 positive hits, and the rectum 37 positivehits. In total, 2791 ORFs were predicted from all fosmids, including 483CAZymes. Fosmids were separated by site, and abundance of each CAZyfamily was examined to gain insight into processes occurring at each site(Figure 3.14).993.4.DissectedBeaverSamplesSiteCAZy FamilyCountFigure 3.14: Comparison of CAZyme abundance on fosmids from each digestive tract site. CAZyme abundanceshows the same families are most abundant in the beaver digestive tract as those found on fosmids from the beaverfecal library.1003.4. Dissected Beaver SamplesThe most abundant families identified on digestive tract fosmids werethe same as those identified on fosmids from the fecal library; GH10, GH43,GH3, and GH2, but also includes GH67, while excluding GH1.The paucity of GH1 genes on recovered fosmids relative to the metage-nomic data poses an interesting question. The GH1 family is abundant inthe metagenomic samples (7th most abundant family), and GH1 membershave been previously seen to have activity on the CMU-cellobioside sub-strate through successive actions of β-glucosidase. Furthermore, GH1 wasthe fourth most abundant GH family to be recovered from the beaver fe-cal library. The community composition of the fecal library and digestivetract libraries was seen to be very similar, and each of the digestive tractlibraries contain more clones than the fecal library. One explanation may bethat another family containing β-glucosidase activity, such as GH2 or GH3,substitutes for GH1 in the beaver intestinal environment.The GH67 family contains α-glucuronidase enzymes responsible for thedebranching of xylooligosaccharides via the removal of glucuronic acid residuesat the non-reducing end. This debranching allows β-xylosidase enzymesto further degrade hemicellulose substrates. None of the substrates usedfor screening should be capable of detecting GH67 activity from fosmids,suggesting this activity is complementary to an activity for which it wasscreened. Checking all fosmids with a GH67 gene shows that in every case,the GH67 gene is located at most two genes upstream or downstream of ei-ther a GH43 or GH10 gene, two families with known xylanase or xylosidaseactivity. The recovery of GH67 enzymes in this manner shows GH genes onfosmids may co-occur more frequently with other families both within and1013.4. Dissected Beaver Samplesoutside of PULs.3.4.5 Identification of PULs on Digestive Tract FosmidsIdentification of PULs within the digestive tract may give insight into theprocessing of dietary biomass. I reasoned that the upper sites of the digestivetract would be enriched in PUL sequences, as complex polysaccharides inthe diet often require combinations of genes acting in a particular order forcomplete degradation of substrate [88]. Thus, the locations where dietarybiomass is least degraded would be most likely to harbour both a greaterabundance, and a greater diversity, of PULs. To test this, I predicted PULsfrom all fosmids identified in functional screening.PUL prediction from all positive clones identified 27 PUL-containingfosmids from the cecum, 9 PUL-containing fosmids from the proximal colon,and 18 PUL-containing fosmids from the rectum. Additionally, one fosmidfrom the cecum, one fosmid from the proximal colon, and four fosmids fromthe rectum contained two PULs, for totals of 28, 10, and 21 PULs at eachsite respectively. Predicted PULs ranged in size from two genes (comprisingonly a SusC and SusD pair), to fifteen genes (including nine GHs), with amean membership of 6.4 genes. In many cases, PULs are found at the endof a fosmid, and may not be complete.Of the 28 PULs in the cecum, fourteen were unique. Of the ten PULsin the proximal colon, seven were unique. Of the 21 PULs in the rectum,ten were unique. Across all sites, of the 58 PULs identified, 19 unique PULswere identified, with overlaps between each site (Figure 3.15).Of the 483 CAZyme genes identified on the fosmids, 175 of them fell1023.4. Dissected Beaver Samples0 336CecumRectumColon412Figure 3.15: Venn diagram of PULs identified from lower digestive tractsites. Nineteen unique PULs were identified from lower beaver digestivetract fosmid libraries, with four PULs identified at all sites.into a PUL. CAZy family abundance within PULs mirrored that of CAZyfamily abundance from all fosmid GH genes, with the most abundant CAZyfamily being GH10, with 34 occurrences. The GH10 family contains endo-β-1,4-xylanases for degradation of xylan backbone of hemicelluloses. Thesecond and third most abundant families identified were CBM4 and GH43.The identified CBM4 domains are always found in conjunction with a GH43or GH10 domain on the same protein, and have been shown to bind tosoluble xylans and amorphous cellulose [149]. The abundance of these fam-ilies lends further support to the degradation potential of xylan containingpolysaccharides by the gut microbiome.Comparison of all PUL containing fosmids to the RefSeq database showedthat none of these sequences belong to known genomes, with all fosmidsshowing less than 37% similarity to any sequenced genome. This under-scores our current lack of understanding of PUL diversity in nature, andshows the value in functional screening to recover novel PULs.1033.5. Conclusion3.5 ConclusionThis study of the beaver gut microbiome using both functional and in sil-ico analysis methods provides insight into the degradation of lignocellulosicbiomass by a hindgut fermenting herbivore. Comparative metagenomic anal-ysis of beaver fecal DNA showed the beaver is more similar to other hindgutfermenters than to other xylophagous organisms. Taxonomic analysis of SSUrRNA genes showed the community composition is similar to other mam-malian guts, dominated by Firmicutes and Bacteroidetes. Metagenomicbinning analysis revealed a subset of CAZymes enriched or depleted in eachof these phyla, giving insight into the different catabolic strategies employedby each. Functional screening identified large-insert fosmids that containgene linkage information the is unavailable from short read metagenomic se-quencing. The prediction of PULs from fosmids was successful and showedthe utility in using large insert fosmids for identification of novel PULs fromany environment. The discovery of novel PULs is important in a trans-lational context, as enzymes within a PUL act together and share similarregulation in order to degrade complex carbohydrates. By understandingwhich genes act in unison, we may further improve bioprocessing capabili-ties or efforts to engineer microbes or microbial consortia for carbohydratedegradation.A limitation of this approach is the use of metagenomic sequencing in-stead of metatranscriptomic sequencing. While taxonomic abundance mayremain stable across sites due to the movement of fecal matter throughthe tract, transcript abundance is able to tell which genes are being ac-1043.5. Conclusiontively transcribed or expressed. The study of transcriptomes would allowus to quantify the expression of genes at each site rather than solely geneticpotential, and may show more prominent patterns of CAZyme abundancethat underlie sequential biomass degradation. Future efforts could focus onhigh throughput comparison of PULs based on CAZyme annotations ratherthan sequence based metrics. This would allow us to address the ques-tion of whether PULs from different species have functionally converged tocontain the same gene repertoire despite sequence differences. Such conver-gence would signal the importance of degrading that particular componentof biomass.Additional work on PUL prediction was performed with Nicolas Ter-rapon during my time in Marseille. In particular, we sought to predict thepresence of known PULs from an environment based solely on short metage-nomic reads. We worked on the assumption that genes within a PUL shouldshow similar depth of coverage in the metagenome, and reads would showvery high (> 90%) levels of nucleotide similarity to each gene within a PUL.Simulated metagenomes were created from known genomes using BEAR [72]to see if a solution could be provided in an ideal case scenario. However,due to the frequent reorganization of PULs within a family, and of somegenes being common in multiple PULs, the complexity could not be over-come. Alternative algorithms would be necessary to test in order to solvethis problem.These future directions are both helped by the use of fosmid screeningto identify novel PULs. As more PULs are identified and functionally char-acterized, we will gain better predictive power about substrate specificity1053.5. Conclusionfor computational identified PULs. As our knowledge of PUL diversity in-creases, we will be more capable of assigning genes to a given PUL, to aidin separation of genes within a PUL to those that are more likely to actalone. Once we are able to distinguish between CAZymes within PULs andCAZymes not within PULs, the ability to predict PULs from short readmetagenomic sequences should be more feasible.106Chapter 4Subfamily Classification ofGlycoside Hydrolase Family43 EnzymesThis chapter presents a subfamily classification system for glycoside hydro-lase family 43 enzymes completed in the laboratory of Dr. Bernard Henrissatat the Laboratoire Architecture et Fonction des Macromole´cules Biologiques(AFMB) in Marseille, France. Subfamily classifications have been producedfor other CAZy families including GH13 [154], GH30 [153], GH5 [9], andall nine polysaccharide lyase families [93]. With help from Nicolas Lenfant,all GH43 protein domain sequences contained in the CAZy database weredivided into 37 subfamilies based solely on protein sequence, and subfam-ilies were tested for robustness by comparison to all other GH43 domainsequences using both pfam [111] and BLAST [4]. Known functional charac-teristics from previous biochemical studies were mapped on to each subfam-ily to determine subfamily substrate specificities. Complete sequences for allGH43 domain containing proteins were retrieved to examine co-occurrenceof other CAZy domains with GH43 domains to give further insight into1074.1. Introductionpotential synergies and evolutionary trajectory of GH43 domain containingproteins. In total, this classification and mapping of characterized enzymesto subfamilies helps to bridge the gap between the high throughput natureof sequence based enzyme discovery, and the characterization provided byfunctional discovery methods.4.1 IntroductionAdvancements in DNA sequencing technologies over the past decade have ex-ponentially increased the number of sequences stored in biological databases,including CAZy. The CAZy database has over 530,000 sequences across sixenzyme classes (GH, GT, PL, CE, AA, and CBM), with the largest beingglycoside hydrolases, currently containing over 285,000 sequences classifiedinto 133 families based on amino acid sequence similarity. A large majorityof these are annotated from in silico predictive studies that may identifytens of thousands of protein domains [66], while functional studies typicallycharacterize less than 20 genes [106] [109]. This relative lack of functionalcharacterization hampers the predictive power of higher throughput in silicostudiesThe GH43 family is currently the 5th largest GH family, containing17,844 total protein domains. Of these, 148 members have been biochem-ically characterized against a natural or synthetic substrate. The majorreported activities are β-D-xylosidase (EC number, α-L-arabino-furanosidase (EC, endo-α-arabinanase (EC, and 1,3-β-galactosidase (EC Taken together, this family comprises a range1084.1. IntroductionXyloseGalactoseGlucoronic AcidArabinoseAcetyl groupFerulic Acid α-1,2α-1,2α-1,3α-1,3β-1,4 β-1,4 β-1,4 β-1,4 β-1,4 β-1,4 β-1,4 β-1,4α-1,3α-1,3α-1,3Figure 4.1: Structure of Arabinoxylan. Schematic of arabinoxylan to showcommon linkages found in arabinoxylan polysaccharides.of debranching enzymes for aiding the degradation of hemicellulose, partic-ularly arabinoxylan (Figure 4.1), and pectin. Crystal structure studies haveshown a range of binding mechanisms and catalytic residues [175] [117] [71]within this family, suggesting the existence of different clades within theGH43 family.The GH43 family has emerged as an important family for biomass de-construction efforts, as studies have found it to be expanded in a numberof plant cell wall degrading microorganisms [80]. Additionally, studies ofthe human gut microbiome have identified GH43 enzymes to be amoung themost abundant CAZymes present [42] [187]. Such abundance in both casessuggest an important role for GH43 enzymes in accessing a wide range ofcomplex substrates, and highlights the need for accurate functional predic-tions of GH43 enzymes in genomic and metagenomic studies.1094.2. Subfamily Assignment of GH43 Protein Domains4.2 Subfamily Assignment of GH43 ProteinDomainsUsing SQL programs developed by Vincent Lombard for accessing the CAZydatabase, all complete GH43 domains (17,844 in total) were retrieved fromthe CAZy database on May 25th, 2015. To reduce redundancy and improveprocessing time, sequences were clustered at 95% similarity using CDhit [49]resulting in 4189 remaining sequences. Multiple sequence alignment wasperformed by MUSCLE [40]. In order to generate high quality and relevantalignments, MAFFT [76] was used to iteratively remove highly dissimilarsequences. Such sequences are defined as those having a gap ¿ 3 residues orinsertion ¿ 1 residue that is not seen in at least 2 other sequences. Follow-ing these quality control measures, 3337 sequences remained. With thesesequences, FASTTree [127] was used to generate a phylogenetic tree basedon the midpoint root method (Figure 4.2).Manual separation of subfamilies was decided based on phylogenetic dis-tances in this reduced tree. Subfamilies were required to contain at leastfive sequences found in this reduced tree in order to generate a proper multi-ple sequence alignment. Additionally, each subfamily was required to showtaxonomic diversity above the Class level to ensure a subfamily was notcomprised solely of taxonomically recent gene homologues due to genomesequencing bias for a particular Class, Order, or Family. These criteriaresulted in 37 putative subfamilies.Hidden Markov models (HMMs) were created for each subfamily, as wellas for the complete GH43 subfamily using HMMer3 [111]. All GH43 se-1104.2. Subfamily Assignment of GH43 Protein DomainsFigure 4.2: Subfamily tree of GH43 proten domain sequences. Radial phy-logenetic tree consisting of 4189 GH43 sequences separated into 37 subfam-ilies. Symbols next to subfamily numbers denote EC numbers present ineach subfamily. The choice of colors is arbitrary. Those branches with nocolor represent sequences that could not be assigned to any subfamily giventhe current criteria.1114.2. Subfamily Assignment of GH43 Protein Domainsquences were compared to these HMMs with HMMer3 to assign a subfamilyto each sequence. Each sequence was compared to all other GH43 sequencesusing BLASTP [4], and the top 100 BLAST hits were retained. These top100 BLAST hits for each sequence were grouped by subfamily, and the sumtotal bitscore for each subfamily was calculated.Criteria for assignment of a domain into a subfamily were two-fold:1. HMM comparison must provide an e-value e−20 lower for the specificsubfamily than for any other subfamily HMM, including the HMMgenerated from all GH43 sequences.2. The subfamily with the highest bitscore ratio (resulting from the sumof the top 100 BLAST hits divided by the number of sequences in thesubfamily) must agree with the HMM designation.After assignment of all GH43 sequences meeting the above criteria, indi-vidual alignments, trees, and HMMs were built for each putative subfamilyusing all sequences and excluding the CD-hit and MAFFT procedure. Sub-families containing proteins with known structures and catalytic residueswere inspected manually to ensure conservation of catalytic residues within asubfamily. This inspection identified an important distinction within one pu-tative subfamily, resulting in the creation of subfamilies GH43 24, GH43 25and GH43 37. The HMM and BLAST analysis was repeated with all com-plete subfamilies to assign subfamily membership to each sequence. Theresulting 37 subfamilies collectively contained 17,138 sequences (96.0% ofall GH43 domains analyzed), ranging from 31 to 1890 sequences each.1124.3. Mapping of Functional Characteristics to Subfamilies4.3 Mapping of Functional Characteristics toSubfamiliesIn order to assess functional capabilities of each subfamily, all GH43 proteinsthat have been biochemically characterized against natural or synthetic sub-strates were mapped onto subfamily trees displaying either the EC numberor substrate they have shown activity against. In total, 22 of 37 subfamiliescontained at least one characterized member. Specific details about eachsubfamily are provided in Table 4.1. Subfamilies are grouped by specificityand detailed below.Table 4.1: GH43 Subfamily Characteristics. Individual GH43subfamily membership numbers and characteristics.Subfamily MembersCharac-terizedLowestTaxonomyActivity(EC Number)None 706 2 Root 1 849 16 Root3.2.1.373.2.1.55GH43 2 188 0 Neocallimastigomycota -GH43 3 345 0 Root -GH43 4 544 14 Neocallimastigomycota 5 545 10 Root 6 849 9 Fungi 7 61 0 Bacteria -GH43 8 102 0 Bacteria -GH43 9 217 0 Chytridiomycota -GH43 10 1118 12 Root3.2.1.373.2.1.55GH43 11 860 22 Ascomycota3.2.1.373.2.1.55GH43 12 794 12 Mollusca3.2.1.373.2.1.55GH43 13 374 0 Fungi -1134.3. Mapping of Functional Characteristics to SubfamiliesSubfamily MembersCharac-terizedLowestTaxonomyActivity(EC Number)GH43 14 720 1 Fungi 15 31 0 Bacteria -GH43 16 394 11 Neocallimastigomycota3.2.1.373.2.1.55GH43 17 159 0 Root -GH43 18 334 0 Root -GH43 19 348 1 Neocallimastigomycota 20 41 1 Basidiomycota -GH43 21 291 1 Fungi 22 394 1 Root 23 89 0 Root -GH43 24 1414 8 Root 25 68 0 Fungi -GH43 26 1890 10 Root 27 56 2 Bacteria3.2.1.373.2.1.55GH43 28 346 0 Root -GH43 29 866 7 Root3. 30 713 0 Root -GH43 31 269 0 Bacteroidetes -GH43 32 36 0 Rotifera -GH43 33 150 1 Root 34 928 0 Root -GH43 35 221 4 Neocallimastigomycota3.2.1.373.2.1.55GH43 36 411 2 Root 37 78 2 Root -1144.3. Mapping of Functional Characteristics to SubfamiliesRα-L-arabinofuranosideRβ-D-xylopyranosideFigure 4.3: Steric Similarity of α-L-Arabinofuranose and β-D-Xylopyranose.α-L-arabinofuranose and β-D-xylopyranose share similar stereochemistrysurrounding the glycosidic bond, which may account for a significant co-occurrence of these two activities by a single GH43 protein domain.4.3.1 β-D-Xylosidase and α-L-ArabinofuranosidaseContaining SubfamiliesThese two activities constitute the majority of characterized GH43 enzymes.To date, all subfamilies that have been characterized and shown to bepolyspecific harbor both β-D-xylosidase (EC and α-L-arabinofurano-sidase (EC activities. The overlap of these activities within a sub-family, and in some cases within a single protein, is not altogether unsurpris-ing, as the α-L-arabinofuranose and β-D-xylose conformations are stericallysimilar near the glycosidic bond (Figure 4.3), and indeed this co-occurrencehas been reported previously [70] [85].The abundance of characterized enzymes with these two activities high-lights the reliance of functional screening efforts on easily available syntheticp-nitrophenol (pNP) sugars, in particular pNP-β-D-xyloside and pNP-α-L-arabinofuranoside, for identification of activity.Of particular note is subfamily 36, which has demonstrated activity1154.3. Mapping of Functional Characteristics to Subfamiliesagainst di-substituted (1,2- and 1,3-arabinofuranoside) xylopyranose residues[107] [173]. These residues are typically recalcitrant to enzymatic attack andas such this subfamily has significant biotechnological interest. Current ECdesignations do not reflect such stereospecificities, so while this subfamilyshows significant functional differences it is not distinguishable based on ECnumber alone.4.3.2 Endo-α-L-Arabinanase Containing SubfamiliesSubfamilies GH43 4, GH43 5 and GH43 6 are the only ones showing α-L-arabinanase (EC activity. These three closely related, but distinctsubfamilies are among the best characterized, with each showing endo-α-L-arabinanase activity. Subfamily GH43 4 and GH43 5 both have multiplecrystal structures available, with subfamily 5 having the first obtained crys-tal structure for a GH43 enzyme [117].4.3.3 β-1,3-Galactosidase Containing SubfamiliesSubfamily GH43 24 is well characterized, with eight members having beenbiochemically characterized. This is the only subfamily shown to have β-1,3-D-galactosidase (EC activity. Members of GH43 24 have beencharacterized and structures obtained [71] which showed a shift in the cat-alytic base residue from Asp38 (in CjArb43A reference sequence from Nur-izzo et al [117]) to a Glu112 (in Ct1,3Gal43A reference from Jiang et al[71]).In addition to subfamily GH43 24, the uncharacterized subfamily GH43 37shares a similar motif at the catalytic base, except a glycine replaces the glu-1164.3. Mapping of Functional Characteristics to Subfamiliestamic acid residue. The effect of such a shift is unknown, but may resultin a loss of functional activity or a re-purposing of this domain for otherfunctions. This potential re-purposing of domains has been addressed byAspeborg et al [9] and includes inactivated chitinases repurposed as xylanaseinhibitors in GH18 [39] and amino acid transporters arising from ancestralα-amylases of GH13 [154]. It is impossible to rule out a change in proteinstructure that may bring a different residue into the active site to serve asthe catalytic base, or the requirement of a co-factor for deglycosylation asseen for ascorbate in the myrosinases of the GH1 family [20].4.3.4 Uncharacterized SubfamiliesOf the 37 subfamilies defined here, only 22 subfamilies contain at least oneprotein that has been characterized via a biochemical assay. Furthermore,only eleven subfamilies have at least five characterized members, which mayexplain why some subfamilies harbouring only EC or activ-ity are identified as monospecific. This lack of biochemical characterizationstretches throughout the GH43 tree, with some poorly explored regions ofthe tree being only distantly related to subfamilies with characterized mem-bers. The subfamily analysis presented here brings these clades into focusas targets for further functional and structural exploration.1174.4. GH43 Domain Co-occurrence with Other CAZy Modules4.4 GH43 Domain Co-occurrence with OtherCAZy ModulesAs CAZymes often contain multiple domains from different families or classes,I searched for other CAZy domains frequently found in the same protein asa given GH43 subfamily, producing a matrix of co-occurrence counts. Thesecounts were normalized against the total number of domains in a given sub-family to generate a frequency of co-occurrence matrix which revealed anumber of significant associations between individual GH43 subfamilies andother CAZy modules (Figure 4.4). Such associations have the potential toinform the functions of subfamilies that have no characterized members, aswell as inform about the evolutionary trajectory followed by each subfamily.Modules found to co-occur with greater than 55% frequency with any GH43subfamily are described below.4.4.1 CBM6The most frequent co-occurring module was found to be CBM6, a mod-ule having a demonstrated binding function to amorphous cellulose andβ-1,4-xylan, with 6.0% of all GH43-domain containing proteins harbouringa CBM6 domain. This module is found associated with ten different sub-families, but is most striking in subfamilies GH43 15 and GH43 16, with100% and 64% co-occurrence, respectively. These CBM6 modules havedemonstrated increased catalytic improvement to associated enzymes onnon-soluble substrates [1]. This previous study on CBM6 domains identifiedfour clades of CBM6 family domains. The CBM6 modules found associated1184.4. GH43 Domain Co-occurrence with Other CAZy Modules28351162291536109131412112630384563432313317243727725231819202122Co-occurrence (%)0 4020 8060 100SubfamilyFigure 4.4: Co-occurrence of GH43 domains with other CAZy modules.Heatmap showing frequency of domain co-occurrence within a protein be-tween GH43 subfamilies and other CAZy domains. CBM: CarbohydrateBinding Module; DOC: Dockerin domain; SIGN: Signal peptide; X: Con-served domain of unknown function.1194.4. GH43 Domain Co-occurrence with Other CAZy Moduleshere are either not assigned to any of the four subfamilies defined by Abbottet al (348 of 1072 domains), or belong to subfamily CBM6b (724 domains),a clade with demonstrated xylan binding capabilities, which matches theactivity of characterized members of this subfamily.4.4.2 CBM35Found in co-occurrence with 40.6% of GH43 24 domains and 72.1% of GH43 25domains, these CBM modules are known to bind xylan and mannooligosac-charides, but also bind 1,3-β-D-galactose [51]. Such binding is expected, asthe GH43 24 subfamily is the only one containing a characterized enzymewith 1,3-β-D-galactosidase activity.4.4.3 CBM13Subfamily GH43 7 has yet to have a member characterized, but all GH43 7proteins harbour a CBM13 module. A CBM13-containing α-L-arabinofurano-sidase from family GH61 (abfB) has been characterized in the soil actino-mycete Streptomyces lividans [177], and has demonstrated xylan-bindingfunctionality. This co-occurrence hints towards β-D-xylosidase (EC α-L-arabinofuranosidase (EC activities, as seen in the subfam-ilies GH43 15 and GH43 16 associated with CBM6 xylan-binding domains.4.4.4 CBM42CBM42 modules are seen in 59% of subfamily GH43 20 proteins. Assays ofa CBM42-containing family GH54 α-L-arabinofuranosidase (AkAbfB) fromAspergillus kawachii [112] revealed arabinose binding capacity for CBM42,1204.4. GH43 Domain Co-occurrence with Other CAZy Moduleswhich differs from the xylan-binding CBMs found associated with otherGH43 subfamilies. This correlation suggests these proteins recognize thearabinofuranoside component of arabinoxylans rather than the xylan back-bone. Such a strategy could avoid competition for limited space of bindingto the xylan backbone, or may represent an organisms utilization of thearabinose sidechains rather than the more abundant xylan backbone. Sub-family GH43 20 contains only bacterial sequences, none of which have beenbiochemically characterized; additional study would be necessary to furthereither of these hypotheses.4.4.5 X19The 3-dimensional structure of several GH43 members shows a C-terminalextension, hereafter termed X19, which folds independently of the beta-propeller but has no apparent function. The systematic C-terminal positionof X19 relative to GH43, as well as the absence of any linking peptide ordomain between them suggests it is a C-terminal cap that may aid in proteinstability. This X19 domain is found only with a subset of GH43 subfamilies(GH43 9 through 14, and GH43 36) and its presence originates at a singlepoint in the GH43 subfamily tree. It is possible that this domain may bean evolutionary remnant unnecessary for catalytic function. Of additionalnote is that outside of the GH43 family, the X19 domain is not seen to occurwith any other GH families.1214.5. Proteins Containing Multiple GH43 Domains4.4.6 Signal PeptidesTo identify the cellular locations of GH43 containing proteins, I searched forsecretion signal peptides found to co-occur with GH43 domains. Across allGH43 containing proteins, 69% contain a signal peptide directing the trans-lated protein outside the cytoplasm. One exception is subfamily GH43 11,which does not have any of its 860 members co-occurring with a signal pep-tide. This subfamily is limited to the Ascomycota, and this lack of signalpeptide suggests this subfamily is involved in intracellular processes suchas the degradation of imported disaccharides, or cell wall remodelling. Onthe contrary, proteins containing multiple GH43 domains are found to con-tain a signal peptide in over 92% of cases, suggesting they play a role indegradation of extracellular substrates.4.5 Proteins Containing Multiple GH43 DomainsIn addition to searching for other CAZy modules found with GH43 subfam-ily domains, I also searched for proteins containing multiple GH43 domainsfrom different subfamilies. This identified 301 proteins that contained twoGH43 domains, and 20 proteins that contained three GH43 domains (Figure4.5). Proteins containing multiple GH43 domains are found to contain a sig-nal peptide in over 92% of cases, suggesting they play a role in degradationof extracellular substrates. The most common GH43 domain to be foundwith another GH43 domain is from subfamily GH43 34, which has 244 of928 total entries (26.3%) found with another GH43 domain. Furthermore,the GH43 34 domain is found to be C-terminal in 231 of these, which may1224.6. Conclusionallude to an unknown functional characteristic. This subfamily is currentlyuncharacterized, which opens the door for many questions related to bothits membership and position within multi-modular proteins. Such a fre-quent level of co-occurrence may suggest a potential synergistic interaction,an auxiliary role for this subfamily in identifying or binding substrates, ora potential loss of function that would need to be confirmed through func-tional biochemical analysis of individual protein domains. The existence ofenzymes containing up to three distinct GH43 domains further underscoresthe functional diversity within the GH43 family.4.6 ConclusionThe recent interest in biomass degrading enzymes and gut microbiome stud-ies has led to a rapid expansion in glycoside hydrolase sequences, includingthe GH43 family. This abundance of data allows for a finer detailed analy-sis of the family, but also exposes limitations arising from the existence ofsuch a large GH family. Functional and structural characteristics assignedto the family are not shared amongst all members, but are partitioned ata level below that of the current GH family designation. The subfamilyclassification system developed here partitions sequences in more homoge-neous, finer subgroups in order to improve protein sequence annotation andfunctional prediction for future genomic and metagenomic studies. Thesesubfamilies encompass over 96% of all completely sequenced GH43 mod-ules and show both phylogenetic and functional characteristics to supporttheir assignments. Of these 37 subfamilies, 22 have characterized members1234.6. ConclusionDomain Domain Domain #3 31 29 1934 113710 16 1116 22 217 19 218 34 52199 34 124 526 72934 1434 592224 126 34 527 134 8223 34 224 22 12617 123 531 234 12722 GH43 1GH43 129 10 231 26 134 3 13GH4318 122 29 126 431 2Total 3211st 2nd 3rdFigure 4.5: Counts of enzymes containing multiple GH43 domains. Tableshowing counts of all enzymes identified to contain multiple GH43 domains.Domains are listed in order of occurrence (1st, 2nd, 3rd) beginning at N-terminal end of the peptide. GH43 represents domains unable to be classifiedinto a subfamily.1244.6. Conclusionshowing that progress in the field of functional characterization has laggedsignificantly behind sequencing progress. Nonetheless, for subfamilies withmultiple characterized members, there is strong agreement within a subfam-ily towards a particular enzymatic activity.4.6.1 LimitationsWhile this work was purely computational, it relies on experimental effortsof dozens of previous studies for enzyme characterization and structure de-termination. However, many functional studies rely upon synthetic sub-strates, such as p-nitrophenol (pNP) monosaccharides, similar to those usedin Chapter 2 of this thesis. While they are convenient in application, theydo not allow elucidation of mechanistic subtleties of an enzyme, such as thespecificity for di-substituted xylan residues as evidenced by Mckee et al [107],that may underlie subfamily divisions. The use of natural substrates dur-ing screening efforts provides additional information about how the enzymebehaves on it’s native substrate.Additionally, many experimental screening studies use an insufficient di-versity of substrates to detect enzymatic activity, as evidenced by a strongbut incomplete overlap of α-L-arabinofuranosidase and β-D-glucosidase ac-tivity among subfamilies. Many studies reporting these activities based onpNP sugar substrates failed to include the other in their assay, and thusmay be reporting incomplete substrate specificities. In order to avoid this,my recommendation would be for future screening efforts to first identify theCAZy family to which an enzyme belongs, and test for all activities reportedwithin the family wherever possible.1254.6. Conclusion4.6.2 Future DirectionsAs was done previously for other CAZy families [154] [9] [153] [93], the sub-family classifications presented here can be extended to additional families.Before such analysis can be performed, there needs to exist a demonstrateddiversity of enzymatic functions including substrate specificity, taxonomicdiversity, and catalytic machinery on which to base the classification on.With continued advances in the DNA sequencing field, there will surely bemore subfamily classifications for a range of CAZy families in the future.The mapping of demonstrated biochemical characteristics to individualsubfamilies allows for finer resolution of enzyme function, but the dearth ofcharacterized enzymes leaves many holes unfilled in the GH43 tree. Thissubfamily classification provides targets for functional characterization ef-forts in order to gain a fuller understanding of the diversity of GH43 activity.While unlikely, it may be possible to uncover previously unidentified activ-ities within the GH43 family. Analysis of fosmids from high throughputscreening performed in Chapter 2 revealed fosmids containing GH43 do-mains from four currently uncharacterized subfamilies; GH43 2, GH43 7,GH43 18, and GH43 28. Future efforts will focus on cloning, expressingand characterizing these proteins in order to fill in these characterizationgaps. Bioinformatic studies not only benefit from experimental science toimprove predictions, but can also guide and inform experimentalists towardssubfamilies with no characterized members.126Chapter 5ConclusionThis dissertation presented the development and application of a high-throughput functional metagenomic screen for glycoside hydrolases. Thisfinal chapter places this work in the broader scientific context of the researchcommunity, discusses assumptions and limitations of these approaches, andoutlines potential future directions and applications.5.1 Related Research and ContextDuring the course of this work, similar high-throughput functional metage-nomic approaches were developed that complement and extend the methodsand results presented herein. Multiple other research groups are engaged inthe searching of novel environments under relevant conditions [66] [120],characterization of enzymes related to sequential degradation of biomass[88] [100], and a focus on improving in silico predictions based on functionalcharacterization [142] [132] have been a recurring theme in many studiesthat have paralleled this work.In order to identify enzymes with commercially relevant activities, stud-ies have looked at environments with phenotypes of interest to maximizetheir potential discovery. O’Connor and colleagues examined the digestive1275.1. Related Research and Contextstrategy of the shipworm Bankia setacea, using a combination of endosym-biont isolate sequencing, metagenomics, and proteomics [120]. B. setaceauses a novel digestive system in which a subset of 42 proteins are trans-ported from the gill-inhabiting endosymbiont into the cecum to deconstructdietary biomass. Of these 42 proteins, 41 were seen to be plant cell wall-degrading enzymes, and several contained unknown catalytic domains thatwere linked to CBMs known to interact with cellulose and xylan. Addition-ally, it represents a set of enzymes known to act in a coordinated manner forthe degradation of lignocellulose. This targeted selection of an environmentdiscovered novel catalytic domains and provided important co-localizationinformation about plant cell wall-degrading enzymes in the shipworm cecum.Selection of relevant screening conditions also plays a key role in discov-ery of enzymes of interest. Towards that end, Gladden et al. [52] identifiedthermotolerant and ionic-liquid tolerant cellulases from a microbial commu-nity grown on switchgrass, by using a combination of metagenomics withexpression and screening of select enzymes. Candidate genes were identifiedin silico by metagenomic methods, and synthesized and expressed in vitroin a cell-free system, as well as in vivo using a low copy plasmid, and testedagainst model compounds at differing pH, temperature, and ionic-liquid(IL) concentrations. IL pretreatment of biomass can improve efficiency ofdegradation, but current cellulase cocktails are currently incompatible withIL presence [163]. By testing temperature and IL tolerance, the authorswere able to identify a correlation between thermotolerance and IL tolerancewithin an individual enzyme, which may improve commercial bioprocessingefforts.1285.1. Related Research and ContextStudies of PULs within gut microbes have provided important informa-tion about the sequential degradation of complex polysaccharides. Workby Larsbrink et al. [88] on the xyloglucan utilization locus encoded inBacteroides ovatus, responsible for the breakdown of this abundant dietarypolysaccharide, used genetic knockouts and substrate specificity assays tocharacterize each of the eight GH genes found in the locus. This work out-lined a distinct pathway for xyloglucan degradation, establishing orderedroles for individual enzymes. While many studies have identified the activ-ity of single CAZymes, understanding their activities in combination withothers can help guide synthetic biology approaches for engineered microbesor consortia.Efforts to more completely characterize CAZy families in relation tofunctional diversity and substrate specificity have also been recently under-taken. Subfamily divisions of the PL families [93], GH30 [153], and GH5[9] have provided more finely resolved details of activity within the subfam-ilies, and in the case of GH5 and GH30, identified subfamilies that havelost activity altogether. These subfamily classifications became possible dueto increased biochemical characterization of individual family members, andfeed this information forward to improve functional predictions of newly dis-covered genes. Additionally, subclassification of genes brings into light theareas of phylogenetic GH family trees that have been comparatively under-studied, and opens them up to gene synthesis approaches that can improveunderstanding of these families.The investigation of the beaver digestive tract paralleled these approachesin that it investigated a previously unexplored environment to search for pat-1295.2. Assumptions and Limitationsterns of sequential biomass degradation. The selection of beaver feces anddigestive tract for study leverages known information about beaver digestion[34], and provides rationale for searching for lignocellulose degrading genes.Furthermore, by undertaking a longitudinal study of the beaver digestivetract, I identified that may underlie sequential biomass degradation in thecontext of ecosystems rather than individual organisms. The abundance ofGH43 genes in these environments fuelled the subfamily classification schemeof GH43 enzymes, and the discovery of genes that belong to currently un-characterized subfamilies will help to further improve functional annotationof these sequences from in silico approaches in the future.5.2 Assumptions and LimitationsThe high-throughput functional screening approach here was successful inidentifying glycoside hydrolases derived from both engineered and naturalmicrobial communities, and using this information to improve functionalpredictions made by in silico approaches. Nevertheless, the genes and en-zymes discovered here still cannot present a complete picture of glycosidehydrolase abundance and diversity in any given environment due to expres-sion bias of the E. coli host, the limited sample size large-insert metagenomiclibraries, the lack of gene expression data, and current knowledge limitationsof enzyme activity on natural substrates.While the Epi300 E. coli used in this screen confers advantages includingreduced homologous recombination (achieved through the removal of therecA gene) and induced high copy-number of fosmids, it remains limited1305.2. Assumptions and Limitationsin its expression of heterologous proteins. Codon biases or sigma-factorrequirements from different bacterial phyla may hamper their expression ina heterologous host [55]. Towards this end, engineered E. coli plasmids havebeen developed that contain tRNA genes that are rare in E. coli genes inorder to increase expression of heterologous genes [16]. Additionally, Gaidaet al. [50] have developed an E. coli strain with additional sigma factors andshown this to be an effective metagenomic screening strain for expression ofheterologous proteins. These approaches combined with the advantages ofthe Epi300 strain used in this work may result in additional expression ofheterologous genes, and improve the recovery of phylogenomically-distinctglycoside hydrolases.Cellulases were the first target for functional screening approaches [61],and since then have become among the most common targets for func-tional screening approaches [166]. Given this, many enzymes that showthe strongest activity (as measured by a z-score cutoff) have been previ-ously identified in functional screens. While the approach in this study useda z-score cutoff of six sigma to identify positive clones, there may be valuein closer characterization of clones that are in the lower ranges of z-scores inan attempt to identify novel or promiscuous enzymes. These efforts wouldrequire high-throughput characterization approaches to accommodate thelarger number of positive clones, as well as computational efforts to selectclones with phylogenomically-distinct sequences that were previously over-looked.Metagenomic sequencing identifies the genetic potential present in anygiven environment, but the presence of a gene does not signal its transcrip-1315.2. Assumptions and Limitationstion or translation. Metatranscriptomics and metaproteomics can combinewith metagenomics to give information about the processes that are occur-ring at a given time or location, and can be tied to other environmentalparameters in search of biological signal [60]. For example, work to studychanging GH patterns as biomass traverses through the beaver digestivetract would benefit from metatranscriptomic sequencing, as microbes willtravel down the tract with the biomass, but may change transcription pro-files as different carbohydrate components of the substrate are consumed.The ability to resolve changing expression profiles may provide patterns thatremain indistinguishable with metagenomic sequencing alone.For all but the simplest environments, metagenomic libraries representunder-sampling experiments that have difficulty in identifying rare (<1%abundance) community members [170]. For example, an individual humanhas been estimated to carry an average of nearly 600,000 unique genes,and the total human gut pan-microbiome has been estimated to contain 3.3million non-redundant microbial genes at varying abundances [130]. Calcu-lation of metagenomic sequence coverage is non-trivial, requiring a weightedPoisson distribution with known species abundance values, but methods ofestimation based on subsampling approaches or assumptions of abundanceof particular taxa have been undertaken (reviewed in [136]). The largest ofthe metagenomic libraries in this work (beaver proximal colon) contained46 384-well plates of clones. Assuming each clone contained on average 40genes, this represents 706,560 genes, insufficient for identification of raregenes within the environment due to repeated recovery of the most abun-dant genes. To offset this, the use of targeted single-cell genomics approaches1325.3. Future Directions and Applicationsshould be employed in future studies to selectively target genes from rela-tively low abundance community members [105]. The combined use of singlecell genomics with high-throughput functional screening can provide a morecomplete picture of total community function.Curated sequence databases rely on biochemically-characterized enzymeactivities in order to assign function to DNA sequence. These activities aremost often determined through testing against model compounds with de-fined linkages, in the case of GHs these are typically short (1 to 4 residues)oligonucleotides attached to a fluorophore [26]. Enzymes are given an en-zyme commission (EC) number to denote the reactions catalyzed. In naturehowever, the linkages encountered by enzymes are often involved in addi-tional inter- or intra-molecular bonds [118] that may affect binding abilityor enzyme efficiency and are not reflected in EC number designation. Con-served residues amongst an enzyme family can be identified in silico usingprotein alignments, but more particular details including steric constraintsand product inhibition can’t be detected unless specifically tested for. Bytesting enzymes against natural instead of synthetic substrates, we can mapmore specific details of activity to sequences, and thus improve functionalpredictions of enzymes in silico.5.3 Future Directions and ApplicationsThe metagenomic recovery of biomass degradation genes is rapidly progress-ing in terms of both throughput [30] and scope [158]. The genes recoveredfrom this study present a suite of enzymes that can be used to aid in biomass1335.3. Future Directions and Applicationsconversion efforts, and for extended characterization of GH families. Thelarge-insert fosmids also provide important gene linkage information thatallows for identification of gene clusters, such as PULs. Incorporation ofnewly available methods and technologies will further aid in the utility ofthis approach to improve our understanding of biomass degradation.The genes identified here serve as a base upon which optimization orgenetic engineering approaches can begin. Directed evolution of these genesusing both random (such as error-prone PCR) [174] or sequence guided[186] approaches can be employed to improve particular functional traits ofinterest for commercial application. Current biomass processing efforts arelimited by enzyme cost and recovery [67], thus the improvement of enzymestability and inhibitor tolerance may aid in the large-scale adoption of theseapproaches. Additionally, these discovered enzymes may be tested as-is inconjunction with current commercial enzyme cocktails to test for improvedbreakdown of feedstock. Functionally informed combinatorial approachescan select enzymes from different GH families in an effort to tailor a cocktailfor particular substrates.The declining cost of gene synthesis allows for large scale efforts to morecompletely characterize gene families of interest. Work by Heins et al. [63]to characterize activities across the complete phylogenomic landscape ofthe GH1 family highlight the potential for using gene synthesis to betterunderstand CAZy families. Similar to the approach presented in Chapter4, the future characterization of the discovered GH43 genes that belong tocurrently uncharacterized subfamilies will provide more confident functionalpredictions of these enzymes from in silico studies. Recent work showing the1345.3. Future Directions and Applicationsexpansion of GH43 in plant cell wall-degrading organisms [80], as well as theimportance of GH43 enzymes in PULs found in the human gut microbiome[187] show the necessity for greater understanding of this large GH family.Applying the same methods to other GH families can similarly improve ourcapacity to understand carbohydrate degradation.The screening approach used here is capable of surveying hundreds ofthousands of clones per week. While this is significant, it still representsonly a fraction of the depth necessary to fully characterize the carbohydratebreakdown processes in an environment. Technologies including fluorescentactivated cell sorting (FACS) [5] and droplet-based microfluidics [30] havebeen applied to screen orders of magnitude more clones. Application of thesetechnologies for GH discovery represent the next generation of screeningapproaches compared to the methods developed and used here. A challengefor these efforts will be in the downstream analysis and characterization, butthe previously mentioned efforts to map function across GH family trees willallow for selection of genes belonging to subgroups of interest.The ability to recover long, contiguous DNA sequences through the useof fosmid screening is one advantage of functional metagenomics comparedto in silico approaches. Current difficulties in assembly and limited DNAsequencing read length previously stood as a barrier to the identification oflarge gene clusters. However, the advancement of single-cell sequencing [157]and improving read-lengths for commercial sequencing technologies [8] haveminimized this advantage. Recently, Illumina has developed the TruSeqplatform, which partitions large DNA sequences into individual reactionsbefore library preparation. This allows for easier and more reliable assembly,1355.3. Future Directions and Applicationsand provides synthetic long (>10 kbp) reads that can be used in a similarmanner to fosmid sequences for the identification of PULs or other geneclusters. The higher throughput of sequencing-based metagenomic methodspositions these developments as improvements over fosmid sequencing forfuture efforts in PUL identification.Indeed, this platform has provided a starting point for a range of otherprojects currently underway in the Hallam Lab. The screening approachhere has been adapted with alternative substrates for the identification ofglycoside phosphorylases, which are of particular biotechnological interestdue to their easy modification into glycosynthases. Identification of phos-phorylases that can tolerate functional groups, such as azides or amides, atdifferent positions of the sugar ring show promise for use in click chemistry[81]. The large number of enzymes identified also have potential for screeningapplications in which a particular substrate can be tested against all clonesin a manner similar to the Biolog plate [184] does for cellular metabolism.Biorefining efforts benefit from enzyme cocktails tailored to particular sub-strates, so the ability to link an efficiency test to a “Cellulog” plate canprovide a rapid test to select for tailored enzymes. Another approach thatwould help take these genes from bench towards batch or bioreactor appli-cations is to express these genes in a more commercially applicable system,such as Pichia pastoris or within the S-layer of Caulobacter crescentus. Dueto the addressed limitations of E. coli as discussed above, showing strongactivity in a biorefinery-ready system would add further value to my workpresented here. These projects have helped to establish the importance andapplicability of high-throughput functional metagenomic screening across1365.4. Closingdifferent projects in the lab.5.4 ClosingThe use of functional metagenomic screening for glycoside hydrolases hasimproved our understanding of carbohydrate degradation in natural andengineered ecosystems. This has generated interest in fields including mi-crobial ecology, biotechnology, and human health. The incorporation of newtechnologies will improve both discovery and characterization approachesmoving forward. The work here represents a small step in establishing aframework for the discovery and analysis of newly discovered genes, andprovides a base upon which a multitude of future questions can be pur-sued. The continued investigation of carbohydrate degradation processes invarious ecosystems will drive forward both primary and applied research ob-jectives, and expand our ability to understand and engineer these processesfor future benefit.137Bibliography[1] D Wade Abbott, Elizabeth Ficko-Blean, Alicia Lammerts van Bueren,Artur Rogowski, Alan Cartmell, Pedro M Coutinho, Bernard Hen-rissat, Harry J Gilbert, and Alisdair B Boraston. Analysis of thestructural and functional diversity of plant cell wall specific family6 carbohydrate binding modules. Biochemistry, 48(43):10395–10404,2009.[2] Mads Albertsen, Philip Hugenholtz, Adam Skarshewski, K˚are LNielsen, Gene W Tyson, and Per H Nielsen. Genome sequences ofrare, uncultured bacteria obtained by differential coverage binning ofmultiple metagenomes. Nature biotechnology, 31(6):533–538, 2013.[3] Miguel Alcalde, Manuel Ferrer, Francisco J Plou, and Antonio Balles-teros. Environmental biocatalysis: from remediation with enzymes tonovel green processes. Trends in biotechnology, 24(6):281–287, 2006.[4] Stephen F Altschul, Thomas L Madden, Alejandro A Scha¨ffer, JinghuiZhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blastand psi-blast: a new generation of protein database search programs.Nucleic acids research, 25(17):3389–3402, 1997.138Bibliography[5] Isabelle Andre, Gabrielle Potocki-Ve´rone`se, Sophie Barbe, ClaireMoulis, and Magali Remaud-Sime´on. Cazyme discovery and design forsweet dreams. Current opinion in chemical biology, 19:17–24, 2014.[6] Zachary Armstrong, Keith Mewis, Cameron Strachan, and Steven JHallam. Biocatalysts for biomass deconstruction from environmentalgenomics. Current opinion in chemical biology, 29:18–25, 2015.[7] Zachary Armstrong, Stephan Reitinger, Terrence Kantner, andStephen G Withers. Enzymatic thioxyloside synthesis: Character-ization of thioglycoligase variants identified from a site-saturationmutagenesis library of bacillus circulans xylanase. ChemBioChem,11(4):533–538, 2010.[8] Philip M Ashton, Satheesh Nair, Tim Dallman, Salvatore Rubino,Wolfgang Rabsch, Solomon Mwaigwisya, John Wain, and JustinO’Grady. Minion nanopore sequencing identifies the position andstructure of a bacterial antibiotic resistance island. Nature biotech-nology, 33(3):296–300, 2015.[9] Henrik Aspeborg, Pedro M Coutinho, Yang Wang, Harry Brumer,and Bernard Henrissat. Evolution, substrate specificity and subfamilyclassification of glycoside hydrolase family 5 (gh5). BMC evolutionarybiology, 12(1):1, 2012.[10] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gure-vich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey INikolenko, Son Pham, Andrey D Prjibelski, et al. Spades: a new139Bibliographygenome assembly algorithm and its applications to single-cell sequenc-ing. Journal of Computational Biology, 19(5):455–477, 2012.[11] Ge´raldine Bastien, Gre´gory Arnal, Sophie Bozonnet, Sandrine La-guerre, Fernando Ferreira, Re´gis Faure´, Bernard Henrissat, FabriceLefe`vre, Patrick Robe, Olivier Bouchez, et al. Mining for hemicellu-lases in the fungus-growing termite pseudacanthotermes militaris usingfunctional metagenomics. Biotechnology for biofuels, 6(1):1, 2013.[12] Douglas Bates, Martin Maechler, Ben Bolker, Steven Walker, et al.lme4: Linear mixed-effects models using eigen and s4. R packageversion, 1(7), 2014.[13] William T Beeson, Christopher M Phillips, Jamie HD Cate, andMichael A Marletta. Oxidative cleavage of cellulose by fungal copper-dependent polysaccharide monooxygenases. Journal of the AmericanChemical Society, 134(2):890–892, 2011.[14] Alex Berlin, Mikhail Balakshin, Neil Gilkes, John Kadla, Vera Maxi-menko, Satoshi Kubo, and Jack Saddler. Inhibition of cellulase, xy-lanase and β-glucosidase activities by softwood lignin preparations.Journal of Biotechnology, 125(2):198–209, 2006.[15] Lorenzo Bertin, Serena Capodicasa, Stefano Fedi, Davide Zannoni,Leonardo Marchetti, and Fabio Fava. Biotransformation of a highlychlorinated pcb mixture in an activated sludge collected from a mem-brane biological reactor (mbr) subjected to anaerobic digestion. Jour-nal of hazardous materials, 186(2):2060–2067, 2011.140Bibliography[16] Ulrich Brinkmann, Ralf E Mattes, and Peter Buckel. High-level ex-pression of recombinant genes in escherichia coli is dependent on theavailability of the dnay gene product. Gene, 85(1):109–114, 1989.[17] R Malcolm Brown Jr. Algae as tools in studying the biosynthesis ofcellulose, natures most abundant macromolecule. In Cell walls andsurfaces, reproduction, photosynthesis, pages 20–39. Springer, 1990.[18] Hilary P Browne, Samuel C Forster, Blessing O Anonye, Nitin Kumar,B Anne Neville, Mark D Stares, David Goulding, and Trevor D Lawley.Culturing of unculturablehuman microbiota reveals novel taxa andextensive sporulation. Nature, 533(7604):543–546, 2016.[19] Roman Brunecky, Markus Alahuhta, Qi Xu, Bryon S Donohoe,Michael F Crowley, Irina A Kataeva, Sung-Jae Yang, Michael G Resch,Michael WW Adams, Vladimir V Lunin, et al. Revealing natures cel-lulase diversity: the digestion mechanism of caldicellulosiruptor besciicela. Science, 342(6165):1513–1516, 2013.[20] Wilhelm Pascal Burmeister, Sylvain Cottaz, Patrick Rollin, AndreaVasella, and Bernard Henrissat. High resolution x-ray crystallographyshows that ascorbate is a cofactor for myrosinase and substitutes forthe function of the catalytic base. Journal of Biological Chemistry,275(50):39385–39393, 2000.[21] J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bit-tinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, An-tonio Gonzalez Pena, Julia K Goodrich, Jeffrey I Gordon, et al. Qiime141Bibliographyallows analysis of high-throughput community sequencing data. Na-ture methods, 7(5):335–336, 2010.[22] Mauricio O Carneiro, Carsten Russ, Michael G Ross, Stacey B Gabriel,Chad Nusbaum, and Mark A DePristo. Pacific biosciences sequenc-ing technology for genotyping and variation discovery in human data.BMC genomics, 13(1):1, 2012.[23] Alan Cartmell, Lauren S McKee, Maria J Pen˜a, Johan Larsbrink,Harry Brumer, Satoshi Kaneko, Hitomi Ichinose, Richard J Lewis,Anders Viksø-Nielsen, Harry J Gilbert, et al. The structure and func-tion of an arabinan-specific α-1, 2-arabinofuranosidase identified fromscreening the activities of bacterial gh43 glycoside hydrolases. Journalof Biological Chemistry, 286(17):15483–15495, 2011.[24] Ron Caspi, Tomer Altman, Richard Billington, Kate Dreher, Hart-mut Foerster, Carol A Fulcher, Timothy A Holland, Ingrid M Ke-seler, Anamika Kothari, Aya Kubo, et al. The metacyc database ofmetabolic pathways and enzymes and the biocyc collection of path-way/genome databases. Nucleic acids research, 42(D1):D459–D471,2014.[25] Mark JP Chaisson, Richard K Wilson, and Evan E Eichler. Geneticvariation and the de novo assembly of human genomes. Nature ReviewsGenetics, 2015.[26] Hong-Ming Chen, Zachary Armstrong, Steven J Hallam, andStephen G Withers. Synthesis and evaluation of a series of 6-chloro-142Bibliography4-methylumbelliferyl glycosides as fluorogenic reagents for screeningmetagenomic libraries for glycosidase activity. Carbohydrate Research,421:33–39, 2016.[27] Bastien Chevreux, Thomas Pfisterer, Bernd Drescher, Albert JDriesel, Werner EG Mu¨ller, Thomas Wetter, and Sa´ndor Suhai. Us-ing the miraest assembler for reliable and automated mrna transcriptassembly and snp detection in sequenced ests. Genome research,14(6):1147–1159, 2004.[28] Kulika Chomvong, Vesna Kordic´, Xin Li, Stefan Bauer, Abigail EGillespie, Suk-Jin Ha, Eun Joong Oh, Jonathan M Galazka, Yong-SuJin, and Jamie HD Cate. Overcoming inefficient cellobiose fermenta-tion by cellobiose phosphorylase in the presence of xylose. BiotechnolBiofuels, 7:85, 2014.[29] Jose C Clemente, Luke K Ursell, Laura Wegener Parfrey, and RobKnight. The impact of the gut microbiota on human health: an inte-grative view. Cell, 148(6):1258–1270, 2012.[30] Pierre-Yves Colin, Balint Kintses, Fabrice Gielen, Charlotte M Miton,Gerhard Fischer, Mark F Mohamed, Marko Hyvo¨nen, Diego P Mor-gavi, Dick B Janssen, and Florian Hollfelder. Ultrahigh-throughputdiscovery of promiscuous enzymes by picodroplet functional metage-nomics. Nature communications, 6, 2015.[31] Michael Cotta and Robert Forster. The family lachnospiraceae, in-143Bibliographycluding the genera butyrivibrio, lachnospira and roseburia. In TheProkaryotes, pages 1002–1021. Springer, 2006.[32] Jeffrey W Craig, Fang-Yuan Chang, Jeffrey H Kim, Steven C Obia-julu, and Sean F Brady. Expanding small-molecule functional metage-nomics through parallel screening of broad-host-range cosmid environ-mental dna libraries in diverse proteobacteria. Applied and environ-mental microbiology, 76(5):1633–1641, 2010.[33] Eamonn P Culligan, Julian R Marchesi, Colin Hill, and Roy D Sleator.Mining the human gut microbiome for novel stress resistance genes.Gut microbes, 3(4):394–397, 2012.[34] A Currier, WD Kitts, and I McT Cowan. Cellulose digestion in thebeaver (castor canadensis). Canadian journal of zoology, 38(6):1109–1116, 1960.[35] Rolf Daniel. The metagenomics of soil. Nature Reviews Microbiology,3(6):470–478, 2005.[36] Aaron E Darling, Guillaume Jospin, Eric Lowe, Frederick A Mat-sen IV, Holly M Bik, and Jonathan A Eisen. Phylosift: phylogeneticanalysis of genomes and metagenomes. PeerJ, 2:e243, 2014.[37] Gideon Davies and Bernard Henrissat. Structures and mechanisms ofglycosyl hydrolases. Structure, 3(9):853–859, 1995.[38] Tom O Delmont, Emmanuel Prestat, Kevin P Keegan, MichaelFaubladier, Patrick Robe, Ian M Clark, Eric Pelletier, Penny R Hirsch,144BibliographyFolker Meyer, Jack A Gilbert, et al. Structure, fluctuation and mag-nitude of a natural grassland soil metagenome. The ISME journal,6(9):1677–1687, 2012.[39] Anne Durand, Richard Hughes, Alain Roussel, Ruth Flatman,Bernard Henrissat, and Nathalie Juge. Emergence of a subfamily ofxylanase inhibitors within glycoside hydrolase family 18. FEBS Jour-nal, 272(7):1745–1755, 2005.[40] Robert C Edgar. Muscle: multiple sequence alignment with high ac-curacy and high throughput. Nucleic acids research, 32(5):1792–1797,2004.[41] Robert C Edgar. Search and clustering orders of magnitude fasterthan blast. Bioinformatics, 26(19):2460–2461, 2010.[42] Abdessamad El Kaoutari, Fabrice Armougom, Jeffrey I Gordon, Di-dier Raoult, and Bernard Henrissat. The abundance and variety ofcarbohydrate-active enzymes in the human gut microbiota. NatureReviews Microbiology, 11(7):497–504, 2013.[43] Brent Ewing and Phil Green. Base-calling of automated sequencertraces using phred. ii. error probabilities. Genome research, 8(3):186–194, 1998.[44] Yi Feng, Cheng-Jie Duan, Hao Pang, Xin-Chun Mo, Chun-Feng Wu,Yuan Yu, Ya-Lin Hu, Jie Wei, Ji-Liang Tang, and Jia-Xun Feng.Cloning and identification of novel cellulase genes from uncultured145Bibliographymicroorganisms in rabbit cecum and characterization of the expressedcellulases. Applied microbiology and biotechnology, 75(2):319–328,2007.[45] Manuel Ferrer, Olga V Golyshina, Tatyana N Chernikova, Amit NKhachane, Dolores Reyes-Duarte, Vitor AP Santos, Carsten Strompl,Kieran Elborough, Graeme Jarvis, Alexander Neef, et al. Novel hy-drolase diversity retrieved from a metagenome library of bovine rumenmicroflora. Environmental Microbiology, 7(12):1996–2010, 2005.[46] James G Ferry. How to make a living by exhaling methane. Annualreview of microbiology, 64:453–473, 2010.[47] Harry J Flint, Edward A Bayer, Marco T Rincon, Raphael Lamed, andBryan A White. Polysaccharide utilization by gut bacteria: potentialfor new insights from genomic analysis. Nature Reviews Microbiology,6(2):121–131, 2008.[48] GE c-authors Fox, E Stackebrandt, RB Hespell, J Gibson, J Maniloff,TA Dyer, RS Wolfe, WE Balch, RS Tanner, LJ Magrum, et al. Thephylogeny of prokaryotes. Science (New York, NY), 209(4455):457,1980.[49] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li.Cd-hit: accelerated for clustering the next-generation sequencing data.Bioinformatics, 28(23):3150–3152, 2012.[50] Stefan M Gaida, Nicholas R Sandoval, Sergios A Nicolaou, Yili Chen,Keerthi P Venkataramanan, and Eleftherios T Papoutsakis. Expres-146Bibliographysion of heterologous sigma factors enables functional screening ofmetagenomic and heterologous genomic libraries. Nature communi-cations, 6, 2015.[51] Arabinda Ghosh, Ana Sofia Lu´ıs, Joana LA Bra´s, Neeta Pathaw,Nikhil K Chrungoo, Carlos MGA Fontes, and Arun Goyal. Decipher-ing ligand specificity of a clostridium thermocellum family 35 carbo-hydrate binding module (ct cbm35) for gluco-and galacto-substitutedmannans and its calcium induced stability. PloS one, 8(12):e80415,2013.[52] John M Gladden, Joshua I Park, Jessica Bergmann, Vimalier Reyes-Ortiz, Patrik Dhaeseleer, Betania F Quirino, Kenneth L Sale, Blake ASimmons, and Steven W Singer. Discovery and characterizationof ionic liquid-tolerant thermophilic cellulases from a switchgrass-adapted microbial community. Biotechnology for biofuels, 7(1):1, 2014.[53] A Gordon and GJ Hannon. Fastx-toolkit. fastq/a short-reads pre-processing tools. Unpublished Available online at: http://hannonlab.cshl. edu/fastx toolkit, 2010.[54] Robert J Gruninger, Tim A McAllister, and Robert J Forster. Bac-terial and archaeal diversity in the gastrointestinal tract of the northamerican beaver (castor canadensis). PloS one, 11(5):e0156457, 2016.[55] Claes Gustafsson, Sridhar Govindarajan, and Jeremy Minshull. Codonbias and heterologous protein expression. Trends in biotechnology,22(7):346–353, 2004.147Bibliography[56] Jo Handelsman. Metagenomics: application of genomics to uncul-tured microorganisms. Microbiology and molecular biology reviews,68(4):669–685, 2004.[57] Niels William Hanson. MetaPathways: a modular pipeline for the anal-ysis of environmental sequence information. PhD thesis, University ofBritish Columbia, 2015.[58] Mohamed F Haroon, Shihu Hu, Ying Shi, Michael Imelfort, JurgKeller, Philip Hugenholtz, Zhiguo Yuan, and Gene W Tyson. Anaer-obic oxidation of methane coupled to nitrate reduction in a novel ar-chaeal lineage. Nature, 500(7464):567–570, 2013.[59] Martin Hartmann, Charles G Howes, David VanInsberghe, Hang Yu,Dipankar Bachar, Richard Christen, Rolf Henrik Nilsson, Steven JHallam, and William W Mohn. Significant and persistent impact oftimber harvesting on soil microbial communities in northern coniferousforests. The ISME journal, 6(12):2199–2218, 2012.[60] Alyse K Hawley, Heather M Brewer, Angela D Norbeck, LjiljanaPasˇa-Tolic´, and Steven J Hallam. Metaproteomics reveals differen-tial modes of metabolic coupling among ubiquitous oxygen minimumzone microbes. Proceedings of the National Academy of Sciences,111(31):11395–11400, 2014.[61] FG Healy, RM Ray, HC Aldrich, AC Wilkie, LO Ingram, andKT Shanmugam. Direct isolation of functional genes encoding cellu-lases from the microbial consortia in a thermophilic, anaerobic digester148Bibliographymaintained on lignocellulose. Applied microbiology and biotechnology,43(4):667–674, 1995.[62] Jan-Hendrik Hehemann, Gae¨lle Correc, Tristan Barbeyron, WilliamHelbert, Mirjam Czjzek, and Gurvan Michel. Transfer ofcarbohydrate-active enzymes from marine bacteria to japanese gutmicrobiota. Nature, 464(7290):908–912, 2010.[63] Richard A Heins, Xiaoliang Cheng, Sangeeta Nath, Kai Deng, Ben-jamin P Bowen, Dylan C Chivian, Supratim Datta, Gregory D Fried-land, Patrik DHaeseleer, Dongying Wu, et al. Phylogenomically guidedidentification of industrially relevant gh1 β-glucosidases through dnasynthesis and nanostructure-initiator mass spectrometry. ACS chem-ical biology, 9(9):2082–2091, 2014.[64] B Henrissat, H Driguez, C Viet, and M Schu¨lein. Synergism of cellu-lases from trichoderma reesei in the degradation of cellulose. NatureBiotechnology, 3(8):722–726, 1985.[65] Bernard Henrissat. A classification of glycosyl hydrolases based onamino acid sequence similarities. Biochemical Journal, 280(2):309–316, 1991.[66] Matthias Hess, Alexander Sczyrba, Rob Egan, Tae-Wan Kim, HarshalChokhawala, Gary Schroth, Shujun Luo, Douglas S Clark, Feng Chen,Tao Zhang, et al. Metagenomic discovery of biomass-degrading genesand genomes from cow rumen. Science, 331(6016):463–467, 2011.149Bibliography[67] Michael E Himmel, Shi-You Ding, David K Johnson, William S Ad-ney, Mark R Nimlos, John W Brady, and Thomas D Foust. Biomassrecalcitrance: engineering plants and enzymes for biofuels, 315(5813):804–807, 2007.[68] Adina Chuang Howe, Janet K Jansson, Stephanie A Malfatti, Susan-nah G Tringe, James M Tiedje, and C Titus Brown. Tackling soildiversity with the assembly of large, complex metagenomes. Proceed-ings of the National Academy of Sciences, 111(13):4904–4909, 2014.[69] Daniel H Huson, Alexander F Auch, Ji Qi, and Stephan C Schuster.Megan analysis of metagenomic data. Genome research, 17(3):377–386, 2007.[70] Nguyen Duc Huy, Palvannan Thayumanavan, Tae-Ho Kwon, andSeung-Moon Park. Characterization of a recombinant bifunctionalxylosidase/arabinofuranosidase from phanerochaete chrysosporium.Journal of bioscience and bioengineering, 116(2):152–159, 2013.[71] Daohua Jiang, Junping Fan, Xianping Wang, Yan Zhao, Bo Huang,Jianfeng Liu, and Xuejun C Zhang. Crystal structure of 1, 3gal43a,an exo-β-1, 3-galactanase from clostridium thermocellum. Journal ofstructural biology, 180(3):447–457, 2012.[72] Stephen Johnson, Brett Trost, Jeffrey R Long, Vanessa Pittet, and An-thony Kusalik. A better sequence-read simulator program for metage-nomics. BMC bioinformatics, 15(9):1, 2014.150Bibliography[73] Dimitris Kallifidas, Hahk-Soo Kang, and Sean F Brady. Tetarimycina, an mrsa-active antibiotic identified through induced expression ofenvironmental dna gene clusters. Journal of the American ChemicalSociety, 134(48):19552–19555, 2012.[74] Jens Kallmeyer, Robert Pockalny, Rishi Ram Adhikari, David CSmith, and Steven DHondt. Global distribution of microbial abun-dance and biomass in subseafloor sediment. Proceedings of the Na-tional Academy of Sciences, 109(40):16213–16216, 2012.[75] Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi,and Mao Tanabe. Kegg as a reference resource for gene and proteinannotation. Nucleic acids research, 44(D1):D457–D462, 2016.[76] Kazutaka Katoh and Daron M Standley. Mafft multiple sequencealignment software version 7: improvements in performance and us-ability. Molecular biology and evolution, 30(4):772–780, 2013.[77] Jonathan D. E. Kawaja, Katy Morin, and W. Douglas Gould. Aduplicate column study of arsenic, cadmium and zinc treatment in ananaerobic bioreactor based on a system operated by teck cominco intrail, british columbia, 2005.[78] Szymon M Kie lbasa, Raymond Wan, Kengo Sato, Paul Horton, andMartin C Frith. Adaptive seeds tame genomic sequence comparison.Genome research, 21(3):487–493, 2011.[79] Dieter Klemm, Brigitte Heublein, Hans-Peter Fink, and AndreasBohn. Cellulose: Fascinating biopolymer and sustainable raw ma-151Bibliographyterial. Angewandte Chemie International Edition, 44(22):3358–3393,2005.[80] Annegret Kohler, Alan Kuo, Laszlo G Nagy, Emmanuelle Morin, Ker-rie W Barry, Francois Buscot, Bjo¨rn Canba¨ck, Cindy Choi, NicolasCichocki, Alicia Clum, et al. Convergent losses of decay mechanismsand rapid turnover of symbiosis genes in mycorrhizal mutualists. Na-ture Genetics, 47(4):410–415, 2015.[81] Hartmuth C Kolb, MG Finn, and K Barry Sharpless. Click chem-istry: diverse chemical function from a few good reactions. Ange-wandte Chemie International Edition, 40(11):2004–2021, 2001.[82] Kishori M Konwar, Niels W Hanson, Maya P Bhatia, Dongjae Kim,Shang-Ju Wu, Aria S Hahn, Connor Morgan-Lang, Hiu Kan Cheung,and Steven J Hallam. Metapathways v2. 5: quantitative functional,taxonomic and usability improvements. Bioinformatics, 31(20):3345–3347, 2015.[83] Kishori M Konwar, Niels W Hanson, Antoine P Page´, and Steven JHallam. Metapathways: a modular pipeline for constructing path-way/genome databases from environmental sequence information.BMC bioinformatics, 14(1):202, 2013.[84] Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin,Jason T Howard, Ganeshkumar Ganapathy, Zhong Wang, David ARasko, W Richard McCombie, Erich D Jarvis, et al. Hybrid error152Bibliographycorrection and de novo assembly of single-molecule sequencing reads.Nature biotechnology, 30(7):693–700, 2012.[85] Stijn Lagaert, Annick Pollet, Christophe M Courtin, and Guido Volck-aert. β-xylosidases and α-l-arabinofuranosidases: Accessory enzymesfor arabinoxylan degradation. Biotechnology advances, 32(2):316–332,2014.[86] Camilla Lambertz, Megan Garvey, Johannes Klinger, Dirk Heesel,Holger Klose, Rainer Fischer, and Ulrich Commandeur. Challengesand advances in the heterologous expression of cellulolytic enzymes: areview. Biotechnology for biofuels, 7(1):1, 2014.[87] David J Lane, Bernadette Pace, Gary J Olsen, David A Stahl,Mitchell L Sogin, and Norman R Pace. Rapid determination of 16sribosomal rna sequences for phylogenetic analyses. Proceedings of theNational Academy of Sciences, 82(20):6955–6959, 1985.[88] Johan Larsbrink, Theresa E Rogers, Glyn R Hemsworth, Lauren S Mc-Kee, Alexandra S Tauzin, Oliver Spadiut, Stefan Klinter, Nicholas APudlo, Karthik Urs, Nicole M Koropatkin, et al. A discrete genetic lo-cus confers xyloglucan metabolism in select human gut bacteroidetes.Nature, 506(7489):498–502, 2014.[89] Sangwon Lee and Steven J Hallam. Extraction of high molecularweight genomic dna from soils and sediments. Journal of visualizedexperiments: JoVE, (33), 2009.153Bibliography[90] Ruth E Ley, Micah Hamady, Catherine Lozupone, Peter J Turnbaugh,Rob Roy Ramey, J Stephen Bircher, Michael L Schlegel, Tammy ATucker, Mark D Schrenzel, Rob Knight, et al. Evolution of mammalsand their gut microbes. Science, 320(5883):1647–1651, 2008.[91] Heng Li and Richard Durbin. Fast and accurate short read alignmentwith burrows–wheeler transform. Bioinformatics, 25(14):1754–1760,2009.[92] Miranda V Logan, Kenneth F Reardon, Linda A Figueroa, Jean ETMcLain, and Dianne M Ahmann. Microbial community activities dur-ing establishment, performance, and decline of bench-scale passivetreatment systems for mine drainage. Water research, 39(18):4537–4551, 2005.[93] Vincent Lombard, Thomas Bernard, Corinne Rancurel, HarryBrumer, Pedro M Coutinho, and Bernard Henrissat. A hierarchicalclassification of polysaccharide lyases for glycogenomics. BiochemicalJournal, 432(3):437–444, 2010.[94] Vincent Lombard, Hemalatha Golaconda Ramulu, Elodie Drula, Pe-dro M Coutinho, and Bernard Henrissat. The carbohydrate-active en-zymes database (cazy) in 2013. Nucleic acids research, 42(D1):D490–D495, 2014.[95] Patrick Lorenz and Ju¨rgen Eck. Metagenomics and industrial appli-cations. Nature Reviews Microbiology, 3(6):510–516, 2005.154Bibliography[96] Michael Love, Simon Anders, and Wolfgang Huber. Differential anal-ysis of count data–the deseq2 package. Genome Biology, 15:550, 2014.[97] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated es-timation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):1–21, 2014.[98] George T Macfarlane and Sandra Macfarlane. Bacteria, colonic fer-mentation, and gastrointestinal health. Journal of AOAC Interna-tional, 95(1):50–60, 2012.[99] Tanja Magocˇ and Steven L Salzberg. Flash: fast length adjust-ment of short reads to improve genome assemblies. Bioinformatics,27(21):2957–2963, 2011.[100] Eric C Martens, Herbert C Chiang, and Jeffrey I Gordon. Mucosalglycan foraging enhances fitness and transmission of a saccharolytichuman gut bacterial symbiont. Cell host & microbe, 4(5):447–457,2008.[101] Eric C Martens, Amelia G Kelly, Alexandra S Tauzin, and HarryBrumer. The devil lies in the details: how variations in polysaccharidefine-structure impact the physiology and evolution of gut microbes.Journal of molecular biology, 426(23):3851–3865, 2014.[102] Eric C Martens, Elisabeth C Lowe, Herbert Chiang, Nicholas A Pudlo,Meng Wu, Nathan P McNulty, D Wade Abbott, Bernard Henrissat,Harry J Gilbert, David N Bolam, et al. Recognition and degradation155Bibliographyof plant cell wall polysaccharides by two human gut symbionts. PLoSBiol, 9(12):e1001221, 2011.[103] A Martinez, AS Bradley, JR Waldbauer, RE Summons, and EF De-Long. Proteorhodopsin photosystem gene expression enables pho-tophosphorylation in a heterologous host. Proceedings of the NationalAcademy of Sciences, 104(13):5590–5595, 2007.[104] Diego Martinez, Luis F Larrondo, Nik Putnam, Maarten D SollewijnGelpke, Katherine Huang, Jarrod Chapman, Kevin G Helfen-bein, Preethi Ramaiya, J Chris Detter, Frank Larimer, et al.Genome sequence of the lignocellulose degrading fungus phanerochaetechrysosporium strain rp78. Nature biotechnology, 22(6):695–700, 2004.[105] Manuel Martinez-Garcia, David M Brazel, Brandon K Swan, CarolArnosti, Patrick SG Chain, Krista G Reitenga, Gary Xie, Nicole JPoulton, Monica Lluesma Gomez, Dashiell ED Masland, et al. Cap-turing single cell genomes of active polysaccharide degraders: an un-expected contribution of verrucomicrobia. PLoS One, 7(4):e35314,2012.[106] Mukil Maruthamuthu, Diego Javier Jime´nez, Patricia Stevens, andJan Dirk van Elsas. A multi-substrate approach for functionalmetagenomics-based screening for (hemi) cellulases in two wheatstraw-degrading microbial consortia unveils novel thermoalkaliphilicenzymes. BMC genomics, 17(1):1, 2016.[107] Lauren S McKee, Maria J Pen˜a, Artur Rogowski, Adam Jackson,156BibliographyRichard J Lewis, William S York, Kristian BRM Krogh, AndersViksø-Nielsen, Michael Skjøt, Harry J Gilbert, et al. Introducingendo-xylanase activity into an exo-acting arabinofuranosidase that tar-gets side chains. Proceedings of the National Academy of Sciences,109(17):6537–6542, 2012.[108] Nathan P McNulty, Meng Wu, Alison R Erickson, Chongle Pan,Brian K Erickson, Eric C Martens, Nicholas A Pudlo, Brian D Muegge,Bernard Henrissat, Robert L Hettich, et al. Effects of diet on resourceutilization by a model human gut microbiota containing bacteroidescellulosilyticus wh2, a symbiont with an extensive glycobiome. PLoSBiol, 11(8):e1001637, 2013.[109] Keith Mewis, Zachary Armstrong, Young C Song, Susan A Baldwin,Stephen G Withers, and Steven J Hallam. Biomining active cellu-lases from a mining bioremediation system. Journal of biotechnology,167(4):462–471, 2013.[110] Keith Mewis, Marcus Taupp, and Steven J Hallam. A high through-put screen for biomining cellulase activity from metagenomic libraries.Journal of visualized experiments: JoVE, (48), 2011.[111] Jaina Mistry, Robert D Finn, Sean R Eddy, Alex Bateman, and MarcoPunta. Challenges in homology search: Hmmer3 and convergent evo-lution of coiled-coil regions. Nucleic acids research, page gkt263, 2013.[112] Akimasa Miyanaga, Takuya Koseki, Hiroshi Matsuzawa, TakayoshiWakagi, Hirofumi Shoun, and Shinya Fushinobu. Crystal structure157Bibliographyof a family 54 α-l-arabinofuranosidase reveals a novel carbohydrate-binding module that can bind arabinose. Journal of Biological Chem-istry, 279(43):44907–44914, 2004.[113] Brian D Muegge, Justin Kuczynski, Dan Knights, Jose C Clemente,Antonio Gonza´lez, Luigi Fontana, Bernard Henrissat, Rob Knight,and Jeffrey I Gordon. Diet drives convergence in gut microbiomefunctions across mammalian phylogeny and within humans. Science,332(6032):970–974, 2011.[114] Heiko Nacke, Martin Engelhaupt, Silja Brady, Christiane Fischer, Ja-nine Tautzt, and Rolf Daniel. Identification and characterization ofnovel cellulolytic and hemicellulolytic genes and enzymes derived fromgerman grassland soil metagenomes. Biotechnology letters, 34(4):663–675, 2012.[115] Carmen-Mihaela Neculita, Ge´rald J Zagury, and Bruno Bussie`re. Pas-sive treatment of acid mine drainage in bioreactors using sulfate-reducing bacteria. Journal of Environmental Quality, 36(1):1–16,2007.[116] Jeremy K Nicholson, Elaine Holmes, James Kinross, Remy Burcelin,Glenn Gibson, Wei Jia, and Sven Pettersson. Host-gut microbiotametabolic interactions. Science, 336(6086):1262–1267, 2012.[117] Didier Nurizzo, Johan P Turkenburg, Simon J Charnock, Shirley MRoberts, Eleanor J Dodson, Vincent A McKie, Edward J Taylor,Harry J Gilbert, and Gideon J Davies. Cellvibrio japonicus α-l-158Bibliographyarabinanase 43a has a novel five-blade β-propeller fold. Nature Struc-tural & Molecular Biology, 9(9):665–668, 2002.[118] ANTOINETTE C O’SULLIVAN. Cellulose: the structure slowly un-ravels. Cellulose, 4(3):173–207, 1997.[119] Roberta M OConnor, Jennifer M Fung, Koty H Sharp, Jack S Ben-ner, Colleen McClung, Shelley Cushing, Elizabeth R Lamkin, Alexey IFomenkov, Bernard Henrissat, Yuri Y Londer, et al. Gill bacteria en-able a novel digestive strategy in a wood-feeding mollusk. Proceedingsof the National Academy of Sciences, 111(47):E5096–E5104, 2014.[120] Roberta M OConnor, Jennifer M Fung, Koty H Sharp, Jack S Ben-ner, Colleen McClung, Shelley Cushing, Elizabeth R Lamkin, Alexey IFomenkov, Bernard Henrissat, Yuri Y Londer, et al. Gill bacteria en-able a novel digestive strategy in a wood-feeding mollusk. Proceedingsof the National Academy of Sciences, 111(47):E5096–E5104, 2014.[121] Norman R Pace, David A Stahl, David J Lane, and Gary J Olsen. Theanalysis of natural microbial populations by ribosomal rna sequences.In Advances in microbial ecology, pages 1–55. Springer, 1986.[122] Hao Pang, Peng Zhang, Cheng-Jie Duan, Xin-Chun Mo, Ji-LiangTang, and Jia-Xun Feng. Identification of cellulase genes from themetagenomes of compost soils and functional characterization of onenovel endoglucanase. Current microbiology, 58(4):404–408, 2009.[123] Yu Peng, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin. Idba-ud: a de novo assembler for single-cell and metagenomic sequencing159Bibliographydata with highly uneven depth. Bioinformatics, 28(11):1420–1428,2012.[124] PB Pope, SE Denman, M Jones, SG Tringe, K Barry, SA Malfatti,AC McHardy, J-F Cheng, P Hugenholtz, CS McSweeney, et al. Adap-tation to herbivory by the tammar wallaby includes bacterial and gly-coside hydrolase profiles different from other herbivores. Proceedingsof the National Academy of Sciences, 107(33):14793–14798, 2010.[125] Phillip B Pope, Alasdair K Mackenzie, Ivan Gregor, Wendy Smith,Monica A Sundset, Alice C McHardy, Mark Morrison, and Vin-cent GH Eijsink. Metagenomics of the svalbard reindeer rumen mi-crobiome reveals abundance of polysaccharide utilization loci. PLoSOne, 7(6):e38571, 2012.[126] Julia Pottka¨mper, Peter Barthen, Nele Ilmberger, UlrichSchwaneberg, Alexander Schenk, Michael Schulte, Nikolai Ig-natiev, and Wolfgang R Streit. Applying metagenomics for theidentification of bacterial cellulases that are stable in ionic liquids.Green chemistry, 11(7):957–965, 2009.[127] Morgan N Price, Paramvir S Dehal, and Adam P Arkin. Fasttree:computing large minimum evolution trees with profiles instead of adistance matrix. Molecular biology and evolution, 26(7):1641–1650,2009.[128] Kim D Pruitt, Tatiana Tatusova, Garth R Brown, and Donna R Ma-glott. Ncbi reference sequences (refseq): current status, new features160Bibliographyand genome annotation policy. Nucleic acids research, 40(D1):D130–D135, 2012.[129] Nicholas A Pudlo, Karthik Urs, Supriya Suresh Kumar, J Bruce Ger-man, David A Mills, and Eric C Martens. Symbiotic human gut bacte-ria with variable metabolic priorities for host mucosal glycans. mBio,6(6):e01282–15, 2015.[130] Junjie Qin, Ruiqiang Li, Jeroen Raes, Manimozhiyan Arumugam,Kristoffer Solvsten Burgdorf, Chaysavanh Manichanh, Trine Nielsen,Nicolas Pons, Florence Levenez, Takuji Yamada, et al. A human gutmicrobial gene catalogue established by metagenomic sequencing. na-ture, 464(7285):59–65, 2010.[131] Christian Quast, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, TimmySchweer, Pablo Yarza, Jo¨rg Peplies, and Frank Oliver Glo¨ckner. Thesilva ribosomal rna gene database project: improved data processingand web-based tools. Nucleic acids research, 41(D1):D590–D596, 2013.[132] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra MSchnoes, Tobias Wittkop, Artem Sokolov, Kiley Graim, ChristopherFunk, Karin Verspoor, Asa Ben-Hur, et al. A large-scale evalua-tion of computational protein function prediction. Nature methods,10(3):221–227, 2013.[133] Varsha Raghavan and Eduardo A Groisman. Species-specific dynamicresponses of gut bacteria to a mammalian glycan. Journal of bacteri-ology, 197(9):1538–1548, 2015.161Bibliography[134] A Rambaut. Figtree, a graphical viewer of phylogenetic trees. Seehttp://tree. bio. ed. ac. uk/software/figtree, 2007.[135] Christian Rinke, Patrick Schwientek, Alexander Sczyrba, Natalia NIvanova, Iain J Anderson, Jan-Fang Cheng, Aaron Darling, StephanieMalfatti, Brandon K Swan, Esther A Gies, et al. Insights into thephylogeny and coding potential of microbial dark matter. Nature,499(7459):431–437, 2013.[136] Luis M Rodriguez-R and Konstantinos T Konstantinidis. Estimatingcoverage in metagenomic data sets and why it matters. The ISMEjournal, 8(11):2349, 2014.[137] Jonathan M Rothberg, Wolfgang Hinz, Todd M Rearick, JonathanSchultz, William Mileski, Mel Davey, John H Leamon, Kim John-son, Mark J Milgrew, Matthew Edwards, et al. An integrated semi-conductor device enabling non-optical genome sequencing. Nature,475(7356):348–352, 2011.[138] Jonathan M Rothberg and John H Leamon. The development andimpact of 454 sequencing. Nature biotechnology, 26(10):1117–1124,2008.[139] Edward M Rubin. Genomics of cellulosic biofuels. Nature,454(7206):841–845, 2008.[140] Jon G Sanders, Annabel C Beichman, Joe Roman, Jarrod J Scott,David Emerson, James J McCarthy, and Peter R Girguis. Baleen162Bibliographywhales host a unique gut microbiome with similarities to both carni-vores and herbivores. Nature communications, 6, 2015.[141] Robert Schmieder and Robert Edwards. Quality control and prepro-cessing of metagenomic datasets. Bioinformatics, 27(6):863–864, 2011.[142] Alexandra M Schnoes, Shoshana D Brown, Igor Dodevski, and Patri-cia C Babbitt. Annotation error in public databases: misannotationof molecular function in enzyme superfamilies. PLoS Comput Biol,5(12):e1000605, 2009.[143] Erin D Scully, Scott M Geib, Kelli Hoover, Ming Tien, Susannah GTringe, Kerrie W Barry, Tijana Glavina del Rio, Mansi Chovatia,Joshua R Herr, and John E Carlson. Metagenomic profiling revealslignocellulose degrading system in a microbial community associatedwith a wood-feeding beetle. PLoS One, 8(9):e73827, 2013.[144] Inna Sekirov, Shannon L Russell, L Caetano M Antunes, and B BrettFinlay. Gut microbiota in health and disease. Physiological reviews,90(3):859–904, 2010.[145] Martin J Sergeant, Chrystala Constantinidou, Tristan A Cogan,Michael R Bedford, Charles W Penn, and Mark J Pallen. Extensivemicrobial and functional diversity within the chicken cecal microbiome.PloS one, 9(3):e91941, 2014.[146] Vega Shah, Bonnie X Chang, and Robert M Morris. Cultivation of achemoautotroph from the sup05 clade of marine bacteria that producesnitrite and consumes ammonium. The ISME Journal, 2016.163Bibliography[147] Hidetoshi Shimodaira. An approximately unbiased test of phylogenetictree selection. Systematic biology, 51(3):492–508, 2002.[148] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein,Steven JM Jones, and Inanc¸ Birol. Abyss: a parallel assembler forshort read sequence data. Genome research, 19(6):1117–1123, 2009.[149] Peter J Simpson, Stuart J Jamieson, Maher Abou-Hachem, Eva Nord-berg Karlsson, Harry J Gilbert, Olle Holst, and Michael P Williamson.The solution structure of the cbm4-2 carbohydrate binding modulefrom a thermostable rhodothermus marinus xylanase. Biochemistry,41(18):5712–5719, 2002.[150] Robert R Sokal and Charles D Michener. A statistical method forevaluating systematic relationships. Multivariate statistical methods,among-groups covariation, page 269, 1975.[151] Morten OA Sommer, George M Church, and Gautam Dantas. A func-tional metagenomic approach for expanding the synthetic biology tool-box for biomass conversion. Molecular systems biology, 6(1):360, 2010.[152] Hans Peter Sørensen and Kim Kusk Mortensen. Advanced geneticstrategies for recombinant protein expression in escherichia coli. Jour-nal of biotechnology, 115(2):113–128, 2005.[153] Franz J St John, Javier M Gonza´lez, and Edwin Pozharski. Consoli-dation of glycosyl hydrolase family 30: a dual domain 4/7 hydrolasefamily consisting of two structurally distinct groups. Febs Letters,584(21):4435–4441, 2010.164Bibliography[154] Mark R Stam, Etienne GJ Danchin, Corinne Rancurel, Pedro MCoutinho, and Bernard Henrissat. Dividing the large glycoside hy-drolase family 13 into subfamilies: towards improved functional an-notations of α-amylase-related proteins. Protein Engineering Designand Selection, 19(12):555–562, 2006.[155] Alexandros Stamatakis. Raxml version 8: a tool for phylogeneticanalysis and post-analysis of large phylogenies. Bioinformatics, pagebtu033, 2014.[156] Manuel Stark, Simon A Berger, Alexandros Stamatakis, and Christianvon Mering. Mltreemap-accurate maximum likelihood placement ofenvironmental dna sequences into taxonomic and functional referencephylogenies. BMC genomics, 11(1):461, 2010.[157] Ramunas Stepanauskas. Single cell genomics: an individual look atmicrobes. Current opinion in microbiology, 15(5):613–620, 2012.[158] Cameron R Strachan, Rahul Singh, David VanInsberghe, KaterynaIevdokymenko, Karen Budwill, William W Mohn, Lindsay D Eltis,and Steven J Hallam. Metagenomic scaffolds enable combinatoriallignin transformation. Proceedings of the National Academy of Sci-ences, 111(28):10143–10148, 2014.[159] Marc Strous, Beate Kraft, Regina Bisdorf, and Halina Tegetmeyer.The binning of metagenomic contigs for microbial physiology of mixedcultures. Frontiers in microbiology, 3:410, 2012.165Bibliography[160] Shinichi Sunagawa, Luis Pedro Coelho, Samuel Chaffron, Jens RoatKultima, Karine Labadie, Guillem Salazar, Bardya Djahanschiri,Georg Zeller, Daniel R Mende, Adriana Alberti, et al. Structure andfunction of the global ocean microbiome. Science, 348(6237):1261359,2015.[161] Ryota Suzuki and Hidetoshi Shimodaira. Pvclust: an r package forassessing the uncertainty in hierarchical clustering. Bioinformatics,22(12):1540–1542, 2006.[162] Ga´bor J Sze´kely, Maria L Rizzo, Nail K Bakirov, et al. Measuringand testing dependence by correlation of distances. The Annals ofStatistics, 35(6):2769–2794, 2007.[163] Haregewine Tadesse and Rafael Luque. Advances on biomass pre-treatment using ionic liquids: an overview. Energy & EnvironmentalScience, 4(10):3913–3929, 2011.[164] Roman L Tatusov, Natalie D Fedorova, John D Jackson, Aviva RJacobs, Boris Kiryutin, Eugene V Koonin, Dmitri M Krylov, RajaMazumder, Sergei L Mekhedov, Anastasia N Nikolskaya, et al. Thecog database: an updated version includes eukaryotes. BMC bioinfor-matics, 4(1):1, 2003.[165] Marcus Taupp, Sangwon Lee, Alyse Hawley, Jinshu Yang, andSteven J Hallam. Large insert environmental genomic library produc-tion. JoVE (Journal of Visualized Experiments), (31):e1387–e1387,2009.166Bibliography[166] Marcus Taupp, Keith Mewis, and Steven J Hallam. The art and designof functional metagenomic screens. Current opinion in biotechnology,22(3):465–472, 2011.[167] Ronald M Teather and Peter J Wood. Use of congo red-polysaccharideinteractions in enumeration and characterization of cellulolytic bacte-ria from the bovine rumen. Applied and environmental microbiology,43(4):777–780, 1982.[168] Nicolas Terrapon, Vincent Lombard, Harry J Gilbert, and BernardHenrissat. Automatic prediction of polysaccharide utilization loci inbacteroidetes species. Bioinformatics, 31(5):647–655, 2015.[169] Susannah G Tringe and Philip Hugenholtz. A renaissance for thepioneering 16s rrna gene. Current opinion in microbiology, 11(5):442–446, 2008.[170] Susannah Green Tringe, Christian Von Mering, Arthur Kobayashi,Asaf A Salamov, Kevin Chen, Hwai W Chang, Mircea Podar, Jay MShort, Eric J Mathur, John C Detter, et al. Comparative metage-nomics of microbial communities. Science, 308(5721):554–557, 2005.[171] Peter J Turnbaugh, Micah Hamady, Tanya Yatsunenko, Brandi LCantarel, Alexis Duncan, Ruth E Ley, Mitchell L Sogin, William JJones, Bruce A Roe, Jason P Affourtit, et al. A core gut microbiomein obese and lean twins. nature, 457(7228):480–484, 2009.[172] Sagar M Utturkar, Dawn M Klingeman, Miriam L Land, Christo-pher W Schadt, Mitchel J Doktycz, Dale A Pelletier, and Steven D167BibliographyBrown. Evaluation and validation of de novo and hybrid assemblytechniques to derive high-quality genome sequences. Bioinformatics,30(19):2709–2716, 2014.[173] Lambertus AM Van den Broek, Ruth M Lloyd, Gerrit Beldman, Jan CVerdoes, Barry V McCleary, and Alphons GJ Voragen. Cloning andcharacterization of arabinoxylan arabinofuranohydrolase-d3 (axhd3)from bifidobacterium adolescentis dsm20083. Applied microbiology andbiotechnology, 67(5):641–647, 2005.[174] Bert van Loo, Jeffrey H Lutje Spelberg, Jaap Kingma, Theo Sonke,Marcel G Wubbolts, and Dick B Janssen. Directed evolution of epox-ide hydrolase from a. radiobacter toward higher enantioselectivity byerror-prone pcr and dna shuffling. Chemistry & biology, 11(7):981–990,2004.[175] Elien Vandermarliere, Tine M Bourgois, Martyn D Winn, StevenVan Campenhout, Guido Volckaert, Jan A Delcour, Sergei V Strelkov,Anja Rabijns, and Christophe M Courtin. Structural analysis of aglycoside hydrolase family 43 arabinoxylan arabinofuranohydrolase incomplex with xylotetraose reveals a different binding mechanism com-pared with other members of the same family. Biochemical journal,418(1):39–47, 2009.[176] J Craig Venter, Karin Remington, John F Heidelberg, Aaron LHalpern, Doug Rusch, Jonathan A Eisen, Dongying Wu, Ian Paulsen,168BibliographyKaren E Nelson, William Nelson, et al. Environmental genome shot-gun sequencing of the sargasso sea. science, 304(5667):66–74, 2004.[177] Patrick Vincent, Francois Shareck, Claude Dupont, Rolf Morosoli, andDieter Kluepfel. New α-l-arabinofuranosidase produced by strepto-myces lividans: cloning and dna sequence of the abfb gene and charac-terization of the enzyme. Biochemical Journal, 322(3):845–852, 1997.[178] Conrad Vispo and Ian D Hume. The digestive tract and digestive func-tion in the north american porcupine and beaver. Canadian Journalof Zoology, 73(5):967–974, 1995.[179] Yi Wang, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin.Metacluster 5.0: a two-round binning approach for metagenomicdata for low-abundance species in a noisy sample. Bioinformatics,28(18):i356–i362, 2012.[180] Adam J Wargacki, Effendi Leonard, Maung Nyan Win, Drew D Regit-sky, Christine Nicole S Santos, Peter B Kim, Susan R Cooper, Ryan MRaisner, Asael Herman, Alicia B Sivitz, et al. An engineered micro-bial platform for direct biofuel production from brown macroalgae.Science, 335(6066):308–313, 2012.[181] Falk Warnecke, Peter Luginbu¨hl, Natalia Ivanova, Majid Ghassemian,Toby H Richardson, Justin T Stege, Michelle Cayouette, Alice CMcHardy, Gordana Djordjevic, Nahla Aboushadi, et al. Metagenomicand functional analysis of hindgut microbiota of a wood-feeding highertermite. Nature, 450(7169):560–565, 2007.169Bibliography[182] Michael J Weickert, Daniel H Doherty, Elaine A Best, and Peter OOlins. Optimization of heterologous protein production in escherichiacoli. Current opinion in biotechnology, 7(5):494–499, 1996.[183] Jadwiga Wild, Zdenka Hradecna, and Waclaw Szybalski. Condition-ally amplifiable bacs: switching from single-copy to high-copy vectorsand genomic clones. Genome research, 12(9):1434–1444, 2002.[184] Anne Winding and Niels Bohse Hendriksen. Biolog substrate utilisa-tion assay for metabolic fingerprints of soil bacteria: incubation effects.In Microbial Communities, pages 195–205. Springer, 1997.[185] Mabel Ting Wong, Weijun Wang, Michael Lacourt, Marie Couturier,Elizabeth Edwards, and Emma Master. Substrate-driven convergenceof the microbial community in lignocellulose-amended enrichments ofgut microflora from the canadian beaver (castor canadensis) and northamerican moose (alces americanus). Frontiers in Microbiology, 7:961,2016.[186] Indira Wu and Frances H Arnold. Engineered thermostable fungalcel6a and cel7a cellobiohydrolases hydrolyze cellulose efficiently at el-evated temperatures. Biotechnology and bioengineering, 110(7):1874–1883, 2013.[187] Meng Wu, Nathan P McNulty, Dmitry A Rodionov, Matvei SKhoroshkin, Nicholas W Griffin, Jiye Cheng, Phil Latreille, Randall AKerstetter, Nicolas Terrapon, Bernard Henrissat, et al. Genetic deter-170Bibliographyminants of in vivo fitness and diet responsiveness in multiple humangut bacteroides. Science, 350(6256):aac5992, 2015.[188] Yu-Wei Wu, Yung-Hsu Tang, Susannah G Tringe, Blake A Simmons,and Steven W Singer. Maxbin: an automated binning method torecover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2(1):1, 2014.[189] Tanya Yatsunenko, Federico E Rey, Mark J Manary, Indi Trehan,Maria Gloria Dominguez-Bello, Monica Contreras, Magda Magris,Glida Hidalgo, Robert N Baldassano, Andrey P Anokhin, et al.Human gut microbiome viewed across age and geography. Nature,486(7402):222–227, 2012.[190] Elena Zaikova, David A Walsh, Claire P Stilwell, William W Mohn,Philippe D Tortell, and Steven J Hallam. Microbial community dy-namics in a seasonally anoxic fjord: Saanich inlet, british columbia.Environmental microbiology, 12(1):172–191, 2010.[191] Lifeng Zhu, Qi Wu, Jiayin Dai, Shanning Zhang, and Fuwen Wei. Evi-dence of cellulose metabolism by the giant panda gut microbiome. Pro-ceedings of the National Academy of Sciences, 108(43):17714–17719,2011.171


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items