Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploring microbial controls on methane cycling using MLTreeMap Song, Young Chang 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_spring_song_young.pdf [ 48.01MB ]
Metadata
JSON: 24-1.0073801.json
JSON-LD: 24-1.0073801-ld.json
RDF/XML (Pretty): 24-1.0073801-rdf.xml
RDF/JSON: 24-1.0073801-rdf.json
Turtle: 24-1.0073801-turtle.txt
N-Triples: 24-1.0073801-rdf-ntriples.txt
Original Record: 24-1.0073801-source.json
Full Text
24-1.0073801-fulltext.txt
Citation
24-1.0073801.ris

Full Text

	
   	
   	
   	
   	
   	
   	
   EXPLORING MICROBIAL CONTROLS ON METHANE CYCLING USING MLTREEMAP by Young Chang Song B.Sc., University of Waterloo, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2013 © Young Chang Song, 2013	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
    Abstract Biological methane (CH4) production, or methanogenesis, plays a crucial role in the global carbon cycle. Biologically generated CH4 can be emitted into the atmosphere, where it acts as greenhouse gas twenty five times more potent than carbon dioxide (CO2), stored as “methane ice” (clathrates or hydrates) in marine sediments along continental margins, or oxidized under aerobic or anaerobic conditions by microbial agents effectively limiting atmospheric flux. Methanogenesis is orchestrated by a group of obligate anaerobic archaea within the phylum Euryarchaeota, known as the methanogens that produce CH4 as the end product of their energy metabolism. Three methanogenic pathways have been described including the hydrogenotrophic, methylotrophic and aceticlastic. Although differing in their use of electron donors, all three pathways converge on a terminal step catalyzed by the heterohexameric methyl-coenzyme M reductase (MCR). Over the last decade cultivation-independent studies have identified anaerobic methane-oxidizing archaea (ANME-1, 2 and 3) related methanogens that appear to run one or more canonical methanogenic pathways in reverse including the terminal step catalyzed by MCR. The three genes encoding MCR subunits, mcrA, mcrB and mcrG possess phylogenetic resolution similar to that of the small subunit ribosomal RNA gene making them useful functional markers for detection and differentiation of methanogenic and methane-oxidation pathways in natural and human engineered ecosystems. Here I introduce an automated and culture-independent method for monitoring the taxonomic structure and genomic potential of methane-cycling environments that leverages and improves upon an existing software package called MLTreeMap. MLTreeMap is a user-extensible software framework that automates maximum likelihood analysis to recover phylogenetic or functional marker genes from environmental sequence data.  I first describe the taxonomic structure and pathway  representation of methane-cycling environments on a global scale based on the identification of MCRA alleles. I then chart both the metabolic potential and gene expression of marine sediments supporting the anaerobic oxidation of CH4 using a series of reference trees representing near complete methane-cycling pathways.  	
    ii	
    Preface Chapters 2 and 3 are based on collaborative work between Young Song and the Christian von Mering’s laboratory at the University of Zurich. Source code of MLTreeMap was written and provided by Manuel Stark from Christian von Mering’s group. Implementation of python scripts, as well as the data analysis and visualization were done by Young Song. Writing of manuscript was done by Young Song and Steven Hallam. Data analysis and visualization for chapter 4 was done by Young Song, with assistance of Antoine Pagé and an undergraduate student at the Hallam lab. Peptide data was provided by Angela Norbeck at the Pacific Northwest Laboratory. Writing of manuscript was done by Young Song and Steven Hallam.	
    	
    iii	
    Table of Contents  	
   Abstract ........................................................................................................................................................................ ii	
   Preface .........................................................................................................................................................................iii	
   Table	
  of	
  Contents..................................................................................................................................................... iv	
   List	
  of	
  Tables ............................................................................................................................................................. vi	
   List	
  of	
  Figures...........................................................................................................................................................vii	
   Glossary .................................................................................................................................................................... viii	
   Acknowledgments .................................................................................................................................................. ix	
   Dedication ....................................................................................................................................................................x	
   1.	
  Introduction........................................................................................................................................................... 1	
   2.	
  MLTreeMap-­‐based	
  bioinformatics	
  pipeline............................................................................................. 5	
   	
  	
  	
  	
  	
  2.1	
  Methods	
  of	
  taxonomic	
  binning	
  of	
  environmental	
  sequences .................................................. 5	
   	
  	
  	
  	
  	
  2.2	
  Phylogenetic	
  based	
  taxonomic	
  analysis ............................................................................................ 6	
   	
  	
  	
  	
  	
  2.3	
  Implementation	
  of	
  MLTreeMap-­‐based	
  workflow ......................................................................... 6	
   3.	
  Global	
  scale	
  analysis	
  of	
  methyl-­‐coenzyme	
  M	
  reductase	
  subunit	
  A ..............................................12	
   	
  	
  	
  	
  	
  3.1	
  Introduction.................................................................................................................................................12	
   	
  	
  	
  	
  	
  3.2	
  Material	
  and	
  methods..............................................................................................................................12	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3.2.1	
  Methyl-­‐coenzyme	
  M	
  reductase	
  subunit	
  A	
  reference	
  tree	
  construction .................12	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3.2.2	
  Testing	
  taxonomic	
  and	
  environmental	
  coverage	
  of	
  McrA	
  reference	
  tree .............14	
   	
  	
  	
  	
  	
  3.3	
  Results	
  and	
  discussion ............................................................................................................................14	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3.3.1	
  Global	
  taxonomic	
  composition	
  of	
  McrA	
  sequences.........................................................14	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3.3.2	
  Distribution	
  of	
  isolation	
  sources	
  in	
  various	
  taxonomic	
  groups.................................18	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3.3.3	
  Distribution	
  of	
  isolation	
  sources	
  in	
  various	
  methanogenic	
  pathways....................19	
   4.	
  Monitoring	
  AOM	
  in	
  Eel	
  River	
  Basin...........................................................................................................21	
   	
  	
  	
  	
  	
  4.1	
  Introduction.................................................................................................................................................21	
   	
  	
  	
  	
  	
  4.2	
  Materials	
  and	
  methods ...........................................................................................................................21	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.2.1	
  Sample	
  and	
  data	
  collection ........................................................................................................21	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.2.2	
  Generating	
  reference	
  data	
  for	
  functional	
  marker	
  genes...............................................22	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.2.3	
  Taxonomic	
  composition	
  of	
  the	
  PC45	
  metagenomic	
  data..............................................26	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.2.4	
  Taxonomic	
  composition	
  of	
  PC44	
  and	
  PC45	
  pyrosequencing	
  data ...........................27	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.2.5	
  AOM	
  pathway	
  detection..............................................................................................................28	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.2.6	
  AOM	
  expression	
  profiling...........................................................................................................28	
   	
  	
  	
  	
  	
  4.3	
  Results	
  and	
  discussion ............................................................................................................................28	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.1	
  Identification	
  of	
  phylogenetic	
  and	
  functional	
  marker	
  genes ......................................28	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.2	
  Taxonomic	
  structure	
  of	
  phylogenetic	
  markers.................................................................29	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.3	
  Taxonomic	
  composition	
  based	
  on	
  universal	
  phylogenetic	
  marker	
  genes ............30	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.4	
  Taxonomic	
  composition	
  of	
  the	
  eight	
  core	
  intervals	
  based	
  on	
  pyrosequencing	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  data .......................................................................................................................................................32	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.5	
  Functional	
  markers	
  of	
  H4MPT-­‐linked	
  C1	
  transfer	
  reactions.......................................33	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.6	
  Functional	
  markers	
  of	
  associated	
  sulfate-­‐reduction ......................................................36	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4.3.7	
  AOM	
  expression..............................................................................................................................36	
   Conclusion .................................................................................................................................................................39	
   References.................................................................................................................................................................42	
   Appendices................................................................................................................................................................47	
    	
    iv	
    	
  	
  	
  	
  	
  Appendix	
  A:	
  Results	
  of	
  preliminary	
  MLTreeMap	
  analysis	
  of	
  ERB	
  data ....................................47	
   	
  	
  	
  	
  	
  Appendix	
  B:	
  Updated	
  phylogeny	
  of	
  marker	
  genes	
  detected	
  in	
  the	
  ERB ...................................53	
    	
    v	
    List of Tables Table 4.1 Summary of functional marker genes used in the MLTreeMap analysis ..........................24	
    	
    vi	
    List of Figures Figure	
  2.1	
  MLTreeMap	
  analysis	
  workflow..................................................................................................10	
   Figure	
  3.1	
  Reference maximum likelihood tree of McrA. ........................................................................13	
   Figure	
  3.2	
  Phylogenetic placements of McrA cluster representatives using MLTreeMap. ...........15	
   Figure	
  3.3	
  Global taxonomic and metabolomic analysis of McrA sequences....................................17	
   Figure	
  4.1	
  Ecological context and input data for MLTreemap pipeline ...............................................23	
   Figure	
  4.2	
  Taxonomic structure of ERB metagenomic and pyrotag data. ...........................................32	
   Figure	
  4.3	
  Genomic potential of ERB metagenomic libraries associated with reverse methanogenesis...................................................................................................................................35	
   Figure	
  4.4	
  Genomic potential of ERB metagenomic libraries associated with sulfate reduction pathway. ................................................................................................................................................36	
   Figure	
  4.5	
  Gene expression of ERB proteomic data. ..................................................................................37	
    	
    vii	
    Glossary ANME  Anaerobic methane oxidizers  AOM  Anaerobic oxidation of methane  CBM  Coal Bed Methane  CH4  Methane  CO2  Carbon Dioxide  COG  Clusters of Orthologous Groups  ERB  Eel River Basin  H4MPT  Tetrahydromethanopterin  MCR  Methyl-coenzyme M reductase  SRB  Sulfate reducing bacteria  SSU rRNA  Small subunit ribosomal RNA  SO42-  Sulfate	
    	
    viii	
    Acknowledgments I would like to thank: my supervisor, Steven Hallam not only for providing me an opportunity and advice for the thesis project, but also for inspiring me to realize my potential with his passion for science and his vision; my committee members Steven Jones and Paul Pavlidis for their time and useful advice during my masters degree. I would also like to thank my external examiner, Wyeth Wasserman for his time; and my collaborators Christian von Mering and Manuel Stark of the University of Zurich for inviting me to his laboratory to share and implement ideas for the thesis project. I am grateful to Angela Norbeck and Heather Brewer at the Pacific Northwest Laboratory for providing me with proteomics data. Sequencing data used in this analysis were provided by the US Department of Energy (DOE) Joint Genome Institute (JGI). I am grateful to Genome Alberta, Genome British Columbia, Genome Canada and Natural Sciences and Engineering Council of Canada (NSERC) for funding this project . Finally, I would also like to extend my gratitude to past and present members of the Hallam lab, with special thanks to Antoine Pagé for their support for past five years.	
  	
    	
    ix	
    Dedication To my mother and father for teaching me to live every day with courage, patience and kindness. To my sister for her undying enthusiasm and optimism. To my family in Korea, for keeping me in their thoughts and prayers. To the friends past and present, for being with me even in my hardest time.  	
    x	
    	
    Chapter 1 Introduction Biological methane (CH4) production, or methanogenesis, plays a crucial role in the global carbon cycle. Biologically generated CH4 can be emitted into the atmosphere, where it acts as greenhouse gas twenty five times more potent than carbon dioxide (CO2), stored as “methane ice” (clathrates or hydrates) in marine sediments along continental margins [2, 3], or oxidized under aerobic or anaerobic conditions by microbial agents effectively limiting atmospheric flux. About 74% of atmospheric CH4 is derived from biological sources. Methane emissions are sourced from a variety of natural and human engineered ecosystems, including, but not limited to, wetlands, lakes, ocean waters, sediments, landfills, ruminants and rice paddies [1, 2]. The current global CH4 emission is 500-600 Tg CH4 y-1 [1]. In the last 200 years, atmospheric CH4 concentration has increased by threefold. This increase in CH4 concentration is responsible for about 20% of the increased greenhouse effect observed for all climate active trace gases [1]. Once in the atmosphere, CH4 has a half-life of about 8.4 years, and is removed by reactions with hydroxyl radicals [1]. Methanogenesis is catalyzed by a group of obligate anaerobic archaea within the phylum Euryarchaeota, known as the methanogens who produce CH4 as the end product of their energy metabolism. Methanogenic archaea are a large and diverse group and have been cultivated from a wide variety of anaerobic environments. Methanogenesis pathways are complex and require a number of unique coenzymes and membrane-bound enzyme complexes, the details of which have been recently reviewed [1, 5].  Despite their diversity, known methanogens utilize a  restricted number of substrates, including CO2, methyl-group containing compounds, and acetate.  These energetic substrates are typically sourced from other microbial community  members capable of converting organic substances, such as carbohydrates, long-chain fatty acids, and alcohols [1]. Most methanogens are capable of utilizing CO2 as substrates, in a pathway known as hydrogenotrophic methanogenesis. Here, four molecules of formate are oxidized to CO2 by formate dehydrogenase, before one molecule of CO2 is reduced to methane [1].  In  hydrogenotrophic methanogenesis, H2 is the primary electron donor. Methanogens affiliated  	
    1	
    	
   with Methanosphaera are known to use methyl-group containing compounds as substrates in a pathway known as methylotrophic methanogenesis.  These compounds include methanol,  methylated amines, and methylated sulfides. Methylotrophic methanogenesis involves methylgroup transfer from methylated compounds to coenzyme M (CoM) to form Methyl-CoM [1]. Almost two-thirds of biologically generated methane is derived from acetate [1]. This pathway, known as acetoclastic methanogenesis, is catalyzed by methanogens affiliated with Methanosarcina and Methanosaeta [1]. Reaction steps of the acetoclastic pathway conducted by Methanosarcina and Methanosaeta are similar but not identical. Methanosarcina use two-step reactions involving acetate kinase (ACK) and phosphotransacetylase (PTA) to convert acetate into acetyl-CoA, while Methanosaeta use adenosine monophophate (AMP)-forming acetyl-CoA synthetase (ACS) [1]. All three methanogenic pathways converge on a terminal step catalyzed by the methyl co-enzyme M reductase (Mcr) holoenzyme in complex with the nickel porphyrin cofactor F430 [2, 5]. In marine sediments, the majority of biologically generated methane that escapes hydrate formation is oxidized into carbon dioxide (CO2). This process, known as the anaerobic oxidation of methane (AOM), has been estimated to consume more than 70 billion kilograms of methane annually [6], limiting CH4 flux between ocean and atmosphere. The AOM is conducted by uncultivated anaerobic methane-oxidizing archaea (ANME) closely related to methanogenic archaea [6-10]. Several ANME lineages (ANME-1, -2, -3) have been described through the use of molecular taxonomic, lipid biomarker, and fluorescent in situ hybridization approaches [6-8]. These methods have also revealed specific physical associations between ANME-2 and sulfatereducing bacterial (SRB) groups belonging to Delta- and Beta-proteobacteria. Previous studies have implicated microbial consortia composed of ANME groups and SRB in methane oxidation coupled to sulfate reduction [5, 10]. As such, it has been proposed that the two organisms mediate AOM in a syntrophic manner, sharing free energy produced as the sum of two half reactions [11]. A recently revised AOM model, however, indicates that AOM may not be an obligate syntrophic process [11]. Here, ANME-2 oxidizes CH4 to CO2. During this reaction, ANME-2 also reduces sulfate to zero-valent sulphur (S0). The S0 is exported outside of the archaeal cell, where it reacts with sulphide (HS-) to form polysulphides, including disulphide. The disulphide is then consumed by SRB.  	
    2	
    	
   The phylogenetic analysis of the 16S small subunit ribosomal rRNA (16S SSU rRNA) confirms the close relationship shared between the ANME clades and cultivated methanogenic orders, Methanosarinales and Methanomicrobiales [10]. Previous biochemical studies have also indicated that while methanogens can oxidize a small portion (i.e. up to 1%) of their CH4 production, the AOM consortia can produce small amounts of CH4, highlighting the biochemical link between both pathways [10]. These observations, therefore, suggest a co-evolution of their biochemistry. Preliminary studies of the AOM were motivated by the hypothesis that the initial step in methane oxidation is essentially a reversal of the terminal reaction in methanogenesis [6, 10, 17-20]. This reaction, involving the reduction of methyl-coenzyme M with coenzyme B and subsequently yielding CH4 and heterodisulfide - is mediated by Mcr [6, 10, 17]. The alpha-subunit of the MCR (MCRA) is an enzyme that has been studied extensively in past decades. Previous studies have described the isolation of mcrA gene from a wide variety of methanogenic environments, including the wet soils [21-23], rice-paddy soils [24], and methane seep sediments [5, 17, 25-32]. Some of these studies have also reported the discovery of uncultured mcrA clades that are unique to these environments under observation [5, 17, 21 24], highlighting the taxonomically diverse nature of the gene. Furthermore, Hallam and colleagues have elaborated on the phylogenetic relationship shared between the ANME-affiliated mcrA and the 16S SSU rRNA [4]. Here, gene content and organization within each of the ANME-1 and ANME-2 mcrA subdivisions are highly conserved, which is a trait that has also been found in the ANME-1 and ANME-2 SSU rRNA. The taxonomic diversity of the mcrA, in addition to its close phylogenetic relationship with the universal phylogenetic marker gene, makes it an ideal biomarker for diagnosing the presence of methane-cycling archaea including ANMEs in natural and human engineered environments. Such a diagnosis has implications for oil and gas resource discovery and in estimating the impact of rapid warming on methane emissions on a planetary scale. Over geological time the CH4 trapped in hydrates is released periodically during intervals of climate change [12-15]. Within increasing temperatures, the dissociation of methane ice is thought to trigger a series of biological responses, including an increase in AOM activity [15]. Hinrichs and colleagues have discussed this phenomenon using the evidence that was collected from ancient marine sediments located in the Santa Barbara Basin (SBB) [15]. The SBB sediments are characterized by the presence of planktonic foraminifera calcites. Hinrichs’ group first noted a foraminiferal signal  	
    3	
    	
   that corresponds to the distribution pattern of a biomarker that is indicative of past methaneoxidation activity [15, 16]. The authors suggested that these observations are associated with CH4 release caused by hydrate dissociation activities and the fluctuating concentration of CH4 in basin waters. Furthermore, Hinrichs and colleagues described the discovery of  13  C-depleted  ether lipids that are diagnostic for anaerobic methane-oxidizing bacteria and archaea. The combined evidence, therefore, suggests that CH4 emission caused by hydrate dissociation in the past have induced AOM creating a balancing force with respect to rapid climate change. Despite the widely acknowledged significance of methane-cycling reactions, the molecular mechanism underlying these pathways in natural and human engineered ecosystems is poorly constrained, in part because the microbial groups known to be involved are difficult to access, exhibit slow in situ growth rates, and remain uncultivated in the laboratory. This thesis introduces a culture-independent and automated method of monitoring the taxonomic composition and genomic potential of microbial communities harboring methane-cycling reactions. The approach described in this thesis builds on an existing software framework called MLTreeMap [33]. I construct novel reference trees representing methane-cycling pathways for use in automated gene identification using MLTreeMap and code several utility scripts that improve performance and data product generation. From a bioinformatics perspective I first describe the application of my MLTreeMap workflow in an effort to determine the taxonomic and metabolic affiliation of the mcrA genes recovered from various environments across the globe in Chapter 3.  In chapter 4 I elaborate on the phylogenetic recruitment and analysis of  methane-cycling marker genes from metagenomic data collected from the Eel River Basin (ERB) off the coast of Mendocino, California. Previous studies have indicated that ANME and SRB groups associated with AOM are well represented in the ERB [5]. Hence, the ERB is used as a benchmark environment for monitoring methane-cycling pathways. In Chapter 5, I discuss the implications of my findings and speculate on future applications of my workflow.  	
    4	
    	
    Chapter 2 MLTreeMap-based bioinformatics pipeline 2.1 Methods of taxonomic binning of environmental sequences In recent years, metagenomics or “environmental genomics” has emerged as a powerful method of exploring the physiology and genetics of uncultured species of microbes [33-34].  The  underlying principle of metagenomics – cloning DNA directly from environmental samples – provides researchers with the means to bypass the challenges of cultivating naturally occurring microbes. As of 2009, two hundred metagenomics studies have been recorded, with 68% of the projects pertaining to various environmental habitats [35].  Most of these projects involve  assigning the sequence fragments into their microbial origin, in an effort to explore the taxonomic composition of the environments under study [36]. There are ever increasing number of software tools designed to estimate the taxonomic composition of the environmental dataset. Most of these tools employ large-scale homology searches against sequence databases, in order to reconstruct phylogeny and determine metabolic potential in the sample [36]. One such tool is MEGAN (MEta Genome ANalyzer; [37]). Here, input nucleotide sequences are searched against databases of known sequences using BLAST. MEGAN then uses NCBI taxonomy to summarize and order the results of the search for each read. The software then uses a simple algorithm to assign each read to the lowest common ancestor (LCA) of the set of taxa that are known to contain the read. Users are then able to estimate and interactively explore the taxonomical content of the dataset. The latest version of MEGAN (MEGAN4; [38]) allows scientists to perform taxonomic analysis using the NCBI taxonomy. In addition, functional analysis is conducted using SEED classification of subsystems and functional roles [39], or the KEGG classification of pathways and enzymes [40]. Another widely used metagenomics analysis tool is MG-RAST (Meta Genome RAST server; [41]), an open source system based on the SEED framework. Unlike MEGAN, MGRAST screens the input sequences for potential protein encoding genes by using a BLASTX search against the SEED non-redundant database (SEED-NR) and FIGfam protein families [42]. In addition to these databases, the input sequences are also compared to a wide range of rDNA databases, including Greengenes [43], RDP-II [44] and European 16S RNA database [45].  	
    5	
    	
   Phylogenetic information contained in the SEED-NR database and the rRNA databases are used to construct the phylogenomic profiles of the sample, while the results from FIGfams search are used to compute functional classifications of the protein encoding genes.  The  functional assignment of these genes in return can be used to describe the putative metabolic repertoire of the sample.  2.2 Phylogenetic based taxonomic analysis Traditional phylogenetic methods, such as maximum parsimony [46], neighbor joining [47] and maximum likelihood [48] provide a robust means of assigning the input reads to their corresponding taxonomic lineage. Of these three methods, the maximum likelihood (ML) is one of the best-described and most accurate techniques in phylogenetics [33, 49-51]. The ML method involves searching for the evolutionary tree that yields the highest probability of generating the observed sequence data [49]. Depending on the size and nature of the input data, the computational cost of this method could vary drastically. The computational cost, however, could be reduced significantly through algorithmic improvements, and the use of restricted (but user-expandable) set of informative reference data [33]. MLTreeMap (Maximum Likelihood TreeMap; [33]) is a software framework that addresses the aforementioned principles. The software employs Maximum Likelihood to recover functional and phylogenetic marker genes from environmental sequence data. The resulting output files include alignments, composite trees, and distribution tables that map marker genes to annotated reference trees.  This  information is useful in assigning reads to specific taxonomic groups and profiling the metabolic potential of environmental samples.  2.3 Implementation of MLTreeMap-based workflow I introduce a workflow that uses the functionalities of MLTreeMap to place the detected marker genes to the optimal loci of the corresponding reference tree. My goal is to eliminate the algorithmic limitations of previous versions of MLTreeMap by implementing various functions that were absent in the framework.  These functions include searching for selected set of  reference genes (instead of searching input sequences under investigation against an entire set of available reference data), as well as reconstructing the reference tree using the marker genes that  	
    6	
    	
   were detected from the input data. A visualization method was also implemented to explore the taxonomic composition of the detected marker genes. The overall scheme of the workflow can be divided into two major procedures as shown in Figure 2.1. Upon the initial installation of MLTreeMap, users gain access to a set of alignment files, Hidden Markov Model [53] files, and Newick tree files [54] that represent the informative phylogenetic and functional marker genes. The phylogenetic markers include 16S small subunit ribosomal RNA (16S SSU rRNA) belonging to Archaea and Bacteria, as well as the genes affiliated with 40 clusters of orthologous groups (COGs) universally across all three domains of life, as near-perfect single copy genes [33, 55]. Users are encouraged to build and integrate the reference data of their own candidate marker genes to the existing set. At the current stage, users need to manually generate the reference data for MLTreeMap. Marker genes of interest are identified based on literature research or by a list of available genes in the public databases, such as KEGG, SEED and Metacyc [56]. Once candidate marker genes have been chosen, the amino acid sequences of these genes are downloaded from public databases such as NCBI GenPept [57-58] or Functional Gene Pipeline/Repository (FGPR; [59]). The FGPR is a monthly updated interactive database tool that aids functional genomics studies, especially in the environment. The FGPR search is based on a protein model built from a set of well-characterized training sequences submitted by experts. The downloaded sequences are then aligned with tools such as MUSCLE [60] or ClustalW [61]. The sequence alignment file is used as an input for A) hmmbuild program available as part of the HMMER software package [53] to generate the Hidden Markov Model for the reference data; and B) maximum likelihood analysis software such as RAxML [62] to build the reference phylogenetic tree. Throughout these steps, the intermediate output files need to be formatted such that the final set of reference data are processed by MLTreeMap without producing any computational error. For example, the names of sequences in the alignment file need to follow a specific syntax enforced by the MLTreeMap’s source code, while the names of the same sequences need to be represented in a different format in the NEWICK tree file. Hence, manually constructing the MLTreeMap reference data is not only time-consuming, but is also prone to various errors. To alleviate these difficulties, I designed and implemented a Python script, create_mltree_ref_data.py that automatically generates the reference data for MLTreeMap. This tool automates the sequence alignment using MUSCLE and inserts the  	
    7	
    	
   resulting alignment file to hmmbuild [48] and makeblastdb, generating a Hidden Markov Model and BLAST database, respectively for downstream MLTreeMap analysis. The script then utilizes the same alignment file to infer a phylogenetic tree using RAxML. Here, Whelan and Goldman (WAG) substitution model [63] and estimated G distribution are used as default parameters of the phylogenetic tree inference. Once the reference data have been built, fragments of environmental nucleotide sequences are inserted into MLTreeMap, and the program is executed via command line. Upon the execution of the software, the presence of marker genes within the input sequences is determined by running BLASTX against a set of reference proteins. MLTreeMap then employs GeneWise [64] to extract all amino acid sequences of marker genes, based on the HMMs provided by both the software developers and the users. The query protein sequences are then aligned to the corresponding reference sequences using hmmalign [53]. Minor gap removal is applied to the alignment using Gblocks [65], and the refined alignment is used as input for RAxML for maximum likelihood analysis.  Here, RAxML first optimizes the maximum  likelihood model parameters and calculates all branch-lengths of the reference tree, based on the corresponding alignment. As each query protein sequence is inserted into every possible branch of the tree, the branch lengths are re-calculated for each insertion attempt. For each query sequence, the position with the optimal branch length is reported. These reports are finally put together as a final output that describes the distribution of reads (in percentile) mapped onto specific loci of the corresponding reference tree. A custom Perl script called mltreemap_stats.pl was developed to parse out various statistics from MLTreeMap output, and generate tables that can be visualized as heatmaps and histograms using the R statistical package [66]. The statistics extracted from the MLTreeMap include a relative taxonomic distribution for each detected marker gene, based on the parsed information from the final output. The phylogenetic position of marker genes generated by MLTreeMap can be visualized using MLTreeMap_Imagemaker.  However, this method lacks taxonomic resolution for  sequences from more distant genotypes. workflow  I have, therefore, implemented a Python-based  (create_new_func_tree.py) to insert the marker genes identified from a query  nucleotide or protein sequences to optimal positions in the corresponding reference trees. The process of this workflow is similar to that which is used in MLTreeMap: protein sequences of marker genes are aligned to the reference sequences, and the alignment is used to conduct  	
    8	
    	
   maximum likelihood analysis.  There are, however, several distinguishable features in this  workflow that are not present in MLTreeMap. First, the framework provides users with an option of employing UCLUST [67] to cluster the marker genes detected from the query sequences. If this option is chosen, only the representative sequences of the calculated cluster are aligned to the reference sequence. This functionality, therefore, alleviates the issue surrounding the computational cost required to conduct maximum likelihood analysis, while maintaining the comprehensiveness of the phylogeny. While MLTreeMap aligns the detected marker genes to the corresponding reference sequences on an individual basis, create_new_func_tree.py aligns the entire query protein sequences to the reference sequences. Users can opt to align the sequences with MUSCLE using the default parameter, or by using the profile-profile alignment option, which is suitable for remote homology detection [68]. While minimal gap removal using Gblocks is a mandatory step in MLTreeMap, I have designed the program such that the users choose between trimmed or fulllength alignment. From previous experience, it has been realized that removing the gapped regions of the alignments in some cases (e.g. homologous sequences with similar lengths) do not result in an improvement of the phylogeny. Similar to create_mltree_ref_data.py, this pipeline uses RAxML, with WAG substitution matrix, with 100 bootstrap cycles as parameters. The resulting appended tree can be visualized using software such as FigTree [69].  	
    9	
    	
    	
   Figure	
   2.1	
   MLTreeMap	
   analysis workflow. A) MLTreeMap analysis. In the initial step of the analysis,  fragments of nucleotide sequences from environmental samples are read into MLTreeMap. A parallel version of MLTreeMap reads amino acid sequences and can be used for gene expression profiling and pathway validation. Output data obtained from MLTreeMap analysis include composite tree, alignment files and maximum-likelihood result for each detected functional and phylogenetic marker gene. Taxonomic compositions of each detected reference gene are included in the output data as well. B) Results of MLTreeMap can be visualized using the perl script, MLTreeMap_Imagemaker available for  	
    10	
    	
   download in MLTreeMap’s official website. A python-based script was implemented to insert the marker genes identified from query nucleotide (or protein) sequences into the optimal positions in the corresponding reference trees. Resulting appended trees can be visualized using software such as FigTree. C) Prior to conducting the MLTreeMap analysis, users can choose to create the data set representing the functional marker genes of their interest, and append them to the existing set of MLTreeMap reference data. The sequences of selected marker genes can be downloaded from public repositories such as Genbank or Functional Gene Pipeline and Repository (Fungenes). A python-script was developed to read the input FASTA file and generate different components of the reference data for the selected marker gene.	
    	
    11	
    	
    Chapter 3 Global scale analysis of methyl-coenzyme M reductase subunit A 3.1 Introduction Methane is a potent greenhouse gas and important hydrocarbon energy resource. Under anaerobic conditions it is produced by methanogenic archaea through a series of one-carbon transfer reactions linked to the coenzymes H4MPT and coenzyme M (CoM). The terminal step in archaeal CH4 production is mediated by methyl-CoM reductase (MCR) in complex with the nickel-containing porphinoid cofactor F430. Once considered diagnostic for methanogenic archaea, the discovery of MCR encoding operons, protein complexes and cofactors in anaerobic CH4-oxidizing archaea is consistent with a reversible role for this enzyme. Here I describe a bioinformatics pipeline for automated detection of MCR subunit alpha (McrA) from environmental sequence information using MLTreeMap, an automated maximum likelihood method for phylogenetic inference and sequence identification. A comprehensive reference tree for McrA was constructed based on public sequence information. This tree was used to explore the global distribution and diversity of McrA with specific emphasis on archaeal subgroups mediating the anaerobic CH4 oxidation.  3.2 Material and methods 3.2.1 Methyl-coenzyme M reductase subunit A reference tree construction A total of 5,696 McrA protein sequences were downloaded from the FGPR database (version 6.6.1). From this dataset, 153 sequences representing methanogenic isolates and defined environmental groups including anaerobic CH4-oxidizing archaea (ANME-1, ANME-2 and ANME-3) were selected for reference tree construction (Figure 3.1). Reference tree construction and sequence processing were automated using the python script, create_mltree_ref_data.py introduced in the previous chapter.  	
    12	
    	
    	
   Figure	
  3.1	
   RAxML maximum likelihood tree based on 153 McrA sequences from representative isolates  and environmental groups. Bootstrap values (%) are based on 100 replicates and are shown for branches with greater than 50% support. The scale bar represents 0.2 substitutions per site. Colour-coded boxes represent isolation source or environmental origin for each sequence in the tree.	
    	
    13	
    	
    3.2.2 Testing taxonomic and environmental coverage of McrA reference tree The set of 5,696 McrA proteins downloaded from the FGPR database was clustered at 96% sequence similarity using UCLUST generating 1,428 sequence clusters. Representative sequences from each cluster were mapped onto the McrA reference tree using MLTreeMap and visualized using mltreemap_imagemaker, a Perl script downloaded from the MLTreeMap website (http://www.mltreemap.org). Resulting taxonomic assignments of representative sequences were used to transitively assign the remaining 4,268 McrA sequences into one of 19 different taxonomic groups in the McrA reference tree. The isolation source or environmental origin for each sequence was determined from accession umbers using a custom python script (retrieve_isolation_source.py) that utilizes the NCBI E-utilities package [70].  3.3 Results and discussion 3.3.1 Global taxonomic composition of McrA sequences Using MLTreeMap analysis, I was able to assign 1,416 out of 1,418 representative McrA sequences and the remaining 4,268 into one of 19 taxonomic groups in the McrA reference tree (Figure 3.2).  The 12 representative sequences excluded from the taxonomic assignment  comprised of 11 partial environmental sequences, with an average amino acid sequence length of 143, and a hypothetical protein that was somehow misannotated as McrA sequence by the FGPR database.  During the MLTreeMap analysis procedure, the lengths of the sequences were  shortened to less than 50 amino acids (a.a.) after a mild gap-removal step [65]. These were subsequently removed from the maximum-likelihood analysis conducted by RAxML, and were not mapped onto the reference tree. Removal of these sequences, however, did not have a major effect on the analysis, as the removed sequences did not cluster with any of the 5,696 McrA sequences at 96% AAI threshold (i.e. singletons).  	
    14	
    	
    	
   Figure	
   3.2	
   Phylogenetic placements of McrA cluster representatives using MLTreeMap. The size of each  bubble represents the percentage of cluster representatives assigned to each node of the reference tree. The histogram represents the percentage of sequences split evenly across successive nodes in the tree sharing a common ancestor.	
    	
    15	
    	
   A profile of the taxonomic composition of the 5,684 McrA sequences was generated, as shown in Figure 3.3.  The taxonomic composition profile suggests that 21.86% of McrA  sequences (i.e. 1,243 sequences) are associated with the methanogenic archaeal order Methanomicrobiales, while 19.07% of McrA sequences (i.e. 1,084 sequences) share a close relationship with the order Methanosarcinales. The McrA sequences affiliated with the order Methanobacteriales make up 16.64% of the total McrA sequences (i.e. 946 sequences). McrA sequences associated with Methanocellales were also recovered. Methanocellales is a novel methanogenic order recognized as a key archaeal group for CH4 emission in rice fields [71]. These sequences represent 6.44% of the total number of McrA sequences (i.e. 366 sequences). The McrA sequences associated with various clades of anaerobic CH4 oxidizers (ANMEs) were also detected from the total McrA sequences analyzed. The McrA sequences associated ANME-1a and ANME-1b comprise 2.08% (i.e. 118 sequences) and 3.03% (i.e. 172 sequences) of the total McrA sequences, respectively.  The McrA sequence affiliated with  ANME-2a and ANME-2c groups are present in a slightly higher amount than the ANME-1 affiliated sequences, with each group representing 3.94% (i.e. 224 sequences) and 2.99% (i.e. 170 sequences) of the total McrA sequences analyzed, respectively. The McrA sequences sharing a close relationship with ANME-3 clade make up only 0.24% of the total McrA sequences (i.e. 14 sequences). McrA sequences affiliated with six environmental archaeal clades recovered from oligotrophic fen, known as the Fen clusters [21] were recovered as well. The McrA sequences associated with these clusters make up 14.72% (i.e. 837 sequences) of the total McrA sequences analyzed. Among these six clades, the Fen clusters III and IV are present in the highest level, with each group representing 3.94% (i.e. 224 sequences) and 8.95% (i.e. 509 sequences) of the total McrA sequences, respectively.  	
    16	
    	
    	
   Figure	
  3.3	
   Global taxonomic and metabolomic analysis of McrA sequences. A) Taxonomic analysis of  the McrA sequences. The grey histogram represents the distribution of taxonomic groups (in percentage) amongst 5,684 McrA sequences. For each taxonomic group identified using MLTreeMap analysis, number of sequences derived from specific isolation source is shown, as indicated in the legend. In taxonomic group, ‘Unclassified’ refers to the sequences that were not assigned to any taxonomic groups present in the reference tree. If the information regarding the isolation source of sequence was not defined in its respective Genbank file, this sequence was placed under ‘Unspecified’ category. B) Pathway analysis of the McrA sequences. Sequences belonging to taxonomic groups associated with one of four major categories of methanogenic pathways are shown. Similar to A, distribution of various isolation sources of the sequences’ origin is shown for each of the four pathway categories.	
    	
    17	
    	
   3.3.2 Distribution of isolation sources in various taxonomic groups In addition to the taxonomic composition of the 5,684 McrA sequences, I retrieved information regarding the isolation sources of these sequences, in order to determine the environmental diversity of various taxonomic groups present in the total McrA sequences (Figure 3.3A). The results show that while the McrA sequences affiliated with major methanogenic groups such as Methanobacteriales, Methanomicrobiales and Methanosarcinales are found in various isolation sources, environment-specific sequences such as those associated with ANME clades and the Fen clusters are found in a limited number of isolation sources. For example, out of 946 McrA sequences associated with Methanobacteriales, 28.54% are derived from various synthetic environments such as anaerobic digester and landfill, while 27.80% were isolated from various soil samples (dry soil, rice-field soil and wetland soil). The Methanobacteriales-affiliated McrA sequences retrieved from various mammalian hosts make up 28.96% of the total McrA sequences affiliated with that order.  The remaining 139 McrA sequences associated with  Methanobacteriales are distributed sparsely across various marine sediments and geological features (e.g. asphalt lake and glacial ice sediments). A similar trend is observed in Methanosarcinales, where 16.32% of the total 1,084 McrA sequences belong to synthetic environments, while 29.52% and 28.13% of the sequences are found in marine sediments and soil samples, respectively. The remaining Methanosarcinalesaffiliated McrA sequences are observed in a wide range of soil samples and non-marine sediments, as in the mammalian hosts. Unlike the major methanogenic archaeal orders, the majority of the ANME-affiliated McrA sequences originate from methane seep sediment [5, 17, 25-32]. In the McrA sequences affiliated with the ANME-1a group, sequences that are retrieved from the methane seep sediment make up 91.52% of the 118 total sequences. In the ANME-1b affiliated McrA sequences, 51.74% of the 172 total sequences are observed from methane seep sediment [5, 17, 25-32]. Furthermore, 12.79% of the total ANME-1b related McrA sequences are present in marine mud volcanoes [72], while 26.16% of the total sequences are recovered from natural sinkhole [73]. The McrA sequences associated with the ANME-2a group are dominated by the sequences derived from the methane seep sediment and the tidal creek sediment [74], each making up 45.98% and 41.96% of the 224 total sequences, respectively. The remaining 12.06% of the McrA sequences originate from marine and terrestrial mud volcanoes [75]. Finally, ANME-2c  	
    18	
    	
   affiliated McrA sequences recovered from methane seep sediment represent 90.59% of the 170 total sequences, while the remaining 9.41% of the sequences are isolated from marine mud volcanoes. The McrA sequences affiliated with the six Fen clusters are dominated by the McrA sequences isolated from various soil samples [22-23, 76-81] and the lake water samples [82-83]. Such observation is apparent in Fen clusters II, III, IV and V where the McrA sequences derived from the soil samples make up 100%, 60.27%, 42.23% and 81.36% of each clade, respectively. The McrA sequences isolated from lake water samples represent 30.80% of Fen cluster III, while the sequences derived from the same environment make up 42.83% of the Fen cluster IV. Unlike the aforementioned Fen clusters, 87.50% of the 16 total McrA sequence associated with Fen cluster VI are isolated from various biogas plants [84]. 3.3.3 Distribution of isolation sources in various methanogenic pathways The final aim in the global analysis of the McrA sequences was to determine the distribution of taxonomic groups associated with one of four different methanogenic pathways across different isolation sources. To do so, I compiled a number of sequences affiliated with various taxonomic groups known to catalyze one or more of the four known methanogenic reactions [1]. Similar to the method used to visualize the distribution of isolation sources in various taxonomic groups, I decided to generate the composition of isolation sources in various CH4-cycling pathways, as shown in Figure 3.3B. The results indicate that 4,911 McrA sequences are affiliated with the taxonomic groups known to catalyze hydrogenotrophic methanogenesis. The McrA sequences associated with Methylotrophic archaeal groups (i.e. order, Methanosarcinales and the genus, Methanosphaera) represent 19.21% of the total McrA sequences analyzed. Sequences affiliated with acetoclastic methanogenesis (i.e. genus Methanosaeta and Methanosarcina) and the anaerobic oxidation of methane (AOM) make up 11.17% and 12.28% of the total McrA sequences, respectively. Perhaps the most intriguing observation is the distribution of isolation sources in all four methanogenic  pathways.  The  McrA  sequences  associated  with  hydrogenotrophic,  methylotrophic and acetoclastic methanogenesis are found in various isolation sources, including rice field soil samples, wetland soil samples, anaerobic digesters and biogas plants. For example, the McrA sequences isolated from rice field soil samples represent 10.20% of the total McrA  	
    19	
    	
   sequences affiliated with hydrogenotrophic methanogenesis, while the McrA sequences derived from wetland soil make up 18.79% of the sequences associated with the same type of methanogenesis. The McrA sequences from anaerobic digesters and biogas plants make up 6.43% and 11.78% of the total hydrogenotrophic methanogenesis affiliated McrA sequences, respectively.  A similar trend is observed in the distribution of sequences affiliated with  methylotrophic and acetoclastic pathways. The only exception to this pattern is that the McrA sequences derived from marine microbial mats are also present in hydrogenotrophic and methylotrophic pathways affiliated sequences. These sequences make up 0.10% of the total hydrogenotrophic methanogenesis affiliated McrA sequences, while the sequences isolated from the same environment make up 22.07% of the total McrA sequences associated with the methylotrophic pathway. In contrast, the McrA sequences associated with AOM are dominated by the sequences derived from methane seep sediments. Here, 66.33% of the McrA sequences associated with AOM are isolated from the aforementioned source. The other prevalent isolation sources of the AOM-affiliated McrA sequences are terrestrial and marine mud volcanoes, natural sinkhole, and tidal creek sediments. The McrA sequences derived from these sources make up 4.44%, 3.68%, 7.74% and 13.89% of total AOM-affiliated McrA sequences, respectively.  	
    20	
    	
    Chapter 4 Monitoring AOM in Eel River Basin 4.1 Introduction Microbially mediated AOM is of significant interest in the field of microbial ecology, because of its role in reducing the CH4 flux from ocean to atmosphere, as well as in stimulating subsurface microbial metabolism and supporting thriving deep-sea communities along the continental margins [8, 10]. Despite its widespread significance, the molecular mechanisms underlying AOM are not well understood, in part because the microbial groups known to be involved are difficult to access, exhibit slow in situ growth rates, and remain uncultivated in the laboratory. In sediments, AOM typically occurs in a region of SO42- and CH4 depletion known as the sulfate methane transition zone, where CH4 is converted into CO2 and reduced by-products, such as acetate, formate, and molecular hydrogen via H4MPT-linked C1 transfer reactions [8, 18-20]. These by-products are in turn used as electron donors in the conversion of SO42- to hydrogen sulfide and water via dissimilatory SO42- reduction. Lipid biomarker and environmental genomic studies aimed at determining the biological component of AOM have implicated anaerobic CH4oxidizing archaea (ANME-1, -2, -3) and SO42- reducing bacteria (SRB) affiliated with Delta- and Beta-proteobacteria [7, 9, 20]. My aim in this project therefore, was to employ the MLTreeMap workflow to investigate the taxonomic composition, genomic potential and gene expression of communities involved in anaerobic oxidation of methane (AOM).  4.2 Materials and methods 4.2.1 Sample and data collection The MLTreeMap analysis workflow was used to analyze the metagenomic data generated from push-core 45 (PC45) 6- to 9-cm interval of ERB methane seep. The data used in this analysis consisted of 131,761 whole-genome shotgun (WGS) sequences averaging 732 bp, as well as 6,600 fosmid libraries averaging 700 bp. These libraries were generated from microbial cells of ANME-1, ANME-2 and associated SRB enriched from the sediment using density centrifugation and size selection. From the WGS sequences, 119,611 open reading frames (ORFs) were  	
    21	
    	
   predicted, while over 10,000 ORFs were predicted from the fosmid libraries using FGenesB (http://www.softberry.com). The MLTreeMap workflow was applied to 9,457 peptide sequences collected from eight various depth intervals of PC44 and PC45, in order to determine the expression level of genes detected in these samples. Detailed workflow of data collection is described in Figure 4.1. 4.2.2 Generating reference data for functional marker genes Prior to conducting the analysis, I have constructed a reference dataset for genes that are known to be involved in H4MPT-linked C1 transfer reactions [6, 18-20] and in the associated SO42reduction reactions. A full list of selected functional marker genes is shown in Table 4.1. Protein sequences of candidate marker genes were downloaded from the NCBI GenPept database (GenBank release 184, June 2011) with the exception of McrA genes, which were retrieved from the FGPR database (version 6.6.1). For each reference dataset, I was unable to use entire sequences in the analysis due to the high computational (i.e. time and memory) cost required to compute the phylogeny using the downloaded sequences. Hence, I aimed to reduce the number of the sequences used in the analysis while still maintaining the overall taxonomic diversity of each reference marker gene. To achieve this goal, the sequences were clustered using UCLUST, with a similarity threshold of 0.96 [86]. The list of representative sequences was inspected to determine if there were more than one sequence that were associated with a single organism (i.e. multiple copy genes). As well, I wanted to ensure that the list of representative sequences also included the sequences associated with taxonomic groups that were unique to the representative sequences. In the cases where the multiple-copy genes were detected, I used the sequence with a higher amino acid length in the subsequent analysis. The sequences of interest that were previously filtered out from the cluster procedure were also appended to the existing representative sequences.  	
    22	
    	
    	
    Figure	
   4.1	
   Ecological context and input data for MLTreeMap pipeline. A) Marine sediments were  sampled from the Eel River Basin off the coast of Mendocino California, USA. Biomass was obtained from nine depth intervals derived from two pushcores (PC44 and PC45) collected by the ROV/Tiburon operated by the Monterey Bay Aquarium Research Institute. Anaerobic Methane Oxidizers including  	
    23	
    	
   ANME-1 filaments and ANME-2 aggregates with associated sulfate reducing bacteria (SRB), were enriched from the PC45 6-9cm depth interval using density centrifugation followed by size selection on a 3 µm filter. Genomic libraries were constructed from DNA extracted from the cell enrichment and Sanger sequenced. Open reading frames were predicted using FGenesB (Softberry Inc.). Proteins were extracted from all eight depth samples and were analyzed by liquid chromatography tandem mass spectrometry (LC-MS/MS) at Pacific Northwest National Labs. Detected peptides were matched to the PC45 metagenomic assembly using the SEQUEST algorithm. Only peptides with a peptide prophet score of ≥ 0.95 were used in subsequent analysis. B) Summary of WGS and fosmid library sequencing data fed into MLTreeMap pipeline. C) Summary of peptide and pyrosequencing data used in metaproteomic and community structure analysis respectively.	
    The catalogue of sequences obtained from the clustering and filtering procedures were inserted into the Python script, create_mltree_ref_data.py, in order to generate the reference data for the selected functional marker genes involved in the H4MPT-linked C1 transfer reactions. Table 4.1 Summary of functional marker genes used in the MLTreeMap analysis  Gene Name (Locus)  Reaction Step  16S SSU rRNA  Phylogenetic Marker Phylogenetic Marker 1  40 Cluster of Orthologous Genes Formylmethanofuran dehydrogenase subunit A (fmdA) Formylmethanofuran dehydrogenase subunit B (fmdB) Formylmethanofuran dehydrogenase subunit C (fmdC) Formylmethanofuran dehydrogenase subunit D (fmdD) Formylmethanofuran dehydrogenase subunit E (fmdE) Formylmethanofuran tetrahydromethanopterin formyltransferase (ftr) N5, N10methenyltetrahydromethanopterin cyclohydrolase (mch) F420-dependent methylene-tetrahydromethanopterin dehydrogenase (mtd) H2-forming methylene-tetrahydromethanopterin dehydrogenase (hmd) Methylene-tetrahydro-methanopterin reductase (mer)  	
    Number of Number of Number of downloaded sequences in the genes sequences reference tree detected * 176 106 *  267  1,119  76  43  77  279  89  72  194  73  44  53  32  21  1,010  85  20  2  228  64  67  3  400  89  34  4  122  45  28  79  30  -  533  77  2  5  24	
    	
   	
    Gene Name (Locus)  Reaction Step  Tetrahydro-methanopterin methyltransferase subunit A (mtrA) Tetrahydro-methanopterin methyltransferase subunit B (mtrB) Tetrahydro-methanopterin methyltransferase subunit C (mtrC) Tetrahydro-methanopterin methyltransferase subunit D (mtrD) Tetrahydro-methanopterin methyltransferase subunit F (mtrF) Tetrahydro-methanopterin methyltransferase subunit G (mtrG) Tetrahydro-methanopterin methyltransferase subunit H (mtrH) Methyl-coenzyme M reductase subunit alpha (mcrA) Methyl-coenzyme M reductase subunit beta (mcrB) Methyl-coenzyme M reductase subunit C (mcrC) Methyl-coenzyme M reductase subunit delta (mcrD) Methyl-coenzyme M reductase subunit gamma (mcrG) Heterodisulfide reductase, subunit A (hdrA) Heterodisulfide reductase, subunit B (hdrB) Heterodisulfide reductase, subunit C (hdrC) Acetyl-CoA decarbonylase/synthase subunit alpha (cdhA) Acetyl-CoA decarbonylase/synthase subunit epsilon (cdhB) Acetyl-CoA decarbonylase/synthase subunit beta (cdhC) Acetyl-CoA decarbonylase/synthase subunit delta (cdhD) Acetyl-CoA decarbonylase/synthase subunit gamma (cdhE) Acetate kinase (ackA) ADP-forming acetyl-CoA synthetase (acd)  6  	
    7  Number of Number of Number of downloaded sequences in the genes sequences reference tree detected 193 61 14 102  45  7  107  48  18  48  20  17  86  41  8  56  19  -  173  59  46  5,400  153  25  25  13  51  33  22  27  39  19  5  56  28  42  71  35  404  85  36  91  55  30  27  107  50  141  44  19  30  199  82  51  102  36  43  62  29  58  410 328  150 100  18 98  25	
    	
    Gene Name (Locus) Adenosine-5'-phosphosulfate (APS) reductase subunits alpha/beta (aprAB)  Reaction Step  Number of Number of Number of downloaded sequences in the genes sequences reference tree detected Sulfate reduction 245 107 7  4.2.3 Taxonomic composition of the PC45 metagenomic data The taxonomic structure of PC45 metagenomic libraries was surveyed using phylogenic markers detected using MLTreeMap. To do so, two versions of MLTreeMap were used in the analysis. The first version of the software can be downloaded from MLTreeMap’s official website (http://www.mltreemap.org). This tool uses nucleotide sequences as input, and was used to detect 16S small subunit ribosomal RNA  (16S SSU rRNA) from the metagenomics data  (Appendix A). The second version of MLTreeMap [87] bypasses the ORF prediction step mediated by GeneWise, and hence is a suitable tool for the users who have already opted to predict the ORFs from their nucleotide data using tools such as Orphelia [88] or Prodigal [89]. I used the latter version of MLTreeMap to detect genes associated with 40 COGs and the selected functional markers from the ORF data (Appendix A). Sequences of 16S SSU rRNA detected from metagenomics libraries were clustered based on a nucleotide similarity threshold of 0.97 using UCLUST, producing 50 clusters. Sequences representing each of these clusters were then aligned with the sequences present in the reference 16S SSU rRNA phylogeny using MUSCLE. The resulting alignment file was inserted into RAxML for maximum likelihood analysis, using the same set of algorithm options that were applied to constructing reference trees for various functional markers. Using MLTreeMap, more than 1,000 sequences of COG genes were detected from the PC45 metagenomics data. Performing the maximum likelihood analysis with data of such magnitude on a single CPU would incur significant computational cost with regards to time and memory used. Hence, instead of reconstructing the phylogeny of one large tree that contains information on the entire 40 COGs, I decided to reconstruct the phylogeny of each COG separately. Here, each of the COG genes detected was assigned to one of 40 different COG groups that it corresponds to. The query sequences were then aligned against the reference sequences of their designated COG group. The alignment file was used as input for maximum  	
    26	
    	
   likelihood analysis, producing a phylogenetic tree containing taxonomic information for the sequences affiliated with each COG group. The 60 completed fosmid libraries had already been assigned to one of ANME-1, 2 subgroups by Hallam and colleagues [6].  Hence, it was possible to allocate the branches  corresponding to the ANME-1, 2 subgroups for the 16S SSU rRNA and the COG trees by searching for the position in each tree where the sequences derived from the completed fosmid libraries were located. The phylogenetic marker genes derived from the whole-genome shotgun and fosmid end libraries were binned into various taxonomic groups, including the ANME-1, -2 lineages, using the nearest ancestral node in the tree with a defined taxonomy assignment [90]. Based on the result of the binning process, the taxonomic composition of PC45 metagenomic data was recorded in a tabulated format, and was visualized as a histogram using the appropriate R statistical package (Figure 4.2). 4.2.4 Taxonomic composition of PC44 and PC45 pyrosequencing data Nearly 71,000 V6 pyrotag sequences recovered from eight different PC44 and PC45 core intervals were analyzed using QIIME (Quantitative Insights Into Microbial Ecology; [91]). The sequences were clustered at 0.97 similarity threshold using UCLUST. Representative sequences were selected from 4,140 operational taxonomic units (OTUs) generated from the cluster procedure.  The taxonomic assignment of each OTU was performed using the Ribosomal  Database Project (RDP) classifier. The rows that were generated by the OTU table from these processes were collapsed such that the taxonomic groups were represented as a class level. I then summed the number of sequences associated with each class across all eight-core intervals. If the result of the summation was equal to or greater than 0.2% of the total pyrotags analyzed, the row of the OTU table was retained for subsequent visualization procedure. Histograms depicting the taxonomic structure of the eight-core depth intervals based on the V6 pyrotag sequences were generated using the R statistical package, and were compared with the histograms showing the microbial community composition of PC45 metagenomic data based on MLTreeMap results (Figure 4.2).  	
    27	
    	
   4.2.5 AOM pathway detection I have appended sequences of detected functional marker genes associated with AOM and SO42reduction pathways to their respective reference trees by applying the methods used to update the trees of the phylogenetic marker genes (Appendix B). Prior to the maximum-likelihood analysis, if there were more than 30 alleles of particular functional marker gene, those sequences were clustered using UCLUST, with AAI threshold of 0.96. Based on the results of maximum likelihood analyses, numbers of marker genes associated with taxonomic groups present in the PC45 6-9cm sample were recorded in tabular format, and were subsequently visualized as a histogram using the R statistical package (Figure 4.3). 4.2.6 AOM expression profiling The peptides with a peptide prophet score of ≥ 0.95 were used in the AOM expression profiling analysis.  Here, peptides detected from each core-depth interval were divided into various  groups. Each group contained unique peptides that matched to more than one ORF derived from the PC45 libraries. These peptide sequences were compiled to generate a list of unique peptides in each core-depth interval. As discussed previously, I have applied the MLTreeMap approach to identify the ORFs generated from the PC45 data as various phylogenetic and functional markers associated with AOM. Using these annotated ORFs as a guide, I was able to detect unique peptides that mapped to AOM pathways in PC44 and PC45 core intervals were detected. The abundance of these peptides, along with their taxonomic origin, was visualized using the R statistical package (Figure 4.5).  4.3 Results and discussion 4.3.1 Identification of phylogenetic and functional marker genes Initial release of MLTreeMap was implemented to search for entire sets of reference genes, even if they had already been searched for in the previous analysis. Hence, users were unable to selectively determine the presence of the genes of their interest. I have modified the source codes for both nucleotide and protein versions of MLTreeMap to introduce an option to perform a search for selected sets of genes, should the users choose to do so. Using the nucleotide version of MLTreeMap, the nucleotide WGS sequences were searched against an entire set of reference data to detect 16S SSU rRNA sequences. The analysis 	
    28	
    	
   was completed within 37 hours using a single process system comprised of Intel Xeon 3Ghz, with 16GB of memory. This search has resulted in the recovery of 106 16S SSU rRNA sequences. The same phylogenetic reference marker was searched using the selective search option introduced to the MLTreeMap's source code. Using this option, MLTreeMap detected the same number of 16S SSU rRNA within 1 hour and 7 minutes. The alternate version of MLTreeMap was utilized to detect the COG genes and the functional marker genes from the protein ORF sequences. The analysis took 28 hours and 19 minutes using the same computational environment as previously done with the original version of MLTreeMap. Using the alternate version of MLTreeMap, 1,119 protein sequences affiliated with the 40 COGs were detected. Number of genes involved in H4MPT-linked C1 transfer reactions, including 150 alleles of methyl-coenzyme M reductase subunits (mcrABCDG) were also recovered. In the PC45 metagenomics data, there are 427 sequences of heterodisulfide reductase (hdrABC) involved in the generation of intermediate substrate required for the reaction mediated by the mcr in the reverse methanogenesis [5, 8, 92]. Coding regions of H2-forming methyleneH4MPT dehydrogenase (hmd), however, are not present in the metagenomics data. This enzyme uses H2 as electron donor in the reduction of N5, N10-methenyl-H4MPT into N5, N10-methyleneH4MPT [82]. Under hydrogen-limited conditions such as methane seep, however, this step of methanogenesis is carried out by coenzyme F420-dependent methylene- H4MPT dehydrogenase (mtd) [1, 93]. Consistent with this fact, there are number of mtd genes detected in the PC45 metagenomics data. Furthermore, I recovered genes affiliated with the acetoclastic methanogenesis pathway, such as carbon monoxide dehydrogenase (cdhABCDE), acetate kinase (ackA) and ADP-forming acetyl-CoA synthetase (acd). Also detected in the PC45 6-9cm data are the functional marker genes for the associated SO42- reduction, such as Adenosine-5'phosphosulfate (APS) reductase (aprAB) and dissimilatory sulfite reductase (dsrAB). 4.3.2 Taxonomic structure of phylogenetic markers Using the preliminary MLTreeMap analyses, I recovered phylogenetic and functional markers that are useful for assessing the taxonomic composition and genomic potentials of the AOM communities. However, this method does not always provide a clear taxonomic resolution for sequences from more distantly related genotypes. For instance, the results of the preliminary  	
    29	
    	
   analyses suggested that none of the detected 16S SSU rRNA or the COG sequences were affiliated with the ANME lineages that are known to be present in the PC45 metagenomics libraries [6, 13]. To demonstrate the lack of accuracy in the MLTreeMap’s taxonomic binning method, I summarized the results of the preliminary MLTreeMap analysis as shown in Figure 4.2. A profile of the taxonomic composition of the 16S SSU rRNA sequences suggests that 53.48% of the phylogenetic markers detected from PC45 data are affiliated with Methanosarcinales, while 2.40% of the sequences are closely related to the Methanomicrobiales group. As well, 14.15% of the 16S SSU rRNA genes are closely related to the Archaeoglobales group, while the remaining 32.27% are distributed across various groups of Bacteria. The taxonomic composition of the universal COG genes is somewhat different from that which is observed in the 16S SSU rRNA, but still supports that none of the ANME lineages are present in the PC45 data (Figure 4.2). Here, 29.53% of the detected phylogenetic markers are assigned to the Methanosarcinales group, while 7.23% are affiliated with Methanomicrobiales. Additionally, 1.02% of the total detected COG sequences are closely related to order Methanobacteriales. These discrepancies are observed because in MLTreeMap, sequences are assigned to the “next best relative” available in the reference tree [33]. Because the ANME affiliated sequences were not available in the 16S SSU rRNA and COG reference trees, the majority the phylogenetic marker genes detected from the environmental data are assigned to the archaeal sequences that are closely related to the ANME lineages.  4.3.3 Taxonomic composition based on universal phylogenetic marker genes The MLTreeMap workflow was employed to produce more accurate taxonomic assessment of the PC45 data.  Here, the 16S SSU rRNA and the COG sequences detected from the  metagenomics libraries were appended into the optimal loci in their respective reference trees. Consistent with findings from previous studies conducted by Hallam and colleagues [6], results of the phylogenetic analysis indicate that the 16S SSU rRNA detected from the PC45 data are dominated by ANME-1, ANME-2 and SRB groups (Figure 4.2). In fact, 48.1% of the total detected 16S SSU rRNA genes are affiliated with the ANME-1 group, while 22.6% share a close relationship with the ANME-2 group. Sequences that share a close relationship with SO42-reducing δ-proteobacteria groups make up just over 12.0% of the total 16S SSU rRNA sequences 	
    30	
    	
   detected. The remaining 16S SSU rRNA sequences are distributed between various bacterial groups, including Firmicute (1.9%), Chloroflexi (0.9%), Planctomycetes (0.9%), αproteobacteria (0.9%) and γ-proteobacteria (1.9%). Contrary to the preliminary MLTreeMap results, there are no 16S SSU rRNA sequences associated with methanogenic archaeal groups. Many of these bacterial groups have also been detected in various deep marine methane seep sediments [94-95]. Furthermore, past studies have demonstrated that Planctomycetes harbours genes for C1 transfers are mediated by methanopterin and methanofuran [96], indicating a possible role of this group of bacteria in CH4 cycling reactions. Similar to the taxonomic composition of the 16S SSU rRNA genes, a number of ANME1 and ANME-2 affiliated COG genes were also found. The ANME-1 affiliated COG genes make up 11.30% of the total marker COG sequences detected from the PC45 6-9 cm interval, while 2.1% are affiliated with the ANME-2 group. Unlike the results of the 16S SSU rRNA sequences, however, there are also COG sequences affiliated with class Methanomicrobia, comprising 18.6% of the total detected COG genes. Almost 22% of the COG genes are assigned to the unknown Euryarchaeal group. There are also COG genes affiliated with various groups of Bacteria. Almost 3.7% of the total COG genes are classified as SO42--reducing δ-proteobacteria, such as Desulfobacterales, Desulfovibrionales, and Syntrophobacterales. Bacterial COG genes that are assigned to an unknown group of δ-proteobacteria are also present in the PC45 metagenomics data. Such sequences make up 13.2% of the total COG genes detected. Nearly 1.8% of the COG genes detected share a close relationship with Chloroflexi. Just over 15% of the COG genes are unclassified bacterial sequences.  The rest of the sequences are distributed sparsely across  various bacterial groups, including Bacteriodetes (0.5%), Planctomycetes (0.5%), gproteobacteria (0.3%), and Spirochaetes (0.2%).  	
    31	
    	
    	
   Figure	
   4.2	
   Taxonomic structures of ERB metagenomic and pyrotag data. A) Taxonomic structure of  PC45_6-9 cm metagenomic libraries was surveyed based on detection of 16S small subunit ribosomal RNA (16S SSU rRNA) genes and 40 “universal” clustered orthologous groups (COG) using MLTreeMap. The label, “_original” indicates the taxonomic structure generated using the original taxonomic composition report generated by MLTreeMap, while “_updated” indicates the taxonomic structure generated using the updated phylogenetcic trees. B) Taxonomic structure of PC44 and PC45 core intervals including PC45_6-9 cm was also surveyed based on analysis of V6 pyrotag sequences recovered from PC44 and PC45 core intervals using QIIME.	
    4.3.4 Taxonomic composition of the eight core intervals based on pyrosequencing data The taxonomic composition of PC45 determined using 16S SSU rRNA and COG sequences does not reflect the overall microbial diversity in the environment, as the sequence data were retrieved from a size-selected cell enrichment sample. To address this issue, I analyzed 70,776 V6 pyrosequencing data collected from eight various intervals of PC44 and PC45 using QIIME. Results of the analysis indicate that, for each core, there is a shift in community structure across various depth intervals (Figure 4.2). In PC44, sequences affiliated with ANME-1 and ANME-2 make up only 7.9% of 21,941 total sequences derived from the PC44 0-3 cm interval. In PC44  	
    32	
    	
   3-7 cm interval, 11.4% of 5,847 total sequences are closely related to ANME-1 and -2 lineages. In 10-13 cm and 13-15 cm intervals, the ANME-affiliated sequences make up 37.8% (out of 1,841 total sequences) and 32% (out of 4,320 total sequences), in respective order. Similar trends are observed in PC45 intervals, where ANME-affiliated sequences make up 11.8% of 7,254 total sequences in the 0-3 cm interval. This number increases to 28% in the 12-15 cm interval. Apart from the ANME lineages, the eight interval samples display a diverse range of Euryarchaeal groups. Among these groups, there are a number of sequences affiliated with extremophiles, such as Halobacterales and Thermoplasmatales. Only a fractional amount of sequences classified as methanogenic archaeal groups, such as Methanobacterales, Methanococcales and Methanosarcinales, are present across different intervals. Within the Bacterial domain, significant number of sequences closely related to the SO42-reducing δ-proteobacterial groups, such as Desulfobacterales and Desulfovibrionales were recovered. These sequences make up over 10% of the total sequence diversity in five of the eight intervals, and more than 6% of the total sequence diversity in the remaining three intervals. There are also sequences assigned to various γ-proteobacterial groups affiliated with higher alkane, aromatic and heavy oil degradation, including Pseudomonadales and Oceanospirillales [97]. These sequences are present in the highest amount in the PC44 0-3cm and PC45 0-3cm samples.  4.3.5 Functional markers of H4MPT-linked C1 transfer reactions In order to determine the genomic potentials of the PC45 data, I reconstructed the phylogeny of the marker genes involved in H4MPT-linked C1 transfer reactions. The detected functional peptide sequences were assigned to taxonomic groups based on their positions in the corresponding tree. Results of the taxonomic binning process suggest that with the exception of AckA and coenzyme F420-dependent N5, N10-methenyltetrahydromethanopterin reductase (Mer), all detected functional marker genes are dominated by alleles associated with ANME-1 and/or ANME-2 groups (Figures 4.3). From these results, it is also evident that the ANME-1 group contains most of the genetic mechanisms necessary to conduct AOM through reverse methanogenesis, as suggested in previous studies [6, 92].  	
    33	
    	
   The marker genes involved in the C1 transfer reactions affiliated with various methanogenic  archaeal  groups,  such  as  Methanosarcinales,  Methanococcales  and  Methanomicrobiales were also detected. The results indicate that in the PC45 metagenomics data, Methanosarcinales is the only group, apart from ANME-1, that contains five subunits of formylmethanofuran  dehydrogenase  (FmdABCDE),  formylmethanofuran-H4MPT-  formyltransferase (Ftr) and F420-dependent methylene-H4MPT dehydrogenase (Mtd). Four of the eight  subunits  of  the  membrane-bound  protein,  N5-methyl-H4MPT-coenzyme  M  methyltransferase (MtrCDAH) and methylene-H4MPT reductase (Mer), all affiliated with the Methanosarcinales group are also present in the PC45 metagenomics data.  Past studies have  shown that species such as Methanosarcina barkeri are capable of utilizing CO as a reducing equivalent for methanogenesis from CO2 [1]. The absence of genes encoding Mtr, Mer, Mtd and Ftr was proven to prohibit this organism from growing on methanol or H2/CO2, suggesting an indispensable role of these enzymes in hydrogenotrophic and methylotrophic methanogenesis [98-99]. Hence, the results indicate a possible role of these methanogenic archaeal groups in C1 cycling. Functional marker genes associated with acetoclastic methanogenesis were also recovered from the PC45 metagenomics data. Such genes include those encoding the ANME-1 affiliated carbon-monoxide dehydrogenase (CdhA) and ADP-forming acetyl-CoA sythetase (Acd). Consistent with the results from previous study [6], the genes encoding enzymes such as acetate kinase (AckA) and phosphoacetyltransferase (Pta) are absent in the metagenomics data. The reaction catalyzed by Acd is similar to those mediated by acetate synthase (Acs) harboured by Methanosaeta group, where acetyl-CoA is converted into acetate in a single step [100]. These findings indicate the possibility for the ANME-1 group to mediate acetoclastic methanogenesis in marine sediment, yet the expression level of the genes affiliated with these activities was not detectable, as discussed in 4.3.7, ‘AOM Expression’.  	
    34	
    	
    	
   Figure	
   4.3	
   Genomic potential of ERB metagenomic library associated with reverse methanogenesis.  Number of gene and peptides detected is indicated as a histogram. Colours of each bar represent taxonomic origins for each gene or protein detected using the MLTreeMap pipeline.	
    	
    35	
    	
   4.3.6 Functional markers of associated sulfate-reduction Functional maker genes of the SO42- reduction associated with AOM, such as dissimilatory sulfite reductase (dsrAB) and adenosine-5'-phosphosulfate (APS) reductase (aprAB) were also recovered from the PC45 metagenomics libraries. The results of the analysis indicate that the majority of the genes encoding the Dsr enzyme are closely related to the δ-proteobacteria, Desulfobacterales, while the remaining Dsr sequences are not affiliated with any known reference taxonomic groups (Figure 4.4). Similar to the unclassified Dsr peptide sequences, the seven protein sequences of Apr could not be taxonomically assigned.  	
   Figure	
  4.4	
   Genomic potential of ERB metagenomic libraries associated with SO42- reduction pathways.  Number of gene and peptides detected is indicated as a histogram. Colours of each bar represent taxonomic origins for each gene or protein detected using the MLTreeMap.	
    4.3.7 AOM expression In order to determine the expression level of detected genes in eight depth intervals of PC44 and PC45, I analyzed the proteomics data retrieved from these samples. The goal was to generate a list of unique peptides for each interval sample. There were total of 4,033 unique peptides across the entire eight intervals. From these, I identified a number of peptide sequences that are functionally relevant to AOM and related SO42--reduction reactions. Most of the enzymes were recovered at the PC44 3-7 cm interval and the PC45 6-9 cm interval (Figure 4.5). Across all  	
    36	
    	
   eight samples, the majority of the AOM-affiliated proteins belong to ANME-1, ANME-2 and Methanosarcinales. In the PC44 3-7 cm interval, 608 (out of 1,509 unique peptides) proteins affiliated with AOM and related SO42--reduction were identified. Among these enzymes, 86.02% of them are ANME-1 and ANME-2 encoded Mcr subunits α, β and γ (McrABG).  The Mch proteins  affiliated with ANME-1 are also present, comprising 1.97% of the total peptides identified in this sample. The other H4MPT-linked C1 cycling proteins detected in the PC44 3- to 7-cm interval include Fmd subunits D and B (FmdDB), Ftr, Mtd, and Mtr subunit H (MtrH). All of these enzymes, however, are encoded by the Methanosarcinales group. A fractional amount of enzymes involved in the SO42--reduction steps, such as adenosine 5’-phosphosulfate reductase (Apr) and dissimilatory sulfite reductase (Dsr), were also detected from this sample. In the PC45 6- to 9-cm interval, there are 461 enzymes (out of 1,334 unique peptides) affiliated with the AOM and related SO42--reduction reactions. Similar to the PC44 3- to 7-cm sample, the ANME-1 and -2 affiliated McrABG make up the majority (i.e. 91.14%) of the total proteins detected. One of the notable differences between the PC44 3- to 7-cm interval and the PC45 6- to 9-cm is that the vast portion of the Mtd and Mtr proteins detected in the PC45 sample are encoded by the ANME-1 group. As well, the HdrA protein affiliated with ANME-1 is present in this core interval. Proteins including Fmd, Ftr, Mch, Acd and Cdh are absent in the proteomics data. Additionally, while the Apr and Dsr proteins were discovered in this sample, the Apr could not be assigned to any known reference taxonomic groups.  	
   Figure	
   4.5	
   Summary and taxonomic origin of peptides mapping to AOM pathways in PC44 and PC45 core intervals reveals depth-specific trends in protein expression and taxonomic composition. 	
    	
    37	
    	
   The observations from the proteomics analysis indicate that most of the ANME-1 encoded enzymatic components required to conduct the AOM are expressed in both the PC44 3to 7-cm and PC45 6- to 9-cm intervals. As mentioned previously, however, some of the proteins that play important roles in AOM are not present in both of the core intervals. Several factors can be taken into consideration to explain the absence of these enzymes. During the MLTreeMap analysis, the detected OFR sequences are aligned against corresponding reference sequences. The alignment is then subjected to minimal gap-removal using Gblocks. If the length of the query sequence is reduced to less than 50 amino acid sequences after the trimming process, these sequences are not used in the subsequent maximum likelihood step. Using the ORFs that were trimmed to less than the minimum sequence length, I identified Acd and Apr proteins from the PC45 6- to 9-cm interval. Because these ORFs had not been processed through maximum likelihood analysis, they were searched against the non-redundant database using BLASTP. The results of BLAST search confirmed that these proteins are affiliated with SO42--reducing bacterial species, such as Desulfobulbus propionicus and Desulfobacterium catecholicum (Data not shown).  Re-inspection of the MLTreeMap reference trees confirmed that none of the  proteins encoded by these species were present in the corresponding trees. These results, therefore, further address the need for improving the taxonomic coverage of the reference trees for more accurate results of the MLTreeMap analysis.  	
    38	
    	
    Chapter 5 Conclusion In this study, I have designed and implemented a workflow based on MLTreeMap framework to reconstruct the phylogeny of user defined marker genes detected from environmental samples. With this pipeline, I have explored the taxonomic diversity and environmental disposition of the methyl-coenzyme M reductase subunit A (mcrA) gene, which is considered diagnostic for methanogenic archaea. The pathway coverage of methanogenesis was expanded to investigate microbial controls on CH4 cycling in the Eel River Basin (ERB) off the coast of Mendocino, California. The results from the mcrA gene analysis identified dramatic shifts in taxonomic groups across different isolation sources. For instance, while the majority of ANME clades are found in methane seep environments, groups such as the Fen Soil clades, Methanocellales and Methanosarcinales are found mostly in rice field soil, wetland soil and freshwater sediment environments. These distributional patterns appear to be correlated with the availability of methanogenic substrates, as well as the level of competition in the environment for these substrates. For instance, marine sediments are often rich in SO42-, making it suitable for SO42-reducing bacteria (SRB) to outcompete methanogens for H2 and acetate [1].  Therefore,  hydrogenotrophic methanogenesis is limited in these environments. High SO4- reduction activity has can be coupled with CH4 oxidation pathway in which electrons flow between ANME archaea and SRB. In freshwater sediments where SO4- concentration is lower than in marine sediments [1], SRB activity is limited, allowing methanogens to utilize H2 and acetate pools generated from anaerobic organic matter degradation. Consistent with this observation, in freshwater sediments, acetoclastic and hydrogenotrophic methanogenesis make up about 70% and 30% of the CH4 production, respectively. Similar metabolic trends are observed in rice paddy soils and wetland soils [1]. The analysis of ERB sediments provided specific insights into CH4 cycling under anaerobic conditions. Specifically, a number of genes encoded by uncultivated ANME clades, including ANME-1 that mediate anaerobic oxidation of methane (AOM) were detected. Peptides mapping to ANME affiliated proteins, including MTD, MTR and MCR were identified  	
    39	
    	
   in the PC44 and PC45 core samples recovered from ERB sediments. In the PC44 core samples, the majority of MCR peptides detected were associated with ANME-2, while ANME-1 affiliated MCR peptides dominated the PC45 core. The genes encoding methylene-H4MPT reductase (mer) were under-represented in metagenome sequences recovered from PC45 core but MER peptides were detected in PC44 and PC45 cores. To date, mer has not been associated with the ANME-1 group. Loss of mer activity in ANME-1 has been postulated to increase the activation barrier for the conversion of 5-methylene-H4MPT into 5-methyl-H4MPT, ultimately promoting AOM [6]. Moreover, genes associated with the acetoclastic pathway, including ANME-1 affiliated acd and ack genes were detected. These observations support a model of acetoclastic CH4 cycling activities catalyzed by microbial communities in the ERB and are broadly consistent with reverse acetoclastic methanogenesis in ANME-1. Another interesting observation is the presence of bacteria-encoded genes related to acetate production. In particular, acd and ack genes affiliated with Bacteroidetes, Chloroflexi and unclassified bacterial groups were detected. Bacterial acetate production has the potential to derive methanogenesis in ERB sediments, resulting in a cryptic CH4 cycle. The role of bacterial groups found in ERB sediments and in trophic ecotypes needs to be further explored using metabolomic and single cell approaches. It remains puzzling why ANME-1 retains upstream genes in the hydrogenotrophic pathway. Indeed, this raises specific questions regarding the fate of the intermediate compounds produced from these reactions. Hallam and colleagues have posited two possible outcomes [6]. First, the 5-methylene-H4MPT derived from reduced CO2 could become a substrate for assimilatory metabolism via serine cycle [6, 101]. Alternatively, the C1 transfer module in ANME-1 may be involved in formaldehyde detoxification, analogous to the properties of other methylotrophic or nonmethylotrophic bacteria [6, 102-103]. Expanding the scope of reference data available in the current MLTreeMap workflow may determine the validity of these possibilities. In conclusion, the MLTreeMap workflow provides a useful tool for conducting globalscale phylogenetic analysis of microbial communities, emphasizing the genomic potential of user-defined marker genes and reactions. Furthermore, my development of reference trees for C1 pathways has significant implications for identifying biomarkers for indicating the type of  	
    40	
    	
   CH4 cycling organisms in the environment with application to monitoring and exploitation of hydrocarbon resource environments.  	
    41	
    	
    References 1.  Liu Y and WB Whitman (2008) Ann. N Y Acad. Sci. 1125, 171-189.  2.  RK Thauer et al. (2008) Nat Rev Microbiol. 6, 579-591  3.  GJ MacDonald (1990) Annu. Rev. Energy. 15, 53-83.  4.  Isaksen IS and SB Dalsoren (2011) Science. 331, 38-39.  5.  RK Thauer (1998) Microbiology. 144, 2377-2406.  6.  Hallam SJ, Putnam N, Preston CM et al. (2004) Science. 305, 1457-1462.  7.  Boetius A, Ravenschlag Katrin, Schubert CJ et al. (2000) Nature. 407, 623-626.  8.  Hallam SJ, Pagé AP, Constan L et al. (2011). Methods Enzymol. 494, 75-90.  9.  Orphan VJ, House CH, Hinrichs K et al. (2002) Proc Natl Acad Sci USA. 99, 7663-7668.  10. Knittel K and A Boetius (2009) Annu. Rev. Microbiol. 63, 311-334. 11. J Milucka et al. (2012) Nature. 491, 541-546. 12. A Mascarelli (2009) Nat. Rep. Clim. Change. 3, 46-49. 13. A Milkov (2004) Earth Sci. Rev. 66, 183-197. 14. CD Ruppel (2011) Nat. Educ. Knowledge. 3(10), 29. 15. Hinrichs KU, Hmelo LR and SP Sylva (2003) Science. 299, 1214-1217. 16. KU Hinrichs (2001) Geochem. Geophys. Geosyst. 2, 2000GC00118. 17. Hallam SJ, Girguis PR, Preston CM, Richardson PM and EF Delong (2003) Appl. Environ. Microbiol. 69, 5483-5491. 18. Meyerdierks A, Kube M, Lombardot et al. (2005) Environ Microbiol. 7, 1937-1951. 19. A Meyerdierks et al. (2010) Environ. Microbiol. 12, 422-439. 20. Pernthaler A, Dekas AE, Brown CT et al. (2008) Proc Nat Acad Sci. 105, 7052-7057. 21. Galand PE, Saarnio S, Fritze H and K Yrjälä (2002) FEMS Microbiol. Ecol. 42, 441-449. 22. S Hunger et al. (2011) Appl. Environ. Microbiol. 77, 3773-3785. 23. H Juottonen et al. (2012) Appl. Environ. Microbiol. 78, 6386-6389. 24. Lueders T, Chin K, Conrad R and M Friedrich (2001) Environ. Microbiol. 3, 194-204. 25. Beal MJ, House CH and VJ Orphan (2009) Science. 325, 184-187. 26. M Krüger et al. (2003) Nature. 426, 878-881. 27. G Webster et al. (2011) FEMS Microbiol. Ecol. 77, 248-263. 28. Dang H, Luan X, Zhao X and J Li (2009) Appl. Environ. Microbiol. 75, 2238-2245.  	
    42	
    	
   29. Harrison BK, Zhang H, Berelson W and VJ Orphan (2009) Appl. Environ. Microbiol. 75. 1487-1499. 30. T Holler et al. (2011) ISME J. 5, 1946-1956. 31. Lloyd KG, Lapham L and A Testke (2006) Appl. Environ. Microbiol. 72, 7218-7230. 32. Nunoura T, Oida H, Toki T, Ashi J, Takai K and K Horikoshi (2006) FEMS Microbiol. Ecol. 57, 149-157. 33. Stark M, Berger SA, Stamatakis A and C von Mering (2010) BMC Genomics. 11, 461-471. 34. J Handelsman (2004) Microbiol Mol Biol. Rev. 68, 669-685. 35. Tringe SG, von Mering C, Kobayashi A et al. (2005) Science. 308, 554-557 36. Liolios K., Chen IA, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM and NC Kyrpides. (2009) Nucl Acids Res. 38, D346-D354. 37. Huson DH, Auch AF, Qi J and SC Schuster (2007) Genome Res. 17, 377-386 38. Huson DH, Mitra S, Ruscheweyh HJ, Weber N and SC Schuster. Genome Res. 21, 15521560. 39. Overbeek R, Begley Tadgh, Butler RM et al. (2005) Nucl Acids Res. 17, 5691-5702. 40. Kanehisa M and S Goto (2000) Nucl Acids Res. 28, 27-30. 41. Meyer F, Paarmann D, D'Souza M et al. (2008) BMC Bioinformatics. 9, 386-393. 42. Meyer F, Overbeek R and A Rodriguez. (2009) Nucl Acids Res. 37, 6643-6654. 43. DeSantis TZ, Hugenholtz P, Larsen N et al. (2006) Appl Env Microbiol. 72, 5069-5072. 44. Maidak BL, Cole JR, Lilburn TG et al. (2001) Nucl Acids Res. 29, 173-174. 45. Wuyts J, Van de Peer Y, Windelmans T and R De Wachter (2002). Nucl Acids Res. 30, 183185. 46. Wu M and JA Eisen (2008) Genome Biol. 9, R151.1-R151.11 47. Krause L, Diaz NN, Goesmann A et al. (2008). Nucl Acids Res. 36, 2230-2239. 48. Schreiber F, Bumrich P, Daniel R and P Meinicke (2010). Bioinformatics. 26, 960-961. 49. Felsenstein J (1981) J Mol Evol. 17, 368-376. 50. Felsenstein J (2004) Infering phylogenies. Sunderland, Mass: Sinauer Assoc 51. Whelan S, Lió P and N Goldman (2001) Trends Genet. 17, 262-272. 52. Holder M and PO Lewis (2003) Nat Rev. 4, 275-284. 53. S Eddy (1998) Bioinformatics. 14, 755-763.  	
    43	
    	
   54. Felsenstein J, Archie J, Day W et al. (1986) The Newick tree format: http://evolution.genetics.washington.edu/phylip/newicktree.html. Retrieved on August 24, 2012. 55. Ciccarelli FD, Doerks T, von Mering C et al. (2006) Science. 311, 1283-1287. 56. Caspi R, Foerster H, Fulcher CA et al. (2008) Nucl Acids Res. 36, D623-D631. 57. Benson DA, Boguski MS, Lipman DJ and Ostell J (1997) Nucl Acids Res. 25, 1-6. 58. Wheeler DL, Barrett T, Benson DA et al. (2005) Nucl Acids Res. 33, D39-D45. 59. Michigan State University (2009) Functional Gene Pipeline/Repository. http://fungene.cme.msu.edu. Retrieved on September 19, 2012. 60. RC Edgar (2004) Nucl Acids Res. 32, 1792-1797. 61. Larkin MA, Blackshields G, Brown NP et al. (2007) Bioinformatics. 23, 2947-2948. 62. A Stamatakis (2006) Bioinformatics. 22, 2688-2690. 63. Whelan S and N Goldman (2001) Mol Biol Evol. 18, 691-699. 64. Birney E, Clamp M and R Durbin (2004) Genome Res. 14, 988-995. 65. J Castresana (2007) Mol Biol Evol. 17, 540-552. 66. R Core Team, R Foundation for Statistical Computing (2012) Vienna, Austria. 67. RC Edgar (2010) Bioinformatics. 26, 2460-2461. 68. Gribskov M, McLachlan AD and D Eisenberg (1987) Proc Natl Acad Sci USA. 84, 43554358. 69. A Rambaut (2006) http://tree.bio.ed.ac.uk/software/figtree. Retrieved on September 12, 2012. 70. National Center for Biotechnology Information (2011) Entrez Programming Utilies Help. http://www.ncbi.nlm.nih.gov/books/NBK25501. Retrieved on Novermber 28, 2012. 71. Sakai S et al. (2008) Int. J. Syst. Evol. Microbiol. 58, 929-936. 72. Lazar C, L’Haridon S, Pignet P and L Toffin (2011) Appl. Environ. Microbiol. 77, 31203131. 73. Sahl JW, Gary MO, Harris JK and JR Spear (2009) Environ. Microbiol. 13, 226-240. 74. Edmonds JW, Weston NB, Joye SB and MA Moran (2008) Appl. Environ. Microbiol. 74, 1836-1844. 75. Alain K, Holler T, Musat F, Elvert M, Treude T and M Krüger (2006) Environ. Microbiol. 8, 574-590.  	
    44	
    	
   76. Castro H, Ogram A and KR Reddy (2004) Appl. Environ. Microbiol. 70, 6559-6568. 77. Frey B, Niklaus PA, Kremer J and S Zimmermann (2011) Appl. Environ. Microbiol. 77, 6060-6068. 78. Milferstedt K, Youngblut ND and RJ Whitaker (2010) ISME J. 4, 764-776. 79. Narihiro T, Hori T, Nagata O, Hoshino T, Yumoto I and Y Kamagata (2011) Biosci. Biotechnol. Biochem. 75, 1727-1734. 80. Smith JM, Castro H and A Ogram (2007) Appl. Environ. Microbiol. 73, 4135-4141. 81. Zhang G, Tian J, Jiang N, Guo X, Wang Y and X Dong (2008) Environ. Microbiol. 10, 1850-1860. 82. C Biderre-Petit et al. (2011) FEMS Microbiol. Ecol. 77, 533-545. 83. Earl J, Hall G, Pickup RW, Ritchie DA and C Edwards (2003) Microb. Ecol. 46, 270-278. 84. Rastogi G, Ranade DR, Yeole TY, Patole MS, and YS Shouche (2008) Bioresour. Technol. 99, 5317-5326. 85. Knittel K and A Boetius (2009) Annu Rev Microbiol. 63, 311-334. 86. Konstantinidis KT and JM Tiedje (2005) J Bacteriol. 187, 6258-6264. 87. Knief C, Delmotte N, Chaffron S et al. (2012). ISME J. 6, 1378-1390. 88. Hoff KJ, Lingner T, Meinicke P and M Tech (2009) Nucl Acids Res. 37, W101-W105. 89. Hyatt D, Chen GL, LoCascio PF et al. (2010) BMC Bioinformatics. 11, 119-129. 90. Liu Z, DeSantis TZ, Andersen GL and R Knight (2008) Nucl Acids Res. 36, e120-e130. 91. Caporaso JG, Kuczynski J, Stombaugh J et al. (2010) Nat Methods. 7, 335-336. 92. RK Thauer (2011) Curr Opin Microbiol. 14, 292-299. 93. Goldman AD, Leigh JA and R Samudrala (2009) BMC Evol Biol. 9, 199-210. 94. Inagaki F, Nunoura T, Nakagawa S et al. (2006) Proc Natl Acad Sci USA. 103, 2815-2820. 95. Kormas KA, Smith DA, Edgcomb V and A Teske. (2003) FEMS Microbiol Ecol. 45, 115125. 96. Chistoserdova L, Jenkins C, Kalyuzhnaya MG et al. (2004) Mol Biol Evol. 21, 1234-1241. 97. Schulze-Makuch D, Haque S, Antonio MR et al. (2011) Astrobiology. 11, 241-258. 98. Welander PV and WW Metcalf (2005) Proc Natl Acad Sci USA. 102, 10664-10669. 99. Welander PV and WW Metcalf (2008) J Bacteriol. 190, 1928-1936. 100. G Fournier (2009) Methods Molec Biol. 532, 163-179. 101. S Angelaccio et al. (2003) J Biol Chem. 278, 41789-41797.  	
    45	
    	
   102. L Chistoserdova et al. (2000) Microbiology. 146, 233-238. 103. Marx CJ, Miller JA, Chistoserdova L and ME Lidstrom. (2004) J Bacteriol. 186, 21732178.  	
    46	
    	
    Appendices Appendix A: Results of preliminary MLTreeMap analysis of ERB data  	
   	
    	
    47	
    	
    	
    48	
    	
    	
    49	
    	
    	
    50	
    	
    	
    51	
    	
    	
    52	
    	
    Appendix B: Updated phylogeny of marker genes detected in the ERB  	
    	
    53	
    	
    	
    54	
    	
    	
   	
   	
   	
    	
    	
    55	
    	
    	
    	
    56	
    	
    	
    	
    	
    57	
    	
    	
    58	
    	
    	
    59	
    	
    	
    60	
    	
    	
    61	
    	
    	
   	
    62	
    	
    	
    63	
    	
    	
    64	
    	
    	
    65	
    	
    	
    66	
    	
    	
    67	
    	
    	
    68	
    	
    	
    	
    69	
    	
    	
    70	
    	
    	
    71	
    	
    	
    72	
    	
    	
    73	
    	
    	
    74	
    	
    	
    75	
    	
    	
    76	
    	
    	
    77	
    	
    	
    78	
    	
    	
    79	
    	
    	
    80	
    	
    	
    	
    81	
    

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073801/manifest

Comment

Related Items