UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Precise correlation and metagenomic binning uncovers fine microbial community structure Durno, W. Evan 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_september_durno_william.pdf [ 7.93MB ]
JSON: 24-1.0348979.json
JSON-LD: 24-1.0348979-ld.json
RDF/XML (Pretty): 24-1.0348979-rdf.xml
RDF/JSON: 24-1.0348979-rdf.json
Turtle: 24-1.0348979-turtle.txt
N-Triples: 24-1.0348979-rdf-ntriples.txt
Original Record: 24-1.0348979-source.json
Full Text

Full Text

Precise correlation and metagenomicbinning uncovers fine microbialcommunity structurebyW. Evan DurnoB.A. Mathematics, minor Statistics, The University of British Columbia, 2012A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2017c©W. Evan Durno 2017AbstractBacteria and Archaea represent the invisible majority of living things on Earth with an estimatednumerical abundance exceeding 1030 cells. This estimate surpasses the number of grains of sandon Earth and stars in the known universe. Interdependent microbial communities drive fluxesof matter and energy underlying biogeochemical processes, and provide essential ecosystemfunctions and services that help create the operating conditions for life. Despite their abundanceand functional imperative, the vast majority of microorganisms remain uncultivated in laboratorysettings, and therefore remain extremely difficult to study. Recent advances in high-throughputsequencing are opening a multi-omic (DNA and RNA) window to the structure and functionof microbial communities providing new insights into coupled biogeochemical cycling and themetabolic problem solving power of otherwise uncultivated microbial dark matter (MDM). Thesetechnological advances have created bottlenecks with respect to information processing, andinnovative bioinformatics solutions are required to analyze immense biological data sets. This isparticularly apparent when dealing with metagenome assembly, population genome binning, andnetwork analysis.This work investigates combined use of single-cell amplifed genomes (SAGs) and metagenomesto more precisely construct population genome bins and evaluates the use of covariance matrixregularization methods to identify putative metabolic interdependencies at the population andcommunity levels of organization. Applying dimensional reduction with principal componentsand a Gaussian mixture model to k-mer statistics from SAGs and metagenomes is shown to binmore precisely, and has been implemented as a novel pipeline, SAG Extrapolator (SAGEX). Also,correlation networks derived from small subunit ribosomal RNA gene sequences are shown to bemore precisely inferred through regularization with factor analysis models applied via Gaussiancopula. SAGEX and regularized correlation are applied toward 368 SAGs and 91 metagenomes,postulating populations metabolic capabilities via binning, and constraining interpretations viacorrelation. The application describes coupled biogeochemical cycling in low-oxygen waters. Useof SAGEX leverages SAGs deep taxonomic descriptions and metagenomes breadth, producesprecise population genome bins, and enables metabolic reconstruction and analysis of populationdynamics over time. Regularizing correlation networks overcomes a known analytic bottleneckbased in precision limitations.iiLay summaryProkaryotic microorganisms, including bacteria and archaea, work together to transform theenvironment on local and global scales. Global scale influences occur through a combinationof two important factors: 1) microbial life, while for the most part invisible, exists on trulymassive scales, and 2) each microbial cell derives energy from local environmental transforma-tions. The cumulative effects are substantial enough to drive global biogeochemical cycles overbillions of years. Despite playing these integral roles, the majority of microorganisms remainuncultivated, rendering them similar to astronomical dark matter. High-throughput genomesequencing approaches are now shining light onto uncultivated microbial diversity and function,creating a number of bioinformatics challenges related to microbial genome assembly, taxonomicbinning, and community metabolic network reconstruction. This thesis contributes toward precisetaxonomic binning and correlation network methods, improving our capacity to understand themetabolic linkages between uncultivated microorganisms and biogeochemical cycles in naturaland engineered ecosystems.iiiPrefaceWhile the majority of the work for this thesis was done by the author, W. Evan Durno, his readingon oxygen minimum zone microbial ecology was mostly directed by Alyse K. Hawley, but alsoby Steven J. Hallam. Both Alyse K. Hawley and Steven J. Hallam made essential contributionstoward assisting the author’s interpretation of all data and analyses. All data were taken from apre-existing project studying the Saanich Inlet oxygen minimum zone.For the metagenomic binning project (see chapter 2), essential classifier algorithms were largelyinspired by the work of Dodsworth et al. [76]. The project motivations were initially imagined bySteven J. Hallam and Jody J. Wright prior to writing Hawley et al. [126]. Connor Morgan-Langwrote the SAGEX interface. The author designed and wrote the primary software pipeline, andran the methods comparison experiment. W. Evan Durno, Steven J. Hallam, and Alyse K. Hawleydesigned experiments which were implemented by W. Evan Durno, Alyse K. Hawley, or ConnorMorgan-Lang.For the SSU rRNA correlation project (see chapter 3), microbial counts were generated andtaxonomically annotated by Monica Torres-Beltran. Kai He and Jessica Ngo assisted in theunivariate model goodness-of-fit survey under the direction of the author. The author designedthis project, implemented nearly all software, and ran all experiments. Bo Chang’s reading groupon graphical models with guidance from Harry Joe and Ruben Zamar was invaluable for thisproject. Ed Gabbott provided the GeForce GTX 980 Ti GPU, which was put to good use.Copyright permissions• Figure 1.1 is republished with permission of Current Opinion in Microbiology, from Theinformation science of microbial ecology, A. S. Hahn, K. M. Konwar, S. Louca, N. W. Hanson,and S. J. Hallam, 31, 209-216, 2016; permission conveyed through Copyright ClearanceCenter, Inc.• Figure 1.2 is reprinted by permission from Macmillan Publishers Ltd: Nature ReviewsMicrobiology [271], copyright 2012; permission conveyed through Copyright ClearanceCenter, Inc.• Figure 1.3 is reprinted by permission from Macmillan Publishers Ltd: Nature Microbiology[133], copyright 2016; permission conveyed through Copyright Clearance Center, Inc.• Figure 1.7 is reprinted by permission from the author, Alyse Hawley [125], in accordance withPNAS copyright permissions (http://www.pnas.org/site/aboutpnas/rightperm.xhtml).• Figure 1.8 is reprinted by permission from the author, Stilianos Louca [167], in accordancewith PNAS copyright permissions (http://www.pnas.org/site/aboutpnas/rightperm.xhtml).• Figure D.1 is reprinted by permission from Macmillan Publishers Ltd: The ISME Journal[267], copyright 2016; permission conveyed through Copyright Clearance Center, Inc.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiBioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiMath and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Metagenomics and the network perspective . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Information flow and the network abstraction . . . . . . . . . . . . . . . . . . 21.2 Taxonomic estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Phylogenetic estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 QIIME, SSU rRNA data processing . . . . . . . . . . . . . . . . . . . . . . . . 71.2.4 Metagenomic binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 SSU rRNA correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.1 Compositional vs. mixed effect perspective . . . . . . . . . . . . . . . . . . . 101.4 Use case: Saanich Inlet OMZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.1 Oceanic nitrogen loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.2 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.3 Important taxa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.4 A conceptual model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4.5 A differential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Math concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.5.1 Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16v1.5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5.3 Gaussian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6 Statistics concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.6.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.6.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.6.4 Copula & marginals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271.6.5 Hypothesis testing & classification . . . . . . . . . . . . . . . . . . . . . . . . 271.7 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.7.1 Numerical calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.7.2 Non-linear programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321.7.3 GPU supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.8 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Metagenomic binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2.1 SAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2.2 SAGEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.2.3 Precision-recall comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.2.4 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.3.1 Precision-recall comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.3.2 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.4.1 Metabolic discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.4.2 Precise binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 SSU rRNA correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.1.1 The overfit hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.1.2 Precision-recall comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.1.3 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.2.1 Multivariate construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.2.2 Marginal model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.2.3 Full model definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2.5 Precision-recall comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2.6 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69vi3.3.1 Marginal model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.3.2 Compositional vs. mixed effect perspective . . . . . . . . . . . . . . . . . . . 713.3.3 Exploring GPU necessity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.3.4 Precision-recall comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.3.5 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.4.1 Precision-recall comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.4.2 Univariate SSU rRNA models . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.4.3 Multivariate SSU rRNA models . . . . . . . . . . . . . . . . . . . . . . . . . . 783.4.4 Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.4.5 Partial correlations and succinct representation . . . . . . . . . . . . . . . . . 803.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.1 Denitrification in Saanich Inlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2 Regularization as reduced parameter complexity . . . . . . . . . . . . . . . . . . . . 834.3 A more succinct representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87AppendicesA Data-driven argument as a Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . 105A.1 Theoretical argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107B CMP variance bound proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108B.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108B.2 Properties of λµ,ν . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3 Properties of σ2µ,ν . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111B.4 Borrowed material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115C Precision with imprecise binners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.1 Marker gene strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.2 Common trait strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.3 Formal arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117D Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119D.1 Factorial experiment regression summaries . . . . . . . . . . . . . . . . . . . . . . . 119D.2 Taxa regressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120D.3 Marginal regression survey results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120D.4 Poor precision-recall exchanges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.5 SAGs sequenced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.6 SAG decontamination taxa ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124D.7 Evaluation levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124viiD.8 ESOM R script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124D.9 ESOM U-matrices and bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126D.10 All binner precision recall statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127viiiList of Tables2.1 Binner precision-recall statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.2 CheckM statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.1 AIC score statistics per marginal model . . . . . . . . . . . . . . . . . . . . . . . . . . 70D.1 All PhylopythiaS and SAGEX (classify) precision-recall statistics . . . . . . . . . . . 127D.2 All MaxBin2.0 and SAGEX (cluster) precision-recall statistics . . . . . . . . . . . . . 128D.3 All ESOM+R precision-recall statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 128ixList of Figures1.1 The Central Dogma compared to Information Theory and microbial communities.Image credit: [120] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Microbial co-occurrences produced from Saanich Inlet, Hawaii Ocean Time-series,Microbial ecology of expanding oxygen minimum zones, and eastern tropical SouthPacific OMZs’ SSU rRNA data. Image credit: [271] . . . . . . . . . . . . . . . . . . . 41.3 Tree of Life phylogeny estimated with binning. Image credit: [133] . . . . . . . . . . . 51.4 A covariance sturcture (A) as implied by a Phylogenetic estimate (B) . . . . . . . . . 71.5 Nitrogen loss examples. Taxa will be argued throughout this work. . . . . . . . . . . 111.6 Saanich Inlet average chemical concentrations . . . . . . . . . . . . . . . . . . . . . . 121.7 A conceptual model of Saanich Inlet denitrification. Image credit [125] . . . . . . . . 151.8 An illustration describing a differential model of Saanich Inlet denitrification. Imagecredit [167] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.9 The standard Gaussian curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.10 Bivariate Guassian simulation. Arrows are covariance matrix eigenvectors. . . . . . 201.11 Data simulated from a Gaussian mixure model . . . . . . . . . . . . . . . . . . . . . 221.12 Illustration of two correlations between (X, Z) and (Y, Z), generating a partialcorrelation (X, Y|Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.13 The NVidia GeForce 980 Ti GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341.14 Blocks of GPU threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.1 The first six lines of an example .fasta file . . . . . . . . . . . . . . . . . . . . . . . . 392.2 SAGEX pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.3 Tetranucleotide signatures are illustrated for various SAGs, an EColi Genome, anda 200m Saanich Inlet metagenome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4 A SAGEX work flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.5 The probability which SUP05 1c recruits nitric oxide reductase drops off time.SAGEX was run on all pairs of metagenomes and SUP05 1c SAGs, aligned toRefSeq-nr (e-value cut-off: 10−3), then tested with logistic regression for signifi-cantly significant interactions between time and recruitment of denitrification genes.Models control for effects of O2 concentrations and metagenome size. . . . . . . . . 493.1 A simplified depiction of a poor precision-recall exchange, like those observed inWeiss et al. [267]. See Figure D.1 for actual. . . . . . . . . . . . . . . . . . . . . . . . . 553.2 Average abundances of select chemical concentrations and taxa . . . . . . . . . . . . 583.3 This work’s initial correlation estimates compared to a more robust method . . . . . 643.4 Histogram of final estimated values νˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . 70x3.5 Descriptive statistics for the SSU rRNA data set . . . . . . . . . . . . . . . . . . . . . 713.6 Statistically significant parameter values per taxa, testing equality with zero. Each βxdescribes regressor weight against variable x. For example, β1 is the intercept, βO2is the weight against O2 concentration, and so on. The parameters σ, ν and Ψ mustalways be positive. Majority-positive values for L1 demonstrate the observationof a mixed effect. The lack of significant values for L2 and L3 does not stop theirassociated covariance matrix Σ = L−1LT−1 +Ψ from attaining significant values. . . 723.7 Testing the necessity of GPU acceleration in estimation. Test A shows GPU accelera-tion is not necessary for general model parameters. Test B shows GPU accelerationis necessary for correctly estimating correlations. . . . . . . . . . . . . . . . . . . . . 733.8 (A) Precision-recall curves, (B) Expected precisions after beta regression . . . . . . . 743.9 All statistically significant correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.10 Statistically significant partial correlations and regressors superimposed over metabolisms.Metabolic relationships reflect both previous interpretations described in subsec-tion 1.4.3 and observations from chapter 2. . . . . . . . . . . . . . . . . . . . . . . . . 774.1 Statistical power decreases as dimension increases for α = 0.05. . . . . . . . . . . . . 844.2 Simplified Tree of Life superimposed with a succinct correlation structure. Thered line is a cutting line, which separates the entire tree into clades. Each clade’scorrelation structure is dictated entirely by its own tree and clade parameters.Clade parameters are latent random variables with a complete correlation structure.Correlations are illustrated with black and magenta lines. Tree image credit: [130] . 85A.1 A linearly dependent Hidden Markov Model analogizing an inferential pipeline. . 106D.1 Precision recall curves for popular 16S correlation techniques (lines) on severalmodels (plots). Image credit: [267] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.2 SAGs sampled and sequenced (picked). Image credit: Alyse Hawley . . . . . . . . . 123D.3 ESOM U-matrices and bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126xiGlossaryBioinformaticsACE Abundance-based coverage estimatorBLAST Basic Local Alignment Search ToolBWA Burrows-Wheeler AlignerCONCOCT Clustering cONtigs with COverage and ComposiTionDNA Deoxyribonucleic acidEBPR Enhanced biological phosphorus removalESOM Emergent Self Organizing MapHGT Horizontal gene transferIMG Integrated Microbial Genomes systemLCA Lowest common ancestorMEGAN MEtaGenome ANalyzerMGA Marine Group A or MarinimicrobiaMSA Multiple sequence alignmentNCBI National Center for Biotechnology InformationOMZ Oxygen minimum zoneORF Open reading frameOTU Operational taxonomic unitPCR Polymerase chain reactionQIIME Quantitative Insights Into Microbial Ecology (software)RNA Ribonucleic acidRPKM Reads per kilobase mappedSAGEX SAG EXtrapolatorxiiSAG Single-cell amplified genomeSOM Self Organizing MapSSU rRNA Small subunit of the ribosomal RNA geneMath and StatisticsAIC Akaike information criterionCCA Canonical Correspondence AnalysisCMP Conway-Maxwell-Poisson (distribution)FDR False discovery rateFMWC Floor model with copulaHMM Hidden Markov ModelLNM Logistic normal multinomialMLE Maximum likelihood estimateMM Method of moment estimateNBGC Negative binomial with Gaussian copulaPCA Principal component analysisPC Principal componentVST Variance stabilizing transformComputationACM Association for Computing MachineryBFGS Broyden-Fletcher-Goldfarb-Shanno numerical optimization algorithmBWT BurrowsWheeler transformCPU Central processing unitCUDA Compute unified device architectureGPU Graphics processing unitxiiiLAPACK Linear Algebra PACKageMC Monte´ Carlo (integration)RAM Random access memoryxivAcknowledgmentsThis degree would not have been possible without support from my family, especially Agnes.Neither would I have succeeded without the scholarly guidance provided by Alyse K. Hawley. I’dlike to thank Harry Joe and Ruben Zamar for actively guiding me toward resources which gave meexcellent results. I would like to thank the Hallam Lab students and staff for their guidance. I’dlike to thank Steven J. Hallam, not just for research guidance and resources, but also for guidancein writing which made this document something I can be proud of.xvChapter 1IntroductionProkaryotic life is estimated to account for 4− 6× 1030 cells, and to sequester 350− 550Pg of carbon,comparable to that of all plant life [269]. Because microbes make their living transforming energyand matter, their cumulative transformative effect is truly massive, having great participation inEarth’s biogeochemical cycles [85]. Microbial forces not only contribute toward long-term effectsin biogeochemical cycling, determining much of the Earth’s chemical history [148, 236] and future[128, 229], but are also applicable in more immediate time-scales [31, 99, 184, 198, 202, 219, 259].Understanding how these microbial forces work in ecological contexts requires sequencing geneticmaterial directly from the environment (see section 1.1). For studying microbial life’s ecologicalmachine, this work’s inferential mechanism is a two-step, bioinformatic process. First, metaboliccapabilities are attributed to taxa. This can be done in a variety of ways (see section 1.2).Second, the breadth of ecological interpretations is constrained through a correlative analysis (seesection 1.3). This work makes contributions toward such inferential mechanisms in chapter 2 andchapter 3, and efforts are evaluated with precision-recall curves (see subsection 1.6.5) and throughapplication toward studying denitrification in Saanich Inlet (see section 1.4). Requisite math,statistics, and computational concepts are described in section 1.5, section 1.6, and section 1.7,respectively.1.1 Metagenomics and the network perspectiveThe greatest challenge to understanding microbial forces is caused by the lab itself, because thevast majority of taxa will not grow in the lab. For example, less than 1% of marine microbes willgrow on standard agar plates [58, 89, 197]. This means that results concluded from lab-grownmicrobes and communities are prone to cultivation bias. Fortunately, modern cultivation-independentmethods, often developed involving genetic sequencing, exist to exchange the concerns due tothe lab bench for the uncertainties of observational experiments. The exchange to reduce biasat the cost of experimental constraints is worthwhile, especially when both perspectives can becombined. The key difference that makes cultivation-independent methods work is that samplesare sequenced soon after being taken directly from the environment with a significantly reducedopportunity for bias-enducing effects to afflict the sample. Genetic sequencing operates likea quickly-taken snapshot, most-closely representing the actual community in the environmentas it was at sampling. This work largely focuses on three cultivation-independent data types:metagenomes, single-cell amplified genomes (SAGs), and small subunit of the ribosomal RNAgene (SSU rRNA).11.1.1 Data typesA metagenome is a sample of DNA taken directly from the environment [260]. It has the advantageof describing DNA as it existed in the environment, without bias-enducing, requisite cultivationsteps. The DNA itself is not immediately useful and requires further processing. Modern high-throughput sequencing through the Illumina platform [26] translates the DNA into millions orbillions of short strings, often 100-200 base pairs (bp) in length. Each string of A,T,C,G charactersis sampled randomly from the initial DNA and may overlap. Overlaps are valuable becausethe small per-base-pair error rate (often below 1%) accumulates opportunities for errors in themany reads. If certain DNA was sequenced more often, it is said to have more (sequencing)depth. Important alternatives to metagenomes include metatranscriptomes and metaproteomes. Ametatranscriptomes is sequenced from reverse-transcribed, environmentally sampled RNA. Ametaproteome is a set of amino acid sequences from environmentally sampled protein.After sequencing, metagenomes exist in many reads, which are small sub-sequences. To learnwhat the DNA encodes, a short-read aligner (such as BWA (Burrows-Wheeler Aligner) [160]) canbe employed to perform look-ups in a database of known functions or taxa (such as NCBI’s(National Center for Biotechnology Information) RefSeq-nr [254, 268], GreenGenes [71], or SILVA[220]). It also is possible to attempt a reconstruction of the underlying genomes which producedthe reads through assembly (see subsection 1.2.1), but the process tends to produces merelylonger sub-sequences and can introduce errors. After assembly, several genes may share a singlesequence. While some genes may identify genetic function or taxonomy, many sequences cannothave their taxonomy known definitively. This presents a fundamental challenge in MicrobialEcology, because linking function to taxa is such an important goal.If taxonomy is the focus of a study, all sequencing power can be directed toward sequencingSSU rRNA genes. This is done by amplifying SSU rRNA genes prior to sequencing. Aftersequencing and processing (see subsection 1.2.3), sequences counts can be used as a proxy formicrobial abundances. The analysis of SSU rRNA data is the subject of chapter 3 and furtherdescribed in section 1.3.Genetic function and taxa can be more confidently linked with SAGs [29, 249, 253], which arethe genomes of individual cells. SAGs are produced by first sorting (perhaps with a microfluidicdevice [106]), amplifying with multiple displacement amplification (MDA), then sequencing.While contamination is possible, software can be used for quality assurance [207], though it is notyet proven to provide perfect results. The result is a relatively confident description of functionaland taxonomic links. SAG collections are growing large [225]. For example, in chapter 2, 368SAGs bioinformatically studied. However, SAGs remain focused descriptions of relatively feworganisms and cannot meet the descriptive breadth achievable with metagenomes and SSU rRNAdata. Therefore a combined approach is often motivated.1.1.2 Information flow and the network abstractionThe Central Dogma of Molecular Biology [64] is an abstraction which describes how informationencoded in the genetic alphabet ({A,T,C,G}) is translated to RNA and then to protein. Thisdirection of informational flow is assumable for this work, but does not hold in general. Effectively,it provides a permanent caveat for all genomic analyses: genetic potential does not ensureexpression. So if a gene is found, it does not certainly get used. It might be argued that in2Figure 1.1: The Central Dogma compared to Information Theory and microbial communities. Image credit:[120]resource-starved environments, microbial life can hardly spare wasted genetic material, butultimately transcriptomics (RNA) is more reliable, and proteomics even more so. While describinga likely future of cloud-based bioinformatic analysis, Hahn et al. [120] described the informationflow in the Central Dogma (Figure 1.1 (b)) and related information warehousing (Figure 1.1 (a))with Claude Shannon’s Theory of Communication [239], and also likened microbes to informationprocessors acting in larger networks (Figure 1.1 (c)), metabolically processing their world as acommunal machine. This work adopts such a network perspective through genetically-describedcorrelation networks and compares it to previous interpretations [125, 167] (see subsection 3.4.4).Working toward a better understanding of microbial-mediated environmental transformation,it is pragmatic digest data and communicate understanding with ecological models. Oftenmodels are conceptual [125, 198], other times they are differential [20, 27, 179, 187], but all thesemodels are coherently abstractable to entities and their interactions. Taking entities as nodes andinteractions as edges, there is a network representation of biological interaction [22, 137]. Thesenetwork representations have faced some controversy in their early years [4, 78], but the scientificcommunity has since adopted a necessary respect for the possible breadth of interpretation[267]. Modern genomic data now provides many opportunities to survey general communitybehaviour on a grand scale. For example, Figure 1.2 is a co-occurence network generated fromglobally-sampled data. These networks’ abstractions allows large-scale analysis and interpretation[87, 98].The idealization of conceptual and differential ecological models into network abstractionsprovides a perspective of deeply complex entities that is seductively simple. For example, it is easyto forget that a correlation network only comments on data covariation, and might be perceivedas similar to a metabolic network. The perspectives are dependent but not equivalent. Worse yet,the automatic generation of networks through correlation may require great familiarity with aninferential model to properly interpret results, especially in cases of model failures which resultin artefactual expressions. The greatest challenge in SSU rRNA correlation methods is confidentedge calling. Popular methods have been shown to have low precision in successfully detecting3Figure 1.2: Microbial co-occurrences produced from Saanich Inlet, Hawaii Ocean Time-series, Microbialecology of expanding oxygen minimum zones, and eastern tropical South Pacific OMZs’ SSU rRNA data.Image credit: [271]correlations [267]. Doubt in network edges invalidates claims of interaction, preventing confidentestimation of correlation networks. Low edge-calling precision denies network application outsideof any graph summary statistics which might be robust to this. This work shows how edge-callingprecision can be achieved in chapter 3.This work further concretizes its network abstraction by describing its nodes (chosen taxa)genomically. The task of assigning genomic sequences and their functions to taxonomies is apopular bioinformatic problem in cultivation-independent analysis and thus Microbial Ecology.In section 1.2, methods for attributing taxa to genomic function are described, where the conceptof metagenomic binning is developed. The process is naturally tedious and erroneous, motivatingpragmatism and thus clear descriptions of confidence are desired. In chapter 2, a fundamentalpragmatism-quality exchange is explored through precision-recall analyses amongst select binningstrategies, and new strategies are described which allow descriptive frontiers to be expanded.Ultimately, a careful comparison and measurement of binning methodologies describes howpragmatism and precision may be exchanged, and binning is generally described as more accuratenear when attributing taxa nearer to the phylogenetic root.This work adopts the network model perspective of oxygen minimum zone (OMZ) microbialecology, and the majority of its emphasis is on precise network inference. This thesis targetsboth the nodes and edges of these networks. First, concerns about binning [240] are met byshowing that SAG-guided binning can improve precision, thus better describing network’s nodes.Second, this work addresses concerns about imprecise correlation detection [267] by using theright statistical model, thus better describing the network’s edges. Ultimately, the data-drivenargument made with correlation networks is made more credible through better nodes and betteredges.4Figure 1.3: Tree of Life phylogeny estimated with binning. Image credit: [133]1.2 Taxonomic estimationThe task of understanding how microbes cycle energy and nutrients is inseparable from evolution-ary dependencies. For the sake of this work, it is sufficient to define a phylogeny as a particularevolutionary history or tree. It is also sufficient to define a taxon (plural: taxa) as a clade or entirebranch of a phylogenetic tree, and taxonomy is the science of defining or classifying organismsinto taxa. An example of an estimated phylogenetic tree is in Figure AssemblyMetagenomic sequences may be assigned a sense of phylogenetic similarity by attempting toreconstruct the originating genetic sequence. If in fact several metagenomic sequences do originatefrom a single strand of DNA, then they certainly share a single taxonomic source. A sequenceassembled from overlapping DNA is a contig (as in contiguous). Reconstructing originating geneticmaterial is difficult, because genetic sequencing technology is only able to read DNA in pieces.For example, the popular Illumina [26] platform may translate DNA into 150 base pair (bp)-lengthreads in a summary file several gigabytes in size. Therefore modern assembly is computational5problem with necessary approximations and algorithmic solutions [195]. It is important to notethat the assembly of metagenomes increases the risk of joining sequences of different taxonomy(chimeras) [35], and also begs philosophical questions like what is a genome?The intuitive algorithm of searching for similarity at sequences’ ends, overlapping, and walkingthrough the genome [194] is the overlap-layout-consensus (OLC) algorithm, and while still relevant,is computationally intractable for many sequences. To generate larger contigs from many smallreads, is the de Bruijn graph approach [215, 280]. De Bruijn graphs are abstract data structures,concretized in both obvious and clever ways. These data structures can still be large, perhapsmotivating distribution over a computer network [244]. A succinct de Bruijn graph [33] is modernreinterpretation inspired by the Burrows-Wheeler Transform (BWT) [43], and decreases memoryrequirements drastically [159].1.2.2 Phylogenetic estimationBootstrapped phylogenetic tree estimationA phylogenetic tree can be estimated for a set of genetic sequences. The genetic sequences mustbe similar enough to compare (perhaps encoding the same protein), and need to exist across thebreadth of the tree–otherwise individual clades would be estimable and incomparable. Geneswhich are known exist ubiquitously are marker genes. An example of an important marker geneis the SSU rRNA gene [155], because it has both identifiable, conserved regions and known hyper-variable regions. A common approach to phylogenetic estimation is to first bioinformaticallyprocess marker gene sequences with multiple sequence alignment (MSA) [72, 80], which lines-upgenes as closely as possible. Then, sequences may be process through an evolutionary modelto produce a phylogenetic tree [248]. The comparison of ubiquitous marker genes makes thispossible [207, 240].There is variance in the estimation of phylogenetic trees. It is fair to ask ”is this clade real?” or”is my result merely due to chance?” Statistical methods largely exist to extract signals from noisydata, and despite the complexity of phylogenetic estimation, there is a method for describingclade confidence: the bootstrap. Bootstrapping phylogenetic trees [84, 88, 132] requires randomlyre-sampling data and estimating a tree for each sample. If a clade exists in many estimates, it islikely not a false inference of sampling’s natural variation. It is important to note that phylogenetictree estimates have a typical error profile: confidence increases near the root [204]. For example, itis easier to attribute phylum than a species.Phylogenetics for statisticiansHaving described estimation methods, it is fair to ask what was being estimated? Different under-lying models are used per context depending on how useful they are, and often go unstated.A framework which forces explicit model definition is in phylogenetic regression [178], wherehypothesis testing can be used to compare the likelihood of different phylogenetic structures.Using the tree to constrain Brownian motion or Ornstein-Uhlenbeck drift, a covariance structurecan be defined (via precision matrices) [224] as illustrated in Figure 1.4.6A BFigure 1.4: A covariance sturcture (A) as implied by a Phylogenetic estimate (B)1.2.3 QIIME, SSU rRNA data processingQIMME (Quantitative Insights Into Microbial Ecology, pronounced ’chime’) [51] is a bioinformaticpipeline used to ready raw genomic sequencing data for statistical analysis. In this work, QIIMEis used to prepare SSU rRNA sequence data for multivariate regression analysis. An early step inthis QIIME workflow is demultiplexing, the sorting of genetic sequences into originating samplesbased on prepended nucleic tags on every sequence. Second, QIIME employs UCLUST [82] toapproximately cluster SSU rRNA sequences in linear time [81]. For this work, SSU rRNA genesequences must be at least 97% identical to share a cluster (percent-identity is an adjustableparameter). Third, cluster representatives are aligned to a taxonomy database. This work usesGreenGenes [71], but SILVA [220] is also popular. After counting the number of sequencesrecruited to each cluster, data are summarized in a tabular format. An example of this sort of datafollows.taxon1 taxon2 taxon3 ··· taxonpsample1 0 0 1 . . . 0sample2 12 4 521 . . . 91sample3 1642 1373 1209 . . . 1031............. . ....samplen 1 3 2 . . . 2While these data provides valuable descriptions of a community’s taxa [59], it comes withsome statistical challenges. First, it suffers a common problem in Bioinformatics: it often hasthousands of dimensions (taxa) and relatively few samples. For example, using SSU rRNA datato survey for inter-taxa correlations has a high error rate [267]. Second, sequencing depth isvariable per sample, which causes statistical dependence between taxa, obfuscating meaningfulsignals–biological interpretation is hidden behind methodological complications. This statisticalcomplications are addressed in chapter Metagenomic binningIn subsection 1.1.1, metagenomes were described as having vast descriptive potential while oftensuffering a decoupling of function and taxonomy. Knowing which organisms are performingwhich functions is a central concern of any ecological field, including microbial ecology. Attemptsat linking taxonomy and function are often made through a combination genomic sequenceNot all genetic material encodes marker genes or can be further assembled. Further strategies7fall into the broad category of metagenomic binning. For this work, it is sufficient to define metage-nomic binning or binning as any attempt to decide that genetic sequences are phylogeneticallysimilar. Binning produces bins, which are collections of sequences. A binner is a tool which assistsor automates binning. A binner which produces bins with a taxonomic label is a classifying binner,otherwise it is a clustering binner. Binning strategies are often automated and work with assemblysoftware [93, 234, 261].Metagenomic binning has been reinterpreted over time [174, 180, 234], so generally applicabledefinitions will now be provided. A general trend has been toward pragmatic strategies, furtherfrom precise methods. Earliest software specifically for metagenomic binning includes TETRA[256] and Phylopythia [182]. TETRA is best described as a clustering method whereas Phylopythiais a classifier. Self-Organizing Maps (SOMs) were used as a clustering binner [1], despite originallybeing designed for data visualization and dimensional reduction [147]. A popular emergentSOM (ESOM; designed for cases with more nodes than data) software source is available fromhttp://databionic-esom.sourceforge.net/ [263]. Another clustering method is MaxBin 2.0 [276],and another classifying method is MEGAN’s (Metagenome Analyzer) LCA (lowest commonancestor) algorithm [135]. Many binners use tetranucleotide frequencies [255]. By 2012, both avariety of classifying and clustering binners had come into existence [174]. Recently the field hasbeen re-interpreting binning as exclusively clustering methods [234]. Clustering methods havebeen used to inform on symbiotic relationships [40], genome isolations [65], and the tree of life[133]. A classic example of a binning application is an extraction of genomes from an acidophilicbiofilm [261].Since there are so many binning softwares, it is convenient to organize them by strategy. Somebinners may only require a metagenome to operate (consider ESOMs [1, 147, 263]), and willleverage kmer (pronounced ’Kay-mer’) and coverage statistics. A kmer statistic is a vector of counts,per genetic sequence, how many times the sequence contains each unique subsequence of lengthk (k = 4 is popular, and also called a tetranucleotie frequency). So a kmer statistic will describe howmany times a sequence contains the substrings AAAA, AAAT, AAAC, AAAG, AATA, and soon. A tetranucleotide statistic is a vector of 256 = 44 counts. A coverage statistic is a descriptionof how much DNA from a sample belongs to a sequence. This is necessary because assemblyalgorithns will collapse many reads into a single sequence. RPKM (Reads per kilobase mapped)[191] is a popular coverage statistic. Other binners will also use additional information, such asa marker gene data base [275]. Marker genes are conserved genes with a known phylogeneticdistribution. An example of one such binner is MaxBin2.0 [276]. Other binners may further useaggregate genome collections such as the Integrated Microbial Genomes (IMG) [176] system andNCBI’s [268] RefSeq [254]. Phylopythia is one such binner [111, 182, 208]. Use of additional dataresources is important for quality binning, but also narrows a binner’s applicable scope.Evaluation of binning products is a topic of debate amongst microbial ecologists. A popularapproach has been to study marker gene distribution within bins [5, 225, 240]. This approach hasbeen automated [207] and will likely be included in future genomic standards [90, 91]. This workshows that choice of binning strategy is an important factor in determining binning quality.A modern concern for binning is that error and incompleteness is inconsistently describedand understood [55, 175, 240]. In chapter 2 error profiles for binning strategies are analyzed fortrends, contributing toward a better understanding of binning error.81.3 SSU rRNA correlationCharacterizations of ecological diversity are long-sought and have various perspectives. Whileα, β, and γ-diversity concepts [13] emphasize diversity at the sample level or higher, and haveemployed statistics like Chao1 [56], ACE (abundance-based coverage estimator) [134], Sorensen,Jaccard, and Bray-Curtis [36] amongst others, genomic sequencing has allowed exploration of ahigh-resolution frontier. Correlating sequence counts describes taxa interactions and holds weightin ecological models [87]. Sophisticated modern correlation approaches in microbial ecologyinclude Local Similarity Analysis [79, 231, 278], random matrix theory-based [70], hypergeometricsampling-based [54], maximal information coefficient [223], SparCC [95], and others [157]. Simpletechniques such as co-presence and mutual exclusion [162], Pearson, and Spearman correlationare also popular.An important recent discovery is that modern correlation techniques in microbial ecology havepoor precision-recall exchanges [267], meaning that graph edges are often incorrectly attributed.This is a clear concern, motivating further research, because it casts doubt over the results ofcorrelation network-based methods. This work contributes to solving this problem in chapter 3.While low edge-calling precision should increase spurious edges in graphs, it might not disturbcertain graph structures in larger graphs. The current applied potential of existing methodscan be seen in studies of the soil microbiome relationships with carbon dioxide (CO2) [281],human gut microbiome [14] relationships with genetics [107] and disease [109], and poultry[201]. Considering these achievements highlights the fact that precise edge-calling is a frontierfor Microbial Ecology. Precision in individual correlation inference is valuable, because it wouldconfidently inform on interactions between individual pairs of taxa.Relevant multivariate regression approaches exist for SSU rRNA data, particularly CanonicalCorrespondence Analysis (CCA) [258]. The fundamental challenge in analyzing SSU rRNA datais high-dimensionality. There are far more measured taxa than samples taken. The solution forCCA is similar to the approaches in chapter 3, both are a form of dimensional reduction. In CCA,dimensional reduction is accomplished by regressing against the primary axis of variation (aneigenvector). In chapter 3, precision is increased by constraining the breadth of possible correlationmatrices. In this way, CCA is the spiritual predecessor of methods proposed in chapter 3.Modern correlation networks in microbial ecology are usually generated with SSU rRNA data.SSU rRNA counts are an attempt to indirectly count taxa abundances per sample. The technicalaspect of generating SSU rRNA data is elaborated upon in an example in subsection 1.1.1 andsubsection 1.2.3. The effective data is thus a list of non-negative integer counts per microbialtaxa for each sample. Often many samples are gathered, and the resulting data product is a non-negative integer matrix. This count matrix may be further processed before network productionwith a previously mentioned correlation tool. SSU rRNA data is often sampled along withenvironmental or experimental data (regressors), which may be used to infer meaningful changesin community structure.The indirect observation of taxa counts through SSU rRNA data carries certain caveats. Theprimary caveat is that amplification and sequencing can induce a general positive correlativeeffect between all taxa. Effectively, every sample’s counts will be shifted up or down together,thus obfuscating the true values. A popular solution has been to convert each sample intoa list of proportions or relative abundances. Unfortunately, ratios inherently induce a general9negative correlative effect between all taxa. While converting to proportions has enabled plenty ofinsight, proportions’ inherent negative correlative dependencies can cause obfuscation as well,thus better corrections are motivated. The best modern solution has been to estimate the effectas an unmeasured regressor in a count model [185]. While unmeasured, this regressor is nottreated as a mixed effect, because as estimate (generated from the count matrix) is instead usedas an observed regressor. Another approach is to apply variance stabilizing transforms (VSTs)McMurdie and Holmes [185].1.3.1 Compositional vs. mixed effect perspectiveThe compositional and mixed effect perspectives of SSU rRNA modelling are the products ofindependent thought, but have elegantly produced a false dichotomy. Under the compositionalperspective, it is imagined that taxa’s SSU rRNA counts compete for sequencing depth, and thustend to be negatively correlated. Under the mixed effect perspective, it is imagined that taxa’sSSU rRNA counts are all driven up or down together with overall sequencing depth, and thustend to be positively correlated. The idea that counts might bias toward positive or negativecorrelation with each other is the false dichotomy, because it is only imposed by the models usedto understand this data. Of course, it is possible that either perspective might be more or lessrelevant dependent on unknown circumstances. The compositional perspective has been realizedin the study of correlation networks [95, 277], while the mixed effect perspective is the result ofunivariate regression surveys [168, 185, 226, 228]. In chapter 3, this work explores SSU rRNAdata with multivariate regression, which brings these two perspectives together in a surprisinglymeaningful way (see subsection 3.3.2).While the compositional effect merely indicates a tendency toward negative correlation, themixed effect perspective borrows from the broad category of it mixed effect linear models [183].In regression analyses (see subsection 1.6.2), a mixed effect is an unobserved random variablewhich obfuscates statistical signals, but can be dealt with through modelling the effect. In RNAseq and SSU rRNA regression, the concept is generalized to an unobserved random effect sharedover all measurements in a single observation. The effect is usually attributed to sequencing deptheffects. For example, if (Y1, Y2) are taxa counts, X is environmental measurements, (ε1, ε2) areobservational errors, and M is the mixed effect, then the following equations share M as a mixedeffect. These kinds of equations are decribed in section 1.5.2.Y1 = µ1 + Xβ1 + Mγ1 + ε1, Y2 = µ2 + Xβ2 + Mγ2 + ε21.4 Use case: Saanich Inlet OMZ1.4.1 Oceanic nitrogen lossAn oxygen minimum zone (OMZ) is a subsurfance body of anoxic water (< 20µM O2/kg).Under this definition, OMZs account for 7% of the ocean by volume [209]. Oceanic oxygen (O2)concentrations have recently decreased rapidly in the north east Pacific OMZ [143], and OMZsare generally expected to expand as the planet warms [250]. OMZs play a disproportionatelylarge role in oceanic denitrification, and because nitrogen is a limiting nutrient, excessive nitrogen10NO NO NO N O N-3 -2 2 2NH4+ NO-2Nitrification DenitrificationNH4+ NO-2 N2 H O2+ +AnammoxFigure 1.5: Nitrogen loss examples. Taxa will be argued throughout this work.loss limits the ocean’s ability to sequester atmospheric CO2 [61]. These facts combined with theclear correlation between CO2 and global warming [229] motivate understanding. It is importantthat global warming’s tipping points [158] be identified. The microbial component of the globalnitrogen cycle is both substantial and not fully understood [50, 154, 216, 262]. It is clear thatthe anoxic conditions favour chemolithotrophic energy metabolism [77], and these alternativemetabolic strategies produce microbial ecological networks resulting in oceanic nitrogen loss [271].Nitrogen gas (N2) is abundant in the atmosphere, very stable, and nearly biologically in-accessible. When N2 is produced, it has effectively left the biotic realm of the nitrogen cycle.Biologically accessible nitrogen is produced during nitrogen fixation (N2 → NH3), an energeticallycostly transformation. Diazotrophs are a group of nitrogen fixing cyanobacteria. Microbiallymediated nitrification causes accumulations of nitrate (NO−3 ) in the ocean. Because OMZs favourchemolithotrophy, nitrate and ammonium are used as an energetic resource. Anammox (anaerobicammonium oxidation) and denitrification are the primary avenues of oceanic nitrogen loss [19]as exemplified in Figure 1.5, with anammox accounting for 30–50% of oceanic N2 production [73].1.4.2 Saanich InletSaanich Inlet is a seasonally anoxic fjord on the coast of Vancouver Island, British Columbia,Canada [12]. Most of the year, the fjord has an anoxic basin. In the later summer and early fall,oxic waters flow into the basin, renewing oxygenation. The inlet is part of the Line P transect, aoceanographic time series [212]. From an OMZ research perspective, Saanich Inlet provides anopportunity to study microbial ecology and nutrient cylcing in OMZs [266, 271] particularly underoxic-anoxic shifts [279]. The Saanich Inlet time series is a valuable collection of data includingboth chemical and genomic measurements, and also a large collection of sequenced genomic dataincluding 91 metagenomes, 368 SAGs [230], and 298 pyrotag SSU rRNA samples [127]. In thiswork, all metagenomes, all SAGs, and 112 SSU rRNA samples are used. The SAGs were sampledin August 2010 along with four other metagenomesThe environmental and SSU rRNA data sampling scheme featured a time series spanning from2006 to 2011, but this study will only use 90 samples from 2008 onwards. This is because targetvariables (concentrations of oxygen, Nitrate (NO−3 ), and hydrogen sulfide (H2S)) have missedmeasurements outside of this range, and model-based interpolations are eventually computation-ally intractable. A distribution of average chemical concentrations of shown in Figure 1.6, and it11200uMv[!is.nan(v)]NO H Sdepth[!is.nan(v)]232O2 3 2ConcentrationDepth200m00Figure 1.6: Saanich Inlet average chemical concentrationscan be see that the OMZ can be described in distinct sections: such as upper & lower oxycline,S/N transition zone, and sulfidic zone [125]. A common concern about this data set is for alack of technical or biological replication, since samples were rarely taken in duplicate. However,intelligent model selection and problem phrasing overcomes these irregularities and concerns.This work will eventually study this data through multivariate regression (see chapter 3), whichtreats each samples’ SSU rRNA counts Yi as conditionally distributed given the environmentalvariables Xi. The 90 conditional variables (Yi|Xi) are each replicates. The only constraint onthe data is that it is from an observational study, which is now a common caveat for microbialecologists due to environmental sampling. Observational studies often cannot explore a fullcombination nor range of variable values. However, the Saanich Inlet OMZ time series remainsnatural experiment describing oxic-anoxic transitions.In this work, Saanich Inlet OMZ data will be used to demonstrate how precise network infer-ence methods reproduce known understanding and novel perspectives on microbially mediateddenitrification in the Saanich Inlet OMZ. Microbially mediated denitrification, which is closelyrelated to oceanic nitrogen loss, is a choosen focus from many possible topics. The conceptualmodel posed by Hawley et al. [125] describes this process in detail (see Figure 1.7), and will oftenbe used as the primary argumentative basis when interpreting new evidence.1.4.3 Important taxaSUP05 (Candidatus Thioglobus autotrophicus)SUP05 gammaproteobacteria are a major participant in the OMZ denitrification pipeline. Followingfirst observation [252], an early metagenomic analysis by Walsh et al. [266] used SSU rRNA datato show that SUP05 was often a dominant population in several OMZs, not just Saanich Inlet[279] but also the ESP and Namibia OMZs. The work also provides early insights into SUP05’smetabolic capacity through assembly of metagenomic fosmids. This SUP05 metagenome encodesthe metabolic potential for autotrophic carbon fixation of CO2, dissimilatory sulfite reductase(SO2−3 →SO2−) thereby suggesting anaerobic respiration of sulfur, and incomplete nitrate reductiongenes (NO−3 →N2O). These metabolic capabilities describe SUP05 as a sulfide-driven partial12denitrifier. These findings are strictly genomic, and therefore invite plausible scrutiny. Fortunately,the later proteomic analysis of Saanich Inlet’s OMZ by Hawley et al. [125] further corroboratedthis perspective, while also emphasizing that sulfur reduction was also driving SUP05’s carbonfixation. This narrative was confirmed with the work of Shah et al. [237], with SUP05’s cultivation.Cultivation enabled on-the-bench (in vitro) manipulation and measurements of microbial activity,and demonstrated that sulfur was essential to SUP05’s growth, that growth was increased throughnitrate reduction (NO−3 →NO−2 ), and further increased in the presence of NH+4 . The discoveryof SUP05’s reliance on sulfur reduction is important, because it is often abundance in OMZs’non-sulfidic zones. Shah et al. remarked that this suggests an ecological hypothesis: that SUP05 issomehow reliant on a sulfur oxidizer.Marinimicrobia (Marine Group A, MGA, or SAR406)The Marinimicrobia phylum is an important accomplice in OMZ denitrification. Followingearly observation is diversity surveys [96, 97], catalyzed reporter deposition fluorescence in situhybridization and SSU rRNA analysis demonstrated [6] that Marinimicrobia is abundant in theNESAP OMZ, comprising 0.3− 2.4% of total bacterial sequences, and increases to as high as11.0% under O2 deficient conditions. Statistically significant Spearman correlation statistics showMarinimicrobia negatively correlating with O2. In a later analysis [272] which combined NESAPand Saanich Inlet data sets, and provided genomic insights through the inspection of 46 fosmidlibraries. Genomic information suggests a sulfur-based energy metabolism, particularly suggestingdissimilatory sulfur oxidation. Marinimicrobia has also been implicated in syntrophic reactions inmethanogenic bioreactors [199].PlanctomycetesThe Planctomycetes phylum is known to harbour anammox, an integral process in oceanic nitrogenloss [73]. The existence of an anammox-harbouring clade of Planctomycetes was first identifiedin a bioreactor designed to remove ammonia from waste water [251]. Anammox abbreviatesanaerobic ammonium oxidation, and converts ammonium to nitrogen (NH+4 +NO−2 → N2+H2O),thereby making the nitrogen largely biologically inaccessible. Anammox has been implicated inthe Black Sea [151] and Costa Rica [66] to account for between 28% [60] and 48% [189] of oceanicnitrogen loss.ThaurmarcaeotaThe Thaumarchaeota are an extremely abundant Archael phylum, making up as much as 20% ofall Picoplankton (Plankton sized 0.2− 2.0µm) [141]. Thaumarchaeota conduct ammonia oxidation(NH+4 →NO−2 ), though they likely use alternative energetic strategies under ammonia-poorconditions [214]. Ammonia oxidation is important because it makes fixed nitrogen accessible todenitrification.13Nitrospira & NitrospinaNitrate (NO−3 ) accounts for 88% of fixed marine nitrogen [116], and the only known biologicalnitrate-forming reaction is nitrite oxidation (NO−2 →NO−3 ) [170]. Nitrospina and Nitrospira arenitrite oxidizers. An enrichment culture of Nitrospira sampled from a sponge was observedto convert NO−2 to NO−3 [203]. An early Nitrospira metagenome sampled from an activatedsludge enrichment culture corroborated with this result, encoding genetic capacity for nitriteoxidation, CO2 fixation, and lacked classic defense mechanisms against oxidative stress [169].While the lack of genes in a metagenome is notable, it is a poor argument for an authentic lackof genetic material. Though of different phyla, a cultivated Nitrospira genome [170] suggestsparticipation in a similar nitrite oxidizing niche. Cultivation confirmed the lack genes encodingcoping mechanisms oxidative stress, explaining the anaerobic nature of these organisms.SAR324 (Marine Group B)SAR324 are deltaproteobacteria common to the dark ocean, particularly in OMZs [41, 96, 271, 273].Single-cell amplified genomes sampled from the Altantic ocean encoded the capacity for C1-metabolism, sulfur oxidation, and a particle associated life-stlye [253]. An ESOM binning [74]experiment concluded that SAR324 might also harbour nitrite reductase (NO−2 →NO) [241]. A laterMetaBAT [140] binning experiment (CheckM statistics [207]: > 96% complete, 0% contamination)would conclude that SAR324 does not harbour nitrite reductase, but does participate in sulfuroxidation [123].Sulfurimonas gotlandica & ArcobacteraceaeSulfurimonas gotlandica is an epsilonproteobacteria with a cultured representative, strain GD1[115], from the Baltic Sea OMZ. A sub-clade, Sulfurmonas GD17, is common to OMZs [113]and is phylogenetically similar to Arcobacteraceae. Epsilon proteobacteria are common to theBastlic Sea, African shelf, and Black sea OMZs, respresenting up to 25% of all prokaryotic cells[38, 47, 113–115, 152, 156, 163]. In vitro experiments have shown Sulfurimonas gotlandica tofacilitate sulfide-oxidizing (S2− →SO2−4 ) complete denitrification (NO−3 →N2).1.4.4 A conceptual modelIn the work of Hawley et al. [125], a metaproteomic analysis of Saanich Inlet is conducted.Normalized spectral abundance factor (NSAF) values are used to describe how metabolic activityoccurs in relation to the oxygen gradient. Whenever possible, protein expression is attributed totaxa. This information is digested into a conceptual model illustrated in Figure 1.7. The modelhighlights many of the taxa described in the previous section, subsection 1.4.3. A hypothesis,SUP05 produces NH+4 , is illustrated via ”?NH+4 ”, and was later refuted with SUP05’s cultivation andsubsequent growth experiments [237]. It was shown that SUP05 cultures can grow by consumingNH+4 . A nitrogen loss mechanism is clearly observed through attributing anammox behaviour toPlanctomycetes, but the final denitrifying enyze nosZ (N2O→ N2) goes taxonomically unattributeddespite being observed in the sulfidic zone. As this conceptual model is a description of entitiesand their interactions, this work considers it a network perspective of Saanich Inlet microbialecology, and it will be revisited.14Figure 1.7: A conceptual model of Saanich Inlet denitrification. Image credit [125]1.4.5 A differential modelIn the work of Louca et al. [167], a multi-omic analysis integrates DNA, mRNA, and proteinwith environmental measurements of O2, NO−3 , and H2S concentrations to inform on SaanichInlet’s denitrifying community. The work is done with a philosophical twist, and considersgenes independently of the taxa. The dynamics of gene abundance with the environment ismodelled with a system of differential equations. The perspective, despite being independentof taxa, fits so well with the previous conceptual model that an illustration of the differentialsystem Figure 1.8 includes taxa. The method is applied toward estimating variables like PDNO(partial denitrification to nitrous oxide) gene abundance through a method that is very differentfrom statistical theory. Instead of maximum likelihood estimates, the differential system is runto steady state (it converges), then variables are measured. After cultivating SUP05, Shah et al.demonstrated the genomic potential for SUP05 to work with an sulfur reducer. This differentialmodel supports that hypothesis, but also an interpretation of sulfide-driven denitrification. Asthis differential model is a description of entities and their interactions, this work considers it anetwork perspective of Saanich Inlet microbial ecology, and it will be revisited.15Figure 1.8: An illustration describing a differential model of Saanich Inlet denitrification. Image credit [167]1.5 Math concepts1.5.1 Set theoryA set S is a collection of distinct elements. If x is an element of set S, then it is written x ∈ S. Ify is not an element of S, then it is written y 6∈ S. For example, S = {1, 2, {3}} is a set satisfying1 ∈ S, 2 ∈ S, 3 6∈ S, and {3} ∈ S. If all elements in set A are also elements in set B, then it is saidthat A is a subset of B and it written A ⊂ B, otherwise it is written A 6⊂ B. An important set is theempty set ∅ = {}, which is uniquely defined as the only set for which every element x satisfiesx 6∈ ∅. A set important to this work is the integersZ = {. . . ,−2,−1, 0, 1, 2, . . .} and the continuumof real numbers R. Other useful sets are the non-negative integers (counts) Z≥0 = {0, 1, 2, 3, . . .}and the positive reals R>0. Notice that Z ⊂ R, yet R 6⊂ Z.A cartesian product × is an operation on two sets, producing a third set. So the cartesianproduct of sets A and B is a third set A × B. It is defined as the set of all pairs of elementsbetween A and B. Formally, for every a ∈ A and b ∈ B, (a, b) ∈ A × B. It is conventionallyrecognized that the cartesian product may be iterated so that for every s1 ∈ S1, s2 ∈ S2, . . ., andsn ∈ Sn, (s1, s2, . . . , sn) ∈ S1 × S2 × · · · × Sn. A cartesian product of a set A with itself is writtenA2 = A× A, A3 = A× A× A, and so on.DimensionsMost mathematical statements made in this work establish relationships between several variables.Hence the statements are multi-variate in nature. The cartesian product is used to construct16multi-dimensional spaces. For the purposes of this work, it is sufficient to define the dimension ofa variable x satisfying x ∈ Rp as p. So if x ∈ R = R1, then x is univariate. Also notice that since,Z≥0 ⊂ R, if x ∈ Z10≥0, then x has dimension 10. Further, if x ∈ Za≥0 ×Rb>0, then the dimension ofx is a + b. For the purposes of this work, it is sufficient to define a vector as any list of numbersx = (x1, x2, . . . , xp) satisfying x ∈ Rp.HypercubesA hypercube is a subset of a real space Rp which bounds a set of vectors per coordinate. Everyhypercube can be bounded defined with two vectors a = (a1, a2, . . . , ap) and b = (b1, b2, . . . , bp),such that for each j ∈ {1, 2, . . . , p}, aj ≤ bj. A unique hypercube exists for every unique pair of aand b, defined as [a1, b1]× [a2, b2]× · · · × [ap, bp], where each [aj, bj] is a bounded, segment subsetof R from aj to bj. Notice that in R every hypercube is a segment, in R2 a square, and in R3 acube. Visual analogues break-down in higher dimensions, only leaving analytic understanding.1.5.2 ProbabilityA more thorough description of these concepts can be found in the work of Klenke [146].Probability theory is founded on the observation that random trials can be imagined as anexperiment with a set of possible outcomes Ω, defined the sample space. For example, the samplespace for a single dice-roll is {1, 2, 3, 4, 5, 6}. Though it is sometimes sufficient to only describeprobabilities of individual outcomes, probability theory instead describes the likelihood of outcomeevents F , which is a special set (sigma algebra) of subsets of Ω. So for any x ∈ F , it is also true thatx ⊂ Ω. For example, an outcome event for a single dice roll is {dice < 3} = {1, 2}. A probabilitymeasure P is a function from mapping elements of F to [0, 1] (written as P : F → [0, 1]). Eventhough P is a function of sets (example: P[{dice < 3}] = 2/6 = 1/3), it is conventional todrop the braces (”{” and ”}”) (example: P[dice < 3]). A probability measure must satisfy theprobability axioms, which are not stated here.Probability modelsMathematical models are essential to this work. For this work, it is sufficient to define a math-ematical model as any set of constraints on variables. For example, two variables (X, Y) =(O2 concentration, Nitrospina abundance) could be constrained through an equation Y = a + bXand real-valued constants (a, b) ∈ R2 (read as (a, b) is in the set of two-dimensional real-valuednumbers). Constraints can also be much more abstract, leveraging powerful theory to be both morerealistic and holding meaningful implications. For example, variable values many be assumed tobe drawn from a simple random sample. Simple random sampling holds powerful implicationsthrough statistical theory.Random variablesProbability theory provides a framework for describing the likelihood of observing variablevalues. Variables defined with a description of likely values are random variables. The admissionof imprecise values inherent to random variables is useful to mathematical models, because17non-random equations are so easily wrong in application. For example, no mircobial abundancewill be exactly known through O2 concentrations, so the earlier equation Y = a + bX is wrong.However through admission of an error term ε, the model Y = a + bX + ε cannot be wrong. Ofcourse, in order for the Y = a + bX + ε model to be useful, an understanding of ε is motivated.Probability distributionsFor a random variable X ∈ R (read as X is in the set of single-dimensional real-valued numbers)and a non-random variable x ∈ R, the probability of observable values is defined through theunivariate distribution function FX(x) = P[X ≤ x]. Through defining the probability of X fallingbelow or equal to x, the probability of all other events (technically Borel sets) is implicitly defined.Multi-dimensional random variables X = (X1, X2, . . . , Xp) ∈ Rp have a multivariate distributionfunction FX(x) = FX1,X2,...,Xp(x1, x2, . . . , xp) = P[∩pj=1{Xj ≤ xj}] (where an intersection of sets∩pj=1 Aj is the set of all elements a satisfying a ∈ Aj for every j ∈ {1, 2, . . . , p}). It is conventionalto write P[∩pj=1{Xj ≤ xj}] as P[X1 ≤ x1, X2 ≤ x2, . . . , Xp ≤ xp].As the Y = a + bX + ε example demonstrates, the addition of a random error term ε allowsa model to be valid, but potentially less useful. If it were true that ε = 0 constantly, the modelwould be perfectly predictive. Practically, ε has a non-trivial distribution function which describesprecisely how wrong the Y = a+ bX+ 0 model is. By understanding the distribution of ε, one mayunderstand how useful a model (such as Y = a + bX + 0) is predictively. Two popular conceptsfor describing a distribution function are location and dispersion.Measures of locationDefine location as a near-typical value for a random variable. The expected value EX =∫RxdFX(x) of a random variable X is a common description of location, often written as µ.Expected values can be understood through averaging repeated trials. The Strong Law of LargeNumbers states that for a random vector X ∈ Rn satisfying FX(x) = FX1,X2,...,Xn(x1, x2, . . . , xn) =FX1(x1)FX1(x2) · · · FX1(xn) (independence) and EX1 = µ ∈ R¯ = R ∪ {∞} ∪ {−∞} (well-definedexpected value), then their average n−1 ∑ni=1 Xi converges to µ with probability one. This is writtenas P[limn→∞ n−1 ∑ni=1 Xi = µ] = 1 or as n−1 ∑ni=1 Xi →a.s. µ, where a.s. stands for almost surely,equivalent to with probability one. Expected values are useful and theoretically well-developed, butbecause they are defined through an integral, not every random variable can have one. A commondescription of location that always exists for a random variable X is the median, defined as anyvalue m satisfying P[X ≤ m] ≤ 1/2 and P[X ≥ m] ≥ 1/2. A random variable’s median m is nearits mean µ, as constrained by |µ−m| = |E(X−m)| ≤ E|X−m| ≤ E|X− µ| ≤ √E(X− µ)2 = σ[172], where σ is a non-random constant which is defined through integrals (and therefore σ maynot be well-defined).Measures of dispersionDefine dispersion as a description of how near a random variable’s realized values are to itslocation. A common description of dispersion is the variance, defined as VarX = E(X −EX)2and often written as σ2. Drawing inspiration from the law of large numbers, it is clear thatσ2 might be estimated through the average square deviation from the mean. Since variance is18defined through the expectation operator E which is in turn defined through an integral, not allrandom variables have a well-defined variance. A common description of dispersion that alwaysexists is the median absolute deviation, defined as median(X−median(X)). This work utilizessome models which permit inexistence of an expected value or variance, so an understanding ofmedian-based descriptors of location and dispersion is motivated.Measures of dependenceA central focus of this work is the description of relationships between variables, many of which aremodelled by random variables. A popular description of statistical dependence between randomvariables X and Y is the covariance, defined as Cov(X, Y) = E(X−EX)(Y−EY). The covariancedescribes the strength of linear relationships between variables. To see this, let random variablesX and Y satisfy the following linear relationship Y = a+ bX, then Cov(X, Y) = Cov(X, a+ bX) =bVarX, so their covariance is related to the slope of the line between them. A concept which bettersdescribes linear relationships by building on covariance is correlation, defined as Cor(X, Y) =Cov(X, Y)/√(VarX)(VarY). For the Y = a + bX example, Cor(X, Y) = bVarX/√(bVarX)2 = 1,since Y is a deterministic linear function of X. This work frequently estimates covariance structures,motivating a description of a random vector X’s Covariance matrix, defined as the symmetricmatrix Σ = Cov(X) = Cov((X1, X2, . . . , Xp)T) satisfying Σij = Cov(Xi, Xj), where Σij is theelement of Σ on the ith row and jth column.1.5.3 Gaussian modelsIn this work a recurring distribution function is the Gaussian or Normal distribution. In a singledimension, the Gaussian distribution function with mean µ and variance σ2 is Φ(x; µ, σ2) =∫ x−∞ φ(z; µ, σ2)dz, where φ is the Gaussian density function, φ(x; µ, σ2) = (2pi)−1/2σ−1 exp[−(x−µ)2/(2σ2)]. The standard Gaussian is a Gaussian distribution with (µ, σ2) = (0, 1). The Gaussiandensity function φ is shaped like a bell, with tails that drop to zero exponentially fast as x is takenaway from zero (see Figure 1.9). The bell-shape indicates that a Gaussian random variable Xlikely takes values near µ, and the exponentially small tails indicates that extreme values quicklybecome so unlikely as to effectively never occur. For example, a standard Gaussian-distributedrandom variable is only expected to fall below −10 only once in every 1023 trials.The multivariate generalization of the Gaussian is the multivariate Gaussian or multivariateNormal distribution of a random vector X ∈ Rp with mean µ = (µ1.µ2, . . . , µp) and covariance ma-trix Σ, defined asΦ(x; µ,Σ) = Φ(x1, x2, . . . , xp; µ,Σ) =∫ x1−∞∫ x2−∞ · · ·∫ xp−∞ φ(t1, t2, . . . , tp; µ,Σ)dt1dt2 · · · dtp,where φ is the multivaraite Guassian density function, and φ(x; µ,Σ) = (2pi)−p/2det(Σ)−1/2 exp[−(x−µ)TΣ−1(x− µ)/2]. Samples drawn from a multivariate Gaussian distribution are likely to fallwithin an elliptical region, as demonstrated in Figure 1.10. If a random vector X ∈ Rp is multivari-ate Gaussian distributed with parameters (µ,Σ), then it is written X ∼ Np(µ,Σ). If p = 1, it maybe written X ∼ N(µ,Σ), in which case X has a univarite Gaussian distribution.EigenvectorsEigenvectors are analytic tools which are abstract and powerful. Their abstraction makes themwidely applicable, but also difficult to understand. By adding assumptions in an example, their19−4 −2 0 2φFigure 1.9: The standard Gaussian curveFigure 1.10: Bivariate Guassian simulation. Arrows are covariance matrix eigenvectors.20usefulness is made clearer. The analytic definition of an eigenvector v is that for some matrixM, there is a scalar (eigenvalue) λ 6= 0 such that Mv = λv. Eigenvalues are the magnitudesof their corresponding eigenvectors. So a matrix may have eigenvector and eigen value pairs.For example, a covariance matrix Σ ∈ Rp×p has p eigenvector-value pairs {(vj,λj)}pj=1 suchthat Σ = λ1v1vT1 + λ2v2vT2 + · · · λpvpvTp , and λ1 ≥ λ2 ≥ · · · λp. It is possible to approximateΣ by truncating it’s sum of eigenvectors, so Σ ≈ λ1v1vT1 + λ2v2vT2 . This heuristic is useful forinterpreting Factor analysis models and Principal Component Analyses (PCA).For a multivariate Gaussian distribution, the eigenvectors of its covariance matrix are the axesof greatest variation. This is shown as arrows in Figure 1.10. Principal Componenet Analysis(PCA) [210] estimation uses this fact to reduce the dimensionality of data. To derive a PCArepresentation for some data, calculate the data’s correlation matrix, then project the data into thelower-dimensional space provided by the first few (usually 2) eigenvectors.Factor analysis modelsFactor Analysis models are a constrained variant of the multivariate Gaussian, initially developedfor Psychology by Spearman. The Factor Analysis model is achieved by constraining a multivariateGaussian’s covariance matrix Σ ∈ Rp×p so the Σ = LLT +Ψ where L ∈ Rp×m where m ≤ p andΨ ∈ Rp×p is a diagonal matrix. An important difference is that the Factor Analysis modelhas fewer free parameters than the general multivariate Gaussian. Instead of Σ having p2parameters, LLT +Ψ has p(m + 1) parameters. If m is small, then the model has linearly many(O(p)) parameters instead of quadratically many (O(p2)). Parameter reductions or contraints areimportant from a statistical perspective, providing an avenue toward lower-variance estimates.A second important difference between the Factor Analysis model and the general Gaussian isthat a random vector X with mean 0 and covariance LLT +Ψ follow the Factor Analysis model isimplity equivalent to X = LF +Ψ1/2E where L ∈ Rp×m, E ∈ Rp are m + 1 independent standardmultivariate Guassian vectors (having distribution function Φ(·; 0, I)). The implicit decompositionof any Factor Analysis model into standard multivariate Gaussians provides an interpretationsimilar to PCA.Gaussian mixture modelsA Gaussian Mixture model is used to model data clusters. It is best understood through astochastic process. If a random vector X follows a Gaussian Mixture model, then it is equivalentto X = ∑mj=1 1Y=jZj, where Y ∈ {1, 2, . . . , m} and Zj ∼ Np(µj,Σj) independently. The indicatorfunction 1A equals 1 when A is true and zero otherwise. Effectively Y acts like a switch, causingX ∼ Np(µj,Σj) with probability P[Y = j]. Many samples drawn from the same Gaussian Mixturemodel will fall into m multivariate Gaussians, as demonstrated in Figure 1.11.Covariance decompositionIn chapter 3, a method for precise correlation network estimation is described. However theimmediately resulting network has many nodes and ends, and is difficult to interpret. It istempting to disregard large sections of the network for the sake of simplification, but that woulddestroy information. Viewing correlation networks more generally as covariance matrices, a21Figure 1.11: Data simulated from a Gaussian mixure modelXYZFigure 1.12: Illustration of two correlations between (X, Z) and (Y, Z), generating a partial correlation(X, Y|Z)solution becomes available. Partial covariance decomposition [16–18] can break a covariancematrix into a simple and complex part without destroying information. Partial correlations can beimagined as an underlying correlation structure which may generate tertiary correlations. Thiswork uses tertiary correlation to refer to a correlation for variables (X, Y), such that there is avariable Z and Cor[X, Y] 6= 0 and Cor[X, Y|Z] =a.s. 0. The tertiary correlation is no less real, inthat the correlation of (Y, Z) from Figure 1.12 is non-zero. However, the tertiary correlations areless foundational, since their multivariate dependencies are entirely constructed by other variables(see Baba [16], theorem 2.1.1).The following section will develop a σY = σβY,XX + Σ decomposition where σY is the initialcovariance matrix and Σ is the simple, partial covariance matrix, and σβY,XX is a complicatedresidual structure which may not be studied at all. In this way, covariation between the dimensionsof Y can be studied while controlling for the covariances with and between the dimensions of X.22The following well-known theory builds the argument.Let Y ∈ Rp, X ∈ Rq be random variables with EX = EY = 0.Define σY,X = [Cov(Yi, Xj)]p×q (asymmetric covariance matrix).Define σY = σY,Y (covariance matrix).Define βY,X = σY,Xσ−1X (best linear predictor weights).Define σY,X·Z = σ(Y−βY,ZZ),(X−βX,ZZ) (asymmetric partial covariance).Define σY·Z = σY,Y·Z (partial covariance).Define (Y · X) = Y− βY,XX (partial variable).Let LX = α+ BX for some non-random α ∈ Rp, B ∈ Rp×q.Theorem 1. σ(Y−βY,XX),LX = 0.Proof. of Theorem 1.σ(Y−βY,XX),LX = Bσ(Y−βY,XX),X = B(σY,X + σ−βY,XX,X)= B (σY,X − βY,XσX,X) = B(σY,X − σY,Xσ−1X,XσX,X)= 0Theorem 2. trace(σY−βY,XX) ≤ trace (σY−LX), where trace(σ) = ∑pj=1 σjj.Proof. of Theorem 2.σY−LX = σ(Y−βY,XX)+(βY,XX−LX)= σY−βY,XX + σ(βY,XX−LX),(Y−βY,XX) + σ(Y−βY,XX),(βY,XX−LX) + σβY,XX−LX= σY−βY,XX + 0+ σβY,XX−LX ; (βY,XX− LX linear in X, apply Theorem 1)⇒ trace (σY−LX) ≥ trace(σY−βY,XX + 0); (σβY,XX−LX positive definite)In the case of multivariate Gaussian regression, define a model Y = βX+Σ1/2ε imposed on thedata (Y, X), where β ∈ Rp×q (non-random) and ε ∼ Np(0, I). Because this is multivariate Gaussianregression, β is the best possible linear predictor of Y given X (best minimizes ∑pj=1E([Y− βX]2j)),which is β = βY,X as given by Theorem 2. Decomposition of the covariance follows.σY = σβY,XX+Σ1/2ε = σ(βY,XX)+(Y−βY,XX)= σβY,XX + σ(βY,XX),(Y−βY,XX) + σ(Y−βY,XX),(βY,XX) + σY−βY,XX= σβY,XX + 0+ σY−βY,XX ; (apply Theorem 1)= σβY,XX + σY·X ; (note the partial covariance)= σβY,XX + σΣ1/2ε = σβY,XX + Σ1/2σεΣ1/2 = σβY,XX + Σ1/2 IΣ1/2 = σβY,XX + ΣThe covariance has been decomposed into a regressor and partial covariance part σY = σβY,XX +σY·X. Because this is made possible by best linear predictors, decomposition is further possible viaultimately resulting in successive partial decomposition. Define some subset J $ {1, 2, . . . , p} andlet Z = [Y− βY,XX]J be sub-vector (projections) of our residuals, and decomposition may proceedas follows.σY = σβY,XX + σY·X = σβY,XX + σY−βY,XX = σβY,XX + σβ(Y−βY,X)X,ZZ + σY−βY,XX·Z= σβY,XX + σβ(Y·X),ZZ + σ(Y·X)·Z = σ(Y·X)−Y + σ((Y·X)·Z)−(Y·X) + σ(Y·X)·Z23Thus the initial covariance can be decomposed into a series of covariance matrices, σY =σ(Y·X)−Y + σ((Y·X)·Z)−(Y·X) + σ(Y·X)·Z. This elegant theoretical construction is a close approxima-tion to what occurs in more complex regression models. Roughly, interpretation can be com-partmentalized into separate covariance matrices σY ≈ σ[Chemical concentrations] + σ[Non-target taxa] +σ[Target nitrogen cycling taxa].1.6 Statistics conceptsFurther reading on material in this section can be found in texts by Casella and Berger [52] andMurphy [193].Statistics and Machine Learning methods are applied toward a variety of problems includingmodelling, descriptive summaries, and visual communication, but this work primarily applies allsuch techniques toward automated decision making. Surveys of large amounts of data is madepossible through automated decision making. In certain situations decisions are known to be rightor wrong, and protocols can be evaluated. All major contributions of this thesis are evaluated insuch a way, primarily through the use of precision-recall curves (see section 1.6.5).1.6.1 EstimationMaximum likelihood estimatesStatistics employs probability models. Recall the example model from section 1.5.2, Y = a +bX + ε, where (Y, X, ε) is a random vector and (a, b) is a constant vector. Assume further thatY represents Nitrospina abundance, and X represent O2 concentration. Repeated observationsof (Y, X) hint at the likely values of (a, b) and the distribution of ε. Statistical theory providesmethods for constructing a likelihood function fY,X(y, x; a, b) which can describe how likely certainparameters ((a, b) in this case), given observed vectors (y, x). In more general and conventionalnotation, a likelihood function for a random vector X with parameter vector θ is written fX(x; θ).Given a sample vector of observations X = x, the value θˆ which maximizes fX(x; θ) is themaximum likelihood estimate (MLE). MLEs are extremely popular in statistics, because they attainasymptotically minimal variance and are eventually unbiased for sufficiently large sample sizes.The MLE’s asymptotically minimal variance is said to make it efficient.θˆMLE = arg maxθ fX(x; θ)OverfitSmall estimator variance is important, because it allows the model to realistically represent thevariational structure of the data it is modelling. In Bioinformatics, a common cause of estimatorvariance is having too few data and too many parameters. For example if dim(θ) = p > n =dim(x), then the model will likely overfit the data. An overfit model is able to describe theeccentricities of the data, but has lost the big-picture signal that it was meant to capture. Overfit isknown to reduce models’ predictive capacity.241.6.2 RegressionThe term regression is used in a variety of ways, but generally refers to any statistical process formeasuring relationships among variables. Measurement can be interpretted as deciding the existenceof relationships or as modelling relationships, and in application both perspectives are often satisfiedsimultaneously. Through historical precident, regression analyses tend to model relationshipsbetween pairs of variables (Y, X) linearly. So if β is a non-random parameter vector and X amatrix, the model constrains Y = Xβ. However, (Y, X) are often defined in very flexible ways.For example, Y might be an internal model parameter, or X might be a matrix of transformedvariables. Effectively, a linear constraint is often applied toward modelling some very non-linearrelationships.Univariate regressionThis work defines univariate regression as the regressing of a location variable (see section 1.5.2)for a univariate random variable Y ∈ R1 against some other variable X. For example, least squaresregression can be applied to model Y as a Gaussian distributed random variable with conditionalexpectation E[Y|X] = Xβ. For generalization beyond univariate Gaussian distributions, General-ized Linear Models (GLMs) [181] model the conditional expectation of Y through a link functiong, so E[Y|X] = g(Xβ). It is very common for models to have additional parameters beyond β,such as variance. A popular GLM applied in Microbial Ecology [168, 185, 228] is the NegativeBinomial, satisfying the following.Y ∈ Z≥0 ; P[Y = y] =(y + µ2/(σ2 − µ)− 1y)(σ2 − µσ2)y ( µσ2)µ2/(σ2−µ); σ2 > µ ; µ = eXβThe Negative Binomial can be further specified as the NEGBIN P [45, 46, 110], taking P inZ>0, and Var[Y] = σ2 = µ+ µP/ν. The NEGBIN 2 or NB2 configuration is used in this work.Multivariate regressionThis work defines multivariate regression as the regressing of location variables for a randomvector Y ∈ Rp, while simultaneously modelling a covariance structure for the dimensions ofY. A popular example is multivariate Gaussian regression, where Y follows a multivariateGaussian distribution with covariance matrix Σ ∈ Rp×p and expected values E[Yj|X] = Xβjfor j ∈ {1, 2, . . . , p}. Notice that the vectors βj can be arranged into a matrix β. Multivariateregression should be contrasted with univariate regression surveys, where a vector of expectedvalues E[Yj|X] is estimated but a covariance structure is not. Univariate regression surveys arepopular in Bioinformatics [168, 227, 228].It is important to note that multivariate regression models tend to have more parameters thanunivariate regression models. For example, if X ∈ Rn×q describes n = 100 samples for q = 4variables, then a univariate Gaussian regression model has q + 1 = 5 parameters (+1 for σ2). In aunivariate regression survey of Y ∈ Rn×p with p = 50, then there are O(p) = p(q+ 1) parameters.A multivariate Gaussian regression model for Y has every parameter in the univariate regressionsurvey plus p(p− 1)/2 parameters in Σ, so it has O(p2) = p((p− 1)/2+ q + 1) parameters.251.6.3 Model selectionAICStatistical models can be poorly selected. Sometimes data exhibit behaviour that a model neces-sarily describes as unlikely. Consider the Gaussian distribution for example, the likelihood ofobserved values drops off expoentially fast with distance from the expected value. If a Gaussiandistribution is fit to data with extreme values, it will likely have an inflated variance estimate,thereby distorting later inference. In this way, poorly selected statistical models can lie. Applica-tion of Information Criterion (AIC) [2] can alleviate this issue, by perhaps comparing the Gaussianmodel to a Student-t distribution, which has the ability to model extreme values. AIC is a statisticwhich can be used to ordinate the quality of model fits. To use it, competing models are fit to thesame data set, evaluated with the AIC statistic, and then the model with the lowest AIC valueslikely suffers the least information loss. AIC statistics must be constrasted with Goodness-of-fittests, which are used to accept or reject the hypothesis that the data follow a particular distribution.AIC ordinates models (which might not fit at all), whereas a goodness-of-fit test simply describesbinary acceptance and rejection.AIC f = 2k− 2 log fX(x; θˆ) ; k = dim(θ)RegularizationRegularization is a model constraint used to reduce overfit. Regularization methods are well-developed in univariate linear model selection (best exemplified by the covariance test in L1regularization [164]), but also in many other applications [28]. Some regularization methods canwork by constraining an optimization problem [131], while others work by reducing the numberof variables [247]. Regularization for high-dimensional covariance matrix estimation has recentlymatured [218] but is actually quite narrow in its applicable scope. Most methods are only usefulwhen applied to multivariate Gaussian distributions, which the multivariate counts of SSU rRNAdata do not satisfy. This work utilizes a regularized high-dimensional covariance matrix anduses copula to interface the requisite multivariate Gaussian distribution with univariate countdistributions.A useful way to imagine how regularization methods work is to realize that models with toomany parameters fit to too few data will overfit. The many moving parts of the model allow itto conform too well to the data, thereby exhibiting its eccentricities, and ignoring larger, moreimportant signals in the data. Regularization methods always work by constraining the modelin some way, thereby making it less flexible. A less flexible model may no-longer conform toowell, and can describe more general themes in the data. In this work, the covariance matrixis constrained by requiring it to equate with a factor analysis models’ covariance structure,Σ = LLT + Ψ. Recall that for p dimensions (taxa), Σ has O(p2) correlations, but LLT + Ψ hasonly O(p) parameters. By reducing the number of moving parts (parameters) in the covariancestrucutre, it can better highlight the greater themes in data’s covariance structure.261.6.4 Copula & marginalsA statistical copula is a mathematical modeller’s tool, which allows for great conveniences. Itallows the modeller to consider the multivariate structure as a separate back end to the model, whilealso allowing largely independent selection of univariate distributions in the models’ front end.Copula are used like theoretical glue, sticking the multivariate and univariate components togetherthrough a deterministic transform. The convenience of selecting univariate and multivariatestructures independently allows for an otherwise unprecedented breadth of models to choosefrom. For example, this work needs special univariate distributions to allow for sufficient goodness-of-fit (see section 1.7 for why), yet also needs a special regularizing covariance (multivariate)structure. In this way, copula is a necessary solution.Mathematically defined, a copula C is a multivariate cumulative distribution function FC withuniform U(0, 1) marginal distributions.Multivariate normal (Gaussian) distributions have already been described as convenient byallowing strategic employment of partial correlations and regularization. Unfortunately, thedata studied in chapter 3 follow a multivariate count distribution (discrete) which is clearly notGaussian (continuous), and no transform will ever map between them [49]. Fortunately, themarginal (univariate parts) and copula (multivariate parts) parts are guaranteed to be, in a way(see Theorem 3), separable through Sklar’s theorem [245]. Theorem 3 is written similarly as foundin Joe [138].Theorem 3. (Sklar’s theorem): For a random vector Y with multivariate cumulative distributionfunction P[Y1 ≤ y1, Y2 ≤ yn, . . . , Yp ≤ yp] and univariate marginal distributions Fj(yj) = P[Yj ≤ yj],an associated copula function C : [0, 1]p → [0, 1] satisfies the following.FY(y) = C(F1(y1), F2(y2), . . . , Fp(yp))(a) If FY is continuous and has quantile functions F−11 , F−12 , . . . , F−1p , then C is uniquely defined asfollows.C(u) = FY(F−11 (u1), F−12 (u2), . . . , F−1p (up))(b) If FY is discrete (or partly discrete), then C is only unique on the following set.Range(F1)× Range(F2)× · · · × Range(Fp)From the modeller’s perspective the copula function C is the multivariate sturcture, andeach marginal distribution describes the random behaviour of each univariate distribution. Forexample, in chapter 3, the copula function is a Gaussian copula constrained to have a factoranalysis model’s covariance structure, while each marginal distribution describes the randombehaviour of SSU rRNA counts.1.6.5 Hypothesis testing & classificationThis work applies statistical methods toward automated decision making, and uses hypothesistesting to automate that framework. Hypothesis testing can be thought of as a comparison of twomodels, the null and alternative model. Imagining the alternative mathematical model as a set27of assumptions {A1, A2, . . . , An}, the null model is merely the same model with an additionalconstraining assumption: the null hypothesis H0. So the null model can be imagined as a slightlylarger set of assumptions {A1, A2, . . . , An, H0}. The null hypothesis H0 is not assumed like eachAi, it is a postulate to be tested. Statistical theory allows the models to be compared for likelihood,and should the null model be deemed sufficiently unlikely, the null model {A1, A2, . . . , An, H0}must be deemed untrue and at least one of its assumptions must be untrue. Good statisticalmethodology requires checking each assumption Ai, so it is likely that the only untrue assumptionis the postulated null hypothesis H0. In this way, the null hypothesis may be rejected.This abstract framework well-developed both theoretically and applicably. For example, theextremely broad category of regression software exist with a standardized interface allowing forautomated hypothesis testing. Of course, different data and questions require different software.While a little bit complicated, it truly couldn’t be easier. If it was easier, abstraction would bereduced so that the tools would be too narrow in scope. Null hypotheses are often formulated asH0 : θ = 0, which is sufficiently general for many problems.Plenty of theoretical tools exist for designing hypothesis testing software. Ultimately, these testsmust somehow digest the comparison of models into a single test statistic t(x; θ) and thresholdτ. Tests will be formulated as rules, such as if t(x; θ) < τ, reject H0, otherwise do not. In orderto design such tests, powerful statistical theory is used to derive the distribution of t(X; θ) forrandom data X while assuming H0. Thereby unlikely values of t(X; θ) can be identified, leadingto rejection of the null H0. Because the null rejecting machinery is derived assuming the null,hypothesis tests have derivable rejection rates α. Formally, α = P[reject H0 | H0 is true].The common assumptions of independent sampling and asymptotically large samples (math isderived taking the sample size large, n→ ∞) allow common and powerful theory to be employed.For example, Wilks’ theorem [270] says that, under the null hypothesis H0, the test statistict(X; θˆ) = −2 logλ = −2 log( fX|H0(X; θˆ0)/ fX|Hc0(X; θˆc)) is asymptotically chi-square distributedwith dim(θ) degrees of freedom. So −2 logλ ∼ χ2dim(θ). Similarly, it is known that MLEstake on (multivariate) Gaussian distributions for sufficiently large sample sizes,√n(θˆ− θ) ∼Ndim(θ)(0, I−1θ ), where Iθ is the Fisher information matrix of X following parameter θ.In this work, hypothesis testing is used to survey for non-zero correlations. So many nullhypotheses of the form H0 : ρij = 0 are tested. Because the greatest challenges in chapter 3 aredue having many more parameters than data, it is unclear if asymptotic assumptions (n→ ∞) arevalid, thus invalidating previously described statistical theory. Instead, the boostrap [83] is used,which is robust to lower sample sizes but requires more computational power to employ.Employing powerful hypothesis testing machinery toward automated decision making leadsto classification. In this work, it is common to classify or assert that θ = 0 if a statistical testcannot reject H0 : θ = 0, and to classify θ 6= 0 if a statistical test rejects H0 : θ = 0. In this way, aclassifying machine C is defined to take random data and produce decisions. So for some data setX, C(X) = {θ 6= 0}, and for another C(X′) = {θ = 0}.Precision-recall exchangesRoughly, precision is the probability that one is right when making a claim, and recall is theprobability that one makes a claim when it is right. Notice that it is often easy to claim little andbe right (low recall, high precision), and easy to claim a lot and often be wrong (high recall, low28precision), but usually challenging to precisely claim enough to make an argument. Precisionand recall tend to exchange with each other. If one imagines truth as the harvest of scientificendeavour, then recall is our yield, and precision is the efficiency or quality of our harvest. Underthis perspective, one might imagine claims to be the unit of our harvest. Usually, an experimentalor deductive logical process is required to harvest a single claim, but in modern contexts machinesare participating in the harvest. The result is that machines are producing many, similarly formedclaims. For example, our correlation software makes claims of the form taxa A is correlated with taxaB, and our binning software claims contig A belongs to taxa B. In bioinformatics, machine-producedclaims tends to be part of a larger argument, and may be used in conjunction with many othermachine-produced claims and a few human-produced claims. Inevitably, an entire argumentis formed and some degree of confidence is required in the machine-produced claims. Certainarguments require high-quality machine-produced claims. If these machines must be right, theymust be precise.The primary challenge [267] in section 1.3 and an important part of the concerns describedin subsection 1.2.4 [55, 175, 240] are due to poor precision-recall exchanges. Precise correlationwould empower microbial ecologists to make further and more confident claims pertaining toindividual taxa interactions and which taxa have which functions. Objectively, the contributionsdelivered by the work are measured through precision-recall exchanges.Because precision and recall are central topics to this work, a formal interpretation of classifierswill now be developed. Classifiers make claims which may be phrased as object or phenomenaA has attribute B. For example, one might attribute correlation to a pair of taxa. Define a set ofpossible attributions T and a set of objects or phenomena S which may be given an attribution. Itis known that special pairs (X, y) ∈ S×T satisfy X ∈ y, and with the goal to confidently map Sto T accordingly, a function C : S→ T is needed. Define precision as P[X ∈ y|C(X) = y]. Definerecall as P[C(X) = y|X ∈ y].The efficient precision-recall exchanges in this work come at a cost that is strategically paidwhen possible. In the two problems covered by this work, only one (binning) leverages additionaldata, the other (correlation) is simply a reinterpretation of existing resources. Efficient precision-recall exchanges do not come for free. It is generally true that the strategies employed in thiswork achieve efficient exchange by increasing the maximum attainable precision and lowering themaximum attainable recall. So by increasing our bound on precision, a bound on recall is lowered.Fortunately, despite a constrained improvement, gains in precision per fixed recall are attained.Precision mythsBioinformatic pipelines involving many machines (C1, C2, . . .) might suggest the opportunity toovercome imprecision through consensus. Such opportunities do exist, certainly under repeatedtrials. It is tempting to imagine that combining multiple strategies necessarily improves precision(consider ensemble approaches [87] in light of Weiss et al. [267]). However, it is simply not aubiquitous truth. Effectively, the question is P[X ∈ y|C1(X) = y, C2(X) = y] ≥ P[X ∈ y|C1(X) =y]? The answer is contextual of course, however the following mathematical constraint appliesgenerally.Result 1. If P[X ∈ y|C2(X) = y] ≤ P[X ∈ y|C1(X) = y]P[C1(X) = y|C2(X) = y],then P[X ∈ y|C1(X) = y, C2(X) = y] ≤ P[X ∈ y|C1(X) = y].29Further require P[C1(X) = y, C2(X) = y] > 0.Proof. Let A1 = {C1(X) = y}, A2 = {C2(X) = y}, B = {X ∈ y}.Then our hypothesis is P[B|A2] ≤ P[B|A1]P[A1|A2]⇔ P[B ∩ A2]/P[A2] ≤ P[B|A1]P[A1 ∩ A2]/P[A2]⇔ P[B ∩ A2] ≤ P[B|A1]P[A1 ∩ A2]⇔ P[B|A1] ≥ P[B∩A2]P[A1∩A2] ≥P[B∩A1∩A2]P[A1∩A2] = P[B|A1 ∩ A2]This result can be generalized to many machines Ci as follows.Corollary 1. If P[X ∈ y|C2(X) = y, C3(X) = y, . . .]≤ P[X ∈ y|C1(X) = y]P[C1(X) = y|C2(X) = y, C3(X) = y, . . .],then P[X ∈ y|C1(X) = y, C2(X) = y, C3(X) = y, . . .] ≤ P[X ∈ y|C1(X) = y].Also require P[C1(X) = y, C2(X) = y, C3(X) = y, . . .] > 0.Proof. Take A2 = {C2(X) = y, C3(X) = y, . . .} in the proof of Result 1.Despite its simplicity, there are meaningful interpretations of Result 1 useful to bioinfor-maticians. First, notice that if the precision of C2 is sufficiently lower than that of C1, the bestprecision is obtained by not employing C2. Second, even if agreement is high between machines(P[C1(X) = y|C2(X) = y] ∼= 1), if both are sufficiently imprecise then their combined effect is nomore precise. Consensus does not bestow truth. A theoretically relaxed interpretation is that noamount of agreement between low quality machines matters unless at least one of them is shownto be precise.Debunked myths:1. Imprecise methods can aid precise methods.2. Consensus is as good as truth.A third interpretation of Result 1 is that if both C1 and C2 have similar precisions but theydisagree (P[C1(X) = y|C2(X) = y] ≤ 1), the best precision is again obtained by only employingone machine. In the sub-case where precisions are low and they disagree, the only lesson is thatbetter methods are needed. Further, a precision must be bounded if disagreement exists, becauseonly one machine can be right. This merely demonstrates that consensus is not sufficient forprecision, but it is necessary.In constructing data-driven arguments, the reliance on automated decision making motivatesobjective evaluations of methods. Without some guard against false interpretation, it is easyto make mistakes. For example, in section 1.4.3, where two binning experiments arrived atcontradicting conclusions. The ESOM experiment concluded the SAR 324 might harbour nitritereductase, whereas a MetaBAT-generated bin concluded otherwise, and was also evaluated withCheckM. Automating decision making can lead to mistakes without objective evaluations. Thiswork advocates for the use of precision-recall curves, because the statistics evaluate the exactdesired behaviour. Precision is the rate of correct attribution amongst all attributions. Recall is therate of attributions amongst correct attributions. Losing objectivity might just let one lose touchwith reality.Further motivation for precision in data-driven argument is developed in Appendix A, wheredata-driven arguments are analogized to Hidden Markov Models.301.7 ComputationFurther reading on topics in this section can be found in work by Isaacson and Keller [136] andBoyd and Vandenberghe [34].Bioinformatics is very much a computational science. This presents certain mathematicalchallenges, particularly in algorithmic theory and numerical analysis. Numerical analysis is thealgorithmic theory of numerical approximation. There are many situations where best methods forcalculating on paper and in silicon are very different. Calculation of eigenvectors has examples ofthis. Constrained, non-linear optimization with Lagrange multipliers does as well. Many popularstatistical softwares employ the same linear algebra code libraries–such as LAPACK [11].Many challenges arise from the use floating point representation, where computers use pre-defined quantities of memory to store a number x is represented through the stored values(s, m) as x = s2m. The finite-memory constraint means that roundoff errors can occur, withx being rounded to one of {0,∞,−∞}. Truncation errors are due to insufficient digits beingstored in s. Numerical stability is a desired property, describing algorithms which can accuratelyapproximate their target functions. A numerical algorithm lacks stability of it produces largeerrors.It might be tempting to imagine that all computational problems can be overcome with suffi-cient hardware resources, but such conveniences often cannot achieve what better algorithms can.For example, truncation errors might be made slightly less frequent by using 64 bit representationof floating point values (double precision) rather than the typical 32 bit representation (singleprecision), but the problem is often overcome entirely by computing on log-scale (see subsec-tion 3.2.3 for a particular example of this). In the case of GPU computing (see subsection 1.7.3),avoiding the use of double precision representation can even make software run faster.1.7.1 Numerical calculusNumerical approximation of functions is often motivated through the use of calculus. Derivativesand integrals often cause numerical approximation to become necessary. Heavily studied functions(consider hypercubic Gaussian integrals [101], beta ratios [39, 75], and the student-t distribution[129]) often have piecewise approximating solutions, broken into efficient iterations or polynomialapproximations.DerivativesDerivatives are a common computational goal. While most derivatives are manually calculable intheory, many calculations are made pragmatically feasible through computer aid (consider back-propagation as an example [233]). It is also common for derivatives to be numerically computed(calculated through computer aid) for convenience (this is common in non-linear programming,see subsection 1.7.2). A common numerical approximation is through the stencil. Stencils areexpensive to compute, because they require re-evaluation of the numerically differentiated function.Examples of two and five-point stencils follow.g′(x) ≈ [g(x + h)− g(x− h)] /(2h)31g′(x) ≈ [−g(x + 2h) + 8g(x + h) + 0− 8g(x− h) + g(x− 2h)] /(12h)g′′(x) ≈ [−g(x + 2h) + 16g(x + h)− 30g(x)− 16g(x− h)− g(x− 2h)] /(12h2)IntegralsWith large amounts of probability theory applied in this work, many examples of integrals havealready been motivated. It is common for integrals to require numerical approximation, thoughsometimes models are selected because of their analytically solved integrals. For low-dimensionalintegrals (∫Rpg(x)dF(x), where p is small, usually ≤ 3), numerical quadrature routines canbe used to approximate integrals. A software library for univariate integral approximation isQUADPACK [217]. For higher-dimensional integrals MC-integration (Monte Carlo-integration)is more feasible, though sufficiently many dimensions will make any computational approachinfeasible. MC-integrals use random number generation to approximate integrals, and thus areideal for computing expected values. This is shown as follows.Eg(X) =∫g(x)dFX(x) =a.s. limn→∞ n−1n∑i=1g(Xi) ; Xi ∼ FXIf simulation of random variables Xi from distribution function FX is fast (it often is), thismethod approximates E f (X). Note that the number of required iterates can be predicted throughChebychev’s formula as follows.P[∣∣∣∣∣n−1 n∑i=1 g(Xi)−Eg(X)∣∣∣∣∣ ≥ k√n−1Var [ f (X)]]≤ k−21.7.2 Non-linear programsNon-linear programs are optimization problems, where a function g(x) is maximized (or mini-mized, equivalent through maximization of −g(x)). Sometimes solutions are analytically known.For example, the parabola g(x) = −3x2 + 2x− 5 can be analytically optimized through setting itsderivative to zero g′(x)− 6x + 2 = 0⇒ x = −3−1. In chapter 2, optimization is achieved throughthe EM-algorithm [69], iterating analytically solvable sub-optimizations. Optimization programsare commonly described as follows. The program is non-linear of g is non-linear.maximize g(x) subject to h(x) ≤ 0 and x ∈ XIterative solutions are common and work by iteration some maximizing function m(xn) = xn+1,and the iterative series must start at some x0. Choice of initial guess x0 is often very important.In some situations, the iterative component is only good for tuning less-significant digits of x.The process by which an initial guess is produced is often at least as important as the optimizingprocedure. Crafting an initialization algorithm requires some domain knowledge. There are nogeneral solutions. This work’s initialization algorithms are always a series of estimators, each32taking as input the output of the previous. The initial estimators are robust, numerically stableand inaccurate. Later estimators are accurate and more delicate.Quasi-Newton methodsA well-developed category of numerical algorithms is for optimizing convex, non-linear, differen-tiable g. Quasi-Newton methods take advantage of how easily parabolic systems are optimized,and work by iteratively approximating g with a parabolic system and optimizing it. A multivariateTaylor series approximation of g expanded about x0 is the following.g(x) ≈ g(x0) + (x− x0)T∇g(x0)T + 12 (x− x0)T∇2g(x0)(x− x0)∇g is the gradient vector of g, so [∇g(x)]i = ∂∂xi g(x). ∇2g is the Hessian matrix of f , so[∇2g(x)]ij = ∂2∂xi∂xj g(x). Assuming g is (at least locally) convex ensures that ∇2g is positivesymmetric definite (for every x vector, xT∇2gx > 0, and [∇2g]T = ∇2g). Systems satisfyingthese conditions have a breadth of well-developed numerical algorithms available to them [42,92, 105, 145, 205, 206, 213, 238, 282]. These solutions take advantage of the fact that the parabolicequiations are solved by differentiating and setting to zero. The derivative of the previous Taylorseries approximation is as follows.∇g(x) ≈ ∇g(x0) +∇2g(x0)(x− x0)For MLE computation in chapter 3, having many dimensions makes ∇2g computation expen-sive. For this situation, the BFGS or L-BFGS algorithms are ideal, since ∇2g is either calculatedimplicitly [108] or not stored at all.1.7.3 GPU supercomputingGraphics processing units (GPUs) are hardware modules (see Figure 1.13) which were developedto accelerate video rendering, a process which requires calculating values for individual pixels ona screen in parallel. With so many pixels, the problem is massively parallel. With a consistentlylarge enough consumer demand for video acceleration, specialized hardware (the GPU) has beendeveloped precisely for this task. Conveniently, certain non-video algorithms have a similarparallelization structure and can be accelerated with the hardware. GPUs are somewhat over-specified, but capable of massive speed-ups in the right compute scenarios. For example inchapter 3, a simulation study utilized a GeForce GTX 980 Ti GPU capable of up to 5360 GFLOPS(giga floating point operations per second) and an AMD FX-8320 Eight-core CPU capable of about40 GFLOPs, suggesting a potential increase of over 100 times. Of course, the specialization of aGPU means that this speed-up is only available for certain problems. Modern GPU technologyis employed in tasks that require only the most powerful computing solutions, including theworld-class Go-playing AlphaGo [243], self-driving cars [8], and high quality sequence alignmentwith the Smith-Waterman algorithm [173]. All GPU-accelerated software in this work is written inCUDA [200].33Figure 1.13: The NVidia GeForce 980 Ti GPUWarp divergenceGPUs’ special form of immense parallelization is achieved by making trade-offs at the hardwarelevel which exclude many numerical problems from acceleration. Accommodating for thesespecializations makes algorithmic design and programming more difficult for GPU software.Software is written for blocks of threads (see Figure 1.14), which need to execute very similarinstruction sets. This is because GPUs achieve their acceleration by executing instructions withmultiprocessors, which are capable of issuing the same instruction to several threads at a time.Threads are processed in groups called warps.Warp divergence occurs when threads in a warp receive different instructions. Because themultiprocessor can only issue the same command, it process the different instruction sets inserial–one after another. Small amounts of warp divergence are common and still allow foraccelleration, but if most threads follow different execution paths, then the entire task would beslower than run in serial on a CPU.34GridBlock (0,0) Block (1,0)Block (0,1) Block (1,1)Block (0,1)Thread (0,0) Thread (1,0)Thread (0,1) Thread (1,1)Figure 1.14: Blocks of GPU threads1.8 DeliverablesThis work makes contributions toward two bioinformatic problems relevant to Microbial ecology:(1) metagenomic binning, and (2) SSU rRNA correlation, described each in their own chapter.Objective evaluation of these contributions is made through precision-recall curve estimation. Con-tributed methods are applied toward better understanding the microbial ecology of denitrificationin Saanich Inlet. While both chapters will describe inferences that won’t be revisited, there is aunifying narrative interpreting results from a network perspective and arguing for the metaboliccooperation between two taxa, SUP05 and Marinimicrobia.In chapter 2, a method for precise metagenomic binning is made possible by recruiting geneticmaterial to SAGs. The work contributes to a much-desired understanding of how binners succeedand fail [5, 225, 240] by calculating precision-recall curves for different binning strategies underdifferent conditions and from different perspectives. It turns out that metagenomic binningshares a similar error profile to that of bootstrapped phyolegenetic trees, informing on how thesetools might best be applied in the future. The work also contributes to the correlation networkperspective of the Saanich Inlet denitrifying community by better describing the genomic potentialof different taxa in a variety of conditions.In chapter 3 a solution for overcoming the poor precision-recall exchanges of modern SSUrRNA correlation surveys in Microbial Ecology [267] is described, objectively validated withprecision-recall curves, and applied toward the Saanich Inlet denitrifying community. It isdiscovered that a form of covariance matrix regularization is important for good precision-recallexchanges in SSU rRNA correlation surveys. Findings in the chapter bolster metabolic syntrophyarguments with correlative support.In chapter 4, findings from both chapters are summarized and brought together. The technicalconclusions of either chapter are important for Bioinformatics, though largely independent.Findings are largely unified through the comparison of several network perspectives of theSaanich Inlet denitrifying community. A future direction is imagined for correlation networks inMicrobial Ecology.35Chapter 2Metagenomic binningMetagenomic binning methods are often used to extract genomes from metagenomes and arebeing applied on vast scales to draw important conclusions on a variety of topics. Viewing thesemethods as machine learning tools, approaches include a variety of strategies and sometimesvery little training data. Wide acceptance of binning products hinges on concerns including falsetaxonomic assignment. Precision, the probability of correct assignment, is therefore desired to behigh. Using single-cell amplified genomes (SAGs) as a reference to guide metagenomic binning isshown here to make significant increases in precision without excessive loss of recall. This workintroduces a binning software released as SAGEX (SAG EXtrapolator). Several binning strategiesare compared and motivated, and illustrate that precision tends to increase when describinghigher-ranked clades, or when more training data is used. This work suggests that evolvinggenomic standards require binning products to be published with their binning strategy andshould encourage the use of precise techniques when possible.2.1 IntroductionHigh-throughput sequencing technologies are rapidly uncovering the incredible genomic potentialof microbial life. Despite their integral roles in mediating matter and energy transformations[85], the vast majority of microorganisms remain uncultivated [58, 89, 197]. Metagenomics,the cultivation-independent sequencing of nucleotides from an environment, is illuminating thisuncultivated ”dark matter” opening a taxonomic and functional window into the networks drivingmicrobial community metabolism in natural and engineered ecosystems [225]. A combination ofmetagenomic binning and assembly methods have been popular for attributing function to taxonomyin these cultivation-independent contexts [93, 234, 261]. While pragmatic, communicating thedegree of genome completion and quality is important [175], has been advocated for in genomicstandards [55], and modern definitions continue to be proposed [207]. The issue that metagenomicbinning might falsely assign genomic sequences to a genome is an acknowledged concern [240]that can be dealt with through a combination of understanding how binning mistakes are madeand how to reduce their occurrence. Here, this work suggets that precise binning is a way toreduce such mistakes.An alternative cultivation-independent method for obtaining genomic information from anenvironment is single-cell sequencing. SAGs are theoretically capable of unambiguously linkingwhole genomes to a taxonomy at the pinnacle level of resolution, a single cell. Unfortunately,caveats of this technology persist such as incomplete recovery of the host’s genome [29]. Thiswork demonstrates reductions in false assignment made possible through SAG-guided binning.The greatest value SAGs offer for binning is that they provide a most relevant source of trainingdata for classifying binners, thus allowing more precise binning of novel taxa. Because SAG36sequencing tends to result in incomplete genomes, binning is further motivated by increasinggenome recovery. Despite training classifying binners with SAGs, this work discovers that species-level precision is not attainable with neither the tool presented herein nor other popular binningsoftware. Therefore, bins should be considered mixed within a narrow taxonomic range (i.e.,species and genus). This taxonomic range is shown to ultimately narrow as binner precisionincreases.A concern for reduced false assignment is best satisfied with precision–the probability that eachassignment (or classification) is done correctly. Precise metagenomic binning provides confidencein individual assignments of genomic sequences to microbial taxa. While not necessarily essentialfor all applications, this precision is valuable in the later use of binning to make inferences in anoxygen minimum zone’s microbial food web [125]. Because certain metabolic capabilities can bedeeply meaningful toward a microbiome’s ability to transform the chemical composition of itsenvironment [154], confidence in results is key.2.1.1 DefinitionsMetagenomic binning has been reinterpreted over time [174, 180, 234], so generally applicabledefinitions will now be provided. It is generally true to define a metagenomic bin as a collectionof related genomic sequences from a metagenome, which may or not be given a taxonomic label.Metagenomic binning or binning is a process resulting in the creation of bins. A binner is a toolwhich assists or automates binning. A binner which produces bins with a taxonomic label is aclassifying binner, otherwise it is a clustering binner. Examples of binning software are described insubsection 1.2.4.The prediction that binners are evaluated with synthetic data [180] has held true [7, 102, 182,186, 276]. A popular and effective method for estimating classification has been to pull genomesfrom public databases and use them as known-label data. Examples include the IntegratedMicrobial Genomes (IMG) [176] system and the National Center for Biotechnology Information’s(NCBI) [268] RefSeq [254]. Leveraging 368 SAGs, this work takes advantage of a novel opportunityto more rigorously scrutinize a variety of binners. This work evaluates binners with syntheticmetagenomes composed of SAGs. Where known-label data derived from public databases mayhave been sampled from a variety of experiments and surveys, the SAGs were sampled from thesame location within a radius of 200m and on the same day. The important difference between theSAGs and disparate public database entries is that the SAGs have evolved together. For example,Horizontal Gene Transfer (HGT) has had the opportunity to cause these genomes to share DNA.Because binners leverage forms of DNA dissimilarity to separate bins, the increased potential forsimilarity between SAGs introduces more pitfalls during binner evaluation that would have beenotherwise missed.So that binner behaviour may be exactly described, a formal description of binning andphylogeny is defined. These formalisms allow ambiguity to be avoided in descriptions ofphylogeny, precision, genome-bin differentiation, and binner error behaviour. Define S as the setof all sequences to be binned, so S = ∪∞n=0{A, T, C, G}n, where exponentiation stands for a Cartesianproduct. Define a taxonomic label as a set of sequences, for example {gammaproteobacteria} ⊂{bacteria}. Define T as the set of all taxonomic labels, so T = P(S). Define a phylogenetic tree V assatisfying V ⊂ T such that it defines a rooted-tree graph GV = (V, E) where E = {(a, b) : (a, b) ∈37V2, a ⊂ b}. Any v ∈ V can be viewed as a clade or population genome. For a given phylogenetictree, define all leaf nodes of GV as genomes. This work recognizes all genomes to be the nucleotidesequences of individual organisms. Under this framework, a metagenomic bin b satisfies b ∈ Tand there exists v ∈ V such that b ⊂ v. These definitions allow bins to have genetic material fromseveral genomes and not necessarily have a recognized definition within a phylogenetic tree–aperspective shared other work [240]. Note that some works use the term genome in cases thiswork defines specifically as either a genome or a bin [133], but fortunately this is an example ofambiguity relievable by precise language.To maintain objectivity, evaluation metrics are defined according to section 1.6.5 using theformal definitions of binning and phylogenetics. Precision-recall exchanges are the fundamentalmetrics of comparison. Define a classifying binner C as any function from S to T. Then precisionis the probability that the sequence x is in the taxonomic set y given that the classifier C hasassigned it to y, and is formally written as P[x ∈ y|C(x) = y], x ∈ S, y ∈ T. Precision is a desirablemetric, because precise binners produce more phylogenetically homogeneous bins, and wouldalleviate previously acknowledged binning concerns [240]. Recall is probability that a sequenceis correctly assigned to a taxonomic group given that it belongs to that group, formally writtenas P[C(x) = y|x ∈ y]. It is important to track both precision and recall because they tend toexchange for one another, and a very precise method without recall is useless.2.1.2 SoftwareBecause a classifying binner designed to train specifically with SAGs doesn’t exist yet, this workalso introduces SAGEX (SAG EXtrapolator). The software is written entirely in C/C++ and theonly library necessary for compilation is POSIX threads. It accepts two .fasta files as inputs (seean example in Figure 2.1), a SAG to train with and a metagenome to recruit from, and outputsa bin .fasta file containing sequences from the metagenome which should be related to theSAG (Figure 2.2). Define the concatenation of the training SAG and the SAGEX output as anextrapolated SAG or extrapolation. Notice that extrapolations are bins as well. This work usesSAGEX with assembled inputs. While tested on SAGs, note that it is possible to run SAGEXon any .fasta file of several contigs, thus allowing the use of Illumina Tru-Seq synthetic longreads [161], fosmids, or database entries. SAGEX is available from github.com/hallamlab/sagex.Ultimately this work describes a wide variety of precisions between binning strategies, of whichSAGEX is a high-precision binning method, and demonstrates precise binning in the context ofmicrobial ecology of an oxygen minimum zone.Genome completion has historically been evaluated with marker genes [5, 225, 240]. A moderntool for finding and summarizing marker genes within a collection of nucleotide sequences isCheckM [207]. Both estimates of bin completion and binning precision are relevant in a series ongenomic standards [90, 91].38> header 1ATCGATGCATGCATCGATG> header 2GCTATGCATGTCGATCGAA>header 3TTAGTCATGCAACGCATTA............Figure 2.1: The first six lines of an example .fasta file2.2 Methods2.2.1 SAGsSingle cell sampling and sequencing was carried out as described in [230] and all SAGs of theSUP05 lineage in this study originate from [230]. In brief, samples for single cell sequencing werecollected August 2010 at station S3 in Saanich Inlet at 100, 150 and 185m. Samples were collecteddirectly into 10 ml glass vials and 1ml was transferred into 143 µl of 48% beatine and frozen on dryice. Samples were stored on dry ice in the field and transferred to -80oC freezer for storage untilthawing at Bigelow Laboratories Single Cell Genomics Center (SCGC; https://scgc.bigelow.org) forsorting by flow cytometry. Cells were sorted and underwent initial round of multiple displacementamplification (MDA) and PCR amplification of the small subunit rRNA (SSU rRNA) gene asdescribed in [249, 253]. Taxonomy of single cells was determined by direct sequencing of ampliconsof the SSU rRNA gene (see section D.5). Clean SSU rRNA sequences were clustered at 99% identityand representative sequences aligned with LAST [144] against GreenGenes [71] database (2010) toobtain taxonomy. Taxonomic assignments and efficiency of MDA were used to choose cells foradditional MDA and subsequent sequencing. Chosen cells were sequenced as described in Rouxet al. [230] at the Genome Sciences Centre, Vancouver BC, Canada and assembled using SPAdes[21].Early SAG sequencing techniques have known contamination issues [257], also detected byCheckM (Table 2.2). Because taxonomically consistent SAGs form the argumentative foundationof this work, sequences with potential for contamination were removed. It is likely that many non-contaminant sequences were removed in the decontamination process. This policy is favourableover working with taxonomically inconsistent SAGs, because it removes ambiguity from thebinning evaluation process. The rule for post-assembly contig removal was 100% identity overat least 2kbp. Because it is fair to expect such alignments to occur as non-contaminants betweenrelated SAGs, alignment between SAGs sharing a pre-defined taxonomic range (see section D.6)were not counted as potential contaminants. So perfect alignments over at least 2kbp betweencontigs from SAGs of sufficient taxonomic distance were removed prior to the any binner analyses.For applications which require more recall, a good software is ProDeGe [257].39Figure 2.2: SAGEX pipeline2.2.2 SAGEXSAGEX is a classifying binner trained on a single SAG. Its design is inspired by a previouswork [76], and automates and refined the essential binning strategy. It accepts a SAG .fastaand a metagenome .fasta file as inputs and outputs a .fasta. The SAG is used as a trainingdata set, each metagenome is evaluated for recruitment, and metagenomic contigs similar to theSAG in taxonomy are recruited. Optionally tetranucleotide counts or tetranuculeotide principlecomponent dimensional reduction data products are available. This work describes a bin fromSAGEX as the output .fasta. The extrapolated SAG is the bin concatenated with the trainingSAG.The SAGEX pipeline (Figure 2.2) carries out two tests on a metagenomic contig to determinemembership to a given SAG-bin. First is the kmer (see subsection 1.2.4 for a definition) signaturetest, which checks that the contig and SAG have similar kmer signatures. Second is the identityfilter which checks for at least one region (user defined length) of DNA with 100% identitybetween the contig and the SAG. A typical run with a metagenome (∼ 50MB) takes 140 secondsper SAG.The kmer signature filter utilizes a Gaussian mixture model fit to the SAG’s kmer valuesafter dimensional reduction with Principal Components (PCs). This works by first calculating thetetranuculeotide frequencies (4-mers) for a given SAG contig. This puts each contig into 256 = 44dimensions, which is too high for later statistical models thus motivating dimensional reduction.A correlation matrix is then calculated for the metagenome’s kmer points. The correlation matrixeigen-decomposition is used to select three eigenvectors, the principal components. This allowsfor the derivation of a linear transformation which projects the metagenome’s kmer values into theprincipal components’ subspace. The linear transformation is then applied to both the SAG andmetagenome’s kmers. The natural behaviour of the data causes the kmers to take on multivariateGaussian distributions (see subsection 1.5.3); the overlap of an Ecoli K12 mg1655 genome [30]and SAG [106] demonstrate this best (see Figure 2.3). This popular insight [261] then motivatesthe choice to employ a Gaussian Mixture Model, which is fit to the SAG’s kmers utilizing the40Principal component 1Principal component 2Ecoli SAG          Ecoli genome          Planctomycetes          OD1          SUP05          MetagenomeFigure 2.3: Tetranucleotide signatures are illustrated for various SAGs, an EColi Genome, and a 200mSaanich Inlet metagenome.Expectation Maximization algorithm. The user may set the initial number of clusters and chooseto let SAGEX decrement the number of clusters. If SAGEX is allowed to decrement the number ofclusters, it does so heuristically by decrementing when round-off errors occur in estimation dueto poor-fitting Gaussians. To evaluate whether or not a metagenomic contig has a similar kmersignature to the SAG, it must fall within the model’s null region (see Equation 2.1). The radius ofthe region may be set by the user.X = ∑mk=1 χ{C=k}MkχA = {1 if A is true, 0 otherwise}C ∈ {1, 2, . . . , m}, categoricalMk ∼iid N3(µk,Σk), multivariate normalnull hypothesis H0 : x =D X (2.1)where =D denotes equivalence in distributionThe identity filter is simple and key to generating high-quality SAG-bins. To pass this test, themetagenomic contig must share a contiguous region of perfect identity with the SAG. The length,S, of this region is S = 25bp by default, but may be set by the user. To accelerate look-ups but notoverly consume memory, all of the SAG’s S-length contiguous regions are perfectly hashed andstored in a sorted list for log-time look-ups. This extremely simple rule is used because it is bothvery effective and computationally fast.SAGEX is written in C as a modular pipeline with a user interface implemented in C++.2.2.3 Precision-recall comparisonsIn order to demonstrate how SAG-guided binning effects quality, a variety of binning strategies areevaluated via precision and recall statistics. Classifying and clustering binners are evaluated withtwo different methods because they are not directly comparable. Either paradigm is evaluatedat three taxonomic levels, representing low (domain), medium (class), and high-level taxonomies41(strain) (see section D.7). All evaluations are done with synthetic metagenomes composed ofconcatenated assembled SAG genomes. While precision was motivated earlier as requisite toconfident binner application, recall is also emphasized because most classifiers can be madearbitrarily precise at the expense of recall, and a classifier without recall is useless. The precisionestimator is TP/(TP + FP), and the recall (also called sensitivity) estimator is TP/(TP + FN),where TP is the number of true positives, FP is the number of false positives, and FN is thenumber of false negatives.Classifying binners are evaluated on their ability to accurately assign taxonomy to contigs.Because evaluation occurs at three different resolutions of taxonomies (previously defined aslow, medium, and high), attributions are counted as correct if they fall at or below a taxonomicdesignation. For example, if a contig is from Bacteria and is attributed to Gammaproteobacteria theattribution is considered correct.Clustering binners have a more complex evaluation because clustering is binning without theattribution of specific taxonomic labels. Contigs with matching taxonomic labels are defined toshare a cluster, otherwise they do not share a cluster. Instead of classifying contig taxonomies,relationships between contigs are classified. Specifically, define a graph of vertices and edges(Equation 2.2). Each contig has a unique vertex. If contigs are in a bin (cluster), they share anedge, otherwise no edge is shared. Thus every cluster has a clique in the graph. Notice that forevery binner evaluation there are two graphs, the true graph and the attributed graph. Everyattributed edge is then evaluated as a true or false positive according to the true graph. Note thata cluster of n vertecies will have n(n− 1)/2 = O(n2) edges, resulting in quadratically deformedcounts. So clustering statistics will not be comparable with classifier evaluation statistics.V ⊂ S, E = {{a, b} : a, b ∈ V}{{a, b}, {b, c}} ⊂ E⇒ {a, c} ∈ EG = (V, E) (2.2)PhylopythiaS [208] was evaluated utilizing its web interface (http://phylopythias.bifo.helmholtz-hzi.de/). While inputs had to be divided into compute jobs, the tool was evaluated as a singleclassifying binner. The Generic 2013 - 500 Species model was used. It was run once per syntheticmetagenome level.SAGEX was evaluated twice, once as a classifying binner and again as a clustering binner. Ineither case, because the synthetic metagenomes were made of SAGs and SAGEX is trained onsingle SAGs, each SAG was not allowed to recruit its own contigs thus avoiding a clear bias. Toevaluate SAGEX as a classifying binner, it was run once per SAG per level, given that the SAGfell within the taxonomic range defined by the level. SAGEX was always run with arguments-C 25 -k 6 -K, and has the default of rejecting all contigs shorter than 2kbp. The taxonomy ofthe SAG used as a training data was used as the taxonomy attributed to any recruited contigs.Because there were many more SAGs than taxanomic categories, classification error statistics wereaveraged per taxonomic category per level.To evaluate SAGEX as a clustering binner, each SAG’s recruitments were treated as a singlebin. Taxonomic attributions were disregarded. Since each SAG bin tends to be small and thenumber of edges grows quadratically with cluster size, recall is driven down.MaxBin2 [276] is a clustering binner. Because it utilizes coverage estimates, metagenome reads42Table 2.1: Binner precision-recall statisticsBinner Type Precision 1 Precision 2 Precision 3PhylopythiaS classify NA* .36 .78SAGEX classify .82 .92 .98SAGEX*** cluster .42 .56 .89MaxBin2 cluster .28 .44 .75ESOM + R cluster .33** .14 .53Binner Type Recall 1 Recall 2 Recall 3PhylopythiaS classify .0007 .08 .39SAGEX classify .04 .03 .02SAGEX*** cluster .002 .0002 .0002MaxBin2 cluster .05 .04 .03ESOM + R cluster 0.04 .13 0.01*9 of 10 measurements are NA, **1 of 10 measurements NA, ***Non-standard SAGEX approachwere required as an input. However, the synthetic SAG metagenomes do not have informativecoverage statistics due to their creation with MDA. Instead the .fastq of a metagenome samplednear the SAGs at 200m was used. The three synthetic metagenomes, one per level, were thenclustered.ESOMs were used as a clustering binner. ESOMs are classically a visualization and di-mensional reduction tool [147] and the U-matrix data product is used to produce clusters [1].This work follows a modern design [74] using Databionics ESOM Tools (http://databionic-esom.sourceforge.net). One thousand 2kbp contigs were sampled randomly from each syntheticmetagenome, processed through SAGEX to produce tetranucleotide frequency proportions (anoptional SAGEX data product). Tetranucleotide frequencies were then loaded into the ESOMsoftware using a 50x50 U-matrix, 100 iterations, 15% k-batch training, and all other parametersdefault. The U-matrix .umx and best match .bm were fed into an R script (section D.8) whichclusters contigs when they share a valley in the U-matrix: any nodes sharing neighbours below acut-off of 0.15 in the U-matrix were defined to share the same cluster (section D.9).While statistics are available per taxonomic category per level (section D.10), statistics areaveraged across categories (Table 2.1). Statistics are stated three times, enumerated by their levels.Clustering and classifying binner statistics cannot be directly compared. Note that recall statisticsare not required to be large, since so many sequences are evaluated. Instead, binning strategiestend to exchange recall for increased precision.Various bins and genomes were evaluated with CheckM with the lineage wf -t 32 -x fastacommand. First, the initial SAGs and their decontaminated counterparts compared. Second, eachbinners’ output bins at level 3 were evaluated. Mean completeness, mean contamination, and aratio of mean contamination over mean completeness are reported (Table 2.2). The ratio is reportedbecause it makes the ESOM results easier to compare against other evaluations. This is motivatedbecause ESOM usage requires sampling large kmers from metagenomes instead of using contigs.Thus ESOMs are evaluated with related yet different synthetic metagenomes relative to the othermethods. The possibility of biases due to SAG self-recruitment do not apply here as in theprecision analysis, so entire SAGEX extrapolations (SAG and recruited contigs concatenated)are used as bins. Completion average estimates are low relative to proposed standards [207],43Table 2.2: CheckM statisticsData Mean Completeness Mean Contamination Ratio*SAGs 50.91% 5.97% 0.12Clean SAGs 15.92% 0.93% 0.06SAGEX Extrapolations 21.19% 9.88% 0.39ESOM + R** 0.59% 0.20% 0.34PhylopythiaS 9.64% 26.19% 3.94MaxBin2 23.87% 13.22% 0.55*Mean contamination divided by mean completeness, **ESOM protocol modifies input. Only compare ratios.suggesting a tendency toward incompletion was typical across binners. This work reports meanCheckM statistics because the same was done for precision and recall estimation. Means arereported because they approximate expected behaviour of the desired statistics.2.2.4 Saanich InletSAGs from Saanich Inlet (SI) were run through SAGEX against four metagenomes from 2010 alonga gradient of depth and decreasing oxygen O2 at 10, 100, 150 and 200m (samples SI048 S3 10,SI060 S3 100, SI060 S3 150, SI060 S3 200 accessions [127]). Metagenomes from 100, 150 and 200mwere collected concomitantly with the SAGs, 10m metagenome was collected the previous year inAugust to match environmental conditions. SAGs from all three depths (100m, 150m and 185m)were run against all four metagenomes in order to see the variability of recruitment to SAGsfrom both corresponding and disparate metagenomes. SAGEX was run on the same settingsas in evaluation. Extrapolated SAGs were then run through checkM [207] in order to estimategenome completeness and contamination (Table 2.2). Extrapolated SAGs were also run throughMetapathways annotation and metabolic pathway finding tool [121, 149, 150] for analysis ofmetabolic attributes for population genomes. Using reads from the metagenomes, RPKM [191]values were calculated for all extrapolated SAG open reading frames (ORFs).A similar and larger analysis features the use of 91 SI metagenomes [127]. All SUP05 SAGsare run through SAGEX against all 91 metagenomes. All recruited contigs’ ORFs (as determinedthrough MetaPathways) are aligned with LAST (e-value cut-off 10−3 [211]) to RefSeq-nr in search ofdenitrifying genes (narG, nirS/K, norCB, nosZ). Events of recruiting one or more denitrification geneare recorded. Logistic regression tests for statistically significant interaction between SUP05 1crecruitment rates and time while controlling for O2 and metagenome size. The regression analysissurveys samples from 2009 to 2014. All statistical tests use a Type-I error rate of α = Results2.3.1 Precision-recall comparisonsA wide variety of precisions are observed (Table 2.1). The distinct difference between theclassifying binners PhylopythiaS and SAGEX is the relevance of training data. For low-leveltaxa, the PhylopythiaS training data becomes irrelevant to this work’s metagenomes, while theSAG-based training data is naturally more relevant. This is demonstrated through a drop in recallto 0.07% and majority NA precision estimates for PhylopythiaS at level 1, because the classifier ishardly classifying anything to such a low level. Clustering binners can be ordered by amount of44training data as well, with ESOMs only uses kmer signatures, MaxBin 2.0 uses kmer signatures,depth, and marker genes, and SAGEX uses kmer signatures, alignments, and many SAGs. Aswith classifiers, the clustering binners also demonstrate that more and more relevant training dataresult in greater precisions. Therefore the pursuit of precision motivates SAG-guided binning.When constrained to clustering, SAGEX has the least recall, which is not surprising because itis the most precise clustering method. Because exchanges between precision and recall are trivial,it is important to note that SAGEX as a classifier (standard usage) achieves the best precisionswhile attaining somewhat typical recall. This favourable exchange between precision and recall isthe motivating imperative for SAGEX.CheckM statistics have some important caveats for effective interpretation. First, note thatthis work reports mean completeness and contamination statistics. The motivating use case forCheckM often involves discarding many nearly incomplete bins, so a CheckM-aided binninganalysis may have low average completeness but report near-complete bins. This work reportsmeans because they describe typical bin behaviour. For example, with independent or weaklycorrelated [171] samples, the mean will converge to the expected marker gene statistic withprobability one. Second, note that CheckM’s contamination statistics are biased downward whencompleteness is small. This bias is because CheckM’s contamination statistic is a sum of over-abundant marker genes [207]. Because different applications are evaluated with CheckM, ratios ofmean contamination per mean completeness is provided (Table 2.2).The CheckM analysis must be divided into two primary cases. First, the top two rows(Table 2.2), SAGs and Clean SAGs, represents a comparison of initial SAGs against decontaminatedSAGs and shows that aggressive decontamination achieved the goal of avoiding low-qualitysynthetic metagenomes as input to the binners. Thus ambiguity in the results of this analysis isreduced, because input data are less noisy. Second, the last four rows summarize each binner.Note that the ESOM protocol [74] requires modified inputs and thus shifted statistics, hencethe ESOM’s ratio should to be compared against other methods’ ratios instead of raw CheckMstatistics. When considering ratios, only PhylopythiaS stands out as appearing to have a higherrate of contamination per completeness. This is because PhylopythiaS bins include high levelcategories such as Archaea that are meant to be interpreted within a hierarchy of related genotypes.2.3.2 Saanich InletThe overarching motivation for binning is the desire for a cultivation independent tool of discoveryfor metabolic capacity of specific organisms within an environmental context. With this in mindthis work explore the metabolic pathways of the resulting population genomes produced bySAGEX using Metapathways [121, 149, 150] as the primary tool for gene annotation and pathwayidentification. In general, the average SAG extrapolation gained 10% more new pathways, onlycounting unique MetaCyc pathways [53]. Extrapolations also lost 6% of old pathways becauseincreased genetic information allows Pathologic to specify and exclude pathways. Of particularinterest is knowledge about metabolic capacity of candidate phyla and microbial dark matter [225]as well as attributing functions to specific taxa. Members of these phyla are defined primarily bySSU rRNA genes found from environmental studies and have no cultured representatives andvery little if any genomic sequence information associated with them.The OD1 (average CheckM completion: 22.10%, contamination: 4.01%) are one such candidate45phyla which are found in aquifers [124], merometic lakes [103] and Saanich Inlet [125]. TwoSAGs from Saanich inlet, collected from 185 m, were found to belong to the OD1 candidatephyla. Metablic capacity of OD1 from ground water samples following acetate addition show afermentation based metabolism likely producing hydrogen gas (H2) or hydrogen sulfide (H2S)[274]. The genome bins for the two OD1 SAGs (averaging 300Kb) indeed carried genes fromfermentation but genes for hydrogen production were not detected. However, while Wrightonet al. [274] described the OD1 as anaerobic, the gene superoxide dismutase for handling oxygenstress in aerobic (or microaerobic) environnments was detected in the population genome bin,which likely allows the organism to cope with fluctuations in oxygen commonly found in theSaanich Inlet environment [279].The Planctomycetes (average CheckM completion: 41.92%, contamination: 21.81%) are foundabundantly in low oxygen environments such as waste water treatment and marine oxygenminimum zones, and are responsible for carrying out anaerobic ammonium oxidation (anammox).Four Planctomycetes SAGs ideintified by small subumit rRNA gene as Kueneniaceae scalinduawere found in Saanich Inlet (two at 150m and two at 185m) with an average genome bin sizeof ∼2Mb. Anammox genes hydroxylamine dehydrogenase and hydrazine hydrolyase wereexpectedly found in the population genomes. Additionally, three genes were found to be involvedin sulfur reduction/oxidation including dissimilatory sulfite reductase, adenylylsulfate reductase(previously found in Kuenenia stuttgartiensis) and sulfate adenylyltransferase. The potential roleof Planctomycetes in sulfur cycling has been previously unrecognised within oxygen minimumzones and these may play an important role in protecting the Planctomycetes from harmfuleffects of reduced sulfur species found in sulfidic basin waters of Saanich Inlet. Carbon monoxidedehydrogenase and an hydrogenase were detected in the population genome suggesting thePlanctomycetes here may be involved in hydrogen production from carbon monoxide (hydrogenproduction VI metacyc pathway). While the Carbon monoxide dehydrogenase has been previouslyreported as involved in Wood-Ljungdal carbon fixation pathway [125], the detection of thehydrogenase in the population genome points to this potentially new function which feedsdirectly into co-metabolic pathways proposed to occur within Saanicih Inlet, namely hydrogenoxidation by the SUP05 group [125].Canonical denitrification is one of loss of biologically available nitrogen globally. However,the denitrification pathway is often modular with different taxa carrying enzymes and few of theorganisms responsible for denitrification in OMZs are known. Utilizing the SAGs and populationgenome bins taxa may be identified as harbouring the various steps of denitrification at threepoints along the oxygen gradient in Saanich Inlet (see Figure 2.4). The abundance of denitrificationgenes overall increased with depth, with 150m and 200m being quite similar as both are underanoxic conditions. The two SUP05 clades show very similar patterns with the exception of nitrousoxide reductase (nosZ), where SUP05 1a population genome (average CheckM completion: 13.66%,contamination: 6.49%) is seen to have nosZ at all depths but SUP05 1c population genome (averageCheckM completion: 10.21%, contamination: 4.09%) is seen to have the nosZ only at the 150 mdepth. This is somewhat consistent with the nosZ gene only being found in 10 out of 48 SUP05 1aSAGs and no SUP05 1c SAGs. SAR324 population genomes also have genes for denitrification,though the nitric oxide reductase (norCB) appears to be missing from the population genomebins. Both the unclassified Gammaproteobacteria and Arcobacteraceae have complete or nearlycompete denitrification pathways and other taxa have various components of the pathway. Notably,46the Marinimicrobia Arctic96B-7, recently attributed to have nitrate reductase (narG) and nitritereductase (nir) [126]. Marinimicrobia SHBH1141, recently attributed to have nosZ, populationgenome also contains narG and nir. SUP05, Arctic and Polaribacter population genomes carry thenarG and narG and nir respectively.The logistic regression analysis models the effect of time on the probability that a SUP05 1cSAG recruits a denitrification gene. A statistically significant negative interaction exists forbetween nitric oxide reductase (norCB). This means that SUP05 1c’s norCB recruitment ratesdecrease over time. The estimated probabilities of recruitment per SAG-metagenome pair areshown in Figure 2.5.47                                                                              Metagenome      Bins750350150    5SUP05_1aSUP05_1cSAR324Unclassified GammaproteobacteriaArcobacteraceaeMarinimicrobia, Arctic96B-7Bacteroidales, VC21_Bac22Marinimicrobia, SHBH1141SUP05, Arctic96BD-19PolaribacterA B C D A B C D A B C D100m 150m 200mNO NO NO N O N-3 -2 2 2A B C DnarG nirS/K norCB nosZOtherABRPKM:(Disoxic) (Suboxic) (Anoxic)Visualization of a metabolic analysis with SAGEX, focussing on denitrification. (A) SAGEX kmer signatures are visualized forSAGs, the 200m metagenome, and recruited metagenomic contigs. Recruitments are always attributed the same taxa as the SAGwhich recruited them. SAG exatrapolations (SAGs and recruited contigs) are output from SAGEX and input into Metapathways. (B)Reads from the 100m, 150m, and 200m metagenomes were aligned to extrapolation ORFs, allowing calculation of RPKM statistics.All shown dots represent cumulative RPKM values from unique ORFs within a taxonomic category.Figure 2.4: A SAGEX work flow482010 2011 2012 20130.0000.0020.0040.0060.008TimeRecruitment probabilityFigure 2.5: The probability which SUP05 1c recruits nitric oxide reductase drops off time. SAGEX wasrun on all pairs of metagenomes and SUP05 1c SAGs, aligned to RefSeq-nr (e-value cut-off: 10−3), thentested with logistic regression for significantly significant interactions between time and recruitment ofdenitrification genes. Models control for effects of O2 concentrations and metagenome size.2.4 DiscussionWith the advent of next generation sequencing has come also a burst of binning from metagenomeson a large scale [240]. Binning has utilized marker genes for both the binning procedure (MaxBin[275, 276], Phylopythia [182]) and post-hoc testing (CheckM [207]). This work demonstrates thatthe advent of SAGs brings more precise binning than previous popular binning strategies. Thesetechnologies will work well together into the future of binning.Existing binning tools were designed out of the need to elucidate the connections betweentaxonomy and function from the vast metagenomic space filled by next generation sequencing.Because SAG-guided binning has resulted in greater precision, these findings suggest that bothfosmids and Illumina Tru-Seq synthetic long reads [161] may be used to guide binning as well.The greatest application for SAGEX is to exploit the link between taxonomy and functionwhich the SAG provides, expanding it to a population level, such that while a single SAG is thegenome of a single organism (with varying degrees of completion) SAGEX can then bin contigsfrom the metagenome which are representative of that population as a whole, and thereby providea taxonomic grounding for much more of the metagenome.2.4.1 Metabolic discoverySAGEX has shown the ability to confirm existing metabolic capacities such as fermentation inOD1 and carbon monoxide dehydrogenase in Planctomycetes showing its fidelity. Further, OD1was shown to have an ecosystem adaptation via superoxide dismutase. Implications for assigningmetabolism to taxonomy such as hydrogen production from Planctomycetes is a substantialinsight into distributed metabolic coupling which has been proposed for Saanich Inlet [125, 167].Identifying the Planctomycetes as the likely source of hydrogen for SUP05 oxidation metabolically49links these two organisms and likely serves to provide SUP05 with additional energetic substratefor growth and likely carbon fixation. With greater numbers of SUP05 nitrate reduction via partialdenitrification may lead to increased nitrite production [237] and feed forward planctomycetesanammox activity ultimately increasing nitrogen loss from OMZ systems. Greater knowledge andtaxonomic resolution of the energetic pathways such as hydrogen production and oxidation whichfuel major players in these cycles such as SUP05 and Planctomycetes is key to understanding thefuture global impacts of OMZ expansion and intensification [86] in a warming planet.Taxonomic distribution of the denitrification pathway sheds light on a recently identifiedbut not taxonomically constrained niche for nitrous oxide reduction within the anoxic watersof Saanich Inlet [167]. Indeed, it appears that several taxa may be capable of filing this niche,specifically SUP05, SAR324, uncultured Gammaproteobacteria, Arcobacteracea, BacteroidalesVC21 Bac22 and Marinimicrobia SHBH1141. These taxonomic attributions are predominaly novel,with no other references of nosZ in any of these groups other then recentoy in Marinimicrobia andin Epsilonproteobacteria Sulfurimonas gotlandica related to Arcobacteraceae [153]. The identificationof nosZ in SUP05 in very interesting as a recently sequenced isolate did not contain the gene [237].It is possible that only a sub-population of SUP05 contain the nosZ gene, as suggested by nosZonly being found in the SUP05 1a and not 1c clades. The presence of nosZ in both the SUP05 cladpopulation genomes at 150 m is slightly confounding and may be due one of two possibilities.One is the possible of miss-assembly in the metagenome between SUP05 1a and SUP05 1c clades,where high abundance and high similarity between the two clades created chimeric contigs whichcontained the SUP05 1a nosZ but recruited to the SUP05 1c population genome. Two would bethe possibility that SAGEX, while highly precise, could not differentiate between the two SUP05clades, indeed, binning methods may not be suitable to differentiate between such closely relatedgroups. Attribution of nosZ to Marinimicrobia SHBH1141 is consistent with recent findings,though attribution of narG and nir is not and may again be the result of either miss-assemblyor other cross-recruitment between closely related clades. The novelty of SAR234 involvementin the denitrification pathway is highly intriguing as this group has been implicated in otherOMZs [271], the extent to which the denitrification trait exists outside of Saanich Inlet wouldneed to be explored. As several taxonomic groups are seen to carry out various steps of thedenitrification pathway the dynamics of which group is dominant under what conditions remainsto be addressed and would require analysis of gene expression data such as metatranscriptomicsand metaproteomics.Application of SAGEX to all SUP05 1c SAGs over all 91 metagenomes from the SaanichInlet time series made testing for time effects possible. Logistic regression analysis found thatSUP05 1c’s nitric oxide reductase recruitment probability decreased from 2009 to 2014 (p-value< 0.05). Precise binning makes these results more credible. Indeed, earlier observations of SUP05[125, 266] did observe nitric oxide reductase, but the later cultivation [237] did not. Combinedwith the observation of other complete denitrifiers in 2010 (Figure 3), these findings support ahypothesis: SUP05 1c is evolving toward partial denitrification, opening up a niche for anothercomplete denitrifier.Application of SAGEX may extend beyond coupled SAGs and metagenomes. Differentialmetabolic pathways present in the genome bins from metageomes along ecological gradientsmay indicate different populations related to the SAG may differ under different environmentalconditions along gradients such as depth or oxygen concentration. Additionally, the utility of50SAGEX to be used with assembled metatranscriptomes further enhances the prospect of exploringdifferential expression at the level of population genomes. Thus SAGEX may be useful in binningpopulations from different environments or over time from the same environment in efforts toexplore differences in metabolic capacity or genetic drift along gradients or over time. The extentto which this variation occurs likely depends on the genetic diversity of the group of organisms.Certainly, within Prochlorococcus a substantial amount of diversity exists amoung sub-populationspresent in different relative abundances over changing seasons [142]. As long as key genomiccharacteristics (genomic regions of 100% similarity and kmer signatures) remain distinct, SAGEXremains precise.2.4.2 Precise binningHaving evaluated both classifying and clustering metagenomic binners using a varying amountsof training data, this work has effectively surveyed some binning strategies across the spectrum ofsupervision. That is, the binners which use more training data are more supervised. With observedprecisions ranging from 14% to 98%, it is clear that some automated binning solutions requireadditional curation. In general, [180] correctly inferred that more and more relevant trainingdata results in better bins. ESOMs remain curation aids [1] and resist automation. MaxBin2performs admirably despite reduced training data. PhylopythiaS performs well with taxa thatare relevent to its models (see section D.10). SAGEX tends to get the highest precisions andrequires a guiding .fasta to operate (this work focuses on SAGs). Of course, as phylogeneticrange narrows all binners make more errors. Formally, for some phylogenetic tree GV and abin b, the height of v satisfying b ⊂ v ∈ V would correlate positively with precision. This errorpattern of confidently describing higher level clades is shared with bootstrapped phylogeneticestimation [84, 88, 132, 204], suggesting a common error profile might exist for all taxonomicestimation methods. This error pattern is a mixed blessing, because binning is being appliedin both strain-level analysis [65, 261], but also in studying the Tree of Life [133] (see Figure 1.3).Binning has been applied toward understanding strains, but ultimately it is best applied inunderstanding the whole tree. The variable quality and pragmatism of binning strategies clearlyplays a large role in these abstract analyses but has yet to be incorporated in an objective way.The comparison of precision and contamination statistics demonstrates that binners effectdifferent forms of bin quality. The primary difference is the scope of applicability. Binnerprecision describes the probability that any recruited sequence is correctly recruited. CheckM’scontamination statistic describes single-copy marker genes and can be heuristically generalized todescribe all sequences in a bin. In theory precision is less heuristic than the contamination statistic,but practically both provide essential information the other cannot. CheckM’s contaminationstatistic describes a necessary condition toward individual genome recovery: that marker genes arein single multiplicity. It is possible that such a case be satisfied and have contaminant sequencesin a bin, so also requiring that a binner have good precision amounts toward a sufficient argument.To claim a bin represents a single genome, it should constructed by a precise binner and also goodmarker gene statistics.The advantage of precise binning is that individual attributions of genetic sequences to ataxa have a higher probability of being correct. Increased probability of correct attributions thenreduces the probability of mistakes in bin construction. The result is a higher quality bin that can51be more widely trusted, thus satisfying a known issue binning [240]. Precision is the motivationand deliverable of SAGEX. However, the other binning strategies studied here have merits as well.First, precision generally comes at the cost of recall and thus reduces inferential yield. Second,increasing precision per recall often costs additional training data [180]. If methods of greaterrecall (consider ESOMs and MaxBin 2.0) are preferred, appropriate statistical methodology canincrease precision. For example, special sequences may have a verifiably greater precision thanothers, but this verification will consume additional data (see Appendix C). So correct usage ofimprecise binners can yield precise results, but comes at the cost of large data sets. Consider thework of [133] as a potential example of this. Notice that precision and therefore confident binningnever comes for free.It is important to note that this work has the potential for bias due to its SAG-centric lens. Away to improve on this study would be to reproduce it with curated metagenomes composed ofknown isolate metagenomes paired with SAGs per isolate. This would also do a better job ofdemonstrating the influence of chimeras, which are a non-trivial issue in metagenomic assembly[35, 240].2.5 ConclusionsSingle-cell amplified genome (SAG)-guided binning is shown to substantially increase precision inmetagenomic binning, thereby alleviating the concern that binning may falsely attribute contigs tobins incorrectly [240]. Precision is studied because it is the probability that attributions are correct.This work’s software for SAG-guided binning is released as SAGEX (SAG EXtrapolator). Whileevaluated with SAGs, SAGEX only requires its guiding (or training) data to be a .fasta file withat least 4 sequences longer than 2kbp. A thorough comparison was made possible by a collectionof 368 SAGs sampled from the Saanich Inlet oxygen minimum zone. It is observed that binnerswhich utilize a larger volume and more relevant training data obtain better precision per recall.All binners are observed to exchange precision for taxonomic specificity; genome-level bins tendto be the most error- prone, while bins at a higher phylogenetic level will have more precision.All methods are argued to have precise usages at least in theory, given sufficient training data orcorrect statistical manipulation combined with a sufficient sample size. Precise binning alleviatesissues with false recruitment, and is attainable with genomic data of sufficient length with adefined phylogeny.A motivating application of precise binning is explored in a microbial food web in the SaanichInlet oxygen minimum zone [125]. Precision is valuable in this application because each interactionhas the potential for immense transformative effects on oceanic chemical composition [154]. Usedin combination with Metapathways [121, 149, 150], SAGEX was able to recover an additional10% more pathways per SAG and describe a variety of metabolisms. OD1 was confirmed tosupport fermentation, and also superoxide dismutase in its population genome for handlingoxygen stress demonstrates an adaption to the Saanich Inlet’s seasonal anoxicity. Discoveryof hydrogen production in Planctomycetes supports potential for coupling with SUP05. Usingmetagenomes from 2010, potential for complete denitrification was observed for taxa includingSUP05 1a, SUP05 1c, and Marinimicrobia SHBH1141. Applying this pipeline to the SaanichInlet time series (2009-2014) and surveying with logistic regression found SUP05 1c nitric oxidereductase recruitment drops with time, suggesting evolution toward a partial denitrification niche.52As metagenomic binning is often used to recover individual genomes with variable degrees ofquality and completion, genomic reporting standards are called for. The findings of this worksuggest that individual genome recovery standards should require both marker gene statistics andthe binning strategy to be published with binning products. When confidence in results is desired,the most precise binning method available should be used, thus motivating guided binning whenpossible.These findings contextualize agreeably with previous models of Saanich Inlet. The discoverythat SUP05 might be evolving toward a partial denitrification niche agrees with the conceptualmodel (see subsection 1.4.4). Specialization would leave an energetic opportunity unutilized, soit really suggests niche partitioning. Further attribution of nitrous oxide reductase (N2O→ N2)provides candidate partners for metabolic syntrophy. Further contextualizing in the differentialmodel (see subsection 1.4.5) supports this hypothesis, particularly that complete denitrificationmay be based on a sulfur-driven relationship. Due to known sulfur-processing metabolic capabilitycombined with a nitrous oxide reductase attribution, Marinimicrobia is a prime candidate. Inchapter 3, this argument and others will be bolstered through correlative means.53Chapter 3SSU rRNA correlationSmall subunit of the ribosomal RNA gene data (SSU rRNA) correlation surveys produce networkswhich may be used to bolster ecological arguments with statements on microbial covariation. Forexample, in chapter 2 many metabolic capabilities were attributed to specific taxa, leaving manysyntrophic hypotheses standing. This chapter uses correlative evidence to further constrain thesehypotheses. These results agree with and extend previous network perspectives of Saanich Inlet(see subsection 1.4.4 and subsection 1.4.5). Recently, a major challenge in SSU rRNA correlationssurveys was demonstrated [267], which effectively brings individual correlative edge attributionsinto question. Imprecise correlation networks might be able to inform on general topologicalcharacteristics, but descriptions of fine community structure have effectively lost confidence. Ifunmet, these concerns would make correlative evidence at most a suggestion if not a burden,in producing ecological arguments. This work meets this concern through constraining of acovariance matrix Σ = LLT +Ψ, which reduces parameter complexity from O(p2) to O(p) for ptaxa. The method is shown to be capable of substantial precision-recall improvements. Henceecological arguments are stengthened through precise correlation.3.1 IntroductionThe denitrifying community of Saanich Inlet includes a variety of taxa, with no strain everoperating in isolation. The process of transforming initially fixed nitrogen in NH+4 to largelybiologically inaccessible N2 can span several ecological niches (as in denitrification but notanammox, see subsection 1.4.1). Due to various factors including competitive pressure, theseniches might encourage metabolic specialization, as would follow from the Black Queen Hypothesis[190]. Indeed, an abundance of genomic data supports the perspective of a taxonomically diversedenitrification pipeline (see section 1.4). An integral component of oceanic nitrogen loss is entirelyecological.Perspectives of such ecological machines can be conveyed through a network perspective.The microbial ecology of Saanich Inlet relating to denitrification has been described throughconceptual [125] and differential [167] models (described in subsection 1.4.4 and subsection 1.4.5),both of which are abstractly communicated through network representations. As described insubsection 1.1.2, a coherent network abstraction exists to communicate these models. Particularly,these networks have microbial or chemical nodes and interactions described with edges. Thesenetworks’ abstraction imposes a degree of superficiality, but the abstraction’s coherence makesthe networks relevant. While the underlying microbial individuals are no less complex, carefulconstruction of networks simplifies and thereby contributes to conversations about the ecologicalmachine.Networks are not only created through abstraction, because they are often inferred directly.54RecallPrecision0.0 0.2 0.4 0.6 0.8 3.1: A simplified depiction of a poor precision-recall exchange, like those observed in Weiss et al.[267]. See Figure D.1 for actual.SSU rRNA correlation networks (see section 1.3) can be estimated directly from specially-processedmetagenomic data, producing a survey of statistical dependence (see section 1.5.2) in the microbialcommunity. Upon estimation correlation networks remain abstract, only imposing descriptions ofcovariation, and invite concretization through further genomic information and prior knowledge.This work does exactly that, by first describing denitrifying taxa with metagenomic binning inchapter 2, then bolstering arguments with correlative information in this chapter.Correlation networks are often estimated through hypothesis testing (see subsection 1.6.5). Foreach pair of taxa indexed (i, j) with correlation ρij, the null hypothesis H0 : ρij = 0 is tested. If thenull is rejected in favour of Hc0 : ρij 6= 0, then taxa pair (i, j) is classified as correlated. If the nullis not rejected, the pair is classified as uncorrelated. Thus correlation network estimation can betreated as a classification problem. In the language of section 1.6.5), the task is to classify pairsof taxa S = {{Cyanobacteria, Planctomycetes}, {Planctomycetes, SUP05}, . . .} as correlated or not(so T = {correlated, uncorrelated}) with correlation classifier (network estimation protocol) C.Interpreting correlation network estimation as a classification problem implicitly describes each Cwith precision-recall exchanges. Thus, correlation network estimation methods can be objectivelycompared and measured.Unfortunately, it was recently observed that popular correlation survey techniques in MicrobialEcology suffer from poor precision-recall exchanges [267]. This means that precision-recall curvestend be shaped according to Figure 3.1, implying that networks can be estimated with precision orrecall, but rarely both. This result means that individual edges in estimated correlation networksare in doubt. General graph structure may be preserved in some approximate sense, but individualedges attributions are in doubt. So if one points to an edge and asks ”does this correlation reallyexist?”, the answer is merely ”maybe”. The discovery of poor precision-recall exchanges forcorrelation network estimation in Microbial Ecology threatens their viable application, because it55introduces incoherence into the abstraction.3.1.1 The overfit hypothesisSSU rRNA can be counted in almost arbitrarily-many clusters (see subsection 1.2.3), because theresolution is adjustable. Studying few low-resolution clusters from a phylogenetically diversecommunity will inevitably lead to oversimplifications. So studying many fine-resolution clustersis motivated. From a statistical modelling perspective, this means that for a sample’s vector of SSUrRNA counts Y ∈ Zp≥0, p becomes large–often in the hundreds, thousands, or tens-of-thousands.All correlation network estimation methods, not just those surveyed by Weiss et al. [267], mustevaluate every pair of these many dimensions for a statistically significant correlation. Indeed,despite applying some of the best existing theoretical solutions, this work must also reduce p fora feasible solution.Correlation estimation can be viewed generally as a form of multivariate regression. Throughthe modelling of dependencies, multivariate regression models tend to have many more parametersthan their univariate counter-parts (see subsection 1.6.2). Simple multivariate models tend to havequadratically-many (O(p2)) parameters, as in the following equation. So if p is one thousand,then there are about one million parameters in need of estimation. In subsection 1.6.1, overfit wasdescribed as an estimation failure due to having too many parameters. In studying the SaanichInlet community 112 SSU rRNA samples are modelled. Any large SSU rRNA data set will neverbe large enough to adequately describe a one-million-dimensional space. Without more creativemodelling, overfit is inevitable. The correlation tests evaluated by Weiss et al. [267] employ nomechanisms to significantly reduce the number of parameters estimated, and are thus pronethe overfit. Overfit is known to reduce predictive capacity and is thus a possible cause of poorprecision-recall exchanges in correlation network estimation in Microbial Ecology.Cor(Y) =[ρij]p×p =1 ρ1,2 ρ1,3 · · · ρ1,pρ1,2 1 ρ2,3 · · · ρ2,pρ1,3 ρ2,3 1 · · · ρ3,p.......... . ....ρ1,p ρ2,p ρ3,p · · · 1This line of reasoning supports the hypothesis that overfit has caused the high error ratesobserved by Weiss et al. [267]. Regularization is a broad category of methods for reducing overfitthrough some form of constraint (see subsection 1.6.3). While a statistical frontier, methodsfor dealing with high-dimensional dependence structures are available [37, 44, 218]. This workeffectively pulls known solutions from the frontiers of statistical theory to implement a practicallyfeasible correlation network estimation software. To deal directly with the quadractic explosion(O(p2)) of parameters, a factor model’s covariance structure (see subsection 1.5.3) is used tomodel quadratically-many correlations with only linearly-many (O(p)) parameters. So theutilized regularization constraint is Σ = LLT +Ψ (expanded in the following equation), whereL ∈ Rp×m,Ψ ∈ diagonal(Rp×p>0 ), m is small. This work uses m = 3.56Σ =σ1,1 σ1,2 σ1,3 · · ·σ1,2 σ2,2 σ2,3 · · ·σ1,3 σ2,3 σ2,2 · · ·.......... . . = LLT +Ψ =l1,1 . . . l1,ml2,1 . . . l2,ml3,1 . . . l3,m.........l1,1 . . . l1,ml2,1 . . . l2,ml3,1 . . . l3,m.........T+ψ1,1 0 0 · · ·0 ψ2,2 0 · · ·0 0 ψ3,3 · · ·.......... . .Of available regularization approaches, the factor structure (LLT +Ψ) has its own strengthsand weaknesses relative to other solutions. Constrained optimization or shrinkage approaches arepopular regularization methods [44], extending well beyond multivariate dependence modelling[164]. Shrinkage methods are not used in this work because they tend to the symptoms ofthe parameters’ quadratic explosion (O(p2)), and with p large, the problem should be dealtwith directly. This leaves a single primary alternative to consider, truncated vine copula [37],which construct multivariate dependencies through a hierarchy of latent bivariate variables.Both truncated vine copula and factor models can represent O(p2) correlations through O(p)parameters, but there are some important differences. An advantage of vines over factor modelsis that they are capable of representing correlations as exactly zero, ρij = 0 is possible. Factormodels cannot achieve this without adverse effects, so for a sufficiently large sample size (well outof reach for most any SSU rRNA study) every test of H0 : ρij = 0 will be rejected. An advantageof using a factor model is that it is capable of detecting and modelling unobserved regressors,effectively adding a missed column to the X matrix (see subsection 1.6.2). The factor model isused in this work, because an essential and unobserved variable is long-hypothesized to exist inSSU rRNA and RNA seq studies: sequencing depth. This is advocated through the mixed effectperspective of SSU rRNA correlation analysis (see subsection 1.3.1). For this work, the validity ofthis perspective over the compositional perspective is advocated in subsection 3.3.2.It is important to note that high-dimensional covariance estimation has previously beeninterpretted as a multiple comparison error problem [283]. False discovery Rate (FDR) [24, 25] hasbeen successfully applied toward univariate regression surveys [168, 227, 228]. However, multiplecomparison corrections have not been designed for high-dimensional correlation surveys. Thiswork does not evaluate them as a potential solution.3.1.2 Precision-recall comparisonTo maintain objectivity, this work’s regularized model is compared against existing methods.Having phrased correlation network estimation as a classification problem, precision and recallare the metrics of comparison. The regularied model is compared against Pearson correlationcoefficient or PCC and SparCC (Sparse Correlations for Compositional data). The Pearsoncorrelation statistic converges to Cor(X, Y) for sufficiently large sample sizes when expectationintegrals exist. SparCC is a popular method in Microbial Ecology, designed to accommodatefor compositional effects [95], and was survey by Weiss et al. [267]. To allow for confidentprecision-recall calculations, the correlation methods are compared in a simulation study.This simulation study demonstrates that regularized estimation is necessary for increasingprecision without loss of recall, thereby alleviating the concerns highlighted by Weiss et al. [267].This bolters the applicability of correlation networks in Microbial Ecology. Surveying a communityfor correlations is only useful if attributed correlations can be done with confidence. Withoutprecision, there is no confidence.57v[!is.nan(v)]v[!is.nan(v)]depth[!is.nan(v)]v[!is.nan(v)]depth[!is.nan(v)]18% 3% 0.01%24% 7% 1%Thaumarchaeota Nitrospina NitrospiraSUP05 MGA PlanctomycetesO NO H S Relative abundance2 3 2depth[!is.nan(v)]depth[!is.nan(v)]Figure 3.2: Average abundances of select chemical concentrations and taxa3.1.3 Saanich InletBeyond precision-recall comparisons, objectivity is further maintained by applying the regularizedcorrelation method to Saanich Inlet SSU rRNA data and findings are contextualized in literature.Any discrepancies between existing understandings of Saanich Inlet community structure andthe correlation network are explored. First, the findings will be contrasted against the conceptual(see subsection 1.4.4) and differential (see subsection 1.4.5) models. While all three modelsuse environmental data, a significant difference between is that the correlation network onlyutilizes SSU rRNA data, whereas the other use proteomic or multi-omic data. An advantage thecorrelative approach has is that its statistical survey over few data types enables high-throughputanalysis, thus providing a description of years of data (2008-2011) instead of days. Second, knowncommunity structure (see Figure 3.2) should also be recovered. Third, of all correlations surveyed,the subset relating to SUP05 will be inspected, because observations in chapter 2 suggest thepotential for SUP05 to currently be evolving toward a metabolically syntrophic relationship builton sulfur-driven denitrification.Application of the regularized model to real data also allows an opportunity to test severalmodelling hypotheses. First, the negative binomial distribution is popular in SSU rRNA regressionanalyses [168, 185, 228], but there may be better models. Through model selection via AIC testing(see subsection 1.6.3), this work finds such an alternative. Further, the question of whether SSUrRNA data follow the mixed effect or compositional perspectives will be commented on.583.2 Methods3.2.1 Multivariate constructionAs stated in subsection 3.1.1, this work tests the ability of a regularization strategy to improveprecision-recall exchanges in correlation network estimation. Interpreting regularization as amodel constraint meant to reduce overfit, this work constrains the covariance matrix Σ to alower-dimensional structure, Σ = LLT + Ψ. Model dimension (the number of free parameters)is long-suspect in causing errors, reaching over modern statistical theory [2]. For p taxa, theconstraint achieves a reduction in parameters from O(p2) to O(p2). In the applied problem, SSUrRNA correlation, p is consistently large, so the reduction is motivated from the perspective ofoverfit.However, choosing to constraining an abstract parameter is purely hypothetical without aparticularly specified model. Factor models specify covariance matrices for multivariate Guassiandistributions. However SSU rRNA data are never Gaussian-distributed (in R), because they arecount data (in Z≥0). So the regularizing solution complicates model selection. To overcome thisissue, this work employs statistical copula (see subsection 1.6.4), a tool which eases multivariatemodelling by allowing the multivariate and univariate components to be selected separately.Via the application of copula, the multivariate structure is now defined without specification ofwhich marginal distributions are used. So the model is abstractly defined as follows for arbitrarymarginal distributions FYij .Data:Y ∈ Zn×p≥0 observed counts,X ∈ Rn×q regressors,Parameters:L ∈ Rp×m factor weights, m small,Ψ ∈ diagonal(Rp×p>0 ) factor model errors,Implicit parameters:Σ ∈ Rp×p,Σ = LLT +Ψ latent covariance matrix,Random variables:Z ∈ Rn×p, Zi ∼ Np(0,Σ) latent Gaussians,Yij ∼ FYij observed marginal counts,Functions:FZij(zij) = Φ(zij; 0, [LLT +Ψ]jj) marginal Gaussian distributions (see subsection 1.5.3),F−1Yij (yij) marginal counts’ generalized inverse distribution function,Constraints:Yij = F−1Yij (FZij(Zij)) copula mechanism3.2.2 Marginal model selectionModelling with copula has conveniently allowed marginal distributions to be dealt with abstractly,but they must be specified prior to model implementation. While copula allows individualdimensions’ (taxa’s) marginal distributions to be different, having arbitrarily many dimensions59(see subsection 1.2.3) motivates selection of a single model which is flexible enough to modelevery dimension. This work surveys a selection of models with AIC testing (see subsection 1.6.3).There are many count models to choose from [46, 63], but an effective way of reducingcandidates is to require a certain kind of flexibility: the ability to satisfy both complete under-dispersion and over-dispersion. An over-dispersed model Y satisfies Var[Y] > EY. An under-dispersed model satisfied Y < EY. Not all count models satisfy both of these properties. Forexample the Poisson distribution satisfies Y ∼ Poisson(λ) ⇒ Var[Y] = EY. By the law of totalvariance, the large class of Poisson mixtures is never under-dispersed, and thereby the negativebinomial and lognormal Poisson [100] are both strictly over-dispersed. Similarly, the binomialand thereby all multinomial marginals are strictly under-dispersed. Requiring no mean-varianceconstraints not only reduces candidate models, but also requires that models be able to reflect therealities the data are expressing. Constraints should come from data, not from models.Requiring complete mean-variance flexibility implicity and ironically requires that candidatecount models have a single mean-variance constraint. By 2, if Y is a count model (Y ∈ Z≥0),then Var[Y] ≥ (EY− bEYc)(dEYe −EY). This is the minimum variance bound satisfied by everycount model.Result 2. If N is a count variable (random variable in Z≥0 with probability one) with mean µ and varianceσ2, then σ2 ≥ (µ− bµc)(dµe − µ).Proof sketch. of Result 2Condition the variance of N on E[N|N < µ]1N<µ+E[N|N ≥ µ]1N≥µ, where 1X is an indicatorfunction.The negative binomial distribution has been popular in SSU rRNA analysis [168, 185, 228].It is a good choice because SSU rRNA data tend to be over-dispersed and it is mathematicallysimple, making computation faster and easier to implement. It is an imperfect choice, because itcannot model under-dispersed data and it cannot model data from distributions with very heavytails. Not being able to model under-dispersed data denies modellers the ability to accuratelypredict taxa values when it might actually be possible. Infinite variance occurs when expectationis finite (EY < ∞), but variance is not (EY2 = ∞). This work explores the possibility of SSU rRNAmodels with non-finite and undefined variance. For these reasons, other models are considered inthis work. Univariate models are objectively compared with AIC statistics subsection 3.3.1. Thefollowing models are considered.1. Negative Binomial distributionDespite not satisfying the mean-variance freedom requirements postulated by this work, thispopular model is a candidate for AIC testing. This work uses the MASS [264] (chapter 7.4)library’s implementation of the Negative Binomial, with location parameter µ, dispersionparameter ν, and with the following probability mass function.fY(y; µ, ν) =Γ(µ+ ν)Γ(ν)y!µyνν(µ+ ν)µ+ν,E[Y] = µ, Var[Y] = µ+ µ2/ν602. Conway-Maxwell-Poisson DistributionThe Conway-Maxwell Poisson distribution (CMP) [62] was initially derived from a queuingmodel and has seen recent interest [117, 165, 166, 235, 242] because of its ability to modelboth under-dispersion (E[Y] ≥ Var[Y]) and over-dispersion (E[Y] ≤ Var[Y]) in count data.The CMP has the following mass function.P[X = x] = fCMP(x;λ, ν) =λx(x!)ν(∞∑j=0λj(j!)ν)−1Notice that the CMP generalizes the Poisson distribution from ν = 1. While the capacity forboth over and under-dispersion suggested the CMP may potentially satisfy our requirementfor separate and unbounded location and dispersion parameters, it was not meant tobe. In Appendix B, this work proves that a CMP with mean µ and variance σ2 satisfiesσ2 < µ(µ+ 1). Because SSU rRNA data sometimes exhibits extreme variability, a varianceupper-bound is inadmissible.3. Hagmark classIn Hagmark [118], the following transform is proven to construct count distributions fromabsolutely continuous, positive random variables’ distributions with unconstrained meansand variances. Most importantly, the count variable N and continuous, positive variable Zsatisfy E[N] = E[Z], and N will have a variance near the variance of Z (see Hagmark [118]for exact details). The transform defining a count model N from a continuous model Z isthe following.P[N ≤ n] =∫ n+1nP[Z ≤ z]dzIn Hagmark [119], a special Poisson distribution generalization capable of all means andvariances is derived.This work applies the Hagmark transform to the Gamma and Log Normal distirbutions,because their expected values are guaranteed to exist. This allows link functions to alwayslink to the expected value.4. Floor classThis class of models is motivated through computational pragmatism. To transform andcontinuous, positive random variable Z into a count variable N with the floor transform,simply take the floor N = bZc. Instead of linking to the expected value, eXβ links to themedian med(N) = med(bZc) = bmed(Z)c by linking directly to med(Z). Further, N iscapable of arbitrarily large and small variances. Continuous, positive random variables areselected for their known medians. This work transforms the Log Normal, Log Cauchy, andLog Student t distributions. Despite the Log Student t distribution generalizing both theLog Normal and Log Cauchy, it does so at the cost of an additional parameter, and so mayperform worse in AIC testing.613.2.3 Full model definitionIn subsection 3.3.1, it is shown that the log Student-t distribution reliably achieves the lowestAIC values most often. With both the multivariate structure defined abstractly, contigent only onspecification of marginal distributions, a specific model is defined as follows.Data:Y ∈ Zn×p≥0 taxa count data,X ∈ Rn×q environmental data, q < p,Parameters:β ∈ Rq×p regressor weights,L ∈ Rp×m,Ψ ∈ diag(Rp×p) factor model parameters,σ2 ∈ Rp>0 marginal scale parameters,ν ∈ Rp>0 marginal tail parameters,Implicit parameters:µ = exp(Xβ) ∈ Rn×p>0 marginal location parameters,Σ = LLT +Ψ ∈ Rp×p>0 covariance matrix,Random variables:Tν ∈ Rn×p>0 , [Tν]ij ∼ Student− t(ν) marginal variables,Z ∈ Rn×p, [Z]i ∼iid Normal(0,Σ) latent Gaussians,Functions:FZij(z) = P[Zij ≤ z] a distribution function,FYij(y) = P[Yij ≤ y] a distribution function,F−1Yij (p), FYij(y)’s generalized inverse,Constraints:Yij = b(µije[Tνj ]ij)σjc marginal model,Yij = F−1Yij (FZij(Zij)) copula mechanismCalculation of probabilities P[Yi = yi] are necessary during estimation protocols, but requireintegration. This is because for any observed Yi, Zi is only constrained to a potentially unboundedhypercubic region in Rp. This is a high-dimensional numerical integral which would be computa-tionally intractable (see subsection 1.7.1) if not for the application of the factor model. Throughconditional expecations (see subsection 3.2.4), the p-dimensional integral can be broken into pm-dimensional integrals. This work uses m = 3, so the integrals are computationally tractable.3.2.4 EstimationThe probability model is parameterized by θ = (β, logσ, log ν, L, logΨ) ∈ Rp(3+q+m). It is flexible.Within θ there are p(3 + q + m) knobs which adjust our model when turned. The likelihoodfunction fY|X(y; x, θ) = Pθ[Y = y|X = x] is the probability that the entire experimental sample(y, x) is observed under a particular parameterization boldsymbolθ. For most choices of θ themodel is absurd, and for one θˆ the model is a most likely representation of the data. θˆ isthe maximum likelihood estimate (MLE) (see subsection 1.6.1). MLEs are solutions to non-linear programs (see subsection 1.7.2). Practical solving of non-linear programs recognizes that62optimization protocols often require a good first guess to work reliably. Therefore MLE calculationis broken into two major steps: (1) a heuristic initialization, followed by (2) running a non-linearprogram solving protocol.InitializationCalculation of the initial guess for the non-linear program θˆ0 is done in series of layered heuristics.Early layers are numerically stable and inaccurate, while later layers are more numerically delicateand much more accurate. Except for the first layer, every layer needs an initial guess. This layeringof methods from robust to accurate is important, because the final stages are delicate. For example,if a probability model is initialized at to an absurd initial guess θ0, then fY|X(y; x, θ0) ≈ 0 and thecomputer’s floating point representation system will underflow (see section 1.7), deciding thevalue is exactly zero. Optimization cannot occur if all values near the initial value are so absurdthat they decide the data set only occurs with probability zero. The layers of estimates are thefollowing.1. βˆ0 = OLS(log(Y + 1) ∼ X)Calculate βˆ0 with an ordinary least squares (OLS) estimation of log(Y + 1) regressed onX. Counts are converted to log scale because the exponential link function is used. One isadded to Y to avoid non-finite values after the logarithm.2. (σˆ0, νˆ0) = MM(Y, X; µˆ0 = exp[Xβˆ0])With a first guess for each marginal’s regressor weights βˆ0, calculate the remaining marginaldistribution parameters (σˆ0, νˆ0) via method of moment (MM) estimators. However, sincethe Student t(ν) doesn’t always have a finite expected value, this work uses median-basedestimates. With other models, MM estimates are sufficient and this step.3. (βˆ1, σˆ1, νˆ1) = Univariate MLE(Y, X; βˆ0, σˆ0, νˆ0)With robust estimates computed, non-linear optimization protocols are employed. Howeverthese solvers at not applied toward the entire system, but to each marginal distributionseparately. This produce a high-quality initial maximum likelihood estimates for eachmarginal distribution.4. zˆ = [zˆj] = [Φ−1(FYj(Yj, Xj; [βˆ1]j, [σˆ1]j, [νˆ1]j))]This step does not actually calculate an estimate, but prepares data for the final layer ofestimation. Here, approximate samples are generate for the latent multivariate Gaussian Z.The approximate sample zˆ will be used to calculate initial estimates for the copula structure,because it’s easier to estimate parameters for an observed variable than an unobservedvariable. This is done by calculating the distribution functions, FY(Y, X; βˆ1, σˆ1, νˆ1), for eachcount in Y. The result is a matrix in [0, 1]n×p of probabilities which are then invertedinto approximate observations with the multivariate normal distribution function Φ(·) =F−1Np (·; 0, I).5. (Lˆ1, Ψˆ1) = Eigen-decomposition Factor Estimate(Cor(zˆ))63This works' initial correlationsBivariate copula correlations−1.0 −0.5 0.0 0.5 1.0−1.0− 3.3: This work’s initial correlation estimates compared to a more robust methodTo initialize the (L,Ψ) estimate, this work relies on a method described in Johnson andWichern [139], section 9.3. First, the correlation matrix R = Cor(zˆ) is calculated. Then,because the model has m factors, calculate the first m eigen vectors vk and values λk. Eachcolumn of Lˆ1 will be [Lˆ1]k =√λkvk, and each diagonal element of Ψˆ1 will be [Ψˆ1]jj =I − LLT.6. θˆ0 = (βˆ1, σˆ1, νˆ1, Lˆ1, Ψˆ1)The initial value for the entire model’s non-linear optimization routine θˆ0 is just a concatena-tion of all best estimates thus far.It is important to note that a more robust method exists for calculating (Lˆ1, Ψˆ1). Instead, thecorrelation matrix which generated them could have been generated by calculating a matrix oflatent correlations. Each correlation is an MLE for a bi-variate Gaussian copula model. The matrixof correlations can then be projected to a nearest covariance matrix, then converted to a correlationmatrix. This robust method is compared with the initialization proceedure employed by thiswork in Figure 3.3. The estimates have correlation of 0.419, and so have an operable similarityin this case. These initial estimates need only be close enough, since their differences would beeliminated after applying a non-linear optimizing protocol. The bi-variate copula method is morerobust, because it works for small p, where the implemented method may not.Non-linear optimizationWith an initial estimate θˆ0, non-linear optimization may begin. Ideally, this work would employthe L-BFGS routine (subsection 1.7.2) with high-quality gradient calculations, but for for thesake of project brevity, a series of univariate optimizations are performed with simple Newton-Raphson optimizations, and five-point stencils are used to estimate first and second derivatives(see subsection 1.7.1). While this software is interfaced through R [221], GPU acceleration makesthe pre-implemented L-BFGS routine cannot be used due to inefficient job batching. The protocolrequests too few simultaneous jobs to overcome warp divergence (see subsection 1.7.3).64GPU acceleration is used for this project, because of substantial compute requirements. Thereare three major causes of substantial compute requirements. First, the hypercubic integrals (seesubsection 3.2.3) are calculated with MC-integration (see subsection 1.7.1). Second, statisticalsignificance for correlation tests H0 : ρij = 0 is computed through bootstrapping. Third, manynetworks are generated for a full-factorial simulation study (see subsection 3.2.5) is used toevaluate the method. GPU calculation is motivated over grid compute methods such as MapReduce [68] or MPI [112], because the overall floating point operations per second were expectedto be higher.The primary caveat in programming for GPUs is that warp divergence must be minimized.Warp divergence, the issuance of different instructions to the same batch of threads, can causea GPU compute job to run slower than if run on a CPU. To avoid warp divergence, codeexecution patterns must be predictable. Divergences must be organized into separate threadblocks. This work achieves predictable log likelihood log fY|X(y|x; θ) calculation primarily throughtwo strategies: (1) log fY|X(y|x; θ) is algebraically broken into summable components, allowingseparate calculations to be organized accordingly; and (2) integration over the m-dimensinalGaussian hypercubes is calculated with an MC-integral, likening it to the work of Genz [101], ratherthan an unpredictable quadrature routine [217]. The algebraic decomposition of log fY|X(y|x; θ)into summed components is done as follows.log fY|X(y|x; θ) = logP[Y = y|X = x]= logP[Y = y] (suppress conditional notation for brevity)= log∏ni=1P[Yi = yi] = ∑ni=1 logP[Yi = yi]= ∑ni=1 log(P(Zi ∈ [a, b]i)∏pj=1P[Yij = yij])= ∑ni=1(logP(Zi ∈ [a, b]i) +∑pj=1 logP[Yij = yij])= ∑ni=1(logP(LFi +Ψ1/2Ei ∈ [a, b]i) +∑pj=1 logP[Yij = yij])= ∑ni=1(logE[P(Ψ1/2Ei ∈ [a, b]i − LFi|Fi)] +∑pj=1 logP[Yij = yij])=a.s. ∑ni=1(log limK→∞ K−1 ∑Kk=1P(Ψ1/2Ei ∈ [a, b]i − LFi|Fi = fik) +∑pj=1 logP[Yij = yij])= limK→∞ ∑ni=1(− log K + log∑Kk=1P(Ψ1/2Ei ∈ [a, b]i − LFi|Fi = fik) +∑pj=1 logP[Yij = yij])= limK→∞ ∑ni=1(− log K + log∑Kk=1 ∏pj=1P(Ψ1/2jj Eij ∈ [a, b]ij − [LFi]j|Fi = fik) +∑pj=1 logP[Yij = yij])= limK→∞ ∑ni=1(− log K + [LS]Kk=1 log∏pj=1P(Ψ1/2jj Eij ∈ [a, b]ij − [LFi]j|Fi = fik) +∑pj=1 logP[Yij = yij])= limK→∞ ∑ni=1(− log K + [LS]Kk=1 ∑pj=1 logP(Ψ1/2jj Eij ∈ [a, b]ij − [LFi]j|Fi = fik) +∑pj=1 logP[Yij = yij])where [a, b]ij = F−1Zij(GYij(yij; µij, σj, νj); [LLT +Ψ]jj),and GYij(yij; µij, σj, νj) = {u ∈ [0, 1] : yij = F−1Yij (u; µij, σj, νj)},and [a, b]i = ×pj=1[a, b]ij is a hypercube,and fik ∼ Nm(0, Im) is an MC-simulant,and ”=a.s.” is equivalence with probability one,and the limK→∞ is applied via the strong law of large numbers,and K is the number of MC iterates,and [LS]pj=1 is a log sum.65A log sum possible by applying the identity log(a + b) = log a + log (1+ exp[log b− log a]).Log sums are used to maintain logarithmic scale during integration. Logarithmic scale is importantbecause the high-dimensional integral is over small-measured (potentially unbounded) hypercubesin Rp. The hypercubic probabilities are so small that they tend to cause underflows if not kept onlog scale.An additional detour from popular methods is required to avoid warp divergence. Thisalgorithm must compute many Student t(ν) distribution functions, which is usually calculatedvia the following identity.∫ t−∞fT(u; ν)du = 1− I νt2+ν(ν2,12), where Ix(a, b) =B(x; a, b)B(1; a, b), B(x; a, b) =∫ x0ta−1(1− t)b−1dtThe Ix(a, b) function is the incomplete beta ratio function. On CPUs processing in doubleprecision, Ix(a, b) is calculated with the TOMS708 library [39, 75]. This work’s software computesin single precision (which is faster for many GPUs), and cannot afford to break the computationinto the many cases used by the TOMS708 algorithm, because unpredictable code executionresults in warp divergence. To remedy this problem, Ix(a, b) is broken into the following threeeasy-to-predict cases, and computed to single-precision accuracy.1. When ν > 105, a normal-distribution approximation is used.2. When t2 < ν, the identity Ix(a + 1, b) = Ix(a, b) +xa(1−x)baB(1;a,b) is iterated.3. Otherwise apply an older t-distribution algorithm. Particularly, ACM Algorithm 395 isapplied [129].Statistical significance via bootstrappingAll estimation exists to serve statistical testing, which effectively decides which taxa-pairs arecorrelated or not. With one test for significance per correlation H0 : ρij = 0, likelihood ratio testsare infeasible, because it would require calculating a different MLE per ρij. A good alternativemight be employing a normal approximation with observed inverse Fisher Information estimatingcovariances. However with so many parameters, it is not obvious that asymptotic assumptions(n→ ∞) are truly satisfied. Further, testing for L or Ψ significance is insufficient, because the im-plicit Σ = LLT +Ψ must be tested per-entry instead. Fisher Information of transformed variables iscalculable but assumes the existence of many derivatives, which may not actually be well-defined.With so many uncertainties, a bootstrapping [83] approach is motivated (see subsection 1.6.5).Employing the bootstrap overcomes delicate theory with expensive computational work. Thebootstrap works by randomly resampling from the data set and calculating an estimate θˆ perresample, thereby empirically reconstructing the sampling distribution.In subsection 1.3.1, the compositional and mixed effect perspectives of SSU rRNA multivariatestructure were described. Both are effects which obfuscate the community’s authentic correlationstructure. The primary strategy of this chapter is to increase correlation-attribution precisionthrough the factor model’s parameter reduction, but the factor model also implicitly agrees withthe mixed effect perspective. The factor model Zi ∼iid Np(0, LLT +Ψ) is implicitly equivalent to66the sum of two Gaussian random variables (Fi, Ei) ∈ Rm ×Rp, via Zi = FiL + EiΨ1/2. Each ofthe m dimensions of Fi is a called a factor. This allows any of the m dimensions of each Fi to actas a mixed effect. In subsection 3.3.2, this work also builds evidence toward the hypothesis thatSaanich Inlet SSU rRNA data does indeed follow the mixed effect perspective, and also arguesthat the first dimension of Fi models the mixed effect of sequencing depth.Embracing the mixed effect perspective, this work modifies bootstrapped estimates of Σ byremoving the mixed effect. This is equivalent to separating L into two components L = [L1, L−1],where L1 is the first column of L and L−1 is the matrix of the remaining m− 1 columns of L. Sowhile Σ = LLT +Ψ, this work actually estimates bootstrapped values of Σ−1 = L−1LT−1 +Ψ.3.2.5 Precision-recall comparisonAs described in subsection 3.1.2, a simulation study is used to objectively compare correlationnetwork estimation methods via the metric of precision-recall exchanges. However, many variablesimportant exist beyond the choice of correlation statistic. Further, all three correlation statisticscompared are designed for different use cases. To understand the inter-dependencies betweenthese variables, the simulation study will be conducted as a full-factorial experiment. Full-factorialexperiments follow a special experimental design which enables for the full testing of all pair-wisedependencies between variables [188]. Recall each correlation statistic is only a classifier for aspecific Type-I error rate α (p-value cutoff). Note that the full-factorial experiment is replicatedonce per α ∈ {0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4}, and precision-recall curves are generated over theα values. The variables of the full-factorial experiment are the following.1. Statistic: Regularized, SparCC, or Pearson2. Number of samples, one of 200 or 1000.3. Regressor effects: yes or no. If yes, simulants’ marginal location parameters are regressedagainst environmental parameters. Environmental measurements and β weights are ran-domly selected from actual model fits.4. Sparsity: one of 0.01 or 0.5. This is the proportion of simulants’ correlation structure elementswhich are non-zero.5. Simulant model: compositional or mixed effect. Specific probability models are used togenerate the random data with known correlations. One model follows the compositionalperspective, while the other follows the mixed effect perspective. Model definitions areprovided below. Note that neither model agrees perfectly with the specifications of eitherthis work’s regularized method, nor SparCC.The compositional model is a logistic normal multinomial (LNM) [277]. It is specified as follows.Data:Y ∈ Zn×p≥0 observed counts,X ∈ Rn×q regressors,Parameters:67β ∈ Rq×p regressor weights,Σ ∈ Rp×p covariance matrix,Implicit parameters:µ ∈ Rn×p, µij = eXiβj regressed means,P ∈ (0, 1)n×p, Pij = eZij /(1+∑pj=1 eZij) probabilities,Random variables:Z ∈ Rn×p, Zi ∼iid Np(µi,Σ) latent Gaussians,N ∈ Zn>0, N ∼iid Empirically sampled,Yi ∼iid Multinomial(Pi, Ni)The mixed effect model uses a copula mechanism (see subsection 1.6.4) to join negative binomialdistributions to a multivarite structure with a mixed effect. This work refers to it as the negativebinomial with Gaussian copula (NBGC) (see section 1.6.2 for the Negative Binomial definitionused). It is defined as follows. Notice that the mixed effect componenet obfuscates the underlyingcorrelation structure.Data:Y ∈ Zn×p≥0 observed counts,X ∈ Rn×q regressors,Parameters:β ∈ Rq×p regressor weights,Σ ∈ Rp×p covariance matrix,σ2 ∈ Rp>0 marginal variances,L ∈ Rp mixed effect vector,Implicit parameters:µ ∈ Rn×p>0 , µij = eXiβ j marginal expected values,Random variables:Yij ∼ NegativeBinomial(µij, σ2j ) marginal distributions,Z ∈ Rn×p, Zi ∼iid Np(0, LLT + Σ) latent Gaussians,Functions:FYij(yij; µij, σ2j ) marginal distribution function,F−1Yij (p; µij, σ2j ) generalized inverse of a marginal distribution function,FZj(zij; 0, [LLT + Σ]jj) latent Gaussian’s marginal distribution function,Contstraints:Yij = F−1Yij (FZj(Zij; 0, [LLT + Σ]jj); µij, σ2j )3.2.6 Saanich InletAs described in subsection 3.1.3, objective evaluation of the regularized correlation method isextended beyond a simulation study and into real-world SSU rRNA data. Using data fromSaanich Inlet (see section 1.4), finding can be contextualized amongst current models of thedenitrifying community’s structure. Further, the data provides meaningful evidence in evaluatingwhich univariate count models best-describe SSU rRNA data (see subsection 3.3.1), and also in68evaluating whether a compositional or mixed effect perspective better-describes the data (seesubsection 3.3.2).A total of 91 SSU rRNA samples are used, sampled from 2007 to 2011, from the depths of 10mto 215m. Concentrations of O2, NO−3 , and H2S are paired with each sample. The SSU rRNA dataare converted to multivariate count data with QIIME (see subsection 1.2.3). From a regressionperspective the SSU rRNA data are represented in the dependent (count) variable Y ∈ Zn×p≥0 , andthe environmental concentrations are the regressors (independent variables) X ∈ Rn×q, wheren = 91, q = 3, and p = 57. Regessors used are concentrations of O2, NO−3 , and H2S, becausethey are known to be important variables. Depth is omitted as a regressor, because it is stronglycorrelated with O2, and thus might decrease statistical power in later testing. Choosing q = 3is justified through compute time restrictions. These data are processed through the estimationpipeline described in subsection 3.2.4, and statistical significance is decided with 1000 bootstrapiterates (see subsection 1.6.5).Reducing phylogenetic resolutionChoosing p = 57 is motivated by the primary contribution of chapter, precise correlations throughparameter reduction (illustrated in subsection 3.1.1). Pragmatically, the O(p2)→ O(p) reductionstill requires p to be at least comparable to n. So while estimating a model of the whole communityis desirable, it is not yet feasibly estimable. Keeping the number of taxa p small can be achievedthrough two methods: (1) throwing out clades, and (2) summing up SSU rRNA counts withinclades. The advantage of throwing out clades is that high-resolution descriptions of remainingclades will be made possible, but it runs the risk of falsely attributing correlations due to omittedtaxa (see section 1.5.3). So discarding information runs the risk of supporting false interpretationswhich are otherwise avoidable. Instead this work takes the second option, summing countswithin clades, because it reduces the chance of false inference only at the cost of a coarser-graindescription of the system. Fortunately clades needn’t always be selected at the same taxonomiclevel, allowing greater description of a few target taxa. All clades are summed up to the phylumlevel, except for SUP05, Nitrospina, and Nitrospira. In total this produces p = 57 counted clades.Just as in precision-recall exchanges, the desire for descriptive breadth will always be at odds withconfident inference. In section 4.3, an O(log p) reduction is explored, but it comes at the cost ofeven more complex modelling, and thereby increased abstraction in results.3.3 Results3.3.1 Marginal model selectionIn determination of which univariate marginal model best describes observed SSU rRNA, severalunivariate count distributions were surveyed with AIC statistics (see subsection 3.2.2). Since thereare many dimensions of SSU rRNA counts, each model has several AIC statistics. Descriptivestatistics are provided in Table 3.1. AIC values behave like a measurement of error, the best modelwill have consistently small AIC values. Because all marginal models are joined together in asingle multivariate model, a marginal model with only a few very large AIC values (high-variancedimensions) is inadmissible, because it threatens the entire model’s ability to fit. Extremely69Table 3.1: AIC score statistics per marginal modelNegative Bin. Hagmark Gam. Hagmark Log N. Floor Log N. Floor Log Cau. Floor Log tMin 13.64 13.81 14.49 13.56 13.36 14.001st Quant. 14.23 14.32 14.87 14.12 14.18 14.23Median 15.09 14.97 15.25 14.71 14.83 15.11Mean 41.56 18.53 15.99 15.50 15.71 15.603rd Quant. 19.85 16.87 16.18 15.93 16.15 16.29Max 4043.07 640.69 33.82 33.83 42.95 26.47Frequency0.0 0.5 1.0 1.5 2.0 2.5 3.005101520Figure 3.4: Histogram of final estimated values νˆunlikely values can cause a few data points to overpower the model or simply fail estimationthrough underflow errors (see section 1.7).The selected marginal model is the Floor Log t distribution, because it consistently achievedthe lowest AIC values. This model has one more parameter than the others, which permits it moreflexibility. Fortunately, the AIC statistic accounts for the number of parameters, so the low AICvalues indicate the additional parameter is motivated. The additional parameter is the Student-t’sshape parameter ν > 0, which controls the thickness of the model’s tail. A distribution with thicktails has more extreme values. A Student-t model has more extreme values when ν is small. Thiscan be seen analytically, because its expected value is undefined for ν ≤ 1, and its variance isundefined for ν ≤ 2. With these special values in mind, the histogram of all estimates νˆ (seeFigure 3.4) meaningfully demonstrates that all distributions exhibit heavy tails, and many don’teven have defined variances.Descriptive statisticsInterpretation of the AIC statistics is easiest when contextualized in descriptive statistics of theactual data, because the different models are best-suited to different distributions of data. Thedata is characterized by strongly skewed count distributions as illustrated by histograms A, B,& C of Figure 3.5. This skew shape is supported by all candidate models, but not necessarilyall are capable of capturing the extreme skew. Another important quality of the data is thatobserved variances can be much larger than their corresponding means, as illustrated by σˆ/µˆ700 2000 4000 6000 8000 10000 12000 14000010203040500 1000 2000 3000 40000204060801000 1 2 3 4020406080FrequencyFrequencySUP05 CyanobacteriaHalobacterialesA BC D0 2 4 6 8 10 120510152025Figure 3.5: Descriptive statistics for the SSU rRNA data setratios in histogram D of Figure 3.5. Having standard deviations 6 or 12 times larger than the meanhighlight the importance of considering models capable of infinite variances, particularly Cauchyor Student-t distributions. Another important statistic is that of all counts, 40% are zero. Thusselected model must be both capable of very large values amongst many zeros. Again, generatingcount models from infinite-variance-capable distributions makes this quality possible. Finally,histograms A shows that the SUP05 distribution’s skew is not so extreme, thereby supportinga model with is capable of finite variances. Of the models considered, such flexibility is onlypossible through the Floor Log t distribution, which the AIC statistic favoured.3.3.2 Compositional vs. mixed effect perspectiveIn subsection 1.3.1, two perspectives on SSU rRNA multivariate structure are proposed: thecompositional and mixed effect perspectives. Under the compositional perspective, taxa areexpected to compete for sequencing depth, and thus have a negative correlative effect includedupon their counts. Under the mixed effect perspective, taxa counts rise and fall together withsequencing depth, and thus have a positive correlative effect induced upon their counts. Both areobfuscating effects that need to be controlled for, if present. Of the p = 57 taxa observed, 75% oftheir observed correlations are positive, suggesting the mixed effect perspective is correct.Observing a majority of positive correlations does not conclusively support a mixed effectperspective, because it is merely an unobserved and linear latent effect, and a compositionalmechanism may be overpowered by a highly inter-dependent community. The strongest evidencefor the data following a mixed effect perspective is in Figure 3.6. The L1 vector has statisticallysignificant, positive values for nearly every taxa. Because L1 is a factor model weight, the datasupport the existence of a single, linear effect shared amongst the taxa. Such is the exact designof the mixed effect perspective. These findings describe Saanich Inlet SSU rRNA multivariate71k__Archaea.p__NO27FWk__Bacteria.p__OP1k__Archaea.p__MSBL1k__Bacteria.p__OP10k__Archaea.p__pSL22k__Archaea.p__DHVE3k__Bacteria.p__ctg_CGOFk__Bacteria.p__OP9_JS1k__Bacteria.p__SM2F11k__Archaea.p__Halobacterialesk__Bacteria.p__Unclassifiedk__Archaea.p__Methanococci_Euryk__Bacteria.p__Fibrobacteresk__Bacteria.p__NKB19k__Archaea.p__pMC2A15k__Bacteria.p__Gemmatimonadetesk__Bacteria.p__ZB2k__Bacteria.p__VHS.B5.50k__Archaea.p__pMC2A384k__Bacteria.p__WS6k__Bacteria.p__GN02k__Bacteria.p__WS3k__Bacteria.p__Nitrospiraek__Bacteria.p__Chlamydiaek__Archaea.p__Methanomicrobia_Euryk__Archaea.p__BC07.2A.27k__Bacteria.p__Fusobacteriak__Archaea.p__Methanobacteria_Euryk__Archaea.p__Thermoplasmata_Euryk__Bacteria.p__Spirochaetesk__Bacteria.p__TM7k__Bacteria.p__Acidobacteriak__Archaea.p__pMC2A209k__Bacteria.p__Elusimicrobia_TG1k__Bacteria.p__Firmicutesk__Bacteria.p__OP11k__Bacteria.p__Chloroflexik__Bacteria.p__TM6k__Bacteria.p__Lentisphaeraek__Bacteria.p__Actinobacteriak__Bacteria.p__Bacteroidetesk__Archaea.p__pISA1No.blast.hitk__Bacteria.p__Caldithrix_KSB1k__Bacteria.p__OP3k__Bacteria.p__Verrucomicrobiak__Archaea.p__Unclassifiedk__Bacteria.p__ZB3k__Bacteria.p__ProteobacteriaMGAOD1ThaumarchaeotaPlanctomycetesCyanobacteriaSUP05NitrospiraNitrospinaPositive correlation Negative correlationFigure 3.6: Statistically significant parameter values per taxa, testing equality with zero. Each βx describesregressor weight against variable x. For example, β1 is the intercept, βO2 is the weight against O2concentration, and so on. The parameters σ, ν and Ψ must always be positive. Majority-positive values forL1 demonstrate the observation of a mixed effect. The lack of significant values for L2 and L3 does not stoptheir associated covariance matrix Σ = L−1LT−1 +Ψ from attaining significant values.structure as following the mixed effect perspective.3.3.3 Exploring GPU necessitySetting up a CUDA-enabled GPU is not always easy, so it is worth exploring the necessity ofGPU-acceleration in estimating for the regularized model. The estimation scheme is a two-stageprocess, with an initialization step and non-linear optimization step (see subsection 3.2.4). Theinitialization step is fast and does not require a GPU, whereas the non-linear optimization stepis slow and does require a GPU. Avoiding the non-linear optimization process would not justsimplify hardware requirements, but also drastically shorten compute times. For example, the two-week-long computational experiment described in subsection 3.2.5 would have taken under oneday without non-linear optimization. That said, the initialization produces a merely approximateestimate, whereas applying the GPU allows for optimal estimates. Therefore the GPU is onlyvaluable if MLE optimality is.Having fully applied the GPU and its associated CUDA software, this study now has bootstrapswhich describe the sampling distribution produced from Saanich Inlet SSU rRNA data (seesubsection 3.2.6). This data is valuable in deciding the necessity of GPU-driven non-linearoptimization. By comparing an initialized estimate θˆ0 to the sampling distribution of θˆ, it can bedecided if θˆ0 falls within the natural variation fully optimized estimates. If it does not, then θˆ072Frequency−3 −2 −1 0 1 2050100150200250FrequencyPCA of Histogram of z-scoresTest A:All parametersTest B:Latent correlations only−5 0 5 10 15 200100200300400500PC1PC2−0.06 −0.05 −0.04 −0.03 −0.02 −0.010.0050.0150.0250.035PC1PC2−3 −2 −1 0 1−5−4−3−2−101Figure 3.7: Testing the necessity of GPU acceleration in estimation. Test A shows GPU acceleration isnot necessary for general model parameters. Test B shows GPU acceleration is necessary for correctlyestimating correlations.certainly has sufficient bias to obfuscate real results.The problem is scrutinized through a combination of hypothesis testing and descriptivestatistics. Asymptotic statistical theory provides the following approximation, which roughlyapplies to this problem. The hypothesis test assumes that the bootstraps θˆ follow a multivariateGaussian distribution with empirical mean θ¯. As shown in PCA A of Figure 3.7 this assumptionimperfectly applied, though it is justifiable in PCA B.tˆ = (θˆ0 − θ¯)TΣˆ−1(θˆ0 − θ¯) ∼ χ2pTwo tests are conducted to demonstrate the nuanced effect the GPU provides. In test A, θis tested in its θA = (β, logσ, log ν, L, logΨ) form, thereby describing estimation quality for theentire model’s structure. In test B, only a subset of the parameters are used θB = (L, logΨ), andare transformed into latent correlation estimates {ρˆij}i 6=j, thereby describing estimation qualityonly for the correlative structure. In both cases, the statistics Σˆ and θ¯ are estimated from thebootstraps θˆ, the approximate Gaussian distribution of θˆ is scrutinized through PCA, and a robustargument is provided through z-scores (θˆ0 − θ¯)/(diag(Σ))1/2. The null hypothesis is alwaysH0 : θˆ0 =D θˆ.The combined approach of hypothesis testing and z-score histograms provides differentargumentative qualities, and ultimately agree in their results. Hypothesis testing is theoreticallypowerful but logically delicate and prone to assumption failures, whereas the z-score histogramsprovide logically robust but ambiguous results. The hypothesis tests are imperfectly applied.In test A, θˆ is not Gaussian distributed as indicated in Figure 3.7 PCA A, where a bimodaldistribution is shown and existing outliers are hidden. In test B, the mere 1000 bootstraps isinsufficient to estimate Σˆ due to the p(p− 1)/2 = 1596 correlations it describes. Instead, Σˆ−1is constructed through diagonalization omitting any eigenvalues less than 2.22× 10−14 (whichhappens to be the first 999 vectors). The hypothesis test A fails to reject with a p-value roundingoff to one (supporting H0 : θˆA0 =D θˆA), and hypothesis test B rejects with a p-value rounding730.0 0.2 0.4 0.6 0.8 SparCCPearson0. PrecisionBeta regression estimatesPearson, mixed effectPearson, compositionalSparCC, mixed effectSparCC, compositionalRecall = 0.2RecallPrecisionA BRegularized, mixed effectRegularized, compositionalFigure 3.8: (A) Precision-recall curves, (B) Expected precisions after beta regressionoff to zero (rejecting H0 : θˆ0B =D θˆB). Similarly, 0.3% of test A’s z-scores fall outside of theunivariate Gaussian’s 95% confidence interval (supporting H0 : θˆA0 =D θˆA), while 56.2% of testB’s z-scores fall outside of the confidence interval (rejecting H0 : θˆ0B =D θˆB). These results supportthe conclusion that GPU acceleration is necessary for extracting fine correlative structure, while itis unecessary for describing general model structure.3.3.4 Precision-recall comparisonThe simulation study described in subsection 3.2.5 yields precision-recall curves illustrated inFigure 3.8 (A). The full-factorial experimental design is great for building strong inferentialarguments, but they produce enough complicated data that a regression analysis is usuallyrequired to correctly interpret findings. In this case, the simulation study produces so manyprecision-recall curves, that while a few suggested themes can be seen, over-plotting obfuscatesclear conclusions, and beta regression is used to clean up the findings in Figure 3.8 (B).The experiment reveals the following observations.1. Under the mixed effect model perspective (which is supported by data, see subsection 3.3.2),the regularized model greatly improves precisions. In Figure 3.8 (B), for a recall of 20%, theexpected precision is about 90%.2. Under the compositional perspective, the regularized model performs comparably to thebest of other methods on average. In Figure 3.8 (B), for a recall of 20%, the expectedprecision drops to about 55% under the compositional perspective. It is worth noting thatthe Pearson method’s precision drops as well. It is possible that drop may be partiallydue to simulant model selection. Particularly, the mixed effect simulates from the LNM,which hides correlation structure behind a mixture model, and thereby should add variance,making inference less efficient. However, the regularized model inherently follows the mixedeffect perspective, and likely had built-in biases that could cause it to under-perform undera compositional condition.3. Computing the regularized portion of the experiment took two weeks, whereas the othermethods took no longer than two days. This demonstrates that the precision increases are74Figure 3.9: All statistically significant correlationsnot free, and come at the cost of increased compute requirements. Despite the substantiallyincreased compute time, algorithmic complexity only grows linearly O(p) with p taxa, solarger jobs should scale well if a GPU grid may be used. The experiment was run over 24conditions ×2 replicates per statistic. The software was run on a single terminal with anAMD FX(tm)-8320 Eight-Core Processor, 32GB RAM, GeForce GTX 980 Ti GPU, runningUbuntu 14.04.4. Pearson outperformed SparCC. This is likely due to a combined effect of two factors. First,sample sizes tested were large (200 or 1000), and because Pearson correlations are consistent,they should be fairly accurate. Second, SparCC precisions tended to cluster, which is likelya symptom of its known tendency to over-attribute correlations (see figure 1b of Friedmanand Alm [95]). The precisions of any method which attributes too many correlations will bydictated more by the natural abundance of correlations, than the method itself.3.3.5 Saanich InletAs described in subsection 3.1.3, the Saanich Inlet SSU rRNA data set is run through thiswork’s regularized method so that it may (1) be contextualized against previous descriptionsof the Saanich Inlet denitrifying community, (2) be evaluated for its ability to recover observedcommunity structure, and (3) further constrain understanding of how denitrification is occurringvia SUP05’s partial denitrification. The estimated correlation network is not simple (see Figure 3.9),and requires further processing so that it may be digested. It might be tempting to simply discardnodes from the network that seem disinteresting, but such an approach destroys information. Forexample, if edges in the remaining network are merely tertiary correlations (see section 1.5.3),then the discarded nodes are actually deciding the observed correlations. So simply discardingtaxa risks ignoring the primary drivers of the community’s covariational structure.Network subsetting can incorporate all relevant covariational information by subsetting with75partial covariance decompositions (see section 1.5.3). To ensure that only statistically significantedges are observed in the sub-network, bootstraps are calculated for partially decomposed Σ−1matrices (see section 3.2.4). The resulting sub-network only includes decided taxa as nodes, butthe whole system’s information is represented via the exclusion of edges strictly due to discardedtaxa. An illustration of the Saanich Inlet denitrifying community’s sub-network is shown inFigure 3.10. The representation of community-wide information in the sub-network can be seenthrough including special nodes. For example, including taxa without know participation indenitrification does not add edges, as was the case for OD1, Chlamydiae, Methanococci Eury, andAcidobacteria. Further, edges can be added by including groups which superset taxa which oughtto be correlated for phylogenetic reasons, as was the car for No.blast.hit, Unclassified Bacteria, andProteobacteria.An nuanced caveat for interpreting Figure 3.10 is that the entire analysis is conditioned onthe environmental observations (O2, NO−3 , H2S). This is due to the model’s origination from amultivariate regression paradigm (see section 1.6.2). While tha taxa’s environmental and partialcorrelations are estimated simultaneously, the environmental correlations take precedence, becausethe partial correlations are only able to describe system variation that is not due to environmentalvariables. The mechanism is very similar to a partial correlation decomposition (see section 1.5.3),however inequivalent due to the non-Gaussian nature of the count data. This nuanced perspectiveis important, because it provides a deeper perspective into postulated ecological dependenceswhich might be attributed to environmental accumulations of public goods. For example, thehypothesis of sulfur-driven denitrification via SUP05 and Marinimicrobia is strengthened, becausethe organisms share a partial correlation while not correlating with environmental H2S.Several observations are relevant to the motivations described in subsection 3.1.3. First,leveraging the method’s ability to more-precisely survey for correlations, it is notable that SUP05shares a partial-correlative edge with Marinimicrobia, but not with Planctomycetes. It is notablethat all three are negatively correlated with O2 concentrations, but have no significant interactionswith NO−3 . Second, the environmental correlations agree with known environmental abundancesillustrated in Figure 3.2. Note that all taxa in the network are negatively correlated with O2,despite existing at different oxygenation levels within the OMZ. This is OMZ O2 levels are onlysubtly different relative to O2 levels near the surface. This O2 phenomena is so strong that theonly Cyanobacteria are positively correlated with O2 (see Figure 3.6). A particularly interestingunivariate differentiation is that Nitrospira are negatively correlated with NO−3 concentrations,while Nitrospina are positively correlated.76SUP05 Planctomycetes ThaumarchaeotaNitrospira Nitrospina MarinimicrobiaPositive correlation Negative correlationNOO23-NOO23-NOO23-O2O2O2HS-NO-3NO-2N O2NH+4N2MetabolismN2Figure 3.10: Statistically significant partial correlations and regressors superimposed over metabolisms.Metabolic relationships reflect both previous interpretations described in subsection 1.4.3 and observationsfrom chapter 2.773.4 Discussion3.4.1 Precision-recall comparisonIn subsection 3.3.4, a full-factorial experiment demonstrates the ability of model regularizationto make precision attribution possible for SSU rRNA correlation surveys, directly meeting theconcerns of Weiss et al. [267]. Taking the perspective that regularization is a model constraintused to reduce overfit, the regularizer is a covariance matrix constraint Σ = LLT + Ψ which iseffective through reducing the parameter’s dimensional complexity from O(p2) to O(p) for ptaxa. Without precise attribution, correlation networks may only be a guide for approximatetopologies, perhaps effectively describable through graph statistics [14, 107, 109, 201, 281] suchas centrality [32], betweenness centrality [94], connectivity, or power law distributions [3]. Ofcourse, some statistics might not actually converge meaningfully on certain random graphs. Withprecise attribution, correlation networks become meaningful to the finest level of resolution: theedge. These findings support the conclusion that the future of SSU rRNA correlation surveysalmost entirely requires some form of high-dimensional accommodation [44, 138, 218] such asregularization.3.4.2 Univariate SSU rRNA modelsIn subsection 3.3.1, univariate count regression models are surveyed with AIC statistics (seeTable 3.1). The conclusion is not that the popular [168, 185, 226, 228] negative binomial is a reliablegood fit, but instead the floor log t distribution is. While the Student t distribution requires anadditional parameter thereby inviting overfit, the AIC statistic accounts for such effects. TheStudent t shape parameter estimates (see Figure 3.4) indicate that extreme values are commonplace,and therefore make less-robust models including the negative binomial prone to poor fits. Thisobservation is meaningful for authors of univariate regression survey software, and suggests thatsuch software might be more robust through application of different models.3.4.3 Multivariate SSU rRNA modelsIn subsection 3.3.2, a data-driven argument is built to support the mixed effect perspective forthe Saanich Inlet data set. Particularly, 75% of all observed correlations were positive, perhapscultivated by the common mixed effect variable of sequencing depth. However, it is the mostly-positive L1 factor weight (see Figure 3.6) which provides the strongest support for the perspective,because it observes a mixed effect mechanism. This data support the mixed effect mechanismover the compositional perspective, but do not conclusively decide the question of which is right.The compositional perspective has been a foundational insight for modellers in Microbial Ecology[95, 277], but any such negative correlations may only be due to the use of relative abundances.Further efforts to model SSU rRNA data would benefit from deciding which perspective ismore correct. However, the idea that SSU rRNA’s observed correlations tend to either postivity ornegativity could easily be a falsely-imposed dichotomy. The idea that most counts tend to rise andfall with sequencing depth (mixed effect) is no more absurd than the idea that some will competefor depth (compositional). The goal of the modeller should be to allow the data communicate itsunderlying message, not impose idealizations upon it. Instead of attempting to decide which way78correlational biases tend toward, it may be better to acknowledge the correlational continuum theseperspectives share. If possible, it would be best to allow models to express both compositionaland mixed effects when possible. The challenge is to do this while avoiding over-parameterization.The primary contribution of this chapter is a more succinct representation of a high-dimensionalspace, but even it only has enough space for just the mixed effect perspective. A more elegantsolution is motivated.3.4.4 Saanich InletIn Figure 3.10, a partial correlation network is presented. It is the result of applying the regularizedcorrelation framework to Saanich Inlet SSU rRNA data. The environmental correlations (O2, NO−3 ,H2S) illustrate known community structure as observed in Figure 3.2. First, SUP05, Marinimicrobia,and Planctomycetes’ only environmental correlations are negative correlations with O2. Theirnegative correlation with O2 is expected, but their metabolisms (see subsection 1.4.3) suggestthat a positive correlation NO−3 might be expected. However, inspecting the sparklines ofFigure 3.2 shows none of their abundances drop offs below the nitrate-sulfidic transition. Ininterpreting this result, it is important to remember vast genomic variety in each of these clades(see subsection 3.2.6), and that this work’s binning experiment did observe the potential forsome complete denitrification for both SUP05 and Marinimicrobia (see Figure 2.4). Second,Thaumarchaeota are know to be highly abundant, and harbour ammonia-oxidizing capabilities.The regression analysis finds Thaumarchaeota negatively correlated with O2 and positivelycorrelated with NO−3 . Knowing that ammonia oxidation requires oxygen might make the negativecorrelation with O2 surprising, but observed Tharmarchaeota abundances (see Figure 3.2) supportthis fact, and demonstrate its known ecological niche as nitrifier under ammonium-poor conditions[177]. The regression software is merely representing Thaumarchaeota’s narrow niche as a linearconstruction of O2 and NO−3 concentrations. Third, while Nitrospina and Nitrospira are knownanaerobic NO−2 oxidizers (see subsection 1.4.3), they are observed in Saanich Inlet to sit indifferent niches (see Figure 3.2). The regression software models this fact through positivelycorrelating Nitrospina with NO−3 , and negatively correlating Nitrospira with NO−3 . Nitriteoxidizing opportunities should be rarest in the sulfidic portion of the OMZ, providing fewerenergetic opportunities, and this is reflected in Nitrospira’s significantly smaller population thanNitropsina. These observations and known facts agree with the model’s representation of thedenitrifying community.In contextualizing the correlation network amongst previous network representations ofSaanich Inlet (see subsection 1.4.4 and subsection 1.4.5), the partial correlation componentsbecome relevant. The most important differences in comparing these networks pertain to breadthand depth of description. Both the conceptual and differential models represent a deep dive intoa few samples (2 depth profiles separated by 5 months and 5 depth profiles over 6 months,respectively), providing an extremely detailed description of ecological mechanisms, and they areinformed by far more genomic information than just SSU rRNA data, including proteomics. Incontrast, the correlation network is informed by the 20 depth profiles from 2006 to 2011, but itonly leverages SSU rRNA data and environmental measurements. The conceptual and differentialnetworks are focused descriptions of target denitrifiers or ecological roles, whereas the correlationnetwork is a description of all taxa described by the SSU rRNA data. An advantage of surveying79the entire community is that sub-networks generated by partial correlation decompositionsaccount for effects due to taxa excluded from the network, protecting later inferences againstfalse attribution. A disadvantage is that correlation networks are inherently abstract and therebyprovide superficial descriptions of ecology, and therefore require a combined-methods approachto be made useful. This work not only uses previous literature, but also enhances its correlationnetwork with genomic descriptions produced via metagenomic binning (see chapter 2).The conceptual, differential, and correlation networks agree with each other, and work togetherto further describe denitrifier community structure in Saanich Inlet. First, the metabolic interactionsof Figure 1.7 do not repeat the correlative structure of Figure 3.10 exactly, but instead representa more modern understanding. Cultivation refuted [237] the hypothesis posed by Hawley et al.[125] that SUP05 might provide Planctomycetes with NH+4 for anammox. This is reflected inthe correlation network as the inexistence of a partial correlative edge between SUP05 andPlanctomycetes. Second, the correlational model is able to differentiate roles for Nitrospina andNitrospira, not just describing them as exemplar nitrite oxidizers, but also describing Nitrospira asNitrospina’s low-abundance, sulfidic zone counter-part. Third, the correlational topology agreesstrongly with the metabolic topology of the differential network (see Figure 1.8), reflecting strongsupport for SUP05 to play the role of partial denitrifier to a sulfur-driven complete denitrifyingcounter-part. Fourth, when further contextualized in the findings of chapter 2, SUP05’s correlativeedge with Marinimicrobia is becomes meaningful, because not only is SUP05 described as tendingtoward partial denitrification (see Figure 2.5), but Marinimicrobia recruits complete denitrifyinggenes (see Figure 2.4) while also supporting known sulfur metabolism (see subsection 1.4.3).These findings suggest Marinimicrobia is SUP05’s sulfur-driven denitrifying partner. Given thelack of H2S correlation, it is likely that this process is not contributing to environmental sulfuraccumulations, and is thus a cryptic sulfur cycle [48]. Fifth, leveraging the correlation survey’sprecision, it is important to note that SUP05 sharing an edge with Marinimicrobia instead ofPlanctomycetes supports the argument that nitrogen loss in Saanich Inlet continues to existthrough denitrification instead of SUP05 shunting NO−2 to Planctomycetes for anammox. Limitedrecall invites the possibility of SUP05 maintaining both relationships. Without precision, thisargument becomes substantially weaker, effectively losing correlative support.3.4.5 Partial correlations and succinct representationThis correlation analysis has two convenient problems. First, in subsection 3.2.6 the summationof SSU rRNA counts within large clades (often up to phylum) was motivated as a necessarydimensional reduction, despite working with a massive parameter reduction from O(p2) to O(p)for p taxa. The problem is not only that the SSU rRNA counting pipeline (see subsection 1.2.3)allows for arbitrary-high resolution and thus arbitrarily-large p, but also that there is an immenselevel of genomic complexity throughout the Saanich Inlet water column. Unfortunately, O(p)parameters is simply too many when p is effectively infinite. This motivates the usage of fewerparameters.Second, in subsection 3.3.5 partial correlation decomposition (see section 1.5.3) is used to turna complex correlation network (see Figure 3.9) into a simple one (see Figure 3.10). This processhas the advantage of providing digestible simplicity while still using all information from thecomplex network, but it also highlights an applied fact: not all information in the network needs80to be modelled to a perfect quality, while certain parts do. From a statistical perspective, thismeans that many expensive parameters are being discared in application. Using the techniquesavailable in subsection 3.3.5, it would be possible for users to specify which taxa to provide highquality modelling for, but such a method risks only reproducing the modeller’s assumptions,instead of allowing the data to guide analyses. This shows that the usage of more parameters isnot even desired.With the recognition that correlation analyses need fewer parameters and that many parametersare not even desired, it is clearly time to search for a more elegant solution. The bioinformaticrealities of problems in Microbial Ecology show that brute-force application of large statisticalmodels to SSU rRNA data misdirect precious resources. There will never be enough datato describe the correlation structure infinitely-many taxa. Better models are needed. Ideallyparameter complexity should not grow with p, however explorations of multivariate O(log p)representations would likely yield effective results. In chapter 4 a succinct model is described,which achieves parameter reduction through assuming evolution follows a stochastic process. Theproblem of SSU rRNA correlation is certainly not yet solved, but does have feasible avenues intothe future.3.5 ConclusionsThis work has responded the concerns of Weiss et al. [267] by presenting a correlation networkestimation paradigm which meets the statistical needs of Microbial Ecology. The essential mechanicis the constraint of the a covariance matrix Σ = LLT +Ψ, achieving a reduction in parameters fromO(p2) to O(p) for p taxa, while still representing all correlations. The methodology is objectivelymeasured through precision-recall curves and application to a Saanich Inlet SSU rRNA data set.The methodology is shown to be capable of substantial increases in precision, largely only atthe cost of increased computational time. In application it allows further understanding of theSaanich Inlet denitrifier community by bolstering claims of a SUP05-Marinimicrobia sulfur-drivendenitrification pathway with correlative evidene, but also supports the perspective that nitrogenloss through denitrification continues alongside anammox despite a having observed SUP05’sshift toward partial denitrification in chapter 2. It is clear that if SSU rRNA correlation surveysare to contribute to any further confident descriptions of fine community structure, some formof regularization must be considered. It would be fruitful to develop models which are able torepresent community structure in fewer than O(p) parameters.81Chapter 4Future directionsThis work contributes toward two bioinformatic tasks in Microbial Ecology: metagenomic binningand SSU rRNA correlation. Major contributions are evaluated in an objective manner, primarilythrough precision-recall curves. Evaluations extend into application by applying contributions to-ward improving understanding of the Saanich Inlet denitrifying community. Application providedboth an opportunity to catch discrepancies in inferences, but also to further understanding. Theprimary conclusions of chapter 2 are that metagenomic binning can be made more precise throughuse of a good reference (as in assembly), and that taxonomic attributions are more precise nearerto the phylogenetic root (as in bootstrapped phylogenetic estimation). The primary conclusion ofchapter 3 is that regularization is necessary for precise estimation of fine correlative structure. Inapplication toward the Saanich Inlet denitrifying community these methods have allowed variousinferences, but a single argument has been built on from previous network perspectives (seesubsection 1.4.4 and subsection 1.4.5): SUP05 is observed to be taking on a partial denitrifyingrole, and is likely working with Marinimicrobia through cryptic sulfur cycling to sustain completedenitrification. Despite these contributions, important questions still remain.4.1 Denitrification in Saanich InletBuilding on previous network interpretations of the Saanich Inlet denitrifying community (seesubsection 1.4.4 and subsection 1.4.5), this work has attributed metabolic capabilities to certaintaxa in chapter 2, then constrained interpretation with correlative evidence in chapter 3. Atthis point it appears that SUP05, a major denitrifier, is moving toward partial denitrification.Metabolic and correlative evidence suggests that SUP05 may continue to play a role in completedenitrification by driving it via a cryptic sulfur-cycling relationship with Marinimicrobia. Despitethe evidence behind this hypothesis, other must also be considered. First, Arcobacteraceae isknown to be an active complete denitrifier in the sulfidic zone (see subsection 1.4.3), and isobserved to be operating in Saanich Inlet (see Figure 2.4). So an alternative hypothesis is thatArcobacteraceae or another complete denitrifier may be taking over the niche. Second, nitrogenloss may be occurring through anammox instead of denitrification. Since SUP05 is taking ona nitrite producing role, and Planctomycetes is known to harbour anammox capabilities, onewould expect a positive correlation to develop between them, though one was not observed (seeFigure 3.10). The lack of a correlation is not conclusive, because the result is not unlikely tobe a false negative, or the relationship could exist, but in a non-obligate manner (which is alsolikely). Third, denitrification may actually be slowing. The ecological consequences energeticopportunities are not fully understood, and just because energy exists to be taken, does not meanit is best to do so.These hypotheses can be further narrowed bioinformatically with existing data. For example,82the third alternative hypothesis can be ruled out with existing chemical concentration data. Ifnitrous oxide concentrations are non-increasing over time, the hypothesis is unlikely. This couldbe argued informatically with a univariate regression analysis or simple plotting. The secondalternative hypotheses is contingent on Planctomycetes increasing in anammox activity whileSUP05 decreases in nitric oxide reduction. This could be argued via a binning experiment similarto the one used to construct Figure 2.5. Similarly, the first alternative hypothesis describesArcobacteraceae (or another organism) increasing in complete denitrifying behaviour, and couldalso be refuted through a binning experiment. If taxonomic attributions are not desired, binningisn’t even necessary.Ruling out alternative hypotheses does not necessarily support the initial hypothesis either,because the SUP05-Marinimicrobia metabolic syntrophy argument is built on imperfect statisticalinferences. Such methods are important for building hypotheses, but ultimately the verificationshould be performed with an isolation and rate measurement experiment. If the hypothesis istrue, then hydrogen sulfide should be consumed, not build up in the environment, and shouldaccelerate denitrification. This should not discredit efforts to rule out alternative hypotheseshowever, because such bioinformatic analyses are ultimately much less expensive than isolationand rate measurement experiments. In this way, these binning and correlation-based argumentsare a typical bioinformatic precursor work which helps place a few, expensive, high-qualityverifications.4.2 Regularization as reduced parameter complexityIn subsection 1.6.3 regularization is defined as constraining a model to reduce overfit. Theoryfor regularization via constrained optimization is well-developed [28, 44, 131, 164], but in chap-ter 3 constraint is implemented through equating a higher dimensional parameter with a lowerdimensional representation [218, 247]. It is unclear that such a method should somehow increaseprecision-recall exchanges. Indeed, it would be worth directly comparing the higher and lowerdimensional models. However, the whole process also highlights an alternative perspective ofregularization in general, where the essential mechanic is the reduction of parameter complexity.Model constraints inevitably invite reduced dimensionality, implicit or not. Even theoreticallydeveloped L1-regularization is applied toward explicit parameter reduction [164]. Embracing thisperspective, it becomes important to ask how smaller models generally improve statistical power.This perspective can be developed theoretically, thereby providing broadly applicable results.Interpreting classification as a hypothesis testing problem, the Neyman-Pearson Lemma [196]can be leveraged to establish a test statistic: under the right conditions a likelihood ratio test(LRT) statistic λ is most-powerful, and so only it will be considered. To gain perspective on abreadth of models this work will invoke a large sample size assumption, thereby making thework of Wilks [270] and Wald [265] applicable. Under the null hypothesis H0 : θ = θ0, a largesample size, and regularity assumptions, the distribution of the LRT statistic λ is known [270] via−2λ ∼H0 χ2(p, 0), where p is the dimension of the parameter tested (iθ ∈ Rp), and χ2(p, δ) is thenon-central chi-square distribution with degrees of freedom p, non-centrality parameter δ, anddistribution function Fχ2(p,δ). Under the alternative hypothesis H1 : θ = θ1, the LRT distribution issimilarly known [67, 265] with −2λ ∼H1 χ2(p, δ), where δ = ∂θT I−1θ1 ∂θ, and ∂θ = θ0 − θ1. Thenthe rejection region for the test statistic −2λ with false-rejection rate α is any value less than830 20 40 60 80 100−0.06−0.04−0.020.001101001000Figure 4.1: Statistical power decreases as dimension increases for α = 0.05.F−1χ2(p,0)(1− α). Then power is calculable under the null as 1− β = 1− Fχ2(p,δ)(F−1χ2(p,0)(1− α)).Regularization by parameter constraint effectively defines a function θ(η) ∈ Rp and η ∈ Rqwhere q < p. It is convenient to impose an unbiased representation assumption of θ0 = θ(η0)and θ1 = θ(η1) for some η0 and η1 in Rq, so that representations are more directly comparable.Where only biased representations exist, the approximation may still be valuable through variance-bias trade-off considerations. Under the θ-representation in p dimensions, the non-centralityparameter is δθ = ∂θT I−1θ1 ∂θ. Under the η-representation, the non-centrality parameter is δη =∂ηT(JT Iθ1 J)−1∂η, where ∂η = η0 − η1, J ∈ Rp×q, and Jij = (∂θi/∂ηj)(η1). Therefore the power ofthe η-representation is 1− βη = 1− Fχ2(q,δη)(F−1χ2(q,0)(1− α)).Having developed this perspective of regularization theoretically, it is now clear that a terselydefined set of functions has broadly applicable implications. Regularization constraints can begeneralized to any functions θ(η) which increase statistical power by satisfying the followingequation and previously stated assumptions. It’s important to note that this is not yet a perfectlyposed mathematical problem, because θ(η) likely needs to satisfy certain pragmatic qualities. Forexample, in chapter 3 the LLT +Ψ constraint is continuously differentiable everywhere, and iscapable of describing a useful breadth of covariance matrices. Even so, developing theory for θ(η)has the potential to guide modelling choices in high-dimensional spaces such as for SSU rRNAdata.1− βθ ≤ 1− βη ⇔ Fχ2(q,δη)(F−1χ2(q,0)(1− α))≤ Fχ2(p,δθ)(F−1χ2(p,0)(1− α))A heuristic understanding of these conditions can be produced through numerical approxima-tion, thereby bolstering the claim that reducing parameter complexity (dimension) can improvestatistical power. Since Fχ2(p,δ) is defined continuously over p, a sense of continuity betweenparameter dimensions can be established. In chapter 3, it is reasonable to imagine that regular-ization causes far greater changes in model dimension p than in δ. So behaviour of statisticalpower over model dimension with fixed δ can be examined through numerical approximation of∂β∂p =∂∂p Fχ2(p,δ)(F−1χ2(p,0)(1− α))with a five-point stencil (see subsection 1.7.1). It can be seen inFigure 4.1 that for each test point, β decreases with model dimension, thereby causing statisticalpower (1− β) to increase.84Figure 4.2: Simplified Tree of Life superimposed with a succinct correlation structure. The red line is acutting line, which separates the entire tree into clades. Each clade’s correlation structure is dictated entirelyby its own tree and clade parameters. Clade parameters are latent random variables with a completecorrelation structure. Correlations are illustrated with black and magenta lines. Tree image credit: [130]4.3 A more succinct representationIn chapter 3, precise correlation inference was made possible via a model constraint Σ = LLT +Ψwhich reduces the parameter complexity from O(p2) to O(p) for p taxa. Despite the effort, insubsection 3.2.6 dimensional reduction is still employed: many clades’ SSU rRNA counts aresummed up to the phylum level, and all remaining taxa are sub-optimally descriptive. Dimensionalreduction is employed, because for very large p, O(p) is still far too many parameters. A moresuccint representation is motivated, meaning that a parameter complexity of O(log p) or O(1)needs to be described (see subsection 3.4.5). This work now proposes one such representation.Instead of trying to correlate taxa, it might be more pragmatic to only correlate clades.Drawing inspiration from phylogenetic regression (see subsection 1.2.2), evolution is modelledas an unobserved Brownian motion processes [224]. Every phylogenetic tree can be describedthis way, including clades, because they are also trees. If the tree is cut correctly, it is broken intoseparate clades. For example, cutting along the red line in Figure 4.2 produces many clades. If onewas to specify the cut along a bifurcating tree according to node height, there would be O(log p)clades. The essential mechanic here is that pair-wise correlations are only estimated betweenclades, while the brownian motion process defines all correlations within clades.Formally, for each of p taxa dimensions (not clades) define a marginal distribution function85FYj(y; µj, σj), where µj is a location parameter and σj is a dispersion parameter. Recognizingthat O(p) is too many parameters, each (µj, σj) will not be estimated. Instead (µj, σj) is random,following a bivariate lognormal distribution. Each µj in clade Ck is dependent according to thebrownian motion process of tree Ck, and originates from an initial Gaussian-distributed randomvalue Zk at the root of the its clade, so E[µj|Zk] = eZk . Similarly constraint each σj, independentlyof every µj. The random vector Z = [Zk] controls the location parameters of each clade. SinceZ ∈ RO(log p) it can describe any regression mechanisms and be given a full pair-wise correlationstructure with fewer than O(p) parameters, so Z ∼ N(Xβ,Σ). So despite being capable ofmodelling all taxa, this model only suffers a parameter complexity of O((log p)2).This proposed model is succinct in that it has a parameter complexity less than O(p) for ptaxa. It is a demonstration of how a high-dimensional model can have a succinct parameterization.It is not designed to have numerically feasible properties however. It is not even uniquely defined,because the marginal distribution functions and σj distributions are defined abstractly. Much morework is required before such models can be made useful to Microbial Ecologists. For example,pragmatic tree cutting requires exploration. The work in chapter 3 demonstrates that precisecorrelation networks are possible, but still requires destructive dimensional reduction (such as thesummation of many clades up to the phylum level). These succinct models offer the opportunityto have precise correlation networks without the need to resort to dimensional reduction.86Bibliography[1] T. Abe, H. Sugawara, M. Kinouchi, S. Kanaya, and T. Ikemura. Novel phylogenetic studies of genomicsequence fragments derived from uncultured microbe mixtures in environmental and clinical samples.DNA Research, 12(5):281–290, 2006.[2] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control,19(6):716–723, 1974.[3] R. Albert and A.-L. Baraba´si. Statistical mechanics of complex networks. Reviews of Modern Physics,74:47–97, 2002.[4] R. Albert, H. Jeong, and A.-L. Baraba´si. Error and attack tolerance of complex networks. Nature, 406:378–382, 2000.[5] M. Albertsen, P. Hugenholtz, A. Skarshewski, K. L. Nielsen, G. W. Tyson, and P. H. Nielsen.Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiplemetagenomes. Nature Biotechnology, 31:533–538, 2013.[6] E. Allers, J. J. Wright, K. M. Konwar, C. G. Howes, E. Beneze, S. J. Hallam, and M. B. Sullivan.Diversity and population structure of marine group a bacteria in the northeast subarctic pacific ocean.The ISME Journal, 7:256–268, 2013.[7] J. Alneberg, B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, L. Lahti, N. J. Loman,A. F. Andersson, and C. Quince. Binning metagenomic contigs by coverage and composition. NatureMethods, 11:1144–1146, 2014.[8] D. Altavilla. Nvidia doubles down on self-driving cars with xavier ai chip and a hat tip tonext gen volta gpu, 2016. http://www.forbes.com/sites/davealtavilla/2016/09/28/nvidia-doubles-down-on-self-driving-cars-with-xavier-ai-chip-and-a-hat-tip-to-next-gen-volta-gpu; accessed online30-September-2016.[9] S. F. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journalof Molecular Biology, 215:403–410, 1990.[10] K. Anantharaman, C. T. Brown, L. A. Hug, I. Sharon, C. J. Castelle, A. J. Probst, B. C. Thomas,A. Singh, M. J. Wilkins, U. Karaoz, E. L. Brodie, K. H. Williams, S. S. Hubbard, and J. F. Banfield.Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifersystem. Nature Communications, 7:13219, 2016.[11] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum,S. Hammarling, A. McKenney, and D. Sorensen. Lapack users’ guide. 1999.[12] J. J. Anderson and A. H. Devol. Deep water renewal in saanich inlet, an intermittently anoxic basin.Estuarine and Coastal Marine Science, 1(1):1–10, 1973.[13] M. J. Anderson, T. O. Crist, J. M. Chase, M. Vellend, B. D. Inouye, A. L. Freestone, N. J. Sanders, H. V.Cornell, L. S. Comita, K. F. Davies, S. P. Harrison, N. J. B. Kraft, J. C. Stegen, and N. G. Swenson.Navigating the multiple meanings of b diversity: a roadmap for the practicing ecologist. EcologyLetters, 14:19–28, 2011.87[14] M. Arumugam, J. Raes, E. Pelletier, D. L. Paslier, T. Yamada, D. R. Mende, G. R. Fernandes, J. Tap,T. Bruls, J.-M. Batto, M. Bertalan, N. Borruel, F. Casellas, L. Fernandez, L. Gautier, T. Hansen,M. Hattori, T. Hayashi, M. Kleerebezem, K. Kurokawa, M. Leclerc, F. Levenez, C. Manichanh, H. B.Nielsen, T. Nielsen, N. Pons, J. Poulain, J. Qin, T. Sicheritz-Ponten, S. Tims, D. Torrents, E. Ugarte, E. G.Zoetendal, J. Wang, F. Guarner, O. Pedersen, W. M. de Vos, S. Brunak, J. Dore´, MetaHIT Consortium,J. Weissenbach, S. D. Ehrlich, and P. Bork. Enterotypes of the human gut microbiome. Nature, 473:174–180, 2011.[15] J. R. Ashford and R. R. Snowden. Multivariate probit analysis. Biometrics, 26:535–546, 1970.[16] K. Baba. Partial, Conditional and Multiplicative Correlation Coefficients. Keio University, 2004.[17] K. Baba and M. Sibuya. Equivalence of partial and conditional correlation coefficients. Journal of theJapan Statistical Society, 35(1):1–19, 2005.[18] K. Baba and R. S. M. Sibuya. Partial correlation and conditional correlation as measures of conditionalindependence. Australian & New Zealand Journal of Statistics, 46(4):657–664, 2004.[19] A. R. Babbin, R. G. Keil, A. H. Devol, and B. B. Ward. Organic matter stoichiometry, flux, and oxygencontrol nitrogen loss in the ocean. Science, 344(6182):406–408, 2014.[20] D. Baetens. Enhanced biological phophorus removal: modelling and experimental design. Ghent University,2001.[21] A. Bankevich, S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I.Nikolenko, S. Pham, A. D. Prjibelski, A. V. Pyshkin, A. V. Sirotkin, N. Vyahhi, G. Tesler, M. A.Alekseyev, and P. A. Pevzner. Spades: A new genome assembly algorithm and its applications tosingle-cell sequencing. Journal of Computational Biology, 19(5):455–477, 2012.[22] A.-L. Baraba´si and Z. N. Oltvai. Network biology: understanding the cell’s functional organization.Nature Reviews Genetics, 5:101–113, 2004.[23] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains.Annals of Mathematical Statistics, 37(6):1554–1563, 1966.[24] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approachto multiple testing. Journal of the Royal Statistical Society, Series B, 51(1):289–300, 1995.[25] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing underdependency. Annals of Statistics, 29(4):1165–1188, 2001.[26] S. Bennett. Solexa ltd. Pharmacogenomics, 5(4):433–438, 2004.[27] A. A. Berryman. The orgins and evolution of predator-prey theory. Ecology, 73(5):1530–1535, 1992.[28] P. Bickel, B. Li, A. Tsybakov, S. Geer, B. Yu, T. Valde´s, C. Rivero, J. Fan, and A. Vaart. Regularizationin statistics. TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, 15(2):271–344, 2006.[29] P. C. Blainey. The future is now: single-cell genomics of bacteria and archaea. FEMS MicrobiologyReviews, 73(3):407–427, 2013.[30] F. R. Blattner, G. Plunkett, C. A. Bloch, N. T. Perna, V. Burland, M. Riley, J. Collado-Vides, J. D.Glasner, C. K. Rode, G. F. Mayhew, J. Gregor, N. W. Davis, H. A. Kirkpatrick, M. A. Goeden, D. J.Rose, B. Mau, and Y. Shao. The complete genome sequence of escherichia coli k-12. Science, 277(5331):1453–1462, 1997.88[31] N. Bokulich and C. Bamforth. The microbiology of malting and brewing. Microbiology and MolecularBiology Reviews, 77(2):157–172, 2013.[32] S. P. Borgatti. Centrality and network flow. Social Networks, 27:55–71, 2005.[33] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya. Succinct de Bruijn Graphs. Springer, 2012.[34] S. Boyd and L. Vandenberghe. Convex optimization. 2004.[35] K. R. Bradnam, J. N. Fass, A. Alexandrov, P. Baranay, M. Bechner, I. Birol, S. Boisvert, J. A. Chapman,G. Chapuis, R. Chikhi, H. Chitsaz, W. Chou, J. Corbeil, C. D. Fabbro, T. R. Docking, R. Durbin,D. Earl, S. Emrich, P. Fedotov, N. A. Fonseca, G. Ganapathy, R. A. Gibbs, S. Gnerre, E´le´nie Godzaridis,S. Goldstein, M. Haimel, G. Hall, D. Haussler, J. B. Hiatt, I. Y. Ho, J. Howard, M. Hunt, S. D. Jackman,D. B. Jaffe, E. D. Jarvis, H. Jiang, S. Kazakov, P. J. Kersey, J. O. Kitzman, J. R. Knight, S. Koren,T. Lam, D. Lavenier, F. Laviolette, Y. Li, Z. Li, B. Liu, Y. Liu, R. Luo, I. MacCallum, M. D. MacManes,N. Maillet, S. Melnikov, D. Naquin, Z. Ning, T. D. Otto, B. Paten, O. S. Paulo, A. M. Phillippy,F. Pina-Martins, M. Place, D. Przybylski, X. Qin, C. Qu, F. J. Ribeiro, S. Richards, D. S. Rokhsar, J. G.Ruby, S. Scalabrin, M. C. Schatz, D. C. Schwartz, A. Sergushichev, T. Sharpe, T. I. Shaw, J. Shendure,Y. Shi, J. T. Simpson, H. Song, F. Tsarev, F. Vezzi, R. Vicedomini, B. M. Vieira, J. Wang, K. C. Worley,S. Yin, S.-M. Yiu, J. Yuan, G. Zhang, H. Zhang, S. Zhou, and I. F. Korf. Assemblathon 2: evaluating denovo methods of genome assembly in three vertebrate species. GigaScience, 2:10, 2013.[36] J. R. Bray and J. T. Curtis. An ordination of upland forest communities of southern wisconsin.Ecological Monographs, 27:325–349, 1957.[37] E. C. Brechmann and H. Joe. Parsimonious parameterization of correlation matrices using truncatedvines and factor analysis. Computational Statistics & Data Analysis, 77:233–251, 2014.[38] I. Brettar, M. Labrenz, S. Flavier, J. Bo¨tel, H. Kuosa, R. Christen, and M. G. Ho¨fle. Identification of athiomicrospira denitrificans-like epsilonproteobacterium as a catalyst for autotrophic denitrificationin the central baltic sea. Applied and Environmental Microbiology, 72(2):1364–1372, 2006.[39] B. Brown and L. Levy. Certification of algorithm 708: Significant digit computation of the incompletebeta function ratios. ACM Transactions on Mathematical Software, 20(3):393–397, 1994.[40] C. T. Brown, L. A. Hug, B. C. Thomas, I. Sharon, C. J. Castelle, A. Singh, M. J. Wilkins, K. C. Wrighton,K. H. Williams, and J. F. Banfield. Unusual biology across a group comprising more than 15% ofdomain bacteria. Nature, 52:208–211, 2015.[41] M. V. Brown and S. P. Donachie. Evidence for tropical endemicity in the deltaproteobacteria marinegroup b/sar324 bacterioplankton clade. Aquatic Microbial Ecology, 46:107–115, 2007.[42] C. G. Broyden. The convergence of a class of double-rank minimization algorithms: Part 2. Journal ofthe Institute of Mathematics and its Applications, 6:222–231, 1970.[43] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Digital EquipmentCorporation, Technical Report 124, 1994.[44] T. Cai, W. Liu, and X. Luo. A constrained l1 minimization approach to sparse precision matrixestimation. American Statistical Association, 106(494):594–607, 2011.[45] A. C. Cameron and P. K. Trivedi. Econometric models based on count data: comparisons andapplications of some estimators and tests. Journal of Applied Econometrics, 1:29–54, 1986.[46] A. C. Cameron and P. K. Trivedi. Regression analysis of count data. 2013.89[47] B. J. Campbell, A. S. Engel, M. L. Porter, and K. Takai. The versatile epsilon-proteobacteria: keyplayers in sulphidic habitats. Nature Reviews Microbiology, 4:458–468, 2006.[48] D. E. Canfield, F. J. Stewart, B. Thamdrup, L. De Brabandere, T. Dalsgaard, E. F. Delong, N. P.Revsbech, and O. Ulloa. A cryptic sulfur cycle in oxygen-minimum-zone waters off the chilean coast.Science, 330(6009):1375–1378, 2010.[49] G. Cantor. Ueber eine elementare frage der mannigfaltigkeitslehre. Jahresbericht der DeutschenMathematiker-Vereinigung, 1:75–78, 1892.[50] D. G. Capone and D. A. Hutchins. Microbial biogeochemistry of coastal upwelling regimes in achanging ocean. Nature Geoscience, 6:711–717, 2013.[51] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer,A. G. Pena, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley,C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh,W. A. Walters, J. Widmann, T. Yatsunenko, J. Zaneveld, and R. Knight. Qiime allows analysis ofhigh-throughput community sequencing data. Nature Methods, 7:335–336, 2010.[52] G. Casella and R. L. Berger. Statistical Inference. Duxbury Thomson Learning, 2002.[53] R. Caspi, R. Billington, L. Ferrer, H. Foerster, C. A. Fulcher, I. M. Keseler, A. Kothari, M. Krummen-acker, M. Latendresse, L. A. Mueller, Q. Ong, S. Paley, P. Subhraveti, D. S. Weaver, and P. D. Karp. Themetacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genomedatabases. Nucleic Acids Research, 44:D471–D480, 2016.[54] S. Chaffron, H. Rehrauer, J. Pernthaler, and C. von Mering. A global network of coexisting microbesfrom environmental and whole-genome sequence data. Genome Research, 20:947–959, 2010.[55] P. S. G. Chain, D. V. Grafham, R. S. Fulton, M. G. FitzGerald, J. Hostetler, D. Muzny, J. Ali, B. Birren,D. C. Bruce, C. Buhay, J. R. Cole, Y. Ding, S. Dugan, D. Field, G. M. Garrity, R. Gibbs, T. Graves,C. S. Han, S. H. Harrison, S. Highlander, P. Hugenholtz, H. M. Khouri, C. D. Kodira, E. Kolker, N. C.Kyrpides, D. Lang, A. Lapidus, S. A. Malfatti, V. Markowitz, T. Metha, K. E. Nelson, J. Parkhill,S. Pitluck, X. Qin, T. D. Read, J. Schmutz, S. Sozhamannan, P. Sterk, R. L. Strausberg, G. Sutton,N. R. Thomson, J. M. Tiedje, G. Weinstock, A. Wollam, Genomic Standards Consortium, HumanMicrobiome Project, Jumpstart Consortium, and J. C. Detter. Genome project standards in a new eraof sequencing. Science, 326(5950):236–237, 2009.[56] A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal ofStatistics, 11(4):265–270, 1984.[57] S. Chib and E. Greenberg. Analysis of multivariate probit models. Biometrika, 85(2):347–361, 1998.[58] J.-C. Cho and S. J. Giovannoni. Cultivation and growth characteristics of a diverse group of olig-otrophic marine gammaproteobacteria. Applied and Environmental Microbiology, 70(1):432–440, 2004.[59] J. E. Clarridge. Impact of 16s rrna gene sequence analysis for identification of bacteria on clinicalmicrobiology and infectious diseases. Clinical Microbiology Reviews, 17(4):840–862, 2004.[60] L. A. Codispoti. The limits to growth. Nature, 387:237, 1997.[61] L. A. Codispoti, J. A. Brandes, J. P. Christensen, A. H. Devol, S. A. Naqvi, H. W. Paerl, and T. Yoshinari.The oceanic fixed nitrogen and nitrous oxide budgets: Moving targets as we enter the anthropocene?Scientia Marina, 65(2):85–105, 2001.[62] R. Conway and W. Maxwell. A queuing model with state dependent service rates. Journal of IndustrialEngineering, 12:132–136, 1962.90[63] D. R. Cox. Renewal theory. 1970.[64] F. Crick. Central dogma of molecular biology. Nature, 227:561–563, 1970.[65] F. D. F. da Silva, A. R. J. Lima, P. H. G. Moraes, A. S. Siqueira, L. T. Dall’Agnol, A. R. F. Barau´na,L. C. Martins, K. G. Oliveira, C. P. S. de Lima, M. R. T. Nunes, J. L. S. G. Vianez-Ju´nior, and E. C.Gonc¸alves. Draft genome sequence of limnobacter sp. strain caciam 66h1, a heterotrophic bacteriumassociated with cyanobacteria. Genome Announcements, 4(3):e00399–16, 2016.[66] T. Dalsgaard, D. E. Canfield, J. Petersen, B. Thamdrup, and J. Acun¯a-Gonza´lez. N2 production by theanammox reaction in the anoxic water column of golfo dulce, costa rica. Nature, 422:606–608, 2003.[67] R. R. Davidson and W. E. Lever. The limiting distribution of the likelihood ratio statistic under aclass of local alternatives. Sankhya¯: The Indian Journal of Statistics, 32(2):209–224, 1970.[68] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communicationsof the ACM, 51(1):107–113, 2008.[69] Dempster, N. M. Arthur P.; Laird, and D. B. Rubin. Maximum likelihood from incomplete data viathe em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1976.[70] Y. Deng, Y.-H. Jiang, Y. Yang, Z. He, F. Luo, and J. Zhou. Molecular ecological network analyses.BMC Bioinformatics, 13:113, 2012.[71] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi,P. Hu, and G. L. Andersen. Greengenes, a chimera-checked 16s rrna gene database and workbenchcompatible with arb. Applied Environmental Microbiology, 72:5069–5072, 2006.[72] P. M. S. Desmond G. Higgins. Clustal: a package for performing multiple sequence alignment on amicrocomputer. Gene, 73(1):237–244, 1988.[73] A. H. Devol. Nitrogen cycle: Solution to a marine mystery. Nature, 422:575–576, 2003.[74] G. J. Dick, A. F. Andersson, B. J. Baker, S. L. Simmons, B. C. Thomas, A. P. Yelton, and J. F. Banfield.Community-wide analysis of microbial genome sequence signatures. Genome Biology, 10:R85, 2009.[75] A. R. Didonato and A. H. Morris. Algorithm 708: Significant digit computation of the incompletebeta function ratios. ACM Transactions on Mathematical Software, 18(3):360–373, 1992.[76] J. A. Dodsworth, P. C. Blainey, S. K. Murugapiran, W. D. Swingley, C. A. Ross, S. G. Tringe, P. S. G.Chain, M. B. Scholz, C. Lo, J. Raymond, S. R. Quake, and B. P. Hedlund. Single-cell and metagenomicanalyses indicate a fermentative and saccharolytic lifestyle for members of the op9 lineage. NatureCommunications, 4:1854, 2013.[77] S. C. Doney. The growing human footprint on coastal and open-ocean biogeochemistry. Science,328(5985):1512–1516, 2010.[78] J. C. Doyle, D. L. Alderson, L. Li, S. Low, M. Roughan, S. Shalunov, R. Tanaka, , and W. Willinger.The ”robust yet fragile” nature of the internet. Proceedings of the National Academy of Sciences, 102(41):14497–14502, 2005.[79] W. E. Durno, N. W. Hanson, K. M. Konwar, and S. J. Hallam. Expanding the boundaries of localsimilarity analysis. BMC Genomics, 14(Suppl 1):S13, 2013.[80] R. C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. NucleicAcids Research, 32(5):1792–1797, 2004.91[81] R. C. Edgar. Local homology recognition and distance measures in linear time using compressedamino acid alphabets. Nucleic Acids Research, 32(1):380–385, 2004.[82] R. C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26(19):2460–2461, 2010.[83] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7:1–26, 1979.[84] B. Efron, E. Halloran, and S. Holmes. Bootstrap confidence levels for phylogenetictrees. Proceedings ofthe National Academy of Sciences, 93(23):13429–13434, 1996.[85] P. G. Falkowski, T. Fenchel, and E. F. Delong. The microbial engines that drive earth’s biogeochemicalcycles. Science, 320(5879):1034–1039, 2008.[86] P. G. Falkowski, T. Algeo, L. Codispoti, C. Deutsch, S. Emerson, B. Hales, R. B. Huey, W. J. Jenkins,L. R. Kump, L. A. Levin, T. W. Lyons, N. B. Nelson, O. S. Schofield, R. Summons, L. D. Talley,E. Thomas, F. Whitney, and C. B. Pilcher. Ocean deoxygenation: Past, present, and future. Earth &Space Science News, 92(46):409–410, 2011.[87] K. Faust and J. Raes. Microbial interactions: from networks to models. Nature Reviews Microbiology,10:538–550, 2012.[88] J. Felsenstein. Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 39(4):783–791, 1985.[89] R. L. Ferguson, E. N. Buckley, and A. V. Palumbo. Response of marine bacterioplankton to differentialfiltration and confinement. Applied and Environmental Microbiology, 47:49–55, 1984.[90] D. Field, G. Garrity, T. Gray, N. Morrison, J. Selengut, P. Sterk, T. Tatusova, N. Thomson, M. J. Allen,S. V. Angiuoli, M. Ashburner, N. Axelrod, S. Baldauf, S. Ballard, J. Boore, G. Cochrane, J. Cole,P. Dawyndt, P. D. Vos, C. dePamphilis, R. Edwards, N. Faruque, R. Feldman, J. Gilbert, P. Gilna,F. O. Glo¨ckner, P. Goldstein, R. Guralnick, D. Haft, D. Hancock, H. Hermjakob, C. Hertz-Fowler,P. Hugenholtz, I. Joint, L. Kagan, M. Kane, J. Kennedy, G. Kowalchuk, R. Kottmann, E. Kolker,S. Kravitz, N. Kyrpides, J. Leebens-Mack, S. E. Lewis, K. Li, A. L. Lister, P. Lord, N. Maltsev,V. Markowitz, J. Martiny, B. Methe, I. Mizrachi, R. Moxon, K. Nelson, J. Parkhill, L. Proctor, O. White,S.-A. Sansone, A. Spiers, R. Stevens, P. Swift, C. Taylor, Y. Tateno, A. Tett, S. Turner, D. Ussery,B. Vaughan, N. Ward, T. Whetzel, I. S. Gil, G. Wilson, and A. Wipat. The minimum information abouta genome sequence (migs) specification. Nature Biotechnology, 26:541–547, 2008.[91] D. Field, L. Amaral-Zettler, G. Cochrane, J. R. Cole, P. Dawyndt, G. M. Garrity, J. Gilbert, F. O.Glo¨ckner, L. Hirschman, I. Karsch-Mizrachi, H. Klenk, R. Knight, R. Kottmann, N. Kyrpides, F. Meyer,I. S. Gil, S.-A. Sansone, L. M. Schriml, P. Sterk, T. Tatusova, D. W. Ussery, O. White, and J. Wooley.The genomic standards consortium. PLoS Biology, 9(6):e1001088, 2011.[92] R. Fletcher. A new approach to variable metric algorithm. The Computer Journal, 13:317–322, 1970.[93] J. A. Frank, Y. Pan, A. TommingKlunderud, V. G. H. Eijink, A. C. McHardy, A. J. Nederbragt, and P. B.Pope. Improved metagenome assemblies and taxonomic binning using long-read circular consensussequence data. Scientific Reports, 6:25373, 2016.[94] L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry, 40(1):35–41, 1977.[95] J. Friedman and E. J. Alm. Inferring correlation networks from genomic survey data. PLOSComputational Biology, 8(9):e1002687, 2012.[96] J. A. Fuhrman and A. A. Davis. Widespread archaea and novel bacteria from the deep sea as shownby 16s rrna gene sequences. Marine Ecology Press Series, 150:275–285, 1997.92[97] J. A. Fuhrman, K. McCallum, and A. A. Davis. Phylogenetic diversity of subsurface marine microbialcommunities from the atlantic and pacific oceans. Applied and Environmental Microbiology, 59(5):1294–1302, 1993.[98] J. A. Fuhrman, J. A. Cram, and D. M. Needham. Marine microbial community dynamics and theirecological interpretation. Nature Reviews Microbiology, 13:133–146, 2015.[99] B. Gallone, J. Steensels, T. Prahl, L. Soriaga, V. Saels, B. Herrera-Malaver, A. Merlevede, M. Roncoroni,K. Voordeckers, L. Miraglia, C. Teiling, B. Steffy, M. Taylor, A. Schwartz, T. Richardson, C. White,G. Baele, S. Maere, and K. J. Verstrepen. Domestication and divergence of saccharomyces cerevisiaebeer yeasts. Cell, 166(6):1397–1410, 2016.[100] M. Gallopin, A. Rau, and F. Jaffre´zic. A hierarchical poisson log-normal model for network inferencefrom rna sequencing data. PLoS ONE, 8(10):e77503, 2013.[101] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational andGraphical Statistics, 1:141–150, 1992.[102] T. S. Ghosh, M. H. M, and S. S. Mande. Discribinate: a rapid method for accurate taxonomicclassification of metagenomic sequences. BMC Bioinformatics, 11(Suppl 7):S14, 2010.[103] E. A. Gies, K. M. Konwar, J. T. Beatty, and S. J. Hallam. Illuminating microbial dark matter inmeromictic sakinaw lake. Applied and Environmental Microbiology, 80(21):6807–6818, 2014.[104] Gigabyte. Gv-n98twf3oc-6gd, 2017. http://www.gigabyte.com ; accessed online 10 March 2017.[105] D. Goldfarb. A family of variable-metric methods derived by variational means. Mathematics ofComputation, 24(109):23–26, 1970.[106] J. Gole, A. Gore, A. Richards, Y. Chiu, H. Fung, D. Bushman, H. Chiang, J. C. andYu-Hwa Lo, andK. Zhang. Massively parallel polymerase cloning and genome sequencing of single cells usingnanoliter microwells. Nature Biotechnology, 31:1126–1132, 2013.[107] J. K. Goodrich, J. L. Waters, A. C. Poole, J. L. Sutter, O. Koren, R. Blekhman, M. Beaumont, W. V.Treuren, R. Knight, J. T. Bell, T. D. Spector, A. G. Clark, and R. E. Leycorrespondence. Human geneticsshape the gut microbiome. Cell, 159(4):789799, 2014.[108] R. M. Gower and P. Richta´rik. Randomized quasi-newton updates are linearly convergent matrixinversion algorithms. ArXiv, 1602:01768v3, 2016.[109] S. Greenblum, P. J. Turnbaugh, and E. Borenstein. Metagenomic systems biology of the humangut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease.Proceedings of the National Academy of Sciences, 109(2):594–599, 2011.[110] W. Greene. Functional forms for the negative binomial model for count data. Economics Letters, 99:585–590, 2008.[111] I. Gregor, J. Dro¨ge, M. Schirmer, C. Quince, and A. C. McHardy. Phylopythias+: a self-trainingmethod for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ, 4:e1603, 2016.[112] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of thempi message passing interface standard. Parallel Computing, 22(6):789–828, 1996.[113] J. Grote, M. Labrenz, B. Pfeiffer, G. Jost, and K. Ju¨rgens. Quantitative distributions of epsilonpro-teobacteria and a sulfurimonas subgroup in pelagic redoxclines of the central baltic sea. Applied andEnvironmental Microbiology, 73(22):7155–7161, 2007.93[114] J. Grote, G. Jost, M. Labrenz, G. J. Herndl, and K. Ju¨rgens. Epsilonproteobacteria represent the majorportion of chemoautotrophic bacteria in sulfidic waters of pelagic redoxclines of the baltic and blackseas. Applied and Environmental Microbiology, 74(24):7546–7551, 2008.[115] J. Grote, T. Schott, C. G. Bruckner, F. O. Glo¨ckner, G. Jost, H. Teeling, M. Labrenz, and K. Ju¨rgens.Genome and physiology of a model epsilonproteobacterium responsible for sulfide detoxification inmarine oxygen depletion zones. Proceedings of the National Academy of Sciences, 109(2):506–510, 2012.[116] N. Gruber. The marine nitrogen cycle: Overview and challenges. 2008.[117] S. Guikema and J. Coffelt. A flexible count data regression model for risk analysis. Risk Analysis,28(1):213–223, 2008.[118] P.-E. Hagmark. On construction and simulation of count data models. Mathematics and Computers inSimulation, 77:72–80, 2008.[119] P.-E. Hagmark. An exceptional generalization of the poisson distribution. Open Journal of Statistics, 2:313–318, 2012.[120] A. S. Hahn, K. M. Konwar, S. Louca, N. W. Hanson, and S. J. Hallam. The information science ofmicrobial ecology. Current Opinion in Microbiology, 31:209–216, 2016.[121] N. W. Hanson, K. M. Konwar, A. K. Hawley, T. Altman, P. D. Karp, and S. J. Hallam. Metabolicpathways for the whole community. BMC Genomics, 15:619, 2014.[122] G. H. Hardy, J. E. Littlewood, and G. Plya. Inequalities. Cambridge University Press, 1934.[123] M. F. Haroon, L. R. Thompson, and U. Stingl. Draft genome sequence of uncultured sar324 bacteriumlautmerah10, binned from a red sea metagenome. Genome Announcements, 4(1):e01711–15, 2016.[124] J. K. Harris, S. T. Kelley, and N. R. Pace. New perspective on uncultured bacterial phylogeneticdivision op11. Applied and Environmental Microbiology, 70(2):845–849, 2004.[125] A. K. Hawley, H. M. Brewer, A. D. Norbeck, L. Pasˇa-Tolic´, and S. J. Hallam. Metaproteomicsreveals differential modes of metabolic coupling among ubiquitous oxygen minimum zone microbes.Proceedings of the National Academy of Sciences, 111(31):11395–11400, 2013.[126] A. K. Hawley, M. K. Nobu, J. J. Wright, W. E. Durno, C. Morgan-Lang, B. Sage, P. Schwientek, B. K.Swan, C. Rinke, M. Torres-Beltra´n, K. Mewis, W. Liu, R. Stepanauskas, T. Woyke, and S. J. Hallam.Co-metabolic innovation along eco-thermodynamic gradients. Submitted, 2016.[127] A. K. Hawley, M. Torres-Beltra´n, M. Bhatia, E. Zaikova, D. A. Walsh, A. Mueller, M. Scofield,S. Kheirandish, C. Payne, L. Pakhomova, O. Shevchuk, E. A. Gies, D. Fairle, S. A. Malfatti, A. D.Norbek, H. M. Brewer, L. Pasa-Tolic, T. Glavina del Rio, C. A. Suttle, S. Tringe, and S. J. Hallam.A compendium of water column multi-omic sequence information from a seasonally anoxic fjordsaanich inlet. Scientific Data, (Submitted), 2017.[128] G. J. Herndl and T. Reinthaler. Microbial control of the dark end of the biological pump. NatureGeoscience, 6:718–724, 2013.[129] G. W. Hill. Algorithm 395: Students tdistribution. Communications of the ACM, 13(10):617–619, 1970.[130] D. M. Hillis. Hillis laboratory, 2017. http://www.zo.utexas.edu/faculty/antisense/Download.html;accessedonline 29 March 2017.[131] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 42(1):80–86, 2000.94[132] S. Holmes. Bootstrapping phylogenetic trees: Theory and methods. Statistical Science, 18(2):241–255,2003.[133] L. A. Hug, B. J. Baker, K. Anantharaman, C. T. Brown, A. J. Probst, C. J. Castelle, C. N. Butterfield,A. W. Hernsdorf, Y. Amano, K. Ise, Y. Suzuki, N. Dudek, D. A. Relman, K. M. Finstad, R. Amundson,B. C. Thomas, and J. F. Banfield. A new view of the tree of life. Nature microbiology, 1:16048, 2016.[134] J. B. Hughes, J. J. Hellmann, T. H. Ricketts, and B. J. M. Bohannan. Counting the uncountable:Statistical approaches to estimating microbial diversity. Applied Environmental Microbiology, 67(10):4399–4406, 2001.[135] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. Megan analysis of metagenomic data. GenomeResearch, 17:377–386, 2007.[136] E. Isaacson and H. B. Keller. Analysis of numerical methods. 1994.[137] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Baraba´si. The large-scale organization ofmetabolic networks. Nature, 407:651–654, 2000.[138] H. Joe. Dependence Modeling with Copulas. CRC Press, 2014.[139] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Pearson Education Inc.,2007.[140] D. D. Kang, J. Froula, R. Egan, and Z. Wang. Metabat, an efficient tool for accurately reconstructingsingle genomes from complex microbial communities. PeerJ, 3:e1165, 2015.[141] M. B. Karner, E. F. DeLong, and D. M. Karl. Archaeal dominance in the mesopelagic zone of thepacific ocean. Nature, 409:507–510, 2001.[142] N. Kashtan1, S. E. Roggensack, S. Rodrigue, J. W. Thompson, S. J. Biller, A. Coe, H. Ding, P. Marttinen,R. R. Malmstrom, R. Stocker, M. J. Follows, R. Stepanauskas, and S. W. Chisholm. Single-cell genomicsreveals hundreds of coexisting subpopulations in wild prochlorococcus. Science, 344(6182):416–420,2014.[143] R. F. Keeling, A. Kortzinger, and N. Gruber. Ocean deoxygenation in a warming world. AnnualReview of Marine Science, 2:199–229, 2010.[144] S. M. Kiełbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith. Adaptive seeds tame genomic sequencecomparison. Genome Research, 21(3):487–493, 2011.[145] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. Institute for Catastrophic LossReduction, 2015.[146] A. Klenke. Probability theory: A comprehensive course. 2013.[147] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.[148] K. O. Konhauser, E. Pecoits, S. V. Lalonde, D. Papineau, E. G. Nisbet, M. E. Barley, N. T. Arndt,K. Zahnle, and B. S. Kamber. Oceanic nickel depletion and a methanogen famine before the greatoxidation event. Nature, 458:750–753, 2009.[149] K. M. Konwar, N. W. Hanson, A. P. Page´, and S. J. Hallam. Metapathways: a modular pipelinefor constructing pathway/genome databases from environmental sequence information. BMCbioinformatics, 14(1), 2013.95[150] K. M. Konwar, N. W. Hanson, M. P. Bhatia, D. Kim, S. Wu, A. S. Hahn, C. Morgan-Lang, H. K.Cheung, and S. J. Hallam. Metapathways v2.5: quantitative functional, taxonomic and usabilityimprovements. Proceedings of the 2014 IEEE Conference on Computational Intelligence in Bioinformaticsand Computational Biology, 31(20):3345–3347, 2015.[151] M. M. M. Kuypers, A. O. Sliekers, G. Lavik, M. Schmid, B. B. Jorgensen, J. G. Kuenen, J. S. S. Damste´,M. Strous, and M. S. M. Jetten. Anaerobic ammonium oxidation by anammox bacteria in the blacksea. Nature, 422:608–611, 2003.[152] M. Labrenz, I. Brettar, R. Christen, S. Flavier, J. Bo¨tel, and M. G. Ho¨fle. Development and applicationof a real-time pcr approach for quantification of uncultured bacteria in the central baltic sea. Appliedand Environmental Microbiology, 70(8):4971–4979, 2004.[153] M. Labrenz, J. Grote, K. Mammitzsch, H. T. S. Boschker, M. Laue, G. Jost, S. Glaubitz, and K. Ju¨rgens.Sulfurimonas gotlandica sp. nov., a chemoautotrophic and psychrotolerant epsilonproteobacteriumisolated from a pelagic redoxcline, and an emended description of the genus sulfurimonas. Interna-tional Journal of Systematic and Evolutionary Microbiology, 63:4141–4148, 2013.[154] P. Lam and M. M. Kuypers. Microbial nitrogen cycling processes in oxygen minimum zones. AnnualReview of Marine Science, 3:317–345, 2011.[155] M. G. I. Langille, J. Zaneveld, J. G. Caporaso, D. McDonald, D. Knights, J. A. Reyes, J. C. Clemente,D. E. Burkepile, R. L. V. Thurber, R. Knight, R. G. Beiko, and C. Huttenhower. Predictive functionalprofiling of microbial communities using 16s rrna marker gene sequences. Nature Biotechnology, 31:814–821, 2013.[156] G. Lavik, T. Stu¨hrmann, V. Bru¨chert, A. V. der Plas, V. Mohrholz, P. Lam, M. M. zligmann, B. M.Fuchs, R. Amann, U. Lass, and M. M. M. Kuypers. Detoxification of sulphidic african shelf waters byblooming chemolithotrophs. Nature, 457:581–584, 2009.[157] P. Legendre and L. Legendre. Numerical Ecology. Elsevier, 3 edition, 2012.[158] T. M. Lenton, H. Held, E. Kriegler, J. W. Hall, W. Lucht, S. Rahmstorf, and H. J. Schellnhuber. Tippingelements in the earth’s climate system. Proceedings of the National Academy of Sciences, 105(6):1786–1793,2007.[159] D. Li, C. M. Liu, R. Luo, K. Sadakane, and T. W. Lam. Megahit: an ultra-fast single-node solutionfor large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics, 31(10):1674–1676, 2015.[160] H. Li and R. Durbin. Fast and accurate short read alignment with burrowswheeler transform.Bioinformatics, 25(14):1754–1760, 2009.[161] R. Li, C.-L. Hsieh, A. Young, Z. Zhang, X. Ren, and Z. Zhao. Illumina synthetic long read sequencingallows recovery of missing sequences even in the finished c. elegans genome. Scientific Reports, 5:10814, 2015.[162] G. Lima-Mendez, K. Faust, N. Henry, J. Decelle, S. Colin, F. Carcillo, S. Chaffron, J. C. Ignacio-Espinosa,S. Roux, F. Vincent, L. Bittner, Y. Darzi, J. Wang, S. Audic, L. Berline, G. Bontempi, A. M. Cabello,L. Coppola, F. M. Cornejo-Castillo, F. d’Ovidio, L. D. Meester, I. Ferrera, M.-J. Garet-Delmas, L. Guidi,E. Lara, S. Pesant, M. Royo-Llonch, G. Salazar, P. Snchez, M. Sebastian, C. Souffreau, C. Dimier,M. Picheral, S. Searson, S. Kandels-Lewis, G. Gorsky, F. Not, H. Ogata, S. Speich, L. Stemmann,J. Weissenbach, P. Wincker, S. G. Acinas, S. Sunagawa, P. Bork, M. B. Sullivan, E. Karsenti, C. Bowler,C. de Vargas, and J. Raes. Determinants of community structure in the global plankton interactome.Science, 348(6237), 2015.96[163] X. Lin, S. G. Wakeham, I. F. Putnam, Y. M. Astor, M. I. Scranton, A. Y. Chistoserdov, and G. T. Taylor.Comparison of vertical distributions of prokaryotic assemblages in the anoxic cariaco basin andblack sea by use of fluorescence in situ hybridization. Applied and Environmental Microbiology, 72(4):2679–2690, 2006.[164] R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significance test for the lasso. The Annalsof Statistics, 42(2):413–468, 2014.[165] D. Lord, S. Guikema, and S. Geedipally. Application of the conway-maxwell-poisson generalizedlinear model for analyzing motor vehicle crashes. Accident Analysis & Prevention, 40(3):1123–1134,2008.[166] D. Lord, S. Geedipally, and S. Guikem. Extension of the application of conway-maxwell-poissonmodels: Analyzing traffic crash data exhibiting under-dispersion. Risk Analysis, 30(8):1268–1276,2010.[167] S. Louca, A. K. Hawley, S. Katsev, M. Torres-Beltran, M. P. Bhatia, S. Kheirandish, C. C. Michiels,D. Capelle, G. Lavik, M. Doebeli, S. A. Crowe, and S. J. Hallam. Integrating biogeochemistry withmultiomic sequence information in a model oxygen minimum zone. Proceedings of the NationalAcademy of Sciences, 113(40):E5925–E5933, 2016.[168] M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersion for rna-seqdata with deseq2. Genome Biology, 15:550, 2014.[169] S. Lu¨cker, M. Wagner, F. Maixner, E. Pelletier, H. Kocha, B. Vacherieb, T. Ratteie, J. S. S. Damste´,E. Spieck, D. L. Paslier, and H. Daims. A nitrospira metagenome illuminates the physiology andevolution of globally important nitrite-oxidizing bacteria. Proceedings of the National Academy ofSciences, 107(30):13479–13484, 2010.[170] S. Lu¨cker, B. Nowka, T. Rattei, E. Spieck, and H. Daims. The genome of nitrospina gracilis illuminatesthe metabolism and evolution of the major marine nitrite oxidizer. Frontiers in Microbiology, 4:27, 2013.[171] R. Lyons. Strong laws of large numbers for weakly correlated random variables. The MichiganMathematical Journal, 35(3):353–359, 1988.[172] C. Mallows. Another comment on o’cinneide. The American Statistician, 45(3):257, 1991.[173] S. A. Manavski and G. Valle. Cuda compatible gpu cards as efficient hardware accelerators forsmith-waterman sequence alignment. BMC Bioinformatics, 9(Suppl 2):S10, 2008.[174] S. S. Mande, M. H. Mohammed, and T. S. Ghosh. Classification of metagenomic sequences: methodsand challenges. Briefings in Bioinformatics, 13(6):669–681, 2012.[175] E. Mardis, J. McPherson, R. Martienssen, R. K. Wilson, and W. R. McCombie. What is finished, andwhy does it matter. Genome Research, 12(5):669–671, 2002.[176] V. M. Markowitz, I. A. Chen, K. Palaniappan, K. Chu, E. Szeto, M. Pillay, A. Ratner, J. Huang,T. Woyke, M. Huntemann, I. Anderson, K. Billis, N. Varghese, K. Mavromatis, A. Pati, N. N. Ivanova,and N. C. Kyrpides. Img 4 version of the integrated microbial genomes comparative analysis system.Nucleic Acids Research, 42:D560–D567, 2014.[177] W. Martens-Habbena, P. M. Berube, H. Urakawa, J. R. de la Torre, and D. A. Stahl. Ammoniaoxidation kinetics determine niche separation of nitrifying archaea and bacteria. Nature, 461:976–981,2009.97[178] E. P. Martins and T. F. Hansen. Phylogenies and the comparative method: A general approach toincorporating phylogenetic information into the analysis of interspecific data. The American Naturalist,149(4):646–667, 1997.[179] M. Maurer and M. Boller. Modelling of phosphorus precipitation in wastewater treatment plantswith enhanced biological phosphorus removal. Water science and technology, 39(1):147–163, 1999.[180] K. Mavromatis, N. Ivanova, K. Barry, H. Shapiro, E. Goltsman, A. C. McHardy, I. Rigoutsos,A. Salamov, F. Korzeniewski, M. Land, A. Lapidus, I. Grigoriev, P. Richardson, P. Hugenholtz,and N. C. Kyrpides. Use of simulated data sets to evaluate the fidelity of metagenomic processingmethods. Nature Methods, 4:495–500, 2007.[181] P. McCullagh and J. Nelder. Generalized Linear Models. Springer, 1983.[182] A. C. McHardy, H. G. Martn, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos. Accurate phylogeneticclassification of variable-length dna fragments. Nature Methods, 4:63–72, 2007.[183] R. A. McLean, W. L. Sanders, and W. W. Stroup. A unified approach to mixed linear models. TheAmerican Statistician, 45(1):54–64, 1991.[184] K. D. McMahon and E. K. Read. Microbial contributions to phosphorus cycling in eutrophic lakesand wastewater. Annual Review of Microbiology, 67:199–219, 2013.[185] P. J. McMurdie and S. Holmes. Waste not, want not: Why rarefying microbiome data is inadmissible.PLOS Computational Biology, 10(4):e1003531, 2014.[186] M. H. Mohammed, T. S. Ghosh, N. K. Singh, and S. S. Mande. Sphinx–an algorithm for taxonomicbinning of metagenomic sequences. Bioinformatics, 27(1):22–30, 2011.[187] F. Monteiro. Mechanistic models of oceanic nitrogen fixation. Massachusetts Institute of Technology, 2009.[188] D. C. Montgomery. Design and Analysis of Experiments. John Wiley & Sons, Inc., 7 edition, 2009. ISBN9780471661597.[189] B. A. V. Mooy, R. G. Keila, and A. H. Devol. Impact of suboxia on sinking particulate organic carbon:Enhanced carbon flux and preferential degradation of amino acids via denitrification. Geochimica etCosmochimica Acta, 66(3):457–465, 2002.[190] J. J. Morris, R. E. Lenski, and E. R. Zinser. The black queen hypothesis: Evolution of dependenciesthrough adaptive gene loss. mBio, 3(2):e00036–12, 2012.[191] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifyingmammalian transcriptomes by rna-seq. Nature Methods, 5:621–628, 2008.[192] P. J. Mumby. Statistical power of non-parametric tests: A quick guide for designing samplingstrategies. Marine Pollution Bulletin, 44:85–87, 2002.[193] K. P. Murphy. Machine learning: A probabalistic perspective. 2012.[194] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz,C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M.Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang,D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter. Awhole-genome assembly of drosophila. Science, 287(5461):2196–2204, 2000.[195] N. Nagarajan and M. Pop. Sequence assembly demystified. Nature Reviews Genetics, 14:157–167, 2013.98[196] J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses.Philosophical Transactions of the Royal Society of London, 231:289–337, 1933.[197] D. Nichols, N. Cahoon, E. M. Trakhtenberg, L. Pham, A. Mehta, A. Belanger, T. Kanigan, K. Lewis,and S. S. Epstein. Use of ichip for high-throughput in situ cultivation of ”uncultivable” microbialspecies. Applied Environmental Microbiology, 76(8):2445–2450, 2010.[198] P. H. Nielsen, A. T. Mielczarek, C. Kragelund, J. L. Nielsen, A. M. Saunders, Y. Kong, A. A. Hansen,and J. Vollertsen. A conceptual ecosystem model of microbial communities in enhanced biologicalphosphorus removal plants. Water Research, 44:5070–5088, 2010.[199] M. K. Nobu, T. Narihiro, C. Rinke, Y. Kamagata, S. G. Tringe, T. Woyke, and W.-T. Liu. Microbialdark matter ecogenomics reveals complex synergistic networks in a methanogenic bioreactor. TheISME Journal, 9:1710–1722, 2015.[200] Nvidia. Cuda c programming guide, 2016. docs.nvidia.com/cuda/cuda-c-programming-guide;accessed online 30-September-2016.[201] B. B. Oakley, C. A. Morales, J. Line, M. E. Berrang, R. J. Meinersmann, G. E. Tillman, M. G. Wise,G. R. Siragusa, K. L. Hiett, and B. S. Seal. The poultry-associated microbiome: Network analysis andfarm-to-fork characterizations. PLoS ONE, 8(2):e57190, 2013.[202] A. Oehmen, A. M. Saunders, M. T. Vives, Z. Yuan, and J. Keller. Competition between polyphosphateand glycogen accumulating organisms in enhanced biological phosphorus removal systems withacetate and propionate as carbon sources. Journal of Biotechnology, 123:22–32, 2006.[203] S. Off, M. Alawi, and E. Spieck. Enrichment and physiological characterization of a novel nitrospira-like bacterium obtained from a marine sponge. Applied and Environmental Microbiology, 76(14):4640–4646, 2010.[204] N. R. Pace. Mapping the tree of life: Progress and prospects. Microbiology and Molecular BiologyReviews, 73(4):565–576, 2009.[205] J. M. Papakonstantinou. Historical Development of the BFGS Secant Method and Its CharacterizationProperties. Rice University, 2009.[206] J. M. Papakonstantinou and R. A. Tapia. Origin and evolution of the secant method in one dimension.The American Mathematical Monthly, 120(6):500–518, 2013.[207] D. H. Parks, M. Imelfort, C. T. Skennerton, P. Hugenholtz, and G. W. Tyson. Checkm: assessingthe quality of microbial genomes recovered from isolates, single cells, and metagenomes. GenomeResearch, 25:1043–1055, 2015.[208] K. R. Patil, L. Roune, and A. C. McHardy. The phylopythias web server for taxonomic assignment ofmetagenome sequences. PLoS One, 7(6):e38581, 2012.[209] A. Paulmier and D. Ruiz-Pino. Oxygen minimum zones (omzs) in the modern ocean. Progress inOceanography, 80(3-4):113–128, 2009.[210] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559–572, 1901.[211] W. R. Pearson. An introduction to sequence similarity (”homology”) searching. Current Protocols inBioinformatics, Chapter 3:Unit3.1, 2013.[212] M. A. Pena and S. J. Bograd. Time series of the northeast pacific. Progress in Oceanography, 75(2):115–119, 2007.99[213] A. Perry. A class of conjugate gradient algorithms with a two-step variable metric memory. North-western University, Center for Mathematical Studies in Economics and Management Science, Evanston, IL,Discussion Paper 269, 1977.[214] M. Pester, C. Schleper, and M. Wagner. The thaumarchaeota: an emerging view of their phylogenyand ecophysiology. Current Opinion in Microbiology, 14(3):300–306, 2011.[215] P. A. Pevzner, H. Tang, and M. S. Waterman. An eulerian path approach to dna fragment assembly.Proceedings of the National Academy of Sciences, 98(17):9748–9753, 2001.[216] G. L. Phyllis Lam and, M. M. Jensen, J. van de Vossenberg, M. Schmid, D. Woebken, D. Gutie´rrez,R. Amanna, M. S. M. Jetten, and M. M. M. Kuypersa. Revising the nitrogen cycle in the peruvianoxygen minimum zone. Proceedings of the National Academy of Sciences, 106(12):4752–4757, 2009.[217] R. Piessens, E. deDoncker Kapenga, C. Uberhuber, and D. Kahaner. Quadpack: a Subroutine Package forAutomatic Integration. Springer, 1983.[218] M. Pourahmadi. High-Dimensional Covariance Estimation. Wiley, 2013.[219] J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez,T. Yamada, D. R. Mende, J. Li, J. Xu, S. Li, D. Li, J. Cao, B. Wang, H. Liang, H. Zheng, Y. Xie, J. Tap,P. Lepage, M. Bertalan, J.-M. Batto, T. Hansen, D. L. Paslier, A. Linneberg, H. B. Nielsen, E. Pelletier,P. Renault, T. Sicheritz-Ponten, K. Turner, H. Zhu, C. Yu, S. Li, M. Jian, Y. Zhou, Y. Li, X. Zhang, S. Li,N. Qin, H. Yang, J. Wang, S. Brunak, J. Dore´, F. Guarner, K. Kristiansen, O. Pedersen, J. Parkhill,J. Weissenbach, MetaHIT Consortium, P. Bork, S. D. Ehrlich, and J. Wang. A human gut microbialgene catalogue established by metagenomic sequencing. Nature, 464:59–65, 2010.[220] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glo¨ckner. Thesilva ribosomal rna gene database project: improved data processing and web-based tools. NucleicAcids Research, 41(D1):D590–D596, 2013.[221] R Core Team. Writing r extensions, 2016. cran.r-project.org/doc/manuals/r-release/R-exts.html;accessed online 4 Oct 2016.[222] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.Proceedings of the IEEE, 77(2):257–286, 1989.[223] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander,M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data sets. Science, 334(6062):1518–1524, 2011.[224] L. J. Revell. Phylogenetic signal and linear regression on species data. Methods in Ecology and Evolution,1(4):319–329, 2010.[225] C. Rinke, P. Schwientek, A. Sczyrba, N. N. Ivanova, I. J. Anderson, J. Cheng, A. Darling, S. Malfatti,B. K. Swan, E. A. Gies, J. A. Dodsworth, B. P. Hedlund, G. Tsiamis, S. M. Sievert, W. Liu, J. A. Eisen,S. J. Hallam, N. C. Kyrpides, R. Stepanauskas, E. M. Rubin, P. Hugenholtz, and T. Woyke. Insightsinto the phylogeny and coding potential of microbial dark matter. Nature, 499:431–437, 2013.[226] D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of rna-seq data using factor analysis ofcontrol genes or samples. Nature Biotechnology, 32:896–902, 2014.[227] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powersdifferential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Research,43(7):e47, 2015.100[228] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edger: a bioconductor package for differentialexpression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.[229] R. Rohde, R. A. Muller, R. Jacobsen, E. Muller, S. Perlmutter, A. Rosenfeld, J. Wurtele, D. Groom3,and C. Wickham. A new estimate of the average earth surface land temperature spanning 1753 to2011. Geoinformatics & Geostatistics: An Overview, 1, 2012.[230] S. Roux, A. K. Hawley, M. T. Beltran, M. Scofield, P. Schwientek, R. Stepanauskas, T. Woyke, S. J.Hallam, and M. B. Sullivan. Ecology and evolution of viruses infecting uncultivated sup05 bacteriaas revealed by single-cell- and meta-genomics. eLife, 3:e03125, 2014.[231] Q. Ruan, D. Dutta, M. S. Schwalbach, J. A. Steele, J. A. Fuhrman, and F. Sun. Local similarityanalysis reveals unique associations among marine bacterioplankton species and environmentalfactors. Bioinformatics, 22(20):2532–2538, 2006.[232] W. Rudin. Principles of Mathematical analysis. McGraw-Hill, 1976.[233] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagatingerrors. Nature, 323:533–536, 1986.[234] N. Sangwan, F. Xia, and J. A. Gilbert. Recovering complete and draft population genomes frommetagenome datasets. Microbiome, 4:8, 2016.[235] K. Sellers and G. Shmueli. A flexible regression model for count data. Annals of Applied Statistics, 4(2):943–961, 2010.[236] A. L. Sessions, D. M. Doughty, P. V. Welander, R. E. Summons, and D. K. Newman. The continuingpuzzle of the great oxidation event. Current Biology, 19(14):R567–R574, 2009.[237] V. Shah, B. X. Chang, and R. M. Morris. Cultivation of a chemoautotroph from the sup05 clade ofmarine bacteria that produces nitrite and consumes ammonium. The ISME Journal, 2016.[238] D. F. Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics ofComputation, 24(111):647–656, 1970.[239] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423,623–656, 1948.[240] I. Sharon and J. F. Banfield. Genomes from metagenomics. Science, 342(6162):1057–1058, 2013.[241] C. S. Sheik, S. Jain, and G. J. Dick. Metabolic flexibility of enigmatic sar324 revealed throughmetagenomics and metatranscriptomics. Environmental Microbiology, 16(1):304–317, 2013.[242] G. Shmueli, T. Minka, J. Kadane, S. Borle, and P. Boatwright. A useful distribution for fitting discretedata: revival of the conway-maxwell-poisson distribution. Journal of the Royal Statistical Society: SeriesC, 54(1):127–142, 2005.[243] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the gameof go with deep neural networks and tree search. Nature, 529:484–489, 2016.[244] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. Abyss: A parallelassembler for short read sequence data. Genome Research, 19:1117–1123, 2009.[245] A. Sklar. Fonctions de re´partition a` n dimensions et leurs marges. Publications de l’Institut de Statistiquede L’Universite´ de Paris, 8:229–231, 1959.101[246] C. Spearman. ’general intelligence’, objectively determined and measured. American Journal ofPsychology, 15:201–293, 1904.[247] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple wayto prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.[248] A. Stamatakis. Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.Bioinformatics, 30(9):1312–1313, 2014.[249] R. Stepanauskas and M. E. Sieracki. Matching phylogeny and metabolism in the uncultured marinebacteria, one cell at a time. Proceedings of the National Academy of Sciences, 104(21):9052–9057, 2007.[250] L. Stramma, G. C. Johnson, J. Sprintall, and V. Mohrholz. Expanding oxygen-minimum zones in thetropical oceans. Science, 320(5876):655–658, 2008.[251] M. Strous, J. A. Fuerst, E. H. M. Kramer, S. Logemann, G. Muyzer, K. T. van de Pas-Schoonen,R. Webb, J. G. Kuenen, and M. S. M. Jetten. Missing lithotroph identified as new planctomycete.Nature, 400:446–449, 1999.[252] M. Sunamura, Y. Higashi, C. Miyako, J. ichiro Ishibashi, and A. Maruyama. Two bacteria phylotypesare predominant in the suiyo seamount hydrothermal plume. Applied and Environmental Microbiology,70(2):1190–1198, 2004.[253] B. K. Swan, M. Martinez-Garcia, C. M. Preston, A. Sczyrba, T. Woyke, D. Lamy, T. Reinthaler, N. J.Poulton, D. P. Maslandm, M. L. Gomez, M. E. Sieracki, E. F. DeLong, G. J. Herndl, and R. Stepanauskas.Potential for chemolithoautotrophy among ubiquitous bacteria lineages in the dark ocean. Science,333(6047):1296–1300, 2011.[254] T. Tatusova, S. Ciufo, B. Fedorov, K. O’Neill, and I. Tolstoy. Refseq microbial genomes database: newrepresentation and annotation strategy. Nucleic Acids Research, 42(1):D553–9, 2014.[255] H. Teeling, A. Meyerdierks, M. Bauer, R. Amann, and F. O. Glo¨ckner. Application of tetranucleotidefrequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9):938–947, 2004.[256] H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glo¨ckner. Tetra: a web-service anda stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dnasequences. BMC Bioinformatics, 5:163, 2004.[257] K. Tennessen, E. Andersen, S. Clingenpeel, C. Rinke, D. S. Lundberg, J. Han, J. L. Dangl, N. Ivanova,T. Woyke, N. Kyrpides, and A. Pati. Prodege: a computational protocol for fully automateddecontamination of genomes. The ISME Journal, 10:269–272, 2016.[258] C. J. F. Ter Braak. Canonical correspondence analysis: a new eigenvector technique for multivariatedirect gradient analysis. Ecology, 67(5):1167–1179, 1986.[259] The NIH HMP Working Group, J. Peterson, S. Garges, M. Giovanni, P. McInnes, L. Wang, J. A. Schloss,V. Bonazzi, J. E. McEwen, K. A. Wetterstrand, C. Deal, C. C. Baker, V. D. Francesco, T. K. Howcroft,R. W. Karp, R. D. Lunsford, C. R. Wellington, T. Belachew, M. Wright, C. Giblin, H. David, M. Mills,R. Salomon, C. Mullins, B. Akolkar, L. Begg, C. Davis, L. Grandison, M. Humble, J. Khalsa, A. R.Little, H. Peavy, C. Pontzer, M. Portnoy, M. H. Sayre, P. Starke-Reed, S. Zakhari, J. Read, B. Watson,and M. Guyer. The nih human microbiome project. Genome Research, 19:2317–2323, 2009.[260] T. Thomas, J. Gilbert, and F. Meyer. Metagenomics - a guide from sampling to data analysis. MicrobialInformatics and Experimentation, 2:3, 2012.102[261] G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M.Rubin, D. S. Rokhsar, and J. F. Banfield. Community structure and metabolism through reconstructionof microbial genomes from the environment. Nature, 428:37–43, 2004.[262] O. Ulloa, D. E. Canfield, E. F. DeLong, R. M. Letelier, and F. J. Stewart. Microbial oceanography ofanoxic oxygen minimum zones. Proceedings of the National Academy of Sciences, 109(40):15996–16003,2012.[263] A. Ultsch and F. Mo¨rchen. Esom-maps: tools for clustering, visualization, and classification withemergent som. Germany: Data Bionics Research Group, University of Marburg, 2005.[264] W. N. Venables and B. D. Ripley. Modern applied statistics with s. 2002.[265] A. Wald. Tests of statistical hypotheses concerning several parameters when the number of observa-tions is large. Transactions of the American Mathematical Society, 54(3):426–482, 1943.[266] D. A. Walsh, E. Zaikova, C. G. Howes, Y. C. Song, J. J. Wright, S. G. Tringe, P. D. Tortell, and S. J.Hallam. Metagenome of a versatile chemolithoautotroph from expanding oceanic dead zones. Science,326(5952):578–582, 2009.[267] S. Weiss, W. Van Treuren, C. Lozupone, K. Faust, J. Friedman, Y. Deng, L. C. Xia, Z. Z. Xu, L. Ursell,E. J. Alm, A. Birmingham, J. A. Cram, J. A. Fuhrman, J. Raes, F. Sun, J. Zhou, , and R. Knight.Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. TheISME Journal, 10:1669–1681, 2016.[268] D. L. Wheeler, D. M. Church, A. E. Lash, D. D. Leipe, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M.Schriml, T. A. Tatusova, L. Wagner, and B. A. Rapp. Database resources of the national center forbiotechnology information. Nucleic Acids Research, 29(1):11–16, 2001.[269] W. B. Whitman, D. C. Coleman, and W. J. Wiebe. Prokaryotes: The unseen majority. Proceedings of theNational Academy of Sciences, 95(12):6578–6583, 1998.[270] S. S. Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses.The Annals of Mathematical Statistics, 9:60–62, 1938.[271] J. J. Wright, K. M. Konwar, and S. J. Hallam. Microbial ecology of expanding oxygen minimum zones.Nature Reviews Microbiology, 10:381–394, 2012.[272] J. J. Wright, K. Mewis, N. W. Hanson, K. M. Konwar, K. R. Maas, and S. J. Hallam. Genomic propertiesof marine group a bacteria indicate a role in the marine sulfur cycle. The ISME Journal, 8:455–468,2014.[273] T. D. Wright, K. L. Vergin, P. W. Boyd, and S. J. Giovannoni. A novel delta-subdivision proteobacteriallineage from the lower ocean surface layer. Applied and Environmental Microbiology, 63(4):1441–1448,1997.[274] K. C. Wrighton, B. C. Thomas, I. Sharon, C. S. Miller, C. J. Castelle, N. C. VerBerkmoes, M. J. Wilkins,R. L. Hettich, M. S. Lipton, K. H. Williams, P. E. Long, and J. F. Banfield. Fermentation, hydrogen,and sulfur metabolism in multiple uncultivated bacterial phyla. Science, 337(6102):1661–1665, 2012.[275] Y. Wu, Y. Tang, S. G. Tringe, B. A. Simmons, and S. W. Singer. Maxbin: an automated binning methodto recover individual genomes from metagenomes using an expectation-maximization algorithm.Microbiome, 2:26, 2014.[276] Y.-W. Wu, B. A. Simmons, and S. W. Singer. Maxbin 2.0: an automated binning algorithm to recovergenomes from multiple metagenomic datasets. Bioinformatics, 32(4):605–607, 2015.103[277] F. Xia, J. Chen, W. K. Fung, and H. Li. A logistic normal multinomial regression model for microbiomecompositional data analysis. Biometrics, 69:1053–1063, 2013.[278] L. C. Xia, J. A. Steele, J. A. Cram, Z. G. Cardon, S. L. Simmons, J. J. Vallino, J. A. Fuhrman, and F. Sun.Extended local similarity analysis (elsa) of microbial community and other time series data withreplicates. BMC Systems Biology, 5(Suppl 2):S15, 2011.[279] E. Zaikova, D. A. Walsh, C. P. Stilwell, W. W. Mohn, P. D. Tortell, and S. J. Hallam. Microbialcommunity dynamics in a seasonally anoxic fjord: Saanich inlet, british columbia. EnvironmentalMicrobiology, 12(1):172–191, 2009.[280] D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de bruijngraphs. Genome Research, 18:821–829, 2008.[281] J. Zhou, Y. Deng, F. Luoe, Z. He, and Y. Yang. Phylogenetic molecular ecological network of soilmicrobial communities in response to elevated co2. mBio, 2(4):e00122–11, 2011.[282] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scalebound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560,1997.[283] Y. Zuo, G. Yu, M. G. Tadesse, and H. W. Ressom. Biological network inference using low order partialcorrelation. Methods, 69(3):266–273, 2014.104Appendix AData-driven argument as a HiddenMarkov ModelRecognizing microbial ecology as an information science [120] emphasizes the data-driven nature ofour arguments. Modern microbial ecology layers data products, feeding the output of one machine (orscientist) into another. For example, to achieve the goal of metabolic pathway prediction from a genome,(1) DNA must be extracted from a sample, then prepared and sent for sequencing; (2) a DNA sequencingmachine [26] then translates the DNA into digital representation as many fragments of As,Ts,Cs, and Gs;(3) fragments of DNA must then be assembled into larger, more useful contiguous sequences (contigs) byan assembling software [159]; (4) then assembly may be translated into metabolic pathway predictions by asoftware [149] which is also a pipeline of further sub-modules. Another important example for ecology isthe counting of microbes (more accurately, their 16S genes), where (1) after DNA extraction, polymerasechain reaction (PCR) is used to amplify target DNA; (2) amplified DNA is sequenced; (3) reads are countedwith a software, QIIME [51]; (4) then 16S genes must be aligned to a reference database such as GreenGenes[71] or SILVA [220] with local-alignment software like BLAST [9] or LAST [144]. At the end of these twoexamples human interpretation relies heavily on the successful layering of data products. In either example,information is processed in a factory-style assembly line, with information being passed from machine tomachine. The desired end product results in each assembly line occasionally sharing steps, and inevitablydiverging as unique products are desired. Each machine may be viewed as a singular unit, or decomposedinto its own assembly lines, and relies on a degree of precision from previous machines to reliably build itsdata product.Elaborating with the formalisms defined in section 1.6.5, we may imagine our assembly line of machinesCi attributing xi ∈ xi+1 whenever Ci(xi) = xi+1. This is where ecological narratives meet predictive power.The reality of Microbial ecology is that we often cannot observe true attributions {xi ∈ xi+1}, and wemay not know if the narrative or model they constitute is correct, because we only every observe ourown attributions {Ci(xi) = xi+1}. We indirectly observe. Fortunately we are able to construct precisionestimators, which tie our narratives to reality. Without precision, these narratives are irrelevant to reality.A.1 Theoretical argumentIn this section, we emphasize the importance of precision in data-driven argument by demonstrating howit may effect the precision of our entire argument. The conclusion is that precision is required at key pointsin our arguments and pipelines in order to be confident in our eventual conclusions. This is done byassuming indirect observation is routine, and likening our data analysis pipelines to a Hidden MarkovModel (HMM).To necessarily sophisticate our argument, allow a capital letter Xi to denote the random variablerepresentation of data product xi. When we require that an Xi equate with a particular value xi via Xi = xi,we constrain our argument. For example, Xi may stand in for a random taxa (which we may attribute viaCi−1(xi−1) = Xi), but if we require that taxa to be xi = {Thaumarcheota}, then have the constraint Xi = xi.Constraining data products allows us to construct data-driven arguments. A thorough example of thisprocess is in section A.2.Define a narrative to have the form {X1 ∈ X2, X2 ∈ X3, . . . , Xn−1 ∈ Xn, Xn ∈ xn+1}, where the finalconclusion is non-random, and we may constrain certain data products to be non-random, Xi = xi.105Figure A.1: A linearly dependent Hidden Markov Model analogizing an inferential pipeline.We actually make a data-driven argument for our narrative via our pipeline, {C1(X1) = X2, C2(X2) =X3, . . . , Cn−1(Xn−1) = Xn, Cn(Xn) = xn+1}, where again the final conclusion is non-random and certaindata products may be constrained to non-random. We observe all pipeline events {Ci(Xi) = Xi+1},and we observe none of the narrative events {Xi ∈ Xi+1}. A data-driven argument is of the form {Xn ∈xn+1|C1(X1) = X2, C2(X2) = X3, . . . , Cn−1(Xn−1) = Xn, Cn(Xn) = xn+1}. So we constrain our pipeline toproduce a data-driven arugment for a narrative.To leverage pre-existing theory for HMMs [23, 222], we conveniently assume that {Ci(Xi) = Xi+1} onlyconditionally depends on {Xi ∈ Xi+1}, and that {Xi ∈ Xi+1} only conditionally depends on {Xi−1 ∈ Xi},where conditional dependence is a probability concept. This provides the dependency structure illustratedin Figure A.1. Define the data-driven argument’s final precision as P[Xn ∈ xn+1| ∩ni=1 {Ci(Xi) = Xi+1}],and ith local precision as P[Xi ∈ xi+1|Ci(Xi) = xi+1]. Our goal is to understand the final precision througha series of known local precisions. Because we sometimes must constrain our arguments, the followingconditional decomposition is useful. Let 1 < m < n and constrain Xm = xm.P[Xn ∈ xn+1, Xm−1 ∈ xm| ∩ni=1 {Ci(Xi) = Xi+1}]= P[Xn ∈ xn+1|Xm−1 ∈ xm,∩ni=1{Ci(Xi) = Xi+1}]×P[Xm−1 ∈ xm| ∩ni=1 {Ci(Xi) = Xi+1}]= P[Xn ∈ xn+1|Xm−1 ∈ xm,∩ni=m{Ci(Xi) = Xi+1}]×P[Xm−1 ∈ xm| ∩mi=1 {Ci(Xi) = Xi+1}]We’ve decomposed our entire final precision, breaking at the constrained point, into two final precisionsfor sub-pipelines, the later of which is conditioned on the other. Unfortunately, {Xm−1 ∈ xm} is never ob-served, so instead we require that the initial precision is near one, P[Xm−1 ∈ xm| ∩mi=1 {Ci(Xi) = Xi+1}] ≈ 1.If this is true, then we may assume that {Xm−1 ∈ xm} and thus P[Xn ∈ xn+1|Xm−1 ∈ xm,∩ni=m{Ci(Xi) =Xi+1}] ≈ P[Xn ∈ xn+1| ∩ni=m {Ci(Xi) = Xi+1}]. So we arrive at the following approximation.P[Xn ∈ xn+1, Xm−1 ∈ xm| ∩ni=1 {Ci(Xi) = Xi+1}]≈ P[Xn ∈ xn+1| ∩ni=m {Ci(Xi) = Xi+1}]×P[Xm−1 ∈ xm| ∩mi=1 {Ci(Xi) = Xi+1}]when P[Xm−1 ∈ xm| ∩mi=1 {Ci(Xi) = Xi+1}] ≈ 1This approximation gives us a way to understand our final precision in terms of other final precisionsof our decomposed pipeline. This decomposition is necessary for developing precise and constraineddata-driven arguments.Next we leverage the forward algorithm from HMM theory to realize how important local pre-cision at constrained points is for achieving high final precisions. The forward algorithm is an iter-ative (and dynamic programming) method for calculating P[Xn ∈ xn+1,∩ni=1{Ci(Xi) = Xi+1}] fromP[Xn−1 ∈ xn,∩n−1i=1 {Ci(Xi) = Xi+1}]. We will equivalently rephrase it in terms of precisions as follows.P[Xn ∈ xn+1| ∩ni=1 {Ci(Xi) = Xi+1}]= P[Xn ∈ xn+1|Cn(Xn) = xn+1]×P[Cn(Xn) = xn+1](P[Xn ∈ xn+1]P[Cn(Xn) = xn+1| ∩ni=1 {Ci(Xi) = Xi+1}])−1106×∑xn−1 P[Xn ∈ xn+1|xn−1 ∈ Xn]P[xn−1 ∈ Xn| ∩n−1i=1 {Ci(Xi) = Xi+1}]This puts our final precision into a form equivalent to a product of a local precision and sub-pipeline’sfinal precisions as follows.[final precision Xn] = [local precision Xn]an ∑xn−1 bn(xn−1)[final precision xn−1]This shows that a low local precision can lower the final precision. For example, if Cn(·) = xn+1constantly, then our final precision is bounded above by P[Xn ∈ xn+1]. We have shown that pipelinecomponent precision can bound the final precision of our data driven arguments. Therefore an imprecisepipeline component may reduce confidence in our narrative.A.2 An exampleThe HMM representation necessarily admits more complexity than in Figure A.1, which is best communi-cated through an example. Consider 16S correlation problem mentioned earlier. Imagine a scientist is usingthat pipeline among others to infer microbial syntrophy, perhaps cyclical redox of sulfur between two taxa.The pipelines likely share steps (genetic sequencing is popular) and meet at the end where interpretationoccurs, but we focus on the correlation pipeline. Consider the following final precision.P[X6 ∈ x7|C1(X1) = X2, C2(X2) = x3, C3(x3) = X4, C4(X4) = X5, C5(X5) = X6, C6(X6) = x7]Where we might have the following event values.{C6(x6) = x7} = {Scientist asserts syntrophy, perhaps utilizing other pipelines}{x6 ∈ x7} = {Syntrophy is genuine, correlation is not illusory}{C5(x5) = x6} = {taxa correlations are statistically significant}{x5 ∈ x6} = {taxa abundances do covary, inference is not an illusion of natural variation}{C4(x4) = x5} = {16S genes align to database entries}{x4 ∈ x5} = {alignment corresponds to actual source}{C3(x3) = x4} = {QIIME counts clustered 16S reads}{x3 ∈ x4} = {QIIME counts resemble authentic phylogenetic structure}{C2(x2) = x3} = {DNA is sequenced}{x2 ∈ x3} = {sequenced DNA resembles true DNA}{C1(x1) = x2} = {16S genes are amplified}{x1 ∈ x2} = {16S amplification primers are not biased against final taxa}In this example, imagine a highly precise short-read Illumina platform was used for sequencing. Soif we discover target taxa reads via {C2(X2) = x3} we basically observe {X2 ∈ x3}, and therefore wemay consider bias against our target taxa irrelevant. Read counts may be low, but their covariation is stilldetectable. Under this interpretation (and disregarding others), we have Markovian behaviour as follows.P[X6 ∈ x7| ∩ni=1 {Ci(Xi) = Xi+1}] ≈ P[X6 ∈ x7| ∩ni=3 {Ci(Xi) = Xi+1}]Now, further imagine that our scientific narrative requires that we constrain X6 = x6, a particularcorrelation does truly exist. According to our exlorations of the forward algorithm, a low precisionP[X5 ∈ x6|C5(X5) = x6] could jeopardize the precision of our narrative.107Appendix BCMP variance bound proofIn this appendix we prove that a CMP with mean µ and variance σ2 satisfies σ2 < µ(µ + 1). LetR>0 = (0,∞). Our proof strategy will start with strategic preliminary proofs. Then because the CMP isparameterized in terms of parameters (λ, ν) ∈ R>0 ×R>0, we reparameterize the CMP to have knownmean µ by setting λ = λµ,ν, where λµ,ν is a function of both µ and ν. Notice that the reparameterizedCMP thus has a reparameterized variance σ2µ,ν, which is also a function of µ and ν. We then study thereparameterized CMP to eventually discover the following two results.Definition 1. A random variable X is CMP(λ, ν) distributed if P[X = x] = (λx/x!ν)/(∑∞j=0 λj/j!ν) for x ∈ Z≥0.This is written as X ∼ CMP(λ, ν).Result 3. If X follows a CMP distribution and has mean µ and variance σ2, then σ2 < µ(µ+ 1).Result 4. If X follows a CMP distribution with parameters λ and ν, then X is over-dispersed (σ2X > µx) when ν < 1and under-dispersed (σ2X < µX) when ν > 1.B.1 PreliminariesLemma 1. Let functions f , g, h be from R to R, h strictly monotone, and a random variable X such that at least oneof E[ f (X)|h(X)] or E[g(X)|h(X)] is non-linear in h(X).For brevity, let f = f (X), g = g(X), h = h(X), σA,B = Cov(A, B), ρA,B = Cor(A, B),βA,B = σA,B/σB,B, σA,B·C be the covariance of A and B partial C,and σA,B|C be the conditional covariance of A and B conditioned on C.Then we have the following.ρ f ,g − ρ f ,hρg,h ∝ σf ,g·h 6= 0Where proportionality ∝ indicates scaling by a positive real value.Proof. of Lemma 1ρ f ,g − ρ f ,hρg,h ∝ σf ,g − βg,hσf ,h = σf ,g − βg,hσf ,h − β f ,hσg,h + β f ,hβg,hσh,h= σf−β f ,hh,g−βg,hh = σf ,g·h= E[σf−β f ,hh,g−βg,hh|h]+ σE[ f−β f ,hh|h],E[g−βg,hh|h] ; (total covariance)= E[σf ,g|h]+ σE[ f−β f ,hh|h],E[g−βg,hh|h] ; (h constant under Cov(·|h))= 0+ σE[ f−β f ,hh|h],E[g−βg,hh|h] ; (h strict monotone⇒ f , g are known given h)By Theorem 6 section B.4, σf ,g·h = E[σf ,g|h]= 0⇔ E[ f |h] and E[g|h] are linear in h.By assumption this is false, so σf ,g·h = σE[ f−β f ,hh|h],E[g−βg,hh|h] 6= 0.Further, sign(ρ f ,g − ρ f ,hρg,h)= sign(σf ,g·h).Lemma 2. (a) For each ν > 0,∑Nj=0λjj!ν → Zλ,ν ∈ R uniformly in λ as N → ∞. (b) ∂∂ν ∑∞j=0 λjj!ν = ∑∞j=0∂∂νλjj!ν108Proof. of Lemma 2 (a)∑∞j=0λjj!ν is a power series in λ and limj→∞∣∣∣ (j+1)!−νj!−ν ∣∣∣ = 0⇒ ∑∞j=0 λjj!ν is uniformly convergent for all λ.Lemma 3. For all c > 1, and λ > 0, j ∈ Z≥1, there exists an L ∈ R>0 such that jcλj ≤ Lj and L > λProof. of Lemma 3For all (c,λ, j) ∈ R2>0 × Z≥0, there exists an L = ec+logλ > λ implies Lj = exp[j(c + logλ)] ≥exp[c log j + j logλ] = jcλjThe following lemma is implicit in result 2.2 of Shmueli et al. [242]. Let R≥0 denote R≥0 ∪ {∞}.Lemma 4. If X ∼ CMP(λ, ν), then X has finite positive integer moments.Proof. of Lemma 4hn,i = (1− 1n )im λii!ν f or n ∈ Z≥2 ⇒ hn,i < hn+1,iand limn→∞ hn,i = im λii!νthen E[Xm] = limn→∞ (∑∞i=1 hn,i) Z−1λ,ν =(∑∞i=1 im λii!ν)Z−1λ,ν ∈ R>0(by Monotone Convergence Theorem (MCT)).For all m ∈ Z≥1, there exists Lm such thatE[Xm] = ∑∞i=1 im λii!ν Z−1λ,ν ≤ ∑∞i=0 Limi!ν Z−1λ,ν ; (Lemma 3)= ZLm ,νZ−1λ,ν < ∞ ; (Lemma 2 (a))Let a ↓ b mean that a decreases as b increases. Let a ↑ b mean that a increases as b increases.Proof. of Lemma 2 (b)fn(ν) = ∑ni=0λii!ν then∂∂ν fn(ν) = −∑ni=0 log(i!) λii!ν ↑ ν ∀ν ∈ R>0and log(i!) ≤ i2 implies ∂∂ν fn(ν)→ · (Lemma 4).Thus ∂∂ν fn(ν) is locally uniformly convergent(by the mean value theorem)and fn(ν)→ · (Lemma 2 (a))implies ∀ν ∈ R>0, limn→∞ ∂∂ν fn(ν) = ∂∂ν limn→∞ fn(ν)Apply Rudin’s theorem Theorem 5.Lemma 5. ∂∂λ log(1+ λ+∑∞j=2λjj!ν)> ddλ log(1+ λ)Proof. of Lemma 5∂∂λ log(1+ λ+∑∞j=2λjj!ν)> ddλ log(1+ λ)if and only if1+∑∞j=2 jλj−1j!ν1+λ+∑∞j=2λjj!ν> 11+λif and only if(1+∑∞j=2 jλj−1j!ν)(1+ λ) = 1+∑∞j=2 jλj−1j!ν + λ+∑∞j=2 jλjj!ν > 1+ λ+∑∞j=2λjj!ν109B.2 Properties of λµ,νIn this section we develop, λµ,ν, a tool for fixing the expected value of the CMP distribution whilemanipulating the variance. It is shown that, given a ν, µ and λ are in one-to-one correspondence throughthe function λµ,ν. It should be noted that λµ,ν is calculated algorithmically in practice.Definition 2.For each (µ, ν) ∈ R2>0, λµ,ν := λ such that E[X] = µ and X ∼ CMP(λ, ν)Defining such a λ as above does not guarantee its existence nor uniqueness. This section its provesuniqueness and existence.Let µλν = E[X] for X ∼ CMP(λ, ν).Let σ2λ,ν = Var[X] for X ∼ CMP(λ, ν).Lemma 6. For a fixed ν ∈ R>0, and for all µ ∈ R>0, there exists a unique λ ∈ R>0 such that µλ,ν = µ.Lemma 7. For fixed ν ∈ R>0, µλ,ν → 0 as λ→ 0+Proof. of Lemma 7limλ→0+ µλ,ν =(limλ→0+ ∑∞i=1 iλii!ν)(limλ→0+ Zλ,ν)−1=(∑∞i=1 limλ→0+ iλii!ν) (∑∞j=0 limλ→0+λjj!ν)−1; (MCT)= (∑∞i=1 0) (1+ 0)−1 = 01+0 = 0Lemma 8. For each ν ∈ R>0, µλ,ν is strictly increasing in λ, and ∂∂λµλ,ν = σ2λ,ν/λProof. of Lemma 8µλ,ν is a ratio of convergent power series(apply Lemmas 2 (a), 3)Convergent power series are continuous.Zλ,ν > 0 for each (λ, ν) ∈ R2>0.∂∂λµλ,ν =∂∂ν(∑∞j=1 jλjj!ν)Z−1λ,ν =[(∂∂λ jλjj!ν)Zλ,ν −(∑∞j=1 jλjj!ν) (∑∞j=0∂∂λλjj!ν)]Z−2λ,ν ; (Lemma 2 (b))= λ−1σ2λ,ν > 0Lemma 9. For fixed ν ∈ R>0, µλ,ν → ∞ as λ→ ∞Let a ∧ b = min({a, b}) and a ∨ b = max({a, b}).Proof. of Lemma 9Assume to the contrary that µλ,ν 6→ ∞,then by Lemma 8 limλ→∞ µλ,ν exists andlimλ→∞ µλ,ν < ∞⇒ there is a c = limλ→∞ µλ,ν ∈ R>0.⇒ 0 ≤ c− µλ,ν = c−∑∞i=1 i λii!ν Z−1λ,ν⇔ 0 ≤ cZλ,ν −∑∞i=1 i λii!ν = c + c∑∞i=1λii!ν −∑∞i=1 i λii!ν= c +∑∞i=1(c− i) λii!ν = c +∑bcci=1(c− i) λii!ν +∑∞i=bcc+1(c− i) λii!ν< c + c∑bcci=1(c− i) λii!ν + 0 < c∑bcci=0 λi = c 1−λbcc+11−λ ⇔ 0 < c1−λ (1− λbcc+1)⇔ 0 < 1− λbcc+1 ; (for λ > 1)⇔ 1 > λbcc+1 ⇔ 0 > logλContradiction for λ > 1, which is given with the limit.110Lemma 10. µλ,ν is continuously differentiable in λ.Proof. of Lemma 10For all λ ∈ R>0,∑∞j=1 j λjj!ν ∈ R>0 ; (Lemma 4)Both Zλ,ν & ∑∞j=1 jλjj!ν are power series in λ and are thus continuously differentiable.⇒ µλ,ν =(∑∞j=1 jλjj!ν)/Zλ,ν is continuously differentiable when Zλ,ν 6= 0.Zλ,ν > 0 for all (λ, ν) ∈ R2>0 since it is an infinite sum of positive terms.Proof. of Lemma 6Fix ν ∈ R>0. Let g(λ) = µλ,ν. Then, because g(λ) is continuous (Lemma 10) and strictly increasing(Lemma 8), it is bijective. Also, it’s domain and range are R>0 (Lemma 7 & 9). This implies there existsunique function g−1(µ) = λµ,ν. Thus, given ν and through g, λ & µ are in one-to-one correspondence.We now have that λµ,ν has a valid definition, in that existence and uniqueness is proven.B.3 Properties of σ2µ,νLet σ2µ,ν = Var[Xµ,ν] for Xµ,ν ∼ CMP(λµ,ν, ν). We will now study σ2µ,ν so that we may prove σ2µ,ν < µ(µ+ 1).Lemma 11. ∂∂µλµ,ν = λµ,ν/σ2µ,ν and λµ,ν is continuously differentiable in µ.Proof. of Lemma 11Apply the inverse function theorem.∂∂µλµ,ν =(∂∂λµλ,ν)−1 ∣∣λ=λµ,ν=(σ2λ,ν/λ)−1 ∣∣λ=λµ,ν(by Lemma 8)= λµ,ν/σ2µ,ν.Since µλ,ν is continuously differentiable (by Lemma 10) and∂∂µλµ,ν always exists, λµ,ν is continuouslydifferentiable in µ.Lemma 12. If X ∼ CMP(λ, ν), then for each λ ∈ R>0, ∂∂νµλ,ν = ∂∂νE[X] = −Cov[X, log(X!)] ∈ R≤0Proof. of Lemma 12For p ∈ {0, 1}, ∂∂ν ∑∞j=0 jp λjj!ν = ∑∞j=0∂∂ν jp λjj!ν ; (Lemma 2 (b))⇒ ∂∂νE[X] = E[X]E[log(X!)]−E[X log(X!)]X ∈ Z≥0 ⇒ X & log(X!) are co-increasing,Apply Theorem 4.All expectations are bounded by finite moments (Lemma 3).Lemma 13. λµ,ν is continuously differentiable in ν, and for each ν and Xµ,ν ∼ CMP(λµ,ν, ν), ∂∂νλµ,ν =Cov[Xµ,ν, log(Xµ,ν!)]λµ,ν/σ2µ,ν.Proof. of Lemma 13For every ν ∈ R>0, ∂∂νµλ,ν exists by Lemma 12, and ∂∂λµλ,ν > 0 exists by Lemma 8. So by the implicitfunction theorem, ∂∂νλµ,ν exists in an open set of R>0 containing ν. Since this is true for all ν ∈ R>0, λµ,νis continuously differentiable in ν. Further let g(λ, ν) = µλ,ν, then g(λµ,ν, ν) = µ by Lemma 6 and thefollowing holds.111∂∂νg(λµ,ν, ν) =∂g∂λ(λµ,ν, ν)(∂∂νλµ,ν)+∂g∂ν(λµ,ν, ν) = 0⇔ ∂∂νλµ,ν = −(∂g∂λ(λµ,ν, ν))(∂g∂ν(λµ,ν, ν))−1= − (−Cov[Xµ,ν, log(Xµ,ν!)]) (σ2µ,ν/λµ,ν)−1The following application of the DCT seems odd when MCT may (eventually) apply, but this lemmaexists to avoid a circular argument.Lemma 14.(a) If for each µ ∈ R>0, there is an Mµ ∈ R>0 such that for each ν ∈ R>0and Mµ > λµ,ν, then limν→∞ ∑∞j=0λjµ,νj!ν = ∑∞j=0 limν→∞λjµ,νj!ν(b) If for each µ ∈ R>0, there is a limν→0+ λµ,ν ∈ R,then limν→0+ ∑∞j=0λjµ,νj!ν = ∑∞j=0 limν→0+λjµ,νj!νProof. of 14 (a)∑∞j=0λjµ,νj!ν ≤ ∑∞j=0Mjµj!ν ∈ R>0 ; (Lemma 4)Apply dominated convergence theorem.Proof. of Lemma 14 (b)λµ,ν ↑ ν (Lemma 13)⇒ λjµ,νj!ν ↑ νApply monotone convergence theorem.Lemma 15. If µ ∈ (0, 1) , then λµ,∞ := limν→∞ λµ,ν ∈ R>0Proof. of Lemma 15Assume to the contrary that λµ,∞ = ∞.Then for Xµ,ν ∼ CMP(λµ,ν, ν), µ = limν→∞ E[Xµ,ν]= limν→∞[λ ddλ log(∑∞j=0λjj!ν)] ∣∣λ=λµ,ν≥ limν→∞[λ ddλ log(1+ λ)] ∣∣λ=λµ,ν; (Lemma 5)= limν→∞[λ1+λ] ∣∣λ=λµ,ν= limν→∞λµ,ν1+λµ,ν = 1implying⇒ µ ≥ 1 and µ ∈ (0, 1), a contradiction.So λµ,∞ ∈ R>0.Lemma 16. If µ ∈ (0, 1), then λµ,∞ = µ1−µ ∈ R>0Proof. of Lemma 16X ∼ CMP(λµ,ν, ν). µ = limν→∞ E[Xµ,ν]= limν→∞(∑∞j=1 jλjµ,νj!ν)(∑∞j=0λjµ,νj!ν)−1(∑∞j=1 limν→∞ jλjµ,νj!ν)(∑∞j=0 limν→∞λjµ,νj!ν)−1; (Lemma 14 (a))112=λµ,∞1+λµ,∞ (Exists by Lemma 15)⇒ λµ,∞ =µ1−µCorollaries 2 and 3 are analogous to results in (Sellers and Shmueli [235]), but are specific to thereparameterized CMP. They describe how the CMP generalizes the Bernoulli and Geometric distributions.Corollary 2. If Xµ,ν ∼ CMP(λµ,ν, ν), µ ∈ (0, 1), Y ∼ Bernoulli(µ), then Xµ,ν → Y in distribution, as ν→ ∞.Proof. of Corollary 2limν→∞ fCMP(x;λµ,ν, ν) = limν→∞λxµ,νx!ν(∑∞j=0λjµ,νj!ν)−1=(limν→∞λxµ,νx!ν)(∑∞j=0 limν→∞λjµ,νj!ν)−1; (Lemma 14 (a))=λxµ,∞1+λµ,∞ 1x<2 (Lemma 16) = µ1x=1 + (1− µ)1x=0Lemma 17. If µ ∈ R>0,λ = µ1+µ , then µλ,ν → µ as ν→ 0+Proof. of Lemma 17If λ ∈ (0, 1), then limν→0+ µλ,ν=(∑∞i=1 limν→0+ iλii!ν) (∑∞j=0 limν→0+λjj!ν); (MCT)=(∑∞i=1 iλi) (∑∞j=0 λj)−1 = λ1−λ = µ/(1+µ)1−µ/(1+µ) = µLemma 18. If µ ∈ R>0, then λµ,0 := limν→0+ λµ,ν = µ1+µProof. of Lemma 18X ∼ CMP( µ1+µ , ν). Y ∼ CMP(λµ,0, ν).λµ,ν ↑ µ (by Lemma 12) & µ = limν→0+ E[X] ; (by Lemma 17)⇒ λµ,0 ≥ µ1+µ .Assume to the contrary that λµ,0 >µ1+µ .λµ,0 >µ1+µ ⇒ µ = limν→0+ E[Y]> E[X] (by Lemma 8) = µ⇒ µ > µ. Contradiction.So λµ,0 =µ1+µ .Corollary 3. If Xµ,ν ∼ CMP(λµ,ν, ν), & Y ∼ Geometric([1+ µ]−1), then Xµ,ν → Y in distribution, as ν→ 0+.Proof. of Corollary 3limν→0+ fCMP(x;λµ,ν, ν) = limν→0+λxµ,νx!ν(∑∞j=0λjµ,νj!ν)−1=(µ1+µ)x (∑∞j=0(µ1+µ)j)−1; (by Lemmas 14 (b) & 18)=(µ1+µ)x (1− µ1+µ)=(µ1+µ)x ( 11+µ)Corollary 4. If µ ∈ (0, 1), then σ2µ,ν → µ(1− µ) as ν→ ∞113Proof. of Corollary 4Xµ,ν ∼ CMP(λµ,ν, ν). p ∈ {1, 2}. limν→∞ E[Xpµ,ν]=(∑∞i=1 limν→∞ ipλiµ,νi!ν)(∑∞j=0 limν→∞λjµ,νj!ν); (by Lemma 14 (a))= λµ,∞(1+ λµ,∞)−1 (by Lemma 15) = µ1−µ(1− µ1−µ); (by Lemma 16)= µ⇒ limν→∞ σ2µ,ν= limν→∞(E[X2µ,ν]− (E[Xµ,ν])2)= µ(1− µ)Lemma 19. For fixed µ ∈ R>0, σ2µ,ν → µ(1+ µ) as ν→ 0+Proof. of Lemma 20Xµ,ν ∼ CMP(λµ,ν, ν). limν→0+ E[X2µ,ν]=(∑∞i=1 i2(µ1+µ)i)(∑∞j=0(µ1+µ)j)−1; (by Lemmas 14 (b) & 18)= ∑∞i=1 i2(µ1+µ)i(1+ µ)−1 = ∑∞i=1 i2(1− 11+µ)i ( 11+µ)= µ(1+ µ) + µ2 ; (Geometric distribution)We will now put ∂∂νσ2µ,ν into a form useful to Lemma 1.Lemma 20. ∂∂νσ2µ,ν ≤ 0 if and only if Xµ,ν ∼ CMP(λµ,ν, ν) &Cor[X2µ,ν, log(Xµ,ν!)] ≥ Cor[X2µ,ν, Xµ,ν]Cor[Xµ,ν, log(Xµ,ν!)]Proof. of Lemma 20∂∂νσ2µ,ν =∂∂νVar[X] =∂∂ν(E[X2µ,ν]− µ2)=∂λµ,ν∂ν∂E[X2µ,ν ]∂λ +∂E[X2µ,ν ]∂ν ; (Multivariate chain rule)≤ 0⇔ Cov[X2µ,ν, Xµ,ν]Cov[Xµ,ν, log(Xµ,ν!)] ≤ Cov[X2µ,ν, log(Xµ,ν!)]Var[Xµ,ν]Lemma 21. σ2µ,ν decreases as ν increases.Proof. of Lemma 21For each µ > 0, σ2µ,ν|ν=1 = µ by Poisson generalization,and σ2µ,ν|ν=0 = µ(µ+ 1) > µ,so ∂∂νσ2µ,ν < 0 for some ν ≤ 1.For every (µ, ν), Cor[X2µ,ν, log(Xµ,ν!)] 6= Cor[X2µ,ν, Xµ,ν]Cor[Xµ,ν, log(Xµ,ν!)] by Lemma 1.∂∂νσ2µ,ν ≤ 0⇔ Cor[X2µ,ν, log(Xµ,ν!)] ≥ Cor[X2µ,ν, Xµ,ν]Cor[Xµ,ν, log(Xµ,ν!)] by Lemma 20,and since σ2µ,ν is continuously differentiable,∂∂νσ2µ,ν cannot ever be zero.But there is some ν ≤ 1 such that ∂∂νσ2µ,ν < 0, so ∂∂νσ2µ,ν is always negative.Recall that Result 3 states that if Xµ,ν ∼ CMP(λ, ν) and E[Xµ,ν] = µ and Var[Xµ,ν] = σ2, thenσ2 < µ(µ+ 1).Proof. of Result 3Since σ2µ,ν → µ(µ+ 1) as ν→ 0+ (by Lemma 20),and because σ2µ,ν decreases in ν (by Lemmma 21),σ2µ,ν < µ(µ+ 1) for each (µ, ν) ∈ R2>0.114Recall that Result 4 states that if Xµ,ν ∼ CMP(λ, ν), then Xµ,ν is over-dispersed when ν < 1 andunder-dispersed when ν > 1.Proof. of Result 4σ2µ,ν decreases in ν (by Lemma 21),and σ2µ,ν|ν=1 = µ by Poisson generalization.B.4 Borrowed materialTheorem 4. If functions f , g are co-monotone (couter-monotone), thenE[ f (X)g(X)]−E[ f (X)]E[g(X)] ≥ 0 (≤ 0).Proof. Derive from E( f (X)− f (Y))(g(X)− g(Y)) ≥ 0, X =D Y independently,or see Hardy et al. [122].Theorem 5. For each n, fn : R → R differentiable on [a, b], there exists an x0 ∈ [a, b] : fn(x0) → ·, f ′n → ·uniformly on [a, b]⇒ fn → f uniformly on [a, b] andf ′(x) = limn→∞ f′n(x) for each x ∈ [a, b]Proof. See Rudin [232], theorem 7.17.Theorem 6. For any random vectors X = (X1, X2, . . . , Xp), Y = (Y1, Y2, . . . , Yp), the following are equivalent.1. E[X|Y] = α+ BY for a vector α and matrix B.2. ΣX·Y = E(ΣX|Y).Where ΣX·Y is the covariance matrix of X partial Y, and ΣX|Y is the covariance matrix of X conditioned on Y.Proof. See Baba [16] theorem 2.1.1, or Baba and Sibuya [18] theorem 1.115Appendix CPrecision with imprecise binnersHere we develop strategies for improving the precision of our claims with imprecise classifiers, thusproviding insight on strategies which may allow us to make confident inferences while constrained byerror-prone tools. These strategies are theoretical and untested, but are inspired by current approaches.Our earlier results in section 1.6.5 which describe constraints on imprecise classifiers hold true and aremathematical facts that cannot be avoided. Instead, we produce two strategies which overcome thesecaveats by either changing our classifier or claim. In both strategies it is necessary to perform assumptionchecking with known-label data. In either strategy we follow the paradigm of section 1.6.5, classifier Cattempts attribution of object or phenomenon X to some label y. In binning, X is a metagenomic sequence,C is a binner, and y is a taxonomic label. Both of these strategies reduce the applicable scope of theirmethods, but only one actually consumes additional metagenomic sequences to improve its precisions, theother does have a greatly increased information requirement.C.1 Marker gene strategyIt has been common to evaluate binning attempts with marker genes [5, 207, 225, 240]. As a post-hocanalysis, marker gene evaluation can be viewed as a modification of the initial binning proceedure withoutloss of generality. So this approach modifies our classifier C to leverage additional information. It isno longer enough to attribute C(X) = y, we now also require X to satisfy additional requirementsX ∈ z. For marker gene analyses specifically, z would be a marker gene requirement. So while ourinitial precision p = P[X ∈ y|C(X) = y] may be miserable, the modified process’ precision q = P[X ∈y|C(X) = y, X ∈ z] can be much better. Given a known-label data set, these precisions may be estimated aspˆ = ∑ni=1 1Xi∈y,C(Xi)=y/∑C(Xi)=y and qˆ = ∑ni=1 1Xi∈y,C(Xi)=y,Xi∈z/∑C(Xi)=y,Xi∈z (1A is an indicator variable,1A = {1 if A; 0 if Ac}). Of course, it is possible that these estimates are only different due to samplingvariation, and is not actually meaningfuly different. To test for statistically significant difference Fisher’sExact Test may be applied for small samples whereas result 5 is applicable when samples are large and amost statistically powerful test is desired.Evaluating the added value of further constraining X by X ∈ z can only be achieved with known-labeldata, and thus is only a concern during classifier evaluation. For binning with marker genes, this meansthat this assumption should be evaluated when the binner itself is evaluated with a synthetic data set [180].However in application of the binner, this is assumption-checking is no longer a concern (requiring theassumption was confidently observed as true in the evaluation). The X ∈ z requirement will likely reducethe number of sequences C may be applied to, but within its scope of application, the modified binnersimply has an increased precision per individual metagenomic sequence.C.2 Common trait strategyTaking inspiration from modern applications of binning [10], binning may not actually used make claimsattributing relationships between taxonomy y and metagenomic sequences X, but instead is used to attributerelationships between taxonomy y and common traits z. For example, this trait could be the encodingof particular reactions in a biogeochemical pathway. Our focus has been sufficiently fixed on the searchfor statistical dependence in (1C(X)=y, 1X∈y), but it might be more pragmatic to search for dependence in(1X∈y, 1X∈z). One possible way to test for such dependence is to search for a correlation between 1X∈y116and 1X∈z. Unfortunately we cannot observe 1X∈y, but can observe 1C(X)=y. So we might use an imprecisebinner C to inform on taxonomy y. It might be possible to test for dependence within (1X∈y, 1X∈z) bytesting for a statistically significant correlation between observations 1C(Xi)=y and 1Xi∈z.Unfortunately it is entirely possible that our classifiers are not only wrong but also biased, and theirclassifications are overwhelmed by artifactual constructs. However, if our classifiers are wrong in a rightand unbiased way, we may aggregate their behaviour conclusions in a more precise way. We formulate onesuch concept of bias, by assuming that our data Xi = (1Xi∈y, 1Xi∈z, 1C(Xi)=y)T follow a multivariate probitmodel [15, 57]. During the classifier evaluation, when 1Xi∈y is observable, it might be decided that thelatent partial correlation σC,Z·Y (formally defined in result 6) is bounded within a range |σC,Z·Y| < b. If suchan assumption is demonstrated to be viable, then there are correlations of the observable (1C(Xi)=y, 1Xi∈z)which imply (1X∈y, 1X∈z) is probably correlated. Of course a smaller bound b results in more confidentlyinferrable correlations between taxonomy and biogeochemical pathways.As in the previous strategy, a test must be conducted during classifier evaluation to conclude that anessential assumption is satisfied. However, this strategy requires further statistical testing, and we have yetto argue that it may more precisely evaluate a target claim. Given this strategy’s assumption is satisfied,application requires discovering a common trait z amongst some metagenomic sequences Xi ∈ z. Forexample, they might encode reactions participating in the denitrification pathway. Then a Fisher ExactTest could conclude that observations (1C(Xi)=y, 1Xi∈z) are significantly correlated (suitable for b = 0), ora likelihood ratio test could be used (suitable for b > 0). In this way, we may evaluate the a claim thatparticular taxa are correlated with a particular biogeochemical pathway. Notice tha t this test consumesseveral metagenomic sequences to evaluate a single claim.To argue an increase in precision, we observe that the we are classifying our claim through hypothesistesting. We make our claim through rejection of a null hypothesis, H0 : Cor[C(X) = y, X ∈ z] = 0. Wereject the null correctly with probability 1− β = P[reject H0|H1], this is the statistical power of our test. Wereject our null incorrectly with probability α = P[reject H0|H0], this is the type-1 error rate. Assuming thatour hypothesis alternatives make up a true dichotomy of our sample space H0 ∪˙H1 = Ω on our probabilityspace (Ω,F ,P), we may conclude that the probability of a correct claim is (1− β)/(1− β+ α). So toincrease our precision (probability of a true claim), we must achieve a statistical power (1− β) which islarge relative to our type-1 error rate (α). Methods for increasing statistical power can be sophisticated[192], but it is heuristically true that power increases with sample size. For binning, this means consumingmore metagenomic sequences with an unbiased binner can produce more precise claims.C.3 Formal argumentsResult 5. Let Ai := 1{Xi∈y} for each of the Xi such that Xi ∈ z.Then we have an independent sample of n Bernoulli random variables Ai such that if ∑ni=1 Ai = k, we successfullyreject the null hypothesis H0 : P[Ai = 1] = p for each i with probability α if the following statement holds.2k logknp+ 2(n− k) log n− kn(1− p) > χ2n−1(α)Proof.Let P[Ai = 1] = q. The likelihood function of sample Ai∈{1,2,...,n} is f (n, k, q) = qk(1− q)n−k withmaximum likelihood estimate qˆ = k/n. Then Wilks’ −2 logΛ statistic is the following.−2 logΛ = −2 log f (n,k,p)f (n,k,qˆ) = 2 logf (n,k,p)f (n,k,k/n)= 2k log knp + 2(n− k) log n−kn(1−p)And −2 logΛ follows a χ2n−1 distribution under H0 when n is large, according to Wilks’ theorem [270].117Result 6. Let X ∈ {0, 1}3 be multivarite probit-distributed [15, 57] as follows.X = 1X∈y1X∈z1C(X)=y =1Y>01Z>01C>0 ;YZC ∼ N3µYµZµC ,σY,Y σY,Z σY,CσY,Z σZ,Z σZ,CσY,C σZ,C σC,Cwhere the covariance notation from section B.1 or section 1.5.3 is used.If |σC,Z·Y| ≤ b, then any test which accepts H0 : |σC,Z| > b implies σC,Y 6= 0 and σZ,Y 6= 0, but also a testwhich accepts H1 : |σC,Z| ≤ b admits the possibility that σC,Y = 0 or σZ,Y = 0 (without further realized constraints).Proof.σC,Z = σC,Z·Y + σC,YσZ,Yσ−1Y,Y|σC,Z| > b⇒ |σC,Z·Y + σC,YσZ,Yσ−1Y,Y| ≤ |σC,Z·Y|+ |σC,YσZ,Yσ−1Y,Y| ≤ b + |σC,YσZ,Yσ−1Y,Y|⇒ |σC,YσZ,Yσ−1Y,Y| ≥ |σC,Z| − b > 0⇒ σZ,Y 6= 0 and σZ,Y 6= 0,giving us our necessary implication.|σC,Z| < b⇐(σC,Z·Y = σC,Z < b⇒ 0 = σC,YσZ,Yσ−1Y,Y ⇒ σC,Y = 0 or σZ,Y = 0),giving us our admission.118Appendix DMiscellaneousD.1 Factorial experiment regression summaries> summary( lm( ll$precision[,"gpu_0.05"] ~ m$mdl + m$n + m$q + m$s ) )Call:lm(formula = ll$precision[, "gpu_0.05"] ~ m$mdl + m$n + m$q +m$s)Residuals:Min 1Q Median 3Q Max-0.81194 -0.20880 0.01995 0.21959 0.65136Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 0.9518 0.2210 4.307 0.000262 ***m$mdlmln -0.4323 0.1877 -2.303 0.030691 *m$n1000 -0.1398 0.1521 -0.920 0.367355m$q4 -0.1709 0.1521 -1.124 0.272809m$s0.5 0.1233 0.1670 0.738 0.467718---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 0.3735 on 23 degrees of freedom(36 observations deleted due to missingness)Multiple R-squared: 0.2039,Adjusted R-squared: 0.06545F-statistic: 1.473 on 4 and 23 DF, p-value: 0.2429> summary( lm( ll$precision[,"pearson_0.05"] ~ m$mdl + m$n + m$q + m$s ) )Call:lm(formula = ll$precision[, "pearson_0.05"] ~ m$mdl + m$n + m$q +m$s)Residuals:Min 1Q Median 3Q Max-0.239552 -0.026302 -0.005363 0.026910 0.206462Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 0.064861 0.062489 1.038 0.311m$mdlmln -0.043095 0.047347 -0.910 0.373m$n1000 0.018219 0.039277 0.464 0.648m$q4 -0.009096 0.049585 -0.183 0.856m$s0.5 0.549366 0.039277 13.987 4.11e-12 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 0.09917 on 21 degrees of freedom(38 observations deleted due to missingness)Multiple R-squared: 0.904,Adjusted R-squared: 0.8857F-statistic: 49.42 on 4 and 21 DF, p-value: 2.169e-10> summary( lm( ll$precision[,"sparcc_0.05"] ~ m$mdl + m$n + m$q + m$s ) )119Call:lm(formula = ll$precision[, "sparcc_0.05"] ~ m$mdl + m$n + m$q +m$s)Residuals:Min 1Q Median 3Q Max-0.17990 -0.06180 0.01208 0.06135 0.15689Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) -0.02234 0.02620 -0.853 0.3972m$mdlmln -0.03278 0.02343 -1.399 0.1671m$n1000 0.04249 0.02343 1.813 0.0749 .m$q4 0.04816 0.02343 2.055 0.0443 *m$s0.5 0.37918 0.02343 16.182 <2e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 0.09373 on 59 degrees of freedomMultiple R-squared: 0.8214,Adjusted R-squared: 0.8093F-statistic: 67.83 on 4 and 59 DF, p-value: < 2.2e-16D.2 Taxa regressedThe following taxa were regressed in our 16S multivariate regression analysis.[1] "Nitrospina" "Nitrospira"[3] "SUP05" "Cyanobacteria"[5] "Planctomycetes" "Thaumarchaeota"[7] "OD1" "MGA"[9] "k__Bacteria.p__Proteobacteria" "k__Bacteria.p__ZB3"[11] "k__Archaea.p__Unclassified" "k__Bacteria.p__Verrucomicrobia"[13] "k__Bacteria.p__OP3" "k__Bacteria.p__Caldithrix_KSB1"[15] "No.blast.hit" "k__Archaea.p__pISA1"[17] "k__Bacteria.p__Bacteroidetes" "k__Bacteria.p__Actinobacteria"[19] "k__Bacteria.p__Lentisphaerae" "k__Bacteria.p__TM6"[21] "k__Bacteria.p__Chloroflexi" "k__Bacteria.p__OP11"[23] "k__Bacteria.p__Firmicutes" "k__Bacteria.p__Elusimicrobia_TG1"[25] "k__Archaea.p__pMC2A209" "k__Bacteria.p__Acidobacteria"[27] "k__Bacteria.p__TM7" "k__Bacteria.p__Spirochaetes"[29] "k__Archaea.p__Thermoplasmata_Eury" "k__Archaea.p__Methanobacteria_Eury"[31] "k__Bacteria.p__Fusobacteria" "k__Archaea.p__BC07.2A.27"[33] "k__Archaea.p__Methanomicrobia_Eury" "k__Bacteria.p__Chlamydiae"[35] "k__Bacteria.p__Nitrospirae" "k__Bacteria.p__WS3"[37] "k__Bacteria.p__GN02" "k__Bacteria.p__WS6"[39] "k__Archaea.p__pMC2A384" "k__Bacteria.p__VHS.B5.50"[41] "k__Bacteria.p__ZB2" "k__Bacteria.p__Gemmatimonadetes"[43] "k__Archaea.p__pMC2A15" "k__Bacteria.p__NKB19"[45] "k__Bacteria.p__Fibrobacteres" "k__Archaea.p__Methanococci_Eury"[47] "k__Bacteria.p__Unclassified" "k__Archaea.p__Halobacteriales"[49] "k__Bacteria.p__SM2F11" "k__Bacteria.p__OP9_JS1"[51] "k__Bacteria.p__ctg_CGOF" "k__Archaea.p__DHVE3"[53] "k__Archaea.p__pSL22" "k__Bacteria.p__OP10"[55] "k__Archaea.p__MSBL1" "k__Bacteria.p__OP1"[57] "k__Archaea.p__NO27FW"D.3 Marginal regression survey resultsIn the following R output, +1 indicates a positive and statistically significant correlation, -1 indicates anegative and statistically significant correlation, and 0 indicates a statistically insignificant relationship atthe α = 0.05 level. Note that some taxa will correlate with a variable even if its metabolic relationship iswith a correlated but unmeasured environmental variable.120o2 no3 h2sNitrospina -1 1 0Nitrospira -1 -1 0SUP05 -1 0 0Cyanobacteria 1 1 1Planctomycetes -1 0 0Thaumarchaeota -1 1 0OD1 -1 -1 1MGA -1 0 0k__Bacteria.p__Proteobacteria 0 -1 0k__Bacteria.p__ZB3 -1 0 1k__Archaea.p__Unclassified -1 -1 1k__Bacteria.p__Verrucomicrobia -1 0 0k__Bacteria.p__OP3 0 1 1k__Bacteria.p__Caldithrix_KSB1 -1 -1 1No.blast.hit 0 0 0k__Archaea.p__pISA1 -1 -1 1k__Bacteria.p__Bacteroidetes 0 0 0k__Bacteria.p__Actinobacteria 0 0 0k__Bacteria.p__Lentisphaerae 0 -1 1k__Bacteria.p__TM6 -1 -1 1k__Bacteria.p__Chloroflexi -1 0 1k__Bacteria.p__OP11 -1 -1 1k__Bacteria.p__Firmicutes 0 0 1k__Bacteria.p__Elusimicrobia_TG1 0 0 1k__Archaea.p__pMC2A209 -1 0 1k__Bacteria.p__Acidobacteria -1 0 0k__Bacteria.p__TM7 0 1 0k__Bacteria.p__Spirochaetes 0 -1 1k__Archaea.p__Thermoplasmata_Eury 0 0 -1k__Archaea.p__Methanobacteria_Eury -1 0 1k__Bacteria.p__Fusobacteria 0 0 0k__Archaea.p__BC07.2A.27 -1 0 1k__Archaea.p__Methanomicrobia_Eury 0 -1 1k__Bacteria.p__Chlamydiae -1 0 1k__Bacteria.p__Nitrospirae -1 -1 0k__Bacteria.p__WS3 -1 0 1k__Bacteria.p__GN02 -1 0 0k__Bacteria.p__WS6 -1 0 0k__Archaea.p__pMC2A384 0 0 1k__Bacteria.p__VHS.B5.50 -1 1 -1k__Bacteria.p__ZB2 -1 0 0k__Bacteria.p__Gemmatimonadetes -1 0 0k__Archaea.p__pMC2A15 -1 0 1k__Bacteria.p__NKB19 0 0 0k__Bacteria.p__Fibrobacteres 0 0 1k__Archaea.p__Methanococci_Eury -1 -1 0k__Bacteria.p__Unclassified 0 0 1k__Archaea.p__Halobacteriales 0 0 1k__Bacteria.p__SM2F11 0 0 0k__Bacteria.p__OP9_JS1 0 0 -1k__Bacteria.p__ctg_CGOF -1 -1 0k__Archaea.p__DHVE3 0 -1 1k__Archaea.p__pSL22 -1 0 0k__Bacteria.p__OP10 -1 -1 -1k__Archaea.p__MSBL1 0 0 0k__Bacteria.p__OP1 -1 -1 1k__Archaea.p__NO27FW -1 0 0121Figure D.1: Precision recall curves for popular 16S correlation techniques (lines) on several models (plots).Image credit: [267]D.4 Poor precision-recall exchangesSee Figure D.1D.5 SAGs sequencedSee Figure D.2.122100m150m185mCenarchaeum Eury Unclassified ArchaeaOD1 Microthrixineae VC21 Bac22E2aB05Flavobacteriales SaprospiralesChlamydiaeChloroflexiLentisphaeraeMarine group A Planctomycetes SAR11 Other alpha RhodobacteralesDesulfobacterDesulfobulbus Nitrospina SAR324  Arcobacteraceae agg47 Other gamma HTCC2207 SAR86 SUP05Unclassified gamma ZA3412cZD0417Spirochaetes Opitutae subdivision 3FS142-18B-02UnclassifiedBacteria Proteobacteria α δ  ε  γVerrucomicrobia Actinobacteria Archaea Bacteriodetes Total SAGs0.5   1    10      20         40              60Picked SAGs1 10  20   40     80 Figure D.2: SAGs sampled and sequenced (picked). Image credit: Alyse Hawley123D.6 SAG decontamination taxa rangesBacteria;Proteobacteria;Gammaproteobacteria;SUP05Bacteria;Proteobacteria;Gammaproteobacteria;SUP05 (Arctic)Archaea;Thaumarchaeota;Cenarchaeales;CenarchaeumBacteria;Bacteroidetes;Flavobacteriales;CytophagaBacteria;Verrucomicrobia;Verrucomicrobia subdivision 3D.7 Evaluation levelsLevel 1, the low levelArchaea;Thaumarchaeota;Cenarchaeales;Cenarchaeum;Unclassified;OTUBacteria;Bacteroidetes;Flavobacteriales;Cytophaga;Unclassified;OTUBacteria;Bacteroidetes;Bacteroidales;VC21 Bac22;OTUBacteria;Proteobacteria;Alphaproteobacteria;Consistiales;Pelagibacter;SAR11;Candidatus Pelagibacter ubique;OTUBacteria;Proteobacteria;Deltaproteobacteria;Nitrospina;OTUBacteria;Proteobacteria;Deltaproteobacteria;Sva0853;SAR324;OTUBacteria;Proteobacteria;Epsilonproteobacteria;Arcobacteraceae;Unclassified;OTUBacteria;Proteobacteria;Gammaproteobacteria;SUP05;Unclassified;OTU ArcticBacteria;Proteobacteria;Gammaproteobacteria;SUP05;Unclassified;OTU SUP05 1aBacteria;Proteobacteria;Gammaproteobacteria;SUP05;mussel thioautotrophic gill symbiont MAR1;OTU SUP05 1cLevel 2, the mid levelArchaea Bacteria;Bacteroidetes;FlavobacterialesBacteria;Bacteroidetes;BacteroidalesBacteria;Proteobacteria;AlphaproteobacteriaBacteria;Proteobacteria;DeltaproteobacteriaBacteria;Proteobacteria;EpsilonproteobacteriaBacteria;Proteobacteria;GammaproteobacteriaLevel 3, the high levelArchaea BacteriaD.8 ESOM R script# esom_binner.Rlibrary("igraph")library("RColorBrewer")umx = as.matrix( read.table("work/fa53_sub.fa.kmer_50x50e100.umx",skip=1) )bm = as.matrix( read.table("work/fa53_sub.fa.kmer_50x50e100.bm",skip=2) )# build the node namenn = function(a,b) paste0( "v" , a , "_" , b )# clusters are built of sufficiently near nodes# Distances can be no greater than ’cut’ to share a groupbuild_graph = function(cut=0.2,u=umx){nr = nrow(u)nc = ncol(u)grid = expand.grid( 1:nr , 1:nc )check_neighbours = function(i){if( u[ grid[i,1] , grid[i,2] ] <= cut ){out = NULL124if( grid[i,1] > 1 ){ out = c( out ,nn(grid[i,1],grid[i,2]) , nn(grid[i,1]-1 , grid[i,2]) ) }if( grid[i,1] < nr ){ out = c( out ,nn(grid[i,1],grid[i,2]) , nn(grid[i,1]+1 , grid[i,2]) ) }if( grid[i,2] > 1 ){ out = c( out ,nn(grid[i,1] , grid[i,2]) , nn(grid[i,1],grid[i,2]-1) ) }if( grid[i,1] < nc ){ out = c( out ,nn(grid[i,1] , grid[i,2]) , nn(grid[i,1] , grid[i,2]+1) ) }if( grid[i,1] > 1 & grid[i,2] > 1 ){ out = c( out ,nn(grid[i,1] , grid[i,2]) , nn(grid[i,1]-1 , grid[i,2]-1) ) }if( grid[i,1] > 1 & grid[i,2] < nc ){ out = c( out ,nn(grid[i,1] , grid[i,2]) , nn(grid[i,1]-1 , grid[i,2]+1) ) }if( grid[i,1] < nr & grid[i,2] > 1 ){ out = c( out ,nn(grid[i,1] , grid[i,2]) , nn(grid[i,1]+1 , grid[i,2]-1) ) }if( grid[i,1] < nr & grid[i,2] < nc ){ out = c( out ,nn(grid[i,1] , grid[i,2]) , nn(grid[i,1]+1 , grid[i,2]+1) ) }return(out)}NULL}edges = lapply( 1:(nr*nc) , check_neighbours )edges = t( matrix( unlist(edges) , nrow=2 ) )edges = data.frame( from=edges[,1] , to=edges[,2] )graph.data.frame(edges)}construct_bins = function(cut=0.2,u=umx,b=bm){g = build_graph(cut,u)cmps = components(g)bin = function(i){if( nn(b[i,2],b[i,3]) %in% names( cmps$membership ) ){return( cmps$membership[ nn(b[i,2],b[i,3]) ] )}0 # zero indicates no membership}bins = sapply( 1:nrow(b) , bin )cbind( b , bins )}library("png")plot_bins = function(bins,coef=1,bg_path=NA,...){plot( bins[,3] , bins[,2] , pch=16 , cex=0.5*coef , ... )clr = rainbow( max(bins[,4]) )# clr = brewer.pal(max(bins[,4]),"Set1")( max(bins[,4]) ) # too many colours for this paletteidx = (1:nrow(bins))[ bins[,4] > 0 ]points( bins[idx,3] , bins[idx,2] , pch=16 , col=clr[ bins[,4] ] , cex=coef )if( ! is.na(bg_path) ){bg = readPNG( bg_path )lim = par()rasterImage(bg, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])points( bins[,3] , bins[,2] , pch=16 , cex=0.5*coef )points( bins[idx,3] , bins[idx,2] , pch=16 , col=clr[ bins[,4] ] , cex=coef )}}125D.9 ESOM U-matrices and bins0 10 20 30 40 5001020304050esom_bins1bins[, 3]Level 10 10 20 30 40 5001020304050esom_bins2bins[, 3]Level 20 10 20 30 40 5001020304050esom_bins3Level 3ESOM U-matrices are shown as height maps. Unique colours within levels are separate bins. Black dots are not assigned bins.Figure D.3: ESOM U-matrices and bins126D.10 All binner precision recall statisticsTaxa Level Precision SensitivityPhylopythiaSThaumarchaeota;Cenarchaeum 1 NA .00Flavobacteriales;Cytophaga 1 NA .00Bacteroidetes;Bacteroidales 1 1.00 .007Alphaproteobacteria;SAR11 1 NA .00Deltaproteobacteria;Nitrospina 1 NA .00Deltaproteobacteria;Sva0853;SAR324 1 NA .00Epsilonproteobacteria;Arcobacteraceae 1 NA .00Gammaproteobacteria;SUP05 Arctic 1 NA .00Gammaproteobacteria;SUP05 1a 1 NA .00Gammaproteobacteria;SUP05 1c 1 NA .00Archaea 2 .66 .45Bacteroidetes;Flavobacteriales 2 .97 .09Bacteroidetes;Bacteroidales 2 .92 .01Alphaproteobacteria 2 .00 .00Deltaproteobacteria 2 .00 .00Epsilonproteobacteria 2 .00 .00Gammaproteobacteria 2 .00 .00Bacteria 3 .91 .34Archaea 3 .65 .45SAGEX (classify)Thaumarchaeota;Cenarchaeum 1 .99 .04Flavobacteriales;Cytophaga 1 .77 .008Bacteroidetes;Bacteroidales 1 .99 .10Alphaproteobacteria;SAR11 1 .97 .04Deltaproteobacteria;Nitrospina 1 .93 .04Deltaproteobacteria;Sva0853;SAR324 1 .95 .10Epsilonproteobacteria;Arcobacteraceae 1 .90 .012Gammaproteobacteria;SUP05 Arctic 1 .36 .003Gammaproteobacteria;SUP05 1a 1 .62 .06Gammaproteobacteria;SUP05 1c 1 .77 .04Archaea 2 .97 .03Bacteroidetes;Flavobacteriales 2 .76 .01Bacteroidetes;Bacteroidales 2 .98 .05Proteobacteria;Alphaproteobacteria 2 .95 .05Proteobacteria;Deltaproteobacteria 2 .93 .05Proteobacteria;Epsilonproteobacteria 2 .90 .01Proteobacteria;Gammaproteobacteria 2 .93 .01Archaea 3 .97 .03Bacteria 3 .997 .004Table D.1: All PhylopythiaS and SAGEX (classify) precision-recall statistics127Taxa Level Precision SensitivitySAGEX (cluster)Thaumarchaeota;Cenarchaeum 1 .85 2.8e-04Flavobacteriales;Cytophaga 1 .20 2.0e-05Bacteroidetes;Bacteroidales 1 .77 4.2e-04Alphaproteobacteria;SAR11 1 .37 1.1e-04Deltaproteobacteria;Nitrospina 1 .32 4.3e-05Deltaproteobacteria;Sva0853;SAR324 1 .85 1.2e-05Epsilonproteobacteria;Arcobacteraceae 1 .28 6.7e-04Gammaproteobacteria;SUP05 Arctic 1 .02 2.5e-05Gammaproteobacteria;SUP05 1a 1 .15 9.8e-03Gammaproteobacteria;SUP05 1c 1 .41 7.5e-03Archaea 2 .78 2.4e-04Bacteroidetes;Flavobacteriales 2 .32 9.4e-05Bacteroidetes;Bacteroidales 2 .72 9.5e-05Proteobacteria;Alphaproteobacteria 2 .34 1.2e-04Proteobacteria;Deltaproteobacteria 2 .85 1.7e-05Proteobacteria;Epsilonproteobacteria 2 .27 6.7e-04Proteobacteria;Gammaproteobacteria 2 .66 3.7e-04Archaea 3 .78 2.4e-04Bacteria 3 .999 7.2e-05MaxBin2Thaumarchaeota;Cenarchaeum 1 .49 .05Flavobacteriales;Cytophaga 1 .11 .04Bacteroidetes;Bacteroidales 1 .36 .04Alphaproteobacteria;SAR11 1 .34 .03Deltaproteobacteria;Nitrospina 1 .30 .03Deltaproteobacteria;Sva0853;SAR324 1 .37 .01Epsilonproteobacteria;Arcobacteraceae 1 .54 .10Gammaproteobacteria;SUP05 Arctic 1 .17 .14Gammaproteobacteria;SUP05 1a 1 .04 .008Gammaproteobacteria;SUP05 1c 1 .11 .005Archaea 2 .43 .05Bacteroidetes;Flavobacteriales 2 .21 .02Bacteroidetes;Bacteroidales 2 .55 .08Alphaproteobacteria 2 .38 .01Deltaproteobacteria 2 .40 .01Epsilonproteobacteria 2 .61 .09Gammaproteobacteria 2 .48 .01Bacteria 3 .95 .004Archaea 3 .54 .06Table D.2: All MaxBin2.0 and SAGEX (cluster) precision-recall statisticsTaxa Level Precision SensitivityESOM + RThaumarchaeota;Cenarchaeum 1 NA .00Flavobacteriales;Cytophaga 1 .03 .01Bacteroidetes;Bacteroidales 1 .22 .06Alphaproteobacteria;SAR11 1 .71 .01Deltaproteobacteria;Nitrospina 1 .41 .07Deltaproteobacteria;Sva0853;SAR324 1 .61 .06Epsilonproteobacteria;Arcobacteraceae 1 .40 .05Gammaproteobacteria;SUP05 Arctic 1 .08 .03Gammaproteobacteria;SUP05 1a 1 .23 .04Gammaproteobacteria;SUP05 1c 1 .29 .04Archaea 2 .03 .05Bacteroidetes;Flavobacteriales 2 .09 .12Bacteroidetes;Bacteroidales 2 .05 .13Alphaproteobacteria 2 .10 .10Deltaproteobacteria 2 .26 .19Epsilonproteobacteria 2 .04 .03Gammaproteobacteria 2 .41 .28Bacteria 3 .998 .02Archaea 3 .06 .002ESOM protocol modifies input. Results are indirectly comparable.Table D.3: All ESOM+R precision-recall statistics128


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items