Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploring microbial community structure and resilience through visualization and analysis of microbial… Perez, Sarah Isa Esther 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2015_september_perez_sarah.pdf [ 46.91MB ]
Metadata
JSON: 24-1.0166317.json
JSON-LD: 24-1.0166317-ld.json
RDF/XML (Pretty): 24-1.0166317-rdf.xml
RDF/JSON: 24-1.0166317-rdf.json
Turtle: 24-1.0166317-turtle.txt
N-Triples: 24-1.0166317-rdf-ntriples.txt
Original Record: 24-1.0166317-source.json
Full Text
24-1.0166317-fulltext.txt
Citation
24-1.0166317.ris

Full Text

Exploring microbial community structure and resiliencethrough visualization and analysis of microbialco-occurrence networksbySarah Isa Esther PerezB.Sc. Honours Physics, McGill University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMasters of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)June 2015c© Sarah Isa Esther Perez, 2015AbstractCultivation-independent microbial ecology research relies on high throughput se-quencing technologies and analytical methods to resolve the infinite diversity ofmicrobial life on Earth. Microorganisms live in communities driven by genetic andmetabolic processes as well as symbiotic relationships. Interconnected communi-ties of microorganisms provide essential functions in natural and human engineeredecosystems. Modelling the community as an inter-connected system can give in-sight into the community’s functional characteristics related to the biogeochemicalprocesses it performs. Network science resolves associations between elements ofstructure to notions of function in a system and has been successfully applied to thestudy of microbial communities and other complex biological systems. Microbialco-occurrence networks are inferred from community composition data to resolvestructural patterns related to ecological properties such as community resilience todisturbance and keystone species. However, the interpretation of global and localnetwork properties from an ecological standpoint remains difficult due to the com-plexity of these systems creating a need for quantitative analytical methods andvisualization techniques for co-occurrence networks.This thesis tackles the visualization and analytical challenges of modelling mi-crobial community structure from a network science approach. First, Hive PanelExplorer, an interactive visualization tool, is developed to permit data driven ex-ploration of topological and data association patterns in complex systems. Theeffectiveness of Hive Panel Explorer is validated by resolving known and novelpatterns in a model biological network, the C. elegans connectome. Second, net-work structural robustness analysis methods are applied to study microbial com-munities from timber harvested forest soils from a North American long term soiliiproductivity study. Analyzing these geographically dispersed soils revealed bio-geographic patterns of diversity and enabled the discovery of conserved organiz-ing principles shaping microbial community structure. The capacity of robustnessanalysis to identify key microbial community members as well as model shifts incommunity structure due to environmental change is demonstrated. Finally, thiswork provides insight into the relationship between microbes and their ecosystem,and characterizing this relationship can help us understand the organization of mi-crobial communities, survey microbial diversity and harness its potential.iiiPrefaceThe sections in this work have not yet been published. Chapter 2 and 3 are inthe process of being submitted to peer reviewed scientific journals in the comingmonths.Chapter 1 Sarah Perez wrote the main text with input from Steven J. Hallam. AriaS. Hahn provided input and feedback overall in particular on the soil ecologysection.Chapter 2 Hive Panel Explorer is an explorative visualization built by Sarah E.I. Perez. The design development process was conducted by Sarah E. I.Perez with support from Aria S. Hahn who provided user-based feedbackand insight. Martin Krzywinski, the developer of hive plots (off of whichthis tool is built), provided feedback on the figures, the structure of the textfor improved readability, and the content of the text. The main text waswritten by Sarah E. I. Perez with editorial input from Steven J. Hallam.Chapter 3 The design of the methodological procedure for the construction andanalysis of the microbial community networks was developed by Sarah E.I. Perez with feedback from Aria S. Hahn and Steven J. Hallam. SarahE. I. Perez wrote the Python scripts to conduct the network analysis of theLong term Soil Productivity project data. This project is part of a multi-labeffort and data collection was undertaken by numerous scientist (see asso-ciated public reports). Network computation and inference was primarilyconducted by Aria S. Hahn with guidance from Steven J Hallam and Karo-line Faust and assistance from Sarah E. I. Perez. Networks were built usingthe CoNet software on high-performance computational resources providedivby Compute Canadas Western Canadian Compute Consortium. Sarah E. I.Perez wrote the main text and figures with editorial support from Steven J.Hallam.Throughout this dissertation the word we refers to Sarah E. I. Perez unlessotherwise stated. None of the work encompassing this dissertation required con-sultation with the Univeristy of British Columbia Research Ethics Board.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Networks and complexity . . . . . . . . . . . . . . . . . . . . . . 51.1.1 From an interactive system to networks . . . . . . . . . . 61.1.2 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . 81.1.3 Biological network complexity . . . . . . . . . . . . . . . 151.2 Network exploration . . . . . . . . . . . . . . . . . . . . . . . . 181.2.1 Visualization as a means for data exploration . . . . . . . 181.2.2 Current network visualizations . . . . . . . . . . . . . . . 201.3 Charting microbial community structure and function . . . . . . . 231.3.1 Soil ecology . . . . . . . . . . . . . . . . . . . . . . . . 24vi1.3.2 Taxonomic assessment . . . . . . . . . . . . . . . . . . . 271.3.3 Community composition . . . . . . . . . . . . . . . . . . 281.4 Microbial co-occurrence networks . . . . . . . . . . . . . . . . . 341.4.1 Symbiosis and inter-taxa interactions . . . . . . . . . . . 351.4.2 Microbial network inference . . . . . . . . . . . . . . . . 361.4.3 Validating network inference models . . . . . . . . . . . . 381.4.4 Current applications of microbial co-occurrence networks 391.5 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . 411.6 Research overview . . . . . . . . . . . . . . . . . . . . . . . . . 432 Hive Panel Explorer: an interactive visualization tool to exploretopological and data association patterns in large networks . . . . . 442.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.1.1 Network science and visualization . . . . . . . . . . . . . 452.1.2 Current network visualization pitfalls . . . . . . . . . . . 462.1.3 Hive plots . . . . . . . . . . . . . . . . . . . . . . . . . . 472.1.4 Hive Panel Explorer . . . . . . . . . . . . . . . . . . . . 492.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.2.1 Visualization idiom and design . . . . . . . . . . . . . . . 512.2.2 Designing a hive panel . . . . . . . . . . . . . . . . . . . 522.2.3 Navigating a hive panel . . . . . . . . . . . . . . . . . . . 572.2.4 HyPE as a web tool . . . . . . . . . . . . . . . . . . . . . 602.3 Results: the structure of the C. elegans connectome . . . . . . . . 612.3.1 The system . . . . . . . . . . . . . . . . . . . . . . . . . 612.3.2 The network . . . . . . . . . . . . . . . . . . . . . . . . 612.3.3 Constructing the hive panel . . . . . . . . . . . . . . . . . 622.3.4 Exploring the C. elegans hive panel . . . . . . . . . . . . 632.4 Discusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.4.1 Assessing patterns and generating hypotheses . . . . . . . 682.4.2 A flexible and adaptive visualization tool . . . . . . . . . 722.4.3 A scalable tool . . . . . . . . . . . . . . . . . . . . . . . 722.5 Future directions and conclusions . . . . . . . . . . . . . . . . . 73vii3 Characterizing robustness and centrality in microbial co-occurrencenetworks from natural and disturbed soil communities . . . . . . . 743.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.2.1 LTSP sample collection and processing . . . . . . . . . . 793.2.2 Environmental DNA extraction and sequencing . . . . . . 813.2.3 Microbial co-occurrence network inference . . . . . . . . 813.2.4 Ecological analysis . . . . . . . . . . . . . . . . . . . . . 823.2.5 Network analysis . . . . . . . . . . . . . . . . . . . . . . 833.2.6 Using HyPE to visualize networks . . . . . . . . . . . . . 833.2.7 Network robustness simulations . . . . . . . . . . . . . . 843.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.3.1 Ecological diversity within and between ecozones . . . . . 853.3.2 Global network topology . . . . . . . . . . . . . . . . . . 873.3.3 Visualizing microbial co-occurrence networks with HyPE 903.3.4 Network robustness simulations . . . . . . . . . . . . . . 943.3.5 Characterizing central taxa . . . . . . . . . . . . . . . . . 953.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.4.1 Soil microbial co-occurrence networks: a complex ecolog-ically driven structure . . . . . . . . . . . . . . . . . . . . 1023.4.2 Centrality and robustness across biogeoclimatic networks . 1033.4.3 Relating treatment effects to robustness analysis . . . . . 1043.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.1 Assumptions and limitations of sequencing approaches . . . . . . 1074.2 HyPE as a community tool . . . . . . . . . . . . . . . . . . . . . 1084.3 Closing: cross-disciplinarity in microbial ecology . . . . . . . . . 109Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110A Chapter 3 supporting material . . . . . . . . . . . . . . . . . . . . . 128viiiList of TablesTable 1.1 An overview of different of ecological diversity metrics. . . . . 29Table 2.1 HyPE’s visual design idiom . . . . . . . . . . . . . . . . . . . 53Table 3.1 LTSP sampling sites’ soil data for the SBS , MD and JP ecozones 80Table 3.2 Richness of LTSP samples grouped by ecozone and treatment . 87Table 3.3 Shannon’s entropy of LTSP samples grouped by ecozone andtreatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Table 3.4 The number of nodes and edges in the LTSP networks . . . . . 88Table 3.5 Global clustering coefficient of the LTSP networks . . . . . . . 88Table 3.6 Size of the largest connected component of the LTSP networks 88Table 3.7 Robustness factor of SBS networks per node removal method . 97Table 3.8 Robustness factor of MD networks per node removal method . 97Table 3.9 Robustness factor of JP networks per node removal method . . 97Table A.1 Number of sequences recovered for samples in ecozone JP withtreatment OM0 . . . . . . . . . . . . . . . . . . . . . . . . . . 129Table A.2 Number of sequences recovered for samples in ecozone JP withtreatment OM1 . . . . . . . . . . . . . . . . . . . . . . . . . . 129Table A.3 Number of sequences recovered for samples in ecozone JP withtreatment OM2 . . . . . . . . . . . . . . . . . . . . . . . . . . 130Table A.4 Number of sequences recovered for samples in ecozone JP withtreatment OM3 . . . . . . . . . . . . . . . . . . . . . . . . . . 130Table A.5 Number of sequences recovered for samples in ecozone MDwith treatment OM0 . . . . . . . . . . . . . . . . . . . . . . . 131ixTable A.6 Number of sequences recovered for samples in ecozone MDwith treatment OM1 . . . . . . . . . . . . . . . . . . . . . . . 131Table A.7 Number of sequences recovered for samples in ecozone MDwith treatment OM2 . . . . . . . . . . . . . . . . . . . . . . . 132Table A.8 Number of sequences recovered for samples in ecozone MDwith treatment OM3 . . . . . . . . . . . . . . . . . . . . . . . 132Table A.9 Number of sequences recovered for samples in ecozone SBSwith treatment OM0 . . . . . . . . . . . . . . . . . . . . . . . 132Table A.10 Number of sequences recovered for samples in ecozone SBSwith treatment OM1 . . . . . . . . . . . . . . . . . . . . . . . 133Table A.11 Number of sequences recovered for samples in ecozone SBSwith treatment OM2 . . . . . . . . . . . . . . . . . . . . . . . 134Table A.12 Number of sequences recovered for samples in ecozone SBSwith treatment OM3 . . . . . . . . . . . . . . . . . . . . . . . 135Table A.13 Summary of samples numbers in each ecozone for each treat-ment level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Table A.14 Shannon’s entropy of LTSP samples grouped by ecozone andtreatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Table A.15 Representation of phyla in central taxa of JP networks . . . . . 143Table A.16 Representation of classes in central taxa of JP networks . . . . 143Table A.17 Representation of orders in central taxa of JP networks . . . . . 144Table A.18 Representation of phyla in central taxa of MD networks . . . . 145Table A.19 Representation of classes in central taxa of MD networks . . . 146Table A.20 Representation of orders in central taxa of MD networks . . . . 147Table A.21 Representation of phyla in central taxa of SBS networks . . . . 148Table A.22 Representation of classes in central taxa of SBS networks . . . 148Table A.23 Representation of orders in central taxa of SBS networks . . . 149xList of FiguresFigure 1.1 An overview of graph types and graph theory metrics . . . . . 7Figure 1.2 An overview of different node centrality measures . . . . . . 12Figure 1.3 The modularity of the Karate Club network . . . . . . . . . . 14Figure 1.4 Complex biological system modelling through networks . . . 16Figure 1.5 Overview of tasks accomplished by visualizations . . . . . . . 19Figure 1.6 Adjacency matrices, a tabular representation of graphs . . . . 21Figure 1.7 Force directed layouts, an intuitive and planar visual represen-tation of graphs . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 1.8 Hive plots, a circularly organized representation of graphs . . 23Figure 1.9 Overview of LTSP ecozones and treatments . . . . . . . . . . 26Figure 1.10 An illustration of the specificity and fidelity of species to envi-ronmental conditions . . . . . . . . . . . . . . . . . . . . . . 33Figure 1.11 Overview of different ecological interactions between micro-bial community members. . . . . . . . . . . . . . . . . . . . 35Figure 1.12 An illustration of microbial network inference through co-occurrencepatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 1.13 The effect of experimental parameters on co-occurrence net-work modelling performance on simulated communities. . . . 39Figure 1.14 The effect of ecological properties on co-occurrence networkmodelling performance on simulated communities . . . . . . 40Figure 1.15 Co-occurrence network visualization and properties for a decadelong time series of bacterioplankton communities in Lake Men-dota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42xiFigure 2.1 A comparison of a force directed layout and hive plot of a so-cial network . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Figure 2.2 A schematic layout of single and double axis hive plots. . . . 54Figure 2.3 An overview of the possible partitions and scales driving nodeassignment and positioning . . . . . . . . . . . . . . . . . . . 56Figure 2.4 An overview of HyPE’s interface . . . . . . . . . . . . . . . . 58Figure 2.5 The C. elegans hive panel . . . . . . . . . . . . . . . . . . . 64Figure 2.6 A schematic of the filtering procedure used to reveal motorneurons connected by more than 10 synapses . . . . . . . . . 69Figure 3.1 Hierarchical clustering of all LTSP samples coloured by ecozone 86Figure 3.2 Hierarchical clustering of all LTSP samples coloured by or-ganic matter (OM) treatment . . . . . . . . . . . . . . . . . . 86Figure 3.3 Probability distribution function of node degree for all LTSPnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure 3.4 Hive Panel of the network from the SBS ecozone with treat-ment OM0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Figure 3.5 Hive panel of twelve hive plots showing the horizon modular-ity of the LTSP networks . . . . . . . . . . . . . . . . . . . . 92Figure 3.6 Hive panel of twelve hive plots showing the connectivity andcentrality of OTUs’ phyla of the LTSP networks . . . . . . . . 93Figure 3.7 Scatter matrix plot of four centrality measures in the Sub Bo-real Spruce (SBS) networks. . . . . . . . . . . . . . . . . . . 96Figure 3.8 Robustness simulations of twelve LTSP networks driven bydifferent centrality measures . . . . . . . . . . . . . . . . . . 98Figure 3.9 Venn diagram of OTUs in ecozone networks . . . . . . . . . . 99Figure 3.10 Venn diagram of the number of phylum, class and order sharedacross ecozone networks . . . . . . . . . . . . . . . . . . . . 100Figure 3.11 Histograms of the average soil horizon of OTUs with high BCvalues for all LTSP networks . . . . . . . . . . . . . . . . . . 101Figure 3.12 Histograms of the abundance of OTUs with high BC values forall LTSP networks . . . . . . . . . . . . . . . . . . . . . . . 102xiiFigure A.1 Hierarchical clustering of SBS samples colored by treatment . 136Figure A.2 Hierarchical clustering of SBS samples colored by horizon . . 136Figure A.3 Hierarchical clustering of SBS samples colored by sample site 136Figure A.4 Hierarchical clustering of JP samples colored by treatment . . 137Figure A.5 Hierarchical clustering of JP samples colored by horizon . . . 138Figure A.6 Hierarchical clustering of JP samples colored by sample site . 138Figure A.7 Hierarchical clustering of MD samples colored by treatment . 139Figure A.8 Hierarchical clustering of MD samples colored by horizon . . 139Figure A.9 Hierarchical clustering of MD samples colored by sample site 140Figure A.10 Scatter matrix plot of four centrality measures in the MD net-works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141Figure A.11 Scatter matrix plot of four centrality measures in the JP networks.142xiiiGlossaryBC betweenness centralityDNA deoxyribonucleic acids, a molecule that encodes genetic informationHYPE Hive Panel ExplorerJP Jack Pine zone, an LTSP study ecozoneLCC largest connected component of a networkLTSP Long Term Soil Productivity, a study of timber harvesting in North Ameri-can forest soilsMD Mediterranean zone, an LTSP study ecozoneOM organic matter removal, a type of treatment implemented in forest harvestingOTU operational taxonomic unit, as defined by prokaryotic rRNA sequence simi-larityPPI protein-protein interactionRNA ribonucleic acids, a molecule that encodes genetic informationSBS Sub Boreal Spruce zone, an LTSP study ecozoneSSU small subunit of the ribosomal moleculexivAcknowledgmentsThis academic journey has been quite an adventure. First, I was fortunate enoughto get funding from CIHR and conduct three research rotations which gave me ex-posure to different fields within Bioinformatics. After joining the Hallam lab, Iwas surrounded by supportive colleagues, notably Aria Hahn who quickly becamea collaborator and a mentor. Aria, thank you for those endless discussions andyour friendship. I also want to thank my supervisor, Dr. Steven Hallam, for en-gaging me in all those brainstorming sessions, for guiding me in my research andfor ensuring that I get a rich learning experience. I offer my gratitude to my thesiscommittee, Dr. Anne Condon and Dr. Martin Hirst, for being both attentive andinsightful in my committee meetings and for their helpful comments and sugges-tions. I have thoroughly enjoyed my experience at UBC and in the Bioinformaticstraining program, and would do it all over again if given the choice.Finally, I am thankful for the incredible support that my partner, my family andclose friends have given me throughout this journey.xvA` mes grand-parents, Rene´e et Jo, Gilou et Arie´, qui m’ont appris le pouvoir de laperse´ve´rance, chacun de leur fac¸on.xviChapter 1IntroductionWith an estimated cell abundance approaching 1030 cells [167], microorganismsrepresent the invisible majority of life on Earth. From the mesosphere to the litho-sphere, microorganisms are adapted to thrive across a wide range of habitats andenvironmental conditions [167]. Interconnected communities of microorganismsprovide essential functions in natural and engineered ecosystems and play integralroles in global scale biogeochemical processes [113, 114]. Resolving the complex-ity of these communities can reveal the inner workings of the Earth system with farreaching implications for biotechnology development and conservation. By har-nessing the hidden metabolic potential of microbial communities, we can developsustainable solutions in energy and materials production [6, 140], synthetic biology[111], medical diagnosis and therapeutics [28, 38] that are more in sync with thenatural world.Despite the impact that microbial communities have on the world around them,charting microbial community metabolism is extremely challenging as less than1% of microbial diversity has been cultured in laboratory settings [114]. Advancesin sequencing technology are beginning to bridge this cultivation gap through plu-rality sequencing of microbial community deoxyribonucleic acids (DNA) and ri-bonucleic acids (RNA) directly from the environment. Applications of these tech-niques enables the characterization of microbial taxonomic diversity and metabolicpotential. Such environmental surveys have helped discover “who is there” (e.g.,through taxonomic assessment based on ribosomal RNA gene abundance) and “what1are they doing” (e.g., metabolic reconstruction through functional gene and path-way analysis). Thus, sequencing technologies enable the study of microbial com-munities as structured and dynamic systems.small subunit (SSU) rRNA sequencing of environmental samples allows di-rect evaluation of taxonomic identity, abundance and diversity in communities[29, 141]. These measurements provide knowledge of community structure, whichcan be modelled to develop biomarkers for environmental factors and processes.For example, studies have sequenced environmental SSU rRNA to evaluate the en-vironmental impact of logging on soil productivity [76, 135, 136] and oil spills oncoastal ecosystems [98]. Different statistical methods, such as the cluster analy-sis of samples and indicator species analysis of taxonomic distributions, model thepresence and role of individual members, from rare to abundant taxa, and charac-terize community composition in relation to environmental parameter data.Microbial communities rely on interconnected genetic and metabolic processesto drive matter and energy transformations. In particular, metagenomic studieshave provided evidence that different reactions within a metabolic pathway may beperformed by and distributed across different community members [29, 73, 171].The genetic distribution of the community can also be altered through horizontalgene transfer [40, 51]. Furthermore, co-culture experiments have demonstrated thatcooperation and competition drive community member dynamics: groups of taxaengage in a variety of positive, neutral and negative interactions [46, 54, 128, 138].Though the analysis of individual community members can provide valuable in-sight into specific metabolic processes, holistic understanding of the ecosystemrequires an awareness of the dynamic interconnections between community mem-bers. Thus, the “whole is greater than the sum of its parts” as evaluating both com-munity composition and interactivity can build more elaborate ecosystem models[54, 73, 171]. Accordingly, microbial communities can be modelled as a dynamicsystem where taxonomic, genetic and metabolic distributions are interrelated.The structure of an interactive system can be modelled by studying its con-nectivity [118]. Through network abstraction, the connective structure of a systemcan be expressed using nodes and edges: the nodes of the network represent themembers of the system and the edges represent the relationships between members.For example, modelling a microbial community as a connected system, individual2taxa become nodes and their interactive relationships become edges. Studies havebuilt community networks by applying co-occurrence analysis to taxonomic abun-dances obtained via SSU rRNA sequencing data [11, 54]. To construct the micro-bial co-occurrence network of a community, the significant positive and negativeco-occurrences are evaluated and assigned as edges. Microbial ecologists have re-cently adopted this network approach to study both the taxonomic distribution andthe interconnected structure of microbial communities [47, 54, 89, 105, 129, 169].Network approaches are widely used to study relationships between the struc-ture and the function of a system in the social, biological and technological sci-ences. Networks and their structure elucidate functional aspects of the modelledsystem such as the wiring efficiency of the C. elegans connectome [34], the vul-nerability to attack of the World Wide Web [5], extinction dynamics in foodwebs[45, 122], and missing annotations in protein-protein interactomes [102], etc. Inparticular, structural properties of microbial co-occurrence networks have beencharacterized to infer biological attributes of the community such as its resilienceto disturbance [76]. Despite the power and the promise of graph theory select-ing the appropriate or optimal quantitative method to accurately discern patternswithin complex systems remains a challenging enterprise. For instance, biologicalnetwork studies have difficulty justifying which of the many different network cen-trality measurements should be used to identify nodes that are “important”, “cen-tral”, or even “essential” to the structure of the network and have difficulty inter-preting the results of their measurements in relation to system properties [62, 85].Along the same lines, the adaptation of certain quantitative methods from food-web studies in macroecology to microbial co-occurrence network studies remainsdifficult to visualize, interpret and validate [55].The visualization of complex systems can help reveal patterns, motivate anal-ysis and generate hypotheses [125, 147]. In order to go beyond presenting knownpatterns and reveal new ones, visualizations need to be designed to permit interac-tive exploration [115, 125, 147]. The discovery of topological features and patternscan help drive a quantitative analysis of a network and formulate hypotheses infer-ring the modelled system’s function from its underlying structure. Current networkvisualization techniques are not designed for interactive exploration. Network rep-resentations, such as adjacency matrices do not provide flexibility in adapting their3layout rules to systems [115]. Other representations such as force-directed lay-outs, are not suitable for large networks for they are often inconsistent and difficultto interpret. On the other hand, rule-based network layouts have been developedto create consistent and coherent network visualizations [96, 115]. In particular,hive plots is a rule-based network layout whose design attempts to provide a visualquery language from which to organize and study networks using system properties[96]. However, these different network layouts have been developed to illustratespecific connectivity features and more flexible visualization designs are requiredto maximize the exploration of patterns in the network. Adapting hive plots todevelop a versatile network visualization and combining this design with interac-tive features could allow for the exploration and interpretation of patterns in highlydimensional and complex networks.This thesis outlines the development and application of a visualization tool anda quantitative modelling approach to study microbial co-occurrence networks con-structed from SSU rRNA sequencing data from soil environments that is extensibleto other forms of data. In the following chapter, I review the state of networkscience and complexity, describe network visualization and quantitative ecologytools, and explore their application to the study of microbial structure and func-tion in natural and engineered ecosystems. In Chapter 2, I describe and evaluatethe development and design of Hive Panel Explorer as an interactive network vi-sualization tool and demonstrate its effectiveness on a known and well studiedbiological network: The C. elegans connectome. In Chapter 3, I demonstrate theapplication of Hive Panel Explorer (HYPE) to soil microbial communities from theLong Term Soil Productivity (LTSP) study based on samples from varied locationsand ecosystems across North America [76, 136]. As the most diverse environment,the soil microbiome epitomizes the complexity of microbial communities and suc-cessfully characterizing their structure and function on local and global scales usingthe methods outlined in this thesis will be readily adaptable to study less diversemicrobial communities. Furthermore, analyzing geographically dispersed samplecollections may reveal biogeographic patterns of diversity and uncover conservedorganizing principles shaping microbial community structure. Thus this thesis hasthe potential to provide insight into the complex relationships between individualmicrobes, their community and their environment.41.1 Networks and complexityAs put by science writer Dorian Sagan [143]:Nature no more obeys the territorial divisions of scientific academicdisciplines than do continents appear from space to be coloured toreflect the national divisions of their human inhabitants. For me, thegreat scientific satoris, epiphanies, eurekas, and aha! moments arecharacterized by their ability to connect.While scientific breakthroughs are accomplished via cross-disciplinary syn-thesis, some disciplines’ knowledge and methodologies lend themselves betterto cross-disciplinary applications. In particular, network science, a mathematicalmethodology, has been applied in social, biological and technological sciences tomodel systems using networks [118]. Network models can be used to capture theconnective structure of a system and promote the study of interconnectivity in na-ture. Connections can be drawn within and across many levels, from fundamentalparticle interactions to the influence of gravitational fields. Particularly in this Dig-ital Age, our world has become more interconnected: people and knowledge arevirtually and globally “hyperlinked” through social media and online databases.Network science harnesses the potential of this interconnected world by study-ing its structure. Anchored in graph theory, the field of network science has devel-oped to resolve these dynamic interconnected structures in a variety of systems asdiverse as social circles, ecosystems, and the World Wide Web [118].Graph theory is thought to have a evolved from a few seminal papers includingone entitled “Seven bridges of Konisberg” published in 1736 by Leonhard Eulerwhere he analyzes the topology of connected bridges to find a path which crossesevery bridge once [17]. The concepts of system topology first introduced in thispaper quickly evolved to model relationships between objects such as social in-teractions, predator-prey relations, and web links between articles on Wikipedia.An ensemble of relationships is called a graph and is denoted by the letter G. Thegraph G is formed from nodes, the objects, connected by edges, the relationships.In mathematical terminology, we say the graph G = (N,E) is composed of theset of nodes N and the set of edges E [17, 118]. The topology of a graph is its5“shape” or connective structure. Certain topologies imply that a graph will havedifferent structural properties [17, 118]. Graph theory encompasses the topologicalalgorithms and metrics applied to graphs.Whereas the mathematical concept is denoted a graph, once applied to a sys-tem with specific objects, the model becomes a network1. The following sectionmotivates the modelling of a system as a network, provides an overview of graphtheory methods and describes topological characteristics of biological networksmore specifically.1.1.1 From an interactive system to networksGraph theory methods have been used to study the relationship between the struc-ture and the function of complex systems by building social, biological and tech-nological networks [118, 119, 159]. In social networks, nodes are typically peopleand the edges connecting these people represent an interaction between them [162].Similarly, in foodwebs, the edges are trophic interactions (i.e. “who eats who”) thatconnect species, the nodes, in an ecosystem [45, 122]. Graph theory measures suchas degree distributions, modularity and connectance, can help formulate associa-tions between elements of structure to notions of function in the system and withinits parts [118].Graphs come in all shapes in sizes: they can be directed or undirected, weightedor not, connected or not [118], as illustrated in Figure 1.1. Directed graphs are usedto model systems where nodes have relationships with implicit direction, such asthe synaptic connections between neurons in a connectome: one neuron, a node,fires a signal to another neuron via a synapse, a directed edge [71]. Weightedgraphs model the relationships between nodes quantitatively by assigning a weightto each edge. For example, the strength of a friendship between two individualsin a social network can be encoded in the weight of the edge connecting them.Finally, a connected graph is one where all nodes can be reached by following apath, a sequence of connected edges. A graph is defined as not connected if a nodeor groups of nodes can’t be reached by following a path and the graph is then saidto have more than one connected component.1In this thesis, both terms will be used according to the context: graphs when discussing theoryand networks when discussing systems modelled using nodes and edges.6Power-law degree distributionBinomial degree distributionDegree135Clustering coecient11/30Betweenness centrality0.080.440.69directed networkconnected componentsweighted networkACsimple connected networkBFigure 1.1: An overview of graph types and graph theory metrics. Graphsare composed of nodes and edges, here represented as circles andlinks between circles, respectively. A) Graphs can be simple, directed,weighted, completely connected or composed of connected compo-nents. B) Two graph topologies that differ in their degree distributionare shown: a power law degree distribution and the characteristic bino-mial degree distribution of randomly generated graphs. C) The topol-ogy of nodes is characterized by graph theory metrics including degree,clustering coefficient and centrality measures such as betweenness cen-trality. The number next to each graph corresponds to the metric valueof the coloured-in node.7Structural properties of graphs are used to characterize properties of individ-ual nodes, individual edges, paths, groups of nodes, groups of edges, connectedcomponents and the whole graph. Several review papers [17, 118–120] provide anoverview of different network analyses and graph theory measures, a few of whichare presented in the following section.1.1.2 Graph theoryMany properties can be calculated to characterize a graph’s topology: degree dis-tribution, global clustering coefficient, diameter, average shortest path length, cen-trality measure, scale free index, modularity, etc [5, 118–120]. Whereas parameterssuch as the number of nodes and the number of edges provide a quantitative assess-ment of a graph’s size, other measures such as the diameter of a graph, the longestshortest path between two nodes in the graph, evaluates the topological size ofthe graph: two graphs with orders of magnitude differences between their numbernodes can have the same diameter. Similarly, the global interconnectivity of twographs of different topological size or with different number of nodes and edgescan be compare by measuring their connectance, the proportion of realized edgescalculated from:Connectance =|E||N|(|N|−1)/2 (1.1)where |E| is the number of edges and |N| the number of nodes [118]. Just asstatistical methods evaluate dependencies and similarities in multivariate datasets,graph theory measures assess repeated structures, connective patterns, partitions,and other complex topological patterns.Graph theory measures take on different formulations depending on whetherthey are applied to directed or undirected, weighted or not weighted, connected ornot connected graphs. However, we will present them as applied to unweightedundirected graphs for simplicity as other formulations are simply derivations ofthe ones presented here. Different measures are applied to characterize a system’sstructure at global and local scales by analyzing the resulting network [96]. Theycan be divided into two types: ones that measure properties of individual nodesand edges and ones that evaluate global properties of the graph and connected8components. An overview of these measures is illustrated in Figure 1.1.Node degreeThe degree of a node is simply the number of edges connected to it. In a socialnetwork, the node degree represents the number of social interactions of that node,which could be used for example to infer that individual’s popularity [162]. Thisnode property can be used to classify nodes by their connectivity.d¯ =∑Ni=1∑Nj=1 ei j|N| =|E||N| (1.2)where |N| is the number of nodes, |E| is the number of edges, and the edge ei j =1 if the ithand jth nodes are connected, otherwise ei j = 0 [118]. To characterize theensemble of node degrees of a graph we evaluate a graph’s degree distribution,which we described next.Degree distributionThe degree distribution of a graph is a global property which communicates thebasic connective topology of the graph. Certain characteristic distributions implystructural properties and specific connectivity patterns in the graph. For instance arandom graph, one which is constructed progressively by joining nodes by an edgewith a certain probability, will have a binomial degree distribution [118, 119] mostnodes will have a degree close to the mean of the distribution and few nodes willhave very low or very high degree.Another characteristic degree distribution is the power law degree distributionwhere node degrees approximately follow:P(d) = 1/dk (1.3)where d is the degree of a node, P(d) is the frequency of that degree in thegraph, and k is a constant that defines the scale of the power law distribution[24, 88, 118]. In this case, the frequency of a node having a certain degree isinversely proportional to that degree by the factor k. Therefore power law graphstend to have a limited number of highly connected nodes, often called hubs. The9proportion of high and low degree nodes is dependent on the value of k and impliescertain structural properties [24, 88, 118]. When k is small (k < 2), there are veryfew hubs that the rest of the graph depends on to remain connected. When k islarge (k > 3), there are many high degree nodes and the graph structure is close tothat of a random graph [24, 88, 118]. When 2 < k < 3 there tends to be a hierarchi-cal connectivity where the hubs are connected to medium degree nodes which areconnected to low degree nodes [24, 88, 118]. From this connective structure, andthe characterization of hubs from their connectivity patterns, the functional roleof hubs can be interpreted. For example, the C. elegans connectome has a powerlaw degree distribution and most of its hubs are neurons with a particular cell typecalled interneurons whose role is to connect more specialized cell types, sensoryand motor neurons [157]. This example demonstrates how the connective role ofeach node can be established through the characterization of the degree distributionof a graph.Triangles, cliques, and clustering coefficientsMany graph theory measures have been developed to evaluate the connective struc-ture of nodes on a local scale. Nodes can be connected in triangles: a set of threenodes all connected to each other. The transitivity of a graph is accordingly theproportion of realized to unrealized triangles [118]. Higher order structures offully connected nodes are called cliques: a k-clique is one where k nodes have allpossible edges between them realized. Many measures stem from these types ofstructures such as the number of triangles, the number of k-cliques in the graph,the size of the largest clique in the graph, etc.In order to measure the local connective behaviour on an individual node basis,the clustering coefficient of a node is calculated as follows:ci =∑Ni ∑Nj e jk ∗ ei j ∗ e jkdi(di−1)/2 (1.4)where the numerator of the fraction is the number of triangles through nodei with e jk ∗ ei j ∗ e jk = 1 if the nodes i, j and k are all connected [118]. The de-nominator represents the total number of possible triangles connected to node igiven that it has a degree of di. The clustering coefficient expresses the connec-10tivity between neighbours: if all of its neighbours are connected than a node has aclustering coefficient of 1.The clustering coefficient of an edge can also be measured by evaluating thenumber of overlap in neighbours of the two nodes connected by the edge in ques-tion. An edge’s clustering coefficient is calculated using:ci, j =|Ni∩N j|+1min(di,d j)(1.5)where Ni and N j are neighbours of nodes i and j respectively [103].The clustering coefficient of a node and its degree are independent propertiesthough the higher the degree of a node the more connected its neighbours need tobe in order to also have a high clustering coefficient. The global clustering coeffi-cient is calculated as the average of the nodes’ clustering coefficients. Along withnotions of degrees, triangles and cliques, these graph theory measures characterizethe global connectivity of a graph and the local connectivity of each node.Centrality measuresCentrality measures are used to evaluate the position of a node within the graph todetermine its centrality with respect to the other nodes in the graph [62, 118, 127,176]. There are many different centrality measures, each of which uses differentmetrics to evaluate the topological position [19, 62, 123] of a node as illustrated inFigure 1.2.Each measure estimates the centrality of the position of a node over a certainrange. For instance, degree centrality evaluates the position of a node locally bysimply taking into account it’s degree. Betweenness centrality evaluates the po-sition of a node globally by evaluating the importance of that node relative to allpaths in the graph, as expressed by the following equation:bci =N∑j,k, j 6=kp jk(i)p jk(1.6)where p jk(i) is the number of paths between node j and k that go through iwhile p jk is the total number of paths going through j and k [118].The centrality of a node may reflect its importance in maintaining the overall11Closeness centralityEigenvector centrality Harmonic centralityBetweenness centralityKatz centralityDegree centrality¸¸highlowFigure 1.2: An overview of different node centrality measures. Nodes arecoloured by their relative centrality (adapted from c©Tapiocozzo (2015)under CC-SA license).structure of a network [5, 85, 118]. The centrality measure that is most appropriateto find the structurally important or essential nodes in a network depends on therange of centrality appropriate for the system under study [5, 85]. For examplein a communications grid network where signals are sent between computers, it isimperative that certain computers don’t go down (as in during a power shortage)such that all messages can make it to their destination: computers with high be-tweenness centrality tend to connect others which would be disconnected in theirabsence and thus if shut down the message will have no alternate route to follow[5]. Furthermore, the structure of the network can also influence the applicabilityof a centrality measure. For instance, in a network with a very high global cluster-ing coefficient and thus where most nodes are connected to their neighbours, thenodes’ betweenness centrality values would be somewhat evenly distributed andprobably not as useful to discern low to high centrality nodes than another measuresuch as degree centrality. Therefore picking the appropriate centrality measure fora system depends on the structure of the network and the roles played by centralnodes in the system.12ModulesThe graph theory measures so far presented have focused on global and local topo-logical properties in a graph. Modularity analysis evaluates the sub-global topologyof a graph by finding structurally meaningful subgraphs (i.e.. subsets of the graph)[120]. A module is defined as a subgraph whose connectivity pattern between itsmembers is greater than the connectivity patterns with nodes outside that subgraph[103, 120]. Modularity analysis can thus be interpreted as a form of topologicalclustering on the graph [120]. One famous example of the application of modular-ity analysis is the Karate Club network [120, 176]: subsequently to the modellingthe two modules in the social network, the karate club split and its members sep-arated to form two karate clubs following the modules predicted [173] (see Figure1.3).Determining the optimal partitions in the graph to find modules depends onthe type of connectivity patterns evaluated on the subgraph [120, 176]. As in thecase of centrality measures, what is considered an appropriate connectivity patterndepends on the context of the system being modelled. Several modularity algo-rithms have been developed and each assesses different connectivity patterns. Thetype of pattern used influences the interpretation of the modules and the methodused to find the patterns determines the computational complexity of a modularityalgorithm.Most modularity algorithms rely on measuring node properties such as degreeor clustering coefficient to evaluate the connectivity of subgraphs. Others measurelarger structures such as triangles and cliques. In summary, modularity algorithmsassess the connectivity of subgraphs by measuring one or a combination of thefollowing [103, 120, 176]1. cliques or clusters of cliques2. minimum edge cut of a graph3. node density in a subgraph4. high betweenness centrality nodes between subgraphs5. total degree within a subgraph13Figure 1.3: The modularity of the Karate Club network. Nodes are colouredby their association to the blue and red modules. The two modulescorrespond to the actual division of the club into two separate karateclubs (from c©Zhao et al. (2014))As in clustering methods, modularity algorithms can have a top-down approachin which case graphs are partitioned iteratively or a bottom-up approach wheresubgraphs are merged iteratively. One advantage of the bottom-up approach isthat it doesn’t rely on knowing how many modules there might be in the graph[120, 176]. Some modularity algorithms are NP-hard, others have a complexity ashigh as Ω(E2N) and thus many are not applicable to large networks with over tensof thousands of nodes [120, 176]. One low complexity algorithm with a bottom-upapproach which has been evaluated on protein-protein interaction (PPI) networksis a fast agglomerative algorithm called FAG-EC [103]. This algorithm measuresthe modularity of a subgraph by comparing the in-degree of a subgraph to its out-degree [103] where the in-degree of corresponds to the number of edges betweennodes within the subgraph and the out-degree corresponds to the number of edges14with nodes outside the subgraph. If the in-degree din is greater than the out-degreedout of a subgraph S by a multiplicative factor λ , than the subgraph is a module:∑i∈Sdini > λ∑i∈Sdouti (1.7)where the value of the parameter λ can be adjusted to obtain a stricter definitionof a module [103].The algorithm builds and evaluates the modularity subgraphs by starting withsingleton subgraphs (i.e. each node is its own subgraph) and merging subgraphswhen their ratio of in- and out-degree increases. So as to reduce the complexity ofthe algorithm, the order in which subgraphs are evaluated as merge-able relies onthe strength of the clustering coefficient of an edge (see Equation 1.5) between twonodes, one in each subgraph: the higher the clustering coefficient of an edge thehigher the probability that the two nodes connected by that edge will be in a module[103]. Thus the edges with the highest clustering coefficient are used to mergesubgraphs earlier in the algorithm ensuring FAG-EC a complexity of Ω(cE) wherec is a constant, which is relatively low compared to other algorithms [120, 176].FAG-EC was tested on PPI networks to find groups of proteins that performspecific biological functions in a cell through these network modules [103]. Thismethod and other modularity algorithms have also been used to find functionalmodules in other systems such as trophic networks [24, 88, 122], and social net-works including the Karate Club network described above [162] (see Figure 1.3).1.1.3 Biological network complexityComplex biological systems are teeming with interactions at different scales: frommolecules to cells to tissues to organs to organisms to species to environment toecosystems. Networks have been used to model these complex interactions in bio-logical systems at every level (Figure 1.4). For instance, at the molecular level, PPInetworks are constructed to model the relationships between protein-protein inter-actions; at the ecosystem level, foodwebs are built to model trophic interactionsbetween species. As with other types of systems, biological networks are far fromrandom and have topological features that have informed researchers on how theyfunction and their dynamics. For example, PPI networks from different organisms15Figure 1.4: Complex biological system modelling through networks: a vari-ety of systems can be described using networks by abstracting systemagents and relationships between agents.have been found to have modules that match cellular and metabolic functional unitswithin that organism [24, 88]. Here we present a few recurring and non-randomglobal and local structural patterns in biological networks followed by the differentchallenges faced when visualizing and interpreting them.Common structuresAs a first assessment of the a network model, the topology of biological networksis tested against the topology of random networks. For a biological network with Nnodes and E edges, a random network can be built with the same number of nodesand edges [35, 118]. The structure of the two networks can be compared by evalu-ating their degree distribution, average shortest path length, diameter, modularity,global clustering coefficient, assortativity, etc [118].The degree distribution of most biological networks, including protein-proteinnetworks, gene expression networks, and foodwebs, is typically a power law dis-tribution [24, 88]. As described in Section 1.1.2, a power law distribution impliesseveral specific topological properties, two of which are discussed here in the con-text of biological networks. First, networks with a power law distribution have fewhigh degree nodes commonly called hubs. In PPI networks, different methodolo-gies have been used to characterize the essential proteins, proteins that are impor-tant to the proper function of the system, and have found that these proteins areoften hubs in the network [24, 88]. While these well-connected nodes can play dif-ferential roles in biological networks, they consistently display properties that dis-tinguish them from other nodes in the network. Second, power law networks with ascaling factor 2 < k < 3 are called scale-free networks that manifest a hierarchical16connective structure: a high degree node is typically connected to medium degreenodes which are connected to low degree nodes [24, 88] (Figure 1.1B). foodwebsfrom different habitats and of different species richness, from tens to thousands ofspecies, have a scale free network structure which renders them robust to distur-bances (e.g., change in climate) such as the extinction of species (i.e. the removalof nodes in the network) [45, 92, 122]. Given such a low proportion of high degreespecies in the foodweb, removing a single species will rarely result in network col-lapse. The topology of foodwebs confers an adaptive connected structure that isrobust to perturbation, a property of scale-free networks that is conserved acrosssystems [118].Complexity and systemsBiological systems are dynamically interconnected within and between hierarchi-cal levels. This complexity makes them difficult to understand even with the useof graph theory to construct networks. Consider that biological data often facesaccuracy and reproducibility issues which limit the power of the models used tostudy them. For instance, brain imaging technology has enabled the measurementof activity of regions of the brain at a macroscopic level and this data is used toreconstruct brain networks called connectomes despite the fact that the activity inthe brain occurs at the level of individual neurons [25]. Therefore the constructionand interpretation of the connectome is dependent on and limited by the resolutionof the data and the fact that connection patterns typically vary between individuals[25].Beyond data challenges associated with replication and resolution, biologicalnetworks containing tens to tens of thousands of nodes and edges are difficult tonavigate. In particular, most network visualization schemes are inappropriate forvery large networks as we will see in Section 1.2.2. Moreover, because of the greatnumber of nodes and edges, characterizing local structures in the network such ascliques, modules, and triangles, as well as manually evaluating the construction ofthe network on an individual node basis is not feasible. One alternative is to findand evaluate local patterns such as repeating connective structures or motifs. Al-though algorithms exist to identify motifs, these must be specified a priori limiting17the discovery of new or unexpected patterns.In addition to being hierarchical, biological systems are multivariate and thushighly dimensional with respect to intrinsic and extrinsic factors. Each factor andits influence can be encoded in the nodes and edges of the biological networks asquantitative and qualitative properties. For example, individual species in food-webs have diets, population sizes, seasonal habits, etc. which impact their feedingbehavior and are tightly linked to their role in the foodweb. At the same time, envi-ronmental parameters such as weather and geography also influence trophic inter-actions [122]. Taking all of these dimensions into account in building a completemodel of the system leads to a multitude of possible association patterns. In orderto meaningfully model these patterns, a network must integrate the dimensionalityof biological systems. Accordingly, both quantitative graph theory measures andqualitative methods such as visualization need to accommodate for this complexityin their design and implementation.1.2 Network explorationNetwork analysis typically involves quantitative measures and methods as well asvisualization. However, most network visualizations such as force directed layoutsare not suitable for visualization tasks when scaled up to large networks becausethey are inconsistent and difficult to interpret [96, 115], often resembling “hair-balls” [96] (Figure 1.7). Biological networks in particular are very large networks.Appropriately visualizing these networks could assist in the interpretation of net-work properties relevant to system function. Here we motivate visualization asa means for system exploration and present current network visualization proce-dures.1.2.1 Visualization as a means for data explorationHumans are visual creatures with powerful pattern detection capabilities [50, 125,147, 151]. While some tasks are best accomplished by computational techniques,others are difficult to abstract quantitatively and are well suited for visualizationpurposes [115]. A visualization tool can thus become medium to explore a datasetand create a transition between raw data and the formulation of a problem or hy-18Figure 1.5: Overview of tasks accomplished by visualizations. These tasksare composed of visual actions applied to data targets (from c©Munzner(2014)).pothesis [115, 125]. Visualizations can be designed for different types of data,and different tasks (Figure 1.5). For example, explorative visualizations have beenbuilt to identify outliers, global and local patterns, similarities, and in general novelfeatures of the visualized data [115].Interactivity can further enable a user’s exploration experience of a visualiza-tion compared to a static graphic [50, 115, 147]. Shneiderman proposed a visual-ization mantra to optimize a user’s interactive experience: “Overview first, zoomand filter, then details-on-demand” [147]. In sum, a visualization is designed to op-timally display the different dimension of dataset in a way that is easily navigableby a user in the context of an interactive and explorative visualization. Further-more, Coleman proposed a set of requirements for a design to produce an aestheticand comprehensive visualization [37]Generality in its application to different datasets19Flexibility in its the range of tasks that can be accomplishedTransparency in its layout to ensure its interpretabilityCompetence in the number and quality of the features it revealsSpeed in its rendering timeParticularly in the case of biological systems which vary in size, complexityand number of dimensions, explorative visualizations are invaluable tools to thediscovery of interpretable biological patterns and the development of hypotheseswhich can drive subsequent experimental and computational analysis [115].1.2.2 Current network visualizationsThough a multitude of heuristic and rule-based network visualization methodsexist, here we present three different methods to illustrate the challenges in vi-sualizing and exploring networks. The pitfalls of these visualization schemesdemonstrate that current network visualizations are not designed to facilitate theexploration and discovery of novel global and local patterns in complex systems[96, 115].Adjacency matricesAdjacency matrices are both the linear algebraic formulation and a visual repre-sentation of graphs. Nodes are encoded as row and column labels while edges areencoded as entries in the matrix. This representation is suitable for the visualizationof directed networks in which case the matrix is asymmetric, of weighted networksin which case entries encode the weight of an edge, and of connected componentsin which case the matrix is sparse.This visual representation is particularly useful for showing the components,modules and cliques of a network by applying an appropriate node ordering (Figure1.6) [115]. However it does not scale to large networks and is not suitable forlooking at individual node topologies such as clustering coefficient.20MyrielNapoleonMlle.BaptistineMme.MagloireCountessdeLoGeborandChamptercierCravatteCountOldManLabarreValjeanMargueriteMme.deRIsabeauGervaisTholomyesListolierFameuilBlachevilleFavouriteDahliaZephineFantineMme.ThenardierThenardierCosetteJavertFaucheleventBamataboisPerpetueSimpliceScaufflaireWoman1JudgeChampmathieuBrevetChenildieuCochepaillePontmercyBoulatruelleEponineAnzelmaWoman2MotherInnocentGribierJondretteMme.BurgonGavrocheGillenormandMagnonMlle.GillenormandMme.PontmercyMlle.VauboisLt.GillenormandMariusBaronessTMabeufEnjolrasCombeferreProuvaireFeuillyCourfeyracBahorelBossuetJolyGrantaireMotherPlutarchGueulemerBabetClaquesousMontparnasseToussaintChild1Child2BrujonMme.HucheloupMyrielNapoleonMlle.BaptistineMme.MagloireCountessdeLoGeborandChamptercierCravatteCountOldManLabarreValjeanMargueriteMme.deRIsabeauGervaisTholomyesListolierFameuilBlachevilleFavouriteDahliaZephineFantineMme.ThenardierThenardierCosetteJavertFaucheleventBamataboisPerpetueSimpliceScaufflaireWoman1JudgeChampmathieuBrevetChenildieuCochepaillePontmercyBoulatruelleEponineAnzelmaWoman2MotherInnocentGribierJondretteMme.BurgonGavrocheGillenormandMagnonMlle.GillenormandMme.PontmercyMlle.VauboisLt.GillenormandMariusBaronessTMabeufEnjolrasCombeferreProuvaireFeuillyCourfeyracBahorelBossuetJolyGrantaireMotherPlutarchGueulemerBabetClaquesousMontparnasseToussaintChild1Child2BrujonMme.HucheloupMyrielNapoleonMlle.BaptistineMme.MagloireCountessdeLoGeborandChamptercierCravatteCountOldManLabarreValjeanMargueriteMme.deRIsabeauGervaisTholomyesListolierFameuilBlachevilleFavouriteDahliaZephineFantineMme.ThenardierThenardierCosetteJavertFaucheleventBamataboisPerpetueSimpliceScaufflaireWoman1JudgeChampmathieuBrevetChenildieuCochepaillePontmercyBoulatruelleEponineAnzelmaWoman2MotherInnocentGribierJondretteMme.BurgonGavrocheGillenormandMagnonMlle.GillenormandMme.PontmercyMlle.VauboisLt.GillenormandMariusBaronessTMabeufEnjolrasCombeferreProuvaireFeuillyCourfeyracBahorelBossuetJolyGrantaireMotherPlutarchGueulemerBabetClaquesousMontparnasseToussaintChild1Child2BrujonMme.HucheloupMyrielNapoleonMlle.BaptistineMme.MagloireCountessdeLoGeborandChamptercierCravatteCountOldManLabarreValjeanMargueriteMme.deRIsabeauGervaisTholomyesListolierFameuilBlachevilleFavouriteDahliaZephineFantineMme.ThenardierThenardierCosetteJavertFaucheleventBamataboisPerpetueSimpliceScaufflaireWoman1JudgeChampmathieuBrevetChenildieuCochepaillePontmercyBoulatruelleEponineAnzelmaWoman2MotherInnocentGribierJondretteMme.BurgonGavrocheGillenormandMagnonMlle.GillenormandMme.PontmercyMlle.VauboisLt.GillenormandMariusBaronessTMabeufEnjolrasCombeferreProuvaireFeuillyCourfeyracBahorelBossuetJolyGrantaireMotherPlutarchGueulemerBabetClaquesousMontparnasseToussaintChild1Child2BrujonMme.HucheloupBlachevilleBamataboisBoulatruelleAnzelmaBaronessTBahorelBossuetBabetBlachevilleBamataboisAnzelmaBaronessTBahorelBossuetBabetFigure 1.6: Adjacency matrices, a tabular representation of graphs. Rows andcolumns are labeled by nodes and each entry in the matrix correspondsto the presence or absence of an edge between the corresponding nodes.The two adjacency matrices, one ordered alphabetically, one ordered byclustering node connectivity patterns, lays out the social network of thecharacters in Victor Hugo’s Les Miserables (from c©Bostock (2015)).Force-directed layoutsForce-direct layouts encode nodes as circles and edges as links between them. Thelayouts are obtained by applying physical rules to the nodes and edges to placethem on a plane in a way that minimizes overlap and the number of crossing edges[64]. For example spring-embedded layouts model the edges as springs with dif-ferent spring coefficients relating to edge weights in the case of weighted networks[64]. The nodes are modelled as particles with repulsive forces to avoid overlap[64].Figure 1.7 illustrates force directed layouts of a large network. While these lay-outs are suitable for showing modules, cliques and triangles in small networks, theinconsistencies due to their heuristic algorithms produces network layouts which21Figure 1.7: Force directed layouts, an intuitive and planar visual represen-tation of graphs. This force directed layout shows the social networkof the characters in Victor Hugo’s Les Miserables coloured by cluster(from c©Bostock (2015)).are inconsistent and thus do not allow for the comparison of networks.Hive plotsHive plots are a consistent and coherent rule-based layout and are an appropriatevisualization for comparing and visualizing structural patterns in large networksacross different data dimensions. They provide an interpretable visualization whileleaving several visualization channels such as colour, size and rule choice, to en-code additional data properties [96]. Hive plot’s design scales to large networksas it handles visual occlusion and other potential visualization design pitfalls [115]by organizing the layout of nodes and edges given their attributes and networkproperties [96]. Despite their flexibility, hive plots can be a daunting visualizationtechnique given the large range of options and combinations of layout rules that22Figure 1.8: Hive plots, a circularly organized representation of graphs. Thissimple hive plot has 6 nodes arranged on three axes (from c©Bostock(2015)).must be chosen by the user. In addition, they are not suitable for exploring certainnetwork topological features such as connected components and cliques.In summary, each network visualization has its strength and weaknesses. De-pending on the topology of networks being studied and their size, different networklayouts will be more suitable than others. Overall, these strategies are best used incombination to evaluate and explore networks and the systems they model.1.3 Charting microbial community structure andfunctionAs alluded to earlier, charting microbial community metabolism is extremely chal-lenging because of the cultivation gap between indigenous microorganisms andlaboratory settings [114]. There are several reasons for this cultivation gap, includ-ing the inability to reproduce in situ physical, chemical and ecological conditionsin a laboratory setting. Thus, culture-based techniques capture only a small frac-tion of microbial diversity. Plurality sequencing bridges the cultivation gap byproviding direct access to the genetic material of indigenous microorganisms. Thegenetic material stored in nucleic acids (DNA and RNA) contains the necessary in-formation for an organism to grow and reproduce [91]. A genome contains thegenes necessary to encode the organism’s metabolic functions and reproduction.23Accordingly, the collection of genetic material in an environment defines the en-semble of the community’s genomes: the metagenome [29]. Metagenomic studiesresolve the taxonomic diversity and functional activity of a community by decod-ing this genetic information.Next-generation sequencing techniques have been developed to measure thegenetic material in environmental samples in a high-throughput manner. Giventhe seemingly infinite diversity of microbial life and the great variation in ge-netic encoding, plurality sequencing studies have generated a great abundance ofmicrobial community data from a variety of natural and engineered ecosystems[29]. This surge of information has driven the development of different solu-tions to store, manage, analyze and present this “big data” [160]. For instance,publicly available databases permit the annotation of nucleotide and amino acidsequences to known genes and gene products as well as the assessment of theirphylogeny [22, 29, 32, 67, 137, 144, 160]. Software solutions are used to align,cluster, and manipulate sequences to provide analytical frameworks to charac-terize the taxonomic, genetic and metabolic potential of a microbial community[27, 29, 75, 83, 93, 160]. In addition, visualization techniques have been devel-oped to present and illustrate ecological findings [78, 83, 89, 95, 99]. Finally, asthe quantity of environmental sequence information increases, these solutions arerequired to scale to the task to effectively study the organization of microbial com-munities and their relationship with their environment.1.3.1 Soil ecologySoil harbors the most diverse microbiome [167]. From boreal forests to arcticsediments, one gram of soil containing an estimate of up to tens of thousands ofunique species [167]. Assessing the quality and type of soil involves measuring soilproperties called edaphic factors, including soil moisture, porosity, temperature,and acidity, which are affected by abiotic and biotic factors such as agriculturalpractices, climate, plant and fungi growth. However, the distribution of microbesthroughout the soil profile is influenced by both edaphic factors and interactionsbetween microbial community members [3, 57, 74, 76, 100, 168] (Figure 1.11).Nevertheless, to date, most research has focused almost exclusively on the role of24edaphic factors by relating this measurable information to community compositiondata using multivariate methods [3, 57, 74, 76, 100, 168]. For example, studieshave revealed soil acidity [100] as well as carbon and nitrogen pools and cycling[39] to be strong indicators of soil microbial community structure and diversity. Inaddition, decreases in microbial biomass and community diversity with soil depth[108, 113] have both been attributed to concurrent changes in carbon resources andsoil acidity throughout the soil profile. Current models of soil communities paintan incomplete picture of this complex system as few studies combine microbialinteractions with environmental parameter data [105, 138, 139].The long term soil productivity projectThe LTSP project is a multidisciplinary effort to monitor the impact of forestrypractices on North American soil productivity that was initiated by the UnitedStates Forest Service 25 years ago [136]. The project spans ten North Ameri-can ecozones: biogeographic regions manifesting particular temperature ranges,precipitation patterns and tree species. Today, the LTSP study remains one of theworld’s largest coordinated research networks including over 110 sampling loca-tions in the United States and Canada [135, 136] (Figure 1.9). Research on LTSPsites is primarily focused on impacts of tree harvesting practices related to soil or-ganic matter (OM) removal and soil compaction [76, 136]. Each LTSP site uses arandomized and replicated factorial design with three levels of OM removal (OM1-OM3) in 40x70m2 plots (Figure 1.9). A control plot, representing natural referenceforest (OM0) is also included at each site. In OM1 plots, tree boles have beenremoved but tree crowns, felled understory, and forest floor material is retained re-sulting in minimal soil OM removal [136]. In OM2 plots, aboveground vegetationis removed but forest floor material is retained resulting in intermediate soil OMremoval [136]. In OM3 plots, all surface organic matter is removed leaving baresoil exposed [76, 135].Recent efforts to study microbial community responses to soil perturbation inthe LTSP network has resulted in an archive of samples spanning 5 ecozones [76].Extant microbial studies have focused on community composition from differentsoil types, at different depths and different levels of soil organic removal and com-25Aboveground Biomass removed  0 %                              50 %                                               70%                                               100% OM0 OM1 OM2 OM3             No Harvesting                            Stem-only removal                        Whole tree removal Whole tree and  forest floor removal Core Sites, >15 yrsCore Sites, ~15 yrsAffiliated SitesABBlack SpruceInterior Douglas FirJack PineSub-boreal Spruce Ponderosa PineLoblolly PineFigure 1.9: Overview of the LTSP ecozones and treatments where microbialcommunities were sampled. A) The age and geographic location of sitesper ecozone is marked. B) Different OM treatments were conducted inthese forests and harsher treatments reflect increased biomass removal[31].26paction, as described above. These forestry practices perturb the soil’s physical,chemical and ecological conditions which cause a disturbance in the microbialcommunity with resulting feedback on biogeochemical cycling [76]. Given thatsoil microbes recycle the carbon content of the soil and produce climate active gassuch as carbon dioxide, methane, and nitrous oxide, this study has the potential tohelp assess the impact of forestry practices on forest ecosystem health and climatechange [97]. With samples from a variety of ecozones, this impact can be mea-sured on a large geographic scale. However, current analytical and visualizationmethods do not facilitate the interpretation of these large datasets and the micro-bial communities they represent across ecological scales.1.3.2 Taxonomic assessmentAs previously described, microbial community diversity can be assessed using SSUrRNA gene sequencing [141, 156]. The SSU rRNA gene is a highly conserved genewith sufficient taxonomic resolution. Its hypervariable regions V1-V9 serve as thegenetic markers to taxonomically identify each organism while the conserved re-gions enable primer binding and sequencing [141, 156]. High quality sequencesare recovered and can then be clustered at different percent identity thresholdsagainst a curated database, for instance the Green Gene database [22]. Matches be-tween the database entries and clustered sequences create an operational taxonomicunit (OTU) that can be assigned taxonomy to different levels in the taxonomic hier-archy [67].An identity threshold of 97% clusters SSU rRNA sequences with sequence vari-ability expected between organisms of the same species [18]. Lower thresholdscapture higher taxonomic levels such as phylum, class, order, family, and genus[18]. Given the number of clustered sequences for a given OTU, the relative abun-dance of that OTU in the community can be estimated and used to infer its abun-dance pattern across the sample profile. Studies of large collections of SSU rRNAsamples from varying ecosystems (e.g., soil, water, organisms, atmosphere) facili-tate the characterization of the Earth’s microbiome [67].271.3.3 Community compositionQuantitative ecology transformed ecological research from a primarily descriptivescience to an analytical science with hypothesis formulation and testing [101].Models are developed and validated using numerical approaches; methods fromfields such as mathematical algebra, statistics, information theory, and chaos the-ory have been utilized to resolve ecological and spatio-temporal patterns in com-plex and highly dimensional datasets. By measuring dependencies, similarities,correlations, and other complex relationships in the ecosystem, these quantitativeprocedures resolve the influence of abiotic (i.e. non-living) factors such as sun-light and biotic factors (i.e. living) such as plant growth in an ecosystem [79].Many of these methods have been adapted from macroecology to characterize thecomposition of microbial communities. Here, three ecological quantitative mea-surements that are generally used in ecological studies and that evaluate differentfeatures of community composition are described: community diversity, clusteringof environmental samples, and indicator species analysis.Community diversityCommunity ecology often focuses on determining relationships between speciesdiversity and environmental factors. Particularly in macroecology, measuring thediversity over spacial and temporal scales has been used to assess the effect ofenvironmental changes on ecosystems [79, 101]. For example, the BIODEPTHexperiment in Europe compared the above-ground plant biomass and plant speciescount, a measure of diversity called species richness, over environmental and spa-tial gradients and showed a log-linear increase in biomass with species richness[79]. Similar relations have been found in microbial communities inhabiting soil[76, 108, 113]. Thus, diversity appears to be a quantitative indicator of ecosystemperturbation across macro and micro scales.As a key measure of community structure, many different diversity metricshave been developed, four of which are summarized in Table 1.1. Diversity quan-tifies the distribution of species in a collection of samples. As described by Pierreand Louis Legendre, community diversity is “a measure of species composition, interms of both the number of species and their relative abundances” [101].28Table 1.1: An overview of different of ecological diversity metrics where Dis diversity, q is the total number of unique species, i represents the ithspecies, pi is the relative abundance of the ith species, f1 is the numberof singletons, and f2 is the number of doubletons.Metric name Formula Descriptionspecies richness D = q number of unique speciesShannon’s entropy D =−∑qi=1 pi log pi measure of disorder inspecies distributionSimpson’s index D = 1−∑qi=1 p2i measures concentration ofspeciesChao 1 D = q+ f1( f1−1)(2 f2+1) skews the species richnessby an estimate of the num-ber of unsampled speciesIt is important to note here that microbial diversity is an operational and prob-abilistic measure due to the polyphasic nature of the definition of species. Namely,microorganisms of the same species can differ phenotypically and genetically [149]and this differentiation can in return blur the distinction between species. In prac-tice the species concept can be defined given a measurement of genetic variationto evaluate the genetic composition of a community and consequently capture itsecological diversity [149] . Here the species concept definition is based on percentsimilarity in SSU rRNA sequences, as described above.Species richness measures the count of unique species in a sample collection.This measure is highly affected by rare species and sampling depth (i.e. the numberof sample units collected); however this issue is resolved using the rarefactionmethod which calculates the number of species given constant sample unit size[101]. Richness measures and rarefaction curves are used to quantitatively evaluatethe recovery of the diversity of an environment through sampling [101].Shannon’s entropy measures how evenly species are distributed by taking intoaccount their relative abundance: high values of this measure correspond to mostspecies having similar abundances and low values typically correspond to a fewspecies dominating the sample units [101].Simpson’s index corresponds to the sum of probabilities that two randomly29chosen organisms belong to the same species; the lower this probability, the higherthe overall value of diversity becomes [101]. This measure is highly affected bychanges in rare species and is relatively stable with increasing sample unit sizes[101].The Chao 1 diversity measure differs from the others in that it takes into ac-count f1, the number of singletons (species only found once), and f2, doubletons(species only found twice). In the context of measuring microbial diversity withSSU rRNA sequences, singletons are OTUs for which only one sequencing read isrecovered. This measure is based on the idea that the rare species can tell us abouthow many species environmental sampling may have missed, and the added factorto q, the species richness, helps estimate this contribution to the diversity [101].In the case of microbial species diversity measurements, next generation sequenc-ing technologies can capture erroneous sequences and therefore most SSU rRNAsequencing studies do not include singletons in their analysis.Given that ecological research is often focused on the spatial organization of anecosystem, diversity measures are applied to partitions of the sample collection toassess the distributions of species through spatial components. Whittaker describedthree spatial levels of diversity: alpha, beta and gamma diversity. Alpha diversity(α) represents the diversity at each sample site, gamma diversity (γ) represents thediversity of the whole sample collection, and beta diversity (β ) measures the persample variation in diversity [101]. The three diversity levels are related throughthe relation β = γ/α [101]. Different metric, such as the ones in Table 1.1 can beused to calculate the α , β and γ diversities.To conclude, these metrics quantify diversity based on different ways of as-sessing abundance and distributions of species and can be used to evaluate spatialand temporal variations within and between sample units.Sample clusteringEnvironmental samples can be grouped into clusters from their respective speciescomposition. Samples with similar compositions based on similarity metrics aregrouped into the same cluster based on an operational threshold. These clustersare characterized to resolve ecological patterns, such as species niches [2, 44, 59],30groups of species that co-occur across geographically distinct sampling sites, andin general to assess the similarities within and between sample units. Clusteringanalysis is thus a knowledge discovery method.There are many ways to measure the similarity between samples and manyways to group samples into clusters. Similarity metrics include the Euclidean dis-tance and the Manhattan distance [101]. Each distance will weigh abundant andrare species composition differently. Here we provide an overview of hierarchicalclustering which is commonly used in ecological studies and can be used with anydistance metric. There are two types of hierarchical clustering procedures. Bothtake as input a distance matrix: the distance metric chosen is used to assess thesimilarity between all pairs of samples thus building a sample distance matrix. Ag-glomerative clustering iteratively groups pairs of samples to form clusters and thenforms clusters between the initial clusters, and so on [101]. Divisive clustering isthe equivalent “top down” approach: the ensemble of the samples is partitionedinto clusters which are subsequently divided iteratively [101]. Both proceduresrely on a greedy algorithm [101]: they pick the best way to merge or divide clus-ters in order to maximize the similarity within or distance between new clusters,respectively. The hierarchy of clusters produced by both methods can be differentand is often visualized using a dendrogram.Ecological analysis of clusters includes the evaluation of common environmen-tal factors such as site locations. This idea is based on the fact that related samplesthrough common environmental factors such as climate will have similar speciescompositions and will cluster into the same clusters [101]. Furthermore, hierar-chical clustering can be used to assess the quality of an environmental samplingexperiment. Finally, hierarchical clustering is one method used to evaluate samplecompositionality which is can be used to expand the ecological understanding ofan environment.Indicator species analysisThe composition of a community can be characterized on a species level by eval-uating the ecological relevance of species compared to some environmental factor.One simple example is a particular species which is consistently found in samples31of a particular habitat. Significant associations between species abundance pro-files and environmental factors can help assign ecological meaning to samples andsample sites [101].Different methods evaluate the association between species abundance and en-vironmental factor distributions. One way to measure these associations is to usecorrelation indices. Methods differ by how they handle the variance and distribu-tion of species and samples [101]. Here we provide an overview of an ecologicallymotivated method called indicator species analysis.Once a partition of the samples is established, using a predetermined ecologicalfactor or a clustering method, indicator species analysis can be applied to discoverwhich species are indicative of the “condition” of sample partitions. The indicativevalue of a species for each condition is calculated using:IndicatorValuei j = Ai j ∗Bi j (1.8)where Ai j is the specificity of species i to the cluster j and Bi j is its fidelity[101]. The specificity is calculated by dividing the average abundance of speciesi in the samples belonging to cluster j by its average abundance in all samples forall clusters k, as follows:Ai j = pi j/pik (1.9)High specificity is obtained when a species is highly abundant in all samplesof cluster j and rare in other samples. The fidelity is calculated by dividing thenumber of samples in cluster j where species i is found divided by the total numberof samples in cluster j, as follows:Bi j = samplesi j/samples j (1.10)High fidelity is obtained when a species is present in all samples within a clus-ter. Figure 1.10 illustrates how a species can have high fidelity, a high specificityor both.Once indicator values are measured for all combinations of species and sampleclusters, the significance of the results are evaluated by taking into the composi-32species 1species 4species 2species 3high fidelityhigh specificityfor condition Blow fidelitylow specificityfor either conditionlow fidelityhigh specificityfor condition ASamples with condition A Samples with condition Bhigh fidelitylow specificityfor condition AFigure 1.10: An illustration of the specificity and fidelity of species to en-vironmental conditions. The eight samples are partitioned under twodifferent conditions.tional bias in species abundance profiles. One example of compositional bias is thepresence of a very abundant species which lowers and skews the relative abundanceof other species when they occur in the same samples [101]. The significance isevaluated for each species i by permuting sample counts of other species and re-calculating the indicator value of species i. This permutation procedures obtains adistribution of indicator values for each species and by comparing the actual valuefound with this distribution, a p-value is obtained which indicates the probabilitythat this value occurred by chance. Typically p-value thresholds of 0.05% signifi-cance and lower are used to filter out poor values [101].Finally, for each cluster a number of indicator species is obtained that can beused in environmental surveys of the condition implied by the sample cluster. Inparticular, hierarchical clustering can be applied to find the sample clusters be-fore applying indicator species analysis. In this case, indicator species and expertknowledge can help assess the ecological properties of the clusters. However, animportant consequence of this procedure is the fact that more indicator species willbe found than expected by chance since the sample partitioning was conductedusing the same species abundance data [101]. The lack of independence betweenthe two methods implies the need for a thorough examination of p-values beforeecological interpretation.331.4 Microbial co-occurrence networksMicrobial ecologists adopt macroecology quantitative and qualitative methods toanalyze, visualize and investigate interaction patterns in microbial communitiesby constructing microbial community networks [11, 54, 54, 105]. modelling thecommunity as an inter-connected system can give insight into the community’sfunctional characteristics related to, for instance, the biogeochemical processes itperforms [105, 150]. Structural properties of microbial community networks havebeen visualized and characterized to infer different biological attributes of the com-munity such as its resilience to disturbance [5, 11, 47]. However, the interpretationof global and local network properties from an ecological standpoint, as is donein macroecology particularly with foodwebs [48, 49, 52, 122], remains difficult[54, 134]. Inferring and interpreting microbial community networks faces manychallenges some of which are common to all biological network models and otherswhich are specific to the microcosmos, such as:1. the selection of a procedure among a multitude of methodologies used toresolves between taxa [11, 54]2. the statistical obstacles in assessing the significance of inferred interactions[16]3. the lack of standard procedure to analyze the constructed network [16, 54]4. the difficulty in ecologically interpreting resolved global network structuralproperties [24, 54, 88]5. the difficulty in relating environmental factors to resolved global networkproperties [54]6. the difficulty in ecologically motivating and validating the analysis of localnetwork patterns [18, 54]In this section, we motivate the construction of microbial co-occurrence net-works, describe the possible procedures and pitfalls in their construction and pro-vide an overview of current microbial co-occurrence network studies and their find-ings.34Figure 1.11: Overview of different ecological interactions between microbialcommunity members. Pairwise interactions can have a positive, neg-ative or neutral impact on the two participant (from c©Faust and Raes(2012)).f1.4.1 Symbiosis and inter-taxa interactionsOnly recent efforts have begun to investigate community structure through the char-acterization of inter-taxa interactions. These interactions have primarily been re-solved using co-culture studies where the stability of an artificial community istested against the addition and removal of species [113]. Major findings of theseefforts include the mutualistic interactions between microorganisms in which ametabolic factor produced by one microorganism is utilized by a second and whothen performs a reciprocal service [54, 73, 113, 174]. The different types of inter-actions, as defined by how they benefit or impair the participating microorganismsare summarized in Figure 1.11 [54].These interactions directly influence the composition of a community. Sinceenvironmental factors can affect the abundance of individual taxa (and vice versa),changes in an environment can have an impact on these interactions. Therefore acomplete model of the system should include interactions between taxa and rela-tionships between taxa and their environment.351.4.2 Microbial network inferenceIn ecology, the choice of methods used to resolve relationships relies on the as-sumption that community dynamics can be either stochastic or non random [101,158]. It is important to note here that Hubbell’s unified theory of biodiversityproposes that community composition can be explained by random processes af-fecting the birth, growth and death of taxa [81]. This hypothesis is called “neutraltheory” and has been verified in some ecosystems [33, 81] and contradicted in oth-ers [101, 104]. Given the evidence of non random and even causal relationships inmicrobial communities, we focus on measuring these complex relationships eventhough we acknowledge that random processes and stochasticity in general alsoplays a role in shaping microbial community structure. Here we use the term re-lationship to designate association patterns between taxa and their environmentwhile the term interaction denotes the resolved inter-taxa associations.These relationships can be quantitatively measured to build networks and modelthe community structure. Network inference relies on different quantitative mea-surements to detect pairwise or complex inter-taxa interactions. For instance, cor-relation measures assess the similarity in the abundance pattern of microbes: posi-tive and negative correlations detect co-occurrence and mutual exclusivity, respec-tively [11, 16, 54]. Methods such as regression analysis and rule association miningcan uncover complex interactions involving 3 or more taxa [54]. Other models havebeen adapted to model dynamic interactions that can evolve over time [54]. Herewe focus on pairwise interactions resolved through correlation measures due to itsflexibility in uncovering many different interactions across spatial gradients and be-cause it is a computationally feasible method to apply to large SSU rRNA datasets,unlike methods such as rule association mining [54]. Notably, correlation-basednetwork inference is also used to construct other biological networks such as geneexpression networks [69].Before a network can be inferred and interpreted from resolved pairwise inter-actions, many potential pitfalls need to be addressed: normalization bias, similaritymeasure bias, and multiple testing issues [16, 54]. The normalization of abundancepatterns on a per sample basis can skew the relative abundance of certain taxa: thepresence of a highly abundant taxon in a few samples can cause the relative abun-36ACBFigure 1.12: An illustration of microbial network inference through co-occurrence patterns. A) OTUs whose relative abundance are similarthroughout the sample profile are said to co-occur. B) OTUs whose rel-ative abundance profile are inversely correlated are said to be mutuallyexclusive. C) From a dataset of community composition, a networkcan be abstracted by defining OTUs as nodes and co-occurrence andmutual exclusion interactions as edges [54].dance of other taxa to be significantly lowered in those samples. In order to avoidthis compositionality bias and to identify correlations which may have been eval-uated as significant because of skewed abundances, a permutation procedure isconducted [16, 53, 54]. This procedure helps remove spurious correlations basedon the assumption that permuting the abundance of other taxa and thus varyingtheir relative abundance shouldn’t affect the predictive power of significant pair-wise correlations.Similarity measure bias is caused by the use of particular correlation measureswhich may resolve only specific kinds of interactions. For instance, some nonlinearassociations can be detected by Spearman’s correlation coefficient, a rank-basedcorrelation, but not Pearson’s [148]. In order to maximize the variety of pairwiseinteractions measured, we can detect correlations using several distance and cor-relation measures and by combining their results, as is done in the co-occurrencenetwork inference software CoNet [53].37Multiple testing bias is due to the fact that the probability of finding spuri-ous interactions increases with the number of tests and becomes prominent for avery large number of tests [4]. Multiple testing correction controls the number offalse-positive interactions and produces a p-value for each interactions whose sig-nificance under the null hypothesis can then be assessed [54]. Finally, significantinteractions are collected to build microbial community co-occurrence networkswhere nodes are OTUs and edges represent co-occurrences or mutual exclusions.1.4.3 Validating network inference modelsThough experimental validation is extremely challenging for uncultiaved microbes[29], simulation experiments have been used to both validate network inferencemethods as well as propose a sampling procedure to produce datasets appropriatefor co-occurrence analysis. Berry and Widder simulated microbial communitiesusing generalized Lotka-Volterra equations, calculated the resulting communitycomposition, and inferred microbial co-occurrence networks from the producedabundances patterns [16]. By varying both experimental and ecological parametersand measuring the specificity and sensitivity between the simulated communitiesand inferred network models they evaluate the conditions under which network in-ference is an appropriate model to study inter-taxa interactions [16]. Figure 1.13 il-lustrates the variation in the specificity and sensitivity of the networks models whenvarying the number of samples, the correlation measure used, and the abundancemeasure used for communities with 100 species [16]. Figure 1.14 demonstrateshow ecological parameters can affect the performance of co-occurrence networkmodels. In general, experimental parameters that increase the modelling perfor-mance are the use of several samples, a combination of correlation measures, andthe use of compositionality bias-corrected relative abundances. In addition, sam-pling designs can help optimize this performance by ensuring a high species rich-ness and low β diversity.Finally, these simulations and further theoretical and experimental studies willhelp provide a standardized sampling procedure and analytical procedure for build-ing accurate community networks that can model known and capture novel inter-taxa interactions.38Figure 1.13: The effect of experimental parameters on co-occurrence net-work modelling performance on simulated communities measured us-ing the sensitivity and specificity of the networks. The experimentalparameters varied were A) the number of samples, B) the correlationmeasure used (MI = mutual information score), and C) the use of ab-solute abundance (AA), relative abundance (RA), or sparCC-correctedrelative abundance (RA-corrected) data, compared for communitieswith uniformly- or log-normally-distributed species abundances (fromc©Berry and Widder (2014)).1.4.4 Current applications of microbial co-occurrence networksJust as ecological quantitative methods model ecosystems and their relation to theirenvironment, graph theory measures have been applied to microbial co-occurrencemodels to evaluate the community’s inter-connected structure and its relation toits environment. In particular, network analysis methods have been adapted fromfoodwebs studies where interactions between species model trophic relations [89,105, 138, 150]. The graph theory measures presented in Section 1.1.2 that evaluateglobal topological structures have been applied to co-occurrence networks howeverinterpreting these findings ecologically remains difficult.Node-based measures such as centrality measures have been used to iden-tify keystone species: taxa whose presence in the community are essential to its39Figure 1.14: The effect of ecological properties on co-occurrence networkmodelling performance on simulated communities measures using thesensitivity and specificity of the networks. The parameters varied werethe A) species richness, B) species evenness defined as Shannon’s di-versity normalized by the maximum entropy of a community, log(q),and C) the β diversity of sampled sites (from c©Berry and Widder(2014)).proper functioning [16, 105, 138]. The current literature which applies methodsfrom macroecology to study keystone microorganisms has not settled on a method-ological procedure, in particular which centrality measure to use, to identify key-stone species in uncultivated communities through co-occurrence networks anal-ysis. However, network robustness analysis is a method which has shown muchpromise in finding an appropriate centrality measure to identify OTUs with inter-esting topological positions in the network that may be keystone OTUs. Networkrobustness is measured by iteratively removing nodes and evaluating the structureof the network resulting from this removal. The resilience of foodwebs to speciesextinction is evaluated this way by measuring the number of secondary extinc-tions from the iterative removal of species in the foodweb [45, 48, 107, 122, 134].Network robustness simulations have been applied to microbial co-occurrence net-works with promising results. For example, network analysis in gut microbiome40data has revealed that healthy subjects have more robust networks than diseasedsubjects [116]. In another co-occurrence network study, robustness simulationswere conducted by removing OTUs using decreasing centrality to attempt to iden-tify key microbial genera in soils from natural forests and agricultural plantations[105, 138, 150].Though community resilience to species extinction can be motivated ecolog-ically in microbial communities, there is a lack of evidence to support the useof network robustness simulations on co-occurrence networks to test communityresilience to environmental changes. This motivation gap is a reflection of thefact that to date no studies have applied network inference methods to study andcompare communities from disturbed and natural environments. However thereis evidence that the community structure captured through network inference mayreflect the presence of disturbance: bacterioplankton communities from a fresh wa-ter lake demonstrated an increase in species richness and network connectance inthe spring compared to the summer and fall [89]. Since increased topological con-nectance typically indicates an increase in network robustness [45, 85], and that thecommunities from the spring have endured the harsh conditions of the winter [89],the results of this study suggests that the structure of a microbial co-occurrencenetwork may reflect the impact of environmental pressures on community com-position. Therefore such impacts may be measured through network robustnesssimulations. Further evidence of environmental changes impacting microbial com-munity structure through co-occurrence network studies would validate the use ofnetwork robustness analysis which would in turn motivate the identification of key-stone species using centrality measures.1.5 Research questionsThe following research questions have driven the development and application ofthe methods and analytical procedures presented in Chapter 2 and Chapter 3.1. What visualization design and interactive features are appropriate for explor-ing biological networks such as microbial co-occurrence networks?2. What kind of patterns, including associations between network properties41Figure 1.15: Co-occurrence network visualization and properties for adecade long time series of bacterioplankton communities in Lake Men-dota in the United States of America. From spring to autumn, the di-versity and richness of the community increased while the complexityof the inferred network decreased (from c©Kara et al. (2013)).and community biotic and abiotic factors, can be resolved in microbial co-occurrence networks? What structural or functional features do these asso-ciations imply? At what scales, global or local (i.e. taxonomic), do thesepatterns and associations occur and are they conserved across different geo-graphical locations or environments?3. How can the exploration of co-occurrence networks enable the identifica-tion of global and local community structures such as microbial keystonespecies? What is the inferred or hypothesized relationship between the pres-ence of these structures and the community response to change in environ-mental factors?4. Do the characterized global and local patterns resolve both ecological con-served principles that are shaping the community and the functional roles of42community structures, such as key microbial species, that are driving it?1.6 Research overviewUnderstanding the organization of microbial communities is an important step to-wards gaining predictive power in microbial ecology. This thesis tackles this chal-lenge by studying microbial community structure and stability across different ge-ographic locations using SSU rRNA sequences and network analysis.Chapter 2 describes the design of Hive Panel Explorer, a data-driven, interac-tive and explorative network visualization. Its design is formulated to appropriatelyadapt to the high dimensionality, complexity and size of biological networks. Itseffectiveness in revealing known and novel topological and data association pat-terns is tested and demonstrated on the C. elegans connectome.Chapter 3 attempts to provide a standard ecologically driven analysis of micro-bial co-occurrence networks which model the inter-connected communities fromthe LTSP project. In particular, the robustness of networks obtained from soil com-munities that have undergone different levels of organic removal will be measuredto evaluate the applicability of robustness simulations on microbial co-occurrencenetworks and to assess the effect of disturbance on community structure. Further-more, robustness simulations can help identify individual taxa with central posi-tions relative to the community’s structure. These taxa’s soil profile and taxonomywill then be characterized to evaluate their role in the community. In sum, theensemble of network analysis conducted on the LTSP dataset has the potential toreveal patterns within and between locations and taxonomic groups and give in-sight onto the functional roles of individual taxa.Finally, Chapter 4 concludes with a discussion of the assumptions and short-comings of the current visualization and analytical methods outlined in this thesis,and lays out future work and improvements to Hive Panel Explorer and the studyof microbial community structure through network analysis.43Chapter 2Hive Panel Explorer: aninteractive visualization tool toexplore topological and dataassociation patterns in largenetworksNetworks are used in a variety of fields to relate topological structure to systemdynamics and function. Network analysis is often motivated and complemented bynetwork visualization. However, the visualization of large networks is challengingdue in part by the abundance of data needing to be visually organized and the dif-ficulty in finding a suitable network layout to resolve patterns associated to systemproperties. Hive plots provide a general, consistent and coherent rule-based visu-alization alternative to force-directed layouts and are appropriate for assessing andcomparing structural patterns within and between large networks. Despite theirflexibility, hive plots can be a daunting visualization technique given, for exam-ple, the great number of possible combinations of layout rules that the user mustchoose from. Here we present HYPE, a visualization idiom and a D3 based toolconsisting of a grid of hive plots, whose design follows the visualization mantra44“Overview first, zoom and filter, then details on demand”. HYPE aims to make hiveplots accessible to the broader scientific community by expanding on the originaldesign and providing a data-driven procedure to construct hive panels and explorelarge networks interactively. HYPE allows the user to discover topological and dataspecific patterns across several dimensions simultaneously. Here, we evaluate thedifferent features of hive panels and outline the navigation of a system throughits network using HYPE’s interactive features by exploring and characterizing thestructure of the C. elegans neural connectome. HYPE is available for download onGithub under the GNU license: https://github.com/hallamlab/HivePanelExplorer.2.1 Introduction2.1.1 Network science and visualizationNetwork approaches are widely used to study relationships between the structureand the function of a system in the social, biological and technological sciences[118]. Networks are composed of nodes and the relationships between them callededges. For instance, nodes represent people in social networks [162] and speciesin foodwebs [45, 122], while the edges represent their social and trophic interac-tions, respectively. modelling the relationships of a system, such as a social orecological community, using networks can describe and characterize the system’sstructure and dynamics [68, 117, 117]. Network measures such as degree dis-tributions, modularity and connectance, can help formulate associations betweenelements of structure to notions of function in the system [68, 117, 118]. Severalreview papers provide an overview of different network analysis and graph the-ory measures [117, 118, 120] some of which are briefly described here and areillustrated in Figure 1.1.Visualization idioms are developed to accomplish different visualization tasks[115, 147]. Current network visualizations are designed to present, summarize,annotate, illustrate, investigate or explore the system modelled by the network(Figure 1.5). From force-directed layouts to circular layouts, each design offersinsights into different structural elements of a network and the system it representsby highlighting different topological features. For example, force-directed layouts45give relative positions to the nodes encoded as circular marks and edges encoded aslinks (Figure 2.1A) to show paths (a set of successive edges connecting two nodes)and modules (a subnetwork with increased connectivity within compared to withthe rest of the network) [64, 115]. In contrast, adjacency matrices are appropriatevisualizations for presenting modules and cliques (a fully interconnected group ofnodes) [56, 115]. Many other network visualizations exist such as circular layouts[112] and spectral layouts [65]. These visualizations can help researchers infersystem structure and dynamics through network properties. For example, modulescan be characterized to assess the biological and functional properties of a groupof interacting proteins in protein-protein interaction networks in the context of in-teractome research [87, 88, 102]. Similarly, cliques are identified and analyzedto gain insight into the tightly knit social circles in the context of social networks[162]. Network visualizations such as force-directed layouts and adjacency matri-ces were designed to present such patterns in the network [65, 115, 147] but theyare limited in the types and number of patterns they resolve.In order to go beyond presenting known patterns and reveal new ones, visual-izations need to be designed to permit an exploration of the system. The discoveryof topological features and patterns can help drive a quantitative analysis of thenetwork and formulate hypotheses on a system’s structure and function [118, 159].The discovery process can be accomplished using visualizations that have beendesigned for data exploration [124–126, 131, 147]. However, the pitfalls of thesevisualization schemes demonstrate that current network visualizations are not de-signed to facilitate the exploration and discovery of novel global and local patternsin complex systems [96, 115] Here, we provide an overview of current visualizationpitfalls, introduce hive plots as a flexible and versatile network layout and describehow to expand hive plots’ potential to build a data-driven interactive visualizationfor network exploration.2.1.2 Current network visualization pitfallsAs models of complex systems, networks come in all shapes and sizes, from smallnetworks with a few to a hundred nodes to large networks with hundreds to thou-sands of nodes and with each node or edge having a virtually unlimited number of46data properties. When scaling up to large networks and to higher levels of mul-tivariate data complexity, effective network visualization becomes an increasinglychallenging task. A suitable visualization must be flexible enough to enable thediscovery of both global and local patterns across node and edge properties.Current approaches do not scale effectively when rendering large networks.Specifically, the amount of data displayed shadows its interpretation, particularlyin the case of force-directed layouts, which suffer from data occlusion and the highlikelihood of pattern misinterpretation [96, 115]. Furthermore, overlaying addi-tional information about the system by colouring or varying the sizes of nodes andedges given data properties, is often impossible without further cluttering the dis-play. Beyond scaling additional visualization pitfalls have been demonstrated. Forexample, when using hierarchical layouts, finding the location of a deleted node bycomparing two visualizations of the network, with and without the missing node,can be a difficult task [96]. In the case of algorithmic-based visualizations, thesefallacies stem from the absence of a coordinate system for node positioning. Forinstance, in force-directed layouts node positions are simulated based on the ar-rangement of edges connecting them and node positions can change from one sim-ulation to another. These variations can cause inconsistencies between two layoutsof the same network, which can lead to different interpretations of the layout. Inaddition, since the layout is built based on a heuristic algorithm, this visualizationis not suitable for the task of visually comparing different networks. These issuesmake it difficult to explore and interpret network visualizations. Finally, currentnetwork visualizations include a lack of generality (applicable to different types ofnetworks), a lack of flexibility (can support different purposes), and their inabilityto complement other displays [37]. While layouts based on heuristic algorithms arenot suitable for large networks, a rule-based layout includes some of the featuresnecessary for building an interpretable, general, flexible and comparable networkvisualization.2.1.3 Hive plotsKrzywinski developed hive plots as an alternate network visualization to force-directed layouts, circular layouts, hierarchic layouts, etc. [96]. Hive plots are a47rule-based layout that attempts to provide a coherent and interpretable networkvisualization [96] while leaving several visualization channels such as colour, sizeand rule choice, to encode additional data properties. Hive plot’s design scalesto large networks as it handles visual occlusion and other potential visualizationdesign pitfalls by positioning the nodes using a coordinate system [96].The nodes and edges’ data properties and network measures are used to orga-nize the nodes and edges in the coordinate system. A link is drawn between twocircular marks if the nodes represented by these marks have an edge connectingthem in the network (Figure 2.2). The nodes are placed onto circularly arrangedaxes and edges are drawn between nodes using Bezier curves (Figure 2.2). Theuser defines i) a rule designating a node’s axis assignment and ii) a rule designat-ing a node’s position along an axis, as illustrated in Figure 2.2. These two rulesare chosen using node properties. While axes are used to group nodes with simi-lar properties, node positioning along the axis distributes nodes according to nodeproperty values. This coordinate system has many similarities with parallel coor-dinate plots [84, 96], only in hive plots the axes are organized in a circular fashionand the nodes are not represented on all axes but are assigned to a particular axis.Any node property indicating the node’s position in the network (i.e. degree,clustering coefficient, betweenness value, etc.) or the node’s role in the system(i.e. gender in social networks, expression value in gene networks, protein familyin protein-protein networks, neuron cell type in connectomes, species diet in food-webs, etc.), can be used to construct rules designating axis assignment. A visualoverview of node network measures is presented in Figure 1.1. In the social net-work displayed in Figure 2.1, the individuals are assigned to the hive plot’s axesaccording to their gender (boy, girl or alien) and are positioned along their respec-tive axis by the number of relationships they have in the network i.e., degree. Thislayout allows the viewer to investigate relationships between gender and the de-gree of an individual in the social community and quickly answer questions suchas “Is the individual with the most relationships a boy, girl or alien?”. The hiveplot in Figure 2.1 shows that aliens generally have a high degree (more social re-lationships than boys and girls) and that the person with the highest degree is analien called Zans (pattern 1). An even more striking pattern is that boys and girlsshare no connections (pattern 2) and that all boys who share relationships are en-48emies (pattern 3). Resolving the same patterns in the force-directed layout wouldinvolve searching the layout and counting degrees and edge types. Thus, Figure2.1 demonstrates that certain patterns are often more difficult to discern using theforce-directed layout compared to a hive plot, even for such a small network.Difficulty in pattern resolution increases with network size and system com-plexity. Hive plots facilitate pattern resolution by projecting the network usingnode properties to create a consistent, interpretable and flexible layout that is ap-propriate for interactive network exploration.2.1.4 Hive Panel ExplorerDespite the fact that hive plots were demonstrated to satisfy the requirements foran effective aesthetic layout (i.e. are general, flexible, reproducible, comparableetc.) [37] and to be an appropriate visualization for networks [96], they have beencriticized for their lack of accessibility. Specifically, though the coordinate systemensures that the plot can be interpreted, users have had difficulties understandinghow to harness the versatility of hive plots for their particular networks and researchquestions. HYPE attempts to address this criticism by providing the user with adata-driven approach to constructing hive plots and interactive methods to explorethem.Hive plots are appropriate for visualizing networks of all types: weighted orun-weighted, directed or undirected, complete or not. Several hive plots have beenused in scientific literature to describe gene expression networks [130], splicingpatterns in RNA sequencing data [172], and neural connectomes [153]. In eachcase, the users have chosen the rules to assign and position nodes from a large setof combinations to present the dimensions of the system they were interested in.The layout’s flexibility and adaptability empowers the user to explore both networkspecific and data specific properties. At the same time the user must determineoptimal assignment and positioning rules to produce a single hive plot, potentiallydiscouraging the use of different combinations.We have developed HYPE to produce a matrix layout of multiple hive plotscalled a “hive panel” [96]. This design circumvents the need to determine the opti-mal set of rules and enables the user to explore multiple combinations of network49A BZedZorbDavid ZansMattJoe VivianAlexIsaacAliceZillEmmaalienboygirlboyaliengirlfriendsenemies312Degree6543127AliceZansJoeFigure 2.1: A comparison of a force directed layout and hive plot of a social network. A) Aliens, boys and girls withrelationships are laid out to minimize node overlay and edge crossing. B) A hive plot of the same social networkshows 3 friendship patterns. Friends were grouped onto axes by gender and were positioned along axes by degree.50measures and data dimensions. HYPE also provides different data transformationsto scale the layout of nodes and edges according to the attribute values used toplace them. We argue that visualizing and exploring multiple projections of thenetwork in a coordinated layout facilitates the discovery of topological and associ-ation patterns in the network. Interactive features such as colouring, highlighting,look-ups and selective filtering enable the exploration of the system. Additionalcolour encoding of the circular marks and links, representing the nodes and edgesrespectively, draws attention to certain nodes or edges to both facilitate the networkexploration and produce publication ready figures. To our knowledge HYPE is thefirst idiom and tool to utilize hive panels and to provide interactive features as anintegral part of its design to visually present and explore networks.Here we rationalize HYPE’s visual idiom as an appropriate visualization fornetwork exploration using a general visualization paradigm [115] and we motivateits interactive features using Shneiderman’s visualization mantra and associatedmethodology [147]. We then provide guidelines for building informative panels,exploring them, and finding known and novel patterns in the network. Finallywe benchmark HYPE on data from the C. elegans connectome to demonstrate itcapacity to find meaningful relationships in model systems.2.2 Methods2.2.1 Visualization idiom and designHYPE’s design is based on a matrix of hive plots and was developed with net-works in mind though any dataset with relationships between data items can beappropriately visualized in a hive panel. Table 1 outlines the general design ofHYPE according to a general visual paradigm and language [115] and describesdata transformation, visual network encoding, visualization tasks, and the size ofthe dataset it is appropriate for.Hive plots were designed to permit the organization of nodes according to nodeproperties. From here on we denote an attribute as either a network property suchas degree (number of edges a node has) or as a node property that is inherited fromassociated multivariate data. These attributes can be categorical (e.g. source or51sink node in directed networks, neuron cell type in connectomes, etc.) or quanti-tative (e.g. node’s clustering coefficient, age of an individual in a social network,etc.). Using these attributes, the nodes can be grouped and sorted before they areorganized on the axes.Typically, a layout of 3 axes is chosen to allow for edges between nodes onany axis to be drawn without crossing over another axis. When using a layout with4 or 5 axes, HYPE doesn’t draw edges between non-neighbouring axes as thesecurves would affect the interpretability of the visualization. In order to view edgesbetween nodes that have been assigned to the same axis, a mirror image of eachaxis is produced with edges draw between reflected nodes. Figure 2.2 presents askeleton of this rule-based visualization with the two layout schemes: single axes(Figure 2.2A) and doubled axes (Figure 2.2B).Each hive plot visually encodes node attributes and thus by comparing multiplehive plots one can assess associations between pairs of node attributes simultane-ously. To facilitate this comparison we construct a hive panel, a set of multiplehive plots organized on a grid. This visualization then becomes a matrix type lay-out where each column denotes the use of a particular axis assignment rule andeach row denotes the use of a particular axis positioning rule (Figure 2.5). Sincemultiple node attributes are used as layout rules, different visual projects of thenetwork onto the axes are presented and compared.2.2.2 Designing a hive panelHYPE enables users to construct hive plots and panels using a data-driven approach.Here we present the different features of HYPE as well as how to best utilize themdepending on the type of system properties being explored.Choosing assignment and position rules.In a multivariate dataset, nodes can have numerous categorical and quantitativeproperties that can in turn be used as one of the two plotting rules. Certain nodeproperties are more suitable as axis assignment rules than axis position rules andvice versa, as summarized in Table 1. For example, a categorical attribute withthree categories is most suitable to group the nodes by their attribute onto the 352Table 2.1: HYPE’s visual design idiomIdiom Hive Panel ExplorerThe data Networks where nodes and edges have attributeswhich can include calculated network properties(ordinal, quantitative and categorical properties)What?Deriving the data1. Calculate network properties (i.e. degree)2. Organize nodes by desired attributes into groups3. Normalize/scale node attributesData attribute Mark or channelNodes CircleEdges LinkNode attributes(mostly quanti-tative)Position on axisNode attributes(any type of at-tribute)Grouping on axisHow? Encoding the dataNode or edgeattributecolour, visibilityActions TargetsPresent andsummarizeDistribution of nodes propertiesDiscover Topology, outliers and patternsExplore Characteristics of topology, out-liers, and patternsWhy? Visualization tasksCompare The position of grouped nodesand edges in different hive pan-els of one network. Topologiesand distribution of node proper-ties between two networks.ScalabilityNodes: dozens to thousandsedges: hundreds to few tens of thousands.53Axisassignmentaxis 1axis 3axis 1axis 3 axis 2AxisassignmentACBAxis positionAxis positionaxis 2edgenodeWFigure 2.2: A schematic layout of single and double axis hive plots. A) Nodes are grouped onto and positioned alongthree axes. B) Double axes can be used to view edges between nodes grouped on the same axes.54axes. However, categorical attributes with several categories (greater than 3) canbe used as axis positioning rules in which case they are ordered alphabetically andpositioned accordingly along the axis (Figure 2.3B). On the other hand, quantita-tive attributes can be used for either rules: while axis assignment rules organizesthe nodes into low, medium and high values of this attribute (Figure 2.3A), axisposition shows the distribution of nodes given this attribute in more detail (Figure2.3). Using these rules of thumb, a panel can rapidly be constructed. If an attributeis not suitable for either plotting rule, or if the user wishes to compare this attributeacross all hive plots, then it can be used to colour the nodes. Taken together, anynode attribute can be presented in three ways: as an axis assignment rule, as anaxis positioning rule or as a colouring rule.Data transformations through partitions and scales.In order to adapt to different distributions of node quantitative attributes, the at-tribute’s values can be mapped to plotting positions using a linear, even or log-arithmic partition for nodes’ axis assignment and a linear, rank or logarithmicscale for nodes’ axis positioning (Figure 2.3A and 2.3B). If unspecified, a linearpartition or scale is used for either plotting rule. An even partition determines thecut-offs to evenly distribute and assign nodes to an axis so that all axes contain thesame number of nodes (give or take one node) (Figure 2.3A). Rank based scalesare used for axis positioning to ensure that no two nodes overlap, even if they havethe same attribute value: if “degree” is the axis position attribute and three nodes ofhave a degree of 5 then they are placed in succession and in an arbitrary order alongthe axis (Figure 2.3B). A logarithmic partition or scale is best suited for attributesthat are distributed exponentially. If a user is interested in displaying all node at-tribute values, without overlap, than a rank or even scale should be used. On theother hand, a linear or logarithmic partition or scale are appropriate for showingoutliers, nodes whose attribute values differ drastically from the other nodes’ val-ues. The choice of different partitions or scales allows the user to construct panelsin a data-driven manner.Known network topologies are best displayed using specific data transforma-tions. A linear partition or scale is appropriate when plotting nodes by degree in55Categorical A      A     B     B      B      B     CABCCategorical Linear1   2    3   4   5   6    7   8   9   101 10 2  2  3  5 6  6  7  7  7   8  9 10Rank1 10low degree nodes (1 < x < 3)medium degree nodes  (3 < x < 8)high degree nodes (8 < x)power-lawrandomADegree frequency P(k)degree (k)Power-law networkRandom network302DB1-45-78-10Linear1 104 7Even density 1-34-67-101 103 6same number of nodes on each axis1-23-56-10Logarithmic 1 102 5 1         2        3      4   5  6 7 8910Logarithmic1 10A B CCNode attributesA B CMultiple nodes are shown to illustrate overlapPower law network   1000 nodes, 2173 edgesRandom network   1000 nodes, 2173 edgesMultiple nodes are shown to illustrate overlap(Optimal)(Optimal)Axis Assignment by degreeLinearLinear0-34-78-11Log0-34-78-11Log0-1.341.35-4.514.52-120-1.341.35-4.514.52-12Axis Assignment by degreeAxis Position by degreeLinearLinear1-101.32101.33-201.65201.66-302Log1-101.32101.33-201.65201.66-302Log1-5.75.71-43.9944-3021-5.75.71-43.9944-302Figure 2.3: An overview of the possible partitions and scales driving node assignment and positioning. A) A schematicof the four possible axis assignment partitions. B) A schematic of the four possible axis positioning scales. C)The degree distributions of a random network a power law network with the same number of nodes and edges. D)Hive panels of the random and power law networks showing the efficacy of different partitions and scales used toorganized nodes based on their degree.56networks with a binomial degree distribution, such as random networks. A logscale is better suited for networks with an exponential degree distribution, suchas power-law networks. Figure 2.3C shows the degree distributions of these twocommon network types and the optimal choice of partition and scale for drawingthese networks in hive plots is illustrated in Figure 2.3D. Determining the appropri-ate plotting method can both facilitate the interpretation of node positions as wellas avoid having circular marks and links overlapping. When used appropriately,axis assignment partitions and position scales help maximize the total visual spaceoccupied by the circular marks.Once assignment and position rules have been chosen along with their partitionscheme and scales, the user obtains a hive panel with each hive plot providing dif-ferent visual projections of the system. Though performing successive refinementsof the initial layout can be tempting, preliminary exploration of the network withthe first constructed panel is encouraged to guide future iterations.2.2.3 Navigating a hive panelThe layout and interactive features of the HYPE allow the user to explore theirdata by following Shneiderman’s visualization exploration mantra “Overview first,zoom, filter, then details on demand” [147]. The organization of the layout ful-fills the first two components by creating an overview of the network and allowingthe user to “zoom in” on a subset of the hive plots. Once the layout is selected,there are five ways to interact with the system using HYPE: colouring, highlight-ing, searching, selecting and filtering. colouring and filtering effectively increaseand decrease the visual salience of a node or edge, respectively. A user can thenvisualize “details on demand” in three ways: searching or highlighting a singlenode or a edge, and selecting multiple nodes or edges. An organization of HYPE’sinterface and features is illustrated in Figure 2.4.Overview. A large panel size such as a 4x4 hive plots provides an overviewof the network: the 16 unique hive plots present different visual projections ofthe system against different node attributes. Quickly, the user can assess globalpatterns in the network such as particular degree distributions apparent from thelayout of nodes and edges in a Degree (linear) by Degree (linear) or a Degree (log)57Axis LabelControl panel with filtering and coloring rulesReveal boxSearch boxNetwork name and sizeTooltipAttribute used for axis assignment for all the hive plots in this columnFigure 2.4: An overview of HYPE’s interface.58by Degree (log) (Figure 2.3D).Zooming. Starting with a large panel, one can reduce the amount of data shownby decreasing the number of hive plots. Since the size of the visualization areadoesn’t decrease, one effectively zooms in on the hive plots remaining, which be-come larger in size.colouring. Nodes and edges can be given visual emphasis by colouring themaccording to their data or network attributes. Simple equality and inequality ex-pressions, such as “equal to”, “greater than”, and “less than”, are used to colournodes and edges by quantitative attributes (Figure 2.4). The position of the colourednodes or edges in each hive plot can then be compared. colouring is especially use-ful to show or compare the position of particular groups of nodes in each hive plotby assigning a colour to each group. This feature thus permits additional compar-isons across network and data properties.Filtering. While colouring can be used to draw attention to nodes and edges,filtering can be used to hide nodes and edges that are not of interest and that may becluttering the display by sharing positions in the layout with other nodes. Hidingthese nodes and edges may facilitate the resolution of hidden patterns. Given theirattributes, nodes and edges can be filtered using two modes “keep” (hide all butthe selected nodes or edges) and “hide” (hide the selected nodes or edges). Whennodes are filtered, the links encoding their edges are also hidden.When nodes or edges are coloured or filtered, the number of objects and theaction chosen is shown in the reveal box in the top right of the interface. This in-formation helps the user assess how many nodes or edges have the attribute valuesselected and have received a particular visual encoding. However, it is important tonote that in a double axis hive plot certain nodes and edges are drawn twice and insingle axis hive plot certain edges are not drawn at all. In other words, the numberof nodes or edges coloured or filtered using the colouring or filtering rules may notbe equal to the number of circular and link marks that have been visually encoded.The reveal box relates the number of selected nodes or edges independently of theirvisual encoding.Highlighting. By hovering over nodes or edges, the name of the mark and thevalue of its attributes used for the layout rules are revealed in a tooltip window(Figure 2.4). This feature allows for rapid identification of nodes or edges given59their position in a plot. In particular, highlighting can be used to identify outliernodes and edges.Selecting. Clicking on a node or edge will cause each instance of the corre-sponding visual mark to increase in size and in opacity. This selection creates a“pop out” effect in the entire panel (Figure 2.4) allowing the user to compare theposition of a node or edge in different hive plots. In addition to being visuallyrevealed in the panel, the select node or edge’s attribute values are shown in the re-veal box (Figure 2.4). Several circular marks and links can be clicked successivelyto select multiple nodes and edges. Once one of the selected marks is clicked onagain, all of the selected marks become “unselected” and return to their normalsize and opacity.Searching. Nodes and edges can also be “searched” using the search box in thetop right corner of the interface. Identified nodes will be selected and “popped out”in the same way as clicking on the node would (Figure 2.4). Searching for a nodeis useful when the user is interested in locating a particular node whose position inthe hive plots is not known.Examples of the different applications of these seven features to present a net-work, explore its structure and generate hypothesis are demonstrated on the C.elegans connectome, a well characterized and studied neural network.2.2.4 HyPE as a web toolThe HYPE tool was built using D3 [20, 21], JavaScript [133] and Python [142]. D3was used for building and rendering the data-driven interactive graphics. D3 wasan ideal candidate to build HYPE because it produces scalable, interactive, data-driven and web-based graphics [20]. Mike Bostock’s hive plot plug-in was usedto generate the positioning of the nodes and edges and its license is included inHYPE’s documentation. All scripts, a wiki and tutorials are available on Githubunder the GNU license: https://github.com/hallamlab/HivePanelExplorer.HYPE takes as input a tabular file in .csv format to facilitate the addition of nodeand edge properties. While many file formats have been developed to encode net-works, these can easily be converted to .csv files using export functions in softwaresuch as Gephi [15] and Cytoscape [146].60Once the hive panel is explored and patterns are identified, users may want touse the panel to present their findings. The export functionality allows the usersto obtain publication ready figures in a SVG format to allow for further editingin vector graphics manipulation software. In addition, the set of colouring andfiltering rules applied can be exported in text format to help users keep track of thevisual encodings they used while permitting figure reproducibility.2.3 Results: the structure of the C. elegans connectome2.3.1 The systemC. elegans is a model organism whose nervous system can be serially reconstructedusing imaging technologies to provide a comprehensive wiring diagram of neu-ronal connectivity over developmental time [25, 164, 166]. Moreover, the anatomi-cal location, the developmental history and the functional role of all neurons withinthe nervous system is recorded in public databases such as the Worm Atlas [8].The C. elegans nervous system can be modelled with nodes representing neuronsand edges representing synaptic connections. The resulting network is commonlycalled a connectome [25]. The C. elegans connectome is a logical and informativeconnectomics model that has been extensively studied using both experimental andquantitative modelling approaches [23, 71, 72, 159, 161, 163, 165].Several studies have applied network visualization and graph theory measuresto analyze different properties of the connectome including wiring efficiency andcost [34], the relation between connectivity and neural development [159], the richclub [157] and small world structure [163]. Here we use HYPE to explore the C.elegans connectome. We demonstrate the application of HYPE’s data-driven designand interactive features to reveal both known and novel properties of the nervoussystem.2.3.2 The networkThe connectome studied here is that of a hermaphrodite worm with 279 somaticneurons and 2,287 synaptic connections. The initial construction, successive re-finements and limitations of this dataset were previously described by [159] and61several studies have analyzed the resulting network [157, 159]. The nodes (neu-rons) and edges (synaptic connections) have categorical and quantitative attributes.Notably, the neurons’ location along the posterior-anterior axis (head to tail) ofthe worm has been measured. Neurons come in three types: motor neurons con-nect muscular cells to the nervous system, sensory neurons connect sensory cellsto the nervous system, and interneurons connect two neuronal cells. Synapsescome in two types, chemical or electrical, which correspond to the signal be-ing transmitted using either neurotransmitters or an electric potential, respectively[13, 25, 164, 166]. The direction of individual synaptic connections was not en-coded in this visualization. The connectome is a complete network (all pairsof nodes are connected by some finite path) and has a small world structure (acombination of a high average clustering coefficient and low average path length)[159, 163].2.3.3 Constructing the hive panelThe hive panel in Figure 2.5 was constructed using combinations of the neuronalattributes described above and different network properties as axis assignment andpositioning rules. Since there are exactly three types of neurons (motor, sensoryand inter), neuronal cell type can be used to group neurons onto axes as an axis as-signment rule. Somatic position is a quantitative attribute that could be investigatedusing an axis position rule. Other node attributes, such as cell class and associatedneurotransmitter, are all categorical attributes with several possible values and thusare best investigated using colouring and filtering rules once the panel is built.In previous studies, the C. elegans connectome was shown to be a scale-freenetwork with a power-law degree distribution [159, 163]. We therefore opted forusing a logarithmic scale to position the nodes by degree. This attribute will allowus to evaluate a possible relation between the number of connections of a neuronand its role in the connectome. To compare the degree of nodes to other attributesthat will also be used as a position rules, we select degree as both an axis positionand assignment rule. We use an even partition to distribute the nodes onto axesequally.There are many other network properties we can explore. In particular, previ-62ous studies have found significant numbers of three node cliques, or triangles, inthe C. elegans connectome [159, 166]. To investigate patterns relevant to neuroncell clusters we include clustering coefficient as a plotting rule. Since we are moreinterested in the magnitude of the clustering coefficient (high or low) versus theabsolute value, we select clustering coefficient as an axis assignment rule. In orderto focus on nodes with a high clustering coefficient, we use a logarithmic partition.Rich club structure [157] and wiring efficiency [34] studies have demonstratedthat the C. elegans connectome has a relatively efficient structure: neurons arestrategically connected and positioned along the worm’s posterior-anterior axis tominimize the wiring cost (relative to the total number of synapses) of each neuronand synapse [34, 157]. In particular, the position and connectivity of interneuronssuggest that they play the role of information highways in the connectome [157,159]. Accordingly, we expect interneurons and their connections to decrease thepath length needed to connect two neurons and therefore to have high betweennesscentrality values. Because we have determined three axis assignment rules (celltype, degree, clustering coefficient) and two axis position rules (somatic positionand degree), we choose betweenness centrality as our third axis positioning rule tocomplete the example.Using our knowledge of the system, we chose 2 system properties and 3 net-work properties to construct a 3x3 hive panel and explore the network (Figure 2.5).Unless specified, we used a linear partition or scale for the layout rules.2.3.4 Exploring the C. elegans hive panelIn the following section, bolded key words express the use of different interactivefeatures. When focusing on individual hive plots in Figure 2.5, these key wordsare designated using the following plotting rules: Assignment rule by Position rule.Before initiating exploration of the network, we colour neurons by cell type tocompare their distribution in the panel. Observed patterns are interpreted in theDiscussion section.63Axis AssignmentAxis PositionSomatic position (linear)Cell type (categorical)InterneuronMotorSensoryDegree (log)InterneuronMotorSensoryBetweenness (linear)InterneuronMotorSensoryClustering (log)0-0.240.25-0.530.54-10-0.240.25-0.530.54-10-0.240.25-0.530.54-1Degree (even)2-1011-1617-932-1011-1617-932-1011-1617-93C. elegans connectome panel279 nodes, 3225 edgesFigure 2.5: The C. elegans hive panel with nodes and edges representing neu-rons and synapses, respectively. Neurons are coloured by cell type andsynapses are coloured by synapse type: electrical synapses in black andchemical synapses in grey.64Overview of cell types and positionsLooking at the whole panel, one can discern the presence of edges connectingnodes from all axes in each hive plot. For instance, the Cell type by Somatic po-sition hive plot shows that all three types of neurons share connections within andbetween types. If sensory neurons and motor neurons weren’t connected throughsynapses, we wouldn’t see any edges between the two axes where each type ofneuron is represented.Looking at the somatic position of the nodes by their position on the axes inthe Cell type by Somatic position hive plot, we observe that sensory neurons andinterneurons occur at the head, tail and at few discrete positions along the posterior-anterior axis whereas motor neurons cover the whole length of the worm. Effec-tively zooming in by only showing Cell type by Degree hive plot, we can get abetter look at the distribution of neurons across the length of the worm. Lookingat the distribution of synapses in this hive plot, we find a high density of synapticconnections between neurons located in the head and between neurons located inthe tail. Furthermore, many synaptic connections between motor neurons start andend at similar somatic positions. These observations are consistent with a study byVarshney and colleagues who illustrated the same patterns using an adjacency ma-trix [159] and points to coordination between adjacent motor neurons to facilitatesinusoidal movement.In contrast, interneurons connect to each other and to sensory or motor neu-rons from varying and often opposing somatic positions (i.e. head to tail and tail tohead): interneuron-interneuron, interneuron-motor and interneuron-sensory synap-tic connections link nodes near the center of the hive plot and nodes near the outeredge. This pattern suggests that interneurons connect physically distant neuronsin the connectome. To further characterize the connectivity of interneurons andinfer their role in the system, we look at another network measure: betweennesscentrality.Interneuron connectivityA node’s betweenness centrality measures the importance of its position in thenetwork relative to other nodes. As illustrated in Figure 1.1C, nodes whose con-65nections reduce the paths between other nodes or whose absence would create dis-connected subnetworks have a high betweenness centrality value. We can observethe centrality of neuronal cell types in the Cell type by Betweenness Centrality hiveplot. We first notice that on average interneurons have higher centralities than sen-sory and motor neurons. Using the tooltip we can find the maximum betweennesscentrality value per cell type: sensory, motor and interneurons have a maximum ofvalue of 0.028, 0.036 and 0.103, respectively. Wiring efficiency studies have foundthat interneurons and their synapses reduce the path length between other neurons[34, 159]. The hive panel resolves this pattern by illustrating how interneurons con-nect physically (across the worm’s length) and topologically (across the network’spath structure) distant nodes.Clustering coefficient and connectivity patternsThe clustering coefficient expresses the connectivity between neighbours: if allof its neighbours are connected than a node has a clustering coefficient of 1. Wecan observe clustering between neuronal cell types in the Clustering coefficient bySomatic position hive plot. Here neurons that were grouped on the axis with highvalues (i.e. whose neighbours are connected at a rate of over 54%) are primarilymotor and sensory neurons in the body and the tail. Using the tool tip we canquickly survey their synaptic connections to low and medium clustering coefficientneurons on the other axes. If we look at the Clustering coefficient by Degree hiveplot, we notice that these neurons have medium to low degree and share synapseswith medium to high degree nodes. For example, using the tool tip we find that themotor neuron DB06 has a degree of 7, a clustering coefficient of 1 and is thereforepart of a 7-node clique. These highly connected cliques are characteristic of asmall world network [118, 163]: a significant small world coefficient implies thatthe path from any two nodes is relatively short despite the large number of nodesin the network.We can colour the neurons with clustering coefficient of 1 using alternativecolours to those used for neuronal type: the reveal box indicates that we havecoloured 9 neurons. These high clustering coefficient neurons are thus part ofcliques whose other members have high degrees and are very connected in the66network. This topological pattern suggests that these motor and sensory neuronsmight relay signals from sensory cells or body wall muscles to the rest of the con-nectome. Evaluating the direction, strength and the type of synapses between themembers of the clique could further resolve this pattern.Using filtering to partition the system and study subnetworksTo further understand differences between neuronal connectivity patterns alongthe anterior-posterior axis, we filter the head neurons (somatic position < 0.2, 140nodes) and the tail neurons (somatic position > 0.65, 51 nodes). Filtering only thehead neurons, we notice that all of the interneurons with degree greater than 43are missing from the Cell type by Degree hive plot. Filtering only the tail neurons,we observe that a few low and medium degree interneurons and one high degreesensory neuron and motor neuron with high betweenness centrality are missing.We can identify these neurons using the tooltip and then selecting them. Thesensory neuron is identified as PQR (degree = 54, betweenness = 0.028) involvedin several processes including aerotaxis and social feeding [70]. The motor neuronis called DD06 (degree = 50, betweenness = 0.036) and innervates dorsal bodywalls muscles along with DD1-DD5 neurons [8, 177]. Filtering out both head andtail neurons, we notice that many medium degree nodes with high betweennesscentrality remain. These observations suggests that high degree interneurons playimportant roles relaying signals at the head of the worm, and a few medium to highdegree interneurons along with one key sensory and one key motor neuron permitcentralized signalling in the tail.Characterizing synaptic connectionsNext, we explore network edges by colouring them to gain insight into synapticconnections and their distribution in the network. We use 2 distinct colours todistinguish between electrical and chemical synapses. We notice that two bundlesof electrical synapses seem to be connecting several neurons to two interneuronswith high degree. Selecting these neurons reveals that they are interneurons AVALand AVBL. These are command neurons responsible for forward and backwardlocomotion, respectively [8, 177]. We can locate these nodes at the head of the67worm in the Cell type by Soma position plot and assess that they have high degreebetweenness centrality from the Cell type by Betweenness centrality hive plot, asexpected.Next we study the weights of synapses, which simply represent the number ofsynaptic connections between neuron pairs. To do so, we filter out the synapseswith low weights (weight ≤ 10) (Figure 2.6). Looking at the information in thereveal box we know that we have coloured 51 synapses. In the Cell type by So-matic Position and the Cell type by Degree hive plot we notice that these synapsesprimarily connect low to medium degree motor neurons (node degree between 2and 21) (Figure 2.6A). To further assess the connectivity of these motor neurons,we filter out sensory neurons and interneurons and their edges, so that only heavyweighted synapses between pairs of motor neurons are revealed (Figure 2.6B).Looking at the Cell type by Degree hive plot and the Cell type by Somatic positionhive plot, we observe that most connections occur between motor neurons with adegree between 7 and 21 and are located primarily in the body of the worm.2.4 Discusion2.4.1 Assessing patterns and generating hypothesesUsing the different interactive features of HYPE we explored known local andglobal topological patterns in the C. elegans connectome. We focused our ex-ploration on somatic position, neuronal type, degree, betweenness centrality, andclustering coefficient by setting these attributes as plotting rules. We found asso-ciations between these attributes to reflect known properties of the system such asits wiring efficiency. Quantitative analysis of these patterns in developmental timeor between mutant and wild type strains can be used to evaluate the significanceof these patterns and generate testable hypotheses. For example, we found a fewoutlier neurons with higher betweenness centrality values than other neurons ofthe same cell type. To assess if these values are expected in networks with similartopologies, we can compare these values to the maximum betweenness central-ity value found in simulated networks with a similar structure to the C. elegansconnectome. We find that randomly generated scale-free networks with the same68Somatic position (linear)Cell type (categorical)Degree (log)Clustering (log)0-0.240.25-0.530.54-1Somatic position (linear)Cell type (categorical)InterneuronMotorDegree (log)Clustering (log)0-0.240.25-0.530.54-1InterneuronSensory Sensory MotorVentralDorsalHead TailVA06VB05DD04VA07VC021130Number of synapsesPDBVB, VA, VCCell classDDDD01 VA02 DD02VA01VB01VA03VB02VD12VA11PDBDD06VB07 DD05 VA08VB04VA04DD03VB03VA05A BCFigure 2.6: A schematic of the filtering procedure used to reveal motor neurons connected by more than 10 synapses.A) From the original panel, edges representing 10 or less synapses are filtered out. The numerous synapses leftbetween motor neurons is highlighted by a yellow circle. B) All interneurons and sensory neurons are filtered out.C) A closer look at the connections between these motor neurons reveals all DD neurons and a part of the dorsalmuscular inhibitory circuit activated during worm locomotion. Motor neurons are coloured by cell class and thenumber of synapses of each edge is illustrated.69number of nodes as the connectome have much smaller betweenness centrality val-ues: the interneurons AVAL and AVAR with a betweenness centrality of 0.103 and0.101 respectively, are much more central than expected (p < 0.05).Whereas many connectome studies focus on circuits within the connectome,HYPE invites the user to investigate the connectivity and roles of individual neu-rons, as well as compare the connectivity of neurons with the same cell type. Forexample, the sensory neuron PQR, which was found to have a betweenness central-ity much higher than the other sensory neurons has a connectivity pattern that sug-gests it plays a more central role than the other sensory neurons. The PQR neuronhas been implicated in different physiological aspects and behavioral phenotypesof the worm, including oxygen sensing, innate immunity modulation, social feed-ing and locomotion related to feeding [70, 165]. The other individual neuron thatHYPE revealed as an outlier in its connectivity pattern was the motor neuron DD6with a betweenness centrality of 0.036. While this motor neuron shares certaincharacteristics with other DD motor neurons, it also has a higher degree (50 com-pared to DD1 which has a degree of 21) and exhibits an alternative gene expressionprofile [106]. The difference in connectivity pattern and biological associations ofthese individual neurons compared to their other neurons of the same type suggeststhey play multi-functional roles in the system.In our exploration we also found that heavy weighted synaptic connectionstend to connect pairs of motor neurons. To put this pattern in perspective, consid-ering the six possible ways of connecting three cell types as well as the numberof neurons in each cell type, the probability of one of these 51 highly weightedsynapses to be between two motor neurons is about 0.1. Therefore the large pro-portion of weighted motor-motor neuron connections (36%, or 18 out of the 51synapses) is significant considering the null hypothesis that the weighted synapsesare distributed evenly between neuron cell types (p < 0.001).To interpret these results, one must consider functional implications of multi-ple synapses between pairs of neurons. Though there is a huge variation both in thenumber and the physical size of synapses, the morphology of synaptic connectionsbetween pairs of neurons has been found to be related to the functional strengthof the interaction between the neurons [13, 14, 86]. The motor neurons connectedthrough multiple synapses resolved by our exploration include DD1-DD6, several70VA and VB, one VC and the PDB neuron (Figure 2.6C). The motor circuits involv-ing DD, VA and VB found in repeating structures along the body of the worm areresponsible for propagating sinusoidal movement [177]. Specifically these multi-ple synapses occur in the motor circuits responsible for inhibiting dorsal musclecontraction and innervating ventral muscle contraction [43, 177]. The comple-mentary circuits are composed of the VD, DA and DB motor neurons that inhibitventral muscle contraction and innervate dorsal muscle contraction [43, 177].Knowing that both complements of these motor circuits are required for thesinusoidal movement of the worm, why do neurons in one complement have moresynapses per connection than the other? To the best of our knowledge this patternhas not yet been characterized, however, biological differences between the neuroncell types involved in these circuits have been studied. First, whereas DD and VDneurons play similar roles as inhibitors of dorsal and ventral muscles respectively,the VD1-VD13 neurons develop post-embryonically whereas the DD1-DD6 neu-rons develop in the embryo and change their synaptic connections after the birth ofVD neurons [72, 161, 164]. Second, these two motor neurons classes differ in theirexpression of certain genes, including a gene related to acetylcholine receptor sub-units [61]. Third, DD motor neurons may regulate the amplitude of the sinusoidalmovement as suggested by UNC-25 and UNC-30 gene mutants [109]. Therefore,the differences in their connectivity pattern motivates the investigation of possibledifferences in the biological roles of DD and VD neurons, some of which havebeen presented here, and these functional differences may be associated with theirrole in locomotion.Models of C. elegans locomotion propose an asymetry in the neuromuscularsystem of the ventral and dorsal sides of the body to explain the initiation of lo-comotion from any intial worm shape [23, 90]. While one model suggests thisasymetry may be facilitated by non equal numbers of VD (13) and DD (6) neu-rons [90], the other model suggests it can be facilitated through a lower activationthreshold for neuron firing in VD neurons [23]. The difference in the connectivitypattern of VD and DD neurons resolved here suggests that the physiology of theirsynaptic connections may also play a role in this asymetry and should be includedin locomotion models. Further characterization of these neurons’ synapses mayreveal a relationship between the embryonic and post-embryonic development, the71varying response to gene expression, and differing role in locomotion of the DDand VD motor neurons.Despite the fact that the C elegans connectome has been thoroughly character-ized in previous studies of the network’s structure, we demonstrate that HYPE canreveal new local and global patterns. Specifically, we suggest the possible associ-ation between the multi-functional roles and differential connectivity of individualneurons, and between the asymetry of locomotion models and the difference in thenumber of synapses between VD and DD motor neuron circuits. As observed, thepatterns resolved through network exploration motivate quantitative analysis andthe results thereby produced can help formulate hypothesis relating the structureand function of the system.2.4.2 A flexible and adaptive visualization toolUsing the C. elegans connectome we demonstrated how HYPE uses system andnetworks properties to resolve known and novel patterns. Though the use case wasan undirected weighted network, we explain how the layout rules can be adaptedto a directed network by assessing source, sink, or transit roles of nodes and usingthis categorical attribute as a plotting rule. The partitions and scales used permit theorganization of nodes given a variety of distributions of quantitative and qualitativenode attributes. Moreover, additional attributes can be created to enhance the ex-ploration of system properties. For instance, in weighted networks, a node’s aver-age weight can be measured by averaging the weights of its edges and this propertycan be visually encoded. Moreover, modularity analysis can be used to find sub-networks and module membership can be encoded as a node attribute. Similarlyto how each hive plot is compared to resolve attribute associations, subnetworkscan be compared by building two different hive panels. Finally, HYPE’s versatilityin its layout rules and its adaptability to different types of system properties andnetworks distinguishes it from other network visualizations.2.4.3 A scalable toolHYPE was able to resolve patterns such as outliers, trends, similarities, and distri-butions of the nodes, edges and their attributes. While HYPE was demonstrated on72a network with hundreds of nodes, we argue that this rule-based layout scales tolarger networks with thousands of nodes. Though in larger networks there is anincreased occurrence of overlap of nodes and edges, the proposed exploration ap-proach based on Shneiderman’s mantra circumvents this issue [147]. Namely, theuser can start with an overview of the network using a 4x4 panel, investigate theglobal trends in the visual signatures provided by 16 hive plots, narrow the num-ber of attributes of interest, and zoom in on the specific hive plots using a 3x3 or2x2 hive panel. In this way, patterns can be resolved at all levels from the wholenetwork topology to subnetworks to individual nodes. In addition, the filteringrules can be used to study specific subnetworks of the network and compare differ-ent subnetworks. Therefore, HYPE’s interactive features allow the user to navigatelarge networks to resolve both local and global patterns in the system.2.5 Future directions and conclusionsDespite the fact that HYPE is based on an intuitive encoding of nodes and edges us-ing circular marks and links, navigating a network through a hive plot or a panel ofhive plots requires a certain familiarity with layout rules and overall set up. For thisreason, we have made available the Friends network shown (Figure 2.1), and the C.elegans panel (Figure 2.5) on our git repository for interested users to interact with.In our experience, we have found that users quickly become accustomed to layoutrules and can then take full advantage of the tool’s features to investigate patterns ofinterest. As demonstrated on biological networks with resolved structures, HYPEallows users, which may or may not be experts in the system modelled, to findknown and reveal novel topological and data association patterns. In the future,we envision HYPE as a web application in the cloud with the purpose to enhanceuser experience by creating a community of hive panel builders and to increase theaccessibility of different network types of varying complexity. To accomplish thisit will be necessary to provide embedded settings in the hive panel output and a logfile for generating reproducible visualizations.73Chapter 3Characterizing robustness andcentrality in microbialco-occurrence networks fromnatural and disturbed soilcommunitiesMicrobial communities form distributed networks of genetic and metabolite ex-change shaped by horizontal gene transfer and symbiotic interactions [40, 51, 73,82, 169]. These networks can be reconstructed based on co-occurrence patterns andenvironmental sequence information [55, 93, 110, 175]. The topological proper-ties of microbial co-occurrence networks including the centrality, connectance, andclustering have the potential to reveal ecological design principles that ultimatelydrive ecosystem functions. Indeed, network centrality measures have been used toidentify important components of systems such as “essential” proteins in protein-protein interaction networks [87], neurons acting as information highways in con-nectomes [157] and keystone species in food webs [134]. Recently, different cen-trality measures have been used to identify microbial keystones in marine and soilenvironments using co-occurrence networks [11, 12, 80, 105, 138, 139, 150, 169].74However, the lack of consistency and methodology in the selection of centralitymeasures limit the interpretation of the derived results in these networks [16, 54].Adopting methods from macroecology, we provide a novel way in which to selectcentrality measures to identify taxa which may play structurally important rolesin co-occurrence networks. Our procedure relies on robustness simulations to testnetwork structural integrity and quantify the structural importance of taxa in co-occurrence networks. We demonstrate this approach using clustered SSU rRNAtag sequences sourced from timber harvested forest soils spanning three biogeo-climatic zones within the Long Term Soil Productivity study (LTSP). We showthat robustness analysis reflects the impact of disturbance from the structure of thenetworks inferred from natural and disturbed communities and that the identifiedcentral taxa may play a role in community stability and resilience.3.1 IntroductionGenetic and metabolic exchanges have been well documented in natural and engi-neered ecosystems [108, 110, 113, 128, 174, 175, 178] and the absence of taxainvolved in these exchanges can impact the community’s dynamics in varyingamounts [46, 128]. In this way, some species play key roles by providing essentialnutrients to the community or maintaining the appropriate environmental condi-tions [46, 108, 113, 128, 174, 178]. In macro-ecology, these functionally importantspecies are denoted “keystone species” [134]. Evidence of keystone species is alsofound in microbial communities despite being a much more diverse system thanfood webs [105, 128, 138]. For example, low abundant but highly active sulfatereducers were found to mediate a major and essential biogeochemical process onwhich the rest of the community relies [128]. Identifying these keystone speciesis critical to understanding community genetic and metabolic processes integral toecosystem functions as well as microbial community response to disturbance in atime of global climate change [145]. However, in highly diverse ecosystems suchas those inhabiting soils resolving keystone connectivity in relation to microbialcommunity structure and function is a challenging enterprise.High-throughput technologies such as SSU rRNA sequencing enables the char-acterization of community membership and bridges the cultivation gap of mi-75croorganisms [67, 141, 156]. Pairing this high resolution community composi-tion data with network inference analysis has captured potential inter-taxa inter-actions [105, 129, 138, 139, 150]. Co-occurrence networks are a type of networkinference model where nodes represent individual taxa and edges represent cor-relations between taxa. Co-occurrence between two taxa can be interpreted asmutualism, niche overlap, commensalism etc., and a mutual exclusion can be in-terpreted as amensalism, competition, alternative niche preference etc. [46, 54, 55].The topological structure of microbial co-occurrence has been related to environ-mental properties such as seasonal disturbance in lake water communities [89], en-terotypes in the human gastrointestinal tract microbiome [10], and the effect of ani-mal feeding activity on soil communities [139]. Co-occurrence networks have alsobeen used to first infer and then to isolate symbiotic microorganisms [46]. Thesestudies suggest that co-occurrence networks combining individual taxonomic com-position information and inter-taxa relationships can illuminate ecological designprinciples organizing microbial community structure and function across ecologi-cal scales [10, 54, 89, 105, 150, 169].In microbial co-occurrence networks studies, different node centrality mea-sures have been used to identify keystone species [16, 105, 138, 139, 150, 169]based on the idea that structurally critical taxa play important ecological rolesgiven that their removal leads to network fragmentation [46, 138, 145]. Severalnetwork centrality measures have been applied in robustness analysis: degree cen-trality, betweenness centrality, eigenvector centrality, closeness centrality, etc (Sec-tion 1.1.2). For example, Lupatini and colleagues identified key microbial taxa insoils from natural forests and agricultural plantations using betweenness centralityand closeness centrality [105]. However, the centrality measure used to identifythese taxa are inconsistent across studies as is the underlying reasoning used todetermine the appropriate centrality measure. For example, one study rationalizesthe use of certain centrality measures by choosing those that agree in the way theyrank taxa by centrality value [139]. Other studies pick centrality measure arbitrar-ily [105, 169].In addition, the experimental design and statistical validation of the networkconstruction methods used in these studies to obtain co-occurrence networks couldbe improved. For example, two of these studies use fewer than 10 samples to76measure correlations [105, 139] and network inference on such low samples num-bers could produce many false positive co-occurrences [16]. Favorable samplecompositions such as levels of sample heterogeneity have been proposed by Berryand Widder to maximize the sensitivity and specificity of co-occurrence analy-sis [16]. Moreover, some these studies do not employ statistical methods to filterfalse positive co-occurrences [105, 139] despite known sources of error such ascompositionality bias. Nonetheless, many softwares have been developed to ad-dress compositionality bias and other pitfalls of network construction [53, 54, 63].Therefore, combining rigorous statistical methods and proper sample compositionscan increase the sensitivity and specificity of co-occurrence analysis while creatinga standard for future microbial co-occurrence network studies.Studies in other biological systems have demonstrated that different centralitymeasures quantify different structural features of node positions relative to the en-tire network [19, 62, 123]. In addition, the applicability of a centrality measurerelies on the network topology and the central characteristics of interest [85]. Forexample, a centrality measure that captures central characteristics in a local neigh-bourhood of a node may not be appropriate to compare the centrality of two nodesfound in distinct regions of the network. Different methods have been proposed toidentify the appropriate centrality measure depending on the global topology of thenetwork and the type of functional role played by central nodes [5, 45, 85]. Iyerand colleagues demonstrated that robustness simulations can identify structurallyimportant nodes by measuring the integrity of the network’s structure against theremoval of nodes ranked by centrality measures. In this way, the centrality measurethat decreases the structural integrity of the network is the measure that identifiesthe nodes that are required to preserve network structure. It is reasonable to assumethat if a network has been sufficiently fragmented by node removal than a func-tional process of the system modelled by the network will be less effective in thefragmented network [5, 85]. In foodweb studies, where nodes are species and edgesare trophic interactions, network robustness simulations and quantitative measureof robustness have been shown to quantify ecosystem stability to species extinc-tion [45]. Adopting robustness analysis methods and applying them to microbialco-occurrence networks has the potential to develop more rigorous selection ofcentrality measures, assess community stability and identify keystone taxa.77Here, we use network inference and robustness simulations to identify cen-tral taxa in microbial communities in timber harvested soil from the Long termsoil productivity (LTSP) study. Recent efforts to measure microbial community re-sponses to perturbation in the LTSP sites has resulted in an archive of SSU rRNAsamples spanning biogeoclimatic ecozones sampled at different soil horizons andin locations with varying levels of timber harvesting [76]. Therefore, the resolvedco-occurrence networks from the LTSP study represent variable topologies influ-enced by ecozone, soil properties and timber harvesting treatment. Given that soilcommunities are highly diverse they provide a real world use case to benchmarknetwork robustness analysis in microbial ecology that is extensible to less complexcommunities.In the process we ask the following questions:1. What centrality measures driving robustness simulations are co-occurrencenetworks least robust to?2. Are co-occurrence networks consistently more robust to the same centralitymeasure driving node removal?3. How do the networks inferred from natural and disturbed communities differin their topology?4. How do the networks inferred from natural and disturbed communities differin their robustness?5. How do the networks inferred from communities from different biogeocli-matic zones differ in their topology, robustness and central taxa?We find that, despite differences in community composition between LTSP sitesand ecozones, the resolved networks from both natural and disturbed communi-ties have similar topologies and are consistently less robust to the removal of taxaranked by their betweenness centrality value. Finally, we characterize the identi-fied central taxa to show that the chosen centrality measure captures taxa that couldnot be identified by their taxonomy or distribution in the soil profile.783.2 Methods3.2.1 LTSP sample collection and processingThe LTSP study is a multidisciplinary effort to monitor the impact of forestrypractices on North American soil productivity [136]. Our study focuses on threeLTSP ecozones in British Colombia, Ontario and California previously describedby Hartmann and colleagues [76]. The three British Columbia sites are locatedin the Sub Boreal Spruce (SBS) biogeoclimatic zone and were harvested 15 yearsprior to sampling. The three California sites are located in the Mediterranean (MD)biogeoclimatic zone and were harvested 16 years prior to sampling. The three On-tario sites are located in the Jack Pine (JP) biogeoclimatic zone and were harvested17 years prior to sampling.At each site, samples were collected at different treatment plots (40x70m2).Three levels of organic matter (OM) removal and one unharvested control. Theplots were arranged in a randomized, full-factorial design. The three levels ofOM removal were defined as stem-only harvesting (OM1), whole-tree harvesting(OM2) and whole-tree harvesting plus forest floor removal (OM3). These levelscorrespond to an increasing carbon source removal [77]. Table A.13 provides asummary of the number of samples recovered for each ecozone and each treatment.Table 3.1 describes the biogeoclimatic properties of individual sites within eachecozone including the tree species planted post-harvest. This data was collectedfrom associated LTSP publications [76, 77, 132, 135, 136].At each plot, samples were collected from the organic soil horizon (top layerof soil) and mineral soil horizon (bottom layer) randomly with 3 to 5 replicatesper plot. Given the varying depth of organic horizon (typically between 0−20cm)from one site to another, dimensionless quantities are used to indicate the horizonsampled: 1 for the organic and 2 for the mineral horizon. In the OM3 plots of theSBS ecozone, the forest floor removed during harvesting had not redeveloped 15years post-harvesting and thus the organic soil horizon could not be sampled. Thisstudy includes a total of 326 samples.79Table 3.1: LTSP sampling sites’ soil data for the SBS, MD and JP ecozonesSite Zone Province or State Latitude Longitude Elevation (m) Climatic Zone (life zones)Wells JP Ontario 46.42 -83.37 228 Cool temperate moistSuperior 3 JP Ontario 47.57 -82.85 426 Boreal moistEddy 3 JP Ontario 46.75 -82.25 488 MoistLowell Hill MD California 39.26 -120.78 1270 Warm temperate dryBlodgett MD California 38.88 -120.64 1320 Warm temperate dryBrandy City MD California 39.55 -121.04 1130 Warm temperate dryLog Lake SBS British Columbia 38.88 122.61 780 Wet coolTopley SBS British Columbia 52.32 126.31 1100 Moist coldSkulow Lake SBS British Columbia 52.32 121.92 1050 Dry warmSite Climatic Zone (Ko¨ppen classification) Forest type Tree SpeciesWells Humid Continental warm summer Mixed pine Jack pine, Black spruce, Red pineSuperior 3 Humid Continental warm summer Jack pine Jack pine, Black spruceEddy 3 Humid Continental warm summer Mixed conifer Jack pine, Balsam fir, White birchLowell Hill Mediterranean hot summer Mixed conifer Ponderosa pine, Sugar pine, White fir, Giant sequoiaBlodgett Mediterranean hot summer Mixed conifer Ponderosa pine, Sugar pine, White fir, Giant sequoiaBrandy City Mediterranean hot summer Mixed conifer Ponderosa pine, Sugar pine, White fir, Giant sequoiaLog Lake Boreal cool summer Sub-boreal spruce Subalpine fir, Douglas fir, Interior spruceTopley Boreal cool summer Sub-boreal spruce Lodgepole pine, Subalpine fir, Interior spruceSkulow Lake Boreal cool summer Sub-boreal spruce Lodgepole pine, Interior spruceSite Soil parent Principal Soil Classification Year established Year sampledWells Glacial outwash Orthic Humo-Ferric Podzol 1993-1994 2011Superior 3 Glacial outwash Orthic Dystric Brunisol 1993-1994 2011Eddy 3 Glacial outwash NA 1993-1994 2011Lowell Hill Volcanic mudflow Mesic Ultic Haploxeralfs 1995 2011Blodgett Volcanic mudflow Mesic Ultic Haploxeralfs 1995 2011Brandy City Volcanic mudflow Mesic Ultic Haploxeralfs 1995 2011Log Lake Glacial till Orthic Humo-Ferric Podzol 1994 2008Topley Glacial till Orthic Gray Luvisol, Gleyed Gray Luvisol 1994 2008Skulow Lake Glacial till Orthic Gray Luvisol 1994 2009803.2.2 Environmental DNA extraction and sequencingThe hypervariable region V1 to V3 of the bacterial (SSU rRNA) gene, PCR am-plified from 50ng soil DNA and sequenced using the 454 platform as previouslydescribed by Hartmann and colleagues [77]. The resolved reads were processed aspreviously described by Hartmann and colleagues [77]. In brief, reads were filteredusing MOTHUR [144]: reads with ambiguous base calls and average quality scores< 25 were eliminated. Sequences were clustered into operational taxonomic units(OTUs) at 97% sequence identity. Singletons, clusters with only one representedsequences were not included in the analysis. The number of sequences per samplerecovered after quality control is summarized in Tables A.1 to A.12.3.2.3 Microbial co-occurrence network inferenceSamples were grouped by treatment and by ecozone to produce twelve microbialco-occurrence networks using the software CoNet (version 3.0) implemented inCytoscape (version 3.1.0) [53, 55] (Table A.13). First, samples’ composition datawas combined to produce a matrix of OTU read counts per network. Read countswere filtered so that only OTUs occurring in 25% percent of samples (in samplegrouping per network) were kept and normalized by total reads per sample. Pair-wise correlations were calculate for each pair of OTUs using two different corre-lation measures: Spearman correlation coefficient and Bray Curtis dissimilarity.Pairwise correlations with an absolute value of 0.6 for Spearman and within thethresholds of 0.4 and 0.6 for Bray Curtis was used to reduce the number of corre-lations to be evaluated.Once the initial network is constructed, different procedures are implementedto refine the network. To avoid compositionality bias, the network co-occurrencesare recomputed for 1000 permutations: for each evaluated co-occurrence, tax-onomic abundance profiles are shuffled and the abundance matrix is renormal-ized. Then the network is recomputed for 1000 bootstrapped matrices: the orig-inal matrix is sub-sampled with replacement and all correlation measures recom-puted. This procedure provides a confidence interval around the co-occurrencescore which is used to remove all co-occurrences not within the limits of the 95%confidence interval. From the bootstrap distribution and after applying a multiple-81test correction, a p-value is calculated per co-occurrence, per measure, and co-occurrences with a p-value of less than 0.05 are removed. Further details on the dif-ferent statistical validations of each network construction step are available throughthe documentation of the CoNet software [53, 55].The co-occurrence analysis and sampling composition of each network followsthe recommendation provided by Berry and Widder [16]:• high resolution of community composition was used and infrequent taxawere removed• sample heterogeneity was minimized: samples were grouped by ecozonedefined by biogeoclimatic conditions• compositionality bias due to relative abundance data was accounted for andcorrected• Bray Curtis dissimilarity, which is robust to spurious correlations from presence-absence count data, was used• several correlation coefficients were measured to increase the sensitivity ofthe inferred networksTwelve LTSP networks are thus produced where nodes are OTUs and edges arepositive co-occurrences and mutual exclusions (Table A.13). These networks andthe collection of samples used to produce each of them are referred to using thename Ecozone-OMX in the rest of this analysis.3.2.4 Ecological analysisIn order to be consistent, the composition data used to assess ecological diversityand clustering patterns is the same data used to compute the networks. Hierarchi-cal cluster analysis of samples was conducted with the R package pvclust usingthe Bray-Curtis dissimilarity metric [152]. All clustering was conducted with abootstrapping procedure of 100 permutations. Community diversity was measuredusing richness and Shannon’s entropy (see Table Table 1.1).823.2.5 Network analysisUnless otherwise specified, the network and LTSP sample analysis was conductedusing a collection of scripts in Python which are publicly available at https://github.com/hallamlab/network-robustness. A summary of the Python packages used isavailable on the github repository’s front page.Degree distribution fitting was conducted following the procedure outlined in[35, 36] and using the associated Python package powerlaw [7]. Modularity anal-ysis of networks was conducted using our own implementation of the algorithmFAG-EX [103] in Python. The minimum proportion of in-degrees to out-degreesfor a subgraph to be a module is suggested to be in the range (1,3] [103, 120]. Ahigher value of this modularity factor corresponds to a stricter definition of moduleand thus modularity [103, 120]. Modularity analysis was conducted on all positiveco-occurrences (not mutual exclusions) in the largest connected component (LCC)of each network with a factor of 2.3.2.6 Using HyPE to visualize networksHive panel Explorer (HYPE) is a network visualization tool which presents and en-ables the exploration of complex networks in a data driven manner (Chapter 2). Inorder to construct informative hive panels and compare the topology of the LTSPnetworks, ecological and network measures were calculated which represent in-dividual OTUs. Average abundances were computed per network by normalizingread counts per total sample counts. OTU’ soil horizon was computed by weigh-ing the sample horizon by the abundance of the OTU in that sample. Resultingvalues between 1−2 correspond to the organic and mineral horizon, respectively.Networks measures were computed using the Python networkx package: degree,betweenness centrality, clustering coefficient.The following six parameters were chosen as layout rules: average soil horizon,abundance, degree, centrality, clustering coefficient and phylum. The average soilhorizon of a node reflects where it is predominantly located within the soil. Giventhe known stratification between organic and mineral horizons we therefore chooseaverage soil horizon as an axis position rule and node degree as an axis assignmentrule. To visualize interactions between phyla we use an OTU’s phylum to rank83and position the nodes along axes. In order to assess the possible associations be-tween the clustering coefficient of nodes and their average abundance, we set theseproperties as axis positioning and axis assignment rules, respectively. Finally thebetweenness centrality is chosen as the third axis assignment rule. While severalother centrality measures could have been used, previous studies of microbial co-occurrence networks and other biological networks have found this measure to bemore informative that more local centrality measures such as closeness centrality[85].Given the exponential type degree distribution of these networks (evaluated inSection 3.3), degree and betweenness centrality were plotted using a logarithmicpartitioning scheme. Average soil horizon and clustering coefficient was plottedusing a linear scale while abundance was plotted using an even partitioning scheme.3.2.7 Network robustness simulationsNetwork robustness simulations can be conducted by removing nodes using dif-ferent rankings [85]. The robustness at each removal step can be measured by as-sessing the change in different network properties, including the number of nodesdisconnected and the diameter of the network [5, 85]. Here, network robustnesssimulations were conducted on the LCC of each network (the largest subgraph inwhich all nodes are connected by some path) by measuring the relative size of theLCC at each node removal step. Nodes were ranked randomly or by different net-work centrality measures: degree, betweenness centrality, eigenvector centralityand closeness centrality.In order to obtain a quantitative measure of resilience to different node re-movals for each network, a robustness factor R is calculated:R = r/|Nlcc| (3.1)where r is the number of nodes removed in the network such that the sizeof the LCC has decreased by 50% and |Nlcc| is the total number of nodes in theLCC [45]. Dunn and colleagues proposed this robustness factor to measure theresilience of food webs to species lost by assessing how many extinctions leadsthe 50% of possible ecosystem extinctions [45]. By adopting this factor from this84macro-ecology study, we are assuming that fragmenting the network such that lessthan 50% of the OTUs take part in the original community will drastically affectthe functional processes that rely on core community structure. R has a maximumvalue of 0.5 and a minimum value of 1/|Nlcc|. This robustness factor is normalizedand can thus be compared between networks with LCCs of different sizes.3.3 Results3.3.1 Ecological diversity within and between ecozonesWe begin our analysis by assessing the ecological similarities and differences withinand between ecozones. We expect sample compositions to differ based on biogeo-climatic conditions (Table 3.1). Figure 3.1 shows a hierarchical clustering analysisof all samples coloured by ecozone and demonstrates that dendrogram clustersdistinguish ecozone effectively. The same hierarchical clustering dendrogram isshown in Figure 3.2 with leaves coloured by treatment to demonstrate that individ-ual samples do not cluster by level of OM removal.850.00.10.20.30.40.50.60.7lLL055lJS060lJS064lJS052lJE124lJE104lJE122lJW014lJW020lJW024lJS080lJS054lJW038lJS084lJW034lJS044lJW036lJW032lJW008lJW022lJW006lJW016lJW030lJW018lBR072lJW028lJS066lJS046lJS076lJE086lJE096lJE106lJE126lJS078lJE108lJE120lJE118lJE088lJE114lJE116lJE094lJE102lJW004lBL048lBL044lBL046lBR068lBR054lBR070lLH024lLH020lLH022lBL026lBR050lBL028lBL030lBL034lBR058lLH002lLH012lLH008lLH010lBR052lLH004lLH006lBR062lBL036lBL032lBR060lBR064lBR066lLH016lLH018lBR056lBL042lLH014lBL038lBL040lSL151lSL180lSL149lSL150lTO115lSL148lSL126lSL132lSL156lSL144lSL168lTO076lTO067lSL127lSL137lTO061lSL155lSL143lSL145lSL131lSL138lSL125lSL133lSL167lSL161lSL162lSL174lSL166lSL172lSL124lSL122lSL136lSL152lSL134lSL140lSL128lSL153lSL176lSL177lSL130lSL154lSL147lSL135lSL141lSL142lSL146lTO106lTO116lTO117lLL008lTO086lTO070lLL002lSL123lSL129lSL179lTO104lSL160lTO118lTO099lLL021lLL010lLL040lLL044lLL039lLL026lLL057lLL058lLL038lLL009lLL046lLL004lLL022lTO080lTO063lTO098lLL003lTO068lTO069lTO062lTO081lTO082lTO100lTO064lTO089lTO087lTO105lLL056lLL045lLL027lLL020lLL028lLL054lTO090lTO103lTO083lTO084lTO085lLL036lLL017lLL035lLL052lLL024lLL025lLL043lLL006lLL030lLL042lLL041lSL121lLL060lSL175lLL047lLL023lLL012lLL005lLL019lTO119lTO107lTO120lLL001lTO108lLL016lLL018lLL034lLL048lLL029lLL037lSL139lLL059lLL011lSL173lTO112lTO113lBL043lTO096lTO071lTO078lTO094lTO095lTO066lTO079lTO065lTO072lLL007lTO101lTO102lLL053lTO097lTO077lTO114lBR065lBL041lBR055lBL027lBR049lBR057lLH011lLH007lBR071lLH019lBL039lBR061lBR063lLH013lLH017lLH015lJE121lBR059lBL037lLH021lBL025lBR069lLH003lLH005lBL031lBR067lBL045lBL047lLH009lLH023lBL035lBL029lBL033lBR051lBR053lLH001lJE095lJE110lJS077lJW027lJE087lJS048lJS050lJS058lJE119lJW023lJE123lJS062lJE085lJW019lJW021lJE090lJE125lJW013lJE103lJE115lJE107lJW037lJE117lJE113lJS053lJS065lJS045lJS063lJS059lJS068lJS079lJE101lJS082lJS083lJW005lJW015lJS051lJE105lJS043lJW017lJW031lJW007lJW029lJW026lJS075lJE093lJW033lJW035lJS072lJS056lJS074lJE100lJW010lJW002lJE112lJW040lJW012lJW042l lJE092lJE098lJW003100100 9898919187 8691 9797 987192 88 4297 8996 91 998483 719578 9398100 9288 1006495897176 96 889191 9395 9289 10085 96 9986 80 7286 72 92 99939692 9994 977299 86 85 92 849897 566297 93 95 9885 91 81 789492 93 939182 9164 6472 958993 83 71 5670 938293 8698 899494100 9379 93 988390 9094 76 865497 76889799 98987886 938190 965988 87 95 60989198 93 869367 9792 8798 9863 9110099 649582 9696 7995 96 97 7994 99 89 8781 65 868052 70 9989 61 7886 89 94 861008897 95 4288647086 73869682 76 100 95 9240 4986 7696 8593 6385 8568 83 85 778699 58977382 92929395 97 9497 75 92 6271 7782 060 9786 1099 8891 68 448794 5089 85 9798 98 8795 647465 4746 9999 99 96926687 91 73 76 938653 860100 9387 97 9489 9785 83 9183 8893 8099 91 9472 66 56870 970581009281 8187100 98888681 957468648 5499 93 5077869695 3650 7031468183100100100 9984637852 7977 9158 776930 76 9981 4383 59 993734 546912 7335100 9822 100405536398 38 622857 3860 2840 9988 35 116 35 785 33 39 97506855 6514 862040 24 58 21 502388 816794 31 38 3262 76 59 778126 23 925265 8048 373 147145 98 17 3818 581999 3728 2829642 4861 65 673228 1040 1 266132 4354883 7223529 718215 687230 26 99 29341142 28 12212 6826 4418 835 311460 222643 4665 6980 47 10 4844 13 33 492 50 85653 2 282 55 7210 76 36 468824 62 35711828 3921220 23 4 8 436 5719 718 646 3215 244 3 6 691122 11146 682878 53 102 26 32 3347 2441 055 4655 319 57 46 11049 139 22 571 81 517 9634 32 6817 12 742663 14 8 28 21438 18010 3737 4 138 3645 21 566 223 37 6 821 7 54160 480601812 23163 19263616 7161623 625 57 2113288043 349 113414109100 XXXSBSJPMDFigure 3.1: Hierarchical clustering of all samples coloured by ecozone0.00.10.20.30.40.50.60.7lLL055lJS060lJS064lJS052lJE124lJE104lJE122lJW014lJW020lJW024lJS080lJS054lJW038lJS084lJW034lJS044lJW036lJW032lJW008lJW022lJW006lJW016lJW030lJW018lBR072lJW028lJS066lJS046lJS076lJE086lJE096lJE106lJE126lJS078lJE108lJE120lJE118lJE088lJE114lJE116lJE094lJE102lJW004lBL048lBL044lBL046lBR068lBR054lBR070lLH024lLH020lLH022lBL026lBR050lBL028lBL030lBL034lBR058lLH002lLH012lLH008lLH010lBR052lLH004lLH006lBR062lBL036lBL032lBR060lBR064lBR066lLH016lLH018lBR056lBL042lLH014lBL038lBL040lSL151lSL180lSL149lSL150lTO115lSL148lSL126lSL132lSL156lSL144lSL168lTO076lTO067lSL127lSL137lTO061lSL155lSL143lSL145lSL131lSL138lSL125lSL133lSL167lSL161lSL162lSL174lSL166lSL172lSL124lSL122lSL136lSL152lSL134lSL140lSL128lSL153lSL176lSL177lSL130lSL154lSL147lSL135lSL141lSL142lSL146lTO106lTO116lTO117lLL008lTO086lTO070lLL002lSL123lSL129lSL179lTO104lSL160lTO118lTO099lLL021lLL010lLL040lLL044lLL039lLL026lLL057lLL058lLL038lLL009lLL046lLL004lLL022lTO080lTO063lTO098lLL003lTO068lTO069lTO062lTO081lTO082lTO100lTO064lTO089lTO087lTO105lLL056lLL045lLL027lLL020lLL028lLL054lTO090lTO103lTO083lTO084lTO085lLL036lLL017lLL035lLL052lLL024lLL025lLL043lLL006lLL030lLL042lLL041lSL121lLL060lSL175lLL047lLL023lLL012lLL005lLL019lTO119lTO107lTO120lLL001lTO108lLL016lLL018lLL034lLL048lLL029lLL037lSL139lLL059lLL011lSL173lTO112lTO113lBL043lTO096lTO071lTO078lTO094lTO095lTO066lTO079lTO065lTO072lLL007lTO101lTO102lLL053lTO097lTO077lTO114lBR065lBL041lBR055lBL027lBR049lBR057lLH011lLH007lBR071lLH019lBL039lBR061lBR063lLH013lLH017lLH015lJE121lBR059lBL037lLH021lBL025lBR069lLH003lLH005lBL031lBR067lBL045lBL047lLH009lLH023lBL035lBL029lBL033lBR051lBR053lLH001lJE095lJE110lJS077lJW027lJE087lJS048lJS050lJS058lJE119lJW023lJE123lJS062lJE085lJW019lJW021lJE090lJE125lJW013lJE103lJE115lJE107lJW037lJE117lJE113lJS053lJS065lJS045lJS063lJS059lJS068lJS079lJE101lJS082lJS083lJW005lJW015lJS051lJE105lJS043lJW017lJW031lJW007lJW029lJW026lJS075lJE093lJW033lJW035lJS072lJS056lJS074lJE100lJW010lJW002lJE112lJW040lJW012lJW042l lJE092lJE098lJW003100100 10093869590 9495 9297 998057 95 9296 8997 92 969286 759569 8597100 10086 997284735288 87 969986 8895 9481 9893 94 9993 76 9390 71 96 100919687 9896 979298 89 92 94 899692 745099 96 95 9580 90 66 969996 92 988891 9573 9075 919493 97 57 7932 9087100 7299 94969299 9191 92 968496 9482 56 787297 887396100 98478184 828595 917391 85 100 76966999 94 79880 9491 9196 9768 869997 479285 9292 7796 96 86 7492 100 93 8883 60 958381 98 9566 75 8676 87 96 881004998 94 777407285 85899389 87 99 99 8146 6384 6586 8296 6087 8495 99 61 958099 0909483 79739086 95 9896 91 88 6388 6886 075 9670 24100 9887 52 08197 1973 91 9582 94 9285 879582 5950 9999 93 94908099 92 84 89 868564 82100100 9784 96 5391 9887 86 8588 7692 7998 99 7484 74 78860 95090918993 909299 99849592 7595863223 2599 98 4877929697 3456 5732458968100100100 10080657954 8178 9157 747227 78 9978 4382 57 993831 52679 7135100 9922 99435032395 35 682854 3561 2940 9890 34 124 30 586 32 42 97466651 6718 87203 24 54 19 542489 836795 30 41 3160 79 58 807927 23 945067 8051 473 127644 97 15 4014 5616100 382 2632742 4766 64 643530 1141 1 256734 4324684 7123725 698416 667733 27 99 31301042 27 11181 6823 4217 739 281557 192642 4863 7179 46 13 4841 13 34 502 55 75960 1 242 60 749 74 40 459823 61 35601328 36921320 25 4 6 438 6122 718 748 2913 233 4 8 71922 01336 662988 54 102 28 33 3352 2142 059 4653 220 68 45 01051 237 23 551 78 616 10733 42 6617 14 642703 14 7 28 20439 20210 3843 4 237 3644 21 537 213 29 5 821 9 59160 470711813 21144 21254315 8161623 428 59 1914338146 250 10341488100 XXXXOM0OM1OM2OM32Figure 3.2: Hierarchical clustering of all samples coloured by OM treatment level86Table 3.2: Richness of LTSP samples grouped by ecozone and treatmentOM0 OM1 OM2 OM3SBS 370 536 489 615MD 1337 1178 1375 1418JP 1135 1014 936 1313Table 3.3: Shannon’s entropy of LTSP samples grouped by ecozone and treat-mentOM0 OM1 OM2 OM3SBS 4.1 4.3 4.3 4.6MD 5.8 5.7 6.0 6.1JP 5.6 5.5 5.5 5.7Within ecozones, hierarchical clustering analysis indicates that samples clusterby soil horizon (Figures A.1-A.9), except for samples from SBS which first clusterby the three sampling sites (Figures A.1-A.9).The diversity of each sample group is quantified using species richness andShannon’s entropy and is summarized in Table 3.2 and A.14. Richness and diver-sity varies between ecozones more than within: the richness on the SBS communi-ties is an order of magnitude below that of MD and JP. We also find an increasedentropy in these two ecozones indicating a more heterogeneous composition thanthe communities from the SBS ecozone.Overall, communities that have undergone OM3 treatments have a relativelyhigher richness and greater Shannon’s entropy. These findings quantify the com-positionality differences between ecozones and may reflect the fact that treatmenteffects vary under different biogeoclimatic conditions[77, 132].3.3.2 Global network topologyHaving found several factors driving ecological differences between sample groups,we begin our investigation of community structure by evaluating the differencesand similarities in the global topology of our networks. The number of nodes ineach network is of the same order of magnitude as their richness (Table 3.4). Ta-ble 3.5 and 3.6 summarize each network’s average clustering coefficient and the87Table 3.4: The number of nodes |N| and edges |E| in the LTSP networksEcozone OM0 OM1 OM2 OM3|N| |E| |N| |E| |N| |E| |N| |E|SBS 278 1469 285 2253 270 1528 114 166MD 1299 85832 1118 70489 1316 76511 1395 173591JP 1057 52363 788 67035 762 51441 68 39Table 3.5: Global clustering coefficient of the LTSP networksEcozone OM0 OM1 OM2 OM3SBS 0.335 0.462 0.429 0.153MD 0.466 0.484 0.447 0.553JP 0.448 0.618 0.581 0.044size of LCC in terms of number of nodes and diameter (longest shortest path) (seeSection 1.1.2 for a review of network measures). Overall we find a highly clus-tered topology (Table 3.5), as previously found in other microbial co-occurrencenetworks [11, 105, 139, 150]. It’s important to note here that triangles, groupsof three connected nodes, can only be achieved between three co-occurring OTUssince two mutually exclusive OTUs cannot by definition co-occur with a third OTUs.Therefore the high clustering coefficient relates the connectivity of co-occurringOTUs. We also note that the number of nodes and edges in the LCC is of the samemagnitude as in the entire network for all except SBS-OM3 and JP-OM3, suggest-ing that the clustered topology is found on a global scale, instead of a local scale inwhich case there would be several smaller components (disconnected subgraphs).As explained in Section 1.1.3, many biological networks have a power law de-Table 3.6: Size of the largest connected component of the LTSP networks:|Nlcc| and D correspond to the number of nodes in the LCC and its diam-eter, respectively.OM0 OM1 OM2 OM3|Nlcc| D |Nlcc| D |Nlcc| D |Nlcc| DSBS 249 10 285 8 268 8 78 11MD 1295 9 1108 9 1285 10 1391 10JP 1023 10 778 7 744 9 3 288Figure 3.3: Probability distribution function of node degree for all LTSP net-works with stretched exponential fitting. The JP-OM3 network wasomitted given its lack of structure.gree distribution (Figure 3.3). We fitted the degree distributions of our networks toa power law using the procedure described in Clauset and colleagues and comparedthe fit to other distributions [35, 36]. We found that all twelve networks follow astretched exponential distribution rather than a power law, exponential or lognor-mal degree distribution (Figure 3.3). This type of distribution is also known as apower law with exponential tail and was found in marine microbial time-dependentco-occurrence networks [150]. These kinds of distributions are not scale free andtypically contain numerous high degree nodes [118]. In addition, the diameter ofnetworks with such distributions scale sub-linearly with increasing network size[118] which explains why the networks have similar diameter (Table 3.6) despitehaving up to an order of magnitude difference in the number of nodes and edges.Next, we evaluated the subglobal structure of communities by conducting amodularity analysis on positive co-occurrences and find highly connected clusters89of co-occurring OTUs. We find modules in most networks that are stratified be-tween organic and mineral horizons as illustrated by the hive panel in Figure 3.5.3.3.3 Visualizing microbial co-occurrence networks with HyPEWe next used HYPE to visualize the modularity of the twelve networks to exploretopological and ecological associations. Hive panels of size 3x3 were constructedfor each network by using the six following parameters as layout rules: average soilhorizon, abundance, degree, centrality, clustering coefficient and phylum. We pro-vide a highlight of the different visually resolved patterns (as of yet quantitativelyvalidated) illustrated by the hive panel of SBS-OM0 in Figure 3.4:• co-occurrences stratify by horizon and mutual exclusions connect OTUs withdifferent average soil horizons• high degree nodes have average soil horizons near the organic layer or themineral layer but not in between• there are many co-occurrences between OTUs of different abundances andfrom different phyla• OTUs with lower abundances seem to have higher clustering coefficients• the organic horizon module contains OTUs with a wider range of average soilhorizon than mineral horizon modules• co-occurrences between the two modules seem to be primarily between lowand high degree nodes.Given the focus on this chapter, we focus on two patterns in particular whichare shown across all networks in Figure (3.5) and (3.6). Figure 3.5 illustrates thetwelve networks’ modular connectivity and Figure 3.6 shows the connectivity andcentrality of OTUs categorized by phyla. Despite not finding two modules corre-sponding to the organic horizon and mineral horizon in all twelve networks, we dosee a similar connectivity pattern: most co-occurrences occur within horizons andfew co-occurrences are found between OTUs horizons.90Axis AssignmentAxis PositionSoil horizon (linear)Abundance (even)0.0001-0.00040.0005-0.00130.0014-0.0489Phylum (categorical)0.0001-0.00040.0005-0.00130.0014-0.0489Clustering (linear)0.0001-0.00040.0005-0.00130.0014-0.0489Degree (log)1-2.882.89-14.1514.15-591-2.882.89-14.1514.15-591-2.882.89-14.1514.15-59Betweenness centrality (log)0-0.010.02-0.030.03-0.060-0.010.02-0.030.03-0.060-0.010.02-0.030.03-0.06Figure 3.4: Hive Panel of the network from the SBS ecozone with treatmentOM0. Nodes are coloured by degree.91SBS MD JPOM0OM1OM3OM2Degree (even)Soil horizon (linear)Organic moduleMineral moduleno moduleFigure 3.5: Hive panel of twelve hive plots showing the horizon modularity of the LTSP networks. Hive plots wereconstructed by partitioning node degrees logarithmically onto axes and linearly positioning the nodes by averagesoil horizon (the organic horizon to hive plot centers)92SBS MD JPOM0OM1OM3OM2Betweenness centrality (log)PhylumAcidobacteriaActinobacteriaBacteroidetesCandidate_division_OP10Candidate_division_TG-1Candidate_division_TM6Candidate_division_TM7Candidate_division_WS3Chloro�lexiCyanobacteriaFibrobacteresFirmicutesGemmatimonadetesNitrospiraePlanctomycetesProteobacteriaVerrucomicrobiaWCHB1-60unclassi�iedFigure 3.6: Hive panel of twelve hive plots showing the connectivity and centrality of OTUs’ phyla of the LTSP net-works. Hive plots were constructed by partitioning node betweenness centrality values logarithmically onto axesand linearly positioning the nodes by the alphabetical rank of their phylum93Figure 3.6 illustrates that OTUs with high betweenness centrality tend to comefrom the three most predominant phyla in this samples collection, Proteobacte-ria, Acidobacteria and Actinobacteria. However, other phyla also have high cen-trality OTUs in specific networks including Bacteriodetes, Chloroflexi, Gemmati-monadetes, Planctomycetes, Firmicutes, and Verrucomicrobia. Given that differentcentrality measures evaluate different features of node positions and that their ap-plicability depends on the type of network and the task at hand, we use robustnesssimulations to determine the appropriate centrality measure given the topology ofthe twelve networks.3.3.4 Network robustness simulationsEvaluating centrality driven robustness of networksRobustness analysis tests the integrity of the network’s structure to different typesof node failures. Conducting different simulations by removing nodes ranked bytheir centrality value can identify the nodes playing key structural roles in the net-work. Figure 3.8 shows the robustness simulations on all twelve networks wherenodes were removed either randomly and by ranked values of degree, closenesscentrality, betweenness centrality and eigenvalue centrality. We notice that rankingnodes by betweenness centrality consistently fragments the LCC earlier in the sim-ulations. In addition, many simulations show a sharp drop in the relative size ofthe LCC: the removal of certain nodes disconnects large subgraphs within the LCC.In particular, we notice that this drop occurs most precipitously in SBS-OM0 andJP-OM0 networks. We then measured the robustness factor for each node removalmethod and find that overall these networks are least robust to the removal of nodesranked by betweenness centrality (Table 3.7, 3.8 and 3.9). Other centralities vary intheir effect on robustness and often produce similar robustness factors as does therandom removal of nodes. Comparing robustness to betweenness centrality noderemoval, we observe that the robustness of networks differs most across ecozonesthan within: SBS networks are on average less robust than networks from the MDand JP ecozones. Moreover SBS-OM0 and JP-OM0 are much less robust than thetreatment networks from the same ecozone. This pattern suggests that treatments94effects on natural and disturbed communities are reflected in the co-occurrence net-work structure. However, MD networks vary very little in their robustness factors,as illustrated by the simulations driven by betweenness centrality.Comparing centrality measuresFigures 3.7, A.10 and A.11 show the relations between each centrality measureand demonstrate that betweenness centrality captures different OTUs than the othercentrality measures. In particular we notice that a high degree, closeness centralityor eigenvector centrality does not ensure a high betweenness centrality value. Thistrend indicates that, in the case of comparing degree and betweenness centrality,nodes with few co-occurrences can connect paths of multiple co-occurrences (i.e. achain of co-occurrences). In the absence of these high betweenness centrality taxa,these paths would we be longer or nonexistent in its absence. Only two centralitymeasures, degree and eigenvector centrality, seem to have a linear relationship. Wealso notice that many nodes tend to have high closeness centrality values, whichdoes not facilitate the selection of highly central OTUs. This trend is expected as anode’s closeness centrality is hierarchically calculated from the closeness centralityof other nodes [118].3.3.5 Characterizing central taxaHaving determined that betweenness centrality (BC) captures certain structural po-sitions related to network robustness, we continue our investigation on the OTUswith highest BC values. A BC value is not as informative as its rank [85]; thereforewe choose a percentile cut off to capture central OTUs. We select the OTUs with thetop 10% percentile of BC values in each network, in combination with a cut off of0.005. The highest and lowest BC values are 0.41 and 0.007, respectively. Thesevalues express that the corresponding nodes take part in 41%−07% of all shortestpaths in the network. To put this in perspective, a network with |N| nodes has inthe order of |N|2 shortest paths. In this way, we collect from all networks despitetheir different sizes and capture a total of 265 central OTUs, which represents 8%of the total number of OTUs that co-occur in the networks.We compare central OTUs to other network members by evaluating their aver-95Figure 3.7: Scatter matrix plot of four centrality measures in the SBS net-works. Histograms of each centrality measure is also shown. The dif-ferent centrality values of OTUs for each treatment network was pooledto produce these plots.96Table 3.7: Robustness factor of SBS networks per node removal methodRandom Betweenness centrality Degree centrality Closeness centrality Eigenvector centralityOM0 0.37 0.07 0.24 0.07 0.09OM1 0.42 0.17 0.19 0.19 0.38OM2 0.39 0.10 0.17 0.25 0.33OM3 0.29 0.12 0.12 0.12 0.2Table 3.8: Robustness factor of MD networks per node removal methodRandom Betweenness centrality Degree centrality Closeness centrality Eigenvector centralityOM0 0.48 0.17 0.48 0.46 0.43OM1 0.47 0.17 0.49 0.45 0.49OM2 0.47 0.14 0.48 0.46 0.49OM3 0.48 0.13 0.49 0.46 0.49Table 3.9: Robustness factor of JP networks per node removal methodRandom Betweenness centrality Degree centrality Closeness centrality Eigenvector centralityOM0 0.42 0.16 0.47 0.13 0.45OM1 0.47 0.39 0.40 0.41 0.42OM2 0.47 0.41 0.46 0.44 0.46OM3 NA NA NA NA NA97Figure 3.8: Robustness simulations of twelve LTSP networks driven by different centrality measures. The relative sizeof LCC of each treatment network is plotted against the number of nodes removed. Networks are coloured byassociated treatment level.98A BFigure 3.9: Venn diagram of all OTUs in ecozone networks (A) and all centralOTUs (B).age soil horizon and abundance. Figure 3.11 shows that central OTUs have a widerange of average soil horizons that are primarily not exclusive to either organic normineral horizons. In terms of their abundance, most central OTUs are rare (< 0.1%)or of intermediate (> 0.1% and (> 0.1% and < 1%) abundance [170], with a fewexceptions (Figure 3.12).Next we compare the taxonomic distribution of central and non-central OTUs.Figure 3.10 illustrates the overlap in taxonomic representations in each networkat the phylum, order and class level: the overlap in taxonomies found in all eco-zone groupings of networks decreases when comparing central OTUs. Taxonomicoverlap at lower taxonomic levels was not evaluated as the number of unclassifiedOTUs at those levels drops from 10% to 20%− 80%. Using counts of OTUs pertaxonomic level per network, we evaluate the possible over-representation of tax-onomies in central OTUs given the null hypothesis that central OTUs were randomlyselected: no individual phylum, order nor class was over-represented in the centralOTUs at a significance below p = 0.05 (Table A.15-A.22). Taxonomic represen-tation was modelled using a hypergeometric distribution of taxonomic counts andover-representation p-values were produced using a Bonferonni correction.Looking at the overlap in central OTUs between networks grouped by ecozone,we find that few OTUs are central in ecozones (0.7%), despite the fact that many99A B Order                                            Class                                       PhylumFigure 3.10: Venn diagram of the number of phylum, class and order sharedacross ecozone networks (A) and the number of central taxonomic lev-els shared (B).OTUs are found in all ecozone groupings of networks (7%) (Figure 3.9). Giventhat functional relations can be associated to bacterial lineages [58], this patternsuggests that the role played by central OTUs is fulfilled at a higher taxonomiclevel instead of a species level.100Figure 3.11: Histograms of the average soil horizon of OTUs with high BCvalues3.4 DiscussionPrevious studies indicate that community diversity and composition varies by bio-geoclimatic conditions in LTSP sampling sites and soil horizons [76, 77]. Amongthese drivers, soil horizon consistently split samples in all ecozones and OM re-moval treatments. Despite these differences, the global topology, modularity andoutcome of robustness simulations remained similar across all twelve LTSP net-works. These results demonstrates that consistent ecological patterns can be re-solved through network analysis despite the variability in community compositionin forest soils.101Figure 3.12: Histograms of the abundance of OTUs with high BC values3.4.1 Soil microbial co-occurrence networks: a complex ecologicallydriven structureEvaluating the global topology of each network, we find similar properties as otherco-occurrence network studies: an exponential type degree distribution [150], ahighly clustered [89, 105, 129, 139, 150, 178], connected [105, 129, 139, 150, 178],and modular structure [150]. Though most biological networks can be fitted toa power law distribution, other real world networks have stretched exponentialdegree distributions: science collaboration networks [117], certain foodwebs [122]and power grids [9]. The structural similarity between the LTSP co-occurrencenetworks and social, biological and technological networks suggests that the modeof network inference used in this study captures non-random relationships.102To illustrate how these topological properties relate to ecological properties,we visualized these patterns using HYPE, a data-driven and versatile network vi-sualization tool that overcomes the difficulties in visualizing large networks. Thehive panels indicated that centrality measures may be used to capture structurallyimportant taxa in co-occurrence networks. Two of these patterns were presented inall networks and hive panels were constructed to demonstrate the stratification ofco-occurrences between horizons and the possible association between taxonomyand centrality with and between LTSP sites.3.4.2 Centrality and robustness across biogeoclimatic networksGiven the similarity in topological structures between networks inferred from nat-ural and disturbed communities, we expected similar outcomes in robustness sim-ulations between networks. Indeed, simulations confirmed that our networks werethe least robust to node removal ranked by the same centrality measure, between-ness centrality. Certain networks were fragmented after the removal of only 10%of nodes with highest BC values, as reflected by their robustness factor. To furtherconfirm that high BC values selects different structural positions than other central-ity measures, we compared the centrality of taxa according to different centralitymeasures. High BC values were not consistent with high values of any other cen-trality measures. To put this in perspective, correlations between centrality mea-sures have been recorded in randomly generated power law and exponential net-works [85]. The lack of correlation observed here suggests that these co-occurrencenetworks have a more complex structure than that of randomly generated networkswith similar degree distributions. This finding confirms that the mode of networkinference used captures non-random relationships between taxa.Overall taxa with high BC values were not distributed like other taxa. Theiraverage relative abundance demonstrated that these taxa have rare or intermediateabundance, with a few exceptions. This trend agrees with experimental findings oflow abundance keystone taxa in oral biofilms [46] and in fermentative mixed cul-tures [138]. For instance, Duran-Pinedo and colleagues showed that the additionof a rare taxa to a culture permitted the isolation of a previously uncultured mi-croorganism [46]. Looking at the taxonomic distribution of central taxa, we found103that they originate from rare and abundant phyla, classes and orders. However, notaxonomy at the phylum, class or order taxonomic level was over-represented inthe set of central taxa (Table A.15-A.22). The ecological attributes and topologi-cal positions of central taxa suggests that BC selects OTUs which are not artifactsof network construction and that could not have been resolved through ecologicalmeasurements: central taxa are predominantly rare or intermediate abundant, am-bivalent in their average horizon, and have mixed centrality values according toother centrality measures.3.4.3 Relating treatment effects to robustness analysisHaving confirmed that networks from both natural and disturbed communities werethe least robust to BC driven node removal, we compared robustness factors withinecozones and between treatments. We found that SBS and JP networks’ robustnessfactors decreased between untreated (OM0) and treatment networks whereas MDnetwork robustness factors did not vary much between treatments (Table 3.7, 3.8and 3.9). Looking at the classification of soils from different sampling sites, wenotice that SBS and JP samples have glacial soil parents whereas MD soils originatefrom volcanic mudflow (Table 3.1). Given that short term (10 years) timber har-vesting effects on forest productivity in LTSP sites depended on the susceptibilityof different soil types [136], it is not surprising to find different treatment effectsassociated with robustness simulations between these ecozones.Surprisingly, we find a counter-intuitive relation between robustness and treat-ment in SBS and JP networks. It is unclear why treatment networks from SBS andJP ecozones exhibited an increased robustness given the evidence of significanttreatment effects in both ecozones. Specifically, ecological assessments of soils inthe JP ecozone showed a significant disturbance in environmental conditions re-lated to forest productivity [60, 132] and the impact of OM treatment was evidentin changes in microbial community composition in soils from the SBS ecozone[76, 77]. The increase in robustness in JP and SBS OM1, OM2 and OM3 (for SBSonly) networks compared to the controls (OM0) therefore demonstrates a shift intopology that could reflect a change in community structure. This change mayecho either community instability, community resilience, or the achievement of an104alternate stable community structure [145]. Given the expected long-term effect oforganic removal on soil conditions and microbiome [76, 77, 132, 135], a follow-upstudy of microbial community structure in the next decades could resolve witherthe apparent shifts in co-occurrence topology capture a state of community adapta-tion or fragmentation. Moreover, multi-omics studies could help elucidate specificchanges in community metabolic potential resulting from changes in microbial in-teractions [30].We now turn to several LTSP studies which have measured the impact of or-ganic matter removal on forests to infer why this robustness pattern was not foundin the MD ecozone networks and evaluate if the results from our robustness analy-sis matches ecological findings from prior studies. The effect of organic matter re-moval has been evaluated based on several criteria including pre- and post-harvestbiomass measurements (volume of organic matter per area) [94, 135, 136], soilcarbon and nitrogen concentration [135, 136, 154], microbial biomass [26], carbonutilization [26], tree survival [60], tree growth [60], soil bulk density [121], micro-bial diversity and shifts in community composition [76, 77]. These studies confirmthat MD forests, soil conditions and microbial communities were less impacted byorganic matter removal than SBS and JP. First, Fleming and colleagues showedthat despite similar responses in tree survival in five ecozones including the onesstudied here, tree growth severely decreased in SBS conifers and JP black spruceand jack pine while MD giant sequoias had an increase in growth [60]. Second,statistical evaluation of treatment effect was measured on total biomass measure-ments and was found to be significant in JP but not MD sites [132]. Third, studiesthat quantified shifts in microbial community composition from these ecozonesfound significant perturbations in community structure and taxonomic composi-tion in SBS communities using SSU rRNA sequencing [76, 77]. In contrast, mea-surements of microbial biomass, respiration and carbon utilization did not resolveany treatment effects in communities from the MD ecozone [26]. These resultssupport the fact that robustness analysis of co-occurrence networks reflect eco-logical findings. Therefore the association between robustness and organic matterremoval impact demonstrates the sensitivity of co-occurrence relationships in mi-crobial communities to environment perturbation.We have shown that central taxa can be captured by centrality measures chosen105using robustness simulations and analysis. Given the structural importance of cen-tral taxa and their role in maintaining network structural integrity, it is reasonableto infer that central taxa may play important functional roles in microbial com-munities and suggests that these OTUs could be keystones. In order to asses theirfunctional (i.e., genetic, metabolic, and biogeochemical) importance, further exper-imental and quantitative analysis is required. Specifically, the functional roles ofthe central taxa can be assessed using plurality and singe-cell genomic sequencingor co-culture experiments using representative isolates [41, 46, 66, 113]. For ex-ample, assigning taxonomic information to population genome bins reconstructedfrom shotgun sequencing can determine metabolic potential of specific taxa withinthe co-occurrence network [42, 66]. The resolved functional associations betweentaxa has potential to illuminate distributed metabolic pathways linking taxa at com-munity levels.3.5 ConclusionMicrobial co-occurrence studies have adopted different network analysis methodsto find potential keystone taxa. However, the concept of keystone taxa is difficultto tackle given the diversity and complexity of microbial communities [178] andthe ambiguity of the species concept, as explained in Section 1.3.3. Understandinghow different network measures, including centrality measures, can be interpretedin the context of co-occurrences networks can help identify keystone taxa and de-termine the impact of disturbance on microbial community structure and function.In the case of LTSP sites, we showed that robustness analysis resolved differentialimpacts of OM removal on microbial communities across ecozones and determinedthat these communities were similar in their inferred networks’ topology and dis-tribution of central taxa. Furthermore, we identified central taxa from a varietyof taxonomies and characterized their soil profile and abundance. These find-ings demonstrate the capacity of network inference models in microbial ecologyresearch to provide new insights into microbial interactions, community stabilityand resilience in forest soil ecosystems extensible to other natural and engineeredecosystems.106Chapter 4ConclusionThis thesis described HYPE, an interactive and data driven exploratory tool for bio-logical networks, and an analytical graph theory based approach to modelling mi-crobial communities from environmental sequence information. This final chapterpresents a high-level discussion of the assumptions and limitations HYPE’s designand of SSU rRNA sequencing data, outlines the future of network visualization, andconcludes by presenting the future integrative needs of microbial ecology.4.1 Assumptions and limitations of sequencingapproachesHigh throughput sequencing technologies have bridged the cultivation gap and en-able the characterization of microbial community composition and genetic poten-tial. However, certain assumptions and limitation must be considered so as to ap-propriately analyze and interpret the produced data. First, particularly in diverseenvironments like soil, SSU rRNA sequencing under-samples the community cap-turing the most abundant community members [141]. Similarly, the possibility ofsequencing errors in singletons, OTUs for which only one sequence has been re-cruited, challenges their credibility when in fact a singleton could represent a rareorganism. Second, the resolution of OTUs’ taxonomies at the family, genus andspecies level remains difficult as the Earth’s microbial diversity has not yet beenfully document in public databases. Furthermore, public databases of SSU rRNA107genes are biased towards culturable microorganisms. Finally, as the quality andquantity of environmental sequencing increases, microbial ecology research willgain traction in charting microbial diversity on Earth.4.2 HyPE as a community toolHYPE enables the exploration of complex systems and drives their quantitativeanalysis through hypothesis generation. As briefly described in Chapter 2, HYPEhas a few usage limitations that need to be acknowledged. In particular, the ef-ficient exploration of a system must adapt to its size and complexity and certainpatterns may be visually hidden and require a deeper exploration to be uncovered.Fortunately, as a user gains experience in exploring their system they will find acombination of the colouring rules, filtering rules and interactive tools to use toease their navigation of their network. In order to decrease the learning curve ofthe tool and facilitate this learning process, we envision a platform where a com-munity of HYPE users can share their experience with the tool, their adventuresin exploring their system, and the patterns they resolved. This type of social andcommunity based learning approach has proved successful on web platforms suchas Stack Overflow [1] where novice and expert users pose and answer statistical,mathematical and computer science related questions. Moreover, such a platformwould help resolve recurrent usage patterns that can be analyzed to improve HYPE’sinteractive features and develop navigation guidelines and procedures for noviceusers. Finally, creating an inter-connected HYPE community can encourage inter-disciplinary collaboration and research while helping users make the best out ofthe tool.As a collaborative online code host, Github’s code sharing features has alreadyincreased the awareness of HYPE as a novel network visualization tool. As ofJune 2015, several dozen unique visitors have visited the repository of which afew have requested features and cloned the repository (created a local copy). Withthe development of an online user interface, the publication of the tool in a peer-reviewed journal, the development of online use cases for novice users, and a socialcommunity platform for sharing hive panels and patterns, this user base will onlyincrease.1084.3 Closing: cross-disciplinarity in microbial ecologyThis thesis has demonstrated that the integration of methods from different disci-plines can empower researchers studying complex systems such as the C. elegansconnectome and microbial communities. In Chapter 2 we combined concepts andmethods from the fields of information visualization, pattern recognition and net-works science to develop a visualization tool. In Chapter 3 we combined sequenc-ing methods, soil ecology, microbial ecology, macroecology methods and networkscience to demonstrate the applicability of graph theory methods and robustnessanalysis to evaluating microbial community stability at a taxonomic and com-munity level. As motivated by Dorian Sagan (see Section 1.1) cross-disciplinarysynthesis stimulates scientific research and creates scientific breakthroughs [143].In environmental genomics in particular, the integration of multi-omic sequenc-ing techniques, statistical methods, network science, complexity modelling, high-performance computing, and other disciplines will capacitate researchers to under-stand and harness the potential of the invisible majority of life on Earth [167].109Bibliography[1] Stack Overflow, 2015.[2] M. Achtman and M. Wagner. Microbial diversity and the genetic nature ofmicrobial species. Nature Reviews Microbiology, 6(6):431–440, June 2008.[3] A. Agnelli, J. Ascher, G. Corti, M. T. Ceccherini, P. Nannipieri, andG. Pietramellara. Distribution of microbial communities in a forest soilprofile investigated by microbial biomass, soil respiration and DGGE oftotal and extracellular DNA. Soil Biology and Biochemistry, 36(5):859–868, May 2004.[4] M. Aickin and H. Gensler. Adjusting for multiple testing when reportingresearch results: the Bonferroni vs Holm methods. American Journal ofPublic Health, 86(5):726–728, May 1996.[5] R. Albert, H. Jeong, and A.-L. Barabasi. Error and attack tolerance ofcomplex networks. Nature, 406(6794):378–382, July 2000.[6] M. Alcalde, M. Ferrer, F. J. Plou, and A. Ballesteros. Environmentalbiocatalysis: from remediation with enzymes to novel green processes.Trends in Biotechnology, 24(6):281–287, Jan. 2006.[7] J. Alstott, E. Bullmore, and D. Plenz. Powerlaw: a Python package foranalysis of heavy-tailed distributions. PLoS ONE, 9(1):e85777, Jan. 2014.arXiv: 1305.0215.[8] Z. Altun and D. Hall. Wormatlas, 2002.[9] L. a. N. Amaral, A. Scala, M. Barthlmy, and H. E. Stanley. Classes ofsmall-world networks. Proceedings of the National Academy of Sciences,97(21):11149–11152, Oct. 2000.110[10] M. Arumugam, J. Raes, E. Pelletier, D. Le Paslier, T. Yamada, D. R.Mende, G. R. Fernandes, J. Tap, T. Bruls, J.-M. Batto, M. Bertalan,N. Borruel, F. Casellas, L. Fernandez, L. Gautier, T. Hansen, M. Hattori,T. Hayashi, M. Kleerebezem, K. Kurokawa, M. Leclerc, F. Levenez,C. Manichanh, H. B. Nielsen, T. Nielsen, N. Pons, J. Poulain, J. Qin,T. Sicheritz-Ponten, S. Tims, D. Torrents, E. Ugarte, E. G. Zoetendal,J. Wang, F. Guarner, O. Pedersen, W. M. de Vos, S. Brunak, J. Dor,MetaHIT Consortium (additional Members), J. Weissenbach, S. D. Ehrlich,and P. Bork. Enterotypes of the human gut microbiome. Nature, 473(7346):174–180, May 2011.[11] A. Barbern, S. T. Bates, E. O. Casamayor, and N. Fierer. Using networkanalysis to explore co-occurrence patterns in soil microbial communities.The ISME Journal, 6(2):343–351, Feb. 2012.[12] A. Barbern, E. O. Casamayor, and N. Fierer. The microbial contribution tomacroecology. Evolutionary and Genomic Microbiology, 5:203, 2014.[13] C. I. Bargmann. Neurobiology of the Caenorhabditis elegans Genome.Science, 282(5396):2028–2033, Dec. 1998.[14] C. I. Bargmann and E. Marder. From the connectome to brain function.Nature Methods, 10(6):483–490, June 2013.[15] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An Open SourceSoftware for Exploring and Manipulating Networks. In Third InternationalAAAI Conference on Weblogs and Social Media, Mar. 2009.[16] D. Berry and S. Widder. Deciphering microbial interactions and detectingkeystone species with co-occurrence networks. Frontiers in Microbiology,5, May 2014.[17] N. L. Biggs, E. K. Lloyd, and R. J. Wilson. Graph Theory 1736-1936.Clarendon Press, Dec. 1986. ISBN 978-0-19-853916-2.[18] M. Blaxter, J. Mann, T. Chapman, F. Thomas, C. Whitton, R. Floyd, andE. Abebe. Defining operational taxonomic units using DNA barcode data.Philosophical Transactions of the Royal Society B: Biological Sciences,360(1462):1935–1943, Oct. 2005.[19] S. R. Borrett. Throughflow centrality is a global indicator of the functionalimportance of species in ecosystems. Ecological Indicators, 32:182–196,Sept. 2013.111[20] M. Bostock. Visualizations with D3, 2015.[21] M. Bostock, V. Ogievetsky, and J. Heer. D3 Data-Driven Documents. IEEETransactions on Visualization and Computer Graphics, 17(12):2301–2309,Dec. 2011.[22] J. L. Bowman, S. K. Floyd, and K. Sakakibara. Green GenesComparativeGenomics of the Green Branch of Life. Cell, 129(2):229–234, Apr. 2007.[23] J. H. Boyle, S. Berri, and N. Cohen. Gait Modulation in C. elegans: AnIntegrated Neuromechanical Model. Frontiers in ComputationalNeuroscience, 6:10, 2012.[24] M. Brilli and P. Li. The Structural Network Properties of BiologicalSystems. Briefings in Functional Genomics, pages 9–32, 2009.[25] E. T. Bullmore and D. S. Bassett. Brain Graphs: Graphical Models of theHuman Brain Connectome. Annual Review of Clinical Psychology, 7(1):113–140, 2011.[26] M. D. Busse, S. E. Beattie, R. F. Powers, F. G. Sanchez, and A. E. Tiarks.Microbial community responses in forest mineral soil to compaction,organic matter removal, and vegetation control. Canadian Journal ofForest Research, 36(3):577–588, Mar. 2006.[27] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman,E. K. Costello, N. Fierer, A. G. Pea, J. K. Goodrich, J. I. Gordon, G. A.Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone,D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J.Turnbaugh, W. A. Walters, J. Widmann, T. Yatsunenko, J. Zaneveld, andR. Knight. QIIME allows analysis of high-throughput communitysequencing data. Nature Methods, 7(5):335–336, May 2010.[28] J. G. Caporaso, C. L. Lauber, E. K. Costello, D. Berg-Lyons, A. Gonzalez,J. Stombaugh, D. Knights, P. Gajer, J. Ravel, N. Fierer, J. I. Gordon, andR. Knight. Moving pictures of the human microbiome. Genome Biology,12(5):R50, 2011.[29] E. Cardenas and J. M. Tiedje. New tools for discovering and characterizingmicrobial diversity. Current Opinion in Biotechnology, 19(6):544–549,Dec. 2008.112[30] E. Cardenas, J. M. Kranabetter, G. Hope, K. R. Maas, S. Hallam, andW. W. Mohn. Forest harvesting reduces the soil metagenomic potential forbiomass decomposition. The ISME Journal, Apr. 2015.[31] E. Cardenas, J. M. Kranabetter, G. Hope, K. R. Maas, S. Hallam, andW. W. Mohn. Forest harvesting reduces the soil metagenomic potential forbiomass decomposition. The ISME Journal, Apr. 2015.[32] R. Caspi, H. Foerster, C. A. Fulcher, P. Kaipa, M. Krummenacker,M. Latendresse, S. Paley, S. Y. Rhee, A. G. Shearer, C. Tissier, T. C. Walk,P. Zhang, and P. D. Karp. The MetaCyc Database of metabolic pathwaysand enzymes and the BioCyc collection of Pathway/Genome Databases.Nucleic Acids Research, 36(suppl 1):D623–D631, Jan. 2008.[33] J. M. Chase. Stochastic community assembly causes higher biodiversity inmore productive environments. Science (New York, N.Y.), 328(5984):1388–1391, June 2010.[34] B. L. Chen, D. H. Hall, and D. B. Chklovskii. Wiring optimization canrelate neuronal structure and function. Proceedings of the NationalAcademy of Sciences of the United States of America, 103(12):4723–4728,Mar. 2006.[35] A. Clauset, M. E. J. Newman, and C. Moore. Finding community structurein very large networks. Physical Review E, 70(6), Dec. 2004. arXiv:cond-mat/0408187.[36] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions inempirical data. SIAM Review, 51(4):661–703, Nov. 2009. arXiv:0706.1062.[37] M. K. Coleman and D. S. Parker. Aesthetics-based Graph Layout forHuman Consumption. Softw. Pract. Exper., 26(12):1415–1438, Dec. 1996.[38] T. H. M. P. Consortium. Structure, function and diversity of the healthyhuman microbiome. Nature, 486(7402):207–214, June 2012.[39] W. R. Cookson, D. A. Abaye, P. Marschner, D. V. Murphy, E. A. Stockdale,and K. W. T. Goulding. The contribution of soil organic matter fractions tocarbon and nitrogen mineralization and microbial community size andstructure. Soil Biology and Biochemistry, 37(9):1726–1737, Sept. 2005.113[40] U. Dobrindt, B. Hochhut, U. Hentschel, and J. Hacker. Genomic islands inpathogenic and environmental microorganisms. Nature Reviews.Microbiology, 2(5):414–424, May 2004.[41] J. A. Dodsworth, P. C. Blainey, S. K. Murugapiran, W. D. Swingley, C. A.Ross, S. G. Tringe, P. S. G. Chain, M. B. Scholz, C.-C. Lo, J. Raymond,S. R. Quake, and B. P. Hedlund. Single-cell and metagenomic analysesindicate a fermentative and saccharolytic lifestyle for members of the OP9lineage. Nature Communications, 4:1854, 2013.[42] J. Drge and A. C. McHardy. Taxonomic binning of metagenome samplesgenerated by next-generation sequencing technologies. Briefings inBioinformatics, 13(6):646–655, Nov. 2012.[43] M. Driscoll and J. Kaplan. Mechanotransduction. In D. L. Riddle,T. Blumenthal, B. J. Meyer, and J. R. Priess, editors, C. elegans II. ColdSpring Harbor Laboratory Press, Cold Spring Harbor (NY), 2nd edition,1997. ISBN 0879695323.[44] A. J. Dumbrell, M. Nelson, T. Helgason, C. Dytham, and A. H. Fitter.Relative roles of niche and neutral processes in structuring a soil microbialcommunity. The ISME Journal, 4(3):337–345, Nov. 2009.[45] J. A. Dunne, R. J. Williams, and N. D. Martinez. Network structure andbiodiversity loss in food webs: robustness increases with connectance.Ecology Letters, 5(4):558–567, 2002.[46] A. E. Duran-Pinedo, B. Paster, R. Teles, and J. Frias-Lopez. CorrelationNetwork Analysis Applied to Complex Biofilm Communities. PLoS ONE,6(12):e28438, Dec. 2011.[47] A. Eiler, F. Heinrich, and S. Bertilsson. Coherent dynamics and associationnetworks among lake bacterioplankton taxa. The ISME Journal, 6(2):330–342, Feb. 2012.[48] E. Estrada. Characterization of topological keystone species: Local, globaland meso-scale centralities in food webs. Ecological Complexity, 4(12):48–57, Mar. 2007.[49] E. Estrada and r. Bodin. Using network centrality measures to managelandscape connectivity. Ecological Applications, 18(7):1810–1825, Sept.2008.114[50] M. Fahle. Human pattern recognition: parallel processing and perceptuallearning. Perception, 23(4):411–427, 1994.[51] P. G. Falkowski, T. Fenchel, and E. F. Delong. The Microbial Engines ThatDrive Earth’s Biogeochemical Cycles. Science, 320(5879):1034–1039,May 2008.[52] S. L. Fann and S. R. Borrett. Environ centrality reveals the tendency ofindirect effects to homogenize the functional importance of species inecosystems. Journal of Theoretical Biology, 294:74–86, Feb. 2012.[53] K. Faust. CoNet - A Cytoscape plugin that detects significant association inpresence/absence and abundance matrices, 2014.[54] K. Faust and J. Raes. Microbial interactions: from networks to models.Nature Reviews Microbiology, 10(8):538–550, Aug. 2012.[55] K. Faust, J. F. Sathirapongsasuti, J. Izard, N. Segata, D. Gevers, J. Raes,and C. Huttenhower. Microbial Co-occurrence Relationships in the HumanMicrobiome. PLoS Comput Biol, 8(7):e1002606, July 2012.[56] J. Fekete. Visualizing networks using adjacency matrices: Progresses andchallenges. In 11th IEEE International Conference on Computer-AidedDesign and Computer Graphics, 2009. CAD/Graphics ’09, pages 636–638,Aug. 2009.[57] N. Fierer, J. P. Schimel, and P. A. Holden. Variations in microbialcommunity composition through two soil depth profiles. Soil Biology andBiochemistry, 35(1):167–176, Jan. 2003.[58] N. Fierer, M. A. Bradford, and R. B. Jackson. Toward an ecologicalclassification of soil bacteria. Ecology, 88(6):1354–1364, June 2007.[59] B. J. Finlay, S. C. Maberly, and J. I. Cooper. Microbial Diversity andEcosystem Function. Oikos, 80(2):209–213, Nov. 1997.[60] T. Fleming, S.-C. Chien, P. J. Vanderzalm, M. Dell, M. K. Gavin, W. C.Forrester, and G. Garriga. The role of C. elegans Ena/VASP homologUNC-34 in neuronal polarity and motility. Developmental Biology, 344(1):94–106, Aug. 2010.[61] R. M. Fox, S. E. Von Stetina, S. J. Barlow, C. Shaffer, K. L. Olszewski,J. H. Moore, D. Dupuy, M. Vidal, and D. M. Miller. A gene expressionfingerprint of C. elegans embryonic motor neurons. BMC Genomics, 6:42,Mar. 2005.115[62] L. C. Freeman. Centrality in social networks conceptual clarification.Social Networks, 1(3):215–239, 1978.[63] J. Friedman and E. J. Alm. Inferring Correlation Networks from GenomicSurvey Data. PLoS Comput Biol, 8(9):e1002687, Sept. 2012.[64] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directedplacement. Software: Practice and Experience, 21(11):1129–1164, Nov.1991.[65] H. Gibson, J. Faith, and P. Vickers. A survey of two-dimensional graphlayout techniques for information visualisation. Information Visualization,12(3-4):324–357, July 2013.[66] E. A. Gies, K. M. Konwar, J. T. Beatty, and S. J. Hallam. IlluminatingMicrobial Dark Matter in Meromictic Sakinaw Lake. Applied andEnvironmental Microbiology, pages AEM.01774–14, Aug. 2014.[67] J. A. Gilbert, J. K. Jansson, and R. Knight. The Earth Microbiome project:successes and aspirations. BMC Biology, 12(1):69, Aug. 2014.[68] M. Girvan and M. E. J. Newman. Community structure in social andbiological networks. Proceedings of the National Academy of Sciences ofthe United States of America, 99(12):7821–7826, June 2002.[69] K.-I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.-L. Barabsi.The human disease network. Proceedings of the National Academy ofSciences, 104(21):8685–8690, May 2007.[70] J. M. Gray, D. S. Karow, H. Lu, A. J. Chang, J. S. Chang, R. E. Ellis, M. A.Marletta, and C. I. Bargmann. Oxygen sensation and social feedingmediated by a C. elegans guanylate cyclase homologue. Nature, 430(6997):317–322, July 2004.[71] J. M. Gray, J. J. Hill, and C. I. Bargmann. A circuit for navigation inCaenorhabditis elegans. Proceedings of the National Academy of Sciencesof the United States of America, 102(9):3184–3191, Mar. 2005.[72] S. J. Hallam. Mechanisms of neuronal asymmetry and synaptic remodelingin the nematode Caenorhabditis elegans. PhD thesis, 2000.[73] S. J. Hallam and J. P. McCutcheon. Microbes don’t play solitaire: howcooperation trumps isolation in the microbial world. EnvironmentalMicrobiology Reports, 7(1):26–28, Feb. 2015.116[74] C. M. Hansel, S. Fendorf, P. M. Jardine, and C. A. Francis. Changes inBacterial and Archaeal Community Structure and Functional Diversityalong a Geochemically Variable Soil Profile. Applied and EnvironmentalMicrobiology, 74(5):1620–1633, Mar. 2008.[75] N. Hanson, K. Konwar, S.-J. Wu, and S. Hallam. MetaPathways v2.0: Amaster-worker model for environmental Pathway/Genome Databaseconstruction on grids and clouds. In 2014 IEEE Conference onComputational Intelligence in Bioinformatics and Computational Biology,pages 1–7, May 2014.[76] M. Hartmann, S. Lee, S. J. Hallam, and W. W. Mohn. Bacterial, archaealand eukaryal community structures throughout soil horizons of harvestedand naturally disturbed forest stands. Environmental Microbiology, 11(12):3045–3062, 2009.[77] M. Hartmann, C. G. Howes, D. VanInsberghe, H. Yu, D. Bachar,R. Christen, R. Henrik Nilsson, S. J. Hallam, and W. W. Mohn. Significantand persistent impact of timber harvesting on soil microbial communitiesin Northern coniferous forests. The ISME journal, 6(12):2199–2218, Dec.2012.[78] M. Hartmann, P. A. Niklaus, S. Zimmermann, S. Schmutz, J. Kremer,K. Abarenkov, P. Lscher, F. Widmer, and B. Frey. Resistance and resilienceof the forest soil microbiome to logging-associated compaction. The ISMEJournal, 8(1):226–244, Jan. 2014.[79] A. Hector, B. Schmid, C. Beierkuhnlein, M. C. Caldeira, M. Diemer, P. G.Dimitrakopoulos, J. A. Finn, H. Freitas, P. S. Giller, J. Good, R. Harris,P. Hgberg, K. Huss-Danell, J. Joshi, A. Jumpponen, C. Krner, P. W.Leadley, M. Loreau, A. Minns, C. P. H. Mulder, G. O’Donovan, S. J.Otway, J. S. Pereira, A. Prinz, D. J. Read, M. Scherer-Lorenzen, E.-D.Schulze, A.-S. D. Siamantziouras, E. M. Spehn, A. C. Terry, A. Y.Troumbis, F. I. Woodward, S. Yachi, and J. H. Lawton. Plant Diversity andProductivity Experiments in European Grasslands. Science, 286(5442):1123–1127, Nov. 1999.[80] M. C. Horner-Devine, J. M. Silver, M. A. Leibold, B. J. M. Bohannan,R. K. Colwell, J. A. Fuhrman, J. L. Green, C. R. Kuske, J. B. H. Martiny,G. Muyzer, L. Ovres, A.-L. Reysenbach, and V. H. Smith. A comparison oftaxon co-occurrence patterns for macro- and microorganisms. Ecology, 88(6):1345–1353, June 2007.117[81] S. P. Hubbell. The Unified Neutral Theory of Biodiversity andBiogeography (MPB-32). Princeton University Press, Apr. 2001. ISBN0691021287.[82] B. L. Hurwitz, S. J. Hallam, and M. B. Sullivan. Metabolic reprogrammingby viruses in the sunlit and dark ocean. Genome Biology, 14(11):R123,Nov. 2013.[83] D. H. Huson, S. Mitra, H.-J. Ruscheweyh, N. Weber, and S. C. Schuster.Integrative analysis of environmental sequences using MEGAN4. GenomeResearch, 21(9):1552–1560, Sept. 2011.[84] A. Inselberg and B. Dimsdale. Parallel Coordinates: A Tool for VisualizingMulti-dimensional Geometry. In Proceedings of the 1st Conference onVisualization ’90, VIS ’90, pages 361–378, Los Alamitos, CA, USA, 1990.IEEE Computer Society Press. ISBN 0-8186-2083-8.[85] S. Iyer, T. Killingback, B. Sundaram, and Z. Wang. Attack Robustness andCentrality of Complex Networks. PLoS ONE, 8(4):e59613, Apr. 2013.[86] T. A. Jarrell, Y. Wang, A. E. Bloniarz, C. A. Brittin, M. Xu, J. N. Thomson,D. G. Albertson, D. H. Hall, and S. W. Emmons. The Connectome of aDecision-Making Neural Network. Science, 337(6093):437–444, July2012.[87] H. Jeong, S. P. Mason, A.-L. Barabsi, and Z. N. Oltvai. Lethality andcentrality in protein networks. Nature, 411(6833):41–42, May 2001.[88] B. H. Junker and F. Schreiber. Analysis of Biological Networks. John Wiley& Sons, Sept. 2011. ISBN 9781118209912.[89] E. L. Kara, P. C. Hanson, Y. H. Hu, L. Winslow, and K. D. McMahon. Adecade of seasonal dynamics and co-occurrences within freshwaterbacterioplankton communities from eutrophic Lake Mendota, WI, USA.The ISME Journal, 7(3):680–684, Mar. 2013.[90] J. Karbowski, G. Schindelman, C. J. Cronin, A. Seah, and P. W. Sternberg.Systems level circuit model of C. elegans undulatory locomotion:mathematical modeling and molecular genetics. Journal of ComputationalNeuroscience, 24(3):253–276, June 2008.[91] P. Khanna. Essentials of Genetics. I. K. International Pvt Ltd, 2010. ISBN9789380026343.118[92] H. Kitano. Computational systems biology. Nature, 420(6912):206–210,Nov. 2002.[93] K. M. Konwar, N. W. Hanson, A. P. Pag, and S. J. Hallam. MetaPathways:a modular pipeline for constructing pathway/genome databases fromenvironmental sequence information. BMC bioinformatics, 14:202, 2013.[94] J. Kranabetter. Site carbon storage along productivity gradients of alate-seral southern boreal forest. Canadian Journal of Forest Research, 39(5):1053–1060, May 2009.[95] M. Krzywinski, J. Schein, n. Birol, J. Connors, R. Gascoyne, D. Horsman,S. J. Jones, and M. A. Marra. Circos: An information aesthetic forcomparative genomics. Genome Research, 19(9):1639–1645, Sept. 2009.[96] M. Krzywinski, I. Birol, S. J. Jones, and M. A. Marra. Hive plotsrationalapproach to visualizing networks. Briefings in Bioinformatics, 13(5):627–644, Sept. 2012.[97] R. Lal. Soil Carbon Sequestration Impacts on Global Climate Change andFood Security. Science, 304(5677):1623–1627, June 2004.[98] R. Lamendella, S. Strutt, S. Borglin, R. Chakraborty, N. Tas, O. U. Mason,J. Hultman, E. Prestat, T. C. Hazen, and J. K. Jansson. Assessment of theDeepwater Horizon oil spill impact on Gulf coast microbial communities.Frontiers in Microbiology, 5:130, 2014.[99] M. G. I. Langille and F. S. L. Brinkman. IslandViewer: an integratedinterface for computational identification and visualization of genomicislands. Bioinformatics, 25(5):664–665, Mar. 2009.[100] C. L. Lauber, M. Hamady, R. Knight, and N. Fierer. Pyrosequencing-BasedAssessment of Soil pH as a Predictor of Soil Bacterial CommunityStructure at the Continental Scale. Applied and EnvironmentalMicrobiology, 75(15):5111–5120, Aug. 2009.[101] P. Legendre and L. Legendre. Numerical Ecology. In P. a. L. Legendre,editor, Developments in Environmental Modelling, volume 24 of NumericalEcology, pages 337–424. Elsevier, 2012.[102] C. Lei and J. Ruan. A novel link prediction algorithm for reconstructingprotein-protein interaction networks by topological similarity.Bioinformatics, page bts688, Dec. 2012.119[103] M. Li, J. Wang, and J. Chen. A Fast Agglomerate Algorithm for MiningFunctional Modules in Protein Interaction Networks. In InternationalConference on BioMedical Engineering and Informatics, 2008. BMEI2008, volume 1, pages 3–7, May 2008.[104] C. A. Lozupone and R. Knight. Global patterns in bacterial diversity.Proceedings of the National Academy of Sciences of the United States ofAmerica, 104(27):11436–11440, July 2007.[105] M. Lupatini, A. K. A. Suleiman, R. J. S. Jacques, Z. I. Antoniolli,A. de Siqueira Ferreira, E. E. Kuramae, and L. F. W. Roesch. Networktopology reveals high connectance levels and few key microbial generawithin soils. Soil Processes, 2:10, 2014.[106] G. S. Maro, M. P. Klassen, and K. Shen. A -Catenin-Dependent WntPathway Mediates Anteroposterior Axon Guidance in C. elegans MotorNeurons. PLoS ONE, 4(3):e4690, Mar. 2009.[107] A. M. Martn Gonzlez, B. Dalsgaard, and J. M. Olesen. Centrality measuresand the importance of generalist species in pollination networks.Ecological Complexity, 7(1):36–43, Mar. 2010.[108] C. J. Marx. Getting in Touch with Your Friends. Science, 324(5931):1150–1151, May 2009.[109] S. L. McIntire, E. Jorgensen, and H. R. Horvitz. Genes required for GABAfunction in Caenorhabditis elegans. Nature, 364(6435):334–337, July 1993.[110] M. T. Mee, J. J. Collins, G. M. Church, and H. H. Wang. Syntrophicexchange in synthetic microbial communities. Proceedings of the NationalAcademy of Sciences of the United States of America, 111(20):E2149–2156, May 2014.[111] T. Mino and H. Satoh. Wastewater genomics. Nature Biotechnology, 24(10):1229–1230, Oct. 2006.[112] E. Mkinen. On circular layouts. International Journal of ComputerMathematics, 24(1):29–37, Jan. 1988.[113] S. Mller, C. Sternberg, J. B. Andersen, B. B. Christensen, J. L. Ramos,M. Givskov, and S. Molin. In situ gene expression in mixed-culturebiofilms: evidence of metabolic interactions between community members.Applied and Environmental Microbiology, 64(2):721–732, Feb. 1998.120[114] S. Mocali and A. Benedetti. Exploring research frontiers in microbiology:the challenge of metagenomics in soil microbiology. Research inMicrobiology, 161(6):497–505, July 2010.[115] T. Munzner. Visualization Analysis and Design. AK Peters VisualizationSeries. A K Peters/CRC Press, 2014.[116] A. Naqvi, H. Rangwala, A. Keshavarzian, and P. Gillevet. Network-basedmodeling of the human gut microbiome. Chemistry & Biodiversity, 7(5):1040–1050, May 2010.[117] M. E. J. Newman. The structure of scientific collaboration networks.Proceedings of the National Academy of Sciences, 98(2):404–409, Jan.2001.[118] M. E. J. Newman. The structure and function of complex networks. SIAMREVIEW, 45:167–256, 2003.[119] M. E. J. Newman. Detecting community structure in networks. TheEuropean Physical Journal B - Condensed Matter and Complex Systems,38(2):321–330, Mar. 2004.[120] M. E. J. Newman. Modularity and community structure in networks.Proceedings of the National Academy of Sciences, 103(23):8577–8582,June 2006.[121] D. S. Page-Dumroese, M. F. Jurgensen, A. E. Tiarks, F. Ponder, F. G.Sanchez, R. L. Fleming, J. M. Kranabetter, R. F. Powers, D. M. Stone, J. D.Elioff, and D. A. Scott. Soil physical property changes at the NorthAmerican Long-Term Soil Productivity study sites: 1 and 5 years aftercompaction. 2006.[122] M. Pascual and J. A. Dunne. Ecological Networks: Linking Structure toDynamics in Food Webs. Oxford University Press, Nov. 2005. ISBN9780199775057.[123] G. C. Pereira, F. F. Santos, and N. F. F. Ebecken. Centrality and NetworkAnalysis in a Natural Perturbed Ecosystem. In R. Menezes, A. Evsukoff,and M. C. Gonzlez, editors, Complex Networks, number 424 in Studies inComputational Intelligence, pages 217–224. Springer Berlin Heidelberg,Jan. 2013. ISBN 978-3-642-30286-2, 978-3-642-30287-9.121[124] A. Perer and B. Shneiderman. Balancing Systematic and FlexibleExploration of Social Networks. IEEE Transactions on Visualization andComputer Graphics, 12(5):693–700, Sept. 2006.[125] A. Perer and B. Shneiderman. Integrating Statistics and Visualization:Case Studies of Gaining Clarity During Exploratory Data Analysis. InProceedings of the SIGCHI Conference on Human Factors in ComputingSystems, CHI ’08, pages 265–274, New York, NY, USA, 2008. ACM.ISBN 978-1-60558-011-1.[126] A. Perer and B. Shneiderman. Systematic Yet Flexible Discovery: GuidingDomain Experts Through Exploratory Data Analysis. In Proceedings of the13th International Conference on Intelligent User Interfaces, IUI ’08,pages 109–118, New York, NY, USA, 2008. ACM. ISBN978-1-59593-987-6.[127] N. Perra and S. Fortunato. Spectral centrality measures in complexnetworks. Physical Review E, 78(3):036107, Sept. 2008.[128] M. Pester, K.-H. Knorr, M. W. Friedrich, M. Wagner, and A. Loy.Sulfate-Reducing Microorganisms in Wetlands Fameless Actors in CarbonCycling and Climate Change. Frontiers in Microbiology, 3, Feb. 2012.[129] S. Peura, S. Bertilsson, R. I. Jones, and A. Eiler. Resistant microbialco-occurrence patterns inferred by network topology. Applied andEnvironmental Microbiology, pages AEM.03660–14, Jan. 2015.[130] D. Pils, A. Bachmayr-Heyda, K. Auer, M. Svoboda, V. Auner, G. Hager,E. Obermayr, A. Reiner, A. Reinthaller, P. Speiser, I. Braicu, J. Sehouli,S. Lambrechts, I. Vergote, S. Mahner, A. Berger, D. Cacsire Castillo-Tong,and R. Zeillinger. Cyclin E1 (CCNE1) as independent positive prognosticfactor in advanced stage serous ovarian cancer patients A study of theOVCAD consortium. European Journal of Cancer, 50(1):99–110, Jan.2014.[131] C. Plaisant. The Challenge of Information Visualization Evaluation. InProceedings of the Working Conference on Advanced Visual Interfaces,AVI ’04, pages 109–116, New York, NY, USA, 2004. ACM. ISBN1-58113-867-9.[132] F. Ponder Jr., R. L. Fleming, S. Berch, M. D. Busse, J. D. Elioff, P. W.Hazlett, R. D. Kabzems, J. Marty Kranabetter, D. M. Morris,D. Page-Dumroese, B. J. Palik, R. F. Powers, F. G. Sanchez,122D. Andrew Scott, R. H. Stagg, D. M. Stone, D. H. Young, J. Zhang, K. H.Ludovici, D. W. McKenney, D. S. Mossa, P. T. Sanborn, and R. A.Voldseth. Effects of organic matter removal, soil compaction andvegetation control on 10th year biomass and foliar nutrition: LTSPcontinent-wide comparisons. Forest Ecology and Management, 278:35–54,Aug. 2012.[133] T. Powell, F. Schneider, and N. Maragioglio. JavaScript: The CompleteReference, 2Nd Edition. McGraw-Hill, Inc., New York, NY, USA, 2edition, 2004. ISBN 0072253576, 9780072253573.[134] M. E. Power, D. Tilman, J. A. Estes, B. A. Menge, W. J. Bond, L. S. Mills,G. Daily, J. C. Castilla, J. Lubchenco, and R. T. Paine. Challenges in theQuest for Keystones Identifying keystone species is difficultbut essential tounderstanding how loss of species will affect ecosystems. BioScience, 46(8):609–620, Sept. 1996.[135] R. F. Powers. Sustaining site productivity in North American forests:problems and prospects. In S. Gessel, D. Lacate, and G. Weetman, editors,Porceedings from the 7th North American Soil Forests Conference,Vancouver, BC, 1990. Faculty of Forestry, University of British Columbia.[136] R. F. Powers, D. Andrew Scott, F. G. Sanchez, R. A. Voldseth,D. Page-Dumroese, J. D. Elioff, and D. M. Stone. The North Americanlong-term soil productivity experiment: Findings from the first decade ofresearch. Forest Ecology and Management, 220(13):31–50, Dec. 2005.[137] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies,and F. O. Glckner. The SILVA ribosomal RNA gene database project:improved data processing and web-based tools. Nucleic Acids Research,page gks1219, Nov. 2012.[138] Y. Rafrafi, E. Trably, J. Hamelin, E. Latrille, I. Meynial-Salles, S. Benomar,M.-T. Giudici-Orticoni, and J.-P. Steyer. Sub-dominant bacteria askeystone species in microbial communities producing bio-hydrogen.International Journal of Hydrogen Energy, 38(12):4975–4985, Apr. 2013.[139] P. H. Rampelotto, A. D. M. Barboza, A. B. Pereira, E. W. Triplett, C. E.G. R. Schaefer, F. A. de Oliveira Camargo, and L. F. W. Roesch.Distribution and interaction patterns of bacterial communities in anornithogenic soil of Seymour Island, Antarctica. Microbial Ecology, 69(3):684–694, Apr. 2015.123[140] S. Rayu, D. G. Karpouzas, and B. K. Singh. Emerging technologies inbioremediation: constraints and opportunities. Biodegradation, 23(6):917–926, Nov. 2012.[141] L. F. W. Roesch, R. R. Fulthorpe, A. Riva, G. Casella, A. K. M. Hadwin,A. D. Kent, S. H. Daroub, F. A. O. Camargo, W. G. Farmerie, and E. W.Triplett. Pyrosequencing enumerates and contrasts soil microbial diversity.The ISME Journal, 1(4):283–290, July 2007.[142] G. Rossum. Python Tutorial. Technical report, CWI (Centre forMathematics and Computer Science), Amsterdam, The Netherlands, TheNetherlands, 1995.[143] D. Sagan. Cosmic Apprentice: Dispatches from the Edges of Science. UnivOf Minnesota Press, Minneapolis ; London, May 2013. ISBN9780816681358.[144] P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B.Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson,J. W. Sahl, B. Stres, G. G. Thallinger, D. J. V. Horn, and C. F. Weber.Introducing mothur: Open-Source, Platform-Independent,Community-Supported Software for Describing and Comparing MicrobialCommunities. Applied and Environmental Microbiology, 75(23):7537–7541, Dec. 2009.[145] A. Shade, H. Peter, S. D. Allison, D. L. Baho, M. Berga, H. Brgmann, D. H.Huber, S. Langenheder, J. T. Lennon, J. B. H. Martiny, K. L. Matulich,T. M. Schmidt, and J. Handelsman. Fundamentals of Microbial CommunityResistance and Resilience. Frontiers in Microbiology, 3, Dec. 2012.[146] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage,N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: a softwareenvironment for integrated models of biomolecular interaction networks.Genome Research, 13(11):2498–2504, Nov. 2003.[147] B. Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy forInformation Visualizations. In Proceedings of the 1996 IEEE Symposiumon Visual Languages, VL ’96, pages 336–, Washington, DC, USA, 1996.IEEE Computer Society. ISBN 0-8186-7508-X.[148] R. R. Sokal and F. J. Rohlf. The Comparison of Dendrograms by ObjectiveMethods. Taxon, 11(2):33–40, Feb. 1962.124[149] J. T. Staley. The bacterial species dilemma and the genomicphylogeneticspecies concept. Philosophical Transactions of the Royal Society B:Biological Sciences, 361(1475):1899–1909, Nov. 2006.[150] J. A. Steele, P. D. Countway, L. Xia, P. D. Vigil, J. M. Beman, D. Y. Kim,C.-E. T. Chow, R. Sachdeva, A. C. Jones, M. S. Schwalbach, J. M. Rose,I. Hewson, A. Patel, F. Sun, D. A. Caron, and J. A. Fuhrman. Marinebacterial, archaeal and protistan association networks reveal ecologicallinkages. The ISME Journal, 5(9):1414–1425, Sept. 2011.[151] N. S. Sutherland. Outlines of a Theory of Visual Pattern Recognition inAnimals and Man. Proceedings of the Royal Society of London. Series B.Biological Sciences, 171(1024):297–317, Dec. 1968.[152] R. Suzuki and H. Shimodaira. pvclust: Hierarchical Clustering withP-Values via Multiscale Bootstrap Resampling, Dec. 2014.[153] S. D. P. r. B. Tabacof, Tim //r//nAU Larson. Beyond the connectomehairball: Rational visualizations and analysis of the C. elegans connectomeas a network graph using hive plots. Frontiers in Neuroinformatics.[154] X. Tan, S. X. Chang, and R. Kabzems. Soil compaction and forest floorremoval reduced microbial biomass and enzyme activities in a boreal aspenforest soil. Biology and Fertility of Soils, 44(3):471–479, Aug. 2007.[155] Tapiocozzo. Figure of six centrality measureson same graph (adapted bySarah Perez), Apr. 2015. Page Version ID: 657599368.[156] E. Tortoli. Impact of Genotypic Studies on Mycobacterial Taxonomy: theNew Mycobacteria of the 1990s. Clinical Microbiology Reviews, 16(2):319–354, Apr. 2003.[157] E. K. Towlson, P. E. Vrtes, S. E. Ahnert, W. R. Schafer, and E. T. Bullmore.The rich club of the C. elegans neuronal connectome. The Journal ofNeuroscience: The Official Journal of the Society for Neuroscience, 33(15):6380–6387, Apr. 2013.[158] A. Valverde, T. P. Makhalanyane, and D. A. Cowan. Contrasting assemblyprocesses in a bacterial metacommunity along a desiccation gradient.Frontiers in Microbiology, 5, Dec. 2014.[159] L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, and D. B. Chklovskii.Structural Properties of the Caenorhabditis elegans Neuronal Network.PLoS Comput Biol, 7(2):e1001066, Feb. 2011.125[160] A. T. Vincent and S. J. Charette. Freedom in bioinformatics. Frontiers inGenetics, 5, July 2014.[161] W. W. Walthall, L. Li, J. A. Plunkett, and C.-Y. Hsu. Changing synapticspecificities in the nervous system of Caenorhabditis elegans:Differentiation of the DD motoneurons. Journal of Neurobiology, 24(12):1589–1599, Dec. 1993.[162] S. Wasserman. Social Network Analysis: Methods and Applications.Cambridge University Press, Nov. 1994. ISBN 9780521387071.[163] D. J. Watts and S. H. Strogatz. Collective dynamics of small-worldnetworks. Nature, 393(6684):440–442, June 1998.[164] J. G. White, D. G. Albertson, and M. a. R. Anness. Connectivity changes ina class of motoneurone during the development of a nematode. Nature, 271(5647):764–766, Feb. 1978.[165] J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. Factors thatdetermine connectivity in the nervous system of Caenorhabditis elegans.Cold Spring Harbor Symposia on Quantitative Biology, 48 Pt 2:633–640,1983.[166] J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. The Structure ofthe Nervous System of the Nematode Caenorhabditis elegans.Philosophical Transactions of the Royal Society of London B: BiologicalSciences, 314(1165):1–340, Nov. 1986.[167] W. B. Whitman, D. C. Coleman, and W. J. Wiebe. Prokaryotes: The unseenmajority. Proceedings of the National Academy of Sciences, 95(12):6578–6583, June 1998.[168] C. Will, A. Thrmer, A. Wollherr, H. Nacke, N. Herold, M. Schrumpf,J. Gutknecht, T. Wubet, F. Buscot, and R. Daniel. Horizon-specificbacterial community composition of German grassland soils, as revealed bypyrosequencing-based analysis of 16s rRNA genes. Applied andEnvironmental Microbiology, 76(20):6751–6759, Oct. 2010.[169] R. J. Williams, A. Howe, and K. S. Hofmockel. Demonstrating microbialco-occurrence pattern analyses within and between ecosystems. TerrestrialMicrobiology, 5:358, 2014.126[170] J. J. Wright. Microbial community structure and ecology of Marine GroupA bacteria in the oxygen minimum zone of the Northeast subarctic PacificOcean. 2013.[171] J. J. Wright, K. M. Konwar, and S. J. Hallam. Microbial ecology ofexpanding oxygen minimum zones. Nature Reviews Microbiology, 10(6):381–394, June 2012.[172] E. Wu, T. Nance, and S. B. Montgomery. SplicePlot: a utility forvisualizing splicing quantitative trait loci. Bioinformatics (Oxford,England), 30(7):1025–1026, Apr. 2014.[173] W. Zachary. Information-Flow Model for Conflict and Fission inSmall-Groups. Journal of Anthropological Research, 33(4):452–473, 1977.WOS:A1977FG85500006.[174] A. J. Zehnder and T. D. Brock. Methane formation and methane oxidationby methanogenic bacteria. Journal of Bacteriology, 137(1):420–432, Jan.1979.[175] A. Zelezniak, S. Andrejev, O. Ponomarova, D. R. Mende, P. Bork, andK. R. Patil. Metabolic dependencies drive species co-occurrence in diversemicrobial communities. Proceedings of the National Academy of Sciences,page 201421834, May 2015.[176] J. Zhao, Q. Liu, and X. Wang. Competitive Dynamics on ComplexNetworks. Scientific Reports, 4, July 2014.[177] M. Zhen and A. D. Samuel. C. elegans locomotion: small circuits, complexfunctions. Current Opinion in Neurobiology, 33:117–126, Aug. 2015.[178] J. Zhou, Y. Deng, F. Luo, Z. He, and Y. Yang. Phylogenetic MolecularEcological Network of Soil Microbial Communities in Response toElevated CO2. mBio, 2(4):e00122–11, Sept. 2011.127Appendix AChapter 3 supporting material128Table A.1: Number of sequences recovered for samples in ecozone JP withtreatment OM0Sample Id Number of sequences Sample Id Number of sequencesJE122 9437 JS083 9810JE123 3711 JS084 7643JE124 7607 JW013 5730JE125 7943 JW014 5765JE126 11042 JW019 15763JS079 16247 JW020 9750JS080 9802 JW026 6080Table A.2: Number of sequences recovered for samples in ecozone JP withtreatment OM1Sample Id Number of sequences Sample Id Number of sequencesJE086 5351 JS066 6669JE087 6722 JS075 11716JE088 8898 JS076 3465JE105 7970 JS077 7166JE106 5012 JS078 10627JE107 7717 JW005 6010JE108 5046 JW006 10136JE117 6965 JW007 7390JE118 16540 JW008 5974JE119 7537 JW021 8619JE120 8787 JW022 20003JS043 7756 JW023 14810JS044 7295 JW024 3854JS045 11382 JW027 10694JS046 8793 JW028 10651JS063 3647 JW029 7678JS064 8167 JW030 7358129Table A.3: Number of sequences recovered for samples in ecozone JP withtreatment OM2Sample Id Number of sequences Sample Id Number of sequencesJE094 1964 JS060 10059JE095 12584 JS062 12919JE096 8922 JS068 8416JE101 8847 JW015 6425JE102 6212 JW016 7745JE103 4257 JW017 8166JE104 4355 JW018 7403JE113 6989 JW031 7802JE114 5888 JW032 8521JE115 7187 JW033 8590JE116 8624 JW034 10237JS051 4689 JW035 7383JS052 6653 JW036 4180JS053 16082 JW037 11295JS054 10421 JW038 6181Table A.4: Number of sequences recovered for samples in ecozone JP withtreatment OM3Sample Id Number of sequences Sample Id Number of sequencesJE092 6223 JS074 5337JE098 11849 JW002 5945JE100 4294 JW003 6755JE110 8403 JW004 9522JE112 7751 JW010 6313JS048 10142 JW012 8096JS050 9501 JW040 6700JS056 12999 JW042 7214JS058 6284130Table A.5: Number of sequences recovered for samples in ecozone MD withtreatment OM0Sample Id Number of sequences Sample Id Number of sequencesBL044 8237 BR071 6132BL045 9041 BR072 8839BL046 9110 LH019 6389BL047 5374 LH020 4973BL048 7757 LH021 6698BR067 10270 LH022 3870BR068 10720 LH023 5030BR069 8476 LH024 5034Table A.6: Number of sequences recovered for samples in ecozone MD withtreatment OM1Sample Id Number of sequences Sample Id Number of sequencesBL026 12668 BR053 8763BL027 5182 BR054 3884BL028 3677 LH001 10345BL029 5411 LH002 6087BL030 11238 LH003 6025BR049 4815 LH004 7804BR050 2532 LH005 6764BR051 5121 LH006 6655131Table A.7: Number of sequences recovered for samples in ecozone MD withtreatment OM2Sample Id Number of sequences Sample Id Number of sequencesBL032 3921 BR059 4798BL033 5451 BR060 7948BL034 3279 LH007 9937BL035 3456 LH008 1883BL036 12921 LH009 9625BR055 10010 LH010 7537BR056 9974 LH011 8605BR057 6757 LH012 6000Table A.8: Number of sequences recovered for samples in ecozone MD withtreatment OM3Sample Id Number of sequences Sample Id Number of sequencesBL038 7839 BR065 9514BL039 10362 BR066 6010BL040 12021 LH013 12495BL041 9506 LH014 4967BL042 6956 LH015 6512BR061 6097 LH016 8432BR062 7366 LH017 8522BR063 8168 LH018 6185Table A.9: Number of sequences recovered for samples in ecozone SBS withtreatment OM0Sample Id Number of sequences Sample Id Number of sequencesLL056 3716 SL180 2169LL057 3025 TO115 4482LL058 2052 TO116 3711LL059 3518 TO117 3956LL060 3699 TO118 2654SL175 2899 TO119 1973SL176 1722 TO120 2308SL177 2213132Table A.10: Number of sequences recovered for samples in ecozone SBS withtreatment OM1Sample Id Number of sequences Sample Id Number of sequencesLL002 4313 SL131 3430LL003 3970 SL132 2267LL004 4455 SL133 2516LL005 3305 SL134 1797LL006 4047 SL135 3095LL019 4154 SL136 1992LL020 3120 SL137 1721LL021 4452 SL138 1853LL022 3843 TO061 4232LL023 5124 TO062 3376LL024 2904 TO063 5373LL037 4235 TO064 3065LL038 4419 TO065 2151LL039 3444 TO066 2541LL040 3308 TO079 3779LL041 3283 TO080 4407LL042 3151 TO081 3676SL121 2687 TO082 3101SL122 3171 TO083 4309SL123 4593 TO084 2730SL124 1934 TO097 3543SL125 2757 TO098 3677SL126 2151 TO099 3669SL127 2552 TO100 3187SL128 2795 TO101 2945SL129 2056 TO102 2771133Table A.11: Number of sequences recovered for samples in ecozone SBS withtreatment OM2Sample Id Number of sequences Sample Id Number of sequencesLL008 3972 SL149 1366LL009 2628 SL150 2017LL010 2429 SL151 2583LL011 3965 SL152 2776LL012 4701 SL153 1551LL025 3768 SL154 1832LL026 3765 SL155 2773LL027 3943 SL156 2016LL028 2926 TO067 4564LL029 3450 TO068 4706LL030 4188 TO069 4850LL043 2986 TO070 3494LL044 3243 TO071 3367LL045 4308 TO072 3623LL046 4368 TO085 7649LL047 3183 TO086 4748LL048 2947 TO087 5291SL139 2535 TO089 3925SL140 1398 TO090 4111SL141 1713 TO103 2660SL142 1739 TO104 4959SL143 1217 TO105 4536SL144 1669 TO106 3515SL145 1326 TO107 2696SL146 1571 TO108 3407SL147 2091134Table A.12: Number of sequences recovered for samples in ecozone SBS withtreatment OM3Sample Id Number of sequences Sample Id Number of sequencesLL017 2346 SL172 4276LL018 2632 SL173 2159LL034 4184 SL174 2833LL035 4230 TO076 2140LL036 4693 TO077 3942LL052 2025 TO078 4621LL053 3428 TO094 4063LL054 2075 TO095 3604SL160 2020 TO096 4094SL161 1976 TO112 4045SL162 2755 TO113 2569SL166 2339 TO114 4131SL167 1789Table A.13: Summary of samples numbers in each ecozone for each treat-ment levelTreatment OM0 OM1 OM2 OM3 Ecozone totalSBS 17 54 53 27 151MD 18 18 18 18 72JP 16 36 32 19 103Treatment total 51 108 103 64 326Table A.14: Shannon’s entropy of LTSP samples grouped by ecozone andtreatmentOM0 OM1 OM2 OM3SBS 4.1 4.3 4.3 4.6MD 5.8 5.7 6.0 6.1JP 5.6 5.5 5.5 5.71350.00.10.20.30.40.50.60.7lLL055lSL151lSL180lSL149lSL150lTO115lSL148lSL126lSL132lSL156lSL144lSL168lTO076lTO067lSL127lSL137lTO061lSL155lSL143lSL145lSL131lSL138lSL125lSL133lSL167lSL161lSL162lSL174lSL166lSL172lSL124lSL122lSL136lSL152lSL134lSL140lSL128lSL153lSL176lSL177lSL130lSL154lSL147lSL135lSL141lSL142lSL146lTO106lTO116lTO117lLL008lTO086lTO070lLL002lSL123lSL129lSL179lTO104lSL160lTO118lTO099lLL021lLL010lLL040lLL044lLL039lLL026lLL057lLL058lLL038lLL009lLL046lLL004lLL022lTO080lTO063lTO098lLL003lTO068lTO069lTO062lTO081lTO082lTO100lTO064lTO089lTO087lTO105lLL056lLL045lLL027lLL020lLL028lLL054lTO090lTO103lTO083lTO084lTO085lLL036lLL017lLL035lLL052lLL024lLL025lLL043lLL006lLL030lLL042lLL041lSL121lLL060lSL175lLL047lLL023lLL012lLL005lLL019lTO119lTO107lTO120lLL001lTO108lLL016lLL018lLL034lLL048lLL029lLL037lSL139lLL059lLL011lSL173lTO112lTO113l lTO096lTO071lTO078lTO094lTO095lTO066lTO079lTO065lTO072lLL007lTO101lTO102lLL053lTO097lTO077lTO114100100 969375 9481 8695 9695 997584 9496 8694 938675839363 939994 39979376 81 90 879887 8997 918696 9991 89 9674 948285 9999 9098 89 86 987999 6494 90 9989 949265 809885 6562 78 8199 95959299 97 84 9888 5491 848199 9880898993 8679 85859398 1009998 92 9088 100 8791 9888 9398 529298 100 88 92 68 8977 949895 61 9609297 74 0 6294 9796 95 87818499 1001853 91006495 5660100100100 998260 7750 7976 9055 747128 7480 4082 583834556710 723523 37503438 5 37 612856 3762 283731 116 35 633 374653 6114 2437 24 57 224925 6728 40 3125 245248 31317 3918 18 3430 2728542 46 31 1339 131 3546 731258129 3011 1220118 61557 23 4442 11 12 252 3622 1144 5 5 17 15 51 252 25 31056 47 0 116 512 5 12412 126 80014 61100XXXXOM0OM1OM2OM3Figure A.1: Hierarchical clustering of SBS samples colored by treatment0.00.10.20.30.40.50.60.7lLL055lSL151lSL180lSL149lSL150lTO115lSL148lSL126lSL132lSL156lSL144lSL168lTO076lTO067lSL127lSL137lTO061lSL155lSL143lSL145lSL131lSL138lSL125lSL133lSL167lSL161lSL162lSL174lSL166lSL172lSL124lSL122lSL136lSL152lSL134lSL140lSL128lSL153lSL176lSL177lSL130lSL154lSL147lSL135lSL141lSL142lSL146lTO106lTO116lTO117lLL008lTO086lTO070lLL002lSL123lSL129lSL179lTO104lSL160lTO118lTO099lLL021lLL010lLL040lLL044lLL039lLL026lLL057lLL058lLL038lLL009lLL046lLL004lLL022lTO080lTO063lTO098lLL003lTO068lTO069lTO062lTO081lTO082lTO100lTO064lTO089lTO087lTO105lLL056lLL045lLL027lLL020lLL028lLL054lTO090lTO103lTO083lTO084lTO085lLL036lLL017lLL035lLL052lLL024lLL025lLL043lLL006lLL030lLL042lLL041lSL121lLL060lSL175lLL047lLL023lLL012lLL005lLL019lTO119lTO107lTO120lLL001lTO108lLL016lLL018lLL034lLL048lLL029lLL037lSL139lLL059lLL011lSL173lTO112lTO113l lTO096lTO071lTO078lTO094lTO095lTO066lTO079lTO065lTO072lLL007lTO101lTO102lLL053lTO097lTO077lTO114100100 9693759481 8695 9695 997584 9496 8694 938675 839363 939994 3997937681 90 879887 8997 918696 9991 89 9674 948285 9999 9098 89 86 98 7999 6494 90 9989 949265 809885 6562 78 8199 95959299 97 84 9888 5491 848199 9880898993 8679 85859398 100 9998 92 9088 100 8791 9888 9398 529298 100 88 9268 8977 949895 61 9609297 74 0 6294 9796 95 87818499 1001853 91006495 5660100100100 9982607750 7976 9055 747128 7480 4082 583834 556710 723523 375034385 37 612856 3762 283731 116 35 633 374653 6114 2437 24 57 22 425 6728 40 3125 245248 31317 3918 18 3430 2728542 46 31 1339 131 3546 731258129 3011 1220118 6 1557 23 4442 11 12 252 3622 1144 5 5 1715 51 252 25 31056 47 0 116 512 5 12412 126 80014 61100XXorganic horizonmineral horizonFigure A.2: Hierarchical clustering of SBS samples colored by horizon0.00.10.20.30.40.50.60.7lLL055lSL151lSL180lSL149lSL150lTO115lSL148lSL126lSL132lSL156lSL144lSL168lTO076lTO067lSL127lSL137lTO061lSL155lSL143lSL145lSL131lSL138lSL125lSL133lSL167lSL161lSL162lSL174lSL166lSL172lSL124lSL122lSL136lSL152lSL134lSL140lSL128lSL153lSL176lSL177lSL130lSL154lSL147lSL135lSL141lSL142lSL146lTO106lTO116lTO117lLL008lTO086lTO070lLL002lSL123lSL129lSL179lTO104lSL160lTO118lTO099lLL021lLL010lLL040lLL044lLL039lLL026lLL057lLL058lLL038lLL009lLL046lLL004lLL022lTO080lTO063lTO098lLL003lTO068lTO069lTO062lTO081lTO082lTO100lTO064lTO089lTO087lTO105lLL056lLL045lLL027lLL020lLL028lLL054lTO090lTO103lTO083lTO084lTO085lLL036lLL017lLL035lLL052lLL024lLL025lLL043lLL006lLL030lLL042lLL041lSL121lLL060lSL175lLL047lLL023lLL012lLL005lLL019lTO119lTO107lTO120lLL001lTO108lLL016lLL018lLL034lLL048lLL029lLL037lSL139lLL059lLL011lSL173lTO112lTO113l lTO096lTO071lTO078lTO094lTO095lTO066lTO079lTO065lTO072lLL007lTO101lTO102lLL053lTO097lTO077lTO114100100 969375 9481 8695 9695 997584 9496 8694 938675839363 939994 39979376 81 90 879887 8997 918696 9991 89 9674 948285 9999 9098 89 86 987999 6494 90 9989 949265 809885 6562 78 8199 95959299 97 84 9888 5491 848199 9880898993 8679 85859398 1009998 92 9088 100 8791 9888 9398 529298 100 88 92 68 8977 949895 61 9609297 74 0 6294 9796 95 87818499 1001853 91006495 5660100100100 998260 7750 7976 9055 747128 7480 4082 583834556710 723523 37503438 5 37 612856 3762 283731 116 35 633 374653 6114 2437 24 57 224925 6728 40 3125 245248 31317 3918 18 3430 2728542 46 31 1339 131 3546 731258129 3011 1220118 61557 23 4442 11 12 252 3622 1144 5 5 17 15 51 252 25 31056 47 0 116 512 5 12412 126 80014 61100XXXLogLakeTopleySkulowLakeFigure A.3: Hierarchical clustering of SBS samples colored by sample site1360.00.10.20.30.40.50.60.7lJE121lJE095lJE110lJS077lJW027lJE087lJS048lJS050lJS058lJE119lJW023lJE123lJS062lJE085lJW019lJW021lJE090lJE125lJW013lJE103lJE115lJE107lJW037lJE117lJE113lJS053lJS065lJS045lJS063lJS059lJS068lJS079lJE101lJS082lJS083lJW005lJW015lJS051lJE105lJS043lJW017lJW031lJW007lJW029lJW026lJS075lJE093lJW033lJW035lJS072lJS056lJS074lJE100lJW010lJW002lJE112lJW040lJW012lJW042l lJE092lJE098lJW003lJW004lJS060lJS064lJS052lJE124lJE104lJE122lJW014lJW020lJW024lJS080lJS054lJS066lJS046lJS076lJE086lJE096lJE106lJE126lJS078lJE108lJE120lJE118lJE088lJE114lJE116lJE094lJE102lJW028lJW038lJS084lJW034lJS044lJW036lJW032lJW008lJW022lJW006lJW016lJW030lJW01897100 9695 97 9667 939595 9393 61 95 988695 959971 9698 469599 95 9693 767996 84 96 947889 67 88 877899 95 6874 8774 95 947366 837783 81 84 84 85 9986 7187 8154 58 8068 5993 7584 6430 97 89 979166779399 4985 90 67 4363 93 678379 99 797231 647196539436721009899 8786 72 8859 808292 7080 70 74 475869 288237 1498 283241 28 2342 353248 68 80 474948 50 7 10458 64 326 2939 12 202458 72624 4 6 69 11 2611 567 829 20 579 282 511 343 68 3 142137173637 757 20 19 853 49 82218 20 18183 42289383455100XXXXOM0OM1OM2OM3Figure A.4: Hierarchical clustering of JP samples colored by treatment1370.00.10.20.30.40.50.60.7lJE121lJE095lJE110lJS077lJW027lJE087lJS048lJS050lJS058lJE119lJW023lJE123lJS062lJE085lJW019lJW021lJE090lJE125lJW013lJE103lJE115lJE107lJW037lJE117lJE113lJS053lJS065lJS045lJS063lJS059lJS068lJS079lJE101lJS082lJS083lJW005lJW015lJS051lJE105lJS043lJW017lJW031lJW007lJW029lJW026lJS075lJE093lJW033lJW035lJS072lJS056lJS074lJE100lJW010lJW002lJE112lJW040lJW012lJW042l lJE092lJE098lJW003lJW004lJS060lJS064lJS052lJE124lJE104lJE122lJW014lJW020lJW024lJS080lJS054lJS066lJS046lJS076lJE086lJE096lJE106lJE126lJS078lJE108lJE120lJE118lJE088lJE114lJE116lJE094lJE102lJW028lJW038lJS084lJW034lJS044lJW036lJW032lJW008lJW022lJW006lJW016lJW030lJW01897100 969597 9667 939595 939361 95 988695 959971 9698 4695 99 95 9693 767996 84 96 947889 67 88 877899 95 6874 877495 947366 83 7783 81 84 84 85 99867187 8154 58 80685993 7584 6430 97 89 9791 66779399 4985 90 67 436393 678379 99 797231 647196539436721009899 878672 8859 808292 708070 74 475869 288237 1498 2832 41 28 2342 353248 68 80 474948 50 7 10458 64 326 293912 202458 7 2624 4 6 69 11 2611567 829 20 579282 511 343 68 3 1421 37173637 757 20 19 85349 82218 20 18183 42289383455100XXorganic horizonmineral horizonFigure A.5: Hierarchical clustering of JP samples colored by horizon0.00.10.20.30.40.50.60.7lJE121lJE095lJE110lJS077lJW027lJE087lJS048lJS050lJS058lJE119lJW023lJE123lJS062lJE085lJW019lJW021lJE090lJE125lJW013lJE103lJE115lJE107lJW037lJE117lJE113lJS053lJS065lJS045lJS063lJS059lJS068lJS079lJE101lJS082lJS083lJW005lJW015lJS051lJE105lJS043lJW017lJW031lJW007lJW029lJW026lJS075lJE093lJW033lJW035lJS072lJS056lJS074lJE100lJW010lJW002lJE112lJW040lJW012lJW042l lJE092lJE098lJW003lJW004lJS060lJS064lJS052lJE124lJE104lJE122lJW014lJW020lJW024lJS080lJS054lJS066lJS046lJS076lJE086lJE096lJE106lJE126lJS078lJE108lJE120lJE118lJE088lJE114lJE116lJE094lJE102lJW028lJW038lJS084lJW034lJS044lJW036lJW032lJW008lJW022lJW006lJW016lJW030lJW01897100 9695 97 9667 939595 9393 61 95 988695 959971 9698 469599 95 9693 767996 84 96 947889 67 88 877899 95 6874 8774 95 947366 837783 81 84 84 85 9986 7187 8154 58 8068 5993 7584 6430 97 89 979166779399 4985 90 67 4363 93 678379 99 797231 647196539436721009899 8786 72 8859 808292 7080 70 74 475869 288237 1498 283241 28 2342 353248 68 80 474948 50 7 10458 64 326 2939 12 202458 72624 4 6 69 11 2611 567 829 20 579 282 511 343 68 3 142137173637 757 20 19 853 49 82218 20 18183 42289383455100XXXWellsSuperior3Eddy3Figure A.6: Hierarchical clustering of JP samples colored by sample site1380.00.10.20.30.40.50.60.7lBL043lBR072lBL048lBL044lBL046lBR068lBR054lBR070lLH024lLH020lLH022lBL026lBR050lBL028lBL030lBL034lBR058lLH002lLH012lLH008lLH010lBR052lLH004lLH006lBR062lBL036lBL032lBR060lBR064lBR066lLH016lLH018lBR056lBL042lLH014lBL038lBL040lBR065lBL041lBR055lBL027lBR049lBR057lLH011lLH007lBR071lLH019lBL039lBR061lBR063lLH013lLH017lLH015l lBR059lBL037lLH021lBL025lBR069lLH003lLH005lBL031lBR067lBL045lBL047lLH009lLH023lBL035lBL029lBL033lBR051lBR053lLH00199 92100 998894 74100 78 9185 9188 98894994986699729998 97 7159 49 7796949349 90 83978096 977579 9994 17999990 99 6393 6480 6898 81 91 93 958399 94988997858795928782 3710099 99100 999584 8195 57 7797 9864 62246471677567206512 30 5756 56 737679141 22 6472853 105242 4647 2195221 56 242 676 2510 40 37 44 2137 14191127291543133085 58100XXXXOM0OM1OM2OM3Figure A.7: Hierarchical clustering of MD samples colored by treatment0.00.10.20.30.40.50.60.7lBL043lBR072lBL048lBL044lBL046lBR068lBR054lBR070lLH024lLH020lLH022lBL026lBR050lBL028lBL030lBL034lBR058lLH002lLH012lLH008lLH010lBR052lLH004lLH006lBR062lBL036lBL032lBR060lBR064lBR066lLH016lLH018lBR056lBL042lLH014lBL038lBL040lBR065lBL041lBR055lBL027lBR049lBR057lLH011lLH007lBR071lLH019lBL039lBR061lBR063lLH013lLH017lLH015l lBR059lBL037lLH021lBL025lBR069lLH003lLH005lBL031lBR067lBL045lBL047lLH009lLH023lBL035lBL029lBL033lBR051lBR053lLH00199 92100 998894 74100 78 9185 9188 988949 94986699 729998 97 7159 49 7796 949349 90 8397 8096 977579 9994 1799 9990 996393 6480 6898 81 91 93 958399 94988997858795928782 3710099 99100 999584 8195 57 7797 9864 622464 71677567 206512 30 5756 56 7376 79141 22 647 2853 105242 4647 219 5221 56242 676 2510 40 37 44 2137 14191127291543133085 58100XXorganic horizonmineral horizonFigure A.8: Hierarchical clustering of MD samples colored by horizon1390.00.10.20.30.40.50.60.7lBL043lBR072lBL048lBL044lBL046lBR068lBR054lBR070lLH024lLH020lLH022lBL026lBR050lBL028lBL030lBL034lBR058lLH002lLH012lLH008lLH010lBR052lLH004lLH006lBR062lBL036lBL032lBR060lBR064lBR066lLH016lLH018lBR056lBL042lLH014lBL038lBL040lBR065lBL041lBR055lBL027lBR049lBR057lLH011lLH007lBR071lLH019lBL039lBR061lBR063lLH013lLH017lLH015l lBR059lBL037lLH021lBL025lBR069lLH003lLH005lBL031lBR067lBL045lBL047lLH009lLH023lBL035lBL029lBL033lBR051lBR053lLH00199 92100 998894 74100 78 9185 9188 98894994986699729998 97 7159 49 7796949349 90 83978096 977579 9994 17999990 99 6393 6480 6898 81 91 93 958399 94988997858795928782 3710099 99100 999584 8195 57 7797 9864 62246471677567206512 30 5756 56 737679141 22 6472853 105242 4647 2195221 56 242 676 2510 40 37 44 2137 14191127291543133085 58100XXXLowellHillBlodgettBrandyCityFigure A.9: Hierarchical clustering of MD samples colored by sample site140Figure A.10: Scatter matrix plot of four centrality measures in the MD net-works. Histograms of each centrality measure is also shown. Thedifferent centrality values of OTUs for each treatment network waspooled to produce these plots.141Figure A.11: Scatter matrix plot of four centrality measures in the JP net-works. Histograms of each centrality measure is also shown. Thedifferent centrality values of OTUs for each treatment network waspooled to produce these plots.142Table A.15: Representation of phyla in central taxa of JP networksPhylum Number of taxa Number of central taxaAcidobacteria 395 4Actinobacteria 425 8Bacteroidetes 36 1Candidate division OP10 2 0Candidate division TG-1 3 0Candidate division TM6 1 0Candidate division TM7 17 0Cyanobacteria 50 0Firmicutes 26 0Gemmatimonadetes 26 0Planctomycetes 110 0Proteobacteria 1028 17Verrucomicrobia 25 0WCHB1-60 2 0Table A.16: Representation of classes in central taxa of JP networksClass Number of taxa Number of central taxaAcidobacteria 379 4Actinobacteria 425 8Alphaproteobacteria 814 14Bacilli 23 0Betaproteobacteria 63 0Chloroplast 1 0Deltaproteobacteria 77 0Gammaproteobacteria 71 3Gemmatimonadetes 26 0Holophagae 13 0Lineage IV 3 0MLE1-12 7 0Opitutae 19 0Phycisphaerae 25 0Planctomycetacia 76 0Spartobacteria 6 0Sphingobacteria 36 1WD272 42 0143Table A.17: Representation of orders in central taxa of JP networksOrder Number of taxa Number of central taxa32-20 10 0Acidimicrobidae 107 0Acidobacteriales 379 4Actinobacteridae 229 8Bacillales 23 0Burkholderiales 42 0Candidatus Xiphinematobacter 2 0Caulobacterales 37 0GR-WP33-30 27 0Gemmatimonadales 26 0Legionellales 1 0Myxococcales 49 0Nitrosomonadales 7 0Opitutales 19 0Planctomycetales 76 0Rhizobiales 376 10Rhodospirillales 377 4Rubrobacteridae 81 0SC-I-84 9 0Sphingobacteriales 36 1TRA3-20 1 0WD2101 25 0Xanthomonadales 62 3iii1-8 3 0144Table A.18: Representation of phyla in central taxa of MD networksPhylum Number of taxa Number of central taxaAcidobacteria 684 17Actinobacteria 1097 45Bacteroidetes 271 4Candidate division OP10 9 0Candidate division TM7 16 0Candidate division WS3 7 0Chloroflexi 89 2Cyanobacteria 22 2Fibrobacteres 3 0Firmicutes 37 1Gemmatimonadetes 86 5Nitrospirae 4 1Planctomycetes 115 1Proteobacteria 1942 68Verrucomicrobia 51 0WCHB1-60 6 0145Table A.19: Representation of classes in central taxa of MD networksClass Number of taxa Number of central taxaAcidobacteria 659 17Actinobacteria 1097 45Alphaproteobacteria 1376 48Anaerolineae 1 1Bacilli 25 1Betaproteobacteria 291 8Chloroflexi 5 0Chloroplast 3 0Clostridia 1 0Deltaproteobacteria 164 6Fibrobacteria 3 0Flavobacteria 1 0Gammaproteobacteria 92 5Gemmatimonadetes 86 5Holophagae 18 0KD4-96 34 1MLE1-12 2 0Nitrospira 4 1OPB35 1 0Opitutae 47 0Phycisphaerae 45 0Planctomycetacia 60 1S085 9 0SHA-109 1 0Spartobacteria 3 0Sphingobacteria 268 4TK10 1 0Thermomicrobia 1 0WD272 16 2146Table A.20: Representation of orders in central taxa of MD networksOrder Number of taxa Number of central taxa32-20 11 0Acidimicrobidae 136 7Acidobacteriales 659 17Actinobacteridae 633 25Anaerolineales 1 1Bacillales 25 1Burkholderiales 164 6Candidatus Xiphinematobacter 1 0Caulobacterales 100 2Chloroflexales 5 0Clostridiales 1 0DA101 2 0Fibrobacterales 3 0Flavobacteriales 1 0GR-WP33-30 24 1Gemmatimonadales 86 5MB-A2-108 2 0Methylophilales 1 0Myxococcales 126 4Nitrosomonadales 50 2Nitrospirales 4 1Opitutales 47 0Planctomycetales 60 1Pseudomonadales 12 0Rhizobiales 704 27Rhodobacterales 3 0Rhodospirillales 494 17Rubrobacteridae 295 12SC-I-84 26 0SJA-36 1 0Sphingobacteriales 268 4Sphingomonadales 41 1TRA3-20 19 0WD2101 39 0Xanthomonadales 76 5iii1-8 4 0147Table A.21: Representation of phyla in central taxa of SBS networksPhylum Number of taxa Number of central taxaAcidobacteria 156 24Actinobacteria 169 36Bacteroidetes 8 1Candidate division TM7 1 0Chloroflexi 35 11Cyanobacteria 5 1Firmicutes 2 0Gemmatimonadetes 10 2Planctomycetes 1 0Proteobacteria 366 54Verrucomicrobia 1 0WCHB1-60 1 0Table A.22: Representation of classes in central taxa of SBS networksClass Number of taxa Number of central taxaAcidobacteria 138 22Actinobacteria 169 36Alphaproteobacteria 261 39Bacilli 2 0Betaproteobacteria 70 10Chloroplast 1 0Deltaproteobacteria 11 3Gammaproteobacteria 23 2Gemmatimonadetes 10 2Holophagae 15 2KD4-96 34 11Opitutae 1 0Phycisphaerae 1 0RB25 3 0Sphingobacteria 8 1WD272 4 1148Table A.23: Representation of orders in central taxa of SBS networksOrder Number of taxa Number of central taxa32-20 15 2Acidimicrobidae 32 3Acidobacteriales 138 22Actinobacteridae 92 22Bacillales 2 0Burkholderiales 31 5Caulobacterales 12 1Desulfuromonadales 2 0GR-WP33-30 8 3Gemmatimonadales 10 2MB-A2-108 1 1Myxococcales 1 0Nitrosomonadales 12 2Opitutales 1 0Rhizobiales 134 20Rhodospirillales 103 15Rubrobacteridae 42 10SC-I-84 14 3Sphingobacteriales 8 1WD2101 1 0Xanthomonadales 23 2149

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166317/manifest

Comment

Related Items