UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Precise correlation and metagenomic binning uncovers fine microbial community structure Durno, W. Evan


Bacteria and Archaea represent the invisible majority of living things on Earth with an estimated numerical abundance exceeding 10^30 cells. This estimate surpasses the number of grains of sand on Earth and stars in the known universe. Interdependent microbial communities drive fluxes of matter and energy underlying biogeochemical processes, and provide essential ecosystem functions and services that help create the operating conditions for life. Despite their abundance and functional imperative, the vast majority of microorganisms remain uncultivated in laboratory settings, and therefore remain extremely difficult to study. Recent advances in high-throughput sequencing are opening a multi-omic (DNA and RNA) window to the structure and function of microbial communities providing new insights into coupled biogeochemical cycling and the metabolic problem solving power of otherwise uncultivated microbial dark matter (MDM). These technological advances have created bottlenecks with respect to information processing, and innovative bioinformatics solutions are required to analyze immense biological data sets. This is particularly apparent when dealing with metagenome assembly, population genome binning, and network analysis. This work investigates combined use of single-cell amplifed genomes (SAGs) and metagenomes to more precisely construct population genome bins and evaluates the use of covariance matrix regularization methods to identify putative metabolic interdependencies at the population and community levels of organization. Applying dimensional reduction with principal components and a Gaussian mixture model to k-mer statistics from SAGs and metagenomes is shown to bin more precisely, and has been implemented as a novel pipeline, SAG Extrapolator (SAGEX). Also, correlation networks derived from small subunit ribosomal RNA gene sequences are shown to be more precisely inferred through regularization with factor analysis models applied via Gaussian copula. SAGEX and regularized correlation are applied toward 368 SAGs and 91 metagenomes, postulating populations’ metabolic capabilities via binning, and constraining interpretations via correlation. The application describes coupled biogeochemical cycling in low-oxygen waters. Use of SAGEX leverages SAGs’ deep taxonomic descriptions and metagenomes’ breadth, produces precise population genome bins, and enables metabolic reconstruction and analysis of population dynamics over time. Regularizing correlation networks overcomes a known analytic bottleneck based in precision limitations.

Item Citations and Data


Attribution 4.0 International