UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Elastic net regression for the selection of orthogroups predictive of trophic mechanisms in diverse eukaryotes Bjornson, Saelin

Abstract

The last eukaryotic common ancestor (LECA) was a unicellular heterotrophic organism that preyed upon bacteria through a process called phagocytosis. LECA’s descendents have evolved into many eukaryotic supergroups, and over this time endosymbiotic integration of cyanobacteria and then eukaryotic algae into diverse hosts has led to the establishment of a photosynthetic organelle, the plastid, throughout multiple distantly-related lineages. As a result, microbial eukaryotes display a range of trophic strategies - heterotrophy through feeding, photoautotrophy through photosynthesis, or a combination of both as mixotrophy. Additionally, saprotrophy, the ability to digest nutrients externally and take up products osmotrophically, has evolved convergently in multiple eukaryotic groups. With the continuous development of high-throughput and culture-independent sampling and sequencing technologies, novel eukaryotic lineages are increasingly recovered from genomic data alone. It is therefore useful to develop methods to predict biological traits from genomic data in microbial eukaryotes. Here, we train an elastic net regression model that predicts the probability of a species obtaining nutrients by heterotrophy, autotrophy, mixotrophy or saprotrophy. To do so, we collected proteomes predicted from genomes and transcriptomes of over 250 species, representing all major lineages with sequence data available. These were clustered into over 230,000 orthologous groups (orthogroups), and models were then trained on the presence or absence of these groups and tested for predictive accuracy on 304 validation species. The final model had a balanced accuracy of 91% and a Cohen’s Kappa of 88% on the validation set, and selected around 4,000 orthogroups that are either negatively or positively associated with each trophic class. The majority of these orthogroups are found in multiple eukaryotic phyla, and 34% had no homology to characterized protein databases. GO term enrichment and KEGG pathway analysis of predictive orthogroups suggests involvement in metabolism of essential compounds, with differential biosynthetic and catabolic abilities associated with each trophic class. By predicting ecological functions in diverse eukaryotes in a culture- and phylogenetic-independent manner, this model has the potential to expedite aspects of both evolutionary and ecological research, as well as assist in selecting candidate proteins that play a functional role in the trophic classes they predict.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International