Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Distribution and diversity of aquatic RNA virus assemblages in an environmental context Vlok, Marli 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_may_vlok_marli.pdf [ 14.74MB ]
Metadata
JSON: 24-1.0364203.json
JSON-LD: 24-1.0364203-ld.json
RDF/XML (Pretty): 24-1.0364203-rdf.xml
RDF/JSON: 24-1.0364203-rdf.json
Turtle: 24-1.0364203-turtle.txt
N-Triples: 24-1.0364203-rdf-ntriples.txt
Original Record: 24-1.0364203-source.json
Full Text
24-1.0364203-fulltext.txt
Citation
24-1.0364203.ris

Full Text

 DISTRIBUTION AND DIVERSITY OF AQUATIC RNA VIRUS ASSEMBLAGES IN AN ENVIRONMENTAL CONTEXT  by  MARLI VLOK B.Sc., Biochemistry and Microbiology, Rhodes University, 2006 B.Sc., Hons, Microbiology, Rhodes University, 2007 M.Sc., Microbiology (with distinction), Rhodes University, 2009  A thesis submitted in partial fulfillment of the requirements for the degree of  DOCTOR OF PHILOSOPHY in The Faculty of Graduate and Postdoctoral Studies (Botany)  The University of British Columbia (Vancouver)   March 2018  © Marli Vlok, 2018  ii  ABSTRACT  RNA viruses are ubiquitous and abundant in aquatic environments, but not well studied; of these, picorna-like viruses are dominant in coastal waters. They are genetically diverse, and primarily infect microbial eukaryotes; thereby, influencing microbial community composition and nutrient cycling. Few studies have elucidated the diversity of marine and freshwater RNA virus assemblages. In this dissertation, metagenomic data were used to address the hypothesis that aquatic RNA viruses are diverse and broadly distributed, as well as the abiotic factors and evolutionary pressures shaping these assemblages. Mapping reads from geographically separate sites to six reference genomes showed that these viruses were subject to purifying selection, with synonymous single-nucleotide variations dominating the mutations. These marine RNA viruses exhibited distinct biogeographic patterns with different quasispecies detectable in different areas.  Marine RNA virus assemblages from polar to tropical environments were taxonomically complex. Viruses in the order Picornavirales were consistently in high relative abundance, and dominated marine RNA virus assemblages. Virus families that are thought to only occur in terrestrial systems were detected in oceanic samples, implying that these viruses infect marine organisms and, therefore, likely originated in the ocean.  Freshwater RNA virus assemblages are equally diverse throughout six sites characterized by different associated land use. A complex suite of abiotic factors is associated with seasonal changes in viral diversity across all sites over 14 months, with high variability in site-specific RNA virus diversity. Certain viruses were associated with specific abiotic factors and sites, suggesting that some RNA viral taxa may be useful as indicator species. Lastly, a sequence-dependent taxonomic framework was developed to incorporate genomes assembled from metagenomic data into the current taxonomic classification system. This led to a proposed expansion of the Marnaviridae from a single isolate to 20 viruses classified into seven genera.  The data presented here revealed unprecedented diversity in RNA aquatic viruses, and greatly expanded our knowledge of the distribution and dynamics of aquatic RNA viruses with consequent implications for understanding their role in aquatic ecosystems.  iii  LAY SUMMARY  Aquatic RNA viruses are diverse and abundant. They infect microbial eukaryotes, affecting their diversity and nutrient cycling across ecosystems. Little is known about these important viruses, including what types are there, where they occur, and whether they change with the environment. This dissertation investigated the composition and distribution of aquatic RNA viruses from the poles to the tropics. Aquatic RNA virus assemblages contain many types of viruses that are related to viruses that infect plants, animals and microbes. These include many previously unknown marine RNA viruses that were classified using a new system, so that they can be incorporated into the established taxonomy. Some viruses are more widespread than others. Changes in the composition of aquatic RNA virus assemblages were related to environmental factors, with certain freshwater viruses being found under specific conditions, suggesting they can be used as environmental indicators.    iv  PREFACE  This statement certifies that the work presented in this thesis was conceived and the chapters written by the author, Marli Vlok. As research advisor, Curtis Suttle was involved in the conceptualization of experimental studies, interpretation of results as well as thesis editing. Hélène Sanfaçon and Patrick Keeling provided recommendations and general guidance. The author collected/processed the samples and analyzed the data, with exceptions stated below. Many of the samples in this study are part of a large collection of virus concentrates archived in the Suttle lab; samples were collected by current and past members of the Suttle lab, including the author. Statistical consultation was done by Adrian Jones. In chapters 2 and 5, Andrew Lang aided in the conceptualization of the studies, interpretation of the results and editing of the chapters.  Steven Short contributed to the collection of samples, interpretation of results and editing of Chapter 3. Chapter 4 was part of a collaboration with the GenomeBC watershed health and biomarker project initiative of the British Columbia Centre for Disease Control. The experimental plan and sample processing was generated by Miguel I. Uyaguari-Diaz, Kirby I. Cronin, Michael Chan, Jared R. Slobodan, Matthew J. Nesbitt, Patrick K. C. Tang, Fiona S. L. Brinkman and Natalie A. Prystajecky.    v  TABLE OF CONTENTS  Abstract ................................................................................................................................... ii Lay Summary ......................................................................................................................... iii Preface ..................................................................................................................................... iv Table of Contents ..................................................................................................................... v List of Tables ............................................................................................................................ x List of Figures ......................................................................................................................... xi List of Abbreviations ............................................................................................................. xv Glossary ................................................................................................................................. xxi Acknowledgements .......................................................................................................... xxviii Dedication ............................................................................................................................. xxx 1 INTRODUCTION TO RNA VIRUSES ASSOCIATED WITH SINGLE-CELLED EUKARYOTES 1 1.1 The significance of marine viruses ............................................................................ 1 1.2 RNA viruses............................................................................................................... 2 1.3 Origin and evolution of RNA viruses ........................................................................ 6 1.4 Marine RNA viruses .................................................................................................. 9 1.4.1 Enumerating RNA viruses in the ocean................................................................. 9 1.4.2 Diversity of marine RNA viruses ........................................................................ 10 1.4.2.1 Culture-based studies ................................................................................... 10 1.4.2.2 Amplicon-based studies ............................................................................... 14 1.4.2.3 Metagenomic studies ................................................................................... 14 1.5 Freshwater RNA viruses .......................................................................................... 16 1.5.1 Diversity of freshwater RNA viruses .................................................................. 16 1.5.1.1 Culture-based studies ................................................................................... 16 1.5.1.2 Metagenomic studies ................................................................................... 17 1.6 The life of an aquatic RNA virus ............................................................................. 17 1.6.1 Capsid structure and genome organization .......................................................... 18 1.6.2 Virus replication cycle ......................................................................................... 20 1.6.3 Environmental niches and host range .................................................................. 22 1.7 Dynamics of aquatic RNA viruses across space and time ....................................... 23 vi  1.7.1 Marine RNA virus dynamics ............................................................................... 24 1.7.2 Freshwater RNA virus dynamics ......................................................................... 24 1.8 Methodological considerations associated with the analysis of environmental RNA virus assemblages ................................................................................................................ 25 1.8.1 Environmental considerations ............................................................................. 25 1.8.2 Analysis considerations ....................................................................................... 26 1.9 Problem statement and research objectives ............................................................. 27 1.9.1 Overall goals and objectives ................................................................................ 27 1.9.2 Chapter descriptions ............................................................................................ 28 1.10 Significance ............................................................................................................. 30 2 FROM THE ARCTIC TO AFRICA: EXPLORING THE GEOGRAPHIC DISTRIBUTION OF SIX MARINE RNA VIRUSES AND THEIR QUASISPECIES ............................................................... 31 2.1 Summary .................................................................................................................. 31 2.2 Introduction.............................................................................................................. 32 2.3 Materials and methods ............................................................................................. 35 2.3.1 Sample collection and preparation....................................................................... 35 2.3.2 cDNA synthesis and metagenomic library preparation ....................................... 36 2.3.3 Metagenomic and genomic analysis .................................................................... 37 2.3.4 Metagenomic read recruitment ............................................................................ 38 2.3.5 Pairwise similarity and selection pressure analyses ............................................ 38 2.4 Results ..................................................................................................................... 39 2.4.1 Three novel picorna-like genomes....................................................................... 39 2.4.2 Marine RNA viruses exist as detectable quasispecies and exhibit biogeographic differences........................................................................................................................ 42 2.4.3 Evolutionary pressures acting on the viral quasispecies...................................... 46 2.4.4 Detection of established viral variants ................................................................. 52 2.5 Discussion ................................................................................................................ 54 2.5.1 Divergent marine Picornavirales genomes are present at the same location ...... 54 2.5.2 Marine RNA virus biogeography ........................................................................ 56 2.5.3 Purifying selection on conserved protein domains in marine RNA virus quasispecies ..................................................................................................................... 58 vii  2.5.4 Established viral variants are widely dispersed ................................................... 60 2.5.5 Conclusions ......................................................................................................... 61 3 METAGENOMIC ANALYSIS REVEALS UNPRECEDENTED DIVERSITY IN MARINE RNA VIRUS ASSEMBLAGES FROM POLAR TO TROPICAL SEAS ...................................................... 63 3.1 Summary .................................................................................................................. 63 3.2 Introduction.............................................................................................................. 64 3.3 Materials and methods ............................................................................................. 66 3.3.1 Sample collection and preparation....................................................................... 66 3.3.2 cDNA synthesis and metagenomic library preparation ....................................... 67 3.3.3 Metagenomic assemblage composition and genomic analysis ............................ 68 3.3.4 Phylogenetic analysis........................................................................................... 69 3.3.5 Statistics ............................................................................................................... 69 3.4 Results ..................................................................................................................... 70 3.4.1 Marine RNA virus diversity ................................................................................ 70 3.4.2 Picornavirales genomes and phylogenies ........................................................... 74 3.4.3 Tombusviridae and RNA-DNA hybrid viruses ................................................... 78 3.4.4 Predictors of marine virus diversity ..................................................................... 79 3.5 Discussion ................................................................................................................ 83 3.5.1 Picornavirales are the most abundant members of marine RNA virus assemblages in all oceanic regions .................................................................................. 84 3.5.2 Assemblage diversity extends beyond Picornavirales ........................................ 87 3.5.3 RDHV-like virus capsid sequences are abundant in marine metagenomic data . 89 3.5.4 Salinity and latitude are associated with changes in assemblage compositions .. 90 3.5.5 Conclusions ......................................................................................................... 90 4 SPATIAL AND TEMPORAL DIVERSITY IN FRESHWATER RNA VIRUSES IN THE CONTEXT OF ENVIRONMENTAL VARIABILITY ........................................................................................ 92 4.1 Summary .................................................................................................................. 92 4.2 Introduction.............................................................................................................. 93 4.3 Materials and methods ............................................................................................. 95 4.3.1 Sample collection and preparation....................................................................... 95 4.3.2 Environmental, physical and chemical data ........................................................ 96 viii  4.3.3 cDNA synthesis and metagenomic library preparation ....................................... 96 4.3.4 Community composition ..................................................................................... 97 4.3.5 Statistical analysis ................................................................................................ 97 4.3.6 Diversity measures and species indicators .......................................................... 98 4.4 Results ..................................................................................................................... 98 4.4.1 Variations in the taxonomic diversity of freshwater RNA virus assemblages .... 98 4.4.2 RNA viruses as indicator species ...................................................................... 102 4.4.3 Differences in alpha diversity among sites ........................................................ 104 4.4.4 Changes in alpha diversity across sites .............................................................. 106 4.5 Discussion .............................................................................................................. 108 4.5.1 Freshwater RNA virus assemblages are diverse and dynamic .......................... 109 4.5.2 Viral diversity was variable with seasonal trends.............................................. 110 4.5.3 Specific viruses are associated with particular environmental factors .............. 111 4.5.4 Drivers of RNA virus diversity are complex ..................................................... 113 4.5.5 Conclusions ....................................................................................................... 114 5 SEQUENCE-BASED TAXONOMIC CLASSIFICATION OF UNCULTURED AND UNCLASSIFIED MARINE SINGLE-STRANDED RNA VIRUSES OF THE ORDER PICORNAVIRALES ................................................................................................................... 116 5.1 Summary ................................................................................................................ 116 5.2 Introduction............................................................................................................ 117 5.3 Materials and methods ........................................................................................... 118 5.3.1 Phylogenetic analysis......................................................................................... 118 5.3.2 Pairwise similarities and diversity metrics ........................................................ 119 5.3.3 Domain analysis................................................................................................. 119 5.4 Results ................................................................................................................... 121 5.4.1 Family classifications and proposed genera ...................................................... 121 5.4.1.1 Marnavirus ................................................................................................ 123 5.4.1.2 Labyrnavirus .............................................................................................. 124 5.4.1.3 Locarnavirus .............................................................................................. 124 5.4.1.4 Kusarnavirus .............................................................................................. 124 5.4.1.5 Bacillarnavirus .......................................................................................... 125 ix  5.4.1.6 Salisharnavirus ........................................................................................... 125 5.4.1.7 Sogarnavirus .............................................................................................. 125 5.4.2 RdRp conserved motifs...................................................................................... 127 5.4.3 Species demarcations ......................................................................................... 128 5.5 Discussion .............................................................................................................. 129 5.5.1 Expanding the family Marnaviridae to encompass seven genera ..................... 130 5.5.2 Classifying new species ..................................................................................... 132 5.5.3 Conclusions ....................................................................................................... 133 6 CONCLUSIONS AND FUTURE DIRECTIONS .................................................................... 134 6.1 Recapitulation and significance of this work......................................................... 134 6.1.1 Aquatic Picornavirales – survivors of an ancient world ................................... 135 6.1.2 Metagenomics provides insight into the diversity of aquatic viruses and their environment. .................................................................................................................. 138 6.1.3 Bridging the gap between ecological and classical virology ............................. 139 6.2 Strength and limitations ......................................................................................... 140 6.2.1 Ribosome contamination ................................................................................... 140 6.2.2 Pre-filtration and virus concentration ................................................................ 141 6.2.3 Metagenomic library construction ..................................................................... 142 6.2.4 Bioinformatics ................................................................................................... 142 6.3 Future directions .................................................................................................... 143 6.4 Significance of this research .................................................................................. 145 Bibliography ......................................................................................................................... 147 A. Supplementary Information to Chapter 2 ................................................................. 179 B. Supplementary Information to Chapter 3 ................................................................. 196 C. Supplementary Information to Chapter 4 ................................................................. 221 D. Supplementary Information to Chapter 5 ................................................................. 236 E. Supplementary Heatmaps ........................................................................................... 245    x  LIST OF TABLES  Table 1.1: Marine RNA virus isolates infecting marine protists. ............................................ 13 Table 4.1: Spearman correlation coefficients (rho) between environmental factors and alpha diversity for each site. ............................................................................................................ 106 Table 4.2: Multiple linear regression analysis of RNA virus alpha diversity and environmental data (all log transformed) across sites. .......................................................... 108 Table 5.1: Genome and isolation information for proposed members of the Marnaviridae. 120 Supplementary Table A.1: Environmental descriptions of stations from which viral metagenomes were generated. ............................................................................................... 179 Supplementary Table A.2: Stochastic dominance observed in the domain sample pairs, analyzed using the Conover-Iman test. .................................................................................. 180 Supplementary Table B.1: Metadata associated with the sampling sites. ............................. 196 Supplementary Table B.2: Taxonomic assignment of Marine RNA virus contigs. .............. 198 Supplementary Table B.3: Accession numbers of RNA virus genomes used in the analyses of marine RNA virus assemblages. ............................................................................................ 212 Supplementary Table C.1: A description of the sampling sites............................................. 221 Supplementary Table C.2: Taxonomic classification of freshwater RNA viruses. ............... 222 Supplementary Table D.1: Mean p-value distances between genera. ................................... 236 Supplementary Table D.2: Genbank accessions used in this study. ...................................... 237     xi  LIST OF FIGURES  Figure 1.1: Virus taxonomy of ICTV recognized viral species. ................................................ 3 Figure 1.2: RNA quasispecies and the relationship of mutation frequency to virus infectivity. ................................................................................................................................................... 5 Figure 1.3: Average mutation rates observed in viruses and the factors that influence it. ........ 6 Figure 1.4: The origin of RNA viruses infecting eukaryotes. ................................................... 8 Figure 1.5: Groups of protists that are associated with RNA viruses. ..................................... 12 Figure 1.6: The capsid structure of three aquatic RNA viruses. .............................................. 19 Figure 1.7: Graphical representation of marine RNA viruse genomes of the order Picornavirales. ......................................................................................................................... 20 Figure 1.8: Transmission electron micrographs of single-celled eukaryotes infected with RNA viruses............................................................................................................................. 21 Figure 2.1: Sampling locations represented in this study. ....................................................... 36 Figure 2.2: Three near-complete viral genomes assembled from Jericho Pier........................ 39 Figure 2.3: Pairwise nucleotide sequence similarities and neighbour-joining trees of the conserved domains of six marine RNA viruses. ...................................................................... 41 Figure 2.4: Prevalence of sequences mapping to the six marine RNA viruses in environmental viral metagenomic datasets. ............................................................................. 43 Figure 2.5: Prevalence of reads with amino acid sequence identities greater than 85% mapped to the six marine RNA viruses. ................................................................................................ 45 Figure 2.6: Distribution of six marine RNA virus genomes. ................................................... 46 Figure 2.7: Nucleotide alignments of the viral domains.......................................................... 49 Figure 2.8: Selection analyses of marine RNA virus domains in environmental datasets. ..... 50 Figure 2.9: Detection of marine RNA virus quasispecies in environmental datasets. ............ 52 Figure 2.10: Established viral variants in different marine RNA virus communities. ............ 53 Figure 3.1: Sampling sites for this study. ................................................................................ 67 Figure 3.2: A heatmap representing the diversity and relative abundance of taxonomically assigned marine RNA virus assemblages. ............................................................................... 73 Figure 3.3: Seven near-complete picorna-like genomes assembled from marine metagenomic data. .......................................................................................................................................... 75 xii  Figure 3.4: Relative abundance of assembled contigs across study locations. ........................ 76 Figure 3.5:  Phylogenetic placement of environmental RNA-dependent RNA polymerase reads within a Picornavirales reference tree. ........................................................................... 77 Figure 3.6:  Phylogenetic relationship of environmental reads associated with RDHV, Tombusviridae and RDHV structural protein sequences. ....................................................... 78 Figure 3.7: Linear regression of diversity and species evenness of samples in relation to salinity...................................................................................................................................... 80 Figure 3.8: Constrained correspondence analysis of RNA virus relative abundances from each site in context of the associated location, salinity and temperature. ............................... 81 Figure 3.9: Response variables associated with marine RNA virus correspondence analysis. ................................................................................................................................................. 82 Figure 4.1: Seasonal variation in site-specific beta diversity at the agricultural sites. .......... 101 Figure 4.2: Viruses as indicators of environmental conditions. ............................................ 103 Figure 4.3: Spatial and temporal changes in viral alpha diversity. ........................................ 105 Figure 5.1: Maximum likelihood phylogeny of the Picornavirales RNA-dependent RNA polymerase amino acid sequences rooted with sequences from the Potyviridae. ................. 122 Figure 5.2: Heatmap of amino acid pairwise sequence analysis of the Marnaviridae. ......... 123 Figure 5.3: Comparative RdRp and capsid maximum-likelihood phylogenies of the family Marnaviridae. ......................................................................................................................... 126 Figure 5.4: Summary of genera in the Marnaviridae and their associated number of species and amino acid diversities. .................................................................................................... 127 Figure 5.5: Conserved motifs of the RdRp sequences identified in the family Marnaviridae. ............................................................................................................................................... 128 Figure 5.6: Distribution of pairwise amino acid sequence identities for Marnaviridae members. ................................................................................................................................ 129 Figure 6.1: Taxonomy of RNA virus genera recognized by the ICTV. ................................ 136 Supplementary Figure A.1: Pairwise amino acid sequence similarity and neighbour-joining trees of the conserved domains of the six analyzed marine RNA viruses. ............................ 181 Supplementary Figure A 2: Reads mapping to six marine RNA viruses represented by metagenomes. ........................................................................................................................ 182 xiii  Supplementary Figure A.3: Comparison of metagenomic library size and number of reads that mapped to the six genomes. ............................................................................................ 192 Supplementary Figure B.1: Schematic representation of experimental procedures used to construct the RNA virus metagenomic libraries. ................................................................... 215 Supplementary Figure B.2: CCA of taxonomically assigned RNA virus assemblages with respect to sampling dates. ...................................................................................................... 216 Supplementary Figure B.3: Phylogenetic placement of bacteriophage-like environmental RNA-dependent RNA polymerase reads within a Leviviridae reference tree. ..................... 217 Supplementary Figure B.4: Shannon diversity and species evenness of samples in relation to environmental parameters. ..................................................................................................... 218 Supplementary Figure B.5: Diversity and species evenness of samples in relation to sampling year. ....................................................................................................................................... 220 Supplementary Figure C.1: Pearson correlations of Log(X+1) transformed data. ................ 230 Supplementary Figure C 2: Effective number of species, evenness and richness of viral RNA assemblages observed in the different watersheds. ............................................................... 231 Supplementary Figure C 3: Spatial and temporal environmental data used in this study. .... 232 Supplementary Figure C.4: Seasonal variation in site-specific beta diversity at the Urban and Protected sites. ....................................................................................................................... 233 Supplementary Figure C.5: Non-metric multidimensional scaling plots of watershed RNA virus assemblages. ................................................................................................................. 235 Supplementary Figure D 1: Pairwise amino acid comparisons of Marnaviridae capsid sequences. .............................................................................................................................. 239 Supplementary Figure D.2: Pairwise amino acid comparisons of Marnaviridae RNA-dependent RNA polymerase sequences. ................................................................................ 240 Supplementary Figure D.3: RNA-dependent RNA polymerase conserved domain alignment of the family Marnaviridae and type members of the order Picornavirales. ......................... 241 Supplementary Figure D.4: Pairwise amino acid comparisons of the RdRp domains of the family Marnaviridae and type members of the order Picornavirales. ................................... 242 Supplementary Figure D.5: Maximum-likelihood phylogeny of Marnaviridae capsid proteins. ............................................................................................................................................... 243 xiv  Supplementary Figure D.6: Summary of all Marnaviridae genera and their associated number of species and amino acid sequence diversities as well as select Picornaviridae and Potyviridae. ............................................................................................................................ 244 Supplementary Figure E.1: Diversity and relative abundance of taxonomically assigned marine RNA virus assemblages. ............................................................................................ 246 Supplementary Figure E.2: Diversity and relative abundance of taxonomically assigned freshwater RNA virus assemblages. ...................................................................................... 247    xv  LIST OF ABBREVIATIONS  +ve ................................................................................................................. positive/positively -ve ................................................................................................................. negative/negatively 3CPro ............................................................................................... chymotrypsin-like protease A ..................................................................................................................................... Alanine aa ................................................................................................................................ amino acid ABPV ................................................................................................... Acute bee paralysis virus ADS .............................................................................................. Agricultural downstream site AEV ............................................................................................ Avian encephalomyelitis virus AgRNAV .......................................................................... Asterionellopsis glacialis RNA virus AIC ............................................................................................... Akaike Information Criterion AiV ............................................................................................................................ Aichi virus AlliS_10 ................................................................................. Allison Sound 2010 metagenome ALPV ............................................................................................... Aphid lethal paralysis virus ANOVA ...................................................................................................... Analysis of variance APL ........................................................................................ Agricultural affected/polluted site APLV .............................................................................................. Antarctic Picorna-like virus Arc .............................................................................................. Arctic composite metagenome AuRNAV ........................................................ Aurantiochytrium single-stranded RNA virus 01 ASV ................................................................................................................ Avian sapelovirus AUP ................................................................................................... Agricultural upstream site BBWV ...................................................................................................... Broad bean wilt virus BC-1 ..................................................................................................... Marine RNA virus BC-1 BC-2 ..................................................................................................... Marine RNA virus BC-2 BC-3 ..................................................................................................... Marine RNA virus BC-3 BelI_10 ....................................................................................... Belize Inlet 2010 metagenome BerS_a08.................................................................................. Bering Sea 2008 metagenome A BerS_b08 ................................................................................. Bering Sea 2008 metagenome B BLAST ............................................................................... Basic Local Alignment Search Tool BlinkB_10 .......................................................................... Blinkenskop Bay 2010 metagenome BLlV .......................................................................................................... Beihai levi-like virus BMV ........................................................................................................ Meihai mollusks virus BSWV .......................................................................................... Beihai sipunculid worm virus bp ................................................................................................................................... base pair BPLV ................................................................................................... Beihai picorna-like virus BSCV ................................................................................................ Beihai sesarmid crab virus BSL RDHV ................................................................. Boiling Springs RNA-DNA hybrid virus BVY .............................................................................................................. Blackberry virus Y xvi  BZ13 ...................................................................................................... Escherichia virus BZ13 C ..................................................................................................................................... Cysteine CampR_12 ........................................................................... Campbell River 2012 metagenome CapeP .................................................................................................... Cape Point metagenome CarrB_11............................................................................... Carrington Bay 2011 metagenome CCA .............................................................. Constrained/Canonical Correspondence Analysis CChil_97 .................................................................................. Central Chile 1997 metagenome CcloRNAV ....................................................................... Cylindrotheca closterium RNA virus CCV ................................................................................................... Changjiang crawfish virus CDF ......................................................................................... cumulative distribution function ChamH_11 .................................................................... Chameleon Harbour 2011 metagenome ChamH_12 .................................................................... Chameleon Harbour 2012 metagenome CLlV .................................................................................................. Changjiang levi-like virus CLSV ................................................................................................... Cucumber leaf spot virus CMtV ....................................................................................................... Carnation mottle virus CPMV ....................................................................................................... Cowpea mosaic virus CRLV ........................................................................................................ Cherry rasp leaf virus CrPV ........................................................................................................ Cricket paralysis virus CRSV .................................................................................................... Carnation ringspot virus CsfrRNAV ............................................................... Chaetoceros socialis f. radians RNA virus CspRNAV2 ................................................................................. Chaetoceros sp. RNA virus 02 CtenRNAV01 ............................................................... Chaetoceros tenuissimus RNA virus 01 CtenRNAVII .......................................................... Chaetoceros tenuissimus RNA virus type II dN/dS ....................................................... non-synonymous/synonymous substitution rate ratio D ............................................................................................................................ Aspartic Acid DCV .............................................................................................................. Drosophila C virus DHBV ..................................................................................................... Duck Hepatitis B virus DIN ................................................................................................ Nitrate and Nitrite combined DMClaHV............................................ Daphnia mendotae-associated RNA-DNA hybrid virus DO.................................................................................................................. Dissolved Oxygen DPI ................................................................................................................ Days post infection DpRNAV ......................................................................................... Delisea pulchra RNA virus DpPV ............................................................................................ Delisea pulchra Picorna virus DiscP_12 ......................................................................... Discovery Passage 2012 metagenome DNA ......................................................................................................... Deoxyribonucleic acid dsRNA ...................................................................................................... double stranded RNA DWV .......................................................................................................... Deformed wing virus E ............................................................................................................................ Glutamic acid e-value..................................................................................................................... Expect value EMCV ............................................................................................. Encephalomyocarditis virus ER ........................................................................................................... Endoplasmic reticulum xvii  ERBV ....................................................................................................... Equine rhinitis B virus F ............................................................................................................................ Phenylalanine FI ................................................................................................................. Escherichia virus FI FMDV .......................................................................................... Foot-and-mouth disease virus G ..................................................................................................................................... Glycine GOM_96 ............................................................................... Gulf of Mexico 1996 metagenome GOM_97 ............................................................................... Gulf of Mexico 1997 metagenome GorgH_12 ............................................................................. Gorge Harbour 2012 metagenome H ................................................................................................................................... Histidine HaRNAV .............................................................................. Heterosigma akashiwo RNA virus HaRV .................................................................................................... Havel river tombusvirus HaSV...................................................................................... Helicoverpa armigera stunt virus HcRNAV ................................................................... Heterocapsa circularisquama RNA virus HCV .................................................................................................................. Hepatitis C virus HiPV ................................................................................................................. Himetobi P virus HIV .......................................................................................... Human immunodeficiency virus HLV ................................................................................................................. Hubei leach virus HLlV ........................................................................................................... Hubei levi-like virus HoCV-1.................................................................................... Homalodisca coagulate virus -1 HPeV.......................................................................................................... Human parechovirus hpi ................................................................................................................ hours post infection I ................................................................................................................................... Isoleucine IAPV ................................................................................................ Israeli acute paralysis virus ICTV ............................................................... International committee on taxonomy of viruses IGR .................................................................................................................. Intergenic region IFV ....................................................................................................... Infectious flacherie virus IRES ................................................................................................. Internal ribosome entry site JohnS ................................................................................... Johnstone Strait 2011 metagenome JohnS_11 ............................................................................. Johnstone Strait 2011 metagenome JP-A ...................................................................................................... Marine RNA virus JP-A JP-B ...................................................................................................... Marine RNA virus JP-B JP13 ........................................................................................... Jericho Pier 2013 metagenome JP14 ........................................................................................... Jericho Pier 2014 metagenome JP_a10 ..................................................................................... Jericho Pier 2010 metagenome A JP_b10 .................................................................................... Jericho Pier 2010 metagenome B JP_a11 ..................................................................................... Jericho Pier 2011 metagenome A JP_b11 .................................................................................... Jericho Pier 2011 metagenome B JRC .................................................................................................................... Jelly-roll capsid K ....................................................................................................................................... Lysine KBV ................................................................................................................ Kashmir bee virus KoS ................................................................................................ Kenton-on-Sea metagenome xviii  kb ................................................................................................................................... kilo base L ...................................................................................................................................... Leucine LagM_94................................................................................ Laguna Madre 1994 metagenome LasqI_12 ............................................................................... Lasqueti Island 2012 metagenome LV UC1 ............................................................................................................ Laverivirus UC1 M ............................................................................................................................... Methionine MacS_08 ............................................................................ Mackenzie Shelf 2008 metagenome McDV ..................................................................................................... Mud crab dicistrovirus MNSV ................................................................................................. Melon necrotic spot virus MS2 .................................................................................................. Enterobacterio phage MS2 MT ......................................................................................................................... Mitochondria N ................................................................................................................................ Asparagine NMDS .............................................................................. Non-metric multidimensional scaling nr database ............................................................................................ Non-redundant database NarrI_11 .................................................................................. Narrows Inlet 2011 metagenome NarrI_12 .................................................................................. Narrows Inlet 2012 metagenome Nun_a07 ....................................................................................... Nunavut 2007 metagenome A Nun_b07 ...................................................................................... Nunavut 2007 metagenome B Nun_c07 ....................................................................................... Nunavut 2007 metagenome C Nun_d07 ...................................................................................... Nunavut 2007 metagenome D Nun2 ......................................................................................... Nunavut 2007 metagenome 2/B Nun3 ......................................................................................... Nunavut 2007 metagenome 3/C OCSV .................................................................................................... Oat chlorotic stunt virus ORF ............................................................................................................. Open reading frame OTU ................................................................................................. Operational taxonomic unit P ....................................................................................................................................... Proline PCO .............................................................................................. Principle component analysis PCA ............................................................................................. Principle coordinates analysis PCR .................................................................................................... Polymerase chain reaction PendS .................................................................................... Pendrell Sound 2012 metagenome PendS_10 .............................................................................. Pendrell Sound 2010 metagenome PendS_11 .............................................................................. Pendrell Sound 2011 metagenome PendS_12 .............................................................................. Pendrell Sound 2012 metagenome Peru_97 .................................................................................................. Peru 1997 metagenome PhV-A ........................................................................................... Plasmopara halstedii virus A PLV ................................................................................................................ Pothos latent virus Poly-A ................................................................................................................ Poly adenylated PortE_11 ................................................................................. Port Elizabeth 2011 metagenome PowR_12 ................................................................................. Powell River 2012 metagenome PRR1 ................................................................................................. Pseudomonas phage PRR1 PSIV ................................................................................................... Plautia stali intestine virus xix  PTV .............................................................................................................. Porcine teschovirus  PUP ................................................................................................................ Pristine watershed PV .............................................................................................................................. Polio virus PVY ...................................................................................................................... Potato virus Y PYFV ................................................................................................. Parsnip yellow fleck virus Q ................................................................................................................................. Glutamine Qbeta .................................................................................................... Escherichia phage Qbeta QI_a15 ................................................................................ Quadra Island 2015 metagenome A QI_b15 ................................................................................ Quadra Island 2015 metagenome B QI_c15 ................................................................................ Quadra Island 2015 metagenome C QI_d15 ................................................................................ Quadra Island 2015 metagenome D QI_e15 ................................................................................ Quadra Island 2015 metagenome E QCSound_10 ........................................................... Queen Charlotte Sound 2010 metagenome QCStrait .................................................................... Queen Charlotte Strait 2011 metagenome QCStrait_11 .............................................................. Queen Charlotte Strait 2011 metagenome R ..................................................................................................................................... Arginine R2 ........................................................................................................ Determination coefficient RAxML ............................................................. Randomized Axelerated Maximum Likelihood RCNMV .................................................................................. Red clover necrotic mosaic virus RDHV_likeV SF1 .......................................................................... RNA-DNA hybrid virus SF1 RdRP ..................................................................................... RNA-dependent RNA polymerase RhPV ................................................................................................. Rhopalosiphum padi virus RNA ................................................................................................................ Ribonucleic acids RsRNAV .......................................................................................... Rhizosolenia setigera RNA RT .............................................................................................................. Reverse transcriptase RT-qPCR .................................... Reverse transcription quantitative polymerase chain reaction RTSV ................................................................................................ Rice tungro spherical virus S ......................................................................................................................................... Serine S3H ......................................................................................................... Superfamily 3 helicase SaPLV .................................................................................................. Sanxia picorna-like virus SARS .......................................................................... Severe acute respiratory syndrome virus SBV ..................................................................................................................... Sacbrood virus SChil_97 .................................................................................... South Chile 1997 metagenome SDV ............................................................................................................ Satsuma dwarf virus SecI_11 ..................................................................................... Sechelt Inlet 2011 metagenome SF-1 ...................................................................................................... Marine RNA virus SF-1 SF-2 ...................................................................................................... Marine RNA virus SF-2 SF-3 ...................................................................................................... Marine RNA virus SF-3 SgCV .......................................................................................................... Saguaro cactus virus ShPLV .................................................................................................. Shahe picorna-like virus SimS_11 ................................................................................. Simoon Sound 2011 metagenome xx  SINV-1 ................................................................................................ Solenopsis invicta virus-1 SLlv............................................................................................................. Shahe levi-like virus SmV-A .................................................................................. Sclerophthora macrospora virus A SNVs .................................................................................................. Single nucleotide variants SP .......................................................................................................... Enterobacteria phage SP SPMMV ...................................................................................... Sweet potato mild mottle virus ssRNA ...................................................................................... single stranded Ribonucleic acid SVV ............................................................................................................. Seneca Valley virus T .................................................................................................................................. Threonine TBSV .................................................................................................. Tomato bushy stunt virus TCV ............................................................................................................. Turnip crinkle virus TeakA_11 .............................................................................. Teakerne Arm 2011 metagenome TeakA_12 .............................................................................. Teakerne Arm 2012 metagenome TORV ..................................................................................................... Tobacco ringspot virus TSV ........................................................................................................... Taura syndrome virus TTV............................................................................................................ Tomato torrado virus UDS ........................................................................................................ Urban downstream site UPL ................................................................................................. Urban affected/polluted site UTR ............................................................................................................. Untranslated region V ....................................................................................................................................... Valine W............................................................................................................................... Tryptophan WenzGV ........................................................................................... Wenzhou gastropode virus WenzPLV ....................................................................................... Wenzhou picorna-like virus WenlPLV .......................................................................................... Wenling picorna-like virus WLlV ...................................................................................................... Wenling levi-like virus WzLlV .................................................................................................. Wenzhou levi-like virus Y .................................................................................................................................... Tyrosine   xxi  GLOSSARY  α diversity   the species diversity of a local community, assemblage or habitat taking richness and evenness into account β diversity   the degree of community differentiation; often associated with differences in habitat or location (spatial scale) 50% loss of specific infectivity in virology it is defined as the mutation frequency at which 50% of the viral genomes are lethally mutated 5’ cap an altered nucleotide at the 5’ end of transcripts, eg. precursor mRNA, which is essential for the translation of the molecule Abiotic   the non-living components of an entity, such as chemical, geological and physical aspects Analysis of variance in statistics it is a model used to analyze the difference among group means and their associated procedures Assemblage a group of species found together in a specific habitat at a certain time Bacillarnavirus viral genus of ssRNA viruses infecting diatoms. Formerly unclassified within the order Picornavirales, proposed reclassification to the family Marnaviridae. It is an acronym of the host group and RNA virus Bank theory (seed bank theory) in viral ecology it refers to the phenomena where a viral assemblage consists of a few abundant viruses, while most taxa are rare. The rare viruses can rapidly increase in abundance due to abiotic or biotic interactions Biotic   associated or involving living organisms Bipartite in virology it refers to a viral genome that consist of two molecules of nucleic acids xxii  BLAST in bioinformatics, it refers to software that finds regions of similarity between nucleotide and/or protein sequences by comparing them to a reference database. Results are provided with statistical significance Bloom   a population outbreak, or sudden increase, of phytoplankton (microscopic algae) that remains within a specific part of the water column Bootstrap value a statistical method to measure confidence in a result by subsampling. In phylogenetics it is used to determine the confidence in the topology of a phylogenetic tree  Burst size   number of viruses produced during the infection of a single cell Capsid   protein structure/shell of viruses composed of multiple subunits Coastal   oceanic region on a continental shelf, often referred to as “neritic” Coevolution it refers to two or more organisms or species that mutually affect the evolution of each other (biology) Constrained Correspondence Analysis   or canonical correspondence analysis is an interpretive multivariate method used to relate species abundance to environmental variables Cumulative distribution function in statistics it is a method used to describe the distribution of any kind of random variable. Dicistronic in virology it refers to a genome organization where there are two distinct open reading frames  Dynamics the changes in the size or composition of a population with respect to time E-value   describes how likely it is that two sequences match by coincidence without taking sequence length into account (bioinformatics); similar to p-value Error catastrophe in virology it refers to the rapid decline in RNA genome infectivity at levels of mutagenesis only slightly higher than normal xxiii  Environmental Placement Algorithm   algorithm that places shorter sequences on a robust reference phylogeny Evenness in ecology it refers to how close in numbers each species in a particular environment is or how equal a community is in numerical terms  Family in virology it is a group of genera that share common characteristics. Family names are single words ending with the suffix -viridae GC content   percentage ratio of guanine and cytosine nucleotides in relation to adenine and thymine Genus in virology it is a group of related species that share some significant properties and often only differ in host range and virulence. Genus names are single words ending in the suffix -virus Heterotroph an organism that relies on the consumption of organic compounds as food for growth Homology (homologues) in biology it refers to the existence of a shared ancestry between genes or structures associated with different taxa Identity when comparing multiple amino acid sequences, it refers to the amino acids that are the same, have the same molecular structure; in a BLAST search it is expressed as a percentage and refers to the extent to which two amino acid or nucleotide sequences have the same residues at the same positions in an alignment Illumina company that produces high-throughput sequencers including the Hiseq, Miseq and NextSeq Inlet partially enclosed body of water that maintains a direct connection to the ocean Kruskal-Wallis test a non-parametric method for testing whether samples originate from the same distribution Kusrnavirus a proposed new genus of ssRNA viruses in the family Marnaviridae. It is derived from the Afrikaans word for coastal xxiv  Labyrnavirus viral genus of ssRNA viruses infecting diatoms. Formerly unclassified within the order Picornavirales, proposed to be reclassified to the family Marnaviridae. Acronym of the host group and RNA virus Latent period in virology, the time between when a cell is first infected by a virus to the point where the cell is lysed and the produced virus particles are released. It is during this period that the virus replicates Locarnavirus a proposed new genus of ssRNA viruses in the family Marnaviridae. Acronym of Locarno beach from which the first marine RNA virus metagenomes were assembled Marnaviridae viral family of ssRNA viruses in the order Picornavirales. Acronym of marine and RNA virus Marnavirus virus genus of ssRNA viruses infecting a Raphidophyte in the family Marnaviridae Microdiversity the differences in organisms that have near identical physical and genetic features which can have important ecological and evolutionary consequences Maximum-likelihood a statistical method to find the most likely compromise; often used to build phylogenies based on evolutionary substitution models Mixed layer in oceanography it refers to the water layer that is well mixed, the boundary of which is created by a rapid change in density. It is influenced by temperature and salinity of often driven by surface wind Monocistronic in virology it refers to a genome organization which consists of one distinct open reading frames, which codes for a polyprotein Monopartite in virology it refers to viral genomes that consist of a single molecule of nucleic acids Monophyletic  in cladistics, it describes a group of species, that form a clade. They are all descendants of a common ancestor and frequently share derived characteristic distinguishing one clade from another xxv  Neighbour-joining a clustering method based on distance matrices, used for the creation of phylogenetic trees Niche specific requirements or role of a particular species/population within a larger community Niche partitioning in ecology, it refers to the process by which competing species use the environment differently so that they can coexist NMDS an ordination which makes few assumptions about the nature of the data. It is used to visualize the variation (distance measure) from multiple variables in two or three dimensions ORF/Open Reading Frame a region of genetic information between a start and stop codon, a putative functional gene Order in viral classification it refers to families of viruses that are considered to have likely evolved from a common ancestor. Order names are written in italics and end with the suffix -virales p-value in statistics it refers to the probability of obtaining a result equal to or greater than what was actually observed, when the null hypothesis is true Poly-adenylated a chain of adenine nucleotides at the 3’ end of many RNA virus genomes Population a group of individuals of the same species occupying a geographic area with relation to a specific period of time Principal component analysis (PCA) exploratory multivariate statistic for linearly related data Principal coordinates analysis (PCO) an ordination technique similar to PCA but is more versatile in that any ecological distance can be investigated Prokaryote a single-celled organism with no membrane-bound nucleus or other membrane-bound organelles xxvi  Protist a eukaryotic organism, frequently single-unicellular, that is not an animal, plant or fungus. Protists are not a monophyletic group or clade but are frequently grouped together for convenience Purifying selection synonymous with natural, stabilizing or negative selection. It is responsible for the preservation of characteristics of species under constant environmental conditions. It relies on the removal of individuals or genes that deviate from the established norm resulting in a particular characteristic remaining unchanged despite continuous mutagenesis Quasispecies in virology it refers to a group of viruses related by similar mutations. They are continuously competing within a highly mutagenic environment R2 in linear regression it describes the variation in the dependent variable that can be explained by the independent variable. An adjusted R2 considers the number of independent variables in the model Rarefy/rarefication of reads  the number of reads from different libraries are made similar through sub-sampling. The number of reads ultimately equates to the sequencing library with the lowest number of reads RdRp an enzyme that catalyzes the replication of RNA from an RNA template. This enzyme is conserved in viruses but also found in eukaryotes as part of the RNA interference pathways Reassortment in virology it refers to the swapping of distinct RNA molecules between multipartite virus particles Recombination in virology it refers to the process of substituting or combining portions of nucleic acids from a ‘donor’ sequence to an ‘acceptor’ genome Relative abundance a quantitative pattern of commonness and rarity among species in a sample or assemblage Richness the number of distinct organisms/taxa in a sample without any regard for their abundance xxvii  Salisharnavirus a proposed new genus of ssRNA viruses in the family Marnaviridae. The name pays homage to the Salish sea, the water masses form which the first marine RNA virus amplicons were amplified from. Sequence diversity in molecular genetics it is a measure of genetic variation (either nucleotide or amino acid) within a population Similarity when comparing multiple amino acid sequences, it refers to amino acids that are biochemically similar in function but not identical in molecular structure Single Nucleotide Variants (SNVs) the variation of a single nucleotide at a specific position in a genome, where the variation is generally greater than 1% Sogarnavirus a proposed new genus of ssRNA viruses in the family Marnaviridae. The acronym refers to the Strait of Georgia, a major water body of the Salish Sea where the majority of marine RNA virus research has been conducted Subfamily in virology it is a group of genera that share a common characteristic. It is only used to clarify/solve complex hierarchical problems. Subfamily names are single words ending in the suffix -virinae Stratified in oceanography it is a water column that has a well-defined density gradient (pycnocline) with a defined layer of low density water on top. Influences by temperature and fresh water input Species (viral species) as defined by the ICTV (2013), it is a monophyletic group of viruses whose properties can be distinguished from those of other species by multiple criteria    xxviii  ACKNOWLEDGEMENTS  This has been a long haul and there have been many individuals who have contributed to the inevitable completion of this degree. Here I would like to express my sincere gratitude to a few. I’d like to thank my supervisor, Curtis Suttle, for providing me with this opportunity, his support through the years and for helping me define the questions that resulted in this thesis. To my committee members Hélène Sanfaçon and Patrick Keeling, thanks for keeping me on track. All three of you have been so supportive through the highs and lows and for that I am eternally grateful. A research laboratory is like a community and this project would not have been possible without the help and support of lab members, past and present. Thanks for being there with sound advice when the lab work wasn’t going as planned, providing inspiration when mine was low, celebrating the small victories with me and just generally being there. I would like to thank (in alphabetical order) Renat Adelshin, Anwar Al-Qattan, Jessica Caleta, Christina Charlesworth, Amy Chan, Caroline Chenard, Anna Cho, Cheryl Chow, Jessie Clasen, Christoph Deeg, Jan Finke, Matthias Fisher, Julia Gustavsen, Ezra Kitson, Jessica Labonte, Gideon Mordecai, Jerome Payet, Jeff Strohm, Emma Shelford, Alvin Tian, Rick White, Danielle Winget, Jennifer Wirth, Kevin Zhong and Matthias Zimmer. A special thank you to Andrew Lang who kindled the enthusiasm when mine was low and for his insistence that the end was in sight. So many wonderful people have supported and kept me sane throughout this experience, providing distractions when I didn’t realize the necessity for a break, being understanding when I was irritable and for the words of encouragement when I needed them. I’d like to thank Adrian Jones for being supportive through this endeavour, being patient through my frustrations and celebrating with me through the successes. The statistical help was also greatly appreciated. I am eternally grateful to my furry companions who were a large part of this journey. I suspect they enjoyed the thesis writing experience more than me. xxix  I am very grateful to UBC, BRITE (Biodiversity Research Integrative Training & Education) and the Department of Botany for funding over the many years of this degree. Last but not least, I’d like to thank my parents, Lezette and Gerrit Vlok, who have always supported me in whatever venture I chose to partake in. I couldn’t have asked for more supportive parents.    xxx  DEDICATION             To the munchkins the providers of perspective and the heart of my home    1  Chapter 2  1  INTRODUCTION TO RNA VIRUSES ASSOCIATED WITH SINGLE-CELLED EUKARYOTES  The term ‘virus’ was first used in 1885 when Louis Pasteur unknowingly but correctly described the ‘poison’ in the spinal fluid of a person infected with the rabies virus. He had no idea as to the nature of the causative agent. Virology has come a long way since 1885. It is now widely recognized that viruses are the most abundant living entities on the planet (Desnues et al., 2008), they have a significant effect on health and agriculture and are major drivers of biogeochemical cycles (Brussaard et al., 2008).  Viruses are ‘nucleoprotein-based, supramolecular ensembles that evolved into biological nanomachines capable of self-replication within cells and propagation between cells and organisms’ (Mateu, 2011). They are the largest reservoirs of genetic potential on the planet (Suttle, 2007), and many can be manipulated to such an extent that genetically modified, reverse engineered viruses have been created. As a result of this largely untapped genetic reservoir, viruses have been teased apart and used for biomedical and biotechnological processes (Wigdorovitz et al., 1999; Dujardin et al., 2003; Lee et al., 2009). To date, the best studied viruses are those that directly affect humanity, with the classical example being poliovirus (PV) (Bienz et al., 1994; Andino et al., 1999; Duintjer Tebbens et al., 2013; Koopman et al., 2017).  With the public focussed on the negative effect of viruses on health and society, their important contribution in shaping environmental processes is easily overlooked. An excellent case study of this is the role of marine viruses in driving biogeochemical cycles in the ocean and shaping host communities.  1.1 The significance of marine viruses In the late 1980s, early 1990s, it was established that marine viruses are both vastly abundant as well as directly pathogenic towards various dominant organisms in the ocean  (Bergh et al., 1 2  1989; Proctor and Fuhrman, 1990; Suttle et al., 1990). It is now widely acknowledged that viruses are an essential part of the marine ecosystem, where they are ubiquitous and abundant, typically present at concentrations of ~107 particles.mL-1  and outnumber other forms of marine life by at least an order of magnitude (Maranger and Bird, 1995; Wommack and Colwell, 2000; Fuhrman, 1999; Suttle, 2005). These viral assemblages are both morphologically (Frank and Moebus, 1987; Weinbauer, 2004) and genetically diverse (Edwards and Rohwer, 2005; Brum and Sullivan, 2015). Studies suggest that there may be more than 10,000 viral genotypes in water samples from different marine habitats (Bench et al., 2007), while analysis of four distinct oceanic regions produced more than 50,000 genotypes (Angly et al., 2006). These estimates have a high level of uncertainty but hint at a significant degree of viral richness. It is this abundance of oceanic viruses that results in approximately 1029 viral infections per day, which in turn results in the release of between 108 to 109 tonnes of carbon per day from the biological pool (Suttle, 2007). Whilst these numbers are informative, they do not encompass all the viral subtypes. Most of the data mentioned above was collected using DNA amplification, metagenomes and flow cytometry. Although these are good methods for analysing DNA viruses, they are not designed or optimised to analyse particles with a RNA genome and are therefore likely to underestimate the importance of RNA viruses.   1.2 RNA viruses With the exception of smallpox, the first known and studied viruses were RNA viruses (Pasteur virus/Rabies, Tobacco Mosaic Virus and PV). RNA viruses are causative agents of some of the most debilitating emerging infectious diseases (Human immunodeficiency virus (HIV), Hepatitis C virus, Influenza, Severe acute respiratory virus (SARS) and Zika virus) (Holmes, 2009; Lauring and Andino, 2010; Jones et al., 2008; Musso et al., 2014). This prominence is related to their great genetic diversity (Figure 1.1) and their ability to rapidly evolve, which results in them being more efficient at evading or overcoming host defences than their double stranded (ds)DNA counterparts (Jones et al., 2008).  3   Figure 1.1: Virus taxonomy of ICTV recognized viral species. The wedges represent the number of virus taxa classified within each family with the colours representing the genome type. Virus particles are adapted from those depicted on the ViralZone website with the hosts of each family represented by a silhouette of a red human (vertebrates), yellow insect (invertebrates), green plant (plants), pink mushroom (fungi), purple bacteria (prokaryotes) and blue flagellate (protists).  4  RNA viruses have a small capsid and genome size (2.3 kb to 32 kb (Holmes and Grenfell, 2009)) with a large burst size. According to the revised Baltimore system, RNA viruses can be classified into five groups according to the orientation of their genome and replication system: positive sense single stranded RNA (+ssRNA), negative sense ssRNA (-ssRNA), double stranded RNA (dsRNA), retroviruses (+ssRNA genome with dsDNA intermediate) and circular single stranded RNA viruses (cssRNA) (Hulo et al., 2011). The type of replication that these groups of RNA viruses use is dependent on the genome type, which determines the highly specific nature of the replication complex. This large and complicated macromolecular complex is composed of a highly sensitive ratio of both viral and host macromolecules and is subject to change depending on the host environment (Nagy and Pogany, 2006). One of the most important aspects of the replication complex is the RNA-dependent RNA polymerase (RdRp). This enzyme is not limited to its role in the replication of the virus but also as the driving force of genome variability and evolution (Ahlquist, 2002). RNA virus evolution is related to as the quasispecies concept (Figure 1.2), which is a mutation-selection balance where a genetically diverse population of variant genomes are ordered around the fittest genome all originating from a single progenitor (Holmes, 2009; Ojosnegros et al., 2010). This highly successful population strategy is as a result of the high error rate of the RdRp which lacks the proof-reading apparatus of a DNA polymerase (Chao et al., 2002; Drake, 1993; Raney et al., 2004; Drake and Hwang, 2005), resulting in about 10-4 changes per nucleotide per replication cycle (Figure 1.3) (Tang et al., 2002; Duffy et al., 2008). This high error rate strategy has led to the most diverse (Koonin et al., 2015; King et al., 2011) and most difficult group of eukaryotic viruses to control. This is particularly evident in human disease where viruses like PV, measles and HIV have burdened people for decades (Duintjer Tebbens et al., 2013; Bester, 2016; Butler et al., 2016). In contrast to the step-wise evolution due to the accumulation of point mutations, large-scale evolutionary events occur through mechanisms of genetic exchange (Steinhauer and Holland, 1987; Worobey and Holmes, 1999). The first process, reassortment, only occurs in multipartite viruses where discrete RNA molecules are swapped between particles and is a faster means to overcome adaptive host barriers than the accumulation of point mutations (Worobey and Holmes, 1999; McDonald et al., 2016; Vijaykrishna et al., 2015). A good example of reassortment is antigenic shift in influenza viruses, which results in the continuous creation of new virus subtypes often associated with new hosts (Webster and Govorkova, 2014). The second, recombination, is a 5  process of combining or substituting portions of nucleic acids from one sequence to another RNA molecule or genome and relies on a single cell being co-infected by at least two viruses (Worobey and Holmes, 1999). Recombination occurs when the replicating RdRp disassociates from the genome and continues replication upon binding to a second genome resulting in a chimeric genome. This process is most frequently observed in monopartite viruses where it drives evolution by accelerating adaptation, enriching beneficial mutations and purging deleterious mutations often accumulated as a result of point mutations (Krupovic et al., 2015; Xiao et al., 2016; Galli and Bukh, 2014; Tromas et al., 2014).  0204060801001200 2 4 6 8 10 12 14 16Mutations/RNA genomeSpecific infectivityLI50A)B) Figure 1.2: RNA quasispecies and the relationship of mutation frequency to virus infectivity. A) A virus that has a high mutation rate will generate a diverse progeny over a short time. The more mutations that incur (arrow length) the less likely the particle is of infecting the original host (likelihood represented by circles with white being the most likely and black the least). As the host develops resistance over time the fitness of the different progeny will change resulting in a new variant that is most fit. B) The 50% loss of specific infectivity (LI50) is where the mutation frequency at which 50% of the viral genomes are lethally mutated. This is rapidly achieved in poliovirus when the levels of mutagenesis are slightly higher than normal.   The consequence of such a rapidly evolving life strategy is that the history and evolution of natural RNA virus assemblages become more obscure with every replication cycle up to the point 6  of eradication when the number of mutations pushed the virus over the edge of the error catastrophe (Crotty et al., 2001) (Figure 1.2). The elucidation of an evolutionary history is further confounded by a lack of a fossil record. The oldest indisputable sequence of an RNA virus is for influenza A, preserved from the 1918 pandemic (Taubenberger et al., 1997), which is too recent to infer long-term changes in viral evolution. An alternative strategy would be to analyse protein structures and domains as these tend to be more conserved than gene sequences (Koonin et al., 2006, 2015) and would likely result in more robust evolutionary theories.  Mutation rate (log(mutation per site per genome replication))0                                 -2                                 -4                                 -6                      -8                                -10Genome size (log(kb))+ssRNA-ssRNARetrodsRNAdsDNAssDNA4                      5         High Mutation Rate Low Mutation RateSingle strandedGenomic architectureDouble strandedRNA DNASmaller LargerFaster Replication speed Slower- Viral enzymes DNA repairDeamination, oxidation Host enzymes -Deamination, oxidation, UV Environmental effects -A)B) Figure 1.3: Average mutation rates observed in viruses and the factors that influence it. A) Average rates of mutation in viruses with different genome architectures, adjusted to the rate per genome replication. B) Factors in red increase the mutation rate above the neutral rate, while blue factors result in a decrease in mutation rate below the neutral rate. Figure adapted from Duffy et al. (2008).  1.3 Origin and evolution of RNA viruses The origin and evolution of RNA viruses is a complex topic that consists of interesting theories and hypotheses which, because of a variety of challenges, have yet to be tested. The first distinction to note is the difference between prokaryotic and eukaryotic viruses. It is unusual to find either ss- or dsRNA viruses that infect prokaryotes (the majority of bacterial and archaeal viruses have large dsDNA and small ssDNA genomes), whereas eukaryotic organisms harbour a vast diversity of RNA viruses, including those with retro-transcribing genomes (Koonin et al., 2015). Although the true cause for this differentiation has yet to be resolved, the current theory implies that compartmentalization and the development of a eukaryotic nucleus resulted in a 7  barrier that most dsDNA virus reproduction could not overcome while complex cytoplasmic membrane systems provided convenient niches for the reproduction of RNA viruses (Koonin et al., 2015; den Boon and Ahlquist, 2010; Greninger, 2015). This fits with the idea that RNA viruses are frequently more reliant on their hosts, and need a diverse reservoir of material in order to replicate (Nagy and Pogany, 2006; Price et al., 2005). Evidence would suggest that the evolution of RNA viruses is closely tied with the evolution of eukaryotic organisms (eukaryogenesis) (Koonin et al., 2008) and that majority of these viruses evolved from a common picorna-like virus ancestor with its roots tracing back to prokaryotic protein domains. Of the three RNA virus superfamilies (Picornavirus-, Alphavirus- and Flavivirus-like) that encompass the majority of +ssRNA viruses infecting eukaryotes, the picornavirus-like superfamily spans the largest diversity of both multi- and unicellular hosts (Le Gall et al., 2008; Koonin et al., 2015). The order Picornavirales is representative of the superfamily and encompasses five families, with unassigned genera and unclassified viruses. These viruses have distinctive pseudo-T = 3 icosahedral capsids and express their genomes by polyprotein processing (Le Gall et al., 2008). Their conserved domains include a superfamily 3 helicase (S3H), a chymotrypsin-like protease (3CPro), a RdRP and a beta-barrel jelly-roll capsid (JRC) protein. These domains are not restricted to members of the Picornavirales and can be found in other +ssRNA viruses. Koonin et al. (2015) summarized the evolutionary history of these protein domains and suggested that the RdRps evolved from the reverse transcriptases (RTs) of the bacterial group II self-splicing introns while the closest homolog of the 3CPro was the HtrA family proteases associated with bacteria and mitochondria (Koonin et al., 2008; Gorbalenya et al., 1989). The beta-barrel reminiscent of the JRC fold is found in both bacterial carbohydrate-binding proteins (Wong et al., 2000) and non-viral tumor necrosis factors that are known to form virus-like particles despite not being viruses (Liu et al., 2002).  A select few groups of eukaryotic RNA viruses appear to have evolved from RNA phages (Koonin et al., 2015), a poorly represented group of RNA viruses with only two recognised families, the Leviviridae (Bollback and Huelsenbeck, 2001) and the Cystoviridae (Mindich, 2004). Weak protein similarities exist between the RdRp domains of the Picornavirales and RNA phages, but it is unlikely that these viruses are related. Given the rarity of RNA phages, they are most likely not representative of a genetic ancestor to the majority of eukaryotic viruses (Koonin et al., 2015, 2008). 8  EukaryogenesisAncestral picorna-like virusFlavivirus-like superfamilyFlaviviridaeTombusviridaePicornavirus-like superfamilyBarnaviridaeLuteoviridaeAstroviridaeHypoviridaePotyviridaeTotiviridaeQuadriviridaeChrysoviridaeMegabirnaviridaePartitiviridaeAmalgaviridaePicornaviralesCaliciviridaePicobirnaviridaePicornaviridaeDicistroviridaeIflaviridaeSecoviridaeMarnaviridaeAlphavirus-like superfamilyNodaviridaeTetraviridaeBirnaviridaeTogaviridaeHepeviridaeEndornaviridaeBirnaviridaeTogaviridaeVirgaviridaeClosteroviridaeBromoviridaeTymoviralesAlphaflexiviridaeBetaflexiviridaeGammaflexiviridaeTymoviridaeNegative-strand RNA virusesOrthomyxoviridaeArenaviridaeOphioviridaeBunyaviralesFeraviridaeFimoviridaeHantaviridaeJonviridaeNairoviridaeMononega-viralesBornaviridaeRhabdoviridaeFiloviridaeNyamiviridaeParamyxoviridaePhasmaviridaePhenuiviridaePeribunyaviridaeTospoviridaeNidoviralesArteriviridaeCoronaviridaeMesoniviridaeRoniviridaeLeviviridaeNarnaviridaeOurmiaviridaeCystoviridaeReoviridaeProkaryotic cell* * Figure 1.4: The origin of RNA viruses infecting eukaryotes. This evolutionary schematic is adapted and simplified from Koonin et al. (2015) and is based on the symbiotic events leading to eukaryogenesis. Yellow boxes indicate virus families that have dsRNA genomes. *Indicates proposed precursor prokaryotic virus families to two distinct lineages.  Despite the simple depiction of viral evolution in Figure 1.4, the genetic repertoire of some viral families consists of many events of protein swapping, loss and acquisition, through 9  recombination and reassortment, that has theoretically occurred between viral families as well as from cellular sources. For example, the dsRNA viruses have polyphyletic origins distinct from +ssRNA virus groups and/or dsRNA bacteriophages (Koonin et al., 2015).  1.4 Marine RNA viruses RNA viruses in marine environments are both genetically diverse and are associated with phylogenetically diverse hosts (Lang et al., 2009). They have been associated with disease and mortality of marine mammals (Geraci et al., 1982; Jensen et al., 2002; Barrett et al., 1993), fish (Qian et al., 2003; Saksida, 2006; Skall et al., 2005), crustaceans (Prasad et al., 2016; Bonami et al., 1997; Bonami and Zhang, 2011), have been implicated in both coral bleaching (Wilson et al., 2001; Levin et al., 2016; Davy et al., 2006; van Oppen et al., 2009) and the die-off of economically viable bivalves (Rasmussen, 1986; Renault and Novoa, 2004). Viruses have also been associated with bloom-forming plankton such as the Raphidophyte Heterosigma akashiwo and the Dinoflagellate Heterocapsa circularisquama. Based on their rapid replication cycles, virus particles are visible 48 hours post infection and  have large burst sizes of 460 – 520 and 3 400 – 21 000 particles per cell respectively (Tai et al., 2003; Tomaru et al., 2004), these viruses are thought to have a significant effect on both the frequency and duration of plankton blooms.  1.4.1 Enumerating RNA viruses in the ocean Of the estimated 107 virus particles.mL-1 of seawater (Suttle, 2005), it has always been argued that most of the marine viruses had DNA genomes and infected bacteria (Weinbauer, 2004; Steward et al., 1992; Comeau et al., 2010). In general, viral abundance is determined using flow cytometry and epifluorescence or electron microscopy. As stated these methods are good for detecting larger DNA viruses, but lack the sensitivity and specificity to detect small RNA genomes or in the case of electron microscopy, differentiate RNA viruses from small DNA viruses (Tomaru and Nagasaki, 2007). To circumvent the standard methodological issues, Steward et al. (2013) employed fluorometry to determine the abundance of RNA and DNA containing virus particles based on the elucidation of the mass of nucleic acid per RNA- and DNA-containing particles. Their study suggested that at least 50% of the virioplankton in the coastal waters of Hawaii have RNA 10  genomes. If this holds true for other marine environments, it could have huge implications for our understanding of viral production and their contribution to biogeochemical processes, as all numbers would have been underestimated. Considering that the majority of viruses identified in this study were picorna-like viruses, known to infect eukaryotes, we can assume that protists actually contribute more to marine viral dynamics than previously thought (Steward et al., 2013).  1.4.2 Diversity of marine RNA viruses Most studies on marine RNA viruses have been on those infecting either charismatic megafauna or animals of economic importance (Barrett et al., 1993; Løvoll et al., 2010). Over the last three decades this research has extended to include RNA virus isolates infecting protists as well as RNA metagenomic data and now includes viruses from four of the five RNA groups with hosts spread across the eukaryotic tree of life (Lang et al., 2009). This includes -ssRNA virus members from the Orthomyxo-, Paramyxo- and Bunyaviridae (infecting various species of fish, marine mammals, crustaceans, and seabirds), dsRNA virus members of the Reo-, Birna- and Retroviridae (infecting bivalves, crustaceans, fish, protists, and seabirds) and +ssRNA viruses belonging to the families Marna-, Dicistro-, Picorna- Calici-, Noda-, Toga, Roni- and Flaviviridae (infecting marine mammal and invertebrates, fish crustaceans, seabirds and protists) (Lang et al., 2009). Of these viruses, those infecting single-celled eukaryotes are most frequently reported as being the representative portion of the free-floating RNA virus assemblage and have been studied using three distinct methods: culture-based, amplicon-based and metagenomic studies.  1.4.2.1 Culture-based studies Established virus-host systems have the potential to provide highly specific information regarding the virus structure, lifecycle and host interactions (Figure 1.5). While this approach is not feasible for diversity estimates, the in-depth data is invaluable, especially when so little is known about the group of viruses in question. Heterosigma akashiwo RNA virus (HaRNAV) was the first ssRNA virus reported to infect and lyse phytoplankton, the toxic bloom-forming alga Heterosigma akashiwo (Tai et al., 2003). HaRNAV particles are non-enveloped icosahedrons, with an average diameter of 25 nm (Table 1.1). The positive-sense genome is 8.6 kb in size and has distinct domain characteristics that 11  resulted in its classification as the only member of the new family Marnaviridae within the order Picornavirales (Lang et al., 2004; Tai et al., 2003). HaRNAV is strain specific, and particles are visible 48 hours post infection. Like other RNA viruses it has a relatively large burst size of 460 – 520 particles per cell (Tai et al., 2003). Many more icosahedral ssRNA viruses (25 – 32 nm in diameter with genomes ranging from 4.4 to 13.2 kb) have since been isolated and characterized. These include, the dinoflagellate-infecting Heterocapsa circularisquama RNA virus (HcRNAV) (Tomaru et al., 2004), Chaetoceros tenuissimus RNA virus type I and II (CtenRNAV01 and CtenRNAVII) (Shirai et al., 2008; Kimura and Tomaru, 2015), Chaetoceros socialis f. radians RNA virus (CsfrRNAV) (Tomaru, Takao, et al., 2009), a Chaetoceros sp.(CspRNAV) (Tomaru et al., 2013) and Rhizosolenia setigera RNA virus (RsRNAV) (Nagasaki et al., 2004) that infects centric diatoms, the pennate diatom-infecting Asterionellopsis glacialis RNA virus (AglaRNAV) (Tomaru et al., 2012) and Aurantiochytrium RNA virus (AuRNAV), which infects a heterotrophic marine flagellate (Takao et al., 2006). The only known dsRNA virus (MpRNAV-01B) infecting a marine protist, infects the photosynthetic Micromonas pusilla (Brussaard et al., 2004). The 11 segments of the 25.5 kb genome are encapsidated by a 65 – 80 nm diameter, non-enveloped capsid. This strain specific virus has a latent period of 36 hours, a lytic cycle of 80 hours and a burst size similar to HaRNAV (Brussaard et al., 2004). Although some studies have suggested the existence of other RNA virus cultured representatives, two notable examples being RNA virus-like particles in Ostreococcus sp. (Munn, 2006) and RdRP sequences of two viruses that infect the pennate diatom Cylindrotheca closterium (Culley et al., 2014), these remain the only representatives in culture that have been characterised through genome sequencing. Recently, the potential host range of these protist infecting RNA viruses has been extended to the holobiont of the multicellular protist Delisea pulchra (Lachnit et al., 2016). These viruses included +ssRNA viruses affiliated to the Picornavirales as well as some dsRNA viruses similar to members of the Totiviridae. Double stranded RNA viruses have been shown to infect multiple parasitic protists, like Giardia, Leishmania, Eimeria, Babesia, Cryptosporidium, Trichomonas, Leptomonas, Entamoeba and Naegleria (Wang and Wang, 1986a; Tarr et al., 1988; Scheffter et al., 1995; Wu et al., 2016; Khramtsov and Upton, 2000; Kraeva et al., 2015; Wang and Wang, 1986b) but this is the first example of such an interaction in a marine host that is not considered a pathogen.  12  Red algaeGreen plantsAlveolatesStramenopilesRhizariaFungiAnimalsDinoflagellatesApicomplexaDiatomsRaphidophytesThraustochytridsFlorideophytesArchamoebaeKinetoplastidsPrasinophytesOomycetesCiliatesCharophytes Figure 1.5: Groups of protists that are associated with RNA viruses. The global tree of eukaryotes was adapted from Burki (2014) to illustrate the protist groups from which RNA viruses have either been isolated from or have been identified by transcriptomics. No known viral isolates exist that infect Ciliates, yet metagenomic analysis would suggest that this group has associated RNA viruses (Greninger and DeRisi, 2015a). Viruses infecting fungi, animals and higher plants were excluded.(Original figure was inspired by a template provided by Y. Eglit.)       Table 1.1: Marine RNA virus isolates infecting marine protists. RNA virus isolate Host Virus Taxonomy Particle diameter (nm) Genome Latent period (hours) Burst size (particles per cell) Reference Family Genus Number of segments Size (kb) HaRNAV Heterosigma akashiwo RNA virus Heterosigma akashiwo Raphidophyte Marnaviridae Marnavirus 25 1 8.6 48 460 – 520 Lang et al., 2004; Tai et al., 2003 HcRNAV Heterocapsa circularisquama RNA virus Heterocapsa circularisquama Dinoflagellate Alvernaviridae Dinornavirus 30 1 4.4 24 – 48 3.4x103 – 2.1x104 Tomaru et al., 2004 RsRNAV Rhizosolenia setigera RNA virus Rhizosolenia setigera Diatom  Bacillarnavirus 32 1 11.2 48 1010 – 3100 Nagasaki et al., 2004 CtenRNAV01 Chaetoceros tenuissimus RNA virus 01 Chaetoceros tenuissimus Diatom  Bacillarnavirus 31 1 9.4 <24 1.0x105 Shirai et al., 2008 CtenRNAVII Chaetoceros tenuissimus RNA virus type II Chaetoceros tenuissimus Diatom   35 1 9.5 24 – 48 287 Kimura and Tomaru, 2015 CspRNAV Chaetoceros species RNA virus Chaetoceros species  Diatom   32 1 9.0 – 10.0 <48 - Tomaru et al., 2013 CsfrRNAV Chaetoceros socialis f. radians RNA virus 01 Chaetoceros socialis f. radians  Diatom  Bacillarnavirus 22 1 9.5 <48 66 Tomaru et al., 2009 AuRNAV Aurantiochytrium single stranded RNA virus Aurantiochytrium Thraustochytrid  Labyrnavirus 25 1 9.0 8 5.8x103 – 6.4x104 Takao et al., 2006 AglaRNAV Asterionellopsis glacialis RNA virus Asterionellopsis glacialis Diatom   31 1 9.5 48 - Tomaru et al., 2012 MpRNAV Micromonas pusilla RNA virus Micromonas pusilla Prasinophyte Reoviridae Mimoreovirus 65 - 80 11 25.5 36 Similar to HaRNAV Brussaard et al., 2004  13  14  1.4.2.2 Amplicon-based studies The common approach of detecting similar viruses is to use the polymerase chain reaction (PCR) along with viral specific primers. This procedure was used to detect Birna virus (dsRNA) in Yellowtail, Sea bass, Amberjack as well as Pearl Oysters (Suzuki et al., 1997; Kitamura et al., 2002, 2003). By using degenerate primers designed to amplify nucleotide sequences coding for proteins with conserved domains, such as the RdRp, whole virus families can be targeted and their relative evolutionary history traced. For example, Culley et al. (2003) used RT-PCR to amplify picorna-like RdRp sequences from 13 of 21 sites in coastal British Columbia. These sequences were phylogenetically distinct from all known isolates apart from those clustering with HaRNAV. With the optimization of the degenerate picorna-like primers and a different filtration method,  Culley and Steward (2007) found 5 new putative genera of picorna-like viruses of the coast of Hawaii and California, while a study of five coastal samples revealed 145 operational taxonomic units (OTUs) that differed by at least 5% between amino acid sequences, many of which formed yet more new clades (Gustavsen et al., 2014). This addition of new clades indicates that there is a vast diversity of RNA viruses yet to be discovered. While amplicon-based studies are required for in-depth analysis of a specific group of viruses, it is challenging to use this approach to study the entire viral assemblage. This is due to the challenges associated with designing primers that can detect multiple virus families. Primer design also relies on prior knowledge of the genetic sequences of the target viruses; a luxury which is rarely afforded in the study of unknown viruses.   1.4.2.3 Metagenomic studies Genomics is largely concerned with the sequencing and analysis of whole genomes; whereas, metagenomics applies similar methods to microbial assemblages (Steward and Rappe, 2007). A metagenomic approach allows a more unbiased approach to analyze the entire assemblage as well as discover novel viral genomes because no previous knowledge of the assemblage is required. Because this approach can result in complex datasets that are hard to disentangle and assemble bioinformatically, it is advised to fractionate samples to isolate the populations of interest and thereby reduce complexity (Steward and Rappe, 2007).   15  Using randomly amplified reverse transcribed whole-genome shotgun sequencing, Culley et al. (2006) examined the first RNA virus assemblages in two coastal BC sites. Between 63% to 81% of the sequences retrieved had no similarity to viruses in the Genbank database. As more metagenomic datasets are published, it is apparent that there is a vast diversity of marine RNA viruses yet to be discovered. Coastal studies from both Hawaii and Antarctica suggest that the majority of free floating marine RNA viruses belong to the order Picornavirales and that many of these viruses are likely associated with diatoms (Culley et al., 2014; Miranda et al., 2016). The taxonomic diversity of these assemblages extend beyond members of the Picornavirales with coastal assemblages including members similar to the Tombusviridae (Culley et al., 2006), Nodaviridae, Retroviridae and viruses from the -ssRNA order Mononegavirales (Zeigler Allen et al., 2017). All these studies resulted in the assembly of novel RNA genomes. The first two that were assembled in this fashion are known as the marine RNA viruses JP-A and -B (Culley et al., 2007). Unlike HaRNAV, which encodes a single polyprotein, the genome architecture of JP-A and -B appears to be dicistronic, much like members of the Dicistroviridae. The conserved S3H, 3CPro and RdRp domains are present on the 5’ open reading frame (ORF), while the VP2, VP3, VP4 and VP1 capsid domains are situated at the 3’ end. The expression of the second ORF is thought to be under translational control of an intergenic region internal ribosome entry site (IGR IRES). This dicistronic genome architecture was echoed in the genomes assembled from Palmer Station, Antarctica (Miranda et al., 2016) while assembled genomes from Kaneohe Bay, Hawaii (Culley et al., 2014) exhibited both the mono- and dicistronic architectures. Many of these genomes were highly divergent and these data suggest that there is a large pool of marine viruses yet to be discovered. One of the biggest challenges faced with metagenomic data is the lack of comprehensive genomic and transcriptomic databases for many host organisms. Analyzing the virus metagenome within an organism can partially circumvent this and can significantly increase the known diversity of RNA viruses in relation to a putative host. Examples include the detection of viral genomes in the RNA metagenomes of marine invertebrates (Shi et al., 2016) and the coral symbiont Symbiodinium (Correa et al., 2012). Due to the replication strategy of RNA viruses, transcriptome analysis can also be used to identify these +ss viruses, even if the viruses are not actively replicating (Levin et al., 2016; Moniruzzaman et al., 2017). While single-cell transcriptomes are  16  optimal for determining specific virus-host systems, network analysis can be used on metatranscriptome data to elucidate these systems and interactions (Moniruzzaman et al., 2017). These studies show that the metagenomic approach is versatile and can be modified to suit the study. If host information is not essential, metagenomics is still the most efficient tool to study the diversity and relative abundance of free viruses in the ocean.   1.5 Freshwater RNA viruses Freshwater environments are intricate ecosystems; they represent an ecological interface between host organisms and humans and are an essential resource to life on earth (Djikeng et al., 2009). This association with the terrestrial environment results in freshwater ecosystems being frequently plagued by terrestrial contamination; both natural leaching from soil or run-off pollution from human activity (Tang et al., 2005). Whether this terrestrial contamination negatively affects the freshwater environment or not, it adds a layer of complexity to the analysis of a virtually unknown assemblage.  To date, we have very little understanding with respect to RNA viruses in freshwater environments. As in marine environments, most of the research focusses on species associated with disease in economically important organisms. This includes viruses infecting the freshwater prawn Macrobrachium rosenbergii (Arcier et al., 1999), farmed Tilapia (Eyngor et al., 2014; Bigarré et al., 2009), crayfish (Edgerton et al., 1994) and snails (Galinier et al., 2017). Unlike marine RNA viruses, there are currently no isolates of RNA viruses known to infect single-celled eukaryotes native to freshwater environments. This probably reflects the sparsity of research into these systems and should not be considered evidence that no such viruses exist.  1.5.1 Diversity of freshwater RNA viruses 1.5.1.1 Culture-based studies Protist-associated freshwater RNA virus isolates have not been characterized. Electron micrographs of the red alga Sirodotia tenuissima (Holden) revealed the presence of polygonal viral particles in cytoplasmic spherical vacuole-like structures (Lee, 1971). While nucleic acids were never extracted from these specimens, the viral replication centers and crystalline arrays were reminiscent of those found in RNA viruses of higher plants. The second isolate, Chara australis  17  RNA virus (CAV), is associated with the charyophyte Chara australis and has genetic resemblances to beny- and tobamoviruses (Gibbs et al., 2011). To date, it is the only representative of its kind and has not been taxonomically classified.  1.5.1.2 Metagenomic studies Metagenomics have provided us with the first look at the diversity of freshwater RNA virus assemblages, in particular, those present in lakes (Djikeng et al., 2009; López-Bueno et al., 2015). Not surprisingly, these datasets suggested a complex RNA virus assemblage with most of the data showing no similarity to anything in the NCBI database. Like the marine assemblages, the freshwater RNA virus assemblages had substantial representation from members of the Picornavirales. These abundant reads extended beyond viral genomes infecting marine single-celled eukaryotes, such as H. akashiwo, R. setigera and Aurantiochytrium, and also include those associated with the families Dicistroviridae and Secoviridae (Djikeng et al., 2009; López-Bueno et al., 2015). According to current data, freshwater RNA virus assemblages exhibit a higher taxonomic diversity than their marine counterparts. Many reads representing families associated with plants have been detected, including the families Tombus-, Poty-, Tobamo- and Bromoviridae and the floating genus Sobemovirus. In addition to the numerous other plant virus families detected, these assemblages also included members from the families Noda-, Reo- and Tetraviridae (classically associated with fish, insects and arthropods) and Picorna- Hepe-, Flavi- and Orthomyxoviridae (associated with birds and mammals) (Djikeng et al., 2009; López-Bueno et al., 2015; Rosario, Nilsson, et al., 2009). The presence of human viruses in freshwater environments suggests that many of the identified genomes are not native to these environments. It has been hypothesized that freshwater environments are reservoirs for terrestrial viruses, so while they may be considered part of the virus assemblage, they are likely not integral members of the freshwater ecosystem.  1.6 The life of an aquatic RNA virus On the surface, viruses are simple biological entities. Their lifecycle involves recognition of a host cell followed by binding to and entering of the cell via specific receptor moieties. The  18  viruses then undergo translation and genome replication before finally assembling their progeny and escaping from the host cell. The intricacies of the virus lifecycle are different for each system. Because so little is known about the replication cycle of marine RNA viruses, we will discuss these topics in a broad sense, frequently drawing from established systems.   1.6.1 Capsid structure and genome organization Structurally, RNA viruses come in all shapes and sizes with the exception of tailed-phage morphologies (Figure 1.1). Aquatic RNA viruses have predominately icosahedral capsids (Figure 1.6). While no capsid structures have been resolved for the abundant marine Picornavirales, many capsids of terrestrial Picornavirales members are available. The cryo-EM reconstructed capsid of HcRNAV is the only structure available for viruses infecting marine single-celled eukaryotes. Although this structure is most likely quite disparate from terrestrial picornaviral capsid structures, it shares many similarities such as a small capsid (34 nm diameter) consisting of 180 quasi-equivalent monomers and T=3 symmetry (Miller et al., 2011; Le Gall et al., 2008). Each monomer contributes towards an elevated, usually antigenic peptide bundle on the surface of the capsid, which is different from the dimer or trimer capsid combinations found in the Tombusvirus and Nodavirus families (Figure 1.6). Like all +ssRNA virus capsids, it is predicted by cryoEM reconstructions that HcRNAV also has β-barrel, jelly rolls as part of the outer capsid with a less-defined inner shell structure of 12 nm in inner diameter. It stands to reason that this less-defined shell is the ssRNA genome.  This idea is given credibility by the finding that the radius between where the capsid domain and RNA shell meet is similar between HcRNAV and nodaviruses, in particular Grouper nervous necrosis virus (GNNV) (Miller et al., 2011). Nodaviruses are some of the best characterized aquatic viruses, with particles reconstructed for numerous betanodaviruses, infecting fish (Xie et al., 2016; Chen et al., 2015; Tang et al., 2002) as well as a freshwater shrimp virus (Ho et al., 2017). Striking differences are observed between individual virus species particles, with betanodaviruses assembling into trimer clusters, while Macrobrachium rosenbergii nodavirus (MrNV) particles have distinct dimeric blade-shaped spikes (Ho et al., 2017). Regardless of whether or not some members of the nodaviruses need to be phylogenetically reclassified, this structural diversity within a virus family suggests an extensive predicted particle diversity for aquatic RNA viruses. Unfortunately, there are too few structures available to perform an in-depth comparative study.  19    Figure 1.6: The capsid structure of three aquatic RNA viruses. Heterocapsa circularisquama RNA virus (HcRNAV) belongs to the genus Dinornavirus of the family Alvernaviridae and infects the marine diatom Heterocapsa circularisquama (left column) (Tomaru et al., 2004; Miller et al., 2011), while Grouper nervous necrosis virus (GNNV) and Macrobrachium rosenbergii nodavirus (MrNV) belong to the family Nodaviridae and belong to the genus Betanodavirus and proposed genus Gammanodavirus respectively (middle and right columns) (Ho et al., 2017; Chen et al., 2015). MrNV infects the giant freshwater shrimp Macrobrachium rosenbergii, while GNNV infects various moribund grouper larvae. The GNNV particle structure was obtained from VIPERdb (Carrillo-Tripp et al., 2009).  Due to the sheer diversity of RNA virus genomes, this review will only highlight the features of the most abundant, members of the order Picornavirales. As previously described (Table 1.1), these viruses have small genomes (~10kb) with conserved domains, including a S3H, a 3CPro, a RdRP and a JRC protein (Figure 1.7) (Le Gall et al., 2008; Lang et al., 2004). The genome organization can be either mono- or dicistronic. Generally, the non-structural domains are situated at the 5’end of the genome while the structural domains are situated at the 3’ end. This is true in all of the isolates and assembled genomes with the exception of one: genome N_001 from Narragansett Bay, Rhode Island (Moniruzzaman et al., 2017). In a dicistronic genome organisation, expression of the second ORF is typically controlled by an intergenic region (IGR) IRES. While the activity of these IRES sequences have yet to be experimentally confirmed for marine viruses, short stretches of sequence similarity to other viral IRESes are present (Culley et al., 2007). The same holds true for the 5’ IRES which is thought to control the translation of the TEMCryo-EM reconstructionMrNVHcRNAV100nm100nmGNNV 20  first ORF. The 3’ end of these genomes have been shown to terminate in a poly-A tail although this has yet to be confirmed for all assembled genomes (Culley et al., 2014; Miranda et al., 2016). This is caused by technical difficulties in acquiring the end sequences of viral genomes.   Predicted ORF                                 RNA Helicase                                   3C Protease                     RdRP IRESVP2                                                  VP4                                                   VP3               VP1                                                   IGR IRES     DicistronicgenomeAAAAAAAAAAAMonocistronicgenomeAAAAAAAAAAA Figure 1.7: Graphical representation of marine RNA viruse genomes of the order Picornavirales. The recognized isolates and metagenomes have either a monocistronic or dicistronic genome orientation which includes a 5’ internal ribosome entry site (IRES [lime green]), S3 helicase (yellow), 3C protease/peptidase (white), RdRp (red), intergenic region (IGR) IRES (green) (in dicistronic genome organisations), and four capsid domains – VP2 (purple), VP4 (blue), VP3 (pink) and VP1 (teal) – and a poly(A) tail.  1.6.2 Virus replication cycle Virus replication is an essential part of the lifecycle and differs depending on a variety of intrinsic and external factors. Because the abundant marine RNA viruses are picorna-like, this summary will focus on the replication cycle of non-enveloped viruses. Receptor based particle attachment initiates the virus cycle, with the specificity of this interaction determining host range (Mizumoto et al., 2007). This interaction is irreversible. Entry into the cell is poorly understood, even for model systems like PV where it is thought to occur through a variation of receptor-mediated endocytosis. The VP4 protein facilitates the unpacking of the viral genome, which is delivered into the cytoplasm of the host where translation occurs. Once the RdRp is successfully translated, synthesis of the viral genome can occur. The genomic RNA serves as a template for the synthesis of a complementary minus-strand, yielding a dsRNA intermediate, which is used as a template for viral genome amplification. Though not proven experimentally, marine RNA viruses of single-celled eukaryotes are thought to rely on internal ribosome entry sites for the efficient translation of their genomes which typically lack a 5’ cap (Lang et al., 2004), much like PV (Andino et al., 1999). The translation products are either a single polyprotein, like in HaRNAV  21  (Lang et al., 2004), or two distinct structural and non-structural polyproteins, like in RsRNAV (Shirai et al., 2006). The structural proteins assemble into provirion capsids, into which the newly synthesized +ssRNA genomes are packaged with structural protein cleavage resulting in particle maturation (Andino et al., 1999; Knipe et al., 1997). These particles frequently form crystalline arrays in the host cytoplasm (Tai et al., 2003) (Figure 1.8). The burst size of these viruses vary dramatically (Table 1.1) with as little as 66 particles per cell for CsfrRNAV (Tomaru, Takao, et al., 2009) to 1.0 x 105 for CtenRNAV01 (Shirai et al., 2008).   Figure 1.8: Transmission electron micrographs of single-celled eukaryotes infected with RNA viruses. Micrographs were modified from the relevant publications as described below: A) Heterosigma akashiwo cytoplasm with swollen endoplasmic reticulum (ER) and fibril (F) filled vacuoles (VC) due to HaRNAV infection (Tai et al., 2003). B) A disrupted Chaetoceros socialis f. radians cytoplasm at 48 hpi with CsfrRNAV (Tomaru, Takao, et al., 2009). C and D) Crystalline arrays of assembled virus particles of HaRNAV and CtenRNAV01 in the cytoplasm of H. akashiwo and Chaetoceros tenuissimus 48 hpi (Tai et al., 2003; Shirai et al., 2008). MT=mitochondria.  During viral infection, the host cell undergoes many changes (Figure 1.8). In cells of the raphidophyte H.akashiwo infected with HaRNAV, an increase in fibril containing cytoplasmic vacuoles and a swollen endoplasmic reticulum (ER) was observed (Tai et al., 2003). Often in the case of infection in protists, the cytoplasm and organelles will completely disintegrate and this was 500nm1µm500nmA    B C                                                             D 22  observed with CsfrRNAV infection of Chaetoceros social f. radians (Tomaru, Takao, et al., 2009) as well as in HcRNAV infected Heterocapsa circularisquama cells (Tomaru et al., 2004). In the above examples of virus replications cycles, ultrastructural changes are recorded 48 hours post infection (hpi), indicating that viral replication is almost complete by this time point. The viral genome of HcRNAV can be detected in infected cells at 24, 48 and 72 hpi. Northern blot analysis would suggest comparable amounts of viral RNAs at each time point, although exact amounts are undetermined (Mizumoto et al., 2007). In contrast, an observable decrease in -ssRNA was observed at 72 hpi, suggesting that between 48 and 72 hpi, the virus genomes are no longer being replicated. Most of the isolates have a latent period of between 24 to 48 hpi (Table 1.1), with the exception of AuRNAV which appears to be less than 8 hpi (Takao et al., 2005).   1.6.3 Environmental niches and host range With such an abundance of virus particles in the marine environment, one might expect some regulatory process or niche partitioning to dictate the success of a specific virus species. This would seem especially important in a system where multiple virus species are capable of infecting a single host. The best example of such a system is that of the diatom Chaetoceros tenuissimus Meunier and its associated viruses (Kimura and Tomaru, 2015; Shirai et al., 2008; Kimura and Tomaru, 2017). Competition studies to determine the effect of salinity and temperature on virus infectivity suggests that both ssDNA and -RNA viruses exhibit niche partitioning (Kimura and Tomaru, 2017). Specific virus species were found to be more successful at infecting the host in environmental conditions that were not necessarily optimal for host reproduction. It does suggest that these viruses rely on both the temperature and salinity of the environment to minimize competition from other viruses. While the molecular method of this phenomenon remains unclear, one hypothesis is that different temperatures and salinities likely alter the receptor binding reaction, an essential step in establishing viral infection. Classical niche partitioning in viruses involves the sequence of the capsid gene, which restricts host range. When a new virus is isolated, host range is one of the first factors to be determined. Previous studies have demonstrated that marine RNA viruses infecting single-celled eukaryotes are host specific, often limited to a few strains (Mizumoto et al., 2007; Tai et al., 2003; Nagasaki et al., 2004; Kimura and Tomaru, 2017). In HcRNAV, this intraspecies host specificity  23  has been attributed to high frequency nucleotide substitutions in the structural ORF, which are predicted to be located on the surface of the virus particle (Mizumoto et al., 2007). The virus-host receptor interaction is often more important than host co-factors in establishing the initial viral infection since adaptation to a new host can subsequently take place. An example of this was the successful production of infectious HcRNAV particles in an unsuitable host after genome particle bombarding (Mizumoto et al., 2007). This is not limited to aquatic viruses as yeast has been used for replication studies of tombusviruses for over a decade (Nagy and Pogany, 2006). The survival of C. tenuissimus within a pool of viral pathogens, suggests that there are other regulatory factors affecting successful viral infection (Tomaru et al., 2017). Recent sampling of water off the coast of Japan have yielded consistently high abundances of C. tenuissimus in both the seawater and sediments, despite its associated infectious viruses being detectable in the same geographical regions (Tomaru et al., 2017). It is likely that the infected cells are sinking down the water column and away from the uninfected population (Lawrence and Suttle, 2004) but this is unlikely to be the only contributor to this phenomena. Virus resistance has been reported to HcRNAV in H. circularisquama populations (Tomaru, Mizumoto, et al., 2009). After an initial decrease in host cell number three days post infection (dpi), the cells increased in number despite the high virus titer in the medium. This resistance to viral replication does not appear to be caused by changes in the virus-host receptor binding interaction but rather due to suppression of intracellular viral replication (Tomaru, Mizumoto, et al., 2009). The molecular mechanism responsible for this host resistance to viral infection is unknown.  1.7 Dynamics of aquatic RNA viruses across space and time The majority of research investigating the spatial and temporal variation of aquatic RNA viruses takes a highly focussed approach and often using amplicon studies to analyze a specific virus or a select group/family of viruses (Gustavsen et al., 2014). As technology has advanced and high-throughput sequencing has become relatively cheap, more studies have focussed on looking at the change in the entire assemblage. These examples are described below.    24  1.7.1 Marine RNA virus dynamics Spatial and temporal scales are difficult to assess and entangle in the marine environment and some may argue that this is a futile endeavour. Regardless, we can make deductions with respect to the observed changes in the assemblages over time at geographically distinct locations. Very few time series datasets exist for aquatic RNA viruses. Instead studies have focussed on seasonal changes. The abundance of the dsRNA marine birnavirus (MABV) known to infect fish and shellfish, was analyzed near shellfish farms off the coast of Japan. MABV genomes were found to have a low abundance during summer with a detectable increase observed during fall and winter (Kitamura et al., 2000). OTUs analysis of marine Picornavirales at a coastal site in British Columbia, Canada, would suggest that a select few OTUs are abundant during both summer and fall, while there is a distinct difference between the rest of the seasonal OTUs (Gustavsen et al., 2014). Detectable OTUs change on a yearly basis with distinct sequences and clades detected one year only to be found absent in the next (Culley et al., 2014). A year-long study indicated a shift over time in the dominant groups of phylogenetically related viruses (Gustavsen, 2016). These OTU observations were echoed by a metagenomic study in coastal Antarctic waters where the observed relative abundance of a particular RdRp sequence changed depending on the month and year of sampling (Miranda et al., 2016). Comparing metagenomes from different locations indicates that although these RNA assemblages are generally dominated by Picornavirales, the relative proportions of specific familiar representation change (Miranda et al., 2016). Due to the patchiness of these assemblages a noticeable difference between geographically distinct locations, as was the case with two coastal British Columbian samples, is sometimes observed (Culley et al., 2006). It is hard to compare the small number of available datasets due to the potential variability of these assemblages.  1.7.2 Freshwater RNA virus dynamics Two lakes serve as our example environments for our understanding of the spatial and temporal changes in RNA virus assemblages. The two lakes are geographically distinct. Lake Needwood (Maryland, USA) is a man-made lake which functions as a retention basin for terrestrial runoff and sediment from the surrounding urban, agricultural and forested areas (Djikeng et al.,  25  2009). Lake Limnopolar (Livingston Island, Antarctica) is a natural lake which frequently experiences ice cover (López-Bueno et al., 2009). Metagenomic analysis of the RNA viruses in the two lakes suggest a more taxonomically diverse assemblage in Lake Needwood, with both lake assemblages experiencing temporal shifts in their virus assemblages at a yearly scale (Djikeng et al., 2009; López-Bueno et al., 2015). The Lake Needwood RNA virus species richness was considerably lower in June compared to November. While both assemblages were dominated by single-celled eukaryote viruses, Lake Limnopolar had far fewer sequences associated with plant and animal viruses. The large proportion of viruses classically associated with multi-cellular organisms in Lake Needwood is likely due to the large terrestrial input, emphasizing the importance of the ecological setting. This change in diversity with respect to ecological setting was also observed in Lake Limnopolar at a much smaller scale. In the Antarctic environment there were more single nucleotide variations (SNVs) in the lake as opposed to the water surrounding microbial mats over time, suggesting that there is either more diversity in the mat-free water or that there is input from another source.  1.8 Methodological considerations associated with the analysis of environmental RNA virus assemblages Many assumptions are made when constructing and analyzing environmental RNA virus assemblages. It is important to consider these assumptions and while many of these have been noted above, for the sake of consistency and completion, we will summarize them below.  1.8.1 Environmental considerations Aquatic environments are frequently disturbed by input from terrestrial or other environments. This is often referred to as contamination and can provide a potentially biases view of the analyzed assemblage. In coastal marine environments, salinity is a reasonable indicator of freshwater input. As such, it is important to differentiate between the viral community, or the active viruses that infect members of the aquatic environment, as opposed to the virus assemblage, or all the virus particles present in an environment, regardless of whether they are naturally occurring and active in the particular environment. Pepper mild mottle virus is a good example of a virus  26  that is ubiquitous in aquatic environments where it is used as an indicator of agricultural and urban contamination (Rosario, Symonds, et al., 2009). Temporal studies often sample water from a specific site, rather than following the specific water mass over time (Fuhrman et al., 2015). Such studies in marine environments have shown repeatable dynamics and predictable patterns (Gilbert et al., 2012). In an attempt to neutralize the bias associated with the microscale heterogeneity in water masses (Seymour et al., 2006), larger water samples are collected – 20L to 200L (Culley et al., 2010). Composite samples are frequently collected down the water column. While this approach would likely include many more micro niches, it is a means of analyzing the collective viral assemblage, especially when comparing a stratified versus a mixed environment.  1.8.2 Analysis considerations Most metagenomic studies make use of the sequence-independent single primer amplification (SISPA) method, which relies on 3’ random primers with a 5’ defined tag sequence (Culley et al., 2010). These primers exhibit bias in their random binding and subsequent amplification. Furthermore, the tag sequence has been shown to result in further amplification bias (Rosseel et al., 2013). It is recommended to use a combination of tags and random primer lengths, with longer random sequences exhibiting less bias (Rosseel et al., 2013). Personal observations found a combination of random primer lengths and tags to work well, in addition to the addition of dimethyl sulfoxide and varying the amount of magnesium chloride and starting RNA concentrations; factors previously shown to minimize amplification bias (Kang et al., 2005; Park and Kohel, 1994). The sequencing technology used can affect the quality of data and the interpretation. This is especially important when working with RNA viruses that have high mutation rates and quasispecies. While Illumina technology has a low error rate of ≥0.1 (Glenn, 2011), it is still advised to filter sequences above a quality score of 20 (this represents a 1% error). The probability of error is further decreased by trimming low quality ends and producing libraries that result in paired-end read overlap, especially when performing amplicon studies (Bokulich et al., 2013).   27  Taxonomic classification relies on a database; one which we know is very limited for viruses. When using the taxonomic approach to analyse environmental virus sequences, it is important to keep in mind that the analysis is of numerous distinct viruses which are all similar to the same reference genome. On the other hand, taxonomic classification can allow one to make biological inferences, provided these caveats are considered.   1.9 Problem statement and research objectives With a growing interest in the evolution and emergence of RNA viruses, many have argued that we are hampered by our lack of understanding the basic diversity associated with these viruses (Holmes, 2009; Kristensen et al., 2010). As described above, we have only scratched the surface with respect to the diversity of aquatic RNA viruses.  By studying aquatic RNA virus assemblages, we will gain a better understanding of their genetic diversity, distribution as well as some insight into the evolution of RNA viruses.  As the global climate continues to change, we can expect aquatic virus assemblages to change as well. Considering how little we truly know about these assemblages, it will be challenging to try and determine the effects that climate change has had on them. Their close association with primary producers means that these assemblages will continue to have an important role in the environment, perhaps more so as the climate tumbles into turmoil. It is important to study both the taxonomic composition of the assemblages as well as the environmental factors associated with changes in assemblage diversity so that accurate predictions can be made.  1.9.1 Overall goals and objectives This dissertation focuses on studying the RNA virus assemblages in both marine and freshwater environments, describing the taxonomic diversity of these viruses at geographically distinct locations, the environmental factors that are associated with changes in the diversity of these assemblages and the evolutionary pressures resulting in viral quasispecies. Because these viruses are thought to be associated with ecologically important and abundant single-celled eukaryotes like diatoms, their potential effect on the ecosystem is substantial.  28  The overall objectives of this dissertation were as follows: 1. Biogeographic distribution of specific marine viral genomes 2. Determination of the evolutionary pressures acting on environmental quasispecies 3. Analysis of the marine RNA virus assemblage diversity and the environmental drivers associated with changes in assemblages 4. Determination of the environmental drivers associated with changes in freshwater RNA virus assembly diversity over time 5. Establishing a taxonomic framework which allows for the inclusion of environmental genomes into virus families recognized by the International Committee on Virus Taxonomy  1.9.2 Chapter descriptions To address the goals described above, the dissertation is structured as follows: Chapter 2: From the Arctic to Africa: Exploring the geographic distribution of six marine RNA viruses and their quasispecies. During the last decade, many studies have focussed on the study of global viral diversity and distribution (Breitbart and Rohwer, 2005; Cobián Güemes et al., 2016; Thurber, 2009; Chow and Suttle, 2015; Brum et al., 2013). While this has considerably advanced our understanding of viruses with DNA genomes, those with RNA genomes have been completely overlooked.  Chapter two addresses this oversight by analyzing the biogeographic distribution of six marine RNA virus genomes using metagenomic data. The datasets were composed of both newly generated metagenomes as well as previously published ones, from a range of geographic locations. Recruitment of environmental reads to the six genomes, revealed the distribution of the genomes. Analysis of single nucleotide variants and synonymous versus non-synonymous mutations, resulted in the identification of environmental quasispecies of marine RNA viruses as well as the evolutionary pressures acting on these RNA genomes.  Chapter 3: Metagenomic analysis reveals unprecedented diversity in marine RNA virus assemblages from polar to tropical seas.  29  Published marine RNA virus metagenomes suggest that marine RNA virus assemblages almost exclusively consist of diverse genomes belonging to the order Picornavirales (Culley et al., 2006, 2014; Miranda et al., 2016; Moniruzzaman et al., 2017). Considering the sheer taxonomic diversity of viruses (King et al., 2011), it would be surprising if the taxonomic diversity in aquatic environments were indeed restricted to a select few families.  Chapter three examines the diversity of geographically distinct RNA virus assemblages as well as the environmental factors associated with the observed changes in diversity. Marine RNA virus metagenomes were constructed and assembled from sites that were geographically distinct with different salinity and temperature profiles. Similarity searches against the NCBI non-redundant database allowed for the taxonomic classification of the assembled contigs, which were used to describe the viral assemblage in terms of the relative abundance of taxa. Taxonomic classification allowed for the identification and subsequent in-depth analysis of virus groups of interest. Linear regression and constrained correspondence analyses were used to analyze the relationship of geographic location, salinity and temperature with respect to the observed taxonomic diversity of the marine RNA virus assemblages.  Chapter 4: Spatial and temporal diversity in freshwater RNA viruses in the context of environmental variability. Freshwater RNA virus assemblages are poorly understood, with very few available datasets (López-Bueno et al., 2015; Djikeng et al., 2009). Considering the complexity of freshwater environments, it is likely that these RNA virus assemblages will be more complex than those observed in marine environments. Furthermore, the effect of the surrounding land-use on the freshwater viral assemblage is completely unknown, though it has been implied that it affects the viral assemblage (Djikeng et al., 2009). Chapter four explores the temporal diversity of freshwater RNA virus assemblages from spatially distinct streams. Freshwater RNA virus datasets were downloaded from the NCBI Genbank repository from a dataset generated with the aim to identify viral biomarkers indicative of surrounding land-use. Seasonal trends in viral assemblage taxonomic composition was analyzed by hierarchical clustering of Bray-Curtis similarity matrices. Due to the effect of land-use, analysis was performed on samples originating from the same area. Keeping the original purpose of the  30  data in mind, indicator species analysis was performed to determine whether specific virus taxa could be used as indicators of specific environmental conditions. Multiple linear regression was used to determine the environmental factors associated with changes in the alpha-diversity of the viral assemblage between sites while Spearman correlation was used to analyze factors associated with changes in alpha-diversity within sites.   Chapter 5: Sequence-based taxonomic classification of uncultured and unclassified marine single-stranded RNA viruses of the order Picornavirales. With an increase in the number of genomes assembled from metagenomic data incongruence between the taxonomically recognized virus diversity and the actual environmental diversity is increasingly being acknowledged. Recognizing the need to incorporate the latter into the formal classification structure, the ICTV has recently agreed to accept proposals for the addition of assembled genomes (Simmonds et al., 2017), the challenge being a suitable classification framework that will accommodate both assembled genomes as well as virus isolates. Chapter 5 describes a sequence-based framework for the classification of viral genomes within the family Marnaviridae. Genus classification is described using RdRp phylogenies, while pairwise amino acid identities of capsid gene sequences are used to determine new species within the family.  1.10 Significance These studies provide a detailed examination of spatial and temporal dynamics of marine and freshwater RNA virus assemblages respectively. Significantly, this is the first study to analyse these assemblages at such a large scale, along with environmental factors associated with changes in the assemblages. This dissertation will advance the field in several important areas, many of which have either never been studied or studied at a much smaller scale. These include: resolving the distribution and biogeography of marine RNA viruses, expanding the diversity of aquatic RNA virus assemblages and the environmental drivers associated with changes in said diversity, discovering novel viral genomes and developing a sequence-based taxonomic framework for the classification of marine RNA viruses of the order Picornavirales.   31  Chapter 2  2  FROM THE ARCTIC TO AFRICA: EXPLORING THE GEOGRAPHIC DISTRIBUTION OF SIX MARINE RNA VIRUSES AND THEIR QUASISPECIES  2.1 Summary RNA viruses are estimated to constitute half of coastal marine viral assemblages, and genetically diverse picorna-like viruses are the dominant representatives. Geographically limited gene surveys indicate some spatial and temporal patterns may exist for these RNA viruses, yet their global dispersal and quasispecies potentials remain largely unknown. Here we show that specific RNA virus genomes exhibit widespread geographic distributions and that the dominant genotypes are under purifying selection. Three novel picorna-like virus genomes (marine RNA viruses BC-1, -2 and -3) were assembled from a coastal British Columbia site and, along with marine RNA viruses JP-A, JP-B and Heterosigma akashiwo RNA virus, exhibited different biogeographical patterns. This indicates that RNA virus distribution is affected by factors such as host specificity and the virus lifecycle, and not just abiotic processes such as dispersal. Both the reference genomes and observed quasispecies are subject to purifying selection, and this amino acid conservation was evident in geographically distinct genomes where synonymous single nucleotide variations dominated the observed mutations. In contrast, sequences originating from a coastal South African site that mapped to one of the viruses, JP-A, exhibited more non-synonymous mutations, likely representing a related virus genotype that has accumulated amino acid changes over a more prolonged separation from other JP-A-like genomes. This is the first global biogeographical analysis of marine RNA viruses and evaluation of the selection pressures acting on these viruses. Not only does this study add to the database of known marine RNA virus genomes, but it develops our understanding of the biogeographical distribution of viruses that are likely playing an integral role in the turnover of globally important plankton.  2  32  2.2 Introduction It is widely acknowledged that viruses play central roles in the ecology of and evolution within marine microbial communities. They have direct effects, such as causing species mortality, as well as making indirect contributions in terms of food web dynamics and biogeochemical cycling (Suttle, 2007; Rohwer and Thurber, 2009). Individual marine viruses are widely distributed, with identical or nearly identical genotypes detected in geographically and environmentally separated locations (Short and Suttle, 2002; Labonté, Swan, et al., 2015; Culley and Steward, 2007; Breitbart and Rohwer, 2005). If biogeography is defined as the presence of spatial and temporal patterns associated with a specific genotype or unit of biodiversity, as a result of dispersal and/or environmental conditions (Hanson et al., 2012; Chow and Suttle, 2015; Lomolino et al., 2010), then we can conclude that for at least some marine viral groups, biogeographical patterns have been observed. The underestimation of both the abundance and importance of RNA viruses in marine environments has recently been highlighted (Steward et al., 2013). RNA-dependent RNA polymerase (RdRp) gene surveys have illuminated the previously undetected diversity within the order Picornavirales, the most abundant RNA virus group detected in the oceans (Culley et al., 2003; Culley and Steward, 2007; Gustavsen et al., 2014; Miranda et al., 2016; Lang et al., 2009). It is believed the hosts for most of these detected viruses are single-celled eukaryotes (protists) (Miranda et al., 2016; Gustavsen et al., 2014; Lang et al., 2009). Both spatial and temporal differences were observed in relative abundance of these environmental picorna-like virus RdRps (Gustavsen et al., 2014). These observed biogeographical patterns can be caused by the specific viral life cycle, host traits or abiotic factors (Chow and Suttle, 2015; Suttle, 2016). Differences in viral lifecycles and properties will further influence their global distribution. These include the particle size or dimension, infection type (lytic versus lysogenic), replication and latent periods, burst size and acquisition of host-derived genes (Chow and Suttle, 2015; Suttle, 2016). RNA viruses that infect protists are thought to be small, have large burst sizes, employ a lytic lifecycle, and there has been no evidence they contain auxiliary metabolic genes (King et al., 2011). If we consider two RNA virus isolates that infect different diatoms, Rhizosolenia setigera RNA virus (RsRNAV) (Nagasaki et al., 2004) and Chaetoceros tenuissimus RNA virus 01 (CtenRNAV01) (Shirai et al., 2008), we observe different lifecycle approaches and these will  33  affect their distributions. RsRNAV has a smaller burst size (103 versus 104) and a longer infection period (2 days compared to less than 24 hours) and we would therefore predict RsRNAV has a lower distribution potential (Chow and Suttle, 2015). As an obligate pathogen, the success of a virus genotype, i.e. producing more of itself, is tightly coupled with its host. Host traits, such as abundance, distribution, size and physiological state, will all affect the biogeography of a particular virus genotype (Chow and Suttle, 2015). Furthermore, strain specificity will affect distribution as the more hosts there are to infect, the higher the probability of successful reproduction. Some marine viruses, such as RsRNAV, CtenRNAV01, Heterocapsa circularisquama RNA virus (HcRNAV) and Heterosigma akashiwo RNA virus (HaRNAV), infect hosts that are widely distributed. However, laboratory studies suggest that these viruses are strain-specific, thereby limiting the number of hosts they can infect and therefore viral distribution (Nagasaki et al., 2004; Shirai et al., 2008; Nagasaki et al., 2005; Tai et al., 2003). Currently, the only evidence that these viruses are not limited to a particular location comes from two studies conducted along the coast of British Columbia. The first study detected infectious HaRNAV particles in different coastal sediments (Lawrence and Suttle, 2004), while the second determined the persistent and widespread presence of two assembled genomes (JP-A and JP-B) in multiple locations around coastal British Columbia (Culley et al., 2007). Virus-host coupling and infection specificity results in selective pressures that drive a co-evolutionary arms race (Weitz et al., 2005). RNA viruses are well-adapted for this challenge by having a high mutation rate due to the error-prone RdRp, which results in approximately 10-4 substitutions per nucleotide per replication cycle (Milo et al., 2010; Holmes, 2009). This generates a genetically diverse virus population from a single progenitor, commonly referred to as a quasispecies, with a distribution of genotypes clustered around the fittest (Domingo et al., 2012; Holmes, 2010; Ojosnegros et al., 2010). Quasispecies are thought to enable the virus population to be more successful by providing built-in potential for flexibility to overcome both host and environmental challenges (Chare and Holmes, 2004; Domingo and Holland, 1997; Moya et al., 2004). While the quasispecies phenomenon has been well characterized in RNA viruses associated with higher plants and mammals (Lauring and Andino, 2010; Elena and Sanjuán, 2007; Schneider and Roossinck, 2001), environmental quasispecies populations, in particular RNA viruses associated with aquatic protists, remain largely unexplored. Quasispecies analysis of assembled RNA virus genomes in an Antarctic lake suggested that the ecological setting was an important  34  factor in selecting the fittest genotype, as more single-nucleotide variants (SNVs) were observed in the lake water compared to microbial mats (López-Bueno et al., 2015). Whether this variability was due to a greater observed diversity and turnover, or because water from multiple locations mixes in the lake, is unclear. Selection ultimately determines which genome from within the quasispecies cloud is the most successful. Under conditions of neutral selection a simple relationship exists between the rate at which mutations are generated in a genome and established within the population (Graur and Li, 2000) and deviations from neutral selection reveal aspects of evolutionary processes such as natural selection (Duffy et al., 2008). For example, arthropod vector-borne RNA viruses exhibit lower rates of non-synonymous (dN) changes relative to those for synonymous changes (dS), indicating an elevated purifying (or negative) selection pressure compared to viruses transmitted by other routes, likely due to the required interactions of the virus capsids with the insect cellular receptors (Chare and Holmes, 2004). In contrast, influenza A viruses break periods of evolutionary stasis with intervals of positive selection, where dN exceeds dS, which rapidly replaces circulating lineages with new dominant ones, allowing the virus to escape immune pressure within a host population (Wolf et al., 2006). Marine RNA viruses infecting protists have different obstacles to overcome compared to viruses infecting multi-cellular organisms. While purifying selection has been proposed for the RdRp of marine picorna-like viruses (Greenspan et al., 2004), we have very little understanding of the evolutionary pressures facing this group of viruses or how they respond to these pressures. To gain a better understanding of the biogeography of marine RNA viruses, their evolutionary patterns, and possible existence as quasispecies, we focused on six viruses, either isolated or genome-assembled from or near to Jericho Pier in Vancouver, Canada. One of these, and the only isolate, is HaRNAV, which infects the toxic bloom-forming raphidophyte Heterosigma akashiwo (Tai et al., 2003). It is the type species of the family Marnaviridae and has a monocistronic genome of approximately 9.1 kb (Lang et al., 2004). JP-A and JP-B are two metagenome-assembled dicistronic viruses of 9.2 kb and 8.8 kb, respectively. Phylogenetic analysis based on RdRp domain sequences places both these genomes in a well-supported clade of marine picorna-like viruses (Culley et al., 2007) and it is believed these viruses also infect protists (Lang et al., 2009). The other three viruses are genomes assembled from metagenomic analyses in this study. We began by comparing the six viruses via pairwise sequence similarities  35  and phylogenetic analyses of four structural and two non-structural conserved domains. The biogeographic patterns of these six viruses were analyzed by recruiting reads from metagenomic datasets from 15 widely distributed marine locations, including the Arctic, Antarctic, the west coasts of North and South America, the Gulf of Mexico, Hawaii and the southern coast of South Africa, as well as two freshwater lakes and reclaimed water. Using alignments of the conserved viral domains, we investigated the evolutionary pressures acting on the different domains and how these compared to the observed evolutionary pressures found in data from different locations. Through recruitment analysis we studied single nucleotide variants (SNVs) of the six viruses to confirm the existence of quasispecies within a biogeographical context.  2.3 Materials and methods 2.3.1 Sample collection and preparation Sampling sites were selected from a variety of geographically distinct coastal environments (Figure 2.1). In total, fourteen samples were processed; these came from four sites along the temperate southwest coast of Canada (Jericho Pier [JP, 49.27; -123.20], Johnstone Strait [JohnS, 50.50; -126.35], Pendrell Sound [PendS, 50.27; -124.70] and Queen Charlotte Strait [QCStrait, 50.70; -127.10]), a composite of 10 samples from the Canadian Arctic [Arc, 72.31; -133.21]), two samples from the Nunavut coastal ocean [Nun2, 70.7; -98.6 and Nun3, 71.9; -94.3] representing the northern subpolar regions, one site in the temperate/subpolar Bering Sea [BerS, 56.6; -172.7], one sample each from Laguna Madre, which represents the subtropical Gulf of Mexico [LagM, 27.5; -97.3], tropical coastal Peru [Peru, -8.7; -85.2] and temperate southern Chile [SChil, -50.9; -80.8], and two sites from the subtropical coastal waters of South Africa (Kenton-on-Sea [SAE, -33.69; 26.68] and Cape Town [SAW, -33.91; 18.39]). The samples were collected at different times (Supplementary Table A.1) and JP was sampled in two consecutive years.  36  This studyPublished marinePublished freshwaterFigure 1:  Sampling sites used in this study. Samples were collected (blue) from a variety of geographical locations and ocean climate zones: The northern Subpolar regions are represented by an Arctic (Arc) composite sample consisting of 10 samples from the Beaufort Sea, MacKenzieShelf and the Amundsen Gulf (general area indicated on image) as well as two samples from the Nunavut islands (Nun2 and Nun3); A sample from the Bering Sea (BerS) represents the Northern Pacific Temperate/Subpolar region;  The Temperate coastal waters of the pacific North-West are represented by locations consisting of one sample each from Queen Charlotte Strait (QCS), Johnstone Strait (JS) and Pendrell Sound (PS) and two samples collected one year apart at Jericho Pier (JP13 and JP14 – the latter was split in two and concentrated differently); A sample from Laguna Madre (LagM) in the Equatorial/Tropical Gulf of Mexico; The tropical coast of Peru and Temperate waters of southern Chile (SChile) and the subtropical coastal waters of southern Africa are represented by a sample each from Kenton-on-Sea (SAE*) and Cape Point (SAW*). Published data from Kaneohe Bay, Hawaii and Palmer Station, Antarctica was also included (purple) as well as freshwater datasets (green) generated from Lake Limnopolar, Antarctica, Lake Needwood, USA and raw water from Effluent and nurseries in Florida, USA.  * Samples were concentrated using chemical flocculation.Nun2Nun3BerSArcLagMQCStraitJohnS PendSJPCapeP* KoS*Kaneohe BayPalmer StationLake LimnopolarLake NeedwoodReclaimed waterSChilPeru Figure 2.1: Sampling locatio s r presented in this study. The samples repres nt multiple depths and time points from geographically distinct locations and ocean climate zones. Fourteen metagenomes were analyzed here (blue) along with published marine data (purple) from Kaneohe Bay, Hawaii (Culley et al., 2014) and Palmer Station, Antarctica (Miranda et al., 2016), as well as freshwater datasets (green) generated from Lake Limnopolar, Antarctica (López-Bueno et al., 2015), Lake Needwood, USA (Djikeng et al., 2009) and reclaimed effluent and nursery water from Florida, USA (Rosario, Nilsson, et al., 2009).   Samples were collected using either Niskin bottles mounted on a CTD rosette (~150 to 200 L) or by bucket from the surface (JP, SAE, SAW) (~2 to 20 L). For each sample, particulate matter was removed by pressure filtering (<17 kPa) the water through glass-fiber filters (MFS GC50, nominal pore size 1.2 µm) and polyvinylidene difluoride filters (Millipore GVWP, pore size 0.22 µm) after which viruses were concentrated using either ultrafiltration (Suttle et al., 1991) through a 30-kDa cut-off cartridge (Amicon S1Y30, Millipore) or chemical flocculation (John et al., 2011) with resuspension in Ascorbate-EDTA buffer. All virus concentrates were stored at 4ºC in the dark until processed for sequencing.  2.3.2 cDNA synthesis and metagenomic library preparation Viruses from 70 ml of each virus concentrate generated through ultrafiltration were further concentrated by ultracentrifugation (120 000 x g for 5 hours at 8ºC) and the pellets resuspended overnight at 4ºC in 400 µl supernatant liquid. Nucleic acids were extracted from pelleted and flocculated viruses (400 µl) using the PureLink® Viral RNA/DNA mini Kit (ThermoFisher Scientific) and the DNA removed using TURBO DNase (ThermoFisher Scientific). cDNA was synthesized in five replicates, with varying concentrations of template RNA, by sequence- 37  independent single-primer amplification (SISPA) using both random hexamer (AD88, 5’-CCTGAATTCGGATCCTCCNNNNNN-3’) and nonamer (SMA-p1, 5’-GACATGTATCCGGATGTNNNNNNNNN-3’) primers (Márquez et al., 2007; Pan et al., 2013) and Superscript III Reverse Transcriptase (ThermoFisher Scientific) as per the manufacturer’s recommendation. Amplification of cDNA was carried out in four replicates using primers AD89 (5’-CCTGAATTCGGATCCTCC-3’) and SMA-P2 (5’-GACATGTATCCGGATGT-3’) (Márquez et al., 2007; Pan et al., 2013) and Platinum Taq DNA Polymerase (ThermoFisher Scientific) as per the manufacturer’s protocol with the addition of 5% DMSO (v/v) and varying concentrations of MgCl2. All replicates for a particular sample were pooled, cleaned using the Genomic DNA Clean and Concentrator kit (Zymo), and sheared to ~300 bp using a S220 Focused-ultrasonicator (Covaris). Size selection (>200 bp) was done using Agencourt AMPure XP beads (Beckman Coulter) and libraries prepared using NEXTflex-96 DNA Barcodes (Bioo Scientific) and the NxSeq DNA sample prep kit (Lucigen) as per the manufacturer’s protocol. Pooled libraries were sequenced on a HiSeq 2000 (Illumina) (2 x 100 bp) at the Biodiversity NextGen Sequencing facility (University of British Columbia, BC, Canada).  2.3.3 Metagenomic and genomic analysis Each set of sequences had the primer sequences removed and were quality-trimmed (PHRED score of 30) using Trimmomatic (Bolger et al., 2014). Paired reads were merged with PEAR (Zhang et al., 2014) and combined with the reads that had no pairs, to produce the final datasets. Each metagenome was assembled separately using the default settings of the de novo assembly algorithm in the CLC Genomics Workbench version 7.5 (CLCBio). The contigs, as well as unassembled reads, for each metagenome were combined and re-assembled using the same parameters. Open reading frames (ORFs) of the largest contigs were identified using the CLC Genomics Workbench version 7.5. ORFs smaller than 350 bp were not included in further analyses. The ORFs were further analyzed by comparison to the Conserved Domain Database  38  (CDD) (Marchler-Bauer et al., 2013) (October 2016) using BLASTp and the Pfam database (December 2015) (Finn, Coggill, et al., 2015) using HMMER v 3.1b2 (Finn, Clements, et al., 2015). All predicted untranslated regions (UTRs) of assembled genomes were analyzed using IRESite (Mokrejš et al., 2009) to identify potential internal ribosomal entry sites (IRES).  2.3.4 Metagenomic read recruitment Reads were recruited from viral metagenomic datasets (Figure 2.1, Supplementary Table A.1), either generated in this study or previously published (Culley et al., 2014; Miranda et al., 2016; Djikeng et al., 2009; López-Bueno et al., 2015; Rosario, Nilsson, et al., 2009), to the three newly assembled RNA genomes as well as JP-A (NC_009757), JP-B (NC_009758) and Heterosigma akashiwo RNA virus (NC_005281) using tBLASTx with an e value of 10-10. Conflicts were resolved based on the highest bit score and instances of recruitment of a read to multiple regions were resolved by identifying the region that provided the lowest e value. Translated reads were mapped onto the genomes based on their percent amino acid identities and visualized with a 0.5 alpha setting to visualize overlapping reads using R 3.2.2 (R Core Team, 2015).  2.3.5 Pairwise similarity and selection pressure analyses Predicted domain nucleotide and amino acid sequences were aligned according to their translated sequences using MUSCLE (Edgar, 2004) and the default parameters. Alignments were viewed and trimmed, by removing 5’ and 3’ ends that did not align, with Aliview version 1.17.1 (Larsson, 2014). Nucleotide and amino acid pairwise sequence similarities were calculated and neighbour joining trees were constructed using the CLC genomics workbench v7.5 and MEGA7 (Kumar et al., 2016), respectively. The number of inferred synonymous (S) and nonsynonymous (N) substitutions for each codon were estimated using the joint maximum likelihood reconstructions of ancestral states under a Muse-Gaut model of codon substitution (Muse and Gaut, 1994) and the Felsenstein 1981 model of nucleotide substitutions (Felsenstein, 1981), using the Hyphy (Kosakovsky Pond and Frost, 2005) and MEGA7 software packages, respectively. One-way ANOVA and Kruskal-Wallis rank sum tests were performed using base R 3.2.2 (R Core  39  Team, 2015), while the Conover-Iman test of multiple comparisons using rank sums, with Bonferroni corrections, was performed using the conover.test R package (Dinno, 2017). Metagenomic reads were mapped to the six genomes using the CLC Genomics Workbench v7.5 with a stringency of identity and read overlap of 90%. Single nucleotide variation (SNV) analysis was conducted using the Quality-based Variant Detection tool as part of CLC Genomic Workbench v7.5 and required a minimum read count of two to be valid. Figures were generated using R 3.2.2 (R Core Team, 2015).  2.4 Results 2.4.1 Three novel picorna-like genomes Three novel near-complete picorna-like virus genomes were assembled from the Jericho Pier 2014 data. The genomes of these viruses, named BC-1, -2 and -3, were 8638, 8843 and 8496 nt in size, respectively, and were assembled from 11100, 31880 and 156011 reads with coverage of 128.65, 359.36 and 1816.97, respectively. The GC contents of the genomes were calculated to be 41.2, 42.3 and 43.4 %, respectively, and their organization matched those of members of the order Picornavirales with dicistronic genomes (Figure 2.2).  Figure 2:  Three near complete viral genomes assembled from Jericho Pier. A) The genomes display quintessential Picornvirales genome architecture with a 5’ ORF carrying replicase associated domains (RNA helicase [gold], C3 protease [white] and RNA-dependent RNA polymerase [red]) and a 3’ IGR IRES (green) induced ORF coding for the four capsid protein domains (stripes). The latter four are organized similarly to the capsid domains of HaRNAV, with VP2 (purple) at the 5’ end followed by VP4 (blue), VP3 (violet) and VP1 (cyan). The assembled genomes are 8 638bp (BC-1), 8 843 bp (BC-2) and 8 496 bp (BC-3) despite not being complete. Non-structural ORF                                  RNA Helicase                                    3C Protease                 RdRPStructural ORF                                          VP2                                                   VP4            VP3VP1                                                           IGR IRES                                           IRESBC-2 8.8 kbBC-3 8.5 kbBC-1 8.6 kb Figure 2.2: Three near-complete viral genomes assembled from Jericho Pier. The genomes display quintessential genome composition common to members of the order Picornavirales, with 5’ ORFs encoding non-structural protein domains (RNA helicase [gold], 3C protease [white] and RNA-dependent RNA polymerase [red]) and 3’ ORFs containing the four structural protein domains (stripes) and preceded by predicted IGR IRES elements (black). The four capsid domains are organized similarly to those of HaRNAV, with VP2 (purple) at the 5’ end followed by VP4 (blue), VP3 (violet) and VP1 (cyan). The assembled contigs are 8638 nt (BC-1), 8843 nt (BC-2) and 8496 nt (BC-3).  40  The first putative open reading frame (ORF) contains domain sequences that resemble those associated with non-structural functions, with RNA helicases (pfam00910) at the 5’ ends, followed by the highly divergent 3C protease domains (pfam00548) and lastly, the RNA-dependent RNA polymerase (pfam00680). The second ORF contains structural protein domains (VP2, VP4, VP3 and VP1) and appears to be under the control of an intergenic region internal ribosome entry site (IGR IRES) (Figure 2.2). The 3’ end of Marine RNA virus BC-1 and the 5’ end of BC-3 were not complete, as judged by the lack of UTRs (Figure 2.2). Poly-A tails were identified at the 3’ ends of the BC-2 and BC-3 genomes after UTRs of 315 and 273 nts, respectively, indicating these regions were completely sequenced. The 5’ UTRs of BC-1 and BC-2 were 634 and 247 nts, respectively, and contained putative IRES elements. The closest genetic relatives in the NCBI non-redundant database to the three novel viruses were all assembled genomes from metagenomic studies with no host information available. BLASTx analyses of the ORFs indicated that both ORFs of BC-2 and -3 were most similar to marine RNA virus PAL 156 with e-value scores of 0.0 and amino-acid identities of 59% and 53% (non-structural ORFs) and 59% and 60% (structural ORFs), respectively. The BC-1 virus exhibited the highest similarity to marine RNA virus PAL 128 with e-values of 0.0 for both ORFs by BLASTx and amino-acid identities of 51% and 45% for the non-structural and structural ORFs, respectively. Three genetically distinct viruses (JP-A, JP-B and Heterosigma akashiwo RNA virus) have previously been assembled or isolated from similar geographic locations as the three new virus genomes (JP-A and -B from Jericho Pier and HaRNAV from the Fraser River plume in the Strait of Georgia). Pairwise nucleotide sequence similarity comparisons of the helicase, RdRp and four structural protein domains for these six viruses indicated that they were all distantly related, with BC-2 and -3 comparatively more similar to one another while HaRNAV was the most divergent of the six (Figure 2.3; Supplementary Figure A.3). Of the six analysed domains, the VP3 and VP1 structural domains were the least conserved among the genomes. Similar trends were observed in pairwise protein sequence similarity analyses (Supplementary Figure A.1).  41  BC-1A)                                                                                                                  B)C)                                                                                                                  D)E)                                                                                                                  F)G)                                                                                                                  H)I)                                                                                                                   J)K)                                                                                                                   L)Figure 4: Pairwise nucleotide sequence similarity and neighbour-joining trees of the conserved domains of six marine RNA viruses. The respective domain sequences were aligned by nucleotide sequences, with codons taken into consideration, and trimmed at the edges so that similar regions of the domains were compared. The percent identity as well as Jukes-Cantor distances are displayed along with a neighbour joining tree, for the trimmed helicase (A,B), RdRP (C, D), VP2 (E, F), VP4 (G, H), VP3 (I, J) and VP1 (K, L) domains of Marine RNA viruses BC-1, BC-2 and BC-3, Heterosigma akashiwo RNA virus and Marine RNA viruses JP-A and JP-B. The percentage of trees in which the associated taxa clustered together is shown next to the branches (bootstrap 1000). Sequence similarity would suggest that these viruses are quite distantly related, and that the capsid domains VP3 and especially VP1 are the least conserved between the viruses.JP-AJP-BBC-2BC-3HaRNAV0.1063BC-1JP-AJP-BBC-2BC-3HaRNAV0.1027BC-1JP-AJP-BBC-2BC-3HaRNAV95950.10BC-1JP-AJP-BBC-2BC-3HaRNAV0.10BC-1JP-AJP-BBC-2BC-3HaRNAV0.108487BC-1JP-AJP-BBC-2BC-3HaRNAV0.1094BC-1 BC-2 BC-3 HaRNAV JP-A JP-BBC-1 49.53 51.71 38.06 46.73 36.76BC-2 0.84 62.93 31.67 48.60 42.68BC-3 0.77 0.51 33.61 49.22 42.68HaRNAV 1.08 1.47 1.33 33.61 37.50JP-A 0.93 0.87 0.85 1.33 40.81JP-B 1.39 1.08 1.08 1.11 1.17HelicasePercent IdentityJukes-Cantor distanceBC-1 BC-2 BC-3 HaRNAV JP-A JP-BBC-1 47.83 47.66 39.53 50.27 42.00BC-2 0.81 60.98 37.27 53.18 46.80BC-3 0.81 0.55 38.76 54.57 46.37HaRNAV 1.16 1.25 1.17 39.95 41.53JP-A 0.78 0.69 0.65 1.16 48.05JP-B 1.05 0.86 0.87 1.08 0.85RdRPPercent IdentityJukes-Cantor distanceBC-1 BC-2 BC-3 HaRNAV JP-A JP-BBC-1 50.98 52.51 40.22 57.58 47.86BC-2 0.79 61.62 45.45 52.60 44.87BC-3 0.73 0.53 41.77 51.73 45.30HaRNAV 1.16 0.95 1.09 47.53 42.46JP-A 0.62 0.73 0.75 0.88 47.86JP-B 0.86 0.95 0.93 1.05 0.87VP2Percent IdentityJukes-Cantor distanceBC-1 BC-2 BC-3 HaRNAV JP-A JP-BBC-1 57.05 54.49 42.69 52.20 46.54BC-2 0.61 71.79 46.78 54.09 57.86BC-3 0.67 0.35 46.20 54.72 55.35HaRNAV 0.90 0.79 0.80 42.69 47.37JP-A 0.71 0.68 0.67 0.96 56.60JP-B 0.87 0.59 0.65 0.80 0.65VP4Percent IdentityJukes-Cantor distanceBC-1 BC-2 BC-3 HaRNAV JP-A JP-BBC-1 45.45 45.78 35.15 44.71 41.67BC-2 0.89 57.92 38.72 42.86 40.96BC-3 0.89 0.59 36.63 38.66 45.38HaRNAV 1.17 0.99 1.10 40.04 34.92JP-A 0.92 0.95 1.14 0.88 39.73JP-B 0.96 1.02 0.89 1.08 1.06VP3Percent IdentityJukes-Cantor distanceBC-1 BC-2 BC-3 HaRNAV JP-A JP-BBC-1 32.77 33.90 34.09 38.39 31.64BC-2 1.35 51.94 36.01 37.57 33.53BC-3 1.28 0.70 30.32 35.84 38.01HaRNAV 1.30 1.34 1.65 36.82 34.11JP-A 1.16 1.13 1.22 1.18 34.29JP-B 1.58 1.36 1.10 1.31 1.39Jukes-Cantor distanceVP1Percent Identity re 2.3: Pairwise nucl otide sequence similarities and ne ghbour-j ining trees of the co erved domains of six marine RNA viruses. The nucleotide sequences of the respective domains were aligned according to the corresponding amino acid sequence alignments, and trimmed at the ends s  that similar regions of all viruses were compared. The percent identities and Jukes-Cantor distances are displayed along with neighbour joining trees for the trimmed helicase (A, B), RdRp (C, D), VP2 (E, F), VP4 (G, H), VP3 (I, J) and VP1 (K, L) domains. Th  percentage of trees in which the associated taxa clustered together is shown next to the branches (based on 1000 bootstrap replicates). Darker shading indicates more closely related sequences.  42  2.4.2 Marine RNA viruses exist as detectable quasispecies and exhibit biogeographic differences To assess the global distribution of the six marine RNA viruses, reads from 17 marine RNA virus metagenomic datasets originating from diverse geographic locations were recruited to the genomes (Figure 2.4). Detection of individual genomes was sporadic in terms of geography and the genomic regions detected, and the percent of amino acid identity for each genome varied widely across metagenomic data sets. Detection of reads mapping to the genomes was sporadic even at the site corresponding to the original virus detections, although for the three new genomes (Figure 2.2) that were assembled from the two JP14 samples, there was 100% identity hits in these samples. In contrast, very few reads corresponding to HaRNAV were detected in these samples (Figures 2.4 and 2.6, Supplementary Figure A.2O and P). The detected JP14 reads corresponding to JP-A and JP-B ranged between 40 and 90% identity for these viruses, but were only found mapping to the RdRp domains. In contrast to the JP14 data, many reads from the JP13 sample corresponded to HaRNAV (80 to 98% amino acid identity), while we detected very few that corresponded to the BC-3 genome in this sample (Figure 2.6 and Supplementary Figure A.2Q). The JP13 sample had no reads similar to the BC-1 and -2 genomes and very few reads (40 to 80% amino acid identity) corresponding to JP-A and JP-B. Sequences with high levels of identity in the deduced amino acid sequence to the six viruses were detected at numerous sites (Figures 2.4A and 2.5). JP-A was detected at 95 to 100% identity at both Queen Charlotte Strait and Johnstone Strait (Figures 2.5, 2.6 and Supplementary Figure A.2F and N). The coverage spanned more than 3000 nt of the genome (3000 to 7800 nt from Queen Charlotte Strait and 3000 to 6000 nt from Johnstone Strait). A similar observation was made for JP-B at the Queen Charlotte Strait and Johnstone Strait sites. In particular, at the latter site we found 2000 nt of the JP-B genome covered by reads with 100% identity in the deduced amino acid sequence. Both these sites also had reads corresponding to BC-2, BC-3 and HaRNAV, but these were fewer than for the JP viruses and more sporadically distributed across the genomes. In contrast, BC-1 was not virtually undetectable at any of these sites. Reads corresponding to HaRNAV, with identities in the deduced amino acid sequence ranging from 80 to 100%, were detected in the Peru, Nunavut, Bering Sea, Laguna Madre and South Chile samples. Like HaRNAV, reads mapping to JP-A were also detected outside of the Canadian West Coast, in this  43  case in the Kenton-on-Sea metagenome, where reads with 60 to 100% identities (aa level) were found (Supplementary Figure A.2L). TransparencyLocationsLocationsTransparencyAntarcticaArcticCape Point*Jericho Pier 13Jericho Pier 14Jericho Pier 14*Johnstone StraitKaneohe BayKenton-on-Sea*Laguna MadreNunavut 2Nunavut 3Pendrell SoundQueen Charlotte StraitBering SeaPeruSouth Chile0.5Lake Limnopolar 06Lake Limnopolar 07Lake Limnopolar 10Lake Needwood JLake Needwood NRW EffluentRW Nursery0.5BC-1BC-2BC-3JP-AJP-BHaRNAVBC-1BC-2BC-3JP-AJP-BHaRNAV100806040200100806040200100806040200100806040200100806040200100806040200100806040200100806040200100806040200100806040200100806040200100806040200Amino acid Identity (%)Amino acid Identity (%)A)                                                                                                 B)0                                 1000                              2000                              3000                   4000                              5000                              6000                               7000            8000                              9000Genome (bp)0                                1000                             2000                             3000                      4000                             5000                             6000                             7000                    8000                             9000Genome (bp)H                                                  P                                             R                           2            4                     3                         1H                                                P                                                  R                        2            4                   3                         1H                                                           P                                                   R            2                       4                  3                              1H                                                              P                                                  R          2            4                  3                              1H                                                       P                                                   R                2             4                   3                           1H                                                           P                                                R               2             4                  3                          1H                                                P                                            R                              2            4                    3                        1H                                               P                                              R                             2            4                   3                         1H                                                          P                                                 R               2                       4                  3                             1H                                                            P                                                  R            2            4                  3                            1H                                                       P                                                   R                2             4                 3                           1H                                                         P                                            R                     2            4                 3                          1 Figure 2.4: Prevalence of sequences mapping to the six marine RNA viruses in environmental viral metagenomic datasets. Fragments were recruited from (A) marine RNA virus metagenomic datasets that were either generated during this study (Arctic, Cape Point, Jericho Pier 13/14, Johnstone Strait, Kenton-on-Sea, Laguna Madre, Nunavut 2/3, Pendrell Sound, Queen Charlotte Strait, Bering Sea, Peru and South Chile) or from one of the following publicly available  44  metagenomic datasets: Palmer Station (Antarctica), Kaneohe Bay (Hawaii), as well as three freshwater datasets (B) from Lake Limnopolar (Antarctica), Lake Needwood (Maryland, USA) and reclaimed water from Florida (USA). Metagenomic reads were recruited against each of the six virus genomes using tBLASTx with an e-value of 10-10. The position of the coloured bars on the Y-axis indicates the % sequence identity (aa level) of the read to the virus genome, while the width of the bar covers the region of the genome where the similarity was observed. The bars are displayed at 0.5 transparency so overlap can be observed. Each genome map for each virus is superimposed on the plot with domains represented as follows: Helicase (H), Protease (P), RNA-dependent RNA polymerase (R), and capsid domains VP1 (1), VP2, (2), VP3 (3) and VP4 (4).  A closer inspection of the sequence data showed that there were many reads in the JP14 sample that mapped with high identity to the BC-1, -2 and -3. While there were reads at 100% identity (amino acid level) to the assembled genomes, a wide variation in percentage of identity in the deduced amino acid sequence was observed for all genomes (Figures 2.4A and 2.5). The pattern of high sequence variation can be exemplified by examining the VP1 domain region of BC-2. Many reads mapped to this small region of the genome, and these varied by a 15% range in the levels of identity with the deduced amino acid sequence of BC-2 (Figure 2.5). A similar pattern was present for the HaRNAV genome in the JP13 sample, where there was a large number of reads mapping to the VP1 domain and, in this case, there was an even greater range of amino acid sequence identities, with some dropping below 80%. Across all sites, a large proportion of the reads that mapped to the six viruses showed low levels of amino acid sequence identity to the viruses. These ranged from 20 to 80% amino acid sequence identity and corresponded to both structural and non-structural domain regions. In particular, there were many reads that mapped to the RdRp domains within the genomes. Two exceptions to this were BC-1 and HaRNAV, which also had fewer environmental reads that mapped to their genomes overall (Figure 2.4). Of the seven recognizable domains, the protease and VP1 regions had the fewest environmental reads identified. Read recruitment analysis from five lakes and two reclaimed-water RNA virus metagenomes identified many reads mapping to the six marine RNA virus genomes, but mostly with low levels of amino acid sequence identity (Figure 2.4B). These ranged from 20 to 90% identity, although most were between 30 and 60%. As was found for the marine recruitment  45  analysis (Figure 2.4A), most of the reads recruited to regions associated with the helicase, RdRp and VP-2, -3 and -4 domains.  BC-1BC-2BC-3JP-AHaRNAV0                  1000               2000              3000               4000              5000               6000         7000              8000               9000100959085100959085100959085100959085100959085100959085Amino acid Identity (%)Genome (bp)JP-BTransparencyLocationsAntarcticaArcticCape Point*Jericho Pier 13Jericho Pier 14Jericho Pier 14*Johnstone StraitKaneohe BayKenton-on-Sea*Laguna MadreNunavut 2Nunavut 3Pendrell SoundQueen Charlotte StraitBering SeaPeruSouth Chile0.5 Figure 2.5: Prevalence of reads with amino acid sequence identities greater than 85% mapped to the six marine RNA viruses. Fragments were recruited from (A) marine RNA virus metagenomic datasets that were generated during this study (Arctic, Cape Point, Jericho Pier 13/14, Johnstone Strait, Kenton-on-Sea, Laguna Madre, Nunavut 2/3, Pendrell Sound, Queen Charlotte Strait, Bering Sea, Peru and South Chile) and from publicly available metagenomics datasets: Palmer Station (Antarctica), Kaneohe Bay (Hawaii). Metagenomic reads were recruited against each of the six virus genomes using tBLASTx with an e-value of 10-10. The position of the coloured bars on the Y-axis indicate the percentage amino acid sequence identity of the read to the virus while the width of the bar covers the region of the genome where the similarity was observed. The bars are displayed at 0.5 transparency to make overlap more evident. Each genome map for each virus is superimposed on the plot with domains with designation fro left to right: Helicase, Protease, RNA-dependent RNA polymerase, VP2, VP4, VP3 and VP1.   46  n=19 n=60 n=72BerSJP13JP14PendSQCStraitJohnSKoSLagMNun3Nun2PeruSChilen=1470 n=1584      n=1966n=1052 n=1389 n=1537n=1320 n=1568 n=1921n=771 n=1117 n=1343n=37 n=103 n=515n=18 n=49 n=55n=18 n=44 n=55n=410 n=868 n=1074n=22 n=47 n=66n=178 n=273 n=436n=37 n=264 n=720 Figure 2.6: Distribution of six marine RNA virus genomes. Charts represent the number of reads recruited to each genome at 95, 85 and 75% protein identity with number of total reads recruited below each chart. Genomes are represented by the following colours: Marine RNA viruses BC-1 (black), BC-2 (red), BC-3 (green), JP-A (blue), JP-B (orange) and Heterosigma akashiwo RNA virus (purple).  2.4.3 Evolutionary pressures acting on the viral quasispecies The data allowed the examination of evolutionary patterns in six of the conserved domains (helicase, RdRp, and the four capsid domains). Based on the consensus sequences of these domains (Figure 2.7), all were subject to purifying selection (Figure 2.8A). Values for dN/dS of <1.0 for all data below the 75th percentile indicated that purifying selection was occurring. No difference was observed among the calculated means for the six domains (ANOVA; p-value: 0.2507), while a difference, or stochastic dominance, was observed in the cumulative distribution functions (CDFs; Kruskal-Wallis rank sum test; p-value: 0.0385). Multiple pairwise comparisons showed that stochastic dominance existed between the VP1-RdRp and VP1-VP2 domains (Conover-Iman test; p-values: 0.0364 and 0.0442, respectively) (Supplementary Table A.2). There were several codons that were subject to neutral evolution and a few more to positive selection. All groups of amino  47  acids (aromatic, acidic, basic, aliphatic and uncharged) appeared to be subjected to the same evolutionary pressure (Supplementary Figure A.4). BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BA)BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-B 48  BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BB)BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BC)BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BD) 49  BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BE)BC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BBC-1BC-2BC-3HaRNAVJP-AJP-BF) Figure 2.7: Nucleotide alignments of the viral domains. The domains represented include the A) RdRp, B) Helicase, C) VP-2, D) VP-4, E) VP-3 and F) VP1 domains of marine RNA viruses BC- 50  1, BC-2, BC-3, JP-A and JP-B, and Heterosigma akashiwo RNA virus (HaRNAV). The nucleotide sequence alignments were constructed based on the corresponding amino acid sequence alignments, with the edges trimmed to remove regions not represented in all viruses.  Figure 5: Marine RNA virus domains are under purifying selection pressure. A) Maximum likelihood analysis of natural selection based on the codons of Heterosigma akashiwo RNA virus and Marine RNA viruses BC-1, -2, -3, JP-A and JP-B. The substitutions were estimated using the joint Maximum Likelihood reconstructions of ancestral states under a Muse-Gaut model (Muse and Gaut, 1994) of codon substitution and the Felsenstein 1981 model (Felsenstein, 1981) of nucleotide substitutions. Amino acids were classified as follows: Aromatic (Phenylalanine [F], Tryptophan [W], Tyrosine [Y]), Negatively charged (-), acidic (Aspartic Acid [D], Glutamic Acid [E]), Nonpolar, aliphatic (Alanine [A], Glycine [G], Isoleucine [I], Leucine [L], Methionine [M], Proline [P], Valine [V]), Polar, uncharged (Asparagine [N], Cysteine [C], Glutamine [Q], Serine [S], Threonine [T]) and Positively charged (+), basic (Arginine [R], Histidine [H], Lysine [K]). B) Selection pressure acting on the various domains of six marine RNA viruses at various locations. Single-nucleotide variants (SNVs) were called using the Quality-based Variant Detction tool (CLC Genomics Workbench) from alignments where environmental sequences of 90% sequence identity were mapped to the genomes. All domains (mean) indicate purifying selection (dN<dS or dN/dS<1.0) based on the mean dN/dS score.5.04.03.02.01.00.02.01.50.50.0dN/dSHelicase      RdRP VP1           VP2          VP3           VP4DomainHelicase      RdRP VP1           VP2        VP3          VP4DomainAromatic- charged, acidicNonpolar, aliphaticPolar, uncharged+ charged, basic1.0dN/dSA)                                                                                                                            B)ClassificationGenomeBC-1BC-2BC-3HaRNAVJP-AJP-BBering SeaJericho Pier 13Johnstone StraitJericho Pier 14Kenton-on-SeaLaguna MadreNunavut 2Nunavut 3Pendrell SoundPeruQueen Charlotte StraitSouth ChileLocation Figure 2.8: Selection analyses of marine RNA virus domains in environmental datasets. A) Maximum likelihood analysis of selection pressure based on the codons of marine RNA viruses BC-1, BC-2, BC-3, JP-A, JP-B and HaRNAV. The substitutions were estimated using the joint maximum likelihood reconstructions of ancestral states under a Muse-Gaut model (Muse and Gaut, 1994) of codon substitution and the Fels nstein 1981 model (Fels nstein, 1981) of nucleotide substitutions. Amino acids were classified as follows: Aromatic (Phenylalanine [F], Tryptophan [W], Tyrosine [Y]), Negatively charged (-), acidic (Aspartic Acid [D], Glutamic Acid [E]), Nonpolar, aliphatic (Alanine [A], Glycine [G], Isoleucine [I], Leucine [L], Methionine [M], Proline [P], Valine [V]), Polar, uncharged (Asparagine [N], Cysteine [C], Glutamine [Q], Serine [S], Threonine [T]) and Positively charged (+), basic (Arginine [R], Histidine [H], Lysine [K]). B) Selection pressure acting on the various domains of the six marine RNA viruses at various locations. Single-nucleotide variants (SNVs) were called using the Quality-based Variant Detection tool (CLC Genomics Workbench) from alignments where environmental sequences of 90% sequence identity were mapped to the genomes.  Selection analyses of the environmental data mapped to the respective genomes and domains indicated that purifying selection was acting on these viruses in the environment (Figure 2.8B). The helicase domain was omitted from statistical tests due to lack of sufficient data, but both ANOVA (p-value: 0.05343) and Kruskal-Wallis rank sum (p-value: 0.09858) tests indicated no significant differences among the other domains. The calculated dN/dS ratios were very low  51  apart from a single outlier for the VP1 domain for virus BC-2 in the Queen Charlotte Strait sample, and for the helicase domain for virus JP-A in the Johnstone Strait and Kenton-on-Sea samples. Two additional virus domain sample pairings were omitted due to insufficient data (VP1 of both BC-2 and -3 from the Nunavut 2 and Queen Charlotte Strait samples, respectively). SNV analyses (Figure 2.9) indicated a cloud of diverse variants of the six marine viruses in different geographical samples. Due to the nature of the analysis and the uneven coverage of genomes, individual domains were analysed independently, and some samples with insufficient reads available for analysis were excluded. Sequences mapping to viruses BC-1 and -2 were abundant in the JP14 sample, and many SNVs were distributed across the RdRp, VP2 and VP3 domains, at frequencies ranging from 1% to 50% (Figure 2.9A, B and C). The majority of these variants were synonymous with the exception of the SNVs in the VP2 domain, where more non-synonymous mutations were detected. In contrast, the SNVs in the same three domains of BC-3 occurred at much lower frequencies (1 to 12%). A similar observation was made for the helicase domains where many low frequency SNVs were observed for virus BC-3, while fewer mutations were observed for BC-1 and -2; however, these were observed at higher frequency (Supplementary Figure A.6A). The VP1 domain of virus BC-3 exhibited a similar trend of low frequency mutations, except for three variants that occurred at considerably higher frequencies, and two of these were non-synonymous. Virus BC-1 did not have enough coverage in the VP1 domain to get a good profile of SNVs across the domain, but a few high-frequency mutations towards the 5’ end of the domain were observed. The VP1 domain of virus BC-2 had many SNVs distributed across the domain, at frequencies ranging from 1% up to 47%.   52  Figure 6: Environmental quasispecies of marine RNA viruses. The distribution of single nucleotide variants (SNVs) along the RNA-dependent RNA polymerase (A), VP2 (B), VP3 (C) and VP1 (D) domains of Marine RNA viruses BC-1, -2, -3, Heterosigma akashiwo RNA virus and Marine RNA viruses JP-A and –B. Synonymous mutations variants are indicated by a triangle ( ) while non-synonymous variants are indicated by a circle ( ). The sites/samples where they were detected are illustrated in different colours. The SNV analysis was conducted using the Quality-based Variant Detection tool of CLC Genomic Workbench.A)                                                                                                                           B)JP14JP14*JohnSKoSLagMNun2PendSQCStraitNon-synonymousSynonymousJP14JP14*JohnSPendSQCStraitNon-synonymousSynonymousLocationMutationLocationMutationDomain (bp)0            100           200           300          400           500           600          700           800          900 1000        1100         1200                                                                                             0                                    100                                 200                               300                                  400                                 500 504030201005040302010050403020100504030201005040302010050403020100SNV frequency (%)5040302010050403020100504030201005040302010050403020100BC-1BC-2BC-3JP-AJP-BHaRNAVJP-AJP-BBC-1BC-2BC-3SNV frequency (%)Domain (bp)RdRP domain VP2 domainC)                                                                                                                           D)Figure 6 continued: Environmental quasispecies of marine RNA viruses. The distribution of single nucleotide variants (SNVs) along the RNA-dependent RNA polymerase (A), VP2 (B), VP3 (C) and VP1 (D) domains of Marine RNA viruses BC-1, -2, -3, Heterosigma akashiwo RNA virus and Marine RNA viruses JP-A and –B. Synonymous mutations variants are indicated by a triangle ( ) while non-synonymous variants are indicated by a circle ( ). The sites/samples where they were detected are illustrated in different colours. The SNV analysis was conducted using the Quality-based Variant Detection tool of CLC Genomic Workbench.JP14JP14*JohnSPendSQCStraitNon-synonymousSynonymousLocationMutationDomain (bp)0                                    100                                 200                                  300            400                                 500                                                                      0                         100                     200                       300   400                       500                      600                      700BC-1BC-2BC-3JP-AJP-BHaRNAVVP3 domain504030201005040302010050403020100504030201005040302010050403020100SNV frequency (%)BC-1BC-2BC-3JP14JP14*QCStraitNon-synonymousSynonymousLocationMutationVP1 domain504030201005040302010050403020100Domain (bp)SNV frequency (%) Figure 2.9: Detection of marine RNA virus quasispecies in environmental datasets. The distribution of single nucleotide variants (SNVs) along the RdRp (A), VP2 (B), VP3 (C) and VP1 (D) domains of marine RNA viruses BC-1, BC-2, BC-3, HaRNAV, JP-A and JP–B. Synonymous mutation variants are indicated by a triangle (▲) while non-synonymous variants are indicated by a circle (●). The sites/samples where they were detected are illustrated in different colours.  2.4.4 Detection of established viral variants SNVs at frequencies greater than 99% were abundant and detected at multiple locations (Figure 2.10), and were most commonly identified for the RdRp domains. JP-B variants were detected in Queen Charlotte Sound (8 SNVs) and Johnstone Strait (15 SNVs), while variants of  53  HaRNAV were also found at both these sites (50 and 56 SNVs, respectively) as well as for Pendrell Sound (20 SNVs), South Chile (18 SNVs), Laguna Madre (28 SNVs), Bering Sea (10 SNVs), JP 13 (10 SNVs), Nunavut 2 and 3 (33 and 43 SNVs, respectively) and Peru (7 SNVs) sites. (Note: the latter four sites are obscured in Figure 2.8). In contrast, high-frequency SNVs of JP-A were detected exclusively in the Kenton-on-Sea sample (18 SNVs).  Figure 7: Established mutations in marine RNA virus assemblages. The distribution of high frequency (>99%) single nucleotide variants (SNVs) along the respective domains of Marine RNA viruses BC-1, Heterosigma akashiwo RNA virus and Marine RNA viruses JP-A and –B. Synonymous mutations variants are indicated by a triangle ( ) while non-synonymous variants are indicated by a circle ( ). The sites/samples where they were detected are illustrated in different colours. The SNV analysis was conducted using the Quality-based Variant Detection tool of CLC Genomic Workbench.Non-synonymousLocationMutationDomain (bp)10099SNV frequency (%)  Bering SeaJP13JP14JohnSKoSLagMNun2Nun3PendSQCStraitSChilPeruSynonymous0    250             0   250  500 750 1000               0   250  500              0   250  500          0    250   500      0   250100991009910099Transparency0.5RdRP VP1 VP2 VP3BC-1JP-AJP-BHaRNAVVP4Helicase Figure 2.10: Established viral variants in different marine RNA virus communities. The distribution of high-frequency (>99%) single nucleotide variants (SNVs) along the respective domains of the BC-1, HaRNAV, JP-A and JP–B viruses. Synonymous mutation variants are indicated by a triangle (▲) while non-synonymous variants are indicated by a circle (●). The sites/samples where they were detected are illustrated by different colours.  Unlike the RdRp domain, fewer high-frequency SNVs were associated with other domains, especially for the helicase, VP4 and VP1 domains. SNVs that were concentrated in the same region of the VP3 domain of HaRNAV were identified in the Bering Sea [3 SNVs], JP13 [2 SNVs], Johnstone Strait [6 SNVs], Laguna Madre [2 SNVs], Nunavut 2 and 3 [6 and 2 SNVs, respectively], Pendrell [2 SNVs], Peru [3 SNVs], Queen Charlotte Strait [14 SNVs] and South Chile [4 SNVs] samples.  54  The BC-1 virus was the only one of the three new genomes for which environmental reads were identified. SNVs corresponding to the VP2 domain were detected from the Nunavut 3 location [11 SNVs]. VP2 domain SNVs for HaRNAV were also detected at this location [4 SNVs] along with the Bering Sea [9 SNVs] and Pendrell Sound [4 SNVs] sites. Detection of SNVs for multiple domains of a single virus at specific sites supports the existence of biogeographical variations with established genome variants in these locations.  2.5 Discussion Metagenomic analysis of marine RNA viruses uncovered three novel genomes from a single coastal sample. The distributions of these and three other marine picorna-like RNA viruses (Lang et al., 2004; Culley et al., 2007) in global RNA virus metagenomic datasets revealed biogeographical patterns, with striking differences in the distributions of these six viruses, even though they were originally isolated or identified from the same, or nearby sites. SNV analysis shows the presence of viral quasispecies, and that purifying selection is the predominant evolutionary process acting on the viruses. These results and their implications are discussed below.  2.5.1 Divergent marine Picornavirales genomes are present at the same location Three near-complete genomes assembled from the JP14 samples (Figure 2.2) double the number of RNA virus genomes from this location (Lang et al., 2004; Culley et al., 2007). The genome properties indicate similarity to genomes assembled from Palmer Station (Miranda et al., 2016), as well as viruses within the unassigned genus Bacillarnavirus (Shirai et al., 2008; Tomaru, Takao, et al., 2009). The domain organization of the non-structural ORF is similar to viruses in the order Picornavirales while the structural domain organizations in the second ORF are organized similarly to HaRNAV from the family Marnaviridae (Lang et al., 2004) and viruses of the genus Bacillarnavirus (Shirai et al., 2006, 2008; Tomaru, Takao, et al., 2009). Nucleotide alignments and pairwise similarity analysis of the RdRp, helicase and four capsid domain sequences indicate that the six viruses are only distantly related to one another (Figure 2.3; Figure 2.7). Of these, HaRNAV appears to be the most distantly related to the others,  55  consistent with previous phylogenetic analyses of RdRp that places HaRNAV as one of the more divergent relative to other picorna-like isolates and environmental sequences (Gustavsen et al., 2014). The majority of virus isolates in this group infect diatoms, which are abundant, genetically diverse and globally distributed. In contrast, the host of HaRNAV is from the Raphidophyceae, a family comprised of a less common and diverse group of protists compared to diatoms (Canter-Lund and Lund, 1995); HaRNAV is the only RNA virus isolated for this group. Therefore, the greater genetic distance observed between HaRNAV and the other marine RNA virus genotypes is likely due in part to lack of additional representatives. The protease domain was excluded from analyses due to the high level of divergence. It is unclear as to why this protein is so divergent across virus species, while the RdRp, for example, is conserved. This could perhaps be due to the different functions of the proteins. While both are essential for viral replication, the RdRp is function revolves exclusively around interaction with the virus genome, while the protease has been shown to cleave not only viral but also host proteins (Jagdeo et al., 2015, 2018). This coupling with host proteins is an interesting consideration when discussing the divergence of the protease. Because no host information is available for metagenomic assembled genomes, this theory couldn’t be tested with the current dataset. Among the conserved structural proteins, VP1 was the most divergent. Proteins associated with host receptor recognition and binding would presumably be different for viruses using different receptors and/or infecting different hosts. In enteroviruses, there are deep depressions or “canyons” on the surface of the capsid, surrounding the five-fold axis, where the receptor recognition sites are situated (Rossmann, 1989; Rossmann et al., 1985; Rossmann and Palmenberg, 1988; He et al., 2000). These canyons are formed by VP1 in the “northern” rim and VP1, -2 and -3 in the “southern” rim (Rossmann, 1994). Currently, there is only one structure available for an RNA virus that infects a protist, Heterocapsa circularisquama RNA virus, which infects a dinoflagellate (Miller et al., 2011). No receptor binding information is available for this virus but it appears to lack deep surface canyons, similar to cricket paralysis virus (CrPV) (Tate et al., 1999). One would expect different evolutionary pressures to act on viruses infecting protists and multicellular “higher” organisms, because of differences in immune responses of the hosts. Therefore, the lower identities for the VP1 sequences compared to the other capsid domains (Figure 2.3) might indicate that this domain is important in virus-host interactions.  56   2.5.2 Marine RNA virus biogeography Seventeen RNA virus metagenomic datasets, 15 from our work here and two from previous studies, were used to evaluate the widespread distribution patterns for the six viruses. A potential bias with this analysis is that different library sizes could affect the detection of viral genomes because of the amount of data available; however, there was no trend between the number of reads recruited to genomes and the library size (Supplementary Figure A.3), suggesting that this did not affect the results. The recruitment analysis performed was not competitive, meaning a single read could theoretically be recruited up to six times, once to each genome. Mean read counts for each library were determined to be 1 to 1.5 for most libraries, with the exceptions of Palmer station, Antarctica and Nunavut 3 (Supplementary Figure A.3), indicating that it was not the same set of reads mapping to all six genomes in each analysis. The six viruses were selected because they all originated from essentially the same location, albeit at different times. Despite the large number of reads from JP14 associated with the three new genomes (BC-1, -2 and -3), no reads were detected for BC-1 and -2 and very few for BC-3 in the sample collected a year earlier (JP13) (Figure 2.4A). In contrast, many more reads were recruited to HaRNAV from the JP13 sample than the JP14 sample. This suggests that the abundances of these viruses are dynamic at a particular site and is consistent with the temporal differences observed in the relative abundance of specific RdRp sequences in coastal environments (Gustavsen et al., 2014; Gustavsen, 2016). Reads mapping to the viral genomes were distributed widely across hemispheres and oceans. Large portions of both the JP-A and JP-B genomes had reads mapping to data from Queen Charlotte Strait and Johnstone Strait (Figures 2.4A and 2.6); whereas, recruitment to data from these sites was sporadic for the BC-2 and BC-3 genomes and didn’t span large regions of the HaRNAV genome. These sites are in British Columbia coastal waters, and only a few hundred kilometers from Jericho Pier, the site from which the viruses were first identified. Relative proximity and oceanographic connectivity among the sites would facilitate dispersal, but the lack of reads associated with BC-1 at these sites emphasizes that the distribution of viral taxa is affected by more than dispersal. Given that host and viral traits, as well as abiotic dispersal affect virus distribution (Chow and Suttle, 2015), other factors need to be considered. For example, particle  57  size is small (25 – 35 nm) and burst size large (66 – 105 particles per cell) for protist-infecting picorna-like viruses (Lang et al., 2009; Kimura and Tomaru, 2015; Shirai et al., 2008; Tai et al., 2003), which will enhance their potential for dispersal. If BC-1 is a specialist with a narrow host range, or its host is rare with a localized distribution, the dispersal of virus particles would be less than those infecting hosts with opposite traits. HaRNAV infects a host that, despite being less common compared to diatoms, is still widely distributed in global coastal waters (Chang et al., 1990; Hongo, 1993; Taylor, 1993; Band-Schmidt et al., 2004, 2012) explaining the widespread distribution of similar viral reads. HaRNAV strains exhibit moderate host strain specificity, infecting five of the 15 host strains tested (Tai et al., 2003), which would explain why reads with high identity, spanning large regions of the genome, were not found across all samples. For viruses known only from metagenomic data, the factors affecting their distribution are even less certain given that no host or viral lifecycle information is available. However, given that the particle and burst sizes of protist-infecting picorna-like viruses are likely similar, it suggests that for viruses for which there is widespread detection of high-identity reads, that they, or their close relatives likely infect hosts that are cosmopolitan in distribution. The widespread presence of low-identity reads in coastal waters across sites indicates that distant relatives of these six viruses are likely globally distributed (Figure 2.4A). The large number of low-identity reads mapping to the highly conserved RdRp domains also supports the large amount of genetic diversity previously observed in gene surveys targeting this region (Gustavsen et al., 2014; Culley et al., 2003, 2014; Culley and Steward, 2007; Miranda et al., 2016). Our data for full genomes show that there is also a large amount of diversity associated with the helicase and capsid domains that can be detected globally. Although the RdRp domain is the most well-conserved region in general, in many samples there was discordance among domains with respect to the number of sequences detected. For example, for JP-A virus many reads from the JP13 sample map to the helicase region (Supplementary Figure A.2Q); whereas, few associated with the RdRp domain. This could reflect that alternative evolutionary pressures are acting on the two domains or that recombination has occurred, both of which have been observed in picornaviruses (Simmonds, 2006). Regardless of the mechanism of diversification, a large richness of viruses related to the six investigated here were detected across widely separated regions. In contrast to the other genomes, very few low-identity reads mapped to HaRNAV, perhaps because this virus infects a host that has relatively low species richness. If true, this suggests that the viruses known  58  from assembled genomes are associated with hosts with greater richness; thus leading to larger and more diverse associated virus group. The data from the freshwater metagenomes (Figure 2.4B) help to serve as a guide for the interpretation of reads mapped to the genomes from the marine datasets. The same host species, and therefore the same viruses, are unlikely to be present in both fresh and marine waters. Using Heterosigma akashiwo and HaRNAV as an example, while there are no known freshwater Heterosigma species, there are freshwater members of the larger raphidophyte group. Therefore, sequences matching to HaRNAV in these data likely represent distant relatives of HaRNAV. The high amino acid identity observed in the marine metagenomes compared to the relatively low identities in the freshwater metagenomes, suggests that the freshwater reads that fall within the 20 to 70% amino acid identity likely represent distant relatives that infect different species.  2.5.3 Purifying selection on conserved protein domains in marine RNA virus quasispecies Little is known about possible quasispecies in marine viruses or the evolutionary pressures acting on their genomes. The data presented here imply that purifying selection is acting on both the structural and non-structural domains of the viruses, with all exhibiting overall dN/dS values <1 (Figure 2.8). These data are consistent with a small dataset for the RdRp domain, which was proposed to experience purifying selection (Greenspan et al., 2004). In purifying (or negative) selection a sequence is constrained with respect to possible amino acid substitutions, and most nucleotide sequence changes do not result in changes to the protein sequence, as is frequently observed in RNA viruses (Holmes, 2003; Edwards et al., 2006; Pybus et al., 2007). Purifying selection is evident in the domain alignments, in which many amino acids that are conserved throughout the six genomes are encoded by alternative codons (Supplementary Figure 2.7). However, the capsid VP1 domain is an exception (Supplementary Table A.2), possibly indicating that this domain is important in host recognition, as discussed above, and would therefore vary based on interaction with the host cell. The existence of quasispecies was validated for the six viruses by examining SNVs in the sequence data from different locations. Not surprisingly, because they were originally assembled from the JP14 samples, the three BC genomes had the greatest genome coverage. However, the BC-3 genome had the most coverage, approximately four times more data than BC-1 and -2, and  59  fewer SNVs for the RdRp, helicase, VP2 and VP3 domains (Figure 2.9A, B and C and Supplementary Figure A.6A). If a population of viruses consists of a large proportion of the fittest genotype surrounded by quasispecies variants, most genotypes would resemble the original virus that infected a host and replicated. This suggests that BC-3 was recently involved in a large lytic event, resulting in higher abundance and lower diversity. A similar observation was made in the analysis of quasispecies in an Antarctic lake, where one abundant RNA virus genotype, APLV1, exhibited few SNVs, with most occurring in less than 10% of the sequences, despite the virus being ecologically successful (López-Bueno et al., 2015). Many caveats exist with SNV analysis, particularly sequencing error. In the present study, two sequence libraries were prepared from the sample (JP14) that we used as a control. In most cases, the same SNVs were present in both libraries, supporting the notion that they are not the result of sequencing errors. The patterns in the mutations for the sequences in the environmental reads that mapped to the remaining genomes indicated that quasispecies-like populations were present at locations beyond the original site of isolation/assembly (Figure 2.9A). For example, SNVs were detected across the RdRp domain for both HaRNAV and JP-A in the Queen Charlotte Strait and Johnstone Strait samples, as well as Pendrell Sound for HaRNAV. Very few reads from Nunavut, Laguna Madre or Kenton-on-Sea mapped to these genomes with sufficient similarity to call SNVs across the RdRp domain, but the detected variants were all synonymous and exhibited different frequencies, with the Kenton-on-Sea variants being much lower than the two Nunavut and Laguna Madre variants. While it is tempting to hypothesize that quasispecies are broadly distributed, there were too few reads from Nunavut, Laguna Madre and Kenton-on-Sea to conclusively state that they are indicative of quasispecies. Realistically, it is more likely that these reads correspond to similar viruses, some which may not even infect the same host. For the BC-3 virus, the VP1 domain is the only one with enough variation to suggest a possible quasispecies (Figure 2.9D). VP1 is likely important for binding to the host receptor; thus the selection pressure could be different for this domain due to virus-host co-evolution. For the VP1 from BC-3, three of the four SNVs detected in both JP14 libraries result in amino acid changes. Similarly, more low-frequency non-synonymous SNVs are found in the VP1 domains of BC-1 and BC-2 (Figure 2.9D). Both functional and structural studies suggest that RNA viruses tolerate restricted numbers and types of mutations (Domingo and Holland, 1997). For example, during large-population passages of foot-and-mouth disease virus (FMDV), numerous non- 60  synonymous mutations occur in the capsid protein with 96% on the outside of the protein surface (Escarmís et al., 1996). It was not feasible to construct reliable structural homology models for the proteins due to lack of appropriate reference structures, and therefore it is not known if the non-synonymous mutations were located on the outside of the virus particle. Overall, the observed differences, along with the higher dN/dS ratio previously discussed, would suggest that VP1 domains experience different degrees of evolutionary pressure compared to the other capsid proteins, likely due to its involvement in host recognition.  2.5.4 Established viral variants are widely dispersed If one considers the co-evolutionary arms race between viruses and their hosts, and the competitive advantage the viruses have because of producing quasispecies, some of the low-frequency variants should be selected for and become established as the best fit genotype. This would result in turnovers in the dominant viral genotypes as is frequently observed in viruses that infect multicellular organisms (Hoenen et al., 2015; Moya et al., 2004; Bourlière et al., 2002). To investigate how HaRNAV, JP-A and JP-B have changed since they were first sequenced, the high-frequency SNVs (>99%) detected in the environmental datasets were analyzed. High-frequency SNVs in the RdRp of HaRNAV indicate that many of the established genetic variants occur at geographically distant locations, including the Bering Sea, Jericho Pier, Johnstone Strait, Queen Charlotte Strait, Pendrell Sound Laguna Madre, Peru, South Chile and Nunavut 2/3 (Figure 2.10). Similarly, high frequency variants of JP-B domains were also detected at geographically distinct locations (Johnstone and Queen Charlotte Straits), but unlike HaRNAV where location-specific reads frequently mapped to more than one domain, the domains and locations were for the most part distinct for JP-B. A similar discord can be observed for the JP-A virus, where there is only clear coverage for the RdRp domain (Figure 2.10). Despite inconsistent coverage across genomes, geographically distinct established variants were detected for multiple genotypes, indicating that a turnover in the dominant genotype does occur in the marine environment. Most SNVs in the HaRNAV and JP-B genomes were synonymous, indicating pressure to conserve the amino acid sequences. In contrast, the SNVs in the RdRp of JP-A were more equally distributed between synonymous and non-synonymous. The JP-A RdRp reads originated from the Kenton-on-Sea sample, which is very distant from British Columbia where the genome was  61  assembled; the large proportion of non-synonymous reads suggests that the Kenton-on-Sea variant of JP-A is relatively distant evolutionarily from the original genome.  2.5.5 Conclusions Analysis of marine RNA virus metagenomic data revealed three novel viruses as well as biogeographic patterns and evolutionary pressures associated with quasispecies. This study is the first to investigate the distribution of multiple marine RNA viruses across distant locations. Genetically related viruses were detected across different oceans and hemispheres, although the distribution among viruses varied. This indicates that dispersal varied among genotypes and was affected by virus lifecycle and host traits, rather than solely by abiotic factors. The limited temporal data in this study agrees with previous gene surveys, indicating that the relative abundances of virus genotypes are dynamic and likely result from episodic infections (Gustavsen et al., 2014). Numerous sequences with low amino acid identities were detected for the six viruses, indicating that there is a great richness of globally distributed, albeit distantly related viruses. Structural and non-structural protein domains are under purifying selection suggesting that, despite the error-prone RdRp and lack of proof-reading, these sequences are constrained to conserve their function. The VP1 domain was distinct, with a statistically higher dN/dS ratio compared to the other domains, the highest distance scores in pairwise similarity analyses, and the only domain where a ‘cloud’ of high amino acid identity reads was observed. This suggests that VP1 is important in host receptor binding, leading it to be under greater pressure to acquire amino acid changes to keep up with host evolution. Of the three new viruses, BC-3 had the greatest coverage and the least low-frequency SNVs in the metagenomic data, consistent with a recent lytic event that produced many representatives of the dominant genotype. Detection of high-frequency SNVs for HaRNAV, JP-A and JP-B indicate that these sequences are related to the originally identified viruses, but that due to purifying selection, most of the mutations are synonymous. The RdRp sequences of the JP-A-like virus detected in the Kenton-on-Sea sample exhibited many non-synonymous mutations, indicating that distant relatives of this virus are broadly distributed. While this study clearly demonstrates that global biogeographic patterns exist for these marine RNA viruses, further investigation is required to get a better resolution of viral distributions and the factors affecting these distributions. Overall, these data demonstrate that closely related RNA viruses are distributed  62  broadly across oceans and hemispheres, but that the arms race with their hosts leads to strong selection pressure against a backdrop of quasispecies to generate high sequence diversity.      63  Chapter 3  3  METAGENOMIC ANALYSIS REVEALS UNPRECEDENTED DIVERSITY IN MARINE RNA VIRUS ASSEMBLAGES FROM POLAR TO TROPICAL SEAS  3.1 Summary There have been few metagenomic investigations of RNA viruses in the sea. Here, marine RNA virus assemblages collected from geographically widespread locations including the Arctic Ocean, Bering Sea, coastal Northeast Pacific, Gulf of Mexico, and the Eastern Pacific from 73.5 oN to 50.9 oS are shown to be taxonomically complex with at least 27 viral families represented. Members of genus Bacillarnavirus, which infect diatoms, were consistently in high relative abundance, unequivocally establishing viruses in the order Picornavirales as being the dominant constituents of marine RNA virus assemblages. Detection of virus genera in oceanic samples that are associated with terrestrial systems, implies that these viruses infect marine hosts, and likely evolved in the ocean. These include related plant-infecting viruses in the Tombusviridae and invertebrate-associated viruses in the Dicistroviridae; the latter suggesting that RNA viruses not only infect phytoplankton, but are likely pathogens of crustacean zooplankton, as well. In contrast, sequences that could be assigned to RNA bacteriophages were extremely rare. Reads corresponding to capsids of the Boiling Springs Lake RNA-DNA hybrid virus suggest that their RNA ancestors are also of marine origin and that their ancestors still exist in the oceans. This study has revealed unprecedented diversity in marine RNA virus communities and demonstrated that many groups of viruses are distributed from polar environments to the tropics, and that the best predictor of the composition of the viral communities is salinity. The results underscore the likely importance of RNA viruses as mortality agents of planktonic eukaryotes.  3  64  3.2 Introduction Viruses play a central role in the evolution and ecology of marine microbial communities, and have direct effects as mortality agents and indirect contributions by affecting food web dynamics and biogeochemical cycling (Suttle, 2007; Rohwer and Thurber, 2009). However, indirect evidence and methodological bias have led to the assumption that DNA viruses are the most abundant and important members of marine virus assemblages (Steward et al., 1992; Weinbauer and Rassoulzadegan, 2004; Wen et al., 2004; Tomaru and Nagasaki, 2007; Brussaard et al., 2000) reaching approximately 107  particles per liter of coastal surface water (Bergh et al., 1989; Wommack and Colwell, 2000). As a result, diversity studies have focussed largely on DNA viruses (Edwards and Rohwer, 2005; Kristensen et al., 2010).  RNA viruses were also known to be present in marine environments, but were thought to be primarily associated with disease in marine mammals and fish (Geraci et al., 1982; Jensen et al., 2002; Skall et al., 2005), as well as organisms of economical importance such as crustaceans and salmonids (Hasson et al., 1995; Bonami and Zhang, 2011; McLoughlin and Graham, 2007). These viruses were associated with many families including Orthomyxoviridae, Paramyxoviridae, Rhabdoviridae, Dicistroviridae, Reoviridae and Togaviridae (Lang et al., 2009), and while they were regarded as important pathogens of higher organisms, there was no evidence associating them with the phyto- and zooplankton in the ocean. Recent molecular studies have expanded our understanding of marine virus diversity, and have shown that these assemblages are genetically diverse and encompass single-stranded (ss) and double stranded (ds) DNA viruses (Angly et al., 2006; Labonté and Suttle, 2013), as well as ss and dsRNA viruses (Culley et al., 2003; Culley and Steward, 2007; Miranda et al., 2016; Gustavsen et al., 2014). Many of these viruses infect marine single-cell eukaryotes (protists), and range widely in particle and genome sizes from large dsDNA viruses in the Mimiviridae to small ssDNA and ssRNA viruses in the order Picornavirales. Members in the order Picornavirales have small icosahedral capsids (25 to 35 nm diameter) with a conserved genome architecture and small positive sense (+) ssRNA genomes (Le Gall et al., 2008). The non-structural genes consist of type II helicase, C3 protease and RNA-dependent RNA polymerase (RdRp) (Sanfaçon et al., 2009). Marine Picornavirales isolates infect a variety of protists, including the toxic bloom-forming raphidophyte Heterosigma akashiwo (Tai et al.,  65  2003) (family Marnaviridae), various cosmopolitan centric diatoms including Rhizosolenia setigera (Nagasaki et al., 2004), Chaetoceros socialis (Tomaru, Takao, et al., 2009), Chaetoceros tenuissimus (Shirai et al., 2008) (unassigned genus Bacillarnavirus) and the thraustochytrid Aurantiochytrium sp. (Takao et al., 2006, 2005) (unassigned genus Labyrnavirus), as well as the pennate diatoms Asterionellopsis glacialis (Tomaru et al., 2012) and Cylindrotheca closterium RNA virus (Culley et al., 2014). RdRp gene surveys (Miranda et al., 2016; Gustavsen et al., 2014; Culley et al., 2014; Culley and Steward, 2007) and metagenomic studies (Culley et al., 2007, 2014; Miranda et al., 2016; Greninger and DeRisi, 2015) indicate that picorna-like viruses are widely distributed and common in coastal environments, and imply that they are also involved in the termination of phytoplankton blooms. The diversity of marine RNA virus assemblages extends beyond the Picornavirales and the abovementioned host organisms to include viruses infecting the dinoflagellate Heterocapsa circularisquama (Tomaru et al., 2004) (family Alvernaviridae) and the prasinophyte Micromonas pusilla (Brussaard et al., 2004) (dsRNA family Reoviridae). Metagenomic and transcriptomic analyses have detected picorna-like viruses associated with Symbodinium, the dinoflagellate symbiont of corals (Correa et al., 2012; Levin et al., 2016), as well as reovirus-like sequences from coastal Hawaiian waters (Culley et al., 2014). Using an indirect approach, the total amount of viral DNA and RNA in coastal tropical waters was compared, suggesting that at least half of the viruses in these assemblages consisted of RNA-containing viruses thought to infect protists (Steward et al., 2013). In contrast, RNA viruses appeared relatively less abundant and less diverse in subpolar regions (Miranda et al., 2016; Culley et al., 2014). This suggests that geographically distinct RNA virus assemblages consist of different RNA viruses of varying abundances, likely influenced by host composition and abiotic factors. In this chapter, I describe the analysis of forty-seven taxonomically assigned marine metagenomes, from diverse geographic locations to gain a better understanding of the global diversity of marine RNA. To date, this is the largest spatial study of marine RNA virus assemblages and constitutes a significant contribution to the known sequencing space of these viruses. These data also reveal that abiotic factors are correlated to changes in the composition of RNA virus assemblages, suggesting that global change will influence marine RNA virus assemblages.   66  3.3 Materials and methods A graphical representation of the general methods is included as Supplementary Figure B.1.  3.3.1 Sample collection and preparation Forty-seven geographically distinct coastal environments were sampled, including arctic, temperate, subtropical and tropical environments, spanning the Arctic Ocean to the West Coast of Southern Chile (Figure 3.1, Supplementary Table B.1). These included five subpolar locations (Mackensie Shelf [MacS_08] and coastal Nunavut [Nun_a-d07]), two temperate/subpolar North Pacific samples (Bering Sea [BerS_a/b08]), 34 sites from the temperate southwest coast of Canada (Gorge Harbour [GorgH_12], Carrington Bay [CarrB_11], Queen Charlotte Sound [QCSound_10], Teakerne Arm [TeakA_11/12], Blenkinsop Bay [BlinB_10], Allison Sound [AlliS_10], Pendrell Sound [PendS_10/11/12], Sechelt Inlet [SecI_11], Port Elizabeth [PortE_11], Simoon Sound [SimS_11], Belize Inlet [BelI_10], Narrows Inlet [NarrI_11/12], Lasqueti Island [LasqI_12], Discovery passage [DiscP_12], Campbell River [CampR_12], Quadra Island [QI_a-e15], Chameleon Harbour [ChamH_11/12], Jericho Pier [JP_a-b10], Johnstone Strait [JohnS_11] and Queen Charlotte Strait [QCStrait_11]), three subtropical Atlantic samples (Gulf of Mexico [GOM_96/97] and Laguna Madre [LagM_94]), one from the tropical upwelling offshore of Peru (Peru [Peru_97]) and temperate and temperate/subpolar samples off of the central and southern coasts of Chile, respectively (Central Chile [CChil_97] and South Chile [SChil_97]). The samples were collected from 1994 to 2015 (Supplementary Table B.1). Samples (~150 to 200L) were collected using either Niskin bottles mounted on a CTD rosette, or by bucket from the surface (JP) (~20 to 60L). Salinity and temperature were measured using either a YSI probe (Yellow Springs, Ohio, USA) or a rosette mounted CTD SBE 25 (Seabird Electronics, Inc., Bellevue, WA). For each sample, particulate matter was removed by pressure filtering (<17 kPa) the water through glass-fiber filters (MFS GC50, nominal pore size 1.2 µm) and polyvinylidene difluoride filters (Millipore GVWP, pore size 0.22 µm) after which viruses were concentrated using ultrafiltration (Suttle et al., 1991) through a 30 kDa cut-off cartridge (Amicon S1Y30, Millipore). All virus concentrates were stored at 4ºC in the dark prior to processing.   67  Peru_97LagM_94GOM_96GOM_97Nun_d07Nun_c07Nun_b07Nun_a07MacS_08BerS_a08BerS_b08SChil_97CChil_97QCSound_10JPLasqI_12SecI_11NarrI_11NarrI_12PowR_12PendS_10/11/12CampR_12 QIGorgeH_12DiscP_12ChamH_11/12TeakA_11/12CarrB_11BlinkB_10JohnS_11PortE_11SimS_11QCStrait_11BelI_10AlliS_10A)                                                                                                       B) Figure 3.1: Sampling sites for this study. Samples were collected from geographically distant locations and climatic zones. Site abbreviations are explained in the methods as well as in Supplementary Table B.1.   3.3.2 cDNA synthesis and metagenomic library preparation Viruses from 70 ml of each virus concentrate were further concentrated by ultracentrifugation (120,000 xg for 5 h at 8ºC) and the pellets resuspended overnight at 4 ºC in 400 µl supernatant. Nucleic acids were extracted from pelleted viruses (400 µl) using the PureLink® Viral RNA/DNA mini Kit (ThermoFisher Scientific) and the DNA removed using TURBO DNase (ThermoFisher Scientific). cDNA was synthesized in five replicates, with varying concentrations of template RNA, by sequence-independent single-primer amplification (SISPA) using both random-hexamer (AD88 - CCTGAATTCGGATCCTCCNNNNNN) and –nonamer (SMA-p1 - GACATGTATCCGGATGTNNNNNNNNN) primers (Márquez et al., 2007; Pan et al., 2013) and Superscript III Reverse Transcriptase (ThermoFischer Scientific) as per manufacturer’s recommendation.  Amplification of cDNA was carried out in 4 replicates using primers AD89 (CCTGAATTCGGATCCTCC) and SMA-P2 (GACATGTATCCGGATGT) (Márquez et al.,  68  2007; Pan et al., 2013) and Platinum Taq DNA Polymerase (ThermoFischer Scientific) as per manufacturer’s protocol with the addition of 5% DMSO (v/v) and varying concentrations of MgCl2. All replicates for a particular sample were pooled, cleaned using the Genomic DNA Clean and Concentrator kit (Zymo) and sheared to ~300bp using a S220 Focused-ultrasonicator (Covaris). Size selection (>200bp) was done using Agencourt AMPure XP beads (Beckman Coulter) and libraries prepared using NEXTflex-96 DNA Barcodes (Bioo Scientific) and the NxSeq DNA sample prep kit (Lucigen) as per manufacturer’s protocol. Pooled libraries were sequenced on a HiSeq 2000 (Illumina) (2x100bp) at the Biodiversity NextGen Sequencing facility (University of British Columbia, BC, Canada).  3.3.3 Metagenomic assemblage composition and genomic analysis Each set of sequences had the primer sequences removed and quality trimmed (PHRED score of 30) using Trimmomatic (Bolger et al., 2014). Paired reads were merged with PEAR (Zhang et al., 2014) and combined with the singletons to produce the final datasets. The reads from each metagenomic dataset were assembled separately using the default settings of the de novo assembly algorithm in CLC Genomics Workbench version 7.5 (CLCBio, Cambridge, MA, USA). The contigs, as well as unmapped reads for each metagenomic dataset were combined and re-assembled using the same parameters. Relative abundances were determined by mapping reads from each dataset to the final contigs, rarefied to the smallest sample size (263 274) using the phyloseq package (Mcmurdie and Holmes, 2013) in R 3.2.2 (R Core Team, 2015) with 1000 iterations and normalized by number of bases/contig size. The clustering was done using the complete linkage algorithm in Primer 6 (Plymouth Routines In Multivariate Ecological Research version 6.1.16) (Clarke, 1993; Clarke and Warwick, 2001) and was verified by a Simprof test. The latter consisted of 999 simulation permutations with a significance level of 5% based on the Bray Curtis similarity resemblance measure. Near-complete genomes (> 6.5 kb with recognized domains) were extended using PRICE version 1.2 (Paired-Read Iterative Contig Extension) (Ruby et al., 2013). Open reading frames (ORFs) were identified using the CLC Genomics Workbench version 7.5. ORFs were called based on any start codon and open-ended sequences but disregarding anything smaller than 350bp. The  69  ORFs were further analyzed against the Conserved Domain Database (CDD) (Marchler-Bauer et al., 2013) (October 2016) using BLASTp and HMMER v 3.1b2 (Finn, Clements, et al., 2015) using the Pfam database (December 2015)(Finn, Coggill, et al., 2015). All untranslated regions (UTR) of the assembled genomes were analyzed using IRESite (Mokrejš et al., 2009) to identify any potential internal ribosomal entry sites.  3.3.4 Phylogenetic analysis All sequences were either aligned or added to existing alignments using either MUSCLE v3.8.425 or MAFFT v7, using default parameters (Edgar, 2004; Katoh, 2013) and manually refined with Aliview version 1.17.1 (Larsson, 2014). ProtTest 3.2 was used for animo-acid model selection (Darriba et al. 2011) as part of the Phylemon2 package (Sánchez et al., 2011). Maximum likelihood trees were constructed with RAxML 8.0 rapid bootstrapping and ML search with 1000 bootstraps. Environmental sequences were analyzed using the Environmental Placement Algorithm (EPA) of RAxML 8.0 (Stamatakis, 2014). The resultant tree was edited in Figtree (http://tree.bio.ed.ac.uk/software/figtree/) and iTOL v3 (Letunic and Bork, 2016). RdRP protein sequences from this study (that span the conserved domains), along with picorna-like virus reference sequences (listed below), were added to the alignment of Culley et al. (2014) and analyzed using the LG+I+G+F amino-acid model. Genbank accession numbers for reference sequences added to the alignment are listed in Supplementary Table B.3.  Sequences that contained capsid domains similar to either Tombusviridae, Nodaviridae-like oomycete viruses or RNA-DNA hybrid virus sequences were added to a constructed reference tree (using the LG+I+G+F amino acid model) consisting of sequences listed in Supplementary Table B.3. The neighbour joining tree (100 bootstraps) was constructed using MEGA7 (Kumar et al., 2016).  3.3.5 Statistics Bray Curtis similarity matrices and hierarchical clustering, using the complete linkage algorithm, analysis were performed using Primer 6 (Plymouth Routines In Multivariate Ecological Research version 6.1.16) (Clarke, 1993; Clarke and Warwick, 2001) and Permanova+ (version  70  1.0.6). Clustering was verified by a Simprof test (Clarke et al., 2008), which consisted of 999 simulation permutations with a significance level of 5% based on the Bray Curtis similarity resemblance measure. All other statistical analysis was performed in R 3.2.2 (R Core Team, 2015) using RStudio 0.99.489 (RStudio Team). Linear regression plots were created using ggplot (Wickham, 2009) and GGally (Schloerke et al., 2016). The constrained correspondence analysis was performed with the relative abundance table and the corresponding environmental factors, including latitude, longitude, salinity (ppt) and temperature (ºC) (Supplementary Table B.1), using the vegan package (Oksanen et al., 2016). Response surfaces for each variable were overlaid using the mgcv R package (Wood, 2004, 2011).  3.4 Results Samples used in this study spanned multiple years. To determine the effect of sample age on the analyses, additional analyses were performed. These are addressed in the relevant sections.   3.4.1 Marine RNA virus diversity Metagenomic analysis of the marine RNA virus assemblages suggest that there is great genetic and taxonomic diversity in the 29,031 assembled viral contigs that ranged in size from 150 to 8,605 bp. Most of the sequences belonged to double-stranded RNA (dsRNA) and +ssRNA viruses (Figure 3.2).  All libraries contained contigs similar to +ssRNA viruses from isolated, unclassified and uncultured members of the Picornavirales, including the family Marnaviridae, the genera Bacillarnavirus and Labyrnavirus, and environmental samples (Figure 3.2 rows 191 – 202 and 379 – 381). These viruses are associated with a variety of important marine protists including diatoms, raphidophytes and thraustochytrids. Many of these viral taxa were consistently abundant with relative abundances within the 75th to 98th percentile. These included taxa such as Aurantiochytrium single-stranded RNA virus (AuRNAV) (row 202), Rhizosolenia setigera RNA virus (RsRNAV) (row 201), Heterosigma akashiwo RNA virus (row 190), Chaetoceros sp. RNA virus 2, Asterionellopsis glacialis RNA virus and Chaetoceros tenuissimus RNA virus type-II (rows 379 – 381). Many reference marine RNA virus genomes have been assembled from  71  metagenomic data and thus are not associated with a specific host. Like the abovementioned genomes, environmental genomes such as Antarctic picorna-like viruses, SF-3, and JP-A and JP-B (rows 191 – 194, 195, 196, 198), were consistently abundant with relative abundance scores within the 98th percentile. Also consistently within the 75th percentile of relative abundance was Delisea pulchra RNA virus (row 207), an unclassified virus within the Picornavirales that infects a multicellular red alga. Contigs representing other families within the Picornavirales were also detected (Figure 3.2 rows 69 – 189). These are assigned to their closest relatives in terms of sequence homology in the database, and do not imply that they are the particular virus species listed. Those associated with the Picornaviridae, Secoviridae and Iflaviridae were sporadic in distribution and frequently in low relative abundance (Figure 3.2 rows 96 – 189), although they spanned all of the genera associated with the Iflaviridae and Secoviridae, and 15 genera in the Picornaviridae that are associated with invertebrate, plant and vertebrate hosts. Dicistroviridae were also well represented, with contigs assigned to the genera Aparaviruses, Cripaviruses and Triatoviruses; they had a higher relative abundance and were often detected in a higher proportion of samples (rows 69 – 85) than the three aforementioned families. Consistently abundant taxa, with relative abundances within at least the 75th percentile, included viruses associated with either insects (e.g. Cricket paralysis virus, row 76), or crustaceans (e.g. Taura syndrome virus, row 73). Contigs similar to the Cricket paralysis genomewere broadly distributed including at JP, BerS, LagM, SChil, Nun, GOM and multiple samples along the Canadian westcoast (Figure 3.2). Fifteen other +ssRNA and seven dsRNA virus families were detected. Most of the taxa were sporadic in distribution with the exception of a few taxa associated with insects (Helicoverpa armigera stunt virus and Nodamuravirus, rows 28 and 40), plants (Barley yellow dwarf virus GAV, row 335; Cereal yellow dwarf virus RPV, row 341; Epirus cherry virus, row 238; Southern bean mosaic virus, row 252), bacteria (Pseudomonas phage PP7, row 234) and marine protists (Heterocapsa circularisquama RNA virus, row 376; Micromonas pusilla reovirus, row 18). Micromonas pusilla reovirus infects a widespread green alga and occurred in high relative abundances (98th percentile) in TeakA_11, DiscP_12 and QI_d/e15. Many unclassified (rows 382 – 410) and environmental viruses (rows 412 – 421) were detected; some were sporadic in distribution, while others were consistently detected at relative abundances within the 75th percentile. This included, but was not limited to taxa from Antarctica that are thought to infect  72  diatoms (PAL_E4, -PAL128, -PAL156, -PAL438 and -PAL473 (rows 398 – 402)) and an enigmatic RNA-DNA hybrid virus (row 404). The latter was included even though these viruses normally have a ssDNA genome, because the contigs are similar to the capsid gene, which resembles that of an RNA virus.    73  - 10- 20- 30- 40- 50- 60- 70- 80- 110- 120- 130- 140- 150- 160- 170- 180- 90- 100- 210- 220- 230- 240- 250- 260- 270- 280- 190- 200- 310- 320- 330- 340- 350- 360- 370- 380- 290- 300- 410- 420- 390- 4005th 75th 98thPercentile100806040200SimilarityJP-a11BerS_a08CChil_97JP-a10JP-b10JP-b11JP-14LagM_94JChamH_11SChil_97BerS_b08Nun_d07JohnS_11CampR_12PendS_12GOM_96Peru_97PendS_10GOM_97Nun_a07QCStrait_11Nun_c07JP13PortE_11Nun_b07CarrB_11BelI_10NarrI_11TeakA_11PendS_11SecI_11GorgH_12NarrI_12QI_c15SimS_11PowR_12QI_a15DiscP_12ChamH_12AlliS_10TeakA_12LasqI_12QI_e15QI_d15MacS_08QCSound_10BlinkB_10QI_b15Unclassified +ssRNA virusesTombusviridaePotyviridaeSobemovirusOurmiavirusLeviviridaeHepeviridaeUnclassified PicornaviralesUnassigned PicornaviralesEnvironmental samples <Picornavirales>MarnaviridaeSecoviridaePicornaviridaeIflaviridaeDicistroviridaeNodaviridaeVirgaviridaeLuteoviridaeCaliciviridaeBromoviridaeAlphatetraviridaeTotiviridaePicobirnaviridaeBirnaviridaeHypoviridaeEndornaviridaePartitiviridaeReoviridaeUnclassified dsRNA virusesAstroviridaeCarmotetraviridaePermutotetraviridaeArenaviridaeUnclassified –ssRNA virusesUnclassified ssRNA virusesUnassigned +ssRNA virusesUnclassified phagesUnclassified virusesEnvironmental samples <viruses>Satellite viruses  Figure 3.2: A heatmap representing the diversity and relative abundance of taxonomically assigned marine RNA virus assemblages. The relative abundance values were determined based on reads mapped to the contigs, rarefied to library read size and normalized to contig length. The heatmap  74  is coloured by percentile across all the study sites with white indicating relative abundances within the 5th percentile and black the 98th percentile. Taxonomic classification is ordered by Baltimore group with dsRNA viruses first, followed by –ssRNA, +ssRNA, unassigned and unclassified ssRNA viruses, unclassified viruses and phages, environmental viruses and satellites. The far left column indicates the family level classification with each row representing a specific virus taxon. A more detailed taxonomic list and heatmap can be found in Supplementary Table B.2 and Supplementary Figure E.1. Columns are clustered based on Bray Curtis similarity.  Hierarchical clustering of the samples, based on a Bray-Curtis similarity matrix of the relative abundances, showed eight statistically significant groups (black lines, Figure 3.2). The taxonomic similarity between samples was low, rarely sharing more than a fifth of their viral assemblages. Samples that clustered together were frequently spatially and/or temporally related. This included five samples from Jericho Pier (JP-a/b 10/11/14) that shared more than 30% (statistically significant based on a simprof test) of their viral assemblages and three samples from coastal Nunavut (Nun_a/b/c07) that shared 17% of their viral taxa. Most Quadra Island sampleshad more than 17% of their viral assemblages in common, with QI_b15, QI_d15 and QI_e15 sharing more than 30% of taxa, and QI_d15 and QI_e15, about 60%. Temporal trends were observed in the coastal British Columbia samples, with some 2011 samples (NarrI_11, TeakA_11, PendS_11 and SecI_11) as well as some 2012 samples (DiscP_12, ChamH_12, TeakA_12 and LasqI_12) clustering together; however these clusters were not statistically significant.  3.4.2 Picornavirales genomes and phylogenies Sequences associated with members of the Picornavirales were the most abundant and consistently detected of all the viral groups. From these data, seven near complete genomes were assembled (Figure 3.3), exhibiting both mono- (BC-9 and BC-6) and dicistronic picorna-like genome architecture. Putative intergenic region internal ribosome entry sites (IGR IRES) were identified in the region between the non-structural and structural ORFs of BC-1, BC-4, BC-3, BC-5 and BC07. Putative IRES-like sequences were detected in the 5’ UTRs of BC-1, BC-4, BC-5 and BC-9. BC-3 was the only assembled genome with an identifiable poly-A tail. BC-1 and BC-3 were indistinguishable from BC-1 and BC-3 that were assembled from two Jericho Pier samples (Chapter 2). The 5’ UTRs of BC-1, BC-4, BC-5 and BC-9 were estimated at 634, 423, 156 and 106 nts, respectively.   75  BC-5 7.2 kbBC-1 8.6 kbBC-4 8.6 kbBC-3 8.5 kbBC-7 6.5 kb7.1 kbBC-6BC-8 6.5 kbPredicted ORF                                        RNA Helicase                                     3C Protease            RdRP Putative IRES/IGR IRES                    Structural domainContig_6232 7.2 kbContig_18 8.6 kbContig_100 8.6 kbContig_9 8.5 kbContig_5824 6.5 kbContig_6862 7.1 kbContig_2026 6.5 kbPredicted ORF                                          RNA Helicase                                    Peptidase C3          RdRP Putative IGR/IRES                             Structural domain Figure 3.3: Seven near-complete picorna-like genomes assembled from marine metagenomic data. The genomes display archetypal Picornavirales monocistronic and dicistronic genome architectures with non-structural protein domains present in the 5’ region of the genome and structural protein domains at the 3’ end. Non-structural domains include an RNA helicase [gold], C3 protease [white] and RNA-dependent RNA polymerase [red]. Structural domains are represented in blue and pre icted IGR IRES elements in green. The assembled contigs are 8605 nt (BC-1/contig_18), 8593 nt (BC-4/contig_100), 8487 nt (BC-3/contig_9), 7228 nt (BC-5/contig_6232), 7073 nt (BC-6/contig_6862), 6489 nt (BC-7/contig_5824) and 6434 nt (BC-8/contig_2026). Shaded contigs are identical to the previously assembled viruses BC-1 and BC-3 (Chapter 2).  Assembled contigs were abundant at different sites and often in a particular region (Figure 3.4). Reads mapping to contigs BC-8 and BC-9 were predominantly from Blenkinsop Bay while reads mapping to contigs BC-1, BC-6 and BC-7 were from multiple Jericho Pier samples. The majority of reads mapping to BC-3 were from a single sample, while the rest of the reads contributing to BC-3 were from multiple Quadra Island samples. Reads from Quadra Island libraries constituted most of the BC-5 reads. BC-4 had a few reads from Quadra Island though most were from Queen Charlotte Sound and Discovery Passage.  76  00.511.522.5Contig_18Contig_100Contig_9Contig_6232Contig_5824Contig_2026Contig_6584Contig_6862Relative abundance (reads rarefied to sample size and normalized to contig length [bp])Assembled ContigsQI_d15QI_e15ChamH_12QI_b15QCSound_10DiscP_12JP_14JP_a11JP_b10JP_b11BlinkB_10BlinkB_10JP_14BC-1        BC-4        BC-3        BC-5        BC-6         BC-7       BC-8        BC-9Assembled genomesTeakA_11QCSound_10QI_a15Qi_b15QI_c15QI_d15QI_e15JP_14JP_b11JP_a11JP_b10BlinkB_10DiscP_12ChamH_11 Figure 3.4: Relative abundance of assembled contigs across study locations. Only locations that had reads which mapped to the genomes were included. Reads were mapped to the contigs, rarefied to library size and normalized to contig length.   Phylogenetic analysis was conducted of the RdRP domains from the above genomes and other members of the Picornvirales. All sequences that spanned the conserved domains were included in the reference tree while shorter sequences were placed on the tree using RAxML EPA (Figure 3.5). Bootstrap support was weak for the deep branches, but obvious clades were observed  77  for the Picornaviridae, Secoviridae, Dicistroviridae and Iflaviridae, all families within the Picornavirales. Virus isolates known to infect marine protists were phylogenetically distant from the other groups. The majority of contigs grouped within these clades, as did other contigs assembled from aquatic and invertebrate metagenomic data. Some contigs grouped within the Dicistroviridae; many reads were placed within branches at the base of the clade, while fewer reads mapped to branches that were well-supported within the clade. Irrespective of the above examples, the majority of reads mapped to clades associated with aquatic RNA viruses.   Figure 3.5:  Phylogenetic placement of environmental RNA-dependent RNA polymerase reads within a Picornavirales reference tree. The maximum likelihood phylogeny was constructed from a modified 330 nucleotide alignment (Culley et al., 2014) and the short environmental sequences placed using the EPA of RAxML. The thickness of the branches denotes the number of reads placed. Branch bootstrap support is indicated as squares. New sequences generated in this study are in black. Reference viruses are listed in Supplementary Table B.3 and colour coded as follows:  78  Potyviridae (forest green), Picornaviridae (red), Secoviridae (green), Dicistroviridae (orange), Iflaviridae (yellow), Marnaviridae and marine isolates (blue), aquatic assembled genomes (teal) and invertebrate assembled genomes (purple).  3.4.3 Tombusviridae and RNA-DNA hybrid viruses  Figure 3.6:  Phylogenetic relationship of environmental reads associated with RDHV, Tombusviridae and RDHV structural protein sequences. The maximum-likelihood phylogeny was inferred using RAxML from a 441 amino-acid alignment generated using MAFFT. One hundred and twelve reads were added to the alignment using MUSCLE and placed on the tree using the RAxML EPA. Branch thickness indicates the number of reads associated with the branch. Bootstrap support is represented by squares. The inset is a neighbour-joining tree of the representative Tombusviridae (green) and Nodaviridae-like (purple) capsids with the 19 environmental reads (black) that were placed between the two clades in the maximum likelihood tree. RDHV-like sequences from aquatic environments and RDHV-like sequences assembled from peat are coloured in blue and teal respectively.  There were 469 contigs that corresponded to genomes in the family Tombusviridae, a group of +ssRNA plant viruses. The contigs ranged between 153 to 4086 bp (median: 479 bp) with 2- to  79  998-fold coverage. ORFs were associated with structural and non-structural proteins, with all the structural domains falling in the protein family (PF00729) for viral coat proteins. Many contigs (114), ranging in size from 156 to 1821 bp (median of 425 bp) and in coverage from 4 to 778 corresponded to the unclassified oomycete infecting viruses Plasmopara halstedii virus A and Sclerophthora macrospora virus A. All structural domains identified were PF00729. Thirty-seven contigs ranging in size from 223 to 1450 bp (median 595 bp), with coverage from 2 to 415 (median 17) were similar to structural ORFs of RNA-DNA hybrid virus genomes. ORFs were similar to the PF00729 domains in the Boiling Springs RNA-DNA hybrid virus (BSL RDHV [two contigs]) and the RNA-DNA hybrid virus SF1 (RDHV-likeV SF1). No contigs were associated with non-structural or replication proteins.  A structural protein reference phylogeny was constructed using RDHV-like structural sequences from members of the Tombusviridae, and two Nodaviridae-like oomycete virus capsids. The structural reads described above were placed on the reference phylogeny (Figure 3.6). Five reads were placed on the RDHV branch, with two at the base of the BSL RDHV clade; four were on the BSL RDHV branch, and two and thirteen reads fell on the DMClaHV and GOS 10665 branches, respectively. Twenty-four reads mapped to the clade of environmental CRUV genomes. A few reads mapped to the PhV-A (2) and SmV-A (3) branches, while one to seven reads were assigned to branches associated with members of the Tombusviridae; 19 reads were placed on the branch separating the Tombusviridae from the oomycete viruses. A neighbour joining tree of these reads along with Tombusviridae and Nodaviridae reads (Figure 3.6 Inset) suggests that these are a separate clade of capsid sequences with no recognized representative.  3.4.4 Predictors of marine virus diversity Diversity varied across samples, with the Gulf of Mexico (1997) having the lowest Shannon diversity score (0.93) and Allison Sound the highest (3.13), while evenness was lowest in Port Elizabeth (0.19) and highest in Allison Sound (0.62). None of the variables had strong explanatory power over taxonomic diversity or species evenness (Figure 3.7 and Supplementary Figure B.5). Salinity had a significant but weak negative explanatory power over diversity and evenness with R2 values of 0.21 and 0.19, respectively (Figure 3.7); whereas, sample age was  80  positively related; however, the variance was very uneven for sample age and the results should not be viewed as significant (Supplementary Figure B.4). R2 = 0.2153 p = 0.00089420                        25                        30                        351.0             1.5             2.0             2.5             3.0SalinityShannon DiversityR2 = 0.1901 p = 0.00195220                        25                        30                        350.2             0.3             0.4              0.5             0.6SalinitySpecies EvennessR2 = 0.0733 p = 0.062697-50               -25                 0                  25                50                 751.0                          2.0                          3.0LatitudeShannon DiversityR2 = 0.0708 p = 0.067524-50              -25                0                25               50                750.2            0.3            0.4            0.5           0.6LatitudeSpecies EvennessA)                                                                                  B)C)                                                                                  D) Figure 3.7: Linear regression of diversity and species evenness of samples in relation to salinity. Diversity is Shannon alpha diversity calculated from taxonomically classified data. Grey shading is the 95% confidence interval with R2 and significance (p) indicated for each plot. The co-efficient are -0.07047 and -0.01194 for A and B respectively. Samples abbreviations are described in the methods section and Supplementary Table B.1.  To explore the relationship between assemblage composition and environmental variables, constrained correspondence analysis was used to compare assemblages for their taxonomic similarity, covariation with identified taxa, and the effect of salinity, temperature, latitude and longitude (Figure 3.8). The analysis was viewed in two dimensions, with CCA1 explaining 55.2% of the variation and CCA2 describing 27.2%. Viral assemblages were spread out in a similar pattern to that observed by hierarchical clustering (Figure 3.2), with some samples like Blenkinsop Bay (BlinB_10) and Queen Charlotte Sound (QCSound_10) being noticeably different compared to most samples, which grouped in the middle of the plot. Virus species were also clustered densely in the middle of the plot, with a few sparsely distributed on the perimeter. Salinity is the strongest predictor, followed by latitude and temperature.   81  GOM_96LagM_94BlinkB_10QCSound_10QI_b15 QI_d15QI_e15QI_a15GorgH_12ChamH_12DiscP_12LasqI_12AlliS_10TeakA_12BelI_10CarrB_11PendS_11TeakA_11SecI_11JP_14SimS_11JP_a11JP_b11JP_a10JP_b10CamH_11NarrI_12SChil_97JohnS_11BerS_a08CChil_97Nun_d07PendS_12PowR_12QI_c15GOM_97CampR_12PendS_10Nun_b07Peru_97JP_13QCStrait-11Nun_a07PortE_11Nun_c07NarrI_11SalinityLatitudeTemperatureLongitudeCCA1 55.2%-3                                    -2                                    -1                                   0              1                                     2                                     3-2                                -1                               0                                 1                          2                                3MacS_08BerS_b08CCA2 27.2% Figure 3.8: Constrained correspondence analysis of RNA virus relative abundances from each site in context of the associated location, salinity and temperature. The direction and strength of the constraining variables are shown in teal and indicate that salinity and latitude are the strongest predictors. Virus taxa, in grey (+), are grouped in the middle of the plot with a few outliers. Samples are represented as annotated black circles (ο). The axes explain 55.2% and 27.2% of the variation. Abbreviations are explained in the Methods and in Supplementary Table B.1.    82  Response surface (ppt):Deviance explained =44%r2(adj) = 0.38p = 3.79e-05 ***Response surface (Latitude):Deviance explained =39.5%r2(adj) = 0.309p = 0.00299 **Salinity LatitudeResponse surface (Longitude):Deviance explained =1.45%r2(adj) = 0.00754p = 0.314Response surface ( C):Deviance explained = 6.57%r2(adj) = 0.0444p = 0.135Temperature LongitudeCCA1 55.2%-6                 -4                -2                  0                  2                 4                  6CCA2 27.2%-6                 -4                -2                  0                  2                 4                  6-4                -2                  0                  2                 4                  6-4                -2                  0                  2                 4                  6CCA1 55.2%CCA2 27.2%CCA1 55.2%-6                 -4                -2                  0                  2                 4                  6CCA2 27.2%-6                 -4                -2                  0                  2                 4                  6-4                -2                  0                  2                 4                  6-4                -2                  0                  2                 4                  6CCA1 55.2%CCA2 27.2%A)                                                                                                      B)C)                                                                                                      D)Response surface (Shannon Diversity):Deviance explained =74.5%r2(adj) = 0.698p = 7.34e-14***Response surface (Evenness):Deviance explained =74.5%r2(adj) = 0.698p = 7.34e-14***CCA1 55.2%-6                 -4                -2                  0                  2                 4                  6CCA2 27.2%-6                 -4                -2                  0                  2                 4                  6-4                -2                  0                  2                 4                  6-4                -2                  0                  2                 4                  6CCA1 55.2%CCA2 27.2%Shannon Diversity Species Evenness2.42.42.62.6E)                                                                                                      F) Figure 3.9: Response variables associated with marine RNA virus correspondence analysis. Response surfaces for each environmental variable is overlaid on the sample constrained correspondence analysis with R2 (adjusted), P-values and the percent of deviance explained reported for each response variable (A – D). Observed Shannon diversity and species evenness were also overlaid to determine how they relate to sample distribution.  Response surfaces of the environmental variables (Figure 3.9 A – D) showed that both salinity and latitude influenced viral assemblages but were non-linear (p = 3.73e-5 and 0.00299).  83  The response variables for temperature and longitude appear linear but map poorly to the data (p = 0.135 and 0.314). Response surfaces of the calculated diversity measures (Shannon diversity and species evenness) (Figure 3.9 E – F) corroborate the regression analysis, but indicate that the relationship isn’t linear. There were no concrete patterns observed between the observed virus assemblages and time of sampling (season, month and year) (Supplementary Figure B.2).  3.5 Discussion Marine RNA viruses influence marine ecosystems by infecting eukaryotes, from microbes to mammals. Studies suggest that free RNA virus particles in seawater primarily belong to the order Picornavirales and are associated with protists (Culley et al., 2014; Miranda et al., 2016). Methodological biases from previous metagenomic studies favoured the amplification of poly-A tailed viruses, resulting in a biased representation of RNA viral assemblages. This study used unbiased approaches to examine the taxonomic diversity of marine RNA virus assemblages in a geographical and environmental context. These results and their significance are discussed in detail below. Viral assemblages were analyzed using taxonomic classification rather than bioinformatic (kmer) clustering to allow inferences to be made with respect to host range and taxonomic relatedness to other genomes. This has the limitation that environmental RNA viruses are poorly represented in reference databases; therefore, viral diversity will be underrepresented when using these databases to determine taxonomy. Here, all contigs that were similar (below an e-value of 0.0001) to a specific viral taxon were concatenated, so that rather than analyzing the relative abundances of specific genotypes, genotypes were analyzed as cumulative representatives of similar genomes. Despite these limitations, much greater taxonomic diversity was observed than in previous studies (Culley et al., 2006, 2014; Miranda et al., 2016). Foregoing poly-T primers and using two sets of random primers allowed amplification of nucleic acids from a broader diversity of RNA viruses. The stability of most marine RNA viruses is unknown; therefore, it is unclear how long-term storage of marine viral assemblages affects taxonomic composition. The infectivity of human enteroviruses (Picornaviridae) and salmonid viruses (Birnaviridae and Rhabdoviridae) are affected by salinity and temperature (Lo et al., 1976; Toranzo and Hetrick, 1982), although in  84  aquatic environments enteroviruses can remain infective for weeks, and non-infectious particles and/or nucleic acids persist for longer (Wetz et al., 2004; Bae and Schwab, 2008). Regression analysis indicated that salinity had a significant weak negative effect on species diversity and evenness (Figure 3.7); whereas, sample age had a positive effect (Supplementary Figure B.5). However, the age-related data were not evenly distributed and were biased by sample location; hence, the results should be viewed cautiously. Nonetheless, there is no evidence from regression analysis of the diversity data (Supplementary Figure B.5), cluster analysis of the relative abundance data (Figure 3.2), or CCA plots (Supplementary Figure B.2) that the composition of the RNA viral assemblage was affected by storage time.   3.5.1 Picornavirales are the most abundant members of marine RNA virus assemblages in all oceanic regions Picornavirales diversity in this study (Figure 3.2) reflected the described taxonomic diversity of the order. The relative proportions of taxa were similar to previous studies (Miranda et al., 2016; Culley et al., 2014) with members of the genera Labyrnavirus and Bacillarnavirus, and uncultured Picornavirales being most abundant. The highest relative abundances were associated with either marine RNA virus isolates or assembled from aquatic metagenomic data, and are thought to be from viruses infecting protists (Shirai et al., 2008, 2006; Tomaru, Takao, et al., 2009; Kimura and Tomaru, 2015; Tomaru et al., 2012; Nagasaki, 2008; Miranda et al., 2016). Protists encompass a wide diversity of evolutionary groups, including stramenopiles, alveolates, excavates and amoebozoa (Burki, 2014), and in the oceans include major taxa of phytoplankton and microzooplankton. While inferring hosts for viruses assembled from metagenomic data is imperfect, closely related RNA viruses typically infect closely related hosts. Many of the contigs assembled here were similar to viruses infecting diatoms, raphidophytes or a thraustochytrid, so it is reasonable to assume that they were from viruses that infect stramenopiles.  Diatoms occur in coastal and oceanic waters worldwide (Armbrust, 2009), are extremely diverse (Kooistra et al., 2007), and are infected by +ssRNA viruses (e.g. Nagasaki et al., 2004; Shirai et al., 2008). Diatom RNA viruses have large burst sizes as large as 104 infectious units per cell (Shirai et al., 2008; Kimura and Tomaru, 2015), which along with ubiquitous hosts accounts for high relative abundance of related viruses in the metagenomic data. All diatom virus species  85  were not detected in every sample, which is consistent with highly dynamic diatom populations with specific ecological niches. For example, diatoms bloom when waters are well-mixed and nutrients are high, such as during spring blooms, and in coastal and upwelling regions or along the sea-ice edge (Morel and Price, 2003). Large dynamic host populations, and viruses with short replication cycles of 24 to 48 h, and large burst sizes (Nagasaki et al., 2004; Shirai et al., 2008; Kimura and Tomaru, 2015) would shift the relative abundance of viruses. Abiotic factors such as salinity and temperature can affect the infectivity of diatom viruses (Kimura and Tomaru, 2017), which can create niche partitioning and allow multiple viruses that infect the same host, to coexist. Therefore, the presence and abundance of specific viruses will be affected by host-specific and virus-specific factors.  Raphidophytes (host of HaRNAV) are toxic bloom-forming algae; they are widespread in temperate coastal waters (Engesmo et al., 2016) and much less diverse than diatoms. Currently there is only one RNA virus associated with a raphidophyte and no information about possible environmental effects on virus-host interactions. The limited diversity of this host suggests that the diversity of the viruses will also be limited, consistent with the data on relative abundance as well as the RdRp phylogeny (Figure 3.2 and 3.5).  In contrast, the metagenomic contigs JP-A, JP-B and SF-3 have high relative abundances, suggesting that these viruses and their close relatives are important players and likely infect abundant phytoplankton, as suggested previously (Gustavsen et al., 2014; Miranda et al., 2016). The diversity of viruses associated with stramenopiles is echoed in the RdRp phylogeny (Figure 3.5), in which the majority of Picornavirales-associated RdRps clustered with isolates, or as new clades within these branches. Many of the shorter reads mapped to deep unsupported branches suggesting that the diversity of marine Picornavirales is incomplete, as suggested by every new RNA virus metagenome published meta-analysis of extant metagenomic data (Shi et al., 2016). Dicistro-like viruses were frequently detected at high relative abundances (Figure 3.2) suggesting that invertebrate viruses are abundant in marine RNA virus assemblages. This, along with RdRp data showing that many aquatic invertebrate viruses form a distinct clade within the Dicistroviridae (Figure 3.5) is congruent with a meta-analysis of environmental metagenomic data that uncovered dicistro-like genomes for crawfish, barnacles and a sesarmid crab (Shi et al., 2016) . The continuous detection of contigs similar to Dicistroviridae genomes suggests that although  86  many of these virus taxa clustered outside of the marine clade, they were of marine origin; this is supported by the fact that many of these contigs and reads were detected in oceanic samples such as the Bering Sea, Queen Charlotte Sound, and off the coasts of Chile and Peru suggesting that they may be from pathogens of crustacean zooplankton. Other families in the Picornavirales are rarely, if ever, found in the marine environment. Viruses in the Picornaviridae have been isolated from ringed seals (Kapoor et al., 2008) and detected in fur seal faeces (Kluge et al., 2016). In the metagenomic data herein, Picornaviridae-like contigs were detected sporadically, but more frequently than for the Secoviridae, which encompass a diverse group of plant viruses not common in the marine environment. While the detection of these taxa was sporadic, they occurred across different sampling locations in oceanic and coastal regions, often at relative abundances within the 75th percentile, suggesting that they are native to marine assemblages. No seco-like virus isolates have been reported for single-celled protists, but viruses that infect the red macroalga Delisea pulchra (Lachnit et al., 2016) have RdRp sequences that cluster within the marine clade (Figure 3.5); whole genome analysis is required to determine the taxonomy of these viruses, which are unclassified Picornavirales (Delisea pulchra RNA virus) and Picornaviridae (Delisea pulchra picornaviruses IndA and -B). The latter viruses were sporadically detected, having higher relative abundances in some coastal British Columbia samples compared to open ocean samples. Delisea pulchra RNA virus-like reads were detected in all samples but the Gulf of Mexico, often at high relative abundances. The dominance of Picornavirales in marine RNA virus assemblages is supported by the assembly of genomes with similar genome organization and sequence homology. The assembled genomes were detected in multiple samples (Figure 3.4) with the exception of BC-8, which was almost exclusively assembled from the Blenkinsop Bay sample. The distribution of closely related viral genomes across multiple geographically linked sites is consistent with high viral dispersal facilitated by currents and tides (Chow and Suttle, 2015). The Quadra Island site is in a coastal region with strong tidal mixing between the Strait of Georgia and Johnstone Strait, which explains the overlap in viral taxa across these locations. Biogeographical studies of Marine RNA viruses BC-1 and BC-3 suggested that these genomes have a localized distribution (Chapter 2), consistent with the data presented here. So, while physical mixing can transport viruses, factors such as host distribution and abiotic conditions also affect distribution (Chow and Suttle, 2015).   87   3.5.2 Assemblage diversity extends beyond Picornavirales Many +ssRNA virus families and genera were detected in this study that have not been previously reported in marine samples, and they were often detected at high relative abundances across multiple sites. Many of these include families and genera that are associated with insects (Alpha-, Carmo- and Permutotetraviridae and Alphanodaviruses). Terrestrial viruses can be detected in nearshore water especially after heavy rain or near a river. However, alphatetravirus-like reads were detected in nearly all samples, although at low abundances in the central Chilean oceanic and the Jericho Pier 13 samples. One would not expect reads for Lepidopteran larval viruses in oceanic samples, yet all three alphatetravirus taxa were within the 75th percentile of relative abundance in the sample from Queen Charlotte Sound, a deeply mixed offshore location. As well, a contig related to Helicoverpa armigera stunt virus was within the 75th percentile in the oceanic samples from southern Chili and Peru. Thus, these reads are likely from native marine viruses that infect crustacean zooplankton hosts. An even stronger case can be made that the abundant contigs associated with alphaviruses and unclassified nodaviruses are marine in origin. Nodavirus isolates associated with crustaceans have been described (Qian et al., 2003; Saedi et al., 2012; Tang et al., 2011) and a new species was discovered in 2017 (Zhang et al., 2017). The RdRp sequences of these viruses are within the same clade as other alphanodaviruses, further supporting the hypothesis that these reads originated from zooplankton or crustacean hosts. Unlike crustaceans, marine vertebrates are well documented as virus hosts (Lang et al., 2009). Caliciviruses infect marine mammals and fish (Ganova-Raeva et al., 2004; Smith, 2000), and astroviruses infect sea lions and dolphins (Rivera et al., 2010). These viruses were sporadic, with each taxon occurring in less than ten of 47 samples. The rarity of these contigs is consistent with the low density of vertebrates in the ocean relative to protists. Moreover, infection of a protist leads to rapid cell lysis and release of viral progeny; whereas, in multi-cellular organisms, viruses can move from cell to cell (Carrington et al., 1996; Smith and Enquist, 2002) with low release rates of viruses to the environment. Hence, it isn’t surprising that protistan viruses dominate the water-column assemblage of viruses. Of all the families associated with plants, the Tombusviridae had the largest diversity and relative abundance. Tombusviridae-associated reads were previously observed to constitute 13%  88  of marine RNA virus metagenomic reads from coastal British Columbia (Culley et al., 2006), and led to an assembled genome, SOG (Culley et al., 2007). Although similar viruses have not been reported in subsequent metagenomic studies, reads mapping to the marine RNA virus SOG genome were detected at low relative abundance in one sample, in the present study. It has been suggested that these Tombus-like viruses may be terrestrial in origin, especially as Tombusviridae particles are stable in aquatic environments (Koenig et al., 2004). Salinity is generally a good indicator of freshwater input; yet, these reads were detected in all samples, including those with high salinity. This suggests that at least some of these viruses are native to the marine environment. Genetic similarities to members of the Tombusviridae have been detected in other marine viruses (Tang et al., 2002). The assembly of a hybrid genome, Tombunodavirus UC1, from wastewater (Greninger and DeRisi, 2015), as well as RDHV genomes (Stedman, 2015) supports the idea that members or ancestors of the Tombusviridae have been involved in recombination and are present in aquatic environments. The placement of tombusviridae-like structural reads between the Tombusviridae clade and the nodaviridae-like clade (Figure 3.6 inset) suggests there is a lot of genomic diversity associated with these viruses that is yet to be discovered. Double-stranded RNA viruses are less frequently detected in aquatic RNA virus metagenomic data, and they are less diverse than the +ssRNA viruses. Micromonas spp. and other prasinophytes are among the most abundant and widespread marine primary producers (Marin and Melkonian, 2010; Worden et al., 2004). Since the isolation of the Micromonas pusilla reovirus from Norwegian coastal waters (Brussaard et al., 2004), it has only been reported once, in the tropical waters of Hawaii (Culley et al., 2014). The detection of similar contigs in virtually all our samples suggests that mimoreoviruses are as abundant and globally distributed as prasinophytes and that the particles can be detected after years in storage at 4oC.  Bacteriophages with RNA genomes are thought to be rare and primarily detected in terrestrial environments where they are frequently associated with coliforms (Koonin et al., 2015). Herein, 13 +ssRNA phages with sequence homology to the family Leviviridae were detected (Figure 3.2 rows 223 to 235). Viruses from this family are associated with coliforms and other terrestrial bacterial species (Bollback and Huelsenbeck, 2001), but occur in coastal marine waters affected by terrestrial contamination or run-off (Brion et al., 2002; Sinton et al., 1999). Detection of similar phages in this study were sporadic across samples and could have resulted from terrestrial input, especially in coastal samples such as Jericho Pier; however, the detection of RNA  89  phage in oceanic samples off Peru and Chile suggests that some of these phages are marine. Only one of the identified taxa (Pseudomonas phage PP7, Figure 3.2 row 234) was consistently abundant (75th percentile) across coastal and oceanic samples. Currently, phage strain O6N-58P is the only reported marine RNA phage (Hidaka and Fujimura, 1971; Hidaka and Ichida, 1976) although many RNA phage genomes were recently detected in invertebrates, aquatic crustaceans and microbial sediments (Shi et al., 2016; Krishnamurthy et al., 2016). RdRp phylogenetic analysis indicated that phages from this study are more similar to phage genomes from invertebrates than coliforms (Supplementary Figure B.3) and suggests that marine RNA phages are not as rare as previously thought, albeit rare relative to the rest of the RNA virus assemblage.  3.5.3  RDHV-like virus capsid sequences are abundant in marine metagenomic data RNA-DNA hybrid or chimera viruses refer to small ssDNA viral genomes where the replication protein resembles that of a ssDNA virus while the capsid is most similar to ssRNA viruses of the Tombusviridae and Nodaviridae. RDHV-like viruses have been found in DNA metagenomic studies from aquatic environments (Diemer and Stedman, 2012; Roux et al., 2013; Mcdaniel et al., 2014), sewage treatment ponds (Kraberger et al., 2015), aquatic arthropods (Hewson et al., 2013; Dayaram et al., 2016) and animal faeces (Steel et al., 2016). While similar reads have been identified in freshwater RNA virus metagenomic data (Roux et al., 2013), those reads were phylogenetically more similar to viruses from the Tombusviridae and Noda-like oomycete viruses. The contigs detected herein were short and only had one large ORF, providing no information on genome organization; however, evidence suggests these contigs are of RNA origin. First, there was very little DNA virus contamination and, second, all of the contigs that were similar to RDHV-like viruses, were exclusively to capsid genes; no RDHV-like replication genes were detected. It is tempting to hypothesize that these RNA viruses are ancestors or at least RNA relatives of the RDHV or Cruciviridae. Contigs from this study mapped to all of the few representatives suggesting that multiple recombination events occurred between a group of RNA viruses and ssDNA viruses. This is in contrast to the previous hypothesis which suggested that a single recombination event resulted in the creation of DNA-RNA hybrid viruses (Quaiser et al., 2016; Roux et al., 2013).   90  3.5.4 Salinity and latitude are associated with changes in assemblage compositions Constrained correspondence analysis (CCA) was used to identify the environmental factors associated with changes in assemblage composition. The factors analyzed explained 82.5% of the variation in assemblage composition. While there are no metagenomic studies focused on RNA viruses for comparison, studies on cyanomyoviruses found both weak and moderate correlations between environmental variables and shifts in virus assemblage composition, particularly with respect to temperature, salinity and nitrate (Huang et al., 2015; Finke, 2017). A more comparable study focusing on the distribution of prasinoviruses in the Mediterranean Sea found similar patterns to the phage studies, except that phosphate was the strongest predictor of changes in the viral assemblage (Clerissi et al., 2014). Nutrients may indirectly affect viral assemblages by directly affecting their hosts, and nutrient availability is controlled by stratification of the water column, which is controlled by salinity and temperature. Although temperature and its effect on water-column mixing was found to affect phytoplankton DNA virus dynamics in the North Atlantic Ocean (Mojica et al., 2016), there was no observable change in diversity or species evenness associated with mixing in the present study, as both the lowest and highest diversity and evenness scores were in stratified environments. The relationship between diversity and salinity is poorly understood for viruses; yet, it is an important factor explaining changes in assemblage composition. A positive relationship has been reported between salinity and cyanomyovirus diversity (Finke, 2017), which echoes observed increases in picophytoplankton diversity (Estrada et al., 2004). In contrast, a weak but negative correlation was observed between RNA virus diversity and salinity in the present study, much like the observations of Campbell and Kirchman (2012) who studied bacterial diversity and salinity in Delaware Bay. The low salinity sites in the present study are coastal and experience freshwater input; virus particles of terrestrial origin would be expected to increase the diversity of viruses at locations with both freshwater and marine inputs.   3.5.5 Conclusions Metagenomic analysis revealed the diversity and relative abundance of marine RNA viruses across environments ranging from the arctic to the equator. The taxonomic diversity spanned seven dsRNA and 20 ssRNA families at forty-seven geographically distinct sites. Virus  91  taxa from the order Picornavirales, that are associated with protists, were consistently detected at all sites, while other taxa were more sporadic in their distribution. Taxa traditionally associated with terrestrial environments were detected in high salinity oceanic samples, suggesting that they are native to the marine environment and indicate that the complexity of these assemblages extends beyond the Picornavirales. This study substantially increased the sequencing space and genetic repertoire for free marine RNA virus particles, and explored the effects of environmental factors on the virus assemblages. A better understanding of these viruses and their interaction with the environment will help to better predict how these assemblages may respond in terms of composition and ecological impact, within a framework of environmental change.   92  Chapter 4  4  SPATIAL AND TEMPORAL DIVERSITY IN FRESHWATER RNA VIRUSES IN THE CONTEXT OF ENVIRONMENTAL VARIABILITY  4.1 Summary RNA viruses are abundant, ubiquitous and genetically diverse in marine environments where they have been implicated in microbial eukaryotic community dynamics (Miranda et al., 2016). Sequence data has shown that marine and freshwater RNA assemblages are dominated by members of the order Picornavirales, but freshwater assemblages appear to be more diverse. Limited data from two lakes suggest that seasonal and yearly changes occur in the composition of RNA virus assemblages (Djikeng et al., 2009; López-Bueno et al., 2015); however, spatial and temporal changes in RNA virus diversity in freshwaters remains unexplored. This chapter shows that a complex suite of abiotic factors is associated with seasonal changes in viral diversity at six sites characterised by different land use. Temperature and pH were associated with changes in diversity across all sites, although they only explained 20% of the variation, likely due to the variability in diversity within sites. The assemblages changed seasonally, with a few viruses exhibiting high abundance while the rest were rare, conforming with a seed-bank model. As well, certain viruses were associated with specific abiotic factors, often at particular sites, suggesting that they could potentially be used as indicator species. This study showed that over 14 months, changes in the diversity of RNA virus assemblages were associated with a complex suite of environmental factors. This is the first report to analyze taxonomically assigned freshwater RNA viruses in the context of environmental factors. Besides greatly enhancing the sequencing space for RNA viruses found in freshwaters, this study also provides baseline data for future ecological studies of viruses in freshwaters and for using RNA viruses as potential biomarkers for assessing the “health” of watersheds.  4  93  4.2 Introduction Freshwater environments are complex ecosystems that are important habitats for plants, animals and microbes, and also provide easily accessible water for humans. With human activities altering hydrology and increasing pollution of freshwater environments (Tang et al., 2005), it is important to understand the effects that different land-use can have on these water sources and their microbial assemblages. Freshwaters are repositories for viruses (Djikeng et al., 2009) that infect humans (Kitajima et al., 2011; Baron et al., 1982; Sinclair et al., 2009), prokaryotes, algae, crustaceans and plants (Rosario, Nilsson, et al., 2009; Koenig et al., 2004). Much of the research that has focused on viruses in freshwaters has examined either specific virus-hosts interactions (Qian et al., 2003; Sahul Hameed et al., 2001) or DNA virus assemblages (Clasen and Suttle, 2009; López-Bueno et al., 2009; Ge et al., 2013; Tseng et al., 2013; Roux et al., 2012). Research on the diversity, community dynamics and influence of environmental variables on RNA viruses in freshwaters is lacking.  RNA virus assemblages in marine environments (Chapter 3) (Culley et al., 2003, 2006; Culley and Steward, 2007; Culley et al., 2014; Miranda et al., 2016) and lakes (Djikeng et al., 2009; López-Bueno et al., 2015) are typically dominated by a diverse assemblage of members of the order Picornavirales (picorna-like viruses) that have highly uneven community structures (Gustavsen et al., 2014; Gustavsen, 2016; Djikeng et al., 2009). However, the RNA virus assemblages in Lake Needwood were far more diverse with respect to RNA virus families than has been reported in marine studies, and included both double-stranded (ds) and single-stranded RNA (ssRNA) viruses, (Djikeng et al., 2009). The community structure was highly uneven, and dominated by 11 genotypes, with only some being abundant across the two samples investigated. This uneven structure of RNA virus assemblages was also found in two samples from coastal British Columbia, Canada, which were dominated by different genotypes with very little assemblage overlap (Culley et al., 2006). As well, comparisons of RNA virus metagenomic data from Palmer Station, Antarctica (Miranda et al., 2016) and Kaneohe Bay, Hawaii (Culley et al., 2014) suggested that both were dominated by picorna-like viruses, with different viral families being present at different abundances.   94  Changes in species abundance and number are important concepts in ecology. Such changes are frequently quantified as either measures of indicator species or changes in assemblage composition (Washington, 1984). One way of analyzing the assemblage structure, or the numerical abundance of each species in an assemblage, is to use an index of community structure, also referred to as a diversity index. (Washington, 1984). Among these, Shannon’s index (Shannon, 1948) has become a staple for analyzing diversity (Spellerberg and Fedor, 2003). Because the Shannon index accounts for both the abundance and evenness of species, it is frequently used to describe the mean species diversity at a site, also referred to as alpha diversity (Whittaker, 1960, 1972). Beta diversity, or true diversity, is another measure commonly used in community ecology, and is defined as the variation in species among sites (Anderson et al., 2011; Whittaker, 1960, 1972). There are many ways to analyze beta diversity; modern beta diversity measures may incorporate relative abundance of species, phylogenetic, taxonomic or functional relationship information (Legendre et al., 2005; Izsak and Price, 2001; Clarke et al., 2006; Graham and Fine, 2008; Swenson et al., 2011).  Diversity indices have gained popularity for describing diversity and variations in aquatic virus assemblages across spatial and temporal scales (Breitbart et al., 2004; López-Bueno et al., 2015; Finke, 2017; Emerson et al., 2013), including for examining seasonal variations in the marine phage assemblages (Chow and Fuhrman, 2012). While in-depth studies have not been done to examine diversity changes in freshwater RNA virus assemblages, seasonal and annual differences were observed in freshwater RNA virus assemblages in lakes from Maryland and Antarctica, respectively (Djikeng et al., 2009; López-Bueno et al., 2015). A similar seasonal shift was observed in the DNA viral assemblages from a subtropical freshwater reservoir, where higher relative abundances and diversity were observed in summer compared to winter (Tseng et al., 2013).  Analysis of DNA virus assemblages in freshwater and marine environments suggests that the factors influencing viral dynamics and abundances are complex and frequently correlated with bacterial abundance and chlorophyll-a (Wilhelm and Matteson, 2008), but that these associations vary depending on the environment (Clasen et al., 2008; Wigington et al., 2016; Finke et al., 2017; Gainer et al., 2017). Considering the terrestrial influence on freshwater environments, freshwater RNA virus assemblages would also be expected to be dynamic and diverse. To investigate changes in the composition of RNA virus assemblages, metagenomic data from three watersheds (two  95  streams and one river) with different land-use effects were collected for 14 months. Correlation and regression analyses were used to determine the environmental factors associated with changes in viral diversity, while taxonomically classified relative abundance counts were used along with hierarchical clustering, to determine seasonal shifts in assemblage composition. As well, the viability of using virus species or groups of closely related viruses were investigated as potential indicators of environmental conditions.  4.3 Materials and methods 4.3.1 Sample collection and preparation Samples were collected and processed as described in Uyaguari-Diaz et al.(2016). Briefly, water samples (40 liters) were collected from three watersheds in southwestern British Columbia that were selected based on the surrounding land use (agricultural, urban and protected) (Supplementary Table C.1). Although the precise locations of the sampling cannot be disclosed, the agricultural watershed is approximately 63 km and 132 km from the urban and protected watersheds, respectively, while 101 km separates the latter two watersheds. Sampling within locations was conducted at one to three sites depending on the location (agriculture: upstream [AUP], affected [APL], downstream [ADS]; urban: affected [UPL], downstream [UDS]; protected: [PUP]). APL was 9.14 km downstream from AUP, and ADS a further 2.48 km downstream; UPL and UDS were separated by 1.09 km. Each month, 14 samples were collected at the agricultural and urban sites (March 2012 – April 2013) and 13 at the protected site (April 2012 – April 2013). Samples were pre-filtered using a 105-µm mesh polypropylene filter (SpectrumLabs, Rancho Dominguex, CA) followed by a series of filters (1-µm pore-size Envirochek HV; 0.2-µm pore-size, 142-mm diameter Supor-200 membrane disc filter; and a 0.2-µm pore-size Supor Acropak 200 sterile cartridge [Pall Corporation, Ann Harbor, MI]) to capture any remaining particles. Virus-size particles were concentrated by tangential flow filtration (TFF) (Suttle et al., 1991) using a 30 kDA cutoff regenerated cellulose Prep/Scale TFF cartridge (Millipore Corporation, Billerica, MA).   96  4.3.2 Environmental, physical and chemical data Water quality parameters (temperature, dissolved oxygen, conductivity, pH and turbidity) were measured in situ using a YSI Professional Plus handheld multiparameter instrument (YYSI Inc., Yellow Springs, OH), a VWR turbidity meter model No. 66120-200 (VWR, Radnor, PA) and a Swoffer 3000 current meter (Swoffer Instrumentsz, Seattle, WA). Dissolved chloride (mg/L) and ammonia (mg/L) were determined using automated colorimetric (SM-4500-Cl G) and phenate methods (SM-4500-NH3 G) (Eaton et al., 2005). Orthophosphates, nitrites and nitrates were analyzed colorimetrically using methods described by Murphy and Riley (1962) and Wood et al. (1967).  4.3.3 cDNA synthesis and metagenomic library preparation Nucleic acids were extracted from viral particles using the NucliSens easyMAG system (bioMérieux, Craponne, France), further concentrated using ethanol precipitation (Uyaguari-Diaz et al., 2016) and DNA removed using Turbo DNase I (Life Technologies, Carsbad, CA) as per manufacturer’s specifications. The cDNA was generated using a modified sequence-independent single-primer amplification (SISPA) method (Wang et al., 2002; Uyaguari-Diaz et al., 2016) using a random nonamer primer (5’GTTTCCCACTGGAGGATA-N9 3’) and Superscript III reverse transcriptase (Life Technologies, Carsbad, CA). Sequenase 2.0 DNA Polymerase (Affymetrix, Santa Clara, CA) was used to generate double-stranded cDNA, which served as the template in subsequent PCR reactions using 5U KlenTaq LA polymerase, 1x Klentaq PCR buffer, 0.2 mM nucleotides and 2 µM primer (5’ GTTTCCCACTGGAGGATA 3’) with parameters described in (Uyaguari-Diaz et al., 2016). PCR products were cleaned up at a ratio of 1.8x with the Agencourt AMPure XP-PCR purification system (Beckman Coulter Inc., Brea, CA). Primers were removed by a BpmI (4 U) digest followed by the previously described Agencourt AMPure XP-PCR purification system (Beckman Coulter Inc., Brea, CA). Samples were end-repaired using 0.2 mM nucleotides, 3 U of T4 DNA polymerase, 5U of DNA polymerase 1 Klenow fragment, 10 U of T4 polynucleotide kinase and 1x T4 ligase buffer (New England BioLabs Inc., Ipswich, MA). Sequencing libraries were prepared using the NEXTflex ChIP-Seq kit (BIOO Scientific, Austin, TX) and size selected for 300 to 500bp using Ranger Technology (Coastal Genomics Inc.,  97  Burnaby, BC). Sequencing was conducted on a Hiseq 2500 v3 at the Genome Sciences Centre (Vancouver, BC) using four lanes of a 100bp paired-end run.   4.3.4 Community composition Shotgun sequenced reads were trimmed, at the 3’ end using a sliding window (length 5) and a minimum Phred score of 20, while sequences of one or more nucleotides with scores less than 20 were trimmed at the 5’ end, using Trimmomatic v0.3 (Bolger et al., 2014). Cutadapt (Martin, 2011) was used to remove sequencing adapters and PEAR v10 (Zhang et al., 2014) to merge overlapping paired-end reads. Reads shorter than 100 bp were discarded. Each sample was assembled using the de novo assembly algorithm in the CLC Genomics Workbench version 7.5 (CLCBio, Cambridge, MA, USA) after which the contigs and unmapped reads were re-assembled across all samples. Reads from each dataset were mapped to the resultant contigs to determine the relative abundance for each contig. The data were then rarefied to the smallest sample size (158,376 reads) using the phyloseq package (Mcmurdie and Holmes, 2013) in R 3.2.2 (R Core Team, 2015) with a thousand iterations. Relative abundance counts were normalized by base pair.  Taxonomic classification was done by a similarity search using the BlastX algorithm (Altschul et al., 1990) against the NCBI non-redundant database (June 2016) with an e-value cutoff of 10-4. Blast output was analyzed using the Megan6 Community Edition software (Huson, 2016). Viral contigs that hit to the same virus species were concatenated for community analysis.  4.3.5 Statistical analysis Multiple linear regression, and Pearson and Spearman rank correlation coefficients (rho) were determined using R 3.2.2 (R Core Team, 2015), including methods from the psych (Revelle, 2016), PerformanceAnalytics (Peterson and Carl, 2014) and car (Fox and Weisberg, 2011) packages. P-values were approximated using a t-test to calculate the probability of observing a spearman correlation higher than what was observed under the null hypothesis that thee is no trend between the two variables. The approximation is accurate to a p-value of ±0.0005 (Best and Roberts, 1975). Six samples (July samples of AUP, APL, ADS, UPL and UDS, and the January PUP sample) were excluded from the abovementioned analysis due to unavailable environmental  98  data. One-way ANOVA and Kruskal-Wallis rank sum tests were performed using base R 3.2.2 (R Core Team, 2015). Bray Curtis similarity matrices and hierarchical clustering analysis were performed using Primer 6 (Plymouth Routines In Multivariate Ecological Research version 6.1.16) (Clarke, 1993; Clarke and Warwick, 2001) and Permanova+ (version 1.0.6). Clustering was done using the complete linkage algorithm and was verified by a Simprof test. The latter consisted of 999 simulation permutations with a significance level of 5% based on the Bray Curtis similarity resemblance measure.  4.3.6 Diversity measures and species indicators  Alpha diversity is represented by the Shannon index and was calculated using the Vegan package (Oksanen et al., 2016) in R 3.2.2. Beta diversity, or the variation in taxa assemblage among sites, is represented by hierarchical clustering (previously described) as well as non-metric multi-dimensional scaling (NMDS) plots. NMDS plots were calculated using Primer 6 (version 6.1.16) (Clarke, 1993; Clarke and Warwick, 2001) and Permanova+ (version 1.0.6) from Bray Curtis similarity matrices generated from taxonomically classified relative abundance tables. Parameters included a Kruskal stress formula 1 and a minimum stress of 0.01. Indicator species were analysed using R 3.2.2 and the indicspecies package (De Caceres and Legendre, 2009). The association function used was IndVal.g and the significance level was set at 0.05. The respective groups were based on a principal component analysis (PCA) of Log transformed Nitrate and Nitrite (DIN) and turbidity measures with base variables indicated in Primer 6 (version 6.1.16) and Permanova+ (version 1.0.6). Selection was confirmed by distance based hierarchical clustering of an Euclidean distance matrix of the above mentioned data. The same six samples, previously mentioned, were excluded due to insufficient data.  4.4 Results 4.4.1 Variations in the taxonomic diversity of freshwater RNA virus assemblages Stream RNA virus assemblages are taxonomically diverse and span a wide range of families and genera that included viruses with double-stranded RNA (dsRNA), positive-sense single-stranded RNA (+ssRNA) and negative-sense single-stranded RNA (-ssRNA) genomes. The  99  latter group was represented by two species, the unclassified Wuhan Spider Virus [row 41] and Jonchet virus [row 55] of the Bunyaviridae. There was a great deal of variation in the temporal and spatial distribution among taxonomic groups of viruses. Some taxa were widespread and persistently abundant across environments and seasons. These included viruses associated with protists (eg. Aurantiochytrium ssRNA virus [row 180], Asterionellopsis glacialis RNA virus [row 189] and Delisea pulchra totivirus IndA [row 19]), insects (eg. select members from the family Dicistroviridae [rows 63, 64, 67, 72 and 80], crustaceans (eg. Penaeid shrimp infectious myonecrosis virus [row 23] of the Totiviridae), plants (eg. members from the families Alphaflexiviridae [rows 42 - 45], Tombusviridae [rows 147, 149 and 162] and unassigned genera Ourmia- and Sobemoviruses [rows 118 and 120] and fungi (Plasmopara halstedii virus [row 229] and Sclerophthora macrospora virus A [row 199]). Many viruses that are as yet unassociated with a specific host, were also found to be persistently abundant and widespread. This included but is not limited to the marine RNA viruses SF-1, -2, -3, JP-B, PAL128 and -473 [rows 35, 88 – 90, 224 and 226], Ciliovirus thought to be associated with a ciliate host [row 32] and the freshwater Antarctic picorna-like viruses 1 and 2 [rows 83 – 84]. In contrast to the above examples, the majority of families and taxa were infrequently detected with a great variation in their relative abundance. These include families associated with protists (eg. Marnaviridae [row 115]), fungi (eg. Barnaviridae [row 50] and Deltaflexiviridae [row 182]), bacteria (eg. Leviviridae [rows 105 – 112], invertebrates (eg. Alpha- and Carmotetraviridae [rows 46 – 48 and 57] and Nodaviridae [rows 126 – 133]), plants (eg. Virgaviridae [rows 206 – 208] and Secoviridae [rows 136 – 141] and vertebrates (eg. Picornaviridae [rows 134 – 135] and Caliciviridae [row 56] Other groups were more strongly associated with seasons. Seasonal trends were observed among sites within similar land-use areas but not across land-use areas. Based on hierarcheal clustering and NMDS analysis of a Bray-Curtis similarity matrix, the viral assemblages at the AUP site were generally more similar to one another than to the assemblages observed at the other agricultural sites (Figure 4.1, Supplementary Figures C.4 and C.5). Noticeably, there were more abundant unclassified Partitiviridae (rows 3 – 9), Totiviridae (rows 13 – 28), Gammaflexiviridae (row 91), Leviviridae (105 – 112), unclassified Nodaviridae (rows 126 – 133) and Ourmia viruses  100  (rows 116 – 118) at the AUP site, compared to the two downstream sites. There were statistically significant differences at AUP from summer to fall, and winter to spring RNA virus assemblages, where a similarity of 10 was observed between groups. The exceptions to this trend were the November 2012, January 2013, February 2013 and March 2013 samples, which either clustered together (November and January), clustered with the summer and fall assemblages (February), or was by itself (March). At the APL site, the 2013 winter to spring assemblages were clearly similar (~50% similarity); as well, about 28% of the virus assemblages were similar between the late-spring/summer and fall transition. There was a similar but shorter transition at the ADS site where the late-spring to summer assemblages shared 39% of their composition.  As with the agricultural site, seasonal similarities were observed among the RNA virus assemblages at both the urban and protected sites (Supplementary Figure C.4), with clear similarity between the summer to early fall assemblages (22) at the UPL site.   101  Dec.12May.12June.12July.12Aug.12Sept.12Oct.12Nov.12Sept.12Oct.12Nov.12Dec.12March.12March.12April.12April.12April.13Aug.12Sept.12Oct.12Feb.13July.12June.12April.13Jan.13Feb.13March.13Nov.12Jan.13Jan.13March.13Dec.12April.13March.12April.12May.12Aug.12Feb.13May.12June.12July.12March.13100608040020Similarity5th 75th 98thPercentileAUP      APL      ADSAgricultural sitesBirnaviridaePartitiviridaePicobirnaviridaeReoviridaeTotiviridaeUnclassified dsRNA viruesesEnvironmental samples <viruses>Satellite virusesUnclassified –ssRNA virusesAlphaflexiviridaeAlphatetraviridaeBenyviridaeAstroviridaeBarnaviridaeBetaflexiviridaeBromoviridaeBunyaviridaeCaliciviridaeCarmotetraviridaeDicistroviridaeEnvironmental samples <Picornavirales>GammaflexiviridaeHepeviridaeIflaviridaeLeviviridaeLuteoviridaeOurmiavirusMarnaviridaeSobemovirusNodaviridaePicornaviridaeSecoviridaeTombusviridaeTymoviridaeDeltaflexiviridaeUnclassified PicornaviralesUnassigned PicornaviralesUnclassified +ssRNA virusesVirgaviridaeUnclassified ssRNA virusesUnclassified viruses- 10- 20- 30- 40- 50- 60- 70- 80- 90- 100- 110- 120- 130- 140- 150- 160- 170- 180- 190- 200- 210- 220- 230 Figure 4.1: Seasonal variation in site-specific beta diversity at the agricultural sites. Complete linkage clustering analysis of a Bray-Curtis similarity matrix based on relative abundance counts indicates a similarity (greater than 40%) for the summer and fall RNA virus communities at the  102  upstream (AUP) and affected (APL) sites, while spring and summer virus assemblages were more similar at downstream sites (ADS). The significance of the clustering was validated by a Simprof test (grey lines are not significant). The heatmap was coloured by percentile across all the study sites with white indicating relative abundances that were within the 5th percentile and black the 98th percentile. Taxonomic classification (families) are indicated on the left of the heatmap. The colours are arbitrary and included for ease of referral. The first column indicates family classification, and the second column genus classification1. A more detailed taxonomic list can be found in Supplementary Table C.2 and Supplementary Figure E.2.  4.4.2 RNA viruses as indicator species Of all the environmental factors, DIN and Turbidity were the two that best differentiated samples into statistically significant groups (Figure 4.2 A). Group 1 (Low Nitrogen, Low Turbidity) consisted of only PUP samples, while Group 4 (Low Nitrogen, High Turbidity) was small and made up of four samples from the APL site. Group 3 (High Nitrogen, High Turbidity) consists of samples from the APL and ADS sites with one from AUP, while Group 2 (High Nitrogen, Low Turbidity) is the most complex of the four, consisting of all of the Urban samples (UPL and UDS), the majority of AUP samples and three ADS samples. The two axes of the PCA explained 85.4 and 14.6% of the variation, respectively. Based on indicator species analysis of taxonomically assigned metagenomic data, viral species were identified that were associated with specific conditions (Figure 4.2 B). Although 53 species were identified (17 for group 1, 4 for group 2, 20 for group 3 and 12 for group 4), using an indicator value or “stat” cut-off of 0.800, resulted in 30 remaining species (9, 2, 9 and 10, respectively) (Figure 4.2 B). The viruses in Group 1 are thought to be associated with a wide variety of host organisms, including insects, fungi and protists. Group 2 includes a variety of sites, but only two viruses, both of which are associated with plants, while viruses in Group 3 are generally either associated with plants or were assembled from aquatic metagenomic data and are thought to infect protists. A large number of viruses were indicative of the four samples constituting Group 4, with most of them being associated with insects and a few with plants.                                                  1 Genus classification of the Dicistroviridae changed after this analysis was completed. A third genus Triatovirus was included in the Virus Taxonomy: 2016 release (https://talk.ictvonline.org). There are eight species that have been reclassified as triatoviruses: Black queen cell virus, Himetobi P virus, Homalodisca coagulate virus-1, Plautia stali intestine virus and Triatoma virus.  103  1234SitesPC1 85.4%PC2 14.6%Group identifier Viral species Stat P-valueLow Nitrogen      Low TurbidityGroup1Chara.australis.virus 0.964 0.001 ***Bat.dicistrovirus 0.902 0.016 *  Rice.stripe.necrosis.virus 0.877 0.001 ***Anopheline.associated.C.virus                       0.874 0.001 ***Drosophila.immigrans.Nora.virus                     0.852 0.005 ** Giardia.lamblia.virus                               0.841 0.003 ** Sclerotinia.sclerotiorum.deltaflexivirus.1          0.827 0.007 ** Macrophomina.phaseolina.double.stranded.RNA.virus.2 0.821 0.004 ** Thika.virus                                         0.819 0.028 *  High Nitrogen Low TurbidityGroup2Cucumber.necrosis.virus 0.955 0.001 ***Sitke.waterborne.virus 0.903 0.014 *  High Nitrogen High TurbidityGroup3Maize.white.line.mosaic.virus             0.965 0.028 *  Nectarine.stem.pitting.associated.virus   0.882 0.001 ***Ryegrass.mottle.virus 0.855 0.006 ** Antarctic.picorna.like.virus.2            0.845 0.015 *  Antheraea.pernyi.iflavirus 0.842 0.008 ** Cocksfoot.mottle.virus 0.828 0.015 *  Marine.RNA.virus.PAL473                   0.828 0.002 ** Cowpea.mottle.virus                       0.815 0.018 *  Marine.RNA.virus.PAL128                   0.807 0.005 ** Low Nitrogen High TurbidityGroup4Hepelivirus                                0.931 0.011 *  Pariacoto.virus                            0.883 0.001 ***Acute.bee.paralysis.virus                  0.869 0.001 ***White.clover.mosaic.virus 0.854 0.043 *  Kashmir.bee.virus                          0.849 0.037 *  Plasmopara.halstedii.virus.A 0.827 0.002 ** Drosophila.melanogaster.totivirus.SW.2009a 0.826 0.033 *  Nilaparvata.lugens.C.virus 0.822 0.002 ** Ancient.Northwest.Territories.cripavirus 0.820 0.033 *  Barley.yellow.dwarf.virus 0.805 0.038 *  A)B) Figure 4.2: Viruses as indicators of environmental conditions. A) Principal coordinate analysis clustered study sites into four groups based on their nitrite+nitrate (DIN) and turbidity measures. The vectors represent the variables (DIN and Turbidity) with a correlation of one (circle). Groups were chosen based on a hierarchical clustering Euclidian distance metric with sites within a group having a distance value of no greater than 2. The four groups are classified as follows: Low  104  Nitrogen, Low Turbidity (Group 1), High Nitrogen, Low Turbidity (Group 2), High Nitrogen, High Turbidity (Group 3) and Low Nitrogen, High Turbidity (Group 4). B) Viral indicator species, based on taxonomically assigned relative abundance counts, were identified for each group with measures of association, P-values and significance were calculated using the ‘indicspecies’ R package.  4.4.3 Differences in alpha diversity among sites The average variance in diversity within sites was larger than the variance among sites (Figure 4.3 A), and diversity among sites was not statistically significant (ANOVA, p-value = 0.0489; Kruskal-Wallis, p-value = 0.0636); however, there were differences in the variability of virus diversity among sites (Figure 4.3 A, B), which ranged between 0.2 to 2.5 nats among months. There was much less variability at the urban downstream (UDS) and agricultural sites (AUP, APL and ADS), with ADS varying the least, if an outlier is excluded. The variation at the agriculturally affected site (APL) resulted from a two-unit decrease from November to December, followed by a one-unit increase through to February (Figure 4.3 B). Diversity at the AUP site changed monthly, by no more than 1.5 nats. At the ADS site diversity was more consistent from the start of sampling to the end of summer, but was followed by a decrease from November to December, an increase for the following three months, and a large decrease from March to April. Diversity was noticeably lower (1.5) during the summer months at the urban affected site (UPL), and increased to 3 to 3.5 at the end of fall. The same pattern was not observed at the UDS site, where diversity varied monthly. The diversity at the protected site (PUP) was also erratic, but was consistently higher during summer and fall (Figure 4.3 B). This variation was mirrored in the evenness calculations for each site, where PUP and UPL had much greater variation in evenness than at the other sites (Supplementary Figure C.2 B).  105  UpstreamAffectedDownstreamAffectedDownstreamAgricultural UrbanProtectedShannon Index (nats)1.5       2.0     2.5       3.0       3.5       4.0ADS       APL       AUP      PUP       UDS       UPL00.511.522.533.544.5Dec-11 Apr-12 Jul-12 Oct-12 Jan-13 May-13Shannon Index (nats)Time (months)UPL UDS AUP APL ADS PUPFigure 1: Spatial and temporal changes in viral alpha diversity. Alpha diversity measures (Shannon Index) based on taxonomic classification of RNA virus communities at each site for the entire time period (A) and at a temporal scale (B). AUP, ADS and UDS have less variation in diversity with an average higher mean (A) – the variation is larger than the mean so it is hard to resolve comparisons between sites. The diversity within a site is dynamic with a substantial decrease in diversity at UPL during Summer (B).A)                                                                                                     B) Figure 4.3: Spatial and temporal changes in viral alpha diversity. Alpha diversity measures (Shannon Index) based on taxo omic classification of RNA viru  commu ities at each site for the entire time period (A) and over time (B). In (A), the bold line is the median, while the box represents the 25th and 75th percentile. The whiskers are 1.5 times the length of the box (intercortile range), provided there are data that extend beyond these margins. If not, it represents the minimum and maximum data points; n = 13.  Although there were no significant differences in diversity among sites, AUP, ADS and UDS were less variable over time (A) as demonstrated in (B).   Dissolved oxygen was positively correlated (rho = 0.73), while temperature, conductivity and pH and were negatively correlated (rho = -0.76, -0.59 and -0.56, respectively) to changes in RNA virus diversity at the UPL site (Table 2). As well, dissolved oxygen was weakly correlated (rho = -0.52) to diversity at the ADS site.    106  Table 4.1: Spearman correlation coefficients (rho) between environmental factors and alpha diversity for each site. UPL was the only site with significant correlations to temperature and dissolved oxygen with weaker correlations to conductivity and pH. Spearman correlationAlpha Diversity (Shannon Index)AUP APL ADS UPL UDS PUPEnvironmentalFactorsTemperature 0.36 0.15 -0.30 -0.76** -0.15 0.084Dissolved O2 -0.38 -0.27 0.52* 0.73** 0.34 -0.26Conductivity 0.37 0.44 -0.32 -0.59* 0.011 -0.36pH 0.14 -0.17 -0.29 -0.56* 0.17 0.29Turbidity -0.21 -0.19 0.17 -0.41 0.016 0.32Cl 0.05 0.43 0.011 -0.37 0.069 0.31DIN -0.34 0.36 0.033 0.34 0.31 0.31Rain Avg 0.0055 -0.017 0.033 0.42 -0.055 -0.11Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Table 2: Spearman correlation of nvironmental factors and alpha diversity for each site. UPL was the only site with significant correlations to temperature and dissolved oxygen with weaker correlations to conductivity and pH. 4.4.4 Changes in alpha diversity across sites To explain differences in RNA virus alpha diversity (based on taxonomically assigned data) across all sites, Pearson correlation indices were calculated. All the data were standardized by log(X+1) transformation. Only turbidity data were not normally distributed after transformation. Weak correlations were observed between alpha diversity (H) and both pH (r=0.25) and nitrate+nitrite (DIN) (r=0.24), while stronger correlations were observed between pH and both temperature (r=0.49) and conductivity (r=0.40). DIN correlated with Cl (r=0.40), turbidity (r=0.66), conductivity (r=0.51) and pH to a lesser extent (r=0.25) (Supplementary Figure C.1). Using these correlations as guides, various multiple linear regression models were tested to determine which best described the changes in diversity based on the Akaike Information Criterion (AIC). The best model, with the lowest AIC, indicated that an increase of log one unit in DIN, temperature and pH resulted in decreases of log 0.002 and log 0.201, and an increase of log 1.028 units of alpha diversity, respectively (Table 4.2). Only temperature and pH, and factors that affected diversity were statistically significant (Table 4.2). The model also suggests that location affected the diversity; DIN at the protected site had the greatest effect with a log one unit increase  107  in DIN resulting in a log 0.604 increase in diversity. Only about 20% of the change in diversity is explained by this model. ANOVA supported the results of the multiple linear regression analysis. Multiple linear regression analysis attempts to model the relationship between multiple explanatory variables and a response variable fitted to a linear equation (simplified below): 𝑌 = β1𝑥1 + β2𝑥2 + β3𝑥3 Log transformed data are no longer linear, resulting in the following equation: log10 Y = β1 log10(x1 + 1) + β2 log10(x2 + 1) + β3 log10(x3 + 1) This can be derived as follows to provide a working equation: 10log10Y = 10[β1 log10(x1+1)+β2 log10(x2+1)+β3 log10(x3+1)] 10log10 Y = 10β1 log10(x1+1)10β2 log10(x2+1)10β3 log10(x3+1) 10log10 Y = 10log10((x1+1)β1)10log10((x2+1)β2)10log10((x3+1)β3) Y = (x1 + 1)β1(x2 + 1)β2(x3 + 1)β3 Substitution of the data in Table 4.2, including the intercept which is representative of the location (agriculture), results in the following equation: 𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦(𝐻) = 10−0.113(𝐷𝐼𝑁 + 1)−0.002(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 + 1)−0.201(𝑝𝐻 + 1)1.028 Because location or DIN are not significant factors, they were set constant by using an average value for DIN (3.638 mg.mL-1) and the intercept for the location resulting in a constant and an equation to determine the effects of pH and temperature on the RNA virus diversity:  𝐻 = 0.769((𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 + 1)−0.201)((𝑝𝐻 + 1)1.028)  For example, if using the average pH and temperature values (6.920 and 9.903) across all site, yields diversity score of 4.0 nats. Taking the predicted diversity at the highest and lowest temperatures (21 and 3), and assuming pH remains constant, the predicted diversity scores are 3.5 and 4.9 nats, respectively.     108  Table 4.2: Multiple linear regression analysis of RNA virus alpha diversity and environmental data (all log transformed) across sites. Temperature and pH were significantly associated with changes in RNA virus diversity (Shannon Index) across sites. The model suggests that unknown factors specific to a particular site also affected changes in diversity, including nitrate+nitrite (DIN) at the protected site. This model explains about 20% of the observed changes, as indicated by the R-squared value. The T-value represents how far to the right on the T-distribution the data are situated, while the P-value is a combination of the T-value and degrees of freedom. The F-statistic is the test statistic for the analysis of variance (ANOVA) approach to test the significance of the model or the components of the model. The degrees of freedom are the number of independent values that are free to vary within a calculation or estimate of a parameter. Multiple linear regression model AnovaFactors Est.Std. ErrorT valuePr(>|t|) Sig.Sum SqDfF valuePr(>F) Sig.(Intercept) -0.113 0.353 -0.321 0.750DIN -0.002 0.047 -0.042 0.966 0.003 1 0.434 0.512Temp -0.201 0.069 -2.923 0.005 ** 0.054 1 8.543 0.005 **pH 1.028 0.410 2.506 0.015 * 0.040 1 6.279 0.015 *Location(Urban) -0.108 0.098 -1.109 0.2710.001 2 0.112 0.894Location (Protected) -0.096 0.056 -1.724 0.089 .DIN:Location(Urban) 0.175 0.162 1.079 0.2840.031 2 2.423 0.096 .DIN:Location(Protected) 0.604 0.311 1.941 0.056 .Residual standard error: 0.080 on 68 DF Residuals: 0.431 on 68 DFMultiple R-squared: 0.232 Adjusted R-squared: 0.153Significance (Sig) codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1F-statistic: 2.93 on 7 and 68 DF P-value: 0.010Table 1: Explaining differences in alpha diversity across sites through multiple linear regression analysis of RNA virus alpha diversity and environmental data (all log transformed). Temperature and pH are significantly associated with changes in RNA virus diversity (Shannon Index) across all sites. The model s ggests that unknown factors specific to a particular site also affects the changes in diversity with Nitrate and Nitrite (DIN) being recognized at the protected site. This model explains about 20% of the observed changes, as indicated by the R-squared value. The T-value represents how far to the right on the T-distribution the data is situated, while the P-value is a combination of the T-value and the degrees of freedom. The F-statistic is the test statistic for the analysis of variance (ANOVA) approach to test the significance of the model or the components of the model. The degrees of freedom refer to the number of independent pieces of informati n that go into the estimate f a para eter.  4.5 Discussion This metagenomic study examined assemblages of RNA viruses and potential host genera in three geographically distinct freshwater environments, and how these assemblages changed with respect to environmental factors. The three watersheds were selected based on surrounding land use, which provided an opportunity to analyze the effects of terrestrial activity on freshwater RNA virus assemblages. To assess both seasonal variation as well as factors associated with changes in the virus assemblages, sampling was done over 13 to 14 months (site dependent).  To capitalise on these samples, a serial filtration approach was used to capture the virus, prokaryotic and eukaryote size fractions; the cellular fractions were analyzed independently of this study. A substantial portion of the reads in the viral fraction were identified as being cellular,  109  primarily, ribosomal. Similar contamination has been reported in numerous viral metagenomic studies (Breitbart et al., 2003; Djikeng et al., 2009). All further analyses were conducted using RNA viruses identified by BLASTx analysis against the NCBI non-redundant (nr) database. Changes in viral diversity were analyzed at spatial and temporal levels and correlated to environmental factors. Seasonal changes in the composition of viral assemblages were observed and specific viral species were shown to be indicative of specific environmental conditions. These results and their inferences are further discussed below.  4.5.1 Freshwater RNA virus assemblages are diverse and dynamic In order to investigate species richness, contigs were taxonomically assigned with an e-value cutoff of 10-4 against the nr database and contigs that were assigned to the same virus species were pooled, so while the genetic and taxonomic richness of the samples are probably underestimated, the relative abundance of a specific genotype may be over-represented. Therefore, this analysis is not of particular virus genomes, but is for related genotypes within a 10-4 e-value of one another. While other methods exist to analyze metagenomes, taxonomic data can allow addition inferences to be made, such as host prediction or environmental sources. Relative abundances of taxonomically assigned RNA virus contigs indicate that some viruses are consistently abundant while others are more sporadic (Figure 4.1; Supplementary Figure C.4). Given that in a river or stream individual viruses are transient, their persistence or dominance implies that they are being continually produced; whereas, most viruses are rare, and represent a “seed bank” (Breitbart and Rohwer, 2005; Edwards and Rohwer, 2005; Hoffmann et al., 2007). The seed-bank model implies that viruses remain active provided their hosts are present and remain susceptible to infection. Once the host abundance changes due to mutation or environmental conditions, the “active” fraction of the viral assemblage should also change dramatically (Edwards and Rohwer, 2005; Hoffmann et al., 2007).  Numerous viral taxa were at relatively high abundances across all sites, irrespective of land use (Figure 4.1). Whether these viruses were part of the native freshwater community, or from soils or land-use activities, is hard to elucidate. Eight of these genomes (Marine RNA viruses SF-1, -2, -3 and Eunivirus, Brandmavirus UC1, Bufivirus UC1, Carascovirus SF1 and Tombunodavirus UC1; rows 89, 35, 90, 33, 213 – 215 and 239 respectively) were assembled from  110  metagenomic studies of San Francisco wastewater (Greninger and DeRisi, 2015; Greninger and DeRisi, 2015b; Benson et al., 2009; Sayers et al., 2009), yet were detected at all sites, including the protected watershed. Thus, it is likely that these genomes are from viruses that infect hosts that are closely associated with the watersheds. Plasmopara halstedii virus A (row 229, Fig 4.1) and Sclerophthora macrospora virus A (row 199) were also persistent and abundant. Both are classified as Nodaviruses that infect downy mildews of plants (Heller-Dohmen et al., 2011; Yokoi et al., 2003). These viruses have a tombusvirus-like capsid, many which are stable and frequently detected in water (Mehle and Ravnikar, 2012; Pleše et al., 1996; Büttner et al., 1987; Büttner and Nienhaus, 1989; Yi et al., 1992; Koenig et al., 2004); it is likely they are stable in water and could persist for a long time, regardless of whether they are of terrestrial origin or associated with freshwater oomycetes (Nechwatal et al., 2008; Noga, 1993). This inability to differentiate between viruses of terrestrial or aquatic origin makes inferences to the seed-bank model more complicated in watersheds (Carew et al., 2013; Phillips, 1977; Bongers and Ferris, 1999), not only because of viral decay (Heldal and Bratbak, 1991; Noble and Fuhrman, 1997; Bongiorni et al., 2005), but also because RNA viruses from marine and freshwater environments are poorly represented in public databases.  4.5.2 Viral diversity was variable with seasonal trends Seasonal trends were observed for viral assemblages within land use regions (Figure 4.1), particularly for the three agricultural sites in which APL and ADS share terrestrial input with the unaffected upstream site, AUP. Statistically, the viral assemblages in APL and ADS were more similar (Figure 4.1), likely because of the influence of the upstream sites on the downstream sites. There were also similarities between AUP assemblages in summer and fall, and winter and spring, while APL had winter to spring and late spring to summer dynamics; the latter was also observed with the ADS assemblages. Seasonal dynamics have also been observed in marine phage and microbial assemblages (Gilbert et al., 2012; Jiang and Paul, 1994; Parsons et al., 2012; Chow and Fuhrman, 2012). Attributing seasonal trends to virus diversity is more complex than comparing assemblage composition and is influenced by location. Viral diversity was high during summer and fall at PUP (Figure 4.3 B). Similarly, temperate marine picorna-like viruses had higher richness and  111  phylogenetic diversity in summer (Gustavsen, 2016). In contrast, lower diversity (Figure 4.3B) and lower richness were observed in the summer UPL samples (Supplementary Figure C.4; Figure 4.1). Lower richness was also reported for RNA virus assemblages in a lake during summer compared to the late fall (Djikeng et al., 2009). Both bacterial and phytoplankton communities are generally more diverse in winter (Zingone et al., 2010; Ladau et al., 2013) so perhaps one would observe a correlation between virus and phytoplankton diversities at UPL were those data available. As only one year of data are available, and given the highly dynamic environment and assemblage composition, it is not possible to differentiate between seasonal effects and anomalous changes; however, the data clearly establish that dynamics differ among virus groups and across sites.  4.5.3 Specific viruses are associated with particular environmental factors This study examined the composition of the viral assemblages using the Shannon Index (alpha diversity), Bray-Curtis similarity matrices, clustering (beta diversity) and heatmaps (relative abundance). The Shannon Index was selected as the measure of diversity, because it incorporates both composition and relative abundance. While richness provides the number of species observed, relative abundance is important when analyzing environments where there can be substantial contributions from outside sources, for example agricultural runoff. The highly uneven species composition of the viral assemblages (Supplementary Figure C.2; Figure 4.1) emphasizes the need to estimate alpha diversity with an index that incorporates both composition and evenness, as has used in other studies of viruses (Breitbart et al., 2002, 2003, 2004; López-Bueno et al., 2015). Diversity indices can be considered to be a proxy for diversity (Jost, 2006; Tuomisto, 2010); there were no obvious differences between trends in alpha diversity estimated with the Shannon Index (Figure 4.3 A) and the effective number of species (Supplementary Figure C.2). To determine whether specific viruses were associated with particular environments, the data were subjected to indicator species analysis. Numerous multicellular organisms, e.g. nematoceran flies (Chironomidae) and nematodes, have been used as indicator species in environmental monitoring projects (Carew et al., 2013; Phillips, 1977; Bongers and Ferris, 1999) and many studies have suggested the use of both DNA and RNA viruses as indicators of human pollution in water (Kittigul, 2000; Griffin et al., 2008; Metcalf et al., 1995; Springthorpe et al.,  112  1993; Kopecka et al., 1993). However, many of these studies proposed indicators that exhibited both epidemic spikes and seasonal fluctuations (Haramoto et al., 2006; Skraber et al., 2004). Furthermore, the use of viruses as indicator species have been limited to viruses that are associated with humans or enterobacteria; yet, there are many other forms of pollution that affect watershed health. In recent years, the techniques employed in environmental monitoring have included nucleic-acid-based identification/barcoding and next-generation sequencing analysis (Hajibabaei et al., 2011; Baird et al., 2012; Carew et al., 2013), which makes analyzing environmental RNA virus assemblages feasible. Finding indicator species associated with each site was challenging due to dynamics in assemblage composition and environmental conditions. Nonetheless, PCA analyses of DIN and turbidity data were used to cluster sites into four groups (Figure 4.2A). All of the protected sites constituted the low nitrogen and turbidity group (Group 1) (Figure 4.2B); most agriculturally affected and downstream sites, with one AUP sample, clustered into the high nitrogen and turbidity group (Group 3), and four APL sites formed a low nitrogen, high turbidity group Group 4). The high nitrogen, low turbidity group (Group 2) was mixed with sites from both urban and agricultural sites. Having sites from different land use areas probably accounts for the low number of virus genomes that were identified as indicator species. The indicator value index measures the association between a particular species and a site group. The multiple land-use sites would have led to greater differences in the virus assemblage across sites; hence, there would have been a smaller pool of viral species associated with nitrogen and turbidity only.  The viruses identified as indicators were associated with a variety of hosts, including higher plants (all groups), insects (Groups 1 and 4), fungi (Group 1) and protists (Groups 1 and 3). Free-living protists would occur in all of the environments under the conditions encountered; however, protists occupy specific ecological niches; hence, it is not surprising that certain protists and protist-infecting viruses would be associated with high DIN and turbidity (Group 3) and others with low DIN and turbidity (Group 1). Although the Group 3 viruses are unknown, they resemble viruses that are known to infect diatoms (Miranda et al., 2016; López-Bueno et al., 2015). In contrast, the dsRNA Giardia lamblia virus from Group 1 is classified within the family Totiviridae, and belongs to a genus of viruses that infect parasitic protists (Wang and Wang, 1986a; Wang and Wang, 1986b; Wu et al., 2016; Kraeva et al., 2015). Giardia lamblia is commonly found in rivers and streams in British Columbia. Perhaps most surprising is that specific protist-infecting viruses  113  are not associated with each of the environmental groups. Also in Group 1, was the Chara australis virus genome. Chara species are native to British Columbia (Warrington, 1994) and contribute to keeping water nutrient-poor and clear (Beltman and Allegrini, 1997; Van Den Berg et al., 2001). Chara survives in these nutrient-poor environments due to nitrogen-fixing epiphytic cyanobacteria on its surface (Ariosa et al., 2004). Chara is unevenly distributed, which makes its detection as an indicator species challenging. In contrast, viruses associated with Chara should be widely dispersed in environments where the plants occur and can be readily detected with molecular techniques. This analysis allowed the best indicator species to be selected based on nucleotide sequences in the entire virus assemblage, instead of only isolates. This is important when analyzing taxonomically assigned data as most of the reference database is from cultured viruses, which represents a small fraction of the RNA virus diversity (Shi et al., 2016; Culley et al., 2006; Steward et al., 2013; Gustavsen et al., 2014). Therefore, although a particular virus is identified as an indicator species, it is likely a closely related taxon or a group of closely related viruses. Hence, if a PCR-based approach was to be used for indicator-species detection, it would require degenerate primers to be designed based on the nucleotide data associated with the taxonomic assignments. The feasibility of such an approach would need to be tested for specific environments.  4.5.4 Drivers of RNA virus diversity are complex To study variability in assemblage diversity across spatial and temporal scales, diversity changes were related to environmental parameters. The mean alpha diversity across all sites was within 0.5 nats; yet, diversity in individual samples ranged from 1.2 to 4.0 nats (Figure 4.3A). To explain these differences across sites, correlation and step-wise multiple linear regression was used with a variety of environmental parameters as explanatory variables. While only pH and DIN were weakly correlated with diversity, temperature was a significant explanatory variable based on regression analysis (Table 4.2). The model (Table 4.2) explained about 20% of the diversity fluctuations. The low explanatory value of the model is consistent with the suggestion that a complex set of environmental factors affects virus diversity (Gaston, 2000), including location-specific factors, of which only some could be incorporated in the model due to the limited sample size.   114  The distribution and abundance of specific viruses depends on many factors that are dictated by the environment, the viruses and their hosts (Chow and Suttle, 2015; Suttle, 2016); hence, it is not possible to achieve a complete understanding of the factors that lead to a particular viral assemblage. These include environmental factors such as temperature, salinity, pH and nutrients that not only directly affect viruses and their hosts, but also can affect host-virus interactions (Brown et al., 2009; Mojica and Brussaard, 2014; Kendrick et al., 2014). Moreover, in the systems studied here, environmental factors, such as rain, can greatly affect the influence of the surrounding terrestrial environment on streams and rivers. One would have expected correlations to precipitation if terrestrial inputs of both nutrient and viruses were high; however, previous studies suggest that precipitation data needs to be collected over a longer period prior to sampling to detect these inputs (Tseng et al., 2013). This reinforces the point that these environments are complex, underlie high community dynamics and that it is challenging to tease apart all of the factors that are influencing the virus assemblage that is detected.  4.5.5 Conclusions Metagenomic analysis revealed the diversity, dynamics and environmental correlates of the diversity of RNA virus assemblages in six watersheds over 14 months. The analysed stream RNA viral assemblages consisted of species taxonomically similar to five dsRNA and 23 +ssRNA recognized virus families, with many more similar to unclassified and unassigned isolates and environmental metagenomic assemblies. Low seasonal similarity was observed in the viral beta-diversity within sites, and was even less apparent across all sites and locations. Specific RNA virus species were proposed to be good indicators for specific environmental conditions, in particular, turbidity and nitrogen; therefore, they should be considered as potential biomarkers for water quality. Diversity was dynamic, with differences in variability among sites. Changes in the UPL RNA virus diversity correlated with temperature, pH, conductivity and dissolved oxygen; the latter correlation was also observed at ADS; no significant correlations were found for any of the other sites. Multiple linear regression analysis suggested that temperature and pH could be used to predict changes in alpha diversity across all sites with the caveat that many factors affect viral diversity and the model only explained 20% of the changes. The dynamic nature of these freshwater RNA virus assemblages, along with their broad taxonomic distribution, suggests that  115  these complex assemblages are affected by many environmental factors and local land-use practices. This study significantly increases our knowledge of RNA virus diversity and dynamics in watersheds, and their relationship to some environmental factors.   116  Chapter 5  5  SEQUENCE-BASED TAXONOMIC CLASSIFICATION OF UNCULTURED AND UNCLASSIFIED MARINE SINGLE-STRANDED RNA VIRUSES OF THE ORDER PICORNAVIRALES  5.1 Summary Metagenomics have altered our understanding of the diversity of RNA viruses, especially those that are abundant in marine environments. Many of these viruses share genetic features with members of the order Picornavirales, yet very few of these viruses have been taxonomically classified. The only recognized family of marine RNA viruses is the Marnaviridae, with its only member species being Heterosigma akashiwo RNA virus. In addition, two genera Labyrnavirus (1 species) and Bacillarnavirus (3 species) are currently included in the order Picornavirales but not assigned to a family. Here we propose a sequence-based framework for taxonomic classification of twenty marine RNA viruses into the family Marnaviridae. Using RNA-dependent RNA polymerase (RdRp) phylogeny-based analyses, we propose to assign the genera Labyrnavirus and Bacillarnavirus to the family Marnaviridae and to create four new genera in the family with the proposed names: Locarnavirus (n=4), Kusarnavirus (n=1), Salisharnavirus (n=4) and Sogarnavirus (n=6). A capsid amino acid pairwise identity of 75% is proposed as a species demarcation threshold. The family displays RdRp and capsid amino acid sequence diversities of 0.59 and 0.69 d and Jukes-Cantor distances ranging from 0.53 to 0.75 and 0.58 to 0.76, suggesting there are many more taxa to be discovered and classified in this family. Our proposed taxonomic framework provides a sound classification system for the Marnaviridae. It is based on sequence information alone, allowing for the inclusion of genomes assembled from metagenomic datasets into the taxonomic framework, thereby expanding our database of taxonomically assigned viral species.  5  117  5.2 Introduction The isolation and subsequent genomic analysis of Heterosigma akashiwo RNA virus (HaRNAV) (Tai et al., 2003; Lang et al., 2004), which infects a single-celled eukaryotic alga, altered our view of marine viruses due to their clear evolutionary relationship to viruses infecting higher plants and animals. Moreover, RNA viruses are thought to constitute approximately 50% of marine virus assemblages (Steward et al., 2013), and play an important ecological role through their infection of marine eukaryotic phytoplankton (Gustavsen et al., 2014; Nagasaki et al., 2004; Tomaru, Takao, et al., 2009; Tomaru et al., 2012, 2013; Miranda et al., 2016; Lawrence and Suttle, 2004; Brussaard, 2004). To date, there are ten described RNA virus isolates that infect marine single-celled eukaryotes (Table 1.1) (Lang et al., 2004; Tai et al., 2003; Tomaru et al., 2004; Shirai et al., 2008; Kimura and Tomaru, 2015; Tomaru, Takao, et al., 2009; Tomaru et al., 2013; Nagasaki et al., 2004; Tomaru et al., 2012; Brussaard et al., 2004), with eight of them having features in common with members of the order Picornavirales (Table 5.1). The order Picornavirales is comprised of positive-sense single-stranded RNA (+ssRNA) viruses in the families Picornaviridae, Dicistroviridae, Iflaviridae, Secoviridae and Marnaviridae (Sanfaçon, et al., 2009). Marnaviridae is the family with the fewest members, with HaRNAV being its only member and Marnavirus its only genus. In addition, two genera of marine RNA viruses (Labyrna- and Bacillarnavirus) are included in the order Picornavirales but are currently unassigned to a family (see http://talk.ictvonline.org/taxonomy/ for the latest taxonomic updates). Despite different genome organizations within the Picornavirales, all members have a helicase, 3C-like protease and RNA-dependent RNA polymerase (RdRp) domains, and capsid proteins with related structures (Le Gall et al., 2008). Viral proteins are expressed as one or more polyprotein(s) that are cleaved into individual functional proteins. While all members of the Picornavirales have the above features in common, they can vary in the number of genome segments and open reading frames (ORFs) expressed (Le Gall et al., 2008). HaRNAV has an 8.6-kb polyadenylated genome with a single 7.7 kb ORF encoding one viral polyprotein (Lang et al., 2004). Members of the Labyrna- and Bacillarnavirus  genera share sequence similarities with HaRNAV, although their genomes have a dicistronic organization (Nagasaki et al., 2004; Shirai et al., 2008; Tomaru, Takao, et al., 2009; Takao et al., 2006) similar to viruses belonging to the Dicistroviridae (Chen et al., 2011a).  118  Metagenomic studies have revealed multiple picorna-like genomes that are more similar to marine RNA viruses than to other known viruses, and are considered to be important members of marine virus assemblages (Culley et al., 2007, 2014; Miranda et al., 2016; Shi et al., 2016). Furthermore, studies suggest that marine RNA viruses are abundant and genetically diverse in coastal environments (Culley and Steward, 2007; Culley et al., 2003; Gustavsen et al., 2014; Miranda et al., 2016; Culley et al., 2014); yet the current taxonomic classification neither reflects nor accommodates this diversity. Classical approaches to virus classification by the International Committee on Taxonomy of Viruses (ICTV) have relied on genetic and biological information such as host range, replication cycle, virus particle structure and properties, and serology (Simmonds et al., 2017). This type of information is not available for viruses identified through metagenomic studies. Considering that the number of assembled marine RNA virus genomes substantially outnumber those from isolated representatives, these genomes need to be incorporated within the established taxonomic framework. This argument has recently been amplified (Simmonds et al., 2017), and the ICTV is now willing to consider taxonomic proposals based only on genomic data. This led to the Genomoviridae being the first virus family to include metagenomically assembled genome sequences into a taxonomic framework (Varsani and Krupovic, 2017). While that framework was established for single-stranded DNA (ssDNA) viruses in the Genomoviridae, the approach is suitable for accommodating +ssRNA viruses associated with the order Picornavirales. In this chapter, genomic data were used to establish a taxonomic framework for 20 marine RNA viruses (Table 5.1), including eight virus isolates and twelve viruses known only from assembled metagenomic data. An analysis of the amino acid sequences of the entire capsid ORFs and the RdRp domains using phylogeny, pairwise identity and domain analysis yielded a framework for classifying +ssRNA viruses within the family Marnaviridae.  5.3 Materials and methods 5.3.1 Phylogenetic analysis All sequences were either aligned or added to existing alignments using MUSCLE v3.8.425 with the default parameters (Edgar, 2004), and then manually refined with Aliview  119  version 1.17.1 (Larsson, 2014). To corroborate amino acid model selection, both the Smart Model Selection in PhyML (Lefort et al., 2017) and ProtTest 3.2 (Darriba et al. 2011) as part of the Phylemon2 package (Sánchez et al., 2011) were used. Maximum likelihood trees were constructed with PhyML 3.0 (Guindon et al., 2010) using the aLRT SH-like approach, LG+I+G+F amino acid model and 100 bootstrap replicates. The resultant tree was edited in iTOL v3 (Letunic and Bork, 2016). RdRp protein sequences of the Picornavirales type members, spanning the eight conserved domains (Koonin and Dolja, 1993), were aligned as describe above. Structural protein sequences were aligned similarly. Genbank accession numbers of relevant sequences are described in Table 5.1 and Supplementary Table D.2.  5.3.2 Pairwise similarities and diversity metrics Relevant alignments were analyzed for Jukes-Cantor distances and sequence similarities using the CLC genomics workbench v7.5 (CLCBio). Mean distances within and among genera as well as the number of amino acid differences per site from mean diversity calculations (amino acid sequence diversity) for each population or genus were calculated for the capsid and RdRp domains using the p-distance method in MEGA7 (Kumar et al., 2016).  5.3.3 Domain analysis The RdRp conserved domains were identified as per Koonin and Dolja (1993) and illustrated using WebLogo3 (Crooks et al., 2004).    120 Table 5.1: Genome and isolation information for proposed members of the Marnaviridae. Existing species and genera are shown in italics while proposed new species and genera are not italicized. Genus Species Accession Genome Source Country References Nucleotide Amino acid Size (nt) # ORFs Architecture Marnavirus Heterosigma akashiwo RNA virus1 NC_005281 NP_944776 8587 One  Heterosigma akashiwo Canada Lang et al., 2004; Tai et al., 2003 Labyrnavirus Aurantiochytrium single stranded RNA virus1 AB193726 BAE47143 9035 Two  Aurantiochytrium sp. Japan Takao et al., 2006 Locarnavirus Marine RNA virus JP-B2 NC_009758 YP_001429583 YP_001429584 8926 Two  Coastal marine Canada Culley et al., 2007 Marine RNA virus SF-2 KF412901 AGZ83339 AGZ83340 9321 Two  Coastal wastewater USA Greninger and DeRisi, 2015 Marine RNA virus SF-1 JN661160 AFM44930 AFM44929 8970 Two  Coastal wastewater USA Greninger and DeRisi, 2015 Marine RNA virus SF-3 KF478836 AHA44480 8648 One  Coastal wastewater USA Greninger and DeRisi, 2015 Kusarnavirus Asterionellopsis glacialis RNA virus2 NC_024489 YP_009047193 YP_009047194 8842 Two   Asterionellopsis glacialis Japan Tomaru et al., 2012 Bacillarnavirus Chaetoceros tenuissimus RNA virus 01 AB375474 BAG30951 BAG30952 9431 Two  Chaetoceros tenuissimus Japan Shirai et al., 2008 Rhizosolenia setigera RNA virus1 AB243297 BAE79742 BAE79743 8877 Two  Rhizosolenia setigera Japan Nagasaki et al., 2004 Chaetoceros socialis f. radians RNA virus 01 NC_012212 YP_002647032 YP_002647033 9467 Two  Chaetoceros socialis f. radians Japan Tomaru et al., 2009 Salisharnavirus Marine RNA virus BC-42   8593 Two  Coastal/oceanic marine Canada 3 Marine RNA virus PAL473 NC_029309 YP_009230124 YP_009230125 6360 Two  Coastal marine USA Miranda et al., 2016 Marine RNA virus BC-1   8638 Two  Coastal marine Canada 4 Marine RNA virus PAL128 NC_029306 YP_009230118 YP_009230119 8660 Two  Coastal marine USA Miranda et al., 2016 Sogarnavirus Marine RNA virus BC-2   8843 Two  Coastal marine Canada 4 Marine RNA virus PAL156 NC_029307 YP_009230120 YP_009230121 7897 Two  Coastal marine USA Miranda et al., 2016 Marine RNA virus BC-3   8496 Two  Coastal marine Canada 4 Marine RNA virus JP-A NC_009757 YP_001429581 YP_001429582 9236 Two  Coastal marine Canada Culley et al., 2007 Chaetoceros tenuissimus RNA virus type II2 NC_025889 YP_009111336 YP_009111337 9562 Two  Chaetoceros tenuissimus Japan Kimura and Tomaru, 2015 Chaetoceros species RNA virus01 AB639040 BAK40203 BAK40204 9417 Two  Chaetoceros sp. Japan Tomaru et al., 2013 1 Assigned type species in existing genera. 2 Proposed type species in new genera. 3 Genome assembly described in Chapter 3 of this thesis. 4 Genome assembly described in Chapter 2 of this thesis.  121  5.4 Results 5.4.1 Family classifications and proposed genera Maximum likelihood phylogenetic analysis of the RdRp domain sequences placed the marine RNA virus sequences in a monophyletic group relative to other sequences (Figure 5.1). HaRNAV and the Aurantiochytrium viruses were most distant (Figure 5.2 and Supplementary Figures D.1 and D.2) and basal within the clade (Figure 5.1). Within the Marnaviridae, mean RdRp and capsid amino acid sequence diversities are 0.59 and 0.69 d, respectively. The analyses place the 20 viruses into seven well-supported clades corresponding to genera within the Marnaviridae, including the classified genus Marnavirus and the unassigned genera Bacillarnavirus and Labyrnavirus. The mean distances between the capsid amino acid sequences range from 0.58 for Kusarnavirus and Sogarnavirus to 0.76 between Marnavirus, Labyrnavirus and Locarnavirus, while the distances between RdRp sequences ranges from 0.53 and 0.54 for Locarnavirus and Kusarnavirus and the Bacillarnavirus and Sogarnavirus, respectively, to 0.75 for the Marnavirus and Labyrnavirus genera (Supplementary Table D.1).  122   Figure 5.1: Maximum likelihood phylogeny of the Picornavirales RNA-dependent RNA polymerase amino acid sequences rooted with sequences from the Potyviridae. SH-like branch support values are shown. Branch colours denote virus families: Potyviridae (dark green), Picornaviridae (red), Iflaviridae (yellow), Dicistroviridae (orange), Secoviridae (lime green) and Marnaviridae, Labyrnavirus, Bacillarnavirus and other marine viruses examined (blue). Existing and proposed genera within the Marnaviridae are indicated by coloured boxes.  123   Seven genera within family Marnaviridae are proposed based on clusters in the RdRp phylogeny (Figure 5.1). This includes the established genus Marnavirus, as well as the two previously unassigned genera Labyrnavirus and Bacillarnavirus. The proposed genera were also supported by the capsid phylogeny with the exception of sogarnavirus, which is split into two distinct groups in the analysis (Figure 5.3 and Supplementary Figure D.5).  Jukes-Cantor distanceHeterosigma akashiwo RNA virusAurantiochytrium ssRNA virusJP-BSF-2SF-1SF-3Asterionellopsis glacialis RNA virusChaetoceros tenuissimus RNA virus01Rhizosolenia setigera RNA virusChaetoceros socialis f radians RNA virusBC-4PAL473BC-1PAL128BC-2PAL156BC-3JP-AChaetoceros tenuissimus RNA virus typeIIChaetoceros sp. RNA virus 2Cricket paralysis virusHaRNAVAuRNAVJP-BSF-2SF-1SF-3AglaRNAVCtenRNAV01RsRNAVCsfrRNAVBC-4PAL473BC-1PAL128BC-2PAL156BC-3JP-ACtenRNAVtypeIICspRNAV2CrPVPercent amino acid identityRdRpJukes-Cantor distanceHeterosigma akashiwo RNA virusAurantiochytrium ssRNA virusJP-BSF-2SF-1SF-3Asterionellopsis glacialis RNA virusChaetoceros tenuissimus RNA virus01Rhizosolenia setigera RNA virusChaetoceros socialis f radians RNA virusBC-4PAL473BC-1PAL128BC-2PAL156BC-3JP-AChaetoceros tenuissimus RNA virus typeIIChaetoceros sp. RNA virus 2Cricket paralysis virusHaRNAVAuRNAVJP-BSF-2SF-1SF-3AglaRNAVCtenRNAV01RsRNAVCsfrRNAVBC-4PAL473BC-1PAL128BC-2PAL156BC-3JP-ACtenRNAVtypeIICspRNAV2CrPVPercent amino acid identityCapsidDistance Identity59.013.20.51.7Distance Identity77.022.40.31.5A)B) Figure 5.2: Heatmap of amino acid pairwise sequence analysis of the Marnaviridae. Jukes-Cantor distances and percent amino acid similarities were analyzed for A) the capsid proteins and B) the RdRp domain. Black indicates the highest similarity detected (lowest distance score) and white the lowest similarity (highest distance score). Colours denote genera: Marnavirus (blue), Labyrnavirus (red), Locarnavirus (purple), Kusarnavirus ( ), Bacillarnavirus (pink), Salisharnavirus (orange), Sogarnavirus (green) and the Dicistrovirus (Cricket paralysis virus) (white).  5.4.1.1 Marnavirus The only genus currently classified within the Marnaviridae is Marnavirus, which consists of one representative, HaRNAV, the founding member of the family and the type species for the  124  genus (Table 5.1). It shares between 22.4 and 30.3% RdRp sequence identity with other genera and 18.5 to 26.4% amino acid identity with capsid sequences in other genera (Figure 5.2 and Supplementary Figures D.1 and D.2). It is a divergent taxon in the family in both phylogenetic analyses.  5.4.1.2 Labyrnavirus This previously unassigned genus consists of one species, AuRNAV (Table 5.1). It shares between 22.4 and 33.6% RdRp sequence identity with other genera and 17.1 to 22.5% amino acid identity with capsid sequences in other genera (Figure 5.2 and Supplementary Figures D.1 and D.2). It is a deep-branching taxon in both phylogenetic analyses.  5.4.1.3 Locarnavirus This genus consists of four viruses, JP-B, SF-1, SF-2 and SF-3, all originating from metagenomic data (Table 5.1). RdRp and capsid diversities among these 4 viruses are 0.44 and 0.63 d respectively (Figure 5.4). The members of this genus share between 52.4 to 59.1% RdRp and 32.2 and 38.3% capsid amino acid identities with one another. Members of this genus share 24.0 to 48.2% RdRp and 17.2 to 27.4% capsid amino acid identities to other members of the family (Figure 5.2 and Supplementary Figures D.1 and D.2). The proposed type species is JP-B.  5.4.1.4 Kusarnavirus The previous unclassified virus isolate AglaRNAV, which infects the pennate diatom Asterionellopsis glacialis, would be the only member of this proposed genus and is therefore the proposed type species (Table 5.1). It shares between 25.4 to 48.2% amino acid identity with RdRp sequences and 21.7 and 47.9% amino acid identity with capsid sequences with members from other Marnaviridae genera. The capsid sequence shares the highest amino acid identity (44.3 to 47.9%) to members from the proposed genus Sogarnavirus (Figure 5.2 and Supplementary Figures D.1 and D.2).   125  5.4.1.5 Bacillarnavirus This previously unassigned genus and consists of three isolates, CtenRNAV01, RsRNAV and CsfrRNAV (Table 5.1), all of which infect centric diatoms. RdRp and capsid diversities are 0.38 and 0.66 d, respectively (Figure 5.4). The members of this genus shares between 57.8 to 64.4% amino acid identity in the RdRp domain and 29.1 to 33.4% amino acid identity in the capsid sequence with each other. Members of this genera shared 23.8 to 47.8% RdRp amino acid identity and 18.3 to 29.4% capsid amino acid identity to other members of the family (Figure 5.2 and Supplementary Figures D.1 and D.2). The type species is RsRNAV.  5.4.1.6 Salisharnavirus This proposed genus would include four viruses, BC-4, PAL473, BC-3 and PAL128 (Table 5.1), which were all discovered through metagenomics. RdRp and capsid diversities are 0.51 and 0.60 d, respectively (Figure 5.4). Within this genus, members share RdRp amino acid identities between 35.6 and 63.6% and capsid amino acid identities between 24.9 and 34.4%. Amino acid identities between members of this genus and other members of the Marnaviridae range from 24.3 to 47.2% for the RdRp and 18.2 to 31.5% for the capsid (Figure 5.2 and Supplementary Figures D.1 and D.2). The proposed type species is BC-4.  5.4.1.7 Sogarnavirus This is the largest proposed genus which would include six new species, with four derived from metagenomic studies (BC-2, PAL156, BC-3 and JP-A) and two isolates (CtenRNAVtypeII and CspRNAV2) (Table 5.1). RdRp and capsid diversities are 0.38 and 0.56 d, respectively (Figure 5.4). The members of this genus share between 48.8 to 77.0% RdRp amino acid identity and 27.6 to 59.0% capsid amino acid identity with one another. Members of this genera share 24.3 to 47.2% RdRp amino acid identity and 17.1 to 47.9% capsid amino acid identity to other members of the family (Figure 5.2 and Supplementary Figures D.1 and D.2). The proposed type species is CtenRNAVtypeII.   126   Figure 5.3: Comparative RdRp and capsid maximum-likelihood phylogenies of the family Marnaviridae. SH-like branch support vales are shown. Branch colours denote virus genera: Marnavirus (blue), Labyrnavirus (red), Locarnavirus (purple), Kusarnavirus (yellow), Bacillarnavirus (pink), Sogarnavirus (green) and Salisharnavirus (orange).  Amino acid diversity was not directly related to the number of genomes analyzed (Figure 5.4, Supplementary Figure D.6). The mean diversity of the capsid ORFs for each genus was higher than for the RdRp and was not directly tied to the RdRp diversity. For example, the genus Bacillarnavirus has the highest capsid (0.66 d) and lowest RdRp diversities (0.37 d) while the genus Salishrnavirus has the highest RdRp diversity (0.51 d) with the second lowest capsid diversity (0.60 d). On average, the amino acid diversities of the Marnaviridae sit in the middle of the scale of zero to one.  127   Figure 5.4: Summary of genera in the Marnaviridae and their associated number of species and amino acid diversities. The numbers of viruses are depicted as black circles (●) while diversities are represented as bars: capsid (light grey) and RdRp (dark grey).  5.4.2 RdRp conserved motifs Members of the Marnaviridae showed conservation of specific amino acids in the RdRp domains but genera were distinguished by unique amino acid substitutions (Figure 5.5). Domain I is characterized by a conserved KDE motif, which is expanded to LKDEP in the genus Locarnavirus, and flanked by additional conserved amino acids in the genus Bacillarnavirus to CTKDEPTK. Domains II and III show substantially more variation, with a single conserved amino acid in both; R is conserved in the middle of domain II and a G at the N-terminal end of the Marnaviridae domain III. In the genera Locarnavirus, Salisharnavirus and Sogarnavirus, neighbouring conserved amino acids were detected in domain II (VRK/MYLL, RKYF/YLP and RK/E/MYFLP). In the N-terminal end of domain III in the Bacillarnavirus, Salisharnavirus and Sogarnavirus, two conserved amino acids precede the conserved G, resulting in a conserved AVG sequence. This domain also has a conserved W at the C-terminal end with the exception of the genus Labyrnavirus, where the W is substituted by a V. Domain IV has an N-terminal conserved GDYXXXD motif, with the G substituted by a W in two members of Salisharnavirus, and the Y  128  substituted by a phenylalanine in all members of the Bacillarnavirus. Domain V is long and showed a lot of variability. The N-terminal end had a distinct E insertion in salisharnaviruses. This domain also features a conserved SGXXXTV motif, except for the marna- and labyrnaviruses, where the V is replaced by an A. The quintessential GDD motif in domain VI is preceded by a conserved TY in the family. Both domains VII and VIII are poorly conserved among the Marnaviridae with conserved FLKR and SXXK motifs respectively. The latter region is highly conserved in the Bacillarnavirus, Salisharnavirus and Sogarnavirus and features an SIFKS motif (with the exception that a Y replaces the F in one member of the salisharnaviruses). SogarnavirusSalisharnavirusBacillarnavirusLocarnavirusLabyrnavirusMarnavirusAllKusarnavirusn=20n=1n=1n=4n=1n=3n=4n=6I                                    II                                          III                       IV                V                                       VI                  VII                 VIIIPolar Neutral Basic Acidic Hydrophobic  Figure 5.5: Conserved motifs of the RdRp sequences identified in the family Marnaviridae. Amino acids are coloured their chemistry: Polar uncharged (green), neutral hydrophilic (purple), basic positively charged (blue), acidic negatively charged (red) and hydrophobic non-polar amino acids (black).  5.4.3 Species demarcations Pairwise comparisons of the RdRp and capsid amino acid sequences revealed a large range of identity values (Figure 5.6 and Supplementary Figures D.1 and D.2). Most of the capsid sequences analyzed shared 22 to 30% pairwise amino acid identities compared to 26 to 46% for the RdRp sequences. The highest pairwise identity scores were 59% and 77% for the capsid and RdRp, respectively.   129  051015202530Frequency of capsid identity scoresPercent amino acid identity0510152025303540Frequency of capsid identity scoresPercent amino acid identityA)           Capsid                               B)           RdRp Figure 5.6: Distribution of pairwise amino acid sequence identities for Marnaviridae members. Identities calculated for the entire capsid polyprotein region are in grey (A) and for the RdRp domain in black (B).  5.5 Discussion Sequence analysis of three marine virus isolates and 12 viruses described from metagenomic studies, all unclassified, as well as members of the unassigned genera Labyrnavirus and Bacillarnavirus and the genus Marnavirus, indicates that the family Marnaviridae should be expanded. Arguments for the expansion of the Marnaviridae to include these taxa are elaborated below.   130  5.5.1 Expanding the family Marnaviridae to encompass seven genera The RdRp phylogeny for the marine viruses analyzed in this study shows a well-supported monophyletic group (Figure 5.1). Because of this relationship, and the basal location of the currently classified representative, we propose that all of the analysed viruses (Table 5.1) belong in the family Marnaviridae. These viruses are more similar to each other than to any other members of the Picornavirales (Supplementary Figured D.3 and D.4). HaRNAV and AuRNAV are at the base of the clade and the most distant from the other viruses. The RdRp phylogeny defines seven new genera in the family (Figure 5.1). Polymerases are sufficiently conserved at the amino acid level that similarities can be used to establish viral classification at the genus level, as has been demonstrated for multiple virus families (Varsani and Krupovic, 2017; Baker and Schroeder, 2008; Nibert et al., 2014). The genera newly assigned to the Marnaviridae were selected based on robust phylogenetic analysis; the supported clades all contain multiple members except for the genera Kusarnavirus and Labyrnavirus. While all genera share some conserved amino acid motifs (Figure 5.5), such as the catalytic GDD (Wang and Gillam, 2001; Kok and McMinn, 2009), genus-specific motifs are also present. Amino acid substitutions that are chemically different would suggest that these positions are not strictly required for the catalytic function of the protein. However, such sites are useful for this level of sequence classification. Genome organization can be useful for comparing virus groups but is not a sufficient marker for either family- or genus-level demarcations (Table 5.1). While the majority of genomes analyzed here have a dicistronic organization, HaRNAV and SF-3 have a single predicted polyprotein encoded in their genomes. HaRNAV is the only representative of the genus Marnavirus, but SF-3 is one of four genomes within the genus Locarnavirus. While all of the remaining isolates within the Marnaviridae and the majority of metagenomic viruses have dicistronic genomes, monocistronic genomes are also found for viruses discovered in the coastal waters of Hawaii, which are also likely members of the Marnaviridae (Culley et al., 2014). No accession numbers or sequences were available for these genomes so they were not included in this analysis. Furthermore, the family Secoviridae include both mono- and bipartite members. Like the proposed additions to the Marnaviridae, the Secoviridae are a monophyletic clade based on  131  the polymerase sequence. This creates further precedent for the inclusion of members with different genomic organization into the family Marnaviridae. With the exception of the sogarnaviruses, all multi-member genera form separate clades on the capsid phylogeny (Supplementary Figure D.5); however, there are incongruences between the capsid and RdRp phylogenies (Figure 5.3). This may be due to recombination within the family. While it is unlikely that recombination occurs within the polymerase domain (Greenspan et al., 2004), recombination has been observed in the genomes of other members of the Picornavirales (Simmonds, 2006; Moore et al., 2011; Elbeaino et al., 2012). Recombination would also explain the similarity between the capsid genes of AglaRNAV and BC-2, BC-3 and Pal156, but not the rest of the genus Sogarnavirus (Figure 5.2). The presence of two groups in the sogarnaviruses suggests a sub-genus classification may be warranted; this may become clearer with the addition of new species. The three established genera names are acronyms based on either the virus hosts (Labyrnavirus and Bacillarnavirus) or where the virus came from (Marnavirus) along with the genome type; for example, Labyrnavirus and Bacillarnavirus refer to the host groups and Marnavirus refers to a marine RNA virus. The genus Locarnavirus is an acronym of Locarno beach from which the first marine RNA virus metagenomes were assembled. JP-B was selected as the type member for this genus being at the base of the clade and the oldest assembled genome in the clade. Salisharnavirus pays homage to the Salish Sea, the water masses from which the first marine RNA virus amplicons were amplified, while Sogarnavirus refers to the Strait of Georgia, a major water body of the Salish Sea. The type members for these two genera are BC-4 and Chaetoceros tenuissimus RNA virus type-II. BC-4 was selected as it is the only genome in the clade where the genome assembly could be verified, and where it was certain that the ORFs were complete; CtenRNAVII was selected because it was the better studied of the two isolates in the clade. Rhizosolenia setigera RNA virus 01 is the type species of the genus Bacillarnavirus as described by the ICTV. The name Kusarnavirus is derived from the Afrikaans word for coastal, and like the genera Marnavirus and Labyrnavirus has only one member, which was selected as the type species.   132  5.5.2 Classifying new species The amino acid divergence of the capsid compared to the RdRp makes it an ideal marker for species demarcation (Figure 5.6 and Supplemental Figures 5.1 and 5.2). Functional differences likely cause the differences between the two regions, as capsid proteins are involved in host recognition and co-evolve with cellular receptors (Tully and Fares, 2006; Nagasaki et al., 2005). A conservative cut-off of 75% pairwise amino acid identity of the capsid polyprotein is suggested for species demarcation. This is 15% higher than our lowest observed identity (59-60% bin) (Figure 5.6). The large genetic distances among the proposed genera, and the substantial estimates of diversity, suggests that there is substantial additional diversity that was not included in our analysis; hence, a more conservative cutoff was selected. Furthermore, a 75% capsid identity for species demarcation is in accordance with the parameters required for species classification in the Secoviridae (Sanfaçon, Iwanami et al., 2011; Sanfaçon et al., 2009), but less stringent than the 90% cutoff for species in the families Iflaviridae and Dicistroviridae (Chen et al., 2011b, 2011a). The % amino acid identity of the RdRp is only recommended for species demarcation if the capsid identity of a potential new species is on the 75% cusp (i.e. + 1%). The proposed cut-off for RdRp amino acid identity is 90%, which is higher than the observed 77-78% bin. This is more strict than the 80% identity required for the protease-polymerase region of the Secoviridae (Sanfaçon, Iwanami et al., 2011; Sanfaçon et al., 2009). Unlike what is done for the Secoviridae, the highly divergent protease is not part of the proposed classification, because of the higher divergence in this domain in the marine RNA viruses, which makes it difficult to identify. The naming of many viruses has often involved a combination of the host name and the symptom of infection or the orientation of the genome, for example Heterosigma akashiwo RNA virus or Cricket paralysis virus. This approach is not feasible with genomes assembled from the environment; hence, an approach was chosen that uses a description of the environment and an abbreviation of the geographic location where the genome originated from followed by a number to differentiate multiple genomes, for example Marine RNA virus BC-1. There is a history of invoking geography in naming viruses, such as Norwalk virus or the genus Marburgvirus. If genomes are assembled from transcriptomes, the organism of origin can be included provided there is reasonably certainty that the virus is infecting the organism. Care should be taken when classifying picorna-like viruses from transcriptomes as the poly-A tail would allow for their  133  amplification during RNA-seq irrespective of whether they are infecting the host or contamination from the environment.  5.5.3 Conclusions In this study, a sequence-based approach was used to expand the family Marnaviridae to include both viruses discovered through metagenomic studies and isolates classified within genera but unassigned to a family. We propose the addition of two previously unclassified genera, Labyrnavirus and Bacillarnavirus, as well as four new genera to the family. Because of differences in the levels of sequence conservation for the RdRp and capsid proteins, we propose the RdRp be used for genus-level classification, while the capsid be used for species demarcation. This is the second approach to deviate from classic taxonomic classification (Varsani and Krupovic, 2017) and the first focusing on +ssRNA viruses. Although this approach does not take into account any biological properties such as pathology or host range, it does provide a means for creating structure for the large pool of new viruses that are being discovered through metagenomics.     134  Chapter 2  6  CONCLUSIONS AND FUTURE DIRECTIONS  6.1 Recapitulation and significance of this work For decades, RNA viruses have been recognized for their significance in the economic, health and biotechnological sectors, but only recently has their prevalence in the environment, and particularly aquatic environments, been recognized (Culley et al., 2014, 2003; Gustavsen et al., 2014; Miranda et al., 2016; Moniruzzaman et al., 2017; Lang et al., 2009; Djikeng et al., 2009; López-Bueno et al., 2015). However, these studies were restricted to few locations or time points; furthermore, most studies used an amplification approach that contains multiple biases. As well, knowledge of RNA viruses in aquatic environments is largely limited to coastal marine ecosystems; RNA viruses in oceanic environments have not been reported, while metagenomic surveys focussing on freshwater assemblages are limited. The work described in this dissertation used metagenomic data to expand our knowledge of aquatic RNA viruses. Biogeographic analysis of six RNA virus genomes (Chapter 2) revealed that distantly related genotypes are broadly distributed with specific genotypes present in more localized areas. Further analysis of these genomes suggests that they exist as quasispecies which adhere to purifying selection. Surveys of 47 marine sites indicated that RNA virus assemblages are taxonomically diverse and that salinity and latitude were associated with changes in diversity (Chapter 3). A monthly time series of RNA viruses in streams demonstrated that these assemblages were also taxonomically diverse and highly dynamic; yet, they exhibit seasonal changes in diversity (Chapter 4). Along with the discovery of this new diversity was the assembly of several near-complete genomes; Chapter 5 describes a framework for classifying these environmental genomes into the pre-existing framework established by the International Committee on Virus Taxonomy. 6  135  While each chapter of this dissertation stands on its own, the overarching theme of the thesis is to uncover and characterize the vast diversity of RNA viruses in marine and fresh waters. Below, I explore some of the new insights gained from this body of work.  6.1.1 Aquatic Picornavirales – survivors of an ancient world This dissertation unequivocally establishes the order Picornavirales, and more specifically the Marnaviridae and similar viruses, as the most consistently present and abundant free viruses in marine and fresh waters (Chapters 3 and 4). Why are these viruses so much more over represented in both marine and freshwater environments compared to terrestrial ecosystems? Are they rooted in an ancient aquatic from which they emerged through co-evolution with their hosts and diversified to infect new super groups of eukaryotes? It has been argued that an ancestral RNA virus that was remarkably similar to viruses in the order Picornavirales appeared after eukaryogenesis (Koonin et al., 2015), and that, based on molecular clock analysis, there are several groups of protists that are basal members of the eukaryotic tree (Parfrey et al., 2011; Hedges et al., 2004; Baldauf et al., 2000). Taken together, it is tempting to hypothesize that members of the Marnaviridae represent progeny of an ancient lineage of viruses that were some of the first viruses to infect eukaryotes. The diversification of these viruses would then be associated with, and perhaps even pre-date, the diversification of eukaryotic lineages (Koonin et al., 2008). The simplicity of picorna-like viruses fits well with the ancient lineage hypothesis. Their capsids are small icosahedrons, a structure which provides the largest enclosed volume as well as the strongest structure to be assembled using free energy (Zandi et al., 2004). To date, all RNA viruses found to infect aquatic protists have such capsids, whether they have +ssRNA or dsRNA genomes, though the Micromonas pusilla reovirus has a double capsid similar to other Reoviridae (Brussaard et al., 2004). The only high-resolution structure of an RNA virus infecting a protist is for HcRNAV, which is not a member of the Picornavirales, displays similarity at the protein level and has an icosahedral capsid similar to that observed in transmission electron micrographs of HaRNAV, the type species of the Marnaviridae (Tai et al., 2003; Shirai et al., 2008; Tomaru et al., 2013), a family in the Picornavirales. Furthermore, the majority of small icosahedral RNA viruses have positive-sense single-stranded genomes (Figure 6.1). Two dsRNA virus families,  136  Totiviridae and Partitiviridae, are also considered small icosahedral RNA viruses and are more closely related to families within the Picornavirales than other dsRNA families (Koonin et al., 2008). Single-celled eukaryote                                                                                                         NematodeHuman          Non-human vertebrate         Plant            Invertebrate           Bacteria             Fungi Figure 6.1: Taxonomy of RNA virus genera recognized by the ICTV.The size of the wedges represents the number of virus genera classified within each family with the colours representing the genome type: +ssRNA (teal), -ssRNA (purple), dsRNA (green), ssRNA (yellow) and +/-ssRNA (pink). Virus particles were adapted from those depicted on the ViralZone website with the hosts of each family represented by a silhouette of a red human (humans), red rabbit (non- 137  human vertebrates), yellow insect (invertebrates), green plant (plants), pink mushroom (fungi), purple bacteria (prokaryotes), dark blue nematode (nematode) and light blue flagellate (protists). Families in the order Picornavirales are represented by dark blue and families with domains with similarity to picorna-like viruses are outlined in white.  The replication cycle and host entry of viruses in the proposed expansion of the family Marnaviridae is unexplored but poses interesting questions. Several viruses have been isolated that infect marine diatoms (Shirai et al., 2008; Tomaru et al., 2013; Kimura and Tomaru, 2015; Nagasaki et al., 2004; Nagasaki, 2008), and environmental sequencing suggests that there are many more (Nagasaki, 2008; Gustavsen et al., 2014; Miranda et al., 2016). Although it seems unlikely that viruses can penetrate the silica frustule of diatoms and enter the cell. These viruses are all less than 40 nm diameter, which is much smaller than the punctae (pores) in diatom frustules (Nagasaki et al., 2004; Losic et al., 2007), which may provide an entry point.  It is not known how marnaviruses enter their hosts. If it is similar to other members of the Picornavirales, except for the Secoviridae (Fuchs et al., 2017), then a host endocytic pathway is required. Clathrin-mediated endocytosis is a common route of entry for RNA viruses, including human rhinoviruses (DeTulleo and Kirchhausen, 1998; Daecke et al., 2005; Chu and Ng, 2004; Blanchard et al., 2006). Although forms of endocytosis other than pinocytosis are poorly understood in diatoms, proteomic studies suggest that at least some diatoms express clathrin (Nunn et al., 2009); therefore, at least some Marna-like viruses likely use an endocytic pathway to enter their hosts. In contrast, large dsDNA viruses that infect single-celled eukaryotes use different strategies of cell entry. For example, the Emiliania huxleyi phycodnavirus relies on its surrounding membrane for entry (Mackinder et al., 2009), while the Paramecium bursaria chlorella virus uses a structural modification on the capsid to inject its genome into the cell (Zhang et al., 2011). With neither of these features observed in members of the Marnaviridae, and the hosts having the systems required for endocytosis, it seem likely that this likely ancient system is used by members of the Marnaviridae to enter their hosts. The host diversity of the expanded Marnaviridae appears to be far greater (Figure 1.5) than for other virus families (Figure 6.1), although this diversity is obscured by single-celled eukaryotes being referred to as protists. In contrast, other +ssRNA virus families are typically limited to a single group of hosts, e.g. plants, vertebrates or fungi. Despite virus families typically having  138  relatively narrow host ranges, families in the Picornavirales are associated with a broad range of multi-and unicellular hosts. This is in line with the theory that picorna-like virus diversity is associated with key events of eukaryotic evolution (Koonin et al., 2008).   6.1.2 Metagenomics provides insight into the diversity of aquatic viruses and their environment.  We are only scratching the surface of the diversity of aquatic RNA viruses. With increased pressure on aquatic systems, whether through aquaculture, resource exploitation or climate change, the life in these environments will experience pressure to change. RNA viruses play an important role in aquatic environments by infecting phototrophic and heterotrophic phytoplankton (Lang et al., 2009), which in turns affects community composition and nutrient cycling (Wilhelm and Suttle, 1999; Fuhrman, 1999). The first step towards understanding the effect that RNA viruses have in the environment is to know which viruses are present and what they infect.  Aquatic RNA viruses do not only infect protists (Lang et al., 2009); there are also many that infect ecologically and economically important species such as salmonids (Breyta et al., 2016). However, there is little evidence of these viruses or their relatives in metagenomic datasets, even though the existence of these viruses within aquatic hosts is well established (Shi et al., 2016). Using a less biased sequencing approach (described in Chapters 2 and 3), with no poly-A bias, Chapters 3 and 4 show that free virus particles, associated with diverse virus families, are present though more erratic in occurrence and generally at lower relative abundance than protistan viruses. These surveyed families include taxa associated with mammals, fish, plants and invertebrates. Low-abundance viruses not only increase the richness of local viral assemblages, but also increase the infection potential of the virus seed bank. The model theorises that changes in the environment or host assemblage causes shuffling among the rare members, thus resulting in shifts in the members that are abundant. Certainly, niche partitioning has been observed for virus-host systems (Kimura and Tomaru, 2017), and the analysis in Chapter 3 indicates that virus assemblage diversity changes based on the environment. This included a decrease in virus diversity and evenness with an increase in salinity, and an increase in diversity with latitude. These interactions and the factors affecting the virus seed bank are more complicated in freshwater environments  139  (Chapter 4) where the assemblage changes as the result of native factors, but also due to input from the terrestrial environment.  6.1.3 Bridging the gap between ecological and classical virology Aquatic virology is rooted in oceanography and as a result, most studies have focussed on environmental aspects associated with virus assemblages (Bergh et al., 1989; Proctor and Fuhrman, 1990; Suttle et al., 1990; Needham et al., 2013). Advances in technology and culture systems have contributed to great advances in the field. In particular, addressing questions in the environment that were previously restricted to laboratory studies, such as how habitats influence genomic diversity (Kelly et al., 2013). Many of these concepts have not trickled down to studies of RNA viruses. Metagenomics is a powerful tool to perform surveys but has rarely been used to answer questions relating to the biology and ecology of RNA viruses. Chapter 2 bridges this gap by employing metagenomic data to analyze the geographic distribution of specific genotypes. This analysis allowed deductions to be made with respect to virus biology based on using metagenomic data along with biological information from distantly related viruses. For example, despite all domains analyzed being under purifying selection, the VP1 protein of six marine RNA virus genomes is more prone to experiencing amino acid changes, suggesting that it is involved in host recognition thus maintains a fine balance between conserving the recognition site and co-evolving with the host receptor. Two of the three BC genomes had many single nucleotide variants (SNVs) while BC-3 had far fewer. This suggests that the latter has recently been involved in lytic events resulting in the dominant genotype being far more abundant in the environment with fewer detectable variants. While it is important to note that phenotypic properties aren’t necessarily deducible from a genome sequence alone, having an understanding of basic principles along with the information retrievable from metagenomic data allows hypotheses to be formulated with respect to phenotype and function. Many novel genomes have been assembled from metagenomic data; yet, until recently, these genomes were excluded from official taxonomic classification (Simmonds et al., 2017). While a robust taxonomic framework has been established for ssDNA viruses in the Genomoviridae (Varsani and Krupovic, 2017), such a framework did not exist for RNA viruses.  140  With the continuous addition of new marine virus genotypes (Chapters 2 and 3) (Miranda et al., 2016; Moniruzzaman et al., 2017) it has become important to classify these viruses (as described in Chapter 5). Sequence-based frameworks for taxonomic classifications have not been well received by all (van Regenmortel, 2016), because virus classification has also relied on host classification and interaction, serology etc.  It has been argued that many newly discovered viral sequences are not homologous to anything in public databases (Mokili et al., 2012), hence, can’t be classified (van Regenmortel, 2016). While this was true when the first surveys of marine picorna-like virus were conducted (Culley et al., 2006), most newly identified viruses have sufficient domain similarity to sequenced virus isolates that they can be identified. The environmental genomes classified in Chapter 5 are all for viruses which have amino-acid similarity to isolates, as well as comparable genome organizations. Furthermore, many families within the Picornavirales rely on comparisons of amino-acid distance of virus domains or ORFs to establish species and genus demarcation (Sanfaçon, Gorbalenya et al., 2011). As such, the classification system presented in Chapter 5 expands the boundaries used to classify viruses, rather than a new set of rules. While such classifications should be done with care, it is important to advance the field of virus classification to incorporate metagenomic data, the largest source of viral diversity.   6.2 Strength and limitations With metagenomic analysis becoming increasingly popular, the importance of appropriate and optimized methodologies has become apparent (Steward and Rappe, 2007). This includes processing of the environmental sample, amplification of the nucleic acids and subsequent bioinformatics, which are discussed below.  6.2.1 Ribosome contamination Unlike other studies, these marine RNA virus metagenomes had no detectable 16S or 18S ribosomal contamination. A flat-bed pre-filtration system was used to remove the cellular fraction (Suttle et al., 1991) rather than cartridge pre-filters (Uyaguari-Diaz et al., 2016; Culley and Steward, 2007), which may be less likely to rupture cells. In contrast, the freshwater samples were pre-filtered using cartridge filters and many samples were dominated by ribosomal reads.   141  Ultracentrifugation was used to further concentrate the marine virus samples. The centrifugation speed was calculated to pellet small viruses but not ribosomes (Chapters 2 and 3). Pellets were resuspended in a small volume of supernatant, so even if some ribosomes were present, they would likely have been swamped by the viral nucleic acids. Another approach would have been to resuspend the pellets in a sterilized buffer, but some virus capsids undergo conformational changes depending on pH or other environmental factors and might be unstable.   6.2.2 Pre-filtration and virus concentration Pre-filtration and concentration of virus particles are integral to the work in this dissertation but have caveats. One of the biggest concerns with pre-filtration is biasing the free virus assemblage by particle loss on the filters, whether this is because viruses are attached to particles, too big to fit through the pores or broken. This is of less concern for the small picorna-like viruses than for the rod-, bullet-shaped or filamentous viruses. In both marine and freshwater datasets filamentous and rod-shaped viruses were detected (Poty-, Gammaflexi- and Virgaviridae and Tymovirales), but not Rhabdoviridae. The latter infect aquatic organisms (Mork et al., 2004; Suttle, 2007; Walker and Winton, 2010; Meyers, 1984; Lang et al., 2009) and single-celled eukaryotes (Bird and McCaul, 1976), yet were not detected in this study, suggesting that rhabdoviruses were either not present, damaged or caught on filters during processing. Tangential flow filtration (TFF) with a 30 kDa cutoff membrane (Suttle et al., 1991) and chemical flocculation (John et al., 2011) are commonly used to concentrate viruses from large volumes of water. At the beginning of this work, there was no data on the efficacy of concentrating RNA viruses by chemical flocculation, which is why TFF was used. Due to logistical constraints flocculation, was used for two samples from Kenton-on-sea and Cape Point, South Africa in (Chapter 2). A single sample comparison (JP14) between the two methods produced similar results based on cluster analysis, although the data wasn’t included due to the limited sample size. Furthermore, the three assembled genomes (BC-1, -2, and -3) were dominant in both samples and SNV analysis suggested that the data were comparable. However, RNA viruses were less stable in the resuspension buffer after flocculation, as RdRp amplification was no longer possible after six months in the storage buffer, suggesting that dialysis may be required for long term storage.  142  Many of the virus concentrates (Chapter 3) were stored at 4°C for substantial periods of time. It is unclear how this storage affects viral assemblages and likely different types of viruses have different stabilities; however, statistical analysis indicated that taxonomic assemblage composition wasn’t significantly affected by the age of the samples (Chapter 3 appendix).  6.2.3 Metagenomic library construction As explained in Section 1.8.2, there are many parameters to consider when using the SISPA method. In order to reduce amplification bias, separate RT reactions were done with random hexamers and nonamers using different concentrations of RNA template, as well as separate PCR reactions with DMSO and different concentrations of MgCl2. Because each set of conditions resulted in different banding patterns when amplified samples were analysed using agarose electrophoresis, and there was no way of selecting the best conditions, it was decided to do all of the above and then pool the reactions before library preparation. While PCR errors due to DNA polymerase is always a concern (von Wintzingerode et al., 1997; Polz and Cavanaugh, 1998; Acinas and Sarma-Rupavtarm, 2005), using 32 replicates, eight of which were RT replicates, minimized PCR bias. Carry-over between samples can occur during library preparation. To determine whether there was any significant contamination that occurred between samples, hierarchical cluster analysis was performed on similarity matrices after the data was processed. No trends in similarity were observed between samples that were processed next to one another so there was no evidence of contamination.  6.2.4 Bioinformatics Illumina sequencing technology was selected as it has one of the lowest error rates, ≥0.1 (Glenn, 2011). All sequences were filtered to eliminate the 1% error, and low quality ends were trimmed. The bioinformatic methods were tested on a control library of HaRNAV. One of the most important steps in a metagenomic bioinformatics pipeline is assembly of reads into contigs. Many algorithms were tested and ultimately the CLC Genomics Workbench’s de novo assembler was selected for its efficient and reliable assembly. As with all assemblers,  143  errors occur. The mapping files of large contigs were manually assessed along with ORF and domain analysis to determine the quality. These methods and parameters used are widely accepted in the field (Miranda et al., 2016; Culley et al., 2014). Taxonomic classification can be challenging due to the inherent genetic nature of RNA viruses. Such classification is further complicated when analysing unexplored virus metagenomes, which frequently consist of small contigs as in Chapters 3 and 4. While relative abundance counts were normalized to account for the disparity in contig length, no adjustment was made to account for the effect of contig length on taxonomic classification. Some of the contigs were small; less than 500 bp in length. As a result, many of the BLAST results had short regions of local similarity. We selected an e-value cut-off of 0.0001; a compromise between retrieving a significant match while obtaining hits with shorter sequences. Despite this, the assignment to specific taxa must be interpreted with caution, especially if one considers that horizontal gene transmission is common amongst RNA viruses. While we used taxonomic classification in Chapters 3 and 4 to describe marine RNA virus assemblages, we acknowledge that many of the identified taxa are likely not the specific virus identified but rather a genome that contains a similar region. Microdiversity, or the differences in organisms that have near identical physical and genetic features, can have important ecological and evolutionary consequences, as shown for bacteriophages and bacteria (Needham et al., 2017; Chafee et al., 2017). A consequence of clustering or combining similar sequences is that microdiversity is lost, even though it exists (Chapter 2). In Chapters 3 and 4 contigs that were similar to the reference virus were concatenated, thereby losing microdiversity. This is an unfortunate side effect of taxonomic classification of RNA viruses based on sequence analysis as it relies on a database, which is very limited for aquatic RNA viruses. Nonetheless, this thesis presents a start for incorporating microdiversity into virus-host network analysis, or for defining sequence similarity cutoffs for operational taxonomic units (OTUs).  6.3 Future directions This thesis greatly expands our knowledge of the geographic distribution,