Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Diversity and evolution of single-stranded DNA viruses in marine environments Labonté, Jessica 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_fall_labonte_jessica.pdf [ 38.7MB ]
Metadata
JSON: 24-1.0071998.json
JSON-LD: 24-1.0071998-ld.json
RDF/XML (Pretty): 24-1.0071998-rdf.xml
RDF/JSON: 24-1.0071998-rdf.json
Turtle: 24-1.0071998-turtle.txt
N-Triples: 24-1.0071998-rdf-ntriples.txt
Original Record: 24-1.0071998-source.json
Full Text
24-1.0071998-fulltext.txt
Citation
24-1.0071998.ris

Full Text

DIVERSITY AND EVOLUTION OF SINGLE-STRANDED DNA VIRUSES IN MARINE ENVIRONMENTS  by  Jessica Labonté B.A. Microbiology, Université Laval, 2003 M.Sc. Microbiology, Université Laval, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Microbiology and Immunology)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  June 2013  © Jessica Labonté, 2013  Abstract Viruses are the most abundant and genetically diverse life forms in the biosphere. By infecting specific subsets of microbial communities, they influence community composition, thereby affecting nutrient and energy cycling. Single-stranded (ss) DNA viruses are major pathogens of plants and animals that have been well studied for years due to their economic and human-health effects. Recent advances in metagenomics show that marine ssDNA viruses are widespread in marine and freshwater environments, but their diversity and evolutionary relationships remain relatively unexplored. This dissertation focuses on characterizing the diversity and evolutionary relationships among marine ssDNA viruses. First, metagenomic data were gathered and analyzed to assess the genetic diversity of ssDNA viruses, leading to the identification 129 genetically distinct groups of ssDNA viruses that had no recognizable similarity to each other or to other sequenced viruses. Each group was represented by at least one complete genome, with most falling into 11 well-defined groups. Comparison and phylogenetic analysis of sequences from marine and terrestrial viruses indicate that ssDNA viruses share a common origin and that terrestrial viruses likely co-evolved with their hosts when they transitioned from the ocean to the land. The second part focused on one particular subfamily of viruses, the Gokushovirinae (family Microviridae) to understand their relationship with the environment. Five complete genomes were assembled and primers were designed to amplify a fragment of the major capsid protein to look at the distribution of gokushoviruses in various marine environments. Phylogenetic analysis revealed that most sequences were distantly related to those from cultured representatives, falling into many new distinct evolutionary groups. Finally, a protocol for fluorescence in situ hybridization was developed to observe cells infected by gokushoviruses. The results presented in this dissertation greatly expand the known sequence space for ssDNA, nearly doubling the number of complete available genomes, and revealing much greater genetic richness than previously though. The vast diversity of ssDNA viruses in the sea and their similarity with viruses infecting eukaryotes is consistent with their role as significant pathogens of marine phytoplankton and microzooplankton.  ii  Preface This statement certifies that the work presented in this thesis was conceived and the chapters written by myself. As research advisor, Dr. Curtis Suttle was involved with conceptualization of the experiments, interpretation of the results, as well as with thesis editing and manuscript preparation. My research committee provided recommendations and general guidance. I collected the samples and analyzed the data, except as outlined in the following paragraphs. The samples used in this study are part of a large collection of virus concentrates archived in the Suttle lab. For the samples from the Strait of Georgia and the Gulf of Mexico, current and past members of the Suttle lab collected the water, and then concentrated the viruses. For the samples from the Saanich Inlet, members of Dr. Steven Hallam’s lab collected the water and filtered it to remove the bacteria, and then past and current members of the Suttle lab, including myself, concentrated the viruses. In chapter 5, the sample selection was done in collaboration with Danielle Winget (postdoctoral fellow in the Suttle lab). Alvin Tian (graduate student in the Suttle Lab) built and maintained the bioinformatic pipeline for the deep-amplicon analysis that was used in part for data analysis. In chapter 6, the fluorescence in situ hybridization experiment design was performed with the collaboration of Jan Finke (graduate student in the Suttle lab). Danielle Winget and Ricardo Cruz (visiting graduate student) also helped with the experimental design. Most of the experiments and optimization of the protocol were done in collaboration with Jan Finke. Flow cytometry counts and cell sorting was done with the help of Andy Johnson and Justin Wong at the UBC Flow Cytometry and Cell Sorting (FACS) Facility.  iii  Table of contents Abstract ................................................................................................................................. ii! Preface .................................................................................................................................. iii! Table of contents ................................................................................................................. iv! List of tables......................................................................................................................... vi! List of figures ...................................................................................................................... vii! List of symbols...................................................................................................................... x! List of abbreviations............................................................................................................ xi! Acknowledgements ........................................................................................................... xiii! Dedication........................................................................................................................... xiv! Chapter 1: Introduction ...................................................................................................... 1! 1.1! Ecology of marine viruses ...................................................................................... 1! 1.2! ssDNA viruses........................................................................................................ 7! 1.3! ssDNA viruses in the environment ....................................................................... 19! 1.4! Thesis objectives.................................................................................................. 25! Chapter 2: Previously unknown and highly divergent ssDNA viruses populate the oceans .......................................................................................................................................... 27! 2.1! Synopsis............................................................................................................... 27! 2.2! Material and methods........................................................................................... 27! 2.3! Results and discussion ........................................................................................ 33! Chapter 3: Evolutionary relationships between circular ssDNA viruses and their hosts.. 52! 3.1! Synopsis............................................................................................................... 52! 3.2! Material and methods........................................................................................... 53! 3.3! Results and discussion ........................................................................................ 54! 3.4! Conclusions.......................................................................................................... 70! Chapter 4: Genetic relatedness of gokushoviruses in marine environments................... 71!  iv  4.1! Synopsis............................................................................................................... 71! 4.2! Material and methods........................................................................................... 71! 4.3! Results and discussion ........................................................................................ 76! 4.4! Conclusion ........................................................................................................... 83! Chapter 5: Amplicon deep sequencing of gokushoviruses in the Saanich Inlet .............. 84! 5.1! Synopsis............................................................................................................... 84! 5.2! Material and methods........................................................................................... 84! 5.3! Results and discussion ........................................................................................ 88! 5.4! Conclusions........................................................................................................ 100! Chapter 6: Determination of the marine host of gokushoviruses using fluorescence in situ hybridization.................................................................................................................... 102! 6.1! Synopsis............................................................................................................. 102! 6.2! Material and methods......................................................................................... 103! 6.3! Results and discussion ...................................................................................... 109! Chapter 7: Overview and outlook................................................................................... 125! 7.1! Recapitulation and significance of this work ...................................................... 125! 7.2! Method limitations and outlook........................................................................... 129! 7.3! Future directions ................................................................................................ 130! 7.4! Significance of this research .............................................................................. 131! Bibliography ...................................................................................................................... 132! Appendices........................................................................................................................ 149!  v  List of tables Table 1.1  The seven characterized families of ssDNA viruses............................................ 8!  Table 1.2  Sequenced ssDNA virus genomes. ..................................................................... 9!  Table 1.3  Conserved Rep motifs for known viral families and phytoplasmal plasmids...... 17!  Table 2.1  Representative genomes for each of the eleven new gene families.................. 40!  Table 3.1  Eukaryotic species with ssDNA replication protein fossils sequences found  within their genomes. ............................................................................................................ 56! Table 5.1  Number of reads and OTUs after clustering at 95% similarity. .......................... 89!  Table 5.2  Observed and estimated richness and diversity indices. ................................... 93!  Table 5.3  Recovery of RFLP Saanich sequences within the amplicon deep sequencing  database. .............................................................................................................................. 95! Table 6.1  Samples and probes used in the FISH experiments........................................ 106!  vi  List of figures Figure 1.1  Overview of the rolling-circle replication in ssDNA viruses............................... 15!  Figure 1.2  Amino acid alignment of representative viral Rep sequences. ......................... 16!  Figure 1.3  Analysis of the complete nucleotide sequence of CsalDNAV........................... 20!  Figure 1.4  Virions of HRPV-1 are pleomorphic.................................................................. 21!  Figure 2.1  Geographic location of the samples used in this study..................................... 28!  Figure 2.2  Multidimensional scaling of the feature frequency profile of ssDNA viral isolates  from the NCBI database. ...................................................................................................... 31! Figure 2.3  Feature frequency profile analyses of ssDNA virus isolates from the NCBI  database and genomes from this study. ............................................................................... 32! Figure 2.4  BLAST comparison of all 4995 contigs............................................................. 35!  Figure 2.5  BLAST comparison of the contigs from each dataset....................................... 36!  Figure 2.6  Feature frequency profile analyses of ssDNA virus isolates from the NCBI  database and assembled genomes from this study.............................................................. 38! Figure 2.7  Network representation of the BLAST comparisons of the environmental  assembled genomes to the previously known ssDNA viruses and to other environmental genomes. .............................................................................................................................. 39! Figure 2.8  Relative percentage of contigs from each viral group proposed in this study... 44!  Figure 2.9  Alignment of the rolling-circle replication protein. ............................................. 46!  Figure 2.10  Genetic relatedness of the rolling-circle replication protein. ........................... 47!  Figure 2.11  Genetic relatedness of the rolling-circle replication protein (side view with  labels). .................................................................................................................................. 48! Figure 2.12  Comparison of the reads from the ssDNA datasets from this study to other  environmental datasets......................................................................................................... 50! Figure 2.13 Figure 3.1  Comparison of the reads from the ssDNA datasets against each other......... 51! A) Genetic relatedness of the rolling-circle replication protein used as a  reference tree for the B) placement of fossil EVE sequences found inside eukaryotic genomes ............................................................................................................................... 58!  vii  Figure 3.2  Alignment of the the inferred amino-acid sequences for replication protein from  terrestrial and marine nanoviruses, circoviruses, cycloviruses and caliciviruses.................. 62! Figure 3.3  A) Phylogeny and B) conserved motifs of the replication protein of nanovirus-  like sequences. ..................................................................................................................... 64! Figure 3.4  A) Phylogeny and B) conserved motifs of the replication protein of circovirus-  like sequences. ..................................................................................................................... 66! Figure 3.5  Evolutionary model of the origin of ssDNA viruses similar to nanoviruses and  circoviruses. .......................................................................................................................... 68! Figure 4.1  Gokushoviruses share a similar genome organisation. .................................... 76!  Figure 4.2  Regions of conservation in environmental gokushovirus genomes.................. 79!  Figure 4.3  Genetic relatedness of the major capsid protein gene VP1.............................. 82!  Figure 5.1  Map of Saanich Inlet and cross-sectional depth profile. ................................... 85!  Figure 5.2  Rank-abundance distribution of OTUs.............................................................. 90!  Figure 5.3  Clustering analysis from 95% to 60% similarity. ............................................... 91!  Figure 5.4  Rarefaction species richness curves. ............................................................... 93!  Figure 5.5  Gokushovirus MCP clonal amplicons digested with the AluI restriction enzyme.  .............................................................................................................................................. 96! Figure 5.6  Genetic relatedness of the gokushovirus MCP from Saanich Inlet................... 98!  Figure 6.1  Gene-FISH principle. ...................................................................................... 110!  Figure 6.2  Melting curve of the phiX174 probe to determine probe hybridization............ 112!  Figure 6.3  Lytic life cycle of phiX174 infecting its host E. coli C. ..................................... 113!  Figure 6.4  Epifluorescence microscopy of the hybridized E. coli cells............................. 114!  Figure 6.5  Multiple alignment of sequences obtained via amplicon deep sequencing of the  conserved region of major capsid protein. .......................................................................... 116! Figure 6.6  Saanich Inlet negative control (no probe) filter as observed through  epifluorescence microscopy with DAPI and Alexa488 filters. ............................................... 118! Figure 6.7  Saanich Inlet positive control (16S rRNA probe) filter as observed through  epifluorescence microscopy with DAPI and Alexa filters. ................................................... 119! viii  Figure 6.8  Saanich Inlet gokushovirus probe filter as observed through epifluorescence  microscopy with DAPI and Alexa filters. ............................................................................. 120! Figure 6.9  Flow cytometry counts of the sonicated filters. ............................................... 122!  ix  List of symbols %  percent  ~  approximately  °  degree  o  degree Celsius  C  x  List of abbreviations aa aLTR  amino acid  ARC  Arctic ocean  BC  British Columbia  BLAST  basic local alignment search tool  bp  base pairs  CARD  catalyzed reporter deposition  CDS  coding region sequence  CTD  conductivity, temperature, depth instrument  contig  contiguous sequence  d.  diameter  DAPI  4′,6-diamidino-2-phenylindole  DNA  deoxyribonucleic acid  DIG  digoxigenin  dNTP  deoxyribonucleoside triphosphate  dsDNA  double-stranded DNA  EDTA  ethylenediaminetetraacetic acid  ENV  environmental  EVE  endogenous viral element  FFP  feature frequency profile  FISH  fluorescence in situ hybridization  g  gram  GC50  glass fiber filter grade GC50 (1.2 µm)  GOM  Gulf of Mexico  GVWP  polyvinylidene difluoride filter  HRP  horseradish peroxidase  h  hours  H 2O  water  kb  kilobase  kDA  kilodalton  kPa  kilopascal  L  litre  LA  Lakes  LB  Luria Bertani broth  m  meter  M  molar  MCP  major capsid protein  MDA  multiple displacement amplification  MDS  multidimensional scale  mg  milligram  approximate likelihood-ratio test  xi  MgCl2  magnesium chloride  min  minutes  mL  millilitre  mM  millimolar  N  North  Ns  ambiguous bases  NCBI  National Center for Biotechnology Information  ng  nanogram  nm  nanometre  nr  non-redundant  nt  nucleotide  NTP  nucleoside-triphosphate  ORF  open reading frame  OTU  operational taxonomy unit  PBS  phosphate-buffered saline  PCR  polymerase chain reaction  primer F  primer forward  primer R  primer reverse  PSI  pound per square inch  PVDF  polyvinylidene fluoride filter (0.45µm)  Rep  rolling-circle replication protein  RFLP  restriction fragment length polymorphism  RNA  ribonucleic acid  rRNA  ribosomal RNA gene  s  second  SAR  Sargasso Sea  SDS  sodium dodecyl sulfate  SI  Saanich Inlet  SOG  Strait of Georgia  ssDNA  single-stranded DNA  SSC  side scatter light  TE  tris EDTA  TEM  transmission electron microscope  TFF  tangential flow filtration  TSA  tyramide signal amplification  V  volt  v/v  volume per volume  VC  viral concentrate  w/v  weight per volume  µL  microlitre  µm  micrometre  µM  micromolar  WGA  whole genome amplification  xii  Acknowledgements It was a long way to the end, but I am grateful to each and every one who helped and supported me throughout this incredible journey. First of all, I would like to acknowledge my advisor, Dr. Curtis Suttle, who accepted me in his lab to work on this challenging and interesting project. I learned a lot from each of our discussions and I thank him for his guidance, expertise, support and encouragement. I want to thank my wonderful labmates, past and present, especially Amy Chan, Christina Charlesworth, Caroline Chénard, Jessie Clasen, Ricardo Cruz, Jan Finke, Matthias Fischer, Manuela Gimenes, Julia Gustavsen, Jérôme Payet, Emma Shelford, Alvin Tian, Johan Vande Voorde, Marli Vlok, Danielle Winget, and Jennifer Wirth. They were colleagues with whom I had great scientific discussions, but also my friends. Life in Vancouver would not have been the same without them. I would like to express my gratitude to my supervisory committee members Dr. François Jean, Dr. Steven Hallam and Dr. Rosie Redfield for their constructive comments and insightful criticism, which made me a better scientist. I am grateful to Curtis Suttle, Christina Charlesworth, Caroline Chénard, Jan Finke, Julia Gustavsen, Emma Shelford, Vera Tai, Danielle Winget, and Jennifer Wirth for proofreading and giving me valuable comments on manuscripts. Too numerous to list are the people who I owe gratitude for various small and big favours, expert advice, and bioinformatic assistance. I want to sincerely thank my former master’s thesis advisor Sylvain Moineau for his continual guidance and advice. I am deeply grateful to my family, who has always believed in me. They supported me, mentally and financially, through my decisions and allowed me to pursue my dreams, even if it meant flying away across Canada. Merci! Finally, I am infinitely grateful to Dany, who was always there for me, in good and in bad times throughout this adventure. One of many more to come… During my degree, I received funding from the Natural Science and Engineering Research Council of Canada through a postgraduate scholarship and the John Richard Turner fellowship in microbiology.  xiii  Dedication  To my father, who inspired me to pursue my dreams, and my mother, who gave me the determination to achieve them.  xiv  Chapter 1: Introduction  A virus is an infective agent that typically consists of a nucleic acid molecule in a protein coat (called a capsid), is too small to be seen by light microscopy, and is able to multiply only within the living cells of a host. Viruses likely infect all living organisms, from unicellular organisms such as bacteria, archaea, and protists to complex animals and plants. Viruses infecting bacteria are usually called bacteriophages (or phages) and the term virus is generally used for viruses infecting eukaryotes. However, in the context of this dissertation, the term virus will be used to describe viruses infecting both prokaryotes and eukaryotes, notably when the host of the virus is unknown. For a long time, viruses were thought to be unimportant and in low concentrations in marine environments. Very few (between 0.1 and 1%) marine microorganisms are culturable (Colwell 1997), therefore there were few isolated viruses infecting marine bacteria. Permitting the direct visualization of viruses, electron microscopy allowed for a breakthrough in marine virology by showing that viruses are present in high concentrations in marine environments (Torrella and Morita 1979; Bergh et al. 1989). Viral abundances are typically between 107 to 108 particles per milliliter of coastal seawater (Suttle 2005; Bergh et al. 1989), are correlated with primary productivity, and usually decrease with distance from shore and increasing depth (Suttle 2005). Viruses are found at about one order of magnitude higher than their host concentrations, and are the most abundant biological entities on Earth (Suttle 2005).  1.1  Ecology of marine viruses  Viruses have a multitude of different morphologies, from icosahedral to a lemon shape, with or without a tail, and may or may not have a membrane (King et al. 2012). In marine systems, the most commonly isolated viruses are bacteriophages belonging to the Podoviridae (icosahedral capsid, short non-contractile tail), Siphoviridae (icosahedral capsid, long non-contractile tail) and Myoviridae (icosahedral capsid, contractile tail) (Suttle 2005). But even within the same morphotype, viruses have an immense genetic diversity (Hendrix 2003; Angly et al. 2006; López-Bueno et al. 2009). This section will focus on the importance of viruses in the environment, as well as on techniques that are used to investigate their biological and genetic diversity. 1  1.1.1  Importance of marine viruses  Viruses are substantial agents of mortality of heterotrophic and autotrophic plankton (Proctor and Fuhrman 1990). In fact, every second, approximately 1023 viral infections occur in the ocean (Suttle 2007). These infections will affect global geochemical cycles and nutrient availability (Wilhelm and Suttle 1999; Fuhrman 1999), which will in turn affect microbial community composition. Moreover, during infections, genetic material can be passed on via lateral gene transfer from a host to a virus, and vice versa, which allows viruses to manipulate or alter the evolution of their hosts.  1.1.1.1 1.1.1.1.1  Role in microbial community composition and diversity Role of viruses in bloom termination  Virus infections can affect microbial communities in different ways. First, viral infections influence host community diversity by selective destruction of microbial strains at high concentration undergoing fast growth: the “kill the winner” hypothesis (Thingstad 2000). Given the right conditions of temperature, light and nutrients, bacteria and protists can manage to escape their grazers to bloom (Falkowski et al. 1998; Field et al. 1998). It has been shown that viruses are implicated in the declines and changes in the clonal composition of microbial blooms. A well-studied example of the role viruses play in the termination of bloom dispersals is the one of the coccolithophore Emiliana huxleyi (Jacquet et al. 2002; Sorensen et al. 2009; Martínez Martínez et al. 2007; Bratbak et al. 1993). Characterization of the host and viral communities revealed that a bloom is very dynamic, with a succession of different host and virus genotypes (Martínez Martínez et al. 2012; Schroeder et al. 2003), with only a few virus genotypes eventually killing the bloom completely (Schroeder et al. 2003). Similarly, viruses were also shown to be the cause of dispersal of blooms of the algae Heterosigma akashiwo (Tarutani et al. 2000) and Phaeocystis globosa (Baudoux and Brussaard 2005), as well as bacterioplankton communities like cyanobacteria (Middelboe et al. 2001; Fuhrman and Schwalbach 2003; Winter et al. 2004).  2  1.1.1.2  Role in nutrient cycling  Viruses represent about 200 Mt of carbon (Wilhelm and Suttle 1999; Suttle 2005) and they are estimated to cause the lysis of 10 to 20% of bacterioplankton each day (Suttle and Chan 1994), converting the prokaryotic biomass into particulate and dissolved organic matter (POM and DOM) (Fuhrman 1999). Through this process, called the virus shunt, viruses catalyze nutrient cycling by stimulating DOM availability for heterotrophic bacteria (Wilhelm and Suttle 1999), thus limiting the direct transfer of microbial biomass to higher trophic levels (Wilhelm and Suttle 1999; Middelboe 2008; Middelboe et al. 2001; Fuhrman 1999; Middelboe and Lundsgaard 2003; Motegi et al. 2009) and accelerating biogeochemical processes (Danovaro et al. 2008). Viruses play a major role in nutrient cycling and regeneration. For example, Shelford et al. (2012) showed that viruses play an important role in the marine nitrogen cycle. Virus-replete treatments showed increased ammonium concentrations compared to samples with bacteria only, suggesting that viral lysis supplies a significant portion of the global nitrogen requirements of phytoplankton (Shelford et al. 2012). Moreover, the addition of viruses to phosphorus-limited bacterial cultures increased bacterial growth through nutrient recycling (Middelboe et al. 1996). Iron (Fe) limits primary production in high-nutrient, low-chlorophyll (HNLC) regions of the open ocean (Boyd et al. 2000). The lysis of bacteria via viral infection is critical for Fe recycling, as bacterial lysis due to infection releases dissolved and particulate Fe, making it bioavailable to stimulate the growth of surviving hosts (Poorvin et al. 2004; Gobler et al. 1997).  1.1.1.3  Viruses as a reservoir of genetic diversity  It has been estimated based on assembly of contigs from metagenomic data that there are >5000 genetically distinct viruses in 100 L of seawater (Breitbart et al. 2002) and as many as 1 million in 1 kg of marine sediment (Breitbart, Felts, et al. 2004). This enormous pool of genetic diversity can potentially be exchanged with other viruses (Pedulla et al. 2003; Short and Suttle 2005) as well as with bacteria and other organisms (Kenzaka et al. 2010; Fuhrman and Schwalbach 2003; Sharon et al. 2011). These genes can have a direct effect on the virus host, or on their ability to infect them. The best known cases of horizontal gene transfer by temperate phages are in the context of pathogenesis, in which phages can carry genes encoding for toxins and other factors involved in bacterial pathogenesis (Breitbart et  3  al. 2005). For example, Vibrio cholerae, a common near-shore bacterium that is normally harmless causes cholera after incorporating the phage cholera toxin (CTX) genes (Waldor and Mekalanos 1996). Viruses also contain remnants of genes that appear to have been horizontally transferred from their hosts (Rohwer et al. 2000; Sullivan et al. 2005, 2009; Weigele et al. 2007). Genes involved in photosynthesis and host metabolism are widespread in marine viruses (Angly et al. 2006; Rohwer et al. 2000; Sullivan et al. 2005). For instance, the genes psbA and psbD, coding for the D1 and D2 proteins, respectively, of photosystem II, are common in photosynthetic organisms, but are also widely distributed among cyanophages (Lindell et al. 2004; Sullivan et al. 2006; Millard et al. 2004). These viral photosynthetic genes can be expressed during infection, and therefore likely have a functional role during the infection cycle (Clokie et al. 2006; Lindell et al. 2005). Another great example of a metabolic gene found in marine phages is PhoH. PhoH belongs to the Pho regulon, which regulates phosphate uptake and metabolism under low-phosphate conditions (Hsieh and Wanner 2010), and is found in nearly 40% of sequenced marine phages (Goldsmith et al. 2011). Primers were designed to look at the diversity of PhoH in marine phage communities (Goldsmith et al. 2011). Phylogenetic analysis showed that viral and bacterial PhoH sequences cluster in distinct clades, and that viral sequences cluster separately depending on whether the host is heterotrophic or autotrophic (Goldsmith et al. 2011). The cases of PsbA/PsbD and PhoH are examples of co-evolution, in which both the host and phage are functionally and genetically shaped by the acquisition of these genes.  1.1.2  Species diversity and genetic relatedness  Viruses are so diverse that they lack a common genetic marker, such as genes encoding ribosomal RNA in cellular organisms. To overcome this problem, sequence analysis of representative genes, usually a replication protein or a capsid protein, within a viral group is used to examine the genetic diversity of related viruses. For example, homologues for structural genes present in T4-like enteric coliphages are also found in some marine Myoviridae phages (Fuller et al. 1998; Hambly et al. 2001). Degenerate primers were designed to amplify a fragment of the major capsid protein (g23) based on sequences from isolates (Tetart et al. 2001), and these were used to investigate environmental T4-like phage  4  diversity, revealing five new groups of viruses distantly related to cultured viruses (Filée et al. 2005). Cyanophages are viruses infecting cyanobacteria, including the globally important marine primary producers Prochlorococcus and Synechococcus (Scanlan et al. 2009; Partensky et al. 1999). Phylogeny of the portal protein (g20) of cyanophages belonging to the family Myoviridae showed that they are ubiquitous in marine and freshwater environments (Short and Suttle 2005; Sullivan et al. 2008). Similarly, primers were designed to look at the prevalence of the photosynthetic genes psbA and psbD in cyanophages (Sullivan et al. 2006; Chénard and Suttle 2008; Sandaa et al. 2008). These genes are also commonly found, but the g20 and psbA phylogenies are not congruent, consistent with the evolution of phages through recombination and lateral gene transfer rather than mutations (Sobecky and Hazen 2009; Ignacio-Espinoza and Sullivan 2012). Replication proteins are conserved within viral families. For example, DNA polymerase B sequences from bacteriophages in the Podoviridae are widely distributed in the environment, but there are no clear biogeographic patterns in the diversity or genetic relatedness of sequences (Breitbart, et al. 2004; Labonté et al. 2009; Chen et al. 2009; Huang et al. 2010). In contrast, phylogenetic analysis of sequences for the replication protein of Phycodnaviridae viruses, a widely distributed group of viruses infecting eukaryotic phytoplankton (Chen and Suttle 1996; Chen, Suttle, and Short 1996; Short and Suttle 2002; Clasen and Suttle 2009) showed that freshwater phycodnaviruses are genetically distinct from their marine counterparts (Clasen and Suttle 2009; Short and Suttle 2002). Phycodnavirus DNA polymerase sequences cluster based on the host they infect, i.e. viruses infecting Micromonas sp. will cluster together, while viruses infecting Cyclotella sp. will cluster together, independently of the viruses infecting infecting Micromonas sp. Because viruses cluster based on the host they infect, comparison of the environmental sequences to sequences from isolates allowed hypotheses to be made about the hosts of environmental phycodnaviruses (Short et al. 2011; Clasen and Suttle 2009). Each of these studies revealed an immense and previously unknown diversity of marine viruses; however, they remain limited by targeting only small groups of viruses rather than entire viral communities.  5  1.1.3  Genetic diversity of viral communities  Examining the structure, diversity and evolution of the entire viral community requires a variety of approaches. Fingerprinting techniques have the advantage of being a cheap way to look at community changes over time. For example, techniques such as randomly amplified polymorphic DNA PCR (RAPD-PCR) (Winget and Wommack 2008), denaturating gradient gel electrophoresis (DGGE) (Clasen and Suttle 2009; Short and Suttle 1999) and pulsed field gel electrophoresis (PFGE) (Sandaa and Larsen 2006) have all been used to look at seasonal and geographic variations in viral community structure. However, it is the cheaper and more accessible DNA sequencing technologies that have had the largest impact on studies of viral diversity by allowing metagenomic approaches that provide more in-depth, and less biased surveys of the molecular diversity of entire viral communities. Metagenomics is the study of the genetic material recovered from an entire community. Virus communities have been sequenced from diverse environments such as marine ecosystems (Angly et al. 2006; Bench et al. 2007; Breitbart et al. 2002), the human gut (Zhang et al. 2006; Breitbart et al. 2008), and modern stromatolites (Desnues et al. 2008), showing that the oceans are dominated by dsDNA bacteriophages which are more diverse than previously thought. Indeed, more than 60-90% of the sequences typically have no detectable similarity to sequenced viruses in databases. Furthermore, metagenomics has allowed the study of different subsets of viruses, based on their genetic material. For example, metagenomic libraries were made from RNA viruses from marine water (Culley et al. 2006), the human gut (Zhang et al. 2006) and plants (Roossinck et al. 2010). Hydroxyapatite columns can also be used to separate ssDNA, dsDNA and RNA for more indepth studies (Andrews-Pfannkoch et al. 2010). Recently, in situ whole genome amplification using the φ29 DNA polymerase (Johne et al. 2009), which is biased towards short circular ssDNA (Angly et al. 2006), and sequencing techniques like 454 pyrosequencing (Margulies et al. 2005), allowed for the identification of ssDNA viruses in the environment (Angly et al. 2006; Desnues et al. 2008; Kim et al. 2008; López-Bueno et al. 2009). The focus of this dissertation thus is on the diversity and evolution of ssDNA viruses in marine environments; the next sections will briefly review what is currently known about these viruses.  6  1.2  ssDNA viruses  For the past few decades, ssDNA viruses have been studied because of their impact on the health of humans, other animals and plants. Indeed, ssDNA viruses are major pathogens of plants and animals, and cause diseases such as the banana bunchy top (Dale 1987), the porcine circovirus disease (Finsterbusch and Mankertz 2009), and the Fifth disease caused by the human parvovirus B19 (Servant-Delmas et al. 2009). Even though sequences with similarity to ssDNA viruses were recently observed in the environment (Angly et al. 2006; Desnues et al. 2008; López-Bueno et al. 2009), little is known about their diversity and the role they play in the oceans. This section gives an overview of the taxonomy, characteristics and lifestyle of ssDNA viruses with emphasis on those commonly found in marine environments. This includes small circular ssDNA viruses containing similar replication proteins to those found in the Circoviridae, Nanoviridae and Geminiviridae families, as well as bacteriophages in the Microviridae family (see section 1.3 for more information on ssDNA in marine environments).  1.2.1  Seven recognized families of ssDNA viruses  They are seven families of ssDNA viruses recognized by the International Committee on Taxonomy of Viruses (ICTV) (King et al. 2012). This classification is based on the host range and the type of ssDNA (segmented or non-segmented, positive-sense or negativesense, circular or linear) composing the genome. Thus there are two families of ssDNA bacteriophages (Inoviridae and Microviridae) and five families of ssDNA viruses infecting eukaryotes (Nanoviridae, Geminiviridae infecting plants; Circoviridae, Parvoviridae, and Anelloviridae infecting animals), which are presented in Table 1.1. ssDNA viruses have small genomes ranging from 1.7 to 8.5 kb (Table 1.2). To date (May 6, 2013), there are 555 sequenced and annotated ssDNA viral genomes in the NCBI Genbank database. In recent years, many new ssDNA viruses have been discovered and sequenced. At the moment, they are classified based on their closest sequenced relative, but many of them still remain unclassified. Table 1.2 and Appendix A present the number of sequenced genomes as well as the genome organization of the type virus of each genus of each family.  7  Table 1.1  The seven characterized families of ssDNA viruses.  Data from the International Committee on Taxonomy of Virus (ICTV) (King et al. 2012). Family Microviridae  Inoviridae  Geminivirida e  Nanoviridae  Circoviridae Parvoviridae  Anelloviridae  Genome characteristics Circular, positive-sense, 4400-5400 nt Monopartite, circular, positive-sense, 4400-8500 nt  Segmented in 2 fragments (4800-5600 nt), or not segmented (2500-3000 nt), ambisense Multipartite genome with 6 to 11 circular molecules of ~1 kb, positivesense Circular, ambisense, 1700-2300 nt Linear, negative-sense, ~4000-6000 nt  Circular, positive-sense, ~3800 nt  Morphology  Sub-family  Genus  Host  Icosahedral, not enveloped, 26-32 nm d. with small surface projections Filamentous, 7 nm d. and 7002000 nm in length Rod-shaped, 10-16 nm d. and 70-280 nm (or more) long Icosahedral, non-enveloped, 38 nm long and 22 nm d., 22 capsomeres per nucleocapsid  Microvirinae Gokushovirinae  Microvirus Spiromicrovirus Bdellomicrovirus Chlamydiamicrovirus  Enterobacteria Parasitic bacteria  Inovirus  Bacteria  Plectrovirus  Mastrevirus Curtovirus Begomovirus Topocuvirus  Plants  Icosahedral, not enveloped, 18-19 nm d.  Nanovirus Babuvirus  Plants  Icosahedral, not enveloped, 12-27 nm d. Icosahedral, not enveloped, 18-26 nm d., made of 60 units of the capsid protein  Circovirus Gyrovirus  Birds and mammals  Parvovirus Erythrovirus Dependovirus Amdovirus Bocavirus Densovirus Iteravirus Brevidensovirus Pefudensovirus Alphatorquevirus Betatorquevirus Gammatorquevirus Deltatorquevirus Epsilontorquevirus Etatorquevirus Iotatorquevirus Thetatorquevirus Zetatorquevirus  Vertebrates  Isosahedral, non-enveloped, 30-32nm d.  Parvovirinae  Densovirinae  Arthropods and other invertebrates Mammals  8  Table 1.2  Sequenced ssDNA virus genomes.  Data from Genbank as of May 6, 2013. Family  Sub-family  Genus  Microviridae  Microvirinae Gokushovirinae  Microvirus Spiromicrovirus Bdellomicrovirus Chlamydiamicrovirus  Unclassified Inoviridae  Inovirus Plectrovirus  Geminiviridae  Nanoviridae Circoviridae  Parvoviridae  Parvovirinae  Densovirinae  Anelloviridae  Unclassified ssDNA viruses  Number of sequenced genomes 7 1 1 5 2 23 5  Unclassified Mastrevirus Curtovirus Begomovirus Topocuvirus Unclassified Nanovirus Babuvirus Unclassified Circovirus Gyrovirus Unclassified Parvovirus Erythrovirus Dependovirus Amdovirus Bocavirus Unclassified Densovirus Iteravirus Brevidensovirus Pefudensovirus Unclassified Alphatorquevirus  4 32 8 292 1 8 4 2 1 19 5 6 13 2 15 1 12 2 6 5 6 4 3 17  Betatorquevirus Gammatorquevirus Deltatorquevirus Epsilontorquevirus Etatorquevirus Iotatorquevirus Thetatorquevirus Zetatorquevirus Unclassified  9 2 0 1 1 2 1 1 5 21  Type species Enterobacteriophage ϕX174 Spiroplasma phage 4 Bdellovibrio phage MAC 1 Chlamydia phage 1 Enterobacteriophage M13 Acholeplasma phage virus MVL51 Maize streak virus Beet curly top virus Bean golden yellow mosaic virus Tomato pseudo-curly top virus Subterranean clover stunt virus Banana bunchy top virus Porcine circovirus 1 Chicken anemia virus Minute virus of mice Human parvovirus B19 Adeno-associated virus-2 Aleutian mink disease Bovine parvovirus Junonia coenia densovirus Bombyx mori densovirus Aedes aegypti densovirus Periplaneta fuliginosa densovirus Torque teno virus 1 isolate TA278 Torque teno mini virus 1 Torque teno midi virus 1 Torque teno tupaia virus Torque teno tamarin virus Torque teno felis virus Torque teno sus virus 1 Torque teno canis virus Torque teno douroucouli virus  9  1.2.1.1  Microviridae  Bacteriophages belonging to the Microviridae consist of a ~30 nm icosahedral capsid containing a positive-sense ssDNA molecule of 4.5 to 5.1 kb that encodes 8 to 11 proteins. Based on the phylogeny of the major capsid protein (VP1) from isolates, the Microviridae family is divided into two subfamilies (Brentlinger et al. 2002); the Microvirinae contains phages infecting enterobacteria like phiX174 and G4 (Escherichia coli) (Godson et al. 1978), while the Gokushovirinae includes those infecting parasitic bacteria such as Chlamydia [Chp1 (Storey et al. 1989), Chp2 (Everson et al. 2002; Liu et al. 2000), Chp3 (Garner et al. 2004), Bdellovibrio (phiMH2K) (Brentlinger et al. 2002) and Spiroplasma (SpV4) (Chipman et al. 1998)]. This section describes the life cycle of ssDNA phages from the Microviridae. Although only sequences similar to the Gokushovirinae sub-family were observed in marine environments, the characterization of the life cycle of these viruses has been challenged and limited by the fact that their hosts are parasitic microorganisms. However, there has been extensive research performed on the ssDNA coliphage phiX174, which will be used here as an example to present the lifecycle.  1.2.1.1.1  Lifestyle  Microvirinae and Gokushovirinae phages are lytic and their replication requires at least five viral proteins: two coat proteins (VP1 and VP2), a scaffolding protein (VP3), a replication protein (VP4), and a DNA packaging protein (VP5). First, a phage particle binds to a receptor on the host cell. The viral protein involved in the recognition of the host and binding is the minor capsid protein VP2 (Chipman et al. 1998). While the host receptor remains unknown for gokushoviruses from the Gokushovirinae sub-family (Everson et al. 2003), it has been demonstrated that the microvirus phiX174 binds to the lipopolysaccharide (LPS) of its host, E. coli C (Feige and Stirm 1976). When in the cell, Microviridae phages take over the host proteins and resources to replicate. Once the viral DNA is injected into the host, the ssDNA genome is converted by the host polymerase into a replicative form (called RF), which consists of a covalently closed dsDNA molecule. The RF serves as a template from which viral genes are transcribed by the host RNA polymerase, with the replication initiator (VP4) being among the early transcribed genes. DNA replication occurs via the rolling-circle mechanism using the replicase of the host and the phage replication initiator VP4 (described in section 1.2.2) (Sait et al. 2011). The late viral genes include those coding for the capsid,  10  which is assembled in the cytoplasm. The capsid consists of 60 copies of the major capsid protein (VP1) and 12 copies of the minor protein (VP2). The scaffolding protein (VP3) is found in the procapsid, but not in the mature virion (Garner et al. 2004; Liu et al. 2000; Salim et al. 2008). Chipman and colleagues (1998) determined the structure of the capsid of the spiromicrovirus, SpV4, using cryo-electron microscopy and found ‘mushroom-like’ protrusions with hydrophobic depressions on its surface. These protrusions, formed by VP2 may be involved in host recognition (Chipman et al. 1998). Finally, VP5 binds to the replication complex, inducing packaging of the neo-synthesized DNA into the procapsids (Clark et al. 2004). The mechanism by which only the positive sense ssDNA is encapsidated is not well known for Microviridae phages. Procapsids mature in the host cytoplasm and, finally, the mature virions are released by cell lysis.  1.2.1.2  Inoviridae  Members of the Inoviridae are the only known ssDNA viruses without an icosahedral capsid. Instead, they are of rod (Plectrovirus) or filamentous (Inovirus) shape. They also have the largest ssDNA genomes, varying from 4.4 to 8.5 kb in length. These viruses infect bacteria without causing lysis, and the infected cells continue living, while producing viruses (King et al. 2012). Inoviruses are also associated with virulence factors. Indeed, the integration of the phage CTX inside the genome of its host Vibrio cholerae is associated with the production of the cholera toxin (Davis 2003).  1.2.1.3  Geminiviridae  Viruses from the Geminiviridae, mostly members of the genus Begomovirus, have a considerable agricultural impact. More than half of the sequenced ssDNA virus genomes (283 genomes) are from begomoviruses infecting plant crops. Begomoviruses are very diverse, white-fly-transmitted viruses that include some of the most destructive plant viruses in tropical and sub-tropical regions (Seal et al. 2006). For example, the type species potato yellow mosaic virus (PYMV), which was first seen in potato plants in 1986, also causes great damage to tomato crops in Central and South America (Urbino et al. 2004). Most begomoviruses have an interesting genome structure have bipartite genomes, termed DNA A and DNA B. The DNA A component encodes proteins required for viral DNA replication and encapsidation; whereas, DNA B encodes two proteins that are essential for 11  systemic movement within the infected plant. A small number of begomoviruses have a monopartite DNA genome that resembles the DNA A of bipartite begomoviruses. This DNA carries all gene functions for replication and pathogenesis (Briddon and Stanley 2006; King et al. 2012). Small circular ssDNA satellites containing a single open reading frame (ORF), termed DNA β, have been found associated with certain monopartite begomovirus infections. Satellites are defined as viruses or nucleic acids that depend on a helper virus for their replication, but lack extensive nucleotide sequence homology to the helper virus and are indispensable for its proliferation (King et al. 2012; Briddon and Stanley 2006).  1.2.1.4  Nanoviridae  Nanoviridae is the other family of plant-infecting ssDNA viruses. For example, the banana bunchy top virus (BBTV) ravages banana farms in Southeast Asia, the Philippines, Taiwan, most of the South Pacific islands, and parts of India and Africa. Once established, BBTV is extremely difficult to eradicate or manage (Dale 1987). There are fewer sequenced genomes of Nanoviridae viruses due to their more complicated genome organization. Nanovirus genomes are multipartite and consist of six to eight circular ssDNA molecules of about 1 kb in size, each coding for one gene (Gronenborn 2004). Each viral genome encodes a master replication initiator (M-Rep) encoded by the DNA-R molecule. M-rep interacts with conserved sequences on all genomic DNAs of the virus (Timchenko et al. 1999, Timchenko et al. 2000; Grigoras et al. 2008) and is facilitated and enhanced by the action of Clink (DNA-C), a virus-encoded cell cycle modulator protein (Aronson et al. 2000). The other genome molecules encode a capsid protein (DNA-S) (Wanitchakorn et al. 1997), a movement protein (DNA-M), a nuclear shuttle protein (DNA-N) (Wanitchakorn et al. 2000), as well as proteins of unknown function (DNA- U1, -U2, -U3 and -U4) (Gronenborn 2004).  1.2.1.5  Circoviridae  Circoviruses primarily infect diverse bird species including parrots, pigeons, gulls, ducks, starlings, ravens, and swans (Niagro et al. 1998; Mankertz et al. 2000; Todd et al. 2007; Hattermann et al. 2003; Johne et al. 2006; Stewart et al. 2006; Halami et al. 2008). To date only two circoviruses have been extensively studied that replicate in mammals, Porcine circovirus 1 and 2 (PCV1 and PCV2) (Allan and Ellis 2000; Finsterbusch and Mankertz 12  2009). While PCV1 is non-pathogenic, PCV-2 is associated with many diseases, including postweaning multisystemic wasting syndrome (PMWS), which results in significant depletion of lymphocytes and death (Finsterbusch and Mankertz 2009), causing massive economic losses in the pig industry (Finsterbusch and Mankertz 2009; Grau-Roma et al. 2011; Segalés et al. 2007; Chae 2005; Opriessnig et al. 2007). With genomes of approximately 2 kb in length, viruses from the genus Circovirus have the smallest genomes of viruses infecting eukaryotes. They encode only two inversely arranged major open reading frames, rep (replication initiator) and cap (capsid protein), which are involved in copying and packaging of the viral genome (Appendix A) (Finsterbusch and Mankertz 2009).  1.2.1.6  Parvovoridae  The only ssDNA viruses that infect humans belong in the Parvoviridae. For example, the parvovirus B19 causes Fifth disease, which is associated with rashes in children from 5 to 15 years old (Brown 2010). Parvoviridae viruses are the only ssDNA viruses that have a linear genome. The 3’ region of parvovirus genomes contains an inverted terminal repeat (ITR) which forms a hairpin structure to initiate the DNA replication (Gonçalves 2005), by a slightly modified version of the rolling-circle replication called rolling-hairpin replication (Tattersall and Ward 1976). Viruses from the genus Densovirus have been mostly studied for their use as a vector for vaccines, because of their ability to transduce human cells. These adeno-associated viruses have a great potential to be used in gene therapy (Smith 2008; Büning et al. 2008).  1.2.1.7  Anelloviridae  Previously grouped with the Circoviridae, Anelloviridae viruses have only recently been classified as a separate family. Anelloviruses infect a wide range of animals, from humans to birds (Hino and Miyata 2007). They are different from other viruses in that they are found in the blood of most humans, but are not linked to any signs of pathogenicity (Biagini and de Micco 2008).  13  1.2.2  Replication of ssDNA viruses  All ssDNA viruses and some bacterial plasmids use rolling-circle replication (RCR) (Timchenko et al. 1999; Gutierrez 1999; Cheung 2011; Delwart and Li 2011; Meehan et al. 1997; del Solar et al. 1998) (Figure 1.1). Following infection, a double-stranded DNA genome is generated by a cellular DNA polymerase 1 using a RNA primer synthesized in the host (Saunders et al. 1992). This dsDNA, called RF (for replicative form) is used as a template for replication. The initiation of RCR involves binding of the replication complex to particular sequence elements associated with the replication origin. In circoviruses, these sequence elements consist of stem-loop structures with a conserved motif of 9 bp, called nonanucleotide motif, in the loop, located between the 5’-ends of the two main inversely encoded ORFs (King et al. 2012; Rosario et al. 2012). The replication complex (also called RCR initiator) consists of two proteins, Rep and Rep’, both derived from the same transcript, but with different carboxy termini (Grigoras et al. 2008). The replication complex binds the stem loop, cutting a nick in the plus strand. Then, a host-encoded DNA polymerase extends the 3’ hydroxyl to copy the complementary negative strand circle while the replication complex stays covalently attached to the 5’ end of the original plus strand (Steinfeldt et al. 2001, 2007; Faurez et al. 2009; Laufs et al. 1995). After one round of polymerization, new binding and nick sites for the replication complex are generated, and termination takes place by cleavage of the newly synthesized strand and simultaneous ligation of the 5’- and 3’-ends of the parental plus strand still linked to the replication complex (Figure 1.1) (Heyraud-Nitschke et al. 1995; Laufs et al. 1995; Londoño et al. 2010) There is a great diversity of RCR initiator proteins found in viruses and plasmids of bacteria and eukaryotes, but most of them share conserved sequence motifs. These motifs, arranged in a characteristic way, define a large superfamily of Rep proteins (Figure 1.2) (Ilyina and Koonin 1992; Koonin and Ilyina 1992; Faurez et al. 2009; Mankertz et al. 1998). These RCR initiators have five signature sequences in common (Table 1.3 and Figure 1.2). The functions of motifs I and V are unknown. Motif II is involved in the initiation and termination of rolling circle DNA replication, possibly by binding Mg or Mn ions that are required for Rep’s catalytic activity (Ilyina and Koonin 1992). Motif III contains the site-active tyrosine that attaches covalently to DNA (Ilyina and Koonin 1992; Koonin and Ilyina 1992). Finally, motif IV, a Walker A motif, is a putative NTP-binding site (Mankertz et al. 1998).  14  Figure 1.1  Overview of the rolling-circle replication in ssDNA viruses.  After virus decapsidation (A), the ssDNA genome is converted into dsDNA by primer extension through host factors (B). During initiation of replication, the viral replication initiator protein (Rep; light green blob) binds to the replicative form near the origin of replication (C), presumably causing partial melting of the dsDNA (D). Partial melting of the replicative form may lead to a cruciform extrusion that exposes the virion strand (thick line) nick site in single-stranded form (D: scissors; E). Rep introduces a nick (F), which produces a primer for leadingstrand synthesis on the 3’-OH end by host DNA polymerases, while the 5’ end is covalently attached to the Rep (G). During elongation, the old virion strand (thick line) is displaced by the newly synthesized positive strand (dashed line) (H). RCR terminates when Rep catalyzes a joining reaction between the ends of the newly synthesized positive strand and releases the old virion strand (H). Image adapted from Rosario et al. (2012).  15  Figure 1.2  Amino acid alignment of representative viral Rep sequences.  [M=Microviridae (pink), G=Geminiviridae (blue), C=Circoviridae (red), N=Nanoviridae (green)] and phytoplasmal plasmids [P (yellow)]. The five conserved Rep motifs are boxed in black and numbered I to IV.  16  Table 1.3  Conserved Rep motifs for known viral families and phytoplasmal plasmids.  Geminiviridae  Motif 1 FTLNN FTINN FTLNN FTLNY FTLNF FTLNN -  Parvoviridae Microviridae  -  Phytoplasmal plasmids  -  Circoviridae Cycloviruses Nanoviridae  Motif 2 GTPHLQGF GTPHLQGY GTPHLQGF GxxHhQGF GxxHhQGY xxxHhQGY GxPHLHAL GxPHLHVL xxxHIHVV xRxHYHLL xRxHWHMI xxxHxHVL  Motif 3 YCSK YChK YCSK YCSK YxxK  Motif 4 GKS  YhaK  GKT  -  -  YICK YIKK  GKT  GKS GKS GKT  Motif 5 VVIDD VhhDD VIIDD NVIDD NIIDD IGFDD NIFDD  h: non polar, hydrophobic x: any amino acid a: polar, acidic -: motif absent red: conserved in all sequences  1.2.3 1.2.3.1  Origin and evolution of ssDNA viruses Replication proteins and the origin of ssDNA viruses  Multiple observations suggest that Rep-containing ssDNA viruses share a common ancestor. First, a similar RCR replication strategy is employed by viruses from the Circoviridae, which infect animals, and Geminiviridae and Nanoviridae, which infect plants, as well as some bacterial plasmids (Faurez et al. 2009; Krupovic et al. 2009; Koonin and Ilyina 1992). Phylogenetic analyses led to the hypothesis that Circoviridae viruses originated from Nanoviridae viruses (Gibbs and Weiller 1999), as did the Geminiviridae viruses (Briddon et al. 2004; Briddon and Stanley 2006). Second, the sequence similarity between the replication proteins of members of the Geminiviridae family and plasmids, notably plasmids in the bacteria Phytoplasma and the red algae Porphyra (Krupovic et al. 2009; Nawaz-ul-rehman and Fauquet 2009; Nash et al. 2011), raises questions as to whether Geminiviridae or plasmids came first. Because RCR is much more widespread in bacteria than in eukaryotes, it is hypothesized that eukaryotic viruses evolved from bacterial replicons (Ilyina and Koonin 1992; Koonin and Ilyina 1992). Additionally, the phylogenetic closeness of the Reps of plant Geminiviridae viruses and plasmids of wall-less, plantinfecting bacteria led to the proposal that a phytoplasmal plasmid may have acquired a capsid-encoding gene to give rise to the ancestor of the Geminiviridae (Krupovic et al.  17  2009). On the other hand, Saccardo et al. (2011) propose that phytoplasmal plasmids may have acquired a Rep gene by horizontal transfer from a geminivirus.  1.2.3.2  Horizontal gene transfer  There is substantial evidence of horizontal gene transfer from ssDNA viruses to their hosts. For instance, replication proteins similar to those from ssDNA viruses are found in genomes of eukaryotes and prokaryotes. For example, “fossil” genomes from members of the Parvoviridae and Circoviridae, called endogenous viral elements (EVEs) are integrated into the chromosomes of various animals, yielding estimates that these viruses replicated in animals at least 30 to 50 million years ago (Katzourakis and Gifford 2010; Belyi et al. 2010). Rep genes are also found in the chromosomes of the parasitic protozoa Giardia duodenalis and Entamoeba histolytica (Franzén et al. 2009), suggesting that circovirus-like viruses also replicated in these hosts. Similarly, geminiviral DNA occurs in the genomes of tobacco plants (Bejarano et al. 1996). Recently, Liu et al. (Liu et al. 2011) confirmed that ssDNA virus-like sequences in host species were broadly distributed in four of the five supergroups of eukaryotes (Keeling et al. 2005), namely Unikonta (including animals, fungi and Entamoeba), Plantae (including land plants and green algae), Chromalveolata (including diatoms and Blastocystis) and Excavata (including Giardia). The integration of ssDNA virus genes is possibly due to the DNA binding, cutting and ligating activity of the RCR protein (Gibbs et al. 2006). Horizontal gene transfer may also have played a role in the origin of ssDNA viruses. It is hypothesized that circoviruses may have originated through recombination between unrelated viruses. This is supported by the fact that the N-terminal region of the Rep protein is most related to that encoded by segmented genomes of the Nanoviridae (infecting plants), while the C-terminal region is related to the RNA-binding helicase 2C of positive strand RNA picorna-like viruses (Gibbs and Weiller 1999).  1.2.3.3  Capsid protein  Capsid proteins are in general less conserved than replication proteins. Even though sequences are very divergent, the structure of the folded proteins can be conserved. Thus, while no sequence similarity was detectable between Geminiviridae viruses capsids and  18  other proteins, their predicted structure was most like that of a capsid protein from a ssRNA plant virus (satellite tobacco necrosis virus) (Krupovic et al. 2009). The complete structure of two geminiviruses, maize streak virus (25 Å) (Zhang et al. 2001) and African cassava mosaic virus (16-19 Å) (Böttcher et al. 2004), and the circovirus, Porcine circovirus 2 (2.3 Å) (Khayat et al. 2011) was determined by cryo-electron microscopy and three-dimensional image reconstruction. The three viruses harbour a T=1 isosahedral capsid structure made from a protein with a canonical viral jelly roll structure (eight stranded beta-barrel fold). This may also reflect a common evolutionary path.  1.3  ssDNA viruses in the environment  The discovery of ssDNA viruses from environmental samples is recent. There are very few viruses that have been isolated, and those that have are distinct from those previously described. Most of the knowledge we have about ssDNA viruses in the environment is based on metagenomic data, which often contains viral sequences similar to viruses from the Microviridae and Circoviridae. This section describes our current understanding of environmental ssDNA viruses, with special attention to differences and similarities to previously known viruses.  1.3.1  Isolation of ssDNA viruses  1.3.1.1  Isolation of viruses infecting diatoms  In 2005, Nagasaki and colleagues (2005) described the first ssDNA virus to be isolated infecting a protist; Chaetoceros salsigineum DNA virus (CsalDNAV) replicates inside the nucleus of its diatom host. CsalDNAV’s 6005-bp genome has a partially double-stranded region of 997 nt (Figure 1.3). It is hypothesized that this double-stranded region has a function in the initiation of replication once the DNA is ejected into the cell (Nagasaki et al. 2005). The genome encodes six ORFs, with ORF1 having a low similarity to some circovirus replication proteins (Nagasaki et al. 2005), and ORF3 coding for a structural protein (Park et al. 2009).  19  Figure 1.3  Analysis of the complete nucleotide sequence of CsalDNAV.  Six ORFs were identified in CsalDNAV. The thick arrow indicates the beginning of the CsalDNAV nucleotide sequence. The gray region in the genome represents the dsDNA. The position and orientation of each ORF is indicated by thin black arrows and numbers [Figure from Park et al. (2009)].  Diatoms are particularly abundant in well-mixed coastal and upwelling regions, but are also found within the photic zone of the open ocean, between the well-lit surface layers and the nutrient-rich deep chlorophyll maximum (Bowler, Vardi, et al. 2010). They constitute the most abundant and diversified group of marine eukaryotic phytoplankton (Kooistra et al. 2007). There are an estimated 200,000 diatom species, which are traditionally classified based on their morphology (Bowler, De Martino, et al. 2010). Diatoms play an important role in the carbon cycle, where it is estimated that they are responsible for about 40% of the net primary productivity in the ocean (Field et al. 1998). Moreover, they are major contributors to the biological carbon pump, through which carbon is exported to the ocean interior (Jin et al. 2006; Armbrust 2009; Bowler, De Martino, et al. 2010). Recently, another ssDNA virus, Chaetoceros lorenzianus DNA virus (ClorDNAV), with a similar genome organization to CsalDNAV was described (Tomaru et al. 2011). These viruses have been assigned to the genus Bacillariodnavirus.  20  1.3.1.2  Isolation of a ssDNA virus infecting Archaea  Recently, a ssDNA virus, Halorubrum pleomorphic virus 1 (HRPV-1), infecting the halophilic archaea Halorubrum was isolated from a solar saltern in Sicily (Pietilä et al. 2009). This virus is the only described ssDNA archaeal virus. The genome encodes nine ORFs, with at least three of them encoding part of the virion structure (Pietilä et al. 2009). The virus has a pleomorphic shape made of lipids decorated with glycoprotein spikes (Figure 1.4) (Pietilä et al. 2010). Surprisingly, HRPV-1 is most closely related to HHPV-1 (Haloarcula hispanica pleomorphic virus 1), a double-stranded DNA virus (Roine et al. 2010). While both of these viruses have a pleomorphic shape, the difference resides within the replication proteins, indicating that the type of DNA is based on the mode of replication of the viruses rather than on the shape and structure of the capsid.  Figure 1.4  Virions of HRPV-1 are pleomorphic.  A. Micrographs of negatively stained particles observed by transmission EM in a typical ‘2× purified’ virus preparation fixed with 2.5% (v/v) glutaraldehyde and stained with 3% (w/v) ammonium molybdate. Scale bars represent 100 nm in the upper and 50 nm in the lower panel. B. Schematic representation of the HRPV-1 particle [from Pietilä et al. (2009)].  21  1.3.1.3  Isolation of marine ssDNA viruses infecting bacteria  Holmsfelt et al. (2012) isolated eight ssDNA phages infecting bacteria belonging to the Bacteroidetes phylum. These phages were divided into two groups: those with genomes too large to be accurately measured with a ssDNA ladder (> 10 kb), and those resembling members of the Microviridae, with genomes of ~5 kb. The genomes of these phages were visible only when stained with with SYBR Gold stain. Because of their ssDNA nature and generally small genome size, accurate enumeration of ssDNA viruses from natural environment remains a challenge (Holmfeldt et al. 2012). The genome sequences of these new phages, which are not yet available, will bring new information on environmental ssDNA viruses.  1.3.2  Metagenomic analysis of ssDNA viruses  Recently, the polymerase of phage phi29 infecting Bacillus subtilis was used to amplify small quantities of DNA using multiple displacement amplification (Dean et al. 2001). This method has been useful to study viral and bacterial communities. Since this method also amplifies ssDNA, it allowed the discovery new ssDNA viruses in metagenomic samples, which could not be sequenced using conventional sequencing methods that require dsDNA template. To date, the observed ssDNA viruses are mostly the ones with small circular genomes similar to those found in the families Microviridae (genus Gokushovirinae mostly) and Circoviridae.  1.3.2.1  Phages from the Gokushovirinae sub-family  Sequences from ssDNA phages have been found in marine metagenomes, with Gokushovirinae viruses being most common. The metagenomic study of viromes from the Strait of Georgia (SOG), the Gulf of Mexico (GOM), the Arctic ocean (ARC) and the Sargasso Sea (SAR) first revealed the presence of Gokushovirinae sequences in marine systems (Angly et al. 2006). They were particularly abundant in the SAR sample; 6% of the sequences were similar to the phage Chp1 that infects Chlamydia psittaci, which allowed for the assembly of the first two environmental Gokushovirinae genomes, with the help of PCR amplification (Angly et al. 2006; Tucker et al. 2011). A survey in the Atlantic Ocean showed a different depth distribution of the two genomes, suggesting that these viruses infect different hosts. Finally, these two genomes were used as templates to design primers that 22  amplified a region of the replication initiator (ORF4 or Rep). Phylogenetic analyses of the amplified fragments revealed that ssDNA phages have different geographic distributions, with the genetic distance of gokushovirus sequences increasing with geographic distance (Tucker et al. 2011), which was also observed through viral metagenomics (Angly et al. 2006). Phages of the Gokushovirinae genus have also also been found in marine and freshwater stromatolites (Desnues et al. 2008), as well as in an Antarctic lake (López-Bueno et al. 2009). While it is commonly thought that Microviridae phages are strictly lytic (Garner et al. 2004; Liu et al. 2000; Salim et al. 2008), recently, an in silico study by Krupovic et al. (2011) found remnants of viruses with a similar genome organization as gokushoviruses in the genomes of bacteria in the Bacteroidetes. These bacteria are found in the human gut and mouth, which suggests that these phages are associated with these environments. The temperate Microviridae phages have the same genome organization as the gokushoviruses, but are phylogenetically distinct from them (Krupovic and Forterre 2011). Krupovic and Forterre (2011) proposed a new sub-family within the Microviridae: the Alpavirinae.  1.3.2.2  Circovirus-like Rep containing sequences  The only conserved gene among ssDNA viruses encodes the replication initiator protein (Rep). Looking for this gene in environmental datasets (Angly et al. 2006) allowed for the assembly of ten circovirus-like genomes (Rosario, Duffy, et al. 2009). While the organization of most of these genomes is similar to circoviruses, with two genes in opposite direction, some are different, either with more genes, or two genes in the same direction. Most of these genomes encode the replication initiator and a coat protein. The very low similarity of the coat proteins indicates a high diversity and suggests a wide range of hosts for these viruses (Rosario, Duffy, et al. 2009). There are no described aquatic circoviruses, but they most likely infect unicellular protists. For example, the comparison of two viral metagenomic datasets from Lake Limnopolar in Antarctica revealed a predominance of sequences related to eukaryotic viruses, including phycodnaviruses and circoviruses in the spring when the lake is ice-covered and dominated by photosynthetic and heterotrophic protists. On the other hand, a larger proportion of dsDNA bacteriophages were present in the summer, when the water is open and dominated by bacteria (López-Bueno et al. 2009). More recently, single-cell sequencing of unicellular  23  algae of the Picobiliphyta phylum allowed for the sequencing of a nanovirus-like genome. One out of the three samples was completely dominated by a small circular DNA molecule containing a Rep gene, suggesting that the cell was infected when sequenced. The other two sequenced cells did not show any ssDNA viral sequences, but only unassembled reads similar to bacterial and viral sequences, which indicate the heterotrophic behavior of picobiliphytes (Yoon et al. 2011). While most ssDNA sequences have been detected when DNA was amplified using MDA, circovirus-like sequences were also observed in samples from the Chesapeake Bay, where the ssDNA was separated from the dsDNA using a hydroxyapatite column (AndrewsPfannkoch et al. 2010). Circovirus sequences have also been found in other environmental samples including freshwater lakes (Roux et al. 2012), sewage (Blinkova et al. 2009), reclaimed water (Rosario, Nilsson, et al. 2009) and rice paddy soil (Kyoung-Ho Kim et al. 2008). Moreover, circovirus-like sequences have been observed in tissues (Li, Victoria, et al. 2010; Li et al. 2011), animal feces (Li, Kapoor, et al. 2010; Ge et al. 2007), fish (Lorincz et al. 2011), and insects (Ng, Willner, et al. 2011; Ng, Duffy, et al. 2011; Rosario et al. 2011). Therefore, our understanding of the genetic diversity of circoviruses and distantly related families of viruses containing small ssDNA circular genomes harbouring a Rep gene is limited, but rapidly increasing.  1.3.2.2.1  Cycloviruses: a new genus within the Circoviridae family  Cycloviruses are circovirus-like sequences that were found in metagenomic datasets from samples from chimpazee stools (Li, Kapoor, et al. 2010) and infant feces (Victoria et al. 2009). Cycloviruses are distinguishable from circoviruses based only on a few subtle genomic characteristics. Like circoviruses, cycloviruses have small circular ambisense (may be both positive- and negative-sense) DNA genomes of 1.7–1.9 kb containing two major inversely arranged ORFs, one of them encoding a putative Rep, the other one a capsid protein. Relative to circoviruses, the Rep and Cap proteins of cycloviruses are slightly shorter and the 3’-intergenic regions between the stop codons of the two major ORFs is either absent or consists of only a few bases (Li, Victoria, et al. 2010; Li et al. 2011). The 5’intergenic region between the start codons of the rep and cap ORFs of cycloviruses is relatively larger than those of circoviruses and also contains a highly conserved stem-loop  24  structure with a distinct loop nonamer sequence. The Rep N-terminus has several motifs associated with rolling circle replication (FTLNN, TPHLQG and YCSK) and dNTP-binding (GXGSK), with some alterations. Conserved amino acid motifs associated with 2C helicase function of some picorna-like viruses were also identified in the carboxy-half of Rep. The Nterminal region of the cyclovirus Cap proteins was highly basic and arginine-rich, as is typical for circoviruses capsid proteins. Phylogenetically, cycloviruses cluster in a related but separate clade to circoviruses (Delwart and Li 2011).  1.3.2.3  ssDNA viruses in marine animals  Using a metagenomic approach, novel ssDNA viruses were discovered in the tissue of sick marine animals. For example, Zalophus californianus anellovirus (ZcAV), was found associated with three dead captive California Sea lions. The double-stranded replicative form of the virus was detected in the lung tissue of all three dead animals, but in none of the other animals at the zoo (Ng, Suedmeyer, et al. 2009). A similar virus was also found in the lung tissue of Pacific harbour seals multiple years after their death (Ng, Wheeler, et al. 2011). In addition, a novel ssDNA virus, sea turtle tornovirus 1 (STTV1), was sequenced directly from a fibropapilloma of a Florida green sea turtle (Ng, Manire, et al. 2009). These results show the potential of viral metagenomics for discovering novel viruses that are important to marine health.  1.4  Thesis objectives  There is strong emerging evidence that ssDNA viruses are major players in marine environments, but their diversity is poorly characterized because of their lack of similarity to known viruses. Moreover, the specific role ssDNA viruses play in the marine community remains poorly accessible because very few of these viruses have been isolated from marine environments. The overarching goal of this dissertation is to characterize the diversity and evolution of marine ssDNA viruses. The dissertation has the following objectives:  1. Assessing the genomic diversity of ssDNA viruses 2. Exploring the origin and evolution of ssDNA viruses  25  3. Looking at the geographic distribution and richness of ssDNA viruses 4. Determining the hosts of marine ssDNA viruses  To address these goals, the dissertation is structured as follows: In chapter 2, I assess the genetic diversity of marine ssDNA viruses through metagenomic analyses, revealing 645 new viral genomes divided into more than 125 new viral genotypes that are distinct from previously known viruses. In chapter 3, I use phylogenetic analysis to compare the replication proteins of marine and terrestrial ssDNA viruses in order to speculate about the origin of ssDNA viruses and the putative hosts of marine ssDNA viruses. In chapter 4, I focus on gokushoviruses. A PCR-based approach was used to look at the genetic relatedness of this viral group, showing that gokushovirus sequences are correlated with their geographic origin. In chapter 5, I examine the distribution of gokushoviruses in Saanich Inlet, British Columbia, by performing amplicon deep sequencing of the major capsid protein gene in oxic and anoxic zones. Results suggest that gokushoviruses are major players in the oxic zone of the Saanich Inlet. In chapter 6, fluorescence in situ hybridization is used to visualize bacterial cells infected by gokushoviruses in an effort to identify their marine hosts. In chapter 7, I integrate the results and discuss the role of ssDNA viruses in marine microbial communities.  26  Chapter 2: Previously unknown and highly divergent ssDNA viruses populate the oceans  2.1  Synopsis  This chapter presents the first comprehensive metagenomic study of ssDNA viruses and uncovers the genetic diversity of these viruses in composite samples from temperate (Saanich Inlet, 11 samples; Strait of Georgia, 85 samples) and subtropical marine waters (41 samples, Gulf of Mexico). A total of 645 putative circular genomes of ssDNA viruses were assembled, almost doubling the number of sequenced ssDNA viral genomes. While more than 90% of the genomes had no evident similarity to sequenced viruses, 70% of them were similar to at least one other sequence in our dataset, suggesting that most of the sequence space for ssDNA viruses has been closed for these samples. However, 129 genetically distinct groups of ssDNA viruses were identified that had no recognizable similarity to each other or to other sequenced viruses. Each of the 129 groups was represented by at least one complete genome, with most falling into 11 well-defined groups. As well, new phylogenetic groups of sequences encoding putative replication and coat proteins have similarity to sequences from viruses infecting eukaryotes. This study reveals that the oceans harbour hundreds of other previously unknown evolutionary groups of ssDNA  viruses  that  are  likely  significant  pathogens  of  the  phytoplankton  and  microzooplankton that underlie the marine food webs.  2.2 2.2.1  Material and methods Collection and Preparation of Samples  Samples were collected from the following three distinct geographic regions: the coastal waters of British Columbia (82 samples), the Gulf of Mexico (41 samples), and Saanich Inlet (11 samples (Figure 2.1; Appendix B). Water samples (~20 to ~200 L) were collected using GO-FLO or Niskin bottles either mounted on a CTD rosette (SOG and GOM) or directly on a hydrographic wire (SI). For each sample, the viruses were concentrated ~100-fold (~200 mL final volume) using ultrafiltration (Suttle et al. 1991). Briefly, particulate matter was removed by pressure filtering (<17 kPa) the samples through 142-mm diameter glass-fibre (MFS GC50, nominal pore size 1.2 µm) and polyvinylidene difluoride (Millipore GVWP, pore size  27  0.22 µm) filters connected in series. The viral size fraction in the filtrate was then concentrated by ultrafiltration through a 30 kDa molecular weight cut-off cartridge (Amicon S1Y30, Millipore), and stored at 4 °C in the dark until processed.  Figure 2.1  Geographic location of the samples used in this study.  In order to integrate variation within a region, numerous virus concentrates (VCs) collected from different locations and at different times within a geographic region were combined into a single mix (Appendix B). Two of these mixes (GOM, SOG) correspond to those used for the Angly et al. (2006) viral metagenomic libraries in which marine viral ssDNA sequences were first observed. Two mL from VCs collected in the Strait of Georgia (SOG) and neighboring inlets and bays were pooled into three mixes based on the year of collection (BC1-1999, 23 samples; BC3-2000, 26 samples; BC4-2004, 16 samples) and one mix based on salinity (BC2-Low salinity, 19 samples). Similarly, we made four mixes from the Gulf of Mexico: eastern Gulf of Mexico (eight samples), Northern GOM (six samples), 28  Western GOM (six samples) and Texas Coast (13 samples). For Saanich Inlet (SI), we used surface samples for the months of April 2007 and January, March, May, July, August and November 2008.  2.2.2  ssDNA preparation  Ten mL of each pooled mix (four mixes from SOG, four from GOM, seven monthly samples for SI) were filtered through a 0.22-µm pore-size syringe filter (PDVF; Millipore) to remove bacteria, and ssDNA was extracted using a QIAprep Spin M13 kit (Qiagen), according to the manufacturer's protocol. To create double-stranded DNA, 5 µL of the ssDNA preparation was subjected to multiple displacement amplification (MDA) using Repli-g Mini kit (Qiagen). Given the very low concentration of ssDNA (<50 ng per sample), we took advantage of the amplification bias of MDA for short ssDNA (Dean et al. 2001; Lizardi et al. 1998). The DNA was then purified using a QIAamp DNA Mini kit (Qiagen) according to the manufacturer’s recommendations. The purified DNA was resuspended in 100 µL RNAse- and DNAse-free water (Invitrogen). In the Repli-g kit, DNA is denatured with an alkaline solution rather than by heat. Denaturation of dsDNA was reduced during WGA by adding the stop solution N1 (which brings the pH back to neutral) immediately after the denaturation solution D1 (NaOH solution). Since MDA creates high concentrations of ssDNA, a renaturation step was added by warming the purified DNA to 94 °C followed by slow cooling to 4 °C in 1 °C, 30 sec steps. The DNA was kept at 4 °C until further use. The Saanich Inlet (SI) samples were processed as described above, except 10 mL of individual VCs were used instead of a VC mix.  2.2.3  Genome analysis, binning and assembly  Metagenomic libraries were constructed from ssDNA WGA products from SI, SOG, and GOM. The purified WGA DNA was concentrated using a Millipore YM-30 Microcon centrifugal filter to a final volume of ~50 µL; 3 to 5 µg of dsDNA from each sample was sent for pyrosequencing using the Roche 454 FLX instrumentation with Titanium chemistry at Genome Québec, McGill University (SOG) and the Broad Institute at the Massachusetts Institute of Technology (GOM and SI). The sequences were quality and linker trimmed, and assembled into contiguous sequences (contigs) using the Newbler Assembler (Roche). The individual reads and assembled sequences were compared to the NCBI database, as well as a subset of the database 29  containing all ssDNA viral genomes using tBLASTx with an e-value cutoff of 10-5. The scaffolds were examined with the Consed package (Gordon 2003) and additional analyses (BLAST, circularization of the genomes, annotations, MUSCLE alignments, and phylogeny) were performed within Geneious Pro 5.6.6 (Biomatters). The metagenomic reads from Angly et al. (2006) were downloaded from the CAMERA database (Seshadri et al. 2007). The composite genomes were made from assembled contigs that had identical sequences at the beginning and the end. Only circular genomes with an average of 3-fold coverage were kept for further analyses.  2.2.4  Feature frequency profiles (FFP)  BLAST was used to identify the complete circular genomes within the NCBI database that were similar to the genomes assembled in this study. The similar genomes identified within NCBI belonged to viruses infecting plants, animals and bacteria, environmental genomes from other metagenomic libraries [ten circovirus-like genomes from an Antarctic lake (LópezBueno et al. 2009), nine from marine environments (Rosario, Duffy, et al. 2009) and eleven cycloviruses from chimpanzee stools (Li, Kapoor, et al. 2010)]. The FFP analyses were performed as previously described (Sims et al. 2009) using the author’s scripts. Basically, the frequency of each n-mer is calculated using a sliding window, the frequencies are compared using the Jensen-Shannon divergence algorithm and results are displayed in a PHYLIP-format matrix. To avoid a bias induced by sequences varying more than 4-fold in length (longer sequences contain more polynucleotides, which affect the algorithm calculations), we kept only the genomes that were larger than 800 bp, and separated the longer sequences into 700 to 1100 bp fragments (2009). Because the orientation of the genomes was not always known, calculations were performed on both strands. For example, the FFP of a genome that was 2400 nucleotides in length would be done on six fragments of 800 bp, including three forward and three reverse sequences, while the FFP of a 1000 bp genome would be done on two 1000 bp fragments, one forward and one reverse. Sequences with less than four homologues were removed to allow better visualization on a MDS plot. The feature frequency was calculated for 4-, 5-, 6-, 7-, 8-, and 9-mers, but the 7mers (heptamers) were better at discriminating known viral families (Figures 2.2 and 2.3). Neighbor joining analysis was performed with PHYLIP v3.6 (Felsenstein, J. 2005, Department  of  Genome  Sciences,  University  of  Washington,  Seattle)  and  a -10  multidimensional scaling analysis with R (R Team, 2011). The tBLASTx (e-value >10 )  30  results comparing the ssDNA composite genomes to each other, to other environmental genomes and to the isolates were presented as a network using Cytoscape (Shannon et al., 2003). For the network presentation, we linked up to five hits to each node.  Figure 2.2 Multidimensional scaling of the feature frequency profile of ssDNA viral isolates from the NCBI database. A) 4-mers, B) 5-mers, C) 6-mers, D) 7-mers, E) 8-mers, and F) 9-mers. Circoviridae are in red, Cycloviruses in dark red, Geminiviridae in blue, Nanoviridae in green, and Parvoviridae in orange.  31  Figure 2.3 Feature frequency profile analyses of ssDNA virus isolates from the NCBI database and genomes from this study. Multidimensional scaling of viral isolates (diamonds) and composite genomes (grey dots) using A) 4-mers, B) 5mers, C) 6-mers, D) 7-mers, E) 8-mers, and F) 9-mers to show that previously established families of ssDNA viruses resolve better using 7-mers.  32  2.2.5  Protein and phylogenetic analysis  For each group in the network, the open reading frames were identified and translated using GeneMark (Borodovsky et al. 2003). The proteins were aligned using MUSCLE (Edgar 2004). To limit sequencing errors and avoid potential chimeras from MDA amplification, only the conserved full-length proteins were kept for further analysis. The alignments were submitted to HH-PRED to determine the putative function of the conserved proteins (Söding et al. 2005). The replication proteins were trimmed to the conserved motifs, aligned using MAFFT with the E-INF-I algorithm (Katoh et al. 2002), and the alignment was manually edited in Geneious. Maximum likelihood analyses were performed using phyML (Edgar 2004) with the LG model, a gamma distribution and bootstrapping with a 100 replicates and 100 approximate likelihood ratio tests (aLTR statistics) (Guindon and Gascuel 2003; Guindon et al. 2010). ProtTest was used to determine the best evolutionary model (Abascal et al. 2005). Trees were viewed with FigTree (http://tree.bio.ed.ac.uk/software/figtree/).  2.2.6  Nucleotide sequence accession numbers  Raw reads and assembled contigs obtained in this study were submitted to CAMERA. The complete assembled genomes are available in Genbank (accession numbers JX904070JX904677).  2.3  Results and discussion  This comprehensive metagenomic study focused on the discovery of unknown ssDNA viruses in 142 marine samples. In order to capture the diversity in three separate regions, the samples were combined into two composite samples from the temperate Northeast Pacific Ocean (11 samples from Saanich Inlet (SI) and 85 samples from the Strait of Georgia (SOG), Canada) and one from the subtropical Gulf of Mexico (41 samples, GOM).  2.3.1  ssDNA preparation  The method presented here differs from other viral metagenomics analyses because two different methods were used to select and enrich the ssDNA from viral communities. In a first step, ssDNA extraction was performed using a silica column with solutions specific for the binding of ssDNA to the silica. This is possible because the charges of ssDNA and 33  dsDNA are different (Ambia-Garrido et al. 2010). The second step of the DNA preparation involved amplification by multiple displacement (MDA) with the phi29 DNA polymerase (Dean et al. 2001). MDA is biased towards small circular ssDNA molecules (Dean et al. 2001, Lizardi et al. 1998, Angly et al. 2006).Although, MDA enhances chimera formation, creates random over-amplification, and can be biased towards GC rich regions (Rodrigue et al. 2009). However, for low concentrations of ssDNA it produces the greatest amplification with the lowest associated bias (Pinard et al. 2006). It was therefore the most suitable method for this study.  2.3.2  Comparison to databases  Pyrosequencing produced 279,628 sequence reads (95,402 for SOG; 96,950 for SI; and 87,274 for GOM) of ~500 bp in length, with 60 to 86 % of the reads from each dataset being assembled into a total of 4,995 contiguous sequences (contigs) (1,260 for SOG; 2,399 for SI; and 1,339 for GOM) ranging from 500 bp to 7246 bp. Comparison of the assembled sequences with ssDNA genomes in Genbank, revealed homologues to all of the ssDNA families except the Anelloviridae (Figures 2.4 and 2.5A). Contigs were similar to viruses from the Circoviridae and Nanoviridae (4.9% and 2.0% of the total, respectively), with similarity being almost exclusively to the replication protein (described below). About 1.6% of the total sequences were similar to the genus Gokushovirus from the Microviridae. Previous studies found that viruses in the Circoviridae and Microviridae (Gokushovirus) are widespread in marine and fresh waters (Rosario et al., 2009; Angly et al., 2006). As well, some aquatic circovirus-like sequences appear to encode a capsid protein known previously only in RNA viruses (Diemer and Stedman, 2012). RNA viral sequences were not observed in our data, and since the sequence coding the replication protein from this putative virus contained premature stop codons, similar sequences would have been excluded from the analyses. Sequences similar to those from the Anelloviridae, which occur in insects and mammals including California sea lions and Pacific harbor seals (Ng et al., 2011) were not found in our samples suggesting that they were rare or too divergent to be assigned to the family. Viruses from the Parvoviridae family have linear genomes and would not be enriched by MDA, and were excluded from this analysis. Only one contig had a low similarity to parvoviruses. The contigs were then compared to the nr Genbank database to check for bacterial contamination. About 6% of contigs were similar to archaeal and bacterial genomes, with 34  hits to hypothetical or phage-like proteins being most common. There were also a few sequences with similarity to eukaryotes (0.6%) (Figure 2.5B), although most were to a “circovirus-like replication protein” in the draft genome of the anaerobic protozoan parasite Giardia intestinalis (Franzén et al., 2009) suggesting that related species might be hosts for environmental circoviruses.  Figure 2.4  BLAST comparison of all 4995 contigs.  BLAST comparison of the contigs against established ssDNA viral families (e-value <10-5).  35  Figure 2.5  BLAST comparison of the contigs from each dataset.  BLAST comparison of the 4,995 contigs against A) the ssDNA viral families (e-value<10-5) and B) the NCBI database (e-value<10-3) for the Strait of Georgia (SOG, 1260 contigs), Saanich Inlet (SI, 2399 contigs), and Gulf of Mexico (GOM, 1336 contigs). Most bacterial hits were to hypothetical phage-like proteins.  36  2.3.3  Comparison and grouping of complete assembled genomes  The composite genomes were compared to other ssDNA viruses and environmental sequences using the feature frequency profile (FFP), and by sequence similarity based on results from tBLASTx. The FFP uses the Jensen-Shannon divergence algorithm to compare the frequency of polynucleotides (here, heptamers) to generate a distance matrix, which is used to perform cluster (neighbor-joining) and multidimensional-scaling (MDS) analyses (Figure 2.6). Only genomes that were similar to at least four other isolate or environmental genomes were used in the analysis to avoid cluttering the diagram with data corresponding to rare genomes. The genomes were divided into five marine FFP clusters. Except for Cluster 3 that overlapped with Nanoviridae viruses, the clusters were distinct from known families of ssDNA viruses. As shown by neighbor joining and MDS analyses, the FFP discriminated established families of viruses, including Nanoviridae, providing evidence that the FPP clusters are robust (Figures 2.2 and 2.3). The polynucleotide frequency does not necessarily represent gene conservation and evolution, but is indicative of host-virus coevolution (Pride et al., 2006), suggesting that viruses within a FFP cluster infect related hosts. Therefore, the overlap of Cluster 3 with Nanoviridae may indicate that viruses within this cluster infect photosynthetic organisms.  37  Figure 2.6 Feature frequency profile analyses of ssDNA virus isolates from the NCBI database and assembled genomes from this study. Neighbor-joining tree (left) and multidimensional scaling (right) (Goodness of fit=0.6495) of viral isolates (diamonds) and assembled genomes (dots) demonstrates that FPP is able to resolve evolutionary relationships among ssDNA viruses. The shaded areas define the established families of ssDNA viruses and the new evolutionary clusters identified in this study.  In the second approach, sequence similarity based on results from tBLASTx was shown as a network, where each node is a complete genome (composite or isolate) and each link represents a hit with an e-value <10-10 (Figure 2.7); therefore, each cloud represents a group of sequences sharing a gene homologue. This resolved 129 genetically distinct groups represented by at least one complete genome (Figure 2.7), with related sequences being grouped based on sequence similarity. With the exception of sequences that might be from multipartite genomes, these genetically distinct groups likely represent distant evolutionary lineages. While 94 of the composite genomes had no similarity to other genomes, most sequences fell within 11 major Coding DNA Sequence (CDS) groups.  38  Figure 2.7 Network representation of the BLAST comparisons of the environmental assembled genomes to the previously known ssDNA viruses (e-value <10-5) and to other environmental genomes (evalue <10-10). Each node represents a complete genome (Circle: Isolate; Triangle: Gulf of Mexico, Diamond: Saanich Inlet, Square: Strait of Georgia) and each link represents a BLAST hit (Black: hit against environmental genome; Grey: hit against isolate). The colour of the outline of the node represents the viral family (Blue: Geminiviridae; Green: Nanoviridae; Dark red: Circoviridae; Orange: Parvoviridae; Olive green: Microviridae) or the colour of the cluster assigned in the FFP analysis (Blue: Cluster 1, Dark blue: Cluster 3, Aqua: Cluster 4, and Purple: Cluster 5). The shaded areas identify those genomes that have a conserved coding sequence (CDS) in common. A solid coloured node means that the genome contains the full-length conserved protein. The shaded grey boxes encompass whole genomes with either no recognizable similarity with other genomes (singlets) or with similarity to one other genome (doublets).  39  Table 2.1  Representative genomes for each of the 11 new gene families.  The arrows indicate putative open reading frames, and the common gene for each family is colored with its FFP cluster color. Small overlapping genes are represented by thin grey arrows.  40  Table 2.1 (continued)  41  Table 2.1 (continued)  42  Each node in the network was assigned to its FFP cluster. Most of the complete genomes (84%) fell into 11 CDS Groups (Figure 2.7) comprising >1% of the genomes and with each member of a cluster sharing a conserved gene (Figure 2.7 and Table 2.1). Only two of these clusters contain previously sequenced viruses. CDS Group 6 contains viruses from the Nanoviridae and Circoviridae families while CDS Group 7 comprises viruses from the Microviridae family. The FFP clusters 1 and 4 resolved into more than one CDS Group. This may be because some ssDNA genomes are multipartite. For example, nanoviruses can have six to eleven circular ssDNA molecules of ~1 kb, each encoding a gene (Gronenborn 2004), while some begomoviruses are multipartite (Nawaz-ul-rehman and Fauquet 2009). Consequently, viruses falling in the same FFP cluster but in different CDS Groups may represent different molecules and belong to the same viral family. The similar number of sequences in CDS Groups 1A (86 genomes) and 1B (73 genomes) suggests that these sequences come from multipartite viruses. Furthermore, CDS Groups 4A (13 genomes) and 4B (17 genomes) also have a similar number of sequences and come from the same FFP cluster. Finally, while most groups were evenly distributed in temperate and subtropical waters, sequences in CDS Group 2 occurred more frequently in the subtropical GOM than the temperate SOG and SI; whereas, the opposite was true for Microviridae phages. This is consistent with some groups of viruses being widely distributed in nature (Breitbart and Rohwer 2005; Short and Suttle 2005; Labonté, Reid, and Suttle 2009), while others are more restricted in distribution (Tucker et al. 2011).  43  Figure 2.8  Relative percentage of contigs from each viral group proposed in this study.  BLAST comparison of the contigs from Strait of Georgia (SOG), Saanich Inlet (SI), and Gulf of Mexico (GOM) against each of the new viral groups (e-value <10-10). “Others” contain contigs that were similar to a genome that did not belong to one of the major CDS groups.  Finally, each contig was assigned to a CDS Group with a tBLASTx e-value <10-10 (Figure 2.8). In each metagenomic dataset the most frequently occurring sequences fell into CDS Groups 1A and 1B, followed by the Circovirus-like group (Figure 2.8). Moreover, nearly 70% of the 4,995 contigs had similarity to at least one composite genome (Figure 2.8), thereby most contigs could be placed in a genomic context.  2.3.4  Identification of conserved proteins  CDS Group 6 was intriguing because it contained genomes from multiple FFP clusters (Figure 2.7). Genomes within the CDS Group 6 share a rolling-circle replication protein commonly found in Circoviridae and Nanoviridae viruses. A similar replication protein is also found in Geminiviridae viruses (Gronenborn 2004), and some plasmids (Gibbs et al. 2006). The translated proteins contained all of the five conserved replicase motifs involved in rolling-circle replication (Figure 2.9), including the motif involved in the initiation and termination of rolling circle DNA replication (motif 2), a DNA linking tyrosine (motif 3), and  44  the Walker A motif, which is a putative NTP-binding site (motif 4) (Ilyina and Koonin 1992; Mankertz et al. 1998). Phylogenetic analysis of the replication protein sequences revealed ten new clades of environmental ssDNA viruses (Figure 2.10). Phylogenetic analysis of the replication protein is also congruent with the current thinking that nanoviruses, geminiviruses, and circoviruses share a common ancestry (Briddon and Stanley 2006; Gibbs and Weiller 1999; Krupovic et al. 2009) (Figure 2.9 and 2.10). Terrestrial viruses are nested in environmental sequences, suggesting that the former evolved from viruses infecting marine protists.  45  Figure 2.9  Alignment of the rolling-circle replication protein.  Alignment of the rolling-circle replication protein from representatives of each of the phylogenetic clades with Circoviridae and circovirus-like representatives in red, Nanoviridae and nanovirus-like representatives in green and other environmental sequences in blue. The five conserved motifs are boxed in red.  46  Figure 2.10  Genetic relatedness of the rolling-circle replication protein.  Unrooted phylogenetic analysis (maximum likelihood; model LG; 100 bootstrap replicates) representing the genetic relatedness of the rolling-circle replication protein of nanoviruses (green), geminiviruses (blue), circoviruses (red), cycloviruses (purple) and the environmental sequences (black dots: this study, grey dots: other studies). The black, dark grey and light grey branches represent >90%, 75-89%, and less than 75% bootstrap support, respectively. Black and grey asterisks at internal nodes represent at least 90% and 75% aLTR bootstrap support, respectively. Roman numerals represent new deeply-branched phylogenetic groups of viruses that likely infect phytoplankton (green), zooplankton (red) or other protists (blue).  47  Figure 2.11  Genetic relatedness of the rolling-circle replication protein (side view with labels).  Phylogenetic analysis (maximum likelihood; model LG; 100 bootstrap replicates) representing the genetic relatedness of the rolling-circle replication protein of nanoviruses (green), geminiviruses (blue), circoviruses (red), cycloviruses (purple) and the environmental sequences (black dots: this study, grey dots: other studies). The black, dark grey and light grey branches represent >90%, 75-89%, and less than 75% bootstrap support, respectively. Black and grey asterisks at internal nodes represent at least 90% and 75% aLTR bootstrap support, respectively. Roman numerals represent new deeply-branched phylogenetic groups of viruses that likely infect phytoplankton (green), zooplankton (red) or other protists (blue).  48  BLAST comparisons to the Genbank database of the conserved proteins in the other ten remaining CDS groups did not reveal significant similarity. To identify the proteins, we used homology detection and structure prediction by HMM-HMM (hidden Markov model) comparison (HHpred). Homology to known protein structure was found for two more proteins. The conserved protein for CDS Group 4A is similar to the coat protein of tobacco necrosis satellite virus 1 (e-value = 0.00041), while CDS Group 3 has weak similarity to ryegrass mottle virus (e-value = 15). As these viruses infect plants, it suggests that viruses within these CDS Groups may infect phytoplankton. One reason why BLAST analysis may have revealed so few similar sequences to those in our dataset is because extant databases are overrepresented with data from ssDNA viruses infecting terrestrial plants and animals. As well, the very high mutation rates in ssDNA viruses can result in rapid sequence divergence. For example, mutation rates of begomoviruses have been reported to be as high as 10-3 to 10-4 substitutions/site/year (Duffy et al., 2008). Given the use of multiple displacement amplification (MDA), pyrosequencing, and sometimes low coverage, it was not possible to evaluate the mutation rate and genetic variability within consensus genomes. High mutation rates can cause multiple substitutions that lead to proteins with similar functions, but extremely diverse sequences (Duffy and Holmes, 2008). An example of this is the conserved jelly roll motif in capsid proteins in which there is no recognizable amino acid homology, but it is argued that the proteins share a common evolutionary history (Bamford, 2002). Nonetheless, marine and terrestrial ssDNA viruses are distinct.  2.3.5  Comparison to other viral metagenomic datasets  In order to identify ssDNA from marine metagenomic libraries, we compared our data with data from four metagenomic libraries constructed from total viral DNA from various marine environments (Arctic ocean, Sargasso Sea, Strait of Georgia and Gulf of Mexico), where the SOG and GOM samples were the same composite samples used in this study (Angly et al. 2006). About 5% to 15% of the sequences in the datasets from Angly et al. (2006) originating from British Columbia, the Gulf of Mexico and the Sargasso Sea were identified as ssDNA (Figure 2.12), while ssDNA was not found in the Arctic dataset. Finally, to compare ssDNA viruses from temperate and subtropical waters, BLAST comparisons were made among our ssDNA metagenomic datasets. About 50% of the ssDNA sequences from temperate waters and the GOM were similar to each other (Figure 2.13); whereas, datasets 49  from the SOG and SI were 72 to 82% similar to each other (Figure 2.13). This result, as well as the observation that ~70% of the sequences in our viral ssDNA datasets had similarity to at least one newly assembled genomes, indicates that we have representative genomes for most of the circular ssDNA viruses in these samples. Moreover, as these data are from hundreds of pooled samples encompassing temperate and subtropical environments, it suggests that we have representative sequences for much of the sequence space for ssDNA viruses in the surface waters of the world’s oceans.  Figure 2.12 datasets.  Comparison of the reads from the ssDNA datasets from this study to other environmental  Relative percentage of reads from the marine metagenomic datasets from Gulf of Mexico (GOM) (A), British Columbia (BBC) (B), Sargasso Sea (SAR) (C), and the Arctic Ocean (ARC) (D) from Angly et al (2006) when blasted against the ssDNA reads from this study (SOG, SI, GOM) (e-value <10-10).  50  Figure 2.13  Comparison of the reads from the ssDNA datasets against each other.  Relative percentage of reads from the ssDNA marine metagenomic datasets from GOM (A), SOG (B), and SI (C) from this study when blasted against each other (e-value <10-10).  2.3.6  Conclusions  The direct purification and sequencing of ssDNA from temperate and subtropical marine biomes yielded 608 complete genomes comprising 129 new evolutionary groups, including 94 unique genomes and 11 conserved groups, that have little or no recognizable similarity at the DNA sequence level to previously known viruses. Given that viruses within extant families have significant genetic similarity, it suggests that these new sequence groups belong to previously unknown families of viruses. The high evolutionary divergence of these sequences also suggests that they belong to viruses that infect a wide diversity of organisms. Although some of these sequence groups likely stem from viruses that infect bacteria, the few sequence groups that could be associated with extant viruses belonged to families of viruses that infect eukaryotes, suggesting that some of these new groups comprise viruses that are pathogens of the phytoplankton and zooplankton that underlie marine food webs. 51  Chapter 3: Evolutionary relationships between circular ssDNA viruses and their hosts  3.1  Synopsis  Most ssDNA viruses, including the Circoviridae, Nanoviridae, and Geminiviridae, replicate their DNA via the rolling-circle mechanism. The similar replication strategy employed by these viruses and some bacterial plasmids, combined with similar replication proteins (also known as a replication initiator), raises the possibility of a common evolutionary history (Faurez et al. 2009; Krupovic et al. 2009; Koonin and Ilyina 1992). Moreover, phylogenetic analyses are consistent with the hypothesis that Circoviridae viruses (Gibbs and Weiller 1999) and Geminiviridae viruses (Briddon et al. 2004; Briddon and Stanley 2006) originated from Nanoviridae viruses. In chapter 2, I found many replication protein homologues within the ssDNA viral metagenomic dataset. Metagenomics revealed the presence of many sequences similar to circoviruses and nanoviruses in the environment; yet, very few marine ssDNA viruses have been isolated. These include two ssDNA viruses infecting diatoms (Nagasaki et al. 2005; Tomaru et al. 2011) and some evidence of cycloviruses infecting copepods (Dunlap et al. 2013). However, the hosts of most marine ssDNA viruses remain unknown. In this chapter, I describe the comparison of replication protein sequences to understand the ecological relationships of nanoviruses and circoviruses with their hosts. First, I compared viral replication proteins to endogenous viral elements (EVE) of ssDNA viruses incorporated into the genomes of eukaryotes. Even if viruses were not isolated on those hosts, the presence of EVE indicates that a ssDNA virus was once replicating inside the cell. EVEs comparisons suggested that marine ssDNA viruses are likely significant pathogens of phytoplankton and microzooplankton, and thus play a role controlling the populations at the base of the marine food web. Second, I constructed a phylogeny of the replication protein from both marine and terrestrial ssDNA viruses, and used it to propose an evolutionary model where terrestrial nanoviruses and circoviruses originated and co-evolved with their hosts as they moved from the ocean to the land.  52  3.2 3.2.1  Material and methods Collection and preparation of samples, ssDNA preparation, genome analysis,  binning and assembly See sections 2.2.1, 2.2.2 and 2.2.3.  3.2.2  Sequences from databases  Reference sequences. All complete genomic reference sequences for ssDNA viruses were recovered from GenBank. They included sequences from isolates infecting plants, animals, and bacteria, as well as environmental genomes from other metagenomic libraries (ten circovirus-like genomes from an Antarctic lake (López-Bueno et al. 2009), nine from marine environments (Rosario, Duffy, et al. 2009) and eleven cycloviruses from chimpanzee stools (Li, Kapoor, et al. 2010)). Replication proteins from the genomic sequences were translated using Geneious v5.6 (Biomatters). Calicivirus polyprotein sequences were obtained from NCBI, and the replicase sequences extracted based on predicted homology to circovirus replication proteins. Marine replication protein sequences. The open reading frames (ORFs) for the contigs from Chapter 2 were called using GeneMark (Borodovsky et al. 2003). The ORFs were compared to the reference sequences using BLASTx (e-value < 10-5), translated into amino acid sequences, and those that had a significant BLAST hit were aligned with the reference sequences using MUSCLE (Edgar 2004). Only full-length protein-coding sequences with 1) no premature stop codons, 2) frame-shift mutations likely introduced by sequencing errors, and 3) containing at least motifs II, III and IV were kept for further analyses. EVE sequences. The EVE sequences in amino acids in Liu et al. (2011) were obtained directly from the author.  3.2.3  Phylogenetic analysis  The replication proteins were aligned with MAFFT using the E-INF-I algorithm (Katoh et al. 2002), trimmed to keep the region between the conserved motifs I and V. The alignment was manually edited in Geneious. Maximum likelihood analyses were performed using phyML (Guindon and Gascuel 2003) with the LG model, a gamma distribution, and bootstrapping with 100 replicates. ProtTest was used to determine the best evolutionary model (Abascal et 53  al. 2005). Trees were viewed with FigTree (http://tree.bio.ed.ac.uk/software/figtree/). Placement of EVE sequences on a fixed reference phylogenetic tree (Figure 2.9) was performed using pplacer (Matsen et al. 2010).  3.3 3.3.1  Results and discussion Endogenous ssDNA viral sequences inside eukaryotic genomes indicate a  wide range of hosts Based on metagenomics of environmental samples from marine, soil, and animal feces, ssDNA viruses, mostly members of the Circoviridae (genus Circovirus and Cyclovirus) and Nanoviridae (genus Nanovirus), are an important part of the viral community. However, no virus has been cultured from these environments, so their hosts remain unknown. In order to infer putative marine hosts for ssDNA viruses, we compared the marine replication proteins to endogenous viral elements (EVEs). EVEs are genetic fossils of viral replication proteins inside the genomes of eukaryotes. The discovery of “fossil” remnants of the Parvoviridae and Circoviridae integrated into the chromosomes of various animals led to estimates that these viruses replicated in animals at least 30 to 50 million years ago (Katzourakis and Gifford 2010; Belyi et al. 2010). While ssDNA viruses do not integrate into the genomes of their hosts as a normal part of their replication cycle, genes are likely transferred as the result of the DNA binding, cutting and ligating activity of the Rep protein involved in rolling-circle replication (Gilbert et al. 2006). The insertion of an EVE implies that a ssDNA virus was replicating at the time of insertion, and was infecting an ancestor of that species. Once integrated, the rate of evolution of the EVE will match that of the organism, which is much slower than the original virus; however, the lack of selective pressure will also allow accumulation of multiple point mutations, insertions and deletions within the sequences. Multiple lines of evidence show that there has been horizontal gene transfer from ssDNA viruses to their hosts. Recently, Liu et al. (2011) confirmed that EVEs were broadly distributed in four of the five supergroups of eukaryotes (Keeling et al. 2005), namely Unikonta (in animals, fungi and Entamoeba), Plantae (in land plants and green algae), Chromalveolata (in diatoms and Blastocystis) and Excavata (in Giardia) (Table 3.2). Pplacer was used to determine the closest relative to the EVEs among marine and terrestrial viruses (Matsen et al. 2010), by placing EVE sequences on a fixed reference phylogenetic  54  tree of ssDNA viral sequences (replication protein tree from Figures 2.9-2.10). This maximizes phylogenetic likelihood or posterior probability according to a reference alignment (alignment used to create the reference tree). Many EVEs grouped with marine sequences, indicating a wide range of hosts for marine ssDNA viruses.  55  Table 3.1  Eukaryotic species with ssDNA replication protein fossils sequences found within their genomes.  The coloured boxes represent the closest viral relative to the fossil sequence based on placement with pplacer.  Entamoeba dispar Entamoeba histolytica Entamoeba invadens Micromonas pusilla Giardia intestinalis Trichoplax adhaerens Aplysia californica Brugia malayi Loa loa Onchocerca volvulus Wuchereria bancrofti Blastocystis hominis Phaeodactylum tricornutum Hydra magnipapillata Lepeophtheirus salmonis Varroa destructor Xenopus tropicalis Ailuropoda melanoleuca Canis familiaris Felis catus Monodelphis domestica  Common name  Phylum  amoeba  Amoebezoa  amoeba  Amoebezoa  amoeba  Amoebezoa  alga  Chlorophyta  protozoan  Diplomonada  placozoa California sea hare  Placozoa  nematode  Nematoda  eye worm  Nematoda  nematode  Nematoda  nematode  diatom  Nematoda Heterokontophyta Heterokontophyta  hydrozoa  Cnidaria  sea louse  Arthropoda  mite  Arthropoda  frog  Chordata  panda  Chordata  dog  Chordata  cat  Chordata  opossum  Chordata  protozoan  Marine nano.  Marine circo.  Other marine viruses  Diatom viruses  Circo  Cyclo  Nano  Mollusca  56  57  Figure 3.1 A) Genetic relatedness of the rolling-circle replication protein used as a reference tree for the B) placement of fossil EVE sequences found inside eukaryotic genomes A) Unrooted phylogenetic analysis (maximum likelihood; model LG; 100 bootstrap replicates) representing the genetic relatedness of the rolling-circle replication protein of nanoviruses (green), geminiviruses (blue), circoviruses (red), cycloviruses (purple) and the environmental sequences (black dots: this study, grey dots: other studies). The black, dark grey and light grey branches represent >90%, 75-89%, and less than 75% bootstrap support, respectively. Black and grey asterisks at internal nodes represent at least 90% and 75% aLTR bootstrap support, respectively. Roman numerals represent new deeply-branched phylogenetic groups of viruses that likely infect phytoplankton (green), zooplankton (red) or other protists (blue). B) Placement of the eukaryotic sequences within the reference tree. The branches were collapsed when the clades were represented only by branches of the same colour. The colour of the branches represent the origin of the sequence (Blue: marine viruses; Green: nanoviruses; Red: Circoviruses; Purple: Cycloviruses; Orange: Bacillariodnaviruses; Dark red and brown: plasmids; Dark blue: Geminiviruses; Black: Endogenous viral elements).  Environmental marine viruses and protists. Sequences that were similar to marine virus sequences were present inside the genomes of unicellular protists from the phyla Amoebezoa  (heterotroph),  Diplomonada  (heterotroph),  Chlorophyta  (alga),  and  Heterokontophyta (diatom) (Figure 3.1). There are few data regarding viruses that infect marine protists, although many large dsDNA viruses and a few RNA viruses have been isolated that infect a variety of marine protists (Van Etten 2011; Van Etten et al. 2002; Lang et al. 2009). The only cultured ssDNA viruses infect diatoms (Nagasaki 2008). Nonetheless, the presence of EVEs similar to sequences in viral marine metagenomic data suggests that descendants of the ssDNA viruses that integrated into eukaryotic genomes still circulate in the sea. Protists are a very important part of the microbial ocean community. Typical abundances of protistan plankton are 103 cells mL−1 in oligotrophic systems and up to 105 cells mL−1 in coastal and nutrient-rich surface waters (Sanders et al. 2000). However, the densities decrease sharply below the deep chlorophyll maximum. Phototrophic protists along with cyanobacteria belonging to the genera Prochlorococcus and Synechococcus account for about half of the Earth’s primary production, and play a major role in global biogeochemical cycles (Worden 2008). Heterotrophic protists are generally bacterial grazers (Massana 2011). They play a central role in the microbial loop by keeping bacterial stocks stable, transferring dissolved organic matter to higher trophic levels, and recycling nutrients that sustain regenerated primary production (Pomeroy et al. 2007). The presence of EVEs inside the genome of many protists (Liu et al. 2011) suggests that ssDNA viruses play an important role as infectious agents of the marine protists that lie at the base of marine food webs.  58  Nanovirus-like sequences and photosynthetic organisms. EVEs found in phytoplankton are most similar to marine nanovirus-like sequences, suggesting that marine nanovirus-like sequences likely are from viruses infecting photosynthetic protists. Sequences from the prasinophyte Micromonas pusilla and the diatom Phaeodactylum tricornorum are near the nanoviruses (Figure 3.1); however, Bacillariodnaviruses (ssDNA viruses infecting diatoms) fall into a different clade. This may be because the only known hosts of bacillariodnaviruses are Chaetoceros spp., which are centric diatoms and quite distantly related to P. tricornorum, which is a pennate diatom. These results suggest that other marine nanoviruslike sequences may belong to viruses infecting pennate diatoms. So far, the only known putative host within the marine nanovirus group is a picobiliphyte. Although picobiliphytes appear to be heterotrophic, evolutionarily they are more closely related to photosynthetic organisms than heterotrophs (Not et al. 2007; Yoon et al. 2011).  Circovirus-like sequences and heterotrophic organisms. In the case of heterotrophic organisms, EVEs may originate from viruses infecting the organisms or from viruses infecting cells the heterotrophs ingested. For example, the heterotrophic protists Giardia intestinalis and Entamoeba spp. contain several EVEs that can be placed within different marine groups (Figure 3.6). Because heterotrophic protists ingest many types of microbes and viruses, they may be hotspots for genetic recombination among distantly related taxa. Similarly, EVEs that are placed among marine sequences are widespread in multicellular eukaryotes including within the phyla Placozoa, Mollusca, Nematoda, Cnidaria and Arthropoda (Figure 3.6). This suggests that EVEs arose from ssDNA viruses that are pathogens of invertebrates, or were acquired from their prey. For example, shotgun genome sequencing revealed at least 57 circovirus-like sequences in the sea louse Lepeophtheirus salmonis and 17 in the freshwater hydroid Hydra magnipapillata (Liu et al. 2011). Crustacean zooplankton such as copepods and ostracods belong to the phylum Arthropoda, as do insects. Sequences found within the genome of the termite Varrao destructor were placed into two groups, one just outside of the animal-infecting circoviruses, and the other just outside of the cycloviruses.  Diatom viruses and other heterokonts. Sequences found within the genomes of Entamoeba spp., Hydra magnipapillata and Blastocystis hominis placed with the diatominfecting Bacillariodnaviruses (Figure 3.6). H. magnipapillata feeds on phytoplankton, which 59  could explain the presence of sequences similar to diatom viruses within its genome. Blastocystis hominis is classified within the same phylum as diatoms, Heterokonta (Denoeud et al. 2011), but is an intestinal parasite that feeds on other microbes. Although it is possible that EVEs within B. hominis were acquired from ingesting infected prey, it is also possible that viruses related to Bacillariodnaviruses may infect heterokont lineages other than diatoms.  Circovirus-like sequences and animals. Finally, EVEs found within the phylum Chordata (all vertebrates and tunicates) placed within the terrestrial circoviruses and cycloviruses (Figure 3.6). The only exception is a Xenopus tropicalis (frog) sequence that placed with marine sequences from Group IX, which could have been misplaced due to multiple deletions in the sequence. The placing of all Chordata EVE sequences within animalinfecting groups agrees with the hypothesis that circoviruses co-evolved with their hosts as they transitioned to land.  3.3.2  ssDNA viruses co-evolved with their hosts from the ocean to the land: an  evolutionary model The EVEs found in animal genomes grouped with animal viruses, and the EVEs found in protists clustered with environmental sequences, suggesting a co-evolution of ssDNA viruses with their hosts. Phylogenetic analysis of the replication protein shows that marine sequences are outside of the clade containing sequences from viruses infecting animals and plants (Chapter 2, Figure 2.8). Both of these arguments suggest that ssDNA viruses coevolved with their host from the ocean to the land. To test this hypothesis, we compared marine and terrestrial replication protein sequences.  3.3.2.1  Nanoviruses and circoviruses replication protein homologues are found in  marine environments. Even though the replication protein (Rep) is quite divergent among viral families, it contains five major motifs that can be aligned. Motifs I and V do not have any known associated function. Motif II is involved in the initiation and termination of the rolling-circle DNA replication, possibly by binding Mg2 or Mn2 ions that are required for its catalytic activity  60  (Ilyina and Koonin 1992). Motif III contains the site-active tyrosine that attaches covalently to DNA (Ilyina and Koonin 1992; Koonin and Ilyina 1992). Finally, motif IV is a Walker A motif and acts as a putative NTP-binding site (Mankertz et al. 1998). An alignment of the replication protein from terrestrial and marine circoviruses, nanoviruses, and caliciviruses (Figure 3.2), illustrates how circoviruses may have originated through recombination of ancestral nanoviruses and circoviruses (Gibbs and Weiller 1999). The Nterminal region of the Rep protein of circoviruses containing motifs I-IV is most related to those encoded by segmented genomes of ssDNA nanoviruses, while the C-terminal region containing motif V is related to the RNA-binding helicase 2C of positive-sense ssRNA caliciviruses (Gibbs and Weiller 1999) (Figure 3.2). Although caliciviruses have been isolated that infect California sea lions, fur seals, and an opaleye fish (Smith and Boyt 1990), calicivirus sequences have not been found within marine (Culley et al. 2006) and freshwater (Djikeng et al. 2009; Bolduc et al. 2012) RNA viral metagenomic datasets. However, there are still very limited metagenomic data for marine RNA viruses. Moreover, caliciviruses have extremely high mutation rates. For example, Norwalk virus has about 10-2 nucleotide substitution/site/year, which is even high for RNA viruses (Victoria et al. 2009). Consequently, the absence of calicivirus sequences in marine metagenomic data sets is likely a combination of the limited data available, and high mutation rates that make marine caliciviruses difficult to recognize through BLAST searches. Recombination is common among ssDNA viruses (Martin et al. 2011) due to the different activities performed by Rep, which include DNA binding, nicking and ligating (Gilbert et al. 2006). Recombination can occur between ssDNA viruses at various recombination “hot spots”, including the origin of replication, which is defined by a short 10–30 nt long highly conserved inverted repeat (nonanucleotide) sequence capable of forming a hairpin structure (Martin et al. 2011). Furthermore, recombination between ssDNA and other unrelated viruses may be more common than previously thought. Recently, Diemer et al. (Diemer and Stedman 2012) used metagenomics to look at viral communities from a hot acidic lake, and found a sequence that coded for a ssDNA Rep protein, and a capsid similar to that of a RNA tombusvirus.  61  Figure 3.2 Alignment of the the inferred amino-acid sequences for replication protein from terrestrial and marine nanoviruses, circoviruses, cycloviruses and caliciviruses. In the alignment, darker colours indicate a higher similarity, with green representing regions of the sequences similar to nanoviruses, and orange representing sequence regions similar to caliciviruses. The vertical dashed line indicates the site of recombination of an ancestral relative of nanoviruses and caliciviruses. Grey boxes represent, in order, the five conserved motifs, with motif V being absent from nanoviruses. Left: The symbols above the name of each virus group represent the different types of sequences that are found, and are used in the evolutionary model.  62  3.3.2.2  Terrestrial and marine viruses share a common ancestor.  Phylogenetic analyses of the replication protein has led to the hypothesis that circoviruses originated from nanoviruses (Gibbs and Weiller 1999), but an alternative explanation is that Rep-containing ssDNA viruses originated in the ocean and acquired the status of terrestrial viruses as their hosts transitioned from the ocean to the land. To test support for this hypothesis, phylogenetic analysis was done for each of the major groups of Rep proteins, namely the “nanovirus-like” and the “circovirus-like”, using marine circovirus sequences as the outgroup for nanovirus-like sequences, and vice versa.  Marine nanoviruses. There are two groups of marine nanovirus-like Rep proteins (Figure 3.2); “nanovirus-like Group I” harbours a very similar organization and similar motifs as the nanoviruses, while “nanovirus-like Group II” is missing motif I and contains a deletion between motifs III and IV (Figures 3.2 and 3.3B). Phylogenetic analysis supports the view of a common origin for nanovirus-like Rep proteins. Moreover, sequences from terrestrial nanoviruses infecting plants cluster tightly together, suggesting a common ancestor that codiversified with its host. The marine nanovirus-like Group I, which comprises most of the marine nanovirus-like sequences, contains a deletion between the motifs III and IV (Figure 3.2), when compared to the nanovirus Group II and the circoviruses. The marine nanoviruses Group I consists of a supported clade outside the marine Group I, implying that the deletion between motifs III and IV occurred as a separate event, after the recombination between a calicivirus and a nanovirus to give rise to Group II. Similarily, Group I is divided into two groups, one of them lacking motif I (Figure 3.3). The fact that all of the sequences missing motif I cluster together suggests that the motif was lost only once. Comparison of the highly conserved motif IV in the nanovirus-like group shows divergence between marine and terrestrial nanoviruses, since none of the marine sequences share the same motif IV as the terrestrial nanoviruses. Motif IV is likely specific to the type of host, as it is involved in NTP-binding (Mankertz et al. 1998), and is tightly coupled to the host polymerase. Motif IV is likely closely linked to its host, which suggests that SI04410 may belong to a virus infecting a picobiliphyte, because of its similarity to MS584-5, the only marine nanovirus sequence that is linked to a host. MS584-5 is from a ssDNA virus that was sequenced along with its probable picobiliphyte host (Yoon et al. 2011). Picobiliphytes are a  63  recently discovered group of marine planktonic protists that were originally thought to be photosynthetic, since they contain picobilins (Not et al. 2007), a pigment found in red algae. However single-cell sequencing has shown no evidence of plastid DNA or plastid-targeted proteins, which suggests that picobiliphytes are likely heterotrophs (Yoon et al. 2011).  Figure 3.3 sequences.  A) Phylogeny and B) conserved motifs of the replication protein of nanovirus-like  A) Rooted maximum-likelihood (LG model, 100 bootstrap replicates) phylogenetic tree of the amino-acid sequences from marine (aqua) and terrestrial nanoviruses (green). The tree is rooted by sequences from the circovirus-like group (purple). B) Terrestrial nanoviruses (green), circoviruses (red), cycloviruses (dark red). Motifs are highlighted in grey if they are also found in the Circoviridae.  64  Marine circoviruses. The Rep circovirus-like sequences are more divergent than those in the nanovirus-like group, resulting in a phylogeny with many poorly supported branches (Figure 3.4). Therefore, although phylogenetic analysis places most of the marine circoviruslike sequences outside of the terrestrial ones, the poor bootstrap values lend little support for a marine origin of terrestrial circoviruses. Nonetheless, there is strong bootstrap support that the Rep of marine and terrestrial circoviruses have a common origin. As well, the fact that circoviruses group with cycloviruses as a well-supported sister clade (Delwart and Li 2011) provides strong evidence of a single ancestor for terrestrial circoviruses. Unlike nanoviruses, marine and terrestrial circoviruses share a common motif IV, while most differences are within motif V that originated from RNA caliciviruses. No specific function has been associated with motif V. No circovirus isolates have been described from marine environments, although a cyclovirus has been linked to zooplankton (Dunlap and Breitbart 2010), and another to a dragonfly (Rosario et al. 2011). Given the diversity and wide distribution of marine circovirus-like Rep sequences, it seems likely that they originated from viruses infecting a wide variety of marine eukaryotes, including zooplankton and protists.  65  Figure 3.4 sequences.  A) Phylogeny and B) conserved motifs of the replication protein of circovirus-like  A) Rooted maximum-likelihood (LG model, 100 bootstrap replicates) phylogenetic tree of the amino-acid sequences from marine (purple) and terrestrial circoviruses (red) and cycloviruses (dark red). The tree has been rooted with sequences from the nanovirus-like group (aqua). B) Motifs are in red or dark red if they were found in terrestrial circoviruses or cycloviruses, and green if they were found in nanoviruses. Motifs are highlighted in grey if they are also found in other ssDNA families (i.e. Nanoviridae).  66  3.3.3  Evolutionary model of the origin of ssDNA viruses.  Although most ssDNA viruses replicate via the rolling-circle mechanism and likely share a common ancestor, phylogenetic analyses are complicated by high sequence divergence. The sequences in metagenomic datasets are from recent viral infections; yet, we are using them to try and reconstruct evolutionary history over hundreds of millions of years. Clearly, most of the diversity that existed has been lost. It is particularly difficult to infer the evolution of ssDNA viruses because of their high mutation rates, which can exceed those of RNA viruses. For example, Porcine Circovirus 2 has an estimated mutation rate of 10−3 substitutions/site/year (Firth et al. 2009), while nanovirus populations of Faba Bean Necrotic Stunt Virus have average mutation rates of 104 substitutions/site/year (Grigoras et al. 2010). Moreover, ssDNA viruses generally have high burst sizes. For instance, the ssDNA virus ClorDNAV, infecting the diatom Chaetoceros lorenzianus, has a burst size of 2.2 × 104 virus particles/infected cell (Tomaru et al. 2011). High mutation rates, coupled with very large burst sizes, can explain the wide variety and high divergence of Rep proteins that are observed in the environment. The proposed evolutionary model of a marine origin of nanoviruses and circoviruses, which infect plants and animals, respectively (Figure 3.5), is the most parsimonious explanation to a common origin for Rep containing ssDNA viruses. It explains the extensive diversity of marine ssDNA viruses, as well as the tight phylogenetic clustering of the terrestrial viruses. A common marine origin would also be consistent with the evolutionary history of life on Earth (Forterre and Gribaldo 2007). I hypothesize that all ssDNA viruses share a common marine ssDNA virus ancestor that gave rise to modern-day marine and terrestrial ssDNA viruses in two separate events. The first event led to the evolution of the two marine nanovirus-like lineages, Nanovirus-like Group I and Group II (with the deletion of a 36-57 aa segment). Subsequently, an ancestor to the Nanovirus-like Group II lineage transitioned with its host from the ocean to the land and gave rise to nanoviruses. In a second independent event, an ancient ssDNA virus recombined with an ancestral calicivirus, resulting in a circovirus-like ancestor that gave rise to the current marine circoviruses, which in turn transitioned to land with their hosts to become the terrestrial circoviruses and cycloviruses.  67  Figure 3.5  Evolutionary model of the origin of ssDNA viruses similar to nanoviruses and circoviruses.  Rectangles represent replication sequences from viruses (Blue: Marine; Teal: Nanovirus-like; Green: Nanovirus; Purple: Circovirus-like; Red: Circoviruses; Dark red: Cycloviruses; Orange: RNA caliciviruses; Grey: Hypothetical/unknown sequences). Blue and yellow backgrounds represent the ocean and the land, respectively. The black arrow on the right represents time.  68  3.3.4  Absence of geminiviruses in marine environments: artifact or reality?  Geminiviruses also share a similar, although distantly related, replication proteins to nanoviruses and circoviruses. Geminivirus Rep proteins are most similar to those of phytoplasmal and red algal plasmids (Krupovic et al. 2009; Nawaz-ul-rehman and Fauquet 2009; Nash et al. 2011), and it has been hypothesized that a phytoplasma plasmid acquired a capsid-coding gene to give rise to the ancestor of geminiviruses (Krupovic et al. 2009). Very few sequences similar to those found in geminiviruses were recovered in our ssDNA metagenomic data, which may be due to several reasons. First, the whole genome amplification that was used in the construction of the metagenomic libraries may have favoured the amplification of the smaller genomes of circoviruses (1.8-2.3 kb) and nanoviruses (6-11 molecules of 1 kb per virus) rather than of geminiviruses (4.8-5.6 kb). Secondly, geminivirus genomes are more similar to plasmids found in bacterial parasites of plants (phytoplasma) than they are to nanovirus and circovirus genomes. For example, the plasmid pOYW found in phytoplasma from yellow onions, like geminiviruses, encodes for a Rep protein with a N-terminal half (192 aa out of 377 aa) that is related to Rep proteins in the pLS1 family of plasmids, while the C-terminal region (100 aa) is related to the Rep of circoviruses and the helicases of picorna-like viruses (Oshima et al. 2001; Gibbs et al. 2006). Related marine plasmid sequences were not found in our metagenomic data, and should not have been since the method is specific for ssDNA. Marine plasmid diversity is still poorly studied, although a recent metagenomic analysis revealed one contig similar to circoviruses, but none to geminiviruses (Ma et al. 2012). Finally, geminiviruses are more closely related to diatom viruses, (Bacillariodnaviruses) than to circoviruses and nanoviruses. We did not recover bacillariodnavirus-like sequences similar in our dataset. These viruses are unique in that their genomes have ~1 kb segment of dsDNA. As the ssDNA was extracted based on its charge, if marine gemini-like viruses also have dsDNA regions, they may not have been recovered. Furthermore, it is also possible that marine gemini-like viruses are present at very low concentrations; to date, there are no estimates of their abundances in natural waters. Geminiviruses are also the result of a recombination event. Sequence similarity between the replication protein of geminiviruses and plasmids in phytoplasma and red algae (Krupovic et al. 2009; Nawaz-ul-rehman and Fauquet 2009; Nash et al. 2011) provides insight into the origin of geminiviruses. There are two contradictory hypotheses on the origin of geminiviruses. First, the phylogenetic closeness of the Rep of geminiviruses and  69  phytoplamal plasmids led to the proposal that that acquisition of a capsid-encoding gene by a phytoplasmal plasmid may have give rise to the ancestor of geminiviruses (Krupovic et al. 2009). Alternatively, the plasmids may have acquired a Rep gene by horizontal transfer from a geminivirus (Saccardo et al. 2011). Because the few marine geminiviruses sequences that have been found are more similar to plasmids, this suggests that geminiviruses come from a phytoplasma plasmid that acquired a capsid-coding gene (Krupovic et al. 2009), and that this acquisition may have happened on land.  3.4  Conclusions  This chapter presents the first in-depth study comparing marine and terrestrial Rep sequences from ssDNA viruses and environmental data, with the goal of examining the evolutionary relationships among marine and terrestrial ssDNA viruses and their hosts. The data support an evolutionary model in which marine viral sequences are ancestral to nanoviruses and circoviruses that infect higher plants and animals, respectively. This supports a model where ssDNA viruses transitioned from the ocean to the land along with their hosts. Also, endogenous viral elements (EVEs) from marine organisms grouped with marine viruses, with EVEs from marine phytoplankton and zooplankton specifically grouping with nanoviruses and circoviruses, respectively. From the work presented here, it was possible to infer an evolutionary history that suggests that ssDNA viruses infecting terrestrial plants and animals arose from viruses infecting marine photosynthetic and heterotrophic protists. Moreover, the analyses suggest that ssDNA viruses infect a wide range of marine hosts and consequently are likely important players influencing phytoplankton and zooplankton that form the base of marine food webs.  70  Chapter 4: Genetic relatedness of gokushoviruses in marine environments  4.1  Synopsis  In this chapter I describe use of metagenomic analysis and amplicon sequencing of purified viral ssDNA to show that bacteriophages belonging to the family Microviridae (Gokushovirinae genera) are genetically diverse and widespread members of marine microbial communities. Metagenomic analysis of samples from the Gulf of Mexico and the coastal waters of British Columbia, Canada, revealed numerous sequences belonging to gokushoviruses and allowed the assembly of five genomes with an organization similar to chlamydiamicroviruses.  Fragment  recruitment  to  these  genomes  from  different  metagenomic data sets is consistent with dominance of different genotypes of gokushoviruses in different oceanic regions. Conservation among the environmental genomes allowed degenerate primers to be designed that were used to amplify an 800 bp fragment from the gene encoding the major capsid protein from marine samples. Phylogenetic analysis revealed that most marine sequences were distantly related to those from cultured representatives. As well, the sequences fell into at least seven distinct evolutionary groups, most of which are represented by one of the assembled genomes. These results have greatly expanded the known sequence space for the gokushoviruses, revealed much greater genetic richness than was known to exist, imply biogeographic differences among the most abundant gokushovirus genotypes, and open a new window on another member of the virosphere.  4.2 4.2.1  Material and methods Collection and Preparation of Samples  Samples were collected from five distinct geographic regions, the coastal waters of British Columbia (82 samples), the Gulf of Mexico (41 samples), Saanich Inlet (seven samples), lakes in British Columbia (14 samples), and the Arctic Ocean (56 samples) (Appendix B). Water samples (~20 to ~200 L) were collected using GO-FLO or Niskin bottles either mounted on a CTD rosette or directly on a hydrographic wire (Saanich Inlet), or by bucket from the surface (lake samples). For each sample, the viruses were concentrated ~100-fold 71  (~200 mL final volume) using ultrafiltration (Suttle et al. 1991). Briefly, particulate matter was removed by pressure filtering (<17 kPa) the samples through 142-mm diameter glass-fiber filters (MFS GC50, nominal pore size 1.2 µm) and polyvinylidene difluoride filters (Millipore GVWP, pore size 0.22 µm) connected in series. The viral size fraction in the filtrate was then concentrated by ultrafiltration though a 30 kDa molecular weight cut-off cartridge (Amicon S1Y30, Millipore), and stored at 4 °C in the dark until processed. In order to integrate variation within a region, numerous virus concentrates (VCs) collected from different locations and at different times within a geographic region were combined into a single mix. These correspond to the same mixes that were used for the Angly et al. (2006) viral metagenomic libraries where the first ssDNA viruses in marine metagenomic data were observed. VCs from the Strait of Georgia (SOG) and surrounding inlets and bays were pooled into the following our mixes by combining 2 mL of each VC: BC1-1999 (23 samples), BC3-2000 (26 samples), BC4-2004 (16 samples) and BC2-Low salinity (19 samples). Similarly, we made our mixes from the Gulf of Mexico: eastern Gulf of Mexico (eight samples), Northern GOM (six samples), Western GOM (six samples) and Texas Coast (13 samples); and three mixes from the Arctic Ocean: Beaufort Sea (20 samples), Chukchi Sea (14 samples) and High Arctic (22 samples). To look at the diversity of gokushoviruses in freshwater, two mixes from British Columbia lakes were made: Chilliwack (six samples) and Cults (eight samples) Lakes. An extensive description of all the samples that were combined in each mix is presented in Appendix B. Saanich Inlet is a unique anoxic fjord where samples are collected on a monthly basis (Zaikova et al. 2010). For this study, we used the surface samples for the months of April 2007, January, March, May, July, August and November 2008.  4.2.2  ssDNA preparation  Pooled mixes (10 mL) from each region (BC, GOM, LA, ARC) were filtered using a 0.22-µm pore-size syringe filter (PDVF; Millipore). The ssDNA was extracted using a QIAprep Spin M13 kit (Qiagen), according to the manufacturer's protocol. To create double-stranded DNA, 5 µL of the ssDNA preparation was subjected to whole genome amplification (WGA) reaction (Repli-g Mini kit), and purified using a QIAamp DNA Mini kit. Denaturation of dsDNA was limited during WGA by adding the stop solution N1 immediately after the denaturation solution D1. Since WGA amplification creates high concentrations of ssDNA, a renaturation step was added by warming the purified DNA to 94 °C followed by slow cooling to 4 °C in 1 72  °C, 30 s steps. The DNA was kept at -20 °C until further use. The Saanich Inlet (SI) samples were processed as described above, except 10 mL of individual VCs were used instead of a VC mix.  4.2.3  Genome analysis, binning and assembly  Metagenomic libraries were constructed from the SI, SOG, and GOM ssDNA whole genome amplification products. The purified WGA DNA was resuspended in 100 µL of RNAse- and DNAse-free water (Invitrogen) and concentrated using a Millipore YM-30Microcon centrifugal filter to a final volume of ~50 µL; 3 to 5 ug of DNA from each sample was sent for pyrosequencing using the Roche 454 FLX instrumentation with Titanium chemistry at Genome Québec, McGill University (SOG) and the Broad Institute at the Massachusetts Institute of Technology (GOM and SI). The sequences were quality and linker trimmed, and assembled into contiguous sequences (contigs) using the Newbler Assembler. The individual reads and assembled sequences were compared to a database of all available genomes from viruses belonging to the Microviridae using the tBLASTx algorithm with an e-value cutoff of 10-5. Reads with significant similarity to gokushoviruses were aligned onto the assembled contigs using the add454Reads.perl script and were reassembled into new contigs using the phredPhrap.perl script of the Consed package (Gordon 2003). Additional contig analyses (BLAST, circularization of the genomes, annotations, alignments, and phylogeny) were performed within the Genious Pro package v5.6 (Biomatters). Metagenomic data from Angly et al. (2006) were downloaded from the CAMERA database (Seshadri et al. 2007) and aligned onto  the  assembled  genomes  (http://bioinformatics.bc.edu/marthlab/Mosaik)  with using  the  the following  MosaikAssembler parameters:  30  mismatches per 100 nucleotides for short 454 reads (SAR, GOM and BBC assembled genomes), and 120 mismatches per 400 nucleotides average read length for longer 454 read lengths (ssDNA SOG, SI and GOM assembled genomes). Finally, pairwise comparison of the genomes were performed with the Artemis comparison tool (ACT) using tBLASTx with an e-value < 10-3 (Carver et al. 2005).  73  4.2.4  Primer Design and PCR Amplification  Two forward (MicroVP1-F1, 5’-CGN GCN TAY AAY TTR ATH-3’; MicroVP1-F2, 5'-AGN GCN TAY AAY TTR CTN-3') and two reverse (MicroVP1-R1, 5’-TTY GGN TAY CAR GAR AGN-3’; MicroVP1-R2, 5'-NCT YTC YTG RTA NCC RAA-3') primers with respective degeneracies of 256, 215, 256 and 256 were designed from alignments of the inferred amino acid sequences of the major capsid protein (VP1) of the chlamydiaphages Chp1, Chp2, Chp3, Chp4, phiCPAR31 and phiCPG1, Spiroplasmaphage Sp4, bdellovibriophage phiMH2K and the Sargasso Sea Chp1-like assembled genome. The primers amplify a ~800 bp VP1 gene fragment from the subfamily Gokushovirinae in the Microviridae. Prior to use in PCR reactions, the purified WGA DNA was resuspended in 100 µL of TE, and 10 µL was used as a template in each PCR reaction mixture consisting of Taq DNA polymerase assay buffer [20 mM Tris·HCl (ph 8.4), 50 mM KCl], 1.5 mM MgCl2, 125 µM of each deoxyribonucleoside triphosphate, 1 µM of each MicroVP1-F1, MicroVP1-F2 and MicroVP1-R1 and MicroVP1-R2 primer and 2.5 U of PLATINUM Taq DNA polymerase (Invitrogen). Negative controls contained all reagents except DNA template. The samples were denatured at 94 °C for 3 min., followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 50 °C for 30 s, and elongation at 72 °C for 50 s, with a final elongation step of 72 °C for 5 min.  4.2.5  Clone Library Construction and RFLP Analysis  Restriction fragment length polymorphism (RFLP) was used to select the different VP1 amplicons for sequencing. PCR amplicons were purified with a MinElute PCR purification kit (Qiagen), ligated into pCR2.1-Topo (Invitrogen), and used to transform chemically competent Escherichia coli Top10 cells. For each sample, thirty clones were checked by colony PCR to verify that they contained an insert of the correct size. RFLP analysis was then performed on twenty positive clones. For each RFLP reaction, 15 µL of colony PCR product was digested with AluI (New England BioLabs) in a reaction containing 1 U/µg of DNA and 1X NEBuffer 4 (20 nM Tris-acetate, 50 mM potassium acetate, 10 mM magnesium acetate, 1 mM DTT, pH 7.9) by incubating at 37 °C for 16 h, followed by heat inactivation at 65 °C for 20 min. RFLP products were separated on a 2% agarose gel in 0.5X TBE (9 mM Tris base, 9 mM boric acid, 2 mM EDTA, pH 8.0) running at 110 V for about 2 h. Sequencing of representative clones from each unique restriction pattern confirmed that  74  each pattern could be considered as an operational taxonomic unit (OTU). Forward and reverse sequences (~800 bp) were obtained for each RFLP pattern using Big-Dye Terminator Cycle Sequencing (Applied Biosystems) and ABI 373 Stretch or ABI Prism 377 sequencers (Nucleic Acid Protein Service Unit, UBC).  4.2.6  Phylogenetic Analysis  For the whole-genome phylogeny, non-coding sequences were removed and the five major open reading frames were ordered. The sequences were aligned using MAFFT (Katoh et al. 2002) and maximum likelihood analysis with 100 bootstrap replicates were performed using PhyML (Guindon et al. 2010). The nucleotide sequences for the major capsid protein, VP1, from previously sequenced isolates, environmental sequences, and the PCR products from this study were trimmed to the PCR-product length (~800 bp) and aligned using MAFFT (Katoh et al. 2002). The alignment was cured with GBlocks using the less stringent setting (Talavera and Castresana 2007). Bayesian phylogenetic analyses were performed on the cured alignment with MrBayes (Huelsenbeck and Ronquist 2001). In order to examine genotypic variation among locations, phylogenetic analysis was performed on nucleotide sequences because of the additional genetic information provided by the the third base of the codons. MrBayes uses a Markov chain Monte Carlo (mcmc) approach to approximate posterior probabilities. Under the HKY85 substitution model with a gamma distribution with invatiant sites, two independent analyses of 4 (1 cold and 3 heated) mcmc chains with 20,000,000 cycles were run to reach a standard deviation <0.01, sampled every 1000th cycle. The consensus tree was generated in Geneious with a burning of 25%. Trees were viewed in FigTree (http://tree.bio.ed.ac.uk/software/figtree/).  4.2.7  Nucleotide Sequence Accession Numbers  The five complete gokushovirus genomes and the 43 environmental PCR product sequences were submitted to Genbank and are available under the accession numbers KC131021-KC131025 and KC130978-KC131020 respectively.  75  4.3 4.3.1  Results and discussion Assembly of complete gokushovirus genomes  Sequence analysis of ssDNA metagenomic libraries from the Strait of Georgia (SOG) and Saanich Inlet (SI), and the Gulf of Mexico (GOM) recovered 1733, 374 and 194 sequences, respectively, that were significantly similar to sequences from viruses belonging to the Microviridae family, with >90% of them being most similar to sequences belonging to the chlamydiamicroviruses and other gokushoviruses. From these datasets, five complete circular genomes were assembled (two from SOG, two from SI and one from GOM) with at least 3-fold coverage. The genome sizes varied from 4,062 bp to 5,386 bp, and were uniformly shorter than previously sequenced isolates (Figure 4.1). Assembly of these genomes represented the accumulation of 95 reads for SOG-1, 58 for SOG-2, 53 for SI-1, 48 for SI-2, and 38 for GOM.  Figure 4.1  Gokushoviruses share a similar genome organisation.  Whole genome phylogeny (Maximum likelihood, 100 bootstrap replicates, HKY85 model) of the coding sequences of gokushoviruses, rooted with the microvirus phiX174 (left) and pairwise comparisons of the five environmental gokushovirus genomes assembled from this study (bold) with the isolates and other environmental genomes. Small overlapping genes of unknown function are represented by short black arrows. Scale bar represents genome length in base pairs. Pairwise comparison of the genomes is represented by grey shadings linking the sequences where darker zones mean stronger similarity.  76  Even though there was only ~20% similarity at the nucleotide level among the assembled genomes, the chlamydiaphages and bdellovibriophage phiMH2K, the gene organization is remarkably similar, and the inferred proteins encoded were highly conserved (Figure 4.1), implying a common evolutionary origin. At least five viral proteins are required for replication of gokushoviruses; VP1 is the major capsid protein; VP2 is hypothesized to be involved in host recognition (Chipman et al. 1998) and virus attachment (Everson et al. 2003); VP3 is a scaffolding protein that is found in the procapsid only and not in mature virions (Clarke et al. 2004); ORF4 is a replication initiator involved in ssDNA synthesis (Garner et al. 2004; Liu et al. 2000; Salim et al. 2008); ORF5 is involved in DNA packaging (Garner et al. 2004; Liu et al. 2000; Salim et al. 2008). The presence of all five essential genes in the assembled genomes strongly suggests they represent complete sequences from extant viruses in the environment. Whole genome phylogeny revealed that the environmental genomes cluster more closely with the bdellovibriophage phiMH2K, rather than the chlamydiaphages (Figure 4.1). Whole genome pairwise comparisons showed that VP2 and ORF4 are the least conserved genes and that chlamydiamicroviruses are very similar to each other (84-86% similarity among chlamydiaphages, 24-44% similarity among environmental phages) (Figure 4.1). These results suggest that the host for gokushoviruses is more closely related to Bdellovibrio sp. than Chlamydia sp. There were only two genomes with a noticeable recombination event: phiMH2K, which infects the bacterial parasite Bdellovibrio bacteriovorus, and the environmental genome SAR phi2, in which the ORF4 and ORF5 are inverted. These two genomes cluster together, suggesting a common evolution history (Figure 4.1). All of the environmental genomes were shorter than those from cultured isolates. Some, such as SI-1 and SOG-1, had multiple overlapping genes of unknown function. It is postulated (Rokyta and Burch 2006) that because of their strictly lytic life cycles, small genome sizes, and low rates of horizontal gene transfer, ssDNA microviruses such as phiX174 and G4, infecting E. coli, evolve differently than dsDNA viruses (Breitbart and Rohwer 2005; Comeau and Buenaventura 2005; Hambly and Suttle 2005). Instead, novel genes were predicted to originate by overprinting. Evaluation of the rates of synonymous and non-synonymous substitution in microviruses yields evidence for overlapping genes under positive selection in one frame and purifying selection in the alternative frame (Pavesi 2006). There is selection for specific codon patterns in viruses similar to Bdellovibrio phage  77  phiMH2K (Carbone 2008) and the ORF4 replication initiation protein is evolving faster than the other phage proteins (Read et al. 2000). This codon selection is tightly linked with the appearance of overlapping genes (Pavesi 2006).  4.3.2  Host specificity and geographic distribution of environmental gokushovirus  genomes Isolates in the Gokushovirinae infect parasitic bacteria, such as Chlamydia sp., Bdellovibrio sp. and Spiroplasma sp., with host specificity likely being dictated by variable genomic regions. To investigate conserved and variable motifs, ssDNA viral metagenomic datasets (from this study) and other viral metagenomic datasets from the Sargasso Sea (SAR), Gulf of Mexico (GOM) and British Columbia (BBC) (Angly et al. 2006) were scaffolded against the assembled viral genomes (Figure 2.2). First, the coverage was uniform when metagenomic data was scaffolded against assembled genome from metagenomic data collected from the same region (Figure 2.2). For the genomes SOG-1, SOG-2, and SI-2, very few reads from datasets other than the one they were assembled from scaffolded against the assembled genomes, suggesting that these genomes are not widespread in the environment (Figure 2.2). Very few reads from the GOM and ssDNA GOM metagenomic data aligned with the SOG and SI genomes (Figure 2.2). In contrast, reads from all of the metagenomic data sets aligned on the GOM genome (Figre 2.2), indicating a wider geographic distribution of similar viruses.  78  Figure 4.2  Regions of conservation in environmental gokushovirus genomes.  Scaffolding of reads from different metagenomic datasets (ssDNA SOG, ssDNA from Strait of Georgia; ssDNA SI; ssDNA from Saanich Inlet; ssDNA GOM, ssDNA from Gulf of Mexico; BBC, virome British Columbia; GOM, virome Gulf of Mexico; SAR, virome Sargasso Sea) onto the assembled gokushovirus genomes A) GOM, and B) SI-2, C) SOG-1, D) SOG-2, and E) SI-1. The alignment of reads allowed 30 mismatches for BBC, GOM and SAR and 120 mismatches for ssDNA SOG, ssDNA GOM and ssDNA SI.  For the assembled genomes SI-1 and GOM, the distribution of the reads recruited to the assembled genomes was very uneven, showing regions of higher conservation, such as VP1 and VP3. Very few reads were recruited to the VP2 region, indicating high variability in this gene. Since the VP2 sequences of the assembled genomes differ from those of isolates, and VP2 encodes for the minor capsid protein involved in host recognition (Chipman et al. 1998), the environmental sequences are likely not from viruses infecting Chlamydia, Bdellovibrio, or Spiroplasma. Recruitment to ORF4 was limited to the source  79  environment for the assembled genomes (i.e. no reads from British Columbia were recruited to the GOM genome ORF4). Thus, both the pairwise comparison (Figure 2.1) and the scaffolding of metagenomic reads (Figure 2.2) showed that VP2 and ORF4 are less conserved. No sequences similar to gokushoviruses were found in the Arctic Ocean viral metagenome (Angly et al. 2006), which is consistent with our data. We could not amplify gokushoviruses from two lakes in British Columbia. However, some sequences were found in a freshwater Antarctic lake (López-Bueno et al. 2009) and Lake Needwook, MD (Kuzmickas et al., unpublished), indicating that freshwater gokushoviruses are different than marine ones.  4.3.3  Genetic relatedness of the major capsid protein gene  To look more deeply at the genetic richness of gokushoviruses, degenerate primers were designed to amplify a ~800 bp fragment of the gene encoding the major capsid protein (VP1), which has conserved regions interspaced with variable regions. The primer location matches the more conserved regions of VP1 (Figure 2.2). For the assembled genomes, the phylogeny of VP1 and the whole viral genome are congruent, suggesting that the phylogeny of VP1 can be used to infer the phylogeny of the whole genome. PCR amplification was performed on mixed samples from the Strait of Georgia (BC; 4 mixes), Gulf of Mexico (GOM; 4 mixes), Saanich Inlet (SI; 11 samples), Arctic (ARC; 3 mixes) and lakes (LA; 2 mixes) (Supplementary Table 1). No products were amplified from the ARC, LA, BC-Low Salinity, or Eastern GOM mixes. This means that gokushoviruses are absent from these environments, in too low concentrations, or that they are too divergent to be amplified by the primers. Twenty VP1 clones from each sample were digested using AluI to reveal 77 different RFLP patterns. Representative clones from each restriction pattern were sequenced (data not shown). Of these, 43 sequences were different (using a 98% similarity cutoff) at the nucleotide level, and thus identified as unique gokushovirus VP1 sequences. A cutoff of 98% was selected to minimize the impact of errors from PCR amplification and sequencing. Some sequences occurred in more than one sample from the same geographic region; for example, the sequence SI-07 was sequenced from multiple dates in Saanich Inlet (i.e. January, April, May and Aug), but no sequences occurred in more than one location (Figure 2.3). Some sequences were found in both BC and SI, but no sequences were found in GOM  80  and BC or GOM and SI. The VP1 sequences which contained the primer sequences and translated into a functional protein were kept for further analysis. This corresponds to 85% of the sequences. The nucleotide alignment revealed multiple regions of conservation, as well as regions that were confined to specific groups, agreeing with previous observations made during initial primer design.  The amplified VP1 PCR products were compared to the VP1 sequences from the assembled genomes, the chlamydiamicrovirus isolates, as well as Genbank environmental sequences from modern stromatolites (Desnues et al. 2008) and freshwater (Lake Needwood, MD; Kuzmickas et al., unpublished), rice paddy soil (Kyoung-Ho Kim et al. 2008) and marine genomes (Venter et al. 2004) containing the primer sequences. Phylogenetic analysis using both maximum-likelihood and Bayesian algorithms produced similar trees (Figure 4.3). Very few sequences were similar to known isolates, and gokushoviruses can be sub-divided into at least seven new supported groups containing more than two sequences (Figure 4.3). Five of these new clades are represented by an assembled genome or sequenced isolate (Figure 4.3). Many sequences, such as SOG4-04 and SI-18, were too divergent to be assigned to a cluster. Sequences from a given location were usually more closely related to ones from the same location; most GOM sequences clustered within groups ENV6 and ENV7, while ENV2 is represented exclusively by SOG sequences. Sequences found in more than one sample also usually clustered together; for example, the sequence SI-10 was found in Saanich Inlet in January, March and May, and SI-07 was found in Saanich Inlet in January, April, May and August. SI-07 and SI-10 clustered with the assembled genome SI-2, within the ENV5 group. Sequences from multiple environments recruited onto the SI2 genome, suggesting that the ENV5 group contains sequences that are widespread. Likewise, the GOM assembled genome did not cluster with other GOM sequences, as occurred in diverse environments (Figures 4.2 and 4.3). The sequences from modern stomatolites and marine genomes clustered together as specific phylogenetic groups.  81  Figure 4.3  Genetic relatedness of the major capsid protein gene VP1  Unrooted Bayesian phylogenetic analysis (20 000 000 MCMC generations with 25% burning; HKY85 model) of PCR products from GOM (red), Strait of Georgia (light blue), Saanich Inlet (dark blue), cultured isolates (black) and other environmental sequences (grey). Posterior probability of at least 90%, 75% and less than 75% is represented by black, dark grey and light grey branches, respectively. Coloured bubbles represent the different supported sub-groups of gokushoviruses with more than two sequences containing a complete sequenced environmental genome (blue), only isolates (grey) or only PCR products (red). The scale bar represents 0.2 nucleotide change per site.  82  Observing such a geographic distribution with OTUs specific to certain environments is unusual with marine viruses, as most of the viral families studied have shown a nearly global distribution (Short and Suttle 2005; Fuller et al. 1998; Hambly et al. 2001; Breitbart, Miyake, et al. 2004; Labonté et al. 2009; Chen et al. 1996; Chen and Suttle 1996; Short and Suttle 2002), but has been observed in gokushoviruses before. Indeed, Tucker et al. (Tucker et al. 2011) observed differences in the depth distribution of gokushovirus sequences in the North Atlantic ocean. These patterns in distribution likely reflect the distribution of hosts. Most of the VP1 sequences that were found in more than one sample were also most closely related to each other; these may reflect gokushoviruses that have a broader host range. In contrast, sequences specific to a single location probably infect bacteria that are specific to that environment. Although the dominant bacterial species are found everywhere (Biers et al. 2009; Rusch et al. 2007), the other bacteria vary depending on the habitat (Biers et al. 2009). It has also been shown that latitude is correlated to species richness (Fuhrman et al. 2008). Gokushoviruses most likely infect bacteria that are not necessary within the most abundant ones, but are dependent on the environment they live in.  4.4  Conclusion  In summary, this study explored the genetic diversity and evolutionary relationships among environmental gokushoviruses. The assembly of five environmental genomes almost doubles the number of gokushovirus genomes in databases. Phylogenetic analysis and genome organization clearly shows a common evolutionary history between marine gokushoviruses and viruses that infect the obligate intracellular parasitic bacteria Chlamydia and Bdellovibrio. Phylogenetic analysis of the MCP also reveals that the distribution of some gokushovirus evolutionary groups is environment dependent. Discovering the hosts of these viruses remains a high priority for developing a comprehensive understanding of the roles that these viruses play in marine ecosystems.  83  Chapter 5: Amplicon deep sequencing of gokushoviruses in the Saanich Inlet  5.1  Synopsis  The hosts of environmental gokushoviruses are unknown, but it is hypothesized that they occupy specialized niches, resulting in a limited geographic range (Angly et al. 2006; Tucker et al. 2011). As well, there is a difference between the genotypes found in surface and deep waters (Tucker et al. 2011). In Chapter 4, I assembled five new gokushovirus genomes and designed primers that amplify an 800-bp fragment from the gene encoding the major capsid protein (MCP). In this chapter, the primer set was used in combination with amplicon deep sequencing to examine the richness and diversity of gokushoviruses in water from the mixed oxic surface (10 m) and deeper anoxic (200 m) layers of Saanich Inlet. The results show a diverse assemblage of gokushoviruses in Saanich Inlet, with greater richness at 10 m than 200 m, and provide persuasive evidence that the hosts of gokushovirus in the Saanich Inlet are most likely aerobic bacteria living in the surface layer.  5.2 5.2.1  Material and methods Preparation of marine samples  Saanich Inlet is a steep sided fjord with restricted circulation due to a shallow glacial sill located at the mouth of the inlet. During spring and summer, high primary productivity in surface waters combined with limited basin circulation contribute to the formation of deep water anoxia (Anderson 1973). The anoxic zone is characterized by accumulation of CH4, NH3 and H2S (Anderson 1973; Lilley et al. 1982; Ward et al. 1989). Typically, in late summer oxygenated nutrient-rich water from Haro Strait (connecting Saanich Inlet to the Strait of Georgia) cascades over the sill, mixing the oxic and anoxic waters from top to bottom (Anderson 1973). On a monthly basis, ~20 L of water from Station S3 in Saanich Inlet (Figure 5.1) were filtered to remove cells using 0.22-µm Sterivex™ filter units (Millipore), and the viruses concentrated from the filtrate by tangential flow filtration using a TFF 30 kDa cartridge (Millipore) to a final volume of ~250 mL, and stored at 4 °C until used. Amplicon deep sequencing was done on two composite samples of virus concentrates composed of  84  samples collected from 10 m and 200 m, respectively, during April 2007, and February, March, April, June, August and November (200 m) or December (10 m) 2008.  Figure 5.1  Map of Saanich Inlet (top) and cross-sectional depth profile (bottom).  Station S3 (48° 35′N, 123° 30’W) was sampled in the current study. Map from Fisheries and Oceans Canada.  5.2.2  ssDNA purification  For each virus concentrate, viral ssDNA was extracted using a QIAprep Spin M13 kit (Qiagen), according to the manufacturer's protocol. To create double-stranded DNA, 5 µL of the ssDNA preparation was subjected to multiple displacement amplification (MDA) (Repli-g Mini kit), and purified using a QIAamp DNA Mini kit. Denaturation of dsDNA was limited during MDA by adding the stop solution N1 immediately after the denaturation solution D1.  85  5.2.3  PCR amplification and sequencing  The purified MDA DNA was resuspended in 100 µL of TE, and 10 µL was used as template in each PCR reaction mixture consisting of Taq DNA polymerase assay buffer [20 mM Tris·HCl (ph 8.4), 50 mM KCl], 1.5 mM MgCl2, 125 µM of each deoxyribonucleoside triphosphate, 1 µM of each MicroVP1-F1, VP1-F2 and MicroVP1-R1 primer and 2.5 U of Platinum® Taq DNA polymerase (Invitrogen). Two forward primers were used to reduce the degeneracy of the primers. Negative controls contained all reagents except DNA template. The samples were denatured at 94 °C for 3 min, followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 50 °C for 30 s, and elongation at 72 °C for 50 s, with a final elongation step of 72 °C for 5 min. PCR amplicons were purified with a MinElute PCR purification kit (Qiagen), pooled into a 10 m or a 200 m mix and concentrated using a Millipore YM-30 Microcon centrifugal filter to a final volume of ~50 µL; amplicons from each sample (10 m and 200 m) were sent for pyrosequencing using Roche 454 FLX instrumentation with Titanium chemistry at the Broad Institute at the Massachusetts Institute of Technology.  5.2.4  Sequence analysis  Sequence binning and clustering. The sequences were screened for quality and length. Reads were removed if they (i) contained one or more ambiguous bases (Ns), (ii) were shorter than 200 nucleotides, or (iii) did not match the priming site at the proximal end. Reads were binned based on the primer sequence using in-house software. Primer sequences were subsequently trimmed. Reads were clustered into operational taxonomic units (OTUs) at 95% identity using CD-hit (Li and Godzik 2006) to minimize the impact of major sources of pyrosequencing errors such as homopolymers (i.e., stretches of identical nucleotides) (Margulies et al. 2005; Huse et al. 2010; Quince et al. 2011) and to reduce the recruitment of singletons (i.e., unique phylotypes that are observed only once in the read datasets). All OTUs were queried in a BLAST search analysis (NCBI BLAST 2.2.2) using an e-value cut-off 10-5 against a manually curated database derived from environmental sequences and sequences from gokushovirus isolates. OTUs that did not have a significant hit were removed. To avoid sequence redundancy and chimeras, a second round of clustering was performed after OTUs were trimmed to 400 bp (average read length) and  86  aligned against reference sequences from isolates and amplicons from Chapter 4. OTUs that did not align properly, which were usually singletons, were removed. The subsequent calculations were made based on the results from the second round of clustering. Diversity and species richness calculations. For each primer bin, a rank-abundance distribution of phylotypes was generated and subsequently fitted to a power-law function using nonlinear regression. The significance of the fit was assessed by the F-value of the non-linear regression coefficient determination (adjusted R2) for each rank-abundance curve. For each primer bin, rarefaction species richness curves and diversity indices were calculated using the Vegan Ecological Diversity package in R (Team 2011). Rarefaction curves were calculated as followed:  #N − xi & # N& ( % ( $ n ' $ n'  , where qi = %  where xi is the count of species i,  is the number of ways we can choose n from N, and  € qi gives the probability that species i does not occur in a sample of size n. Total estimated richness (SChao1) (Chao 1987) was calculated by the following equation:  S p = So +  f12 2 f2  Where So is the observed number of species, f1 and f2 are the numbers of species observed once or twice.  €  Finally, the Shannon-Weaver (H’) diversity index was calculated as follows:  where pi is the proportion of species i, and S is the number of species so that  .  87  5.2.5  Phylogenetic analyses  Nucleotide alignments of OTUs were done using MAFFT (Katoh et al. 2002) and MUSCLE (Edgar 2004). The alignment was trimmed to get the conserved region (3’ end for the forward primers and 5’ end for the reverse primer). Bayesian phylogenetic analyses were performed on the curated alignment with MrBayes (Huelsenbeck and Ronquist 2001). MrBayes uses a Markov chain Monte Carlo (mcmc) approach to approximate prior and posterior probabilities. Under the HKY85 substitution model with an invgamma distribution, two independent analyses of 4 (1 cold and 3 heated) mcmc chains with 3.5 million cycles were run, sampled every 1000th cycle. The consensus tree was generated in Geneious with a burning of 25%. Trees were viewed in FigTree (http://tree.bio.ed.ac.uk/software/figtree/).  5.3  Results and discussion  5.3.1  Amplicon deep sequencing of the MCP from gokushoviruses  MCP amplicons from two pooled mixes (10 m and 200 m) were sequenced using the 454 GS FLX pyrosequencing technology with Titanium chemistry (Margulies et al. 2005). Table 5.1 summarizes the sequencing and clustering results. Briefly, we had between 3622 and 5333 good reads (no N, exact primer match, no chimeras) for the 10 m bins and between 849 and 2108 good reads for the 200 m bins. The yield from PCR amplification of the 200 m samples was lower than from the 10 m samples (data not shown), which can explain the difference in the number of reads, and may reflect a lower concentration of gokushovirus at 200 m. Errors in the sequences could be introduced at each step of the experiment. First, WGA generally has an error rate of 1 per 106-107 bp (Dean et al. 2001). Pyrosequencing technology also introduces errors such as homopolymers, insertions and deletions with a rate of 0.1% per base (Margulies et al. 2005; Huse et al. 2010; Quince et al. 2011). Using a conservative approach, reads from each bin were clustered into operational taxonomy units (OTUs) with more than 95% sequence similarity. To limit redundancy of sequences due to clustering based on sequence length and sequences not belonging to gokushoviruses, a second round of clustering was done after the sequences were aligned against reference gokushovirus sequences and amplicons obtained in Chapter 4. OTU sequences that were shorter than 200 bp were removed and all the remaining sequences were cut to 400 bp. Sequences that did not align with the reference sequences were removed. The second  88  round of clustering resulted in fewer OTUs, but is more representative of the diversity of the samples. Primer F2 was not as specific as the other primers and resulted in many OTUs that were not similar to gokushoviruses. Consequently, the number of OTUs in the F2 primer bin dramatically decreased after the second round of clustering (Table 5.1). Diversity indices were calculated based on the second clustering in order to avoid overestimation of the richness and diversity.  Table 5.1  Number of reads and OTUs after clustering at 95% similarity.  The first clustering was performed from all the reads. The second round of clustering was performed on the OTUs that were obtained from the first round. OTUs >200 bp were kept and sequences were trimmed to 400 bp, if they were longer. OTUs that were not significantly similar to gokushovirus reference sequences were discarded. Number of good reads  Number of OTUs (1st clustering)  Number of OTUs (2nd clustering)  Most abundant %  Singletons %  Doublets %  F1  4652  389  366  23.1  39.9  18.6  F2  3622  305  138  21.7  32.6  15.9  R1  5333  257  212  18.4  42.9  13.7  F1  2108  72  65  17.3  40.0  10.8  F2  849  83  30  23.6  33.3  10.0  R1  1661  27  20  39.0  50.0  10.0  Primer 10 m  200 m  The OTUs were ranked to show the relative abundance of each gokushovirus taxon in our dataset (Figure 5.2). The rank-abundance plots for the 10 m sample display a similar trend for each primer bin, with a dominance of a few genotypes and a long tail of doublets and singletons. The rank abundance distribution of genotypes was approximated by a power-law function with R2 values > 0.95. In contrast, there were fewer OTUs in the 200 m samples, and the distribution was less well described by a power-law function. Phages and their hosts usually follow a power law rank-abundance distribution (Edwards and Rohwer 2005; Suttle 2007; Hoffmann et al. 2007). One explanation for environmental phage genotypes following a power-law distribution (Edwards and Rohwer 2005) are that multiple viruses compete for the same hosts, or that each virus is specific for one host but the hosts compete for resources. In the first scenario, the most abundant viruses are the most successful at finding and infecting their hosts; hence, every round of infection produces more of that virus. In the second model, the host that gets more nutrients divides  89  more rapidly resulting in more available hosts for the viruses. In this case, the most abundant viruses are the ones that infect the bacteria that are the most abundant.  Figure 5.2  Rank-abundance distribution of OTUs.  Rank abundance distribution of the most abundant (up to 50) OTUs (95% similarity) for each primer and each sample: A) Forward primer 1, 10 m; B) Forward primer 2, 10 m, C) Reverse primer, 10 m; D) Forward primer 1, 200 m; E) Forward primer 2, 200 m, and D) Reverse primer, 200 m. The solid line shows the power-law curve fitting the rank-abundance distribution of phylotypes. Equations and adjusted R2 are also shown for each curve.  90  Figure 5.3  Clustering analysis from 95% to 60% similarity.  The ratio is expressed as the number of OTUs/number of OTUs at 95%.  To compare the similarity of sequences within primer bins, clustering was done using multiple similarity thresholds (Figure 5.3). For each primer the sequences were more similar to each other at in the 200 m bin than in the 10 m. This was particularly true for the reverse primer, for which a decrease in the similarity threshold from 95% to 90% resulted in 58% less sequences. The end of the PCR product is more conserved than the beginning, which can explain this high similarity between OTUs in the reverse primer bins. For the other primer bins, the decrease in number of OTUs was steeper when the similarity threshold was below 80%. Rarefaction species richness curves are presented in Figure 5.4. None of the curves reached a plateau (Figure 5.4) consistent with estimates that 57 to 76% of the total richness of gokushoviruses in Saanich Inlet (Table 5.2) was captured. Since the majority of the reads were not singletons (e.g. only 156 singletons out of a total of 4652 reads for F-10 m), the richness that was not captured was probably represented by rare phylotypes. Usually, environmental samples are dominated by relatively few dominant genotypes and thousands of low-abundance ones that account for most of the genetic diversity. Moreover, abundant and rare genotypes can be temporally and spatially dynamic with a rare genotype in one environment being dominant in another one, or rare genotypes becoming dominant under the right conditions (Sogin et al. 2006; Huse et al. 2010; Gobet et al. 2012).  91  Care must be taken in interpreting estimates of evenness and diversity calculated from the data. First, evenness can be influenced by the use of MDA, which can unevenly amplify the initial template (Dean et al. 2001), as well as be biased towards small circular ssDNA molecules (Angly et al. 2006) and GC-rich templates (Rodrigue et al. 2009). Second, the PCR products were pooled from multiple months, which could have affected the evenness. For example, if one time point was dominated by a single genotype, it would be highly represented in the sample, even if it was in low abundance in the environment. In contrast, a sample with many genotypes will have a lower concentration of each genotype that will be further diluted in the mix. Despite these caveats, it is evident that richness was 4 to 10-fold higher in the 10 m samples than in the 200 m samples (Table 5.2). Primers F1 and R recovered a similar number of genotypes, while the estimated richness was lower in data generated using primer F2 due to its lower specificity. Similarly, the Shannon and Simpson diversity indices are higher for the 10 m samples (3.64 to 3.87) than for the 200 m samples (1.44 to 2.67) (Table 5.2). This is consistent with observations of lower prokaryotic diversity in the anoxic versus oxic waters of Saanich Inlet, as the Shannon index was lower for archaea (0.87 vs. 1.15) and bacteria (0.39 vs. 3.146) at 215 m than at 10 m, respectively (Zaikova et al. 2010).  92  Table 5.2  Observed and estimated richness and diversity indices.  The richness is expressed in numbers of genotypes.  Observed richness  Estimated richness (Schao)  Shannon diversity index  F1  367  520 ± 35  3.83  F2 R F1 F2 R  138 212 65 30 20  181 ± 18 349 ± 42 106 ± 27 41 ± 15 35 ± 24  3.64 3.87 2.67 2.30 1.44  Primer 10 m  200 m  Figure 5.4  Rarefaction species richness curves.  Rarefaction curve analysis to infer the number of observed virus phylotypes present within each gene-specific primer bin. Each curve denotes the number of observed OTUs as a function of the number of sampled reads. Reads were clustered into OTUs based on a 95% similarity threshold.  93  5.3.2  Comparison to RFLP analysis  In Chapter 4, RFLP analysis was used to select MCP amplicons for sequencing from Saanich Inlet, Strait of Georgia and the Gulf of Mexico (Figure 5.5). For Saanich Inlet, RFLP analysis of 180 clones from PCR products from 9 samples resulted in 16 unique bands. Deep sequencing of amplicons from Saanich Inlet recovered 14 of the 19 sequences associated with the RFLP bands (Table 5.3 and Figure 5.5). Four of the five amplicons that were associated with a RFLP band, but were not recovered in the deep sequencing data, were found only once, and consequently may have been statistically absent from the deep sequencing data. The RFLP band associated with the remaining unrecovered sequence was found in samples from January, July, and May, suggesting that the absence of this sequence could be due to MDA or PCR bias. In contrast to the data from Saanich Inlet, none of the 13 sequences from the Gulf of Mexico mix and only two out of the 12 sequences from the Strait of Georgia mix (SOG3-31 and SOG4-29), which were associated with RFLP bands, were recovered in the amplicon deep sequencing data. The SOG mix was composed of a mixture of 85 virus concentrates from the Strait of Georgia and surrounding inlets, including Saanich Inlet. The fact that only two sequences were recovered in the deep sequencing data suggests that gokushovirus sequences are highly specific to their environment. In contrast, studies on the portal protein from myoviruses (Short and Suttle 2005; Sullivan et al. 2008), and DNA polymerase B from podoviruses (Breitbart, Miyake, et al. 2004; Labonté et al. 2009; Chen et al. 2009; Huang et al. 2010) have recovered identical virus sequences from very different environments.  94  Table 5.3  Recovery of RFLP Saanich sequences within the amplicon deep sequencing database.  A total of 14 out of 19 sequences (shown in grey) were sequenced again with the amplicon deep sequencing. RFLP sequence 1  SI-07  OTU  Frequency of OTU  Frequency (%)  Sample  OXZQ1  4/1563  0.26  F2-10m  EP9D9  63/2158  2.92  R-10m  2  SI-12  ESV4E  99/5080  1.95  F1-10m  3  SI-05  DBSZ3  1/5080  0.02  F1-10m  4  SI-13  F1-200m  5 6 7  GTBP0  48/2387  2.00  EU4OQ  223/2158  10.33  R-10m  SI-10  D0YTA  23/5080  0.45  F1-10m  EI6DM  3/1563  0.19  F2-10m  SI-18  EKHXK  50/5080  0.98  F1-10m  EYQCF  7/2158  0.32  R-10m  SI-SOG-19  ECKEG  6/5080  0.10  F1-10m  C4GME  4/2158  0.19  R-10m  8  SI-08/SI-09  DV3J0  1/5080  0.02  F1-10m  9  SI-16  D0O8I  1/5080  0.02  F1-10m  10  SI-11  C2OWF  5/2158  0.32  R-10m  11  SI-06  GC1JP  3/5080  0.06  F1-10m  12  SI-14  BOBYV  4/5080  0.08  F1-10m  13  SI-02  DMER9  1/1563  0.06  F2-10m  14  SI-15  15  SI-03  16  SI-04  17  SI-01  18  SI-17  At least in some instances the amplicon deep sequencing data appear to have captured the dominant phylotypes from the RFLP analysis. For example, the most frequent genotype from the RFLP data (SI-07) (Figure 5.3) was also one of the most abundant genotypes in the deep-amplicon sequencing, representing 2.92% of the sequences in the F-10 m bin. Moreover, the sequence SI-13, which was found in the month of April at 120 m, was very dominant, representing more than 10% of all of sequences found in the F2-10m bin.  95  Figure 5.5  Gokushovirus MCP clonal amplicons digested with the AluI restriction enzyme.  PCR products were generated from MDA-amplified ssDNA cloned into PCR-Topo. Twenty random clones from each sample were digested with the enzyme AluI, yielding 16 restriction patterns, and the associated sequences shown in the table. The eleven sequences that were recovered by deep sequencing are shaded in grey. An X means that the sequences was not sequenced or had no similarity to gokushoviruses.  96  5.3.3  Genetic relatedness of the gokushovirus MCP in Saanich Inlet  The genetic relatedness among major capsid protein sequences from Saanich Inlet was phylogenetically analyzed in the context of other PCR amplicons and MCP sequences from isolates. The sequenced isolates included phages infecting the parasitic bacteria Chlamydia sp., Bdellovibrio bacteriovorans and Spiroplasma melliferum; however, only MH2K that infects B. bacteriovorans has marine homologues (Figure 5.6). Marine Bdellovibrio-and-likeorganisms (BALOs) are commonly found in marine environments and parasitize Vibrio sp. (Martin 2002). Consequently, viruses similar to MH2K could be infecting marine BALOs. Most sequences fell within 12 previously unknown phylogenetic groups of gokushoviruses containing more than four sequences (Figure 5.6); however, many sequences fell outside the major clades, indicating that the diversity of gokushoviruses is very high and that they may infect a wide range of bacteria. Gokushovirus genomes were assembled that were representative of the clades I, V, VIII and XII. Phylogenetic analysis of the 3’ and 5’ ends of the major capsid protein gene (Figure 5.6; only 3’ data shown) revealed that most sequences from the 200 m sample fell within a large clade (Group VI) that also contained closely related sequences from the 10 m fraction (> 90% similarity). The five most common sequences from 200 m and two of the five most common ones from 10 m were within Group VI (Figure 5.6). In contrast, the most common sequence from the 10 m sample fell within clade I, while the other two most frequently occurring sequences fell within groups containing less than four genetically distinct OTUs. An abundant sequence suggests the occurrence of a recent lytic event. Assuming that closely related sequences are from viruses infecting closely related hosts, the results suggest that viruses in Clades I and VI were infecting a broad phylogenetically related group of hosts, while in the other cases the viruses were infecting much more closely related groups of bacteria. Sequences within Clade VI were very abundant at both depths, suggesting they were infecting hosts that were abundant in the fall when the water was being deeply mixed.  97  Figure 5.6  Genetic relatedness of the gokushovirus MCP from Saanich Inlet.  Unrooted Bayesian phylogenetic analysis (3 500 000 MCMC generations with 25% burning; HKY85 model) of PCR products from Saanich Inlet (10 m in dark blue, 200 m in light blue), Gulf of Mexico (grey), Strait of Georgia (grey), and complete sequenced genomes (black). Bootstrap support of at least 90%, 75% and less than 75% is represented by black, dark grey and light grey branches, respectively. Shaded phylogenetic groups I through XII represent well-supported clades of gokushoviruses with more than four sequences. The five most abundant sequences from 10 m and 200 m are indicated by the dark- and light-blue arrows respectively, with the ranking indicated by the associated numbers. The scale bar represents 0.2 nucleotide changes per site.  98  5.3.4  Distribution of gokushovirus populations in Saanich Inlet  This study revealed new insights regarding the distribution of gokushoviruses in Saanich Inlet. First, the richness of gokushoviruses was much higher in oxic than anoxic layers. Second, the distribution of gokushovirus phylotypes at 10 m was better described by a power-law function than at 200 m. Third, genetically similar viruses were found at 10 m and 200 m. The significance of these observations and comparisons with the bacterial distribution in the Saanich Inlet (Zaikova et al. 2010) are discussed below. Two scenarios can explain the observation that the richness of gokushoviruses is less at 200 m than at 10 m (Figures 5.2-5.4, and Tables 5.1-5.2). One possibility is that gokushoviruses infect bacteria that occur at both 10 and 200 m, but that at 10 m many hosts are present in high concentrations, so there are many infection possibilities, while at 200 m the hosts are scarce, so there are fewer infection opportunities. Bacterial diversity follows a similar trend, with fewer bacterial taxa at 200 m than 10 m (Zaikova et al. 2010); hence there would be fewer bacterial taxa to infect, presumably leading to lower viral diversity. At 10 m the opposite would be true. No marine gokushovirus have been isolated, but based on an analysis of 16S rDNA from Saanich Inlet (Zaikova et al. 2010), possible hosts would include δ-proteobacteria  (Nitrospina),  Actinobacteria  (Microthrix)  and  Verrucomicrobia.  An  alternative explanation is that gokushoviruses only infect bacteria that live in the oxic surface waters, and the sequences recovered from 200 m were the result of seasonal mixing. Evidence for this can be seen from data collected during November 2008 following the yearly deep-water renewal. At this time Nov-01 was the dominant virus sequence in the surface layer, and was the second the most abundant sequence in the 200 m amplicon deep sequencing data. As well, other sequences were found at both depths, with all of the most abundant sequences from 200 m and two of the most abundant sequences from 10 m clustering within the same clade (VI) as the SI-12, which was dominating the November sample (Figure 5.6). The fact that the most abundant sequences at 200 m are closely related to the dominant sequence in the surface layer at the time of mixing, suggests that the sequences recovered from 200 m may infect aerobic bacteria and were introduced from the surface. Some of the temporal changes in the gokushovirus phylotypes were likely the result of changes in the bacterial community. These results are consistent with a metagenomic study of four aquatic ecosystems, which revealed that dominant viral taxa persist over time while the relative abundances of rare ones constantly change (Rodriguez-Brito et al. 2010). 99  Rodriguez-Brito et al. (2010) suggested a model in which microbial and viral taxa continuously replace each other in a ‘kill-the-winner’ manner, maintaining stable metabolic potential and species composition. Therefore, Group VI sequences could be from viruses that infect the dominant microbial taxa, while viruses infecting more ephemeral taxa belong to phylogenetic groups with fewer representative OTUs. In general, a power-law distribution of viral taxa is indicative of an environment in which viruses are infecting the competitively dominant hosts (Hoffmann et al. 2007). The 200 m viral taxa were not as well described by a power-law function as those from 10 m (Figure 5.2); this could be explained by the scenario in which the viruses found in the 200 m anoxic zone were not actively replicating viruses, but were mixed from the surface waters. In this case the possible hosts of the gokushoviruses would include members of the αproteobacteria (Rhodobacterales and SAR11), β-proteobacteria (Methylophilales), and γproteobacteria (Agg-47, Arc96B, SAR86, Arc96B-19) and Bacteroidetes, based on the dominant bacterial genotypes that were observed in the Saanich Inlet (Zaikova et al. 2010). Recently, Krupovic and Forterre (Krupovic and Forterre 2011) found remnants of viruses, with a similar genome organization to gokushoviruses, in the genomes of bacteria from the phylum Bacteroidetes, suggesting that these bacteria may be hosts for gokushoviruses. Since gokushoviruses were not observed in arctic viral metagenomes (Angly et al. 2006), it suggests that bacteria from the ARC96BD group were unlikely to be hosts for the viruses in this study. The most abundant bacterium in the anoxic waters of Saanich Inlet is SUP05. SUP05 is a chemoautotrophic uncultured member of the γ-proteobacteria that is found in oxygen minimum zones (Walsh et al. 2009). It contains a wide variety of genes able to mediate autotrophic carbon assimilation, sulfur oxidation, and nitrate respiration. Because it is absent from the surface, and its concentration increases with depth and decreasing oxygen concentration (Zaikova et al. 2010), it is unlikely to be a host of marine gokushoviruses.  5.4  Conclusions  This study has showed that in Saanich Inlet the genetic richness of gokushoviruses was much higher in the oxic (10 m) than in the anoxic (200 m) layer, and that a power-law function was better able to describe the distribution of gokushovirus taxa at 10 m than 200 m, suggesting that gokushoviruses are replicating in the aerobic zone. Finally, the presence  100  of very similar viruses at 10 m and 200 m could be due to deep water renewal. These results suggest that in Saanich Inlet gokushoviruses infect aerobic bacteria that are specific to the environment.  101  Chapter 6: Determination of the marine host of gokushoviruses using fluorescence in situ hybridization  6.1  Synopsis  Analysis of the genetic relatedness among marine gokushoviruses from chapters 4 and 5 suggested that they infect bacteria that are tied to specific environments. Marine gokushosviruses, and hence their hosts, appear to be absent from the Arctic Ocean, are mostly found in near-surface water (~10 m), are not active in anoxic water, and differ between tropical and temperate waters. Therefore, the marine hosts of gokushovirus are likely aerobic bacteria living at the surface of temperate and tropical waters. So far, no marine ssDNA gokushovirus has been isolated and cultured, and this is likely to remain a problem given the difficulty of culturing most marine microorganisms (Colwell 1997). Fluorescence in situ hybridization (FISH) is a culture-independent method that uses a fluorescently labeled nucleotide probe to target a specific nucleotide sequence within cells. FISH has been used in marine samples to count and observe cyanobacteria (Worden et al. 2000), examine gene expression in bacteria (Pernthaler and Amann 2004) and enrich methanotroph populations (Kalyuzhnaya et al. 2006), for example. FISH can be combined with catalysed reporter deposition (CARD), a method that increases the fluorescence signal by using multiple molecules of fluorescently-labeled tyramides bound to horseradish peroxidase conjugated to antibodies that recognize the probe (Pernthaler et al. 2002). The modified approach, known as CARD-FISH or Gene-FISH uses a larger digoxigenin-labeled probe, and was first used to look for the presence of the gene AmoA in crenarchaea (Moraru et al. 2010). In 2010, Moraru et al. (2010) presented a modified FISH technique combining a larger probe labeled with digoxigenin (DIG) and CARD-FISH, which they called Gene-FISH, to look for the presence of a specific gene, AmoA, in crenarchaea. Here, a modified Gene-FISH technique is presented that provides the first evidence of gokushovirus-infected cells in Saanich inlet.  102  6.2 6.2.1  Material and methods Control ssDNA virus and E. coli strains  phiX174 amplification. Escherichia coli strain C cultures were started from a single colony and stored at -80 °C in LB (1% Tryptone, 0.5% Yeast Extract and 1% NaCl) supplemented with 15% glycerol. To propagate the virus, LB media was inoculated at 1% (by volume) with an overnight grown culture of E. coli C. After 2 h of incubation the culture was supplemented with 2 mM CaCl2 (a cofactor necessary for viral infection) and infected with ~1010 viruses for 10 mL of culture. The incubation was continued until complete lysis (~2 h post-infection). The lysate was 0.22 µm filtered and stored at 4 °C. FISH samples. This study used E. coli strain C (negative control), E. coli C infected with phiX174 (positive viral control) and E. coli clone pCC1-phiX174 (positive FISH control), which is E. coli EPI300 transformed with a pCC1 CopyControl vector (Epicentre) that contains a 215-bp fragment of the major capsid protein of phage phiX174. The cultures were grown for 2 h at 37 °C at 250 rpm to reach exponential phase in LB media with 12.5 µg mL−1 chloramphenicol and 2x Induction Solution (Epicentre) for the pCC1-phiX174 clone. After 2 h, a culture of E. coli C was supplemented with 2mM CaCl2, infected with 50 µL of phiX174 lysate and the incubation was continued for 30 min. The cells were harvested by centrifugation and then fixed in 1% paraformaldehyde in 1× PBS pH 7.4 (137 mM NaCl, 2.7 mM KCl, 8 mM Na2HPO4, and 2 mM KH2PO4) for 1 h at room temperature. The cells were filtered onto a 0.2-µm pore-size polycarbonate membrane, then washed with 10 mL of filtered (0.22 µm) Milli-Q and 5 mL 1× PBS.  6.2.2  Preparation of marine samples  Water samples were collected on a monthly basis from Saanich Inlet, British Columbia, in 12-l Go-Flow bottles. The bacteria from 2.5 L of surface water (4 to 10 m depth) were concentrated by tangential flow filtration using a Pellicon XL Filter Mod Durapore 0.22µm cassette (Millipore) to a final volume of 10 to 13 mL. One mL of bacterial concentrate was kept for the preparation of the 16S rRNA bacterial probe, while the rest was fixed with paraformaldehyde to a final concentration of 1% and kept at 4 °C until used within 2 months. Another ~20 L of water were 0.22 um filtered using Sterivex™ filter units (Millipore) and the viruses concentrated by tangential flow filtration using a TFF 30-kDa cartridge (Millipore) to a final volume of ~250 to 500 mL, and the concentrates stored at 4 °C until used. The  103  microscopy pictures were from the February 2012 samples, while the flow cytometry counts were from the April 2012 samples.  6.2.3  Gokushovirus deep-amplicon sequencing and design of the environmental  gokushovirus probe Amplicon deep sequencing of PCR fragments coding for the major capsid protein of gokushovirus is described in Chapter 5. The obtained genotypes were aligned using MUSCLE (Edgar 2004) and new primers were designed based on sequence similarity. The new degenerate forward primers [F1-215nt (5’AAN TCN TTY ACN GAR CA-3’), F2-215 (5’-AAN TCN TTY GTN GAR CA-3’), and F3-215 (5’-AAN AGN TTY ACN GAR CA-3’)], combined with the previous reverse primers (R1 and R2) amplify a 215 bp fragment at the end of the 800 bp fragment.  6.2.4  Preparation of the probes  16S rRNA probe template. Amplification of the partial prokaryotic 16S SSU rRNA gene (rDNA) was carried out from DNA extracted from PVDF filters (e.g. 0.45 to 1.2 µm size fraction) using the bacterial 16S rDNA-specific primers 341F (5’-CCT ACG GGA GGC AGC AG-3’) and 907R (5’-CCG TCA ATT C[A/C]T TTG AGT TT-3’), yielding ~600 bp product (Muyzer and Smalla 1998). One mL of bacterial concentrate was pelleted by centrifugation at 10,000 g for 3 min and resuspended in 20 µL of 10 mM Tris-Cl, pH 8.5. One µL was used as template in a PCR reaction mixture consisting of 1 U Platinium Taq, 1 x Platinium Taq DNA polymerase assay buffer, 1.5 mM MgCl2, 0.2 mM of each dNTP, 0.3 µM of each primer, 1 U Platinum Taq DNA. PCR mixtures were adjusted to a final volume of 25 µL with sterile water. Thermal cycling conditions of amplification consisted of an initial denaturation at 95 oC for 90 s, followed by 10 ‘touchdown’ cycles of denaturation at 95 oC for 1 min, annealing at 65 oC (temperature decreasing by 1 oC per cycle) for 1 min, and extension at 72 oC for 2 min. This was followed by another 20 cycles at 95 oC for 1 min, 55 oC for 1 min and 72 oC for 2 min. The final extension step was for 10 min at 72 oC. The PCR product was purified using the MinElute PCR purification kit (Qiagen) and stored at -20 °C. Gokushovirus probe. Ten µL of whole genome amplified (WGA) template were used in each PCR reaction mixture consisting of Taq DNA polymerase assay buffer [20 mM Tris·HCl  104  (ph 8.4), 50 mM KCl], 1.5 mM MgCl2, 125 µM of each deoxyribonucleoside triphosphate, 1 µM of each MicroVP1-F1, MicroVP1-F2 and MicroVP1-R1 and MicroVP1-R2 primer and 2.5 U of PLATINUM Taq DNA polymerase (Invitrogen). Negative controls contained all reagents except DNA template. The samples were denatured at 94 °C for 3 min, followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 50 °C for 30 s, and elongation at 72 °C for 50 s, with a final elongation step of 72 °C for 5 min. The PCR product was purified using the MinElute PCR purification kit (Qiagen) and stored at -20 °C. DIG probes. The following three probes based on PCR products were used for this experiment (Table 4.1): phiX174 probe (FISH control), degenerate 16S rRNA probe (environmental positive control) and a degenerate gokushovirus probe. Probes were produced by incorporating Dig-dUTP into dsDNA via PCR (70 µM Dig-dUTP), using the PCR Dig Probe Synthesis Kit (Roche), according to the manufacturer instructions. The template for each probe was either 1 µL of phiX174 lysate, 1 µL of 16S rRNA purified PCR, 1 µL of 800 bp PCR (using the F1short/F2short/F3short and R1/R2 primers for a product of 215 nt), or water for the negative control. The PCR products were column purified using the MinElute  PCR  purification  kit  (Qiagen).  The  concentration  was  determined  spectrophotometrically (NanoDrop, Fischer ThermoScientific). The probes were stored at −20 °C, in 10 mM Tris-Cl, pH 8.5.  105  Table 6.1  Samples and probes used in the FISH experiments.  Blue: negative controls; Green: positive controls; Yellow: assay sample  Sample  Sample  Probe  1  E. coli C  phiX174 (215 nt)  -  2  E. coli pCC1  phiX174 (215 nt)  +  3  E. coli C infected with phiX174  phiX174 (215 nt)  +  4  Saanich water 200X bacterial concentrate  no probe  -  5  Saanich water 200X bacterial concentrate  16S (~500 nt) *  + for most cells  6  Saanich water 200X bacterial concentrate  Microvirus (215 nt) **  very few + cells  * **  Expected FISH result  Probe amplified from the bacterial concentrate Probe amplified from the virus concentrate  Determination of the phiX174 hybridization temperature. The melting temperature (Tm) of the perfectly matched phiX174 probe–target hybrids was measured in hybridization buffer, in vitro, using real-time PCR (IQ5, Bio-Rad) and SYBR Green I dye (Invitrogen). The hybridization buffer was composed of 3.5 mL of formamide (Sigma), 2.5 mL of 20× SSC (3 M NaCl, 0.3 M sodium citrate), 1 g of dextran sulfate sodium salt (Sigma), 50 µL of 20% sodium dodecyl sulfate (SDS), 0.4 mL of 0.5 M EDTA pH 8.0 and 2.6 mL of autoclaved MilliQ water. To 1.5 mL of hybridization buffer SYBR Green I was added to a final concentration of 10 µM. To 100 µL of the latter mixture, 1000 ng of PCR (phiX174 probe) was added, and the resulting solution was aliquoted into 25 µL portions per well and used for Tm determinations. The Tm was measured for the probe dsDNA, for target dsDNA and for the hybrid dsDNA. The thermal protocol used for the determination in hybridization buffer as the follows: denaturation at 80 °C for 5 min, hybridization at 42 °C for 25 min and melting from 50 °C to 75 °C, +0.2 °C per 1.5 min, minimum ramp rate. The composition of the washing buffer II was 0.1× SSC, 0.1% SDS, 10 µM SYTO 9 dye. The Tm was measured for  106  the probe dsDNA (~1 ug of PCR product). The thermal protocol used for Tm in washing buffer was from 50 °C to 75 °C, +0.2 °C per 1.5 min, minimum ramp rate.  6.2.5  FISH protocol  The FISH protocol was adapted and modified from Moraru et al. (2010). All water used during the protocol was autoclaved 0.22-µm filtered MilliQ water. Unless stated otherwise, the incubations were performed at room temperature. All washing steps were carried out in 15-20 mL volumes. Sample immobilization. One mL of paraformaldehyde fixed E. coli or environmental cells was filtered through 0.2-µm 24-mm polycarbonate black filters (GTBP, Millipore). The filters were then washed with 5 mL of 1× PBS, 10 mL of water, and air dried. Inactivation of endogenous peroxidases. The inactivation was performed by overlaying the filter with 0.01 M HCl for 10 min, followed by washing with 1× PBS for 5 min and then water for 1 min. Permeabilization. Permeabilization was undertaken in 0.5 mg mL-1 lysozyme (Invitrogen), 1× PBS, 0.1 M Tris-HCl pH 8.0 and 0.05 M EDTA pH 8.0, for 1 h at 37 °C. The filters were then washed three times with 1× PBS. Hybridization. E. coli C, E. coli pCC1-phiX174 and E. coli C infected with phiX174 samples were hybridized with the phiX174-Coat215 probe, while three Saanich seawater samples were hybridized with no probe (negative control for proper inactivation of endogenous peroxidases), the 16S rRNA degenerate probe (which targets most bacteria), and the gokushovirus probe (which targets bacterial cells infected by a gokushovirus), respectively. The hybridization took place in hybridization buffer (7 mL of formamide, 5 mL of 20× SSC, 2 g of dextran sulfate, 100 µL of 20% SDS, 0.8 mL of 0.5 M EDTA pH 8.0, 2.2 mL of water, and 2 mL of 10% blocking reagent (Roche)) supplemented with 25 pg µL−1 of probe. Before the hybridization, the samples were denatured at 85 °C for 25 min before being transferred to 42 °C overnight. The filters were then washed at 42 °C, shaking, first for 1, 5, and 30 min with buffer I (2× SSC, 0.1% SDS) and second for 2x1 and 45 min with washing with buffer II (0.1× SSC, 0.1% SDS). Antibody binding. The samples were blocked for 1 h in 1× PBS and 0.5% Western Blocking Reagent (WBR) (Roche). Antibody binding took place for 1.5 h in a solution  107  containing 1× PBS, 1% WBR and 3 U µL−1 anti-DIG HRP-conjugated antibody (Fab fragments) (Roche). The wash was carried out in a 1× PBS, 0.5% WBR solution for 1, 5 and 2 × 10 min. CARD for gene detection. The samples were equilibrated for 20 min in 1× PBS, followed by incubation for 40 min at 37 °C in amplification buffer containing 1× PBS, 20% dextran sulfate, 0.1% blocking reagent, and 2 M NaCl with 0.0015% H2O2 and 2 µg mL−1 Alexa488labelled tyramide. The samples were then washed for 1, 5 and 2 × 10 min with 1× PBS at 46 °C with slow agitation. The filters were cut in half, with one half for flow cytometry and the other for microscopy. Alexa tyramide preparation. To label the tyramides, a stock solution 1 mg of succinidyl diester Alexa488 (Invitrogen) was resuspended in 100 µL of dimethylformamide (Fisher) and added to 12.5 µL of tyramine HCl (1.1-fold molar excess) and incubated in the dark for 6 to 12 h at room temperature. The reaction mixture was diluted with 100% ethanol to a final concentration of 1 mg of active dye/mL, then frozen at -20 °C in 20 µL aliquots. Counterstaining and microscopy. The filter was placed on a Hoefer filtration manifold and stained for 20 minutes with 1 µg mL−1 4′,6-diamidino-2-phenylindole (DAPI). The filters were rinsed with 10 mL of water and embedded in Vectashield mounting medium (Vector Laboratories). Microscopy was performed on a Zeizz Axio-scope II (Jena, Germany) equipped with the following fluorescence filters: DAPI (365/12 nm excitation, 397 LP emission, FT 395 Beam Splitter), Alexa488 (485/20 excitation, 515-565 emission, FT 510 Beam Splitter). Photomicrographs were taken with a CCD camera. Cell resuspension and flow cytometry. The cells were removed from the filters by sonication (JSP SuperSonic) for 5 min in ice water. Samples were analyzed using a BD LSR II flow cytometer (Becton Dickinson, San Jose, CA) equipped with a blue laser (488 nm excitation/530 nm emission) and detection of green fluorescence (FITC) versus size scatter. Sorting was done with the BD FACSAria III, also with a blue laser and the FITC detector. Results were analyzed with Cyflogic 1.2.1 (CyFlo Ltd.).  108  6.3 6.3.1  Results and discussion Visualizing phiX174, a ssDNA virus model  Development of FISH to detect virus-infected cells required a model system (phage phiX174 and E. coli C) to test the protocol and verify that infected cells could be detected. Phage phiX174 belongs to the family Microviridae; it has a 5,368 bp ssDNA genome encoding 11 proteins, 8 of which are essential for viral replication. There were several advantages of using phiX174 as a model for the development of the method. First, the virus replicates very quickly with complete lysis of the host within 2 hours of infection. Second, E. coli C is easy to culture. Finally, it is the closest culturable relative of gokushoviruses. Gokushoviruses are harder to work with because they infect parasitic bacteria, such as Chlamydia and Bdellovibrio, which require an additional host in culture. For these reasons, phiX174 was used to optimize the method. The gene-FISH technique was used to visualize infected cells. Briefly, a probe up to 500 nt long was obtained by PCR, labeled with DIG-dUTPs (dUTP coupled to digoxigenin) and bound to the target. Then, HRP conjugated antibodies were bound to the DIG, and Alexa488-labeled tyramides bound to the HRP (Figure 6.1).  109  Figure 6.1  Gene-FISH principle.  A probe (red) is amplified and labeled with DIG (yellow dots). The probe hybridizes to the genomic DNA (blue). HRP–conjugated (blue circles) anti-dig antibodies bind to the DIG on the probe. Finally, Alexa488 labeled tyramides bind to HRP. Figure modified from Speicher and Carter (2005).  110  For environmental samples, the ideal probe is 1) short, 2) unique to the targeted group to avoid false positives, and 3) relatively conserved amongst the targeted group so that it is not too degenerate. Typical probes are 20 to 30 nucleotides long, but Gene-FISH uses a longer probe of 200-500 nucleotides (Moraru et al. 2010). The probe used in this study is 215 nt long and targets the gene that encodes the major capsid protein, VP1. When a virus is replicating, there are multiple copies of the viral genome inside the cell, so multiple probes can bind inside a cell, giving a stronger fluorescence signal. Therefore, our probe should have several potential targets in a virally infected cell. Cells infected by ssDNA viruses are ideal candidates for detection by FISH because of the high viral copy numbers, and because microviruses and gokushoviruses accumulate within the cells until lysis occurs. For example, ~ 500 virions are produced each time the chlamydiaphage Chp2 (a gokushovirus) infects its parasitic host (Salim et al. 2008). For phiX174, the burst size is estimated to be ~185 virions per cell (Delongchamp et al. 2001). Thus, there should be a large number of viral copies that the probe can bind to, increasing the fluorescence signal. Second, after entering the cell, microviruses take over the host proteins and resources to replicate. Once injected, the ssDNA is converted by the host polymerase into a replicative form (called RF) that consists of a covalently closed dsDNA molecule (Figure 1.1). The RF serves as a template from which viral genes are transcribed by the host RNA polymerase. DNA replication occurs via a rolling-circle mechanism using the host replicase (Sait et al. 2011). The probe can bind to the RF, which during replication is present in about 10 to 20 copies inside the cell (Iwaya et al. 1973). Third, in phiX174, the capsid consists of 60 copies each of the F, G, and J proteins and 12 copies of the H protein (Dokland et al. 1997). The probe could also hybridize with the mRNA used for the production of the major capsid protein F. For gokushoviruses, the capsid consists of 60 copies of the major capsid protein VP1 and 12 copies of the minor protein VP2. The melting temperature (Tm) of the probe (68 °C) was determined by a melting curve (Figure 6.2). The ideal hybridization temperature for FISH is 25 °C below the Tm. Hence, the hybridization temperature of the phiX174 probe was set at 43 °C.  111  150  100  50  -d(RFU)/dT  0 55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  -50  -100  -150  -200  -250  Temperature (°C)  Figure 6.2  Melting curve of the phiX174 probe to determine probe hybridization.  Blue: Stained probe; Purple: SYBR control without DNA  We were concerned that viruses inside a cell would be too ephemeral for detection by FISH. To determine the length of time the target was detectable and viruses were present in a cell, the optical density and number of hosts and viruses during the life cycle of phiX174 were measured (Figure 6.3). Lysis started after a latent period of about 40 minutes and complete lysis of E. coli C occurred within 80 minutes of infection. Virus particles were inside infected cells for at least 60 minutes post infection, which corresponds to 66% of the lytic cycle. This should be adequate for detecting infected cells in environmental samples.  112  1.00E+12  3.00E+08  1.00E+11 1.00E+10  2.50E+08  1.00E+09 2.00E+08  1.00E+07 1.00E+06  1.50E+08  1.00E+05 1.00E+04  E. coli/ ml  VLP/ ml  1.00E+08  1.00E+08  1.00E+03 1.00E+02  5.00E+07  1.00E+01 1.00E+00  0.00E+00 0  20  40  60  80  100  time (min.)  Figure 6.3  Lytic life cycle of phiX174 infecting its host E. coli C.  The virus-like particles (blue) and bacteria (purple) were estimated by flow cytometry with SYBR-Green staining of the DNA.  Three samples were used for the phiX174 visualization experiment. First, uninfected E. coli C was used as a negative control for probe binding. Second, E. coli pCC1-phiX174 with the plasmid pCC1 containing an insert corresponding to the sequence of the probe was used as a positive control. Third E. coli C infected with phiX174 at various time points was used as a virus control (Figure 6.4). Viruses could be observed inside cells at times 20, 40 and 60 minutes (data not shown). Background fluorescence caused by unspecific binding of the Alexa488-tyramides occurred during the experiment, but was greatly reduced by more vigorous washes at the end of the experiment (Figure 6.3).  113  Figure 6.4  Epifluorescence microscopy of the hybridized E. coli cells.  E. coli C is a negative FISH control, E. coli pCC1-phiX174 is a positive FISH control and phiX174 infection are E. coli C cells 30 minutes after they were infected with phiX174. DAPI stains the DNA in each cell, while the Alexa488 stains only the cells where the probe hybridized.  114  6.3.2  Visualizing gokushovirus infections in samples from Saanich Inlet  Bacterial concentrates (200X) from near-surface water from Saanich Inlet were used to identify cells infected with gokushoviruses. For each sample, three separate, identically treated filters were prepared. One filter was hybridized with no probe (internal control for proper inactivation of endogenous peroxidases), one with a degenerate 16S rRNA probe (positive environmental control), and one with the gokushovirus probe. The environmental gokushovirus probe was designed based on an alignment of sequences obtained by amplicon deep sequencing of the major capsid protein gene (Chapter 5). The alignment of the terminal region of the end of the fragments revealed many regions that are highly conserved, allowing for the design of new degenerate primers to amplify the probe. New degenerate primers were designed to target a region of ~215 bp using the same reverse primer as in the 800 bp PCR products. That region contains multiple highly conserved or identical sites that reduced the degeneracy of the probe (Figure 6.5). The gokushovirus probe was made of mixed sequences obtained by PCR amplification from environmental samples using degenerate primers. Therefore, theoretically each probe in the mix could have a different Tm. However, the same hybridization temperature (43 °C) as for the microvirus phiX174 was used for each sample so that all experiments were conducted under the same conditions.  115  Figure 6.5 Multiple alignment of sequences obtained via amplicon deep sequencing of the conserved region of major capsid protein. This is the region targeted with the degenerate marine gokushovirus FISH probe. In the consensus sequence, bright green bars represent identical sites, olive green bars represent highly conserved sites, and red bars represent poorly conserved sites.  116  The first sample filter did not contain any probe during the hybridization step (Figure 6.5). Since the Alexa488-labeled tyramides bind not only to the HRP-conjugated antibodies, but also to peroxidases; this control was used to verify that the endogenous peroxidases present in seawater were properly inactivated. Inactivation of endogenous peroxidases was performed using 0.01 M HCl, which inactivated the majority (Figure 6.6A), but not all peroxidases (Figure 6.6B). To obtain better inactivation, various concentrations of hydrogen peroxide were tested (Pavlekovic et al. 2009), but the treatment resulted in numerous burst cells (data not shown). Background fluorescence due to unspecific binding of the Alexalabeled tyramides to the filter was also observed (Figure 6.6A-C). On the second sample filter, a probe targeting the V6 region of the bacterial 16S rRNA was used to verify the presence of cells and overall FISH protocol efficiency. As expected, most of the bacterial cells (~85%) hybridized (Figure 6.7). Finally, the third sample filter was hybridized with the gokushovirus probe. Gokushovirus infections were observed microscopically in a 10 m bacterial concentrate from Saanich Inlet from February 2012 (Figure 6.8). These observations indicate that gokushoviruses infect free-living bacteria in Saanich Inlet.  117  Figure 6.6 Saanich Inlet negative control (no probe) filter as observed through epifluorescence microscopy with DAPI (left) and Alexa488 (right) filters. Examples of A) background fluorescence, B) improper inactivation of endogenous peroxidases (noted with yellow arrows), and C) unspecific binding of the Alexa labeled tyramides solution (noted with yellow arrows).  118  Figure 6.7 Saanich Inlet positive control (16S rRNA probe) filter as observed through epifluorescence microscopy with DAPI (left) and Alexa (right) filters.  119  Figure 6.8 Saanich Inlet gokushovirus probe filter as observed through epifluorescence microscopy with DAPI (left) and Alexa (right) filters. Arrows indicate likely infected cells, except for the ones with “No”, which is an example of unspecific fluorescence.  120  6.3.3  Sorting of gokushovirus infected environmental sample  Infected cells were counted and separated from other cells and debris by flow cytometry and cell-sorting (Figure 6.9). Based on microscopy counts, roughly 1% of the hybridized cells were false positives due to incomplete peroxidase inactivation. In the sample from May 2012, about 7.5% of the cells hybridized with the gokushovirus probe, which is surprisingly high given that probe was for a single group of viruses. Assuming that the viruses are visible inside the cells for 66% of the lytic cycle (as shown in section 6.3.1), gokushoviruses would lyse about 11.3% of the prokaryotes in Saanich Inlet water each day, suggesting that gokushoviruses are important agents of microbial mortality in Saanich Inlet.  121  Figure 6.9  Flow cytometry counts of the sonicated filters.  SSC, side scatter; FITC-A, 530 nm fluorescence; FSC-A, forward-scatter area. Circles represent where hybridized cells should be. SI, Saanich Inlet. The prediction circles were based on the cell counts in the positive controls.  122  6.3.4  Methods limitations  FISH has been used to observe and count specific cell types in environmental samples. The signal is greatly enhanced with CARD, making cells easier to differentiate. Here, I presented a FISH method used to observe infected cell, but no cells could be sorted and isolated in order to identify the host of gokushoviruses. The FISH method is very complex with many pitfalls and limitations that are outlined below. A major problem was the need to fix cells on filters prior to FISH. Fixation greatly improved the fluorescence signal, as cells were not damaged by the successive rounds of centrifugation needed for the protocol in solution. However, the cells were hard to detach from the filters for analysis by flow cytometry. The cells were detached from the filters by sonication (Manti et al. 2011), as vortexing for up to two hours did not. However, sonication creates a lot of debris (Figure 6.8), which made it difficult for cell sorting, as the debris clogged the flow-cytometer capillary. To reduce the debris, we will need to increase the number of cells on each filter by filtering more bacterial concentrate, and then dilute the solution to pass it through the capillary. Another problem is that although Gene-FISH produces a strong signal (Moraru et al. 2010), the complex with the DIG-HRP-antibody-Tyramides is big and unstable. The samples needed to be processed within 10 minutes post-sonication otherwise the fluorescence decreased to undetectable levels. Probes using Alexa-labeled dUTP could give a lower fluorescence signal, but a more stable one as well, by reducing the number of steps in the protocol (Kalyuzhnaya et al. 2006). Prior to FISH, the bacterial cells were fixed with paraformaldehyde, which cross-links free amino groups to preserve cells. Molecular work such as whole genome amplification and PCR for the identification of the marine host of gokushoviruses is inhibited by paraformaldehyde, unless they are properly treated. In order to properly treat the sample before PCR reactions, we will need to sort more cells for protocol trials.  6.3.5  Conclusions  Here, we developed a method combining fluorescence in situ hybridization and catalysed deposition reporter using a long probe to observe viral infections by marine ssDNA gokushoviruses in the Saanich Inlet. We have shown that about 7.5% of the cells are  123  infected and producing gokushoviruses, suggesting that gokushoviruses are very important players of the bacterial communities in surface waters of the Saanich Inlet.  124  Chapter 7: Overview and outlook  7.1  Recapitulation and significance of this work  For decades, ssDNA viruses have been studied because of their economic and health effects, but only recently has their prevelance in the environment been recognized (Angly et al. 2006; Desnues et al. 2008; López-Bueno et al. 2009; Roux et al. 2012). However, the identification of environmental ssDNA viruses has been limited to comparisons of sequence similarity with known viruses; furthermore, most environmental ssDNA viruses have little similarity to isolates. As well, knowledge of ssDNA viruses in the environment is largely obtained from metagenomic surveys encompassing all DNA viruses; surveys focused on ssDNA viruses have not been reported.  The work described in this present dissertation used genomic approaches to greatly expand our knowledge of marine environmental ssDNA viruses. Metagenomic analysis of purified ssDNA (Chapter 2) revealed that the most abundant sequences coded for a rolling-circle replication protein commonly found in circular ssDNA viruses. The analysis of these sequences suggests that circular ssDNA viruses infecting a wide variety of eukaryotes likely had a common marine origin (Chapter 3). The second part of the dissertation focuses on the genetic relatedness among marine gokushoviruses in temperate and tropical waters (Chapter 4), their richness and distribution in Saanich Inlet (Chapter 5), and presents a protocol to identify cells in seawater infected with ssDNA viruses (Chapter 6).  7.1.1  Assessing the genomic diversity of ssDNA viruses  The use of multiple displacement amplification (MDA), high-throughput shotgun sequencing and metagenomic analyses has allowed for the detection of ssDNA virus sequences in seawater, with most belonging to viruses with small circular genomes such as gokushoviruses and circoviruses. Although MDA preferentially amplifies small circular ssDNA (Angly et al. 2006), the characterization of ssDNA viruses in previous studies has been limited to comparison with sequences from ssDNA viruses infecting plants and animals.  125  In chapter 2, I described a protocol that uses a silica column and MDA for extracting and enriching ssDNA from viral concentrates. I applied this method to sequence composite viral communities from temperate (Saanich Inlet and waters in and around Strait of Georgia) and subtropical (Gulf of Mexico) waters in order to examine the genetic diversity of marine ssDNA viruses. Metagenomic analysis of the datasets yielded 4559 contigs and 645 complete circular genomes; however, as in previous studies (Angly et al. 2006; Bench et al. 2007; López-Bueno et al. 2009; Desnues et al. 2008), more than 90% of the sequences had no evident similarity to sequences in databases. Nonetheless, these data more than double the number of complete ssDNA genomes in the NCBI database. Comparison of the complete genomes using nucleotide frequency analysis and BLAST revealed more than 125 new evolutionary groups that are so genetically divergent that they likely belong to previously unknown families of viruses. Of these, 11 major groups comprised almost 50% of the sequences. Only two of these groups were previously known; one of these includes members of the Nanoviridae and Circoviridae (focus of Chapter 3), while the other includes members of the Microviridae (the focus of Chapters 4, 5, and 6). Each of the new groups shared at least one common gene and had a common genome structure. Analysis of the 11 conserved inferred amino-acid sequences suggests that two encode capsid proteins similar to those found in plant viruses, suggesting that these sequences are from viruses infecting phytoplankton. This is the first reported metagenomic analysis of environmental ssDNA viruses. The absence of contamination from other viral and bacterial DNA means that by comparative analysis these data can be used to identify ssDNA sequences in other viral metagenomic datasets. Moreover, the fact that most of the sequences assembled into contigs, that nearly 85% of the sequences were similar between samples from Saanich Inlet and the Strait of Georgia, and that 70% of the sequences were similar to at least one other sequence in the data set, suggests that much of the sequence space for ssDNA viruses in these environments has now been sampled.  7.1.2  Origin and evolution of ssDNA viruses  Chapter 3 presented the first in-depth comparison of Rep-containing marine and terrestrial ssDNA viruses in the context of their evolution and environment. Previous studies focused on terrestrial viruses only (Liu et al. 2011; Delwart and Li 2011). There is a much wider  126  range of sequence diversity in marine ssDNA viruses than in terrestrial nanoviruses and circoviruses. Indeed, Rep-coding sequences are more similar within established families of ssDNA viruses than than among environmental sequences. Phylogenetic analysis predicts that Rep sequences from marine environments are ancestral to plant nanoviruses and separate from animal circoviruses. This suggests that terrestrial viruses have a marine origin and transitioned from the ocean to the land alongside their hosts. This is the most parsimonious explanation for the much high diversity of marine sequences compared to those from terrestrial viruses.  7.1.3  Genetic diversity of gokushoviruses  Sequences from the sub-family Gokushovirinae were the first sequences from ssDNA viruses to be observed in marine metagenomic datasets. Sequences were observed in marine waters (Angly et al. 2006; Tucker et al. 2011), marine and freshwater stromatolites (Desnues et al. 2008), and in a freshwater lake in Antarctica (López-Bueno et al. 2009), suggesting that these phages are widely distributed. Although only 2% of the sequences from the ssDNA viral metagenomic datasets were similar to gokushoviruses, five complete genomes ranging from 4,062 bp to 5,386 bp kb were assembled. This almost doubles the number of gokushovirus genomes in the NCBI database. The environmental genomes share a common genome organization with gokushoviruses infecting parasitic bacteria like Bdellovibrio bacteriovorans, and Chlamydia sp., coding for five major putative proteins and having a few small overlapping genes. Phylogenetic analysis and genome organization clearly show a common evolutionary history for marine gokushoviruses and those that infect obligate intracellular parasitic bacteria. Fragment recruitment to these assembled genomes from other viral metagenomic datasets showed that different genotypes of gokushoviruses dominate in different environments. It also showed some regions of conservation within the gene encoding the major capsid protein, and regions of high variability within the gene encoding the minor capsid protein, the protein that is involved in the recognition of the host. Regions of conservation surrounding a region of high variability allowed for the design of degenerate primers that amplify a 800-bp fragment of the gene encoding the major capsid protein of gokushoviruses.  127  Chapter 4 described the use of these primers for comparing gokushoviruses among environments. Phylogenetic analysis of samples from the subtropical Gulf of Mexico, temperate waters in and around Strait of Georgia, and multiple samples collected monthly from Saanich Inlet revealed that the distribution of some gokushovirus evolutionary groups was environment dependent. Chapter 5 detailed the use of these primers to examine the total richness of gokushoviruses in two composite samples from the oxic (10 m) and anoxic zones (200 m) of Saanich Inlet. The richness of gokushoviruses was much higher in the oxic than in the anoxic layer. Moreover, the frequency distribution of virus taxonomic bins was better described by a power-law for the oxic than anoxic zone, suggesting active replication of gokushoviruses in oxic waters. Finally, the presence of very similar sequences at 10 m and 200 m could be due to deep-water renewal. The results suggest that ssDNA viruses, which infect bacteria are more rare and endemic to specific environments.  7.1.4  Possible role of ssDNA viruses in the environment  Every second, approximately 1023 viral infections occur in the ocean (Suttle 2007). While we don’t know the fraction of infections that are caused by ssDNA viruses, we know that they are prevalent viruses and may play an important role in the community. Very few ssDNA viruses have been isolated, and discovering the hosts of ssDNA viruses remains a high priority for understanding the role they play in marine ecosystems. The results presented in this dissertation deepen our knowledge on ssDNA viruses and allowed us to speculate about their prokaryotic and eukaryotic hosts. Relatively few groups of ssDNA viruses infecting marine protists have been identified. Since most of the sequences in my dataser were similar to viruses infecting eukaryotes, this suggests that many of the marine ssDNA viruses infect eukaryotes. It was possible to infer a number of possible hosts by looking at the presence of endogenous viral elements (EVEs) within the genomes of eukaryotes. Interestingly, many EVEs grouped with marine viruses; EVEs from phytoplankton grouped with nanoviruses while EVEs from zooplankton grouped with marine circoviruses. These results strongly suggest that marine ssDNA viruses play an important role in the control of autotrophic and heterotrophic protists at the base of the marine food chain. Gokushoviruses likely play an important role in the bacterioplankton community of Saanich Inlet. The amplicon deep sequencing results suggest that the hosts of gokushoviruses in  128  Saanich Inlet are aerobic bacteria that are specific to the environment and are present in relatively low abundance. Chapter 6 outlines a method that combines fluorescence in situ hybridization with a catalysed deposition reporter in order to observe cells in which gokushoviruses from Saanich Inlet were replicating. Although the host was not identified, the method showed that gokushoviruses replication was occurring in about 7.5% of bacteria, suggesting that gokushoviruses are important bacterial mortality agents in surface waters of Saanich Inlet. In the ocean, the most abundant viruses infect the most abundant bacteria as it has been shown with pelagibacter phages infecting bacteria from the SAR11 clade (Zhao et al. 2013). Given the high occurrence of gokushovirus infections in Saanich Inlet, they likely infect bacteria that are present in high concentrations in the surface waters of Saanich Inlet.  7.2  Method limitations and outlook  Metagenomic analysis of ssDNA viral communities allowed for the assembly of multiple new viral genomes, and revealed many new genotypes. However, the metagenomic library preparation was biased and did not capture the whole diversity of marine ssDNA viruses. For example, filamentous ssDNA viruses, such as members of the Inoviridae, would have been lost during pre-filtration through a 0.2-µm pore-size filter to remove residual bacteria. Moreover, since MDA preferentially amplifies short circular ssDNA, ssDNA viruses with linear genomes, such as the members of the Parvoviridae, could not be enriched, likely accounting for the very few sequences of these that were captured. The use of MDA amplification is heavily biased and can result in misrepresentation of the initial template. MDA preferentially amplifies short circular ssDNA molecules, which in the context of this dissertation was beneficial. However, it can also enhance chimera formation, does not evenly amplify, and is sometimes biased towards GC rich regions (Rodrigue et al. 2009). Given the very low concentration of initial ssDNA (<50 ng per sample), MDA was used to yield the greatest amplification (with enrichment of ssDNA) with the lowest associated bias (Pinard et al. 2006). Although 454-pyrosequencing is an excellent way to get multiple sequences at a relatively low cost, it has many intrinsic biases. It often results in sequences containing homopolymers (Margulies et al 2005), which may cause overestimated richness (Huse et al 2009, Quince et al 2011). Moreover, 454-pyrosequencing can create chimeras (Lasken and Stockwell  129  2007), which can affect the assembly of metagenomic data. These problems were partially overcome by keeping only fully translated proteins for downstream analyses. In the amplicon deep-sequencing experiment, a 95% similarity threshold was used to avoid richness overestimation and I kept only sequences that aligned on the reference sequences to exclude chimeras. PCR-based approaches may also be biased due to differential amplification of DNA sequences, formation of chimeras and heteroduplexes, and DNA polymerase error (von Wintzingerode et al. 1997; Polz and Cavanaugh 1998; Acinas and Sarma-Rupavtarm 2005). Additionally, when sequencing long PCR products, such as with the amplicon deep sequencing in Chapter 5, it was not possible to get full-length sequences, as each sequencing read is only ~400 bp. Hence, this limits the resolution of the subsequent phylogenetic analysis. On the other hand, screening of clones by RFLP can underestimate richness as two identical profiles can still differ at the sequence level. At present, most bioinformatic tools have been developed for analysis of bacterial sequences and few tools are available for viruses. Future research should be directed towards developing algorithms for the taxonomic classification of viral genotypes and for estimating their genetic richness and diversity. The high divergence of viral sequences complicates phylogenetic analyses, which are often designed for analyzing more conserved sequences. The limitations of the FISH technique are discussed in Chapter 6. FISH is a complicated method in which each step is critical. Fixation of cells on filters and their removal using sonication creates debris. As well, the large fluorescent complex and the use of paraformaldehyde made the identification of the marine gokushovirus impossible.  7.3  Future directions  This dissertation significantly expands our knowledge of ssDNA viruses by looking at their genetic diversity, geographic distribution and possible hosts. However, several important questions remain to be addressed. At this point, little is known about the natural abundance of ssDNA viruses. Quantitative PCR directly from virus concentrates or environmental samples could give insights regarding the exact concentration of specific groups of ssDNA viruses.  130  Additional research also needs to be done to determining the hosts of ssDNA viruses. Viruses have been isolated that infect the diatom Chaetoceros spp., and similar efforts should be directed towards the isolation of ssDNA viruses infecting other organisms. Isolation of new viruses is the best way to characterize virus-host interactions. However, it is not always possible because many marine microorganisms are difficult to culture (Colwell 1997). Single-cell sequencing is a promising technology that can circumvent the need for cultivation as was demonstrated with the sequencing of picobiliphytes (Yoon et al. 2011). In chapter 2, I presented many new groups of viruses that are present in marine environments, although much of the dissertation focused on viruses from the Nanoviridae, Circoviridae and Microviridae. Primers could be designed to look at the richness and distribution of the other nine major groups of ssDNA viruses. Moreover, FISH (Chapter 6) could also be used to target viruses from other viral groups. This method has the potential to become a tool for the observing infected cells, and possibly identifying host cells.  7.4  Significance of this research  This study greatly expanded the knowledge about the diversity and evolution of ssDNA viruses. The number of complete genomes more than doubled, revealing an incredible diversity that was previously unknown. New and innovative methods were used to compare sequences that had no similarity to known sequences. The research presented in this thesis changed the way we see ssDNA viruses as it showed that they are likely significant players in marine communities as pathogens of the phytoplankton and microzooplankton underlying marine food webs. Moreover, the work presented in this dissertation not only changed our perception on the extent of the genetic diversity of ssDNA viruses in the sea, but also on the evolution of viruses and the Earth.  131  Bibliography Abascal F, Zardoya R, Posada D. (2005). ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21:2104–2105. Acinas S, Sarma-Rupavtarm R. (2005). PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Appl Environ Microb 71:8966–8969. Allan GM, Ellis JA. (2000). Porcine Circoviruses: A Review. J Vet Diagn Invest 12:3–14. Ambia-Garrido J, Vainrub A, Pettitt B. (2010). A model for structure and thermodynamics of ssDNA and dsDNA near a surface: a coarse and grained approach. Comput phys commun 181:2001–2007. Anderson J. (1973). Deep water renewal in Saanich Inlet, an intermittently anoxic basin. Estuar Coast Mar Sci 16:1–10. Andrews-Pfannkoch C, Fadrosh DW, Thorpe J, Williamson SJ. (2010). Hydroxyapatitemediated separation of double-stranded DNA, single-stranded DNA, and RNA genomes from natural viral assemblages. Appl Environ Microb 76:5039–5045. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, et al. (2006). The marine viromes of four oceanic regions. PLoS Biol 4:2121–2131. Armbrust EV. (2009). The life of diatoms in the world’s oceans. Nature 459:185–192. Aronson MN, Meyer AD, Györgyey J, Katul L, Vetten HJ, Gronenborn B, et al. (2000). Clink, a nanovirus-encoded protein, binds both pRB and SKP1. J Virol 74:2967–2972. Bamford DH. (2002). Evolution of Viral Structure. Theor Popul Biol 61:461–470. Baudoux A-C, Brussaard CPD. (2005). Characterization of different viruses infecting the marine harmful algal bloom species Phaeocystis globosa. Virology 341:80–90. Bejarano ER, Khashoggi A, Witty M, Lichtenstein C. (1996). Integration of multiple repeats of geminiviral DNA into the nuclear genome of tobacco during evolution. Proc Natl Acad Sci U S A 93:759–764. Belyi VA, Levine AJ, Skalka AM. (2010). Sequences from ancestral single-stranded DNA viruses and vertebrate genomes: the Parvoviridae and Circoviridae are more than 40 to 50 million years old. J Virol 84:12458–12462. Bench SR, Hanson TE, Williamson KE, Ghosh D, Radosovich M, Wang K, et al. (2007). Metagenomic characterization of Chesapeake Bay virioplankton. Appl Environ Microb 73:7629–7641. Bergh Ø, Børsheim KY, Bratbak G, Heldal M. (1989). High abundance of viruses found in aquatic environments. Nature 340:467–468. Biagini P, De Micco P. (2008). Anelloviruses (TTV and variants): state of the art 10 years after their discovery. Transfus Clin Biol 15:406–415. Biers EJ, Sun S, Howard EC. (2009). Prokaryotic genomes and diversity in surface ocean waters: interrogating the global ocean sampling metagenome. Appl Environ Microb 75:2221–2229.  132  Blinkova O, Rosario K, Li L, Kapoor A, Slikas B., Bernardin F, et al. (2009). Frequent detection of highly diverse variants of cardiovirus, cosavirus, bocavirus, and circovirus in sewage samples collected in the United States. J Clin Microbiol 47:3507–3513. Bolduc B, Shaughnessy DP, Wolf YI, Koonin E V, Roberto FF, Young M. (2012). Identification of novel positive-strand RNA viruses by metagenomic analysis of archaea-dominated Yellowstone Hot Springs. J Virol 86:5562–5573. Borodovsky M, Mills R, Besemer J. (2003). Prokaryotic gene prediction using GeneMark and GeneMark.hmm. Curr Protoc Bioinformatics Chapter 4:4.5.1–4.5.16. Böttcher B, Unseld S, Ceulemans H, Russell RB, Jeske H. (2004). Geminate structures of African Cassava Mosaic Virus. J Virol 78:6758–6765. Bowler C, De Martino A, Falciatore A. (2010). Diatom cell division in an environmental context. Curr Opin Plant Biol 13:623–630. Bowler C, Vardi A, Allen AE. (2010). Oceanographic and biogeochemical insights from diatom genomes. Ann Rev Mar Sci 2:333–365. Boyd PW, Watson AJ, Law CS, Abraham ER, Trull T, Murdoch R, et al. (2000). A mesoscale phytoplankton bloom in the polar Southern Ocean stimulated by iron fertilization. Nature 407:695–702. Bratbak G, Egge J, Heldal M. (1993). Viral mortality of the marine alga Emiliania huxleyi (Haptophyceae) and termination of algal blooms. Mar Ecol Prog Ser 93:39–48. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton James, Salamon P, et al. (2004). Diversity and population structure of a near-shore marine-sediment viral community. Proc R Soc Lond [Biol] 271:565–574. Breitbart M, Haynes M, Kelley S, Angly FE, Edwards RA, Felts B, et al. (2008). Viral diversity and dynamics in an infant gut. Res Microbiol 159:367–373. Breitbart M, Miyake JH, Rohwer F. (2004). Global distribution of nearly identical phageencoded DNA sequences. FEMS Microbiol Lett 236:249–256. Breitbart M, Rohwer F. (2005). Here a virus, there a virus, everywhere the same virus? Trends Microbiol 13:278–284. Breitbart M, Rohwer F, Abedon S. (2005). Phage Ecology and Bacterial Pathogenesis. In:Phages, their role in bacterial pathogenesis, Waldor, MK, Freedman, DI, & Adhya, SL, eds (ed)., ASM Press: Washington, DC, pp. 66–91. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, et al. (2002). Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci U S A 99:14250–14255. Brentlinger KL, Hafenstein S, Novak CR, Fane BA, Borgon R, McKenna R, et al. (2002). Microviridae, a family divided: isolation, characterization, and genome sequence of phiMH2K, a bacteriophage of the obligate intracellular parasitic bacterium Bdellovibrio bacteriovorus. J Bacteriol 184:1089–1094. Briddon RW, Bull SE, Amin I, Mansoor S, Bedford ID, Rishi N, et al. (2004). Diversity of DNA 1: a satellite-like molecule associated with monopartite begomovirus-DNA beta complexes. Virology 324:462–474. Briddon RW, Stanley J. (2006). Subviral agents associated with plant single-stranded DNA viruses. Virology 344:198–210. 133  Brown KE. (2010). The expanding range of parvoviruses which infect humans. Rev Med Virol 20:231–244. Büning H, Perabo L, Coutelle O, Quadt-Humme S, Hallek M. (2008). Recent developments in adeno-associated virus vector technology. J Gene Med 10:717–733. Carbone A. (2008). Codon bias is a major factor explaining phage evolution in translationally biased hosts. J Mol Evol 66:210–223. Carver TJ, Rutherford KM, Berriman M, Rajandream M-A, Barrell BG, Parkhill J. (2005). ACT: the Artemis Comparison Tool. Bioinformatics 21:3422–3423. Chae C. (2005). A review of porcine circovirus 2-associated syndromes and diseases. Vet J 169:326–336. Chao A. (1987). Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–791. Chen F, Suttle CA. (1996). Evolutionary relationships among large double-stranded DNA viruses that infect microalgae and other organisms as inferred from DNA polymerase genes. Virology 219:170–178. Chen F, Suttle CA, Short SM. (1996). Genetic diversity in marine algal virus communities as revealed by sequence analysis of DNA polymerase genes. Appl Environ Microb 62:2869–2874. Chen F, Wang K, Huang S, Cai H, Zhao M, Jiao N, et al. (2009). Diverse and dynamic populations of cyanobacterial podoviruses in the Chesapeake Bay unveiled through DNA polymerase gene sequences. Environ Microbiol 11:2884–2892. Chénard C, Suttle CA. (2008). Phylogenetic diversity of sequences of cyanophage photosynthetic gene psbA in marine and freshwaters. Appl Environ Microb 74:5317– 5324. Cheung AK. (2011). Porcine circovirus: Transcription and DNA replication. Virus Res 6382:1–8. Chipman PR, Agbandje-McKenna M, Renaudin J, Baker TS, McKenna R. (1998). Structural analysis of the spiroplasma virus, SpV4!: implications for evolutionay variation to obtain host diversity among the Micrioviridae. Structure 6:135–145. Clarke IN, Cutcliffe LT, Everson JS, Garner SA, Lambden PR, Pead PJ, et al. (2004). Chlamydiaphage Chp2, a skeleton in the phiX174 closet: scaffolding protein and procapsid identification. J Bacteriol 186:7571–7574. Clasen JL, Suttle CA. (2009). Identification of freshwater Phycodnaviridae and their potential phytoplankton hosts, using DNA pol sequence fragments and a genetic-distance analysis. Appl Environ Microb 75:991–997. Clokie MRJ, Shan J, Bailey S, Jia Y, Krisch HM, West S, et al. (2006). Transcription of a “photosynthetic” T4-type phage during infection of a marine cyanobacterium. Environ Microbiol 8:827–835. Colwell RR. (1997). Microbial diversity: the importance of exploration and conservation. J Ind Microbiol Biot 18:302–307. Comeau AM, Buenaventura E. (2005). A persistent, productive, and seasonally dynamic vibriophage population within Pacific oysters (Crassostrea gigas). Appl Environ Microb 71:5324–5331. 134  Culley AI, Lang AS, Suttle CA. (2006). Metagenomic analysis of coastal RNA virus communities. Science 312:1795–1798. Dale JL. (1987). Banana bunchy top: an economically important tropical plant virus disease. Adv Virus Res 33:301–325. Danovaro R, Dell’Anno A, Corinaldesi C, Magagnini M, Noble R, Tamburini C, et al. (2008). Major viral impact on the functioning of benthic deep-sea ecosystems. Nature 454:1084–1087. Davis B. (2003). Filamentous phages linked to virulence of Vibrio cholerae. Curr Opin Microbiol 6:35–42. Dean F, Nelson J, Giesler T, Lasken R. (2001). Rapid amplification of plasmid and phage DNA using phi29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res 11:1095–1099. Delongchamp R, Valentine C, Malling H. (2001). Estimation of the average burst size of phiX174 am3 , cs70 for use in mutation assays with transgenic mice. Environ Mol Mutagen 360:356–360. Delwart E, Li L. (2011). Rapidly expanding genetic diversity and host range of the Circoviridae viral family and other Rep encoding small circular ssDNA genomes. Virus Res 164:114–121. Denoeud F, Roussel M, Noel B, Wawrzyniak I, Da Silva C, Diogon M, et al. (2011). Genome sequence of the stramenopile Blastocystis, a human anaerobic parasite. Genome Biol 12:R29. Desnues C, Rodriguez-Brito B, Rayhawk S, Kelley S, Tran T, Haynes M, et al. (2008). Biodiversity and biogeography of phages in modern stromatolites and thrombolites. Nature 452:340–343. Diemer GS, Stedman KM. (2012). A novel virus genome discovered in an extreme environment suggests recombination between unrelated groups of RNA and DNA viruses. Biology Direct 7:13. Djikeng A, Kuzmickas R, Anderson NG, Spiro DJ. (2009). Metagenomic analysis of RNA viruses in a fresh water lake. PloS One 4:e7264. Dokland T, McKenna R, Ilag LL, Bowman BR, Incardona NL, Fane BA, et al. (1997). Structure of a viral procapsid with molecular scaffolding. Nature 389:308–313. Duffy S, Holmes EC. (2008). Phylogenetic evidence for rapid rates of molecular evolution in the single-stranded DNA begomovirus tomato yellow leaf curl virus. J Virol 82:957– 965. Duffy S, Shackelton LA, Holmes EC. (2008). Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet 9:267–276. Dunlap D, Ng T, Rosario K, Barbosa J, Greco A, Breitbart M, et al. (2013). Molecular and microscopic evidence of viruses in marine copepods. Proc Natl Acad Sci U S A 110:1375-1380. Edgar RC. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. Edwards RA, Rohwer F. (2005). Viral metagenomics. Nat Rev Microbiol 3:504–510. Van Etten JL. (2011). Another really, really big virus. Viruses 3:32–46. 135  Van Etten JL, Graves M V, Müller DG, Boland W, Delaroque N. (2002). Phycodnaviridae-large DNA algal viruses. Arch Virol 147:1479–1516. Everson JS, Garner SA, Fane BA, Liu BL, Lambden PR, Clarke IN. (2002). Biological properties and cell tropism of Chp2, a bacteriophage of the obligate intracellular bacterium Chlamydophila abortus. J Bacteriol 184:2748–2754. Everson JS, Garner SA, Lambden PR, Fane BA, Clarke IN. (2003). Host range of chlamydiaphages phiCPAR39 and Chp3. J Bacteriol 185:6490–6492. Falkowski PG, Barber RT, Smetacek V. (1998). Biogeochemical controls and feedbacks on ocean primary production. Science 281:200–206. Faurez F, Dory D, Grasland B, Jestin A. (2009). Replication of porcine circoviruses. Virol J 6:60. Feige U, Stirm S. (1976). On the structure of the Escherichia coli C cell wall lipopolysaccharide core and on its phiX174 receptor region. Biochem Bioph Res Co 71:566–573. Field CB, Behrenfeld MJ, Randerson JT, Falkowski PG. (1998). Primary production of the biosphere: integrating terrestrial and oceanic components. Science 281:237–240. Filée J, Tétart F, Suttle CA, Krisch HM. (2005). Marine T4-type bacteriophages, a ubiquitous component of the dark matter of the biosphere. Proc Natl Acad Sci U S A 102:12471– 12476. Finsterbusch T, Mankertz A. (2009). Porcine circoviruses--small but powerful. Virus Res 143:177–183. Firth C, Charleston MA, Duffy S, Shapiro B, Holmes EC. (2009). Insights into the evolutionary history of an emerging livestock pathogen!: porcine circovirus 2. J Virol 83:12813–12821. Forterre P, Gribaldo S. (2007). The origin of modern terrestrial life. HFSP J 1:156–168. Franzén O, Jerlström-Hultqvist J, Castro E, Sherwood E, Ankarklev J, Reiner DS, et al. (2009). Draft genome sequencing of Giardia intestinalis assemblage B isolate GS: is human giardiasis caused by two different species? PLoS Pathog 5:e1000560. Fuhrman JA. (1999). Marine viruses and their biogeochemical and ecological effects. Nature 399:541–548. Fuhrman JA, Schwalbach M. (2003). Viral influence on aquatic bacterial communities. Biol Bull 204:192–195. Fuhrman JA, Steele J a, Hewson I, Schwalbach MS, Brown M V, Green JL, et al. (2008). A latitudinal diversity gradient in planktonic marine bacteria. Proc Natl Acad Sci U S A 105:7774–7778. Fuller NJ, Wilson WH, Joint IR, Mann NH. (1998). Occurrence of a sequence in marine cyanophages similar to that of T4 g20 and its application to PCR-based detection and quantification techniques. Appl Environ Microb 64:2051–2060. Garner SA, Everson JS, Lambden PR, Fane BA, Clarke IN. (2004). Isolation, molecular characterisation and genome sequence of a bacteriophage (Chp3) from Chlamydophila pecorum. Virus Genes 28:207–214. Ge L, Zhang J, Zhou X, Li H. (2007). Genetic structure and population variability of tomato yellow leaf curl China virus. J Virol 81:5902–5907. 136  Gibbs MJ, Smeianov VV, Steele JL, Upcroft P, Efimov BA. (2006). Two families of Rep-like genes that probably originated by interspecies recombination are represented in viral, plasmid, bacterial, and parasitic protozoan genomes. Mol Biol Evol 23:1097–1100. Gibbs MJ, Weiller GF. (1999). Evidence that a plant virus switched hosts to infect a vertebrate and then recombined with a vertebrate-infecting virus. Proc Natl Acad Sci U S A 96:8022–8027. Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, et al. (2006). Two families of rep-like genes that probably originated by interspecies recombination are represented in viral, plasmid, bacterial, and parasitic protozoan genomes. Mol Biol Evol 23:1097–1100. Gobet A, Böer SI, Huse SM, Van Beusekom JEE, Quince C, Sogin ML, et al. (2012). Diversity and dynamics of rare and of resident bacterial populations in coastal sands. ISME J 6:542–553. Gobler C, Hutchins D, Fisher N, Cosper E, Sanudo-Wilhelmy S. (1997). Release and bioavailability of C, N, P, Se, and Fe following viral lysis of a marine chrysophyte. Limnol Oceanogr 42:1492–1504. Godson GN, Barrell BG, Staden R, Fiddes JC. (1978). Nucleotide sequence of bacteriophage G4 DNA. Nature 276:236–247. Goldsmith DB, Crosti G, Dwivedi B, McDaniel LD, Varsani A, Suttle CA, et al. (2011). Development of phoH as a novel signature gene for assessing marine phage diversity. Appl Environ Microb 77:7730–7739. Gonçalves MAFV. (2005). Adeno-associated virus: from defective virus to effective vector. Virol J 2:43. Gordon D. (2003). Viewing and editing assembled sequences using Consed. Curr Protoc Bioinformatics Chapter 11. Grau-Roma L, Fraile L, Segalés J. (2011). Recent advances in the epidemiology, diagnosis and control of diseases caused by porcine circovirus type 2. Vet J 187:23–32. Grigoras I, Timchenko T, Grande-Pérez A, Katul L, Vetten H-J, Gronenborn B. (2010). High variability and rapid evolution of a nanovirus. J Virol 84:9105–9117. Grigoras I, Timchenko T, Gronenborn B. (2008). Transcripts encoding the nanovirus master replication initiator proteins are terminally redundant. J Gen Virol 89:583–593. Gronenborn B. (2004). Nanoviruses: genome organisation and protein function. Vet Microbiol 98:103–109. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59:307–321. Guindon S, Gascuel O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696–704. Gutierrez C. (1999). Geminivirus DNA replication. Cell Mol Life Sci 56:313–329. Halami MY, Nieper H, Müller H., Johne R. (2008). Detection of a novel circovirus in mute swans (Cygnus olor) by using nested broad-spectrum PCR. Virus Res 132:208–212. Hambly E, Suttle CA. (2005). The viriosphere, diversity, and genetic exchange within phage communities. Curr Opin Microbiol 8:444–450. 137  Hambly E, Tétart F, Desplats C, Wilson WH, Krisch HM, Mann NH. (2001). A conserved genetic module that encodes the major virion components in both the coliphage T4 and the marine cyanophage S-PM2. Proc Natl Acad Sci U S A 98:11411–11416. Hattermann K, Schmitt C, Soike D, Mankertz A. (2003). Cloning and sequencing of Duck circovirus (DuCV). Arch Virol 148:2471–2480. Hendrix RW. (2003). Bacteriophage genomics. Curr Opin Microbiol 6:506–511. Heyraud-Nitschke F, Schumacher S, Laufs J, Schaefer S, Schell J, Gronenborn B. (1995). Determination of the origin cleavage and joining domain of geminivirus Rep proteins. Nucleic Acids Res 23:910–916. Hino S, Miyata H. (2007). Torque teno virus (TTV): current status. Rev Med Virol 17:45–57. Hoffmann KH, Rodriguez-Brito B, Breitbart M, Bangor D, Angly FE, Felts B, et al. (2007). Power law rank-abundance models for marine phage communities. FEMS Microbiol Lett 273:224–228. Holmfeldt K, Odić D, Sullivan MB, Middelboe M, Riemann L. (2012). Cultivated singlestranded DNA phages that infect marine Bacteroidetes prove difficult to detect with DNA-binding stains. Appl Environ Microb 78:892–894. Hsieh Y, Wanner B. (2010). Global regulation by the seven-component Pi signaling system. Curr Opin Microbiol 13:198–203. Huang S, Wilhelm SW, Jiao N, Chen F. (2010). Ubiquitous cyanobacterial podoviruses in the global oceans unveiled through viral DNA polymerase gene sequences. ISME J 4:1243–1251. Huelsenbeck JP, Ronquist F. (2001). MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755. Huse SM, Welch DM, Morrison HG, Sogin ML. (2010). Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol 12:1889–1898. Ignacio-Espinoza JC, Sullivan MB. (2012). Phylogenomics of T4 cyanophages: lateral gene transfer in the “core” and origins of host genes. Environ Microbiol 14:2113–2136. Ilyina TV., Koonin EV. (1992). Conserved sequence motifs in the initiator proteins for rolling circle DNA replication encoded by diverse replicons from eubacteria, eucaryotes and archaebacteria. Nucleic Acids Res 20:3279–3285. Iwaya M, Eisenberg S, Bartok K, Denhardt DT. (1973). Mechanism of replication of singlestranded PhiX174 DNA. VII. Circularization of the progeny viral strand. J Virol 12:808– 818. Jacquet S, Heldal M, Iglesias-Rodriguez D, Larsen A, Wilson W, Bratbak G. (2002). Flow cytometric analysis of an Emiliana huxleyi bloom terminated by viral infection. Aquat Microb Ecol 27:111–124. Jin X, Gruber N, Dunne JP, Sarmiento JL, Armstrong RA. (2006). Diagnosing the contribution of phytoplankton functional groups to the production and export of particulate organic carbon, CaCO3, and opal from global nutrient and alkalinity distributions. Global Biogeochem Cy 20:1–17. Johne R, Fernández-de-Luco D, Höfle U, Müller Hermann. (2006). Genome of a novel circovirus of starlings, amplified by multiply primed rolling-circle amplification. J Gen Virol 87:1189–1195. 138  Johne R, Müller Hermann, Rector A, Van Ranst M, Stevens H. (2009). Rolling-circle amplification of viral DNA genomes using phi29 polymerase. Trends Microbiol 17:205– 211. Kalyuzhnaya MG, Zabinsky R, Bowerman S, Baker DR, Lidstrom ME, Chistoserdova L. (2006). Fluorescence in situ hybridization-flow cytometry-cell sorting-based method for separation and enrichment of type I and type II methanotroph populations. Appl Environ Microb 72:4293–4301. Katoh K, Misawa K, Kuma K, Miyata T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059– 3066. Katzourakis A, Gifford RJ. (2010). Endogenous viral elements in animal genomes. PLoS genetics 6:e1001191. Keeling PJ, Burger G, Durnford DG, Lang BF, Lee RW, Pearlman RE, et al. (2005). The tree of eukaryotes. Trends Ecol Evol 20:670–676. Kenzaka T, Tani K, Nasu M. (2010). High-frequency phage-mediated gene transfer in freshwater environments determined at single-cell level. ISME J 4:648–659. Khayat R, Brunn N, Speir JA, Hardham JM, Ankenbauer RG, Schneemann A, et al. (2011). The 2.3-angstrom structure of porcine circovirus 2. J Virol 85:7856–7862. Kim K-H, Chang H-W, Nam Y-D, Roh SW, Kim M-S, Sung Y, et al. (2008). Amplification of uncultured single-stranded DNA viruses from rice paddy soil. Appl Environ Microb 74:5975–5985. King A, Adams M, Carstens E, Lefkowitz E. (2012). Virus Taxonomy: Ninth Report of the International Committee on Taxonomy of Viruses. 2nd Ed. Elsevier Academic Press: San Diego, California. Kooistra W, Gersonde R, Medlin LK, Mann DG. (2007). The origin and evolution of the diatoms: their adaptation to a planktonic existence. In:Evolution of primary producers in the sea, Falkowski, PG & Knoll, AH, eds (ed)., Boston, Elsevier, pp. 207–250. Koonin EV, Ilyina TV. (1992). Geminivirus replication proteins are related to prokaryotic plasmid rolling circle DNA replication initiator proteins. J Gen Virol 73:2763–2766. Krupovic M, Forterre P. (2011). Microviridae goes temperate: microvirus-related proviruses reside in the genomes of bacteroidetes. PloS One 6:e19893. Krupovic M, Ravantti JJ, Bamford DH. (2009). Geminiviruses: a tale of a plasmid becoming a virus. BMC Evol Biol 21:112. Labonté JM, Reid KE, Suttle CA. (2009). Phylogenetic analysis indicates evolutionary diversity and environmental segregation of marine podovirus DNA polymerase gene sequences. Appl Environ Microb 75:3634–3640. Lang AS, Rise ML, Culley AI, Steward GF. (2009). RNA viruses in the sea. FEMS Microbiol Rev 33:295–323. Lasken RS, Stockwell TB. (2007). Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC Biotechnol 7:19. Laufs J, Traut W, Heyraud F, Matzeit V, Rogers SG, Schell J, et al. (1995). In vitro cleavage and joining at the viral origin of replication by the replication initiator protein of tomato yellow leaf curl virus. Proc Natl Acad Sci U S A 92:3879–3883. 139  Li L, Kapoor A, Slikas Beth, Bamidele OS, Wang C, Shaukat S, et al. (2010). Multiple diverse circoviruses infect farm animals and are commonly found in human and chimpanzee feces. J Virol 84:1674–1682. Li L, Shan T, Soji OB, Alam MM, Kunz TH, Zaidi Sohail Zahoor, et al. (2011). Possible cross-species transmission of circoviruses and cycloviruses among farm animals. J Gen Virol 92:768–772. Li L, Victoria JG, Wang C, Jones M, Fellers GM, Kunz TH, et al. (2010). Bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses. J Virol 84:6955–6965. Li W, Godzik A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. Lilley M, Baross J, Gordon L. (1982). Dissolved hydrogen and methane in Saanich Inlet, British Columbia. Deep Sea Research Part A. 29:1471–1484. Lindell D, Jaffe JD, Johnson ZI, Church GM, Chisholm SW. (2005). Photosynthesis genes in marine viruses yield proteins during host infection. Nature 438:86–89. Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, Chisholm SW. (2004). Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc Natl Acad Sci U S A 101:11013–11018. Liu BL, Everson JS, Fane BA, Giannikopoulou P, Vretou E, Lambden PR, et al. (2000). Molecular characterization of a bacteriophage (Chp2) from Chlamydia psittaci. J Virol 74:3464–3469. Liu H, Fu Y, Li B, Yu X, Xie J, Cheng J, et al. (2011). Widespread horizontal gene transfer from circular single-stranded DNA viruses to eukaryotic genomes. BMC Evol Biol 11:276. Lizardi PM, Huang X, Zhu Z, Bray-Ward P, Thomas DC, Ward DC. (1998). Mutation detection and single-molecule counting using isothermal rolling-circle amplification. Nat Genet 19:225–232. Londoño A, Riego-Ruiz L, Argüello-Astorga GR. (2010). DNA-binding specificity determinants of replication proteins encoded by eukaryotic ssDNA viruses are adjacent to widely separated RCR conserved motifs. Arch Virol 155:1033–1046. López-Bueno A, Tamames J, Velázquez D, Moya A, Quesada A, Alcamí A. (2009). High diversity of the viral community from an Antarctic lake. Science 326:858–861. Lorincz M, Cságola A, Farkas SL, Székely C, Tuboly T. (2011). First detection and analysis of a fish circovirus. J Gen Virol 92:1817–1821. Ma Y, Paulsen IT, Palenik B. (2012). Analysis of two marine metagenomes reveals the diversity of plasmids in oceanic environments. Environ Microbiol 14:453–466. Mankertz A, Hattermann K, Ehlers B, Soike D. (2000). Cloning and sequencing of columbid circovirus (coCV), a new circovirus from pigeons. Arch Virol 145:2469–2479. Mankertz A, Mankertz J, Wolf K, Buhk HJ. (1998). Identification of a protein essential for replication of porcine circovirus. J Gen Virol 79:381–384. Manti A, Boi P, Amalfitano S, Puddu A, Papa S. (2011). Experimental improvements in combining CARD-FISH and flow cytometry for bacterial cell quantification. J Microbiol Meth 87:309–315. 140  Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. Martin DP, Biagini P, Lefeuvre P, Golden M, Roumagnac P, Varsani A. (2011). Recombination in eukaryotic single stranded DNA viruses. Viruses 3:1699–1738. Martin MO. (2002). Predatory prokaryotes: an emerging research opportunity. J Mol Microbiol Biotechnol 4:467–477. Martínez Martínez J, Schroeder D, Larsen A, Bratbak G, Wilson WH. (2007). Molecular dynamics of Emiliania huxleyi and cooccurring viruses during two separate mesocosm studies. Appl Environ Microb 73:554–562. Martínez Martínez J, Schroeder D, Wilson WH. (2012). Dynamics and genotypic composition of Emiliania huxleyi and their co-occurring viruses during a coccolithophore bloom in the North Sea. FEMS Microbiol Ecol 81:315–323. Massana R. (2011). Eukaryotic picoplankton in surface oceans. Annu Rev Microbiol 65:91– 110. Matsen FA, Kodner RB, Armbrust EV. (2010). pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11:538. Meehan BM, Creelan JL, McNulty MS, Todd D. (1997). Sequence of porcine circovirus DNA: affinities with plant circoviruses. J Gen Virol 78:221–227. Middelboe M. (2008). Microbial disease in the sea: effects of viruses on carbon and nutrient cycling. In:Infectious Disease Ecology: Effects of Ecosystems on Disease and of Disease on Ecosystems, Ostfeld, RS, Keesing, F, & Eviner, VT (eds)., Princeton University Press: Princeton, NJ, USA, pp. 242–259. Middelboe M, Hagström A, Blackburn N, Sinn B, Fischer U, Borch NH, et al. (2001). Effects of bacteriophages on the population dynamics of four strains of pelagic marine bacteria. Microbial Ecol 42:395–406. Middelboe M, Jorgensen N, Kroer N. (1996). Effects of viruses on nutrient turnover and growth efficiency of noninfected marine bacterioplankton. Appl Environ Microb 62:1991–1997. Middelboe M, Lundsgaard C. (2003). Microbial activity in the Greenland Sea!: role of DOC lability, mineral nutrients and temperature. Aquat Microb Ecol 32:151–163. Millard A, Clokie MRJ, Shub DA, Mann Nicholas H. (2004). Genetic organization of the psbAD region in phages infecting marine Synechococcus strains. Proc Natl Acad Sci U S A 101:11007–11012. Moraru C, Lam P, Fuchs BM, Kuypers MMM, Amann R. (2010). GeneFISH--an in situ technique for linking gene presence and cell identity in environmental microorganisms. Environ Microbiol 12:3057–3073. Motegi C, Nagata T, Miki T, Weinbauer Markus G., Legendre L, Rassoulzadegan F. (2009). Viral control of bacterial growth efficiency in marine pelagic environments. Limnol Oceanogr 54:1901–1910. Muyzer G, Smalla K. (1998). Application of denaturing gradient gel electrophoresis (DGGE) and temperature gradient gel electrophoresis (TGGE) in microbial ecology. Antonie van Leeuwenhoek 73:127–141.  141  Nagasaki K. (2008). Dinoflagellates, diatoms, and their viruses. J Microbiol (Seoul, Korea) 46:235–243. Nagasaki K, Tomaru Y, Takao Y, Nishida K, Shirai Y, Suzuki H, et al. (2005). Previously unknown virus infects marine diatom. Appl Environ Microb 71:3528–3535. Nash TE, Dallas MB, Reyes MI, Buhrman GK, Ascencio-Ibañez JT, Hanley-Bowdoin L. (2011). Functional analysis of a novel motif conserved across geminivirus Rep proteins. J Virol 85:1182–1192. Nawaz-ul-rehman MS, Fauquet CM. (2009). Evolution of geminiviruses and their satellites. FEBS Lett 583:1825–1832. Ng TFF, Duffy S, Polston Jane E., Bixby E, Vallad GE, Breitbart M. (2011). Exploring the diversity of plant DNA viruses and their satellites using vector-enabled metagenomics on whiteflies. PloS One 6:e19050. Ng TFF, Manire C, Borrowman K, Langer T, Ehrhart L, Breitbart M. (2009). Discovery of a novel single-stranded DNA virus from a sea turtle fibropapilloma by using viral metagenomics. J Virol 83:2500–2509. Ng TFF, Suedmeyer WK, Wheeler E, Gulland F, Breitbart M. (2009). Novel anellovirus discovered from a mortality event of captive California sea lions. J Gen Virol 90:1256– 1261. Ng TFF, Wheeler E, Greig D, Waltzek TB, Gulland F, Breitbart M. (2011). Metagenomic identification of a novel anellovirus in Pacific harbor seal (Phoca vitulina richardsii) lung samples and its detection in samples from multiple years. J Gen Virol 92:1318– 1323. Ng TFF, Willner DL, Lim YW, Schmieder R, Chau B, Nilsson C, et al. (2011). Broad surveys of DNA viral diversity obtained through viral metagenomics of mosquitoes. PloS One 6:e20579. Niagro FD, Forsthoefel AN, Lawther RP, Kamalanathan L, Ritchie BW, Latimer KS, et al. (1998). Beak and feather disease virus and porcine circovirus genomes: intermediates between the geminiviruses and plant circoviruses. Arch Virol 143:1723–1744. Not F, Valentin K, Romari K, Lovejoy C, Massana R, Töbe K, et al. (2007). Picobiliphytes: a marine picoplanktonic algal group with unknown affinities to other eukaryotes. Science 315:253–255. Opriessnig T, Meng XJ, Halbur PG. (2007). Porcine circovirus type 2-associated disease: update on current terminology, clinical manifestations, pathogenesis, diagnosis, and intervention strategies. J Vet Diagn Invest 19:591–615. Oshima K, Kakizawa S, Nishigawa H, Kuboyama T, Miyata S, Ugaki M, et al. (2001). A plasmid of phytoplasma encodes a unique replication protein having both plasmid- and virus-like domains: clue to viral ancestry or result of virus/plasmid recombination? Virology 285:270–277. Park Y, Jung S-E, Tomaru Y, Choi W, Kim Y, Mizumoto H, et al. (2009). Characterization of the Chaetoceros salsugineum nuclear inclusion virus coat protein gene. Virus Res 142:127–133. Partensky F, Hess WR, Vaulot D. (1999). Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microbiol Mol Biol R 63:106–127.  142  Pavesi A. (2006). Origin and evolution of overlapping genes in the family Microviridae. J Gen Virol 87:1013–1017. Pavlekovic M, Schmid MC, Schmider-Poignee N, Spring S, Pilhofer M, Gaul T, et al. (2009). Optimization of three FISH procedures for in situ detection of anaerobic ammonium oxidizing bacteria in biological wastewater treatment. J Microbiol Meth 78:119–126. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, Lewis JA, et al. (2003). Origins of highly mosaic mycobacteriophage genomes. Cell 113:171–182. Pernthaler A, Amann R. (2004). Simultaneous fluorescence in situ hybridization of mRNA and rRNA in environmental bacteria. Appl Environ Microb 70:5426–5433. Pernthaler A, Pernthaler J, Amann R. (2002). Fluorescence in situ hybridization and catalyzed reporter deposition for the identification of marine bacteria. 68:3094–3101. Pietilä MK, Laurinavicius S, Sund J, Roine E, Bamford DH. (2010). The single-stranded DNA genome of novel archaeal virus Halorubrum pleomorphic virus 1 is enclosed in the envelope decorated with glycoprotein spikes. J Virol 84:788–798. Pietilä MK, Roine E, Paulin L, Kalkkinen N, Bamford DH. (2009). An ssDNA virus infecting archaea: a new lineage of viruses with a membrane envelope. Mol Microbiol 72:307– 319. Pinard R, De Winter A, Sarkis GJ, Gerstein MB, Tartaro KR, Plant RN, et al. (2006). Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics 7:216. Polz MF, Cavanaugh CM. (1998). Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microb 64:3724–3730. Pomeroy LR, Williams J leB, Azam F., Hobbie JE. (2007). The microbial loop. Oceanography 20:28–33. Poorvin L, Rinta-Kanto JM, Hutchins D, Wilhelm SW. (2004). Viral release of iron and its bioavailability to marine plankton. Limnol Oceanogr 49:1734–1741. Pride DT, Wassenaar TM, Ghose C, Blaser MJ. (2006). Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics 7:8. Proctor LM, Fuhrman JA. (1990). Viral mortality of marine bacteria and cyanobacteria. Nature 343:60–62. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. (2011). Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12:38. Read TM, Fraser CM, Hsia R-C, Bavoil PM. (2000). Comparative analysis of Chlamydia bacteriophages reveals variation localized to a putative receptor binding domain. Microb Comp Genomics 5:223–231. Rodrigue S, Malmstrom RR, Berlin AM, Birren BW, Henn MR, Chisholm SW. (2009). Whole genome amplification and de novo assembly of single bacterial cells. PloS One 4:e6864. Rodriguez-Brito B, Li L, Wegley L, Furlan M, Angly FE, Breitbart M, et al. (2010). Viral and microbial community dynamics in four aquatic environments. Water 739–751.  143  Rohwer F, Segall A, Steward G, Seguritan V, Breitbart M, Wolven F, et al. (2000). The complete genomic sequence of the marine phage Roseophage SIO1 shares homology with nonmarine phages. Limnol Oceanogr 45:408–418. Roine E, Kukkaro P, Paulin L, Laurinavicius S, Domanska A, Somerharju P, et al. (2010). New, closely related haloarchaeal viral elements with different nucleic acid types. J Virol 84:3682–3689. Rokyta D, Burch C. (2006). Horizontal gene transfer and the evolution of microvirid coliphage genomes. J Bacteriol 188:1134–1142. Roossinck MJ, Saha P, Wiley GB, Quan J, White JD, Lai H, et al. (2010). Ecogenomics: using massively parallel pyrosequencing to understand virus ecology. Mol Ecol 19:81– 88. Rosario K, Duffy S, Breitbart M. (2012). A field guide to eukaryotic circular single-stranded DNA viruses: insights gained from metagenomics. Arch Virol 157:1851–1871. Rosario K, Duffy S, Breitbart M. (2009). Diverse circovirus-like genome architectures revealed by environmental metagenomics. J Gen Virol 90:2418–2424. Rosario K, Marinov M, Stainton D, Kraberger S, Wiltshire EJ, Collings DA, et al. (2011). Dragonfly cyclovirus, a novel single-stranded DNA virus discovered in dragonflies (Odonata: Anisoptera). J Gen Virol 92:1302–1308. Rosario K, Nilsson C, Lim YW, Ruan Y, Breitbart M. (2009). Metagenomic analysis of viruses in reclaimed water. Environ Microbiol 11:2806–2820. Roux S, Enault F, Robin A, Ravet V, Personnic S, Theil S, et al. (2012). Assessing the diversity and specificity of two freshwater viral communities through metagenomics Martin, DP, ed. PloS One 7:e33641. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, et al. (2007). The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5:e77. Saccardo F, Cettul E, Palmano S, Noris E, Firrao G. (2011). On the alleged origin of geminiviruses from extrachromosomal DNAs of phytoplasmas. BMC Evol Biol 11:185. Sait M, Livingstone M, Graham R, Inglis NF, Wheelhouse N, Longbottom D. (2011). Identification, sequencing and molecular analysis of Chp4, a novel chlamydiaphage of Chlamydophila abortus belonging to the family Microviridae. J Gen Virol 92:1733– 1737. Salim O, Skilton RJ, Lambden PR, Fane BA, Clarke IN. (2008). Behind the chlamydial cloak: the replication cycle of chlamydiaphage Chp2, revealed. Virology 377:440–445. Sandaa R-A, Clokie MRJ, Mann NH. (2008). Photosynthetic genes in viral populations with a large genomic size range from Norwegian coastal waters. FEMS Microbiol Ecol 63:2–11. Sandaa R-A, Larsen A. (2006). Seasonal variations in virus-host populations in Norwegian coastal waters: focusing on the cyanophage community infecting marine Synechococcus spp. Appl Environ Microb 72:4610–4618. Sanders RW, Berninger U-G, Lim EL, Kemp PF, Caron DA. (2000). Heterotrophic and mixotrophic nanoplankton predation on picoplankton in the Sargasso Sea and on Georges Bank. Mar Ecol Prog Ser 192:103–118.  144  Saunders K, Lucy A, Stanley J. (1992). RNA-primed complementary-sense DNA synthesis of the geminivirus African cassava mosaic virus. Nucleic Acids Res 20:6311–6315. Scanlan DJ, Ostrowski M, Mazard S, Dufresne A, Garczarek L, Hess WR, et al. (2009). Ecological genomics of marine picocyanobacteria. Microbiol Mol Biol R 73:249–299. Schroeder D, Oke J, Hall MJ, Malin G, Wilson William H. (2003). Virus succession observed during an Emiliania huxleyi bloom. Appl Environ Microb 69:2484–2490. Seal SE, Jeger MJ, Van den Bosch F. (2006). Begomovirus evolution and disease management. Adv Virus Res 67:297–316. Segalés J, Allan GM, Domingo M. (2007). Porcine circovirus diseases. Anim Health Res Rev 6:119–142. Servant-Delmas A, Laperche S, Mercier M, Lefrere JJ. (2009). Genetic diversity of human Erythroviruses. Pathol Biol 57:167–174. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M. (2007). CAMERA: a community resource for metagenomics. PLoS Biol 5:e75. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498–2504. Sharon I, Battchikova N, Aro E-M, Giglione C, Meinnel T, Glaser F, et al. (2011). Comparative metagenomics of microbial traits within oceanic viral communities. ISME J 5:1178–1190. Shelford E, Middelboe M, Møller E, Suttle CA. (2012). Virus-driven nitrogen cycling enhances phytoplankton growth. Aquat Microb Ecol 66:41–46. Short CM, Rusanova O, Short SM. (2011). Quantification of virus genes provides evidence for seed-bank populations of phycodnaviruses in Lake Ontario, Canada. ISME J 5:810–821. Short CM, Suttle CA. (2005). Nearly identical bacteriophage structural gene sequences are widely distributed in both marine and freshwater environments. Appl Environ Microb 71:480–486. Short SM, Suttle CA. (2002). Sequence analysis of marine virus communities reveals that groups of related algal viruses are widely distributed in nature. Appl Environ Microb 68:1290–1296. Short SM, Suttle CA. (1999). Use of the polymerase chain reaction and denaturing gradient gel electrophoresis to study diversity in natural virus communities. Hydrobiologia 401:19–32. Sims GE, Jun S-R, Wu GA, Kim S-H. (2009). Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106:2677–2682. Smith AW, Boyt PM. (1990). Caliciviruses of ocean origin: a review. J Zoo Wildlife Med 21:3–23. Smith RH. (2008). Adeno-associated virus integration: virus versus vector. Gene Ther 15:817–822.  145  Sobecky PA, Hazen TH. (2009). Horizontal Gene Transfer and mobile elements in marine systems Gogarten, MB, Gogarten, JP, & Olendzenski, LC, eds. Methods Mol Biol 532:435–453. Söding J, Biegert A, Lupas AN. (2005). The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33:W244–W248. Sogin ML, Morrison HG, Huber JA, Welch MD, Huse SM, Neal PR, et al. (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci U S A 103:12115–12120. Del Solar G, Giraldo R, Ruiz-Echevarría MJ, Espinosa M, Díaz-Orejas R. (1998). Replication and control of circular bacterial plasmids. Microbiol Mol Biol R 62:434–464. Sorensen G, Baker AC, Hall MJ, Munn CB, Schroeder D. (2009). Novel virus dynamics in an Emiliania huxleyi bloom. J Plankton Res 31:787–791. Speicher MR, Carter NP. (2005). The new cytogenetics: blurring the boundaries with molecular biology. Nat Rev Genet 6:782–792. Steinfeldt T, Finsterbusch T, Mankertz A. (2007). Functional analysis of cis- and trans-acting replication factors of porcine circovirus type 1. J Virol 81:5696–5704. Steinfeldt T, Finsterbusch T, Mankertz A. (2001). Rep and Rep’ protein of porcine circovirus type 1 bind to the origin of replication in vitro. Virology 291:152–160. Stewart ME, Perry R, Raidal SR. (2006). Identification of a novel circovirus in Australian ravens (Corvus coronoides) with feather disease. Avian Pathol 35:86–92. Storey CC, Lusher M, Richmond SJ. (1989). Analysis of the complete nucleotide sequence of Chp1, a phage which infects avian Chlamydia psittaci. J Gen Virol 70:3381–3390. Sullivan MB, Coleman ML, Quinlivan V, Rosenkrantz JE, Defrancesco AS, Tan G, et al. (2008). Portal protein diversity and phage ecology. Environ Microbiol 10:2810–2823. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW. (2005). Three Prochlorococcus cyanophage genomes: signature features and ecological interpretations. PLoS Biol 3:e144. Sullivan MB, Krastins B, Hughes JL, Kelly L, Chase M, Sarracino D, et al. (2009). The genome and structural proteome of an ocean siphovirus: a new window into the cyanobacterial “mobilome”. Environ Microbiol 11:2935–2951. Sullivan MB, Lindell D, Lee JA, Thompson LR, Bielawski JP, Chisholm SW. (2006). Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biol 4:e234. Suttle CA. (2007). Marine viruses--major players in the global ecosystem. Nat Rev Microbiol 5:801–812. Suttle CA. (2005). Viruses in the sea. Nature 437:356–361. Suttle CA, Chan AM. (1994). Dynamics and distribution of cyanophages and their effect on marine Synechococcus spp. Appl Environ Microb 60:3167–3174. Suttle CA, Chan AM, Cottrell MT. (1991). Use of ultrafiltration to isolate viruses from seawater which are pathogens of marine phytoplankton. Appl Environ Microb 57:721– 726.  146  Talavera G, Castresana J. (2007). Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564– 577. Tarutani K, Nagasaki K, Yamaguchi M. (2000). Viral impacts on total abundance and clonal composition of the harmful bloom-forming phytoplankton Heterosigma akashiwo. Appl Environ Microb 66:4916–4920. Tattersall P, Ward D. (1976). Rolling hairpin model for replication of parvovirus and linear chromosomal DNA. Nature 263:106–109. Tetart F, Desplats C, Kutateladze M, Monod C, Ackermann HW, Krisch HM. (2001). Phylogeny of the major head and tail genes of the wide-ranging T4-type bacterophages. J Bacteriol 183:358–366. Thingstad TF. (2000). Elements of a theory for the mechanisms controlling abundance, diversity, and role of lytic bacterial viruses in aquatic systems biogeochemical. Limnology 45:1320–1328. Timchenko T, Katul L, Sano Y, De Kouchkovsky F, Vetten HJ, Gronenborn B. (2000). The master rep concept in nanovirus replication: identification of missing genome components and potential for natural genetic reassortment. Virology 274:189–195. Timchenko T, De Kouchkovsky F, Katul L, David C, Vetten HJ, Gronenborn B. (1999). A single Rep protein initiates replication of multiple genome components of faba bean necrotic yellows virus, a single-stranded DNA virus of plants. J Virol 73:10173–10182. Todd D, Scott ANJ, Fringuelli E, Shivraprasad HL, Gavier-Widen D, Smyth JA. (2007). Molecular characterization of novel circoviruses from finch and gull. Avian Pathol 36:75–81. Tomaru Y, Takao Y, Suzuki H, Nagumo T, Koike K, Nagasaki K. (2011). Isolation and characterization of a single-stranded DNA virus infecting Chaetoceros lorenzianus Grunow. Appl Environ Microb 77:5285–5293. Torrella F, Morita R. (1979). Evidence by electron micrographs for a high incidence of bacteriophage particles in the waters of Yaquina Bay, Oregon: ecological and taxonomical implications. Appl Environ Microb 37:774–778. Tucker KP, Parsons R, Symonds EM, Breitbart M. (2011). Diversity and distribution of single-stranded DNA phages in the North Atlantic Ocean. ISME J 5:822–830. Urbino C, Polston JE, Patte CP, Caruana ML. (2004). Characterization and genetic diversity of potato yellow mosaic virus from the Caribbean. Arch Virol 149:417–424. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, et al. (2004). Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66–74. Victoria JG, Kapoor A, Li L, Blinkova O, Slikas Beth, Wang C, et al. (2009). Metagenomic analyses of viruses in stool samples from children with acute flaccid paralysis. J Virol 83:4642–4651. Victoria M, Pereira Miagostovich M, Leite PG, Colina R, Cristina J, Fioretti JM. (2009). Infection, genetics and evolution bayesian coalescent inference reveals high evolutionary rates and expansion of norovirus populations. 9:927–932. Waldor MK, Mekalanos JJ. (1996). Lysogenic conversion by a filamentous phage encoding cholera toxin. Science 272:1910–1914.  147  Walsh DA, Zaikova E, Howes C, Young C, Wright J, Tringe S, et al. (2009). Metagenome of a versatile chemolithoautotrop from expanding oceanic dead zones. Science 326:578– 582. Wanitchakorn R, Hafner GJ, Harding RM, Dale JL. (2000). Functional analysis of proteins encoded by banana bunchy top virus DNA-4 to -6. J Gen Virol 81:299–306. Wanitchakorn R, Harding RM, Dale JL. (1997). Banana bunchy top virus DNA-3 encodes the viral coat protein. Arch Virol 142:1673–1680. Ward B, Kilpatrick K, Wopat A, Minnich E, Lidstrom M. (1989). Methane oxidation in Saanich Inlet during summer stratification. Continental Shelf Research 9:65–75. Weigele PR, Pope WH, Pedulla ML, Houtz JM, Smith AL, Conway JF, et al. (2007). Genomic and structural analysis of Syn9, a cyanophage infecting marine Prochlorococcus and Synechococcus. Environ Microbiol 9:1675–1695. Wilhelm SW, Suttle CA. (1999). Viruses and nutrient cycles in the sea. Bioscience 49:781– 788. Winget DM, Wommack KE. (2008). Randomly amplified polymorphic DNA PCR as a tool for assessment of marine viral richness. Appl Environ Microb 74:2612–2618. Winter C, Smit A, Herndl GJ, Weinbauer M. G. (2004). Impact of virioplankton on archaeal and bacterial community richness as assessed in seawater batch cultures. Appl Environ Microb 70:804–813. Von Wintzingerode F, Göbel UB, Stackebrandt E. (1997). Determination of microbial diversity in environmental samples: pitfalls of PCR-based rRNA analysis. FEMS Microbiol Rev 21:213–229. Worden A. (2008). Ecology and diversity of picoeukaryotes. In:Microbial ecology of the oceans, Kirchman, DL, ed (ed)., John Wiley and Sons, inc. Worden A, Chisholm SW, Brian J. (2000). In situ hybridization of Prochlorococcus and Synechococcus (marine cyanobacteria) spp. with rRNA-targeted peptide nucleic acid probes. Appl Environ Microb 66:284–289. Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki ME, Wilson William H., et al. (2011). Single-cell genomics reveals organismal interactions in uncultivated marine protists. Science 332:714–717. Zaikova E, Walsh DA, Stilwell CP, Mohn WW, Tortell PD, Hallam SJ. (2010). Microbial community dynamics in a seasonally anoxic fjord: Saanich Inlet, British Columbia. Environ Microbiol 12:172–191. Zhang T, Breitbart M, Lee WH, Run J-Q, Wei CL, Soh SWL, et al. (2006). RNA viral community in human feces: prevalence of plant pathogenic viruses. PLoS Biol 4:e3. Zhang W, Olson NH, Baker TS, Faulkner L, Agbandje-McKenna M, Boulton MI, et al. (2001). Structure of the Maize streak virus geminate particle. Virology 279:471–477. Zhao Y, Temperton B, Thrash JC, Schwalbach MS, Vergin KL, Landry ZC, et al. (2013). Abundant SAR11 viruses in the ocean. Nature 494:357–360.  148  Appendices Appendix A Genome organization of the genome type of each ssDNA viral genus. Family Inoviridae  Sub-family -  Genus Inovirus  Type species Enterobacteriophage M13  Plectrovirus  Acholeplasma phage virus MV-L51  Type species genome  Unclassified inoviruses  149  Family Microviridae  Sub-family Microvirinae  Genus Microvirus  Type species Enterobacteriophage ϕX174  Gokushovirinae  Spiromicrovirus  Spiroplasma phage 4  Bdellomicrovirus  Bdellovibrio MAC 1  Chlamydiamicrovirus  Chlamydia phage 1  Type species genome  phage  150  Family Geminiviridae  Sub-family -  Genus Mastrevirus  Type species Maize streak virus  Curtovirus  Beet curly top virus  Begomovirus  Bean golden yellow mosaic virus  Topocuvirus  Tomato pseudocurly top virus  Type species genome  Unclassified  151  Family Circoviridae  Sub-family -  Genus Circovirus  Type species Porcine circovirus 1  Gyrovirus  Chicken virus  Type species genome  anemia  Unclassified circoviruses  152  Family Nanoviridae  Sub-family -  Genus Nanovirus  Type species Subterranean clover stunt virus  Babuvirus  Banana bunchy top virus  Type species genome  Unassigned nanovirus  153  Family Parvoviridae  Sub-family Parvovirinae  Genus Parvovirus  Type species Minute virus of mice  Erythrovirus  Human B19  Dependovirus  Adeno-associated virus-2  Type species genome  parvovirus  154  Family  Sub-family  Genus Amdovirus  Type species Aleutian mink disease  Bocavirus  Bovine parvovirus  Type species genome  155  Family  Sub-family Densovirinae  Genus Densovirus  Type species Junonia coenia densovirus  Iteravirus  Bombyx densovirus  mori  Brevidensovirus  Aedes densovirus  aegypti  Pefudensovirus  Periplaneta fuliginosa densovirus  Type species genome  156  Family Anelloviridae  Sub-family  Genus Alphatorquevirus Betatorquevirus Gammatorquevirus Deltatorquevirus Epsilontorquevirus Etatorquevirus Iotatorquevirus Thetatorquevirus Zetatorquevirus  Type species Torque teno virus 1 isolate TA278 Torque teno mini virus 1 Torque teno midi virus 1 Torque teno tupaia virus Torque teno tamarin virus Torque teno felis virus Torque teno sus virus 1 Torque teno canis virus Torque teno douroucouli virus  Type species genome  Unclassified anelloviruses Unclassified ssDNA viruses  * All the pictures are from the ViralZone (http://viralzone.expasy.org)  157  Appendix B Location and environmental data for each the samples used in this study. B.1  Gulf of Mexico samples. Location  Green River Plume  North East Gulf of Mexico  Western Gulf of Mexico  Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume Green River plume North Eastern Gulf of Mexico North Eastern Gulf of Mexico North Eastern Gulf of Mexico North Eastern Gulf of Mexico North Eastern Gulf of Mexico North Eastern Gulf of Mexico North Eastern Gulf of Mexico North Eastern Gulf of Mexico Western Gulf of Mexico Western Gulf of Mexico  VC name 1B  Latitude  Longitude  date collected 01/07/15  temp  sal  nd  Depth (m) nd  nd  20.2  36.5  2A  nd  nd  sfc  01/07/17  29.9  33.5  2C  nd  nd  nd  01/07/17  25.2  36.4  2D  nd  nd  nd  01/07/17  25,00  36.5  2G  nd  nd  nd  01/07/17  20,00  36.7  3A  27°30.388'N  88° 24.719'W  01/07/18  28.9  34.2  3B  27°30.042'N  88° 24.109'W  sfc pump 110  01/07/18  22.5  36.6  4A  27°58.148'N  87° 34.775'W  01/07/19  29.7  31.5  4B  27°58.148'N  87° 34.775'W  sfc pump 110  01/07/19  20.4  36.6  5  28°33.7'N  87° 24.85'W  sfc  01/07/20  30,00  32  6A  29°00.037'N  87° 17.836'W  7  01/07/20  29.5  33.3  6B  29°00.037'N  87° 17.836'W  25  01/07/20  28.5  36.4  6C  29°00.037'N  87° 17.836'W  90  01/07/20  20.5  36.5  6D  29°00.037'N  87° 17.836'W  120  01/07/20  18.7  36.5  7A  26°28.526'N  85° 38.478'W  3m  01/07/21  30,00  33.5  7B  26°28.526'N  85° 38.478'W  20 m  01/07/21  28.9  35.7  7C  26°28.526'N  85° 38.478'W  40 m  01/07/21  27.5  36.5  7E  26°28.526'N  85° 38.478'W  70 m  01/07/21  23.9  36.5  8A  25°18.781'N  84°43.213'W  01/07/26  n/a  n/a  8C  25°18.781'N  84°43.213'W  sfc (3m) 40 m  01/07/26  27.7  36.4  8D  25°18.781'N  84°43.213'W  68 m  01/07/26  23.1  36.6  8E  25°18.781'N  84°43.213'W  90 m  01/07/26  21.6  36.6  105  nd  nd  nd  nd  nd  nd  192  nd  nd  78.5 to 86  95/06/25  20.2  36.35  158  Location  Texas Coast  Western Gulf of Mexico Western Gulf of Mexico Western Gulf of Mexico Western Gulf of Mexico Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast Texas Coast  VC name 223  Latitude  Longitude  date collected 96/07/25  temp  sal  92°52.97'W  Depth (m) sfc  25°34.24'N  30,00  36.2  228  25°42.08'N  93°00.76'W  163.8  96/07/26  18.59  36.39  230  26°50.36'N  94°59.94'W  sfc  96/07/28  29.82  36.22  232  27°05.46'N  94°57.73'W  sfc  96/07/28  30.5  36.3  sfc  95/06/27  28.45  30.28  sfc sfc 15 sfc sfc sfc sfc sfc sfc sfc sfc sfc  96/07/29 96/07/29 96/07/29 96/07/30 96/07/31 05/07/07 94/06/03 94/08/18 96/03/14 95/11/28 95/01/23 95/09/25  29,00 29,00 29,00 29.2 28.6 27.3 29.5 30.5 17.6 18.9 14.7 28.9  36.3 36.3 36.3 36.1 35.71 34.6 34.2 40.6 31 31.3 26.1 33  202 236 237 238 241 243 156 113 124 170 167 141 161  27°29.35'N 27°29.92'N 27°30.12'N 27°32.85'N 27°43.982'N nd nd nd nd nd nd nd  96°16.36'W 96°17.69'W 96°17.96'W 96°28.40'W 96°50.439'W nd nd nd nd nd nd nd  159  B.2  Strait of Georgia samples. Location  BC-2004  BC-2000  By Powell River Salmon Arm Teakerne ArmHead Malaspina (Wooten Bay) Saanich Inlet  VC name 658 665 660  Latitude  Longitude  date collected 04/07/13 04/07/13 04/07/13  temp  sal  124'26''05W nd 124'51''44W  Depth (m) nd nd 14.4  49'50''97N nd 50'11''14N  14.3 14.2 12.3  27.1 28.1 27.9  663  50'04''80N  124'42''84W  9.6  04/07/16  nd  nd  667  48'36''240 N 49'49''716 N 49'43''64N 49'27''34N  123'30''098W  115  04/07/18  13,00  29.2  04/07/16  13.6  26.5  124'19''38W 123'17''01W  7.7 245  03/07/06 02/08/20  12.8 8.9  26.7 31  561 562 563 566 567 568 570  49'27''34N 49'49''68N  123'17''01W 124'047''78W  24.2 26.4  124'41''31W  17  27  50'11''80N 50'11''40N  124'50''88W 125'08''50W  02/08/20 02/08/21 02/08/21 02/08/22 02/08/22 02/08/23 02/08/24  15.8 14.2  49'58''73N  2 155 2 7 30 7 10  ? 11.5  28.4 28.9  571  48'48''09N  123'24''21W  25  02/08/24  11.9  29.5  426  49°27.30'N  123°16.88'W  238  00/07/31  nd  nd  427  nd  nd  50  00/07/31  9.2  29.8  428  nd  nd  25  00/07/31  9.5  29.3  429  nd  nd  15  00/07/31  10,00  28.8  430  nd  nd  3  00/07/31  14.1  23.4  431  nd  nd  sfc  00/07/31  15.7  433  50°00.274' N 50°00.274' N 50°00.274' N 50°00.202' N 50°00.202' N 50°00.202' N 50°07.318'  124°57.861'W  100  00/08/01  8.5  14,0 0 30.2  124°57.861'W  50  00/08/01  9.1  29.7  124°57.861'W  25  00/08/01  10.6  28.7  124°57.878'W  15  00/08/01  13,00  27.3  124°57.878'W  5  00/08/01  16.7  24.9  124°57.878'W  sfc  00/08/01  18.2  23.4  124°54.264'W  8.4  00/08/01  ca 16  25.7  124°51.142'W  7+10  00/08/01  16.6  23.7  to 19  to  Hotham sound  664  W of Scoth Fir Pt Howe Sound. N of Boyer Howe Sd Hotham sound Hotham sound Malaspina-profile Malaspina-profile Teakerne Arm Quadra-diatom bloom Saltspring Isdiatom bloom North of Boyer Island Howe sound depth profile Howe sound depth profile Howe sound depth profile Howe sound depth profile Howe sound depth profile Baker Sounddepth profile Baker Sounddepth profile Baker Sounddepth profile Baker Sounddepth profile Baker Sounddepth profile Baker Sounddepth profile off Squirrel Cove  608 560  434 435 436 437 438 439  124'04''793W  N Teakerne Arm  440  50°11.343' N  25.5  160  Location Malaspina-depth  VC name 441  profile[lancelot] Malaspina-depth  Latitude  Longitude  date collected 00/08/02  temp  sal  124°42.855'W  Depth (m) 44  50°04.786'  11  28  N 442  nd  nd  15  00/08/02  15  26  443  nd  nd  5  00/08/02  16.8  25  444  nd  nd  sfc  00/08/02  17.6  25.6  445  50°01.693'  124°43.939'W  130  00/08/02  9.1  28.4  123°47.370'W  195  00/08/03  8.9  29.1  profile Malaspina-depth profile Malaspina-depth profile Malaspina-Hole  N Sechelt-Porpoise  447  Bay Sechelt-Porpoise  49°33.475' N  448  nd  nd  15  00/08/03  13  25.6  449  nd  nd  5  00/08/03  15  24  450  nd  nd  sfc  00/08/03  18  22  451  nd  nd  5  00/08/03  15  24.5  452  49°40.95'N  123°50.79'W  250  00/08/03  8.9  28.2  mouth of Sechelt  453  49°46.44'N  123°58.53'W  235  00/08/03  8.9  30.9  Narrows Inlet  361  49°43.89'N  123°44.27'W  5  99/08/18  13.8  21.3  Salmon Inlet  374  49°36.40'N  123°48.22'W  2.8  99/08/19  16.1  22  Salmon Inlet  375  49°36.40'N  123°48.17'W  3.2  99/08/20  nd  nd  Pendrell  393  50°17.68'N  124°43.64'W  11.6  99/08/21  15.3  24.9  Pendrell  391  50°17.68'N  124°43.64'W  5  99/08/21  19.9  21.7  Pendrell  392  50°17.68'N  124°43.64'W  0.5  99/08/20  22.6  14.6  Malaspina Inlet-  385  50°01.73'N  124°43.83'W  5.7  99/08/19  17.8  21  377  50°04.76'N  124°42.83'W  3.1  99/08/20  nd  nd  384  49°58.50'N  124°40.96'W  6.4  98/07/18  nd  nd  327  50°07.62'N  125°19.94'W  7.5  98/07/18  nd  nd  328  50°6.98'N  125°18.87'W  7.15  98/07/18  nd  nd  Bay Sechelt-Porpoise Bay Sechelt-Porpoise Bay Sechelt-Porpoise Bay Outside Skookumchuck BC-1999  Hole (Stn H7) Malaspina-North Arm Malaspina-south Arm S of Seymour Narrows S of Seymour Narrows  161  Location  Latitude  Longitude  W Cape Mudge  VC name 329  date collected 98/07/18  temp  sal  125°8.29'W  Depth (m) 6.7  49°57.03'N  nd  nd  sub sfc chl max  330  49°57.03'N  125°05.86'W  nd  98/07/20  nd  nd  Howe Sound  338  49°27.31'N  123°17.08'W  7.8  98/07/20  nd  nd  Howe Sound  339  49°34.46'N  123°15.16'W  6.7  98/07/20  nd  nd  Howe Sound-  340  49°39.69'N  123°13.23'W  4.9  98/07/25  nd  nd  April Point  292  50°03.54'N  125°14.733'W  sfc  98/07/25  nd  nd  April Point  293  50°03.54'N  125°14.733'W  profile  98/07/25  nd  nd  Yaculta-quadra  294  50°01.354  125°12.492'W  profile  98/07/25  nd  nd  profile  98/07/25  nd  nd  125°17.524'W  profile  98/07/26  nd  nd  head  1'N Whirlpool  295 296  50°06.568' N  Seymour Narrows  297  50°06.81'N  125°18.518'W  profile  98/07/26  nd  nd  Cape Mudge rip  298  49°58.70'N  125°11.500'W  profile  98/07/26  nd  nd  299  49°58.310'  125°10.729'W  profile  98/07/26  nd  nd  123°30.379'W  210  96/08/19  nd  nd  123°30.388'W  3  96/08/19  nd  nd  123°30.108'W  7.75  96/08/19  nd  nd  123°53.532'W  5  96/08/19  nd  nd  tide Cape Mudge rip tide BC-Low  Saanich Inlet  N 248  Salinity  48°35.132' N  Saanich Inlet  249  48°35.178' N  Saanich Inlet  250  48°40.156' N  Jervis  256  50°09.749' N  Jervis  258  nd  nd  2.9  96/08/19  nd  nd  Jervis  259  49°43.985'  124°14.758'W  12.7  96/08/19  nd  nd  N FRP [chl max]  262  49°07.77'N  123°28.025'W  14.7  96/08/22  nd  nd  Jericho Pier  271  nd  nd  nd  97/01/10  nd  nd  SOG [chl max]  275  49°19.7'N  123°44.1'W  5  97/03/05  nd  nd  Jericho Pier  268  nd  nd  nd  96/10/23  nd  nd  Jericho Pier  305  nd  nd  sfc  98/07/24  nd  nd  Jericho Beach  341  nd  nd  nd  98/08/18  nd  nd  Jericho Pier  342  nd  nd  nd  99/05/18  nd  nd  Numvkamic Bay  356  nd  nd  8  99/07/12  nd  nd  Grappler Inlet  353  nd  nd  sfc  99/07/05  nd  nd  head  162  Location  Latitude  Longitude  nd  nd  357  nd  nd  Jericho Pier  404  nd  Jericho Pier  483  Jericho Pier  497  Pachina Bay (dino  VC name  Depth (m)  date collected nd  temp  sal  nd  nd  sfc  99/07/15  nd  nd  nd  sfc  00/04/14  nd  nd  nd  nd  sfc  00/04/04  nd  nd  nd  nd  sfc  01/07/17  nd  nd  bloom)  163  B.3  Saanich Inlet samples.  Location  VC  Latitude  Longitude  name  Depth  date  (m)  collected  temp  sal  Saanich Inlet  809  48°35.30'N  123°30.22'W  200  07/04/24  nd  nd  Saanich Inlet  808  48°35.30'N  123°30.22'W  150  07/04/24  nd  nd  Saanich Inlet  807  48°35.30'N  123°30.22'W  135  07/04/24  nd  nd  Saanich Inlet  806  48°35.30'N  123°30.22'W  100  07/04/24  nd  nd  Saanich Inlet  805  48°35.30'N  123°30.22'W  10  07/04/24  nd  nd  Saanich Inlet  837  48°35.30'N  123°30.22'W  10  08/01/08  nd  nd  Saanich Inlet  843  48°35.30'N  123°30.22'W  10  07/03/18  nd  nd  Saanich Inlet  861  48°35.30'N  123°30.22'W  10  08/05/13  nd  nd  Saanich Inlet  874  48°35.30'N  123°30.22'W  10  08/07/15  nd  nd  Saanich Inlet  880  48°35.30'N  123°30.22'W  10  08/08/10  nd  nd  Saanich Inlet  929  48°35.30'N  123°30.22'W  10  08/11/11  nd  nd  164  B.4  Arctic samples.  Chuckchi Sea  Location  Latitude  Longitude  Depth (m)  date collected  temp  sal  Chukchi Sea Chukchi Sea  71°43.74'N [71-43.93]  155°09.41'W [155-07.37]  60  02/09/08  1.1  31.9  71°43.74'N  155°09.41'W  10  02/09/08  5.4  30.8  72°30.00'N  160°00.00'W  10  02/09/08  1.3  28.4  72°30.00'N  160°00.00'W  25  02/09/08  0.7  32.6  Chukchi Sea Chukchi Sea  72°54.00'N [72-53.99]  158°48.00'W [158-47.88]  10  02/09/08  0.3  27.6  72°54.00'N  158°48.00'W  30  02/09/08  -1.0  31.2  Chukchi Sea Chukchi Sea Chukchi Sea Chukchi Sea Chukchi Sea Chukchi Sea Chukchi Sea Chukchi Sea  72°30.00'N [73-29.35]  157°00.00'W [157-00.90]  1500  02/09/09  -0.4  34.9  72°30.00'N  157°00.00'W  3246  02/09/09  -0.3  35.0  74-35.12  158-29.52  50  02/09/10  -0.6  31.8  10  02/09/10  -0.7  26.8  Chukchi Sea Chukchi Sea  Canadian Arctic  Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic  74°15.00'N  162°33.58'W  45  02/09/11  -1.2  31.4  74°15.00'N  162°33.58'W  10  02/09/11  0.3  27.7  73°40.00'N  164°06.75'W  30  02/09/11  -1.4  31.6  73°40.00'N  164°06.75'W  10  02/09/11  0.5  27.9  72°30.00'N  151°20.0W  10  02/09/14  0.3  26.1  70-50.00'N  141°50.0W  50  02/09/17  -1.3  31.6  10  02/09/17  0.8  26.0  50  02/09/17  -0.9  31.6  10  02/09/17  0.7  26.4  70°52.50'N  133°58.00'w  45  02/09/19  -1.2  31.2  70°52.50'N  133°58.00'w  10  02/09/19  -0.2  26.3  70°22.00'N  133°36.00'w  23  02/09/19  -0.7  29.7  70°22.00'N  133°36.00'w  02/09/19  -0.3  26.8  70°22.00'N  133°36.00'w  10 bottom10  02/09/19  -1.0  31.8  70-45  127°60  15  02/09/21  10  02/09/21  5  02/09/21  1.5  21.4  70-45  127-26's  165  CASES  Location Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Canadian Arctic Cape Bathurst  Cape Bathurst  Beaufort Sea Beaufort Sea  MZ MacKenzie shelf MacKenzie shelf MacKenzie shelf MacKenzie shelf MacKenzie shelf MacKenzie shelf Amundsen Gulf Amundsen Gulf  Latitude  69°53.750'N  Longitude  139°30.000'w  Depth (m)  date collected  15  02/09/21  13  temp  sal  02/09/26  -1.4  32.4  10  02/09/26  0.5  25.8  30  02/09/26  -0.9  30.0  69°24.000'N  138°00.000'w  10  02/09/27  70°12.000'N  137°00.000'w  25  02/09/27  -0.8  30.0  70 49.30  127 40.39  139 44 17  02/09/24 02/09/24 02/09/24  -1.5 -1.2 -0.9  33.1 32.0 29.0  70 45.90  127 36.51  6 33 sfc  02/09/24 02/09/24 02/09/24  -0.6 -0.7 0.5  26.7 31.2 22.1  71 27.12  133 46.52  968  02/09/28  -0.1  34.9  71 27.01  133 46.93  70 08.75  133 30.69  sfc (2m) 20 60 90 3  02/09/29 02/09/29 02/09/29 02/09/29 02/10/02  -0.7 -0.7 -1.1 -1.4 -0.3  26.3 27.1 31.9 32.5 26.8  16  02/10/02  -0.7  29.5  31  02/10/02  -1.1  31.4  69 51N  133 23W  sfc  02/10/03  -0.5  25.4  70 51.10  133 38.98  sfc  02/10/04  -0.6  20.3  41  02/10/04  -0.9  31.5  70 51.05  133 39.18  65  02/10/04  -1.3  32.3  70 43.70  124 11.73  480  02/10/10  0.3  34.7  70 43.39  124 11.05  41 18 2  02/10/10 02/10/10 02/10/10  -0.3 0.7 -0.8  31.8 27.3 23.9  166  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0071998/manifest

Comment

Related Items