Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Metagenomic analyses of two female genital tract diseases : bacterial vaginosis and ovarian cancer Montoya, Vincent Keith 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_spring_montoya_vincent.pdf [ 2.32MB ]
Metadata
JSON: 24-1.0073777.json
JSON-LD: 24-1.0073777-ld.json
RDF/XML (Pretty): 24-1.0073777-rdf.xml
RDF/JSON: 24-1.0073777-rdf.json
Turtle: 24-1.0073777-turtle.txt
N-Triples: 24-1.0073777-rdf-ntriples.txt
Original Record: 24-1.0073777-source.json
Full Text
24-1.0073777-fulltext.txt
Citation
24-1.0073777.ris

Full Text

Metagenomic Analyses of Two Female Genital Tract Diseases: Bacterial Vaginosis and Ovarian Cancer  by Vincent Keith Montoya  B.Sc. Microbiology The University of Hawai‘i at Mānoa, 2007  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in  The Faculty of Graduate Studies  (Pathology and Laboratory Medicine)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  April 2013 © Vincent Keith Montoya, 2013  Abstract Metagenomics is a rapidly evolving field that has facilitated the expansion of microbiology into new areas of human and environmental health. Metagenomic studies have expanded the phylogenetic tree of life by increasing taxonomic resolution in individual phyla as well as adding entirely new branches of life. This revolution in microbiology has been made possible by the introduction of second-generation highthroughput sequencing, the associated methods for preparing DNA sequencing libraries, as well as new bioinformatic algorithms for analyzing these new types of data. Because of the novelty of these methods, very few have been systematically tested for their sensitivities and specificities outside of the initial development process. As the interpretation of metagenomic studies utilizing these tools depends greatly upon their efficiencies in both detection and classification, it is essential to best determine the performance of each tool. In this study, a variety of novel techniques were utilized and tested in their abilities to characterize the microbial populations in two regions of the female genital tract: ovarian cancer tissue and the vaginal microbiome. Although a diverse microbial population was initially observed in the transcriptome sequence data for ovarian cancer using next generation sequencing, we were unable to recover these microbial sequences through PCR and Sanger sequencing approaches. Optimized methods were applied to healthy vaginal microbiome samples and tested for their ability to differentiate them from a polymicrobial disease of the vagina, bacterial vaginosis. In addition to a high correlation between a microbial scoring system for bacterial vaginosis, this novel metagenomic pipeline also revealed microorganisms not yet associated with the vaginal microbiome such as specific Bifidobacteria spp., various bacteriophage, and Debaryomyces. Collectively, both of these studies provide unique insights into each disease as well as illustrate both the limitations and potential of the rapidly growing field of metagenomics.  ii  Preface The bacterial vaginosis research project was funded by the Canadian Institutes of Health Research and Genome British Columbia. Ethical approval for this project was obtained by the University of British Columbia – Children’s & Women’s Health Centre of BC Research Ethics Board The certificate number for the HIV vaginal microbiome project is CW11-0062 / H11-00119 and the certificate number for the vaginal microbiome project of healthy, reproductive-aged women is H10-02535 The ovarian cancer research project was funded by the BC Cancer Foundation for OvCaRe and was approved by the University of British Columbia - British Columbia Cancer Agency Research Ethics Board. The certificate numbers for the ovarian cancer research are H09-02153, H08-01411, H05-60119.  iii  Table of Contents Abstract .............................................................................................................................................ii Preface ............................................................................................................................................. iii List of Tables .................................................................................................................................... vii List of Figures.................................................................................................................................. viii List of Abbreviations ......................................................................................................................... ix Acknowledgements............................................................................................................................ x Chapter 1 Introduction .......................................................................................................................1 1.1 Metagenomics ..................................................................................................................................... 1 1.2 Metagenomic Methods of Microbial Detection .................................................................................. 6 Cloning.................................................................................................................................................. 6 Sequence-Independent Amplification ................................................................................................... 6 Filtration................................................................................................................................................ 8 Monomeric Subtraction......................................................................................................................... 8 Polymeric Subtraction ......................................................................................................................... 10 1.3 Microarrays ....................................................................................................................................... 11 1.4 High-Throughput Sequencing ........................................................................................................... 15 High-Throughput Sequencing Analysis .............................................................................................. 15 Comparing the Dominant Methods ..................................................................................................... 16 Limitations of Each Illumina and Roche System................................................................................ 17 Sensitivity ........................................................................................................................................... 19 1.5 Multivariate Analyses ....................................................................................................................... 23 NMDS ................................................................................................................................................. 25 Multivariate Analysis Packages .......................................................................................................... 25 1.6 The Human Microbiome ................................................................................................................... 26 The Female Genital Tract Microbiome ............................................................................................... 28 The Microbiome in Disease ................................................................................................................ 30 Study Aim 1: Ovarian Cancer ............................................................................................................ 35 Hypothesis #1...................................................................................................................................... 38 Study Aim #2: Bacterial Vaginosis .................................................................................................... 38 BV Diagnosis ...................................................................................................................................... 39 Metagenomics and BV Biology .......................................................................................................... 40 iv  The Relationship of BV and HIV ....................................................................................................... 42 Hypothesis #2...................................................................................................................................... 44 Chapter 2 Ovarian Cancer ................................................................................................................. 45 2.1 Ovarian Cancer Study Methods ........................................................................................................ 45 Ovarian Cancer Tissue Sampling ........................................................................................................ 45 Ethics................................................................................................................................................... 45 Ovarian Cancer Transcriptome ........................................................................................................... 46 PCR Amplification.............................................................................................................................. 49 Computer Programming and Multivariate Analysis ........................................................................... 49 2.2 Ovarian Cancer Study Results .......................................................................................................... 50 Pan-viral Microarray Analysis ............................................................................................................ 52 PCR and Sequence Analysis ............................................................................................................... 52 RNASeq .............................................................................................................................................. 57 Virome ................................................................................................................................................ 67 Confirmation of Microbiome Findings by PCR.................................................................................. 68 16S Ribosomal DNA Assay ................................................................................................................ 68 Conclusions ............................................................................................................................................. 72 Chapter 3 Bacterial Vaginosis ........................................................................................................... 77 3.1 Bacterial Vaginosis Study Methods .................................................................................................. 77 Data Collection ................................................................................................................................... 78 Sample Collection ............................................................................................................................... 79 Sample Processing .............................................................................................................................. 79 Sequence Library Preparation ............................................................................................................. 81 KAPA qPCR ....................................................................................................................................... 82 MiSeq .................................................................................................................................................. 82 3.2 Bacterial Vaginosis Study Results .................................................................................................... 83 Nextera XT Sensitivity........................................................................................................................ 83 Sequence Classification Sensitivity .................................................................................................... 89 Measuring Taxonomic Classification ............................................................................................... 102 Experimental Bacteria Spike ............................................................................................................. 112 Discussion ......................................................................................................................................... 120 3.3 Bacterial Vaginosis Metagenomics Results ................................................................................... 123 v  Sample Information .......................................................................................................................... 123 Sequencing ........................................................................................................................................ 124 Taxonomic Classification ................................................................................................................. 126 Multivariate Analyses ....................................................................................................................... 128 Comparison with the Nugent Score .................................................................................................. 134 Virome .............................................................................................................................................. 135 Mycobiome ....................................................................................................................................... 138 Conclusions ....................................................................................................................................... 141 Chapter 4 Discussion ...................................................................................................................... 146 Summary ........................................................................................................................................... 146 Ovarian Cancer Results..................................................................................................................... 146 Limitations of the Ovarian Cancer Study.......................................................................................... 147 Bacterial Vaginosis Results .............................................................................................................. 148 Limitations of the Bacterial Vaginosis Study ................................................................................... 149 Conclusion ........................................................................................................................................ 150 References..................................................................................................................................... 151 Appendix ....................................................................................................................................... 176 Sequence Library Normalization ...................................................................................................... 176 Python Program for Homopolymeric Sequence Filtration ................................................................ 177 Custom Lowest Common Ancestor Python Program ....................................................................... 178  vi  List of Tables Table 1-1 Statistics for Illumina, Roche, and ABI SOLiD .................................................................................. 17 Table 1-2 A-D High-Throughput Sequencing Sensitivities ................................................................................. 20 Table 2-1 Primers for the Virochip ....................................................................................................................... 46 Table 2-2 Round B PCR Conditions for Virochip ............................................................................................... 47 Table 2-3 Sample information for the ovarian cancer samples screened .......................................................... 50 Table 2-4 Primers used to screen ovarian cancer tissue ...................................................................................... 53 Table 2-5 Ovarian Cancer PCR Master Mix Screening ..................................................................................... 57 Table 2-6 Total microbial sequences generated for each cancer sample ........................................................... 59 Table 2-7 PID, BV, and Healthy Vaginal Microbial Populations Recovered ................................................... 63 Table 2-8 Lactobacillus and Neisseria spp Analysis ............................................................................................ 66 Table 3-1 Enterovirus qRT-PCR .......................................................................................................................... 80 Table 3-2 Enterovirus qRT-PCR Conditions ....................................................................................................... 81 Table 3-3 KAPA qPCR Primers ............................................................................................................................ 82 Table 3-4 KAPA qPCR Conditions ....................................................................................................................... 82 Table 3-5 Web-Based Classification Programs .................................................................................................... 94 Table 3-6 Simulated Data Set Totals ..................................................................................................................... 100 Table 3-7 Total Reads for Spiked Data Set ........................................................................................................... 102 Table 3-8 MG-RAST Precision and Sensitivity for Simulated Data Set ............................................................ 106 Table 3-9 Galaxy Precision and Sensitivities for Simulated Data Set ................................................................ 107 Table 3-10 N-mer length comparison for NBC .................................................................................................... 109 Table 3-11 MetaBin Precision and Sensitivity Values for Simulated Data Set ................................................. 110 Table 3-12 WebCarma Precision and Sensitivity Values for Simulated Data Set ............................................ 110 Table 3-13 Misclassified Reads Using NBC .......................................................................................................... 115 Table 3-14 Precision and Sensitivity Values for Bacterial Spike ........................................................................ 115 Table 3-15 Burkholderiaceae Discrepancies ......................................................................................................... 116 Table 3-16 Time Analysis for each Classification Program ................................................................................ 123 Table 3-17 BV/HIV patient stratification ............................................................................................................. 124 Table 3-18 Sequences Generated for Vaginal Microbiome Samples ................................................................. 126  vii  List of Figures Figure 1-1 Growth of Genbank and Cost of Sequencing ..................................................................................... 1 Figure 2-1 Nested RT-PCR for Gammaretroviruses ........................................................................................... 55 Figure 2-2 Alignment of MLV-V Sequences ......................................................................................................... 55 Figure 2-3 Proportion of Microbial Families in OvCan Tissue .......................................................................... 60 Figure 2-4 Ovarian Cancer Multivariate Analyses .............................................................................................. 61 Figure 2-5 PID and BV-Associated Bacteria ........................................................................................................ 64 Figure 2-6 Cancer virome percent recoveries for each cancer sample .............................................................. 67 Figure 2-7 16S rDNA Assay ................................................................................................................................... 70 Figure 2-8 A/B 16S rDNA PCR Results ................................................................................................................ 71 Figure 2-9 Total Sequences Are Not Proportional to Microbial Sequences ...................................................... 75 Figure 3-1A/B Mean coxsackievirus quantification values ................................................................................. 84 Figure 3-2 Plasma qRT-PCR CB4 RNA recoveries ............................................................................................. 87 Figure 3-3 Vaginal Swab qRT-PCR CB4 RNA recoveries ................................................................................. 88 Figure 3-4 Primary MG-RAST precision and sensitivity values at the family level ......................................... 104 Figure 3-5 Secondary MG-RAST precision and sensitivity values at the family level...................................... 106 Figure 3-6 Galaxy precision and sensitivities for the simulated data set ........................................................... 108 Figure 3-7 Simulated Data Accuracy .................................................................................................................... 112 Figure 3-8 Simulated Data Clustering .................................................................................................................. 112 Figure 3-9 MG-RAST Optimization for Bacterial Spike .................................................................................... 117 Figure 3-10 Galaxy Bacterial Spike Optimization ............................................................................................... 118 Figure 3-11 Bacteria Spike Accuracy .................................................................................................................... 119 Figure 3-12 Clustering of Classification Programs for Bacteria Spiking .......................................................... 120 Figure 3-13 WebCarma Burkholderiaceae Sequence Alignments ..................................................................... 122 Figure 3-14 Vaginal Microbiome Metagenomic Pipeline .................................................................................... 127 Figure 3-15 Hierarchical Clustering of Vaginal Microbiomes ........................................................................... 129 Figure 3-16 NMDS Clustering of Vaginal Microbiomes ..................................................................................... 130 Figure 3-17 NMDS Clustering of Vaginal Microbiomes According to HIV status .......................................... 131 Figure 3-18 NMDS Clustering of Vaginal Microbial Species ............................................................................. 132 Figure 3-19 A/B Hierarchical Clustering of Vaginal Microbial Species ............................................................ 133 Figure 3-20 BV-Associated Bacteria vs the Nugent Score ...................................................................................135 Figure 3-21 BV Virome: Preliminary Results ...................................................................................................... 136 Figure 3-22 BV Virome: Re-alignment ................................................................................................................. 137 Figure 3-23 BV Mycobiome ................................................................................................................................... 139 Figure 3-24 Hierarchical Clustering of BV Mycobiome ......................................................................................141  viii  List of Abbreviations bp: base pair DNA: Deoxyribonucleic Acid BLAST: Basic Local Alignment Search Tool NCBI: National Center for Biotechnology Information PCR: Polymerase Chain Reaction qPCR: Quantitative Polymerase Chain Reaction RT-PCR: Reverse Transcription Polymerase Chain Reaction qRT-PCR: Quantitative Reverse Transcription Polymerase Chain Reaction BV: Bacterial vaginosis BVAB: Bacterial Vaginosis-Associated Bacteria PID: Pelvic Inflammatory Disease rRNA: ribosomal Ribonucleic Acid rDNA: ribosomal Deoxyribonucleic Acid OC: Ovarian cancer EOC: Epithelial ovarian cancer Mu: Mucinous ovarian cancer En: Endometrioid ovarian cancer HGS: High-grade serous ovarian cancer ES: Epithelioid Sarcoma MG-RAST: Metagenomics Rapid Annotation using Subsytems Technology gc: Genome copies  ix  Acknowledgements I would like to thank my supervisor Patrick Tang for providing me with all of the support, patience, and guidance throughout my time as a graduate student. I have learned more than I could have ever hoped for. Second, I would also like to thank Dr. Deborah Money and Dr. David Huntsman for their generous support and sample distribution. I would also like to thank all of the members of their laboratories for sample collection, processing, as well as the collection of clinical information. Third, I would like to thank my co-workers at the British Columbia Centre for Disease Control for instrument training (Molecular and Virology laboratories), problem solving, and assistance with computer programming (Kevin Jewell). I would also like to thank Mark McCabe for his assistance in a qRT-PCR assay used in this study. This project would not have been possible without the help of these individuals for their extraordinary insights, assistance, and guidance.  x  Chapter 1 Introduction 1.1 Metagenomics In 1977, Sanger et al. sequenced the first complete genome using a novel technique at the time, the chain-termination method. This project, to sequence a 5,386 base pair (bp) genome of the фX174 bacteriophage, took more than one year to complete (Ussery et al., 2009). At this rate, it would have taken 1,000 years to sequence the Escherichia coli genome. The significance of this sequencing endeavour by the Sanger team is highlighted by the fact that the first free-living microorganism would not be sequenced until almost 20 years later in 1995(Fleischmann et al., 1995). Although there was a Sanger sequencing explosion in the 1990's, at an estimated $5,000 per million bp, the expansion of this field was limited. As efficiencies in existing technologies improved or novel technologies were introduced, the cost of sequencing dropped at a staggering rate, as shown in Figure 1-1 alongside the overall growth of DNA uploaded onto Genbank (Wetterstrand, 2012).  1.6E+11 B P 1.4E+11  6000 5000 U S  1.2E+11  D o 3000 l l 2000 a r 1000 s  8E+10 6E+10 4E+10 2E+10  Base Pairs Cost per MB  2012  2010  2008  2006  2004  2002  2000  1998  1996  1994  1992  1990  1988  0 1986  0 1984  G e n b a n k  4000  1E+11  1982  o n  Year  Figure 1-1. Genbank growth and the gost of sequencing, alongside the price per million bp starting in 2001  1  The growth of DNA sequencing has revolutionized many scientific fields ranging from evolution to forensic science, and solitarily provided the foundations for others that would subsequently emerge. One of the most impactful and illuminating breakthroughs in the field of microbiology over the last decade has arguably been the introduction of metagenomics. Metagenomics is the sequence-based characterization of a given sample at the taxonomic or functional level (Gilbert and Dupont, 2011). Although the foundational techniques for metagenomics were established in 1998, it wasn't until 2004 when methods traditionally used to sequence individual genomes were utilized to transcend the limits of the genome and pave the way into the meta-genome (Venter et al., 2004). Venter et al. demonstrated the power of metagenomics using a clone library from isolates in the Sargasso sea with traditional Sanger sequencing. Of the 1 billion non-redundant base pairs sequenced, 1,800 unique genomes, 48 novel bacterial phylotypes, and 1.2 million previously unknown genes were discovered. With the introduction of next generation sequencing (high-throughput sequencing/deep sequencing), the power of Craig Venter's sequencing facility was condensed into a single machine. Laboratories around the world began sequencing human and environmental samples at depths previously assumed impossible. Metagenomics has since dramatically expanded and introduced new branches on the phylogenetic tree of life (Gill et al., 2006; Kembel et al., 2011). For example, it is solely responsible for establishing that on average 1 million bacteria are recovered per millilitre of seawater and given an average genome size is 2 million bp, the aforementioned groundbreaking Sargasso Sea metagenomics project recovered only 0.05% of the genomic information in just 1 millilitre of sea water (Bohannon, 2007; Nealson and Venter, 2007). Metagenomics has also revolutionized our view of animal biology by identifying and consistently reinforcing the fact that bacterial cells making up any given microbiome outnumber those cells of their hosts by at least one order of magnitude (Plottel and Blaser, 2011; Qin et al., 2010). Furthermore, these sequencing endeavours have demonstrated that of the microbial species in the world native to habitats ranging from the human body to the ocean abyss, less than 1% of all microbial species recovered have been cultured (Eckburg et al., 2005; Temperton and Giovannoni, 2012; Tringe and Rubin, 2005). Of the hundreds of billions of bp that have been sequenced, there still remains 2  an unimaginable amount of diversity yet to be characterized. For instance, there have been multiple studies on human and other mammalian microbiomes that have found that 28-68% of the recovered sequences have no significant similarity to the Genbank NT database (E-VALUE (>0.001), while for environmental metagenomics, these percentages rise above 80% (Bench et al., 2007; Breitbart et al., 2003; Brulc et al., 2009; Reyes et al., 2010; Yooseph et al., 2007). Metagenomics has also had significant impacts within the clinical laboratory as well. For example, using conventional laboratory assays an estimated 40% of gastrointestinal and 15% of respiratory illnesses fail to be diagnosed (Finkbeiner et al., 2008; Juven et al., 2000). In a clinical laboratory, when conventional tests fail to identify pathogens associated with a particular syndrome, single or multiplexed PCR assays for conserved regions within known microbial families, differentially labeled nucleic acid probes, serology, as well as direct sequencing can be employed (Molenkamp et al., 2007; Raymond et al., 2009) (Mahony et al., 2009; Rohayem et al., 2004). Cell culture, electron microscopy, serology, and PCR can be powerful methods for microbial identification, but only if the respective cell lines, sera, and primers for downstream detection are compatible with the respective microbial target. Immuno-based detection systems also have their limitations as sera derived from previously infected hosts can be difficult to obtain and antibodies may not be present in sufficient concentrations (Doane, 1987; Roingeard, 2008). Furthermore, electron microscopy is relatively insensitive as typically 105 to 106 particles/ml are needed for viral detection (Roingeard, 2008). Samples that remain undiagnosed with these methods are prime candidates for metagenomic analyses, such as pan-viral microarrays, simultaneously testing for all known, divergent, as well as novel viruses (Palacios et al., 2007; Wang et al., 2002). If the etiologic agent is undetectable by microarray analysis, the samples can then be subjected to high-throughput sequencing to detect the potentially novel microbial pathogen(s). Metagenomics has successfully identified multiple pathogens including WU/KI polyomavirus (Allander et al., 2007; Gaynor et al., 2007), human cosavirus E1 (Kapoor et al., 2008), a novel arenavirus (Palacios et al., 2008), Merkel skin carcinoma polyomavirus (Chiu et al., 2008a; Feng et al., 2008) , and 3  Human TMEV-like cardiovirus (Chiu et al., 2008a). In a representative example, an acute diarrhea case where all standard clinical tests failed to identify an associated pathogen, high-throughput sequencing was then applied (Nakamura et al., 2011b). Of the 96,941 sequences generated, 156 aligned to Campylobacter jejuni. This prompted a re-examination of the original sample using more specific methods which subsequently confirmed this organism responsible for this acute infection. In addition to acute diseases, the etiologies of many clinically significant chronic diseases such as neoplasms, inflammatory, and certain autoimmune diseases have been put into question. Metagenomics has successfully recovered incredibly diverse populations of microorganisms in complex diseases such as type 1 diabetes (Vaarala et al., 2008), atherosclerosis (Koren et al., 2011), inflammatory bowel disease (Peterson et al., 2008), various cancers (Castellarin et al., 2012; Plottel and Blaser, 2011), asthma and chronic pulmonary disease (Huang et al., 2010; Marri et al., 2012). These microorganisms could have significant impacts upon disease development, progression, or treatment efficacy(Plottel and Blaser, 2011). Furthermore, if these bacteria are native to the region in question, based on our thorough understanding of prokaryotic biology, they could theoretically be used for enhanced drug delivery (Hamady et al., 2011). Given the promise of the combination of next generation sequencing and metagenomics, it is disturbing that by a large margin, many studies do not attempt to ascertain the limits of the methods used for sequence recovery or taxonomic classification. For example, in a Pubmed search using 'Illumina' as the query, there are 2,443 resulting articles. If 'sensitivity' or 'limit of detection' is coupled to this query anywhere in the article, this number drops to 151. Although not exact, this rough estimate of all studies using Illumina technology demonstrates that less than 10% of research projects even attempt to mention the sensitivity of the method. Furthermore, if terms specific to metagenomics and microbiology are included ('virus', bacteria, 'microbial', 'microbiome', or 'metagenomic') the number drops to below 10. As shown below in the sensitivity section, all of those studies which adequately address this question in a quantitative fashion (ie via quantitative PCR) are compared, which totals to a dismal, four. Of these, only three tested the limit of detection, whereas the fourth study simply quantified those samples with positive results (Bertolini et al., 2012). Finally, there are even fewer studies which test the taxonomic 4  classification tools for which their results largely depend upon. Although these results could simply remain in the unpublished realm, if a group were to thoroughly test these methods with a dataset representative of the experiment, the amount of time and computation required would likely be mentioned in order to bolster the confidence of all subsequent results. Furthermore, whether these methods are reproducible has also been an issue in most of the amplicon sequencing methods applied to microbial characterizations. Initially rRNA amplicon sequencing was shown to have a surprisingly low level of reproducibility where after removing singleton sequences, the percent overlap among samples with two technical replicates ranged from 15.1–30.2% (Zhou et al., 2011). More recently another study found high reproducibility down to the family level with triplicate samples showing recovery methods deviating only by~0.9% (Pilloni et al., 2012). The most recent rRNA reproducibility examination found that between the two samples analyzed, the mean abundances between three orders of bacteria Firmicutes, Actinobacteria, and Bacteroides displayed percent overlap for only 71.5%,1.6%, and 25.0%, respectively (Flores et al., 2012). These results suggest that with any method, slight variations can cause dramatic effects in reproducibility. The purpose of this study is to apply a series of tested metagenomic methods to the microbial characterizations of two complex diseases, bacterial vaginosis and ovarian cancer. Illustrating the rapid growth of metagenomics, during this 2 year period the bioinformatic tools available to analyze these datasets changed dramatically. As new tools were introduced and refined, we chose a specific class of both library preparation methods and taxonomic classification tools for our metagenomic studies. These results strengthened with sensitivities in both the sequence generation and classification methods, offer unique insights into each application with confidences heretofore unachieved in their applications towards metagenomics.  5  1.2 Metagenomic Methods of Microbial Detection Cloning Metagenomics initially emerged where studies relied upon traditional cloning techniques for DNA amplification and subsequent characterization. Random or 'Shotgun' cloning begins with the ligation of randomly fragmented DNA into vectors (Handelsman et al., 1998; Kimelman et al., 2012; Venter et al., 2004). Each vector is then amplified within transformed bacteria grown on selective media. Using the ends of the vector as primer binding sites, the ligated DNA can be amplified and sequenced. Aside from the large amount of time this method requires, one large bias to this method is the sensitivity of E. coli to environmental sequences (Kimelman et al., 2012). These sequences can either be toxic to E. coli such as those of phage origin or active components of other bacteriocidal genes, or they may contain sequences with altered methylation patterns that may be digested by host restriction enzymes. Other organisms like Saccharomyces cerevisease can be used for larger sequences, however these same limitations still apply but lower in frequency (Kimelman et al., 2012). Although this method is efficient for small scale sequencing projects, the restrictions of cloning for metagenomics is best illustrated in the difficulties it caused for the human genome project. Although the draft sequence representing 99% of the human genome was published in 2001, due to difficulties in cloning, the sequence was not completed until 2009 (Garber et al., 2009). As genomic sequencing moved beyond the limits of Sanger sequencing, sequence independent methods of amplification began to replace these cloning procedures. These methods require no bacteria and in addition to removing biological restrictions to sequence retrieval, many potential contaminants are removed from the metagenomic pipeline. Sequence-Independent Amplification  Many different techniques for sequence-independent amplification have been described which can be divided roughly into three groups. The first group, including sequence-independent single primer 6  amplication (SISPA) or linked amplified shotgun libraries (LASL), is based on restriction enzyme digestion or shearing of DNA/cDNA followed by the ligation of adapters. The resulting sequences can then be PCR amplified using the adapter sequence as the primer binding site (Reyes and Kim, 1991). These methods have been successful in identifying novel parvoviruses in human plasma (Jones et al., 2005). The second group includes PCR amplification methods using arbitrary primers at low annealing temperatures or random primers, which have both been successful in detecting numerous viruses in respiratory secretions and stool, including human metapneumoviruses, bocaviruses, polyomaviruses, cosaviruses, and kobuviruses (Allander et al., 2007; Allander et al., 2005; Gaynor et al., 2007; Greninger et al., 2009; Holtz et al., 2009; Kapoor et al., 2008; Li et al., 2009; van den Hoogen et al., 2001). The final group consists of methods which amplify DNA by displacement, including фX29 DNA polymerasebased amplification and rolling-circle amplification specifically for circular DNA genomes (Rector et al., 2004). A rolling-circle amplification strategy was recently used to identify a novel polyomavirus associated with a rare skin disease, trichodysplasia spinulosa, in an immunocompromised patient (van der Meijden et al., 2010). Alternative procedures known as 'Not So Random' (NSR) amplification methods can also be employed (Filiatrault, 2011). As opposed to random hexamers used for reverse transcription, those hexamers which have significant homology to rRNA are removed. Recently the Ovation Prokaryotic RNA-Seq System has been introduced where, through the use of 'proprietary primers', mRNA can be enriched and subsequently amplified (Armour et al., 2009; Head et al., 2011). However, even with NSR applications to sequence-independent amplification, it is critical that either unwanted cellular debris or background nucleic acids are first removed in order to enrich samples for maximal recovery of the presumed microorganisms.  7  Filtration Microbial sequences can also be concentrated prior to extraction by filtering or concentrating samples according to the diametric constrictions of the microorganisms of interest. Although popular in environmental metagenomic samples which contain particulate matter, filtration and concentration have successfully been applied to clinical and microbiome characterizations where fecal matter and cellular debris can be prevalent (Loens et al., 2002; Thurber et al., 2009). Recently, virome characterizations have used 0.45 µm filters in order to capture cellular organisms and enable virus concentration (Handley et al., 2012). These studies have successfully recovered novel viruses from various hosts including field mice (Phan et al., 2011), bats (Li et al., 2010), rats (Haange et al., 2012), and humans (Zhang et al., 2006). Virome characterizations from multiple sample types frequently utilize tangential flow filtration for virus concentration (Bench et al., 2007; Breitbart et al., 2003). In standard impact filtration, the flow of material is perpendicular to the filter and therefore the flow-through of the desired substrate may be obstructed by the accumulation of unwanted material (Schoenfeld et al., 2008). In tangential flow filtration, the flow is cycled through the system tangential to the filter, where a back flow, generated by clamping the exiting sample, provides an adequate pressure differential to drive particles through the filter. This tangential flow system effectively minimizes the accumulation of material along the filter. Viral particles larger than the pores in the filter repeatedly cycle through the system reaching the desired concentration for downstream detection. For samples with extremely high concentrations of particulate matter, pre-filtering with a coarse pore filter prior to tangential flow filtration is recommended. Ultracentrifugation is typically the next step in virus particle concentration where virus particles of the desired density gradient can readily be removed for downstream processes. Detailed procedures on these methods have been described elsewhere (Thurber et al., 2009). Monomeric Subtraction There are two approaches for nucleic acid subtraction, those which target a single sequence (monomeric), and those targeting multiple and often unknown targets (polymeric). Monomeric  8  subtraction is mostly utilized in analyses examining a particular class of nucleic acid molecules. Since there are sequence-based cellular mechanisms for recognizing and processing nucleic acid molecules with specific functions, these intrinsic sequences can be targeted for subtraction. One branch of metagenomics interested in either organisms with an RNA genome or the phenotypic potential of the microbial population, is termed metatranscriptomics (Moran et al., 2013). These studies specifically aim to generate cDNA from non-rRNA sequence substrates. As rRNA makes up roughly 80-90% of the RNA in any given cell, probes in these studies are added post-extraction and target a conserved rRNA sequence for subsequent removal. The importance of this subtraction is illustrated in one study characterizing the bacterial transcriptome of a marine environment, where almost 90% of the reads generated through pyrosequencing were specific for rRNA (Hewson et al., 2009; Stewart et al., 2010). Although the manufacturers claim greater than 90% of rRNA removal, some metatranscriptomics using rRNA subtraction studies still recover a significant percentage of rRNA sequences (37%) (Poretsky et al., 2009). This is likely due to the lack of complimentarity between the rRNA sequences within the sample and the rRNA probe. Alternatively, the messenger RNA (mRNA) can be enriched by either of two methods. As bacteria do not have polyadenylated transcripts, mRNA enrichment in bacterial transcriptomics requires an alternative procedure using 5´-Phosphate-Dependent Exonuclease. Since rRNA is posttranscriptionally processed to yield a 5' monophosphate and other RNA transcripts instead contain the original 5′-triphosphate, this enzyme is utilized to uniquely digest rRNA (Filiatrault, 2011). This method seems to only produce a modest increase in mRNA abundance where, in one study less than 25% of the aligned sequencing reads represented transcripts other than rRNA (Birren, 2012). Those analyses interested in only the eukaryotic virome however, might choose to enrich mRNA by targeting polyadenylated RNA molecules. Although viruses from specific families such as Picornaviridae have polyadenylated regions within their genome, many lack such regions and therefore these methods would only capture a subset of the viral RNA molecules that are destined for protein synthesis. 9  These subtractive technologies used in metatranscriptomics are not yet standardized and several issues remain to be solved, such as the observations of unequal coverage of transcript enrichment (Filiatrault, 2011). Furthermore, relatively few studies have thoroughly compared or tested these kits. In one recent study however, He et al. compared the rRNA and exonuclease digestion methods of subtraction using an Illumina-based transcriptome analysis (Yi et al., 2011). These results suggest that the rRNA hybridization method introduced less bias in the relative proportion of mRNA sequences recovered compared to the exonuclease digestion. However, in other cases only testing the rRNA method, rRNA hybridization has also shown to be nonspecific and could therefore lead to other biases in the transcriptome (Stewart et al., 2010). Interestingly, recent studies have utilized both subtractive methods and still found recoveries for rRNA at 19.9% (Moore et al., 2011). Polymeric Subtraction Those studies which aim to simultaneously subtract a population of unrelated nucleic acid molecules too complex or distantly related to traditional probes require alternative strategies. Each of these methods requires a paired nucleic acid sample that is genetically homogenous with the experimental sample for hybridization. These studies have largely been applied to human samples with a suspected novel pathogen. In these cases a healthy sample matched to the patient is preferred, however the same tissue type from a control subject has been shown to be sufficient (Ambrose and Clewley, 2006). For samples with a relatively high concentration of microbial DNA/RNA, some methods such as differential display require a nonspecific amplification step where upon gel electrophoresis, unique amplicons relative to the uninfected samples are subsequently cloned and sequenced (Lu et al., 2004). Subtraction or representational differential analysis (RDA) is an alternative method where infections with a low nucleic acid content are instead preferred (Muerhoff et al., 1997). This method consists of adding the uninfected sample (driver) in excess to that of the test sample with common sequences removed via subtractive hybridization (Sagerstrom et al., 1997). This process may be performed in single or multiple rounds until the resolution of unique single stranded DNA sequences are sufficient for isolation and subsequent  10  characterization. As episomal or integrated viruses cannot be concentrated using basic techniques such as centrifugation or in some cases enzymatic digestion, subtractive hybridizations have frequently been used as a solution to this problem in the past. Historically, Bishop and Varmus used this method for the characterization of the first oncogene in an avian sarcoma virus (Stehelin et al., 1976). More recently a variation of this method has been used to discover human herpesvirus 8 from Kaposi sarcoma lesions (Chang et al., 1994). Despite these successes, as sequencing depth expands this somewhat laborious method is becoming increasingly rare in metagenomic analyses. Recently however, modifications of these subtractive hybridizations have been specifically designed to complement high-throughput sequencing (Adams et al., 2009). Here, both the healthy and affected samples are first amplified, where biotinylated dUTP in addition to dTTP are incorporated into the healthy sample. Streptavidin beads are then used to subtract the common amplicons between the healthy and affected samples and the remaining reads are then sequenced. Through this method, a novel cucumovirus was discovered in infected Gomphrena globosa plants. Similarly, one metatranscriptomic study used customized sample specific probes derived from aliquots of the experimental samples themselves (Stewart et al., 2010). Here using universal primers, both 16S and 23S rRNA sequences are first amplified and used for subsequent in vitro transcription. These antisense RNA probes are then biotinylated and used to target and subtract complimentary rRNA sequences in the experimental samples from which the probes were originally recovered. This approach was shown to remove up to 58% of the rRNA in the samples analyzed.  1.3 Microarrays Microarrays have proven to be a powerful metagenomic tool to quantify and characterize environments (Brodie et al., 2007), transcriptomes (Schena et al., 1995; Welsh et al., 2001), and clinical specimens (Chiu et al., 2008b; Palacios et al., 2007; Wang et al., 2002). Diagnostic microarrays hold great promise for metagenomics as they can be simultaneously utilized for detecting and resequencing 11  known, divergent, and novel microorganisms. For example, shortly after the swine flu pandemic was declared, an influenza microarray platform was utilized to obtain genome-wide sequence information about the emerging reassortant influenza virus (Berthet et al., 2010). In contrast to these targeted viral microarrays which detect a narrower range of viruses, the detection of a broad range of viruses requires a different type of microarray design. The most widely used pan-viral microarrays, the Virochip and the Greenechip, have tens of thousands of probes, each with a collection accounting for all known virus families known to infect humans (Palacios et al., 2007; Wang et al., 2002). In reference to quantitative microarrays, the probes on these arrays are elongated to increase their tolerance for mismatches between each probe and its complementary sequence thereby increasing the ability to detect novel viruses. These arrays need to be updated regularly as new viruses are discovered. In 2002, the Virochip was first introduced and shortly afterwards, it demonstrated its potential in the initial identification and sequence characterization of the SARS coronavirus (Wang et al., 2003). Probe selection for the original Virochip is genome based: all sequenced viral genomes are aligned and a 70-nucleotide window is used to select the five best sequences which contain at least 20 bases in common. Additional probes were chosen for certain genera that are genetically similar but have clinically diverse outcomes so as to further assist in their differentiation (Wang et al., 2002). The current version of the Virochip updated in October of 2009, contains ~36,000 probes. The sensitivity of the Virochip is superior to direct fluorescent antibody detection and comparable to PCR for diagnosis of viral respiratory infections (Chiu et al., 2008b; Kistler et al., 2007), with detection levels as low as 100 viral genome copies based on testing for rhinoviruses (Wang et al., 2002). The Virochip is currently being used as a diagnostic tool at the British Columbia Centre for Disease Control as well as University of California, San Francisco where it has successfully identified uncommon, divergent, as well as novel viruses. Other significant findings include a new clade of rhinoviruses (Kistler et al., 2007), a novel human cardiovirus (Chiu et al., 2008a), as well as an uncommon human parainfluenza virus type 4 and a divergent human metapneumovirus infection in patients with severe respiratory infection (Chiu et al., 2007; Chiu et al., 12  2006). More recently, the Virochip was used to rapidly identify 2009 pandemic influenza as a novel swine influenza variant even though the viral probes were designed prior to the emergence of the pandemic (Greninger et al., 2010). The diagnostic microarray with the widest range in microorganism identification is the GreeneChip. This array is a pan-microbial microarray containing 29,495 probes specific for pathogens ranging from viruses to eucaryotic parasites (Palacios et al., 2007). In contrast to the Virochip, this array design is protein-based where three nucleic acid probes were chosen for each family based on alignments of amino acid sequences. These 3 regions when possible correspond to 1 nonstructural and 2 structural regions. For cellular parasites, specific portions of highly conserved ribosomal DNA sequences were selected. Collectively the chip contains 9,477 oligonucleotides for all vertebrate viruses (as of 2007, totaling 1,710 species), 11,479 16S rRNA probes for 135 bacteria, 1,120 18S rRNA probes for 73 fungi, and 848 18S rRNA probes for 63 parasite genera. In addition to these probes, 300 host-immune response genes were included for additional information. Based on real-time PCR analyses using viral RNA isolated from cultured cells, the sensitivity of this array platform reached 1900 RNA copies for RNA viruses and 10,000 RNA copies for DNA viruses. Illustrating the potential of this array, a patient with symptoms presumably indicative of viral hemorrhagic fever was shown to be positive for Plasmodium falciparum. Extensive subtyping of influenza viruses using a set of specific primers prior to hybridization was also demonstrated (Quan et al., 2007). The crux to all microarrays is their sensitivity as well as their reproducibility. With reference to diagnostic arrays, this is greatly determined by the viral to host ratio of nucleic acids prior to amplification. Variables such as the type of sample and the site of infection greatly influence this ratio. For instance, there is a relatively high success rate of virus identification in acute respiratory and gastrointestinal infections which tend to have higher virus titers compared to other infections (Chiu et al., 2008b; Kistler et al., 2007). Acellular samples such as those derived from cell culture supernatant or blood serum improve this ratio due to minimal host nucleic acid content (Leski et al., 2010). In tissue  13  samples that have less desirable virus to host nucleic acid ratios, methods such as DNase treatment can be useful prior to reverse transcription as stated in the previous section. Various methods in reverse transcription and subsequent amplifications also influence the success of virus identification (Leski et al., 2010). The standardized method for amplification using the Virochip consists of a modified sequence independent amplification (SIA) strategy that incorporates a reverse transcription step. Reverse transcription is directed by random nonamer primers flanked by a unique 5' adapter sequence. Second strand synthesis is carried out, much like SIA (Bohlander et al., 1992), with Sequenase, a T7 DNA polymerase. The cDNA is subsequently amplified prior to enzymatic amino-allyl nucleotide incorporation for downstream dye conjugation, both using a Klenow fragment DNA polymerase. The above procedure is analogous to that used for the GreeneChip except that dendrimers are employed for fluorescent dye molecule incorporation. Each virus genome will bind in a unique pattern as dictated by the probes available as well as the hybridization conditions used. Given that there are thousands of probes, computer algorithms and statistical methods are essential to ascertain whether an observed pattern is indicative of a true virus genome. One such strategy designed for this purpose is E-Predict, an algorithm that compares an observed hybridization pattern to a pre-constructed theoretical pattern (Urisman et al., 2005). These theoretical hybridization patterns are generated in silico where the free energy of hybridization is calculated for each virus genome when compared to all probes on the microarray. Furthermore, by comparing related microarray data this calculated value can be adjusted in a Bayesian fashion. With reference to novel viruses for which a theoretical pattern is nonexistent, the intensity of each individual probe can also be compared to its intensity in controls as well as related samples hybridized in the past. When taxonomically arranged, these statistically significant oligonucleotides can provide sufficient data to determine the identity of a given virus (Chiu et al., 2006). Other algorithms assessing the significance of a given pathogen in microarray data have also been employed such as PhyloDetect (Rehrauer et al., 2008), and DetectiV (Watson et al., 2007).  14  1.4 High-Throughput Sequencing High-throughput sequencing applied to microbial detection relies on both the average length of the sequences as well as the number of reads that are generated. Using these sequences for taxonomic classification requires that these sequences be long enough to differentiate between relatives within the taxonomic level in question. However shorter reads (<50 bases) may still provide taxonomic resolution provided that they align to a relatively unambiguous region of a genome (Luo et al., 2012). All of these requirements also depend on sufficient representation of homologous genomes on the particular database used for alignment. These features are essential in samples where there may be a complex microbial community or high concentrations of host nucleic acid where detection and differentiation of a potential pathogen can be problematic. Although most evidently characteristic of stool or respiratory samples, such complexities are frequently observed in other body sites traditionally viewed as 'sterile' such as cerebrospinal fluid and sera. Furthermore, although the subtraction methods described above are useful in removing undesirable nucleic acid molecules, these methods frequently do not remove all of the material and can also indirectly bind to alternative nucleic acid molecules. Therefore, the depth of sequencing can be the limiting factors in samples with a complex nucleic acid background. High-Throughput Sequencing Analysis Analyzing high-throughput sequencing data for microbial characterization can be a daunting and computationally intensive task (Mande et al., 2012). Standard techniques typically begin by removing homopolymeric reads and other low complexity reads, followed by the classification and removal of host genetic material. At this stage, various assembly algorithms could be utilized to generate larger and more taxonomically informative contigs. The raw reads or assembled contigs are then subjected to homology searches to non-redundant organism sequence databases. Since human subtractions do not have 100% efficiency, these databases must also include mammalian sequences. The values one uses for alignment efficiency are arbitrary as these depend on the database one is aligning against as well as the size of the generated reads. In order to avoid misclassifications, multiple alignments are frequently performed where 15  following an initial high stringency alignment, subsequent rounds of decreasing stringencies can often provide crucial information about the algorithm used for alignment as well as limitations of the database. Once a sufficient level of confidence has been established in the sequence classifications, depending on the goals of the project, reads for unnecessary organisms can be subtracted in silico. If novel microorganisms are suspected, the remaining reads could be aligned against databases for specific taxons where homology can be verified either at the nucleotide (BLASTN) or at the amino acid level (BLASTX/tBLASTX) (Altschul et al., 1990). In a recent example where an analysis of an infectious disease of snakes initially recovered reads for a divergent arenavirus, a de novo assembler (PRICE de novo genome assembler) was subsequently applied where using these initial reads as seeds, sufficient genome coverage was established which enabled the differentiation of two species of virus from the same sample set (Stenglein et al., 2012). Whether for quantitative, validative, or genomic coverage purposes, PCR primers can then be designed from the generated sequences. Comparing the Dominant Methods The dominant methods for microbial detection using high-throughput sequencing are Illumina sequencing-by-synthesis, Roche 454 pyrosequencing, and to a lesser extent, ABI SOLiD. Each of these methods has their benefits and depending on the aims of each individual research project, one system may be more reliable than the other. Shown in Table 1-1 are the general statistics one usually considers when choosing a particular sequencing technology (Caporaso et al., 2012; Eisenstein, 2012; Liu et al., 2012; Luo et al., 2012). The most commonly cited difference between Roche and Illumina is the depth of sequencing. Roche's GS FLX Titanium XL+ typically outputs 700 Mega bases (Mb) whereas Illumina's HiSeq instrument has maximum outputs of 600 Giga bases (Gb). This statistic has many downstream implications that benefit Illumina users such as the increased number of assembled contigs, number of unique genes sequenced, and most importantly, fewer mutations as a result of increased depth at any given genomic location. Many metagenomic or microorganism discovery projects may choose Roche 454 pyrosequencing however, based solely upon another crucial difference, the sequence length.  16  Although typical read lengths are on average 700 bases in length, Roche 454 pyrosequencing is capable of producing reads as long as 1 thousand bases whereas Illumina's maximum length is currently limited to 300 bases. This difference in length increases the specificity of the gene or taxonomic classification as well as aiding subsequent de novo assemblies (Luo et al., 2012).  Distributor Ilumina Roche ABI SOLiD  System HiSeq MiSeq GS FLX Titanium XL+ GS Junior System 5500xl 1µm beads 5500xl 0.75µm beads  Amplicon Size (bp) 36-300 36-300 1,000 400 75 50  Output 600 Gb 8.5 Gb 1 Gb 35 Mb 15 Gb 20 Gb  Time 11 days 39 hours 23 hours 10 hours 24 hours 24 hours  Reads 2.4 x109 34x106 1x106 1x105 2x108 4x108  Table 1-1. Statistics for Illumina, Roche, and ABI SOLiD as of 2012, Mb = Megabase, Gb=Gigabase  Limitations of Each Illumina and Roche System The different sequencing chemistries also influence the mutations each produces. By separating each input deoxynucleotide triphosphate (dNTP) in terms of time, pyrosequencing relies on a single fluorophore for detecting each incorporated nucleotide. This mechanism presents a problem for the mechanism of detection in sequences with homopolymeric regions as short as three bp. The corresponding accumulated light intensity makes base-calling in these regions variable. Furthermore, it has also been found that up to 15% of the sequences produced are products of in vitro amplification (Luo et al., 2012). Pyrosequencing using the Roche 454 system also depends upon an emulsion step to precisely allocate one amplicon to one bead for sequencing (Gomez-Alvarez et al., 2009). Incomplete emulsion has been observed where a single amplicon can attach to multiple beads which results in an inflated recovery of the associated species in question (Briggs et al., 2007). Although one can account for this inflation by removing replicate reads, this rarely performed step may in turn under represent abundant species in amplicon-based studies. Illumina's sequencing-by-synthesis also introduces errors. There are differences in base-calling efficiencies that depend on the spatial location of the amplicon on the sequence tile to which they are 17  attached (Dolan and Denver, 2008). Furthermore, due to the chemistry involved in sequence adsorption as well as physical limitations in the CCD camera used to monitor fluorescence, there are also inefficiencies at both the 5' and 3' ends of the sequences (Schroder et al., 2010). The majority of the sequence errors, however, are a result of GGC motifs solitarily or adjacent to secondary structures /GCrich motifs (Nakamura et al., 2011a). This sequence artifact is also strand specific, presumably as a result of interactions of the sequence and the active site of the DNA polymerase itself. Despite these errors, for both sequencing technologies, algorithms are currently being developed that aim to account for these sequence artifacts and as a result, these efficiencies are increasing annually. Nearly all of the initial research concerning sequence error analysis of NGS involves a single genome in isolation, typically the bacteriophage фX174 (Abnizova et al., 2012; Holt and Jones, 2008). This ideal situation is atypical in virtually all known in vivo systems. Recently, a metagenomics study using each of these technologies compared the errors generated characterizing a mock community consisting of 18 microbial species (Luo et al., 2012). Not only did Illumina generate significantly fewer sequence errors, Illumina produced roughly 5 times as much sequence data (2460 Mb) as Roche 454 (502 Mb). The two platforms agreed on 89% of the unassembled reads as well as 90% of the assembled reads. Furthermore, the estimated abundances of both the genes and genomes in the samples were also highly analogous. Illumina also produced longer and more accurate assembled contigs in reference to the genome sequences of each microorganism. In addition to greater sequence information, the authors also found that Illumina was about 4 times cheaper also. Despite these advantages, many groups still prefer Roche 454 as a diagnostic tool especially for novel microorganism discovery (Briese et al., 2009; Chu et al., 2012). This is the result of the fact that although it may have a higher mutation rate in homopolymeric regions, error rates in gene encoding regions are comparable to Illumina. Furthermore, the observation that Illumina produces longer assembled contigs is typically for cases in which a reference genome exists. If the converse is true, the longer read lengths produced by Roche 454 have the potential to produce longer assembled contigs (Diguistini et al., 2009; Kumar and Blaxter, 2010). 18  Sensitivity There have been very few studies directly pursuing the question of sensitivity concerning the application of high-throughput sequencing for microorganism detection. This is highly unfortunate as there have been a variety of published studies which fail to recover microbial sequences for the disease in question (Arron et al., 2011). Due to the fact that these groups cannot quantitatively specify the limitations of the methods used this can therefore undermine certain conclusions such studies may reach. Cheval et.al compared the sensitivities of Illumina and Roche 454 sequencing in their ability to detect spiked viruses in cerebrospinal fluid (CSF) and plasma (Cheval et al., 2011). The results of this study are found in Table 1-2(A)/(B), where in all cases Illumina was orders of magnitude more sensitive in detecting viruses with both DNA and RNA genomes. For example, the limit of detection for Roche 454 was 103 plaque forming units (PFUs) per ml for both DNA and RNA viruses, whereas Illumina results were in most cases more sensitive than the PFU system entirely and for Influenza viruses, more sensitive than the qRT-PCR (<102.4 genome copies). In another study, Moore et.al. spiked an RNA virus into colorectal cancer tissue and used Illumina sequencing to determine the limit of detection (Fig. 1-2C) (Moore et al., 2011). In order to ascertain the detection level, this particular study made two assumptions: 1 pg of viral RNA equates to 150,000 genome copies (gc) and that the average cell has 20 pg of RNA. From this information, the authors state that the sensitivity of this method is roughly 1 viral sequence per 1,000,000 total sequences. More specifically, considering that a single paired read was recovered for 0.2 pg of spiked viral RNA, this was also determined as the limit of detection. Although not as specific, from the above information, one can calculate that the sensitivity is roughly limited to 30,000 gc of input viral RNA per sample. In a more recent study, as opposed to spiked samples, Malboeuf et al. utilized clinical samples for evaluating the sensitivity of Illumina sequencing (Fig. 1-2D) (Malboeuf et al., 2013). Through qRT-PCR analyses for comparison, this team found that their particular method was able to detect 96-100% of the protein encoding regions of HIV, RSV, and WNV from as low as 100 gc per sample (Tables 1-2D). 19  Cheval et al., 2011  System ROCHE Limit of Detection (RNA/DNA virus) Illumina Limit of Detection (RNA/DNA virus)  TCID50 Sensitivities 10,103, 105 PFU PLASMA  10, 102, 103 PFU- CSF  105, 103  Out of Range, 103  10, 10  10, 10  Table 1-2A. Cheval et al., 2011  qPCR Sensitivity  Cheval et al., 2011  qPCR qPCR (gc/µl) H3N8 0.357 FluB 0.357  Illumina Reads Detected 10 4  Lim of Detection (gc/sample) 37.7 37.7  Table 1-2B. Cheval et al., 2011  Moore et al., 2012  Viral RNA (gc/sample) 30 x 106 3x106 300,000 30,000  HaRNAV Detection, nanodrop quantification Virus Expected Recoveries Illumina Recoveries 2035 618 233 31 23 6 2 1  Table 1-2C. Moore et al.  Malboeuf et al., 2012  qPCR Sensitivities  Virus  Initial Concentration (gc/sample)  Post Amplification (gc/sample)  HIV HIV HIV HIV HIV HIV  800 200 200 200 100 100  3.30E+08 2.50E+07 1.00E+06 2.60E+06 9.00E+05 9.90E+05  % Viral Reads Recovered 1.1 0.4 0.7 1.3 0.7 1.7  Table 1-2D. Malboeuf et al.  Table 1-2 A-D. High-throughput sequencing sensitivities, tables adapted from data in each of the 3 mentioned studies, (A) Cheval et al., 2011, (B) Moore et al., and (C) Malboeuf et al., 2012  20  The differences in the sensitivities between these studies are significant as the limit of detection ranges from 37.7-30,000 gc/sample. Although there are many differences between the methods used in each study, the most obvious differences lie in the subtraction and amplification stages of each study. For example, in the study displaying the highest sensitivity of Illumina sequencing, Malboeuf et al. used the Nugen-RNASeq library preparation where mRNA sequences are enriched and subsequently amplified. The amplification step could bring rare but informative RNA populations up to detectable levels that may otherwise be eliminated in one of the frequent wash steps leading up to sequencing. This is clearly demonstrated where although the quantity of HIV was initially found to be 100 gc/sample, postamplification and therefore the concentration at the point of Illumina sequencing, the concentration was quantified as 1.2x 1011 genome copies (Table 1-2D). Although this study recovered the second highest reported sensitivity, of the roughly 15 million 100 bp reads generated per sample, over 9% of the reads were recovered for HIV, indicating that the sensitivity is much lower. The second highest sensitivity was demonstrated in Cheval et.al, where 4 reads were recovered at the dilution with102.4 (251) gc/ml of Influenza H3N8 (37.7 gc/sample). Although a transcriptome enrichment kit was used, Cheval et al. instead utilized random hexamers for cDNA synthesis. However, amplification of the resulting cDNA was then performed with a standard ф29 DNA polymerase protocol. Unfortunately, this group did not quantify the virus gc number following amplification. Nonetheless, 5 µg of DNA was loaded into each Roche 454 and Illumina library preparation in order to generate the lowest number of reads among these three studies, ranging from 5-10 million 76 bp reads per sample. The highest detection limit was in Moore et. al, where 30,000 genome copies were recovered at the highest dilution. This is despite the fact that among the three studies, this study generated the greatest number of reads for each sample (~22 million 36-150 bp). However, as the gc number is an extrapolated value determined by cellular and viral RNA estimates, the detection limit is quite possibly higher. Furthermore, viral RNA concentrations were measured with nanodrop, a method that in comparison to more sensitive RNA quantification methods, frequently yields RNA concentrations that are on orders of magnitude different than more sensitive methods. Nonetheless, at face value, this group demonstrated a relatively high sensitivity of 30,000 21  gc/sample. This group performed two subtraction steps by enriching mRNA transcripts followed by ribosomal RNA subtraction and thus no amplification. Comparing all of these methods, one can conclude that with the subtraction of ribosomal RNA directly or indirectly through transcriptome enrichment, appreciable sensitivities can be reached in the tens of thousands of genome copies. However, when an amplification step is introduced, the sensitivities dramatically increase to fewer than 100 gc/sample. Therefore, if the primary goal of a given research project is detection, an amplification step is strongly suggested. However, amplification has its costs. First, dependencies of primers, enzyme biochemistry, and secondary structures of the template molecule can introduce variations in template preferences and therefore a community that is distinct from that which existed in the original sample (Hamady and Knight, 2009; von Wintzingerode et al., 1997). This can prove to be detrimental to projects seeking relationships between microorganisms or those projects that are investigating complex polymicrobial samples, such as that of dysbioses or microbiome characterizations. Secondly, although infrequent, PCR amplification may also produce hybrid amplicons as a result of template switching or strand transfer during template extension (Haas et al., 2011b). Again, for complex diseases such as that of microbiome characterizations, this may be problematic. Given there are tradeoffs between each method, Nextera XT technology has recently emerged as a promising alternative used in conjunction with Illumina sequencing. The Nextera XT library preparation method has adopted a transposon-based protocol which by simultaneously trimming and ligating the adapters required by Illumina sequencing in a single tube efficiently reduces the required >1 microgram of input cDNA to 1 nanogram (Caruccio, 2011). This method has been tested assessing the prevalence of HCV intrahost genomic diversity where rare genomic variants were detected with relatively high sensitivity (Lauck et al., 2012). Recently, a mock metagenome was constructed with 9 DNA bacteriophage using the Nextera library preparation kit (Marine et al., 2011). This study found that the phage were recovered in roughly the same order predicted based upon the size of each of their genomes. In another study which compared a range of next generation sequence library platforms in 22  detecting HIV variants, the Nextera library preparation kit was used alongside the 454 pyrosequencing platform (Archer et al., 2012). Comparing these methods it was concluded that the efficiency of the Nextera kit performed equally well in detecting the particular HIV variants in reference to other library preparation methods. Therefore, provided that Nextera satisfies the above requirements of depth versus amplification, and that it is competitive when compared with traditional library procedures, we decided to utilize this method for our study.  1.5 Multivariate Analyses The amount of variability generated from metagenomic analyses presents an arduous task for microbiologists. The complex communities recovered may fluctuate as a result of time between sampling, the methodology used, stochastic differences, and finally true biological variation. Complicating these studies further are observations that variability often exists when samples are repeatedly taken from individuals at the same location (Caporaso et al., 2011). The Human Microbiome project and MetaHit have largely defined the diversity of the human body, however in large part using 16S amplicon sequencing (2012; Qin et al., 2010). Owing to the fact that true metagenomics is random in nature, it expectedly generates datasets with higher complexities (Gill et al., 2006; Greenblum et al., 2012). Therefore, properly defining similar samples and differentiating others is immensely important. Metagenomic studies produce abundance tables with the number of species exceeding the number of samples usually by orders of magnitude. Although metagenomics is a relatively new field, these tables are analogous to species-sample matrices in macroecological studies and therefore, the very same statistical tools can be applied. Multivariate analyses encompass all statistical tools developed to best represent the relationship between datasets containing multiple variables (>2) simultaneously (Ramette, 2007). Most metagenomic analyses use values derived from these variables primarily for the differentiation of groups based on sample (dis)similarity, identifying trends in data, as well as identifying the variables which drive these relationships. In order to reach these conclusions multivariate analyses employ exploratory analyses such as principal component analysis (PCA), various cluster analyses, or 23  (non)Multidimensional scaling (NMDS) as well as hypothesis driven analyses such as redundancy analysis, canonical correspondence analysis, or multiple ANOVA tests (Ramette, 2007). Provided that most metagenomic studies are in the exploratory phase, most multivariate methodologies rely on different types of clustering, PCA, or NMDS (HMPC, 2012; Segata et al., 2012; Turnbaugh et al., 2009a). The main goal of clustering analyses is to simultaneously minimize intrasample variation and maximize intersample variation so that groupings begin to emerge. This variation is assessed by first calculating the distance values for all pairs of normalized abundances within a sample matrix. The goal of a distance metric is to best represent the relationship between the variables (abundance values) in a given dataset, where those samples with similar abundance profiles will have similar distance values and conversely for those samples with divergent abundance profiles. One example of a distance measure that is widely used is the Bray-Curtis measure (Rees et al., 2004). For each sample, this value is calculated by the absolute differences between the counts divided by the sum of the differences. This distance measure allows the comparison between samples that have zero values for certain species whereas other distance measures may depend on this zero value for the denominator which may cause one to adjust each sample by adding 1 prior to distance computation. Another similarity metric that is commonly used is the Pearson coefficient. The Pearson coefficient is usually described as a measure of how similar two curves representing two samples are with one another (D'Haeseleer, 2005). Variations of this measure are applied in different clustering algorithms, some of which are listed below. The next step in cluster analysis is to compare the resulting distance matrices so as to most efficiently visually represent them (Ramette, 2007). Here, comparing these matrices determines their linkages and therefore their proximity in a cluster. For hierarchical clustering, linkages have to be determined a priori (Corpet, 1988). The overall distance between these samples may be calculated by selecting the closest neighboring members (closest values) between two matrices (nearest neighbor), the farthest pair (furthest neighbor), or the average. Alternatively in k-means clustering, samples are assigned to a pre-determined number of clusters based upon the nearest Euclidean distance (as defined by the pythagorean formula) to the mean distances of each predefined cluster. The results of these calculations 24  are represented as a tree, with shorter branches and therefore more proximal clusters for those similar samples and longer branches for those that are increasingly dissimilar. Hierarchical clustering ultimately establishes cluster proximity by grouping the two most closely related clusters (shortest branches) to a branch or node of the tree with a length equal to their distance. Their positions are replaced by this single object and upon subsequent cyclic repetitions using these new distances produced, a hierarchy of clusters are generated. NMDS The goal of NMDS is to determine a spatial configuration for a given number of axes, which preserves the order of dissimilarity as closely as possible (Shepard, 1980). This is usually done by applying a stress function, where a monotonic regression attempts to fit a rank-preserved function to the data in question. This function is then used to compare the distances between the new graphical representation and those in the original data. Several iterations are performed where by randomly rearranging the points in the plot, the best goodness of fit can be calculated for the chosen number of axes. Finally, confidence ellipses can be used to encircle samples which cluster above a 95% confidence interval. Multivariate Analysis Packages There are many packages that facilitate the usage of these tools from a single analytical platform. The Primer-E package is a popular tool that allows users a wide range of multivariate analyses including NMDS, PCA, as well as SIMPER, a tool that determines which function or species are possibly driving the observed differences between samples (Clarke, 1993). The R statistics software is one of the most widely used statistical tools available (R.Core.Team, 2012). There are many packages within R which contain virtually all known multivariate tools. In recent years, many of them have been consolidated into one package, namely, the ShotgunFunctionalizeR package. This package provides many tools for assessing statistically significant differences in organisms as well as their functions. Finally, there are web-based tools now being implemented with various multivariate tools such as Metastats. 25  In analyzing these differences, one should be aware of the fact that due to the large number of steps in a typical metagenomic procedure, technical variation can often be mistaken for true biological variation. Furthermore, temporal variation often exists in biological systems and accounting for this can be exceedingly difficult as sample sizes decrease. Therefore, this variation should have been accounted for by beginning with a shallow sequencing analysis on a limited number of samples in order to hypothetically assess the variation of the environment in question. Subsequently, an appropriate number of samples can be selected in order to maximize the power of the applied multivariate analyses. Ultimately, the goal of metagenomic analyses is to identify the clinically important microorganisms as well as the possible links between them, in order to identify the properties that define a syndrome or an environment. As metagenomics ventures outside the exploratory realm, samples can subsequently be manipulated by subtraction, filtration, or possibly by patient treatment stratification in order to parse out potential significance in the microorganisms and their associated functions. Furthermore, the collection and incorporation of metadata into such analyses are equally important to the metagenomic data itself, as this information can be utilized for both hypothesis generation and solidification. As more metagenomes accumulate for each sample type, patterns between those comparable datasets will undoubtedly emerge. This data can then assist in the development of targeted treatments for complex diseases.  1.6 The Human Microbiome The adult human microbiome is composed of a population exceeding 100 trillion cells and collectively contains over 300 times more protein encoding genes than its human host (Turnbaugh et al., 2007). This ‘metagenome’ colonizes us at the time of birth, and for the remaining portion of our lives we continually shed and acquire Bacteria, Archaea, Eukaryotes, as well as viruses. The human body is a composite of multiple anatomical niches, each with their own unique microbial populations. This microbiome does not travel with us passively, as it provides the human body with various crucial  26  functions in human development, immunity, nutrition, and physiology (Barton et al., 2007; Plottel and Blaser, 2011). These mutualistic interactions are a result of highly-evolved and complex host-microbe signalling networks which affect neurological, cell proliferative, and differentiation pathways (Israel et al., 2001; Ley et al., 2008). This population of microorganisms is in a state of flux over the course of our lives and is shaped by each individual in a unique pattern both locally (epithelial interactions) and over long distances (hormonal or bacterial metabolites) (Nicholson et al., 2012; Vijay-Kumar et al., 2010). The members of these populations also regulate each other's growth and metabolism resulting in a complex web of antagonistic interactions between mutualist, passenger, and pathogen (Faust et al., 2012; Kamada et al., 2012; Lupp et al., 2007). A detailed example of this was demonstrated where nutrient competition over the same carbohydrate substrate exists between a murine intestinal pathogen Citrobacter and other commensal bacteria (Kamada et al., 2012). Due to observations that germ-free animals were unable to alleviate infection, this may be an evolved mechanism for pathogen removal. Furthermore, these microbial populations interact at the genetic level where gene exchange amongst them occurs frequently (Smillie et al., 2011). A clinically impactful example of this exchange includes antibiotic resistance gene transfer from commensal to pathogen (Sommer et al., 2010; van Reenen and Dicks, 2011). In a remarkable example of host-microbe coevolution, it has also been shown that genes involved in the metabolism of algae carboydrates were passed from the bacteria native to seaweed to the intestinal bacteria of Japanese consumers (Hehemann et al., 2010). In a harmonious ecological context, this complex synchronized symphony is central to the definition of what it is to be human. International collaborative microbiome projects have for the past decade undertaken the task to sequence this metagenome in order to identify core microbial species common to all human beings (Qin et al., 2010; Turnbaugh et al., 2007). The majority of studies have found that although there is conservation at higher taxonomic levels as well as in the protein encoded functions, inter-person variation at the genus or species level is tremendously varied (Qin et al., 2010). The largest collaborative effort to date, the Human Microbiome Project, found that among 242 healthy adults there was a surprising diversity of taxa with high inter-person diversity (HMPC, 2012). Samples from each of the body habitats 27  were more similar to each other in comparison to other sites and although no taxon was universally conserved there were several families that were highly prevalent across the human body such as Streptococcaceae, Prevotellaceae, and Pasteurellaceae. Although this project was large in scope, like most other microbiome studies, it did not control for the lifestyles of the participants. Recognizing that these factors can have profound effects on the microbiome, one such study controlled the diet of six volunteers and subsequently analyzed the bacteria as well as the viral component of stool samples (Minot et al., 2011). Interestingly, they found an increasing homogeneity of the bacterial and viral populations over the four time points indicating an adaptation of the microbiome to each subject's diet. The Female Genital Tract Microbiome The human vagina, or the lower female genital tract, is a remarkable example of this mutualistic host-microbe balance. Hormonal fluctuations during the menstrual cycle create a dynamic and complex environment (Witkin and Ledger, 2012). Despite these radical changes in pH and available nutrients, variations in microbial growth and abundance occurs throughout the menstrual cycle and finally at menses, the population is again maintained (Gajer et al., 2012; Linhares et al., 2011). The vaginal environment is relatively hypoxic and as a result, both host epithelial cells and existing bacteria undergo anaerobic glycolysis. High oestrogen levels cause vaginal epithelial cells to produce the substrate for glycolysis in the form of glycogen which bacteria quickly metabolize to produce high concentrations of lactic acid and the characteristic decrease in pH (3.5-4.5). This acidic environment is not seen in other closely related mammals, possibly indicating a crucial evolutionary requirement specific to human development and reproduction (Mirmonsef et al., 2012). Lactobacilli have long been seen as a keystone species of the healthy vaginal microbiome. As implied by their name, these bacteria produce much of the lactic acid during glycolysis (Boskey et al., 2001). This lactic acid is one of the protective roles vaginal microbes provide to inhibit the outgrowth of opportunistic or pathogenic microorganisms. In fact, lactic acid seems to be more detrimental to microbial growth than acidity alone (O'Hanlon et al., 2011). Lactobacilli also produce other antimicrobial  28  compounds such as broad spectrum defenses including hydrogen peroxide and more specific antimicrobial compounds collectively termed bacteriocins (Aroutcheva et al., 2001). As detection methods and subject populations have expanded, we now understand that the once universal L. acidophilus/L. crispatus dominant population consist of multiple species including but not limited to, L. iners, L. gasseri¸ L. vaginalis and L. jensenni (Antonio et al., 1999). In one of the largest vaginal microbiome studies to date, it was found that the vaginal microbiome differs from all other anatomical sites of the human body where 289/396 samples were dominated by this single genus, frequently dominating 50% of the total sequences recovered (Ravel et al., 2011). However, the study also revealed that in the remaining patients, this Lactobacillus genus was more of a minor component to the microbial population (Ravel, 2010). These 107 women instead had a population dominated by Prevotella spp., Gardnerella spp., and Atopobium spp. and were more likely to be from women of African American or Mexican ancestry. Although only speculative, the authors suggest that these ethnic discrepancies could be determined by genetic differences between hosts such as immune mediated, ligand polymorphisms, or vaginal secretion differences. Environmental influences such as lifestyle differences were not taken into consideration. These studies challenge the viewpoint that a “normal” and “healthy” vagina is composed of lactobacilli at a specific pH. Other research supporting these results suggest that these non-traditional bacteria could also maintain a healthy vaginal environment as various bacteria including Atopobium, Streptococcus, and Staphylococcus are capable of lactic acid production and therefore, like the microbiome in the gut, the function of the microorganisms is therefore conserved (Gajer et al., 2012; Rodriguez Jovita et al., 1999). Although these results may just be a temporal shift in microbial populations, gastrointestinal microbiome studies have also demonstrated large differences between samples where those with similar profiles are grouped into 'enterotypes' (Arumugam et al., 2011). These results may suggest that the vaginal microbiome should be subclassified in a similar manner (Koren et al., 2013).  29  Very few studies have attempted to identify microorganisms in the upper genital tract in healthy women due to the invasiveness of the procedure. However, various women have undergone tubal ligation as a risk reducing surgery procedure. Of these samples, none have been analyzed using a sequencingbased approach. Although the prevailing dogma is that the upper genital tract is sterile (Ness et al., 2004; Sweet, 2009), there have been a limited number of cases where bacteria have been isolated from control subjects of infected upper genital tracts (Duff et al., 1983; Heinonen et al., 1985). Furthermore, diagnosing upper genital tract infections is difficult and depending on the method used, relatively inaccurate (Maleckiene et al., 2009; Simms et al., 2003). This is predominantly due to the fact that the symptoms associated with upper genital tract infections can be due to unrelated gastrointestinal pathologies (Simms et al., 2003). Interestingly, of the studies isolating bacteria from the upper genital tracts of patients, there are various women who fail to present any of the diagnostic criteria indicative of an infection, but yet still contain a diverse upper genital tract microbial population (Ness et al., 2002; Sweet, 2009). Furthermore, toll-like receptors, innate immune receptors that recognize common bacterial components, are expressed at relatively high levels in the upper genital tract as well as the ovary (Zhou et al., 2009). This may indicate that contact with microorganisms is not a rare occurrence and that the upper genital tract quite possibly contains a microbiome of its own. The Microbiome in Disease An important example of the importance of microbiomes to human biology comes from observing the effects of its absence in animal models. Germ-free mice have long been known to display altered immune development and have an increased susceptibility to various diseases. Such models provide crucial insights into a variety of questions regarding the microbiome. One such study compared germ-free mice that were later colonized by either a human derived or a murine derived microbiota (Chung et al., 2012). Only the mice inoculated with a native microbiota developed normally, whereas those with a human derived microbiota resembled the immune development of a germ-free mouse. This  30  suggests that our microbiomes are a part of our evolution and each host has selected for a unique population to carry out the crucial functions imparted to us. Likewise, equally important studies are those concerning dysbiosis, the compositional alteration of the microbiome. Dysbiotic diseases include antibiotic-associated diarrhea (Young and Schmidt, 2004), celiac disease (De Palma et al., 2010), colorectal cancer (Castellarin et al., 2012; Scanlan et al., 2008), cystic fibrosis (van der Gast et al., 2011), esophageal disease (Pei et al., 2005), inflammatory bowel diseases (IBD) (Frank et al., 2007; Willing et al., 2009), irritable bowel syndrome (Codling et al., 2010), necrotizing enterocolitis (Wang et al., 2009), non-bacterial prostatitis (Krieger and Riley, 2002), obesity (Zhang et al., 2009), and bacterial vaginosis (Fredricks et al., 2005). In many of these diseases, the clinical significance of these microbial shifts are not known whether to be a precursor or a result of the disease itself (Frank et al., 2011b). This is of paramount importance as antibiotic treatment may not be necessary and possibly even harmful for the latter. Due to the fact that most microbiome studies focus on the gut microbiome, the most well studied dysbiotic disease is inflammatory bowel disease (IBD). IBD is actually a combination of two diseases, Crohn’s disease, which is a general gastrointestinal disorder, and ulcerative colitis, which affects the colonic mucosa (Kaser et al., 2010). Each of these diseases is heavily associated with a shift in the gut microbiome. Strengthening IBD research immensely is the fact that there exist murine models of this disease, and furthermore, a mouse microbiome can be humanized as well (Turnbaugh et al., 2009b). Also, genome-wide association studies complimented by these mouse models have identified genetic risk factors including NOD2 (binds microbial molecules) and ATF16L1 (involved in autophagy) (Frank et al., 2007). Studies integrating these genetic risk factors along with each patient's microbiomes have found significant associations between these genes and altered commensal populations. However, these studies also show that there were populations of bacteria in these patients which were associated with the disease phenotype but not the genetic risk factor indicating other genetic determinants or microorganisms are involved (Frank et al., 2011a). Studies in gnotobiotic mice have highlighted bacterial species that can induce or prevent IBD (Bloom et al., 2011). Fecal transferring is a technique to transfer an entire 31  microbiome to a naive patient. These studies in mice have found that colitis and other dysbiotic symptoms like metabolic dysfunction can effectively be transferred to a naive mouse (Garrett et al., 2010; Vijay-Kumar et al., 2010). Conversely, there have been many reports of dysbiotic diseases being treated by fecal transfers (Borody et al., 2004; van Nood et al., 2013). The question then was raised as to why this disease with a genetic component has its onset primarily from 15-30 years of age and not in younger age groups (Kaser et al., 2010). Recent evidence suggests that environmental stimuli, such as gastrointestinal infections also play a significant role. For instance, it was shown in a mouse model for Crohn’s disease, a viral infection was required alongside the genetic risk factor to develop the disease phenotype (Cadwell et al., 2010). More detailed research has shown that Toxoplasma gondii infections in mice lead to a compromised intestinal epithelium, allowing commensal bacteria movement into the peripheral body including the mesenteric lymph nodes, liver and spleen (Hand et al., 2012). This leads to immune activation by TH1 cells followed by differentiation into memory T cells, both specific to commensal bacterial antigens. Supporting evidence also comes from the identification of antibodies specific to commensal microbiota in healthy human serum (Haas et al., 2011a). The above evidence suggests that dysbiosis is a primary etiologic factor in the pathogenesis of IBD, however additional prospective or longitudinal studies are required to confirm these observations. Bacterial vaginosis (BV) is a classic example of such a shift in microbial population. Although more thoroughly illustrated below, briefly BV is defined as a loss in H2O2 producing lactobacilli and the establishment of a diverse population of predominantly gram-negative or gram variable species. Studies in BV have been isolated to characterizing the microbial portion of the disease. Although this is vital to the understanding of BV, provided that the host has a significant role in microbiome development and tolerance, it would be surprising that there are no host determinants involved in the pathogenesis of this disease. Although there have been no GWAS in BV, host susceptibility genes have been suggested, such as Toll-Like Receptor 4 (Genc et al., 2004). It is intriguing to suggest that a single organism or possible combination of organisms initiates the onset of this dysbiotic disease, however without GWAS, or cleverly conducted prospective longitudinal studies, the mechanisms as to why these microorganisms are 32  in flux and whether or not this is a prerequisite or an aftermath of an unknown variable, the etiology of BV may remain an enigma. Insights from metagenomics studies in an oral dysbiotic disease periodontitis have revealed the importance of low-abundance microorganisms in disease progression (Hajishengallis et al., 2011). This species of bacteria, Porphyromonas gingivalis, was shown to drastically alter the oral microbial population as a result of its ability to initiate inflammation through complement-dependent pathways. Interestingly, in addition to the complement pathway, the surrounding commensal microorganisms were also required for disease progression. This phenomenon has also been seen in other gastrointestinal inflammatory diseases where the commensal species Bacteroides fragilis which comprises roughly 1-2% of the microbial population, has profound effects on human physiology and cancer development due to host-pathway manipulations (Holton, 2008). These and other findings have given rise to the "keystonepathogen" hypothesis which states that there exists a subset of complex diseases that are initiated by lowabundance microorganisms (Hajishengallis et al., 2012). In contrast to other pathogens that become established as the dominant population at the onset of disease, keystone-pathogens instead support and stabilize the dysbiotic microbiome while remaining at low concentrations. Cancer Cancers are a collection of complex diseases triggered by a series of genetic abnormalities which ultimately lead to uncontrolled cellular proliferation. These diseases are researched largely independent of the microbiome despite the crucial role these microorganisms play in cell regulatory pathways, immunity, and metabolism. Furthermore, it is estimated that roughly 20% of cancers have an infectious etiology (Javier and Butel, 2008). For cancers like colorectal carcinomas (CRC), which develop within the context of a microbial milieu, it would be surprising if this population had no effect on the carcinogenic potential of the colonic epithelial cells. In fact, research over the past decade has identified a member of our commensal intestinal microbiome, Bacteroides fragilis, which can promote tumorogenesis in mice through the activation of signal transducer and activator of transcription 3 (STAT3), which is essential for differentiation T helper type 17 (TH17) (Wu et al., 2009). By producing 33  an enterotoxin, B. fragilis can also stimulate the cleavage of E-cadherin which then leads to MYC expression and persistent proliferation of human colonic epithelial cells (Wu et al., 2003). In 2008, multiple groups began to characterize the microbiomes of CRC patients and contrasted them with healthy tissue from the same patient or CRC (-) patients (Scanlan et al., 2008; Sobhani et al., 2011). Initially members of taxa such as Coriobacteridae, Roseburia, Fusobacterium, and Faecalibacterium were found to be enriched in CRC patients. However, in 2012, two independent studies using next generation sequencing found that Fusobacterium nucleolyticum is enriched in CRC tissue relative to neighboring non-cancerous colorectal tissue (Castellarin et al., 2012; Kostic et al., 2012). One group also found F. nucleolyticum to be capable of cell invasion in a CRC cell line. As each of these groups used a sequencing based approach, they both found that a large population of unrelated microorganisms in addition to F. nucleolyticum were also enriched in CRC. These same results were obtained in an earlier study on the CRC microbiome (Sobhani et al., 2011). Although the increase in the concentration of F. nucelolyticum is intriguing, it is not grounds for causality in carcinogenesis. More research including animal models, large sample screenings of microbial recovery and serotyping, could facilitate a deeper understanding to this question. However, in light of the other microbial populations elevated in CRC tissue, alternative theories have emerged. Perhaps this species is simply most successful in the altered microenvironment created by the tumour. As the microbiome can impact cell growth regulation and ultimately cancer development, this might suggest that a group or individual “driver species”, possibly initiated the carcinogenic transformation but were later outcompeted by other invasive “passenger” bacteria better adapted to this new environment (Hajishengallis et al., 2012). As cancer development is accumulative over multiple years, the microbial populations would therefore be expected to reflect the altered environment. During cellular transformation, the immediate as well as the surrounding environment undergo significant changes. As the tumour grows in size it requires the supply of ample nutrients and therefore promotes processes such as angiogenesis. This altered blood flow as well as increased size of tumour mass invariably creates regions with altered oxygen concentrations and may therefore promote the growth of a more anaerobic 34  population (Khan et al., 2012a). The metabolism of a cancer cell also shifts frequently from oxidative phosphorylation to glycolysis. This shift also coincides with different secondary metabolites released and provides a different nutrient profile for which the surrounding microbial population to feed (Hirayama et al., 2009). For instance, metabolomics of colorectal cancer samples have revealed differences in lactate, pH, lipids, and fatty acids when compared to a healthy colon (Righi et al., 2009). Microorganisms also require a substrate to bind to in order to remain in the specific niche to which they have evolved. Tumour cells frequently have a different profile of cell surface glycoconjugates and therefore may select for different bacteria (Ogata et al., 1976; Yip et al., 2006). Furthermore, hormonal fluctuations frequently observed in multiple cancers also accompanies a shift in microbial populations as demonstrated in the altered microbial populations of ovariectomized rats (Bezirtzoglou et al., 2008). Ovarian cancer may in fact be an example of such a disease. Study Aim 1: Ovarian Cancer Ovarian cancer is the 16th most prevalent cancer in Canada with an estimated 2,600 cases diagnosed in 2012 (Canadian.Cancer.Society, 2012). It is the leading cause of death from gynecologic malignancies and remains the 5th highest cancer related death among women (Siegel et al., 2012). Due to the relatively small size of the ovary (2-4 cm in diameter), most symptoms have their onset when the cancer has metastasized to other regions of the body. As a result, roughly 75% of ovarian cancers are diagnosed in these later stages (Cho and Shih Ie, 2009). When ovarian cancer remains confined to the ovary (stage 1), the 5 year survival rate exceeds 90%, whereas when diagnosed at later stages, the this rate drops to 30% (Siegel et al., 2012). Genomic and morphological studies over the past decade have reclassified epithelial ovarian cancers (EOC) as not a single disease, but rather as multiple distinct subtypes (Salvador et al., 2009; Shah et al., 2009; Vaughan et al., 2011; Wiegand et al., 2010). The WHO recognizes 8 histological subtypes each differing in their drug response, propensity to metastasize, prognosis, and the site of origin (Karst and Drapkin, 2010). The four most clinically significant subtypes, high-grade serous (HGS), clear-cell  35  (CC), endometrioid (En), and mucinous (Mu) tumours collectively makeup more than 90% of all EOCs (Cho and Shih Ie, 2009). Seventy percent of all EOCs are classified under the HGS histology and make up roughly 2/3 of all EOC related deaths (Siegel et al., 2012). CC, En, and Mu account for roughly 12%, 5%, and 3-4%, respectively, of EOC (Wiegand et al., 2010). Although largely treated as a single disease, these subtypes have been shown to be entirely separate diseases, arising from distinct mutations in separate cells of origin. Histologists have long noted the resemblance of these EOCs to non-ovarian tissues but the importance of these observations were not fully appreciated until it was shown that these resemblances were also present in their individual gene expression profiles (Marquez et al., 2005). For instance, the genes expressed in HGS were more closely related to the fallopian tube, both CC and En resembled endometrial tissue, and Mu resembled colonic tissue. Since then, CC and En tumours were shown to be associated with endometriosis, the displacement of the endometrium to distant sites of the body (Wiegand et al., 2010). A significant percentage of Mu tumours have been shown to be metastases deriving from gastrointestinal tumours (Kelemen and Kobel, 2011). Finally, due to common mutations and its co-occurrence with tubal intraepithelial carcinomas, HGS carcinomas are thought to arise as metastases from the fallopian tube (Lee et al., 2007). It was also recently reported that in a humanized murine model, HGS was shown to arise from the fallopian tube (Kim et al., 2012). Both genomic and transcriptomic studies have identified distinct mutations in each of these EOCs. CC and En have both been associated with a loss of BAF250a, a protein involved in chromatin modification (Wiegand et al., 2010). This analysis has found that 34/82 CC and 62/130 En cases were ARID1A mutation positive, which encodes BAF250a. Studies of HGS have reported over 94% of cases have mutations in TP53, indicating that this may occur early on in the carcinogenic process (Ahmed et al., 2010). Inflammation has been long recognized as a risk factor in ovarian cancer development (Coward et al., 2011; White et al., 2012). Epidemiologic investigations uncovered multiple risk factors under the umbrella of inflammation. One of the leading risk factors for the development of ovarian cancer has been 36  based upon observations that both oral contraceptive use and multiparous women have a decreased risk of EOC (Adami et al., 1994; Weiss, 1988). This is presumably due to the decrease in frequency of the damage and subsequent reparation of the ovarian epithelium during ovulation. This cyclic epithelial cell damaging of ovarian tissue can introduce inflammatory cytokines, microorganisms, as well as their byproducts which can ultimately result in chronic inflammation (Modan et al., 2001). Feminine hygiene products using talcum powder have also been shown to be a risk factor due to their pro-inflammatory nature (Merritt et al., 2008). Repeated acute or chronic inflammatory infections are also an aspect of cancer development that warrants more attention (Samaras et al., 2010). The upper genital tract is not immune to infection where infections of the uterus, fallopian tubes, and the ovaries are collectively classified as pelvic inflammatory disease (PID). This polymicrobial disease results in an acute as well as chronic inflammatory infection (Wiesenfeld et al., 2002). Repeated increases of inflammation in the immediate environment may lead to the production of cytotoxic substances and therefore increasing the likelihood of DNA damage. Furthermore, vaginal infections contribute to the inflammatory state of the upper genital tract where infections such as BV increase the likelihood of PID as well as assisting in the movement of microbial or host derived inflammatory substances as a result of retrograde menstruation (Ness et al., 2004; Wiesenfeld et al., 2002). In fact, several studies have found that BV-associated microorganisms account for roughly two-thirds of the bacteria that are responsible for PID (Haggerty et al., 2004; Ness et al., 2002). Various retrospective studies diagnosing PID in asymptomatic women also suggest that subclinical PID is at least as prevalent as clinically observed PID and that these subclinical forms are frequently associated with long-term health effects (Wiesenfeld et al., 2002). Microorganisms associated with PID including Chlamydia trachomatis, Mycoplasma genitalium have been associated with and in some cases recovered from ovarian cancer tissue (Chan et al., 1996; Ness et al., 2003). Recently, in the largest study of its kind to date, Li et. al found an almost two-fold increased risk in ovarian cancer development in women with a history of PID (Lin et al., 2011). This risk was apparent even when they did not separate the ovarian cancer into their respective subtypes. However 37  this study has been met with skepticism, mostly concerning their choice of IOCD code 917, which classifies pelvic pain as a case of PID, despite the fact that this pain may be due to other non-infectious pathologies. This is not the first relationship of OC with PID as salpingitis, a subcomponent of PID defined as inflamed fallopian tubes was apparent in 53% of ovarian cancers examined (Seidman et al., 2002). Hypothesis #1  Using a metagenomic approach, involving testing of ovarian cancer tissue samples using a panviral microarray and high-throughput sequencing, I hypothesize that metagenomics will identify microorganisms associated with ovarian cancer.  Study Aim #2: Bacterial Vaginosis Bacterial vaginosis (BV) is traditionally defined by a reduction in the vaginal H2O2 producing lactobacilli populations and an increase in the number of anaerobic bacteria including but not limited to Gardnerella vaginalis, Atopobium spp., Mobiluncus spp, and Prevotella spp (Ma et al., 2012). BV is the most common vaginal condition in reproductive-aged women, where it accounts for roughly 24-30% of women in sexually transmitted disease clinic populations. According to the Center for Disease Control the prevalence of BV among women between 14–49 years of age is estimated to be 21.2 million (29.2%) (Centers for Disease Control and Prevention (U.S.), 2004). Factors contributing to the development of BV include high risk sexual activities, vaginal douching, smoking, and menstrual blood (Ma et al., 2012). Adding to the clinical impact of BV itself, this disease is associated with adverse reproductive sequelae such as preterm birth as well as increased risk for sexually transmitted disease acquisition including human immunodeficiency virus (HIV), HPV, human herpesviruses, and PID disease (Allsworth et al., 2008; Cherpes et al., 2003; Martin et al., 1999; Ness et al., 2005).  38  BV Diagnosis Currently, BV can be diagnosed microscopically with the Nugent score or using a combination of tests collectively called Amsel’s criteria. The Nugent score is based upon Gram stain results from vaginal swab smears where a value based on the relative abundance of lactobacilli spp. to other gram-negative rods/ gram-variable rods/coccobacilli species is calculated. Decreasing counts of the large gram-positive rods representative of Lactobacillus morphotypes are scored as 0 to 4, increasing counts of small gramvariable rods possibly indicating G. vaginalis morphotypes are again scored as 0 to 4, as well as curved gram-variable rods indicative of Mobiluncus spp. morphotypes are scored as 0 to 2 (Nugent et al., 1991). A 0-10 scoring system is totalled where 7 to 10 is consistent with BV, 4-6 intermediate BV, and 0-3, healthy. One of the benefits of this quantitative approach is that it can be utilized in diagnosis to account for differences between patients. Amsel’s criteria encompasses a wide range of observations including the presence of homogenous vaginal discharge, a pH greater than 4.5 , microscopic counts of epithelial cells studded with bacteria (clue cells), and a fishy odour upon addition of 10% KOH to a vaginal swab (Amsel et al., 1983). When three out of four conditions are met, a positive diagnosis is made. In contrast to the Nugent score, the Amsel test is not quantitative and is therefore more open to the interpretation of the observer and as such, open to subjectivity. In a literature review, the Amsel test was shown to have 69% sensitivity and 94% specificity whereas the Nugent score can reach sensitivities above 89% and specificities of 83% (Koumans et al., 2007; Schwebke et al., 1996). Like all diagnostic tests however, these systems have their flaws. Differentiating between Lactobacillus sp. and BV-associated bacteria (BVAB) can be difficult when relying simply on a Gram stain. Furthermore, it is known that the vaginal microbiome can change significantly depending on the lifestyle of the individual as well as the stage of the menstrual cycle (Jespers et al., 2012; Morison et al., 2005). Therefore, depending on when the sample is taken, the Nugent score may not reflect the patients "normal" microbial population. For example, in a study that sampled women 32 times over 16 weeks showed that all but 9 patients had significant Nugent score fluctuations which were found to correlate 39  most with the stage of the menstrual cycle, even more so than sexual activity (Brotman et al., 2010). The authors concluded that single sample characterizations of vaginal microbiota may miss this dynamic nature and that women may have sporadic and relatively short episodes of BV without apparent symptoms. In fact, the CDC has found that 84% of women diagnosed with BV report having no symptoms and this could alter both clinical presentation and Amsel’s criteria-based diagnosis (Koumans et al., 2007). Amine-metabolite producing bacteria responsible for the malodourous production have also been found in both BV(-) and BV(+) cohorts (Kubota et al., 1995; Verhelst et al., 2004). This deficiency in diagnosis impacts the populations available for comparison as well as treatment efficacy. For instance, antibiotics used for treating BV such as metronidazole and clindamycin can be effective initially, although relapses are common, possibly as a result of biofilm formation and the lack of reestablishment of a normal microbiome (Bradshaw et al., 2006). Nevertheless, these diagnostic tests suffice and remain the most efficient diagnostic tool available in a clinical setting. Metagenomics and BV Biology The Nugent score and Amsel's criteria were established when culture-based microbial identification was the only option for characterizing the vaginal microbiome. Since then however, there has been an explosion in the types of sequence-based technologies that offer higher sensitivities with broad ranging specificities. Metagenomics has shown that only 20-60% of the microorganisms identified on and in the human body have been cultured (Peterson et al., 2009). The majority of metagenomic studies characterizing the human microbiome have been "targeted", where a conserved gene common to the vast majority of the organisms in a microbiome such as 16S ribosomal RNA (rRNA) and Chaperonin60 (cpn60), is sequenced (Rappe and Giovannoni, 2003; Schellenberg et al., 2009). Each read is then compared to a database of previously sequenced samples, and based on their nucleotide or amino acid similarity, the organisms are taxonomically classified. Various studies also employ "operational taxonomic unit" classification where the generated amplicons are binned according to a percent similarity cutoff or prior clustering analyses in order to assess species abundances independent of the potential  40  limitations of the reference database (Schloss and Handelsman, 2005). Such studies have identified bacteria associated with BV with no close relatives on the respective database and are arbitrarily labelled as BVAB1 etc (Smith, 2005). These studies were initially applied to BV due to the aforementioned difficulties in diagnosis and treatment.  Complicating the diagnosis is the knowledge that bacteria previously thought to be indicative  of BV such as G. vaginalis and Prevotella spp. are also commonly isolated from healthy women (Hill et al., 2005; Ravel et al., 2011; Tosun et al., 2007). As mentioned above, Ravel et al. have shown that these microorganisms exist in the healthy vagina of specific ethnic groups in relatively high proportions. Since there is only a limited knowledge concerning the natural variation of the microbiome in general, this study like many others, raises questions regarding how a healthy population is defined (Costello et al., 2009). In regards to the vaginal microbiome, these results could possibly be explained by differences in menstrual cycles, ethnicity, sexual behaviour, or these women could have been in the early stages of BV development (Ma et al., 2012). Due to the enhanced taxonomic resolution provided by metagenomics, recent research has identified multiple subtypes of G. vaginalis within the vagina (Paramel Jayaprakash et al., 2012). It was found that specific subtypes were more associated with BV than others as determined by the expression of virulence genes such as sialidase and vaginolysin. These subtypes were initially missed as some only grow in anaerobic conditions and not in 7% CO2, which are the recommended growth requirements for G. vaginalis. Therefore, it seems that the mere presence or absence of these microorganisms is not indicative of BV, but quite possibly confined to the metabolic potential inherent to the microorganisms present. This may complicate microbiome studies as in addition to difficulties in strain or genotype differentiation with most amplicon studies, the inflammatory potential of such strains are not truly revealed until cultivation is possible. In one of the largest 16S rRNA profiles of 220 women, 98 of these women were, according to Amsel's criteria and a Nugent score, positive for BV (Srinivasan et al., 2012). The rRNA profiles were consistent with the clinical diagnoses as 95% of the sequence reads from women with BV were dominated by BVAB1, BVAB2, Megasphaera spp., L. amnionii, S. sanguinegens, G. vaginalis, and 41  Atopobium vaginae. Only five out of 122 women without BV displayed a microbial profile indicative of BV. The field of microbiome research is relatively young and there is only a limited knowledge concerning the influence or the extent of its natural variation, the impact of low-abundance species, and many other variables. Likewise, much of BV research is still in the exploratory phase where variables influencing its aetiology are still being collected. Although still premature, there have been relatively small prospective studies suggesting that BVAB are acquired from extravaginal sources, where women who developed BV frequently had G. vaginalis and other BV associated bacteria isolated from oral and anal swabs (Marrazzo et al., 2012). As exploratory results are generated from metagenomics, more precise questions and cohorts can be screened with quantitative methods such as qPCR, in order to more astutely address the complexities of BV. All of the hypotheses stated above concerning the relationship between the influences of the microbiome and human health will ultimately depend upon carefully designed prospective longitudinal studies large enough to encompass the diversity that is thought to exist. These in turn will require the collection of metadata for each subjects diet, lifestyle, illnesses, prescription drugs, and other variables thought to play roles in BV development.  The Relationship of BV and HIV HIV infection can significantly affect many other diseases due to its effects on the host immune system. Studies of HIV (+) women in sub-saharan Africa have revealed that a significantly high percentage of these women are also positive for BV (Myer et al., 2005; Taha et al., 2007). Recent evidence suggests that BV(+) women are more likely to transmit HIV to a new partner as well as a newborn child (Cohen et al., 2012; Farquhar et al., 2010). One recent study of BV(+) women infected with HIV in sub-Saharan Africa found that Prevotella bivia and the order Clostridiales were significantly elevated (Hummelen et al., 2010). Therefore, HIV(+) women represent an important BV subpopulation to study.  42  Although the progression of HIV(+) to AIDS correlates with the loss of CD4+ T cells, HIV RNA concentration in the blood, and systemic immune activation, the precise mechanism has remained elusive (Forsman and Weiss, 2008; Klatt and Silvestri, 2012). Natural hosts for simian immunodeficiency virus (SIV) such as African Green monkeys frequently present with high-level viremia but do not develop AIDS, whereas rhesus monkeys infected with SIV progress with a phenotype similar to that observed in humans (Brenchley et al., 2010). One of the strongest correlations to disease progression and mortality is the degree to which the immune system is activated (Kuller et al., 2008; Salazar-Gonzalez et al., 1998). The extent of this immune activation does not seem to be explained simply by the HIV virus itself. Although the destabilization of the immune system is a factor in progression to AIDS, an emerging hypothesis posits that the translocation of antigens into the peritoneal cavity as a result of intestinal epithelial damage leads to systemic immune activation, increased HIV replication, and finally the progression to AIDS (Brenchley et al., 2006; Klatt et al., 2013; Marchetti et al., 2008). Several independent groups have observed a significant increase in LPS levels in the plasma of SIV-infected Asian macaques and HIV-infected human patients (Klatt et al., 2013). To explore this concept further, microbiome studies examining 16S rDNA have sought to contrast the bacterial gut microbiome of HIVinfected patients to that of healthy patients, but no significant differences were identified (McKenna et al., 2008). In a recent study however, the intestinal viromes of healthy and SIV(+) rhesus monkeys were instead compared (Handley et al., 2012). A substantial expansion of the enteric virome was seen in SIV(+) monkeys and collectively, 32 previously undescribed viruses were recovered. Specifically, novel adenoviruses cultured from these patients were associated with enteritis possibly indicating associations with AIDS enteropathy and progression. This suggests that the vaginal viral and bacterial microbiome may be similarly affected in patients with HIV. We can also speculate that similar mechanisms may also play a role in the development of BV. Although we are beginning to catalogue the bacteria associated with BV, we still do not understand the precise mechanisms that lead to this dysbiosis. It has long been known that the composition of our microbiomes is largely influenced from birth, breast feeding, as well as diet (Gronlund et al., 1999; 43  Harmsen et al., 2000; Koenig et al., 2011). For instance, upon oral delivery of lactobacilli, these bacteria fastidiously colonize the vaginas of both human and animal models within days (Alander et al., 1999). In fact, probiotics including L. rhamnosus str. GR-1 and L. reuteri RC-14 have been shown to ascend from the rectum to the vagina in healthy women (Reid et al., 2003). Although not seen in every case, this oral inoculation has been shown to decrease the overall frequency of future BV episodes (Morelli et al., 2004; Reid et al., 2003). The ability to colonize the vaginal microbiome is not unique to lactobacilli as other residents of the gastrointestinal tract also undergo this migration including certain streptococci (Timmerman et al., 2004). Interestingly, BVAB such as G. vaginalis and Leptotrichia/Sneathia spp. have been recovered in the oral and gastrointestinal microbiome (Marrazzo et al., 2012). A recent prospective study found that G. vaginalis and Leptotrichia/Sneathia spp. were more commonly recovered from oral and rectal swabs among women who subsequently developed BV suggesting an extravaginal source for BVAB and development. In light of observations that HIV prevalence is associated with an abnormal vaginal microbial ecosystem (Klatt et al., 2013; Mahal, 2013), one could speculate that the changes in the gut microbiome are related to the changes in the vaginal microbiome. Hypothesis #2 The aims of this component of the study concerning the vaginal microbiome were to simultaneously identify viral agents of both eukaryotic and prokaryotic hosts as well as to characterize potential known and unknown bacteria in healthy as well as BV/HIV cohorts. I hypothesize that optimized metagenomic techniques will differentiate between the healthy vaginal microbiome and the vaginal microbiome associated with bacterial vaginosis.  44  Chapter 2 Ovarian Cancer 2.1 Ovarian Cancer Study Methods The OC study was a component of a BC Cancer Agency OvCare project, and thus the methods for processing the samples were designed for gene expression analyses. Conversely the methods used in the BV study were designed specifically to detect microorganisms, including viruses. Ovarian Cancer Tissue Sampling Ovarian carcinoma tissue was obtained from the OvCaRe (Ovarian Cancer Research) frozentumor bank. Patients were provided with written informed consent for research using the tumour samples in this study. Separate approval was also provided from the hospital's institutional review board enabling the permission in the usage of these samples for RNA-sequencing experiments. All tumour samples were independently reviewed by a gynecologic pathologist to confirm the pathologic diagnosis prior to RNA sequencing. In cases where a discrepancy arose between this diagnosis and that of the source institution, the samples were reviewed by another gynecologic pathologist. Both of the pathologists were blinded to the results of the genomic studies. Ethics Patients were approached for consent prior to collection of surplus tissue to diagnostic requirements in addition to a blood sample, to be used in a research ethics board approved research protocol. All patients were informed at the time of consent concerning the potential loss of confidentiality which may arise from the research. Furthermore, they were also informed that none of the study data would ever be put into the clinical record or be reported back to the care physicians. Tissue specimens were released to researchers only with an REB approved study certificate through anonymized sample distribution. The primary datasets from the transcriptomes generated will not be released into the public domain, however they could be made available through a tiered access mechanism to specific  45  investigators who will be required to honor the same ethical and privacy principles of the BCCA investigators. Ovarian Cancer Transcriptome Briefly, poly(A)+RNA was purified, using the MACS mRNA isolation kit (Miltenyi Biotec),from 5–10 μg of DNase I–treated total RNA as per the manufacturer's instructions. RNA was subsequently reverse transcribed from the purified poly(A)+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen) and random hexamer primers (Invitrogen) at a concentration of 5 μM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina).  Primers for the Virochip Sol-Primer A (40 μM ) = 5'-GTTTCCCACTGGAGGATA-N9 -3' Sol-Primer B (100 μM) = 5'-GTTTCCCACTGGAGGATA-3' Table 2-1. Primers used for the Virochip in Round A and Round B and C protocols  300 ng of RNA for each sample was reverse transcribed (Round A). 2 µl of Sol-Primer A (Fig 21) were added to 2 µl of sample and 6 µl of ddH2O. This was incubated at 72˚C for 4 minutes to remove RNA secondary structures, followed by room temperature incubation for 5 minutes. 1 µl of Superscript III reverse transcriptase, 2 µl of 12.5 dNTP, 2 µl of ddH2O, 4 µl of Superscript III First Strand Synthesis buffer, and 0.1 M of dithiothreitol was added to perform the reverse transcription (RT) reaction. RT was performed at 42˚C for 1 hour followed by a 94˚C x 2 min incubation and cooling down to 10˚C. Second strand synthesis was performed using Sequenase, a T7 bacteriophage DNA polymerase. The Sequenase reaction mix consisted of 7.7 µl ddH2O, 2.0 10X Sequenase buffer, and 0.3 µl of Sequenase enzyme. The Sequenase reaction mix was added to the Round A reaction and incubated by ramping from 10˚C to 37˚C 46  over 8 minutes, followed by holding at 37˚C for 8 min. Following the denaturization of the dsDNA at 94˚C for 2 min, the Round A reaction was cooled to 10˚C and 0.9 µl of diluted Sequenase buffer and 0.3 µl of Sequenase enzyme was then added and incubated in the same ramping conditions as described above. The OC double stranded cDNA was amplified targeting Sol-primer A with a Klentaq Master mix. 37 µl of ddH2O was added to 10X Klentaq PCR buffer, 1 µL KlenTaq LA polymerase, 1 µL 12.5 mM dNTP, using 1 µL 100 µM of Sol-primer B (Fig. 2-1). The Sol-Primer B targets the complementary sequence on the 5`end of the Sol-primer A adapters found on the OC cDNA. The PCR reaction was carried out using the parameters shown in Table 2-2:  94ºC for 4 min 20 cycles of: 68ºC for 5 min 94ºC for 30 s 50ºC for 1.0 min 68ºC for 1.0 min 68ºC for 2 min Table 2-2 Round B PCR Conditions for the Virochip  Round C was performed using the same master mix except that an alternative dNTP mix where an aminoallyl dUTP was included for dye-conjugation in addition to the standard deoxynucleotide triphosphates. The same Round B PCR conditions are used for Round C except that the Round C amplification is for only 15 cycles. The Round C amplified material was then purified with a Zymo DNA Clean and Concentrate (Zymo). Initially, 500 µl of DNA binding buffer was added to each 50 µl sample. This sample was then transferred to a Zymo spin column and centrifuged for 3000 rpm for 150 seconds. The flow-through was then reloaded to the spin column and centrifuged once more. After adding a discard collection tube, 200 µl of wash buffer was used to wash each sample. Samples are then spun at maximum speed for 30  47  seconds to remove the wash buffer. This step was then repeated. The column was then transferred to a microcentrifuge collection tube and 9 µl of ddH2O was used to elute the sample using the maximum centrifugation speed for 30 seconds. For dye coupling, 1 µl of 1 M sodium bicarbonate, pH 10, was added to each sample. The Alexa 555 dye (Invitrogen) was reconstituted in 6 µl of DMSO and 1 µl of the dye was added to each sample. Samples are incubated at room temperature in the dark for 4 hours. In order to map all the spots on the Virochip, in addition to each specific viral oligonucleotide a control oligonucleotide (Spike 70) was added to each spot at the time of microarray printing. When dyeconjugated compliment to Spike 70 (Probe 70) was hybridized to the Virochip, all of the spots can be mapped and identified. Probe 70 was diluted to 1 µg /µl at a volume equal to 0.5 µl x number of the samples. Alexa 647 dye was reconstituted in the same manner as Alexa 555 and 1 µl was then added to the diluted Probe 70 for dye conjugation using the same incubation conditions. Each sample was then purified with the Zymo kit using 200 µl of DNA binding buffer. Following this binding step, the samples were purified as previously stated. Each sample was eluted in 4 µl of ddH2O whereas the Probe 70 sample was eluted in volume equal to 3 x number of samples plus 2 µl. The Virochip microarrays were printed at the Jack Bell Cancer Research Centre, Vancouver, BC. The oligonucleotides were printed onto epoxysilane coated slides. Prior to sample hybridization, the arrays must be washed with ethanolamine (ETOH-A). The ETOH-A solution consists of 50 mM ETOHA, 1% SDS, 0.1 M Tris pH 9. This solution was heated to 50˚C and the slides are incubated in this solution for 15 minutes. The slides are rinsed twice in ddH2O and centrifuged in a swinging bucket rotor for 650 rpm for 5 minutes. The arrays are then placed in a standard microarray hybridization chamber. The samples are denatured prior to hybridization by incubating each sample at 100˚C for at least 2 minutes. 9 µl of 65˚C hybridization solution (Glass Array Hybridization Buffer #1, Invitrogen) was then added to each sample and then loaded onto each microarray. The hybridization chamber was sealed and each array was incubated in a 65˚C water bath overnight.  48  Following this incubation, the microarrays are washed in a 65˚C 2X Saline-Sodium Citrate (SSC) / 0.2% SDS solution with manual agitation for 60 seconds. The arrays are then washed into a room temperature 2X SSC solution, and finally a room temperature 0.2X SSC solution. The arrays are centrifuged again at 650 rpm for 5 minutes. Each array was then scanned on an Agilent G2565CA microarray scanner.  PCR Amplification Unless stated otherwise, all PCR reactions used the following master mix recipe: 12.5 µL of Quanta Accuprime PCR Master mix was added to 1 µl of 10 µM of the forward primer, 1 µl of 10 µM of the reverse primer, 10.5 µl ddH2O, and 1 µl template. The PCR conditions were 95˚C for 5 min, 35 cycles of 95˚C for 15 seconds, 50-65˚C (depending on the primer pair) for 30 seconds, 68 seconds for 1 minute, followed by 68˚C for 2 minutes.  Computer Programming and Multivariate Analysis Python Programming  All programming was performed using Python. Two of the programs used mentioned in this study are a homopolymeric sequence filtration program and a paired end lowest common ancestor selection program. These programs are shown in the appendix. Multivariate Analyses Hierarchical Clustering  Hierarchical clustering was performed using Cluster 3.0 designed by the Eisen laboratory from Stanford University (Eisen et al., 1998). Here, a Pearson correlation (Correlation Centered) was used for both species and samples in each sum normalized dataset. Images were produced using Java Treeview.  49  Non-Metric Multidimensional Scaling  Non-Metric Multidimensional Scaling was performed using R statistics (R.Core.Team, 2012). The packages used for this analysis included Vegan, BiodiversityR, and Ecodist. Ecodist was used for Bray-Curtis distance measures and NMDS ordination. Ordiplot was used with both Vegan and BiodiversityR for plot visualizations.  2.2 Ovarian Cancer Study Results All cancer tissue samples were obtained from the OvCare frozen tumour bank at the BC Cancer Agency following subtype diagnosis by a gynecologic pathologist. The samples for the Virochip and initial virome screening listed in Table 2-3 included samples VOA2, 149, 259, 365, 382, and 496. The samples used for RNA sequencing consisted of five high-grade serous (HGS), five endometrioid (En), two clear-cell (CC), and two mucinous (Mu) ovarian carcinomas. An epithelioid sarcoma (ES) sample, a mesenchymal soft tissue tumour of an extremity, was used as a control from an alternative tissue type because healthy female upper genital tract tissue was not available for RNA sequencing at the time of the study. A lung cancer cell line was also used as a control for comparative purposes. For follow up ribosomal RNA screening, all of the 60 samples are shown in Table 2-3.  VOA Sample Name 2 11 40 56 65 65 73 80 139 146 149 156 158 190 201 209  Ovarian Cancer Subtype High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous Endometrioid High-Grade Serous Endometrioid High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous  RNA conc. (ng/µL) 276.36 230.11 193.01 198.68 259.46 259.46 178.16 277.94 461.53 263 209.93 410 166.62 134.6 167.11 498.2  50  VOA Sample Name 255 259 260 286 307 326 331 337 365 372 376 379 382 382 383 385 388 389 394 409 417 426 445 450 453 480 481 483 488 489 496 497 505 523 536 544 554 555 620 626 656 682 716 729 731 744 759 764 779 801  Ovarian Cancer Subtype Endometrioid High-Grade Serous High-Grade Serous Low-Grade Serous High-Grade Serous Clear cell High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous Clear cell High-Grade Serous High-Grade Serous Clear cell High-Grade Serous High-Grade Serous High-Grade Serous Clear cell High-Grade Serous Clear cell High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous Endometrioid Endometrioid High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous Low-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous Low-Grade Serous High-Grade Serous High-Grade Serous  RNA conc. (ng/µL) 206 91 314.71 60.1 380.23 577 364.25 331.42 147.08 238.56 149 190.57 168.81 382 269.44 285.75 235.85 384.57 198.01 204.1 423 451.39 188.83 238 160.23 360.24 144.53 236 302.52 139 367.26 500.62 202.34 346.32 732.9 228.8 497 350 358.5 303.1 219.36 201.7 317.8 54 136.7 553.72 905.2 74 429 155  51  VOA Sample Name 859 860 875 891 900 908 946 1057 1067 1092  Ovarian Cancer Subtype High-Grade Serous High-Grade Serous Low-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous High-Grade Serous  RNA conc. (ng/µL) 183.5 279.74 113.53 215.26 276.4 121.47 160.42 113.22 126.58 305.89  Table 2-3. Sample information for the ovarian cancer samples screened  Pan-viral Microarray Analysis Initially, the virome of six HGS samples were characterized using the Virochip. The Virochip is a pan-viral microoarray introduced in 2002 that includes oligos from all known virus taxa (Wang et al., 2002). 300 ng of RNA from HGS tissue extracts was reverse transcribed using a modified sequenceindependent single primer amplication (SISPA) system. The primer adapter flanking the 5' end of each random nonamer is subsequently used for targeted amplification. In the final rounds of amplification, amino-allyl dUTP molecules are incorporated to allow for fluorescent dye conjugation. These products are then hybridized on the Virochip and subsequently scanned on an Agilent microarray scanner. The hybridization patterns observed for the HGS samples were not obvious for any single genome in that relatively few oligos were significantly elevated above historic hybridization intensities. These included oligos for enteroviruses, polyomaviruses, and specific retroviruses. Only sample 382 was E-predicted as an enterovirus, Coxsackievirus A16. PCR and Sequence Analysis  Another feature of the Virochip is that PCR confirmation of microarray hits can be performed by designing primers based on selected sequences with elevated intensities for a particular sample. However, for some viral genomes, there may be too few oligos spanning the genome such that it may not be possible to design primers based on the microarray oligonucleotide sequences. In these situations, in  52  order to confirm microarray hits for specific groups of viruses, PCR primers were designed using the most conserved regions of each group of viruses. All primers both for viral and subsequent bacterial analyses used are shown in Table 2-2. Other than the enterovirus, the polyomavirus and retrovirus primers yielded no amplicons in the preliminary screening of 6 HGS samples. A potential enterovirus amplicon was observed as a weak band by gel electrophoresis but upon cloning and sequencing, it was discovered to be a human sequence non-specifically amplified by the enterovirus primers. In a subsequent nested PCR assay targeting a conserved region of the Gag protein in gamma-retroviruses (Figures 2-1 and 2-2), the nested retrovirus PCR yielded a sequence with 99% homology to a recently sequenced, Murine Leukemia Virus-related virus (Lo et al., 2010)  Primer >Neisserian 139616s 386427  Sequence  Targeted Organism  CTTTTGTCAGGGAAGAAAAGGCTGT*  Neisseria spp.  >Neisseria 139616s 627-574  CTCTGACACACTCTAGTCACCCAGTTCAGAA*  Neisseria spp.  >Neisseria101423s 5947959520 >Neisseria1014 23s 596949653  AGTGCATCAGGTGGATGCCTTGGCGATGATAGGCGACGAAGG *  Neisseria spp.  GGTTGATTTCTTTTCCTCCGGGTACTTAGATGGTTCAGTTCT*  Neisseria spp.  >Pap148For2  TTTCCAGATCCAAACAAATTTGC*  >Pap148Rev2  GCAATATCCCAGTGTTCACC*  >HPV-491-514  AGGATGTAAACCAGCCACTGGTGA  >HPV-991-968  ACCTTGGGCTCTGTGAAGCCAATA  >PapE1For1240-1263  ACAGAGGCGGATTGTGTGGATAGT*  >PapE1Rev1661-1648  GTTGCTTTCACGTTGGAGCTTTGC*  Human Papillomavirus Human Papillomavirus Human Papillomavirus Human Papillomavirus Human Papillomavirus Human Papillomavirus  >NeissRevCpn1410-1387  CAACACTTTGTTCACCACCACGCT*  Neisseria spp.  >NeissForCpn954-977  TCAAGCCAAACGCATCGAAATCGG*  Neisseria spp.  >NeissGlutRNAFor  AACTTGCCGCTACGCTGAACAAAG*  Neisseria spp.  >NeissGlutRNARev  TTAAACTGTTCCACGGCTTTGGCG*  Neisseria spp.  >NeissGlutRNAFor318-341 >NeissGlutRNARev570547  ACGGCAAATTGGAAATCGTGGTCG*  Neisseria spp.  ATATTGCCGTCGCAAATGTCCAGC*  Neisseria spp.  >NeissGlutRNAFor260-283  TATCCCGACTTGCCGAAAGGCTAT*  Neisseria spp.  53  Primer  Sequence  Targeted Organism  >NeissOpcA465-488  AACCTGAACCATCCACACCCAACT*  Neisseria spp.  >NeissOpcAFor772  CGTCATATCCTTCGCCCTGAAACAA*  Neisseria spp.  >NeissOpcARev820  CGTCGGCTTGTCAAACCCTACG*  Neisseria spp.  >NeissOpcFor17  AAGGCTTTGATAATCTGCCTGCCC*  Neisseria spp.  >NeissOpcRev223  TGTGGATGGTTCAGGTTTGTGGGA*  Neisseria spp.  >EnteroVirusFor  ACACGGACACCCAAAGTAGTCGGTTCC  Enteroviruses  >EnteroVirusRev  TCCGGCCCCTGAATGCGGCTAATCC  Enteroviruses  >WuBKFor  TGTTTTTCAAGTATGTTGCATCC  WU/BK Polyomaviruses  >WuBKRev  CACCCAAAAGACACTTAAAAGAAA  WU/BK Polyomaviruses  >BKJVFor  AGTCTTTAGGGTCTTCTACC  >BKJVRev  GGTGCCAACCTATGGAACAG  BK/JC Polyomaviruses BK/JC Polyomaviruses  >MerkFor  ATCTGCACCTTTTCTAGACTCC  Merkel Polyomavirus  >MerkRev  ATATAGGGGCCTCGTCAACC  Merkel Polyomavirus  >SV40For  TTCCTCTTCCTGCAGTACT  SV40 Polyomavirus  >SV40Rev  AGTTGCAAACCAGACCTCA  SV40 Polyomavirus  >RetroNes1  ATCAGTTAACCTACCCGAGTCGGAC  Gammaretroviruses  >RetroNes2  GCCGCCTCTTCTTCATTGTTCTC  Gammaretroviruses  >RetroGagFor  TCTCGAGATCATGGGACAGA  Gammaretroviruses  >RetroGagRev  AGAGGGTAAGGGCAGGGTAA  Gammaretroviruses  >RiboFor  AGAGTTTGATCATGGCTCAG  >RiboRev  GWATTACCGCGGCKGCTG  Ribosomal DNA (Bacteria) Ribosomal DNA (Bacteria)  Table 2-4. Primers used to screen ovarian carcinoma tissue, * = developed from sequence retrieval  54  VOA2  VOA149 VOA259 VOA365 VOA382 VOA496  Figure 2-1. Nested RT-PCR results for gammaretroviruses in ovarian carcinoma tissue  Interestingly, the gag sequences from the different HGS samples were not identical to each other. A previous study which identified this same amplicon in samples from chronic fatigue syndrome patients suggested that the presence of sequence heterogeneity, which consisted of both deletions and SNPs, was indicative of replication fidelity typical of retroviruses suggesting a bona fide retroviral infection (Lo et al., 2010). From our HGS samples, most of the same SNPs were recovered and the same 21-base pair deletion was also found (Figure 2-2).  (2-2A)  (2-2B)  Figure 2-2. (A) Alignment of cloned MLV-V sequences derived from HGS ovarian cancer samples, a polytropic MLV-V mutant (Lo, 2010), and Mus musculus endogenous retrovirus. (B) Alignment of cloned MLV-V sequences against the Mus musculus endogenous retrovirus  55  Since murine endogenous retroviruses have not been previously associated with ovarian cancer, we then used this same method to screen a larger set of HGS samples. Upon testing 40 HGS, 10 CC, and 10 En tissue extracts with the same nested retrovirus PCR, surprisingly all samples were positive including the negative control. As nested PCR reactions are prone to amplicon contamination, a thorough decontamination of the equipment used in this assay was conducted. However, upon subsequent screenings of multiple water, non-human, and unrelated human respiratory samples, the assay produced both positive and negative results in a pattern that was inconsistent between replicates. This prompted an optimization of the assay in order to improve its consistency across the samples analyzed. Upon removing the external primers of the nested PCR assay, and using only the internal primers, the assay became more consistent between replicates of the same samples. In fact, subsequent to this modification, all samples analyzed were found to be positive. These positive results were consistent even when different laminar flow biosafety cabinets and pipettors were used. This lead to the hypothesis that there may be a contaminant intrinsic to the PCR master mix used, Invitrogen's One-step RT-PCR Master Mix. To test this hypothesis, a panel of different PCR master mixes were tested as shown in Table 2-5. All of these were used as per each manufacturer's protocol and the sample used in each reaction was PCR-grade water. As shown in Table 2-5, all except Qiagen 1 step RT-PCR, Fermentas (Maxima Hot Start PCR MM, DreamTaq PCR MM), and Applied Biosystem’s master mixes (AmpliTaq Gold, TaqManUniversal MM, TaqMan Gene Expression MM) were positive for the MLV-V amplicon. Interestingly, these are the only products that do not use an antibody bound to the Taq polymerase in order to inhibit PCR amplification at lower temperatures (these polymerases are also known as “hot start”). Instead these kits use a chemical moiety to inhibit their respective Taq polymerases. Upon questioning the manufacturer about the origins of this antibody, it was found that this antibody was produced in a murine hybridoma cell line.  56  Company Applied Biosystems Applied Biosystems Applied Biosystems Applied Biosystems Fermentas Fermentas Invitrogen Invitrogen Invitrogen Invitrogen Invitrogen Qiagen  Kit  Antibody  Chemical Moiety  MLVV  DNA  RNA  AmpliTaq Gold  N  Y  N  N  N  AmpliTaq Gold 360 MM  N  Y  N  N  N  TaqMan Universal MM 2 w/ UNG  N  Y  N  N  N  TaqMan Gene Expression MM Maxima Hot Start PCR MM DreamTaq PCR MM AccuPrime RT-PCR Supermix Platinum PCR Supermix Superscript II w/ Platinum Taq HiFi Superscript III w/ Platinum Taq HiFi Platinum Taq Hifi 1 step RT-PCR MM  N N N Y Y Y Y Y N  Y Y N N N N N N Y  N N N Y Y Y Y Y N  N N N Y Y NA NA NA N  N N N Y N Y Y Y N  Table 2-5. Ovarian cancer PCR master mix screening for each of the designated companies  Sequencing of the MLV-V amplicons found in the samples and reagents all showed high sequence homology to a polytropic endogenous retrovirus found in mice (Figure 2-2). This polytropism is defined as the ability of an endogenous retrovirus to exogenize, through mutation/recombination, and subsequently infect multiple types of cells including those of human origin. A final experiment was run using the Qiagen PCR 1 step RT-PCR master mix reagents, with Invitrogen's One Step RT-PCR enzymes as a positive control and the HGS tissue extracts as the experimental samples. All HGS tissue RT-PCRs were negative while the Invitrogen enzyme mix was positive for the MLV-V amplicon. Collectively, these results show that the MLV-V most likely originated from the murine hybridoma cell line used in “hot start” Invitrogen PCR reagents and thus the MLV-V detected in the ovarian cancer samples were contaminants from that specific manufacturer’s reagents. RNASeq All of the steps involved in the library preparation for Illumina sequencing were carried out at the BC Cancer Agency. Briefly, total RNA was treated by DNase followed by capture of polyadenylated RNA using MACS mRNA isolation kit (Shah et al., 2009). Samples were then loaded onto an Illumina 57  Genome Analyzer II. The read lengths for each sample ranged from 36-76 bp long. Samples were quality filtered where sequences with bases below Q20 were removed. The quality-filtered sequence files were transferred to the BCCDC to identify potential microorganisms in the samples. Initially, human sequences were removed by aligning the reads to the human genome using the Burrows-Wheeler Aligner (Li and Durbin, 2009). The sequence files were then subjected to homopolymeric read filtration, where a custom python script was used to remove reads with 20 sequential bases of the same nucleotide (see appendix). The sequences were then uploaded onto GenomeQuest for taxonomic classification. GenomeQuest is a commercial bioinformatics server used for various forms of next-generation sequence analyses. Users can upload sequences for custom alignments to a specific genome of choice, or in our case, utilize a custom algorithm called MegaSearch (similar to NCBI MegaBlast). Sequences are aligned to Genbank's NT database and the result files were retrieved via FTP. The result files were organized as the top five percent matches for each read along with their taxonomic profile. Since these files consisted of paired-end sequences, in order to maximize the taxonomic classification potential of each file, a lowest common ancestor algorithm was developed. Using a custom python script, the top percent matches for each individual read were selected and the taxonomic tree was subsequently compared with the aim of selecting the lowest common ancestor shared with its mate pair (appendix). In cases where there were multiple top percentages, sets were constructed of each combination where the lowest common ancestor between them was chosen. Shown in Table 2-6 are the total sequences generated along with the total microbial sequences classified by GenomeQuest.  58  Cancer CC_1 CC_2 En_1 En_3 En_4 En_5 En_6 HG_1 HG_2 HG_3 HG_4 HG_5 HG_6 Muc_1 Muc_2 ES LungCan  Total Sequences 28,084,294 28,176,128 18,806,908 48,704,601 50,563,522 7,996,210 19,867,212 39,662,071 29,174,975 58,528,054 69,354,382 5,609,832 2,445,692 910,234 14,100,468 2,755,420 7,391,936  Microbial Sequences 111,105 109,541 300,810 53,177 69,519 557,462 386,029 18,059 3,342 158,796 3,374 4,556 95,471 129,133 617,267 122,750 133,598  % Microbial 0.40 0.39 1.60 0.11 0.14 6.97 1.94 0.04 0.01 0.27 0.004 0.08 3.90 14.19 4.38 4.45 .018  Table 2-6. Total microbial sequences generated for each cancer sample, including reads for bacterial, viral, and fungal microorganisms.  Noticeably, all of the cancer subtypes had appreciable microbial (bacterial, viral, and fungal) ranging from 0.004-14% of the total reads. In Figure 2-3, the top ten microbial families are shown for each cancer subtype. In the majority of cancer microbiomes, the family Moraxellaceae was found to be the dominant taxon. In all cases, the species responsible for this dominance was found to be Acinetobacter baumanii. The remaining three microbiomes were instead dominated by Pseudomonadaceae. Another dominating set of reads, designated only as 'Other', consisted of sequences that could only be classified down to the phylum level due to lack of representation in the NT database.  59  100% 90% 80%  Pasteurellaceae  70%  Streptococcaceae  60% 50% 40% 30%  Nocardiaceae Vibrionaceae Burkholderiaceae Pseudomonadaceae Propionibacteriaceae  20%  Enterobacteriaceae  10%  Moraxellaceae  0%  other  Figure 2-3. Proportion of microbial families in ovarian cancer tissue where the top ten microbial families quantitated for each sample are shown as the percentage of the total microbial sequences.  Due to the similarities in the dominant families, additional analyses were used to differentiate these samples. Therefore, hierarchical and NMDS clustering were applied and as shown in Figure 2-4, these methods were largely unable to differentiate the cancers from one another. The NMDS clustering in Figure 2-4A displays samples clustering together encircled with 95% confidence ellipses. As shown, various cancer subtypes cluster within this single cluster, including ES and the lung cancer cell line.  60  (2-4A)  61  (2-4B)  Figure 2-4. (A) Ovarian cancer multivariate analyses: NMDS using a Bray-Curtis distance measure, showing no significant clustering. (B) Hierarchical clustering demonstrating two clusters with no cancer types in isolation.  Previous analyses have identified associations between ovarian cancer and pelvic inflammatory disease (PID). This polymicrobial disease is classified as an ascending genital tract infection which has traditionally been associated with Neisseria gonorrhea, and Chlamydia trachomatis. However as shown in Table 2-7, results from the largest prospective studies suggest that PID is associated with a more complex population consisting of species that are also recovered from both the healthy vaginal microbiome as well as another polymicrobial disease, bacterial vaginosis (Ness et al., 2004; Wiesenfeld et al., 2002). Shown in blue are bacteria found in the vaginal microbiome such as Lactobacillae and Bifidobacteriaceae which were also recovered in the RNA sequencing data of the ovarian cancer microbiomes shown in red (this study). Clinically important female genital tract pathogens for both BV and PID were also recovered including Neisseria gonorrheae, G. vaginalis, Escherichia coli, Vibrio spp., and Mycoplasma spp..  62  Organism Normal Escherichia coli Neisseria spp. Chlamydia spp. Mycoplasma spp. Gardnerella vaginalis Prevotella spp. Peptostreptococcus Streptococcus spp. Propionibacterium spp. Bacteroides spp. Fusobacterium spp. Leptotrichia spp. Acinetobacter spp. Corynebacteria spp. Staphylococcus spp. Vibrio spp. Megasphaera spp. Enterococcus spp. Enterobacter spp. Lactobacillus spp. Ureaplasma spp. Atopobium vaginae  BV  PID  Ovarian Cancer Microbiome  Table 2-7. PID, BV, and healthy vaginal microbial populations recovered; presence or absence of crucial microorganisms that are associated with the normal/healthy vaginal microbiome, BV, or PID.  Shown in Figure 2-5 are the sum normalized species frequencies for each of these bacterial families along with those of ES and the lung cancer cell line for comparison. Oddly, of the 19 families shown here, all but 3 were also recovered in our control samples. The family Bifidobacteriaceae was unique to the ovarian carcinomas. Although other members such as Bacterioidaceae and Neisseriaceae 63  were not unique to the ovarian carcinomas, the number of reads recovered in the lung cancer cell line sample for these families was fewer than 10.  3.5E-06 3E-06  Sum Normalized (%)  2.5E-06 2E-06 1.5E-06  CC  1E-06  HG En  5E-07  Muc  9E-21  ES LungCan Vibrionaceae  Veillonellaceae  Streptococcaceae  Staphylococcaceae  Propionibacteriaceae  Peptostreptococcaceae  Neisseriaceae  Mycoplasmataceae  Moraxellaceae  Leptotrichiaceae  Lactobacillaceae  Fusobacteriaceae  Enterococcaceae  Enterobacteriaceae  Corynebacteriaceae  Coriobacteriaceae  Chlamydiaceae  Bifidobacteriaceae  Bacterioidaceae  -5E-07  Figure 2-5. PID and BV-associated bacteria sum normalized counts for microbial families in cancer samples  Since there are species within the families Lactobacillaceae and Neisseriaceae that are native to the respiratory tract, if evidence against the basis of contamination at the time of sampling is to rest partially upon the presence of specific microorganisms native to the vagina then a more specific secondary analysis is required. Although GenomeQuest provides taxonomic resolution at the species level, it employs a MegaBlast-like algorithm for sequence alignment and this data may not be reliable. Therefore, all of the sequences in the HGS samples specific to these families were aligned to Genbank's NT database using BLASTN on NCBI. Only reads aligning to lactobacilli specific to the vaginal microbiome or N. gonorrheae with E-values significantly lower than all other subsequent classifications were designated as true positives. As shown in Table 2-8A, of the 48 paired-end reads that were aligned 64  by GenomeQuest to the Lactobacillus genus, 75% of them aligned to residents of the human vagina including L. crispatus (8), L. johnsonii (4), and L. Helveticus (10). Other lactobacilli included bacteria that are found in the human vagina as well as other environmental sources including L. delbrueckii (4), L. fermentum (2), and L. reuterii (2) (Naidu et al., 1999; Ravel et al., 2011). All of the lactobacilli produced E-values that were orders of magnitude lower than the next species aligned, indicating that these reads are specific to these genomes. However, Table 2-8B shows that of the 62 reads classified as the genus Neisseria, only 2 out of the 8 paired-end reads with N. gonorrheae as the top alignment produced Evalues that were lower than other closely related species. The majority of the sequences contained Evalues which were equal to or slightly less than that produced for N. meningitidis or another relative Rahnella aquatili. Although these reads could still be derived from a N. gonorrhoeae genome, the lack of specificity casts doubt into these taxonomic classifications. This is significant as species of this genus such as N. meningitidis are pathogens which can also exist as commensal bacteria in the respiratory tract and therefore may have originated as contaminants during the biopsy process. For the control ES sample, since there were no reads for Neisseriaceae, the same analysis was done with the lactobacilli sequences. Of the 32 paired-end reads specific to lactobacilli only 3 sequences contained E-values lower than the next best hit. Two of these reads were specific to L. delbrueckii which is native to both the vaginal microbiome and the environment, and the third was more specific to an environmental bacterium L. fermentum. The other sequences had equal alignment scores to various Lactobacillus species and many of which were endogenous to the vagina including L. vaginalis and L. crispatus. Since all of these reads were specific to 16S rRNA, the E-values were identical for all alignments and they could not be resolved at the species level. As opposed to the majority of HGS Lactobacillaceae sequences which were aligned to single species, the Lactobacillaceae reads from the ES sample could only be assigned to the genus Lactobacillus. In the lung cancer cell line, sequences for both Lactobacillaceae (2) and Neisseriaceae (7) were recovered. Two of the Lactobacillaceae sequences were specific to L. casei, while 1 of the Neisseraiceae sequences was unique to Laribacter hongkongensis. Both L. casei and L. hongkongensis  65  have been recovered from the oral, respiratory, and gastrointestinal microbiomes (Cox et al., 2010; Woo et al., 2009).  (2-8A) HGS Lactobacillus Hits L. crispatus L. gasseri L. helveticus L. fermentum L. delbrueckii L. plantarum B. cereus L. reuteri L. casei Uncultured lactobacilli L. salivarias Lactobacillus General (2-6B)  Paired-reads 8 4 10 2 4 2 2 2 2 4 2 4  HGS Neisseria Hits N. meningitidis N. gonorrheae Bradyrhizobium N. lactamica Chromobacterium violaceum Thauera spp. Rahnella aquatilis Uncultured Neisseriaceae N. sicca  Uncultured Neisseria spp.  Paired-reads 31 8 1 3 1 1 1 2 2 12  Table 2-8. Lactobacillus and Neisseria spp analysis. Paired-end read counts for (A) Lactobacillaceae and (B) Neisseriaceae re-aligned using BLASTN against Genbank`s NT database.  66  Virome  Shown in Figure 2-6 are the viruses recovered for each sample. Most of these sequences are specific to bacteriophage of Escherichia coli such as Myoviridae (T4 species) and Microviridae (фX174). Interestingly, a variety of papillomaviruses were also recovered. In each individual sequence file however, there was never any more than a single read for each papillomavirus. Most of these sequences were specific to gammapapillomaviruses including type 60, 80, and at the time, a newly sequenced genotype 144. Although there were only singletons specific for papillomaviruses, these were absent in the ES sample which only contained T4 bacteriophage. The lung cancer cell line consisted of sequences specific to Mastadenovirus as well as reads for фX174 bacteriophage. For sequencing on the Illumina platform, фX174 genomic libraries are spiked in at the time of sequencing for quality control purposes, and therefore these reads were designated as contaminants (Shah et al., 2009). 100% 90% 80% 70% Siphoviridae  60%  Podoviridae  50%  Papillomaviridae  40%  Myoviridae  30%  Microviridae  20%  Adenoviridae  10% LungCan  ES  M_2  M_1  HG_6  HG_5  HG_4  HG_3  HG_2  HG_1  En_5  En_4  En_3  En_2  En_1  CC_2  CC_1  0%  Figure 2-6. Cancer virome percent recoveries for each cancer sample  67  Confirmation of Microbiome Findings by PCR Reviewing the sequences in the transcriptome files and comparing these to our control samples, multiple PCR assays were developed in order to confirm the presence of these microorganisms in our samples. Shown in Table 2-4 are the primers used for each individual microorganism, where an asterisk indicates those primers that were specifically designed by the sequences recovered in the transcriptome files. For these assays, multiple genomes of each designated species deemed significant in the transcriptome were aligned also to closely related species in Geneious (Biomatters), and conserved regions flanking the reads were used for primer development. All of these samples were screened by first reverse transcribing ovarian carcinoma tissue RNA in a similar manner to the Virochip protocol and subsequently using a Quanta PCR Master Mix for each respective set of primers (see methods). Unfortunately other than Neisseria gonorrheae, we were unable to obtain suitable positive control material to validate these assays. All of these samples were PCR negative for the microorganisms tested. 16S Ribosomal DNA Assay  As a significant percentage of sequences in these datasets were specific to rRNA, a single assay was developed using conserved primers targeting variable regions V1-V3 (Lane et al., 1985). Since a control was available for one of the target species N. gonorrhoeae, the assay was developed to account for the possibility that these microorganisms may be too dilute for detection within ovarian cancer tissue. This control was diluted to concentrations within the range of those microorganisms recovered in the ovarian cancer sequence files. The theoretical concentration of input microbial nucleic acid was estimated to be 528-792 nanograms since 4 pM of 200-300 bp sequences were loaded onto the Illumina Genome Analyzer II. Since 4x10-3 - 14% of the analyzed transcriptome files were of microbial origin, the theoretical starting point of microbial cDNA sequence weight comes to 0.02-110.9 ng of DNA. Another variable that has to be considered is the ratio of the target to the background nucleic acid in the reverse transcription reaction, in this case ovarian cancer RNA. Therefore, I diluted my control 68  bacterial sample with ovarian cancer RNA to simulate the microbial sequence composition in the same sample matrix. An important note here is that my N. gonorrhoeae control is a DNA/RNA sample that has been DNased, whereas the above calculations are for DNA. On average ssRNA is roughly 53% of the weight of dsDNA, but this ~2 fold difference is insignificant when considering that the above calculations are for the end point of a transcriptome library preparation, where theoretically, 5% of the total cDNA remains. This should efficiently remove a large percentage of the microbial sequences in the initial sample. Furthermore, as the reverse transcription reaction is randomly primed, theoretically, the cDNA should be proportional to the original sample. The positive control is a species that will also subsequently be targeted in multiple PCR assays, N. gonorrheae. 1 µl of DNase was added to 20 µl of total nucleic acid extract, and through a Quant-iT™ high-sensitivity DNA assay kit using a Qubit® fluorometer (Invitrogen,Carlsbad, CA, USA), the remaining RNA was quantified. I then diluted this sample ten-fold in ddH2O, and subsequently used this sample in a 1:4 serial dilution with ovarian cancer RNA. This serial dilution ranged from 0.013 ng 0.825 ng, all within the context of 427.5 ng of ovarian cancer RNA. The samples were then reverse transcribed using the same procedure as that of the experimental samples followed by a PCR using primers (RiboFor and RiboRev, Table 2-4) targeting variable regions 1-3 of rDNA with a Quanta PCR master mix. As shown in Figure 2-7, rDNA was detected only in the first serial dilution point, 0.825 ng. Although this starting material is not the lowest seen in the transcriptome files, it is well below the average of the estimated range for these samples, 21.9 ng (0.02-98 ng). Furthermore, the proportion of N. gonorrheae to ovarian RNA at this dilution is 0.19% which was also at the lower end of the estimated range (4x10-3 - 14%, average 3.3%).  69  Figure 2-7. 16S rDNA assay results. The values indicated the amount of DNA in each reaction at the time of the initial RT reaction  Although the complexity of the human component was captured in these test samples, this is not an identical comparison to our experimental samples as the percentage of N. gonorrheae in the transcriptome files was on average less than 1% of each microbiome. Despite this lack of complexity, the primers used in this PCR can bind various genomes with theoretically equal efficiency. Nonetheless, it can be concluded from the results within this assay that rDNA can be detected with a limit of detection of roughly 0.825 ng rDNA in a given ovarian cancer sample. Using this same PCR, 64 HGS ovarian cancers were screened. As shown in Figure 2-8A,B all samples were negative except for the positive control, N. gonorrheae RNA.  70  (2-8 A)  Pos  71  (2-8 B)  Figure 2-8. 16S rDNA PCR results for HGS samples 1-32(A) and 33-64(B), Pos = Positive control, N. gonorrhoeae  Conclusions The ovarian cancer transcriptome initially provided the unexpected recovery of a diverse microbial population. However, due to the observation that many of the cancers as well as the ES/Lung cancer cell line samples shared many of the same dominant family members, cluster analyses were used to assess significant differences that may exist between each cancer type. These results suggested that there were no significant differences between them. A closer inspection of the minority sequences which are native to the female genital tract provided mixed results. Although 75% of the sequences classified as the genus Lactobacillus in HGS were assigned to endogenous vaginal microorganisms, only 2/68 of the Neisseriaceae were specific to N. gonorrhoeae. Furthermore, analyzing these same microorganisms from  72  the ES control revealed that many of the reads specific to lactobacilli were indicative of species native to the genital tract but came from highly conserved regions of the genome and therefore could only be classified at the genus level, Lactobacillus. Only three sequences were resolved at the species level, L. delbrueckii and L. fermentum. Because these species are predominantly recovered in dairy products as well as the oral and vaginal microbiomes, these microorganisms are unlikely sources of laboratory contamination. Coupled with the observation that the remaining sequences are likely native to the female genital tract, as this is found in a non-ovarian tissue sample, the utility of using this genus as evidence for a microbiome native to ovarian cancer is questionable. Furthermore, as the dominant families in each cancer microbiome were also recovered from the lung cancer cell line, and therefore independent of the human microbiome entirely, these families are likely contaminants occurring post-tissue sample extraction. The rDNA PCR results have several implications for the characterization of this potential microbiome. Assuming that this assay is sensitive enough to detect the bacteria in the samples tested, the possibility still remains that because we were unable to obtain the same samples which were sequenced, the observed profiles could be unique to the subset of patients involved in the transcriptome study. However, all of the samples that were sequenced contained microorganisms present and the majority were within the range of this assay. Thus it would be unlikely that none of the subsequent samples tested by PCR would contain detectable bacterial DNA. Another complication is that the control used to determine the limit of detection contains only a N. gonorrheae 16S rDNA population whereas in the sequenced samples, there was a highly diverse rDNA profile. This could impact the sensitivity of the assay if such a population existed in the ovarian tissue RNA extracts as the primers are not targeting a single template. Although creating bacterial control material that contains a diverse range of bacterial species is difficult, an attempt to control the human component (and thus increase the complexity of the sample) was made by diluting the sample in ovarian cancer RNA. The third possibility is that these microbial sequences were contaminants introduced at some point post tissue extraction. Supplementing this is the fact that these samples were not extracted and 73  prepared with a microbial analysis in mind. Therefore, if the ovarian cancer tissue was not handled in such a manner as to prevent the introduction of microorganisms from attaching or growing on or within the sample, then at this depth of sequencing one might expect to recover microbial sequences from the laboratory environment. Although such surgical procedures are generally sterile, the process of tissue management in pathology post-extraction may result in the incorporation of contaminating nucleic acid if a microbial analysis is not pre-meditated. Furthermore, sterility does not ensure the absence of microbial nucleic acid contamination from dead microorganisms, such as those found on autoclaved surgical instruments. Complicating this issue is the fact that the dominant families recovered, Moraxellaceae and Pseudomonadaceae, are both known water contaminants and human pathogens (Grahn et al., 2003). Furthermore, multivariate analyses such as hierarchical and NMDS clustering could not differentiate the controls and the ovarian cancers suggesting the majority of the microbial profiles were derived from contamination as these samples are from distinct sources. Although the human microbiome project has not analyzed the upper genital tract tissue, the conclusions from these and additional studies suggest that populations from different body sites should be distinct. Contamination seems to be a likely explanation for the findings as the lung cancer cell line also contained similar abundance profiles relative to the cancer samples. A tissue culture is subjected to multiple antimicrobial compounds in order to prevent microbial contamination. The presence of the microorganisms in this sample, independent of its similarity with the associated cancer microbiomes in this study, suggests that the methods used to process and sequence the samples did not ensure microbial nucleic acid sterility. Another possibility for contamination is that these sequences were generated in the library preparation or sequencing process. In this scenario, it is plausible that the number of microbial sequences recovered would be proportional to the number of sequences generated for each sample. However, a graph of the total number of sequences adjacent to the total microbial sequences for each individual sample is shown in Figure 2-9A which demonstrates no such relationship. However, Figure 2-9B shows there are patterns in the proportional recoveries of the top three microbial families. Although there is not 74  a single pattern, each falls into one of three patterns characterized by the dominance of one family and similar recovery rates for the others. 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 Total Seqs  100  Microbial Seqs  10 HG_5 HG_4 En_2 En_4 HG_1 HG_2 CC_1 CC_2 En_5 En_1 M_2 En_3 LungCan HG_3 ES HG_6 M_1  1  Figure 2-9A. Microbial sequences are not proportional to total sequences. All samples shown here are ordered according to the total sequences generated for each sample, from high to low 1000000 100000 10000 1000  Pseudomonadaceae Moraxellaceae  100  Enterobacteriaceae  10 1  Figure 2-9B. Total sequences of the top three microbial families, Moraxellaceae, Pseudomonadaceae, and Enterobacteriaceae for each cancer sample.  75  The limitations of this analysis such as small sample sizes and a lack of a proper negative control (healthy ovarian tissue and a water blank) limit the confidence in the conclusions of this study. As we had only one sample for the ES used to control for microbial populations, it is possible however improbable, that by chance this sample contained sequences specific to these microorganisms. However, this logic does not apply to the lung cancer cell line for reasons stated above. Combining all of the evidence together, I conclude that the most probable scenario from the aforementioned experiments is that many of the microbial sequences present in these cancer transcriptome files originated from an unknown source post-tissue extraction. Microorganisms such as N. gonorrhoeae and L. crispatus are unlikely contaminants, however, their recoveries suggest that if they are present, they are below the limit of detection of the methods used in this study for sequence recovery. Perhaps if healthy ovarian tissue could be obtained as well as a proper negative control, the results could be interpreted with higher confidence, but with similar sequences obtained in distant tissues as well as a cell line, lack of PCR amplicon recovery, and no significant differences according to multivariate analyses, contamination is the most likely explanation for the majority of the microorganisms identified in this study. Future studies investigating a microbial association with ovarian cancer should only include tissue samples that have been obtained with a pre-meditated strategy to ensure sample sterility.  76  Chapter 3 Bacterial Vaginosis 3.1 Bacterial Vaginosis Study Methods Study 1A. Metagenomic characterization of the vaginal microbiome of healthy, non-pregnant women. The vaginal swabs were collected as part of a larger vaginal microbiome study (VOGUE) with 300 healthy non-pregnant women aged 18-49 years. These women were deemed healthy and asymptomatic through clinical assessment by an OB/GYN specialist or by a physician at a family practice or student health clinic. Patients meeting the inclusion and exclusion criteria were invited to participate in the VOGUE study. The inclusion criteria for these women were, upon sufficient comprehension of English for informed consent completion, the assurance that they were not pregnant and for the duration of the study period, not planning on becoming pregnant. Further requirements included women of 18 years or older with a regular menstrual cycle (~28 days). Women were not included in this study if they showed a lack of ability to provide written informed consent, or have used either systemic or topical antimicrobial therapy within the prior month. 300 samples were to be collected for the overall study. Based on age matching and sample availability, 7 of these samples were used in our analysis.  Study 1B. Metagenomic characterization of the vaginal microbiome associated with specific disease states. The objective of this study was to characterize the vaginal microbiome of 50 HIV-positive nonpregnant women. This study received University of British Columbia Ethics Board approval on April 11, 2011. Vaginal swabs were collected at a single timepoint with the option to expand sampling upon further consent. Women were offered participation in this study when attending the Oak Tree Clinic at BC Women’s Hospital for HIV care. Patients meeting the inclusion criteria were invited to participate in the VOGUE study. Additional information regarding health status including duration of HIV infection, HAART regimen, CD4 count and viral load were also collected from the HIV positive women. The inclusion and exclusion criteria comprised the same conditions specified in study 1A. The aim of this  77  study was to enroll 50 women where biological samples will be collected for microbiome characterization. For our study, we chose 22 samples. Data Collection A patient interview for each subject was conducted in order to collect basic demographic, health, and reproductive information. This interview was followed by a patient chart review in order to obtain laboratory and antiretroviral medication information. The demographic information includes age, height, weight, BMI, marital status, ethnicity, country of birth, year of immigration to Canada, and residential location. The general medical history included current or chronic diseases. Genital infection history included information regarding bacterial vaginosis, urinary tract infections, yeast infections, trichomoniasis, genital herpes, genital warts, chlamydia, gonorrhea, and syphilis. Drug information included antimicrobial use for three months prior to the study visit and prescription and non-prescription drugs were recorded for two months prior to the study visit. Non-prescription drugs also included any pro-biotics during this period. Reproductive health information was also collected regarding menstruation timing and occurrence, pregnancy history, feminine hygiene product usage, recent vaginal symptoms, contraception usage, and sexual activity. The sexual activity questions included information concerning the frequency of vaginal, oral and anal sex, number of partners, pain experienced during vaginal intercourse, and sex toy usage. Substance use including heroin, cocaine, marijuana, opiates, crystal methamphetamine, benzodiazepines, methadone, alcohol, tobacco, or any other substance reported by the patient, was also collected. Additional information from the HIV cohort was collected regarding the mode of HIV acquisition, the primary HIV positive test, the lowest CD4 count (nadir), CD4 count at study visit, highest viral load, the HIV clade, and hepatitis B and C immune status. Lastly, antiretroviral medication history is also collected for each subject.  78  Sample Collection For each subject enrolled into this vaginal microbiome study, in addition to swab collection, a pelvic exam was performed by an OB/GYN specialist. The vaginal and cervical health was visually assessed for abnormal appearance or discharge. A bimanual examination was performed along with a wet mount if indicated by standardized procedure. Four swabs were collected from each patient for a Gram stain, while the other three were reserved for bacterial and viral analyses. The Gram stain swab used for this analysis is the Copan Sterile Transport swab suitable for aerobic and anaerobic bacteria. The swabs used for the viral analysis are the Copan 3mL Universal Transport Medium Kits for the Collection and Preservation of Virus, Chlamydia spp., Mycoplasma spp., and Ureaplasma spp. Following the labeling with a confidential study identification number, the Gram stain swab is sent the local hospital laboratory for clinical analysis, and the other three swabs are placed in a -80° C freezer within 4 hours of swab collection at the Children’s and Women’s Hospital Laboratory. These swabs are then archived at -80° C at the British Columbia Centre for Disease Control (BCCDC). Sample Processing The samples for viral analyses were processed henceforth at the BCCDC. After vortexing each tube for 30 seconds, 1 ml was removed for extraction. Because of the large volume, the NucliSENS easyMag (BioMerieux) was chosen as the extraction method. The easyMAG is an automated magnetic bead-based extraction system that is used routinely at the BCCDC for gram positive/negative bacterial and viral nucleic acid extractions. First, 1000 µl of each sample is pipetted into an Axygen deep well plate. The plate is then sealed and placed on a 80˚C hot plate for 10 minutes. 2 ml of lysis buffer is then added into an easyMAG tubestrip. The heated samples are then added to each respective well in the tubestrip. The samples then sit for 10 minutes at room temperature after which 100 µl of easyMAG magnetic beads are added. The samples are then loaded onto the easyMAG extraction machine and subsequently eluted in 35 µl of elution buffer.  79  DNA is then removed from the total nucleic acid using DNase treatment with Turbo DNase I (Ambion). 1 ul of DNase was added to 20 µL of the sample along with 2 ul DNase buffer and after a 37˚C incubation for 30 min, 2 ul of inactivating reagent was added to inactivate the DNase. 8 µl of each DNased sample (RNA) was then put into our reverse transcription (RT) reaction. 2 µl of 40 uM random hexamers were added to 8 µl of sample. This was then incubated at 72˚C for 4 minutes to denature secondary structures, followed by room temperature incubation for 5 minutes. 1 µl of Superscript III reverse transcriptase, 2 µl of 12.5 mM dNTP, 2 µl of ddH2O, 4 µl of Superscript III First Strand Synthesis buffer, and 0.1 M of dithiothreitol was then added to RT reaction. Reverse transcription was carried out at 42˚C/1 hour followed by a 94˚C/2 min incubation. After cooling the reaction to 10˚C, second strand synthesis was carried out using Sequenase, a T7 bacteriophage DNA polymerase. This reaction mix consisted of 7.7 µl ddH2O, 2.0 10X Sequenase buffer, and 0.3 µl of Sequenase. Following its addition to the RT reaction, this was then incubated with a ramp from 10˚C to 37˚C over 8 minutes, followed by holding at 37˚C/8 min. Following the denaturization of cDNA at 94˚C/2 min the RT reaction is cooled to 10˚C. 0.9 µl of diluted Sequenase buffer and 0.3 µl of Sequenase was then added and incubated at the same conditions as the Sequenase reaction described above. Samples were then stored at –20C until further processing.  Primer/ Probe ForwardP ReverseP Fam Probe  Direction  5’ ->3’ sequence  Target  F R Probe  5’ CCC TGAATGCGGCTAAT 3’ 5’ TGTCACCATAAGCAGCCA 3’ 5’ FAM ACGGACACCCAAAGTAGTCGGTTC IBFQ 3’  Poliovirus 5’UTR  Table 3-1 Enterovirus qRT-PCR primers and probes  The enterovirus qRT-PCR was used to quantitate enterovirus genome copy numbers in a series of enterovirus-spiked samples. An enterovirus Armoured RNA was used as the internal positive control. 500 copies/µL of Asuragen Enterovirus Armoured RNA™ Quant is extracted using the easyMAG and subsequently diluted to 25 copies/ µL. Using the above primers and a FAM-labeled Taqman probe, a 100 80  µl primer-probe reaction master mix is then made by adding 72 µl of IDTE 1X Buffer, 4 µl FAM Probe to 8 µl Forward Primer (400 nM) and 16 µl of the reverse primer (800 nM) mix. The probe is specific for a highly conserved 143-nucleotide portion of the 5′ untranslated region (5′UTR) of poliovirus, while the primers flank this region. For each sample, 1 µl of this primer-probe mix was then added to 9 µl of PCR grade water and 5 µl Taqman Fast Virus 1-Step RT-PCR master mix. 15 µl of the sample mix is then added to each well of an Applied Biosystems Fast Optical 96-Well Plate. 5 µl of each experimental and control sample were then added to each well, along with three negative controls (PCR grade water) and five standard ten-fold serial dilutions of the extracted armored RNA, from 10-1 - 10-5, in triplicate, were used as the reference for the standard curve and subsequent quantification. This plate is then loaded onto a Fast Block enabled TaqMan 7500 (Applied Biosystems) using the following parameters:  Step RT HS Cycling (45 cycles)  Temperature (°C) 50 95 95 60  Time (min:sec) 5:00 0:20 0:03 0:30  Table 3-2 Enterovirus qRT-PCR thermocycler conditions  Sequence Library Preparation  The Nextera XT library preparation consists of fragmentation, adapter ligation, and subsequent sample normalization using a standardized procedure provided by Illumina. Briefly, 5 µl of each 0.2 ng/µl DNA sample is added to a master mix consisting of 10 µl of tagmentation buffer and 5 µl of tagmentation amplicon mix. Tagmentation ensues at 55˚C for 5 minutes and subsequently neutralized with 5 µl of the neutralization buffer. At this point a specific DNA sequence is now attached (or "tagged") to each end of the fragmented DNA samples ("tagmented"). Adapters and indices are then incorporated with a PCR amplification where 15 µl of Nextera PCR master mix and 5 µl of each indexed 81  primer is added to the same tube with tagmented DNA. PCR ensues at 72˚C for 3 minutes and 95˚C for 30 seconds followed by 12 cycles of 95˚C for 10 seconds, 55˚C for 30 seconds, and 72˚C for 30 seconds. Extension is carried out at 72˚C for 5 minutes. The PCR products are then cleaned using AMPure beads and then normalized as per standard Illumina procedure. A step-by-step procedure for each is shown in the appendix. KAPA qPCR The KAPA qPCR is a proprietary qPCR for the quantitation of sequencing libraries. Initially, 1 ml of the primer mix (sequences below) is added to the 5 ml qPCR Master Mix.We then make a new experimental master mix for each sample. 4 µl of our experimental sample is then added to each respective well of an Applied Biosystems Fast Optical 96-Well Plate. For our control samples, we added 4 µl of the provided library DNA standards 1-6 (20-0.0002 pM) in triplicate.  PrimerP1 PrimerP2  F R  5’-AAT GAT ACG GCG ACC ACC GA-3’ 5’-CAA GCA GAA GAC GGC ATA CGA-3’  Table 3-3 KAPA qPCR primers We then loaded the plate into a Fast Block enabled TaqMan 7500 using the following parameters:  Step HS Cycling (35 cycles)  Temperature (°C) 95 95 60  Time (min:sec) 5:00 0:30 0:45  Table 3.4 KAPA qPCR conditions  MiSeq KAPA qPCR results were used to dilute each sample such that an equimolar solution of the cDNA library was pooled so that 8pM was loaded onto the MiSeq. The samples were then sequenced  82  according to the standardized manufacturer's protocol generating sequences at a maximum length of 151 bp. Quality filtration was performed by filtering reads that contained greater than 5 nucleotides with quality scores below 15.  3.2 Bacterial Vaginosis Study Results Nextera XT Sensitivity Illumina sequencing-by-synthesis begins with the ligation of adapter sequences which contain both the sequence for hybridization to its complementary sequence on a glass surface (flow cell), as well as the primer to initiate the sequencing reaction. If both ends of the amplicon are being sequenced, the ideal DNA input length for Illumina sequencing with 151 bp paired-end reads is roughly 500 bp. In order to produce a sequence library at this specific size, the traditional methodology requires sonication, end polishing, and ligation. Each of these steps require multiple clean ups which provide ample opportunity for the introduction of contaminants while significantly reducing yields. Nextera technology utilizes an in vitro transposition reaction for the purposes of DNA shearing and adapter ligation in a single tube. In vitro transposition begins with the integration of a transposase into the target DNA by a transposome, a transposon-DNA complex. This transposase is composed of two monomers that bind to two ends of a DNA complex with a near random sequence specificity. Due to the co-affinity of each monomer, these two ends then join and cleavage ensues. A convenient end to this cleavage is the incorporation of the initial DNA molecules in the transposome complex, which then serve as binding sites for primers in a low cycle number PCR. This PCR step simultaneously fills the staggered DNA cleavage sites as well as incorporating the adapter sequence for Illumina sequencing. Since this reaction occurs within a single tube, the Nextera protocol can be modified to accept a low input material (1 ng of total dsDNA in the Nextera XT kit). An important consideration for any method used to characterize microorganisms is the limit of detection for the designated protocol. The sensitivity of the Nextera XT library preparation was tested by inoculating two sets of samples with Coxsackievirus B4 (CB4) lysate. Initially, this virus lysate was 83  quantified using an optimized enterovirus qRT-PCR assay where a pre-quantified armoured RNA is used as a standard for quantification. The Asuragen Enterovirus Armoured RNA consists of an encapsidated RNA molecule derived from nucleotides 348 to 1218 of the 5′ untranslated region of the poliovirus Sabin type 1 genome (Gregory et al., 2006). The results are shown below in Figure 3-1A, where each dilution is shown in triplicate.  10000000  Genome Copies (per µl)  1000000 100000 10000 1000 100 10 1  CoxB4 Concentration  Figure 3-1A. The qRT-PCR results for the Coxsackievirus B4 (CB4). Each dilution was done in triplicate  84  10000000  Genome Copies (per µl)  9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000 1.E-05  1.E-04  1.E-03  1.E-02  1.E-01  1.E+00  CoxB4 Concentration  Figure 3-1B. Mean coxsackievirus quantification values in triplicate for each dilution are shown normalized to each respective dilution factor  These results clearly demonstrate the consistency for each dilution as well as the consistency of the randomly primed reverse transcription reaction. In order to illustrate the concentration of the virus stock, the average for each dilution-adjusted value is shown below in Figure 3-1B. From the above results, the average across the whole dilution series was taken and used as the absolute concentration of the initial stock. The concentration of CB4 was calculated as 8,398,432.4 +/- 493,307 genome copies (gc) per µl. The experimental detection ranges from Moore et al. were used as a guideline to determine the range of virus concentrations to inoculate each set of samples (Moore et al., 2011). More specifically, six plasma samples were inoculated with six different CB4 concentrations ranging from 1.38 - 13,857 gc/sample. In this case, there will be relatively few competing RNA molecules in the RT reaction and the subsequent sequence library preparation. An additional variable that is frequently not considered in similar studies is the nucleic acid complexity of the sample matrix. This is an important consideration required to provide more confidence in the detection efficiency as well as meaningful interpretation of negative results. Therefore it is essential to immerse the control target used for detection within the 85  molecular context of the experimental sample. To this end, the same sample matrix was created by pooling aliquots of several vaginal swab samples and inoculating this matrix with the CB4 lysate. A slightly different dilution scheme was established due to the fact that the pooled samples contained roughly 16.1 ng/µl of RNA. Thus the viral RNA would be mixed with billions of other RNA molecules which may decrease the detection efficiency of the viral sequences. Aliquots of the pooled vaginal swab sample were spiked with six different CB4 concentrations ranging from 2.7 - 27,714 gc/sample. In order to track the concentration of CB4 throughout the library preparation process, aliquots were collected at three points and subsequently quantified with the same enterovirus qRT-PCR assay described above. These aliquots were collected post-extraction, post-reverse transcription, and postNextera XT library preparation. As seen in Figure 3-2, the plasma inoculation reveals an expected decreasing trend in the concentration of virus RNA recovered as measured in gc per µl. Adjusting for each dilution factor, the expected yield is shown adjacent to each sample throughout the procedure. Moving forward at each stage in the procedure, the difference between the theoretical yield and the actual yield correspondingly increases as a result of the accumulated loss. The percent yield at each step was typically less than 50% of the expected amount. This is most evident in the final expected yield in the post-Nextera XT samples (Nx). However, in this case, the percent yield at this point is ~40%, which is within the range of previous steps.  86  100000 10000 1000 Genome Copies  100 10  Expected Yield  1  Ext - 13,857 Ext - 1,385 Ext - 138 Ext - 13.8 Ext - 2.8 Ext. - 1.4 RT - 13,857 RT - 1,385 RT - 138 RT - 13.8 RT - 2.8 RT - 1.4 Nx - 13,857 Nx - 1,385 Nx - 138 Nx - 13.8 Nx - 2.8 Nx - 1.4  0.1  Samples  Figure 3-2. Plasma qRT-PCR CB4 RNA recoveries. Ext = Post-extraction,RT=Post Reverse Transcription, Nx= Post-Nextera XT  After multiple alignments using Burrows-Wheeler Aligner against various CB4 genomes as well as aligning against Genbank's NT database using Galaxy and MG-RAST, the CB4 sequences were quantified. The Illumina sequencing results matched that of the qRT-PCR, where the limit of detection in both cases was the 1,385 gc/sample dilution. Extrapolating from the qRT-PCR results, 274 gc should have been recovered at this dilution, which is similar to the identified 140 reads generated from the MiSeq. Due to the fact that the qRT-PCR assay begins to show stochastic results below 1 genome copy per µl, our theoretical yield should have been around the 0.02 dilution, or 0.1 input gc. Shown in Figure 3-3 is a similar decreasing trend for inoculated pooled vaginal swab samples. Obvious patterns here are the efficiencies in recovery rates for the reverse transcription reaction and Nextera XT. For instance, in the lower dilutions at the RT stage, we recovered up to 84% of the viral RNA. This is relatively high when considering that this RT reaction is randomly primed along with an additional loss from nucleic acid extraction. In the Nextera XT normalization, there was an abrupt drop  87  in sensitivity as only 0.09% of the theoretical yield was recovered. Although the largest loss of sample occurred at the plasma inoculation, the drop in recovery was less dramatic. For example, in the lowest dilution (27,714 gc/sample) at the RT step we recovered CB4 at 7,520 gc/µl, and after Nextera XT, 0.03 gc/µl. As with the plasma inoculation, the Illumina sequencing results matched that of the qRT-PCR. In this case sequences for CB4 were only recovered in the highest dilution. Extrapolating from the qRTPCR results, we were theoretically expecting between 1 and 2 reads, whereas we recovered 7 reads. As stated above, this is not surprising because of stochastic effects below 1 genome copy per µl. 100000 10000 1000 100 Genome Copies  10 1 Expected Yield  0.1 0.01 Ext - 27,714 Ext - 2,771 Ext - 277 Ext - 27.7 Ext - 5.4 Ext. - 2.7 RT - 27,714 RT - 2,771 RT - 277 RT - 27.7 RT - 5.4 RT - 2.7 Nx - 27,714 Nx - 2,771 Nx - 277 Nx - 27.7 Nx - 5.4 Nx - 2.7  0.001  Samples  Figure 3-3. Vaginal swab qRT-PCR CB4 RNA recoveries. Ext = Post-extraction,RT=Post Reverse Transcription, Nx= Post-Nextera XT  Nextera XT Sensitivity Conclusions Using the stated library preparation methodology, the limit of detection for the plasma inoculation is 1,385 gc/sample, and the vaginal swab inoculation was found to be 27,714 gc/sample. The limit of detection for the 1,385 gc/sample plasma inoculation is in line with the aforementioned sensitivity studies without amplification (Cheval et al., 2011; Moore et al., 2011). Our assay was more sensitive than Moore  88  et al. even though we did not physically subtract any nucleic acid. The increased sensitivity for both vaginal swab and plasma samples, could possibly be a result of the Nextera XT protocol where minimal material is lost as a result of fewer wash steps. Nextera XT uses a method of normalizing the concentration of DNA in each sample by using a pre-determined concentration of polyethylene glycol and magnetic beads such that only a specific concentration of DNA is adsorbed. When we contacted Illumina representatives to discuss the predetermined yields, they were unable to provide any specific information about their proprietary method for library normalization. Although the exact dilution factor is not known for Nextera XT normalization, adjusting simply for the input volume did not account for the loss of sensitivity in both of the inoculations. Another possible explanation could be that the proprietary bead-based normalization procedure was biased against the viral reads. Evolutionary pressures exerted upon viral genomes have selected a compact and heavily structured nucleic acid molecule. This structure could possibly translate to biases in primer binding sites and therefore smaller cDNA molecules relative to those originating from bacteria. Magnetic beads often exhibit oligonucleotide preferences based upon size (Lisa, 2010). This same competition would not have been present in the plasma inoculation. Tagmentation, the insertion of a transposome as well as the adapters needed for Illumina sequencing, also is known to reduce the size of the input amplicons. Further research analyzing the flow-through at each stage in the normalization procedure is needed in order to identify where the loss occurred in the library preparation.  Sequence Classification Sensitivity One of the fundamental problems inherent to metagenomics is how to most efficiently classify each sequence in a data set. Oddly, there are few metagenomic projects which thoroughly assess the taxonomic classification accuracy of the selected sequence classification program. This is surprising as the results and conclusions largely depend upon the selected algorithm. Researching the literature, aside from the studies introducing taxonomic classification tools, there is one group that specifically addressed 89  this issue in 2012 (Bazinet and Cummings, 2012). However, in their analysis this group essentially used the recommended setting for each program. This severely limits the potential of each program as each pipeline has been optimized with a different dataset and depending upon the goals of each research project, the settings should be adjusted accordingly. Furthermore, in addition to each program being updated as well as new programs not available at that time, an updated comparison was required. Therefore, for each of the programs below, I aim to first optimize the pipeline for each specific data set and present the results which represent the maximum potential of each classification program. Finally, this optimized pipeline will be used in the taxonomic characterizations of my experimental samples. Sequence Classification Programs Introduction There have been various strategies to taxonomically classify metagenomic reads consisting of three dominant methods: composition based, homology-based, and phylogenetic-based classification. Sequence composition methodologies are those that compare features of a read to those of a genome in its entirety (Bazinet and Cummings, 2012). A minority of these approaches are based on analyzing broad genomic principles such as GC content or codon biases. However, these have not been successful in delineating reads to lower and meaningful taxon levels. The dominant approaches use Nmer frequency counting (Rosen et al., 2011). An analogy for this type of classification is to identify the genre of a book by the frequency of words found within it. A microbiology textbook might be classified by the frequency of 'bacteria' or 'virus' within the text. Furthermore, some of these programs such as Naive Bayes Classifier (NBC) utilize a machine learning algorithm. Briefly, the classifier is 'trained' by computing all of the frequencies for all possible overlapping oligonucleotide motifs of length (n) for all available genomes defined by the user. Subsequently, the same is performed for uploaded reads, and these motif frequencies are compared. It has been shown in multiple studies that these frequencies can be signatures for a given genome. To my knowledge, all of these algorithms are restricted to nucleotide comparisons. Sequence composition methods include PhyloPythia, SPHINX, RDP, Naive Bayes  90  Classifier (NBC) and TAxonomic COmposition Analysis method (TACOA) (Diaz et al., 2009; Lan et al., 2012; McHardy et al., 2007; Mohammed et al., 2011; Rosen et al., 2011). Another method for sequence classification utilizes phylogenetic relationships. These programs attempt to assign the most probable placement of each read on a phylogenetic tree according to various algorithms for evolution including maximum likelihood, Bayesian methods, or neighbor-joining (Bazinet and Cummings, 2012). Certain programs such as FastUniFrac use the length of the branch connecting each read to calculate the amount the given query sequence has evolved (Hamady et al., 2010). Most programs however are only concerned with the placement of each read in this tree. Most of these programs are restricted to marker genes such as 16S rRNA as the reference tree for phylogenetic placement (Hamady et al., 2010; Lan et al., 2012). Certain programs such as Statistical Assignment Package (SAP) however, have developed algorithms to avoid this limitation by incorporating a ClustalW alignment for phylogenetic tree construction of representative species chosen from an initial BLAST alignment (Munch et al., 2008). The query sequence is classified as the lowest taxonomic group that defines all members of its sister clade on the respective tree. Provided the additional steps and taxonomic information incorporated for each analysis, each read can take minutes to be classified and therefore most high-throughput sequencing projects will likely utilize an alternative program. However, many algorithms that utilize phylogenetic placement for taxonomic classification are becoming faster and increasingly memory efficient (Bazinet and Cummings, 2012). The most widespread approach among sequence classifying programs is homology-based classifications, where an uploaded read is used as a seed to scan the database for direct sequence similarity. These include all methods that utilize Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990). The BLAST algorithm has been used for over two decades and is considered to be the 'gold standard' for sequence classification. Akin to frequency based algorithms, BLAST cuts a query sequence into small contigs of a pre-determined size that is dependent upon the input sequence length. BLAST then finds matches of these contigs in the database and extends outwards from these regions of homology. A scoring system is implemented where points and penalties are assigned for matches, mismatches, the 91  formation of gaps, and subsequent gap enlargements. Depending on those limitations set by the user, all extensions are explored in the database until mismatches accumulate above a threshold whereby the extensions will then revert back to the alignment with the maximum score. One has the option of searching nucleotide databases as well as amino acid sequences. Since proteins are conserved at the amino acid moreso than the nucleotide level, many programs such as MGRAST provide the users the option to exclusively utilize protein homologies to define metagenomic reads. The relatively high conservation at the amino acid level may further enhance the detection of novel organisms. However, provided that the rate of growth of nucleotide databases surpasses that of protein databases at staggering levels, nucleotide databases employ a much more diverse and expansive collection to screen. Furthermore, as there is no method to efficiently sequence or amplify proteins at sensitivities comparable to that at the nucleotide level, metagenomic sampling takes place only at the nucleotide level. As these projects expand in number, comparing these nucleotide sequences amongst themselves may be fruitful even if the sequences do not have a taxonomic origin. Lastly, as protein scans from nucleotide sequences involves translations in multiple reading frames, nucleotide-level alignments are substantially faster. Depending on the nature of the sequences themselves and the specifications of the data set as a whole, one may have a preference for similarity, composition, or phylogenetic algorithms. Sequence composition based methods are usually restricted to long sequences (>800 bp) given that the metagenomic read is supposed to represent the oligonucleotide frequency count of the entire genome. However, NBC has surprisingly demonstrated an impressive taxonomic resolution with reads as short as 35 bp (Rosen et al., 2011). These methods also require one to train the dataset with those genomes thought to be relevant to the given environment to which the sample was taken. Although this may limit the organisms that can be classified, this same limitation is present in all homology-based programs. One caveat to the sensitivity of training a particular database is that the N-mer frequencies of a transcriptome may be significantly different from the genome. Therefore, one must train the Bayes Classifier using all known ORFs for each chosen species. This may complicate RNA-based studies in that not all ORFs are 92  readily apparent from a genome sequence, and certain transcripts will not be recovered from an experiment, which would therefore alter the N-mer frequency. Thus, the ideal solution may require one to first begin with a transcriptome study of the given organism and calculate the N-mer frequency of this data set. Even so, whether this represents the genes expressed by the organism within the environment in question might in fact be a different question. Each of these classification methodologies are employed by multiple programs each with their own custom algorithms. Some are novel while others are a medley of various components native to other programs. However, relatively few cloud-based sequence classification programs exist. During the course of this study, I have been fortunate to witness the evolution of a laboratory moving from a relatively small computer network with limited computational power to one with a robust server equipped with custom scripts designed for complex sequence analyses. During the first half of my research, it was virtually impossible to analyze high-throughput dataset with our initial infrastructure. As in most laboratories, merely storing the data from high-throughput sequencing machines was problematic. For example, a typical run on Illumina's MiSeq as well as Roche's 454 FLX system generates 35-40 GB of data. Using other Illumina machines such as those from a HiSeq, typical outputs are over ten times as large. These large files owe most of their memory to the images produced in each run for both pre- and post- sequence analysis. One cross comparison study analyzing various sequence classification programs found that on a 2.66 GHz MacBook Pro with 8 GB 1067 MHz RAM, of the various dataset which were relatively small compared to that normally produced in high-throughput analyses, many sequence classifying programs required up to ~6,000 MB of RAM for upwards of 70,000 minutes (Bazinet and Cummings, 2012). For many laboratories around the world, it would not be possible to complete the analysis of a typical data set with file sizes over 1000 times larger in a suitable time frame. Furthermore, most laboratories are not fortunate enough to have a computational biologist with the ability to programmatically manipulate data for proper uploading, let alone know how to use each program. Aside from troubleshooting, most web-based programs only require a specific type of data, and analysis is simply performed with a few keystrokes. Importantly, these classification programs centralize various 93  tools enhancing the opportunity for inter-study comparisons. Finally, because sequencing is expanding at such a rapid pace most projects will likely move to a cloud-based system for proper storage and analysis. For these reasons, I have chosen to compare and analyze online sequence classifier programs. Table 3-5 shows all of the available classification programs offering a web-based component.  Web-based Sequence Classification Programs Naïve Bayes Classifier (NBC) PhyloPythiaS Ribosomal Database Project Classifier (RDP) Evolutionary Placement Algorithm (EPA) FastUniFrac Galaxy WebCarma MG-RAST MetaBin SPHINX MLTreeMap  Classification Method Composition Composition Composition Phylogeny Phylogeny Similarity Similarity Similarity Similarity Similarity/Composition Similarity/Phyogeny  Table 3-5. Web-based classification programs available as of December 2012  However, for this analysis many of these were excluded based upon the incompatible requirements between the program and the dataset to be analyzed. For instance SPHINX and FastUniFrac only allow 10,000 and 100,000 reads to be analyzed, respectively (Hamady et al., 2010; Mohammed et al., 2011). Although one can upload components of a single file and concatenate each result file, these programs also have a limited number of samples each user can run. For unknown reasons MLTreeMap was unavailable at the time of this study. Due to an institute-wide power outage (personal communication), the MetaBin server was shut down. However, I was fortunate enough to upload my experimental dataset onto this website prior to this crash, but a thorough analysis of the performance of this program was not feasible. EPA has only been tested on dataset with single genes or multiple genes derived from a single species, which is not compatible with most metagenomic studies (Berger et al., 94  2011). Lastly, PhyloPythiaS does not have an algorithm for human sequences and therefore many sequences will be misclassified. Although NBC is restricted to microbial comparisons as well, PhyloPythiaS was removed from this analysis largely due to the fact that analyses with query sequences less than 1,000 bases are not recommended (McHardy et al., 2007). Finally, RDP was removed for the obvious reason that it is restricted to rRNA analyses. Therefore, for this analysis, the remaining classification programs consisted of Galaxy, MG-RAST, MetaBin, WebCarma, and NBC. Galaxy Galaxy is a web-based genomic analysis tool which offers several computational methods from data quality analytics to sequence analysis (Kosakovsky Pond et al., 2009). Not only is Galaxy freely available to any researcher with access to the internet, one can also download the Galaxy application and customize it to meet individual needs and specifications. The development of Galaxy was initiated with the intention to solve problems with reproducibility of bioinformatics analysis. Since there are several tools each unique in their analytic specificity and sensitivity, the Galaxy team brought many of these tools into one workspace in order to enable inter-study comparisons. Furthermore, each step is compatible with another such that they can be easily concatenated and converted to saved workflows, which can then be shared amongst research groups. This also allows for reproducibility of a specific analysis pipeline. The user begins by uploading raw sequence data in fasta or fastq formats. Sequence files take a matter of minutes to hours to upload depending on the file size. Galaxy has over 100 tools for text manipulation, sequence quality filtering, assembly, mutation discovery, and taxonomic classification. Quality filtering and text manipulation programs usually take a matter of minutes, however assemblies and taxonomic classifications usually take multiple days. Taxonomic classification on the public Galaxy server is accomplished with a homology-based approach using MEGABLAST. This algorithm is traditionally used for large sequence dataset and it is on average 10 times faster compared with other programs. This speed is due to the fact that it concatenates many of the queries being uploaded in order  95  to save time scanning the database (Zhange et. al). Sequences are aligned to either an NT database or the whole genome sequence (WGS) database, where each is updated regularly. Taxonomic classification is performed by a lowest common ancestor approach where the lowest taxonomic group for those alignments with similar scores is chosen. Another approach called 'Find diagnostic hits' is used where each result from each read is compared and reads with non-complementary taxonomic classifications at a pre-determined taxon level are removed.  MG-RAST In 2003, a rapidly growing number of genomes were being sequenced with essentially no analytic tools for comparing them. A prominent and heavily cited sequence annotation model, the SEED framework, was launched for comparative genome analysis. As opposed to other genomic characterization methods that annotate each genome individually, the SEED framework annotated each individual functional subsystem for all genomes simultaneously. This methodology established a more rapid and specific framework for high-throughput genome analysis, as demonstrated by the Rapid Annotation using Subsytems Technology (RAST) server established in 2008 (Meyer et al., 2008). This same year a separate server using the same comparative technology was initiated by the same team but tailored specifically for metagenomics. The metagenomics RAST (MG-RAST) server was born and has since been used in a variety of studies for various culture-independent analyses. After signing up and requesting an account, the user can upload sequence data in compressed (gzip, zip, tar) format or as fasta/fastq files from Sanger, Illumina, 454, and assembled contigs exceeding 40 kilobases. The user can then easily demultiplex or merge overlapping mate pairs. Due to the fact that uploading files is coupled to sequence analysis, uploading and analysis takes on average more than 24 hours. However, given the amount of data that is being processed and generated, this is surprisingly fast. One limitation to this method is that there are few options for sequence analysis. The user is provided with the option of filtering out the human genome prior to taxonomic classification, a quality score cut-  96  off, and the maximum number of ambiguous bases called or those that are below the specified cut-offs. Although it removes the opportunity to specify the percent similarity cut-off at this stage, in the postanalysis stage, one has the option of specifying the desired percentage cut-offs. Directly after submission, MG-RAST first normalizes your data. It then scans each sequence for protein encoding genes via BLASTX against the SEED non-redundant database. Collectively, the nonredundant component of each of these databases is termed the M5NR database consisting of 15,945,780 sequences. In parallel, sequences are also screened for ribosomal RNA against GREENGENES, RDP II, and the European 16S RNA database, collectively containing 309,342 unique ribosomal RNA species. Accessory databases such as the mitochondrial database, and ACLAME database of mobile elements are also screened. All of these databases are screened using the search criteria specific to each. For example, although the protein encoding sequences are screened using a expect value (E) cut-off of 0.01, the ribosomal databases are much more stringent where this value is at 1 × 10-5 across a minimum of 50 bp in length. The results of these analyses are then used for phylogenetic reconstructions using both the SEED nr and ribosomal databases. Functional classifications are also constructed using the SEED results. All of the above results can be filtered by percent mapping and downloaded accordingly. MG-RAST also provides analytical and visualization tools such as hierarchical clustering and heatmaps to illustrate the abundances of each bacteria at each specified taxonomic level. Taxonomic classification can be performed according to the best hit for each read as well as grouping the top hits and identifying the lowest common ancestor (LCA) for each read.  MetaBin MetaBin is a relatively new sequence classification program developed in 2011 (Sharma et al., 2012). It is a homology based classifier that uses a unique alignment approach. Instead of being restricted to traditional BLAST or BLASTX alignments like that of virtually every other sequence  97  classifier using similarity-based algorithms, users have the option of choosing BLAT (BLAST-like alignment tool) on dataset up to 100 MB. Where BLAST aligns a query sequence onto an un-indexed database, BLAT aligns sequences to a previously indexed database. As opposed to aligning to raw sequences on Genbank, BLAT aligns to a set of non-overlapping 11-mers each indexed specifically for each genome on Genbank. This simultaneously speeds up the alignment by 1000 fold and provides genomic information regarding localization. MetaBin capitalizes on these abilities by using a more stringent classification procedure that depends on the spatial configuration of the aligned reads. For short reads, MetaBin considers only those reads that match either the amino or carboxy-terminal of the said protein encoded by the given gene, or those that are of a high percentage match elsewhere. All other reads are discarded. The location of a given read can be important as demonstrated by instances where reads that overlap either ORFs or eukaryotic genes within splice junctions, classifications may incorrectly align to a foreign species which may have weak homology across the entire read. Although this may yield a lower E value because the portion of the read aligning to the actual ORF is obviously smaller than the entire read, focusing only on the ORF component yields much higher percent matches and therefore correct classifications in MetaBin. This technique has been shown to be more specific down to the genus/species level and minimizes the use of LCA, which can lead to a higher-level taxonomic assignment. LCA is still used in MetaBin for larger reads which span multiple ORFs. WebCarma In 2008, CARMA was developed as a metagenomics sequence classification tool specifically for short reads (Krause et al., 2008). In contrast to those that utilize 16S ribosomal RNA or BLASTN for identification, CARMA scans only for known proteins. One year later, this same team introduced WebCARMA, a refined version of CARMA available as a web application (Gerlach et al., 2009). One of the key features of CARMA is a reciprocal BLAST. Provided that assigning a taxon to a metagenomic read necessitates careful analysis, a secondary BLAST using the top hit of the primary BLAST against a new database that now includes the metagenomic read has been developed to provide additional  98  information about the evolutionary distance between the similar reads. For instance, a metagenomic read A is aligned to a NCBI NR protein database via BLASTX. If read A was found to have a high percentage relatedness to database reads B and slightly less so to C, WebCarma would store this information in order to set up a reciprocal BLAST. The reciprocal BLAST would start with the highest scoring read (B) now as the query of a BLASTp alignment to a new database which includes the BLASTX results (read C) as well as the original query (read A). If read B still shares the highest homology with A, then the taxonomic group of this read will be assigned to the original query A. However, if the alternate case where B is now more closely related to species C, then the origins of read A will now shift to the LCA of read C and B. After signing up, files can be uploaded in raw or compressed FASTA format meaning that quality filtering is not available. NBC The Naive Bayes Classifier (NBC) was developed in 2008 as a sequence composition similarity program (Rosen et al., 2008). As opposed to the aforementioned sequence similarity programs that look at sequence homology between the read and the similar reads in the database, the NBC algorithm was designed with the underlying idea to use characteristic features of a given genome for classification. These features are N-mer frequency comparisons, where N is a predefined number of nucleotides. This group calculated the frequency counts of all sequences of lengths 3, 6, 9, 12, and 15 nucleotides, in 635 microbial genomes. These frequency counts are then used to train a naive Bayes Classifier to identify from what genome a read may have been a part of. The user has the choice of one of these lengths and the frequency counts of the submitted sequence files are calculated and compared to those in the database. The database can easily be changed by simply submitting a text file with the chosen name of a given microorganism. Using a Bayes algorithm, this program uses this submitted file to recover the genomes of each organism so that the characteristics of each genome can be compared and assessed for N-mer frequency comparisons. This removes some dependencies on sequence homology by the fact that genomes typically have unique nucleotide frequency patterns. Using this gene-independent similarity  99  search, one is comparing a sequence to, in theory, the whole genome of a given species or genotype. Although typically N-mer frequency counts require large reads, this machine learning algorithm has found that reads as short as 35 bases could be accurately classified. This is surprising given that the desired frequency count of 15 is the theoretically optimal N-mer size. Although, each frequency count is quantified with an overlapping window in order to maximize N-mer comparisons. Demonstrating the accuracy of this algorithm, the authors randomly spliced one hundred 25 bp fragments from each of the 635 genomes used to train the data set. They found that BLAST could not identify 287 of these reads whereas NBC classified 177. Overall NBC was able to correctly identify more reads (.4%) than BLAST, even though those sequences originally came from the NCBI database. A user can simply submit sequence files in FASTA format at a maximum 20 MB. Although this file size is quite small, there is no limit of data over a given period of time, and therefore one can simply split a larger file into multiple 20 MB components and concatenate each result file. Sequence Classification Program Accuracy Testing Data Sets In order to properly test the selected web-based sequence classification programs, two approaches were taken. First, a simulated Illumina dataset was generated using iMESSi (Mende et al., 2012). This online simulator allows a user to specify both the number of organisms as well as the number of reads with custom quality scoring. This online simulater offers Illumina data simulation but only available as a downloaded program. Although in the experimental data set the length of the sequences generated is 151 bases, the sequences generated from the data simulation contained reads 75 bases in length. This smaller size was chosen partly because preliminary analyses revealed that longer simulated reads presented relatively little difficulty for the classification tools. Secondly, although the experimental data set consists of read lengths of 151 bases, there are a significant percentage of reads that are shorter and therefore, a 75 base homology was set for certain classification measures. Lastly, the short read lengths inherent to  100  Illumina sequencing technology has been met with skepticism regarding its usage in metagenomic analyses. Therefore, thoroughly testing short reads for their taxonomic resolution capability is important. Within the 11,249 sequences generated are 10 genomes from 9 different families. Importantly, two species were chosen from the same genus, Streptococcus pneumoniae and Streptococcus pyogenes. The reads from each genome are randomly distributed although the number of reads recovered for each genome was disproportionate as shown in Table 3-6. Although 11,249 reads is only a fraction of the one million reads which were the targeted sequence depth produced in our dataset, this dataset was sufficient to test the taxonomic classification accuracy of each program as well as account for the number of sequences that would contain too many errors to accurately classify.  Species Alkaliphilus metalliredigens Bacillus clausii Haemophilus influenzae Pseudomonas stutzeri Rhodococcus jostii Salmonella enterica Staphylococcus aureus Streptococcus pneumoniae Streptococcus pyogenes Thiobacillus denitrificans Unclassified  Abundance 478 964 462 525 485 498 470 483 486 5969 429  Table 3-6. Simulated data set totals for each species  The other data set used to compare the sequence classification programs is an experimental data set provided to the BCCDC by another laboratory in Simon Fraser University (Fiona Brinkman and colleagues). In total, one nanogram of DNA from a combination of two gram-negative (Pseudomonas aeruginosa PA01 (family Pseudomonaceae), Rhodobacter capsulatus SB1003 (family Rhodobacteraceae)) and two gram-positive bacterial species (Nocardioides sp JS614 (family Nocardioidaceae), Streptomyces coelicolor A3 (family Streptomycetaceae)) were spiked into the sample in an equimolar fashion. These samples were initially extracted with Sigma's GenElute bacterial genomic  101  DNA kit. Similar to our experimental samples, the sequence library was prepared with Nextera XT and run on a MiSeq. Therefore, this sample is highly analogous to our method and should provide us with a representative data set sufficient for testing the potential of the selected sequence classification programs. One of the selected dataset from this experiment was chosen which contained 150,948 sequences.  Family Nocardioidaceae Pseudomonadaceae Rhodobacteraceae Streptomycetaceae Burkholderia Xanthomonas Unclassified  Number 495 95,086 27,570 17,346 2602 (3,236*) 1,948 5,287  Table 3-7. Total reads for spiked data set for each family in the Bacteria spiked data set. * Refined value (see Table 3-15)  For each file, the sequences were first taxonomically characterized using a NCBI's BLAST algorithm on a server at the BCCDC. Initially, each file was aligned using a 95% cutoff. For those sequences which were below this cut-off, an additional alignment was performed at 90%. Shown in Table 3-7 are the classifications for the Bacteria spike. For the simulated data set and the Bacteria spike, 429 and 5,287 of the sequences were unclassifiable, respectively. In situations where discrepancies arose between the classifier and this initial set of BLAST results, the results were verified with BLASTN on NCBI against Genbank's NT database.  Measuring Taxonomic Classification In order to assess the accuracy of the sequencing classification programs, three sets of measurable values were used. Precision is defined as the ratio of the total correct alignments to the total reads 102  aligned. This measure relieves any stringency differences between the programs by only measuring those reads that are aligned to each respective database. Sensitivity however, is defined as the ratio between the total correct alignments and the total sequences uploaded onto each program. Sensitivity becomes important when the sequencing program is too stringent and only outputs all cases where obvious matches to the respective database are included. In these cases, the sensitivity measure will highlight those programs that disproportionally rely on those reads that it defines as classifiable. Furthermore, this measure more efficiently allows inter-program comparisons as it is relative to the total number of reads generated. In order to rank those programs by these two values, we used accuracy, which is the average of these two values. Each of these values was calculated at three taxonomic levels: species, family, and order. In order to represent the many parameters tested, each will be presented in this abbreviated format: 90P_20A_E03. This translates to an alignment with 90% similarity over 15 units (nucleotides or amino acids), at a minimum E-value of 1x10-3.  Simulated Illumina Data Set Results MG-RAST MG-RAST offers multiple tools to analyze a metagenomic dataset. For taxonomic classification there are two dominant methods that MG-RAST relies upon: Best Diagnostic Hit' (BDH) and Lowest Common Ancestor' (LCA). BDH assigns a read to a given taxon if it has the highest similarity amongst the generated hit table. Whereas LCA instead takes a subset of this hit table and finds the shared common ancestor at the lowest taxon possible. LCA unfortunately only provides the user with an abundance value for each taxon. This unusual value yields the number of hits for each taxonomic level in the database, as opposed to the number of unique hits in your uploaded sequence database. Therefore, the total abundance may be greater than the number of uploaded reads. Since this is not a useful count in this analysis, BDH was selected as the solitary method of taxonomic classification. MG-RAST offers an array of stringencies within BDH to which a user can customize. As in all classification programs, there was a  103  trade-off for sensitivity and specificity depending on the stringencies applied. For amino-acid similarity, the recommended methodology consists of a minimum alignment length of 15 amino acids at 60 percent similarity. Nucleotide similarity was also compared using the M5RNA database, however despite the adjustments made to this algorithm, a maximum 0.2% of reads were aligned (Figure 4-5). Therefore, the remaining MG-RAST analyses focus on the M5NR database. Using the recommended settings, 60P_15A_E05, only 1.2% of the total reads were aligned to a genome on the MG-RAST data base. Of these, there were only 74 that were correctly aligned which is reflected in the sensitivity value of just below 0.6%. As seen in Figure 3-4, whether the minimum percentage or the amino acid alignment values were adjusted for each family, no significant differences in the sensitivity or the precision were observed.  Percentage Recovery  0.7 0.6 0.5 0.4 0.3 0.2  Prec  0.1  Sens  0  Stringency Settings  Figure 3-4. Primary MG-RAST precision and sensitivity values for the simulated data set at the family level. ##P =Percent, ##A = Minimum Amino acid length, E##, E-value 1x10^-##  Furthermore, the dominant family Hydrogenophilales was absent in these alignments. Since no differences between the alignments changed when adjusted for the above features, another variable to  104  consider is the E-value. The E-value is a measure of confidence in an alignment that is briefly defined as the number of subject sequences from a given database that have the probability of producing an alignment score equal to or greater than that calculated from the alignment in question. Furthermore, inherent to the calculation of an E-value is the alignment length and therefore, the default E-value used by MG-RAST (1x10-5) could be too stringent. After adjusting this value to 1x10-3, not only was Hydrogenophilales now present, but this was designated as the most abundant taxon. Increasing this threshold expectedly increased the total number of reads aligned to roughly 72% of the total. Interestingly, as shown in Figure 3-5, using the remaining standardized settings along with this adjusted E-value, both the precision and the sensitivity also increased significantly. As the percentage match and amino acid length were adjusted, these values increased but only slightly. The optimal set of values based on the accuracy value in this analysis was found to be an E-value of 1x10-3 with at least 20 amino acids aligning at 95% similarity (95P_20A_E03). Here the precision for the species, family and order, ranged from 90.6%-96.6%, whereas the sensitivity ranged from 46.9-48.9%. These values can be seen in Table 3-8 below.  105  1.2  Percent Recovery  1 0.8 0.6 0.4  Prec Sens  0.2 0  Stringency Settings Figure 3-5. Secondary MG-RAST precision and sensitivity values for the simulated data set at the family level. ##P =Percent, ##A = Minimum Amino acid length, E##, E-value 1x10^-##, M5RNA being the RNA database  Precision Sensitivity  Species 0.906346 0.469731  Family 0.965147 0.487421  Order 0.965929 0.488932  Table 3-8. MG-RAST precision and sensitivity results for the simulated data set using optimized stringency settings for MG-RAST, 95P_20A_E03  Galaxy The public server on Galaxy provides users with a MEGABLAST option as discussed above. Here one is given options such as a percent cut-off as well as a customizable E-value. The alignment length can subsequently be adjusted post-alignment. Initially, the recommended settings were chosen which consisted of an E-value of 1x10-3 and a 90% similarity cut-off. After classifying the taxonomic identity of these reads, the next step is to 'Find the lowest diagnostic rank' (FLD) which is equivalent to the lowest common ancestor algorithm. Although there is another 'Find diagnostic hits', which counts the 106  reads with unique taxons, this produces significantly lower precision values (~6 times lower). This significant difference in precision was enough to choose FLD as the representative endpoint to the taxonomic pipeline of Galaxy. The recommended settings produced modest results where 18.4% of the reads were aligned with precisions ranging from 69-96% for species to order, respectively. When the similarity cut-off was dropped to 80%, the average reads aligning to a genome increased somewhat proportionally to the total reads aligned (26.9%), where the precisions dropped to 53-86% for the specified taxonomic range. Due to the increased number of total alignments however, the sensitivity correspondingly increased by roughly 5% for each taxon. Furthermore, upon increasing the percentage similarity above 90%, no significant differences were seen. Therefore, although there was roughly a 5% loss in sensitivity at the 90% cut-off relative to 80%, this sacrifice in sensitivity was balanced by a significantly greater enhancement in precision. As in the MG-RAST analyses at the E-value of 1x10-5, the dominant species Thiobacillus denitrificans of the family Hydrogenophilaceae was not detected in Galaxy. Galaxy's default however, as stated above is already at 1x10-3, but this value is also relative to each database. Upon adjusting this value to 0.01, just as in MG-RAST, T. denitrificans also became most abundant, containing over 50% of the total mapped reads. The total number of reads aligned increased from just over 2,000 to 10,010. Although there was a slight drop in precision at the family and order levels (~1%), in all other measures both the sensitivity and the precision were significantly higher. As seen in Table 3-9, the sensitivity for these alignments increased by roughly 400% relative to the alignment using the E-value of .001.  Prec Sens  Galaxy Alignment 90PE02 90PE03 0.812924 0.692122 0.723531 0.1273  Table 3-9. Galaxy precision and sensitivities for the simulated data set comparing the recommended settings (90PE03) to the initial adjustment of 90PE02, the alignment length is unrestricted  107  Further modifications of these values at the family level shown in Figure 3-6, lead to the highest accuracy: a 95% alignment with no restrictions on alignment length and an E-value of .01.  1.2  Percent Recovery  1 0.8 0.6 Prec  0.4  Sens  0.2 0  Stringency Settings  Figure 3-6. Galaxy precision and sensitivities for the simulated data set at the family level  NBC The Naive Bayes Classifier (NBC) provides users the options of specifying the sequence length (N-mer) of the frequency comparisons as well as customizing the genome lists to compare. Although the option of the N-mer length is provided, Rosen et. al established that the number of accurate sequence identifications increases exponentially proportional to increasing N-mer lengths, with 15 nucleotides determined as the optimal length (Rosen et al., 2011). Updating genome lists on NBC is remarkably simple as all that is required is uploading a text file with the names of the genomes one desires. I tested this classifier with N-mer counts of 12 and 15 on the genome lists provided by NBC for Bacteria and Archaea. After uploading a file, the time for analysis completion was roughly 24 hours,  108  similar to the other sequence programs. An email is sent to users along with a collection of files which contain all of the results files along with a summary file of the top hits for each read. Almost 100% of the reads were assigned to a genome, as opposed to MG-RAST and Galaxy, where even with the lowest stringencies, this could not be achieved. As shown in Table 3-10, the precision for all taxon levels studied was over 93% for the N-mer 15. Provided that all of the sequences were aligned, the sensitivity values were identical. Concerning the differences between the N-mer specification, as shown in Table 310, the 15mer precision and sensitivities were higher at all taxon levels.  Precision Sensitivity  Spec 0.572559 0.572406  N-mer 12 Fam 0.628767 0.628767  Ord 0.675885 0.675705  Spec 0.937861 0.937861  N-mer 15 Fam 0.950129 0.950129  Ord 0.96133 0.96133  Table 3-10. N-mer length comparison for NBC  MetaBin MetaBin provides users with the option of analyzing FASTA files using two different algorithms for alignment BLASTX or BLAT. The parameters which are customizable for MetaBin users include a minimum Bit-score along with a Bit-score range, the minimum number of reads to form a taxonomic bin, and lastly, the option to include eukaryotes in the database. Unfortunately, at the time of this analysis the MetaBin server was down for the latter half of 2012 due to a building-wide power outage. Although MetaBin is available for download, as stated above, only web-servable taxonomic classification programs are being considered in this analysis. Prior to the decision to test all of the variables on MetaBin, fortunately this file was tested prior to the server crash using the default parameters. MetaBin aligned 80.6% of the reads in this particular data set. Of these alignments, as shown in Table 3-11, the sensitivities ranged from 64.4-73.0% from species to order, while the precision ranged from 79.99-90.6%, respectively.  109  Precision Sensitivity  Spec 0.799824 0.64468  MetaBin Fam 0.888717 0.71633  Ord 0.906474 0.730643  Table 3-11. MetaBin precision and sensitivity values for the simulated Illumina data set  WebCarma WebCarma has the fewest options for users to customize. Users have the sole option of uploading a FASTA file for processing. As with other classification programs roughly 24 hours is required for sequence analysis. WebCarma aligned 1,604 reads, a dismal 14.2%. As shown in Table 312, the precision of these alignments ranged from 33.1-86.4% for species to order, respectively, whereas due to the poor alignment percentage the sensitivities ranged from 4.7-14.2%.  Precision Sensitivity  WebCarma Species Family Order 0.331671 0.842204 0.86412 0.047293 0.119566 0.142057  Table 3-12. WebCarma alignment calculations of the simulated Illumina data set  Simulated Data Set Conclusions Analyzing this data set was informative as the identity of every read was known a priori, and therefore an accurate assessment of each program could be made. Furthermore, testing shorter sequences was important also as a significant percentage of our experimental data set consisted of reads below 80 bp. For those programs which provided users the option of changing the E-value, it was consistently 110  demonstrated that this value needed to be adjusted accordingly. In these cases the same sequences which needed E-value adjusting were realigned using NCBI's BLASTN. Oddly, the E-values were within the range of the original E-value in the initial alignment. It is assumed that the poor alignment efficiency is due either to algorithmic differences in alignment scoring, or more likely, differences between the number of reads that exist in each database. The latter being a variable used to calculate E-values. Shown in Figure 3-7 are the accuracy values of the classification programs at each respective taxonomic level. NBC, the only program not to employ a sequence homology-based algorithm, had the highest values at every single taxonomic level. However, one large difference between the other sequence databases, is that the NBC database (the genome list) is limited to prokaryotic N-mer counts, eukaryotes such as humans are not included. This lack of complexity in the database could therefore enhance the precision and sensitivity of NBC for bacterial classification. This severely limits mammalian metagenomic analyses as frequently small Illumina generated reads have homology to both human and microbial genomes. This is exemplified in an uploaded experimental file (file A-48, BV study) which unexpectedly generated 2,391 hits for human adenoviruses. Upon BLAST confirmation, all of these were found to be of human origin. In order to quantitatively address the issue of similarity, hierarchical clustering was also applied using sequence counts for each method at the family level. Figure 3-8 shows that NBC was most closely related to the actual simulated microbial community, followed by Galaxy and MetaBin. MG-RAST was shown to be slightly divergent as it was further down the tree, but WebCarma was the most distantly related as it was on a separate branch entirely. This order is the same as that predicted by the accuracy values in Figure 3-7. Collectively, these results prove that Illumina can be a valuable tool for metagenomic analyses.  111  1.2  Percent Recovery  1 0.8 Species  0.6  Family  0.4  Order  0.2 0 WebCarma  Galaxy  MetaBin  NBC  MG-RAST  Sequence Classification Program  Figure 3-7. Simulated data classification accuracy (average of precision and sensitivity values) for each sequence classification program used to analyze the simulated data set.  Figure 3-8. Hierarchical clustering of the sequence reads recovered for each data set, where 'Actual' is a list of the true taxonomic classifications of each sequence.  Experimental Bacteria Spike Another essential step in assessing the potential of each taxonomic classification program is testing a control sample that has been prepared in an identical or similar manner to the experimental samples. This data set provided by Fiona Brinkman and colleagues is derived from sequencing of cultured microorganisms that were inoculated into a water sample. As stated above, 150,948 sequences 112  of length 151 bp were uploaded onto each sequence classification program. In contrast to the simulated data set, the origin of each read is not known and due to errors in sequencing, a certain percentage may become unclassifiable to any taxon. Furthermore, provided the high sensitivity and low specificity provided by high-throughput sequencing, there could be contaminants introduced from the reagents throughout the procedure. In fact, a significant percentage of reads were specific to families not spiked in this procedure, including Xanthomonadaceae and Burkholderiaceae, two families that are known water contaminants. These families were included in the analysis to best illustrate the potential of each program. Therefore, this data set provides many important variables to test that would be impossible to replicate with simulated data. Another variable that became important for this particular data set was the time it took to analyze the data. Unlike the relatively small file from the simulated data, this file was 30.1 MB. Although this is small relative to most high-throughput sequencing data file sizes, certain programs chosen for this study contained size constraints. NBC Similar to the simulated data set, NBC was able to assign a taxon to almost every single read (>99.9%). Therefore, the precision was roughly equal to the sensitivity. In contrast to the simulated data set, these values were significantly lower at all three taxon levels ranging from 58% at the species level to 85% at the order level. Table 3-14 shows these values at the family level. The majority of the sequences that were misclassified belonged to families such as Polyangiaceae (2.6%), Bradyrhizobiaceae (2.5%), Methylobacteriaceae (1.9%), Catenulisporaceae (1.8%), and Rhizobiaceae (1.8%), as shown in Table 313.  113  Misclassified Families Polyangiaceae Bradyrhizobiaceae Methylobacteriaceae Comamonadaceae Catenulisporaceae Rhizobiaceae Streptosporangiaceae Haliangiaceae Myxococcaceae Frankiaceae Actinosynnemataceae  Counts 3944 3783 3004 2767 2037 1815 1722 1689 1147 1099 1078  % of Total 2.61282 2.506161 1.990089 1.833082 1.349471 1.202401 1.14079 1.118928 0.759864 0.728065 0.714153  Table 3-13. Misclassified reads using NBC totalled for each family adjacent to the percent recoveries  However, there were 9,456 additional hits for Burkholderiaceae when compared to the original taxonomic identifications shown in Table 3-7. Therefore, all of the 11,719 Burkholderiaceae sequences were realigned to the NCBI Bacteria database this time with a relaxed stringency and each read was visually assessed in the BLAST alignments. Interestingly, only 2,178 of these reads aligned to Burkholderiaceae while the majority of the remaining sequences aligned to the other spiked genomes. For instance 3,639 reads aligned to Rhodobacteriaceae instead. These results are also shown in Table 3-13. Similarly, a discrepancy also existed with Xanthomonadaceae, where applying a similar analysis, just 704 out of 2,441 aligned to this family. Despite these misclassifications, NBC was the only sequence classification program to identify a significant percentage of Nocardiodes (424/495). The time it took NBC to analyze this data set was not significantly longer than the simulated data set, as the file was completed in roughly 36 hours.  114  MetaBin MetaBin classified 85.9% of the reads in the data set. The precision ranged from 70.8-81.7% from species to order, while the sensitivity ranged from 40.5-46.8%, respectively. Table 3-14 displays these values at the family level. MetaBin was also quite efficient in terms of time, taking roughly 20 hours to complete.  WebCarma Similar to the NBC results, there was a significant number of reads (12,217) which were binned in the Xanthomonadaceae family. Upon realigning these reads using BLAST, 6.5% (798) aligned to this particular family. However, another inconsistency was present in the Burkholderiaceae, where 3,326 sequences were binned here compared to the 2,602 in the tabulated data set shown in Table 3-7. As shown in Table 3-15, in contrast to the previous inconsistencies, 3,236 were classified as members of this family; 624 more than previously known. Overall, there was a significant increase in accuracy of WebCarma relative to the simulated data set where the sensitivity ranged from 35-64.7% and the precision ranged from 43.1-79.6%. The time it took WebCarma to complete analyzing this file was 30 hours. WebCarma states on its website that 30 MB is the maximum file size allowed, but this particular file was larger than this restriction and it was still aligned without complication.  WebCarma Galaxy Metabin NBC MGRAST Precision 0.7859402 0.991616 0.809076 0.691828 0.96162 Sensitivity 0.63873652 0.371386 0.463497 0.691828 0.586 Table 3-14. Precision and sensitivity values for the classification programs tested at the family level  115  Source NBC WebCarma Actual  Burkholderiaceae Counts 11719 3326 2602  BLASTN Result 2178 3236 2602  Table 3-15. Burkholderiaceae discrepancies totalled for NBC and WebCarma relative to Table 3-7 (Actual)  MG-RAST  Given the observations in the simulated data set, I extended this analysis here in the bacteria inoculation. As shown in Figure 3-9, similar to the simulated data set, dropping the E-value to 1 x 10-3 significantly improves the accuracy of MG-RAST. Consistent with the analysis in the simulated data, aligning to the M5RNA database again significantly restricted the accuracy of taxonomic classification, where a maximum of less than 0.1% of the reads were aligned despite stringency adjustment.  116  1.2  Percent Recovery  1 0.8 0.6 0.4 Prec 0.2  Sens  0  Stringency Settings  Figure 3-9. MG-RAST optimization using sensitivity and precision values for MG-RAST according to each of the stringency settings tested.  Changing the minimum number of aligned amino acids from the default 15 to 20, did not significantly change the precision values whereas the sensitivity increased slightly. However, increasing the percentage corresponded with a significant decrease in sensitivity and an increase in precision. This inverse correlation is a result of the fact that the total sequences aligned begins to decrease while of those reads that do align, simultaneously increase in alignment efficiency. Provided that both precision and sensitivity are equally important, the accuracy value demonstrated that the increase in precision slightly over shadows the loss in sensitivity. Therefore the optimal algorithm here is an 80 percent cut-off over a minimum of 20 amino acids, at an E-value of 1 x 10-3. Overall, MG-RAST displayed a relatively high accuracy where the precision for the three taxon levels ranged from 34.7-96.3% while the sensitivity ranged from 22.7-59.1%. MG-RAST took roughly 24 hours to complete the analysis.  117  Galaxy Shown in Figure 4-10 are the different parameters tested for Galaxy. Similar to the conclusions from the simulated data set, the higher E-value of 0.01 increased the sensitivity significantly while only slightly dropping the sensitivity. Furthermore, increasing the percentage cut-off to 95% slightly increased the precision without significantly changing the sensitivity. Changing the minimum alignment length from 70 to 100 nucleotides did not significantly alter either value. Therefore, the optimal stringencies according to the accuracy value under the parameters tested are demonstrated at a 95% similarity over a minimum of 70 nucleotides, at an E-value of 0.01.  1.2  Percent Recovery  1 0.8 0.6 0.4  Prec Sens  0.2 0  Stringency Settings  Figure 3-10. Galaxy bacteria spike optimization results using precision and sensitivity measures.  Using these optimized stringencies, Galaxy yielded precisions ranging from 96.5-99.2% and sensitivities from 36.1-37.2% for the three taxonomic levels, respectively. The results at the family level are shown in Table 3-10. Importantly, this file took roughly 72 hours to complete, as it did not even begin processing  118  the file until 24 hours after it was uploaded. This delay in file analysis was consistently observed for multiple files of different sizes.  Experimental Data Set Conclusion Each taxonomic classification program used in this analysis displayed strengths in different manners. In terms of the precision and sensitivity values, Galaxy has the highest precision for all taxonomic levels, while NBC produced the highest sensitivity. However, since Galaxy also had the lowest sensitivity at the family and order levels, it is obvious that measuring the precision in isolation is insufficient. The accuracy is shown in Figure 3-11 for the three taxonomic levels analyzed. MG-RAST had the highest accuracies for all taxons analyzed other than the species level, for which it had the lowest value. Galaxy had the highest accuracy at the species level but 28.2% lower than MG-RAST at the order level.  1.2  Accuracy Value  1 0.8 Species  0.6  Family 0.4  Order  0.2 0 WebCarma  Galaxy  MetaBin  NBC  MG-RAST  Sequence Classification Program  Figure 3-11. Bacteria spike classification accuracy measurements for each sequence classification program for all taxons tested.  119  Hierarchical clustering shown in Figure 3-12 found that WebCarma and Galaxy were most similar to the known, whereas MG-RAST and NBC followed. MetaBin was shown to be most divergent as it existed on a separate branch. This is surprising given that Galaxy had virtually no sequences recovered from the family Nocardioidaceae, whereas NBC recovered 85.9% of them.  Figure 3-12. Clustering of Classification Programs for Bacteria Spiking.  Discussion Comparing both of these datasets, there are significant differences between them that need to be addressed. In the simulated data set, NBC produced both the highest precision and sensitivity values at each taxon level, whereas in the bacteria spike it was at most second best at the family level. This was surprising as N-mer frequency comparison accuracy would presumably be proportional to the sequence length in that each read is a representative of the N-mer count for the genome as a whole. Perhaps the actual mutations produced during sequencing impact NBC classification moreso than sequence homology-based programs. The NBC algorithm is also limiting as it does not provide N-mer frequency comparisons to mammalian genomes. Therefore, it may automatically remove itself from metagenomic studies of mammalian microbiomes. Although one can simply remove the human sequences from a given data set prior to uploading onto NBC, frequently human sequences are still in high abundance following this computational subtraction. Furthermore, due to the fact that the N-mer frequency of an organism's transcriptome may not match that of its genome, this will likely be reflected in the accuracy value when studying RNA analyses/transcriptomes.  120  Both NBC and WebCarma challenged the original BLASTN-mediated classification by producing higher counts of both Burkholderiaceae and Xanthomonadaceae. Only WebCarma was the successful opponent concerning the family Burkholderiaceae, where 704 additional reads were added to the total in the original tabulation. This result was intriguing especially concerning the fact that the subsequent BLASTN clarification alignment demonstrated that all of these reads aligned to a Burkholderia genome at an average of 91.6% covering a minimum of 123 bases as shown in Figure 3-13. This is theoretically well within the range of the initial BLASTN alignment. It is currently not known why these reads failed to be classified initially, but it raises the question of whether NCBI's BLASTN should be the gold standard when determining a taxonomic placement. It is attractive in that each result produces an alignment that one can visually assess along with all of the information to determine the accuracy of the alignment. However, which particular genomes are not included in these results remains the central problem. More research into the discrepancies between each algorithm need to be addressed specifically as without a gold standard it is difficult to compare and optimize each additional program. Although this problem was addressed by including a simulated data set, the lack of discrepancy between this data and the original suggests that the simulated data may not have accurately simulated the error rates observed in our Illumina sequencing data. This stresses the importance of accurately simulating data to reflect actual sequencing errors.  121  180 160  Nucleotide Length  140 120 100 80  Aligned  60  Algnmt Length  40 20 1 172 343 514 685 856 1027 1198 1369 1540 1711 1882 2053 2224 2395 2566 2737 2908 3079  0  Quantity of Sequences  Figure 3-13. WebCarma Burkholderiaceae alignment lengths (Algnmt Length) versus the total nucleotides aligned (Aligned).  Shown in Table 3-12 is the time for all sequence classification programs to complete each analysis from submission to the acquisition of results. All except MetaBin took longer than 24 hours. Time is a difficult variable to test in that the speed greatly depends upon how many other jobs are also being run simultaneously. Furthermore, the computational power at each individual institution is likely not equal. These will also of course be upgraded over time. If for instance, the goal of this analysis was to use these methods as a clinical diagnostic tool, time would be much more important and thus a more accurate analysis would be required where perhaps the average of various uploads during peak/off hours for each location could be used instead. In this case, an in-house computation would be suggested in this situation. However, the observation that Galaxy took roughly 72 hours to complete the analysis of a 30 MB file is not conducive to analyses which usually contain dozens of files at roughly 10 times the size. Multiple files were analyzed and seemingly independent of their size, most took several hours before analysis began.  122  Time Illumina NBC 24 WebCarma 24 MetaBin 17 MG-RAST 24 Galaxy 36  Bacterial Spike 36 30 20 24 72  Table 3-16. Time analysis for each classification program (in hours) from submission to receipt of results  When considering which stringencies to utilize for our experimental files, although NBC is a promising classification tool, it cannot be used until it is capable of analyzing mammalian N-mer frequencies. Although Galaxy was second in the simulated data set, it did not perform as well in the bacteria spike. Furthermore, provided that the experimental files are on average 10 times larger, then the time variable begins to play a more significant role in data analysis. Aligning these dataset to the M5RNA database of MG-RAST, which includes the rRNA database produced a surprisingly low yield, less than 1% of the reads. Therefore, as MG-RAST produced the highest accuracy values for the family and order levels in the bacteria spiked, stringency settings set at an 80 percent cut-off over a minimum of 20 amino acids, at an E-value of 1 x 10-3 will be used for our experimental samples.  3.3 Bacterial Vaginosis Metagenomics Results Sample Information The vaginal swabs were collected as part of a larger vaginal microbiome study (Vogue) analyzing several components of the healthy and unhealthy female genital tract. The HIV(-) cohort came from Vogue Study 1A, while samples from the HIV(+) cohort were collected from Vogue Study 1B (see methods). Patients meeting the inclusion and exclusion criteria (listed in methods) were selected, of which a subcomponent were chosen for this study. In total 29 samples were included, 7 HIV(-) and 22 HIV(+). All of these patients were assessed for BV as diagnosed by an obstetrician gynaeocologist using  123  the Nugent score. As shown in Table 3-17 there are a total of 8 BV(-) patients, including 4 HIV(-) and 4 HIV(+). Three BV intermediate cases were included, 1 HIV(-) and 2 HIV(+). Finally, there were 18 BV(+) samples, consisting of 2 HIV(-) and 16 HIV(+) patients. Demographic and clinical data were also collected for each study participant including but not limited to antimicrobial usage within the 3 month interval prior to sample collection, menstrual cycle, sexual history, Nugent score, HIV viral loads, CD4 counts, and HCV sero-positivity/PCR.  N = 29 BV ( + ) Intermediate BV ( - )  HIV (-) 2 1 4  HIV (+) 16 2 4  Table 3-17. BV/HIV patient stratification  Sequencing All extracted samples were DNased, reverse transcribed, and directly loaded into the Nextera XT library preparation (see methods). Samples were multiplexed such that 15 samples were loaded onto each MiSeq run. Although the Nextera XT method inherently normalizes each sample, in order to both verify the efficiency of this step as well as ensure the desired concentration for roughly 1 million reads per sample, a KAPA qPCR was performed. This method quantifies the ligated adapter sequence and therefore provides an estimate of the dsDNA concentration. Shown in Table 3-18 are the number of sequences generated for each sample. Although the majority of samples contain more than 800,000 reads, there were two samples which generated less than 100,000. For these samples, A-53 and B-39, despite multiple extractions, there was a limited amount of nucleic acid material available as quantified with a Quant-iT™ high-sensitivity DNA assay kit using a Qubit® fluorometer (Invitrogen, 124  Carlsbad, CA, USA). Therefore, to meet the needs of the initial NexteraXT reaction, which requires only 1 nanogram of DNA, the maximum volume possible was loaded. However, even at this volume, the initial concentration was still below the required concentration. For these samples, the input amount into the MiSeq reaction was adjusted upwards (based on the KAPA qPCR quantitation) to account for the lower starting nucleic acid concentration however lower numbers of reads were still produced. Nonetheless, these samples were included in the pipeline and analyzed identically to the other samples.  125  Sample  Total BP  A-47  96,263,708  Total Reads 637,508  B-1  155,280,850  1,028,350  B-11  123,450,050  817,550  B-13  148,509,406  983,506  B-18  129,865,436  860,036  B-21  169,786,212  1,124,412  B-24  171,525,732  1,135,932  B-26  218,252,984  1,445,384  B-28  156,153,328  1,034,128  B-29  131,708,844  872,244  B-3  178,609,746  1,182,846  B-33  95,111,578  629,878  B-39  4,962,464  32,864  B-40  99,438,936  658536  B-43  133,229,112  882,312  A-48  67,463,780  446,780  A-51  96,306,290  637,790  A-53  2,106,752  13,952  A-54  78,899,916  522,516  A-57  56,990,420  377,420  A-58  32,427,854  214,754  B-15  195,583,354  1,295,254  B-17  154,543,970  1,023,470  B-24-V2  269,063,578  1,781,878  B-31  195,192,868  1,292,668  B-36  128,082,730  848,230  B-44  18,019,434  119,334  B-50  15,890,938  105,238  B-54  228,231,970  1,511,470  B-9  117,612,994  778,894  Table 3-18. Sequences generated for the vaginal microbiome samples, A= HIV(-), B=HIV(+)  Taxonomic Classification Shown in Figure 3-14 are the four steps used for taxonomic characterization. All of these steps were performed on MG-RAST. Quality filtration was performed by filtering reads that contained greater 126  than 5 nucleotides with quality scores below 15. According to the sequence classification analysis, the optimized stringencies found for the Nextera XT test sample (Bacteria spike, listed in sequence classification program section) were used to taxonomically characterize these samples. This was established as an 80% BLASTX similarity across a minimum 20 amino acids with an E-value of .001. MG-RAST automatically classifies non-protein encoding genes using the same stringencies at the nucleotide level. Although Nextera XT was used in both the spiked sample and these experimental samples, there still could be discrepancies between the average sequence length for these samples as 151 bp is only the maximum length. However, as shown in previous analyses, adjusting the sequence length insignificantly impacts the accuracy of taxonomic classification once the parameters for percent alignment and E-values are set.  Quality Filtration Human Subtraction  Alignment  Sequence Classification Figure 3-14. Vaginal microbiome metagenomic pipeline for taxonomic characterization.  127  There is a balance between accuracy and taxonomic resolution when one considers which taxonomic level to use for analysis. There have been many studies that consider only those at the higher taxonomic levels such as Order, but for BV analyses, a lower taxonomic resolution is needed in order to properly discriminate healthy and unhealthy vaginal microbiomes. Therefore, I chose to analyze these dataset predominantly at the family level. Given that the library preparation process in the Bacteria spike was identical to the BV samples, assuming a direct correlation, all subsequent analyses at the family level precisely classify 96.3% of the reads. Alternatively, 41.4% of the reads are estimated to fall through the filter as the sensitivity was found to be 58.6%. Multivariate Analyses Shown in Figure 3-15 are the top 10 families arranged according to a hierarchical clustering analysis with a Pearson correlation coefficient using furthest neighbor linkages. The two distinct clusters produced have patterns that are indicative of BV(-) and BV(+) vaginal microbiomes. The utility of this analysis lies in the observation that the clusters are seemingly produced based on the dominant families shown in each data set. In the leftmost cluster, the samples are dominated by BV(-) which are in turn dominated by the family Lactobacillaceae. Conversely, in the opposite cluster 19 out of the 22 samples were BV(+) or BV(I) and had Bifidobacteriaceae as the dominant family recovered. The two BV(I) samples which did not have Bifidobacteriaceae as the dominant family were dominated instead by Prevotellaceae. Both Prevetellaceae and Bifidobacteriaceae have been associated with BV, namely various Prevotella spp. and Gardnerella vaginalis. Furthermore, as stated above, although cultural differences may exist, in general Lactobacillus species are definitive of a healthy, BV(-) vaginal microbiome.  128  Figure 3-15. Hierarchical clustering of vaginal microbiomes shown above the top 10 families for each sample  Another clustering method used that is independent of hierarchical clustering is Non-Metric Multidimensional Scaling using a Bray-Curtis distance measure (NMDS). This method provides a confidence value which under a predefined variable encircles those samples that are significantly associated with any given cluster. In this case, BV status was used as the differentiating variable, where samples were part of a cluster with a 95% confidence interval. The resulting scatter plot is shown in Figure 3-16. Just as in hierarchical clustering, BV(+) and BV(-) samples produced distinct clusters. According to the confidence ellipses, BV(I) samples also formed their own distinct cluster. Interestingly, of the three BV(+) samples which clustered with the BV(-) samples, two of them were the same samples which clustered similarly in the hierarchical clustering analysis, B11 and B44. Furthermore, the same 129  relationship was seen in one of the BV(I), A58, where although encircled into it's own cluster, it was also a part of the BV(-) samples, just as in the hierarchical clustering. Accordingly, it was also dominated by the Lactobacillaceae family. Examination into the dominant families recovered for the remaining BV(+) sample B3 within the BV(-) cluster did not provide any evidence as to why it clustered in this manner, indicating other less prevalent families are driving this relationship.  Figure 3-16. NMDS clustering of vaginal microbiomes using Bray-curtis similarity, BV variable, 95% CI, made using R-statistics  130  Using this same method but changing the variable to HIV status, Figure 3-17 shows that the confidence ellipses encircle all of the samples together indicating that the HIV status of each patient is independent of the previous clustering. The same results were seen in the hierarchical clustering where HIV was present in both clusters. However, given that there were 22 HIV(+) / 7 HIV(-) samples, the above relationship is difficult to establish.  Figure 3-17. NMDS Clustering of Vaginal Microbiomes According to HIV status using Bray-curtis similarity. 95% C.I.  131  Interestingly, both of these cluster analyses maintain this specificity down to the species level as shown in Figure 3-18. In the hierarchical clustering the same samples are in the two most distinct clusters except in a different order. In the NMDS cluster, the samples seemed to cluster together more efficiently compared to the family level. Examining the species in the hierarchical clustering, it is readily apparent that there are certain regions which seemingly define each cluster.  Figure 3-18. NMDS clustering of vaginal microbial species using BV status as the variable, 95% CI  132  Figure 3-19A. Hierarchical clustering of vaginal microbial species, displaying only region Lactobacillus species  In Figure 3-19A, L. gasseri, L. johnsonii, and L. delbrueckii are present in relatively high concentrations in the BV(-) samples compared with the BV(+) samples, whereas L. iners is abundant in both. This is intriguing in the light of the fact that L.iners has recently been described as an intermediary species that can readily adapt to both healthy and BV vaginal environments. Further down the cluster, shown in Figure 3-19B are the species making up the family Bifidobacteriaceae. Although G. vaginalis is present in high concentrations in the BV(+) samples, it is also recovered in relatively high amounts in the BV(-) cluster. However, species such as B. pseudocatenulatum, B. longum, and less so B. bifidum are seemingly only abundant in BV(+) samples.  133  Figure 3-19B. Hierarchical clustering of vaginal microbial species, displaying only the Bifidobacteria/Atopobium region  Comparison with the Nugent Score By examining the dominant bacterial families in each sample, a comparison with the recoveries of those families strongly correlated with BV can be compared directly with the Nugent score for each patient. A Pearson correlation coefficient between the Nugent score was maximized by including the dominant families seen in Figure 3-15 associated with BV: Bifidobacteriaceae, Prevotellaceae, Clostrideaceae, Veillonellaceae, and Coriobacteriaceae. This value was found to be 0.73 while for Lactobacillus, -0.69 indicating a high correlation. Shown in Figure 3-20 are the BV-associated bacteria (BVAB) plotted against the Lactobacillus spp. alongside the Nugent score for each patient. Amongst the BV(+) (black), BV(I) (green), and BV(-) (red) regions, there are 5 discrepancies between the BVAB and the Nugent score shown in yellow. All of these samples were underscored by the Nugent scoring system according to relative concentrations of BVAB and lactobacilli. Two of these discrepant samples, B-11 and B-21, were classified as having intermediate BV, a category that is largely uncharacterized. In fact quantitative PCR assays have demonstrated that these populations often reveal a population more closely related to BV than healthy (Menard et al., 2010). However, the third BV(I) sample, had significant 134  recoveries of both BVAB and lactobacilli, and therefore precisely what we would expect from the Nugent score. Another sample A-53 had a mixed microbial population with roughly 52.1% BVAB and 30.5% lactobacilli, which could have by chance been sampled with more Lactobacillus spp. on the initial Gram staining. The other two samples A-53 and A-48 were the farthest outliers, where the BVAB outnumbered Lactobacillus spp. by 5 and nearly ten-fold, respectively. These two samples need to be examined further in order to ascertain the large discrepancy between these two methods.  Figure 3-20. BV-Associated bacteria (blue) plotted against lactobacilli. (red) and the Nugent score (green). The yellow indicates those samples with a high concentration of BVAB and a low Nugent score. The colored regions of the chart indicate Nugent scores for BV(-) (red), BV(I) (green), and BV(+) (black).  Virome There were several families of eukaryotic and prokaryotic viruses recovered as shown in Figure 3-21. The abundances of each family however were relatively low in the initial MG-RAST alignment, as the largest concentration was 18 paired end reads belonging to the family Siphoviridae for a BV(+)HIV(-)  135  sample, A-47. All of these reads were specific to an E. coli M13 bacteriophage and are shown in Table 315.  20 18 Inoviridae  16 Iridoviridae  14  Myoviridae  12  Papillomaviridae  10  Podoviridae  8  Poxviridae  6  Siphoviridae  4 2  A-47 A48 A51 A53 A54 A57 A58 B-1 B-11 B-13 B15 B17 B-18 B-21 B24 B-24 B-24V1 B24V2 B-26 B-28 B-29 B-3 B31 B-33 B36 B-39 B-40 B-43 B44 B50 B54 B9  0  Figure 3-21. BV virome, preliminary results directly from MG-RAST  As there were singletons specific to a variety of viruses including papillomavirus and a picobirnavirus in two HIV(+) samples B-50 and B-43, respectively, I used an extended procedure for secondary confirmation for each sample. Here, all of the reads were aligned to Genbank's Virus database using a server at the BCCDC with an implemented BLASTN algorithm (BLAST2). These alignments were then uploaded onto NCBI's BLAST and aligned against the nt database (BLAST3). Consistent with the previous alignments, a lower E-value increased the accuracy of all BLAST2 alignments. With this extended procedure, many of the viral reads were verified as shown in Figure 3-22. However for sample, A-47 the reads for the M13 bacteriophage could not be confirmed. There were several reads specific for Lactobacillus bacteriophage (Kc5a, фADH) as well as a Lactobacillus prophage Lj771 for a BV(-)HIV(-)  136  sample A-57. The most consistently recovered bacteriophage however, was an E. coli T7 bacteriophage, where it was present in 10 samples. Interestingly, in sample A-57 a relatively high abundance of lactobacilli was present and upon further inspection of these species of bacteriophage, indeed all of them have a lysogenic component in their life history (Kilic et al., 2001). In a BV(-)HIV(+) sample B-36, there were multiple reads specific for Streptococcus phage (Abc2, 5093, 2167, PH15, SM1), and just as in A57, all of these phage are lysogenic consistent with the high percentage recovery of Streptococci sequences (Mills, 2011).  14 T7 Phage  12  Streptococcus Phage  10  Lactobacillus Phage  8  Lactobacillus Prophage  6  Enterococcus Phage  4  Alphapapillomavirus  2  Picobirnavirus  A-47 A-48 A-51 A-53 A-54 A-57 A-58 B-1 B-11 B-13 B-15 B-17 B-18 B-21 B-24 B-24-2 B-26 B-28 B-29 B-3 B-31 B-33 B-36 B-40 B-43 B-44 B-50 B-54 B-9  0  Figure 3-22. BV virome re-alignment using NCBI  Concerning the functions of the bacteriophage genes present, although the majority of reads were functionally classified as ssDNA binding proteins involved in DNA recombination, 3 reads were classified as anti-repressors. This is interesting in that recently, bacteria lysogenized with multiple bacteriophage revealed that these proteins were shown to be involved in prophage crosstalk specifically concerned with synchronizing prophage induction (Lemire et al., 2011). This may be one reason why such a diversity of phages were recovered in both of the samples containing anti-repressors (A-57 and B36). Furthermore, the metadata reveals that patient B-36 (BV(-)HIV(+)) was taking an antibiotic, nitrofurantoin, for a urinary tract infection (UTI) at the time of sampling. It is possible that both the  137  bacteria and the virome could have adapted to the altered environment. Alternatively, given the gastrointestinal tract origin of most UTI pathogens, the diversity of phage in the patient with the UTI could reflect the introduction of gastrointestinal bacteria and their associated phage into the vaginal tract (Imirzalioglu et al., 2008). In sample B-50 (BV(+)HIV(+)), a singleton was initially recovered for HPV-44. Interestingly, the BLAST3 realignment discovered 6 more paired end reads specific to HPV-34, HPV-44, HPV-58, and HPV-114. All of these HPV genotypes belong to the largest genus of HPV, Alphapapillomavirus. In the B-43 sample, additional reads for a picobirnavirus were not recovered. Furthermore, an assembler with an algorithm specifically designed to extend selected paired-end read sequences (Stenglein et al., 2012), no additional reads were recovered for either of these samples indicating that these viruses, if existent, are at concentrations near the limit of detection for this procedure. Mycobiome The vaginal mycobiome has only been characterized through culture based methods and therefore, these results uncovered here are to my knowledge, the first to utilize a metagenomics approach. Followed by a realignment using BLASTN against the NT database, the majority of fungal sequences were shown to be specific to highly conserved regions of rRNA rendering taxonomic resolution past the Order level, exceedingly difficult. Shown in Figure 3-23 are those reads for which family level resolution was possible.  138  0.25  0.2 Actinomycetaceae  0.15  Ajellomycetaceae Debaryomycetaceae  0.1  Glycomycetaceae Lipomycetaceae Planctomycetaceae  0.05  Saccharomycetaceae Schizosaccharomycetaceae  0  Figure 3-23. BV mycobiome based on the percent of the total reads for each sample  There are two dominant families of fungi that dominate both the HIV(-) and HIV(+) patients: Saccharomyceteceae and Debaryomyceteceae. Saccharomyces cerevesieae and Candida albicans are both frequently recovered from both healthy and infected vaginas, and both of these species dominated the family Saccharomyceteceae. For two BV(+)HIV(+) samples B-33 and B-54, the Saccharomyceteceae sequences consisted of 8.1-21.4% of the total reads, respectively. Of the 21.4% of the total reads for sample B-54, the vast majority were specific to C. albicans. However, there were also a minority of other species in this family recovered including C. tropicalis, Spathaspora passalidarum, and C. dubliniensis. All of these species have been isolated from the vagina aside from S. passalidarum, which is typically isolated from insects (Khan et al., 2012b; Richter et al., 2005; Wohlbach et al., 2011). Interestingly, C. tropicalis has recently been associated with another dysbiotic disease, IBD (Iliev et al., 2012).  139  The sequences for Debaryomyceteceae were again difficult to resolve as most alignments included hits throughout the Order Saccharomycetales all with equal alignment scores. Two genera were resolved from these reads, one of which was surprisingly a genus of yeast derived from plants, Scheffersomyces. Provided the level of conservation, these reads could be indicative of contamination, but more likely issues with taxonomic resolution or misalignments. The other genus within this family recovered was Debaryomyces. This particular genera is often found as an opportunistic pathogen frequently associated with AIDS patients (Hodgson and Rachanis, 2002). This genus was significantly associated with BV(+)HIV(+) samples B-33 and B-54, where reads for Debaryomyceteceae ranged from 3.7-14.9% for these respective samples. In order to assess any possible associations between HIV or BV and the mycobiome, hierarchical clustering was performed using only the mycobiome. As shown in Figure 3-24, Debaryomyceteceae seems to be more prevalent in the BV(-) or BV(I) samples. However, provided that the BV(-) samples were also in the opposing cluster, the specified relationship necessitates a larger sample subset. Since samples B-33 and B-54 had such high concentrations of fungal sequences recovered, relationships possibly explaining these abundances were sought for in the associated metadata for each patient. Interestingly, sample B-33 was also being treated for clarithromycin for a throat infection. Despite observations that vulvovaginitis candidiasis is associated with changes in the endogenous microbial populations which may have been altered with this antibiotic, considering that this fungal dominance is not seen in other patients treated with antimicrobials, this data is not specific enough to possibly establish any further associations (Spinillo et al., 1999). However, patient B-54 was medicated with Nystatin specifically for an oral yeast infection, possibly suggesting oral-vaginal transmission.  140  Figure 3-24. Hierarchical clustering of BV mycobiome using family level taxonomic classification  Conclusions In the present study, an optimized procedure was developed specifically for metagenomic characterizations. BV is an ideal disease to test such a method in that it is defined as a shift in a microbial population that has been firmly established. In this case, the vaginal microbiomes of 22 HIV(+) and 7 HIV(-) patients were characterized, and further stratified for their status of BV including 4 BV(-)HIV(-), 4 BV(-)HIV(+), 2 BV(+)HIV(-), 1 BV(I)HIV(+), 2BV(I)HIV(+), and 16 BV(+)HIV(+) patients. In addition to viruses, microorganisms from 2 kingdoms of life were recovered and quantified for each of these cohorts. Using NMDS and hierarchical clustering, patterns within the microbiome emerged which  141  were consistent with the prevailing patterns indicative of BV where lactobacilli dominated the healthy samples and BV associated bacteria (BVAB) were elevated in the opposing cluster(s). The 'gold standard' of BV diagnosis, the Nugent score, was used as a baseline in establishing the accuracy of this procedure. If the presented technique were to be compared with this diagnostic scoring system, 5/8 of the BV(-) samples would have been correctly classified according to the hierarchical clustering. However, NMDS clustering at the species level would have had 89.7% accuracy (26/29). Furthermore, a plot of the recoveries of BVAB versus the Lactobacillus spp. also showed a significantly high correlation to the corresponding Nugent score (24/29). However, given that the Nugent score, despite being the 'gold standard', is essentially a measure of the relative proportion of lactobacilli in a given sample by means of a gram stain, the above results are much more sensitive and provide a theoretically higher specificity. Collectively however, these results show strong support for our hypothesis that metagenomic analyses can accurately differentiate between a healthy vaginal microbiome and that of BV. Furthermore, although ~800,000 reads were generated for each sample, the vaginal microbiome was correctly classified in samples with just over 10,000 mapped reads. Provided that the cost for 15 samples at roughly 800,000 reads is on average $120, if more research could support this drop in depth, this translates to ~$44 per sample minus labour. As costs for high-throughput sequencing continues to drop, this metagenomic strategy could be used as a clinical diagnostic tool. In addition to meeting the status quo of the presumed BV vaginal microbiome, these results also provide many additional extensions into the biology of both healthy and unhealthy vaginal microbiomes. For instance in the hierarchical clustering analysis of the vaginal microbiome shown in Figure 3-19B at the species level, various Bifidobacteria were recovered and shown to be more associated with BV than bacteria such as G. vaginalis and A. vaginae. Interestingly, all of these Bifidobacteria spp. are associated vaginal microbiomes however the vast majority of research focuses on these bacteria within the gut (Dumonceaux et al., 2009; Korshunov et al., 1999; Matsuki et al., 1999). Interestingly, due to the high genetic conservation between species of this genera, only a denaturing gel electrophoresis was able to differentiate B. pseudocatenulatum from other species in the vagina as opposed to amplicon-based studies 142  (Turroni et al., 2012). Therefore, it is possible that targeted amplicon sequencing may either not be sensitive enough or simply not provide enough data to resolve this species from its surrounding relatives as suggested in a gut microbiome analysis (Matsuki et al., 1998). Furthermore, there is virtually nothing known about how the virome influences the establishment and maintenance of microbial populations within the vagina. This analysis revealed that a number of lysogenic bacteriophage were recovered and displayed high homology to species specific to both lactobacilli as well as Streptococci. Bacteriophage induction is a highly orchestrated process and is often in direct connection with the level of stress an individual bacterial cell is confronted with. Therefore, analyzing these phage populations may provide insight into a microbial population that is in flux. Interestingly, in this case the bacterial hosts of the phage recovered here were both in high abundances in each case. A possible explanation may lie in a recent finding that a resident of the gut microbiome, Enterococcus faecalis, has evolved a strategy to utilize a lysogenic phage to lyse competing genotypes and therefore conferring a growth advantage (Duerkop et al., 2012). Provided the void of research concerning the virome and it's influence on the vaginal ecosystem, let alone any system, this evolutionary relationship between phage and their hosts could likely be widespread in various environments rich in microorganisms. Concerning the eukaryotic component of the virome, in this study, 22 samples were HIV(+) and 8 of these patients were also seropositive for HCV. Neither of these eukaryotic RNA viruses were recovered from the vaginal swabs. Although BV is associated with an increased HIV shedding, all of the patients in this study were under antiretroviral treatment, which would most likely result in a virus concentration below the threshold of this procedure. Furthermore, there have been few documented studies that quantitatively address the concentration of HIV within the vagina in terms of differences in gc. Of these studies, the majority quantify HIV from cervicovaginal lavages (CVL) and have recoveries typically ranging from 1,148-1,412 gc/ml for BV(-) and BV(+) cohorts, respectively (Fiore et al., 2002; Sha et al., 2005). In another study using CVL samples, a group found that as expected, the number of HIV genome copies decreased significantly with HAART treatment, where 1/3 of all analyzed cases 143  produced recoveries less than 500 gc/ml (Fiore et al., 2002). However, in one study analyzing the HIV concentration obtained from vaginal swabs immersed in 1 ml of liquid and concentrated to 200 µl, the recovery of HIV before treatment with HAART was 1,288 gc/ml, while after treatment this concentration was reduced to 794 gc/ml (Wang et al., 2001). As these studies did not use volumes greater than 10 ml for each CVL or vaginal swab, this is well outside the limit of detection of this assay which was found to be 27,714 gc/sample. Evidence for HCV detection from vaginal swabs is lacking, however it has been recovered via qPCR in the cellular component of CVL samples, and was shown to be associated with menstrual blood (Wang et al., 2011). Although one of the HCV(+) patients included in this study (B-31) was in the menstrual phase at the time of sampling, there were no HCV sequences recovered. However, although this patient was seropositive, a qPCR assay for HCV was negative indicating that perhaps HCV concentrations were below the limit of detection in the plasma, and therefore the menstrual blood as well. Furthermore, in one of the few studies to quantify HCV from the vagina, HCV was detected in CVL swabs of chronically infected patients without anti-viral treatment at just 50-300 gc /sample (Belec et al., 2003). Again, since all of the women in this study were treated with anti-virals, expected HCV levels would be diluted further. The existence of Debaryomyceteceae in a healthy vagina, identified through the mycobiome analysis, has only rarely been documented (Beigi et al., 2004). As the majority of samples contained this family of fungi, in HIV(+) and HIV(-) patients, perhaps in larger sample sizes with amplicon sequencing or high-throughput sequencing at a greater depth, this microorganism may be fully appreciated in the vaginal microbiome. Following this analysis however, a recent publication found recovered sequences specific for 5 Candida species (Drell et al., 2013). This suggests that the sequences recovered in this analysis could be indicative of a much more diverse abundance of species, as the majority of sequences were only specific down to the family level. Furthermore, it was especially interesting in that the patient with the highest abundance of fungal sequences recovered (B-54) also was being treated for an oral yeast infection. As this patient was BV(+)HIV(+), it is not uncommon for opportunistic C. albicans infections to occur concomitantly with each of these pathologies. However, this patient was clinically assessed by a 144  gynecologist and although not tested specifically, did not produce any symptoms associated with candidiasis. It is intriguing to address the possibility of oral-vaginal transmissions of C. albicans, where although there exists a relationship between vulvovaginal candidiasis and oral sex, it seems only to be significant in recurrent infections (Ballini et al., 2012). Specifically it is hypothesized that C. albicans seeds each recurrent infection through the gastrointestinal tract. As stated above, HIV is associated with bacterial translocation from the gut and thus, this may be evidence of such a transmission for a fungal microorganism. It should be noted that the HIV status of the women in this study was not specifically controlled. Since 22/29 patients in this study were HIV positive, this sub-variable could have impacted the results from this study. It is not completely understood how HIV or antiretroviral treatment can impact the microbiome. Although BV and HIV are associated infections, there have been no direct correlations between HIV status or treatment and an altered vaginal microbiome (Hummelen et al., 2010). Furthermore, much like the results in this study, others have found that a healthy vaginal microbiome in HIV positive women is essentially identical to that in HIV negative women (Hummelen et al., 2010). Although there were 29 samples total and only 8 healthy samples, the preliminary associations from these results could aid larger studies in metagenomic pipeline sensitivity and specificity comparisons as well as identifying specific taxons associated with the BV vaginal environment. We successfully characterized the vaginal microbiome in these samples which challenged the precision of the current clinical diagnosis. In conclusion, this assay was proven to be an efficient method for characterizing the microbial populations within the vagina. Although we were unable to detect HIV or HCV in women known to have these infections, an alternative method may have been more successful at detecting these RNA viruses. As stated in the above sensitivity assays, if virus detection was the solitary goal of this experiment then either filtration, ribosomal RNA subtraction, or cDNA amplification should have been added to this metagenomic pipeline in order to selectively concentrate these target nucleic acids.  145  Chapter 4 Discussion Summary The work presented in this study is a multifactorial approach for the metagenomic characterization of microorganisms associated with two complex diseases, BV and OC. Both of these applications are novel, in that previous studies have used targeted PCR methods rather than randomlyprimed shotgun sequencing to find associations between microorganisms and either disease. Metagenomics was used to test for the presence of microorganisms within ovarian cancer tissue where five OC samples were analyzed with a pan-viral microarray and an additional 15 were subjected to highthroughput sequencing analyses. In the BV study, we explored the accuracy of metagenomics methods to classify the known as well as novel microorganisms within the vaginal microbiome of 29 patients stratified by both their BV and HIV status. Ovarian Cancer Results Microorganisms have been recovered in ovarian cancer tissue samples from PCR analyses targeting M. genitalium, as well as viruses such as HPV and cytomegalovirus (Chan et al., 1996; Ness et al., 2003). However, for every positive association there is at least one negative association (Idahl et al., 2010). The majority of these targeted PCR methods can only target individual pathogens and cannot fully appreciate the entire spectrum of possible microorganisms associated with OC. Many of the PCR targets are based on the causes of PID, but the etiologic agents for this disease are also poorly appreciated as they have largely been identified through bacterial culture methods. Therefore, a metagenomics analysis was required to more comprehensively address this question. Our pan-viral microarray initially revealed weak patterns specific to certain virus genera, however subsequent PCR analyses were unable to recover any sequences specific to these viruses. These results indicate that the observed hybridization patterns were most likely false-positive signals generated from the cross-hybridization of host gene transcripts. Microbial sequences in the transcriptome component were recovered ranging from 0.004-14% of the total generated reads for each cancer sample (Table 2-6). 146  Closer inspection of sequences presumed to be unique to the female genital tract revealed that indeed native species such as Lactobacillus crispatus and Lactobacillus vaginalis were present (Table 2-8). However, following the lack of PCR recovery for all microorganisms targeted, closer inspections using bioinformatics analyses were performed. Multivariate methods failed to differentiate between each ovarian cancer subtype, as well as between transcriptomes derived from alternative tissue sources such as an epithelioid sarcoma and a lung cancer cell line (Figure 2-4). Furthermore, closer inspection of the top three microbial families Enterobacteriaceae, Pseudomonadaceae, and Moraxellaceae revealed that similar recovery patterns existed between each sample (Figure 2-9). Finally, a PCR assay was developed to recover 16S rRNA directly from randomly primed ovarian cancer cDNA libraries. This assay was tested specifically for its ability to recover rRNA at the concentrations suggested by the ovarian cancer sequence data. This assay yielded no PCR amplicons (Figure 2-8). These results are supported by the largest targeted PCR amplification study to date (Idahl et al., 2010), where neither N. gonorrhoeae, M. genitalium, C. trachomatis, high risk HPV genotypes, or two polyomaviruses BKV and JCV, could be detected in 186 ovarian cancer tissue samples. Collectively, the above results suggest that the microbial sequences that we initially recovered from the transcriptome files were largely a result of contamination post-extraction. Limitations of the Ovarian Cancer Study The limitations of the methods used to detect microorganisms in ovarian cancer all revolve around sensitivity. Initially, polyadenylated RNA was isolated for sequencing. It was therefore unexpected to discover such a high percentage of microbial sequences as these should not be polyadenylated. Instead of designing PCR assays specific to each bacterium which was observed in the data, a single assay was developed targeting a universally conserved gene amongst bacteria, 16S rRNA. This PCR assay was run in a conventional format meaning that weak bands may not have been visible. The limit of detection of the assay was determined and correlated with the number of bacterial sequences in each sample. However, the top three families, Moraxellaceae, Enterobacteriaceae, and  147  Pseudomonadaceae were also observed in our controls, one of which was a cell line. Although we recovered sequences specific to bacteria native to the female genital tract in ovarian cancer tissue, if the top three families are considered contaminants, then the remaining bacteria would be at levels below the empirically-determined limit of detection of the 16S PCR assay. Therefore, our conclusions are limited to the sensitivities of the assays used in this study and do not have sufficient power to completely reject the initial hypothesis concerning the identification of microorganisms associated with ovarian cancer. Given issues with sensitivity, a different approach, such as deep amplicon sequencing targeting a universally conserved gene may be more promising. The high sensitivities observed in these assays could possibly yield a conclusion with a higher degree of confidence regarding the potential of the ovarian cancer microbiome. Bacterial Vaginosis Results All published studies of BV have focused on culturing or targeting specific and/or conserved regions of bacterial genomes with PCR. These amplicon studies have demonstrated poor reproducibilities as measured by comparisons of the taxonomic groups recovered, ranging from 1.6-99.1%. However, the largest study targeting rRNA by Srinivasan et al., found a high correlation to the clinical diagnosis where only five out of 122 healthy vaginal populations displayed a similar microbial profile to those with BV (Srinivasan et al., 2012). Furthermore, quantitative PCR assays based upon these amplicon studies have observed similar abundances of the targeted microorganisms. These qPCR assays when compared with the Nugent score or Amsel's criteria, have shown sensitivities and specificities as high as 100% and 93%, respectively (Menard et al., 2010). In this original metagenomics study, it was shown that there is a strong correlation between the BV status and the multivariate analyses as seen in Figures 3-15, 3-16, and 3-18. These clustering algorithms were able to differentiate between BV(-), BV(I), and BV(+) samples. This is significant especially because BV(I) presents difficulties even for qPCR analyses. Furthermore, a simple plot in Figure 3-20 of the bacterial families significantly associated with BV compared to the family  148  Lactobacillaceae reveals a similar relationship with the Nugent score. Collectively these results suggest that metagenomics is a promising tool for the analysis of BV and can be successfully used for the characterizations of microorganisms associated with a healthy vaginal microbiome and BV. Since a relatively high accuracy of the microbial classification relative to the Nugent score was still observed for a depth of only 10,000 mapped reads, this assay could be multiplexed with 96 samples, dropping the cost per sample to under $50 (not including labor). Limitations of the Bacterial Vaginosis Study In our BV analysis, our method produced sensitivities ranging from 1,386 - 27,714 gc/sample depending on the nucleic acid context of the sample matrix. Although we were able to detect the microorganisms known to be associated with BV as well as additional viruses and fungal populations, these sensitivities are low relative to qPCR. If our aims were to improve this sensitivity, we would ideally sequence these samples on an alternative instrument which would allow a greater depth of sequencing such as Illumina`s HiSeq. Alternatively, if we were to use a MiSeq we could have focused on a single variable such as characterizing the potential existence of the eukaryotic virome. To this end, mRNA could be isolated and subsequently amplified for each sample in order to subtract a large percentage of cellular RNA in order to bring viruses up to detectable levels as seen in Malboeuf et al. (Malboeuf et al., 2013). Alternatively, if we were to focus on the prokaryotic and eukaryotic virome, we could have physically filtered these samples prior to extraction in order to remove cellular debris and therefore a reduction in nucleic acid competition, as seen in Handley et al. (Handley et al., 2012). Furthermore, to positively assess the concentrations of the microorganisms recovered in this BV analysis, a qPCR assay for specific taxa should be employed in order to contrast those with the relative abundances observed with this procedure. Although qPCR assays have been designed for G. vaginalis and A. vaginae in order to contrast with that of Lactobacillus spp., this study found that a Pearson correlation coefficient between BV and associated microbial families significantly increased upon the addition of Prevotellaceae and Veillonaceae. This indicates that the accuracy of such qPCR assays should  149  be multiplexed further in order to capture the complexity of BVAB. Alternatively, for those species not observed in other BV studies such as the bacteriophage and HPV, targeted sequencing should be performed in order to solidify these findings. Finally, the dominance of fungal populations in healthy, HIV(+), and BV(+) cohorts indicates that targeted amplicon sequencing should be applied in order to delineate the fungal species that were recovered.  Conclusion As new tools are developed in the field of metagenomics, many of them are applied without adequate testing or inclusion of appropriate controls and therefore questioning the significance of their findings. Observations in our ovarian cancer microbiome analyses strongly suggest that such methodologies still need to be tested properly using duplicates as well as suitable controls, especially in the application to relatively unexplored scientific questions. Used properly, however, these metagenomic methods allow researchers to address complex questions in order to comprehend diseases at resolutions previously unobtainable. As metagenomics moves out of the exploratory phases of environmental characterizations, associated variables can be manipulated for the design of subsequent prospective longitudinal studies. Collectively, these results illustrate that there are still gaps in our understanding of the novel and traditional methods used for sequence recovery and characterization. The results from all the studies utilizing these tools will therefore reflect these limitations. For proper study design in metagenomicsbased studies, these variables must be optimized in a quantitative manner so that the results will most accurately represent the true microbial communities within each sample. Although metagenomics is still in its infancy, it stands to revolutionize the field of microbiology and infectious diseases as well as clarify the role of microorganisms in other disease traditionally thought to be non-infectious.  150  References  Abnizova, I., Leonard, S., Skelly, T., Brown, A., Jackson, D., Gourtovaia, M., Qi, G., Te Boekhorst, R., Faruque, N., Lewis, K., and Cox, T. (2012). Analysis of context-dependent errors for illumina sequencing. J Bioinform Comput Biol 10, 1241005. Adami, H.O., Hsieh, C.C., Lambe, M., Trichopoulos, D., Leon, D., Persson, I., Ekbom, A., and Janson, P.O. (1994). Parity, age at first childbirth, and risk of ovarian cancer. Lancet 344, 1250-1254. Adams, I.P., Glover, R.H., Monger, W.A., Mumford, R., Jackeviciene, E., Navalinskiene, M., Samuitiene, M., and Boonham, N. (2009). Next-generation sequencing and metagenomic analysis: a universal diagnostic tool in plant virology. Mol Plant Pathol 10, 537-545. Ahmed, A.A., Etemadmoghadam, D., Temple, J., Lynch, A.G., Riad, M., Sharma, R., Stewart, C., Fereday, S., Caldas, C., Defazio, A., Bowtell, D., and Brenton, J.D. (2010). Driver mutations in TP53 are ubiquitous in high grade serous carcinoma of the ovary. J Pathol 221, 49-56. Alander, M., Satokari, R., Korpela, R., Saxelin, M., Vilpponen-Salmela, T., Mattila-Sandholm, T., and von Wright, A. (1999). Persistence of colonization of human colonic mucosa by a probiotic strain, Lactobacillus rhamnosus GG, after oral consumption. Appl Environ Microbiol 65, 351-354. Allander, T., Andreasson, K., Gupta, S., Bjerkner, A., Bogdanovic, G., Persson, M.A., Dalianis, T., Ramqvist, T., and Andersson, B. (2007). Identification of a third human polyomavirus. J Virol 81, 41304136. Allsworth, J.E., Lewis, V.A., and Peipert, J.F. (2008). Viral sexually transmitted infections and bacterial vaginosis: 2001-2004 National Health and Nutrition Examination Survey data. Sex Transm Dis 35, 791796. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410. Ambrose, H.E., and Clewley, J.P. (2006). Virus discovery by sequence-independent genome amplification. Rev Med Virol 16, 365-383. American College of Physicians (2003- ) (2009). Practical gynecology : a guide for the primary care physician, 2nd edn (Philadelphia, American College of Physicians). Amsel, R., Totten, P.A., Spiegel, C.A., Chen, K.C., Eschenbach, D., and Holmes, K.K. (1983). Nonspecific vaginitis. Diagnostic criteria and microbial and epidemiologic associations. Am J Med 74, 14-22. Antonio, M.A., Hawes, S.E., and Hillier, S.L. (1999). The identification of vaginal Lactobacillus species and the demographic and microbiologic characteristics of women colonized by these species. J Infect Dis 180, 1950-1956. 151  Archer, J., Weber, J., Henry, K., Winner, D., Gibson, R., Lee, L., Paxinos, E., Arts, E.J., Robertson, D.L., Mimms, L., and Quinones-Mateu, M.E. (2012). Use of four next-generation sequencing platforms to determine HIV-1 coreceptor tropism. PLoS One 7, e49602. Armour, C.D., Castle, J.C., Chen, R., Babak, T., Loerch, P., Jackson, S., Shah, J.K., Dey, J., Rohl, C.A., Johnson, J.M., and Raymond, C.K. (2009). Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6, 647-649. Aroutcheva, A., Gariti, D., Simon, M., Shott, S., Faro, J., Simoes, J.A., Gurguis, A., and Faro, S. (2001). Defense factors of vaginal lactobacilli. Am J Obstet Gynecol 185, 375-379. Arron, S.T., Ruby, J.G., Dybbro, E., Ganem, D., and Derisi, J.L. (2011). Transcriptome sequencing demonstrates that human papillomavirus is not active in cutaneous squamous cell carcinoma. J Invest Dermatol 131, 1745-1753. Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D.R., Fernandes, G.R., Tap, J., Bruls, T., Batto, J.M., Bertalan, M., Borruel, N., Casellas, F., Fernandez, L., Gautier, L., Hansen, T., Hattori, M., Hayashi, T., Kleerebezem, M., Kurokawa, K., Leclerc, M., Levenez, F., Manichanh, C., Nielsen, H.B., Nielsen, T., Pons, N., Poulain, J., Qin, J., Sicheritz-Ponten, T., Tims, S., Torrents, D., Ugarte, E., Zoetendal, E.G., Wang, J., Guarner, F., Pedersen, O., de Vos, W.M., Brunak, S., Dore, J., Antolin, M., Artiguenave, F., Blottiere, H.M., Almeida, M., Brechot, C., Cara, C., Chervaux, C., Cultrone, A., Delorme, C., Denariaz, G., Dervyn, R., Foerstner, K.U., Friss, C., van de Guchte, M., Guedon, E., Haimet, F., Huber, W., van Hylckama-Vlieg, J., Jamet, A., Juste, C., Kaci, G., Knol, J., Lakhdari, O., Layec, S., Le Roux, K., Maguin, E., Merieux, A., Melo Minardi, R., M'Rini, C., Muller, J., Oozeer, R., Parkhill, J., Renault, P., Rescigno, M., Sanchez, N., Sunagawa, S., Torrejon, A., Turner, K., Vandemeulebrouck, G., Varela, E., Winogradsky, Y., Zeller, G., Weissenbach, J., Ehrlich, S.D., and Bork, P. (2011). Enterotypes of the human gut microbiome. Nature 473, 174-180. Ballini, A., Cantore, S., Fatone, L., Montenegro, V., De Vito, D., Pettini, F., Crincoli, V., Antelmi, A., Romita, P., Rapone, B., Miniello, G., Perillo, L., Grassi, F.R., and Foti, C. (2012). Transmission of nonviral sexually transmitted infections and oral sex. J Sex Med 9, 372-384. Barton, E.S., White, D.W., Cathelyn, J.S., Brett-McClellan, K.A., Engle, M., Diamond, M.S., Miller, V.L., and Virgin, H.W.t. (2007). Herpesvirus latency confers symbiotic protection from bacterial infection. Nature 447, 326-329. Bazinet, A.L., and Cummings, M.P. (2012). A comparative evaluation of sequence classification programs. BMC Bioinformatics 13, 92. Beigi, R.H., Meyn, L.A., Moore, D.M., Krohn, M.A., and Hillier, S.L. (2004). Vaginal yeast colonization in nonpregnant women: a longitudinal study. Obstet Gynecol 104, 926-930. Belec, L., Legoff, J., Si-Mohamed, A., Bloch, F., Matta, M., Mbopi-Keou, F.X., and Payan, C. (2003). Cell-associated, non-replicating strand(+) hepatitis C virus-RNA shedding in cervicovaginal secretions from chronically HCV-infected women. J Clin Virol 27, 247-251. Bench, S.R., Hanson, T.E., Williamson, K.E., Ghosh, D., Radosovich, M., Wang, K., and Wommack, K.E. (2007). Metagenomic characterization of Chesapeake Bay virioplankton. Appl Environ Microbiol 73, 7629-7641.  152  Berger, S.A., Krompass, D., and Stamatakis, A. (2011). Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol 60, 291-302. Berthet, N., Leclercq, I., Dublineau, A., Shigematsu, S., Burguiere, A.M., Filippone, C., Gessain, A., and Manuguerra, J.C. (2010). High-density resequencing DNA microarrays in public health emergencies. Nat Biotechnol 28, 25-27. Bertolini, V., Gandolfi, I., Ambrosini, R., Bestetti, G., Innocente, E., Rampazzo, G., and Franzetti, A. (2012). Temporal variability and effect of environmental variables on airborne bacterial communities in an urban area of Northern Italy. Appl Microbiol Biotechnol. Bezirtzoglou, E., Voidarou, C., Papadaki, A., Tsiotsias, A., Kotsovolou, O., and Konstandi, M. (2008). Hormone therapy alters the composition of the vaginal microflora in ovariectomized rats. Microb Ecol 55, 751-759. Birren, D.C.G.G.A.E.M.F.B. (2012). Evaluation of bacterial ribosomal RNA (rRNA) depletion methods for sequencing microbial community transcriptomes. Bloom, S.M., Bijanki, V.N., Nava, G.M., Sun, L., Malvin, N.P., Donermeyer, D.L., Dunne, W.M., Jr., Allen, P.M., and Stappenbeck, T.S. (2011). Commensal Bacteroides species induce colitis in hostgenotype-specific fashion in a mouse model of inflammatory bowel disease. Cell Host Microbe 9, 390403. Bohannon, J. (2007). Metagenomics. Ocean study yields a tidal wave of microbial DNA. Science 315, 1486-1487. Bohlander, S.K., Espinosa, R., 3rd, Le Beau, M.M., Rowley, J.D., and Diaz, M.O. (1992). A method for the rapid sequence-independent amplification of microdissected chromosomal material. Genomics 13, 1322-1324. Borody, T.J., Warren, E.F., Leis, S.M., Surace, R., Ashman, O., and Siarakas, S. (2004). Bacteriotherapy using fecal flora: toying with human motions. J Clin Gastroenterol 38, 475-483. Boskey, E.R., Cone, R.A., Whaley, K.J., and Moench, T.R. (2001). Origins of vaginal acidity: high D/L lactate ratio is consistent with bacteria being the primary source. Hum Reprod 16, 1809-1813. Bradshaw, C.S., Morton, A.N., Hocking, J., Garland, S.M., Morris, M.B., Moss, L.M., Horvath, L.B., Kuzevska, I., and Fairley, C.K. (2006). High recurrence rates of bacterial vaginosis over the course of 12 months after oral metronidazole therapy and factors associated with recurrence. J Infect Dis 193, 14781486. Breitbart, M., Hewson, I., Felts, B., Mahaffy, J.M., Nulton, J., Salamon, P., and Rohwer, F. (2003). Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol 185, 6220-6223. Brenchley, J.M., Price, D.A., Schacker, T.W., Asher, T.E., Silvestri, G., Rao, S., Kazzaz, Z., Bornstein, E., Lambotte, O., Altmann, D., Blazar, B.R., Rodriguez, B., Teixeira-Johnson, L., Landay, A., Martin, J.N., Hecht, F.M., Picker, L.J., Lederman, M.M., Deeks, S.G., and Douek, D.C. (2006). Microbial translocation is a cause of systemic immune activation in chronic HIV infection. Nat Med 12, 1365-1371. Brenchley, J.M., Silvestri, G., and Douek, D.C. (2010). Nonprogressive and progressive primate immunodeficiency lentivirus infections. Immunity 32, 737-742. 153  Briese, T., Paweska, J.T., McMullan, L.K., Hutchison, S.K., Street, C., Palacios, G., Khristova, M.L., Weyer, J., Swanepoel, R., Egholm, M., Nichol, S.T., and Lipkin, W.I. (2009). Genetic detection and characterization of Lujo virus, a new hemorrhagic fever-associated arenavirus from southern Africa. PLoS Pathog 5, e1000455. Briggs, A.W., Stenzel, U., Johnson, P.L., Green, R.E., Kelso, J., Prufer, K., Meyer, M., Krause, J., Ronan, M.T., Lachmann, M., and Paabo, S. (2007). Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci U S A 104, 14616-14621. Brodie, E.L., DeSantis, T.Z., Parker, J.P., Zubietta, I.X., Piceno, Y.M., and Andersen, G.L. (2007). Urban aerosols harbor diverse and dynamic bacterial populations. Proc Natl Acad Sci U S A 104, 299-304. Brotman, R.M., Ravel, J., Cone, R.A., and Zenilman, J.M. (2010). Rapid fluctuation of the vaginal microbiota measured by Gram stain analysis. Sex Transm Infect 86, 297-302. Brulc, J.M., Antonopoulos, D.A., Miller, M.E., Wilson, M.K., Yannarell, A.C., Dinsdale, E.A., Edwards, R.E., Frank, E.D., Emerson, J.B., Wacklin, P., Coutinho, P.M., Henrissat, B., Nelson, K.E., and White, B.A. (2009). Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases. Proc Natl Acad Sci U S A 106, 1948-1953. Cadwell, K., Patel, K.K., Maloney, N.S., Liu, T.C., Ng, A.C., Storer, C.E., Head, R.D., Xavier, R., Stappenbeck, T.S., and Virgin, H.W. (2010). Virus-plus-susceptibility gene interaction determines Crohn's disease gene Atg16L1 phenotypes in intestine. Cell 141, 1135-1145. Canadian.Cancer.Society (2012). Canadian Cancer Statistics, C.C. Society, ed. (Statistics Canada). Caporaso, J.G., Lauber, C.L., Costello, E.K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., Knights, D., Gajer, P., Ravel, J., Fierer, N., Gordon, J.I., and Knight, R. (2011). Moving pictures of the human microbiome. Genome Biol 12, R50. Caporaso, J.G., Lauber, C.L., Walters, W.A., Berg-Lyons, D., Huntley, J., Fierer, N., Owens, S.M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J.A., Smith, G., and Knight, R. (2012). Ultra-highthroughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J 6, 16211624. Caruccio, N. (2011). Preparation of next-generation sequencing libraries using Nextera technology: simultaneous DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol Biol 733, 241-255. Castellarin, M., Warren, R.L., Freeman, J.D., Dreolini, L., Krzywinski, M., Strauss, J., Barnes, R., Watson, P., Allen-Vercoe, E., Moore, R.A., and Holt, R.A. (2012). Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res 22, 299-306. Centers for Disease Control and Prevention (U.S.) (2004). Trends in reportable sexually transmitted diseases in the United States (Atlanta, GA, Centers for Disease Control and Prevention). Chan, P.J., Seraj, I.M., Kalugdan, T.H., and King, A. (1996). Prevalence of mycoplasma conserved DNA in malignant ovarian cancer detected using sensitive PCR-ELISA. Gynecol Oncol 63, 258-260. Chang, Y., Cesarman, E., Pessin, M.S., Lee, F., Culpepper, J., Knowles, D.M., and Moore, P.S. (1994). Identification of herpesvirus-like DNA sequences in AIDS-associated Kaposi's sarcoma. Science 266, 1865-1869. 154  Cherpes, T.L., Meyn, L.A., Krohn, M.A., Lurie, J.G., and Hillier, S.L. (2003). Association between acquisition of herpes simplex virus type 2 in women and bacterial vaginosis. Clin Infect Dis 37, 319-325. Cheval, J., Sauvage, V., Frangeul, L., Dacheux, L., Guigon, G., Dumey, N., Pariente, K., Rousseaux, C., Dorange, F., Berthet, N., Brisse, S., Moszer, I., Bourhy, H., Manuguerra, C.J., Lecuit, M., Burguiere, A., Caro, V., and Eloit, M. (2011). Evaluation of high-throughput sequencing for identifying known and unknown viruses in biological samples. J Clin Microbiol 49, 3268-3275. Chiu, C.Y., Alizadeh, A.A., Rouskin, S., Merker, J.D., Yeh, E., Yagi, S., Schnurr, D., Patterson, B.K., Ganem, D., and DeRisi, J.L. (2007). Diagnosis of a critical respiratory illness caused by human metapneumovirus by use of a pan-virus microarray. J Clin Microbiol 45, 2340-2343. Chiu, C.Y., Greninger, A.L., Kanada, K., Kwok, T., Fischer, K.F., Runckel, C., Louie, J.K., Glaser, C.A., Yagi, S., Schnurr, D.P., Haggerty, T.D., Parsonnet, J., Ganem, D., and DeRisi, J.L. (2008a). Identification of cardioviruses related to Theiler's murine encephalomyelitis virus in human infections. Proc Natl Acad Sci U S A 105, 14124-14129. Chiu, C.Y., Rouskin, S., Koshy, A., Urisman, A., Fischer, K., Yagi, S., Schnurr, D., Eckburg, P.B., Tompkins, L.S., Blackburn, B.G., Merker, J.D., Patterson, B.K., Ganem, D., and DeRisi, J.L. (2006). Microarray detection of human parainfluenzavirus 4 infection associated with respiratory failure in an immunocompetent adult. Clin Infect Dis 43, e71-76. Chiu, C.Y., Urisman, A., Greenhow, T.L., Rouskin, S., Yagi, S., Schnurr, D., Wright, C., Drew, W.L., Wang, D., Weintrub, P.S., Derisi, J.L., and Ganem, D. (2008b). Utility of DNA microarrays for detection of viruses in acute respiratory tract infections in children. J Pediatr 153, 76-83. Cho, K.R., and Shih Ie, M. (2009). Ovarian cancer. Annu Rev Pathol 4, 287-313. Chu, D.K., Poon, L.L., Chiu, S.S., Chan, K.H., Ng, E.M., Bauer, I., Cheung, T.K., Ng, I.H., Guan, Y., Wang, D., and Peiris, J.S. (2012). Characterization of a novel gyrovirus in human stool and chicken meat. J Clin Virol 55, 209-213. Chung, H., Pamp, S.J., Hill, J.A., Surana, N.K., Edelman, S.M., Troy, E.B., Reading, N.C., Villablanca, E.J., Wang, S., Mora, J.R., Umesaki, Y., Mathis, D., Benoist, C., Relman, D.A., and Kasper, D.L. (2012). Gut immune maturation depends on colonization with a host-specific microbiota. Cell 149, 1578-1593. Clarke, K.R. (1993). Non-parametric multivariate analyses of changes in community structure. Australian Journal of Ecology, 117-143. Codling, C., O'Mahony, L., Shanahan, F., Quigley, E.M., and Marchesi, J.R. (2010). A molecular analysis of fecal and mucosal bacterial communities in irritable bowel syndrome. Dig Dis Sci 55, 392-397. Cohen, C.R., Lingappa, J.R., Baeten, J.M., Ngayo, M.O., Spiegel, C.A., Hong, T., Donnell, D., Celum, C., Kapiga, S., Delany, S., and Bukusi, E.A. (2012). Bacterial vaginosis associated with increased risk of female-to-male HIV-1 transmission: a prospective cohort analysis among African couples. PLoS Med 9, e1001251. Corpet, F. (1988). Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, 10881-10890. Costello, E.K., Lauber, C.L., Hamady, M., Fierer, N., Gordon, J.I., and Knight, R. (2009). Bacterial community variation in human body habitats across space and time. Science 326, 1694-1697.  155  Coward, J., Kulbe, H., Chakravarty, P., Leader, D., Vassileva, V., Leinster, D.A., Thompson, R., Schioppa, T., Nemeth, J., Vermeulen, J., Singh, N., Avril, N., Cummings, J., Rexhepaj, E., Jirstrom, K., Gallagher, W.M., Brennan, D.J., McNeish, I.A., and Balkwill, F.R. (2011). Interleukin-6 as a therapeutic target in human ovarian cancer. Clin Cancer Res 17, 6083-6096. Cox, M.J., Huang, Y.J., Fujimura, K.E., Liu, J.T., McKean, M., Boushey, H.A., Segal, M.R., Brodie, E.L., Cabana, M.D., and Lynch, S.V. (2010). Lactobacillus casei abundance is associated with profound shifts in the infant gut microbiome. PLoS One 5, e8745. D'Haeseleer, P. (2005). How does gene expression clustering work? Nat Biotechnol 23, 1499-1501. De Palma, G., Nadal, I., Medina, M., Donat, E., Ribes-Koninckx, C., Calabuig, M., and Sanz, Y. (2010). Intestinal dysbiosis and reduced immunoglobulin-coated bacteria associated with coeliac disease in children. BMC Microbiol 10, 63. Diaz, N.N., Krause, L., Goesmann, A., Niehaus, K., and Nattkemper, T.W. (2009). TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10, 56. Diguistini, S., Liao, N.Y., Platt, D., Robertson, G., Seidel, M., Chan, S.K., Docking, T.R., Birol, I., Holt, R.A., Hirst, M., Mardis, E., Marra, M.A., Hamelin, R.C., Bohlmann, J., Breuil, C., and Jones, S.J. (2009). De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol 10, R94. Doane, F.W. (1987). Immunoelectron microscopy in diagnostic virology. Ultrastruct Pathol 11, 681-685. Dolan, P.C., and Denver, D.R. (2008). TileQC: a system for tile-based quality control of Solexa data. BMC Bioinformatics 9, 250. Drell, T., Lillsaar, T., Tummeleht, L., Simm, J., Aaspollu, A., Vain, E., Saarma, I., Salumets, A., Donders, G.G., and Metsis, M. (2013). Characterization of the vaginal micro- and mycobiome in asymptomatic reproductive-age estonian women. PLoS One 8, e54379. Duerkop, B.A., Clements, C.V., Rollins, D., Rodrigues, J.L., and Hooper, L.V. (2012). A composite bacteriophage alters colonization by an intestinal commensal bacterium. Proc Natl Acad Sci U S A 109, 17621-17626. Duff, P., Gibbs, R.S., Blanco, J.D., and St Clair, P.J. (1983). Endometrial culture techniques in puerperal patients. Obstet Gynecol 61, 217-222. Dumonceaux, T.J., Schellenberg, J., Goleski, V., Hill, J.E., Jaoko, W., Kimani, J., Money, D., Ball, T.B., Plummer, F.A., and Severini, A. (2009). Multiplex detection of bacteria associated with normal microbiota and with bacterial vaginosis in vaginal swabs by use of oligonucleotide-coupled fluorescent microspheres. J Clin Microbiol 47, 4067-4077. Eckburg, P.B., Bik, E.M., Bernstein, C.N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S.R., Nelson, K.E., and Relman, D.A. (2005). Diversity of the human intestinal microbial flora. Science 308, 16351638. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-14868. Eisenstein, M. (2012). The battle for sequencing supremacy. Nat Biotechnol 30, 1023-1026. 156  Farquhar, C., Mbori-Ngacha, D., Overbaugh, J., Wamalwa, D., Harris, J., Bosire, R., and John-Stewart, G. (2010). Illness during pregnancy and bacterial vaginosis are associated with in-utero HIV-1 transmission. AIDS 24, 153-155. Faust, K., Sathirapongsasuti, J.F., Izard, J., Segata, N., Gevers, D., Raes, J., and Huttenhower, C. (2012). Microbial co-occurrence relationships in the human microbiome. PLoS Comput Biol 8, e1002606. Feng, H., Shuda, M., Chang, Y., and Moore, P.S. (2008). Clonal integration of a polyomavirus in human Merkel cell carcinoma. Science 319, 1096-1100. Filiatrault, M.J. (2011). Progress in prokaryotic transcriptomics. Curr Opin Microbiol 14, 579-586. Finkbeiner, S.R., Allred, A.F., Tarr, P.I., Klein, E.J., Kirkwood, C.D., and Wang, D. (2008). Metagenomic analysis of human diarrhea: viral detection and discovery. PLoS Pathog 4, e1000011. Fiore, J.R., Suligoi, B., Monno, L., Angarano, G., and Pastore, G. (2002). HIV-1 shedding in genital tract of infected women. Lancet 359, 1525-1526. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., and et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496-512. Flores, R., Shi, J., Gail, M.H., Gajer, P., Ravel, J., and Goedert, J.J. (2012). Assessment of the human faecal microbiota: II. Reproducibility and associations of 16S rRNA pyrosequences. Eur J Clin Invest 42, 855-863. Forsman, A., and Weiss, R.A. (2008). Why is HIV a pathogen? Trends Microbiol 16, 555-560. Frank, D.N., Robertson, C.E., Hamm, C.M., Kpadeh, Z., Zhang, T., Chen, H., Zhu, W., Sartor, R.B., Boedeker, E.C., Harpaz, N., Pace, N.R., and Li, E. (2011a). Disease phenotype and genotype are associated with shifts in intestinal-associated microbiota in inflammatory bowel diseases. Inflamm Bowel Dis 17, 179-184. Frank, D.N., St Amand, A.L., Feldman, R.A., Boedeker, E.C., Harpaz, N., and Pace, N.R. (2007). Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Natl Acad Sci U S A 104, 13780-13785. Frank, D.N., Zhu, W., Sartor, R.B., and Li, E. (2011b). Investigating the biological and clinical significance of human dysbioses. Trends Microbiol 19, 427-434. Fredricks, D.N., Fiedler, T.L., and Marrazzo, J.M. (2005). Molecular identification of bacteria associated with bacterial vaginosis. N Engl J Med 353, 1899-1911. Gajer, P., Brotman, R.M., Bai, G., Sakamoto, J., Schutte, U.M., Zhong, X., Koenig, S.S., Fu, L., Ma, Z.S., Zhou, X., Abdo, Z., Forney, L.J., and Ravel, J. (2012). Temporal dynamics of the human vaginal microbiota. Sci Transl Med 4, 132ra152. Garber, M., Zody, M.C., Arachchi, H.M., Berlin, A., Gnerre, S., Green, L.M., Lennon, N., and Nusbaum, C. (2009). Closing gaps in the human genome using sequencing by synthesis. Genome Biol 10, R60. Garrett, W.S., Gallini, C.A., Yatsunenko, T., Michaud, M., DuBois, A., Delaney, M.L., Punit, S., Karlsson, M., Bry, L., Glickman, J.N., Gordon, J.I., Onderdonk, A.B., and Glimcher, L.H. (2010).  157  Enterobacteriaceae act in concert with the gut microbiota to induce spontaneous and maternally transmitted colitis. Cell Host Microbe 8, 292-300. Gaynor, A.M., Nissen, M.D., Whiley, D.M., Mackay, I.M., Lambert, S.B., Wu, G., Brennan, D.C., Storch, G.A., Sloots, T.P., and Wang, D. (2007). Identification of a novel polyomavirus from patients with acute respiratory tract infections. PLoS Pathog 3, e64. Genc, M.R., Vardhana, S., Delaney, M.L., Onderdonk, A., Tuomala, R., Norwitz, E., and Witkin, S.S. (2004). Relationship between a toll-like receptor-4 gene polymorphism, bacterial vaginosis-related flora and vaginal cytokine responses in pregnant women. Eur J Obstet Gynecol Reprod Biol 116, 152-156. Gerlach, W., Junemann, S., Tille, F., Goesmann, A., and Stoye, J. (2009). WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads. BMC Bioinformatics 10, 430. Gilbert, J.A., and Dupont, C.L. (2011). Microbial metagenomics: beyond the genome. Ann Rev Mar Sci 3, 347-371. Gill, S.R., Pop, M., Deboy, R.T., Eckburg, P.B., Turnbaugh, P.J., Samuel, B.S., Gordon, J.I., Relman, D.A., Fraser-Liggett, C.M., and Nelson, K.E. (2006). Metagenomic analysis of the human distal gut microbiome. Science 312, 1355-1359. Gomez-Alvarez, V., Teal, T.K., and Schmidt, T.M. (2009). Systematic artifacts in metagenomes from complex microbial communities. ISME J 3, 1314-1317. Grahn, N., Olofsson, M., Ellnebo-Svedlund, K., Monstein, H.J., and Jonasson, J. (2003). Identification of mixed bacterial DNA contamination in broad-range PCR amplification of 16S rDNA V1 and V3 variable regions by pyrosequencing of cloned amplicons. FEMS Microbiol Lett 219, 87-91. Greenblum, S., Turnbaugh, P.J., and Borenstein, E. (2012). Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc Natl Acad Sci U S A 109, 594-599. Gregory, J.B., Litaker, R.W., and Noble, R.T. (2006). Rapid one-step quantitative reverse transcriptase PCR assay with competitive internal positive control for detection of enteroviruses in environmental samples. Appl Environ Microbiol 72, 3960-3967. Greninger, A.L., Chen, E.C., Sittler, T., Scheinerman, A., Roubinian, N., Yu, G., Kim, E., Pillai, D.R., Guyard, C., Mazzulli, T., Isa, P., Arias, C.F., Hackett, J., Schochetman, G., Miller, S., Tang, P., and Chiu, C.Y. (2010). A metagenomic analysis of pandemic influenza A (2009 H1N1) infection in patients from North America. PLoS One 5, e13381. Gronlund, M.M., Lehtonen, O.P., Eerola, E., and Kero, P. (1999). Fecal microflora in healthy infants born by different methods of delivery: permanent changes in intestinal flora after cesarean delivery. J Pediatr Gastroenterol Nutr 28, 19-25. Haange, S.B., Oberbach, A., Schlichting, N., Hugenholtz, F., Smidt, H., von Bergen, M., Till, H., and Seifert, J. (2012). Metaproteome analysis and molecular genetics of rat intestinal microbiota reveals section and localization resolved species distribution and enzymatic functionalities. J Proteome Res 11, 5406-5417.  158  Haas, A., Zimmermann, K., Graw, F., Slack, E., Rusert, P., Ledergerber, B., Bossart, W., Weber, R., Thurnheer, M.C., Battegay, M., Hirschel, B., Vernazza, P., Patuto, N., Macpherson, A.J., Gunthard, H.F., and Oxenius, A. (2011a). Systemic antibody responses to gut commensal bacteria during chronic HIV-1 infection. Gut 60, 1506-1519. Haas, B.J., Gevers, D., Earl, A.M., Feldgarden, M., Ward, D.V., Giannoukos, G., Ciulla, D., Tabbaa, D., Highlander, S.K., Sodergren, E., Methe, B., DeSantis, T.Z., Petrosino, J.F., Knight, R., and Birren, B.W. (2011b). Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 21, 494-504. Haggerty, C.L., Hillier, S.L., Bass, D.C., and Ness, R.B. (2004). Bacterial vaginosis and anaerobic bacteria are associated with endometritis. Clin Infect Dis 39, 990-995. Hajishengallis, G., Darveau, R.P., and Curtis, M.A. (2012). The keystone-pathogen hypothesis. Nat Rev Microbiol 10, 717-725. Hajishengallis, G., Liang, S., Payne, M.A., Hashim, A., Jotwani, R., Eskan, M.A., McIntosh, M.L., Alsam, A., Kirkwood, K.L., Lambris, J.D., Darveau, R.P., and Curtis, M.A. (2011). Low-abundance biofilm species orchestrates inflammatory periodontal disease through the commensal microbiota and complement. Cell Host Microbe 10, 497-506. Hamady, M., and Knight, R. (2009). Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res 19, 1141-1152. Hamady, M., Lozupone, C., and Knight, R. (2010). Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J 4, 17-27. Hamady, Z.Z., Scott, N., Farrar, M.D., Wadhwa, M., Dilger, P., Whitehead, T.R., Thorpe, R., Holland, K.T., Lodge, J.P., and Carding, S.R. (2011). Treatment of colitis with a commensal gut bacterium engineered to secrete human TGF-beta1 under the control of dietary xylan 1. Inflamm Bowel Dis 17, 1925-1935. Hand, T.W., Dos Santos, L.M., Bouladoux, N., Molloy, M.J., Pagan, A.J., Pepper, M., Maynard, C.L., Elson, C.O., 3rd, and Belkaid, Y. (2012). Acute gastrointestinal infection induces long-lived microbiotaspecific T cell responses. Science 337, 1553-1556. Handelsman, J., Rondon, M.R., Brady, S.F., Clardy, J., and Goodman, R.M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 5, R245-249. Handley, S.A., Thackray, L.B., Zhao, G., Presti, R., Miller, A.D., Droit, L., Abbink, P., Maxfield, L.F., Kambal, A., Duan, E., Stanley, K., Kramer, J., Macri, S.C., Permar, S.R., Schmitz, J.E., Mansfield, K., Brenchley, J.M., Veazey, R.S., Stappenbeck, T.S., Wang, D., Barouch, D.H., and Virgin, H.W. (2012). Pathogenic simian immunodeficiency virus infection is associated with expansion of the enteric virome. Cell 151, 253-266. Harmsen, H.J., Wildeboer-Veloo, A.C., Raangs, G.C., Wagendorp, A.A., Klijn, N., Bindels, J.G., and Welling, G.W. (2000). Analysis of intestinal flora development in breast-fed and formula-fed infants by using molecular identification and detection methods. J Pediatr Gastroenterol Nutr 30, 61-67.  159  Head, S.R., Komori, H.K., Hart, G.T., Shimashita, J., Schaffer, L., Salomon, D.R., and Ordoukhanian, P.T. (2011). Method for improved Illumina sequencing library preparation using NuGEN Ovation RNASeq System. Biotechniques 50, 177-180. Hehemann, J.H., Correc, G., Barbeyron, T., Helbert, W., Czjzek, M., and Michel, G. (2010). Transfer of carbohydrate-active enzymes from marine bacteria to Japanese gut microbiota. Nature 464, 908-912. Heinonen, P.K., Teisala, K., Punnonen, R., Miettinen, A., Lehtinen, M., and Paavonen, J. (1985). Anatomic sites of upper genital tract infection. Obstet Gynecol 66, 384-390. Hewson, I., Poretsky, R.S., Dyhrman, S.T., Zielinski, B., White, A.E., Tripp, H.J., Montoya, J.P., and Zehr, J.P. (2009). Microbial community gene expression within colonies of the diazotroph, Trichodesmium, from the Southwest Pacific Ocean. ISME J 3, 1286-1300. Hill, J.E., Goh, S.H., Money, D.M., Doyle, M., Li, A., Crosby, W.L., Links, M., Leung, A., Chan, D., and Hemmingsen, S.M. (2005). Characterization of vaginal microflora of healthy, nonpregnant women by chaperonin-60 sequence-based methods. Am J Obstet Gynecol 193, 682-692. Hirayama, A., Kami, K., Sugimoto, M., Sugawara, M., Toki, N., Onozuka, H., Kinoshita, T., Saito, N., Ochiai, A., Tomita, M., Esumi, H., and Soga, T. (2009). Quantitative metabolome profiling of colon and stomach cancer microenvironment by capillary electrophoresis time-of-flight mass spectrometry. Cancer Res 69, 4918-4925. HMPC (2012). Structure, function and diversity of the healthy human microbiome. Nature 486, 207-214. Hodgson, T.A., and Rachanis, C.C. (2002). Oral fungal and bacterial infections in HIV-infected individuals: an overview in Africa. Oral Dis 8 Suppl 2, 80-87. Holt, R.A., and Jones, S.J. (2008). The new paradigm of flow cell sequencing. Genome Res 18, 839-846. Holton, J. (2008). Enterotoxigenic Bacteroides fragilis. Curr Infect Dis Rep 10, 99-104. Huang, Y.J., Kim, E., Cox, M.J., Brodie, E.L., Brown, R., Wiener-Kronish, J.P., and Lynch, S.V. (2010). A persistent and diverse airway microbiota present during chronic obstructive pulmonary disease exacerbations. OMICS 14, 9-59. Hummelen, R., Fernandes, A.D., Macklaim, J.M., Dickson, R.J., Changalucha, J., Gloor, G.B., and Reid, G. (2010). Deep sequencing of the vaginal microbiota of women with HIV. PLoS One 5, e12078. Idahl, A., Lundin, E., Elgh, F., Jurstrand, M., Moller, J.K., Marklund, I., Lindgren, P., and Ottander, U. (2010). Chlamydia trachomatis, Mycoplasma genitalium, Neisseria gonorrhoeae, human papillomavirus, and polyomavirus are not detectable in human tissue with epithelial ovarian cancer, borderline tumor, or benign conditions. Am J Obstet Gynecol 202, 71 e71-76. Iliev, I.D., Funari, V.A., Taylor, K.D., Nguyen, Q., Reyes, C.N., Strom, S.P., Brown, J., Becker, C.A., Fleshner, P.R., Dubinsky, M., Rotter, J.I., Wang, H.L., McGovern, D.P., Brown, G.D., and Underhill, D.M. (2012). Interactions between commensal fungi and the C-type lectin receptor Dectin-1 influence colitis. Science 336, 1314-1317. Imirzalioglu, C., Hain, T., Chakraborty, T., and Domann, E. (2008). Hidden pathogens uncovered: metagenomic analysis of urinary tract infections. Andrologia 40, 66-71. Israel, D.A., Salama, N., Arnold, C.N., Moss, S.F., Ando, T., Wirth, H.P., Tham, K.T., Camorlinga, M., Blaser, M.J., Falkow, S., and Peek, R.M., Jr. (2001). Helicobacter pylori strain-specific differences in 160  genetic content, identified by microarray, influence host inflammatory responses. J Clin Invest 107, 611620. Javier, R.T., and Butel, J.S. (2008). The history of tumor virology. Cancer Res 68, 7693-7706. Jespers, V., Menten, J., Smet, H., Poradosu, S., Abdellati, S., Verhelst, R., Hardy, L., Buve, A., and Crucitti, T. (2012). Quantification of bacterial species of the vaginal microbiome in different groups of women, using nucleic acid amplification tests. BMC Microbiol 12, 83. Juven, T., Mertsola, J., Waris, M., Leinonen, M., Meurman, O., Roivainen, M., Eskola, J., Saikku, P., and Ruuskanen, O. (2000). Etiology of community-acquired pneumonia in 254 hospitalized children. Pediatr Infect Dis J 19, 293-298. Kamada, N., Kim, Y.G., Sham, H.P., Vallance, B.A., Puente, J.L., Martens, E.C., and Nunez, G. (2012). Regulated virulence controls the ability of a pathogen to compete with the gut microbiota. Science 336, 1325-1329. Kapoor, A., Victoria, J., Simmonds, P., Slikas, E., Chieochansin, T., Naeem, A., Shaukat, S., Sharif, S., Alam, M.M., Angez, M., Wang, C., Shafer, R.W., Zaidi, S., and Delwart, E. (2008). A highly prevalent and genetically diversified Picornaviridae genus in South Asian children. Proc Natl Acad Sci U S A 105, 20482-20487. Karst, A.M., and Drapkin, R. (2010). Ovarian cancer pathogenesis: a model in evolution. J Oncol 2010, 932371. Kaser, A., Zeissig, S., and Blumberg, R.S. (2010). Inflammatory bowel disease. Annu Rev Immunol 28, 573-621. Kelemen, L.E., and Kobel, M. (2011). Mucinous carcinomas of the ovary and colorectum: different organ, same dilemma. Lancet Oncol 12, 1071-1080. Kembel, S.W., Eisen, J.A., Pollard, K.S., and Green, J.L. (2011). The phylogenetic diversity of metagenomes. PLoS One 6, e23214. Khan, A.A., Shrivastava, A., and Khurshid, M. (2012a). Normal to cancer microbiome transformation and its implication in cancer diagnosis. Biochim Biophys Acta 1826, 331-337. Khan, Z., Ahmad, S., Joseph, L., and Chandy, R. (2012b). Candida dubliniensis: an appraisal of its clinical significance as a bloodstream pathogen. PLoS One 7, e32952. Kilic, A.O., Pavlova, S.I., Alpay, S., Kilic, S.S., and Tao, L. (2001). Comparative study of vaginal Lactobacillus phages isolated from women in the United States and Turkey: prevalence, morphology, host range, and DNA homology. Clin Diagn Lab Immunol 8, 31-39. Kim, J., Coffey, D.M., Creighton, C.J., Yu, Z., Hawkins, S.M., and Matzuk, M.M. (2012). High-grade serous ovarian cancer arises from fallopian tube in a mouse model. Proc Natl Acad Sci U S A 109, 39213926. Kimelman, A., Levy, A., Sberro, H., Kidron, S., Leavitt, A., Amitai, G., Yoder-Himes, D.R., Wurtzel, O., Zhu, Y., Rubin, E.M., and Sorek, R. (2012). A vast collection of microbial genes that are toxic to bacteria. Genome Res 22, 802-809.  161  Kistler, A., Avila, P.C., Rouskin, S., Wang, D., Ward, T., Yagi, S., Schnurr, D., Ganem, D., DeRisi, J.L., and Boushey, H.A. (2007). Pan-viral screening of respiratory tract infections in adults with and without asthma reveals unexpected human coronavirus and human rhinovirus diversity. J Infect Dis 196, 817-825. Klatt, N.R., Funderburg, N.T., and Brenchley, J.M. (2013). Microbial translocation, immune activation, and HIV disease. Trends Microbiol 21, 6-13. Klatt, N.R., and Silvestri, G. (2012). CD4+ T cells and HIV: A paradoxical Pas de Deux. Sci Transl Med 4, 123ps124. Koenig, J.E., Spor, A., Scalfone, N., Fricker, A.D., Stombaugh, J., Knight, R., Angenent, L.T., and Ley, R.E. (2011). Succession of microbial consortia in the developing infant gut microbiome. Proc Natl Acad Sci U S A 108 Suppl 1, 4578-4585. Koren, O., Knights, D., Gonzalez, A., Waldron, L., Segata, N., Knight, R., Huttenhower, C., and Ley, R.E. (2013). A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets. PLoS Comput Biol 9, e1002863. Koren, O., Spor, A., Felin, J., Fak, F., Stombaugh, J., Tremaroli, V., Behre, C.J., Knight, R., Fagerberg, B., Ley, R.E., and Backhed, F. (2011). Human oral, gut, and plaque microbiota in patients with atherosclerosis. Proc Natl Acad Sci U S A 108 Suppl 1, 4592-4598. Korshunov, V.M., Gudieva, Z.A., Efimov, B.A., Pikina, A.P., Smeianov, V.V., Reid, G., Korshunova, O.V., Tiutiunnik, V.L., and Stepin, II (1999). [The vaginal Bifidobacterium flora in women of reproductive age]. Zh Mikrobiol Epidemiol Immunobiol, 74-78. Kosakovsky Pond, S., Wadhawan, S., Chiaromonte, F., Ananda, G., Chung, W.Y., Taylor, J., and Nekrutenko, A. (2009). Windshield splatter analysis with the Galaxy metagenomic pipeline. Genome Res 19, 2144-2153. Kostic, A.D., Gevers, D., Pedamallu, C.S., Michaud, M., Duke, F., Earl, A.M., Ojesina, A.I., Jung, J., Bass, A.J., Tabernero, J., Baselga, J., Liu, C., Shivdasani, R.A., Ogino, S., Birren, B.W., Huttenhower, C., Garrett, W.S., and Meyerson, M. (2012). Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res 22, 292-298. Koumans, E.H., Sternberg, M., Bruce, C., McQuillan, G., Kendrick, J., Sutton, M., and Markowitz, L.E. (2007). The prevalence of bacterial vaginosis in the United States, 2001-2004; associations with symptoms, sexual behaviors, and reproductive health. Sex Transm Dis 34, 864-869. Krause, L., Diaz, N.N., Goesmann, A., Kelley, S., Nattkemper, T.W., Rohwer, F., Edwards, R.A., and Stoye, J. (2008). Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res 36, 2230-2239. Krieger, J.N., and Riley, D.E. (2002). Bacteria in the chronic prostatitis-chronic pelvic pain syndrome: molecular approaches to critical research questions. J Urol 167, 2574-2583. Kubota, T., Sakae, U., Takeuchi, H., and Usui, M. (1995). Detection and identification of amines in bacterial vaginosis. J Obstet Gynaecol (Tokyo 1995) 21, 51-55. Kuller, L.H., Tracy, R., Belloso, W., De Wit, S., Drummond, F., Lane, H.C., Ledergerber, B., Lundgren, J., Neuhaus, J., Nixon, D., Paton, N.I., and Neaton, J.D. (2008). Inflammatory and coagulation biomarkers and mortality in patients with HIV infection. PLoS Med 5, e203. 162  Kumar, S., and Blaxter, M.L. (2010). Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11, 571. Lan, Y., Wang, Q., Cole, J.R., and Rosen, G.L. (2012). Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms. PLoS One 7, e32491. Lane, D.J., Pace, B., Olsen, G.J., Stahl, D.A., Sogin, M.L., and Pace, N.R. (1985). Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci U S A 82, 6955-6959. Lauck, M., Alvarado-Mora, M.V., Becker, E.A., Bhattacharya, D., Striker, R., Hughes, A.L., Carrilho, F.J., O'Connor, D.H., and Pinho, J.R. (2012). Analysis of hepatitis C virus intrahost diversity across the coding region by ultradeep pyrosequencing. J Virol 86, 3952-3960. Lee, Y., Miron, A., Drapkin, R., Nucci, M.R., Medeiros, F., Saleemuddin, A., Garber, J., Birch, C., Mou, H., Gordon, R.W., Cramer, D.W., McKeon, F.D., and Crum, C.P. (2007). A candidate precursor to serous carcinoma that originates in the distal fallopian tube. J Pathol 211, 26-35. Lemire, S., Figueroa-Bossi, N., and Bossi, L. (2011). Bacteriophage crosstalk: coordination of prophage induction by trans-acting antirepressors. PLoS Genet 7, e1002149. Leski, T.A., Malanoski, A.P., Stenger, D.A., and Lin, B. (2010). Target amplification for broad spectrum microbial diagnostics and detection. Future Microbiol 5, 191-203. Ley, R.E., Hamady, M., Lozupone, C., Turnbaugh, P.J., Ramey, R.R., Bircher, J.S., Schlegel, M.L., Tucker, T.A., Schrenzel, M.D., Knight, R., and Gordon, J.I. (2008). Evolution of mammals and their gut microbes. Science 320, 1647-1651. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760. Li, L., Victoria, J.G., Wang, C., Jones, M., Fellers, G.M., Kunz, T.H., and Delwart, E. (2010). Bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses. J Virol 84, 6955-6965. Lin, H.W., Tu, Y.Y., Lin, S.Y., Su, W.J., Lin, W.L., Lin, W.Z., Wu, S.C., and Lai, Y.L. (2011). Risk of ovarian cancer in women with pelvic inflammatory disease: a population-based study. Lancet Oncol 12, 900-904. Linhares, I.M., Summers, P.R., Larsen, B., Giraldo, P.C., and Witkin, S.S. (2011). Contemporary perspectives on vaginal pH and lactobacilli. Am J Obstet Gynecol 204, 120 e121-125. Lisa, D.M. (2010). Exploring Size Selection Methods to Optimize Read Length in 454 Sequencing (Broad Institute). Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L., and Law, M. (2012). Comparison of nextgeneration sequencing systems. J Biomed Biotechnol 2012, 251364. Lo, S.C., Pripuzova, N., Li, B., Komaroff, A.L., Hung, G.C., Wang, R., and Alter, H.J. (2010). Detection of MLV-related virus gene sequences in blood of patients with chronic fatigue syndrome and healthy blood donors. Proc Natl Acad Sci U S A 107, 15874-15879.  163  Loens, K., Ursi, D., Ieven, M., van Aarle, P., Sillekens, P., Oudshoorn, P., and Goossens, H. (2002). Detection of Mycoplasma pneumoniae in spiked clinical samples by nucleic acid sequence-based amplification. J Clin Microbiol 40, 1339-1345. Luo, C., Tsementzi, D., Kyrpides, N., Read, T., and Konstantinidis, K.T. (2012). Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS One 7, e30087. Lupp, C., Robertson, M.L., Wickham, M.E., Sekirov, I., Champion, O.L., Gaynor, E.C., and Finlay, B.B. (2007). Host-mediated inflammation disrupts the intestinal microbiota and promotes the overgrowth of Enterobacteriaceae. Cell Host Microbe 2, 204. Ma, B., Forney, L.J., and Ravel, J. (2012). Vaginal microbiome: rethinking health and disease. Annu Rev Microbiol 66, 371-389. Mahal, D.C., B; Albert, L; Wagner, E; Hill, J; Hemmingsen, S; Pick, N; Money, D, Vogue Study Group (2013). Metagenomic Characterization of the Vaginal Microbiome in HIV+ Women Using Culture Independent Methods. Paper presented at: Conference on Retroviruses and Opportunistic Infections (Georgia World Congress Center, Atlanta, GA, USA). Mahony, J.B., Hatchette, T., Ojkic, D., Drews, S.J., Gubbay, J., Low, D.E., Petric, M., Tang, P., Chong, S., Luinstra, K., Petrich, A., and Smieja, M. (2009). Multiplex PCR tests sentinel the appearance of pandemic influenza viruses including H1N1 swine influenza. J Clin Virol 45, 200-202. Malboeuf, C.M., Yang, X., Charlebois, P., Qu, J., Berlin, A.M., Casali, M., Pesko, K.N., Boutwell, C.L., Devincenzo, J.P., Ebel, G.D., Allen, T.M., Zody, M.C., Henn, M.R., and Levin, J.Z. (2013). Complete viral RNA genome sequencing of ultra-low copy samples by sequence-independent amplification. Nucleic Acids Res 41, e13. Maleckiene, L., Kajenas, S., Nadisauskiene, R.J., and Railaite, D.R. (2009). Comparison of clinical and laparoscopic diagnoses of pelvic inflammatory disease. Int J Gynaecol Obstet 104, 74-75. Mande, S.S., Mohammed, M.H., and Ghosh, T.S. (2012). Classification of metagenomic sequences: methods and challenges. Brief Bioinform 13, 669-681. Marchetti, G., Bellistri, G.M., Borghi, E., Tincati, C., Ferramosca, S., La Francesca, M., Morace, G., Gori, A., and Monforte, A.D. (2008). Microbial translocation is associated with sustained failure in CD4+ T-cell reconstitution in HIV-infected patients on long-term highly active antiretroviral therapy. AIDS 22, 2035-2038. Marine, R., Polson, S.W., Ravel, J., Hatfull, G., Russell, D., Sullivan, M., Syed, F., Dumas, M., and Wommack, K.E. (2011). Evaluation of a transposase protocol for rapid generation of shotgun highthroughput sequencing libraries from nanogram quantities of DNA. Appl Environ Microbiol 77, 80718079. Marquez, R.T., Baggerly, K.A., Patterson, A.P., Liu, J., Broaddus, R., Frumovitz, M., Atkinson, E.N., Smith, D.I., Hartmann, L., Fishman, D., Berchuck, A., Whitaker, R., Gershenson, D.M., Mills, G.B., Bast, R.C., Jr., and Lu, K.H. (2005). Patterns of gene expression in different histotypes of epithelial ovarian cancer correlate with those in normal fallopian tube, endometrium, and colon. Clin Cancer Res 11, 61166126.  164  Marrazzo, J.M., Fiedler, T.L., Srinivasan, S., Thomas, K.K., Liu, C., Ko, D., Xie, H., Saracino, M., and Fredricks, D.N. (2012). Extravaginal reservoirs of vaginal bacteria as risk factors for incident bacterial vaginosis. J Infect Dis 205, 1580-1588. Marri, P.R., Stern, D.A., Wright, A.L., Billheimer, D., and Martinez, F.D. (2012). Asthma-associated differences in microbial composition of induced sputum. J Allergy Clin Immunol. Martin, H.L., Richardson, B.A., Nyange, P.M., Lavreys, L., Hillier, S.L., Chohan, B., Mandaliya, K., Ndinya-Achola, J.O., Bwayo, J., and Kreiss, J. (1999). Vaginal lactobacilli, microbial flora, and risk of human immunodeficiency virus type 1 and sexually transmitted disease acquisition. J Infect Dis 180, 1863-1868. Matsuki, T., Watanabe, K., Tanaka, R., Fukuda, M., and Oyaizu, H. (1999). Distribution of bifidobacterial species in human intestinal microflora examined with 16S rRNA-gene-targeted speciesspecific primers. Appl Environ Microbiol 65, 4506-4512. Matsuki, T., Watanabe, K., Tanaka, R., and Oyaizu, H. (1998). Rapid identification of human intestinal bifidobacteria by 16S rRNA-targeted species- and group-specific primers. FEMS Microbiol Lett 167, 113-121. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P., and Rigoutsos, I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4, 63-72. McKenna, P., Hoffmann, C., Minkah, N., Aye, P.P., Lackner, A., Liu, Z., Lozupone, C.A., Hamady, M., Knight, R., and Bushman, F.D. (2008). The macaque gut microbiome in health, lentiviral infection, and chronic enterocolitis. PLoS Pathog 4, e20. Menard, J.P., Mazouni, C., Fenollar, F., Raoult, D., Boubli, L., and Bretelle, F. (2010). Diagnostic accuracy of quantitative real-time PCR assay versus clinical and Gram stain identification of bacterial vaginosis. Eur J Clin Microbiol Infect Dis 29, 1547-1552. Mende, D.R., Waller, A.S., Sunagawa, S., Jarvelin, A.I., Chan, M.M., Arumugam, M., Raes, J., and Bork, P. (2012). Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS One 7, e31386. Merritt, M.A., Green, A.C., Nagle, C.M., and Webb, P.M. (2008). Talcum powder, chronic pelvic inflammation and NSAIDs in relation to risk of epithelial ovarian cancer. Int J Cancer 122, 170-176. Meyer, F., Paarmann, D., D'Souza, M., Olson, R., Glass, E.M., Kubal, M., Paczian, T., Rodriguez, A., Stevens, R., Wilke, A., Wilkening, J., and Edwards, R.A. (2008). The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9, 386. Mills, S., Griffin, C., Ross, R.P. (2011). A new phage on the ‘Mozzarella’ block: Bacteriophage 5093 shares a low level of homology with other Streptococcus thermophilus phages. International Dairy Journal 21, 963-969. Minot, S., Sinha, R., Chen, J., Li, H., Keilbaugh, S.A., Wu, G.D., Lewis, J.D., and Bushman, F.D. (2011). The human gut virome: inter-individual variation and dynamic response to diet. Genome Res 21, 16161625.  165  Mirmonsef, P., Gilbert, D., Veazey, R.S., Wang, J., Kendrick, S.R., and Spear, G.T. (2012). A comparison of lower genital tract glycogen and lactic acid levels in women and macaques: implications for HIV and SIV susceptibility. AIDS Res Hum Retroviruses 28, 76-81. Modan, B., Hartge, P., Hirsh-Yechezkel, G., Chetrit, A., Lubin, F., Beller, U., Ben-Baruch, G., Fishman, A., Menczer, J., Struewing, J.P., Tucker, M.A., and Wacholder, S. (2001). Parity, oral contraceptives, and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. N Engl J Med 345, 235-240. Mohammed, M.H., Ghosh, T.S., Singh, N.K., and Mande, S.S. (2011). SPHINX--an algorithm for taxonomic binning of metagenomic sequences. Bioinformatics 27, 22-30. Molenkamp, R., van der Ham, A., Schinkel, J., and Beld, M. (2007). Simultaneous detection of five different DNA targets by real-time Taqman PCR using the Roche LightCycler480: Application in viral molecular diagnostics. J Virol Methods 141, 205-211. Moore, R.A., Warren, R.L., Freeman, J.D., Gustavsen, J.A., Chenard, C., Friedman, J.M., Suttle, C.A., Zhao, Y., and Holt, R.A. (2011). The sensitivity of massively parallel sequencing for detecting candidate infectious agents associated with human tissue. PLoS One 6, e19838. Moran, M.A., Satinsky, B., Gifford, S.M., Luo, H., Rivers, A., Chan, L.K., Meng, J., Durham, B.P., Shen, C., Varaljay, V.A., Smith, C.B., Yager, P.L., and Hopkinson, B.M. (2013). Sizing up metatranscriptomics. ISME J 7, 237-243. Morelli, L., Zonenenschain, D., Del Piano, M., and Cognein, P. (2004). Utilization of the intestinal tract as a delivery system for urogenital probiotics. J Clin Gastroenterol 38, S107-110. Morison, L., Ekpo, G., West, B., Demba, E., Mayaud, P., Coleman, R., Bailey, R., and Walraven, G. (2005). Bacterial vaginosis in relation to menstrual cycle, menstrual protection method, and sexual intercourse in rural Gambian women. Sex Transm Infect 81, 242-247. Muerhoff, A.S., Leary, T.P., Desai, S.M., and Mushahwar, I.K. (1997). Amplification and subtraction methods and their application to the discovery of novel human viruses. J Med Virol 53, 96-103. Munch, K., Boomsma, W., Huelsenbeck, J.P., Willerslev, E., and Nielsen, R. (2008). Statistical assignment of DNA sequences using Bayesian phylogenetics. Syst Biol 57, 750-757. Myer, L., Denny, L., Telerant, R., Souza, M., Wright, T.C., Jr., and Kuhn, L. (2005). Bacterial vaginosis and susceptibility to HIV infection in South African women: a nested case-control study. J Infect Dis 192, 1372-1380. Naidu, A.S., Bidlack, W.R., and Clemens, R.A. (1999). Probiotic spectra of lactic acid bacteria (LAB). Crit Rev Food Sci Nutr 39, 13-126. Nakamura, K., Oshima, T., Morimoto, T., Ikeda, S., Yoshikawa, H., Shiwa, Y., Ishikawa, S., Linak, M.C., Hirai, A., Takahashi, H., Altaf-Ul-Amin, M., Ogasawara, N., and Kanaya, S. (2011a). Sequencespecific error profile of Illumina sequencers. Nucleic Acids Res 39, e90. Nakamura, S., Nakaya, T., and Iida, T. (2011b). Metagenomic analysis of bacterial infections by means of high-throughput DNA sequencing. Exp Biol Med (Maywood) 236, 968-971.  166  Nealson, K.H., and Venter, J.C. (2007). Metagenomics and the global ocean survey: what's in it for us, and why should we care? ISME J 1, 185-187. Ness, R.B., Goodman, M.T., Shen, C., and Brunham, R.C. (2003). Serologic evidence of past infection with Chlamydia trachomatis, in relation to ovarian cancer. J Infect Dis 187, 1147-1152. Ness, R.B., Hillier, S.L., Kip, K.E., Soper, D.E., Stamm, C.A., McGregor, J.A., Bass, D.C., Sweet, R.L., Rice, P., and Richter, H.E. (2004). Bacterial vaginosis and risk of pelvic inflammatory disease. Obstet Gynecol 104, 761-769. Ness, R.B., Kip, K.E., Hillier, S.L., Soper, D.E., Stamm, C.A., Sweet, R.L., Rice, P., and Richter, H.E. (2005). A cluster analysis of bacterial vaginosis-associated microflora and pelvic inflammatory disease. Am J Epidemiol 162, 585-590. Ness, R.B., Soper, D.E., Holley, R.L., Peipert, J., Randall, H., Sweet, R.L., Sondheimer, S.J., Hendrix, S.L., Amortegui, A., Trucco, G., Songer, T., Lave, J.R., Hillier, S.L., Bass, D.C., and Kelsey, S.F. (2002). Effectiveness of inpatient and outpatient treatment strategies for women with pelvic inflammatory disease: results from the Pelvic Inflammatory Disease Evaluation and Clinical Health (PEACH) Randomized Trial. Am J Obstet Gynecol 186, 929-937. Nicholson, J.K., Holmes, E., Kinross, J., Burcelin, R., Gibson, G., Jia, W., and Pettersson, S. (2012). Host-gut microbiota metabolic interactions. Science 336, 1262-1267. Nugent, R.P., Krohn, M.A., and Hillier, S.L. (1991). Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation. J Clin Microbiol 29, 297-301. O'Hanlon, D.E., Moench, T.R., and Cone, R.A. (2011). In vaginal fluid, bacteria associated with bacterial vaginosis can be suppressed with lactic acid but not hydrogen peroxide. BMC Infect Dis 11, 200. Ogata, S.I., Muramatsu, T., and Kobata, A. (1976). New structural characteristic of the large glycopeptides from transformed cells. Nature 259, 580-582. Palacios, G., Druce, J., Du, L., Tran, T., Birch, C., Briese, T., Conlan, S., Quan, P.L., Hui, J., Marshall, J., Simons, J.F., Egholm, M., Paddock, C.D., Shieh, W.J., Goldsmith, C.S., Zaki, S.R., Catton, M., and Lipkin, W.I. (2008). A new arenavirus in a cluster of fatal transplant-associated diseases. N Engl J Med 358, 991-998. Palacios, G., Quan, P.L., Jabado, O.J., Conlan, S., Hirschberg, D.L., Liu, Y., Zhai, J., Renwick, N., Hui, J., Hegyi, H., Grolla, A., Strong, J.E., Towner, J.S., Geisbert, T.W., Jahrling, P.B., Buchen-Osmond, C., Ellerbrok, H., Sanchez-Seco, M.P., Lussier, Y., Formenty, P., Nichol, M.S., Feldmann, H., Briese, T., and Lipkin, W.I. (2007). Panmicrobial oligonucleotide array for diagnosis of infectious diseases. Emerg Infect Dis 13, 73-81. Paramel Jayaprakash, T., Schellenberg, J.J., and Hill, J.E. (2012). Resolution and characterization of distinct cpn60-based subgroups of Gardnerella vaginalis in the vaginal microbiota. PLoS One 7, e43009. Pei, Z., Yang, L., Peek, R.M., Jr Levine, S.M., Pride, D.T., and Blaser, M.J. (2005). Bacterial biota in reflux esophagitis and Barrett's esophagus. World J Gastroenterol 11, 7277-7283. Peterson, D.A., Frank, D.N., Pace, N.R., and Gordon, J.I. (2008). Metagenomic approaches for defining the pathogenesis of inflammatory bowel diseases. Cell Host Microbe 3, 417-427. 167  Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., Baker, C.C., Di Francesco, V., Howcroft, T.K., Karp, R.W., Lunsford, R.D., Wellington, C.R., Belachew, T., Wright, M., Giblin, C., David, H., Mills, M., Salomon, R., Mullins, C., Akolkar, B., Begg, L., Davis, C., Grandison, L., Humble, M., Khalsa, J., Little, A.R., Peavy, H., Pontzer, C., Portnoy, M., Sayre, M.H., Starke-Reed, P., Zakhari, S., Read, J., Watson, B., and Guyer, M. (2009). The NIH Human Microbiome Project. Genome Res 19, 2317-2323. Phan, T.G., Kapusinszky, B., Wang, C., Rose, R.K., Lipton, H.L., and Delwart, E.L. (2011). The fecal viral flora of wild rodents. PLoS Pathog 7, e1002218. Pilloni, G., Granitsiotis, M.S., Engel, M., and Lueders, T. (2012). Testing the limits of 454 pyrotag sequencing: reproducibility, quantitative assessment and comparison to T-RFLP fingerprinting of aquifer microbes. PLoS One 7, e40467. Plottel, C.S., and Blaser, M.J. (2011). Microbiome and malignancy. Cell Host Microbe 10, 324-335. Poretsky, R.S., Hewson, I., Sun, S., Allen, A.E., Zehr, J.P., and Moran, M.A. (2009). Comparative day/night metatranscriptomic analysis of microbial communities in the North Pacific subtropical gyre. Environ Microbiol 11, 1358-1375. Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D.R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H.B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Jian, M., Zhou, Y., Li, Y., Zhang, X., Qin, N., Yang, H., Wang, J., Brunak, S., Dore, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Bork, P., and Ehrlich, S.D. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59-65. Quan, P.L., Palacios, G., Jabado, O.J., Conlan, S., Hirschberg, D.L., Pozo, F., Jack, P.J., Cisterna, D., Renwick, N., Hui, J., Drysdale, A., Amos-Ritchie, R., Baumeister, E., Savy, V., Lager, K.M., Richt, J.A., Boyle, D.B., Garcia-Sastre, A., Casas, I., Perez-Brena, P., Briese, T., and Lipkin, W.I. (2007). Detection of respiratory viruses and subtype identification of influenza A viruses by GreeneChipResp oligonucleotide microarray. J Clin Microbiol 45, 2359-2364. R.Core.Team (2012). R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing). Ramette, A. (2007). Multivariate analyses in microbial ecology. FEMS Microbiol Ecol 62, 142-160. Rappe, M.S., and Giovannoni, S.J. (2003). The uncultured microbial majority. Annu Rev Microbiol 57, 369-394. Ravel, J., Gajer, P., Abdo, Z., Schneider, G.M., Koenig, S.S., McCulle, S.L., Karlebach, S., Gorle, R., Russell, J., Tacket, C.O., Brotman, R.M., Davis, C.C., Ault, K., Peralta, L., and Forney, L.J. (2011). Vaginal microbiome of reproductive-age women. Proc Natl Acad Sci U S A 108 Suppl 1, 4680-4687. Raymond, F., Carbonneau, J., Boucher, N., Robitaille, L., Boisvert, S., Wu, W.K., De Serres, G., Boivin, G., and Corbeil, J. (2009). Comparison of automated microarray detection with real-time PCR assays for detection of respiratory viruses in specimens obtained from children. J Clin Microbiol 47, 743-750.  168  Rees, G.N., Baldwin, D.S., Watson, G.O., Perryman, S., and Nielsen, D.L. (2004). Ordination and significance testing of microbial community composition derived from terminal restriction fragment length polymorphisms: application of multivariate statistics. Antonie Van Leeuwenhoek 86, 339-347. Rehrauer, H., Schonmann, S., Eberl, L., and Schlapbach, R. (2008). PhyloDetect: a likelihood-based strategy for detecting microorganisms with diagnostic microarrays. Bioinformatics 24, i83-89. Reid, G., Charbonneau, D., Erb, J., Kochanowski, B., Beuerman, D., Poehner, R., and Bruce, A.W. (2003). Oral use of Lactobacillus rhamnosus GR-1 and L. fermentum RC-14 significantly alters vaginal flora: randomized, placebo-controlled trial in 64 healthy women. FEMS Immunol Med Microbiol 35, 131-134. Reyes, A., Haynes, M., Hanson, N., Angly, F.E., Heath, A.C., Rohwer, F., and Gordon, J.I. (2010). Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466, 334-338. Richter, S.S., Galask, R.P., Messer, S.A., Hollis, R.J., Diekema, D.J., and Pfaller, M.A. (2005). Antifungal susceptibilities of Candida species causing vulvovaginitis and epidemiology of recurrent cases. J Clin Microbiol 43, 2155-2162. Righi, V., Durante, C., Cocchi, M., Calabrese, C., Di Febo, G., Lecce, F., Pisi, A., Tugnoli, V., Mucci, A., and Schenetti, L. (2009). Discrimination of healthy and neoplastic human colon tissues by ex vivo HRMAS NMR spectroscopy and chemometric analyses. J Proteome Res 8, 1859-1869. Rodriguez Jovita, M., Collins, M.D., Sjoden, B., and Falsen, E. (1999). Characterization of a novel Atopobium isolate from the human vagina: description of Atopobium vaginae sp. nov. Int J Syst Bacteriol 49 Pt 4, 1573-1576. Rohayem, J., Berger, S., Juretzek, T., Herchenroder, O., Mogel, M., Poppe, M., Henker, J., and Rethwilm, A. (2004). A simple and rapid single-step multiplex RT-PCR to detect Norovirus, Astrovirus and Adenovirus in clinical stool samples. J Virol Methods 118, 49-59. Roingeard, P. (2008). Viral detection by electron microscopy: past, present and future. Biol Cell 100, 491-501. Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Adv Bioinformatics 2008, 205969. Rosen, G.L., Reichenberger, E.R., and Rosenfeld, A.M. (2011). NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27, 127-129. Sagerstrom, C.G., Sun, B.I., and Sive, H.L. (1997). Subtractive cloning: past, present, and future. Annu Rev Biochem 66, 751-783. Salazar-Gonzalez, J.F., Martinez-Maza, O., Nishanian, P., Aziz, N., Shen, L.P., Grosser, S., Taylor, J., Detels, R., and Fahey, J.L. (1998). Increased immune activation precedes the inflection point of CD4 T cells and the increased serum virus load in human immunodeficiency virus infection. J Infect Dis 178, 423-430. Salvador, S., Gilks, B., Kobel, M., Huntsman, D., Rosen, B., and Miller, D. (2009). The fallopian tube: primary site of most pelvic high-grade serous carcinomas. Int J Gynecol Cancer 19, 58-64. Samaras, V., Rafailidis, P.I., Mourtzoukou, E.G., Peppas, G., and Falagas, M.E. (2010). Chronic bacterial and parasitic infections and cancer: a review. J Infect Dev Ctries 4, 267-281. 169  Scanlan, P.D., Shanahan, F., Clune, Y., Collins, J.K., O'Sullivan, G.C., O'Riordan, M., Holmes, E., Wang, Y., and Marchesi, J.R. (2008). Culture-independent analysis of the gut microbiota in colorectal cancer and polyposis. Environ Microbiol 10, 789-798. Schellenberg, J., Links, M.G., Hill, J.E., Dumonceaux, T.J., Peters, G.A., Tyler, S., Ball, T.B., Severini, A., and Plummer, F.A. (2009). Pyrosequencing of the chaperonin-60 universal target as a tool for determining microbial community composition. Appl Environ Microbiol 75, 2889-2898. Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470. Schloss, P.D., and Handelsman, J. (2005). Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 71, 1501-1506. Schoenfeld, T., Patterson, M., Richardson, P.M., Wommack, K.E., Young, M., and Mead, D. (2008). Assembly of viral metagenomes from yellowstone hot springs. Appl Environ Microbiol 74, 4164-4174. Schroder, J., Bailey, J., Conway, T., and Zobel, J. (2010). Reference-free validation of short read data. PLoS One 5, e12681. Schwebke, J.R., Hillier, S.L., Sobel, J.D., McGregor, J.A., and Sweet, R.L. (1996). Validity of the vaginal gram stain for the diagnosis of bacterial vaginosis. Obstet Gynecol 88, 573-576. Segata, N., Haake, S.K., Mannon, P., Lemon, K.P., Waldron, L., Gevers, D., Huttenhower, C., and Izard, J. (2012). Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol 13, R42. Seidman, J.D., Sherman, M.E., Bell, K.A., Katabuchi, H., O'Leary, T.J., and Kurman, R.J. (2002). Salpingitis, salpingoliths, and serous tumors of the ovaries: is there a connection? Int J Gynecol Pathol 21, 101-107. Sha, B.E., Zariffard, M.R., Wang, Q.J., Chen, H.Y., Bremer, J., Cohen, M.H., and Spear, G.T. (2005). Female genital-tract HIV load correlates inversely with Lactobacillus species but positively with bacterial vaginosis and Mycoplasma hominis. J Infect Dis 191, 25-32. Shah, S.P., Kobel, M., Senz, J., Morin, R.D., Clarke, B.A., Wiegand, K.C., Leung, G., Zayed, A., Mehl, E., Kalloger, S.E., Sun, M., Giuliany, R., Yorida, E., Jones, S., Varhol, R., Swenerton, K.D., Miller, D., Clement, P.B., Crane, C., Madore, J., Provencher, D., Leung, P., DeFazio, A., Khattra, J., Turashvili, G., Zhao, Y., Zeng, T., Glover, J.N., Vanderhyden, B., Zhao, C., Parkinson, C.A., Jimenez-Linan, M., Bowtell, D.D., Mes-Masson, A.M., Brenton, J.D., Aparicio, S.A., Boyd, N., Hirst, M., Gilks, C.B., Marra, M., and Huntsman, D.G. (2009). Mutation of FOXL2 in granulosa-cell tumors of the ovary. N Engl J Med 360, 2719-2729. Sharma, V.K., Kumar, N., Prakash, T., and Taylor, T.D. (2012). Fast and accurate taxonomic assignments of metagenomic sequences using MetaBin. PLoS One 7, e34030. Shepard, R.N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science 210, 390-398. Siegel, R., Naishadham, D., and Jemal, A. (2012). Cancer statistics, 2012. CA Cancer J Clin 62, 10-29.  170  Simms, I., Warburton, F., and Westrom, L. (2003). Diagnosis of pelvic inflammatory disease: time for a rethink. Sex Transm Infect 79, 491-494. Smillie, C.S., Smith, M.B., Friedman, J., Cordero, O.X., David, L.A., and Alm, E.J. (2011). Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241-244. Smith, A.M.O.C.J., ed. (2005). DNA fingerprinting of microbial communities (New York: Taylor & Francis). Sobhani, I., Tap, J., Roudot-Thoraval, F., Roperch, J.P., Letulle, S., Langella, P., Corthier, G., Tran Van Nhieu, J., and Furet, J.P. (2011). Microbial dysbiosis in colorectal cancer (CRC) patients. PLoS One 6, e16393. Sommer, M.O., Church, G.M., and Dantas, G. (2010). The human microbiome harbors a diverse reservoir of antibiotic resistance genes. Virulence 1, 299-303. Spinillo, A., Capuzzo, E., Acciano, S., De Santolo, A., and Zara, F. (1999). Effect of antibiotic use on the prevalence of symptomatic vulvovaginal candidiasis. Am J Obstet Gynecol 180, 14-17. Srinivasan, S., Hoffman, N.G., Morgan, M.T., Matsen, F.A., Fiedler, T.L., Hall, R.W., Ross, F.J., McCoy, C.O., Bumgarner, R., Marrazzo, J.M., and Fredricks, D.N. (2012). Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. PLoS One 7, e37818. Stehelin, D., Varmus, H.E., Bishop, J.M., and Vogt, P.K. (1976). DNA related to the transforming gene(s) of avian sarcoma viruses is present in normal avian DNA. Nature 260, 170-173. Stenglein, M.D., Sanders, C., Kistler, A.L., Ruby, J.G., Franco, J.Y., Reavill, D.R., Dunker, F., and Derisi, J.L. (2012). Identification, characterization, and in vitro culture of highly divergent arenaviruses from boa constrictors and annulated tree boas: candidate etiological agents for snake inclusion body disease. MBio 3, e00180-00112. Stewart, F.J., Ottesen, E.A., and DeLong, E.F. (2010). Development and quantitative analyses of a universal rRNA-subtraction protocol for microbial metatranscriptomics. ISME J 4, 896-907. Sweet, R.L. (2009). Treatment strategies for pelvic inflammatory disease. Expert Opin Pharmacother 10, 823-837. Taha, T.E., Kumwenda, N.I., Kafulafula, G., Makanani, B., Nkhoma, C., Chen, S., Tsui, A., and Hoover, D.R. (2007). Intermittent intravaginal antibiotic treatment of bacterial vaginosis in HIV-uninfected and infected women: a randomized clinical trial. PLoS Clin Trials 2, e10. Temperton, B., and Giovannoni, S.J. (2012). Metagenomics: microbial diversity through a scratched lens. Curr Opin Microbiol 15, 605-612. Thurber, R.V., Haynes, M., Breitbart, M., Wegley, L., and Rohwer, F. (2009). Laboratory procedures to generate viral metagenomes. Nat Protoc 4, 470-483. Timmerman, H.M., Koning, C.J., Mulder, L., Rombouts, F.M., and Beynen, A.C. (2004). Monostrain, multistrain and multispecies probiotics--A comparison of functionality and efficacy. Int J Food Microbiol 96, 219-233. 171  Tosun, I., Alpay Karaoglu, S., Ciftci, H., Buruk, C.K., Aydin, F., Kilic, A.O., and Erturk, M. (2007). [Biotypes and antibiotic resistance patterns of Gardnerella vaginalis strains isolated from healthy women and women with bacterial vaginosis]. Mikrobiyol Bul 41, 21-27. Tringe, S.G., and Rubin, E.M. (2005). Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet 6, 805-814. Turnbaugh, P.J., Hamady, M., Yatsunenko, T., Cantarel, B.L., Duncan, A., Ley, R.E., Sogin, M.L., Jones, W.J., Roe, B.A., Affourtit, J.P., Egholm, M., Henrissat, B., Heath, A.C., Knight, R., and Gordon, J.I. (2009a). A core gut microbiome in obese and lean twins. Nature 457, 480-484. Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., and Gordon, J.I. (2007). The human microbiome project. Nature 449, 804-810. Turnbaugh, P.J., Ridaura, V.K., Faith, J.J., Rey, F.E., Knight, R., and Gordon, J.I. (2009b). The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med 1, 6ra14. Turroni, F., Peano, C., Pass, D.A., Foroni, E., Severgnini, M., Claesson, M.J., Kerr, C., Hourihane, J., Murray, D., Fuligni, F., Gueimonde, M., Margolles, A., De Bellis, G., O'Toole, P.W., van Sinderen, D., Marchesi, J.R., and Ventura, M. (2012). Diversity of bifidobacteria within the infant gut microbiota. PLoS One 7, e36957. Urisman, A., Fischer, K.F., Chiu, C.Y., Kistler, A.L., Beck, S., Wang, D., and DeRisi, J.L. (2005). EPredict: a computational strategy for species identification based on observed DNA microarray hybridization patterns. Genome Biol 6, R78. Ussery, D.W., Wassenaar, T.M., and Borini, S. (2009). Computing for comparative microbial genomics : bioinformatics for microbiologists (London, Springer). Vaarala, O., Atkinson, M.A., and Neu, J. (2008). The "perfect storm" for type 1 diabetes: the complex interplay between intestinal microbiota, gut permeability, and mucosal immunity. Diabetes 57, 25552562. van der Gast, C.J., Walker, A.W., Stressmann, F.A., Rogers, G.B., Scott, P., Daniels, T.W., Carroll, M.P., Parkhill, J., and Bruce, K.D. (2011). Partitioning core and satellite taxa from within cystic fibrosis lung bacterial communities. ISME J 5, 780-791. van Nood, E., Vrieze, A., Nieuwdorp, M., Fuentes, S., Zoetendal, E.G., de Vos, W.M., Visser, C.E., Kuijper, E.J., Bartelsman, J.F., Tijssen, J.G., Speelman, P., Dijkgraaf, M.G., and Keller, J.J. (2013). Duodenal infusion of donor feces for recurrent Clostridium difficile. N Engl J Med 368, 407-415. van Reenen, C.A., and Dicks, L.M. (2011). Horizontal gene transfer amongst probiotic lactic acid bacteria and other intestinal microbiota: what are the possibilities? A review. Arch Microbiol 193, 157-168. Vaughan, S., Coward, J.I., Bast, R.C., Jr., Berchuck, A., Berek, J.S., Brenton, J.D., Coukos, G., Crum, C.C., Drapkin, R., Etemadmoghadam, D., Friedlander, M., Gabra, H., Kaye, S.B., Lord, C.J., Lengyel, E., Levine, D.A., McNeish, I.A., Menon, U., Mills, G.B., Nephew, K.P., Oza, A.M., Sood, A.K., Stronach, E.A., Walczak, H., Bowtell, D.D., and Balkwill, F.R. (2011). Rethinking ovarian cancer: recommendations for improving outcomes. Nat Rev Cancer 11, 719-725.  172  Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., Fouts, D.E., Levy, S., Knap, A.H., Lomas, M.W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y.H., and Smith, H.O. (2004). Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66-74. Verhelst, R., Verstraelen, H., Claeys, G., Verschraegen, G., Delanghe, J., Van Simaey, L., De Ganck, C., Temmerman, M., and Vaneechoutte, M. (2004). Cloning of 16S rRNA genes amplified from normal and disturbed vaginal microflora suggests a strong association between Atopobium vaginae, Gardnerella vaginalis and bacterial vaginosis. BMC Microbiol 4, 16. Vijay-Kumar, M., Aitken, J.D., Carvalho, F.A., Cullender, T.C., Mwangi, S., Srinivasan, S., Sitaraman, S.V., Knight, R., Ley, R.E., and Gewirtz, A.T. (2010). Metabolic syndrome and altered gut microbiota in mice lacking Toll-like receptor 5. Science 328, 228-231. von Wintzingerode, F., Gobel, U.B., and Stackebrandt, E. (1997). Determination of microbial diversity in environmental samples: pitfalls of PCR-based rRNA analysis. FEMS Microbiol Rev 21, 213-229. Wang, C.C., Cook, L., Tapia, K.A., Holte, S., Krows, M., Bagabag, A., Santos, A., Corey, L., and Jerome, K.R. (2011). Cervicovaginal shedding of hepatitis C viral RNA is associated with the presence of menstrual or other blood in cervicovaginal fluids. J Clin Virol 50, 4-7. Wang, C.C., McClelland, R.S., Reilly, M., Overbaugh, J., Emery, S.R., Mandaliya, K., Chohan, B., Ndinya-Achola, J., Bwayo, J., and Kreiss, J.K. (2001). The effect of treatment of vaginal infections on shedding of human immunodeficiency virus type 1. J Infect Dis 183, 1017-1022. Wang, D., Coscoy, L., Zylberberg, M., Avila, P.C., Boushey, H.A., Ganem, D., and DeRisi, J.L. (2002). Microarray-based detection and genotyping of viral pathogens. Proc Natl Acad Sci U S A 99, 1568715692. Wang, D., Urisman, A., Liu, Y.T., Springer, M., Ksiazek, T.G., Erdman, D.D., Mardis, E.R., Hickenbotham, M., Magrini, V., Eldred, J., Latreille, J.P., Wilson, R.K., Ganem, D., and DeRisi, J.L. (2003). Viral discovery and sequence recovery using DNA microarrays. PLoS Biol 1, E2. Wang, Y., Hoenig, J.D., Malin, K.J., Qamar, S., Petrof, E.O., Sun, J., Antonopoulos, D.A., Chang, E.B., and Claud, E.C. (2009). 16S rRNA gene-based analysis of fecal microbiota from preterm infants with and without necrotizing enterocolitis. ISME J 3, 944-954. Watson, M., Dukes, J., Abu-Median, A.B., King, D.P., and Britton, P. (2007). DetectiV: visualization, normalization and significance testing for pathogen-detection microarray data. Genome Biol 8, R190. Weiss, N.S. (1988). Measuring the separate effects of low parity and its antecedents on the incidence of ovarian cancer. Am J Epidemiol 128, 451-455. Welsh, J.B., Zarrinkar, P.P., Sapinoso, L.M., Kern, S.G., Behling, C.A., Monk, B.J., Lockhart, D.J., Burger, R.A., and Hampton, G.M. (2001). Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci U S A 98, 1176-1181. Wetterstrand, K. (2012). DNA Sequencing Costs (National Human Genome Research Institute, NIH).  173  White, K.L., Schildkraut, J.M., Palmieri, R.T., Iversen, E.S., Jr., Berchuck, A., Vierkant, R.A., Rider, D.N., Charbonneau, B., Cicek, M.S., Sutphen, R., Birrer, M.J., Pharoah, P.P., Song, H., Tyrer, J., Gayther, S.A., Ramus, S.J., Wentzensen, N., Yang, H.P., Garcia-Closas, M., Phelan, C.M., Cunningham, J.M., Fridley, B.L., Sellers, T.A., and Goode, E.L. (2012). Ovarian cancer risk associated with inherited inflammation-related variants. Cancer Res 72, 1064-1069. Wiegand, K.C., Shah, S.P., Al-Agha, O.M., Zhao, Y., Tse, K., Zeng, T., Senz, J., McConechy, M.K., Anglesio, M.S., Kalloger, S.E., Yang, W., Heravi-Moussavi, A., Giuliany, R., Chow, C., Fee, J., Zayed, A., Prentice, L., Melnyk, N., Turashvili, G., Delaney, A.D., Madore, J., Yip, S., McPherson, A.W., Ha, G., Bell, L., Fereday, S., Tam, A., Galletta, L., Tonin, P.N., Provencher, D., Miller, D., Jones, S.J., Moore, R.A., Morin, G.B., Oloumi, A., Boyd, N., Aparicio, S.A., Shih Ie, M., Mes-Masson, A.M., Bowtell, D.D., Hirst, M., Gilks, B., Marra, M.A., and Huntsman, D.G. (2010). ARID1A mutations in endometriosis-associated ovarian carcinomas. N Engl J Med 363, 1532-1543. Wiesenfeld, H.C., Hillier, S.L., Krohn, M.A., Amortegui, A.J., Heine, R.P., Landers, D.V., and Sweet, R.L. (2002). Lower genital tract infection and endometritis: insight into subclinical pelvic inflammatory disease. Obstet Gynecol 100, 456-463. Willing, B., Halfvarson, J., Dicksved, J., Rosenquist, M., Jarnerot, G., Engstrand, L., Tysk, C., and Jansson, J.K. (2009). Twin studies reveal specific imbalances in the mucosa-associated microbiota of patients with ileal Crohn's disease. Inflamm Bowel Dis 15, 653-660. Witkin, S.S., and Ledger, W.J. (2012). Complexities of the uniquely human vagina. Sci Transl Med 4, 132fs111. Wohlbach, D.J., Kuo, A., Sato, T.K., Potts, K.M., Salamov, A.A., Labutti, K.M., Sun, H., Clum, A., Pangilinan, J.L., Lindquist, E.A., Lucas, S., Lapidus, A., Jin, M., Gunawan, C., Balan, V., Dale, B.E., Jeffries, T.W., Zinkel, R., Barry, K.W., Grigoriev, I.V., and Gasch, A.P. (2011). Comparative genomics of xylose-fermenting fungi for enhanced biofuel production. Proc Natl Acad Sci U S A 108, 1321213217. Woo, P.C., Lau, S.K., Tse, H., Teng, J.L., Curreem, S.O., Tsang, A.K., Fan, R.Y., Wong, G.K., Huang, Y., Loman, N.J., Snyder, L.A., Cai, J.J., Huang, J.D., Mak, W., Pallen, M.J., Lok, S., and Yuen, K.Y. (2009). The complete genome and proteome of Laribacter hongkongensis reveal potential mechanisms for adaptations to different temperatures and habitats. PLoS Genet 5, e1000416. Wu, S., Morin, P.J., Maouyo, D., and Sears, C.L. (2003). Bacteroides fragilis enterotoxin induces c-Myc expression and cellular proliferation. Gastroenterology 124, 392-400. Wu, S., Rhee, K.J., Albesiano, E., Rabizadeh, S., Wu, X., Yen, H.R., Huso, D.L., Brancati, F.L., Wick, E., McAllister, F., Housseau, F., Pardoll, D.M., and Sears, C.L. (2009). A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat Med 15, 10161022. Yi, H., Cho, Y.J., Won, S., Lee, J.E., Jin Yu, H., Kim, S., Schroth, G.P., Luo, S., and Chun, J. (2011). Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic Acids Res 39, e140. Yip, G.W., Smollich, M., and Gotte, M. (2006). Therapeutic value of glycosaminoglycans in cancer. Mol Cancer Ther 5, 2139-2148.  174  Yooseph, S., Sutton, G., Rusch, D.B., Halpern, A.L., Williamson, S.J., Remington, K., Eisen, J.A., Heidelberg, K.B., Manning, G., Li, W., Jaroszewski, L., Cieplak, P., Miller, C.S., Li, H., Mashiyama, S.T., Joachimiak, M.P., van Belle, C., Chandonia, J.M., Soergel, D.A., Zhai, Y., Natarajan, K., Lee, S., Raphael, B.J., Bafna, V., Friedman, R., Brenner, S.E., Godzik, A., Eisenberg, D., Dixon, J.E., Taylor, S.S., Strausberg, R.L., Frazier, M., and Venter, J.C. (2007). The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 5, e16. Young, V.B., and Schmidt, T.M. (2004). Antibiotic-associated diarrhea accompanied by large-scale alterations in the composition of the fecal microbiota. J Clin Microbiol 42, 1203-1206. Zhang, H., DiBaise, J.K., Zuccolo, A., Kudrna, D., Braidotti, M., Yu, Y., Parameswaran, P., Crowell, M.D., Wing, R., Rittmann, B.E., and Krajmalnik-Brown, R. (2009). Human gut microbiota in obesity and after gastric bypass. Proc Natl Acad Sci U S A 106, 2365-2370. Zhang, T., Breitbart, M., Lee, W.H., Run, J.Q., Wei, C.L., Soh, S.W., Hibberd, M.L., Liu, E.T., Rohwer, F., and Ruan, Y. (2006). RNA viral community in human feces: prevalence of plant pathogenic viruses. PLoS Biol 4, e3. Zhou, J., Wu, L., Deng, Y., Zhi, X., Jiang, Y.H., Tu, Q., Xie, J., Van Nostrand, J.D., He, Z., and Yang, Y. (2011). Reproducibility and quantitation of amplicon sequencing-based detection. ISME J 5, 1303-1313. Zhou, M., McFarland-Mancini, M.M., Funk, H.M., Husseinzadeh, N., Mounajjed, T., and Drew, A.F. (2009). Toll-like receptor expression in normal ovary and ovarian tumors. Cancer Immunol Immunother 58, 1375-1385.  175  Appendix Sequence Library Normalization Library Normalization Additive 1 (LNA1) Library Normalization Beads 1 (LNB1) Library Normalization Wash (LNW) Library Norm Storage Buffer 0.1 N NaOH Components for the sequence library preparation #AMPure clean up 1) Bring AMPure XP beads to room temperature 2) Prepare 80% ethanol 3) Centrifuge, Transfer PCR products to new plate 4) Add .6X the volume of PCR in microlitres of beads to each sample 5) Incubate at room temp for 5 minutes 6) Place plate on magnetic stand for 2 minutes 7) Remove and discard supernatant 8) Add 200 µl of 80% ethanol to each well 9) Incubate for 30 seconds 10) Remove and discard supernatant 11) Repeat steps 10-12 12) Allow beads to air dry for 15 minutes 13) Remove plate from magnetic stand 14) Add 52.5 µl of resuspension buffer to each sample 15) Incubate at room temp for 2 minutes 16) Place plate back on magnetic plate for 2 minutes 17) Transfer 50 µl of supernatant to new microcentrifuge tubes  #Library Normalization 1) Thaw LNA1 2) Bring LNB1, LNW1 to room temperature  176  3) Vigorously vortex LNB1 for 1 minute and invert 4) Add 20 µl of purified PCR to plate (magnetic suitable) 5) For 96 samples, add 4.4 ml of LNA1 to a 15 ml conical tube 6) Add 800 µl of LNB1 to 15 ml conical with LNA1 7) Add 45 µl of mix (step 7) to sample plate 8) Seal plate 9) Place on plate shaker for 1800 rpm for 30 min 10) Place plate on magnetic stand for 2 min or until clear 11) While plate is still on stand, remove 80 µl and put into hazardous waste 12) Remove Plate from Magnetic stand 13) Add 45 µl of LNW1 to each sample well 14) Seal plate 15) Shake at 1800 rpm / 5 minutes 16) Place on magnetic stand for 2 minutes 17) Remove and discard supernatant 18) Remove from magnetic stand 19) Repeat steps 16-18 20) Add 30 µl of .1 N NaOH to each well 21) Seal plate 22) Shake at 1800 rpm for 5 minutes 23) Add 30 µl of LNS1 to each well of another standard 96 well PCR plate 24) Place samples on magnetic stand for 2 minutes 25) Add 30 µl of sample to the standard PCR plate with LNS1 26) Seal plate 27) Centrifuge at 1000g for 1 minute 28) Store plate in -20˚C  Python Program for Homopolymeric Sequence Filtration  #!/usr/bin/env python import sys import string import sys  if sys.argv[1]: infile = sys.argv[1] f = open(infile,"r") K = f.readlines() size = 21 for line in K: modline = line.replace("n","n") modline = modline.replace("N","N") #modline = line.replace("y","") if not line.strip(): continue else: 177  if ">" in line: id = line continue if "A" * size in modline: continue elif "T" * size in modline: continue elif "G" * size in modline: continue elif "C" * size in modline: continue elif "N" * 5 in modline: continue print id[:-1] print line[:-1] Custom Lowest Common Ancestor Python Program #!/usr/bin/env python #!/usr/bin/python2.5 import sys  if sys.argv[1]: infile = sys.argv[1] f = open(infile,"r") K = f.readlines() matchCount = 0 lineCount = 0 dict1 = {} dict2 = {} histogram = {} tax1 = {} tax2 = {} percent1 = {} percent2 = {} comp1 = {} comp2 = {}  P = open(infile.rsplit(".",1)(American College of Physicians (2003- )) + "_paired" + filter + ".txt", "w") H = open(infile.rsplit(".",1)(American College of Physicians (2003- )) + "_histogram" + filter + ".txt", "w") T = open(infile.rsplit(".",1)(American College of Physicians (2003- )) + "_taxonomy" + filter + ".txt", "w") print "Started Analysis of " + infile for line in K: replaced = 0 if '|' in line: line.replace("|", "") replaced = 1 if not filter == "": lc_line = line.lower();  178  if '_1\t' in line: items = line.split("_1\t",1) uid = items(American College of Physicians (2003- )) info = items[1] if not dict1.has_key(uid): dict1[uid] = [] dict1[uid].append(info) mydata = info.split("\t") #print mydata[3] try: if not percent1.has_key(uid): percent1[uid] = [] percent1[uid].append(float(mydata[3])) except: pass if replaced == 0: mytax = mydata[5].replace("; ", ";").split(";") else: mytax = mydata[6].replace("; ", ";").split(";") mydata = mytax[len(mytax)-1].split(".",1) mytax[len(mytax)-1] = mydata(American College of Physicians (2003- )) if not tax1.has_key(uid): tax1[uid] = [] for index in range(0, len(mytax)): mytax[index] = "%s_%s" % (chr(97 + index), mytax[index]) tax1[uid].append(mytax) elif '_2\t' in line: items = line.split("_2\t",1) uid = items(American College of Physicians (2003- )) info = items[1] if not dict2.has_key(uid): dict2[uid] = [] dict2[uid].append(info) mydata = info.split("\t") #print mydata[3] try: if not percent2.has_key(uid): percent2[uid] = [] percent2[uid].append(float(mydata[3])) except: pass if replaced == 0: mytax = mydata[5].replace("; ", ";").split(";") else: mytax = mydata[6].replace("; ", ";").split(";") mydata = mytax[len(mytax)-1].split(".",1) mytax[len(mytax)-1] = mydata(American College of Physicians (2003- )) if not tax2.has_key(uid): tax2[uid] = [] for index in range(0, len(mytax)): mytax[index] = "%s_%s" % (chr(97 + index), mytax[index]) tax2[uid].append(mytax) print "Completed Analysis of " + infile for uid in dict1: lineCount += 1 if uid in dict2: for info in dict1[uid]: matchCount += 1 P.write(uid + "_1\t")  179  P.write(info) myline = info.split("\t") percentID = float(myline[3]) if not histogram.has_key(percentID): histogram[percentID] = 0 histogram[percentID] += 1 for info in dict2[uid]: matchCount += 1 P.write(uid + "_2\t") P.write(info) myline = info.split("\t") percentID = float(myline[3]) if not histogram.has_key(percentID): histogram[percentID] = 0 histogram[percentID] += 1 P.close() print "Created file: %s having a total of %s matched pairs." % (P.name, matchCount) for percentID in sorted(histogram.iterkeys(),reverse=True): H.write("%.2f\t%s\n" % (percentID, histogram[percentID])) H.write("dict1=" + str(len(dict1)) + "\tdict2=" + str(len(dict2)) + "\tTotal=" + str(len(dict1)+len(dict2))) H.close() print "Created file: %s." % (H.name) for uid in tax1: if uid in tax2: maximum = 0 taxvalue = "" for index1 in range (0, len(tax1[uid])): for index2 in range (0, len(tax2[uid])): if not comp1.has_key(uid): comp1[uid] = [] comp2[uid] = [] product = eval(str(percent1[uid][index1]) + "*" + str(percent2[uid][index2])) if product >= maximum: taxonomy = sorted(set(tax1[uid][index1]) & set(tax2[uid][index2])) if product > maximum or len(taxonomy) >= len(taxvalue): taxvalue = taxonomy maximum = product maxvalue = str(percent1[uid][index1]) + "*" + str(percent2[uid][index2])  comp1[uid].append(taxvalue) comp2[uid].append(maxvalue) mylist = comp1[uid].pop() for index in range (0, len(mylist)): comp1[uid].append(mylist[index].split("_",1)[1]) if len(comp1[uid]) == 0: T.write(uid + "\t" + "" + "\t" + "\t".join(comp2[uid]) + "\t" + "\t".join(comp1[uid]) + "\n") else: T.write(uid + "\t" + comp1[uid][len(comp1[uid])-1] + "\t" + "\t".join(comp2[uid]) + "\t" + "\t".join(comp1[uid]) + "\n")  T.close() print "Created file: %s." % (T.name) del dict1 del dict2 del histogram del tax1 del tax2 del percent1 del percent2 del comp1 del comp2  180  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073777/manifest

Comment

Related Items