UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Profiling the vaginal microbiome in HIV-positive women Mahal, Daljeet Singh 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_spring_mahal_daljeet.pdf [ 4.65MB ]
Metadata
JSON: 24-1.0073796.json
JSON-LD: 24-1.0073796-ld.json
RDF/XML (Pretty): 24-1.0073796-rdf.xml
RDF/JSON: 24-1.0073796-rdf.json
Turtle: 24-1.0073796-turtle.txt
N-Triples: 24-1.0073796-rdf-ntriples.txt
Original Record: 24-1.0073796-source.json
Full Text
24-1.0073796-fulltext.txt
Citation
24-1.0073796.ris

Full Text

PROFILNG THE VAGINAL MICROBIOME IN HIV-POSITIVE WOMEN by Daljeet Singh Mahal B.Sc., The University of British Columbia, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Reproductive and Developmental Sciences) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2013  © Daljeet Singh Mahal, 2013  Abstract Disruptions or imbalances of the vaginal microbiome can lead to negative reproductive health consequences for women, including an increased risk of sexually transmitted infections, pelvic inflammatory disease, and preterm birth. HIV-positive women may be particularly vulnerable to microbiome disruptions due to the immune dysfunction intrinsic to this disease. The objective of this study was to explore the vaginal microbiome in HIV-positive reproductive-aged women utilizing cpn60 metagenomic profiling and to correlate vaginal bacterial profiles with demographic/clinical variables. 54 HIV-positive women were recruited from the Oak Tree Clinic in Vancouver, BC. Demographic/clinical information was collected and vaginal gram-stains were assessed by Nugent’s scoring. Total DNA was extracted from vaginal swabs and PCR amplified using cpn60-specific universal primers. Cpn60sequence libraries were generated with 454-GS-FLX Titanium pyrosequencing. 64 unique bacterial phylotypes were classified based on sequence similarity to known bacterial organisms and 10 common vaginal community clusters were generated. Fisher’s exact test and Wilcoxon signed-rank tests were utilized to conduct statistical analyses. The mean age of enrolled women was 36.6 years (range=22.3–48.8). The mean CD4 count for these women was 484 cells/mm³ (range=90-930 cells/mm³) while the mean viral load was 13,144 copies/mL (range=<40-355,245 copies/mL). 63% of women had suppressed viral loads while 85% of women were on antiretroviral therapy. According to scored vaginal Gram stains, 30% (16/54) of women had bacterial vaginosis and 11% (6/54) had intermediate scores. 12/54 women were categorized as having abnormal vaginal discharge. Abnormal vaginal discharge significantly correlated with  ii  Dialister micraerophilus, Gardnerella vaginalis B, Gardnerella vaginalis D, Porphyromonas uenonis, Prevotella amnii, Prevotella buccalis, and Prevotella timonensis (p<0.04 for all). Normal vaginal discharge significantly correlated with Lactobacillus crispatus (p<0.0006). Viral loads <40 copies/mL significantly correlated with Lactobacillus crispatus and Lactobacillus gasseri while viral loads >40 copies/mL correlated with Atopobium vaginae, Gardnerella vaginalis D, Prevotella amnii, and 4 other potentially novel bacterial species (P<0.03 for all). CD4 counts <350 cells/mm3 significantly correlated with Gardnerella vaginalis B and Lactobacillus iners (p<0.05 for all). Cpn60-based sequence data demonstrated substantial variation in the vaginal microbiome among HIV-positive women, with species-specific differences dependent on vaginal discharge status, immune status and uncontrolled HIV replication.  iii  Preface This research project received funding and support from the Canadian Institutes of Health Research and Genome British Columbia. Ethical approval for this project was obtained by the University of British Columbia – Children’s & Women’s Health Centre of BC Research Ethics Board (UBC C&W REB) with Certificate Number H11-00119. Preliminary results from Chapters 3 and 4 were presented at Genomics: The Power and the Promise Conference, the 20th Conference on Retroviruses and Opportunistic Infections (CROI), and the 22nd Annual Canadian Conference on HIV/AIDS Research (CAHR) with the following abstract publications: Mahal, D, Chaban, B, Albert, AYK, Vicol, L, Wagner, E, Hill, J, Money, D, Vogue Research Group. Clustering Analysis of the Vaginal Microbiome in HIV-Infected Women Using Metagenomic Characterization. Genome. Vol. 55, No. 10. Oct 2012. Mahal, D, Chaban, B, Albert, AYK, Vicol, L, Wagner, E, Hill, J, Hemmingsen, S, Pick, N, Money, D, Vogue Study Group. Metagenomic Characterization of the Vaginal Microbiome in HIV-Positive Women Using Culture Independent Methods. CROI 2013. March 2013. Mahal, D, Chaban, B, Albert, AYK, Vicol, L, Wagner, E, Hill, J, Hemmingsen, S, Pick, N, Money, D, Vogue Research Group. Metagenomic Characterization of the Vaginal Microbiome in HIV-Positive Women Using Gene Sequencing Methods. Canadian Journal of Infectious Diseases & Medical Microbiology. Volume 24, Supplement A. Spring 2013. This research project could not have been made possible without the contribution and support of a number of select individuals. As mentioned in Section 2.1, vaginal swabs were collected by Bal Dhesi and Laura Vicol from the Oak Tree Clinic. The DNA extraction outlined in Section 2.2 was performed by Vincent Montoya from the BCCDC.  iv  While the sample processing and PCR amplification of the first 4 samples was conducted by me, the remaining samples were processed by Dr. Bonnie Chaban at the University of Saskatchewan (Section 2.2). The gene sequencing processing was conducted by Winnie Dong at the BC Centre for Excellence in HIV/AIDS (Section 2.2). My involvement in the DNA extraction and gene sequencing processing stages was limited to shadowing and the documentation of progress. As mentioned in Section 2.3, the initial stage of raw gene sequence assembly was conducted by Matthew Links and Dr. Janet Hill at the University of Saskatchewan.  v  Table of Contents ABSTRACT ....................................................................................................................... ii	
   PREFACE......................................................................................................................... iv	
   TABLE OF CONTENTS ................................................................................................ vi LIST OF TABLES .......................................................................................................... vii	
   LIST OF FIGURES ....................................................................................................... viii	
   ACKNOWLEDGEMENTS ............................................................................................ ix	
   DEDICATION .................................................................................................................. x	
   1 BACKGROUND ........................................................................................................... 1	
   1.1 THE HUMAN MICROBIOME......................................................................................... 1	
   1.2 GENE SEQUENCING TOOLS ......................................................................................... 2	
   1.3 REPRODUCTIVE HEALTH AND INFECTIOUS DISEASE .................................................. 5	
   1.4 CURRENT DIAGNOSTIC AND TREATMENT TECHNIQUES ............................................. 8	
   1.5 HIV AND REPRODUCTIVE HEALTH .......................................................................... 12	
   1.6 VAGINAL MICROBIOME ANALYSES .......................................................................... 15	
   1.7 RATIONALE .............................................................................................................. 25	
   2 MATERIALS AND METHODS ............................................................................... 29	
   2.1 STUDY DETAILS ....................................................................................................... 29	
   2.2 LABORATORY METHODS .......................................................................................... 32	
   2.3 BIOINFORMATICS ..................................................................................................... 43	
   2.4 DATA ANALYSIS ...................................................................................................... 46	
   2.5 STATISTICS ............................................................................................................... 50	
   3 RESULTS .................................................................................................................... 54	
   3.1 VOGUE 1B DATA ..................................................................................................... 54	
   4 DISCUSSION .............................................................................................................. 71	
   4.1 OVERALL FINDINGS ................................................................................................. 71	
   4.2 DEMOGRAPHICS ....................................................................................................... 72	
   4.3 BACTERIAL VAGINOSIS OUTCOMES ......................................................................... 73	
   4.4 BACTERIAL TAXA CLASSIFICATION ......................................................................... 74	
   4.5 REPRODUCTIVE HEALTH FINDINGS .......................................................................... 75	
   4.6 HIV-SPECIFIC CHARACTERISTICS ............................................................................ 83	
   4.7 ADDRESSING THE HYPOTHESIS ................................................................................ 87	
   4.8 LIMITATIONS ............................................................................................................ 88	
   4.9 FUTURE DIRECTIONS ................................................................................................ 93	
   4.9.1 CONCLUDING REMARKS........................................................................................ 95	
   BIBLIOGRAPHY ........................................................................................................... 98	
   APPENDIX .................................................................................................................... 108	
    vi  List of Tables Table 1 Studies profiling the vaginal microbiome in HIV-negative women…………….16 Table 2 Studies profiling the vaginal microbiome in HIV-positive women…………..…23 Table 3 Demographic and Clinical Characteristics……………………………………...55 Table 4 Clinical HIV-Specific Characteristics…………………………………………..56 Table 5 Nearest neighbour bacterial species identified by cpnDB…………..………......58 Table 6 Good’s coverage calculation and Diversity Index per sample………….............60 Table 7 Bacterial species by normal versus abnormal vaginal discharge………..............61 Table 8 Correlations of Nugent scores with vaginal microbial clusters………………....63 Table 9 Significant bacterial species by CD4 count groupings………………...……......67 Table 10 Significant bacterial species by viral load groupings………………....….……69  vii  List of Figures Figure 1 Exposed gel bands visualized by Alpha Innotech AlphaImager Instrument...…38 Figure 2 Ethnic distribution of study subjects……………………...……………………54 Figure 3 Rarefaction curves of 54 HIV-positive vaginal samples……………………….61 Figure 4 Heat map of 54 bacterial species clustered into 10 groups ……………………62 Figure 5 Bacterial species differences by normal and abnormal vaginal discharge……..64 Figure 6 Lactobacillus gasseri prevalence in HIV-positive women by ethnicity.............65 Figure 7 Relationship of Nugent score with total 16S rRNA bacterial quantity………...66 Figure 8 Linear Discriminant Analysis values for CD4 count groupings…………….....68 Figure 9 Cladogram of bacterial species for CD4 count groupings………….....………..68 Figure 10 Linear Discriminant Analysis values for viral load groupings………………..69 Figure 11 Cladogram of bacterial species for viral load groupings…………………...…70  viii  Acknowledgements There are a large number of people who helped make this project a reality. I would like to thank my supervisor Dr. Deborah Money for taking me on as a student and continually being there for guidance and encouragement. I would like to thank each of my committee members, Dr. Patrick Tang, Dr. Anthony Cheung, Dr. Gina Ogilvie, Dr. Rajavel Elango, and Dr. Tobias Kollmann for providing me with invaluable advice and constructive criticism. I would like to thank Bal Dhesi and Laura Vicol for aiding in this project’s sample collection stage. I would like to thank Vincent Montoya, Dr. Bonnie Chaban, and Winnie Dong for the respective roles they took on in this project’s sample processing stages. I would also like to thank Matthew Links and Dr. Janet Hill for assisting with the bioinformatics portion of this project. I would especially like to thank Dr. Bonnie Chaban and Dr. Janet Hill for continually tolerating and addressing all of my questions and concerns. I would also like to thank Dr. Arianne Albert for providing advice on statistical analyses and for always being very willing to teach useful analytical techniques. Last and certainly not least, I would like to thank Emily Wagner for going above and beyond in providing endless support and motivation every step of the way. I would also like to acknowledge and thank the Canadian Institutes of Health Research and Genome British Columbia for funding and support.  ix  Dedication To my past. To the best friends I met along this journey - Sanjeet, Shivinder and Fontayne. To the cartoon-Wagner guide who kept me grounded - Emily. To the most insanely amazing officemate – Nancy Lipsky. To my Baba Ji and Thadi Ji. To my brother - Maninder. To my sister – Aneeta. To my Dadji. To Ma. To my future.  “You must jump off cliffs all the time and build your wings on the way down.” - Ray Bradbury  x  1 BACKGROUND 1.1 The Human Microbiome The term “microbiome” was initially proposed by Joshua Lederberg to describe the ecological profile of all of the symbiotic, commensal, or pathogenic microorganisms that inhabit a specific environment. (Lederberg 2001) The human microbiome is currently defined as the sum total of all of the microorganisms along with their genetic components that reside on the surface of and inside the human body. (Turnbaugh, Ley et al. 2007) These microorganisms include bacteria, viruses, fungi, archaea and all other microscopic organisms. In terms of quantity, it has been estimated that the number of bacteria inhabiting the human body outnumbers our human cells by a factor of 10. (Turnbaugh, Ley et al. 2007) This greatly emphasizes the integrated role that microorganisms have in human health and disease and also brings forward the philosophical question of the human microbiome possibly being a biological extension of the human identity. In order to explore these types of questions and to better understand the relationship of the human microbiome with positive and negative health, the National Institutes of Health (NIH) initiated the Human Microbiome Project in 2007 with plans to investigate the microbiota of the oral and nasal cavities, the skin, the gastrointestinal tract, and the genitourinary tract. (Peterson, Garges et al. 2009) Aligned with these plans, the Canadian Institutes of Health Research (CIHR) developed the Canadian Microbiome Initiative to further develop an understanding of the healthy and diseased states of the human microbiota. Both of these initiatives have prompted research teams around the world to delve deeper into the examination of the human microbiome.  1  1.2 Gene Sequencing Tools The opportunity to thoroughly profile the microbiota of major areas of the human body has been highly dependent on the progression and advancement of methodological tools and technologies. Cultivation techniques utilizing selective and non-selective media to isolate specific bacterial organisms were previously the standard method of bacterial identification. (Rogosa, Mitchell et al. 1951; Totten, Amsel et al. 1982) Unfortunately, one of the major limitations of these cultivation methods involved the inability to isolate and culture the majority of the organisms present in a certain environment. (Donachie, Foster et al. 2007) This limitation is referred to as the “great plate count anomaly,” referring to the number of bacteria being grown on a culturing plate only representing a small fraction of the total number of bacteria present in a certain sample. (Donachie, Foster et al. 2007) Although cultivation techniques are able to provide useful phenotypic information about isolated organisms, the advent of molecular techniques based on nucleic acid sequencing technology truly revolutionized the ability to detect and identify entire bacterial profiles of samples. (Olsen, Lane et al. 1986; Ward, Weller et al. 1990) Specifically for the vaginal microbiome, the initial molecular tools used for population level bacterial identification involved a polymerase chain reaction (PCR) with primers mostly based on the universal bacterial 16S ribosomal subunit along with denaturing gradient gel electrophoresis. (Burton and Reid 2002) The major reason for the use of an rRNA subunit gene sequence was due to the fact that rRNAs in general are functionally homologous in all organisms and have highly conserved nucleotide sequences as well as distinct regions of variability. (Burton and Reid 2002) While  2  universal primers can target these highly conserved regions, the variable regions can be used to differentiate and classify bacterial taxa. (Olsen, Lane et al. 1986; Lamont, Sobel et al. 2011) More specifically, 16S rRNA was chosen as an appropriate molecular target due to its size being the most suitable for broad phylogenetic analyses. (Burton and Reid 2002) The 16S rRNA gene is about 1540 nucleotides in length and has a total of 9 variable regions that have been designated V1 to V9. (Neefs, Van de Peer et al. 1993; Clarridge 2004) Currently, the most common approach for the molecular identification of bacteria involves PCR amplification of variable regions of the universal 16S rRNA gene target followed by next-generation ultra-high throughput gene sequencing. (Lamont, Sobel et al. 2011; Ravel, Gajer et al. 2011) This approach allows for the generation of a large number of gene sequence reads for multiple samples in a single sequencing run. This methodology has not only allowed for a more thorough understanding of which bacterial organisms are present in different vaginal samples, but it has also allowed for the generation of distinct clusters of common vaginal communities of bacteria. (Fredricks, Fiedler et al. 2005; Ravel, Gajer et al. 2011) Another universal gene sequencing target that is currently being utilized for nextgeneration ultra-high throughput gene sequencing is the chaperonin-60 (cpn60) gene target. (Schellenberg, Links et al. 2011) Cpn60 is a protein that is responsible for intracellular folding and polypeptide chain assembly in the plastids, mitochondria, and cytoplasm of most bacteria, eukaryotes, and archaea. (Hill JE 2004; Hill, Goh et al. 2005) While the length of the cpn60 gene spans about 2200 base pairs, the universal target region has a length of approximately 549 to 567 base pairs and corresponds to nucleotides 274 to 828 of the E. coli cpn60 gene sequence. (Goh, Potter et al. 1996) The  3  cpn60 universal gene target is known to provide higher phylogenetic resolution and have greater discriminatory power than the 16S rRNA gene target due to the variable region of the cpn60 gene target being more consistently distributed in comparison with the 16S rRNA target region. (Brousseau, Hill et al. 2001) As the physical structure of the 16S rRNA gene sequence requires its nucleotide distribution to be organized into segments of highly conserved and variable regions, this results in a smaller number of variable region sequences being available for taxonomic differentiation. (Clarridge 2004; Hill, Goh et al. 2005) This is due to the 16S rRNA genes being more functionally constrained than the cpn60 genes, with increased amino acid codon redundancy for cpn60 compared with 16S rRNA. (Hill, Goh et al. 2005) Although a single nucleotide base change may have a large effect for 16S rRNA gene expression, this type of a change may not have a profound effect for cpn60 gene expression. For these reasons, especially when discriminating between bacterial taxa at the species and strain levels, there is a greater number of variable region nucleotides available to differentiate between closely related taxa in the cpn60 gene target. (Goh, Potter et al. 1996; Hill, Paccagnella et al. 2006) The potential for increased discriminatory power along with a relatively short universal gene target length make the cpn60 gene a highly suitable molecular target for microbial identification. (Schellenberg, Links et al. 2009) Furthermore, as the online chaperonin database located at www.cpndb.ca is a weekly-curated database with a high level of quality control, it serves as a consolidated reliable and up-to-date source for the identification of bacterial taxa. (Hill JE 2004) This is in contrast to the multiple and mostly public ribosomal RNA databases available for 16S rRNA gene sequences. (Pruesse, Quast et al. 2007; Cole, Wang et al. 2009)  4  1.3 Reproductive Health and Infectious Disease Prior to looking into the application of molecular tools for advancement towards reproductive health, it is essential to understand the key areas of female reproductive health in need of greater exploration. The infectious condition associated with vaginal burning, discharge, odour, itching, and/or dyspareunia is termed vaginitis. (Sweet and Gibbs 2009) The major identified causes for this condition include trichomoniasis, candidiasis, and bacterial vaginosis. (Hill and Embil 1986) Trichomoniasis is caused by the protozoan Trichomonas vaginalis and often causes vulvar itching and presents with homogenous and malodorous vaginal secretions with a pH above 4.5. (Wolner-Hanssen, Krieger et al. 1989) Candidiasis is caused by the fungus Candida albicans and often leads to symptoms of dysuria, vulvar pruritus, and swelling, along with signs of vulvar edema and thick, curdy vaginal discharge. (Winner HI 1964; Eckert, Hawes et al. 1998) While the detection and diagnosis of trichomoniasis and candidiasis are relatively unambiguous due to their mono-etiological nature, the classification of bacterial vaginosis is more complex. Bacterial vaginosis was initially termed “non-specific vaginitis” due to the lack of understanding surrounding the causation of this condition. (Gardner and Dukes 1954) “Non-specific vaginitis” was often correlated with symptoms of abnormal vaginal discharge, vaginal malodour and vulvar irritation, not caused by Trichomonas vaginalis or Candida albicans. (Amsel, Totten et al. 1983) It was not until 1955 that Gardner and Dukes replaced the term “non-specific vaginitis” with Hemophilus vaginalis vaginitis. (Gardner and Dukes 1955) This was due to Gardner and Dukes isolating the bacterial species then termed Hemophilus vaginalis, from the vaginal swab samples of women  5  who presented with abnormal, homogenous, and odorous vaginal discharge with a pH between 5.0 and 5.5. (Gardner and Dukes 1955) After confirming that the vaginal swab samples of these women lacked Trichomonas vaginalis and Candida albicans, it appeared that Hemophilus vaginalis was the single bacterial species responsible for this condition. (Gardner and Dukes 1959) Gardner and Dukes initially termed this bacterial isolate Hemophilus vaginalis due to its appearance as a gram-negative rod and its ability to grow on blood agar. (Gardner and Dukes 1955) This classification was not free from controversy as it was later discovered that this bacterium did not require media required by the Hemophilus genus for growth, specifically media containing hemin and nicotinamide adenine dinucleotide. (Dunkelberg and McVeigh 1969) These findings led other research teams to classify this isolated bacterium as Corynebacterium vaginale due to its ability to stain gram positive as well as gram negative and its resemblance to Corynebacterium microscopic morphology. (Zinnemann 1963; Dunkelberg, Skaggs et al. 1970; Greenwood 1980; Wells and Goei 1981; Catlin 1992) This classification was also later refuted. After applying a set of methodologies to these bacteria, including Adansonian analyses, DNADNA hybridization, electron microscopy, and cell envelope biochemical analyses, a new bacterial genus for this species, Gardnerella vaginalis, was proposed by Greenwood and Pickett. (Greenwood 1980) This new bacterial genus was proposed due to this organism’s dissimilarity to recognized gram-positive and gram-negative genera. (Greenwood 1980; Piot, van Dyck et al. 1980) This led to the replacement of the term Hemophilus vaginalis vaginitis with Gardnerella vaginalis vaginitis. (Purdon, Hanna et al. 1984; van der Meijden 1984; Sefer and Ionescu 1991)  6  Further studies exploring the relationship of G. vaginalis with G. vaginalis vaginitis demonstrated a lack of consistent association between G. vaginalis and abnormal vaginal discharge. (McCormack, Hayes et al. 1977) Moreover, it was found that G. vaginalis was recovered from normal control subjects without any signs or symptoms of G. vaginalis vaginitis in addition to those subjects presenting with G. vaginalis vaginitis. (Pheifer, Forsyth et al. 1978) This led to controversy surrounding the use of the terms Gardnerella vaginalis vaginitis and non-specific vaginitis for its description. (van der Meijden 1984) Additionally, an improvement in the understanding of this condition as being associated with a shift in a wide spectrum of vaginal bacteria as opposed to causation by a single bacterial agent, ultimately led to usage of the current term “bacterial vaginosis.” (Martius, Krohn et al. 1988; Mazzulli, Simor et al. 1990; Nugent, Krohn et al. 1991) Currently, bacterial vaginosis is defined by a shift in vaginal ecology from a Lactobacillus dominated community to one consisting of a mixture of organisms that can include Gardnerella vaginalis, Bacteroides sp, Mobiluncus sp, and Mycoplasma hominis. (Martius, Krohn et al. 1988; Mazzulli, Simor et al. 1990; Nugent, Krohn et al. 1991) It is understood that this transformation of the vaginal environment to a mixed community of bacterial organisms may lead to an increased risk for human immunodeficiency virus (HIV) acquisition, human papilloma virus (HPV) susceptibility, pelvic inflammatory disease, preterm delivery, premature rupture of membranes, and intrauterine infections. (Sweet 1995; Govender, Hoosen et al. 1996; Ugwumadu, Hay et al. 1997; Kimberlin and Andrews 1998; Moodley, Connolly et al. 2002; Watts, Fazzari et al. 2005; Allsworth,  7  Lewis et al. 2008) Although the specific causation for this shift remains elusive, the impact of the vaginal flora on reproductive health is becoming increasingly clear. 1.4 Current Diagnostic and Treatment Techniques Characterized at the microbial level, bacterial vaginosis presents as an asymptomatic condition in about 50% of affected women. (Amsel, Totten et al. 1983) In regards to systematically diagnosing bacterial vaginosis at the clinical level, specific criteria for assessment were first introduced in 1983. (Amsel, Totten et al. 1983) These clinical criteria, termed Amsel’s criteria, are based on the fulfillment of 3 out of 4 of the following potential signs: a homogenous grayish-white vaginal discharge, a vaginal pH with a value above 4.5, the presence of clue cells on a saline wet mount, and a fishy or amine odour upon addition of 10% potassium hydroxide solution to vaginal fluid. (Amsel, Totten et al. 1983) While clue cells refer to the heavy coating of bacteria on vaginal epithelial cells evaluated through microscopic examination, vaginal amine odour detection is a more subjective evaluation often referred to as a “positive whiff or sniff test.” (Pheifer, Forsyth et al. 1978; Amsel, Totten et al. 1983; Verstraelen and Verhelst 2009) While these clinical criteria for bacterial vaginosis assessment remain the preferred diagnostic measures at the clinical level, the current “gold standard” diagnostic method is the Nugent’s scoring system. Developed in 1991, Nugent’s diagnostic technique is based on the assessment of Gram-stained vaginal smears under oil immersion. (Nugent, Krohn et al. 1991) This technique is a modification and refinement of the microscopic evaluation initially utilized by Spiegel for bacterial vaginosis diagnosis. (Spiegel, Amsel et al. 1983) Although Spiegel defined a system for scoring bacterial morphotypes, Nugent developed a  8  standardized numerical scale for bacterial vaginosis diagnosis. (Spiegel, Amsel et al. 1983) Nugent’s scoring system evaluates vaginal smears for the presence and absence of specific bacterial morphotypes and assigns a score from 0 to 10. (Nugent, Krohn et al. 1991) A total score of 0 to 3 corresponds with a normal reading or normal vaginal microflora, 4 to 6 corresponds with an intermediate reading, and 7 or higher corresponds with a diagnosis of bacterial vaginosis. (Nugent, Krohn et al. 1991) This total score is calculated by determining the proportion and abundance of organisms with morphotypes consistent with Lactobacillus sp, Gardnerella sp, Bacteroides sp, and curved gramvariable rods per oil immersion field. (Nugent, Krohn et al. 1991) While a higher abundance of Lactobacillus morphotypes yields a lower overall score, a higher abundance of Gardnerella, Bacteroides, and curved gram-variable rod morphotypes yields a higher overall score. (Nugent, Krohn et al. 1991) The validity of this Nugent’s scoring system has been evaluated for sensitivity and specificity through multiple sample assessments, with results supporting its reliability. (Schwebke, Hillier et al. 1996; Tam, Yungbluth et al. 1998; Money 2005) Furthermore, in terms of variability between observers of Gram-stained vaginal smear samples, the inter-observer reproducibility of Nugent scores has been found to be high. (Forsum, Jakobsson et al. 2002) In contrast, concerns do exist in regards to the specific detailed methodologies utilized for vaginal sampling as well as for Gram stain preparation and interpretation in Nugent scoring. Although Nugent scoring reproducibility is high with the use of consistent equipment, preparation techniques, and interpretation procedures, variation in these specifics has led to decreased reproducibility for this diagnostic technique. (Forsum, Jakobsson et al. 2002; Forsum, Larsson et al. 2008) Factors leading to variation  9  include: inconsistent application of vaginal swab samples to glass slides, fixation and timing differences for Gram staining, lack of standardization of the microscopic field image area, and disagreement in regards to morphotype classification. (Forsum, Jakobsson et al. 2002; Larsson, Carlsson et al. 2004; Forsum, Larsson et al. 2008) The latter factor is one of particular concern as the classification of bacterial rods and cocci can often become unclear and lead to inaccurate bacterial taxa categorization. (Forsum, Larsson et al. 2008) For these reasons, although Nugent’s scoring has remained a relatively reliable diagnostic tool, the availability of sequencing technologies should allow for the eventual development of a more specific molecular diagnostic tool. In regards to the treatment of bacterial vaginosis, the standard recommendations in both Canada and the USA involve the usage of the antibiotics metronidazole and clindamycin. (Sarwal 2008; Workowski 2010) Both the Canadian Guidelines on Sexually Transmitted Infections and the CDC Sexually Transmitted Diseases Treatment Guidelines recommend the following options: metronidazole 500 mg orally twice daily for 7 days, 5 grams of metronidazole gel 0.75% applied intravaginally once daily for 5 days, or 5 grams of clindamycin cream 2% applied intravaginally once at bedtime for 7 days. (Sarwal 2008; Workowski 2010) While the regimen of metronidazole therapy for bacterial vaginosis was first demonstrated to be effective in 1978 by relieving signs and symptoms of bacterial vaginosis in 80 out of 81 treated patients, the clindamycin regimen was confirmed to be equally effective in 1988 when compared with a standard metronidazole regimen. (Pheifer, Forsyth et al. 1978; Greaves, Chungafung et al. 1988) Although introduced decades earlier, these two antibiotic therapies continue to be regarded as the standard primary treatment for bacterial vaginosis. Unfortunately,  10  bacterial vaginosis recurrence following treatment has been observed in the range of about 15 – 50% of diagnosed women. (Ceruti, Piantelli et al. 1994; Joesoef and Schmid 2005; Bradshaw, Morton et al. 2006) While the specific reason for this recurrence remains unclear, it is apparent that the effectiveness of traditional antibiotic therapies in treating bacterial vaginosis has declined. (Joesoef, Schmid et al. 1999) Although this decline in effectiveness may be contributed to increased bacterial resistance, the root cause for this decline may most likely be related to the fact that a particular causative agent for bacterial vaginosis has not been identified. (Eschenbach 2007) The lack of a specific identified target may be one of the major factors hindering the establishment of a more effective treatment. (Eschenbach 2007) Although trials have also taken place utilizing antiseptic and acidic products with mixed results (Verstraelen and Verhelst 2009), a potentially more promising approach utilizes probiotic therapies. Probiotics are live microorganisms that when administered in adequate amounts may positively alter a microflora and provide health benefits. (Schrezenmeir and de Vrese 2001) Vaginal probiotic therapies involve oral as well as vaginal tablets containing a variety of bacterial strains including tablets consisting of a combination of Lactobacillus brevis, Lactobacillus salivarius, and Lactobacillus plantarum (Mastromarino, Macchia et al. 2009), and tablets made up of either Lactobacillus acidophilus (Hallen, Jarstrand et al. 1992), Lactobacillus rhamnosus (Rossi, Rossi et al. 2010), or Lactobacillus reuteri (Anukam, Osazuwa et al. 2006) among other strains. Although study results with these specific probiotic strains have provided some evidence in favour of effectiveness in treating or protecting against bacterial vaginosis, trials with a greater number of participants with more well defined  11  treatment outcomes are still necessary. (MacPhee, Hummelen et al. 2010) Furthermore, trials have also taken place looking at the usage of probiotic therapies in conjunction with or following antibiotic treatment. While these studies have yielded mixed results depending on specific probiotic strains, long-term use of probiotics following antibiotic treatment has been demonstrated to significantly reduce bacterial vaginosis recurrence with treatments that have included a standard dose of oral metronidazole followed by weekly vaginal tablets exclusively of Lactobacillus rhamnosus (Marcone, Calzolari et al. 2008) and with the usage of capsules containing a combination of Lactobacillus rhamnosus and Lactobacillus gasseri following standard clindamycin treatment (Larsson, Stray-Pedersen et al. 2008). As the recurrence and prevalence of bacterial vaginosis has not been eliminated with traditional antibiotic regimens thus far, the progress of alternative treatment methods is highly encouraging. With an increased understanding and more specific detailing of protective as well as pathogenic bacterial species, the development of more targeted antibiotics and possibly more effective probiotics may lead to more refined treatment regimens and prophylactic therapies for bacterial vaginosis. 1.5 HIV and Reproductive Health Human immunodeficiency virus (HIV) is a member of the retrovirus family. It is a dense cylindrical virus containing genomic RNA, a reverse transcriptase enzyme and nucleoid-containing core proteins. (Fauci 1988; Cann and Karn 1989; Lapadat-Tapolsky, De Rocquigny et al. 1993) More specifically, HIV is a member of the lentivirus subfamily as it is characterized as causing chronic infections with a relatively long latency period between the time of exposure and the appearance of signs and symptoms. (Fauci 1988) HIV initially infects cells that display a CD4 antigen, frequently including  12  helper CD4+ T cells and also monocytes, macrophages, and dendritic cells. (Cann and Karn 1989) After binding to a host cell through the interaction of the viral envelope glycoprotein gp120 with a CD4 antigen and a specific chemokine receptor, HIV enters the host cell through membrane fusion. (Cann and Karn 1989) Following the disassembly of its viral capsid within a host cell, the reverse transcriptase enzyme in HIV, RNAdependent DNA polymerase, allows HIV to transcribe DNA from its viral RNA. (Fauci 1988; Le Rouzic and Benichou 2005; Sweet and Gibbs 2009) This is the function that allows HIV to insert viral DNA into a host genome with the use of integrase. (Fauci 1988; Sweet and Gibbs 2009) Once inserted, regulatory protein assembly is initiated, further leading to the production of virions that eventually bud from the host cell and infect other cells. (Fauci 1988; Sweet and Gibbs 2009) Although HIV infection may lead to a variety of negative health outcomes, generally HIV infection depletes the immune system, causing the body to become more vulnerable to the onset of opportunistic infections. (Fauci 1988) Clinically, progression of HIV infection and disease has been associated with a decreased CD4+ T cell count, with Acquired Immune Deficiency Syndrome (AIDS) being used to classify HIV-infected individuals with less than 200 CD4 cells/uL of blood or a total CD4 cell percentage of less than 14 by US standards and by the clinical definitions of AIDS-defining illnesses in Canada. (CDC 1992; CIDPC 2002) In terms of genital tract health, HIV has been found to play a profound role towards an increased acquisition of lower genital tract infections. (Cu-Uvin, Hogan et al. 1999) On comparison of HIV-positive and HIV-negative women, it has been found that the prevalence of human papilloma virus (HPV), candidal vaginitis, Neisseria  13  gonorrhoeae, syphilis, and bacterial vaginosis is greater in women living with HIV. (Plummer, Simonsen et al. 1989; Carpenter, Mayer et al. 1991; Hutchinson, Rompalo et al. 1991; Sun, Ellerbrock et al. 1995; Sewankambo, Gray et al. 1997; Cu-Uvin, Hogan et al. 1999) Furthermore, it has been found that the quantity of Lactobacillus bacteria and especially hydrogen-peroxide producing strains of lactobacilli are reduced in the vaginal microflora of HIV-positive women on comparison with HIV-negative women. (Knezevic, Stepanovic et al. 2005) As the strains of lactobacilli that are frequent hydrogen peroxide producers are known to have more protective effects against potential genital infections than other strains of Lactobacillus, the presence of less protective Lactobacillus strains may contribute to the already compromised state of immunity in HIV-positive women. (Eschenbach, Davick et al. 1989; Hillier, Krohn et al. 1992; Hillier, Krohn et al. 1993; Antonio, Hawes et al. 1999) Although these observations provide a possible explanation for the increased vulnerability of HIV-positive women to genital tract infections, the specific strains that differ among these women have not been thoroughly identified. On the other hand, a number of studies have gathered evidence supporting the observation that women with bacterial vaginosis have an increased susceptibility to HIV acquisition and an increased risk for HIV transmission. (O'Connor, Kinchington et al. 1995; Sha, Zariffard et al. 2005; Atashili, Poole et al. 2008; Lai, Hida et al. 2009) Some of these studies have explored the ability of the cervicovaginal mucus to inactivate and protect against HIV. (O'Connor, Kinchington et al. 1995; Sha, Zariffard et al. 2005; Lai, Hida et al. 2009) When the acidity of the cervicovaginal mucus is lowered, possibly through a change in the vaginal microflora, the protective ability of this barrier against  14  HIV may decrease and consequentially increase HIV acquisition risks. (O'Connor, Kinchington et al. 1995; Sha, Zariffard et al. 2005; Lai, Hida et al. 2009) Furthermore, it has also been found that HIV-positive women who have been diagnosed with bacterial vaginosis have a higher genital tract viral load than HIV-positive women without bacterial vaginosis. (Cu-Uvin, Hogan et al. 2001) This finding of the genital tract viral load being effected by localized physiological changes as opposed to plasma viral load levels has been further supported by a study that observed decreasing genital viral loads during the periovulatory phase of the menstrual cycle while maintaining a consistent plasma viral load. (Money, Arikan et al. 2003) In terms of CD4 counts, it has been found that a significant correlation has existed between low CD4 counts and decreased quantities of lactobacilli. (Mane, Kulkarni et al. 2013) Although all of these findings have not established exact explanations for the relationship between HIV and lower genital tract infections, it is clear that further exploration of this relationship through an increased understanding of the vaginal bacterial profile is warranted. 1.6 Vaginal Microbiome Analyses The identification of a number of novel vaginal bacterial taxa has only become possible following the usage of molecular techniques for bacterial characterization (Table 1). Burton and Reid conducted one of the initial molecular-based studies with a group of 20 Caucasian post-menopausal women with the use of 16S rRNA PCR and denaturing gradient gel electrophoresis (DGGE). (Burton and Reid 2002) For this study, specimens were collected from vaginal swabs. (Burton and Reid 2002) While it was found that women with low Nugent scores had vaginal flora mostly dominated by Lactobacillus organisms, women with Nugent scores indicative of abnormal vaginal flora had an  15  Table 1 Studies profiling the vaginal microbiome in HIV-negative women Study  Sample Size  Technique  Normal Vaginal Flora Lactobacillus iners, L. crispatus  Abnormal Vaginal Flora  Burton and Reid 2002  20  16S rRNA PCR, DGGE  Ferris et al. 2004  46  16S rRNA PCR, DGGE  L. crispatus, L. jensenii  Verhelst et al. 2004  8  16S rRNA PCR, ARDRA  L. crispatus, L. gasseri  Atopobium vaginae, Gardnerella vaginalis  Zhou et al. 2004  5  16S rRNA PCR, Clone Gene Sequencing  Atopobium vaginae, Megasphaera, Leptotrichia  -  21  16S rDNA PCR, Clone Analysis  L. iners, L. crispatus  Gardnerella vaginalis, Megasphaera, Leptotrichia, Dialister, Atopobium, Lactobacillus iners  Hyman et al. 2005  20  16S rRNA PCR, Clone Gene Sequencing  Hill	
  et	
  al.	
   2005	
    16	
    Fredricks et al. 2007  264  Taxon-Directed 16S rRNA PCR  L. crispatus, L. iners, Peptoniphilus  Thies et al. 2007  70  16S rRNA T-RFLP  L. crispatus, L. iners, L. gasseri  144  16S rRNA T-RFLP  L. iners, L. crispatus, L. jensenii, L. gasseri, Atopobium  Zhou et al. 2010  73  16S rRNA T-RFLP  L. iners, L. crispatus, L. jensenii, L. gasseri, Atopobium vaginae  Ravel et al. 2011  396  16S rRNA Pyrosequencing  L. iners, L. crispatus, L. gasseri, L. jensenii  Fredricks et al. 2005  Zhou et al. 2007  Lactobacillus, Bifidobacterium, Gardnerella, Prevotella, Streptococcus L.	
  crispatus,	
  L.	
  iners,	
  L.	
   Chaperonin-­‐60	
   gasseri,	
  L.	
  jensenii,	
  L.	
   sequencing	
   buchneri,	
  Gardnerella	
    Gardnerella, Prevotella, Bacteroides, Streptococcus, Lactobacillus Atopobium vaginae, Gardnerella vaginalis, Bifidobacterium, Mycoplasma, Prevotella  -  -­‐	
   Leptotrichia, Sneathia, Atopobium vaginae, Megasphaera, Clostridiales Atopobium vaginae, Megasphaera, Lactobacillus iners, Gardnerella vaginalis, Clostridiales Atopobium, Megasphaera, Dialister, Anaerococcus, Finegoldia, Peptostreptococcus Streptococcus, Gardnerella vaginalis, Mycoplasma, Prevotella Prevotella, Dialister, Atopobium, Gardnerella, Megasphaera, Finegoldia, Mobiluncus  16  increased diversity of organisms including Gardnerella, Prevotella, Bacteroides, Streptococcus, and Lactobacillus. (Burton and Reid 2002) While many of these bacterial organisms had been previously detected and identified through culturing methods, the detection of Lactobacillus iners in the vaginal flora was a novel result. (Burton and Reid 2002) The reason for the inability to detect these bacterial taxa prior to this study may be attributed to the difficulty in growing this species of lactobacilli on media including MRS and Rogosa. (Falsen, Pascual et al. 1999; Burton and Reid 2002) Ferris et al. conducted a study with a group of 46 predominantly African American reproductive-aged women with the use of 16S rRNA PCR and DGGE. (Ferris, Masztal et al. 2004) For this study, specimens were collected through vaginal swabs and a vaginal lavage. (Ferris, Masztal et al. 2004) An interesting finding from this study was that Atopobium vaginae was identified in a large proportion of women with Nugent scores that were indicative of bacterial vaginosis. (Ferris, Masztal et al. 2004) This was one of the initial reports indicating the presence of Atopobium vaginae in an abnormal vaginal flora. (Ferris, Masztal et al. 2004) Furthermore, it was noted that A. vaginae appeared to be highly resistant to metronidazole treatment following a susceptibility test to various antimicrobials. (Ferris, Masztal et al. 2004) The presence of A. vaginae in abnormal vaginal flora was further confirmed by Verhelst et al. through a study that was conducted on 8 vaginal samples of women between the ages of 28 to 51 years. (Verhelst, Verstraelen et al. 2004) This study utilized 16S rRNA PCR and Amplified rDNA Restriction Analysis (ARDRA) along with species-specific PCR for Gardnerella vaginalis and Atopobium vaginae. (Verhelst, Verstraelen et al. 2004) It was found that both Gardnerella vaginalis and Atopobium vaginae had a significant co-occurrence in  17  vaginal samples that were indicative of bacterial vaginosis. (Verhelst, Verstraelen et al. 2004) Zhou et al. conducted a study on 5 Caucasian reproductive-aged women utilizing 16S rRNA PCR followed by gene sequencing of constructed clones. (Zhou, Bent et al. 2004) The specimens for this analysis were collected through mid-vaginal swabs. (Zhou, Bent et al. 2004) Although vaginal flora status for these women was exclusively based on the self-reporting of symptoms, it was determined that a healthy vaginal flora in this study could contain Atopobium vaginae, Megasphaera species, Leptotrichia species, and Lactobacillus species. (Zhou, Bent et al. 2004) Since a standard assessment technique such as Nugent’s scoring was not administered to confirm if a vaginal sample was from a normal or healthy vaginal flora, there is a degree of uncertainty in regards to these specific bacterial taxa being present in a healthy vaginal flora. (Zhou, Bent et al. 2004) Fredricks et al. conducted a study on 21 reproductive-aged women using 16S rDNA broad-range PCR and clone analysis. (Fredricks, Fiedler et al. 2005) Vaginal flora status assessment for this study utilized Amsel’s criteria and specimens for this study were obtained through the collection of vaginal fluid by brushing the lateral vaginal wall with a foam swab. (Fredricks, Fiedler et al. 2005) It was determined that women without bacterial vaginosis had vaginal flora mostly dominated by Lactobacillus iners and Lactobacillus crispatus, while women with bacterial vaginosis had vaginal flora containing a range of taxa including Gardnerella vaginalis, Megasphaera, Leptotrichia, Dialister, Atopobium, Clostridiales, and Lactobacillus iners. (Fredricks, Fiedler et al. 2005) An interesting finding from this study was that while Lactobacillus iners was  18  detected in both women with and without bacterial vaginosis, Lactobacillus crispatus was only detected in women without bacterial vaginosis. (Fredricks, Fiedler et al. 2005) Hyman et al. conducted a study on 20 reproductive-aged women using 16S rRNA PCR followed by deep gene sequencing of constructed clones. (Hyman, Fukushima et al. 2005) Vaginal specimens for this study were collected with a cryoloop from the vaginal epithelium in the posterior vaginal fornix. (Hyman, Fukushima et al. 2005) Although the vaginal flora of these women was categorized as being normal or healthy, this was based on patients self-reporting an asymptomatic state. (Hyman, Fukushima et al. 2005) Nonetheless, the vaginal samples of these women were either dominated by Lactobacillus species or they contained a range of bacterial taxa including Bifidobacterium, Gardnerella, Prevotella, Pseudomonas, or Streptococcus. (Hyman, Fukushima et al. 2005) A larger study, conducted by Fredricks et al., looked at a total of 264 women between the ages of 17 and 55 years. (Fredricks, Fiedler et al. 2007) Specimens for this study were collected with a polyurethane foam swab brushed against the lateral vaginal wall. (Fredricks, Fiedler et al. 2007) It was determined that Leptotrichia, Sneathia, Atopobium vaginae, Megasphaera, and novel species of Clostridiales were all found to be significantly associated with bacterial vaginosis. (Fredricks, Fiedler et al. 2007) Furthermore, it was found that PCR detection of specific species of Megasphaera and Clostridales were reliable predictors of bacterial vaginosis. (Fredricks, Fiedler et al. 2007) Thies et al. conducted a study that looked at 70 reproductive-aged women utilizing vaginal swabs and 16S rRNA gene terminal restriction fragment length polymorphism (T-RFLP). (Thies, Konig et al. 2007) It was determined that women with bacterial vaginosis contained a range of bacterial taxa, mainly including Atopobium vaginae,  19  Megasphaera sp., Lactobacillus iners, Gardnerella vaginalis, and three taxa of the Clostridiales order. (Thies, Konig et al. 2007) Vaginal flora of women without bacterial vaginosis were found to only contain lactobacilli species. (Thies, Konig et al. 2007) Hill et al. conducted a study looking at a group of 16 healthy, non-pregnant women utilizing vaginal swabs from the posterior fornix along with chaperonin-60 gene sequencing. (Hill, Goh et al. 2005) It was determined that the healthy vaginal flora of these women was dominated by Lactobacillus crispatus, L. iners, L. gasseri, L. jensenii, and L. buchneri. (Hill, Goh et al. 2005) As all of the women in this study had Nugent’s scores below 4, an interesting finding from this study was the detection of Gardnerella vaginalis isolates in the healthy vaginal flora of these women. (Hill, Goh et al. 2005) Furthermore, bacterial taxa with similarity to Porphyromonas and Megasphaera were also detected. (Hill, Goh et al. 2005) These findings suggested that distinct strains or isolates of certain bacterial organisms may be less pathogenic than previously understood. Zhou et al. looked at a group of Caucasian and Black reproductive-aged women using a mid-vaginal swab and 16S rRNA T-RFLP. (Zhou, Brown et al. 2007) While standardized assessment procedures were not utilized for the diagnosis of bacterial vaginosis or for the determination of a normal vaginal flora, the study’s medical personnel classified all of the women in this study as having a healthy or normal vaginal flora. (Zhou, Brown et al. 2007) An interesting finding in this study was that Black women had a higher prevalence of vaginal communities that were not dominated by Lactobacillus in comparison with Caucasian women. (Zhou, Brown et al. 2007) Furthermore, the vaginal communities of Black women that were not dominated by  20  Lactobacillus were found to contain either Atopobium or specific taxa of the Clostridiales order. (Zhou, Brown et al. 2007) This study was one of the first to demonstrate ethnic differences in the vaginal microbiota of women. Another study conducted by Zhou et al. that explored ethnic differences of vaginal microbiota looked at 73 reproductive-aged Japanese women using mid-vaginal swabs and 16S rRNA T-RFLP. (Zhou, Hansmann et al. 2010) As with the previous study, all of the women in this study were classified as not having any vaginal symptoms or signs of abnormality. (Zhou, Hansmann et al. 2010) Also, as with the previous study, no standardized assessment procedure was applied for this reproductive health status determination. (Zhou, Hansmann et al. 2010) Regardless, it was found that most vaginal communities in this group were dominated by Lactobacillus iners, Lactobacillus crispatus, Lactobacillus jensenii, or Lactobacillus gasseri. (Zhou, Hansmann et al. 2010) A smaller group of vaginal communities in this cohort contained a high amount of Atopobium vaginae along with Lactobacillus, Clostridium, Dialister, Gemella, Lachnospiraceae, Leptotrichia, Megasphaera, Mobiluncus, and Prevotella among other bacterial taxa. (Zhou, Hansmann et al. 2010) Another group of vaginal communities in this cohort was found to contain an increased diversity of bacterial taxa including Streptococcus, Veillonella, Lachnospira, Gardnerella vaginalis, Anaerococuus, Peptostreptococcus, Megasphaera, Mycoplasma, Prevotella, and Shigella. (Zhou, Hansmann et al. 2010) Overall, it was determined that similar bacterial communities were detected in this group of Japanese women on comparison with Caucasian and Black women. (Zhou, Hansmann et al. 2010) In contrast, it was found that the frequency of vaginal communities not dominated by lactobacilli was lower in Japanese women in  21  comparison to Black women and higher in Japanese women in comparison to Caucasian women. (Zhou, Hansmann et al. 2010) To date, the largest study on profiling the vaginal microbiome was conducted by Ravel et al. on a total of 396 reproductive-aged women equally distributed into the ethnic groups of Caucasian, Black, Asian, and Hispanic. (Ravel, Gajer et al. 2011) Bacterial taxa were determined through the pyrosequencing of a 16S rRNA PCR amplified target of self-collected vaginal swabs. (Ravel, Gajer et al. 2011) It was found that there were 5 major vaginal communities of bacteria in this group of women with 4 of these bacterial communities being dominated by Lactobacillus iners, Lactobacillus crispatus, Lactobacillus gasseri, and Lactobacillus jensenii. (Ravel, Gajer et al. 2011) The fifth community cluster contained high proportions of Prevotella, Dialister, Atopobium, Gardnerella, Megasphaera, Finegoldia, Mobiluncus, and other anaerobic bacteria. (Ravel, Gajer et al. 2011) In terms of Nugent scores, it was found that the community with a heterogeneous mixture of bacteria correlated with the highest Nugent scores, while the groups dominated by Lactobacillus iners and Lactobacillus jensenii correlated with the next highest Nugent scores. (Ravel, Gajer et al. 2011) In terms of ethnic differences, it was determined that Lactobacillus-dominant bacterial communities had a higher prevalence in Caucasian and Asian women and a lower prevalence in Hispanic and Black women. (Ravel, Gajer et al. 2011) Although a number of molecular studies have looked into the vaginal microbiota of healthy women without chronic diseases or infections, the number of studies exploring the vaginal bacterial communities in HIV-positive women is very limited (Table 2). Spear et al. conducted one of the first molecular studies looking into profiling the vaginal  22  microbiota in HIV-positive women on a total of 11 HIV-positive women and 10 HIVnegative women. (Spear, Sikaroodi et al. 2008) All of the women who participated in this study were of reproductive age and study results were obtained through the collection of cervicovaginal lavage samples followed by pyrosequencing of the 16S rRNA gene target. (Spear, Sikaroodi et al. 2008) Although a significant difference was not found in the comparison of vaginal microbial diversity between HIV-positive and HIV-negative women, there were 4 bacterial taxa that were only detected in the vaginal micoflora of HIV-positive women. (Spear, Sikaroodi et al. 2008) These bacterial taxa included Propionibacterineae, Anaerococcus, Citrobacter, and Acinetobacter. (Spear, Sikaroodi et al. 2008) It was also found that women without bacterial vaginosis had vaginal microflora mostly dominated by Lactobacillus species while women with bacterial vaginosis had a higher proportion of a range of bacterial taxa including Prevotella, Megasphaera, Gardnerella, Coriobacterineae, Lachnospira, and Sneathia. (Spear, Sikaroodi et al. 2008) Table 2 Studies profiling the vaginal microbiome in HIV-positive women Study  Sample Size  Technique  Normal Vaginal Flora  Spear et al. 2008  11  16S rRNA Pyrosequencing  Lactobacillus  Hummelen et al. 2010  132  16S rRNA Gene Sequencing  L. iners, L. crispatus  Spear et al. 2011  36  16S rRNA Pyrosequencing  L. iners, L. crispatus  Abnormal Vaginal Flora Prevotella, Megasphaera, Gardnerella, Lachnospira, Sneathia Prevotella bivia, Lachnospiraceae, Gardnerella vaginalis -  Hummelen et al. conducted a larger study looking into 132 Tanzanian HIVpositive women. (Hummelen, Fernandes et al. 2010) All of these women were of reproductive age and results were obtained through the collection of mid-vaginal wall 23  swab samples followed by Illumina gene sequencing of the V6 region of the 16S rRNA gene. (Hummelen, Fernandes et al. 2010) It was found that Gardnerella vaginalis and Lactobacillus iners were detected in all of the collected vaginal swab samples, suggesting that these 2 bacterial taxa may be core members of the vaginal microbiota in Tanzanian HIV-positive women. (Hummelen, Fernandes et al. 2010) In total, 8 major vaginal community clusters were determined. (Hummelen, Fernandes et al. 2010) 2 of these clusters were found to be related to a normal vaginal microflora and were dominated by Lactobacillus iners or Lactobacillus crispatus. (Hummelen, Fernandes et al. 2010) 4 of these clusters were found to be related to bacterial vaginosis and were dominated by Prevotella bivia, Lachnospiraceae, Gardnerella vaginalis, and a mixture of different bacterial taxa. (Hummelen, Fernandes et al. 2010) The remaining 2 clusters contained a mixture of Gardnerella vaginalis, Lactobacillus iners and a number of other bacterial taxa. These remaining 2 clusters were unique in that they were not specific to a single Nugent’s score grouping, and instead spanned all 3 categories of normal flora, intermediate flora, and bacterial vaginosis. (Hummelen, Fernandes et al. 2010) Spear et al. conducted another study on HIV-positive women that collected cervicovaginal lavage samples and utilized the V1 and V2 regions of the 16S rRNA gene target for pyrosequencing. (Spear, Gilbert et al. 2011) This study included a total of 36 reproductive-aged HIV-positive women, of which 30 were African-American. (Spear, Gilbert et al. 2011) As the focus of this study was to specifically determine the major species of Lactobacillus that were present in the vaginal microbiota, it was determined that Lactobacillus iners was the most predominant species of Lactobacillus in this group of HIV-positive women. (Spear, Gilbert et al. 2011) While Lactobacillus iners was the  24  dominant Lactobacillus species in 66% of these HIV-positive women, it was found that Lactobacillus crispatus was more often the dominant species of Lactobacillus in women who had microbiota consisting of >50% Lactobacillus. (Spear, Gilbert et al. 2011) While insight has been gained into the makeup of the vaginal microbiota of HIV-positive women at a deeper level, it is clear there is a need for more investigation into more diverse groups of women and with the use of more advanced gene-sequencing tools. Furthermore, it is important to note that there is no information on the microbiota of Canadian women. 1.7 Rationale Microbial communities of the female vaginal tract play a profound role in determining reproductive health states in women. It is currently understood that a healthy vaginal environment is dominated by the lactic-acid producing Lactobacillus bacterial species. (Antonio, Hawes et al. 1999) These species are known to produce acidic compounds such as peroxidase, lowering the pH of the vaginal tract and aiding in its protective effect. (Sweet and Gibbs 2009) When these naturally existing bacterial communities of the vagina are disrupted, many negative health consequences can arise. These negative health outcomes include: increased chances of sexually transmitted infections, preterm birth, infertility, and early pregnancy loss. (Hillier 2005) These disruptions to the natural vaginal microbiome can be caused by: sexual activity, contraceptive use, douching practices, and antibiotic use. (Hooton, Hillier et al. 1991) Specifically in HIV-positive women, it has been found that conditions such as bacterial vaginosis and candidiasis are present at significantly higher rates than in women without any history of chronic disease. (Jamieson, Duerr et al. 2001)  25  Bacterial vaginosis is a common genital condition, occurring in about 10 – 30% of women in the general population. (Allsworth and Peipert 2007) Defined by a shift in the natural ecology of the vagina from Lactobacillus microbes to a diverse mixture of aerobic and anaerobic organisms, bacterial vaginosis is known to be a strong precursor for many negative reproductive health outcomes including pelvic inflammatory disease. (Sweet 1995) The current standards for identifying and diagnosing this condition include the usage of Amsel’s criteria and the Nugent scoring system. Although the Nugent scoring system is the current gold standard for bacterial vaginosis diagnosis due to its reproducibility and effectiveness, it is subjective and does not allow for an accurate identification of specific bacterial organisms. (Forsum, Jakobsson et al. 2002; Larsson, Carlsson et al. 2004; Forsum, Larsson et al. 2008) Furthermore, this microscopy-based technique does not always allow for the basic differentiation between protective and disease-related bacteria such as Lactobacillus and Atopobium species. (Burton, Devillard et al. 2004) More importantly, these techniques are highly limited in identifying and characterizing the majority of microbes that are present in a vaginal sample. While culture-based techniques have allowed for the isolation of certain microorganisms into pure cultures, these techniques have been limited in providing a more accurate and in-depth viewing of microbes. As it is being increasingly understood that specific diseases are caused by a complex interplay of pathogens as opposed to a single organism, there is a strong need to accurately classify organisms at the microbial level. (Tibayrenc 2007) The focus of this graduate research project was to explore and define the vaginal microbiome of HIV-positive women. Although much has been learned about reproductive and vaginal diseases at a clinical level, relatively little is known about  26  the deeper microbial level of these negative health outcomes. A large knowledge gap continues to exist in regards to what specific kinds and how many bacterial species exist in the vaginal tract of women in general, and especially in HIV-positive women. The current research project explored the specific identities of the bacterial communities in the vaginal tract of HIV-positive women through the use of novel highthroughput chaperonin 60 (cpn60) gene sequencing methods. While most vaginal microbiome characterization studies to date had utilized 16S rRNA-encoding DNA sequencing methods, this study utilized cpn60 gene sequencing methods due to its ability for increased species resolution over the 16S rRNA target gene. (Schellenberg, Links et al. 2009) This increased species resolution allows for genomic detail to the level of accurately differentiating beyond bacterial genera levels and into species and subspecies levels. (Schellenberg, Links et al.) These detailed bacterial findings may potentially lead to the exact identification of disease or infection-related species of bacteria within a single bacterial phylum, allowing for the separation of “healthy” and “unhealthy” bacterial species. This may further lead to highly targeted diagnostic techniques for reproductive infections and diseases. An incredible amount of progress has been made in the field of HIV research and clinical care in the past 20 years. While many advances have been made in areas of testing and treatment, a significant need continues to exist for an improved understanding of reproductive and vaginal health outcomes at the deep microbial level. This research project was the first to utilize cpn60 gene sequencing techniques to profile the vaginal microbiome in HIV-positive women and was also one of the first studies to explore the vaginal microbiome in women of Aboriginal decent. The major goal of this research  27  project was to provide a new set of data that could contribute to a greater understanding of reproductive health in women with HIV, with the hope of building upon current diagnostic and treatment techniques for specific reproductive conditions. HYPOTHESIS HIV-positive reproductive-aged women share a core vaginal microbiome with variations that can be defined and correlated with specific demographic and clinical characteristics. OBJECTIVES The primary aim was to characterize and define the bacterial profile of the vaginal microbiome within a cohort of HIV-positive reproductive-aged women, using sequencing of the 60 kDa chaperonin gene. The secondary aim was to understand and determine relationships between vaginal bacterial diversity and abundance with a number of demographic, clinical, and HIV-specific variables. These variables included: ethnicity, BMI, genital infection history, feminine hygiene product use, contraception use, sexual activity, reproductive health at visit, length of HIV infection, CD4 count at visit, and viral load at visit.  28  2 MATERIALS AND METHODS 2.1 Study Details This research project entitled, Vogue (Vaginal Microbiome Group Initiative) Study 1B, received ethical approval by the University of British Columbia – Children’s & Women’s Health Centre of BC Research Ethics Board (UBC C&W REB) with Certificate Number H11-00119 on April 11, 2011. STUDY POPULATION The subjects that were offered enrolment into the Vogue 1B study were all patients who attended the Oak Tree Clinic located in Vancouver, British Columbia, Canada. The Oak Tree Clinic is a tertiary referral outpatient centre providing specialized care for HIV-infected men, women, and children, a program of the B.C. Women’s Hospital and Health Centre. Women attending the clinic for routine HIV and gynaecologic care were invited to enrol if they were HIV-infected women between the ages of 18 and 49 years, who were neither pregnant nor menopausal. The enrolment target for the Vogue 1B study was to reach a total of 50 HIV-infected women. This number of 50 target women was supported by a power calculation for a future comparison of the vaginal microbiota of HIV-infected and uninfected women. As the Vogue 1A study has an enrolment target of 300 HIV-negative women, a comparison of 50 HIV-positive women with 300 healthy women using a two-tail test with an alpha error level of 0.05 yielded over 80% power for the detection of differences. This power calculation was based on reported differences of the vaginal microbiota by gold standard Gram stain analyses. (Money 2005) Ultimately, a  29  total of 54 women were enrolled in order to account for some women potentially having insufficient samples for analysis. DATA COLLECTION Following informed consent (appendix page 110) a patient interview was conducted in order to collect basic demographic, health, and reproductive information. This interview was followed by a patient chart review in order to collect specific laboratory and antiretroviral medication information. The collected demographic information included: age, height, weight, BMI, ethnicity, country of birth, year immigrated to Canada, marital status, education level, and general residential location. Each subject’s general medical history included information on any significant current or chronic diseases. Genital infection and treatment history was recorded for bacterial vaginosis, yeast infections, urinary tract infections, trichomoniasis, genital warts, genital herpes, chlamydia, gonorrhea, and syphilis. Antimicrobial use was recorded for the 3month time period prior to the study visit, while other prescription and non-prescription drugs were recorded for a 2-month time period prior to the study visit. Within this nonprescription drug section, information about any supplemental probiotic or herbal remedy use was also recorded. In terms of reproductive health, information was collected on pregnancy history, gynaecologic history, history of other sexually transmitted infections, menstruation timing and occurrence, tampon usage, recent vaginal symptoms, feminine hygiene product usage, contraception usage, and sexual activity. The sexual activity questions encompassed information related to recent vaginal intercourse occurrence, number of recent sexual partners, pain experienced during vaginal intercourse, frequency of receiving oral sex, frequency of anal sex, and sex toy usage. Current and previous  30  substance use history was recorded for heroin, cocaine, crystal meth, marijuana, opiates, benzodiazepines, methadone, alcohol, tobacco, and any other substances reported by the study subject. Specific HIV-related information was collected for the following factors: HIV acquisition mode, timing of first HIV positive test, HIV clade, CD4 nadir, highest viral load ever, CD4 count at study visit, viral load at study visit, hepatitis C immune status, and hepatitis B immune status. Antiretroviral history was also collected for each subject through the recording of all antiretroviral medication combinations taken along with start and stop dates. The data collection form that includes all of this collected information is attached in the appendix (page 118), along with a “Pelvic Exam Findings” form that documents reproductive health for the study visit day. DATABASE DESIGN An electronic database was created with the use of the secure web application entitled Research Electronic Data Capture (REDCap). All data captured on data collection forms through patient interviews and chart reviews were manually entered into this electronic database for more efficient and accessible data storage. This database is securely stored on a server at the Child and Family Research Institute (CFRI) on the campus adjacent to BC Women’s Hospital. SAMPLE COLLECTION1 For each subject enrolled into the Vogue 1B study, vaginal study swabs were collected during the same time as a patient’s regularly scheduled pelvic exam. At the time of a speculum examination, in addition to clinical vaginal and cervical samples, four swabs were taken from the posterior fornix and laternal vaginal wall. These swabs were  1  Vaginal swabs were collected by Bal Dhesi and Laura Vicol at the Oak Tree Clinic. 31  respectively: one swab for Gram stain analysis and Nugent’s scoring, two swabs in a dry swab container for gene sequencing analyses, and one swab in viral transport media for viral analyses. The collection procedure for each of these swabs was similar, following a detailed set of instructions. At the time of the speculum exam a swab from the posterior fornix and lateral vaginal wall was taken for each sample. After labeling with a confidential study identification number, the Gram stain swab was submitted to the local hospital laboratory for clinical analysis, while the other three swabs were stored in a -80˚ C freezer within 4 hours of swab collection. The specific Gram stain swab used for this analysis was the Copan Sterile Transport Swab Suitable for Aerobes and Anaerobes. The dry swabs used were MicroRheologics Sterile Swab Applicators with Tips Flocked with Nylon fiber. The containers used for these dry swabs were the Empty Copan 1mL Universal Transport Medium tubes. The swabs used for viral analysis were the Copan 3mL Universal Transport Medium Kits for the Collection and Preservation of Virus, Chlamydia, Mycoplasma and Ureaplasma. The collected viral swabs are planned to undergo processing for the extraction of RNA and DNA at the BCCDC for a separate virome profiling analysis. 2.2 Laboratory Methods One of the two swabs collected in the dry swab containers remained stored as a backup sample for future analyses. The other swab was processed by MagMAX™ DNA extraction and then shipped to the Hill Laboratory located at the University of Saskatchewan, for gene sequencing processing. The Scientific Industries Digital Vortex Genie 2 was used for all vortexing purposes while the Eppendorf Centrifuge 5430 and the  32  MicroMax Thermo IEC Centrifuges were used for all centrifugation purposes. For the MagMAX™ DNA extraction, the Applied Biosystems MagMAX™ Express was utilized. DNA EXTRACTION2 For the MagMAX™ DNA extraction at the BCCDC, 300 uL of sterile 1X Phosphate Buffered Saline (PBS) Buffer (pH 7.4) was initially added to the vaginal swab in the dry swab container. After vortexing the swab for 30 seconds, 200 uL of the sample solution was removed from the swab container and placed into a 1.5 mL tube. At this point, the original swab and swab container were discarded. Separately, 235 uL of MagMAX Lysis/Binding Solution Concentrate was added to a prepared tube of zirconia beads in a guanidinium thiocyanate-based solution. 175uL of the sample solution was then transferred from the 1.5 mL tube and added to the prepared tube of zirconia beads. This tube was then vortexed for 15 minutes and then centrifuged for 3 minutes at 16,000 x g using the Eppendorf Centrifuge 5430. This procedure allowed the zirconia beads to mechanically disrupt the cells, releasing nucleic acid content. (Isola, Pardini et al.) In addition to cell lysis, the guanidinium thiocyanatebased solution inactivated nucleases through enzyme denaturation. (Isola, Pardini et al.) As the MagMAX™ DNA extraction utilizes a specific plate with 96 wells, arranged in a manner of 12 columns by 8 rows, 12 unique samples can be run on one plate at a time. The Applied Biosystems MagMAX™ Express utilized for these samples was able to run two plates at a time, allowing for a total of 24 unique samples to be processed simultaneously. In order to prepare a plate for MagMAX™ DNA extraction, 10 uL of Lysis Binding Enhancer was added to each well in row A of the plate. This was  2  DNA Extraction was performed by Vincent Montoya at the BCCDC. 33  followed by the addition of 10 uL of Nucleic Acid Binding Beads to each well in row A. Rows B and C of the plate had 150 uL of Wash Solution 1 Concentrate added to each well, while rows D and E of the plate had 150 uL of Wash Solution 2 Concentrate added to each well. Row F of the plate had 50 uL of Elution Buffer added to each well. Following the centrifugation of the sample solutions, the supernatant liquid from the individual centrifuged sample solutions was added to the appropriate wells in row A of the plate, followed by the addition of 65 uL of 100% isopropanol into each well in row A of the plate. This prepared plate was then inserted into the MagMAX™ Express instrument along with a new sterile plastic Tip Comb. The MagMAX™ Express instrument then ran for a period of approximately 20 minutes. This run allowed for the nucleic acid binding beads to bind nucleic acids and sequentially wash these nucleic acids. Wash Solution 1 Concentrate removed proteins and other contaminants, while Wash Solution 2 Concentrate removed residual binding solution. After run completion in the MagMAX™ Express instrument, the plate was taken out of the instrument and eluted sample yields were extracted from row F of the processed plate. These sample solution yields were added to 1.5 mL Eppendorf tubes. For each sample, 1/10 of the sample solution volume of 3M sodium acetate (pH 5.2) was added to each of the samples’ Eppendorf tubes while a volume equivalent to the sample solution volume of isopropanol was also added to each sample. The samples were then placed into a -80˚ C freezer for a period of 30 minutes. This process allowed the DNA to precipitate out from the sample solution. Each of the samples were then placed into a MicroMax Thermo IEC Centrifuge set at 20,000 x g and 4˚ C for a period of 30 minutes. After this round of centrifugation, any remaining supernatant inside the Eppendorf tubes  34  was removed through pipetting, leaving a pellet of the sample. These Eppendorf tubes containing the sample pellets were then placed into the Savant SpeedVac SC100 vacuum centrifuge with open lids in order to fully remove all supernatant liquid. This centrifugation process ran for approximately 1 hour, after which the lids on the samples’ Eppendorf tubes were closed. The processed samples were then stored at room temperature and shipped to the Hill Laboratory in Saskatoon, Saskatchewan on the day following this processing. SAMPLE PROCESSING3 At the Hill Laboratory in Saskatoon, the MagMAX™ purified samples were received and stored at room temperature. The VWR Analog Vortex Mixer was used for all vortexing purposes while the Thermo Scientific Sorvall Legend Micro 17 Centrifuge was used for all centrifugation purposes. Initially, 100 uL of a Pure TE Buffer was added to each of the sample pellets. This Pure TE Buffer (pH 8) consisted of 0.121 grams of tris(hydroxymethyl)aminomethane (Huntriss, Woodfine et al.) and 0.037 grams of ethylenediaminetetraacetic acid (EDTA) dissolved in approximately 100 mL of water. These samples with added Pure TE Buffer were vortexed for 3 seconds and stored in a 20˚ C freezer. TOTAL 16S rRNA qPCR 2 uL of each sample was extracted to run a quantitative real time polymerase chain reaction (qPCR) for the V3 region of the 16S rRNA gene. The specific 16S rRNA assay utilized for this run was the SRV3-1/3 primer set. (Lee, Zo et al. 1996)  3  Sample processing and PCR amplification was primarily conducted by Dr. Bonnie Chaban at the University of Saskatchewan. 35  CPN60 PCR AMPLIFICATION / CONCENTRATION In preparation for polymerase chain reaction (PCR) amplification of the universal cpn60 gene target, a unique master mix solution was created for each individual sample. This master mix consisted of: 477.4 uL of Ultrapure Water, 70 uL of 10 x PCR Buffer (Invitrogen), 35 uL of 50mM MgCl2 (Invitrogen), 14 uL of 10 mM dNTP, and 5.6 uL of Platinum Taq (Invitrogen). A unique multiplex identification (MID) tagged primer working stock was also prepared for each sample. The MID tag was a unique 10 base pair sequence that was added to each of the samples’ primer sets, allowing for the differentiation of unique samples in future processing steps. The 2 sets of forward and reverse primers used for the primer working stock were: H279 (5’ – GAIIIIGCIGGIGAYGGIACIACIAC – 3’) H280 (5' – YKIYKITCICCRAAICCIGGIGCYTT – 3') & H1612 (5' – GAIIIIGCIGGYGACGGYACSACSAC – 3') H1613 (5' – CGRCGRTCRCCGAAGCCSGGIGCCTT – 3') The primer working stock consisted of 3 µL of 100 mM primer H279, 3 µL of 100 mM primer H280, 9 µL of 100 mM primer H1612, 9 µL of 100 mM primer H1613, and 276 µL of Ultrapure water. The reason for the inclusion of the primer set of H1612 and H1613 was that this primer set has proven to have a higher success rate for the binding of guanine and cytosine rich sequences. (Isola, Pardini et al.) For each sample, 70 µL of primer working stock solution was utilized. 13 sterile PCR tubes were then labelled with the numbers 1 through 12 with the remaining tube receiving the label “NTC,” or No Template Control. The reason for this  36  tube was to test for any potential contamination through progression of study procedures as well as to ensure that reagents were free of contamination. The master mix solution, the primer working stock solution and the 13 sterile PCR tubes were then placed under an ultraviolet (UV) light in a “Cleanspot” UV cabinet for 10 minutes to allow for the inactivation of any DNA products through the formation of thymine dimers. (Schreier, Schrader et al. 2007) 70 µL of the MID-primer mix was then added to the tube of master mix solution. 48 µL of this complete master mix solution was then added into the NTC tube. 24 µL of the vortexed sample solution was then added to this complete master mix solution. 50 µL of this mixed solution was aliquoted into each of the 12 PCR tubes. These 12 PCR tubes were then added to the top row of the Eppendorf Mastercycler EP Gradient Thermal Cycler over an ascending temperature gradient arranged in the following manner: Column Temperature (˚C)  1 41.9  2 42.3  3 43.4  4 45.1  5 47.2  6 49.6  7 52.0  8 54.4  9 56.5  10 58.3  11 59.5  12 60.1  The NTC tube was added to column 12 of the second row of the thermal cycler. The thermal cycler was then run on the following PCR program: 95˚C – 5 minutes, 95˚C – 30 seconds, 42-60˚C Gradient – 30 seconds x 40 cycles, 72˚C – 30 seconds, 72˚C – 2 minutes, 10˚C – hold. After the PCR program was complete, the amplified cpn60 target samples in all 12 of the PCR tubes were pooled together into a single microfuge tube. In order to ensure there was no contamination during the study procedures, an agarose gel was run for the pooled PCR sample and for the NTC sample. 1uL of ethidium bromide was added to the agarose gel mixture in order to allow for DNA visualization under UV light. This 1% agarose gel mixture was then poured into a gel tray that was securely placed in gel tray holder. A small plastic comb was then inserted 37  into this gel mixture in order to create wells. Following the setting of the gel, the plastic comb was removed from the gel and the gel tray was placed into a running tank filled with 0.5 x TBE buffer. The PCR sample and the NTC sample were added to gel wells by mixing 2 uL of DNA electrophoresis sample buffer (DNA-ESB) with 5 uL of each of the samples. A DNA ladder was also added to a gel well in order to allow for sample size determination. The gel was run in the running tank at 100 volts for 35 minutes. The gel was then placed into the Alpha Innotech AlphaImager instrument. After adjustment of the AlphaImager instrument camera’s exposure time and focus, an image of the exposed gel was captured, printed and analyzed. If the NTC lanes for the gel image were blank with no visible bands, the sample was considered to be uncontaminated and ready for concentration and purification. Furthermore, depending on the concentration of the sample’s bacterial content, the presence of bands at approximately the 550 base pair region of the sample lanes would be indicative of amplified cpn60 gene product. (Figure 1)  Figure 1 Exposed gel bands visualized by Alpha Innotech AlphaImager Instrument  38  The PCR amplified sample was then concentrated using Amicon Ultra-0.5 Centrifugal Filters Units with Ultracel-30 membranes. The concentration protocol, contained in the “Pyrosequencing Preparation Protocol, Hill Lab, BC Last Edit 2012-0113” was followed. This concentrated PCR product was then purified through the process of gel purification. For this process, a rainbow tracking dye was utilized. This tracking dye consisted of: 0.5 mL of 0.5 M EDTA (pH 8.0), 12 g Sucrose, 0.06 g Bromophenol Blue, 0.07 g Xylene Cyanol FF, 0.06 g Cresol Red, 0.11 g Orange G, and Ultrapure Water to a total volume of 25 mL. After preparing a 1% agarose gel without any added ethidium bromide stain, 5 uL of the rainbow tracking dye was directly added to the ~30 uL of concentrated PCR sample. After placing the prepared agarose gel into a running tank, the sample product was added across two wells of the agarose gel with about 18 uL of sample added into each well. The reason for distributing the sample product across two wells was due to the limited space offered by each well. The gel was run at 100 V for 35 minutes. The gel was taken out of the running tank and a clean scalpel blade was used to cut out the portions of the gel that contained the relevant cpn60 target sequences. In order to ensure the entire relevant sample was contained in the cut out portions, the entire red-coloured band and the top half of the purple coloured band of the gel was cut out. While the red coloured band corresponded with the 600 – 900 base pair region, the purple coloured band corresponded with the 300 – 600 base pair region. For purification purposes, each excised gel fragment was then inserted into a microfuge tube and gel purified with the use of Qiagen’s QIAEX II Gel Extraction Kit (catalogue number 20021). The protocol listed on pages 12 – 14 of the “QIAEX II  39  Handbook 10/2008” for this gel purification step was followed, with the exception of minor modifications listed in the appendix (page 130). This resulted in 2 microfuge tubes each containing ~20 uL of concentrated and purified MID-tagged cpn60 PCR product. In order to quantify the amount of DNA present in each tube, an Invitrogen Qubit Fluorometer was used. A working solution was created with the use of 1393 uL of Invitrogen Quanti-iT Buffer and 7 uL of Invitrogen Quanti-iT Reagent. The standard solutions consisted of 190 uL of this working solution in addition to 10 uL of standard solution. One standard solution consisted of 100 ng/uL dsDNA while the other standard solution consisted of 0 ng/uL dsDNA. This resulted in each standard solution mixture containing 200 uL of solution. The sample solutions consisted of 198 uL of working solution in addition to 2 uL of the purified sample. The purified sample DNA was quantified through inserting the samples into the Fluorometer and obtaining a reading, while the standard solutions were used to calibrate the Fluorometer. Following DNA quantification of all 54 HIV-positive samples, portions of each sample were pooled into 3 distinct libraries. Library 1 was ~45 uL with a DNA concentration of 76.7 ng/uL, Library 2 was ~50 uL with a DNA concentration of 76.1 ng/uL, and Library 3 was ~30 uL with a DNA concentration of 84.6 ng/uL. Library 1 contained 21 samples, Library 2 contained 18 samples, and Library 3 contained 18 samples. Samples could be pooled for each of the libraries due to the incorporation of MID tags at the PCR amplification stage. This library pooling was followed by the shipment of these samples to the Harrigan Lab located at the BC Centre for Excellence in HIV/AIDS in Vancouver, BC.  40  GENE SEQUENCING PROCESSING4 At the Harrigan Lab, the pooled sample libraries were processed in preparation to be run on the 454 Roche FLX+ Gene Sequencer. The 3 major stages involved in the preparation of these samples included: fragment end repair and adaptor ligation, emulsion PCR and bead enrichment, and PicoTiterPlate preparation. The 3 manuals followed for this processing were Roche methodology manuals. For the fragment end repair and adaptor ligation stage, the Roche Rapid Library Preparation Method Manual (GS FLX Titanium Series – October 2009 Rev. Jan 2010) was followed beginning at step #13G of Section 3.1 on page 3 and ending at step #15 of Section 3.5 on page 4. The thermocycler used for these steps was the Applied Biosystems GeneAmp PCR System 9700. Following ligation, sample library DNA was quantified with the use of the Beckman Coulter DTX 880 Multimode Detector. For the emulsion PCR and bead enrichment stage, the Roche emPCR Method Manual – Lib-L SV (GS FLX Titanium Series – October 2009 Rev. Jan 2010) was followed for the 8 X SVE option beginning at Section 3 on page 5. The 8 X SVE option refers to the division of each sample library into 8 small volume emulsions. The TissueLyser used for this stage was the Qiagen TissueLyser while the thermocycler used was the Biosystems GeneAmp PCR System 9700. Minor modifications were made to the emPCR Method Manual. Steps 4 – 6 of Section 3.5.1 on page 7 were repeated for a total of 2 isopropanol washes. Steps 15 – 21 of Section 3.5.2 on page 8 were repeated for a total of 4 filter washes.  4  Gene sequencing processing was primarily conducted by Winnie Dong at the BC Centre for Excellence in HIV/AIDS. 41  For the PicoTiterPlate preparation stage, the Roche Sequencing Method Manual (GS FLX Titanium Series – October 2009 Rev. November 2010) was thoroughly followed. This Method Manual was also followed for the loading of the gene sequencing instrument with the prepared PicoTiterPlate and for the initiation of a gene sequencing run. Generally, the first stage of this gene sequencing processing allowed for the ligation of adaptor sequences onto each of the amplified cpn60 products. These common adaptors then allowed for the binding of cpn60 product sequences to DNA capture beads in the second stage of processing. Each of the individual cpn60 sequences that were attached to DNA capture beads were then amplified through emulsion PCR with the use of Emulsion Oil. Following amplification, the emulsions were broken and Melt Solution was utilized in order to make the DNA single stranded. During the bead enrichment step, Enrichment Primers were added to the single-stranded DNA products. This was followed by the addition of Enrichment Beads that only bound to the DNA products that had attached Enrichment Primers. This step filtered out any DNA Capture beads that did not have DNA attached. The purified beads with attached DNA were then added to the PicoTiterPlate during the third stage of processing along with Enzyme Beads, Packing Beads and PPiase Beads. The prepared PicoTiterPlate was then inserted into the 454 Roche FLX+ Gene Sequencer. PYROSEQUENCING TECHNOLOGY As each well of the PicoTiterPlate is able to contain a single purified bead with attached single-stranded DNA, the 454 Roche FLX+ Gene Sequencer is able to produce a single sequence read for each well of the PicoTiterPlate. This allows the 454 Roche  42  FLX+ Gene Sequencer to produce approximately 700,000 amplicon reads per sequencing run. The method that is utilized by this sequencer to generate sequence reads is termed sequencing-by-synthesis, or more specifically massively parallel pyrosequencing. (Ronaghi, Uhlen et al. 1998) This is a process in which nucleotides are sequentially added complementary to the single-stranded DNA strand with DNA polymerase and a sequencing primer. (Ronaghi, Uhlen et al. 1998) With the addition of each nucleotide to the DNA strand, inorganic pyrophosphate is released. (Ronaghi, Uhlen et al. 1998) Each of the nucleotide additions to the DNA strand is detected by the gene sequencer because each released inorganic pyrophosphate becomes converted to ATP with sulfurylase, further causing the production of visible light with luciferase. (Ronaghi, Uhlen et al. 1998) This process is able to take place because the Enzyme Beads and Packing Beads added to the PicoTiterPlate wells include polymerase, luciferase and ATP sulfurylase. (Huse, Huber et al. 2007) As the gene sequencer is equipped with a charge-coupled device (CCD) camera, the different intensities of light generated by the addition of distinct nucleotides is recorded by this camera into a flowgram that displays the sequence of nucleotides generated for each DNA strand. (Huse, Huber et al. 2007) 2.3 Bioinformatics After gene sequencing, a set of files called standard flowgram format (sff) files was generated for each region of the PicoTiterPlate. As the 54 Vogue-1B samples utilized in this project were divided into libraries of 18 samples, 3 different sets of sff files were generated. These sff files were differentiated based on the unique MID tags that were added to each sample’s primer sets during initial stages of sample processing. The program utilized for this organization of sequence reads was SFF Tools. These  43  organized sequence reads were converted into a text-based file format called FASTA. In order to generate assembled reads from these raw FASTA files of gene sequences, “Section 2 – Assembled Reads” of the “Analyzing Pyrosequencing Data (for Dummies)” manual, last modified March 20, 2012, was followed. With permission from the author, this manual is attached in the appendix (page 131). Section 2 of the manual describes the use of the bioinformatics pipeline called Microbial Profiling Using Metagenomic Assembly (mPUMA), created by the University of Saskatchewan bioinformatician Matthew Links. This pipeline of programs basically assembles sequence reads into groups of identical isotigs, removes cpn60 primers from each sequence read with the program Seqclean, further combines identical isotigs after primer removal with the program CD-HIT, and then creates input files for different data analysis programs. The data analysis programs that input files are created for include: GeneSpring, MEGAN, Unifrac, and Mothur. Following mPUMA assembly5, “Section 3 – Frequency Tables” of the “Analyzing Pyrosequencing Data (for Dummies)” manual was followed. This section provided instructions on how to determine the best bacterial matches available for sequence reads with the use of the online Chaperonin Database located at www.cpndb.ca. These matches were determined through a Basic Local Alignment Search Tool (BLAST) match followed by a full sequence alignment using the Smith-Waterman algorithm. This section also described how to generate frequency tables in order to understand the proportions of the different bacterial taxa that were present in individual vaginal swab  5  Assembly of the raw gene sequence files using mPUMA was conducted by Matthew Links and Dr. Janet Hill at the University of Saskatchewan 44  samples. These proportions were determined through the generation of the number of specific DNA reads of each isotig sequence for each individual sample. Following the creation of a consolidated frequency table for the 54 Vogue-1B samples, “Section 4 – Chimera Checking” of the manual was followed. This section outlines how to identify potential chimeras from the dataset and how to remove these chimeras from the frequency table. As chimeras are unwanted DNA sequences that are formed by the combination of 2 different DNA sequences during PCR amplification, this step serves as a quality control filter prior to data analysis. The following modifications were made to the steps outlined on pages 17 – 23 of “Section 4 – Chimera Checking:” 1.  Following the removal of chimeras identified by both Bellerophon and the manual method outlined in Section 4, the online program Emboss Needle was utilized for the review of any remaining chimera suspects. This is a program that utilizes the Needleman-Wunsch global alignment algorithm to align 2 sequences together.  2.  The full sequences of the suspected chimeras were individually aligned with the full sequences of each of the bacterial species that most closely matched with the suspect chimera sequences. After utilizing the program Needle, sequence alignments were assessed for the distribution of identical and non-identical nucleotides throughout the alignment of the suspect chimera and bacterial species sequences. If the identical and non-identical nucleotides for the alignments of the suspect chimera sequence and each bacterial species sequence appeared well distributed, then the suspect chimera was kept in the dataset. If the nucleotides were clustered towards  45  either end of the sequence alignments, then the suspect chimera was confirmed to be a chimera and removed from the dataset. Following these chimera checking steps, another filtering step was applied in which any isotig sequence with less than a 55% nucleotide identity match to the chaperonin database bacterial gene sequences was excluded from the consolidated frequency table. This 55% identity threshold is set due to the fact that cpn60 is a proteincoding gene, and therefore requires a minimal percentage of gene homology with established cpn60 genes in order to allow for proper functional protein folding. (Gupta 1995) These matched and filtered sequences were then ready for the data analysis phase. 2.4 Data Analysis Prior to data analysis, data assembly involved the generation of assembled sequence reads at a specific sampling depth. Since the number of sequence reads generated for each vaginal swab sample ranged from 1,081 reads up to 97,381 reads, there were two major options for the comparison of bacterial species between vaginal swabs. One option was that sequence reads could be assembled at maximal sampling depth for each sample, followed by the scaling of raw sequence reads to a common appropriate sampling depth for all samples. The other option was to choose an appropriate subsampling depth and assemble sequences in accordance with the chosen sampling depth. For this analysis, both options were pursued and data analysis was completed for the second option of choosing an appropriate sampling depth and assembling sequences according to this sampling depth. For this data analysis, a sampling depth of 1081 randomly selected reads was chosen as the number of reads for data assembly for all 54 vaginal swab samples. This allowed for a direct comparison of  46  raw sequence reads between all samples without any need for scaling. The reason that a sampling depth of 1081 reads was chosen was due to the fact that the sample with the lowest number of reads at a maximum read sampling depth generated 1081 reads. For this reason, all 54 samples could be included for an even comparison without any need for scaling up or down after subsampling. Furthermore, upon comparison of bacterial profiles generated at a maximal sampling depth with profiles generated at a subsampling depth of 1081 reads, the composition of major bacterial taxa was highly similar. When evaluating the number of unique bacterial taxa generated at the 55% minimum nucleotide identity match for the online Chaperonin Database, it was determined that 74 unique bacterial taxa were generated at a maximal sampling depth while 55 unique bacterial taxa were generated at a sampling depth of 1081 reads. Although this may suggest that the higher sampling depths aid in the identification of rarer bacterial taxa, all of the additional bacterial taxa identified through a maximal sampling depth contributed to less than 1% of each sample’s bacterial profile. Another factor that needed to be assessed prior to data analysis was the bacterial identity match percentage cut off for the online Chaperonin Database. A 55% minimum nucleotide identity match to the Chaperonin Database bacterial gene sequences was utilized as part of the bioinformatics filtering steps. This cut off was chosen in order to exclude any gene sequences that could not code for functioning cpn60 proteins. This is a percentage cut off that has been determined by the Hill Laboratory through examination and comparison of gene sequences with both the Chaperonin Database and Genbank. In terms of assigning a cut off for the accurate classification of distinct bacterial taxa, a number of different cut offs were attempted. The heat map clustering results for these  47  distinct taxa at the 55% and 78% cut off levels are included in the appendix (page 191). As the percent identification cut off for the differentiation of a single bacterial genus or species is very unique and specific for each individual bacterial taxa, assigning a single match percentage cut off for all bacterial taxa could potentially lead to the misidentification of certain taxa while excluding other taxa that could be accurately identified. For example, when looking at inter-specific percent match identifications for different bacterial genera, an accurate match for the genus Campylobacter can occur in the 70% range while an accurate match for the genus Megasphaera occurs in the 90% range. A thorough assessment examining the inter-specific and intra-specific sequence distance ranges for cpn60 gene sequences demonstrates that the range of bacterial taxa sequence distances is very high. (Links, Dumonceaux et al. 2012) This supports the ability of the cpn60 universal gene target to provide more discriminatory power than other universal gene targets. (Links, Dumonceaux et al. 2012) For the purpose of the current analysis, all bacterial taxa with a bacterial match percentage above 55% were included, with bacterial classifications assigned to bacterial taxa only for a match equal to or greater than 90% nucleotide identity. While bacterial taxa between the 55 – 90% nucleotide identity range were included in the analysis, they were classified with their original sequence isotig number followed by the top bacterial taxa hit recovered from the cpn60 database. The 90% cut off for bacterial classification was chosen to ensure the confident labelling of all bacterial taxa at the genus level and also for most taxa at the species and subspecies levels. Although this 90% cut off resulted in a decreased number of bacterial taxa being grouped together under common bacterial labels, this cut off  48  ensured a greater confidence in the assignment of bacterial labels at the genus and species levels. The first stage of data analysis involved the use of the open-source diversity statistics generating software called mothur. This is expandable software initially developed by Dr. Patrick Schloss from the Department of Microbiology & Immunology at The University of Michigan. (Schloss, Westcott et al. 2009) In order to create an input file for this program and generate diversity statistics, “Section 7 – mothur Diversity Statistics and Rarefaction Curves” was followed in the “Analyzing Pyrosequencing Data (for Dummies)” manual. The set of instructions outlined for “Running mother from haruspex” on page 31 of the manual was followed, with the modification of utilizing the University of Saskatchewan server entitled “eightball.” The initial set of data generated with the use of mothur was the Good’s coverage and the Shannon Diversity Index for each sample. These calculations provided information on the quality of vaginal sampling along with ecological diversity. (Good 1953) For data analysis, the primary goal was to determine the overall diversity and makeup of the vaginal microbiome for each of the study subjects. This involved the comparison of raw sequence reads for each bacterial taxa that was identified for each individual subject sample. In order to initially compare the bacterial profiles obtained for each subject sample, the statistical software termed R Version 2.15.1 was utilized. The initial method used for comparing bacterial profiles involved the generation of a heat map in which a colouring scheme was used to differentiate between the least to most abundant bacterial species. Furthermore, the specific type of heat map that was generated involved clustering vaginal samples together into groups based on the occurrence of common  49  bacterial species. The specific packages that needed to be installed into R for creating this heat map included: Lattice Version 0.20-10, Vegan Version 2.0-4, BiodiversityR Version 1.6, RColorBrewer Version 1.0-5, and Heatplus Version 2.2.0. This heat map allowed for the detection of physically visible trends in bacterial diversity across the whole cohort of study subjects. Following the generation of this visual display of bacterial profiles, an in-depth comparison of unique bacterial species within the most prevalent bacterial genera was made. This comparison was made in order to determine which specific bacterial species were the most dominant among this cohort. Following this assessment, different demographic and clinical variables were compared to the distribution of bacterial genera and species among the individual subject samples. Relationships of bacterial diversity were explored with variables including: ethnicity, BMI, history of genital tract infections, Nugent’s scores, vaginal discharge findings, feminine hygiene product use, contraception use, sexual activity, substance use history, length of HIV infection, CD4 counts at visit, HIV viral loads at visit, and use of antiretrovirals or other antimicrobials. Furthermore, the different clusters of study subjects with common bacterial distributions were also compared with demographic and clinical variables. 2.5 Statistics For the initial stage of data analysis during which diversity statistics were generated with the use of mothur, the Good’s coverage (Good 1953) was calculated by: C = (1 – (n1/N)) Where, n1 = the number of operational taxonomic units or isotigs represented by one sequence  50  N = the total number of individuals in the sample The Good’ coverage equation provides an estimate of the percentage of total species in an environment that are represented in the collected sample. (Good 1953; Bik, Eckburg et al. 2006; Ling, Kong et al. 2010) Furthermore, rarefaction curve values were also generated with mothur in order to better understand the environmental coverage obtained through study sampling. When creating the clustered heat map for bacterial profiling, the first factor that needed to be determined was the ecological distance between the bacterial species of different subject samples. This was calculated through the use of the Bray-Curtis distance formula (Kindt and Coe 2005). The formula for this distance calculation is: Distance = 1 – 2 (ΣSi=1 min (ai, ci)) / (Σsi=1 (ai + ci)) Where, ai = abundance of species at site a ci = abundance of species at site c Following this distance calculation, a hierarchical clustering analysis was performed with the use of the vegan package in R. (Oksanen 2012) This was followed by the generation of a dendogram based on bacterial species frequencies and common subject groupings. After a visual examination of this dendogram, a value or common height for cutting the dendogram into distinct clusters was chosen. This cut-off height was based on the separation of common bacterial groups through the hierarchical clustering analysis. These groupings of clustered samples were then displayed as a heatmap with the use of R.  51  For the first analysis that took demographic data into account, dominant species of the Lactobacillus genus were compared among different ethnic groups. In order to test for significant differences between ethnic groups, contingency tables were created and Fisher’s exact test was utilized. The reason that Fisher’s exact test was used for this set of data was due to the small samples sizes for each ethnic group along with the analysis of categorical data. In order to compare the diagnosis of abnormal vaginal discharge or a patient’s self-report of abnormal vaginal discharge with bacterial species presence, the Wilcoxon signed-rank test was utilized. This test was utilized due to the fact that the bacterial species population of the vaginal microbiome cannot be assumed to be normally distributed. The Wilcoxon signed-rank test was also utilized when comparing bacterial species distributions with HIV-specific clinical parameters. This statistical test was used to compare the presence of bacterial species among categories of CD4 counts and among groupings of viral loads. In order to further analyze differences in bacterial species relative abundance based on CD4 count and viral load groupings, the Linear Discriminant Analysis Effect Size (LEfSe) algorithm was utilized. (Segata, Izard et al. 2011) The web-based application entitled Galaxy, developed by the Huttenhower Lab at the Department of Biostatistics, Harvard School of Public Health, was used for the computation of these comparisons. (Giardine, Riemer et al. 2005; Blankenberg, Von Kuster et al. 2010; Goecks, Nekrutenko et al. 2010) This application utilized the Kruskall-Wallis test and the pairwise Wilcoxon test in order to generate a Linear Discriminant Analysis model. (Segata, Izard et al. 2011) This model calculated a Linear Discriminant Analysis value  52  for each significantly different bacterial rank and also visually displayed significantly different bacterial ranks through a cladogram. In order to understand the correlation between Nugent’s scores and total 16S rRNA quantity, the Spearman’s rank correlation coefficient was calculated. This is a non-parametric measurement of the correlation between two variables. (Zar 2005) Significance for the above statistical analyses was determined when any p-value was less than 0.05 (p<0.05).  53  3 RESULTS 3.1 Vogue 1B Data The study population was a cohort of women attending an HIV clinic in Vancouver (the Oak Tree Clinic) who were enrolled between April 2011 and November 2011. The 54 HIV positive women that provided study samples had a mean age of 36.6 years with a range of 22.3 – 48.8 years. Their ethnicity by self-report was: 18 Caucasian, 9 Black, 13 Aboriginal, 4 Asian, 4 South Asian, and 6 mixed ethnicities (Figure 2). The distribution of the mixed ethnicities group included 3 women of Aboriginal/Caucasian descent, 2 women of Black/Caucasian descent, and 1 woman of Aboriginal/Hispanic descent.  11.1%	
   Caucasian	
   33.3%	
    Black	
   Asian	
    24.1%	
    South	
  Asian	
   Aboriginal	
   7.4%	
   7.4%	
    16.7%	
    Mixed	
  Ethnicities	
    Figure 2 Ethnic distribution of study subjects In terms of HIV-specific characteristics, the mean duration of HIV infection was 11.7 years with a range from 1.8 to 24.3 years. The mean CD4 count for these women  54  was 484 cells/mm3, while the mean viral load was 13,144 copies/mL. While 46 women were on combination antiretroviral therapy at the time of their clinical visit, 34 women had a suppressed viral load. Furthermore while 39 women had a CD4 Count >350 cells/mm3, 14 women had a CD4 Count <350 cells/mm3 (Table 4). Table 3 Demographic and clinical characteristics Categorical variables are reported as either n or % (n) and continuous variables are reported as mean ±	
 95% CI (range). N = 54 Mean BMI (kg/m2) Country of Birth Angola Burma Burundi Cameroon Canada England Ethiopia Hong Kong India Philippines Sierra Leone Swaziland Thailand Nugent Score Normal (0-3) Intermediate (4-6) Bacterial Vaginosis (7-10) Abnormal Vaginal Discharge Symptomatic Clinical Report Positive for Chlamydia trachomatis (n=50) Positive for Neisseria gonorrhoeae (n=50) Douche Product Usage History Douche Product Usage within 48 Hours of Visit Mean Number of Pregnancies in Lifetime Antimicrobial Use Mean Number of Sexual Partners in Past Year (Data Available for n=52)  27.2 ± 1.7 (17.8 – 44.2) 1 1 2 1 36 1 3 1 4 1 1 1 1 59% (39) 11% (6) 30% (16) 12 7 8 1 0 7 0 3.3 ± 0.6 (0 – 9) 5 1.1 ± 0.2 (0-4)  55  Table 4 Clinical HIV-specific characteristics. Categorical variables are reported as % (n) and continuous variables are reported as mean ±	
 95% CI (range). Mean Duration of HIV Infection (Years) Mean CD4 Count (cells/mm3) CD4 Count >350 cells/mm3 (n=53) CD4 Count <350 cells/mm3 (n=53) Mean Viral Load (copies/mL) Viral Load (<40 copies/mL) cART  11.7 ± 1.4 (1.8 – 24.3) 484 ± 57 (90 – 930) 74% (39) 26% (14) 13,144 ± 13,650 (<40 – 355,245) 63% (34) 85% (46)  In terms of antimicrobial usage excluding antiretroviral medications during the study visits, a total of 5 women were on antibiotic, antifungal or antiviral therapy. While 3 of these women were taking Septra (trimethoprim and sulfamethoxazole) for prophylactic purposes, 1 woman was taking famciclovir for recurrent genital herpes, and 1 woman was taking fluconazole for a skin rash. Nugent’s scored vaginal Gram stains indicated 41% (22/54) of the study subjects had abnormal results, of which 30% (16/54) had bacterial vaginosis and 11% (6/54) had intermediate scores. In terms of abnormal vaginal discharge, 2 distinct categories were utilized. Women who self-reported symptoms of abnormal vaginal discharge within 48 hours of the clinical visit were classified as “Symptomatic,” while women who were assessed by a Nurse Practitioner to have abnormal vaginal discharge were classified as “Abnormal Vaginal Discharge” by “Clinical Report.” Although there was overlap among women in both of these classifications, the total number of unique women in both of these groupings was classified under “Abnormal Vaginal Discharge” (Table 3). After performing a gene sequencing assembly at a sampling depth of 1,081 reads, 157 unique isotigs were generated. After filtering out and removing any isotigs with a cpn60 ID match of less than 55% nucleotide identity, a total of 129 isotigs remained. After an additional step of chimera checking and filtering, 3 isotigs were removed and a  56  total of 126 isotigs were used for data analysis. These 126 isotigs were consolidated into groupings of identical bacterial species based on a minimum 90% cpn60 nucleotide identity match with remaining isotigs retaining their original isotig ID label followed by a top hit cpn60 ID match label. This resulted in a total of 64 unique bacterial taxa, of which there were 40 consolidated bacterial groupings and 24 individual isotigs (Table 5). For this data analysis, Gardnerella vaginalis was separated into the following 4 subtypes A, B, C, and D, respectively corresponding with the strains Gardnerella vaginalis 409-5, Actinobacteria sp. N153, Gardnerella vaginalis ATCC 14018, and Gardnerella vaginalis 101. (Paramel Jayaprakash, Schellenberg et al. 2012) In order to obtain an understanding of the percentage of bacterial species in the natural vaginal environment that were represented by the collected vaginal samples, Good’s Coverage estimates were calculated for each sample (Table 6). Furthermore, rarefaction curves were generated in order to determine if the sampling of each vaginal microbiome was thorough and well represented (Figure 3). As all 54 curves were approaching a plateau when reaching a higher number of sequence reads, this demonstrated that the sampling of the vaginal environment was thorough. This was due to the fact that the observed number of unique isotigs in an environment increase with an increasing number of sequence reads, up to a maximum number of sequence reads. After reaching this maximum number of sequence reads, an increased number of sequence reads will not yield any additional unique isotigs. After performing a hierarchical clustering analysis for the 64 unique bacterial taxa detected amongst this cohort of 54 HIV-positive women, 10 distinct clusters of bacterial taxa were determined (Figure 4). Cluster 1 (n=14, Red) was dominated by Gardnerella  57  vaginalis Group A and Gardnerella vaginalis Group B, Cluster 2 (n=9, Light Blue) contained a mixture of isotig 00018 (Megasphaera sp. UPII), Prevotella timonsensis, isotig 00007 (Clostridium genomosp.), Gardnerella vaginalis Group A, and Gardnerella vaginalis Group B, Cluster 3 (n=3, Pink) was dominated by Gardnerella vaginalis Group C (Group C), Cluster 4 (n=1, Grey) was dominated by Dialister micraerophilus and Prevotella oris, Cluster 5 (n=1, Black) was dominated by Lactobacillus jensenii, Cluster 6 (n=6, Dark Green) was dominated by Lactobacillus iners, Cluster 7 (n=12, Yellow) was dominated by Lactobacillus crispatus, Cluster 8 (n=4, Orange) was dominated by Bifidobacterium breve, Cluster 9 (n=3, Light Green) was dominated by Lactobacillus gasseri, and Cluster 10 (n=1, Dark Blue) was dominated by isotig 00030 (Gardnerella vaginalis Group C). Median Nugent’s scores and median CD4 counts were determined for each of these clusters. Clusters 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 respectively corresponded with median Nugent’s scores of 4.5, 9, 7, 1, 6, 1, 0.5, 0, 1, and 4. Clusters 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 also respectively corresponded with median CD4 counts of 540, 210, 370, 620, 240, 580, 545, 540, 510, and 420. Table 5 Nearest neighbour bacterial species identified at sampling depth of 1081 sequence reads by cpnDB Nearest Neighbour Species  Number of Isotigs  Gardnerella	
  vaginalis	
  B	
   Atopobium	
  parvulum	
   Atopobium	
  vaginae	
   Bifidobacterium	
  breve	
   Bifidobacterium	
  dentium	
   Bifidobacterium	
  pullorum	
   Campylobacter	
  rectus	
   Dialister	
  micraerophilus	
   Eubacterium	
  dolichum	
   Gardnerella	
  vaginalis	
  D	
   Gardnerella	
  vaginalis	
  A	
   Gardnerella	
  vaginalis	
  C	
   isotig00007	
  Clostridium	
  genomosp.	
  BVAB3	
  UPII9-­‐5	
   isotig00017	
  Dialister	
  micraerophilus	
   isotig00018	
  Megasphaera	
  sp.	
  UPII	
   	
    8 1 2 2 1 1 1 1 1 6 14 10 1 1 1  cpnDB Reference ID Match (% nucleotide identity) 94.6 – 99 91.8 96.2 – 97.1 98.5 – 99.3 99.5 99.3 98.2 99.6 99.3 96.8 – 98.7 91.3 – 99.8 94.9 - 100 69.8 79.8 84.2  58  Nearest Neighbour Species	
   isotig00030	
  Gardnerella	
  vaginalis	
  C	
   isotig00064	
  Campylobacter	
  lari	
   isotig00072	
  Gardnerella	
  vaginalis	
  D	
   isotig00089	
  Prevotella	
  loescheii	
   isotig00093	
  Mobiluncus	
  curtsii	
   isotig00094	
  Prevotella	
  bergensis	
   isotig00096	
  Prevotella	
  sp.	
  oral	
   isotig00101	
  Nocardia	
  cyriacigeorgica	
   isotig00102	
  Chlorobium	
  phaeobacteroides	
   isotig00103	
  Lactobacillus	
  reuteri	
   isotig00105	
  Olsenella	
  uli	
   isotig00106	
  Corynebacterium	
  xerosis	
   isotig00110	
  Prevotella	
  buccalis	
   isotig00118	
  Atopobium	
  parvulum	
   isotig00119	
  Dethiobacter	
  alkaliphilus	
   isotig00121	
  Tepidanaerobacter	
  sp.	
  Re1	
   isotig00124	
  Prevotella	
  bergensis	
   isotig00128	
  Aerococcus	
  urinae	
   isotig00136	
  Gardnerella	
  vaginalis	
  D	
   isotig00140	
  Eubacterium	
  ventriosum	
   isotig00149	
  Atopobium	
  vaginae	
   Lactobacillus	
  crispatus	
   Lactobacillus	
  gasseri	
   Lactobacillus	
  iners	
   Lactobacillus	
  jensenii	
   Lactobacillus	
  johnsonii	
   Lactobacillus	
  ultunensis	
   Megasphaera	
  sp.	
  UPII	
   Mobiluncus	
  curtisii	
   Mobiluncus	
  mulieris	
   Pediococcus	
  dextrinicus	
   Peptoniphilus	
  harei	
   Porphyromonas	
  uenonis	
   Prevotella	
  amnii	
   Prevotella	
  bivia	
   Prevotella	
  buccae	
   Prevotella	
  buccalis	
   Prevotella	
  corporis	
   Prevotella	
  denticola	
   Prevotella	
  disiens	
   Prevotella	
  melaninogenica	
   Prevotella	
  oris	
   Prevotella	
  timonensis	
   Selenomonas	
  noxia	
   Shigella	
  dysenteriae	
   Staphylococcus	
  lugdunensis	
   Streptococcus	
  mitis	
   Streptococcus	
  parasanguinis	
   Streptococcus	
  salivarius	
   	
    Number of Isotigs 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 2 6 4 1 1 1 2 1 1 1 3 1 3 1 1 1 1 1 1 1 6 1 1 1 1 1 1  cpnDB Reference ID Match (% nucleotide identity) 78.7 69.1 88.7 87.8 89.5 81.2 79.2 74 62.5 88.2 83.4 81.1 87.8 82.9 69.1 69.5 82.3 80.7 89.5 81.6 83.2 99.3 – 100 97.5 - 100 97.5 - 100 95.1 - 100 98.4 100 99.6 95.5 – 99.6 99.8 98 94.9 98.2 – 99.5 100 98.3 – 99.1 99.3 95.5 98.4 100 99.8 97.5 98.2 94.7 – 99.1 98.7 99.1 99.3 93.7 96.3 95.7  59  Table 6 Good’s coverage calculation and Shannon diversity index per sample Sample ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54  Good’s Coverage 0.996602 0.993583 0.997826 0.997628 1 1 1 0.997683 0.995146 1 0.997893 0.998895 0.995957 0.996445 0.994518 0.995638 0.99777 0.998879 0.998758 0.998788 0.99547 0.995157 0.994213 0.998826 0.994076 0.998801 0.998689 0.998643 0.997392 0.995429 0.998798 0.995575 0.997564 0.998711 0.996462 0.99866 0.992038 0.998737 0.998634 1 0.998605 0.998645 1 0.997509 0.998775 0.998917 0.991443 0.997567 0.991667 0.996499 0.995152 0.998661 0.998826 0.997222  Shannon Diversity Index 1.761832 0.256213 0.382858 1.144529 0.284404 0.044721 0.241181 0.67601 2.45781 0.22396 1.745658 1.324783 0.858609 0.265968 1.922517 1.655812 1.531555 2.262679 1.119082 0.040376 2.146595 1.319396 0.712874 2.082174 0.093604 1.580051 0.036975 1.245876 1.761265 1.727819 0.619497 0.345059 0.612243 0.77751 0.78152 0.820674 1.908739 1.14601 0.059658 2.388532 0.210636 0.575643 1.573615 2.159456 0.046834 1.799685 1.250193 1.246441 0.766262 1.931227 2.161621 0.225151 0.654889 0.312889  60  Figure 3 Rarefaction curves of 54 HIV-positive vaginal samples Table 7 Significant bacterial species in groups of normal versus abnormal vaginal discharge by Wilcoxon Rank Sum Test Bacterial Species Dialister micraerophilus Gardnerella vaginalis B Gardnerella vaginalis D Isotig 00064 - Campylobacter lari Isotig 00007 - Clostridium genomosp. Isotig 00017 – Dialister micraerophilus Lactobacillus crispatus Porphyromonas uenonis Prevotella amni Prevotella buccalis Prevotella timonensis  p-value 0.0020 0.040 0.026 0.00029 0.00048 0.000013 0.00061 0.000018 0.000070 0.017 0.020  Significant Discharge Status Abnormal Discharge Abnormal Discharge Abnormal Discharge Abnormal Discharge Abnormal Discharge Abnormal Discharge Normal Discharge Abnormal Discharge Abnormal Discharge Abnormal Discharge Abnormal Discharge  61  9 10  Unique Bacterial Taxa  isotig00030.Gardnerella.vaginalis.ATCC.14018 Pediococcus.dextrinicus.ATCC.33087 Bifidobacterium.pullorum.JCM1214 Atopobium.parvulum.DSM.20469 Bifidobacterium.dentium.JCM1195 Prevotella.corporis.ATCC.33547 isotig00093.Mobiluncus.curtsii.subsp..curtsii.ATCC.35241 isotig00105.Olsenella.uli.DSM.7084 Bifidobacterium.breve.JCM1192 Lactobacillus.johnsonii.ATCC.33200 isotig00103.Lactobacillus.reuteri.JCM1112 Streptococcus.mitis.ATCC.49456 Shigella.dysenteriae.155.74 Streptococcus.parasanguinis.ATCC.15912 Streptococcus.salivarius.subsp..salivarius.ATCC.7073 Lactobacillus.ultunensis.DSM.16047 Staphylococcus.lugdunensis.ATCC.43809 isotig00119.Dethiobacter.alkaliphilus.AHT.1.DSM19026 isotig00089.Prevotella.loescheii.JCM.12249 Selenomonas.noxia.ATCC.43541 Lactobacillus.gasseri.ATCC.33323 Lactobacillus.crispatus.CECT4840 Lactobacillus.iners.DSM.13335 Lactobacillus.jensenii.ATCC.25258 isotig00128.Aerococcus.urinae.ATCC.51268 isotig00094.Prevotella.bergensis.JCM.13869 isotig00102.Chlorobium.phaeobacteroides.DSM.266 Prevotella.denticola.JCM.13449 Mobiluncus.curtisii.subsp..holmesii.ATCC.35242 isotig00118.Atopobium.parvulum.DSM.20469 isotig00101.Nocardia.cyriacigeorgica.GUH.2 isotig00121.Tepidanaerobacter.sp..Re1 isotig00106.Corynebacterium.xerosis.A6.70 Gardnerella.vaginalis.101.NULL Atopobium.vaginae.DSM.15829 Prevotella.timonensis.JCM.15640 isotig00007.Clostridium.genomosp..BVAB3.UPII9.5 isotig00018.Megasphaera.sp..UPII.135.E isotig00110.Prevotella.buccalis.ATCC.35310 isotig00124.Prevotella.bergensis.JCM.13869 isotig00064.Campylobacter.lari.NCTC11352..ATCC35221... Prevotella.amnii.JCM.14753 isotig00136.Gardnerella.vaginalis.101.NULL isotig00017.Dialister.micraerophilus.DSM.19965 Mobiluncus.mulieris.ATCC.35243 Gardnerella.vaginalis.ATCC.14018 Actinobacteria.sp..N153.N153_103.6 Gardnerella.vaginalis.409.05.NULL Prevotella.buccalis.ATCC.35310 Peptoniphilus.harei.ACS.146.V.Sch2b Prevotella.bivia.ATCC.29303 Campylobacter.rectus.ATCC.33238 Eubacterium.dolichum.ATCC.29143 isotig00096.Prevotella.sp..oral.taxon.302.str..F0323 Prevotella.buccae.ATCC.33574 Prevotella.disiens.ATCC.29426 Prevotella.melaninogenica.ATCC.25845D Porphyromonas.uenonis.JCM.13868 Dialister.micraerophilus.DSM.19965 Prevotella.oris.JCM.8540 isotig00140.Eubacterium.ventriosum.ATCC.27560 Megasphaera.sp..UPII.135.E isotig00072.Gardnerella.vaginalis.101.NULL isotig00149.Atopobium.vaginae.DSM.15829  Community Groups  2 4 6  1  2  3 4 5  6  7  8  01-026 01-001 01-033 01-031 01-036 01-017 01-028 01-022 01-037 01-030 01-019 01-012 01-047 01-035 01-044 01-011 01-009 01-040 01-018 01-021 01-029 01-024 01-050 01-043 01-003 01-015 01-051 01-046 01-045 01-025 01-054 01-013 01-023 01-008 01-010 01-005 01-052 01-039 01-027 01-006 01-020 01-014 01-041 01-042 01-034 01-048 01-053 01-032 01-038 01-016 01-007 01-002 01-049 01-004  Figure 4 Heat Map of 64 Bacterial Taxa Clustered into 10 Groups Based on Common Taxa Abundance Along the y-axis, Cluster 1 is red, Cluster 2 is light blue, Cluster 3 is pink, Cluster 4 is grey, Cluster 5 is black, Cluster 6 is dark green, Cluster 7 is yellow Cluster 8 is orange, Cluster 9 is light green, and Cluster 10 is dark blue. Species abundance is indicated by colour range from black (low abundance) to red (high abundance) in the central heat map grid.  62  Subject ID  0 8  Relative Intensity Scale  Table 8 Correlations of Nugent scores with vaginal microbial clusters Dominant Taxa Clusters  No. of Samples  1. Gardnerella vaginalis A / Gardnerella vaginalis B 2. Mixed Anaerobes 3. Gardnerella vaginalis C 4. Dialister micraerophilus / Prevotella oris 5. Lactobacillus jensenii 6. Lactobacillus iners 7. Lactobacillus crispatus 8. Bifidobacterium breve 9. Lactobacillus gasseri 10. Isotig 0030 (Gardnerella vaginalis C)  14 9 3 1 1 6 12 4 3 1  Nugent Score (No. of Samples) 0 - 3 4 - 6 7 - 10 6 3 5 0 1 8 0 0 3 1 0 0 0 1 0 6 0 0 12 0 0 4 0 0 3 0 0 0 1 0  On comparison of bacterial species abundance between groups of women with normal versus abnormal vaginal discharge, 11 species were found to be significantly different between these 2 groups (Table 7). While 10 of these species were found to be more greatly abundant in women with abnormal vaginal discharge, Lactobacillus crispatus was the only bacterial species found to be more abundant in women with normal vaginal discharge (Figure 5). The Shannon Diversity Indices of women with normal versus abnormal vaginal discharge were also compared with a type 3 two-tailed ttest (Table 6). It was determined that the Shannon Diversity Indices were higher for women with abnormal vaginal discharge with the application of the Bonferroni correction for multiple analyses (p<0.000008). In terms of HIV-specific characteristics, it was found that there were a total of 14 women with a CD4 count below 350 cells/mm3, and 39 women with a CD4 count above 350 cells/mm3. Upon comparison of these characteristics with abnormal vaginal discharge, 4 of the women with a CD4 count below 350 cells/mm3 had abnormal vaginal discharge while 10 of the women with a CD4 count below 350 cells/mm3 had normal  63  vaginal discharge. These findings were not suggestive of any significant trends (p>0.05). Also, it was found that there were a total of 33 women with a viral load below 40 copies/mL and a total of 20 women with a viral load above 40 copies/mL. Of the 12 women in this analysis categorized as having abnormal vaginal discharge, 7 of these women had a viral load above 40 copies/mL while 5 of these women had a viral load below 40 copies/mL. These findings were also not suggestive of any significant findings (p>0.05).  abnormal normal  200  150  100  si s  lis tel l vo Pre  Pre  vo  tel l  a.b  a.t im on  en  uc ca  nii .am lla  as .ue hy  rom on  .cr is us cill ba cto La  Pre vo te  nis no  tus pa  hil rop ae icr r.m iste  ial 01  tig .00 Iso  Iso  tig .00  00  7.D  7.C  64 00 g.0  Po rp  p. no um .ge  los trid i  mp y .Ca  rel l ne Iso ti  mo s  ter .la ri lob  a.v ag  ac  ina  ali Ga rd  ne  rel  la. va  gin  rop Ga rd  list er. mi cra e Dia  lis. D  s.B  us  0  us  50  hil  Bacterial Species Abundance (Number of Species)  250  Bacterial Species Figure 5 Bacterial species differences between women with normal and abnormal vaginal discharge by Wilcoxon Signed-Rank Test, as presented in Table 7 64  No significant findings were found when comparing ethnic groups with clinical, reproductive or HIV-specific characteristics. On comparison of specific bacterial species within ethnic groups, significant differences were found when looking into which species of the Lactobacillus genus were the most prevalent among different ethnic groups. It was found that Lactobacillus gasseri was the only species of Lactobacillus that was more prevalent in groups of women of either Asian or Mixed ethnic descent when compared to women of White, Black, Aboriginal, or South-Asian descent (Figure 6).  20  Lactobacillus gasseri  10 0  5  Number of Subjects  15  dominant not dominant  white  black  aboriginal  asian  south-asian  mixed  Ethnicity  Figure 6 Lactobacillus gasseri prevalence in HIV-positive women by ethnicity (Fisher Test, p=0.006). 65  In order to obtain an accurate understanding of bacterial quantity among the 54 vaginal swab samples, a 16S rRNA qPCR was run for all of the samples. It was found that a significant relationship existed between increasing Nugent scores and increasing bacterial quantity, as tested by Spearman’s rank correlation coefficient (Figure 7). Furthermore, it was also determined that Shannon Diversity Indices were higher for women with an abnormal Nugent score compared with women with a normal Nugent score, upon comparison with a type 3 two-tailed t-test (p<0.0000004). This finding remained true with the application of the Bonferroni correction for multiple analyses.  6 4 0  2  Nugent Score  8  10  Total Bacterial Quantity by Nugent Score  0  5  10  15  20  25  30  Log Bacterial Quantity  Figure 7 Relationship of increasing Nugent score with total 16S rRNA bacterial quantity in HIV-positive women (S=0.4629368, p=0.0004).  66  Bacterial species abundance was also correlated with HIV-specific characteristics through the use of the Wilcoxon signed-rank test and separately with The Linear Discriminant Analysis Effect Size (LEfSe) algorithm. For these specific analyses, a total of 53 women were included due to the lack of adequate CD4 count or viral load lab results for one subject. On comparison of bacterial species between a group of women with a CD4 count below 350 cells/mm3 (n=14) and a group of women with a CD4 count above 350 cells/mm3 (n=39), it was found that there were 2 bacterial species significantly more abundant in women with a CD4 count less than 350 cells/mm3 compared to women with a CD4 count greater than 350 cells/mm3 (Table 9). Although both of these bacterial species were more abundant in women with a CD4 count less than 350 cells/mm3, these 2 bacterial species branched from 2 distinct phyla (Figures 8, 9). On comparison of bacterial species between a group of women with a viral load below 40 copies/mL (n=33) and a group of women with a viral load above 40 copies/mL (n=20), it was found that there were 5 bacterial species significantly more abundant in women with a viral load greater than 40 copies/mL and 2 bacterial species significantly more abundant in women with a viral load less than 40 copies/mL (Table 10). Beyond significant differences at the bacterial species level, significant differences were also observed at the phylum, class, order, family, and genus levels (Figure 10, 11).  Table 9 Significant bacterial species in women with CD4 < 350 cells/mm3 versus CD4 > 350 cells/mm3 by Wilcoxon Signed-Rank Test. Bacterial Species Gardnerella vaginalis B Lactobacillus iners  p-value 0.044 0.047  Significant CD4 Count Category (cells/mm3) <350 <350  67  Figure 8 Linear Discriminant Analysis Values for Lactobacillus iners and Gardnerella vaginalis B in women with CD4 < 350 cells/mm3  Figure 9 Cladogram of bacterial species significantly more abundant in women with CD4<350 cells/mm3 versus women with CD4>350 cells/mm3 Yellow dots represent taxonomic ranks beginning with kingdom near the centre and species at the periphery of the cladogram. Red dots represent significant bacteria.  68  Table 10 Significant bacterial species in women with viral load < 40 copies/mL versus viral load > 40 copies/mL by Wilcoxon Signed-Rank Test. Bacterial Species Atopobium vaginae Gardnerella vaginalis D Isotig 00007 – Clostridium genomosp Isotig 00017 – Dialister micraerophilus Isotig 00018 – Megasphaera sp. UP II Isotig 00064 - Campylobacter lari Lactobacillus crispatus Lactobacillus gasseri Prevotella amnii  p-value 0.0049 0.023 0.0039 0.012 0.028 0.0037 0.011 0.0026 0.011  Significant Viral Load Category (copies/mL) >40 >40 >40 >40 >40 >40 <40 <40 >40  Figure 10 Linear Discriminant Analysis Values for significant bacteria by viral load groupings of greater than 40 copies/mL versus less than 40 copies/mL  69  Figure 11 Cladogram of bacterial species significantly more abundant in women with viral load>40 copiess/mL versus women with viral load<40 copies/mL. Yellow dots represent taxonomic ranks beginning with kingdom near the centre and species at the periphery of the cladogram. Red and green dots represent statistically significant taxonomic ranks.  70  4 DISCUSSION 4.1 Overall Findings In this cohort of reproductive-aged HIV-positive women, a total of 10 distinct vaginal community clusters were generated with no core bacterial species common to all women. While community clusters that correlated with low Nugent scores and relatively positive reproductive health were mostly dominated by different species of Lactobacillus or Bifidobacterium breve, community clusters with high Nugent scores and relatively poor reproductive health were mostly dominated by different strains of Gardnerella vaginalis or a mixture of anaerobic bacteria. A comparison of vaginal bacterial profiles between groups of women with normal and abnormal vaginal discharge provided a more insightful understanding of the specific bacterial species and strains consistent with positive and negative reproductive health outcomes. Of note, out of the four strains of Gardnerella vaginalis, subtypes B and D were the only strains found to be significantly related to abnormal vaginal discharge, while Lactobacillus crispatus was the only strain of Lactobacillus found to be significantly related to normal vaginal discharge. These findings provide support for the increased pathogenic and protective potential of these respective bacterial species. Bacterial species-specific relationships were also determined for CD4 count groupings and viral load groupings, indicative of potential immune status effects on vaginal bacterial profiles. Of note, Gardnerella vaginalis B was found to be significantly correlated with CD4 counts below 350 cells/mm3, while Gardnerella vaginalis D, Atopobium vaginae, and Prevotella amnii were found to be significantly correlated with viral loads above 40 copies/mL. Additionally, Lactobacillus crispatus and Lactobacillus gasseri were the only strains of Lactobacillus found to be significantly  71  correlated to viral loads below 40 copies/mL. In order to gain a more thorough understanding and outlook on this study’s overall findings, it is necessary to comprehend the details of these findings, and furthermore to examine these findings with a survey of previously established outcomes. 4.2 Demographics A total of 54 HIV-positive women between the ages of 22.3 and 48.8 years were enrolled into this study. The reason for strictly enrolling women who were of a reproductive age was due to this age range of women, at least in previously studied healthy women, having a consistent reproductive physiology and a structurally and compositionally stable vaginal microbiota. (Yamamoto, Zhou et al. 2009) Of note, the physiological composition of the female reproductive tract has been known to greatly differ prior to the onset of puberty as well as following menopause in comparison with adolescence and adulthood. (Hammerschlag, Alpert et al. 1978; Larsen, Goplerud et al. 1982) These profound differences can mostly be attributed to hormonal changes that lead to an increase in estrogen levels following menarche and a decrease in these levels following menopause. (Yamamoto, Zhou et al. 2009) As an increase in estrogen levels has been associated with an increase in vaginal glycogen levels, it is understood that these hormonal alterations may cause the vaginal environment to become more favourable for the growth of specific bacterial species. (Farage and Maibach 2006; Yamamoto, Zhou et al. 2009) For instance, lactobacilli are a genus of bacteria that are known to produce lactic acid through the metabolism of glycogen. (Boskey, Cone et al. 2001) For this reason, in women with consistent levels of estrogen production, it is common to see an abundance of lactobacilli in the vaginal environment. (Yamamoto,  72  Zhou et al. 2009) In addition to the understanding that the vaginal microbiota of reproductive-aged healthy women is relatively consistent and unique, the major reason for examining women of this age range was due to concerns surrounding reproductive health and negative reproductive health outcomes, features that are exclusively applicable to reproductive-aged women. 4.3 Bacterial Vaginosis Outcomes In terms of the rate of bacterial vaginosis prevalence as determined by Nugent’s scoring in this cohort of women, it was found that 41% of women had abnormal vaginal flora while 30% of women had bacterial vaginosis. These bacterial vaginosis prevalence rates are highly comparable to prevalence rates previously reported for HIV-positive women. (Cu-Uvin, Hogan et al. 1999; Nwadioha, Egah et al. 2011) In contrast to the diagnosis of bacterial vaginosis by Nugent’s scoring, information about abnormal vaginal discharge in this study was collected either through a patient’s self-report of abnormal vaginal discharge symptoms or through assessment by a Nurse Practitioner performing a pelvic examination. As bacterial vaginosis is a condition that has been found to be asymptomatic in approximately 50% of women diagnosed at the microbial level, it was important to differentiate women who presented with abnormal vaginal discharge from those who did not. (Amsel, Totten et al. 1983) This increased level of differentiation allowed for a deeper understanding of the potential distinctions between the microbiota of these 2 groups of women. Prior to looking into these comparisons, it is essential to understand and justify the procedures used for the classification of vaginal bacteria in this study.  73  4.4 Bacterial Taxa Classification As outlined in the Data Analysis section (Section 2.4), data assembly was completed at a sampling depth of 1081 reads for each sample with a minimum 55% nucleotide identity against the Chaperonin Database for data inclusion. Furthermore, for the assignment of bacterial taxa labels at the genus and species level, a 90% Chaperonin Database nucleotide identity match cut off was applied. In order to ensure that 1081 reads provided an appropriate depth of bacterial sampling, the Good’s coverage equation was applied to all 54 sets of bacterial profiles. (Good 1953) As a minimum Good’s coverage of 99% was achieved for all collected samples (Table 6); this provided reasonable evidence supporting a thorough sampling of the vaginal environment. Furthermore, the generation and visualization of rarefaction curves (Figure 5) demonstrated that an adequate number of sequence reads had been assessed. As indicated by the plateauing of each of the sample’s rarefaction curves, it was apparent that an increase in sampling or sequence reads would not lead to the detection of a considerably greater variety of unique bacterial taxa. In terms of utilizing a 90% Chaperonin Database identity match for a bacterial classification and labelling cut off, the reasoning was more complex. As mentioned in the Data Analysis section (Section 2.4), the accurate assignment of bacterial genus and species classification labels based on the Chaperonin Database is highly variable across the bacterial taxa range, with accurate assignments of bacterial genera potentially ranging across cut off percentages from the 70% to the 90% range. (Hill JE 2004) More importantly, it is essential to understand the limitations of classifying bacterial species with the sole usage of gene sequencing methodologies. As bacterial classification with  74  gene sequencing fully relies on the specific database available for bacterial labelling, the accuracy of the database utilized and the inclusiveness of bacteria within this database completely dictate the limits of bacterial characterization. Furthermore, as the taxonomic definition of a bacterial species remains highly dependent on the characterization of specific phenotypic properties, utilizing molecular methods for the classification of bacterial species does not guarantee that a bacterial taxa classified at the species level is truly representative of the specifically classified species. (Stackebrandt, Frederiksen et al. 2002) More precisely, this method of classification provides the most confident label that can be assigned with the use of molecular bacterial classification methods. With this in mind, a 90% Chaperonin Database identity match cut off was chosen as it allowed for the confident differentiation of the major bacterial taxa identified in this cohort down to a species level. Although a number of the bacterial taxa identified between the cut off of 55% to 90% match identity could still be identified down to a bacterial species label, the 90% cut off was retained to ensure increased confidence in the labelling of overall bacterial taxa. 4.5 Reproductive Health Findings The 10 distinct bacterial clusters identified in this Canadian cohort of 54 HIVpositive women had clear associations with normal and abnormal vaginal flora (Figure 4). While clusters 5, 6, 7, and 9 were dominated by 4 different species of Lactobacillus, cluster 8 was dominated by Bifidobacterium breve. These 5 clusters corresponded with normal healthy vaginal flora as characterized by previous culturing and sequencing studies. (Eschenbach, Davick et al. 1989; Antonio, Hawes et al. 1999; Korshunov, Gudieva et al. 1999; Ravel, Gajer et al. 2011) Lactobacillus jensenii, Lactobacillus iners,  75  Lactobacillus crispatus, Lactobacillus gasseri, and Bididobacterium breve have all been detected in the healthy normal vaginal flora of women. (Korshunov, Gudieva et al. 1999; Ravel, Gajer et al. 2011) While Lactobacillus bacterial organisms are capable of producing lactic acid and other bactericidal agents, Bifidobacterium breve has been found to inhibit a number of potentially pathogenic organisms including Staphylococcus aureus and Enterococcus faecalis. (Korshunov, Gudieva et al. 1999; Boskey, Cone et al. 2001) As anticipated, vaginal communities corresponding to clusters 6, 7, 8, and 9 all had Nugent scores corresponding to a normal vaginal microbiota. (Table 8) Although the single vaginal community corresponding with cluster 5 (Lactobacillus jensenii) received an intermediate Nugent score, this may be attributed to the detection of low proportions of anaerobic bacteria in this sample as well as the limited sample size of one for this particular cluster. The 5 remaining clusters consisted of vaginal communities dominated by specific strains of Gardnerella vaginalis or a mixture of anaerobic bacteria, including Atopobium vaginae, Megasphaera, Prevotella, and Clostridium. These are all vaginal bacteria that have been previously associated with an abnormal vaginal microbiota. (Thies, Konig et al. 2007; Zhou, Hansmann et al. 2010; Ravel, Gajer et al. 2011) While cluster 1 was mostly dominated by Gardnerella vaginalis A and Gardnerella vaginalis B, cluster 2 mostly contained a mixture of Prevotella timonensis, Atopobium vaginae, and 2 potentially novel bacterial organisms most closely associated with Clostridium genomosp. BVAB3 UPII9-5 and Megasphaera sp. UPII. Interestingly, the vaginal communities associated with cluster 2 consistently had high Nugent scores indicative of bacterial vaginosis, while the vaginal communities associated with cluster 1 had Nugent  76  scores more evenly distributed among the different Nugent score categories. Cluster 3, corresponding to vaginal communities dominated by Gardnerella vaginalis C, received Nugent scores consistently indicative of bacterial vaginosis. Although clusters 4 and 10 each had a single sample, the vaginal community corresponding to cluster 4 mostly consisted of Dialister micraerophilus and Prevotella oris while the vaginal community corresponding to cluster 10 consisted of potentially novel bacterial species most closely associated with Gardnerella vaginalis C. It is important to note here that 50 of the women in this study were clinically tested by nucleic acid amplification testing (NAAT) for the presence of Chlamydia trachomatis and Neisseria gonorrhoeae. While no one was found to be positive for the presence of Neisseria gonorrhoeae, it was found that Subject 01-050 was positive for Chlamydia trachomatis. As this woman was categorized into cluster 2 along with 8 other women who were not positive for Chlamydia trachomatis, it does not appear that the presence of Chlamydia trachomatis in the endocervix greatly influenced the overall bacterial composition of this vaginal community. 12 of the women in this study were also tested for the presence of Trichomonas vaginalis, with Subject 01-051 testing positively. As this woman was found to cluster independently from other women into cluster 4, this suggests that Trichomonas vaginalis may have a specific effect on the overall bacterial composition of the vaginal environment. Although classically defined normal vaginal microflora in this study mostly corresponded with lower Nugent scores and traditionally abnormal vaginal microflora mostly corresponded with higher Nugent scores, these findings were generally expected and did not provide a particularly new outlook on bacterial relationships with vaginal  77  health. On the other hand, an exploration of bacterial profiles in relation to women with normal and abnormal vaginal discharge did succeed in providing a deeper understanding of the correlations between bacterial organisms and vaginal health outcomes. When compared with the generated vaginal community clusters, it was found that all 12 women categorized as having abnormal vaginal discharge had vaginal microflora corresponding with clusters 1, 2, 3, and 4. This demonstrated that all of the women with abnormal vaginal discharge had vaginal communities associated with abnormal vaginal flora. Although no significant trends of abnormal vaginal discharge unique to these vaginal community clusters were established, significant correlations were established between abnormal vaginal discharge and specific bacterial species. It was determined that Dialister micraerophilus, Gardnerella vaginalis B, Gardnerella vaginalis D, Porphyromonas uenonis, Prevotella amnii, Prevotella buccalis, and Prevotella timonensis all significantly correlated with women who presented with abnormal vaginal discharge. (p<0.04 for all) 3 additional bacterial taxa were also found to be significantly more abundant in women with abnormal vaginal discharge than women with normal vaginal discharge. (p<0.0004) These 3 potentially novel bacterial organisms were most closely associated with Campylobacter lari, Clostridium genomosp. BVAB3 UPII9-5, and Dialister micraerophilus. These species-specific correlations of bacterial taxa with abnormal vaginal discharge suggest that these bacteria may be more strongly correlated with negative reproductive health conditions in comparison with other bacteria. In terms of future diagnostic techniques, these specific bacterial strains may serve as key indicators for the progression of negative reproductive health conditions.  78  On the other hand, only a single bacterial species was found to be significantly more prevalent in women with normal vaginal discharge in comparison with women who presented with abnormal vaginal discharge (p<0.0006). This bacterial species was Lactobacillus crispatus. As no other species of Lactobacillus were significantly correlated with a normal vaginal discharge, this may suggest that Lactobacillus crispatus has a greater protective effect in comparison with other species of Lactobacillus. These findings are in agreement with previous studies that have found Lactobacillus crispatus to be a more frequent hydrogen peroxide producer than other Lactobacillus species. (Antonio, Hawes et al. 1999) Hydrogen peroxide production, through the oxidation of hydrocarbons, has been found to have a role in the inhibition of bacterial growth for specific bacterial organisms that lack protective mechanisms such as catalase or peroxidase. (Collins and Aramaki 1980; Strus, Brzychczy-Wloch et al. 2006) This emphasizes the additional microbicidal activity that may be offered by Lactobacillus crispatus alongside its acidic components. These features along with the relationship of this organism with normal vaginal discharge provide support for the potential of Lactobacillus crispatus to be a promising species for probiotic therapy. In terms of antimicrobial effects that may have taken place due to antimicrobial use in addition to antiretroviral medications, a total of 5 women were on antibiotic, antiviral or antifungal therapy at the time of their study visits. These women included subjects 01-003, 01-011, 01-012, 01-019, and 01-021. As HIV-positive individuals are often at a greater risk for specific opportunistic infections, many individuals regularly use antimicrobials for chronic conditions or for prophylactic purposes. Although Septra has been found to alter the vaginal microflora by reducing the prevalence of bacterial species  79  such as Escherichia coli among other gram-positive and gram-negative organisms, 2 of the women taking Septra in this study had vaginal communities consistent with cluster 2 while 1 woman taking Septra had a vaginal community consistent with cluster 1. (Schiffman 1975; Stamey, Condy et al. 1977) The woman in this study taking fluconazole was found to have a vaginal community also consistent with Cluster 1, while the woman taking famciclovir had a vaginal community consistent with Cluster 3. Although antimicrobial agents may temporarily alter the bacterial composition of the vaginal environment, it is important to note that 4 of the 5 women taking antimicrobial agents in this study were not temporarily utilizing these agents for an acute condition but rather for long-term health maintenance. This stresses that the vaginal microbiota of these particular women could be regarded as their stabilized vaginal flora. Overall, the vaginal communities of these 5 women were found to be associated with bacterial organisms from abnormal vaginal flora and 2 of these 5 women were found to have abnormal vaginal discharge. The 2 women with abnormal vaginal discharge were both taking Septra. As fluconazole is an anti-fungal agent and famciclovir is an anti-viral agent, Septra was the sole antimicrobial agent expected to have a significant effect on bacterial composition. Although the vaginal communities of all 5 these women corresponded with vaginal community clusters 1, 2 and 3, no significant trends were established in regards to antimicrobial usage. Although previous studies have provided evidence for an ethnic component to the vaginal microbiota, no trends were determined for this study on comparison with overall bacterial composition. (Zhou, Hansmann et al. 2010; Ravel, Gajer et al. 2011) In contrast, when performing an analysis looking into the species composition of individual  80  bacterial genera, Lactobacillus gasseri was found to significantly be the most prevalent species of Lactobacillus only in women of either Asian or Mixed ethnic descent on comparison with women of White, Black, Aboriginal, or South-Asian descent. (Figure 6) Although the underlying reasons for ethnic differences in the vaginal microbiota are unclear, it has been speculated that differences in innate and adaptive immunity and vaginal epithelial cell surface binding may play a role. (Dramsi, Trieu-Cuot et al. 2005; Velez, De Keersmaecker et al. 2007) As different bacterial species, such as the unique strains of lactobacilli, express distinct adhesion molecules for the binding of organisms to epithelial cells and mucus, physiologically altered vaginal environments may affect the composition of dominant bacterial organisms. (Velez, De Keersmaecker et al. 2007) These differences in the ability to adhere to epithelial and mucus surfaces may further be affected by the nutritional requirements of distinct bacterial species, which may ultimately be influenced and directed by the diet of an individual. (Falsen, Pascual et al. 1999; Neggers, Nansel et al. 2007) Although genetic factors influencing vaginal immunity and physiology may play a role in the ethnic differentiation of vaginal microbiota, there is greater evidence in favour of local environmental factors influenced by diet and other lifestyle practices. (Neggers, Nansel et al. 2007; Ravel, Gajer et al. 2011) Overall, it was apparent that Lactobacillus gasseri may have a greater prevalence in women of Asian and Mixed ethnic descent, with the potential reason for this difference possibly being the presence of distinct vaginal nutrient availability. However, as detailed dietary records were not obtained in this study, we were unable to confirm or refute this hypothesis.  81  Another analysis that was performed to better understand the complexity of the unique vaginal communities in these women was a quantitative real time polymerase chain reaction (qPCR). (Fredricks, Fiedler et al. 2009) This qPCR was utilized to determine the total number of bacterial 16S rRNA genes detected in each sample, providing an approximate overview of the bacterial quantity of each sample. For this analysis, the specific 16S assay used was the SRV3-1/3 primer set, targeted at the V3 region of the 16S rRNA gene. (Lee, Zo et al. 1996) It was determined that increasing Nugent scores were significantly related to increasing total bacterial 16S gene quantity. (Figure 7) Previous culture-based and molecular-based studies have demonstrated that abnormal vaginal microflora often present in relatively high quantities in comparison with normal vaginal flora. (Nugent, Krohn et al. 1991; Fredricks, Fiedler et al. 2009) The clinical diagnosis of bacterial vaginosis also includes the criterion of the presence of vaginal cells covered by a large quantity of bacteria, termed clue cells. (Amsel, Totten et al. 1983) These previous findings coupled with the current molecular findings provide supporting evidence for the relationship between increased bacterial quantity and negative reproductive health outcomes. The correlations of higher Shannon Diversity Indices in women with higher Nugent scores and in women with abnormal vaginal discharge, further support and deepen the understanding of increased bacterial diversity being correlated to negative reproductive health outcomes, in addition to increased bacterial quantity. Although analyses of vaginal bacterial composition were conducted for factors purported to affect composition including feminine hygiene product use, contraception use, number of sexual partners, history of genital infections, hepatitis C status, and BMI,  82  no significant correlations were found. However, it is important to note that in this study the number of women specific to each subcategory for these variables was limited. 4.6 HIV-Specific Characteristics For the analyses relating bacterial species composition with immune status based on CD4 count groupings and based on controlled versus uncontrolled viral replication, a total of 53 women were included for analysis. The reason for the exclusion of 1 woman was due to the lack of up-to-date CD4 count and viral load lab results for this woman. Immune dysfunction associated with HIV is primarily measured by absolute CD4 cell count. A CD4 count of below 350 cells/mm3 is considered a level at which immune function is impaired and until recently was the level at which antiretroviral therapy aimed at immune reconstitution was initiated. (Hammer, Eron et al. 2008; Thompson, Aberg et al. 2012) It was determined that upon comparison of women with a CD4 count above and below 350 cells/mm3, Gardnerella vaginalis B and Lactobacillus iners were both found to be more significantly present in women with a CD4 count below 350 cells/mm3 (p<0.05 for all). (Table 9) Although direct relationships between CD4 counts and vaginal bacterial compositions have not been established, it appears that these factors may be linked by the innate host defense system of the genital tract. (Cole 2006) As vaginal epithelial cells along with their mucosal secretions constitute the primary barrier against the invasion of microbial and viral pathogenic organisms, the immune components of this initial line of defense may have a vital role in contributing to the formation of the vaginal bacterial profile. (Cole 2006; Witkin, Linhares et al. 2007) It is understood that these vaginal epithelial cells contain a set of Toll-like receptors contributing to innate immunity as well as components for the activation of adaptive  83  immunity, including secretory immunoglobulin A and immunoglobulin B. (Meredith, Raphael et al. 1989; Cole 2006) Furthermore, it has been found that the vaginal subepithelial stroma contains dendritic cells, macrophages and T lymphocytes. (Patterson, Landay et al. 1998) Although no exact mechanism has linked decreased immune status based on low CD4 count levels with decreased vaginal immunity, it is understood that lowered CD4 count levels contribute to an overall depletion of the immune system with increased susceptibility to opportunistic infections. It has also been demonstrated that the reduction in CD4 T-cells during HIV infection largely occurs in the gastrointestinal tract area. (Veazey, DeMaria et al. 1998; Guadalupe, Reay et al. 2003; Brenchley, Schacker et al. 2004) Evidence has been provided for a potential break in the gastrointestinal mucosal immune barrier, a barrier that is not greatly distant from the vaginal mucosal barrier. (Brenchley, Price et al. 2006) Although speculative, these findings suggest that an overall lowered immune status may potentially decrease the protective ability of the vaginal immune barrier, allowing for the adherence and growth of specific bacterial flora. It has also been reported that HIV-positive individuals have higher plasma lipopolysaccharide (LPS) levels in comparison with HIV-negative individuals. (Brenchley, Price et al. 2006) As LPS is a component of gram-negative bacterial cell walls, this provides further support for a more greatly compromised immune barrier in HIV-positive individuals, allowing for the presence and eventual translocation of specific bacterial species across anatomical sites such as the gastrointestinal tract and potentially the genital tract. (Brenchley, Price et al. 2006) Specifically for this study, it was found that Gardnerella vaginalis B and Lactobacillus iners were both found to be more significantly present in women who were  84  more immune compromised. While Gardnerella vaginalis B may be an expectant organism to be found in a compromised vaginal mucosal barrier, the presence of Lactobacillus iners was less expected due to its common correlation with a healthy vaginal flora. Nonetheless, as Lactobacillus iners has been found to be one of the only strains of lactobacilli to be present in women with generally abnormal vaginal flora, these findings may provide support for the notion that this strain of lactobacilli represents a transition species between normal and abnormal vaginal flora. (Verstraelen, Verhelst et al. 2009) Also, as the quantity specifically of hydrogen-peroxide producing species of Lactobacillus have been found to be particularly reduced in HIV-positive women in comparison with HIV-negative women and as Lactobacillus iners is not a frequent hydrogen peroxide producer, the presence of Lactobacillus iners does not disagree with previous findings in HIV-positive women. (Knezevic, Stepanovic et al. 2005) Upon comparison of bacterial species in women with a viral load above and below 40 copies/mL, it was determined that Atopobium vaginae, Gardnerella vaginalis D, Prevotella amnii, and 4 other potentially novel bacterial species were more significantly present in women with a viral load above 40 copies/mL (p<0.03 for all). (Table 10) These 4 potentially novel bacterial species were found to be most closely associated with Clostridium genomosp. BVAB3 UPII9-5, Dialister micraerophilus, Megasphaera sp. UP II, and Campylobacter lari. Two bacterial species were also found to be more significantly present in women with a viral load below 40 copies/mL. These bacterial species were Lactobacillus crispatus and Lactobacillus gasseri (p<0.02 for all). (Table 10) These results demonstrated that uncontrolled viral replication was correlated with the presence of abnormal vaginal flora, while controlled viral replication was  85  correlated with the presence of normal vaginal flora. Although previous studies have correlated increased genital HIV shedding levels with a decrease in Lactobacillus levels, plasma viral load correlations are more limited. (Mane, Kulkarni et al. 2013) This is likely due to findings that have demonstrated that genital HIV shedding is independent from plasma viral load levels. (Money, Arikan et al. 2003; Cu-Uvin, DeLong et al. 2010) The independence of the genital tract HIV viral load from the plasma viral load supports the understanding that the effect of plasma viral loads on vaginal bacterial composition may be caused by a mechanism involving vaginal immunity as opposed to direct interactions of HIV with the genital tract. Regardless, these results do support the benefits of controlling viral replication and suppressing plasma viral loads. As viral replication is controlled through combination antiretroviral therapy (cART), a comparison of the bacterial profiles of women on and off cART was also conducted. While 85% of women were on cART, 62% of women were found to have a suppressed viral load below 40 copies/mL. The reason for this discrepancy between antiretroviral treatment and viral suppression was due to a number of women initiating or reinitiating cART close to their research study visit dates, having interruptions in their therapy treatment courses, or struggling with cART compliance. With the limited number of women categorized into the grouping of not being on cART at the time of their research visit, it was determined that Gardnerella vaginalis D was the only bacterial species significantly more prevalent in women who were not on cART in comparison with women who were on cART through the Wilcoxon signed-rank test (p=0.02). This demonstrated that Gardnerella vaginalis D might be a bacterial species that is especially influenced by antiretroviral therapy.  86  4.7 Addressing the Hypothesis For this study, it was hypothesized that HIV-positive reproductive-aged women share a core vaginal microbiome with variations that can be defined and correlated with specific demographic and clinical characteristics. This hypothesis was based on the understanding that HIV-positive women have a higher prevalence of specific negative reproductive health conditions including HPV, candidal vaginitis, syphilis, and particularly bacterial vaginosis. (Plummer, Simonsen et al. 1989; Carpenter, Mayer et al. 1991; Hutchinson, Rompalo et al. 1991; Sun, Ellerbrock et al. 1995; Sewankambo, Gray et al. 1997; Cu-Uvin, Hogan et al. 1999) For these reasons, it was postulated that an underlying factor such as the vaginal microbiome with common bacterial species might contribute to the occurrence of these conditions in HIV-positive women. The first part of this hypothesis regarding HIV-positive reproductive-aged women sharing a core vaginal microbiome was not supported by this study’s findings. After addressing the primary aim of this study by characterizing and profiling the bacterial composition of the vaginal microbiome through cpn60 gene sequencing, it was determined that there were no bacterial species or taxa that were found to be present in the vaginal microbiota of all 54 subjects in this study. Therefore, no core vaginal microbiome or core bacterial species were identified using this taxonomy-based approach. In contrast, distinct bacterial clusters or common vaginal communities were established. The second part of the hypothesis addressing variations in the vaginal microbiome being defined and correlated with specific demographic and clinical characteristics was supported on certain levels. Although the distinct bacterial clusters in this study were not  87  significantly correlated with demographic or clinical characteristics, significant bacterial species-specific relationships were determined. In addressing the secondary aim of this study, vaginal microbiota compositions were compared based on categorization by demographic, clinical, and HIV-specific variables. Significant findings were not determined for associations of bacterial profiles with BMI groupings, feminine hygiene product usage, contraception usage, sexual activity within a year of the clinical visit, genital infection history, or length of HIV infection. Although significant findings were not determined for all tested variables, significant bacterial species-specific differences were determined for categorizations based on vaginal discharge status, CD4 count level, plasma viral load, combination antiretroviral therapy usage, and ethnicity. 4.8 Limitations Although careful consideration went into the design and implementation of this study, the encounter of challenges and the accumulation of new findings during the course of this study did provide new insight and perspective on potential areas of improvement. A primary limitation of this study was the sample size of 54 HIV-positive women. Although a limited sample size was purposefully chosen for this study, as it was an exploratory study with the intention of generating new hypotheses, small sample sizes always present with limitations on the confidence of results and the potential overinterpretation of findings. While sample size was considered and appropriate statistical analyses were conducted, a larger sample size would have been preferred for increased confidence in study trends and findings. Furthermore, for this specific study population, it would have been beneficial to have a better representation of the different ethnic groups included in this study. Although there were high proportions of Caucasian, Black and  88  Aboriginal women, the numbers of Asian and South Asian women were very limited. For analyses comparing bacterial profiles between different ethnic groups, an even distribution of women for each ethnic group may have been beneficial. However, the ethnic mix in this study is highly reflective of the mix of HIV-positive women in British Columbia. Another limitation of this study involved the criteria that were utilized in the clinical assessment of women during pelvic examinations. Although abnormal vaginal findings were recorded with the use of a standardized “Pelvic Exam Findings” data collection form (appendix page 115), vaginal pH was not consistently collected and a saline wet mount was not constructed. In hindsight, it would have been highly beneficial to have a standardized clinical assessment procedure such as Amsel’s criteria utilized for each woman enrolled into the study. Although Nugent’s scoring is the gold standard for clinical assessment of bacterial vaginosis at present and was the technique that was utilized, additional factors could have also provided more variables for the comparison of vaginal bacterial profiles among different subject groupings. Other limitations involved the use of gene sequencing technologies for the identification of vaginal bacterial profiles. As all cultivation-based and molecular-based techniques have their own sets of limitations, gene sequencing with the use of the chaperonin-60 gene target was not devoid of its own limitations. Although the cpn60 gene target was utilized due to its potential for higher phylogenetic resolution and increased discriminatory power between closely related bacterial taxa compared to other universal gene targets, there are specific bacterial taxa that this target is unable to identify. (Brousseau, Hill et al. 2001) A number of species of the genus Mycoplasma  89  have been found to lack a type 1 chaperonin gene. (Hill JE 2004) These species have included Mycoplasma capricolum, M. mobile, M. mycoides, M. pulmonis, and M. synoviae. (Hill JE 2004) Ureaplasma parvum has also been found to lack a type 1 chaperonin gene. (Hill JE 2004) Since a separate screening for these specific bacterial taxa was not performed, this demonstrates a clear limitation in the thoroughness of the bacterial profiling performed in this study. A potential method to overcome this type of a barrier could have been to conduct a separate PCR-based screening for Mycoplasma and Ureaplasma detection in addition to the gene sequencing analyses. Although this type of analysis would not have generated comparable data of bacterial proportions for this study, it could have provided insight into the presence or absence of these specific bacterial taxa. Another limitation that further demonstrated the potential need for complementary analyses in addition to the exclusive use of gene sequencing for bacterial identification, was the detection of Chlamydia trachomatis DNA in a vaginal sample. While Chlamydia trachomatis was only detected in a single sample through targeted species-specific PCR, Chlamydia trachomatis was not detected in this same sample through the use of gene sequencing. Although cpn60 gene sequencing has demonstrated its ability to detect and characterize species of the Chlamydia genus in previous reports (Hill, Goh et al. 2005), the detection limit for specific species through cpn60 gene sequencing is lower than the detection limit provided by a targeted PCR approach. Although the inability to profile these specific bacterial taxa through gene sequencing demonstrated that the proportion of these bacterial taxa was likely low, it remains a limiting factor. In contrast, it was expected that Chlamydia trachomatis would be present  90  in low proportions compared with the normal flora and it would be unlikely to detect these organisms at a sampling depth of 1,081 sequence reads. Regardless, this limitation does highlight the potential need to perform additional screening analyses for specific bacterial species of interest, alongside gene sequencing analyses. Furthermore, as gene sequencing and data interpretation have the potential to generate false positive and false negative results, it is important to perform an additional qPCR for each of the identified bacteria in this sample cohort for confirmation of their presence. Another option to increase the likelihood of detecting organisms present in low proportions such as Chlamydia trachomatis is to perform a deeper sequencing analysis with potentially millions of sequence reads per sample. A further limitation involved in the use of universal gene sequencing for bacterial identification included the specific match percentage cut off that was chosen for identifying bacteria with the Chaperonin Database. As accurate classifications of bacterial taxa at the species level range in Chaperonin Database cut off percentages dependent on each bacterial genus and species, there was no single ideal match percentage cut off that could encompass the full range of accurate bacterial identity cut offs. For instance, for a bacterial species such as Prevotella buccalis, the highest match percentage in the database that coded for a different species was 82.4%, whereas all match percentages between the range of 82.4 – 100% simply coded for different strains of Prevotella buccalis. (Hill JE 2004) On the other hand, for a bacterial species such as Lactobacillus jensenii, the highest match percentage in the database that coded for a different species was 87.1%, whereas all match percentages between the range of 87.1 – 100% coded for different strains of Lactobacillus jensenii. (Hill JE 2004) This  91  demonstrates the inability to choose a single match percentage cut off that is ideal for all bacterial taxa. Although this inability is a limitation, a single match percentage cut off was chosen in this study to ensure confidence in bacterial labelling and differentiation. More general limitations of this deep amplicon sequencing approach included biases from specific laboratory methodologies as well as the innate restrictions imposed by the online Chaperonin Database. Although a combination of 2 sets of primers was used for the amplification of cpn60 PCR product in this study’s samples, primers have a direct influence on the preferential amplification of certain bacterial taxa based on their design. Although the sets of primers utilized in this study were designed for a comprehensive profiling of environmental bacterial taxa (Hill, Town et al. 2006), the potential for biased bacterial amplification in gene-sequencing based analyses is an important factor to be conscious of. Moreover, another factor of gene sequencing analyses with a propensity for influencing results was the potential for sample contamination during the multiple stages of sample processing. As sample contamination sources can include laboratory tubes, vaginal swabs, appropriate laboratory liquids and washes, and other relevant equipment, organisms from unwanted sources can negatively affect study results. In order to reduce contamination risk, consistent sample collection and processing procedures were adhered to throughout the course of this study. This included the careful usage and handling of sterile equipment, the use of ultraviolet lights in “Cleanspot” UV cabinets to reduce laboratory contamination of study tubes, and the use of blank control samples for DNA extraction and PCR amplification to ensure reduced sample contamination. Prior to gene sequencing, the PCR of samples was also repeated until the blank control samples presented with a clean negative result, ensuring  92  study samples were not contaminated during the PCR stage. Although these precautions were followed, it is important to note that the potential for sample contamination was still present. In terms of restrictions imposed by the online Chaperonin Database, the only organisms that can be characterized through cpn60 gene sequencing are those organisms that have been referenced into the database. As organisms included in the online database are continually increasing, the current set of referenced organisms limit the ability to fully characterize the profile of organisms present in study samples. For this reason, there may be organisms within study samples that have not yet been referenced into the online database. Further limitations arose upon the speculative discussion of study results. For instance, it was postulated that the significant differences observed in the profile of lactobacilli bacteria in different ethnic groups could be due to differences in lifestyle or diet choices. This type of speculative consideration could have been supported with evidence if more data was collected on factors including diet choices and other environmental influences. For this study, this could have included an additional data collection form such as a Food Frequency Questionnaire. 4.9 Future Directions In order to gain a deeper understanding of the vaginal bacterial composition potentially unique to HIV-positive women, it is necessary to compare this study’s results with a control-matched HIV-negative cohort. As a study looking into the vaginal microbiome of HIV-negative women, entitled Vogue 1A, is currently nearing completion, a set of control-matched data will become available for comparison purposes. Vogue 1A is a study with an enrolment target of 300 HIV-negative  93  reproductive-aged women. As the study methodologies for Vogue 1A are very similar in nature to the current reported study, this group of women will serve as an ideal control cohort for comparison purposes. This type of analysis will aid in determining if the common vaginal community clusters generated in this study also occur in HIV-negative women. Furthermore, this comparison will allow for a more thorough understanding of the unique differences between the vaginal microbiota of HIV-positive and HIV-negative women in relation to demographic and clinical variables. As vaginal microbiome profiling simply provides an overview of the bacteria that are present in a vaginal environment without an indication of which bacterial genes are functioning, it is important to conduct an analysis looking at the functional profiling of bacterial organisms through shotgun metagenomics. This approach could look at all of the genes present in samples and potentially provide insight into common bacterial functions across samples. In order to look at which bacterial genes are being actively expressed, it is also important to study gene expression at the RNA level for the microbiome. This would entail transcriptomics analyses that could determine gene expression patterns and provide insight into which specific bacterial genes are being highly expressed. An understanding of gene expression levels could reveal, for example, that bacterial genes present at low proportions have a greater role in the microbiome than genes present at higher proportions. This could lead to a more accurate identification of specific disease or reproductive health condition markers, by revealing which bacterial species are more heavily involved in the onset of specific conditions. As the current study provided a strong indication for relationships between distinct HIV characteristics and specific bacterial species, further research is warranted  94  for an exploration of these relationships. Although CD4 counts and plasma viral loads were significantly correlated with the presence of certain bacterial species in the vaginal microbiota, the underlying mechanisms that connect these variables are yet to be fully established or understood. Further research looking into the role of the immune system in vaginal bacterial composition needs to be conducted to better understand if a causal relationship exists between the vaginal microbiota in HIV-positive women and their immune status. While the current study provided evidence for a relationship between HIV immune status characteristics and vaginal microbiota, further studies are required to confirm this relationship and thoroughly detail its potential underlying mechanisms. 4.9.1 Concluding Remarks The overall intention of this exploratory study was to generate a unique set of data to aid in the understanding of the vaginal microbial profile of reproductive-aged women living with HIV. 54 HIV-positive women were enrolled from the Oak Tree Clinic situated in Vancouver, BC, Canada. This group encompassed previously understudied ethnic backgrounds such as Aboriginal women and also included a diversity of women with and without negative reproductive health conditions. The universal gene target, chaperonin 60, was utilized for the gene sequencing based profiling of the vaginal bacteria for these women. This form of molecular analysis for bacterial identification proved to be useful in the differentiation of bacterial taxa at highly granular levels of categorization. Furthermore, these classified bacteria were significantly correlated with a variety of demographic and clinical characteristics, providing evidence for the generation of new hypotheses and a need for greater exploration and study.  95  A total of 10 common vaginal community clusters were generated for the women in this study. While these clusters highlighted specific reproductive health trends, the detailed correlations established between health states and distinct bacterial strains were even more revealing. In addition to identifying the sets of bacterial organisms correlated with positive and negative health outcomes, this study was able to pinpoint specific strains of bacteria. For instance, out of the four identified strains of Gardnerella vaginalis in this study, only two of these strains were significantly correlated with women who presented with abnormal vaginal discharge, low CD4 counts and high viral loads. This provided evidence for these specific strains having greater pathogenic potential and possibly a unique relationship to the vaginal microbiota of HIV-positive women. Furthermore, specific species of certain bacterial genera were also identified as being strongly related to positive health outcomes with protective potential against unwanted bacterial taxa. Upon further comparisons of bacterial profiles with HIV-specific characteristics, evidence was also presented in support of probable immunological connections between vaginal microbiota immune status determinants. The root reason for initiating this research was simply to generate new information that could potentially lead to improvement in the reproductive health of women and particularly in women living with HIV. Further exploration of these data findings may provide preliminary information eventually leading to the development of improved diagnostic tests for negative reproductive health outcomes as well as better treatment and prophylactic therapies for women. While further study is warranted for the deeper understanding of many of the relationships presented by this report’s findings, this study provides unique, new information at the genomic level of the vaginal microbiota in  96  HIV-positive women.  97  BIBLIOGRAPHY Allsworth, J. E., V. A. Lewis, et al. (2008). "Viral sexually transmitted infections and bacterial vaginosis: 2001-2004 National Health and Nutrition Examination Survey data." Sex Transm Dis 35(9): 791-796. Allsworth, J. E. and J. F. Peipert (2007). "Prevalence of bacterial vaginosis: 2001-2004 National Health and Nutrition Examination Survey data." Obstet Gynecol 109(1): 114-120. Amsel, R., P. A. Totten, et al. (1983). "Nonspecific vaginitis. Diagnostic criteria and microbial and epidemiologic associations." Am J Med 74(1): 14-22. Antonio, M. A., S. E. Hawes, et al. (1999). "The identification of vaginal Lactobacillus species and the demographic and microbiologic characteristics of women colonized by these species." J Infect Dis 180(6): 1950-1956. Anukam, K. C., E. Osazuwa, et al. (2006). "Clinical study comparing probiotic Lactobacillus GR-1 and RC-14 with metronidazole vaginal gel to treat symptomatic bacterial vaginosis." Microbes Infect 8(12-13): 2772-2776. Atashili, J., C. Poole, et al. (2008). "Bacterial vaginosis and HIV acquisition: a metaanalysis of published studies." AIDS 22(12): 1493-1501. Bik, E. M., P. B. Eckburg, et al. (2006). "Molecular analysis of the bacterial microbiota in the human stomach." Proc Natl Acad Sci U S A 103(3): 732-737. Blankenberg, D., G. Von Kuster, et al. (2010). "Galaxy: a web-based genome analysis tool for experimentalists." Curr Protoc Mol Biol Chapter 19: Unit 19 10 11-21. Boskey, E. R., R. A. Cone, et al. (2001). "Origins of vaginal acidity: high D/L lactate ratio is consistent with bacteria being the primary source." Hum Reprod 16(9): 1809-1813. Bradshaw, C. S., A. N. Morton, et al. (2006). "High recurrence rates of bacterial vaginosis over the course of 12 months after oral metronidazole therapy and factors associated with recurrence." J Infect Dis 193(11): 1478-1486. Brenchley, J. M., D. A. Price, et al. (2006). "Microbial translocation is a cause of systemic immune activation in chronic HIV infection." Nat Med 12(12): 13651371. Brenchley, J. M., T. W. Schacker, et al. (2004). "CD4+ T cell depletion during all stages of HIV disease occurs predominantly in the gastrointestinal tract." J Exp Med 200(6): 749-759. Brousseau, R., J. E. Hill, et al. (2001). "Streptococcus suis serotypes characterized by analysis of chaperonin 60 gene sequences." Appl Environ Microbiol 67(10): 4828-4833. Burton, J. P., E. Devillard, et al. (2004). "Detection of Atopobium vaginae in postmenopausal women by cultivation-independent methods warrants further investigation." J Clin Microbiol 42(4): 1829-1831. Burton, J. P. and G. Reid (2002). "Evaluation of the bacterial vaginal flora of 20 postmenopausal women by direct (Nugent score) and molecular (polymerase chain reaction and denaturing gradient gel electrophoresis) techniques." J Infect Dis 186(12): 1770-1780. Cann, A. J. and J. Karn (1989). "Molecular biology of HIV: new insights into the virus life-cycle." AIDS 3 Suppl 1: S19-34.  98  Carpenter, C. C., K. H. Mayer, et al. (1991). "Human immunodeficiency virus infection in North American women: experience with 200 cases and a review of the literature." Medicine (Baltimore) 70(5): 307-325. Catlin, B. W. (1992). "Gardnerella vaginalis: characteristics, clinical considerations, and controversies." Clin Microbiol Rev 5(3): 213-237. CDC (1992). "1993 revised classification system for HIV infection and expanded surveillance case definition for AIDS among adolescents and adults." MMWR Recomm Rep 41(RR-17): 1-19. Ceruti, M., G. Piantelli, et al. (1994). "[Bacterial vaginosis. Prevention of recurrence]." Minerva Ginecol 46(12): 657-661. CIDPC, C. a. (2002). "A GUIDE TO HIV/AIDS EPIDEMIOLOGICAL AND SURVEILLANCE TERMS." Health Canada. Clarridge, J. E., 3rd (2004). "Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases." Clin Microbiol Rev 17(4): 840-862, table of contents. Cole, A. M. (2006). "Innate host defense of human vaginal and cervical mucosae." Curr Top Microbiol Immunol 306: 199-230. Cole, J. R., Q. Wang, et al. (2009). "The Ribosomal Database Project: improved alignments and new tools for rRNA analysis." Nucleic Acids Res 37(Database issue): D141-145. Collins, E. B. and K. Aramaki (1980). "Production of Hydrogen peroxide by Lactobacillus acidophilus." J Dairy Sci 63(3): 353-357. Cu-Uvin, S., A. K. DeLong, et al. (2010). "Genital tract HIV-1 RNA shedding among women with below detectable plasma viral load." AIDS 24(16): 2489-2497. Cu-Uvin, S., J. W. Hogan, et al. (2001). "Association between bacterial vaginosis and expression of human immunodeficiency virus type 1 RNA in the female genital tract." Clin Infect Dis 33(6): 894-896. Cu-Uvin, S., J. W. Hogan, et al. (1999). "Prevalence of lower genital tract infections among human immunodeficiency virus (HIV)-seropositive and high-risk HIVseronegative women. HIV Epidemiology Research Study Group." Clin Infect Dis 29(5): 1145-1150. Donachie, S. P., J. S. Foster, et al. (2007). "Culture clash: challenging the dogma of microbial diversity." ISME J 1(2): 97-99. Dramsi, S., P. Trieu-Cuot, et al. (2005). "Sorting sortases: a nomenclature proposal for the various sortases of Gram-positive bacteria." Res Microbiol 156(3): 289-297. Dunkelberg, W. E., Jr. and I. McVeigh (1969). "Growth requirements of Haemophilus vaginalis." Antonie Van Leeuwenhoek 35(2): 129-145. Dunkelberg, W. E., Jr., R. Skaggs, et al. (1970). "A study and new description of Corynebacterium vaginale (Haemophilus vaginalis)." Am J Clin Pathol 53(3): 370-377. Eckert, L. O., S. E. Hawes, et al. (1998). "Vulvovaginal candidiasis: clinical manifestations, risk factors, management algorithm." Obstet Gynecol 92(5): 757765. Eschenbach, D. A. (2007). "Bacterial vaginosis: resistance, recurrence, and/or reinfection?" Clin Infect Dis 44(2): 220-221.  99  Eschenbach, D. A., P. R. Davick, et al. (1989). "Prevalence of hydrogen peroxideproducing Lactobacillus species in normal women and women with bacterial vaginosis." J Clin Microbiol 27(2): 251-256. Falsen, E., C. Pascual, et al. (1999). "Phenotypic and phylogenetic characterization of a novel Lactobacillus species from human sources: description of Lactobacillus iners sp. nov." Int J Syst Bacteriol 49 Pt 1: 217-221. Farage, M. and H. Maibach (2006). "Lifetime changes in the vulva and vagina." Arch Gynecol Obstet 273(4): 195-202. Fauci, A. S. (1988). "The human immunodeficiency virus: infectivity and mechanisms of pathogenesis." Science 239(4840): 617-622. Ferris, M. J., A. Masztal, et al. (2004). "Association of Atopobium vaginae, a recently described metronidazole resistant anaerobe, with bacterial vaginosis." BMC Infect Dis 4: 5. Forsum, U., T. Jakobsson, et al. (2002). "An international study of the interobserver variation between interpretations of vaginal smear criteria of bacterial vaginosis." APMIS 110(11): 811-818. Forsum, U., P. G. Larsson, et al. (2008). "Scoring vaginal fluid smears for diagnosis of bacterial vaginosis: need for quality specifications." APMIS 116(2): 156-159. Fredricks, D. N., T. L. Fiedler, et al. (2005). "Molecular identification of bacteria associated with bacterial vaginosis." N Engl J Med 353(18): 1899-1911. Fredricks, D. N., T. L. Fiedler, et al. (2009). "Changes in vaginal bacterial concentrations with intravaginal metronidazole therapy for bacterial vaginosis as assessed by quantitative PCR." J Clin Microbiol 47(3): 721-726. Fredricks, D. N., T. L. Fiedler, et al. (2007). "Targeted PCR for detection of vaginal bacteria associated with bacterial vaginosis." J Clin Microbiol 45(10): 3270-3276. Gardner, H. L. and C. D. Dukes (1954). "New etiologic agent in nonspecific bacterial vaginitis." Science 120(3125): 853. Gardner, H. L. and C. D. Dukes (1955). "Haemophilus vaginalis vaginitis: a newly defined specific infection previously classified non-specific vaginitis." Am J Obstet Gynecol 69(5): 962-976. Gardner, H. L. and C. D. Dukes (1959). "Hemophilus vaginalis vaginitis." Ann N Y Acad Sci 83: 280-289. Giardine, B., C. Riemer, et al. (2005). "Galaxy: a platform for interactive large-scale genome analysis." Genome Res 15(10): 1451-1455. Goecks, J., A. Nekrutenko, et al. (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences." Genome Biol 11(8): R86. Goh, S. H., S. Potter, et al. (1996). "HSP60 gene sequences as universal targets for microbial species identification: studies with coagulase-negative staphylococci." J Clin Microbiol 34(4): 818-823. Good, I. J. (1953). "The Population Frequencies of Species and the Estimation of Population Parameters." Biometrika 40(3/4): 237-264. Govender, L., A. A. Hoosen, et al. (1996). "Bacterial vaginosis and associated infections in pregnancy." Int J Gynaecol Obstet 55(1): 23-28. Greaves, W. L., J. Chungafung, et al. (1988). "Clindamycin versus metronidazole in the treatment of bacterial vaginosis." Obstet Gynecol 72(5): 799-802.  100  Greenwood, J. R., and Pickett, M.J. (1980). "Transfer of Haemophilus vaginalis Gardner and Dukes to a new genus, Gardnerella: G. vaginalis (Gardner and Dukes) comb. nov." International Journal of Systematic Bacteriology 30(1): 170-178. Guadalupe, M., E. Reay, et al. (2003). "Severe CD4+ T-cell depletion in gut lymphoid tissue during primary human immunodeficiency virus type 1 infection and substantial delay in restoration following highly active antiretroviral therapy." J Virol 77(21): 11708-11717. Gupta, R. S. (1995). "Evolution of the chaperonin families (Hsp60, Hsp10 and Tcp-1) of proteins and the origin of eukaryotic cells." Mol Microbiol 15(1): 1-11. Hallen, A., C. Jarstrand, et al. (1992). "Treatment of bacterial vaginosis with lactobacilli." Sex Transm Dis 19(3): 146-148. Hammer, S. M., J. J. Eron, Jr., et al. (2008). "Antiretroviral treatment of adult HIV infection: 2008 recommendations of the International AIDS Society-USA panel." JAMA 300(5): 555-570. Hammerschlag, M. R., S. Alpert, et al. (1978). "Anaerobic microflora of the vagina in children." Am J Obstet Gynecol 131(8): 853-856. Hill, J. E., S. H. Goh, et al. (2005). "Characterization of vaginal microflora of healthy, nonpregnant women by chaperonin-60 sequence-based methods." Am J Obstet Gynecol 193(3 Pt 1): 682-692. Hill, J. E., A. Paccagnella, et al. (2006). "Identification of Campylobacter spp. and discrimination from Helicobacter and Arcobacter spp. by direct sequencing of PCR-amplified cpn60 sequences and comparison to cpnDB, a chaperonin reference sequence database." J Med Microbiol 55(Pt 4): 393-399. Hill JE, P. S., Crowell KG, Goh SH and Hemmingsen SM (2004). cpnDB: a chaperonin sequence database. Genome Res. 14: 1669-1675. Hill, J. E., J. R. Town, et al. (2006). "Improved template representation in cpn60 polymerase chain reaction (PCR) product libraries generated from complex templates by application of a specific mixture of PCR primers." Environ Microbiol 8(4): 741-746. Hill, L. V. and J. A. Embil (1986). "Vaginitis: current microbiologic and clinical concepts." CMAJ 134(4): 321-331. Hillier, S. L. (2005). "The complexity of microbial diversity in bacterial vaginosis." N Engl J Med 353(18): 1886-1887. Hillier, S. L., M. A. Krohn, et al. (1992). "The relationship of hydrogen peroxideproducing lactobacilli to bacterial vaginosis and genital microflora in pregnant women." Obstet Gynecol 79(3): 369-373. Hillier, S. L., M. A. Krohn, et al. (1993). "The normal vaginal flora, H2O2-producing lactobacilli, and bacterial vaginosis in pregnant women." Clin Infect Dis 16 Suppl 4: S273-281. Hooton, T. M., S. Hillier, et al. (1991). "Escherichia coli bacteriuria and contraceptive method." JAMA 265(1): 64-69. Hummelen, R., A. D. Fernandes, et al. (2010). "Deep sequencing of the vaginal microbiota of women with HIV." PLoS One 5(8): e12078. Huntriss, J., K. Woodfine, et al. (2011). "Quantitative analysis of DNA methylation of imprinted genes in single human blastocysts by pyrosequencing." Fertil Steril 95(8): 2564-2567 e2561-2568.  101  Huse, S. M., J. A. Huber, et al. (2007). "Accuracy and quality of massively parallel DNA pyrosequencing." Genome Biol 8(7): R143. Hutchinson, C. M., A. M. Rompalo, et al. (1991). "Characteristics of patients with syphilis attending Baltimore STD clinics. Multiple high-risk subgroups and interactions with human immunodeficiency virus infection." Arch Intern Med 151(3): 511-516. Hyman, R. W., M. Fukushima, et al. (2005). "Microbes on the human vaginal epithelium." Proc Natl Acad Sci U S A 102(22): 7952-7957. Isola, D., M. Pardini, et al. (2005). "A Pyrosequencing assay for rapid recognition of SNPs in Mycobacterium tuberculosis embB306 region." J Microbiol Methods 62(1): 113-120. Jamieson, D. J., A. Duerr, et al. (2001). "Longitudinal analysis of bacterial vaginosis: findings from the HIV epidemiology research study." Obstet Gynecol 98(4): 656663. Joesoef, M. R. and G. Schmid (2005). "Bacterial vaginosis." Clin Evid(13): 1968-1978. Joesoef, M. R., G. P. Schmid, et al. (1999). "Bacterial vaginosis: review of treatment options and potential clinical indications for therapy." Clin Infect Dis 28 Suppl 1: S57-65. Kimberlin, D. F. and W. W. Andrews (1998). "Bacterial vaginosis: association with adverse pregnancy outcome." Semin Perinatol 22(4): 242-250. Kindt, R. and R. Coe (2005). Tree Diversity Analysis: A manual and software for common statistical methods for ecological and biodiversity studies. Nairobi, Kenya, World Agroforestry Centre. Knezevic, A., S. Stepanovic, et al. (2005). "Reduced quantity and hydrogen-peroxide production of vaginal lactobacilli in HIV positive women." Biomed Pharmacother 59(9): 521-523. Korshunov, V. M., Z. A. Gudieva, et al. (1999). "[The vaginal Bifidobacterium flora in women of reproductive age]." Zh Mikrobiol Epidemiol Immunobiol(4): 74-78. Lai, S. K., K. Hida, et al. (2009). "Human immunodeficiency virus type 1 is trapped by acidic but not by neutralized human cervicovaginal mucus." J Virol 83(21): 11196-11200. Lamont, R. F., J. D. Sobel, et al. (2011). "The vaginal microbiome: new information about genital tract flora using molecular based techniques." BJOG 118(5): 533549. Lapadat-Tapolsky, M., H. De Rocquigny, et al. (1993). "Interactions between HIV-1 nucleocapsid protein and viral DNA may have important functions in the viral life cycle." Nucleic Acids Res 21(4): 831-839. Larsen, B., C. P. Goplerud, et al. (1982). "Effect of estrogen treatment on the genital tract flora of postmenopausal women." Obstet Gynecol 60(1): 20-24. Larsson, P. G., B. Carlsson, et al. (2004). "Diagnosis of bacterial vaginosis: need for validation of microscopic image area used for scoring bacterial morphotypes." Sex Transm Infect 80(1): 63-67. Larsson, P. G., B. Stray-Pedersen, et al. (2008). "Human lactobacilli as supplementation of clindamycin to patients with bacterial vaginosis reduce the recurrence rate; a 6month, double-blind, randomized, placebo-controlled study." BMC Womens Health 8: 3.  102  Le Rouzic, E. and S. Benichou (2005). "The Vpr protein from HIV-1: distinct roles along the viral life cycle." Retrovirology 2: 11. Lederberg, J., McCray, AT (2001). "'Ome Sweet 'Omics-- A Genealogical Treasury of Words." Scientist 15(8). Lee, D. H., Y. G. Zo, et al. (1996). "Nonradioactive method to study genetic profiles of natural bacterial communities by PCR-single-strand-conformation polymorphism." Appl Environ Microbiol 62(9): 3112-3120. Ling, Z., J. Kong, et al. (2010). "Molecular analysis of the diversity of vaginal microbiota associated with bacterial vaginosis." BMC Genomics 11: 488. Links, M. G., T. J. Dumonceaux, et al. (2012). "The chaperonin-60 universal target is a barcode for bacteria that enables de novo assembly of metagenomic sequence data." PLoS One 7(11): e49755. MacPhee, R. A., R. Hummelen, et al. (2010). "Probiotic strategies for the treatment and prevention of bacterial vaginosis." Expert Opin Pharmacother 11(18): 2985-2995. Mane, A., S. Kulkarni, et al. (2013). "HIV-1 RNA shedding in the female genital tract is associated with reduced quantity of Lactobacilli in clinically asymptomatic HIVpositive women." Diagn Microbiol Infect Dis 75(1): 112-114. Marcone, V., E. Calzolari, et al. (2008). "Effectiveness of vaginal administration of Lactobacillus rhamnosus following conventional metronidazole therapy: how to lower the rate of bacterial vaginosis recurrences." New Microbiol 31(3): 429-433. Martius, J., M. A. Krohn, et al. (1988). "Relationships of vaginal Lactobacillus species, cervical Chlamydia trachomatis, and bacterial vaginosis to preterm birth." Obstet Gynecol 71(1): 89-95. Mastromarino, P., S. Macchia, et al. (2009). "Effectiveness of Lactobacillus-containing vaginal tablets in the treatment of symptomatic bacterial vaginosis." Clin Microbiol Infect 15(1): 67-74. Mazzulli, T., A. E. Simor, et al. (1990). "Reproducibility of interpretation of Gramstained vaginal smears for the diagnosis of bacterial vaginosis." J Clin Microbiol 28(7): 1506-1508. McCormack, W. M., C. H. Hayes, et al. (1977). "Vaginal colonization with Corynebacterium vaginale (Haemophilus vaginalis)." J Infect Dis 136(6): 740745. Meredith, S. D., G. D. Raphael, et al. (1989). "The pathophysiology of rhinitis. III. The control of IgG secretion." J Allergy Clin Immunol 84(6 Pt 1): 920-930. Money, D. (2005). "The laboratory diagnosis of bacterial vaginosis." Can J Infect Dis Med Microbiol 16(2): 77-79. Money, D. M., Y. Y. Arikan, et al. (2003). "Genital tract and plasma human immunodeficiency virus viral load throughout the menstrual cycle in women who are infected with ovulatory human immunodeficiency virus." Am J Obstet Gynecol 188(1): 122-128. Moodley, P., C. Connolly, et al. (2002). "Interrelationships among human immunodeficiency virus type 1 infection, bacterial vaginosis, trichomoniasis, and the presence of yeasts." J Infect Dis 185(1): 69-73. Neefs, J. M., Y. Van de Peer, et al. (1993). "Compilation of small ribosomal subunit RNA structures." Nucleic Acids Res 21(13): 3025-3049.  103  Neggers, Y. H., T. R. Nansel, et al. (2007). "Dietary intake of selected nutrients affects bacterial vaginosis in women." J Nutr 137(9): 2128-2133. Nugent, R. P., M. A. Krohn, et al. (1991). "Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation." J Clin Microbiol 29(2): 297-301. Nwadioha, S., D. Egah, et al. (2011). "Prevalence of bacterial vaginosis and its risk factors in HIV/AIDS patients with abnormal vaginal discharge." Asian Pac J Trop Med 4(2): 156-158. O'Connor, T. J., D. Kinchington, et al. (1995). "The activity of candidate virucidal agents, low pH and genital secretions against HIV-1 in vitro." Int J STD AIDS 6(4): 267272. Oksanen, J., Kindt, R., Legendre, P., O’Hara, B., Simpson, G. L., Solymos, P., Stevens, H. H., Wagner, H. (2012) "The vegan Package: Community Ecology Package 2.03." Olsen, G. J., D. J. Lane, et al. (1986). "Microbial ecology and evolution: a ribosomal RNA approach." Annu Rev Microbiol 40: 337-365. Paramel Jayaprakash, T., J. J. Schellenberg, et al. (2012). "Resolution and characterization of distinct cpn60-based subgroups of Gardnerella vaginalis in the vaginal microbiota." PLoS One 7(8): e43009. Patterson, B. K., A. Landay, et al. (1998). "Repertoire of chemokine receptor expression in the female genital tract: implications for human immunodeficiency virus transmission." Am J Pathol 153(2): 481-490. Peterson, J., S. Garges, et al. (2009). "The NIH Human Microbiome Project." Genome Res 19(12): 2317-2323. Pheifer, T. A., P. S. Forsyth, et al. (1978). "Nonspecific vaginitis: role of Haemophilus vaginalis and treatment with metronidazole." N Engl J Med 298(26): 1429-1434. Piot, P., E. van Dyck, et al. (1980). "A taxonomic study of Gardnerella vaginalis (Haemophilus vaginalis) Gardner and Dukes 1955." J Gen Microbiol 119(2): 373396. Plummer, F. A., J. N. Simonsen, et al. (1989). "Epidemiologic evidence for the development of serovar-specific immunity after gonococcal infection." J Clin Invest 83(5): 1472-1476. Pruesse, E., C. Quast, et al. (2007). "SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB." Nucleic Acids Res 35(21): 7188-7196. Purdon, A., Jr., J. H. Hanna, et al. (1984). "An evaluation of single-dose metronidazole treatment for Gardnerella vaginalis vaginitis." Obstet Gynecol 64(2): 271-274. Ravel, J., P. Gajer, et al. (2011). "Vaginal microbiome of reproductive-age women." Proc Natl Acad Sci U S A 108 Suppl 1: 4680-4687. Rogosa, M., J. A. Mitchell, et al. (1951). "A selective medium for the isolation and enumeration of oral and fecal lactobacilli." J Bacteriol 62(1): 132-133. Ronaghi, M., M. Uhlen, et al. (1998). "A sequencing method based on real-time pyrophosphate." Science 281(5375): 363, 365. Rossi, A., T. Rossi, et al. (2010). "The use of Lactobacillus rhamnosus in the therapy of bacterial vaginosis. Evaluation of clinical efficacy in a population of 40 women treated for 24 months." Arch Gynecol Obstet 281(6): 1065-1069.  104  Sarwal, S. (2008). Canadian Guidelines on Sexually Transmitted Infections. P. H. A. o. Canada. Schellenberg, J., M. G. Links, et al. (2009). "Pyrosequencing of the chaperonin-60 universal target as a tool for determining microbial community composition." Appl Environ Microbiol 75(9): 2889-2898. Schellenberg, J., M. G. Links, et al. (2011). "Pyrosequencing of chaperonin-60 (cpn60) amplicons as a means of determining microbial community composition." Methods Mol Biol 733: 143-158. Schiffman, D. O. (1975). "Evaluation of an anti-infective combination. Trimethoprimsulfamethoxazole (Bactrim, Septra)." JAMA 231(6): 635-637. Schloss, P. D., S. L. Westcott, et al. (2009). "Introducing mothur: open-source, platformindependent, community-supported software for describing and comparing microbial communities." Appl Environ Microbiol 75(23): 7537-7541. Schreier, W. J., T. E. Schrader, et al. (2007). "Thymine dimerization in DNA is an ultrafast photoreaction." Science 315(5812): 625-629. Schrezenmeir, J. and M. de Vrese (2001). "Probiotics, prebiotics, and synbiotics-approaching a definition." Am J Clin Nutr 73(2 Suppl): 361S-364S. Schwebke, J. R., S. L. Hillier, et al. (1996). "Validity of the vaginal gram stain for the diagnosis of bacterial vaginosis." Obstet Gynecol 88(4 Pt 1): 573-576. Sefer, M. and D. Ionescu (1991). "[New bacterial microorganisms in the etiology of human infections. The genus Gardnerella]." Bacteriol Virusol Parazitol Epidemiol 36(1): 1-17. Segata, N., J. Izard, et al. (2011). "Metagenomic biomarker discovery and explanation." Genome Biol 12(6): R60. Sewankambo, N., R. H. Gray, et al. (1997). "HIV-1 infection associated with abnormal vaginal flora morphology and bacterial vaginosis." Lancet 350(9077): 546-550. Sha, B. E., M. R. Zariffard, et al. (2005). "Female genital-tract HIV load correlates inversely with Lactobacillus species but positively with bacterial vaginosis and Mycoplasma hominis." J Infect Dis 191(1): 25-32. Spear, G. T., D. Gilbert, et al. (2011). "Pyrosequencing of the genital microbiotas of HIV-seropositive and -seronegative women reveals Lactobacillus iners as the predominant Lactobacillus Species." Appl Environ Microbiol 77(1): 378-381. Spear, G. T., M. Sikaroodi, et al. (2008). "Comparison of the diversity of the vaginal microbiota in HIV-infected and HIV-uninfected women with or without bacterial vaginosis." J Infect Dis 198(8): 1131-1140. Spiegel, C. A., R. Amsel, et al. (1983). "Diagnosis of bacterial vaginosis by direct gram stain of vaginal fluid." J Clin Microbiol 18(1): 170-177. Stackebrandt, E., W. Frederiksen, et al. (2002). "Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology." Int J Syst Evol Microbiol 52(Pt 3): 1043-1047. Stamey, T. A., M. Condy, et al. (1977). "Prophylactic efficacy of nitrofurantoin macrocrystals and trimethoprim-sulfamethoxazole in urinary infections. Biologic effects on the vaginal and rectal flora." N Engl J Med 296(14): 780-783. Strus, M., M. Brzychczy-Wloch, et al. (2006). "The in vitro effect of hydrogen peroxide on vaginal microbial communities." FEMS Immunol Med Microbiol 48(1): 56-63.  105  Sun, X. W., T. V. Ellerbrock, et al. (1995). "Human papillomavirus infection in human immunodeficiency virus-seropositive women." Obstet Gynecol 85(5 Pt 1): 680686. Sweet, R. and R. Gibbs (2009). Infectious Diseases of the Female Genital Tract, 5th ed. Baltimore, Lippincott Williams & Wilkins. Sweet, R. L. (1995). "Role of bacterial vaginosis in pelvic inflammatory disease." Clin Infect Dis 20 Suppl 2: S271-275. Tam, M. T., M. Yungbluth, et al. (1998). "Gram stain method shows better sensitivity than clinical criteria for detection of bacterial vaginosis in surveillance of pregnant, low-income women in a clinical setting." Infect Dis Obstet Gynecol 6(5): 204-208. Thies, F. L., W. Konig, et al. (2007). "Rapid characterization of the normal and disturbed vaginal microbiota by application of 16S rRNA gene terminal RFLP fingerprinting." J Med Microbiol 56(Pt 6): 755-761. Thompson, M. A., J. A. Aberg, et al. (2012). "Antiretroviral treatment of adult HIV infection: 2012 recommendations of the International Antiviral Society-USA panel." JAMA 308(4): 387-402. Tibayrenc, M. (2007). Encyclopedia of Infectious Diseases: Modern Methodologies. Hoboken, New Jersey, John Wiley & Sons. Totten, P. A., R. Amsel, et al. (1982). "Selective differential human blood bilayer media for isolation of Gardnerella (Haemophilus) vaginalis." J Clin Microbiol 15(1): 141-147. Turnbaugh, P. J., R. E. Ley, et al. (2007). "The human microbiome project." Nature 449(7164): 804-810. Ugwumadu, A., P. Hay, et al. (1997). "HIV-1 infection associated with abnormal vaginal flora morphology and bacterial vaginosis." Lancet 350(9086): 1251. van der Meijden, W. I. (1984). "Clinical aspects of Gardnerella vaginalis-associated vaginitis. A review of the literature." Scand J Urol Nephrol Suppl 86: 135-141. Veazey, R. S., M. DeMaria, et al. (1998). "Gastrointestinal tract as a major site of CD4+ T cell depletion and viral replication in SIV infection." Science 280(5362): 427431. Velez, M. P., S. C. De Keersmaecker, et al. (2007). "Adherence factors of Lactobacillus in the human gastrointestinal tract." FEMS Microbiol Lett 276(2): 140-148. Verhelst, R., H. Verstraelen, et al. (2004). "Cloning of 16S rRNA genes amplified from normal and disturbed vaginal microflora suggests a strong association between Atopobium vaginae, Gardnerella vaginalis and bacterial vaginosis." BMC Microbiol 4: 16. Verstraelen, H. and R. Verhelst (2009). "Bacterial vaginosis: an update on diagnosis and treatment." Expert Rev Anti Infect Ther 7(9): 1109-1124. Verstraelen, H., R. Verhelst, et al. (2009). "Longitudinal analysis of the vaginal microflora in pregnancy suggests that L. crispatus promotes the stability of the normal vaginal microflora and that L. gasseri and/or L. iners are more conducive to the occurrence of abnormal vaginal microflora." BMC Microbiol 9: 116. Ward, D. M., R. Weller, et al. (1990). "16S rRNA sequences reveal numerous uncultured microorganisms in a natural community." Nature 345(6270): 63-65.  106  Watts, D. H., M. Fazzari, et al. (2005). "Effects of bacterial vaginosis and other genital infections on the natural history of human papillomavirus infection in HIV-1infected and high-risk HIV-1-uninfected women." J Infect Dis 191(7): 1129-1139. Wells, J. I. and S. H. Goei (1981). "Rapid identification of Corynebacterium vaginale in non-purulent vaginitis." J Clin Pathol 34(8): 917-920. Winner HI, H. R. (1964). Candida albicans. London, J & A Churchill. Witkin, S. S., I. M. Linhares, et al. (2007). "Bacterial flora of the female genital tract: function and immune regulation." Best Pract Res Clin Obstet Gynaecol 21(3): 347-354. Wolner-Hanssen, P., J. N. Krieger, et al. (1989). "Clinical manifestations of vaginal trichomoniasis." JAMA 261(4): 571-576. Workowski, K. a. B., Stuart (2010). "Sexually Transmitted Diseases Treatment Guidelines, 2010." Centers for Disease Control and Prevention, MMWR 59(No. RR-12): 56-58. Yamamoto, T., X. Zhou, et al. (2009). "Bacterial populations in the vaginas of healthy adolescent women." J Pediatr Adolesc Gynecol 22(1): 11-18. Zar, J. H. (2005). Encyclopedia of Biostatistics Second Edition. Chichester, England, John Wiley & Sons, Ltd: 4191-4196. Zhou, X., S. J. Bent, et al. (2004). "Characterization of vaginal microbial communities in adult healthy women using cultivation-independent methods." Microbiology 150(Pt 8): 2565-2573. Zhou, X., C. J. Brown, et al. (2007). "Differences in the composition of vaginal microbial communities found in healthy Caucasian and black women." ISME J 1(2): 121133. Zhou, X., M. A. Hansmann, et al. (2010). "The vaginal bacterial communities of Japanese women resemble those of women in other racial groups." FEMS Immunol Med Microbiol 58(2): 169-181. Zinnemann, K., and Turner, G.C. (1963). "The taxonomic position of "Haemophilus vaginalis" (Coynebacterium vaginale)." J. Pathol. Bacteriol.(85): 213-219.  107  APPENDIX  108  APPENDIX INDEX Subject Information and Consent Form…………..………………………………….....110 Pelvic Exam Findings Form………….............................................................................115 Vogue 1B Data Collection Form…………………………………………….…............118 QIAEX II Methodology Modifications…………………………………….....………..130 Analyzing Pyrosequencing Data 2012 Handbook……………………………………...131 Figure S1 Heat Map of Bacterial Taxa at 55% Cut off…………………………………191 Figure S2 Heat Map of Bacterial Taxa at 78% Cut off…………………………………192  109  Vogue (Vaginal Microbiome Group Initiative)  Subject Information and Consent Form Vogue (Vaginal Microbiome Group Initiative) – Study 1B Principal Investigator: Dr. Deborah Money: Professor, UBC Department of Obstetrics and Gynaecology Researcher: Daljeet Mahal  Sponsor(s): Canadian Institutes of Health Research (CIHR) Genome British Columbia (Genome BC) INTRODUCTION You are being invited to take part in this research study because you are a woman living with HIV and are scheduled for a planned pelvic exam (PAP test). BACKGROUND AND PURPOSE OF THIS STUDY Changes to the communities of bacteria that naturally exist inside the vagina can have negative effects for women. These effects include: increased risk of being infected with sexually transmitted infections, preterm birth, problems becoming pregnant, early pregnancy loss, and infections that may significantly impact quality of life. However, researchers and doctors still do not know very much about what kinds of bacteria and how many of these bacteria are normally present in a healthy vagina. A better understanding of what types of bacteria are in the vagina is needed in order to identify the subtle imbalances and shifts in bacterial populations that are “healthy” and maintain reproductive health, versus “unhealthy” that have negative effects and increase the chances of disease. As part of our study, we want to study the vaginal bacteria present in a sample of 75 HIV-positive women between the ages of 18 and 49. The purpose of our study is to identify the different types and numbers of bacteria living in the vagina using a new, highly specific DNA-based method. We want to learn which bacteria in the vagina are associated with health and which are associated with disease. We hope the information gained through our study will help us to develop tests and therapy to diagnose and treat abnormal vaginal bacteria before they lead to greater health problems for women. YOUR PARTICIPATION IS VOLUNTARY Your participation is entirely voluntary, so it is up to you to decide whether or not to take part in this study. Before you decide, it is important for you to understand why the research is being done and what it will involve. This consent form will tell you about the study and why the research is being done, what it will involve, and the possible benefits, risks and discomforts to help you decide whether or not you wish to take part. If you wish to participate, you will be asked to sign this form. If you do decide to take part in this study, you can choose to withdraw at any time without giving any reasons for your decision. Subject Consent Form Study 1B Version 3 28MAR2012 REB approval:  Page 1 of 5  110  Vogue (Vaginal Microbiome Group Initiative)  If you do not wish to participate, you do not have to provide any reason for your decision not to participate, nor will you lose the benefit of any medical care to which you are entitled or are presently receiving. Please take time to read the following information carefully and to discuss it with your family and your doctor before you decide. WHO CAN PARTICIPATE IN THIS STUDY? Women presenting to a research or medical clinic for a pelvic examination who: 1. Are HIV infected. 2. Are 18 to 49 years of age. 3. Have an adequate comprehension of the English language to sign written informed consent. 4. Are not currently pregnant. WHO SHOULD NOT PARTICIPATE IN THIS STUDY? Women who have one or more of the following exclusion criteria SHOULD NOT participate: 1. Are not HIV infected. 2. Are younger than 18 years of age or older than 49 years of age. 3. Are menopausal. 4. Are not able to provide written informed consent. 5. Are currently pregnant WHAT DOES THIS STUDY INVOLVE FOR YOU? If you decide to take part in this study, and sign this consent form, you can expect the following: •  As part of your scheduled gynaecological examination, you will undergo a speculum exam. A speculum will be inserted into the vagina to open the vaginal canal in order to see the cervix (which is at the end of the vagina). Swabs will then be taken by gently brushing against the surface of the vagina. These samples will be taken as indicated for a pap test or other planned exam to test for infections. These samples are unrelated to the study and will be collected as part of your scheduled examination regardless of whether you participate in the study or not.  •  For women that have consented to the study, we will take four additional swab samples from your vagina. The collection of additional vaginal swabs will be taken at the same time as your planned swabs and will take a minimal amount of time (1-2 minutes). The collection of the additional samples should not add any discomfort to your examination. It is possible that study staff may contact you in the future to request further samples to do more testing directly related to this study. Whether or not you agree to provide any additional samples in the future will be completely voluntary and you do not have to provide any reason for your decision if you decide not to provide the additional samples. If you do agree to provide additional samples in the future, this will be reviewed with you in a separate consent form.  Subject Consent Form Study 1B Version 3 28MAR2012 REB approval:  Page 2 of 5  111  Vogue (Vaginal Microbiome Group Initiative)  •  Either prior to or following your exam, a researcher will ask you questions related to your medical and sexual history. You do not have to answer any questions you do not feel comfortable answering. Answering the questions should only take 15-20 minutes of your time. We may also need to collect some information from your medical chart and the Oak Tree Clinic electronic health records, after your visit. This information may include basic demographic and clinical data. We will be collecting your Personal Health Number (PHN) in order to access this pertinent clinical information. It is possible that study staff may need to contact you in the future by telephone or email to request further information regarding your medical or sexual history if it can not be found in your chart.  •  One of the four vaginal swabs we collect will be transported to the clinical laboratory at the BC Centre for Disease Control for Gram Stain analysis. This involves a lab technician taking your vaginal sample and making a smear on a glass slide using color stains and then viewing your sample using a microscope to examine the different bacteria that are present in the vagina.  •  The remaining three vaginal swabs we collect will be stored at -20°C or frozen @ -80°C as required and transported to the Tang laboratory at BC Centre for Disease Control where they can only be accessed by researchers directly involved with our study. The identities of the subjects from which the swab samples were obtained will be kept strictly confidential and can only be accessed by study researchers. Samples will be processed at the Tang Laboratory and/or shipped on a regular basis to our partner laboratories at the University of Saskatchewan and the University of Western Ontario for study analyses. The individual study participant’s DNA or tissue will not be analyzed in any of these samples as this study’s objective is microbial analysis. If there are any of your unused samples left after all the study testing has will ask your permission to store your samples at the Tang laboratory partner laboratory sites for a maximum of 25 years for future research study or similar studies looking at the vaginal microbiome. This will be you in a separate consent form.  been done, we or one of our related to this reviewed with  POTENTIAL RISKS AND BENEFITS There are no research-related risks associated with this study. The collection of additional swabs for the purpose of this study will require a very small amount of additional time (approximately 1-2 minutes) over your planned examination. In some cases, women experience minor discomfort when vaginal swabs are collected. We will minimize the inconvenience and potential discomfort by collecting all study swabs at the same time your planned swabs are taken. There are no direct benefits to participating in this study but you are possibly benefiting women in the future by helping us determine what types of vaginal bacteria are associated with health and disease in women. NEW FINDINGS You will be told of any new information learned during the course of the study that might cause Subject Consent Form Study 1B Version 3 28MAR2012 REB approval:  Page 3 of 5  112  Vogue (Vaginal Microbiome Group Initiative) you to change your mind about staying in the study. At the end of the study, you will be provided with the overall results of the study; however, we will not provide individual results. WHAT HAPPENS IF YOU DECIDE TO WITHDRAW YOUR CONSENT? Your participation in this research is entirely voluntary. You are under no obligation to be included in this study. You may withdraw from this study at any time. If you decide to enter the study and to withdraw at any time in the future, there will be no penalty or loss of benefits to which you are otherwise entitled, and your future medical care will not be affected. If you choose to withdraw, your samples and collected data will be destroyed. COSTS AND REIMBURSEMENTS The study doctor will not receive any money for your participation in this study. There is no cost to you for participating in this study. You will, however, receive a $20 honorarium for participating in this study. CONFIDENTIALITY Your confidentiality will be respected. No information that discloses your identity will be released or published without your specific consent to the disclosure. All information that is obtained will be dealt with in a confidential manner. The information will be entered into a data file. This data will be identified by code-number instead of by your name. Only the researchers will have access to the code. However, research records and medical records identifying you may be inspected in the presence of the Investigator or his/her designate, Health Canada, and the UBC Research Ethics Boards for the purpose of monitoring the research. However, no records that identify you by name or initials will be allowed to leave the Investigators’ offices. Signing this consent form in no way limits your legal rights against the sponsor, investigators, or anyone else. WHO TO CONTACT IF YOU HAVE QUESTIONS OR CONCERNS ABOUT YOUR RIGHTS AS A SUBJECT DURING THE STUDY If you have any questions or concerns about this study, please contact Dr. Deborah Money. If you have any questions or concerns regarding your rights as a research subject, please call the Research Subject Information Line in the University of British Columbia (UBC) Office of Research Services. You may also email your questions or concerns.  Subject Consent Form Study 1B Version 3 28MAR2012 REB approval:  Page 4 of 5  113  Vogue (Vaginal Microbiome Group Initiative) Vogue (Vaginal MicrObiome Project TEam) Please indicate the following by marking the boxes below: □ I have read and understood the subject information and consent form. □ I have had sufficient time to consider the information provided and to ask for advice if necessary. □ I have had the opportunity to ask questions and have had satisfactory responses to my questions. □ I understand that the information collected will be kept confidential and that the result will only be used for scientific objectives. □ I understand that my participation in any research study is voluntary and that I am completely free to refuse to participate or to withdraw from this study at any time without changing in any way the quality of care that I receive. □ I understand that I am not waiving any of my legal rights as a result of signing this consent form. □ I understand that there is no guarantee that this study will provide any benefits to me. □ I understand that study staff will be collecting my PHN to access clinical data. □ I understand study staff may need to collect basic demographic and clinical data from my medical chart and the Oak Tree Clinic electronic health records. □ I understand study staff may contact me in the future by telephone or email if they have further questions related to my medical or sexual history that can not be found in my chart. □ I understand study staff may contact me in the future by telephone or email to request additional vaginal swab samples for further testing directly related to this study. I understand that providing additional samples in the future is voluntary and I can choose not to provide the samples without giving any reason for my decision. □ I have read this form and I freely consent to participate in this study. □ I have been told that I will receive a dated and signed copy of this consent form for my records. □ I hereby consent to participate in the study.  Name of Subject (Please print)  Signature of Subject  Date  Name of Person conducting consent (Please print)  Signature of Person conducting consent  Date  Name of Principal Investigator/Designated Representative (Please print)  Signature of Principal Investigator/Designated Representative  Date  The person who may be contacted about this research is: Dr. Deborah Money.  Subject Consent Form Study 1B Version 3 28MAR2012 REB approval:  Page 5 of 5  114  Subject Initials: ___ ___ ___  Subject # 01- ___ ___ ___  Date of Visit: ___/___/20 __ __  PELVIC EXAM FINDINGS a. External genital exam * Normal * Abnormal à * warts/condylomas * ulcer * evidence of female circumcision * dermatological abnormality: specify if possible:___________________________ * other, specify: ______________________ b. Speculum exam Vaginal Appearance: * Normal * Abnormal à Please specify ____________________________ Vaginal Discharge: * Normal * Abnormal à Please specify: Colour:_____________________ Consistency: _________________ Volume: _____________________ Cervical Appearance: * Normal * Abnormal (e.g. mucopurulent cervicitis) à Please specify _______________ c. Bimanual exam * Normal * Abnormalà Please specify: ____________________________ d. Wet Mount: performed * not performed * pH: lower* normal * elevated * range (number): __________ e. HPV Positive: Yes ¨ No ¨ Unknown ¨  Type (if available) ______________  HPV vaccine: Yes ¨ No ¨ Date: _ _ / _ _ _ / _ _ _ f. Most Recent Pap Results  Date: _ _ / _ _ _ / _ _ _ _  Result (if available) ______________________ g. Other: __________________________________________________________ Comments:_______________________________________________________ Verbal Consent Obtained Prior to Sample Collection:  £ Yes £ No  Verbal Consent / Pelvic Performed by ____________________ Research Staff Signature  VOguE Pelvic Exam Source document 11APR2011 Page 1 of 3  115  Subject Initials: ___ ___ ___  Subject # 01- ___ ___ ___  Date of Visit: ___/___/20 __ __  CLINICAL CHECKLIST DATE exam performed: _ _ / _ _ _ / _ _ _ _ TIME exam performed: _ _:_ _ ITEM 1.  External Genital Exam performed  2.  Bimanual Exam performed  3.  Speculum Exam performed A.  Gram Stain Study Swab collected  YES  NO  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  ¨  Specify Date/Time if different from above __ __/__ __ __/__ __ _____:_____  Study Tubes collected: B.  C. D. E. F.  Number of tubes: ________ Samples collected: A ¨ B ¨ C ¨ D ¨ Chlamydia / Gonorrhoea Swabs collected (not required; not included in study kit) Culture Swab collected: (not required; not included in study kit) Herpes Swab collected: (not required; not included in study kit) Trichomonas Swab collected: (not required; not included in study kit)  G. Wet Mount (optional)  Obtained by ___________________________ Research Staff Signature  VOguE Pelvic Exam Source document 11APR2011 Page 2 of 3  116  Subject Initials: ___ ___ ___  Subject # 01- ___ ___ ___  Date of Visit: ___/___/20 __ __  LAB RESULTS (please attach any documentation) Report Date (dd/mm/year): __ __/ __ __/ __ __ __ __ Result Gonorrhea: Done □ Not Done □ Not Processed □  +□/-□  Chlamydia: Done □ Not Done □ Not Processed □  +□/-□  Herpes: Done □ Not Done □ Not Processed □  +□/-□  Trichomonas: Done □ Not Done □ Not Processed □  +□/-□  Gram Stain: Nugent’s Score (0-10) ____________ Is the Nugent score: Consistent with BV □ Intermediate BV □ Not Consistent with BV □ Interpretation:_________________________________________________________ ____________________________________________________________________ ____________________________________________________________________ Vaginal Swab: Was Culture done: Yes □ / No □ Result:_______________________________________________________________ ____________________________________________________________________  Obtained by ___________________________ Research Staff Signature  VOguE Pelvic Exam Source document 11APR2011 Page 3 of 3  117  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  INCLUSION CRITERIA Subject is HIV positive  YES ¨ ¨  NO ¨ ¨  Subject is not menopausal  ¨  ¨  Subject is not currently pregnant  ¨  ¨  Subject is between 18 and 49 years of age  n/a  Does subject meet all inclusion criteria? Yes  ¨  No ¨  Confirmed by:_____________________________Research Staff Signature  Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 1 of 12  118  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  INFORMED CONSENT PROCESS YES ¨  NO ¨  n/a  ¨  1.  Was the study “Informed Consent” form presented in the subject’s language of preference? If English is not language of preference, list language preference________________  2.  If English is not the subject’s language of preference, was a translator present when the Informed Consent forms were read and discussed?  ¨  ¨  3  Does the subject understand the study procedures and agrees to participate in the study by giving written informed consent.  ¨  ¨  4.  Was the subject allowed to ask questions medical in nature?  ¨  ¨  5.  Have all the subject’s questions about the study been answered? List  ¨  ¨  questions asked _______________________________________ ___________________________________________________  If NO, please comment below: 6.  Does the subject understand that her participation in this study is voluntary? If NO, please comment below:  ¨  ¨  8.  Was the subject given a signed copy of the informed consent form?  ¨  ¨  The Informed Consent form has been read in its entirety by the subject. Discussions have been conducted and the subject’s questions have been answered by the Investigator/ RN/ Research staff member. The subject has signed the Informed Consent form prior to having any study procedures performed. Yes  ¨  No ¨  Date consent signed __ __/__ __ __/__ __ Time consent signed _________ (24-hour clock)  Performed by: ____________________ Research Staff Signature  Attach additional pages if needed.  Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 2 of 12  119  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  DEMOGRAPHICS DATE OF BIRTH:  __ __/__ __ __/__ __ __ __  Height: ___________ cm £ inches £  AGE: _______ BMI: ________ kg/m²  Weight: ___________ kg £ lbs £ Ethnicity:£ £ £ £ £ £ £  White / Caucasian Black / African Canadian Hispanic Asian South Asian Aboriginal / First Nations / Métis / Inuit _______________________ Other (specify mother/father ethnicities) ______________________  Marital Status: £ Single  £ Married/Common Law  £ Other (specify)___________  Highest Education level attained: □ Did not complete high school □ High school diploma □ Some Post-secondary □ Post-secondary/Undergraduate Degree (Bachelor’s) □ Graduate Degree (e.g. Master’s, Ph.D) □ Other: __________________ First 3 Digits of Postal Code: ___________  Obtained by ___________________________ Research Staff Signature  Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 3 of 12  120  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  GENERAL MEDICAL HISTORY Does participant have any known significant current or chronic disease? No  ,  Yes please complete the following:  System  Diagnosis  Respiratory (e.g. asthma)  Gastrointestinal (e.g. celiac, inflammatory or irritable bowel syndrome)  Musculoskeletal (e.g. arthritis)  Genitourinary (not incl. infections).  Allergies/ Autoimmune Disorders Cardiovascular  Other  Collected by: _____________________Research Staff Signature  Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 4 of 12  121  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  GENITAL INFECTION HISTORY Has the subject ever been diagnosed with one of the conditions listed below? Uncertain  No  Yes Bacterial Vaginosis  No  Yes, please complete the following list: Not Sure  Number of Infections Treatment for most Past 2 Past 1 Lifetime recent infection: months Year None Prescription Over the counter Natural products  Yeast Infection Candida UTI  Trichomoniasis Genital Warts Condylomas  Genital Herpes  Chlamydia  Gonorrhea  Syphillis  None Prescription Over the counter Natural products None Prescription Over the counter Natural products None Prescription Over the counter Natural products None Prescription Over the counter Natural products None Prescription Over the counter Natural products None Prescription Over the counter Natural products None Prescription Over the counter Natural products None Prescription Over the counter Natural products  Obtained by ___________________________ Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 5 of 12  122  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  ANTIMICROBIAL USE Apart from the responses above and excluding antiretroviral combinations, has the subject taken any antimicrobials in the past 3 months?  Yes No This includes: oral medication, topical medication, and intravaginal medication If YES, complete the following section: ANTIMICROBIAL USE Drug name  Date started  Date stopped  Dose/freq  Reason for antibiotic treatment  Obtained by ___________________________ Research Staff Signature  PRESCRIPTION/NON-PRESCRIPTION DRUG USE Is this patient currently taking, or has this patient taken any prescription/non-prescription drugs, including probiotic supplements or herbal remedies and excluding antiretroviral combinations, in the past two months?  Yes No If YES, complete the following section:  PRESCRIPTION/NON-PRESCRIPTION DRUG USE Drug name  Date started  Date stopped  Dose/freq  Reason for drug treatment  Obtained by ___________________________ Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 6 of 12  123  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  REPRODUCTIVE HEALTH When was your last menstrual period? (1st day of LMP) __ __/__ __ __/__ __ __ __ Do you have a “normal” menstrual cycle? (i.e. period every 3-5 weeks) □ Yes □ No In the past year, how often did you use tampons during your periods? Never Sometimes but not for every period Every period / Part of the time Every period / Exclusively Pregnancy History : G______ T______  P______  SA______  TA______  L_______  Have you noticed any of the following vaginal symptoms? Abnormal discharge Abnormal odor Irritation or discomfort Other (please describe)  Past 48 hours □ Yes □ No □ Yes □ No □ Yes □ No □ Yes □ No  Past 2 weeks □ Yes □ No □ Yes □ No □ Yes □ No □ Yes □ No  USE OF FEMININE HYGIENE PRODUCTS Do you use douche products? Yes No If YES, how often: □ Daily □ Monthly □ A few times per week □ Every few months □ A few times a month If YES, what product(s) do you use? ______________ Have you douched in the past 48 hours? □ Yes □ No Do you use feminine wipes or genital deodorant products? Yes If YES, how often: □ Daily □ Monthly □ A few times per week □ Every few Months □ A few times a month If YES, what product(s) do you use? ______________ Have you used these products in the past 48 hours? □ Yes  No  □ No  Obtained by ___________________________ Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 7 of 12  124  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  SAFER SEX PRACTICES Since your last menstrual period, what method of contraception have you used? * NA – Current partner is a female * NA – not sexually active * None / Withdrawal / Rhythm method * Hormonal (if used, fill out section below with most recent product) * Surgical Sterilization: * Subject * Partner * Barrier: * Male Condom * Female Condom * Copper IUD * Sponge * Spermicide * Diaphragm * Abstinence * Other, specify ____________ Hormonal contraceptive use Progestin only pills Estrogen/Progestin Combination pills Nuvaring vaginal ring Mirena IUD Depo Provera injection Ortho Evra patch Implanon implant Emergency contraceptive pill Other Hormonal, specify _____________  Current Use Total # yrs Used * __________ * __________ * __________ * __________ * __________ * __________ * __________ * __________ * __________  If Not Known, Record Product Name: _________________________ Comments: __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________  Obtained by _______________________Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 8 of 12  125  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  SEXUAL ACTIVITY Are your sexual partners: □ Male, □ Female, or □ Both? Have you had vaginal intercourse in the past 48 hours? □ Yes  □ No  Number of partners you have had vaginal intercourse with in the past year:___________ Number of partners you have had vaginal intercourse with in the past 2 mos:__________ Do you experience any pain or discomfort during vaginal intercourse? □ Yes □ No If yes, how often _________________ (%) How often do you engage (receive) in oral sex? □ Never □ Daily □ Weekly □ Twice Per Month □ Monthly □ Other (Specify) _________________________________ Have you had oral sex in the past 48 hours? □ Yes □ No How often do you engage in anal sex? □ □ □ □  Never Daily Weekly Monthly  □ Other (Specify) ___________________________________  Have you had anal sex in the past 48 hours? □ Yes □ No How often do you use sex toys? □ Never □ Daily □ Weekly □ Monthly □ Other (Specify) ___________________________________ Have you used a sex toy in the past 48 hours? □ Yes □ No What kind of sex toy(s) do you use: __________________________________ Are they penetrative?  □ Yes  □ No  Do you and your partner(s) use the same toys?  □ Yes  □ No  Obtained by ___________________________ Research Staff Signature  Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 9 of 12  126  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  HISTORY OF SUBSTANCE USE Drug Use Please assess current and/or historical use of the following substances. Enter number of years used. Amount: 1= Occasionally 2= once or twice a week 3= about once daily 4= more than once daily. Currently Using Previously Used Never (within past 3 months) (prior to last 3 months) Substance Years Years Used Amount Amount (duration)  Heroin – Inhaled Heroin – IV Heroin – other Cocaine - inhaled Cocaine – IV Cocaine – other Crack – all methods Crystal meth – inhaled Crystal meth – IV Crystal meth – other THC/Marijuana Opiates/Opioids Benzodiazapines Methadone Other, specify Other, specify  Current Alcohol Use £ None £ Occasional drink day  (duration)  £ £ £ £ £ £ £ £ £ £ £ £ £ £ £ £ £ 2-3 Drinks per week  £ daily à ____ of drinks per  Has subject ever had an alcohol abuse problem? £ No £ Currently £ Historically à _____ # of years, If past when did she stop?__ __/__ __ __/__ __ __ __ Tobacco Use £ Never smoked £ Current Smoker à Average cigarettes per day ______ # of years ______ £ Past Smoker à Average cigarettes per day ______ # of years ______ à Quit how long ago _______ (years, months, or date) Obtained by: _____________________Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 10 of 12  127  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  HIV HISTORY AND DATA Likely mode of HIV acquisition: (chart review only) £ IV drug use £ Blood products or percutaneous (eg. Tattoo) £ Sexual Contact £ Perinatal transmission £ Unknown £ Other, specify ______________ First HIV positive test result: __ __ __/ __ __ __ __ (mon/yyyy)  CD4 Nadir: __________ X10^9/L ___________ % ____________ (dd.mon.yyyy) Highest VL Ever: ______________ copies/mL  ________________ (dd.mon.yyyy)  Baseline HIV Labs: (closest to day of study visit) CD4: __________ X10^9/L ___________ % VL: _______________ copies/mL  _______________ (dd.mon.yyyy)  __________________ (dd.mon.yyyy)  Please obtain most recent copies of results of the following tests: Status Result CD4 Nadir HCV Antibody Done Not Done + / HCV PCR  Done  Not Done  +  / -  HBV sAb  Done  Not Done  +  / -  sAg  Done  Not Done  +  / -  Done Done  Not Done Not Done  +  / _________  cAb HIV Clade  Obtained by ___________________________ Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 11 of 12  128  Subject Initials: ___ ___ ___  Subject # 01 - ___ ___ ___  Date of Visit: __ __/__ ___ __/20__  Antiretroviral History: Combo #1 - Drugs  Start Date dd.mmm.yyyy  Stop Date dd.mmm.yyyy  Combo #2 - Drugs  Start Date dd.mmm.yyyy  Stop Date dd.mmm.yyyy  Combo #3 - Drugs  Start Date dd.mmm.yyyy  Stop Date dd.mmm.yyyy  Combo #4 - Drugs  Start Date dd.mmm.yyyy  Stop Date dd.mmm.yyyy  Combo #5 - Drugs  Start Date dd.mmm.yyyy  Stop Date dd.mmm.yyyy  Obtained by ___________________________ Research Staff Signature Vaginal Microbiome Group Initiative Study 1B – PI: Dr. D. Money Visit Source Document REB approval: 11APR2011  Page 12 of 12  129  Modifications	
  to	
  QIAEX	
  II	
  Handbook	
  10/2008	
  protocol:	
   1. At	
  step	
  3,	
  20	
  uL	
  of	
  QIAEX	
  II	
  beads	
  were	
  added	
  to	
  each	
  gel	
  fragment	
  instead	
  of	
   10	
  uL	
  of	
  QUIAEX	
  II	
  beads.	
   2. At	
  step	
  7,	
  the	
  sample	
  in	
  one	
  tube	
  was	
  resuspended	
  with	
  500	
  uL	
  of	
  PE	
  wash	
   buffer	
  and	
  added	
  to	
  the	
  beads	
  in	
  the	
  second	
  tube.	
  	
  This	
  combination	
  of	
   sample	
  beads	
  into	
  a	
  single	
  tube	
  occured	
  during	
  the	
  first	
  wash	
  of	
  step	
  7.	
   3. At	
  step	
  9,	
  DNA	
  was	
  eluted	
  with	
  Pure	
  TE	
  buffer.	
   4. Step	
  11	
  was	
  completed	
  with	
  the	
  eluted	
  samples	
  stored	
  in	
  separate	
  tubes.	
   	
    130  2012 Analyzing Pyrosequencing Data (for dummies)  Written by: Bonnie Chaban Hill lab resource last modified: 3/20/2012 131  Analyzing Pyrosequencing Data (for dummies) Table of Contents Introduction ................................................................................................................................................................3 Section 1 - Unassembled reads ...................................................................................................................................3 FASTA files from the run (not assembled) ............................................................................................................3 Nearest neighbour analysis from unassembled reads .............................................................................................5 Watered Blast search of fasta sequences ................................................................................................................5 Section 2 - Assembled reads.......................................................................................................................................8 de novo Assemblies ................................................................................................................................................8 Files generated in mPUMA pipeline from pyrosequencing data............................................................................8 Section 3 - Frequency Tables ...................................................................................................................................11 Frequency table of GeneSpring isotig list ............................................................................................................12 Adding the best database match ID for each isotig ..............................................................................................13 Frequency table of GeneSpring isotig list without create_composite_text_columns.pl script .............................14 Generating a frequency table in Excel for unassembled nearest neighbour data (section 1) ...............................15 Section 4 - Chimera checking...................................................................................................................................17 Do chimera check of your isotigs .........................................................................................................................17 Removing chimeras from your frequency tables ..................................................................................................23 Section 5 - Useful information about your final isotig dataset from cpnDB ............................................................24 Pulling the taxonomic lineage for the best database match of each isotig from cpnDB ......................................25 Pulling the FASTA sequence for the best database match of each isotig from cpnDB .......................................26 Section 6 - Doing a nearest neighbour analysis of assembled isotigs ......................................................................26 Section 7 - mothur diversity statistics and rarefaction curves ..................................................................................28 Creating input files for mothur .............................................................................................................................29 Running mothur ....................................................................................................................................................30 Section 8 - (Fast) Unifrac .........................................................................................................................................32 Making input files for Unifrac ..............................................................................................................................32 Using Unifrac .......................................................................................................................................................35 Section 9 - GeneSpring.............................................................................................................................................36 Making GeneSpring input files from your own frequency tables ........................................................................36 Running GeneSpring ............................................................................................................................................37 Section 10 - MEGAN ...............................................................................................................................................38 Appendix A - Bonnie’s way of making phylogenetic trees......................................................................................42  Page 2 of 60 132  Analyzing Pyrosequencing Data (for dummies) Appendix B - Installing Cygwin X on a PC (for running GeneSpring) ...................................................................50 Appendix C - Loading data into GeneSpring and basic functions ...........................................................................50 Appendix D - Basic starting with WinSCP and Putty ..............................................................................................50 Appendix E - Useful things to know and alternative procedures .............................................................................50 Automating the command line processes .............................................................................................................50 To pull out a list of FASTA sequences from a large file of FASTA sequences...................................................52 Alternative way to make chimera checking table (step 2)....................................................................................54 Appendix F - Databases we have .............................................................................................................................57 Appendix G - Signature Oligo ..................................................................................................................................58  Introduction This is a collection of computational protocols that have been used to analyze 454 pyrosequencing data in the past. Versions of programs will change, as well as our ideas about how these analyses should be done. The goal of this document is to provide some basic instructions on how to do different computational manipulations. It’s a starting point. If you generate a result that looks unexpected or strange, look into it further. You are the smartest analysis device this process has. All the instructions here are just tools for you to build with. Also keep in mind, there are probably better ways to do many of these processes - these instructions are coming from someone who has more Windows and less Unix background, so this is just one person’s gimmicked way of doing it. Good luck.  Section 1 - Unassembled reads FASTA files from the run (not assembled) When a sequencing run is done on a 454 Roche machine, the system outputs an .sff file for each region on the machine. The first thing that needs to be done is for the .sff file to be separated into MID-tagged individual sample library files. In our group, this processing step will be done for you before you ever see the data (and most sequencing centers that you would pay to run your pyrosequencing samples would include this processing step in their service for you). So that you are aware, this separation of sequences is done using a collection of programs called sff tools, which includes programs like sffinfo and sfffile to extract the DNA sequences and divide the MID-tagged libraries. Our group also has custom scripts to additionally recover data, such as when we get partial MID sequence degradation. Page 3 of 60 133  Analyzing Pyrosequencing Data (for dummies) Because of the space the unassembled FASTA files take up on the server, we don’t keep them stored. If you want to generate a copy of the FASTA sequences for each of your samples (which is about as close to having a copy of the raw data as you are going to get, since the .sff files are huge) you can.  Generate the unassembled FASTA files from each sample: 1. Open WinSCP and Putty and find your project in the mPUMA pipeline. You can use WinSCP to navigate around to the right place, but you will need to be within Putty to work. Within the project folder, you will see a folder called sffs - THIS IS NOT THE ONE YOU WANT. Keep looking in this directory and you find a folder with something to do with assembly in the name look in this folder for a folder called sff (there is a difference between sffs and sff). Move into the sff folder and you should see a list of file names with the plate region-MID#. Each sample should have 2 files - one as just plate region-MID#.sff and one with plate region-MID#recovered.sff. These represent the sequence reads that were recognized as having a complete MID tag and sequence reads recovered from an identifiable partial MID tag. 2. At the command line, while you are in the sff folder, use the sffinfo command a. Type: sffinfo -s filename.sff > place and name of output file.fasta b. For example: sffinfo -s G4Z6J7202-MID8.sff > /mnt/usb/chaban/BCCDC_H1N1_Flu/MID8.fasta c. This command means extract the sequences (-s) from the sff file called G4Z6J7202MID8.sff and put them in a file called MID8.fasta in my home directory (/mnt/usb/chaban/BCCDC_H1N1_Flu). 3. Repeat the command for all the sff files in the directory. 4. Move into your home directory and combine the regular and recovered file for the same sample into one file. a. use the cat (concatenate) command: cat file1.fasta file2.fasta > file_all.fasta  The text file you get should be for a specific MID-tagged sample should look like: >GJBSLIE09FOC44 length=197 xy=2210_2806 region=9 run=R_2010_06_22_19_33_08_ CTGCGGTCGCCGAAGCCGGGGCTTAACTGCTGATACATTCAGCACACCGCGGAGTTTATT CACCACCAAAGTTGTAAGTGCTTCGCCCTCAATATCTTCAGCAATGATAAGAAGTGGTTT GCCACTTTGCATAGTTGCTTCAAGCAAAGGTAGAATCTCTTTCATATTTGAGATTTTTTT ATCTGTCAAGGCACACA >GJBSLIE09FMGZR length=101 xy=2189_0501 region=9 run=R_2010_06_22_19_33_08_ GAGGGGCGGGGAGTGGGACGACGACGGCAACCGTTCTTGCATATAGTATTTACAAAGAGG GTCTAAGAAATACTATATGCAAGAACGGTTGCCGTCGTGGT >GJBSLIE09FO6U9 length=419 xy=2220_0371 region=9 run=R_2010_06_22_19_33_08_ GAG….. Page 4 of 60 134  Analyzing Pyrosequencing Data (for dummies) This has the sequence name, length of the sequence, its XY coordinate from the sequencing plate, the region of the plate it was run on and a run identifier (the year, month, day, hour, min, sec of the run). You will get a FASTA file for each sample that was sequenced. Once we get a collection of sequences from a sample, we call it a library.  Nearest neighbour analysis from unassembled reads A nearest neighbour analysis is a way to go through your data and assign a taxonomic name(s) to each sequence. This can be done on each read individually (the unassembled reads) or it can be done on assembled isotigs (see next section). The advantage to this type of analysis is that you generate a list of genus and species names that indicate the closest match of your library sequences to something in the database and a frequency of that match occurring. This method also tends to remove non-cpn60 sequences (possibly junk) from the analysis. The disadvantage to this analysis is that if your sequence doesn’t have a good match to something in the database, it can either be discarded or assigned a poor name (with cpn60 sequences, a general rule (with lots of exceptions) is that matches less than 95% are probably a different species and less than 80% are probably a different genus). As well, several matches can have the same species as their best match, but be very different from each other – these will still get lumped together with the same name. Finally, depending on how the analysis parameters are set, you can get more than one best match for some sequences. This means that your output frequency of names won’t correspond to the number of sequences inputted. This could skew your dataset towards having more names for closely related taxa. To generate names and frequencies by nearest neighbour analysis, you need to generate/manipulate 3 files. The first 2 files are generated from command line codes on haruspex (use Putty to open a command prompt and WinSCP to move files between your computer and the server), while the third file is an Excel spreadsheet. To begin, have the text file of your library sequences in a folder on haruspex.  Watered Blast search of fasta sequences Step 1 – Do a watered-BLAST analysis of your library What watered-BLAST does is take your FASTA sequence and first does a BLASTn search with whatever database you want. It collects the best BLAST match (or matches if there is a tie in the score) and then does full sequence alignment on the best match(es) using the Smith-Waterman algorithm. The command line command is: /home/pipeline/aped/bin/watered_blast.pl -i fasta.file -d blast_db -p blast_program -v numb_hits > output.file  Page 5 of 60 135  Analyzing Pyrosequencing Data (for dummies) The first pathway is the directions to the instructions to do a watered-BLAST search; -i indicates the input file -> this is your FASTA text file (query); -d indicates which database to search against. In our case, we usually want the database cpnDB_nr_date (this is a non-redundant version of cpnDB that gets updated periodically – check with Janet what the most resent version is). This variable currently reads: /aped/blast_dbs/cpndb_nr_20110622; -p tells the program to use blastn (for nucleotide) -> has to match your input file; -v tells the program how many BLASTn results to report; if any of reported hits have the same score, all those hits will be carried on to Smith-Waterman alignment (if you only want one match per sequence, you need to set this variable to 1); > output.file tells the program to output the result to a file called output.file So, an actual command that I run looks like: /home/pipeline/aped/bin/watered_blast.pl -i GJBSLIE09-MID1.combined -d /aped/blast_dbs/cpndb_nr_20110622 -p blastn -v 5 > GJBSLIE09-MID1_wateredBlast_5best_unfiltered This command must be run from the folder your input file is in and will write the output file to the same folder. This program can take several seconds to an hour to run, depending on the size of the input file. Step 2 - Filter your watered-BLAST results The results from step 1 could result in up to 5 names being assigned to each FASTA sequence or ridiculously poor matches with names because the FASTA sequences were too short or not cpn60 sequences at all. To clean up the list, you filter the output.file based on a set of criteria. The command line command to do this is: /home/pipeline/aped/bin/watered_BLAST_mapping_freq_best.pl -i min_ID -l length -r difference input.file > output.file The first pathway is the directions to the instructions on how to process the watered-BLAST file; -i indicated what the minimum percent identity the match needs to have. 55% or higher for match to be to a cpn60 gene; 80% or higher for the name to actually be a reasonable genus match; -l is the minimum length the match has to span. To be a reasonable match, you usually want about 100 bp; -r is the percent difference between the top match and the next top match to filter. If there is a difference greater than the number you give between a match and the next match (say 1%), the lower match is removed. You can set this to zero, but that doesn’t guarantee you will only get one name per sequence if two names are both 87.9% identical to your sequence, you will carry both names on (if you only want one match per sequence, you need to set the original watered_blast -v variable to 1); input.file is your watered-BLAST output.file; Page 6 of 60 136  Analyzing Pyrosequencing Data (for dummies) > output.file tells the program to output the result to a file called output.file So, an actual command that I run looks like: /home/pipeline/aped/bin/watered_BLAST_mapping_freq_best.pl -i 55 -l 100 -r 1 GJBSLIE09MID1_wateredBlast_5best_unfiltered > GJBSLIE09-MID1_wateredBlast_55id_100leng_1diff_filtered This now gives you a filtered, tab-delineated file that you can work with in Excel. Transfer this file to your computer from haruspex. To get a filtered watered Blast output that doesn’t have N/A rows when results have been filtered out, add |grep -v “N/A” to the command. For example: /home/pipeline/aped/bin/watered_BLAST_mapping_freq_best.pl -i # -l # -r # unfiltered_watered_Blast_input_file |grep -v “N/A” > output_file_name.  Step 3 - Generate a list of species and frequencies for each library Once you have generated your filtered, watered-BLAST list of identities, you will want to condense them into a list of species present and their frequency in your library. You can do this in Excel. To begin, open Excel and go to File->open and choose your filtered file. The program will recognize that the format is not a standard Excel spreadsheet - choose yes, you want to open the file (it is not corrupted). The program will then recognize the information is in a tab-delineated form. Click Next twice and Finish once to have the program open the file as a 4 column file. You will have your original sequence name, percent identity of the assigned name to your original sequence, a zero or one (zero if the name was filtered out based on step 2; one if the name was kept) and the taxonomic name. To filter the list of names to get a list of unique names in the library, go to the Data tab and choose Filter -> advanced. Change the advanced filter action option to “Copy to another location”; click on the rightside box of each range to select “List range” as column D (click on the top of the column to select all), “Criteria range” as column D and “Copy to” as column E; select the box beside “Unique records only” and click OK. Column E now has a list of the species identified in the library. To determine the frequency of each species, start in column F in the cell next to the first name and enter “=COUNTIF(D:D, E2)”, which means count each time E2 appears in column D. A number will appear in the cell indicating the frequency of that name. You can then highlight that formula cell down column E to the bottom of your list and use the Fill -> down command from the Home tab.  Page 7 of 60 137  Analyzing Pyrosequencing Data (for dummies) Finally, to sort the list from highest to lowest frequency, highlight columns E and F, choose Sort and Filter -> Custom sort from the Home tab. Select “Column F” in the Sort by drop down menu, “Values” in the Sort on drop-down menu and “Largest to smallest” in the Order drop-down menu. Your table is now sorted by frequency, from the most abundant to least abundant species. You can use this table to make frequency histograms (graphs) for each library.  Section 2 - Assembled reads de novo Assemblies When you retrieve the FASTA file from a pyrosequencing run, we can get hundreds to tens of thousands of sequences per library. One way to organize the data is to take each read individually and compare it to things we know (section 1 - nearest neighbour). Another way is to look at the data completely independently and ask “How many different sequences do we have, regardless if we know what they are or not?” We would expect to find the same sequence many times for more abundant species, so our list of ten thousand sequences might only represent tens to hundreds of OTUs (operational taxonomic units groups of distinct sequences). To look at the data in this way, we use a program called gsAssembler (can be obtained from Roche with permission) to create contigs and isotigs of the sequences. This type of assembly is done on a set of libraries that comprise a complete experiment. Basically, the program does massive alignments and groups same sequences together and generates a consensus sequence for each group (isotig). We then generate files from the newbler output to extract information that we want to use. From this, we can look at the isotigs, the number of sequences in each isotig and the frequency of each isotig in each library to answer all kinds of questions about the data. The assembly process and some downstream applications have been compiled together into a bioinformatics pipeline called mPUMA (microbial profiling using metagenomic assembly), invented by Matt Links. In our group, when we receive data from a sequencing run, it is put through the pipeline to create a collection of files for you to examine/manipulate/use as input. The basic pipeline, as it stands now, does an assembly of the data, removes the cpn60 UT primers, collapses isotigs that are identical without primer sequences and generates some downstream output files.  Files generated in mPUMA pipeline from pyrosequencing data Files in the assembly directory: Step 1 - Assembly - files from the assembly process: 454AlignmentInfo.tsv  454AssemblyProject.xml  454IsotigsLayout.txt Page 8 of 60 138  Analyzing Pyrosequencing Data (for dummies) 454NewblerMetrics.txt 454ContigGraph.txt 454TrimStatus.txt 454Isotigs.txt  454RefLink.txt 454Isotigs.qual 454AllContigs.qual 454ReadStatus.txt  454AllContigs.fna 454NewblerProgress.txt 454Isotigs.fna ace (folder)  454AssemblyProject.xml - An XML file describing the parameters used in the assembly (Important: this file contains the record of the overlapMinMatchLength and overlapMinMatchIdentity parameters used to generate the isotigs - you should know this for your dataset. Plus, the file contains the fact that a vector database of the degenerate primer sequences were used in the assembly) 454NewblerMetrics.txt - contains the metrics of the assembly (#s and average lengths of things). (Important: be aware of how many long isotigs (longer than the UT length of 650 with primers) or short isotigs are formed. Those should be some of the first sequences you look at to see if there is some type of artifact / mistake in the assembly of these data.) 454Isotigs.fna - this is a multi-FASTA file of all the isotigs (actual DNA sequences) assembled from your data. The directory called "ace" contains an ACE formatted file for each of the assembled sequences (contigs and isotigs). Here is where you can familiarize yourself with the assemblies, especially to look at things like really long and really short isotigs. To view the ace files you should use Tablet (http://bioinf.scri.ac.uk/tablet/). Tablet is a software program capable of rendering VERY dense alignments and can show you the protein translation in all 6 frames which may prove helpful. Note - at this point, the 454Isotigs.fna file will be used as a DNA file and translated into a protein file. Any of the following files with an additional aa in the name are the same file as described below, but in protein form.  Step 2 - Seqclean - Remove cpn60 UT primer sequences from the isotigs 454Isotig.fna.cln - the seqclean report output (tab delimitated table reporting what primer sequence was removed from each isotig) 454Isotig.fna.seqclean - a FASTA file that has all your isotigs with the primer sequences removed 454Isotig.fna.seqclean.blastx - a blast report of the 454Isotig.fna.seqclean file to be used by cd-hit in the next step  Page 9 of 60 139  Analyzing Pyrosequencing Data (for dummies) Step 3 - cd-hit - Collapse isotigs that are now identical after the primers were removed 454Isotig.fna.seqclean.cd-hit.clstr - the cd-hit report showing which isotigs were clustered together because they are now identical 454Isotig.fna.seqclean.cd-hit.clstr.report - another format for the cd-hit report, this one indicating at the far left which isotig was kept as the longest representative of a cluster, the length of the representative, how many isotigs were collapsed into each cluster, followed by all the isotigs that were collapsed into each representative. 454Isotig.fna.seqclean.cd-hit - the FASTA file with the sequences of all the collapsed isotigs 454Isotig.fna.seqclean.cd-hit.clstr.fna - exactly the same file as 454Isotig.fna.seqclean.cd-hit, renamed to have the same conventions as all the other files. This is the FASTA file with the sequences of all the collapsed isotigs. This is the last common file to the process. It is the unique isotigs, primer trimmed and collapsed version of your data. It contains duplicate isotigs (don’t know why they are created in the assembly, but some sequences get copied into 2 identical isotigs [usually one in the forward orientation and one in the reverse orientation]), non-cpn60 isotigs and cpn60 isotigs. 454Isotig.fna.seqclean.cd-hit.clstr.fna.wateredBLAST.cpn60_nr_20110622 - this is a watered blast report of the 454Isotig.fna.seqclean.cd-hit.clstr.fna sequences. The date tag on the end will match the version of cpndb_nr used. 454Isotig.fna.seqclean.cd-hit.clstr.fna.wateredBLAST.cpn60_nr_20110622.png - a graphic of the watered blast results showing the distribution of the percent identity matches.  Step 4A - Strand the isotigs (arrange all the FASTA sequences in the plus strand orientation) 454Isotig.fna.seqclean.cd-hit.clstr.fna.stranded - This is a FASTA file with all the sequences oriented in the plus strand orientation. This is done by looking at the top X number of watered blast hits for the sequence and determining which orientation the matches are in. If the top hits are in both directions (commonly happens with non-cpn60 random hits), the sequence is removed from the list. So, this list is filtered for most (but not all!!) non-cpn60 sequences, but still contains duplicate isotigs from the assembly process.  Step 4B - RDP classifier 454Isotig.fna.seqclean.cd-hit.clstr.fna.classified - the RDP classifier report on the taxonomic placement of 454Isotig.fna.seqclean.cd-hit.clstr.fna sequences.  Page 10 of 60 140  Analyzing Pyrosequencing Data (for dummies) Files in the nucleotide directory (the protein directory has the same thing for the protein translations): Step 4C - GeneSpring files - the non-duplicate isotigs GeneSpring files are created as input for the program GeneSpring and can also be used to create frequency tables indicating how many times each isotig was seen in each library. The intermediate files used to make GeneSpring files are not currently put up on server. At the moment, you have to go out of the assembly directory to the nucleotide directory and into the FILES_FOR_GENESPRING directory. Any of the GeneSpring text files have a list of isotigs - this list takes the 454Isotig.fna.seqclean.cdhit.clstr.fna file and filters out the duplicate isotigs. The GeneSpring isotig list is the non-duplicated isotig list, containing both cpn60 and non-cpn60 isotigs. This total will be different from the 454Isotig.fna.seqclean.cd-hit.clstr.fna.stranded, which has removed most non-cpn60 isotigs but contains duplicate isotigs. There is currently no FASTA file generated of this list. If you want the sequences of these isotigs, you have to manually extract this list from the 454Isotig.fna.seqclean.cd-hit.clstr.fna file.  Step 4D - Input files for the other programs - MEGAN, Unifrac, mothur The other directories in the nucleotide folder contain input files for the programs listed. It is my current understanding that they are generated from the 454Isotig.fna.seqclean.cd-hit.clstr.fna file. I haven’t done much with them yet, but be aware that it is my understanding that these files will contain duplicate isotigs and non-cpn60 isotigs.  Section 3 - Frequency Tables To me, frequency tables are the basic summary tables of your data. They will show every isotig or nearest neighbour in your dataset and how often that entity was seen in each library in your experiment. This table can also include the percent identity of each sequence to a database match/name and the taxonomic lineage of that entity. I make frequency tables of both the actual counts and scaled counts from the libraries. Downstream programs will need either one or the other and they are NOT interchangeable. There are several ways frequency tables can be generated, depending on what you want to start with and what you want to show.  Page 11 of 60 141  Analyzing Pyrosequencing Data (for dummies) Frequency table of GeneSpring isotig list This is usually the first frequency table I make. It will generate a list of non-duplicated isotigs, containing both cpn60 and non-cpn60 isotigs. 1. Start by copying the FILES_FOR_GENESRPING folder from the assembly directory on haruspex to your personal folder on haruspex. To do this, open the program Putty and create a new directory in your home directory (FILES_FOR_GENESPRING_mine). In your home directory, use the copy command cp. The format for the command is cp “what you want to move” “where you want to move it”. You will need the complete pathway to your GeneSpring files directory. As well, using *.* means to include everything in the folder. So, an example would be: cp /mPUMA/name_of_directory/name_of_subdirectory/nucleotide/FILES_FOR_GENESPRING/*. * FILES_FOR_GENESPRING_mine 2. Move into the FILES_FOR_GENESPRING_mine folder (cd FILES_FOR_GENESPRING_mine) 3. Use the create_composite_text_columns script. a. For a scaled frequency table: i. /home/pipeline/meta-cpn/bin/create_composite_text_columns.pl -c 2 -h 1 -o scaled_frequency_table.txt [first letters of GeneSpring files]*.txt ii. This means use the perl script found at /home/pipeline/metacpn/bin/create_composite_text_columns.pl, take the information in column 2 (which in the GeneSpring files is the scaled frequency), look for the library name (which is the file name) and make it the header for each column, write the output file to scaled_frequency_table.txt, use all the files in this directory that start with whatever is in [ ]. b. For an actual frequency table: i. /home/pipeline/meta-cpn/bin/create_composite_text_columns.pl -c 3 -h 1 -o actual_frequency_table.txt [first letters of GeneSpring files]*.txt ii. Same as above, but changing the program to look at the third column in the GeneSpring files, which is the actual frequency value.  Page 12 of 60 142  Analyzing Pyrosequencing Data (for dummies) 4. Copy the new tables to your computer (using WinSCP) and open them with Excel. You will probably want to change the library names in the header row, depending on what the GeneSpring files were called (if they were named by sample ID, you are probably fine; if they were named by the 454 region name and MID, you will want a simpler, recognizable name). Note: try to keep names short, with no spaces or special characters (underscore “_” is OK, but no others like -, :, “, /, >, !, ~, etc.) This gives you a basic table with the isotig name in the first column and then its frequency in each of your sample libraries next to it.  Adding the best database match ID for each isotig For my frequency tables, I like to add the best database match ID for each isotig. To do this, you have to pull out just the FASTA sequences for the GeneSpring isotig list and do a watered_BLAST search. 1. Create a fasta file of all the isotigs you want to ID from the GeneSpring list of isotigs. At the moment, we don’t have a FASTA file of just these sequences from mPUMA. To generate it yourself: a. Create a .txt file with the isotig names you want and save it on haruspex (call it genespring_list.txt). You can do this by opening any GeneSpring file from your assembly and copying the OTU column into Notepad or taking the first column of either of the frequency files. b. Copy the file 454Isotigs.fna.seqclean.cd-hit.clstr.fna (this is all the fasta sequences from your assembly, seqclean’ed, cd-hit’ed and collapased) from the assembly to your computer and open the file with Excel. Use the text to columns feature in the Data tab to divide the first column (by spaces), so that the description line only has >isotig0####. Delete the other columns and save as a text file (454Isotigs.fna.seqclean.cdhit.clstr.fna.simplename.txt). c. Copy the file 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.txt to the same directory on haruspex as your genespring_list.txt file. d. While you are in that directory, index your large fasta file by typing: i. cdbfasta 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.txt -o 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.cidx Page 13 of 60 143  Analyzing Pyrosequencing Data (for dummies) e. Use the cdbyank command to extract your desired list of sequences into a new file as written (your output file is genespring_otu.fasta): i. for i in `cat genespring_list.txt` ; do cdbyank -a $i 454Isotigs.fna.seqclean.cdhit.clstr.fna.simplename.cidx >> genespring_otu.fasta ; done ii. ***Note: in the above command, the ` around the `cat genespring_list.txt` is NOT an apostrophe (‘), it’s the key up by the number 1 key on my keyboard. As well, there are spaces before and after the ; *** 2. Do a watered_BLAST search of the genespring_otu.fasta file a. See Section 1, Nearest neighbour analysis from unassembled reads, Step 1 – Do a watered-BLAST analysis of your library for detailed description b. Reminder command format: /home/pipeline/aped/bin/watered_blast.pl -i fasta.file -d blast_db -p blast_program -v numb_hits > output.file c. Important: for this application, you MUST set the -v 1 (you absolutely only want 1 hit for each sequence) d. Copy the output.file to your computer and open with Excel. You should have columns for Query (isotig number), Description (region from sequencing plate), % ID of match (not to self), Length (of sequence), Closest hit header (best nearest neighbour), Strandedness (whether the match was in the forward or reverse direction). e. I copy the Closest hit header and % ID of match to my frequency tables (insert columns after isotig name)  Frequency table of GeneSpring isotig list without create_composite_text_columns.pl script These are the instructions I had from before we had the create_composite_text_columns.pl script. You can do the initial table assembly manually by the following method. Open each GeneSpring input file in Excel, copy the isotig name column the first time into a new Excel worksheet, and then label each column next to it with your library names. Under the library names, you copy the frequency values for that library. This column would be actual counts or scaled counts, depending on which table you are working on. At the end, do a Home -> Replace search to change N/A to 0 in your table (if that isotig wasn’t seen in a library, its frequency is zero)  Page 14 of 60 144  Analyzing Pyrosequencing Data (for dummies) A de novo assembled table would be: Column A Label isotig0001 isotig0002 isotig0003 isotig0004 isotig0005 isotig0006 isotig0007  Column B Column C BC1_MID2 BC1_MID3 345.70 0 0.54 94.56 0 45.80 0 563.89 7.09 0 0 453.8 9.98 6.89  Generating a frequency table in Excel for unassembled nearest neighbour data (section 1) If you created individual frequencies for each library based on unassembled reads, you can compile them into one composite frequency table by the following instructions: To begin, you will need frequency tables from each individual library (see Nearest neighbour analysis from unassembled reads, section 1). Copy and paste each individual library frequency table into an Excel spreadsheet, each one below the last, so that all the libraries are in columns A and B. Add the library name into column C. A nearest neighbour table should look like: Column A  Column B  Column C  Label  Frequency  Library  b1030 AF440238 Bacteroides vulgatus ATCC 8482  1143  BC1_MID2  b16458 AB510684 Bacteroides plebeius JCM 12973  304  BC1_MID2  b6848 AY691287 Prevotella ruminicola ATCC 19189  1  BC1_MID2  b12644 NZ_ABXH01000007 Collinsella intestinalis DSM 13280  1  BC1_MID2  b13510 NZ_ABQT01000013 Helicobacter cinaedi CCUG 18818  6907  BC1_MID1  b7948 DQ059428 Helicobacter hepaticus ATCC 51448  1087  BC1_MID1  90  BC1_MID1  b7904 DQ059472 Campylobacter upsaliensis ATCC 43954  Next, do an advanced filter of the species names to get a list of each unique name only. To do this, go to the Data tab and choose Filter -> advanced. Change the advanced filter action option to “Copy to another location”; click on the right-side box of each range to select “List range” as column A (click on the top of the column to select all), “Criteria range” as column A and “Copy to” as column E; select the box beside “Unique records only” and click OK. Column E now has a list of the species identified in the library.  Page 15 of 60 145  Analyzing Pyrosequencing Data (for dummies) You then want to create column headings for each of your library names, starting at F1 and going across G1, H1… until you have a column heading for each library. The library name MUST be identical between the library names beside the frequency tables and the column headings (will use the column headings as a search parameter). Your table will now look like: Column A Label b1030 AF440238 Bacteroides vulgatus ATCC 8482 b16458 AB510684 Bacteroides plebeius JCM 12973 b16460 AB510681 Bacteroides massiliensis JCM 13223 b12775 NZ_ABWK01000112 Mitsuokella multacida DSM 20544  Column B Frequency  Column C Library  1143  BC1_MID2  304  BC1_MID2  300  BC1_MID2  183  BC1_MID2  D  Column E Label b1030 AF440238 Bacteroides vulgatus ATCC 8482 b16458 AB510684 Bacteroides plebeius JCM 12973 b16460 AB510681 Bacteroides massiliensis JCM 13223 b12775 NZ_ABWK01000112 Mitsuokella multacida DSM 20544  Column F BC1_MID1  Column G BC1_MID2  Column H BC1_MID4  To fill in the frequencies, use the SUMIFS command. The command looks like =SUMIFS ([where to look for the value you want in the cell at the end - frequency column], [first selection criteria column names in column A], [first selection criteria value - specific species in column E], [second selection criteria column - libraries in column C], [second selection criteria value - specific library for that column]). So, to fill in the first species frequency for BC1_MID1 (column F), the command looks like: =SUMIFS($B:$B,$A:$A, $E2,$C:$C, $F$1) This means, sum up all the values in column B that have the value of E2 (species name) in column A (list of species names) and the value of F1 (library name) in column C (list of library names). Since each frequency should only have one species name-library combination, you will get the frequency value for that species name-library reported. The $ are in place so that you can copy the formula with the Fill option and not have it change (you always want to be looking at columns B, A and C). First, you must copy the SUMIFS command across the table (Fill -> right). Then you will have to manually go in and change the final cell for the library name to the proper column. This will give you commands that look like this across: =SUMIFS($B:$B,$A:$A, $E2,$C:$C, $F$1) / =SUMIFS($B:$B,$A:$A, $E2,$C:$C, $G$1) / =SUMIFS($B:$B,$A:$A, $E2,$C:$C, $H$1) / … Once you have the top row filled in, you can select all the top row commands and Fill -> down the length of the table. If the commands have been entered correctly, only the value for the name cell will change. For example, down column F, the commands will look like: =SUMIFS($B:$B,$A:$A, $E2,$C:$C, $F$1) Page 16 of 60 146  Analyzing Pyrosequencing Data (for dummies) =SUMIFS($B:$B,$A:$A, $E3,$C:$C, $F$1) =SUMIFS($B:$B,$A:$A, $E4,$C:$C, $F$1) … This will give you a matrix of all the species detected in all your libraries and the frequency of each species in each library.  Section 4 - Chimera checking Chimeras are DNA sequences that are hybrids of two templates. They most often occur in a PCR reaction when one PCR product starts extension but the cycle ends before it finishes the entire product. In the next PCR cycle, the half-made product can bind to a different template just enough to mis-prime and finish its extension with a different sequence. You now have a single DNA molecule that represents two different things. The current scientific dogma is to remove these chimeras because they create artificial diversity in the data. We don’t have a set chimera-checking protocol for cpn60, but this is my personal protocol. The basic idea is that we look at the beginning and end regions of each isotig, make a list of the isotigs that have different database references matches at the beginning and end and compare these to an established chimera-checking software program, Bellerophon (designed for 16S rRNA sequences).  Do chimera check of your isotigs Step 1 - Make a fasta file of all the sequences you want to chimera check  1. If you did not do this for your frequency table, create a fasta file of all the isotigs you want to chimera check. I like the GeneSpring list of isotigs (these have had their primers removed, identical sequences collapsed and duplicate isotigs removed). However, at the moment, we don’t have a fasta file of just these sequences from mPUMA. To generate it yourself: a. Create a .txt file with the isotig names you want and save it on haruspex (call it genespring_list.txt). You can do this by opening any GeneSpring file from your assembly and copying the OTU column into Notepad. b. Copy the file 454Isotigs.fna.seqclean.cd-hit.clstr.fna (this is all the fasta sequences from your assembly, seqclean’ed, cd-hit’ed and collapased) from the assembly to your computer and open the file with Excel. Use the text to columns feature in the Data tab to Page 17 of 60 147  Analyzing Pyrosequencing Data (for dummies) divide the first column (by spaces), so that the description line only has >isotig0####. Delete the other columns and save as a text file (454Isotigs.fna.seqclean.cdhit.clstr.fna.simplename.txt). c. Copy the file 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.txt to the same directory on haruspex as your genespring_list.txt file. d. While you are in that directory, index your large fasta file by typing: i. cdbfasta 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.txt -o 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.cidx e. Use the cdbyank command to extract your desired list of sequences into a new file as written (your output file is genespring_otu.fasta): i. for i in `cat genespring_list.txt` ; do cdbyank -a $i 454Isotigs.fna.seqclean.cdhit.clstr.fna.simplenamecidx >> genespring_otu.fasta ; done ii. ***Note: in the above command, the ` around the `cat genespring_list.txt` is NOT an apostrophe (‘), it’s the key up by the number 1 key on my keyboard. As well, there are spaces before and after the ; *** Note - if the cdbfasta and cdbyank commands don’t work for you (or you don’t understand them), an alternate way of pulling sequences from a multiple FASTA file is presented in Appendix E. Step 2 - Extract and compare the beginning and end of each sequence Note: An alternative procedure for this step (not using c2e.pl and c3.pl) can be found in Appendix E. 1. Extract the ends of the OTUs you want to check using C2E (Chaban's Chimera Ends). This is  possible either in terms of an explicit length parameter (if it’s greater than the sequence length, the full length will be used), or you can specify this as a percentage of length (note: this parameter is an integer not a float). Each sequence identifier gets either a '-5prime' or a '-3prime' suffix in the header of the fasta file. a. The command format is: path/to/c2e.pl -i input_file_of_all_fasta_sequences_to_screen -o  output_file_name -l length_in_bp_of_each_end OR path/to/c2e.pl -i input_file_of_all_fasta_sequences_to_screen -o output_file_name -p percent_of_sequence_to_take_from_each_end Page 18 of 60 148  Analyzing Pyrosequencing Data (for dummies) b. So, for ends which are 150 bps of the original sequences, at the command prompt in the directory that has your genespring_otu.fasta file, type: i. /home/pipeline/meta-cpn/bin/c2e.pl -i genespring_otu.fasta -o genespring_otu_150bp_ends -l 150 c. Alternatively, for ends which are 25% of the original sequences, use: i. /home/pipeline/meta-cpn/bin/c2e.pl -i genespring_otu.fasta -o genespring_otu_25percent_ends -p 25 2. Run a Watered Blast search of the sequence ends against the reference database (asking for only the single best hit). Use the command (updating the cpndb_nr database date as updates are made, command description in section 1 of these instructions): a. /home/pipeline/aped/bin/watered_blast.pl -i genespring_otu_150bp_ends -d /aped/blast_dbs/cpndb_nr_20110622 -p blastn -v 1 > genespring_otu_150bp_ends_watered_blast 3. Parse the Watered Blast results to see if both ends match the same reference sequence and if those matches are in the correct orientation using C3 (Chaban’s Chimera Checking). a. The command format is: path/to/c3.pl -i input_file_of_all_fasta_sequences_to_screen -w output_file_from_watered_blast_search > final_table_file_name_to_save b. So, the command you type looks like: i. /home/pipeline/meta-cpn/bin/c3.pl -i genespring_otu.fasta -w genespring_otu_150bp_ends_watered_blast > compiled_chimera_checking_table c. Copy your compiled_chimera_checking_table file to your computer and open it with Excel. It should look like: ID Match Test Strand Test % Identity difference 5prime hit 5prime % identity 5prime strand isotig00005 FALSE TRUE 23.7 b255 AF036324 Staphyloco 98 + isotig00021 FALSE TRUE 22.2 b255 AF036324 Staphyloco 98.7 + isotig00025 FALSE TRUE 1.3 b255 AF036324 Staphyloco 98.7 + isotig00028 FALSE TRUE 1.3 b255 AF036324 Staphyloco 98.7 + isotig00057 TRUE TRUE 1.3 b255 AF036324 Staphyloco 98.7 + isotig00081 TRUE TRUE 0.1 b255 AF036324 Staphyloco 99.3 isotig00187 FALSE TRUE 0.7 b824 AF245666 Enterococc 99.3 isotig00274 FALSE TRUE 28.2 b19157 NC_015672 Flexisti 46.1 + isotig00280 TRUE TRUE 1.3 b255 AF036324 Staphyloco 99.3 -  3prime hit 3prime % identity 3prime strand b10882 NZ_ABBX01000122 Cand 74.3 b10882 NZ_ABBX01000122 Cand 76.5 b1123 AF352799 Streptococcus p 100 b1123 AF352799 Streptococcus p 100 b255 AF036324 Staphylococcus a 100 b255 AF036324 Staphylococcus a 99.2 + b10197 EF173658 Staphylococcu 100 + b10882 NZ_ABBX01000122 Cand 74.3 b255 AF036324 Staphylococcus a 98 +  4. Sort the table by Strand Test, then by Match test.  Page 19 of 60 149  Analyzing Pyrosequencing Data (for dummies) Step 3 - Interpreting the chimera checking table 1. My rules for interpreting the results: a. FALSE for Strand test (strandedness) and FALSE for Match test (closest header) i. Not cpn60 isotig; % ID values were all generally low; copy these isotigs to a new sheet and label “not cpn60” ii. You shouldn’t ever get a case of FALSE for standedness and TRUE for closest header (that would mean you hit the exact same sequence in opposite orientations from either end of your original isotig - that would be randomly wrong) b. TRUE for Strand test (strandedness) and TRUE for Match test (closest header) i. Good isotigs; % ID values were generally good - both forward and reverse pieces have about the same % ID number (doesn’t need to be high, as long as there are about the same to each other); copy these isotigs to a new sheet and label “Good isotigs” ii. ADDED STEP: Sort the “Good isotigs” by % ID values (do a custom sort by the front % ID). Move all the isotigs with less than 55% ID to your “not cpn60” sheet. Redo the sort with the end % ID and remove any stranglers (less than 55% ID). c. TRUE for Strand test (strandedness) and FALSE for Match test (closest header) i. These are the chimera suspects; NOT ALL OF THESE ARE CHIMERAS! ii. Copy this section to a new sheet and sort based on % ID columns (do beginning % ID, then add a level and do end % ID). Look at the % ID’s less than 55% - in many cases, both the beginning and end match are below 55% - move these isotigs to the “not cpn60” sheet. iii. Look through each pair and decide if the difference is enough to cause concern (Is a beginning to b16193 ACEO02000014 Neisseria subflava NJ9703 (95.3% ID) and an end to b18679 AEQJ01000046 Neisseria meningitidis CU385 (100% ID) too divergent for you?). There are cases where duplicate species will end up in cpndb_nr (in my case, I found a set that had v3131 3372 Corynebacterium pseudodiphtheriticum A7-60 as the beginning match and v3117 3335 Corynebacterium pseudodiptheriticum CMPT M369-2 as the end match - these are just fine as a good isotig). My rule of thumb today is that isotigs with a Page 20 of 60 150  Analyzing Pyrosequencing Data (for dummies) beginning and end from different genera will get copied to a new sheet called “possible chimera” for further examination. Isotigs from the same genera (like the Neisseria case above), will get copied to the “Good isotigs” list. Step 4 - Checking your suspected chimeras with the program Bellerophon 1. You now have a reasonable “possible chimera” list to consider. I now take my entire original isotig dataset and use the online chimera-checker Bellerophon to compare what Bellerophon finds to what I have found. Isotigs that fall into both lists get removed. To do this, you need to do the following steps: 2. You want to analyze all the “Good isotigs” and “possible chimera” list (remove the non-cpn60 isotigs). You will need to do this in 2 parts (Bellerophon will want to make an alignment of all your sequences, so you will want them all in the forward or + strand direction). d. Take your “Good isotigs” and “possible chimera” sheet and sort by strandedness (doesn’t matter which strandedness column you use, they are both the same). Copy all the isotig names from both lists with negative strand matches to a Notepad file called neg_strand_all_cpn60_isotigs_list.txt and all the isotigs names from both lists with positive strand matches to pos_strand_all_cpn60_isotigs_list.txt. Move these files to haruspex in the same directory you were working in before. e. Use the cbdyank command to pull out these sequences from the 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.cidx file you made earlier into a fasta sequence file. i. for i in `cat neg_strand_ all_cpn60_isotigs_list.txt` ; do cdbyank -a $i 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.cidx >> neg_strand_ all_cpn60_isotigs _revcomp_fasta.txt ; done ii. for i in `cat pos_strand_ all_cpn60_isotigs_list.txt` ; do cdbyank -a $i 454Isotigs.fna.seqclean.cd-hit.clstr.fna.simplename.cidx >> pos_strand_ all_cpn60_isotigs_fasta.txt ; done f. Reverse the negative strand file: i. seqret -sreverse neg_strand _all_cpn60_isotigs_revcomp_fasta.txt neg_strand_ all_cpn60_isotigs_fasta.txt Page 21 of 60 151  Analyzing Pyrosequencing Data (for dummies) g. You more have 2 .txt files with your cpn60 isotig fasta sequences all in the right orientation. Combine the 2 files into one. To do that, use concatenate command: i. cat *_strand_all_cpn60_isotigs_fasta.txt > all_cpn60_isotigs_fasta_for_chimera_checking.txt ii. You should check that you did that correctly by counting the number of sequences in the positive strand file and the negative strand file and making sure the total is in the all file: 1. grep \> pos_strand_all_cpn60_isotigs_fasta.txt | wc 2. grep \> neg_strand_all_cpn60_isotigs_fasta.txt | wc 3. grep \> all_cpn60_isotigs_fasta_for_chimera_checking.txt | wc 4. The numbers from 1+2 should = 3 (remember the value from 3 for the next step) 3. Make your Bellerophon input files by dividing the .txt file into files of no more than 300 sequences per file (the online tool can only handle 300 sequences at once). To do this: h. Make a file with a list of all the fasta sequence IDs you need to check (removing the > from the beginning of the line): i. grep \> all_cpn60_isotigs_fasta_for chimera_checking.txt | sed s\/>//g > all_chimera_check_IDs_list.txt ii. Should get a list of isotig numbers b. Now divide this list into groups of no more than 300 IDs per file and print this sublist to a new file, For example, if you had 678 isotigs: i. sed -n 1,250p all_chimera_check_IDs_list.txt > first250_isotigs_chimera_screen.txt ii. sed -n 251,500p all_chimera_check_IDs_list.txt > second250_isotigs_chimera_screen.txt iii. sed -n 501,678p all_chimera_check_IDs_list.txt > end250_isotigs_chimera_screen.txt c. Now you need to index your complete fasta file by typing: i. cdbfasta all_cpn60_isotigs_fasta_for_chimera_checking.txt -o all_cpn60_isotigs_fasta_for_chimera_checking.cidx d. Use the cdbyank command to extract each list of sequences into a new file: i. for i in `cat first250_isotigs_chimera_screen.txt` ; do cdbyank -a $i all_cpn60_isotigs_fasta_for_chimera_checking.cidx >> first250_fasta_for_chimera_checking.fasta ; done Page 22 of 60 152  Analyzing Pyrosequencing Data (for dummies) ii. ***Note: in the above command, the ` around the `cat first250_isotigs_chimera_screen.txt` is NOT an apostrophe (‘), it’s the key up by the number 1 key on my keyboard. As well, there are spaces before and after the ; *** iii. Repeat for the rest of your files by changing the input and output names. iv. Copy the fasta files to your computer. 4. Go to http://comp-bio.anu.edu.au/bellerophon/bellerophon.pl and enter your email address, title (if you want) and upload your 300 isotig-max sequence file. The program parameters are on the right; you can use any correction (I leave the default), use a 200 bp window (the screen you just did was with a 150 bp window), you WANT the program to align the sequences (you didn’t give it an alignment) and PCR library is the only option, so leave it. The program will do its thing and email you the results (can take anywhere from 30 min to 24 hours). 5. Look at the output and decide which isotigs you feel are real chimeras. I look at the possible chimera from Bellerophon and see which of my manual end comparison lists it is in. If it was in the “possible chimera” list, I remove that isotig from the dataset. If it was in the “Good isotig” list, I tend to keep it. I also keep an eye out for “chimera parents” that repeatedly come up - they themselves might be chimeras. I go back to my fasta file, remove the first round of chimeras detected and run Bellerophon again. From the results of the second round, I didn’t find many more isotigs that I really felt were chimeras, so I stopped here. I removed the chimera isotigs from the dataset and continued on with my analysis. This is a pretty aggressive way of identifying chimeras. Might be over-zealous?  Removing chimeras from your frequency tables Now that you have a list of isotigs we want to exclude based on either being non-cpn60 or probable chimeras, you will want to remove them from your frequency tables. To filter a scaled/actual count table to have only rows NOT in a list: 1. In a new worksheet in Excel, label cell A1 “All_plus_remove” and make a list of all the isotigs you want to remove from the table under the label in column A.  Page 23 of 60 153  Analyzing Pyrosequencing Data (for dummies) 2. In the same column (column A), at the bottom of the remove list, paste in the entire isotig list (from your frequency file - in the same order as your frequency file! (very important)). 3. In the next column (column B), label cell B1 “All” and paste the entire isotig list (from your frequency file - in the same order as your frequency file! (very important)) under the label. 4. In the next column (column C), label cell C1 “Selection” and use =COUNTIF(A:A,B2) in cell C2. Highlight and fill down the formula to the bottom of column B. You should now have a number with every isotig from your frequency file list, in order, as either a 1 (keep the isotig) or 2 (remove the isotig) 5. Copy and paste special the “Selection” column as values into the first column of your frequency table (insert a new column, your selection number should be right next to the isotig name). 6. After the last column of your table, in the next column, type Selection in the cell #1 and right below it, in cell #2, type =1 7. Do advance filter, copy to a new location, List range = the entire frequency table including the Selection column, Criteria range = the 2 Selection =1 cells, copy to = column next to Selection column, OK. a. This should have regenerated the frequency table with only isotigs you wanted to keep. 8. Here I do a final check for non-cpn60 isotigs that got through the chimera-checking. If you created the frequency table with the best database match ID and % ID (see Frequency table of GeneSpring isotig list, section 3 above), you will have the % ID match for the database match of each isotig. Sort the entire table by % ID match, smallest to largest. I remove any isotigs with a match less than 55%. 9. This final frequency table gives you an isotigs list with non-redundant, cpn60-only, chimerascreened isotigs and their frequencies. This is the final dataset that I use for my analyses and take out to other software programs.  Section 5 - Useful information about your final isotig dataset from cpnDB cpnDB is our reference database for the cpn60 gene target. The database contains lots of useful information we can use during a pyrosequencing analysis. Two things I commonly do (and therefore have instructions for) are to get the taxonomic lineage for the best database match of each isotig and to pull the full UT sequence of the best database match from cpnDB for each isotig. The taxonomic Page 24 of 60 154  Analyzing Pyrosequencing Data (for dummies) breakdown is useful when you want to make stacked bar graphs showing the composition of your libraries at the phylum, order, class, family, genus or species level. As well, being able to obtain the full length cpnDB sequence of the isotig’s best database match can be useful in phylogenetic tree building or PCR assay design (see sections below). Pulling the taxonomic lineage for the best database match of each isotig from cpnDB How to pull the taxonomic lineage of cpnDB matches from cpnDB: 1. Make a list of the cpnDB IDs for all the best database matches in your frequency table. To do this, copy the Closest hits columns from the frequency table sheet into a new sheet as one column. Important - you want to keep the order and number of IDs exactly the same as the frequency table - this means keeping duplicate IDs. Separate the cpnDB ID from this column by using the text-to-columns feature in the Data tab, using spaces as the delineators. This will make a separate column with just the cpnDB ID (b#### or v####). Copy the cpnDB IDs into Notepad and save as cpndb_lineages_want_list.txt. Copy this file to haruspex. 2. Use the getLineages perl script to pull the taxonomic lineage of your list from cpnDB. To do this, you first need to make an output directory wherever you are working on haruspex, which needs to be called “output”. Then enter the following command (using Putty): a. perl /home/jhill/bin/getLineage.pl b. A number of prompts will come up asking for information; your mySQL username is your cpnDB user name; your password is your cpnDB password; your input file is cpndb_lineages_want_list.txt; your output file name is cpndb_lineages_got.txt; you want ut sequences c. Your file cpndb_lineages_got.txt will be created in the output directory you just made. 3. Copy cpndb_lineages_got.txt to your computer and open with Excel. You want to delineate the rows by semicolon to get each taxonomic level into its own cell. However, there is no standardized taxonomic lineage pathway at this point, so you will have to manually edit the columns to fall into phylum, order, class, family, genus and species. Finally, since your rows should still all line-up with your frequency file, copy and paste your lineages into the frequency table.  Page 25 of 60 155  Analyzing Pyrosequencing Data (for dummies) NOTE: We are working on adapting the RDP Classifier (used for 16S rRNA) as a taxonomic tool for cpn60. I haven’t worked with it much yet, but output files from this program can be found with the assembly data outputs (see section 2 - Files generated in mPUMA pipeline from pyrosequencing data, step 4B).  Pulling the FASTA sequence for the best database match of each isotig from cpnDB How to pull a subset of sequences from cpnDB: 1. Make a list of the cpnDB IDs for all the best database matches in your frequency table. To do this, copy the Closest hits columns from the frequency table sheet into a new sheet as one column. Filter this column using the Advanced filter option in the Data tab to generate a unique list of these items. Separate the cpnDB ID from this column by using the text-to-columns feature in the Data tab, using spaces as the delineators. This will make a separate column with just the cpnDB ID (b#### or v####). Copy the cpnDB IDs into Notepad and save as cpndb_ids_want_list.txt. Copy this file to haruspex. 2. Use the getSeqs perl script to pull the fasta sequences in your list from cpnDB. To do this, you first need to make an output directory wherever you are working, which needs to be called “output”. Then enter the following command: a. perl /home/jhill/bin/getSeqs.pl b. A number of prompts will come up asking for information; your mySQL username is your cpnDB user name; your password is your cpnDB password; your input file is cpndb_ids_want_list.txt; your output file name is cpndb_ids_want_fasta.txt; you want ut sequences c. Your file cpndb_ids_want_fasta.txt will be created in the output directory you just made.  Section 6 - Doing a nearest neighbour analysis of assembled isotigs If you look through your isotig list at the best database matches, more likely than not, you will see the same matches repeated throughout the dataset. This implies that you have a “species” that is showing slight sequence variations. This could indicate ecotypes, subspecies or this could reflect sequencing errors and overspliting of the data. To examine this, I compile a nearest neighbour analysis of the Page 26 of 60 156  Analyzing Pyrosequencing Data (for dummies) assembled isotigs. This is different than the nearest neighbour analysis described in section 1 (which uses individual, unassembled reads). The nearest neighbour analysis at this level has been curated much more extensively and allows for you to determine the % ID range of your isotigs to their best match. To generate a nearest neighbour assembled frequency table: 1. Start with a scaled values isotig frequency table for scaled values nearest neighbours or an actual values isotig frequency table for actual values nearest neighbours. 2. Copy the Closest hit header column to a new worksheet in the isotig frequency table file. 3. Do an advanced filter of the species names to get a list of each unique name only. a. To do this, go to the Data tab and choose Filter -> advanced. Change the advanced filter action option to “Copy to another location”; click on the right-side box of each range to select “List range” as column A (click on the top of the column to select all), “Criteria range” as column A and “Copy to” as column B; select the box beside “Unique records only” and click OK. Column B now has a list of the species identified in the library. b. Delete column A so that you just have the unique name column 4. You then want to create column headings for each of your library names, starting at B1 and going across until you have a column heading for each library. You can copy these directly from the frequency table top row and paste into this worksheet. 5. Use the SUMIFS command to fill in the table (NOTE: older versions of Excel want the command SUMIF, not SUMIFS - exchange commands if SUMIFS doesn’t work for you): a. For the first column (first library), use =SUMIFS(column of first library in isotig table, column of closest hit header in isotig library, cell A2 in nearest neighbour table) i. Example: =SUMIFS(Sheet1C:C, Sheet1$A:$A, Sheet2$A2)  sum up all the frequencies in this library where there isotig closest hit header is this unique name ii. You will want $ in front of the column with the closest hit header because we want to be able to fill this formula throughout the table and we don’t want that column to change. iii. Highlight and Fill -> down this formula to fill the entire nearest neighbour table for library one. b. Highlight the entire library one formula column and fill right to cover all the libraries in your experiment. Page 27 of 60 157  Analyzing Pyrosequencing Data (for dummies) 6. Double-check that the numbers generated in the table make sense and that the formula copied correctly. 7. Copy the entire nearest neighbour assembled isotig frequency table and paste special -> values into a new worksheet. This ensures that the values remain constant and aren’t affected by new sorting of the table (which can mess up the formulas).  Section 7 - mothur diversity statistics and rarefaction curves The most commonly cited diversity statistic generating program for metagenomic data is mothur. It is a collection of command-line programs that calculate a few statistical parameters for a single community (or a group of communities) such as Chao1 (richness predictor); Sobs (total number of OTUs observed); Good’s coverage estimator; Simpson’s index (evenness) and H’ (Shannon’s combined richness/evenness index). It is also the go-to program for rarefaction curves (plots that give you an idea of how completely you sampled your environment). The mothur package can be downloaded at http://www.mothur.org/wiki/Download_mothur (there is a Wiki here that contains all kinds of documentation about mothur - I wish you all the luck in the world its command line code based and I understand very little of it). The program downloads as a Zipped file, so if you create a folder somewhere (I created a folder in my Program files directory) and unzip it, you will get a single executable file. If you double click on the mother icon/file, a command shell will open where you can enter commands. mothur was originally designed to create OTUs out of individual sequence reads and then do stats on them. How the program does this is to take full-length sequences (all in the same orientation), do an alignment, create a distance matrix and then assign OTUs based on some distance criteria. This would be great for something like clone library data, where we would generate full-length or near full-length sequence. However, our pyrosequencing data is rarely full-length and we generate sequence information from both ends of the cpn60 UT. I have tried aligning sets of sequences and it is very difficult to get a large number of sequences to overlap over a large enough region to make alignments meaningful (though as the 454 read lengths get longer, this problem will go away). Lucky for us, we don’t have to go through these steps - the isotig assembly process is our in-house equivalent. So, we can simply create an input file for our libraries with isotig-based frequency information and jump into the mothur pipeline at the diversity stats and rarefaction curve steps. Note - for mothur, you HAVE to use the actual counts from the libraries (whole numbers). If you try to use scaled values or anything with a decimal place, the program will either only return stats for the first  Page 28 of 60 158  Analyzing Pyrosequencing Data (for dummies) library you entered or the entire program will hang up and stop (not return to the command prompt). So, NO DECIMAL NUMBERS.  Creating input files for mothur To create the input file for mothur from an isotig or nearest neighbour frequency table: In Excel: 1. Open a new Excel file. In the first column (column A), type NA and copy it down for however many rows you need for each library in your dataset. mothur will look for something in this column and will choke if it doesn’t find anything. 2. Open your actual counts frequency file. Copy the library names from the first row and paste special, transpose, the list into column B. 3. Count how many isotigs or nearest neighbours you have in your dataset (can look at the row number at the bottom of your frequency table and subtract 1 for your header row). Enter that total into column C and copy down for however many rows you need for each library in your dataset. 4. Select all the frequency data in the actual frequency table, copy and paste special, transpose, into column D of your new file. a. Note: in Excel 2003, you are limited to 256 columns, so if there are more than 256 OTUs, you must use Excel 2007 or later, which has a limit of over 16,000 columns for a workbook. b. If you did not recopy your frequency table as values only (to remove the formulas from the cells), when you paste the values, they will either come up as an error or all the values will change. Take a moment and double-check that the values transposed correctly. Library 1 should now be a row (instead of a column) 5. Save this file as tab-delimited, text and place it in the same directory as the mothur program (For me, that’s C:\Program files\mothur). Your final file should look like: NA HDS1fall 927 45 NA HDS1spring 927 9 NA DDS8 927 0  0 34 0  3 0 678  0 9 8  256 0 34  ….. ….. ….. Page 29 of 60 159  Analyzing Pyrosequencing Data (for dummies) (Note: I’ve tried putting other things in column A so that the name shows up in the final file - for some reason, it works sometimes and not others. So, if you want to gamble, you can enter in a project name into column A, but be prepared for something to crash. Otherwise, you can leave NA and it will work fine. Sorry, not my program.) Running mothur In the folder where you unzipped mothur, you should have the program icon and your input file. The program will create *.logfile when you run it - don’t worry about those. IMPORTANT - The commands for running mothur have changed between versions of the software. I don’t know why they did this, but the program only works with the correct commands. Here are the commands for versions of mothur BEFORE v1.20.1 (OLD COMMANDS): Double click on the mothur icon to open the shell window. At the command prompt, you want to type in 3 sets of commands: 1. read.otu(shared=*.txt) a. At the *, enter the name of the input file you created. When this works, you won’t see anything happen - the command prompt will just return. This step just loads the file into the program. 2. summary.single(calc=nseqs-coverage-npshannon-simpson-sobs-chao) a. This command generates some output to the screen, hopefully telling you files have been created. In your folder, two files should have been generated for each library - a .rabund file and a .summary file. Forget the .rabund file and copy the .summary file to wherever you want to save this data on your computer. Open the file in Excel and you will get a single row with each of the diversity stats listed in the command. 3. rarefaction.single() a. This command also generated some output to the screen, telling you files have been created. This time you should get one .rarefaction file for each library. Copy these files to wherever you want to save this data on your computer. Open the files with Excel and you get lists of values for a rarefaction curve. Plot your graph from the first column of values. Enjoy.  Page 30 of 60 160  Analyzing Pyrosequencing Data (for dummies) These instructions are for mothur version v1.20.1 (CURRENT COMMANDS). If they change mothur again and these commands don’t work, you can try looking up changes on the mothur Wiki (google “mothur” and the Wiki instructions come up): Double click on the mothur icon to open the shell window. At the command prompt, you want to type in 2 sets of commands: 1. summary.single(shared=*.txt, calc=nseqs-coverage-npshannon-simpson-sobs-chao) a. This command generates some output to the screen, hopefully telling you files have been created. In your folder, you will get a .rabund file for each library and one .summary file. Forget the .rabund files (you can delete them) and copy the .summary file to wherever you want to save this data on your computer. Open the file in Excel and you will get a table with a row for each library with the diversity stats listed in the command. 2. rarefaction.single(shared=*.txt) a. This command also generates some output to the screen, telling you files have been created. This time you should get one .rarefaction file for each library. Copy these files to wherever you want to save this data on your computer. Open the files with Excel and you get lists of values for a rarefaction curve. Plot your graph from the first column of values. Enjoy. Running mothur from haruspex: If you run into command problems, or just didn’t download a copy of mothur for yourself, we have a copy of mothur on haruspex (v1.20.1). To run mothur from haruspex: 1. Login to haruspex from WinSCP a. Create a directory in your home directory called mothur_file b. Copy your input file from your computer into the mothur_file directory 2. From the command line (in Putty), move into your mothur_file directory a. At the command prompt, type: mothur 3. mothur will load as usual. Use the v1.20.1 commands above to generate your statistics and rarefaction files. Page 31 of 60 161  Analyzing Pyrosequencing Data (for dummies) Section 8 - (Fast) Unifrac From the website: http://bmf2.colorado.edu/fastunifrac/ Their words (from the website): Fast UniFrac is a new version of UniFrac that is specifically designed to handle very large datasets. Like UniFrac, Fast UniFrac provides a suite of tools for the comparison of microbial communities using phylogenetic information. It takes as input a single phylogenetic tree that contains sequences derived from at least three different environmental samples, a file mapping ids used in the tree to a set of unique sample ids (same format as 'environment file' in regular Unifrac), and an (optional) category mapping file describing additional relationships between samples and subcategories for visualizations. For example, in a given set of gut samples, you might define subcategories for different diets, different physical locations/dates, different species, and/or different treatments like antibiotics or high fat. Both the UniFrac distance metric and the P test can be used to make comparisons. Both of these techniques bypass the need to choose operational taxonomic units (OTUs) based on sequence divergence prior to analysis. Fast UniFrac allows you to: • • • •  Determine if the samples in the input phylogenetic tree have significantly different microbial communities. Cluster samples to determine whether there are environmental factors (such as temperature, pH, or salinity) that group communities together. Determine whether system under study was sampled sufficiently to support cluster nodes. Easily visualize the differences between samples graphically, with support for three dimensional exploration of datasets and with multiple subcategory coloring.  My words: Basically, Unifrac lets you look at your libraries and see if they contain phylogenetically similar organisms and if you can group libraries together based on some criteria (like Principal Components Analysis or clustering methods). Unifrac is hosted as an online application, so there is nothing for you to download. You do need to register to use larger datasets (smaller sets can be done on the website by logging in as guest). Like the instructions above say, you need 3 files to work with Unifrac: a phylogenetic tree, a sample ID file with all your OTUs and their frequencies and category file with info about your libraries.  Making input files for Unifrac File 1 - Phylogenetic tree  Page 32 of 60 162  Analyzing Pyrosequencing Data (for dummies) The phylogenetic tree you need is a rooted tree that contains all your OTUs (either as isotigs or as nearest neighbours). With clone library data, you would simply take the full-length sequence of each OTU and build a tree with a distant root (I like a human cpn60 sequence). However, for pyrosequencing data, it is very difficult to build a complete alignment of isotigs, since they are different lengths and from both ends of the cpn60 UT (same problem with using mothur from the beginning). So, you have 2 choices. You can (A) analyze your data as isotigs by building the best tree you can with all the isotig sequences. This is getting easier as the sequencing technology generates longer sequences. However, you will still have to manually trim all the sequences to the same length after alignment to generate a proper tree. Or (B) you can analyze your data as nearest neighbours and pull the cpnDB UT sequences from the nearest neighbour analysis of assembled sequences and use those as your sequences. If you want to use isotig sequences (route A), you will need a FASTA file with all your good isotigs in the forward orientation. You can generate this file by taking the FASTA file you made at step 12d of chimera checking above and removing the chimera sequences from it. Alternatively, it might just be easier to generate the FASTA from scratch using the instructions from step 1 of Adding the best database match ID for each isotig (section 3 - frequency tables) and orienting them to be all forward stranded with the instructions from step 12 of section 4 - chimera checking. Random note - if the cdbfasta and cdbyank commands don’t work for you (or you don’t understand them), an alternate way of pulling sequences from a multiple FASTA file is presented in Appendix E. If you want to use nearest neighbour reference sequences (route B), you will need a FASTA file with all your nearest neighbour UT sequences. You can generate this file with the instructions Pulling the FASTA sequence for the best database match of each isotig from cpnDB from section 5 - Useful information about your final isotig dataset from cpnDB. With either route, I recommend adding in a human cpn60 sequence (pick anything you want from cpnDB) to include as an outlier. Now you are ready to make a phylogenetic tree. See Appendix A for detailed instructions on how to do that. Do the basic ClustalW alignment, trimming with GeneDoc, then use PHYLIP (DNADIST and NEIGHBOUR) to make the tree, open it in TreeView to check it out and root it with the human sequence. Finally save the tree in either Newick or Nexus format. Fast Unifrac wants all its input files zipped, so to zip the final tree file. To do that, copy your text file to haruspex and use the zip command: zip output_file_name.zip input_file_name.txt The format is zip *.zip *.txt - keep the extensions (good thing to have).  Page 33 of 60 163  Analyzing Pyrosequencing Data (for dummies) File 2 - Sample ID mapping file For this file, what you want is a long text file (tab delimited) with 3 columns: sequence ID (b#/v# for nearest neighbour, isotig### for isotig), library name, and frequency of that OTU in that library. To get this, we have to reformat our frequency table. To make this file, open a new Excel worksheet. In column A, copy your sequence IDs. You will need to repeat this list in column A for as many libraries as you have. In case you are wondering, this will make a very large file (my last dataset had 227 nearest neighbour OTU x 67 libraries = 15,209 rows in the table). In column B, copy the library name for each list of your sequence IDs. Finally, column C, copy the frequencies for each library beside their IDs. Your file should look like: …… v8582 DDS2_norm 0.55 v9415 DDS2_norm 0 b10164 DDS2_enrich 0 b10199 DDS2_enrich 45.89 b10200 DDS2_enrich 0 ……. NOTE: You can use scaled or actual values for the sample ID file, but you must remember which one you used. In Unifrac, there will be the option to normalize the abundance values in different programs. If you have entered actual values, you will need to normalize. If you entered scaled values, you essentially normalized before starting, so you will want to use the non-normalized setting. Save this worksheet as a text, tab-delimited file. You need to zip it before you can load it into Unifrac. To do that, copy your text file to haruspex and use the zip command: zip output_file_name.zip input_file_name.txt  File 3 - Category mapping file This is a table that you need to make up for your dataset. You want to make it in Excel and save it as a text, tab-delimited file. The key features are that the first cell must have #SampleID (must have the #) and the first column must be your library names. After that, you can have all kinds of categories that might be important to your data. Remember that each category must have at least 2 different values in it, or the program will crash. The second line of the table can be a description and must start with a #. An example is: #SampleID Season HealthState Age #Campylobacter in dogs study by Bonnie looking at healthy and diarrheic dogs with normal Page 34 of 60 164  Analyzing Pyrosequencing Data (for dummies) and enriched libraries DDS34.norm Spring DDS51.norm Spring DDS64.norm Spring HDS18D.norm Spring HDS19.norm Spring HDS1fall.norm Fall HDS1spring.norm Spring HDS2fall.norm Fall  Diarrheic Diarrheic Diarrheic Healthy Healthy Healthy Healthy Healthy  Under2 Under2 Under2 Over10 Under2 Under2 Under2 2to10  This file also needs to be zipped to load online, so use the zip command for this file.  Using Unifrac Once you have all 3 files, login to the program website and load your zipped NEXUS tree, zipped Sample ID file and your zipped Category file. If the program has any problems, it will tell you. A favorite is that it doesn’t like almost all special characters like -, _, ? and so on, so if you can’t use different capitalization for info, try periods. Unifrac also has a reasonably good tutorial, so if you have any problems, try http://bmf.colorado.edu/fastunifrac/tutorial.psp Things you can do in Unifrac (edited from their tutorial information): Measuring the overall difference between each pair of samples - In order to generate the raw distances between each pair of samples using the UniFrac metric, choose the Sample Distance Matrix from the Select analysis menu. Clustering the samples - It can be useful to see how the environments cluster together since there are often patterns in the clustering that could not have been determined from the pattern of significant differences alone. Go to the main screen and choose the Cluster Samples option from the Select analysis menu. To be confident that the clustering results are correct, it is necessary to go back to the main page and select the Jackknife Sample Clusters option, which will sample a smaller number of sequences from each environment and tell you whether the clusters are well-supported. Perform Principal Coordinates Analysis (PCoA) - The cluster diagrams are useful for showing which environments are most closely related to one another, but it is also important to see if the environments are distributed along any axes of variation that can be interpreted easily (e.g. a pH or temperature gradient). Go to the main screen and choose the PCoA option from the Select analysis menu. If you want to get creative, the PCoA results have a view in 3D option that makes cool figures for presentations and some people’s Nature papers.  Page 35 of 60 165  Analyzing Pyrosequencing Data (for dummies) Section 9 - GeneSpring GeneSpring is a software program that can do simple and complex visualization and statistical analysis. The program was originally designed to analyze microarray data, so that must be remembered when loading data and working through the program. We “trick” the program into taking in sequence frequency (in place of array spot intensity) and isotigs or nearest neighbours (in place of spot ID) and then ask the software to compute comparisons and statistics based on differences between libraries. To input a library into GeneSpring, we use the format: OTU Percent Scaled# ActualCount Label isotig00002 0 0 0 something isotig00004 0 0 0 something OTU is the label for the sequence (either an isotig number if the sequences where assembled de novo (below) or a cpnDB id number of its nearest neighbour). Percent is the proportion that OTU is of that library (adding the percent column for a library should equal 100). Scaled# is the frequency of that OTU, scaled to the median of all the libraries to be compared. This is needed because the number of sequence reads we get from each library is different and if we want to compare libraries of different sizes, we need to “normalize” them to a common number. ActualCount is the actual number of times that isotig or nearest neighbour was seen in the library (frequency) (not scaled). Label is the full name of the nearest neighbour (taxonomic name). The important thing to remember when putting together files for GeneSpring is that every OTU that is to be considered must appear in each library input file, even if that OTU was not seen in some libraries. If you want to do this analysis with the de novo assembled data, mPUMA has created the input files you need (the same ones we’ve been using for other purposes). The mPUMA set of GeneSpring files will have the non-duplicated isotig list, containing both cpn60 and non-cpn60 isotigs. If you want to generate these files yourself, you can. To do this, you will need to have compiled an actual counts and scaled counts frequency tables for all the OTUs you want to compare, ideally with their best database match (for isotig frequency tables). Instructions for doing this are in section 3 - frequency tables.  Making GeneSpring input files from your own frequency tables Note - when you are making your own GeneSpring input files, you don’t need the Percent column. My instructions are written to omit that column. For each library you want to analyze, you will need a separate text, tab delineated file. You will start by making the table in Excel, then save it to a tab delineated file format. Page 36 of 60 166  Analyzing Pyrosequencing Data (for dummies) 1. To start, open an Excel table and put the labels (OTU, Scaled#, ActualCount, Label) across the top in the first row. 2. Open the frequency tables you want to use. 3. To make column 1 (OTU), you want either the isotig## (copy directly from the first column of your frequency table) or just the cpnDB ID. To get just the cpnDB ID from the whole name of the nearest neighbour entry, copy the column of names and paste it into a new Notepad file. Save the Notepad file. Reopen the Notepad file in Excel - when asked, the file is SPACE-delineated. By doing this, when the file opens, the first column in Excel is the cpnDB ID all by itself (it is the first thing in the name, separated by a space). Copy the first column and paste it into your GeneSpring Excel file. 4. For the frequency columns, copy the scaled values from the scaled frequency table and the actual values from the actual frequency table. 5. For the Label column, paste in the closest header hit/nearest neighbour full name. Spaces and special characters are OK here. 6. Save the file in Excel as both an Excel file and a tab-delineated file. Use the tab-delineated file for GeneSpring. 7. Repeat for all the libraries so that you have a tab-delineated file for each. Final file should look like: OTU Scaled6021 b1030 6 b16458 0 b16460 1.8 b12775 0  ActualCount 10 0 3 0  Label b1030 AF440238 Bacteroides vulgatus ATCC 8482 b16458 AB510684 Bacteroides plebeius JCM 12973 b16460 AB510681 Bacteroides massiliensis JCM 13223 b12775 NZ_ABWK01000112 Mitsuokella multacida  Running GeneSpring GeneSpring is an expensive, licensed software that we renew every year. As such, we have one copy of the program that lives on our server, haruspex. We have created a general user, Charles Darwin or chuckd, that everyone is the group can “become” in order to run the program. Only one person from one computer can be logged in as chuckd at any one time, so please, when you are taking a break, try to remember to log-off. If you forget to close the program and log-off and go home for the day, no one in the group and access the program.  Page 37 of 60 167  Analyzing Pyrosequencing Data (for dummies) You will need a XWin terminal on your computer to run GeneSpring. For a PC, this means you will need to install parts of Cygwin X. Instructions to do this are in the Appendix B. Mac computers have an X-terminal preinstalled. To open GeneSpring on your computer: 1. Open a Cygwin-X XWin Server (program should be in your start menu; should get a black X icon with a partial red ring around it in the bottom right tray on your computer). This should also open a Cygwin-X Xterm shell. If the XWin server was already open, just open the Xterm shell. 2. At the command prompt, type ssh -X chuckd@haruspex.usask.ca 3. When prompted for a password, use: agilent 4. At the command prompt, type: Agilent/GeneSpringGX/GeneSpringGX 5. Program should open. Mac user note: Connecting to haruspex from a Mac using X11: if you're having trouble getting GeneSpring to launch and display properly, try connecting like this instead: ssh -XY chuckd@haruspex.usask.ca The -Y argument tells haruspex some things about "trust" in the relationship. It seems to be an issue for systems running older versions of the Mac OS (pre 10.5). Matt Links has written a wonderful PowerPoint presentation detailing how to load data into GeneSpring and getting started with a few basic analyses. These slides have been included in Appendix C.  Section 10 - MEGAN MEGAN stands for MEtaGenome ANalyser. It is stand-alone software that you need to install on your computer. It can be downloaded at http://www-ab.informatik.uni-tuebingen.de/software/megan. The purpose of MEGAN is to explore a metagenomics data set in a taxonomical context. To use the program, you need a FASTA file of your library sequences (unassembled, which is exactly what you got from your .sff file) and a Blastx results file for your library sequences (do a Blastx search of each sequence in your library and report the best protein matches for each). The program starts by downloading the current NCBI taxonomy database when you open the program. It then looks at the Blastx results for each sequence and “decides” (based on your parameters) how well it can identify that Page 38 of 60 168  Analyzing Pyrosequencing Data (for dummies) sequence to the species, genus, family, etc. level. MEGAN then shows you a taxonomic tree with circles indicating where your sequences mapped onto it (NOT a phylogenetic tree).  mPUMA will generate the input files for your dataset, based on the unassembled reads for each library. However, if you want to generate your own input files, you need is a FASTA file of all the sequences from each library (the files used in section 1 of these instructions). We also have a copy of MEGAN loaded on haruspex that you can work from. To open and use MEGAN from haruspex on your computer: 1. Open a Cygwin-X XWin Server (program should be in your start menu; should get a black X icon with a partial red ring around it in the bottom right tray on your computer). This should also open a Cygwin-X Xterm shell. If the XWin server was already open, just open the Xterm shell. 2. At the command prompt, login to haruspex a. type: ssh -X your-account-name@haruspex.usask.ca; for example, for me, the command would be ssh -X chaban@haruspex.usask.ca 3. When prompted for a password, use your regular haruspex password 4. At the command prompt, type: /usr/local/megan/MEGAN & a. The "&" keeps your terminal window useable, running MEGAN in the "background"; it's optional 5. MEGAN should open. Use as usual.  Page 39 of 60 169  Analyzing Pyrosequencing Data (for dummies) 6. When you want to quit, exit MEGAN from the file menu or click on the red X close button on the top, right of the window. You will be taken back to your X-terminal. If you don’t have a command prompt, type control+c to get a prompt. Type exit at the prompt to close your connection with haruspex, then close the X-terminal window. Mac user note: Connecting to haruspex from a Mac using X11: if you're having trouble getting a program like MEGAN to launch and display properly, try connecting like this instead: ssh -XY username@haruspex.usask.ca The -Y argument tells haruspex some things about "trust" in the relationship. It seems to be an issue for systems running older versions of the Mac OS (pre 10.5). -----------------------------------------------------------------------------------------------------------------------------To begin, have the text file of each of your FASTA library sequences in a folder on haruspex. Step 1 – Do a Blastx analysis of your library What Blastx does is takes your FASTA file (which is DNA sequence), converts each DNA sequence into its 6 possible protein translations and then does a Blastp search of whatever database you tell it to use. It then reports the best matches (however many you ask for) of each sequence. MEGAN will use these matches to determine if the best match(es) are close enough to a known species to give it that assignment, or if the matches come from the same genus but a mix of species (meaning it can only say the sequence is a Lactobacilli but can’t tell you which species) and so on. To run Blastx, you need to be in the directory that has your FASTA file and use the command (with Putty): >blastall -p blast_prog -d db_to_use -v num_hits -b num_align -i input.file > output.file The blastall is the command to do a BLAST search (it is accessible from any directory on haruspex, so it doesn’t need a complete pathway like other programs); -p indicates the type of blast to use. In this case, it will always be blastx; -d indicates the database to search against. In our case, we want the database cpndb_pep_nr_date (this is a non-redundant protein version of cpnDB that gets updated periodically – check with Janet what the most resent version is). This variable currently reads: /aped/blast_dbs/cpndb_pep_nr_20100504; -v indicates how many best hits to report. In most cases, you won’t need more than 10; -b indicates how many best alignments to report (MEGAN needs both hits and alignments). Again, in most cases, 10 is lots; -i input.file is your library FASTA file (DNA sequences); > output.file tells the program to write to results into this output file to use later. Page 40 of 60 170  Analyzing Pyrosequencing Data (for dummies) So, an actual command that I run looks like: >blastall -p blastx -d /aped/blast_dbs/cpndb_pep_nr_20100504 -v 10 -b 10 -i GJBSLIE09-MID1.fas > GJBSLIE09-MID1_blastx_vb10 This program can take some time to run (depending on the size of the FASTA file, a few seconds to 20+ min) and will generate large files. Be patient. As well, unlike the watered_blast program, nothing will output to the screen; the cursor will just sit at a blank line and might blink (or not). When the program is finished running, the command prompt will return. Copy the blastx file to your computer from the server to work with it in MEGAN. Since you will have to run a Blastx search on every fasta file for each sample in your library, this is a good place to know how to automate the command line process. See Appendix E for instructions.  Step 2 - Use the FASTA file and Blastx results in MEGAN To generate a MEGAN tree, open the MEGAN software. The program will load the NCBI taxonomy tree as a default upon start-up. Go to File -> Import from BLAST. A box will open to help you through choosing files. The first screen asks for the Blastx file to import (select it from your computer [In the Open box, you will have to change the Files of Type drop-down menu to All files to see most of your files] and click the Next Step button in the bottom corner), then your FASTA DNA sequence file, then where you want to save the MEGAN analysis result. These are the only 3 things you have to do (there are additional settings on the tabs at the top of the box you can change here or after you do the analysis. I recommend turning off the COG content and GO content analyses, since we already know we have cpn60 sequences). A critical consideration is the parameters set for lowest common ancestor (LCA) identification. Click the Apply button to do the analysis. To compare more than one library, do individual analyses for each library and then choose File -> Compare. Select the libraries you want to compare (they must be open in the program). You should play around with settings and parameters and displays. Have fun.  Page 41 of 60 171  Analyzing Pyrosequencing Data (for dummies) Appendix A - Bonnie’s way of making phylogenetic trees Programs you will need: ------------------------------------------------------------------------------------------------------ClustalX: http://www.clustal.org/download/current/ Choose whichever file is right for your computer (Mac or PC): * clustalx-2.0.10-mac.dmg Mac OS X disk image for ClustalX * clustalx-2.0.10-win.msi Windows installer for ClustalX Make yourself a shortcut icon to get to the program ------------------------------------------------------------------------------------------------------GeneDoc: http://www.nrbsc.org/gfx/genedoc/ Choose the “OK, OK, Enough already. Download a copy. Last update: March 2, 2007” link and download. This program is for PCs only (I think). -----------------------------------------------------------------------------------------------------PHYLIP: http://evolution.genetics.washington.edu/phylip/getme.html Choose whichever file is right for your computer (Mac or PC) from list. Remember where you download this program because you will have to work with it from its folders ------------------------------------------------------------------------------------------------------TreeView: http://taxonomy.zoology.gla.ac.uk/rod/treeview.html Make yourself a shortcut icon to get to the program. This is another PC only program. Another good program is TreeExplorer. Mac users are directed to the programs NJPlot or Dendroscope, for excellent tree viewing and editing. Another useful tool is CLC Sequence Viewer. ------------------------------------------------------------------------------------------------------To create a phylogenetic tree with proper branch lengths (standard tree): 1. Start with a saved Notepad file containing all the sequences (in FASTA format) you want in your tree. a. FASTA format is >Name[return]ATGCGTAGCT b. Each name can have a max of 8 characters and each name must be unique. Some programs truncate names to 8 characters and if the first 8 characters are the same for more than one sequence, programs will crash. At the very end, you can change the names to whatever you want for your final tree. c. Recommended that you save this file on a memory stick and work everything off of it. For these programs, you’ll need to type in the complete pathway to the file, so most C: drives will require C:\My Documents\More stuff\More stuff\file name, whereas on a memory stick, the file is usually E:\file name Page 42 of 60 172  Analyzing Pyrosequencing Data (for dummies) 2. Open ClustalX and load your sequences into the program by going to File; Load sequences. 3. Check the multiple alignment parameters by going to Alignment; Alignment Parameters; Multiple Alignment Parameters. If you are working with a 16S rRNA sequence, the defaults are fine, so you can skip this step. If you are working with a protein-coding sequence, change the Gap opening penalty to 50 and the gap extension penalty to 5. These are my rules that I’ve found by trial and error. The idea is that protein-coding genes acquire indels (insertions or deletions) that keep the protein in-frame for translation and gaps tend to be fewer, larger and in groups of 3 nucleotides. Starting a gap is a large penalty, but extending the gap is a much smaller penalty. Click the OK button at the top of the box to finish the change. 4. Do a complete alignment by going to Alignment; Do complete alignment. A window with output files will come up; make sure you are getting a *.aln file and click OK. 5. Check that the alignment looks reasonable. Several things to watch for: a. All your sequences look like they go together. A favorite is that some sequences are in the forward direction and others are in the reverse direction. You’ll see this by getting a block of sequences aligning well and a second block of sequences aligning well but the 2 blocks don’t match. Try reverse complementing one block with revseq in the EMBOSS package of programs (http:/haruspex.usask.ca/emboss) and try again. b. All your sequences are about the same length. If some sequences are significantly longer than others (more than 100 bp difference in length), Clustal has a hard time aligning properly. If you see a sequence that stands out of the alignment at the beginning or end, consider removing that sequence and trying again. 6. When you are happy that your alignment looks reasonable, you can exit ClustalX. 7. Open the program GeneDoc. To load your sequences, go to File; Import. In the prompt box that appears, the Select input device is File and the Select the type of file is Clustal (ALN). Click the Import button and select the Clustal alignment you just made. You can then Done in the Import box. 8. Next you want to trim all the aligned sequences to the same length and remove any PCR primer sequences or cloning vector sequences if they are present. It is OK to have a few sequences a little shorter than others, but try to avoid this if possible (makes a more accurate tree). It trim from the ends, go to Edit; Select columns. Click on the first, then last column you want to trim to highlight. Then go to Edit; Delete all data. 9. Save your trimmed alignment by going to File; Export. In the Export box, you want Select Sequences - Select All, Export device - File, and File type - Phylip (the file type is very important - need Phylip for next step). Click Export button and save. You can then click Done in the Export box, save the trimmed alignment as a GeneDoc file if you want and exit the program. 10. Open your folder with PHYLIP in it and go into the exe folder (should have a list of icons for running programs) 11. Check that there is no file called outfile or outtree in this folder. If there is, delete it before you start.  Page 43 of 60 173  Analyzing Pyrosequencing Data (for dummies) 12. Open either DNADIST (if analyzing DNA sequences) or PROTDIST (if analyzing protein sequences). A shell window should open telling you the program couldn’t find “infile” and asks you to enter a file name. Type in the pathway to your GeneDoc PHYLIP aligned file (E:\name.phy) 13. A list of settings will come up. For a single tree distance matrix (which is what you are making), all the defaults are good. Type in “Y” for yes and press Enter. 14. The program will run and tell you that an output file was written to “outfile”. Press Enter to quit the program. Look in your exe folder for PHYLIP and you should have an outfile. Move this file to your E: drive and rename it DNADIST_something or PROTDIST_something. Delete the outfile in the exe folder. 15. Open NEIGHBOUR. A shell window opens looking for a new file name. Type in the name of your file from DNADIST/PROTDIST (ie: E:\PROTDIST_something). If you forgot to delete the outfile, the program will ask you if you want to overwrite it – select “R” to replace it. 16. A list of setting comes up – for a basic tree with branch lengths, the defaults are good. Type “Y” to accept and run the program and Enter to quit when done. 17. You will now have an outfile and an outtree file in the exe folder. Copy the outtree file to your memory stick and rename it Neighbour_something, then delete both the outfile and outtree file from the exe folder. 18. At this point, you may want to change the names that will appear on the tree. Remember, you gave everything an 8 character or less name on the FASTA file (or Clustal truncated the names to 8-10 characters for you for the program to work with it). Most of the time, you will want the full genus/species name to appear on the tree. Go to the instructions “Changing labels on phylogenetic trees” listed below and follow those steps. You should end up with 1 file when you are done: NEIGHBOUR_name_change_something. Otherwise, you can view your tree with the current name tags. 19. Open TreeView (or whichever tree viewing/editing program you want to use; these instructions are for TreeView) and go to File; Open and find your NEIGHBOUR_name_change_something or Neighbour_something file (you will probably have to change the “Files of type” drop-down menu below the File name to “All files” to see your NEIGHBOUR_name_change_something or Neighbour_something file in the folder). 20. To change the default tree that opens to something useful, go to Tree; phylogram to change the tree to a rectangular tree with the branch lengths showing. This view should also have a scale bar at the bottom. 21. To designate an outliner group and separate it from the tree, go to Tree; Define outgroup. From the window that pops up, select your outgroup and move it over to the outgroup box. Click OK when done. Then go to Tree; Root with outgroup. This will rearrange the tree so that your outgroup is separate. 22. You can also play with the view of your tree by selecting Tree; Order; ladderize one way or another. 23. To save you tree when you are done, go to File; Save as. Very important!! When saving, you NEED to change the default setting Save as type: NEXUS to PHYLIP/Newick 8:45 and click the Root with outgroup and Include edge lengths. If you don’t, you will lose all the branch Page 44 of 60 174  Analyzing Pyrosequencing Data (for dummies) lengths in the saved tree! You can also save your tree as a picture by going to File; Save as graphic. To create a phylogenetic tree with proper branch lengths AND bootstrap values: 1. Start with a saved Notepad file containing all the sequences (in FASTA format) you want in your tree. a. FASTA format is >Name[return]ATGCGTAGCT b. Each name can have a max of 8 characters and each name must be unique. Some programs truncate names to 8 characters and if the first 8 characters are the same for more than one sequence, programs will crash. At the very end, you can change the names to whatever you want for your final tree. c. Recommended that you save this file on a memory stick and work everything off of it. For these programs, you’ll need to type in the complete pathway to the file, so most C: drives will require C:\My Documents\More stuff\More stuff\file name, whereas on a memory stick, the file is usually E:\file name 2. Open ClustalX and load your sequences into the program by going to File; Load sequences. 3. Check the multiple alignment parameters by going to Alignment; Alignment Parameters; Multiple Alignment Parameters. If you are working with a 16S rRNA sequence, the defaults are fine, so you can skip this step. If you are working with a protein-coding sequence, change the Gap opening penalty to 50 and the gap extension penalty to 5. These are my rules that I’ve found by trial and error. The idea is that protein-coding genes acquire indels (insertions or deletions) that keep the protein in-frame for translation and gaps tend to be fewer, larger and in groups of 3 nucleotides. Starting a gap is a large penalty, but extending the gap is a much smaller penalty. Click the OK button at the top of the box to finish the change. 4. Do a complete alignment by going to Alignment; Do complete alignment. A window with output files will come up; make sure you are getting a *.aln file and click OK. 5. Check that the alignment looks reasonable. Several things to watch for: a. All your sequences look like they go together. A favorite is that some sequences are in the forward direction and others are in the reverse direction. You’ll see this by getting a block of sequences aligning well and a second block of sequences aligning well but the 2 blocks don’t match. Try reverse complementing one block with revseq in the EMBOSS package of programs (http:/haruspex.usask.ca/emboss) and try again. b. All your sequences are about the same length. If some sequences are significantly longer than others (more than 100 bp difference in length), Clustal has a hard time aligning properly. If you see a sequence that stands out of the alignment at the beginning or end, consider removing that sequence and trying again. 6. When you are happy that your alignment looks reasonable, you can exit ClustalX. 7. Open the program GeneDoc. To load your sequences, go to File; Import. In the prompt box that appears, the Select input device is File and the Select the type of file is Clustal (ALN). Click the Page 45 of 60 175  Analyzing Pyrosequencing Data (for dummies) Import button and select the Clustal alignment you just made. You can then Done in the Import box. 8. Next you want to trim all the aligned sequences to the same length and remove any PCR primer sequences or cloning vector sequences if they are present. It is OK to have a few sequences a little shorter than others, but try to avoid this if possible (makes a more accurate tree). It trim from the ends, go to Edit; Select columns. Click on the first, then last column you want to trim to highlight. Then go to Edit; Delete all data. 9. Save your trimmed alignment by going to File; Export. In the Export box, you want Select Sequences - Select All, Export device - File, and File type - Phylip (the file type is very important - need Phylip for next step). Click Export button and save. You can then click Done in the Export box, save the trimmed alignment as a GeneDoc file if you want and exit the program. 10. Open your folder with PHYLIP in it and go into the exe folder (should have a list of icons for running programs) 11. Check that there is no file called outfile or outtree in this folder. If there is, delete it before you start. 12. Open either DNADIST (if analyzing DNA sequences) or PROTDIST (if analyzing protein sequences). A shell window should open telling you the program couldn’t find “infile” and asks you to enter a file name. Type in the pathway to your GeneDoc PHYLIP aligned file (E:\name.phy) 13. A list of settings will come up. For a single tree distance matrix (which is what you are making), all the defaults are good. Type in “Y” for yes and press Enter. 14. The program will run and tell you that an output file was written to “outfile”. Press Enter to quit the program. Look in your exe folder for PHYLIP and you should have an outfile. Move this file to your E: drive and rename it DNADIST_something_1 or PROTDIST_something_1. This file contains the distance matrix we will need later. Delete the outfile in the exe folder. 15. Open SEQBOOT. A shell window should open telling you the program couldn’t find “infile” and asks you to enter a new file name. Type in the pathway to your GeneDoc PHYLIP aligned file again (E:\name.phy) 16. A list of settings should come up. Check how many replicates you want to make (random variations of your original data file). The default is 100; the other commonly used value is 1000. To change the value, type “R” and then 1000 when prompted. Everything else is good – type “Y” to run the program. The program will ask for a random seed number to start – type in any odd number. 17. The program will run and tell you that an output file was written to “outfile”. Press Enter to quit the program. Look in your exe folder for PHYLIP and you should have an outfile. Move this file to your E: drive and rename it SEQBOOT_something_100 (if you ran 100 replicates). Delete the outfile in the exe folder. 18. Open either DNADIST (if analyzing DNA sequences) or PROTDIST (if analyzing protein sequences) again. A shell window should open telling you the program couldn’t find “infile” and asks you to enter a file name. Type in the path to your SEQBOOT file (ie: E:\SEQBOOT_something_100). A list of settings will come up. We need to create distance Page 46 of 60 176  Analyzing Pyrosequencing Data (for dummies) matrices for 100 files, so we need to change one setting. Type “M” to change the Analyze multiple data sets, type “D” when prompted multiple data sets or multiple weights and then enter the number of data sets (either 100 or 1000). Then type in “Y” to run the program (this takes some time, especially for the 1000 data sets). 19. The program will run and tell you that an output file that has been written to “outfile”. Press Enter to quit the program. Look in your exe folder for PHYLIP and you should have an outfile. Move this file to your E: drive and rename it DNADIST_something_100 or PROTDIST_something_100. Delete the outfile in the exe folder. 20. Open NEIGHBOUR. A shell window opens looking for a new file name. Type in the name of your file from DNADIST/PROTDIST (ie: E:\PROTDIST_something_100). A list of settings will come up. We need to create trees for 100 files, so we need to change one setting. Type “M” to change the Analyze multiple data sets, enter the number of data sets (either 100 or 1000) and then enter in a random seed number when prompted. The rest of the defaults are good, so type in “Y” to run the program. 21. You will now have an outfile and an outtree file in the exe folder. Copy the outtree file to your memory stick and rename it Neighbour_something_100, then delete both the outfile and outtree file from the exe folder. 22. Open CONSENSE. A shell window should open telling you the program couldn’t find “infile” and asks you to enter a file name. Type in the path to your Neighbour_something_100 file. A list of settings will come up. The defaults are good – type “Y” to run the program and Enter when done. 23. You will now have an outfile and an outtree file in the exe folder. Copy the outtree file to your memory stick and rename it CONSENSE_something_100, then delete both the outfile and outtree file from the exe folder. This file now has a consensus tree with bootstrap values but no branch lengths. a. After building the consense tree, for some versions of PHYLIP, on some operating systems (many Macs), the consense.trees file may not be interpretable by fitch. To solve this problem, use RETREE. Open the consense.trees file with RETREE and then quit, saving the tree as an unrooted tree. The resulting tree file looks identical to the one you started with, but apparently is different in some way that only FITCH can understand. It's strange, but it works. 24. To add branch lengths to the consensus tree, open FITCH. A shell window should open telling you the program couldn’t find “infile” and asks you to enter a file name. Type in the path to your distance matrix from only the original file. This is either DNADIST_something_1 or PROTDIST_something_1. A list of setting will come up. We need to change the Search for best tree setting – type “U” so that the setting reads “No, use user trees in input file”. Everything else is good – type “Y” to run. The program will then say it couldn’t find the intree file and asks for a new file name. Type in the path to your consensus tree: CONSENSE_something_100 (or the RETREE file, if the program gives you trouble). 25. You will now have an outfile and an outtree file in the exe folder. Copy the outtree file to your memory stick and rename it FITCH_something_100, then delete both the outfile and outtree file from the exe folder. This file now has a consensus tree with branch lengths but no bootstrap values. Page 47 of 60 177  Analyzing Pyrosequencing Data (for dummies) 26. At this point, you may want to change the names that will appear on the tree. Remember, you gave everything an 8 character or less name on the FASTA file (or Clustal truncated the names to 8 characters for you for the program to work with it). Most of the time, you will want the full genus/species name to appear on the tree. Go to the instructions “Changing labels on phylogenetic trees” listed below and follow those steps. You should end up with 2 files when you are done: FITCH_name_change_something_100 and CONSENSE_name_change_something_100. Otherwise, you can view your tree with the current name tags. 27. Open TreeView (or whichever tree viewing/editing program you want to use; these instructions are for TreeView) and go to File; Open and find your FITCH_name_change_something_100 or FITCH_something_100 file (you will probably have to change the “Files of type” drop-down menu below the File name to “All files” to see your FITCH_name_change_something_100 or FITCH_something_100 file in the folder). 28. To change the default tree that opens to something useful, go to Tree; phylogram to change the tree to a rectangular tree with the branch lengths showing. This view should also have a scale bar at the bottom. This tree is the consensus tree with branch lengths. 29. To designate an outliner group and separate it from the tree, go to Tree; Define outgroup. From the window that pops up, select your outgroup and move it over to the outgroup box. Click OK when done. Then go to Tree; Root with outgroup. This will rearrange the tree so that your outgroup is separate. 30. Once you have your FITCH tree arranged like this, open another TreeView window by going to File; Open and select your CONSENSE_name_change_something_100 or CONSENSE_something_100 file. Rearrange your windows to see both trees at the same time. Repeat steps 23 and 24 to give the tree the same order (the tree will look different because all the branch lengths will be the same length). To see the bootstrap values, go to Tree; Show internal edge labels. The numbers that appear at the branching points are your bootstraps. They tell you how many times that branching point looked that way in the 100 or 1000 trees you tested. The higher the number, the more confidence you can have in that branching point. 31. You can also play with the view of your tree by selecting Tree; Order; ladderize one way or another. 32. There is no way to merge the two trees at the moment, so to make a presentation/publication tree, you need to copy the Fitch tree into PowerPoint and then use text boxes to insert the bootstrap values at the proper nodes. 33. To save you tree when you are done, go to File; Save as.. Very important!! When saving, you NEED to change the default setting Save as type: NEXUS to PHYLIP/Newick 8:45 and click the Root with outgroup and Include edge lengths. If you don’t, you will lose all the branch lengths in the saved tree! You can also save your tree as a picture by going to File; Save as graphic. To change the labels on your phylogenetic tree from the 8 character tag to real names The only way I know of to quickly and easily change the names of labels on phylogenetic trees is to use the UNIX command ‘sed’. To use this, you will need to create a file with the old label (tag name) and Page 48 of 60 178  Analyzing Pyrosequencing Data (for dummies) the new label (usually the full organism name) with the correct format. You then upload the sed command file, along with your tree file(s), to our server, haruspex.usask.ca. Haruspex is a UNIX-based system, so it contains the sed command. The procedure is: 1. Open a new Excel file. You will use the first 3 columns – do not give the columns headers. a. In the first column, enter the name you want to appear on the final tree (the full name – it can have spaces). b. In the second column, enter the name that is on the FASTA file (the 10 character or less tag name that comes after the ‘>’ that has no spaces). This second column has to match what is in your FASTA file exactly or the program will not recognize it. c. In the third column, type in =CONCATENATE("s/",B1,"/'",A1,"'/g") This has to be entered just like this (for the first column). You can then highlight and “Fill down” the rest of your table so that every full name/tag name has a command that looks like: s/gi|5763993/'Thermococcus kodakaraensis KOD1'/g The sed command structure is ‘s/old name/’new name’/g, which means substitute (‘s’) everything in the first / / marks for everything in the second / /(which have ‘’ to tell it that spaces are OK and to treat it all as text) every time you find the first / / (g=globally in the file). d. Once you have typed out the complete list of names you want to change and have copied the concatenate command, highlight the complete third column and copy it. 2. Open a Notepad file. Paste your third column into the Notepad file so that you have a list of lines that look like: s/gi|5763993/'Thermococcus kodakaraensis KOD1'/g In this file, check that none of your names have special characters like “/”. If you do, then you will need to insert the “\” character before the special character (so that the program knows the character immediately after the “\” should be taken as a character and not a special symbol. For example: s/gi|2417696/'Sulfolobus solfataricus 98\/2'/g Save this Notepad file – this is the file you will use (it is a list of commands for sed to follow). I call these Name_change_something. 3. Open WinSCP and log onto haruspex.usask.ca. Upload your Name_change_something Notepad file to the server. Upload you tree files to be renamed (NEIGHBOUR_something_1 for single trees or FITCH_something _100 and CONSENSE_something_100 for bootstrapped trees). I usually create a folder called Name_change_something to keep things organized. 4. Open Putty and log onto haruspex.usask.ca. Use the ls command (type ls at the > and press enter) to see the contents of the directory you are in. Use the cd command to move to different folders (type cd Name_change to move down into the Name_change folder). When you use the ls command and your Name_change_something and tree files show up in the list, you are in the right directory. 5. Type at the >sed –f Name_change_something NEIGHBOUR_something_1 > NEIGHBOUR_renamed_something_1 and press enter. The format is sed [single space] –f [use files][single space] name of file with sed commands and names [single space] name of tree file to be renamed [single space] > [symbol meaning write the output into a file][single space] name of new renamed tree file. 6. If you use the ls command again, a new file should appear called NEIGHBOUR_renamend_something_1. Go back to WinSCP and refresh the window (button on Page 49 of 60 179  Analyzing Pyrosequencing Data (for dummies) the top of the window). The renamed tree file should appear. Download it to your local computer. Open the renamed file with TreeView as described above (continue with general instructions).  Appendix B - Installing Cygwin X on a PC (for running GeneSpring) Instructions to follow.  Appendix C - Loading data into GeneSpring and basic functions See Powerpoint slides.  Appendix D - Basic starting with WinSCP and Putty See Powerpoint slides.  Appendix E - Useful things to know and alternative procedures  Automating the command line processes If you have many libraries to analyze, going through each library and individually typing in the commands for the first two steps Nearest Neighbour and the first step of the Megan analysis can take a very long time. If your input FASTA file has thousands of sequences, it can take up to an hour to complete the watered-BLAST and Blastx searches. However, you can write a little program file that tells haruspex to run one command after another, without you having to monitor it. All the command files are created using the same idea: use Excel to create a concatenated command structure, copy those commands to a text file (Notepad, for example), add the command to the text file to make it a command file, transfer the file to haruspex and change the file status to be executable.  Page 50 of 60 180  Analyzing Pyrosequencing Data (for dummies) To make the commands in Excel, identify what part of the command is always the same and what part needs to change. For example, Step 1 of the Nearest Neighbour analysis is the watered-BLAST command: /home/pipeline/aped/bin/watered_blast.pl -i fasta.file -d blast_db -p blast_program -v numb_hits > output.file In this command, the only parts that change between libraries is the input fasta.file and the output.file. So, if you created a column in Excel that listed input files and a second column that listed output files, you could use the Concatenate formula to create a list of properly formatted commands. You type in the first concatenate command and then just Fill -> down the list generates the command in the third row of this example: /home/chaban/ Pyro_fasta/BC 3/GJBSLIE11MID4.combine d.fas  /home/chaban/Pyro_f asta/BC3/GJBSLIE11 MID4.combined_wat eredBlast_5best_unfil tered /home/chaban/ /home/chaban/Pyro_f Pyro_fasta/BC asta/BC3/GJBSLIE11 3/GJBSLIE11- MID5.combine MID5.combined_wat d.fas eredBlast_5best_unfil tered /home/chaban/ /home/chaban/Pyro_f Pyro_fasta/BC asta/BC4/GJBSLIE12 4/GJBSLIE12- MID10.combin MID10.combined_wa ed.fas teredBlast_5best_unfi ltered  =CONCATENATE("/home/pipeline/aped/bin/watered_blast. pl -i ", A11, " -d /aped/blast_dbs/cpndb_nr_20100504 -p blastn -v 5 > ", B11)  =CONCATENATE("/home/pipeline/aped/bin/watered_blast. pl -i ", A2, " -d /aped/blast_dbs/cpndb_nr_20100504 -p blastn -v 5 > ", B2)  /home/pipeline/aped/bin/watered_blast.pl -i /home/chaban/Pyro_fasta/BC4/GJBSLIE12MID10.combined.fas -d /aped/blast_dbs/cpndb_nr_20100504 -p blastn -v 5 > /home/chaban/Pyro_fasta/BC4/GJBSLIE12MID10.combined_wateredBlast_5best_unfiltered  Note - if you want to sort your files as you create them, you will need to give the complete pathway for filenames. In the example above, the first two files were from region 3 of the sequencing plate, so I have sorted the FASTA files into a folder called BC3, whereas the third file is from region 4, so I have that FASTA file in a folder called BC4. I have also written the output files to go into the matching folders. Once you are happy with your column of commands, copy that column into a text program (Notepad). At the very top of the file, add the line: #! /bin/sh[hard return] For example, your file should look like: #! /bin/sh Page 51 of 60 181  Analyzing Pyrosequencing Data (for dummies) /home/pipeline/aped/bin/watered_blast.pl -i /home/chaban/Pyro_fasta/BC1/GJBSLIE09MID10.combined.fas -d /aped/blast_dbs/cpndb_nr_20100504 -p blastn -v 5 > /home/chaban/Pyro_fasta/BC1/GJBSLIE09-MID10.combined_wateredBlast_5best_unfiltered /home/pipeline/aped/bin/watered_BLAST_mapping_freq_best.pl -i 55 -l 100 -r 1 /home/chaban/Pyro_fasta/BC1/GJBSLIE09-MID10.combined_wateredBlast_5best_unfiltered > /home/chaban/Pyro_fasta/Combined_Nearest_Neighbour_all35datasets/NearestNeighbour_BC1_MID10 /home/pipeline/aped/bin/watered_blast.pl -i /home/chaban/Pyro_fasta/BC1/GJBSLIE09MID4.combined.fas -d /aped/blast_dbs/cpndb_nr_20100504 -p blastn -v 5 > /home/chaban/Pyro_fasta/BC1/GJBSLIE09-MID4.combined_wateredBlast_5best_unfiltered /home/pipeline/aped/bin/watered_BLAST_mapping_freq_best.pl -i 55 -l 100 -r 1 /home/chaban/Pyro_fasta/BC1/GJBSLIE09-MID4.combined_wateredBlast_5best_unfiltered > /home/chaban/Pyro_fasta/Combined_Nearest_Neighbour_all35datasets/NearestNeighbour_BC1_MID4 Be aware that the commands will run in the order they are listed. Once you are created this text file, save it and copy it to haruspex (can use WinSCP). Before you can run the command file, you must make it executable. Open a Putty shell and in the folder you have uploaded your command file, type >chmod +x filename Your file is now ready to run. At the command prompt, type >./filename The list of commands in your command file should run one after the other until the end and then the command prompt will return. Depending on what programs you put in the command, this could take several hours. Go and do something else :) To pull out a list of FASTA sequences from a large file of FASTA sequences This could be useful if you wanted to put out all the sequences that had good quality hits only or to put together a file of sequences that are reversed (match to cpn60 sequence is on the negative strand - can see in the unfiltered watered Blast file) so that you can reverse complement them to do multiple alignments. To do this, you need the text file that has all your FASTA sequences in it.  Page 52 of 60 182  Analyzing Pyrosequencing Data (for dummies) 1. Do a filtered watered_Blast with the grep command to get a list that has only results (no N/A lines) a. Remember; do original watered Blast search then use: /home/pipeline/aped/bin/watered_BLAST_mapping_freq_best.pl -i # -l # -r # unfiltered_watered_Blast_input_file |grep -v “N/A” > output_file_name. 2. Open the filtered file in Excel: a. Will have READ, %ID, Number, Hit columns 3. Create a command file to copy only these FASTA sequences from the main FASTA file (poor hits or really short reads removed). a. Copy the READ column to a new Excel worksheet in column A b. In column B, create a list with all the names in lower case (will need these file names as lower case). Use the command =LOWER(cell). In cell B1, this looks like =LOWER(A1). Fill this command down the column. (Note - if your file names are already lower case, skip this step) c. In column C, create a command list with CONCATENATE. This will look like: =CONCATENATE(“cp “,B1,”.fasta collection_directory_name/”) and you can fill down the column. You should end up with a string of commands that look something like: cp ghder24ks.fasta File_sorting_collection (copy the file called ghder24ks.fasta to the directory called File_sorting_collection) d. Copy column C into Notepad and add #!/bin/sh to the top line of the file and save. 4. Go to Haruspex (open a Putty shell and WinSCP) a. Create a new directory to work in (example: Individual_FASTA_files_project. In this directory put: i. Your complete FASTA file with all your sequences ii. The command file you just made iii. A new directory (within this directory) that is your collection directory (the collection_directory_name you put in your command file) 5. Use the seqret command to break-up the complete multiple FASTA file into individual files: a. type: seqret -ossingle multiple_FASTA_file_name b. You should now have a whole list of files called *.fasta in the Individual_FASTA_files_project directory Page 53 of 60 183  Analyzing Pyrosequencing Data (for dummies) 6. Run your command file to copy only the FASTA files you want into the collection directory a. Make the command file executable by typing: chmod +x command_file_name b. Run the file by typing: ./command_file_name 7. Move into the collection directory (cd command) and check to see if you got all the files you wanted a. Use the command ls | wc to list the number of files in the directory 8. To put all the sequences back into one multiple FASTA file, concatenate the individual files a. Use the command: cat *.fasta > new_file_name 9. Copy the new file wherever and work with it. To pull out only the sequences that are on the plus strand or minus strand, do the following: 1. Take the unfiltered watered_BLAST output file and open it in Excel (doesn’t matter what the settings were - OK if you got more than one hit per sequence) a. This gives you a list of FASTA sequence names, duplicate if present and strandedness on end 2. Copy the FASTA name column (example: GHEJFI78IFJS) and strandedness column (- or +) to a new worksheet. 3. Sort the new table by the strandedness column to pull all the same direction sequences together. 4. Copy the FASTA names of the strand you want into a new column. Do an advanced filter to get only unique names listed in a new column (go to the Data tab and choose Filter -> advanced. Change the advanced filter action option to “Copy to another location”; click on the right-side box of each range to select “List range” as the starting list (click on the top of the column to select all), “Criteria range” as starting list and “Copy to” as column you want the results in; select the box beside “Unique records only” and click OK.) 5. Copy this filtered list to a new worksheet and continue with the instruction above at step 3.  Alternative way to make chimera checking table (step 2) Step 2 - Extract and compare the beginning and end of each sequence Page 54 of 60 184  Analyzing Pyrosequencing Data (for dummies)  1. Extract the first portion of each isotig sequence for chimera checking. I’m going to try 150 bp from the start. a. use the seqret command as written: i. seqret -sbegin 1 -send 150 genespring_otu.fasta beginning_150bp_chimera_check 2. Extract the last portion of each isotig sequence for chimera checking. To do this, you will have to reverse, complement the original file, extract your 150 bp from the beginning, then reverse, complement your extracted file. a. reverse the original file with seqret as written: i. seqret -sreverse genespring_otu.fasta genespring_otu_reverse.fasta b. extract the now first 150 bp of these sequences: i. seqret -sbegin 1 -send 150 genespring_otu_reverse.fasta end_reverse_comp_150bp_chimera_check c. reverse the extract file so your sequences are in the right orientation i. seqret -sreverse end_reverse_comp_150bp_chimera_check end_150bp_chimera_check 3. Take the files beginning_150bp_chimera_check and end_150bp_chimera_check and do a watered_blast search on each. Return just the top, single hit. a. run the 2 watered_blast scripts as follows (update cpndb_nr database date as updates are made): i. /home/pipeline/aped/bin/watered_blast.pl -i beginning_150bp_chimera_check -d /aped/blast_dbs/cpndb_nr_20110622 -p blastn -v 1 > beginning_150bp_chimera_check_wateredBlast ii. /home/pipeline/aped/bin/watered_blast.pl -i end_150bp_chimera_check -d /aped/blast_dbs/cpndb_nr_20110622 -p blastn -v 1 > end_150bp_chimera_check_wateredBlast b. copy the watered_blast results to your computer and open both with Excel 4. In a new Excel file, copy the isotig name into the first column (from either file). Copy the % ID of match, Closest hit header and Strandedness columns from the beginning_150bp_chimera_check_wateredBlast file into the next 3 columns and the same  Page 55 of 60 185  Analyzing Pyrosequencing Data (for dummies) columns from the end_150bp_chimera_check_wateredBlast into the next columns. Your file should look like: Query % ID of match Closest hit header Strandedness % ID of match (Closest hit header Strandedness isotig00005 98 b255 AF036324 Staphylococcus aureus subsp. aureus A + 74.3 b10882 NZ_ABBX01000122 Candidate division TM7 sp. + isotig00021 98.7 b255 AF036324 Staphylococcus aureus subsp. aureus A + 76.5 b10882 NZ_ABBX01000122 Candidate division TM7 sp. + isotig00025 98.7 b255 AF036324 Staphylococcus aureus subsp. aureus A + 100 b1123 AF352799 Streptococcus parasanguinis ATCC 15+ isotig00028 98.7 b255 AF036324 Staphylococcus aureus subsp. aureus A + 100 b1123 AF352799 Streptococcus parasanguinis ATCC 15+ isotig00057 98.7 b255 AF036324 Staphylococcus aureus subsp. aureus A + 100 b255 AF036324 Staphylococcus aureus subsp. aureus A+ isotig00081 99.3 b255 AF036324 Staphylococcus aureus subsp. aureus A 99.2 b255 AF036324 Staphylococcus aureus subsp. aureus Aisotig00187 99.3 b824 AF245666 Enterococcus hirae ATCC 8043 100 b10197 EF173658 Staphylococcus epidermidis ATCC 14-  5. In the next column (should be H), add the title “Same closest hit header”. In this column, you will use the True/False command to compare Closest hit columns. Add the formula =”cell of beginning header”=”cell of end header” a. will look like: =C2=F2 b. Fill down the entire column. If the headers are the same, the cell should say TRUE. If the cells are different, they should say FALSE. 6. In the next column (should be I), add the title “Same Strandedness”. In this column, you will use the True/False command to compare Strandedness columns. Add the formula =”cell of beginning Strandedness”=”cell of end Strandedness” a. will look like: =D2=G2 7. Fill down the entire column. If best match was in the same direction on the best hit reference sequence, the cell should say TRUE. If the cells are different, they should say FALSE. Your table should now look like: % ID of m Closest hit header Query Strand% ID of mClosest hit header Strand Same closest hit header Same strandedness isotig00005 98 b255 AF036324 Staphylococcus aureus su+ 74.3 b10882 NZ_ABBX01000122 Candidate div + FALSE TRUE isotig00021 98.7 b255 AF036324 Staphylococcus aureus su+ 76.5 b10882 NZ_ABBX01000122 Candidate div + FALSE TRUE isotig00025 98.7 b255 AF036324 Staphylococcus aureus su+ 100 b1123 AF352799 Streptococcus parasangu+ FALSE TRUE isotig00028 98.7 b255 AF036324 Staphylococcus aureus su+ 100 b1123 AF352799 Streptococcus parasangu+ FALSE TRUE isotig00057 98.7 b255 AF036324 Staphylococcus aureus su+ 100 b255 AF036324 Staphylococcus aureus su+ TRUE TRUE isotig00081 99.3 b255 AF036324 Staphylococcus aureus su99.2 b255 AF036324 Staphylococcus aureus suTRUE TRUE isotig00187 99.3 b824 AF245666 Enterococcus hirae ATCC 100 b10197 EF173658 Staphylococcus epidermFALSE TRUE  8. Sort the entire table by Same strandedness, then by Same closest hit header. ….Return to chimera checking instructions and continue with Step 3  Page 56 of 60 186  Analyzing Pyrosequencing Data (for dummies) Appendix F - Databases we have To do watered_Blast (DNA to DNA): /aped/blast_dbs/ • cpndb_nr_20110622 (DNA sequences - all cpn60 UT non-redundant sequences from cpnDB as of June 22nd, 2011) • tcp_archaea_UT_reference_database (DNA sequences - 166 archaeal type II chaperonin sequences trimmed to my UT target) • archaea_tcp_20100409 (DNA sequences - 166 full length archaeal type II chaperonin sequences) • groupii_all_full_seq_4Aug2010 (DNA sequences - all the full length type II chaperonin sequences in cpnDB as of Aug 4th, 2010) • archaea_16s_20100409 (DNA sequences - 84 full length archaeal 16S reference sequences) • archaea_mcr_20100409 (DNA sequences - 25 full length methyl co-enzyme H reductase subunit A sequences from archaeal methanogens) • vagnr_20100816 (Going to guess is a collection of non-redundant cpn60 sequences found in vaginal libraries) • rdp_isolates_20100705 (Going to guess is a collection of bacterial full length 16S rRNA genes) To do a Blastx analysis (DNA to protein): /aped/blast_dbs/ • cpndb_nr_pep_20110622 (Protein sequences - all cpn60 UT non-redundant protein sequences as of June 22nd, 2011) • tcp_prot_reference_database_20100805 (Protein sequences - 16S full length archaeal type II chaperonin protein sequences as of Aug 5th, 2010)  Page 57 of 60 187  Analyzing Pyrosequencing Data (for dummies) Appendix G - Signature Oligo Signature Oligo ver. 1.1. To use Sig Oligo to define sequence ranges unique to one collection of sequences and not in another collection of sequences (ie: PCR primer design): 1. Log onto haruspex 2. Create a new primary directory to work in. Within this primary directory, create 2 new subdirectories (call them these names): a. target i. put a collection of single FASTA files (not one big multiple FASTA file) of all the sequences you want to include in your region selection (can be 1 sequence or several) b. other i. put a collection of single FASTA files (not one big multiple FASTA file) of all the sequences you want to exclude from you region selection. These can be a few sequences from closely related organisms to your target group or it can be all of cpnDB minus your targets. Be aware, the more you compare against, the more likely you won’t find unique regions to your target. c. NOTE: If you accidently put the same sequence in both folders (like an identical sequence from 2 different strains that have different names), the program will fail to return you any regions. 3. Move out of the primary directory and type: a. /usr/local/sigoli/sigoli-1-1/b/sigoli -operation=ranges -oligosize=20 -sequence-directory=name_of_primary_directory_you _just_made -diff=yes b. This command should return you a list that looks something like: i. Top section is a label from the target folder ii. Middle section lists the sequence ranges that are unique to the target sequence(s) compared to the others  -----------------------TR8 -----------------------[ 1 - 38] : 19 [ 20 - 46] : 33 [ 33 - 145] : 89 [ 134 - 164] : 149 [ 154 - 193] : 173 [ 177 - 378] : 277 [ 367 - 548] : 457 [ 530 - 709] : 619 -----------------------b13620 ------------------------  iii. Bottom section is a label from the other folder 4. If an acceptable range is not obtained, try changing the target and other folders until something suitable is found. You can alter the parameters/options as shown below. Page 58 of 60 188  Analyzing Pyrosequencing Data (for dummies)  Complete description of the commands from the programs Help option: -------------------------------------------------usage: sigoli.exe [command-line-options] command-line-options: [option-name=option-value]* -operation=(operation-name); supported operations: strings -- writes to the output all oligo strings from all sequences and all groups positions -- generates an input file for Array Designer (tab-separated list of oligo sites) ranges  -- writes a list of all ranges of oligos from each sequence and each group  ambig  -- writes a list of all ambiguous subsequences that have been discarded because of more  ambiguities than max-unambiguous-count -sequence-directory=(relative-directory-name); the location of the directory name where the sequences and directories to be analysed are located -oligo-size=(oligo-size); will set the size of oligos to be discovered; default=16 -ambiguous=(yes/no); (obsolete) if yes, ambiguous subsequences may be considered oligos -diff=(yes/no); indicates whether small differences (1 nucleotide) are considered -crowded=(yes/no); indicates whether for the ranges and positions operations, an oligo range is populated with intermediary sites -stop-on-error=(yes/no); indicates whether the system will stop when encountering an invalid sequence file; default=no -first-site-gap=(gap-size); for a crowded display, indicates the size of the gap between the border of the range and the first interior site -inter-site-gap=(gap-size); for a crowded display, indicates the size of the gap between sites inside an oligo range -max-unambiguous-count=(count); indicates the maximum number of unambiguous sequences that will be considered in a disambiguation -------------------------------------------------example: sigoli -operation=positions -oligo-size=20 -sequence-directory=s -ambiguous=yes -diff=no crowded=yes -first-site-gap=12 -inter-site-gap=5 -max-unambiguous-count=1000 -------------------------------------------------Page 59 of 60 189  Analyzing Pyrosequencing Data (for dummies) SigOli is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License -------------------------------------------------author: Manuel Zahariev, mz@alumni.sfu.ca -------------------------------------------------This product contains software developed by the University of California and its contributors. -------------------------------------------------This product contains output from GNU Bison, used in accordance with the license conditions specified in "Conditions for using Bison".  Page 60 of 60 190  Lactobacillus.reuteri.JCM1112 Streptococcus.mitis.ATCC Shigella.dysenteriae.155.74 Streptococcus.parasanguinis.ATCC Streptococcus.salivarius.subsp. Lactobacillus.ultunensis.DSM Olsenella.uli.DSM Prevotella.corporis.ATCC Bifidobacterium.breve.JCM1192 Lactobacillus.johnsonii.ATCC Staphylococcus.lugdunensis.ATCC Dethiobacter.alkaliphilus.AHT Prevotella.loescheii.JCM Selenomonas.noxia.ATCC Lactobacillus.gasseri.ATCC Lactobacillus.crispatus.CECT4840 Lactobacillus.iners.DSM Lactobacillus.jensenii.ATCC Pediococcus.dextrinicus.ATCC Atopobium.parvulum.DSM Bifidobacterium.dentium.JCM1195 Bifidobacterium.pullorum.JCM1214 Aerococcus.urinae.ATCC Chlorobium.phaeobacteroides.DSM Prevotella.denticola.JCM Mobiluncus.curtsii.subsp. Nocardia.cyriacigeorgica.GUH.2 Tepidanaerobacter.sp..Re1 Porphyromonas.uenonis.JCM Dialister.micraerophilus.DSM Prevotella.oris.JCM Prevotella.buccalis.ATCC Gardnerella.vaginalis.101 Mobiluncus.mulieris.ATCC Prevotella.bergensis.JCM Campylobacter.lari.NCTC11352. Prevotella.amnii.JCM Atopobium.vaginae.DSM Prevotella.timonensis.JCM Clostridium.genomosp..BVAB3 Megasphaera.sp..UPII Gardnerella.vaginalis.ATCC Actinobacteria.sp..N153 Gardnerella.vaginalis.409.05 Corynebacterium.xerosis.A6.70 Eubacterium.ventriosum.ATCC Peptoniphilus.harei.ACS.146.V.Sch2b Prevotella.bivia.ATCC Campylobacter.rectus.ATCC Eubacterium.dolichum.ATCC Prevotella.sp..oral Prevotella.buccae.ATCC Prevotella.disiens.ATCC Prevotella.melaninogenica.ATCC  Community Groups  2 4  01-045 01-025 01-054 01-013 01-023 01-008 01-007 01-002 01-049 01-050 01-011 01-044 01-009 01-040 01-018 01-021 01-029 01-024 01-046 01-037 01-030 01-026 01-001 01-017 01-033 01-031 01-036 01-028 01-022 01-019 01-012 01-047 01-035 01-004 01-003 01-043 01-015 01-051 01-010 01-005 01-052 01-039 01-027 01-006 01-020 01-014 01-041 01-042 01-034 01-048 01-053 01-032 01-038 01-016  Unique Bacterial Taxa  Figure S1 Heat Map of Bacterial Taxa at 55% Match Identity Cut off Species abundance is indicated by colour range from black (low abundance) to red (high abundance) in the central heat map grid.  191  Subject ID  0 6  Relative Intensity Scale  Lactobacillus.reuteri.JCM1112 Streptococcus.mitis.ATCC Shigella.dysenteriae.155.74 Streptococcus.parasanguinis.ATCC Streptococcus.salivarius.subsp. Lactobacillus.ultunensis.DSM Olsenella.uli.DSM Prevotella.corporis.ATCC Bifidobacterium.breve.JCM1192 Lactobacillus.johnsonii.ATCC Staphylococcus.lugdunensis.ATCC isotig00119.Dethiobacter.alkaliphilus.AHT.1.DSM19026 Prevotella.loescheii.JCM Selenomonas.noxia.ATCC Lactobacillus.gasseri.ATCC Lactobacillus.crispatus.CECT4840 Lactobacillus.iners.DSM Lactobacillus.jensenii.ATCC Pediococcus.dextrinicus.ATCC Atopobium.parvulum.DSM Bifidobacterium.dentium.JCM1195 Bifidobacterium.pullorum.JCM1214 Aerococcus.urinae.ATCC isotig00102.Chlorobium.phaeobacteroides.DSM.266 Prevotella.denticola.JCM Mobiluncus.curtsii.subsp. isotig00121.Tepidanaerobacter.sp..Re1 Nocardia.cyriacigeorgica.GUH.2 Porphyromonas.uenonis.JCM Dialister.micraerophilus.DSM Prevotella.oris.JCM Prevotella.buccalis.ATCC Gardnerella.vaginalis.101 Mobiluncus.mulieris.ATCC Prevotella.bergensis.JCM isotig00064.Campylobacter.lari.NCTC11352..ATCC35221... Prevotella.amnii.JCM Atopobium.vaginae.DSM Prevotella.timonensis.JCM isotig00007.Clostridium.genomosp..BVAB3.UPII9.5 Megasphaera.sp..UPII Gardnerella.vaginalis.ATCC Actinobacteria.sp..N153 Gardnerella.vaginalis.409.05 Corynebacterium.xerosis.A6.70 Eubacterium.ventriosum.ATCC Peptoniphilus.harei.ACS.146.V.Sch2b Prevotella.bivia.ATCC Campylobacter.rectus.ATCC Eubacterium.dolichum.ATCC Prevotella.sp..oral Prevotella.buccae.ATCC Prevotella.disiens.ATCC Prevotella.melaninogenica.ATCC  Community Groups  2 4  01-045 01-025 01-054 01-013 01-023 01-008 01-007 01-002 01-049 01-050 01-011 01-044 01-009 01-040 01-018 01-021 01-029 01-024 01-046 01-037 01-030 01-026 01-001 01-017 01-033 01-031 01-036 01-028 01-022 01-019 01-012 01-047 01-035 01-004 01-003 01-043 01-015 01-051 01-010 01-005 01-052 01-039 01-027 01-006 01-020 01-014 01-041 01-042 01-034 01-048 01-053 01-032 01-038 01-016  Subject ID  0 6  Relative Intensity Scale  Unique Bacterial Taxa  Figure S2 Heat Map of Bacterial Taxa at 78% Match Identity Cut off Species abundance is indicated by colour range from black (low abundance) to red (high abundance) in the central heat map grid.  192  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073796/manifest

Comment

Related Items