Nomenclature errors in public 16s rDNA gene databases : strategies to improve the accuracy of sequence annotations

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Nomenclature errors in public 16s rDNA gene databases : strategies to improve the accuracy of sequence annotations Lesack, Kyle

Abstract

Obtaining an accurate representation of the microorganisms present in microbial ecosystems presents a considerable challenge. Microbial communities are typically highly complex, and may consist of a variety of differentially abundant bacteria, archaea, and microbial eukaryotes. The targeted sequencing of the 16S rDNA gene has become a standard method for profiling membership and biodiversity of microbial communities, as the bacterial and archaeal community members may be profiled directly, without any intermediate culturing steps. These studies rely upon specialized 16S rDNA gene reference databases, but little systematic and independent evaluation of the annotations assigned to sequences in these databases has been performed. This project examined the quality of the nomenclature annotations provided by the 16S rDNA sequences in three public databases: The Ribosomal Database Project, SILVA, and Greengenes. To do that, first three nomenclature resources – the List of Prokaryotic Names with Standing in Nomenclature, Integrated Taxonomic Information System, and Prokaryotic Nomenclature Up-to-Date – were evaluated to determine their suitability for validating prokaryote nomenclature. A core-set of valid, invalid, and synonymous organism names was then collected from these resources, and used to identify incorrect nomenclature in the public 16S rDNA databases. To assess the potential impact of misannotated reference sequences on microbial gene survey studies, the misannotations identified in the SILVA database were categorized by sample isolation source. Methods for the detection and prevention of nomenclature errors in reference databases were examined, leading to the proposal of several quality assurance strategies for future biocuration efforts. These included phylogenetic methods for the identification of anomalous taxonomic placements, database design principles and technologies for quality control, and opportunities for community assisted curation.

Item Metadata

Title	Nomenclature errors in public 16s rDNA gene databases : strategies to improve the accuracy of sequence annotations
Creator	Lesack, Kyle
Publisher	University of British Columbia
Date Issued	2017
Description	Obtaining an accurate representation of the microorganisms present in microbial ecosystems presents a considerable challenge. Microbial communities are typically highly complex, and may consist of a variety of differentially abundant bacteria, archaea, and microbial eukaryotes. The targeted sequencing of the 16S rDNA gene has become a standard method for profiling membership and biodiversity of microbial communities, as the bacterial and archaeal community members may be profiled directly, without any intermediate culturing steps. These studies rely upon specialized 16S rDNA gene reference databases, but little systematic and independent evaluation of the annotations assigned to sequences in these databases has been performed. This project examined the quality of the nomenclature annotations provided by the 16S rDNA sequences in three public databases: The Ribosomal Database Project, SILVA, and Greengenes. To do that, first three nomenclature resources – the List of Prokaryotic Names with Standing in Nomenclature, Integrated Taxonomic Information System, and Prokaryotic Nomenclature Up-to-Date – were evaluated to determine their suitability for validating prokaryote nomenclature. A core-set of valid, invalid, and synonymous organism names was then collected from these resources, and used to identify incorrect nomenclature in the public 16S rDNA databases. To assess the potential impact of misannotated reference sequences on microbial gene survey studies, the misannotations identified in the SILVA database were categorized by sample isolation source. Methods for the detection and prevention of nomenclature errors in reference databases were examined, leading to the proposal of several quality assurance strategies for future biocuration efforts. These included phylogenetic methods for the identification of anomalous taxonomic placements, database design principles and technologies for quality control, and opportunities for community assisted curation.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2017-07-31
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0349132
URI	http://hdl.handle.net/2429/62456
Degree (Theses)	Master of Science - MSc
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2017-09
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Nomenclature errors in public 16s rDNA gene databases : strategies to improve the accuracy of sequence annotations Lesack, Kyle

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights