De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing Khan, Hamza

Abstract

Current genome and transcriptome annotation pipelines mostly depend on reference resources. This restricts their annotation capabilities for novel species that might lack reference resources for itself or a closely related species. To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. We have tested our method on characterizing roundworm and human transcriptomes, while using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. We have also applied our method for annotating the transcriptome of the American Bullfrog. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible. Long non-coding RNA (lncRNA) have shown to contribute towards sub-cellular structural organization, function, and evolution of genomes. With a composite reference transcriptome and a draft genome assembly for the American Bullfrog, we developed a pipeline to find putative lncRNAs in its transcriptome. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through support vector machines. We further refined our set by selecting and keeping contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 7 candidates from our final set for validation through qPCR, out of which 6 were amplified.

Item Metadata

Title	De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing
Creator	Khan, Hamza
Publisher	University of British Columbia
Date Issued	2016
Description	Current genome and transcriptome annotation pipelines mostly depend on reference resources. This restricts their annotation capabilities for novel species that might lack reference resources for itself or a closely related species. To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. We have tested our method on characterizing roundworm and human transcriptomes, while using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. We have also applied our method for annotating the transcriptome of the American Bullfrog. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible. Long non-coding RNA (lncRNA) have shown to contribute towards sub-cellular structural organization, function, and evolution of genomes. With a composite reference transcriptome and a draft genome assembly for the American Bullfrog, we developed a pipeline to find putative lncRNAs in its transcriptome. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through support vector machines. We further refined our set by selecting and keeping contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 7 candidates from our final set for validation through qPCR, out of which 6 were amplified.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2017-12-31
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0340507
URI	http://hdl.handle.net/2429/60152
Degree (Theses)	Master of Science - MSc
Program (Theses)	Bioinformatics
Affiliation	Science, Faculty of
Degree Grantor	University of British Columbia
Graduation Date	2017-02
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing Khan, Hamza

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights