Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing Khan, Hamza 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_february_khan_hamza.pdf [ 2.45MB ]
Metadata
JSON: 24-1.0340507.json
JSON-LD: 24-1.0340507-ld.json
RDF/XML (Pretty): 24-1.0340507-rdf.xml
RDF/JSON: 24-1.0340507-rdf.json
Turtle: 24-1.0340507-turtle.txt
N-Triples: 24-1.0340507-rdf-ntriples.txt
Original Record: 24-1.0340507-source.json
Full Text
24-1.0340507-fulltext.txt
Citation
24-1.0340507.ris

Full Text

 DE NOVO ANNOTATION OF NON-MODEL ORGANISMS USING WHOLE GENOME AND TRANSCRIPTOME SHOTGUN SEQUENCING  by  Hamza Khan  B.Tech., VIT University, 2014  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)   December 2016  © Hamza Khan, 2016 ii  Abstract  Current genome and transcriptome annotation pipelines mostly depend on reference resources. This restricts their annotation capabilities for novel species that might lack reference resources for itself or a closely related species. To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. We have tested our method on characterizing the Caenorhabditis elegans (C. elegans) and Homo sapiens (H. sapiens) transcriptomes, using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. We have also applied our method for annotating the transcriptome of Rana (Lithobates) catesbeina. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible.  Long non-coding RNA (lncRNA) have shown to contribute towards sub-cellular structural organization, function, and evolution of genomes1. With a composite reference transcriptome and a draft genome assembly for R. catesbeiana, we developed a pipeline to find putative lncRNAs in its transcriptome. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating iii  contigs through SVM algorithms such as CPC. We further refined our set by selecting and keeping contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 7 candidates from our final set for validation through qPCR, out of which 6 were amplified.                   iv  Preface  This thesis has been compiled by Hamza Khan with inputs and feedback from Dr. Inanc Birol, Dr. Martin Hirst and Dr. William Hsiao.  Versions of chapter 2 and 3 are in the process of being submitted to peer reviewed scientific journals in the coming months. ntCard, which is a component of Chapter 2 has been accepted for publication in Bioinformatics (Oxford journals) with the title 'ntCard: A streaming algorithm for cardinality estimation in genomics data' (Manuscript ID:BIOINF-2016-1738). ntCard was written by Hamid Mohamadi, while Hamza Khan helped in benchmarking it against other tools, incorporating it as a part of ChopStitch (Section 2.2.3), and writing the ntCard manuscript. A poster presentation based on Chapter 2 was given at the ‘Genome Informatics Conference 2015’ at Cold Spring Harbor laboratory, New York, USA and the ‘geXc Vancouver 2016 symposium’ at the University of British Columbia, Vancouver, Canada on October 29, 2015 and June 15, 2016 respectively.  Stewart Austin Hammond, Muhammet Erdi Kucuk, Rene L. Warren and Bahar Behsaz have contributed in compiling the genome and transcriptome assemblies of the American bullfrog (Rana (Lithobates) catesbeiana), and have constructed the Bullfrog Annotation Resource for the Transcriptome (BART) (Section 3.2.3). Stewart Austin Hammond has also contributed in differential gene expression analysis of putative long non-coding RNA candidates (Section 3.2.13). Dr. Caren Helbing and her lab members at the University of Victoria, Canada carried out the qPCR experiments for Chapter 3 (Section 3.3.3).  v  Table of contents  Abstract ....................................................................................................................................... ii Preface ........................................................................................................................................ iv Table of contents ......................................................................................................................... v List of tables ............................................................................................................................... ix List of figures .............................................................................................................................. x List of abbreviations .................................................................................................................. xi Acknowledgements .................................................................................................................. xiii 1 Introduction ............................................................................................................................. 1 1.1. De novo sequencing .......................................................................................................... 1 1.2. Challenges ........................................................................................................................ 1 1.3. Current approaches ........................................................................................................... 2 1.4. Genome annotation pipelines ........................................................................................... 4 1.4.1. NCBI eukaryotic genome annotation pipeline ........................................................... 4 1.4.2. Ensembl gene annotation process .............................................................................. 5 1.4.3. MAKER and MAKER2 ............................................................................................. 6 1.5. mRNA processing in eukaryotes ...................................................................................... 8 1.5.1 Alternative splicing ..................................................................................................... 9 1.6. Tools for finding putative exons in non-model organisms ............................................. 13 1.7. Splice graphs................................................................................................................... 15 1.8. Non-coding RNA (ncRNA) ............................................................................................ 17 1.8.1. Long noncoding RNA .............................................................................................. 19 1.8.2. Strategies for lncRNA identification ....................................................................... 20 1.9. Research questions ......................................................................................................... 22 vi  2. ChopStitch: Exon annotation and splice graph reconstruction using transcriptome assembly and whole genome sequencing data ........................................................................ 24 2.1. Introduction .................................................................................................................... 24 2.2. Methods .......................................................................................................................... 26 2.2.1 Trans-ABySS ............................................................................................................ 26 2.2.2 ntHash ....................................................................................................................... 27 2.2.3. ntCard ....................................................................................................................... 27 2.2.4. Bloom filter .............................................................................................................. 28 2.3. Implementation ............................................................................................................... 28 2.3.1. Genomic Bloom filters ............................................................................................. 28 2.3.2. Solid k-mers Bloom filter ........................................................................................ 29 2.3.3. Trans-ABySS assembly ........................................................................................... 29 2.3.4. Detecting exon-exon junctions ................................................................................ 30 2.3.5. Detecting and correcting sequencing errors and RNA editing events ..................... 31 2.3.6. Detecting putative exons smaller than k .................................................................. 33 2.3.7. Splice graph construction ......................................................................................... 33 2.4 Dataset ............................................................................................................................. 34 2.4.1. Bullfrog genome dataset .......................................................................................... 34 2.5 Results ............................................................................................................................. 35 2.5.1. Evaluation criteria for exon-exon boundary detection ............................................. 36 2.5.2. ChopStitch outperforms other tools ......................................................................... 37 2.5.3. Effect of FPR for dbf and sbf on ChopStitch’s performance ................................... 38 2.5.4. Assessment of putative splice graphs ...................................................................... 39 2.5.5. ChopStitch results with American bullfrog dataset ................................................. 40 2.6. Discussion and conclusion.............................................................................................. 44 vii  3. Finding long non-coding RNA in the American bullfrog (Rana (Lithobates) catesbeiana) transcriptome ............................................................................................................................ 45 3.1 Introduction ..................................................................................................................... 45 3.1.1. LncRNA studies in non-model organisms ............................................................... 46 3.1.2. The thin line between coding and non-coding RNA worlds .................................... 46 3.2 Methods ........................................................................................................................... 47 3.2.1. Compiling a list of lncRNA candidates ................................................................... 47 3.2.2. Computing resources ............................................................................................... 48 3.2.3. Bullfrog annotation resource for the transcriptome (BART) ................................... 48 3.2.4. Updating BART with shorter contigs ...................................................................... 49 3.2.5. Removing putative coding contigs with TransDecoder ........................................... 49 3.2.6. Redundancy removal with CD-HIT-EST ................................................................ 50 3.2.7. Finding contigs with polyA tail and hexamer evidence........................................... 50 3.2.8. Similarity search with NCBI nr and RefSeq90 databases using Blastx................... 52 3.2.9. Assessing coding potential of contigs ...................................................................... 52 3.2.10. Searching protein family databases for sequence homologs ................................. 53 3.2.11. Finding sequence homologs ................................................................................... 53 3.2.12. Differential expression analysis ............................................................................. 54 3.2.13. Quantitative real-time polymerase (qPCR) validation ........................................... 54 3.3. Results ............................................................................................................................ 55 3.3.1. Compiling a putative ncRNA set ............................................................................. 55 3.3.2. Finalising qPCR candidates ..................................................................................... 59 3.3.3. qPCR validation of putative ncRNA transcripts ...................................................... 63 3.4. Discussion and conclusions ............................................................................................ 64 3.4.1. Exploring the biological significance of lncRNAs .................................................. 64 viii  4. Discussion and conclusions .................................................................................................. 66 4.1 Challenges in establishing new model species as genomic resources ............................. 66 4.2. The easily misinterpreted non-coding RNA ................................................................... 69 5. Closing remarks .................................................................................................................... 72 Bibliography .............................................................................................................................. 73 Appendices ................................................................................................................................ 94 Appendix A – Effect of leniency on ChopStitch’s performance for the C. elegans dataset . 94 Appendix B – qPCR primers for selected 7 candidates ......................................................... 95                   ix  List of tables  Table 2.1. Datasets used to evaluate and compare Chopstitch’s performance…………….34 Table 2.2. Evaluation results for Chopstitch splice graphs...................................................40 Table 3.1: Sequence hexamers considered for detection of cleavage site.............................51 Table 3.2: List of species selected to construct the CATSA database...................................53 Table 3.3: Contigs with positive but low CPC scores............................................................56 Table 3.4. Length statistics for putative lncRNA candidates.................................................58 Table 3.5. Top fifteen Blastx hits against CATSA.................................................................60 Table 3.6. Top eight Blastx hits against known lncRNA database ........................................60 Table 3.7. Top twenty most significant ncRNA candidates....................................................61 Table 3.8. Three other candidates that were differentially expressed in both the low temperature and temperature shifted samples ............................................................................................61 Table 3.9. DE analysis results for CATSA and lncRNA databases hits ................................62 Table 3.10. LncRNA candidates for which qPCR was run..…………………………….......63           x  List of figures  Figure 1.1: NCBI's eukaryotic genome annotation pipeline overflow …………………………4 Figure 1.2: Ensembl gene annotation process..............................................................................6 Figure 1.3: Maker pipeline……………………………….………….………………….…........7 Figure 1.4: Diagram representing splicing of pre-mrna...............................................................9 Figure 1.5: Splicing patterns.......................................................................................................11 Figure 1.6: Different transcripts with the corresponding splice graph………………………...16 Figure 1.7. Number of publications returned by pubmed using the keyword “ncRNA”...........19 Figure 2.1. Chopstitch algorithm workflow……………………………………......................30 Figure 2.2. Detecting putative exons smaller than the size of k-mer........................................32 Figure 2.3. Constructing the splice graph DOT file………………………………………......35 Figure 2.4. Precision and sensitivity comparison between tools...............................................37 Figure 2.5. Time and memory comparison between tools…………………………….………38 Figure 2.6. Chopstitch results for C. elegans on a range of FPR values for dbf and sbf............39 Figure 2.7. Blast results for run1……………………………………………………….............41 Figure 2.8. Blast result comparison between run1 and run2.......................................................42 Figure 2.9.Intersection between nr-Chopstitch, nr-LEMON, nr-MAKER-high confidence set................................................................................................................................................43 Figure 3.1. Lncrna detection pipeline, with number of contigs at each step..............................................................................................................................................57 Figure 3.2. Length distribution for the 7,003 putative lncRNA candidates................................58 Figure 3.3. Finalised qPCR candidates with selection criteria....................................................63 Figure 4.1. Increase in the number of bases in each release of Genbank ……………………...66 xi  List of abbreviations  BART - Bullfrog annotation resource for the transcriptome BCL - B-cell lymphoma BLAT - BLAST-like alignment tool CHART - Capture hybridization analysis of RNA targets CPC - Coding potential calculator ChIRP - Chromatin isolation by RNA purification DAG - Directed acyclic graph ESE - Exonic splicing enhancers ESS - Exonic splicing silencers ESTs - Expressed sequence tags FISH - Fluorescence in situ hybridization GTF - General transfer format HITS-CLIP - High-throughput sequencing cross-linking immunoprecipitation HMM - Hidden markov model HMM - Hidden markov model HOTAIR - HOX transcript antisense RNA ISE - Intronic splicing enhancers ISS - Intronic splicing silencers KCNQ1OT1 - KCNQ1 opposite strand or antisense transcript 1 NAT - Natural antisense transcript ORF - Open reading frame xii  PAR-CLIP - Photoactivatable ribonucleotide-enhanced cross-linking and immunoprecipitation PKM - Pyruvate kinase M RAP - RNA antisense purification RON - Recepteur d'Origine nantais SNP - Single nucleotide polymorphism SR - Serine-arginine rich STAT - Signal transducer and activator of transcription SVM - Support vector machine SYK - Spleen tyrosine kinase TERC - Telomerase RNA component TSA - Transcriptome shotgun assembly TUCP - Transcripts of unknown coding potential UTRs - Untranslated regions VEGF - Vascular endothelial growth factor WGSS - Whole genome shotgun sequencing ZEB2 - Zinc-finger E-box binding homeobox 2 hnRNP - Heterogeneous RNP lncRNA - Long non-coding RNA ncRNAs - Non-coding RNAs qPCR - Quantitative real-time polymerase snRNA - Small nuclear RNAs snoRNA - Small nucleolar RNA  tRNA – Transfer RNA xiii  Acknowledgements  I would like to thank my supervisor, Dr. Inanc Birol, for giving me the opportunity to conduct research in his lab, and for his guidance, irreplaceable constructive criticism and assistance during the project work. I am also grateful to my committee members, Dr. Martin Hirst and Dr. William Hsiao for their encouragement, insightful comments, and important questions. I would also like to thank my fellow lab-mates, in particular, Hamid Mohamadi and Benjamin P. Vanderwalk for helping me in reviewing my code, Stewart Austin Hammond and Rene L. Warren for their help and advice during the course of my research project. In addition, I would like to thank my family for their steadfast support. Finally, I would like to acknowledge the CIHR Bioinformatics Training Program for funding my Master’s degree at the University of British Columbia.             1  1 Introduction  1.1. De novo sequencing  Model organisms, such as Drosophila melanogaster (common fruit-fly), Mus musculus (mouse) or Escherichia coli (E. coli) satisfy the need for genetic uniformity, but might lack many interesting traits and have limited answers for certain biological, evolutionary and ecological questions2. Recent advances in high-throughput sequencing technologies have enabled the sequencing of non-model organisms with reasonable budgets. Such studies often interrogate terra incognita of new genomes and transcriptomes of these species with massive amounts of short sequences. Investigating the information in these datasets requires de novo analysis tools and techniques, when the species or closely-related species lacks a high quality reference resource. Studies involving genome and transcriptome assembly at an early stage, followed by annotation, variant discovery and homolog identification are challenging problems in genomics and transcriptomics studies of non-model organisms.   1.2. Challenges  Investigating de novo sequencing datasets is not an easy process. There are many challenges that need to be addressed in the process of finding pertinent information on genes, ncRNAs, etc. First, it is difficult to attain a genome or transcriptome assembly with a comparable contiguity to manually-improved shotgun assemblies of established model organisms. Second, many annotation challenges are incurred because of the lack of close homology of the recently sequenced non-model organism. Third, developing computational approaches that are independent of a standard reference genome is complicated, not the least because of scalability problems during genome 2  assembly. Lastly, updating and merging such annotation datasets is not straightforward, and requires ascertaining that the new results improve on the original annotation. Merging consensus annotation data from different research groups working on the same non-model organism might also pose difficulties.  1.3. Current approaches   Many genome annotation projects aim at defining structure of transcripts including exon-intron architecture, and provide inferences into their putative functions3. The latter process might also be termed as functional annotation. The contiguity and completeness of a genome assembly determines whether it is ready for annotation. This is evaluated with the help of certain summary statistics such as contig N50, which can be defined as the shortest sequence length for a contig or scaffold at which 50% of the entire assembly is contained4. Information on the average number of gaps per scaffold and a scaffold’s average gap size are also useful for assembly evaluation. Current approaches for genome annotation could be divided into two stages: a pre-processing or computation stage and an annotation stage5.  a) The pre-processing or computation stage starts with the identification and masking of repeats such as ‘low-complexity’ regions, including homopolymeric nucleotide runs and mobile elements, such as long and short interspersed nuclear elements6. The next step involves aligning proteins, expressed sequence tags (ESTs) or RNA-Seq data to a draft genome assembly. These sequences might be in the form of transcriptome assemblies from the same organism or related organisms. RNA-Seq data provides unprecedented advantage for improving the accuracy of gene annotations by providing evidence for identifying exons, alternatively spliced exons, and splice sites. This data can be assembled de novo using tools such as SOAPdenovo-Trans7, Trinity8, TransABySS9, or 3  could be aligned to a reference genome using tools such as TopHat10, GMAP11 and GSNAP12 followed by the assembly of alignments into transcripts by assembly tools. Many annotation pipelines such as MAKER213 are now fairly compatible with RNA-Seq data and its assemblies. However, the best method for using RNA-Seq data is still an open problem, and depends on both genome biology and the contiguity and completeness of the genome assembly. For example, closely spaced genes in the genome might result in Cufflinks merging RNA-Seq reads from neighboring genes. For such as condition, assembling RNA-Seq data de novo might attenuate the difficulty. As de novo genome assembly is an arduous and demanding task, annotation approaches that could avoid this step would be much preferable. Other than that, ab initio gene prediction tools use mathematical models to identify genes in DNA sequence assemblies. However, they suffer from some practical limitations. For example, most ab-initio tools aim at finding the single most likely coding sequence, while not reporting alternatively spliced transcripts or untranslated regions (UTRs). Besides, they also rely on training datasets from model organisms or closely related organisms that are rarely available for freshly sequenced genomes. Certain other ab initio tools such as Augustus14 and SNAP15 provide evidence-based gene and splice variant prediction, and can use external evidence such as ESTs to make better predictions. Tools like UTRscan16 rely on matching biological sequence patterns against existing UTR databases to detect UTRs. b) The annotation stage combines alignment based evidences and ab initio gene predictions to obtain a final set of gene annotations that, at a bare minimum, should contain the transcripts and protein sequences of every gene and information on intron-exon structures, start-stop codons, UTRs and alternative splicing patterns.   4  1.4. Genome annotation pipelines   1.4.1. NCBI eukaryotic genome annotation pipeline  The NCBI eukaryotic genome annotation pipeline17 annotates transcripts, genes and proteins on both complete and incomplete genome assemblies that are submitted to NCBI and are available publically.  Figure 1.1: NCBI's eukaryotic genome annotation pipeline overview. Prepared genome sequences are aligned against transcripts, proteins, short reads and curated genomic sequences. Genes are predicted based on all available alignments, followed by in-house tracking of RefSeq database sequences. Best models are then selected, followed by protein naming. Finally, annotation datasets are formatted for public release. (Adapted from Brown et al. 17)  It is a source of rich content for many NCBI resources such as RefSeq sequence databases18, BLAST databases, Gene19, and the Map Viewer genome browser20. After masking genomic 5  sequences using RepeatMasker21 and WindowMasker22, transcripts, proteins and RNA-Seq reads are aligned to the genome using BLAST23, Splign24 and ProSplign, ranked and filtered. RefSeq genomic sequences are also aligned, if available for the organism being annotated (Figure 1.1). Gene models are then predicted based on transcript and protein alignments. Best models amongst the RefSeq and the predicted models are selected, named, and accessioned. The resulting annotation products are then formatted and released in public resources.  1.4.2. Ensembl gene annotation process   As of January 1, 2016, the Ensembl gene annotation system25 has annotated over 70 different vertebrate species across many genome projects. The first stage aims at preparing the genome for gene annotation by loading the assembly and masking repeats.  The second stage is the Protein-Coding Model Building stage, which involves Similarity, RNA-Seq and Targeted pipelines. This stage helps in generating protein-coding transcript models by sequence to genome alignments, and deducing transcript models with the help of these alignments. The final stage is referred to as the Model Filtering stage. Potential coding transcript models are sorted, and those with insufficient support are filtered out. Annotation of non-coding RNA genes and pseudogenes is also done and a final Ensembl gene set is created, which is then cross-referenced with other sources of data.    6    Figure 1.2: Ensembl gene annotation process: Involves three different protein coding model building approaches: Similarity, Targeted and RNA-Seq. This is followed by preparing transcript consensus, adding UTRs, prioritising models, annotating pseudogenes and lncRNAs (Figure adapted from Aken et al., 201625)    1.4.3. MAKER and MAKER2  The pipeline 13,26 starts with the computation phase, where a number of sequence analysis programs such as RepeatMasker21, BLAST23, Exonerate27 and SNAP28 are run to identify and mask repeats, and to assemble mRNA and protein EST alignments (Figure 1.3). In the filtering phase, marginal predictions and sequence alignments are identified and removed on the basis of some calculated scores, percent identities, etc. This step is followed by clustering the remaining data against the 7  genomic sequence for the identification of overlapping alignments and predictions. During the polishing stage, Exonerate realigns vastly similar and matching mRNAs, ESTs, and proteins to genome sequences for obtaining precise exon boundaries. While using the polished and clustered EST alignments, MAKER synthesises information to produce evidence for annotations. SNAP then modifies its internal Hidden Markov Model (HMM) based on this information, and generates predictions. In the final annotation step, MAKER post-processes SNAP-predictions, and recombines them with evidence to generate comprehensive annotations.  Figure 1.3: MAKER pipeline: Starts with masking repeats using RepeatMasker and does alignments against protein, EST and mRNAs. Filtering step removes marginal predictions.  Clustering and polishing is done by Exonerate followed by annotation. Figure adapted from Cantarel et al., 200826  MAKER213 is an updated version of MAKER, and is specifically designed for de novo sequencing projects using NGS data, to scale to processing large datasets. It has reduced reliance on training datasets, and can utilise mRNA-Seq data. 8  The current annotation pipelines are fully reliant on reference resources, without which they are inefficient. This restricts their annotation capabilities for novel species. Hence, there is a need to develop cogent methods for annotating genes for non-model organism. These methods should be efficient enough to handle large NGS datasets, and effective enough to reduce reliance on reference resources.  1.5. mRNA processing in eukaryotes  In eukaryotic organisms, genetic information generally flows from DNA to mRNA (transcription) to proteins (translation). The process of gene regulation involves a number of modulatory mechanisms and complex networks, including silencing, editing, splicing, regulation, stability and localization of mRNA molecules. A pre-mRNA contains protein coding sequences (exons) and non-coding sequences (UTRs, introns and untranslated regions, flanking coding sequences). Human genes on average consist of eight exons, while having an average exon length of 145 nucleotides29. These exon sequences are separated by introns, typically an order of magnitude longer than exons30,31. A pre-mRNA molecule undergoes the process of removal of introns to form an mRNA molecule, which is subsequently synthesised into a protein. The former process is called mRNA splicing. The spliceosome in the cell nucleus is the primary machinery that removes introns from pre-mRNA. The process involves a number of interacting proteins called small nuclear RNAs (snRNA) that comprise five small ribonuclear proteins: U6, U5, U4, U2, and U132. Spliceosome cleaves an intron in two biochemical steps: 1) cleaving from 5’splice site with subsequent lariat formation; and 2) cleaving from 3’splice site, followed by exon ligation (see Figure 1.4).  9  1.5.1 Alternative splicing  Alternative splicing refers to the process in which exons or segments of exons or noncoding regions in pre-mRNA transcripts are hopped or joined differentially. Consequently, a single gene can form transcripts of different sequence content, called isoforms. More than 90% of human genes produce alternatively spliced mRNA transcripts33. Majority of alternatively spliced proteins are associated with surface regions in protein structures that suggest their importance in functioning of proteins34. Thus, alterative splicing can produce variable proteins that have different functional capacities.   Figure 1.4: Diagram representing splicing of pre-mRNA.  Exons are represented by yellow boxes and introns by black lines. Although there are exceptions to this general pattern, an intron sequence usually starts with GU at 5’ end and finishes with AG at 3’ end. There is a branch point ‘A’ followed by a pyrimidine rich region. BThe spliceosome RNA complexes identify 5’ splice site, and causes lariat formation joining to the branch point through first transesterification reaction. Second transesterification reaction removes the lariat structure, ligating exons around it. Adapted from Douglas et al. 201132. 10   In addition to 5’ and 3’ splice sites, and a branch point sequence, other regulatory sequences are also required to identify exons and splice sites for alternative splicing. These sequences are called intronic and exonic splicing enhancers and silencers35. An enhancer sequence binds to positive regulators, such as serine-arginine rich (SR) protein family, and promotes exon inclusion in the resulting mRNA transcript. In contrast, a silencer sequence bounds to negative regulators, such as heterogeneous RNP (hnRNP) proteins, and promotes exon exclusion. The antagonist function of enhancer and silencer thus determine the level of exon inclusion in the mature mRNA. Interestingly, splicing affected by disease causing mutations, are generally located within enhancer and silencer regions of exon and introns36.  1.5.2. Alternative splicing patterns Alternative splicing generates variation among mRNA transcripts that are produced from individual genes. Variability in mRNA transcripts often results from the selection of promoter of one of the multiple exons on 5’ end rather than splice-site selection by spliceosome. Alternative splicing of internal exons generate different patterns in the mRNA transcripts that include: cassette exons (skipping one or more exons and exon inclusion), mutually exclusive exons, alternative 3’ and 5’ splice sites, intron retention, and alternative poly(A) sites31,37 (See Figure 1.5).  11   Figure 1.5: Splicing patterns. Alternative splicing causing variability in mRNA transcripts using exons, introns, splice-sites, promoters, and polyadenylation sites. Adapted from Ward et al.36  Table 1.1 illustrates different examples of alternative splicing with the gene isoforms, while including patterns in which the splicing has occurred and the biological role rendered by the splice variants. The table has been compiled by a comprehensive survey of related literature.  Isoforms of genes Alternative splicing pattern Role (Reference) RON  ΔRON Cassette exon (skipping of one exon) RON gene codes for a tyrosine kinase receptor for macrophage-stimulating proteins, and is associated with cell mobility and invasion38. The ΔRON isoform is responsible for proteolytic maturation of proteins39, and is found overexpressed in number of cancer types40. BRAFV600E p61BRAFv600E Cassette exon (skipping of multiple exons) A proto-oncogene, BRAF codes for a serine/threonine protein kinase involved in mitogen-activated protein kinase signalling pathway39. The V600E mutation in this protein is often seen in malignant melanoma patients, and is sensitive to the drug, vemurafenib. However, its alternatively spliced isoform lacks an N-terminal RAS-binding domain, resulting in drug resistance41,42. 12  Isoforms of genes Alternative splicing pattern Role (Reference) SYK(L) SYK(S) Cassette exon (exon inclusion) Spleen tyrosine kinase (SYK) acts as a tumour suppressor protein in breast cancer43. However, SYK has also been associated with lymphoma, and head and neck carcinoma44,45. The longer isoform, SYK(L) has been found to promote cell survival and tumour malignancies, whereas shorter SYK(S) isoform helps in inducing apoptosis of ovarian cancerous cells39.  BCL-X(L) BCL-X(S) Alternative 5’ splice-sites The gene BCL2L1 produces two alternatively spliced isoforms: BCL-X(L) and BCL-X(S)39. The longer isoform has antiapoptotic effect, and is overexpressed in various cancers46, while the shorter one is proapoptotic in nature, and is found down-regulated in cancer47.  STAT2 STAT2+I19 Intron retention STAT2, a transcription factor, is a main component of JAK/STAT signalling pathway48. It is responsible for the stimulation of sinterferon (INF) through dimerization with STAT1. INF has an apoptotic effect on cancer cells, and is used as a treatment for hematologic malignancies49. However, a spliced variant of STAT2 containing intron-19 produces INF resistance in cancerous cells50. VEGFxxx VEGFxxxb Alternative 3’ splice-sites VEGF has been reported as a mitogen that induces angiogenesis in tumours51. Two VEGF spliced variants; VEGFxxx and VEGFxxxb work antagonistically, where VEGFxxx acts as proangiogenic factor in various cancer malignancies, and VEGFxxxb, an antiangiogenic, is found down-regulated52. PKM1 PKM2 Mutually exclusive exons Two isoforms of a metabolic enzyme, Pyruvate kinase M(PKM) are PKM1 and PKM239. Studies have found that PKM is associated with switching between glycolysis and oxidative phosphorylation, and is required for the growth of tumour cells. PKM2 is found ubiquitously in tumours, while PKM1 is differentially expressed in muscle and brain tissues. Substitution of PKM2 with PKM1 has been shown to inhibit tumour growth53,54.  Table 1.1: Examples of alternative splice patterns  13  1.6.Tools for finding putative exons in non-model organisms  In the past decade, gene discovery methodology has been revolutionised by a number of tools that can utilise RNA-Seq datasets. These new methods to model genes de novo have superseded previous methods, which relied on the recognition of splice sites, coding region and other signals. These tools could be divided into three categories:  a) Tools that require mapping RNA-Seq reads against a reference genome process using splice-aware alignment algorithms, such as HISAT255, TopHat210. Examples include Cufflinks and StringTie. Cufflinks56 is a program for reference-guided transcriptome assembly and expression quantification. After aligning paired-end RNA Seq reads against a genome, each pair of fragment reads is treated as a single alignment by Cufflinks. This is followed by the assembly of overlapping bundles of fragment alignments, and estimating abundances for the assembled transcripts. Along with other assembly output files, it outputs a GTF (General Transfer Format) file of assembled isoforms57. There is a single record per row, with each record representing either a transcript or an exon within a transcript. StringTie58 is another transcriptome assembly program that applies a network flow algorithm, which was originally a part of the optimization theory59. It incorporates a genome-dependant assembly approach, and also borrows ideas from de novo sequence assembly. The method starts with a set of genome-mapped RNA-Seq reads, just like Cufflinks. Separately, the paired end RNA-Seq reads are assembled de novo, using the MaSuRCA60 genome assembler. The output from this step is termed as ‘super-reads’, and could serve as an optional secondary input to StringTie for building an alternative splice graph. A flow network is then built by creating nodes corresponding 14  to all the nodes of this splicing graph. Two nodes are connected by an edge if an alignment maps from one node to the other, and the number of such alignments control an edge's capacity. This step is followed by computing the maximum flow of a path to estimate abundance. This is done by joining fragments that can explain a maximal number of reads for a given transcript path. This process is repeated until all the reads have been assigned. Similar to Cufflinks, the output includes a GTF file of transcripts and exons. b) Tools that use sequence similarity search against known gene models of model or closely-related species. Examples include LEMONS and CEPiNS. CEPiNS61 is one the first large-scale exon prediction tools that do not necessarily require a reference genome to find putative exons in assembled transcriptomes, and it facilitates the study of gene structures for model and non-model species. Given a set of transcribed genes and the genomic sequence of a model species, and a transcriptome assembly of a novel species in FASTA format, CEPiNS identifies the shared set of orthologous genes for the novel species. It does not recommend a particular model species to use. It starts by removing redundant transcripts in the novel species set that are similar to any other transcript in the same set with 95% or more similarity at the nucleotide level with BLASTN62. It uses SPIDEY63 to align the mRNA sequences of the model organism against its genomic sequences, and finds exon boundaries for the model dataset. It then uses TBLASTX to align assembled transcripts from the non-model species in all six frames against the corresponding exonic sequences of the reference species that were obtained by its transcriptome to genome alignments using SPIDEY. The final output is a file containing putative exon coordinates and a FASTA file of putative exons. Recently, LEMONS64 emerged as an efficient tool for identifying exon-exon junctions in transcriptomes of organisms that lack a reference genome. It uses a database of known and mapped 15  exon-exon junctions for all of its predictions. LEMONS utilizes a BLAST based approach, and leverages the evolutionary conservation of both splice junction recognition signals and exon/intron boundary positions to predict splice-junctions in transcripts. The database in LEMONS includes well annotated genomes and proteomes of Homo sapiens, Anolis carolinensis, Gallus gallus, Xenopus tropicalis, Arabidopsis thaliana, Mus musculus, Caenorhabditis elegans, Danio rerio and Drosophila melanogaster. It uses BLASTX to find orthologous proteins, and predicts exon-exon junctions based on the conserved gene structure. The output files of LEMONS have a similar format as that of the CEPiNS output files. c) Ab initio tools that use mathematical models of biological signals such as coding and non-coding genes, splice sites, along with a set of known gene structures for training such models65. Examples of such programs include GENSCAN29, GENIE66, and AUGUSTUS14. AUGUSTUS14 facilitates ab initio prediction of protein coding genes in eukaryotes with the help of a Hidden Markov model67. Transcript and gene annotations from related species that have been syntenically mapped to a target genome with the help of TRANSMAP68, can also be incorporated as evidence to AUGUSTUS68. It can also utilise evolutionary conservation of retroposed genes, mRNA, EST and DNA of target species. It has been compared with other similar methods such as GENIE66 and GENSCAN29 using drosophila and human real datasets. AUGUSTUS outperforms these methods by predicting the maximum number of genes in all cases.   1.7.Splice graphs  A splicing graph provides a convenient and natural representation of all splicing variants for a transcript. It could be defined as a directed acyclic graph (DAG) where: 16   Vertices or nodes represent the splicing sites for a given gene. These sites are numbered from 1 to n, and are ordered by their position from 5' to 3'.   Edges represent splice paths between splicing sites. They also follow the same 5' to 3' orientation.   Figure 1.6: Different transcripts with the corresponding splice graph  Figure 1.6 demonstrates how different transcript isoforms of a gene can be collapsed into a single splice graph. The boxes denote exons and the lines or arcs denote the connecting splice paths. It can be seen in the figure that the gene has alternate promoters with both exons 1 and 2 splicing to exon 3. Exon 4 represents an alternative exon that can either be included or spliced out. An alternate donor site is present on Exon 5, and is split into sub-exons 5.1 and 5.2. A splice graph offers a concise representation of all transcript isoforms, and lays out splicing patterns, although not all traversals along the graph necessarily correspond to viable transcripts. Most tools available for splice graph construction using RNA-Seq data rely on pre-existing, well-annotated gene models. Tools such as SpliceGrapher69 use existing gene models and enhance them 17  using RNA-Seq and EST data. Other tools such as Bridger70 use splice graphs as an intermediate step in their de novo transcriptome assembly process. TransComb71 is yet another tool that combines splice graph junctions to carry out a genome-guided transcriptome assembly. Recent tools like SDEAP72 use splice graphs for carrying out differential transcript expression analysis in cohort data. In the context of non-model organisms, tools that could create splice graphs without the help of a well-annotated reference genome could be of great value to the genomics research community.  1.8.Non-coding RNA (ncRNA)  With the discovery of tRNAs and rRNAs in the mid-20th century, the biological role of ncRNAs came into limelight. Since then, we are gaining an improved understanding of the functional roles of ncRNAs, including the well-characterized snRNAs73, RNAse P74, and Xist75. It is also found that almost all of the mammalian genome undergoes transcription at some degree76. The interest in this field soared with the conclusions of the ENCODE consortium that claimed that over 80% of the human genome had some biochemical function1, and established the fact that DNA sequences that were thought to be supposedly inert, are actually transcribable. However, the number of different transcript types should not be confused with their abundance and functional importance in a typical cell. Many ncRNA transcripts play no functional role in a cell or in its organisational physiology due to their low expression levels. A study77 found that a ncRNA type’s abundance is crudely correlated with its degree of conservation. Although, the difference between “junk” RNA and functional ncRNAs is vague78, and there have been astonishing discoveries that elaborate the importance of many ncRNAs. Infrastructural ncRNAs, such as tRNAs, rRNAs, spliceosomal uRNAs, snRNAs, and snoRNAs, are well-studied for their regulatory functions in 18  many catalytic and regulatory processes79,80. For example, the U1 snRNA, which normally helps in splicing, is also involved in regulating the initiation of transcription by RNA polymerase II by interacting with transcription initiation factor TFIIH81. ncRNAs such as 75L RNA are associated with central cell biological processes. 75L RNA facilitates the transportation of signal peptides containing nascent proteins to the endoplasmic reticulum by acting as a main component of a signal recognition particle82. ncRNAs also have roles in chromosome maintenance and segregation. In the case of dyskeratosis congenita83, a mutated small RNA with likeness to box H/ACA snoRNAs is known to play a significant biological role 84.  Ribosomes consist of more than 60% ribosomal RNA, which catalyse the translation of nucleotide sequences into protein. Transfer RNAs on the other hand, form an adapter molecule between protein and mRNA85. Small nuclear ribonucleoproteins form a major part of the spliceosome machinery86. Group I and group II catalytic intron are self-splicing in nature, and can facilitate their own excision from mRNA, rRNA and tRNA precursors in a variety of living beings87. ncRNA are found to be heavily involved in gene regulation, which can occur in trans or in cis88. A growing number of studies suggests that there is still an enormous repertoire of ncRNAs with functions still to be discovered89. The seeds of revolution in ncRNA analysis that were sown many decades ago have erupted with the wide-spread use of deep sequencing technologies. There has been a phenomenal increase in the study of these versatile cellular molecules, which regulate a remarkably broad spectrum of cellular processes90. These studies have expanded their focus from human to newly sequenced non-model organisms by comprehensively profiling their transcriptomes. A simple search on PubMed91 using the keyword “ncRNA”, with results binned by the year of publication, shows the trend in which the ncRNA revolution has advanced in the last two decades (Figure 1.7).  19   With the identification of many ncRNAs associated with many facets of gene expression, there has been an increased interest in the scientific community for this field. I expect this to result in more astonishing discoveries soon.  1.8.1. Long noncoding RNA   Long non-coding RNAs are a key class of ncRNAs that are associated with a number of biological activities across different life forms92. These enigmatic players in the complex transcriptional milieu could be somewhat arbitrarily defined as non-coding RNAs of length 200 nucleotides or more. They can be further classified into intergenic ncRNA and long intronic ncRNA93. They are mainly transcribed from pseudogenes, which are non-coding DNA sequences that resemble known genes94. LncRNAs can alter protein-coding gene expression, and function as cis and trans modulators 94. Recent studies suggest that they can regulate transcription through chromatin modulation. This Figure 1.7: Number of publications returned by PubMed using the keyword “ncRNA”. Timeline is shown on the x-axis and the total number of publications in that year is shown on the y-axis. 20  functionality is found in many eukaryotes, suggesting this to be a conserved function in spite of the fact that lncRNAs themselves are often not that conserved94. Many lncRNAs potentially target and partake in histone-modifying activities at specific loci. Examples include HOTAIR95, KCNQ1OT196 and, ANRIL (CDKN2B antisense RNA 1)97. Some studies have also elucidated their role as scaffolds that can influence the activity of complexes that modify histones94. LncRNAs can also regulate mRNA processing. LncRNA genes with an antisense orientation to known protein-coding genes can influence the processing of an mRNA transcribed from a sense strand94. Such lncRNAs are referred to as natural antisense transcripts (NATs)98. In mammalian cells, NATs have been found to influence splicing patterns of mRNAs at the c-ErbAalpha, MYC and ZEB2 (zinc-finger E-box binding homeobox 2) loci in neuroblastoma samples99. LncRNAs have also been implicated in regulating mRNA stability, and playing a significant role at the protein level100. sno-lncRNAs are a type of lncRNAs, which are lined by small nucleolar RNA (snoRNA) sequences101. These sequences have shown to influence splicing patterns in human cell lines by physically interacting with an alternative splicing regulator94. With new studies, the role of lncRNA has been expanding at a great pace, and more is yet to be explored as this burgeoning field holds many surprises.  1.8.2. Strategies for lncRNA identification   With the availability of high-throughput sequencing datasets, a number of lncRNA identification pipelines and tools have been published100,102–104. Most de novo methods for lncRNA identification are based on the features derived from sequence and structure. Statistical approaches to gene finding can exploit fine statistical variations between the sequence content of the coding and non-21  coding elements105. Typically, this involves open reading frame (ORF) predicting algorithms, along with a sequence alignment and homology detection approach106.  lncRScan107 is one such approach to detect novel lncRNA. It starts by the elimination of partial transcripts and artefacts in the transcriptome assembly, followed by steps to identify lncRNAs, and distinguish them from protein-coding mRNAs. The tool incorporates a five-step process to accomplish this task, which includes (1) organizing input transcripts into different categories as per their genomic location in relation to gene transcript annotations; (2) selecting transcripts greater than 200 nucleotides; (3) filtering out sequences with open-reading frames of length more than 300 nucleotides; (4) phylogenetic analysis to filter any protein-coding transcripts, and (5) search the left-over transcripts in the Protein family (Pfam) database, and exclude any with significant hits to protein domains. The authors have evaluated the performance of their tool, and have demonstrated its ability in filtering out mRNAs, while annotating a stringent set of putative lncRNAs. iSeeRNA108 is a Support Vector Machine (SVM) based classifier that can distinguish lncRNAs from protein coding RNAs with the help of sequence conservation, ORF identification, and nucleotide sequences based features. It can be used to build customised SVMs for any species of interest, and can be incorporated into pipelines that analyse transcriptome data.  Another tool, lncRNA2Function109, follows an innovative ontology-driven approach, and assumes that, similarity in expression patterns across multiple conditions correlates with similarity in biological pathways and functions. This tool aims at annotating lncRNAs with functional terms that are significantly associated with the co-expressed protein-coding genes. 22  Functional characterization of lncRNA is a rapidly expanding research field, and much remains to be done for functional analysis and precise characterization of lncRNAs. Further, the identification of long non-coding RNA in non-model organisms poses major challenges due to the lack of established reference genomes, reliable transcript models, and annotations. Thus, there is a crucial need to develop de novo analysis tools and pipelines to investigate this hidden realm of biologically relevant sequences.  1.9. Research questions  The following research questions have driven the development and application of the methods and analytical procedures presented in Chapter 2 and Chapter 3. 1. What approaches would improve the genome and transcriptome annotation process of non-model organisms, such as the American bullfrog (Rana (Lithobates) catesbeiana), which lack a closely related reference resource? How could we reduce the reliance of current annotation methods on such reference resources and strive for de novo annotation methods? 2. Genome assembly is an arduous and demanding task. How can resource-exhaustive and time-consuming genome assemblies be avoided for transcript annotations? What annotation approach will be independent of the contiguity and completeness of a genome assembly?  3. What computational technologies and algorithms could make the annotation process faster and more efficient as compared to the existing approaches?  23  4. What available graphical methods could be used to explore gene structures de novo, and help in exploring new loci for functional analysis, and studying genes that were previously inaccessible? How can splicing patterns in mRNA transcripts be visualised? 5. How the current approaches for lncRNA identification could be improved and applied on non-model species? What would be an acceptable validation for such an approach?                     24  2. ChopStitch: Exon annotation and splice graph reconstruction using transcriptome assembly and whole genome sequencing data  2.1. Introduction  Alternative splicing is a regulated process that gives rise to multiple isoforms from a single gene110. It might result in including or excluding certain exons from the final, processed messenger RNA (mRNA) of a gene110. This process increases the diversity in the transcriptome and proteome of an organism, and has a significant role in the regulation of gene expression and protein function. It is very frequent, and occurs in more than 90% of multi-exon human genes32. Knowledge of novel splice variants and exons, especially in non-model organisms can help in designing primers for PCR amplification, and shed light on genes that were previously inaccessible. Recent advances in high-throughput sequencing technologies have enabled sequencing of non-model organisms with reasonable budgets. Such studies often interrogate the genomes and transcriptomes of these species with massive amounts of short sequences111–113. RNA-Seq is one such powerful sequencing technique for providing transcriptome-wide coverage114. It provides unprecedented opportunities for researchers to explore the transcriptome of non-model organisms with limited genomic information, and has proved helpful in finding novel gene isoforms, measuring transcript abundance, and profiling gene expression115. Interrogating the genomes and transcriptomes using these datasets requires de novo analysis tools and techniques, when the species or closely-related species lacks a high quality reference116,117. For certain applications such as de novo annotation, information on putative exons and alternative splicing variants may be desirable.  25  The vast majority of tools that are currently available for splice or exon-exon junction prediction either use a reference genome to map short sequence reads to it or rely on existing gene models. Splicegrapher69 uses gene models to construct a splice graph that acts as a foundation for expressed sequence tags (EST) and RNA-Seq data to act on. Cufflinks56 predicts putative exons as a part of its assembly process by identifying the smallest set of transcripts that describe the observed data. Another computational method called StringTie58 uses a network flow algorithm59 in combination with optional de novo assembly for finding putative exons and assembling transcripts. Both Cufflinks and StringTie require mapping RNA-Seq reads to the reference genome using splice aware alignment programs such as HISAT255, TopHat210, prior to their execution. Recently, LEMONS64 emerged as an efficient tool for identifying exon-exon junction in transcriptomes of organisms that lack a reference genome. It uses a database of known and mapped exon-exon junctions for all of its predictions. The idea of utilizing evolutionary conserved intron/exon boundary positions and alternative splicing signatures, limits the scope of LEMONS to just shared orthologs.  To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer (all observed sequences of length k) spectrum of genomics reads. We have tested our method on characterizing the Caenorhabditis elegans (C. elegans) and Homo sapiens (H. sapiens) transcriptomes, using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-26  exon junctions with and without the help of a reference genome. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible.   2.2. Methods  ChopStitch is implemented in C++ and python, and consists of three utilities, CreateBloom, FindExons, and MakeSplicegraph for creating a Bloom filter, finding putative exons and constructing splice graphs, respectively. The input to the tool is a FASTA file of transcriptome assembly in Trans-ABySS output format and a FASTA or FASTQ file of whole genome sequencing reads. The output of the tool is a FASTA file of putative exons and a DOT file of splice graphs. ChopStitch is available online at https://github.com/bcgsc/chop_stitch  2.2.1 Trans-ABySS  Trans-ABySS9 is a de novo short-read transcriptome assembly and analysis pipeline, that uses ABySS118, a short-read genome assembler for generating assemblies across a wide range of k values, so as to address variable transcript expression levels. It starts by filtering and merging the multi-k assemblies, followed by a step that generates a set of non-redundant contigs. It also integrates techniques to map assembled contigs to known transcripts, while supporting EXONERATE27 and BLAT119 contig-to-genome aligners. It can identify novel gene-fusion events, splicing events, candidate polyadenylation sites, and estimate the level of gene expression. We have used the latest version of Trans-ABySS with ABySS 1.5.2 for our analysis. 27  2.2.2  ntHash  ChopStitch uses the ntHash120 algorithm at its core to efficiently hash consecutive k-mers in a sequence. This hashing method uses a recursive function, f, in which the hash value of the current k-mer H is derived from the hash value of the previous k-mer:                         H(k-meri) = f(H(k-meri-1), r[i+k-1], r[i-1])    (1) ntHash has a time complexity of O(k+1) as compared to the O(kl) complexity of regular hash functions, where k is the k-mer size and l is the length of sequence being hashed.  2.2.3. ntCard  ntCard (Mohamadi et al., 2016) is a streaming algorithm for estimating k-mers frequencies in genomics datasets. It begins by computing a 64-bit hash value for each k-mer in a provided dataset, while using the ntHash120 algorithm. It selects a desired length of prefix and suffix bits from every computed hash value, and terms it as sample bits (s) and resolution bits (r), respectively. It then uses a sampling rate parameter s for reducing a dataset by 1/2s on average, and avoids overpopulation of the counter space of size 2r.  It then constructs a reduced representation multiplicity table to describe a sample distribution, while using the resolution bits of subsampled hash values. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution.    28  2.2.4. Bloom filter  A Bloom filter is a space-efficient, probabilistic data structure that is used to test the dynamic set membership of an element121. It has fast look-up times and a manageable false positive rate. It can be described as a bit array of m bits, associated with h hash functions that map elements to positions in the bit array. The number of hash functions that could minimise the false positive probability depends on the size of the Bloom filter m and the number of elements n to be inserted into it122: 𝒉 = (𝒎𝒏) 𝒍𝒏𝟐     (2) The false positive rate in a Bloom filter is often given as123: (𝟏 − [𝟏 − (𝟏𝒎)]𝒉𝒏)𝒉    (3)  2.3. Implementation  2.3.1. Genomic Bloom filters Hash based data structures such as Bloom filters provide an efficient way of counting or cataloguing k-mers in genomic sequences for operations such as indexing and querying. Therefore, we have used ntHash based Bloom filter as the main data structure in our tool. All possible substrings of length k obtained from paired end whole genome reads are hashed using ntHash, and loaded into a first level Bloom filter, which we refer to as the Distinct k-mers Bloom Filter (dbf). The size of this Bloom filter depends on the false positive rate (FPR) and the number of hash functions provided by the user as an input. We then use ntCard to estimate the number of unique 29  k-mers to be inserted into the Bloom filter. Given n as the number of distinct elements and fpr the desired FPR rate, the size of a Bloom filter is governed by the formula:  𝑆𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑏𝑓 = (−𝑛ln2(2) )  x ln(𝑓𝑝𝑟)      (4)   2.3.2. Solid k-mers Bloom filter   While utilizing the k-mer histogram output from ntCard, we discard all k-mers with counts less than two, so as to eliminate the majority of k-mers that result from sequencing errors. Although not all k-mers with multiplicity of one would be erroneous k-mers, and there would be erroneous k-mers with multiplicity of two or more, this cascading approach is very useful, and has been demonstrated in many methods123, 124, 125. The resultant k-mers with counts two or more are stored in a new Bloom filter, which we refer to as the Solid k-mers Bloom Filter (sbf). Thus, regions of the transcriptome with a read coverage of less than two would be omitted from our analyses.   2.3.3. Trans-ABySS assembly  We used Trans-ABySS v1.5.59 to assemble paired-end RNA-Seq reads with k-mer sizes of 25, 35 and 45, using default parameters. Contigs generated from these three runs were merged together using the transabyss-merge utility in the package. Contigs shorter than 100 base pairs were excluded from the final contig set. We ran RNA-Quast126 on the assembled contigs for assembly evaluation.  30  2.3.4. Detecting exon-exon junctions  The k-mer content of assembled transcripts is interrogated in the solid Bloom filter to detect exon-exon junctions. We hypothesize that all k-mers internal to exons would ideally be present in the solid Bloom filter (sbf), while k-mers overlapping an exon-exon junction will be absent (Fig. 2.1).                         Figure 2.1: ChopStitch algorithm workflow: After constructing the genomic bloom filter, ChopStitch interrogates transcript sequences to find putative exons. It then finds exons with overlapping edges and constructs a splicegraph in DOT format. Graphviz ccomps is used to find sub-graphs.  In an ideal situation, the stretch of absentee k-mers at an exon-exon junction should exactly be equal to k-1. However, considering the FPR of the sbf, this value can be adjusted by a certain number of k-mers, which in ChopStitch, is governed by a leniency parameter. We define leniency 31  as the number of k-mers that can be ignored while counting a k-1 absentee k-mer stretch at a given position. Given the FPR and k, this can be calculated as:                                    𝒍𝒆𝒏𝒊𝒆𝒏𝒄𝒚 = 𝑭𝑷𝑹 𝐱 𝒌 𝐱 𝒍𝒆𝒏𝒊𝒆𝒏𝒄𝒚 𝒇𝒂𝒄𝒕𝒐𝒓                     (5)  In the above stated formula, the leniency factor dictates the level of leniency, and can be provided by the user as an input (default = 10).   2.3.5. Detecting and correcting sequencing errors and RNA editing events  The fidelity of a high-throughput sequencing dataset can be affected by several factors including sequencing errors, which would result in single or multiple base differences between the genomic and transcriptomic counterparts of an individual. This would increase false positive predictions for ChopStitch as a single or multiple base difference between the transcripts, and the corresponding genome would return a k-1 or larger stretch of absentee k-mers. A similar situation would be encountered in the case of a single nucleotide polymorphism (SNPs) or RNA editing event. Therefore, it was crucial to include reliable approaches for monitoring and reducing false positives while detecting exon-exon junctions.  32   Figure 2.2: Detecting putative exons smaller than the size of k-mer: The stretch of absentee k-mers is greater than k-1. The 3-sided arrows represent the scrutiny process towards the beginning and end of the absentee-k-mer stretch  In order to detect such events, the first k-mer found to be absent in the sbf after a series of present k-mers is scrutinized by substituting its last nucleotide with all the other three possible nucleotide bases and testing the presence of the resulting k-mers in the sbf. One out of the three resulting k-mers should ideally be present in the sbf, if the original k-mer had a sequencing error or a SNP or an RNA editing event at its end. Owing to the false positive rate of a Bloom filter, this step is repeated ‘leniency’ number of times for the next few k-mers in order to ascertain a true prediction. Although this step significantly reduces the number of false positives for our tool, it is advisable 33  to use ChopStitch on the genome and transcriptome of the same individual so as to minimize single nucleotide differences due to SNPs  2.3.6. Detecting putative exons smaller than k  Exons smaller than the k-mer size would return a stretch of absentee k-mers equal to a length of [(k-1) + size of the small exon]. This helps in identifying smaller exons that would otherwise have had been missed (Figure 2.2). A true exon-exon junction followed by one or more single base differences due to an RNA-editing event or a SNP or a sequencing error at a distance of less than k-1 would also return a similar profile of absentee k-mers. As soon as a k-mer is found present after a stretch of absentee k-mers of length greater than k-1, 'leniency' number of k-mers before this k-mer are scrutinized for detecting and correcting single base differences as explained in section 2.3.5. The only difference is that while the former case, substitutions started at the last base of the scrutinized k-mer, substitutions start from the first base in the latter case. If the scrutinized junctions in the absentee k-mer stretch return to be true, then the smaller exon is outputted as a substring of the absentee k-mer stretch ranging from its start to (length of the stretch - (k-1).   2.3.7. Splice graph construction  k-mers are retrieved from the edges of putative exons for a length equal to the leniency factor. These k-mers are cross-examined against each other by a Python script to find overlapping edges between putative exons belonging to different transcript isoforms (Figure 2.3). This information is used to collect exons in a single node, and generate splice graphs for the complete set of transcripts, which is then stored as a single graph in DOT format. Graphviz ccomps127 is then used to output subgraphs from this file. These subgraphs clearly depict putative splice variants for a 34  transcript, while offering a fair representation of all transcript isoforms and beautifully laying out their splicing patterns. These can be visualized using DOT visualization tools such as Graphviz127 and Gephi128.   2.4 Dataset  We have used real datasets of H. sapiens and C. elegans (Table 2.1) to test and compare the performance of ChopStitch, and benchmark our results. The H. sapiens datasets are from a 1000 Genomes Project individual, NA19238, and the C. elegans datasets were obtained from its N2 strain. Organism Library strategy Accession ID Read length (bp) Size (GB)  H. sapiens WGS SRA:ERR309932 250 58.20  C. elegans WGS SRA:DRR008444 110 4.30  H. sapiens RNA-Seq SRA:ERR356374 50 1.20  C. elegans RNA-Seq SRA: SRR2537190 50 0.52 Table 2.1: Datasets used to evaluate and compare ChopStitch’s performance  2.4.1. Bullfrog genome dataset  We recently sequenced and assembled the American bullfrog genome. The MAKER2 genome annotation pipeline was used to predict protein coding genes in the draft bullfrog genome assembly. The set of predicted genes were refined by identifying a high confidence set for the purposes of downstream biological analyses, based on a high degree of sequence similarity to human and amphibian Swiss-Prot proteins (Downloaded 02/16/16), or to contigs from the Bullfrog Annotation Resource for the Transcriptome (BART). A detailed description of BART can be found 35  in section 3.2.3. The final assembly with annotated MAKER2 gene predictions is available at NCBI-Genbank under accession (LIAG00000000, BioProject PRJNA285814). We aligned BART against the draft genome assembly and performed error-correction to reduce single base differences in the aligned segments.      Figure. 2.3: Constructing the splice graph DOT file: Ti and Ei represent the transcript and exon ids belonging to different protein coding transcript isoforms. k-mers are shown as colored lines towards the edge of the putative exon. Putative exons that share the same k-mer spectra are shown by the same color and are represented as a single node in the DOT file with their headers concatenated with the string ‘_OR_’.   2.5 Results  We compared our method with LEMONS, Cufflinks and StringTie. While LEMONS is a tool for identifying splice junctions in transcriptomes of organisms which lack a reference genome,         digraph {          T1E1_OR_T2E1_OR_T3E1_OR_T4E1 -> T1E2_OR_T3E2_OR_T4E2;          T1E1_OR_T2E1_OR_T3E1_OR_T4E1 -> T1E3_OR_T2E2_OR_T4E3;          T1E2_OR_T3E2_OR_T4E2 -> T1E3_OR_T2E2_OR_T4E3;          T1E2_OR_T3E2_OR_T4E2 -> T1E4_OR_T2E3_OR_T3E3;          T1E3_OR_T2E2_OR_T4E3 -> T1E4_OR_T2E3_OR_T3E3;          T1E4_OR_T2E3_OR_T3E3 -> T1E5_OR_T2E4_OR_T3E4_T4E5;          T1E4_OR_T2E3_OR_T3E3 -> T1E5_OR_T2E4_OR_T3E4_T4E5;           } 36  Cufflinks and StringTie being non de novo transcriptome assemblers require a reference genome for predicting putative exons. LEMONS utilizes a BLAST based approach, and leverages the evolutionary conservation of both splice junction recognition signals and exon/intron boundary positions to predict splice-junctions in transcripts in the absence of a reference genome. On the other hand, Cufflinks and StringTie use a BAM alignment file of RNA-Seq reads aligned against a reference genome as an input to assemble different transcript isoforms for a gene. All tools were run with 32 threads and default parameters on an isolated machine with 2.5 TB RAM and 128 Xeon E7-8867 CPU cores.  2.5.1. Evaluation criteria for exon-exon boundary detection  Putative exons from all these tools were assessed by aligning them against reference exons from Ensembl using BLAST with 95 percent query coverage, 95 percent sequence identity, and a length difference of five base pairs between the subject and the query hits. Putative exons that returned a hit to a reference exon while following these criteria were considered to be True Positives (TP), while putative exons that did not return a hit were considered as False Positives (FP). Precision was calculated as the ratio of the number of true positives to the total number of putative exons, (𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)). To define the sensitivity of these predictions, we calculated the ratio between the number of true positives and the total number of reference exons present in the Ensembl dataset for that species, (𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)). As RNA-Seq captures the presence and quantity of RNA in a sample at a given moment in time, the number of transcripts assembled from such datasets will ideally be less as compared to the comprehensive set of gene models available in reference databases. Thus, the sensitivity would be low for all tools, but the metric would still be relevant to compare the relative performances of alternative tools. 37  2.5.2. ChopStitch outperforms other tools  Chop-Stitch outperforms the state-of-the-art methods available for finding putative exons with and without the help of a reference genome. From Figure 2.4 we can see that in H. sapiens at k = 50 nucleotides (nt), dbf FPR = 0.16, sbf FPR = 0.01 and leniency = 10, ChopStitch finds putative exons at a precision and sensitivity of 0.91 and 0.20 respectively, and performs better than the available methods. Similar results were obtained in the case of C. elegans. For each combination of data set and tool, we measured running time and peak memory usage.   Figure 2.4: Precision and sensitivity comparison between tools: A) Hits to Ensembl exons, B) Precision C) Sensitivity. Chop-Stitch’s is almost two-fold better at predicting exon-exon junction as compared to its counterparts.  In figure 2.5, we observe that StringTie was the fastest tool with the lowest peak memory usage. While ChopStitch showed competitive run times and memory usage, LEMONS was notably A B  A C 38  slower with higher peak memory. ChopStitch’s peak memory requirement is directly proportional to the size of the dbf, which is governed by the approximate number of distinct k-mers outputted by ntCard. Effect of leniency on ChopStitch’s performance has been discussed in Appendix A.    Figure 2.5: Time and memory comparison between tools for A) C. elegans and B) H. sapiens dataset. StringTie was the best in terms of time and memory benchmarks, while ChopStitch was comparable.  2.5.3. Effect of FPR for dbf and sbf on ChopStitch’s performance  We evaluated ChopStitch’s performance with different false positive rates for dbf and sbf in order to optimize our tool. We measured the performance based on F1-scores129, which can be defined as: 𝑭𝟏 𝒔𝒄𝒐𝒓𝒆 = (2 𝐱 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐱 𝑟𝑒𝑐𝑎𝑙𝑙)/(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 A B 39  F1 score can be used to measure the accuracy of a test in the statistical analysis of a binary classification. It incorporates both precision and recall of a test and can be inferred as a weighted average of both. In other words, it is the harmonic mean of recall and precision. Its best value is a 1 and worst is at 0. Figure 2.6 shows ChopStitch results for C. elegans on a range of FPR values for dbf and sbf. F1-score was calculated while increasing sbfFPR by 0.005 and dbfFPR by 0.05 at each measurement. ChopStitch’s performance seems to deteriorate with a small increase in sbfFPR as compared to a much larger increase in dbfFPR. This is due to the fact that the size of sbf is significantly smaller than dbf, and many false positives k-mers are removed during this dbf to sbf reduction step. Besides, it is dbf that is used to interrogate k-mers in ChopStitch, and has a direct role in exon-exon junction prediction. Figure 2.6: ChopStitch results for C. elegans on a range of FPR values for dbf and sbf. ChopStitch’s performance seems to deteriorate with a small increase in sbfFPR as compared to a much larger increase in dbfFPR   2.5.4. Assessment of putative splice graphs  With ChopStitch being a de novo tool, we found it relevant to compare the obtained splice graph by both a de novo and a reference based strategy.  40  De novo evaluation For this, we hypothesized that all putative exons that are a part of a single node in a splice graph should be substrings of the largest putative exon. As seen in Table 2.2, we found that 97% of 31,302 nodes and 95% of 11,264 nodes, present in the splice graphs for H. sapiens and C. elegans, followed this criteria.  Reference based evaluation We aligned all putative exons outputted for H. sapiens and C. elegans against their reference genome, and hypothesized that all putative exons that are a part of a single node in a splice graph should align to approximately the same coordinates (+/-5 nucleotides). We found that, 94% and 96% of the total nodes present in the splice graphs for H. sapiens and C. elegans respectively, followed this criteria. The exons that failed to align might be misassemblies with common sequences towards their boundaries. This percentage increases as we relax the cut-off. Organism Evaluation method  Precision  H. sapiens de novo 0.97  C. elegans de novo 0.95  H. sapiens Reference-based 0.94  C. elegans Reference-based 0.96 Table 2.2. Evaluation results for ChopStitch splice graphs  2.5.5. ChopStitch results with American bullfrog (Rana (Lithobates) catesbeina) dataset  We used the R. catesbeina draft assembly and error-corrected BART transcriptome assemblies (Refer section 3.2.3) as inputs to ChopStitch, along with parameters, k=40 nt, sbfFPR = 0.01 and 41  dbfFPR=0.01. We reduced redundancy in our resulting set with the help of cd-hit-est with parameters, M=10000, T=20, d=40, s=0.9, aL=0.9, B=1, p=1, n=9 (NR-CST). Similarly, we reduced redundancy in the predicted exons from the high-confidence MAKER set, the BART set, and also the putative exons that were obtained after running LEMONS on BART. The resulting four non-redundant sets were called NR-CST, NR-MAKER, NR-BART, and NR-LEMONS, respectively. Thereafter, we did the following BLASTN runs: RUN1: Query = NR-MAKER, Subject = NR-BART RUN2: Query = NR-MAKER, Subject = NR-CST RUN3: Query = NR-MAKER, Subject = NR-LEMONS Figure 2.7: Blast results for RUN1. Query coverage on x-axis and percentage identity on y-axis   42  Figure 2.7 shows the BLAST results for RUN1. Each cell in the matrix has two numbers. The first number denotes the total number of hits between the query and the subject at a given percentage identity and query coverage. The second number in decimals is calculated by dividing the latter with the total number of non-redundant MAKER exons, which is 1,153,798. It can be seen that the level of coherence between NR-MAKER and NR-BART is extremely low with only a 15% agreement at 95% sequence identity and query coverage. With MAKER as the only robust source of in-silico prediction available to us for R. catesbeina, we tried to find the intersection between NR-CST, NR-MAKER, NR-BART, and NR-LEMONS.    Figure 2.8: Blast result comparison between RUN1 and RUN2. The best results can be seen at an identity of 99% and a query coverage of 95% 43  Figure 2.8 compares results for RUN1 and RUN2. The number in each cell is obtained by dividing the number of hits from RUN1 with the number of hits that were obtained for RUN2 at the same sequence identity and query coverage. In order to introduce stringency in our evaluation, it was made sure that while counting hits in the case of RUN2, the subject and query sequences differed by a maximum of five nucleotides. The best results (74.58 %) are at a query coverage of 95% and a sequence identity of 99%. We tried to find the degree of intersection between our resulting non-redundant datasets. 20,809 putative exons were found common in LEMONS-NR and MAKER-NR, while 28,554 were common in CST-NR and MAKER-NR. Interestingly, 18,426 putative exons were common between CST-NR and LEMONS-NR. Therefore, CST-NR had 10,128 extra exons common in MAKER-NR, as compared to LEMONS-NR. Figure 2.9: Intersection between results for nr-ChopStitch, nr-LEMON, nr-MAKER-high confidence set  44  2.6. Discussion and conclusion  Chop-Stitch is a new method for finding putative exons and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. It enables the identification of exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also detects base substitutions in transcript sequences corresponding to sequencing or assembly errors, haplotype variations, or putative RNA editing events. It has incorporated the state-of-the-art technologies such as ntHash and ntCard for a fast and efficient workflow. The primary output of our tool is a FASTA file containing putative exons. Further, k-mer content of the edges of putative exons are interrogated against other exons to detect transcript isoforms that are reported as splice graphs in dot output format. As genome assembly is an arduous and demanding task, annotation approaches that could avoid this step would be much preferable. ChopStitch's ability to use unassembled genomic datasets, reduces its reliance on the contiguity and completeness of a genome assembly for the annotation process. Chop-stitch enables the accurate identification of exon-exon junctions in transcript isoforms. Its ability to function de novo, without a reference genome makes it stand out from other tools. Figure 2.9 demonstrates ChopStitch's ability in finding putative exons in non-model organisms. Compared with other leading methods on real data, ChopStitch was substantially accurate at finding putative exons. This method could be used effectively to annotate de novo transcriptome assemblies, find novel-exons in coding as well as non-coding mRNA transcripts, and search alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis and studying genes that were previously inaccessible.   45  3. Finding long non-coding RNA in the American bullfrog (Rana (Lithobates) catesbeiana) transcriptome  3.1 Introduction  Recent advancements in transcriptome sequencing have helped in exploring the role of long non-coding RNAs (lncRNAs) in numerous biological processes, and have established the fact that these molecules that were previously mistaken to be transcriptional noise, are way more important130. The discovery and analysis of lncRNAs represents a new frontier in molecular genetics, and is of major relevance to the biology behind the functional and largely unexplored component of the transcriptome. This loosely classified group of non-coding RNA-transcripts with length greater than 200 nucleotides is found to be present in many branches of life. They show lack of conservation, but that does not imbue a lack of function. Surprisingly, a study131 by Guttman et al., (2009) reveals that lncRNA promoters have a higher degree of conservation as compared to lncRNA exons, and the conservation level is almost similar to those in protein-coding genes. Besides, their level of conservation is distinctly more than ancient repeats with a high degree of conservation in exons, splice sites and small domains bounded by less constrained biological sequences1,132. Many lncRNAs in primate and rodent species have also shown recent evolutionary origin1. In a study133 by Necsulea et al., (2014)  reconstruction of homologous lncRNA families was carried out based on sequence similarities, and it was found that only 3% of the families had originated more than 300 million years ago134. These families were reconstructed on the basis of similarities in DNA sequences with the help of single-link clustering, and evolutionary ages of lncRNA were estimated on the basis of phylogenetic distribution of species. Despite being non-coding with relatively low level of sequence conservation, some lncRNAs have shown to 46  contribute towards structural organization, function, and evolution of genomes1. Examples include classical lncRNAs, such as XIST, HOTAIR, telomerase RNA component (TERC), and many more with roles in imprinting genomic loci135, translation136, splicing137, transcription138, nuclear factor trafficking139, chromatin modification, determining the conformation of chromosomes, and allosterically regulating enzymatic activity90,140. Coupled with the scientific evidence, these studies paint a compelling view of the hidden role of lncRNAs in regulating major processes in living organisms.   3.1.1. LncRNA studies in non-model organisms  Current lncRNA databases are mostly populated with sequences that were derived from human and mouse141–143. Recent studies involving bovine144–146, chicken147, pig148, diamondback moth149, and goat150 have further contributed in enriching animal lncRNA datasets. A few databases151 exploring interactions of ncRNAs with proteins, RNAs, and viruses have also been developed, but mainly describing interactions in human.  3.1.2. The thin line between coding and non-coding RNA worlds  Most bioinformatics methods for identifying lncRNAs, negatively select and filter protein-coding features in mRNA152. They look for conserved open reading frames (ORFs), as coding sequences favour non-random codon usage and synonymous base changes. On the other hand, lncRNAs have a tendency to have more random substitutions153. Further, mRNA ORFs share sequence similarities to other proteins or have known protein domains, which is not the case with lncRNAs152,154. Although, these bioinformatics approaches help identify lncRNAs to a great 47  extent, studies have shown some discrepancy, where certain conserved lncRNAs were misannotated as coding transcripts132,155. A study104 by Jia et al., (2010) revealed that about two-thirds of human genes that were previously thought to be hypothetical protein coding genes were later found to be non-coding genes. Thus, there is a need to optimize bioinformatics pipelines for lncRNA detection, complemented by wet-lab validation techniques. We hereby present a subtractive pipeline for lncRNA detection in non-model organisms and discuss relevant wet-lab approaches to validate our findings. We have applied our pipeline to annotate the R. catesbeiana transcriptome.   3.2 Methods  3.2.1. Compiling a list of lncRNA candidates   A computational pipeline was developed to predict putative lncRNAs from the transcriptome of the R. catesbeiana (Figure 3.1). The assembled transcriptome has been submitted to NCBI (Submission ID: SUB968516, Bio Project ID: PRJNA285814). The major strategy that we employed was to find putative transcripts that were potentially coding, and eliminate them at different steps of our pipeline. We have used different methodologies to evaluate and elucidate the coding potential of our transcripts of interest. Our employed techniques range from finding open reading frames in transcripts, through evaluating coding potential by support vector machine algorithms, such as Coding Potential Calculator (CPC)156, to similarity searches with well-known protein sequence and family databases. Cabili et al., (2016)157 had previously suggested the ‘Transcripts of unknown coding potential’ (TUCP) classification, which refers to long RNAs that 48  have an in silico evidence of being coding. Our subtractive approach of finding and removing transcripts of unknown and known coding potentials has been advocated in previous studies149.  3.2.2. Computing resources  Computational analysis was performed using computing nodes with dual Intel Xeon X5650 2.66GHz CPUs with 12 cores, 48GB of RAM and Linux CentOS 5.4 as the operating system.  3.2.3. Bullfrog annotation resource for the transcriptome (BART)  De novo assembly of RNA-Seq data is a cogent method for constructing and annotating transcriptomes of non-model organisms. De novo liver transcriptome assemblies of R. catesbeiana have previously been published in Birol et al., (2015)113. Taking a step further, we built the Bullfrog Annotation Resource for the Transcriptome (BART), which is a composite R. catesbeiana reference transcriptome. It was constructed by combining Trans-ABySS116 transcriptome assemblies from 32 R. catesbeiana tadpole samples representing five tissues under different chemical and temperature exposure conditions. Transcripts from independently assembled tissues were filtered for length (500 nucleotides, or longer), and aligned to each other using the BLAST-like Alignment Tool (BLAT)119. Parallelized BLAT (pblat; github.com/icebert/pblat) was used to identify highly similar sequences, where only the longest example of each set of similar sequences was retained. This process produced 794,291 transcripts that were then annotated via BLAST62 alignment to the NCBI non-redundant (nr) database, the Swiss-Prot Homo sapiens protein database (Uniprot Consortium, 2015), and a set of amphibian transcript sequences obtained from the NCBI nucleotide database. 49  3.2.4. Updating BART with shorter contigs  BART contained contigs of length greater than 500 nucleotides, and had to be updated so as to include shorter contigs for our lncRNA analysis. We followed a BART-like construction approach to annotate contigs with lengths between 200 and 499 nucleotides, and referred to the resulting set as BART Junior (BART Jr.). Consequently, the original BART set came to be known as BART Senior (BART Sr.).  BLAT alignments with 95% or more sequence identity and 40% or more query coverage were considered as a cut-off for BART Sr. On the other hand, an identity and coverage cut-off of 95% and 60% was considered for BART Jr., respectively. This was done while taking into account the narrower contig size range of the latter set. We avoided using contigs shorter than 200 nucleotides, as they were not long enough to be categorized as lncRNAs. In order to reduce redundancy, we aligned all BART Jr. contigs against BART Sr. contigs, and discarded those BART Jr. contigs that returned hits with 25% query coverage and 99% identity.  3.2.5. Removing putative coding contigs with TransDecoder  An open reading frame (ORF) is a stretch of RNA that has a continuous stretch of codons without a stop codon if transcribed. TransDecoder (transdecoder.github.io) is a tool to find such candidate coding regions in transcript sequences based on a minimum length ORF and a log-likelihood score. If one ORF is found within the boundaries of another ORF, TransDecoder reports the longer ORF. It also reports multiple ORFs for a single transcript. For our analysis we started with 794,291 BART Sr. contigs and 1,341,707 BART Jr. contigs respectively. We used TransDecoder v3.0.0 with the default parameters, and filtered out contigs that returned complete (with both start and stop codons) or incomplete (missing a start or stop codon) ORF predictions.  50  3.2.6. Redundancy removal with CD-HIT-EST   Removing redundancy in sequence datasets helps reduce storage space, computational time and noise, while submitting and archiving in biological databases such as UniProt158, GenBank159. It also helps in reducing biases in sequence analysis such as calculating GC-content of a nucleotide dataset or amino-acid composition of a protein dataset. CD-HIT160,161 is a standalone application that uses a greedy incremental algorithm to cluster biological sequences, and reduce sequence redundancy. It starts by choosing the longest input sequences as the representative sequence, followed by comparisons and classification of the remaining sequences as representative or redundant sequences based on their similarities to representative sequences. Originally, CD-HIT was developed for clustering protein sequences162, but was later extended to cluster nucleotide sequences161, and came to be known as CD-HIT-EST160 (v4.6.6). Using this, we clustered contigs while applying a sequence identity threshold of 0.99. This threshold is known as CD-HIT’s ‘global sequence identity’, and is determined by dividing the number of identical amino acids in an alignment by the shorter sequence's full length.  3.2.7. Finding contigs with polyA tail and hexamer evidence  Almost all mRNAs have a stretch of adenylate residues at their 3'-end, which is known as the poly(A)-tail. This tail is formed with the help of poly(A) polymerase, which adds 150-200 adenylate residues after the pre-mRNA gets cleaved at a defined site. Sequence hexamers such as ‘AAUAAA’ have been shown to be absolutely required as polyadenylation signals (PAS) of eukaryotic protein-coding genes135.   51  AATAAA AAAAAG AATATA TATAAA AGTAAA AAGAAA CATAAA AATACA AATAGA ATTAAA ACTAAA GATAAA AATGAA TTTAAA AAAACA GGGGCT'  Table 3.1: Sequence hexamers considered for detection of cleavage site163 These sequence signals direct the binding of protein factors to the pre-mRNA. We exploit this knowledge in our filtering process. We aligned our set of transcripts against the R. catesbeiana draft genome (Refer section 2.4) using GMAP v2016-05-0111 to obtain a BAM alignment file. GMAP is a standalone program for integrated genomic mapping and alignment of cDNA. The alignment file was fed to an in-house polyA tail detection pipeline. The pipeline starts by looking for the presence of perfect polyA tails in assembled transcript sequences (or polyT heads in reverse-complement sequences, when using non-strand specific RNA-Seq protocols) without any 52  non-AT characters. It then searches for a cleavage site by interrogating a 50bp window upstream of it to look for a hexamer evidence (Table 3.1).  In the case of a polyA tail, it looks for hexamer on the positive strand of the contig, while in the case of a polyT head, the reverse-complemented sequence is interrogated. Contigs containing a polyA/T tail/head and a hexamer motif corresponding to a cleavage site were selected for further analysis.  3.2.8. Similarity search with NCBI nr and RefSeq90 databases using Blastx   We did a six-frame translation of our nucleotide sequences, and queried them against Uniref90164 and NCBI's RefSeq databases using the Blastx62 program from NCBI’s BLAST+ (v2.4.0) software package. We discarded all contigs that returned a hit to any sequence in these databases at an e-value less than 10-5.  3.2.9. Assessing coding potential of contigs   The Coding Potential Calculator (CPC)156 is a Support Vector Machine (SVM) to evaluate the protein-coding potential of a transcript. It based on six biologically meaningful sequence features, and can efficiently discriminate between coding and noncoding transcripts, while returning a positive CPC score for coding contigs, and a negative CPC score for non-coding contigs. We assessed the coding potential of our contigs with CPC v0.9-r2, and filtered out any contig that returned a positive CPC score of greater than 1.   53  3.2.10. Searching protein family databases for sequence homologs  We used HMMScan v3.1b2 from the HMMER package165 to search our contigs of interest against Pfam166, which is a curated protein families database, each of which is represented by hidden Markov models (HMMs) and multiple sequence alignments. We did a 3-frame translation on our contigs of interest prior to this step and later on removed all contigs that returned a hit against known protein domains irrespective of a threshold.   3.2.11. Finding sequence homologs   We constructed a comprehensive amphibian transcriptome shotgun assembly database (CATSA) by downloading and combining nucleotide sequences for 14 amphibian species (Table 3.2) from the NCBI- Genbank-Transcriptome Shotgun Assembly Sequence (TSA) database159. We created a BLAST database from this set of sequences to carry out sequence similarity searches.   Species  TSA size (MB) 1 Ambystoma mexicanum 4.2 2 Hynobius chinensis  97 3 Microhyla fissipes 85 4 Pseudacris regilla 36 5 Rhacophorus omeimontis 39 6 Pseudacris regilla 36 7 Hynobius retardus 445 8 Odorrana margaretae 41 9 Rana clamitans 37 10 R pipiens 886 11 Leptobrachium boringii 45 54  Table 3.2: List of species selected to construct the CATSA database We interrogated our putative lncRNA contig set against this CATSA database for homologs that could add confidence to our set. We also did a similarity search against lncRNA sequences present in lncRNA databases such as lncRNADB141 and LNCipedia142.  3.2.12. Differential expression analysis   As an example of the utility of the draft genome assembly and high confidence gene predictions, 100 bp paired-end RNA-Seq reads from six pre-metamorphic R. catesbeiana tadpoles exposed to 10 nM 3,3',5-triiodo-L-thyronine (T3) or dilute sodium hydroxide vehicle control for 48 h (three animals per condition) were used to characterize the T3-induced metamorphic gene expression program in the back skin. The reads were aligned to the draft genome using STAR167, and read counts per transcript were quantified using HTSeq168. Differential expression in response to T3 treatment was assessed using the DESeq2 v1.14 software package169. 3.2.13. Quantitative real-time polymerase (qPCR) validation  qPCR is a sensitive gene quantification method, based on the polymerase chain reaction (PCR). It has been demonstrated as a method to amplify and validate putative lncRNA170. We validated  Species  TSA size (MB) 12 Pelophylax nigromaculatus 47 13 Tylototriton wenxianensis 87 14 Bufotes viridis 45 15 Megophrys  45 16 Polypedates megacephalus 53 17 Rhacophorus dennysi 53 55  our shortlisted candidates using qPCR, while following the technique described in Veldhoen et al., (2014)171.   3.3. Results  3.3.1. Compiling a putative ncRNA set  As shown in Figure 3.1, we started with 794,291 BART Sr. and 1,341,707 BART Jr. contigs. We filtered out 183,264 contigs from BART Sr. that returned complete or incomplete ORF predictions with Transdecoder. Redundant contigs in BART Sr. and BART Jr. were removed using CD-HIT-EST with a similarity cut-off of 0.99. This reduced the sets to 495,543 and 638,289 contigs in BART Sr. and BART Jr., respectively. These sets were combined into the ‘Merged Set’. Contigs from this merged set were aligned against the draft genome using GMAP, and a BAM alignment file was obtained. A custom polyA tail detection pipeline used this BAM file to detect contigs with polyA tails and hexamer evidence. This step drastically reduced the number of contigs to 9,624. These contigs were aligned against protein sequences from nr and RefSeq90 databases, and 2,080 contigs returning significant hits at an e-value less than 10-5 were discarded. The presence of ORFs in contigs belonging to BART Jr. in the remaining set was assessed using Transdecoder, and four contigs with any predicted ORFs were omitted. A 3-frame translation was done on the remaining contigs, followed by a run of HMM-Scan, which removed 537 contigs with known Pfam protein domains. This final set of putative ncRNA had 7,003 contigs. We evaluated this final set with Coding Potential Calculator and found that all except eight contigs showed no coding potential. The authors of CPC describe sequences that return a score between -1 and 1 as ‘weak noncoding’ 56  or ‘weak coding’. Taking that into consideration, these eight contigs were not discarded as they returned pretty low coding potential scores, and could be possibly non-coding. Contig ID Length Prediction Score CCH-0005-C_R2757339 808 coding 0.01 CCH-0008-C_R2736307 2721 coding 0.16 CCH-0008-T_S4181866 1057 coding 0.66 CCH-0010-C_J2948646 3036 coding 0.28 CCH-0010-C_R2934329 2151 coding 0.32 CCH-0011-T_S662707 2603 coding 0.18 CCH-BS01-R14407931 1541 coding 0.51 CCH-0011-C_R5449921 459 coding 0.02  Table 3.3: Contigs with positive but low CPC scores 57          Figure 3.1: LncRNA detection pipeline, with number of contigs at each step 58  A FASTA file of putative lncRNA candidates is available on ftp.bcgsc.ca/supplementary/bullfrog. As seen in Table 3.4, the putative ncRNA contigs have a mean length of 1,362 base pairs, while the minimum and maximum contig size was 200 and 10,978 respectively. Figure 3.2 shows the length distribution for the putative lncRNA contigs. The first peak depicts putative lncRNAs from BART Jr., while the second peak depicts putative lncRNAs from BART Sr. Statistic Min. 1st Quartile Median Mean 3rd Quartile Max. Length (bp) 200 612 956 1362 1688 10978  Table 3.4: Length statistics for putative lncRNA candidates  Figure 3.2: Length distribution for the 7,003 putative lncRNA candidates. The first peak represents sequences from BART Jr., while the second peak represents sequences from BART Sr.  59  3.3.2. Finalising qPCR candidates  The top 15 hits against CATSA and top 8 hits against lncRNA databases are shown in tables 3.5 and 3.6 respectively. Most hits against CATSA were from Rana clamitans and Rana pipiens, while those against lncRNA databases were from human. All hits were obtained at an identity threshold of 90% or more and a query coverage of 95% or more. Candidates with counts greater than zero in at least one sample were evaluated for differential expression. Of the 15 candidate ncRNAs that gave hits against CATSA, 11 had sufficient expression to be evaluated for differential expression in the replicate back skin set with DESeq2. Of these 11, five were differentially expressed with an adjusted p-value (padj) score of less than 0.05 (Table 3.9). None of the contigs that returned hits against lncRNA databases had sufficient expression to be evaluated for differential expression in this sample set. For additional targets, the top 20 most significant ncRNA candidates (in the back skin replicate set) were identified (Table 3.7), and transcripts with read counts more than 100 in every sample were prioritized to maximize the likelihood of detection by qPCR. It is to be noted that read counts are considered to be directly proportional to the level of expression. Additionally, three candidates (Table 3.8) were differentially expressed in both the low temperature and temperature shifted samples, and were also included in the validation set.   60  Table 3.5: Top fifteen Blastx hits against CATSA. QLEN is query length, SLEN is subject length, P-IDENT is percentage identity between subject and query and QCOV is query coverage. CCH-0008-C_S2790810 Lnc-AL669831.1-12:1 686 694 0 1194 98.10 100 CCH-BS01-C-R16152797 Lnc-RNF125-5:1  293 286 1.75E-117 422 93.66 96 CCH-BS01-C-R16155137 lnc-RNF125-5:1 289 286 3.70E-119 427 94.01 97 CCH-BS01-C-R16164401 Lnc-AC006946.15.1-9:1  290 302 1.75E-107 388 91.25 99 CCH-BS01-C-R16170468 Lnc-TTC26-5:1  226 244 2.99E-79 294 90.66 99 CCH-BS01-C-R16203240 Lnc-AC006946.15.1-9:1  296 302 8.35E-106 383 90.60 98 CCH-BS01-C-R16236888 lnc-ISX-10:1 202 204 5.78E-66 250 90.52 94 CCH-BS01-C-R16342124 lnc-KCNC2-8:1 276 263 5.98E-107 387 93.46 94 Table 3.6: Top eight Blastx hits against known lncRNA databaseQUERY SUBJECT QLEN SLEN E-VALUE BIT-SCORE P-IDENT QCOVS CCH-0004-C_S800061 Rana_pipiens_Transcript_327343 571 575 0 1026 99.64 98 CCH-0004-C_S800061 Rana_pipiens_Transcript_327342 571 580 0 1026 99.64 98 CCH-0007-C_R6298938 Rana_pipiens_Transcript_247695 1581 1585 0 2161 91.64 99 CCH-0008-T_S2649095 gb|GEGH01040424.1| 744 736 0 1090 93.64 98 CCH-0010-C_S1300224 Rana_pipiens_Transcript_245790 525 521 0 675 91.19 96 CCH-OB01-C_S786189 gi|451248444|gb|GAEG01013685.1| 699 709 0 1203 98.12 99 CCH-BS01-J16253803 gi|451247309|gb|GAEG01014253.1| 702 696 0 1227 98.70 98 CCH-BS01-R14437108 gi|451246446|gb|GAEG01014684.1| 665 685 0 1074 96.09 98 CCH-BS01-S16233932 gi|451257896|gb|GAEG01008979.1| 861 863 0 1343 95.01 99 CCH-BS01-S16305978 gb|GEGH01061427.1| 753 769 0 1068 92.35 99 CCH-BS01-S16500118 gb|GEGJ01020568.1| 933 917 0 1406 94.66 98 CCH-0004-T_S3276190 gb|GEGJ01017879.1| 496 491 0 719 93.13 99 CCH-0007-C_S6361122 Rana_pipiens_Transcript_490761 289 295 3.28E-106 388 92.39 94 CCH-0007-C_S6361122 Rana_pipiens_Transcript_490760 289 288 1.18E-105 387 92.33 94 CCH-0007-C_S6448949 gi|451214853|gb|GAEG01030472.1| 489 471 2.42E-179 632 91.13 96 CCH-0008-C_S2481118 Rana_pipiens_Transcript_302324 278 266 3.14E-106 388 93.28 95 CCH-0011-C_S5238300 Rana_pipiens_Transcript_194512 366 370 1.45E-140 503 91.55 100 CCH-0011-C_S5238300 Rana_pipiens_Transcript_194516 366 353 8.81E-133 477 91.42 95 61                      Table 3.7: Top twenty most significant ncRNA candidates (in the back skin replicate set), transcripts with read counts greater than 100 in every sample is bolded. Fold change between conditions and adjusted p-value can also be seen for each candidate. log2 fold change between the groups shows the change in expression. For example, a log2 fold change of 2 means that the expression has increased 4-folds. TranscriptID Fold Change  lowpadj lowFold Change, shift padj, shift CCH-0012-T_R4358319 4.78 0.00 7.53 7.74e-07 CCH-0011-T_J6410177 4.36 4.55e-07 7.54 1.17e-08 CCH-0007-T_R6862455 3.84 0.00 3.60 7.74e-06 Table 3.8: Three other candidates that were differentially expressed in both the low temperature and temperature shifted samples Transcript ID Fold Change padj CCH-0012-C_R4911685 1138.05 4.13E-121 CCH-0008-T_R4049217 30.87 5.56E-119 CCH-0012-T_R4358319 36.91 3.27E-52 CCH-0010-T_S3072019 8.22 6.03E-43 CCH-0013-C_R4276172 6.30 1.05E-42 CCH-0010-T_S3113827 19.12 8.69E-41 CCH-0010-C_R2979324 9.09 9.80E-40 CCH-BS01-R14415949 5.42 1.42E-35 CCH-0011-C_J5528159 4.88 3.91E-35 CCH-BS01-S14820675 19.91 4.09E-35 CCH-0005-T_J5938149 0.20 2.61E-34 CCH-0008-T_R4204331 8.70 6.41E-34 CCH-0012-T_S4563862 9.84 6.84E-34 CCH-0008-C_S2758172 0.26 5.57E-32 CCH-0007-C_R6676911 3.12 6.94E-31 CCH-0008-C_S2772034 5.87 4.99E-29 CCH-0009-C_S3618443 9.15 3.32E-26 CCH-0008-C_S2862211 0.20 1.35E-24 CCH-BS01-S16297543 0.05 4.47E-24 CCH-BS01-R14422264 0.19 7.36E-23 62  TranscriptID FoldChange pvalue padj CCH-BS01-J16253803 2.60 3.30e-16 2.03e-14 CCH-0008-C_S2481118 0.95 0.91 0.95 CCH-BS01-S16233932 1.28 0.39 NA CCH-0004-T_S3276190 0.75 0.53 0.69 CCH-0007-C_S6448949 2.19 0.01 0.02 CCH-BS01-S16500118 3.50 3.30e-21 4.06e-19 CCH-0010-C_S1300224 0.82 0.67 NA CCH-0008-T_S2649095 2.49 0.00 0.01 CCH-BS01-R14437108 0.57 2.06e-05 0.00 CCH-0007-C_R6298938 1.12 0.53 0.07 CCH-0004-C_S800061 1.03 0.89 0.94 Table 3.9: Differential expression analysis results for CATSA and lncRNA databases hits  In summary, the set of 20 putative ncRNA that were selected as candidates for qPCR validation are (Figure 3.3):  Five differentially expressed (DE) ncRNAs with hits to the amphibian TSAs  Ten ncRNAs bolded in Table 3.7 of ncRNAs with “most significant p-value”  Two ncNRAs that were consistently DE in the temperature shift samples   Three other ncRNAs from the “most significant p-value” table, but with less robust counts  63  3.3.3. qPCR validation of putative ncRNA transcripts  Out of these 20 candidates, our collaborators at the University of Victoria ran qPCR on 7 candidates and found that 6 out of 7 candidates showed amplification. (Table 3.10). Details of qPCR primers can be found in Appendix B. CCH-0005-T_J5938149 CCH-BS01-R14422264 CCH-0012-C_R4911685 CCH-0008-T_R4049217 CCH-0012-T_R4358319 CCH-0007-T_R6862455 CCH-0011-T_J6410177  Table 3.10: lncRNA candidates for which qPCR was run. The bolded one did not amplify.  Figure 3.3: Finalised qPCR candidates with selection criteria  64  3.4. Discussion and conclusions  The identification of non-coding RNA in non-model organisms poses a major challenge due to the lack of reliable transcript models and annotations. We developed a computational pipeline to predict putative lncRNAs in the transcriptome of R. catesbeiana. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through SVM algorithms such as CPC. We further refined our set by selecting contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 20 candidates from our final set for validation through qPCR. Out of these 20 candidates, we ran qPCR on 7 candidates and found that 6 out of 7 candidates showed amplification.  3.4.1. Exploring the biological significance of lncRNAs  Although many studies have worked on developing and optimising lncRNA identification pipelines using RNA-Seq datasets, more research is required to infer their biological and pathological relevance. In our study, we have tried comparative transcriptomics as a method to detect sequence homology, and have prioritised lncRNA candidates for future elucidation of their functionality. Here, we discuss recent approaches that have proved useful in shedding light on the biological function of lncRNA. A recent study106 by Kashi et al., (2016) explores ways to gain biological insights into novel putative lncRNA candidates. A simple approach to study the physiological role of lncRNAs in the cell is by perturbing the expression of target RNAs, either by 65  knockdown or by overexpression172. Fluorescence in situ hybridization (FISH) can help in visualizing the subcellular localisation of a given lncRNA, and can help in exploring potential functions. A recent study173 detected Kcnq1ot1, Kcnq1, and Xist in E12.4 mouse placental sections through RNA-FISH, and showed that the epigenetic inactivation of the inside genes was caused due to a domain formed by Kcnq1ot1. Some studies have shown the role of lncRNAs in regulating genes and have unravelled their interactions with proteins140. Certain snoRNAs have been shown to form regulatory protein complexes known as ribonucleoprotein particles88, so it would also be worthwhile to look for lncRNA–protein interactions. These interactions could be identified through techniques such as HITS-CLIP and PAR-CLIP, which have been demonstrated in previous studies174, 175. LncRNAs that are thought to regulate gene expression are mainly present on the chromatin176. This could be either indirectly through a protein intermediate or directly through hybridisation and triple helix formation177,178. Technologies such as RAP179, ChIRP and CHART180 can help find associations between chromatin and an RNA of interest181. With rapidly increasing appreciation of the biological roles of lncRNAs and advances in lncRNA function validation techniques, our subtractive pipeline and compiled list of putative R. catesbeiana lncRNAs available on the FTP site will facilitate future studies to explore the role of lncRNAs.      66  4. Discussion and conclusions  4.1 Challenges in establishing new model species as genomic resources  We have entered an age where virtually any living being can go genomic, owing to which there has been a phenomenal increase in the sequencing of non-model organisms in the last few decades. Figure 4.1 demonstrates this increase by showing the number of bases that have been released by GENBANK for the last decade.   Figure 4.1: Increase in the number of bases in each release of GENBANK   NGS allows for profiling variation in nucleotides and discovering genetic markers on a very large scale, that could help in unravelling information regarding proteins, metabolites, disease,  inheritance and other regulatory molecules113. Normally, a good genome assembly is required before taking on such ventures. Assembling genomes of non-model organisms that lack a closely related genome is a real conundrum. It is difficult to attain a genome or transcriptome assembly 67  with a contiguity of the classical shotgun assemblies of established model organisms. Besides, extracting information from a sea of A-T-G-Cs through de novo annotation is not a trivial task. Taking this into consideration, many recent studies are increasingly aiming to analyse NGS datasets without attempting a full genome assembly. For example a study on the mammoth genome followed this approach182, in which researchers used Sanger sequencing data and genome sequencing data of the African Elephant (Loxodonta africana) to annotate and analyse sequence reads. In another study183, investigators sequenced the Great tit (Parus major) genome, and used it for assembling 2.5% of its genome, while using a strategy that involved a reduced representation library for increasing coverage of the sequenced portion. This was also used for downstream SNP discovery. A more usual application of NGS in non-model organisms is transcriptome characterisation and annotation. Most bioinformatics applications available for such task either rely on draft genome assemblies of the same or closely related species, for mapping reads. Recently, de novo assembly methods such as Trans-ABySS, Trinity have been successful to a great extent in assembling the transcriptome of novel species. However, there is a crucial need to develop de novo transcriptome annotation tools and pipelines which can reduce reliance on reference resources and can serve as an initiating point for thorough functional characterization. We developed ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. The tool identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. The tool has been developed while considering the biology behind alternative splicing, RNA-editing, SNPs, and sequencing reads and does not rely on any type of genomic reference resources such as reference genome assemblies, transcript 68  models. We have tested our method on characterizing the Caenorhabditis elegans (C. elegans) and Homo sapiens (H. sapiens) transcriptome, using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. ChopStitch detected putative exons at almost double precision and sensitivity as compared to its competitors. The memory and running time were also comparable. StringTie was the fastest with the lowest memory consumption, while Lemons was the slowest with the highest memory consumption. Trailing k-mers from either ends of a putative exon are interrogated with similar k-mers from other exons to find overlaps, based on which they are clubbed together in a single node, and a splice graph is constructed. We have applied our method to annotate the transcriptome of R. catesbeina. We have used the R. catesbeina draft assembly and error-corrected BART transcriptome assemblies as inputs to ChopStitch. This was done so as to minimise the single base differences due to SNPs, RNA editing events or sequencing errors. We evaluated our obtained set of putative exons against MAKER2 annotations. MAKER2 had annotated 1,318,334 putative exons with the help of available evidences such as BART. Further downstream analysis as described in section 2.4 reduced the MAKER2 putative exon set to 77,624 and later to a non-redundant set of 72,327 putative exons. 39.4 % of these exons were similar to the exons predicted by ChopStitch and 28.7% were similar to those predicted by LEMONS. Interestingly, as seen in figure 2.9, a total of 46,372 putative exons were common between LEMONS and ChopStitch exons, out of which 27,946 were not annotated by MAKER. ChopStitch used 403,149 BART contigs and outputted 617,626 exons in a total of 402,853 BART contigs. It constructed 80,143 splice graphs from these exons, where each splice graph represents the different transcript isoforms. This also suggests, that a big portion of 69  putative exons were left unpredicted by the MAKER2 pipeline, which could be further explored, amplified and validated through wet-lab techniques such as qPCR.  LEMONS predicted a total of 337,352 putative exons which is close to half what ChopStitch predicted. LEMONS predictions are limited to shared orthologs, as it uses a sequence similarity search by BLASTX against a database of known and mapped exon-exon junctions for all of its predictions. On the other hand, ChopStitch predictions are fully independent of known genomic or transcriptomic resources. Thus, ChopStitch helps in exploring new loci for functional analysis and helps in studying novel genes. With plummeting costs of oligonucleotide synthesis, ChopStitch will expedite the design of custom-made probes for exome capture in non-model organisms.  4.2. The easily misinterpreted non-coding RNA  In the last two decades, as genome sequencing gained pace, the search for novel genes, exons, introns, splice sites, ncRNAs, ORFs, etc. has given rise to a number of computational methods. Present computational strategies for predicting lncRNAs in non-model organisms are not foolproof. For instance, many lncRNA identification studies rely on finding and removing ORF containing transcripts. As the chance for finding ORFs in a sequence increases with the length of a sequence, many ORF finding algorithms have strict length cut-offs (~300 nt). A previous study184 discovered that a 679 nucleotide RNA, which was previously thought to be a lncRNA was found to be protein coding, and had short ORFs. After close observation, they found that two ORFs did code for two small peptides of 24 and 12 amino acids respectively. In an another study185 on Drosophila, another mRNA that was thought to be a lncRNA had ORFs. Many other studies have found previously reported lncRNAs to be coding. Studies186 have found a high number of mis-70  annotated lncRNAs in different model organisms. In Zebra fish, the steroid receptor RNA activator 1 (sra1) gene was previously regarded as non-coding187 , but later studies revealed that it codes for the protein steroid receptor RNA activator protein (Srap)188. Thus, there is a crucial need to optimise lncRNA identification pipelines for novel species by amalgamating different strategies. The emergence of lncRNAs as important regulators of gene expression has instigated novel discoveries99,106. With the current research pace of this field, future studies will undoubtedly uncover additional and novel characteristics and functionalities of these biomolecules.  Though it is unusual for de novo sequencing studies to have their genomes annotated for lncRNA, recent studies147, 148, 149, 150 have contributed in enriching animal lncRNA datasets.  With BART as a composite R. catesbeiana reference transcriptome and the available draft genome assembly of the same species, we developed a pipeline to find putative lncRNA candidates in R. catesbeiana. We used a staged subtractive approach with diverse methods to remove coding contigs and reduce our set. This included predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through SVM algorithms such as CPC; selecting contigs with PolyA tails and sequence hexamers; and interrogating our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We ran qPCR on 7 candidates and 6 out of 7 candidates showed positive results. Although our computational pipeline incorporates strict filtering steps for putative lncRNA selection, it is still desirable to test experimentally whether these candidates are indeed noncoding or whether they can be translated into proteins. One approach will be to test whether the transcript yields peptides when translation is carried out in vitro189, 185. Another approach would be to 71  synthesise predicted peptides and use them for antibody production that can be used for detecting putative peptides by techniques such as western blotting or immunohistochemistry. Ribosome profiling190, a technique that helps in capturing RNA regions associated with translating ribosomes could also be used to distinguish between coding and non-coding transcripts. Its application in mice191, zebrafish155, plants192, and yeast193, has shown that a good fraction of expressed lncRNAs are connected with translating ribosomes. Mass spectrometry is yet another way to detect peptides from novel genes and has been of significance in previous studies. Although we have multiple methods at our hands, there is still a scope of uncertainty with lncRNA validation. For instance in the case of in vitro translation verification, a sequence may be translated in vitro but not in vivo and vice versa. Besides, methods like immunohistochemistry or western blot may also suffer from limited sensitivity194. Thus, the toolbox for lncRNA identification is considerable and quickly growing, with many novel approaches at hand. However, it is essential to manually inspect transcripts, whether they are predicted as coding or non-coding, and validate through wet-lab techniques. This should serve as a fruitful direction of further research.         72  5. Closing remarks  The broad underlying objective of this thesis was to attempt to construct methods for de novo annotation of non-model organisms. We have developed a method for de novo identification of exon-exon junctions and splicegraph construction from RNA-Seq assembled transcripts and whole genome sequencing data. While characterising the transcriptome of H. sapiens, C. elegans, and R. catesbeina, we found that our method outperforms the available alternative methods. We have also developed a pipeline to find putative lncRNAs in the transcriptome of R. catesbeina. Our work could be further refined by validating our putative lncRNA set using the necessary wet-lab techniques as described in section 3.4 and 4.2. With rapidly increasing appreciation for de novo annotation techniques for non-model organisms and advances in wet-lab validation techniques, our research will facilitate future studies to explore the genome and transcriptome of novel species.               73  Bibliography  1. Derrien, T. et al. The GENCODE v7 catalogue of human long non-coding RNAs : Analysis of their structure , evolution and expression. Genome Res. 22, 1775–1789 (2012). 2. Nawy, T. Non–model organisms. Nat. Methods 9, 37–37 (2011). 3. Mudge, J. M. & Harrow, J. The state of play in higher eukaryote gene annotation. Nat. Publ. Gr. (2016). doi:10.1038/nrg.2016.119 4. Mäkinen, V., Salmela, L. & Ylinen, J. Normalized N50 assembly metric using gap-restricted co-linear chaining. BMC Bioinformatics 13, 255 (2012). 5. Ekblom, R. & Wolf, J. B. W. A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications 7, 1026–1042 (2014). 6. Singer, M. F. SINEs and LINEs: Highly repeated short and long interspersed sequences in mammalian genomes. Cell 28, 433–434 (1982). 7. Xie, Y. et al. SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 30, 1660–1666 (2014). 8. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013). 9. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–12 (2010). 10. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of 74  insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). 11. Wu, T. D. & Watanabe, C. K. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005). 12. Wu, T. D., Reeder, J., Lawrence, M., Becker, G. & Brauer, M. J. in Methods in Molecular Biology 1418, 283–334 (2016). 13. Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011). 14. Stanke, M. et al. AUGUSTUS: A b initio prediction of alternative transcripts. Nucleic Acids Res. 34, (2006). 15. Bromberg, Y. & Rost, B. SNAP: Predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 35, 3823–3835 (2007). 16. Grillo, G. et al. UTRdb and UTRsite (RELEASE 2010): A collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 38, (2009). 17. Thibaud-Nissen, F., Souvorov, A., Murphy, T., DiCuccio, M. & Kitts, P. Eukaryotic Genome Annotation Pipeline. NCBI Handb. 11, 1–24 (2013). 18. Pruitt, K., Brown, G., Tatusova, T. & Maglott, D. The Reference Sequence ( RefSeq ) Database. NCBI Handb. 1–24 (2002). 19. Brown, G. R. et al. Gene: A gene-centered information resource at NCBI. Nucleic Acids Res. 43, D36–D42 (2015). 75  20. Wolfsberg, T. G. Using the NCBI map viewer to browse genomic sequence data. Current Protocols in Bioinformatics 151–1525 (2010). doi:10.1002/0471250953.bi0105s16 21. Smit, A., Hubley, R. & Green, P. RepeatMasker Open-3.0. RepeatMasker Open-3.0 www.repeatmasker.org (1996). 22. Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: Window-based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006). 23. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–10 (1990). 24. Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008). 25. Aken, B. L. et al. The Ensembl Gene Annotation System. Database (Oxford). 2016, baw093 (2016). 26. Cantarel, B. L. et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008). 27. Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). 28. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 29. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997). 30. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). 76  31. Faustino, N. A., Cooper, T. a & Andre, N. Pre-mRNA splicing and human disease. Genes Dev. 17, 419–437 (2003). 32. Douglas, A. G. L. & Wood, M. J. A. RNA splicing: Disease and therapy. Brief. Funct. Genomics 10, 151–164 (2011). 33. Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–5 (2008). 34. Romero, P. R. et al. Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proc. Natl. Acad. Sci. U. S. A. 103, 8390–5 (2006). 35. Matos, L. et al. Functional analysis of splicing mutations in the IDS gene and the use of antisense oligonucleotides to exploit an alternative therapy for MPS II. Biochim. Biophys. Acta - Mol. Basis Dis. 1852, 2712–2721 (2015). 36. Ward, A. J. & Cooper, T. a. The pathobiology of splicing. J. Pathol. 220, n/a-n/a (2009). 37. Proudfoot, N. J., Furger, A. & Dye, M. J. Integrating mRNA processing with transcription. Cell 108, 501–512 (2002). 38. Wagh, P. K., Peace, B. E. & Waltz, S. E. Met-Related Receptor Tyrosine Kinase Ron in Tumor Growth and Metastasis. Adv. Cancer Res. 100, 1–33 (2008). 39. Zhang, J. & Manley, J. L. Misregulation of pre-mRNA alternative splicing in cancer. Cancer Discov. 3, 1228–1237 (2013). 40. Ghigna, C. et al. Cell motility is controlled by SF2/ASF through alternative splicing of the 77  Ron protooncogene. Mol. Cell 20, 881–890 (2005). 41. Davies, H. et al. Mutations of the BRAF gene in human cancer. Nature 417, 949–954 (2002). 42. Poulikakos, P. I. et al. RAF inhibitor resistance is mediated by dimerization of aberrantly spliced BRAF(V600E). Nature 480, 387–90 (2011). 43. Coopman, P. J. et al. The Syk tyrosine kinase suppresses malignant growth of human breast cancer cells. Nature 406, 742–747 (2000). 44. Buchner, M. et al. Spleen tyrosine kinase is overexpressed and represents a potential therapeutic target in chronic lymphocytic leukemia. Cancer Res. 69, 5424–5432 (2009). 45. Luangdilok, S. et al. Syk tyrosine kinase is linked to cell motility and progression in squamous cell carcinomas of the head and neck. Cancer Res. 67, 7907–7916 (2007). 46. Xerri, L. et al. Predominant expression of the long isoform of Bcl-x (Bcl-xL) in human lymphomas. Br. J. Haematol. 92, 900–6 (1996). 47. Ma, X., Zhao, Y., Li, Y., Lu, H. & He, Y. Relevance of Bcl-x expression in different types of endometrial tissues. J. Exp. Clin. Cancer Res. 29, 14 (2010). 48. Gough, D. J., Levy, D. E., Johnstone, R. W. & Clarke, C. J. IFN?? signaling-Does it mean JAK-STAT? Cytokine and Growth Factor Reviews 19, 383–394 (2008). 49. Goldstein, D. & Laszlo, J. The role of interferon in cancer therapy: a current perspective. CA. Cancer J. Clin. 38, 258–77 (1957). 50. Du, Z. et al. Interferon-resistant Daudi cell line with a stat2 defect is resistant to apoptosis induced by chemotherapeutic agents. J. Biol. Chem. 284, 27808–27815 (2009). 78  51. Keck, P. J. et al. Vascular permeability factor, an endothelial cell mitogen related to PDGF. Science (80-. ). 246, 1309–1312 (1989). 52. Biselli-Chicote, P. M., Oliveira, A. R. C. P., Pavarino, E. C. & Goloni-Bertollo, E. M. VEGF gene alternative splicing: Pro- and anti-angiogenic isoforms in cancer. J. Cancer Res. Clin. Oncol. 138, 363–370 (2012). 53. David, C. J., Chen, M., Assanah, M., Canoll, P. & Manley, J. L. HnRNP proteins controlled by c-Myc deregulate pyruvate kinase mRNA splicing in cancer. Nature 463, 364–8 (2010). 54. Christofk, H. R. et al. The M2 splice isoform of pyruvate kinase is important for cancer metabolism and tumour growth. Nature 452, 230–233 (2008). 55. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–60 (2015). 56. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). 57. Stein, L. Generic Feature Format, Version 3. Seq. Ontol. Proj. 1–18 (2010). 58. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–5 (2015). 59. Ford, L. R,  and D. R. F. Flows in networks. Journal of the Franklin Institute 275, 152 (1963). 60. Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 79  (2013). 61. Hasan, S. & Wheat, C. W. CEPiNS: Conserved Exon Prediction in Novel Species. Bioinformation 9, 210–1 (2013). 62. Camacho, C. et al. BLAST plus: architecture and applications. BMC Bioinformatics 10, 1 (2009). 63. Wheelan, S. J., Church, D. M. & Ostell, J. M. Spidey: A tool for mRNA-to-genomic alignments. Genome Res. 11, 1952–1957 (2001). 64. Levin, L. et al. LEMONS - A tool for the identification of splice junctions in transcriptomes of organisms lacking reference genomes. PLoS One 10, 1–15 (2015). 65. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215-ii225 (2003). 66. Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. Genie--gene finding in Drosophila melanogaster. Genome Res. 10, 529–38 (2000). 67. Eddy, S. R. What is a hidden Markov model? Nat. Biotechnol. 22, 1315–1316 (2004). 68. Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008). 69. Rogers, M. F., Thomas, J., Reddy, A. S. & Ben-Hur, A. SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol. 13, R4 (2012). 70. Chang, Z. et al. Bridger: a new framework for de novo transcriptome assembly using 80  RNA-seq data. Genome Biol. 16, 30 (2015). 71. Liu, J., Yu, T., Jiang, T. & Li, G. TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs. Genome Biol. 17, 213 (2016). 72. Yang, E.-W. & Jiang, T. SDEAP: A Splice Graph Based Differential Transcript Expression Analysis Tool for Population Data. Bioinformatics btw513 (2016). doi:10.1093/bioinformatics/btw513 73. Yang, V. W., Lerner, M. R., Steitz, J. A. & Flint, S. J. A small nuclear ribonucleoprotein is required for splicing of adenoviral early RNA sequences. Proc. Natl. Acad. Sci. U. S. A. 78, 1371–5 (1981). 74. Stark, B. C., Kole, R., Bowman, E. J. & Altman, S. Ribonuclease P: an enzyme with an essential RNA component. Proc. Natl. Acad. Sci. U. S. A. 75, 3717–21 (1978). 75. Brockdorff, N. et al. The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell 71, 515–526 (1992). 76. ENCODE Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007). 77. Managadze, D., Rogozin, I. B., Chernikova, D., Shabalina, S. A. & Koonin, E. V. Negative correlation between expression level and evolutionary rate of long intergenic noncoding RNAs. Genome Biol Evol 3, 1390–1404 (2011). 78. Palazzo, A. F. & Lee, E. S. Non-coding RNA: What is functional and what is junk? Front. Genet. 5, (2015). 81  79. Butcher, S. E. & Brow, D. a. Towards understanding the catalytic core structure of the spliceosome. Biochem. Soc. Trans. 33, 447–449 (2005). 80. Steitz, T. A. & Moore, P. B. RNA, the first macromolecular catalyst: The ribosome is a ribozyme. Trends in Biochemical Sciences 28, 411–418 (2003). 81. Kwek, K. Y. et al. U1 snRNA associates with TFIIH and regulates transcriptional initiation. Nat. Struct. Biol. 9, 800–805 (2002). 82. Nagai, K. et al. Structure, function and evolution of the signal recognition particle. EMBO J. 22, 3479–3485 (2003). 83. Vulliamy, T. et al. The RNA component of telomerase is mutated in autosomal dominant dyskeratosis congenita. Nature 413, 432–5 (2001). 84. Meier, U. T. The many facets of H/ACA ribonucleoproteins. Chromosoma 114, 1–14 (2005). 85. Goodenbour, J. M. & Pan, T. Diversity of tRNA genes in eukaryotes. Nucleic Acids Res. 34, 6137–6146 (2006). 86. Will, C. L. & Lührmann, R. Spliceosome structure and function. Cold Spring Harb. Perspect. Biol. 3, 1–2 (2011). 87. Zimmerly, S. et al. A group II intron RNA is a catalytic component of a DNA endonuclease involved in intron mobility. Cell 83, 529–538 (1995). 88. Eddy, S. R. Non-Coding Rna Genes and the Modern Rna World. Nat. Rev. Genet. 2, 919–929 (2001). 89. Mattick, J. S. & Makunin, I. V. Non-coding RNA. Human molecular genetics 15 Spec 82  No, (2006). 90. Ponting, C. P., Oliver, P. L. & Reik, W. Evolution and Functions of Long Noncoding RNAs. Cell 136, 629–641 (2009). 91. Canese, K. & Weis, S. PubMed: The bibliographic database. NCBI Handb. (2013). 92. Quinn, J. J. & Chang, H. Y. Unique features of long non-coding RNA biogenesis and function. Nat. Rev. Genet. 17, 47–62 (2015). 93. Li, J., Xuan, Z. & Liu, C. Long non-coding RNAs and complex human diseases. International Journal of Molecular Sciences 14, 18790–18808 (2013). 94. Geisler, S. & Coller, J. RNA in unexpected places: long non-coding RNA functions in diverse cellular contexts. Nat. Rev. Mol. Cell Biol. 14, 699–712 (2013). 95. Gupta, R. a et al. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 464, 1071–6 (2010). 96. Kanduri, C. Kcnq1ot1: A chromatin regulatory RNA. Seminars in Cell and Developmental Biology 22, 343–350 (2011). 97. Pasmant, E., Sabbagh, A., Vidaud, M. & Bièche, I. ANRIL, a long, noncoding RNA, is an unexpected major hotspot in GWAS. FASEB J. 25, 444–448 (2011). 98. Faghihi, M. A. & Wahlestedt, C. Regulatory roles of natural antisense transcripts. Nat. Rev. Mol. Cell Biol. 10, 637–43 (2009). 99. Szcześniak, M. W. & Makałowska, I. lncRNA-RNA Interactions across the Human Transcriptome. PLoS One 11, e0150353 (2016). 83  100. Guo, X. et al. Advances in long noncoding RNAs: Identification, structure prediction and function annotation. Briefings in Functional Genomics 15, 38–46 (2016). 101. Yin, Q. F. et al. Long Noncoding RNAs with snoRNA Ends. Mol. Cell 48, 219–230 (2012). 102. Iwakiri, J., Hamada, M. & Asai, K. Bioinformatics tools for lncRNA research. Biochimica et Biophysica Acta - Gene Regulatory Mechanisms 1859, 23–30 (2016). 103. Achawanantakun, R., Chen, J., Sun, Y. & Zhang, Y. LncRNA-ID: Long non-coding RNA IDentification using balanced random forests. Bioinformatics 31, 3897–3905 (2015). 104. Jia, H. et al. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA 16, 1478–1487 (2010). 105. Yotsukura, S., duVerle, D., Hancock, T., Natsume-Kitatani, Y. & Mamitsuka, H. Computational recognition for long non-coding RNA (lncRNA): Software and databases. Brief. Bioinform. bbv114 (2016). doi:10.1093/bib/bbv114 106. Kashi, K., Henderson, L., Bonetti, A. & Carninci, P. Discovery and functional analysis of lncRNAs: Methodologies to investigate an uncharacterized transcriptome. Biochim. Biophys. Acta - Gene Regul. Mech. 1859, 3–15 (2016). 107. Sun, L., Liu, H., Zhang, L. & Meng, J. lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine. PLoS One 10, e0139654 (2015). 108. Sun, K. et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics 14 Suppl 2, S7 (2013). 109. Jiang, Q. et al. LncRNA2Function: a comprehensive resource for functional investigation 84  of human lncRNAs based on RNA-seq data. BMC Genomics 16, S2 (2015). 110. Kornblihtt, A. R. et al. Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat. Rev. Mol. Cell Biol. 14, 153–65 (2013). 111. Ellegren, H. Genome sequencing and population genomics in non-model organisms. Trends Ecol. Evol. 29, 51–63 (2014). 112. da Fonseca, R. R. et al. Next-generation biology: Sequencing and data analysis approaches for non-model organisms. Mar. Genomics (2016). doi:10.1016/j.margen.2016.04.012 113. Birol, I. et al. De novo transcriptome assemblies of Rana (Lithobates) catesbeiana and Xenopus laevis tadpole livers for comparative genomics without reference genomes. PLoS One 10, 1–18 (2015). 114. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009). 115. Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016). 116. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009). 117. Lin, Y. et al. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27, 2031–2037 (2011). 118. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E. & Jones, S. J. M. ABySS : A parallel assembler for short read sequence data ABySS : A parallel assembler for short 85  read sequence data. Genome Res. 19, 1117–1123 (2009). 119. Kent, W. J. BLAT - The BLAST-like alignment tool. Genome Res. 12, 656–664 (2002). 120. Mohamadi, H., Chu, J., Vandervalk, B. P. & Birol, I. ntHash: recursive nucleotide hashing. Bioinformatics btw397 (2016). doi:10.1093/bioinformatics/btw397 121. Salikhov, K., Sacomoto, G. & Kucherov, G. Using cascading bloom filters to improve the memory usage for de Brujin graphs. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8126 LNBI, 364–376 (2013). 122. Mutaf, P. & Castelluccia, C. Compact neighbor discovery (A bandwidth defense through bandwidth optimization). in Proceedings - IEEE INFOCOM 4, 2711–2719 (2005). 123. Jackman, S. D. et al. ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter. bioRxiv 50, 68338 (2016). 124. Vandervalk, B. P. et al. Konnector v2.0: pseudo-long reads from paired-end sequencing data. From IEEE Int. Conf. Bioinforma. Biomed. 8, 2–5 (2014). 125. Paulino, D. et al. Sealer: a scalable gap-closing application for finishing draft genomes. BMC Bioinformatics 16, 230 (2015). 126. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: Quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013). 127. Ellson, J., Gansner, E. R., Koutsofios, E., North, S. C. & Woodhull, G. Graphviz and Dynagraph -- Static and Dynamic Graph Drawing Tools. Graph Draw. Softw. 127–148 (2004). doi:10.1007/978-3-642-18638-7_6 86  128. Bastian, M., Heymann, S. & Jacomy, M. Gephi: An Open Source Software for Exploring and Manipulating Networks. Third Int. AAAI Conf. Weblogs Soc. Media 361–362 (2009). doi:10.1136/qshc.2004.010033 129. Sasaki, Y. The truth of the F-measure. Teach Tutor mater 1–5 (2007). 130. Zhu, Q. H. & Wang, M. B. Molecular functions of long non-coding RNAs in plants. Genes 3, 176–190 (2012). 131. Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–7 (2009). 132. Ørom, U. A. et al. Long noncoding RNAs with enhancer-like function in human cells. Cell 143, 46–58 (2010). 133. Necsulea, A. et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505, 635–40 (2014). 134. Aprea, J. & Calegari, F. Long non-coding RNAs in corticogenesis: deciphering the non-coding code of the brain. EMBO J. 34, 2865–2884 (2015). 135. Thakur, N. et al. An antisense RNA regulates the bidirectional silencing property of the Kcnq1 imprinting control region. Mol Cell Biol 24, 7855–7862 (2004). 136. Wang, H. et al. Dendritic BC1 RNA in translational control mechanisms. J. Cell Biol. 171, 811–821 (2005). 137. Yan, M. et al. Identification and characterization of a novel gene Saf transcribed from the opposite strand of Fas. 14, 1465–1474 (2005). 138. Feng, J. et al. The Evf-2 noncoding RNA is transcribed from the Dlx-5/6 ultraconserved 87  region and functions as a Dlx-2 transcriptional coactivator. Genes Dev. 20, 1470–1484 (2006). 139. Willingham, A. T. A Strategy for Probing the Function of Noncoding RNAs Finds a Repressor of NFAT. Science (80-. ). 309, 1570–1573 (2005). 140. Rinn, J. L. & Chang, H. Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 81, 145–166 (2012). 141. Quek, X. C. et al. lncRNAdb v2.0: Expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 43, D168–D173 (2015). 142. Volders, P. J. et al. An update on LNCipedia: A database for annotated human lncRNA sequences. Nucleic Acids Res. 43, D174–D180 (2015). 143. Bu, D. et al. NONCODE v3.0: Integrative annotation of long noncoding RNAs. Nucleic Acids Res. 40, 210–215 (2012). 144. Huang, W., Long, N. & Khatib, H. Genome-wide identification and initial characterization of bovine long non-coding RNAs from EST data. Anim. Genet. 43, 674–682 (2012). 145. Billerey, C. et al. Identification of large intergenic non-coding RNAs in bovine muscle using next-generation transcriptomic sequencing. BMC Genomics 15, 499 (2014). 146. Weikard, R., Hadlich, F. & Kuehn, C. Identification of novel transcripts and noncoding RNAs in bovine skin by deep next generation sequencing. BMC Genomics 14, 789 (2013). 147. Li, T. et al. Identification of long non-protein coding RNAs in chicken skeletal muscle using next generation sequencing. Genomics 99, 292–298 (2012). 148. Zhao, W. et al. Systematic identification and characterization of long intergenic non-88  coding RNAs in fetal porcine skeletal muscle development. Sci. Rep. 5, 8957 (2015). 149. Etebari, K., Furlong, M. J. & Asgari, S. Genome wide discovery of long intergenic non-coding RNAs in Diamondback moth (Plutella xylostella) and their expression in insecticide resistant strains. Sci. Rep. 5, 14642 (2015). 150. Ren, H. et al. Genome-wide analysis of long non-coding RNAs at early stage of skin pigmentation in goats (Capra hircus). BMC Genomics 17, 67 (2016). 151. Zhang, X. et al. RAID: a comprehensive resource for human RNA-associated (RNA-RNA/RNA-protein) interaction. RNA 20, 989–93 (2014). 152. Ulitsky, I. & P. Bartel, D. lincRNAs: Genomics, Evolution, and Mechanisms Igor. Cell 154, 26–46 (2013). 153. Ilott, N. E. & Ponting, C. P. Predicting long non-coding RNAs using RNA sequencing. Methods 63, 50–59 (2013). 154. Dinger, M. E., Pang, K. C., Mercer, T. R. & Mattick, J. S. Differentiating protein-coding and noncoding RNA: Challenges and ambiguities. PLoS Comput. Biol. 4, (2008). 155. Bazzini, A. A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014). 156. Kong, L. et al. CPC: Assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 35, 345–349 (2007). 157. Moran N. Cabili. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2016). 158. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, 154–89  159 (2005). 159. Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–D72 (2016). 160. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012). 161. Li, W. & Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006). 162. Weizhong Li, L. J. and A. G. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 282–283 (2000). 163. Kleppe, L. et al. Maternal 3’UTRs: from egg to onset of zygotic transcription in Atlantic cod. BMC Genomics 13, 443 (2012). 164. Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015). 165. Finn, R. D. et al. HMMER web server: 2015 Update. Nucleic Acids Res. 43, W30–W38 (2015). 166. Finn, R. D. et al. Pfam: The protein families database. Nucleic Acids Res. 42, 222–230 (2014). 167. Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 168. Anders, S., Pyl, P. T. & Huber, W. HTSeq-A Python framework to work with high-90  throughput sequencing data. Bioinformatics 31, 166–169 (2015). 169. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). 170. Skreka, K. et al. Identification of differentially expressed non-coding RNAs in embryonic stem cell neural differentiation. Nucleic Acids Res. 40, 6001–15 (2012). 171. Veldhoen, N., Propper, C. R. & Helbing, C. C. Enabling comparative gene expression studies of thyroid hormone action through the development of a flexible real-time quantitative PCR assay for use across multiple anuran indicator and sentinel species. Aquat. Toxicol. 148, 162–173 (2014). 172. Mattick, J. S. The genetic signatures of noncoding RNAs. PLoS Genet. 5, (2009). 173. Redrup, L. et al. The long noncoding RNA Kcnq1ot1 organises a lineage-specific nuclear domain for epigenetic gene silencing. Development 136, 525–30 (2009). 174. Guil, S. et al. Intronic RNAs mediate EZH2 regulation of epigenetic targets. Nat. Struct. Mol. Biol. 19, 664–670 (2012). 175. Spitzer, J. et al. PAR-CLIP (Photoactivatable Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation): A Step-By-Step Protocol to the Transcriptome-Wide Identification of Binding Sites of RNA-Binding Proteins. Methods in Enzymology 539, (Elsevier Inc., 2014). 176. Jeon, Y. & Lee, J. T. YY1 Tethers Xist RNA to the inactive X nucleation center. Cell 146, 119–133 (2011). 177. Martianov, I., Ramadass, A., Serra Barros, A., Chow, N. & Akoulitchev, A. Repression of 91  the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature 445, 666–670 (2007). 178. Schmitz, K. M., Mayer, C., Postepska, A. & Grummt, I. Interaction of noncoding RNA with the rDNA promoter mediates recruitment of DNMT3b and silencing of rRNA genes. Genes Dev. 24, 2264–2269 (2010). 179. Engreitz, J. M. et al. The Xist lncRNA exploits three-dimensional genome architecture to spread across the X chromosome. Science (80-. ). 341, 1237973 (2013). 180. Simon, M. D. et al. The genomic binding sites of a noncoding RNA. Proc. Natl. Acad. Sci. U. S. A. 108, 20497–502 (2011). 181. Chu, C., Qu, K., Zhong, F. L., Artandi, S. E. & Chang, H. Y. Genomic Maps of Long Noncoding RNA Occupancy Reveal Principles of RNA-Chromatin Interactions. Mol. Cell 44, 667–678 (2011). 182. Miller, W. et al. Sequencing the nuclear genome of the extinct woolly mammoth. Nature 456, 387–390 (2008). 183. Van Bers, N. E. M. et al. Genome-wide SNP detection in the great tit Parus major using high throughput sequencing. Mol. Ecol. 19, 89–99 (2010). 184. Rohrig, H., Schmidt, J., Miklashevichs, E., Schell, J. & John, M. Soybean ENOD40 encodes two peptides that bind to sucrose synthase. Proc. Natl. Acad. Sci. U. S. A. 99, 1915–1920 (2002). 185. Galindo, M. I., Pueyo, J. I., Fouix, S., Bishop, S. A. & Couso, J. P. Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol. 5, 92  1052–1062 (2007). 186. Kelkar, D. S. et al. Annotation of the zebrafish genome through an integrated transcriptomic and proteomic analysis. Mol. Cell. Proteomics M114.038299- (2014). doi:10.1074/mcp.M114.038299 187. Koufariotis, L. T., Chen, Y. P. P., Chamberlain, A., Jagt, C. Vander & Hayes, B. J. A catalogue of novel bovine long noncoding RNA across 18 tissues. PLoS One 10, (2015). 188. Novikova, I. V., Hennelly, S. P. & Sanbonmatsu, K. Y. Structural architecture of the human long non-coding RNA, steroid receptor RNA activator. Nucleic Acids Res. 40, 5034–5051 (2012). 189. Lanz, R. B. et al. A steroid receptor coactivator, SRA, functions as an RNA and is present in an SRC-1 complex. Cell 97, 17–27 (1999). 190. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–23 (2009). 191. Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011). 192. Juntawong, P., Girke, T., Bazin, J. & Bailey-Serres, J. Translational dynamics revealed by genome-wide profiling of ribosome footprints in Arabidopsis. Proc. Natl. Acad. Sci. U. S. A. 111, E203-12 (2014). 193. Smith, J. E. et al. Translation of Small Open Reading Frames within Unannotated RNA 93  Transcripts in Saccharomyces cerevisiae. Cell Rep. 7, 1858–1866 (2014). 194. Panzitt, K. et al. Characterization of HULC, a Novel Gene With Striking Up-Regulation in Hepatocellular Carcinoma, as Noncoding RNA. Gastroenterology 132, 330–342 (2007).               94  Appendices  Appendix A – Effect of leniency on ChopStitch’s performance for the C. elegans dataset  Leniency Total Exons Exons with hits to Ensembl exons Precision Recall 0 92835 84435 0.910 0.408 5 92222 83765 0.908 0.406 10 92835 84435 0.910 0.408 15 92873 84464 0.909 0.408 20 92912 84471 0.909 0.409 25 93243 84418 0.905 0.408 30 93531 84368 0.902 0.408 35 93861 84283 0.898 0.408 40 94105 84235 0.895 0.408  There is not a marked difference between ChopStitch results at different leniency values. This could be explained by the fact that for this C. elegans dataset both the transcriptomic and genomic datasets belonged to the same strain (N2) and would have low number of SNPs. The variation in results for different leniency values is bound to increase with unmatched datasets where there will be more chances of single nucleotide differences between the genome and transcriptome of the organism under study.    95  Appendix B – qPCR primers for selected 7 candidates  Transcript ID  Scaffold ID Scaffold Position NCR Primer Set Name Primer Sequence Amplicon length CCH-0005-T_J5938149  +Rc-01r160223s0000008 365928-366106 ncr1 160149 ACCGAGCCTTAGTCACAAT 434   +Rc-01r160223s0001138 282345-283112   160150 GGATTACCGATAACCATAGAGA 434       ncr2 160135 GCTCCGACTTAGAAAAAGGT 347         160136 GACTGTTTTGCTGACCCATA 347 CCH-BS01-R14422264  -Rc-01r160223s0013649 11065-7828 ncr3 160151 TGACAAGGACACCGCATT 216         160152 GGCTGAACTGAACTGACATTA 216 CCH-0011-T_J6410177 -Rc-01r160223s0003820 89822-89246, 86238-85744 ncr4 160153 GGCGAGTAGAAGATGTTGTC 202         160154 ATCATTCTGGAGGCATCAATAG 202 CCH-0007-T_R6862455 +Rc-01r160223s0006754 100737-102262 ncr5 160155 GATCAGTTCCATCTTCTCAGTT 389   +Rc-01r160223s0003056 127508-128654   160156 CCAGTAGTTCGTATCCACATT 389       ncr6 160137 AGATAACAGCCTTCCTCCTC 435         160138 ACAATGATGGACATGATTCC 435 CCH-0008-T_R4049217 -Rc-01r160223s0000006 9180-9133   ncr7 160157 GTTCATCAAGTAGGTCTCCAAT 254   -Rc-01r160223s0019462 17973-13463   160158 TATCACCAGTCAGAGCCATAA 254      ncr8 160139 CAGTTTTTGACAATTCCTTT 189         160140 ATTAAAATTAGGGATGAGCC 189 CCH-0012-T_R4358319 +Rc-01r160223s0114416 610-997, 7796-7855  ncr9  160159 CGTCTTCCTCTACTTCAACAC 202 96  Transcript ID  Scaffold ID Scaffold Position NCR Primer Set Name Primer Sequence Amplicon length   +Rc-01r160223s0025320 7178-7255, 13165-16879   160160 CTACTTCCTCTCCACCTCAG 202      ncr10 160141 ACAAGTAAGGACAGGGAGTGG 244         160142 GGAGTCAGGGTTCTGTAGGC 244 CCH-0012-C_R4911685 +Rc-01r160223s0026511 4790-5281 ncr11 160161 CCATAACGGACATCACATACG 215   -Rc-01r160223s0170941 3024-2525   160162 TCTACATGAGCAGCAAGGAT 215  The table shows all the 7 transcripts that were analysed through qPCR. Those that successfully amplified have been bolded.         

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0340507/manifest

Comment

Related Items