UBC Faculty Research and Publications

RNA STRAND: The RNA Secondary Structure and Statistical Analysis Database Andronescu, Mirela; Bereg, Vera; Hoos, Holger H; Condon, Anne Aug 13, 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12859_2008_Article_2325.pdf [ 641.3kB ]
Metadata
JSON: 52383-1.0224109.json
JSON-LD: 52383-1.0224109-ld.json
RDF/XML (Pretty): 52383-1.0224109-rdf.xml
RDF/JSON: 52383-1.0224109-rdf.json
Turtle: 52383-1.0224109-turtle.txt
N-Triples: 52383-1.0224109-rdf-ntriples.txt
Original Record: 52383-1.0224109-source.json
Full Text
52383-1.0224109-fulltext.txt
Citation
52383-1.0224109.ris

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceDatabaseRNA STRAND: The RNA Secondary Structure and Statistical Analysis DatabaseMirela Andronescu, Vera Bereg, Holger H Hoos* and Anne CondonAddress: Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, B.C., CanadaEmail: Mirela Andronescu - andrones@cs.ubc.ca; Vera Bereg - verab@cs.ubc.ca; Holger H Hoos* - hoos@cs.ubc.ca; Anne Condon - condon@cs.ubc.ca* Corresponding author    AbstractBackground: The ability to access, search and analyse secondary structures of a large set ofknown RNA molecules is very important for deriving improved RNA energy models, for evaluatingcomputational predictions of RNA secondary structures and for a better understanding of RNAfolding. Currently there is no database that can easily provide these capabilities for almost all RNAmolecules with known secondary structures.Results: In this paper we describe RNA STRAND – the RNA secondary STRucture and statisticalANalysis Database, a curated database containing known secondary structures of any type andorganism. Our new database provides a wide collection of known RNA secondary structuresdrawn from public databases, searchable and downloadable in a common format. Comprehensivestatistical information on the secondary structures in our database is provided using the RNASecondary Structure Analyser, a new tool we have developed to analyse RNA secondarystructures. The information thus obtained is valuable for understanding to which extent and withwhich probability certain structural motifs can appear. We outline several ways in which the dataprovided in RNA STRAND can facilitate research on RNA structure, including the improvementof RNA energy models and evaluation of secondary structure prediction programs. In order tokeep up-to-date with new RNA secondary structure experiments, we offer the necessary tools toadd solved RNA secondary structures to our database and invite researchers to contribute to RNASTRAND.Conclusion: RNA STRAND is a carefully assembled database of trusted RNA secondarystructures, with easy on-line tools for searching, analyzing and downloading user selected entries,and is publicly available at http://www.rnasoft.ca/strand.BackgroundThe number of solved RNA secondary structures hasincreased dramatically in the past decade, and severaldatabases are available to search and download specificevaluating RNA secondary structure prediction software,obtaining distributions of naturally occuring structuralfeatures, or searching RNA molecules with specific motifs,researchers need to easily access a much larger set ofPublished: 13 August 2008BMC Bioinformatics 2008, 9:340 doi:10.1186/1471-2105-9-340Received: 15 May 2008Accepted: 13 August 2008This article is available from: http://www.biomedcentral.com/1471-2105/9/340© 2008 Andronescu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 10(page number not for citation purposes)classes of RNA secondary structures [1-5]. However, forpurposes such as improving RNA energy models [6,7],known RNA secondary structures, ideally all known RNAsecondary structures. RNA STRAND aims to provide thisBMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340capability, in addition to easy search, analysis and down-load features. Figure 1 shows an example of an RNA sec-ondary structure and highlights some of its structuralfeatures.Previous RNA databases provide secondary structureinformation, but are specialised in a different direction orfollow different goals. The Rfam Database [5] contains alarge collection of non-coding RNA families; however,many of the corresponding secondary structures are com-putationally predicted. The Comparative RNA Web Site[1] specialises in ribosomal RNA and intron RNA mole-cules. The Sprinzl tRNA database [2] specialises in tRNAmolecules, the RNase P database [3] specialises in RNaseP RNA molecules, and the SRP and tmRNA databases [4]specialise in SRP RNA and tmRNA molecules, respec-tively. Pseudobase [8] contains short RNA fragments thathave pseudoknots. The RAG (RNA-As-Graphs) Database[9] classifies and analyses RNA secondary structuresaccording to their topological characteristics based on thedescription of RNAs as graphs, but its collection of struc-tures is very limited.A number of previous databases contain three-dimen-sional (3D) RNA structures; however, as opposed to pro-teins, the number of solved RNA 3D structures is muchsmaller than the number of solved RNA secondary struc-tures. (Only 18% of all RNA molecules we collected haveknown 3D structures.) As such, all these databases do notinclude molecules whose secondary structures are knownbut 3D structures are unknown; examples include: theRCSB Protein Data Bank [10], the Nucleic Acids Database[11], the RNA Structure Database [12] and the StructuralClassification of RNA (SCOR) database [13]. NCIR [14]contains non-canonical base pairs in 3D RNA molecules.FR3D [15] provides a collection of 3D RNA structuralmotifs found in the RCSB Protein Data Bank. Finally,there are other RNA databases that provide RNAsequences, but no experimental structural information,such as the SubViral RNA Database [16], which contains acollection of over 2600 sequences of viroids, the hepatitisdelta virus and satellite RNAs, but only mfold-predictedsecondary structures.RNA STRAND spans a more comprehensive range of RNAsecondary structures than do previous databases. It cur-rently provides highly accurate secondary structures for4666 RNA molecules. Since some users of RNA STRANDwill likely develop new thermodynamic models, predic-tion tools or statistical analyses, our data is exclusivelydetermined by carefully conducted comparative sequenceanalysis [1], or by experimental methods such as NMR orX-ray crystallography [10]. All information has beenRNA secondary structure exampleFigure 1RNA secondary structure example. Schematic repre-sentation of the secondary structure for the RNase P RNA molecule of Methanococcus marapaludis from the RNase P Database; the RNA STRAND ID for this molecule is ASE_00199. Solid grey lines represent the ribose-phosphate backbone. Dotted grey lines represent missing nucleotides. Solid circles mark base pairs. Dashed boxes mark structural features. We define an RNA secondary structure as a set of base pairs [22]. In this work, we consider all C-G, A-U and G-U base pairs as canonical, and all other base pairs as non-canonical. However, we note that from the point of view of the planar edge-to-edge hydrogen bonding interaction [42], there are C-G, A-U and G-U base pairs that do not interact via Watson-Crick edges, and vice-versa [14,42]. Comparative sequence analysis tools do not currently describe bond types. A number of structural motifs can be identified in a secondary structure: A stem is composed of one or more consecutive base pairs. A hairpin loop contains one closing base pair, and all the bases between the paired bases are unpaired. An internal loop is a loop with two closing base pairs, and all bases between them are unpaired. A bulge loop can be seen as a variant of an internal loop in which there are no unpaired bases on one side. A multi-loop is a loop which has at least three closing base pairs; stems emanating from these base pairs are called multi-loop branches. A pseudoknot is a structural motif that involves non-nested, crossing base pairs.Page 2 of 10(page number not for citation purposes)obtained from publicly available RNA databases. Our goalBMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340in creating this database is to provide comprehensiveinformation on structural features – such as types andsizes for stems and loops, pseudoknot complexity andbase pair types – that can be interactively analysed ordownloaded within and across functional classes of mol-ecules. Such information could be used, for example, tounderstand what type of structural motifs are common ina specific set of RNA molecules; to estimate the accuracyof RNA secondary structure computational predictionmethods; or to improve current thermodynamic modelsfor RNA secondary structure prediction.Construction and ContentFigure 2 describes the four main modules that compriseRNA STRAND. To create the database, we first collectedthe data from various external sources, then we processedthe data and prepared it for a MySQL relational database.Next, we installed and populated the database, and finallywe prepared dynamic web pages that interact with thedatabase. In what follows we describe in detail the con-struction and content of each module.External sourcesThe current release v2.0 of RNA STRAND contains a totalof 4666 entries (RNA sequences and secondary structures)of the following provenance:• RCSB Protein Data Bank (PDB) [10]: 1059 entries,obtained from three dimensional NMR and X-ray atomicstructures containing RNA molecules only, or RNA mole-cules and proteins (only the RNAs were included in RNASTRAND), in PDB format. These include ribozymes,ribosomal RNAs, transfer RNAs, synthetic structures, andcomplexes containing more than one RNA molecule. Outof the 1059 entries, 575 contain at least two RNA mole-cules; these are easily searchable from the RNA STRANDweb site. The RNA secondary structures were generatedfrom the tertiary structures using RNAView [17], which isalso used for secondary structure visualisation in theNucleic Acid Database [11].• Comparative RNA Web Site, version 2 [1]: 1056 entriesof ribosomal and intronic RNA molecules obtained bycovariance-based comparative sequence analysis.• tmRNA database [4]: 726 entries of transfer messengerRNA sequences and secondary structures determined bycomparative sequence analysis.• Sprinzl tRNA Database (September 2007 edition) [2]:622 transfer RNA sequences and secondary structuresobtained by comparative sequence analysis from thetRNA sequences data set. The genomic tRNA and tRNAgene sets from the Sprinzl tRNA database containgenomic sequences, and thus we think they are not as rel-evant for understanding function and folding of func-tional RNA molecules.• RNase P Database [3]: 454 Ribonuclease P RNAsequences and secondary structures obtained by compar-ative sequence analysis.• SRP Database [4]: 383 entries of Signal Recognition Par-ticle RNA sequences and secondary structures determinedby comparative sequence analysis.• Rfam Database, version 8.1 [5]: 313 entries from 19Rfam families, including hammerhead ribozymes, telom-erase RNAs, RNase MRP RNAs and RNase E 5' UTR ele-ments (only the seeds have been used). Of the 607 Rfamfamilies in version 8.1, 172 have the secondary structureflag "published", while the remaining 435 families havebeen predicted using Pfold [5]. For several reasons, wedecided to include only 19 of the 172 "published" fami-lies: (1) some of these families come from other databasesDatabase schemaFigure 2Database schema. Construction of RNA STRAND, from Page 3 of 10(page number not for citation purposes)that we have included directly, such as structures from theRNase P Database or SRP Database; (2) most of the sec-data collection to data presentation via dynamic web pages.BMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340ondary structures are actually predicted computationallyand then published in the papers cited by Rfam, such asfamilies RF00013, RF00035, RF00161 or RF00625. Sincethe Rfam database provides only very limited informationabout the reliability of the structures it contains, we havestudied all 172 families and decided which families toinclude based on the cited papers. The details regardingthe decision for each family are described in Supplemen-tary Material 1, accessible from the main page of the RNASTRAND web site.• Nucleic Acid Database (NDB) [11]: 53 entries whichoccur in NDB and not in PDB (note that NDB and PDBhave a large overlap of RNA structures); these includetransfer RNAs and synthetic RNAs obtained by X-ray crys-tallography.Table 1 provides some additional information on theseRNAs; information and statistics on the current databasecontents are also available from the main page of the RNASTRAND site.In the future, we intend to regularly check the aforemen-tioned databases for new entries. With our current tools,keeping the database up-to-date will be relatively easy.Data processingUnique IDsWe created a unique and stable identifier for each entry inthe RNA STRAND database. Future releases will keep allprevious IDs unchanged.Conversion scriptsOne of the challenging tasks in collecting the RNASTRAND data arose from the fact that the external sourcesoffer data in various formats. We have built tools to con-vert from all these formats to the CT format, which we useto store all structures internally, and to RNAML, BPSEQ,dot-parentheses and FASTA formats when requested by auser. The format descriptions are accessible on the "Help"page online.ValidationAll external databases we have used in the current versionof RNA STRAND, except Rfam, contain highly curatedRNA secondary or tertiary structures, therefore we trust thecuration methods of these sources. For Rfam we selected aset of reliable structures based on the cited papers, asdescribed in the previous section. Once we converted allthe secondary structure external files into the CT format,we checked all files in order to make sure the secondarystructures are valid (i.e., one base is paired with at mostone other base, and if base at position i is paired with baseat position j, then base at position j is also paired withbase at position i.). When performed on our present data,this validation step revealed several inconsistencies insome of the external files, which we brought to the atten-tion of the respective database owners.RNA Secondary Structure AnalyserThe structural statistics that form the core part of RNASTRAND were generated using the RNA Secondary Struc-ture Analyser, which takes as input an RNA secondarystructure description, for example in CT format, and out-puts a wide range of secondary structure information.While many of these features, such as the number andcomposition of stems, are rather straightforward to deter-mine, in some cases, more advanced algorithmic tech-niques have to be applied – as is the case, for example, forthe minimal number of base pairs that need to beremoved to render a structure pseudoknot free. For thisspecific task, we implemented a dynamic programmingTable 1: The main RNA types included in RNA STRAND v2.0.RNA type Main source(s) # Length % PKBPentries mean std mean stdTransfer messenger RNA tmRDB [4] 726 368 86 21.0 6.116S ribosomal RNA CRW [1], PDB [10] 723 1529 286 1.8 0.5Transfer RNA Sprinzl DB [2], PDB [10] 707 76 21 0.1 2.3Ribonuclease P RNA RNase P DB [3] 470 323 71 5.7 3.2Signal rec. particle RNA SRPDB [4], PDB [10] 394 220 111 0.0 0.023S ribosomal RNA CRW [1], PDB [10] 205 2699 716 2.4 1.15S ribosomal RNA CRW [1], PDB [10] 161 115 21 0.0 0.0Group I intron CRW [1], PDB [10] 152 563 412 5.8 2.2Hammerhead ribozyme Rfam [5], PDB [10] 146 61 24 0.0 0.0Group II intron CRW [1], PDB [10] 42 1298 829 1.4 3.5All molecules All of the above 4666 527 722 5.3 9.1Page 4 of 10(page number not for citation purposes)Overview of the main RNA types in version 2.0 of the RNA STRAND database, their provenance, the number of RNAs, the mean length and standard deviation for each type. % PKBP denotes the percentage of the base pairs that need to be removed in order to render the structure pseudoknot-free. Most of the major RNA types are represented by a large number of molecules.BMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340algorithm that removes the minimum number of basepairs [18]; however, more sophisticated approaches couldbe used, such as those recently described by Smit et al.[19]. The complete output of the analyser run for eachindividual database entry can be accessed easily from theRNA STRAND web interface, and a description of the out-put can be found in the online Supplementary Material 2.MySQL databaseAll the data obtained from the RNA Secondary StructureAnalyser were inserted into a relational database imple-mented in MySQL (version 5.0.26). The main table isMOLECULE, with one row per RNA entry in the database.This table contains as primary key the unique RNASTRAND ID of the entry and further comprises variousdescriptive fields, including: organism, reference, length,RNA type, external source, external ID, sequence, threelevels of abstract shapes using the RNAshapes representa-tion [20], the method of secondary structure determina-tion, and a link to the respective CT file. (Since RNAshapesversion 2.1.5 cannot obtain the abstract shape of pseudo-knotted secondary structures, we first removed a mini-mum number of base pairs to render the structurepseudoknot-free.) Furthermore, there is one table per sec-ondary structure feature, where the table MOLECULE isconnected to each of these tables in a one-to-many rela-tionship. For example, the table STEM contains informa-tion such as the number of base pairs and the estimatedfree energy change for that stem, using parameters by Xiaet al. [21]. Accurately estimating the free energy change ofentire structures is currently challenging, due to structuralmotifs for which current energy models are incomplete,such as pseudoknots, non-canonical base pairs, and mod-ified nucleotides. Other similar tables includeHAIRPIN_LOOP, MULTI_LOOP and PSEUDOKNOT.An additional table TMP_MOLECULE is used to tempo-rarily hold new submissions received via the web inter-face; for these, we manually check the submissioninformation by checking the cited paper, after which, ifthe submission is accepted, all further steps required topermanently add the respective RNA(s) to the databaseare performed automatically.Web interfaceThe web interface to RNA STRAND has been created usinga set of PHP scripts (version 5.1.2). The main functions ofthe web interface are searching, browsing, analysis, down-loading and uploading.Searching and browsingThe user specifies one or more search criteria in a web-based form. The general criteria include RNA type (e.g.,bases), the number of molecules in the complex, whetherit is a fragment, a sequence pattern using the standardIUPAC nucleic acid codes, an abstract structure or frag-ment using the RNAshapes representation [20] andwhether or not to include non-redundant sequences.We define a set of entries to be non-redundant if theirsequences are pairwise distinct. On a search page the usercan request a non-redundant set that satisfies some searchcriteria. In this case, if two entries have identical RNAsequences, one of them will be selected arbitrarily. In theremainder of this paper, when we refer to a number ofnon-redundant entries matching some criteria, we mean alargest non-redundant set of entries satisfying the speci-fied criteria. Currently there are 4104 non-redundantentries out of the 4666 entries in RNA STRAND v2.0.Advanced searches are supported based on 21 additionalsearch criteria on secondary structure elements, such asselection of RNA molecules having at least one pseudo-knot, or hairpin loops with a specific sequence – for exam-ple GNRA hairpin loops. The set of database entries thatmatch all of the specified criteria simultaneously isreturned in the form of a table.Using advanced search criteria, users can search for entrieswith various structural motifs. For example, when lookingfor a Y shape with an additional hairpin, one wouldsearch for entries that have exactly one multi-loop, threemulti-loop branches, three hairpins, one molecule in thecomplex, and no pseudoknots. This search returns 31entries, most of which are ciliate telomerase RNAs fromRfam. If pseudoknots are allowed, then vertebrate telomer-ase RNAs from Rfam are also included, yielding 36 searchresults. An equivalent pseudoknot-free search can beobtained by typing in the abstract shape [ [] [] ] [] (wherematching brackets represent one interrupted or uninter-rupted stem). Pseudoknots are currently not permitted inthe abstract shape representation [20].Support for inspecting large fractions of the database con-tents is provided via searches with no or very general cri-teria. For example, it is easy to obtain a list of all RNase PRNA structures contained in the database.Details on individual entries from the result list of anysearch can be displayed by clicking on an RNA STRANDID link of the results table. This single entry display com-prises general information about the entry, links to theoriginal database entry for this molecule, a secondarystructure diagram, details of its secondary structure ele-ments and features, links to other RNA STRAND entrieswith the same sequence (i.e., redundant entries), links toPage 5 of 10(page number not for citation purposes)16S Ribosomal RNA), organism of origin (e.g., E. coli),external source (e.g., RCSB Protein Data Bank), length (inthe sequence and secondary structure specification in fiveformats (CT, RNAML, BPSEQ, dot-parentheses andBMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340FASTA), and a link to the complete output of the RNA Sec-ondary Structure Analyser.AnalysisIn addition to the aforementioned analysis informationfor individual entries, RNA STRAND also provides histo-grams or cumulative distribution functions of variousmolecule characteristics (such as number of pseudoknotsper molecule) or structural features (such as number ofbranches per multi-loop) for all structures in the databaseor for user-selected subsets, as obtained from the searchpage. In addition, correlations between various moleculecharacteristics and molecule length can be obtained. Foran unbiased analysis, the user has the option of normalis-ing the data by RNA type (such as tRNA), in which case foreach particular RNA type, one data point is obtained byaveraging over all the data for molecules of that type.Finally, the user can choose to remove the outliers of thedistributions. We use a common definition, according towhich a data point is an outlier if, and only if, it is smallerthan Q1 - 1.3·(Q3 - Q1) or greater than Q1 + 1.3·(Q3 - Q1),where Q1 and Q3 are the first and third quartiles, respec-tively. Such analyses may guide research pertaining tounderstanding structural features in naturally occuringRNA molecules, as we outline in the "Utility and discus-sion" section.DownloadingThe set of molecules selected via the search page can bedownloaded in one of five supported formats: CT,RNAML, BPSEQ, dot-parentheses and FASTA. Thus,researchers can use specifically selected structures locally.UploadingRNA STRAND supports public submission of RNA sec-ondary structures to the database via its web interface. Thestructure file can be in any of the four supported second-ary structure formats (CT, RNAML, BPSEQ and dot-paren-theses) or in the PDB tertiary structure format. Since RNASTRAND is a curated database, newly submitted structuresare checked for accuracy and completeness by one of thedatabase administrators before they are added to the data-base. New additions to the public databases that consti-tute our external sources will be added to RNA STRANDregularly. This is complemented by the public submissionoption, which is intended for submission of structuresthat do not yet belong to any of these databases.Utility and discussionRNA STRAND v2.0 contains 4666 RNA molecules orinteracting complexes of various types, and an abundanceof RNA structural motifs (see also Table 1). This representsa considerable amount of data from which to draw signif-icant statistics and trends about RNA secondary structures.from which to draw significant statistics and trends aboutRNA secondary structures. In what follows we illustratehow the information in RNA STRAND can be used for var-ious purposes.Obtaining statistics of naturally occuring RNA structural featuresWe performed statistical analyses using the RNA STRANDweb interface. Our first observation concerns the numberand complexity of pseudoknots. According to the currentdata from RNA STRAND v2.0, pseudoknots occur rathercommonly, especially in longer molecules: 74% of all(non-redundant) entries with 100 or more nucleotidescontain pseudoknots. We compared the stem length (i.e.,the number of base pairs in uninterrupted stems) with theminimal number of base pairs that need to be removedper pseudoknot to render the structure pseudoknot free(we denote this number by # PKBP; note that for over95% of the pseudoknots, the bases counted in determin-ing # PKBP form one uninterrupted stem; also, there is nooverlap between the base pairs counted in determiningthe stem length and the base pairs counted in determining# PKBP). Table 2 shows that when considering all RNAtypes in the database, the median, mean and standardTable 2: Statistics on the complexity of pseudoknots in RNA STRAND v2.0.RNA type # Stem length # PKBPentries median mean std median mean std16S ribosomal RNA 644 4.00 4.30 2.50 3.00 2.50 0.6823S ribosomal RNA 93 4.00 4.14 2.39 2.00 3.75 3.12Transfer messenger RNA 657 4.00 4.11 2.24 5.00 5.51 1.00Ribonuclease P RNA 433 4.00 4.45 2.51 4.00 5.18 1.36All, non-redundant 4104 4.00 4.35 2.44 4.00 4.14 1.86All, non-redundant & normalised 4104 4.96 5.05 0.58 4.65 4.95 1.78Page 6 of 10(page number not for citation purposes)The columns represent the RNA type, the number of entries for each type, the median, mean and standard deviation of the stem length (i.e., number of adjacent base pairs) and the minimum number of base pairs to break in order to open pseudoknots (# PKBP). For each row, a non-redundant set was selected, and outliers were removed.BMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340deviation of the two measures, stem length and # PKBP,are very similar, even when we normalise by RNA type.(For normalised analysis, instead of using one data pointper molecule or per structural feature, we use one datapoint for each RNA type, where this point is determinedby averaging all data points for the respective class ofRNAs. This way, the user can avoid biasing the analysiswhen there are substantially more structures for someRNA types than for others.) However, for 16S and 23Sribosomal RNA molecules the stem length tends to be sig-nificantly larger than # PKBP, whereas for transfer messen-ger RNA molecules in particular and ribonuclease P RNAmolecules to some extent, # PKBP is larger than the stemlength. This observation is interesting in the context ofcomputational approaches for RNA secondary structureprediction which ignore pseudoknots [22], add pseudo-knots hierarchically in a second stage [23], or simultane-ously add stems in pseudoknotted and non-pseudoknotted regions [24,25].Our second observation concerns the abundance of non-canonical base pairs and the pairing type of their immedi-ate neighbours. (We consider all C-G, A-U and G-U pairsto be canonical base pairs, and all other base pairs to be non-canonical.) Figure 3 shows a histogram for the 729 non-redundant entries whose structures were determined byall-atom methods (these include structures from the Pro-tein Data Bank and the Nucleic Acid Database). For thisdata set, non-canonical A-G base pairs are the most abun-dant, representing 55% of all non-canonical base pairs,and G-G pairs are the least abundant, representing only4% of all non-canonical base pairs. The plot also showsthat a relatively small fraction of non-canonical base pairshave as immediate neighbours canonical base pairs. Inter-estingly, for all seven types of non-canonical base pairs,more pairs are adjacent to at least one other non-canoni-cal base pair than surrounded by two canonical base pairs.For example, 55% of all A-A pairs are adjacent to at leastone other non-canonical base pair. This may suggest thatnon-canonical base pairs are sufficiently stable energeti-cally to form several consecutive base pairs.Finally, we found rather strong linear correlationsbetween the number of nucleotides of the RNAs in ourdatabase and the number of stems, hairpin loops, bulges,internal loops and multi-loops; the Pearson correlationcoefficients are r = 0.95, 0.95, 0.92, 0.91 and 0.92, respec-tively. This is consistent with the idea that the local forma-tion of these secondary structure elements is relativelyindependent of the overall size of the molecule and inagreement with the current thermodynamic energy mod-els of RNA secondary structure, which assume additiveand independent energy contributions for these structuralelements. Interestingly, the correlation between the RNAlength and the number of pseudoknots is significantlyweaker (r = 0.64), suggesting that pseudoknots may notfollow the same linearity principle.Evaluating energy-based secondary structure prediction programsThe RNA STRAND database can be used to evaluate theprediction accuracy of energy-based RNA secondary struc-ture prediction software. RNA STRAND v2.0 contains3704 non-redundant entries containing one moleculethat can be used to evaluate software such as CONTRAfold[7] or mfold [26], 403 non-redundant entries containingcomplexes of two or more molecules that can be used toevaluate sofware for interacting molecules [27,28], and1957 non-redundant single-molecules with pseudoknotsthat can be used to evaluate secondary structure predic-tion programs with pseudoknots [23-25,29].We have selected 2518 structures out of the 3704 non-redundant entries containing one molecule, after we elim-inated the entries with unknown nucleotides and overlylarge loops. (Specifically, entries having hairpin loops,bulges, internal loops or multi-loops with more than 50,50, 50 and 100 unpaired bases, respectively, wereremoved.) In addition, we have removed all non-canoni-cal base pairs and the minimum number of base pairsneeded to render the structures pseudoknot-free. Theresulting structures are used as ground-truth referencestructures. We evaluated the sensitivity and positive pre-dictive value (PPV) of CONTRAfold [7] and SimFold [30]with various free energy parameter sets, see Figure 4. Sen-Histogram of the occurence of non-canonical base pairsFigure 3Histogram of the occurence of non-canonical base pairs. Histogram of non-canonical base pairs in the 729 non-redundant entries whose structures were determined by  0 500 1000 1500 2000 2500 3000 3500 4000 4500AG AC AA UU CU CC GGFrequencyNon-canonical base pairs and nearest neighbours At least one non-canonically paired neighbourAny neighboursTwo canonically paired neighboursPage 7 of 10(page number not for citation purposes)sitivity is the number of correctly predicted base pairsdivided by the number of base pairs in the reference struc-NMR or X-ray crystallography.BMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340ture, and PPV is the number of correctly predicted basepairs divided by the number of predicted base pairs. "Sim-Fold Turner99" in Figure 4 refers to SimFold using the freeenergy parameters described by Mathews et al. [22], and isessentially equivalent to mfold 3.1 [26]. On this large set,the average sensitivity of prediction is 0.63, while the aver-age PPV is 0.57."CONTRAfold 1.1 151Rfam" is the CONTRAfold softwareversion 1.1, as reported by Do et al. [7]. The CONTRAfoldprediction program uses a trade-off parameter γ betweensensitivity and PPV, and thus we report predictions for γranging from 2 to 20. When the target of one measure isfixed to the value obtained with "SimFold Turner99", theother is similar as well, showing that on this data set,CONTRAfold 1.1 gives similar average prediction accu-racy as "SimFold Turner99". The remaining points of Fig-ure 4 are described in the following section.Improving RNA energy modelsMore importantly, RNA STRAND can facilitateapproaches for improving the free energy models underly-training data consisting of RNA sequences with knownsecondary structures, and the size and variety of such dataare key for obtaining good results.Figure 4 shows the average sensitivity and PPV of variousprograms measured on the 2518 structures mentioned inthe previous section, and trained on various training sets."CONTRAfold 1.1 151Rfam" was trained on a small set of151 structures from various Rfam families [7], while"CONTRAfold 2.0 STRAND1.3" was trained on 3427 pre-processed structures (i.e., split and restricted) of averagelength 178 nucleotides from version 1.3 of the RNASTRAND database, as used by Andronescu et al. [6]. Thefigure shows that using the much larger set in the lattercase gives an improvement of roughly 7% in predictionaccuracy.To demonstrate even further the importance of using alarge and comprehensive set of known RNA secondarystructures for obtaining high-quality free energy parame-ters, we have used the current version of RNA STRANDv2.0 to obtain a new training set of 2246 structures ofaverage length 246 nucleotides. Using the Maximum Like-lihood parameter estimation method described byAndronescu et al. [6], which is similar to CONTRAfold[7], we have improved the average accuracy of predictioneven further, as shown by the data point labelled "Sim-Fold STRAND2.0" in Figure 4. This gives an improvementof 8% in average sensitivity and 10% in average PPV com-pared to the Turner99 parameters, when measured on ourtest set of 2518 structures. (Note that, since CONTRAfoldand SimFold use different energy models and predictionalgorithms, it is more appropriate to make comparisonsbetween different versions of each, than it is to compareCONTRAfold versus SimFold).These results provide clear evidence for the key role oflarge and carefully assembled sets of RNA secondary struc-tures, such as provided by RNA STRAND, in the context ofdetermining RNA free energy models. In the future, we areplanning to use the RNA STRAND data to train free energyparameters for pseudoknotted structures. Existing energymodels for RNA secondary structure prediction methodswith pseudoknots are often ad-hoc [25,29], and webelieve that by using data-driven methods in conjunctionwith the 1957 non-redundant RNA STRAND entries rep-resenting RNAs with pseudoknots, it will be possible toobtain improved energy models for pseudoknotted struc-ture prediction.Other uses of RNA STRANDThe numerous search criteria supported by the RNAPrediction accuracy achieved by various energy modelsFigure 4Prediction accuracy achieved by various energy mod-els. Sensitivity vs. positive predictive value (PPV) of various secondary structure prediction methods. Sensitivity is the number of correctly predicted base pairs divided by the number of base pairs in the reference structure, PPV is the number of correctly predicted base pairs, divided by the number of predicted base pairs. Higher prediction accuracy is achieved when the free energy parameters are obtained by training on a larger set of structures. The CONTRAfold pre-diction program uses a trade-off parameter γ between sensi-tivity and PPV, and thus we report predictions for γ ranging from 2 to 20. 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.54  0.56  0.58  0.6  0.62  0.64  0.66  0.68  0.7  0.72  0.74SensitivityPPVSimFold Turner99CONTRAfold 1.1 151RfamCONTRAfold 2.0 STRAND1.3SimFold STRAND2.0Page 8 of 10(page number not for citation purposes)ing energy-based RNA secondary structure prediction soft-ware [6,7]. In this context, it can be very useful to exploitSTRAND web interface allow users to select and studymolecules with specific structural features. For example,BMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340Tyagi and Mathews [31] studied the computational pre-diction accuracy of helical coaxial stacking in multi-loops.RNA STRAND v2.0 conveniently allows the selection anddownload of 189 non-redundant entries with all-atomstructures that have at least one multi-loop. Other exam-ples include the use of naturally occuring pseudoknottedstructures that can be used to evaluate computationalmethods to render a pseudoknotted RNA secondary struc-ture pseudoknot free [19], or to evaluate RNA secondarystructure visualisation tools [32].In recent work on the role of RNA structure in splicing,Rogic et al. [33] needed to identify thermodynamicallystable stems that maximally shorten the distance betweenmRNA donor sites and branchpoint sequences. Since theoptimal free energy of such stems is unknown, Rogic et al.wished to determine the most probable ranges of possiblefree energies for uninterrupted stems. By selecting all mol-ecules on the RNA STRAND web site, they obtained distri-butions of estimated stem free energies (obtained withparameters by Xia et al. [21]), which were used to supporta new model for the role of RNA secondary stucture inmRNA splicing.In addition, RNA STRAND can facilitate the design ofoptical melting experiments [21], whose goal is to betterunderstand the thermodynamics of RNA structure forma-tion, and to improve RNA secondary structure predictionaccuracy. When designing optical melting experiments,usually a set of known RNA secondary structures is firstassembled to determine what types of structural motifsthat have not been previously studied appear frequentlyin naturally occuring RNAs [34,35]. The RNA STRANDweb interface, as well as the abundance of reliable RNAstructures in the RNA STRAND database, can be very use-ful in this context. For example, a significant number ofmulti-loops (16% in all non-redundant RNA STRANDentries) have five or more branches, but, to the best of ourknowledge, optical melting experiments only exist formulti-loops with up to four branches [36,37]. Moreover,30% of the internal loops in all non-redundant RNASTRAND entries have seven or more unpaired bases, and13% have an absolute asymmetry (i.e., absolute differencebetween the number of unpaired bases on each side) of atleast three, while only limited optical melting experi-ments exist to cover these cases [38,39].ConclusionIn this paper, we presented RNA STRAND, a new databasefor RNA secondary structure data that provides access todetailed information on known secondary structures aswell as statistical analyses of structural aspects of varioustypes of RNAs. We believe that such information will bedevelopment and evaluation of energy models for second-ary structure prediction. Our database is flexible andextensible; it provides a convenient web interface to itsmajor functions and supports searches according to manycriteria, including properties of secondary structure ele-ments. The database is publicly accessible and supportsthe submission of new RNA structures by the researchcommunity. We are committed to keeping RNA STRANDup-to-date with new structures that are added to the eightdatabases of provenance, and we invite submissions of alltypes of RNA secondary structures, which will help to fur-ther expand the database and increase its usefulness.In the future, we intend to add RNA secondary structuresobtained from the SHAPE technique [40,41], and also toprovide further search options such as searches by specificstructural motifs.Availability and requirementsRNA STRAND is publicly available at http://www.rnasoft.ca/strand. The RNA Secondary Structure Analyser, aswell as the database tables, are available upon requestfrom the authors.Authors' contributionsMA collected the data, implemented conversion and vali-dation scripts, implemented the MySQL database and partof the PHP scripts, performed the statistical analyses andhelped to draft the manuscript. VB implemented the vastmajority of the PHP scripts and most of the RNA Second-ary Structure Analyser. HHH and AC conceived theproject, specified the design of the RNA Secondary Struc-ture Analyser, supervised MA's and VB's work, and helpedto write the manuscript. All authors checked the accuracyof the database and web interface, read and approved thefinal manuscript.AcknowledgementsWe thank Baharak Rastegari, Yinglei S. Zhao, Mohammad Safari and Jack Jia, who provided an efficient computer program for pseudoknotted structure parsing; Robin Gutell, Christian Zwieb, James Brown and Mathias Sprinzl for providing data and help with the CRW, SRPDB & tmRDB, RNase P DB and Sprinzl tRNA DB, respectively; Simon Moxon and Jennifer Daub for help using the Rfam database; Robert Giegerich and David Mathews for useful discussions; Alex Brown for help with the web interface; and Farheen Rawji for her work on an early version of the RNA Secondary Structure Analyser.Funding for this work was provided by: Mathematics of Information Tech-nology and Complex Systems Network of Centres of Excellence (to AC and HH); Natural Sciences and Engineering Research Council of Canada Discovery Grant Program (238788 to HH and 217192 to AC).References1. Cannone J, Subramanian S, Schnare M, Collett J, D'Souza L, Du Y,Page 9 of 10(page number not for citation purposes)useful in the context of understanding RNA structure andfunction; in particular, we expect it to further facilitate theFeng B, Lin N, Madabusi L, Müller K, Pande N, Shang Z, Yu N, GutellR: The comparative RNA web (CRW) site: an online data-base of comparative sequence and structure information forPublish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2008, 9:340 http://www.biomedcentral.com/1471-2105/9/340ribosomal, intron, and other RNAs.  BMC Bioinformatics 2002,3:2. [Correction: BMC Bioinformatics 3:15]2. Sprinzl M, Vassilenko K: Compilation of tRNA sequences andsequences of tRNA genes.  Nucleic Acids Res 2005, 33(Databaseissue):D139-40.3. Brown J: The Ribonuclease P Database.  Nucleic Acids Res 1999,27:314.4. Andersen ES, Rosenblad MA, Larsen N, Westergaard JC, Burks J,Wower IK, Wower J, Gorodkin J, Samuelsson T, Zwieb C: ThetmRDB and SRPDB resources.  Nucleic Acids Res 2006, 34(Data-base issue):D163-8.5. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy S, Bateman A:Rfam: annotating non-coding RNAs in complete genomes.Nucleic Acids Res 2005, 33(Database issue):D121-4.6. Andronescu M, Condon A, Hoos HH, Mathews DH, Murphy KP: Effi-cient parameter estimation for RNA secondary structureprediction.  Bioinformatics 2007, 23(13):i19-i28.7. Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondarystructure prediction without physics-based models.  Bioinfor-matics 2006, 22(14):e90-e98.8. van Batenburg FH, Gultyaev AP, Pleij CW: PseudoBase: structuralinformation on RNA pseudoknots.  Nucleic Acids Res 2001,29:194-195.9. Gan HH, Fera D, Zorn J, Shiffeldrim N, Tang M, Laserson U, Kim N,Schlick T: RAG: RNA-As-Graphs database-concepts, analysis,and features.  Bioinformatics 2004, 20(8):1285-1291.10. Westbrook J, Feng Z, Chen L, Yang H, Berman H: The ProteinData Bank and structural genomics.  Nucleic Acids Res 2003,31:489-491.11. Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A,Demeny T, Hsieh SH, Srinivasan AR, Schneider B: The nucleic aciddatabase. A comprehensive relational database of three-dimensional structures of nucleic acids.  Biophys J 1992,63(3):751-759.12. Murthy VL, Rose GD: RNABase: an annotated database of RNAstructures.  Nucleic Acids Res 2003, 31:502-504.13. Tamura M, Hendrix DK, Klosterman PS, Schimmelman NR, BrennerSE, Holbrook SR: SCOR: Structural Classification of RNA, ver-sion 2.0.  Nucleic Acids Res 2004, 32(Database issue):D182-4.14. Nagaswamy U, Larios-Sanz M, Hury J, Collins S, Zhang Z, Zhao Q,Fox GE: NCIR: a database of non-canonical interactions inknown RNA structures.  Nucleic Acids Res 2002, 30:395-397.15. Sarver M, Zirbel CL, Stombaugh J, Mokdad A, Leontis NB: FR3D:finding local and composite recurrent structural motifs inRNA 3D structures.  J Math Biol 2008, 56(1–2):215-252.16. Rocheleau L, Pelchat M: The Subviral RNA Database: a toolboxfor viroids, the hepatitis delta virus and satellite RNAsresearch.  BMC Microbiol 2006, 6:24.17. Yang H, Jossinet F, Leontis N, Chen L, Westbrook J, Berman H,Westhof E: Tools for the automatic identification and classifi-cation of RNA base pairs.  Nucleic Acids Res 2003,31(13):3450-3460.18. Apostolico A, Atallah MJ, Hambrusch SE: New clique and inde-pendent set algorithms for circle graphs.  Discrete Applied Math-ematics 1996, 32:1-24.19. Smit S, Rother K, Heringa J, Knight R: From knotted to nestedRNA structures: a variety of computational methods forpseudoknot removal.  RNA 2008, 14(3):410-416.20. Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R:RNAshapes: an integrated RNA analysis package based onabstract shapes.  Bioinformatics 2006, 22(4):500-503.21. Xia T, SantaLucia J, Burkard M, Kierzek R, Schroeder S, Jiao X, CoxC, Turner D: Thermodynamic parameters for an expandednearest-neighbor model for formation of RNA duplexes withWatson-Crick base pairs.  Biochemistry 1998,37(42):14719-14735.22. Mathews D, Sabina J, Zuker M, Turner D: Expanded sequencedependence of thermodynamic parameters improves pre-diction of RNA secondary structure.  J Mol Biol 1999,288(5):911-940.23. Jabbari H, Condon A, Pop A, Pop C, Zhao Y: HFold: RNA Pseudo-knotted Secondary Structure Prediction Using HierarchicalFolding.  Workshop on Algorithms in Bioinformatics 2007:323-334.24. Ren J, Rastegari B, Condon A, Hoos HH: HotKnots: heuristic pre-25. Rivas E, Eddy SR: A dynamic programming algorithm for RNAstructure prediction including pseudoknots.  J Mol Biol 1999,285(5):2053-2068.26. Zuker M: Mfold web server for nucleic acid folding and hybrid-ization prediction.  Nucleic Acids Res 2003, 31:3406-3415.27. Andronescu M, Zhang ZC, Condon A: Secondary structure pre-diction of interacting RNA molecules.  J Mol Biol 2005,345:987-1001.28. Dirks R, Bois J, Schaeffer J, Winfree E, Pierce N: Thermodynamicanalysis of interacting nucleic acid strands.  SIAM Rev 2007,49:65-88.29. Dirks RM, Pierce NA: A partition function algorithm for nucleicacid secondary structure including pseudoknots.  J ComputChem 2003, 24(13):1664-1677.30. Andronescu M: Algorithms for predicting the SecondaryStructure of pairs and combinatorial sets of nucleic acidstrands.  In Master's thesis Dept. of Computer Science, University ofBritish Columbia; 2003. 31. Tyagi R, Mathews DH: Predicting helical coaxial stacking inRNA multibranch loops.  RNA 2007, 13(7):939-951.32. Byun Y, Han K: PseudoViewer: web application and web serv-ice for visualizing RNA pseudoknots and secondary struc-tures.  Nucleic Acids Res 2006, 34(Web Server issue):W416-422.33. Rogic S, Montpetit B, Hoos HH, Mackworth AK, Ouellette FB, HieterP: Correlation between the secondary structure of pre-mRNA introns and the efficiency of splicing in Saccharomycescerevisiae.  BMC Genomics 2008, 9:355.34. Badhwar J, Karri S, Cass CK, Wunderlich EL, Znosko BM: Thermo-dynamic characterization of RNA duplexes containing natu-rally occurring 1 × 2 nucleotide internal loops.  Biochemistry2007, 46(50):14715-14724.35. Davis AR, Znosko BM: Thermodynamic characterization of sin-gle mismatches found in naturally occurring RNA.  Biochemis-try 2007, 46(46):13425-13436.36. Diamond J, Turner D, Mathews D: Thermodynamics of three-way multibranch loops in RNA.  Biochemistry 2001,40(23):6971-6981.37. Mathews D, Turner D: Experimentally derived nearest-neigh-bor parameters for the stability of RNA three- and four-waymultibranch loops.  Biochemistry 2002, 41(3):869-880.38. Peritz A, Kierzek R, Sugimoto N, Turner D: Thermodynamicstudy of internal loops in oligoribonucleotides: symmetricloops are more stable than asymmetric loops.  Biochemistry1991, 30(26):6428-6436.39. Chen G, Turner DH: Consecutive GA pairs stabilize medium-size RNA internal loops.  Biochemistry 2006, 45(12):4025-4043.40. Merino EJ, Wilkinson KA, Coughlan JL, Weeks KM: RNA structureanalysis at single nucleotide resolution by selective 2'-hydroxyl acylation and primer extension (SHAPE).  J Am ChemSoc 2005, 127(12):4223-4231.41. Wilkinson KA, Merino EJ, Weeks KM: Selective 2'-hydroxyl acyla-tion analyzed by primer extension (SHAPE): quantitativeRNA structure analysis at single nucleotide resolution.  NatProtoc 2006, 1(3):1610-1616.42. Leontis NB, Westhof E: Geometric nomenclature and classifi-cation of RNA base pairs.  RNA 2001, 7(4):499-512.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 10 of 10(page number not for citation purposes)diction of RNA secondary structures including pseudoknots.RNA 2005, 11(10):1494-1504.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0224109/manifest

Comment

Related Items