UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Computational analysis of ribonucleic acid basepairs in RNA structure and RNA-RNA interactions Lai, Daniel 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2016_may_lai_daniel.pdf [ 9.62MB ]
JSON: 24-1.0228783.json
JSON-LD: 24-1.0228783-ld.json
RDF/XML (Pretty): 24-1.0228783-rdf.xml
RDF/JSON: 24-1.0228783-rdf.json
Turtle: 24-1.0228783-turtle.txt
N-Triples: 24-1.0228783-rdf-ntriples.txt
Original Record: 24-1.0228783-source.json
Full Text

Full Text

Computational analysis of ribonucleic acid basepairs in RNAstructure and RNA-RNA interactionsbyDaniel LaiB.Sc., The University of British Columbia, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)The University of British Columbia(Vancouver)April 2016c© Daniel Lai, 2016AbstractRibonucleic acids (RNA), are an essential part of cellular function, transcribed from DNA andtranslated into protein. Rather than a passive informational medium, RNA can also be highlyfunctional and regulatory. Certain RNAs fold into specific structures giving it enzymatic proper-ties, while others bind to specific targets to guide regulatory processes. With the advent of next-generation sequencing, a large number of novel non-coding RNAs have been discovered throughwhole-transcriptome sequencing. Many efforts have been made to study the structure and bindingpartners of these novel RNAs, in order to determine their function and roles.This work begins with a description of my R package R4RNA for manipulating RNA basepairdata, the building blocks of RNA structure and RNA binding. The package deals with the input/out-put and manipulation of RNA basepair and sequence data, along with statistical and visualizationmethods for evaluation, interpretation and presentation. We also describe R-CHIE, a visualiza-tion tool and web server built on R4RNA that visualizes complex RNA basepairs in conjunctionwith sequence alignments. We then conduct the largest known evaluation of RNA-RNA interac-tion methods to date, running state-of-the-art tools on curated experimentally validated datasets.We end with a review of cotranscriptional RNA basepair formation, summarizing biological, the-oretical and computational methods for the process, and future directions for improving classicalmethods in RNA structure prediction.All content chapters of this thesis has been peer-reviewed and published. The work on R4RNAhas led to two publications, with the package used to great visual effect by various publications andalso adopted by the RNA structure database RFAM. My assessment of RNA-RNA interaction isat present the only published evaluation of its kind, and will hopefully become a benchmark forfuture tool development and a guide to selecting appropriate tools and algorithms. Our publishedreview on RNA cotranscriptional folding is well-received, being the first review specifically on itstopic.iiPrefaceA version of Chapter 2 has been already published: Lai,D., Proctor,J.R., Zhu,J.Y.A. and Meyer,I.M.(2012) R-CHIE: a web server and R package for visualizing RNA secondary structures. NucleicAcids Research, 40 (12), e95. doi:10.1093/nar/gks241.I was the main developer of the project, with fellow graduate students Jeff Proctor and AliceJing-Yun Zhu as the two other main test users of the package before public release. Their feed-back was subsequently used to improve and debug the package, with code contributions mainly bymyself and Jeff Proctor. Dr. Irmtraud Meyer provided the idea for the package, and the funds andresources required to achieve the end product.A version of Chapter 3 has already been published: Lai,D., and Meyer,I.M. (2015) A compre-hensive comparison of general RNARNA interaction prediction methods. Nucleic Acids Research,First published online December 15, 2015. doi:10.1093/nar/gkv1477.The corresponding analysis and manuscript were done by myself with ideas for the project,supervision and manuscript editing provided by Dr. Irmtraud Meyer.A version of Chapter 4 has already been published: Lai,D., Proctor,J.R. and Meyer,I.M. (2013)On the importance of cotranscriptional RNA structure formation. RNA, 19(11), 1461-73.doi:10.1261/rna.037390.112.Dr. Irmtraud Meyer completed the first draft. Fellow graduate student Jeff Proctor and myselfaddressed reviewer feedback with revisions to the text and figures of the manuscript, sharing firstauthorship on the final publication.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 RNA biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 The role of RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 RNA chemistry and structure . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Functional RNA structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Structured non-protein coding RNA . . . . . . . . . . . . . . . . . . . . . 71.2.2 Structured protein-coding RNA . . . . . . . . . . . . . . . . . . . . . . . 121.2.3 RNA-RNA interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3 Solving RNA structure experimentally . . . . . . . . . . . . . . . . . . . . . . . . 191.3.1 Primary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.3.2 Secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.3.3 Tertiary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.3.4 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.4 Computational prediction of RNA basepairing . . . . . . . . . . . . . . . . . . . . 21iv1.4.1 RNA secondary structure prediction . . . . . . . . . . . . . . . . . . . . . 211.4.2 Energy-based predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.4.3 Evolutionary-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 231.5 Objectives and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Visualization of RNA Basepairs and Alignments . . . . . . . . . . . . . . . . . . . . 252.1 Visualizing RNA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.1 Primary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.2 Secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.3 Tertiary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Types of RNA secondary structure visualization . . . . . . . . . . . . . . . . . . . 282.2.1 Dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.2 Circle and linear diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.3 Stem-loop diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.4 Other diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.5 Conserved RNA structure . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.1 The R-CHIE web server . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.3.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.3.4 The R4RNA R package . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Assessment of RNA-RNA Interaction Prediction Methods . . . . . . . . . . . . . . . 483.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2 Algorithm strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.1 Energy-based RNA-RNA interaction prediction programs . . . . . . . . . 513.3.2 Comparative RNA-RNA interaction prediction tools . . . . . . . . . . . . 563.3.3 Multiple sequence alignment generation programs . . . . . . . . . . . . . 593.3.4 Related works and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.1 Bacterial sRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.2 Fungal snoRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63v3.4.3 Multiple sequence alignments . . . . . . . . . . . . . . . . . . . . . . . . 643.4.4 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.5.1 Minimum free energy results on sRNA dataset . . . . . . . . . . . . . . . 663.5.2 Suboptimal interaction results on sRNA dataset . . . . . . . . . . . . . . . 703.5.3 Effect of increasing target sequence size on sRNA dataset . . . . . . . . . 763.5.4 MFE energy-based results on snoRNA dataset . . . . . . . . . . . . . . . . 763.5.5 Suboptimal energy-based results on snoRNA dataset . . . . . . . . . . . . 813.5.6 Comparative predictions for sRNA dataset . . . . . . . . . . . . . . . . . . 883.5.7 Comparative predictions for snoRNA dataset . . . . . . . . . . . . . . . . 893.5.8 Effect of different aligners on predictive performance . . . . . . . . . . . . 943.5.9 Combining energy-based and comparative results . . . . . . . . . . . . . . 993.5.10 Basepair covariation in datasets and background . . . . . . . . . . . . . . 993.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.6.1 Settings and overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.6.2 Performance effects of conservation . . . . . . . . . . . . . . . . . . . . . 1033.6.3 Target size and interaction search space . . . . . . . . . . . . . . . . . . . 1043.6.4 Runtime and memory performance . . . . . . . . . . . . . . . . . . . . . 1043.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054 Cotranscriptional RNA Folding and Prediction . . . . . . . . . . . . . . . . . . . . . 1094.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2 Experimental and theoretical evidence for cotranscriptional folding . . . . . . . . . 1114.2.1 Directionality of transcription . . . . . . . . . . . . . . . . . . . . . . . . 1114.2.2 Transcription, transcription speed and variations thereof . . . . . . . . . . 1124.2.3 Self-interactions including transient RNA structures . . . . . . . . . . . . 1124.2.4 Interactions with other molecules . . . . . . . . . . . . . . . . . . . . . . 1154.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.3 Capturing cotranscriptional folding in methods for RNA secondary structure pre-diction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.3.1 Existing methods for RNA secondary structure prediction . . . . . . . . . 1194.3.2 Existing methods for predicting RNA folding pathways . . . . . . . . . . . 1224.3.3 Ideas for capturing cotranscriptional folding in methods for RNA secondarystructure prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123vi4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160A.1 Chapter 1 appendix material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160A.1.1 R4RNA function manual . . . . . . . . . . . . . . . . . . . . . . . . . . 160A.1.2 R4RNA vignette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187viiList of TablesTable 3.1 RNA-RNA interaction methods used in assessment . . . . . . . . . . . . . . . 51Table 3.2 Energy-based MFE sRNA results: default vs optimal settings . . . . . . . . . . 69Table 3.3 Energy-based sRNA results: MFE vs suboptimal . . . . . . . . . . . . . . . . . 75Table 3.4 Energy-based MFE snoRNA result: short vs full-length input . . . . . . . . . . 80Table 3.5 Energy-based snoRNA results: MFE vs suboptimal . . . . . . . . . . . . . . . 87Table 3.6 Comparative suboptimal sRNA results as a function of %ID . . . . . . . . . . . 88Table 3.7 Comparative suboptimal snoRNA results as a function of %ID . . . . . . . . . 91Table 3.8 Energy-based suboptimal snoRNA results: short vs full-length input . . . . . . 93Table 3.9 Comparative suboptimal snoRNA result: short vs full-length input . . . . . . . 94Table 3.10 Combining energy-based and comparative results . . . . . . . . . . . . . . . . 100Table 3.11 Dataset covariation and conservation scores . . . . . . . . . . . . . . . . . . . 101viiiList of FiguresFigure 1.1 Ribonucleic acid monomer chemical structure . . . . . . . . . . . . . . . . . . 4Figure 1.2 Ribonucleoside chemical structures . . . . . . . . . . . . . . . . . . . . . . . 5Figure 1.3 RNA Watson-Crick and wobble basepairs . . . . . . . . . . . . . . . . . . . . 6Figure 1.4 Yeast tRNA-Phe structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 1.5 Ribosomal RNA catalytic action . . . . . . . . . . . . . . . . . . . . . . . . . 9Figure 1.6 Riboswitch translation and transcriptional control . . . . . . . . . . . . . . . . 11Figure 1.7 RNA-RNA interaction examples: snoRNA and miRNA . . . . . . . . . . . . . 17Figure 1.8 Bacterial small RNA RyhB binding sodB mRNA . . . . . . . . . . . . . . . . 18Figure 2.1 Dot plot of RNAFOLD predicted Cripavirus structure . . . . . . . . . . . . . . 29Figure 2.2 Circular diagram of Cripavirus IRES . . . . . . . . . . . . . . . . . . . . . . . 30Figure 2.3 Linear diagram of Cripavirus IRES . . . . . . . . . . . . . . . . . . . . . . . 30Figure 2.4 Stem-loop diagram of Cripavirus IRES . . . . . . . . . . . . . . . . . . . . . 31Figure 2.5 Variant stem-loop diagram of Cripavirus IRES . . . . . . . . . . . . . . . . . 32Figure 2.6 Mountain plot of RNAFOLD predicted Cripavirus IRES . . . . . . . . . . . . 33Figure 2.7 RALEE plot showing conserved alignment basepairs . . . . . . . . . . . . . . 34Figure 2.8 R-chie single structure arc diagram . . . . . . . . . . . . . . . . . . . . . . . 36Figure 2.9 R-chie double structure arc diagram . . . . . . . . . . . . . . . . . . . . . . . 37Figure 2.10 R-chie overlapping structure arc diagram . . . . . . . . . . . . . . . . . . . . 38Figure 2.11 R-chie single structure covariation plot . . . . . . . . . . . . . . . . . . . . . 39Figure 2.12 R-chie single structure lettered covariation plot . . . . . . . . . . . . . . . . . 39Figure 2.13 R-chie double structure covariation plot . . . . . . . . . . . . . . . . . . . . . 41Figure 2.14 R-chie overlapping structure covariation plot . . . . . . . . . . . . . . . . . . 42Figure 2.15 SAM riboswitch alternative structure by Zhu et al. . . . . . . . . . . . . . . . 45Figure 2.16 RMPR structure determined by SHAPE chemical probing by Rogler et al. . . . 46ixFigure 2.17 Representative RB1 wild-type and mutant structures by Kutchko et al. . . . . . 47Figure 3.1 MCC heatmap of MFE energy-based sRNA results: default vs optimal options 67Figure 3.2 Clustering energy-based sRNA results . . . . . . . . . . . . . . . . . . . . . . 68Figure 3.3 MCC heatmap of energy-based sRNA results: MFE vs suboptimal . . . . . . . 72Figure 3.4 MCC distribution of MFE energy-based sRNA results . . . . . . . . . . . . . 73Figure 3.5 MCC distribution of suboptimal energy-based sRNA results . . . . . . . . . . 74Figure 3.6 Energy-based sRNA results as a function of input length . . . . . . . . . . . . 78Figure 3.7 MCC heatmap of MFE energy-based MFE snoRNA results: short vs full-length 79Figure 3.8 MCC distribution of MFE energy-based snoRNA results . . . . . . . . . . . . 83Figure 3.9 MCC distribution of suboptimal energy-based snoRNA results . . . . . . . . . 84Figure 3.10 MCC heatmap of suboptimal energy-based snoRNA results: short vs full-length 85Figure 3.11 Clustering energy-based snoRNA results . . . . . . . . . . . . . . . . . . . . 86Figure 3.12 Comparative sRNA results as a function of minimum %ID . . . . . . . . . . . 89Figure 3.13 TPR vs PPV density plot for energy-based sRNA results . . . . . . . . . . . . 90Figure 3.14 Comparative snoRNA results as a function of %ID . . . . . . . . . . . . . . . 92Figure 3.15 Comparative MCC performance across different aligners . . . . . . . . . . . . 96Figure 3.16 Comparative TPR performance across different aligners . . . . . . . . . . . . 97Figure 3.17 Comparative PPV performance across different aligners . . . . . . . . . . . . 98Figure 3.18 RNA-RNA interaction dataset basepair covariation . . . . . . . . . . . . . . . 102Figure 3.19 CPU runtimes as a function of input length . . . . . . . . . . . . . . . . . . . 106Figure 3.20 Memory usage as a function of input length . . . . . . . . . . . . . . . . . . . 107Figure 4.1 Alternative transient RNA structures in the hok-sok system . . . . . . . . . . . 127Figure 4.2 Hypothetical cotranscriptional folding pathways . . . . . . . . . . . . . . . . . 128xAcknowledgmentsThis work would not have been possible without the time and support of those around me. Firstly,I’d like to thank my supervisor Irmtraud Meyer for guiding me throughout the degree and alsoco-supervisor Paul Pavlidis for his guidance in the final year. I’d also like to thank my supervisorycommittee members Steven Jones and Martin Hirst for their guidance and feedback through theduration of my degree.I also extend my thanks to my lab mates in the Meyer and Pavlidis lab that I’ve had the pleasureof working along side with throughout these years. Special thanks go to the Meyer Lab membersRodrigo Goya, Jeff Proctor, Alborz Mazloomian, Evan Gatev, Yang Shu, Alice Zhu and Ian Wood.I’d like to thank the University of British Columbia, CIHR/MSFHR Bioinformatics TrainingProgram and the Natural Sciences and Engineering Reserach Council of Canada for their generousfunding enabling this research.Finally, I’d like to thank my family for their encouragement and support.xiChapter 1IntroductionWe start by introducing the concept of basepairing in ribonucleic acids (RNA), its history, biologyand significance to molecular biology. We then provide a general history and summary of com-putational methods developed thus far to predict basepairs, establishing the state of the art in thefield.1.1 RNA biology1.1.1 The role of RNAThe central dogma of molecular biology proposed by Francis Crick in 1958 hypothesized a flow ofinformation in all living things from DNA to RNA to protein [1, 2], sidelining ribonucleic acids asa transient intermediate state. By 1961, experimental evidence published for DNA replication [3],RNA transcription [4–6] and protein translation [7] seemed to support this casting of RNA as an“unstable intermediate” [8], although much remained to be discovered regarding RNA’s potentialin regulating the steps of the dogma.In 1961, whilst speculating the mechanisms of protein induction and repression, Franc´ois Jacoband Jacques Monad suggested the existence of messenger RNAs (mRNA) transcribed from DNA,which could be under the influence of “regulator genes”, ultimately affecting the observed ratesof protein synthesis [9]. At the time of writing however, they were unsure whether the functionalproduct of regulator genes would be of protein or RNA, and whether this product acted upon theDNA gene or the mRNA transcript.In following decades of research, the general conclusion seemed to be that proteins were the1functional product of regulator genes, producing DNA-binding proteins controlling transcription(e.g. transcription factors) [10], or RNA-binding proteins affecting mRNA splicing and stability[11]. The main role of RNA was slightly expanded, to include messenger RNAs, ribosomal RNAs(rRNA) [8] and transfer RNAs (tRNA) [1], seen simply as scaffolding adaptors for the proteintranslation machinery, respectively. While cases of RNA-mediated gene regulation were found,they seemed to be the exception rather than the rule, with only eleven naturally occurring casesobserved in prokaryotes, none in eukaryotes by 1988 [12].By 1998 however, the role of RNA in gene regulation would begin to take a huge turn, startingwith the publication by Fire and Mello showing that double-stranded RNA could effectively affectgene expression in Caenorhabditis elegans through RNA interference (RNAi) [13]. Followed bydemonstrations that the same technique could be applied to human cells in 2001 by the Tuschlgroup [14]. RNA-mediated gene regulation is now considered critical regulatory process in geneexpression, with implications in disease etiology, diagnosis, prognosis, and therapy [15–17].While the Human Genome Project released its first draft of DNA sequences by the early 2001[18, 19], large-scale efforts to characterize all RNA transcripts in mammals were also in full swing.Early results by the FANTOM consortium for mouse utilized expressed sequence tag (EST) andfull-length cDNA (complementary DNA, generated from RNA by reverse transcriptase) sequencesalong with tiling arrays, initially focusing on identifying protein-coding transcripts [20, 21]. Theunexpected discovery a large number of novel non-coding RNAs led to the use of unbiased tran-script identification techniques, such as serial analysis of gene expression (SAGE) [22] and capanalysis gene expression (CAGE) [23]. Major findings of these studies include the pervasive tran-scription of the entire mammalian genome, the majority of these transcripts being non-coding fromintronic and intergenic regions [24].With the advent of next-generation sequencing (NGS) starting in 2005, DNA could be se-quenced at a fraction of the time and cost of previous Sanger-based methods [25–27]. Protocols forRNA-seq soon followed, applying next-generation sequencing technologies to sequencing cDNAfrom whole transcriptomes, allowing for the high-throughput generation of RNA data [28, 29].Studies such as the ENcyclopedia of DNA Elements (ENCODE) Project have since shown thatwhile only ∼ 2% of the genome is protein-coding, > 75% of the genome is cumulativly tran-scribed throughout the human body [30–33], likely containing novel ncRNAs. [34–37]. Besidespost-transcriptional gene regulation via RNA interference, non-coding RNAs have been shown tobe involved in chromatin structure [32, 38], epigenetic regulation [39, 40], RNA splicing [41, 42],and catalysis [43, 44]. With the exact mechanisms, classes and number of RNAs remaining to bedetermined, the study of RNA shows much promise in yielding novel insights into the function of2living organisms neglected for over half a century.1.1.2 RNA chemistry and structureA RNA molecule, or ribonucleic acid, is a biomolecule chemically consisting of a phosphateconnected to a ribose sugar from which extends a nitrogenous base (Figure 1.1). Multiple RNAmonomers can form a linear polymer by forming covalent phosphodiester bonds between the phos-phate of one molecule to the sugar of another, forming a sugar-phosphate backbone off whichhangs various bases. The most commonly used four bases and their abbreviations in RNA areadenine (A), guanine (G), cytosine (C), uracil (U) (Figure 1.2). Chemically, C and U are pyrim-idines, containing a hexagonal heterocyclic ring made of four carbon and two nitrogen atoms. Aand G on the other hand, are purines, containing the hexagonal pyrimidine fused to a pentagonalimidizole ring, physically larger than the purines. Differentiating one base from another is variousnitrogen-containing amines and oxygen-containing carbonyls that surround the heterocyclic rings.The series of basepairs in this linear polymer, often described as a string of letters consisting of A,C, G and U, is known as the primary structure or sequence of the RNA.When two complementary basepairs are placed with nucleotides facing each other in the correctorientation, a non-covalent hydrogen bond forms. Specifically, amines can act as hydrogen bonddonors with a relatively positive charge, while carbonyls act as acceptors with a relatively negativecharge. In addition to the right hydrogen bond donors and acceptors, the physical location andorientation of these molecular groups has to be correct, resulting in the very specific complementaryWatson-Crick basepairs of A with U, and G with C (Figure 1.3). In both pairings, we have asmaller pyrimidine pairing with a larger purine, with G-C pairs forming three hydrogen bonds andA-U pairs forming two. Following these same steric and electrochemical considerations then, athird wobble basepair of G-U is also valid which forms two hydrogen bonds, bringing the list ofcanonical basepairs to three [45] (Figure 1.3). A description of the primary structure, in additionto all the basepaired and unpaired nucleotides, describes the secondary structure, which can oftenbe visualized as paired stems and unpaired loops on a two dimensional figure.While basepaired stems are represented as a ladder-like figure on a two-dimensional figure,in reality it assumes a double helix structure much like double-stranded DNA [46, 47]. In the ab-sence of positive cations, the negatively-charged phosphate backbone of RNA helices repels againstother parts of the RNA, preventing the formation of a compact structure [48]. With the additionof positive cations such as Mg2+, these positively charged ions bind to specific locations on theRNA, allowing for the formation of stable compact three-dimensional RNA structures, or the ter-3Figure 1.1: Two ribonucleic monomer shown each with a ribose sugar (a furanose ringconsisting of four carbons and one oxygen atom), from which a nucleotide base extendsfrom the 1’ carbon (guanine and ‘R’), connecting with phosphates between the 3’carbon of one ribose to the 5’ carbon of another.c©Warraich Sahib. Retrieved October 09, 2015 from Wikimedia Commons. Used underCreative Commons Attribution-ShareAlike 3.0 Unported License.tiary structure via non-covalent bonds (Figure 1.4). In the majority of cases, the hydrogen bondsformed as part of the secondary structure are stable enough to remain unchanged upon tertiarystructure formation. A fully solved tertiary structure describes the three-dimensional coordinatesof each atom in an RNA molecule relative to each other, with varying degrees of resolution (atomicaccuracy) depending on its experiment and purpose.Having described primary, secondary, and tertiary structure then, it is hopefully obvious howRNA folding has been described as hierarchical, with each sequential step contingent on the correctformation of the previous in the majority of known cases [49]. The actual formation of these struc-tures is also highly dependent on the immediate environment of the molecule, with temperature,pH, and ionic concentration of the solution playing large effects on primary, secondary and tertiarystructure stability. Adding to the complexity in vivo, RNA folding is a highly kinetic process, oc-curring whilst being transcribed, and potentially being influenced intentionally and indirectly by4(a) Adenosine: Adenine + Ribose (b) Guanosine: Guanine + Ribose(c) Cytidine: Cytosine + Ribose (d) Uridine: Uracil + RiboseFigure 1.2: The four basic RNA nucleotides, shown here as nucleosides (attached to ribose).Uracil and cytosine are pyrimidines (six-membered heterocyclic ring composed of twonitrogens), while adenine and guanine are purines (fused pyrimidine-imidazole ringsystem).By NEUROtiker. Retrieved October 09, 2015 from Wikimedia Commons. Public do-main image.other macromolecules in the cell.1.2 Functional RNA structuresWhile adopting a structure does not necessarily mean that the RNA will be functional, many func-tional examples of RNA are due to a highly specific structure. Broadly, we can classify functionalstructured RNAs into coding and non-coding RNA. For non-coding structured RNAs, the mostwell-known examples are most likely tRNA and rRNAs, with other key examples being catalytic5(a) Watson-Crick basepair between Gua-nine and Cytosine(b) Watson-Crick basepair between Ade-nine and Uracil(c) Wobble basepair between Guanine andUracilFigure 1.3: Watson-Crick and wobble basepairing formed between pyrimidine-purine pairs.Three basepars ordered in order of strength, with the strongest GC pair forming threehydrogen bonds.By Yikrazuul. Retrieved October 09, 2015 from Wikimedia Commons. Public domainimage.ribozymes. For coding RNAs, RNA structure often plays a more regulatory role, often found in theuntranslated regions of the transcript but also within the coding region [50]. Structure on coding6(a) Primary structure (with blue modified bases)and secondary structure.(b) Tertiary structure with corresponding sec-ondary structureFigure 1.4: RNA structure of the yeast phenylalanine tRNA, showing an example of hierar-chical primary, secondary and tertiary structure.By Yikrazuul. Retrieved October 09, 2015 from Wikimedia Commons. Used underCreative Commons Attribution-ShareAlike 3.0 Unported License.RNAs have been found to directly influence translation efficiency, guide RNA localization, andRNA stability [51].1.2.1 Structured non-protein coding RNAThere exists a relatively small number of functional non-protein coding structured RNAs, varyingin size and roles. Some of these are completely independent, some exist in larger ribonucleoproteincomplexes, while others are commonly reoccurring motifs and structures found in a large numberof RNA sequences.7Transfer RNA and ribosomal RNAThe earliest studied example of ncRNA is transfer RNA, whose secondary cloverleaf structure wasproposed in 1965 [52], paving the way to the full structure that was determined a decade later[53, 54] (Figure 1.4).Unlike tRNA, which carries out its function independently as a single molecule, ribosomalRNA (rRNA), is an example of a structured RNA that functions within a ribonucleoprotein (RNP)complex. rRNA was initially assumed to be structural while ribosomal protein defined ribosomefunction [55], but the current understanding is that rRNA is a ribozyme or a catalytic RNA and isresponsible for the chemical reactions that define translation [56] (Figure 1.5).Due to the size andcomplexity of the ribosome, initially only the secondary structure could be studied by phylogeneticcomparison of primary sequences around the early 80s [55]. It was not until year 2000 that wehad high-resolution crystal structures resolving the tertiary structure, which also confirmed theribosome as a ribozyme [57–59].RibozymesThe first discovered ribozyme is actually much earlier, and was made in 1982 [43, 61], where theintron in the rRNA of the protozoan ciliate Tetrahymena thermophila was found to auto-excise inthe absence of proteins and other small RNAs. In time, it was shown that the splicing was intrinsicto the RNA structure, and that this structure was found in a wide number of other viruses, bacteriaand eukaryotic organelles and is known today as self-splicing Group I introns [62, 63]. Much of thecatalytic sites and mechanisms of the structure had been determined and explained using secondarystructures [62] within a decade of its discovery, and only 15 years after was a tertiary structuredcrystallized and solved [64].A similar but mechanistically distinct group of self-splicing introns were also discovered in1982, known today as the Group II introns [65]. These were shown to be self-splicing by 1986,and are excised in a manner similar to eukaryotic nuclear pre-mRNA introns [62, 65]. So farthese self-splicing structures have only been found in bacteria and the organelles (mitochondriaand chloroplast) of fungi, plants, and protists [66]. As of today, no Group II intron has been foundin the nuclear genome of eukaryotes [66]. Crystallized in 2008, the catalytic core of Group IIintrons have been found to be structurally and functionally similar to certain small nuclear RNAsin the eukaryotic nuclear spliceosome, implying an evolutionary relationship [67].Very recently, it was shown that yeast small nuclear RNAs (snRNAs) within the eukaryoticspliceosome RNP complex are also ribozymes [68], functioning in a manner analogous to Group8Figure 1.5:Current understanding of ribosomal peptide-bond formation [60] a chemical reactionbinding two peptides (R1, R2) at the end of tRNAs. b Binding of the aminoacyl-tRNA(aa-tRNA: amino acid bound tRNA) to the A site and peptidyl-tRNA (protein boundtRNA) to the P site. c Positioning of peptidyl-tRNA A72 2’OH and rRNA A2451 N3around the α-amino nucleophile. D, E current model of catalysis involving the shufflinga series of protons around peptidyl-tRNA A72 2’OH.From Schmeing & Ramakrishnan (2009) What recent ribosome structures have revealedabout the mechanism of translation. Nature 461:1234-1242. Retrieved October 09, 2015from Nature Publishing Group. doi:10.1038/nature08403. Used with permission fromNature Publishing Group.9II self-splicing enzymes. Like ribosomes, further research and understanding of the molecularchemistry has shown the RNA moieties to be the catalytic component, while proteins are found tobe more structural in nature that previously assumed. It has thus been hypothesized that these areremnants of an ancient “RNA world” (discussed in detail later) absent of DNA and protein, havingseen the catalytic potential of RNA [69].Following soon after the discovery of self-splicing introns in 1982, the RNA moieties of theRibonuclease P (RNaseP) RNP was determined to be the active component and another exampleof a structured ribozyme [44, 70]. Homologs were found to be ubiquitous in all branches of life,responsible for processing the 5’ leader sequence of precursor tRNA [71], although only thosein Bacteria have been shown to work without protein subunits in vitro. The tertiary structure ofRNaseP was crystallized and solved nearly two decades later in 2005 [72, 73].The RNAseP and splicing ribozymes described thus far are specifically metalloenzymes, re-quiring the presence of specific metal ions that are positioned in specific locations in the catalyticsite with direct roles in catalysis [61]. Thus far these have been the most common and abundanttypes of ribozymes, and was thought that all ribozymes had to be metalloenzymes.This was until the final abundant class of ribozymes currently known was discovered in 1986[74, 75], as an autolytic structural motif found in plant viruses. Now known as the hammer-head ribozyme, these self-cleaving ribozymes can cleave their own phosphodiester backbone inthe absence of metal ions under correct conditions. Like earlier discussed examples, the three-dimensional structure was unknown until crystallization efforts were successful in the mid 1990s[76, 77], becoming the first ribozyme to be successfully crystallized [61].RiboswitchesDiscovered in the early 2000s and characterized by 2002 [78], riboswitches are structures in the5’ untranslated region of mRNAs that control the expression of the downstream transcript [79](Figure 1.6). Found predominantly in eubacteria, these structures undergo structural changes whenbound to specific metabolites, inhibiting the expression of the gene. The gene expresses a productthat is usually part of the biosynthesis or transport pathway of the bound metabolite, thus creatinga self-regulating feedback loop to control metabolite levels [80].A growing list of a dozen or so metabolite-specific classes exist, with a few crystallized ex-amples giving us a clear view of their mechanisms [81]. Generally, a riboswitch has two distinctstructural conformations, one that allows for translation or transcription of the downstream prod-uct, while the other one inhibits it commonly by blocking the ribosome, or causing a pre-mature10release of the polymerase [82]. Depending on the riboswitch, structures can range from a smallpseudoknot, to complex multi-stem structures [81].Figure 1.6:General control mechanisms for riboswitches. For transcriptional control. the unboundstate allows the formation of the anti-terminator structure which prevents the formationof the terminator stem. When bound, the terminator stem prevents the full-lengthtranscription of the mRNA. In translational control, the binding of the metabolite makesthe ribosome binding site (RBS) inaccessible, preventing translation of the mRNA.From Tucker & Breaker (2005) Riboswitches as versatile gene control elements. Cur-rent opinion in structural biology 15 (3):342-8. Retrieved October 09, 2015 from Else-vier. doi:10.1016/j.sbi.2005.05.003. Used with permission from Elsevier.Long non-coding RNAsOut of the > 60,000 recently sequenced long non-coding RNAs (lncRNAs) in human [83], thereexist several well-characterized examples where structural motifs have been strongly suggested toplay a part in their function. The most well-studied is the Xist RNA, discovered in the early 90s,and found to play an essential role in X-chromosome inactivation [84]. Over 16kb long, the fullstructure remains to be solved, but studies have shown that secondary structures in specific regions11are involved in the recruitment of histone modification proteins leading to gene silencing [85–87].Other gene silencing secondary structures have been found on other lncRNAs, such as HOTAIR[88], a 2148nt long RNA that silences the HoxD locus involved in epidermal tissue development[89] and ANRIL, which when mutated has been found to be associated with heart disease [90, 91]and cancer [92].1.2.2 Structured protein-coding RNAEven for protein-coding messenger RNAs, there have been examples where RNA structures arerequired for the correct processing of the transcript. Many examples have been reviewed by otherworks [50, 51], and we highlight a few examples below to demonstrate the role of RNA structure.SplicingIn an example demonstrated on the cardiac troponin T (cTNT) gene, a stem-loop structure 3’ ofintron 4 is targeted by the protein MBNL1, which when bound represses the inclusion of the exon5. In patients with the genetic disorder myotonic dystrophy type 1, (CTG)n repeat expansions in the3’UTR of the DMPK gene result in CUG repeats its transcript that form stable RNA stem-loopsthat improperly sequester MBNL1. The depletion of MBNL1 allows splicing factor U2AF65 tobind the 3’ region of intron 4 on cTNT in the absence of the stem-loop, causing the inclusion ofexon 5. Stabilization of the stem-loop structure has been shown to block U2AF65 binding [93].In a simpler example, the SMN2 gene has been shown to normally skip exon 7 due to a stem-loop structure 5’ of the exon preventing the recruitment of splicing factor U1. In point mutationsexperiments, it has been shown that weakening this stem-loop promotes the inclusion of exon 7,and that compensatory mutations restoring the stem-loop restores the skipping [94].In general, RNA structures have been found to prevent spliceosome assembly by hiding single-stranded splice sites and enhancer binding sites. Alternatively, the same stem-loops can also hidesplicing repressor sites promoting splicing [95]. Genomic wide scans have shown an associationbetween conserved RNA secondary structures and splice-site selection, suggesting many moreuncharacterized examples in the human genome [96].LocalizationIn yeast, when the protein folding capacity of the endoplasmic reticulum is compromised, thetransmembrane Ire1 protein form foci that recruit the HAC1 mRNA to activate the unfolded proteinresponse. It has been shown that a stem-loop structure in the 3’ UTR containing a conserved12bipartite sequence element is necessary and sufficient for the localization of the HAC1 mRNA tothese loci, provided the transcript remains untranslated. To ensure the lack of translation, the singleintron in HAC1 forms a structure that binds to the 5’UTR resulting in ribosomal stalling. Oncerecruited, Ire1 initiates the non-conventional splicing of the HAC1 intron, releasing the ribosomalstalling and allowing the translation of HAC1 activating the unfolded protein response involving7-8% of the yeast genome [97].In a more general case, “RNA zipcodes” located on the 3’UTR are known to target specificmRNAs to subcellular regions in eukaryotic organisms via anchoring and transport. Of the zip-codes known, a wide variety of lengths, sequences, structures and trans-acting factors are involved,making further generalization difficult [51, 98]TranslationThe presence of RNA structure can often physically stall the progress of ribosomes, preventingtranslation from completing such as the intron structure in the HAC1 example above.Another example is a metabolite-independent “RNA switch” in the human VEGFA 3’UTR thatis highly analogous to bacterial riboswitches [99]. Specifically, during normal normoxia, cell sig-naling protein interferon-gamma (IFN-γ), represses the translation of several proteins via the IFN-γ-activated inhibitor of translation complex (GAIT), and additionally also induces the proteosomaldegradation of heterogeneous nuclear ribonucleoprotein L (HNRNPL). In these conditions, GAITbinds to the VEGFA switch, causing a conformational change that inhibits translation. In hypoxicconditions, IFN-γ-mediated proteosomal degradation of HNRNPL is blocked by the proteasomeinhibitor MG132, increasing the levels of HNRNPL. HNRNPL competes with GAIT to bind theVEGF switch, out-competing GAIT and changes the switch to a more stable structure that alsoallows VEGFA to undergo translation. Like a standard riboswitch then, the switch can assume twostructures, one inducive of translation while the other not. Instead of the presence of a metabolitecausing the conformation change however, it is the mutually exclusive, stimulus-dependent bindingof either GAIT or HNRNPL [99].Instead of blocking translation, RNA structures are also often used to initiate non-conventionaltranslation by forming Internal Ribosome Entry Sites (IRES), characterized in 1988 [100]. Whereasnormal translation begins with the recognition and binding of initiation factors at the 5’-end cap ofthe mRNA transcript, many mRNAs in viruses and eukaryotes have been found to contain highlystructured IRES in the 5’ UTR that allow these transcript to bypass cap-dependent translation [101].Highly variable in structure, sequence and size, depending on the specific structure unique to the13family, species, or even transcript, the ribosome docks with the IRES without initiation factors, astart codon, and may even mimic the initiating tRNA [101, 102]. In a very recent work, it wasshown that bacterial ribosomes were capable of translating eukaryotic mRNAs using IRES, despitebillions of years of divergence since the last universal ancestor [103].Yet another example seen in viruses to initiate non-conventional translation is ribosome shunt-ing, characterized in the early 90s [104]. While a ribosome scans the transcript to start translation,it encounters a specific secondary structure causing it to dissociate and continue downstream of thestructure. The region bypassed is often inhibitory to scanning due to strong secondary structure ormultiple start codons [105].DegradationRNA structures have also been found to be required for the targeted degradation of specific tran-scripts. In an example found in yeast, the RPS28B gene forms a conserved hairpin structure on its3’UTR. In the presence of abundant Rps28b protein, a self-regulating loop occurs when Rps28bbinds the hairpin loop, causing the recruitment of decapping machinery that leads to the degrada-tion of the transcript [106].In humans, many examples of single nucleotide polymorphisms (SNP) causing a change inRNA secondary structure resulting in a decrease of mRNA stability have been found [50, 51].A specific example is observed in the catechol-O-methyltransferase gene, where three haplotypesexhibit three different structures, correlating with different levels of pain sensitivity. It has beenshown that the more stable the resulting structure is, the lower the expression of the gene was,correlating with higher pain sensitivity [107].Conclusion and the RNA WorldIt has been hypothesized that ribozymes and riboswitches may be ancient remnants of an RNAWorld [108], where RNA was the sole propagator of genetic information independent of DNAand proteins. First proposed by Carl Woese in 1967 in The Genetic Code along with publicationsby Crick [109] and Orgel [110], it has been hypothesized since the 60s that modern life may haveroots in a world where RNA was both the carrier and propagator of genetic information. The theorypostulates that over time, DNA took over the role of information storage being more chemicallystable, and proteins came to dominate the catalysis of reactions and regulation of systems [108,111]. Whether RNA spontaneously started life (RNA-first) or whether some alternative systemlead up to RNAs remains to be determined [112, 113].14Using the basic building blocks of basepairs and tertiary interactions, RNAs can form struc-tures varying from simple hairpins to complex multidomain structures. Functional structures arefound in all branches of life across both coding and non-coding transcripts. At times, they simplyserve as passive but specific binding targets for protein complexes to initiate other processes suchas degradation and localization. RNA structures can also be highly active, such as riboswitchesregulating gene expression and ribozymes actively ligating, cleaving and splicing. A growing listof synonymous and untranslated mutations associated with diseases are also being explained bymisfolded RNA, giving further motivation to determine and study RNA structures.Whole-transcriptome probes have shown an abundance of previously unstudied structures,promising the study of RNA structures to be an active field of research in years to come [50].While solving the tertiary structures remain the golden standard for RNA structure determination,the study of secondary structure can be done in a much more cost and time-efficient manner. Atthe level of secondary structures, computational predictions can greatly aid and speed up the searchfor structures, highlighting the importance and need for fast and accurate basepair prediction algo-rithms.1.2.3 RNA-RNA interactionsWhile we have only focused on RNA structure where basepairing occurs intramolecularly, therealso exists functional instances of intermolecular basepairing, or RNA-RNA interactions (RRI).Like RNA structure, basepairs are the building blocks of RRI, serving as the stabilizing force insome examples while only serving as a sequence-specific adaptors in others. A small but highlycommon example of an RRI is the specific binding of tRNA anti-codons to mRNA codons in thetranslation process. The consistency of Watson-Crick basepairs allows the unambiguous binding ofcodons and anti-codons in the first two bases, while the flexibility of the third wobble base allowsfor a single tRNA anti-codon to bind to multiple codons. Like RNA structures, RRI can varygreatly in sequence, size, structure and function, and we present a few known examples below.snRNAPreviously mentioned as structured catalytic non-coding RNAs, small nuclear RNAs (snRNA) in-volved in splicing also undergo RRI within the spliceosome in eukaryotes. Multiple snRNAs con-tained within ribonucleoproteins binding specifically to the donor site, branch point and acceptorsites on the mRNA sequence [114]. Multiple interactions between the snRNAs also form a complexstructure that enables the splicing to occur, consisting of short intermolecular basepairs [115].15snoRNAWhile only a few dozen snRNAs exist in the cell, more than 200 unique small nucleolar RNAs(snoRNAs) are present in cells, and are one of the largest groups of trans-acting ncRNAs currentknown [116]. Split into C/D box and H/ACA box RNAs, they bind to specific sites on the unpro-cessed rRNA transcript to guide RNA methylation (C/D box) or pseudouridylation (H/ACA box)events (Figure 1.7). Found in all branches of life, snoRNAs are essential for the proper functioningof the ribosome [116].C/D box and H/ACA box snoRNAs each have short consensus sequence motifs and specificRNA structures, recognized by proteins and contained within RNPs. The unique sequences ofthe snoRNAs, act as adaptors allowing catalytic RNP complexes to be targetted to specific rRNAmodification sites [116, 117].Small ncRNAsPerhaps the most recent and well-known example of functional RRI found in eukaryotes, are mi-croRNAs and small interfering RNAs involved in RNA interference (Figure 1.7, bottom). MicroR-NAs (miRNA) are short RNAs roughly 22 in length that act as adaptors to target protein complexesto specific mRNA sites [119]. Transcribed as a long hairpin, the pre-miRNA hairpin is cleavedby RNase proteins Drosha and Dicer in the nucleus and cytoplasm respectively, and eventuallyloaded in a RNP complex known as RNA-induced silencing complexes (RISC). The RISC canthen specifically basepair complementary binding sites on mRNAs, causing translation repressionor degradation [119, 120].Small interfering RNAs (siRNA) are a highly similar system, also incorporated into the RISC,but with RNA sequences originally from exogenous sources, such as viruses in the cytoplasm.Dicer processes foreign transcripts into short sequences of roughly 21 bases, loads it into RISC,binds to complementary sequences and trigger RNA cleavage and degradation [119, 121]. A thirdclass of small ncRNAs are Piwi-associated RNAs (piRNA), which are 24 to 30 bases long, gothrough a related but unique set of processing enzymes to target and suppress transposable elementsin germline cells [119].In all three types of these small ncRNAs, the RNA acts as an adaptor to specifically bindintended targets while the protein complex carries out the function. In miRNA and siRNA, the firstseven or so nucleotide have been found to be seed regions which are required to perfectly bind withthe target, although a sequence having a perfect complement seed region is not necessarily a target[122]. Plants have a similar pathways for miRNA and siRNA, but involve different processing16Figure 1.7:Three common RNA-RNA interactions from a review by Meyer [118]. Part of a ribonu-cleoprotein complex C/D box snoRNAs and H/ACA snoRNAs bind to specific rRNAsites to enable highly specific RNA base modifications. The snoRNAs adopt highlyspecific intramolecular structure in addition to forming intermolecular interactions withrRNA. In contrast, miRNA does not adopt any complex structure, and form a highlycomplementary binding with targets sites on mRNA.From Meyer (2008) Predicting novel RNA-RNA interactions. Current Opinion inStructural Biology 18 (3):387-393. Retrieved October 09, 2015 from Elsevier. doi:10.1016/j.sbi.2008.03.006. Used with permission from Elsevier.proteins and functions as reviewed in [123].Small RNAsSmall RNAs or sRNAs, are non-coding regulatory RNAs found in bacteria, shown to bind to thetranslational start sites of mRNAs, controlling the stability and translation of their targets (Fig-17ure 1.8). 40 to 400 nucleotides in length [124], sRNAs do not simply bind in zipper-like fashion tothe mRNA across its entire length like the small ncRNAs described above. Instead, sRNA-mRNAinteractions vary significantly in stability and length, modulated by existing RNA secondary struc-tures on both the sRNA and mRNA strands. Thus, the identification of the functionally relevantinteraction serves as a challenging and relevant problem in RNA-RNA interaction prediction.Figure 1.8:Example RNA-RNA interaction between bacterial small RNA RyhB and sodB mRNAmodulated by the Hfq chaperone protein. The sodB mRNA start codon (green) isnormally protected in a small stem-loop (left), but becomes exposed when Hfq bindsthe RNA causing a structural change. The exposure of the sodB start codon allowsthe sRNA RyhB to bind, stopping translation. The new RNA-RNA complex becomestranslationally inactive and susceptible to degradation.From Geissmann & Touati (2004) Hfq, a new chaperoning role: binding to messengerRNA determines access for small RNA regulator. The EMBO Journal 23:396-405. Re-trieved October 09, 2015 from John Wiley and Sons. doi:10.1038/sj.emboj.7600058.Used with permission from John Wiley and Sons.181.3 Solving RNA structure experimentally1.3.1 Primary structureThe primary structure or sequence of the RNA can be experimentally determined by the sequencingof complementary DNA using classical Sanger or modern high-throughput technologies [27, 125].As RNA is typically too unstable to sequence directly, it is common practice to obtain the com-plementary DNA through the usage of reverse transcriptase. With the DNA sequence of interest,di-deoxy or Sanger sequencing [126] can be done manually on a gel for small-scale experiments.For larger experiments, various commercial technologies and protocols are available for fast andcost-effective large-scale experiments [127].1.3.2 Secondary structureRough secondary structure information can be obtained via RNA probing and footprinting, in-volving chemical modification and enzymatic cleavage of the structure, followed by subsequentidentification of affected bases via sequencing of the treated RNA [50, 128]. A concrete example isthe usage of the RNase V1 ribonuclease (discovered in Caspian cobra venom [129]), which specif-ically targets and cleaves helical basepaired structures. After a controlled incomplete V1 digestionof a homogeneous solution of the structure of interest, analysis of the products and determiningcleavage sites can give a rough estimation of bases that were paired but not their basepaired targets[50, 130]. Enzymes that target unpaired bases include RNases I, T1 and A, S1 nuclease, and chem-icals such as DMS (targets adenosine and cytosine), CMCT (uridine), kethoxal (guanine), NMIA(unbiased) and Pb2+ (unbiased cleaving). The nucleases cleave RNA sites as expected, while thechemicals typically modify specific unpaired nucleotides in addition to reaching locations steri-cally inaccessible by the larger enzyme complexes. For reagents that target unpaired structures,the absence of reactions can mean that the regions is paired, but may simply be inaccessible dueto structural location or the presence of other macromolecules. Depending and reagents and pro-tocol, RNA structure probing can be done in vitro and in vivo (DMS, Pb2+), and coupled withhigh-throughput sequencing techniques for whole-transcriptome RNA structure information [50].1.3.3 Tertiary structureThe gold standard for structure determination is solving the tertiary structure, with the most com-mon methods for RNA being X-ray crystallography [131] and nuclear magnetic resonance (NMR)spectroscopy [132]. X-ray crystallography requires the non-trivial task of crystallizing a purified19sample of the RNA/RNP of interest, but once done, enables atomic resolution of large structuressuch as the ribosome [57–59]. In NMR spectroscopy, no crystal is required, and a homogeneousliquid solution of the RNA of interest is sufficient, although the resolution (atomic clarity) of thestructure decreases as input sequence increases, with high resolution structures limited to sequencesof 50 bases or less [133]. Other methods include small-angle X-ray scattering (SAXS) [134], whichuses X-ray scattering like X-ray crystallography and does not require a crystal, but does require theavailability of a one of the limited synchrotron beam facilities globally [135]. Cryo-electron mi-croscopy (cryo-EM) has also been used to determine external shapes of RNA structures, done bythe rapid freezing of a thin film of homogeneous molecules in solution, followed by visualizationvia an electron microscope and image processing [136, 137]. Finally, a recent alternative to thestandard NMR process called solid-state NMR (ssNMR) has been demonstrated to be applicableto solving RNA structure [138], does not require a crystalline sample and can address moleculesfar beyond the size possible for standard solution-based NMR [138].1.3.4 CaveatsDepending on the technique, the RNA may need to be modified, such as removing highly un-stable regions to help crystal formation, or isolating a subunit of a larger structure to fit withinexperimental limitations. In addition, the need to extract, purify, and amplify the RNA of interestprior to structure determination means that the structure may not be representative of the functionalstructure in vivo. This is especially true if the native structure of RNA depended on specific RNA-binding proteins or ATP-dependent helicases [139] or is in reality one of many possible structuresassumed in cell such as in the case of riboswitches. In vitro folding often occurs with fully synthe-sized molecules, purposely purified, amplified, and made to fold in a controlled solvent medium.In contrast, in vivo folding typically occurs during transcription (i.e. cotranscriptionally), whilstthe nascent transcript emerges from the polymerase into the cellular milieu [140–142]. The incre-mental addition of bases during transcription means that local structures can form on the emerging5’-end, which may induce or prevent the formation of specific basepairs with the 3’-end when itlater emerges [141–143]. Other biomolecules like proteins and RNAs in the cell can also bindand process RNAs cotranscriptionally, adding another layer of complexity to RNA folding in vivo[144].201.4 Computational prediction of RNA basepairing1.4.1 RNA secondary structure predictionWhile solving the tertiary structure remains the gold standard for studying RNA structure, in prac-tice studying novel functional RNAs on the level of secondary structure basepairs is often a highlyproductive alternative to waiting for a tertiary solution. In many of the examples listed for func-tional RNAs previously, the tertiary structure remains unsolved, with many of those that havesolutions coming decades after initial discoveries. Due to the hierarchical folding of RNA, func-tionality and structures solved on the basepair level are almost always present and applicable tothe tertiary structure in contrast to proteins [49]. As basepairing rules are consistent and the mainstabilizing force in many functional RNA structures, the prediction of RNA secondary structurethrough basepairing rules remains an important tool for experimentalists.Due to the time and cost associated with determining the secondary and tertiary structure ofRNAs experimentally, efforts have been made in the development of computational approaches topredict RNA structures in silico given a primary sequence since the 70s. Due to the hierarchicalnature of RNA structure formation, the accurate prediction of secondary structure is an active areaof research required for the accurate prediction of tertiary structure [49]. Generally, computationalapproaches to predicting RNA secondary structure (henceforth referred to simply as structure) canbe classified as those that are energy-based and those that are comparative or evolutionary-based asdiscussed below.1.4.2 Energy-based predictionsConceptualized in the early 70s, energy-based methods operate on the assumption that the most sta-ble structure taken by an RNA sequence at thermodynamic equilibrium, is the unique functionallyrelevant one [145]. The RNA structure is imagined to be built from individual components suchas stabilizing basepairs and unstabilizing unpaired loops, bulges, and ends, each having a uniqueenergy contribution which is summed to obtain the overall energy. The two challenges are thendefining the set of components and the energy contributions or the energy model, and obtaining anefficient algorithm to explore all relevant structures.Energy models have gone through various revisions, differing due to experimental setup andstructural elements, with the most commonly used set referred to as the Turner model [146]. Mostimportantly, it was found that using the energy contribution of basepair stacking (two basepairsstacked together) gave much more accurate predictions than energy contributions from single base-21pairs. Recently, another set of parameters was also determined by Andronescu et al., by usingmachine learning to reverse engineer energy contributions for structural elements given solvedstructures with known overall energy levels [147, 148].Early developments in finding an algorithm to efficiently explore the structure search spacebegin with bruteforce combinatorial approaches starting in 1975 [149], which do not scale wellpast several hundred basepairs. A breakthrough came a few years later with the usage of recursivedynamic programming algorithms [150] which can easily determine the single best structure forsequences up to kilobases in length. When combined with stacking energy models by the late 80s[151], the modern RNA folding “Zuker algorithm” came into being, and has been the dominantalgorithm used for RNA secondary structure prediction.The most popular energy-based methods—such as MFOLD [152] and RNAFOLD [153] fromthe Vienna RNA Package—are so-called minimum free energy (MFE) algorithms that utilize dy-namic programming, guaranteed to determine the structure with the lowest energy in a time andspace efficient manner [154, 155]. Such algorithms take advantage of the fact that a stable overallstructure is composed of stable sub-structures, and thus the algorithm can safely ignore highly un-stable sub-structures to save on computation and re-use calculations for sub-structures reoccurringin multiple candidate solutions.While efficient and effective, in practice, the accuracy of predicted MFE structures fall as thelength of predicted sequences increase beyond a hundred bases, attributed to a few reasons [156].Pseudoknotted basepairs are simply ignored, as including pseudoknots means that a structure can-not be split into independent discrete substructures, violating a required property for fast dynamicprogramming algorithms. Discussed in more details in the following chapter, pseudoknots are aprevalent structural motif formed by non-nested basepairs, found in nearly every organism withmany examples found in functional RNAs [157]. Algorithmically, the classical pseudoknot-freeMFE methods run in O(n3) time and O(n2) memory, where n is the sequence length [152]. With-out resorting to heuristics, pseudoknot MFE methods increase the complexity to O(n6) time andO(n4), quickly making it intractable for longer sequences of interest [158].Current energy models fail to account for the complexity of folding in vivo, effectively simu-lating folding in vitro. Given the potential kinetics and interactions during folding, the theoreticalMFE structure may simply not form due to structures being trapped in metastable structures duringfolding or as a result of in vivo interactions [156]. In attempts to alleviate this, methods that explic-itly simulate kinetic folding [159–161] and incorporate cotranscriptional effects into the standardMFE energy model [162] have been developed.221.4.3 Evolutionary-based methodsThe comparative approach to structure prediction is based on the observation that if an RNA andits structure are important for some biological function, they are likely to be conserved, with tRNAand rRNA being the classical examples [163]. Thus, given a set of aligned homologous RNAsequences (typically determined by primary sequence conservation), we should be able to identifyits structure using signals of conservation [163]. The key signal of basepair conservation are so-called covarying bases that are the result of compensatory mutations. These bases appear as twopositions in the multiple sequence alignment, where all base form valid basepairs in all species, butthe primary sequence differs between species, hypothesized to arise from a mutation on one sidethat negative impacts the structure followed by a compensatory mutation in the other side to restorethe basepair [164]. Computationally then, the idea is to calculate the correlated evolution (e.g.covariation) between all potentially basepaired alignment columns, and return the set of basepairedpositions that show significant non-random covariation.Early comparative algorithms implement the idea above just as described—computing covaria-tion scores in all possible positions pairs—but ignore the phylogenetic distance between the speciesfrom which the RNA sequences are derived, and return a set of individual basepairs. These, how-ever, may contain conflicting (i.e. mutually incompatible) basepairs that cannot be combined intoa valid RNA secondary structure that could form in real life [165]. Modern implementations utiliz-ing phylogenetic stochastic context-free grammars (phylo-SCFG) resolve both problems by scor-ing mutations between alignment columns of sequences as a function of phylogenetic distance,and using dynamic programming to determine a pseudoknot-free structure optimizing the overallconservation score [166]. Scanning versions of phylo-SCFGs [167] have been successfully appliedto discover conserved ncRNAs in vertebrate [168] and insect [169] whole-genome alignments.Given a multiple sequence alignment of sufficient quality and divergence, comparative (and hy-brid) methods have been shown to significantly outperform MFE methods in terms of sensitivityand specificity of known basepairs [170]. However, given alignments that fail to properly alignconserved structures severely impact the performance of comparative RNA prediction tools, whichin practice limits the applicability of the tools depending on the sequencing, homologs, and con-servation of a sequence of interest. Described as an “chicken-and-egg” problem, it is difficult topredict a structure without a good structure-based alignment, but obtaining a good structure-basedalignment is difficult without knowing the structure beforehand, as basepair covariation often leadsto primary sequence divergence [171]. Certain tools that simultaneously fold and align structuresto alleviate this problem have been developed [171–174], but rely heavily on heuristics to remain23tractable as input lengths increase.1.5 Objectives and contributionsThe remainder of my thesis consists of three chapters, primarily composed of four publishedmanuscripts. Having described RNA structure determination and prediction in this chapter, thefollowing Chapter 2 focuses on the visualization of RNA secondary structures essential to explain-ing and analyzing RNA structures. In addition to discussing common forms of RNA visualization,we make our own contributions in the form of the R package R4RNA for visualization secondarystructures with sequence alignment information. In Chapter 3, we discuss the prediction of RNA-RNA interactions, highly related to RNA structure prediction algorithms just described. We sum-marize the current state-of-the-art for RRI prediction tools, and evaluate the performance of selecttools on the largest experimentally validated dataset collected to date. Finally, Chapter 4 conductsan in-depth review of some outstanding issues in computational RNA structure prediction, espe-cially those relating to in vivo effects and cotranscriptional folding touched upon in this chapter.24Chapter 2Visualization of RNA Basepairs andAlignmentsThis chapter outlines the foundational software code base that was developed to facilitate the en-tirety of the research I conducted. Termed R4RNA, this package consists of a collection of func-tions in the programming language R. The package itself can roughly be split into three functionsets. The first set of functions deal with the reading, writing and converting the multiple text for-mats for RNA basepairs, essential for reading the unique output formats of various tools. Thesecond set of functions deal with the statistical analysis of basepairs, serving as the basis for muchof the numerical analysis in my research. The final set of functions are responsible for the visual-ization of basepairs, both for publication quality figures and also ad hoc graphs for sanity checks,debugging, interpretation, and analysis.This work has been published as one of the only RNA analysis packages in R. We also hosta web server version front-end of the visualization component, also published as part of a largercollection of Meyer Lab web servers:Lai,D., Proctor,J.R., Zhu,J.Y.A. and Meyer,I.M. (2012) R-CHIE: a web server and R packagefor visualizing RNA secondary structures. Nucleic Acids Research, 40 (12), e95. doi:10.1093/nar/gks241Lai,D. and Meyer,I.M. (2014) e-RNA: a collection of web server for comparative RNA struc-ture prediction and visualisation. Nucleic Acids Research, 42 (W1): W373-W376. doi:10.1093/nar/gku292While the format input/output and statistical parts of the package are essential and highly use-ful, they are not of particular intellectual novelty. The manual listing function names and detailsand vignette containing examples usage released with the R package are available as part of theAppendix of this thesis.25The programming language or R was selected due to its high adoption in the bioinformaticscommunity due to the presence of the Bioconductor repository for open source computationalbiology and bioinformatics packages [175]. The open-source nature of R also allows users free andeasy access to the environment required to run our package and accompanying tools. Specific tothis project, the interactive console native to R allows for quick responsive results when computingRNA structure and visualizing figures for ad hoc analyses. R has powerful graphing abilities allowsusers to easily combine our plots with existing R graphs, as seen in a recent work combining ourgraphs with chemical probing data [176].The remainder of this chapter focuses on the contributions that this package brings to the fieldof RNA visualization.2.1 Visualizing RNA structureAs detailed in the introduction, RNA structure can be described as primary, secondary, and tertiaryat the molecular level. As direct observation of RNA molecules is not feasible with current tech-nologies, the next best option is to create graphical representations of RNA molecules for studyand analysis. Rather than three-dimensional models of RNA at a scaled-up size however, thesefigures are often more abstract, conveying the defining features of primary, secondary, or tertiarystructures.2.1.1 Primary structureMolecularly, the primary structure describes the linear sequence of covalently bonded nucleotides.When studying and visualizing at this level of abstraction, the relative spatial coordinates and iden-tity of individual atoms are ignored. Since the four common RNA nucleotides can be unambigu-ously identified, primary structure can represented as a string of text characters, typically consistingof A, U, G and U, corresponding to the four RNA nucleotides. Also referred to simply as the RNAsequence, this string of nucleotides does not need to be in a straight line, which is often the case insecondary structures.2.1.2 Secondary structureThe secondary structure of RNA describes all basepairs in the primary structure in addition tohydrogen bond basepairs forming between the nucleotides. Again, the relative spatial coordinatesof atoms are not considered at this level of abstraction, thus the only new information is the identityof the pairs. In its most abstract form, the secondary structure can be described as a primary26structure sequence, accompanied by a list of paired positions or tuples. Each tuple is a pair ofunordered numbers, with each indicating a specific RNA identified by its numerical position in thesequence, and the entire tuple indicating that the two basepair.Given a tuple or basepair, based solely on the identity of the paired nucleotides, it can betermed canonical if it forms a stable Watson-Crick or wobble basepair, or non-canonical for anyother combination of nucleotides. Given two basepairs, depending on their positions, they canbe described as nested basepairs if the positions of one pair falls completely within the range ofanother. Consequently an unnested basepair is then any basepair that is not well nested, with onlyone of its two positions falling within the positional range of another basepair. Physically, a serieswell nested basepairs results in a straight ladder-like stack of basepairs, that often forms a helicalstructure or helix. The helix, also referred to as a stem, ends in an unpaired segment we refer toas the loop due to limitations in RNA flexibility. Unnested basepairs form pseudoknots whichphysically occur when the loop of a helix binds with an unpaired region on the same molecule, ata position beyond the stem that forms it. Technically, a pseudoknot can be defined as two pairswith positions i : j and i′ : j′ whose relative positions are in the order i < i′ < j < j′. Finally,depending on how the list of basepair tuples was constructed, other anomalies can also occur, suchas incompatible basepairs, which describe two basepairs that overlap at one or more position,making them mutually exclusive in reality, as a single base cannot form a canonical basepair withmore than one other nucleotide. A specific type of incompatible basepairs are duplicating basepairswhere both positions are identical.For some programs, some of these basepair types are a persistent challenge in secondary struc-ture prediction and planar visualization [177]. In the classical “stem-loop” diagrams of displayingsecondary structures on a plane, pseudoknots often result in undesired overlapping elements (Fig-ure 2.4), which require stretching and manual adjustment of stems to display [178]. In contrast toprimary structure, many different and unique methods of displaying secondary structure informa-tion exist in various publications.2.1.3 Tertiary structureThe tertiary structure of RNA describes the full three-dimensional structure of the molecule,specifically the relative spatial coordinates of all the atoms. Visually, this is often representedas a three-dimensional model, with different levels of abstraction: from figures showing balls ofelectron cloud for each atom, to more abstract figures showing just the backbone and secondarybonds in three-dimensional space. In reality, this is still highly abstract, as it fails to capture the27environment and dynamics experienced by the molecule, which may change as a function of timeand conditions in vivo.2.2 Types of RNA secondary structure visualizationDue to experimental and computational limitations, it is highly unlikely for accurate tertiary struc-ture information to be available for the majority of RNAs, especially those newly discovered. Thenext best option is to work on the level of secondary structure, knowledge from which will typi-cally be directly applicable to the full tertiary structure if it is ever determined. Thus, a significantamount of research has been put into the generation and display of RNA secondary structure.2.2.1 Dot plotsAs secondary structure is defined as a list of tuples, after of a textual list of numbers, the nextsimplest form would arguable by a dot plot [179, 180]. A dot plot consists of a N×N grid for anRNA sequence of length N, and a filled in cell or “dot” at x,y denotes the existence of a basepairbetween positions x and y (Figure 2.1). While not referred to as dot plots, examples are seenas early as 1971 [145], used to predict secondary structures in RNA. While not the most visuallyaesthetic at times, it does have the benefit of being able to show mutually exclusive basepairs whichwould not be able to form simultaneously in reality. In popular packages such as the Vienna RNAPackage [181], dot plots are often used to display the results of suboptimal predictions, with eachdot displaying the probability that a basepair will form instead of a discrete all or nothing solution.2.2.2 Circle and linear diagramsAnother early method of visualizing RNA secondary structure is the circular format, seen in one ofthe earliest works of structure prediction in 1978 [183]. In a circle diagram, the primary sequenceis laid out in a circular pattern, and a chord is drawn between two positions representing a basepairbetween them (Figure 2.2).A highly related format is the linear diagram, which is effectively a circular diagram that hasbeen unrolled resulting in the primary sequence being laid out in a straight horizontal format (Fig-ure 2.3). Whereas basepairs were mathematical chords in the circular format, they are often seen inthe form of rectangular arcs [184] or elliptical arcs [185, 186]. Regardless of exact shape, the arcrepresents a single basepairs, with the two ends connected to the basepairs on the primary sequencethat basepairs in the structure.For both circular and linear plots, well-nested and unnested basepairs are easily distinguish-28RF00458C C A A C A A U G U G A U C U U G C U U G C G G A G G C A A A A U U U G C A C A G U A U A A A A U C U G C A A G U A G U G C U A U U G U U G G A A U C A C C G U A C C U A U U U A G G U U U A C G C U C C A A G A U C G G U G G A U A G C A G C C C U A U C A A U A U C U A G G A G A A C U G U G C U A U G U U U A G A A G A U U A G G U A G U C U C U A A A C A G A A C A A U U U A C C U G C U G A A C A A A U UC C A A C A A U G U G A U C U U G C U U G C G G A G G C A A A A U U U G C A C A G U A U A A A A U C U G C A A G U A G U G C U A U U G U U G G A A U C A C C G U A C C U A U U U A G G U U U A C G C U C C A A G A U C G G U G G A U A G C A G C C C U A U C A A U A U C U A G G A G A A C U G U G C U A U G U U U A G A A G A U U A G G U A G U C U C U A A A C A G A A C A A U U U A C C U G C U G A A C A A A U UCCAACAAUGUGAUCUUGCUUGCGGAGGCAAAAUUUGCACAGUAUAAAAUCUGCAAGUAGUGCUAUUGUUGGAAUCACCGUACCUAUUUAGGUUUACGCUCCAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAGGAGAACUGUGCUAUGUUUAGAAGAUUAGGUAGUCUCUAAACAGAACAAUUUACCUGCUGAACAAAUUCCAACAAUGUGAUCUUGCUUGCGGAGGCAAAAUUUGCACAGUAUAAAAUCUGCAAGUAGUGCUAUUGUUGGAAUCACCGUACCUAUUUAGGUUUACGCUCCAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAGGAGAACUGUGCUAUGUUUAGAAGAUUAGGUAGUCUCUAAACAGAACAAUUUACCUGCUGAACAAAUU0 1CCAACAAUGUGAUCUUGCUUGCGGAGGCAAAAU U U G CACAGUA U AAAAU C U G C A A G U A G UG C U A U U G U U G GAAUCACCGUACC U AUUUAGGUUUA C GCUCCAAGAUCGGUG GAUAGCA GCCCUAUCAAUAUCUA G G AG A A CU G UGCUAU G UU U AG AAGA U U AGGUAGUCUCUAAACAGAACAAUUUACCUGCUGAACAAAUUFigure 2.1:Dot plot (left) output by the RNAFOLD web server [181] for the predicted minimumfree energy structure of the Cripavirus Internal Ribosomal Entry Site (RFAM RF00478[182]) shown on the right. Dots at position x,y indicate a base at position x basepairingwith a base at position y. Below the diagonal (left top to right bottom) are dots represent-ing the minimum free energy structure, while those above the diagonal indicate those inthe ensemble thermodynamic structure (e.g. suboptimal structures), with dot size pro-portional to basepairing likelihood. Minimum free energy structure at right coloured byprobability from 0 (purple) to 1 (red).able by the presence of non-overlapping and overlapping arcs/chords, respectively. Conflictingbasepairs can also be easily displayed, with two arcs connecting to a single base.2.2.3 Stem-loop diagramsThe most common computationally generated method of visualizing secondary structures are stem-loop diagrams, showing non-pseudoknotted structures in a planar format. Basepaired regions arevisualized as ladder-like stems, and unpaired regions are loops, bulges, and loose-ends enclosed,within, or flanking these stems, respectively (Figure 2.4 & 2.5). Manually created stem-loops areseen in the earliest works in RNA prediction [187] in the early 60s, with the automated visualiza-tion methods emerging two decades later [188]. Modern implementation of stem-loop visualization29CCAACAAUGUGAUCUUGCUUGCGGAGGCAAAAUUUGCACAGUAUAAAAUCUGCAAGUAGUGCUAUUGUUGGAAUCACCGUACCUAUUUAG GU UU AC G CU C C A A G A U C G G U G G A UAGCAGCCCUAUCAAUAUCUAGGAGAACUGUGCUAUGUUUAGAAGAUUAGGUAGUCUCUAAACAGAACAAUUUACCUGCUGAACAAAUU 1102030405060708090100110120130140150160170180190200203RF00458Figure 2.2:Circular diagram of the Cripavirus IRES by VARNA [186], the primary structure is seenas a circlular line, running 5’ to 3’, start at the bottom going clockwise, with each base-pair displayed as a chord connecting two positions. Pseudoknots are easy to determineby the intersecting chords.C C A A C A A U G U G A U C U U G C U U G C G G A G G C A A A A U U U G C A C A G U A U A A A A U C U G C A A G U A G U G C U A U U G U U G G A A U C A C C G U A C C U A U U U A G G U U U A C G C U C C A A G A U C G G U G G A U A G C A G C C C U A U C A A U A U C U A G G A G A A C U G U G C U A U G U U U A G A A G A U U A G G U A G U C U C U A A A C A G A A C A A U U U A C C U G C U G A A C A A A U U  1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 203RF00458Figure 2.3:Linear diagram of the Cripavirus IRES by VARNA [186], the primary structure is seenas a horizontal line, running 5’ to 3’, left to right, with each basepair displayed as an arcconnecting two positions. Pseudoknots are easy to determine by the intersecting arcs.30CCAACAAUGUGAUCUUGCUUGCGGAGGCAAAAUUUGCACAGUAUA A AAUCUGCAAGUAGUG CUAUUGUUGGAAUCAC CGUACCUAUUUAGGU UUACG CUCCAAGAUCGGUG GAUAGCA GCCCUAUCAAUAUCUAGGAGAACUGUGCUAUGUUUA GAAGAUUAGGUAGUCUCUAAACAGAACAAUUUACCUGCUGAACAAAUU 1102030405060708090 100110120130140150160170180190200203RF00458Figure 2.4: Stem-loop diagram of the Cripavirus IRES by VARNA [186], showing basepairedstems and unpaired loops, bulges, and ends. Basepairs connecting loop to loop acrossthe structure are unnested pseudoknoted basepairs, which can be confusing to displayon this format.diagrams are highly customizable, require minimal user-adjustment, and can even visualize pseu-doknots with aesthetically pleasing results [178, 185, 186]. Conflicting and duplicating basepairshowever, are not possible to display in a single diagram.2.2.4 Other diagramsBesides the four types of visualization methods listed, many other formats exist, some more com-mon than others. The dot-bracket or Vienna format is an extremely common method of digitallystoring basepair information [153], which can also be easily read to determine secondary structure.In the dot-bracket format, a string consisting of text characters the length of the primary sequence iswritten, with the period character (dots) used in positions of unpaired bases, and pairs of matchingbrackets used where basepaired positions exist (Figure 2.7, SS_cons row at alignment bottom).Unless multiple types of bracket characters were used, it can be impossible to distinguish betweennested and unnested basepairs.31CCAACAAUGUGAUCU UGCUUGCGGAGGCAAAAUU U G CACAGUAUAAAAUCUGCAAGUAGUGCUAUUGUUGG A A U C A C C G U ACCUAU UUAGGU U U A C G CUCCAAGAUCGGUG GAUAGCA GCCCUAUCAAUAUCUAGGAG A A C U G U G C U A UGUUUAGAAGAUUAG GUAGUCUCUAAACA G A A C A A U U U A C C U G C U G A A C A A A U U  1102030405060708090 100110120130140150160170180 190 200 203RF00458Figure 2.5:Variant of the Cripavirus IRES stem-loop diagrams in Figure 2.4 by VARNA [186],trading aesthetic quality for informational clarity.One other format outputted by the popular Vienna web server are mountain plots [181], initiallydefined in 1984 [190] (Figure 2.6). These show the primary sequence in a linear format on the x-axis, while the y-axis displays the number of enclosing basepairs (i.e. how deeply nested a basepairis). Starting at the first position, as you increase in position along the x-axis, you increase heighton the y-axis whenever a basepair starts, descend when a basepair ends, and remain at the sameheight when unpaired bases are encountered. Thus, matching slopes at the same height correspondto helices and plateaus correspond to unpaired loops at the end of helices or bulges within helices.More abstract format exist, but have not seen wide adoption such as trees [191] and graphs [192]in the strict mathematical sense.2.2.5 Conserved RNA structureFor any experimentally determined or theoretically predicted RNA structure, a method of eval-uation is to analyze the degree of structure conservation between homologous sequences [193].Strong evidence for RNA structure conservation are pairs of compensatory mutations that retainthe basepairing ability, but change the basepairing nucleotides (covariation). A quick method of vi-sually surveying the quality of a given multiple sequence alignment and a corresponding secondarystructure prediction is to highlight covarying pairs of alignment columns [194] (Figure 2.7). Fi-nally, like standard gene prediction, identifying regions of high primary sequence conservationcan also be useful in understanding the potential functional roles of different parts of the RNA32Heightmfepfcentroid05101520250 50 100 150 2000.01.02.0PositionEntropy0 1CCAACAAUGUGAUCUUGCUUGCGGAGGCAAAAUU U G CACAGUA U AAAAU C U G C A A G U AGU G CU A U U G U U G GAAU CACCGUACCUAU U UAGGUUUACGCUCCAAGAUCGGUGGAUAGCAG C C C U AUCAAUAUCUAGGAGAACUG UGCUAU G UU U AGA AG A UU AGGUAGUCUCUAAACAGAACAA UUUACCUGCUGAACAAAUUFigure 2.6:Mountain plot (left) output by the RNAFOLD web server [181] for the predicted centroid[189] structure of the Cripavirus IRES sequence (right). Three lines correspond to theminimum free energy structure (red), the thermodynamic ensemble of RNA structures(green), and the centroid structure (blue). The entropy graph below shows how differentthe ensemble structure is from the MFE, indicating how likely the MFE basepair willnot form at a given position. Colours on the stem-loop structure indicate basepairingprobability from 0 (purple) to 1 (red).sequence.During the development and evaluation of the comparative helix prediction method TRANSAT [195]in our research group, we had to develop a new method for RNA secondary structure visualizationthat was able to do the following:1. Show conflicting basepairs (i.e. basepairs involving the same sequence position)2. Display primary sequence conservation and basepair covariation with respect to the multiplesequence alignment3. Simultaneously show two different structures (e.g. from different sources of annotation orprediction)4. Be aesthetically pleasing and intuitive to grasp33Figure 2.7: A screenshot of RALEE, the RNA ALignment Editor in Emacs [194]. Columnshighlighted in the same shade represent stem-forming basepaired regions, with gapsin the highlighting showing loss of basepairing mutations. The structure is seen at thebottom, shown with matching brackets as basepairs and dots as unpaired regions.From Sam Griffiths-Jones (2004) RALEE–RNA ALignment Editor in Emacs Bioinfor-matics, 21(2):257-259. Retrieved October 13, 2015 from Oxford University Press. doi:10.1093/bioinformatics/bth489. Used with permission from Oxford University Press.Our requirement to visualize conflicting basepairs rules out all formats except circle, linear anddot plots. Finally, the additional need to simultaneously show several structures in conjunction witha corresponding multiple-sequence alignment, led us to choose a format where the RNA sequenceof interest in shown along a straight line.While various powerful visualization programs already exist, only a few are actively supportedand visualize RNA secondary structure in a linear fashion [185, 186]. Of those that do, theylack the features that we require, as they were not designed to handle conflicting basepairs nordisplay multiple sequence alignments simultaneously with a structure. Finally, the need to create34such diagrams in a high-throughput and scripted manner, rather than restricted by a graphical userinterface, made their adoption and modification difficult [185, 186].In the following, we present a highly modified and new method employing a linear formatwhich we call arc diagrams [196] to fulfill our above four requirements. In addition, we provide aweb server R-CHIE which accepts four common secondary structure formats and secondary struc-ture for the quick visualization of data with our method to generate publication quality figures.For further customization and local use, we also make a corresponding R package [197] calledR4RNA available at www.e-rna.org which leverages the graphical and computational frameworkof the interactive and easily-scriptable language R.2.3 Materials and methods2.3.1 The R-CHIE web serverPart of the larger Meyer Lab web server of tools and methods at www.e-rna.org, the R-CHIE webserver is located at http://www.e-rna.org/r-chie, which provides a simple interface for generatingsix different types of arc diagrams automatically with instructions and examples, accessible by allmajor browsers. Descriptions and usage of the six different types of diagrams are as follows:Single StructureThis is the most basic type of arc diagram, essentially identical to the typical linear diagramsobserved in other publications, with the exception of much more powerful graphical options (Fig-ure 2.8). This arc diagram shows the RNA sequence of interest drawn as a horizontal line from 5’to 3’, left to right, with arcs, drawn above the horizontal line. Each arc depicts a basepair of theRNA structure and connects the respective sequence positions involved in that basepair.For predicted structures, it is not uncommon for individual structural features such as helicesor basepairs to be assigned individual scores such as energetic contributions [180] or statisticalsignificance [195]. In order to retain and visualize this valuable information, our method can assigndifferent colours to individual arcs according to their corresponding scores, using palettes obtainedfrom ColourBrewer [198], or those specified by the user. Alternatively, colouring arcs can also bedone completely manually independent of value, e.g. when certain basepairs or structure featuressuch as pseudoknots are to be especially highlighted. In addition, basepairs can also be filtered bytheir scores and a lower or upper threshold value can be imposed.350 20 40 60 80 100 120 140 160 180 200Figure 2.8: An example of a single structure arc diagram, showing the solved struc-ture Cripavirus Internal Ribosomal Entry Site (family RF00458 from the RFAMdatabase [182]). The horizontal line represents the nucleotide sequence with positionsnumbered in the background. Each arc represents basepair between two positions.Double StructureA double structure arc diagram is obtained by starting with a single structure diagram, and drawinga second structure below the horizontal sequence line (Figure 2.9). Any colouring and filteringoptions can be applied to the top and the bottom structure jointly or separately.This type of arc diagram is especially useful when comparing two – perhaps radically different– alternative structures for the same sequence. It is also useful for comparing two similar struc-tures, e.g. derived from two different structure prediction methods or comparing a predicted to anexperimentally validated structure.Overlapping StructureThe type of figures seen in the TRANSAT paper [195] allow us to quickly and visually evaluate theperformance of a structure prediction method by comparing the predicted to the known structure(Figure 2.10). In order to do this best, we aim to simultaneously visualize the sensitivity and thepositive predictive value, i.e. specificity, of the prediction method.Similar to the double structure plot, arcs are seen both above and below the horizontal sequenceline. Instead of showing all arcs corresponding to one structure above the horizontal line and allarcs corresponding to the other structure below, however, the first is interpreted as a predictedstructure, the second as reference i.e. known structure, and the two structures are overlapped. Forthis, the algorithm identifies all predicted basepairs that overlap with those of the known structure(i.e. a true positive in the performance evaluation), and draws corresponding arcs above the line,360 20 40 60 80 100 120 140 160 180 200P−value[0,1e−06](1e−06,1e−05](1e−05,0.0001](0.0001,0.001]Figure 2.9: An example of a single structure arc diagram, showing the solved structureCripavirus IRES (bottom) and TRANSAT predicted basepairs (top). Predicted basepairsare coloured by P-value.coloured by the score of the basepair in the predicted structure. Any predicted basepair that is notpart of the known structure (i.e. a false positive) is drawn below the sequence line, also colouredby its score. Any known basepair that was not predicted (i.e. a false negative) is drawn above thesequence line in black. Basepairs that are neither part of the known nor the predicted structure (i.e.false negatives) are not shown at all.With a single glance, this type of diagram shows both the sensitivity and specificity of thestructure prediction and readily highlights new basepairs that are not part of the known structurethat may warrant further investigation.Creating two overlapping structures from predictions of two different algorithms against thesame known structure and juxtaposing the resulting diagrams is also an interesting method of com-paring and highlighting the differences of two algorithms, as was done extensively in the TRANSATpaper [195].370 20 40 60 80 100 120 140 160 180 200P−value[0,1e−06](1e−06,1e−05](1e−05,0.0001](0.0001,0.001]UnpredictedFigure 2.10: An example of a overlapping structure arc diagram, overlapping the TRANSATpredicted structure and the solved structure the Cripavirus IRES. The structure shownabove the horizontal sequence is the known structure in black, coloured by p-value ifcorrectly predicted by TRANSAT. The arcs below the line represent novel basepairspredicted by TRANSAT not found in the known structure. Such a diagram can give aqualitative description of a predicted structure’s performance, where high sensitivitywould result in a high proportion of top helices being coloured, and high specificitywould result in a majority of helices above the line.Single Structure CovariationAdding a multiple sequence alignment beneath a single structure arc diagram provides a powerfulmeans of displaying both the secondary structure and corresponding evidence for basepair conser-vation and covariation (Figure 2.11 & 2.12). As comparative methods have been shown to be thestate-of-the-art when it comes to predictive accuracy, sequence alignments are often used in RNAstructure research. This type of arc diagram is especially useful for evaluating structures given analignment and vice versa.For this type of arc diagram, the arcs are drawn on top of the sequence line as usual while the380 20 40 60 80 100 120 140 160 180 200Conservation Covariation One−sided Invalid Unpaired GapFigure 2.11: An example of a single structure covariation diagram, showing the solvedstructure Cripavirus IRES projected onto a multiple sequence alignment. Basepairedcolumns are coloured according to conservation status as indicated by the legend whereConservation is the absence of mutations, (Double-Sided) Covariation is a basepairwhere both sides have changed but remain valid, (One-Sided) Covariation is whereonly one side mutates but remains valid, Invalid basepairs have mutated to non-pairingnucleotides, and Unpaired and Gap are unpaired and gapped, respectively.0 20 40 60 80 100 120 140 160 180 200C C A A C A A U G U G A U C U U G C U U G C G G A − G G C A A A A U U U G C A C A G U A U A A A A U C U G C A A G U A G U G C U A U U G U U G G − A A U C A C C G U A C C U A U U U A G G U U U A C G C U C C A A G A U C G G U G G A U A G C A G C C C U A U C A A − U A U C U A G G A G A A − C U G U G C U − A U G U U U A G A A G A U U A G G U A G U C U C U A A A C A − − − G A A C A A U U U A C C U G C U G A A C A A A U UAF183905.1/5647−5848G C A A A A A U G U G A U C U U G C U U G U A A − − A U A C A A U U U U G A G A G G U U A A U A A A U U A C A A G U A G U G C U A U U U U U G U − A U U U A G G U U A G C U A U U U A G C U U U A C G U U C C A G G A U G C C U A G − U G G C A G C C C C A − C A A − U A U C C A G G A A G C − C C U C U C U G C G G U U U U U C A G A U U A G G U A G U C G A A A A A C C − − U A A G A A A U U U A C C U G C U A C A U U U C A AAF218039.1/6028−6228G A A A A U G U G U G A U C U G A U U A G A A G − − U A A G A A A A U U C C U A G − U U A U A A U A U U U U U A A U A C U G C U A C A U U U U U − A A G A C C C U U A G U U A U U U A G C U U U A C C G C C C A G G A U G G G G U G − C A G C G U U C C U G − C A A − U A U C C A G G G C A C − − C U A G G U G C A G C C U U G U A G U U U U A G U G G A C U U U A G G C U − − A A A G A A U U U C A C U A G C A A A U A A U A A UAB017037.1/6286−6484C U G A C U A U G U G A U C U U A U U A A A A U U A G G U U A A A U U U C G A G G U U A A A A A U A G U U U U A A U A U U G C U A U A G U C U U − A G A G G U C U U G U A U A U U U A U A C U U A C C A C A C A A G A U G G A C C G − G A G C A G C C C U C − C A A − U A U C U A G U G U A C − − C C U C G U G C U C G C U C A A A C A U U A A G U G G U G U U G U G C G A − − A A A G A A U C U C A C U U C A A G A A A A A G A AAB006531.1/6003−6204G U U A A G A U G U G A U C U U G C U U C C U U − − A U A C A A U U U U G A G A G G U U A A U A A G A A G G A A G U A G U G C U A U C U U A A U − A A U U A G G U U A A C U A U U U A G U U U U A C U G U U C A G G A U G C C U A U − U G G C A G C C C C A − U A A − U A U C C A G G A C A C − C C U C U C U G C U U C U U A U A U G A U U A G G U U G U C A U U U A G A A − − U A A G A A A A U A A C C U G C U A A C U U U C A AAF014388.1/6078−6278A G U G U U G U G U G A U C U U G C G C G A U − − − − − − − A A A U G C U G A C G − − − U G A A A A C G U U G C G U A U U G C U A C A A C A C U − − − − − U G G U U A G C U A U U U A G C U U U A C U A A U C A A G A C G C C G U C − G U G C A G C C C A C − A A A A − G U C U A G A U A − − − − C G U C A C A G G A G A G C A U A C G C U A G G U C G C G U U G A C U A U C C U U A U A U A U − G A C C U G C A A A U A U A A A CAF022937.1/6935−7121U U G A C U A U G U G A U C U U G C U U U C G − − − − U A A U A A A A U U C U G U A C A U A A A A G U C G A A A G U A U U G C U A U A G U U A A G G U U G C G C U U G C C U A U U U A G G C A U A C U U C U C A G G A U G G C G C G − U U G C A G U C C A A − C A A G − A U C C A G G G A C U G U A C A G A A U U U U C C − U A U A C C U C G A G U C G G G U U U − G G A A − − U C U A A G G U U G A C U C G C U G U A A A U A A UAF178440.1/5925−6123Conservation Covariation One−sided Invalid Unpaired GapFigure 2.12: A variation of the plot shown in Figure 2.11, where species and basepairs havebeen labeled.multiple sequence alignment is shown below as a block of parallel black lines, each representingone sequence of nucleotides (with grey for gaps) from the multiple sequence alignment. Twoalignment columns at the base of a single arc represent the two columns of basepairing nucleotides.The corresponding nucleotides are shown in green if they represent a valid canonical basepair, and39red if they are not. For bases coloured in green, if there is a compensatory mutation that differsfrom the most commonly observed basepair in that pair of alignment columns, it is highlightedin blue (dark blue for a double-sided mutation, light blue for a single-sided mutation), similar toexisting programs [167, 194, 199].Given a structure and a corresponding multiple sequence alignment, the web server automati-cally applies this colouring, allowing for a quick evaluation of how well (or poorly) the differentstructural features are supported by covariation and gap-patterns.Given two different ways of aligning the same set of sequences, two arc diagrams of this typecan also be used to highlight the effect that the alignment quality has on the corresponding structureprediction.One small caveat is the technical inability of the covariation colouring to be displayed simulta-neously for conflicting helices. When faced with conflicting helices, our algorithm makes a greedydecision to select and colour the first helix that it observes in the input. A user can therefore simplyrank conflicting helices or basepairs in the input file to ensure that the most dominant features arebeing coloured (See dashed arcs in Figure 2.13).Additional web server appearance options that users may adjust include displaying sequencesas blocks instead of lines, including the nucleotide base on the block, and including the sequencedescriptions left of each sequence. The specific colours used to highlight basepairs in the multiplesequence alignment can also be customized, and one may even completely ignore basepairs andcolour the alignment based on nucleotides.While the structure conservation for one basepair can be summarized in a single numericalvalue as done for some figures in the RFAM database [182], a coloured multiple sequence alignmentas in these arc diagrams retains more detailed information. If desired however, although we offerthe ability to colour arcs according to structure conservation, covariation, and percent canonicalbasepair.Double Structure CovariationTwo multiple sequence alignment blocks, one for each structure, are inserted between the topand bottom arcs of a double arc diagram, highlighting the conservation and covariation for eachstructure (Figure 2.13).An extension of the double structure arc diagram, the double structure covariation diagramsshow not only the differences between two structures, but also allow the evaluation of the differ-ent basepairs in light of evolutionary evidence. The input multiple sequence alignment is dupli-400 20 40 60 80 100 120 140 160 180 200Conservation Covariation One−sided Invalid Unpaired GapP−value[0,1e−06](1e−06,1e−05](1e−05,0.0001](0.0001,0.001]Figure 2.13: An example of a double structure covariation plot, showing TRANSAT predic-tions (top) and the known Cripavirus IRES structure (bottom). An extension of thesingle covariation plot, the only unique difference is the dashed basepairs, which in-dicate basepairs that conflict with other more highly ranked (according to P-value)basepairs and not displayed on the alignment.cated into two identical blocks. The top alignment and its covariation annotation refer to the topRNA structure and the bottom alignment to the bottom RNA structure.To ensure that the colouring remains consistent, the determination of a covarying basepair isrelative to the most commonly observed base pair in a given pair of basepaired alignment columns,as mentioned before. In the case where there is a tie for the most common base pair, priority is givento the basepair more commonly observed according to Structure Statistics from the ComparativeRNA Web Site [200].41Overlapping Structure CovariationA natural extension of the arc diagrams mentioned so far is to combine overlapping structure arcdiagrams with covariation plots. These are similar to double structure covariation plots, but aredrawn according to the same rules as overlapping structure arc plots (Figure 2.14).0 20 40 60 80 100 120 140 160 180 200Conservation Covariation One−sided Invalid Unpaired GapP−value[0,1e−06](1e−06,1e−05](1e−05,0.0001](0.0001,0.001]Figure 2.14: An example of a overlapping structure covariation plot, combining the covaria-tion plot with the overlapping structure diagram.This type of diagram is of great use when evaluating new helices (drawn below the sequenceline in overlapping structure arc plots) by providing evolutionary evidence (or lack of) for theirexistence. The same rules to resolving conflicting helices exist as outlined above for single structurecovariation plots, as do the rules for ensuring consistent covariation plot colours.422.3.2 InputThere are two main types of text inputs to the web server; those specifying secondary structuresand those specifying multiple sequence alignments.For RNA secondary structure, our method accepts most common output formats: dot bracket orVienna format [201], MFOLD’s connect format [202], and Gutell’s bpseq format [200]. Addition-ally, we also accept a variant of Shapiro’s original region table [188] that we refer to as the helixformat. This describes each helix as one line which contains the following fields: start position ofouter basepair, end position of outer basepair, helix length (in terms of number of basepairs), helixscore. Differing from Shapiro’s definition, we add a header line which includes the length of thesequence, along with any other comments one may want to retain, e.g. the primary sequence. Thishelix format provides an extremely compact means of representing complex RNA structures, andalso allows for unambiguous specification of conflicting, pseudoknotted and overlapping basepairs.For multiple sequence alignment, our program accepts standard FASTA format [203].To control the appearance of the resulting figure showing the desired type of arc diagram, anstandard options panel is available to automate and fine-tune colouring and filtering of basepairson the fly.2.3.3 OutputAfter jobs are submitted, they are queued in a dedicated computer cluster with up to 100 concurrentjobs processed from all our web servers. After generating the figure, which typically takes on theorder of seconds, a static web page rendering the figure is displayed from which the figure can bedownloaded as .png or .pdf format. To ensure reproducibility and help with customization, we alsooutput the command sequence required to replicate the figure using a locally installed version ofour program.2.3.4 The R4RNA R packageThe plotting functionality of the web server is driven by an R script built on top of an R packagecalled R4RNA which we make available for offline and local use and which can be downloadedfrom http://www.e-rna.org/r-chie release for public use under the GPLv3 license.Written in R [197] (which is freely downloadable at http://www.r-project.org/ for all majoroperating systems), the package is capable of producing the same plots as the server with a fewinteractive function calls, and allows for even more fine-tune control, automation, and customizeddiagrams, and output formats. This is especially convenient for Bioinformatics research groups that43can call functions of our R package within existing programs and analysis pipelines. Our softwareis well-documented, including a comprehensive manual and instructional examples in the vignettedocument, both of which are in the Appendix section of this thesis.2.4 Results and discussionWe here present a new computational method for visualizing RNA secondary structures in conjunc-tion with corresponding multiple sequence alignments which can either be used via a web serverR-CHIE or offline and locally via a corresponding R package called R4RNA. Our method readilycreates six different types of arc diagrams which cover numerous useful applications. These rangefrom visualizing the evolutionary evidence for a given RNA secondary structure to comparisons oftwo RNA secondary structure and performance evaluations of RNA structure prediction methods.The key feature of all six types of arc diagrams is that details that are typically lost in a numericalevaluation are highlighted and can be visually interpreted in an straightforward and intuitive way.Our method makes several major improvements with respect to existing methods that depictRNA secondary structures in a linear way. These include the colouring of structural features ac-cording to their score (e.g. free energy, p-value, log-likelihood), the joint display of an RNA struc-ture with a corresponding multiple sequence alignment which highlights the evolutionary patternsthat support the different structural features, and comparison plots which allow the quick visualinspection of sensitivity and specificity of a predicted structure with respect to a reference struc-ture. In addition, all types of arc plots can display structural features that are mutually exclusive orwould render the overall RNA structure pseudoknotted.Our R-CHIE web server and the corresponding R4RNA R package can be freely accessedand downloaded from http://www.e-rna.org/r-chie. In addition to several examples, we have alsogenerated single structure covariation arc diagrams of all seed alignments in the RFAM database[182] which can be downloaded from our web page. Since our publication, our plots have beenused to generate many published figures, in addition to being adopted by the latest version ofRFAM as a visualization method for conservation information [204]. A few highlights includethe display of riboswitch conformation covariation [205] (Figure 2.15), comparing two structuresin light of chemical probing evidence [176] (Figure 2.16), and comparing multiple wild-type andmutant structures [206] (Figure 2.17).44Figure 2.15: The double covariation plot used by Zhu et al. [205] to display the structure andconservation of the two conformation of the SAM riboswitch.From Jing Yun A Zhu & Irmtraud M Meyer (2015) Four RNA families with func-tional transient structures. RNA Biology, 12(1):5-20. Retrieved October 09, 2015 fromTaylor & Francis Online. doi:10.1080/15476286.2015.1008373. Used under CreativeCommons Attribution-Non-Commercial 3.0 Unported License.45Figure 2.16:Double structure plot used by Rogler et al. [176] to show the structure of theRNA endoribonuclease RMPR as predicted by SHAPE chemical probing (blue) andpreviously published covariation-based results (orange). In the ideal scenario, SHAPEvalues correlate strong with unpaired regions and weakly with paired regions, whichthe authors believe to strongly suggest the existence of the top alternative structure.From Rogler et al. (2013) Small RNAs derived from lncRNA RNase MRP have gene-silencing activity relevant to human cartilagehair hypoplasia. Human Molecular Ge-netics, 23(2): 368-382. Retrieved October 09, 2015 from Oxford University Press.doi:10.1093/hmg/ddt427. Used with permission from Oxford University Press.46Figure 2.17:Single structure plots used in conjunction with a principal component decompositionscatter plot of suboptimal structures and stem-loop figures to show the results ofmutants (purple and yellow) versus the wild-type structure (blue) of the human theRetinoblastoma 1 mRNA 5’UTR.From Kutchko et al. (2015) Multiple conformations are a conserved and regula-tory feature of the RB1 5 UTR. RNA, 21: 1274-1285. Retrieved October 09, 2015from The RNA Society. doi:10.1261/rna.049221.114. Used under Creative CommonsAttribution-NonCommercial 4.0 International.47Chapter 3Assessment of RNA-RNA InteractionPrediction MethodsIn this chapter, I present a comprehensive assessment on prediction performance accuracy for cur-rently available RNA-RNA interaction prediction methods. The chapter starts by describing thestate-of-art in computational RNA-RNA interaction prediction, defines the specific problem of in-terest. I then proceed to evaluate the largest collection of existing tools, against one of the largestcollection of experimentally validated interaction data. The work described here has been pub-lished as one of the only evaluation and benchmarks of its kind:Lai,D., and Meyer,I.M. (2015) A comprehensive comparison of general RNARNA interac-tion prediction methods. Nucleic Acids Research, First published online December 15, 2015.doi:10.1093/nar/gkv1477.3.1 IntroductionA large percentage of the mammalian genome is transcribed into non-coding RNA (ncRNA) [31].As these ncRNAs may play important regulatory roles in the cell, efforts have been made to func-tionally annotate these transcripts [32, 207]. Previous research on ncRNAs such as sRNA [208]and miRNA [209] have shown that the identification of RNA-RNA interactions (RRI) betweencandidate ncRNAs and their targets is a key step to understanding the role of the RNA. Identify-ing and validating these interactions experimentally however, can be slow and costly. To aid theidentification of RNA-RNA interactions, a range of in silico methods have been proposed [118].The prediction of RRIs can be viewed as a direct extension of RNA secondary structure predic-48tion, employing similar theories and algorithms. In both settings, solutions are obtained by deter-mining the set of Watson-Crick and wobble basepairs that correspond to the functionally relevantstructure/interaction. Specifically, given two RNA sequences consisting of nucleotides adenine(A), cytosine (C), guanine (G), and uracil (U), determine the optimal set of intermolecular hydrogenbond basepairs between the two sequences.More complex versions of the problem exist, such as those solving the joint structure, consist-ing of both the intramolecular basepairs within a single sequence in addition to the intermolecularbasepairs. There is also the highly related RNA-RNA target prediction problem, where given asingle query RNA sequence, and a set of potential target RNA sequences, find the correct pairingtarget for the query RNA. Tools solving the basic RRI prediction problem are much more commonthan those tackling these variations, and correct prediction of the complex variations often relyon correctly predicting the RRI problem first. As such, we will focus on the basic intermolecularRNA-RNA interaction problem given two sequences. In contrast to many previous evaluations, weimpose no restriction on the specific type or length of the input RNA, aiming to evaluate RRIs toolsin a general de novo scenario.3.2 Algorithm strategiesWe compare the predictive performance of 14 published computational methods (11 distinct pro-gram binaries) designed to predict interacting basepairs given two input RNA sequences. To betterunderstand and compare these, we subdivide the RRI prediction algorithms based on their strategiesinto four types similar to those in other works [208, 210].The first type concerns itself only with intermolecular basepairs, both during computation andalso for the final predicted result. Such algorithms are typically the fastest, having no need topredict intramolecular basepairs that could interfere and restrict certain intermolecular interactions.Ignoring restrictions and interferences, however, is exactly why these tools may incorrectly predictcertain interactions where the existing RNA secondary structure needs to be taken into account.Algorithmically, these types of tools usually derive the set of interacting basepairs that maximizes acertain value, commonly the stability of the entire interaction complex as quantified by the overallGibbs free energy (∆G) of stacking basepairs. We refer to these as “interaction-only” methods,RNADUPLEX [181], RNAPLEX-c [211], RISEARCH [212], and GUUGLE [213] fall into thiscategory. GUUGLE is unique amongst these tools, being the only one that does not computeGibbs free energies to score optimal interactions, but instead returns all ungapped interactionsabove a user-specific length, which we include as an absolute baseline for predictive performance.49The second type of method predicts only intermolecular basepairs, but factors in intramolecularinteractions during computation, addressing the weakness of the first type. These algorithms utilizethe McCaskill partition function algorithm [214, 215] on the single input sequences to predict thepairing likelihood of nucleotides at each position. Thus, the stability of the intermolecular interac-tion at a specific position is now affected by both the predicted stability of the stacking basepairs,and also how likely the position will be made inaccessible by existing intramolecular basepairs. Werefer to these as “accessibility-based” methods which comprise RNAUP [216], INTARNA [217],and RNAPLEX-a [218].The third type considers both inter- and intramolecular basepairs with restrictions during bothcomputation and results, outputting in a joint structure. The most basic of these are termed“concatenation-based” algorithms, literally concatenating the two input sequences and runningit through classical RNA secondary structure prediction algorithms such as MFOLD [152] andRNAFOLD [153]. The main shortcoming of these methods stems from the classical RNA sec-ondary structure algorithm’s inability to predict unnested basepairs or pseudoknots, which trans-lates to the inability to predict interactions that form on interior loops in the joint structure. PAIR-FOLD [219] and RNACOFOLD [181] fall into this category.The fourth and final type is less well-defined and encompasses all non-concatenation methodsthat solve the joint structure, with little to no restrictions on interactions. The removal of restrictionsoften comes at the great expense of runtime performance, so these tools are typically restricted torelatively short input sequences. In this class, we have the program RACTIP [220], made tractablefor use on longer sequences by utilizing the technique of integer programming to optimize forruntime performance.In addition to falling into one of the four categories, tools may optionally take multiple se-quence alignments as input for each of the two input sequences. Based on successful RNA sec-ondary structure prediction tools like PFOLD [221] and RNAALIFOLD [222], the addition of well-aligned and sufficiently divergent homologs provides additional information when predicting evo-lutionarily conserved basepairs. In theory, basepairs that are fully conserved or undergo compen-satory mutations to retain the basepaired structure (i.e. covariation) are likely to be more function-ally important than unconserved basepairs. RNAALIDUPLEX [181] is the multiple sequence align-ment version of RNADUPLEX [181], classified as the first interaction-only type. The interaction-only and accessibility-based version of RNAPLEX can optionally take multiple sequence align-ments as input, which we will denote as RNAPLEX-cA and RNAPLEX-aA, respectively. PET-COFOLD [210] belongs to the final complex joint structure category, given two multiple sequencealignment.50Tool Strategy Suboptimal Conservation Interaction Length ReferenceGUUGLE Interaction only Yes No Short Local Gerlach & Giegerich (2006) [213]RNAPLEX-c Interaction only Yes No Short Local Tafer et al. (2008) [211]RISEARCH Interaction only Yes No Short Local Wenzel et al. (2012) [212]RNADUPLEX Interaction only Yes No Long Local Lorenz et al. (2011) [181]RNAPLEX-cA Interaction only Yes Yes Short Local Tafer et al. (2011) [218]RNAALIDUPLEX Interaction only Yes Yes Long Local Lorenz et al. (2011) [181]PAIRFOLD Concatenation No No Short Global Andronescu et al. (2005) [219]RNACOFOLD Concatenation No No Short Global Bernhart et al. (2006) [223]INTARNA Accessibility Yes No Short Local Busch et al. (2008) [217]RNAPLEX-a Accessibility Yes No Short Local Tafer et al. (2011) [218]RNAUP Accessibility No No Short Local Mu¨ckstein et al. (2006) [216]RNAPLEX-aA Accessibility Yes Yes Short Local Tafer et al. (2011) [218]RACTIP Complex joint No No Long Global Kato et al. (2010) [220]PETCOFOLD Complex joint No Yes Short Global Seemann et al. (2011) [210]Table 3.1: RNA-RNA interaction tools evaluated with categories and features. Strategy in-dicates the broad strategy of the algorithm in terms of prediction and output, describedin the Introduction. Suboptimal indicates whether the tool can return suboptimal resultsin addition to the minimum-free energy result. Conservation indicates whether it takesalignments as input. Interaction Length roughly describes the style of helices output,short helices ranging typically not surpassing a dozen or so basepairs, with long helicesreaching up to several times of that in total basepair count. Local interaction are a singleinteraction with gaps and bulges typically no longer than a few basepairs, while globalpredictions may span the entire sequence, containing multiple instances of local interac-tions, separated by long regions lacking intermolecular basepairs.3.3 Methods3.3.1 Energy-based RNA-RNA interaction prediction programsMentioned above, the programs used are summarized in Table 3.1, with more algorithmic detailsand exact settings to follow. On the table, in addition to splitting the tools into the four categoriesaccording to their strategy and usage of conservation, we also summarize whether they can outputsuboptimal results (instead of a MFE result) and the style of output they give.Unless stated otherwise, inputs target.fa and query.fa consists of two standard FASTAfiles, each containing one header line followed by an RNA sequence.Maximal interaction length for the sRNA-mRNA dataset was 60 and is set as such for manytools. This value is set to 25 for the snoRNA-rRNA dataset.51GUUGLEGUUGLE [213], while not specifically designed to predict RNA-RNA interactions, can very rapidlyrecover all continuous helices even for very large input sequences. It creates a suffix tree out of thetarget sequence, which allows for the rapid lookup of complementary pairing locations when givena query sequence, with novel strategies to allow for the G-U wobble base. Its inability to recoverhelices with bulges and loops and the lack of an energy calculation makes it ill-equipped to findcomplex interactions, but we use it to serve as an absolute baseline for accuracy performance.Version 1.2 obtained from https://bibiserv2.cebitec.uni-bielefeld.de/guugle?id=guugle view downloadguugle -d 6 target.fa query.fa > output.txt-d Obligatory option, minimum interaction length to be outputRNAPLEX-cRNAPLEX-c is also part of the ViennaRNA Package 2.0, designed to be even faster than RNADU-PLEX at predicting RNA-RNA interactions, ideally suited for genome-wide applications [211].Algorithmically, this gain in speed is achieved by further simplifying the Zuker algorithm [152],restricting more loop types, loop size and approximating bulge and loop energies.Part of the Version 2.1.9 ViennaRNA package obtained from http://www.tbi.univie.ac.at/RNA/index.html#downloadRNAplex -q query.fa -t target.fa -c 30 -l 60 -e 0 > output.txt-c Per nucleotide extension cost, recommended settings from publication-l Maximal length of interaction. 60 is minimum rounded known interaction length.-e (Suboptimal toggle) Enables suboptimal mode, returning all results ∆G ≤ 0RISEARCHRISEARCH [212] is one of the most recent programs and has been shown to run a few times fasterthan RNAPLEX. Algorithmically, it reduces basepair prediction into a Smith-Waterman-like [224]local sequence alignment problem, with RNA secondary structure stacking energy calculation re-placed by a dinucleotide scoring matrix lookup.52Version 1.0 obtained from http://rth.dk/resources/risearch/. The latest Version 1.1 allows forthresholding by energy instead of score, which does not affect the results.RIsearch -d 30 -s 0 -q query.fa -t target.fa > output.txt-c Per nucleotide extension cost, recommended settings from publication-s (Suboptimal toggle) Enables suboptimal mode, returning all results with a score less than 0RNADUPLEXRNADUPLEX is part of the ViennaRNA Package 2.0 [181], designed for the rapid prediction ofRNA-RNA interactions. Algorithmically, it is a modification of Zuker’s classical RNA secondarystructure algorithm [152], simplified to ignore intramolecular basepairs and branching structures,much like the miRNA-specific RNAHYBRID algorithm [225].Part of the Version 2.1.9 ViennaRNA package obtained from http://www.tbi.univie.ac.at/RNA/index.html#downloadcat query.fa target.fa | RNAduplex -e 50 > output.txt-e Suboptimal toggle Enables suboptimal mode, returning all results within 50 kcal/mol to theminimum-free-energy solutionPAIRFOLDPAIRFOLD [219] is one of the earliest general tools for predicting the joint RNA secondary struc-ture, which concatenates two input sequences and employs a slightly modified version Zuker algo-rithm [152] found in MFOLD. Special attention is paid to the concatenation site, which is containedwithin a “special loop” that is penalized for intermolecular interactions. Due to usage of the Zukeralgorithm, it is unable to predict pseudoknotted basepairs for the concatenated sequence.Part of the MULTIRNAFOLD-2.0 package downloaded at http://www.rnasoft.ca/download.htmlManuscript states that suboptimal results are available, but no settings to enable it were foundin the exposed user interface.53pairfold query.fa target.fa > output.txtQuery and target sequences are not given as files, but as strings on the command lineRNACOFOLDRNACOFOLD [223] is part of the ViennaRNA Package 2.0, takes two input sequences, concate-nates the two, then runs a slightly modified version of RNAFOLD on the concatenated sequence,returning one joint structure with an overall stability score. Theoretically identical to PAIRFOLD,it too cannot predict pseudoknotted basepairs for the concatenated sequence.Part of the Version 2.1.9 ViennaRNA package obtained from http://www.tbi.univie.ac.at/RNA/index.html#downloadconcatenated_query_ampersand_target.fa | RNAcofold > output.txtInput file required a single FASTA file with a single sequence consisting of the two inputsequences concatenated and separated by an ampersand symbol.INTARNAINTARNA [217] can be seen as a direct improvement over RNAUP, reducing the runtime com-plexity with various optimizations, and introducing the concept of a seed region. Seed regionsrequire interactions to contain a strongly complementary region, with user defined lengths andmismatch allowance. Designed specifically for predicting interactions between short sRNA queryand longer mRNAs in bacterial systems, accessibility is computed over the entire query sequenceusing RNAUP and in a sliding window with RNAPLFOLD [153] over the target sequence.Version 1.2.5 obtained from http://www.bioinf.uni-freiburg.de/Software/#IntaRNA-downloadIntaRNA -w 140 -L 70 -l 60 -m query.fa -t target.fa -o > output.txt-w Average the pair probabilities over windows of given size.-L Set the maximum allowed separation of a base pair to span.-l Max length of hybridized region, mainly for efficient computation-o Enables detailed output54Window and length values set to be identical to RNAPLFOLD and RNAPLEX values seenabove.RNAPLEX-aThe updated version of RNAPLEX-c [218], which we term RNAPLEX-a can optionally take inan externally computed accessibility file output by RNAPLFOLD [181]. Accessibility profiles arecomputed for sliding windows by giving RNAPLFOLD FASTA files for both target and querysequences and converted to position and length-specific opening energies. Predicted interactionshave a stability value formed by the summation of the hybridization energy, the opening cost onthe query sequence, and opening cost on the target sequence.RNAPLEX is part of the Version 2.0.7, while RNAPLFOLD is from ViennaRNA package ob-tained from http://www.tbi.univie.ac.at/RNA/index.html#download. Version 2.1.9 of RNAPLEX ap-pears to have a bug that causes a lot of errors in the output file when accessibility is used, but isotherwise fine.cat query.fa target.fa | RNAplfold -b -O -u 60 -W 140 -L 70mkdir accessibility_dir; mv -f *_dp.ps *_openen* accessibility_dir;RNAplex -q query.fa -t target.fa -a accessibility_dir -b -l 60 -e 0 > output.txt-b Output/input accessibility profiles in binary format-O Switch output from probabilities to their logarithms (approximate mean opening energy)-u Compute the mean probability that regions of length 1 to a given length are unpaired. 60 isminimum rounded known interaction length.-W Average the pair probabilities over windows of given size.-L Set the maximum allowed separation of a base pair to span.-l Maximal length of interaction. 60 is minimum rounded known interaction length.-e (Suboptimal toggle) Enables suboptimal mode, returning all results ∆G ≤ 0Values for -W and -L were set relative to -u using ratios that were obtained from the publishedrecommended settings. In practice, FASTA sequence names had to be renamed from species nameswhich were identical in the query and target file to two unique names so that would not clash in theFASTA-header-derived output filenames generated by RNAPLFOLD.55RNAUPRNAUP [216] is one of the first accessibility-based methods, which treat RNA-RNA interactionformation as a two step process. The first is computing the probability that a region on the targetsequence is unpaired (i.e. accessibility-profile), from which the energy required to open the sitecan be derived. This is followed by the hybridization step with the unpaired target site, potentiallytaking a penalty to stability if an inaccessible region needs to be opened.Part of the Version 2.1.9 ViennaRNA package obtained from http://www.tbi.univie.ac.at/RNA/index.html#downloadcat query.fa target.fa | RNAup -b --interaction_pairwise > output.txt-b Include the probability of unpaired regions in both RNAs• --interaction pairwise Activate pairwise interaction modeA maximal interaction length settings -w was available, but enabling it to the expected number(maximal known interaction length) actually decreased performance and thus was not used.RACTIPRACTIP [220] has been shown to be the fastest method for simultaneously predicting the entirejoint structure. A model of interactions and constraints are defined for general RNA-RNA in-teractions, and an optimal solution is solved for using integer programming. Like PETCOFOLD,internal models allow it to predict complex structures such as kissing hairpins, not possible withconcatenation-based methods.Version 0.0.2 obtained from http://www.ncrna.org/software/ractip/ractip query.fa target.fa > output.txtNo recommended settings were found, no alternative settings were attempted. Three probabil-ity settings and one accessibility model toggle were available.3.3.2 Comparative RNA-RNA interaction prediction toolsInput files query_alignment.fa and target_alignment.fa are paired FASTA files,each containing the same number of FASTA sequences, with matching ordering of species.56RNAPLEX-cAAn updated version of RNAPLEX-c [218] can additionally take alignments as input, and likeRNAALIDUPLEX, also incorporates a covariation score into the stability of the predicted duplex inthe style of RNAALIFOLD. We shall refer to this as RNAPLEX-cA.Part of the Version 2.1.9 ViennaRNA package obtained from http://www.tbi.univie.ac.at/RNA/index.html#downloadRNAplex -q query_alignment.fa -t target_alignment.fa -c 30 -l 60 -e 0 -A >output.txt-c Per nucleotide extension cost, recommended settings from publication-l Maximal length of interaction. 60 is minimum rounded known interaction length.-e (Suboptimal toggle) Enables suboptimal mode, returning all results ∆G ≤ 0-A Tells tool to compute interactions based on alignmentsRNAALIDUPLEXRNAALIDUPLEX is an alignment-based version of RNADUPLEX, also part of the same Vien-naRNA Package 2.0. Using a similar technique as seen in RNAALIFOLD [222], RNAALIDUPLEXfactors in basepair conservation information into energy-based predictions. For each basepair, inaddition to the stability derived from the stacking energies, an additional basepair covariation scorecomputed from the conservation of the basepair is added to the energy stability. Note that inputfiles had to be converted to CLUSTAL format.Part of the Version 2.1.9 ViennaRNA package obtained from http://www.tbi.univie.ac.at/RNA/index.html#downloadRNAaliduplex query_alignment.aln target_alignment.aln -e 50 > output.txt-e (Suboptimal toggle) Enables suboptimal mode, returning all results within 50 kcal/mol tothe minimum-free-energy solutionInstead of FASTA files, CLUSTALW files were required as input.57RNAPLEX-aARNAPLEX-a can additionally take input alignments, resulting in RNAPLEX-aA, using energy,accessibility, and conservation information in the style of RNAALIFOLD to make predictions.RNAPLEX is part of the Version 2.0.7, while RNAPLFOLD is from ViennaRNA package ob-tained from http://www.tbi.univie.ac.at/RNA/index.html#download. Version 2.1.9 of RNAPLEX ap-pears to have a bug that causes a lot of errors in the output file when accessibility is used, but isotherwise fine.cat query_alignment.fa target_alignment.fa | RNAplfold -b -O -u 60 -W 140 -L 70mkdir accessibility_dir; mv -f *_dp.ps *_openen* accessibility_dir;RNAplex -q query.fa -t target.fa -a accessibility_dir -b -l 60 -e 0 -A > output.txt-b Output/input accessibility profiles in binary format-O Switch output from probabilities to their logarithms (approximate mean opening energy)-u Compute the mean probability that regions of length 1 to a given length are unpaired. 60 isminimum rounded known interaction length.-W Average the pair probabilities over windows of given size.-L Set the maximum allowed separation of a base pair to span.-l Maximal length of interaction. 60 is minimum rounded known interaction length.-e (Suboptimal toggle) Enables suboptimal mode, returning all results ∆G ≤ 0-A Tells tool to compute interactions based on alignmentsValues for -W and -L were set relative to -u using ratios that were obtained from the publishedrecommended settings. In practice, FASTA sequence names had to be renamed from species nameswhich were identical in the query and target file to two unique names so that would not clash in theFASTA-header-derived output filenames generated by RNAPLFOLD.PETCOFOLDPETCOFOLD [210] is a hierarchical joint structure prediction tool that takes two FASTA align-ments as input. In a two step pipeline, the program first predicts structures individually for each58of the two input alignments, thereby determining highly stable, conserved, and thus likely func-tional, intramolecular basepairs using the PETFOLD [226] algorithm. This underlying PETFOLDalgorithm has been described to be the classical PFOLD [221] RNA secondary structure algorithmwith a full evolutionary model [227] but adds a weighted thermodynamic component generatedvia the McCaskill algorithm using RNAFOLD with option -p, not unlike the “accessibility” score.These two alignments are then concatenated and likely intermolecular basepairs are then deter-mined, with the likely basepairs from step one being prevented from binding in step two. Thisprocess allows for intermolecular basepairs in loops (i.e. kissing hairpins), a structure that wouldinvolve pseudoknotted basepairs that concatenation-based algorithms cannot handle.Version 3.2 obtained from http://rth.dk/resources/petcofold/download.phpPETcofold -f query_alignment.fa -f target_alignment.fa --intermol --war--extstem > output.txt• --intermol Structure output of intermolecular base pairs• --war FASTA format output• --extstem Constrained stems get extended by inner and outer base pairs3.3.3 Multiple sequence alignment generation programsA subset of our tools are comparative and require high quality multiple sequence alignment toperform optimally. These tools take multiple sequence alignments as input, with the objectiveof using evolutionary information in the alignments to improve the accuracy performance of thealgorithm. The quality of the alignment is a large limiting factor to the performance of these tools,so we evaluate the performance of the tools as a function of the alignments’ minimum percentidentity. Specifically, we start with the full unfiltered alignment, and then remove all sequenceswith a percent identity (relative to the reference species) lower than the minimum threshold, andrun the resulting alignments with the algorithms selected. No filtering was done using minimumsequence count or total tree length.Our initial selection of aligners was based on recent assessments of multiple sequence aligners[228, 229], where MAFFT (in “accurate mode” or L-INS-i) [230] and ProbConsRNA [231] wereselected. While these tools have been shown to perform well at aligning homologs with conservedsequences, it is unknown if they can correctly align homologous conserved basepairing structurewhich may exhibit covariation and thus lose sequence conservation. In order to alleviate this,59we also examine alignments from two structure-aware aligners LOCARNA [232] and SPARSE[233]. The latest version of MAFFT also included two structure-aware alignment modes Q-INS-iand X-INS-i both of which we test. We conduct a very brief assessment of predictive performanceon alignments created by the listed aligners, and made our final selection to use MAFFT Q-INS-ibased on balance between accuracy and runtime performance, detailed in the Results section.GOTOHSCANVersion 2.0α was obtained from http://www.bioinf.uni-leipzig.de/software.htmlGotohScan2a -d complete_genomes.fa -q reference_sequence.fa -o 0 -e 0.01 --quiet >output.csv-d FASTA file containing completed genomes as separate entries-o Output format selector, BLAST tabular format was chosen-e Sets E-value, with downstream filtering in mind, a lenient value was chosen to maximizesensitivity–quiet Don’t print status output of current queryFor the bacterial dataset, the FASTA file containing 244 genomes is 1.1GB, making the searchsufficiently swift if parallelized. For the fungal dataset, the 68 genomes are 4.2GB, which causes asignificant slow down during searching. In practice, we split the genome file into 500 separate filesof roughly equal size (taking care not to truncate any FASTA entries), then query each referencesequence against each one and find the union of the results.BLAST was used in early trials, which was superior in terms of speed, but the inferior align-ment quality made a non-negligible negative impact on tool performance and was thus abandoned.MAFFTVersion 7.215 was obtained from http://mafft.cbrc.jp/alignment/software/.mafft-qinsi input.fa > output.famafft-xinsi input.fa > output.fa60mafft-linsi input.fa > output.faThree modes were run during the testing phase of the aligner selection, and ultimately Q-INS-iwas chosen as the only structure-aware mode that could successful align the largest sequences usedin the study.LOCARNA and SPARSEVersion 1.8.1 of LOCARNA was obtained from http://www.bioinf.uni-freiburg.de/Software/LocARNA/mlocarna --quiet --keep-sequence-order --tgtdir output_directory input.famlocarna --sparse --tgtdir output_directory input.faSPARSE was distributed as an alignment mode in LOCARNA and was run from the samebinary.ProbConsRNAVersion 1.1 of PROBCONSRNA was downloaded from http://probcons.stanford.edu/download.htmlprobcons intput.fa > output.faNote that the PROBCONSRNA was used instead of the protein-sequence aligner PROBCONS.3.3.4 Related works and toolsNote that we focus on tools predicting general non-biology-specific RNA-RNA interactions, thusexcluding the large collection of tools focused on predicting miRNA interactions and targets. Whilethe general ideas of hybridization stability and accessibility apply to both the general and miRNAcases, modern miRNA prediction increasingly rely on miRNA-specific features that make theirtools unsuited for predicting interactions outside of those for miRNAs. Reviews [209] and evalua-tions [234, 235] of miRNA tools have been covered extensively by other works.Two notable related tools solving the interaction target prediction problem are RNAPREDA-TOR [236] which utilizes RNAPLEX to predict the target partner of small bacterial sRNAs, and61COPRARNA [237] which uses INTARNA to tackle the same sRNA target prediction problem. Arecent assessment of target prediction tools for sRNA was done by Pain et al. [238], showingCOPRARNA as the best tool currently available for the task.Finally, there were methods that fit the criteria of our evaluation but were excluded due toour inability to obtain or run them due to availability or practical reasons. These include GRNAS[239] (unavailable publicly), INTERNA [240] (algorithmically impractical), RIP [241] (unavailablepublicly) and RIPALIGN [242] (algorithmically impractical).3.4 DatasetsThe evaluation of any new computational tool requires the compilation of a set of experimen-tally verified results. While RNA secondary structure tools have long benefited from curated andcompiled datasets such as RNA STRAND [243] and RFAM [244], RNA-RNA interaction tool eval-uation have so far only relied on ad hoc and varying datasets. Generally, tools have aimed to obtaina set of biologically functional interactions, consisting mostly of miRNA, bacterial small RNAs(sRNA), and snoRNAs. In this paper, we aggregate the sRNA and snoRNA interaction pairs thathave been used in various papers, and here present what we believe is one of the largest, freely ac-cessible and digitized collections of such RNA-RNA interactions. In contrast to previous datasets,we attempt to alleviate the biological bias of focusing on a single type of data, gathering bothsnoRNA and sRNA. Additionally, we also eliminate any bias on input length, taking the full lengthof target rRNA and mRNA sequences. We make the dataset available in tab-delimited format atdoi:10.1093/nar/gkv1477, containing the full sequences and basepair information for future tooldevelopment and benchmarking efforts.3.4.1 Bacterial sRNASmall RNAs or sRNAs, are non-coding regulatory RNAs found in bacteria, shown to bind to thetranslational start sites of mRNAs, controlling the stability and translation of their targets. 40 to400 nucleotides in length [124], sRNAs do not simply bind in zipper-like fashion to the mRNAacross its entire length like the majority of miRNAs. Instead, sRNA-mRNA interactions varysignificantly in stability and length, modulated by existing RNA secondary structures on both thesRNA and mRNA strands. Thus, the identification of the functionally relevant interaction servesas a challenging and relevant problem in RNA-RNA interaction prediction.Functionally relevant sRNA-mRNA pairs are obtained from previously published experimentalworks, mostly derived from biochemical mapping experiments in Escherichia coli and Salmonella62enterica. In earlier works a set of 18 interactions collected for INTARNA [217] was used by severalworks [212, 245]. An expanded set tripling the interaction count was used for analysis of sRNA tar-get binding regions [246, 247], stated to be equivalent to sRNATarBase [248] published in parallel.Finally, the most recent and comprehensive set of over 100 interactions was compiled to evaluatethe partner prediction problem in COPRARNA [237], a direct expansion of the aforementioned set.Our sRNA dataset is a curated and digitized version of the interactions presented by COPRARNA,recovered from the supplementary material files and at times graphical figures in cited experimentalpublications to obtain the exact basepairing. The end product is a set of 109 sRNA-mRNA inter-actions (64 E. coli, 45 S. enterica) from 18 sRNAs against 82 mRNA targets. sRNA lengths rangefrom 72 to 237 nucleotides, with a mean length of 123nt. The majority of interactions involve onlyone interaction site, but some pairs involve interactions at two disjoint sites. For these interactions(OxyS-fhlA and RprA-csgD in E. coli, GcvB-cycA, GcvB-tppB, and MicF-lpxR in S. enterica),each continuous segment is counted as one unique interaction for performance evaluation, result-ing in these pairs having two solutions each. While it is technically possible to combine both sitesinto a long interaction containing a lengthy unpaired region in the middle, splitting it allows forboth better predictive performance for the tools evaluated and also allows us to limit the maximuminteraction length.We used RefSeq [249] genomes for Escherichia coli str. K-12 substr. MG1655 (NC 000913.3)and Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 (NC 003197) as our refer-ence sequence. Sequences for sRNA and mRNA targets were extracted from genomes using genenames and associated GFF annotation files, along with 300 bases upstream of the translation startsite. Once sequences were extracted, interactions from supplementary materials and manuscriptfigures were mapped onto the sequences and all interactions were confirmed to correspond to validbasepairs. The output data is stored in a computer-parsable CSV file, each line containing the fullsRNA and mRNA sequences, along with the exact binding location and the basepair formation,available as Supplementary Files.3.4.2 Fungal snoRNASmall nucleolar RNAs or snoRNAs, are non-coding RNAs found in eukaryote and archaea, shownto stably bind to rRNAs, guiding essential chemical modifications at specific positions [117]. TheseRNAs are generally classified into C/D box snoRNAs that guide methylation and H/ACA snoR-NAs that guide pseudouridylation. We focus on C/D box snoRNAs out of necessity, as H/ACAsnoRNA interactions heavily depend on correctly folding intramolecular hairpins, making them63great for evaluating joint structures, but falling outside the scope of this work. For those interestedin H/ACA predictions, we refer readers to the recent work RNASNOOP [250], which does a smallcomparison between three H/ACA prediction tools. A single C/D box snoRNA typically has one ortwo binding sites, ranging from 10 to 21 nt in length, forming highly complementary interactionswith its target rRNA site. Despite the highly complementary interactions, the possible existenceof multiple binding sites on the snoRNA, and the length of the rRNA targets (up to thousands ofnucleotides) presents a very different problem than that posed by the sRNA set.We chose to use yeast snoRNA-rRNA interactions from Saccharomyces cerevisiae, due to thecompleteness of the dataset and annotations available. Our dataset consists of 52 C/D box interac-tions obtained from Methylation Guide snoRNA Database [251], with additional interactions fromthe UMASS Amherst Yeast snoRNA Database [252]. The 52 interactions are made by 43 uniquesnoRNAs and two rRNA targets. SnoRNA lengths range from 78 to 255 nucleotides with a meanlength of 104nt, the full rRNA sequences are the 1800nt 18S rRNA and 3396nt 25S rRNA.We use the Saccharomyces cerevisiae S288c genome from the Saccharomyces Genome Database[253], and extract snoRNA and rRNA sequences by gene name using the associated GFF genomeannotation file. Interaction data from the Methylation Guide snoRNA Database [251] was refor-matted and mapped onto the sequences, confirming the correctness of the basepairs. The finaloutput is a computer-parsable CSV file with each line representing an interaction, containing thefull snoRNA and rRNA sequence, as well as the exact interaction sites and basepairing formation.3.4.3 Multiple sequence alignmentsWhereas minimum-free energy (MFE) methods only require a single sequence from each of thetarget and query RNAs, comparative methods require multiple sequence alignments (MSA) ofhomologs.For the sRNA database, we obtained the list of 244 completed Enterobacteriaceae genomesfrom KEGG [254] and complete genomes listed from RefSeq. We used GOTOHSCAN [255] tofind homologs for input reference sequences, and generated FASTA files of unaligned hits from theoutput. FASTA files were then aligned with MAFFT [230] in structurally-aware Q-INS-i modewith default settings. Finally, using percent identity (% ID) to the references species, we kept thetop hit for any species that had more than one homolog hit. The above process resulted in MSAof sRNA and mRNA sequences, which were then used to make pairs of alignments containing thesame species. For each interaction, the corresponding sRNA and mRNA alignments were taken,and the intersection of species was kept. The number of species after these intersections range64from 42 to 216, with a mean species count of 170. Pipeline scripting was done in Perl and R, withFASTA sequence and RNA structure manipulation done using the R4RNA package [256].For the snoRNA database, we obtained the entirety of the RefSeq release 68 Fungi sequencesNov 2014, containing just under 3000 species. We followed the same procedure as described, alsousing MAFFT Q-INS-i on the full length rRNAs despite significantly longer runtimes. Speciescount for finalized alignments range from 5 to 44 with a mean count of Performance measuresWe use the True Positive Rate (TPR, also called sensitivity) and Positive Predictive Value (PPV,also called selectivity [170]) to measure predictive performance. We only consider intermolecularbasepairs, ignoring all intramolecular predictions. Given a set of predicted basepairs and a set ofexperimentally validated basepairs, each predicted basepair is either a True Positive (TP) if it alsoappears in the known set else it is a False Positive (FP). All basepairs in the known set that arenot predicted as TP are False Negatives (FN), i.e. the prediction algorithm incorrectly predicts thebasepair as non-pairing. Hence:T PR :=T PT P+FNPPV :=T PT P+FPTrue negatives basepairs (TN) have traditionally been of little practical use for RNA secondarystructure prediction evaluation, with the same applying in this work. Regardless, we describe itscomputation as it is required in some of our statistical analysis. We compute the number of TNbasepairs as the “total” number of possible basepairs minus the number of TP basepairs as definedabove. We estimate the total number of possible basepairs as n×(n−1)2 , where n is the length of theconcatenated sequence, as estimated by [181]. This value is typically several magnitudes largerthan the other values, and typically makes the specificity measure (also known as the true negativerate: T NR := T NFP+T N ) largely meaningless, as it is effectively 1 for all tools evaluated. For thisreason, we use TPR and PPV, which are independent of the TN count.Finally, we also use Matthews Correlation Coefficient (MCC) [257] as a rough summary ofboth TPR and PPV as defined as:MCC :=T P×T N−FP×FN√(T P+FP)(T P+FN)(T N +FP)(T N +FN)MCC ranges from 1 for predictions with maximum TPR and PPV, to -1 for very poor predic-tions, although in practice, the physical constraints of RNA basepairs result in a range between 0and 1 for non-random predictions, and has been shown to be an approximation of the geometric65mean of TPR and PPV [258].Considerations were made to subtract compatible basepairs as defined in [170] from the FalsePositive count, but doing so was found to be too lenient for tools which took a shotgun approach topredicting interactions.3.5 Results3.5.1 Minimum free energy results on sRNA datasetWe run the 10 energy-based tools against 109 sRNA-mRNA pairs, using the full length sRNAagainst truncated mRNA targets. Knowing the biology of sRNAs, it is not unreasonable for usersto focus on a region around the translation start site. Previous studies have used similar windows,such as -150 and +100 bps relative to the start codon [247]. With all interactions falling within -130and +104 bps, we begin with a conservative window of -150 and +150 bps. We run the analysisunder two sets of options, the first being the most basic use case scenario of all default settings,followed by runs using optimal recommended settings when available.Results for 109 pairs run on 10 tools are visualized as a heatmap in Figure 3.1 created us-ing ggplot [259] in R, with default options on left, optimal options in middle and differencesshown on the right. Hierarchical clustering of the results run with optimal settings clarifies anotherwise confusing set of results, and provides some immediate insight into tool similarity. Asexpected, results for accessibility-based INTARNA and RNAPLEX-a, interaction-only RISEARCHand RNAPLEX-c, and concatenation-based PAIRFOLD and RNACOFOLD show highly similar per-formance profiles (Figure 3.2, left). RNADUPLEX and RACTIP also cluster together, which wasnot apparent from their algorithmic strategies.The mean of performance results is shown on Table 3.2 for the sRNA dataset, seen for resultsrun with default options, optimal options, and the difference between the two runs where applica-ble. Setting correct options for RISEARCH and RNAPLEX-c are essential for obtaining competitiveresults, whereas all other tools gain only a small increase in performance (Figure 3.1 middle). Forthese two tools, the optimal setting involves setting the per nucleotide extension penalty to 0.3 kcalmol−1–the average duplex energy between two random RNA bases [211]. For the two accessibility-based tools, optimal settings involve restricting the length of interactions to 60 nt, just larger thanthe longest interaction in the dataset.According to mean MCC (Table 3.2), the best performing tool on the dataset is INTARNA(0.62) followed closely by RNAPLEX-a (0.58). The simple inclusion of accessibility information66may not completely explain this advantage over other tools, given that RNAUP too uses acces-sibility, yet only achieves a mean MCC of 0.39. Tools that perform poorly according to MCC,such as RNADUPLEX, appear to be severely penalized for predicting a large number of interactingbasepairs resulting in a poor PPV.(E015) FnrS/folE(E009) CyaR/nadE(S039) GcvB/serA(S043) GcvB/tppB(E047) GlmZ/glmS(S030) GcvB/gltI(E061) MicF/phoE(E001) ArcZ/rpoS(E016) FnrS/folX(S004) ArcZ/tpx(E080) RybB/ompC(E040) GcvB/sstT(S026) GcvB/cycA(S044) GcvB/tppB(S091) RybB/ybfM(E020) FnrS/sodA(S036) GcvB/metQ(E065) OmrA/ompR(E067) OmrB/cirA(S041) GcvB/STM4351(E093) RyhB/fur(S087) RybB/ompW(S090) RybB/tsx(E017) FnrS/gpmA(S024) GcvB/brnQ(S033) GcvB/ilvE(E063) OmrA/cirA(S057) MicF/lpxR(E102) Spf/fucI(E073) RprA/csgD(S042) GcvB/thrL(E072) OxyS/fhlA(S002) ArcZ/sdaC(S034) GcvB/livJ(E050) MicA/ompX(E052) MicA/tsx(E086) RybB/ompW(S035) GcvB/livK(S038) GcvB/oppA(S046) GcvB/ygjU(S031) GcvB/iciA(E055) MicF/cpxR(S054) MicC/ompD(S048) MicA/lamB(E099) SgrS/ptsG(S029) GcvB/gdhA(S045) GcvB/ybdH(E021) FnrS/sodB(E012) CyaR/yqaE(E069) OmrB/ompR(E088) RybB/rluD(S085) RybB/ompS(E108) Spf/sthA(E089) RybB/sdhC(E071) OxyS/fhlA(S081) RybB/ompC(E107) Spf/srlA(E095) RyhB/sdhC(E019) FnrS/metE(S059) MicF/lrp(E058) MicF/lrp(E053) MicC/ompC(E018) FnrS/maeA(E097) RyhB/sodB(E109) Spf/xylF(E094) RyhB/iscS(S100) SgrS/sopD(E051) MicA/phoP(S037) GcvB/ndk(S083) RybB/ompF(S011) CyaR/ompX(S027) GcvB/cycA(E025) GcvB/cycA(S028) GcvB/dppA(E103) Spf/galK(E066) OmrA/ompT(E070) OmrB/ompT(E075) RprA/rpoS(E098) SgrS/manX(E005) ChiX/chbC(E006) ChiX/chiP(S007) ChiX/ybfM(E106) Spf/sdhC(E010) CyaR/ompX(S062) MicF/yahO(S032) GcvB/ilvC(E022) FnrS/yobA(E074) RprA/csgD(E008) CyaR/luxS(S101) SgrS/yigL(E060) MicF/ompF(E014) DsrA/rpoS(E064) OmrA/csgD(E068) OmrB/csgD(E105) Spf/nanC(S023) GcvB/argT(E078) RybB/fiu(S077) RybB/fadL(E013) DsrA/hns(E076) RprA/ydaM(S003) ArcZ/STM3216(S082) RybB/ompD(S079) RybB/ompA(E104) Spf/gltA(E049) MicA/ompA(S084) RybB/ompN(S056) MicF/lpxR(E092) RyhB/cysE(E096) RyhB/shiAIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(E015) FnrS/folE(E009) CyaR/nadE(S039) GcvB/serA(S043) GcvB/tppB(E047) GlmZ/glmS(S030) GcvB/gltI(E061) MicF/phoE(E001) ArcZ/rpoS(E016) FnrS/folX(S004) ArcZ/tpx(E080) RybB/ompC(E040) GcvB/sstT(S026) GcvB/cycA(S044) GcvB/tppB(S091) RybB/ybfM(E020) FnrS/sodA(S036) GcvB/metQ(E065) OmrA/ompR(E067) OmrB/cirA(S041) GcvB/STM4351(E093) RyhB/fur(S087) RybB/ompW(S090) RybB/tsx(E017) FnrS/gpmA(S024) GcvB/brnQ(S033) GcvB/ilvE(E063) OmrA/cirA(S057) MicF/lpxR(E102) Spf/fucI(E073) RprA/csgD(S042) GcvB/thrL(E072) OxyS/fhlA(S002) ArcZ/sdaC(S034) GcvB/livJ(E050) MicA/ompX(E052) MicA/tsx(E086) RybB/ompW(S035) GcvB/livK(S038) GcvB/oppA(S046) GcvB/ygjU(S031) GcvB/iciA(E055) MicF/cpxR(S054) MicC/ompD(S048) MicA/lamB(E099) SgrS/ptsG(S029) GcvB/gdhA(S045) GcvB/ybdH(E021) FnrS/sodB(E012) CyaR/yqaE(E069) OmrB/ompR(E088) RybB/rluD(S085) RybB/ompS(E108) Spf/sthA(E089) RybB/sdhC(E071) OxyS/fhlA(S081) RybB/ompC(E107) Spf/srlA(E095) RyhB/sdhC(E019) FnrS/metE(S059) MicF/lrp(E058) MicF/lrp(E053) MicC/ompC(E018) FnrS/maeA(E097) RyhB/sodB(E109) Spf/xylF(E094) RyhB/iscS(S100) SgrS/sopD(E051) MicA/phoP(S037) GcvB/ndk(S083) RybB/ompF(S011) CyaR/ompX(S027) GcvB/cycA(E025) GcvB/cycA(S028) GcvB/dppA(E103) Spf/galK(E066) OmrA/ompT(E070) OmrB/ompT(E075) RprA/rpoS(E098) SgrS/manX(E005) ChiX/chbC(E006) ChiX/chiP(S007) ChiX/ybfM(E106) Spf/sdhC(E010) CyaR/ompX(S062) MicF/yahO(S032) GcvB/ilvC(E022) FnrS/yobA(E074) RprA/csgD(E008) CyaR/luxS(S101) SgrS/yigL(E060) MicF/ompF(E014) DsrA/rpoS(E064) OmrA/csgD(E068) OmrB/csgD(E105) Spf/nanC(S023) GcvB/argT(E078) RybB/fiu(S077) RybB/fadL(E013) DsrA/hns(E076) RprA/ydaM(S003) ArcZ/STM3216(S082) RybB/ompD(S079) RybB/ompA(E104) Spf/gltA(E049) MicA/ompA(S084) RybB/ompN(S056) MicF/lpxR(E092) RyhB/cysE(E096) RyhB/shiAIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(E015) FnrS/folE(E009) CyaR/nadE(S039) GcvB/serA(S043) GcvB/tppB(E047) GlmZ/glmS(S030) GcvB/gltI(E061) MicF/phoE(E001) ArcZ/rpoS(E016) FnrS/folX(S004) ArcZ/tpx(E080) RybB/ompC(E040) GcvB/sstT(S026) GcvB/cycA(S044) GcvB/tppB(S091) RybB/ybfM(E020) FnrS/sodA(S036) GcvB/metQ(E065) OmrA/ompR(E067) OmrB/cirA(S041) GcvB/STM4351(E093) RyhB/fur(S087) RybB/ompW(S090) RybB/tsx(E017) FnrS/gpmA(S024) GcvB/brnQ(S033) GcvB/ilvE(E063) OmrA/cirA(S057) MicF/lpxR(E102) Spf/fucI(E073) RprA/csgD(S042) GcvB/thrL(E072) OxyS/fhlA(S002) ArcZ/sdaC(S034) GcvB/livJ(E050) MicA/ompX(E052) MicA/tsx(E086) RybB/ompW(S035) GcvB/livK(S038) GcvB/oppA(S046) GcvB/ygjU(S031) GcvB/iciA(E055) MicF/cpxR(S054) MicC/ompD(S048) MicA/lamB(E099) SgrS/ptsG(S029) GcvB/gdhA(S045) GcvB/ybdH(E021) FnrS/sodB(E012) CyaR/yqaE(E069) OmrB/ompR(E088) RybB/rluD(S085) RybB/ompS(E108) Spf/sthA(E089) RybB/sdhC(E071) OxyS/fhlA(S081) RybB/ompC(E107) Spf/srlA(E095) RyhB/sdhC(E019) FnrS/metE(S059) MicF/lrp(E058) MicF/lrp(E053) MicC/ompC(E018) FnrS/maeA(E097) RyhB/sodB(E109) Spf/xylF(E094) RyhB/iscS(S100) SgrS/sopD(E051) MicA/phoP(S037) GcvB/ndk(S083) RybB/ompF(S011) CyaR/ompX(S027) GcvB/cycA(E025) GcvB/cycA(S028) GcvB/dppA(E103) Spf/galK(E066) OmrA/ompT(E070) OmrB/ompT(E075) RprA/rpoS(E098) SgrS/manX(E005) ChiX/chbC(E006) ChiX/chiP(S007) ChiX/ybfM(E106) Spf/sdhC(E010) CyaR/ompX(S062) MicF/yahO(S032) GcvB/ilvC(E022) FnrS/yobA(E074) RprA/csgD(E008) CyaR/luxS(S101) SgrS/yigL(E060) MicF/ompF(E014) DsrA/rpoS(E064) OmrA/csgD(E068) OmrB/csgD(E105) Spf/nanC(S023) GcvB/argT(E078) RybB/fiu(S077) RybB/fadL(E013) DsrA/hns(E076) RprA/ydaM(S003) ArcZ/STM3216(S082) RybB/ompD(S079) RybB/ompA(E104) Spf/gltA(E049) MicA/ompA(S084) RybB/ompN(S056) MicF/lpxR(E092) RyhB/cysE(E096) RyhB/shiAIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram−1.0−∆MCCFigure 3.1: Predictive accuracy measured by Matthews Correlation Coefficient for energy-based interaction prediction tools on the sRNA-mRNA dataset. Results for tools run ondefault (left) versus optimal (middle) options shown with differences (right), pairs andtools clustered hierarchically according to optimal results to group like results.67012345IntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAup012345GUUGleRIsearchRNAplex−cPairfoldRNAcofoldRactIPRNAduplexIntaRNARNAplex−aRNAupFigure 3.2: Clustering of tools according to energy-based MFE (left) and suboptimal (right)results on the sRNA-mRNA dataset.68IntaRNA RNAplex-a GUUGle Pairfold RNAcofold RactIP RNAduplex RIsearch RNAplex-c RNAupMCC (Optimal) 0.62 0.58 0.35 0.33MCC (Default) 0.61 0.54 0.44 0.38 0.42 0.23 0.22 0.03 0.13 0.39MCC (∆) 0.01 0.04 0.32 0.20TPR (Optimal) 0.66 0.63 0.34 0.32TPR (Default) 0.62 0.57 0.67 0.52 0.56 0.38 0.46 0.04 0.17 0.39TPR (∆) 0.04 0.06 0.30 0.16PPV (Optimal) 0.61 0.56 0.39 0.35PPV (Default) 0.64 0.52 0.39 0.29 0.33 0.15 0.11 0.03 0.11 0.48PPV (∆) -0.03 0.04 0.36 0.24∆G (Optimal) -13.03 -14.45 -19.79 -25.11∆G (Default) -10.73 -14.11 9.901 -130.472 -126.942 -64.26 -16.85 -21.11 -16.74∆G (∆) -2.30 -0.34 -2.94 -3.99Basepairs (Optimal) 18.76 20.04 14.67 16.85Basepairs (Default) 16.87 19.19 104.50 33.16 32.85 50.34 79.50 21.51 23.50 13.90Basepairs (∆) 1.89 0.84 -6.84 -6.65Table 3.2: Mean accuracy performance for MFE predictions of energy-based tools on 109 sRNA-mRNA interactions. Foreach metric, we have results computed using optimal recommended settings, default settings, and the difference.Basepairs is the mean number of predicted intermolecular basepairs. Five tools lacked optimal settings, and RACTIPand GUUGLE did not compute ∆G free energies. GUUGLE output multiple results with no energy, so are morestrictly “suboptimal” prediction. 1Minimal basepair threshold. 2Joint structure energies.693.5.2 Suboptimal interaction results on sRNA datasetWhile all the energy-based tools used produce a single minimum free-energy secondary structureby default, a majority of tools also allows the prediction of suboptimal results. In practice, this bothallows for an increased sensitivity and also the ability to correctly predict pairs of interacting se-quences with multiple binding sites. We determine the increases obtained by turning on suboptimalresults for tools that have this option.For this test, we use the same sRNA dataset and optimal options as used previously, with theexception of the new suboptimal results option enabled, set to allow for all suboptimal structureswithin reason to be returned. Specifically, when given the choice to set an energy threshold, wehave the tools return all interactions with a predicted ∆G stability of ≤ 0 kcal/mol. This is highenough to include the minimum free-energy structures and any suboptimal results that would be ofpractical interest, and low enough to keep output file sizes under control. It should be emphasizedthat this energy cutoff is not the one used for performance evaluation as follows. The definitionsdefined for MFE performance evaluation persist, but the predicted basepairs are now the union ofintermolecular basepairs (i.e. no duplicates) from all suboptimal structures below a specific energythreshold. This specific energy threshold is unique to each tool for each interaction prediction, andis chosen to maximize the MCC value for that specific run. While knowing the energy thresholdthat maximizes the MCC up-front is impossible in the typical use case where the interactions arenot known beforehand, we aim to derive the theoretical maximum of including suboptimal results,and thereby make potential recommendations on how to set such a threshold for future de novoruns.Results for suboptimal results are visualized in Figure 3.3 (middle), retaining the same columnand row ordering for easy comparison with the optimal MFE results in Figure 3.3 (left). Visually,the tools with the largest changes are RNADUPLEX, RISEARCH and RNAPLEX-c, differenceshighlighted in Figure 3.3 (right). INTARNA and RNAPLEX-a have relatively smaller gains toperformance. Summarized in Table 3.3, we see that the three former tools roughly double TPR, butsee little change to PPV rates. This increase in TPR likely stems from the increased total numberof predicted basepairs, three to four times the number of bases for these three tools.Measured by MCC, INTARNA (0.69) and RNAPLEX-a (0.69) remain the two top perform-ing tools, while RNAPLEX-c (0.52) and RISEARCH (0.50) jump ahead to take third and fourthspot. Surprisingly, INTARNA has minimal change to the number of predicted basepairs even withsuboptimal results enabled, meaning that the validated interactions are often its MFE predictionalready. GUUGLE, which simply returns all valid ungapped interactions, does surprisingly well70with an MCC of 0.44 if we simply take all predictions greater than 9 basepairs (mean 9.90). Itobtains a TPR equal to RNAPLEX-a, but suffers from inadequate PPV, suggesting that Gibbs freeenergy serves as a good means to determine functional interactions from random ones. It is notedthat while there is a shift in performance, the clustering of tools remains similar to before with onlyMFE results for this dataset (Figure 3.2, right).In addition to a clustering of tools, the results of each tool also cluster to some extent. In Fig-ures 3.4 and 3.5, we see MCC performance distributions of the energy-based tools on sRNA datareturning MFE and suboptimal results (when applicable), respectively. We see a large number ofresults at 0 MCC when MFE results are returned, since predictions are either a hit or miss. Withsuboptimal options enabled, this peak at 0 mostly disappears for tools with the option. On thisFigure 3.13, we show the TPR and PPV for each tool on each prediction, with a two dimensionaldensity plot showing the rough clustering of results for each tool. Ignoring the large majority ofpoints that end up with no predictions, we see a large concentration of accessibility-based predic-tions with a PPV of 1 and a range of TPR values. In contrast, tools without accessibility have astrong concentration of results with a TPR of 1, but a range of PPV values.The Gibbs free energy (∆G) in Table 3.3 denotes the energy threshold used, below which basesare considered to be positively predicted. In practice, these could serve as guidelines for thresholds.The variance in energies between tools, however, makes setting clear guidelines difficult. TheRank denotes the average number of suboptimal results kept for each interaction to obtain theperformances seen (i.e. MFE results effectively have Rank of 1). A lower rank doesn’t necessarilymean worse performance, since it might simply reflect a tool that successfully predicts the entireinteraction via small separate interactions.71(E015) FnrS/folE(E009) CyaR/nadE(S039) GcvB/serA(S043) GcvB/tppB(E047) GlmZ/glmS(S030) GcvB/gltI(E061) MicF/phoE(E001) ArcZ/rpoS(E016) FnrS/folX(S004) ArcZ/tpx(E080) RybB/ompC(E040) GcvB/sstT(S026) GcvB/cycA(S044) GcvB/tppB(S091) RybB/ybfM(E020) FnrS/sodA(S036) GcvB/metQ(E065) OmrA/ompR(E067) OmrB/cirA(S041) GcvB/STM4351(E093) RyhB/fur(S087) RybB/ompW(S090) RybB/tsx(E017) FnrS/gpmA(S024) GcvB/brnQ(S033) GcvB/ilvE(E063) OmrA/cirA(S057) MicF/lpxR(E102) Spf/fucI(E073) RprA/csgD(S042) GcvB/thrL(E072) OxyS/fhlA(S002) ArcZ/sdaC(S034) GcvB/livJ(E050) MicA/ompX(E052) MicA/tsx(E086) RybB/ompW(S035) GcvB/livK(S038) GcvB/oppA(S046) GcvB/ygjU(S031) GcvB/iciA(E055) MicF/cpxR(S054) MicC/ompD(S048) MicA/lamB(E099) SgrS/ptsG(S029) GcvB/gdhA(S045) GcvB/ybdH(E021) FnrS/sodB(E012) CyaR/yqaE(E069) OmrB/ompR(E088) RybB/rluD(S085) RybB/ompS(E108) Spf/sthA(E089) RybB/sdhC(E071) OxyS/fhlA(S081) RybB/ompC(E107) Spf/srlA(E095) RyhB/sdhC(E019) FnrS/metE(S059) MicF/lrp(E058) MicF/lrp(E053) MicC/ompC(E018) FnrS/maeA(E097) RyhB/sodB(E109) Spf/xylF(E094) RyhB/iscS(S100) SgrS/sopD(E051) MicA/phoP(S037) GcvB/ndk(S083) RybB/ompF(S011) CyaR/ompX(S027) GcvB/cycA(E025) GcvB/cycA(S028) GcvB/dppA(E103) Spf/galK(E066) OmrA/ompT(E070) OmrB/ompT(E075) RprA/rpoS(E098) SgrS/manX(E005) ChiX/chbC(E006) ChiX/chiP(S007) ChiX/ybfM(E106) Spf/sdhC(E010) CyaR/ompX(S062) MicF/yahO(S032) GcvB/ilvC(E022) FnrS/yobA(E074) RprA/csgD(E008) CyaR/luxS(S101) SgrS/yigL(E060) MicF/ompF(E014) DsrA/rpoS(E064) OmrA/csgD(E068) OmrB/csgD(E105) Spf/nanC(S023) GcvB/argT(E078) RybB/fiu(S077) RybB/fadL(E013) DsrA/hns(E076) RprA/ydaM(S003) ArcZ/STM3216(S082) RybB/ompD(S079) RybB/ompA(E104) Spf/gltA(E049) MicA/ompA(S084) RybB/ompN(S056) MicF/lpxR(E092) RyhB/cysE(E096) RyhB/shiAIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(E015) FnrS/folE(E009) CyaR/nadE(S039) GcvB/serA(S043) GcvB/tppB(E047) GlmZ/glmS(S030) GcvB/gltI(E061) MicF/phoE(E001) ArcZ/rpoS(E016) FnrS/folX(S004) ArcZ/tpx(E080) RybB/ompC(E040) GcvB/sstT(S026) GcvB/cycA(S044) GcvB/tppB(S091) RybB/ybfM(E020) FnrS/sodA(S036) GcvB/metQ(E065) OmrA/ompR(E067) OmrB/cirA(S041) GcvB/STM4351(E093) RyhB/fur(S087) RybB/ompW(S090) RybB/tsx(E017) FnrS/gpmA(S024) GcvB/brnQ(S033) GcvB/ilvE(E063) OmrA/cirA(S057) MicF/lpxR(E102) Spf/fucI(E073) RprA/csgD(S042) GcvB/thrL(E072) OxyS/fhlA(S002) ArcZ/sdaC(S034) GcvB/livJ(E050) MicA/ompX(E052) MicA/tsx(E086) RybB/ompW(S035) GcvB/livK(S038) GcvB/oppA(S046) GcvB/ygjU(S031) GcvB/iciA(E055) MicF/cpxR(S054) MicC/ompD(S048) MicA/lamB(E099) SgrS/ptsG(S029) GcvB/gdhA(S045) GcvB/ybdH(E021) FnrS/sodB(E012) CyaR/yqaE(E069) OmrB/ompR(E088) RybB/rluD(S085) RybB/ompS(E108) Spf/sthA(E089) RybB/sdhC(E071) OxyS/fhlA(S081) RybB/ompC(E107) Spf/srlA(E095) RyhB/sdhC(E019) FnrS/metE(S059) MicF/lrp(E058) MicF/lrp(E053) MicC/ompC(E018) FnrS/maeA(E097) RyhB/sodB(E109) Spf/xylF(E094) RyhB/iscS(S100) SgrS/sopD(E051) MicA/phoP(S037) GcvB/ndk(S083) RybB/ompF(S011) CyaR/ompX(S027) GcvB/cycA(E025) GcvB/cycA(S028) GcvB/dppA(E103) Spf/galK(E066) OmrA/ompT(E070) OmrB/ompT(E075) RprA/rpoS(E098) SgrS/manX(E005) ChiX/chbC(E006) ChiX/chiP(S007) ChiX/ybfM(E106) Spf/sdhC(E010) CyaR/ompX(S062) MicF/yahO(S032) GcvB/ilvC(E022) FnrS/yobA(E074) RprA/csgD(E008) CyaR/luxS(S101) SgrS/yigL(E060) MicF/ompF(E014) DsrA/rpoS(E064) OmrA/csgD(E068) OmrB/csgD(E105) Spf/nanC(S023) GcvB/argT(E078) RybB/fiu(S077) RybB/fadL(E013) DsrA/hns(E076) RprA/ydaM(S003) ArcZ/STM3216(S082) RybB/ompD(S079) RybB/ompA(E104) Spf/gltA(E049) MicA/ompA(S084) RybB/ompN(S056) MicF/lpxR(E092) RyhB/cysE(E096) RyhB/shiAIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(E015) FnrS/folE(E009) CyaR/nadE(S039) GcvB/serA(S043) GcvB/tppB(E047) GlmZ/glmS(S030) GcvB/gltI(E061) MicF/phoE(E001) ArcZ/rpoS(E016) FnrS/folX(S004) ArcZ/tpx(E080) RybB/ompC(E040) GcvB/sstT(S026) GcvB/cycA(S044) GcvB/tppB(S091) RybB/ybfM(E020) FnrS/sodA(S036) GcvB/metQ(E065) OmrA/ompR(E067) OmrB/cirA(S041) GcvB/STM4351(E093) RyhB/fur(S087) RybB/ompW(S090) RybB/tsx(E017) FnrS/gpmA(S024) GcvB/brnQ(S033) GcvB/ilvE(E063) OmrA/cirA(S057) MicF/lpxR(E102) Spf/fucI(E073) RprA/csgD(S042) GcvB/thrL(E072) OxyS/fhlA(S002) ArcZ/sdaC(S034) GcvB/livJ(E050) MicA/ompX(E052) MicA/tsx(E086) RybB/ompW(S035) GcvB/livK(S038) GcvB/oppA(S046) GcvB/ygjU(S031) GcvB/iciA(E055) MicF/cpxR(S054) MicC/ompD(S048) MicA/lamB(E099) SgrS/ptsG(S029) GcvB/gdhA(S045) GcvB/ybdH(E021) FnrS/sodB(E012) CyaR/yqaE(E069) OmrB/ompR(E088) RybB/rluD(S085) RybB/ompS(E108) Spf/sthA(E089) RybB/sdhC(E071) OxyS/fhlA(S081) RybB/ompC(E107) Spf/srlA(E095) RyhB/sdhC(E019) FnrS/metE(S059) MicF/lrp(E058) MicF/lrp(E053) MicC/ompC(E018) FnrS/maeA(E097) RyhB/sodB(E109) Spf/xylF(E094) RyhB/iscS(S100) SgrS/sopD(E051) MicA/phoP(S037) GcvB/ndk(S083) RybB/ompF(S011) CyaR/ompX(S027) GcvB/cycA(E025) GcvB/cycA(S028) GcvB/dppA(E103) Spf/galK(E066) OmrA/ompT(E070) OmrB/ompT(E075) RprA/rpoS(E098) SgrS/manX(E005) ChiX/chbC(E006) ChiX/chiP(S007) ChiX/ybfM(E106) Spf/sdhC(E010) CyaR/ompX(S062) MicF/yahO(S032) GcvB/ilvC(E022) FnrS/yobA(E074) RprA/csgD(E008) CyaR/luxS(S101) SgrS/yigL(E060) MicF/ompF(E014) DsrA/rpoS(E064) OmrA/csgD(E068) OmrB/csgD(E105) Spf/nanC(S023) GcvB/argT(E078) RybB/fiu(S077) RybB/fadL(E013) DsrA/hns(E076) RprA/ydaM(S003) ArcZ/STM3216(S082) RybB/ompD(S079) RybB/ompA(E104) Spf/gltA(E049) MicA/ompA(S084) RybB/ompN(S056) MicF/lpxR(E092) RyhB/cysE(E096) RyhB/shiAIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram−∆MCCFigure 3.3: Predictive accuracy measured by Matthews Correlation Coefficient for energy-based interaction prediction tools on the sRNA-mRNA dataset. MFE (left) versus sub-optimal (middle) results shown with differences (right), pairs and tools clustered hierar-chically according to MFE results to group like results (Figure 3.1, middle). Suboptimalresults only available in INTARNA, RISEARCH, RNADUPLEX, and both versions ofRNAPLEX72IntaRNA RNAplex−aGUUGle PairfoldRNAcofold RactIPRNAduplex RIsearchRNAplex−c RNAup020406002040600204060020406002040600.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00MCCCountFigure 3.4: MCC performance distribution of MFE results for energy-based tools using opti-mal settings (where applicable) on the sRNA-mRNA dataset.73IntaRNA RNAplex−aGUUGle PairfoldRNAcofold RactIPRNAduplex RIsearchRNAplex−c RNAup020406002040600204060020406002040600.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00MCCCountFigure 3.5: MCC performance distribution of suboptimal results for energy-based tools usingoptimal settings (where applicable) on the sRNA-mRNA dataset.74IntaRNA RNAplex-a GUUGle Pairfold RNAcofold RactIP RNAduplex RIsearch RNAplex-c RNAupMCC (Suboptimal) 0.69 0.69 0.44 0.30 0.50 0.52MCC (MFE) 0.62 0.58 0.38 0.42 0.23 0.22 0.35 0.33 0.39MCC (∆) 0.07 0.11 0.08 0.14 0.19TPR (Suboptimal) 0.76 0.86 0.67 0.86 0.75 0.77TPR (MFE) 0.66 0.63 0.52 0.56 0.38 0.46 0.34 0.32 0.39TPR (∆) 0.10 0.23 0.40 0.42 0.45PPV (Suboptimal) 0.72 0.62 0.39 0.13 0.39 0.43PPV (MFE) 0.61 0.56 0.29 0.33 0.15 0.11 0.39 0.35 0.48PPV (∆) 0.11 0.06 0.02 0.00 0.08∆G (Suboptimal) -13.34 -13.77 9.901 -56.38 -16.67 -22.71∆G (MFE) -13.03 -14.45 -130.472 -126.942 -64.26 -19.79 -25.11 -16.74∆G (∆) -0.31 0.68 7.88 3.11 2.40Bps (Suboptimal) 19.49 35.05 104.50 233.37 62.38 56.28Bps (MFE) 18.76 20.04 33.16 32.85 50.34 79.50 14.67 16.85 13.90Bps (∆) 0.72 15.01 153.86 47.71 39.42Rank (Suboptimal) 1.12 3.05 13.42 10.75 5.30 3.67Table 3.3: Mean accuracy performance for suboptimal predictions of energy-based tools on 109 sRNA-mRNA interac-tions. For each metric, we have results computed with suboptimal results enable, disabled (MFE) and the difference.Basepairs is the mean number of predicted intermolecular basepairs. Five tools lacked the suboptimal option, RACTIPand GUUGLE did not compute ∆G free energies. 1Minimal basepair threshold. 2Joint structure energies.753.5.3 Effect of increasing target sequence size on sRNA datasetUsing optimal options and suboptimal results where applicable, we test the accuracy performanceof energy-based tools on the sRNA dataset, but this time increasing the length of the target mRNAsequence. Previously, we had used the full-length sRNA sequence against a 300nt window aroundthe translation start site, 150nt upstream, 150nt downstream. Here, we keep the 150nt upstream thesame, but gradually increase the length of the coding sequence (CDS) downstream by incrementsof 100nt, until we are 1150nt upstream, having target sequences up to 1300nt, roughly the lengthof the average bacterial gene.The resulting performance as measured by MCC is seen in Figure 3.6, showing a monotonicallydecreasing trend for all tools. All tools have difficulties maintaining PPV, resulting in an overalldecrease in MCC as the search space increases. This is alleviated somewhat by tools that producesuboptimal results, as they are able to maintain a high TPR rate, which is often untrue for tools thatonly return MFE results. The rate of decrease varies between tools, with a few such as INTARNAand RNAPLEX-a having relatively linear trends, while RISEARCH and RNAPLEX-c having fastasymptotic trends. Extrapolating, it is likely that increasing the length even further will result ina further decrease in performance for all tools, which would have worrying implications for full-transcriptome searches.3.5.4 MFE energy-based results on snoRNA datasetIn order to examine whether the performances observed in the sRNA dataset are generalizable tointeractions of other types, we evaluate the performance of all tools on our second dataset, con-sisting of 52 C/D snoRNA-rRNA interactions. We proceed straight to evaluating the performanceusing optimal settings, adjusting the maximal interaction length to 25, just large enough to captureall known interactions in the dataset. We start by running MFE results following by suboptimalresults.We run the energy-based tools on two versions of the dataset using the same 52 snoRNA andrRNA pairs. We first determine the performance in an ideal scenario, knowing the general bindingregion of the snoRNA on the rRNA sequence, having the full length snoRNA interact with a 300ntsubsequence of the target rRNA centered around the center of the binding site. We then test a morerealistic de novo scenario, interacting the full snoRNA against the entirety of the target rRNA.MFE results on the short and long dataset are seen on are seen in Table 3.4 and Figure 3.7,which repeat our observation that increasing the search space results in a significant drop in per-formance for all tools. In contrast to suboptimal results, the drop in MCC is caused by decreasing76performance in both TPR and PPV.77l l l l l l l l l l ll llll l l l l l l0.20.40.6300 600 900 1200CDS LengthValueProgramllGUUGleIntaRNAPairfoldRactIPRIsearchRNAcofoldRNAduplexRNAplex−aRNAplex−cRNAupl l l l l l l l l l lllll l l l l l l l0.250.500.75300 600 900 1200CDS LengthValueProgramllGUUGleIntaRNAPairfoldRactIPRIsearchRNAcofoldRNAduplexRNAplex−aRNAplex−cRNAupl ll l l l l l l l ll ll ll l l l l l l0. 600 900 1200CDS LengthValueProgramllGUUGleIntaRNAPairfoldRactIPRIsearchRNAcofoldRNAduplexRNAplex−aRNAplex−cRNAupFigure 3.6: MCC (top), TPR (middle) and PPV (bottom) performance of energy-based toolsas a function of length. Optimal options and suboptimal results are used where applica-ble.78(S001) U14/18S(S019) snR13/25S(S002) snR87/18S(S003) U14/18S(S005) snR41/18S(S031) snR51/25S(S035) snR60/25S(S004) snR40/18S(S022) U24/25S(S028) snR47/25S(S032) snR52/25S(S037) snR61/25S(S007) snR47/18S(S027) snR40/25S(S044) snR67/25S(S052) snR78/25S(S015) snR70/18S(S040) snR64/25S(S051) snR76/25S(S008) snR51/18S(S021) snR190/25S(S025) snR39/25S(S006) snR41/18S(S049) snR73/25S(S038) snR62/25S(S014) snR57/18S(S020) U18/25S(S018) snR79/18S(S009) snR52/18S(S016) snR74/18S(S033) snR58/25S(S013) snR56/18S(S012) snR55/18S(S024) snR38/25S(S034) snR59/25S(S023) U24/25S(S036) snR60/25S(S030) snR50/25S(S029) snR48/25S(S026) snR39B/25S(S048) snR72/25S(S039) snR63/25S(S041) snR65/25S(S045) snR68/25S(S011) snR54/18S(S017) snR77/18S(S047) snR71/25S(S010) snR53/18S(S046) snR69/25S(S042) snR66/25S(S043) snR67/25S(S050) snR75/25SIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(S001) U14/18S(S019) snR13/25S(S002) snR87/18S(S003) U14/18S(S005) snR41/18S(S031) snR51/25S(S035) snR60/25S(S004) snR40/18S(S022) U24/25S(S028) snR47/25S(S032) snR52/25S(S037) snR61/25S(S007) snR47/18S(S027) snR40/25S(S044) snR67/25S(S052) snR78/25S(S015) snR70/18S(S040) snR64/25S(S051) snR76/25S(S008) snR51/18S(S021) snR190/25S(S025) snR39/25S(S006) snR41/18S(S049) snR73/25S(S038) snR62/25S(S014) snR57/18S(S020) U18/25S(S018) snR79/18S(S009) snR52/18S(S016) snR74/18S(S033) snR58/25S(S013) snR56/18S(S012) snR55/18S(S024) snR38/25S(S034) snR59/25S(S023) U24/25S(S036) snR60/25S(S030) snR50/25S(S029) snR48/25S(S026) snR39B/25S(S048) snR72/25S(S039) snR63/25S(S041) snR65/25S(S045) snR68/25S(S011) snR54/18S(S017) snR77/18S(S047) snR71/25S(S010) snR53/18S(S046) snR69/25S(S042) snR66/25S(S043) snR67/25S(S050) snR75/25SIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(S001) U14/18S(S019) snR13/25S(S002) snR87/18S(S003) U14/18S(S005) snR41/18S(S031) snR51/25S(S035) snR60/25S(S004) snR40/18S(S022) U24/25S(S028) snR47/25S(S032) snR52/25S(S037) snR61/25S(S007) snR47/18S(S027) snR40/25S(S044) snR67/25S(S052) snR78/25S(S015) snR70/18S(S040) snR64/25S(S051) snR76/25S(S008) snR51/18S(S021) snR190/25S(S025) snR39/25S(S006) snR41/18S(S049) snR73/25S(S038) snR62/25S(S014) snR57/18S(S020) U18/25S(S018) snR79/18S(S009) snR52/18S(S016) snR74/18S(S033) snR58/25S(S013) snR56/18S(S012) snR55/18S(S024) snR38/25S(S034) snR59/25S(S023) U24/25S(S036) snR60/25S(S030) snR50/25S(S029) snR48/25S(S026) snR39B/25S(S048) snR72/25S(S039) snR63/25S(S041) snR65/25S(S045) snR68/25S(S011) snR54/18S(S017) snR77/18S(S047) snR71/25S(S010) snR53/18S(S046) snR69/25S(S042) snR66/25S(S043) snR67/25S(S050) snR75/25SIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram−1.0−∆MCCFigure 3.7: Predictive accuracy measured by Matthews Correlation Coefficient for energy-based minimum-free energy prediction tools on the snoRNA-rRNA dataset. Results onthe short dataset (left) versus the long dataset (middle) with differences (right).79IntaRNA RNAplex-a GUUGle Pairfold RNAcofold RactIP RNAduplex RIsearch RNAplex-c RNAupMCC (Short) 0.72 0.75 0.86 0.45 0.50 0.31 0.31 0.82 0.82 0.56MCC (Long) 0.56 0.52 0.62 0.18 0.23 0.18 0.15 0.58 0.58 0.45MCC (∆) 0.16 0.23 0.24 0.26 0.27 0.13 0.16 0.24 0.24 0.11TPR (Short) 0.77 0.82 0.97 0.72 0.75 0.61 0.70 0.87 0.84 0.63TPR (Long) 0.61 0.56 0.96 0.31 0.40 0.36 0.32 0.62 0.60 0.51TPR (∆) 0.16 0.25 0.01 0.41 0.35 0.25 0.37 0.25 0.24 0.12PPV (Short) 0.68 0.70 0.81 0.28 0.35 0.16 0.14 0.78 0.80 0.60PPV (Long) 0.53 0.49 0.49 0.11 0.13 0.09 0.07 0.55 0.56 0.47PPV (∆) 0.16 0.21 0.32 0.18 0.22 0.07 0.07 0.23 0.24 0.13∆G (Short) -13.34 -13.99 12.641 -1042.802 -1006.992 -56.46 -22.43 -24.16 -18.17∆G (Long) -12.79 -14.12 12.581 -125.352 -119.582 -50.39 -21.44 -23.29 -17.10∆G (∆) -0.55 0.13 0.061 -917.462 -887.412 -6.07 -0.99 -0.87 -1.07Bps (Short) 14.77 15.42 34.67 34.62 29.79 49.94 66.98 15.31 14.10 13.96Bps (Long) 15.38 15.58 174.58 38.04 43.85 57.02 72.69 15.96 14.63 15.31Bps (∆) -0.62 -0.15 -139.90 -3.42 -14.06 -7.08 -5.71 -0.65 -0.54 -1.35Table 3.4: Mean accuracy performance measures MFE predictions of energy-based tools on 52 snoRNA-rRNA interac-tions. For each metric, we evaluate predictions using the full length snoRNA against the full length rRNA (long), oran optimal 300nt window around the binding site on the rRNA (short). Basepairs is the mean number of predictedintermolecular basepairs. 1Minimal basepair threshold. 2Joint structure energies.803.5.5 Suboptimal energy-based results on snoRNA datasetOn Table 3.5, we see a numerical summary of the effects of enabling the suboptimal results optionwhen available. The MCC gains obtained for enabling suboptimal results on this snoRNA dataset(0.03 to 0.08) are much smaller compared to the effects seen with sRNA (Table 3.3, 0.07 to 0.19).This is due to the MFE results often being the known interaction, making the additional predictionsgained from suboptimal results mostly unnecessary. As seen in the TPR and PPV columns, en-abling suboptimal options results in a trade-off, increasing TPR at the expense of PPV, ultimatelyresulting in a higher MCC similar to the sRNA dataset. In Figures 3.8 and 3.9, we see MCC per-formance distributions of the energy-based tools on snoRNA data returning MFE and suboptimalresults (when applicable).The results of using the short ideal window versus full rRNA are shown in Figure 3.10 andsummarized in Table 3.8, with average results for each tools across the dataset and metrics. Asseen in the sRNA dataset, tools form pairs and cluster closely together Figure 3.11 compared toFigure 3.2).For the short ideal case (Table 3.8 (Short) rows), the average MCC performance is higher for alltools in comparison to the sRNA dataset, likely due to the simpler and more uniform interactionsin this dataset. The simpler interactions are reflected in a much higher TPR rate for all tools withsix out of the ten energy-based tools achieving a TPR rate of ≥ 0.91, detecting almost all knowninteractions. With the exception of GUUGLE, RISEARCH, and RNAPLEX-c which double theirPPV rates from around 0.40 to 0.80, most tools only see relatively small improvements to PPV.When we extend the target to full length rRNAs (Table 3.8 (Long) rows), we see a significantdecrease in performance for all tools as the number of positive predictions increase, the majority ofthem false. Based on MCC, INTARNA, RNAUP, RNADUPLEX, RACTIP and RNAPLEX-a suffera relatively smaller drop in MCC (-0.10 to -0.14), while the remaining tools suffer a larger decrease(-0.24 to -0.29). For the latter tools, interaction-only tools with suboptimal options (GUUGLE,RISEARCH and RNAPLEX-c) see little change in TPR, but experience significant decreases inPPV, explaining the MCC drop. Concatenation tools PAIRFOLD and RNACOFOLD see significantdrops in TPR, PPV and MCC performances.Based on the Gibbs free energy cutoffs and the predicted number of basepairs (Table 3.8 Bpscolumns), increasing the target search space does not significantly change the energy threshold, butthe number of basepairs that pass this threshold increases. With the exception of RNAUP, toolsthat do not compute suboptimal results do extremely poorly with the increased search space. Ofthe remaining tools that do compute suboptimal results, accessibility seems to be key to preventing81huge losses in PPV, perhaps explaining why RNAUP remains competitive.82IntaRNA RNAplex−aGUUGle PairfoldRNAcofold RactIPRNAduplex RIsearchRNAplex−c RNAup051015202505101520250510152025051015202505101520250.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00MCCCountFigure 3.8: MCC performance distribution of MFE results for energy-based tools using opti-mal settings (where applicable) on the snoRNA-rRNA dataset.83IntaRNA RNAplex−aGUUGle PairfoldRNAcofold RactIPRNAduplex RIsearchRNAplex−c RNAup05101520051015200510152005101520051015200.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00MCCCountFigure 3.9: MCC performance distribution of suboptimal results for energy-based tools usingoptimal settings (where applicable) on the snoRNA-rRNA dataset.84(S004) snR40/18S(S011) snR54/18S(S003) U14/18S(S019) snR13/25S(S015) snR70/18S(S007) snR47/18S(S031) snR51/25S(S051) snR76/25S(S044) snR67/25S(S006) snR41/18S(S033) snR58/25S(S028) snR47/25S(S046) snR69/25S(S002) snR87/18S(S027) snR40/25S(S043) snR67/25S(S050) snR75/25S(S022) U24/25S(S035) snR60/25S(S005) snR41/18S(S047) snR71/25S(S048) snR72/25S(S008) snR51/18S(S032) snR52/25S(S018) snR79/18S(S016) snR74/18S(S013) snR56/18S(S041) snR65/25S(S012) snR55/18S(S009) snR52/18S(S025) snR39/25S(S026) snR39B/25S(S023) U24/25S(S034) snR59/25S(S036) snR60/25S(S014) snR57/18S(S020) U18/25S(S021) snR190/25S(S017) snR77/18S(S037) snR61/25S(S045) snR68/25S(S001) U14/18S(S010) snR53/18S(S024) snR38/25S(S029) snR48/25S(S052) snR78/25S(S038) snR62/25S(S040) snR64/25S(S042) snR66/25S(S049) snR73/25S(S030) snR50/25S(S039) snR63/25SIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(S004) snR40/18S(S011) snR54/18S(S003) U14/18S(S019) snR13/25S(S015) snR70/18S(S007) snR47/18S(S031) snR51/25S(S051) snR76/25S(S044) snR67/25S(S006) snR41/18S(S033) snR58/25S(S028) snR47/25S(S046) snR69/25S(S002) snR87/18S(S027) snR40/25S(S043) snR67/25S(S050) snR75/25S(S022) U24/25S(S035) snR60/25S(S005) snR41/18S(S047) snR71/25S(S048) snR72/25S(S008) snR51/18S(S032) snR52/25S(S018) snR79/18S(S016) snR74/18S(S013) snR56/18S(S041) snR65/25S(S012) snR55/18S(S009) snR52/18S(S025) snR39/25S(S026) snR39B/25S(S023) U24/25S(S034) snR59/25S(S036) snR60/25S(S014) snR57/18S(S020) U18/25S(S021) snR190/25S(S017) snR77/18S(S037) snR61/25S(S045) snR68/25S(S001) U14/18S(S010) snR53/18S(S024) snR38/25S(S029) snR48/25S(S052) snR78/25S(S038) snR62/25S(S040) snR64/25S(S042) snR66/25S(S049) snR73/25S(S030) snR50/25S(S039) snR63/25SIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram0.000.250.500.751.00MCC(S004) snR40/18S(S011) snR54/18S(S003) U14/18S(S019) snR13/25S(S015) snR70/18S(S007) snR47/18S(S031) snR51/25S(S051) snR76/25S(S044) snR67/25S(S006) snR41/18S(S033) snR58/25S(S028) snR47/25S(S046) snR69/25S(S002) snR87/18S(S027) snR40/25S(S043) snR67/25S(S050) snR75/25S(S022) U24/25S(S035) snR60/25S(S005) snR41/18S(S047) snR71/25S(S048) snR72/25S(S008) snR51/18S(S032) snR52/25S(S018) snR79/18S(S016) snR74/18S(S013) snR56/18S(S041) snR65/25S(S012) snR55/18S(S009) snR52/18S(S025) snR39/25S(S026) snR39B/25S(S023) U24/25S(S034) snR59/25S(S036) snR60/25S(S014) snR57/18S(S020) U18/25S(S021) snR190/25S(S017) snR77/18S(S037) snR61/25S(S045) snR68/25S(S001) U14/18S(S010) snR53/18S(S024) snR38/25S(S029) snR48/25S(S052) snR78/25S(S038) snR62/25S(S040) snR64/25S(S042) snR66/25S(S049) snR73/25S(S030) snR50/25S(S039) snR63/25SIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupProgram−1.0−∆MCCFigure 3.10: Predictive accuracy measured by Matthews Correlation Coefficient for energy-based interaction prediction tools on the snoRNA-rRNA dataset. Results on the shorttruncated target dataset (left) versus the long full-length target dataset (middle) withdifferences (right).850123GUUGleRIsearchRNAplex−cIntaRNARNAplex−aPairfoldRNAcofoldRactIPRNAduplexRNAup0123GUUGleIntaRNARIsearchRNAplex−cRNAplex−aPairfoldRNAcofoldRactIPRNAduplexRNAupFigure 3.11: Clustering of tools according to energy-based MFE (left) and suboptimal (right)results on the snoRNA-rRNA dataset.86IntaRNA RNAplex-a GUUGle Pairfold RNAcofold RactIP RNAduplex RIsearch RNAplex-c RNAupMCC (Suboptimal) 0.80 0.83 0.86 0.38 0.87 0.84MCC (MFE) 0.72 0.75 0.45 0.50 0.31 0.31 0.82 0.82 0.56MCC (∆) 0.08 0.08 0.07 0.06 0.03TPR (Suboptimal) 0.91 0.95 0.97 0.97 0.98 0.97TPR (MFE) 0.77 0.82 0.72 0.75 0.61 0.70 0.87 0.84 0.63TPR (∆) 0.13 0.14 0.27 0.11 0.13PPV (Suboptimal) 0.75 0.76 0.81 0.16 0.81 0.76PPV (MFE) 0.68 0.70 0.28 0.35 0.16 0.14 0.78 0.80 0.60PPV (∆) 0.07 0.06 0.02 0.03 -0.03∆G (Suboptimal) -12.79 -14.12 12.581 -50.39 -21.44 -23.29∆G (MFE) -12.39 -14.46 -125.352 -119.582 -47.18 -20.45 -22.17 -17.10∆G (∆) -0.40 0.35 -3.20 -0.99 -1.13Bps (Suboptimal) 18.23 17.75 34.67 118.94 17.90 20.27Bps (MFE) 14.77 15.42 34.62 29.79 49.94 66.98 15.31 14.10 13.96Bps (∆) 3.46 2.33 51.96 2.60 6.17Rank (Suboptimal) 1.31 1.38 3.88 4.19 1.19 1.58Table 3.5: Mean accuracy performance measures suboptimal predictions of energy-based tools on 52 snoRNA-rRNA inter-actions. For each metric, we have results computed with suboptimal results enable, disabled (MFE) and the difference.Basepairs is the mean number of predicted intermolecular basepairs. Five tools lacked the suboptimal option, RACTIPand GUUGLE did not compute ∆G free energies. 1Minimal basepair threshold. 2Joint structure energies.873.5.6 Comparative predictions for sRNA datasetAs seen in Table 3.6 and Figure 3.12, the minimum percent identity (% ID) has a large effect onthe MCC, TPR and PPV values of the four tools evaluated. We observe a monotonically increas-ing trend for TPR as the minimum percent identity increases, suggesting that the experimentallydetermined basepairs are not extremely well conserved, and are only detected when a majority ofdivergent homologs are filtered out. PPV values fluctuate depending on the tool, with RNAPLEX-cand RNAALIDUPLEX seeing a slight decreasing trend. With the exception of PETCOFOLD, thereappears to be a clear trade-off between TPR and PPV, with TPR increasing while PPV decreasesas the minimum percent identity of the alignment increases.Maximal MCC performances for tools are obtained at 70% ID for RNAALIDUPLEX (0.31),80% ID for PETCOFOLD (0.43), 90% ID for RNAPLEX-cA (0.56) and 100% ID for RNAPLEX-aA(0.69). For the three tools that have direct energy-based counterparts, these MCC values are greateror equal to the performance values seen in Table 3.3, with an increase in performance of 0.01RNAALIDUPLEX, 0.04 RNAPLEX-cA, and 0.00 RNAPLEX-aA. Take note that optimal com-parative MCC values were obtained at different minimum percent identity thresholds, with theRNAPLEX-aA threshold of 100% effectively being the energy-based methods as no divergent in-formation was present in the alignment.PETcofold RNAaliduplex RNAplex-aA RNAplex-cA% ID MCC TPR PPV MCC TPR PPV MCC TPR PPV MCC TPR PPV45 0.30 0.34 0.30 0.20 0.30 0.22 0.19 0.20 0.55 0.23 0.21 0.5750 0.30 0.33 0.29 0.22 0.32 0.23 0.20 0.21 0.55 0.24 0.22 0.6555 0.29 0.33 0.29 0.23 0.37 0.23 0.24 0.26 0.58 0.27 0.25 0.6760 0.35 0.39 0.33 0.26 0.43 0.23 0.31 0.32 0.57 0.31 0.29 0.7165 0.33 0.39 0.31 0.30 0.53 0.24 0.37 0.42 0.53 0.41 0.39 0.7270 0.34 0.43 0.31 0.31 0.61 0.22 0.42 0.46 0.60 0.45 0.48 0.6375 0.41 0.50 0.37 0.28 0.64 0.16 0.45 0.50 0.58 0.47 0.51 0.6080 0.43 0.53 0.37 0.29 0.71 0.15 0.50 0.57 0.57 0.50 0.58 0.5785 0.39 0.49 0.34 0.29 0.76 0.14 0.50 0.62 0.54 0.50 0.61 0.5790 0.40 0.53 0.32 0.30 0.82 0.13 0.57 0.71 0.58 0.56 0.72 0.5395 0.42 0.56 0.33 0.30 0.85 0.13 0.57 0.71 0.57 0.54 0.75 0.48100 0.41 0.57 0.31 0.30 0.84 0.13 0.69 0.86 0.61 0.52 0.77 0.43Table 3.6: Accuracy performance measures for comparative tools as a function of changinginput multiple sequence alignment minimum percent identity on the sRNA dataset. Vi-sually represented in Figure 288l l lll l l l l l l ll l l l l ll l l l l ll lllll llll l l0.250.500.750.250.500.750.250.500.75MCCPPVTPR60 80 100Input Alignment Minimum Percent IdentityPerformance Metric ValueProgramlPETcofoldRNAaliduplexRNAplex−aARNAplex−cAFigure 3.12: Performance on comparative tools using alignments for the sRNA dataset, align-ment sequences filtered by minimum percent identity.3.5.7 Comparative predictions for snoRNA datasetWe test the effects of multiple sequence alignment inputs on the snoRNA dataset with truncatedrRNA targets, again filtering alignments by minimum percent identity. As seen in Figure 3.14and Table 3.7, this time we see that increasing percent identity actually results in a drop in MCCperformance for three of the four MSA-based tools. Again, while we see a trade-off betweenincreasing TPR and decreasing PPV as the minimum percent identity increases, the gains in TPRare quite minor, with the loss of PPV fairly significant as the minimum percent identity increases.These results seem to suggest that in this dataset, the benefits of an increase in PPV outweigh thepenalties of decreased sensitivity, leading to an overall MCC that is superior to energy-based tools.Maximal MCC performances for tools are obtained at 75% ID for RNAALIDUPLEX (0.67),8980% ID for PETCOFOLD (0.58), 80% ID for RNAPLEX-cA (0.83) and 80% ID for RNAPLEX-aA(0.82). For the three tools that have direct energy-based counterparts, these MCC values are greateror roughly equal to the performance values seen in Table 3.8, with an increase in performance of0.29 RNAALIDUPLEX, 0.09 RNAPLEX-cA, and -0.01 RNAPLEX-aA. In contrast to the sRNAresults, the increase in performance can be directly attributed to the conservation information, withan alignment of identical sequences (i.e. 100% minimum ID) resulting in inferior results.Surprisingly, when comparing the two versions of RNAPLEX, the simpler RNAPLEX-cA thatdoes not take in accessibility profiles does better than its accessibility-based comparative counter-part. With an optimal MCC value of 0.93, its accuracy is superior to all energy-based methods onthe short snoRNA-rRNA dataset as seen on Table 3.8 (MCC (Short) row).We follow up this evaluation by testing the same comparative tools on the snoRNA-rRNAdataset, but use the full-length rRNAs instead of the windowed binding sites. Simulating a denovo use case, we use no minimum percent identity filter, obtaining results seen in Table 3.9. Asin the case of energy-based algorithms on Table 3.8 (MCC (Long) row), we see a drop in overallperformance as measured by MCC. However, the usage of conservation information maintains arelatively higher PPV value. Most evident in RNAPLEX-cA, we observe a TPR rate of 0.94 andPPV of 0.81, resulting in an MCC measure of 0.85, eclipsing all other tools conservation or energy-based on the untruncated snoRNA-rRNA dataset.lllllllllllllll llll llllll llllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll llllllllllllllllllllllllllll ll lllllllllllllllllllllllllllllllllll l lllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllll ll ll lllllllllllllllll llllll l llllllllll lllllllllllllllllllllllllll0.000.250.500.751.000.00 0.25 0.50 0.75 1.00PPVTPRProgramllllllllllIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupllllll ll lllllllll l lll lllll lllllllllllllllll llllllllllllllllllllllllllllllll lllll lll llllllll ll llllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllll lllllllllllllllllllllll llllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllll ll lllllllllllllllllllll ll llllllllllllllllllll l ll l lll lll llll l l llllllllllllllllllllllll ll lllllllllllllllllllllllllllllllllll0.000.250.500.751.000.00 0.25 0.50 0.75 1.00PPVTPRProgramllllllllllIntaRNARNAplex−aGUUGlePairfoldRNAcofoldRactIPRNAduplexRIsearchRNAplex−cRNAupFigure 3.13: Performance of MFE (left) and suboptimal results when applicable (right) forindividual runs on the sRNA-mRNA dataset using energy-based tools.90PETcofold RNAaliduplex RNAplex-aA RNAplex-cA% ID MCC TPR PPV MCC TPR PPV MCC TPR PPV MCC TPR PPV65 0.54 0.67 0.48 0.67 0.94 0.51 0.82 0.86 0.85 0.93 0.94 0.9270 0.54 0.67 0.48 0.67 0.94 0.51 0.82 0.86 0.85 0.93 0.94 0.9275 0.55 0.68 0.48 0.67 0.94 0.51 0.82 0.86 0.85 0.93 0.95 0.9280 0.55 0.68 0.50 0.66 0.95 0.50 0.82 0.87 0.84 0.93 0.95 0.9285 0.52 0.69 0.48 0.62 0.95 0.44 0.80 0.86 0.83 0.92 0.95 0.9090 0.58 0.77 0.47 0.53 0.96 0.33 0.81 0.89 0.81 0.92 0.96 0.8995 0.54 0.80 0.37 0.40 0.97 0.18 0.82 0.95 0.75 0.85 0.94 0.82100 0.54 0.84 0.35 0.38 0.97 0.16 0.81 0.95 0.72 0.81 0.94 0.75Table 3.7: Accuracy performance measures for comparative tools as a function of changinginput multiple sequence alignment minimum percent identity on the snoRNA dataset.Visually represented in Figure 3.1491l l l llll ll l l llll ll l l l l l l l0.250.500.751.000.250.500.751.000.250.500.751.00MCCPPVTPR70 80 90 100Input Alignment Minimum Percent IdentityPerformance Metric ValueProgramlPETcofoldRNAaliduplexRNAplex−aARNAplex−cAFigure 3.14: Performance on comparative tools using alignments for the snoRNA dataset,alignment sequences filtered by minimum percent identity.92IntaRNA RNAplex-a GUUGle Pairfold RNAcofold RactIP RNAduplex RIsearch RNAplex-c RNAupMCC (Long) 0.70 0.71 0.62 0.18 0.23 0.18 0.24 0.58 0.60 0.45MCC (Short) 0.80 0.83 0.86 0.45 0.50 0.31 0.38 0.87 0.84 0.56MCC (∆) -0.10 -0.12 -0.24 -0.26 -0.27 -0.13 -0.14 -0.29 -0.24 -0.11TPR (Long) 0.90 0.95 0.96 0.31 0.40 0.36 0.97 0.98 0.97 0.51TPR (Short) 0.91 0.95 0.97 0.72 0.75 0.61 0.97 0.98 0.97 0.63TPR (∆) -0.00 -0.00 -0.01 -0.41 -0.35 -0.25 0.00 0.00 0.00 -0.12PPV (Long) 0.62 0.58 0.49 0.11 0.13 0.09 0.08 0.42 0.47 0.47PPV (Short) 0.75 0.76 0.81 0.28 0.35 0.16 0.16 0.81 0.76 0.60PPV (∆) -0.13 -0.17 -0.32 -0.18 -0.22 -0.07 -0.08 -0.39 -0.30 -0.13∆G (Long) -12.39 -14.46 12.581 -125.352 -119.582 -47.18 -20.45 -22.17 -17.10∆G (Short) -11.80 -13.26 12.641 -1042.802 -1006.992 -49.31 -20.38 -22.12 -18.17∆G (∆) -0.60 -1.21 -0.061 917.462 887.412 2.12 -0.07 -0.05 1.07Bps (Long) 52.44 34.87 174.58 38.04 43.85 57.02 723.35 75.50 97.27 15.31Bps (Short) 18.23 17.75 34.67 34.62 29.79 49.94 118.94 17.90 20.27 13.96Bps (∆) 34.21 17.12 139.90 3.42 14.06 7.08 604.40 57.60 77.00 1.35Rank (Long) 4.50 3.98 21.42 32.21 4.54 7.19Rank (Short) 1.31 1.38 3.88 4.19 1.19 1.58Rank (∆) 3.19 2.60 17.54 28.02 3.35 5.62Table 3.8: Mean accuracy performance measures for suboptimal predictions of energy-based tools on 52 snoRNA-rRNAinteractions. For each metric, we evaluate predictions using the full length snoRNA against the full length rRNA(long), or an optimal 300nt window around the binding site on the rRNA (short). Basepairs is the mean number ofpredicted intermolecular basepairs. 1Minimal basepair threshold. 2Joint structure energies.93PETcofold RNAaliduplexRNAplex-cARNAplex-aAMCC (Long) 0.25 0.62 0.85 0.73MCC (Short) 0.54 0.67 0.93 0.79MCC (∆) -0.29 -0.05 -0.07 -0.06TPR (Long) 0.29 0.94 0.94 0.83TPR (Short) 0.67 0.94 0.94 0.82TPR (∆) -0.38 0.00 -0.00 0.01PPV (Long) 0.47 0.47 0.81 0.74PPV (Short) 0.49 0.51 0.92 0.86PPV (∆) -0.03 -0.05 -0.11 -0.12∆G (Long) -23.96 -19.22 -12.53∆G (Short) -24.06 -19.27 -11.08∆G (Delta) 0.09 0.05 -1.45Bps (Long) 7.44 51.90 19.10 29.50Bps (Short) 18.06 33.44 13.56 12.94Bps (∆) -10.62 18.46 5.54 16.56Rank (Long) 2.48 1.50 3.25Rank (Short) 1.48 1.12 1.17Rank (∆) 1.00 0.38 2.08Table 3.9: Comparative results on the snoRNA dataset, evaluated by running the full lengthsnoRNA query against the full length rRNA sequence (long) or a 300nt window aroundthe binding site (short).3.5.8 Effect of different aligners on predictive performanceUsing the snoRNA-rRNA dataset, we run sequence-based aligners ProbConsRNA [231] and MAFFT(L-INS-i mode) [230], and structurally-aware aligners LOCARNA [232], SPARSE [233], and twomodes of MAFFT (Q-INS-i and X-INS-i). For each aligner, we give as input the full-length un-filtered unaligned snoRNA and rRNA alignments as inputs. After obtaining alignments we trim thetarget rRNA to a window surrounding the known interaction site and ensure each pair of alignmentshas the same species. Finally, we progressively filter the alignment according to minimum percentidentity every 10%. We then run comparative interactions on the resulting alignments and evaluateperformance in terms of MCC, TPR and PPV.94All tools successfully aligned snoRNA within a reasonable about of time but only PROBCON-SRNA, and two modes of MAFFT (L-INS-i and Q-INS-i) successfully completed the two rRNAalignments (18S and 25S) within a week of continuous runtime. For the tools that failed their rRNAalignments, MAFFT Q-INS-i rRNA alignments were used. This test was to be repeated on thesRNA-mRNA dataset, but multiple tools were unable to complete the mRNA alignments within aweek of runtime.Accuracy performance of the four comparative tools on the different alignments are measuredin MCC (Table 3.15), TPR (Table 3.16) and PPV (Table 3.17). According to MCC, no tools seemsclearly superior to all the others, while SPARSE and PROBCONSRNA clearly produce inferiorresults on specific tools. The three MAFFT modes perform extremely similarly. Considering bothperformance accuracy and runtime speed, the choice of MAFFT Q-INS-i is arguable the bestchoice given the selection of alignment algorithms.95LocARNA MAFFT L−INS−i MAFFT Q−INS−i MAFFT X−INS−i ProbConsRNA SPARSEl l lllll l lllll l lllll l l l ll l l l lll llllll ll l l ll l l l ll l l llll l l llll l l l l ll l l l ll l l llll l l llll l l ll ll l l l ll l ll l ll l llll l l ll ll l l l ll llllll llllll ll llllll ll0.40.60.850 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100Alignment Minimum Percent IdentityMatthews Correlation CoefficientProgramllllPETcofoldRNAaliduplexRNAplex−aARNAplex−cAPETcofold RNAaliduplex RNAplex−aA RNAplex−cAl l lllll l l l ll l lll l l l ll llll l lllll lllll l lll ll l l lll lllll l lllll ll l ll l l l llll l ll ll lll l l l llll l l l ll ll l l l llll ll0.40.60.850 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100Alignment Minimum Percent IdentityMatthews Correlation CoefficientAlignerllllllLocARNAMAFFT L−INS−iMAFFT Q−INS−iMAFFT X−INS−iProbConsRNASPARSEFigure 3.15: Accuracy performance as measured by MCC of comparative interaction predictions tools on alignments con-structed by different alignment algorithms and subsequently filtered by minimum percent identity.96LocARNA MAFFT L−INS−i MAFFT Q−INS−i MAFFT X−INS−i ProbConsRNA SPARSEl l lllll l l ll ll l lll llll l llll l lll ll ll llll lll l llll l l ll ll l l llllll l llll l l l lll l l llll l l lllll lllll l l l l ll l l llll l l llll llllllllllllll llllllll l0. 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100Alignment Minimum Percent IdentityTrue Positive RateProgramllllPETcofoldRNAaliduplexRNAplex−aARNAplex−cAPETcofold RNAaliduplex RNAplex−aA RNAplex−cAl l llllll l llll l llll llll llll l l ll ll l lll l ll l l llllll l lll ll ll lll l l lll l ll l l lllll ll l l l llll l l ll l l ll llllllll0. 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100Alignment Minimum Percent IdentityTrue Positive RateAlignerllllllLocARNAMAFFT L−INS−iMAFFT Q−INS−iMAFFT X−INS−iProbConsRNASPARSEFigure 3.16: Accuracy performance as measured by TPR of comparative interaction predictions tools on alignments con-structed by different alignment algorithms and subsequently filtered by minimum percent identity.97LocARNA MAFFT L−INS−i MAFFT Q−INS−i MAFFT X−INS−i ProbConsRNA SPARSEl l llllllll l l l lll l l llll l l llll llllll l l llll l l l lll l lllll l l llll l l llll l l llll l l l lll llll l lllll l l l lll l l llll l l llll l lllll l l l lll l l llll l lllllll l llll l l ll0.250.500.751.0050 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100Alignment Minimum Percent IdentityPositive Predictive ValueProgramllllPETcofoldRNAaliduplexRNAplex−aARNAplex−cAPETcofold RNAaliduplex RNAplex−aA RNAplex−cAl l lllll l l lll l llll l l ll l l lll lll l lllll lllll l lllll l llll lll l l l lll l l lll l l lll l lll l llllll l l llll ll l l lll l l l lll l l l0.250.500.751.0050 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100Alignment Minimum Percent IdentityPositive Predictive ValueAlignerllllllLocARNAMAFFT L−INS−iMAFFT Q−INS−iMAFFT X−INS−iProbConsRNASPARSEFigure 3.17: Accuracy performance as measured by PPV of comparative interaction predictions tools on alignments con-structed by different alignment algorithms and subsequently filtered by minimum percent identity.983.5.9 Combining energy-based and comparative resultsA common way to increase TPR or PPV is to combine results of multiple tools. Due to the numberof tools we have, the potential number of combinations is unrealistic to fully explore. However, wetake some time to test this technique on the three energy-based tools with comparative counterparts.In Table 3.10, we show the predictive performance of RNADUPLEX, RNAPLEX-C and RNAPLEX-a paired with comparative counterparts RNAALIDUPLEX, RNAPLEX-cA and RNAPLEX-aA. Foreach pair of tools, we show the performance of the MFE algorithm, the MSA algorithm (MFE +alignment input), the union of results, and the intersection of results.As expected, the union of results increases the TPR to values greater than the MSA or MFEresults individually, but results in a large decrease in PPV. The intersection increases the PPV tovalues greater than the MSA or MFE results individually, and results in a decrease of TPR. Basedof MCC results, taking the union or intersection can be highly beneficial at times, resulting invalues greater than MFE or MSA results. However, the results are inconsistent and it is hard torecommend a consistent setting for specific tools.However, if TPR or PPV are of particular interest in predicting basepair interactions, theseresults suggest that taking the union or intersection of results is not a bad approach.3.5.10 Basepair covariation in datasets and backgroundTo gain some understanding into the different effects that alignments had on the sRNA vs snoRNAdatasets, we computed basepair covaration score and basepair conservation for the known inter-and intramolecular basepairs in the datasets. Figure 3.18 shows the binned distribution of base-pairs according to their covariation scores (where -2 is unconserved, 0 is perfectly conserved, and2 is covarying with compensatory mutations, as defined by [260]) and conservation (1 is perfectconservation of nucleotide, 0 is no conservation). Specifically, we take the multiple sequence align-ment as previously described for the dataset, and compute the two scores for every intermolecularbasepair between the snoRNA and rRNA, sRNA and mRNA, and the intramolecular basepairs be-tween rRNA and rRNA derived from the solved structures from the Comparative RNA Website[200]. We also repeat the same scoring with random helices predicted on the same rRNA align-ment to obtain a background, and random helices predicted on a shuffled (with MULTIPERM [261])rRNA alignment for a fully random control. We filter all alignments used to 80% minimum percentidentity in both datasets and controls, which is where most tools have been shown to perform bestacross both datasets.The results from the plots are summarized in Table 3.11. From these plots, we clearly see99Program Data Prediction TP FN FP TPR PPV MCCRNAduplex snoRNA MSA 590 74 904 0.89 0.39 0.59RNAduplex snoRNA MFE 471 193 3014 0.71 0.14 0.31RNAduplex snoRNA Union 622 42 3640 0.94 0.15 0.37RNAduplex snoRNA Intersect 439 225 278 0.66 0.61 0.64RNAplex-c snoRNA MSA 588 76 80 0.89 0.88 0.88RNAplex-c snoRNA MFE 489 175 285 0.74 0.63 0.68RNAplex-c snoRNA Union 605 59 288 0.91 0.68 0.79RNAplex-c snoRNA Intersect 472 192 77 0.71 0.86 0.78RNAplex-a snoRNA MSA 492 172 150 0.74 0.77 0.75RNAplex-a snoRNA MFE 569 95 257 0.86 0.69 0.77RNAplex-a snoRNA Union 609 55 328 0.92 0.65 0.77RNAplex-a snoRNA Intersect 452 212 79 0.68 0.85 0.76RNAduplex sRNA MSA 271 1608 1628 0.14 0.14 0.14RNAduplex sRNA MFE 874 1005 7962 0.47 0.10 0.21RNAduplex sRNA Union 978 901 9354 0.52 0.09 0.22RNAduplex sRNA Intersect 167 1712 236 0.09 0.41 0.19RNAplex-c sRNA MSA 249 1630 484 0.13 0.34 0.21RNAplex-c sRNA MFE 512 1367 2004 0.27 0.20 0.24RNAplex-c sRNA Union 633 1246 2472 0.34 0.20 0.26RNAplex-c sRNA Intersect 128 1751 16 0.07 0.89 0.25RNAplex-a sRNA MSA 229 1650 723 0.12 0.24 0.17RNAplex-a sRNA MFE 1180 699 1208 0.63 0.49 0.56RNAplex-a sRNA Union 1210 669 1909 0.64 0.39 0.50RNAplex-a sRNA Intersect 199 1680 22 0.11 0.90 0.31Table 3.10: Predictive performance of RNADUPLEX, RNAPLEX-C and RNAPLEX-a pairedwith comparative counterparts RNAALIDUPLEX, RNAPLEX-cA and RNAPLEX-aA.For each pair of tools, we show the performance of the MFE algorithm, the MSA algo-rithm (MFE + alignment input), the union of results, and the intersection of results. TP,FN, FP, TPR, PPV and MCC defined in main paper, where MCC here is estimated bythe geometric mean of the TPR and PPV.that the majority of basepairs show strong conservation (1) and no covariation (0). Overall, wesee a much stronger positive covariation score in intramolecular rRNA-rRNA results, with theintermolecular basepairs in snoRNA-rRNA and sRNA-rRNA showing less covariation. The in-termolecular interactions show surprisingly similar covariation and conservation scores, both with100Dataset Covariation Conservation BasepairsrRNA-rRNA 0.06±0.34 0.87±0.19 1504snoRNA-rRNA -0.03±0.13 0.98±0.06 664sRNA-mRNA -0.03±0.14 0.97±0.08 1879Non-functional rRNA-rRNA -0.22±0.22 0.82±0.13 1500Shuffled rRNA-rRNA -0.60±0.21 0.64±0.13 1500Table 3.11: Covariation and conservation mean ± standard deviation scores for known andcontrol inter- and intramolecular basepairsslightly negative covariation scores and extremely high conservation scores.3.6 Discussion3.6.1 Settings and overfittingAs shown in Table 3.2, the effect of differing settings is significant for multiple tools on variousdatasets. Given no prior information and guidance, determining optimal settings is a non-trivialtask. Even with recommended settings from authors, we noticed that settings that were optimal forone dataset cannot be assumed to work well with another. While the tools evaluated can technicallybe applied to any RNA sequence, the need for biologically-specific settings adds an extra layer ofchallenge for users.In theory, it would be possible to perform an exhaustive search through the multi-variablesetting space to find the setting values that maximize performance accuracy for each tool on eachdataset. It is debatable, however, whether such dataset-specific settings would still perform well onother datasets and de novo user data as such parameters would potentially be extremely overfit andlargely meaningless on other datasets and research settings.We showed that enabling the suboptimal options for tools that support it consistently increasesthe overall predictive performance (Table 3.3). As touched upon previously, however, in practicethe user would then have to deal with a ranked list of results instead of the single output. We haveshown that the number of ranked results to use for optimal performance and the optimal Gibbsenergy threshold cutoff varies greatly depending on both the tool and dataset in question, makingspecific suggestions difficult.101−1.5−1.0−−1.5−1.0−−1.5−1.0−−1.5−1.0−−1.5−1.0−−rRNAsnoRNA−rRNAsRNA−mRNANon−functional rRNA−rRNAShuffled rRNA−rRNA0.00 0.25 0.50 0.75 1.00ConservationCovariation1864512BasepairsFigure 3.18: Covariation and conservation ofbasepairs in manuscript datasets com-pared to positive and negative back-grounds. Mean values ± the standarddeviation for covariation and conserva-tion are shown. rRNA-rRNA (pos-itive control) shows the basepair dis-tribution seen in conserved intramolec-ular basepairs. snoRNA-rRNA andsRNA-mRNA show basepair distribu-tions of the experimentally validated(from single species) intermolecularbasepairs used in the dataset. Non-functional and Shuffled rRNA-rRNA(negative controls) show basepair dis-tribution for valid intramolecular base-pairs on rRNA that are not in theknown functional structure, and validintramolecular basepairs on the rRNApredicted after shuffling the sequence.The number of basepairs for the nega-tive controls were selected to be roughlyapproximate to the other sets.1023.6.2 Performance effects of conservationAccording to benchmark evaluations in RNA secondary structure prediction, compensatory muta-tions in multiple sequence alignments can greatly aid the accurate prediction of basepairs [170].For RNA-RNA interactions prediction, the inclusion of conservation information by givingalignments seems to bring mixed results depending on the tool and dataset. For interaction-onlytools like RNAPLEX-c, the addition of conservation information increases the specificity, resultingin an overall MCC performance increase. When used in conjunction with accessibility-based meth-ods (e.g. RNAPLEX-a) however, additional alignment information does not seem to significantlyincrease the performance, and may even decrease performance like due to the alignments of ques-tionable quality. Out of all the tools evaluated, the highest MCC performance (0.93) was achievedby RNAPLEX-cA on the snoRNA dataset with a relatively divergent input alignment (65% ID),showing that in the ideal case, conservation information can provide the best predictions. Thenumber of variables that need to be correctly determined for this optimal result (i.e. alignmentsettings, percent identity threshold, settings, suboptimal results to keep), could make it impracticalin a de novo setting.Previous studies have observed that sRNA binding sites exhibit a high sequence conservationbut low basepair conservation, further stating that covariation can only help a subset of interactions[247]. However, other studies also on sRNA have shown that under ideal circumstances [246], andsophisticated alignment methods [218], conservation can serve as a beneficial feature.In addition to issues with the alignments due to the biology of sequence, the tools used toobtain homologs and align them can greatly effect the predictions, as shown through our usageof six different alignment methods. Choosing a proper aligner is a non-trivial task, with a needto consider computational restraints, RRI prediction tool, and whether an algorithmically morecomplex algorithm is actually worth a potential increase in performance. While our alignmentscould undoubtedly be improved with expert knowledge and manual curation, our work hopefullyshows the issues with a high-throughput comparative de novo RNA-RNA interaction screen.Compared to what is known from RNA secondary structure studies, the performance observedfor comparative tools on the datasets at different minimum percent identity settings is perplexing.Possible avenues of explanation include things attributable to the user, such as alignment inputand tool settings of which there are already a non-trivial amount to control. Additionally, it isknown that many of the evolutionary models and scores employed in these intermolecular basepaireprediction algorithms were trained on intramolecular basepairs [210], for which we have suggestedmay be under different selective pressures and evolve differently. For all comparative algorithms103used, they combine thermodynamic and evolutionary components, with tunable weighs for eachcomponent trained on specific datasets. These weights undoubtedly play a role in how tools reactto changes in alignment quality, but require a non-trivial amount of work to optimize for specificdatasets.Taken together, the benefits of conservation information highly depend on the type of interac-tion (and possibly even the specific transcript pair) in question, complicated by the significant effectthat homolog and alignment quality have on the results. Even under ideal circumstances, however,previous work has shown that accessibility aids correct prediction more than conservation [246] andthat using both results in only a slightly higher performance. In practice, the effort and curationrequired to generate high quality alignments required is non-trivial making comparative potentiallyless appealing than energy-based methods.3.6.3 Target size and interaction search spaceConsistently observed in all tools across all datasets tested, increasing the length of the input se-quences leads a decrease in predictive performance. For tools that enable multiple suboptimalresults, there is little change to TPR, but difficulties in maintaining a high PPV results in an overalllow MCC value. For tools that only produce a MFE result, both TPR and PPV suffer.The inability of these tools to scale properly as input size increases is most problematic whenapplying these tools to predict potential interactions on a transcriptome-wide scale. These obser-vations agree with existing interaction target prediction tools for sRNAs such as COPRARNA andRNAPREDATOR shown to have PPV values of 44% and 28%, and TPR values of 23% and 32%respectively [237, 238].While we demonstrate on the untruncated snoRNA dataset that conservation may be a strongfeature to include to for an increased PPV rate, the difficulties discussed likely affect its usefulnessin practice. COPRARNA uses homolog information in its computations, and it is uncertain whetherit is the limitations of the algorithm, the alignment quality, or the biology that limit its performance.3.6.4 Runtime and memory performanceWe show CPU runtimes and physical memory usages for the energy-based tools outputting subop-timal results (where applicable) when running on the sRNA-mRNA dataset with increasing longmRNA sequences in Figures 3.19 and 3.20, respectively.With the exception of RNAUP, all tools ran in a few seconds to under a minute. Notably, GU-UGLE, RISEARCH and RNADUPLEX returned results effectively immediately up to the maximal104input length of 1150 basepairs. INTARNA, RNAPLEX-c, RNAPLEX-a and RNACOFOLD sawroughly linearly increasing runtimes up to roughly 10 seconds. PAIRFOLD and RACTIP had poly-nomial runtimes with the longest jobs finishing in under a minute. RNAUP had runtimes severaltimes larger than all other tools, with longer jobs taking several minutes.For physical memory, most tools could comfortable run under a hundred or so megabytes, withRNAUP being the exception taking several times more. GUUGLE, RISEARCH and RNADUPLEXagain used negligible amounts of memory. Interesting, both versions of RNAPLEX showed a con-stant memory usage, and it is unclear whether this is a consequence of the scanning-like algorithm,or a pre-allocation of memory whose limit we have yet to encounter. RNACOFOLD, PAIRFOLD,RACTIP and INTARNA use increasing amounts of memory relative to each other, each with in-creasing memory usage as a function of input length.Extrapolating from these performances, it is likely that several tools would see little problemwhen applied to larger genome-wide searches, while others would need to be modified or perhapsused in a secondary pass in a larger pipeline.3.7 ConclusionRNA-RNA interaction prediction has increasingly become a field of intense interest, driven byadvances in sequencing technology, uncovering a vast number of novel non-coding RNAs. Thepotential of using RNA-RNA interactions in identifying ncRNA targets may help us determine itsnetworks and functions in the cell.Fast and accurate full genome computational RNA interaction target searches are a soughtafter goal, which we believe starts with a strong foundation of being able to accurately predictinteractions sites given two transcripts. In this work, we have conducted the most comprehensiveassessment of general RNA-RNA interaction prediction tools to date. For this, we have compileda comprehensive benchmark dataset, consisting of two biologically different types of functionaland experimentally confirmed RRIs: bacterial sRNA-mRNA interactions that regulate translation,and yeast snoRNA-rRNA interactions that guide nucleotide modifications. Instead of artificiallytruncating input sequences around their know interaction sites, we provide the full query and targettranscript sequence to simulate a realistic setting for de novo discoveries. We make our datasetpublicly available for future research and development of new tools.Evaluating all tools against all interactions in our dataset, we test not only the predictive accu-racy of tools, but also the effects of various common settings seen in multiple tools. Of the fourincreasingly complex prediction strategies we grouped our tools into, those that did accessibility-105l lllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllll llllll lllllll1020304050100150200250300350400450500GUUGleIntaRNAPairfoldRactIPRIsearchRNAcofoldRNAduplexRNAplex−aRNAplex−cRNAupCPU Time (s)Length (bps)15025035045055065075085095010501150Figure 3.19: Boxplot of CPU runtime of tools on increasing input target mRNA sequences(query sRNA sequence length kept constant), using energy-based tools outputting sub-optimal results where applicable.based predictions generally fared the best, with INTARNA consistently performing well across alldatasets, RNAPLEX-a performing closely on many occasions, and RNAUP being an exception tothis observation.106lllllllllllllllllllll ll l ll llllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllll l1020304050100150200250300350400450500GUUGleIntaRNAPairfoldRactIPRIsearchRNAcofoldRNAduplexRNAplex−aRNAplex−cRNAupPhysical Memory (MB)Length (bps)15025035045055065075085095010501150Figure 3.20: Boxplot of physical memory usage of tools on increasing input target mRNAsequences (query sRNA sequence length kept constant), using energy-based tools out-putting suboptimal results where applicable.The effects of adding evolutionary conservation information to predictions is highly mixed,ranging from detrimental effects on the sRNA dataset to impressive gains in performance on theuntruncated snoRNA dataset. We further observe that the addition of conservation information to107accessibility information often results in an overall decrease in performance. This is unexpectedgiven that the best methods for predicting RNA secondary structure work in a comparative way (byharnessing information on evolutionary conservation of base-pairs) and should also be seen as awarning as many of the current methods for predicting general RNA-RNA interaction deliberatelyemploy similar ideas.With the field’s goal of applying RNA-RNA interaction prediction to full-genome searches ofRNA targets, we conduct a controlled experiment by increasing the target sequence length. Asexpected, we observed large drops in prediction accuracy, resulting in implications for large-scalesearches.The comparatively new field of general RNA-RNA interaction prediction thus needs a range ofnovel ideas compared to RNA secondary structure prediction to address the challenges shown byour benchmark tests.Current prediction accuracies that pale in comparison to general RNA secondary structure pre-diction algorithms. It may be that we have reached the theoretical limits that a generalized non-biology-specific prediction algorithm can achieve, and that further performance gains can only beachieved by developing tools specific to a biological class of interactions. Various existing toolsfor miRNA prediction already pursue this, taking advantage of binding motifs and highly specificinteractions lengths. There are also tools such as PLEXY [262], which can accurately predict C/DsnoRNA binding sites by using nucleotide sequence motifs to narrow down the window of predic-tion. Alternatively, recent advancements have been made in high-throughput RNA structure andinteraction probing [263]. These corresponding enzymatic probes and pairing constraints producedhave the potential to greatly assist predictions, replacing pure in silico accessibility profiles withexperimental binding evidence. Regardless, we hope that our assessment and the accompanyingdataset will help improve the current state-of-art in RNA-RNA interaction prediction.108Chapter 4Cotranscriptional RNA Folding andPredictionIn this chapter, we examine the biological in vivo process of RNA folding, and contrast it to thein vitro process currently captured in computational predictions. Specifically, we discuss base-pair formation beginning from the very moment the RNA polymer forms during transcription, orcotranscriptionally. We review the experimental and theoretical evidence for cotranscriptionallyfolding, then discuss existing and possible methods for capturing this to improve the computationalprediction of basepairs.This chapter has been published as a review:Lai,D., Proctor,J.R. and Meyer,I.M. (2013) On the importance of cotranscriptional RNA struc-ture formation. RNA, 19(11), 1461-73. doi:10.1261/rna.037390.112.4.1 Introduction and motivationThe primary products of all DNA genomes are RNA transcripts consisting of linear sequences offour different types of ribonucleic acids (abbreviated A, C, G and U and chemically different fromthe similarly abbreviated DNA building blocks A, C, G and T). When a gene of the genome isactivated, a corresponding transcript is synthesized in a linear fashion with its 5’-end emerging firstand its 3’-end emerging last. Primary transcripts vary greatly in length from a few nucleotides (nt)to 104 nt and longer. They may be processed in a number of ways, e.g. splicing and RNA editing,which may happen while the transcript is being made. The functional role of some transcriptsis exerted by RNA structure which is formed when pairs of complementary nucleotides of the109RNA sequence (C-G, A-U, G-U) form basepairs. In contrast to proteins, where we typically need toknow its three-dimensional structure in order to study its potential functional roles, it often sufficesto only know the RNA secondary structure in order to investigate its potential functional role(s).This RNA secondary structure is defined by the pairs of basepaired sequence positions in the RNA.RNA structure can either be global, i.e. span most of the transcript, or more local, i.e. be confinedto a sub-sequence of the transcript. During its life in the cell, a single transcript may assume morethan one functionally relevant RNA structure, e.g. riboswitches which can assume two mutuallyexclusive structures which are both functional.Many computational methods for RNA structure prediction, in particular earlier and non-comparative methods, implicitly focus on predicting global RNA structures only. They are typ-ically applied to analyze the non-coding portion of a given transcriptome as this is where glob-ally structured RNA genes are suspected. RNA structural features, however, are also known toplay important functional roles in regulating protein-coding transcripts (e.g. splicing, localization,degradation, translation initiation), yet this typically involves only local RNA structures whichonly some of the computational methods for RNA secondary-structure prediction can adequatelymodel [264, 265].Recent advances in nucleotide sequencing technologies have enabled the routine sequencingof entire transcriptomes, with methods such as strand-specific RNA-seq, enabling the discovery ofnovel transcripts en masse. Experimental methods for RNA structure determination such as X-raycrystallography and NMR can provide atomic resolution three dimensional solutions, but remainrelatively costly and comparatively slow. Computational methods for predicting RNA secondarystructures based on RNA sequence information alone are therefore key to assigning potential func-tional roles to the transcriptome and identifying worthwhile targets for experimental validation.When available, computational structure prediction can be aided by results from RNA foot-printing experiments. Such experiments can estimate the pairing status of individual nucleotidepositions in a single sequence with chemical probes, but cannot identify the pairing-partner in-volved in a basepair. Such methods, when paired with next-generation sequencing technologies,in protocols such as Frag-seq, PARS, and SHAPE-seq, show great promise in generating high-throughput RNA secondary structure probe maps [50]. Nonetheless, footprinting results still re-quire algorithms to derive the overall most likely solution, again emphasizing the need for reliableand computationally efficient computational methods.There exists by now ample experimental evidence that RNA structure formation starts cotran-scriptionally, while the RNA is transcribed from the genome. The process of cotranscriptionalstructure formation is key to determining the resulting functional RNA structure(s) in vivo and that110this process can be influenced by a range of intrinsic as well as extrinsic factors. Yet, nearly allstate-of-the-art methods for computational RNA secondary structure prediction ignore the structureformation process and focus exclusively on the end result, i.e. a single, final RNA structure. Therealready exist a few computational methods that aim to explicitly simulate the cotranscriptionalfolding pathway by capturing key features of the folding environment in vivo. As their predictionaccuracy has so far been evaluated on only a few select sequences of typically short length, how-ever, they are currently viewed as folding pathway prediction methods rather than RNA secondarystructure prediction methods.We argue that ignoring the formation process often yields decent structure predictions, espe-cially for short and globally structured transcripts (< 200 nt), but that in order to increase theprediction accuracy for longer transcripts and to reach a conceptually better understanding, weought to aim to take some effects of cotranscriptional folding into account.In the following, we first review the variety of mechanisms that have been shown to influencecotranscriptional folding in vivo. This summarizes primarily experimental, but also some theoret-ical evidence for cotranscriptional folding. We then provide an overview of the currently existingmethods for RNA secondary structure prediction. This part of the review is not aimed at providinga detailed description of every existing method for RNA secondary structure prediction, but ratherat highlighting the different underlying concepts employed by these methods. At this point, wealso cover methods for predicting RNA folding pathways which already capture some effects ofcotranscriptional folding. To conclude, we propose a range of ideas how cotranscriptional foldingcould be captured in computational methods for RNA secondary structure prediction in order tofurther improve their prediction accuracy.4.2 Experimental and theoretical evidence for cotranscriptionalfolding4.2.1 Directionality of transcriptionOne of the most obvious differences between the in vivo and the typical in vitro setting is thatRNA transcripts in vivo emerge sequentially starting with the 5’-end, whereas in vitro experimentsstart with an already synthesized molecule. The directionality of the molecule’s synthesis in vivomay thus lead to structural asymmetries during its cotranscriptional folding which may in turninfluence the resulting functional RNA structure(s).1114.2.2 Transcription, transcription speed and variations thereofWhether or not folding can happen during synthesis depends, amongst other things, on how thetimescale of RNA synthesis compares to that of RNA structure formation. The speed of tran-scription not only depends on the underlying organism, but also on the polymerase responsible forgenerating the transcript in question. It ranges from 200 nucleotides per second (nts−1) in phages,to 20–80 nts−1 in bacteria and 10–20 nts−1 for human polymerase II [141]. On the other hand,RNA folding is known to occur on a wide range of time scales; some RNAs fold in 10–100 ms[266], whereas kinetically trapped conformations can persist for minutes or hours [266–268]. Ex-periments in the early 1980s have shown that RNA structure formation can happen during transcrip-tion [269, 270], i.e. cotranscriptionally, and that folding in vivo can happen on the same timescale asRNA synthesis [271]. The latter was first shown for the cotranscriptional and structure-dependentself-splicing of the Tetrahymena group I intron [271].Since then, several in vitro experiments have confirmed that RNA folding can happen cotran-scriptionally and that the speed of transcription not only affects the overall folding rate, but alsotransient structures as well as the final structure [272–274]. Lewicki et al. [275] and Chao et al.[276] showed that altering the natural speed of transcription can yield misfolded and functionallyinactive transcripts. Experimental studies of the Tetrahymena self-splicing intron are consistentwith the view that a set of identical RNA molecules partitions into an active and an inactive pool,and that this partitioning is highly influenced by the cotranscriptional folding environment, includ-ing the RNA transcription rate [277].For a given transcript, the speed of transcription is not necessarily constant. Transcriptionalpausing can serve as an additional mechanism for fine-tuning cotranscriptional folding [278–280].This pausing happens at specific transcript positions and for well-defined time intervals (rang-ing from 10−6 s to 10 s). In bacteria, pausing can be due to interactions between the emerg-ing RNA and the polymerase and/or polymerase-associated protein factors [281–283]. The flavinmono-nucleotide (FMN) dependent riboswitch in Bacillus subtilis [279] is a beautiful exampleof how these features can be combined into a cotranscriptional feedback loop where the bindingof a metabolite selects one of two possible cotranscriptional folding pathways whose resultingRNA structure determines whether transcription is terminated or not.4.2.3 Self-interactions including transient RNA structuresOne of the key features of any RNA sequence is that it can interact with itself via basepairs be-tween complementary nucleotides to form RNA structure. During cotranscriptional folding, al-112ready formed structures can un-pair and yield to other structures, in which case we refer to themas a transient structures. In other cases, it is energetically unfavourable for an existing structureto yield to a new conformation, thereby forming a kinetic trap. Transient structural features thushave the potential to significantly influence the cotranscriptional folding pathway and the resultingfunctional RNA structure(s), see Figure 4.2. Most of our current knowledge of transient structures,which we also refer to as cis RNA-RNA interactions, stems from dedicated experimental studiesof select folding pathways which explore how RNA structure changes as a function of time.Folding pathways of RNA transcripts in vitro have been the subject of intense study for along time. Initial experiments primarily studied how already synthesized and fully denaturedRNA molecules fold, whereas more recent studies examine cotranscriptional folding pathwaysin vitro and, most recently, also in vivo [284, 285]. As any of these experiments are technicallysophisticated, our current view derives from a few well-studied test cases such as the hairpin ri-bozyme [286–290] and the Tetrahymena intron [277, 291]. These ribozymes are comparativelyeasy to study in vivo as their cleavage relies on distinct structural features whose products areeasier to detect than the corresponding functional structures.Cotranscriptional folding—whether in vitro and in vivo—tends to happen sequentially [288,290] as basepairs at the 5’-end of the RNA can form first, whereas basepairs involving the 3’-endcan only form once transcription is complete. This folding often involves transient RNA structureelements, i.e. structural features that are only present for a specific time-span [270, 292]. Thesecan direct the structure formation via one or several folding pathways towards the desired structuralconfiguration(s). These transient features may also play distinct functional roles. They may, forexample, be required for template activity during (+)-strand synthesis in some viruses [292] or mayserve as protein-binding sites during transcription [293]. These examples once again illustrate thatany given RNA transcript may have more than a single functionally relevant RNA structure duringits lifetime in the cell.Cotranscriptional folding and other reaction rates in vivo typically differ from those in vitrowith folding rates in vivo being typically [288, 290], but not always [286] higher than in vitro. Oneexample is the cotranscriptional folding of the Tetrahymena ribozyme in vitro which is twice as fastas the refolding of the fully synthesized and denatured molecule, but slower than the cotranscrip-tional folding in vivo [273]. Cotranscriptional folding pathways in vivo need not be unique [291]and tertiary interactions can determine which of several possible folding pathways is chosen [294].Factors such as transcription speed and flanking sequences can also influence which pathway dom-inates [277]. One of the few existing in vivo studies of cotranscriptional folding pathways [295]indirectly examined the structural folding intermediates of the Tetrahymena ribozyme at 10−5 s113time resolution using X-ray synchrotron radiation foorprinting and chemical accessibility probingand found folding intermediates that are similar to those in vitro.The tryptophan (trp) operon is a group of genes found in bacteria that act in the biosynthesispathway of the amino acid tryptophan. The trp operon leader encodes a short peptide that is richin tryptophan codons near the 5’-end of the RNA [296]. Regulation of the trp operon is carriedout in part by the trp operon leader through a mechanism that relies on the simultaneous transcrip-tion of a DNA gene and translation of the resulting RNA in bacteria. The trp operon leader is ariboswitch that assumes two mutually exclusive structural configurations that form cotranscription-ally: the attenuator which prevents further transcription of the trp operon, and the anti-terminatorwhich permits transcription [296]. When tryptophan levels are high, the ribosome proceeds rapidlythrough the operon leader, and interferes with the anti-terminator hairpin. When tryptophan lev-els are low, the ribosome stalls while translating the leader peptide, and allows the anti-terminatorhairpin to form, and thus the trp operon is activated.In addition to these experimental results, the bioinformatics community has conducted a rangeof computational studies to investigate cotranscriptional structure formation. Computational sim-ulations of cotranscriptional folding pathways, e.g. [297], show that the basic features of cotran-scriptional folding and their beneficial effects on RNA structure formation can be investigated insilico. Using a kinetic Monte Carlo Markov Chain (MCMC) to study the folding of the hepatitisdelta virus ribozyme (87 nt length), Isambert and Siggia [297] show that cotranscriptional fold-ing at the natural transcript speed of 50 nts−1 is significantly more efficient than when startingwith a fully denatured sequence or when using the increased transcript speed of 1000 nts−1 thatis typically used in in vitro experiments. By combining computational simulations of RNA fold-ing pathways with phylogenetic structure analyses, Schoemaker and Gultyaev [298] investigate theeffect of sRNA binding on ribosomal RNA (rRNA) structure formation during cotranscriptionalfolding and find that it significantly facilitates structure formation.A bioinformatics analysis of 361 structural RNA genes [140] showed that these genes not onlyencode information on their known functional structure, but also on transient features of their re-spective cotranscriptional RNA folding pathways. For this, Meyer and Miklo´s examined helices(defined as contiguous stretches of adjacent basepairs) that could potentially out compete helicesof the known structure. They found statistically significant 5’-to-3’ asymmetries between thesecompeting helices and the respective helices of the known structure. More specifically, they iden-tified two different types of transient structures: those that can yield to the functional structure andhelp its cotranscriptional formation and those that are more likely to act as kinetic traps during co-transcriptional folding. They showed that the former are preferentially encoded in the underlying114RNA sequences, whereas the latter are suppressed.More recently, Zhu et al. [143] conducted a computational study of six RNA families withknown transient and alternative structures in order to test whether evolutionarily related sequencesnot only assume similar final structures, but also share common transient structures during theirrespective cotranscriptional folding pathways. They find that some transient structures have beenevolutionarily conserved on a level that is similar to those of the final structure. Moreover, theyfind that evolutionarily related sequences encounter similar transient structure features during theirrespective, predicted cotranscriptional folding pathways and that these features often coincide withknown transient features.To conclude, naturally occurring transcripts not only encode their functional RNA structure,but also information on how to get there via transient features that help define the correspondingcotranscriptional folding pathway.4.2.4 Interactions with other moleculesOne key difference between the in vivo and in vitro setting is that the cellular environment typicallycontains a wealth of additional molecules. In vivo, these may interact with the RNA transcript andthereby influence its structure formation and the resulting RNA structure, see Figure 4.2C. Thesemolecules may comprise of proteins, RNA transcripts, metabolites, ligands and different types ofions. Any intermolecular interaction between two distinct RNA molecules, i.e. any trans RNA-RNA interaction, has the potential to prevent the thus bound RNA nucleotides from engaging inother interactions including RNA structure (i.e. cis RNA-RNA interactions). This may either stabi-lize or destabilize existing RNA structure features which may in turn influence the cotranscriptionalfolding pathway and the resulting RNA structures.Due to the methodological challenges of investigating RNA folding in vivo and in real time, wecurrently have only limited insight into folding pathways in vivo [273, 291, 294, 295]. Numerousrecent in vitro experiments that replicate specific aspects of the complex in vivo environment andrapid progress regarding in vivo methodologies [284, 299] are likely to change this.So, which interactions between RNA transcripts and other molecules have been experimentallyconfirmed to be functionally important for RNA structure formation?Ligand-RNA interactions One of the most obvious examples where RNA structure formation isinfluenced by trans interactions are so-called riboswitches. The change of one distinct RNA struc-ture to another one is usually triggered by the binding of a metabolite or ion, but can also be115induced by a temperature change, at least in bacteria (thermoswitches) [300–302]. The two distinctstructural conformations of a riboswitch are typically located in the 5’-UTRs of messenger RNAs(mRNAs) and are mutually exclusive as they engage two overlapping sub-sequences. The structuralchange triggers a change of the gene’s expression by altering either its transcription, translation orsplicing [81, 303]. Nechooshtan et al. [304] identified a pH-responsive riboregulator up-stream ofthe alx ORF. For a high pH, the translationally active RNA structure is formed during transcriptionwhich involves two well-defined transcriptional pausing sites. Frieda and Block [305] succeeded indirectly observing the cotranscriptional folding of the pbuE adenine riboswitch. Using an opticalassay which allowed them to monitor folding transitions in individual transcripts in real time, theyshowed that the transcriptional outcome of the riboswitch is kinetically controlled. Perdrizet et al.[306] present strong evidence that the btuB riboswitch in Escherichia coli depends on the precisetranscriptional pausing of its polymerase to guide its folding into its native structure.Protein-RNA interactions In order for many large RNAs to fold in vitro into their functional struc-ture without any other trans-acting molecules (apart from water), it is necessary to raise the concen-tration of metal ions (e.g. of Mg2+) significantly above normal levels in vivo [307, 308]. Severalin vitro experiments have shown that the ion concentration can be lowered if specific proteins areadded that stabilize the RNA structure [309–313] and that can bind folding intermediates [310].This has also been confirmed by several in vivo experiments [314–316].RNA-binding proteins often play different functional roles depending on the binding inter-face they use to interact with different partners. One example is Cyt 18 in Neurospora crassawhich not only aids RNA folding, but also acts as a splicing factor and an aminoacyl-tRNAsynthetase [314, 317]. Most of these proteins bind an RNA in a sequence- or structure-specificway [310, 311, 318–325]. There are also proteins, however, that interact with RNAs in a less spe-cific way such as RNA helicases which help anneal and unwind RNAs while requiring ATP [326–330] and hnRNP proteins which bind single-stranded stretches of pre-mRNAs and thereby aidsplicing [331]. Some protein-RNA interactions are required to happen at very specific times. Onekey example are ribosomal RNAs which are modified, processed and the corresponding ribosomespre-assembled cotranscriptionally and in a tightly co-regulated way as shown in several in vivoexperiments [332–335]. There is also recent experimental evidence that cotranscriptional splicingis coupled to transcriptional pausing in yeast [336] and that, interestingly, cotranscriptional splic-ing can also be coupled to translation as shown in vivo for the thymidylate synthase intron of theT4-phage [337]. Therefore, RNA-binding proteins involved in splicing may thus act cotranscrip-116tionally.Chaperone-RNA interactions Chaperones are molecules, usually proteins, that assist a molecule’scorrect folding by refolding misfolded structure features. Based on this definition, the trans-interaction partners of a given RNA transcript described above are not chaperones as they guidethe correct co-folding pathway rather than help already misfolded RNA transcripts refold correctly.Many detailed experiments have shown that RNA transcripts can misfold in vitro and that it takesthese molecules minutes to many hours or longer to escape these structural traps [338–340]. Thismay be attributed to several alternative folding pathways of the in vitro folding landscape whichtends to be more rugged than the cotranscriptional folding landscape in vivo [341–343], but canalso be due to individual RNA structure elements that keep the structure trapped.There is some evidence that RNA structures can also misfold in vivo [291, 337] and that thereexist dedicated cellular mechanisms for dealing with misfolded RNA structures, e.g. by seques-tering and degrading them as shown for the Tetrahymena intron [291]. Most RNA chaperonesidentified so far are proteins that resolve misfolded RNA structures by binding stretches of double-stranded RNA with low-affinity and in a sequence-unspecific way. Other RNA chaperones bindsingle-stranded RNA and facilitate the transition from the incorrect to the correct structural confor-mation by lowering specific kinetic barriers [344].Chaperone-assisted folding has been extensively studied for proteins, whereas comparativelylittle is known about the extent and mechanisms underlying chaperone-assisted RNA folding. Whatwe know is that most of these proteins play a wide range of other functional roles in additionto being RNA chaperones and that they share no obvious similarities in terms of sequence andstructure motifs [285]. Additionally, unlike protein-chaperones, RNA chaperons typically do notrequire any ATP to encourage refolding [312, 344–346].Trans RNA-RNA interactions, i.e. interactions with other transcripts Trans RNA-RNA inter-actions, i.e. interactions with other transcripts, involve the same elementary building blocks asRNA structure or cis RNA-RNA interactions, namely basepairs between pairs of complementarynucleotides. This implies that trans RNA-RNA interactions involve two single-stranded stretchesof RNAs. They differ in that regard from protein-RNA interactions which may involve single-stranded or double-stranded RNA (and may happen in a sequence-specific or unspecific way).If a single-stranded stretch of RNA sequence is to be bound in a sequence specific way, itshould be much easier in terms of evolution to come up with a corresponding, near-complementary117RNA sequence than to devise an RNA-binding protein that would bind in an equally sequence-specific way. One would therefore expect trans RNA-RNA interactions to be more much abundantthan sequence-specific protein-RNA interactions with single-stranded RNAs [118, 347].Functionally important trans RNA-RNA interactions include the well-known class of microRNA-mRNA interactions which alter gene expression on mRNA level [348], interactions between snoR-NAs and ribosomal RNAs which edit rRNAs before ribosome assembly [117] and snRNA-mRNAinteractions which are key during mRNA splicing [115]. Both mRNA splicing and ribosome as-sembly can occur cotranscriptionally.Large-scale transcriptome studies of higher organisms such as mouse and human show thata large fraction of the transcriptome does not encode any proteins, e.g. [349]. These non-codingtranscripts are diverse with regard to length, expression patterns and levels and functional roles, ifknown. This has given rise to a wealth of different names for these transcripts which are commonlycollectively referred to as non-coding RNAs (ncRNA).One well-studied ncRNA example is the short DsrA sRNA (small RNA) in Escherichia coliwhich alters the structure of the rpoS mRNA upon binding, thereby enabling its translation. Inorder for this trans RNA-RNA interaction to happen, the structure of the ncRNA DsrA first needs tobe destabilized by binding the Sm-like protein Hfq [350–353]. Several other examples of structure-mediated translation regulation via trans RNA-RNA interactions between a short ncRNA and anmRNA have been found, primarily in bacteria [354, 355]. The short ncRNA is often an anti-sensetranscript of the corresponding mRNA, the trans RNA-RNA interaction typically involves a shortstretch of near-complementarity and a protein is often required as third ingredient for the regulatorymechanism to be functional. Yet another example of a functionally relevant trans RNA-RNAinteraction is the formation of the 30S ribosomal subunit in bacteria which requires the transientinteraction with the leader sequence of the rRNA-operons [356].Another well-studied example is the hok/sok toxin-antitoxin system in Escherichia coli whichprovides a mechanism for preservation of the R1 plasmid after cell division [357], see Figure 4.1.This system consists of three overlapping genes. The host-killing hok gene induces cell deathupon translation of its protein. The mok (modulation of killing) gene overlaps hok on the samemRNA transcript, and translation of the mok reading frame must occur in order for translation ofhok to occur. The sok (suppression of killing) gene encodes a short anti-sense RNA that binds andprevents translation of mok, and thus indirectly, also the translation of hok. In cells that possess theR1 plasmid, the unstable sok RNA is produced in high quantities, and prevents cell death causedby the longer lived hok RNAs. Following mitosis, the sok RNA is rapidly degraded in any daughtercells that lack the R1 plasmid, allowing the hok gene to induce cell death. The mechanism of118the hok/sok system depends on several structural features of the hok mRNA. Alternative structuralconfigurations reduce the degradation rate of the hok mRNA, and several transient hairpins at the5’-end prevent binding of sok RNA during transcription [357].4.2.5 SummaryThe overall view that emerges is that the cotranscriptional folding pathways are determined bothby intrinsic features encoded in the RNA sequence itself such as transient and final structural fea-tures, and by extrinsic features such as the speed of the transcribing polymerase, trans interactionpartners (e.g. proteins, ligands, RNA transcripts and other trans-interaction partners). In vivo, bothtypes of features are combined in the appropriate cellular context and determine the functionalRNA structure(s) being formed.A range of experimental evidence supports the notion of fairly well-defined co-folding path-ways in vivo. These pathways are on the one hand robust enough to guide the formation of thecorrect functional RNA structure under typical cellular conditions, but are—if required—flexibleenough to yield different structural and functional outcomes, if the cellular environment signifi-cantly changes [279].4.3 Capturing cotranscriptional folding in methods forRNA secondary structure prediction4.3.1 Existing methods for RNA secondary structure predictionA wide variety of computational methods already exist for predicting RNA structural features.Most RNA structure prediction methods that can technically handle long, naturally occurring tran-scripts such as rRNAs only aim to capture the RNA secondary structure rather than its tertiarystructure. Fortunately, many functional features can already be studied on this level of abstraction.In the following, we therefore focus on methods for RNA secondary structure prediction (ratherthan also covering methods for predicting tertiary RNA structure which are currently limited tosequences of around 100 nt length).Existing methods for predicting RNA secondary structure can be broadly grouped into twocategories. Those that take a single RNA sequence as input and those that work in a comparativeway by taking a set of homologous RNA sequences as input. There also exists a different class ofprediction methods that explicitly predict cotranscriptional folding pathways in terms of RNA sec-ondary structure changes over time. They aim to capture the structure formation process in vivo and119are typically limited to analyzing transcripts of a few hundred nucleotides length. These methodsare currently viewed as folding pathway prediction methods rather than RNA secondary structureprediction methods.Comparative methods for RNA secondary structure prediction currently provide the state-of-artin terms of prediction accuracy, in particular for long RNA sequences. Apart from one recently in-troduced new method COFOLD [162], none of the currently existing non-comparative or compara-tive methods for predicting RNA secondary structures, however, explicitly capture cotranscriptionalfolding or its overall effects.In the following, we review the existing methods and propose ways of capturing some effectsof cotranscriptional folding explicitly in order to further improve their prediction accuracy.Non-comparative, MFE methods for RNA secondary structure prediction Historically, non-comparativemethods which take a single RNA sequence as input came first. These employ the so-calledminimum-free energy (MFE) approach which aims to identify the (usually pseudoknot free) RNA sec-ondary structure that minimizes the overall free Gibbs energy of the transcript. They include well-known methods such as MFOLD, RNAFOLD and related programs [152, 153, 202, 358]. Thesemethods mirror the in vitro setting, where a fully synthesized RNA has infinite time to settle intoits thermodynamically most favorable configuration. They implicitly assume that the functionallyrelevant secondary structure is the thermodynamically most stable one. Predictions are generatedby efficiently searching the search space of all possible (usually, pseudoknot free) RNA secondarystructure for the structure with the lowest overall MFE. This is typically done using a dynamicprogramming algorithm.Several methods based on the suboptimal folding algorithm introduced by Wuchty et al. [359]have been developed which explicitly consider an ensemble of RNA secondary structures close tothe minimum free energy. RNASUBOPT, a program included in the VIENNARNA package [153,201], provides a list of low energy secondary structures above a user-defined energy cutoff abovethe minimum free energy. SFOLD [189, 360, 361] employs a statistical approach to sample RNA sec-ondary structures from the ensemble of RNA secondary structures at thermodynamic equilibrium,where the probability that the algorithm picks a particular structure is proportional to the struc-ture’s probability in the structural ensemble. While these methods consider structures that differfrom the MFE configuration, they still assume that the RNAs are in thermodynamic equilibrium.Moreover, they ignore the kinetic nature of cotranscriptional formation, and the effect it may haveon the resulting structure or ensemble of structures.120In 1996, Morgan and Higgs [156] investigated a set of long RNAs (comprising 16S rRNAs,23S rRNAs and RNAseP) and found significant discrepancies between the evolutionarily con-served RNA structure features and the respective predicted MFE structures. They concluded thatthese differences “cannot simply be put down to errors in the free energy parameters used in themodel” [156] and hypothesized that these may be due to effects of kinetic folding in vivo.In order to test this hypothesis, Proctor and Meyer recently introduced the new RNA secondarystructure prediction method COFOLD [162] which is the first to combine thermodynamic with ki-netic considerations. They incorporate one overall effect of kinetic folding into a minimum freeenergy prediction method: the reachability of potential pairing partners during cotranscriptionalfolding. COFOLD demonstrates a significant performance improvement over minimum free en-ergy methods alone, particularly for longer RNA sequences of more than 1000 nt for which oneusually observes a marked decrease in prediction accuracy. Capturing this overall effect of cotran-scriptional folding yields RNA secondary structures with similar, but slightly higher free energiescompared to the MFE structure. These results promise that there may be great value in account-ing for other effects of cotranscriptional folding to improve non-comparative methods for RNAsecondary structure prediction.Comparative methods for RNA secondary structure prediction Rapidly increasing amounts ofgenome sequencing data for a variety of organisms have given rise to a conceptually new approachto RNA secondary structure prediction that takes as input a set of homologous RNA sequencesrather than a single RNA sequence of interest, e.g. [166, 221, 260, 264, 265, 362–372]. Eventhough these comparative methods differ considerably regarding their underlying algorithms, theyall aim to identify the consensus RNA secondary structure that has been conserved during evo-lution. The underlying working hypothesis is that RNA structures that are functionally relevantshould also be conserved. This assumption usually holds as RNA structures tend to be more con-served than the underlying primary sequences. Depending on the evolutionary distances among theinput sequences, however, this approach may fail to detect species-specific structure features thathave only developed recently.Overall, comparative methods for RNA secondary structure prediction currently provide thestate-of-art in terms of prediction accuracy. They tend to significantly outperform non-comparativemethods [170], but typically require a high-quality input alignment provided by the user to reachtheir optimal performance (see, e.g. [364, 366, 367, 370, 372] for methods that do not require afixed input alignment).121All of these methods generate predictions by first identifying pairs of covarying alignmentcolumns to detect conserved basepairs and then combining these into a single (and, usually, global)consensus RNA secondary structure. For this, they either employ (1) a modified MFE frameworkwhich also accounts for conservation of basepairs and aims for overall energy minimization, (2) aprobabilistic framework such as stochastic context-free grammars (SCFGs) combined with likeli-hood maximization, (3) a non-deterministic, yet probabilistic approach such as Bayesian MarkovChain Monte Carlos (MCMCs) that samples from a posterior distribution which is subsequentlycombined with a post-processing step to extract a consensus structure, or (4) a combination ofheuristic, ad-hoc procedures.4.3.2 Existing methods for predicting RNA folding pathwaysIn parallel to the development of the RNA secondary structure prediction methods, several meth-ods have been developed that aim to explicitly simulate cotranscriptional structure formation asfunction of time. All of these methods, e.g. RNAKINETICS [160, 373, 374], KINFOLD [375],KINEFOLD [159, 297, 376] and KINWALKER [161], take as input a single RNA sequence and em-ploy a range of different statistical models, approximations and heuristics to arrive at their predic-tions. Typically, they utilize stochastic simulation that extends the input RNA sequence at regularintervals, and simulates helix formation and disruption events over a simulated time scale. Theprobability that each event occurs is proportional to its theoretical chemical rate of change. Theyhave, however, conceptual difficulties dealing with long sequences (over a few hundred nt) and theirperformance has until recently [143] been only benchmarked for a few select sequences. They arethus currently viewed as folding pathway prediction methods rather than RNA secondary structureprediction methods.The recent study by Zhu et al. [143] utilizes three of these existing methods to show thatevolutionarily related RNA sequences share common transient structural features during their pre-dicted folding pathways, and that these features often coincide with known transient structures.The authors propose an analysis pipeline that applies several folding pathway prediction methodsin a comparative manner by combining folding predictions across evolutionarily related RNA se-quences. Moreover, this study provides solid evidence that some transient helices have been con-served during evolution.1224.3.3 Ideas for capturing cotranscriptional folding in methods for RNA secondarystructure predictionThe key effect of cotranscriptional folding is to make the formation of the final structure depend onits wider context, both along the sequence and in terms of time.The key feature common to all existing non-comparative and comparative methods for RNA sec-ondary structure prediction is that they search the space of all possible (typically pseudoknot free)RNA secondary structure for the optimal structure without having any notion of a folding pathwayor a time-wise ordering of events, see Figure 4.2. The recently introduced method COFOLD [162]is an exception, yet it currently only models a single overall effect, namely the reachability ofbasepairing partners during cotranscriptional folding which effectively amounts to a re-weighingof different regions of the structure search space. The search of the structure space usually in-volves a scoring function whose overall value is being optimized during the search. The overallscore for any candidate RNA structure is typically expressed as the sum or product of scores forindividual structural building blocks that, taken together, cover the entire sequence. These elemen-tary scores and the way in which they are combined by the scoring function during optimization,however, only depends on the local building blocks of the sub-sequence under consideration, butneither on their location within the sequence nor the RNA structure context of the surroundingsequence, see Figure 4.2. Most optimization algorithms are dynamic programming algorithms thatcombine optimal structures for adjacent sub-sequences into one optimal structure for the resultingmerged sub-sequence. The order of these steps, however, does not replicate the events during co-transcriptional folding. In particular, no region of the theoretical structural search space is markedas unlikely, if the corresponding structure feature could not readily form cotranscriptionally in vivo,see Figure 4.2.One of the intrinsic features that are known to influence the formation of RNA structure invivo are transient structures as discussed earlier. As these features are encoded in the RNA se-quence itself, they could in principle be detected by any method for RNA secondary structureprediction and subsequently used to bias the optimization process yielding the final RNA structure.Their detection could be implemented via a straightforward dynamic programming procedure thatswiftly identifies all candidate helices (of some minimum length or stability) in the given inputRNA sequence [140]. The conceptual problem is that these helices would naturally comprise bothcandidate transient helices as well as candidate helices of the final RNA secondary structure. Thesehelices could be used in the optimization procedure in order to influence the local decision making(how to combine optimal structures for two sub-sequences into a single optimal structure for the123merged sub-sequence). This would be one conceptual way of taking the wider structure context intoaccount during the optimization procedure yielding the predicted final RNA structure. In the spiritof [140], these modifications could for example penalize any candidate structure that has strongcompeting transient helices upstream which could jeopardize its cotranscriptional formation.Whereas the identification of candidate helices and relevant competing helices for a single se-quence may be complicated due to the relatively large search space, comparative methods maygenerate a more accurate and smaller set of evolutionarily conserved competing helices to con-sider, such as those output by the comparative helix finding algorithm TRANSAT [195]. If transientRNA structural features turn out to be evolutionarily conserved on a similar level to those of thefinal RNA structure, which is what recent results by Zhu et al. [143] indicate, however, this mayactually lower the prediction accuracy of comparative RNA secondary structure prediction meth-ods as they may erroneously incorporate these conserved transient helices into the predicted finalRNA secondary structure. Whether or not this is the case and a cause for concern remains to beshown.In addition to the ideas employed by COFOLD [162] discussed above, the directionality oftranscription could also be captured by rendering the scores assigned to the structural buildingblocks dependent on their position within the transcript, whether they are nearer to the 5’-end orthe 3’-end.It is less obvious how one should account for the speed of transcription, let alone variations oftranscription speed and transcriptional pausing. At least for now, there is too little experimentalinformation to hope to identify transcriptional pausing sites computationally. A change in overalltranscription speed alters the ratio between the speed of transcript synthesis and the rate of structureformation. This has been experimentally shown to influence cotranscriptional folding pathwaysand their structural outcome. On the structure prediction side, the speed of transcription couldbe captured by altering the effective distances between structural features. This is exactly whatthe free parameter in COFOLD [162] is for. By changing its value, one can effectively accountfor different (yet constant) transcription rates and thereby optimize the program’s performance fordifferent species. If the transcription speed is high with respect to the rate of structure formation, theemerging transcript has less time and hence fewer opportunities to explore the surrounding structurespace. This has the overall effect of enlarging effective distances, whereas a low transcription speedshould have the overall effect of reducing effective distances.Trans interactions of the transcript with other molecules comprise a biologically diverse set ofinteractions between the transcript and various other molecules. All of the existing methods forpredicting RNA secondary structure including methods for folding pathway prediction assume an124isolated RNA sequence as input and ignore any potential trans interaction partners (the bulk effectsof water and some ions is taken into account by most folding pathway prediction methods). If andhow these trans interactions influence the cotranscriptional structure formation not only dependson the type of interaction (RNA-RNA, RNA-protein etc.), but also very much on the timing of theinteraction with respect to the structure formation. For example, a protein that binds the emergingtranscript early on and for a short time has a very different influence on structure formation than aprotein that binds the final RNA structure only.Early and persistent types of trans interactions could be captured in RNA secondary structureprediction methods by preventing the bound sub-sequence from engaging in other interactions, inparticular other RNA structural features. Technically, this is fairly easy to achieve via a slightmodification of the default optimization procedure by assigning a large penalty to all structuresolutions that do not keep the bound sub-sequence single or double-stranded. This feature is alreadyimplemented by all RNA secondary structure prediction methods that allow known RNA structuralfeatures to be taken into account, e.g. [152, 221, 265]. This assumes, however, that details aboutthe interaction site (sub-sequence, ssRNA versus dsRNA) are known up-front which is often notthe case.Any trans interactions of a more transient nature, however, are hard to capture computationallyby any of the existing methods for RNA secondary structure prediction as this would require themto have some notion of time-ordered steps which they currently do not have.Suggestions for further improving methods for folding pathway predictionThe existing folding pathway prediction methods already mimic the in vivo folding as they fold theRNA sequence cotranscriptionally at a constant transcription speed (which needs to be specifiedby the user). This is, however, only a first approximation of the complex in vivo situation. As thesemethods explicitly predict folding pathways, they already model cis RNA- RNA interactions andin particular transient RNA structural features. At least for now, these methods do not predict vari-ations of transcription speed and do not capture potential trans interactions with other moleculesfrom the in vivo environment.If details about trans interactions are known up-front (timing, binding site, ssRNA versusdsRNA), these could be fairly easily captured by preventing the known binding site from en-gaging in other interactions. This has already been done for select examples and allowed us tocomputationally investigate the effect of trans interactions on cotranscriptional RNA structure for-mation [298].1254.4 SummaryWith 90% of the human genome being transcribed [30–32], the investigation of transcriptomes andhow they are regulated has never been more important. RNA structure is one important featureby which transcripts can influence their fate in the cell. There is by now ample experimental andsolid theoretical evidence that RNA structure formation already starts during transcription and thatevents during the cotranscriptional folding determine which functional RNA structure(s) are beingformed. Yet, as of now the process of structure formation is completely ignored by almost allstate-of-the-art methods for RNA secondary structure prediction. We argue that capturing someaspects of the structure formation process in predictive models could significantly improve thesemethods and provide evidence for this in form of a new method [162]. These initial results arevery encouraging as they show that a significant improvement in prediction accuracy can alreadybe gained by modeling a single overall effect of cotranscriptional folding and without making theunderlying prediction algorithm much more complex. Beyond this, we propose detailed ideas ofhow different aspects of cotranscriptional folding in vivo could also be captured in silico.One of the most simple and encouraging messages from the mounting (and sometimes daunt-ingly complex) experimental results is certainly the realization that the transcript in the cell doesnot explore all of the structure search space.1260 50 100 150 200 250 300 350 400hokmoksoktac fbiucb dcbmokSD hokSDsokTcleavagesiteactiveactive/inactiveinactiveinactive/transienttransientallAUGCFigure 4.1: RNA structure features for the reference sequence from Escherichia coliplasmid R1 encoding the hok and mok proteins. The horizontal line depicts theplasmid’s sequence with its nucleotides color-coded according to the legend on the topleft. Underneath the sequence line, black arrows indicate the protein-coding regionsof the hok and mok proteins. The grey arrow shows the sequence region that iscomplementary to the sok antisense RNA which is part of a different transcript. Eacharc above the horizontal line represents a basepair between the two correspondingpositions along the sequence and is color-coded according to the structure conformationto which it belongs (active, inactive or transient, see the legend on the top right). Belowthe horizontal sequence line, black lines indicate the location of known sequencemotifs (tac, translational activator element; ucb, upstream complementary box; dcb,downstream complementary box; mok SD, mok Shine-Dalgarno sequence; hok SD,hok Shine-Dalgarno sequence; fbi, fold-back inhibitory element). This arc-diagram wasfirst published by Steif and Meyer [357] and generated using the R-CHIE web-server[256].From Adi Steif & Irmtraud M Meyer (2012) The hok mRNA. RNA Biology,9(12):1399-1404. Retrieved October 13, 2015 from Taylor & Francis Online. doi:10.1080/15476286.2015.1008373. Used under Creative Commons Attribution-Non-Commercial 3.0 Unported License.127Figure 4.2: Examples of cis and trans interactions during cotranscriptional folding. A:Hypothetical RNA sequence, capable of forming helices h1 to h4, at sites A to E. B:Transcription of the sequence across time points t1 to t5, with the sequential lengtheningof the 3’-end. The transcription process limits the available sites for helix formation,imposing an order on helix formation. If an early formed helix is stable, it can serveto block the formation of subsequent helices by occupying specific sites. C: Sites mayalso be occupied due to interactions with other molecules, in this case a protein bindingsite (PBS), occupies site A, leading to a very different result. D: If early helices arerelatively unstable, they can be seen as transient helices that yield to new helices. Thismechanisms can aid the robust formation of desired structure features. Note that someof the conformations shown above correspond to the ones introduced and defined byMeyer and Miklo´s [140]. These are: B h1 (ii¯) and h3 (ic) are 3’ Trans, where h1 isstable, preventing the formation of h3 and h1 (i¯i) and h2 (ic) are 3’ Cis, where h1 isstable, preventing the formation of h2. D h1 (ci) and h2 (ii¯) are 5’ Cis, where h1 is anintermediate for h2 and h2 (ci) and h3 (ii¯) are 5’ Cis, where h2 is an intermediate for h3.128Chapter 5ConclusionAdvances in high-throughput technologies have revealed an abundance of novel RNA species,and solving the structure of these species may aid researchers in characterizing and understandingthem. With computational algorithms, we can swiftly predict basepairing structures, partners andinteractions for these RNAs, highlighting candidates for further study and validation. In this thesis,I make contributions to the computational aspect of this problem.In Chapter 2, I present a novel R package for RNA basepair processing, allowing for the easymanipulation of RNA sequence and basepair data. While algorithmically trivial, the availabilityof such a package can greatly aid the often tedious and mundane chore of writing ad hoc parsersand converters for the growing number of RNA basepair formats unique to every new predictiontool. Built upon the foundation of the R package R4RNA, I have already created the tool R-CHIE, capable of visualizing complex RNA basepair data in arc diagram formats, creating intuitivecomparison plots not possible with typical methods of RNA secondary structure visualization. Thelinear format of the arc diagram also makes it ideal for overlaying on top of multiple sequencealignments, allowing users to see basepair conservation for each species at each position. Sinceits publication, the work has been well received, seeing adoption by the largest RNA structuredatabase for visualizing basepair conservation status [204], displaying alternative structures [205,357], comparing multiple structure [206], and contrasting two unique structures relative to chemicalprobing data [176].In Chapter 3, I conduct an assessment of the current state-of-the-art for computational RNA-RNA interaction basepair prediction. To our knowledge, it is the largest and most comprehensiveof its kind at the time of writing. With growing interest in determining ncRNA functionality,we feel it is important to establish the predictive accuracy of RNA-RNA interaction predictions129given sequences with known interacting basepairs. For predicting short intermolecular RNA-RNAinteractions, we have shown that tools considering standard stacking energy stabilities in additionto accessibility score penalties consistently perform the best. Contrary to RNA secondary structureprediction, the usage of conservation information does not result in a sufficient gain in predictiveperformance compared to the aforementioned method, and may even decrease the performance ifused with accessibility. The inability of all tools to maintain a high performance as input lengthincreases continues to be a major issue, with worrisome implications when applying these methodsto whole-genome interaction partner searches. Overall, we believe that in their current states,computational RNA-RNA interaction can serve as a rough tool for predictions that are better thanrandom guesses, but fall short of replacing experimental methods.In Chapter 4, we present a review on cotranscriptional folding in RNA basepair prediction,highlighting the biological and computational status of this process. We present the first reviewoutlining the experimental and statistical evidence the field has for cotranscriptional folding, theimplications it has on RNA basepair formation, and how computational methods have and can usethis information. Biologically, we provide known examples of both cis- and trans- interactionswith the RNA as its folds during transcription, including metabolites, ions, proteins, chaperones,upstream nucleotides, and nucleotides of other RNA sequences. Computationally, we review thecurrent methods RNA secondary structure prediction, starting with classical energy-based methods,comparative and hybrid methods taking sequence alignments, and the small set of tools simulatingdynamic RNA folding pathways. Finally, we conclude with some ideas of incorporating RNA co-transcriptional knowledge into RNA prediction algorithms, with the main example being COFOLD[162] . Created by co-author Jeff Proctor, COFOLD incorporates some overall effects of cotran-scriptional folding directly into the structure prediction process thereby significantly increasing theprediction accuracy especially for long RNA sequences.As a whole, the thesis positions itself at the junction between what is known, and what we cando in the field of computational RNA basepair prediction. It is often foolish to proceed down a pathof research without knowing the current state of affairs, and my thesis hopes to serve as a beaconfor future works to come.130Bibliography[1] Crick, F.H., 1958. On protein synthesis. Symp Soc Exp Biol 12: 138–63. ISSN 0081-1386. → pages1, 2[2] Crick, F.H., 1970. Central dogma of molecular biology. Nature 227(5258): 561–3. ISSN0028-0836. → pages 1[3] Lehman, I., Bessman, M.J., Simms, E.S., & Kornberg, A., 1958. Enzymatic synthesis ofdeoxyribonucleic acid. I. Preparation of substrates and partial purification of an enzyme fromEscherichia coli. J Biol Chem 233(1): 163–70. ISSN 0021-9258. → pages 1[4] Hurwitz, J., Bresler, A., & Diringer, R., 1960. The enzymic incorporation of ribonucleotides intopolyribonucleotides and the effect of DNA. Biochem Biophys Res Commun 3(1): 15–19. ISSN0006291X. doi:10.1016/0006-291X(60)90094-2. → pages 1[5] Stevens, A., 1960. Incorporation of the adenine ribonucleotide into RNA by cell fractions from E.coli B. Biochem Biophys Res Commun 3(1): 92–96. ISSN 0006291X.doi:10.1016/0006-291X(60)90110-8. → pages[6] Weiss, S.B., 1960. Enzymatic Incorporation of Ribonucleoside Triphosphates into theInterpolynucleotide Linkages of Ribonucleic Acid. Proc Natl Acad Sci U S A 46(8): 1020–30. ISSN0027-8424. → pages 1[7] Nirenberg, M. & Matthaei, J.H., 1961. The dependence of cell-free protein synthesis in E. coli uponnaturally occurring or synthetic polyribonucleotides. Proc Natl Acad Sci U S A 47: 1588–602. ISSN0027-8424. → pages 1[8] Brenner, S.E., Jacob, F., & Meselson, M., 1961. An Unstable Intermediate Carrying Informationfrom Genes to Ribosomes for Protein Synthesis. Nature 190(4776): 576–581. ISSN 0028-0836.doi:10.1038/190576a0. → pages 1, 2[9] Jacob, F. & Monod, J., 1961. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol3(3): 318–356. ISSN 00222836. doi:10.1016/S0022-2836(61)80072-7. → pages 1[10] Pabo, C.O. & Sauer, R.T., 1992. Transcription factors: structural families and principles of DNArecognition. Annu Rev Biochem 61: 1053–95. ISSN 0066-4154.doi:10.1146/annurev.bi.61.070192.005201. → pages 2131[11] Day, D.A. & Tuite, M.F., 1998. Post-transcriptional gene regulatory mechanisms in eukaryotes: anoverview. J Endocrinol 157(3): 361–71. ISSN 0022-0795. → pages 2[12] Inouye, M., 1988. Antisense RNA: its functions and applications in gene regulation–a review. Gene72(1-2): 25–34. ISSN 0378-1119. → pages 2[13] Fire, A., Xu, S., Montgomery, M.K., Kostas, S.A., Driver, S.E., & Mello, C.C., 1998. Potent andspecific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391(6669):806–11. ISSN 0028-0836. doi:10.1038/35888. → pages 2[14] Elbashir, S.M., Harborth, J., Lendeckel, W., Yalcin, A., Weber, K., & Tuschl, T., 2001. Duplexes of21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411(6836):494–8. ISSN 0028-0836. doi:10.1038/35078107. → pages 2[15] Esquela-Kerscher, A. & Slack, F.J., 2006. Oncomirs - microRNAs with a role in cancer. Nat RevCancer 6(4): 259–69. ISSN 1474-175X. doi:10.1038/nrc1840. → pages 2[16] Calin, G.A. & Croce, C.M., 2006. MicroRNA signatures in human cancers. Nat Rev Cancer 6(11):857–66. ISSN 1474-175X. doi:10.1038/nrc1997. → pages[17] Haussecker, D., 2008. The business of RNAi therapeutics. Hum Gene Ther 19(5): 451–62. ISSN1557-7422. doi:10.1089/hum.2008.007. → pages 2[18] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K.,Doyle, M., FitzHugh, W. et al., 2001. Initial sequencing and analysis of the human genome. Nature409(6822): 860–921. ISSN 0028-0836. doi:10.1038/35057062. → pages 2[19] Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell,M., Evans, C.A., Holt, R.A. et al., 2001. The sequence of the human genome. Science 291(5507):1304–51. ISSN 0036-8075. doi:10.1126/science.1058040. → pages 2[20] Kawai, J., Shinagawa, A., Shibata, K., Yoshino, M., Itoh, M., Ishii, Y., Arakawa, T., Hara, A.,Fukunishi, Y., Konno, H. et al., 2001. Functional annotation of a full-length mouse cDNAcollection. Nature 409(6821): 685–90. ISSN 0028-0836. doi:10.1038/35055500. → pages 2[21] Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N.,Saito, R., Suzuki, H. et al., 2002. Analysis of the mouse transcriptome based on functionalannotation of 60,770 full-length cDNAs. Nature 420(6915): 563–73. ISSN 0028-0836.doi:10.1038/nature01266. → pages 2[22] Velculescu, V.E., Zhang, L., Vogelstein, B., & Kinzler, K.W., 1995. Serial Analysis of GeneExpression. Science 270(5235): 484–487. ISSN 0036-8075. doi:10.1126/science.270.5235.484. →pages 2[23] Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodzius, R., Watahiki, A.,Nakamura, M., Arakawa, T. et al., 2003. Cap analysis gene expression for high-throughput analysisof transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A100(26): 15776–81. ISSN 0027-8424. doi:10.1073/pnas.2136655100. → pages 2132[24] Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T.,Lenhard, B., Wells, C. et al., 2005. The transcriptional landscape of the mammalian genome.Science 309(5740): 1559–63. ISSN 1095-9203. doi:10.1126/science.1112014. → pages 2[25] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J.,Braverman, M.S., Chen, Y.J., Chen, Z. et al., 2005. Genome sequencing in microfabricatedhigh-density picolitre reactors. Nature 437(7057): 376–380. ISSN 0028-0836.doi:10.1038/nature04726. → pages 2[26] Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall,K.P., Evers, D.J., Barnes, C.L., Bignell, H.R. et al., 2008. Accurate whole human genomesequencing using reversible terminator chemistry. Nature 456(7218): 53–9. ISSN 1476-4687.doi:10.1038/nature07517. → pages[27] Metzker, M.L., 2010. Sequencing technologies - the next generation. Nat Rev Genet 11(1): 31–46.ISSN 1471-0064. doi:10.1038/nrg2626. → pages 2, 19[28] Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., & Wold, B., 2008. Mapping andquantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7): 621–8. ISSN 1548-7105.doi:10.1038/nmeth.1226. → pages 2[29] Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., & Snyder, M., 2008. Thetranscriptional landscape of the yeast genome defined by RNA sequencing. Science 320(5881):1344–9. ISSN 1095-9203. doi:10.1126/science.1158441. → pages 2[30] Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo´, R., Gingeras, T.R., Margulies, E.H., Weng,Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E. et al., 2007. Identification and analysis offunctional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816. ISSN 1476-4687. doi:10.1038/nature05874. → pages 2, 126[31] The ENCODE Project Consortium, 2012. An integrated encyclopedia of DNA elements in thehuman genome. Nature 489(7414): 57–74. ISSN 0028-0836. doi:10.1038/nature11247. → pages 48[32] Amaral, P.P., Dinger, M.E., Mercer, T.R., & Mattick, J.S., 2008. The eukaryotic genome as an RNAmachine. Science 319(5871): 1787–9. ISSN 1095-9203. doi:10.1126/science.1155472. → pages 2,48, 126[33] Djebali, S., Davis, C.A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde,J., Lin, W., Schlesinger, F. et al., 2012. Landscape of transcription in human cells. Nature489(7414): 101–8. ISSN 1476-4687. doi:10.1038/nature11233. → pages 2[34] Mattick, J.S., 2004. RNA regulation: a new genetics? Nat Rev Genet 5(4): 316–23. ISSN1471-0056. doi:10.1038/nrg1321. → pages 2[35] Backofen, R., Bernhart, S.H., Flamm, C., Fried, C., Fritzsch, G., Hackermu¨ller, J., Hertel, J.,Hofacker, I.L., Missal, K., Mosig, A. et al., 2007. RNAs everywhere: genome-wide annotation ofstructured RNAs. J Exp Zool B Mol Dev Evol 308(1): 1–25. ISSN 1552-5007.doi:10.1002/jez.b.21130. → pages133[36] Beiter, T., Reich, E., Williams, R.W., & Simon, P., 2009. Antisense transcription: a critical look inboth directions. Cell Mol Life Sci 66(1): 94–112. ISSN 1420-9071.doi:10.1007/s00018-008-8381-y. → pages[37] Clark, M.B., Amaral, P.P., Schlesinger, F.J., Dinger, M.E., Taft, R.J., Rinn, J.L., Ponting, C.P.,Stadler, P.F., Morris, K.V., Morillon, A. et al., 2011. The reality of pervasive transcription. PLoSBiol 9(7): e1000625; discussion e1001102. ISSN 1545-7885. doi:10.1371/journal.pbio.1000625. →pages 2[38] Lippman, Z., Gendrel, A.V., Black, M., Vaughn, M.W., Dedhia, N., McCombie, W.R., Lavine, K.,Mittal, V., May, B., Kasschau, K.D. et al., 2004. Role of transposable elements in heterochromatinand epigenetic control. Nature 430(6998): 471–6. ISSN 1476-4687. doi:10.1038/nature02651. →pages 2[39] Rinn, J.L., Kertesz, M., Wang, J.K., Squazzo, S.L., Xu, X., Brugmann, S.A., Goodnough, L.H.,Helms, J.A., Farnham, P.J., Segal, E. et al., 2007. Functional demarcation of active and silentchromatin domains in human HOX loci by noncoding RNAs. Cell 129(7): 1311–23. ISSN0092-8674. doi:10.1016/j.cell.2007.05.022. → pages 2[40] Costa, F.F., 2008. Non-coding RNAs, epigenetics and complexity. Gene 410(1): 9–17. ISSN0378-1119. doi:10.1016/j.gene.2007.12.008. → pages 2[41] Rogers, J. & Wall, R., 1980. A mechanism for RNA splicing. Proc Natl Acad Sci U S A 77(4):1877–9. ISSN 0027-8424. → pages 2[42] Mattick, J.S., 2007. A new paradigm for developmental biology. J Exp Biol 210(Pt 9): 1526–47.ISSN 0022-0949. doi:10.1242/jeb.005017. → pages 2[43] Kruger, K., Grabowski, P.J., Zaug, A.J., Sands, J., Gottschling, D.E., & Cech, T.R., 1982.Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence ofTetrahymena. Cell 31(1): 147–57. ISSN 0092-8674. → pages 2, 8[44] Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N., & Altman, S., 1983. The RNA moiety ofribonuclease P is the catalytic subunit of the enzyme. Cell 35(3 Pt 2): 849–57. ISSN 0092-8674. →pages 2, 10[45] Crick, F.H., 1966. Codon–anticodon pairing: the wobble hypothesis. J Mol Biol 19(2): 548–55.ISSN 0022-2836. → pages 3[46] Rich, A. & Davies, D.R., 1956. A new two stranded helical structure: polyadenylic acid andpolyuridylic acid. J Am Chem Soc 78(14): 3548–3549. ISSN 0002-7863. doi:10.1021/ja01595a086.→ pages 3[47] Varshavsky, A., 2006. Discovering the RNA Double Helix and Hybridization. Cell 127(7):1295–1297. ISSN 00928674. doi:10.1016/j.cell.2006.12.008. → pages 3[48] Draper, D.E., Grilley, D., & Soto, A.M., 2005. Ions and RNA folding. Annu Rev Biophys BiomolStruct 34: 221–243. ISSN 1056-8700. doi:10.1146/annurev.biophys.34.040204.144511. → pages 3134[49] Tinoco, I. & Bustamante, C., 1999. How RNA folds. J Mol Biol 293(2): 271–81. ISSN 0022-2836.doi:10.1006/jmbi.1999.3001. → pages 4, 21[50] Wan, Y., Kertesz, M., Spitale, R.C., Segal, E., & Chang, H.Y., 2011. Understanding thetranscriptome through RNA structure. Nat Rev Genet 12(9): 641–55. ISSN 1471-0064.doi:10.1038/nrg3049. → pages 6, 12, 14, 15, 19, 110[51] Mortimer, S.A., Kidwell, M.A., & Doudna, J.A., 2014. Insights into RNA structure and functionfrom genome-wide studies. Nat Rev Genet 15(7): 469–79. ISSN 1471-0064. doi:10.1038/nrg3681.→ pages 7, 12, 13, 14[52] Holley, R.W., Apgar, J., Everett, G.A., Madison, J.T., Marquisee, M., Merrill, S.H., Penswick, J.R.,& Zamir, A., 1965. Structure of a Ribonucleic Acid. Science 147(3664): 1462–5. ISSN 0036-8075.→ pages 8[53] Kim, S.H., Suddath, F.L., Quigley, G.J., McPherson, A., Sussman, J.L., Wang, A.H.J., Seeman, N.C.,& Rich, A., 1974. Three-Dimensional Tertiary Structure of Yeast Phenylalanine Transfer RNA.Science 185(4149): 435–440. ISSN 0036-8075. doi:10.1126/science.185.4149.435. → pages 8[54] Ladner, J.E., Jack, A., Robertus, J.D., Brown, R.S., Rhodes, D., Clark, B.F., & Klug, A., 1975.Structure of yeast phenylalanine transfer RNA at 2.5 A resolution. Proc Natl Acad Sci U S A 72(11):4414–4418. ISSN 0027-8424. doi:10.1073/pnas.72.11.4414. → pages 8[55] Woese, C.R., Gutell, R., Gupta, R., & Noller, H.F., 1983. Detailed analysis of the higher-orderstructure of 16S-like ribosomal ribonucleic acids. Microbiol Rev 47(4): 621–69. ISSN 0146-0749.→ pages 8[56] Cech, T.R., 2000. STRUCTURAL BIOLOGY: Enhanced: The Ribosome Is a Ribozyme. Science289(5481): 878–879. ISSN 00368075. doi:10.1126/science.289.5481.878. → pages 8[57] Nissen, P., Hansen, J., Ban, N., Moore, P.B., & Steitz, T.A., 2000. The structural basis of ribosomeactivity in peptide bond synthesis. Science 289(5481): 920–30. ISSN 0036-8075.doi:10.1126/science.289.5481.920. → pages 8, 20[58] Schluenzen, F., Tocilj, A., Zarivach, R., Harms, J., Gluehmann, M., Janell, D., Bashan, A., Bartels,H., Agmon, I., Yonath, A. et al., 2000. Small Ribosomal Subunit Resolution at 3 . 3 A. Structure102: 615–623. → pages[59] Wimberly, B.T., Brodersen, D.E., Clemons Jr, W.M., Morgan-Warren, R.J., Carter, A.P., Vonrhein,C., Hartsch, T., & Ramakrishnan, V., 2000. Structure of the 30S ribosomal subunit. Nature 407:327–339. ISSN 0028-0836. doi:10.1038/35030006. → pages 8, 20[60] Schmeing, T.M. & Ramakrishnan, V., 2009. What recent ribosome structures have revealed aboutthe mechanism of translation. Nature 461(7268): 1234–1242. ISSN 0028-0836.doi:10.1038/nature08403. → pages 9[61] Doudna, J.A. & Cech, T.R., 2002. The chemical repertoire of natural ribozymes. Nature 418(6894):222–228. ISSN 0028-0836. doi:10.1038/418222a. → pages 8, 10135[62] Cech, T.R., 1990. Self-Splicing of Group I Introns. Annu Rev Biochem 59(1): 543–568. ISSN0066-4154. doi:10.1146/annurev.bi.59.070190.002551. → pages 8[63] Nielsen, H. & Johansen, S.D., 2009. Group I introns: Moving in new directions. RNA Biol 6(4):375–383. ISSN 1547-6286. doi:10.4161/rna.6.4.9334. → pages 8[64] Cate, J.H., Gooding, A.R., Podell, E., Zhou, K., Golden, B.L., Kundrot, C.E., Cech, T.R., & Doudna,J.A., 1996. Crystal structure of a group I ribozyme domain: principles of RNA packing. Science273(5282): 1678–1685. ISSN 0036-8075. doi:10.1126/science.273.5282.1678. → pages 8[65] Michel, F., Kazuhiko, U., & Haruo, O., 1989. Comparative and functional anatomy of group IIcatalytic introns a review. Gene 82(1): 5–30. ISSN 03781119.doi:10.1016/0378-1119(89)90026-7. → pages 8[66] Lambowitz, A.M. & Zimmerly, S., 2011. Group II introns: Mobile ribozymes that invade DNA.Cold Spring Harb Perspect Biol 3(8): 1–19. ISSN 19430264. doi:10.1101/cshperspect.a003616. →pages 8[67] Toor, N., Toor, N., Keating, K.S., Keating, K.S., Taylor, S.D., Taylor, S.D., Pyle, A.M., & Pyle,A.M., 2008. Crystal Structure of a Self-Spliced Group II Intron. Science 320(5872): 77–82. ISSN0036-8075. → pages 8[68] Fica, S.M., Tuttle, N., Novak, T., Li, N.S., Lu, J., Koodathingal, P., Dai, Q., Staley, J.P., & Piccirilli,J.a., 2013. RNA catalyses nuclear pre-mRNA splicing. Nature ISSN 0028-0836.doi:10.1038/nature12734. → pages 8[69] Joyce, G.F., 2002. The antiquity of RNA-based evolution. Nature 418(6894): 214–221. ISSN0028-0836. doi:10.1038/418214a. → pages 10[70] Guerrier-Takada, C. & Altman, S., 1984. Catalytic activity of an RNA molecule prepared bytranscription in vitro. Science 223(4633): 285–286. ISSN 0036-8075.doi:10.1126/science.6199841. → pages 10[71] Evans, D., Marquez, S.M., & Pace, N.R., 2006. RNase P: interface of the RNA and protein worlds.Trends Biochem Sci 31(6): 333–41. ISSN 0968-0004. doi:10.1016/j.tibs.2006.04.007. → pages 10[72] Kazantsev, A.V., Krivenko, A.A., Harrington, D.J., Holbrook, S.R., Adams, P.D., & Pace, N.R.,2005. Crystal structure of a bacterial ribonuclease P RNA. Proc Natl Acad Sci U S A 102(38):13392–13397. ISSN 0027-8424. doi:10.1073/pnas.0506662102. → pages 10[73] Torres-Larios, A., Swinger, K.K., Krasilnikov, A.S., Pan, T., & Mondrago´n, A., 2005. Crystalstructure of the RNA component of bacterial ribonuclease P. Nature 437(7058): 584–587. ISSN0028-0836. doi:10.1038/nature04074. → pages 10[74] Prody, G.A., Bakos, J.T., Buzayan, J.M., Schneider, I.R., & Bruening, G., 1986. Autolyticprocessing of dimeric plant virus satellite RNA. Science 231(4745): 1577–1580. ISSN 0036-8075.doi:10.1126/science.231.4745.1577. → pages 10[75] Hammann, C., Luptak, A., Perreault, J., & de la Pena, M., 2012. The ubiquitous hammerheadribozyme. Rna 18(5): 871–885. ISSN 1355-8382. doi:10.1261/rna.031401.111. → pages 10136[76] Pley, H.W., Flaherty, K.M., & McKay, D.B., 1994. Three-dimensional structure of a hammerheadribozyme. Nature 372(6501): 68–74. ISSN 0028-0836. doi:10.1038/372068a0. → pages 10[77] Scott, W.G., Finch, J.T., & Klug, A., 1995. The crystal structure of an all-RNA hammerheadribozyme: A proposed mechanism for RNA catalytic cleavage. Cell 81(7): 991–1002. ISSN00928674. doi:10.1016/S0092-8674(05)80004-2. → pages 10[78] Nahvi, A., Sudarsan, N., Ebert, M.S., Zou, X., Brown, K.L., & Breaker, R.R., 2002. Genetic controlby a metabolite binding mRNA. Chem Biol 9(9): 1043. ISSN 1074-5521. → pages 10[79] Tucker, B.J. & Breaker, R.R., 2005. Riboswitches as versatile gene control elements. Curr OpinStruct Biol 15(3): 342–8. ISSN 0959-440X. doi:10.1016/j.sbi.2005.05.003. → pages 10[80] Barrick, J.E. & Breaker, R.R., 2007. The distributions, mechanisms, and structures ofmetabolite-binding riboswitches. Genome Biol 8(11): R239. ISSN 1465-6906.doi:10.1186/gb-2007-8-11-r239. → pages 10[81] Serganov, A., 2009. The long and the short of riboswitches. Curr Opin Struct Biol 19(3): 251–259.ISSN 0959440X. doi:10.1016/j.sbi.2009.02.002. → pages 10, 11, 116[82] Kim, J.N. & Breaker, R.R., 2008. Purine sensing by riboswitches. Biol Cell 100(1): 1–11. ISSN02484900. doi:10.1042/BC20070088. → pages 11[83] Volders, P.J., Verheggen, K., Menschaert, G., Vandepoele, K., Martens, L., Vandesompele, J., &Mestdagh, P., 2014. An update on LNCipedia: a database for annotated human lncRNA sequences.Nucleic Acids Res 43(D1): D174–D180. ISSN 0305-1048. doi:10.1093/nar/gku1060. → pages 11[84] Brown, C.J., Ballabio, A., Rupert, J.L., Lafreniere, R.G., Grompe, M., Tonlorenzi, R., & Willard,H.F., 1991. A gene from the region of the human X inactivation centre is expressed exclusively fromthe inactive X chromosome. Nature 349(6304): 38–44. ISSN 0028-0836. doi:10.1038/349038a0.→ pages 11[85] Maenner, S., Blaud, M., Fouillen, L., Savoye, A., Marchand, V., Dubois, A., Sanglier-Cianfe´rani, S.,Van Dorsselaer, A., Clerc, P., Avner, P. et al., 2010. 2-D structure of the a region of Xist RNA and itsimplication for PRC2 association. PLoS Biol 8(1): 1–16. ISSN 15449173.doi:10.1371/journal.pbio.1000276. → pages 12[86] Royce-Tolland, M.E., Andersen, A.A., Koyfman, H.R., Talbot, D.J., Wutz, A., Tonks, I.D., Kay,G.F., & Panning, B., 2010. The A-repeat links ASF/SF2-dependent Xist RNA processing withrandom choice during X inactivation. Nat Struct Mol Biol 17(8): 948–954. ISSN 1545-9993.doi:10.1038/nsmb.1877. → pages[87] Rinn, J.L. & Chang, H.Y., 2012. Genome Regulation by Long Noncoding RNAs. Annu Rev Biochem81(1): 145–166. ISSN 0066-4154. doi:10.1146/annurev-biochem-051410-092902. → pages 12[88] Tsai, M.C., Manor, O., Wan, Y., Mosammaparast, N., Wang, J.K., Lan, F., Shi, Y., Segal, E., &Chang, H.Y., 2010. Long Noncoding RNA as Modular Scaffold of Histone Modification Complexes.Science 329(5992): 689–693. ISSN 0036-8075. doi:10.1126/science.1192002. → pages 12137[89] Somarowthu, S., Legiewicz, M., Chillo´n, I., Marcia, M., Liu, F., & Pyle, A., 2015. HOTAIR Formsan Intricate and Modular Secondary Structure. Mol Cell pages 353–361. ISSN 10972765.doi:10.1016/j.molcel.2015.03.006. → pages 12[90] McPherson, R., Pertsemlidis, A., Kavaslar, N., Stewart, A., Roberts, R., Cox, D.R., Hinds, D.A.,Pennacchio, L.A., Tybjaerg-Hansen, A., Folsom, A.R. et al., 2007. A Common Allele onChromosome 9 Associated with Coronary Heart Disease. Science 316(5830): 1488–1491. ISSN0036-8075. doi:10.1126/science.1142447. → pages 12[91] Helgadottir, A., Thorleifsson, G., Manolescu, A., Gretarsdottir, S., Blondal, T., Jonasdottir, A.,Jonasdottir, A., Sigurdsson, A., Baker, A., Palsson, A. et al., 2007. A Common Variant onChromosome 9p21 Affects the Risk of Myocardial Infarction. Science 316(5830): 1491–1493.ISSN 0036-8075. doi:10.1126/science.1142842. → pages 12[92] Yap, K.L., Li, S., Mun˜oz-Cabello, A.M., Raguz, S., Zeng, L., Mujtaba, S., Gil, J., Walsh, M.J., &Zhou, M.M., 2010. Molecular Interplay of the Noncoding RNA ANRIL and Methylated Histone H3Lysine 27 by Polycomb CBX7 in Transcriptional Silencing of INK4a. Mol Cell 38(5): 662–674.ISSN 10972765. doi:10.1016/j.molcel.2010.03.021. → pages 12[93] Warf, M.B., Diegel, J.V., von Hippel, P.H., & Berglund, J.A., 2009. The protein factors MBNL1 andU2AF65 bind alternative RNA structures to regulate splicing. Proc Natl Acad Sci U S A 106(23):9203–9208. ISSN 0027-8424. doi:10.1073/pnas.0900342106. → pages 12[94] Singh, N.N., Singh, R.N., & Androphy, E.J., 2007. Modulating role of RNA structure in alternativesplicing of a critical exon in the spinal muscular atrophy genes. Nucleic Acids Res 35(2): 371–389.ISSN 03051048. doi:10.1093/nar/gkl1050. → pages 12[95] Hertel, K.J., 2008. Combinatorial control of exon recognition. J Biol Chem 283(3): 1211–1215.ISSN 00219258. doi:10.1074/jbc.R700035200. → pages 12[96] Shepard, P.J. & Hertel, K.J., 2008. Conserved RNA secondary structures promote alternativesplicing. RNA 14(8): 1463–1469. ISSN 1355-8382. doi:10.1261/rna.1069408. → pages 12[97] Arago´n, T., van Anken, E., Pincus, D., Serafimova, I.M., Korennykh, A.V., Rubio, C.A., & Walter,P., 2009. Messenger RNA targeting to endoplasmic reticulum stress signalling sites. Nature457(7230): 736–740. ISSN 0028-0836. doi:10.1038/nature07641. → pages 13[98] Chabanon, H., Mickleburgh, I., & Hesketh, J., 2004. Zipcodes and postage stamps: mRNAlocalisation signals and their trans-acting binding proteins. Brief Funct Genomic Proteomic 3(3):240–256. ISSN 1473-9550. doi:10.1093/bfgp/3.3.240. → pages 13[99] Ray, P.S., Jia, J., Yao, P., Majumder, M., Hatzoglou, M., & Fox, P.L., 2009. A stress-responsiveRNA switch regulates VEGFA expression. Nature 457(7231): 915–919. ISSN 0028-0836.doi:10.1038/nature07598. → pages 13[100] Pelletier, J. & Sonenberg, N., 1988. Internal initiation of translation of eukaryotic mRNA directedby a sequence derived from poliovirus RNA. Nature 334(6180): 320–325. ISSN 0028-0836.doi:10.1038/334320a0. → pages 13138[101] Hellen, C.U., 2001. Internal ribosome entry sites in eukaryotic mRNA molecules. Genes Dev15(13): 1593–1612. ISSN 08909369. doi:10.1101/gad.891101. → pages 13, 14[102] Baird, S.D., Turcotte, M., Korneluk, R.G., & Holcik, M., 2006. Searching for IRES. RNA 12(10):1755–1785. ISSN 1355-8382. doi:10.1261/rna.157806. → pages 14[103] Colussi, T.M., Costantino, D.A., Zhu, J., Donohue, J.P., Korostelev, A.A., Jaafar, Z.A., Plank,T.d.M., Noller, H.F., & Kieft, J.S., 2015. Initiation of translation in bacteria by a structuredeukaryotic IRES RNA. Nature 519(7541): 110–113. ISSN 0028-0836. doi:10.1038/nature14219.→ pages 14[104] Fu¨tterer, J., Kiss-La´szlo´, Z., & Hohn, T., 1993. Nonlinear ribosome migration on cauliflower mosaicvirus 35S RNA. Cell 73(4): 789–802. ISSN 00928674. doi:10.1016/0092-8674(93)90257-Q. →pages 14[105] Ryabova, L.a., Pooggin, M.M., & Hohn, T., 2006. Translation reinitiation and leaky scanning inplant viruses. Virus Res 119(1): 52–62. ISSN 01681702. doi:10.1016/j.virusres.2005.10.017. →pages 14[106] Badis, G., Saveanu, C., Fromont-Racine, M., & Jacquier, A., 2004. Targeted mRNA degradation bydeadenylation-independent decapping. Mol Cell 15(1): 5–15. ISSN 10972765.doi:10.1016/j.molcel.2004.06.028. → pages 14[107] Nackley, A.G., Shabalina, S.A., Tchivileva, I.E., Satterfield, K., Korchynskyi, O., Makarov, S.S.,Maixner, W., & Diatchenko, L., 2006. Human Catechol-O-Methyltransferase Haplotypes ModulateProtein Expression by Altering mRNA Secondary Structure. Science 314(5807): 1930–1933. ISSN0036-8075. doi:10.1126/science.1131262. → pages 14[108] Gilbert, W., 1986. Origin of life: The RNA world. Nature 319(6055): 618–618. ISSN 0028-0836.doi:10.1038/319618a0. → pages 14[109] Crick, F., 1968. The origin of the genetic code. J Mol Biol 38(3): 367–379. ISSN 00222836.doi:10.1016/0022-2836(68)90392-6. → pages 14[110] Orgel, L.E., 1968. Evolution of the genetic apparatus. J Mol Biol 38(3): 381–393. ISSN 00222836.doi:10.1016/0022-2836(68)90393-8. → pages 14[111] Breaker, R.R., 2012. Riboswitches and the RNA World. Cold Spring Harb Perspect Biol 4(2):a003566–a003566. ISSN 1943-0264. doi:10.1101/cshperspect.a003566. → pages 14[112] Robertson, M.P. & Joyce, G.F., 2012. The Origins of the RNA World. Cold Spring Harb PerspectBiol 4(5): a003608–a003608. ISSN 1943-0264. doi:10.1101/cshperspect.a003608. → pages 14[113] Bernhardt, H.S., 2012. The RNA world hypothesis: the worst theory of the early evolution of life(except for all the others). Biol Direct 7(1): 23. ISSN 1745-6150. doi:10.1186/1745-6150-7-23. →pages 14[114] Madhani, H.D. & Guthrie, C., 1994. Dynamic RNA-RNA interactions in the spliceosome. Annu RevGenet 28: 1–26. ISSN 0066-4197. doi:10.1146/annurev.ge.28.120194.000245. → pages 15139[115] Horowitz, D.S., 2012. The mechanism of the second step of pre-mRNA splicing. Wiley InterdiscipRev RNA 3(3): 331–50. ISSN 1757-7012. doi:10.1002/wrna.112. → pages 15, 118[116] Matera, A.G., Terns, R.M., & Terns, M.P., 2007. Non-coding RNAs: lessons from the small nuclearand small nucleolar RNAs. Nat Rev Mol Cell Biol 8(3): 209–20. ISSN 1471-0072.doi:10.1038/nrm2124. → pages 16[117] Bachellerie, J.P., Cavaille´, J., & Hu¨ttenhofer, A., 2002. The expanding snoRNA world. Biochimie84(8): 775–90. ISSN 0300-9084. → pages 16, 63, 118[118] Meyer, I.M., 2008. Predicting novel RNA-RNA interactions. Curr Opin Struct Biol 18(3): 387–93.ISSN 0959-440X. doi:10.1016/j.sbi.2008.03.006. → pages 17, 48, 118[119] Ha, M. & Kim, V.N., 2014. Regulation of microRNA biogenesis. Nat Rev Mol Cell Biol 15(8):509–524. ISSN 1471-0080. doi:10.1038/nrm3838. → pages 16[120] Meister, G. & Tuschl, T., 2004. Mechanisms of gene silencing by double-stranded RNA. Nature431(7006): 343–9. ISSN 1476-4687. doi:10.1038/nature02873. → pages 16[121] Carthew, R.W. & Sontheimer, E.J., 2009. Origins and Mechanisms of miRNAs and siRNAs. Cell136(4): 642–55. ISSN 1097-4172. doi:10.1016/j.cell.2009.01.035. → pages 16[122] Jackson, A.L., Burchard, J., Schelter, J., Chau, B.N., Cleary, M., Lim, L., & Linsley, P.S., 2006.Widespread siRNA ”off-target” transcript silencing mediated by seed region sequencecomplementarity. RNA 12(7): 1179–1187. ISSN 1355-8382. doi:10.1261/rna.25706. → pages 16[123] Axtell, M.J., 2013. Classification and comparison of small RNAs from plants. Annu Rev Plant Biol64: 137–59. ISSN 1545-2123. doi:10.1146/annurev-arplant-050312-120043. → pages 17[124] Storz, G., Opdyke, J.a., & Zhang, A., 2004. Controlling mRNA stability and translation with small,noncoding RNAs. Curr Opin Microbiol 7: 140–144. ISSN 13695274.doi:10.1016/j.mib.2004.02.015. → pages 18, 62[125] Wang, Z., Gerstein, M., & Snyder, M., 2009. RNA-Seq: a revolutionary tool for transcriptomics.Nat Rev Genet 10(1): 57–63. ISSN 1471-0064. doi:10.1038/nrg2484. → pages 19[126] Sanger, F. & Coulson, A.R., 1975. A rapid method for determining sequences in DNA by primedsynthesis with DNA polymerase. J Mol Biol 94(3): 441–448. ISSN 00222836.doi:10.1016/0022-2836(75)90213-2. → pages 19[127] Van Dijk, E.L., Auger, H., Jaszczyszyn, Y., & Thermes, C., 2014. Ten years of next-generationsequencing technology. Trends Genet 30(9). ISSN 01689525. doi:10.1016/j.tig.2014.07.001. →pages 19[128] Antal, M., Boros, E., Solymosy, F., & Kiss, T., 2002. Analysis of the structure of human telomeraseRNA in vivo. Nucleic Acids Res 30(4): 912–20. ISSN 1362-4962. → pages 19[129] Favorova, O.O., Fasiolo, F., Keith, G., Vassilenko, S.K., & Ebel, J.P., 1981. Partial digestion oftRNA–aminoacyl-tRNA synthetase complexes with cobra venom ribonuclease. Biochemistry 20(4):1006–1011. ISSN 00062960. → pages 19140[130] Lowman, H.B. & Draper, D.E., 1986. On the recognition of helical RNA by cobra venom V1nuclease. J Biol Chem 261(12): 5396–5403. ISSN 00219258. → pages 19[131] Ke, A. & Doudna, J.a., 2004. Crystallization of RNA and RNA-protein complexes. Methods 34(3):408–414. ISSN 10462023. doi:10.1016/j.ymeth.2004.03.027. → pages 19[132] Parisien, M. & Major, F., 2008. The MC-Fold and MC-Sym pipeline infers RNA structure fromsequence data. Nature 452(7183): 51–5. ISSN 1476-4687. doi:10.1038/nature06684. → pages 19[133] Fu¨rtig, B., Richter, C., Wo¨hnert, J., & Schwalbe, H., 2003. NMR spectroscopy of RNA.ChemBioChem 4(10): 936–962. ISSN 14394227. doi:10.1002/cbic.200300700. → pages 20[134] Fang, X., Stagno, J.R., Bhandari, Y.R., Zuo, X., & Wang, Y.X., 2015. Small-angle X-ray scattering:a bridge between RNA secondary structures and three-dimensional topological structures. CurrOpin Struct Biol 30: 147–160. ISSN 0959440X. doi:10.1016/j.sbi.2015.02.010. → pages 20[135] Grochulski, P., 2007. Status and the future of structural biology at the Canadian Light Source. ActaCryst 121(4): 866–870. → pages 20[136] Bai, X.c., McMullan, G., & Scheres, S.H., 2015. How cryo-EM is revolutionizing structural biology.Trends Biochem Sci 40(1): 49–57. ISSN 09680004. doi:10.1016/j.tibs.2014.10.005. → pages 20[137] Garmann, R.F., Gopal, A., Athavale, S.S., Knobler, C.M., Gelbart, W.M., & Harvey, S.C., 2015.Visualizing the global secondary structure of a viral RNA genome with cryo-electron microscopy.RNA 21(5): 877–886. ISSN 1355-8382. doi:10.1261/rna.047506.114. → pages 20[138] Marchanka, A., Simon, B., Althoff-Ospelt, G., & Carlomagno, T., 2015. RNA structuredetermination by solid-state NMR spectroscopy. Nat Commun 6(May): 7024. ISSN 2041-1723.doi:10.1038/ncomms8024. → pages 20[139] Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., & Weissman, J.S., 2014. Genome-wide probingof RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505(7485): 701–5.ISSN 1476-4687. doi:10.1038/nature12894. → pages 20[140] Meyer, I.M. & Miklo´s, I., 2004. Co-transcriptional folding is encoded within RNA genes. BMC MolBiol 5: 10. ISSN 1471-2199. doi:10.1186/1471-2199-5-10. → pages 20, 114, 123, 124, 128[141] Pan, T. & Sosnick, T., 2006. RNA folding during transcription. Annu Rev Biophys Biomol Struct 35:161–75. ISSN 1056-8700. doi:10.1146/annurev.biophys.35.040405.102053. → pages 20, 112[142] Lai, D., Proctor, J.R., & Meyer, I.M., 2013. On the importance of cotranscriptional RNA structureformation. RNA 19: 1461–1473. doi:10.1261/rna.037390.112. → pages 20[143] Zhu, J.Y.A., Steif, A., Proctor, J.R., & Meyer, I.M., 2013. Transient RNA structure features areevolutionarily conserved and can be computationally predicted. Nucleic Acids Res 41(12):6273–6285. ISSN 0305-1048. doi:10.1093/nar/gkt319. → pages 20, 115, 122, 124[144] Neugebauer, K.M., 2002. On the importance of being co-transcriptional. J Cell Sci 115(20):3865–3871. ISSN 00219533. doi:10.1242/jcs.00073. → pages 20141[145] Tinoco, I., Uhlenbeck, O.C., & Levine, M.D., 1971. Estimation of secondary structure in ribonucleicacids. Nature 230(5293): 362–7. ISSN 0028-0836. → pages 21, 28[146] Mathews, D.H., Sabina, J., Zuker, M., & Turner, D.H., 1999. Expanded sequence dependence ofthermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288(5):911–40. ISSN 0022-2836. doi:10.1006/jmbi.1999.2700. → pages 21[147] Andronescu, M., Condon, A., Hoos, H.H., Mathews, D.H., & Murphy, K.P., 2007. Efficientparameter estimation for RNA secondary structure prediction. Bioinformatics 23(13): i19–28. ISSN1367-4811. doi:10.1093/bioinformatics/btm223. → pages 22[148] Andronescu, M.S., Pop, C., & Condon, A.E., 2010. Improved free energy parameters for RNApseudoknotted secondary structure prediction. RNA 16(1): 26–42. ISSN 1469-9001.doi:10.1261/rna.1689910. → pages 22[149] Pipas, J.M. & McMahon, J.E., 1975. Method for predicting RNA secondary structure. Proc NatlAcad Sci U S A 72(6): 2017–21. ISSN 0027-8424. → pages 22[150] Nussinov, R. & Jacobson, A.B., 1980. Fast algorithm for predicting the secondary structure ofsingle-stranded RNA. Proc Natl Acad Sci U S A 77(11): 6309–13. ISSN 0027-8424. → pages 22[151] Zuker, M., 1989. On finding all suboptimal foldings of an RNA molecule. Science 244(4900):48–52. ISSN 0036-8075. doi:10.1126/science.2468181. → pages 22[152] Zuker, M. & Stiegler, P., 1981. Optimal computer folding of large RNA sequences usingthermodynamics and auxiliary information. Nucleic Acids Res 9(1): 133–48. ISSN 0305-1048. →pages 22, 50, 52, 53, 120, 125[153] Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M., & Schuster, P., 1994. Fastfolding and comparison of RNA secondary structures. Monatshefte fu¨r Chemie Chem Mon 125(2):167–188. ISSN 0026-9247. doi:10.1007/BF00818163. → pages 22, 31, 50, 54, 120[154] Eddy, S.R., 2004. How do RNA folding algorithms work? Nat Biotechnol 22(11): 1457–8. ISSN1087-0156. doi:10.1038/nbt1104-1457. → pages 22[155] Eddy, S.R., 2004. What is dynamic programming? Nat Biotechnol 22(7): 909–10. ISSN 1087-0156.doi:10.1038/nbt0704-909. → pages 22[156] Morgan, S.R. & Higgs, P.G., 1996. Evidence for kinetic effects in the folding of large RNAmolecules. J Chem Phys 105(16): 7152. ISSN 00219606. doi:10.1063/1.472517. → pages 22, 121[157] Staple, D.W. & Butcher, S.E., 2005. Pseudoknots: RNA structures with diverse functions. PLoS Biol3(6): e213. ISSN 1545-7885. doi:10.1371/journal.pbio.0030213. → pages 22[158] Rivas, E. & Eddy, S.R., 1999. A dynamic programming algorithm for RNA structure predictionincluding pseudoknots. J Mol Biol 285(5): 2053–68. ISSN 0022-2836.doi:10.1006/jmbi.1998.2436. → pages 22[159] Xayaphoummine, A., Bucher, T., Thalmann, F., & Isambert, H., 2003. Prediction and statistics ofpseudoknots in RNA structures using exactly clustered stochastic simulations. Proc Natl Acad Sci US A 100(26): 15310–15315. ISSN 0027-8424. doi:10.1073/pnas.2536430100. → pages 22, 122142[160] Danilova, L.V., Pervouchine, D.D., Favorov, A.V., & Mironov, A.A., 2006. RNAKinetics: a webserver that models secondary structure kinetics of an elongating RNA. J Bioinform Comput Biol4(2): 589–596. ISSN 0219-7200. doi:10.1142/S0219720006001904. → pages 122[161] Geis, M., Flamm, C., Wolfinger, M.T., Tanzer, A., Hofacker, I.L., Middendorf, M., Mandl, C.,Stadler, P.F., & Thurner, C., 2008. Folding Kinetics of Large RNAs. J Mol Biol 379(1): 160–173.ISSN 00222836. doi:10.1016/j.jmb.2008.02.064. → pages 22, 122[162] Proctor, J.R. & Meyer, I.M., 2013. CoFold: An RNA secondary structure prediction method thattakes co-transcriptional folding into account. Nucleic Acids Res 41(9): 1–11. ISSN 03051048.doi:10.1093/nar/gkt174. → pages 22, 120, 121, 123, 124, 126, 130[163] Fox, G.E. & WOESE, C.R., 1975. 5S RNA secondary structure. Nature 256(5517): 505–507. ISSN0028-0836. doi:10.1038/256505a0. → pages 23[164] Higgs, P.G., 1998. Compensatory neutral mutations and the evolution of RNA. Genetica102-103(1-6): 91–101. ISSN 0016-6707. → pages 23[165] Gutell, R.R., Power, A., Hertz, G.Z., Putz, E.J., & Stormo, G.D., 1992. Identifying constraints on thehigher-order structure of RNA: continued development and application of comparative sequenceanalysis methods. Nucleic Acids Res 20(21): 5785–95. ISSN 0305-1048. → pages 23[166] Knudsen, B. & Hein, J., 1999. RNA secondary structure prediction using stochastic context-freegrammars and evolutionary history. Bioinformatics 15(6): 446–454. ISSN 1367-4803.doi:10.1093/bioinformatics/15.6.446. → pages 23, 121[167] Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E.S., Kent, J.,Miller, W., & Haussler, D., 2006. Identification and classification of conserved RNA secondarystructures in the human genome. PLoS Comput Biol 2(4): e33. ISSN 1553-7358.doi:10.1371/journal.pcbi.0020033. → pages 23, 40[168] Parker, B.J., Moltke, I., Roth, A., Washietl, S., Wen, J., Kellis, M., Breaker, R., & Pedersen, J.S.,2011. New families of human regulatory RNA structures identified by comparative analysis ofvertebrate genomes. Genome Res 21(11): 1929–43. ISSN 1549-5469. doi:10.1101/gr.112516.110.→ pages 23[169] Stark, A., Lin, M.F., Kheradpour, P., Pedersen, J.S., Parts, L., Carlson, J.W., Crosby, M.A.,Rasmussen, M.D., Roy, S., Deoras, A.N. et al., 2007. Discovery of functional elements in 12Drosophila genomes using evolutionary signatures. Nature 450(7167): 219–32. ISSN 1476-4687.doi:10.1038/nature06340. → pages 23[170] Gardner, P.P. & Giegerich, R., 2004. A comprehensive comparison of comparative RNA structureprediction approaches. BMC Bioinformatics 5(1): 140. ISSN 1471-2105.doi:10.1186/1471-2105-5-140. → pages 23, 65, 66, 103, 121[171] Meyer, I.M. & Miklo´s, I., 2007. SimulFold: simultaneously inferring RNA structures includingpseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput Biol 3(8):e149. ISSN 1553-7358. doi:10.1371/journal.pcbi.0030149. → pages 23143[172] Sankoff, D., 1985. Simultaneous Solution of the RNA Folding, Alignment and ProtosequenceProblems. SIAM J Appl Math 45(5): 810. ISSN 00361399. doi:10.1137/0145048. → pages[173] Havgaard, J.H., Lyngsø, R.B., & Gorodkin, J., 2005. The FOLDALIGN web server for pairwisestructural RNA alignment and mutual motif search. Nucleic Acids Res 33(Web Server issue):W650–3. ISSN 1362-4962. doi:10.1093/nar/gki473. → pages[174] Sorescu, D.A., Mo¨hl, M., Mann, M., Backofen, R., & Will, S., 2012. CARNA–alignment of RNAstructure ensembles. Nucleic Acids Res 40(Web Server issue): W49–53. ISSN 1362-4962.doi:10.1093/nar/gks491. → pages 23[175] Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier,L., Ge, Y., Gentry, J. et al., 2004. Bioconductor: open software development for computationalbiology and bioinformatics. Genome Biol 5(10): R80. ISSN 1465-6914.doi:10.1186/gb-2004-5-10-r80. → pages 26[176] Rogler, L.E., Kosmyna, B., Moskowitz, D., Bebawee, R., Rahimzadeh, J., Kutchko, K., Laederach,A., Notarangelo, L.D., Giliani, S., Bouhassira, E. et al., 2014. Small RNAs derived from lncRNARNase MRP have gene-silencing activity relevant to human cartilage-hair hypoplasia. Hum MolGenet 23(2): 368–82. ISSN 1460-2083. doi:10.1093/hmg/ddt427. → pages 26, 44, 46, 129[177] Lyngsø, R.B. & Pedersen, C.N.S., 2000. Pseudoknots in RNA secondary structures. Proc FourthAnnu Int Conf Comput Mol Biol 13(6): 201–209. doi:10.1145/332306.332551. → pages 27[178] Byun, Y. & Han, K., 2009. PseudoViewer3: generating planar drawings of large-scale RNAstructures with pseudoknots. Bioinformatics 25(11): 1435–7. ISSN 1367-4811.doi:10.1093/bioinformatics/btp252. → pages 27, 31[179] Jacobson, A.B. & Zuker, M., 1993. Structural analysis by energy dot plot of a large mRNA. J MolBiol 233(2): 261–9. ISSN 0022-2836. doi:10.1006/jmbi.1993.1504. → pages 28[180] Zuker, M. & Jacobson, A.B., 1998. Using reliability information to annotate RNA secondarystructures. RNA 4(6): 669–79. ISSN 1355-8382. → pages 28, 35[181] Lorenz, R., Bernhart, S.H., Ho¨ner Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P.F., &Hofacker, I.L., 2011. ViennaRNA Package 2.0. Algorithms Mol Biol 6: 26. ISSN 1748-7188.doi:10.1186/1748-7188-6-26. → pages 28, 29, 32, 33, 49, 50, 51, 53, 55, 65[182] Griffiths-Jones, S., 2003. Rfam: an RNA family database. Nucleic Acids Res 31(1): 439–441. ISSN13624962. doi:10.1093/nar/gkg006. → pages 29, 36, 40, 44[183] Nussinov, R., Pieczenik, G., Griggs, J.R., & Kleitman, D.J., 1978. Algorithms for Loop Matchings.SIAM J Appl Math 35(1): 68. ISSN 00361399. doi:10.1137/0135006. → pages 28[184] Abrahams, J.P., van den Berg, M., van Batenburg, E., & Pleij, C., 1990. Prediction of RNAsecondary structure, including pseudoknotting, by computer simulation. Nucleic Acids Res 18(10):3035–44. ISSN 0305-1048. → pages 28[185] Wiese, K., Glen, E., & Vasudevan, A., 2005. jViz. Rna-A Java tool for RNA secondary structurevisualization. IEEE Trans Nanobioscience 4(3): 212–218. ISSN 1536-1241. → pages 28, 31, 34, 35144[186] Darty, K., Denise, A., & Ponty, Y., 2009. VARNA: Interactive drawing and editing of the RNAsecondary structure. Bioinformatics 25(15): 1974–5. ISSN 1367-4811.doi:10.1093/bioinformatics/btp250. → pages 28, 30, 31, 32, 34, 35[187] Fresco, J.R., Alberts, B.M., & Doty, P., 1960. Some Molecular Details of the Secondary Structure ofRibonucleic Acid. Nature 188(4745): 98–101. ISSN 0028-0836. doi:10.1038/188098a0. → pages29[188] Shapiro, B.A., Lipkin, L.E., & Maizel, J., 1982. An interactive technique for the display of nucleicacid secondary structure. Nucleic Acids Res 10(21): 7041–52. ISSN 0305-1048.doi:10.1093/nar/gkn942. → pages 29, 43[189] Ding, Y. & Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondary structureprediction. Nucleic Acids Res 31(24): 7280–7301. ISSN 03051048. doi:10.1093/nar/gkg938. →pages 33, 120[190] Hogeweg, P. & Hesper, B., 1984. Energy directed folding of RNA sequences. Nucleic Acids Res12(1Part1): 67–74. ISSN 0305-1048. doi:10.1093/nar/12.1Part1.67. → pages 32[191] Le, S.Y., Nussinov, R., & Maizel, J.V., 1989. Tree graphs of RNA secondary structures and theircomparisons. Comput Biomed Res 22(5): 461–73. ISSN 0010-4809.doi:10.1016/0010-4809(89)90039-6. → pages 32[192] Gan, H.H., 2003. Exploring the repertoire of RNA secondary motifs using graph theory;implications for RNA design. Nucleic Acids Res 31(11): 2926–2943. ISSN 13624962.doi:10.1093/nar/gkg365. → pages 32[193] Pace, N., Thomas, B., & Woese, C., 1999. Probing RNA structure, function, and history bycomparative analysis. In R.F. Gesteland, T.R. Cech, & J.F. Atkins, editors, RNA World, chapter 4,pages 113–142. Cold Spring Harbor, 2nd edition. ISBN 0879695617. doi:10.1101/087969589.37.i.→ pages 32[194] Griffiths-Jones, S., 2005. RALEE–RNA ALignment editor in Emacs. Bioinformatics 21(2): 257–9.ISSN 1367-4803. doi:10.1093/bioinformatics/bth489. → pages 32, 34, 40[195] Wiebe, N.J.P. & Meyer, I.M., 2010. TRANSAT– method for detecting the conserved helices offunctional RNA structures, including transient, pseudo-knotted and alternative structures. PLoSComput Biol 6(6): e1000823. ISSN 1553-7358. doi:10.1371/journal.pcbi.1000823. → pages 33, 35,36, 37, 124[196] Wattenberg, M., 2002. Arc diagrams: visualizing structure in strings. In IEEE Symp. Inf. Vis. 2002.INFOVIS 2002., volume 2002, pages 110–116. IEEE Comput. Soc. ISBN 0-7695-1751-X.doi:10.1109/INFVIS.2002.1173155. → pages 35[197] R Development Core Team, R.F.F.S.C., 2008. R: A Language and Environment for StatisticalComputing. → pages 35, 43[198] Harrower, M. & Brewer, C.a., 2003. ColorBrewer.org: An Online Tool for Selecting ColourSchemes for Maps. Cartogr J 40(1): 27–37. ISSN 00000000. doi:10.1179/000870403235002042.→ pages 35145[199] Bendan˜a, Y.R. & Holmes, I.H., 2008. Colorstock, SScolor, Rato´n: RNA alignment visualizationtools. Bioinformatics 24(4): 579–80. ISSN 1367-4811. doi:10.1093/bioinformatics/btm635. →pages 40[200] Cannone, J.J., Subramanian, S., Schnare, M.N., Collett, J.R., D’Souza, L.M., Du, Y., Feng, B., Lin,N., Madabusi, L.V., Mu¨ller, K.M. et al., 2002. The comparative RNA web (CRW) site: an onlinedatabase of comparative sequence and structure information for ribosomal, intron, and other RNAs.BMC Bioinformatics 3: 2. ISSN 1471-2105. → pages 41, 43, 99[201] Hofacker, I.L., 2003. Vienna RNA secondary structure server. Nucleic Acids Res 31(13):3429–3431. ISSN 03051048. doi:10.1093/nar/gkg599. → pages 43, 120[202] Zuker, M., 2003. Mfold web server for nucleic acid folding and hybridization prediction. NucleicAcids Res 31(13): 3406–3415. ISSN 1362-4962. doi:10.1093/nar/gkg595. → pages 43, 120[203] Pearson, W.R. & Lipman, D.J., 1988. Improved tools for biological sequence comparison. Proc NatlAcad Sci U S A 85(8): 2444–8. ISSN 0027-8424. → pages 43[204] Daub, J., Eberhardt, R.Y., Tate, J.G., & Burge, S.W., 2015. Rfam: Annotating Families ofNon-Coding RNA Sequences. In E. Picardi, editor, RNA Bioinforma., volume 1269, pages 349–363.Springer New York. ISBN 978-1-4939-2290-1. doi:10.1007/978-1-4939-2291-8{\ }22. → pages44, 129[205] Zhu, J.Y.A. & Meyer, I.M., 2015. Four RNA families with functional transient structures. RNA Biol12(1): 5–20. ISSN 1547-6286. doi:10.1080/15476286.2015.1008373. → pages 44, 45, 129[206] Kutchko, K.M., Sanders, W.E.S., Ziehr, B.E.N., Phillips, G., Solem, A., Halvorsen, M., Weeks,K.M., Moorman, N., & Laederach, A., 2015. Multiple conformations are a conserved and regulatoryfeature of the RB1 5 UTR. Rna 21(7): 1–12. doi:10.1261/rna.049221.114. → pages 44, 129[207] Mercer, T.R. & Mattick, J.S., 2013. Structure and function of long noncoding RNAs in epigeneticregulation. Nat Struct Mol Biol 20(3): 300–7. ISSN 1545-9985. doi:10.1038/nsmb.2480. → pages48[208] Backofen, R. & Hess, W.R., 2010. Computational prediction of sRNAs and their targets in bacteria.RNA Biol 7(1): 33–42. ISSN 1555-8584. → pages 48, 49[209] Peterson, S.M., Thompson, J.a., Ufkin, M.L., Sathyanarayana, P., Liaw, L., & Congdon, C.B., 2014.Common features of microRNA target prediction tools. Front Genet 5(February): 1–10. ISSN16648021. doi:10.3389/fgene.2014.00023. → pages 48, 61[210] Seemann, S.E., Richter, A.S., Gesell, T., Backofen, R., & Gorodkin, J., 2011. PETcofold: predictingconserved interactions and structures of two multiple alignments of RNA sequences. Bioinformatics27(2): 211–9. ISSN 1367-4811. doi:10.1093/bioinformatics/btq634. → pages 49, 50, 51, 58, 103[211] Tafer, H. & Hofacker, I.L., 2008. RNAplex: a fast tool for RNA-RNA interaction search.Bioinformatics 24(22): 2657–63. ISSN 1367-4811. doi:10.1093/bioinformatics/btn193. → pages 49,51, 52, 66146[212] Wenzel, A., Akbasli, E., & Gorodkin, J., 2012. RIsearch: fast RNA-RNA interaction search using asimplified nearest-neighbor energy model. Bioinformatics 28(21): 2738–46. ISSN 1367-4811.doi:10.1093/bioinformatics/bts519. → pages 49, 51, 52, 63[213] Gerlach, W. & Giegerich, R., 2006. GUUGle: a utility for fast exact matching under RNAcomplementary rules including G-U base pairing. Bioinformatics 22(6): 762–4. ISSN 1367-4803.doi:10.1093/bioinformatics/btk041. → pages 49, 51, 52[214] McCaskill, J.S., 1990. The equilibrium partition function and base pair binding probabilities forRNA secondary structure. Biopolymers 29(6-7): 1105–19. ISSN 0006-3525.doi:10.1002/bip.360290621. → pages 50[215] Bernhart, S.H., Mu¨ckstein, U., & Hofacker, I.L., 2011. RNA Accessibility in cubic time. AlgorithmsMol Biol 6(1): 3. ISSN 1748-7188. doi:10.1186/1748-7188-6-3. → pages 50[216] Mu¨ckstein, U., Tafer, H., Hackermu¨ller, J., Bernhart, S.H., Stadler, P.F., & Hofacker, I.L., 2006.Thermodynamics of RNA-RNA binding. Bioinformatics 22(10): 1177–82. ISSN 1367-4803.doi:10.1093/bioinformatics/btl024. → pages 50, 51, 56[217] Busch, A., Richter, A.S., & Backofen, R., 2008. IntaRNA: efficient prediction of bacterial sRNAtargets incorporating target site accessibility and seed regions. Bioinformatics 24(24): 2849–56.ISSN 1367-4811. doi:10.1093/bioinformatics/btn544. → pages 50, 51, 54, 63[218] Tafer, H., Amman, F., Eggenhofer, F., Stadler, P.F., & Hofacker, I.L., 2011. Fast Accessibility-BasedPrediction of RNA-RNA Interactions. Bioinformatics 27(14): 1934–1940. ISSN 1367-4811.doi:10.1093/bioinformatics/btr281. → pages 50, 51, 55, 57, 103[219] Andronescu, M., Zhang, Z.C., & Condon, A., 2005. Secondary structure prediction of interactingRNA molecules. J Mol Biol 345(5): 987–1001. ISSN 0022-2836. doi:10.1016/j.jmb.2004.10.082.→ pages 50, 51, 53[220] Kato, Y., Sato, K., Hamada, M., Watanabe, Y., Asai, K., & Akutsu, T., 2010. RactIP: fast andaccurate prediction of RNA-RNA interaction using integer programming. Bioinformatics 26(18):i460–6. ISSN 1367-4811. doi:10.1093/bioinformatics/btq372. → pages 50, 51, 56[221] Knudsen, B., 2003. Pfold: RNA secondary structure prediction using stochastic context-freegrammars. Nucleic Acids Res 31(13): 3423–3428. ISSN 1362-4962. doi:10.1093/nar/gkg614. →pages 50, 59, 121, 125[222] Bernhart, S.H., Hofacker, I.L., Will, S., Gruber, A.R., & Stadler, P.F., 2008. RNAalifold: improvedconsensus structure prediction for RNA alignments. BMC Bioinformatics 9: 474. ISSN 1471-2105.doi:10.1186/1471-2105-9-474. → pages 50, 57[223] Bernhart, S.H., Tafer, H., Mu¨ckstein, U., Flamm, C., Stadler, P.F., & Hofacker, I.L., 2006. Partitionfunction and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol 1(1): 3. ISSN1748-7188. doi:10.1186/1748-7188-1-3. → pages 51, 54[224] Smith, T.F. & Waterman, M.S., 1981. Identification of common molecular subsequences. J Mol Biol147: 195–197. ISSN 00222836. doi:10.1016/0022-2836(81)90087-5. → pages 52147[225] Rehmsmeier, M., Steffen, P., Hochsmann, M., & Giegerich, R., 2004. Fast and effective predictionof microRNA/target duplexes. RNA 10(10): 1507–17. ISSN 1355-8382. doi:10.1261/rna.5248604.→ pages 53[226] Seemann, S.E., Gorodkin, J., & Backofen, R., 2008. Unifying evolutionary and thermodynamicinformation for RNA folding of multiple alignments. Nucleic Acids Res 36(20): 6355–62. ISSN1362-4962. doi:10.1093/nar/gkn544. → pages 59[227] Felsenstein, J., 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. JMol Evol 17(6): 368–76. ISSN 0022-2844. → pages 59[228] Pervez, Babar, M., Nadeem, A., Aslam, M., Awan, A., Aslam, N., Hussain, T., Naveed, N., Qadri, S.,Waheed, U. et al., 2014. Evaluating the Accuracy and Efficiency of Multiple Sequence AlignmentMethods. Evol Bioinforma page 205. ISSN 1176-9343. doi:10.4137/EBO.S19199. → pages 59[229] Pais, F.S.M., Ruy, P.D.C., Oliveira, G., & Coimbra, R.S., 2014. Assessing the efficiency of multiplesequence alignment programs. Algorithms Mol Biol 9: 4. ISSN 1748-7188.doi:10.1186/1748-7188-9-4. → pages 59[230] Katoh, K. & Standley, D.M., 2013. MAFFT multiple sequence alignment software version 7:improvements in performance and usability. Mol Biol Evol 30(4): 772–80. ISSN 1537-1719.doi:10.1093/molbev/mst010. → pages 59, 64, 94[231] Do, C.B., 2005. ProbCons: Probabilistic consistency-based multiple sequence alignment. GenomeRes 15(2): 330–340. ISSN 1088-9051. doi:10.1101/gr.2821705. → pages 59, 94[232] Will, S., Reiche, K., Hofacker, I.L., Stadler, P.F., & Backofen, R., 2007. Inferring noncoding RNAfamilies and classes by means of genome-scale structure-based clustering. PLoS Comput Biol 3(4):e65. ISSN 1553-7358. doi:10.1371/journal.pcbi.0030065. → pages 60, 94[233] Will, S., Otto, C., Miladi, M., Mo¨hl, M., & Backofen, R., 2015. SPARSE: quadratic timesimultaneous alignment and folding of RNAs without sequence-based heuristics. Bioinformatics31(15): 2489–2496. ISSN 1367-4803. doi:10.1093/bioinformatics/btv185. → pages 60, 94[234] Baek, D., Ville´n, J., Shin, C., Camargo, F.D., Gygi, S.P., & Bartel, D.P., 2008. The impact ofmicroRNAs on protein output. Nature 455(September): 64–71. ISSN 0028-0836.doi:10.1038/nature07242. → pages 61[235] Alexiou, P., Maragkakis, M., Papadopoulos, G.L., Reczko, M., & Hatzigeorgiou, A.G., 2009. Lost intranslation: An assessment and perspective for computational microrna target identification.Bioinformatics 25(23): 3049–3055. ISSN 13674803. doi:10.1093/bioinformatics/btp565. → pages61[236] Eggenhofer, F., Tafer, H., Stadler, P.F., & Hofacker, I.L., 2011. RNApredator: fastaccessibility-based prediction of sRNA targets. Nucleic Acids Res 39(Web Server issue): W149–54.ISSN 1362-4962. doi:10.1093/nar/gkr467. → pages 61[237] Wright, P.R., Richter, A.S., Papenfort, K., Mann, M., Vogel, J., Hess, W.R., Backofen, R., & Georg,J., 2013. Comparative genomics boosts target prediction for bacterial small RNAs. Proc Natl AcadSci U S A 110(37): E3487–96. ISSN 1091-6490. doi:10.1073/pnas.1303248110. → pages 62, 63,104148[238] Pain, A., Ott, A., Amine, H., Rochat, T., Bouloc, P., & Gautheret, D., 2015. An assessment ofbacterial small RNA target prediction programs. RNA Biol 12(5): 509–513. ISSN 1547-6286.doi:10.1080/15476286.2015.1020269. → pages 62, 104[239] Montaseri, S., Zare-Mirakabad, F., & Moghadam-Charkari, N., 2014. RNA-RNA interactionprediction using genetic algorithm. Algorithms Mol Biol 9(1): 17. ISSN 1748-7188.doi:10.1186/1748-7188-9-17. → pages 62[240] Salari, R., Mathias, M., & Will, S., 2010. Time and Space Efficient RNA-RNA InteractionPrediction via Sparse Folding. Res Comput Mol Biol Lect Notes Comput Sci 6044: 473–490.doi:10.1007/978-3-642-12683-3{\ }31. → pages 62[241] Huang, F.W.D., Qin, J., Reidys, C.M., & Stadler, P.F., 2009. Partition function and base pairingprobabilities for RNA-RNA interaction prediction. Bioinformatics 25(20): 2646–54. ISSN1367-4811. doi:10.1093/bioinformatics/btp481. → pages 62[242] Li, A.X., Marz, M., Qin, J., & Reidys, C.M., 2011. RNA-RNA interaction prediction based onmultiple sequence alignments. Bioinformatics 27(4): 456–63. ISSN 1367-4811.doi:10.1093/bioinformatics/btq659. → pages 62[243] Andronescu, M., Bereg, V., Hoos, H.H., & Condon, A., 2008. RNA STRAND: the RNA secondarystructure and statistical analysis database. BMC Bioinformatics 9: 340. ISSN 1471-2105.doi:10.1186/1471-2105-9-340. → pages 62[244] Burge, S.W., Daub, J., Eberhardt, R., Tate, J., Barquist, L., Nawrocki, E.P., Eddy, S.R., Gardner, P.P.,& Bateman, A., 2013. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res 41(Database issue):D226–32. ISSN 1362-4962. doi:10.1093/nar/gks1005. → pages 62[245] Chitsaz, H., Salari, R., Sahinalp, S.C., & Backofen, R., 2009. A partition function algorithm forinteracting nucleic acid strands. Bioinformatics 25(12): i365–73. ISSN 1367-4811.doi:10.1093/bioinformatics/btp212. → pages 63[246] Peer, A. & Margalit, H., 2011. Accessibility and evolutionary conservation mark bacterial small-rnatarget-binding regions. J Bacteriol 193(7): 1690–701. ISSN 1098-5530. doi:10.1128/JB.01419-10.→ pages 63, 103, 104[247] Richter, A.S. & Backofen, R., 2012. Accessibility and conservation: General features of bacterialsmall RNA-mRNA interactions? RNA Biol 9(7): 954–65. ISSN 1555-8584. doi:10.4161/rna.20294.→ pages 63, 66, 103[248] Cao, Y., Wu, J., Liu, Q., Zhao, Y., Ying, X., Cha, L., Wang, L., & Li, W., 2010. sRNATarBase: acomprehensive database of bacterial sRNA targets verified by experiments. RNA 16(11): 2051–7.ISSN 1469-9001. doi:10.1261/rna.2193110. → pages 63[249] Tatusova, T., Ciufo, S., Federhen, S., Fedorov, B., McVeigh, R., O’Neill, K., Tolstoy, I., &Zaslavsky, L., 2014. Update on RefSeq microbial genomes resources. Nucleic Acids Res43(December 2014): D599–D605. ISSN 0305-1048. doi:10.1093/nar/gku1062. → pages 63149[250] Tafer, H., Kehr, S., Hertel, J., Hofacker, I.L., & Stadler, P.F., 2010. RNAsnoop: efficient targetprediction for H/ACA snoRNAs. Bioinformatics 26(5): 610–6. ISSN 1367-4811.doi:10.1093/bioinformatics/btp680. → pages 64[251] Lowe, T.M. & Eddy, S.R., 1999. A Computational Screen for Methylation Guide snoRNAs in Yeast.Science 283(5405): 1168–1171. ISSN 00368075. doi:10.1126/science.283.5405.1168. → pages 64[252] Piekna-Przybylska, D., Decatur, W.A., & Fournier, M.J., 2007. New bioinformatic tools for analysisof nucleotide modifications in eukaryotic rRNA. RNA 13(3): 305–12. ISSN 1355-8382.doi:10.1261/rna.373107. → pages 64[253] Cherry, J.M., Hong, E.L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E.T., Christie, K.R.,Costanzo, M.C., Dwight, S.S., Engel, S.R. et al., 2012. Saccharomyces Genome Database: thegenomics resource of budding yeast. Nucleic Acids Res 40(Database issue): D700–5. ISSN1362-4962. doi:10.1093/nar/gkr1029. → pages 64[254] Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., & Tanabe, M., 2012. KEGG for integration andinterpretation of large-scale molecular data sets. Nucleic Acids Res 40(Database issue): D109–14.ISSN 1362-4962. doi:10.1093/nar/gkr988. → pages 64[255] Hertel, J., de Jong, D., Marz, M., Rose, D., Tafer, H., Tanzer, A., Schierwater, B., & Stadler, P.F.,2009. Non-coding RNA annotation of the genome of Trichoplax adhaerens. Nucleic Acids Res37(5): 1602–1615. ISSN 03051048. doi:10.1093/nar/gkn1084. → pages 64[256] Lai, D., Proctor, J.R., Zhu, J.Y.A., & Meyer, I.M., 2012. R-CHIE: a web server and R package forvisualizing RNA secondary structures. Nucleic Acids Res 40(12): e95. ISSN 1362-4962.doi:10.1093/nar/gks241. → pages 65, 127[257] Matthews, B.W., 1975. Comparison of the predicted and observed secondary structure of T4 phagelysozyme. Biochim Biophys Acta 405(2): 442–451. ISSN 00052795.doi:10.1016/0005-2795(75)90109-9. → pages 65[258] Gorodkin, J., Stricklin, S.L., & Stormo, G.D., 2001. Discovering common stem-loop motifs inunaligned RNA sequences. Nucleic Acids Res 29(10): 2135–44. ISSN 1362-4962. → pages 66[259] Wickham, H., 2011. ggplot2. Wiley Interdiscip Rev Comput Stat 3: 180–185. ISSN 19395108.doi:10.1002/wics.147. → pages 66[260] Hofacker, I.L., Fekete, M., & Stadler, P.F., 2002. Secondary structure prediction for aligned RNAsequences. J Mol Biol 319(5): 1059–66. ISSN 0022-2836. doi:10.1016/S0022-2836(02)00308-X.→ pages 99, 121[261] Anandam, P., Torarinsson, E., & Ruzzo, W.L., 2009. Multiperm: shuffling multiple sequencealignments while approximately preserving dinucleotide frequencies. Bioinformatics 25(5): 668–9.ISSN 1367-4811. doi:10.1093/bioinformatics/btp006. → pages 99[262] Kehr, S., Bartschat, S., Stadler, P.F., & Tafer, H., 2011. PLEXY: efficient target prediction for boxC/D snoRNAs. Bioinformatics 27(2): 279–80. ISSN 1367-4811. doi:10.1093/bioinformatics/btq642.→ pages 108150[263] Helwak, A. & Tollervey, D., 2014. Mapping the miRNA interactome by cross-linking ligation andsequencing of hybrids (CLASH). Nat Protoc 9(3): 711–28. ISSN 1750-2799.doi:10.1038/nprot.2014.043. → pages 108[264] Pedersen, J.S., Forsberg, R., Meyer, I.M., & Hein, J., 2004. An evolutionary model forprotein-coding regions with conserved RNA structure. Mol Biol Evol 21(10): 1913–22. ISSN0737-4038. doi:10.1093/molbev/msh199. → pages 110, 121[265] Pedersen, J.S., Meyer, I.M., Forsberg, R., Simmonds, P., & Hein, J., 2004. A comparative methodfor finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res32(16): 4925–36. ISSN 1362-4962. doi:10.1093/nar/gkh839. → pages 110, 121, 125[266] Al-Hashimi, H.M. & Walter, N.G., 2008. RNA dynamics: it is about time. Curr Opin Struct Biol18(3): 321–329. ISSN 0959440X. doi:10.1016/j.sbi.2008.04.004. → pages 112[267] Sosnick, T.R. & Pan, T., 2003. RNA folding: Models and perspectives.doi:10.1016/S0959-440X(03)00066-6. → pages[268] Thirumalai, D. & Hyeon, C., 2005. RNA and Protein Folding: Common Themes and Variations.Biochemistry 44(13): 4957–4970. doi:10.1021/bi047314+. → pages 112[269] Boyle, J., Robillard, G.T., & Kim, S.H., 1980. Sequential folding of transfer RNA. J Mol Biol139(4): 601–625. ISSN 00222836. doi:10.1016/0022-2836(80)90051-0. → pages 112[270] Kramer, F.R. & Mills, D.R., 1981. Secondary structure formation during RNA synthesis. NucleicAcids Res 9(19): 5109–5124. ISSN 03051048. doi:10.1093/nar/9.19.5109. → pages 112, 113[271] Brehm, S.L. & Cech, T.R., 1983. Fate of an intervening sequence ribonucleic acid: excision andcyclization of the Tetrahymena ribosomal ribonucleic acid intervening sequence in vivo.Biochemistry 22(10): 2390–2397. ISSN 0006-2960. → pages 112[272] Pan, T., Fang, X., & Sosnick, T., 1999. Pathway modulation, circular permutation and rapid RNAfolding under kinetic control. J Mol Biol 286(3): 721–731. ISSN 0022-2836.doi:10.1006/jmbi.1998.2516. → pages 112[273] Heilman-Miller, S.L. & Woodson, S.A., 2003. Effect of transcription on folding of the Tetrahymenaribozyme. RNA 9(6): 722–733. ISSN 13558382. doi:10.1261/rna.5200903. → pages 113, 115[274] Heilman-Miller, S.L. & Woodson, S.a., 2003. Perturbed folding kinetics of circularly permutedRNAs with altered topology. J Mol Biol 328(2): 385–394. ISSN 00222836.doi:10.1016/S0022-2836(03)00304-8. → pages 112[275] Lewicki, B.T., Margus, T., Remme, J., & Nierhaus, K.H., 1993. Coupling of rRNA Transcriptionand Ribosomal Assembly in Vivo. J Mol Biol 231(3): 581–593. ISSN 00222836.doi:10.1006/jmbi.1993.1311. → pages 112[276] Chao, M.Y., Kan, M.C., & Lin-Chao, S., 1995. RNAII transcribed by IPTG-induced T7 RNApolymerase is non-functional as a replication primer for ColE1-type plasmids in Escherichia coli.Nucleic Acids Res 23(10): 1691–1695. ISSN 0305-1048. doi:5w0030[pii]. → pages 112151[277] Koduvayur, S.P. & Woodson, S.A., 2004. Intracellular folding of the Tetrahymena group I introndepends on exon sequence and promoter choice. RNA 10(10): 1526–1532. ISSN 1355-8382.doi:10.1261/rna.7880404. → pages 112, 113[278] Toulme´, F., Mosrin-Huaman, C., Artsimovitch, I., & Rahmouni, A.R., 2005. Transcriptional pausingin Vivo: A nascent RNA hairpin restricts lateral movements of RNA polymerase in both forward andreverse directions. J Mol Biol 351(1): 39–51. ISSN 00222836. doi:10.1016/j.jmb.2005.05.052. →pages 112[279] Wickiser, J.K., Winkler, W.C., Breaker, R.R., & Crothers, D.M., 2005. The speed of RNAtranscription and metabolite binding kinetics operate an FMN riboswitch. Mol Cell 18(1): 49–60.ISSN 10972765. doi:10.1016/j.molcel.2005.02.032. → pages 112, 119[280] Wong, T.N., Sosnick, T.R., & Pan, T., 2007. Folding of noncoding RNAs during transcriptionfacilitated by pausing-induced nonnative structures. Proc Natl Acad Sci U S A 104(46):17995–18000. ISSN 0027-8424. doi:10.1073/pnas.0705038104. → pages 112[281] Liu, K., Zhang, Y., Severinov, K., Das, A., & Hanna, M.M., 1996. Role of Escherichia coli RNApolymerase alpha subunit in modulation of pausing, termination and anti-termination by thetranscription elongation factor NusA. EMBO J 15(1): 150–161. ISSN 0261-4189. → pages 112[282] Landick, R., 1997. RNA polymerase slides home: Pause and termination site recognition. Cell88(6): 741–744. ISSN 00928674. doi:10.1016/S0092-8674(00)81919-4. → pages[283] Mooney, R.A., Artsimovitch, I., & Landick, R., 1998. Information processing by RNA polymerase:Recognition of regulatory signals during RNA chain elongation. → pages 112[284] Adilakshmi, T., Soper, S.F.C., & Woodson, S.A., 2009. Structural analysis of RNA in living cells byin vivo synchrotron X-ray footprinting. Methods Enzymol 468(09): 239–258. ISSN 1557-7988.doi:10.1016/S0076-6879(09)68012-5. → pages 113, 115[285] Woodson, S.A., 2010. Compact intermediates in RNA folding. Annu Rev Biophys 39: 61–77. ISSN1936-1238. doi:10.1146/annurev.biophys.093008.131334. → pages 113, 117[286] Donahue, C.P., Yadava, R.S., Nesbitt, S.M., & Fedor, M.J., 2000. The kinetic mechanism of thehairpin ribozyme in vivo: influence of RNA helix stability on intracellular cleavage kinetics. J MolBiol 295(3): 693–707. ISSN 0022-2836. doi:10.1006/jmbi.1999.3380. → pages 113[287] Fedor, M.J., 2002. The catalytic mechanism of the hairpin ribozyme. Biochem Soc Trans 30(Pt 6):1109–1115. ISSN 03005127. doi:10.1042/BST0301109. → pages[288] Mahen, E.M., Harger, J.W., Calderon, E.M., & Fedor, M.J., 2005. Kinetics and thermodynamicsmake different contributions to RNA folding in vitro and in yeast. Mol Cell 19(1): 27–37. ISSN10972765. doi:10.1016/j.molcel.2005.05.025. → pages 113[289] Fedor, M.J., 2009. Comparative enzymology and structural biology of RNA self-cleavage. Annu RevBiophys 38: 271–299. ISSN 1936-122X. doi:10.1146/annurev.biophys.050708.133710. → pages[290] Mahen, E.M., Watson, P.Y., Cottrell, J.W., & Fedor, M.J., 2010. mRNA secondary structures foldsequentially but exchange rapidly in vivo. PLoS Biol 8(2). ISSN 15449173.doi:10.1371/journal.pbio.1000307. → pages 113152[291] Jackson, S.A., Koduvayur, S., & Woodson, S.A., 2006. Self-splicing of a group I intron revealspartitioning of native and misfolded RNA populations in yeast. RNA 12(12): 2149–2159. ISSN1355-8382. doi:10.1261/rna.184206. → pages 113, 115, 117[292] Repsilber, D., Wiese, S., Rachen, M., Schro¨der, A.W., Riesner, D., & Steger, G., 1999. Formation ofmetastable RNA structures by sequential folding during transcription: time-resolved structuralanalysis of potato spindle tuber viroid (-)-stranded RNA by temperature-gradient gel electrophoresis.RNA 5(4): 574–584. ISSN 13558382. doi:10.1017/S1355838299982018. → pages 113[293] Ro-Choi, T.S. & Choi, Y.C., 2003. Structural elements of dynamic RNA strings. Mol Cells 16(2):201–210. ISSN 10168478. → pages 113[294] Chauhan, S. & Woodson, S.a., 2008. Tertiary interactions determine the accuracy of RNA folding. JAm Chem Soc 130(4): 1296–1303. ISSN 00027863. doi:10.1021/ja076166i. → pages 113, 115[295] Sclavi, B., Sullivan, M., Chance, M.R., Brenowitz, M., & Woodson, S.A., 1998. RNA folding atmillisecond intervals by synchrotron hydroxyl radical footprinting. Science 279(5358): 1940–1943.ISSN 00368075. doi:10.1126/science.279.5358.1940. → pages 113, 115[296] Yanofsky, C., 1981. Attenuation in the control of expression of bacterial operons. Nature 289(5800):751–758. ISSN 0028-0836. doi:10.1038/289751a0. → pages 114[297] Isambert, H. & Siggia, E.D., 2000. Modeling RNA folding paths with pseudoknots: application tohepatitis delta virus ribozyme. Proc Natl Acad Sci U S A 97(12): 6515–6520. ISSN 0027-8424.doi:10.1073/pnas.110533697. → pages 114, 122[298] Schoemaker, R.J.W. & Gultyaev, A.P., 2006. Computer simulation of chaperone effects of ArchaealC/D box sRNA binding on rRNA folding. Nucleic Acids Res 34(7): 2015–26. ISSN 1362-4962.doi:10.1093/nar/gkl154. → pages 114, 125[299] Alexander, R.D., Barrass, J.D., Dichtl, B., Kos, M., Obtulowicz, T., Robert, M.C., Koper, M.,Karkusiewicz, I., Mariconti, L., Tollervey, D. et al., 2010. RiboSys, a high-resolution, quantitativeapproach to measure the in vivo kinetics of pre-mRNA splicing and 3’-end processing inSaccharomyces cerevisiae. RNA 16(12): 2570–2580. ISSN 1355-8382. doi:10.1261/rna.2162610.→ pages 115[300] Johansson, J., Mandin, P., Renzoni, A., Chiaruttini, C., Springer, M., & Cossart, P., 2002. An RNAthermosensor controls expression of virulence genes in Listeria monocytogenes. Cell 110(5):551–561. ISSN 0092-8674. doi:10.1016/S0092-8674(02)00905-4. → pages 116[301] Narberhaus, F., 2010. Translational control of bacterial heat shock and virulence genes bytemperature-sensing mRNAs. RNA Biol 7(1): 84–89. ISSN 1547-6286. doi:10.4161/rna.7.1.10501.→ pages[302] Giuliodori, A.M., Di Pietro, F., Marzi, S., Masquida, B., Wagner, R., Romby, P., Gualerzi, C.O., &Pon, C.L., 2010. The cspA mRNA Is a Thermosensor that Modulates Translation of the Cold-ShockProtein CspA. Mol Cell 37(1): 21–33. ISSN 10972765. doi:10.1016/j.molcel.2009.11.033. → pages116153[303] Roth, A. & Breaker, R.R., 2009. The structural and functional diversity of metabolite-bindingriboswitches. Annu Rev Biochem 78: 305–334. ISSN 0066-4154.doi:10.1146/annurev.biochem.78.070507.135656. → pages 116[304] Nechooshtan, G., Elgrably-weiss, M., Sheaffer, A., Westhof, E., & Altuvia, S., 2009. ApH-responsive riboregulator. Genes Dev 23: 2650–2662. doi:10.1101/gad.552209.terminator. →pages 116[305] Frieda, K.L. & Block, S.M., 2012. Direct Observation of Cotranscriptional Folding in an AdenineRiboswitch. doi:10.1126/science.1225722. → pages 116[306] Perdrizet, G.a., Artsimovitch, I., Furman, R., Sosnick, T.R., & Pan, T., 2012. Transcriptional pausingcoordinates folding of the aptamer domain and the expression platform of a riboswitch. Proc NatlAcad Sci U S A 109(9): 3323–3328. ISSN 0027-8424. doi:10.1073/pnas.1113086109. → pages 116[307] Gregan, J., Kolisek, M., & Schweyen, R.J., 2001. Mitochondrial Mg2+ homeostasis is critical forgroup II intron splicing in vivo. Genes Dev 15(17): 2229–2237. ISSN 08909369.doi:10.1101/gad.201301. → pages 116[308] Fedorova, O., Julie Su, L., & Pyle, A.M., 2002. Group II introns: Highly specific endonucleaseswith modular structures and diverse catalytic functions. Methods 28(3): 323–335. ISSN 10462023.doi:10.1016/S1046-2023(02)00239-6. → pages 116[309] Gampel, A. & Cech, T.R., 1991. Binding of the CBP2 protein to a yeast mitochondrial group Iintron requires the catalytic core of the RNA. Genes Dev 5(10): 1870–1880. ISSN 08909369.doi:10.1101/gad.5.10.1870. → pages 116[310] Caprara, M.G., Mohr, G., & Lambowitz, A.M., 1996. A tyrosyl-tRNA synthetase protein inducestertiary folding of the group I intron catalytic core. J Mol Biol 257(3): 512–531. ISSN 0022-2836.doi:10.1006/jmbi.1996.0182. → pages 116[311] Matsuura, M., Saldanha, R., Ma, H., Wank, H., Yang, J., Mohr, G., Cavanagh, S., Dunny, G.M.,Belfort, M., & Lambowitz, A.M., 1997. A bacterial group II intron encoding reverse transcriptase,maturase, and DNA endonuclease activities: Biochemical demonstration of maturase activity andinsertion of new genetic information within the intron. Genes Dev 11(21): 2910–2924. ISSN08909369. doi:10.1101/gad.11.21.2910. → pages 116[312] Weeks, K.M., 1997. Protein-facilitated RNA folding. Curr Opin Struct Biol 7(3): 336–342. ISSN0959440X. doi:10.1016/S0959-440X(97)80048-6. → pages 117[313] Ostersetzer, O., Cooke, A.M., Watkins, K.P., & Barkan, A., 2005. CRS1, a chloroplast group IIintron splicing factor, promotes intron folding through specific interactions with two intron domains.Plant Cell 17(1): 241–255. ISSN 1040-4651. doi:10.1105/tpc.104.027516. → pages 116[314] Mohr, G., Zhang, A., Gianelos, J.a., Belfort, M., & Lambowitz, A.M., 1992. The NeurosporaCYT-18 protein suppresses defects in the phage T4 td intron by stabilizing the catalytically activestructure of the intron core. Cell 69(3): 483–494. ISSN 00928674.doi:10.1016/0092-8674(92)90449-M. → pages 116154[315] Waldsich, C., Grossberger, R., & Schroeder, R., 2002. RNA chaperone StpA loosens interactions ofthe tertiary structure in the td group I intron in vivo. Genes Dev 16(17): 2300–2312. ISSN08909369. doi:10.1101/gad.231302. → pages[316] Waldsich, C., Masquida, B., Westhof, E., & Schroeder, R., 2002. Monitoring intermediate foldingstates of the td group I intron in vivo. EMBO J 21(19): 5281–5291. ISSN 02614189.doi:10.1093/emboj/cdf504. → pages 116[317] Mohr, G., Caprara, M.G., Guo, Q., & Lambowitz, A.M., 1994. A tyrosyl-tRNA synthetase canfunction similarly to an RNA structure in the Tetrahymena ribozyme. Nature 370(6485): 147–150.ISSN 0028-0836. doi:10.1038/370147a0. → pages 116[318] Weeks, K.M. & Cech, T.R., 1996. Assembly of a ribonucleoprotein catalyst by tertiary structurecapture. Science 271(5247): 345–348. ISSN 0036-8075. doi:10.1126/science.271.5247.345. →pages 116[319] Webb, A.E. & Weeks, K.M., 2001. A collapsed state functions to self-chaperone RNA folding into anative ribonucleoprotein complex. Nat Struct Biol 8(2): 135–140. ISSN 1072-8368.doi:10.1038/84124. → pages[320] Bassi, G.S., de Oliveira, D.M., White, M.F., & Weeks, K.M., 2002. Recruitment of intron-encodedand co-opted proteins in splicing of the bI3 group I intron RNA. Proc Natl Acad Sci U S A 99(1):128–133. ISSN 00278424. doi:10.1073/pnas.012579299. → pages[321] Paukstelis, P.J., Coon, R., Madabusi, L., Nowakowski, J., Monzingo, A., Robertus, J., & Lambowitz,A.M., 2005. A tyrosyl-tRNA synthetase adapted to function in group I intron splicing by acquiring anew RNA binding surface. Mol Cell 17(3): 417–428. ISSN 10972765.doi:10.1016/j.molcel.2004.12.026. → pages[322] Talkington, M.W.T., Siuzdak, G., & Williamson, J.R., 2005. An assembly landscape for the 30Sribosomal subunit. Nature 438(7068): 628–632. ISSN 0028-0836. doi:10.1038/nature04261. →pages[323] Paukstelis, P.J., Chen, J.H., Chase, E., Lambowitz, A.M., & Golden, B.L., 2008. Structure of atyrosyl-tRNA synthetase splicing factor bound to a group I intron RNA. Nature 451(7174): 94–97.ISSN 0028-0836. doi:10.1038/nature06413. → pages[324] Adilakshmi, T., Bellur, D.L., & Woodson, S.A., 2008. Concurrent nucleation of 16S folding andinduced fit in 30S ribosome assembly. Nature 455(7217): 1268–1272. ISSN 0028-0836.doi:10.1038/nature07298. → pages[325] Dai, L., Chai, D., Gu, S.Q., Gabel, J., Noskov, S.Y., Blocker, F.J.H., Lambowitz, A.M., & Zimmerly,S., 2008. A three-dimensional model of a group II intron RNA and its interaction with theintron-encoded reverse transcriptase. Mol Cell 30(4): 472–485. ISSN 10972765.doi:10.1016/j.molcel.2008.04.001. → pages 116[326] Hickman, A.B. & Dyda, F., 2005. Binding and unwinding: SF3 viral helicases. Curr Opin StructBiol 15(1 SPEC. ISS.): 77–85. ISSN 0959440X. doi:10.1016/j.sbi.2004.12.001. → pages 116155[327] Halls, C., Mohr, S., Del Campo, M., Yang, Q., Jankowsky, E., & Lambowitz, A.M., 2007.Involvement of DEAD-box Proteins in Group I and Group II Intron Splicing. BiochemicalCharacterization of Mss116p, ATP Hydrolysis-dependent and -independent Mechanisms, andGeneral RNA Chaperone Activity. J Mol Biol 365(3): 835–855. ISSN 00222836.doi:10.1016/j.jmb.2006.09.083. → pages[328] Bleichert, F. & Baserga, S.J., 2007. The Long Unwinding Road of RNA Helicases. Mol Cell 27(3):339–352. ISSN 10972765. doi:10.1016/j.molcel.2007.07.014. → pages[329] Pyle, A.M., 2008. Translocation and unwinding mechanisms of RNA and DNA helicases. Annu RevBiophys 37: 317–336. ISSN 1936-122X. doi:10.1146/annurev.biophys.37.032807.125908. → pages[330] Fairman-Williams, M.E., Guenther, U.P., & Jankowsky, E., 2010. SF1 and SF2 helicases: Familymatters. doi:10.1016/j.sbi.2010.03.011. → pages 116[331] Farina, K.L. & Singer, R.H., 2002. The nuclear connection in RNA transport and localization.Trends Cell Biol 12(10): 466–472. ISSN 09628924. doi:10.1016/S0962-8924(02)02357-7. →pages 116[332] Granneman, S. & Baserga, S.J., 2005. Crosstalk in gene expression: Coupling and co-regulation ofrDNA transcription, pre-ribosome assembly and pre-rRNA processing. Curr Opin Cell Biol 17(3):281–286. ISSN 09550674. doi:10.1016/j.ceb.2005.04.001. → pages 116[333] Udem, S.A. & Warner, R., 1973. The Cytoplasmic Maturation Precursor Ribonucleic Acid of aRibosomal in Yeast *. J Biol Chem 248(4): 1412–1416. → pages[334] Oakes, M., Nogi, Y., Clark, M.W., & Nomura, M., 1993. Structural alterations of the nucleolus inmutants of Saccharomyces cerevisiae defective in RNA polymerase I. Mol Cell Biol 13(4):2441–2455. ISSN 0270-7306. doi:10.1128/MCB.13.4.2441.Updated. → pages[335] Kosˇ, M. & Tollervey, D., 2010. Yeast Pre-rRNA Processing and Modification OccurCotranscriptionally. Mol Cell 37(6): 809–820. ISSN 10972765. doi:10.1016/j.molcel.2010.02.024.→ pages 116[336] Alexander, R.D., Innocente, S.a., Barrass, J.D., & Beggs, J.D., 2010. Splicing-Dependent RNApolymerase pausing in yeast. Mol Cell 40(4): 582–593. ISSN 10972765.doi:10.1016/j.molcel.2010.11.005. → pages 116[337] Semrad, K. & Schroeder, R., 1998. A ribosomal function is necessary for efficient splicing of the T4phage thymidylate synthase intron in vivo. Genes Dev 12(9): 1327–1337. ISSN 08909369.doi:10.1101/gad.12.9.1327. → pages 116, 117[338] Treiber, D.K. & Williamson, J.R., 2001. Beyond kinetic traps in RNA folding. Curr Opin StructBiol 11(3): 309–314. ISSN 0959440X. doi:10.1016/S0959-440X(00)00206-2. → pages 117[339] Baird, N.J., Fang, X.W., Srividya, N., Pan, T., & Sosnick, T.R., 2007. Folding of a universalribozyme: the ribonuclease P RNA. Q Rev Biophys 40(2): 113–161. ISSN 0033-5835.doi:10.1017/S0033583507004623. → pages156[340] Shcherbakova, I., Mitra, S., Laederach, A., & Brenowitz, M., 2008. Energy barriers, pathways, anddynamics during folding of large, multidomain RNAs. Curr Opin Chem Biol 12(6): 655–666. ISSN13675931. doi:10.1016/j.cbpa.2008.09.017. → pages 117[341] Nikolcheva, T. & Woodson, S.A., 1999. Facilitation of group I splicing in vivo: misfolding of theTetrahymena IVS and the role of ribosomal RNA exons. J Mol Biol 292(3): 557–567. ISSN0022-2836. doi:10.1006/jmbi.1999.3083. → pages 117[342] Schroeder, R., Grossberger, R., Pichler, A., & Waldsich, C., 2002. RNA folding in vivo. Curr OpinStruct Biol 12(3): 296–300. ISSN 0959440X. doi:10.1016/S0959-440X(02)00325-1. → pages[343] Zemora, G. & Waldsich, C., 2010. RNA folding in living cells. RNA Biol 7(6): 634–641. ISSN1547-6286. doi:10.4161/rna.7.6.13554. → pages 117[344] Herschlag, D., 1995. RNA chaperones and the folding problem. J Biol Chem 270: 20871–20874.ISSN 00219258. doi:10.1074/jbc.270.36.20871. → pages 117[345] Schroeder, R., Barta, A., & Semrad, K., 2004. Strategies for RNA folding and assembly. Nat RevMol Cell Biol 5(11): 908–919. ISSN 1471-0072. doi:10.1038/nrm1497. → pages[346] Rajkowitsch, L., Chen, D., Stampfl, S., Semrad, K., Waldsich, C., Mayer, O., Jantsch, M.F., Konrat,R., Bla¨si, U., & Schroeder, R., 2007. RNA chaperones, RNA annealers and RNA helicases. RNABiol 4(3): 118–130. ISSN 1547-6286. doi:10.4161/rna.4.3.5445. → pages 117[347] Smit, S., Widmann, J., & Knight, R., 2007. Evolutionary rates vary among rRNA structuralelements. Nucleic Acids Res 35(10): 3339–3354. ISSN 03051048. doi:10.1093/nar/gkm101. →pages 118[348] Lagos-Quintana, M., Rauhut, R., Lendeckel, W., & Tuschl, T., 2001. Identification of novel genescoding for small expressed RNAs. Science 294(5543): 853–858. ISSN 00368075.doi:10.1126/science.1064921. → pages 118[349] Carninci, P., 2010. RNA Dust: Where are the Genes? DNA Res 17(2): 51–59. ISSN 13402838.doi:10.1093/dnares/dsq006. → pages 118[350] Mikulecky, P.J., Kaw, M.K., Brescia, C.C., Takach, J.C., Sledjeski, D.D., & Feig, A.L., 2004.Escherichia coli Hfq has distinct interaction surfaces for DsrA, rpoS and poly(A) RNAs. Nat StructMol Biol 11(12): 1206–1214. ISSN 1545-9993. doi:10.1038/nsmb858. → pages 118[351] Soper, T.J. & Woodson, S.A., 2008. The rpoS mRNA leader recruits Hfq to facilitate annealing withDsrA sRNA. RNA 14(9): 1907–1917. ISSN 1355-8382. doi:10.1261/rna.1110608. → pages[352] Soper, T., Mandin, P., Majdalani, N., Gottesman, S., & Woodson, S.A., 2010. Positive regulation bysmall RNAs and the role of Hfq. Proc Natl Acad Sci U S A 107(21): 9602–9607. ISSN 0027-8424.doi:10.1073/pnas.1004435107. → pages[353] Hopkins, J.F., Panja, S., & Woodson, S.a., 2011. Rapid binding and release of Hfq from ternarycomplexes during RNA annealing. Nucleic Acids Res 39(12): 5193–5202. ISSN 03051048.doi:10.1093/nar/gkr062. → pages 118157[354] Geissmann, T., Marzi, S., & Romby, P., 2009. The role of mRNA structure in translational control inbacteria. RNA Biol 6(2): 153–160. ISSN 1547-6286. doi:10.4161/rna.6.2.8047. → pages 118[355] Lioliou, E., Romilly, C., Romby, P., & Fechter, P., 2010. RNA-mediated regulation in bacteria: fromnatural to artificial systems. N Biotechnol 27(3): 222–235. ISSN 18716784.doi:10.1016/j.nbt.2010.03.002. → pages 118[356] Balzer, M. & Wagner, R., 1998. Mutations in the leader region of ribosomal RNA operons causestructurally defective 30 S ribosomes as revealed by in vivo structural probing. J Mol Biol 276(3):547–557. ISSN 0022-2836. doi:10.1006/jmbi.1997.1556. → pages 118[357] Steif, A. & Meyer, I.M., 2012. The hok mRNA family. RNA Biol 9(12): 1399–404. ISSN1555-8584. doi:10.4161/rna.22746. → pages 118, 119, 127, 129[358] Mathews, D.H., Burkard, M.E., Freier, S.M., Wyatt, J.R., & Turner, D.H., 1999. Predictingoligonucleotide affinity to nucleic acid targets. RNA 5(11): 1458–69. ISSN 1355-8382. → pages 120[359] Wuchty, S., Fontana, W., Hofacker, I.L., & Schuster, P., 1999. Complete suboptimal folding of RNAand the stability of secondary structures. Biopolymers 49(2): 145–65. ISSN 00063525.doi:10.1002/(SICI)1097-0282(199902)49:2〈145::AID-BIP4〉3.0.CO;2-G. → pages 120[360] Ding, Y., Chan, C.Y., & Lawrence, C.E., 2004. Sfold web server for statistical folding and rationaldesign of nucleic acids. Nucleic Acids Res 32(Web Server issue): W135–41. ISSN 1362-4962.doi:10.1093/nar/gkh449. → pages 120[361] Chan, C.Y., Lawrence, C.E., & Ding, Y., 2005. Structure clustering features on the Sfold Webserver. Bioinformatics 21(20): 3926–3928. ISSN 13674803. doi:10.1093/bioinformatics/bti632. →pages 120[362] Mathews, D.H. & Turner, D.H., 2002. Dynalign: an algorithm for finding the secondary structurecommon to two RNA sequences. J Mol Biol 317(2): 191–203. ISSN 0022-2836.doi:10.1006/jmbi.2001.5351. → pages 121[363] Witwer, C., Hofacker, I.L., & Stadler, P.F., 2004. Prediction of consensus RNA secondary structuresincluding pseudoknots. IEEE/ACM Trans Comput Biol Bioinforma 1(2): 66–77. ISSN 15455963.doi:10.1109/TCBB.2004.22. → pages[364] Perriquet, O., Touzet, H., & Dauchet, M., 2003. Finding the common structure shared by twohomologous RNAs. Bioinformatics 19(1): 108–16. ISSN 1367-4803. → pages 121[365] Ruan, J., Stormo, G.D., & Zhang, W., 2003. An Iterated loop matching approach to the prediction ofRNA secondary structures with pseudoknots. Bioinformatics 20(1): 58–66. ISSN 1367-4803.doi:10.1093/bioinformatics/btg373. → pages[366] Touzet, H. & Perriquet, O., 2004. CARNAC: folding families of related RNAs. Nucleic Acids Res32(Web Server issue): W142–5. ISSN 1362-4962. doi:10.1093/nar/gkh415. → pages 121[367] Ji, Y., Xu, X., & Stormo, G.D., 2004. A graph theoretical approach for predicting common RNAsecondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics 20(10):1591–602. ISSN 1367-4803. doi:10.1093/bioinformatics/bth131. → pages 121158[368] Mathews, D.H., 2005. Predicting a set of minimal free energy RNA secondary structures common totwo sequences. Bioinformatics 21(10): 2246–53. ISSN 1367-4803.doi:10.1093/bioinformatics/bti349. → pages[369] Havgaard, J.H., Lyngsø, R.B., Stormo, G.D., & Gorodkin, J., 2005. Pairwise local structuralalignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 21(9):1815–24. ISSN 1367-4803. doi:10.1093/bioinformatics/bti279. → pages[370] Holmes, I., 2005. Accelerated probabilistic inference of RNA structure evolution. BMCBioinformatics 6: 73. ISSN 1471-2105. doi:10.1186/1471-2105-6-73. → pages 121[371] Dowell, R.D. & Eddy, S.R., 2006. Efficient pairwise RNA structure prediction and alignment usingsequence alignment constraints. BMC Bioinformatics 7: 400. ISSN 1471-2105.doi:10.1186/1471-2105-7-400. → pages[372] Meyer, I.M., 2007. A practical guide to the art of RNA gene prediction. Brief Bioinform 8(6):396–414. ISSN 1477-4054. doi:10.1093/bib/bbm011. → pages 121[373] Mironov, A.A., Dyakonova, L.P., & Kister, A.E., 1985. A kinetic approach to the prediction of RNAsecondary structures. J Biomol Struct Dyn 2(5): 953–962. ISSN 0739-1102. → pages 122[374] Mironov, A.a. & Lebedev, V.F., 1993. A kinetic model of RNA folding. Biosystems 30(1-3): 49–56.ISSN 0303-2647. → pages 122[375] Flamm, C., Fontana, W., Hofacker, I.L., & Schuster, P., 2000. RNA folding at elementary stepresolution. RNA 6(3): 325–38. ISSN 1355-8382. doi:10.1017/S1355838200992161. → pages 122[376] Xayaphoummine, A., Bucher, T., & Isambert, H., 2005. Kinefold web server for RNA/DNA foldingpath and structure prediction including pseudoknots and knots. Nucleic Acids Res 33(Web Serverissue): W605–10. ISSN 1362-4962. doi:10.1093/nar/gki447. → pages 122159Appendix ASupporting MaterialsA.1 Chapter 1 appendix materialA.1.1 R4RNA function manual160Package ‘R4RNA’October 13, 2015Type PackageTitle An R package for RNA visualization and analysisVersion 0.99.2Date 2015-10-13Author Daniel Lai, Irmtraud MeyerMaintainer Daniel Lai <redacted@example.com>Depends R (>= 3.2.0)Description Plots arc diagrams for RNA secondary structure and alignmentsLicense GPL-3biocViews Alignment, MultipleSequenceAlignment, Preprocessing,Visualization, DataImport, DataRepresentationURL http://www.e-rna.org/r-chie/R topics documented:R4RNA-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Alignment Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Basepair Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Basepair/Helix Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Coerce to Helix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Colour Helices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Convert Helix Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Covariation Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Create Blank Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Find Unknotted Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Helix Type Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Log10 Space Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Plot Helix Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Read FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Read Structure File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Structure Mismatch Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Write FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Write Helix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Index 2511612 R4RNA-packageR4RNA-package An R package for RNA visualization and analysisDescriptionAn R package for RNA visualization and analysisDetailsPackage: R4RNAType: PackageVersion: 0.99.2Date: 2015-10-13License: GPL-3Author(s)Maintainer: Daniel Lai,Irmtraud MeyerExamples# Read input datapredicted <- readHelix(system.file("extdata", "helix.txt", package = "R4RNA"))known <- readVienna(system.file("extdata", "vienna.txt", package = "R4RNA"))sequence <- readFasta(system.file("extdata", "fasta.txt", package = "R4RNA"))plotHelix(predicted)pval.coloured <- colourByValue(predicted, log = TRUE, get = TRUE)plotDoubleHelix(pval.coloured, known, scale = FALSE)plotOverlapHelix(pval.coloured, known)cov.coloured <- colourByCovariation(known, sequence, get = TRUE)plotCovariance(sequence, cov.coloured)plotDoubleCovariance(cov.coloured, pval.coloured, sequence,conflict.filter = "grey")plotOverlapCovariance(pval.coloured, known, sequence, grid = TRUE,conflict.filter = "grey", legend = FALSE, any = TRUE)# List of all functionsls("package:R4RNA")# use example() and help() for more details on each function162Alignment Statistics 3Alignment Statistics Compute statistics for a multiple sequence alignmentsDescriptionFunctions to compute covariation, percent identity conservation, and percent canonical basepairsgiven a multiple sequence alignment and optionally a secondary structure. Statistics can be com-puted for a single base, basepair, helix or entire alignment.UsagebaseConservation(msa, pos)basepairConservation(msa, pos.5p, pos.3p)basepairCovariation(msa, pos.5p, pos.3p)basepairCanonical(msa, pos.5p, pos.3p)helixConservation(helix, msa)helixCovariation(helix, msa)helixCanonical(helix, msa)alignmentConservation(msa)alignmentCovariation(msa, helix)alignmentCanonical(msa, helix)alignmentPercentGaps(msa)Argumentshelix A helix data.framemsa A multiple sequence alignment, such as those returned by readFastapos, pos.5p, pos.3pPositions of bases or basepairs for which statistics shall be calculated for.DetailsConservation values have a range of [0, 1], where 0 is the absence of primary sequence conservation(all bases different), and 1 is full primary sequence conservation (all bases identical).Canonical values have a range of [0, 1], where 0 is a complete lack of basepair potential, and 1indicates that all basepairs are validCovariation values have a range of [-2, 2], where -2 is a complete lack of basepair potential andsequence conservation, 0 is complete sequence conservation regardless of basepairing potential,and 2 is a complete lack of sequence conservation but maintaining full basepair potential.helix values are average of base/basepair values, and the alignment values are averages of helicesor all columns depending on whether the helix argument is required.alignmentPercentGaps simply returns the percentage of nucleotides that are gaps in a sequencefor each sequence of the alignment.1634 Basepair FrequencyValuebaseConservation, basepairConservation, basepairCovariation, basepairCanonical, alignmentConservation,alignmentCovariation, and alignmentCanonical return a single decimal value.helixConservation, helixCovariation, helixCanonical return a list of values whose lengthequals the number of rows in helix.alignmentPercentGaps returns a list of values whose length equals the number of sequences inthe multiple sequence alignment.Author(s)Jeff Proctor, Daniel LaiExamplesdata(helix)baseConservation(fasta, 9)basepairConservation(fasta, 9, 18)basepairCovariation(fasta, 9, 18)basepairCanonical(fasta, 9, 18)helixConservation(helix, fasta)helixCovariation(helix, fasta)helixCanonical(helix, fasta)alignmentConservation(fasta)alignmentCovariation(fasta, helix)alignmentCanonical(fasta, helix)alignmentPercentGaps(fasta)Basepair Frequency Calculates the frequency of each basepairDescriptionCalculates the frequency of each basepair in a given helix structure. Internally, breaks helices intobasepairs, and returns a structure of unique basepairs, where the values is its frequency, regardlessof original value.UsagebasepairFrequency(helix)Argumentshelix A helix data.frameValueA helix data.frame of unique basepairs of length 1, with the frequency of appearance as its value,sorted by decreasing value.164Basepair/Helix Conversion 5Author(s)Daniel LaiSee AlsocolourByBasepairFrequencyExamplesdata(helix)basepairFrequency(helix)Basepair/Helix ConversionExpand or collapse helices to and from basepairsDescriptionGiven a helix data frame, expands a helix of arbitrary length into helices of length 1 (i.e. base-pairs). Also does the reverse operation of clustering consecutive basepairs (or helices), and merg-ing/collapsing them into a single helix.UsageexpandHelix(helix)collapseHelix(helix, number = FALSE)Argumentshelix A helix data frame.number Indicates presence of a column in the helix data frame titled exactly ’number’,which will be used to unique identify basepairs belonging to the same helix.Only basepairs from the same helix as identified by the number will be collapsedtogether.DetailsDuring the expansion, basepairs expanded from a single helix will all be assigned the value of theoriginating helix (the same goes for all other columns besides i, j, and length). During collapsing,only helices/basepairs of equal value will be grouped together. The ordering of collapsed helicesreturned will be sorted by value (increasing order). For any other columns besides i, j, length andvalue, values will be obtained from the corresponding columns of the outer most basepair.ValueReturns a helix data frame.Author(s)Daniel Lai1656 Coerce to HelixExamples# Create helix data framehelix <- data.frame(2, 8, 3, 0.5)helix[2, ] <- c(5, 15, 4, -0.5)helix <- as.helix(helix)helix$colour <- c("red", "blue")# Before expansionprint(helix)# After expansionprint(expanded <- expandHelix(helix))# Collapse back (sorted by value)print(collapseHelix(expanded))Coerce to Helix Coerce to a Helix Data FrameDescriptionFunctions to coerce a structure into a helix data frame, and to check whether a structure is a validhelix data frame. A helix data frame is a data frame, so any structure coercible into a data.framecan become a helix data frame.Usageas.helix(x, length)is.helix(x)Argumentsx Structure to coerce. Should be a structure coercible into a standard R data.framestructure for as.helix. Should be a string for parseBracket. May be anythingfor is.helix.length The length of the RNA sequence containing the helices.Detailsas.helix takes in a data.frame and coerces it into a helix data frame acceptable by other R4RNAfunctions. This mainly involves setting specific column names and casting to specific types.Valueis.helix returns a boolean.as.helix returns helix data frame with valid input.Author(s)Daniel Lai166Colour Helices 7Examples# Not a valid helix data framehelix <- data.frame(c(1, 2, 3), seq(10, 20, length.out = 3), 5, runif(3))is.helix(helix)warnings()# Formatted into a helix data framehelix <- as.helix(helix)is.helix(helix)Colour Helices Assign colours to helicesDescriptionFunctions to generate colours for helices by various rules, including integer counts, value ranges,percent identity covariation, conservation, percentage canonical basepair, basepair frequency, andnon-pseudoknotted groups.UsagecolourByCount(helix, cols, counts, get = FALSE)colourByValue(helix, cols, breaks, get = FALSE,log = FALSE, include.lowest = TRUE, ...)colourByBasepairFrequency(helix, cols, get = TRUE)colourByUnknottedGroups(helix, cols, get = TRUE)colourByCovariation(helix, msa, cols, get = FALSE)colourByConservation(helix, msa, cols, get = FALSE)colourByCanonical(helix, msa, cols, get = FALSE)defaultPalette()Argumentshelix A helix data frame to be coloured.cols An array of characters (or numbers) representing a set of colours to colour helixwith. When missing, a default set of colours from defaultPalette() will be used.Valid input include hex codes, colour names from the colours function, andinteger numbers. The colours will be interpreted as being from best to worst.counts An array of integers the same length as cols, dictating the number of times eachcorresponding colour should be used. When missing, the function will divide thenumber of helices evenly over each of the colours available.breaks An integer number of intervals to break the ‘value’ column of helix into, ora list of numbers defining the interval breaks. If missing, the range of ‘he-lix$value’ will automatically be split evenly into intervals for each colour avail-able.get If TRUE, returns the input helix with a col column, else simply returns anarray of colours the same length as the number of row in helix. The exceptionsare colourByBasepairFrequency and colourByUnknottedGroups which willreturn a different helix if TRUE, and a list of colours that will not match theinput helix if FALSE.1678 Colour Heliceslog If TRUE, will breaks values into even log10 space intervals, useful when valuesare p-values.include.lowest Whether the lowest interval should include the lowest value, passed to cut... Additional arguments passed to cut, potentially useful ones include right (whetherintervals should be inclusive on the right or left) and dig.lab (number of digitsin interval labels).msa A multiple sequence alignment, such as those returned by readFastaDetailscolourByCount assigns colours indepenent of the helix input’s value column, and instead operatesover the number of helices (i.e. rows).colourByValue uses cut to assign each of the helices to an interval based on its value.colourByCovariation, colourByConservation , and colourByCanonical, colour helices ac-cording to compensatory mutations (or covariation), percentage identity conservation, and percent-age canonical basepair repsectively, relative to the multiple sequence alignment provided.colourByBasepairFrequency colours each basepair according to the number of times it appear inthe input, regardless of its value.colourByUnknottedGroups greedily partitions the basepairs into non- pseudoknotted groups, andassigns a colour to each.ValueAll “colourBy” functions return a list of colours when get = FALSE, and a helix with a colcolumn if get = TRUE. In both bases, the returned object has attributes “legend” and “fill”, showingthe mapping between interval (in legend) and colour (in fill), which can as eponymous argumentslegend.defaultPalette returns the default list of colours.Author(s)Daniel LaiSee AlsoplotHelixlogseqbasepairFrequencyunknottedGroupsExamplesdata(helix)known$col <- colourByCount(known)plotHelix(known)plotHelix(colourByValue(helix, log = TRUE, get = TRUE))cov <- colourByCovariation(known, fasta, get = TRUE)plotCovariance(fasta, cov)168Convert Helix Formats 9legend("topleft", legend = attr(cov, "legend"),fill = attr(cov, "fill"), title = "Covariation")Convert Helix Formats Convert helix structures to and from other formatsDescriptionConverts dot bracket vienna format to and from helix format. It should be noted that the allowsstructures of vienna is a subset of those allowed in the helix format. Thus, conversion from viennato helix will yield the identical structure, while conversion from helix to vienna may result in theloss of certain basepairs (mainly those that are conflicting). Pseudoknots are supported in bothdirections of conversion with limitations.UsageviennaToHelix(vienna, value = NA, palette = NA)helixToVienna(helix)helixToConnect(helix)helixToBpseq(helix)Argumentsvienna A string containing only a vienna dot bracket structure, with balanced brackets.Allowable brackets are (, <, [, {, A, B, C, and D (where upper-case alphabets arepaired with lower-case alphabets).value A numerical value to assign to all helices.palette A list of colour names for up to 8 colours that will be used to colour brackets oftype (, <, [, {, A, B, C, and D, respectively.helix A helix data.frame.DetailsviennaToHelix will ignore any non dot-bracket characters prior to parsing, so the resultant lengthwill be shorter than expected if invalid characters are included.If the colour palette is less than the number of supported brackets, it will simply cycle through thelist. To explicitly prevent the colouring/ display of specific bracket type, colour it “NA”.For helixToVienna, pseudoknotted basepairs will be assigned different bracket types. As there areonly 8 supported bracket types, any basepair pseudonotted deeper than 8 levels will be excludedfrom the output. Additionally, vienna format is unable to respresent conflicting basepairs, so con-flicting basepairs will also be excluded. For both types of exclusion, those at the bottom of the helixdata.frame will always be excluded in favour of keeping helices higher on the data.frame table.helixToConnect and helixToBpseq will convert a non-conflicting helix data.frame into connector bpseq format repsectively, provided the helix structure has a “sequence” attribute containing asingle nucleotide sequence of the structure.16910 Covariation PlotsValueviennaToHelix returns a helix data.frame. helixToVienna returns a character string of basepairsin the Vienna helix format. helixToConnect and HelixTpBpseq return data.frames in the connectand bpseq formats, respectively.Author(s)Daniel LaiExamples# viennaToHelix demonstrating ALL valid bracket symbolsdot_bracket <- ".....(<[{.....ABCD.....}]>).....dcba....."parsed <- viennaToHelix(dot_bracket, -31.5)print(parsed)vienna <- helixToVienna(parsed)print(vienna)# Colouring the brackets by bracket typecolour <- c("red", "orange", "yellow", "green", "lightblue", "blue", "purple", "black")double.rainbow <- viennaToHelix(dot_bracket, 0, colour)plotHelix(double.rainbow)Covariation Plots Plot nucleotide sequence coloured by covarianceDescriptionGiven a multiple sequence alignment and a corresponding secondary structure, nucleotides in thesequence alignment will be coloured according to the basepairing and conservation status, wheregreen is the most commonly observed valid basepair in the column, dark blue being valid covariation(i.e. mutation into another valid basepair), cyan is one-sided mutation that retains the basepair, andred is a mutation where the basepair has been lost.UsageplotCovariance(msa, helix, arcs = TRUE, add = FALSE, grid = FALSE, text =FALSE, legend = TRUE, species = 0, base.colour = FALSE, palette = NA, flip =FALSE, grid.col = "white", grid.lwd = 0, text.cex = 0.5, text.col = "white",text.font = 2, text.family = "sans", species.cex = 0.5, species.col = "black",species.font = 2, species.family = "mono", shape = "circle", conflict.cutoff =0.01, conflict.lty = 2, conflict.col = NA, pad = c(0, 0, 0, 0), y = 0, x = 0,...)plotDoubleCovariance(top.helix, bot.helix, top.msa, bot.msa = top.msa,add = FALSE, grid = FALSE, species = 0, legend = TRUE,pad = c(0, 0, 0, 0), ...)plotOverlapCovariance(predict.helix, known.helix, msa, bot.msa = TRUE,overlap.cutoff = 1, miss = "black", add = FALSE, grid = FALSE, species = 0,legend = TRUE, pad = c(0, 0, 0, 0), ...)170Covariation Plots 11Argumentsmsa, top.msa, bot.msaMultiple sequence alignment as an array of named characters, all of equal length.Typically output of readFastatop.msa and bot.msa are specific to top.helix and bot.helix respectively,and may be set to NA to have no multiple sequence alignment at all.helix, top.helix, bot.helix, predict.helix, known.helixA helix data.frame with a structure corresponding to msa,See plotDoubleHelix and plotOverlapHelix for detailed explanations of top.helix,bot.helix, predict.helix, and known.helix.arcs TRUE if the structure should be plotted as arcs. Arcs may be styled with stylingcolumns, see example and plotHelix for details.add TRUE if graphical elements are to be added to an existing device, else a newplotting device is created with blankPlot.grid TRUE if the multiple sequence alignment is to be drawn as a grid of bases, elsethe multiple sequence alignment is drawn as equidistant horizontal lines.text Only applicable when grid is TRUE. TRUE if the grid is to be filled with nu-cleotide character.legend TRUE if legend are to be shown.species If a number greater than 0 is given, then species names for the multiple sequencealignment will be printed along the left side. This name is typically the entireheader lines of FASTA entries from readFasta, and can be manually manipu-lated using the names function. The number specifies the start position relativeto the left edge of the multiple sequence alignment).base.colour TRUE if bases are to be coloured by nucleotide instead of basepair conservation.palette A list of colour names to override the default colour palette. When base.colour isTRUE, the first 6 colours will be used for colouring bases A, U, G, C, - (gap), and? (everything else), respectively. When base.colour is FALSE, the first 7 colourswill be used for colouring conserved basepairs, covarying basepairs, one-sidedconserved basepairs, invalid basepairs, unpaired bases, gaps, and bases/pairswith ambiguous bases, resepctively. If the palette is shorter than the expectedlength, the palette will simply cycle. “NA” is a valid colour, that will effectivelyplot nothing.flip If TRUE, the entire plot will be flipped upside down. Note that this is not aperfect mirror image about the horizon.grid.col, grid.lwdThe colour and line width of the borders displayed when grid is TRUE.text.cex, text.col, text.font, text.familycex, col, family and font for the text displayed via the text option. Use help("par")for more information the paramters.species.cex, species.col, species.font, species.familycex, col, family and font for the species text displayed via the species option.Use help("par") for more information the paramters.shape One of "circle", "triangle", or "square", specifying the shape of the arcs.conflict.lty, conflict.col, conflict.cutoffDetermines the line type (style) and colour to be used for conflicting basepairs.By default, conflicting helices are drawn as dotted lines (lty = 2) and what-ever colour was originally assigned to it (col = NA). Conflicting helices may be17112 Covariation Plotscoloured by setting conflict.col to some R-compatible colour name. If botharguments are set to NA, then no attempt to exclude conflicting helices will bemade when colouring covariance plot columns, which in most cases will renderthe plot nonsensical. When the input has helices with multiple basepairs, andonly part of the helix is conflicting, the conflict.cutoff determines abovewhat percentage of basepairs have to be conflicting before a helix is consideredconflicting, with the default set at 1 conflicting).miss The colour for unpredicted arcs in overlapping diagrams, see plotOverlapHelixfor more information.overlap.cutoff Decimal between 0 and 1 indicating the percentage of basepairs within a helixthat have to be overlapping for the entire helix to count as overlapping. Defaultis 1, or 100pad A four integer array passed to blankPlot, specifies the number of pixels to padthe bottom, left, top and right sides of the figure with, repsectively.x, y Coordinates for the left bottom corner of the plot. Useful for manually position-ing and overlapping figure elements.... In plotCovariance, these are additional arguments passed to blankPlot, use-ful arguments include ‘lwd’, ‘col’, ‘cex’ for line width, line colour, and text size,respectively. help(par) for more.For plotDoubleCovariance and plotOverlapCovariance, these are additionalarguments passed to plotCovariance (and thus indirectly also to blankPlot).ValueNot intended to return a value, will plot to GUI or file if specific.Author(s)Daniel LaiSee AlsoplotHelixplotDoubleHelixplotOverlapHelixcolourByCovariationcolourByConservationcolourByCanonicalExamplesdata(helix)# Basic covariance plotplotCovariance(fasta, known, cex = 0.8, lwd = 1.5)# Grid modeplotCovariance(fasta, known, grid = TRUE, text = FALSE, cex = 0.8)# Global style and nucleotide colouringplotCovariance(fasta, known, grid = TRUE, text = FALSE, base.colour = TRUE)172Create Blank Plot 13# Styling indivual helices with styling columnsknown$col <- c("red", "blue")plotCovariance(fasta, known, lwd = 2, cex = 0.8)# Use in combination with colourBy functionscov <- colourByCovariation(known, fasta, get = TRUE)plotCovariance(fasta, cov)legend("topleft", legend = attr(cov, "legend"),fill = attr(cov, "fill"), title = "Covariation")Create Blank Plot Create a blank plotting canvasDescriptionCreates a blank plotting canvas with the given dimensions, along with functions to find best valuesfor the canvas dimensions.UsageblankPlot(width, top, bottom, pad = c(0, 0, 0, 0), scale = TRUE,scale.lwd = 1, scale.col = "#DDDDDD", scale.cex = 1, debug = FALSE,png = NA, pdf = NA, factor = ifelse(!is.na(png), 8, 1/9),no.par = FALSE, asp = 1,...)maxHeight(helix)Argumentswidth A number indicating the horizontal width of the blank plot.top, bottom The maximum and minimum values vertically to be displayed in the plot.pad An array of 4 integers, specifying the pixels of whitespace to pad beyond thedimensions given by top, bottom, and width. Four number corresponding topadding on the bottom, left, top and right, respectively. Default is c(0, 0, 0, 0).scale If TRUE, inserts a scale on the plot.scale.lwd, scale.col, scale.cexAllows manual modification of the scale’s line width and colour, respectively.png, pdf If one or the other is set to a filename, a file in png or pdf format will be producedrespectively. If both are set to non-NA values, png will have priority.factor The scaling factor used to produce plots of png or pdf format. Should be set soafter multiplication of the top, bot, etc arguments, good document dimensionin pixels with png and inches for pdf will be produced.debug If TRUE, frames the boundaries of the intended plotting space in red, used todetermine if inputs produce expected output area. Also outputs to STDIN di-mensions of the plot.no.par Suppresses the internal call to par in the function if set to TRUE, useful forusing par arguments such as mfrow, etc.asp Controls and aspect ratio of the plot, defaultly set to 1, set to NA to disablecompletely.17314 Example Data... Additional arguments passed to par when no.par is FALSE, common ones in-clude ‘lwd’, ‘col’, ‘cex’ for line width, line colour, and text size, respectively.help(par) for more. When no.par is set to TRUE, this option does nothing,and manually calling par is required prior to the calling of this function.helix A helix data.frameDetailsblankPlot creates a blank plot with the given dimensions, with minimal margins around the plotand no axis or labels. If more control is required, using plot directly would be more efficient.maxHeight returns the height that the highest helix would require, and can be used to determinetop and bottom for blankPlot.ValuemaxHeight returns a numeric integer.Author(s)Daniel LaiSee AlsoplotHelixExamples# Create helix and obtain heighthelix <- as.helix(data.frame(1, 37, 12, 0.5))height <- maxHeight(helix)print(height)# Use height to create properly sized plotwidth <- attr(helix, "length")blankPlot(width, height, 0)# Add helix to plotplotHelix(helix, add = TRUE)Example Data Helices predicted by TRANSAT with p-valuesDescriptionThis data set contains two sets of helices and a multiple sequence alignment. The two sets of helicesare helices and known which are helices predicted to occur for RNA sequence RF00458 by the pro-gram TRANSAT, and experimentally proposed structure of the same sequence, respectively. fastais the seed homologues for the multiple sequence alignment obtained from the RFAM database.Usagedata(helix)174Find Unknotted Groups 15Formathelix and known are 4 column data frames, where columns i and j denote the left-most and right-most basepairs, the length is the number of consecutive basepairs the helix contains, and the valueis assigned to each helix on a row.fasta is an array of named characters of length 7.ReferencesWiebe NJ, Meyer IM. (2010) TRANSAT– method for detecting the conserved helices of functionalRNA structures, including transient, pseudo-knotted and alternative structures. PLoS Comput Biol.6(6):e1000823.Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, KolbeDL, Eddy SR, Bateman A. (2011) Rfam: Wikipedia, clans and the “decimal” release. NucleicAcids Res. 39(Database issue):D141-5.Find Unknotted Groups Partition basepairs into unknotted groupsDescriptionBreaks down input helices into basepairs, and assigns each basepair to a numbered group such thatbasepairs in each group are non-pseudoknotted relative to all other basepairs within the same group.The algorithm is greedy and thus will not find the best combination of basepairs to minimize thenumber of groups.UsageunknottedGroups(helix)Argumentshelix A helix data.frame.ValueAn array of integers dictating the groups of each helix. Will only correspond to the input helixstructure if the input had helices of length 1 (e.g. output of expandHelix).Author(s)Daniel LaiSee AlsocolourByUnknottedGroupsexpandHelixExamplesdata(helix)known$group <- unknottedGroups(known)print(known)17516 Helix Type FiltersHelix Type Filters Logical filters of helix by typeDescriptionGiven a helix data frame, checks if helices are conflicting, duplicating, or overlapping, and returnsan array of numeric values, where 0 is FALSE and 1 is TRUE. Values in between 0 and 1 occurwhen a single helix has multiple basepairs with different values, the number observed in this case isthe mean of the basepair values within the helix. See details for exact definition of the three typesof events.UsageisConflictingHelix(helix)isDuplicatingHelix(helix)isOverlappingHelix(helix, query)Argumentshelix A helix data framequery For isOverlappingHelix, a helix data structure against which helix will bechecked for overlap against.DetailsHelices of length greater than 1 are internally expanded into basepairs of length 1, after which thefollowing conditions are evaluated:A conflicting basepair is one where at least one of its two positions is used by either end of anotherbasepair.A duplicating basepair is one where both of its positions are used by both ends of another basepair.An overlapping basepair is one in helix where both of its positions are used by both ends ofanother basepair in the query structure.In the case of conflicting and duplicating basepairs, for a set of basepairs that satisfies this condition,the basepair situation highest on the data frame will be exempt from the condition. i.e. Say 5basepairs are all duplicates of each other, the top 1 will return FALSE, while the bottom 4 willreturn TRUE. This assumes some significant meaning to the ordering of rows prior to using thisfunction. This is to be used with which to filter out basepairs that satisfy these conditions, leavinga set of basepairs free of these events.If the original input had helices greater than length 1, then after applying all of the above, TRUE istreated as 1, FALSE as 0, and the average of values from each basepair is taken as the value for thehelix in question.ValueReturns an array of numerics corresponding to each row of helix, giving the average conditionalstatus of the helix, where 0 signifying all basepairs are FALSE, and 1 where all basepairs are TRUE.Author(s)Daniel Lai176Log10 Space Operations 17Examplesdata(helix)conflicting <- isConflictingHelix(helix)duplicating <- isDuplicatingHelix(helix)# Nonsensical covariation plotplotCovariance(fasta, helix)# Plot nonconflicting helicesplotCovariance(fasta, helix[(!conflicting & !duplicating), ])# Similar resultplotCovariance(fasta, helix, conflict.col = "lightgrey")Log10 Space OperationsLog base 10 sequence, floor and ceilingDescriptionSequence, floor and ceiling operations in log 10 space.Usagelogseq(from, to, length.out)logfloor(x)logceiling(x)Argumentsfrom, to Positive non-zero values to start and end sequence, respectively.length.out The number of elements the resulting sequence should containg. If absent, func-tion will attempt to generate numbers factors of 10 apart.x A value to round.Valuelogseq returns an array numbers evenly distanced in log10-space.logfloor and logceiling return a value that is 10 raised to an integer number.Author(s)Daniel LaiExampleslogseq(1e-10, 1e3)logseq(1e-10, 1e3, length.out = 10)logceiling(2.13e-6)logfloor(2.13e-6)17718 Plot Helix StructuresPlot Helix Structures Plots helices in arc diagramDescriptionPlots a helix data frame as an arc diagram, with styling possible with properly named additionalcolumns on the data frame.UsageplotHelix(helix, x = 0, y = 0, flip = FALSE, line = FALSE, arrow = FALSE,add = FALSE, shape = "circle", ...)plotDoubleHelix(top, bot, line = TRUE, arrow = FALSE, add = FALSE, ...)plotOverlapHelix(predict, known, miss = "black", line = TRUE,arrow = FALSE, add = FALSE, overlap.cutoff = 1, ...)plotArcs(i, j, length, x = 0, y = 0, flip = FALSE, shape = "circle", ...)plotArc(i, j, x = 0, y = 0, flip = FALSE, shape = "circle", ...)Argumentshelix, top, bot, predict, knownHelix data.frames, with the four mandatory columns. Any other column will beconsidered a styling column, and will be used for styling the helix. See examplefor styling usage. See Details for exact usage of each helix.x, y The coordinate of the left bottom corner of the plot, useful for manually posi-tioning figure elements.flip If TRUE, flips the arcs upside down about the y-axis.line If TRUE, a horizontal line representing the sequence is plotted.arrow If TRUE, an arrow is played on the right end of the line.add If TRUE, graphical elements are added to the active plot device, else a new plotdevice is created for the plot.shape One of "circle", "triangle", or "square", specifying the shape of the arcs.miss The colour for unpredicted arcs in overlapping diagrams, see details for moreinformation.overlap.cutoff Decimal between 0 and 1 indicating the percentage of basepairs within a helixthat have to be overlapping for the entire helix to count as overlapping. Defaultis 1, or 100i, j The starting and ending position of the arc along the x-axislength The total number of arcs to draw by incrementing i and decrementing j. Used todraw helices.... Any additional parameters passed to par178Plot Helix Structures 19DetailsplotHelix creates a arc diagram with all arcs on top, plotDoubleHelix creates a diagram witharcs on the top and bottom. plotOverlapHelix is slight trickier, and given two structures predictand known, plots the predicted helices that are known on top, predicted helices that are not knownon the bottom, and finally plots unpredicted helices on top in the colour defined by miss.plotArc and plotArcs are the core functions that make everything work, and may be used forextreme fine-tuning and customization.ValueNot intended to return a value, will plot to GUI or file if specific.Author(s)Daniel LaiSee AlsocolourByCountExamplesdata(helix)# Plot helix plainplotHelix(known)# Apply global appearance optionsplotHelix(known, line = TRUE, arrow = TRUE, col = "blue", lwd = 1.5)# Add extra column with styling optionsknown$lty <- 1:4known$lwd <- 1:2known$col <- c(rgb(1, 0, 0), "orange", "yellow", "#00FF00", 4, "purple")plotHelix(known)# Manually colour helices according to valuehelix$col <- "red"helix$col[which(helix$value < 1e-3)] <- "orange"helix$col[which(helix$value < 1e-4)] <- "green"helix$col[which(helix$value < 1e-5)] <- "blue"plotHelix(helix)# Automatically creating a similar plot with legendcoloured <- colourByValue(helix, log = TRUE, get = TRUE)plotHelix(coloured, line = TRUE, arrow = TRUE)legend("topleft", legend = attr(coloured, "legend"),fill = attr(coloured, "fill"), title = "P-value", text.col = "black")# Plot both helices with stylesplotDoubleHelix(helix, known)# Overlap helixplotOverlapHelix(helix, known)17920 Read Structure FileRead FASTA Read FASTA format multiple sequence alignment files.DescriptionReads in FASTA format multiple sequence text files into a list of named characters, with namesderived from the description line of each FASTA entry.UsagereadFasta(file, filter = FALSE)Argumentsfile FASTA format file containing at least one sequence, where each sequence has adescription line beginning with the > character.filter When true, filters out any sequences with any other characters besides: A, C, G,T, U, and - for gaps. Also converts all letter to uppercase and all T’s to U’s, andN’s to -’s.ValueReturns an array of named characters, each element a sequence read, with the description as itsname.Author(s)Daniel LaiExamplesfile <- system.file("extdata", "fasta.txt", package = "R4RNA")fasta <- readFasta(file)head(fasta)Read Structure File Read secondary structure fileDescriptionReads in secondary structure text files into a helix data frame.UsagereadHelix(file)readConnect(file)readVienna(file, palette = NA)readBpseq(file)180Read Structure File 21Argumentsfile A text file in connect format, see details for format specifications.palette Used to colour basepairs by bracket type. See viennaToHelix for more details.DetailsHelix: Files start with a header line beginning with # followed by the sequence length, followedby a four-column tab-delimited table (with column names), where each row corresponds to a helixin the structure. The four columns are i and j for the left-most and right-most basepair positionsrespectively, the length of the helix (converging inwards from i and j, and finally an arbitrary valueassigned to the helix.Vienna: Dot-bracket notation from Vienna package programs, where each structure consists ofmatched brackets for basepairs and periods for unbased pairs. Valid brackets are (, , [, <, A, B, C,D matched with ), , ], >, a, b, c, d, respectively. An energy value can be appended to the end ofany dot-bracket structure. The function will accept slight variations of the format, including thosewith FASTA-like headers (in which case line breaks are allows), and those without FASTA-likeheaders (in which case line breaks are NOT allowed), with both types allowing for a preceding(NOT following) nucleotide sequence for the structure. Multiple entries of the same length may bein a single file, which will be returned as a single helix structure, with respectively energy values (ifspecified).Connect: Output from mfold and other programs, this format is expected to be a text file beginningwith a header line that starts with the sequence length, with an optional Energy/dG value, followedby a six-column tab-delimited table where columns 1 and 5 denote the position that are basepaired(unpaired when column 5 is 0). Other columns are ignored, but for completeness, column 2 is thenucleotide, column 3 and 4 are the positions of the bases left and right of the base specified incolumn 1 respectively (with 0 denoting non-existance), and column 6 a copy of column 1. Multipleentries of the same length may be in a single file, which will be returned as a single helix structure.All helices will be assigned the energy value extracted from their respective structure header lines.Bpseq: Format used by the Gutell Lab’s Comparative RNA Website. The file may optionallybegin several header lines (e.g. Filename, Organism, Accession, etc.), followed by a 3-column tab-delimited table for the structure, where column 1 is the base position, base 2 is the nucleotide base,and column 3 is the paired position (0 if unpaired). Certain pieces of header information will beparsed and returned as attributes of the output data frame. Multiple structures can be within a singlefile, returned as a single helix data frame, with attributes set to those of the first entry.ValueReturns a helix format data frame.Author(s)Daniel Lai, Jeff ProctorExamplesfile <- system.file("extdata", "helix.txt", package = "R4RNA")helix <- readHelix(file)head(helix)file <- system.file("extdata", "connect.txt", package = "R4RNA")connect <- readConnect(file)head(connect)18122 Structure Mismatch Scoremessage("Note connect data assigns structure energy level to all basepairs")file <- system.file("extdata", "vienna.txt", package = "R4RNA")vienna <- readVienna(file)head(vienna)message("Note vienna data assigns structure energy level to all basepairs")file <- system.file("extdata", "bpseq.txt", package = "R4RNA")bpseq <- readBpseq(file)head(bpseq)message("Note bpseq data has no value assigned to basepairs")Structure Mismatch ScoreScores how a basepair structure fits a sequenceDescriptionCalculates a score that indicates how badly a set of basepairs (i.e. a secondary structure) fits with asequence. A perfect fit is a structure where all basepairs form valid basepairs (A:U, G:C, G:U, andequivalents) and has a score of 0. Each basepair that forms a non-canonical pairing or pairs to gapsincreases the score by 1, and each base-pair with a single-sided gap increases the score by 2.UsagestructureMismatchScore(msa, helix, one.gap.penalty = 2, two.gap.penalty = 2,invalid.penalty = 1)Argumentsmsa An array of strings representing sequences of interest, typically the output fromreadFastahelix A helix data.frameone.gap.penaltyPenalty score for basepairs with one of the bases being a gaptwo.gap.penaltyPenalty score for basepairs with both bases being a gapsinvalid.penaltyPenalty score for non-canonical basepairsValueReturns an array of mismatch scores.Author(s)Jeff Proctor, Daniel Lai182Write FASTA 23Examplesdata(helix)mismatch <- structureMismatchScore(fasta, known)# Sort by increasing mismatchsorted_fasta <- fasta[order(mismatch)]Write FASTA Writes out a FASTA format file from a list of named charactersDescriptionWrites out a FASTA format file from a list of named characters, where the sequences are from theelements, and the descriptions are form the names. Does not attempt to break the sequence intomultiple lines.UsagewriteFasta(msa, file = stdout(), wrap = NA)Argumentsmsa A list of characters representing each sequence, and names for each elementcontaining the description of each sequence. Defaults to the console.file A character string pointing to the path of a file, or a connection.wrap An integer to determine the number of characters in each row before the linewraps/breaks. If NA, then no wrapping will occur.ValueNo value returned. Will write to STDOUT or file if specified.Author(s)Daniel LaiExamplesfasta <- c(sequence = "AAAAACCCCCUUUUU", structure = "(((((.....)))))")writeFasta(fasta)18324 Write HelixWrite Helix Write out a helix data frame into a text fileDescriptionWrite out a helix data frame into a text file into the four-column tab-delimited format with properheader and column names.UsagewriteHelix(helix, file = stdout())Argumentshelix A helix data frame.file A character string pointing to a file path, or a file connection. Defaults to theconsole.ValueNo value returned, will write to STDOUT or specific file location.Author(s)Daniel LaiExamples# Create helix data framehelix <- data.frame(2, 8, 3, 0.5)helix[2, ] <- c(5, 15, 4, -0.5)helix <- as.helix(helix)writeHelix(helix)184Index∗Topic IORead FASTA, 20Read Structure File, 20Write FASTA, 23Write Helix, 24∗Topic aplotBasepair Frequency, 4Covariation Plots, 10Create Blank Plot, 13Find Unknotted Groups, 15Plot Helix Structures, 18∗Topic colorColour Helices, 7Log10 Space Operations, 17∗Topic datasetsExample Data, 14∗Topic fileRead FASTA, 20Read Structure File, 20Write FASTA, 23Write Helix, 24∗Topic logicHelix Type Filters, 16∗Topic manipBasepair/Helix Conversion, 5Coerce to Helix, 6Convert Helix Formats, 9∗Topic mathAlignment Statistics, 3Structure Mismatch Score, 22∗Topic packageR4RNA-package, 2Alignment Statistics, 3alignmentCanonical (AlignmentStatistics), 3alignmentConservation (AlignmentStatistics), 3alignmentCovariation (AlignmentStatistics), 3alignmentPercentGaps (AlignmentStatistics), 3as.helix, 6as.helix (Coerce to Helix), 6baseConservation (AlignmentStatistics), 3Basepair Frequency, 4Basepair/Helix Conversion, 5basepairCanonical (AlignmentStatistics), 3basepairConservation (AlignmentStatistics), 3basepairCovariation (AlignmentStatistics), 3basepairFrequency, 8basepairFrequency (Basepair Frequency),4blankPlot, 11, 12blankPlot (Create Blank Plot), 13Coerce to Helix, 6collapseHelix (Basepair/HelixConversion), 5Colour Helices, 7colourByBasepairFrequency, 5colourByBasepairFrequency (ColourHelices), 7colourByCanonical, 12colourByCanonical (Colour Helices), 7colourByConservation, 12colourByConservation (Colour Helices), 7colourByCount, 19colourByCount (Colour Helices), 7colourByCovariation, 12colourByCovariation (Colour Helices), 7colourByUnknottedGroups, 15colourByUnknottedGroups (ColourHelices), 7colourByValue (Colour Helices), 7Convert Helix Formats, 9Covariation Plots, 10Create Blank Plot, 13cut, 8defaultPalette (Colour Helices), 7Example Data, 14expandHelix, 152518526 INDEXexpandHelix (Basepair/HelixConversion), 5fasta (Example Data), 14Find Unknotted Groups, 15helix (Example Data), 14Helix Type Filters, 16helixCanonical (Alignment Statistics), 3helixConservation (AlignmentStatistics), 3helixCovariation (AlignmentStatistics), 3helixToBpseq (Convert Helix Formats), 9helixToConnect (Convert Helix Formats),9helixToVienna (Convert Helix Formats), 9is.helix, 6is.helix (Coerce to Helix), 6isConflictingHelix (Helix TypeFilters), 16isDuplicatingHelix (Helix TypeFilters), 16isOverlappingHelix (Helix TypeFilters), 16known (Example Data), 14legend, 8Log10 Space Operations, 17logceiling (Log10 Space Operations), 17logfloor (Log10 Space Operations), 17logseq, 8logseq (Log10 Space Operations), 17maxHeight (Create Blank Plot), 13names, 11par, 14parseBracket, 6parseBracket (Coerce to Helix), 6plot, 14Plot Helix Structures, 18plotArc (Plot Helix Structures), 18plotArcs (Plot Helix Structures), 18plotCovariance (Covariation Plots), 10plotDoubleCovariance (CovariationPlots), 10plotDoubleHelix, 11, 12plotDoubleHelix (Plot HelixStructures), 18plotHelix, 8, 11, 12, 14plotHelix (Plot Helix Structures), 18plotOverlapCovariance (CovariationPlots), 10plotOverlapHelix, 11, 12plotOverlapHelix (Plot HelixStructures), 18R4RNA (R4RNA-package), 2R4RNA-package, 2Read FASTA, 20Read Structure File, 20readBpseq (Read Structure File), 20readConnect (Read Structure File), 20readFasta, 3, 8, 11, 22readFasta (Read FASTA), 20readHelix (Read Structure File), 20readVienna (Read Structure File), 20Structure Mismatch Score, 22structureMismatchScore (StructureMismatch Score), 22unknottedGroups, 8unknottedGroups (Find UnknottedGroups), 15viennaToHelix, 21viennaToHelix (Convert Helix Formats), 9Write FASTA, 23Write Helix, 24writeFasta (Write FASTA), 23writeHelix (Write Helix), 24186A.1.2 R4RNA vignette187R4RNA: A R package for RNA visualization and analysisDaniel LaiMarch 21, 2012Contents1 R4RNA 11.1 Reading Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Basic Arc Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Multiple Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Filtering Helices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Colouring Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Overlapping Multiple Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Visualizing Multiple Sequence Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.8 Multiple Sequence Alignements with Annotated Arcs . . . . . . . . . . . . . . . . . . . . . . . 61.9 Additional Colouring Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.9.1 Colour By Covariation (with alignment as blocks) . . . . . . . . . . . . . . . . . . . . 61.9.2 Colour By Conservation (with custom alignment colours) . . . . . . . . . . . . . . . . 71.9.3 Colour By Percentage Canonical Basepairs (with custom arc colours) . . . . . . . . . 71.9.4 Colour Pseudoknots (with CLUSTALX-style alignment) . . . . . . . . . . . . . . . . . 82 Session Information 81 R4RNAThe R4RNA package aims to be a general framework for the analysis of RNA secondary structure andcomparative analysis in R, the language so chosen due to its native support for publication-quality graphics,and portability across all major operating systems, and interactive power with large datasets.To demonstrate the ease of creating complex arc diagrams, a short example is as follows.1.1 Reading InputCurrently, supported input formats include dot-bracket, connect, bpseq, and a custom “helix” format. Below,we read in a structure predicted by TRANSAT, the known structure obtained form the RFAM database.> library(R4RNA)> message("TRANSAT prediction in helix format")> transat_file <- system.file("extdata", "helix.txt", package = "R4RNA")> transat <- readHelix(transat_file)> message("RFAM structure in dot bracket format")> known_file <- system.file("extdata", "vienna.txt", package = "R4RNA")> known <- readVienna(known_file)> message("Work with basepairs instead of helices for more flexibility")> message("Breaks all helices into helices of length 1")1188> transat <- expandHelix(transat)> known <- expandHelix(known)1.2 Basic Arc DiagramThe standard arc diagram, where the nucleotide sequence is the horizontal line running left to right from 5’to 3’ at the bottom of the diagram. Any two bases that base-pair in a secondary structure are connect withan arc.> plotHelix(known, line = TRUE, arrow = TRUE)> mtext("Known Structure", side = 3, line = -2, adj = 0)0 20 40 60 80 100 120 140 160 180 200Known Structure1.3 Multiple StructuresTwo structures for the same sequence can be visualized simultaneously, allowing one to compare and contrastthe two structures.> plotDoubleHelix(transat, known, line = TRUE, arrow = TRUE)> mtext("TRANSAT\nPredicted\nStructure", side = 3, line = -5, adj = 0)> mtext("Known Structure", side = 1, line = -2, adj = 0)21890 20 40 60 80 100 120 140 160 180 200TRANSATPredictedStructureKnown Structure1.4 Filtering HelicesBase-pairs can be associated with a value, such as energy stability or statistical probability, and we can easilyfilter out basepairs according to such rules.> message("Filter out helices above a certain p-value")> transat <- transat[which(transat$value <= 0.001), ]1.5 Colouring StructuresWe can also assign colour to the structure according to base-pairs values.> message("Assign colour to basepairs according to p-value")> transat$col <- col <- colourByValue(transat, log = TRUE)> message("Coloured encoded in 'col' column of transat structure")> plotDoubleHelix(transat, known, line = TRUE, arrow = TRUE)> legend("topright", legend = attr(col, "legend"), fill = attr(col,+ "fill"), inset = 0.05, bty = "n", border = NA, cex = 0.75,+ title = "TRANSAT P-values")31900 20 40 60 80 100 120 140 160 180 200TRANSAT P−values[0,1e−05](1e−05,0.0001](0.0001,0.001]1.6 Overlapping Multiple StructuresA neat way of visualizing the concordance between two structure is an overlapping structure diagram,which we can use to overlap the predicted TRANSAT structure and the known RFAM structure. Predictedbasepairs that exist in the known structure are drawn above the line, and those predicted that are not knownto exist are drawn below. Those known but unpredicted are shown in black above the line.> plotOverlapHelix(transat, known, line = TRUE, arrow = TRUE, scale = FALSE)41911.7 Visualizing Multiple Sequence AlignmentsIn addition to visualizing the structure alone, we can also visualize a secondary structure along with alignednucleotide sequences. In the following, we will read in a multiple sequence alignment obtained from RFAM,and visualize the known structure on top of it.We can also annotate the alignment colours according to their agreement with the known structure. Ifa sequence can form as basepair as dictated by the structure, the basepair is coloured green, else red. Forgreen basepairs, if a mutation has occured, but basepairing potential is retained, it is coloured in blue (darkfor mutations in both bases, light for single-sided mutation). Unpaired bases are in black and gaps are ingrey.> message("Multiple sequence alignment of interest")> fasta_file <- system.file("extdata", "fasta.txt", package = "R4RNA")> fasta <- readFasta(fasta_file)> message("Plot covariance in alignment")> plotCovariance(fasta, known, cex = 0.5)51920 20 40 60 80 100 120 140 160 180 200Conservation Covariation One−sided Invalid Unpaired Gap1.8 Multiple Sequence Alignements with Annotated ArcsArcs can be coloured as usual. It should be noted that structures with conflicting basepairs (arcs sharinga base) cannot be visualized properly on a multiple sequence alignment, and are typically filtered out (e.g.drawn in grey here).> plotCovariance(fasta, transat, cex = 0.5, conflict.filter = "grey")0 20 40 60 80 100 120 140 160 180 200Conservation Covariation One−sided Invalid Unpaired Gap1.9 Additional Colouring MethodsVarious other methods of colour arcs exist, along with many options to control appearances:1.9.1 Colour By Covariation (with alignment as blocks)> col <- colourByCovariation(known, fasta, get = TRUE)> plotCovariance(fasta, col, grid = TRUE, legend = FALSE)> legend("topright", legend = attr(col, "legend"), fill = attr(col,6193+ "fill"), inset = 0.1, bty = "n", border = NA, cex = 0.37,+ title = "Covariation")0 20 40 60 80 100 120 140 160 180 200Covariation[−2,−1.5](−1.5,−1](−1,−0.5](−0.5,0](0,0.5](0.5,1](1,1.5](1.5,2]1.9.2 Colour By Conservation (with custom alignment colours)> custom_colours <- c("green", "blue", "cyan", "red", "black",+ "grey")> plotCovariance(fasta, col <- colourByConservation(known, fasta,+ get = TRUE), palette = custom_colours, cex = 0.5)> legend("topright", legend = attr(col, "legend"), fill = attr(col,+ "fill"), inset = 0.15, bty = "n", border = NA, cex = 0.75,+ title = "Conservation")0 20 40 60 80 100 120 140 160 180 200Conservation Covariation One−sided Invalid Unpaired GapConservation[0,0.125](0.125,0.25](0.25,0.375](0.375,0.5](0.5,0.625](0.625,0.75](0.75,0.875](0.875,1]1.9.3 Colour By Percentage Canonical Basepairs (with custom arc colours)> col <- colourByCanonical(known, fasta, custom_colours, get = TRUE)> plotCovariance(fasta, col, base.colour = TRUE, cex = 0.5)> legend("topright", legend = attr(col, "legend"), fill = attr(col,+ "fill"), inset = 0.15, bty = "n", border = NA, cex = 0.75,+ title = "% Canonical")71940 20 40 60 80 100 120 140 160 180 200A U G C −% Canonical[0,0.167](0.167,0.333](0.333,0.5](0.5,0.667](0.667,0.833](0.833,1]1.9.4 Colour Pseudoknots (with CLUSTALX-style alignment)> col <- colourByUnknottedGroups(known, c("red", "blue"), get = TRUE)> plotCovariance(fasta, col, base.colour = TRUE, legend = FALSE,+ species = 23, grid = TRUE, text = TRUE, text.cex = 0.2, cex = 0.5)0 20 40 60 80 100 120 140 160 180 200C C A A C A A U G U G A U C U U G C U U G C G G A − G G C A A A A U U U G C A C A G U A U A A A A U C U G C A A G U A G U G C U A U U G U U G G − A A U C A C C G U A C C U A U U U A G G U U U A C G C U C C A A G A U C G G U G G A U A G C A G C C C U A U C A A − U A U C U A G G A G A A − C U G U G C U − A U G U U U A G A A G A U U A G G U A G U C U C U A A A C A − − − G A A C A A U U U A C C U G C U G A A C A A A U UAF183905.1/5647−5848G C A A A A A U G U G A U C U U G C U U G U A A − − A U A C A A U U U U G A G A G G U U A A U A A A U U A C A A G U A G U G C U A U U U U U G U − A U U U A G G U U A G C U A U U U A G C U U U A C G U U C C A G G A U G C C U A G − U G G C A G C C C C A − C A A − U A U C C A G G A A G C − C C U C U C U G C G G U U U U U C A G A U U A G G U A G U C G A A A A A C C − − U A A G A A A U U U A C C U G C U A C A U U U C A AAF218039.1/6028−6228G A A A A U G U G U G A U C U G A U U A G A A G − − U A A G A A A A U U C C U A G − U U A U A A U A U U U U U A A U A C U G C U A C A U U U U U − A A G A C C C U U A G U U A U U U A G C U U U A C C G C C C A G G A U G G G G U G − C A G C G U U C C U G − C A A − U A U C C A G G G C A C − − C U A G G U G C A G C C U U G U A G U U U U A G U G G A C U U U A G G C U − − A A A G A A U U U C A C U A G C A A A U A A U A A UAB017037.1/6286−6484C U G A C U A U G U G A U C U U A U U A A A A U U A G G U U A A A U U U C G A G G U U A A A A A U A G U U U U A A U A U U G C U A U A G U C U U − A G A G G U C U U G U A U A U U U A U A C U U A C C A C A C A A G A U G G A C C G − G A G C A G C C C U C − C A A − U A U C U A G U G U A C − − C C U C G U G C U C G C U C A A A C A U U A A G U G G U G U U G U G C G A − − A A A G A A U C U C A C U U C A A G A A A A A G A AAB006531.1/6003−6204G U U A A G A U G U G A U C U U G C U U C C U U − − A U A C A A U U U U G A G A G G U U A A U A A G A A G G A A G U A G U G C U A U C U U A A U − A A U U A G G U U A A C U A U U U A G U U U U A C U G U U C A G G A U G C C U A U − U G G C A G C C C C A − U A A − U A U C C A G G A C A C − C C U C U C U G C U U C U U A U A U G A U U A G G U U G U C A U U U A G A A − − U A A G A A A A U A A C C U G C U A A C U U U C A AAF014388.1/6078−6278A G U G U U G U G U G A U C U U G C G C G A U − − − − − − − A A A U G C U G A C G − − − U G A A A A C G U U G C G U A U U G C U A C A A C A C U − − − − − U G G U U A G C U A U U U A G C U U U A C U A A U C A A G A C G C C G U C − G U G C A G C C C A C − A A A A − G U C U A G A U A − − − − C G U C A C A G G A G A G C A U A C G C U A G G U C G C G U U G A C U A U C C U U A U A U A U − G A C C U G C A A A U A U A A A CAF022937.1/6935−7121U U G A C U A U G U G A U C U U G C U U U C G − − − − U A A U A A A A U U C U G U A C A U A A A A G U C G A A A G U A U U G C U A U A G U U A A G G U U G C G C U U G C C U A U U U A G G C A U A C U U C U C A G G A U G G C G C G − U U G C A G U C C A A − C A A G − A U C C A G G G A C U G U A C A G A A U U U U C C − U A U A C C U C G A G U C G G G U U U − G G A A − − U C U A A G G U U G A C U C G C U G U A A A U A A UAF178440.1/5925−61232 Session InformationThe version number of R and packages loaded for generating the vignette were:• R version 2.13.2 (2011-09-30), x86_64-unknown-linux-gnu• Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C,LC_MONETARY=C, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C,LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C• Base packages: base, datasets, grDevices, graphics, methods, stats, utils• Other packages: R4RNA 0.1.4• Loaded via a namespace (and not attached): tools 2.13.28195


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items