Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Beyond the one-sequence-one-structure dogma : predicting and analysing transient and alternative RNA… Zhu, Jing Yun Alice 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_november_zhu_jingyunalice.pdf [ 13.9MB ]
JSON: 24-1.0220703.json
JSON-LD: 24-1.0220703-ld.json
RDF/XML (Pretty): 24-1.0220703-rdf.xml
RDF/JSON: 24-1.0220703-rdf.json
Turtle: 24-1.0220703-turtle.txt
N-Triples: 24-1.0220703-rdf-ntriples.txt
Original Record: 24-1.0220703-source.json
Full Text

Full Text

Beyond the one-sequence-one-structure dogma: predicting and analysingtransient and alternative RNA secondary structures that areevolutionarily conservedbyJing Yun Alice ZhuBSc, The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August, 2015c© Jing Yun Alice Zhu, 2015AbstractState-of-the-art methods in RNA secondary structure prediction focus on predicting the final, func-tional structure. However, ample experimental and statistical evidence indicates that structure formationstarts immediately during transcription and this co-transcriptional folding influences the resultant finalRNA structure. Thus, identifying the transient structures that are formed co-transcriptionally may bringinsight into understanding how co-transcriptional folding leads to the final conformation in vivo. As RNAsecondary structures are currently best predicted by comparative approaches, we therefore investigatedwhether homologous RNA genes not only assume the same final structure, but also share structural fea-tures during the co-transcriptional folding in vivo. For this, we compiled a non-redundant data set of 32transcripts deriving from six different RNA families which constitutes the most comprehensive data setwith experimentally confirmed transient and alternative RNA structures so far. We present statisticalevidence that homologous RNA genes from related organisms fold co-transcriptionally in a similar way.In particular, we show that some transient structures are highly conserved with levels similar to those ofthe final, functional structure. Moreover, we find that the predicted co-transcriptional folding pathwaysof homologous sequences encounter similar transient structure features, which often coincide with knowntransient features. We thus also predict candidates for these evolutionarily conserved transient featuresof co-transcriptional folding pathways in silico.We further expand 4 alignments from the aforementioned dataset by searching via covariance modeland manual curation in order to share them with the RNA community. These alignments either updatethe existing Rfam datasets with annotation of transient structures, or introduce new RNA family: (1)Trp operon leader, where alternative structures are coordinated to regulate the operon transcription inresponse to tryptophan abundance (2) HDV ribozyme, where the self-cleavage activity is modulated viatransient structures involving the extended 5’ flanking sequence (3) 5’ UTR of Levivirus maturationprotein, where a transient structure temporarily postpones the formation of the final structure thatinhibits the translation of maturation protein (4) SAM riboswitch, where the downstream gene expressionis regulated by alternative structures upon binding of SAM.iiPrefaceAs part of my MSc research, I was involved in the entire research phases (from data collection, dataanalysing to visualization) and preparation of the following research manuscripts:Published:• J. Y. A. Zhu, A. Steif, J. R. Proctor, and I. M. Meyer. Transient RNA structure features areevolutionarily conserved and can be computationally predicted. Nucleic Acids Research, 41: 6273-6285, 2013.• J.Y. A. Zhu, I. M. Meyer. Four RNA families with functional transient structures. RNA Biology,12(1):5-20, 2015.I was also involved in the initial phase of website construction of the following project:Published:• D. Lai, J. R. Proctor, J. Y. Zhu, and I. M. Meyer. R-CHIE: a web server and R package forvisualizing RNA secondary structures. Nucleic Acids Research, 40(12):e95, 2012.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Functional Roles Of RNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 The Central Dogma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Transcriptional Pervasiveness Of Noncoding RNA . . . . . . . . . . . . . . . . . . 11.1.3 The Obsolete Central Dogma Does Not Contain The Noncoding RNA . . . . . . . 21.1.4 RNA Has Various Functional Roles In Organisms . . . . . . . . . . . . . . . . . . . 31.2 RNA Secondary Structure And Structure Formation . . . . . . . . . . . . . . . . . . . . . 51.2.1 Three Layers Of RNA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 RNA Secondary Structure Has Various Functional Roles . . . . . . . . . . . . . . . 61.2.3 Structural Building Blocks Of RNA Secondary Structure . . . . . . . . . . . . . . 81.2.4 Elements Of RNA Tertiary Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.5 RNA Structural Formaion In Vivo . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3 Transient And Alternative RNA Structure Features . . . . . . . . . . . . . . . . . . . . . . 131.4 Experimental Methods For Identifying RNA Structural Features . . . . . . . . . . . . . . 141.4.1 The Importance Of Identifying RNA Secondary Structure . . . . . . . . . . . . . . 141.4.2 Low-Throughput Method: Biochemical Probing . . . . . . . . . . . . . . . . . . . . 151.4.3 High-Throughput Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.4.4 High-Resolution But Low-Throughput Techniques . . . . . . . . . . . . . . . . . . 171.5 Computational Methods For RNA Structure Prediction . . . . . . . . . . . . . . . . . . . 211.5.1 Non-Comparative Methods: Minimum Free Energy (MFE)-Based . . . . . . . . . . 211.5.2 Non-Comparative Methods: Probability-Based Methods . . . . . . . . . . . . . . . 221.5.3 Comparative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.5.4 Kinetic Folding Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.6 Hypothesis And Goal Of This Thesis: Prediction Of Transient And Altenative RNA Struc-ture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25iv2 Analysis Pipeline For Identifying Evolutionarily Conserved And Kinetically FeasibleTransient RNA Structure Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.1.1 The Importance Of Functional Transient Structure On RNA Folding Pathway . . 272.1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Strategy Overview: Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 Strategy Overview I: Data Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.1 Procedures Of Alignment Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Background And Details Of The Alignments Assembled . . . . . . . . . . . . . . . 322.3.3 Alignment Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.4 Strategy Overview II: Selection Of Representative Sequences From Each Alignment . . . . 382.5 Strategy Overview III: Folding Pathway Simulation With Existing Kinetic Methods . . . 392.5.1 Three Kinetic Programs Employed In Our Pipeline, And Their Input Parameters . 392.5.2 Kinefold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.5.3 RNAKinetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.5.4 Kinwalker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.6 Strategy Overview IV: Post-processing Of Simulation Predictions . . . . . . . . . . . . . . 442.6.1 Kinefold Results Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.6.2 RNAKinetics Results Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . 452.6.3 Kinwalker Results Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.7 Extended Pipeline For Detecting Conserved Transient Structure . . . . . . . . . . . . . . 462.7.1 Transat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.7.2 Combine Kinetic Programs And Transat . . . . . . . . . . . . . . . . . . . . . . . 472.8 Visualization Of Pipeline Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.8.1 Visualization Of The Proposed Putative Functional Transient/Alternative Struc-tures In Relation To The Known Structures . . . . . . . . . . . . . . . . . . . . . . 482.8.2 Visualization Of The Pipeline Performance Evaluated Against Reference StructuralFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.9 Performance Statistical Measures And Results . . . . . . . . . . . . . . . . . . . . . . . . . 572.9.1 Known Transient Features Of Folding Pathways Are Evolutionarily Conserved. . . 572.9.2 Known Transient Features Can Be Predicted Computationally Using Folding Path-way Prediction Programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.9.3 Combining Prediction Programs Improves The Prediction Accuracy For KnownTransient And Final Structural Features. . . . . . . . . . . . . . . . . . . . . . . . 762.9.4 There Is Evolutionary Evidence For Additional Transient Heelices. . . . . . . . . . 802.9.5 These Additional Transient Helices Can Be Predicted Computationally. . . . . . . 813 Four RNA Families With Functional Transient RNA Structures . . . . . . . . . . . . 833.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83v3.2 Trp Operon Leader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.2.1 Transcriptional Control Of The Tryptophan Operon . . . . . . . . . . . . . . . . . 843.2.2 Terminator Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.2.3 Anti-terminator Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863.2.4 Half-life Of The Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.3 5’UTR Of Leviviridae Levivirus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883.3.1 Translational Control Of Maturation Protein In Leviviridae Levivirus . . . . . . . 883.3.2 Final Inhibitory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903.3.3 Transient Permissive Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913.3.4 Half-life Of The Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.4 HDV Ribozyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.4.1 Regulation Of The HDV Ribozyme Self-cleaving Activity. . . . . . . . . . . . . . . 933.4.2 Active Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.4.3 Structure Preventing Self-cleavage . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.4.4 Structure Permitting Self-cleavage . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.4.5 Half-life Of The Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.5 SAM-responsive Riboswitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.5.1 Regulation Of Gene Expression Via A Riboswitch. . . . . . . . . . . . . . . . . . . 973.5.2 Sam-unbound Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.5.3 Sam-bound Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.6 Method Of Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.6.1 Flowchart For Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.6.2 Primary Covariation Model: Initial Small Alignment . . . . . . . . . . . . . . . . . 993.6.3 Secondary Covariation Model: Pre-assembled Alignment + Unique Rfam SeedAlignment Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.6.4 Expanded Alignment: Select Sequences To Add Into The Original Small MSA . . 1023.6.5 Reference Sequence And Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.7 Results For New Rfam Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.7.1 Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.7.2 Mapping Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.7.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.7.4 Figures Of The Arc-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074 Conclusion, Discussion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.1.1 The Pipeline For Comparative Prediction Of Transient Structural Features . . . . 1124.1.2 The Pipeline For Annotating An Alignment With Both Final And Transient Struc-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113vi4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144.3.1 The Transient Structures Have Slightly Lower Covariation . . . . . . . . . . . . . . 1144.3.2 Predicting Tangible Amount Of Functional Transient Features . . . . . . . . . . . 1154.3.3 Annotation Of Known And Functional Transient/Alternative Features . . . . . . . 1154.3.4 Some Odd But Interesting Cases Of Conserved Transient Structures . . . . . . . . 1164.4 Outlook On Future Research Direction Of Transient Structural Features . . . . . . . . . . 1164.4.1 Expanding The Pipeline For Efficient Prediction Of Transient Structures . . . . . 1164.4.2 Utilizing The Transient Structures Predicted From Our Pipelines Part I: BetterAlignment And More Efficient Prediction Of Final Functional Structure . . . . . . 1174.4.3 Utilizing The Transient Structures Predicted From Our Pipelines Part II: DecodeThe Pivotal Point In Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.4.4 Utilizing The Transient Structures Predicted From Our Pipelines Part III: PredictTranscriptional Pausing Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.4.5 Utilizing The Transient Structures Predicted From Our Pipelines Part IV: Ri-boSNitches And Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122viiList of Tables1 Alignment information of Bacterial ribonuclease P type A ribozyme (RF00010) . . . . . . 332 Alignment information of Bacterial Signal Recognition Particle 4.5S RNA (RF00169) . . . 343 Alignment information of Tryptophan Operon Leader (RF00513) . . . . . . . . . . . . . . 354 Alignment information of Hepatitis Delta Virus Ribozyme (RF00094) . . . . . . . . . . . 365 Alignment information of Levivirus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Alignment information of Samriboswitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Alignment quality measures for the merged known structures . . . . . . . . . . . . . . . . 388 Alignment quality measures that are not related to structures . . . . . . . . . . . . . . . . 389 Alignment-specific cutoff for pairwise sequence identity . . . . . . . . . . . . . . . . . . . . 3910 Quality measures for known transient features and known final features for all alignments 7511 TPR for known final and known transient structural features as predicted by Kinwalker,Kinefold, and RNAKinetics using the respective MCC-derived cutoff . . . . . . . . . . 7612 Detailed performance measures for each alignment using kinetic folding programs at MCC-derived cutoff values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7713 TPR and PPV for all classes of structures using kinetic folding programs at MCC-derivedcutoff values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8014 Quality measures of transient structural features for all alignments . . . . . . . . . . . . . 8115 TPR for novel transient structural features for kinetic folding prediction methods usingthe MCC-derived cutoff optimized over known features . . . . . . . . . . . . . . . . . . . . 8316 The source literatures where the transient/alternative and dominant structures are derived 10517 The basic alignment statistics for our new alignments and the corresponding seed align-ments in Rfam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10618 The alignment quality measures for the transient structural features for our alignments . . 107viiiList of Figures1 The elements of RNA secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 The co-transcriptional folding of RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Analysis pipeline for the comparative post-processing of the kinetic simulation of the align-ment sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 The reference structural features of Levivirus . . . . . . . . . . . . . . . . . . . . . . . . . 505 The reference structural features of Bacterial ribonuclease P type A ribozyme . . . . . . . 516 The reference structural features of Hepatitis Delta Virus Ribozyme . . . . . . . . . . . . 527 The reference structural features of Bacterial Signal Recognition Particle 4.5S RNA . . . . 538 The reference structural features of Tryptophan Operon Leader . . . . . . . . . . . . . . . 549 The reference structural features of Samriboswitch . . . . . . . . . . . . . . . . . . . . . . 5510 Kinwalker prediction performance on Levivirus . . . . . . . . . . . . . . . . . . . . . . . 5811 Kinefold prediction performance on Levivirus . . . . . . . . . . . . . . . . . . . . . . . . 5912 RNAKinetics prediction performance on Levivirus . . . . . . . . . . . . . . . . . . . . . 6013 Kinwalker prediction performance on RF00010, Bacterial ribonuclease P type A ribozyme 6114 Kinefold prediciton performance on RF00010, Bacterial ribonuclease P type A ribozyme. 6215 Kinwalker prediction performance on RF00094, HDV ribozyme . . . . . . . . . . . . . . 6316 Kinefold prediction performance on RF00094, HDV ribozyme . . . . . . . . . . . . . . . 6417 RNAKinetics prediction performance on RF00094, HDV ribozyme . . . . . . . . . . . . 6518 Kinwalker prediction performance on RF00169, Bacterial Signal Recognition Particle4.5S RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6619 Kinefold prediction performance on RF00169, Bacterial Signal Recognition Particle 4.5SRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6720 RNAKinetics prediction performance on RF00169, Bacterial Signal Recognition Particle4.5S RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6821 Kinwalker prediction performance on RF00513, Tryptophan Operon Leader . . . . . . . 6922 Kinefold prediction performance on RF00513, Tryptophan Operon Leader . . . . . . . . 7023 RNAKinetics prediction performance on RF00513, Tryptophan Operon Leader . . . . . 7124 Kinwalker prediction performance on SamRiboswitch . . . . . . . . . . . . . . . . . . . 7225 Kinefold prediction performance on SamRiboswitch . . . . . . . . . . . . . . . . . . . . 7326 RNAKinetics prediction performance on SamRiboswitch . . . . . . . . . . . . . . . . . . 7427 MCC plotted over a series of cutoff values for known transient and final structure . . . . . 7828 ROC curves indicating performance for known transient and final features using kineticfolding programs at a broad range of cutoff values . . . . . . . . . . . . . . . . . . . . . . 7929 ROC curve illustrating predictive performance for novel transient features using kineticfolding methods at a broad range of cutoff values . . . . . . . . . . . . . . . . . . . . . . . 82ix30 Schematic drawing for Tryptophan operon leader . . . . . . . . . . . . . . . . . . . . . . . 8831 Schematic drawing for Levivirus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9232 Schematic drawing for HDV ribozyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9633 Schematic drawing for SAM riboswitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9834 Flowchart part I for alignment compilation using Infernal . . . . . . . . . . . . . . . . . 10035 Flowchart part II for alignment compilation using Infernal . . . . . . . . . . . . . . . . 10136 Arc-plot of Tryptophan operon leader made using the visualisation program R-chie . . . 10837 Arc-plot for the Levivirus alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10938 Arc-plot for the HDV ribozyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11039 Arc-plot of SAM riboswitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11140 Odd cases of transient/alternative structures . . . . . . . . . . . . . . . . . . . . . . . . . 117xAcknowledgementsI would like to thank my supervisor, Dr. Irmtraud Meyer, for giving me the precious opportunity toenter the research field back in the undergraduate years, taking me as a graduate student, and providingme with all the support and guidance. She has always been a great mentor to me, guiding me withpatience and encouragement. She invested valuable time in training me and generously funded me toconferences to get inspired. No matter how busy her personal schedule is, she constantly pays closeattention to my project and is always there to discuss whenever I have questions. She also kindly set ahigh bar for me, clearly specifying what I need to achieve in each stage, so I can gain adequate academiccredentials to move forward on my research path. Moreover, she encourages me whenever I need to buildup my confidence, and happily accommodates my request when I wish to try various opportunities. Forall the past years, I’m very grateful for all the help from Dr.Meyer!I would also like to thank my committee members, Dr. Joerg Gsponer and Dr. Nobuhiko Tokuriki,for their valuable suggestions and kind support. They are very approachable and gave me valuable adviceon both research and career plan. I would like to thank all the members of the Meyer group as well fortheir advice and contribution. I would also like to thank Adi Steif and Jeff Proctor particularly for theircontribution to the project detailed in Chapter 2. In addition, I’m grateful for the continuous supportfrom my parents all the time.xi1 Introduction1.1 The Functional Roles Of RNA Sequences1.1.1 The Central DogmaIn the central dogma, the inherited information in the genome is amplified in a cascade whereby DNAis transcribed onto RNA transcripts and RNA transcripts are subsequently translated into proteins.That is, RNA is merely considered as a mediator conveying the inherited information from DNA toprotein. This dogma is obsolete because it now transpires that RNA assumes many regulatory rolesconcerning gene expression in the physiological process [1]. For instance, some RNA transcripts evenpossess enzymatic properties or participate in the post-transcriptional processing, such as splicing andlocalization [1]. Such expansion in the functional roles are usually achieved by the ability of RNA to foldinto a defined secondary and tertiary structure [1].1.1.2 Transcriptional Pervasiveness Of Noncoding RNAPervasive transcription In mammals, the protein-coding genes only comprise approximately 1.2%of the genome [2]. However, 5%-20% of segments from the human genome is estimated to be underevolutionary constraint by comparative analysis against the mouse genome [3–5]. These evidences suggestthat there are other types of evolutionary selection imposed on the genome. Despite not coding codons,are these conserved regions functional? If so, how do they deliver their function while their roles are notmanifested in protein products?Moreover, transcripts pertaining to these untranslated regions have been captured – “pervasive tran-scription” is empirically observed on most of the genome (63% in mouse genome via cDNA sequencing,or 74.7%-93% in human genome via RNA-seq and whole-chromosome microarrray) [2,6–8]. More specif-ically, these studies discover more than 30K noncoding transcripts in mouse genome [2, 6, 7], and morethan 40K novel transcriptional start sites in human genome [3, 9]. In comparison, even a conservativeestimation utilizing the processed EST data – mostly derived from polyadenylated RNA, albeit that non-polyadenylated RNA occupies more than 40% of the ribosomal-RNA-free pool – yields a 57.09% coverageof the human genome [2]. The current known genome annotation is therefore only the tip of the iceberg.Noncoding RNA is important The “pervasive transcription” and the ensuing studies reveal theimportance of such noncoding RNA transcripts. Firstly, it indicates the complexity of an organism: theproportional ratio of noncoding RNA related to protein-coding RNA reflects the biological complexity oforganisms [10,11]. For example, intron occupies more than one third of the human genome whereas it isgreatly reduced in Takifugu [2]. Moreover, a large composition of non-coding RNA can serve as a versatilerepertoire for future evolutionary opportunities, with sufficient plasticity to evolve into functional featuresfor the cellular regulatory or immune system [10].11.1.3 The Obsolete Central Dogma Does Not Contain The Noncoding RNAMany regulatory RNA molecules, modules and networks need to be inserted into this central dogmato expand it both vertically and horizontally.Long noncoding RNA (lncRNA) and other elements that are not included in the extantcentral dogma The central dogma misses many noncoding RNA elements that are essential to theregulation of gene expression. Moreover, such elements have versatile forms, such as the emergent dis-coveries of RNA functioning as self-contained ribozyme. For another instance, lncRNA also has diverseregulatory roles – e.g. promoting the nucleation of nucleus structure, sensor, guiding, scaffold, which aremediated by structural motif – spanning throughout each stage of the cascade model for gene expression,i.e. epigenetic, transcription, and translation [12]. Mounting experimental evidence shows that lncRNA issynthesized in the transcription stage, along with the protein-coding mRNA [12,13]. Resembling mRNA,lncRNA is also capped at the 5’ end and mostly polyadenylated at the 3’ end, which undergoes splicingas well [12]; in addition to that conventional type, RNA polymerase II/III, or the biogenesis machineryfor small nucleolar RNA (snoRNA) can generate lncRNA without a poly(A) tail [14, 15] or bound bysnoRNA [12,16].Other than the aforementioned caveats, the border between mRNA and these elements could be am-biguous. As an example, lncRNA is usually distinguished from mRNA by its absence of an ORF [12,17];in this case, lncRNA is not subject to the evolutionary constraint that imposes different substitution rateon the codon bases in ORF [12]. In fact, some RNA is equipped with dual identities [12, 17] wherebylncRNA and mRNA dynamically interchange their function along the road of evolution; e.g. Xist geneis derived from a gene that used to encode protein [12]. Moreover, transcriptional complexity arises dueto alternative splicing, resulting in the lncRNA and mRNA regions being juxtaposed together [12]. Ad-ditionally, these RNA classes can be interconverted. For instance, lncRNA can be post-transcriptionallyprocessed into small RNAs [8,12]. This further complicates the extant central dogma.Interactions among noncoding RNA are not depicted in the extant central dogma Theobsolete central dogma does not capture the interaction between these different classes (or within thesame class) of noncoding RNA. For instance, lncRNA can function as sensor to interact with otherlncRNA, microRNA (or miRNA, about 21-24 nucleotides long), or mRNA to gauge their expressionlevel [12]. Such important role played by RNA-RNA interaction triggers mass interest in identifyingfunctional ncRNA, such as miRNAs and their targets in the past two decades [10].The function of RNA interaction also permeates all three stages of the central dogma. RNA canrepress the gene expression by directly operating on the transcript level via RNA-mediated gene silencing:lncRNA or miRNA can base pair with their target mRNA, which is subsequently recognized by proteinfor further modification or degradation [10, 12, 18]. Alternatively, RNA can repress the gene expressionby operating on the epigenetic level by affecting DNA methylation: lncRNA can recruit proteins, bycoupling with other cellular machinery, to modify the epigenetic layer and alter local chromatin structure,2modulating the gene expression of adjacent regions, either in cis (e.g. Airn) or in trans (e.g. HOTAIR) [10,12,18,19].Certain dimensions for the RNA interaction are not included in the central dogma, such as the statusof the complementary strand. Noncoding RNA genes are prevalently observed to be transcribed on thecomplementary strand of a protein-coding gene in a sense-antisense manner, which happens for more than50% in mammalian protein-coding genome [6], e.g. more than 66% in mouse genome [2, 7]. Moreover,even within one protein-coding gene, noncoding RNA gene could be embedded in the intron or in thereverse direction [12,20]. Such overlapping transcription – on both or the same strand(s) – is previouslyoverlooked in terms of magnitude and prevalence [2, 7]. Hence, the central dogma is further entangledafter introducing all the interaction and the new dimensions above.Regulatory networks of RNAs As discussed above, the elements and interactions that the currentcentral dogma misses are mostly associated with RNA. Indeed, it is envisioned that RNA can construct aself-contained regulatory network and also serve as a signal carrier traversing different regulatory modulesin this network [8, 12]. Consistent with this vision, lncRNA can harbour various domains that eachspecifically senses a distinct signal, which can subsequently be centralized and processed in a mechanismsimilar to an logical gate employing Boolean-function [12,21].Given the observation that lncRNA distribution is highly restricted to specific tissue with accompaniedhigh evolution rate, a “scale-free topology” is proposed wherein such diverse lncRNAs and their residingregulatory networks are connected via highly conserved proteins that are capable of binding multipletargets [12, 22]. Such design delivers stability to the global network even when certain low-connectivitynodes – e.g. the less conserved and highly tissue-specific lncRNA – are altered, as evidenced in lncRNAknockout experiments [12, 22]. Thus, we can image that once the central dogma is updated, differentstages and elements will be closely connected through many entangled interactions.1.1.4 RNA Has Various Functional Roles In OrganismsLow expression of noncoding RNA As introduced in Section 1.1.3, the so-called “junk region” (i.e.intronic, intergenic region) can still code for functional noncoding RNAs. For instance, DNA sequenc-ing and genome-wide microarray reveal that lncRNA is abundant (reaching 4,662–15,512 transcripts percatalog, compared to 25,000 protein-coding genes in human), yet diverse with little inter-catalog redun-dancy [12, 23]. Such vast amount of noncoding RNA transcripts can engage in the regulation of thehuman genome, and some of them function via interacting with other regulators [3]. Such function ismainly mediated by structural feature, which condones variation in the primary sequence as long as thecanonical base-pairs are maintained via compensatory mutation [3].Despite their importance and abundance, the expression level of noncoding RNA is elusive especiallycontrasted with mRNA, contributing to their difficulty for detection [12,23]. For instance, given averagelyonly 0.3 transcript/cell, HOTTIP RNA delivers critical regulatory function on the epigenetic level togovern the activation of some HOXA genes [2, 24]. Mercer & Mattick (2013) suggest that such low3expression of these noncoding RNA is perhaps compensated by their functional redundancy [25](e.g. 1miRNA can target many mRNA agonist due to imperfect complementarity), or low requirement on thetarget, or incomplete repression preferred by the regulatory system [12]. Furthermore, the RNA-inducingsilencing complex (RISC) can be recycled to process mRNA for many times; the short RNAs can also beamplified by RNA-dependent RNA polymerase (RdRP) to target more mRNAs [26].The remaining of this section covers some typical but not exhaustive examples wherein organisms useRNA as regulatory means:Functional roles of RNA sequences in viruses RNA virus does not have the privilege of a tran-scription layer to regulate gene expression. Nevertheless, it is compensated by the ability of RNA to carryout multiple functions, such as initiating the -1 ribosomal frameshifting in HIV via structural mRNAfeatures (pseudoknot) [27], or functioning as enzyme – with a defined secondary structure – as exhib-ited in HDV ribozyme [28]. Even beyond the ribozyme regions, some viral RNAs are highly enrichedwith structural features [1, 29]. For instance, HIV-1 RNA genome has its 5’ and 3’ UTR enriched withmore structural features than the coding region; moreover, a periodic switch between structural andnonstructural region is observed in the coding region wherein the structures reside in the inter-proteinlinker zone [1,29]. This observation is suggestive of a regulatory role for the structured RNA in elicitingribosome pausing to facilitate correct folding of protein domains [1, 29].Functional roles of RNA sequences in prokaryotes In addition to maintain plasmid replication,RNA also plays important roles in the gene expression of prokaryotes, such as in regulating transcription(e.g. 6S RNA is used by E. coli to cope with stress in stationary phase) and translation (e.g. riboswitchand operon attenuation) [1,26]. One well-known example for regulating translation initiation is bacteriathermometer, which is essentially a structural feature embedded in the 5’ UTR [1, 30]. Normally, thethermometer is in the folded conformation, gate-keeping the ribosomal binding site (RBS); it respondsto destructive higher temperature by unfolding to uncover the RBS, so the downstream genes can beexpressed to help the cell cope with high temperature [1, 30].Moreover, bacterial small RNA (sRNA) – a RNA sequence of 80-110 nucleotides that are directlytranscribed from sRNA genes, and could encode small peptides – can degrade or inactivate target mRNAvia imperfect complementary base-pairing (aided by Hfq protein); this is achieved by binding the SDsequence or the start codon of the mRNA with a conserved domain on the sRNA [25, 26]. Some sRNAcan even activate the gene translation by sequestering a region in mRNA that blocks RBS [26]. Suchregulatory RNAs are diverse and specific to the niche of the organism [25].Functional roles of RNA sequences in eukaryotes In eukaryotes, noncoding RNAs have lengthvarying from 20 to 100,000 nucleotides, and have versatile forms, such as intron (which can organize nu-cleosome and chromatin), ribozyme, miRNA, small interfering RNAs (siRNAs), PIWI-interacting RNAs(piRNAs) and lncRNA [31–33]. Their diverse functions include but are not limited to:(i) epigenetics and transcriptional regulation:4They can serve as scaffold to recruit protein complex to promoter to suppress local gene expression,e.g. intergenic noncoding RNA [32]. This is also used in dosage compensation, such as Xist which isparadoxically transcribed on the X-chromosome to be inactivated and then recruits histone-modifyingcomplex [26]. Similarly in imprinting, ncRNA is expressed around the clustered imprinted genes to silencethem, such as Igf2/H19 (competing for enhancer), Kcnq1ot1 or Air (modifying local chromatin) [32].Moreover, antisense transcripts can negatively regulate the transcription of the overlapped genes, or formsense-antisense pairs to alter its local chromatin status via RNA interference (RNAi) [32]. That is, RNAi– comprised of short RNAs, i.e. miRNA, siRNA, and piRNAs, which couple with the Argonaute proteinsto form RISC, and generate Guide RNA for target specificity – can recruit chromatin-modifying proteinto modify the histones around the promoter so as to modulate its activity [26]. Unlike their prokaryoticcounterparts sRNA, siRNA/miRNA only has 21-23 nucleotides, and piRNA has 26-33 nucleotides [26].(ii) post-transcriptional processing:They are involved in alternative splicing, RNA editing, modification (e.g. snoRNAs facilitates the chemicalmodification in rRNA), localization and degradation [32]. Antisense noncoding RNA can also occlude thefunctional domain of the sense transcript via sense-antisense pairing to block its access to other moleculesthat are required for post-transcriptional processing/transportation, degrading, or translation [32].(iii) translational regulationFor instance of in cis regulation, thermometer-like structures are also found in the 5’ UTR of S. cerevisiae,enclosing the RBS [1]. For in trans regulation, RISC (with miRNA, siRNA) can either degrade the mRNAor block the translation machinery to access the target mRNA; whether to degrade or block is determinedby the degree of sequence complementarity between the short RNAs and the mRNAs, and the specificArgonaute protein incorporated in RISC [26]. They are important for regulating organism development,providing protection against virus/transposons/repetitive sequence to ensure genome integrity [26, 32].For instance, mRNA is related to many human disease and cancer [10,26].1.2 RNA Secondary Structure And Structure FormationAs shown above, the majority of those versatile RNA functions are mediated by the virtue of RNAto assume structures [3]. Unlike protein, the function of a RNA molecule is primarily determined by itssecondary structure [3, 34].1.2.1 Three Layers Of RNA StructureThe structural hierarchy of RNA has the following 3 levels: primary sequence, secondary structure,and tertiary structure.Primary Sequence The primary sequence of RNA is made of four standard bases – A (adenine), C(cytosine), G (guanine), U (uracil). In organisms with DNA genome, the primary sequence is the RNAtranscript synthesized from the complementary DNA template. In viruses with RNA genome, the genomeitself can serve as RNA transcript for translation and replication. However, the in vivo sequence may5be different from the one directly transcribed from DNA due to post-transcriptional modification, e.g.methylation, splicing [34]. For instance, in vivo chemical modification can yield greater than 100 differentnucleotides, which has potential ability to influence the sequence function [12, 35]. Given the prevalenceof chemical modification as reported by high-throughput sequencing [12, 36], combinatorial approachesinvolving sequencing and mass spectrometry are required to identify the true RNA primary sequence [34].Secondary Structure The RNA secondary structure is essentially a collection of canonical base-pairs:either Watson-Crick, or Wobble base-pairs, with comparable thermodynamic stability [34]. It is formedinside a RNA molecule, which illustrates the 2D planar layout of this RNA polymer chain in terms ofwhich base pairs with which base. The secondary structure could be local or global.The molecular function of RNA is mainly manifested via the secondary structure. Hence, if its molec-ular function is vital for the organisms’ survival, the secondary structure tends to be selected amongrelated species in a purifying process, namely evolutionary conservation [3]. Such conservation is notrestricted to the entire primary sequence; instead, it’s the base-pairing regions being required to be main-tained [3]. Given the faster base substitution rate of RNA compared to DNA or protein, compensatorymutations accumulated in the base-paring regions – not disrupting the base-pairing potential – are calledcovariation [3].In this thesis, unless exceptionally explained, structure refers to secondary structure.Tertiary Structure Built upon the previous 2 structural layers, tertiary structure entails interactionformed among various elements of secondary structure, and a schematic representation of the secondarystructure can be modified by including extra connection among nucleotides far apart to depict the tertiaryinteraction [34]. The tertiary interaction is the primary driving force for a RNA molecule to assume thefunctional global fold [34]. Due to such critical importance, nucleotides involved in a tertiary interactionare highly conserved, e.g. tRNAphe [34].1.2.2 RNA Secondary Structure Has Various Functional RolesMany functions of the noncoding RNA introduced in Section 1.1.4 are mediated by the structuredregion of RNA. For instance, such structured regions can constitute more than 13.6% of the humangenome as predicted by Smith et. al (2013) even under stringent cutoff [3]. It is the secondary structureof RNA, rather than its primary sequence, that determines the function of a RNA molecule [12]. Thus, theRNA structures also serve as evolutionary markers – a more sensitive marker than the primary sequencesimilarity – for reconstructing the phylogenetic linkage among homologous sequences according to theirstructural homology [10].RNA secondary structure associated with gene expression [1](i) Efficiency of transcription or translation initiation:6Structure in transcripts is prevalently leveraged by organisms to control the initiation – a rate-limitingstep for gene expression – or the termination of transcription or translation, such as by riboswitch andattenuation [1]. In general, there is a negative correlation between the amount of structures embedded inthe initial segment of a transcript and the corresponding protein yield, which is observed in both naturallyoccurring and synthetic transcripts [1,37]. For instance, the start codon region (or the entire 5’/3’ UTR inSaccharomyces cerevisiae and Arabidopsis thaliana) of most eukaryotic transcripts is less involved in RNAstructure, presumably to facilitate translation initiation [1]. Especially in certain prokaryotic mRNAswithout SD sequence, the start codon and its vicinity are found to be unstructured to accentuate itselfagainst the structured internal start codon, in order to compete for ribosome binding [1,38]. Such negativecorrelation is also consistent with the selection of codon nucleotides observed in the initial segment ofbacterial transcripts: the first several codons favour AU-rich codon which forms less stable base-pairsthan GC, thus avoiding hindering the translation initiation [1, 39]. Similar pattern is found in humanhaplotype study of COMT gene: single-nucleotide polymorphism stabilizing mRNA structure results inless protein yield [40].(ii)Post-transcriptional processing: splicing, transportationRNA secondary structure is involved in the alternative splicing of eukaryotes [41]. For instance, stablestructural features are detected at both 5’ and 3’ splicing sites of A. thaliana, whereas opposite trendis found in human [1]. Moreover, it is demonstrated that, in the mammalian alternative splicing, the 5’splicing site is more enriched in local structure, and the weak 3’ splicing site is more related to long-rangeRNA interaction [41]. RNA structure is suggested to affect alternative splicing either (1) directly, wherethe structure purposely exposes or blocks a functional domain (e.g. splicing site, branch point) on thepre-mRNA, by placing the domain in a single-stranded region, or hairpin, respectively; (2) indirectly,where the structural change can loop out the intron in between or bring together two distant splicingsites [41].After transcription, the transcripts are transported to various cellular compartments for translation orfurther modification. The information for cellular localization is encoded in either a “zipcode” sequenceor a characteristic structural feature, which is presented to the recognition protein [1].(iii)Ribosomal PausingEmergent evidences suggest an association between ribosomal pausing and secondary structure inthe transcript: e.g. a 3-nucleotide periodicity – represent the tendency of each nucleotide being part ofa secondary structure – is observed in the coding region across 3,000 transcripts of S. cerevisiae, withamplitude in accordance with the density of ribosome occupancy [1]. Moreover, structural features inthe coding region result in higher protein production [1], which leads to hypothesis that these structuralfeatures modulate the translation speed, with interleaved pausing, to assist protein co-translational foldingso as to produce correctly folded protein in the functional conformation [1].RNA secondary structure associated with regulatory RNAs On one hand, the functional fea-tures of regulatory lncRNA mainly depend on its structure (e.g. tRNA, ribozyme), and it is necessary to7know its structure in order to characterize its function [12]. On the other hand, for the short RNAs inRNAi, RNA secondary structure plays essential roles for the biogenesis of miRNA and the mechanismby which they interact with their target mRNA. Firstly, miRNA is generated from a double-strandedmiRNA precursor hairpin that undergoes two cleaving process: Drosha cleaves pri-miRNA to producepre-miRNA which is subsequently cleaved by Dicer to produce mature miRNA [1,26]. Regardless of thesequence, both processes require the substrate RNA to be presented in a specific secondary structure, inorder to be recognized by the enzymes [26]. Secondly, miRNA represses the gene expression of its targetmRNA by imperfect pairing, forming a paired region, in the middle of which the RISC exosome cleavesthe target mRNA [26]. However, in both prokaryote and eukaryotes, the targeted mRNAs can evade thedigestion by RISC via folding their targeted 3’ UTR into secondary structure [1].RNA secondary structure associated with RNA-binding protein (RBP) or DNA bindingRNA secondary structure is important for the interaction between RNA and RBP. For instance, ADARprotein need to recognize the double stranded region of the transcript in order to initiate the editing [32]This brings insight into decoding the regulatory networks of gene expression since a large number ofproteins are found to bind RNA, the function of which spans from degradation, localization to trans-lation [12]. X-ray crystallography and NMR spectroscopy reveal that the interface of such RBP-RNAbinding requires highly structural RNA region; thus, the protruding structures (i.e. protein structureor single-stranded RNA nucleotides) and the concave surface (i.e. binding pocket in protein beta-sheet,groove in RNA double-stranded helix) can fittingly lock with each other [12]. Such structural fittingstrategy allows RBP to come up with sufficient amount of specificity to address the overwhelming di-versity of RNA sequence and structures. To achieve so, RBP subtly combines distinct modular domains– each specifically targets a characteristic RNA motif – and therefore collectively assembles the desiredspecificity for the whole RNA molecule [12]. On the other hand, from the perspective of RNA, for in-stance, lncRNA can thus serve as the scaffold to bring multiple proteins together via distinct structuralfeatures. In addition, RNA also binds DNA by folding its domain into a structurally complementarypocket to encompass the DNA [12].Half life of RNA transcript associated with secondary structure The GC content of a transcriptis positively correlated with the half-life of the RNA transcript and it is perhaps due to the propensity ofGC-rich sequence to render itself in secondary structure; such stability is desired for functional role [3].However, the folding energy of mRNA is not necessarily correlated with its half-life [1].1.2.3 Structural Building Blocks Of RNA Secondary StructureCanonical Base Pairs RNA (e.g. lncRNA) prefers to render itself in a secondary structure via hy-drogen bonding between faces – Watson-Crick, Hoogsteen, or ribose edge – on a pair of the constituentnucleotides, in order to achieve thermodynamic stabilization [42]. More specifically, a hydrogen bond isnot a covalent bond, and it is formed between a purine (A, G) and a pyrimidine(C, U), giving rise to a8base-pair. In RNA, the canonical (or consensus) base-pairs refer to AU, CG, GU. Among them, AU andCG are the Watson-Crick base-pairs, and GU is the Wobble base-pair. Both AU and GU hold 2 hydro-gen bonds, and CG holds 3 hydrogen bonds. Such Watson-Crick pairing geometry confers isostericityon all canonical Watson-Crick base-pairs (AU, UA, CG, GC), which can substitute for each other whilekeeping the 3D helix structure intact; this accounts for covariation and results in the A-form RNA doublehelix [42].In addition to hydrogen bonds, base stacking (see Section 1.2.4) further stabilizes the secondarystructure, contributing to the inter-nucleotide interaction as well [12, 42]. Together these two types ofinteraction shape the folding process of RNA – hydrogen bonding guides where the folding happens,whereas base stacking is the main contributor for promoting the RNA folding [42]Non-Watson-Crick base-pairs In addition to Watson-Crick base-pairs, there are 11 alternative in-teractions regarding which two faces are involved since both purine and pyrimidine offer 3 edges – i.e.Watson-Crick, Hoogsteen, and Sugar, additionally there are cis or trans options for the pairing bases –to form Hydrogen-bond [42]. This diversity of interaction, primarily the non-Watson-Crick ones, shapesthe tertiary structure (see Section 1.2.4) [42].Units of Secondary Structure As shown in Figure 1, the basic elements constructing complicatedsecondary structure are: single-stranded region, double-stranded hairpin stem (attributable to the signa-ture double helix), hairpin loop, bulge, internal loop, multi-branched loop, pseudoknots (strictly speaking,pseudoknots belong to tertiary interaction, see Section 1.2.4) [43]. The helices in such structural units ofsecondary structure are often interconnected by tertiary interaction, resulting in being coaxially stackedin parallel/orthogonal orientation [12,43].1.2.4 Elements Of RNA Tertiary InteractionFunctions of tertiary interaction The tertiary interaction – mostly formed by non-Watson-Crickinteraction – is built upon the preformed domains of secondary-structure, resulting in many structuralmotifs that are fairly independent of the surrounding sequences [12]. Moreover, the tertiary interactioncan deliver the equal effect, regardless of whether the interacting partner is provided in trans or in cis– experiments demonstrate that complementing a mutant ribozyme (an interacting domain is knockedout) with a domain in trans can restore the ribozyme activity [34, 45]. In addition to coordinatingthe inter-domain interaction within the same RNA molecule, tertiary interaction also enables the targetrecognition of RNA molecules, such as tRNA precursor by RNase P [34]. The tertiary interaction hasimportant functions especially when RNA takes on enzymatic role or poises as a scaffold for variousproteins: the complicated 3D structure of RNA, shaped by both secondary and tertiary interactions,allows its constituent molecules to be correctly positioned for the interacting metabolite/protein [34].9HairpinLoop (single stranded)Stem (double stranded)BulgeInternal loopMulti-branched loop PseudoknotFigure 1. The elements of RNA secondary structure. Hairpin, bulge, internal loop, multi-branchedloop and pseudoknot are presented here (drawn using VARNA [44]), with their bases numbered startingfrom the 5’ end. The hairpin is composed of a double-stranded stem region and a single-stranded loopregion. The bulge is similar to the hairpin, except that it has >1 unpaired base(s) in only one strand ofthe stem region. The internal loop has >1 unpaired and mismatched base(s) on both strands of thestem region. The multi-branched loop has >2 stems connected by a conjunction where coaxial stackingcan occur. In the pseudoknot, a tertiary interaction is formed between the loop region of the hairpinand the unpaired region on the 3’ end. The six bases involved in this interaction are shaded in grey.10Cations facilitate tertiary structure formation The structural folding of RNA is hierarchical –the secondary structure folds autonomously in prior to the formation of the tertiary interaction [12].The ensuing transition from purely secondary to tertiary structure requires metal cation binding forstabilization, e.g. the folding of Th-intron is facilitated by Mg2+ [34]. The positive charge of metal cationneutralizes the negative charge accumulated from the RNA backbones that are closely packed by tertiaryinteraction, thus stabilizing the tertiary structure [34].Examples of tertiary interaction Apart from stacking and pseudoknot explained below, other exam-ples include adenosine platform, 2’-hydroxy-mediated helical interactions, base triples, tetraloop, ribosezipper, and etc. [34]. Another typical example is loop-loop interaction – a helix produced by the basepairing between the complementary nucleotides in 2 hairpin loops [34].Base Stacking and Coaxial Stacking Base stacking occurs when the adjacent nucleotides havetheir pi-electron system of the aromatic rings overlaid with each other [34]. In addition to hydrogenbonds between base-pairs, base stacking is the main contributor to the thermodynamic stability of RNAstructure; indeed, stacking interaction is prevalently observed in the bases of many large RNAs, e.g. >90% of the bases in tRNA (compared to canonical base-pairs that account for 60%-70% of bases in RNAson average) [34, 42]. As a result of base stacking, coaxial stacking can be formed between helices at thenick: the neighbouring duplexes stack upon each other to generate a contiguous and coaxial double helixwherein the junction is stabilized by tertiary interaction and metal cations [34].Because of the dependency of base stacking strength on dinucleotide, the dinucleotide compositionneeds to be maintained on the same level when the bases of sequence (or columns of alignment) arerandomly reshuffled in RNA structure prediction programs to serve as negative control [10]. Otherwise,the prediction specificity tends to be wrongly raised [3].Pseudoknotted RNA structures For the purpose of forming and stabilizing global fold, RNAmolecule also employs pseudoknot for the backbone topology strategy [34]. As shown in Figure 1, apseudoknot arises when the loop region of a hairpin forms consecutive canonical base-pairs with a com-plementary sequence located beyond the stem of this hairpin on either end of the same strand. Suchknotted base-pairing, between the loop of the 1st hairpin and an adjacent unpaired sequence, creates a2nd hairpin; the stems of both hairpins are stabilized by coaxial stacking [34,46]. Moreover, the loops ofboth hairpins are embedded in the major/minor grooves of the coaxially stacked helices, where extensiveinteraction between the loop nucleotides and the bases of hairpin stems further stabilizes the pseudoknot;this entangled topology delivers such extensive stabilization that HDV ribozyme even does not needcations to stabilize its global fold [34,47].111.2.5 RNA Structural Formaion In VivoThe hierarchical folding There has been a debate regarding whether the hierarchical folding of RNAinto higher-order structure is a sequential or landscape process [34, 48, 49]. During hierarchical folding,RNA will have most of its native secondary structures autonomously formed as a prerequisite – unit byunit, with each modular unit of the secondary structure corresponding to a functional domain – beforethe assembly of tertiary interactions that drives the global fold [34]. Such hierarchical structural foldingis supported by the order of structural melting observed in thermal denaturation [34, 50]. Such foldinghappens on a time-scale of milliseconds, e.g. tRNA is capable of assembling its structure in around 100ms [34].On the other hand, the folding process is depicted as a landscape of RNA molecules folding alongparallel pathways, and the aforementioned sequential hierarchical folding only describes the most occupiedlineage inside this landscape (not necessarily leading to the native final conformation) [34,48,49]. Apartfrom this lineage, a lot other concomitant folding pathways are stuck in kinetic trap formed by certainnonnative base-pairing [34,51]. The nonnative base-pairing is also important for maintaining the foldinglandscape, and could become part of the new popular lineage upon change in experimental condition [34,51,52].Co-transcriptional folding pathway of a RNA transcript Similar to the “Levinthal paradox” forprotein folding, how does a RNA molecule distinguish the correct/optimal folding pathway from count-less possible structural conformations and folding trajectories, in order to fold into its native structurecorrectly and swiftly on time-scale of milliseconds [34]? The co-transcriptional folding observed bothexperimentally and statistically may enable such accurate and fast folding among all the alternativeroutes.We know by now that RNA structural formation starts immediately as the transcript is synthesizedfrom its DNA template by the RNA polymerase [53–55], as shown in Figure 2. This is in contrary to theassumption held by thermal denaturation – the structural formation starts after the transcript is fullytranscribed. Nevertheless, it is feasible since structural formation (on time-scale of milliseconds) occursmore quickly than transcription (20-200 nt/s depending on organism) [54, 56]. Hence, the formation ofthe RNA structural features proceeds from 5’ towards 3’ end along the elongating transcript, which incis imposes constraints on the availability of the corresponding sequence segments in forming a putativestructural feature. Indeed, structural reorganization has been experimentally supported – many mutually-exclusive structures occur during the elongation process of transcript [54]. The less stable structure formedearlier breaks down to make room for the ensuing more stable structure – overlapped with the earlier one– as transcript is elongated [54]. This implies a regulatory role for this dynamic nature of interchangingstructures during transcription. For instance, a 5’ hairpin formed preceding a long range interaction(LRI)can locally sequester the 5’ portion of the LRI and render other nonnative structures infeasible beforethe correct 3’ pairing portion is formed later on [57,58].In addition to the transcriptional directionality, many factors further complicate this co-transcriptional12Figure 2. The co-transcriptional folding of RNA. RNA Structure formation starts immediately in vivoas the RNA transcript emerges during transcription. Some structures only exist transiently (redbrackets), which will be replaced by the mutually exclusive final structure (blue brackets). Each pair ofbrackets represents a base-pair, and the dots represent the unpaired nucleotides in this sequence. Thesequence and dot-bracket structures are originally from a figure published by Geis et al. (2008) [70].folding process, such as transcriptional speed, site-specific pausing site, and trans-interaction with molecules(e.g.metabolite [59–61], protein [62], miRNA [63] and etc.). There have been ample experimental evidencedemonstrating that when such factors are altered the corresponding final structure changes, resultingin an inactive transcript [64–68]. Moreover, statistical evidence indicates that transient structural fea-tures deriving from co-transcriptional folding are encoded in the sequence to facilitate the proper foldingtrajectory leading to the functional structure whereas suppressing the competitive ones [69].1.3 Transient And Alternative RNA Structure FeaturesCo-transcriptional folding produces functional transient structures As described in Sec-tion 1.2.5, RNA secondary structure starts forming immediately as the RNA is transcribed. The structuralfeatures formed temporarily along the folding pathway could be either the components of the native finalfunctional structure or some transient structures that are mutually exclusive with the final structure;whether these structures will persist is subject to their half-life, stability, and competition with alter-native structures [54, 71]. In fact, some intermediate structures are essential in most RNA molecules,especially large RNAs, which must be incorporated into the simulation of folding pathway in order topredict the native final structure that is phylogenetically conserved [72–74]. Such simulation via geneticalgorithm [72,74] permits the co-transcriptional structural reorganization as observed earlier by Kramer& Mills (1981) [54], and not all the structural features in the final structure are part of the global Min-imum Free Energy (MFE) conformation [72]. The simulation well predicts the native final structure inspite of not considering the in trans binding of molecules [72], which demonstrates the guiding role offunctional transient structures in shaping the final structural product.Despite being short-lived, these transient structures could be functional. They can aid the correctformation of LRI, as seen in Bacterial RNase P Type A RNA and Bacterial SRP 4.5S RNA where13the transient structures are formed preceding a transcriptional pausing site to sequester the 5’ portionof a LRI until the cognate 3’ portion is synthesized [75]. Moreover, such transient structures can beemployed to regulate gene expression via either translational control as exemplified in Levivirus [76, 77],or transcriptional mechanism as exhibited by Tryptophan operon leader [78,79] and SAM riboswitch [80].Last but not the least, transient structures incorporating the 5’ flanking sequence are involved in adjustingthe self-cleavage activity of HDV ribozyme, CPEB3 ribozyme and group I intron [28,33].Taken together all these evidences, the transient structures could be crucial for directing the foldingpathway of the RNA molecule to the desired final conformation, which is required for its underlying func-tional role. The recent statistical evidence suggests that transient structures are evolutionarily conservedacross homologous sequences, which further implies their functional importance [81]. A thorough under-standing about how structures (i.e. both transient and final ones) emerge during transcription and therelative timing of their appearance will bring insight into decoding the repercussion brought by geneticvariation that causes structural alternation. Indeed, Mortimer et al. (2014) also calls for future researchfocusing on RNA structures in co-transcriptional folding process [1].Structural rearrangement produces alternative structures in response to ligand/metabolitebinding Riboswitch is a quintessential example showing how the structure of a RNA molecule can beinterconverted into another conformation – mutually exclusive with the former structure with competingstability – upon the binding of metabolites [82–84]. Another example is Tryptophan operon leader,where the binding of a stalled ribosome results in the formation of alternative structure in the leadersequence. Though non-coding RNA is able to assume a defined secondary structures with minimumfree energy [1, 85], RNA structures in vivo is found to be more dynamic; that is, they spend more timebeing less structured [1]. This suggests rearrangement among multiple structural ensembles (collectionof possible structures) in vivo [1]. In particular, riboSNitches, SNP causing structural change, has beeninvestigated in the context of structural ensembles, which reports that alternative structures triggered byriboSNitches result in the aberrant expression of the adjacent genes [86]. Thus, in contrary to the obsoleteone-sequence-one-structure dogma, one RNA sequence is able to take on multiple functional structuresto give play to its regulatory mechanism.1.4 Experimental Methods For Identifying RNA Structural Features1.4.1 The Importance Of Identifying RNA Secondary StructureIdentifying the structure of a RNA molecule is critical for extrapolating its function since its functioncould be assigned by finding the most closely-matching RNA species – in terms of structural homology –that is already annotated with known function [3,87]. Moreover, this structure also helps us to understandthe mechanism through which this RNA molecule manifests its function [34]. However, it could belabour-intensive to experimentally determine the structure, and computationally expensive to explorethe possible structures. For instance, Zuker and sankoff (1984) [88] reports that the folding space of a14sequence with n nucleotides is 1.8n. This is an intangible number to analyse especially for long RNAs,yet this folding space could be significantly reduced by experimental probing, comparative analysis of thestructures, and etc. [10]. Here, some typical experimental methods for RNA structure determination aresummarized below, but it is still important to combine their results – complementary to each other – toreach a consistent conclusion, in order to support the computationally-predicted structural model [10].1.4.2 Low-Throughput Method: Biochemical ProbingEnzymatic/Chemical reagents react with backbones or bases in the probing experimentsBiochemical probing is often used to validate the computationally predicted structure. The biochemicalprobing methods exam the RNA structure in solution rather than in crystallized form; these methodseither enzymatically cleave the exposed nucleotide in the RNA structure via ribonuclease (RNase) ofvarious specificity, or chemically modify the exposed target sites of specific nucleotide [34]. Such exposedsites could come from the ribose-phosphate backbone of nucleotides at single-stranded (ss) region (ex-cept RNase V1 for helical conformation of the backbone, i.e. in double-stranded (ds) region) [34, 89, 90]since nucleotides embedded in the base-pairs of structured region are inaccessible for the cleavage ormodification [1]. Some reagents could also directly target the elements on the Watson-Crick edge [42]to test whether this nucleotide is double-stranded (ds) [34, 89]. Generally speaking, if a nucleotide iscleaved by ss-specific RNase or modified by chemical reagents targeting Watson-Crick positions, it is amore convincing evidence to show that this nucleotide is not in a base-pair [89]. Otherwise, absence ofsensitivity to these reactions does not necessarily rule out its possibility of being in a base-pair; e.g. itcould be buried deep inside the structured RNA, or in a Non-Watson-Crick base-pair (see Section 1.2.3),or blocked by interaction with solvent [89].The enzymatically cleaved or chemically modified sites are detected through two approachesThe nucleotides got cleaved or modified could be detected on single-nucleotide resolution in the subsequentRT-PCR and gel electrophoresis [1, 34]. In both approaches, a shorter DNA fragment is generated,extending from a labelled (e.g. by 32P) end to the cleaved or modified nucleotide [89]; the labelled end isthe 5’/3’ end of the original full-length RNA in Approach 1, or the DNA primer in Approach 2 [34, 89].Approach 1 is common for probing short RNA sequences under 200 nucleotides [89]. The labelledRNA molecule (on 5’ or 3’ end) is allowed to fold before being cleaved by biochemical reagents, such asRNase [89]. The shorter RNA sequence, due to cleaving, is subsequently identified on denaturing gels,so its linear size can be estimated to locate the nucleotides where the cleavage occurs [89]. However, thisapproach has restriction on RNA size, and also requires purification and labelling of the RNA moleculein prior to the cleavage [89]. Moreover, refolding in vitro after labelling may not produce the biologicallyfunctioning molecule [89]. If the label is large, it may also interfere with the RNA folding.Approach 2 is more commonly used for longer RNA sequences, and no purification or labelling forthe RNA molecule is involved, e.g. suitable for complex of RNA-RBP [89]. Once the exposed nucleotideis modified/cleaved, it can be identified in the following sequencing or primer extension [34, 89] where15reverse transcriptase will disassociate at the modified/cleaved nucleotide [10]. This reverse transcriptionis initiated by labelled primers that are designed to target the RNA template approximately every n(e.g. n=200) nucleotides apart [91]. Running the RT-PCR reads in gel electrophoresis will reveal bandswith shorter length, which is subsequently used to locate the cleaved/modified site [10]. However, thisapproach is not feasible for RNA sequence with structural features in its very 3’ end; co-transcriptionalpausing by the reverse transcriptase, or termination due to in vivo RNA chemical modification also yieldfalse signals [89].Biochemical probing has some disadvantage in spite of their single-nucleotide resolutionFirstly, the biochemical probing experiment entails numerous independent reactions, so as to cumulativelymap the structural results of the entire sequence. Thus, multiple copies of the RNA molecules are requiredto distribute them individually to the reactions where each reaction contains one enzyme or chemicalreagent [89]. It is therefore laborious to prepare the samples. Secondly, these methods cannot distinguishprotein-bound region from double-stranded region. This hinders the result interpretation, especially forprobing of in vivo content or unprocessed cell lysate [89]. Thirdly, many reagents are infeasible (e.g. sizeor buffer requirement) to conduct in vivo mapping of the biologically functioning RNA structure insidethe cell [10]. Lastly, while the biochemical approach can probe the silhouette of the RNA structure, itcan not identify the pairing partner of the nucleotides interrogated [10].Examples of reagents used in enzymatic probing Enzymatic probing has lower resolution due tothe bulky size of RNase [89]. For instance, the common RNase includes [89]:(i) RNase A cleaves at the 3’ side of ss C and U(ii) RNase T1 cleaves at the 3’ side of G.(iii) RNase T2 universally cleaves at the 3’ side of ss nucleotide.(iii) In contrast, RNase V1 cleaves at ds nucleotides [90]Examples of reagents used in chemical probing Chemical probing has better resolution thanenzymatic probing given the smaller size of chemical reagent. The common chemical reagents include[89]:(i) Dimethyl sulfate (DMS) targets A (N-1) and C (N-3).(ii) Kethoxal (KE) targets G (N-1 and 2-NH2)(iii) l-cyclohexyl-3-(2-morpholinoethyl)carbodiimide metho-p-toluene sulfonate (CMCT) targets U (N-3)(i)-(iii) methylates their targeted site if these sites are not engaging in a Watson-Crick base-pair ortertiary interaction; that is, base-pair in non-Watson-Crick geometry could still evade the detection byDMS/KE/CMCT [1,89]. Among them, DMS is special in that it can permeate the cell membrane for invivo probing of the biologically functional structure of RNA [89,92].In addition, similar to the RNase, some chemical reagents cleave the ribose-phosphate backbone of ssregion as well, such as Pb2+ [93], hydrolysis [94].16More approaches for identifying secondary structure (i) Oligonucleotide hybridization: addshort oligonucleotides that are complementary to certain stretch of the RNA template. Thus, the ssregions are supposed to bind the oligonucleotides, and the corresponding change in the RNA function ismonitored [34,95].(ii) selective 2’-hydroxyl acylation and primer extension (SHAPE): in comparison to the biochemicalmethods which produce a discrete output (either modified/cleaved or not), SHAPE gives relative flexi-bility of each nucleotide via the reactivity of its ribose group [96]. Here, flexible region is inferred to bethe unpaired portion of the RNA molecule, whereas the rigid region is base-paired or participating intertiary interaction [10,96].Approaches available for detecting tertiary interaction The tertiary interaction in a RNAmolecule can be probed as well, such as by: (i) Free radicals (e.g. generated by FeII-EDTA, H2O2):it cleaves the ribose-phosphate backbone of RNA, regardless of whether this backbone resides in a ss ords region [97]. Thus, only tertiary interaction (or protein binding) insulates the RNA backbone from freeradical attack [34,97].(ii) Interference in the formation of tertiary interaction through mutation, or nucleotide substitution byanalog that impedes tertiary interaction: they can prevent the formation of functional tertiary interaction,and the resultant inactive RNA molecule could be isolated to locate the putative nucleotide positionscritical for tertiary interaction [34,98,99].1.4.3 High-Throughput MethodThe low-throughput biochemical probing methods introduced in Section 1.4.2 are not applicable to thelarge quantity and diversity of structured RNA encountered in whole-genome transcriptome studies [1,100]. Thereafter, the biochemical reagents are usually combined in a complementary manner (e.g. includeone RNase that cleaves ss and another one cleaving ds region to yield a more detailed structural mapping)to cleave/modify RNA sequences in parallel, the downstream product of which is sequenced via next-generation sequencing [1]. This gives rise to the state-of-art RNA structure-determination techniques,such as the ones utilizing enzymatic probing (e.g. PARS [100, 101], ds/ssRNA-seq [102]), and the onesintegrating with chemical probing (e.g. SHAPE-seq [103], DMS chemical probing [104]) [1]. Such high-throughput capacity can reveal the “genome-wide RNA structurome”, so their structural landscape androles played by single nucleotide variation in different gene structure could be assessed [100]. However,these methods can only assess how structured a RNA sequence is, and cannot identify the base-pairingpartners in the structured region due to the limitation of the biochemical probing.1.4.4 High-Resolution But Low-Throughput TechniquesNuclear magnetic resonance (NMR) NMR utilizes a magnetic field to gauge the electromagneticradiation absorbed and remitted by a selected group of nuclei in the RNA sample that are labelled byisotopes. The nuclei appropriate for NMR have odd mass number and dipole (i.e. spin number I = 1/2,17and thus spin state = +/- 1/2), such as 1H, 13C, 15N, 31P in RNA [105]. Such positively charged nucleispin and thus generate a magnetic field like “a small bar magnet”, with their dipoles aligned along theirrotational axis, and with their strength corresponding to the gyromagnetic ratio γ of the nuclei, peaking at1H [105]. These spinning nuclei can be oriented in the applied magnetic field B with their small magneticfield aligned with, or against B; the former orientation is the favoured lower energy state, whereas thelatter one is the higher energy state [105]. NMR is operated based on the resonance of these spinningnuclei upon absorption of energy – gaining energy allows the nuclei to flip (bidirectionally) between thelower and higher energy states [105]. The energy absorption is proportional to the characteristic γ ofthe selected nuclei, and is also affected by the chemical environment of the nuclei; therefore, the wavefrequency, in the radio frequency (RF) range, must be tuned to produce the suitable radiation energy, inorder to generate the resonance [105]. Such energy absorption induces a voltage, which is subsequentlydetected as free induction decay (FID) [105].By various NMR experimental set-up (e.g. 1D and 2D NMR), many restraining parameters can bederived to construct the RNA structure, such as J couplings, residual dipolar couplings, cross-correlatedrelaxation rates, and NOE contacts [106]. The NMR results for the resonances in the RNA moleculeare displayed in a NMR spectrum plot where the x-axis is the resonating RF. On such spectrum plot,the nuclei responsible for the resonance can be assigned according to the characteristic RF range for theenergy absorption by different nuclei; in the respective resonance range of each nucleus, the chemical shiftis used to identify the functional group where this nucleus is embedded in, the corresponding absorptionpeak reveals the number of resonating nuclei in this functional group, and the splitting pattern (i.e.J coupling) gives cue for the number of interacting nuclei. The chemical shift for the imino protonsof G and U are especially important In 1D NMR, where the G and U engaging in base-pairing areidentified with distinctive chemical shift depending on whether the base-pair is canonical or not, thusrevealing total number of base-pairs in the RNA structure [106]. In addition, the chemical shift can beused to characterize the RNA structure; for instance, 1H NMR chemical shifts are dependent on theirsurrounding sequence and secondary structures [107], and 31P NMR chemical shift can further revealuncanonical conformational change in the ribose-phosphate backbone [106]. However, RNA consists ofonly 4 bases and is in A-form helix, which results in limited dispersion of chemical shift [106]. Hence.2D NMR is required to address larger RNA molecule, which reveals more information about the nucleithat are closely positioned by bond (via COSY and HSQC) or by the global spatial orientation of theRNA structure (such as within 5 A˚ in NOESY) [106]. The 2D COSY and HSQC use the J couplingin heterogeneous or homogeneous nuclei (e.g. between 15N and 1H) to differentiate the canonical base-pairs [106]. The NOESY is based on the NOE contacts between the protons (e.g. amino or iminoprotons) that are brought closely together by the base-pairing nucleotides, so the resultant chemical shiftand the correlation between the participant nucleotides can be leveraged to distinguish GC, AU and GU;moreover, the NOE contacts between sequential base-pairs can be used to identify the stem region of theRNA structure [106].NMR can characterize both the local and global RNA structure in solution, which is more likely to18be the in vivo functional form. The structural dynamics of RNA can be probed by NMR as well, suchas the temporary RNA conformation [25]. Moreover, it can capture the subtle difference among manystructural motifs, e.g. hairpin and bulges [106]. With its ability to correlate chemical shift or cross-peakto the corresponding base-pair, NMR can identify both canonical and noncanonical base-pairing partnersin the structured region [10]. It also provides information on tertiary interactions, such as coaxial stacking(e.g. in pseudoknot), and interhelical organization [106]. In addition, NMR can map the interaction sitewith protein, ions, solvent, and other small ligands on the RNA sample [106]. However, NMR still hassome disadvantages: (i) large sample quantity is required (especially if more than proton NMR is used).The experimental condition – e.g. concentration of solvents and temperature– must be tuned to ensurethe functional folding of the RNA sample; (ii) multidimensional NMR is needed to characterize largeRNA. But 13C is much rarer than 1H, thus not sensitive; (iii) size limit for RNA sample is around 100nucleotides long, or even more stringent down to 50 if higher resolution is imposed [34,106].X-ray crystallography The range of X-ray wavelength used in X-ray crystallography – on the scaleof A˚ – coincides with the size of chemical bonds and atoms, providing atomic resolution for RNA, ineven small crystal [108]. In X-ray crystallography, the RNA, prepared in crystal form, is irradiated witha beam of monochromatic X-rays which is diffracted by the molecules positioned in arrays inside theRNA crystal [108]. As a result of the spatial periodicity of the RNA molecules inside a crystal, theX-rays, diffracted by the RNA molecules in different planes/arrays in the crystal, can be concentratedinto reflection spots on a 2D reflection which is mapped from the 3D RNA crystal [108]. The reflectiondots and the corresponding angle and intensities are recorded for this particular orientation of the crystalfacing the X-rays. Subsequently, the RNA crystal is rotated around a fixed axis, and the new 2D imageof reflection is recorded upon X-ray irradiation. This process is repeated for about 180◦, after which thetotal collection of 2D reflection images are combined mathematically to reconstruct a 3D model for theRNA molecule [108].Many structured RNAs have been solved by this approach, such as yeast tRNAPhe (which is the firstRNA that is solved by this approach), group I intron ribozyme, and hammerhead ribozyme [108]. Thesolved structure provides high-resolution information on both secondary and tertiary structure for theRNA sample, including both canonical and noncanonical base-pairs [108]. The structural dynamics of theRNA molecule could be additionally studied by the occupancy parameters of atoms [108]. Furthermore,many complexes of RNA-protein – e.g. the internal loops of RNAs can bind protein – have been solvedby X-ray crystallography since the protein may shape the RNA conformation significantly which deviatesfrom the conformation free of protein [108]. This is critical to study the functional role of RNA invivo [108]. In addition, the solved RNA sample can include ions, and water molecules that are boundat various sites, e.g. backbone, major/minor grooves, or base-pairs [108]. Moreover, where and howthe individual RNA molecules are interconnected – e.g. interhelical stacking and extensive H-bonds –inside the RNA crystal could bring insights into the intermolecular interaction of RNA in biologicalcontext [108].19In contrast to NMR, X-ray crystallography can handle large RNA molecules [108]. However, thepreparation of large quantity of RNA, with satisfying purity and in crystal form, is the main road-blockerfor many RNAs [108]. The production of large quantity of RNA is achieved by chemical and enzymaticsynthesis [108]. The chemical synthesis works for sequences under 25 nucleotides, whereas the enzymaticsynthesis via T7-RNA polymerase works for the longer ones [108]. The latter approach is complicatedby the random insertion of nucleotides at both 5’ and 3’ end by T7-RNA polymerase, which is solved byfusing the RNA of interest with other cis- or trans- cleaved ribozyme [108]. The synthesized productsneed to be further purified, either by denaturing approach or native method, subject to the quantityand the characteristics of the RNA [108]. The subsequent crystallization step is difficult for RNA dueto the flexibility of RNA especially in the absence of protein [108]. The concentration of RNA andsalt during crystallization is also subtle, which can dictate whether double helices or stem-loops formin the crystal [108]. Moreover, it still requires complementary experiments sometimes, such as SHAPE,to characterize the RNA structure [1]. The RNA conformation solved from the crystal form may notcorrespond to the biological form; in addition, X-ray crystallography derives its results from static samplelike taking a snapshot, thus not accounting for the dynamic structural ensembles observed in vivo (seeSection 1.3).cryo-electron microscopy (cryo-EM) The cryo-EM, essentially a version of transmission electronmicroscopy, exams the RNA sample at cryogenic temperature; the cryo-electron tomography (CET) canfurther construct a 3D model of the RNA sample [109]. It can characterize the RNA structure in aresolution on sub-nanometre scale (3-12 A˚), which bridges between the atomic resolution provided byX-ray crystallography and the cellular resolution provided by conventional confocal microscopy [109,110].Resembling X-ray crystallography, the sample is required to be homogeneous, which should be puri-fied in prior to be inspected by cryo-EM [109]. The aqueous sample, containing 2.5 to 15 µg of the RNAmolecules, is subsequently coated on a porous grid on the EM, and undergoes flash-freezing, producing thefrozen, hydrated and functional state – like a layer of “vitreous ice” in appearance – of the sample [109].In order to minimize the damage to the biologically relevant form of the RNA molecule, the electron doseis usually lower than 25e-/A˚2; such dose of electrons transmit through the homogeneous RNA moleculesthat are suspended in the vitreous ice on the grid, projecting the molecular silhouette and the interiorcomponents of the RNA molecule onto a 2D image [109]. The RNA molecules are presented at diverseperspectives in the vitreous ice layer, which are reflected in the projected 2D images, accordingly [109].Among them, the images of RNA molecules with similar perspectives are aligned and averaged to com-pensate for the low dose of electrons used in cryo-EM; the various and complementary perspectives canbe collected to reconstruct the 3D structure of the RNA molecule [109].In contrast to X-ray crystallography and NMR, the sample RNA can be characterized by cryo-EM in itsin vivo form while being frozen-hydrated, and it thus does not need the isotope labelling or crystallizationstep required by the other two methods, nor is staining or fixation required; this helps to preserve theRNA in the in vivo functional conformation [109]. In addition, the low dose of incident electrons and low20temperature minimize the damage to the biological sample in comparison to the other methods [109].Moreover, the large quantity of RNA molecules in the vitreous ice can give cues for structural dynamics;concomitantly, the presence of various conformation, however, can result in heterogeneous sample thatimpedes the observation [109]. The disadvantage of cryo-EM also includes lower resolution, and difficultiesin charactering small molecule [109]1.5 Computational Methods For RNA Structure PredictionAs it is labour-intensive to experimentally characterize the structure of RNA, computational predictionhas become an efficient approach to obtain the RNA structure. In general, these computational predictionmethods can be divided into non-comparative and comparative methods.1.5.1 Non-Comparative Methods: Minimum Free Energy (MFE)-BasedThe MFE-based method – for example, Mfold [111], RNAfold [112,113] – seeks for the secondarystructure with the lowest free energy among all structures that could possibly be formed by one RNAmolecule [34]. Such MFE structure thus generates the largest decrease in free energy – i.e. the lowest ∆G◦– for the RNA comparing its structured and unstructured conformation [10]. The optimum MFE structurecan be predicted via dynamic programming algorithm if it is pseudoknot-free, such as by Nussinovalgorithm [114] which recursively searches for the conformation with the largest number of base-pairsby treating all canonical base-pairs equally. Thereafter, the Zuker-Stiegler algorithm [115] is developed,which can be employed to predict the MFE structure by calculating the Gibbs free energy which includesterms to account for both enthalpy and entropy. Similarly, the Zuker-Stiegler algorithm utilizes dynamicprogramming to iterate through all possible (pseudoknot-free) secondary structures of an input sequenceby recursively calculating and incorporating the MFE structure and energy of subsequences. The energy ofthe structure pertaining to the subsequence is made feasible by the approximation of ∆G◦ from its nearestneighbour model with experimentally derived parameters [10]. The optimal structure corresponding tothe minimum free energy could be retrieved accordingly through a backtracking algorithm.The thermodynamic constraint is important, especially for identifying ncRNAs which generally favourstructures of more negative free energy [10]. However, the MFE conformation is not always the biologi-cally functional one because there could be other factors existing in vivo to stabilize certain alternativeconformation of higher free energy [89]. For instance, RNAs can bind proteins or other small ligands toassume a functional structure which is otherwise infeasible given the sequence alone. For this reason,thermodynamic-based prediction is reported to work best for sequences under 700 nucleotides [10] whereaslonger RNA sequences can assume some suboptimal conformation instead of the MFE conformation [72].Moreover, the complete function of a RNA molecule may involve multiple structures which inter-convertinto each other so as to transfer the signal detected from the cellular environment to the action-launchingdomain on the structured RNA. Hence, identifying the MFE structure alone is not enough to interpretthe complete function. To address this problem, suboptimal folding is also introduced by Wuchty et21al. (1999) [116]. Lastly, the dynamic programming makes the prediction of pseudoknotted structureinfeasible due to the time complexity.1.5.2 Non-Comparative Methods: Probability-Based MethodsIn addition to MFE-based methods, other methods that are based on partition function and stochasticsampling are introduced to address the probability of the predicted structure, such as Sfold [117–119]and Contrafold [120]. Both approaches of non-comparative methods hold the assumption that thesystem is at equilibrium.The partition function Q (or denoted as Z(T ) in some publications) – derived from the Gibbs freeenergy of the structural ensemble, which are weighed and summed – is considered in this approach: Q =∑i=ni=1 Ki, whereKi iterates through the equilibrium constants of all possible secondary structures [10,121].Here, the formula for equilibrium constant Ki for structure i is calculated as: Ki = [Structure i in foldedconformation]/[Structure i in unfolded conformation] = e−∆G◦/RT , where [ ] = the concentration ofthe RNA in folded/unfolded conformation, R = gas constant, T = Temp (Kelvins) [10]. Thus, thepartition function Q describes the structural ensemble at equilibrium. Ki/Q then represents the fractionof a particular structure in the structural ensemble at equilibrium, which essentially corresponds to theprobability of observing such structure [10]. According to the Ki/Q, the MFE structure is elusive inthis ensemble, but its constituent base-pairs are largely shared among structures of relatively low freeenergy [10]. Hence, the base-pairs can be quantitatively annotated in the predicted structure in terms oftheir base-pairing probability, based on which a structure composed of base-pairs with high probabilitycan be drawn by (i) imposing a high threshold to avoid ambiguous assignment of the pairing partner, (ii)adopting “maximum expected accuracy structure”, as seen in Contrafold [120] [10].Furthermore, the Ki/Q probability of encountering a structure and the concomitant Boltzmann en-semble of structures for one input RNA are utilized by Sfold, as proposed by Ding and Lawrence(2003) [117], to stochastically sample the possible conformation at equilibrium; therefore, a list of struc-tures with low free energy can be returned with probability matching the Boltzmann distribution of thestructures. The sampled structures can be further grouped into clusters to reveal the centroid structurefor the cluster [10,119].1.5.3 Comparative MethodsThe comparative approach takes Multiple Sequence Alignment (MSA) of the homologous sequencesas input, and the structure prediction is based on the assumption that functional structure should be con-served across related species [10]. Programs in this category include Pfold [122], RNA-Decoder [123]and RNAalifold [124].Compared to the MFE approach, the comparative approach is more accurate by searching for thecommon structure that is conserved among phylogenetically related species [34], which can reach morethan 95% accuracy [10]. As maintaining the biologically functional RNA molecule requires the conser-vation of its structure in the evolving related species, such structural conservation is manifested in the22form of covariation in spite of the diversity in primary sequence. The input MSA is used to detect co-variation signal. Covariation refers to the one-sided or double-sided mutation in two nucleotides that arefar apart in the primary sequence. This coordination of mutation is due to the base-pairing constraintof the secondary structure; if the structural region is embedded in the protein-coding region, there aredual evolutionary constraint: one layer from the structure and another layer from the amino acid. Thus,this allows them to retain their base-pairing potential (observable as compensatory mutation, e.g. GCmutates to GU, AU, CG), which suggests that these two nucleotides could be paired in the functionalconformation. As a result of the compensatory mutation, the structure is not perturbed due to theisostericity of canonical base-pairs, so the structure-dependent function is maintained.Diverse algorithms are involved in the comparative approach, such as covariation model which istrained from a carefully curated seed alignment with the corresponding structure [125]. This covariationmodel is used to search and align sequences against the seed alignment in terms of both primary sequenceand structural fit. Subsequently, the stochastic context free grammar (SCFG) that entails many produc-tion rules can be used to model the base-pairing pattern of a structure. Cocke-Younger-Kasami (CYK)algorithm [126] can then be used to search for the structure giving rise to the largest probability.Unlike MFE, the comparative approach can yield finding including pseudoknotted structure, andeven tertiary structural motif [10]. In addition, the comparative methods do not need to model thecellular environment of the RNA sequences as well, such as ligand/protein binding or salt concentration.However, the comparative approach has certain disadvantages. Firstly, it requires a high-quality MSAas input, but it requires structural information to curate such structural alignment. In order to produceadequate amount of covariation signal, a certain number of alignments sequences are required. Moreover,the sequence similarity of the MSA need to be controlled within an optimal range, i.e. not too conserved,and yet not too varying. To address this chicken-and-egg problem, there are some methods – for instance,Dynalign [127] and Foldalign [128] – that simultaneously fold and align the homologous sequences.Compared to other approaches, this simultaneous approach is less restrained by the quality of the inputMSA [10]. In addition, RNAcase first folds each sequence individually, and the structural informationis incorporated into the subsequent aligning. Secondly, there are not enough diverse dataset to train theprograms for the structural purifying signals of evolutionary selection [3]. For instance, benchmarkingof many comparative methods are measured on curated alignments derived from Rfam [129]; this doesnot reflect the realistic cases [3]. Lastly, for long sequences, sliding window approach is commonly usedto save the time-complexity, but the potential structure may exceed the boundary used by the slidingwindow [3].1.5.4 Kinetic Folding SimulationThe equilibrium-based methods for RNA structure prediction, as introduced above, may not holdtrue since the RNA could degrade before the equilibrium state is reached [130]. Moreover, these methodsdo not account for the constraints accompanying the co-transcriptional folding of the elongating RNAmolecule. This missing property is explicitly addressed by the kinetic folding approach wherein various23methods have been devised to directly simulate the folding pathway of a RNA molecule. In comparisonto the comparative methods, these kinetic folding methods only accept one single RNA sequence as inputdespite the variation in their methods in modelling the folding process. During the simulation, thesemethods extend the RNA sequence, based on an user-specified transcription rate, at regular intervalsover an appropriate time scale. Their output is typically a list of structures that are formed consecutivelyalong the predicted folding trajectory.Among all the existing kinetic folding methods, most rely on stochastic simulation (e.g. RNAKinet-ics [131–133], Kinfold [134], Kinefold [135–137], [72]). Such stochastic methods are typically modelledthrough an irreducible continuous-time Markov process involving, firstly, a structure space collecting allfeasible structures (helices) for the input RNA sequence [130]. Such methods have various rules governinghow extensible the helix is – e.g. some requires the helix to be extended to the maximum length whereassome allows the helix to be only partially formed – especially in the presence of an overlapped helix, whichthus affects the size of the structure space [130]. Secondly, they involve a rule specifying the minimumgranularity – i.e. helix for the maximum granularity or base-pair for the minimum granularity – of thestructural transition, which in turn determines the allowable transition among the structures in the struc-ture space [130]. The modelling granularity of the RNA folding process by the existing computationalmethods can be divided into molecular level in nanoseconds, base-pair level in milliseconds (kinfold [134]),helix level in seconds (RNAKinetics, Kinefold, but these two methods do not allow bulge or inter-nal loop) [133]. Lastly, the corresponding set of transition rates are also required, which must satisfy:dPx(t)dt =∑y 6=x[Py(t)kxy − Px(t)kyx], where kxy denotes the transition rate from structure y to structurex, and Px(t) is the probability of the RNA molecule in structure x at time t [130]. The transition ratesmust also satisfy the detailed balance rule – piykxy = pixkyx, where pix is the Boltzmann distribution ofstructure x as given by exp(−∆G(x)/RT )∑x exp(−∆G(x)/RT ) , with the denominator corresponding to the partition function– to ensure the structural convergence into the equilibrium structure if given long enough time [130]. Dueto the large folding complexity for long sequence, the Markov process is usually simulated by Monte-Carlo methods where the transition rates are devised carefully to hold the detailed balance rule; e.g. theirtransition rate can be assigned based on the difference between the end and the starting structural states,or between the transition and the starting structural state, or calculating the change in stacking energyor loop energy [130]. For instance, Kinefold (see Section 2.5.2) and RNAKinetics (see Section 2.5.3)fall in the Markov Chain Monte Carlo domain, which allow random helix formation and disassociationwith a probability dependent on the transition rate of this randomized chemical process. Kinefold canpredict pseudoknotted structure while RNAKinetics does not.Some of the remaining methods, such as Kinwalker [138](see Section 2.5.4), employ a deterministicalgorithm that utilizes both free energy minimization and a heuristic which only permits a structuraltransition if the corresponding transcript is transcribed and it is kinetically feasible within the giventranscription rate. In prior to simulating transcription, Kinwalker employs the dynamic programmingto predict the minimum free energy (MFE) structure for all subsequences, which are then successivelymerged starting from the 5’ end. The transition time corresponding to the merging process, as estimated24from a heuristic procedure, must be within the time interval allowed between two consecutive transcriptionevents.There are also some methods that involve explicitly calculating the energy landscapes of all subse-quences, and then constructing the pathway connecting the energetically favourable conformation in theconsecutively elongating subsequences [130].There are some limitation for such kinetic folding methods. Firstly, their assumption concerning thecellular environment is much more simplified than the condition in vivo. For instance, the transcriptionrate – assumed to be constant in the kinetic folding methods – is uneven with interleaved pausing alongthe transcript [75, 139, 140]. In addition, they cannot model the effects delivered by various interactionpartners (e.g. proteins, RNA, or small ligands). Secondly, the length limitations (about 200 nt forKinefold and RNAKinetics, and 1000 nt for Kinwalker) makes these methods inapplicable to manylncRNA. Thirdly, the output of kinetic folding methods is hard to interpret or follow up experimentallysince many methods sample a large number of trajectories with each trajectory corresponding to adetailed list of structures. Statistical sampling also requires sampling more than than 100 trajectorieswhen sequence is longer than 100 nucleotides long, which can take hours. Fourthly, the kinetic methodspredict pseudoknotted structures purely statistically in the absence of thermodynamic parameters; hence,many of such hypothetical pseudoknot are unlikely to exist [130]. Fifthly, the time assigned artificiallyin the simulation may not correspond linearly to the actual time in in vivo folding.1.6 Hypothesis And Goal Of This Thesis: Prediction Of Transient And Al-tenative RNA Structure FeaturesAs discussed in this introduction, apart from the predominant final structural conformation, the alter-native and transient structures formed along the co-transcriptional folding pathway are both functional.The formation of both alternative/transient structures and the final structure are a prerequisite for theRNA molecule to manifest its regulatory function. However, some of the state-of-art methods ignore suchalternative/transient structures by focusing on predicting the final structure, without considering the var-ious constraints imposed by co-transcriptional folding. The others of these methods (i.e. kinetic methods)predict exponentially-increasing number of folding pathways for a single input RNA sequence. Further-more, each folding pathway is a collection of numerous secondary structures along the transcriptionaltimeline. Thus, it is not feasible to efficiently analyse these mass amount of prediction.To address this problem, this thesis has the following goal:(i) Hypothesis: For homologous RNA sequences with similar function and the corresponding conservedfinal structural conformation, if the transient structures formed along the co-transcriptional folding path-way and the alternative structures are functional, they would be evolutionarily conserved to a certainextent among these homologous sequences as well.(ii) To investigate this hypothesis, we compiled a non-redundant data set of 32 transcripts derivingfrom six different RNA families. These six families constitute the most comprehensive data set withexperimentally confirmed transient/alternative RNA structures thus far. We then constructed a compu-25tational pipeline to integrate the prediction from kinetic methods in a comparative approach to searchfor conserved tranisent/alternative structure.(iii) To further provide a high-quality curated set of structural alignments based on both tran-sient/alternative structure and final/predominant structure, we constructed a pipeline involving In-fernal [141] for alignment assembly and curation. Four alignments with typical examples of tran-sient/alternative structures were curated carefully according to current experimental evidences, whichwill be shared with the RNA community via Rfam database.262 Analysis Pipeline For Identifying Evolutionarily ConservedAnd Kinetically Feasible Transient RNA Structure FeaturesThis chapter corresponds to my publication in [81], the content of which is mostly summarized and rewritten, but partiallypasted in-verbatim here.2.1 Motivation2.1.1 The Importance Of Functional Transient Structure On RNA Folding PathwayAs introduced in Section 1.2.5 and Section 1.3, many factors can alter the co-transcriptional foldingpathway, such as transcriptional speed and pausing where RNA polymerase stalls at specific locationfor specific duration. Interaction with molecular partners elicits structural changes to guide the foldingprocess or meet certain metabolic demands [59–61]; this includes cis-interaction with flanking sequencesas a result of transcriptional directionality [142], and trans-interaction with molecules, such as chaperoneproteins [62], sequence- or structure-specific RBP [143, 144], miRNA [63], snRNA [145], snoRNA [146].All of these factors affect the kinetic folding pathway of RNA molecule, and in turn determine the finalstructural product. Along the RNA folding pathway, abundant transient structures can exist for onlya brief window of time or under particular metabolic condition. Some of these transient structuresusher the folding pathway to the functional structure, whereas some interfere with the formation of thefunctional structure, resulting in mis-folded final product. Mounting statistical evidence shows that thelatter ones are under-represented, implying that RNA transcripts may encode the folding pathway withinthe sequence itself [69]. Such encoded folding pathway could manifest itself via a series of transientstructures appearing along the transcriptional timeline in a designated order. Despite of their ephemeralexistence, such transient structures could be functional as well by serving as landmarks for the foldingpathway leading to the functional product. Hence, we seek to identify functional transient structure.2.1.2 HypothesisIn this chapter, we explore the hypothesis that conserved RNA structures from homologous non-codingRNA genes not only fold into similar final (or the dominant structure in riboswitch) functional structures,but that their co-transcriptional folding pathways also share common transient structural features.We aim to (1) exam whether the known transient and final functional features are conserved in relatedsequences, (2) investigate if the known transient/final features can be detected using comparative analysisof kinetic folding methods, (3) propose an analysis pipeline for prediction of novel transient features, and(4) evaluate the ability of kinetic folding methods to predict novel conserved transient features.So far, evaluation of kinetic folding methods has been mostly restricted to a small number of testcases, and we propose the first comprehensive performance evaluation of these methods. For this, wecompile a non-redundant data set of 32 sequences with known transient and final RNA secondary struc-tures – extensively experimentally verified – and devise a dedicated computational analysis pipeline to27comparatively post-process the predicted kinetic folding pathways.However, some assumptions and the accompanying caveats need be acknowledged for this work.Firstly, it is assumed that functionally important structure must be evolutionarily conserved amonghomologs. Thus, conservation is implied, and our approach in this chapter is not feasible if such transientstructure is a feature specific to its species. Secondly, if the aforementioned assumption holds, yet theevolutionary timescale where these transient structural features are conserved is still unclear. That is,the sequences of an alignment need to be chosen case by case to ensure that the evolutionary distancesamong them are appropriate for the timescale.2.2 Strategy Overview: FlowchartFigure 3 summarizes the procedures of how the input alignments of homolog RNAs are processed andanalysed in the pipeline.The core analysis section is illustrated in solid arrows in Figure 3, which is a comparative analysispipeline for kinetic methods that simulate RNA folding pathway. Section 2.3 covers the first step forthe compilation of the input alignments, which explains how multiple structures (transient/alternativeand final/dominant) are mapped onto each alignment. Section 2.4 explains the criteria about how thenon-redundant representative sequences are selected from each alignment for the downstream kineticsimulation. Section 2.5 introduces the 3 kinetic programs for folding pathway simulation of the represen-tative sequences. Section 2.6 explains how the raw output from each kinetic program is post-processedto summarize a score for each helix predicted for the entire alignment. The high scoring helices of eachalignment are visualized in Section 2.8.The optional analysis incorporating Transat is illustrated in dotted arrows, which is used to predictputative novel transient helices. Section 2.7 and Section 2.8 introduce how the Transat prediction couldbe prioritized and contrasted to the prediction from the core analysis section.The results in Section 2.9 present the comprehensive evaluation of the analysis pipeline.2.3 Strategy Overview I: Data AssemblyTransient/alternative structures are not as stable as the final/dominant structure and hard to character-ize experimentally, so there are only few RNA sequences with experimentally validated transient/alternativestructures. Furthermore, database usually does not contain the complete annotation of the known struc-tures, especially the transient/alternative ones. Hence, a set of high-quality MSA of non-coding RNAswith structural annotation for both final/dominant and transient/alternative structures must be com-piled for this work. As introduced in Section 2.2, the dataset will be used to: (i) feed the non-redundantalignment sequences into Kinetic folding programs that only accept single sequence as input, (ii) feedeach alignment into the comparative program Transat.Previous analysis suggests that the performance of Transat peaks when the input alignment featuresten or more sequences with an average pairwise percent-identity of 70-80% [148]. Thus, this is used as part28Figure 3. Analysis pipeline. Solid black arrows represent the core analysis involving kinetic foldingprograms. For each non-coding RNA molecule examined, a reference sequence and known structuralfeatures were used to construct a multiple-sequence alignment (MSA). Non-redundant sequences wereextracted from the MSA, and provided as input to three kinetic folding programs: Kinwalker, Kinfoldand RNAKinetics. The raw output, consisting of simulated folding pathways, was analysed, andhigh-scoring helices were extracted. These predicted helices were mapped back onto the MSA. Acomparison across programs then yielded helices with high scores from multiple programs. The dottedarrows represent an optional analysis involving the comparative helix prediction program Transat. Aphylogenetic tree is required as input along with an MSA. FastTree2 is employed in our pipeline toestimate the phylogenetic tree based on the concatenated unpaired regions [147]. The program detectsand scores conserved helices, and outputs a ranked list of conserved helices with P-values. These aremapped onto the alignment along with the output of the three kinetic folding programs.29of the criteria for curating the six alignments in our dataset, and the final/dominant structure is mappedfrom the experimentally interrogated reference sequence to all the homologous sequences in the alignment.Only sequences well fit the final/dominant structures are selected, which is to optimize the alignment forthe final structure (or one of two alternative structures, if both are functional). Transient/alternativestructures are mapped onto the alignment following sequence selection so as not to bias Transat for theprediction of known transient/alternative features.2.3.1 Procedures Of Alignment Assembly1. Data collection: source, reference sequence and structures Six alignments are assembled inthis study, among which four are sourced from the Rfam database [129]. The bacterial ribonuclease(RNase) P type A RNA (Rfam ID: RF00010), the bacterial signal recognition particle (SRP)4.5S RNA (Rfam ID: RF00169), and the tryptophan (trp) operon leader (Rfam ID: RF00513)are extracted from the seed alignments of the Rfam database [129] (RF00010 seed has 306 sequences;RF00169 seed has 271 sequences; RF00513 seed has 24 sequences). One exception is the hepatitis deltavirus (HDV) ribozyme (Rfam ID: RF00094), which is extracted from the full Rfam alignment instead(RF00094 full alignment has 598 sequences) because the seed alignment presents inadequate variationwith its 33 sequences.The Levivirus alignment is sourced from previous literature supplemented with Blast search. TheSamriboswitch alignment is initially extracted from the published raw alignment of a previous literature(see Table 16). Both of these two alignments are further expanded and aligned by Infernal [149].The raw alignments (sequence + structures) are passed to next step.2. Final structure mapping via reference sequence For each of these raw alignments obtainedfrom Step 1, an reference sequence (in ungapped Fasta form) and its corresponding secondary structure(in dot-bracket form) are chosen from the source literature based on the availability of experimental val-idation. This reference sequence serves as the mediator to manually map the ungapped secondary struc-tures from literature onto the raw alignment. There exists several differences comparing the structuresmapped from literature against the structures featured in Rfam. Since the literature-derived structure issupported by experimental evidence (denoted as known structure), the conflicted or improperly alignedregions in the raw data are adjusted to fit the known structure. The corrected raw alignments (+ theircorresponding gapped structure) and reference sequences (+ their corresponding ungapped structures)are passed to next step.3. Ranking Sequence according to structural fit In order to choose homolog sequences residingwithin appropriate timescale of evolution to the reference sequence (discussed in Section 2.1.2), sequencesfrom the raw data are scored via Perl scripts, and subsequently ranked according to their fit to thefinal/dominant structure. For a sequence, the pair of nucleotides occupying the positions of each knownbase-pair from the final/dominant structure are scored as following:30(i) they form consensus base-pair ({A,U}, {G,C}, or {G,U}): + 1 point,(ii) they cannot base pair: - 1 point,(iii) they have an one-sided gap that weakens the thermodynamic stability of a helix by introducing‘bulges’ [148]: - 1 point,(iv) they have double-sided gaps, presumably due to two successive mutation where one evolutionarilyunfavoured one-sided deletion is followed by an favourable second deletion, yielding a shorter helix freeof ‘bulges’ [148]: neutral, no points awarded/deducted.After ranking their overall scores based on all the known base-pairs from the final/dominant structure,species with appropriate evolutionary timescale are retrieved from the top ranked 10-25 sequences. Afterfiltering out sequences with poor fitting or excessive insertions, the resultant all-gap columns are removedfrom the alignment. This selected subset of sequences is passed to the next step.4. Extension to transcription start site (TSS) To predict conserved helices that form co-transcriptionally(see Section 1.3), we need to simulate the folding starting from the TSS. Thus, the alignment sequencesfrom Step 3 are extended reversibly towards the 5’ end to the TSS, if feasible. Due to inadequate an-notation available for TSS, all sequences are assumed to share the same TSS as the reference sequence.Subsequently, an ungapped alignment is assembled from the subsequences corresponding to the insertedsegment extended in the upstream region of these alignment sequences. This alignment is then alignedvia standard multiple-sequence alignment program MUSCLE [150], outputting a gapped alignment thatis concatenated to the existing alignment from Step 3.For the Levivirus alignment, all sequences are extended to the TSS of the reference sequence. For theRNase P alignment, the TSS is 3 nucleotides upstream of the 5’ end from the one in Rfam, and therebyextended [151]. For the SRP 4.5S RNA alignment, TSS is 32 nucleotides upstream of the 5’ end from theone in Rfam, and extended [152]. For both the Trp operon leader and SAM riboswitch, the 5’ end of theRfam alignment is several nucleotides downstream of the TSS, but not extended due to the abundantdiversity of the alignment sequences [79]. Finally, for the HDV ribozyme, the alignment is extended by52 nucleotides upstream so as to include the 2 alternative structures [28].5. Realignment of unstructured regions It is necessary to properly align the unstructured regions– region where two or more columns that are not involved in base-pairs of the final/dominant structure– because transient/alternative helices could be embedded there. Therefore, a proper alignment enablesthe conservation signal of such helices to be clearly presented to the downstream analysis.However, there is no prior knowledge of transient/alternative helix available to guide the alignment.Moreover, some gaps are not inserted in these unstructured regions in a systematic way in Step 4. Forthis reason, the unstructured regions are extracted from the alignment and grouped into sub-alignmentsseparately, which are aligned by MUSCLE, respectively. MUSCLE removes all gaps and realigns theinput according to only primary sequence conservation [150]. This realigning process improves gap31placement and facilitates the subsequent detection of structural features. The realigned sub-alignmentsare then inserted back to their original locations along the alignment, and the concatenated alignmentsare then outputted to the next step.6. Tree estimation The optional section of the analysis pipeline (Section 2.7) utilizes the predictionfrom Transat. This program requires both a MSA and a phylogenetic tree as input, and the evolution-ary distance among alignment sequences can be estimated by the tree. However, most tree estimationprograms assume that each column in the alignment evolves independently. Clearly, this assumptiondoes not apply to our alignment from Step 5, which contains columns involved in base-pairing. Foreach alignment, the unpaired columns regarding the final/dominant structure are therefore extracted andconcatenated together, which is subsequently used to generate the phylogenetic tree via the FastTreeprogram [147].It should be noted that the extracted columns may still subject to evolutionary constraints by partici-pating in transient/alternative structures. Nevertheless, without prior knowledge the unpaired alignmentcolumns evolve more freely, at least than those paired ones.2.3.2 Background And Details Of The Alignments AssembledBacterial RNase P Type A RNA This RNA molecule (known as 10Sb or M1 RNA) is the essentialcatalytic moiety of bacterial RNase P – an endoribonuclease comprised of an RNA molecule and apolypeptide – that cleaves phosphodiester bond located within a substrate RNA chain, such as trimmingthe tRNA or 4.5S SRP RNA precursor to generate the mature ones [151] [153]. The functional structureof RNase P has been solved by X-ray crystallography [154], which contains several long-range helices thatcould be misfolded [75]. A transcriptional pausing site is previously observed near the 5’ portion of theRNA transcript, and a transient structure formed upstream of this pausing site is discovered by Wonget al.(2007). This transient structure temporarily sequesters the 5’ nucleotides of six long range helices,and releases them once their corresponding downstream pairing partners are transcribed, facilitating thecorrection formation of the LRIs [75].In this alignment (Table 1), the reference sequence is retrieved from E. coli (accession numberCP001509.3).Bacterial SRP 4.5S RNA The bacterial Signal Recognition Particle (SRP) is an ribonucleoproteinwhich constitutes a SRP-dependent machinery that targets and cotranslationally translocates secretaryprotein for inner membrane insertion [155,156]. The 4.5S RNA is the RNA moiety of this complex, whichis essential for the binding between the protein moiety of SRP and the membrane receptor FtsY [157].The final 4.5S RNA structure, as determined by chemical probing [158] and X-ray crystallography [159],has helices formed by LRI, and the structure is critical for its role in the formation of SRP-Receptorcomplex [157]. Similar to RNase P Type A RNA, Wong et al. (2007) identified a transient nonnativestructure upstream of a transcriptional pausing site [75]. This transient structure is proposed to sequester32Rfam ID seq. accession code in representative sequences rangeRF00010CP001509.3 Yes 3136788-3136410ABCQ01000029.1 No 38221-38598CP000544.1 Yes 2306475-2306833AAOF01000005.1 No 51335-50979M33657.1 No 1-383AAPG01000076.1 Yes 1037-1399CP000323.1 No 441079-441444CP000155.1 No 6015384-6015017CP000127.1 No 3251640-3251285CP000285.1 No 2476041-2475686CP000282.1 No 1090693-1091052CP000444.1 No 286631-286986AE017340.1 No 438669-439020CP000514.1 No 2749727-2749375AANE01000023.1 No 29686-29331AAOE01000014.1 Yes 15831-15485AP008229.1 No 4089960-4089609CP000453.1 No 2510110-2509745M19024.1 No 1-354AAKW01000074.1 No 145448-145799CP000513.1 Yes 836440-836105CP000608.1 Yes 759201-759541CR543861.1 No 2832239-2831882AAVL02000025.1 Yes 21848-22196Table 1. Alignment information of Bacterial ribonuclease P type A ribozyme, derived from RF00010.Reference sequence (CP001509.3) is in bold. “seq. accession code” refers to the code retrieved fromNCBI website. “in representative sequences” indicates whether each alignment sequence is selected as arepresentative sequence that is inputted into kinetic folding programs to simulate folding pathways.“range” gives the start and end coordinates of each sequence.33Rfam ID seq. accession code in representative sequences rangeRF00169X01074.1 Yes 138-275AE014073.1 No 412756-412893CP000712.1 Yes 1786664-1786801CP001138.1 No 505944-506081CP001127.1 No 532527-532664CP000075.1 Yes 2101578-2101715CP000076.1 No 2115289-2115426CP000020.2 Yes 1087938-1087802X14404.1 No 95-232CP000720.1 Yes 3460816-3460682Table 2. Alignment information of Bacterial Signal Recognition Particle 4.5S RNA, derived fromRF00169. Reference sequence (X01074.1) is in bold.the 5’ portion of the LRI, facilitating the formation of LRI [75].The reference sequence of this alignment (Table 2) is derived from E. coli (accession number X01074.1).Trp operon leader The trp operon encodes genes responsible for the biosynthesis of the amino acidtryptophan [79]. The 5’ leader region of this operon regulates its gene expression in response to the abun-dance of tryptophan. The abundance of the tryptophan in the environment is gauged by the translationof a leader peptide – encoded in this leader region – which contains a tryptophan tandem [79]. Alongthe transcript of this leader region, two alternative structures (namely, terminator and anti-terminator),formed co-transcriptionally, were identified by Yanofsky (1981) [79]. When the supply of tryptophan isdeficient, the ribosome, stalling on the tryptophan tandem, promotes the formation of the anti-terminatorstructure [79]. As a result, transcriptional machinery is able to proceed to transcribe the downstreamoperon genes (trpE, D, etc) [79]. When tryptophan is plentiful, the ribosome is no longer hindered bythe tryptophan tandem, and the RNA terminator structure – mutually exclusive to the anti-terminatorstructure – is able to form [79]. This terminator structure is recognized by the RNA polymerase toterminate the transcription of the operon genes [79].The reference sequence of the trp operon leader alignment (Table 3) is extracted from E.coli (acces-sion number AE005174.2). The Rfam-annotated 5’ helix of the terminator structure deviates from theliterature annotation [79], and is corrected for the consistency with the literature. For the initial mapping(Step 2 in Section 2.3), the terminator structure was arbitrarily chosen.HDV ribozyme The Hepatitis Delta Virus (HDV) ribozyme is required for the replication of HDV ofwhich the genome is comprised of single-stranded(ss) RNA [28]. In fact, this HDV ribozyme is encodedby the HDV genome, and this ribozyme has catalytic activity regardless of whether it is formed by the34Rfam ID seq. accession code in representative sequences rangeRF00513AE005174.2 Yes 2263095-2263188AY027546.1 Yes 150-252AALF01000034.1 Yes 7496-7395AM286415.1 No 2419797-2419898CP000720.1 No 2235898-2235999AALD01000032.1 No 7530-7429J01557.1 Yes 117-211BX950851.1 Yes 2602822-2602926J01739.1 Yes 99-194AACY020572468.1 No 905-810Table 3. Alignment information of Tryptophan Operon Leader, derived from RF00513. Referencesequence (AE005174.2) is in bold.genomic-sense sequence or antigenomic-sense sequence [160]. It self-cleaves the RNA transcript at the 5’end of this ribozyme [28]. The antigenomic and genomic versions of this ribozyme are not complementaryto each other, but rather adjoin each other with only 2 nucleotide positions overlapped; each spans nt -1to 84 relative to the cleavage site [161]. The spontaneous self-cleaving activity of this ribozyme can bemoderated by the formation of another 2 alternative structures [28]. Chadalavada et al (2000) identifiedthese two alternative structures in the genomic RNA which forms co-transcriptionally: the first onesuppresses ribozyme self-cleavage, while a second, alternative structure sequesters the upstream portionof the aforementioned inhibitory structure, restoring ribozyme self-cleavage [28].The reference sequence of the HDV alignment (Table 4) is extracted from human patient (accessionnumber M28267.1). As experimental evidence for the two alternative structures solely applies to thegenomic sequence, only genomic sequences from the seed/full Rfam alignments are incorporated into thisalignment in spite of the inclusion of both genomic and anti-genomic sequences in Rfam alignments [129].Due to the limited diversity displayed by the seed alignment, genomic sequences from the full Rfamalignment provides invaluable information on evolutionary constrain.5’ UTR of Levivirus maturation gene Bacteriophages in Genus Levivirus have positive ssRNAgenome encoding 4 genes where the maturation gene adjoins the 5’UTR [162]. The Shine-Dalgarnosequence, embedded in the 5’ UTR, is masked by an inhibitory secondary structure, rendering the mat-uration gene untranslatable at equilibrium [76]. However, for a transient window of time during thereplication of this ssRNA genome, the maturation gene on the newly synthesized positive strand couldbe accessible to the ribosome for translation [76]. This is enabled by a small transient helix, as identifiedby Van Meerten et al.(2001), that postpones the formation of the final inhibitory structure [76].The Rfam database does not contain an entry for the 5’ UTR of Levivirus maturation protein, so a35Rfam ID seq. accession code in representative sequences rangeRF00094M28267.1 Yes 635-775AM902165.1 No 633-773AM902181.1 No 630-771AM779584.1 No 632-773M92448.1 No 627-769AB088679.1 Yes 641-780AB118845.1 No 635-775AB118841.1 No 635-775AB037947.1 Yes 634-775AB037948.1 No 634-775Table 4. Alignment information of Hepatitis Delta Virus Ribozyme, derived from RF00094. Referencesequence (M28267.1) is in bold.Alignment seq. accession code in representative sequences rangeLevivirusGQ153927.1 Yes 1-132AF227250.1 Yes 1-139AF195778.1 No 1-132EF107159.1 No 1-133X15031.1 No 1-131NC001426.1 Yes 1-138FJ483839.1 Yes 1-131Table 5. Alignment information of Levivirus. Reference sequence (GQ153927.1) is in alignment is manually assembled with sequences aligned by MUSCLE [150], and curation done on4SALE [163] (Table 5). The reference sequence (accession number GQ153927.1) is from bacteriophageMS2, and its corresponding final and transient structures are extracted from van Meerten et al. (2001) [76].This alignment includes the 5’ UTR from seven MS2-like phages retrieved from NCBI Genome [164].Similar to the Rfam-derived alignments above, Step 5 from Section 2.3.1 is applied to the unstructuredalignment regions in the inhibitory final structure.SAM riboswitch The SAM riboswitch governs the gene expression of 26 genes in B. subtilis andrelated species [80]. The ligand-binding domain of SAM riboswitch has high and selective affinity tothe coenzyme S-adenosylmethionine (SAM); without the ligand bound, the riboswitch assumes an anti-terminator structure to permit the transcription of the downstream genes [80]. Immediately upon bindingSAM, the riboswitch structurally rearranges and a terminator hairpin forms [80]. This terminator hairpinthus promotes the dissociation of the RNA Polymerase, turning off the gene expression.36Alignment seq. accession code in representative sequencesSamRibos.AL009126.3 YesSAM Oi13 YesSAM Lm07 YesSAM Bc10 YesSAM Oi09 NoSAM Bh04 NoSAM Bh05 NoSAM Bc17 YesSAM Bc11 NoSAM Bh02 NoSAM Oi07 YesSAM Oi05 YesSAM Ca07 NoSAM Oi10 NoSAM Oi02 NoTable 6. Alignment information of Samriboswitch. The alignment sequences are derived from thesupplementary file of [80]. The reference sequence (AL009126.3 and range = 1258276-1258464) is inbold.A new alignment (Table 6) is assembled with sequences derived from Winkler et al.(2003) where analignment for SAM riboswitch is presented with annotation of the terminator-containing structure [80].On Rfam, the entry RF00162 only features part of the terminator-containing structure, lacking theterminator. Moreover, the Rfam alignment is truncated at the 3’ end and the anti-terminator structure ismissing [129]. The reference sequence (accession number AL009126.3) is derived from B. subtilis, and thecorresponding experimentally validated alternative structures are derived from Winkler et al.(2003) [80].As explained in Section 2.3.1, the terminator-containing structure is arbitrarily chosen for mapping thereference-specific structure onto the alignment. Then they are processed similarly to those Rfam-derivedalignments.2.3.3 Alignment QualityFor the six alignments compiled, Table 7 gives the alignment statistics related to the structural regions.Table 8 gives the general alignment statistics not related to structures.37Alignment Covariation No. BPs Frac. Canonical BPsLevivirus 0.302 54 0.939RF00010 0.441 122 0.966RF00094 0.00692 61 0.941RF00169 0.34 38 0.976RF00513 0.195 29 0.969SamRibos. 0.276 66 0.895Table 7. Alignment quality measures for the merged known structures. Covariation, No. BPs (numberof base-pairs) and Frac. Canonical BPs (fraction of canonical base-pairs) are all calculated only acrossthe sequence region corresponding to the merged known structures. For each alignment, the mergedknown structures include all the transient/alternative and final/dominant structures.Alignment primary seq. conservation gappiness number of seqs Align length Tree lengthLevivirus 0.645 0.154 7 158 1.181RF00010 0.698 0.0839 24 391 4.279RF00094 0.819 0.0697 10 152 0.499RF00169 0.726 0.0241 10 141 1.240RF00513 0.694 0.0682 10 107 1.253SamRibos. 0.532 0.227 15 215 4.310Table 8. Alignment quality measures that are not related to structures. Primary sequenceconservation and gappiness are calculated on the entire primary sequence. Tree length refers to thetotal branch length summed from the newick format tree.2.4 Strategy Overview II: Selection Of Representative Sequences From EachAlignmentEach of the alignments assembled has 7 to 24 sequences, but the kinetic folding program accepts asingle sequence only. Therefore, a systematic procedure needs be constructed to select representativesequences – they aim to summarize the sequence composition of this alignment by eliminating sequenceswith high similarity – from each alignment. In contrast to individually inputting all the sequences of analignment into the kinetic folding program, simulating the co-transcriptional folding pathways of onlythese representative sequences avoids assigning extra weight to certain structural features folded by asubset of highly redundant sequences.Firstly, the alignment sequences are ranked by their overall fit to the final/dominant structure of thereference sequence (e.g. having few invalid base-pairs is preferred). Secondly, the structural best-fitting38sequence is added into the set of representative sequences, and the next best-fitting-sequence is added if itdoes not share pairwise sequence identity greater than an alignment-specific threshold (Table 9) with anysequence inside the current set of representative sequences. Then, the 2nd step is repeated to successivelyadd qualified representative sequences until all alignment sequences are assessed, ensuring that the set ofrepresentative sequences is not redundant.Alignment cutoff for pairwise sequence identityLevivirus 0.8RF00010 0.7RF00094 0.85RF00169 0.8RF00513 0.8SamRibos. 0.55Table 9. Alignment-specific cutoff for pairwise sequence identity. This cutoff is applied to filter outredundant sequences when selecting representative sequences.2.5 Strategy Overview III: Folding Pathway Simulation With Existing Ki-netic Methods2.5.1 Three Kinetic Programs Employed In Our Pipeline, And Their Input ParametersThe three programs employed and their input sequence size In this analysis pipeline, threesimulation methods – Kinefold [135–137], Kinwalker [138], and RNAKinetics [131–133] – are in-corporated to model the RNA folding pathway of the representative sequences selected in Section 2.4.These methods are non-comparative, so the dynamic structural profile of each elongating transcript alongthe synthesizing timeline could be predicted. The results are subsequently analysed comparatively. Thisselection of methods generates results of various perspectives by depicting the diversity of existing fold-ing simulation methods with significantly distinct simulation algorithms. The length limitation includesroughly 200 nucleotides for the stochastic simulation methods, and 1000 nucleotides for Kinwalker.Transcriptional rate by Polymerase or synthesis rate by Replicase All three kinetic methodsrequire an input parameter of a constant transcription rate. A cognate transcription rate should be usedfor each input sequence as there is mounting evidence showing that the resultant structure of a RNAmolecule is significantly impacted by the transcription process. For instance, when the transcription rateis modulated (e.g. using a non-host polymerase), not only the overall folding rate of the whole RNAmolecule but also the conformation of the transient or final structure can be different [64–68]. Therefore,for each of the alignment sequence, a transcription rate is applied according to the evolutionary domain39of the organism and the underlying mechanism of its RNA synthesis: 200 nt/s for viruses (30 nt/sec isused for Levivirus, which is the rate of coliphage replicase [76]), 20–80 nt/s for bacteria, and 10–20 nt/sfor eukaryotes [56].Trial number The stochastic methods randomly simulate the folding trajectory in each run, so multipletrials are necessary to determine the statistical significance of the prediction. In our pipeline, the numberof trials required for both RNAKinetics and Kinefold is approximated as a quadratic function of thesequence length, based on recommendation for these programs.Simulation time For both RNAKinetics and Kinefold the total simulation time allowed, t, is setto be twice the transcription time: t = 2 · L/r, where L is the sequence length in nucleotides, and r isthe transcription rate in nt/s.2.5.2 KinefoldInput Kinefold is a stochastic method for predicting RNA structure, including pseudoknot, by mod-elling its co-transcriptional folding process. Its input parameters include transcription rate, simulationtime, number of runs, random seeds, and etc.Algorithm In Kinefold, the pseudokottd structures (nonnested) and the pseudoknot-free structures(nested) are modelled similarly via a polymer model [135]. That is, the structure is visualized to becomposed of helices as rigid rods, and single-stranded regions as flexible polymer springs (Gaussianchains) [135]. The single-stranded polymer springs can then be grouped locally into closed or opennets by internally encompassing no more than two rods (i.e. helices) in each net. Some nets are thecharacteristics of a pseudoknotted structure, e.g. a net with base-pairing between the hairpin loop ofone helix and the flanking unpaired region. Such independent nets can collectively represent the globalRNA structure through a “crosslinked gel”, resembling a mesh with each vertex representing a localnet, and each inter-net linking line representing a polymer spring or rod [136]. From this “crosslinkedgel”, the entropy can be summed up in order to model the entropy of the global RNA structure. Thethermodynamic free energy of the constituent helices can be summed up based on the parameters from theTurner group, which is combined with the entropy contributed by the global organization of these helices(calculated using the “crosslinked gel” and the structural dimension measurements) [135]. Such leverageof a polymer model to represent the structure confers on Kinefold the ability to efficiently predictpseudoknoted structure despite that there is no analytical energy parameters for pseudoknots [135].The co-transcriptional folding process is modelled by Markov Chain Monte Carlo (MCMC), withhelix – a helix is defined as having > 3 base-pairs in Kinefold – formation/disassociation comprisingthe steps [135]. A transition rate k for the helix formation and disassociation is calculated as k◦ ×exp(−∆G/kT ), i.e. the arrhenius-like rates [135]. ∆G is the difference in energy when comparing thetransition configuration and the current configuration, which is based on the “crosslinked gel”; k◦ '40108s−1 here, which is experimentally determined to account for base stacking [135]. After each helixformation, the existing and the incoming helices can undergo change in helix length in order to minimizethe energy [136]. When the incoming helix is locally overlapped with the existing structure, the sizeof the conflicting helices can be compromised to set a minimally costly division between them in termsof free energy, without disrupting their 3-bp-long helix core; this entails selecting an optimum positionto initiate the nucleation core of the new helix (minimum size is 3 bp) [135]. The same adjustment foroverlapping helices is applied to the transition configuration, in order to curate the transition rate for helixformation/disassociation [135]. For instance, when an existing helix is disassociating, the global structurecan undergo rearrangement as well especially in the remaining competing helices that were perhapsshrunken earlier when this helix was first added in; the transition rate is then calculated accordingly forthe helix disassociation [135]. After the transition rate kji (from structural state i to j) is updated, theprobability of the transition is calculated as pji = kjiti, where the lifetime of state i ti =1∑j kji[136].The simulation proceeds stochastically with a random choice of transition at each step, i.e. add or removea helix with pji, which is probabilistically dependent on and positively correlated to the transition ratekji [135]. After the transition to structural state j, the corresponding lifetime of the previous state, ti,is appended to the total time elapsed since the transcription starting point; this cycle of steps is thenrepeated until reaching a stable configuration distribution within the given simulation time. Kinefold isnon-deterministic, so tens or hundreds of trajectories – proportional to the sequence length – are sampledin order to derive the statistics.The simulation, however, can become stuck in a kinetic trap – a local minima for free energy – madeof some repeatedly revisiting conformations, predominantly a trap of two conformations [135]. Ligandbinding could in vivo aid the RNA molecule to evade the kinetic traps. In Kinefold, a transition rateis arbitrarily assigned – while satisfying the detailed balance rule – to the trap-causing conformations toescape the trapping local free energy minima [135]. This trapping problem has been further acceleratedby an Exactly clustered stochastic (ECS) algorithm introduced in the second paper of Kinefold [136].In the updated algorithm, the n structures that are recently visited are clustered into a set A [136]. Thesimulated RNA molecule in the state i inside set A can then exit the set A from state j to reach a statek outside A [136]. A weight is assigned to any possible pathway inside the set A, from state i to state j,where both states are part of set A. In this way, from any state i, the probability and the correspondingtime to escape the set A from a pathway ending at j can be calculated cumulatively. The stochasticsimulation is thus modelled at the granularity of cluster set A and the cluster A is updated every time anew structure is visited.Output The output ofKinefold contains a detailed list of structural configurations over the simulationtime for a single sampled trajectory. The data for all the sampled trajectories must be processed in orderto derive the folding statistics.412.5.3 RNAKineticsInput RNAKinetics is a stochastic algorithm that predicts RNA structure along the co-transcriptionaltimeline, not including pseudoknots. It takes a RNA sequence of length n (up to 250 nucleotides), chainelongation rate, total simulation time T, nucleation constant κ, and total number of sampled runs Nrunsas the input [133].Algorithm RNAKinetics models the RNA co-transcriptional folding through representing the cur-rent folding status by a tuple of 3 variables:(i) t for the total time elapsed since the start of transcription(ii) l for the corresponding transcribed length of the transcript(iii) current fold for the collection of all helices – a candidate helix, as defined in RNAKinetics, doesnot contain bulge – existing at the current t and accessible for the length l so far [133]. For the currentfold, only compatible helices are allowed in the Standard model, but overlapping or mutually exclusivehelices are allowed in the Advanced model, given fast interconversion rate [133].At any t, the current structural configuration is subject to spontaneous structural rearrangement, whichincludes two events that follow the detailed balance equilibrium rule [130,133]:(1) a helix existing in current fold can decay, and the corresponding decay kinetic constant is posi-tively related to the free energy (∆Ghelix, derived from both stacking and H-bond) and length (Nh =number of basepairs in this helix− 1) of the helix [133]:kdis = κc ·Nh · e∆GhelixkT where κc = 106...108s−1, a kinetic constant for one marginal base pair,A threshold of kdis < 103s−1 is imposed on all the possible candidate helices, in order to filter out theextremely unstable ones. The retained helices are then numbered by 1, ... M [133].(2) a new helix, currently not included in current fold but accessible to the transcribed length l,can form. The formation kinetic constant for any candidate helix is also positively related to the helixlength, but negatively related to the energy difference pertaining to the loop region as a result of thehelix formation [133]:kform = κc ·Nh · e−∆GloopskTIn the case that this new helix is not compatible with another helix currently in the current fold, theAdvanced model uses an effective transition constant to model the formation of this new helix by takinginto consideration the decaying of the mutually exclusive helix [133]:1kform= 1kdis +1kiniwhere kdis takes care of the competitive base-pairs obstructing theformation of this new helix. The kini represents the formation of the new helix by initiating from onlyone base-pair [133].When simulating the co-transcriptional folding, Nruns trials are conducted to sample the time-basedprobabilistic distribution of RNA structures [133]. In each trial, for each of the M qualified helices (filteredby kdis, as explained in (1)), the decay kinetic constant kdis and the formation kinetic constant kform42can be calculated, respectively [133]. In a pool of these M helices, ki (i=1 to M) is calculated for eachof these M helices using the kdis or kform, individually, which depends on whether this helix is part ofcurrent fold thus far. Additionally, a kM+1 is assigned to the elongation constant kE if the transcriptionis in progress, or 0 if l = full length n [133]. Among these k1,...,kM+1, a transition is randomly selectedand the corresponding time taken for this event is appended to t [133]. This repeats until the allowableT is reached [133].Output The output does not return the raw predicted folding pathway for the individual trial. Instead,it returns a summary of the helices that are encountered in the predicted pathway across the Nrunsrandomized trials, such as free energy and length. By aggregating the predictions across all trials, thesummary also gives the probability values of each helix over the simulation time.2.5.4 KinwalkerInput Kinwalker is a deterministic algorithm that predicts only pseudoknot-free RNA structure. Ittakes a RNA sequence of length n (up to 1000 nucleotides) and the transcription rate as the input [138].Algorithm Kinwalker models the progressive nature of co-transcriptional folding by simulating fold-ing events with interleaving transcription events [138]. Transcription event emits 1 base at a time andtranspires at a regular time interval that is calculated from transcription rate: 1transcription rate [138]. Oncethe sequence is elongated by a transcription event, the sequence undergoes a folding event to address theelicited change in the energy landscape [138]. Folding event transpires during the regular time intervalbetween consecutive transcription events, or after the last transcription event [138]. Unlike transcriptionevent, the length of folding event varies, which is estimated from the energy barrier of transforming thestart structure into the end structure:t(∆G) = 10(811 ∆G−7) for ∆G > 0 where t(∆G) is appended to the current time elapsed to calculatethe remaining time allowed in the regular time interval [138].Prior to transcription or folding, for subsequences bound by i and j (1 ≤ i ≤ j ≤ n, Xi paired withXj), a matrix Ci,j is constructed to store the energy values of the MFE structure predicted by dynamicprogramming [138]. Such subsequences – corresponding to the MFE structures embedded in Ci,j – arethen stored in L, an ordered list. In L, these subsequences are ranked based on a hierarchical criteria:(i) the shortest sequences come first as local structure is preferred (ii) subsequence closer to either the5’ or 3’ end is placed first as it can promptly engage in structural formation without the interference offlanking structured sequences (iii) smaller i comes first as imposed by co-transcriptional folding [138].When folding event transpires, a subsequence (i, j) within the transcribed region is selected from Land its MFE structure is incorporated into the current structure to reduce the total free energy, givenfeasible energy barrier [138]. All subsequences of this added sequence (i, j) are defined as covered insidethe current structure and will not be contemplated in the future iterations [138]. This added sequence (i,43j) is non-dominated when no (k, l) exists in the current structure with k ≥ i, and l ≥ j [138]. The frontrepresents the collection aggregating all such non-dominated covered (i, j) subsequence [138]. However,conflict arises when this incoming subsequence (i, j) (structure S2) and an subsequence contained inthe current structure S1 hold different optimal base-pairs in the overlapped region [138]. This can beresolved by first merging the compatible base-pairs from S2 and S1 into a hybrid structure S’. Secondly,inside S’, the base-pair pertaining to S1 in the conflicting position is iteratively replaced by the onefrom S2 until exhausting all the conflicting base-pairs; each iterative step results in an intermediateS’. Finally, the optimal merged structure with the lowest free energy is chosen among all the S’ and theinitial/end structure [138]. This optimal merged structure must be reachable from the initial structure S1with an energy barrier below Emax which can be obtained by inversing the formula above to estimate themaximum ∆G, given the remaining time till the next transcription event. The energy barrier is measuredby the saddle point height of the optimal folding path estimated by Morgan-Higgs heuristic [165]; thesaddle point corresponds to the largest free energy along the optimal path, but the lowest among allthe maximum energy barriers pertaining to the alternative paths [138]. This optimal folding path isderived by concatenating all the optimal subpaths wherein a subpath is constructed once a permutationof the base-pairs exclusive to the target structure is iteratively added to the current structure [138]. Theoptimal subpath, with the lowest height of saddle point, connects an initial structure to a merged structurecontaining the initial structure – the base-pairs conflicted with the added permutation are removed – andthe inserted permutation [138]. Such optimal subpath is constructed stepwise until reaching the targetstructure.During folding event, Emax continuously decreases as time elapses. Once the next transcription occurs,Emax is reset to the initial value corresponding to the full time interval between transcriptions. Afterthe entire transcript is generated, more folding is penalized and the folding stops when front engulfs theentire transcript (i.e. (i, j) = (1, n)) [138].Output The output simply contains a series of secondary structures along the elongating transcriptlength over the simulated time.2.6 Strategy Overview IV: Post-processing Of Simulation PredictionsAs the aforementioned kinetic methods are non-comparative, each of the representative sequencesfrom one alignment need to be simulated by all three kinetic methods to generate the prediction of thefolding trajectory, respectively. The predicted folding trajectories of all representative sequences, from onealignment, are then combined in a comparative way. This comparative combination and post-processingallow us to assign a statistical significance to each predicted structural feature (e.g. helix or base-pair).However, the three programs differ in terms of the underlying algorithms and output format. Hence, theresult of each program is processed in a distinct way:(i) For the deterministic method, Kinwalker, the predicted folding pathway of all representativesequences is aggregated. The score assigned to each base-pair indicates the fraction of predictions in which44this base-pair is found. (ii) For the stochastic method, the predicted folding pathway of all simulationtrials for one representative sequence is aggregated, and then such aggregated pathway of all representativesequences is aggregated again. The Kinefold score represents the fraction of predictions where the base-pair occurs. The score of RNAKinetics refers to the peak probability averaged across all representativesequences, which is on a different scale than the other two programs.The goal of this program-specific post-processing is to establish a metric for each kinetic foldingmethod, which allows us to compare the statistical significance of one predicted structural feature acrossall three programs. The score, assigned to the predicted structural feature, is on the alignment level sinceit summarizes the prediction of all representative sequences.2.6.1 Kinefold Results Post-processingScore Calculation for each base-pair Kinefold, a stochastic method, simulates multiple trials foreach input representative sequence, and the number of trials correlates quadratically with the sequencelength. For each trial, the results include the detailed folding trajectory (i.e. secondary structure configu-ration) of the simulated RNA molecule over time. Thus, we aggregate the secondary structure predictedin all trials for one input sequence, and a score is initialized to any base-pair (by breaking down thepredicted helix into base-pairs) predicted to occur at any time in at least one trial. The score equals thefraction of trials in which this base-pair is found. This score is then averaged across all representativesequences for this alignment, and the average is set to be the score for this base-pair.2.6.2 RNAKinetics Results Post-processingScore Calculation for each base-pair In contrast to Kinefold, RNAKinetics already aggregatesthe prediction of all trials and generates a summary of the predicted helices with probability values of eachhelix over the simulation time. Thus, we simulate each representative sequence through RNAKineticsto obtain this summary list of helices, and retrieve the maximum probability over time for each predictedhelix. The helix is then broken down into base-pairs. After repeating this for all representative sequencesof one alignment, the average maximum probability is calculated for each base-pair to assign it as thescore.2.6.3 Kinwalker Results Post-processingScore Calculation for each base-pair As no randomization is involved inKinwalker, the predictionof this deterministic method for each input sequence is simply a series of secondary structures in dot-bracket form along the transcript synthesizing time. Thus, a score is assigned to any base-pair that can beformed at any time in the predicted folding pathway of any representative sequence from one alignment.The score is calculated to be the fraction of representative sequences in which the base-pair occurs.452.7 Extended Pipeline For Detecting Conserved Transient Structure2.7.1 TransatTransat is a comparative program that predicts evolutionarily conserved helices, taking a multiplesequence alignment and a phylogenetic tree – this tree quantitatively summarizes the evolutionary re-lationship among the homologs of the input alignment – as input [148]. Transat does not formulateassumptions regarding the biochemical cellular environment where this RNA sequence folding occurs.Instead, Transat relies on the evolutionary signal encoded in the input alignment to infer conservedstructural features, which implies their functional importance. It predicts individual helices instead ofone global structure, which allows the detection of structurally incompatible features (e.g. mutually ex-clusive structures due to overlapping, or pseudoknots) [148]. This feature is critical for the searching oftransient helices formed co-transcriptionally since such labile helices are usually mutually exclusive withthe known functional structure. Transat is also fairly robust to alignment errors because it does notrule out structural possibilities that only work for a subset of the sequences [148].Identifying and Scoring Feasible HelicesIn Transat, a helix is defined to consist of four or more consecutive base-pairs. First, Transatdetects all feasible helices (h) for each individual ungapped sequence from the input alignment. Thenit takes each helix h, maps it back onto the gapped input alignment, and computes its log-likelihoodaccording to [148]:Λ(h) = log2(P (h|ΘpairedP (h|Θunpaired )1Lwhere Θpaired is the hypothesis that the columns of h are paired, Θunpaired is the hypothesis that they arenot, and L is a normalization factor, essentially the number of base-pairs in h. Transat uses Felsenstein’salgorithm on the alignment columns of h to compute P (h|Θpaired) and P (h|Θunpaired).Estimating P-values for the Predicted HelicesGiven different length and nucleotide composition, sequences vary in their overall propensity to formhelices [148]. Sequences with high helix-forming propensity tend to produce lots of random helices withhigh log-likelihood scores, but only some of these are actually important [148]. In order to normalize thelog-likelihood scores based on the intrinsic characteristics of the input alignment:(i) the sequences are realigned based on nucleotide identity only, which removes the influence of thestructural information embedded in the alignment [148];(ii) the alignment columns are randomly shuffled 500 times according to certain constraints. For eachof these randomized alignments, helices are detected and scored as described above. The log-likelihoodscores for all the helices from all these alignments are combined into a histogram which is used to deriveP-values for the helices from the original alignment. The P-value indicates the probability that thepredicted helix is observed merely due to chance.46These randomization steps confer advantage on Transat : a null distribution of possible structureswith specific statistical score (i.e. P-value) is assigned to each predicted helix.2.7.2 Combine Kinetic Programs And TransatAs shown in the optional pipeline of the Flowchart (Section 2.2), the comparative program Transatcan be combined with the Kinetic programs to search for evolutionarily conserved non-native structuresthat are also permitted to form along the folding trajectory of homologous RNA sequences. Thesestructures are deemed the potential transient structural features. Though Transat operates on thealignment level and Kinetic programs operate on the level of single RNA sequence, the post-processedsimulation results from Kinetic programs (as described in Section 2.6) can be used to compare to theprediction from Transat on the alignment level.For this, Transat is employed to predict conserved helices that do not belong to the known functionalstructure. These novel helices are regarded as potentially transient structural features. In each alignmentassembled in Section 2.3, we retrieve the top six helices – with the lowest P-values, and has less than 50%of its constituent base-pairs overlapped with the known structure – predicted by Transat. Despite notbeing part of the known structure, these novel helices are supported by significant evolutionary evidencewith no P-values greater than 0.03. We treat these helices as putative novel transient helices, and inputthem into the downstream analysis of this optional pipeline. That is, we compare the post-processedsimulation results form the three Kinetic programs against the putative novel transient helices to see:(i) whether such putative novel transient helices are also predicted by the Kinetic programs; (ii) ifthey do get predicted by any Kinetic program, what is the corresponding score, in the post-processedresults, for the constituent base-pairs of these helices? If the score is high, it predicts that these conservedhelices are also capable of being formed along the folding pathway given all the constrains imposed bythe co-transcriptional folding.2.8 Visualization Of Pipeline PerformanceIn order to evaluate the performance of the core and the optional Transat section of the analy-sis pipeline, we compare the prediction metrics post-processed from the Kinetic programs (Section 2.6)against several benchmarks: (i) known final/dominant functional structure, (ii) known transient/alternativestructures, and (iii) putative novel transient structures proposed by Transat. For this, each alignmentin our data set is annotated with known final features, known transient features (in the case of alternativestructures we consider all base-pairs transient), and putative novel transient features. This allows us toinvestigate the ability of the comparative approach – that integrates the predictions from kinetic RNAfolding methods – to predict these classes of structural features.472.8.1 Visualization Of The Proposed Putative Functional Transient/Alternative Struc-tures In Relation To The Known StructuresKnown features (final/dominant + transient/alternative) vs. Transat prediction (top 6helices) The following figures (Figure 4 to Figure 9) illustrate the reference structural features forthe six alignments. The reference structural features comprise known final/dominant structure, knowntransient/alternative structure and putative novel transient helices, against which the prediction fromthe Kinetic folding pathway simulation programs is evaluated. The known structures are derived fromprevious research literatures. The putative novel transient helices refer to the top six helices predictedby Transat with the lowest P-values. They are selected from the Transat prediction after removingany helix that overlaps with the known base-pairs ≥ 50%.Arc plot The arc plots for the reference structural features of the alignments in Figure 4 to Figure 9are adapted from the R-chie plot [166] and its annotation convention. The components of the arc plotsare described below.Horizontal lines In each arc plot, the alignment sequences are represented by an array of horizontallines. Each line is composed of a series of grids where each grid represents one nucleotide in this sequence.There are two arrays of horizontal lines, with the second array positioned right below and aligned with thefirst one. They identically represent the alignment sequences in the same order, except one color-codedfor the top arcs and the other one for the bottom arcs. The vertical grey bars, placed in the background,divide the alignment into segments of equal length of nucleotides for measurement convenience.Arcs Arcs are placed above and below the alignment lines, and each arc portrays a base-pair fromthe reference structural features. The two columns of grids connected by one arc represent the pairingbases for this base-pair, with one pair for each sequence. On each alignment sequence line, this pair ofgrids (representing nucleotides) are coloured according to their pairing covariation quality, i.e. canonical(namely, AU, CG, and GU) or not. Two green grids connected by an arc refer to a canonical base-pairmatching the most abundant nucleotide composition within the corresponding pair of columns. Cyangrids refer to a canonical base-pair that differs from the most prevalent base-pair (i.e. green grids) byone side mutation. Blue grids specify a canonical base-pair that has mutation on both sides. Red gridmeans a non-canonical base-pair. If the grid resides in a gap, then it is coloured grey. The unstructuredregion, i.e. the columns of grids that are not connected by arcs, is coloured in black.Arcs above the horizontal alignment linesThe base-pairs of known transient/alternative and known final/dominant structures are shown asarcs above the alignment. The top arcs of structures are color-coded based on their structure type ( according to the ”Known Structure” legend located at the bottom left corner.If the alignment has final functional and co-transcriptionally-formed transient structures, the covariationquality depicted in the coloured grids on the top alignment is specific to the final structure. On the other48hand, if the alignment has alternative structures, then the top alignment is specific to the alternativestructure 2.Arcs below the horizontal alignment linesThe putative novel transient helices are shown via arcs below the alignment lines. The bottom arcsare coloured based on their P-values according to the legend at the bottom right corner. As Transat iscapable of predicting mutually-exclusive helices, conflict between competing helices arises when we wantto represent their covariation on the same plot [148]. To resolve this issue, the covariation colors showncorrespond to the connecting arcs that are drawn with solid lines only.490 20 40 60 80 100 120 140Known StructureTransientFinal0 − 5e−045e−04 − 0.0010.001 − 0.0050.005 − 0.010.01 − 0.03Figure 4. The reference structural features of Levivirus500 50 100 150 200 250 300 350Known StructureTransientFinal0 − 5e−045e−04 − 0.0010.001 − 0.0050.005 − 0.010.01 − 0.03Figure 5. The reference structural features of RF00010, Bacterial ribonuclease P type A ribozyme.510 20 40 60 80 100 120 140Known StructureTransientFinal0 − 5e−045e−04 − 0.0010.001 − 0.0050.005 − 0.010.01 − 0.03Figure 6. The reference structural features of RF00094, Hepatitis Delta Virus Ribozyme.520 10 20 30 40 50 60 70 80 90 100 110 120 130 140Known StructureTransientFinal0 − 5e−045e−04 − 0.0010.001 − 0.0050.005 − 0.010.01 − 0.03Figure 7. The reference structural features of RF00169, Bacterial Signal Recognition Particle 4.5SRNA530 10 20 30 40 50 60 70 80 90 100Known StructureAlternative 1Alternative 20 − 5e−045e−04 − 0.0010.001 − 0.0050.005 − 0.010.01 − 0.03Figure 8. The reference structural features of RF00513, Tryptophan Operon Leader540 20 40 60 80 100 120 140 160 180 200Known StructureAlternative 1Alternative 20 − 5e−045e−04 − 0.0010.001 − 0.0050.005 − 0.010.01 − 0.03Figure 9. The reference structural features of Samriboswitch. Alternative 1 is the sam-unboundconformation and Alternative 2 is the sam-bound conformation.2.8.2 Visualization Of The Pipeline Performance Evaluated Against Reference StructuralFeaturesreference structural features vs. Kinetic prediction The following figures (Figure 10 to Figure 26)visualize the performance evaluation of the three Kinetic folding pathway simulation programs for the55six alignments, respectively. The kinetic predictions are evaluated against the different categories in thereference structural features, i.e. final, transient/alternative or the top 6 helices predicted by Transat.Arc plot The arc plots for the program-specific performance for the alignments in Figure 10 to Figure 26are adapted from the R-chie plot [166] and its annotation convention. The components of the arc plotsand their representations are described below. The basic features of the plots are similar to the ones seenin Figure 4 to Figure 9, e.g. covariation, arcs and alignments.Horizontal lines Unlike the previous plots, there is only one alignment in these performance plots,which is represented by the block of lines positioned in the center.Arcs above the horizontal lines In each of the following performance plots, the reference struc-tural features – known transient/alternative, final functional structures, and top six helices predicted byTransat – are all placed above the alignment. Both known transient/alternative and final structures areplotted using arcs with solid lines. Arcs for known transient structure and known alternative structure 1are coloured in red. Known final and known alternative structure 2 are represented by black arcs. Thecovariation shown on the grids of the alignment is specific only to the known final structure or alternativestructure 2, i.e. the black arcs. The novel transient helices predicted by Transat are plotted in arcswith dotted lines above the alignment, and the arcs are color-coded to the Transat P-values based onthe ”Transat” legend located at the top right corner.Arcs below the horizontal lines The structural prediction from a kinetic folding pathway simulationprogram is first filtered with the cutoffs derived from the optimal MCC as shown in Figure 27. Then theretaining base-pairs are plotted as arcs below the alignment. If the predicted base-pair matches any ofthe reference structural features (i.e. the arcs above the alignment for known transient/alternative/finalstructure), it is then color coded in terms of its post-processed prediction score (see Section 2.6) accordingto the ”TP” legend at the bottom right corner. Otherwise, it is color coded based on the ”FP” legendat the bottom left corner. Moreover, if the predicted base-pair matches the known final or transientstructures, its arc is plotted with solid line. If it matches the novel transient helices, its arc is plottedwith dotted line instead.Definition of the post-processed prediction score in the three Kinetic programs For eachof the pathway prediction programs, the predicted base-pair is assigned a score value after analysingthe homologous sequences comparatively. The value of the predicted base-pair has different meaningdepending on what program is employed due to the distinct output formats of Kinwalker, Kinefoldand RNAKinetics.Kinwalker The prediction from Kinwalker for all representative sequences of an alignment is anal-ysed to aggregate all the predicted base-pairs. For each base-pair, the fraction of representative sequences56bearing this base-pair in their prediction is calculated. This fraction corresponds to the score value of thearc, which is coloured according to the aforementioned ”TP” or ”FP” legend. The cutoff 0.43 is appliedto filter out any base-pair predicted with lower score than that.Kinefold For each representative sequence, there are n runs sampled in Kinefold. Thus, the predic-tion for each run is analysed and the fraction of runs bearing a base-pair is calculated for all the base-pairsthat ever get predicted for one representative sequence. This fraction calculated for a base-pair is sub-sequently averaged across all representative sequences of an alignment. Thus, the mean fraction of runscontaining this base-pair in the prediction corresponds to the score value shown in the arc. The cutoff0.755 is applied to filter out any base-pair predicted with lower score than that.RNAKinetics Since RNAKinetics already summarizes all the prediction from the sampled runs intoa list of probability values along equally-spaced simulation time points, we choose to use the maximumprobability to represent the probability of seeing this helix. Then the max probability of a base-pair isaveraged across all representative sequences. This averaged max probability is the score value shown inthe arc. The cutoff 0.0082 is applied to filter out any base-pair predicted with lower score than that.2.9 Performance Statistical Measures And Results2.9.1 Known Transient Features Of Folding Pathways Are Evolutionarily Conserved.Among related species, functionally equivalent RNA sequences tend to be evolutionarily conserved interms of secondary structure [148], which can be measured by the base-pairing potential and covariation ofthe relevant structural features. The structure quality measures of the alignments are shown in Table 10for the known final structure and known transient structure, respectively. Any base pair that is shared byboth known transient structure and known final structure is removed from the transient structure. Foreach structural category, the constituent base pairs are subjected to the calculation of canonical base pairpercentage, covariation, primary sequence conservation, and sequence gappiness with the aforementionedcanonical base pairs, i.e. AU, CG, and GU.As shown in the average row of Table 10, both known transient and final structures well maintain thecanonical base pairing potential while known final structure is only slightly higher by 5%. The knowntransient structure has a high canon. bp of 0.909 and a positive 0.0987 covariation. Noticeably, thecovariation level of known transient structure is much lower than that of known final structure. This lackof variation pertaining to the transient structure is also reflected by the 0.09 higher primary sequenceconservation.Nonetheless, Table 10 indicates that the transient features are evolutionarily conserved with approx-imately the same level as final features based on the base pairing potential.Alignment statistics Summary statistics for alignment characteristics are found in Table 7 and Ta-ble 8. The average number of sequences per alignment is 12, and the average number of characters in570 20 40 60 80 100 120 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]FP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]Figure 10. Kinwalker prediction performance on Levivirus.580 20 40 60 80 100 120 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]FP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]Figure 11. Kinefold prediction performance on Levivirus.590 20 40 60 80 100 120 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]FP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]Figure 12. RNAKinetics prediction performance on Levivirus.600 50 100 150 200 250 300 350Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]FP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]Figure 13. Kinwalker prediction performance on RF00010, Bacterial ribonuclease P type Aribozyme.610 50 100 150 200 250 300 350Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]FP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]Figure 14. Kinefold prediciton performance on RF00010, Bacterial ribonuclease P type A ribozyme.620 20 40 60 80 100 120 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]FP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]Figure 15. Kinwalker prediction performance on RF00094, HDV ribozyme.630 20 40 60 80 100 120 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]FP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]Figure 16. Kinefold prediction performance on RF00094, HDV ribozyme.640 20 40 60 80 100 120 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]FP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]Figure 17. RNAKinetics prediction performance on RF00094, HDV ribozyme.650 10 20 30 40 50 60 70 80 90 100 110 120 130 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]FP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]Figure 18. Kinwalker prediction performance on RF00169, Bacterial Signal Recognition Particle4.5S RNA.660 10 20 30 40 50 60 70 80 90 100 110 120 130 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]FP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]Figure 19. Kinefold prediction performance on RF00169, Bacterial Signal Recognition Particle 4.5SRNA.670 10 20 30 40 50 60 70 80 90 100 110 120 130 140Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureTransientFinalTP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]FP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]Figure 20. RNAKinetics prediction performance on RF00169, Bacterial Signal Recognition Particle4.5S RNA.680 10 20 30 40 50 60 70 80 90 100Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureAlternative 1Alternative 2TP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]FP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]Figure 21. Kinwalker prediction performance on RF00513, Tryptophan Operon Leader.690 10 20 30 40 50 60 70 80 90 100Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureAlternative 1Alternative 2TP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]FP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]Figure 22. Kinefold prediction performance on RF00513, Tryptophan Operon Leader.700 10 20 30 40 50 60 70 80 90 100Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureAlternative 1Alternative 2TP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]FP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]Figure 23. RNAKinetics prediction performance on RF00513, Tryptophan Operon Leader.710 20 40 60 80 100 120 140 160 180 200Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureAlternative 1Alternative 2TP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]FP( 0.4 − 0.55 ]( 0.55 − 0.7 ]( 0.7 − 0.85 ]( 0.85 − 1 ]Figure 24. Kinwalker prediction performance on SamRiboswitch.720 20 40 60 80 100 120 140 160 180 200Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureAlternative 1Alternative 2TP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]FP( 0.75 − 0.8 ]( 0.8 − 0.85 ]( 0.85 − 0.9 ]( 0.9 − 1 ]Figure 25. Kinefold prediction performance on SamRiboswitch.730 20 40 60 80 100 120 140 160 180 200Transat( 0 − 5e−04 ]( 5e−04 − 0.005 ]( 0.005 − 0.05 ]Known StructureAlternative 1Alternative 2TP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]FP( 0.008 − 0.02 ]( 0.02 − 0.05 ]( 0.05 − 0.1 ]( 0.1 − 1 ]Figure 26. RNAKinetics prediction performance on SamRiboswitch.74alignment known transient known finalcanon. bp covar. cons. gap. canon. bp covar. cons. gap.Levivirus 0.881 0.00794 0.639 0.0119 0.946 0.338 0.69 0.0223RF00010 0.936 0.132 0.811 0 0.969 0.478 0.685 0.0245RF00094 0.884 -0.13 0.884 0.0339 1 0.149 0.926 0RF00169 0.886 0.111 0.786 0.0143 0.997 0.391 0.794 0RF00513 0.969 0.195 0.836 0.00690 0.969 0.195 0.836 0.00690SamRibos. 0.895 0.276 0.648 0.0717 0.895 0.276 0.648 0.0717average 0.909 0.0987 0.767 0.0231 0.963 0.305 0.758 0.0209Table 10. Quality measures for known transient features and known final features for all alignments.Percent canonical base pair (canon. bp) indicates the proportion of base pairs across all alignments thatcontain one of the three canonical pairs (A-U, G-U, G-C). Covariation (covar.) measures the relativefrequency of compensatory mutations that retain the base pairing potential, and indicates base pairsthat are functionally important. Covariation ranges from -2 to +2, and it is 0 when the paired columnshave no variation, negative when they contain many invalid pairs, and positive when they containcompensatory mutations. Conservation (cons.) indicates the mean pairwise percent identity betweenhomologous sequence positions at paired columns. Percent gaps (gap.) indicates the proportion ofpaired sequence positions that contain gaps.the alignments is 194. Compared to alignments found in the Rfam database, the alignments we compiledhere feature lower covariation (0.260 as opposed to 0.324), and higher conservation (0.686 as opposed to0.556), with a lower average number of sequences (12 as opposed to 158) [167] [168] [169].2.9.2 Known Transient Features Can Be Predicted Computationally Using Folding Path-way Prediction Programs.In order to assess the ability of kinetic folding programs to predict known transient features in acomparative manner, we define a value for each program that is assigned to each predicted base pairby combining predictions across representative homologous sequences. The known transient and finalstructures are merged into a non-redundant set, against which the predictions from each kinetic foldingprogram are evaluated on base pair level by imposing cutoffs spanning from 0 to 1. The curves in Figure 27exhibit the variance of MCC values for different cutoffs for the three kinetic folding programs.Kinwalker, Kinefold and RNAKinetics have optimal MCC values of 0.676, 0.656 and 0.263, re-spectively. Kinwalker and Kinefold have roughly the same optimal MCC, whereas RNAKinetics hasthe lowest one. This difference is further analyzed in the corresponding ROC curves in Figure 28, whichshows that Kinwalker and Kinefold perform similarly. Figure 28 demonstrates that Kinwalker andKinefold reach similarly high TPR at 0.907 and 0.968, which outscores the maximum TPR of 0.830achieved by RNAKinetics. Overall, RNAKinetics accumulates more false positives at low cutoff val-75ues, resulting in low specificity even with stringent cutoffs. These characteristics of RNAKinetics onthe dataset contributes to its generally lower MCC values in Figure 27.According to the optimal MCC in Figure 27, Kinwalker, Kinefold, and RNAKinetics have MCCvalues peaking at cutoffs of 0.43, 0.755 and 0.0082, respectively. The program-specific cutoffs are appliedto their corresponding predictions to filter out lower quality predictions, and the resultant TPR valuesare shown in Table 11. These cutoffs are derived by optimizing MCC over combined structure comprisingknown transient and known final features, and thus they do not exclusively optimize over the transientor the final structure. Using these cutoffs, 44.4% of the transient base pairs can be detected by the threekinetic folding programs on average. As expected, the TPR for the transient structures is 22.0% lowerthan that of final structure. Specifically, both Kinwalker and Kinefold perform better on TPR forfinal structure than transient structure by 33.4% and 40.3%, respectively. Conversely, RNAKineticsyields approximately similar TPR for transient and final structures with TPR for transient structurebeing only 7.0% higher.This evidence supports that the known transient features can still be predicted with a TPR comparableto that of the known final structures even after subjecting the predictions to the same cutoff. Moreover,this cutoff is the optimal value chosen based on the averaged performance of known transient and knownfinal structures.TPRprogram known transient known finalKinwalker 0.428 0.762Kinefold 0.183 0.586RNAKinetics 0.722 0.652average 0.444 0.667Table 11. True positive rate (TPR) for known final and known transient structural features aspredicted by Kinwalker, Kinefold, and RNAKinetics using the respective MCC-derived cutoff.TPR is a measurement of sensitivity on base pair level, and is defined as TPR = TP/(TP + FN). Theprogram-specific MCC-derived cutoffs are applied to filter the predictions (see Figure 27 for details).2.9.3 Combining Prediction Programs Improves The Prediction Accuracy For KnownTransient And Final Structural Features.To further investigate the performance of the three programs, Table 13 summarizes the TPR andPPV for the 4 structural categories using different combination of the programs. The predictions by thethree programs are first filtered using the cutoff derived from the optimal MCC.Additionally, we explore the performance using the intersection of predictions generated by two pro-grams, resulting in more strict filtering. For this, a base pair is considered a positive when predictedby both programs at the respective MCC-derived cutoffs. Table 13 shows performance for all possible76known transient known final all known novel transientprogram alignment cutoff TPR PPV TPR PPV TPR PPV MCC TPR PPVKinwalker Levivirus 0.43 0 0 0.875 0.792 0.778 0.792 0.78 0 0RF00010 0.43 0.538 0.167 0.679 0.679 0.664 0.698 0.68 0.16 0.114RF00094 0.43 0.258 0.267 0.633 0.463 0.443 0.551 0.49 0.172 0.227RF00169 0.43 0.286 0.0833 0.903 0.56 0.789 0.577 0.67 0 0RF00513 0.43 0.621 0.474 0.621 0.474 0.621 0.474 0.54 0.16 0.2SamRibos. 0.43 0.864 0.919 0.864 0.919 0.864 0.919 0.89 0 0Kinefold Levivirus 0.755 0 NaN 0.396 1 0.352 1 0.59 0 NaNRF00010 0.755 0 0 0.615 0.838 0.549 0.838 0.68 0 0RF00094 0.755 0.258 0.444 0.7 0.677 0.475 0.744 0.59 0 0RF00169 0.755 0 0 0.968 0.909 0.789 0.909 0.85 0 0RF00513 0.755 0.31 0.9 0.31 0.9 0.31 0.9 0.53 0 0SamRibos. 0.755 0.53 0.921 0.53 0.921 0.53 0.921 0.70 0 0RNAKinetics Levivirus 0.0082 0.667 0.0179 0.354 0.072 0.389 0.0875 0.18 0.154 0.018RF00010 NA NA NA NA NA NA NA NA NA NARF00094 0.0082 0.839 0.0875 0.567 0.059 0.705 0.137 0.30 0.379 0.040RF00169 0.0082 0.571 0.0144 0.806 0.0839 0.763 0.096 0.26 0.36 0.032RF00513 0.0082 0.759 0.154 0.759 0.154 0.759 0.154 0.34 0.68 0.14SamRibos. 0.0082 0.773 0.52 0.773 0.52 0.773 0.52 0.63 0.44 0.234Table 12. Detailed performance measures for each alignment using kinetic folding programs atMCC-derived cutoff values. The table includes the cutoff value, TPR defined asTPR = TP/(TP + FN), PPV defined as PPV = TP/(TP + FP ), and MCC defined asMCC = (TP · TN − FP · FN)/√(TP + FP ) · (TP + FN) · (TN + FP ) · (TN + FN). Data are notshown for RF00010 using RNAKinetics because the program was unable to complete.770.0 0.2 0.4 0.6 0.8 cutoff valueMCC0 0.01 0.02 0.03 0.04 0.05RNAKinetics cutoff valueFigure 27. Matthews’ correlation coefficient (MCC) plotted over a series of cutoff values for knowntransient and final structure. A line is shown for Kinefold (blue), RNAKinetics (green), andKinwalker (red). For each coloured line, the additional two lines with the lighter color are theminimum (below) and maximum (above) values of MCC derived from the six-fold cross-evaluation. Thehorizontal axis indicates the cutoff value for the kinetic folding programs, and its scale is shown at thebottom for Kinefold and Kinwalker and at the top for RNAKineticsM˙CC is a measure of bothsensitivity and specificity and is defined asMCC = (TP · TN − FP · FN)/√(TP + FP ) · (TP + FN) · (TN + FP ) · (TN + FN). Optimal MCCvalues are produced at the following cutoff values: Kinefold (0.755), Kinwalker (0.43), andRNAKinetics (0.0082).intersections. Similarly, in the row of ”any two programs”, a TP refers to a base pair that is predictedby at least 2 kinetic folding programs. As one alignment exceeds the limits of RNAKinetics, only theremaining 5 alignments serve as input for this combinative analysis.As shown in Table 13, TPR of each individual program is higher than the TPR of any combinationin which this program is involved, as this results in stricter filtering. Kinefold is an exception, which isdue to the fact that one alignment is removed from the combinatorial analysis.However, the PPV is greatly improved in all 4 structural categories, especially in known transient andfinal structures. For instance, RNAKinetics suffers from low PPV while performing well in terms ofTPR (See Table 13). When RNAKinetics is coupled with Kinwalker or Kinefold the PPV increasesfrom 0.191 to 0.511/0.432 for known transient structures. Moreover, the PPV significantly increases from0.210 to 0.932/0.714 for known final structures. The choice of any two programs tends to maximize TPR780.0 0.2 0.4 0.6 0.8 0.02 0.04 0.06 0.08 28. ROC curves indicating performance for known transient and final features using kineticfolding programs at a broad range of cutoff values. In both plots, vertical axis indicate true positiverate (TPR = TP/(TP + FN). In the left plot, horizontal axis indicates the false discovery rate, i.e. theproportion of predictions which are false (FDR = 1− PPV = FP/(TP + FP ). In the right plot, thehorizontal axis indicates the false positive rate i.e. the proportion of all potential negatives that arepredicted (FPR = FP/(FP + TN). Kinefold is shown in blue, RNAKinetics in green, andKinwalker in red.79programs known transient known final all known novel transientTPR PPV TPR PPV TPR PPV TPR PPVKinwalker 0.428 0.318 0.762 0.648 0.693 0.667 0.0871 0.09Kinefold 0.183 0.378 0.586 0.874 0.501 0.885 0 NARNAKinetics 0.722 0.191 0.652 0.210 0.678 0.231 0.322 0.0769Kinwalker & Kinefold 0.202 0.513 0.53 0.924 0.453 0.934 0 0Kinefold & RNAKinetics 0.138 0.511 0.39 0.932 0.334 0.945 0 0Kinwalker & RNAKinetics 0.284 0.432 0.483 0.714 0.438 0.736 0.032 0.133any two programs 0.347 0.455 0.648 0.760 0.577 0.777 0.032 0.114Table 13. Mean true positive rate (TPR) and positive predictive value (PPV) for all classes ofstructures using kinetic folding programs at MCC-derived cutoff values. Performance measures areshown for Kinwalker, Kinefold, and RNAKinetics alone, in addition to the intersection of all pairsof programs. For this, we consider the set of predicted base pairs as the intersection of the base pairspredicted by each program at its optimal cutoff. TPR is a measure of sensitivity and is defined asTPR = TP/(TP + FN). PPV is a measure of specificity and is defined as PPV = TP/(TP + FP ).while sacrificing certain amount of PPV. This choice generates TPR comparable to that of the individualprogram and such TPR is higher than any of the 2-program combination. In addition, the PPV is higherthan that calculated using individual program.In general, Table 13 demonstrates the efficiency of combining prediction programs in enhancing thePPV while only slightly compromising TPR.2.9.4 There Is Evolutionary Evidence For Additional Transient Heelices.In addition to exploring known transient and final features, we used Transat to identify additionalpotential transient helices that are conserved through evolution. Transat output includes a list ofconserved helices, each with a p-value indicating its statistical support. Helices with 60% or greateroverlap with the known structure are removed, and six helices with the most significant p-values areextracted from the resulting list to produce the set of novel transient helices.Table 14 shows quality measures for the novel transient helices. They contain few invalid base pairs(95.4% canonical base pairs vs. 90.9% for known trans.), and are highly conserved (91.5% percent identityvs. 76.7% for known trans.). However, they show less variation, indicated by higher sequence conservationand significantly lower covariation measure (0.036 vs 0.0987 for known trans.). Covariation indicates therelative frequency of compensatory mutations that retain base pairing potential, and is an indicator forconserved functional structures. Additionally, the novel transient features contain fewer gaps (0.9%)compared to the known transient (2.31%). Thus the novel transient helices predicted by Transat aresupported by significant evolutionary evidence, although they show less covariation than known transientand known final features on average.80alignment canon. bp covar. cons. gap.Levivirus 0.907 -0.0311 0.823 0.022RF00010 0.992 0.033 0.968 0RF00094 0.993 0.0559 0.96 0RF00169 0.996 0.0622 0.961 0RF00513 0.964 0.141 0.87 0SamRibos. 0.872 -0.046 0.905 0.0293average 0.954 0.0358 0.915 0.0085average (known trans.) 0.909 0.0987 0.767 0.0231average (known final) 0.963 0.305 0.758 0.0209Table 14. Quality measures of transient structural features for all alignments. Percent canonical basepair (canon. bp) indicates the proportion of base pairs across all alignments that contain one of thethree canonical pairs (A-U, G-U, G-C). Covariation (covar.) measures the relative frequency ofcompensatory mutations that retain the base pairing potential, and indicates base pairs that arefunctionally important. Covariation ranges from -2 to +2, and it is 0 when the paired columns have novariation, negative when they contain many invalid pairs, and positive when they contain manycompensatory mutations. Conservation (cons.) indicates the mean pairwise percent identity betweenhomologous sequence positions at paired columns. Percent gaps (gap.) indicates the proportion ofpaired sequence positions that contain gaps. Novel transient features comprise six helices predicted byTransat that have less than 60% overlap with the known structure and the most significant p-values.2.9.5 These Additional Transient Helices Can Be Predicted Computationally.Using the same MCC-derived cutoffs established for the known transient and final features, we evaluatethe ability of the kinetic folding methods to identify the novel transient features. Table 15 shows thetrue positive rate for the three folding programs against the novel transient features. RNAKineticsperforms best for this purpose, as it is able to identify 32.2% of the base pairs (i.e. the true positiverate). Kinwalker predicts 8.7% of the features, while Kinefold predicts none of them. On average,they achieve a true positive rate of 13.6%, compared to 44.4% for known transient features, and 66.7%for known final features.Figure 29 shows the performance for novel transient helices on a broader range of cutoff values, asthe MCC-derived cutoffs discussed above are stringent. Kinefold (blue) and RNAKinetics (green)performance is roughly comparable, where RNAKinetics detects 76.5% of novel transient features with7% false positive rate, and Kinefold detects 67.0% of the features with 7% false positive rate. Overall,RNAKinetics achieves higher true positive rate while sacrificing fewer false positives. Kinwalker (red)is unable to detect a large percentage of the novel transient features at any cutoff value. It finds 28.1% ofknown features with 4.0% false positive rate. Although the cutoffs derived from known features generatea small number of high-confidance predictions (table 15), the kinetic folding programs are able to detect810.00 0.05 0.10 0.075 0.125Figure 29. ROC curve illustrating predictive performance for novel transient features using kineticfolding methods at a broad range of cutoff values. Vertical axis indicates true positive rate, a measureof sensitivity (TPR = TP/(TP + FN)), and the horizontal axis indicates false positive rate, a measureof specificity (FPR = FP/(FP + TN)). Note that the axes are plotted on different scales. Each lineindicates performance of one program across many cutoff values (Kinefold in blue, RNAKinetics ingreen, and Kinwalker in red). Novel transient features comprise six helices predicted by Transatthat have less than 60% overlap with the known structure and the most significant p-values.a significant proportion of these features by relaxing the cutoffs and accepting a higher false positive rate.82TPRprogram novel transientKinwalker 0.0871Kinefold 0RNAKinetics 0.322average 0.136average (known trans.) 0.444average (known final) 0.667Table 15. True positive rate (TPR) for novel transient structural features for kinetic folding predictionmethods using the MCC-derived cutoff optimized over known features. TPR is evaluated on base pairlevel, and is an indication of sensitivity: TPR = TP/(TP + FN). Average TPR for known transientand known final features is also shown at the bottom. Novel transient features comprise six helicespredicted by Transat that have less than 60% overlap with the known structure and the mostsignificant p-values.3 Four RNA Families With Functional Transient RNA Struc-turesThis chapter corresponds to my publication in [170], the content of which is mostly pasted in-verbatim here.3.1 MotivationWe know by now that RNA structure formation in vivo involves transient RNA structure elementswhich can not only help to define co-transcriptional folding pathways, but can also play distinct functionalroles of their own [171]. Recent statistical evidence suggests that some transient structures are evolution-arily conserved across homologous sequences thus confirming their potential functional importance [81].It is thus possible to provide entries in Rfam with a more complete structural annotation which shouldin turn allow us to gain a better understanding of the underlying regulatory mechanisms.For each of the four alignments introduced in this study, evidences from previous research (as citedin each section below) show that the formation of the experimentally confirmed transient/alternativestructures is critical to confer the RNA molecule the ability to modulate gene expression or regulateribozyme activity. Moreover, our previous research [81] comparatively integrates the folding pathwaysof RNA homologs including these four families; these folding pathways are predicted by state-of-the-artkinetic folding programs which only take input of one single RNA sequence. This previous research showsthat many RNA homologs encounter similar transient or alternative structures. In spite of being transientor alternative rather than being the dominant conformation, these structures tend to be evolutionarilyconserved. Furthermore, they have implication in constraining the folding pathway for future structuralprediction programs, which calls for more annotation of such transient/alternative structures. However,83only the dominant structure is annotated in Rfam [172]. The Rfam database of RNA families [172]features three of them – Trp operon leader (RF00513), HDV ribozyme (RF00094), SAM riboswitch(RF00162) – but lacks the annotation and alignment for the alternative structures. More specifically,Rfam features the terminator structure of Trp operon leader, but misses the anti-terminator structure;for HDV ribozyme, the active self-cleavage structure is featured whereas the repressive and permissivestructures involving the 5’ flanking sequence of the cleavage site are absent; for the SAM riboswitch,though most of the SAM-bound structure is included the corresponding alignment and annotation forthe terminator hairpin and the SAM-unbound structure are lacked. The Levivirus alignment is new andnot part of Rfam. In order to fill these missing structural annotations, we set up a pipeline involvingInfernal program [141] to structurally align sequences for multiple structures. We here show that itis possible to go beyond the one-sequence-one-structure dogma by providing carefully curated alignmentannotated by both transient/alternative and dominant structures for the four RNA alignments.3.2 Trp Operon Leader3.2.1 Transcriptional Control Of The Tryptophan OperonLayout of the functional domains in trp operon The E.coli trp operon spans approximately7000 nucleotides (nt) which consecutively encode the promoter containing an operator [173, 174], thetranscribed leader region, and structural genes essential for the biosynthesis of tryptophan(trp), i.e.E,D,C,B,A [175]. The leader transcript refers to the 162 nt long untranslated region (UTR) preceding thestructural genes [79]. Along this leader transcript, the ribosomal binding site resides at the 5’ end [176],and an internal transcription termination signal is located distally before the first structural gene E [79,177]. This termination signal, lying within the attenuator, can be recognised by RNA polymerase toproduce a transcript of about 140 nt (137-141) length [78,178]. This shorter transcript generates a leaderpeptide consisting of 14 residues whose distal end has a tandem of trp residues [79]. This leader peptideis involved in the attenuation regulatory strategy where the Trp operon leader is utilised for adaptationto the metabolic condition concerning the biosynthesis of trp [78].Attenuation in Tryptophan operon leader. The trp-activated repressor protein(trpR) is stimulatedupon trp binding to compete against RNA polymerase [179], and consequently switches off the initiationof trp operon transcription [180]. An additional repression mechanism is postulated to operate directlyon the progressing transcription along the starting segment of the operon [181, 182]. Further deletionmutagenesis studies narrow down the regulatory region onto the leader region of the trp operon [183,184]wherein transcription stops at a distal transcription termination site [177, 185]. The transcription andtranslation of this short leader transcript have been demonstrated in vivo and in vitro [78, 186, 187].The corresponding leader peptide is the byproduct of the attenuation mechanism which is essentially aninternal transcription termination signal that is modulated in response to the abundance of metabolitesrelevant to the products of this operon [183]. Consequently, the ongoing transcription of the downstream84operon genes could be regulated accordingly and promptly in an “economical” way [78,183].The attenuation mechanism requires the translation of the leader peptide as shown by the alteredtermination frequency observed in mutants with deficient components or targets of the translation ma-chinery, e.g. tRNA, tryptophanyl-tRNA synthetase and start codon [79, 176, 188, 189]. Whether theribosome stalls during translation and where this occurs is likely to influence this regulatory attenuationmechanism because it depends on the abundance of the corresponding loaded tRNA; given the trailing trptandem observed in this trp operon, the trp codon is found to be the codon responsible for this [79,190].Where the ribosome stalls dictates which alternative RNA structure the leader transcript forms: eitherthe transcription termination hairpin forms or it is disrupted to permit transcriptional read-through [188].In essence, the choice between terminator and anti-terminator structures bridges the communication be-tween translation of the leader peptide and the transcription of operon genes in exchanging the messagefor the abundance of trp [79,191].Dual regulatory systems for the operon expression. When trp is abundant, the cell uses a repres-sive system to promote the synthesis of other amino acids in starvation; here, the trpR repressor operateson the transcription initiation whereas the attenuation operates on the progressing transcription [79].The repression system targets the intracellular trp concentration which depends on the influx trp, thenewly synthesised trp, and finally the consumption of trp for protein synthesis [79]. In contrast, theattenuation measures the concentration of charged tRNATrp, which is contributed by cellular capabilityof protein synthesis, trp concentration, tryptophanyl-tRNA synthetase, and tRNATrp [79]. The intracel-lular trp concentration does not always correlate with the concentration of charged tRNATrp [78]. Thisdual system thus acts in concert to tune the biosynthesis level of trp on a wide spectrum [79,192].3.2.2 Terminator StructureThe terminator structure consists of two hairpins. The 5’ portion of the first hairpin encompassesthe region encoding the trailing residues of the leader peptide: Trp, Trp, Arg, Thr, Ser [79] (Figure 30).Other than the trp tandem required for sensing trp deficiency, base pairs embedded in this region alsoimpose constraints on the sequence composition; indeed, conservation is observed in these five successivecodons and mutating the upstream codons does not alter the operon expression [78, 79, 193–195]. Thesecond hairpin, enriched in G/C and immediately followed by several uracil residues, comprises thetermination signal that attenuates the operon transcription [79,178].The terminator structure forms when no ribosome stalls in the vicinity of the Trp tandem (i.e. Trp orArg codon); that is, either the leader peptide is not translated or the translation proceeds smoothly alongthe 5’ portion of the first hairpin with abundant charged tRNATrp [78,79]. More specifically, the ribosomeis proposed to sterically mask about 10 nts downstream, thus ribosome stalling in either the upstream Glyor further downstream Thr does not disrupt the formation of the termination hairpin [78,79]. Thereafter,co-transcriptional folding only allows the two hairpins to form sequentially; the first hairpin forms rightafter its pairing portions are transcribed, rendering the 3’ portion of the first hairpin unavailable to pair85with the newly synthesised 5’ half of the second termination hairpin [79].Experimental evidence. In the past, the terminator structure responsible for producing the 140 nt-long attenuated leader transcript has been investigated via experimental approaches. Lee and Yanofsky(1977) concluded that the termination efficiency at the attenuator is correlated with the stability of anembedded secondary structure which is conserved between E. coli and Salmonnella typhimurium; theproposed structures agree with results of a partial RNase T1 digestion that exhibit digestion resistancein the distal portion of this transcript, i.e. where the structural features are located [190]. Thereafter,Oxender et al. (1979) conducted structural probing with RNase T1 partial digestion followed by isolationof the co-migrating pairing regions in a non-denaturing gel electrophoresis; the base-pairing regions weresubsequently identified via denaturing gel electrophoresis and fingerprinting, based on which the twohairpins comprising the terminator structure were drawn [191]. Later on, the secondary structure of DNAtemplate was ruled out as a contributor for this termination signal so the RNA structural features are theone causing the termination [78,196,197]. Indeed, the functional importance of the second hairpin for thetranscriptional termination is illustrated by the reduced transcription termination frequency observed inexperiments destabilising the central G+C pairing of this hairpin, such as by in vivo mutational analysis orin vitro substitution of G-C bond by I-C bond [190,198–200]. Moreover, mutational analysis progressivelydisrupting the first hairpin still preserves the production of the attenuated transcript, suggesting thatthe second hairpin itself is sufficient for the termination [78,200].3.2.3 Anti-terminator StructureAnti-terminator structure disrupts the two terminator hairpins. The anti-terminator hairpinis formed by the pairing between the 3’ portion of the first hairpin and the 5’ portion of the secondhairpin from the terminator structure [79].This structure occurs when charged tRNATrp (or tRNAArg) is starving, and the ribosome is impededaround the tandem Trp codon where the 5’ portion of the first terminator hairpin resides [79]. Thisstalling ribosome spans approximately 10 nucleotides downstream and thereby prevents the formationof the first terminator hairpin [79]. Instead, it promotes the base-pairing between the 3’ half of thefirst terminator hairpin and the 5’ half of the second terminator hairpin once they are transcribed [79].Co-transcriptionally, this removes the pairing option held by the 3’ of the second terminator hairpin,rendering it single-stranded [79]. Since the transcriptional termination hairpin is sequestered under thiscircumstance, the progressing RNA polymerase no longer dissociates at the attenuation site, and themRNA encoding the trp operon poly-peptides gets fully transcribed [78, 79]. Nonetheless, the anti-terminator hairpin is speculated to still form occasionally after the ribosome dissociates from the leadertranscript even in the absence of ribosome stalling [78].Experimental evidence. Both the translation of the leader peptide and ribosomal stalling are nec-essary for inhibiting the transcription termination [189]. Moreover, mutational analysis destabilising or86disrupting the base-pairing of the anti-terminator hairpin, e.g. trpL75 mutant, demonstrates increasedtermination of several folds; consistent with the attenuation model, this mutation prevents the relief ofattenuation even with Trp starvation [78, 189]. In contrast, complementary oligonucleotides targetingthe 5’ portion of the first terminator hairpin increase the operon expression, presumably promoting theanti-terminator formation [78, 201]. However, there is no direct experimental evidence confirming thebase-pairing of the anti-terminator hairpin (i.e. structural probing) due to the co-transcriptional nature,i.e. the other two terminator hairpins render the anti-terminator formation infeasible [78].3.2.4 Half-life Of The StructuresThe formation of the alternative structures is determined by whether or not the translating ribo-some is impeded, which must be concomitantly captured by the transcribing polymerase [78]. The timescale of the structural modulation must thus be comparable to that of the transcribing polymerase [79].Evidence supporting this requirement is a transcriptional pausing site located at the end of the firstterminator hairpin which allows time to put the ribosome in sync with the RNA polymerase along thesame transcript [79]. The amount of trp abundance can thus be measured in a timely manner via theproper formation of the alternative structure. Subsequently, either attenuation or read-through occursaccordingly [79].87Promoter          Trp Operon Leader Region                  trpE, trpD, trpC, trpB, trpA Operator MKAIFVLKGWWRTS                Anti-terminator  Structure Terminator  Structure termination hairpin abundant tRNATrp                                                                                                starving tRNATrp  operator Figure 30. Schematic drawing for Tryptophan operon leader. The Trp codon (W) tandem ishighlighted in green color in both the gene structure (the leader peptide sequence is shown) and theRNA secondary structures. The stem of the termination hairpin is colored in red in both terminatorand anti-terminator structures. In the anti-terminator structure, the ribosome, stalling on the Trptandem, impedes the formation of the first hairpin. The 5’ portion (red) of the termination hairpin isthen sequestered in an anti-terminator hairpin. The RNA secondary structures in Figure 30 toFigure 33 are drawn via VARNA [44].3.3 5’UTR Of Leviviridae Levivirus3.3.1 Translational Control Of Maturation Protein In Leviviridae LevivirusPhylogeny and host specificity of family Leviviridae. The family Leviviridae, a prevalent familytargeting gram-negative bacteria, comprises positive single-stranded (ss) RNA bacteriophages with oneof the smallest genome sizes (around 3500 to 4200 nt) [162,202–204]. Most of the family members exhibitinfection specificity for E. coli bearing F pilus receptors; moreover, the family members exploit similar88mechanisms and host factors for replication and translational regulation [162]. This family was proposedto be a monophyletic group with main constituent genera Levivirus and Allolevivirus based on maximumlikelihood and Bayesian estimation using the coat and replicase genes of nine species [162]. Levivirus andAllolevivirus have difference in the genes encoded and the orientation of the open reading frame (ORF),albeit both having four genes [162]. Each of these two genera is further sub-divided into two groupsaccording to their serological cross-reactivity and the characteristics of the virion [162, 202, 205, 206].Among them, MS2 and GA are the typical species of Group I, II of the Levivirus, respectively; Qβ andSP are the representative ones of Group III and IV of the Allolevivirus, respectively [162,207].Structure of 5’ UTR of maturation gene solves the competition between translation andreplication on ssRNA template. MS2, a model organism for Group I coli-phage of Levivirus, hasa sense RNA genome encoding four proteins (starting from the 5’ end): maturation, coat, lysis, andreplicase [70,77]. Both maturation and coat proteins serve as the structural component of the icosahedralvirion, presenting a per-virion ratio of 1:180 [77]. The lysis protein lyses the host cell; the replicase andhost factors comprise a holoenzyme responsible for the replication of strands in both polarities [77]. Asa result of the RNA genome, transcriptional regulatory tools are no longer available and the viral geneexpression is thus regulated translationally to achieve the desired quantity and timing pertaining to theseproteins [77].As Levivirus has a ss RNA genome serving as the template for both translation and replication, theribosome and the replicase tend to compete in binding the template [77]. To solve this conflict, thethree distal genes share a single ribosomal entry site; moreover, the ribosome and replicase share thesame binding site around the start codon of the coat gene [76, 77, 208, 209]. The translation of the threedistal genes is therefore coupled as the binding of replicase and of ribosome is mutually exclusive [76].Nevertheless, ribosome bound to the ORF of the maturation gene could potentially dislodge the replicasetravelling to the 5’ end [76]. To prevent this, transcript folding can exert a translational control by gate-keeping the ribosomal binding site(RBS) in a long-distance interaction (LDI) via an inhibitory upstreamcomplementary sequence (UCS) [77]. This structure prevents the binding of the ribosome and yieldsway to the replicase [77, 210]. The translation of maturation protein is thus usually suppressed. Whenthe formation of the secondary structure is prevented, translation increases [77, 211]. Expression of thematuration protein is key for the viral infection process as it can proteolytically trigger the releasingof viral genome through contacting the F-pili of male; this only requires low copies of the maturationprotein Escherichia coli [77,212]. The virus employs a co-transcriptional folding strategy to postpone theformation of this inhibitory LDI via sequestering the UCS in a metastable hairpin, enabling a transienttranslation of maturation protein [76]. Moreover, studying this translational control of the maturationprotein brings deeper insight into the evolutionary divergence of Levivirus and Allolevivirus in the FamilyLeviviridae [162]. Gene expansion is proposed to have occurred in Allolevivirus, which postpones theformation of the inhibitory LDI and therefore up-regulates the maturation protein; subsequent mutationsaccumulate and restore the virus fitness, resulting in the difference between these two genera [162,213].893.3.2 Final Inhibitory StructureThe final cloverleaf-like structure consists of four hairpins. The cloverleaf-like structure assumedby the 5’ UTR of equilibrated RNA from MS2 was first proposed by Groeneveld et. al. (1995), consistingof a 5’ hairpin, and the downstream West (W), South (S) and East (E) arms [76, 77] (Figure 31). Aninhibitory UCS, immediately 3’ to the 5’ hairpin, pairs with the 7 nt Shine-Dalgarno(SD, AGGAGGU)sequence, forming a LDI [77]. This cloverleaf-like structure, even including the bulge in the bottom ofthe S arm, is evolutionarily conserved between MS2 and KU1 which is a Group II Levivirus that variesfrom MS2 in terms of primary sequence [77].The cloverleaf-like and inactive structure forms once the first 123 nt of the plus-strand is synthe-sised [76]. Meanwhile, the start codon is not replicated yet, and the translation initiation complexrequires the transcript up to the first 145 nt [76]. The maturation protein can therefore not be trans-lated [76] . The RBS of the maturation protein spanning from nucleotide 110 to 145 can also pair witha stretch of downstream sequence to further sequester the RNA molecule in an inactive form [77]. Thisensures that the RBS is not accessible to the ribosome once the inactive cloverleaf structure is formed,otherwise the ribosome could be there to dislodge the progressing replicase [77].Experimental evidence. Phylogenetic analysis reveals the conservation of the cloverleaf structure inthe 5’ UTR sequences of the maturation gene among Levivirus members: Groeneveld et. al. (1995)assembled an alignment consisting of four group I phages (fr, M12, JP501, MS2), and two group IIphages (KU1, GA) [77]. They noticed that the structural features are generally preserved, albeit somevariations in the W arm comparing the two groups and bulge shifting in the S arm comparing MS2 andM12 [77, 214, 215]. They also reported covariation being observed in all arms and, most importantly, inthe LDI; particularly, the amount of covariations present in Group I and II are similar [77]. In addition,they were able to employ the secondary-structure prediction program Mfold (GCG sofware, GeneticsComputer Group, Madison, Wisconsin) to predict the cloverleaf structure [77].Biochemical probing of the structure from a reference MS2 sequence is initialised as well using acombination of DMS, DEP, CMCT, and RNase T1/T2/VT [77]. The resultant probing pattern is consis-tent with the proposed structure in terms of sensitivity to modification or cleavage, including the bulgeregion [77].Functional analyses have been conducted on the components of this cloverleaf structure through aseries of mutants with deletions in variable portions of the arms [77]. Any changes in the synthesis of thematuration protein were thus attributed to the structural feature being mutated because the stability ofthese mutant RNAs was confirmed to be unchanged [77]. Only the base-pairing potential rather thanthe primary sequence composition of the W arm is required for maturation protein translation [77].Moreover, deleting the entire UCS sequence or destabilising the LDI via mutations significantly enhancesthe maturation protein synthesis, which is further evidence for the negative regulatory role of the UCSstrand [77].903.3.3 Transient Permissive StructureThe transient structure consists of a metastable hairpin. After Groeneveld et. al. (1995) andPoot et. al. (1997) proposed a kinetic model to explain the brief translation of maturation protein, aseries of MS2 mutants were designed by Van Meerten et. al. (2001) to progressively locate the kinetic trapthat contributes to the slow folding of the cloverleaf-like structure [76,216]. This temporary kinetic trap,essentially a transient structure, is located in the 5’ UTR [76]. The precise position was further exploredby replacing the W, S and E arms of MS2 arm by arm by the cognate arm from KU1 [76]. Finally,nt 37-45 of MS2, residing in the 5’ segment of the W arm, was identified as the functional sequencecorresponding to the 3’ portion of a metastable hairpin which is conserved among MS2, KU1 and fr [76].This transient hairpin encompasses 4 nts from the 3’ portion of the 5’ hairpin, the UCS, and 7 nts fromthe 5’ portion of the West arm originally in the cloverleaf-like structure [76]. Therefore, it disrupts theinhibitory LDI and thereafter frees the SD sequence, temporarily permitting the non-equilibrated RNAto be captured by the translation machinery for a brief translation of the maturation protein during thelimited time window, i.e. after the synthesis of the RBS but before the LDI forms [76]. Moreover, thistruncates the 5’ hairpin and exposes the G’s at the start of the 5’UTR as ss, which is stipulated by theviral replication; maturation protein complemented in trans is not enough to rescue a mutant with nometastable hairpin, which agrees with this additional role [76].This structure forms only co-transcriptionally during the synthesis of the positive strand from anantisense template, and will be eventually replaced by the mutually exclusive LDI [76]. This requires theribosome to bind the RBS fast enough compared to the formation of the LDI [77], which is demonstratedfeasible [217].Experimental evidence. Firstly, Groeneveld et. al. indirectly tested the kinetic model by delineatingthe maturation protein synthesis contributed from the equilibrium model; the latter model states thatRBS is occasionally freed from UCS during breathing if the fully-bound LDI forms faster than ribosomebinding [77]. Through modulating the stability of the LDI via mutations, they eliminated the possibil-ity of maturation protein being mostly synthesised by the equilibrated cloverleaf conformation [76, 77].Secondly, they directly assessed the kinetic model via adjusting the co-transcriptional time delayed forthe LDI formation, and concluded that this duration is positively correlated with the yield of matura-tion protein [77]. Moreover, computational simulation of the co-transcriptional folding trajectory usingKinwalker does predict both the transient kinetic trap and the cloverleaf structure [70].Mutational analysis also provides evidence supporting the functional importance of this metastablehairpin – the kinetic trap for this kinetic model – for the translation of maturation protein. Mutantswith the metastable hairpin disrupted produce no plaque whereas compensatory double mutation is ableto rescue the fitness of the phage [76]. On the other hand, mutation stabilising the metastable structurevia replacing the bulge by a base pair increases the translation of maturation [76]. Furthermore, bulkevolution of those mutants with no metastable hairpin eventually leads to the restoration of the metastablehairpin [76]. Taken together, they imply the necessity of the metastable hairpin for the synthesis of91maturation protein and thus the infection fitness of the phage.3.3.4 Half-life Of The StructuresIn vitro, the cloverleaf structure requires several minutes to fully fold, whereas a tRNA with compa-rable size and conformation folds only on a millisecond time scale [76] [216]. This further suggests thatthe MS2 5’ UTR folding is delayed by being kinetically trapped in a non-native structure [76]. Giventhat coli-phage replicase proceeds at 30 nt per second (sec), the ribosome can stably bind the maturationprotein start codon as long as the cloverleaf structure is postponed to form by about 1 sec [76] [218].Hence, the proposed translational control is convincing in terms of time.Final Inhibitory Structure Transient Permissive Structure Translation is ON            Translation is OFF  UCS SD South Arm 5’ hairpin East Arm West Arm LDI Figure 31. Schematic drawing for Levivirus (MS2). The UCS is highlighted in red color, the SDsequence is in yellow color, and the start codon is colored in green. In the left structure, the SDsequence is accessible to ribosome for translation. In the right structure, the SD pairs with the UCS,forming the LDI to impede ribosome binding.923.4 HDV Ribozyme3.4.1 Regulation Of The HDV Ribozyme Self-cleaving Activity.HDV. HDV is an RNA satellite virus depending on hepatitis B virus (HBV), and aggravates thevirulency of HBV-causing hepatitis [219]. HDV has a circular and 1700 nt long ssRNA genome, encodingthe delta antigen protein [28,160]. The high self-complementarity of the genome enables it to assumes arod-like structure [220–223]. A double-rolling-circle mechanism is exploited by HDV to replicate via thehost RNA polymerase II, yielding linear multimers of both genomic and antigenomic senses [160,223].Themultimers are subsequently self-cleaved into monomers via a trans-esterification reaction catalyzed by theHDV ribozyme in cis [160,223]. The linear monomers are then ligated into a circular genome via a hostfactor, which harbours the ribozyme-targeting cleavage site again [28,33]. The ribozyme activity is turnedoff in the ligated RNA by interacting a downstream attenuator in order to serve as a template during theupcoming replication cycles [28,224] .HDV ribozyme catalyses the self-cleavage activity. This HDV ribozyme has a fast reaction ratewhich only depends on 85 nucleotides of either genomic or antigenomic RNAs requiring one nucleotidelocated 5’ of the cleavage site; limited variation is observed in this ribozyme [160, 223, 225, 226]. Bothgenomic and antigenomic ribozymes are enriched in G and fold into similar secondary structures, as con-cluded by aligning the genomic and antigenomic sequences in search of sequence and structural similarity,secondary-structure prediction via minimum-free-energy minimisation, and ribonuclease digestion [161].Flanking sequence participates in the regulation of HDV ribozyme self-cleavage. Non-catalytic sequences neighbouring the group I intron of Tetrahymena thermophila can modulate the ri-bozyme self-processing activity through base-pairing a functional portion of the ribozyme sequence [28,227–229]. Consistent with this regulatory model, flanking sequences originated from virus and vector havebeen shown to affect the HDV ribozyme self-cleaving activity [226,230]. The upstream sequence 5’ to theself-cleavage site could thus be involved in regulating the self-cleavage activity of the HDV ribozyme whenthe self-cleavage activity is not desired, but the downstream attenuator not readily available. That is,alternative structures formed co-transcriptionally and being mutually exclusive to the active conforma-tion could potentially temporarily adjust the HDV ribozyme activity [28]. This would require a thoroughinvestigation of the co-transcriptional folding kinetics of the HDV ribozyme incorporating the 5’ upstreamsequence [33]. Here, we summarise the findings in one inhibitory and one permissive structure for theHDV ribozyme self-cleaving activity [28].3.4.2 Active StructureActive conformation consists of 5 stems: P1, P1.1, P2, P3, P4 [160] As shown in Figure 32,theactive double-pseudo-knotted structure assumes a nested structure folding the active site inside a cat-alytic cleft to shield it against the solvents [160]. Coaxial stacking occurs among P1, P1.1 and P4; P293and P3 share another stack parallel to the aforementioned stack [160]. Both pseudo-knots are requiredfor cleavage [160]. The composition of helix P1 can be modulated without affecting the activity as longas the length, base-pairing potential and the G1*U37 wobble base pair are intact [160]. Helix P1.1consists of only two base pairs, but it is critical for rendering the ribozyme into its correct 3D confor-mation, especially the active site responsible for cleavage activity [160]. Mutations breaking the P1.1stem result in a significantly reduced ribozyme activity [160, 231, 232]. The active conformation of thegenomic HDV ribozyme has been examed by X-ray crystallography, ribonuclease probing and site-specificmutagenesis [160].3.4.3 Structure Preventing Self-cleavageThe inhibitory structure consists of Alt1, Alt2 and Alt3. Due to the rapid self-cleavage of theHDV ribozyme, the ribonuclease experiments were performed on the 3’ self-cleavage product rather thanthe precursor [161]. An extended transcript extending from 30 nt upstream of the cleavage site to 15 ntdownstream of the 3’-end, denoted as -30/99 RNA, was found to have extremely diminished activity [28].The flanking sequence kinetically traps the ribozyme during transcription and results in a slow reactionrate, which can be improved by the addition of heat and denaturants to facilitate the formation of theactive conformation [28,161,226].Alt1, Alt2 and Alt3 disrupt the P1.1, P2 and P3 stems in the active conformation, as revealed bythe biochemical, computational, and mutational studies conducted by Chadalavada et. al. (2000); incontrast, P1 and P4 remain the native conformation [28]. P2 is proposed to form prior to the remainingHDV ribozyme and activate both genomic and antigenomic ribozyme, which may explain the resultantinactive conformation [28,233,234]. Alt1 is a 10-bp LDI formed between an upstream inhibitory stretch(nt -25/-15 related to the cleavage site) and the downstream stretch (nt 76/86) [28, 235, 236]. Alt2 is aninteraction between upstream flanking sequence and the ribozyme, and Alt3 is a non-native ribozyme-ribozyme interaction [28].Experimental evidence. Three experimental approaches provide evidence for the inhibitory sec-ondary structure [28]. Firstly, this extended transcript was directly probed via ribonucleases due toits slow self-cleaving rate, and the cleavage results were used to constrain the structural prediction byMfold 3.0, yielding the structure shown in Figure 32 [28]. Secondly, a series of DNA oligomers were usedto rescue the ribozyme activity of this inactive transcript [28]. Among the oligomers, AS1 anneals to theentire upstream inhibitory stretch of Alt1, raising the reactivity rate by 2700- to 20,000- fold; AS2 onlypartially disrupts Alt1, nevertheless, it still accelerates the reactivity by 14-fold [28]. These two oligomershave no additive effect since sequestering either portion of Alt1 is sufficient for disrupting this inhibitoryLDI [28]. Thirdly, mutations were introduced outside the ribozyme to ensure that the observed ribozymeactivity is caused by the stability of Alt1 [28]. Single mutations, destabilising Alt1, exhibit 150-foldincrease in the reactivity rate or even co-transcriptional self-cleavage; double compensatory mutation,restoring Alt1, has similar reactivity to the inactive RNA transcript [28]. Upon addition of AS1, these94mutants show similar reactivity rate to the active ribozyme [28]. Lastly, Alt1 and Alt 3 are conservedamong 21 genomic isolates [28]. These non-native structures are conserved perhaps due to their potentialrole in facilitating the formation of the genome rod-like structure which is required for replication andpackaging [28].3.4.4 Structure Permitting Self-cleavageSelf-cleavage-permissive structure is an upstream hairpin. The permissive structure for the self-cleavage of the HDV ribozyme is mapped to the nt -54/-18 of the RNA transcript using the secondary-structure prediction program Mfold [28]. This structure sequesters the inhibitory stretch of nt -24/-15from Alt1 in a hairpin P(-1) located upstream of the cleavage site [28, 235, 236]. This extended RNAtranscript is demonstrated experimentally to cleave co-transcriptionally [28]. P(-1) has bulges allowingG migrating, resulting in a more stable conformation due to the increased structural entropy [28, 237].Hairpin P(-1) does not appear to interact with the ribozyme domain as the ribozyme has similar self-cleaving activity regardless of whether it is activated in trans (AS1 or AS2) or in cis [237]. However,the P(-1) motif is not found in the antigenomic sequences [28], so only genomic sequences are assembledin our updated alignment for HDV ribozyme.Experimental evidence. Firstly, structural mapping via ribonuclease was used to probe the nt -54/-1fragment instead of the whole precursor transcript due to the fast-cleaving nature of this structure, whichreveals a local hairpin P(-1) pairing nt -54/-40 with -18/-30 [28]. This structure is consistent with thess-count values calculated using Mfold which represent the propensity for a nucleotide position to besingle-stranded [28]. Secondly, evolutionary conservation is found in hairpin P(-1) and the linking regionbetween P(-1) and P1 among 21 genomic HDV RNA isolates [28]. Minimal sequence variation is alsoobserved in this linking region (nt -17/-1) which is pyrimidine-rich and suspected to melt the annealingbetween the nascent transcript and the template, facilitating the subsequent ribozyme folding [28].3.4.5 Half-life Of The Structures.No direct data exists regarding the half-life, but the mechanism proposed for the HDV ribozymeresembles the one utilised by the group I intron [28]. Moreover, in the human HDV-like CPEB3 ribozyme,a similar regulatory mechanism involving the flanking sequence was discovered [33]. As a follow-up study,an equilibrium model is proposed comprising two intermediate and the native fold, which is confirmedby mutagenesis and kinetic characterisation [238]. In this model, the 5’ portion of P2 can base-pair witheither the native 3’ portion or non-native ribozyme sequence made of nucleotides from P1, P3 and single-stranded regions [238]. Given that P2 has a driving role in the correct folding of the HDV ribozyme, thedirection of the shifting of this equilibrium may explain the resultant ribozyme activity [238].95Active Structure Permissive Alternative Structure 1 3 -54                      -30                                0 : self-cleavage site Inhibitory Alternative Structure 2 U 1 2 3 P4 P2 P1 P3 P1.1 P(-1) Alt 1 Alt 3 Alt 2 Figure 32. Schematic drawing for HDV ribozyme. The starting point of each structure on the viralgenome in relation to the self-cleavage site (this site is annotated by an orange arrow) is labeled by thecorresponding number. The 5 stems of the active structure are colored using similar color schemeemployed by Chadalavada et al. (2000) [28]: P1 in blue, P2 in green, P3 in yellow, P4 in dark red, P1.1in purple. The stem P(-1) of the permissive alternative structure is colored in pink. In the inhibitoryalternative structure, Alt 1, 2, and 3 disrupt the native stems of the active structure except P1 and P4.963.5 SAM-responsive Riboswitch3.5.1 Regulation Of Gene Expression Via A Riboswitch.Riboswitches are proposed to be regulatory mechanisms that derive from the RNA world [239, 240].They correspond to non-coding RNA structure elements located in the leader sequence of mRNA strands,which selectively bind certain metabolites to regulate the synthesis of downstream products relevant tothis metabolite [80, 82–84, 241]. This regulatory response is achieved by the coordination between theembedded metabolite-binding aptamer domain and the expression platform which switches between al-ternative structures upon sensing a change in the aptamer domain when a metabolite binds [80, 83, 84].The riboswitch can generally assume two mutually-exclusive structures – one metabolite-bound, and onemetabolite-unbound. This ligand recognition via aptamer can sequentially affect the interaction betweenthe mRNA and the translational or transcriptional apparatus [242]. The structure of the aptamer is typ-ically evolutionarily conserved; for instance, the S-adenosylmethionine(SAM) riboswitch discussed in thispaper is well conserved among bacteria species [80]. In SAM riboswitch, S box is the aptamer where thecoenzyme SAM binds with high affinity, which often precedes a putative transcription terminator hairpin;the binding of SAM triggers an allosteric change that subsequently terminates the transcription [80,243].Overall, a riboswitch assigns the same mRNA both a sensory and an action role without requiring anintermediate [244]. This amounts to a speedy and sensitive responsive mechanism that is able to sensethe conditions of the cellular environment with an accuracy comparable to mechanisms involving proteinfactors [80,244].3.5.2 Sam-unbound Structure.If the SAM is unbound, the anti-terminator sequence can sequester the terminator sequence to pre-vent the formation of terminator and the polymerase can progress through the downstream gene [80](Figure 33). The structure without binding of SAM was derived from in vitro transcription and in-lineprobing using the first 251 nt of yitJ gene in B. subtilis as the model [80,243].3.5.3 Sam-bound Structure.Once the SAM is bound to the conserved core of the aptamer, the anti-terminator is sequestered byan anti-anti-terminator; the terminator sequence is then freed to assume a terminator hairpin to end thetranscription [80,245].Experimental evidence The SAM-bound structure was originally derived from the phylogenetic con-servation observed in an alignment [80,243], and subsequently verified with disruptive and compensatorymutations using a 124-nt long construct (from leader mRNA of yitJ gene in B. subtilis) fused with areporter gene [80], and later determined by X-ray crystallography [246].The proposed transcriptional termination mechanism upon SAM addition was tested in vitro usingthe transcription of 11 DNA templates harbouring the S box [80, 240]. The percentage of transcrip-97SAM SAM-Bound Structure SAM-Unbound Structure terminator  anti-anti-terminator anti-terminator Figure 33. Schematic drawing for SAM riboswitch. For both RNA structures, the stem of theterminator is colored in red, and the stem of the anti-anti-terminator is colored in blue. In theSAM-unbound structure, the anti-terminator is formed by the 3’ portion of the anti-anti-terminator andthe 5’ portion of the terminator hairpin; thus, the terminator hair can no longer form. The pseudoknotis not shown in this drawing.tion termination was compared between presence and absence of SAM, which demonstrated increasingtranscription termination upon addition of SAM [80]. Moreover, all structural features involved in thetranscription termination mechanism (i.e. the anti-terminator, terminator, and anti-anti-terminator) weredirectly confirmed using disruptive and compensatory mutations [80]. Such disruptive mutations desta-bilise terminator and anti-terminator individually or both; in comparison, the compensatory mutationsrestore terminator and anti-terminator individually or simultaneously [80]. The corresponding percentageof SAM-induced transcription termination was compared among these mutants [80]. The results showthat terminator is required for the mechanism to respond to SAM and anti-terminator is critical forrelieving the transcriptional termination [80]. Hence, such mutational analysis supports the functionalroles of those proposed structural features participating in the SAM-induced transcriptional terminationmechanism [80].983.6 Method Of Analysis Pipeline3.6.1 Flowchart For OverviewFigure 34 and Figure 35 summarize the workflow for the assembling and curation of the four alignmentsin this chapter.3.6.2 Primary Covariation Model: Initial Small AlignmentA primary covariation model(cm) is constructed using a small good-quality alignment assembledpreviously by our group [81]; repeat the CM compilation (i.e. build and calibrate in Infernal-1.1rc2)for each of the alternative structures. If the structure contains pseudo-knots, it must be split intonon-pseudo-knotted substructures in order to process it in Infernal; if possible, compatible helices ofsuch non-pseudo-knotted substructures and the remaining alternative structures are combined into onestructure so as to simplify the subsequent curation (e.g. in HDV, a non-pseudo-knotted substructure ofthe active conformation is added to the alt2 structure). The aforementioned pre-assembled alignmentsof Trp operon leader, Levivirus, HDV and SAM riboswitch have 10, 7, 10 and 15 sequences, respectively.In this step, two functions from Infernal [141] are used to make cm files.• cmbuild: we use the default setting, and run the function as :cmbuild -o 〈stockholm file storing the annotated alignment returned by cmbuild〉 -F 〈cm file for theoutput Covariation Model〉 〈stockholm file for the input alignment〉• cmcalibrate: we use the default setting, and run the function as:cmcalibrate 〈cm file for the Covariation Model to be calibrated〉3.6.3 Secondary Covariation Model: Pre-assembled Alignment + Unique Rfam Seed Align-ment SequencesFirstly, the primary CM is used to align the sequences from the Rfam seed alignment (or only partof it, e.g. the alternative structures of HDV are only valid for the genomic sequences, so only genomicsequences are aligned here). This aligning procedure is repeated for the primary CM built from each ofthe alternative structures. Secondly, redundant sequences are removed from this expanded alignment(i.e.sequence from pre-assembled alignment + Rfam seed alignment), which is subsequently manually curatedto optimise the covariation for all alternative structures. Curation is initiated on the alignment generatedby the CM with the structure with the largest number of base pairs, and the other alternative structuralfeatures are mapped onto the same alignment. During curation, Ralee [247] is used to visually comparethe resultant alignments generated by the covariation models pertaining to each alternative structure,and structural overlapping regions are identified and then manually curated. Thirdly, a secondary CM isthus built and calibrated based on this expanded alignment.99 Pipeline in R and Perl 1 Small MSA (ref seq included) curated for all alternative structures  Primary CM: Covariance model for the MSA for struct 1 (cm1), struct 2 (cm2)… struct n (cmn), respectively  Infernal: cmbuild, cmcalibrate Rfam seed alignment sequences, i.e. more representative sequences •Repeat cmalign using cm1, cm2…cmn •Remove duplicated sequences •Curate in Ralee to make one alignment optimized for all structs (according to the cmalign results).  Secondary CM: Covariance model for the Expanded alignment for struct 1 (cm1), struct 2 (cm2)… struct n (cmb), respectively  Expanded alignment = MSA + Rfam seed seqs Seq dataset: Rfam Full alignment sequences (ungapped), or NCBI Taxonomy browser  cmsearch using cm1, cm2…cmn, respectively  Overlapped hits (score>0, span full-length) among the cmsearch results of all structs,  Find common hits Cmalign the overlapped hits using cm1, cm2…cmn Infernal: cmbuild, cmcalibrate 1 For each cm, there is an alignment = MSA + seed seqs + overlapped hits Remove any duplicated seqs from the overlapped hits For each cm, retain new seqs Clustering by primary sequence composition via Ucluster N clusters Figure 34. Flowchart part I for alignment compilation.100N clusters For each struct i, choose the best-fit seq for each cluster from the cmaligned alignment using cmi Struct 1 Struct 2 Struct n Common seqs overlapped among the aligned best-fit from the clusters Rank by base-pairing potential (+ contribution to covariation), insertion relative to reference seq Ranked Common Seqs for each cm •Incrementally calculate the qc measures of the alignment of the ranked common seqs •Choose the best k seqs to add to MSA + Seed seqs, and iteratively curate, trim, cleanup New Alignment  Figure 35. Flowchart part II for alignment compilation.1013.6.4 Expanded Alignment: Select Sequences To Add Into The Original Small MSAOverlapped hits Firstly, the primary CM for each alternative structure is used to search the Rfamfull alignment via Infernal (alternatively, for Levivirus, NCBI Taxonomy browser is used by down-loading the branch of the tree of life in which this non-coding RNA resides; for SAM riboswitch, we startour curation from an alignment provided by Winkler et. al. (2003) [80] and the SAM-bound structureis also annotated by them). A sequence is retained only if the searches using the cm of each individualalternative structure all return this sequence as a hit, and this sequence spans the full length of all thecm without truncated ends. A hit is defined as a candidate sequence that yields log-odds score greaterthan 0 in the search result. Secondly, the secondary CM is used to align the overlapped hits; repeat thealigning step for all secondary CM corresponding to each of the alternative structures. Since the align-ment constructing the secondary CM is optimised for all alternative structure, aligning the hits using thesecondary CM facilitates the following curation among all alternative structures.The search for homologs to add is only conducted on the Rfam full alignment rather than searchingthe NCBI from scratch, which is due to the fact that the qualified sequence to add must satisfy therequirement for all alternative structures. Thus, it would be more efficient but without loss of generalityto start from the candidates fitting well with at least one of the alternative structures, i.e. the Rfam fullalignment.In this step, two functions from Infernal [141] are used to search for and align new related sequencesso as to add them into the high-quality MSA.• cmsearch: we use the default setting, and run the function as :cmsearch –tblout 〈hits returned by the search in a tab format, with score, E-value, and etc.〉 -o〈detailed information about the hits, and the hits are ranked in this file〉 -A 〈alignment of thesearched sequences〉 –verbose 〈cm file for the Covariation Model〉 〈fasta file for database to search〉• cmalign: we use the default setting, and run the function as:cmalign -o 〈stockholm alignment of all the sequences using the given Covariation Model〉 –mapali〈cm file used to align the new sequences〉 〈fasta file storing the new sequences〉Cluster the aligned overlapped hits In order to reduce the sequence redundancy and increasethe diversity, the alignment sequences comprised of the overlapped hits are clustered based on primarysequence conservation via USEARCH [248]; different percentage identity cutoffs are tested to obtainaround 50 to 100 clusters. For each alternative structure, a best-fit sequence is chosen for each clusterusing home-made scripts scoring the covariation. Best-fit sequences overlapped among all of the structuresare identified as common hits. These common hits are then ranked and filtered based on the fit of thesequence to a structure and the extent of insertions relative to the reference sequence, which gives rise toa MSA for each structure.Two functions from USEARCH [248] are used to cluster sequences based on their primary sequence:102• Sort by sequence length:usearch –sort 〈input fasta alignment〉 –output 〈output sorted alignment〉• Cluster according to (ungapped) primary sequence:usearch –cluster 〈fasta file of the sorted alignment〉 –uc 〈the clustered results, with all the cladesand the subordinate sequences〉 –id 〈Cutoff for the primary sequence similarity〉For each RNA family, we tried cutoffs from about 90% to 98% until the clustering step gener-ates about 50 to 100 clusters which is a tangible number of clusters to work on in the subsequentselection step.Criteria for selecting sequences to add We use some home-made perl scripts to rank the alignmentsequences based on their structural fitting in accordance with the reference sequence, contribution to thecovariation, gappiness, and etc. There are some score numbers that we used in the scripts during evalu-ating and ranking the covariation score of sequences. These scores are assigned to one-sided covariation,double-sided covariation, mismatch, gappiness etc. However, due to the difference in the phylogeneticdiversity of alignments, there is no universal set of scores working for them all. Instead, we had toadjust these scores frequently until the resultant alignment is satisfactory, as visually contemplated inRalee [247]. As a starting point for such ranking process, the following set of score could be used ini-tially (and might be substantially adjusted later). Users can then recursively adjust these scores basedon observation using Ralee [247]:• nucleotides of base pair match the reference sequence: +1• nucleotides of base pair has one-sided covariation: +1.5• nucleotides of base pair has double-sided covariation: +2• nucleotides of base pair has (gap-free) non-canonical base pair: -1• nucleotides of base pair has one-sided gap: -1.5• nucleotides of base pair has double-sided gap: -1The structural measures (e.g. covariation, gappiness) of these ranked common hit sequences in thealignment are then incrementally calculated starting from the top sequence for each alternative struc-ture, respectively. Thus, a set of sequences could be chosen systematically to be added to the originalsmall alignment to enhance the alignment quality. Sequences added should maintain a positive overallcovariation score of this alignment, and have few invalid base-pairs, fewest insertions introducing gaps,and no significant redundancy in terms of primary sequence among them.103Curation The cmalign and cmsearch functions generate alignment of which the quality need to beimproved via manual curation in order to serve the subsequent step. The manual curation is operatedon Ralee [247] for each alignment sequence. Visual inspection and manual comparison are conductedto improve the alignment quality in terms of structural fitting.During the manual curation, the goal is to optimally align the sequences against the structure andcorrect any obvious misalignment. This arduous process thus requires judgements based on experimentalevidence derived from the literature consulted for each family. The steps of manual curation are asfollowing:(i) convert the raw alignment generated by cmalign/cmsearch to stockholm format, and display it inRalee [247] which colors the alignment sequences based on their structural fitting against the structure.(ii) zoom into the mismatching nucleotides in each sequence. If it’s obvious that they are misaligned andshould be aligned to the adjoining region, then push them to the neighbouring position. Homologousregions must be aligned based on their primary sequence. They cannot be shuffled around merely tosatisfy the reference structure if they are obvious to be homologous to a neighbouring region and not tothe reference helix region. (iii) cmalign/cmsearch is run using cm files built from each of the alternativestructures for the family, respectively. Thus, each alignment is specific to only one structure. For eachRNA family, repeat (i) and (ii) for all alternative structures, respectively. (iv) for each RNA family,compare the curated alignments (i.e. after (ii)) pertaining to each alternative structure, and identifysequence regions where both of the alternative structures involve. We call such region structural overlap-ping regions. Structural overlapping regions could be manually improved to make them simultaneouslyfit both structures better. This is achieved by correcting obvious misaligned nucleotides, and insertinggap columns if necessary to make this region more flexible to accommodate both structures. (v) oncethe manual curation is done for all the structural regions of the alignment sequences as explained above,we use home-made perl scripts to splice out the unstructured (based on the reference structure) regionof sequences in an alignment, and realign them via MUSCLE [150]. Then the realigned unstructuredregions and the curated structural regions are stitched together. The all-gap columns are subsequentlyremoved.For accuracy checking, the alignment sequences are ungapped and matched against NCBI database tomake sure no base is altered during this curation process. This checking process also ensures the accuracyof the starting/ending coordinates of the sequence on the genome with the given accession number.3.6.5 Reference Sequence And StructureThe calculation of covariation and structure mapping involved in the aforementioned work-flow use thefollowing reference sequence: (1) Trp operon leader: AE005174.2/2263095-2263188, from E.coli O157:H7strain EDL933. (2) Levivirus: GQ153927.1/1-132, from Enterobacterio phage MS2. (3) HDV ribozyme:M28267.1/635-775, isolated from patient with acute delta-hepatitis. (4) Sam: AL009126.3/1258276-1258464, from Bacillus subtilis subspecies. subtilis strain 168.In this study, the structures and the corresponding ungapped reference sequences are directly read off104figures from literature consulted. Such structures are then mapped onto our alignment based on the ref-erence sequence. The consulted papers and the corresponding figures wherein the dot-bracket structuresare extracted are shown in the table below, Table 16:Structure Source literatureTrp operon leader (terminator) Figure 2/4 in Yanofsky (1981) [79]; Figure 2/3 in Kolter and Yanofsky (1982) [78]Trp operon leader (anti-terminator) Figure 2/4 in Yanofsky (1981) [79]; Figure 2/3 in Kolter and Yanofsky (1982) [78]Levivirus (final inactive) Figure 2 in Groeneveld et al. (1995) [77]Levivirus (transient) Figure 1 in Meerten et al. (2001) [76]HDV (active) Figure 1, 2 in Ferre-D’Amare et al.(1998) [160]HDV (Alternative 1) Figure 1 in Chadalavada et al. (2000) [28]Chadalavada2002 HDV (Alternative 2) Figure 6, 7 in Chadalavada et al. (2000) [28]SAM (SAM-bound) Figure 5 in Winkler et al. (2003) [80] and Rfam [172]SAM (SAM-unbound) Figure 5 in Winkler et al. (2003) [80] and Rfam [172]Table 16. The source literatures where the transient/alternative and dominant structures are derived.For each literature, the exact figure number where the dot-bracket structure and the representativesequence are read off is clarified in this table.3.7 Results For New Rfam Families3.7.1 AlignmentsWe summarise the key features of our alignments in Table 17 and Table 7. A small multiple sequencealignment (MSA) from a previous project published by us [81] was used to build a primary covariancemodel (CM) using the Infernal program [141] to search the full alignments in Rfam [172] and theNCBI Taxonomy Browser [249]. In our previous research [81], these small MSA were compiled withevolutionarily related sequences for these four RNA families, respectively. These small MSAs are high-quality with positive covariation, and few gaps, which made them a convenient starting point to build thisprimary CM. For each of our four alignments, we aligned the sequences of the seed alignment in Rfamwith the primary CM first and then curated this against all experimentally verified structures which werefirst mapped to a reference sequence. We then build a secondary CM incorporating sequences from theRfam seed alignment which we used to align the hits returned by the search with the primary CM. Thesehits were clustered to reduce primary sequence redundancy, choosing a representative sequence for eachcluster based on structural fit and covariation. Hits passing this selection strategy were retained onlyif they contained all alternative structures. We then calculated quality measures for the resulting hitsto determine which ones to add to the existing small MSA. The resulting alignment was then curatedrecursively according to the aligning results returned by Infernal using the secondary CM for each of105the alternative structures.3.7.2 Mapping StructuresEach of the alternative structures is mapped to the curated alignments via the respective referencesequence, see Table 17 and Table 7 for the structure-specific quality measures. All quality measures forall new alignments show an improvement with respect to the corresponding Rfam alignments, if theyexist, see Table 17. As the numbers in Table 7 show, the evolutionary evidence supporting the alternativestructures is strong and comparable to those supporting the nominal structural features, see Table 17.This is also illustrated in the arc-plots shown in Figure 36 to Figure TablesStructure Covariation No. BPs Frac. Canonical BPs No. Seqs align. lengthTrp operon leader (Rfam) 0.0653 21 0.8766 22 127Trp operon leader (new) 0.2830 23 0.9505 29 131HDV (Rfam-genomics) 0.0557 31 0.9821 18 115HDV (Rfam) 0.3432 31 0.9677 33 115HDV (new) 0.0868 30 0.996 25 155Levivirus (new) 0.2799 48 0.9318 11 169SAM (Rfam) 0.2980 27 0.9421 433 231SAM (new) 0.2417 54 0.8627 85 425Table 17. The basic alignment statistics for our new alignments and the corresponding seedalignments in Rfam, if it exists. The structural quality measures allow a swift comparison between thestructural annotation of our new alignments and that of the corresponding seed alignments in Rfam.For the Trp operon leader, the quality measures are calculated for the terminator structure; for HDVribozyme, the quality measures are shown for the active structure, and provided for both the originalRfam seed alignment and the extracted genomic sequences from the seed alignment (Rfam-genomics);for the SAM riboswitch, the structure included is the SAM-bound structure. Since the levivirus is anew RNA family introduced here, only the statistics for our new Levivirus alignment is shown. Valuesfor the covariation range from -2 to 2 and measure the relative frequency of compensatory mutationsmaintaining the base-pairing potential. A positive covariation implies the presence of compensatorymutations. No. BPs refers to the number of base pairs in the corresponding structure. Frac. CanonicalBPs is the fraction of canonical base pairs in the alignment for the aforementioned structure. No. seqsis the number of sequences in the alignment. Align. length is the length of the gapped alignment innucleotides. The alignment statistics are calculated using R-chie [166].106Structure Covariation No. BPs Frac. Canonical BPsTrp operon leader (anti-terminator) 0.0635 10 0.9069HDV (Alternative 1) 0.0437 37 0.9459HDV (Alternative 2) -0.0692 13 0.8677Levivirus (transient) 0.1805 14 0.9286SAM (SAM-unbound) 0.1449 40 0.8165Table 18. The alignment quality measures for the transient structural features for our alignments. ForTrp, the numbers refer to the anti-terminator structure. For HDV ribozyme, Alternative 1 refers to theself-cleavage-inhibitory alternative structure and Alternative 2 to the self-cleavage-permissivealternative structure. For the Levivirus alignment, transient refers to the metastable hairpinspermitting the temporary translation of maturation protein. For SAM riboswitch, the structure is theSAM-unbound structure. Please see the caption of Table ?? for other definitions. The basic alignmentstatistics, such as the alignment size or length, are also part of Table ?? as all the alternative structuresshare the same alignment.3.7.4 Figures Of The Arc-plots1070 10 20 30 40 50 60 70 80 90 100 110 120 130( 0 − 0.25 ]( 0.25 − 0.5 ]( 0.5 − 0.75 ]( 0.75 − 1 ]ConservationCovariationOne−sidedInvalidUnpairedGapAmbiguousFigure 36. Arc-plot of Tryptophan operon leader made using the visualisation program R-chie [166].The left legend specifies the percentage of canonical base-pairs in the paired alignment columns, i.e.those connected by an arc. The right legend specifies the evolutionary support (e.g. covariation etc.) foreach position in the alignment. The alignment and arcs at the top show the information for theterminator structure, whereas the bottom ones correspond to the anti-terminator structure for the sameunderlying alignment. The lines in each alignment correspond to the respective sequences with everybox representing either a nucleotide or a gap in the respective sequence. Every arc represents abase-pair involving the respective two alignment columns. The arcs are colour-coded according to theirpercentage of canonical base pairs, whereas the evolutionary information supporting each base pair isencoded in the colouring of the underlying nucleotides in the base paired alignment columns, see theright legend for details. Two green blocks connected by an arc implies that this is a canonical base pair(i.e. GC, AU or GU) which corresponds to the most abundant type of base pair for this pair ofalignment columns. Cyan means this is a canonical base pair but that it has an one-sided mutationwith respect to the most abundant (green) base-pair. Blue refers to a canonical base-pair which differson both sides from the most abundant (green) base-pair. Red means this is a non-canonical base pair.Unpaired nucleotides are shown in black and gaps in grey.1080 20 40 60 80 100 120 140 160( 0 − 0.25 ]( 0.25 − 0.5 ]( 0.5 − 0.75 ]( 0.75 − 1 ]ConservationCovariationOne−sidedInvalidUnpairedGapAmbiguousFigure 37. Arc-plot for the Levivirus alignment. The alignment and arcs at the top correspond to thefinal inhibitory structure, whereas the bottom ones correspond to the transient structure permissive formaturation protein translation. See the caption of Figure 36 for more information on arc-plots and twolegends.1090 20 40 60 80 100 120 140( 0 − 0.25 ]( 0.25 − 0.5 ]( 0.5 − 0.75 ]( 0.75 − 1 ]ConservationCovariationOne−sidedInvalidUnpairedGapAmbiguousFigure 38. Arc-plot for the HDV ribozyme. The alignment and arcs at the top show the permissivealternative structure 2 and the active structure. The alternative structure 2, P(-1), is the 13-bp hairpinon the left side on top of the arc diagram. The bottom arcs correspond to the inhibitory alternativestructure 1. See the caption of Figure 36 for more information on arc-plots and two legends.1100 50 100 150 200 250 300 350 400( 0 − 0.25 ]( 0.25 − 0.5 ]( 0.5 − 0.75 ]( 0.75 − 1 ]ConservationCovariationOne−sidedInvalidUnpairedGapAmbiguousFigure 39. Arc-plot of SAM riboswitch. The top covariation and arcs correspond to the SAM-boundconformation and the bottom ones refer to the SAM-unbound conformation.See the caption ofFigure 36 for more information on arc-plots and two legends.1114 Conclusion, Discussion and OutlookIn this work, we proposed a methodology to extract the co-transcriptionally formed, and evolution-arily conserved transient structures from the structural prediction – kinetic and comparative – of analignment of homologous sequences. The program-specific cutoffs are also derived to prioritize such pre-dicted transient features to guide further experimental step. These efforts are to address the graduallyprominent role played by transient structures as revealed in recent research on RNA co-transcriptionalfolding, aided by state-of-art technologies [71]. For homologous sequences, we propose that certain tran-sient structures tend to be conserved to channel the RNA transcript down the similar folding pathway.Such co-transcriptional folding pathway regulates the the gene expression in various ways. For instance,the transient structure for the 5’ UTR of the Levivirus, formed co-transcriptionally, temporarily enablesthe RNA transcript to initiate translation before the more thermodynamically-favoured structure inacti-vates the translation [76]. In another instance, as seen in riboswitches, the transcript assumes alternativeconformation once the metabolite binds the aptamer; thus, the metabolite-binding pocket, i.e. the ap-tamer, must be correctly pre-formed for this mechanism to function. However, as exemplified in adenineriboswitch, the stable formation of such aptamer structure strongly relies on the co-transcriptional fold-ing to compete against the alternative(terminator) conformation; without the co-transcriptional folding,only the terminator conformation forms due to thermodynamic favouritism [250]. Moreover, transientstructures formed along the transcritpional timeline is critical for the profile of the co-transcriptionalpausing dynamics of RNA polymerase [251].4.1 Summary4.1.1 The Pipeline For Comparative Prediction Of Transient Structural FeaturesA dataset composed of six alignments and a pipeline are introduced in Chapter 2.Dataset The dataset compiled in this work constitutes currently the largest set of alignments anno-tated with both known – experimentally verified – transient/alternative and final/dominant structures.These structural alignments are curated to optimize the final/dominant structure while being obliviousof the transient/alternative structure (except the obvious misaligning errors, which are fixed for thetransient/alternative structures).Pipeline Subsequently, we devised a pipeline combining the three Kinetic co-transcriptional foldingpathway simulation programs – namely, Kinwalker, Kinefold and RNAKinetics– and Transatto comparatively analyse the homologous sequences from the aforementioned dataset, in order to eval-uate the prediction of the known structural features and to propose novel transient helices. Thereafterprogram-specific cutoffs are derived from the prediction performance of these programs, based on bench-marking against all the known structural features. Thus, the predicting sensitivity is tuned towards anintermediate level between known transient and known final structures. These cutoffs are then applied112to filter the prediction from the three folding simulation programs, purposing to evaluate the detectabil-ity of the known transient, known final and putative novel transient structures proposed by Transat,respectively.4.1.2 The Pipeline For Annotating An Alignment With Both Final And Transient Struc-turesIn Chapter 3, four alignments out of the six alignments from Chapter 2 are significantly expandedand further curated to match up to the quality standard of Rfam.Dataset The updated four RNA families consist of the Trp operon leader, the Levivirus 5’ UTR, theHDV ribozyme, and the Sam riboswitch, which all have transient structures conveying important func-tions vital to the regulation of gene expression in their organisms. The diverse functional roles of thesetransient structures comprise transcription regulation (Trp operon leader and SAM riboswitch), trans-lation regulation (Levivirus 5’UTR), and self-cleavage (HDV ribozyme). These alignment are thereforeannotated by both known transient/alternative and final/dominant structures, which are shown to beevolutionarily conserved. Our four alignments and structural annotations either significantly update andextend the existing Rfam family or introduce a new RNA family.Pipeline To aid the procedures for expansion and annotation, a pipeline integrating the Infernalprogram [141] is also set up to structurally align sequences for multiple structures.4.2 ConclusionThe known transient/alternative structures are evolutionarily conserved in terms of their compatibil-ity in consensus base-pairing and their feasibility to occur along the co-transcriptional folding pathway,as revealed by the structural quality measures across the entire dataset and the evaluation on the de-tectability of such structures through our pipeline. Generally, these transient/alternative structures arepreserved with similar strength compared to the known functional final/dominant structures, but withless covariation. Moreover, according to the detectability evaluation of these labile structures in ourpipeline, a significant proportion of the known transient structures are still retained in the predictioneven after filtering through our program-specific cutoffs. In comparison, the novel transient structuresare evolutionarily conserved, but show less covariation. In contrast to the known transient/alternativestructures, the novel transient structures are predicted with a lower sensitivity in our pipeline when fil-tered by the same program-specific cutoffs. Nevertheless, this sensitivity could be tuned by relaxing thecutoffs to suit the quality of the alignment interrogated.Moreover, the performance arc plots constructed in this study could provide insights into the searchof novel transient structures that deliver critical functions to the organism. For instance, as shown inFigure 23, four novel transient helices (arcs with dotted lines above the alignment) predicted by Transatare also significantly predicted by RNAKinetics (see the bottom arcs with dotted lines, in mirror image113to the top four arcs). These four helices all have mutually-exclusive base-pairs disrupting the formationof the Alternative 2 conformation (top black arcs). Especially the novel transient helix predicted byTransat with the lowest p-value (the top purple helix, in dotted line) is able to sequester the 5’ portionsof the two helices in the Alternative 2 conformation; since the Alternative 2 conformation contains thetermination hairpin (see Figure 30), this purple helix could be a putative alternative structure in additionto the known anti-terminator structure (i.e. Alternative 1 conformation). Interestingly, this purple helixis also predicted by RNAKinetics with the highest averaged probability among all the kinetic prediction.Hence, this agreement between Transat and RNAKinetics suggests that this evolutionarily conservedhelix predicted by Transat is also kinetically feasible to form along the folding pathway given all theco-transcriptional constraint. This purple helix, furthermore, impedes the formation of the Alternative1 conformation (the top red helix in Figure 23). The 5’ portion of this purple helix encompasses the trptandem where ribosome will stall in case of insufficient charged tRNATrp. Taking all this context intointerpreting the role of this purple helix, we can suggest: (i) the purple helix is mutually-exclusive withthe Alternative 1/2 conformation, and can thus transiently preclude the leader transcript from beingtrapped in either Alternative 1 or 2 conformation in prior to the arrival of the translating ribosome, (ii)once the ribosome – proceeding towards the 3’ end – arrives, it breaks apart the labile purple helix, (iii)now the transcript is unfolded temporarily, so both the terminator structure (Alternative 2 conformation)and anti-terminator structure (Alternative 1 conformation) can have a fair chance to initiate, respectively,(iv) thereafter which conformation forms will entirely depend on whether the ribosome stalls at the trptandem.However, this is merely our hypothesis proposed by such arc plot, pending further experimentalverification.4.3 Discussion4.3.1 The Transient Structures Have Slightly Lower CovariationThe transient structures, including both the known ones verified by experiments and the putative novelones predicted by Transat, exist only during or as a result of the co-transcriptional folding pathway.They are conserved evolutionarily but with less covariation. This lack of covariation could be attributedto the fact that the nucleotide positions of the known transient/alternative structures are frequently co-used by the known functional final/dominant structures. These overlapped positions reflect the guidingpurpose of the transient structures: e.g., some transient helices sequester the 5’ portion of one finalfunctional helix, which protects it from being trapped in other non-native helices until releasing it uponthe synthesis of the correct 3’ portion. The transient structures are only folded co-transcriptionally andare eventually replaced by the mutually-exclusive final structures, which imposes extra constraints on thenucleotide composition of the positions overlapped by both structures. Thus, mutation in these positionsis non-functional unless it satisfies the canonical base pairing requirement for both transient and finalstructure, which results in the lower covariation observed.1144.3.2 Predicting Tangible Amount Of Functional Transient FeaturesApart from the annotated known transient structures, perhaps there are a lot other transient butfunctional structural features remaining unidentified. Hence, the Kinetic folding pathway simulationprograms are usually employed to suggest other transient structures. However, such simulation programs,especially the stochastic ones, often yield vast amount of predicted structural features, which is unrealisticto follow up through labour-intensive experiments. It is thus of interest to prioritize the predictedstructural features. which is addressed by our pipeline. Our pipeline simulates the homologous sequencesin parallel via Kinwalker, Kinefold and RNAKinetics, the prediction of which is averaged across allhomologous sequences to accentuate the evolutionarily conserved structures occurring along the foldingpathway. This approach enhances the probing specificity for transient structural features, and narrowsthe predictions down to a tangible number to facilitate the downstream experimental work. The probingspecificity can be further improved via combining the three Kinetic folding pathway simulation programsby retaining only the intersections of the program-specific predictions. The specificity thereby increasesby several fold while the sensitivity is preserved at a comparable level. This combinatorial step facilitatesthe discovery of putative structures when given an alignment without any known structural features.4.3.3 Annotation Of Known And Functional Transient/Alternative FeaturesIn accordance with the current schema in Rfam, we provide structural annotation dedicated for eachstructure of distinct functional role for four RNA families (see Chapter 3). We hope to make that thecase that the structural annotation of any RNA family ought to naturally also comprise functional tran-sient/alternative RNA structures. Before being incorporated into Rfam, such transient/alternative RNAstructures require further experimental evidence supporting their structure and functional roles. However,as shown in the cited literature for the four families, explicitly translating experimental evidence intoa structural annotation of an RNA family is not always a straightforward task for transient/alternativestructures since such ephemeral structures are hard to obtain experimentally. For another instance,ribonuclease probing on nucleotides shared by alternative structures may display only minor hits, ren-dering the interpretation equivocal [161]. Moreover, the manual curation process, as shown here, islabour-intensive, which requires expertise in the families.Including functional transient/alternative structures for one sequence also entails some adjustments inRfam as it breaches the one-sequence-one-structure schema used by Rfam to extend seed alignments intofull alignments; i.e. the current computational analysis pipeline of Rfam only takes one covariance modelfor searching homologous sequences and making structural alignment. Now it will be more difficult as morethan one covariance models are required to capture both the final/dominant and transient/alternativeRNA structures of a given RNA family. A concerted effort using a variety of experimental techniques aswell as more sophisticated computational analysis pipelines will thus be required to arrive at completestructural annotations containing functional transient/alternative structures.1154.3.4 Some Odd But Interesting Cases Of Conserved Transient StructuresWhile curating the alignments for transient/alternative structures, it is observed that the helices canshrink or expand in homologous sequences, and sometimes their position shifts along the sequence; theodd cases for the shifting phenomenon are illustrated in Figure 40. These odd cases provoke questions onhow evolutionarily conserved the transient/alternative structures are, in terms of the helices themselvesand how the helices are positioned in relation to each other. In the top plot of Figure 40, the circled regionon the RNA sequence must satisfy the base-pairing requirement for both of the 2 alternative structures(i.e. the 5’ blue helix, and the green helix). In the bottom plot of Figure 40, the circled region is no longerrequired to simultaneously provide pairing partner for both alternative structures since the green helixshifts towards the 3’ end. Such subtle changes won’t affect the regulation mechanism as the formation ofthe termination hairpin is still mutually-exclusive to the alternative helix. The change in relative positionof the alternative structures seems specific to the nature of its sequence: this could be due to a changein the Trp tandem pattern in the alignment of Trp Operon Leader, or other unknown reasons in theSAM riboswitch. For instance, in Trp Operon Leader alignment, most sequences from the Rfam seedalignment fit the given structural annotation except some odd cases from the species in genus Vibrio. Infact, these odd cases have a different pattern regarding the Trp tandem, i.e. they have two Trp tandemsseparated by an interval whereas the other cases (conforming with the structural annotation) have onlyone Trp tandem (2 consecutive trp codons). After searching Pfam [252] for the leader peptide of the TrpOperon Leader, three types of leader peptide are found:(i) the leader peptide corresponding to our reference sequence (one trp tandem composed of 2 trpcodons) is in PF08255(ii) there are another 2 types of trp leader peptide, one has 3 trp codons in one Trp tandem, inPF08055(iii) and another one has 2 trp codons in two separated Trp tandems in PF08056, similar to the oddcase. This may explain the difference observed in the layout of the 2 alternative structures.Thus, for some odd cases, the helices structure is conserved (with abundant covariation), but the helixlocation in relation to each other undergoes changes. Such conservation can not be captured in alignmentrepresentation since alignment representation stipulates that not only the structural components but alsothe relative position of the multiple structures need to be conserved.4.4 Outlook On Future Research Direction Of Transient Structural Features4.4.1 Expanding The Pipeline For Efficient Prediction Of Transient StructuresThere are other Kinetic folding pathway simulation programs with distinctive underlying algorithms,e.g. Kinfold [134, 253]. Therefore, our pipeline could be further extended by incorporating the predic-tions from these programs. This will enable analysis of the predicted helices from different perspectives,such as zooming into the set of helices that satisfy each constraint imposed by the corresponding program,e.g. thermodynamically or stochastically. Looking at the intersect of predictions essentially filters out116Figure 40. Odd cases of transient/alternative structures. All the black lines represent the same RNAsequence. The arcs represent the helices. The bottom green arcs are the transient/alternative helices inrelation to the final/dominant helices represented by the top blue arcs. The 3’ blue helix could be atermination hairpin.more FP and enhances the specificity. Alternatively, we can look at the union of the predicted helices withsignificant score, and compare them against the prediction from Transat to search for more putativenovel transient helices.Finally, it is important to train the pipeline using a larger dataset with more annotated structuralfeatures. This will allow us to compile a set of program-specific cutoffs that are more representative ofsuch data in general, which could avoid truncating the sensitivity when an inappropriately high cutoffis imposed. The improvement in both specificity and sensitivity could more accurately guide the futurediscovery in novel transient structures.4.4.2 Utilizing The Transient Structures Predicted From Our Pipelines Part I: BetterAlignment And More Efficient Prediction Of Final Functional StructureRfam alignments are the primary source used by structural prediction programs to benchmark theirperformance, from which program parameters can be trained and insightful extrapolation could bedrawn [3]. However, the alignments in Rfam are only curated according to the final/dominant functionalstructure. Given the importance of transient or alternative structures, we could also align homologoussequences based on alternative structural features. All structural features – transient, alternative, andfinal/dominant functional structures – can be combined to help juxtapose the homologous sequencesmore accurately, in contrast to aligning them merely with the final structure. The new set of alignmentscurated to all known structures, provided in this study, are thus helpful for future prediction program intheir benchmarking.Each nucleotide can base pair with alternative partners due to the multiple forms of canonical base-pair, so one RNA sequence can theoretically fold into many distinct structures and engage in countless117folding trajectories. Different folding trajectories may end up with different final structures. How doesone RNA molecule reach its correct final structure efficiently? The transient structures may serve as thelamp posts lighting up the favoured folding pathway that leads to the correct final structure. Thus, it iscritical to predict some of these functional transient structures in order to search for the correspondingfolding pathway(s). This can help reduce the computation space for RNA structure prediction by puttingconstraints on the folding pathway, so there will be less resultant final structures to select from.4.4.3 Utilizing The Transient Structures Predicted From Our Pipelines Part II: DecodeThe Pivotal Point In EvolutionThe transient/alternative structures will also bring insights into understanding evolution since theselabile structures reflect the distinct selection criteria imposed by a structured RNA during different stagesof evolution [10]. As proposed by Mathews (2010), in early stage, thermodynamic constraint plays thedeterminant role and the cell invests multiple approaches to stabilize the newly emergent functionalstructure [10]. These approaches – e.g. features along the folding kinetics pathway or chaperon protein– help the RNA molecule to efficiently achieve its final functional structure [10]. Thus, detecting the co-transcriptional transient structure aiding the folding kinetics could help decode the evolutionary processof RNA.A fixed RNA sequence (genotype) can give rise to multiple structures (phenotype) as a result ofalternative base-pairing groups, namely, phenotypic plasticity [10, 254]. In comparison, one particularRNA secondary structure could also be formed by multiple distinct RNA sequences, called neutrality(Fontana 2002 in [10]). Speaking about neutrality in the context of evolution, the accumulated mutationswill not disrupt the function of this RNA molecule so long as the structure is conserved [10]. Thisphenomenon of multiple genotype sharing the same phenotypes is called neutral drift [10]. In the longrun of evolution, the combination of neutral drift and phenotypic plasticity enables the RNA to transformamong alternative structures while retaining the original functional structure [10]. Thus, this evolvingRNA can try out different genotypes before it innovatively and accumulatively mutates into a genotypethat is capable of folding into not only the current structure, but also a more optimal structure [10,255]. Moreover, such structural innovation with steady neutral drift occurring at intervals can facilitateribozyme or riboswitch design to equip one RNA molecule with dual functions [10, 256, 257]. Suchphenomenon and application highlight the importance of finding transient/alternative structures in agiven RNA sequence, which could be achieved using our approach introduced in this study.4.4.4 Utilizing The Transient Structures Predicted From Our Pipelines Part III: PredictTranscriptional Pausing SiteRNAP pausing influences RNA structural formation The RNA polymerase(RNAP) pausing,an ubiquitous phenomenon in both prokaryotic and eukaryotic transcription, renders the transcriptionprocess non-continuous [258]. This pausing behaviour of RNAP could be explained from both structuraland energetic perspectives, e.g. the rearrangement of RNAP interacting sites, or the interaction associated118with the DNA template, or the binding of trans molecules, or barrier such as nucleosomes [258]. Thepaused RNAP may be the target of various regulatory process, and the elongation is halted so as tocoordinate various co-transcriptional events [258]. For instance, in prokaryotes, pausing synchronizestranscription and translation, which allows regulatory factors to fit in [258]. The RNAP pausing alsochaperones the RNA structural formation, as exemplified by its role in preserving the 5’ portion of along-range-interaction(LRI) from a functional RNA molecule in a non-native transient structure [75].Moreover, strategic pausing site can facilitate one structure to form whereas blocking the other, whichis utilized by riboswitch in the kinetic control for the decision making step; a pausing site, at the end ofan aptamer, favours the formation of a structure that does not involve nucleotides beyond the pausingsite [259]. In riboswitch, pausing also synchronizes the formation of the aptamer domain (involvingtertiary structural folding ) and the downstream expression platform (involving only secondary structuralformation) [259]. The aptamer domain is more time-consuming to fold, so pause sites are frequently foundimmediately downstream of the aptamer domain to allow extra time for tertiary structural folding; thisalso avoids the interference from the downstream transcript in the folding of the upstream structure [259].There are two classes of RNAP pausing Class I of RNAP Pausing (mainly in enteric bacteria) isinduced by the secondary structure of the upstream nascent RNA where a hairpin forms and interactswith the RNAP [251]. Such interaction between RNAP and the upstream RNA structure results in thedisplacement of the 3’ OH of the nascent RNA from the catalytic center, thus preventing the addition ofNTP [251]. This depends on whether this hairpin is in the immediate vicinity –around 10-12 nucleotides–of the very 3’ terminal of the nascent RNA; if so, the paused conformation can be reinforced so theelongation is interrupted [260]Class II of RNAP Pausing (mainly in eukaryotic cells) occurs as a result of the RNAP backtrackingin which RNAP is threaded backwards along the upstream template DNA and RNA for about 10 or morenucleotides [251]. Such backward movement blocks the active site on RNAP with the newly-synthesized3’ end from the nascent RNA [251].The secondary structure of the nascent RNA not only participates in inducing the RNAPpause, but also helps to attenuate the pausing to resume elongation [261] In Class I, thehairpin formed by the upstream nascent transcript is required to initiate the RNAP pausing. In contrast,the nascent RNA structure is critical in class II to recover the RNAP from the pause by serving as akinetic barrier sitting upstream to block the backtracking RNAP [262]. For instance, in region with in-creasing GC composition, the pausing frequency and duration decrease accordingly since stronger kineticbarrier can form to stop the backtracking RNAP [261]. When the nascent transcript is degraded byRNase, the pausing recovery facilitated by the GC-rich nascent transcript is removed and the pausingfrequency/duration resumes [261]. Moreover, the energy barrier profile calculated by Kinefold matcheswell with the energy barrier profile constructed by the observed frequency and duration of RNAP paus-ing [261].119RNAP pausing prediction programs Two main programs have been published thus far for the pre-diction of pausing sites and pausing profile, namely, Bai (2004) [263] and Tadigotla (2006) [262]. Thesetwo programs are both based on the general thermal ratchet model for a molecular motor with proces-sivity: as a result of the Brownian motion, RNAP undergoes reversible translocation among differentstates along the DNA template, and the RNA polymerization reaction favours the forward motion uponthe formation of phosphodiester bond [264]. These two programs analyse the processive dynamics ofthe transcription elongation complex (TEC) state by state. These translocation states are labelled as-1, 0, 1, according to the position of the RNAP catalytic center in relation to the 3’ end of the RNAtranscript [264] [262]. To achieve the processive transcription dynamics, the TEC consists of(i) a transcription bubble with around 12-14 nts unwound in the DNA strands, which is stablized bythe bound RNAP,(ii) the 8-to-9-nt-long RNA-DNA hybrid wherein the DNA is from the template strand,(iii) the paired DNA duplex at either ends,(iv) and four channels for the entry of the DNA duplex, the exit of the DNA strands, the entry of thenew NTP, and the exit of the newly synthesized RNA, respectively [262].Are RNAP pausing sites phylogenetically conserved? Due to the regulatory roles of co-transcriptionalpausing, there might be some mechanism involved to phylogenetically preserve the pattern of RNAP paus-ing regarding location and duration. The RNA secondary structures – upstream hairpin – are known tobe important for the initiation/termination of RNAP pausing, which suggests that the transient RNA sec-ondary structures along the co-transcriptional folding pathway could be conserved to a certain extent inorder to preserve the pausing dynamics, e.g. the evolutionarily conserved pausing site observed by Wonget al. (2007) [75]. Therefore, our pipeline could be used to comparatively predict the kinetically feasiblehairpin for the upstream nascent RNA, which could serve as a prerequisite for the immediately down-stream RNAP pausing site. Given a MSA of homologous sequences, such conserved upstream hairpinscould be incorporated into Bai’s or Tadigotla’s methods to enhance their predictive power. For instance,the free energy term for RNA-RNA interaction in the state free energy from Tadigotla (2006) [262] iscalculated based on the MFE conformation. However, the MFE conformation may not always be thefunctional one responsible for the RNAP pausing; in this case, the comparative approach may give struc-tural prediction that is closer to the conformation in vivo. Thus, we could replace the MFE term by theenergy of a conserved structure in the old formula. Moreover, consensus nucleotides are found near thepausing sites [258], which could be combined with the predicted conserved upstream hairpin to betterpredict the pausing site.A better prediction of pausing site can bring insights into the dynamics of DNA-RNA-RNAP TEC,which will allow us to infer the regulatory function related to the pausing site. Resembling the concept ofYin-Yang, the structure of the nascent RNA transcript and the RNAP pausing sites are interconnectedand mutually influence each other. Hence, incorporating the pausing site into kinetic simulation programmay help better model the in vivo co-transcriptional folding because pausing at certain site will induce120structural rearrangement in the transcribed portion of the transcript [75]. This could provide a newperspective to understand the malfunction of RNA folding related to certain diseases.4.4.5 Utilizing The Transient Structures Predicted From Our Pipelines Part IV: RiboSNitchesAnd DiseaseFirstly, our study may bring insight into disease related to wrongly formed transient/alternativestructure. The riboSNitches – mutation that causes structural change – are found to account for manydisease phenotype [1, 86]. Thus, it would be informative to probe for any possible riboSNitch residingin functional structures, which implies that such riboSNitch could be responsible for certain phenotypicaberration. However, such studies concerning riboSNitches are focusing on functional final/dominantstructures thus far. This calls for similar studies conducted specially for transient/alternative structures.This entails us to compile a collection of highly possible transient/alternative conformations occurring asa result of co-transcriptional folding, which could be achieved through our studies.Secondly, identifying the highly possible and functional transient/alternative structures can constrainthe RNA folding pathway to the most feasible one(s), which enhances our computational ability to modelthe final functional RNA secondary structure (or structural dynamics) shaped by its true folding pathway.This improves the design and manufacture of therapeutic RNA molecule (e.g. siRNA) with appropriateconformation for its target [10,265].121References1. Mortimer, S. A., Kidwell, M. A., and Doudna, J. A. (2014) Insights into RNA structure andfunction from genome-wide studies. Nat Rev Genet, 15(7), 469–79.2. Clark, M. B., Amaral, P. P., Schlesinger, F. J., Dinger, M. E., Taft, R. J., Rinn, J. L., Ponting,C. P., Stadler, P. F., Morris, K. V., Morillon, A., Rozowsky, J. S., Gerstein, M. B., Wahlestedt, C.,Hayashizaki, Y., Carninci, P., Gingeras, T. R., and Mattick, J. S. (2011) The reality of pervasivetranscription. PLoS Biol, 9(7), e1000625.3. Smith, M. A., Gesell, T., Stadler, P. F., and Mattick, J. S. (2013) Widespread purifying selectionon RNA structure in mammals. Nucleic Acids Res, 41(17), 8220–36.4. Chinwalla, A. T. and et al. (12, 2002) Initial sequencing and comparative analysis of the mousegenome. Nature, 420(6915), 520–562.5. Pheasant, M. and Mattick, J. S. (2007) Raising the estimate of functional human sequences.Genome Research, 17(9), 1245–1253.6. Carninci, P., Kasukawa, T., Katayama, S., Gough, J., and et al. (2005) The TranscriptionalLandscape of the Mammalian Genome. Science, 309(5740), pp. 1559–1563.7. RIKEN Genome Exploration Research Group and Genome Science Group (Genome NetworkProject Core Group) and the FANTOM Consortium, Katayama, S., Tomaru, Y., Kasukawa, T.,Waki, K., Nakanishi, M., Nakamura, M., Nishida, H., Yap, C. C., Suzuki, M., Kawai, J., Suzuki,H., Carninci, P., Hayashizaki, Y., Wells, C., Frith, M., Ravasi, T., Pang, K. C., Hallinan, J.,Mattick, J., Hume, D. A., Lipovich, L., Batalov, S., Engstrm, P. G., Mizuno, Y., Faghihi, M. A.,Sandelin, A., Chalk, A. M., Mottagui-Tabar, S., Liang, Z., Lenhard, B., Wahlestedt, C., and FAN-TOM Consortium and RIKEN Genome Exploration Research Group and Genome Science Group(Genome Network Project Core Group) and RIKEN Genome Exploration Research Group andGenome Science Group (Genome Network Project Core Group) and the FANTOM Consortium(2005) Antisense Transcription in the Mammalian Transcriptome. Science, 309(5740), 1564–1566.8. Djebali, S., Davis, C. A., Merkel, A., Dobin, A., and et al. (09, 2012) Landscape of transcriptionin human cells. Nature, 489(7414), 101–108.9. Thurman, R. E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M. T., Haugen, E., Sheffield,N. C., Stergachis, A. B., Wang, H., Vernot, B., Garg, K., John, S., Sandstrom, R., Bates, D.,Boatman, L., Canfield, T. K., Diegel, M., Dunn, D., Ebersol, A. K., Frum, T., Giste, E., Johnson,A. K., Johnson, E. M., Kutyavin, T., Lajoie, B., Lee, B.-K., Lee, K., London, D., Lotakis, D.,Neph, S., Neri, F., Nguyen, E. D., Qu, H., Reynolds, A. P., Roach, V., Safi, A., Sanchez, M. E.,122Sanyal, A., Shafer, A., Simon, J. M., Song, L., Vong, S., Weaver, M., Yan, Y., Zhang, Z., Zhang,Z., Lenhard, B., Tewari, M., Dorschner, M. O., Hansen, R. S., Navas, P. A., Stamatoyannopoulos,G., Iyer, V. R., Lieb, J. D., Sunyaev, S. R., Akey, J. M., Sabo, P. J., Kaul, R., Furey, T. S., Dekker,J., Crawford, G. E., and Stamatoyannopoulos, J. A. (09, 2012) The accessible chromatin landscapeof the human genome. Nature, 489(7414), 75–82.10. Mathews, D. H., Moss, W. N., and Turner, D. H. (2010) Folding and finding RNA secondarystructure. Cold Spring Harb Perspect Biol, 2(12), a003665.11. Taft, R. and Mattick, J. (2003) Increasing biological complexity is positively correlated with therelative genome-wide expansion of non-protein-coding DNA sequences. Genome biology, 5(1),P1–P1.12. Mercer, T. R. and Mattick, J. S. (2013) Structure and function of long noncoding RNAs inepigenetic regulation. Nature structural & molecular biology, 20(3), 300–7.13. Guttman, M., Amit, I., Garber, M., French, C., Lin, M. F., Feldser, D., Huarte, M., Zuk, O.,Carey, B. W., Cassady, J. P., Cabili, M. N., Jaenisch, R., Mikkelsen, T. S., Jacks, T., Hacohen,N., Bernstein, B. E., Kellis, M., Regev, A., Rinn, J. L., and Lander, E. S. (2009) Chromatinsignature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature,458(7235), 223–227.14. Yang, L., Duff, M., Graveley, B., Carmichael, G., and Chen, L.-L. (2011) Genomewide character-ization of non-polyadenylated RNAs. Genome Biology, 12(2), R16.15. Wilusz, J. E., Freier, S. M., and Spector, D. L. 2015/06/21 3’ End Processing of a Long Nuclear-Retained Noncoding RNA Yields a tRNA-like Cytoplasmic RNA. Cell, 135(5), 919–932.16. Yin, Q.-F., Yang, L., Zhang, Y., Xiang, J.-F., Wu, Y.-W., Carmichael, G. G., and Chen, L.-L.2015/06/21 Long Noncoding RNAs with snoRNA Ends. Molecular Cell, 48(2), 219–230.17. Dinger, M. E., Gascoigne, D. K., and Mattick, J. S. (2011) The evolution of {RNAs} with multiplefunctions. Biochimie, 93(11), 2013 – 2018.18. Hawkins, P. G. and Morris, K. V. (MAR 1, 2008) RNA and transcriptional modulation of geneexpression. CELL CYCLE, 7(5), 602–607.19. Guttman, M., Donaghey, J., Carey, B. W., Garber, M., Grenier, J. K., Munson, G., Young, G.,Lucas, A. B., Ach, R., Bruhn, L., Yang, X., Amit, I., Meissner, A., Regev, A., Rinn, J. L., Root,D. E., and Lander, E. S. (SEP 15, 2011) lincRNAs act in the circuitry controlling pluripotencyand differentiation. NATURE, 477(7364), 295–U60.20. P, K., J, D., J, C., J, L., G, H., S, D., and TR, G. (2005) Examples of the complex architectureof the human transcriptome revealed by RACE and high-density tiling arrays. Genome Research,15(7), 987–997.12321. Win, M. N. and Smolke, C. D. (OCT 17, 2008) Higher-order cellular information processing withsynthetic RNA devices. SCIENCE, 322(5900), 456–460.22. Barabasi, A. and Oltvai, Z. (FEB, 2004) Network biology: Understanding the cell’s functionalorganization. NATURE REVIEWS GENETICS, 5(2), 101–U15.23. Derrien, T., Johnson, R., Bussotti, G., Tanzer, A., Djebali, S., Tilgner, H., Guernec, G., Martin,D., Merkel, A., Knowles, D. G., Lagarde, J., Veeravalli, L., Ruan, X., Ruan, Y., Lassmann, T.,Carninci, P., Brown, J. B., Lipovich, L., Gonzalez, J. M., Thomas, M., Davis, C. A., Shiekhattar,R., Gingeras, T. R., Hubbard, T. J., Notredame, C., Harrow, J., and Guig, R. (2012) The GEN-CODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution,and expression. Genome Research, 22(9), 1775–1789.24. Wang, K. C., Yang, Y. W., Liu, B., Sanyal, A., Corces-Zimmerman, R., Chen, Y., Lajoie, B. R.,Protacio, A., Flynn, R. A., Gupta, R. A., Wysocka, J., Lei, M., Dekker, J., Helms, J. A.,and Chang, H. Y. (04, 2011) A long noncoding RNA maintains active chromatin to coordinatehomeotic gene expression. Nature, 472(7341), 120–124.25. Narberhaus, F. and Vogel, J. (10, 2009) Regulatory RNAs in prokaryotes: here, there and every-where. Mol Microbiol, 74(2), 261–9.26. Watson, J. D., Baker, T. A., Bell, S. P., Gann, A., Levine, M., and Losick, R. (2013) MolecularBiology of the Gene (7th Edition) , Benjamin Cummings, .27. Namy, O., Moran, S. J., Stuart, D. I., Gilbert, R. J. C., and Brierley, I. (05, 2006) A mechan-ical explanation of RNA pseudoknot function in programmed ribosomal frameshifting. Nature,441(7090), 244–247.28. Chadalavada, D. M., Knudsen, S. M., Nakano, S., and Bevilacqua, P. C. (2000) A role for upstreamRNA structure in facilitating the catalytic fold of the genomic hepatitis delta virus ribozyme.Journal of Molecular Biology, 301(2), 349–367.29. Sun, B., Deaton, A., and Lee, J. (MAR 3, 2006) A transient heterochromatic state in Xist preemptsX inactivation choice without RNA stabilization. MOLECULAR CELL, 21(5), 617–628.30. Chowdhury, S., Maris, C., Allain, F. H.-T., and Narberhaus, F. (06, 2006) Molecular basis fortemperature sensing by an RNA thermometer. The EMBO Journal, 25(11), 2487–2497.31. Esteller, M. (12, 2011) Non-coding RNAs in human disease. Nat Rev Genet, 12(12), 861–874.32. Prasanth, K. V. and Spector, D. L. (1, 2007) Eukaryotic regulatory RNAs: an answer to the’genome complexity’ conundrum.. Genes Dev, 21(1), 11–42.33. Chadalavada, D. M., Gratton, E. A., and Bevilacqua, P. C. (2010) The human HDV-like CPEB3ribozyme is intrinsically fast-reacting. Biochemistry, 49(25), 5321–30.12434. Batey, R. T., Rambo, R. P., and Doudna, J. A. (1999) Tertiary Motifs in RNA Structure andFolding. Angew Chem Int Ed Engl, 38(16), 2326–2343.35. Cantara, W. A., Crain, P. F., Rozenski, J., McCloskey, J. A., Harris, K. A., Zhang, X., Vendeix,F. A. P., Fabris, D., and Agris, P. F. (01, 2011) The RNA modification database, RNAMDB:2011 update. Nucleic Acids Research, 39(Database issue), D195–D201.36. Kellner, S., Burhenne, J., and Helm, M. (2010) Detection of RNA modifications. RNA Biology,7(2), 237–247.37. Kudla, G., Murray, A. W., Tollervey, D., and Plotkin, J. B. (2009) Coding-sequence determinantsof gene expression in Escherichia coli.. Science, 25(324), 255–258.38. Scharff, L. B., Childs, L., Walther, D., and Bock, R. (06, 2011) Local Absence of SecondaryStructure Permits Translation of mRNAs that Lack Ribosome-Binding Sites. PLoS Genet, 7(6),e1002155.39. Bentele, K., Saffert, P., Rauscher, R., Ignatova, Z., and Blu¨thgen, N. (2013) Efficient translationinitiation dictates codon usage at gene start. Molecular Systems Biology, 9, 675–675.40. Nackley, A. G., Shabalina, S. A., Tchivileva, I. E., Satterfield, K., Korchynskyi, O., Makarov,S. S., Maixner, W., and Diatchenko, L. (2006) Human Catechol-O-Methyltransferase HaplotypesModulate Protein Expression by Altering mRNA Secondary Structure. Science, 314(5807), 1930–1933.41. Pervouchine, D. D., Khrameeva, E. E., Pichugina, M. Y., Nikolaienko, O. V., Gelfand, M. S.,Rubtsov, P. M., and Mironov, A. A. (2012) Evidence for widespread association of mammaliansplicing and conserved long-range RNA structures. RNA, 18(1), 1–15.42. Leontis, N. B. and Westhof, E. (2002) The annotation of RNA motifs. Comparative and functionalgenomics, 3(6), 518–24.43. Tian, B., Bevilacqua, P. C., Diegelman-Parente, A., and Mathews, M. B. (12, 2004) The double-stranded-RNA-binding motif: interference and much more. Nat Rev Mol Cell Biol, 5(12), 1013–1023.44. Darty, K., Denise, A., and Ponty, Y. (2009) VARNA: Interactive drawing and editing of the RNAsecondary structure. Bioinformatics, 25(15), 1974–1975.45. Doudna, J. A. and Cech, T. R. (1995) Self-assembly of a group I intron active site from itscomponent tertiary structural domains. RNA, 1(1), 36–45.46. Puglisi, J. D., Wyatt, J. R., and Jr, I. T. (1990) Conformation of an {RNA} pseudoknot. Journalof Molecular Biology, 214(2), 437 – 453.12547. Kolk, M. H., van der Graaf, M., Wijmenga, S. S., Pleij, C. W., Heus, H. A., and Hilbers, C. W.(Apr, 1998) NMR structure of a classical pseudoknot: interplay of single- and double-strandedRNA.. Science, 280(5362), 434–438.48. Thirumalai, A. and Woodson, S. (1996) Kinetics of Folding of Proteins and RNA. Accounts ofChemical Research, 29, 433–439.49. Dill, K. A. and Chan, H. S. (Jan, 1997) From Levinthal to pathways to funnels.. Nat Struct Biol,4, 10–19.50. Draper, D. E. (May, 1996) Parallel worlds. Nat Struct Biol, 3(5), 397–400.51. Pan, J., Thirumalai, D., and Woodson, S. A. (1997) Folding of {RNA} involves parallel pathways.Journal of Molecular Biology, 273(1), 7 – 13.52. Pan, J. and Woodson, S. A. (1998) Folding intermediates of a self-splicing RNA: mispairing ofthe catalytic core. Journal of Molecular Biology, 280(4), 597 – 609.53. Boyle, J., Robillard, G., and Kim, S. (1980) Sequential folding of transfer RNA. A nuclear mag-netic resonance study of successively longer tRNA fragments with a common 5’ end. Journal ofMolecular Biology, 139, 601–625.54. FR, K. and DR, M. (1981) Secondary structure formation during RNA synthesis. Nucleic AcidsRes, 9(19), 5109–5124.55. Brehm, S. and Cech, T. (1983) Fate of an intevening sequence ribonucleic-acid — excision andcyclization of the Tetrahymena ribosomal ribonucleic-acid intervening sequence in vivo. Biochem-istry, 22(10), 2390–2397.56. Pan, T. and Sosnick, T. (2006) RNA folding during transcription. Annual Review of Biophysicsand Biomolecular Structure, 35, 161–175.57. Mahen, E., Harger, J., Calderon, E., and Fedor, M. (2005) Kinetics and thermodynamics makedifferent contributions to RNA folding in vitro and in yeast. Molecular Cell, 19(1), 27–37.58. Mahen, E., Watson, P., Cottrell, J., and Fedor, M. (2010) mRNA secondary structures foldsequentially but exchange rapidly in vivo. PLoS Biology, 8(2), e1000307.59. Johansson, J., Mandin, P., Renzoni, A., Chiaruttini, C., Springer, M., and Cossart, P. (2002) AnRNA thermosensor controls expression of virulence genes in Listeria monocytogenes. Cell, 110(5),551–561.60. Narberhaus, F. (2010) Translational control of bacterial heat shock and virulence genes bytemperature-sensing mRNAs. RNA Biology, 7(1), 84–89.12661. Giuliodori, A., Pietro, F. D., Marzi, S., Masquida, B., Wagner, R., Romby, P., Gualerzi, C., andPon, C. (2010) The cspA mRNA is a thermosensor that modulates translation of the cold-shockprotein CspA. Molecular Cell, 37(1), 21–33.62. Weeks, K. M. (1997) Protein-facilitated RNA folding. Current Opinion in Structural Biology,7(3), 336–342.63. Lagos-Quintana, M., Rauhut, R., Lendeckel, W., and Tuschl, T. (2001) Identification of novelgenes coding for small expressed RNAs. Science, 294(5543), 853–858.64. Pan, T., Fang, X., and Sosnick, T. (1999) Pathway modulation, circular permutation and rapidRNA folding under kinetic control. Journal of Molecular Biology, 286(13), 721–731.65. Heilman-Miller, S. and Woodson, S. (2003) Effect of transcription on folding of the Tetrahymenaribozyme. RNA, 9(6), 722–733.66. Heilman-Miller, S. and Woodson, S. (2003) Perturbed folding kinetics of circularly permutedRNAs with altered topology. Journal of Molecular Biology, 328(2), 385–394.67. Lewicki, B., Margus, T., Remme, J., and Nierhaus, K. (1993) Coupling of rRNA transcription andribosomal assembly in vivo – formation of active ribosomal-subunits in Escherichia coli requirestranscription of RNA genes by host RNA polymerase which cannot be replaced by T7 RNApolymerase. Journal of Molecular Biology, 231, 581–593.68. Chao, M. Y., Kan, M., and Lin-Chao, S. (1995) RNAII transcribed by IPTG-induced T7 RNApolymerase is non-functional as a replication primer for ColE1-type plasmids in Escherichia coli .Nucleic Acids Research, 23, 1691–1695.69. Meyer, I. M. and Miklo´s, I. (2005) Statistical evidence for conserved, local secondary structurein the coding regions of eukaryotic mRNAs and pre-mRNAs. Nucleic Acids Research, 33(19),6338–6348.70. Geis, M., Flamm, C., Wolfinger, M., Tanzer, A., Hofacker, I., Middendorf, M., Mandl, C., Stadler,P., and Thurner, C. (2008) Folding kinetics of large RNAs. Journal of Molecular Biology, 379(1),160–173.71. Meyer, I. M. and Miklo´s, I. (2004) Co-transcriptional folding is encoded within RNA genes. BMCMolecular Biology, 10, 5.72. Gultyaev, A., von Batenburg, F., and Pleij, C. (1995) The computer-simulation of RNA foldingpathways using a genetic algorithm. Journal of Molecular Biology, 250(1), 37–51.73. Jacobson, A. B. and Zuker, M. (1993) Structural analysis by energy dot plot of a large mRNA.Journal of Molecular Biology, 233, 261–269.12774. Shapiro, B. A., Bengali, D., Kasprzak, W., and Wu, J. C. (2001) {RNA} folding pathway func-tional intermediates: their prediction and analysis1. Journal of Molecular Biology, 312(1), 27 –44.75. Wong, T., Sosnick, T., and Pan, T. (2007) Folding of noncoding RNAs during transcription facil-itated by pausing-induced nonnative structures. Proceedings of the National Academy of Scienceof the USA, 104(46), 17995–18000.76. Van Meerten, D., Girard, G., and Van Duin, J. (2001) Translational control by delayed RNAfolding: Identification of the kinetic trap. RNA, 7(3), 483–494.77. Groeneveld, H., Thimon, K., and van Duin, J. (1995) Translational control of maturation-proteinsynthesis in phage MS2: a role for the kinetics of RNA folding?. RNA, 1(1), 79–88.78. Kolter, R. and Yanofsky, C. (1982) Attenuation in amino acid biosynthetic operons. Annu RevGenet, 16, 113–34.79. Yanofsky, C. (1981) Attenuation in the control of expression of bacterial operons. Nature, 289,751–758.80. Winkler, W. C., Nahvi, A., Sudarsan, N., Barrick, J. E., and Breaker, R. R. (2003) An mRNAstructure that controls gene expression by binding S-adenosylmethionine. Nature Structural Biol-ogy, 10(9), 701–707.81. Zhu, J., Steif, A., Proctor, J., and Meyer, I. (2013) Transient RNA structure features are evo-lutionarily conserved and can be computationally predicted. Nucleic Acids Research, 41(12),6273–6285.82. Nahvi, A., Sudarsan, N., Ebert, M. S., Zou, X., Brown, K. L., and Breaker, R. R. (2002) Geneticcontrol by a metabolite binding mRNA. Chem. Biol., 9, 1043–1049.83. Winkler, W., Nahvi, A., and Breaker, R. (2002) Thiamine derivatives bind messenger RNAsdirectly to regulate bacterial gene expression. Nature, 419, 952–956.84. Winkler, W., Cohen-Chalamish, S., and Breaker, R. (2002) An mRNA structure that controlsgene expression by binding FMN. Proceedings of the National Academy of Science of the USA,99, 15908–15913.85. Clote, P., Ferre, F., Kranakis, E., and Krizanc, D. (MAY, 2005) Structural RNA has lower foldingenergy than random RNA of the same dinucleotide frequency. RNA-A PUBLICATION OF THERNA SOCIETY, 11(5), 578–591.86. Halvorsen, M., Martin, J. S., Broadaway, S., and Laederach, A. (Aug 19, 2010) Disease-associatedmutations that alter the RNA structural ensemble. PLoS genetics, 6(8), e1001074.12887. Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F., and Backofen, R. (04/13, 2007) InferringNoncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering.PLoS Comput Biol, 3(4), e65.88. Zuker, M. and Sankoff, D. (1984) RNA secondary structures and their prediction. Bul. Math.Biol., 46, 591–621.89. Ziehler, W. A. and Engelke, D. R. (05/, 2001) Probing RNA Structure with Chemical Reagentsand Enzymes. Current protocols in nucleic acid chemistry / edited by Serge L.Beaucage ...[et al.],0 6, Unit–6.1.90. Ying, S. Y., Chang DC FAU Miller, J. D., Miller JD FAU Lin, S.-L., and Lin, S. L. MicroRNAprotocols. Perspectives Number 1064-3745 (Print); 1064-3745 (Linking) Humana Press (2006).91. Stern, S., Moazed, D., and Noller, H. F. (1988) Structural analysis of RNA using chemical andenzymatic probing monitored by primer extension. In Harry F. Noller, Jr., K. M., (ed.), Ribosomes,Vol. 164 of Methods in Enzymology, pp. 481 – 489 Academic Press.92. Harris, K. A., Crothers, D. M., and Ullu, E. (1995) In vivo structural analysis of spliced leaderRNAs in Trypanosoma brucei and Leptomonas collosoma: a flexible structure that is independentof cap4 methylations.. RNA, 1(4), 351–62.93. Lindell, M., Romby, P., and Wagner, E. G. H. (04/, 2002) Lead(II) as a probe for investigatingRNA structure in vivo.. RNA, 8(4), 534–541.94. Li, Y. and Breaker, R. R. (1999) Kinetics of RNA Degradation by Specific Base Catalysis ofTransesterification Involving the 2-Hydroxyl Group. Journal of the American Chemical Society,121(23), 5364–5372.95. Zarrinkar, P. P. and Williamson, J. R. (1996) The kinetic folding pathway of the Tetrahymenaribozyme reveals possible similarities between RNA and protein folding.. Nat Struct Biol., 3(5),432–8.96. Merino, E. J., Wilkinson, K. A., Coughlan, J. L., and Weeks, K. M. (Mar 30, 2005) RNA structureanalysis at single nucleotide resolution by selective 2’-hydroxyl acylation and primer extension(SHAPE). Journal of the American Chemical Society, 127(12), 4223–4231.97. Karaduman, R., Fabrizio, P., Hartmuth, K., Urlaub, H., and Lhrmann, R. (2006) {RNA} Struc-ture and RNAProtein Interactions in Purified Yeast {U6} snRNPs. Journal of Molecular Biology,356(5), 1248 – 1262.98. Strobel, S. A. and Shetty, K. (Apr, 1997) Defining the chemical groups essential for Tetrahymenagroup I intron function by nucleotide analog interference mapping.. Proc Natl Acad Sci U S A,94(7), 2903–2908.12999. Murphy, F. L. and Cech, T. R. (Feb, 1994) GAAA tetraloop and conserved bulge stabilize tertiarystructure of a group I intron domain.. J Mol Biol, 236(1), 49–63.100. Wan, Y., Qu, K., Zhang, Q. C., Flynn, R. A., Manor, O., Ouyang, Z., Zhang, J., Spitale, R. C.,Snyder, M. P., Segal, E., and Chang, H. Y. (01, 2014) Landscape and variation of RNA secondarystructure across the human transcriptome. Nature, 505(7485), 706–709.101. Kertesz, M., Wan, Y., Mazor, E., Rinn, J. L., Nutter, R. C., Chang, H. Y., and Segal, E. (09, 2010)Genome-wide measurement of RNA secondary structure in yeast. Nature, 467(7311), 103–107.102. Zheng, Q., Ryvkin, P., Li, F., Dragomir, I., Valladares, O., Yang, J., Cao, K., Wang, L.-S.,and Gregory, B. D. (09/30, 2010) Genome-Wide Double-Stranded RNA Sequencing Reveals theFunctional Significance of Base-Paired RNAs in Arabidopsis. PLoS Genet, 6(9), e1001141.103. Lucks, J. B., Mortimer, S. A., Trapnell, C., Luo, S., Aviran, S., Schroth, G. P., Pachter, L.,Doudna, J. A., and Arkin, A. P. (2011) Multiplexed RNA structure characterization with selective2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proceedings of theNational Academy of Sciences, 108(27), 11063–11068.104. Ding, Y., Tang, Y., Kwok, C. K., Zhang, Y., Bevilacqua, P. C., and Assmann, S. M. (01, 2014) Invivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature,505(7485), 696–700.105. James, T. L. (2000) Biophysics textbook online (Chapter 1: Fundamentals of NMR), BiophysicalSociety, ID: 53955010.106. Frtig, B., Richter, C., Whnert, J., and Schwalbe, H. (2003) NMR Spectroscopy of RNA. Chem-BioChem, 4(10), 936–962.107. Barton, S., Heng, X., Johnson, B. A., and Summers, M. F. (01, 2013) Database proton NMRchemical shifts for RNA signal assignment and validation. Journal of Biomolecular NMR, 55(1),33–46.108. Holbrook, S. R. and Kim, S.-H. (1997) RNA crystallography. Biopolymers, 44(1), 3–21.109. Klostermeier, D. and Hammann, C. (2013) RNA Structure and Folding: Biophysical Techniquesand Prediction Methods, De Gruyter, .110. Clare, D. K. and Orlova, E. V. (2010) 4.6 Cryo-EM reconstruction of tobacco mosaic virus fromimages recorded at 300 keV on a 4k 4k {CCD} camera. Journal of Structural Biology, 171(3),303 – 308.111. Zuker, M. (July, 2003) Mfold web server for nucleic acid folding and hybridization prediction.Nucleic Acids Research, 31(13), 3406–3415.130112. Hofacker, I., Fontana, W., Stadler, P., Bonhoeffer, S., Tacker, M., and Schuster, P. (1994)Fast Folding and Comparison of RNA Secondary Strutures. Monatshefte fu¨r Chemie (Chemi-cal Monthly), 125, 167–188.113. Lorenz, R., Bernhart, S. H., Siederdissen, C. H. Z., Tafer, H., Flamm, C., Stadler, P. F., andHofacker, I. L. (Nov 24, 2011) ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6.114. Nussinov, R. and Jacobson, A. (1980) Fast algorithm for predicting the secondary structure ofsingle-stranded RNA. Proc. Natl. Acad. Sci. USA, 77, 6309–6313.115. Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using ther-modynamic and auxiliary information. Nucleic Acids Research, 9, 133–148.116. Wuchty, S., Fontana, W., Hofacker, I., and Schuster, P. (1999) Complete Suboptimal Folding ofRNA and the Stability of Secondary Structures. Biopolymers, 49, 145–165.117. Ding, Y. and Lawrence, C. (Dec, 2003) A statistical sampling algorithm for RNA secondarystructure prediction. Nucleic Acids Research, 31(24), 7280–7301.118. Ding, Y., Chan, C., and Lawrence, C. (Jul, 2004) Sfold web server for statistical folding andrational design of nucleic acids. Nucleic Acids Research, 32(2), W135–W141.119. Chan, C., Lawrence, C., and Ding, Y. (Oct, 2005) Structure clustering features on the Sfold Webserver. Bioinformatics, 21(20), 3926–3928.120. Do, C. B., Woods, D. A., and Batzoglou, S. (Jul, 2006) CONTRAfold: RNA secondary structureprediction without physics-based models. Bioinformatics, 22(14), E90–E98.121. McCaskill, J. S. (1990) The Equilibrium Partition Function and Base Pair Binding Probabilitiesfor RNA Secondary Structure. Biopolymers, 29, 1105–1119.122. Knudsen, B. and Hein, J. (Jun, 1999) RNA secondary structure prediction using stochasticcontext-free grammars and evolutionary history. Bioinformatics, 15(6), 446–454.123. Pedersen, J., Meyer, I., Forsberg, R., Simmonds, P., and Hein, J. (2004) A comparative methodfor finding and folding RNA secondary structures within protein-coding regions. Nucleic AcidsRes., 32(16), 4925 – 4936.124. Hofacker, I., Fekete, M., and Stadler, P. (2002) Secondary Structure Prediction for Aligned RNASequences. Journal of Molecular Biology, 319, 1059–1066.125. Eddy, S. R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic AcidsResearch, 22(11), 2079–2088.126. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological sequence analysis, CambridgeUniversity Press, .131127. Mathews, D. H. and Turner, D. H. (2002) Dynalign: an algorithm for finding the secondarystructure common to two RNA sequences. Journal of Molecular Biology, 317(2), 191–203.128. Havgaard, J. H., Lyngsø R. B., and Gorodkin, J. (2005) The FOLDALIGN web server for pairwisestructural RNA alignment and mutual motif search. Nucleic Acids Research, 33, W650–W653.129. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S. R., and Bateman, A. (2005)Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Research, 33, D121–D124.130. Flamm, C. and Hofacker, I. (2008) Beyond energy minimization: approaches to the kinetic foldingof RNA. Monatshefte fu¨r Chemie, 139(4), 447–457.131. Mironov, A., Dyakonova, L., and Kister, A. (1985) A kinetic approach to the prediction of RNAsecondary structures. Journal of Biomolecular Structure & Dynamics, 2(5), 953–962.132. Mironov, A. and Lebedev, V. (1993) A kinetic model of RNA folding. Biosystems, 30(1-3), 49–56.133. Danilova, L., Pervouchine, D., Favorov, A., and Mironov, A. (2006) RNAkinetics: a web serverthat models secondary structure kinetics of an elongating RNA. Journal of Bioinformatics andComputational Biology, 4(2), 589–596.134. Flamm, C., Fontana, W., Hofacker, I. L., and Schuster, P. (Mar, 2000) RNA folding at elementarystep resolution. RNA, 6(3), 325–38.135. Isambert, H. and Siggia, E. D. (Jun, 2000) Modeling RNA folding paths with pseudoknots: ap-plication to hepatitis delta virus ribozyme. Proceedings of the National Academy of Science of theUSA, 97(12), 6515–20.136. Xayaphoummine, A., Bucher, T., Thalmann, F., and Isambert, H. (Dec, 2003) Prediction andstatistics of pseudoknots in RNA structures using exactly clustered stochastic simulations. Pro-ceedings of the National Academy of Science of the USA, 100(26), 15310–5.137. Xayaphoummine, A., Bucher, T., and Isambert, H. (Jul, 2005) Kinefold web server for RNA/DNAfolding path and structure prediction including pseudoknots and knots.. Nucleic Acids Res,33(Web Server issue), W605–10.138. Geis, M., Flamm, C., Wolfinger, M. T., Tanzer, A., Hofacker, I. L., Middendorf, M., Mandl, C.,Stadler, P. F., and Thurner, C. (May, 2008) Folding kinetics of large RNAs.. Journal of MolecularBiology, 379(1), 160–173.139. Toulme, F., Mosrin-Huaman, C., Artsimovitch, I., and Rahmouni, A. (2005) Transcriptionalpausing in vivo: A nascent RNA hairpin restricts lateral movements of RNA polymerase in bothforward and reverse directions. Journal of Molecular Biology, 351(1), 39–51.132140. Wickiser, J., Winkler, W., Breaker, R., and Crothers, D. (2005) The speed of RNA transcriptionand metabolite binding kinetics operate an FMN riboswitch. Molecular Cell, 18(1), 49–60.141. Nawrocki, E. P. and Eddy, S. R. (2013) Infernal 1.1: 100-fold faster RNA homology searches.Bioinformatics, 29, 2933–2935.142. Koduvayur, S. and Woodson, S. (2004) Intracellular folding of the Tetrahymena group I introndepends on exon sequence and promoter choice. RNA, 10(10), 1526–1532.143. Mohr, G.and Zhang, A., Gianelos, J., Belfort, M., and Lambowitz, A. (1992) The neurosporaCYT-18 protein suppresses defects in the phage T4 td intron by stabilizing the catalyticallyactive structure of the intron core. Cell, 69(3), 483–494.144. Mohr, G., Caprara, M., Guo, Q., and Lambowitz, A. (1994) A tyrosyl-transfer-RNA synthetasecan function similarly to an RNA structure in the Tetrahymena ribozyme. Nature, 370(6485),147–150.145. Horowitz, D. S. (2012) The mechanism of the second step of pre-mRNA splicing. Wiley interdis-ciplinary reviews RNA, 3, 331–350.146. Bachellerie, J., Cavaille, J., and Hu¨ttenhofer, A. (2002) The expanding snoRNA world. Biochimie,84(8), 775–790.147. Price, M., Dehal, P., and Arkin, A. (2010) FastTree2 - Approximately maximum-likelihood treesfor large alignments. PLoS ONE, 5(3), e9490.148. Wiebe, N. J. P. and Meyer, I. M. (2010) Transat — method for detecting the conserved helices offunctional RNA structures, including transient, pseudo-knotted and alternative structures. PLoSComputational Biology, 6(6), e1000823.149. Nawrocki, E., Kolbe, D., and Eddy, S. (2009) Infernal 1.0: Inference of RNA alignments. Bioin-formatics, 25, 1335–1337.150. Edgar, R. (2004) MUSCLE: Multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Research, 32(5), 1792–1797.151. Gurevitz, M., Swatantra, K., and Apirion, D. (1983) Identification of a precursor molecule for theRNA moiety of the processing enzyme RNase P. Proceedings of the National Academy of Sciencesof the United States of America, 80, 4450–4454.152. Hsu, L., Zagorski, J., and Fournier, M. (1984) Cloning and sequence analysis of the Escherichiacoli 4.5 S RNA gene. Journal of Molecular Biology, 25(3), 509–531.153. Frank, D. N. and Pace, N. R. (1998) RIBONUCLEASE P: Unity and Diversity in a tRNAProcessing Ribozyme. Annual Review of Biochemistry, 67(1), 153–180.133154. Torres-Larios, A., Swinger, K. K., Krasilnikov, A. S., Pan, T., and Mondragon, A. (2005) Crystalstructure of the RNA component of bacterial ribonuclease P. Nature, 437(7058), 584–587.155. Doudna, J. and Batey, R. (2004) Structural insights into the signal recognition particle. AnnualReview of Biochemistry, 73, 539–557.156. Ulbrandt, N. D., Newitt, J. A., and Bernstein, H. D. (1997) The E. coli Signal Recognition ParticleIs Required for the Insertion of a Subset of Inner Membrane Proteins. Cell, 88(2), 187–196.157. Jagath, J. R., Matassova, N. B., de Leeuw, E., Warnecke, J. M., Lentzen, G., Rodnina, M. V.,Luirink, J., and Wintermeyer, W. (2001) Important role of the tetraloop region of 4.5S RNA inSRP binding to its receptor FtsY.. RNA, 7(2), 293–301.158. Lentzen, G., Moine, H., Ehresmann, C., Ehresmann, B., and Wintermeyer, W. (1996) Structureof 4.5S RNA in the signal recognition particle of Escherichia coli as studied by enzymatic andchemical probing. RNA, 2, 244–253.159. Ataide, S. F., Schmitz, N., Shen, K., Ke, A., Shan, S., Doudna, J. A., and Ban, N. (2011) Thecrystal structure of the signal recognition particle in complex with its receptor. Science, 331(6019),881–886.160. Ferre-D’Amare, A. R., Zhou, K., and Doudna, J. A. (1998) Crystal structure of a hepatitis deltavirus ribozyme. Nature, 395(6702), 567–74.161. Rosenstein, S. P. and Been, M. D. (1991) Evidence that genomic and antigenomic RNA self-cleaving elements from hepatitis delta virus have similar secondary structures. Nucleic Acids Res,19(19), 5409–16.162. Bollback, J. P. and Huelsenbeck, J. P. (2001) Phylogeny, genome evolution, and host specificity ofsingle-stranded RNA bacteriophage (family Leviviridae). Journal of Molecular Evolution, 52(2),117–128.163. Seibel, P., Mu¨ller, T., Dandekar, T., Schultz, J., and Wolf, M. (2006) 4SALE: A tool for syn-chronous RNA sequence and secondary structure alignment and editing. BMC Bioinformatics, 7,498.164. Sayers, E. W., Barrett, T., Benson, D. A., Bolton, E., Bryant, S. H., Canese, K., Chetvernin, V.,Church, D. M., DiCuccio, M., Federhen, S., Feolo, M., Fingerman, I. M., Geer, L. Y., Helmberg,W., Kapustin, Y., Krasnov, S., Landsman, D., Lipman, D. J., Lu, Z., Madden, T. L., Madej, T.,Maglott, D. R., Marchler-Bauer, A., Miller, V., Karsch-Mizrachi, I., Ostell, J., Panchenko, A.,Phan, L., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Shumway, M., Sirotkin, K.,Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T. A., Wagner, L., Wang, Y., Wilbur, W. J.,Yaschenko, E., and Ye, J. (2012) Database resources of the National Center for BiotechnologyInformation. Nucleic Acids Research, 40(D1), D13–D25.134165. Morgan, S. R. and Higgs, P. G. (1998) Barrier heights between ground states in a model of RNAsecondary structure. Journal of Physics A: Mathematical and General, 31(14), 3153.166. Lai, D., Proctor, J., Zhu, J., and Meyer, I. (2012) R-chie: a web server and R package forvisualizing RNA secondary structures. Nucleic Acids Research, 40(12), e95.167. Daub, J., Gardner, P., Tate, J., Ramskld, D., Manske, M., Scott, W., Weinberg, Z., Griffiths-Jones, S., and Bateman, A. (Dec, 2008) The RNA WikiProject: Community annotation of RNAfamilies.. RNA, 14(12), 2462–2464.168. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S., and Bateman, A. (2005)Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33, D121–D124.169. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy, S. (2003) Rfam: an RNAfamily database.. Nucleic Acids Res, 31, 439–441.170. Zhu, J. A. and Meyer, I. M. (2015) Four RNA families with functional transient structures. RNABiology, 12(1), 5–20.171. Lai, D., Proctor, J., and Meyer, I. (2013) On the importance of cotranscriptional RNA structureformation. RNA, 19, 1461–1473.172. Burge, S. W., Daub, J., Eberhardt, R., Tate, J., Barquist, L., Nawrocki, E. P., Eddy, S. R.,Gardner, P. P., and Bateman, A. (2013) Rfam 11.0: 10 years of RNA families. Nucleic Acids Res,41, D226–32.173. Rose, J. K. and Yanofsky, C. (1974) Interaction of the operator of the tryptophan operon withrepressor. Proceedings of the National Academy of Science of the USA, 71, 3134–38.174. Bennett, G. N. and Yanofsky, C. (1978) Sequence analysis of operator constitutive mutants of thetryptophan operon of Escherichia coli. Journal of Molecular Biology, 121, 179–192.175. Yanofsky, C. (1971) Tryptophan biosynthesis in Escherichia coli. Genetic determination of theproteins involved. J. Am. med. Assoc, 218, 1026–1035.176. Stroynowski, I., van Cleemput, M., and Yanofsky, C. (1982) Superattenuation in the tryptophanoperon of Serratia marcescens. Nature, 298, 38 – 41.177. Squires, C., Lee, F., Bertrand, K., Squires, C. L., Bronson, M. J., and Yanofsky, C. (1976)Nucleotide sequence of the 5’ end of tryptophan messenger RNA of Escherichia coli. Journal ofMolecular Biology, 103, 351–81.178. Bertrand, K., Korn, L. J., Lee, F., and Yanofsky, C. (1977) The attenuator of the tryptophanoperon of Escherichia coli : Heterogeneous 3-OH termini in vivo and deletion mapping of functions.Journal of Molecular Biology, 117, 227–47.135179. Oppenheim, D., Bennett, G., and Yanofsky, C. (1980) Escherichia coli RNA polymerase and trprepressor interaction with the promoter-operator region of the tryptophan operon of Salmonellatyphimurium. Journal of Molecular Biology, 144(2), 133–42.180. Rose, J. K., Squires, C. L., Yanofsky, C., Yang, H. L., and Zubay, G. (1973) Regulation of in vitrotranscription of the tryptophan operon by purified RNA polymerase in the presence of partiallypurified repressor and tryptophan. Nature new Biol, 245,, 133–37.181. Imamoto, F. (1968) Immediate cessation of transcription of the operator-proximal region of thetryptophan operon in Escherichia coli after repression of the operon. Nature, 220, 31–34.182. Hiraga, S. and Yanofsky, C. (1973) Inhibition of the progress of transcription on the tryptophanoperon of Escherichia coli. Journal of Molecular Biology, 79, 339–49.183. Jackson, E. and Yanofsky, C. (1973) Region between the operator and first structural gene ofthe tryptophan operon of Escherichia coli may have a regulatory function. Journal of MolecularBiology, 76, 89–101.184. Bertrand, K., Squires, C., and Yanofsky, C. (1976) Transcription termination in vivo in the leaderregion of the tryptophan operon of Escherichia coli. Journal of Molecular Biology, 103, 319–37.185. Lee, F., Squires, C. L., Squires, C., and Yanofsky, C. (1976) Termination of transcription invitro in the Escherichia coli tryptophan operon leader region. Journal of Molecular Biology, 103,383–393.186. Bronson, M. J., Squires, C., and Yanofsky, C. (1973) Nucleotide sequences from tryptophanmessenger RNA of Escherichia coli : the sequence corresponding to the amino-terminal region ofthe first polypeptide specified by the operon. Proceedings of the National Academy of Science ofthe USA, 70(8), 2335–9.187. Miozzari, G. F. and Yanofsky, C. (1978) Translation of the leader region of the Escherichia colitryptophan operon. J. Bact, 133, 1457–1466.188. Yanofsky, C. and Soil, L. (1977) Mutations affecting tRNATrp and its charging and their effecton regulation of transcription termination at the attenuator of the tryptophan operon. Journal ofMolecular Biology, 113, 663–677.189. Zurawski, G., Elseviers, D., Stauffer, G. V., and Yanofsky, C. (1978) Translational control oftranscription termination at the attenuator of the Escherichia coli tryptophan operon. Proceedingsof the National Academy of Science of the USA, 75(12), 5988–92.190. Lee, F. and Yanofsky, C. (1977) Transcription termination at the trp operon attenuators of Es-cherichia coli and Salmonella typhimurium: RNA secondary structure and regulation of termi-nation. Proceedings of the National Academy of Science of the USA, 74(10), 4365–9.136191. Oxender, D., Zurawski, G., and Yanofsky, C. (1979) Attenuation in the Escherichia coli trypto-phan operon: role of RNA secondary structure involving the tryptophan codon region. Proceedingsof the National Academy of Science of the USA, 76(11), 5524–5528.192. Bertrand, K. and Yanofsky, C. (1976) Regulation of transcription termination in the leader regionof the tryptophan operon of Escherichia coli involves tryptophan or its metabolic product. Journalof Molecular Biology, 103(2), 339–49.193. Lee, F., Bertrand, K., Bennett, G., and Yanofsky, C. (1978) Comparison of the nucleotide se-quences of the initial transcribed regions of the tryptophan operons of Escherichia coli andSalmonella typhimurium. Journal of Molecular Biology, 121(2), 193–217.194. Miozzari, G. F. and Yanofsky, C. (1978) The regulatory region of the trp operon of Serratiamarcescens. Nature, 276(5689), 684–9.195. Miozzari, G. F. and Yanofsky, C. (1978) Naturally occurring promoter down mutation: Nucleotidesequence of the trp promoter/operator/leader region of Shigella dysenteriae 16. Proceedings of theNational Academy of Science of the USA, 75(11), 5580–4.196. Farnham, P. J. and Platt, T. (1980) A model for transcription termination suggested by studieson the trp attenuator in vitro using base analogs. Cell, 20(3), 739–48.197. Farnham, P. J. and Platt, T. (1982) Effects of DNA base analogs on transcription terminationat the tryptophan operon attenuator of Escherichia coli. Proceedings of the National Academy ofScience of the USA, 79(4), 998–1002.198. Zurawski, G. and Yanofsky, C. (1980) Escherichia coli tryptophan operon leader mutations, whichrelieve transcription termination, are cis-dominant to trp leader mutations, which increase tran-scription termination. Journal of Molecular Biology, 142(1), 123–9.199. Stauffer, G. V., Zurawski, G., and Yanofsky, C. (1978) Single base-pair alterations in the Es-cherichia coli trp operon leader region that relieve transcription termination at the trp attenuator.Proceedings of the National Academy of Science of the USA, 75(10), 4833–7.200. Stroynowski, I. and Yanofsky, C. (1982) Transcript secondary structures regulate transcriptiontermination at the attenuator of S. marcescens tryptophan operon. Nature, 298, 34–38.201. Winkler, M. E., Mullis, K., Barnett, J., Stroynowski, I., and Yanofsky, C. (1982) Transcriptiontermination at the tryptophan operon attenuator is decreased in vitro by an oligomer comple-mentary to a segment of the leader transcript. Proceedings of the National Academy of Science ofthe USA, 79(7), 2181–5.202. Furuse, K. (1987) Distribution of coliphages in the environment: general considerations. In: GoyalSM (ed) Phage ecology, Wiley, New York.137203. Crawford, E. M. and Gesteland, R. F. (1964) The adsorption of bacteriophage R17. Virology, 22,165–167.204. Bradley, D. E. (1972) Shortening of Pseudomonas aeruginosa pili after RNA-phage adsorption..J Gen Microbiol, 72, 303–319.205. Shapiro, L. and Bendis, I. (1975) RNA phages of bacteria other than E. coli. In: N Zinder (ed)RNA phages, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY.206. Miyake, T., Haruna, I., Shiba, T., Itoh, Y. H., Yamane, K., and Watanabe, I. (1971) Groupingof RNA phages based on template specificity of their RNA replicases. Proceedings of the NationalAcademy of Science of the USA, 68, 2022–2024.207. Murphy, F. A., Fauquet, C. M., Bishop, D. H. L., Ghabrial, S. A., Jarvis, A. W., Martelli, G. P.,Mayo, M. A., and Summers, M. D. (1995) Virus taxonomy: The classification and nomenclatureof viruses. The sixth report of the International Committee on Taxonomy of Viruses, Springer-Verlag, Vienna.208. van Himbergen, J., van Geffen, B., and van Duin, J. (1993) Translational control by a long rangeRNA-RNA interaction; a basepair substitution analysis. Nucleic Acids Res, 21(8), 1713–7.209. Kolakofsky, D. and Weissmann, C. (1971) Possible mechanism for transition of viral RNA frompolysome to replication complex. Nature (London) New Biol, 231, 42–46.210. Klovins, J., Tsareva, N. A., de Smit, M. H., Berzins, V., and van Duin, J. (1997) Rapid evolutionof translational control mechanisms in RNA genomes. J Mol Biol, 265(4), 372–84.211. Robertson, H. D. and Lodish, H. F. (1970) Messenger characteristics of nascent bacteriophageRNA. Proceedings of the National Academy of Science of the USA, 67(2), 710–716.212. van Duin, J. (1988) Single-stranded RNA bacteriophage. In: Calendar R, ed. The bacteriophages,vol 1, Plenum Press, New York.213. Beekwilder, J., Nieuwenhuizen, R., Poot, R., and van Duin, J. (1996) Secondary structure modelfor the first three domains of Qbelta RNA. Control of A-protein synthesis. J Mol Biol, 256, 8–19.214. Skripkin, E. A., Adhin, M. R., de Smit, M. H., and van Duin, J. (1990) Secondary structure ofthe central region of bacteriophage MS2 RNA. Conservation and biological significance. Journalof Molecular Biology, 211(2), 447–63.215. Olsthoorn, R. C., Licis, N., and van Duin, J. (1994) Leeway and constraints in the forced evolutionof a regulatory RNA helix. EMBO J, 13(11), 2660–8.216. Poot, R. A., Tsareva, N. V., Boni, I. V., and van Duin, J. (1997) RNA folding kinetics regulatestranslation of phage MS2 maturation gene. Proc Natl Acad Sci U S A, 94(19), 10110–5.138217. de Smit, M. H. and van Duin, J. (1990) Secondary structure of the ribosome binding site deter-mines translational efficiency: A quantitative analysis. Proceedings of the National Academy ofScience of the USA, 87, 7668–7672.218. Kondo, M. (1975) Structure and function of Qβ replicase. Arch Int Physiol Bioch, 83, 909–948.219. Lai, M. M. (1995) The molecular biology of hepatitis delta virus. Annu. Rev. Biochem, 64, 259–286.220. Wang, K. S., Choo, Q. L., Weiner, A. J., Ou, J. H., Najarian, R. C., Thayer, R. M., Mullenbach,G. T., Denniston, K. J., Gerin, J. L., and Houghton, M. (1986) Structure, sequence and expressionof the hepatitis delta (delta) viral genome. Nature, 323(6088), 508–514.221. Kos, A., Dijkema, R., Arnberg, A. C., van der Meide, P. H., and Schellekens, H. (1986) Thehepatitis delta (delta) virus possesses a circular RNA. Nature, 323(6088), 558–560.222. Makino, S., Chang, M. F., Shieh, C. K., Kamahora, T., Vannier, D. M., Govindarajan, S., andLai, M. M. (1987) Molecular cloning and sequencing of a human hepatitis delta (delta) virus RNA.Nature, 329(6137), 343–346.223. Kuo, M. Y., Goldberg, J., Coates, L., Mason, W., Gerin, J., and Taylor, J. (1988) Molecularcloning of hepatitis delta virus RNA from an infected woodchuck liver: sequence, structure, andapplications. J Virol, 62(6), 1855–1861.224. Lazinski, D. and Taylor, J. (1995) Regulation of the hepatitis delta virus ribozymes: to cleave ornot to cleave?. RNA, 1, 225–233.225. Been, M. D. and Wickham, G. S. (1997) Self-cleaving ribozymes of hepatitis delta virus RNA.Eur. J. Biochem, 247, 741–753.226. Perrotta, A. T. and Been, M. D. (1990) The self-cleaving domain from the genomic RNA ofhepatitis delta virus: sequence requirements and the effects of denaturant. Nucleic Acids Res, 18,6821–6827.227. Woodson, S. and Cech, T. (1991) Alternative secondary structures in the 5 exon affect bothforward and reverse self-splicing of the Tetrahymena intervening sequence RNA. Biochemistry,pp. 2042–2050.228. Woodson, S. (1992) Exon sequences distant from the splice junction are required for efficientself-splicing of the Tetrahymena IVS. Nucl. Acids Res, 20, 4027–4032.229. Nikolcheva, T. and Woodson, S. (1999) Facilitation of group 1 splicing in vivo: Misfolding of theTetrahymena IVS and the role of ribosomal RNA exons. Journal of Molecular Biology, 292(3),557–567.139230. Perrotta, A. and Been, M. (Apr, 1991) A pseudoknot-like structure required for efficient self-cleavage of hepatitis delta-virus RNA. Nature, 350(6317), 434–436.231. Tanner, N. K., Schaff, S., Thill, G., Petit-Koskas, E., Crain-Denoyelle, A. M., and Westhof, E.(1994) A three-dimensional model of hepatitis delta virus ribozyme based on biochemical andmutational analyses. Curr. Biol, 4(6), 488–498.232. Perrotta, A. T. and Been, M. D. (1996) Core sequences and a cleavage site wobble pair requiredfor HDV antigenomic ribozyme self-cleavage. Nucleic Acids Res, 24, 1314–1321.233. Matysiak, M., Wrzesinski, J., and Ciesiolka, J. (1999) Sequential folding of the genomic ribozymeof the hepatitis delta virus: structural analysis of RNA transcription intermediates. J. Mol. Biol.,291, 283–294.234. Perrotta, A., Nikiforova, O., and Been, M. (1999) A conserved bulged adenosine in a peripheralduplex of the antigenomic HDV self-cleaving RNA reduces kinetic trapping of inactive conforma-tions. Nucl. Acids Res., 27, 795–802.235. Mathews, D., Sabina, J., Zuker, M., and Turner, D. (1999) Expanded sequence dependence ofthermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288,911–940.236. Zuker, M., Mathews, D., and Turner, D. (1999) Algorithms and thermodynamics for RNA sec-ondary structure prediction practical guide In RNA Biochemistry and Biotechnology, J.B.B.R.C.Clark (Ed.), NATO ASI Series, Kluwer Academic Publishers, Dordrecht, the Netherlands.237. Zhu, J. and Wartell, R. (1999) The effect of base sequence on the stability of RNA and DNAsingle base bulges. Biochemistry, 38, 15986–15993.238. Chadalavada, D. M., Senchak, S. E., and Bevilacqua, P. C. (2002) The folding pathway of thegenomic hepatitis delta virus ribozyme is dominated by slow folding of pseudoknots. Journal ofMolecular Biology, 317, 559–575.239. Benner, S., Ellington, A., and Tauer, A. (1989) Modern metabolism as a palimpsest of the RNAworld. Proceedings of the National Academy of Science of the USA, 86, 7054–7058.240. Joyce, G. (2002) The antiquity of RNA-based evolution. Nature, 418, 214–221.241. Mironov, A., Gusarov, I., Rafikov, R., Lopez, L., Shatalin, K., Kreneva, R., Perumov, D., andNudler, E. (2002) Sensing small molecules by nascent RNA: a mechanism to control transcriptionin bacteria. Cell, 111, 747–756.242. Haller, A., Rieder, U., Aigner, M., Blanchard, S. C., and Micura, R. (2011) Conformationalcapture of the SAM-II riboswitch. Nat Chem Biol., 7(6), 393–400.140243. Grundy, F. and Henkin, T. (1998) The S box regulon: a new global transcription terminationcontrol system for methionine and cysteine biosynthesis genes in Gram-positive bacteria. Mol.Microbiol., 30, 737–749.244. Nudler, E. and Mironov, A. (2004) The riboswitch control of bacterial metabolism. TrendsBiochem. Sci., 29(1), 11–17.245. Epshtein, V., Mironov, A., and Nudler, E. (2003) The riboswitch-mediated control of sulfurmetabolism in bacteria. Proceedings of the National Academy of Science of the USA, 100, 5052–5056.246. Montange, R. K. and Batey, R. T. (2006) Structure of the S- adenosylmethionine riboswitchregulatory mRNA element. Nature, 441, 1172–1175.247. Griffiths-Jones, S. (2005) RALEE–RNA ALignment editor in Emacs. Bioinformatics, 21(2), 257–259.248. Edgar, R. C. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics,26(19), 2460–2461.249. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2009) GenBank.Nucleic Acids Res, 37, D26–31.250. Frieda, K. L. and Block, S. M. (2012) Direct observation of cotranscriptional folding in an adenineriboswitch. Science, 338(6105), 397–400.251. Artsimovitch, I. and Landick, R. (2000) Pausing by bacterial RNA polymerase is mediated bymechanistically distinct classes of signals. Proceedings of the National Academy of Science of theUSA, 97(13), 7090–7095.252. Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A.,Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., and Punta, M. (2014)Pfam: the protein families database. Nucleic Acids Research, 42(D1), D222–D230.253. Aviram, I., Veltman, I., Churkin, A., and Barash, D. (2012) Efficient procedures for the numericalsimulation of mid-size RNA kinetics. Algorithms for Molecular Biology, 7(24).254. Fontana, W. (2002) Modelling ‘evo-devo’ with RNA. BioEssays, 24(12), 1164–1177.255. Schuster, P. (2001) Evolution in silico and in vitro: The RNA model. Biological Chemistry,382(9), 1301–1314.256. Schultes, E. A. and Bartel, D. P. (2000) One Sequence, Two Ribozymes: Implications for theEmergence of New Ribozyme Folds. Science, 289(5478), 448–452.141257. Held, D., Travis Greathouse, S., Agrawal, A., and Burke, D. (2003) Evolutionary Landscapes forthe Acquisition of New Ligand Recognition by RNA Aptamers. Journal of Molecular Evolution,57(3), 299–308.258. Churchman, L. and Weissman, J. S. (JAN, 2011) Nascent transcript sequencing visualizes tran-scription at nucleotide resolution. Nature, 469, 368–373.259. Perdrizet, George A., I., Artsimovitch, I., Furman, R., and et al. (2012) Transcriptional pausingcoordinates folding of the aptamer domain and the expression platform of a riboswitch. Proceedingsof the National Academy of Science of the USA, 109(9), 3323–3328.260. Artsimovitch, I. and Landick1, R. (1998) Interaction of a nascent RNA structure with RNApolymerase is required for hairpin-dependent transcriptional pausing but not for transcript release.Genes Dev., 12(19), 3110?122.261. Zamft, B., Bintu, L., Ishibashi, T., and et al. (JUN, 2012) Nascent RNA structure modulates thetranscriptional dynamics of RNA polymerases. Proceedings of the National Academy of Scienceof the USA, 109(23), 8948–8953.262. Tadigotla, V., O’Maoileidigh, D., Sengupta, A., and et al. (2006) Thermodynamic and kineticmodeling of transcriptional pausing. Proceedings of the National Academy of Science of the USA,103(18), 7198–7198.263. Bai, L., Shundrovsky, A., and Wang, M. D. (2008) Sequence-dependent kinetic model for tran-scription elongation by RNA polymerase. JOURNAL OF MOLECULAR BIOLOGY, 381(4),1088–1088.264. Bai, L. and Wang, M. D. (2010) Comparison of pause predictions of two sequence-dependenttranscription models. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERI-MENT,.265. Tafer, H., Ameres, S. L., Obernosterer, G., Gebeshuber, C. A., Schroeder, R., Martinez, J., andHofacker, I. L. (05, 2008) The impact of target site accessibility on the design of effective siRNAs.Nat Biotech, 26(5), 578–583.142


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items