R N A Secondary Structure Prediction Using Hierarchical Folding . b y Hosna Jabbari B.Sc, The University of Victoria, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia August, 2007 © Hosna Jabbari 2007 A b s t r a c t Algorithms for prediction of RNA secondary structure— the set of base pairs that form when an RNA molecule folds— are valuable to biologists who aim to understand RNA structure and function. Improving the accuracy and efficiency of prediction methods'is an ongoing challenge, particularly for pseudoknotted secondary structures, in which base pairs overlap. This challenge is biologically important, since pseudoknotted structures play essential roles in functions of many RNA molecules, such as splicing and ribosomal frameshifting. State-of-the-art methods, which are based on free energy minimization, have high run-time complexity (typically 0(n5) or worse), and can handle (minimize over) only limited types of pseudoknotted structures. We analyze a new approach for prediction of pseudoknotted structures, mo-tivated by the hypothesis that RNA structures fold hierarchically, with pseudo-knot free (non-overlapping) base pairs forming first, and pseudoknots forming later so as to minimize energy relative to the folded pseudoknot free structure. Our HFold algorithm, based on work of S. Zhao, uses two-phase energy min-imization to predict hierarchically-formed secondary structures in 0(n*) time, matching the complexity of the best algorithms for pseudoknot free secondary structure prediction via energy minimization. Our algorithm can handle a wide range of biological structures, including kissing hairpins and nested kissing hair-pins, which have previously required 0(vi6) time. We also report on the experimental evaluations of HFold and present thor-ough analyses of the results. We show that if the input structure to the algo-rithm is correct, running the algorithm results in 16% accuracy improvement on average over the accuracy of the true pseudoknot free structures. However if the input structure is not correct, the accuracy improvement is not.significant. Tf the first 10 suboptimal foldings are given as input to our algorithm instead of just the minimum free energy structure (MFE), the prediction accuracy im-proves significantly over the accuracy of the MFE structures. This improvement is even more when the number of suboptimal foldings as input to our algorithm increases. The comparison of the energy of the structures predicted by HFold on the true pseudoknot free structures with the energy of the true structures calculated using a different method with the same energy model shows that the energy model may be the cause for the cases for which HFold predicts struc-tures far from the true structures. Our experimental result provides some ways in which the hierarchical folding hypothesis might need to be refined. iii C o n t e n t s A b s t r a c t ii Con t en t s iii L i s t o f Tables v L i s t o f F i g u r e s vi A c k n o w l e d g e m e n t s x 1 I n t r o d u c t i o n 1 1.1 R N A Secondary Structure 1 2 B a c k g r o u n d on R N A Seconda ry S t r u c t u r e 5 2.1 Notation ' 5 2.2 Region and Loop Classification and Related Definitions 6 2.3 Bi-secondary and Density-2 Structures 9 2.4 Energy Model '. 11 2.4.1 Pseudoknot free energy model 11 2.4.2 Pseudoknotted energy model 12 3 A l g o r i t h m 14 3.1 A Useful Property of Density-2 Structures . . . '. 14 3.2 High Level Description of HFold 18 4 R e s u l t s 23 4.1 Accuracy ' 24 4.2 Dataset 24 4.3 Accuracy when input is the true pseudoknot free structure . . . . 24 4.4 Accuracy when input is the M F E pseudoknot free structure . . . 29 4.5 Best accuracy, taken over suboptimal folds 33 ' 4.6 Accuracy of the lowest energy structure, taken over suboptimal folds 36 4.7 Peeling Techniques 37 4.7.1 Accuracy when input is the M F E pseudoknot free struc-ture with 1-3 base pairs removed from inside or outside the stems that close hairpin loops 37 Contents ' iv 4.7.2 Accuracy when input is the M F E pseudoknot free struc-ture with 1-3 base pairs removed from both inside and outside each stem that closes a hairpin loop 39 4.7.3 Accuracy when input is the M F E pseudoknot free struc-ture with stems of size 1-3 removed 44 4.7.4 Accuracy when input is the M F E pseudoknot free struc-ture with loosened loops 44 4.8 Base Pair Confidence . . . . : . .' 45 4.9 Energy Comparison 45 5 C o n c l u s i o n a n d F u t u r e W o r k 52 B i b l i o g r a p h y 54 A R e c u r r e n c e s 57 A . l W,,.; 57 A.2 Wlij ' . ' 58 A.3 Wfrl , '• • • 60 A.4 VP-j 61 A.5 VPi.i".j 64 A.6 V,/ 65 A.7 VBkj 65 A.8 VM,/ . . . : : 66 A.9 WMij 66 A. 10 WMBi,j 67 A . 11 WMB'-'.l 68 A.12 BE[iJ,j',j) ' 69 List o f Tables V 2.1 Energy parameters 13 4.1 Pseudoknot free sequences used for secondary structure predic-tion. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence: and (3) the reference • from which the sequence was obtained 25 4.2 Sequences with H-type pseudoknots used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis: (2) type of sequence: (3) structure of sequence (H-type pseudoknot = H, H-type pseudoknots with nested structures = H*): and (4) the reference from which the sequence was obtained 50 4.3 Sequences with kissing interactions and the density-3 sequence used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) Structure of the sequence (kissing interactions = K, density-3 structure = D3); and (4) the reference from which the sequence was obtained. 51 vi List of Figures 1.1 A pseudoknot free structure (top), an H-type pseudoknotted struc-ture (center) and a kissing hairpin (bottom). Figures were generated by PseudoViewer [10] 4 2.1 An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats 6 2.2 Pseudoknot 7 2.3 Density-2 structures 10 2.4 Bi-secondary structure that is not density-2 10 2.5 The figure shows a density-2 secondary structure for a sequence of length n. /?.;.,• is the structure restricted to the region (for i < j < n) and dj is that part of /?,,;, above the horizontal line. The circle dots show positions of j where R;J is a prefix of a density-2 pseudoloop with respect to G, and the crosses show positions of j where Rij is not a prefix of a density-2 pseudoloop with respect to G 11 3.1 Illustration of band borders in Lemma 3 and Lemma 4 17 3.2 Illustration of how the WMB recurrence unwinds, to calculate WMBij. Arcs above the horizontal line from i to j represent base pairs of C,;,, and arcs below the line represent base pairs of G'rr Case (a) of the WMB recurrence handles the overall structure whose energy is WMBij, with I = l\. with terms to account for energies of the right upper band (BE) and right lower closed subregion ( W](Ll + \).(in,c(i,'., )-i)) a s w e ' l a s the remaining prefix (WMB/ih). The term WMBliM is handled by case (a) of the WMB1 recurrence, with / = l-i and terms to account for the lower right substructure labeled VPi2iil, the upper left band (BE), and the remaining "prefix" of the overall structure (WMB'i^li_l-j). WMBl,l^_l-} is then handled by case (b) of the WMB1 recurrence, with I = l-i, and terms to account for WI^3+l)^2_i^ and WMBlihi. Finally, the WMB'lh term is handled by end case (c) of the WMB1 recurrence 20 3.3 Visual illustration of recurrences in HFold 22 List of Figures vii 4.1 Accuracy of HFold when the input is the true pseudoknot free secondary structure (G/,iS), versus the accuracy of GyUJ struc-tures. The horizontal axis represents the accuracy of structures predicted by HFold while the vertical line represents the accu-racy of the input structures. Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively 26 4.2 A pseudoknot free structure given as input to HFold (top) and the corresponding HFold pseudoknot free result (bottom). The overall structure remains pseudoknot free after HFold, but more base pairs are added to the loops. For example, the large internal loop starting at 1 and ending at 164 on the top figure is broken into two smaller loops in the bottom figure. Similarly the loops of the branch starting at 51 and ending at index 90 are broken into smaller loops 27 4.3 CH,, structure given as input to HFold (top), the correspond-ing HFold result (center) and the true pseudoknotted structure (bottom). HFold adds 3 pseudoknot free stacked base pairs at 2.40, 3.39, and 4.38, while the true pseudoknotted structure has pseudoknotted base pairs at 18.36, 19.35, and 20.34 28 4.4 Predicted pseudoknotted structure (top) and the true pseudo-knotted structure (bottom). HFold predicts interleaved bands, while the true pseudoknotted structure has 5 separate H-type pseudoknots 30 4.5 Accuracy of HFold when the input is the true pseudoknot free sec-ondary structure (G.,,,-,„,(j) (horizontal axis), versus the accuracy of Gsm„u structures (vertical axis). Pseudoknotted and pseudo-knot free structures are presented by 'x ' and 'O', respectively. . . . 31 4.6 Accuracy of HFold when the input is the MFE structure (hori-zontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are pre-sented by [ \ ' and ; 0 ' . respectively 32 4.7 MFE structure (top), true pseudoknotted structure (bottom). Tire true pseudoknotted structure has 4 stacking base pairs at 1.14, 2.13, 3.12, 4.11 and another stem of size 4 at 6.26, 7.25, 8.24, and 9.23, while in the MFE structure the first stem is a shift of the true structure (3.15, 4.14, 5.13, and 6.12). In this cases HFold does not add any base pairs to the input structure. 33 4.8 Accuracy of HFold on the best-of-25 (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', re-spectively 34 4.9 Accuracy of HFold on the best-of-25, versus the accuracy of the best-of-10. Pseudoknotted and pseudoknot free structures are presented by ;x : and ! 0 ' , respectively 35 List of Figures viii 4.10 Accuracy of HFold on the minimum energy structure over the first 25 suboptimals, versus the accuracy of HFold on the M F E structure. Pseudoknotted and pseudoknot free structures are pre-sented by ;x' and ' O ' , respectively 36 4.11 The pseudoknotted structure predicted by HFold (top), the cor-responding M F E structure (center) and the true structure (bot-tom). Although HFold correctly predicts parts of the pseudoknot in the structure on the top, the accuracy of the structure pre-dicted by HFold is lower than the accuracy of the corresponding M F E structure, because there are some extra pseudoknotted base pairs predicted by HFold. Note that the M F E structure has all the correct pseudoknot free sub-structures (except the stem clos-ing at 1.114). High pseudoknot initiation penalty, and low band penalty are the main causes for such predictions by HFold. . . . 38 4.12 Accuracy of the M F E structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), ver-sus the accuracy of the M F E structures (vertical axis). Pseudo-knotted and pseudoknot free structures are presented by 'x ! and l O \ respectively • 40 4.13 Accuracy of HFold when input is the M F E structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of HFold when input structure is M F E (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively 41 4.14 Accuracy of the M F E structures with 1 base pair removed from both ends of each stem (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudo-knot free structures are presented by 'x' and ' O ' , respectively. . . 42 4.15 Accuracy of HFold when input is the M F E structures with 1 base pair removed from both ends of each stem (horizontal axis), versus the. accuracy of HFold when input structure is M F E (ver-tical axis). Pseudoknotted and pseudoknot free structures are presented by 'x ! and ' O ' , respectively 43 4.16 Accuracy of HFold when the input is the pseudoknot free struc-tures with base pair confidence above 75% (horizontal axis), ver-sus the accuracy of the M F E structures (vertical axis). Pseudo-knotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively 46 4.17 Comparison of HFold energy and true structure energy. The hori-zontal axis represents the energy of structures predicted by HFold when the input is G/,;,,, while the vertical axis represents the en-ergy of the "true" structures. Pseudoknotted and pseudoknot free structures are presented by 'x ' and ' O ' , respectively 48 4.18 Comparison of energy of G/„:fl (horizontal axis) and the energy of the M F E structure (vertical axis). Pseudoknotted and pseudo-knot free structures are presented by 'x' and ! 0 ; , respectively. . . 49 List of Figures ix A.l Illustration of Cases for Wij. 58 A.2 Illustration of Cases for Wlij.- 59 A.3 Illustration of cases for Wt .j. In case (5), the plotted structure from i to j could contain more than 2 bands (not illustrated) . . 60 A.4 Illustration of Cases for VPitj -. . 63 A.5 Illustration of Cases for VP\j '65 A.6 Illustration of Cases for VMi:j 66 A.7 Illustration of Cases for WMBij 67 A.8 Illustration of Cases for WMB'.i.j 69 A.9 Illustration of Cases for BE-,.-,;',,<,,• 71 X Acknowledgements I would like to thank my supervisor, professor Anne Condon, for her endless sup-port and encouragements. She introduced me to a new and interesting branch of science that I did not know about before. I cannot thank her enough for all she has done for me. I would also like to thank Dr. Will Evans for being my second reader and for his helpful comments. Many thanks go to people of Beta lab for making Beta such a lovely place to work. Last, but not least, 1 would like to thank my parents, my sisters and my wonderful husband for believing in me.through all stages of my studies. Without their love and support 1 could not have done anything. Hosna Jabbari. To my best friend and wonderful husband, Majid. 1 Chapter 1 Introduction 1.1 R N A Secondary Structure R N A molecules aid in translation and replication of the genetic code, catalyze cellular processes, and regulate the expression level of genes [7]. Structure is key to the function of R N A molecules, and so methods for.predicting R N A structure from the base sequence are of great value. Currently, prediction methods focus on secondary structure - the set of base pairs that form when the R N A molecule folds. There has been significant success in prediction of pseudoknot, free sec-ondary structures, which have no crossing base pairs (see F ig . 1.1). State-of-the-art prediction algorithms, such as Mfold [18] or RNAfo ld [11] find the structure with minimum free energy ( M F E ) from the set of all possible pseudoknot free secondary structures. The energy of a structure is estimated as the sum of en-ergies of loops that form when the molecule folds, where the loop energy values are'provided by biologists. While many small R N A secondary structures are pseudoknot free, pseu-doknots do arise frequently in biologically-important R N A molecules, both in the cell [24, 27], and in viral R N A [6]. Examples include simple H-type pseu-doknots, with two interleaved stems, which are essential for certain catalytic functions and for ribosomal frameshifting [2], as well as kissing hairpins, which are essential for replication in the coxsackie B virus [19]. Unfortunately, M F E pseudoknotted secondary structure prediction is N P -hard [1, 13, 14], even for a simple energy model that depends on base pairs but not on unpaired bases. Polynomial-time MFE-based approaches to pseudo-knotted structure prediction have been proposed [1, 8, 21, 22, 26], with respect to various sum-of-loops energy models for pseudoknotted structures, which find the M F E structure for a given input sequence, from a. restricted class of struc-tures. A class of structures can be defined by specifying allowable patterns of interleaving among base pairs. For example, Mfold and RNAfo ld handle the class of pseudoknot free secondary structures; we provide more examples later. We say that a structure' R can be handled by a given algorithm if R is in the class of structures over which the algorithm optimizes. Algorithms for M F E pseudoknotted secondary structure prediction trade off run-time complexity and generality - the class of structures handled, that is, the class of structures over which the algorithms optimize. For example, kissing hairpins are not in the class of structures handled by the 0 ( n 5 ) algorithms of Akutsu [1] and Dirks and Pierce [8] but are in the class handled by the Q(n6) algorithm of Rivas and Eddy [22]. (We note that, even when the true Chapter 1. Introduction 2 structure R for a'sequence is handled by an algorithm, the algorithm still may not correctly predict R, because correctness depends not only on the generality of the algorithm but also on the energy model and energy parameters.) Our work is motivated by two limitations of MFE-based algorithms for pseu-doknotted secondary structure prediction: they have high time, complexity, and ignore the folding pathway from unfolded sequence to stable structure. Sev-eral experts have provided evidence for, and support, the hierarchical folding hypothesis [16, 25], which is succinctly stated by Tinoco and Bustamante as follows: "An RNA molecule [has] a. hierarchical structure in which the primary sequence determines the secondary structure which, in turn, determines its ter-tiary folding, whose formation alters only minimally the secondary structure" [25]. (These and other authors consider the initially-formed secondary structure to be pseudoknot free, and refer to base pairs that form pseudoknots as part of the tertiary structure. However, here we refer to all canonical base pairs, namely A-U, C-G, and G-U, as secondary structure.) We note that while the hierarchical folding hypothesis is a common assumption, some counter examples have been reported, notably formation of the structure of a subdomain of the Tetrahyinena thermophila group I intron ribozyme [29]. However, even in this case, 15 of the 19 base pairs in the initially-formed pseudoknot free secondary structure are retained upon formation of tertiary structure, and the 4 missing base pairs lie at the ends of stems. This work is a continuation of the M.Sc. work of Zhao [30], in which she presented a novel and efficient algorithm to predict RNA secondary structures, in a manner consistent with a natural formalization of the hierarchical folding hypothesis. She defined the problem of predicting the secondary structure as follows: given a sequence S and a pseudoknot free secondary structure G (a set of base pairings), find a pseudoknot free secondary structure G' (a set of base pairings disjoint from G) for S, such that the free energy of G U G' is less than or equal to the free energy of GUG" for all pseudoknot free structures G" / G'. As with algorithms for MFE pseudoknotted secondary structure prediction, algorithms,for Hierarchical-MFE secondary structure prediction may handle a restricted class of structures. That is, the type of structure formed by G U G" may have restricted patterns of interleaving among base pairs. Since both G and G" are pseudoknot free, the most general class of structures that could be handled by an algorithm for hierarchical-MFE secondary structure prediction would be the hi-secondary structures of Witwer et al. [28] - those structures which can be partitioned into two pseudoknot free secondary structures G and G'. There is no known way to solve the hierarchical-MFE prediction for the class of bi-secondary structures. Instead, Zhao suggested a solution with respect to a subclass of the bi-secondary structures, which she called density-2 structures, explained in Section 2. This is quite a genera) class, including H-type pseudoknots and kissing hair-pins, as well as structures containing nested instances of these structural motifs. The only known algorithm for predicting MFE nested kissing hairpins, that of Rivas and Eddy, requires Q(nb) time. Rastegari and Condon [20] showed that, out of a set of over 1,100 biological structures, all but nine are density-2 (when Chapter 1. Introduction 3 isolated base pairs are removed), and six of these nine are also not in the class handled by Rivas and Eddy's algorithm. The main contributions of this work are: • refining and fixing the original recurrences presented in the work of Zhao [30], • defining and proving four lemmas, that help establish corrections of Zhao's recurrences, • implementing the HFold algorithm, and • throughly analyzing its performance on a dataset of 70 strands. We published a preliminarily version of this work in [12]. In Chapter 2, we present some useful background information and notations pertaining to RNA structure prediction. In Chapter 3, we summarize HFold, a. dynamic programming algorithm that solves the Hierarchical-MFE secondary structure prediction problem for the class of density-2 secondary structures in 0(n3) time and 0(n2) space, and present our lemmas. In Chapter 4, we describe the results of our experimental analysis. Our ex-perimental evaluation of HFold shows that, when provided with the true pseu-doknot free substructure for the input sequence, HFold adds pseudoknots which, on average, improve the accuracy (measured as fraction of correctly predicted bases) by 16%. However, HFold does not significantly improve accuracy when given as input a computational prediction of the MFE pseudoknot free secondary structure G, since HFold cannot correct errors in G. Chapter 1. Introduction 4 50 40 U-C--A •U-G P-6' 20 A . J " - G \ 60 C A' ,65 A - U 10 A - G - G - U - G - G - C r f-.-A-U-U-G r ""A-G-G U .u-G-G-G;-6 -G (j—-45 C G C" C "C G' U "& I • C "A J G T, iv "A; f A' "C' A' U" V c C "A' C U "G A" *G~-A C *A' G' V G' "G"' G' t ' G'" T ,U' t' G" t" A 'A G *•: U V G' c A" Figure 1.1: A pseudoknot free structure (top), an H-type pseudoknotted struc-ture (center) and a kissing hairpin (bottom). Figures were generated by PseudoViewer [10]. " 5 Chapter 2 Background on R N A Secondary Structure A n R N A molecule is a sequence of nucleotides, or bases, of which there are four types: Adenine (A), Guanine (G), Cytosine (C), and Uraci l (U). The molecule has chemically distinct ends, called the 5' and 3' ends. We model an R N A molecule as a sequence over the alphabet {A,C, G, £ /} , with the left end of the sequence being the 5' end. Throughout, n denotes the length of an R N A sequence. We index the bases consecutively from the 5' end starting from 1, and refer to a base by its index. When an R N A molecule folds, bonds may form between certain pairs of bases, where each base may pair with at most one other base. A secondary structure R is a set of pairs < i < j < n, such that no index occurs in more than one pair. The pair i.j denotes that the base indexed i is paired with the base indexed j. The canonical base pairs, which form the secondary structure, are the Watson-Crick pairs A-U and C-G, as well as the wobble pair G-U (see Fig . 2.1). 2.1 Notation We use the following notations when describing our algorithms. Throughout, we use R to denote a secondary structure. • bpni'i')'- We let bpa(i) denote the index of the base that is paired with base i in R. if any; otherwise bpp.(i) = 0 . • paired(.R, i ) : true if and only if i is paired in the structure R. • cross: if i.j, i'.j', and i < i' < j < j', we say that pair i.j crosses pair ?/.;/' (and i'.j' crosses i.j). • p s e u d o k n o t t e d base pai r : We say that i.j is a pseudoknotted base pair if for some other base pair i!.j' in structure R, i.j crosses i ' . j ' . We also refer to i and j as pseudoknotted base indices. • p seudokno t free secondary s t ruc ture : If there are no pseudoknotted base pairs in the given structure, it is called pseudoknot free secondary structure. Chapter 2. Background on RNA Secondary Structure 6 NOT Weakly closed Figure 2.1: An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats. • cover: Let G be a pseudoknot free secondary structure. Base pair i.j covers base fc if i < k < j and there is no other base pair i'.j' £ G with i < i' < k < j' < j. In this case, we denote i.j by cover (G.k). Otherwise cnver{G,k) = (-1,-1). • isCovered(G, fe): true if and only if some base pair of G covers fe. • region Sequence of indices between i and j inclusive. • disjoint regions: Two regions and ['«',/] are disjoint if no index is in both regions, i.e. j < i! or j ' < i. 2.2 Region and Loop Classification and Related Definitions Following is the terminology used to refer to loops and other standard elements in a, secondary structure. These definitions are mostly taken from and illus-trated in the work of Rastega.i'i and Condon [20]. Throughout, definitions are with respect to a. fixed secondary structure R. Generally we use R to refer to a structure that may be pseudoknotted (that is, contains at least one pseudoknot-ted base pair), and use G to refer-to a structure that we know to be pseudoknot free. Chapter 2. Background on RNA Secondary Structure 7 1 3 9 II 15 18 23 27 34 39 Figure 2.2: Pseudoknot • empty(/?.. [i,j\) with respect to secondary structure R: true if region contains no base pair in R. Formally, V7c,i < k < j,paired(R,k). • w e a k l y c losed reg ion: A region is weakly closed if no base pair connects a base in the region to a base outside the region. Formally, is weakly closed if and only if for all fc 6 either bpn(k) £ [i,j] or bpn{k) = 0. Weakly closed(R, is true if and only if is a weakly closed region oi R. .• c losed region: A weakly closed region with at least two bases, is closed if it cannot be partitioned into two smaller weakly closed regions. Formally, is closed if and only if i < j, \i,j) is weakly closed, and for all I £ — 1], neither [i,L] nor [/ + is weakly closed. Note that if [i.j] is closed then both i and j must be paired (although not necessarily with each other) [20]. • p seudokno t free c losed r eg ion : A'closed region that does not contain any pseudoknotted base pairs. • p s e u d o k n o t t e d closed reg ion : A closed region of a structure R, such that i.bpn(i) and bpn(j).j are pseudoknotted base pairs. We refer to • indices i. and j as the left and right borders of the pseudoknotted region M -• d i r e c t l y banded in : For a pseudoknotted base pair i.j, we say i.j is directly banded in base pair i'.j' and write i.j ^ i'.j' if: (1) ,:' < i < j <j\ and (2) [i' + 1,7 — 1] and [j + 1, j ' — 1] are weakly closed regions. "Directly banded in" is transitive: i.j -< i'.j' ^ implies i.j ^ i".j". • band : Let us consider a maximal chain of ^. The minimum (maximum) base pair in the maximal chain is the band's inner (outer) closing pair. If i.j is the outer and i'.j' the inner closing pair of a band, then •'•']• a n c ' [j'.j] are the band's regions. For example there are 3 bands in Fig. 2.2: [1,2] U [22,23], [18.19] U [33, 34] and [27, 27] U [39, 39]. Chapter 2. Background on RNA Secondary Structure 8 i and i' are the borders of band region, j and j' are the borders of [j'.j] band region. We. refer to i and j as the left and the right border of the band respectively. • ins ide a band : A region [i.j] is inside a band [ii,«'i] U if either i\ < i < j < %i or j[ < i < j < j\ is true. • b a n d associa ted w i t h c losed region: We say that band \i, i'] U [j1, j] is associated with closed region [i",j"[ if [''',''•'], and:thus [j',j], are subregions of [/", j"] but are not subregions of any closed child of [i", j" j . For example, in F ig 2.2, the three bands [1, 2] U [22, 23], [18,19] U [33, 34] and [27,27] U [39,39] are associated with closed region [1,39]. • u n p a i r e d bases associa ted w i t h c losed r e g i o n [i,j]: are the unpaired bases in [i,j] but not in any closed region or band region which are sub-regions of [i,j]. For example, in the structure of F ig . 2.2, the unpaired bases associated with closed region [1,39] are 17, 20, 21, 24-26, 28-32, and 35-38. • base pa i r ings associa ted w i t h c losed r eg ion [i,j]: are the base pair-ings in [i.j] but not in any closed region or band region which are subre-gions of [i.,j]. • h a i r p i n loop (or h a i r p i n ): Formally, the tuple defines a hairpin loop in a secondary structure if i and j are paired, and [i + 1, j — 1] is an empty region, i.j is called the closing base pair of the hairpin loop. The hairpin marked in Fig . 2.1 contains 4 unpaired bases. • i n t e rna l loop: A n internal loop, sometimes called interior loop, contains two closing base pairs, and all bases between them are unpaired. The tuple (i,i',j',j), with i + 1 < i! < j' < j — 1, defines an internal loop if i.j and i'.j', and [i + — 1] and [j' + l,j — 1] are empty regions. • s tacked loop: A stacked loop, also called stacked pair, contains two consecutive base pairs. The tuple (i,j) defines a stacked pair if i.j and (i + l ) . ( j — 1) are in R. A stem or helix is made of consecutive stacked loops. Note that, in fact, a stacked loop is also a special case of an internal loop, with no unpaired bases on either side. • bu lge loop: A bulge loop, or simply bulge, is a special case of an internal loop, which has no unpaired base on one side, and at least one unpaired base on the other side. • spans a band: There are two types of internal loops, stacked loops and bulge loops: those for which the closing base pair, i.j, is not pseudoknotted and those for which i.j is pseudoknotted. In the'latter case, we say that the loop spans a band. Chapter 2. Background on RNA Secondary Structure 9 • m u l t i - b r a n c h e d loop: There are two types of multi-branched loops, or multiloops, depending on whether or not they span a band: (1) Let [i.j] be a closed region which is not pseudoknotted, and has at least two (closed region) children, or a pseudoknotted child. Then the unpaired bases and base pairs associated with [i,j] form a multiloop. , (2) Let i.j be a pseudoknotted base pair and i'.j' ^ i.j, where at least one of the (weakly closed) regions [i + l.i' — 1] and [/ -1- 1 ,j — 1] is not empty. Then the unpaired bases and base pairs associated with band region [i,i'\ U [j',j] comprise a multiloop that spans a band. For both types of multiloop, we say that i.j is the closing base pair of the multiloop. For example, in the structure of Fig. 2.2, [2,22] shows a multiloop that spans a band. Note that each closed subregion of [i,j] is called a branch of the corresponding multiloop. • pseudoloop: Let [i,j] be a pseudoknotted closed region. Then the un-paired bases and base pairs associated with [i.j], together with the closing base pairs of the bands associated with \i,j], are members of a pseudoloop. The base pairs i.bp(i) and bp(j).j are the closing base pairs of the pseu-doloop. The pseudoloop is an exterior pseudoloop if region [i.j] is not subregion of any other region. • c losed r eg ion associa ted w i t h pseudoloop: We say that closed re-gion [*',/] is associated with, pseudoloop [i,j], if [i',j'] is a closed proper subregion of [i,j] but not a subregion of any closed subregion of [i,j]. For example, in Fig. 2.2, closed regions [3,9] and [10,16] are associated with pseudoloop [1,39] but closed region [11,15] is not associated with pseudoloop [1, 39]. 2.3 Bi-secondary and Density-2 Structures • Bi-secondary s t ruc tures : Witwer et al. [28] introduced a definition of ;:bi-secondary structure", which is a union of two disjoint pseudoknot free secondary structures. The pseudoknotted secondary structures we can handle in our algorithm are a subset of the bi-secondary structures. • dens i ty : We define density as follows: Let L be a pseudoloop and i.bp(i) and bp{j).j be the closing base pairs of L. We say a band 'U [Ji,ji] crosses k if i\ < k < ji. Let #B(L, k) be the number of bands associated with L that cross k. Then the density of L is: density(L) = max {#B{L: k)) ' (2.1) i<k<j The density of a structure, R, is maximum density of L over all pseu-doloops L of R. We say R is a density-2 structure if the density of R is at Chapter 2. Background on RNA Secondary Structure 10 Figure 2.4: Bi-secondary structure that is not dehsity-2. Chapter 2. Background on RNA Secondary Structure • 11 Figure 2.5: The figure shows a density-2 secondary structure for a sequence of length n. Rij is the structure restricted to the region [i,j] (for i < j < n) and Gi-j is that part of /?.,, above the horizontal line. The circle dots show positions of j where R.j is a prefix of a density-2 pseudoloop with respect to G, and the crosses show positions of j where R,,., is not a prefix of a density-2 pseudoloop with respect to G. most 2. Figure 2.3 illustrates density-2 secondary structures. Figure 2.4 shows a bi-secondary structure that is not a density-2 structure. • prefix: Let G,.( be a pseudoknot free structure over region Let R;,j be a density-2 structure over region [i, j] containing G-,,j. We say that R^ is a prefix of a density-2 pseudoloop with respect to G,,, if i starts the first (leftmost) band associated with a pseudoloop of R,j , and j is either 1. the right border of a closed region associated with the pseudoloop, 2. the right border of the pseudoloop starting at i, 3. the rightmost border of any band associated with /?,;,- — except the first band , or 4. an unpaired base associated with the pseudoloop that is not inside the first two bauds (and not inside any closed subregion). 2.4 Energy Model Computational methods for predicting the secondary structure of an RNA or DNA molecule are based on models of the free energy of loops. The parameters of these models are driven in part by current understanding of experimentally determined free energies, and in part lay what can be incorporated into an effi-cient algorithm. The free energy of a loop depends on temperature; throughout we assume that the temperature is fixed. 2.4.1 P s e u d o k n o t free e n e r g y m o d e l We first summarize the notation used to refer to the free energy of pseudoknot free loops, along with some standard assumptions that are incorporated into loop Chapter 2. Background on RNA Secondary Structure 12 free energy models. We refer to a model that satisfies all of our assumptions as a standard free energy model. This model is somewhat simpler than that underlying MFold and Simfold, but our algorithm can be extended to their more detailed model. • e[-[(i,j): gives the free energy of a hairpin loop closed by i.j; we assume that for all but a small number of cases, e#(i, j) depends only on the length of the loop, and the two paired bases i and j on the loop. • es(*, j)'- gives the free energy of a. stacked pair that consists of i.j and • (* + i).rj-i). • e,;nt('i, i'. j', j): gives the free energy of an internal loop or bulge with exterior pair i.j and interior pair i'.j'. The free energy of a. multiloop with k branches and u unpaired bases is a + bk + cu, where a, b, c are constants. The free energy of a sequence S with respect to a fixed secondary structure R. is the sum of the free energies of the loops of R. Sometimes when the strand S is fixed, it is convenient to refer simply to the free energy of the structure R. We define the free energy of a strand S to be the minimum free energy of the strand, with respect to all structures R. 2.4.2 Pseudoknotted energy model • BE(i, i',j', j): The total energy of band [i,i'\ U [j',j] is the sum of the energies of its loops. If a band has no loops, i.e. consists of just one base paii', we define its energy to beO. • eSLp(i.j): defines the energy of stacked pairs in a band, and its value is es{i,j) * 0.83. • e-uit,p(i,r,r'}j): defines the internal loop that spans a band, and its value is eu,,,(i,r,r',j) * 0.83. We define energy of multiloops that span a band to be similar to pseudoknot free multiloops. The energy of an exterior pseudoloop is calculated as: energy of bands plus Pi,*m.+Pv„*k, + Pup*u+P,, where m is the number of the bands, k is the number of closed subregions, u is the number of unpaired bases. If the pseudoknot is inside a, multiloop or a pseudoloop, P„ is replaced-by Psm or Psp respectively. Let /?..,;,, be a prefix of a pseudoloop. The energy of Ritj is the sum of the energies of all loops within R. t plus a penalty for each band and each unpaired base in [i.j] associated with the pseudoloop of which RtJ is a prefix. Table 2.1 summarizes the energy constants and functions used in our energy model for pseudoknotted structure. Chapter 2. Background on RNA Secondary Structure 13 Table 2.1: Energy parameters. Name Description Value (Kcal/mol) exterior pseudoloop initiation penalty 9.6 Ps,n penalty for introducing pseudoknot 15.0 inside a multiloop P,„ penalty for introducing pseudoknot 15.0 inside a pseudoloop Pi, band penalty 0.2 PUP , penalty for unpaired base in 0.1 a pseudoloop Pps penalty for closed subregion 0.1 inside a pseudoloop energy of a hairpin loop closed by i.j es(i,j) energy of stacked pair closed by i.j e«tp(i..j) energy of stacked pair es(i,j) x 0.83 that spans a band ehu(i,r,r'.j) energy of a pseudoknot free internal loop emi.p('i.,r,r' ,j) energy of internal loop that emt{i,r,r',j) x 0.83 spans a band a • multiloop initiation penalty 3.4 b multiloop base pair penalty • 0.4 c penalty for unpaired base in a 0 multiloop a' penalty for introducing a multiloop 3.4 that spans a band b' base pair penalty for a multiloop 0.4 that spans a band c> penalty for unpaired base in a 0 multiloop that spans a band 14 Chapter 3 Algori thm 3.1 A Useful Property of Density-2 Structures As will become clearer later, the reason that the HFold algorithm works for density-2 structures is because of the following lemmas, which are key for effi-cient decomposition of energies in the recurrences. Roughly, the lemmas show how to calculate the band borders for a given region. L e m m a 1 LetG andG' be disjoint, pseudoknot free, secondary structures, such that Gl)G' is a density-2 secondary structure, and let i, j be the start and end of a pseudoloop of G U G'. Let, I, £ [i + l,j] be such that J. I is paired in G' (but, not in G), 2. I is not in any closed proper subregion of [i,j] 'with respect to GU G', and 3. 3r : i < r < I. < bpc(r) < j. Let &(,: () = niin{fc|?' < k < I < bpc(k)}, and 6'(u) = im\x{k\i < k < I < bpG(k)}. Then, the structure G U C ' contains a band with outer base pair b(n).bpc(b(i.i)) and inner base pair b'^ - ^.bpc^b'^ (j) . Note that restriction (3) ensures that b^^ and b'^ ^ are well defined. The bot-tom left part of Fig. 2.1 illustrates Lemma 1, showing the borders of the band whose arcs cross the base pair involving base I = 14. If [i,j] is the region [2, 30], then b(2,i4) = 2 and &'(2.14) = 6. Proof Throughout the proof we assume that I < bpc(l), but the proof can easily be modified to handle the case when bpc(l) < I by substituting bpa>(l).l instead of l.bpc(l) everywhere in the proof. Since i is the start of a pseudoloop and j is the end of the pseudoloop, [i.j] must be a pseudoknotted closed region of G U G'. Restriction (1) implies hpG>(l) € since if it is not, then [i.j] is not a closed region of G U G'. We claim that for any r satisfying restriction (3), l.bpc(l) crosses r.bpc(r). Chapter 3. Algorithm 15 ULbpcH) does not cross r.bpc(r), then it must be that i < r < I < bpc{l) < bpc(r). Also, we must have one of the following cases: • l.-bpc(l) does not cross another base pair. Then region [l.bpc(l)[ is a closed subregion of [i.j] with respect to GUG', and thus, / is in a closed subregion of with respect to G U G', which is a contradiction. • l.bpc'(l) crosses another base pair, say m.bpc(m). Then I.bpc (I) is in a band associated with a pseudoloop. We claim that, this pseudoloop is a subregion of [r. bpc('»')]. Otherwise, it must also be that r.bpc(r) is in a band associated with the pseudoloop. Moreover, the bands containing r.bpc{r). m.bpc{m) and l.bpc(l) must all be distinct, based on the defini-tion of band and the fact: that; I.bpc (I.) is in [r,6pc? (»•)]> and m,.bpc(in) G G and thus, cannot cross r.bpc{r). A line drawn vertically at position m or at position / must cross all three bands. This is because m.bpc(m) crosses l.bpc(l) which is in [r, bpc(r)]. Therefore, G U G' must have density 3, contradiction. Therefore, our assumption is incorrect and I.bpc {I) crosses r.bpc (''')• Let b\.bpo{b\), and &2-^ Pc(&2) be the outer and the inner base pairs of the band containing r.bpc^r) that l.bpc{l) crosses. We have i < b\ < b-2 < I < bpG(b2) < bpc{b,) < j. Now we prove that b(u) = b\. Since i < b\ < I < bpc (I) it must be that 6(v,j) < bi, by the definition of If &(,;;) < b\, then we have &(.,;() < 61 < bpc{b-\) < bpc(b(i,t)), since C is pseudoknot free. By same argument as used above, we conclude that 6(io-&Pc(&(».!)) crosses I.bpc (I). Thus, &(u).&pc(fy»,n) is the outer base pair of the band, which is a contradiction. Thus our assumption of &(,-,/) / 61 does not hold, and b^i) = bi. Similarly we can show that b'^ ^ = I>2- • L e m m a 2 LetG a.ndG' be disjoint, pseudoknot free, secondary structures, such that, G U G' is a density-2 secondary structure. Let I be paired in G' (but not in G) and let [i,j] be a region such that I G [i.j] and bpc (I) £ [i,j]- Let mm{k,\i, < k <l < bpG(k)} U {co}, max{/c|i < k < I < bpG{k)} U {-1}, nvAx{bpG{k)\k < I < bpa{k) < j} U {-1}, and mm{bpG(k)\k < I < bpG(k) < j} U {00}. Then, if [i,j] is a weakly closed region of G, then either all four of these quantities have finite, positive values, in which case the structure G U G' con-tains a band with, outer base pair and inner base pair b'^. ^.B'a (i.e. bp(b(.;j-)) = -5((,7) and bp(b'^ /)) = .-^), or none of the four of these quantities have finite, positive values, in which case I is not covered by a base pair' of G y . b(i,i) = bkD = Bd,:i) = B(l,i) = Chapter 3. Algorithm 16 Note that each term may be either infinity or -1 to account for the cases when there is no such band border. . For region [i,j] = [1,23], which is weakly closed in G, in Fig-. 2.2, we have 6(1,23) = 1, 6(1,23)' = 2 a l K l S ( l ,23) = 2 2 a I K l B ( l ,23) = 2 3 ' P r o o f We first show that if [i,j] is a weakly closed region of G, then either all four of these quantities have finite, positive values, or none of the four of these quantities have finite, positive values. Throughout the proof we assume that I < bpc{l), but the proof can easily be modified to handle the case when bpc(l) < L Let us consider Cover(G,/) = (r.bpc('»•))• Note that Cover(G,/) may be ( — 1,-1) if I is not covered in G. Based on the definition of weakly closed region, either Cover(GJ) e [i,j] or Cover(G,/) £ Therefore, we either have r < i < I < j < bpc{r), in which case, none of the four quantities have finite positive values, or we have i < r < I < bpci'i) < j , in which case r.bpc(r) crosses l.bpc(l), and therefore all four quantities have finite positive values. Now we claim when is weakly closed, Cover(G,Z) = (r,bpc(r)), and r.bpc(r) € the structure G U G' contains a band with outer base pair b(i.i)-B(i,i) a n d hmer base pair b'^^.B'^.y To prove this claim we first prove that &pG(6(»,n) = B{ij), and bpc(b'(iJ)) = B [ u ) . Assume 6pG(6(ij)) + B<i,j)- Then, since both 6(M).&pG(&(.M)) and bpc{B(l:j)).B{ij) are in G, then they cannot cross, and.one must be nested in the other. If b{i,i).bpc(b{i,i)) is in [bpc{B{id)).B{Lj)\, then we have i < bpc{B{ltj)) < b{iJ) < I. which is contradiction to the definition of &(;,()• Therefore b(LiybpG(b(ij)) can-not be in [bpc(B(i j)), By.,,-j]. With a similar argument we can show that bpc{B(i,j)).Ba,j) cannot be in [6(,;./.), bpc;(&(,:.;))]• Thus, we must have 6pc(6(i,i)) = Bn.j)- The proof to show that bpc(b'^ n ) = B'n .j is similar to what came above. Now we prove that a band containing b^jyByj^, also contains b'^^.B'^^. Assume the bands containing b^^.B^j) and b'^^.B'^ ^ are distinct. Then, a vertical line drawn, at / crosses 3 distinct bands. Therefore the structure of G U G' has density 3, contradiction. Thus, one band contains both b^jyB^^ mdb'(i.,i)-B'(i,:i)-Based on the definition of 6'^ n , there is no other base pair m.bpcirn) in G, such that 6'(;J) < m < I < bpG(m) < B'n j y Therefore, b'u,jyB[i.j) i s t n e i n n e r base pair of the band. " We can easily show that if b^j).B(i.j) is not the outer base pair of the band, then we must have a base pair m.bpc(m) € such that m < 6(.,;|() < J5(I,7) < bpc(m). It is clear that m.bpc{m) must be contained in the same band as 6(ii).'B(i.,-), Thus, contradiction. Therefore 6(»,j).B(j,,-) is the outer base pair of Chapter 3. Algorithm 17 Figure 3.1: Illustration of band borders in Lemma 3 and Lemma 4 the band. • • . Lemma 3 Let.G and G' be disjoint, pseudoknot free, secondary structures, such that G U G' is a density-2 secondary structure. Let I be paired in G' (but. not in G) and let [i.l] be a region such, that I, 6 [i.l] and bpc{l) tfz Let, 6(.,;j) = min{fc|i < k < / < bpc{k)} U {oo}, and b'{ii) = max{A:|z < k < I < bpc{k)} U {-1}. Then, either both, or neither of 6(*j) and b'^. ^ have finite positive values. For example, for region [17,19] in Fig. 2.2, we have 6(17,19) = 18,. 6JJ7 = 19. P r o o f Let us consider Cover(G,/) = (r. bpc(?'))• Note that Cover(GJ) may be ( — 1,-1) if / is not covered in G. If r < i, then there must be no base pair k such that i < k < I. < bpc(k). Otherwise, since both r.bpc(r) and k.bpc(k) are in G, we must have r < k < I, < bpc{k) < bpc{r). Thus,' we must have Cover(G,/) = (k,bpc(k,)) for one such k, which is contradiction. Therefore, £>(,:,() = 00 and 6|.; ^ = — 1. Let us assume i < r, Cover(G,/) = {r,bpc{r)), I = j and bpc{l) = i — 1. Then, r.bpa(r) must cross l.bpc>{l), since i < r and based on the definition of cover, we must have r < I < bpc(r). We claim that both b(,; () and 6|.; ^ have finite positive values. Moreover, b'^ ^ = r. Assume 6'^ '^ ^ r. Then, 6|.; ^.bpcib^ <)) must be in [r,bpc{r)} (based on the definition of b'^^). Therefore we have r < b'^ ^ < I < bpc(b'^ / < bpG{r), contradiction. Thus, 6{4 n = r. Now, clearly b(.,;,j) ^ 00. We claim that if Chapter 3. Algorithm 18 '-'(•;,() 6(,;()) then b'^ ^.bpcib'^ tj) is in [b^jybpcib^^)}. Otherwise, the bands containing &(•;./)-VG(&(U))> b'^ ^ .bpdb'^ Q) and l.bpc(l) must be distinct. A vertical line drawn vertically at position I crosses 3 bands, thus G U G" has density 3, contradiction. • Lemma 4 LetG and G' be disjoint, pseudoknot free, secondary structures, such that G U G" is a density-2 secondary structure. Let I be paired in G' (but not, in G) and, let [l,j] be a region such that I £ [l,j] and bpo>(l) Let £((.,•) = nvA\{bpc(k)\k < I < bpG{k) < j) U {-1}, and B{Lj) = mm{bpc(k)\k<l<bpc(k,)<j}U{oo}. Then, eitlier both or neither of 2?(f j) and By .j have finite positive values. For example, for region [9,11] in Fig. 2.2, we have £?|g n j = S(g.ii) = 9. Proof Proof is very similar to.proof of Lemma 3. • 3.2 High Level Description of HFold HFold is a method for prediction of pseudoknotted RNA secondary structure that integrates MFE-based prediction with folding pathway considerations in a novel way. The method is motivated by the hypothesis that pseudoknotted RNA secondary structures form in a hierarchical fashion, with a pseudoknot free structure forming first and additional pseudoknot-forming base pairs that are added later (possibly with minor rearrangements of the initial pseudoknot free structure) [15, 25]. HFold works by.taking as input a sequence of bases, S, and a pseudoknot free secondary structure G, and finding a second pseudoknot free structure G' which minimizes the energy of G U G' (i.e. HFold(5, G) = G U G"). Like MFE methods, HFold handles only a restricted class of structures, but this class is quite general (density-2 structures). The method has two poten-tial advantages over MFE-based secondary structure predication. First, HFold's hierarchical folding principle may model biological folding just as well, or bet-ter, than does the MFE structure formation hypothesis, at least on biological structures. Second, HFold has 0(n3) running time, making it significantly more efficient than MFE-based methods that require fi(n5) time or more to predict biologically-important pseudoknotted structures. Here we outline our hierarchical fold algorithm. We first briefly review key ideas of dynamic programming algorithm which predicts the energy of the MFE pseudoknot free secondary structure for a fixed sequence S = s iS2. . . s„ [18], Let W, j be the energy of the MFE pseudoknot free secondary structure for the subsequence .s,.s,:+i ...Sj. If * > j, Wi.j = 0, since the subsequence is empty. Otherwise, it must either be that i.j is a base pair in the MFE structure for Chapter 3. Algorithm 19 Si.. .s.j. or that the MFE structure can be decomposed into two independent subparts. These two cases correspond to the first two rows of the recurrence for Wij below. Thus, Wij is given by the following recurrence: Wi, = min { ^ V ' ' ... TT. where Viti is the free energy of the MFE structure for s; . . . s.j that contains i.j. The recurrence for V; , can in turn be expressed in terms of the energies of the hairpin loop {en{i.j)).. an internal loop, or a multiloop closed by i.j. and multiloops. If * > j, V.L.j is set to be oo. Otherwise, i.j closes a hairpin, an internal loop, or a multiloop in the MFE structure for ... .s.j. Thus V^j can be expressed as the minimum of the free energies attainable in three cases: f eH(i,j), V(i,j) = min < min,.,./ e»„e(i,r,r /»j) + Vrr', { VM'itj where en{i,j) and e,: r l t(v'\ r', j) are as given in Table 2.1, and VMij is the energy of a MFE structure for ,s,;. . . s.j in which i.j closes a multiloop. We extend the definition of Witi for the hierarchical folding algorithm as follows. Let G be a given pseudoknot free structure for S. If some arc of G covers i or j, then Witj = oo. If i > j, then Wtj = 0. Otherwise we define Wij to be the energy of the MFE secondary structure G,;, U GC for the strand Si... s.j, taken over all choices of G'^ which is pseudoknot free, disjoint from G,; ( , and such that G,;.; U G'- is density-2. In this case, W^j satisfies the following recurrence: Wij = mm < mmi<r<jWitr+ Wfr+i)j, [ WMi). : - / where the first two cases are the same as for pseudoknot free cases and the last case is specific to pseudoknotted structures. Ps is the pseudoknot initiation penalty, given in Table 2.1. The third row of this recurrence accounts for the case when the optimal secondary structure G,;J U (?<• includes pseudoknotted base pairs and cannot be partitioned into two independent substructures for two regions [i, r] and [r+1, j], for some r. Such a structure must contain a chain of two or more successively-overlapping bands, which must alternate between G,;,- and G^-, possibly with nested substructures interspersed throughout. Figure 3.2 provides an example, and shows how the recurrence for WMB, given below, unwinds when the example structure is the MFE structure. In order to calculate the energies of substructures in such a structure' in the recurrences, we use three additional terms: BE, VP, and WI. Roughly, these account for energies of bands spanned by base pairs of G-j, regions enclosed by pseudoknotted base pairs of GC (excluding part of those regions that are within Chapter 3. Algorithm 20 Figure 3.2: Illustration of how the WMB recurrence unwinds, to calculate WMB^-j. Arcs above the horizontal line from i to j represent base pairs of Gi-j, and arcs below the line represent base pairs of G\y Case (a) of the WMB recurrence handles the overall.structure whose energy is WMB^.j, with I = l\. with terms to account for energies of the right upper band {BE) and right lower closed subregion ( WI^1 + ])t{i,vc(b'., )-l)) a s well a s the remaining prefix (WMB1,,^). The term WMB'lh is handled by case (a) of the WMB' recur-rence, with / = l,o and terms to account for the lower right substructure labeled VP[2 ( l , the upper left band {BE), and the remaining "prefix" of the overall structure ( WMB'; (h_l}). WMB'i(h_x) is then handled by case (b) of the WMB' recurrence, with I = and terms to account for WI(l3+i)and WMB'il3. Finally, the WMB',^ term is handled by end case (c) of the WMB1 recurrence. a band of <?»,•), and weakly closed subregions, respectively. Further details on different recurrences can be found in Appendix A. We now give the recurrence for WMBi,j. As the base case, we set WMB-,^ = +oo if v. > j, since if i > j the structure is empty, and thus cannot be pseudo-knotted. Otherwise, there are two cases, depending on whether j is paired in G or not. In case (a), j is paired in G. Then, in the MFE structure, some base / with bp(j) < I, < j must be paired in G', causing bp(j).j to be pseudoknotted. We minimize the energy over all possible choices of / (note that / must be un-paired in G, since it will be paired in G', which is disjoint from G). By Lemma 1, once / is fixed, the inner base pair of the band whose outer base pair is bp(j).j is also determined. The Pi, + BE term in case (a) of.the recurrence accounts for the energy of the band, a IWterm accounts for a weakly closed region that is in the band, and the remaining energy is represented by the WMB' term. In case (b), j is not paired in C, and, the recurrence is unwound by moving directly to a WMB1 term. Thus, we have: Chapter 3. Algorithm 21 I + ^ i l + H ' / ( 1 + 1 ) , ( „ W ! ( i i i | i ) ) . 1 ) ) , ,(6) WMB',.! if 0 < ftp(j) < j Complementing case (a) of the WMB recurrence, WMB' handles the case that the rightmost band is.not in G, but is part of the structure G'. In the recurrence for WMB'. case (a) is the complex case, accounting for the energy of the region spanned by the rightmost two bands using the 1P\,, VP, and BE terms, and recursively calling WMB1. The band borders in WMB' cases are determined using Lemmas 3 and 4. Case (b) is called when one iteration of WMBi.j or WMB'; • case (a) is done and the rightmost substructure of the over-all "prefix" up to position j is a weakly closed region. Note that Wlij — -f oo when cover (i) ^ cover (j), which avoids entering case (b) as the first iteration of WMB'. Cases (c) and (d) are end cases, where only one or two bands need to be accounted for, respectively and so no recursive call to WMB1 is made. Thus we have: WMB',., = min ((a)2P„ + jnm^ (BEHu)M{i ;»Cu.i.:rv:,I(Ci;J-,l) + WMB'L{l_l)+ VPLj), i{bpc(j) = 0 if bpG(j) < j (b) min {WMB'iii+ WIu+1)tj), i<l<:i if 0 = bpcij) < bpG{i) Figure 3.3 shows how all the recurrences call each other Figure 3.3: Visual illustration of recurrences in HFold. 23 Chapter 4 Results The goals of our analyses were sevenfold: • first, a baseline test to see if HFold finds pseudoknots when presented with the true pseudoknot free secondary structure for a sequence; • second, to assess the accuracy of HFold, when using the predicted MFE pseudoknot free structure for a sequence as input; • third, to see whether by taking the best HFold output, taken over several runs when suboptimal pseudoknot free structures are used as input, it is possible to obtain significant improvements in accuracy over simply running HFold with MFE structure as input; • fourth, to see whether by taking the minimum free energy HFold output, taken over several runs when suboptimal pseudoknot free structures are used as input, it is possible to obtain significant improvements in accuracy over simply running'HFold with MFE structure as input; .. • fifth, to see whether by taking a substructure of the MFE structure in which bases are paired with high confidence we can gain any significant improvement in accuracy; • sixth, to see whether by .taking the MFE structure and peeling away the stems from inside and/or outside stems that close hairpin loops it is pos-sible to gain better prediction accuracy than with MFE alone; • and finally to compare the energy of the true structure, calculated with a separate method using the same energy model, with the energy of the structure output by HFold, when it is presented with the true pseudoknot free structure. We implemented the HFold algorithm in C++, using the energy model de-scribed in Section 2.4. Our implementation builds on the SimFold algorithm of Andronescu [3]. All experiments were run on a SUSE Linux 10.1 with two Intel 2.40 GHz processors and 1 GB of RAM. We performed several analyses on a large dataset. In the following sections we present our thorough analyses. Chapter 4. Results 24 4.1 Accuracy In our analyses, we measure the accuracy of a predicted structure R for sequence S with true structure T as follows. Each base (position) i of S gets a score of 1 if bpa(i) = bpri't), and gets a score of 0 otherwise. The accuracy is the total score over all bases of S, divided by the length of the sequence. Thus, the accuracy lies between 0 and I, with 1 indicating perfect accuracy. 4.2 Dataset Our dataset of 70 sequences includes 7 with pseudoknot free and 63 with pseu-doknotted structures. These 63 structures include 6 sequences with kissing interactions, 6 with H-type pseudoknots with nested structures and 1 sequence with density-3 structure. Tables 4.1, 4.2 and 4.3 show the sequences used in this work. The length of these sequences varies from 26 to 214 bases. The sequences include the following types: viral ribosomal frame shifting, viral ribo-somal readthrough, sequences with high affinity to HIV-l-RT, mRNA, tmRNA, viral 3; UTR, ribozoines, .viral RNA, signal recognition particle RNA, small nuclear RNA and tRNA [5, 6, 9, 23, 27]. Since there are no known energy pa-rameters for non-canonical base pairs arid loops with less than 3 unpaired bases inside in our energy model, we removed the non-canonical base pairings and the loops with less than 3 unpaired bases from the structures of our dataset. We partitioned each pseudoknotted structure in our dataset into two pseudoknot free structures G/,;,rand Gsiiu,u- Structure G|„;H was created by removing the minimum number of base pairs needed to get a pseudoknot free structure from the input pseudoknotted structure, and structure Gsmu,u consists of the removed base pairs. We call each of these pseudoknot free structures a true pseudoknot free secondary structure for the corresponding sequence. 4.3 Accuracy when input is the true pseudoknot free structure We first test the accuracy of HFold, when presented with a sequence S whose true structure is pseudoknotted, and the corresponding true pseudoknot free secondary structures Gug- The average accuracy of the Gbia structures obtained from pseudoknotted structures in our dataset is 77%. Figure 4.1 plots the accuracy of HFold(5, Gf„;a) versus the accuracy of the G|,;a structures, for each sequence in our dataset. The data points in the lower triangle of Fig. .4.1 show the cases that HFold improved the accuracy of the original structure and the points on the upper triangle show the cases in which HFold added incorrect base pairings and thus decreased the accuracy of the final structure in' comparison to the accuracy of the original structure. The average accuracy of the structure obtained after running HFold, namely HFold(5, G|„;fl), is 77% averaged over 63 Chapter 4. Results 25 Sequence IDs and lengths Sequence type Ref. Homo sapiens(164). snRNA [9] Arabidopsis thaliana(164) RR4640(55), RK5280(67), tRNA [23] RM8530(77), RG6241(66), RA1662(76) Table 4.1: Pseudoknot free sequences used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; and (3) the reference from which the sequence was obtained. structures (not including the pseudoknot free structures), with 6 cases achieving perfect accuracy. Since G'in;l of the pseudoknot free structures is the pseudoknot free structures themselves, the input structures to HFold include 7 cases with perfect accuracy (presented in Fig. 4.1 with circle); from those cases 2 remain the same after run-ning HFold (i.e. HFold(S, C;,;,,) = G|,;H) but some extra base pairings get added to the rest of G;„;s for the pseudoknot free structures, such that the average accuracy after running HFold on the pseudoknot free structures is 92%. Note that HFold does not add any pseudoknots to the pseudoknot free structures. Figure 4.2 shows an example of before and after structures for a pseudoknot free structure. There are 40 cases presented in Fig. 4.1 marked with 'x: whose accuracy is between 44% and 80%. Of those cases, 34 are the result of the high penalty for introducing a pseudoknot. In all of these 34 cases, the expected output structures need addition of stems of size less than or equal to 7 bases, and the pseudoknot initiation penalty is much higher than the free energy of the stem. That is why HFold does not add these stems and thus the accuracy is not perfect. In 16 of these cases HFold adds extra base pairings to the input structure but in the wrong places and in the rest of the cases it does not add anything. Figure 4.3 shows an example of the before and after structures for one of these cases. There are 3 cases among the above mentioned 40 cases for which we removed non-canonical base pairs and loops with less than 3 unpaired bases inside; in 2 of those cases, HFold achieves higher accuracy (in one case it achieves perfect accuracy) when presented with the original G/„,, (that is, the Gi,i;l structure with non-canonical base pairings and small loops in place), while in the 3rd case the accuracy of the HFold prediction does not change when given the original G^g Chapter 4. Results 26 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracy after HFold Figure 4.1: Accuracy of HFold when the input is the true pseudoknot free sec-ondary structure (GuB), versus the accuracy of Gult structures. The horizontal axis represents the accuracy of structures predicted by HFold while the ver-tical line represents the accuracy of the input structures. Pseudoknotted and pseudoknot free structures are presented by V and 'O', respectively. Chapter 4. Results 27 c s. •y ' . G - G , 60 "C-G . . -C-C 'uU'A'G."f S-i-ti-t-6"' C - G - U - - C G'-C'- A" -G e-u-G !-J5 i . u ..t-u-'G.c A.u •~G.U Figure 4.2: A pseudoknot free structure given as input to HFold (top) and the corresponding HFold pseudoknot free result (bottom). The overall structure remains pseudoknot free alter HFold, but more base pairs are added to the loops. For example, the large internal loop starting at 1 and ending at 164 on the top figure is broken into two smaller loops in the bottom figure. Similarly the loops of the branch starting at 51 and ending at index 90 are broken into smaller loops. Chapter 4. Results 28 20 V-A-u cH / 1 0 . .u-A N 40 42 10 ?42 9 U-G-a c-c-S-G G"' A u' ,20 40 -C • A-30 10, U •g \ '~"A"A. r; G ^ c G ' J G' ,G'" \ 40 42 Figure 4.3: G'/,,;,y structure given as input to HFold (top), the corresponding HFold result (center) and the true pseudoknotted structure (bottom). HFold adds 3 pseudoknot free stacked base pairs at 2.40, 3.39, and 4.38, while the true pseudoknotted structure has pseudoknotted base pairs at 18.36, 19.35, and 20.34. Chapter 4. Results 29 as input. In the remaining 3 cases of the above mentioned 40 cases, HFold's prediction is pseudoknotted but the pseudoknotted structure either has more crossing base pairs than the original structure or it has pseudoknots in different places. High pseudoknot initiation penalty may be the cause behind this problem except when HFold is adding pseudoknot free base pairs inside or outside the loops (similar to the structure of Fig. 4.2). For example if the true structure includes several H-type pseudoknots, HFold predicts interleaved bands instead. This is because for every separate pseudoknot, the energy value is penalized by one pseudoknot initiation penalty, while for a big interleaved pseudoloop it is only penalized once. In addition, extra bands stabilize the energy value. Figure 4.4 shows an example of before and after structures for one of these cases. In all the cases where HFold's prediction is different from the true structure, the minimum free energy of HFold's result is lower than the energy of the true structure. We also ran HFold, when presented with a sequence S and the corresponding true pseudoknot free secondary structure C7.,„,„H as its input. Figure 4.5 plots the accuracy of HFold(S,c7.,„,„«) versus the accuracy of the G.,„,.„/j structures. In this case, the average accuracy of the Gmtu,u structures not including the pseudoknot free structures is 64%. The average accuracy of the HFold output, HFold(5, GSI„„.(/), not including the pseudoknot free structures is 80%, with 19 cases achieving perfect accuracy. Interestingly, when the better of the two accuracies obtained by using HFold on Giyiij and Gsln„u is taken for each sequence, the average accuracy, over all sequences is 88% including the pseudoknot free structures and 87% not including them. 4.4 Accuracy when input is the M F E pseudoknot free structure More realistically, we need to be able to predict the secondary structure of a sequence without knowing the true pseudoknot free secondary structure for the sequence. A natural approach is to run HFold on the predicted MFE secondary structure for the sequence. We use SimFold [3] to produce the pseudoknot free secondary structure G, which becomes the input to HFold. Figure 4.6 plots the accuracy for each data sequence versus the accuracy of the input structures. The average accuracy is 59%, which is exactly the same as the average accuracy of the MFE pseudoknot free structures. Figure 4.6 shows 14 points with accuracy between 18% to 31%. In all of these cases, the MFE pseudoknot free structures bear little resemblance to the true structures, and HFold does not add any pseudoknotted base pairs to them. In 4 out of these 14 cases, the low accuracy of the MFE pseudoknot free structures is explained by stems that are "shifts" of stems in the true structure. In all those cases HFold does not add any base pairs to the input structure. Figure Chapter 4. Results 30 - • - - A U ' u C " 0 ) A G^.UCG ,.' A ' 13: 7. r-u. f; . A " QS- A ' r v u ru i ~~\u r.< / r'' A >-<- r- 'nu' ' v >' u' G G" TC C. t' . ..5 IJ C "G •no 'C u " A " ' u. A ' A U " U 6-*' " • A u. s-c C-G... V, •' 130 •'A.u 214 C-G Figure 4.4: Predicted pseudoknotted structure (top) and the true pseudoknotted structure (bottom). HFold predicts interleaved bands, while the true pseudo-knotted structure has 5 separate H-type pseudoknots. Chapter 4. Results 31 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracy of HFold on G Figure 4.5: Accuracy of HFold when the input is the true pseudoknot free secondary structure (Gml,,.u) (horizontal axis), versus the accuracy of Gsm,M structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 32 Accuracy after HFold Figure 4.6: Accuracy of HFold when the input is the MFE structure (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 33 20 10 .A-C-A*. 'c 'C-G-C-C C G C '.6 c: 26 ~"26 Figure 4.7: MFE structure (top), true pseudoknotted structure (bottom). Tire true pseudoknotted structure has 4 stacking base pairs at 1.14, 2.13, 3.12, 4.11 and another stem of size 4 at 6.26, 7.25, 8.24, and 9.23, while in the MFE structure the first stem is a shift of the true structure (3.15, 4.14, 5.13, and. 6.12). In this cases HFold does not add any base pairs to the input structure. 4.7 shows an example of one of those cases. There are"only two data points presented in Fig. 4.6 that show significant change of accuracy before and after running HFold. In both cases, HFold pre-dicts pseudoknots. In one case the predicted pseudoknotted structure is very similar to the true structure since the MFE structure for the corresponding sequence is very similar to the G\,ig structure of the sequence. In the other case, the true pseudoknotted structure includes one H-type pseudoknot with only 3 pseudoknotted base pairs, but because of the high pseudoknot initiation penalty, HFold predicts one pseudoloop of 3 interleaved bands, thus lowering the accuracy. 4.5 Best accuracy, taken over suboptimal folds Since HFold does not improve accuracy, on average, when using the MFE pseu-doknot free predictions as the input, we also measured the best accuracy ob-tainable using HFold with suboptimal structures as input. For- each sequence in our dataset, we used SimFold to calculate the 25 lowest-energy structures for that sequence. (We chose the 25 lowest-energy structures because, when we compared the 1000 lowest-energy structures from SimFold with the true pseudoknot free structures for each sequence in our dataset, in c Chapter 4. Results 34 Best accuracy of HFold over the first 25 suboptimals Figure 4.8: Accuracy of HFold on the best-of-25 (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and :0', respectively. only 39 cases was the true structure found among these 1000 structures and in 29 of the 39 cases the true structure was among the 25 lowest-energy structures.) We'then ran HFold on all 25 structures, and calculated the accuracy of the HFold output on each run. We did the same with the 10 lowest energy structures. Then we took the best accuracy among the 10 lowest-energy structures, and the best accuracy among the 25 lowest-energy structures. When averaged over all sequences, the best-of-10 accuracy is 70% with 4 cases with perfect accuracy and the best-of-25 is 74% with 8 cases with perfect accuracy. These accuracies are a significant improvement over the 59% average accuracy obtained using the MFE structures. Figure 4.8 shows the accuracy results of the best-of-25 versus the accuracy of the MFE structures. As shown in Fig. 4.8, 48 data points are in the lower triangle, showing improvement over the MFE structures. In many cases the improvement is simply due to the fact that the suboptimal structure is close to G (,;.,,. In 20 of the 48 cases, HFold adds nothing; in 15 out of 48 cases, it adds 1 or 2 pseudoknot Chapter 4. Results 35 Best accuracy of HFold over'the first 25 suboptimals Figure 4.9: Accuracy of HFold on the best-of-25, versus the accuracy of the best-of-10. Pseudoknotted and pseudoknot free structures are presented by 'x' and l O ; , respectively. free base pairs; but in the remaining 13 cases it perfectly identifies G„„lau, from which in 5 cases HFold adds extra base pairings, thus, not achieving 100% accuracy. Figure 4.9 shows the improvements achieved using best-of-25 versus best-of-10. As shown in Fig. 4.9, there are only 10 data points in the lower triangle that show significant improvement on best-of-25 versus best-of-10, and the rest of the cases are largely similar. Since the difference between the average best accuracies of best-of-25 and best-of-10 is only 4%, but the number of structures is considerably lower and the improvement over accuracy of the MFE structures is significant even for best-of-10, biologists may prefer to obtain 10 lowest energy structures instead of just the MFE structure to get better prediction. Chapter 4. R.esults 36 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 • Accuracy of HFold on min energy over the first 25 suboptimals Figure 4.10: Accuracy of HFold on the minimum energy structure over the first 25 suboptimals;. versus the accuracy of HFold on the MFE structure. Pseudo-knotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. 4.6 Accuracy of the lowest energy structure, taken over suboptimal folds In this section, similar to Section 4.5, for each sequence we get the first 25 suboptimal structures, and then we run HFold on those structures. However, in this case we choose the final structure with the lowest energy and compare its accuracy with the accuracy of HFold on the MFE structure for the same sequence. Figure 4.10 shows the results for this case. There are 9 data points in Fig. 4.10 not on the diagonal line. One of these data points is in the lower triangle, showing significant improvement over the accuracy of the MFE structure. In this case, the input structure is very similar to the corresponding Gnu structure but has one extra loop, thus HFold correctly identifies the corresponding Gs„,„u structure and improves the accuracy of the overall structure. For 5 of the 8 data points in the upper triangle, HFold correctly predicts the Chapter 4. Results 37 corresponding Gsinau structure but adds more pseudoknotted or pseudoknot free base pairs to lower the energy. An example of one of those cases is presented in Fig. 4.11. There are some stems in some of the input structures that are "shifts" of stems in the true structure; these shifts are also causes of low accuracies. In those cases HFold does not add any base pairs to the structure. There is not much difference in the "energy" values of the predicted struc-tures with lowest energy over the first 25 suboptimals versus the energy of the predicted structure when input structure is MFE, but some of the structures in the latter case are pseudoknotted, and thus have more similar structure to the true structure than MFE. 4.7 Peeling Techniques One of the main problems with the MFE structures is that they typically contain long stems and short loops, making it difficult for HFold to add enough pseu-doknotted base pairs into the small regions. One possible solution to obtain better input structures is to peel away the steins. This approach is also moti-vated by the process of hierarchical folding, where it has been shown that, upon formation of a pseudoknot, some base pairs at the ends of stems in the initially formed pseudoknot free structures may break [25]. For our testing purposes, we tried the following cases: 1. removing 1-3 base pairs from inside the stems that close hairpin loops (3 cases); 2. removing 1-3 base pairs from outside the stems that close hairpin loops (3 cases); 3. removing 1-3 base pairs from both inside and outside the stems that close hairpin loops (3 cases); 4. removing stems of size m when 1 < m < 3 (3 cases), and 5. peeling enough.base pairs from inside hairpin loops so that there are at least 6 unpaired bases inside each loop (1 case); 4.7.1 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from inside or outside the stems that close hairpin loops For our first peeling technique, we removed 1-3 base pairs from inside each stem that closes a hairpin loop of the MFE structures and ran HFold on the resulting structures. Figures 4.12 shows the accuracy of the MFE structures when 2 base pairs are removed from inside each stem that closes a hairpin loop versus the accuracy of the MFE structures themselves, while Fig. 4.13 shows the accuracy of the predicted structure when the input structure is peeled versus the accuracy Chapter 4. Results 38 y ft -„ ^ • ,.<- yK £, -V u - A ,C - c ^ . t 6 A A l j 'A, t i L A ..U ,.b A-110 b-c-c-e-c-A-" ' ' i j - U - UA G-A 'U.IJ. l 4- A ii • u • G U A A -1 C-C-C-y-l-Ht.y-G-c-u' fl.-G •i;-A-U~^--A-^-A-U-^-G.C-A-.. 'A *• H - H - M - U .C .0 i-A J , '-. no' •U-A-G-J.C-C'A-i-G-C-i-A • U - C - C - G - C - A U -t5 C G - G " ~ U -A-<3 -fi-C-d • ' U.A il.A Li . A Figure 4.11: The pseudoknotted structure predicted by HFold (top), the corre-sponding MFE structure (center) and the true structure (bottom). Although HFold correctly predicts parts of the pseudoknot in the structure on the top, the accuracy of the structure predicted by HFold is lower than the accuracy of the corresponding MFE structure, because there are some extra pseudoknotted base pairs predicted by HFold. Note that the MFE structure has all the correct pseudoknot free sub-structures (except the stem closing at 1.114). High pseu-doknot initiation penalty, and low band penalty are the main causes for such predictions by HFold. Chapter 4. Results 39 of the MFE structures. Since the rest of the figures are similar we only present a detailed discussion on this case. Generally speaking, as shown in Fig. 4.12 when the accuracy of the MFE structures is lower than 55%, peeling increases the accuracy of the structure. This is because the MFE structure in these cases has little resemblance to the true structure. However, as shown in Fig. 4.13, in all but one case, HFold adds the peeled base pairs back and creates the. original MFE structure. There is only one data, point on the lower triangle showing improvement over the MFE structure. In this case the peeled structure is similar to Gug, thus HFold iden-tifies the corresponding G„)mii structure correctly, but since it adds more base pairs to lower the energy of the whole structure the accuracy of the predicted structure is not perfect. There are some data points shown in Fig. 4.13 whose accuracies do not change after running HFold. This is because the stems in their corresponding MFE structure are shifts of the stems of their Gua structure, thus removing a base pair from inside each stem that closes a hairpin loop does not change the accuracy of the whole structure. By removing more base pairs from inside the structures, we obtain more cases where the accuracies before and after running HFold are different. For example, when we remove only 1 base pair from inside each stem that closes a hairpin loop we find 10 data points with the same accuracies before and after HFold, while this number decreases to 6 when we remove 2 base pairs from inside each stem that closes a loop (presented in Fig. 4.13). Similar results are achieved when 1, 2, or 3 base pairs are removed from outside each stem that closes a hairpin loop of the MFE structures and HFold is run'given these structures as input. . • 4.7.2 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from both inside and outside each stem that closes a hairpin loop We peeled away parts of stems from both ends of each hairpin loop. When we peeled only 1 base pair from each end of the stems, we ended up with 5 cases whose accuracies did not change before and after peeling (see Fig. 4.14). In those cases the MFE structure is very different from the true structure so that peeling one base from both ends of each loop does not change the accuracy. In all •those cases HFold adds the peeled base pairs back and thus, the final predicted structure is the MFE structure. As shown in Fig. 4.14, generally speaking when the accuracy of the MFE structures is lower than 55%, peeling increases the accuracy of the structure. This is because the MFE structure in these cases has little resemblance to the true structure. Because of the same reason, HFold adds the peeled base pairs back to the structure to lower the energy of the whole structure (see Fig. 4.15). There is only one case shown in Fig. 4.15 for which running HFold results Chapter 4. Results 40 Accuracy of the peeled structures Figure 4.12: Accuracy of the MFE structures with 2 base pairs removed from inside each stem that closes a.hairpin loop (horizontal axis), versus the accu-racy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 41 Accuracy of HFold on in 2 Figure 4.13: Accuracy of HFold when input is the MFE structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of HFold when input structure is MFE (vertical axis). Pseu-doknotted and pseudoknot free structures are presented by 'x' and 'O', respec-tively. Chapter 4. Results 42 Accuracy of the peeled structures Figure 4.14: Accuracy of the MFE structures with f base pair removed from both ends of each stem (horizontal axis), versus the accuracy of the MFE struc-tures (vertical axis). Pseudoknotted and pseudoknot free structures are pre-sented by 'x' and '0', respectively. Chapter 4. Results 43 Accuracy of HFold on both 1 Figure 4.15: Accuracy of HFold when input is the MFE structures with 1 base paii' removed from both ends of each stem (horizontal axis), versus the accu-racy of HFold when input structure is MFE (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 4 4 in improvement in accuracy; in this case the peeled structure is similar to Gin,,, thus, HFold identifies the corresponding G.snu,ii structure correctly but adds more base pairs to lower the energy of the structure, resulting in a not perfect accuracy. , Removing more than 1 base pair from both ends of the stems results in fewer cases with the same accuracy before and after peeling; that is because the peeled structures have fewer base pairs. But since the MFE structures in general have little resemblance to the true structures, HFold adds the peeled base pairs back and thus, the average accuracy of the predicted structures is the same as the average accuracy of the MFE structures. 4.7.3 Accu racy when input is the M F E pseudoknot free structure w i t h stems of size 1-3 removed Since the MFE structures typically have long stems and small loops, we decided to remove the stems of size 1-3 and then give those structures as input to HFold. Since none of the MFE structures have any stems of size 1, peeling away stems of size 1 does not change any of the structures and thus, the final predicted structures are the MFE results. When removing stems of size 2 or 3, there is little change in the structure of some of our data points. However since HFold cannot fix the errors of the input structure, it does not find any structure with lower energy than the MFE structure and thus puts the removed base pairs back. In only 1 case, the input structure is similar to G/„9 and thus HFold finds a. pseudoknotted structure which is very similar to the true structure but with some extra base pairs. 4.7.4 Accu racy when input is the M F E pseudoknot free structure w i t h loosened loops Finally, we decided to loosen up the loops of the MFE structures so that HFold could possibly add extra base pairings to the structure and create pseudoknotted structures. Since we noticed from the HFold results in Section 4.4 that if the true pseudoknotted structure needs less than a stack of size 6 (that, is 6 consecutive base pairs) the structure typically will not form, due to the large pseudoknot initiation penalty, we decided to make room for at least 6 bases inside the loops. After removing enough base pairs from inside the loops of the MFE structures, we ran HFold on them and obtained the accuracy of the results. It is interesting to note that the average accuracy in both cases (before and after running HFold) is almost the same (59% and 58.5% respectively) and is also the same as the average accuracy of the MFE structures. In only 1 case HFold predicts the pseudoknot as it is in the true structure but it adds some extra base pairings. In this case the MFE structure is very similar to the corresponding G/„9 structure but has overcrowded loops; that is why loosening the loops helps HFold find the true pseudoknotted structure. In the rest of the cases, HFold adds the peeled base pairs back and results in the MFE structure. Chapter 4. Results 45 4.8 Base Pair Confidence Structure formation can be viewed as a probabilistic process, with the likelihood of a given secondary structure (for a fixed sequence) being a function of the energy of that structure. The lower the free energy, the more likely the structure: thus, the MFE structure is the most likely to form. Energy-based methods have also been developed to predict base pairing probabilities of pseudoknot free secondary structures at thermodynamic equi-librium. Mathews [15] found that, roughly speaking, the higher the probability that a base pair is predicted to be in the equilibrium structure, the more likely it is to occur in the true structure. One possible way to improve the accuracy may be to use base pair probability as a confidence measure - that is, base pairs are kept only if their probability is higher than a threshold value. Based on the results presented in Mathews [15], we chose different thresholds starting from base pair probabilities of 75%, and increased the threshold by steps of 5% to the maximum of 95% threshold. In each case, we chose only the base pairs whose probabilities are higher than the given threshold, and gave these structures as input to HFold. Figure 4.16 shows the accuracy comparison plot for predicted structures when base pair confidence is at least 75% and the MFE structures. As shown in Fig. 4.16 there are only two points that are not on the diagonal line. Both of these two cases show improvement over accuracy of the MFE structures. It is worth mentioning that in both of these cases the structure predicted by HFold when the input structure is base pairs with confidence above 75% achieves higher accuracy than the structure predicted by HFold when the input structure is the MFE structure. In one of these cases the improvement is 11% and in the other case it is 8%. The results are similar to Fig. 4.16 when the input structure has base pair confidence of at least 80% and 85%. When the confidence level reaches 90% all the data points except one are on the diagonal line. This one case is one of the two cases we discussed when the threshold was 75%. It is clear that the higher the threshold, the fewer base pairs are chosen for the input structure. For example, when the threshold is 90%, there are no base pairs chosen in 23 cases out of 70 cases, thus, the input structure to HFold is empty and HFold gives the MFE structure for those cases. For all the different threshold values the average accuracy is 59%, .which is exactly the same as the HFold accuracy when the input structures are the MFE structures. 4.9 Energy Comparison Using a separate method, with the same energy model, we calculated the en-ergy of the "true" pseudoknotted structures, and compared those values with the energies of the structures predicted by HFold when the input was Gug. Figure 4.17 shows the result of our comparison. All of the data points, except * one case, are either on the diagonal line, or in the upper triangle. As expected, Chapter 4. Results 46 Accuracy of HFold on base pairs with confidence >= 0.75 Figure 4.16: Accuracy of HFold when the input is the pseudoknot free structures with base pair confidence above 75% (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by ' x ' and :0', respectively. Chapter 4. Results 47 HFold's results have the same energy as the "true" structure or the "true" struc-tures have higher energy than the HFold's results using our energy model. The only case that HFold results in a structure with higher energy value than the true structure is when the true structure is a density-3 structure which cannot he handled by HFold. Figure 4.18 shows the comparison of the energy of G\,itj structure for a se-quence versus the energy of the MFE structure for the same sequence. In many cases, the energy differences are large, indicating the energy model is poor. Ideally, when the energy model is good, all the data points should be on the' diagonal line or very close to it, such that running HFold on the MFE prediction would result in the true structures. It is interesting to note that for two of the pseudoknot free structures (pre-sented by circles in the figure) the MFE energy value is much lower than the energy of the corresponding Guu structures; this can be an indication of the fact that even the pseudoknot free energy model is far from being correct. This is because large loops are broken into small loops to lower the energy in the MFE structures (see Fig. 4.2). This is the main reason for not getting the perfect accuracy for all cases when the true pseudoknot free structures are given to HFold as input structures. Chapter 4. Results 48 Figure 4.17: Comparison of HFold energy and true structure energy. The hor-izontal axis represents the energy of structures predicted by HFold when the input is Gin,) while the vertical axis represents the energy of the "true" struc-tures. Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 49 Figure 4.18: Comparison of energy of G\,\.a (horizontal axis) and the energy of the MFE structure (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. Chapter 4. Results 50 Sequence IDs and lengths Sequence type Str. Ref. BLV(27), BYDV-NY-RPV(27), viral ribosomal H [27] CABYV(27), FIV(35), frame shifting H PLRV-S(26), PRRSV-LV(59), H SRVlgag/pro(37), H BChV(26), BWYV(26), EIAV(35), H PEMV(28), PLRV-W(26), H MMTVgag/pro(34), H PRRSV-16244B(58), H SARS-CoV(69) H* AKV-MuLV(50), GaLV(49), viral ribosomal H [27] Cas-Br-E-MuLv(50), SNV(50) readthrough H BaEV(50), Mo-MuLV(50), FeLV(50) H 1.1(37), 2.9(34), 2.7a(36), with high affinity H [5] 1.7(37), 2.1b(39), 2.10(37), to HIV-l-RT H •1.6(37), 2.12(37), 2.6b(42), H 1.3a(37), 2.4a(37), 2.11(37), H 1.17(37), 1.8(39), 1.9b(37), H 2.5a(41), 2.2b(42), 2.3a(45) H Ec-S15(67), Hs-PrP(45), mRNA H [27] Bt-PrP(45), T4-gene32(28) H Ec-PK1(30), Bp-PK2(80) tmRNA H [27] LP-PK1(30), Ec-PK4(52) H Tt-LSU-P3/P7(65), Ribozymes H* [27] HDV-It_ag(89) H* BVDVJRES(73), HCVJres(56) viral RNA H* [27] TMV-L(38), CSFV.IRES(76), H turnip yellow mosaic virus(86), H tobacco mosaic virus(214) H* Table 4.2: Sequences with H-type pseudoknots used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) structure of sequence (H-type pseudoknot = H, H-type pseudoknots with nested structures = H*); and (4) the reference from which the sequence was obtained. Chapter 4. Results 51 Sequence IDs and lengths Sequence type Str. Ref. I-ICV.229E(61) viral ribosomal K [27] CoxB3(68) Viral 3' UTR K [27] satRPV(72), NLVS(45), HDV-It_g(88) Ribozymes K D3 [27] Coxsackie(114) Viral and Phage RNA K [6] Hs-SRP-pkn(47) SRP RNA K [27] Table 4.3: Sequences with kissing interactions and the density-3 sequence used lor secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) Structure of the sequence (kissing interactions = K, density-3 structure = D3); and (4) the reference from which the sequence was obtained. 52 Chapter 5 Conclusion and Future Work In this work, we presented HFold, a fast, new dynamic programming algorithm that efficiently predicts RNA secondary structure including pseudoknots, based on the hierarchical folding hypothesis. HFold can predict kissing hairpins and pseudoloops with arbitrary number of bands. The algorithm had been proposed by Zhao [30]; this thesis focuses on an empirical analysis of its performance. Our analysis shows that, when presented with the true pseudoknot free struc-ture, Gin,) no significant improvement is obtained over the accuracy of the pseu-doknot free structure alone. However when HFold is given Gsm„u as input structure, 16% accuracy improvement is obtained on average over the accuracy of the pseudoknot free structure alone. This study also shows that using Gsln„u as the input to HFold results in 80% accuracy with 19 cases of perfect accu-racy. High pseudoknot initiation penalty and stem shifts are main causes of not achieving perfect accuracy for the rest of the cases. Based on the analyses presented in this work, using only the MFE structures as input to HFold does not result in high accuracies as the MFE structures are usually overcrowded with long stems and small loops and thus, running HFold does not result in the addition of any extra base pairings. We showed here that since the MFE structures usually bear little resemblance to the "true" structures, peeling the stems away from inside and/or outside the loops does not improve the average accuracy. We took the peeling one step further and showed that neither removing stems of size m when 1 < m < 3, nor loosening the loops to have at least 6 unpaired bases inside, changes the average accuracy of the results. Since the energy model using which the MFE structures are being formed does not consider pseudoknots, even using base pair probabilities is of no help in improving the accuracy of the results. We also showed that using the best of 10 suboptimal structures improves the accuracy of prediction by 11%. The accuracy increases by another 4% if the first 25 suboptimals are used. In many cases the accuracy of the MFE structure is greater than or equal to the accuracy of the .lowest energy structure predicted by HFold on the first 25 suboptimal structures. However, in numerous cases of the latter structures, HFold identifies the correct pseudoknotted base pairs. Note that adding more base, pairs to lower the energy causes the accuracy of HFold result to be lower than the corresponding MFE structure. This study showed that since pseudoknot free energy model predicts too Chapter 5. Conclusion and Future Work 53 many base pairs in the structure, peeling some base pairs away from the ends of stems that close hairpin loops in many cases improved the accuracy of the structure. However high pseudoknot initiation penalty in the pseudoknotted energy model made HFold avoid predicting pseudoknots by adding pseudoknot free base pairs when possible. We also found that because of the high pseudoknot initiation penalty and low band penalty, more bands are added to pseudoknotted structures to compensate for the energy. The results of this study provide more insight to the hierarchical folding hypothesis. Our results suggest that with respect to the current energy model minor rearrangements of the pseudoknot free MFE structures do not often lead to the true pseudoknotted structures. We also showed that in some cases the pseudoknot free MFE structure contains shifted stems of the true structure; therefore without shifting the stems it is not possible for the MFE structure to fold into the true pseudoknotted structure. Throughout this work, we used an index by index comparison of the base pairs to find the accuracy of prediction. However, in many examples the overall structure of the predicted structure was similar to the true structure, but more base pairs or shifts lowered the accuracy. We believe that using other notions of accuracy that consider structural similarities are more accurate that using an index by index comparison. Thorough analyses of the data based on different commonly used accuracy measures such as sensitivity, specificity, or positive predictive is underway. Our findings raise three main questions: 1. whether the structures identified by the experimental methods are reliable, 2. whether the current energy model is precise enough, and 3. whether the hierarchical folding hypothesis is precise enough. One of our future goals is to tune the parameters of the current energy model to improve the accuracy of the prediction for a wider sets of structures. One possible approach is using Andronescu's tuning method [4], Another future work is to use a better energy model for pseudoknotted structures, such as that of Cao and Chen [5], and obtain better energy parameters. This is also of great interest to us to evaluate the hierarchical folding hypothesis using computational methods in future. We are not yet able to do a sound comparison of the prediction accuracy of HFold with MFE-based methods, since it would be important to ensure that the same energy model is used by both methods. Therefore one of the main goals for our future work is to compare hierarchical and MFE algorithms implemented using the same energy model, at least for H-type pseudoknots. Finally, we plan to incorporate other techniques to produce better input structures to HFold, such as information obtained from chemical modification data [17]. 54 Bibliography [1] T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Disc. App. Math., 104(l-3):45-62, Aug 2000. [2] S. L. Alain, J. F. Atkins, and R. F. Gesteland. Programmed ribosomal frameshifting: Much ado about knotting! PNAS, 96(25):14177-14179, Dec 1999. [3] M. Andronescu. R,. Aguirre-Hernandez, A. Condon, and H. H. Hoos. RNA-soft: A suite of RNA secondary structure prediction and design software tools. Nucleic Acids Res, 31(13):3416-3422, Jul 2003. [4] M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews, and K.P. Murphy. Efficient parameter estimation for RNA secondary structure prediction. BMC Bioinformatics, page to appear, July 2007. [5] S. Cao and S. Chen. Predicting RNA pseudoknot folding thermodynamics. Nucleic Acids Res, 34(9):2634-2652, 2006. [6] B. A. L. M. Deiman and C. W. A. Pleij. Pseudoknots: A vital feature in viral RNA. Seminars in Virol, 8(3):166-175, 1997. [7] C. Dennis. The brave new world of RNA. Nature, 418(6894):122-124, Jul 2002. [8] R. M. Dirks and N. A. Pierce. A partition function algorithm for nu-cleic acid secondary structure including pseudoknots. J. Cornput. Chem., 24(13):1664-1677, Oct 2003. [9] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfain: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33(Database issue), Jan 2005. [10] K. Han, Y. Lee, and W. Kim. Pseudoviewer: automatic visualization of RNA pseudoknots. Bioinformatics, 18 Suppl 1, 2002. [11] I. L. Hofacker, W. Fontana, P. F. Stadler, L'. S. Bonhoeffer, M. Tacker, and P. Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte filr Chenue / Chemical Monthly; 125(2):167-188, Feb 1994. Bibliography 55 [12] H. Jabbari, A. Condon, A. Pop, C. Pop, and Y. Zhao. HFold: RNA pseudo-knotted secondary structure prediction using hierarchical folding. In Raf-faele Giancarlo and JSridhar Hannenhalli, editors, LNBI, Lecture Notes in Computer Science, page to appear. Springer, 2007. [13] R. B. Lyngs0. Complexity df pseudoknot prediction in simple models. In Josep Diaz, Juhani Karhumaki, Arto Lepisto, and Donald Sannella, editors, ICA LP, volume 3142 of Lecture Notes in Computer Science, pages 919-931. Springer, 2004. [14] R. B. Lyngso and C. N. Pedersen. RNA pseudoknot prediction in energy-based models. ./. Cornput. Biol, 7(3-4):409-427. 2000. [15] D. H. Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10(8):1178-1190, Aug 2004. [16] D. H. Mathews. Predicting RNA secondary structure by free energy mini-mization. Th.eor. Chem. Ace: Theory, Computation, and Modeling (The-oretica Clnmica Acta), pages 1-9, May 2006. [17] D. H. Mathews, M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker, and D. H. Turner. Incorporating chemical modification constraints into a dy-namic programming algorithm for prediction of RNA secondary structure. Proc Nail Acad Set U S A, 101(19):7287-7292, May 2004. [18] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded se-quence dependence of thermodynamic parameters improves prediction of RNA secondary structure,. .7. of Mol. Biol, 288(5):911-940, May 1999. [19] W.J. Melchers, J.G. Hoenderop, H.J. Bruins Slot, C.W. Pleij, E.V. Pilipenko, V.l. Agol, and J.M. Galama. Kissing of the two predominant hairpin loops in the coxsackie B virus 3' untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis. .7. Virol, 71(l):686-696, Jan 1997. [20] B. Rastegari and A. Condon. Parsing nucleic acid pseudoknotted secondary structure: algorithm and applications. J. Cornput. Biol, 14(l):16-32, 2007. [21] J. Reeder and R. Giegerich. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, Aug 2004. [22] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots, Jul 1998. [23] M. Sprinzl, C. Horn, M. Brown, A. Ioudovitch, and S. Steinberg. Compila-tion of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res, 26(1):148-153, Jan 1998. Bibliography 56 [24] D. W. Staple and S. E. Butcher. Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3(6), Jun 2005. [25] I. Tinoco and C. Bustamante. How RNA folds. J. Mol. Biol, 293(2):271-281, Oct 1999. ' [26] Y. Ueniura, A. Hasegawa, S. Kbbayashi, and T. Yokomori. Tree adjoining grammars for RNA structure prediction. Theor. Cornput. Sci., 210(2):277-303, Jan 1999. [27] F. H. van Batenburg, A. P. Gultyaev, and C. W. Pleij. Pseudoba.se: struc-tural information on RNA pseudoknots. Nucleic Acids Res, 29(1):194-195, Jan 2001. [28] C. W'itwer, Ivo L. Hofacker, and P.F. Stadler. Prediction of consensus RNA secondary structures including pseudoknots. IEEE/ACM Trans. Cornput. Biol. Bioin.forina.tics, l(2):66-77, Apr 2004. [29] M. Wu and I. Tinoco. RNA folding causes secondary structure rearrange-ment. Proc. Natl. Acad. Sci. USA, 95(20):11555-11560, Sep 1998. [30) Y. S. Zhao. Efficient algorithm for RNA pseudoknotted secondary structure prediction given a pseudoknot free structure. Master's thesis, University of British Columbia, Department of Computer Science, September 2005. [31] M. Zuker and P. Stiegler. Optimal computer folding of large RNA se-quences using thermodynamics and auxiliary information. Nucleic Acids Res, 9(1):133-148, Jan 1981. 57 Appendix A Recurrences In this chapter, we present the details of HFold recurrences. Throughout this work, we will use the following notation: • G: a pseudoknot free structure. • G': a pseudoknot free structure that we add to G. • R. the complete pseudoknotted structure: R=GL)G'. Rhj refers to the subset of R whose bases are in the region [i,j]. Let Gij = G l~l {i.,j} x i.e. the set of base pairs of G contained in region Let /?.,., be a minimum free energy (MFE) secondary structure for [i,j], given a pseudoknot free secondary structure G;,j. The energy value of each substructure type, for a given input sequence S = $i$o...s.n and the given pseudoknot free secondary structure G, is stored in an array. In the next subsections, we describe how each is calculated. We illustrate each case with a figure, where we use the following notations in our figures: • G is presented by the portion of the figure above the base line. • G' is presented by the portion of the figure under the base line. • The normal black lines can be any arcs in Rij. • The solid lines are for base pairs. • The dotted lines connect bases that don't have to be paired. • The clear shade within the arcs indicate that there are no additional base pairs within the arc. • The.gray shade within the arcs are unknown structures. A . l Wij W^.j is the MFE of all valid structures Rij over region if i and j are not covered in G, i.e. i.sCovered(G,i) &nd isCovered(G, j). Otherwise, W^-is+00. The base cases are as follows: W.,j = 0, if i > j, since then Rij is empty; and Wij = +00, if isCovered(G.i), or isCovered(G,j). Appendix A. Recurrences 5 8 (1) ( 2 ) r r + 1 (3) 7, I-\ Figure A. 1: Illustration of Cases for Wij. Otherwise, Wi,j is given by the following recurrence: WhJ = m i n ^ ^ ( ^ + W ^ (A.l) . (3) WMBi,j + Ps The base cases indicate that W is being used only when the structure is an exterior structure and that there is no penalty for having unpaired bases at either end of the structure. Case (1) handles the case that i pairs with j, i.e. bpRi%1 (i) = j. Case (2) handles the cases that 3r, * < r < j, 6p/?.;,,• (?-) < r (i.e. t is either unpaired or paired with another base inside region [i,r]), and bpp.. .(j) > r or bp a 0) = 0 (i.e. 7 is either unpaired or paired with a base inside region [r + h:i\). If R-i.j does not fall into case (1) or (2), it must be that is a pseudoknot-ted closed region. This is an exterior pseudoknot because of the premise that i and j are not covered in G. In this case, we add a Ps penalty for introducing an exterior pseudoknot. A.2 Wlj W]i:.j is the minimum free energy of all valid structures Rij, given that.[i, j] is weakly closed, and /?.,;.,- is inside a pseudoknot. Otherwise, Wlij is +00. Appendix A. Recurrences 59 The base cases are as follows: Wli.-j = +00, if cover(i) / cover(j), since is not weakly closed. Wh,j = PUp, if * = j and bpG(i) = 0, since [i,j] is an empty region, thus we give it the value for an unpaired base in a pseudoloop. 117,., = 0, if?; > j. Otherwise, WI^j is given by the following recurrence: II'/.., mm < (1) Vij+PP, (2) min ( WI, , i<t<3 {(3) WMBi if i.j e G, or (bpc(i) and bpc(j) = 0) WI, (t+DJ) (A.2) /'.,, + P„, Similar to W-U.j, case (1) handles the case that i pairs with j, i.e. 6po (t) = j or the case that neither is paired in G but are paired in G'. Case (2) handles the cases that 31, i <t < j, bpa, , (i) < t, and bp a, ,0) > t OYbp(Rij,j) = 0. If R;.j does not fall into case (1) or (2), it must be that paired(Rij,i) and paired(Ri,j,j), and bpa,,.,•(*) > frpy?.; where [i, j] is a pseudoknotted closed region in Ri.j. In this case, R,.j will be covered by case (3). Since WI, j is the structure inside a pseudoknot, but not inside a band, we add a Pv,s penalty to case (1) and (3), and a Ps,n penalty to case (3) for introducing a new pseudoknot inside a pseudoloop. Appendix A. Recurrences 60 Figure A.3: Illustration of cases for Wtt ... In case (5), the plotted structure from i to j could contain more than 2 bands (not illustrated) A . 3 Wl':/ Wf- • is the minimum free energy of all valid nonempty structures RIJ, if [i-j\ is weakly closed, given that RLi is inside a band. Otherwise, Wftj is +oo. The base cases are as follows: Wl1- , = +co, if [i.j] is not weakly closed. Wf.L.- = +oo, if i > j, since empty(iJjj, Otherwise, Wfi • is given by the following recurrence: f ( l ) ^ - + & ' Wl , = min < (2) WL + c' if i.j € G, or (bpG(i) = 0 and bpG(j) = 0) if frpG(i) = 0 if frpc(j) = 0 (A.3) (3) +c' (4) mm.(IV/iit+ [(5)H/iV/B,:,,+P5m +6' Cases (2) and (3) handle free bases on each side of the sequence. The rest of Appendix A. Recurrences 61 the cases are similar to W/with the only difference being that here in cases (1) and (5) we use b' as the penalty for introducing a new base pair in the structure instead of PpH, since the base pair is not inside a pseudoloop, but rather in a band. A . 4 VP,,, VPj,j is the minimum free energy of all valid structures Rij, in which bpc(i) = bpa(j) = 0, bases i and j are paired in G', i.e. bpa>{i) — j, and i.j crosses a base pair of G. Here the energy of Rij is the energy of all loops within Rij that are not inside a. band. Otherwise, VPij is +oo. The base case is as follows: (i>j, VPij = +oo , if < i.j does not cross any base pair of G, (A.4) [bpG{i) > 0, or bpcij) > 0 Otherwise, VP,j is given by the following recurrences: Appendix A. Recurrences 62 17'.., = min ' ( i ) w / ( i + 1 ) i ( B ( i j ) _ 0 WI, (3) W^i+i)^^^.,-!) + (''<••,/)-') + ^ V ( V , ( ) + l ) , ( . 7 - l ) (4) e.stp(i,i 1 ,i) + KPji+D^-D (5) min (e,„,,p(i,r,r' ,j) •;<7-<-m•;•!,( 0(';..,•,,''(;..,•)) •»»^('''(.'.,-)^(.-..-;))<'''<.v + V P , , , , ) (6) mjn (7) min if isCovered(G. i), and isCovered(G, j) if isCovered(G,i), and isCovered(G,j) if isCovered(G, i), and i,sCovered(G,j) if (6pG(i + l) = 0,. and 6P G(j - 1) = 0) (A.5) cover(G.i) = cover{G ,r) and cover(G,j) = cover (G,r') and empty (G, [i + 1, r - 1]) and empty(G,[r' +1,7-1]) + ^, (,--,) +a' + 26') + H/ / ' ( . „ + 1 ) , { y _ 1 ) + a ' + 26') Cases (1), (2), and (3) handle the cases that there are no other base pairs in [i,j] that cross the same band(s) that i.j crosses. In these cases we compute the energy between band borders. Case (4) handles the case that base pairs i.j and (i+ l).(j — 1) form a stacked pair in R,,;. Case (5) handles the case that i.j and r.r' close an internal loop of P,;j. Cases (6) and (7) handle the similar condition to case (5) except that case (6) allows closed regions in the gap region [i,r — 1] and case (7) allows closed regions in the gap region [»••+ In those cases i.j does not close an internal loop; but rather close a multiloop that spans a band. Appendix A. Recurrences 63 Figure A.4: Illustration of Cases for VP^j Appendix A. Recurrences 64 Cases (6) and (7) can he combined into the following case: min ( , , ( , . _ , ) + VP.,,,, + W^ ( T . , + 1 ) i b-_ 1 ) + P„« + a' + 2b') i<r<inm(B{.,.;),'>(;..,•)') <„ur.,:{h\. .n,B(i.i))<r'<j (A.6) Since the minimization is done over two parameters r and r', we should limit the size of region [r,r'] to keep the complexity of our algorithm to 0(nJ). If Rij does not fall into any case from (1) to (7), then there must exist r.r' in G", with i < r < r' < j, and one base r (or r') inside the band region [B1^. .y B(i,j)] (or [b(i<.j),bC -j]) and the other base r' (or r) outside the band region [B1^ .yB^j)] (or [by,.j),b^. .^]). Then G U G' must have density at least 3, which is not allowed in our algorithm. A.5 V P ^ VP-,j is the minimum free energy of all valid structures Rij over region [i.j], such that for some r, i < r < j, either bpc(i) = r or bpc>(j) = r, and either ».r or r.j crosses a base pair of'G. Otherwise, VP; • is +oo. The base case is as follows: VPij = +oo, if / > j (A.7) Otherwise, VP1- • is given by the following recurrences: VP'ij = min f(l) min (VP.,,.+ JW ( r + 1 ) i i ) • (2) min, (!<(,._!,+ » < - ' - < " » " ( B ( j i , . ) : J ) ( A g . (3) min (VP^ + c'iJ-r)) if empty(G, [r + 1, j]) „»,.».•(»:,//(...))<,•<:/ (4) min (c'(r - i) + VPrj) if empty(G, [i, r - 1]) ' i<r<in.in(B'f. ,y) In both cases (1) and (2), the energy of Rij is the energy of all loops within Rij. In case (1), we have two components: the energy given by base pair i.r which is covered by VP, and the energy given by the structure from base r + 1 to base j, which is covered by Wf. Since only VPij uses VPij, the structure from r + 1 to j is within a band (see case (6) of VPij) and so is covered by WT. Case (2) can be reasoned similarly, with reference to case (7) of VPij for use of WF. Cases (3) and (4) are similar to cases (1) and (2) with the only difference that there is no base pairs in regions [r + l,j] and [i,r — 1] respectively. Appendix A. Recurrences 65 (1) (2) Figure A.5: Illustration of Cases for VR-^. A . 6 V,, Vij is the minimum free energy of all valid structures R,j over region [i.j], if [i.j] is weakly closed or empty and i.j forms a base pair of Rij. Otherwise, V)j is +00. The base bases are as follows: V.Lj = +oo, if < [i.j\ is not weakly closed \bvii; ,•(*) > 0 and bpri. ;{j) > 0 but bpR, .(i) ^ bpR: .^j) (A.9) Otherwise, V.Lj is given by the following recurrence: Vij = mm{eH(i,3),es(iJ) + V i ) , ( i-U> VBJi.^ vMij} (A.10) This recurrence is identical to that used in pseudoknot free algorithms, so we omit the proof here and for VBI in the next section. A . 7 VBI; j VBk,j is the minimum free energy of all valid structures Ri:j over region if [i,j] is weakly closed or empty, assuming i.j closes a bulge or internal loop of Rij. Otherwise, I />'/... is +oo. Appendix A. Recurrences 66 The base case is as follows: VBIij — +00, if j — i < 1 Otherwise, VBl,,= min (eillt(i,i'j',j) + W,-')- (A.11) i<»' <,)'<:; where all bases between [i,i.']-and [j'.j] are unpaired. ei.nt.{i; i'j'j) gives the free energy of an internal.loop or bulge with exterior pair i.j and interior pair i.'.j'. The complexity for computing VBlj can be done in 0(n3) using [31]. A . 8 VM,., VMij is the minimum free energy of all valid structures Rij over region [i.j], if [i.j] is weakly closed or empty and i.j closes a multiloop of Rij. Otherwise, VMij is +00. We can obtain a recurrence that calculates the loop cost as the sum of two subparts. VM.-,.-. = min< -+i<h<j-i (A.12) [ (2) iyAf5 ( i + 1 ) ( i_i) + a + Psm + b Case (1) is similar to recurrence for Wij , and case (2) handles the case that there is one or more pseudoknotted loops in the multiloop. A . 9 WM,, WMj is the minimum free energy of all valid structures Rij, if [i,j] is weakly closed, not empty, and i and j are on a multibranched loop. The base case is as follows: WMij = +00 , if i > j Otherwise, WMij is given by the following recurrences: • Appendix A. Recurrences 67 (2) Figure A.7: Illustration of Cases for WMB;.,. WMi.-j = m i n { {2)WMli+1)j+c • bpc(i) = 0 (3) W % . 1 ) + C bpG(j)=0 (A.13) (4) min {WMi.t + WM{t+l)/l) [ (5) IfiVm,:,+ Psm + b Cases (1) to (4) are the same as in a pseudoknot free structure, and case (5) handles the case of a pseudoknotted loop in the multiloop. A . 1 0 WMB, , WMB^.j is the minimum free energy of all valid structures Rij over region [i,j], where /?.,;., is a density-2 pseudoloop. Otherwise, WMB is +oo. The base case is as follows: II-'.',//?= +oo , if / :••./ Otherwise, WMBij = min (1) P,,+ min (BEibpcljlbpGiBlu^Bl.jyj) U bpcU) > 0 + WMB'hl+ WI{l+1)!(B[iirlj), (A.14) (2) WMB'it.j In case (1), j is paired in G. Then, in the MFE structure, some base I with bpc(j) < I < j must be paired in G', causing 6p/?„ Aj).j to be pseudoknotted. We minimize the energy over all possible choices of / (note that / must be unpaired in G, since it will be paired in G', which is disjoint from G). By Lemma 1, once / is fixed, the inner base pair of the band whose outer base pair Appendix A. Recurrences 68 is bpii; ,{j).j is also determined. The P),+ BE term in case (1) of the recurrence accounts for the energy of the band, a WI term accounts for a weakly closed region in the band, .and the remaining energy is represented by the WMB' term. In case (2), j is not paired in G, and the recurrence is unwound by moving directly to a WMB1 term. A . 11 WMll, , WMB] j is the minimum free energy of all valid structures RL..j over region where R,., is a proper prefix of a density-2 pseudoloop. Otherwise, WMB is +oo. The base case is as follows: WMB'.ivJ = +oo , if i > j Otherwise, WMB';^ = min '{1)2P,,+ min i<l<,n,.n(;,.b{iJ)) isCoVKral{C,.hl) (BEibpciBuj^MciB^B'^Buj)) if 6pG0) =0 + W M 0 I N _ L ) + VP,.,). < (2) min. (WMBfiil+WI[l+1)J), if bpG(j) < j ( A J 5 ) c„vcr(l)=covur(;)) lVa(l.)=() (3) Ph + VP., (4) 2P,, + mm (BE(i, b'(i 0 , 6pG(6'(l- 0 ) , bpG(i)) if bpG(j) = 0 t<l<hpa(i) _ + WI\h,: 0 +i), (;-i) + VPi,j) and bpc(i) > 0 Complementing case (1) of the WMB recurrence, WMB' handles the case that the rightmost band is not in G, but is part of the MFE structure G' (with respect to G). In the recurrence for yVMB1, case (1) is the complex case, accounting for the energy of the region spanned by the rightmost two bands using the 2P\,. VP, and BE terms, and recursively calling WMB1. Case (2) is called when one iteration of WMBij case (2) or WMB/i.i case (1) is done and we need to compute the energy of the remaining part of the structure inside an arc. Note that Wlij = +00 when cover (i) — cover (j) = (—1, —1) such that entering case (2) as the first iteration of WMB' is not possible. For example, the total energy for the structure presented in Fig. A.11(2) is calculated in the following way: the energy of the rightmost band covering base j is calculated using WMB case (1); the energy of the remaining structure from i to j accounts for the energy of the closed subregion of [i,j], [1 + 1, j], and the remaining prefix (WMB'-j), which is handled by case (2) of the WMB! recurrence. In cases (3) and (4), only one iteration of WMB1 yields the result. Case (3) handles the case that i pairs with j in G', and i.j is the left-most base pair in Appendix A. Recurrences 69 (1) (2) Figure A.8: Illustration of Cases for WMB.,j. the structure. Since this is part of a pseudoknotted structure, it will be covered by VP. Case (4) handles the case that there are only two bands in Rij, and the right band of the pseudoknotted loop is not in-G. A . 1 2 BEij. /.'. f. j) . BE(i, is the minimum free energy of the band [?', i'] U [j'j], if i < i' < j' < j, in which bpc{i) = j and bpG(i') - j'-The base cases are as follows: BE(i,i',j',j) = +oo, if it is not the case that i<i' < j' <j BE(i,i,j,j) = 0, if i < j Appendix A. Recurrences 70 Otherwise, BE(i,i'j',j) = min + BE(i + l,i',j',:J-l) (2)eintP{i,l,bpc(l),j) +BE(l.,i'f,bpG(l)) + BE(l,i',j'bpc(l)) + WJ{lll>o{l)+iU:l_l)+a'+2b' + BE(l./i/,f,bpc(l)) if 6pc(i + l) = (j - 1) if bPG(l) > 0, (empty(C7, [i +1,1-1}), and empty(C7, [bpc(l) + l,j - 1})), (i<l< i') and 0" < bpc{l) < J) UbpG(l) > 0 (weakly closed(G, [i + 1,1 - 1]) and weakly closed(G, [bpc(l) + l,j - 1])) and (i<l< i') and W < bpa(l) < j) (A.16) if bpG{l) > 0. (weakly closed(G, [i + 1,1 - 1}) +a' + 2b''+ c'(j - bp{G, I) + 1) and empty(G,[6pG(0+ 1,7-1])) and {i<l<i>) , . and (f<bPG(l)<j) if bpG{l) > 0 (weakly closed(G, [bpG(l) + l,j - 1]), and empty(G, [?:+ 1,/ - 1])) and [i < I, < i') and (f < bPG(l) < j) Case (1) handles the case that base pairs i.j and (i + l).(j — 1) of G form a stacked loop in the band. Case (2) handles the case that i.j and l.bpG(l) are the base pairs of an internal loop of G. (b)a' + 2b' + c'{L-i + l) + BE(l,i',j',bPG(l)) + WJ,{b,,c(i)+i),{j-i) Appendix A. Recurrences 71 Appendix A. Recurrences 72 Case (3) handles a similar situation as in case (2) except that there are other closed regions in both of the regions [i.l] and [6pc(0>il-Case (4) handles the case that the region [&7->G(0IJ]> is empty, and so we must pay the unpaired base penalty c' for each unpaired base. In this case, the left side, [i.l] must not be empty. Case (5) is the same as case (4) except the left side is empty and the right is not empty.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- RNA secondary structure prediction using hierarchical...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
RNA secondary structure prediction using hierarchical folding Hosna, Hosna 2007
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | RNA secondary structure prediction using hierarchical folding |
Creator |
Hosna, Hosna |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Algorithms for prediction of RNA secondary structure— the set of base pairs that form when an RNA molecule folds— are valuable to biologists who aim to understand RNA structure and function. Improving the accuracy and efficiency of prediction methods'is an ongoing challenge, particularly for pseudoknotted secondary structures, in which base pairs overlap. This challenge is biologically important, since pseudoknotted structures play essential roles in functions of many RNA molecules, such as splicing and ribosomal frameshifting. State-of-the- art methods, which are based on free energy minimization, have high runtime complexity (typically θ(n⁵) or worse), and can handle (minimize over) only limited types of pseudoknotted structures. We analyze a new approach for prediction of pseudoknotted structures, motivated by the hypothesis that RNA structures fold hierarchically, with pseudoknot free (non-overlapping) base pairs forming first, and pseudoknots forming later so as to minimize energy relative to the folded pseudoknot free structure. Our HFold algorithm, based on work of S. Zhao, uses two-phase energy minimization to predict hierarchically-formed secondary structures in 0(n³) time, matching the complexity of the best algorithms for pseudoknot free secondary structure prediction via energy minimization. Our algorithm can handle a wide range of biological structures, including kissing hairpins and nested kissing hairpins, which have previously required θ(n⁶) time. We also report on the experimental evaluations of HFold and present thorough analyses of the results. We show that if the input structure to the algorithm is correct, running the algorithm results in 16% accuracy improvement on average over the accuracy of the true pseudoknot free structures. However if the input structure is not correct, the accuracy improvement is not significant. If the first 10 suboptimal foldings are given as input to our algorithm instead of just the minimum free energy structure (MFE), the prediction accuracy improves significantly over the accuracy of the MFE structures. This improvement is even more when the number of suboptimal foldings as input to our algorithm increases. The comparison of the energy of the structures predicted by HFold on the true pseudoknot free structures with the energy of the true structures calculated using a different method with the same energy model shows that the energy model may be the cause for the cases for which HFold predicts structures far from the true structures. Our experimental result provides some ways in which the hierarchical folding hypothesis might need to be refined. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-03-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0052065 |
URI | http://hdl.handle.net/2429/32001 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2007-0443.pdf [ 2.9MB ]
- Metadata
- JSON: 831-1.0052065.json
- JSON-LD: 831-1.0052065-ld.json
- RDF/XML (Pretty): 831-1.0052065-rdf.xml
- RDF/JSON: 831-1.0052065-rdf.json
- Turtle: 831-1.0052065-turtle.txt
- N-Triples: 831-1.0052065-rdf-ntriples.txt
- Original Record: 831-1.0052065-source.json
- Full Text
- 831-1.0052065-fulltext.txt
- Citation
- 831-1.0052065.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0052065/manifest