R N A Secondary Structure Prediction Using Hierarchical Folding .by Hosna Jabbari B.Sc, The University of Victoria, 2005 A THESIS S U B M I T T E D IN PARTIAL F U L F I L M E N T O F T H E REQUIREMENTS FOR T H E D E G R E E OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia August, 2007 © Hosna Jabbari 2007 A b s t r a c t Algorithms for prediction of RNA secondary structure— the set of base pairs that form when an RNA molecule folds— are valuable to biologists who aim to understand RNA structure and function. Improving the accuracy and efficiency of prediction methods'is an ongoing challenge, particularly for pseudoknotted secondary structures, in which base pairs overlap. This challenge is biologically important, since pseudoknotted structures play essential roles in functions of many RNA molecules, such as splicing and ribosomal frameshifting. State-ofthe-art methods, which are based on free energy minimization, have high runtime complexity (typically 0(n ) or worse), and can handle (minimize over) only limited types of pseudoknotted structures. We analyze a new approach for prediction of pseudoknotted structures, motivated by the hypothesis that RNA structures fold hierarchically, with pseudoknot free (non-overlapping) base pairs forming first, and pseudoknots forming later so as to minimize energy relative to the folded pseudoknot free structure. Our HFold algorithm, based on work of S. Zhao, uses two-phase energy minimization to predict hierarchically-formed secondary structures in 0(n*) time, matching the complexity of the best algorithms for pseudoknot free secondary structure prediction via energy minimization. Our algorithm can handle a wide range of biological structures, including kissing hairpins and nested kissing hairpins, which have previously required 0(vi ) time. We also report on the experimental evaluations of HFold and present thorough analyses of the results. We show that if the input structure to the algorithm is correct, running the algorithm results in 16% accuracy improvement on average over the accuracy of the true pseudoknot free structures. However if the input structure is not correct, the accuracy improvement is not.significant. Tf the first 10 suboptimal foldings are given as input to our algorithm instead of just the minimum free energy structure (MFE), the prediction accuracy improves significantly over the accuracy of the MFE structures. This improvement is even more when the number of suboptimal foldings as input to our algorithm increases. The comparison of the energy of the structures predicted by HFold on the true pseudoknot free structures with the energy of the true structures calculated using a different method with the same energy model shows that the energy model may be the cause for the cases for which HFold predicts structures far from the true structures. Our experimental result provides some ways in which the hierarchical folding hypothesis might need to be refined. 5 6 iii C o n t e n t s Abstract ii Contents iii List of Tables List of Figures v vi Acknowledgements x 1 Introduction 1.1 R N A Secondary Structure 1 1 2 Background on R N A Secondary Structure 2.1 Notation ' 2.2 Region and Loop Classification and Related Definitions 2.3 Bi-secondary and Density-2 Structures 2.4 Energy Model '. 2.4.1 Pseudoknot free energy model 2.4.2 Pseudoknotted energy model 5 5 6 9 11 11 12 3 Algorithm 3.1 A Useful Property of Density-2 Structures . . . '. 3.2 High Level Description of HFold 14 14 18 4 Results 4.1 Accuracy ' 4.2 Dataset 4.3 Accuracy when input is the true pseudoknot free structure . . . . 4.4 Accuracy when input is the M F E pseudoknot free structure . . . 4.5 Best accuracy, taken over suboptimal folds ' 4.6 Accuracy of the lowest energy structure, taken over suboptimal folds 4.7 Peeling Techniques 4.7.1 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from inside or outside the stems that close hairpin loops 23 24 24 24 29 33 36 37 37 Contents ' iv 4.7.2 4.8 4.9 5 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from both inside and outside each stem that closes a hairpin loop 4.7.3 Accuracy when input is the M F E pseudoknot free structure with stems of size 1-3 removed 4.7.4 Accuracy when input is the M F E pseudoknot free structure with loosened loops Base Pair Confidence . . . . : . .' Energy Comparison Conclusion and Future W o r k 39 44 44 45 45 52 Bibliography 54 A 57 Recurrences A.l A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 A. 10 A . 11 A.12 W,,.; Wlij Wf VP-j VP . ". V,/ VBkj VM,/ . . . : WMij WMBi,j WMB'-'. BE[iJ,j',j) rl ' .' , '• • • i i j : l ' 57 58 60 61 64 65 65 66 66 67 68 69 V List o f Tables 2.1 4.1 Energy parameters 13 Pseudoknot free sequences used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence: and (3) the reference • from which the sequence was obtained 25 4.2 Sequences with H-type pseudoknots used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis: (2) type of sequence: (3) structure of sequence (H-type pseudoknot = H, H-type pseudoknots with nested structures = H*): and (4) the reference from which the sequence was obtained 50 4.3 Sequences with kissing interactions and the density-3 sequence used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) Structure of the sequence (kissing interactions = K, density-3 structure = D3); and (4) the reference from which the sequence was obtained. 51 vi List of Figures 1.1 2.1 2.2 2.3 2.4 2.5 3.1 3.2 A pseudoknot free structure (top), an H-type pseudoknotted structure (center) and a kissing hairpin (bottom). Figures were generated by PseudoViewer [10] An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats Pseudoknot Density-2 structures Bi-secondary structure that is not density-2 The figure shows a density-2 secondary structure for a sequence of length n. /?.;.,• is the structure restricted to the region (for i < j < n) and dj is that part of /?,,;, above the horizontal line. The circle dots show positions of j where R;J is a prefix of a density-2 pseudoloop with respect to G, and the crosses show positions of j where Rij is not a prefix of a density-2 pseudoloop with respect to G 4 6 7 10 10 11 Illustration of band borders in Lemma 3 and Lemma 4 Illustration of how the WMB recurrence unwinds, to calculate WMBij. Arcs above the horizontal line from i to j represent base pairs of C,;,, and arcs below the line represent base pairs of G' Case (a) of the WMB recurrence handles the overall structure whose energy is WMBij, with I = l\. with terms to account for energies of the right upper band (BE) and right lower closed subregion ( W]( \).(i , (i,'., )-i)) ' l the remaining prefix (WMB/ ). The term WMBl is handled by case (a) of the WMB recurrence, with / = l-i and terms to account for the lower right substructure labeled VPi i , the upper left band (BE), and the remaining "prefix" of the overall structure (WMB' ^ _ - ). WMB , ^_ is then handled by case (b) of the WMB recurrence, with I = l-i, and terms to account for WI^ )^ _i^ and WMBl . Finally, the WMB' term is handled by end case (c) of the WMB recurrence 17 Visual illustration of recurrences in HFold 22 rr a s Ll + w e a s n c ih iM 1 2i l i l li l j 1 l l } 3+l 2 ihi 1 lh 3.3 20 List of Figures 4.1 Accuracy of HFold when the input is the true pseudoknot free secondary structure (G/,i ), versus the accuracy of Gy structures. The horizontal axis represents the accuracy of structures predicted by HFold while the vertical line represents the accuracy of the input structures. Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively A pseudoknot free structure given as input to HFold (top) and the corresponding HFold pseudoknot free result (bottom). The overall structure remains pseudoknot free after HFold, but more base pairs are added to the loops. For example, the large internal loop starting at 1 and ending at 164 on the top figure is broken into two smaller loops in the bottom figure. Similarly the loops of the branch starting at 51 and ending at index 90 are broken into smaller loops CH,, structure given as input to HFold (top), the corresponding HFold result (center) and the true pseudoknotted structure (bottom). HFold adds 3 pseudoknot free stacked base pairs at 2.40, 3.39, and 4.38, while the true pseudoknotted structure has pseudoknotted base pairs at 18.36, 19.35, and 20.34 Predicted pseudoknotted structure (top) and the true pseudoknotted structure (bottom). HFold predicts interleaved bands, while the true pseudoknotted structure has 5 separate H-type pseudoknots Accuracy of HFold when the input is the true pseudoknot free secondary structure (G.,,,-,„,(j) (horizontal axis), versus the accuracy of G „u structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. ... Accuracy of HFold when the input is the M F E structure (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by \ ' and 0 ' . respectively M F E structure (top), true pseudoknotted structure (bottom). Tire true pseudoknotted structure has 4 stacking base pairs at 1.14, 2.13, 3.12, 4.11 and another stem of size 4 at 6.26, 7.25, 8.24, and 9.23, while in the M F E structure the first stem is a shift of the true structure (3.15, 4.14, 5.13, and 6.12). In this cases HFold does not add any base pairs to the input structure. Accuracy of HFold on the best-of-25 (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively Accuracy of HFold on the best-of-25, versus the accuracy of the best-of-10. Pseudoknotted and pseudoknot free structures are presented by x and 0 ' , respectively S 4.2 4.3 4.4 4.5 vii UJ 26 27 28 30 sm 4.6 [ 4.7 4.8 4.9 ; ; : ! 31 32 33 34 35 List of Figures viii 4.10 Accuracy of H F o l d on the minimum energy structure over the first 25 suboptimals, versus the accuracy of H F o l d on the M F E structure. Pseudoknotted and pseudoknot free structures are presented by x' and ' O ' , respectively 36 ; 4.11 T h e pseudoknotted structure predicted by H F o l d (top), the corresponding M F E structure (center) and the true structure (bottom). Although H F o l d correctly predicts parts of the pseudoknot in the structure on the top, the accuracy of the structure predicted by HFold is lower than the accuracy of the corresponding M F E structure, because there are some extra pseudoknotted base pairs predicted by HFold. Note that the M F E structure has all the correct pseudoknot free sub-structures (except the stem closing at 1.114). High pseudoknot initiation penalty, and low band penalty are the main causes for such predictions by H F o l d . . . . 38 4.12 Accuracy of the M F E structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x and ! l O \ respectively • 40 4.13 Accuracy of HFold when input is the M F E structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of H F o l d when input structure is M F E (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively 41 4.14 Accuracy of the M F E structures with 1 base pair removed from both ends of each stem (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. . . 42 4.15 Accuracy of HFold when input is the M F E structures with 1 base pair removed from both ends of each stem (horizontal axis), versus the. accuracy of HFold when input structure is M F E (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x and ' O ' , respectively 43 ! 4.16 Accuracy of H F o l d when the input is the pseudoknot free structures with base pair confidence above 75% (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively 46 4.17 Comparison of H F o l d energy and true structure energy. T h e horizontal axis represents the energy of structures predicted by H F o l d when the input is G/,;,,, while the vertical axis represents the energy of the "true" structures. Pseudoknotted and pseudoknot free structures are presented b y ' x ' and ' O ' , respectively 4.18 48 Comparison of energy of G/„: (horizontal axis) and the energy of fl the M F E structure (vertical axis). Pseudoknotted and pseudo- knot free structures are presented by 'x' and 0 , respectively. . . ! ; 49 List of Figures A.l Illustration of Cases for Wij. A.2 Illustration of Cases for Wlij.A.3 Illustration of cases for Wt .j. In case (5), the plotted structure from i to j could contain more than 2 bands (not illustrated) . . A.4 Illustration of Cases for VP -. . A.5 Illustration of Cases for VP\j A.6 Illustration of Cases for VM A.7 Illustration of Cases for WMBij A.8 Illustration of Cases for WMB'. . A.9 Illustration of Cases for BE-,.-,;',,<,,• itj i:j i j ix 58 59 60 63 '65 66 67 69 71 X Acknowledgements I would like to thank my supervisor, professor Anne Condon, for her endless support and encouragements. She introduced me to a new and interesting branch of science that I did not know about before. I cannot thank her enough for all she has done for me. I would also like to thank Dr. Will Evans for being my second reader and for his helpful comments. Many thanks go to people of Beta lab for making Beta such a lovely place to work. Last, but not least, 1 would like to thank my parents, my sisters and my wonderful husband for believing in me.through all stages of my studies. Without their love and support 1 could not have done anything. Hosna Jabbari. To my best friend and wonderful husband, Majid. 1 Chapter 1 Introduction 1.1 R N A Secondary Structure R N A molecules aid in translation and replication of the genetic code, catalyze cellular processes, and regulate the expression level of genes [7]. Structure is key to the function of R N A molecules, and so methods for.predicting R N A structure from the base sequence are of great value. Currently, prediction methods focus on secondary structure - the set of base pairs that form when the R N A molecule folds. There has been significant success in prediction of pseudoknot, free secondary structures, which have no crossing base pairs (see F i g . 1.1). State-of-theart prediction algorithms, such as Mfold [18] or R N A f o l d [11] find the structure with minimum free energy ( M F E ) from the set of all possible pseudoknot free secondary structures. The energy of a structure is estimated as the sum of energies of loops that form when the molecule folds, where the loop energy values are'provided by biologists. While many small R N A secondary structures are pseudoknot free, pseudoknots do arise frequently in biologically-important R N A molecules, both in the cell [24, 27], and in viral R N A [6]. Examples include simple H-type pseudoknots, with two interleaved stems, which are essential for certain catalytic functions and for ribosomal frameshifting [2], as well as kissing hairpins, which are essential for replication in the coxsackie B virus [19]. Unfortunately, M F E pseudoknotted secondary structure prediction is N P hard [1, 13, 14], even for a simple energy model that depends on base pairs but not on unpaired bases. Polynomial-time M F E - b a s e d approaches to pseudoknotted structure prediction have been proposed [1, 8, 21, 22, 26], with respect to various sum-of-loops energy models for pseudoknotted structures, which find the M F E structure for a given input sequence, from a. restricted class of structures. A class of structures can be defined by specifying allowable patterns of interleaving among base pairs. For example, Mfold and R N A f o l d handle the class of pseudoknot free secondary structures; we provide more examples later. We say that a structure' R can be handled by a given algorithm if R is in the class of structures over which the algorithm optimizes. Algorithms for M F E pseudoknotted secondary structure prediction trade off run-time complexity and generality - the class of structures handled, that is, the class of structures over which the algorithms optimize. For example, kissing hairpins are not in the class of structures handled by the 0 ( n ) algorithms of Akutsu [1] and Dirks and Pierce [8] but are in the class handled by the Q(n ) algorithm of Rivas and E d d y [22]. (We note that, even when the true 5 6 Chapter 1. Introduction 2 structure R for a'sequence is handled by an algorithm, the algorithm still may not correctly predict R, because correctness depends not only on the generality of the algorithm but also on the energy model and energy parameters.) Our work is motivated by two limitations of MFE-based algorithms for pseudoknotted secondary structure prediction: they have high time, complexity, and ignore the folding pathway from unfolded sequence to stable structure. Several experts have provided evidence for, and support, the hierarchical folding hypothesis [16, 25], which is succinctly stated by Tinoco and Bustamante as follows: "An RNA molecule [has] a. hierarchical structure in which the primary sequence determines the secondary structure which, in turn, determines its tertiary folding, whose formation alters only minimally the secondary structure" [25]. (These and other authors consider the initially-formed secondary structure to be pseudoknot free, and refer to base pairs that form pseudoknots as part of the tertiary structure. However, here we refer to all canonical base pairs, namely A-U, C-G, and G-U, as secondary structure.) We note that while the hierarchical folding hypothesis is a common assumption, some counter examples have been reported, notably formation of the structure of a subdomain of the Tetrahyinena thermophila group I intron ribozyme [29]. However, even in this case, 15 of the 19 base pairs in the initially-formed pseudoknot free secondary structure are retained upon formation of tertiary structure, and the 4 missing base pairs lie at the ends of stems. This work is a continuation of the M.Sc. work of Zhao [30], in which she presented a novel and efficient algorithm to predict RNA secondary structures, in a manner consistent with a natural formalization of the hierarchical folding hypothesis. She defined the problem of predicting the secondary structure as follows: given a sequence S and a pseudoknot free secondary structure G (a set of base pairings), find a pseudoknot free secondary structure G' (a set of base pairings disjoint from G) for S, such that the free energy of G U G' is less than or equal to the free energy of GUG" for all pseudoknot free structures G" / G'. As with algorithms for MFE pseudoknotted secondary structure prediction, algorithms,for Hierarchical-MFE secondary structure prediction may handle a restricted class of structures. That is, the type of structure formed by G U G" may have restricted patterns of interleaving among base pairs. Since both G and G" are pseudoknot free, the most general class of structures that could be handled by an algorithm for hierarchical-MFE secondary structure prediction would be the hi-secondary structures of Witwer et al. [28] - those structures which can be partitioned into two pseudoknot free secondary structures G and G'. There is no known way to solve the hierarchical-MFE prediction for the class of bi-secondary structures. Instead, Zhao suggested a solution with respect to a subclass of the bi-secondary structures, which she called density-2 structures, explained in Section 2. This is quite a genera) class, including H-type pseudoknots and kissing hairpins, as well as structures containing nested instances of these structural motifs. The only known algorithm for predicting MFE nested kissing hairpins, that of Rivas and Eddy, requires Q(n ) time. Rastegari and Condon [20] showed that, out of a set of over 1,100 biological structures, all but nine are density-2 (when b Chapter 1. Introduction 3 isolated base pairs are removed), and six of these nine are also not in the class handled by Rivas and Eddy's algorithm. The main contributions of this work are: • refining andfixingthe original recurrences presented in the work of Zhao [30], • defining and proving four lemmas, that help establish corrections of Zhao's recurrences, • implementing the HFold algorithm, and • throughly analyzing its performance on a dataset of 70 strands. We published a preliminarily version of this work in [12]. In Chapter 2, we present some useful background information and notations pertaining to RNA structure prediction. In Chapter 3, we summarize HFold, a. dynamic programming algorithm that solves the Hierarchical-MFE secondary structure prediction problem for the class of density-2 secondary structures in 0(n ) time and 0(n ) space, and present our lemmas. In Chapter 4, we describe the results of our experimental analysis. Our experimental evaluation of HFold shows that, when provided with the true pseudoknot free substructure for the input sequence, HFold adds pseudoknots which, on average, improve the accuracy (measured as fraction of correctly predicted bases) by 16%. However, HFold does not significantly improve accuracy when given as input a computational prediction of the MFE pseudoknot free secondary structure G, since HFold cannot correct errors in G. 3 2 Chapter 1. 4 Introduction 50 P-6' 40 U-C--A •U-G 60 -G\ A.J" 20 A' C 10 ,65 A-U A - G - G - U - G - G - C .u-G-G-G;- r f-.-A-U-U-G -G ""A-G-G U 6 r (j—-45 C G C" C "C G' U "& I • C "A J G T, iv "A; f A' "C' A' U" V c C "A' C U "G A" *G~-A C *A' G' V G' "G"' G' t ' G'" T ,U' t' G" t" A 'A G *•: U V G' c A" Figure 1.1: A pseudoknot free structure (top), an H-type pseudoknotted structure (center) and a kissing hairpin (bottom). Figures were generated by PseudoViewer [10]. " 5 Chapter 2 Background on R N A Secondary Structure A n R N A molecule is a sequence of nucleotides, or bases, of which there are four types: Adenine ( A ) , Guanine (G), Cytosine (C), and Uracil (U). The molecule has chemically distinct ends, called the 5' and 3' ends. We model an R N A molecule as a sequence over the alphabet {A,C, G, £/}, with the left end of the sequence being the 5' end. Throughout, n denotes the length of an R N A sequence. We index the bases consecutively from the 5' end starting from 1, and refer to a base by its index. When an R N A molecule folds, bonds may form between certain pairs of bases, where each base may pair with at most one other base. A secondary structure R is a set of pairs < i < j < n, such that no index occurs in more than one pair. The pair i.j denotes that the base indexed i is paired with the base indexed j. The canonical base pairs, which form the secondary structure, are the Watson-Crick pairs A-U and C-G, as well as the wobble pair G-U (see F i g . 2.1). 2.1 Notation We use the following notations when describing our algorithms. we use R to denote a secondary structure. Throughout, • bpni'i')'- We let bpa(i) denote the index of the base that is paired with base i in R. if any; otherwise bpp.(i) = 0 . • paired(.R, i): true if and only if i is paired in the structure R. • cross: if i.j, i'.j', and i < i' < j < j', we say that pair i.j crosses pair ?/.;/' (and i'.j' crosses i.j). • p s e u d o k n o t t e d base p a i r : We say that i.j is a pseudoknotted base pair if for some other base pair i!.j' in structure R, i.j crosses i ' . j ' . We also refer to i and j as pseudoknotted base indices. • p s e u d o k n o t free s e c o n d a r y s t r u c t u r e : If there are no pseudoknotted base pairs in the given structure, it is called pseudoknot free secondary structure. Chapter 2. Background on RNA Secondary Structure 6 NOT Weakly closed Figure 2.1: An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats. • cover: Let G be a pseudoknot free secondary structure. Base pair i.j covers base fc if i < k < j and there is no other base pair i'.j' £ G with i < i' < k < j' < j. In this case, we denote i.j by cover (G.k). Otherwise cnver{G,k) = (-1,-1). • isCovered(G, fe): true if and only if some base pair of G covers fe. • region Sequence of indices between i and j inclusive. • disjoint regions: Two regions and ['«',/] are disjoint if no index is in both regions, i.e. j < i! or j ' < i. 2.2 Region and Loop Classification and Related Definitions Following is the terminology used to refer to loops and other standard elements in a, secondary structure. These definitions are mostly taken from and illustrated in the work of Rastega.i'i and Condon [20]. Throughout, definitions are with respect to a.fixedsecondary structure R. Generally we use R to refer to a structure that may be pseudoknotted (that is, contains at least one pseudoknotted base pair), and use G to refer-to a structure that we know to be pseudoknot free. 7 Chapter 2. Background on RNA Secondary Structure 13 9 II 15 18 23 27 34 39 Figure 2.2: Pseudoknot • empty(/?.. [i,j\) with respect to secondary structure R: true if region contains no base pair in R. Formally, V7c,i < k < j,paired(R,k). • w e a k l y closed r e g i o n : A region is weakly closed if no base pair connects a base in the region to a base outside the region. Formally, is weakly closed if and only if for allfc6 either bpn(k) £ [i,j] or bpn{k) = 0. Weakly closed(R, is true if and only if is a weakly closed region oi R. .• closed region: A weakly closed region with at least two bases, is closed if it cannot be partitioned into two smaller weakly closed regions. Formally, is closed if and only if i < j, \i,j) is weakly closed, and for all I £ — 1], neither [i,L] nor [/ + is weakly closed. Note that if [i.j] is closed then both i and j must be paired (although not necessarily with each other) [20]. • p s e u d o k n o t free closed r e g i o n : A'closed region contain any pseudoknotted base pairs. that does not • p s e u d o k n o t t e d closed r e g i o n : A closed region of a structure R, such that i.bpn(i) and bpn(j).j are pseudoknotted base pairs. We refer to • indices i. and j as the left and right borders of the pseudoknotted region M• d i r e c t l y b a n d e d i n : For a pseudoknotted base pair i.j, we say i.j is directly banded in base pair i'.j' and write i.j ^ i'.j' if: (1) ,:' < i < j <j\ and (2) [i' + 1,7 — 1] and [j + 1, j ' — 1] are weakly closed regions. "Directly banded in" is transitive: i.j -< i'.j' ^ implies i.j ^ i".j". • b a n d : Let us consider a maximal chain of ^. The minimum (maximum) base pair in the maximal chain is the band's inner (outer) closing pair. If i.j is the outer and i'.j' the inner closing pair of a band, then •'•']• ' [j'.j] are the band's regions. For example there are 3 bands in Fig. 2.2: [1,2] U [22,23], [18.19] U [33, 34] and [27, 27] U [39, 39]. a n c Chapter 2. Background on RNA Secondary 8 Structure i and i' are the borders of band region, j and j' are the borders of [j'.j] band region. We. refer to i and j as the left and the right border of the band respectively. • i n s i d e a b a n d : A region [i.j] is inside a band [ii,«'i] U i\ < i < j < %i or j[ < i < j < j\ is true. if either • b a n d a s s o c i a t e d w i t h c l o s e d r e g i o n : We say that band \i, i'] U [j , j] is associated with closed region [i",j"[ if [''',''•'], and:thus [j',j], are subregions of [/", j"] but are not subregions of any closed child of [i", j" j . For example, in F i g 2.2, the three bands [1, 2] U [22, 23], [18,19] U [33, 34] and [27,27] U [39,39] are associated with closed region [1,39]. 1 • u n p a i r e d bases a s s o c i a t e d w i t h c l o s e d r e g i o n [i,j]: are the unpaired bases in [i,j] but not in any closed region or band region which are subregions of [i,j]. For example, in the structure of F i g . 2.2, the unpaired bases associated with closed region [1,39] are 17, 20, 21, 24-26, 28-32, and 35-38. • base p a i r i n g s a s s o c i a t e d w i t h c l o s e d r e g i o n [i,j]: are the base pairings in [i.j] but not in any closed region or band region which are subregions of [i.,j]. • h a i r p i n l o o p (or h a i r p i n ): Formally, the tuple defines a hairpin loop in a secondary structure if i and j are paired, and [i + 1, j — 1] is an empty region, i.j is called the closing base pair of the hairpin loop. The hairpin marked in F i g . 2.1 contains 4 unpaired bases. • i n t e r n a l l o o p : A n internal loop, sometimes called interior loop, contains two closing base pairs, and all bases between them are unpaired. The tuple (i,i',j',j), with i + 1 < i! < j' < j — 1, defines an internal loop if i.j and i'.j', and [i + — 1] and [j' + l,j — 1] are empty regions. • s t a c k e d l o o p : A stacked loop, also called stacked pair, contains two consecutive base pairs. The tuple (i,j) defines a stacked pair if i.j and (i + l ) . ( j — 1) are in R. A stem or helix is made of consecutive stacked loops. Note that, in fact, a stacked loop is also a special case of an internal loop, with no unpaired bases on either side. • b u l g e l o o p : A bulge loop, or simply bulge, is a special case of an internal loop, which has no unpaired base on one side, and at least one unpaired base on the other side. • s p a n s a b a n d : There are two types of internal loops, stacked loops and bulge loops: those for which the closing base pair, i.j, is not pseudoknotted and those for which i.j is pseudoknotted. In the'latter case, we say that the loop spans a band. Chapter 2. Background on RNA Secondary Structure 9 • m u l t i - b r a n c h e d l o o p : There are two types of multi-branched loops, or multiloops, depending on whether or not they span a band: (1) Let [i.j] be a closed region which is not pseudoknotted, and has at least two (closed region) children, or a pseudoknotted child. Then the unpaired bases and base pairs associated with [i,j] form a multiloop. , (2) Let i.j be a pseudoknotted base pair and i'.j' ^ i.j, where at least one of the (weakly closed) regions [i + l.i' — 1] and [/ -1- 1 ,j — 1] is not empty. Then the unpaired bases and base pairs associated with band region [i,i'\ U [j',j] comprise a multiloop that spans a band. For both types of multiloop, we say that i.j is the closing base pair of the multiloop. For example, in the structure of Fig. 2.2, [2,22] shows a multiloop that spans a band. Note that each closed subregion of [i,j] is called a branch of the corresponding multiloop. • p s e u d o l o o p : Let [i,j] be a pseudoknotted closed region. Then the unpaired bases and base pairs associated with [i.j], together with the closing base pairs of the bands associated with \i,j], are members of a pseudoloop. The base pairs i.bp(i) and bp(j).j are the closing base pairs of the pseudoloop. The pseudoloop is an exterior pseudoloop if region [i.j] is not subregion of any other region. • c l o s e d r e g i o n a s s o c i a t e d w i t h p s e u d o l o o p : We say that closed re- gion [*',/] is associated with, pseudoloop [i,j], if [i',j'] is a closed proper subregion of [i,j] but not a subregion of any closed subregion of [i,j]. For example, in Fig. 2.2, closed regions [3,9] and [10,16] are associated with pseudoloop [1,39] but closed region [11,15] is not associated with pseudoloop [1, 39]. 2.3 Bi-secondary and Density-2 Structures • Bi-secondary s t r u c t u r e s : Witwer et al. [28] introduced a definition of bi-secondary structure", which is a union of two disjoint pseudoknot free secondary structures. The pseudoknotted secondary structures we can handle in our algorithm are a subset of the bi-secondary structures. ;: • d e n s i t y : We define density as follows: Let L be a pseudoloop and i.bp(i) and bp{j).j be the closing base pairs of L. We say a band 'U [Ji,ji] crosses k if i\ < k < ji. Let #B(L, k) be the number of bands associated with L that cross k. Then the density of L is: density(L) = max {#B{L k)) ' i<k<j : (2.1) The density of a structure, R, is maximum density of L over all pseudoloops L of R. We say R is a density-2 structure if the density of R is at Chapter 2. Background on RNA Secondary Structure Figure 2.4: Bi-secondary structure that is not dehsity-2. 10 Chapter 2. Background on RNA Secondary Structure • 11 Figure 2.5: Thefigureshows a density-2 secondary structure for a sequence of length n. Rij is the structure restricted to the region [i,j] (for i < j < n) and Gi-j is that part of /?.,, above the horizontal line. The circle dots show positions of j where R.j is a prefix of a density-2 pseudoloop with respect to G, and the crosses show positions of j where R,,., is not a prefix of a density-2 pseudoloop with respect to G. most 2. Figure 2.3 illustrates density-2 secondary structures. Figure 2.4 shows a bi-secondary structure that is not a density-2 structure. • prefix: Let G,. be a pseudoknot free structure over region Let R;,j be a density-2 structure over region [i, j] containing G-,,j. We say that R^ is a prefix of a density-2 pseudoloop with respect to G,,, if i starts the first (leftmost) band associated with a pseudoloop of R,j , and j is either ( 1. the right border of a closed region associated with the pseudoloop, 2. the right border of the pseudoloop starting at i, 3. the rightmost border of any band associated with /?,;,- — except the first band , or 4. an unpaired base associated with the pseudoloop that is not inside thefirsttwo bauds (and not inside any closed subregion). 2.4 Energy M o d e l Computational methods for predicting the secondary structure of an RNA or DNA molecule are based on models of the free energy of loops. The parameters of these models are driven in part by current understanding of experimentally determined free energies, and in part lay what can be incorporated into an efficient algorithm. The free energy of a loop depends on temperature; throughout we assume that the temperature is fixed. 2.4.1 P s e u d o k n o t free e n e r g y m o d e l Wefirstsummarize the notation used to refer to the free energy of pseudoknot free loops, along with some standard assumptions that are incorporated into loop Chapter 2. Background on RNA Secondary Structure 12 free energy models. We refer to a model that satisfies all of our assumptions as a standard free energy model. This model is somewhat simpler than that underlying MFold and Simfold, but our algorithm can be extended to their more detailed model. • e[-[(i,j): gives the free energy of a hairpin loop closed by i.j; we assume that for all but a small number of cases, e#(i, j) depends only on the length of the loop, and the two paired bases i and j on the loop. • es(*, j)'- gives the free energy of a. stacked pair that consists of i.j and • (* + i).rj-i). • e,; ('i, i'. j', j): gives the free energy of an internal loop or bulge with exterior pair i.j and interior pair i'.j'. nt The free energy of a. multiloop with k branches and u unpaired bases is a + bk + cu, where a, b, c are constants. The free energy of a sequence S with respect to a fixed secondary structure R. is the sum of the free energies of the loops of R. Sometimes when the strand S is fixed, it is convenient to refer simply to the free energy of the structure R. We define the free energy of a strand S to be the minimum free energy of the strand, with respect to all structures R. 2.4.2 Pseudoknotted energy model • BE(i, i',j', j): The total energy of band [i,i'\ U [j',j] is the sum of the energies of its loops. If a band has no loops, i.e. consists of just one base paii', we define its energy to beO. • e Lp(i.j): defines the energy of stacked pairs in a band, and its value is S e {i,j) * 0.83. s • e-ui ,p(i,r,r' j): t } is e ,,,(i,r,r',j) u defines the internal loop that spans a band, and its value * 0.83. We define energy of multiloops that span a band to be similar to pseudoknot free multiloops. The energy of an exterior pseudoloop is calculated as: energy of bands plus Pi,*m.+P „*k, + P *u+P,, where m is the number of the bands, k is the number of closed subregions, u is the number of unpaired bases. If the pseudoknot is inside a, multiloop or a pseudoloop, P„ is replaced-by P or P respectively. Let /?..,;,, be a prefix of a pseudoloop. The energy of R j is the sum of the energies of all loops within R. plus a penalty for each band and each unpaired base in [i.j] associated with the pseudoloop of which R is a prefix. Table 2.1 summarizes the energy constants and functions used in our energy model for pseudoknotted structure. v up sm sp it t tJ 13 Chapter 2. Background on RNA Secondary Structure Name Table 2.1: Energy parameters. Description exterior pseudoloop initiation penalty penalty for introducing pseudoknot inside a multiloop penalty for introducing pseudoknot P,„ inside a pseudoloop band penalty Pi, penalty for unpaired base in P , a pseudoloop penalty for closed subregion Ps inside a pseudoloop energy of a hairpin loop closed by i.j energy of stacked pair closed by i.j es(i,j) energy of stacked pair e«tp(i..j) that spans a band energy of a pseudoknot free ehu(i,r,r'.j) internal loop e i.p('i.,r,r' ,j) energy of internal loop that spans a band • multiloop initiation penalty a multiloop base pair penalty • b penalty for unpaired base in a c multiloop penalty for introducing a multiloop a' that spans a band base pair penalty for a multiloop b' that spans a band penalty for unpaired base in a c> multiloop that spans a band Ps,n UP p m Value (Kcal/mol) 9.6 15.0 15.0 0.2 0.1 0.1 e (i,j) x 0.83 s e t{i,r,r',j) m 3.4 0.4 0 3.4 0.4 0 x 0.83 14 Chapter 3 Algorithm 3.1 A Useful Property of Density-2 Structures As will become clearer later, the reason that the HFold algorithm works for density-2 structures is because of the following lemmas, which are key for efficient decomposition of energies in the recurrences. Roughly, the lemmas show how to calculate the band borders for a given region. L e m m a 1 LetG andG' be disjoint, pseudoknot free, secondary structures, such that Gl)G' is a density-2 secondary structure, and let i, j be the start and end of a pseudoloop of G U G'. Let, I, £ [i + l,j] be such that J. I is paired in G' (but, not in G), 2. I is not in any closed proper subregion of [i,j] 'with respect to GU G', and 3. 3r : i < r < I. < bp (r) c < j. Let &(,: () 6' (u) = = niin{fc|?' < k < I < bpc(k)}, and im\x{k\i < k < I < bp (k)}. G Then, the structure G U C ' contains a band with outer base pair b(n).bpc(b(i.i)) and inner base pair b'^ - ^.bpc^b'^ j) . ( Note that restriction (3) ensures that b^^ and b'^ ^ are well defined. The bottom left part of Fig. 2.1 illustrates Lemma 1, showing the borders of the band whose arcs cross the base pair involving base I = 14. If [i,j] is the region [2, 30], then b(2,i4) = 2 and &' . ) = 6. (2 14 Proof Throughout the proof we assume that I < bpc(l), but the proof can easily be modified to handle the case when bpc(l) < I by substituting bpa>(l).l instead of l.bpc(l) everywhere in the proof. Since i is the start of a pseudoloop and j is the end of the pseudoloop, [i.j] must be a pseudoknotted closed region of G U G'. Restriction (1) implies hpG>(l) € since if it is not, then [i.j] is not a closed region of G U G'. We claim that for any r satisfying restriction (3), l.bpc(l) crosses r.bpc(r). Chapter 3. Algorithm 15 ULbpcH) does not cross r.bp (r), then it must be that i < r < I < bpc{l) < bpc(r). Also, we must have one of thefollowingcases: c • l.-bpc(l) does not cross another base pair. Then region [l.bpc(l)[ is a closed subregion of [i.j] with respect to GUG', and thus, / is in a closed subregion of with respect to G U G', which is a contradiction. • l.bpc'(l) crosses another base pair, say m.bpc(m). Then I.bpc (I) is in a band associated with a pseudoloop. We claim that, this pseudoloop is a subregion of [r. bpc('»')]. Otherwise, it must also be that r.bpc(r) is in a band associated with the pseudoloop. Moreover, the bands containing r.bpc{r). m.bpc{m) and l.bpc(l) must all be distinct, based on the definition of band and the fact: that; I.bpc (I.) is in [r,6pc? (»•)]> and m,.bpc(in) G G and thus, cannot cross r.bpc{r). A line drawn vertically at position m or at position / must cross all three bands. This is because m.bpc(m) crosses l.bpc(l) which is in [r, bpc(r)]. Therefore, G U G' must have density 3, contradiction. Therefore, our assumption is incorrect and I.bpc {I) crosses r.bpc (''')• Let b\.bpo{b\), and &2-^Pc(&2) be the outer and the inner base pairs of the band containing r.bpc^r) that l.bpc{l) crosses. We have i < b\ < b-2 < I < bp (b ) < bp {b,) < j. Now we prove that b(u) = b\. Since i < b\ < I < bpc (I) it must be that 6(v,j) < bi, by the definition of If &(,;;) < b\, then we have &(.,;() < 61 < bpc{b-\) < bpc(b(i,t)), since C is pseudoknot free. By same argument as used above, we conclude that 6(io-&Pc(&(».!)) crosses I.bpc (I). Thus, &(u).&pc(fy»,n) is the outer base pair of the band, which is a contradiction. Thus our assumption of &(,-,/) / 61 does not hold, and b^i) = bi. Similarly we can show that b'^ ^ = I>2- • G 2 c L e m m a 2 LetG a.ndG' be disjoint, pseudoknot free, secondary structures, such that, G U G' is a density-2 secondary structure. Let I be paired in G' (but not in G) and let [i,j] be a region such that I G [i.j] and bpc (I) £ [i,j]- Let b(i,i) = kD = d,:i) = (l,i) = b B B mm{k,\i, < k <l < bp (k)} U {co}, G max{/c|i < k < I < bp {k)} U {-1}, G nvAx{bp {k)\k < I < bp {k) < j} U {-1}, and G a mm{bp (k)\k < I < bp (k) < j} U {00}. G G Then, if [i,j] is a weakly closed region of G, then either all four of these quantities have finite, positive values, in which case the structure G U G' contains a band with, outer base pair and inner base pair b'^. ^.B' (i.e. bp(b(.;j-)) = -5((, ) and bp(b'^ /)) = .-^), or none of the four of these quantities have finite, positive values, in which case I is not covered by a base pair' of G y . a 7 16 Chapter 3. Algorithm Note that each term may be either infinity or -1 to account for the cases when there is no such band border. . For region [i,j] = [1,23], which is weakly closed in G, in Fig-. 2.2, we have 6(1,23) = 1, 6(1, )' = 23 2 a l K l S (l,23) = 2 2 a I K l B (l,23) = 2 3 ' Proof We first show that if [i,j] is a weakly closed region of G, then either all four of these quantities havefinite,positive values, or none of the four of these quantities have finite, positive values. Throughout the proof we assume that I < bpc{l), but the proof can easily be modified to handle the case when bpc(l) < L Let us consider Cover(G,/) = (r.bpc('»•))• Note that Cover(G,/) may be ( — 1,-1) if I is not covered in G. Based on the definition of weakly closed region, either Cover(GJ) e [i,j] or Cover(G,/) £ Therefore, we either have r < i < I < j < bpc{r), in which case, none of the four quantities have finite positive values, or we have i < r < I < bpci'i) < j , in which case r.bpc(r) crosses l.bpc(l), and therefore all four quantities have finite positive values. Now we claim when is weakly closed, Cover(G,Z) = (r,bpc(r)), and r.bpc(r) € the structure G U G' contains a band with outer base pair b(i.i)-B(i,i) hmer base pair b'^^.B'^.y To prove this claim wefirstprove that &p (6(»,n) = B ij), and bp (b' ) =B[ . Assume 6p (6(ij)) + B<i,j)- Then, since both 6 ).&p (& . ) and bp {B( )).B j) are in G, then they cannot cross, and.one must be nested in the other. If a n d G { c (iJ) u ) G (M b{i,i).bpc(b{i,i)) is in [bpc{B i )).B \, { d G ( M) c then we have i < bp {B ) {Lj) c l:j <b {ltj) {i < I. {iJ) which is contradiction to the definition of &(;,()• Therefore b( iybpG(b(ij)) cannot be in [bpc(B(i j)), By.,,-j]. With a similar argument we can show that bpc{B(i,j)).Ba,j) cannot be in [6(,;./.), bpc;(&(,:.;))]• Thus, we must have 6pc(6(i,i)) = Bn.j)- The proof to show that bpc(b'^ ) = B' .j is similar to what came above. L n n Now we prove that a band containing b^jyByj^, also contains b'^^.B'^^. Assume the bands containing b^^.B^j) and b'^^.B'^ ^ are distinct. Then, a vertical line drawn, at / crosses 3 distinct bands. Therefore the structure of G U G' has density 3, contradiction. Thus, one band contains both b^jyB^^ '(i.,i)- '(i,:i)- mdb B Based on the definition of 6'^ , there is no other base pair m.bpcirn) such that 6' < m < I < bp (m) < B' Therefore, b'u,jyB[i.j) base pair of the band. in G, n i (;J) G n j s t n e i n n e r y " We can easily show that if b^j).B(i.j) is not the outer base pair of the band, then we must have a base pair m.bpc(m) € such that m < 6(.,;) < J5(I, ) < bpc(m). It is clear that m.bpc{m) must be contained in the same band as 6(ii).'B(i.,-), Thus, contradiction. Therefore 6(»,j).B(j,,-) is the outer base pair of (| 7 Chapter 3. Algorithm 17 Figure 3.1: Illustration of band borders in Lemma 3 and Lemma 4 the band. • • . Lemma 3 Let.G and G' be disjoint, pseudoknotfree,secondary structures, such that G U G' is a density-2 secondary structure. Let I be paired in G' (but. not in G) and let [i.l] be a region such, that I, 6 [i.l] and bpc{l) tfz Let, 6(.,;j) = b' = {ii) min{fc|i < k < / < bpc{k)} U {oo}, and max{A:|z < k < I < bp {k)} U {-1}. c Then, either both, or neither of 6(*j) and b'^. ^ have finite positive values. For example, for region [17,19] in Fig. 2.2, we have 6(17,19) = 18,. 6J J7 = 19. Proof Let us consider Cover(G,/) = (r. bp (?'))• Note that Cover(GJ) may be ( — 1,-1) if / is not covered in G. If r < i, then there must be no base pair k such that i < k < I. < bpc(k). Otherwise, since both r.bpc(r) and k.bpc(k) are in G, we must have r < k < I, < bpc{k) < bpc{r). Thus,' we must have Cover(G,/) = (k,bpc(k,)) for one such k, which is contradiction. Therefore, c £>(,:,() = 00 and 6|. ^ = — 1. ; Let us assume i < r, Cover(G,/) = {r,bpc{r)), I = j and bpc{l) = i — 1. Then, r.bpa(r) must cross l.bpc>{l), since i < r and based on the definition of cover, we must have r < I < bpc(r). We claim that both b(, ) and 6|. ^ have finite positive values. Moreover, b'^ ^ = r. Assume 6'^^' ^ r. Then, 6|. ^.bpcib^ <)) must be in [r,bpc{r)} (based on the definition of b'^^). Therefore we have r < b'^ ^ < I < bpc(b'^ / < bp {r), contradiction. Thus, 6{ = r. Now, clearly b(.,;,j) ^ 00. We claim that if ;( ; ; G 4n 18 Chapter 3. Algorithm '-'(•;,() 6(,;()) then b'^ ^.bpcib'^ j) is in [b^jybpcib^^)}. Otherwise, the bands containing &(•;./)-VG(&(U))> b'^ ^.bpdb'^ Q) and l.bpc(l) must be distinct. A vertical line drawn vertically at position I crosses 3 bands, thus G U G" has density 3, contradiction. • t Lemma 4 LetG and G' be disjoint, pseudoknot free, secondary structures, such that G U G" is a density-2 secondary structure. Let I be paired in G' (but not, in G) and, let [l,j] be a region such that I £ [l,j] and bpo>(l) Let £((.,•) B{ Lj) = nvA\{bpc(k)\k < I < bp {k) < j) U {-1}, and = mm{bpc(k)\k<l<bpc(k,)<j}U{oo}. G Then, eitlier both or neither of 2?(f j) and By .j havefinitepositive values. For example, for region [9,11] in Fig. 2.2, we have £?| j = S(g.ii) = 9. gn Proof Proof is very similar to.proof of Lemma 3. 3.2 • High Level Description of HFold HFold is a method for prediction of pseudoknotted RNA secondary structure that integrates MFE-based prediction withfoldingpathway considerations in a novel way. The method is motivated by the hypothesis that pseudoknotted RNA secondary structures form in a hierarchical fashion, with a pseudoknot free structureformingfirst and additional pseudoknot-forming base pairs that are added later (possibly with minor rearrangements of the initial pseudoknot free structure) [15, 25]. HFold works by.taking as input a sequence of bases, S, and a pseudoknot free secondary structure G, and finding a second pseudoknot free structure G' which minimizes the energy of G U G' (i.e. HFold(5, G) = G U G"). Like MFE methods, HFold handles only a restricted class of structures, but this class is quite general (density-2 structures). The method has two potential advantages over MFE-based secondary structure predication. First, HFold's hierarchical folding principle may model biological folding just as well, or better, than does the MFE structure formation hypothesis, at least on biological structures. Second, HFold has 0(n ) running time, making it significantly more efficient than MFE-based methods that require fi(n ) time or more to predict biologically-important pseudoknotted structures. Here we outline our hierarchical fold algorithm. We first briefly review key ideas of dynamic programming algorithm which predicts the energy of the MFE pseudoknot free secondary structure for a fixed sequence S = s i S 2 . . . s „ [18], Let W, j be the energy of the MFE pseudoknot free secondary structure for the subsequence .s,.s,: i ...Sj. If * > j, Wi.j = 0, since the subsequence is empty. Otherwise, it must either be that i.j is a base pair in the MFE structure for 3 5 + Chapter 3. Algorithm 19 Si.. .s.j. or that the MFE structure can be decomposed into two independent subparts. These two cases correspond to thefirsttwo rows of the recurrence for Wij below. Thus, Wij is given by the following recurrence: Wi, = min { ^ V ' ' ... . TT where V is the free energy of the MFE structure for s; . . . s.j that contains i.j. The recurrence for V; , can in turn be expressed in terms of the energies of the hairpin loop {en{i.j)).. an internal loop, or a multiloop closed by i.j. and multiloops. If * > j, V. .j is set to be oo. Otherwise, i.j closes a hairpin, an internal loop, or a multiloop in the MFE structure for ... .s.j. Thus V^j can be expressed as the minimum of the free energies attainable in three cases: iti L f e (i,j), V(i,j) = min < min,.,./ e»„e(i,r,r »j) + V ', { VM' H / rr itj e,: (v'\ where en{i,j) and r', j) are as given in Table 2.1, and VMij is the rlt energy of a MFE structure for ,s,. . . s.j in which i.j closes a multiloop. We extend the definition of W for the hierarchical folding algorithm as follows. Let G be a given pseudoknot free structure for S. If some arc of G covers i or j, then W j = oo. If i > j, then W j = 0. Otherwise we define Wij to be the energy of the MFE secondary structure G,;, U GC for the strand Si... s.j, taken over all choices of G'^ which is pseudoknot free, disjoint from G,; , and such that G,;.; U G'- is density-2. In this case, W^j satisfies the following recurrence: ; iti it t ( Wij = mm < mmi< W + [ WMi). : - / r<j W i)j, itr fr+ where the first two cases are the same as for pseudoknot free cases and the last case is specific to pseudoknotted structures. P is the pseudoknot initiation penalty, given in Table 2.1. The third row of this recurrence accounts for the case when the optimal secondary structure G,;J U (?<• includes pseudoknotted base pairs and cannot be partitioned into two independent substructures for two regions [i, r] and [r+1, j], for some r. Such a structure must contain a chain of two or more successivelyoverlapping bands, which must alternate between G,;,- and G^-, possibly with nested substructures interspersed throughout. Figure 3.2 provides an example, and shows how the recurrence for WMB, given below, unwinds when the example structure is the MFE structure. In order to calculate the energies of substructures in such a structure' in the recurrences, we use three additional terms: BE, VP, and WI. Roughly, these account for energies of bands spanned by base pairs of G-j, regions enclosed by pseudoknotted base pairs of GC (excluding part of those regions that are within s Chapter 3. 20 Algorithm Figure 3.2: Illustration of how the WMB recurrence unwinds, to calculate WMB^-j. Arcs above the horizontal line from i to j represent base pairs of Gi-j, and arcs below the line represent base pairs of G\y Case (a) of the WMB recurrence handles the overall.structure whose energy is WMB^.j, with I = l\. with terms to account for energies of the right upper band {BE) and right lower closed subregion ( WI^ ) {i, (b'., )-l)) well the remaining prefix (WMB ,,^). The term WMB' is handled by case (a) of the WMB' recurrence, with / = l,o and terms to account for the lower right substructure labeled VP[ , the upper left band {BE), and the remaining "prefix" of the overall structure ( WMB' _ ). WMB' _ is then handled by case (b) of the WMB' recurrence, with I = and terms to account for WI( i)and WMB' . Finally, the WMB',^ term is handled by end case (c) of the WMB recurrence. a s 1 + ] t a s vc 1 lh 2 ( l ; (h l} i(h x) l3+ il3 1 a band of <?»,•), and weakly closed subregions, respectively. Further details on different recurrences can be found in Appendix A. We now give the recurrence for WMBi,j. As the base case, we set WMB-,^ = +oo if v. > j, since if i > j the structure is empty, and thus cannot be pseudoknotted. Otherwise, there are two cases, depending on whether j is paired in G or not. In case (a), j is paired in G. Then, in the MFE structure, some base / with bp(j) < I, < j must be paired in G', causing bp(j).j to be pseudoknotted. We minimize the energy over all possible choices of / (note that / must be unpaired in G, since it will be paired in G', which is disjoint from G). By Lemma 1, once / is fixed, the inner base pair of the band whose outer base pair is bp(j).j is also determined. The Pi, + BE term in case (a) of.the recurrence accounts for the energy of the band, a IWterm accounts for a weakly closed region that is in the band, and the remaining energy is represented by the WMB' term. In case (b), j is not paired in C, and, the recurrence is unwound by moving directly to a WMB term. Thus, we have: 1 21 Chapter 3. Algorithm I + ^ +H'/ i l , „ ( 1 + 1 ) ( W ! ( i i i | i ) ). ), 1 ) if 0 <ftp(j)< j ,(6) WMB',.! Complementing case (a) of the WMB recurrence, WMB' handles the case that the rightmost band is.not in G, but is part of the structure G'. In the recurrence for WMB'. case (a) is the complex case, accounting for the energy of the region spanned by therightmosttwo bands using the 1P\,, VP, and BE terms, and recursively calling WMB . The band borders in WMB' cases are determined using Lemmas 3 and 4. Case (b) is called when one iteration of WMBi.j or WMB'; • case (a) is done and the rightmost substructure of the overall "prefix" up to position j is a weakly closed region. Note that Wlij — -f oo when cover (i) ^ cover (j), which avoids entering case (b) as the first iteration of WMB'. Cases (c) and (d) are end cases, where only one or two bands need to be accounted for, respectively and so no recursive call to WMB is made. Thus we have: 1 1 WMB',., = min ((a)2P„ + jnm^ (BE Hu)M{i ;»Cu.i.:rv:,I(C -,l) i;J + WMB' _ + L{l (b) min i<l<:i VP ), l) i{bpc(j) = 0 Lj {WMB' i+ ii WI ), u+1)tj if bp (j) < j G if 0 = bpcij) < bp {i) G Figure 3.3 shows how all the recurrences call each other Figure 3.3: Visual illustration of recurrences in HFold. 23 Chapter 4 Results The goals of our analyses were sevenfold: • first, a baseline test to see if HFold finds pseudoknots when presented with the true pseudoknot free secondary structure for a sequence; • second, to assess the accuracy of HFold, when using the predicted MFE pseudoknot free structure for a sequence as input; • third, to see whether by taking the best HFold output, taken over several runs when suboptimal pseudoknot free structures are used as input, it is possible to obtain significant improvements in accuracy over simply running HFold with MFE structure as input; • fourth, to see whether by taking the minimum free energy HFold output, taken over several runs when suboptimal pseudoknot free structures are used as input, it is possible to obtain significant improvements in accuracy over simply running'HFold with MFE structure as input; .. • fifth, to see whether by taking a substructure of the MFE structure in which bases are paired with high confidence we can gain any significant improvement in accuracy; • sixth, to see whether by .taking the MFE structure and peeling away the stems from inside and/or outside stems that close hairpin loops it is possible to gain better prediction accuracy than with MFE alone; • and finally to compare the energy of the true structure, calculated with a separate method using the same energy model, with the energy of the structure output by HFold, when it is presented with the true pseudoknot free structure. We implemented the HFold algorithm in C++, using the energy model described in Section 2.4. Our implementation builds on the SimFold algorithm of Andronescu [3]. All experiments were run on a SUSE Linux 10.1 with two Intel 2.40 GHz processors and 1 GB of RAM. We performed several analyses on a large dataset. In the following sections we present our thorough analyses. 24 Chapter 4. Results 4.1 Accuracy In our analyses, we measure the accuracy of a predicted structure R for sequence S with true structure T as follows. Each base (position) i of S gets a score of 1 if bpa(i) = bpri't), and gets a score of 0 otherwise. The accuracy is the total score over all bases of S, divided by the length of the sequence. Thus, the accuracy lies between 0 and I, with 1 indicating perfect accuracy. 4.2 Dataset Our dataset of 70 sequences includes 7 with pseudoknot free and 63 with pseudoknotted structures. These 63 structures include 6 sequences with kissing interactions, 6 with H-type pseudoknots with nested structures and 1 sequence with density-3 structure. Tables 4.1, 4.2 and 4.3 show the sequences used in this work. The length of these sequences varies from 26 to 214 bases. The sequences include the following types: viral ribosomal frame shifting, viral ribosomal readthrough, sequences with high affinity to HIV-l-RT, mRNA, tmRNA, viral 3 UTR, ribozoines, .viral RNA, signal recognition particle RNA, small nuclear RNA and tRNA [5, 6, 9, 23, 27]. Since there are no known energy parameters for non-canonical base pairs arid loops with less than 3 unpaired bases inside in our energy model, we removed the non-canonical base pairings and the loops with less than 3 unpaired bases from the structures of our dataset. We partitioned each pseudoknotted structure in our dataset into two pseudoknot free structures G/,;, and G ,u- Structure G|„; was created by removing the minimum number of base pairs needed to get a pseudoknot free structure from the input pseudoknotted structure, and structure G ,u consists of the removed base pairs. We call each of these pseudoknot free structures a true pseudoknot free secondary structure for the corresponding sequence. ; r siiu H smu 4.3 Accuracy when input is the true pseudoknot free structure We first test the accuracy of HFold, when presented with a sequence S whose true structure is pseudoknotted, and the corresponding true pseudoknot free secondary structures Gug- The average accuracy of the Gbi structures obtained from pseudoknotted structures in our dataset is 77%. Figure 4.1 plots the accuracy of HFold(5, Gf„; ) versus the accuracy of the G|,; structures, for each sequence in our dataset. The data points in the lower triangle of Fig. .4.1 show the cases that HFold improved the accuracy of the original structure and the points on the upper triangle show the cases in which HFold added incorrect base pairings and thus decreased the accuracy of the final structure in' comparison to the accuracy of the original structure. The average accuracy of the structure obtained after running HFold, namely HFold(5, G|„; ), is 77% averaged over 63 a a a fl 25 Chapter 4. Results Sequence IDs and lengths Sequence type Ref. Homo sapiens(164). Arabidopsis thaliana(164) snRNA [9] RR4640(55), RK5280(67), RM8530(77), RG6241(66), RA1662(76) tRNA [23] Table 4.1: Pseudoknot free sequences used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; and (3) the reference from which the sequence was obtained. structures (not including the pseudoknot free structures), with 6 cases achieving perfect accuracy. Since G'i of the pseudoknot free structures is the pseudoknot free structures themselves, the input structures to HFold include 7 cases with perfect accuracy (presented in Fig. 4.1 with circle); from those cases 2 remain the same after running HFold (i.e. HFold(S, C;,;,,) = G|,; ) but some extra base pairings get added to the rest of G;„; for the pseudoknot free structures, such that the average accuracy after running HFold on the pseudoknot free structures is 92%. Note that HFold does not add any pseudoknots to the pseudoknot free structures. Figure 4.2 shows an example of before and after structures for a pseudoknot free structure. There are 40 cases presented in Fig. 4.1 marked with 'x whose accuracy is between 44% and 80%. Of those cases, 34 are the result of the high penalty for introducing a pseudoknot. In all of these 34 cases, the expected output structures need addition of stems of size less than or equal to 7 bases, and the pseudoknot initiation penalty is much higher than the free energy of the stem. That is why HFold does not add these stems and thus the accuracy is not perfect. In 16 of these cases HFold adds extra base pairings to the input structure but in the wrong places and in the rest of the cases it does not add anything. Figure 4.3 shows an example of the before and after structures for one of these cases. There are 3 cases among the above mentioned 40 cases for which we removed non-canonical base pairs and loops with less than 3 unpaired bases inside; in 2 of those cases, HFold achieves higher accuracy (in one case it achieves perfect accuracy) when presented with the original G/„,, (that is, the Gi, structure with non-canonical base pairings and small loops in place), while in the 3rd case the accuracy of the HFold prediction does not change when given the original G^ n;l H s : i;l g 26 Chapter 4. Results 0.4 0.5 0.6 0.7 0.8 Accuracy after HFold 0.9 1 Figure 4.1: Accuracy of HFold when the input is the true pseudoknot free secondary structure (Gu ), versus the accuracy of Gu structures. The horizontal axis represents the accuracy of structures predicted by HFold while the vertical line represents the accuracy of the input structures. Pseudoknotted and pseudoknot free structures are presented by V and 'O', respectively. B lt Chapter 4. 27 Results c s. •y '.G-G, "C-G ..-C-C ..t- u C-G-U--C G'-C'- A" -G 'uU' ' ."f A G S-i-ti-t-"' !-J5 i .u e-u-G 6 60 'G.c A.u •~G.U Figure 4.2: A pseudoknot free structure given as input to HFold (top) and the corresponding HFold pseudoknot free result (bottom). The overall structure remains pseudoknot free alter HFold, but more base pairs are added to the loops. For example, the large internal loop starting at 1 and ending at 164 on the top figure is broken into two smaller loops in the bottom figure. Similarly the loops of the branch starting at 51 and ending at index 90 are broken into smaller loops. 28 Chapter 4. Results 20 V- -u A cH 1 0 . / N .u- A 40 42 10 G"' c-c- ?42 ,20 A S-G u' 9 U-G-a 40 30 -C • A- 10, U •g \ '~"A"A. r; G ^ c G ' J G' ,G'" \ 40 42 Figure 4.3: G'/,,;, structure given as input to HFold (top), the corresponding HFold result (center) and the true pseudoknotted structure (bottom). HFold adds 3 pseudoknot free stacked base pairs at 2.40, 3.39, and 4.38, while the true pseudoknotted structure has pseudoknotted base pairs at 18.36, 19.35, and 20.34. y Chapter 4. Results 29 as input. In the remaining 3 cases of the above mentioned 40 cases, HFold's prediction is pseudoknotted but the pseudoknotted structure either has more crossing base pairs than the original structure or it has pseudoknots in different places. High pseudoknot initiation penalty may be the cause behind this problem except when HFold is adding pseudoknot free base pairs inside or outside the loops (similar to the structure of Fig. 4.2). For example if the true structure includes several H-type pseudoknots, HFold predicts interleaved bands instead. This is because for every separate pseudoknot, the energy value is penalized by one pseudoknot initiation penalty, while for a big interleaved pseudoloop it is only penalized once. In addition, extra bands stabilize the energy value. Figure 4.4 shows an example of before and after structures for one of these cases. In all the cases where HFold's prediction is different from the true structure, the minimum free energy of HFold's result is lower than the energy of the true structure. We also ran HFold, when presented with a sequence S and the corresponding true pseudoknot free secondary structure C7.,„,„H as its input. Figure 4.5 plots the accuracy of HFold(S,c7.,„,„«) versus the accuracy of the G.,„,.„/j structures. In this case, the average accuracy of the G ,u structures not including the pseudoknot free structures is 64%. The average accuracy of the HFold output, HFold(5, G „„.(/), not including the pseudoknot free structures is 80%, with 19 cases achieving perfect accuracy. Interestingly, when the better of the two accuracies obtained by using HFold on Giyiij and G „u is taken for each sequence, the average accuracy, over all sequences is 88% including the pseudoknot free structures and 87% not including them. mtu SI sln 4.4 Accuracy when input is the M F E pseudoknot free structure More realistically, we need to be able to predict the secondary structure of a sequence without knowing the true pseudoknot free secondary structure for the sequence. A natural approach is to run HFold on the predicted MFE secondary structure for the sequence. We use SimFold [3] to produce the pseudoknot free secondary structure G, which becomes the input to HFold. Figure 4.6 plots the accuracy for each data sequence versus the accuracy of the input structures. The average accuracy is 59%, which is exactly the same as the average accuracy of the MFE pseudoknot free structures. Figure 4.6 shows 14 points with accuracy between 18% to 31%. In all of these cases, the MFE pseudoknot free structures bear little resemblance to the true structures, and HFold does not add any pseudoknotted base pairs to them. In 4 out of these 14 cases, the low accuracy of the MFE pseudoknot free structures is explained by stems that are "shifts" of stems in the true structure. In all those cases HFold does not add any base pairs to the input structure. Figure 30 Chapter 4. Results -•--A U' C A G^.UCG ,.' A ' 13: u. f; . A " QS- ' v u G G" TC C. t' . ..5 IJ A u " 0 ) 7. r r v >' u' •no C "G i ~~\u r.< / r'' A >-<- r- 'n ' ' u r 'C u u "A" ' u. A' A U" U 6-*' "•A u. s-c -G... C V, •' 130 •'A.u 214 C -G Figure 4.4: Predicted pseudoknotted structure (top) and the true pseudoknotted structure (bottom). HFold predicts interleaved bands, while the true pseudoknotted structure has 5 separate H-type pseudoknots. Chapter 0.2 0.3 0.4 4. 31 Results 0.5 0.6 0.7 Accuracy of HFold on G 0.8 0.9 1 Figure 4.5: Accuracy of HFold when the input is the true pseudoknot free secondary structure (G ,,.u) (horizontal axis), versus the accuracy of G ,M structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. ml sm Chapter 4. Results 32 Accuracy after HFold Figure 4.6: Accuracy of HFold when the input is the MFE structure (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. 33 Chapter 4. Results 20 .A-C-A*. 10 'C-G-C-C C '.6 'c C G c: 26 ~"26 Figure 4.7: MFE structure (top), true pseudoknotted structure (bottom). Tire true pseudoknotted structure has 4 stacking base pairs at 1.14, 2.13, 3.12, 4.11 and another stem of size 4 at 6.26, 7.25, 8.24, and 9.23, while in the MFE structure the first stem is a shift of the true structure (3.15, 4.14, 5.13, and. 6.12). In this cases HFold does not add any base pairs to the input structure. 4.7 shows an example of one of those cases. There are"only two data points presented in Fig. 4.6 that show significant change of accuracy before and after running HFold. In both cases, HFold predicts pseudoknots. In one case the predicted pseudoknotted structure is very similar to the true structure since the MFE structure for the corresponding sequence is very similar to the G\, structure of the sequence. In the other case, the true pseudoknotted structure includes one H-type pseudoknot with only 3 pseudoknotted base pairs, but because of the high pseudoknot initiation penalty, HFold predicts one pseudoloop of 3 interleaved bands, thus lowering the accuracy. ig 4.5 Best accuracy, taken over suboptimal folds Since HFold does not improve accuracy, on average, when using the MFE pseudoknot free predictions as the input, we also measured the best accuracy obtainable using HFold with suboptimal structures as input. For- each sequence in our dataset, we used SimFold to calculate the 25 lowestenergy structures for that sequence. (We chose the 25 lowest-energy structures because, when we compared the 1000 lowest-energy structures from SimFold with the true pseudoknot free structures for each sequence in our dataset, in c Chapter 4. Results 34 Best accuracy of HFold over the first 25 suboptimals Figure 4.8: Accuracy of HFold on the best-of-25 (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 0', respectively. : only 39 cases was the true structure found among these 1000 structures and in 29 of the 39 cases the true structure was among the 25 lowest-energy structures.) We'then ran HFold on all 25 structures, and calculated the accuracy of the HFold output on each run. We did the same with the 10 lowest energy structures. Then we took the best accuracy among the 10 lowest-energy structures, and the best accuracy among the 25 lowest-energy structures. When averaged over all sequences, the best-of-10 accuracy is 70% with 4 cases with perfect accuracy and the best-of-25 is 74% with 8 cases with perfect accuracy. These accuracies are a significant improvement over the 59% average accuracy obtained using the MFE structures. Figure 4.8 shows the accuracy results of the best-of-25 versus the accuracy of the MFE structures. As shown in Fig. 4.8, 48 data points are in the lower triangle, showing improvement over the MFE structures. In many cases the improvement is simply due to the fact that the suboptimal structure is close to G (,;.,,. In 20 of the 48 cases, HFold adds nothing; in 15 out of 48 cases, it adds 1 or 2 pseudoknot Chapter 4. Results 35 Best accuracy of HFold over'the first 25 suboptimals Figure 4.9: Accuracy of HFold on the best-of-25, versus the accuracy of the best-of-10. Pseudoknotted and pseudoknot free structures are presented by 'x' and O , respectively. l ; free base pairs; but in the remaining 13 cases it perfectly identifies G„„ u, from which in 5 cases HFold adds extra base pairings, thus, not achieving 100% accuracy. Figure 4.9 shows the improvements achieved using best-of-25 versus best-of10. As shown in Fig. 4.9, there are only 10 data points in the lower triangle that show significant improvement on best-of-25 versus best-of-10, and the rest of the cases are largely similar. Since the difference between the average best accuracies of best-of-25 and best-of-10 is only 4%, but the number of structures is considerably lower and the improvement over accuracy of the MFE structures is significant even for best-of-10, biologists may prefer to obtain 10 lowest energy structures instead of just the MFE structure to get better prediction. la Chapter 0.1 4. R.esults 36 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 • Accuracy of HFold on min energy over the first 25 suboptimals Figure 4.10: Accuracy of HFold on the minimum energy structure over the first 25 suboptimals;. versus the accuracy of HFold on the MFE structure. Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. 4.6 Accuracy of the lowest energy structure, taken over suboptimal folds In this section, similar to Section 4.5, for each sequence we get the first 25 suboptimal structures, and then we run HFold on those structures. However, in this case we choose the final structure with the lowest energy and compare its accuracy with the accuracy of HFold on the MFE structure for the same sequence. Figure 4.10 shows the results for this case. There are 9 data points in Fig. 4.10 not on the diagonal line. One of these data points is in the lower triangle, showing significant improvement over the accuracy of the MFE structure. In this case, the input structure is very similar to the corresponding Gn structure but has one extra loop, thus HFold correctly identifies the corresponding G „,„u structure and improves the accuracy of the overall structure. For 5 of the 8 data points in the upper triangle, HFold correctly predicts the u s Chapter 4. Results 37 corresponding G u structure but adds more pseudoknotted or pseudoknot free base pairs to lower the energy. An example of one of those cases is presented in Fig. 4.11. There are some stems in some of the input structures that are "shifts" of stems in the true structure; these shifts are also causes of low accuracies. In those cases HFold does not add any base pairs to the structure. There is not much difference in the "energy" values of the predicted structures with lowest energy over the first 25 suboptimals versus the energy of the predicted structure when input structure is MFE, but some of the structures in the latter case are pseudoknotted, and thus have more similar structure to the true structure than MFE. sina 4.7 Peeling Techniques One of the main problems with the MFE structures is that they typically contain long stems and short loops, making it difficult for HFold to add enough pseudoknotted base pairs into the small regions. One possible solution to obtain better input structures is to peel away the steins. This approach is also motivated by the process of hierarchical folding, where it has been shown that, upon formation of a pseudoknot, some base pairs at the ends of stems in the initially formed pseudoknot free structures may break [25]. For our testing purposes, we tried the following cases: 1. removing 1-3 base pairs from inside the stems that close hairpin loops (3 cases); 2. removing 1-3 base pairs from outside the stems that close hairpin loops (3 cases); 3. removing 1-3 base pairs from both inside and outside the stems that close hairpin loops (3 cases); 4. removing stems of size m when 1 < m < 3 (3 cases), and 5. peeling enough.base pairs from inside hairpin loops so that there are at least 6 unpaired bases inside each loop (1 case); 4.7.1 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from inside or outside the stems that close hairpin loops For our first peeling technique, we removed 1-3 base pairs from inside each stem that closes a hairpin loop of the MFE structures and ran HFold on the resulting structures. Figures 4.12 shows the accuracy of the MFE structures when 2 base pairs are removed from inside each stem that closes a hairpin loop versus the accuracy of the MFE structures themselves, while Fig. 4.13 shows the accuracy of the predicted structure when the input structure is peeled versus the accuracy 38 Chapter 4. Results y ft -„ ^ • ,.<- y K £,-V - ,C - c ^ . t A A l j 'A, t i A ..U ,. A-110 u 6 A L b b-c-c-e-c-A-" ' ' i j - U - UA 4- A ii • G-A'U.IJ.l u• G U A A-1 C-C-C-y-l-Ht.y-G-c-u' fl.-G •i;-A-U~^--A-^-A-U-^-G.C-A-.. •U-A-G-J.C-C'A-i-G-C-i-A 'A *• H - H - M - U .C • U - C - C - G - C - A U -t5 C G - G " ~ U i-A -A-<3 -fi-C-d • ' , '-. no' J .0 U.A il.A Li . A Figure 4.11: The pseudoknotted structure predicted by HFold (top), the corresponding MFE structure (center) and the true structure (bottom). Although HFold correctly predicts parts of the pseudoknot in the structure on the top, the accuracy of the structure predicted by HFold is lower than the accuracy of the corresponding MFE structure, because there are some extra pseudoknotted base pairs predicted by HFold. Note that the MFE structure has all the correct pseudoknot free sub-structures (except the stem closing at 1.114). High pseudoknot initiation penalty, and low band penalty are the main causes for such predictions by HFold. Chapter 4. 39 Results of the MFE structures. Since the rest of the figures are similar we only present a detailed discussion on this case. Generally speaking, as shown in Fig. 4.12 when the accuracy of the MFE structures is lower than 55%, peeling increases the accuracy of the structure. This is because the MFE structure in these cases has little resemblance to the true structure. However, as shown in Fig. 4.13, in all but one case, HFold adds the peeled base pairs back and creates the. original MFE structure. There is only one data, point on the lower triangle showing improvement over the MFE structure. In this case the peeled structure is similar to Gu , thus HFold identifies the corresponding G„ ii structure correctly, but since it adds more base pairs to lower the energy of the whole structure the accuracy of the predicted structure is not perfect. There are some data points shown in Fig. 4.13 whose accuracies do not change after running HFold. This is because the stems in their corresponding MFE structure are shifts of the stems of their Gu structure, thus removing a base pair from inside each stem that closes a hairpin loop does not change the accuracy of the whole structure. By removing more base pairs from inside the structures, we obtain more cases where the accuracies before and after running HFold are different. For example, when we remove only 1 base pair from inside each stem that closes a hairpin loop we find 10 data points with the same accuracies before and after HFold, while this number decreases to 6 when we remove 2 base pairs from inside each stem that closes a loop (presented in Fig. 4.13). Similar results are achieved when 1, 2, or 3 base pairs are removed from outside each stem that closes a hairpin loop of the MFE structures and HFold is run'given these structures as input. . • g )m a 4.7.2 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from both inside and outside each stem that closes a hairpin loop We peeled away parts of stems from both ends of each hairpin loop. When we peeled only 1 base pair from each end of the stems, we ended up with 5 cases whose accuracies did not change before and after peeling (see Fig. 4.14). In those cases the MFE structure is very different from the true structure so that peeling one base from both ends of each loop does not change the accuracy. In all •those cases HFold adds the peeled base pairs back and thus, the final predicted structure is the MFE structure. As shown in Fig. 4.14, generally speaking when the accuracy of the MFE structures is lower than 55%, peeling increases the accuracy of the structure. This is because the MFE structure in these cases has little resemblance to the true structure. Because of the same reason, HFold adds the peeled base pairs back to the structure to lower the energy of the whole structure (see Fig. 4.15). There is only one case shown in Fig. 4.15 for which running HFold results Chapter 4. Results 40 Accuracy of the peeled structures Figure 4.12: Accuracy of the MFE structures with 2 base pairs removed from inside each stem that closes a.hairpin loop (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 41 Accuracy of HFold on in 2 Figure 4.13: Accuracy of HFold when input is the MFE structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of HFold when input structure is MFE (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 42 A c c u r a c y of the peeled structures Figure 4.14: Accuracy of the MFE structures with f base pair removed from both ends of each stem (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and '0', respectively. Chapter 4. Results 43 Accuracy of HFold on both 1 Figure 4.15: Accuracy of HFold when input is the MFE structures with 1 base paii' removed from both ends of each stem (horizontal axis), versus the accuracy of HFold when input structure is MFE (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 44 in improvement in accuracy; in this case the peeled structure is similar to Gin,,, thus, HFold identifies the corresponding G. ,ii structure correctly but adds more base pairs to lower the energy of the structure, resulting in a not perfect accuracy. , Removing more than 1 base pair from both ends of the stems results in fewer cases with the same accuracy before and after peeling; that is because the peeled structures have fewer base pairs. But since the MFE structures in general have little resemblance to the true structures, HFold adds the peeled base pairs back and thus, the average accuracy of the predicted structures is the same as the average accuracy of the MFE structures. snu 4.7.3 A c c u r a c y w h e n i n p u t is the M F E pseudoknot free s t r u c t u r e w i t h stems of size 1-3 r e m o v e d Since the MFE structures typically have long stems and small loops, we decided to remove the stems of size 1-3 and then give those structures as input to HFold. Since none of the MFE structures have any stems of size 1, peeling away stems of size 1 does not change any of the structures and thus, the final predicted structures are the MFE results. When removing stems of size 2 or 3, there is little change in the structure of some of our data points. However since HFold cannot fix the errors of the input structure, it does not find any structure with lower energy than the MFE structure and thus puts the removed base pairs back. In only 1 case, the input structure is similar to G/„ and thus HFold finds a. pseudoknotted structure which is very similar to the true structure but with some extra base pairs. 9 4.7.4 A c c u r a c y w h e n i n p u t is the M F E pseudoknot free s t r u c t u r e w i t h loosened loops Finally, we decided to loosen up the loops of the MFE structures so that HFold could possibly add extra base pairings to the structure and create pseudoknotted structures. Since we noticed from the HFold results in Section 4.4 that if the true pseudoknotted structure needs less than a stack of size 6 (that, is 6 consecutive base pairs) the structure typically will not form, due to the large pseudoknot initiation penalty, we decided to make room for at least 6 bases inside the loops. After removing enough base pairs from inside the loops of the MFE structures, we ran HFold on them and obtained the accuracy of the results. It is interesting to note that the average accuracy in both cases (before and after running HFold) is almost the same (59% and 58.5% respectively) and is also the same as the average accuracy of the MFE structures. In only 1 case HFold predicts the pseudoknot as it is in the true structure but it adds some extra base pairings. In this case the MFE structure is very similar to the corresponding G/„ structure but has overcrowded loops; that is why loosening the loops helps HFold find the true pseudoknotted structure. In the rest of the cases, HFold adds the peeled base pairs back and results in the MFE structure. 9 Chapter 4. Results 4.8 45 Base Pair Confidence Structure formation can be viewed as a probabilistic process, with the likelihood of a given secondary structure (for a fixed sequence) being a function of the energy of that structure. The lower the free energy, the more likely the structure: thus, the MFE structure is the most likely to form. Energy-based methods have also been developed to predict base pairing probabilities of pseudoknot free secondary structures at thermodynamic equilibrium. Mathews [15] found that, roughly speaking, the higher the probability that a base pair is predicted to be in the equilibrium structure, the more likely it is to occur in the true structure. One possible way to improve the accuracy may be to use base pair probability as a confidence measure - that is, base pairs are kept only if their probability is higher than a threshold value. Based on the results presented in Mathews [15], we chose different thresholds starting from base pair probabilities of 75%, and increased the threshold by steps of 5% to the maximum of 95% threshold. In each case, we chose only the base pairs whose probabilities are higher than the given threshold, and gave these structures as input to HFold. Figure 4.16 shows the accuracy comparison plot for predicted structures when base pair confidence is at least 75% and the MFE structures. As shown in Fig. 4.16 there are only two points that are not on the diagonal line. Both of these two cases show improvement over accuracy of the MFE structures. It is worth mentioning that in both of these cases the structure predicted by HFold when the input structure is base pairs with confidence above 75% achieves higher accuracy than the structure predicted by HFold when the input structure is the MFE structure. In one of these cases the improvement is 11% and in the other case it is 8%. The results are similar to Fig. 4.16 when the input structure has base pair confidence of at least 80% and 85%. When the confidence level reaches 90% all the data points except one are on the diagonal line. This one case is one of the two cases we discussed when the threshold was 75%. It is clear that the higher the threshold, the fewer base pairs are chosen for the input structure. For example, when the threshold is 90%, there are no base pairs chosen in 23 cases out of 70 cases, thus, the input structure to HFold is empty and HFold gives the MFE structure for those cases. For all the different threshold values the average accuracy is 59%, .which is exactly the same as the HFold accuracy when the input structures are the MFE structures. 4.9 Energy Comparison Using a separate method, with the same energy model, we calculated the energy of the "true" pseudoknotted structures, and compared those values with the energies of the structures predicted by HFold when the input was Gu . Figure 4.17 shows the result of our comparison. All of the data points, except * one case, are either on the diagonal line, or in the upper triangle. As expected, g Chapter 4. Results 46 Accuracy of HFold on base pairs with confidence >= 0.75 Figure 4.16: Accuracy of HFold when the input is the pseudoknot free structures with base pair confidence above 75% (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by ' x ' and 0', respectively. : Chapter 4. Results 47 HFold's results have the same energy as the "true" structure or the "true" structures have higher energy than the HFold's results using our energy model. The only case that HFold results in a structure with higher energy value than the true structure is when the true structure is a density-3 structure which cannot he handled by HFold. Figure 4.18 shows the comparison of the energy of G\,i structure for a sequence versus the energy of the MFE structure for the same sequence. In many cases, the energy differences are large, indicating the energy model is poor. Ideally, when the energy model is good, all the data points should be on the' diagonal line or very close to it, such that running HFold on the MFE prediction would result in the true structures. It is interesting to note that for two of the pseudoknot free structures (presented by circles in thefigure)the MFE energy value is much lower than the energy of the corresponding Gu structures; this can be an indication of the fact that even the pseudoknot free energy model is far from being correct. This is because large loops are broken into small loops to lower the energy in the MFE structures (see Fig. 4.2). This is the main reason for not getting the perfect accuracy for all cases when the true pseudoknot free structures are given to HFold as input structures. tj u Chapter 4. Results 48 Figure 4.17: Comparison of HFold energy and true structure energy. The horizontal axis represents the energy of structures predicted by HFold when the input is Gin,) while the vertical axis represents the energy of the "true" structures. Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 49 Figure 4.18: Comparison of energy of G\,\. (horizontal axis) and the energy of the MFE structure (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. a Chapter 4. 50 Results Sequence IDs and lengths Sequence type Str. Ref. BLV(27), BYDV-NY-RPV(27), CABYV(27), FIV(35), PLRV-S(26), PRRSV-LV(59), SRVlgag/pro(37), BChV(26), BWYV(26), EIAV(35), PEMV(28), PLRV-W(26), MMTVgag/pro(34), PRRSV-16244B(58), SARS-CoV(69) viral ribosomal frame shifting H H H H H H H H H* [27] AKV-MuLV(50), GaLV(49), Cas-Br-E-MuLv(50), SNV(50) BaEV(50), Mo-MuLV(50), FeLV(50) viral ribosomal readthrough H H H [27] 1.1(37), 2.9(34), 2.7a(36), 1.7(37), 2.1b(39), 2.10(37), •1.6(37), 2.12(37), 2.6b(42), 1.3a(37), 2.4a(37), 2.11(37), 1.17(37), 1.8(39), 1.9b(37), 2.5a(41), 2.2b(42), 2.3a(45) with high affinity to HIV-l-RT H H H H H H [5] Ec-S15(67), Hs-PrP(45), Bt-PrP(45), T4-gene32(28) mRNA H H [27] Ec-PK1(30), Bp-PK2(80) LP-PK1(30), Ec-PK4(52) tmRNA H H [27] Tt-LSU-P3/P7(65), HDV-It_ag(89) Ribozymes H* H* [27] BVDVJRES(73), HCVJres(56) TMV-L(38), CSFV.IRES(76), turnip yellow mosaic virus(86), tobacco mosaic virus(214) viral RNA H* H H H* [27] Table 4.2: Sequences with H-type pseudoknots used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) structure of sequence (H-type pseudoknot = H, H-type pseudoknots with nested structures = H*); and (4) the reference from which the sequence was obtained. 51 Chapter 4. Results Sequence IDs and lengths Sequence type Str. Ref. I-ICV.229E(61) viral ribosomal K [27] CoxB3(68) Viral 3' UTR K [27] satRPV(72), NLVS(45), HDV-It_g(88) Ribozymes K D3 [27] Coxsackie(114) Viral and Phage RNA K [6] Hs-SRP-pkn(47) SRP RNA K [27] Table 4.3: Sequences with kissing interactions and the density-3 sequence used lor secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) Structure of the sequence (kissing interactions = K, density-3 structure = D3); and (4) the reference from which the sequence was obtained. 52 Chapter 5 Conclusion and Future Work In this work, we presented HFold, a fast, new dynamic programming algorithm that efficiently predicts RNA secondary structure including pseudoknots, based on the hierarchical folding hypothesis. HFold can predict kissing hairpins and pseudoloops with arbitrary number of bands. The algorithm had been proposed by Zhao [30]; this thesis focuses on an empirical analysis of its performance. Our analysis shows that, when presented with the true pseudoknot free structure, Gin,) no significant improvement is obtained over the accuracy of the pseudoknot free structure alone. However when HFold is given G „u as input structure, 16% accuracy improvement is obtained on average over the accuracy of the pseudoknot free structure alone. This study also shows that using G „u as the input to HFold results in 80% accuracy with 19 cases of perfect accuracy. High pseudoknot initiation penalty and stem shifts are main causes of not achieving perfect accuracy for the rest of the cases. Based on the analyses presented in this work, using only the MFE structures as input to HFold does not result in high accuracies as the MFE structures are usually overcrowded with long stems and small loops and thus, running HFold does not result in the addition of any extra base pairings. We showed here that since the MFE structures usually bear little resemblance to the "true" structures, peeling the stems away from inside and/or outside the loops does not improve the average accuracy. We took the peeling one step further and showed that neither removing stems of size m when 1 < m < 3, nor loosening the loops to have at least 6 unpaired bases inside, changes the average accuracy of the results. Since the energy model using which the MFE structures are being formed does not consider pseudoknots, even using base pair probabilities is of no help in improving the accuracy of the results. We also showed that using the best of 10 suboptimal structures improves the accuracy of prediction by 11%. The accuracy increases by another 4% if the first 25 suboptimals are used. In many cases the accuracy of the MFE structure is greater than or equal to the accuracy of the .lowest energy structure predicted by HFold on the first 25 suboptimal structures. However, in numerous cases of the latter structures, HFold identifies the correct pseudoknotted base pairs. Note that adding more base, pairs to lower the energy causes the accuracy of HFold result to be lower than the corresponding MFE structure. This study showed that since pseudoknot free energy model predicts too sm sln Chapter 5. Conclusion and Future Work 53 many base pairs in the structure, peeling some base pairs away from the ends of stems that close hairpin loops in many cases improved the accuracy of the structure. However high pseudoknot initiation penalty in the pseudoknotted energy model made HFold avoid predicting pseudoknots by adding pseudoknot free base pairs when possible. We also found that because of the high pseudoknot initiation penalty and low band penalty, more bands are added to pseudoknotted structures to compensate for the energy. The results of this study provide more insight to the hierarchical folding hypothesis. Our results suggest that with respect to the current energy model minor rearrangements of the pseudoknot free MFE structures do not often lead to the true pseudoknotted structures. We also showed that in some cases the pseudoknot free MFE structure contains shifted stems of the true structure; therefore without shifting the stems it is not possible for the MFE structure to fold into the true pseudoknotted structure. Throughout this work, we used an index by index comparison of the base pairs to find the accuracy of prediction. However, in many examples the overall structure of the predicted structure was similar to the true structure, but more base pairs or shifts lowered the accuracy. We believe that using other notions of accuracy that consider structural similarities are more accurate that using an index by index comparison. Thorough analyses of the data based on different commonly used accuracy measures such as sensitivity, specificity, or positive predictive is underway. Our findings raise three main questions: 1. whether the structures identified by the experimental methods are reliable, 2. whether the current energy model is precise enough, and 3. whether the hierarchical folding hypothesis is precise enough. One of our future goals is to tune the parameters of the current energy model to improve the accuracy of the prediction for a wider sets of structures. One possible approach is using Andronescu's tuning method [4], Another future work is to use a better energy model for pseudoknotted structures, such as that of Cao and Chen [5], and obtain better energy parameters. This is also of great interest to us to evaluate the hierarchical folding hypothesis using computational methods in future. We are not yet able to do a sound comparison of the prediction accuracy of HFold with MFE-based methods, since it would be important to ensure that the same energy model is used by both methods. Therefore one of the main goals for our future work is to compare hierarchical and MFE algorithms implemented using the same energy model, at least for H-type pseudoknots. Finally, we plan to incorporate other techniques to produce better input structures to HFold, such as information obtained from chemical modification data [17]. 54 Bibliography [1] T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Disc. App. Math., 104(l-3):45-62, Aug 2000. [2] S. L. Alain, J. F. Atkins, and R. F. Gesteland. Programmed ribosomal frameshifting: Much ado about knotting! PNAS, 96(25):14177-14179, Dec 1999. [3] M. Andronescu. R,. Aguirre-Hernandez, A. Condon, and H. H. Hoos. RNAsoft: A suite of RNA secondary structure prediction and design software tools. Nucleic Acids Res, 31(13):3416-3422, Jul 2003. [4] M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews, and K.P. Murphy. Efficient parameter estimation for RNA secondary structure prediction. BMC Bioinformatics, page to appear, July 2007. [5] S. Cao and S. Chen. Predicting RNA pseudoknot folding thermodynamics. Nucleic Acids Res, 34(9):2634-2652, 2006. [6] B. A. L. M. Deiman and C. W. A. Pleij. Pseudoknots: A vital feature in viral RNA. Seminars in Virol, 8(3):166-175, 1997. [7] C. Dennis. The brave new world of RNA. Nature, 418(6894):122-124, Jul 2002. [8] R. M. Dirks and N. A. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J. Cornput. Chem., 24(13):1664-1677, Oct 2003. [9] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfain: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33(Database issue), Jan 2005. [10] K. Han, Y. Lee, and W. Kim. Pseudoviewer: automatic visualization of RNA pseudoknots. Bioinformatics, 18 Suppl 1, 2002. [11] I. L. Hofacker, W. Fontana, P. F. Stadler, L'. S. Bonhoeffer, M. Tacker, and P. Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte filr Chenue / Chemical Monthly; 125(2):167-188, Feb 1994. Bibliography 55 [12] H. Jabbari, A. Condon, A. Pop, C. Pop, and Y. Zhao. HFold: RNA pseudoknotted secondary structure prediction using hierarchical folding. In Raffaele Giancarlo and JSridhar Hannenhalli, editors, LNBI, Lecture Notes in Computer Science, page to appear. Springer, 2007. [13] R. B. Lyngs0. Complexity df pseudoknot prediction in simple models. In Josep Diaz, Juhani Karhumaki, Arto Lepisto, and Donald Sannella, editors, ICA LP, volume 3142 of Lecture Notes in Computer Science, pages 919-931. Springer, 2004. [14] R. B. Lyngso and C. N. Pedersen. RNA pseudoknot prediction in energybased models. ./. Cornput. Biol, 7(3-4):409-427. 2000. [15] D. H. Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10(8):1178-1190, Aug 2004. [16] D. H. Mathews. Predicting RNA secondary structure by free energy minimization. Th.eor. Chem. Ace: Theory, Computation, and Modeling (Theoretica Clnmica Acta), pages 1-9, May 2006. [17] D. H. Mathews, M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker, and D. H. Turner. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Nail Acad Set U S A, 101(19):7287-7292, May 2004. [18] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure,. .7. of Mol. Biol, 288(5):911-940, May 1999. [19] W.J. Melchers, J.G. Hoenderop, H.J. Bruins Slot, C.W. Pleij, E.V. Pilipenko, V.l. Agol, and J.M. Galama. Kissing of the two predominant hairpin loops in the coxsackie B virus 3' untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis. .7. Virol, 71(l):686-696, Jan 1997. [20] B. Rastegari and A. Condon. Parsing nucleic acid pseudoknotted secondary structure: algorithm and applications. J. Cornput. Biol, 14(l):16-32, 2007. [21] J. Reeder and R. Giegerich. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, Aug 2004. [22] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots, Jul 1998. [23] M. Sprinzl, C. Horn, M. Brown, A. Ioudovitch, and S. Steinberg. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res, 26(1):148-153, Jan 1998. Bibliography 56 [24] D. W. Staple and S. E. Butcher. Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3(6), Jun 2005. [25] I. Tinoco and C. Bustamante. How RNA folds. J. Mol. Biol, 293(2):271281, Oct 1999. ' [26] Y. Ueniura, A. Hasegawa, S. Kbbayashi, and T. Yokomori. Tree adjoining grammars for RNA structure prediction. Theor. Cornput. Sci., 210(2):277303, Jan 1999. [27] F. H. van Batenburg, A. P. Gultyaev, and C. W. Pleij. Pseudoba.se: structural information on RNA pseudoknots. Nucleic Acids Res, 29(1):194-195, Jan 2001. [28] C. W'itwer, Ivo L. Hofacker, and P.F. Stadler. Prediction of consensus RNA secondary structures including pseudoknots. IEEE/ACM Trans. Cornput. Biol. Bioin.forina.tics, l(2):66-77, Apr 2004. [29] M. Wu and I. Tinoco. RNA folding causes secondary structure rearrangement. Proc. Natl. Acad. Sci. USA, 95(20):11555-11560, Sep 1998. [30) Y. S. Zhao. Efficient algorithm for RNA pseudoknotted secondary structure prediction given a pseudoknot free structure. Master's thesis, University of British Columbia, Department of Computer Science, September 2005. [31] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res, 9(1):133-148, Jan 1981. 57 Appendix A Recurrences In this chapter, we present the details of HFold recurrences. Throughout this work, we will use the following notation: • G: a pseudoknot free structure. • G': a pseudoknot free structure that we add to G. • R. the complete pseudoknotted structure: R=GL)G'. R j refers to the subset of R whose bases are in the region [i,j]. Let Gij = G l~l {i.,j} x i.e. the set of base pairs of G contained in region Let /?.,., be a minimum free energy (MFE) secondary structure for [i,j], given a pseudoknot free secondary structure G;,j. The energy value of each substructure type, for a given input sequence S = $i$o...s. and the given pseudoknot free secondary structure G, is stored in an array. In the next subsections, we describe how each is calculated. We illustrate each case with a figure, where we use the following notations in our figures: h n • G is presented by the portion of the figure above the base line. • G' is presented by the portion of the figure under the base line. • The normal black lines can be any arcs in Rij. • The solid lines are for base pairs. • The dotted lines connect bases that don't have to be paired. • The clear shade within the arcs indicate that there are no additional base pairs within the arc. • The.gray shade within the arcs are unknown structures. A.l Wij W^.j is the MFE of all valid structures Rij over region if i and j are not covered in G, i.e. i.sCovered(G,i) &nd isCovered(G, j). Otherwise, W^-is+00. The base cases are as follows: W.,j = 0, if i > j, since then Rij is empty; and Wij = +00, if isCovered(G.i), or isCovered(G,j). Appendix A. (1) Recurrences 58 ( ) 2 r r+1 (3) 7, I\ Figure A. 1: Illustration of Cases for Wij. Otherwise, Wi,j is given by the following recurrence: W hJ = m i n ^ ^ ^ ( + ^ W (A.l) . (3) WMBi,j + P s The base cases indicate that W is being used only when the structure is an exterior structure and that there is no penalty for having unpaired bases at either end of the structure. Case (1) handles the case that i pairs with j, i.e. bpR (i) = j. Case (2) handles the cases that 3r, * < r < j, 6p/?.;,,• (?-) < r (i.e. t is either unpaired or paired with another base inside region [i,r]), and bpp.. .(j) > r or bp a 0) = 0 (i.e. 7 is either unpaired or paired with a base inside region i%1 [r + h:i\). If R-i.j does not fall into case (1) or (2), it must be that is a pseudoknotted closed region. This is an exterior pseudoknot because of the premise that i and j are not covered in G. In this case, we add a P penalty for introducing an exterior pseudoknot. s A.2 Wlj W] .j is the minimum free energy of all valid structures Rij, given that.[i, j] is weakly closed, and /?.,;.,- is inside a pseudoknot. Otherwise, Wlij is +00. i: 59 Appendix A. Recurrences The base cases are as follows: Wli.-j = +00, if cover(i) / cover(j), since is not weakly closed. Wh,j = P p, if * = j and bp (i) = 0, since [i,j] is an empty region, thus we give it the value for an unpaired base in a pseudoloop. U G 117,., = 0, if?; > j. Otherwise, WI^j is given by the following recurrence: (1) if i.j e G, or Vij+P , P (bp (i) and bpc(j) = 0) mm < (2) min ( WI, , II'/.., i<t<3 {(3) WMBi /'.,, WI,(t+DJ) c (A.2) + P„, Similar to W- .j, case (1) handles the case that i pairs with j, i.e. 6po (t) = j or the case that neither is paired in G but are paired in G'. Case (2) handles the cases that 31, i <t < j, bpa, , (i) < t, and bp a, ,0) > t OYbp(Rij,j) = 0. If R;.j does not fall into case (1) or (2), it must be that paired(Rij,i) and paired(Ri,j,j), and bpa,,.,•(*) > frpy?.; where [i, j] is a pseudoknotted closed region in Ri.j. In this case, R,.j will be covered by case (3). Since WI, j is the structure inside a pseudoknot, but not inside a band, we add a P , penalty to case (1) and (3), and a P , penalty to case (3) for introducing a new pseudoknot inside a pseudoloop. U v s s n Appendix A. Recurrences 60 Figure A.3: Illustration of cases for Wt ... In case (5), the plotted structure from i to j could contain more than 2 bands (not illustrated) t Wl' A.3 :/ Wf- • is the minimum free energy of all valid nonempty structures RIJ, if [i-j\ is weakly closed, given that R is inside a band. Otherwise, Wf j is +oo. The base cases are as follows: Wl - , = +co, if [i.j] is not weakly closed. Wf. .- = +oo, if i > j, since empty(iJjj, Li t 1 L Otherwise, Wf • is given by the following recurrence: i if i.j € G, or (bp (i) = 0 and bp (j) = 0) if frp (i) = 0 (A.3) iffrpc(j)= 0 f(l)^-+&' G G Wl , = min < + c' (2) WL (3) +c' (4) mm.(IV/ + G iit [(5)H/iV/B, ,,+P : 5m +6' Cases (2) and (3) handle free bases on each side of the sequence. The rest of Appendix A. Recurrences 61 the cases are similar to W/with the only difference being that here in cases (1) and (5) we use b' as the penalty for introducing a new base pair in the structure instead of P , since the base pair is not inside a pseudoloop, but rather in a band. pH A.4 VP,,, VPj,j is the minimum free energy of all valid structures Rij, in which bpc(i) = bpa(j) = 0, bases i and j are paired in G', i.e. bpa>{i) — j, and i.j crosses a base pair of G. Here the energy of Rij is the energy of all loops within Rij that are not inside a. band. Otherwise, VPij is +oo. The base case is as follows: (i>j, VPij = +oo , if < i.j does not cross any base pair of G, [bp {i) G > 0, or bpcij) > 0 Otherwise, VP,j is given by the following recurrences: (A.4) 62 Appendix A. Recurrences 17'.., = min '(i)w/ ( i + 1 ) i ( B ( i j ) _ WI, 0 if isCovered(G. i), and isCovered(G, if j) isCovered(G,i), and isCovered(G,j) (3) W^i+i)^^^.,-!) + + ^ V ( V (''<••,/)-') if isCovered(G, , + l),(.7-l) i), ( ) and i,sCovered(G,j) (4) e. p(i,i 1 ,i) + KPji+D^-D st if (6p (i + l) = 0,. and 6 (j - 1) = 0) G PG (5) min •;<7-<-m•;•!,( (e,„,,p(i,r,r' ,j) (A.5) 0'..,•,,''(;..,•)) ( ; •»»^('''(.'.,-)^(.-..-;))<'''<.v + cover(G.i) VP,,,,) = cover{G ,r) and cover(G,j) = cover (G,r') and empty (G, [i + 1, r - 1]) and empty(G,[r' +1,7-1]) (6) mjn (7) min + ^, ,--,) +a' + 26') ( + H / / ' . „ , _ a ' + 26') ( +1) {y 1)+ Cases (1), (2), and (3) handle the cases that there are no other base pairs in [i,j] that cross the same band(s) that i.j crosses. In these cases we compute the energy between band borders. Case (4) handles the case that base pairs i.j and (i+ l).(j — 1) form a stacked pair in R,,;. Case (5) handles the case that i.j and r.r' close an internal loop of P,;j. Cases (6) and (7) handle the similar condition to case (5) except that case (6) allows closed regions in the gap region [i,r — 1] and case (7) allows closed regions in the gap region [»••+ In those cases i.j does not close an internal loop; but rather close a multiloop that spans a band. Appendix A. Recurrences Figure A.4: Illustration of Cases for 63 VP^j 64 Appendix A. Recurrences Cases (6) and (7) can he combined into the following case: min ( , , ( , . _ , ) + VP.,,,, + W^ ., (T +1)ib -_ + P„« + a' + 2b') 1) i<r<inm(B .,.;),'>(;..,•)') { . ,B .i))<r'<j <„ur.,:{h\. n (i (A.6) Since the minimization is done over two parameters r and r', we should limit the size of region [r,r'] to keep the complexity of our algorithm to 0(n ). If Rij does not fall into any case from (1) to (7), then there must exist r.r' in G", with i < r < r' < j, and one base r (or r') inside the band region [B ^. .y B(i,j)] (or [b(i<.j),bC -j]) and the other base r' (or r) outside the band region [B ^ .yB^j)] (or [by,.j),b^. .^]). Then G U G' must have density at least 3, which is not allowed in our algorithm. J 1 1 A.5 VP^ VP-,j is the minimum free energy of all valid structures Rij over region [i.j], such that for some r, i < r < j, either bpc(i) = r or bpc>(j) = r, and either ».r or r.j crosses a base pair of'G. Otherwise, VP • is +oo. The base case is as follows: ; VPij = +oo, if / > j (A.7) Otherwise, VP - • is given by the following recurrences: VP'ij = min 1 f(l) (2) (3) (4) min (VP.,,.+ J W min, »<-'-<"»"(B ( j i ,. ) : J ) (r+1)ii min f • (!<(,._!,+ ( min (VP^ + c'iJ-r)) „»,.».•(»:,//(...))<,•<:/ i<r<in.in(B' . ) (c'(r - i) + VP j) r A g . if empty(G, [r + 1, j]) if empty(G, [i, r - 1]) ' ,y) In both cases (1) and (2), the energy of Rij is the energy of all loops within Rij. In case (1), we have two components: the energy given by base pair i.r which is covered by VP, and the energy given by the structure from base r + 1 to base j, which is covered by Wf. Since only VPij uses VPij, the structure from r + 1 to j is within a band (see case (6) of VPij) and so is covered by WT. Case (2) can be reasoned similarly, with reference to case (7) of VPij for use of WF. Cases (3) and (4) are similar to cases (1) and (2) with the only difference that there is no base pairs in regions [r + l,j] and [i,r — 1] respectively. 65 Appendix A. Recurrences (1) (2) Figure A.5: Illustration of Cases for VR-^. V,, A.6 Vij is the minimum free energy of all valid structures R,j over region [i.j], if [i.j] is weakly closed or empty and i.j forms a base pair of Rij. Otherwise, V)j is +00. The base bases are as follows: V. j = +oo, if < [i.j\ is not weakly closed \bvii; ,•(*) > and bp . {j) > 0 but bp , .(i) ^ bp .^j) (A.9) Otherwise, V. j is given by the following recurrence: L 0 ri ; R R: L Vij = mm{e (i, ),e (iJ) H 3 s + Vi),(i-U> i.^ Mij} VBJ v (A.10) This recurrence is identical to that used in pseudoknot free algorithms, so we omit the proof here and for VBI in the next section. A.7 VBI; j VBk,j is the minimum free energy of all valid structures R j over region if [i,j] is weakly closed or empty, assuming i.j closes a bulge or internal loop of Rij. Otherwise, I />'/... is +oo. i: Appendix A. Recurrences 66 The base case is as follows: VBIij — +00, if j — i < 1 Otherwise, VBl,,= + W,-')- min (e (i,i'j',j) i<»' <,)'<:; illt (A.11) where all bases between [i,i.']-and [j'.j] are unpaired. i.nt.{i; i'j'j) gives the free energy of an internal.loop or bulge with exterior pair i.j and interior pair i.'.j'. The complexity for computing VBlj can be done in 0(n ) using [31]. e 3 VM,., A.8 VMij is the minimum free energy of all valid structures Rij over region [i.j], if [i.j] is weakly closed or empty and i.j closes a multiloop of Rij. Otherwise, VMij is +00. We can obtain a recurrence that calculates the loop cost as the sum of two subparts. VM.-,.-. = min< -+i<h<j-i [ (2) iyAf5 _i) + a + P (i+1)(i sm +b (A.12) Case (1) is similar to recurrence for Wij , and case (2) handles the case that there is one or more pseudoknotted loops in the multiloop. A.9 WM,, WMj is the minimum free energy of all valid structures Rij, if [i,j] is weakly closed, not empty, and i and j are on a multibranched loop. The base case is as follows: WMij = +00 , if i > j Otherwise, WMij is given by the following recurrences: • 67 Appendix A. Recurrences (2) Figure A.7: Illustration of Cases for WMB;.,. {2)WM j+c WMi.-j = i n { (3) W % . 1 bp (i) = 0 • li+1) m + ) c bp (j)=0 G C (4) min {WMi.t + [ (5) IfiVm,:,+ P WM (A.13) ) {t+l)/l +b sm Cases (1) to (4) are the same as in a pseudoknot free structure, and case (5) handles the case of a pseudoknotted loop in the multiloop. A.10 WMB, , WMB^.j is the minimum free energy of all valid structures Rij over region [i,j], where /?.,;., is a density-2 pseudoloop. Otherwise, WMB is +oo. The base case is as follows: II-'.',//?= +oo , if / :••./ Otherwise, WMBij = min (1) P,,+ min (BEibpcljlbpGiBlu^Bl.jyj) + WMB' + WI hl U bpcU) > 0 j), {l+1)!(B[iirl (A.14) (2) WMB' . it j In case (1), j is paired in G. Then, in the MFE structure, some base I with bpc(j) < I < j must be paired in G', causing 6p/?„ Aj).j to be pseudoknotted. We minimize the energy over all possible choices of / (note that / must be unpaired in G, since it will be paired in G', which is disjoint from G). By Lemma 1, once / is fixed, the inner base pair of the band whose outer base pair 68 Appendix A. Recurrences is bpii; ,{j).j is also determined. The P),+ BE term in case (1) of the recurrence accounts for the energy of the band, a WI term accounts for a weakly closed region in the band, .and the remaining energy is represented by the WMB' term. In case (2), j is not paired in G, and the recurrence is unwound by moving directly to a WMB term. 1 WMll, A . 11 , WMB] j is the minimum free energy of all valid structures R ..j over region where R,., is a proper prefix of a density-2 pseudoloop. Otherwise, WMB is +oo. The base case is as follows: WMB'. = +oo , if i > j L ivJ Otherwise, WMB';^ = min '{1)2P,,+ min i<l<,n,.n(;,.b ) {iJ) isCoVKral{C,. l) h (BEibpciBuj^MciB^B'^Buj)) + W M 0 _ + VP,.,). (2) min. (WMBf +WI I < N L if 6p 0) =0 G ) iil ), if bp (j) < j [l+1)J G ( A J 5 ) c„vcr(l)=covur(;)) l (l.)=() Va (3) P + VP., (4) 2P,, + mm h (BE(i, b' , 6p (6'- ) , bp (i)) (i 0 G (l 0 G if bp (j) = 0 G t<l<hpa(i) _ + WI\ , i), ;-i) + VPi,j) h : 0+ and bp (i) > 0 ( c Complementing case (1) of the WMB recurrence, WMB' handles the case that the rightmost band is not in G, but is part of the MFE structure G' (with respect to G). In the recurrence for yVMB , case (1) is the complex case, accounting for the energy of the region spanned by the rightmost two bands using the 2P\,. VP, and BE terms, and recursively calling WMB . Case (2) is called when one iteration of WMBij case (2) or WMB . case (1) is done and we need to compute the energy of the remaining part of the structure inside an arc. Note that Wlij = +00 when cover (i) — cover (j) = (—1, —1) such that entering case (2) as the first iteration of WMB' is not possible. For example, the total energy for the structure presented in Fig. A.11(2) is calculated in the following way: the energy of the rightmost band covering base j is calculated using WMB case (1); the energy of the remaining structure from i to j accounts for the energy of the closed subregion of [i,j], [1 + 1, j], and the remaining prefix (WMB'-j), which is handled by case (2) of the WMB! recurrence. In cases (3) and (4), only one iteration of WMB yields the result. Case (3) handles the case that i pairs with j in G', and i.j is the left-most base pair in 1 1 / i i 1 Appendix A. Recurrences (1) 69 (2) Figure A.8: Illustration of Cases for WMB.,j. the structure. Since this is part of a pseudoknotted structure, it will be covered by VP. Case (4) handles the case that there are only two bands in Rij, and the right band of the pseudoknotted loop is not in-G. A.12 BEij. /.'. f. j) . BE(i, is the minimum free energy of the band [?', i'] U [j'j], if i < i' < j' < j, in which bp {i) = j and bp (i') - j'The base cases are as follows: BE(i,i',j',j) = +oo, if it is not the case that i<i' < j' <j BE(i,i,j,j) = 0, if i < j c G 70 Appendix A. Recurrences Otherwise, BE(i,i'j',j) = min + BE(i + l,i',j',:J-l) (2)e {i,l,bp (l),j) +BE(l.,i'f,bp (l)) intP if 6p (i + l) = (j - 1) if b (l) > 0, (empty(C7, [i +1,1-1}), and empty(C7, [bp (l) + l,j - 1})), c PG c G c (i<l< i') and 0" < bpc{l) < J) Ubp (l) > 0 G + BE(l,i',j'bpc(l)) + WJ{ _ a'+2b' lll>o{l)+iU:l l)+ (weakly closed(G, [i + 1,1 - 1]) and weakly closed(G, [bp (l) + l,j - 1])) and (i<l< i') and c W < bpa(l) < j) if bp {l) > 0. (weakly closed(G, [i + 1,1 - 1}) (A.16) G + BE(l. i/,f,bp (l)) +a' + 2b''+ c'(j - bp{G, I) + 1) and / c empty(G,[6p (0+ 1,7-1])) and {i<l<i>) , . and G (f<b (l)<j) if bp {l) > 0 (weakly closed(G, [bp (l) + l,j - 1]), and PG (b)a' + 2b' + c'{L-i + l) + BE(l,i',j',b (l)) PG + {b,,c(i)+i),{j-i) WJ, G G empty(G, [?:+ 1,/ - 1])) and [i < I, < i') and (f < b (l) < j) PG Case (1) handles the case that base pairs i.j and (i + l).(j — 1) of G form a stacked loop in the band. Case (2) handles the case that i.j and l.bp (l) are the base pairs of an internal loop of G. G Appendix A. Recurrences 71 Appendix A. Recurrences 72 Case (3) handles a similar situation as in case (2) except that there are other closed regions in both of the regions [i.l] and [6pc(0>ilCase (4) handles the case that the region [&7->G(0IJ]> is empty, and so we must pay the unpaired base penalty c'foreach unpaired base. In this case, the left side, [i.l] must not be empty. Case (5) is the same as case (4) except the left side is empty and the right is not empty.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- RNA secondary structure prediction using hierarchical...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
RNA secondary structure prediction using hierarchical folding Hosna, Hosna 2007
pdf
Page Metadata
Item Metadata
Title | RNA secondary structure prediction using hierarchical folding |
Creator |
Hosna, Hosna |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Algorithms for prediction of RNA secondary structure— the set of base pairs that form when an RNA molecule folds— are valuable to biologists who aim to understand RNA structure and function. Improving the accuracy and efficiency of prediction methods'is an ongoing challenge, particularly for pseudoknotted secondary structures, in which base pairs overlap. This challenge is biologically important, since pseudoknotted structures play essential roles in functions of many RNA molecules, such as splicing and ribosomal frameshifting. State-of-the- art methods, which are based on free energy minimization, have high runtime complexity (typically θ(n⁵) or worse), and can handle (minimize over) only limited types of pseudoknotted structures. We analyze a new approach for prediction of pseudoknotted structures, motivated by the hypothesis that RNA structures fold hierarchically, with pseudoknot free (non-overlapping) base pairs forming first, and pseudoknots forming later so as to minimize energy relative to the folded pseudoknot free structure. Our HFold algorithm, based on work of S. Zhao, uses two-phase energy minimization to predict hierarchically-formed secondary structures in 0(n³) time, matching the complexity of the best algorithms for pseudoknot free secondary structure prediction via energy minimization. Our algorithm can handle a wide range of biological structures, including kissing hairpins and nested kissing hairpins, which have previously required θ(n⁶) time. We also report on the experimental evaluations of HFold and present thorough analyses of the results. We show that if the input structure to the algorithm is correct, running the algorithm results in 16% accuracy improvement on average over the accuracy of the true pseudoknot free structures. However if the input structure is not correct, the accuracy improvement is not significant. If the first 10 suboptimal foldings are given as input to our algorithm instead of just the minimum free energy structure (MFE), the prediction accuracy improves significantly over the accuracy of the MFE structures. This improvement is even more when the number of suboptimal foldings as input to our algorithm increases. The comparison of the energy of the structures predicted by HFold on the true pseudoknot free structures with the energy of the true structures calculated using a different method with the same energy model shows that the energy model may be the cause for the cases for which HFold predicts structures far from the true structures. Our experimental result provides some ways in which the hierarchical folding hypothesis might need to be refined. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-03-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0052065 |
URI | http://hdl.handle.net/2429/32001 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2007-0443.pdf [ 2.9MB ]
- Metadata
- JSON: 831-1.0052065.json
- JSON-LD: 831-1.0052065-ld.json
- RDF/XML (Pretty): 831-1.0052065-rdf.xml
- RDF/JSON: 831-1.0052065-rdf.json
- Turtle: 831-1.0052065-turtle.txt
- N-Triples: 831-1.0052065-rdf-ntriples.txt
- Original Record: 831-1.0052065-source.json
- Full Text
- 831-1.0052065-fulltext.txt
- Citation
- 831-1.0052065.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0052065/manifest