R N A Secondary Structure Prediction Using Hierarchical Folding . b y Hosna Jabbari B.Sc, The University of Victoria, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia August, 2007 © Hosna Jabbari 2007 A b s t r a c t Algorithms for prediction of RNA secondary structure— the set of base pairs that form when an RNA molecule folds— are valuable to biologists who aim to understand RNA structure and function. Improving the accuracy and efficiency of prediction methods'is an ongoing challenge, particularly for pseudoknotted secondary structures, in which base pairs overlap. This challenge is biologically important, since pseudoknotted structures play essential roles in functions of many RNA molecules, such as splicing and ribosomal frameshifting. State-of-the-art methods, which are based on free energy minimization, have high run-time complexity (typically 0(n5) or worse), and can handle (minimize over) only limited types of pseudoknotted structures. We analyze a new approach for prediction of pseudoknotted structures, mo-tivated by the hypothesis that RNA structures fold hierarchically, with pseudo-knot free (non-overlapping) base pairs forming first, and pseudoknots forming later so as to minimize energy relative to the folded pseudoknot free structure. Our HFold algorithm, based on work of S. Zhao, uses two-phase energy min-imization to predict hierarchically-formed secondary structures in 0(n*) time, matching the complexity of the best algorithms for pseudoknot free secondary structure prediction via energy minimization. Our algorithm can handle a wide range of biological structures, including kissing hairpins and nested kissing hair-pins, which have previously required 0(vi6) time. We also report on the experimental evaluations of HFold and present thor-ough analyses of the results. We show that if the input structure to the algo-rithm is correct, running the algorithm results in 16% accuracy improvement on average over the accuracy of the true pseudoknot free structures. However if the input structure is not correct, the accuracy improvement is not.significant. Tf the first 10 suboptimal foldings are given as input to our algorithm instead of just the minimum free energy structure (MFE), the prediction accuracy im-proves significantly over the accuracy of the MFE structures. This improvement is even more when the number of suboptimal foldings as input to our algorithm increases. The comparison of the energy of the structures predicted by HFold on the true pseudoknot free structures with the energy of the true structures calculated using a different method with the same energy model shows that the energy model may be the cause for the cases for which HFold predicts struc-tures far from the true structures. Our experimental result provides some ways in which the hierarchical folding hypothesis might need to be refined. iii C o n t e n t s A b s t r a c t ii Con t en t s iii L i s t o f Tables v L i s t o f F i g u r e s vi A c k n o w l e d g e m e n t s x 1 I n t r o d u c t i o n 1 1.1 R N A Secondary Structure 1 2 B a c k g r o u n d on R N A Seconda ry S t r u c t u r e 5 2.1 Notation ' 5 2.2 Region and Loop Classification and Related Definitions 6 2.3 Bi-secondary and Density-2 Structures 9 2.4 Energy Model '. 11 2.4.1 Pseudoknot free energy model 11 2.4.2 Pseudoknotted energy model 12 3 A l g o r i t h m 14 3.1 A Useful Property of Density-2 Structures . . . '. 14 3.2 High Level Description of HFold 18 4 R e s u l t s 23 4.1 Accuracy ' 24 4.2 Dataset 24 4.3 Accuracy when input is the true pseudoknot free structure . . . . 24 4.4 Accuracy when input is the M F E pseudoknot free structure . . . 29 4.5 Best accuracy, taken over suboptimal folds 33 ' 4.6 Accuracy of the lowest energy structure, taken over suboptimal folds 36 4.7 Peeling Techniques 37 4.7.1 Accuracy when input is the M F E pseudoknot free struc-ture with 1-3 base pairs removed from inside or outside the stems that close hairpin loops 37 Contents ' iv 4.7.2 Accuracy when input is the M F E pseudoknot free struc-ture with 1-3 base pairs removed from both inside and outside each stem that closes a hairpin loop 39 4.7.3 Accuracy when input is the M F E pseudoknot free struc-ture with stems of size 1-3 removed 44 4.7.4 Accuracy when input is the M F E pseudoknot free struc-ture with loosened loops 44 4.8 Base Pair Confidence . . . . : . .' 45 4.9 Energy Comparison 45 5 C o n c l u s i o n a n d F u t u r e W o r k 52 B i b l i o g r a p h y 54 A R e c u r r e n c e s 57 A . l W,,.; 57 A.2 Wlij ' . ' 58 A.3 Wfrl , '• • • 60 A.4 VP-j 61 A.5 VPi.i".j 64 A.6 V,/ 65 A.7 VBkj 65 A.8 VM,/ . . . : : 66 A.9 WMij 66 A. 10 WMBi,j 67 A . 11 WMB'-'.l 68 A.12 BE[iJ,j',j) ' 69 List o f Tables V 2.1 Energy parameters 13 4.1 Pseudoknot free sequences used for secondary structure predic-tion. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence: and (3) the reference • from which the sequence was obtained 25 4.2 Sequences with H-type pseudoknots used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis: (2) type of sequence: (3) structure of sequence (H-type pseudoknot = H, H-type pseudoknots with nested structures = H*): and (4) the reference from which the sequence was obtained 50 4.3 Sequences with kissing interactions and the density-3 sequence used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) Structure of the sequence (kissing interactions = K, density-3 structure = D3); and (4) the reference from which the sequence was obtained. 51 vi List of Figures 1.1 A pseudoknot free structure (top), an H-type pseudoknotted struc-ture (center) and a kissing hairpin (bottom). Figures were generated by PseudoViewer [10] 4 2.1 An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats 6 2.2 Pseudoknot 7 2.3 Density-2 structures 10 2.4 Bi-secondary structure that is not density-2 10 2.5 The figure shows a density-2 secondary structure for a sequence of length n. /?.;.,• is the structure restricted to the region (for i < j < n) and dj is that part of /?,,;, above the horizontal line. The circle dots show positions of j where R;J is a prefix of a density-2 pseudoloop with respect to G, and the crosses show positions of j where Rij is not a prefix of a density-2 pseudoloop with respect to G 11 3.1 Illustration of band borders in Lemma 3 and Lemma 4 17 3.2 Illustration of how the WMB recurrence unwinds, to calculate WMBij. Arcs above the horizontal line from i to j represent base pairs of C,;,, and arcs below the line represent base pairs of G'rr Case (a) of the WMB recurrence handles the overall structure whose energy is WMBij, with I = l\. with terms to account for energies of the right upper band (BE) and right lower closed subregion ( W](Ll + \).(in,c(i,'., )-i)) a s w e ' l a s the remaining prefix (WMB/ih). The term WMBliM is handled by case (a) of the WMB1 recurrence, with / = l-i and terms to account for the lower right substructure labeled VPi2iil, the upper left band (BE), and the remaining "prefix" of the overall structure (WMB'i^li_l-j). WMBl,l^_l-} is then handled by case (b) of the WMB1 recurrence, with I = l-i, and terms to account for WI^3+l)^2_i^ and WMBlihi. Finally, the WMB'lh term is handled by end case (c) of the WMB1 recurrence 20 3.3 Visual illustration of recurrences in HFold 22 List of Figures vii 4.1 Accuracy of HFold when the input is the true pseudoknot free secondary structure (G/,iS), versus the accuracy of GyUJ struc-tures. The horizontal axis represents the accuracy of structures predicted by HFold while the vertical line represents the accu-racy of the input structures. Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively 26 4.2 A pseudoknot free structure given as input to HFold (top) and the corresponding HFold pseudoknot free result (bottom). The overall structure remains pseudoknot free after HFold, but more base pairs are added to the loops. For example, the large internal loop starting at 1 and ending at 164 on the top figure is broken into two smaller loops in the bottom figure. Similarly the loops of the branch starting at 51 and ending at index 90 are broken into smaller loops 27 4.3 CH,, structure given as input to HFold (top), the correspond-ing HFold result (center) and the true pseudoknotted structure (bottom). HFold adds 3 pseudoknot free stacked base pairs at 2.40, 3.39, and 4.38, while the true pseudoknotted structure has pseudoknotted base pairs at 18.36, 19.35, and 20.34 28 4.4 Predicted pseudoknotted structure (top) and the true pseudo-knotted structure (bottom). HFold predicts interleaved bands, while the true pseudoknotted structure has 5 separate H-type pseudoknots 30 4.5 Accuracy of HFold when the input is the true pseudoknot free sec-ondary structure (G.,,,-,„,(j) (horizontal axis), versus the accuracy of Gsm„u structures (vertical axis). Pseudoknotted and pseudo-knot free structures are presented by 'x ' and 'O', respectively. . . . 31 4.6 Accuracy of HFold when the input is the MFE structure (hori-zontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are pre-sented by [ \ ' and ; 0 ' . respectively 32 4.7 MFE structure (top), true pseudoknotted structure (bottom). Tire true pseudoknotted structure has 4 stacking base pairs at 1.14, 2.13, 3.12, 4.11 and another stem of size 4 at 6.26, 7.25, 8.24, and 9.23, while in the MFE structure the first stem is a shift of the true structure (3.15, 4.14, 5.13, and 6.12). In this cases HFold does not add any base pairs to the input structure. 33 4.8 Accuracy of HFold on the best-of-25 (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', re-spectively 34 4.9 Accuracy of HFold on the best-of-25, versus the accuracy of the best-of-10. Pseudoknotted and pseudoknot free structures are presented by ;x : and ! 0 ' , respectively 35 List of Figures viii 4.10 Accuracy of HFold on the minimum energy structure over the first 25 suboptimals, versus the accuracy of HFold on the M F E structure. Pseudoknotted and pseudoknot free structures are pre-sented by ;x' and ' O ' , respectively 36 4.11 The pseudoknotted structure predicted by HFold (top), the cor-responding M F E structure (center) and the true structure (bot-tom). Although HFold correctly predicts parts of the pseudoknot in the structure on the top, the accuracy of the structure pre-dicted by HFold is lower than the accuracy of the corresponding M F E structure, because there are some extra pseudoknotted base pairs predicted by HFold. Note that the M F E structure has all the correct pseudoknot free sub-structures (except the stem clos-ing at 1.114). High pseudoknot initiation penalty, and low band penalty are the main causes for such predictions by HFold. . . . 38 4.12 Accuracy of the M F E structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), ver-sus the accuracy of the M F E structures (vertical axis). Pseudo-knotted and pseudoknot free structures are presented by 'x ! and l O \ respectively • 40 4.13 Accuracy of HFold when input is the M F E structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of HFold when input structure is M F E (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively 41 4.14 Accuracy of the M F E structures with 1 base pair removed from both ends of each stem (horizontal axis), versus the accuracy of the M F E structures (vertical axis). Pseudoknotted and pseudo-knot free structures are presented by 'x' and ' O ' , respectively. . . 42 4.15 Accuracy of HFold when input is the M F E structures with 1 base pair removed from both ends of each stem (horizontal axis), versus the. accuracy of HFold when input structure is M F E (ver-tical axis). Pseudoknotted and pseudoknot free structures are presented by 'x ! and ' O ' , respectively 43 4.16 Accuracy of HFold when the input is the pseudoknot free struc-tures with base pair confidence above 75% (horizontal axis), ver-sus the accuracy of the M F E structures (vertical axis). Pseudo-knotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively 46 4.17 Comparison of HFold energy and true structure energy. The hori-zontal axis represents the energy of structures predicted by HFold when the input is G/,;,,, while the vertical axis represents the en-ergy of the "true" structures. Pseudoknotted and pseudoknot free structures are presented by 'x ' and ' O ' , respectively 48 4.18 Comparison of energy of G/„:fl (horizontal axis) and the energy of the M F E structure (vertical axis). Pseudoknotted and pseudo-knot free structures are presented by 'x' and ! 0 ; , respectively. . . 49 List of Figures ix A.l Illustration of Cases for Wij. 58 A.2 Illustration of Cases for Wlij.- 59 A.3 Illustration of cases for Wt .j. In case (5), the plotted structure from i to j could contain more than 2 bands (not illustrated) . . 60 A.4 Illustration of Cases for VPitj -. . 63 A.5 Illustration of Cases for VP\j '65 A.6 Illustration of Cases for VMi:j 66 A.7 Illustration of Cases for WMBij 67 A.8 Illustration of Cases for WMB'.i.j 69 A.9 Illustration of Cases for BE-,.-,;',,<,,• 71 X Acknowledgements I would like to thank my supervisor, professor Anne Condon, for her endless sup-port and encouragements. She introduced me to a new and interesting branch of science that I did not know about before. I cannot thank her enough for all she has done for me. I would also like to thank Dr. Will Evans for being my second reader and for his helpful comments. Many thanks go to people of Beta lab for making Beta such a lovely place to work. Last, but not least, 1 would like to thank my parents, my sisters and my wonderful husband for believing in me.through all stages of my studies. Without their love and support 1 could not have done anything. Hosna Jabbari. To my best friend and wonderful husband, Majid. 1 Chapter 1 Introduction 1.1 R N A Secondary Structure R N A molecules aid in translation and replication of the genetic code, catalyze cellular processes, and regulate the expression level of genes [7]. Structure is key to the function of R N A molecules, and so methods for.predicting R N A structure from the base sequence are of great value. Currently, prediction methods focus on secondary structure - the set of base pairs that form when the R N A molecule folds. There has been significant success in prediction of pseudoknot, free sec-ondary structures, which have no crossing base pairs (see F ig . 1.1). State-of-the-art prediction algorithms, such as Mfold [18] or RNAfo ld [11] find the structure with minimum free energy ( M F E ) from the set of all possible pseudoknot free secondary structures. The energy of a structure is estimated as the sum of en-ergies of loops that form when the molecule folds, where the loop energy values are'provided by biologists. While many small R N A secondary structures are pseudoknot free, pseu-doknots do arise frequently in biologically-important R N A molecules, both in the cell [24, 27], and in viral R N A [6]. Examples include simple H-type pseu-doknots, with two interleaved stems, which are essential for certain catalytic functions and for ribosomal frameshifting [2], as well as kissing hairpins, which are essential for replication in the coxsackie B virus [19]. Unfortunately, M F E pseudoknotted secondary structure prediction is N P -hard [1, 13, 14], even for a simple energy model that depends on base pairs but not on unpaired bases. Polynomial-time MFE-based approaches to pseudo-knotted structure prediction have been proposed [1, 8, 21, 22, 26], with respect to various sum-of-loops energy models for pseudoknotted structures, which find the M F E structure for a given input sequence, from a. restricted class of struc-tures. A class of structures can be defined by specifying allowable patterns of interleaving among base pairs. For example, Mfold and RNAfo ld handle the class of pseudoknot free secondary structures; we provide more examples later. We say that a structure' R can be handled by a given algorithm if R is in the class of structures over which the algorithm optimizes. Algorithms for M F E pseudoknotted secondary structure prediction trade off run-time complexity and generality - the class of structures handled, that is, the class of structures over which the algorithms optimize. For example, kissing hairpins are not in the class of structures handled by the 0 ( n 5 ) algorithms of Akutsu [1] and Dirks and Pierce [8] but are in the class handled by the Q(n6) algorithm of Rivas and Eddy [22]. (We note that, even when the true Chapter 1. Introduction 2 structure R for a'sequence is handled by an algorithm, the algorithm still may not correctly predict R, because correctness depends not only on the generality of the algorithm but also on the energy model and energy parameters.) Our work is motivated by two limitations of MFE-based algorithms for pseu-doknotted secondary structure prediction: they have high time, complexity, and ignore the folding pathway from unfolded sequence to stable structure. Sev-eral experts have provided evidence for, and support, the hierarchical folding hypothesis [16, 25], which is succinctly stated by Tinoco and Bustamante as follows: "An RNA molecule [has] a. hierarchical structure in which the primary sequence determines the secondary structure which, in turn, determines its ter-tiary folding, whose formation alters only minimally the secondary structure" [25]. (These and other authors consider the initially-formed secondary structure to be pseudoknot free, and refer to base pairs that form pseudoknots as part of the tertiary structure. However, here we refer to all canonical base pairs, namely A-U, C-G, and G-U, as secondary structure.) We note that while the hierarchical folding hypothesis is a common assumption, some counter examples have been reported, notably formation of the structure of a subdomain of the Tetrahyinena thermophila group I intron ribozyme [29]. However, even in this case, 15 of the 19 base pairs in the initially-formed pseudoknot free secondary structure are retained upon formation of tertiary structure, and the 4 missing base pairs lie at the ends of stems. This work is a continuation of the M.Sc. work of Zhao [30], in which she presented a novel and efficient algorithm to predict RNA secondary structures, in a manner consistent with a natural formalization of the hierarchical folding hypothesis. She defined the problem of predicting the secondary structure as follows: given a sequence S and a pseudoknot free secondary structure G (a set of base pairings), find a pseudoknot free secondary structure G' (a set of base pairings disjoint from G) for S, such that the free energy of G U G' is less than or equal to the free energy of GUG" for all pseudoknot free structures G" / G'. As with algorithms for MFE pseudoknotted secondary structure prediction, algorithms,for Hierarchical-MFE secondary structure prediction may handle a restricted class of structures. That is, the type of structure formed by G U G" may have restricted patterns of interleaving among base pairs. Since both G and G" are pseudoknot free, the most general class of structures that could be handled by an algorithm for hierarchical-MFE secondary structure prediction would be the hi-secondary structures of Witwer et al. [28] - those structures which can be partitioned into two pseudoknot free secondary structures G and G'. There is no known way to solve the hierarchical-MFE prediction for the class of bi-secondary structures. Instead, Zhao suggested a solution with respect to a subclass of the bi-secondary structures, which she called density-2 structures, explained in Section 2. This is quite a genera) class, including H-type pseudoknots and kissing hair-pins, as well as structures containing nested instances of these structural motifs. The only known algorithm for predicting MFE nested kissing hairpins, that of Rivas and Eddy, requires Q(nb) time. Rastegari and Condon [20] showed that, out of a set of over 1,100 biological structures, all but nine are density-2 (when Chapter 1. Introduction 3 isolated base pairs are removed), and six of these nine are also not in the class handled by Rivas and Eddy's algorithm. The main contributions of this work are: • refining and fixing the original recurrences presented in the work of Zhao [30], • defining and proving four lemmas, that help establish corrections of Zhao's recurrences, • implementing the HFold algorithm, and • throughly analyzing its performance on a dataset of 70 strands. We published a preliminarily version of this work in [12]. In Chapter 2, we present some useful background information and notations pertaining to RNA structure prediction. In Chapter 3, we summarize HFold, a. dynamic programming algorithm that solves the Hierarchical-MFE secondary structure prediction problem for the class of density-2 secondary structures in 0(n3) time and 0(n2) space, and present our lemmas. In Chapter 4, we describe the results of our experimental analysis. Our ex-perimental evaluation of HFold shows that, when provided with the true pseu-doknot free substructure for the input sequence, HFold adds pseudoknots which, on average, improve the accuracy (measured as fraction of correctly predicted bases) by 16%. However, HFold does not significantly improve accuracy when given as input a computational prediction of the MFE pseudoknot free secondary structure G, since HFold cannot correct errors in G. Chapter 1. Introduction 4 50 40 U-C--A •U-G P-6' 20 A . J " - G \ 60 C A' ,65 A - U 10 A - G - G - U - G - G - C r f-.-A-U-U-G r ""A-G-G U .u-G-G-G;-6 -G (j—-45 C G C" C "C G' U "& I • C "A J G T, iv "A; f A' "C' A' U" V c C "A' C U "G A" *G~-A C *A' G' V G' "G"' G' t ' G'" T ,U' t' G" t" A 'A G *•: U V G' c A" Figure 1.1: A pseudoknot free structure (top), an H-type pseudoknotted struc-ture (center) and a kissing hairpin (bottom). Figures were generated by PseudoViewer [10]. " 5 Chapter 2 Background on R N A Secondary Structure A n R N A molecule is a sequence of nucleotides, or bases, of which there are four types: Adenine (A), Guanine (G), Cytosine (C), and Uraci l (U). The molecule has chemically distinct ends, called the 5' and 3' ends. We model an R N A molecule as a sequence over the alphabet {A,C, G, £ /} , with the left end of the sequence being the 5' end. Throughout, n denotes the length of an R N A sequence. We index the bases consecutively from the 5' end starting from 1, and refer to a base by its index. When an R N A molecule folds, bonds may form between certain pairs of bases, where each base may pair with at most one other base. A secondary structure R is a set of pairs < i < j < n, such that no index occurs in more than one pair. The pair i.j denotes that the base indexed i is paired with the base indexed j. The canonical base pairs, which form the secondary structure, are the Watson-Crick pairs A-U and C-G, as well as the wobble pair G-U (see Fig . 2.1). 2.1 Notation We use the following notations when describing our algorithms. Throughout, we use R to denote a secondary structure. • bpni'i')'- We let bpa(i) denote the index of the base that is paired with base i in R. if any; otherwise bpp.(i) = 0 . • paired(.R, i ) : true if and only if i is paired in the structure R. • cross: if i.j, i'.j', and i < i' < j < j', we say that pair i.j crosses pair ?/.;/' (and i'.j' crosses i.j). • p s e u d o k n o t t e d base pai r : We say that i.j is a pseudoknotted base pair if for some other base pair i!.j' in structure R, i.j crosses i ' . j ' . We also refer to i and j as pseudoknotted base indices. • p seudokno t free secondary s t ruc ture : If there are no pseudoknotted base pairs in the given structure, it is called pseudoknot free secondary structure. Chapter 2. Background on RNA Secondary Structure 6 NOT Weakly closed Figure 2.1: An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats. • cover: Let G be a pseudoknot free secondary structure. Base pair i.j covers base fc if i < k < j and there is no other base pair i'.j' £ G with i < i' < k < j' < j. In this case, we denote i.j by cover (G.k). Otherwise cnver{G,k) = (-1,-1). • isCovered(G, fe): true if and only if some base pair of G covers fe. • region Sequence of indices between i and j inclusive. • disjoint regions: Two regions and ['«',/] are disjoint if no index is in both regions, i.e. j < i! or j ' < i. 2.2 Region and Loop Classification and Related Definitions Following is the terminology used to refer to loops and other standard elements in a, secondary structure. These definitions are mostly taken from and illus-trated in the work of Rastega.i'i and Condon [20]. Throughout, definitions are with respect to a. fixed secondary structure R. Generally we use R to refer to a structure that may be pseudoknotted (that is, contains at least one pseudoknot-ted base pair), and use G to refer-to a structure that we know to be pseudoknot free. Chapter 2. Background on RNA Secondary Structure 7 1 3 9 II 15 18 23 27 34 39 Figure 2.2: Pseudoknot • empty(/?.. [i,j\) with respect to secondary structure R: true if region contains no base pair in R. Formally, V7c,i < k < j,paired(R,k). • w e a k l y c losed reg ion: A region is weakly closed if no base pair connects a base in the region to a base outside the region. Formally, is weakly closed if and only if for all fc 6 either bpn(k) £ [i,j] or bpn{k) = 0. Weakly closed(R, is true if and only if is a weakly closed region oi R. .• c losed region: A weakly closed region with at least two bases, is closed if it cannot be partitioned into two smaller weakly closed regions. Formally, is closed if and only if i < j, \i,j) is weakly closed, and for all I £ — 1], neither [i,L] nor [/ + is weakly closed. Note that if [i.j] is closed then both i and j must be paired (although not necessarily with each other) [20]. • p seudokno t free c losed r eg ion : A'closed region that does not contain any pseudoknotted base pairs. • p s e u d o k n o t t e d closed reg ion : A closed region of a structure R, such that i.bpn(i) and bpn(j).j are pseudoknotted base pairs. We refer to • indices i. and j as the left and right borders of the pseudoknotted region M -• d i r e c t l y banded in : For a pseudoknotted base pair i.j, we say i.j is directly banded in base pair i'.j' and write i.j ^ i'.j' if: (1) ,:' < i < j penalty for unpaired base in a 0 multiloop that spans a band 14 Chapter 3 Algori thm 3.1 A Useful Property of Density-2 Structures As will become clearer later, the reason that the HFold algorithm works for density-2 structures is because of the following lemmas, which are key for effi-cient decomposition of energies in the recurrences. Roughly, the lemmas show how to calculate the band borders for a given region. L e m m a 1 LetG andG' be disjoint, pseudoknot free, secondary structures, such that Gl)G' is a density-2 secondary structure, and let i, j be the start and end of a pseudoloop of G U G'. Let, I, £ [i + l,j] be such that J. I is paired in G' (but, not in G), 2. I is not in any closed proper subregion of [i,j] 'with respect to GU G', and 3. 3r : i < r < I. < bpc(r) < j. Let &(,: () = niin{fc|?' < k < I < bpc(k)}, and 6'(u) = im\x{k\i < k < I < bpG(k)}. Then, the structure G U C ' contains a band with outer base pair b(n).bpc(b(i.i)) and inner base pair b'^ - ^.bpc^b'^ (j) . Note that restriction (3) ensures that b^^ and b'^ ^ are well defined. The bot-tom left part of Fig. 2.1 illustrates Lemma 1, showing the borders of the band whose arcs cross the base pair involving base I = 14. If [i,j] is the region [2, 30], then b(2,i4) = 2 and &'(2.14) = 6. Proof Throughout the proof we assume that I < bpc(l), but the proof can easily be modified to handle the case when bpc(l) < I by substituting bpa>(l).l instead of l.bpc(l) everywhere in the proof. Since i is the start of a pseudoloop and j is the end of the pseudoloop, [i.j] must be a pseudoknotted closed region of G U G'. Restriction (1) implies hpG>(l) € since if it is not, then [i.j] is not a closed region of G U G'. We claim that for any r satisfying restriction (3), l.bpc(l) crosses r.bpc(r). Chapter 3. Algorithm 15 ULbpcH) does not cross r.bpc(r), then it must be that i < r < I < bpc{l) < bpc(r). Also, we must have one of the following cases: • l.-bpc(l) does not cross another base pair. Then region [l.bpc(l)[ is a closed subregion of [i.j] with respect to GUG', and thus, / is in a closed subregion of with respect to G U G', which is a contradiction. • l.bpc'(l) crosses another base pair, say m.bpc(m). Then I.bpc (I) is in a band associated with a pseudoloop. We claim that, this pseudoloop is a subregion of [r. bpc('»')]. Otherwise, it must also be that r.bpc(r) is in a band associated with the pseudoloop. Moreover, the bands containing r.bpc{r). m.bpc{m) and l.bpc(l) must all be distinct, based on the defini-tion of band and the fact: that; I.bpc (I.) is in [r,6pc? (»•)]> and m,.bpc(in) G G and thus, cannot cross r.bpc{r). A line drawn vertically at position m or at position / must cross all three bands. This is because m.bpc(m) crosses l.bpc(l) which is in [r, bpc(r)]. Therefore, G U G' must have density 3, contradiction. Therefore, our assumption is incorrect and I.bpc {I) crosses r.bpc (''')• Let b\.bpo{b\), and &2-^ Pc(&2) be the outer and the inner base pairs of the band containing r.bpc^r) that l.bpc{l) crosses. We have i < b\ < b-2 < I < bpG(b2) < bpc{b,) < j. Now we prove that b(u) = b\. Since i < b\ < I < bpc (I) it must be that 6(v,j) < bi, by the definition of If &(,;;) < b\, then we have &(.,;() < 61 < bpc{b-\) < bpc(b(i,t)), since C is pseudoknot free. By same argument as used above, we conclude that 6(io-&Pc(&(».!)) crosses I.bpc (I). Thus, &(u).&pc(fy»,n) is the outer base pair of the band, which is a contradiction. Thus our assumption of &(,-,/) / 61 does not hold, and b^i) = bi. Similarly we can show that b'^ ^ = I>2- • L e m m a 2 LetG a.ndG' be disjoint, pseudoknot free, secondary structures, such that, G U G' is a density-2 secondary structure. Let I be paired in G' (but not in G) and let [i,j] be a region such that I G [i.j] and bpc (I) £ [i,j]- Let mm{k,\i, < k (,:,() = 00 and 6|.; ^ = — 1. Let us assume i < r, Cover(G,/) = {r,bpc{r)), I = j and bpc{l) = i — 1. Then, r.bpa(r) must cross l.bpc>{l), since i < r and based on the definition of cover, we must have r < I < bpc(r). We claim that both b(,; () and 6|.; ^ have finite positive values. Moreover, b'^ ^ = r. Assume 6'^ '^ ^ r. Then, 6|.; ^.bpcib^ <)) must be in [r,bpc{r)} (based on the definition of b'^^). Therefore we have r < b'^ ^ < I < bpc(b'^ / < bpG{r), contradiction. Thus, 6{4 n = r. Now, clearly b(.,;,j) ^ 00. We claim that if Chapter 3. Algorithm 18 '-'(•;,() 6(,;()) then b'^ ^.bpcib'^ tj) is in [b^jybpcib^^)}. Otherwise, the bands containing &(•;./)-VG(&(U))> b'^ ^ .bpdb'^ Q) and l.bpc(l) must be distinct. A vertical line drawn vertically at position I crosses 3 bands, thus G U G" has density 3, contradiction. • Lemma 4 LetG and G' be disjoint, pseudoknot free, secondary structures, such that G U G" is a density-2 secondary structure. Let I be paired in G' (but not, in G) and, let [l,j] be a region such that I £ [l,j] and bpo>(l) Let £((.,•) = nvA\{bpc(k)\k < I < bpG{k) < j) U {-1}, and B{Lj) = mm{bpc(k)\k j, Wi.j = 0, since the subsequence is empty. Otherwise, it must either be that i.j is a base pair in the MFE structure for Chapter 3. Algorithm 19 Si.. .s.j. or that the MFE structure can be decomposed into two independent subparts. These two cases correspond to the first two rows of the recurrence for Wij below. Thus, Wij is given by the following recurrence: Wi, = min { ^ V ' ' ... TT. where Viti is the free energy of the MFE structure for s; . . . s.j that contains i.j. The recurrence for V; , can in turn be expressed in terms of the energies of the hairpin loop {en{i.j)).. an internal loop, or a multiloop closed by i.j. and multiloops. If * > j, V.L.j is set to be oo. Otherwise, i.j closes a hairpin, an internal loop, or a multiloop in the MFE structure for ... .s.j. Thus V^j can be expressed as the minimum of the free energies attainable in three cases: f eH(i,j), V(i,j) = min < min,.,./ e»„e(i,r,r /»j) + Vrr', { VM'itj where en{i,j) and e,: r l t(v'\ r', j) are as given in Table 2.1, and VMij is the energy of a MFE structure for ,s,;. . . s.j in which i.j closes a multiloop. We extend the definition of Witi for the hierarchical folding algorithm as follows. Let G be a given pseudoknot free structure for S. If some arc of G covers i or j, then Witj = oo. If i > j, then Wtj = 0. Otherwise we define Wij to be the energy of the MFE secondary structure G,;, U GC for the strand Si... s.j, taken over all choices of G'^ which is pseudoknot free, disjoint from G,; ( , and such that G,;.; U G'- is density-2. In this case, W^j satisfies the following recurrence: Wij = mm < mmi j, since if i > j the structure is empty, and thus cannot be pseudo-knotted. Otherwise, there are two cases, depending on whether j is paired in G or not. In case (a), j is paired in G. Then, in the MFE structure, some base / with bp(j) < I, < j must be paired in G', causing bp(j).j to be pseudoknotted. We minimize the energy over all possible choices of / (note that / must be un-paired in G, since it will be paired in G', which is disjoint from G). By Lemma 1, once / is fixed, the inner base pair of the band whose outer base pair is bp(j).j is also determined. The Pi, + BE term in case (a) of.the recurrence accounts for the energy of the band, a IWterm accounts for a weakly closed region that is in the band, and the remaining energy is represented by the WMB' term. In case (b), j is not paired in C, and, the recurrence is unwound by moving directly to a WMB1 term. Thus, we have: Chapter 3. Algorithm 21 I + ^ i l + H ' / ( 1 + 1 ) , ( „ W ! ( i i i | i ) ) . 1 ) ) , ,(6) WMB',.! if 0 < ftp(j) < j Complementing case (a) of the WMB recurrence, WMB' handles the case that the rightmost band is.not in G, but is part of the structure G'. In the recurrence for WMB'. case (a) is the complex case, accounting for the energy of the region spanned by the rightmost two bands using the 1P\,, VP, and BE terms, and recursively calling WMB1. The band borders in WMB' cases are determined using Lemmas 3 and 4. Case (b) is called when one iteration of WMBi.j or WMB'; • case (a) is done and the rightmost substructure of the over-all "prefix" up to position j is a weakly closed region. Note that Wlij — -f oo when cover (i) ^ cover (j), which avoids entering case (b) as the first iteration of WMB'. Cases (c) and (d) are end cases, where only one or two bands need to be accounted for, respectively and so no recursive call to WMB1 is made. Thus we have: WMB',., = min ((a)2P„ + jnm^ (BEHu)M{i ;»Cu.i.:rv:,I(Ci;J-,l) + WMB'L{l_l)+ VPLj), i{bpc(j) = 0 if bpG(j) < j (b) min {WMB'iii+ WIu+1)tj), i-<- r- 'nu' ' v >' u' G G" TC C. t' . ..5 IJ C "G •no 'C u " A " ' u. A ' A U " U 6-*' " • A u. s-c C-G... V, •' 130 •'A.u 214 C-G Figure 4.4: Predicted pseudoknotted structure (top) and the true pseudoknotted structure (bottom). HFold predicts interleaved bands, while the true pseudo-knotted structure has 5 separate H-type pseudoknots. Chapter 4. Results 31 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracy of HFold on G Figure 4.5: Accuracy of HFold when the input is the true pseudoknot free secondary structure (Gml,,.u) (horizontal axis), versus the accuracy of Gsm,M structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 32 Accuracy after HFold Figure 4.6: Accuracy of HFold when the input is the MFE structure (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 33 20 10 .A-C-A*. 'c 'C-G-C-C C G C '.6 c: 26 ~"26 Figure 4.7: MFE structure (top), true pseudoknotted structure (bottom). Tire true pseudoknotted structure has 4 stacking base pairs at 1.14, 2.13, 3.12, 4.11 and another stem of size 4 at 6.26, 7.25, 8.24, and 9.23, while in the MFE structure the first stem is a shift of the true structure (3.15, 4.14, 5.13, and. 6.12). In this cases HFold does not add any base pairs to the input structure. 4.7 shows an example of one of those cases. There are"only two data points presented in Fig. 4.6 that show significant change of accuracy before and after running HFold. In both cases, HFold pre-dicts pseudoknots. In one case the predicted pseudoknotted structure is very similar to the true structure since the MFE structure for the corresponding sequence is very similar to the G\,ig structure of the sequence. In the other case, the true pseudoknotted structure includes one H-type pseudoknot with only 3 pseudoknotted base pairs, but because of the high pseudoknot initiation penalty, HFold predicts one pseudoloop of 3 interleaved bands, thus lowering the accuracy. 4.5 Best accuracy, taken over suboptimal folds Since HFold does not improve accuracy, on average, when using the MFE pseu-doknot free predictions as the input, we also measured the best accuracy ob-tainable using HFold with suboptimal structures as input. For- each sequence in our dataset, we used SimFold to calculate the 25 lowest-energy structures for that sequence. (We chose the 25 lowest-energy structures because, when we compared the 1000 lowest-energy structures from SimFold with the true pseudoknot free structures for each sequence in our dataset, in c Chapter 4. Results 34 Best accuracy of HFold over the first 25 suboptimals Figure 4.8: Accuracy of HFold on the best-of-25 (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and :0', respectively. only 39 cases was the true structure found among these 1000 structures and in 29 of the 39 cases the true structure was among the 25 lowest-energy structures.) We'then ran HFold on all 25 structures, and calculated the accuracy of the HFold output on each run. We did the same with the 10 lowest energy structures. Then we took the best accuracy among the 10 lowest-energy structures, and the best accuracy among the 25 lowest-energy structures. When averaged over all sequences, the best-of-10 accuracy is 70% with 4 cases with perfect accuracy and the best-of-25 is 74% with 8 cases with perfect accuracy. These accuracies are a significant improvement over the 59% average accuracy obtained using the MFE structures. Figure 4.8 shows the accuracy results of the best-of-25 versus the accuracy of the MFE structures. As shown in Fig. 4.8, 48 data points are in the lower triangle, showing improvement over the MFE structures. In many cases the improvement is simply due to the fact that the suboptimal structure is close to G (,;.,,. In 20 of the 48 cases, HFold adds nothing; in 15 out of 48 cases, it adds 1 or 2 pseudoknot Chapter 4. Results 35 Best accuracy of HFold over'the first 25 suboptimals Figure 4.9: Accuracy of HFold on the best-of-25, versus the accuracy of the best-of-10. Pseudoknotted and pseudoknot free structures are presented by 'x' and l O ; , respectively. free base pairs; but in the remaining 13 cases it perfectly identifies G„„lau, from which in 5 cases HFold adds extra base pairings, thus, not achieving 100% accuracy. Figure 4.9 shows the improvements achieved using best-of-25 versus best-of-10. As shown in Fig. 4.9, there are only 10 data points in the lower triangle that show significant improvement on best-of-25 versus best-of-10, and the rest of the cases are largely similar. Since the difference between the average best accuracies of best-of-25 and best-of-10 is only 4%, but the number of structures is considerably lower and the improvement over accuracy of the MFE structures is significant even for best-of-10, biologists may prefer to obtain 10 lowest energy structures instead of just the MFE structure to get better prediction. Chapter 4. R.esults 36 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 • Accuracy of HFold on min energy over the first 25 suboptimals Figure 4.10: Accuracy of HFold on the minimum energy structure over the first 25 suboptimals;. versus the accuracy of HFold on the MFE structure. Pseudo-knotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. 4.6 Accuracy of the lowest energy structure, taken over suboptimal folds In this section, similar to Section 4.5, for each sequence we get the first 25 suboptimal structures, and then we run HFold on those structures. However, in this case we choose the final structure with the lowest energy and compare its accuracy with the accuracy of HFold on the MFE structure for the same sequence. Figure 4.10 shows the results for this case. There are 9 data points in Fig. 4.10 not on the diagonal line. One of these data points is in the lower triangle, showing significant improvement over the accuracy of the MFE structure. In this case, the input structure is very similar to the corresponding Gnu structure but has one extra loop, thus HFold correctly identifies the corresponding Gs„,„u structure and improves the accuracy of the overall structure. For 5 of the 8 data points in the upper triangle, HFold correctly predicts the Chapter 4. Results 37 corresponding Gsinau structure but adds more pseudoknotted or pseudoknot free base pairs to lower the energy. An example of one of those cases is presented in Fig. 4.11. There are some stems in some of the input structures that are "shifts" of stems in the true structure; these shifts are also causes of low accuracies. In those cases HFold does not add any base pairs to the structure. There is not much difference in the "energy" values of the predicted struc-tures with lowest energy over the first 25 suboptimals versus the energy of the predicted structure when input structure is MFE, but some of the structures in the latter case are pseudoknotted, and thus have more similar structure to the true structure than MFE. 4.7 Peeling Techniques One of the main problems with the MFE structures is that they typically contain long stems and short loops, making it difficult for HFold to add enough pseu-doknotted base pairs into the small regions. One possible solution to obtain better input structures is to peel away the steins. This approach is also moti-vated by the process of hierarchical folding, where it has been shown that, upon formation of a pseudoknot, some base pairs at the ends of stems in the initially formed pseudoknot free structures may break [25]. For our testing purposes, we tried the following cases: 1. removing 1-3 base pairs from inside the stems that close hairpin loops (3 cases); 2. removing 1-3 base pairs from outside the stems that close hairpin loops (3 cases); 3. removing 1-3 base pairs from both inside and outside the stems that close hairpin loops (3 cases); 4. removing stems of size m when 1 < m < 3 (3 cases), and 5. peeling enough.base pairs from inside hairpin loops so that there are at least 6 unpaired bases inside each loop (1 case); 4.7.1 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from inside or outside the stems that close hairpin loops For our first peeling technique, we removed 1-3 base pairs from inside each stem that closes a hairpin loop of the MFE structures and ran HFold on the resulting structures. Figures 4.12 shows the accuracy of the MFE structures when 2 base pairs are removed from inside each stem that closes a hairpin loop versus the accuracy of the MFE structures themselves, while Fig. 4.13 shows the accuracy of the predicted structure when the input structure is peeled versus the accuracy Chapter 4. Results 38 y ft -„ ^ • ,.<- yK £, -V u - A ,C - c ^ . t 6 A A l j 'A, t i L A ..U ,.b A-110 b-c-c-e-c-A-" ' ' i j - U - UA G-A 'U.IJ. l 4- A ii • u • G U A A -1 C-C-C-y-l-Ht.y-G-c-u' fl.-G •i;-A-U~^--A-^-A-U-^-G.C-A-.. 'A *• H - H - M - U .C .0 i-A J , '-. no' •U-A-G-J.C-C'A-i-G-C-i-A • U - C - C - G - C - A U -t5 C G - G " ~ U -A-<3 -fi-C-d • ' U.A il.A Li . A Figure 4.11: The pseudoknotted structure predicted by HFold (top), the corre-sponding MFE structure (center) and the true structure (bottom). Although HFold correctly predicts parts of the pseudoknot in the structure on the top, the accuracy of the structure predicted by HFold is lower than the accuracy of the corresponding MFE structure, because there are some extra pseudoknotted base pairs predicted by HFold. Note that the MFE structure has all the correct pseudoknot free sub-structures (except the stem closing at 1.114). High pseu-doknot initiation penalty, and low band penalty are the main causes for such predictions by HFold. Chapter 4. Results 39 of the MFE structures. Since the rest of the figures are similar we only present a detailed discussion on this case. Generally speaking, as shown in Fig. 4.12 when the accuracy of the MFE structures is lower than 55%, peeling increases the accuracy of the structure. This is because the MFE structure in these cases has little resemblance to the true structure. However, as shown in Fig. 4.13, in all but one case, HFold adds the peeled base pairs back and creates the. original MFE structure. There is only one data, point on the lower triangle showing improvement over the MFE structure. In this case the peeled structure is similar to Gug, thus HFold iden-tifies the corresponding G„)mii structure correctly, but since it adds more base pairs to lower the energy of the whole structure the accuracy of the predicted structure is not perfect. There are some data points shown in Fig. 4.13 whose accuracies do not change after running HFold. This is because the stems in their corresponding MFE structure are shifts of the stems of their Gua structure, thus removing a base pair from inside each stem that closes a hairpin loop does not change the accuracy of the whole structure. By removing more base pairs from inside the structures, we obtain more cases where the accuracies before and after running HFold are different. For example, when we remove only 1 base pair from inside each stem that closes a hairpin loop we find 10 data points with the same accuracies before and after HFold, while this number decreases to 6 when we remove 2 base pairs from inside each stem that closes a loop (presented in Fig. 4.13). Similar results are achieved when 1, 2, or 3 base pairs are removed from outside each stem that closes a hairpin loop of the MFE structures and HFold is run'given these structures as input. . • 4.7.2 Accuracy when input is the M F E pseudoknot free structure with 1-3 base pairs removed from both inside and outside each stem that closes a hairpin loop We peeled away parts of stems from both ends of each hairpin loop. When we peeled only 1 base pair from each end of the stems, we ended up with 5 cases whose accuracies did not change before and after peeling (see Fig. 4.14). In those cases the MFE structure is very different from the true structure so that peeling one base from both ends of each loop does not change the accuracy. In all •those cases HFold adds the peeled base pairs back and thus, the final predicted structure is the MFE structure. As shown in Fig. 4.14, generally speaking when the accuracy of the MFE structures is lower than 55%, peeling increases the accuracy of the structure. This is because the MFE structure in these cases has little resemblance to the true structure. Because of the same reason, HFold adds the peeled base pairs back to the structure to lower the energy of the whole structure (see Fig. 4.15). There is only one case shown in Fig. 4.15 for which running HFold results Chapter 4. Results 40 Accuracy of the peeled structures Figure 4.12: Accuracy of the MFE structures with 2 base pairs removed from inside each stem that closes a.hairpin loop (horizontal axis), versus the accu-racy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 41 Accuracy of HFold on in 2 Figure 4.13: Accuracy of HFold when input is the MFE structures with 2 base pairs removed from inside each stem that closes a hairpin loop (horizontal axis), versus the accuracy of HFold when input structure is MFE (vertical axis). Pseu-doknotted and pseudoknot free structures are presented by 'x' and 'O', respec-tively. Chapter 4. Results 42 Accuracy of the peeled structures Figure 4.14: Accuracy of the MFE structures with f base pair removed from both ends of each stem (horizontal axis), versus the accuracy of the MFE struc-tures (vertical axis). Pseudoknotted and pseudoknot free structures are pre-sented by 'x' and '0', respectively. Chapter 4. Results 43 Accuracy of HFold on both 1 Figure 4.15: Accuracy of HFold when input is the MFE structures with 1 base paii' removed from both ends of each stem (horizontal axis), versus the accu-racy of HFold when input structure is MFE (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 4 4 in improvement in accuracy; in this case the peeled structure is similar to Gin,,, thus, HFold identifies the corresponding G.snu,ii structure correctly but adds more base pairs to lower the energy of the structure, resulting in a not perfect accuracy. , Removing more than 1 base pair from both ends of the stems results in fewer cases with the same accuracy before and after peeling; that is because the peeled structures have fewer base pairs. But since the MFE structures in general have little resemblance to the true structures, HFold adds the peeled base pairs back and thus, the average accuracy of the predicted structures is the same as the average accuracy of the MFE structures. 4.7.3 Accu racy when input is the M F E pseudoknot free structure w i t h stems of size 1-3 removed Since the MFE structures typically have long stems and small loops, we decided to remove the stems of size 1-3 and then give those structures as input to HFold. Since none of the MFE structures have any stems of size 1, peeling away stems of size 1 does not change any of the structures and thus, the final predicted structures are the MFE results. When removing stems of size 2 or 3, there is little change in the structure of some of our data points. However since HFold cannot fix the errors of the input structure, it does not find any structure with lower energy than the MFE structure and thus puts the removed base pairs back. In only 1 case, the input structure is similar to G/„9 and thus HFold finds a. pseudoknotted structure which is very similar to the true structure but with some extra base pairs. 4.7.4 Accu racy when input is the M F E pseudoknot free structure w i t h loosened loops Finally, we decided to loosen up the loops of the MFE structures so that HFold could possibly add extra base pairings to the structure and create pseudoknotted structures. Since we noticed from the HFold results in Section 4.4 that if the true pseudoknotted structure needs less than a stack of size 6 (that, is 6 consecutive base pairs) the structure typically will not form, due to the large pseudoknot initiation penalty, we decided to make room for at least 6 bases inside the loops. After removing enough base pairs from inside the loops of the MFE structures, we ran HFold on them and obtained the accuracy of the results. It is interesting to note that the average accuracy in both cases (before and after running HFold) is almost the same (59% and 58.5% respectively) and is also the same as the average accuracy of the MFE structures. In only 1 case HFold predicts the pseudoknot as it is in the true structure but it adds some extra base pairings. In this case the MFE structure is very similar to the corresponding G/„9 structure but has overcrowded loops; that is why loosening the loops helps HFold find the true pseudoknotted structure. In the rest of the cases, HFold adds the peeled base pairs back and results in the MFE structure. Chapter 4. Results 45 4.8 Base Pair Confidence Structure formation can be viewed as a probabilistic process, with the likelihood of a given secondary structure (for a fixed sequence) being a function of the energy of that structure. The lower the free energy, the more likely the structure: thus, the MFE structure is the most likely to form. Energy-based methods have also been developed to predict base pairing probabilities of pseudoknot free secondary structures at thermodynamic equi-librium. Mathews [15] found that, roughly speaking, the higher the probability that a base pair is predicted to be in the equilibrium structure, the more likely it is to occur in the true structure. One possible way to improve the accuracy may be to use base pair probability as a confidence measure - that is, base pairs are kept only if their probability is higher than a threshold value. Based on the results presented in Mathews [15], we chose different thresholds starting from base pair probabilities of 75%, and increased the threshold by steps of 5% to the maximum of 95% threshold. In each case, we chose only the base pairs whose probabilities are higher than the given threshold, and gave these structures as input to HFold. Figure 4.16 shows the accuracy comparison plot for predicted structures when base pair confidence is at least 75% and the MFE structures. As shown in Fig. 4.16 there are only two points that are not on the diagonal line. Both of these two cases show improvement over accuracy of the MFE structures. It is worth mentioning that in both of these cases the structure predicted by HFold when the input structure is base pairs with confidence above 75% achieves higher accuracy than the structure predicted by HFold when the input structure is the MFE structure. In one of these cases the improvement is 11% and in the other case it is 8%. The results are similar to Fig. 4.16 when the input structure has base pair confidence of at least 80% and 85%. When the confidence level reaches 90% all the data points except one are on the diagonal line. This one case is one of the two cases we discussed when the threshold was 75%. It is clear that the higher the threshold, the fewer base pairs are chosen for the input structure. For example, when the threshold is 90%, there are no base pairs chosen in 23 cases out of 70 cases, thus, the input structure to HFold is empty and HFold gives the MFE structure for those cases. For all the different threshold values the average accuracy is 59%, .which is exactly the same as the HFold accuracy when the input structures are the MFE structures. 4.9 Energy Comparison Using a separate method, with the same energy model, we calculated the en-ergy of the "true" pseudoknotted structures, and compared those values with the energies of the structures predicted by HFold when the input was Gug. Figure 4.17 shows the result of our comparison. All of the data points, except * one case, are either on the diagonal line, or in the upper triangle. As expected, Chapter 4. Results 46 Accuracy of HFold on base pairs with confidence >= 0.75 Figure 4.16: Accuracy of HFold when the input is the pseudoknot free structures with base pair confidence above 75% (horizontal axis), versus the accuracy of the MFE structures (vertical axis). Pseudoknotted and pseudoknot free structures are presented by ' x ' and :0', respectively. Chapter 4. Results 47 HFold's results have the same energy as the "true" structure or the "true" struc-tures have higher energy than the HFold's results using our energy model. The only case that HFold results in a structure with higher energy value than the true structure is when the true structure is a density-3 structure which cannot he handled by HFold. Figure 4.18 shows the comparison of the energy of G\,itj structure for a se-quence versus the energy of the MFE structure for the same sequence. In many cases, the energy differences are large, indicating the energy model is poor. Ideally, when the energy model is good, all the data points should be on the' diagonal line or very close to it, such that running HFold on the MFE prediction would result in the true structures. It is interesting to note that for two of the pseudoknot free structures (pre-sented by circles in the figure) the MFE energy value is much lower than the energy of the corresponding Guu structures; this can be an indication of the fact that even the pseudoknot free energy model is far from being correct. This is because large loops are broken into small loops to lower the energy in the MFE structures (see Fig. 4.2). This is the main reason for not getting the perfect accuracy for all cases when the true pseudoknot free structures are given to HFold as input structures. Chapter 4. Results 48 Figure 4.17: Comparison of HFold energy and true structure energy. The hor-izontal axis represents the energy of structures predicted by HFold when the input is Gin,) while the vertical axis represents the energy of the "true" struc-tures. Pseudoknotted and pseudoknot free structures are presented by 'x' and 'O', respectively. Chapter 4. Results 49 Figure 4.18: Comparison of energy of G\,\.a (horizontal axis) and the energy of the MFE structure (vertical axis). Pseudoknotted and pseudoknot free structures are presented by 'x' and ' O ' , respectively. Chapter 4. Results 50 Sequence IDs and lengths Sequence type Str. Ref. BLV(27), BYDV-NY-RPV(27), viral ribosomal H [27] CABYV(27), FIV(35), frame shifting H PLRV-S(26), PRRSV-LV(59), H SRVlgag/pro(37), H BChV(26), BWYV(26), EIAV(35), H PEMV(28), PLRV-W(26), H MMTVgag/pro(34), H PRRSV-16244B(58), H SARS-CoV(69) H* AKV-MuLV(50), GaLV(49), viral ribosomal H [27] Cas-Br-E-MuLv(50), SNV(50) readthrough H BaEV(50), Mo-MuLV(50), FeLV(50) H 1.1(37), 2.9(34), 2.7a(36), with high affinity H [5] 1.7(37), 2.1b(39), 2.10(37), to HIV-l-RT H •1.6(37), 2.12(37), 2.6b(42), H 1.3a(37), 2.4a(37), 2.11(37), H 1.17(37), 1.8(39), 1.9b(37), H 2.5a(41), 2.2b(42), 2.3a(45) H Ec-S15(67), Hs-PrP(45), mRNA H [27] Bt-PrP(45), T4-gene32(28) H Ec-PK1(30), Bp-PK2(80) tmRNA H [27] LP-PK1(30), Ec-PK4(52) H Tt-LSU-P3/P7(65), Ribozymes H* [27] HDV-It_ag(89) H* BVDVJRES(73), HCVJres(56) viral RNA H* [27] TMV-L(38), CSFV.IRES(76), H turnip yellow mosaic virus(86), H tobacco mosaic virus(214) H* Table 4.2: Sequences with H-type pseudoknots used for secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) structure of sequence (H-type pseudoknot = H, H-type pseudoknots with nested structures = H*); and (4) the reference from which the sequence was obtained. Chapter 4. Results 51 Sequence IDs and lengths Sequence type Str. Ref. I-ICV.229E(61) viral ribosomal K [27] CoxB3(68) Viral 3' UTR K [27] satRPV(72), NLVS(45), HDV-It_g(88) Ribozymes K D3 [27] Coxsackie(114) Viral and Phage RNA K [6] Hs-SRP-pkn(47) SRP RNA K [27] Table 4.3: Sequences with kissing interactions and the density-3 sequence used lor secondary structure prediction. In order, the columns provide (1) sequence ID, as found in the database or paper from which we obtained the sequence, and their length in parenthesis; (2) type of sequence; (3) Structure of the sequence (kissing interactions = K, density-3 structure = D3); and (4) the reference from which the sequence was obtained. 52 Chapter 5 Conclusion and Future Work In this work, we presented HFold, a fast, new dynamic programming algorithm that efficiently predicts RNA secondary structure including pseudoknots, based on the hierarchical folding hypothesis. HFold can predict kissing hairpins and pseudoloops with arbitrary number of bands. The algorithm had been proposed by Zhao [30]; this thesis focuses on an empirical analysis of its performance. Our analysis shows that, when presented with the true pseudoknot free struc-ture, Gin,) no significant improvement is obtained over the accuracy of the pseu-doknot free structure alone. However when HFold is given Gsm„u as input structure, 16% accuracy improvement is obtained on average over the accuracy of the pseudoknot free structure alone. This study also shows that using Gsln„u as the input to HFold results in 80% accuracy with 19 cases of perfect accu-racy. High pseudoknot initiation penalty and stem shifts are main causes of not achieving perfect accuracy for the rest of the cases. Based on the analyses presented in this work, using only the MFE structures as input to HFold does not result in high accuracies as the MFE structures are usually overcrowded with long stems and small loops and thus, running HFold does not result in the addition of any extra base pairings. We showed here that since the MFE structures usually bear little resemblance to the "true" structures, peeling the stems away from inside and/or outside the loops does not improve the average accuracy. We took the peeling one step further and showed that neither removing stems of size m when 1 < m < 3, nor loosening the loops to have at least 6 unpaired bases inside, changes the average accuracy of the results. Since the energy model using which the MFE structures are being formed does not consider pseudoknots, even using base pair probabilities is of no help in improving the accuracy of the results. We also showed that using the best of 10 suboptimal structures improves the accuracy of prediction by 11%. The accuracy increases by another 4% if the first 25 suboptimals are used. In many cases the accuracy of the MFE structure is greater than or equal to the accuracy of the .lowest energy structure predicted by HFold on the first 25 suboptimal structures. However, in numerous cases of the latter structures, HFold identifies the correct pseudoknotted base pairs. Note that adding more base, pairs to lower the energy causes the accuracy of HFold result to be lower than the corresponding MFE structure. This study showed that since pseudoknot free energy model predicts too Chapter 5. Conclusion and Future Work 53 many base pairs in the structure, peeling some base pairs away from the ends of stems that close hairpin loops in many cases improved the accuracy of the structure. However high pseudoknot initiation penalty in the pseudoknotted energy model made HFold avoid predicting pseudoknots by adding pseudoknot free base pairs when possible. We also found that because of the high pseudoknot initiation penalty and low band penalty, more bands are added to pseudoknotted structures to compensate for the energy. The results of this study provide more insight to the hierarchical folding hypothesis. Our results suggest that with respect to the current energy model minor rearrangements of the pseudoknot free MFE structures do not often lead to the true pseudoknotted structures. We also showed that in some cases the pseudoknot free MFE structure contains shifted stems of the true structure; therefore without shifting the stems it is not possible for the MFE structure to fold into the true pseudoknotted structure. Throughout this work, we used an index by index comparison of the base pairs to find the accuracy of prediction. However, in many examples the overall structure of the predicted structure was similar to the true structure, but more base pairs or shifts lowered the accuracy. We believe that using other notions of accuracy that consider structural similarities are more accurate that using an index by index comparison. Thorough analyses of the data based on different commonly used accuracy measures such as sensitivity, specificity, or positive predictive is underway. Our findings raise three main questions: 1. whether the structures identified by the experimental methods are reliable, 2. whether the current energy model is precise enough, and 3. whether the hierarchical folding hypothesis is precise enough. One of our future goals is to tune the parameters of the current energy model to improve the accuracy of the prediction for a wider sets of structures. One possible approach is using Andronescu's tuning method [4], Another future work is to use a better energy model for pseudoknotted structures, such as that of Cao and Chen [5], and obtain better energy parameters. This is also of great interest to us to evaluate the hierarchical folding hypothesis using computational methods in future. We are not yet able to do a sound comparison of the prediction accuracy of HFold with MFE-based methods, since it would be important to ensure that the same energy model is used by both methods. Therefore one of the main goals for our future work is to compare hierarchical and MFE algorithms implemented using the same energy model, at least for H-type pseudoknots. Finally, we plan to incorporate other techniques to produce better input structures to HFold, such as information obtained from chemical modification data [17]. 54 Bibliography [1] T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Disc. App. Math., 104(l-3):45-62, Aug 2000. [2] S. L. Alain, J. F. Atkins, and R. F. Gesteland. Programmed ribosomal frameshifting: Much ado about knotting! PNAS, 96(25):14177-14179, Dec 1999. [3] M. Andronescu. R,. Aguirre-Hernandez, A. Condon, and H. H. Hoos. RNA-soft: A suite of RNA secondary structure prediction and design software tools. Nucleic Acids Res, 31(13):3416-3422, Jul 2003. [4] M. Andronescu, A. Condon, H.H. Hoos, D.H. Mathews, and K.P. Murphy. Efficient parameter estimation for RNA secondary structure prediction. BMC Bioinformatics, page to appear, July 2007. [5] S. Cao and S. Chen. Predicting RNA pseudoknot folding thermodynamics. Nucleic Acids Res, 34(9):2634-2652, 2006. [6] B. A. L. M. Deiman and C. W. A. Pleij. Pseudoknots: A vital feature in viral RNA. Seminars in Virol, 8(3):166-175, 1997. [7] C. Dennis. The brave new world of RNA. Nature, 418(6894):122-124, Jul 2002. [8] R. M. Dirks and N. A. Pierce. A partition function algorithm for nu-cleic acid secondary structure including pseudoknots. J. Cornput. Chem., 24(13):1664-1677, Oct 2003. [9] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfain: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33(Database issue), Jan 2005. [10] K. Han, Y. Lee, and W. Kim. Pseudoviewer: automatic visualization of RNA pseudoknots. Bioinformatics, 18 Suppl 1, 2002. [11] I. L. Hofacker, W. Fontana, P. F. Stadler, L'. S. Bonhoeffer, M. Tacker, and P. Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte filr Chenue / Chemical Monthly; 125(2):167-188, Feb 1994. Bibliography 55 [12] H. Jabbari, A. Condon, A. Pop, C. Pop, and Y. Zhao. HFold: RNA pseudo-knotted secondary structure prediction using hierarchical folding. In Raf-faele Giancarlo and JSridhar Hannenhalli, editors, LNBI, Lecture Notes in Computer Science, page to appear. Springer, 2007. [13] R. B. Lyngs0. Complexity df pseudoknot prediction in simple models. In Josep Diaz, Juhani Karhumaki, Arto Lepisto, and Donald Sannella, editors, ICA LP, volume 3142 of Lecture Notes in Computer Science, pages 919-931. Springer, 2004. [14] R. B. Lyngso and C. N. Pedersen. RNA pseudoknot prediction in energy-based models. ./. Cornput. Biol, 7(3-4):409-427. 2000. [15] D. H. Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10(8):1178-1190, Aug 2004. [16] D. H. Mathews. Predicting RNA secondary structure by free energy mini-mization. Th.eor. Chem. Ace: Theory, Computation, and Modeling (The-oretica Clnmica Acta), pages 1-9, May 2006. [17] D. H. Mathews, M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker, and D. H. Turner. Incorporating chemical modification constraints into a dy-namic programming algorithm for prediction of RNA secondary structure. Proc Nail Acad Set U S A, 101(19):7287-7292, May 2004. [18] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded se-quence dependence of thermodynamic parameters improves prediction of RNA secondary structure,. .7. of Mol. Biol, 288(5):911-940, May 1999. [19] W.J. Melchers, J.G. Hoenderop, H.J. Bruins Slot, C.W. Pleij, E.V. Pilipenko, V.l. Agol, and J.M. Galama. Kissing of the two predominant hairpin loops in the coxsackie B virus 3' untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis. .7. Virol, 71(l):686-696, Jan 1997. [20] B. Rastegari and A. Condon. Parsing nucleic acid pseudoknotted secondary structure: algorithm and applications. J. Cornput. Biol, 14(l):16-32, 2007. [21] J. Reeder and R. Giegerich. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, Aug 2004. [22] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots, Jul 1998. [23] M. Sprinzl, C. Horn, M. Brown, A. Ioudovitch, and S. Steinberg. Compila-tion of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res, 26(1):148-153, Jan 1998. Bibliography 56 [24] D. W. Staple and S. E. Butcher. Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3(6), Jun 2005. [25] I. Tinoco and C. Bustamante. How RNA folds. J. Mol. Biol, 293(2):271-281, Oct 1999. ' [26] Y. Ueniura, A. Hasegawa, S. Kbbayashi, and T. Yokomori. Tree adjoining grammars for RNA structure prediction. Theor. Cornput. Sci., 210(2):277-303, Jan 1999. [27] F. H. van Batenburg, A. P. Gultyaev, and C. W. Pleij. Pseudoba.se: struc-tural information on RNA pseudoknots. Nucleic Acids Res, 29(1):194-195, Jan 2001. [28] C. W'itwer, Ivo L. Hofacker, and P.F. Stadler. Prediction of consensus RNA secondary structures including pseudoknots. IEEE/ACM Trans. Cornput. Biol. Bioin.forina.tics, l(2):66-77, Apr 2004. [29] M. Wu and I. Tinoco. RNA folding causes secondary structure rearrange-ment. Proc. Natl. Acad. Sci. USA, 95(20):11555-11560, Sep 1998. [30) Y. S. Zhao. Efficient algorithm for RNA pseudoknotted secondary structure prediction given a pseudoknot free structure. Master's thesis, University of British Columbia, Department of Computer Science, September 2005. [31] M. Zuker and P. Stiegler. Optimal computer folding of large RNA se-quences using thermodynamics and auxiliary information. Nucleic Acids Res, 9(1):133-148, Jan 1981. 57 Appendix A Recurrences In this chapter, we present the details of HFold recurrences. Throughout this work, we will use the following notation: • G: a pseudoknot free structure. • G': a pseudoknot free structure that we add to G. • R. the complete pseudoknotted structure: R=GL)G'. Rhj refers to the subset of R whose bases are in the region [i,j]. Let Gij = G l~l {i.,j} x i.e. the set of base pairs of G contained in region Let /?.,., be a minimum free energy (MFE) secondary structure for [i,j], given a pseudoknot free secondary structure G;,j. The energy value of each substructure type, for a given input sequence S = $i$o...s.n and the given pseudoknot free secondary structure G, is stored in an array. In the next subsections, we describe how each is calculated. We illustrate each case with a figure, where we use the following notations in our figures: • G is presented by the portion of the figure above the base line. • G' is presented by the portion of the figure under the base line. • The normal black lines can be any arcs in Rij. • The solid lines are for base pairs. • The dotted lines connect bases that don't have to be paired. • The clear shade within the arcs indicate that there are no additional base pairs within the arc. • The.gray shade within the arcs are unknown structures. A . l Wij W^.j is the MFE of all valid structures Rij over region if i and j are not covered in G, i.e. i.sCovered(G,i) &nd isCovered(G, j). Otherwise, W^-is+00. The base cases are as follows: W.,j = 0, if i > j, since then Rij is empty; and Wij = +00, if isCovered(G.i), or isCovered(G,j). Appendix A. Recurrences 5 8 (1) ( 2 ) r r + 1 (3) 7, I-\ Figure A. 1: Illustration of Cases for Wij. Otherwise, Wi,j is given by the following recurrence: WhJ = m i n ^ ^ ( ^ + W ^ (A.l) . (3) WMBi,j + Ps The base cases indicate that W is being used only when the structure is an exterior structure and that there is no penalty for having unpaired bases at either end of the structure. Case (1) handles the case that i pairs with j, i.e. bpRi%1 (i) = j. Case (2) handles the cases that 3r, * < r < j, 6p/?.;,,• (?-) < r (i.e. t is either unpaired or paired with another base inside region [i,r]), and bpp.. .(j) > r or bp a 0) = 0 (i.e. 7 is either unpaired or paired with a base inside region [r + h:i\). If R-i.j does not fall into case (1) or (2), it must be that is a pseudoknot-ted closed region. This is an exterior pseudoknot because of the premise that i and j are not covered in G. In this case, we add a Ps penalty for introducing an exterior pseudoknot. A.2 Wlj W]i:.j is the minimum free energy of all valid structures Rij, given that.[i, j] is weakly closed, and /?.,;.,- is inside a pseudoknot. Otherwise, Wlij is +00. Appendix A. Recurrences 59 The base cases are as follows: Wli.-j = +00, if cover(i) / cover(j), since is not weakly closed. Wh,j = PUp, if * = j and bpG(i) = 0, since [i,j] is an empty region, thus we give it the value for an unpaired base in a pseudoloop. 117,., = 0, if?; > j. Otherwise, WI^j is given by the following recurrence: II'/.., mm < (1) Vij+PP, (2) min ( WI, , i t OYbp(Rij,j) = 0. If R;.j does not fall into case (1) or (2), it must be that paired(Rij,i) and paired(Ri,j,j), and bpa,,.,•(*) > frpy?.; where [i, j] is a pseudoknotted closed region in Ri.j. In this case, R,.j will be covered by case (3). Since WI, j is the structure inside a pseudoknot, but not inside a band, we add a Pv,s penalty to case (1) and (3), and a Ps,n penalty to case (3) for introducing a new pseudoknot inside a pseudoloop. Appendix A. Recurrences 60 Figure A.3: Illustration of cases for Wtt ... In case (5), the plotted structure from i to j could contain more than 2 bands (not illustrated) A . 3 Wl':/ Wf- • is the minimum free energy of all valid nonempty structures RIJ, if [i-j\ is weakly closed, given that RLi is inside a band. Otherwise, Wftj is +oo. The base cases are as follows: Wl1- , = +co, if [i.j] is not weakly closed. Wf.L.- = +oo, if i > j, since empty(iJjj, Otherwise, Wfi • is given by the following recurrence: f ( l ) ^ - + & ' Wl , = min < (2) WL + c' if i.j € G, or (bpG(i) = 0 and bpG(j) = 0) if frpG(i) = 0 if frpc(j) = 0 (A.3) (3) +c' (4) mm.(IV/iit+ [(5)H/iV/B,:,,+P5m +6' Cases (2) and (3) handle free bases on each side of the sequence. The rest of Appendix A. Recurrences 61 the cases are similar to W/with the only difference being that here in cases (1) and (5) we use b' as the penalty for introducing a new base pair in the structure instead of PpH, since the base pair is not inside a pseudoloop, but rather in a band. A . 4 VP,,, VPj,j is the minimum free energy of all valid structures Rij, in which bpc(i) = bpa(j) = 0, bases i and j are paired in G', i.e. bpa>{i) — j, and i.j crosses a base pair of G. Here the energy of Rij is the energy of all loops within Rij that are not inside a. band. Otherwise, VPij is +oo. The base case is as follows: (i>j, VPij = +oo , if < i.j does not cross any base pair of G, (A.4) [bpG{i) > 0, or bpcij) > 0 Otherwise, VP,j is given by the following recurrences: Appendix A. Recurrences 62 17'.., = min ' ( i ) w / ( i + 1 ) i ( B ( i j ) _ 0 WI, (3) W^i+i)^^^.,-!) + (''<••,/)-') + ^ V ( V , ( ) + l ) , ( . 7 - l ) (4) e.stp(i,i 1 ,i) + KPji+D^-D (5) min (e,„,,p(i,r,r' ,j) •;<7-<-m•;•!,( 0(';..,•,,''(;..,•)) •»»^('''(.'.,-)^(.-..-;))<'''<.v + V P , , , , ) (6) mjn (7) min if isCovered(G. i), and isCovered(G, j) if isCovered(G,i), and isCovered(G,j) if isCovered(G, i), and i,sCovered(G,j) if (6pG(i + l) = 0,. and 6P G(j - 1) = 0) (A.5) cover(G.i) = cover{G ,r) and cover(G,j) = cover (G,r') and empty (G, [i + 1, r - 1]) and empty(G,[r' +1,7-1]) + ^, (,--,) +a' + 26') + H/ / ' ( . „ + 1 ) , { y _ 1 ) + a ' + 26') Cases (1), (2), and (3) handle the cases that there are no other base pairs in [i,j] that cross the same band(s) that i.j crosses. In these cases we compute the energy between band borders. Case (4) handles the case that base pairs i.j and (i+ l).(j — 1) form a stacked pair in R,,;. Case (5) handles the case that i.j and r.r' close an internal loop of P,;j. Cases (6) and (7) handle the similar condition to case (5) except that case (6) allows closed regions in the gap region [i,r — 1] and case (7) allows closed regions in the gap region [»••+ In those cases i.j does not close an internal loop; but rather close a multiloop that spans a band. Appendix A. Recurrences 63 Figure A.4: Illustration of Cases for VP^j Appendix A. Recurrences 64 Cases (6) and (7) can he combined into the following case: min ( , , ( , . _ , ) + VP.,,,, + W^ ( T . , + 1 ) i b-_ 1 ) + P„« + a' + 2b') i(;..,•)') <„ur.,:{h\. .n,B(i.i))(j) = r, and either ».r or r.j crosses a base pair of'G. Otherwise, VP; • is +oo. The base case is as follows: VPij = +oo, if / > j (A.7) Otherwise, VP1- • is given by the following recurrences: VP'ij = min f(l) min (VP.,,.+ JW ( r + 1 ) i i ) • (2) min, (!<(,._!,+ » < - ' - < " » " ( B ( j i , . ) : J ) ( A g . (3) min (VP^ + c'iJ-r)) if empty(G, [r + 1, j]) „»,.».•(»:,//(...))<,•<:/ (4) min (c'(r - i) + VPrj) if empty(G, [i, r - 1]) ' i 0 and bpri. ;{j) > 0 but bpR, .(i) ^ bpR: .^j) (A.9) Otherwise, V.Lj is given by the following recurrence: Vij = mm{eH(i,3),es(iJ) + V i ) , ( i-U> VBJi.^ vMij} (A.10) This recurrence is identical to that used in pseudoknot free algorithms, so we omit the proof here and for VBI in the next section. A . 7 VBI; j VBk,j is the minimum free energy of all valid structures Ri:j over region if [i,j] is weakly closed or empty, assuming i.j closes a bulge or internal loop of Rij. Otherwise, I />'/... is +oo. Appendix A. Recurrences 66 The base case is as follows: VBIij — +00, if j — i < 1 Otherwise, VBl,,= min (eillt(i,i'j',j) + W,-')- (A.11) i<»' <,)'<:; where all bases between [i,i.']-and [j'.j] are unpaired. ei.nt.{i; i'j'j) gives the free energy of an internal.loop or bulge with exterior pair i.j and interior pair i.'.j'. The complexity for computing VBlj can be done in 0(n3) using [31]. A . 8 VM,., VMij is the minimum free energy of all valid structures Rij over region [i.j], if [i.j] is weakly closed or empty and i.j closes a multiloop of Rij. Otherwise, VMij is +00. We can obtain a recurrence that calculates the loop cost as the sum of two subparts. VM.-,.-. = min< -+i j Otherwise, WMij is given by the following recurrences: • Appendix A. Recurrences 67 (2) Figure A.7: Illustration of Cases for WMB;.,. WMi.-j = m i n { {2)WMli+1)j+c • bpc(i) = 0 (3) W % . 1 ) + C bpG(j)=0 (A.13) (4) min {WMi.t + WM{t+l)/l) [ (5) IfiVm,:,+ Psm + b Cases (1) to (4) are the same as in a pseudoknot free structure, and case (5) handles the case of a pseudoknotted loop in the multiloop. A . 1 0 WMB, , WMB^.j is the minimum free energy of all valid structures Rij over region [i,j], where /?.,;., is a density-2 pseudoloop. Otherwise, WMB is +oo. The base case is as follows: II-'.',//?= +oo , if / :••./ Otherwise, WMBij = min (1) P,,+ min (BEibpcljlbpGiBlu^Bl.jyj) U bpcU) > 0 + WMB'hl+ WI{l+1)!(B[iirlj), (A.14) (2) WMB'it.j In case (1), j is paired in G. Then, in the MFE structure, some base I with bpc(j) < I < j must be paired in G', causing 6p/?„ Aj).j to be pseudoknotted. We minimize the energy over all possible choices of / (note that / must be unpaired in G, since it will be paired in G', which is disjoint from G). By Lemma 1, once / is fixed, the inner base pair of the band whose outer base pair Appendix A. Recurrences 68 is bpii; ,{j).j is also determined. The P),+ BE term in case (1) of the recurrence accounts for the energy of the band, a WI term accounts for a weakly closed region in the band, .and the remaining energy is represented by the WMB' term. In case (2), j is not paired in G, and the recurrence is unwound by moving directly to a WMB1 term. A . 11 WMll, , WMB] j is the minimum free energy of all valid structures RL..j over region where R,., is a proper prefix of a density-2 pseudoloop. Otherwise, WMB is +oo. The base case is as follows: WMB'.ivJ = +oo , if i > j Otherwise, WMB';^ = min '{1)2P,,+ min i 0 Complementing case (1) of the WMB recurrence, WMB' handles the case that the rightmost band is not in G, but is part of the MFE structure G' (with respect to G). In the recurrence for yVMB1, case (1) is the complex case, accounting for the energy of the region spanned by the rightmost two bands using the 2P\,. VP, and BE terms, and recursively calling WMB1. Case (2) is called when one iteration of WMBij case (2) or WMB/i.i case (1) is done and we need to compute the energy of the remaining part of the structure inside an arc. Note that Wlij = +00 when cover (i) — cover (j) = (—1, —1) such that entering case (2) as the first iteration of WMB' is not possible. For example, the total energy for the structure presented in Fig. A.11(2) is calculated in the following way: the energy of the rightmost band covering base j is calculated using WMB case (1); the energy of the remaining structure from i to j accounts for the energy of the closed subregion of [i,j], [1 + 1, j], and the remaining prefix (WMB'-j), which is handled by case (2) of the WMB! recurrence. In cases (3) and (4), only one iteration of WMB1 yields the result. Case (3) handles the case that i pairs with j in G', and i.j is the left-most base pair in Appendix A. Recurrences 69 (1) (2) Figure A.8: Illustration of Cases for WMB.,j. the structure. Since this is part of a pseudoknotted structure, it will be covered by VP. Case (4) handles the case that there are only two bands in Rij, and the right band of the pseudoknotted loop is not in-G. A . 1 2 BEij. /.'. f. j) . BE(i, is the minimum free energy of the band [?', i'] U [j'j], if i < i' < j' < j, in which bpc{i) = j and bpG(i') - j'-The base cases are as follows: BE(i,i',j',j) = +oo, if it is not the case that i*o{l)+iU:l_l)+a'+2b' + BE(l./i/,f,bpc(l)) if 6pc(i + l) = (j - 1) if bPG(l) > 0, (empty(C7, [i +1,1-1}), and empty(C7, [bpc(l) + l,j - 1})), (i 0 (weakly closed(G, [i + 1,1 - 1]) and weakly closed(G, [bpc(l) + l,j - 1])) and (i 0. (weakly closed(G, [i + 1,1 - 1}) +a' + 2b''+ c'(j - bp{G, I) + 1) and empty(G,[6pG(0+ 1,7-1])) and {i) , . and (f 0 (weakly closed(G, [bpG(l) + l,j - 1]), and empty(G, [?:+ 1,/ - 1])) and [i < I, < i') and (f < bPG(l) < j) Case (1) handles the case that base pairs i.j and (i + l).(j — 1) of G form a stacked loop in the band. Case (2) handles the case that i.j and l.bpG(l) are the base pairs of an internal loop of G. (b)a' + 2b' + c'{L-i + l) + BE(l,i',j',bPG(l)) + WJ,{b,,c(i)+i),{j-i) Appendix A. Recurrences 71 Appendix A. Recurrences 72 Case (3) handles a similar situation as in case (2) except that there are other closed regions in both of the regions [i.l] and [6pc(0>il-Case (4) handles the case that the region [&7->G(0IJ]> is empty, and so we must pay the unpaired base penalty c' for each unpaired base. In this case, the left side, [i.l] must not be empty. Case (5) is the same as case (4) except the left side is empty and the right is not empty. *