Algorithms for prediction of RNApseudoknotted secondary structuresbyHosna JabbariB.Sc., The University of Victoria, 2005M.Sc., The University of British Columbia, 2007A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2015© Hosna Jabbari 2015AbstractRNA molecules are crucial in different levels of cellular function, and theirfunctions largely depend on their structures. Since experimental methodsfor determining RNA structure are expensive, computational methods forprediction of RNA secondary structure (the set of base pairs) are valuable.One such approach is based on the Minimum Free Energy (MFE) foldinghypothesis, which states an RNA molecule folds into the structure with theminimum free energy. Another approach is based on the hierarchical foldinghypothesis, which posits that an RNA molecule first folds into a psuedoknot-free structure (i.e., no crossing base pairs), then additional base pairs areadded that may form pseudoknots (structures with crossing base pairs) tolower the structure’s free energy.The overarching goal of this thesis is to overcome some limitations of theprevious methods, namely (1) slow running time, (2) poor prediction accu-racy (i.e., low number of correctly predicted base pairs), (3) limited in typesof pseudoknots they can predict, and (4) limited opportunity to providestructural information that can guide prediction. We propose two algo-rithms and study different variations of each method. We propose a relaxedversion of the hierarchical folding hypothesis and present Iterative HFold,which is based on this hypothesis. Furthermore, we propose the CCJ al-gorithm, an MFE-based algorithm that significantly expands the structureshandled in O(n5) time. Employing a sparsification technique, we propose asparse version of CCJ that improves its space usage.While CCJ’s and Iterative HFold’s performance are not significantly dif-ferent on our large data set, Iterative HFold is considerably less expensive torun than CCJ. In addition, Iterative HFold provides the user with ability toincorporate structural information as input constraint. Both CCJ and Iter-ative HFold predict significantly more accurate structures on key data setsthan two of the best performing computational methods currently available(i.e., HotKnots V2.0 and IPknot). Of the folding hypotheses that we havestudied, it seems that the relaxed hierarchical folding hypothesis has betterpotential to predict RNA secondary structure, at least with respect to theenergy model used in this work.iiPrefaceThe candidate contributed to all major ideas and writing of the publishedand unpublished manuscripts that are the basis of this thesis. The candi-date was the lead author in all published and unpublished manuscripts andin no instance was a co-author a student. The candidate collaborated withthe candidate’s supervisor and mentors in all aspects of research includingdesign, development and evaluation of the algorithms. The introductorychapters, Chapters 1–3, were written by the candidate, but used selectedcontent from publications that she co-authored [22, 24, 52, 53]. Chapter 4 isa collaboration between the candidate and the candidate’s supervisor, Dr.Anne Condon. A version of Chapter 4 appeared in BMC Bioinformatics[53]. Chapter 5 is a collaboration between the candidate, the candidate’ssupervisor, Dr. Anne Condon, and Dr. Ho-Lin Chen. A version of Chapter5 appeared in the Journal of Computational Biology [22]. Most of the resultspresented in Chapter 6, stem from collaborative work between the candi-date and the candidate’s supervisor, which appeared in BMC Bioinformatics[53]. Chapter 7 is a collaboration between the candidate, the candidate’ssupervisor, Dr. Anne Condon, and Dr. Sebastian Will.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background on RNA secondary structure . . . . . . . . . . . 51.2 Accuracy measures . . . . . . . . . . . . . . . . . . . . . . . 61.3 RNA secondary structure prediction problem and thesis ob-jectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . 101.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Background and related work . . . . . . . . . . . . . . . . . . 152.1 Computational RNA secondary structure prediction includ-ing pseudoknots . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.1 Minimum-free-energy-based methods . . . . . . . . . 162.1.2 Heuristic methods . . . . . . . . . . . . . . . . . . . . 172.1.3 Methods based on maximum expected accuracy . . . 192.1.4 Methods that incorporate input constraints . . . . . . 202.1.5 Methods based on the hierarchical folding hypothesis 212.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Bootstrap percentile confidence intervals . . . . . . . . . . . 242.4 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . 25ivTable of Contents2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Energy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 RNA secondary structure . . . . . . . . . . . . . . . . . . . . 273.2 Energy model . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Hierarchical folding-based pseudoknotted secondary struc-ture prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1 SimFold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 HFold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 HFold-PKonly . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Iterative HFold . . . . . . . . . . . . . . . . . . . . . . . . . . 394.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 MFE-based RNA pseudoknotted secondary structure pre-diction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1 CCJ energy model . . . . . . . . . . . . . . . . . . . . . . . . 495.2 The CCJ class of structures . . . . . . . . . . . . . . . . . . . 515.3 Recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.1 Pseudoknots . . . . . . . . . . . . . . . . . . . . . . . 555.3.2 Transitioning between band groups in pseudoknots . 565.3.3 Nested substructures . . . . . . . . . . . . . . . . . . 605.3.4 Internal loops and multiloops . . . . . . . . . . . . . . 615.4 Time and space complexity . . . . . . . . . . . . . . . . . . . 655.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Comparison of Iterative HFold and CCJ . . . . . . . . . . . 666.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . 676.1.1 Definition of Gbig and Gsmall . . . . . . . . . . . . . . 676.1.2 Robustness test . . . . . . . . . . . . . . . . . . . . . 676.1.3 Accuracy comparison tests . . . . . . . . . . . . . . . 676.1.4 Running time . . . . . . . . . . . . . . . . . . . . . . 706.1.5 Memory usage . . . . . . . . . . . . . . . . . . . . . . 706.2 Robustness comparison . . . . . . . . . . . . . . . . . . . . . 706.3 Accuracy comparison of different versions of HFold . . . . . 736.4 Iterative HFold with SimFold’s suboptimal structures . . . . 766.5 Accuracy comparison with existing methods . . . . . . . . . 766.6 Running time comparison . . . . . . . . . . . . . . . . . . . . 786.7 Memory usage comparison . . . . . . . . . . . . . . . . . . . 796.8 Comparison with ShapeKnots . . . . . . . . . . . . . . . . . 80vTable of Contents6.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.9.1 HotKnots . . . . . . . . . . . . . . . . . . . . . . . . . 836.9.2 IPknot . . . . . . . . . . . . . . . . . . . . . . . . . . 836.9.3 Comparison of CCJ, HFold and Iterative HFold . . . 846.9.4 Energy model . . . . . . . . . . . . . . . . . . . . . . 906.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927 Sparsification of the CCJ algorithm for space reduction . 947.1 Background on sparsification . . . . . . . . . . . . . . . . . . 957.1.1 Sparsification of the MFE pseudoknot-free algorithm 967.1.2 Sparsification of pseudoknotted algorithms . . . . . . 987.2 Changes to the CCJ recurrences . . . . . . . . . . . . . . . . 1017.3 Sparsification of the CCJ algorithm . . . . . . . . . . . . . . 1107.4 Space complexity analysis . . . . . . . . . . . . . . . . . . . . 1157.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158 Conclusions and future work . . . . . . . . . . . . . . . . . . . 1168.1 Accuracy measures . . . . . . . . . . . . . . . . . . . . . . . 1178.2 Robustness comparison of HFold and Iterative HFold . . . . 1178.3 Choosing best input structure . . . . . . . . . . . . . . . . . 1188.4 Accuracy comparison . . . . . . . . . . . . . . . . . . . . . . 1188.5 Time and memory usage comparison . . . . . . . . . . . . . 1188.6 Merits and shortcomings of each method . . . . . . . . . . . 1198.6.1 HotKnots . . . . . . . . . . . . . . . . . . . . . . . . . 1198.6.2 IPknot . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.6.3 HFold . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.6.4 Iterative HFold . . . . . . . . . . . . . . . . . . . . . 1208.6.5 CCJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.7 Comparing the folding hypotheses . . . . . . . . . . . . . . . 1218.8 Energy model limitation . . . . . . . . . . . . . . . . . . . . 1218.9 Data and accuracy measure limitation . . . . . . . . . . . . . 1228.10 Possible future directions . . . . . . . . . . . . . . . . . . . . 122Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124AppendicesA IPknot detailed performance . . . . . . . . . . . . . . . . . . 136viTable of ContentsB Iterative HFold’s detailed performance comparison . . . . 137B.1 Robustness comparison . . . . . . . . . . . . . . . . . . . . . 137B.2 Comparison of correlation between false positive rates . . . . 137C Time and memory comparison . . . . . . . . . . . . . . . . . 141C.1 Time comparison . . . . . . . . . . . . . . . . . . . . . . . . 141C.2 Memory comparison . . . . . . . . . . . . . . . . . . . . . . . 141D Pseudocode for the sparse CCJ algorithm . . . . . . . . . . 148viiList of Tables2.1 Summary of the methods with which we will later comparethe performance our algorithms. . . . . . . . . . . . . . . . . 162.2 Summary of data sets used in this thesis. . . . . . . . . . . . 243.1 Energy parameters. This table provides the names, descrip-tion and values of the energy parameters and functions thatwe used in our methods. The names and definitions are thesame as in our original HFold [52], and the values were up-dated based on the work of Andronescu et al. [11]. Theseparameters were derived for a temperature of 37 ◦C and 1 Msalt (NaCl) concentration. . . . . . . . . . . . . . . . . . . . . 345.1 CCJ energy functions that are not explicitly in the DP09model. The values of such functions typically depend on thebases in positions i, l, i+ 1 and l− 1 of S, and whether i < l.We have set values of these functions to 0, in order to makethe models as similar as possible. . . . . . . . . . . . . . . . . 506.1 Comparison of bootstrap 95% percentile confidence intervalsof average F-measure of different versions of HFold when givenSimFold structure as input vs. when given HotKnots hotspotsstructures as input. . . . . . . . . . . . . . . . . . . . . . . . . 746.2 Comparison of bootstrap 95% percentile confidence intervalsof average F-measure with existing methods. . . . . . . . . . 746.3 Comparison of bootstrap 95% percentile confidence intervalsof average F-measure with existing methods on DK-pk16,DK-pk16* and IP-pk168 data sets. . . . . . . . . . . . . . . . 756.4 Comparison of average F-measure, sensitivity and PPV withexisting methods on HK-PK, HK-PKfree, DK-pk16, DK-pk16*and IP-pk168 data sets. The values outside parentheses showthe average F-measure while the values inside parenthesesshow average sensitivity and PPV respectively. . . . . . . . . 75viiiList of Tables6.5 Comparison of Iterative HFold F-measure with ShapeKnotson SHAPE data . . . . . . . . . . . . . . . . . . . . . . . . . . 826.6 Comparison of the number of times each of the proposed al-gorithms (i.e. CCJ, Iterative HFold and HFold) predicted thestructure with the minimum free energy, or the structure withhighest F -measure in each of the four data sets used in thisthesis. The table also provides the number of times IterativeHFold and HFold predicted structures with higher accuracythan the structure predicted by the CCJ algorithm. . . . . . 846.7 Comparison of bootstrap 95% percentile confidence interval ofaverage F-measure between the minimum energy structuresand the maximum accuracy structures of HK-PK and HK-PK-free data sets. . . . . . . . . . . . . . . . . . . . . . . . . 91A.1 Comparison of bootstrap 95% percentile confidence intervalof average F-measure of IPknot with different default settingsand no refinement. . . . . . . . . . . . . . . . . . . . . . . . . 136A.2 Comparison of bootstrap 95% percentile confidence intervalof average F-measure of IPknot with different default settingsand 1 iteration refinement. . . . . . . . . . . . . . . . . . . . . 136B.1 Comparison of robustness of different versions of HFold onthe HK-PK data set. . . . . . . . . . . . . . . . . . . . . . . . 138B.2 Comparison of robustness of different versions of HFold onthe HK-PK-free data set. . . . . . . . . . . . . . . . . . . . . 139B.3 Comparison of robustness of different versions of HFold overall structures of the HK-PK and HK-PK-free data sets. . . . 140B.4 Comparison of correlation between false positive rates in in-put structure and output structures. . . . . . . . . . . . . . . 140C.1 Running time comparison . . . . . . . . . . . . . . . . . . . . 142C.2 Running time comparison - Continued . . . . . . . . . . . . . 143C.3 Running time comparison - Continued . . . . . . . . . . . . . 144C.4 Memory usage comparison . . . . . . . . . . . . . . . . . . . . 145C.5 Memory usage comparison - Continued . . . . . . . . . . . . . 146C.6 Memory usage comparison - Continued . . . . . . . . . . . . . 147ixList of Figures1.1 A pseudoknot-free structure in both graphical (a) and arcdiagram (b) formats. In (a) some of the loops are annotatedby their type: loops with one emanating branch are hairpinloops, with two emanating branches are either internal orbulge loops; and with three or more emanating branches aremulti-loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Arc diagram representation of H-type pseudoknot (a), kissinghairpin (b) and a chain of four interleaved stems (c). Dotsalong the horizontal line represent bases from the 5’ (left) tothe 3’ (right) ends. Arcs represent base pairs. Some arcscross, thereby introducing pseudoknots. . . . . . . . . . . . . 82.1 Arc diagram representation of a simplified version of the struc-ture of the aptamer core of a SAM-IV riboswitch [98], butpreserving patterns of crossing base pairs. . . . . . . . . . . 172.2 Two examples of density-2 structures. The top figure shows apseudoknotted structure with a chain of arbitrary number ofbands, and the bottom figure shows an H-type pseudoknot,with an arbitrary depth of nested H-type pseudoknots. . . . 222.3 A bi-secondary structure that is not density-2. This structureis an example of a level 2 structure based on the definitionprovided by Sato et al. [83] . . . . . . . . . . . . . . . . . . . 22xList of Figures3.1 An H-type pseudoknotted structure (left) and a pseudoknot-free structure (right) in graphical (top) and arc diagram (bot-tom) formats. In the top left structure, the start and end ofthe pseudoloop, unpaired bases of a pseudoknotted structureand one of the two bands are labelled. In the top right struc-ture, some of the loops are annotated by their type: loopswith one emanating branch are hairpin loops, with two ema-nating branches are either internal or bulge loops; and withthree or more emanating branches are multi-loops (not shownhere). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 (a) Arc diagram representation of a simplified version of thestructure of the aptamer core of a SAM-IV riboswitch [98],but preserving patterns of crossing base pairs. Below both theleft and right endpoints of each stem is a letter which identifiesthat stem; concatenating the letters from left to right yieldsthe motif for this structure, namely ABCBDEDCAFFE. (b)A chain of four overlapping stems. This structure has motifABACBDCD. If either the leftmost or rightmost stem wereremoved from this structure, the result would be a kissinghairpin structure, with motif ABACBC. . . . . . . . . . . . . 31xiList of Figures4.1 Illustration of how the WMB recurrence unwinds, to calculateWMBi,j . Arcs above the horizontal line from i to j representbase pairs of Gij , and arcs below the line represent base pairsof G′ij . Note that Gi,j is the part of the given pseudoknot-freestructure G whose base pairs lie between positions i and j,assuming that no arc of G covers i or j. G′i.j is the part ofthe given pseudoknot-free structure G′ added by HFold whosebase pairs lie between positions i and j such that Gij ∪ G′ijis the MFE structure for strand si . . . sj given Gi,j . (1) Theoverall structure is handled by WMBi,j recurrence. (2) Acase of the WMB recurrence handles the overall structurewhose energy is WMBi,j , with terms to account for energiesof the right upper band (BE) and right lower closed subre-gion (WI) as well as the remaining structure (WMB′i,l1). (3)The term WMB′i,l1 is handled by a case of the WMB′ recur-rence, with terms to account for the lower right substructurelabelled VPl2,l1 , the upper left band (BE), and the remainingstructure (WMB′i,(l2−1)). (4) WMB′i,(l2−1) is then handled bya case of the WMB′ recurrence, with terms to account forWI(l3+1),(l2−1) and WMB′i,l3 . (5) Finally, the WMB′i,l3 term ishandled by VPi,l3 . . . . . . . . . . . . . . . . . . . . . . . . . 385.1 TGB (three-groups-of-bands) and CCJ pseudoknotted sec-ondary structures. (a) In this TGB structure for a gappedregion, each shaded arc represents a band – a set of pseudo-knotted base pairs which span the arc. The left and rightgroups each have two bands, and the middle group has fourbands. (The outermost middle “band” does not appear to bea band, since it does not cross any other bands, but as laterparts of the figure show, once this TGB structure is overlaidwith another TGB structure, all bands do indeed cross otherbands.) (b) A structure with two groups of bands, and withpseudoknotted structures nested within one of these bands.In general, not all bands illustrated in part (a) need be presentin a TGB structure, as long as at least one band is on the mid-dle group (labeled with star in part (c)). (c) A CCJ structureis obtained by overlaying two TGB structures. This exam-ple is obtained by overlaying the structures of parts (a) and(b). Embedded in this structure is a chain of four overlappingstems, which are labeled by four stars. . . . . . . . . . . . . 52xiiList of Figures5.2 Illustration of how the W recurrence unwinds, when the MFEstructure for a sequence of length 12 is the simple kissing hair-pin structure illustrated at the top of the figure. Since theoverall structure is pseudoknotted, it is handled by the sec-ond case of the W recurrence. Here we have W (1, 0) = 0,therefore the energy is accounted for by P(1, 12) +Ps. In theP recurrence, the structure is divided into two TGB struc-tures, namely PK(1, 2, 5, 6) and PK(3, 4, 7, 12). The termPK(1, 2, 5, 6) takes care of the internal loop of the gappedregion [1, 2] ∪ [5, 6], and PK(3, 4, 7, 12) calculates the energyof the rest of the structure by transitioning between the PR,PR,iloop, PfromR, PM and PM,iloop recurrences. . . . . . . . . 575.3 Illustration of how the W recurrence unwinds, when the MFEstructure for a sequence of length 20 is the structure illus-trated at the top of the figure. Note that this structure hasfive interleaved bands. As in Figure 5.2, the overall struc-ture is pseudoknotted, thus is handled by P(1, 20) +Ps. HereP(1, 20) is divided into PK(1, 6, 9, 14), and PK(7, 8, 15, 20).The TGB structure of PK(1, 6, 9, 14) is similar to Figure 5.2,and is handled similarly, with the only difference being thatthe leftmost band is handled by PL recurrence instead of a PKrecurrence. The TGB structure of PK(7, 8, 15, 20) is similarto the right part of Figure 5.2. . . . . . . . . . . . . . . . . . 586.1 H-type pseudoknot. The blue base pairs (i.e., the stems of1.32 to 12.21, and 40.61 to 46.55) belong to the Gbig structureand the green base pairs (i.e., the band of 16.53 to 20.49)belong to the Gsmall structure, as defined in Section 6.1.1.This figure was produced using the VARNA software [26]. . 686.2 Comparison of robustness of HFold and Iterative HFold. Ro-bustness results for pseudoknotted structures of the HK-PKdata set (a), pseudoknot-free structures of the HK-PK-freedata set (b) and all structures (c). The X axes show the per-centage of the Gbig structure that is provided as input, andthe Y axes show bootstrap 95% percentile confidence inter-vals for average F-measure. Dashed lines show the borders ofthe bootstrap 95% percentile for average F-measure and solidlines show the average F-measure itself. . . . . . . . . . . . . 72xiiiList of Figures6.3 Time comparison. Comparison of running times of Itera-tive HFold and HotKnots in a log plot. The X axis showslog10(time) for HotKnots data points and the Y axis showslog10(time) for Iterative HFold. . . . . . . . . . . . . . . . . 796.4 Example structure predicted by the CCJ algorithm vs. Iter-ative HFold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5 An example where Iterative HFold finds a structure withlower energy than the structure predicted by HFold. . . . . . 876.6 Example structure predicted by the CCJ algorithm vs. Iter-ative HFold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.7 An example of overcrowded input structure provided to Iter-ative HFold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.1 Examples of recurrence types in the CCJ algorithm. . . . . . 99xivAcknowledgementsI would like to express my sincere gratitude to my supervisor, Anne Condon,for her endless support through my graduate studies. I have been blessed tohave Anne as my supervisor for both my master’s and Ph.D. Due to familyneeds I only spent my first year of Ph.D. at UBC, after which I workedremotely on my thesis. I have two wonderful kids and three moves in myPh.D. resume. I cannot imagine a more supportive and accommodatingsupervisor. Anne is more than a supervisor to me; she is my mentor, myidol and my role model. She has guided me countless times throughout mylife with both research and life decisions.I would like to thank my supervisory committee, Holger Hoos and DavidKirkpatrick, for their support and insightful feedback. I would also like tothank my dissertation committee members, Inanc Birol, Joel Friedman andEric Jan, and my external examiner Ivo Hofacker for the time and effort theyput into reading my thesis and for their insightful and valuable commentsand feedback. I am thankful to Peter Clote for his hospitality and supportduring my stay in Boston. I would also like to thank Eleni Stroulia, RussGreiner and David Wishart for welcoming me to their research groups at theUniversity of Alberta. Many thanks to the support staff at the ComputerScience Department at UBC, in particular Joyce Poon, Holly Kwan, KathImhiran, Lara Hall, Hermie Lam, and Michele Ng. I am also thankful toall anonymous reviewers who reviewed my work and provided valuable andconstructive feedback. Special thanks to my collaborators, Ho-Lin Chenand Sebastian Will. Thank you for the helpful discussions and for yoursuggestions which have benefited my research greatly.I would like to thank my friends and colleagues in the BETA lab. Thankyou for being part of my life. I miss our conversations and gatherings greatly!More specifically, I would like to thank Dan Tulplan, Sanja Rogic, Mirela An-dronescu, Rosalia Aguirre-Hernandez, Baharak Rastegari, Chris Thachuk,Monir Hajiaghayi, Dave Tompkins, Viann Chan and Frank Hutter.I am thankful to Frances Ricks, Micaela Serra, Ali Shoja, Toni Garretand my undergraduate supervisor, Jon Muzio, without their help I couldnot have finished my undergraduate studies. I would also like to thank JonxvAcknowledgementsMuzio for his continuous support as my mentor throughout these years.I would like to thank Frances and Jim Ricks, for their priceless friendshipand their kindness towards my family and I. They opened the doors to theirhearts and their house to us and accepted us as part of their own family.They helped us feel amongst family members while we were hundreds ofmiles away from our own families.I owe big part of this work to my family. My parents suffered beingapart while my mom helped take care of my babies, when I was busy. Mysister, Farzaneh, agreed being apart from her mom, so I could benefit fromher help. I am mostly in debt to my parents, and Farzaneh for my mostrecent paper [53], which may not have been concluded if it had not been fortheir support. I am thankful to my sister, Zohreh, for checking on me everysingle day and filling my days with rays of hope and sunshine.Mostly, I am thankful to my fabulous husband, Majid. He has been mybest friend for the past 15 years! He is my definition of joyful life! He hasbeen there for me in the gloomiest times, when I had no clue what fate wasplanning, never judging and never doubting my abilities. He has alwaysoffered me his love, advice, and his shoulder to cry on. He has gone toextreme lengths to help me succeed. Honey, I can’t imagine my life withoutyou!Last but not least, I am thankful to my amazing kids, Viyana andKayson. I am thankful to them for understanding my passion for my work.The sparkles in their eyes and smiles on their faces mean the world to me!xviDedicationTo my daughter, Viyana, and my son, Kayson, without whom this thesiswould have been completed at least 2 years earlier!And to my love, Majid.xviiChapter 1IntroductionRNA molecules are crucial in different levels of cellular function, rangingfrom translation and regulation of genes to coding for proteins [25, 29, 30, 42,50, 65]. Apart from naturally-occurring RNAs, other RNA molecules havebeen designed which have novel catalytic functions not found in nature [54].Additionally, DNA and RNA sequences are designed to form novel structuresfor a wide range of applications, such as algorithmic DNA self-assembly[46, 78], detection of low concentrations of other molecules of interest [34] orto exhibit motion [85], and help detect circulating cancer cells in a patient’sblood [43]. Understanding the structure of an RNA molecule is important ininferring its function [2, 17, 56, 82]. Determining RNA structure also helpswith designing RNA molecules for applications in biotechnology [79].Since experimental methods for determining RNA structure, such as X-ray, crystallography and NMR, are time consuming, expensive and in somecases infeasible [37], computational methods for prediction of RNA structureare valuable. Currently computational RNA structure prediction methodsmainly focus on predicting RNA secondary structure - the set of base pairsthat form when RNA molecules fold (see Section 1.1 and Figure 3.1). RNA’sthree-dimensional structure, as defined by the atoms’ coordinates, is calledthe tertiary structure. Although very challenging, there are approachesfor prediction of RNA tertiary structure [20, 71, 104]. Prediction of RNAsecondary structure is easier than its tertiary structure and RNA secondarystructure sheds light on the tertiary structure of RNA [94]. The focus ofthis thesis is on RNA secondary structure.When multiple homologous RNA sequences are available, structure ofthe target RNA can be determined by comparative approaches utilizing thecommon secondary structure of the homologous sequences. Examples ofcomparative-based approaches include RNAalifold [16], Pfold [55], PETfold[84], RSpredict [90], Multilign [102], CetroidAlifold [44] , TurboFold [45] andLocARNA [101]. To predict correct secondary structure comparative-basedmethods require reliable alignment of RNA sequences, which may not beavailable in all cases.There is evidence that RNA molecules in their natural environments1Chapter 1. Introductionusually fold to their minimum free energy structure [94]. This is the basisfor many existing algorithms that aim to predict the minimum free energy(MFE) RNA secondary structure from the set of all possible structures, wheneach structure feature is assigned a free energy value and the energy of astructure is calculated as the sum of the features’ energies. Current energy-based algorithms take as input a single RNA sequence, and find a struc-ture (or few structures) optimized based on their underlying energy model.We note that not all energy-based approaches are MFE-based. There aremany non-MFE but energy-based algorithms that, guided by their under-lying energy model, predict secondary structure of RNA molecules. Thesealgorithms include heuristic approaches [11, 40, 51, 76, 86, 87] as well as ap-proaches that rely on finding base pairs with maximum expected accuracyfor the given RNA sequence [15, 35, 57, 83] (See Section 2.1.3).While energy-based approaches can be less accurate than comparativeapproaches, especially for long RNA sequences, energy-based approachesare applicable in cases of novel RNAs such as the many RNAs of unknownfunction recently reported by the ENCODE consortium [1]. Energy-basedapproaches can also be easier to apply to predict the structure of interactingRNA molecules, for example, in gene knockdown studies. In this thesis wefocus on energy-based methods.There has been significant success in prediction of pseudoknot-free sec-ondary structures (structures with no crossing base pairs) [4, 32, 35, 38,48, 49, 57, 63, 70, 72, 92, 103]. Figure 1.1 shows a pseudoknot-free RNAsecondary structure and its corresponding arc diagrams, in which each arcdenotes a base pair.While many small RNA secondary structures are pseudoknot-free, manybiologically important RNA structures, both in the cell [91, 96], and in viralRNA [28] are pseudoknotted (See Figure 1.2). Examples include simpleH-type pseudoknots (See Figure 1.2a), with two interleaved stems, whichare essential for certain catalytic functions and for ribosomal frameshifting[6], as well as kissing hairpins (See Figure 1.2b), which are essential forreplication in the coxsackie B virus [66].Unfortunately, MFE pseudoknotted secondary structure prediction isNP-hard [5, 58, 59], even for a simple energy model that depends on basepairs but not on unpaired bases. Polynomial-time MFE-based approachesto pseudoknotted structure prediction have been proposed [5, 33, 75, 77,95], with respect to various sum-of-loops energy models for pseudoknottedstructures, which find the MFE structure for a given input sequence, froma restricted class of structures. A class of structures can be defined byspecifying allowable patterns of interleaving among base pairs.2Chapter 1. IntroductionExisting MFE-based methods for prediction of RNA pseudoknotted sec-ondary structures are expensive in terms of both time and memory. Forexample, the algorithm of Rivas and Eddy [77], that runs in O(n6) time andO(n4) space where n is the length of RNA sequence, and handles the mostgeneral classes of pseudoknots, (that includes kissing hairpins and H-typepseudoknots), is not practical for prediction of secondary structure RNAmolecules longer than 106 nucleotides [76]. While heuristic methods do notguarantee prediction of the minimum free energy structure, they are usu-ally faster than MFE-based methods. For example, HotKnots V2.0 [11, 76],that is among the best heuristic methods for prediction of RNA pseudo-knotted secondary structures, requires about 1.7 hours to predict structureof an RNA sequence of length 400 [53]. Other methods for prediction ofpseudoknotted structures, referred to as methods based on maximum ex-pected accuracy (MEA), such as the IPknot method of Sato et al. [83], aremotivated by the finding of Mathews [61] that base pairs with high base pair-ing probabilities are more likely to be in the known structure. MEA-basedmethods by nature do not determine the energy of the predicted structureand have accuracy comparable to the MFE-based algorithms [41]. Incorpo-rating known structural information can improve the accuracy of structureprediction [64]. However, there is both limited reliable structural informa-tion available to the user and limited opportunity for the user to providestructural information, or constraints, to guide prediction.Motivated by two limitations of the MFE-based algorithms for pseudo-knotted secondary structure prediction, their high time complexity, and thepotential for improving predictions by accounting for the folding pathwayfrom unfolded sequence to stable structure, we previously presented a novelmethod, called HFold [52], based on the hierarchical folding hypothesis. Thishypothesis posits that an RNA molecule first folds into a pseudoknot-freestructure; then additional base pairs are added that may form pseudoknotswith the first structure so as to lower the structure’s free energy [94]. HFoldhandles a restricted class of structures, but this class is quite general, andincludes H-type pseudoknots and kissing hairpins with arbitrarily nestedsubstructures. Rastegari and Condon [74] showed that, out of a set of over1100 biological structures, all but nine are density-2 (when isolated basepairs are removed), and six of these nine are also not in the class handledby Rivas and Eddy’s algorithm. Given a pseudoknot-free structure as in-put, HFold predicts a possibly pseudoknotted structure that contains thegiven input structure and, relative to that constraint, has minimum free en-ergy. HFold’s running time is O(n3), significantly faster than other methodsfor predicting pseudoknotted structures, and matches that of MFE-based3Chapter 1. Introductionmethods for predicting pseudoknot-free structures.While HFold is fast, our earlier implementation of HFold had its ownshortcomings. First due to a high pseudoknot initiation penalty and lowband penalty in its underlying energy model, many of its predicted structuresdid not have pseudoknots. Second if the first structure input to HFoldcontains base pairs that are not in the true pseudoknot-free structure for thegiven RNA sequence or is not the complete pseudoknot-free structure (i.e. itdoes not include all the base pairs in the pseudoknot-free structure), HFoldis often unable to predict the known pseudoknotted structure as output.Overall existing energy-based approaches for prediction of pseudoknottedstructures, suffer from one or more of the following shortcomings:1. slow running time,2. poor prediction accuracy (particularly for long RNA sequences),3. handling very limited class of pseudoknots,4. limited opportunity for the user to provide structural information, orconstraints, that can guide prediction.In cases of a prediction method that incorporates user-defined constraints,it is also useful to understand the degree to which the method’s accuracypersists as the input information degrades. We use the term robustness withrespect to partial information or robustness to refer to this property of amethod. (We note that in our definition of robustness we do not mean robustwith respect to noise.) When investigating a method’s robustness, we aimto understand the degree to which the method provides a reliable predictioneven when limited information about the true pseudoknot-free structureis available. To the best of our knowledge, the concept of robustness insecondary structure prediction methods has not been studied before.The main goal of this doctoral thesis is to present novel energy-basedmethods to alleviate the above mentioned shortcomings for prediction ofpseudoknotted RNA secondary structures based on thermodynamic energymodel. We note that assessing merits of prediction methods based on theiraccuracy is a challenging problem, as difference in accuracy may not bereliable when size of data sets is small. To alleviate this problem we usestatistically sound methods (see Sections 2.3 and 2.4) in comparing perfor-mance of our methods with themselves and some of the already existingmethods.41.1. Background on RNA secondary structureIn the remainder of this chapter we first provide useful background infor-mation pertaining to RNA secondary structure prediction, explain the accu-racy measures, formulate the RNA secondary structure prediction problem,outline our contributions and describe the organization of this thesis.1.1 Background on RNA secondary structureWe start this section by explaining how to model RNA secondary structureand how to represent the structure using an arc diagram. Then we providea simplified definition of some of the features of RNA secondary structures.We refer the readers to Chapter 3 for detailed description of loops in RNAsecondary structure and explanation of the energy model.An RNA molecule is a sequence of nucleotides, or bases, of which thereare four types: Adenine (A), Guanine (G), Cytosine (C), and Uracil (U). Themolecule has chemically distinct ends, called the 5′ and 3′ ends. We modelan RNA molecule as a sequence over the alphabet {A,C,G,U}, with theleft end of the sequence being the 5′ end. We index the bases consecutivelyfrom the 5′ end starting from 1, and refer to a base by its index. The basesA and U are said to be complementary, as are the bases C and G. The pairof bases U and G is called a wobble pair.When an RNA molecule folds, bonds may form between complementaryand wobble pairs of bases. From now on we expand our definition of com-plementary pairs to include wobble pairs as well. The resulting structuredepends on environmental conditions, such as temperature and salt concen-tration of the solution in which the molecule resides. In what follows, let Sbe an RNA sequence of length n. A base pair for S is an ordered pair i.lwith 1 ≤ i < l ≤ n, such that ith and lth bases of S are complementary. Wecall i and l the endpoints of the base pair. The span of base pair i.l is l− i.If i.l and d.e are two base pairs with i ≤ d, then one of the following casesapplies: (i) i < d < e < l, in which case d.e is nested in i.l; (ii) i < l < d < e,in which case i.l and d.e are disjoint, (iii) i < d < l < e, in which the basepairs cross, or (iv) either i = d, d = l, or l = e, in which case the base pairscollide.A secondary structure R for S is a set of base pairs for S, none of whichcollide. We can illustrate the secondary structure of an RNA molecule byan arc diagram, in which the bases of the RNA molecule are presented on ahorizontal line and the base pairs are represented by arcs connecting the twopairing bases. With respect to R, base pair i.l is pseudoknotted if it crossessome base pair of R. R is pseudoknotted if it contains pseudoknotted base51.2. Accuracy measurespairs, and is pseudoknot-free otherwise. A pseudoknotted secondary struc-ture has crossing arcs in the arc diagram. Figure 1.1 shows a pseudoknot-freesecondary structure both in graphical and arc diagram format. Figure 1.2shows three different pseudoknotted structures in arc diagram format.Base pairs of secondary structure R partition the unpaired bases of se-quence S into loops [74]. Loops, and their associated unpaired bases andclosing base pairs, are defined as follows. Some of the loops are illustratedin Figure 1.1.• A hairpin loop has one closing base pair, and so one emanating branch.All other bases inside a hairpin loop are unpaired.• An internal loop has two emanating branches, thus two closing basepairs. All bases between the two closing base pairs are unpaired. Bulgeloops and stacked pairs (or simply stacks) are two special cases ofinternal loops. In a bulge loop there are no unpaired bases in one sideof the loop, while in a stacked pair there are no unpaired bases.• A multiloop has three or more emanating branches. A special case ofmultiloop is when one of these branches is a pseudoloop, in which casemultiloop may have only two emanating branches.• A pseudoloop is associated with each pseudoknot. Nested base pairs(i.e., branches) belong to the pseudoloop if they are not part of anynested substructure.A stem is a maximal sequence of stacked pairs, where successive base pairsare the closing base pairs of an internal loop. In Figure 1.1, base pairs 16.48,17.47, 18.46, 19.45, 20.44, and 21.43 represent a stem.A band is a maximal set of pseudoknotted base pairs, all of which crossexactly the same set of base pairs. In Figure 1.2a, base pairs 1.25, 2.24,3.23, 4.22, 5.20, 6.19, 7.18 and 8.17 represent a band.1.2 Accuracy measuresFollowing common practice [3, 11, 41], we measure the accuracy of a pre-dicted RNA secondary structure relative to a reference secondary structureby F -measure, which is the harmonic mean of sensitivity (also called truepositive rate or recall) and positive predictive value (PPV ) (also referred toas precision). We define these values as follows:61.2. Accuracy measuresStackInternal LoopHairpin LoopMulti LoopBulgeU A C C U G GACGGGGUCAACUUGUGAUCAA UA A GA C G A G U G G CC UAGGCUAGUG ACCUCCAUUGCAC AU A A CG G A GG G GU G C UUAGCUU A A G GUC UC CC C A AGUGGGAGAG 1 102030405060708090100110120130U U A C U(a) A pseudoknot-free secondary structure in graphical format.130U A C U U A C C U G G A C G G G G U C A A C U U G U G A U C A A U A A G A C G A G U G G C C U A G G C U A G U G A C C U C C A U U G C A C A U A A C G G A G G G G U G C U U A G C U U A A G G U C U C C C C A A G U G G G A G A G 1 10 20 30 40 50 60 70 80 90 100 110 120U(b) A pseudoknot-free structure in arc diagram format.Figure 1.1: A pseudoknot-free structure in both graphical (a) and arc di-agram (b) formats. In (a) some of the loops are annotated by their type:loops with one emanating branch are hairpin loops, with two emanatingbranches are either internal or bulge loops; and with three or more emanat-ing branches are multi-loops.71.2. Accuracy measures50G G U C A G G A G C C C C C C C C U G A A C C C A G G A U A A C C C U C A C U G U C G G G G G G C1 10 20 30 40G(a) An H-type pseudoknot.73C C C U A C U G U G C U A A C C G A A C C A G A U A A C G G U A C A G U A G G G G U A A A U U C U C C G C A U U C G G U G C G G A A A A A A A A1 10 20 30 40 50 60 70A(b) A kissing hairpin.81C C C U A C U G U G C U A A C C G A A C C A G A U A A C G G U A C A G U A G G G G U A A A U U C U C C G C A U U C G G U C C C C G C G G A A A A A A A A G G G G1 10 20 30 40 50 60 70 80A(c) A chain of four interleaved stems.Figure 1.2: Arc diagram representation of H-type pseudoknot (a), kissinghairpin (b) and a chain of four interleaved stems (c). Dots along the hor-izontal line represent bases from the 5’ (left) to the 3’ (right) ends. Arcsrepresent base pairs. Some arcs cross, thereby introducing pseudoknots.81.3. RNA secondary structure prediction problem and thesis objectivesSensitivity = Number of correctly predicted base pairsNumber of base pairs in the reference structurePPV = Number of correctly predicted base pairsNumber of predicted base pairsandF -measure = 2× sensitivity× PPVsensitivity + PPVWe also define these values as 0 when their denominators are 0. Whena prediction agrees with the reference structure, the value of F -measure isequal to 1 (so are the values of sensitivity and PPV). When the values ofsensitivity or PPV is equal to 0, the predicted structure does not have anybase pairs in common with the reference structure.We note that there are other accuracy measures, including true nega-tive rate or specificity, negative predictive value, false positive rate, falsediscovery rate and Fβ-measure (in which, β determines the important ra-tio of recall to precision) that are used based on specific applications [36].However, in evaluating prediction accuracies of methods for RNA secondarystructure prediction sensitivity, PPV and F -measure are widely used.1.3 RNA secondary structure prediction problemand thesis objectivesThe overarching goal of this thesis is to overcome some limitations of theprevious methods for prediction of RNA pseudoknotted secondary structure,namely (1) slow running time, (2) poor prediction accuracy, (3) limited intypes of pseudoknots they can predict, and (4) limited opportunity for theuser to provide structural information that can guide prediction. To this endwe propose two main methods (Iterative HFold and CCJ) and compare themwith some of the best existing methods (HotKnots V2.0, IPknot, and HFold)both in terms of time and memory efficiency and in terms of predictionaccuracy. We focus on methods that are based on variants of the hierarchicalfolding hypothesis and on minimum free energy prediction. We use theenergy model of HotKnots V2.0, which has already yielded good predictionsnot only for the HotKnots method but also for methods based on maximumexpected accuracy [83].91.4. Thesis contributionsWith respect to methods based on the hierarchical folding hypothesis,our goals are first, to improve the efficiency and accuracy of our previousimplementation of HFold, and second, to investigate variants of the hypoth-esis. The simplest version of the hypothesis, as implemented by HFold, isthat a sequence first folds into an MFE pseudoknot-free structure, and subse-quently pseudoknotted base pairs are added. In this thesis we consider othervariants of the hypothesis: that the initially-formed pseudoknot-free struc-ture is not necessarily the MFE structure, and/or that some base pairs fromthis initial structure are removed as pseudoknotted base pairs are added.These variants are consistent with findings of Wilkinson et al. [100] andDing et al. [31] that allowing for disruption of some base pairs in the ini-tially formed pseudoknot-free secondary structure can improve predictionaccuracy.Since HFold and the new variants that we introduce in this thesis takeas input an initial pseudoknot-free structure, another objective of this thesisis to investigate how prediction accuracy of methods based on hierarchicalfolding improve, as a function of the percentage of the reference structurethat is provided as input.With respect to MFE-based methods, our goal is simply to obtain moreefficient and general methods for prediction of MFE pseudoknotted struc-tures.In addition to the above, another objective of this thesis is to compareour two approaches to RNA secondary structure prediction, based on hier-archical folding and MFE prediction, and describe merits and shortcomingsof each approach. We also compare our approaches with approaches alreadyin the literature.1.4 Thesis contributionsThe contributions of this thesis are as follows:1. We present an implementation for HFold with new energy parameters(see Chapter 4). We previously published the theoretical contribu-tion of our HFold algorithm that is based on the hierarchical foldinghypothesis [52]. However, the implementation of the algorithm wasnot publicly available due to its limited accuracy. Here we present anupdated version of HFold with energy parameters of HotKnots V2.0[11]. We used basic optimization techniques on the HFold source codeto make it use memory more efficiently.101.4. Thesis contributions2. We propose the HFold-PKonly algorithm, which is a version of HFoldalgorithm that only predicts pseudoknotted base pairs to lower the en-ergy of input constraints (see Chapter 4). HFold-PKonly is especiallyuseful for cases when user has either complete information about thetrue pseudoknot-free structure or wants to check whether a single stemof the input structure can be part a pseudoknot since, if the inputstructure only has the specific stem in question, the output structureof HFold-PKonly will determine if the given stem can be part of apseudoknot.3. Motivated by the literature on the hierarchical folding hypothesis anddue to over prediction of base pairs in the minimum-free-energy-basedprediction methods and methods based on strict hierarchical folding,we define a new folding hypothesis, namely “relaxed hierarchical fold-ing”. We propose the Iterative HFold algorithm, which uses structuremodification consistent with the relaxed hierarchical hypothesis whilestays in the O(n3) time and O(n2) space complexity. Iterative HFoldalgorithm takes as input a pseudoknot-free structure, and produces apossibly pseudoknotted structure whose energy is at least as low asthat of any (density-2) pseudoknotted structure containing the inputstructure (see Chapter 4). Iterative HFold leverages strengths of ear-lier methods, namely the fast running time of HFold, a method thatis based on the hierarchical folding hypothesis, and the energy param-eters of HotKnots V2.0 [11]. Iterative HFold has additional featuresthat result in improved prediction accuracy over HFold: in particular,not all base pairs of the input structure need be part of the outputstructure. Iterative HFold can predict a wide range of structures, in-cluding H-type pseudoknots and kissing hairpins. Our experimentalevaluation on a large data set shows that Iterative HFold is robustwith respect to partial information, with average accuracy on pseudo-knotted structures steadily increasing from roughly 54% to 79% as theuser provides up to 40% of the base pairs of the input structure, andwith more modest but still positive improvement in accuracy whenfurther structural information is provided.4. We propose CCJ, a novel MFE-based algorithm which significantlyexpands the class of structures that can be handled in O(n5) time (seeChapter 5). We describe a more general method of formulating the dy-namic programming recurrences for prediction of pseudoknotted RNAsecondary structures that cover gapped regions. We introduce two new111.4. Thesis contributionsideas into the dynamic programming recurrences that improve the timecomplexity to O(n5): (i) a new class of structures called TGB struc-tures, with at most three groups of bands, and (ii) recurrences thathandle TGB structures by transferring to the left, right, middle or theouter bands. Our algorithm can handle H-type pseudoknotted struc-tures, kissing hairpins, and chains of four interleaved stems, as well asnested substructures of these types. There are several algorithms forpredicting MFE pseudoknotted secondary structures that run in Θ(n5)time and Θ(n4) space [59, 95]. All of these algorithms handle a morelimited class than does the Rivas and Eddy algorithm, a very generalMFE-based algorithm with running time of Θ(n6) and space Θ(n4).All can handle H-type pseudoknots, and some can handle kissing hair-pin structures when these do not have arbitrary nested substructures[95]. There are also some algorithms that run in Θ(n4) time [59, 75];these handle classes of structures that are even more restricted thanthe Θ(n5) algorithms. However, none of the algorithms to date handlekissing hairpin structures with arbitrary nested substructures, and thisis a real limitation given the biological importance of such structures(see e.g., [14, 21, 28]). For example, neither of the structures of Figure3.2 can be handled by any of the algorithms with O(n5) running time.5. We perform a thorough empirical evaluation of CCJ, HFold, HFold-PKonly and Iterative HFold on large data sets (see Chapter 6). Weuse four data sets, namely HK-PK with 88 pseudoknotted structures,HK-PK-free with 337 pseudoknot-free structures, IP-pk168 with 168pseudoknotted structures, and DK-pk16 with 16 pseudoknotted struc-tures with strong experimental support (see Section 2.2). We alsothoroughly compare performance of our algorithms with some of thebest existing methods for prediction of pseudoknotted structures (Hot-Knots V2.0 and IPknot). We also compare performance of our Itera-tive HFold algorithm with that of ShapeKnots, a method that incor-porates SHAPE reactivity scores in its prediction algorithm, on thedata set of Hajdin et al. [40].Our experimental evaluation shows that Iterative HFold is much fasterthan HotKnots V2.0, while having comparable accuracy on three ofour four data sets, namely HK-PK, HK-pkfree and DK-pk16. Iter-ative HFold has significantly better accuracy than HotKnots on theIP-pk168 data set and than IPknot on the HK-PK and IP-pk168 datasets. Moreover, Iterative HFold is more accurate than ShapeKnots ona small test set that was previously used to evaluate ShapeKnots, even121.4. Thesis contributionsthough Iterative HFold does not use SHAPE reactivity scores.Due to its MFE-based nature, the CCJ algorithm is the slowest anduses the most memory. However, its prediction accuracy is comparableto that of Iterative HFold on all data sets. Similar to Iterative HFold,CCJ’s prediction accuracy is significantly better than HotKnots onthe IP-pk168 data set and than IPknot on the HK-PK* and IP-pk168data sets.6. Since space complexity is a bottleneck for usability of the CCJ al-gorithm, we describe a new way of using candidate lists to reducethe space complexity of the CCJ algorithm to O(n2Z) instead ofO(n4), and provide the recurrences for sparse CCJ algorithm (seeChapter 7). Previously published methods using the sparsificationtechnique [12, 67, 81, 99] either focused on pseudoknot-free methods(due to complexity of pseudoknotted methods), or used pseudoknot-ted methods with simplified energy models (base pair maximizationonly), and they mainly aimed to reduce the time complexity of themethods. To the best of our knowledge we are the first researchers toapply the sparsification technique to a pseudoknotted method with arealistic energy model and parameters, and focusing on reducing itsspace complexity.7. Based on our experimental analysis of the HFold, Iterative HFold andCCJ algorithms, we present merits and shortcomings of hierarchicalfolding hypothesis and minimum free energy folding hypothesis (seeChapter 8). While we did not observe any significant difference in theperformance of HFold, Iterative HFold, and CCJ algorithms, we foundthat (i) Iterative HFold better identifies the pseudoknotted base pairsin reference structures than HFold, (ii) predicting the structure withthe minimum free energy does not seem to be directly correlated withfinding the structure with the highest accuracy.The work presented in this thesis benefits the general RNA communityby providing them with state-of-the-art algorithms for predicting RNA sec-ondary structure including pseudoknots, which improves the accuracy ofstructure prediction (key in determining RNA function), and open up thepossibility of predicting structure of larger RNA molecules, previously im-possible cases.Published results of this work has attracted significant attention fromthe community. The recently published Iterative HFold algorithm has beenaccessed more than 2400 times as of January 7, 2015, based on the BMC131.5. Thesis outlineArticle Metrics. The publication outlining the CCJ algorithm has been cited16 times as of January 7, 2015, based on Google Scholar Citation.1.5 Thesis outlineThe remainder of this thesis is organized as follows. In Chapter 2, we pro-vide a review of the most relevant thermodynamic-based algorithms for RNApseudoknotted secondary structure prediction, describe the best performingexisting algorithms, and describe our data sets and statistical analysis meth-ods used to evaluate performance of the algorithms in this work. In Chapter3, we provide formal definition of secondary structure features and details ofthe energy model used in this thesis. In Chapter 4, we provide an overviewof the algorithms that the Iterative HFold algorithm builds on, and thenexplain the details of our HFold-PKonly and Iterative HFold algorithms. InChapter 5, we explain our CCJ algorithm and the class of structures it canhandle. In Chapter 6, we present a thorough comparison of the proposedalgorithms, Iterative HFold, and CCJ, with some of the best existing meth-ods for prediction of RNA secondary structure. In Chapter 7, we provide anoverview of a sparsification technique that was previously used to improvetime and space complexity of MFE-based RNA pseudoknot-free structureprediction algorithms and then explain how we used candidate lists to reducespace complexity of the CCJ algorithm. Finally in Chapter 8, we provideconclusions of this work and present directions for future work.14Chapter 2Background and related workIn this chapter, we first review the most relevant thermodynamic-based al-gorithms for RNA pseudoknotted secondary structure prediction. Then wedescribe our data sets and statistical analysis methods used to evaluate per-formance of the algorithms in this work. Finally we present a summary ofthe chapter.2.1 Computational RNA secondary structureprediction including pseudoknotsIn this section, we first give an overview of computational methods for pre-dicting the minimum free energy secondary structure for an RNA sequence.Then, we briefly discuss heuristic approaches for secondary structure predic-tion, and in particular, we give an overview of HotKnots. We then explainthe basics of prediction methods based on maximum expected accuracy, andprovide a brief overview of IPknot. In Section 2.1.4 we discuss methodsthat accept input constraints for prediction of pseudoknotted structures; wethen give a brief explanation of ShapeKnots as currently the best methodfor including SHAPE reactivity scores to structure prediction. Finally inSection 2.1.5, we discuss the hierarchical folding hypothesis and introduceHFold.HFold has two potential advantages over MFE-based secondary structurepredication. First, HFold’s hierarchical folding principle may model biolog-ical folding just as well, or better, than does the MFE structure formationhypothesis, at least on biological structures. Second, HFold has O(n3) run-ning time, where n is the length of RNA sequence, making it significantlymore efficient than MFE-based methods that require Ω(n5) time or more topredict biologically-important pseudoknotted structures.Another potential advantage of HFold over heuristic methods such asHotKnots or ShapeKnots is that unlike these methods, HFold minimizes thefree energy of the possibly pseudoknotted output structure relative to thegiven input structure. Therefore HFold’s method of adding pseudoknotted152.1. Computational RNA secondary structure prediction including pseudoknotsstems is better motivated energetically than HotKnots or ShapeKnots.The reason for highlighting these four methods (summarized in Ta-ble 2.1) is that later we compare performance of our algorithms with thesemethods.Table 2.1: Summary of the methods with which we will later compare theperformance our algorithms.Name Type Structures handled ReferenceHotKnots V2.0 heuristic general pseudoknots [11, 76]IPknot MEA up to level 3 [83]ShapeKnots heuristic general pseudoknots [40]HFold hierarchical folding density-2 [52]2.1.1 Minimum-free-energy-based methodsAlgorithms for MFE secondary structure prediction of pseudoknot-free sec-ondary structures exploit two useful properties. First, a pseudoknot-freesecondary structure for a sequence is either closed by a base pair connect-ing the first and last base in the sequence, or can be broken down into twoindependent substructures on a prefix and suffix of the sequence. Second,the total energy of a structure which is composed of two independent sub-structures is the sum of the energies of the loops of the substructures. As aresult, a divide-and-conquer approach, based on dynamic programming, canbe used to find the MFE pseudoknot-free secondary structure; the run-timeof such algorithms is Θ(n3) for standard energy loop models [69].Since MFE pseudoknotted secondary structure prediction is NP-hard [5,58, 59], algorithms for MFE pseudoknotted secondary structure predictiontrade off run-time complexity and generality. We say that a structure Rcan be handled by a given algorithm if R is in the class of structures overwhich the algorithm optimizes. We note that, even when the true structureR for a sequence is handled by an algorithm, the algorithm still may notcorrectly predict R, because correctness depends not only on the generalityof the algorithm but also on the energy model and energy parameters.The most general MFE-based algorithm for prediction of pseudoknottedstructures was proposed by Rivas and Eddy and is called Pknots [77]. Pknotsis a complex dynamic programming algorithm with the time complexity ofΘ(n6), and space complexity of Θ(n4). There are algorithms of Uemura et162.1. Computational RNA secondary structure prediction including pseudoknotsal. [95] and Lyngso and Pedersen [59] for predicting MFE pseudoknottedsecondary structures that run in Θ(n5) time and Θ(n4) space. Both of thesealgorithms handle a more limited class than does the Rivas and Eddy al-gorithm. All can handle H-type pseudoknots, and some can handle kissinghairpin structures when these do not have arbitrary nested substructures[95]. There are also some algorithms that run in Θ(n4) time [59, 75]; thesehandle classes of structures that are even more restricted than the Θ(n5)algorithms. However, none of these algorithms handle kissing hairpin struc-tures with arbitrary nested substructures, and this is a real limitation giventhe biological importance of such structures. For example, neither of thestructures of Figures 2.1 nor 1.2c can be handled by any of the algorithmswith O(n5) running time.74C C C A A C C C C C C A A A A G G G A A A A C C C A C C C C A A G G G A A A G G G A A A A G G G G A A A A C C C A A A A A G G G A A A A A G G G G1 10 20 30 40 50 60 70CFigure 2.1: Arc diagram representation of a simplified version of the struc-ture of the aptamer core of a SAM-IV riboswitch [98], but preserving pat-terns of crossing base pairs.2.1.2 Heuristic methodsHigh running time and/or space usage of MFE-based methods for predic-tion of pseudoknotted structures of large RNA sequences has been the mainreason for development of heuristic methods [11, 51, 76, 86–88]. Similarto the dynamic programming approaches, heuristic methods are guided byenergy models and output one or more low energy structures, but not nec-essarily the MFE structure. Heuristic methods usually run faster than theMFE-based methods that handle the same class of structures. For example,HotKnots V2.0 [11, 76] is a heuristic approach that is guided by energy min-imization and can handle kissing hairpin structures. However, HotKnots isstill slow on long sequences.Recent heuristic procedures include iterated loop matching (ILM) [80],HotKnots [11, 76], HPknotter [51], KnotSeeker [86], and DotKnot [87, 88].ILM uses a dynamic programming algorithm for prediction of pseudoknot-172.1. Computational RNA secondary structure prediction including pseudoknotsfree structures to identify a promising helix, then adds this helix to the struc-ture, removes the bases forming this helix from the sequence and iterates tofind additional helices.Pseudoknot detection approaches, such as HPknotter and KnotSeeker,work based on this assumption that if part of RNA sequence pertainingpseudoknot is detected with high accuracy, the remaining sequence can beefficiently folded using the state-of-the-art MFE secondary structure predic-tion programs.HPknotter is a heuristic approach for detecting H-type pseudoknots inan RNA sequence by incorporating several existing algorithms: RNAMotif[60], Pknots [77], Nupack [33] and pknotsRG [75], where RNAMotif is anRNA structural motif search tool and is used in HPknotter to filter outthe subsequences that do not have the requirements to fold to an H-typepseudoknot to avoid the long running time of the dynamic programmingcomponents (i.e., Pknots, Nupack or pknotsRG). Huang et al. use the fol-lowing steps to predict a pseudoknotted structure. First, running RNAMo-tif’s structural matcher on a given RNA sequence, they find a great numberof possible pseudoknot fragments (called hits). Then using NUPACK en-ergy calculation tool they remove the previously found possible pseudoknotfragments that can be folded to lower energy pseudoknot-free structures.Second, they run pknots, NUPACK, or pknotsRG on a filtered hit (a shortRNA sequence) and remove the hit from the list if it does not fold into apseudoknotted structure. Finally they produce a minimum weight indepen-dent set of the filtered hits, where the weight corresponds to the free energyvalue, and return a mutually disjoint pseudoknot set as the result.KnotSeeker has a similar heuristic approach to that of HPknotter. UsingGUUGle tool [39] and RNAeval (Vienna package 1.7) [48], they find stablestructures including hairpin and bulge loops with free energy less than +2.0kcal/mol. Given the list of filtered structures from step one, and restrictingthe size of possible pseudoloops to be less than 90 bases and free energy ofless than −2.5 kcal/mol, they find possible H-type pseudoknots, and verifythem using pknotsRG [75]. Then minimum weight independent set of allstructures found in steps one and two are made (this includes nested struc-tures), and the final result is the structure with the minimum free energy.DotKnot is the extension of KnotSeeker with two main difference: 1) apseudoknot-free partition function is used to find a set of promising structureelements, and 2) no dynamic programming algorithm is used for pseudoknotcandidate verification. Given an RNA sequence, DotKnot returns only the(possibly empty) set of detected pseudoknots. Structure of the rest of thesequence, which is pseudoknot-free can be then found using a pseudoknot-182.1. Computational RNA secondary structure prediction including pseudoknotsfree MFE-based method. This way DotKnot can predict H-type pseudoknotsand kissing hairpins.HotKnotsHotKnots [11, 76] is a heuristic program that given an RNA sequence, firstfinds up to 20 lowest energy stems (from the set of all stems for the givenRNA sequence), called hotspots. We note that for short RNA sequencesnumber of hotspots are fewer than 20, but for long RNA sequences, inde-pendent of their length, 20 hotspots will be generated. Then keeping allthese stems, it adds other non-overlapping low energy stems to the stemsfound in the first step, so as to minimize the energy of the overall structure,eventually producing up to 20 output structures.We note that most recent advances on heuristic methods have focusedon achieving high accuracy and thus have focused on specific classes of pseu-doknotted structures. However, Andronescu et al. [11] estimated the en-ergy parameters of HotKnots using a large set of pseudoknotted structures(with over 400 RNA molecules) and their fine machine learning techniques,and constraint generation program, that resulted in average prediction ac-curacy improvement of 11% over its previous energy model (that of Dirksand Pierce [33]). Therefore, considering the accuracy measures and the sizeof the tested data set, HotKnots V2.0 is yet among the best performingheuristic approaches for general classes of pseudoknotted structures. Welater compare performance of our algorithms with HotKnots.2.1.3 Methods based on maximum expected accuracyMethods based on maximum expected accuracy for prediction of pseudo-knotted structures, such as the IPknot method of Sato et al. [83], are moti-vated by the finding of Mathews [61] that base pairs with high base pairingprobabilities in the thermodynamic ensemble are more likely to be in theknown structure.IPknotIPknot is a secondary structure prediction method based on Maximum Ex-pected Accuracy (MEA) of base pairs, employing the free energy parametersof HotKnots V2.0 [11]. In addition to an RNA sequence, IPknot gets sev-eral parameters as input. We next describe each of these parameters andsettings briefly.192.1. Computational RNA secondary structure prediction including pseudoknots• level: If structure G can be decomposed into k disjoint pseudoknot-free structures, G1, G2, ..., Gk, such that every base pair in Gi crossesthe base pairs of Gj , 1 ≤ i ≤ j ≤ k, Sato et al. say that structureG has k levels. For example, a pseudoknot-free structure has level 1,and an H-type pseudoknot has level 2, and so does a kissing hairpin.In another example, when representing the secondary structure in dotbracket format, the number of different brackets used to represent thestructure is the level of the structure. IPknot can handle structuresup to level 3.• scoring model: The energy model used to produce posterior proba-bilities for each base pair is called “scoring model”. IPknot has 3different scoring models, namely “CONTRAfold”, “McCaskill” and“NUPACK”.• refining parameters: The procedure of recalculating the base pair prob-abilities based on the original prediction results is referred to as “re-fining parameters”.• base pair weights for each level: Positive numbers representing therate of true base pairs in each level.In a comprehensive comparison performed by Puton et al. [73] on theperformance of publicly available non-comparative RNA secondary structureprediction methods that can handle pseudoknotted structures, IPknot ranksfirst when considering both short and long RNA sequences.We later compare performance of our algorithms with IPknot.2.1.4 Methods that incorporate input constraintsIncorporating known structural information can improve the accuracy ofstructure prediction. For example, Mathews et al. [64] used SHAPE reac-tivity data to improve the prediction accuracy from 26.3% to 86.8% for the5S rRNA of E. coli. Roughly, the larger the SHAPE reactivity value fora given nucleotide, the more likely it is that the nucleotide is unpaired inthe structure. However, limited SHAPE reactivity data is available, and thedata does not unambiguously determine whether a base is paired or not or,if it is paired, to what other nucleotide.202.1. Computational RNA secondary structure prediction including pseudoknotsShapeKnotsDeigan et al. [27] created pseudo energy terms from SHAPE reactivity data,as a means of integrating such data into prediction software. They reportedprediction accuracy of 96% to 100% for three moderate-sized RNAs (< 200nucleotides) and for 16S rRNA ( 1500 nucleotides). ShapeKnots [40] isa new method for incorporating SHAPE reactivity data for pseudoknottedstructures that incorporates the pseudo energy terms into a heuristic methodsimilar to HotKnots [76].2.1.5 Methods based on the hierarchical folding hypothesisWe previously presented HFold [52], an approach for prediction of pseudo-knotted structures, motivated by two goals, namely to avoid the high run-ning time complexity of other methods for pseudoknotted secondary struc-ture prediction and to leverage the hierarchical folding hypothesis. Given apseudoknot-free structure as input, HFold predicts a possibly pseudoknot-ted structure from a broad class that contains the given input structure and,relative to that constraint, has minimum free energy. HFold’s running timeis O(n3), significantly faster than other methods for predicting pseudoknot-ted structures. Several experts have provided evidence for the hierarchicalfolding hypothesis [13, 23, 62, 94].The class of structures that HFold can handle, density-2 structures, in-cludes many important pseudoknots including H-type pseudoknots, kissinghairpins and infinite chains of interleaved bands, with arbitrary nested (pseu-doknotted) substructures.A density-2 secondary structure is a bi-molecular secondary structure R(i.e., a secondary structure than can be partitioned into two pseudoknot-freesecondary structures, or a level 2 structure based on the definition of levelprovided by Sato et al. [83]) with an additional constraint, which is easy todescribe intuitively in terms of the structure’s arc diagram. Take any region[i, j], and remove all proper nested substructures. Choose any base l ∈ [i, j].Draw a vertical line through base l in the arc diagram. Then, the verticalline should intersect at most two bands. We refer the readers to Chapter 3for formal definition of density-2 structures. Figure 2.2 illustrates density-2secondary structures. Figure 2.3 shows a structure that is a bi-secondarystructure but not a density-2 structure.212.1. Computational RNA secondary structure prediction including pseudoknots(a) Arbitrary Number of Bands(b) Arbitrary Depth of BandsFigure 2.2: Two examples of density-2 structures. The top figure shows apseudoknotted structure with a chain of arbitrary number of bands, andthe bottom figure shows an H-type pseudoknot, with an arbitrary depth ofnested H-type pseudoknots.Figure 2.3: A bi-secondary structure that is not density-2. This structure isan example of a level 2 structure based on the definition provided by Satoet al. [83]222.2. Data setsHFold secondary structure prediction problemHFold considers the problem of predicting the secondary structure as follows:given a sequence S and a pseudoknot-free secondary structure G (a set ofbase pairs), find a pseudoknot-free secondary structure G′ (a set of basepairs disjoint from G) for S, such that the free energy of G ∪ G′ is lessthan or equal to the free energy of G∪G′′ for all pseudoknot-free structuresG′′ 6= G′.2.2 Data setsWe use three data sets to analyze performance of our algorithms. Our firstdata set is the test data set of Andronescu et al. [11], that was used toevaluate the performance of HotKnots V2.0. This data set contains 446distinct RNA sequences and their reference structures, of which 348 arepseudoknot-free and 98 are pseudoknotted, and was compiled from the RNASTRAND database [7].There are eight cases in this data set for which the original sequenceand structure were shortened to accommodate restrictions in length. Weremoved them from our data set. From now on we use “HK-PK” to referto the pseudoknotted structures in this set (with 88 structures) and “HK-PK-free” to refer to the pseudoknot-free structures in this set (with 337structures). RNA sequences in HK-PK and HK-PK-free have length between10 and 400 nucleotides. We note that structures in the HK-PK data set areall density-2 structures.Our second data set is the pk168 data set of Sato et al. [83]. This setcontains 168 pseudoknotted structures from 16 categories of pseudoknots,and was compiled from the PseudoBase database [96]. The sequences in thisset have at most 85% similarity and have length of at most 140 nucleotides.We refer to this data set as “IP-pk168”. We note that there is no overlapbetween the IP-pk168 and the HK-PK data sets.Our third data set is the test data set of Sperschneider et al. [89]. Thisset contains 16 pseudoknotted structures with strong experimental support.RNA sequences in this set have length between 34 and 377 nucleotides. Werefer to this data set as “DK-pk16”.One of our proposed algorithms, CCJ, has high memory requirementsand so we were unable to run it on RNA strands of length > 195 bases.To accommodate CCJ’s length restriction we compiled two extra data sets,“HK-PK*”, and “DK-pk16*” respectively corresponding to the HK-PK andthe DK-pk16 data sets without sequences longer than 195 bases. Table 2.2232.3. Bootstrap percentile confidence intervalssummarizes the data sets used in this thesis.Table 2.2: Summary of data sets used in this thesis.Name # of sequences sequence lengths ReferenceHK-PK 88 26 – 400 [11]HK-PK* 85 26 – 195 [11]HK-PK-free 337 10 – 194 [11]IP-pk168 168 21 – 137 [83]DK-pk16 16 34 – 377 [89]DK-pk16* 13 34 – 194 [89]2.3 Bootstrap percentile confidence intervalsTo formally assess the dependency of measured prediction accuracy of resultsof a method on a given set of RNA we use bootstrap confidence intervals,a well-known statistical resampling technique [47, 97]. Following the recentwork of Aghaeepour and Hoos [3] and Hajiaghayi et al. [41] we calculatethe bootstrap 95% percentile confidence interval of average F -measure asfollows. For each vector f of F -measures (where, for example, f may bethe F-measures of predictions obtained by Iterative HFold on pseudoknot-ted structures) we first take 104 resamples with replacement, where theresamples have the same length as the original sample vector f (i.e., |f |),and then calculate their average F -measures. These 104 calculated averageF -measures represent the bootstrap distribution for the vector f . We thenreport the 2.5th and 97.5th percentile of this distribution (i.e., the bootstrapdistribution of the 104 average F -measures calculated above) as the lowerand upper bounds of the confidence interval respectively, and call it the boot-strap 95% percentile confidence interval. By reporting the bootstrap 95%percentile confidence interval for average F -measure of a method, A, on adata set, D, we say that we are 95% confident that the average F -measure ofmethod A on the data set D is in the reported interval. All calculations areperformed using the “boot” package of the R statistics software environment[93].242.4. Permutation test2.4 Permutation testFollowing the recent work of Hajiaghayi et al. [41], we use a two-sidedpermutation test to assess the statistical significance of the observed per-formance differences between two methods. The test proceeds as follows,given a data set and two structure prediction procedures, A and B. First,we calculate the difference mean(fA)−mean(fB) in means between sets ofF -measure values obtained by A and B. Then we combine the two sets fAand fB and record the difference in sample means for 104 randomly chosenways of choosing two sets with the same size as |fA| and |fB| from the com-bined set. The p-value is the proportion of the sampled permutations wherethe absolute difference was greater than or equal to that of absolute differ-ence of the means of sets fA and fB. Then, if the p-value of this test is lessthan the 5% significance level, we reject the null hypothesis that methodsA and B have equal accuracy and thus accept the alternative hypothesisthat the difference in accuracy of method A and B is significant. Otherwise,we cannot reject the null hypothesis. All calculations are performed usingthe “permTS” method in the“perm” package of the R statistics softwareenvironment.2.5 SummaryIn this chapter we reviewed many thermodynamics-based methods for pre-diction of pseudoknotted RNA secondary structures. We explained thatMFE pseudoknotted RNA secondary structure prediction is NP-hard andtherefore MFE-based algorithms handle restricted classes of pseudoknottedstructures. While MFE-based algorithms are guaranteed to find the mini-mum free energy secondary structure from the class considered, they usuallyhave large space and time complexities that limit their usability. Moreover,the accuracy of prediction of MFE-based algorithms is poor for long RNAsequences.We then provided examples of heuristic methods for RNA pseudoknottedsecondary structure prediction. Heuristic algorithms are not guaranteed tofind the optimal structure including pseudoknots, but generally run fasterthan the MFE-based algorithms that handle the same class of structures.However, HotKnots is still slow on long sequences.Methods based on maximum expected accuracy, such as IPknot, predictthe base pairings with maximum pairing probabilities in the thermodynamicensemble. A disadvantage of methods based on maximum expected accuracy252.5. Summaryis that they do not provide energy of the predicted structure. Note that thisis easy to calculate, however.We explained that incorporating input constraints in prediction algo-rithms improves prediction accuracy and we discussed ShapeKnots as anexample of a method that incorporates SHAPE reactivity scores as inputconstraints.Furthermore, we provided a high level description of HFold, a methodbased on the hierarchical folding hypothesis. HFold’s output structure isthe minimum free energy density-2 structure conditioned on the given inputstructure.In this chapter we also described the data sets we used in this thesis,and explained the statistical methods used in this work to compare perfor-mance of our algorithms with themselves and with some of the best existingmethods for prediction of RNA pseudoknotted structures.To summarize, existing methods for prediction of pseudoknotted struc-tures suffer from one or more of the following shortcomings: 1) slow runningtime, 2) poor prediction accuracy, or 3) limited classes of pseudoknots han-dled. Moreover there is limited opportunity for the user to provide structuralinformation, or constraints, that can guide prediction. In cases of a predic-tion method that incorporates user-defined constraints, it is also useful tounderstand the degree to which the method’s accuracy persists as the inputinformation degrades.The contributions described in this thesis aim to address the above men-tioned shortcomings pertaining to improved pseudoknotted RNA secondarystructure algorithms.26Chapter 3Energy modelIn this chapter we provide formal definitions of secondary structure featuresand details of the energy model used in this thesis.3.1 RNA secondary structureLet S be an RNA sequence of length n. When i ≤ l, a region [i, l] is the setof indices between i and l, inclusive and when i > l, the region [i, l] is theempty set. A gapped region is the union of two regions [i, j] and [k, l] withi < j + 1 < k ≤ l. Base pair d.e spans gapped region [i, j] ∪ [k, l] if d ∈ [i, j]and e ∈ [k, l]. With respect to secondary structure R, we say that region[i, l] is structure-free if no base in the region is paired in R. Region [i, l] isweakly closed if no base pair connects a base in the region to a base outsidethe region.A weakly closed region [i, j], with at least two bases, is closed if it can-not be partitioned into two smaller weakly closed regions. Region [i, l] is apseudoknot if it is weakly closed, both i and l are paired but not to eachother, and the region cannot be partitioned into distinct weakly closed re-gions. Base pairs of secondary structure R partition the unpaired bases ofsequence S into loops [74]. Loops, and their associated unpaired bases andclosing base pairs, are defined as follows. Some of the loops are illustratedin Figure 3.1.Loops in pseudoknot-free structures are comprised of regions of unpairedbases, separated by “closing” base pairs, from which stems of base pairsemanate.• A hairpin loop has one closing base pair i.l; all bases in [i + 1, l − 1]are unpaired and belong to the loop.• An internal loop has an external closing base pair i.l and a secondclosing base pair d.e, which is nested in i.l (i.e., i < d < e < l). Allbases in regions [i+ 1, d− 1] and [e+ 1, l− 1] are unpaired and belongto the loop. In a bulge loop either [i + 1, d − 1] or [e + 1, l − 1] is an273.1. RNA secondary structure3′G GG G GC CC C CAA A AAAAA A A A A AA AAA AAAAA AG G G G G GC C C C C CStacked pairInternal loopA Hairpin loopBulgeWeakly closedGC CA AGNOT Weakly closed3 8 15A2 6 14 20 28 32 20Band1 15′3′3′5′UnpairedStart of pseudoloopEnd of pseudoloopUnpaired3′5′5′Figure 3.1: An H-type pseudoknotted structure (left) and a pseudoknot-freestructure (right) in graphical (top) and arc diagram (bottom) formats. Inthe top left structure, the start and end of the pseudoloop, unpaired bases ofa pseudoknotted structure and one of the two bands are labelled. In the topright structure, some of the loops are annotated by their type: loops withone emanating branch are hairpin loops, with two emanating branches areeither internal or bulge loops; and with three or more emanating branchesare multi-loops (not shown here).283.1. RNA secondary structureempty region, but not both. A special case is a stacked pair, in whichd = i+ 1 and e = l − 1.• A multiloop with external base pair i.l arises in two cases. In thefirst, some base pair d.e is nested in i.l; also regions [i+ 1, d− 1] and[e+ 1, l − 1] are weakly closed but are not both structure-free. In thesecond, [d, e] is a pseudoknot and regions [i+ 1, d− 1] and [e+ 1, l− 1]are weakly closed (and may both be structure-free). An unpaired baseu in [i, l] belongs to the multiloop if u can see the base pair i.l - thatis, there is no base pair x.y in R with i < x < u < y < l. The closingbase pairs of the multiloop are i.l, the base pair d.e if in the first case,plus any non-pseudoknotted base pair x.y in [i+1, d−1]∪ [e+1, l−1],where both x and y can see i.l.• A pseudoloop is associated with each pseudoknot [i, l]. Its closing basepairs are of two types. Base pairs of the first type are the borders of thepseudoknot’s bands. Here, a band is a maximal set of pseudoknottedbase pairs, all of which cross exactly the same set of base pairs, andits borders are the base pair(s) with the largest and smallest spans.We refer to the border with the largest span as the external border. Aband belongs to pseudoknot [i, l] if its base pairs are within [i, l] andare not within any pseudoknot nested within [i, l]. Closing base pairsof the second type are any pseudoknot-free base pairs whose endpointscan see at least two bands of the pseudoknot. Finally, unpaired basesof the pseudoloop are those unpaired bases in [i, l] that do not belongto any other loop within [i, l].• An external loop contains all of those unpaired bases in [1, n] that arenot in any other loop.A stem is a maximal sequence bp1, bp2, ..., bpk of base pairs, where suc-cessive base pairs bpi and bpi+1 are the closing base pairs of an internalloop. Note that a band may be composed of several stems, separated bymultiloops.We define density of a structure as follows: Let L be a pseudoloop andi.i′ and j′.j be the closing base pairs of L. We say a band [i1, i′1] ∪ [j′1, j1]crosses k if i1 ≤ k ≤ j1. Let #B(L, k) be the number of bands associatedwith L that cross k. Then the density of L is:density(L) = max{#B(L, k)|i, j define borders of L; i ≤ k ≤ j}293.2. Energy modelThe density of a structure, R, is the maximum density of L over allpseudoloops L of R. We say R is a density-2 structure if the density of R isat most 2.We can obtain a motif for a secondary structure, which describes thepattern of overlaps of its stems, in the following way. Label each stem witha distinct symbol, write each symbol under both the left and right endsof the stem in the arc diagram for the structure (so each symbol appearstwice), and concatenate the symbols in order from left to right. A structureis an H-type pseudoknot if its motif is ABAB, a kissing hairpin structureif its motif is ABACBC, and a chain of four overlapping stems if its motifis ABACBDCD (some of these types are illustrated in Figure 3.2). If themotif for a structure has substring AxByAzB, where A and B are symbolsand x, y, and z are arbitrary strings, not all of which are empty, then wesay that the structure has an H-type pseudoknot with nested substructures.Similarly, a structure contains a kissing hairpin with nested substructures ifits motif has a substring of the form AvBwAxCyBzC, where A, B and Care symbols and v, w, x, y, and z are arbitrary strings, not all of which areempty.3.2 Energy modelMany computational methods for predicting the secondary structure of anRNA (or DNA) molecule are based on models of the free energy of loops[11, 33, 48, 52, 63, 72, 75, 77]. Roughly speaking, base pairs tend to stabilizean RNA structure, whereas unpaired bases form destabilizing loops. In athermodynamics-based energy model the free energy of a sequence S withrespect to a fixed secondary structure R is the sum of the free energies ofthe loops of R. We define the free energy of a strand S to be the minimumfree energy of the strand, with respect to all structures R, which are usuallytaken over some well-defined class of structures, such as pseudoknot-freestructures. The parameters of this model are driven in part by currentunderstanding of experimentally determined free energies, and in part bywhat can be incorporated into an efficient algorithm. The free energy of aloop depends on temperature; throughout we assume that the temperatureis fixed. In this work we assume the temperature is fixed at 37 ◦C.Pseudoknot-free energy modelWe first summarize the notation used to refer to the free energy of pseudoknot-free loops, along with some standard assumptions that are incorporated into303.2. Energy modelA AB C D E FB CD EFA AB BC CD D(a)(b)3′5′ 3′5′Figure 3.2: (a) Arc diagram representation of a simplified version of thestructure of the aptamer core of a SAM-IV riboswitch [98], but preservingpatterns of crossing base pairs. Below both the left and right endpointsof each stem is a letter which identifies that stem; concatenating the lettersfrom left to right yields the motif for this structure, namely ABCBDED-CAFFE. (b) A chain of four overlapping stems. This structure has motifABACBDCD. If either the leftmost or rightmost stem were removed fromthis structure, the result would be a kissing hairpin structure, with motifABACBC.313.2. Energy modelpseudoknot-free energy models. We refer to a model that satisfies all of ourassumptions as a standard free energy model.• eH(i, j): gives the free energy of a hairpin loop closed by i.j. Weassume that for all but a small number of cases, eH(i, j) depends onlyon the length of a the loop, and the two paired bases i and j.• eS(i, j): gives the free energy of a stacked pair that consists of i.j and(i+ 1).(j − 1).• eint(i, d, e, j): gives the free energy of an internal loop or bulge withexterior pair i.j and interior pair d.e.• The free energy of a multiloop with k branches and u unpaired basesis a+ bk + cu, where a, b and c are constants.Pseudoknotted energy model• The total energy of a band is the sum of the energies of its loops. Ifa band has no loops, i.e. consists of just one base pair, we define itsenergy to be 0.• estP (i, j): defines the energy of a stacked pair in a band.• eintP (i, d, e, j): defines the energy of internal loop that spans a band.• The free energy of a multiloop that spans a band with k branches andu unpaired bases is a′ + b′k + c′u, where a′, b′ and c′ are constants.The energy of an exterior pseudoloop is calculated as: energy of bands plusPb ×m+ Pps × k + Pup × u+ Ps, where m is the number of the bands, k isthe number of closed subregions and u is the number of unpaired bases. Ifthe pseudoknot is inside a multiloop or a pseudoloop, Ps is replaced by Psmor Psp respectively. To illustrate the model, we calculate the free energy ofthe pseudoknotted structure of region [1, 32] of Figure 3.1 (left) as follows:Ps + 2× Pb + 14× Pup + estP (2, 3, 19, 20) + eintP (3, 5, 18, 19) +estP (5, 6, 17, 18) + estP (14, 15, 29, 30) + estP (15, 16, 28, 29).Table 3.1 summarizes the energy constants and functions used in ourenergy model for pseudoknotted structures. The values of these energyparameters are those of the DP09 parameter set of Andronescu et al. [11],used by the HotKnots V2.0 prediction software.323.2. Energy modelWe note that there are minor differences between energy models usedby the algorithms presented in this thesis. In particular, in HFold, IterativeHFold and CCJ, some specific functions, including eint and eintP (and sim-ilarly in calculation of multiloop energy) are calculated in slightly differentways in the pseudoknot-free and pseudoknotted energy models because onetakes dangling ends into account while the other does not. HotKnots V2.0is consistent in including dangling penalties in calculating energy for bothpseudoknot-free and pseudoknotted structures.333.2.EnergymodelTable 3.1: Energy parameters. This table provides the names, description and values of the energy parametersand functions that we used in our methods. The names and definitions are the same as in our original HFold [52],and the values were updated based on the work of Andronescu et al. [11]. These parameters were derived for atemperature of 37 ◦C and 1 M salt (NaCl) concentration.Name Description Value (Kcal/mol)Ps exterior pseudoloop initiation penalty −1.38Psm penalty for initiation of pseudoloop in a multiloop 10.07Psp penalty for initiation of pseudoloop in a pseudoloop 15.00Pb penalty for initiating a band 2.46Pup penalty for unpaired base of a pseudoloop 0.06Pps penalty for closed subregion 0.96inside a pseudoloopeH(i, j) energy of a hairpin loop closed by i.jeS(i, j) energy of stacked pair closed by i.jestP (i, j) energy of stacked pair that spans a band 0.89× eS(i, j)eint(i, d, e, j) energy of a pseudoknot-free internal loopeintP (i, d, e, j) energy of internal loop that spans a band 0.74× eint(i, d, e, j)a penalty for initiation of an ordinary multiloop 3.39b multiloop base pair penalty 0.03c penalty for unpaired base of an ordinary multiloop 0.02a′ penalty for initiation of a multiloop 3.41that spans a bandb′ branch penalty in a multiloop 0.56that spans a bandc′ penalty for unpaired base in a multiloop 0.12that spans a band34Chapter 4Hierarchical folding-basedpseudoknotted secondarystructure predictionIn this chapter we describe our Iterative HFold algorithm for RNA pseudo-knotted structure prediction, given as guidance an initial pseudoknot-freeinput structure. Motivated by the literature on the hierarchical foldinghypothesis and to overcome some of the shortcomings of the MFE-basedmethods and those based on the hierarchical folding (see Chapter 2), wedefine a new folding hypothesis, namely “relaxed hierarchical folding”, anddescribe the Iterative HFold algorithm, which uses structure modificationconsistent with this hypothesis, while stays in the O(n3) time and O(n2)space complexity. Iterative HFold incorporates four different methods andreports the structure with the lowest energy, among all structures producedby these methods, as its final structure. In the first two subsections we givea high-level description of two earlier methods that Iterative HFold buildson, namely SimFold (Section 4.1) and HFold (Section 4.2). In Section 4.3we explain a third component of Iterative HFold, namely our HFold-PKonlyalgorithm, and discuss its differences with HFold. We then present the de-tails of Iterative HFold in Section 4.4, along with examples and motivationfor this approach. In Section 4.5 we present the summary of this chapter.4.1 SimFoldIterative HFold uses the SimFold RNA secondary structure prediction method[8, 9], which predicts the minimum free energy pseudoknot-free secondarystructure for a given RNA sequence. SimFold uses a dynamic programmingmethod similar to Zuker’s MFold method [105], but achieves higher predic-tion accuracy than MFold. The pseudoknot-free energy parameters proposedby Andronescu at al. [9] were estimated using their constraint generationprogram based on the current values of the standard energy model [63], tak-354.2. HFolding into account most of the underlying experimental data values and a largeset of known structures with various topologies. Andronescu et al. achievedon average about 7% prediction accuracy improvement over the standardenergy model. In addition to an RNA sequence S, SimFold can also takea pseudoknot-free secondary structure, G, as input and predict the MFEpseudoknot-free secondary structure that contains all base pairs of G.Following we briefly review key ideas of the dynamic programming al-gorithm, which predicts the energy of the MFE pseudoknot-free secondarystructure for a fixed sequence S = s1s2 . . . sn [63]. Let Wi,j be the en-ergy of the MFE pseudoknot-free secondary structure for the subsequencesisi+1 . . . sj . If i ≥ j, Wi,j = 0, since the subsequence is empty. Otherwise,it must either be that i.j is a base pair in the MFE structure for si . . . sj , orthat the MFE structure can be decomposed into two independent subparts.These two cases correspond to the two rows of the following recurrence forWi,j .Wi,j = min{Vi,j ,mini≤r<jWi,r +W(r+1),j ,where Vi,j is the free energy of the MFE structure for si . . . sj that containsi.j. If i ≥ j, Vi,j is set to be ∞. Otherwise, i.j closes a hairpin, an internalloop, or a multiloop in the MFE structure for si . . . sj .We note that in mini≤r<jWi,r +W(r+1),j , i and j are fixed, and we takethe minimum over all values of r, such that i ≤ r < j.4.2 HFoldHere, we provide a high level description of the HFold algorithm [52].Similar to the definition of Wi,j for the pseudoknot-free case, here wedefine Wi,j as the energy of the MFE pseudoknotted secondary structure forthe subsequence sisi+1 . . . sj , where the structure is constrained as follows.Let G be a given pseudoknot-free structure for a given sequence S, and Gijbe substructure of G for the strand si . . . sj . Let G′ be the pseudoknot-freestructure for the sequence S, such that G 6= G′ and G ∪ G′ is the lowestenergy structure for S given G. We define G′ij as the substructure of G′for the strand si . . . sj . If some arc of G covers i or j, then Wi,j = ∞. Ifi ≥ j, then Wi,j = 0. Otherwise we define Wi,j to be the energy of theMFE secondary structure Gij ∪G′ij for the strand si . . . sj . In this case, Wi,j364.3. HFold-PKonlysatisfies the following recurrence:Wi,j = minVi,j ,mini≤r<jWi,r +W(r+1),j ,WMBi,j + Pswhere the first two cases are the same as for pseudoknot-free cases andthe last case is specific to pseudoknotted structures. Ps is the pseudoknotinitiation penalty, given in the energy table presented in Section 3.2.The third row of this recurrence accounts for the case when the optimalsecondary structure Gij ∪G′ij includes pseudoknotted base pairs and cannotbe partitioned into two independent substructures for two regions [i, r] and[r + 1, j], for some r. Such a structure must contain a chain of two ormore successively-overlapping bands, which must alternate between Gij andG′ij , possibly with nested substructures interspersed throughout. Figure 4.1provides an example, and shows how the recurrence for WMB unwinds whenthe example structure is the MFE structure.In order to calculate the energies of substructures in such a structure inthe recurrences, we use three additional terms: BE, VP, and WI. Roughly,these account for energies of bands spanned by base pairs of Gij , regions en-closed by pseudoknotted base pairs of G′ij (excluding part of those regionsthat are within a band of Gij), and weakly closed subregions, respectively.WMB′ complements a special case of WMB recurrence in which the right-most band is not in G, but is part of the structure G′ .We refer the reader to our HFold paper [52] for further details of thealgorithm.4.3 HFold-PKonlyThe third method on which Iterative HFold builds, called HFold-PKonly,is identical to HFold except that G′ may only contain base pairs that crossbase pairs in G. Similar to HFold, HFold-PKonly is limited to density-2 structures. The prediction provided by HFold-PKonly can be useful incases where HFold does not produce a pseudoknotted structure.HFold-PKonly solves the following problem: given an RNA sequence S,and a pseudoknot-free input structure G, find a pseudoknot-free structureG′ such that every base pair in G′ crosses some base pair of G and such thatG ∪ G′ is the lowest energy structure that contains G among all such G′s.Note that G′ may contain no base pairs.374.3. HFold-PKonly(1) (2)(3) (4)(5)iji jl1i WIWMBi,jl2 jV Pl2,l1l1i l2 jl3WIl1WMB′i,l3iWMB′i,(l2−1)BEBEWMB′i,l1l2 jl3 l1V Pi,l3Figure 4.1: Illustration of how the WMB recurrence unwinds, to calculateWMBi,j . Arcs above the horizontal line from i to j represent base pairsof Gij , and arcs below the line represent base pairs of G′ij . Note that Gi,jis the part of the given pseudoknot-free structure G whose base pairs liebetween positions i and j, assuming that no arc of G covers i or j. G′i.j isthe part of the given pseudoknot-free structure G′ added by HFold whosebase pairs lie between positions i and j such that Gij ∪ G′ij is the MFEstructure for strand si . . . sj given Gi,j . (1) The overall structure is handledby WMBi,j recurrence. (2) A case of the WMB recurrence handles the overallstructure whose energy is WMBi,j , with terms to account for energies of theright upper band (BE) and right lower closed subregion (WI) as well asthe remaining structure (WMB′i,l1). (3) The term WMB′i,l1 is handled bya case of the WMB′ recurrence, with terms to account for the lower rightsubstructure labelled VPl2,l1 , the upper left band (BE), and the remainingstructure (WMB′i,(l2−1)). (4) WMB′i,(l2−1) is then handled by a case of theWMB′ recurrence, with terms to account for WI(l3+1),(l2−1) and WMB′i,l3 .(5) Finally, the WMB′i,l3 term is handled by VPi,l3 .384.4. Iterative HFold4.4 Iterative HFoldIterative HFold is distinguished from the above three methods, namely Sim-Fold, HFold and HFold-PKonly, in two important ways. First, the outputof the HFold, HFold-PKonly and SimFold methods must contain the givenpseudoknot-free input structure G, whereas Iterative HFold may modify theinput structure. This can be useful when the given input structure is nota high-accuracy estimate of the true pseudoknot-free substructure of thereference structure, Gbig. Given a pseudoknotted structure, G, Gbig is thepseudoknot-free structure obtained from G when minimum number of pseu-doknotted base pairs were removed. We refer to the removed base pairsfrom G as Gsmall structure. Blue arcs in Figure 1.2 represent base pairs inGbig structure and green arcs represent the base pairs in Gsmall structure.Second, while HFold and HFold-PKonly can add base pairs that crossthose in G, they cannot add base pairs that cross each other, and neithercan SimFold. In contrast, Iterative HFold can add base pairs that cross eachother. This is particularly useful when the input structure contains limitedinformation about Gbig, and so it is necessary both to predict base pairs inGbig and in Gsmall in order to get a good prediction.Iterative HFold is comprised of four different iterative methods. Follow-ing the description of each method, we motivate why we chose to includeit as part of our overall algorithm. Iterative HFold takes as input both anRNA sequence S and a pseudoknot-free secondary structure G; later weshow that structure G can be produced by computational methods, for ex-ample, HotKnots hotspots or SimFold suboptimal structures, when only thesequence S is initially available.Iterative HFold considers the problem of predicting the secondary struc-ture as follows: Given an RNA sequence S, and a pseudoknot-free inputstructure G, run the following four methods and pick the structure with thelowest free energy among these four as the output structure. We refer to thisfolding hypothesis as the “relaxed” hierarchical folding hypothesis. IterativeHFold runs in O(n3) time, as it runs four methods sequentially, when eachone is O(n3).Method 1: Run HFold on S and G, and store the resulting G ∪G′.Motivation: This is the core HFold method, motivated by thehierarchical folding hypothesis.Method 2: First run HFold-PKonly on S and G. If HFold-PKonly results394.4. Iterative HFoldin a structure G ∪ G′ such that G′ is not the empty structure,then run HFold with sequence S and structure G′, and storethe result. Otherwise, simply store G as the result. See thefollowing example. (We note that running HFold with S and G′results in a structure G′ ∪ G′′, where it may be the case thatG′′ 6= G (i.e., G may not be part of the result of method 2).)Motivation: When input structure G does not agree with thereference Gbig structure, it may still be the case that HFold-PKonly finds the pseudoknotted structure Gsmall (or a goodapproximation to Gsmall). A call to HFold with input Gsmallmay then find a better approximation to Gbig.404.4. Iterative HFoldExample 1: Example of results of method 1 and method 2 of Iterative HFold.S = GGGCUCUGGAGCCCCCCCCGAGCCCCGGCUAACUCUAUCUGUAGGGGGGCG = ...........(((((...........................)))))..Method 1: HFold on S and G[[[[[[.[[..(((((]].]]]]]]..................))))).. -12.75Method 2:HFold Pkonly on S and G[[[[[[.[[..(((((]].]]]]]]..................)))))..G1[[[[[[.[[.......]].]]]]]].........................HFold on S and G1[[[[[[.[[.((((((]].]]]]]]...................)))))) -14.67In this example, method 2 of Iterative HFold outperforms method 1:although both HFold and HFold PKonly produce the same result on se-quence S and input structure G, namely the structure G ∪ G′ , the addi-tional iteration in method 2, in which HFold is run with S and G′ , finds astructure with lower energy than that of G ∪ G′ .Method 3: First run SimFold on S andG to obtain resultG′ — a pseudoknot-free structure that contains G. Then let Gupdated be the sec-ondary structure of S containing the relaxed stems of G′ thatinclude the base pairs of G. By a relaxed stem, we mean a sec-ondary structure containing stacked base pairs, bulges of size 1and internal loops of maximum size of 3 (i.e., either the symmet-ric loop of 1× 1 or the non-symmetric loop of 1× 2 or 2× 1 butno other loop types; this is motivated by common practice [15]).Then run method 2 on S and Gupdated, and store the result. SeeExample 2.Motivation: This method can work well when the given in-put structure has a small number of base pairs from Gbig, be-cause Gupdated contains stems that includes these base pairs,but avoids “overcrowding” with further base pairs that mightprevent HFold-PKonly from finding pseudoknotted stems.414.4. Iterative HFoldExample 2: Example of result of method 3 compared to all four meth-ods of Iterative HFold.S = GUUUGUUAGUGGCGUGUCCGUCCGCAGCUGGCAAGCGAAUGUAAAGACUGACG = ............(..........)............................Method 1: HFold on S and G((((((((((.(((........))).))))))))))................ -11.01Method 2 and 4:(((.......))).[[[[.[[.(((.]].]]]].)))............... -5.08Method 3:Simfold on S and G((((((((((.(((........))).))))))))))................Gupdated((((((((((.(((........))).))))))))))................Method 2 on S and Gupdated((((((((((.(((.[[[.[[[))).)))))))))).........]]].]]] -11.26In this example, method 3 of Iterative HFold outperforms the other methods.Because the input structureG consists of just one base pair, method 1 (HFold)outputs a pseudoknot-free structure containing G. The output of both meth-ods 2 and 4 are pseudoknotted but do not contain the base pair of the in-put structure G.In contrast, method 3 first adds base pairs to G, resulting in the pseudoknot-free structure Gupdated, and then adds additional pseudoknotted base pairsvia method 2.Method 4: Let S1 be the subsequence of S obtained by removing bases thatare external unpaired bases with respect to input structure G.Run SimFold on S1 and G1 (with base indices renumbered toagree with S1), to obtain pseudoknot-free structure G′. Thencontinue exactly as in method 3. See Example 3.Motivation: This method is very similar to method 3, but fur-ther constrains G′ since the base pairs in G′ cannot involve basesthat are removed from S to obtain S1. This potentially increasesthe possibilities for pseudoknotted base pairs to be added bymethod 2.424.4. Iterative HFoldExample 3: Example of result of method 4 compared to all four meth-ods of Iterative HFold.S = CCGAGCUCUGUAGCGAGUGCUUGUAACCCGAGCGGGGGCG = .(..................................)..Method 1 and 3:((..((((.((.((((....)))).))..))))...)). -5.10Method 2:.(..................................).. +3.04Method 4:S1 and G1S1 = CGAGCUCUGUAGCGAGUGCUUGUAACCCGAGCGGGGG1 = (..................................)Simfold on S1 and G1:(..((((.((.((((....)))).))..))))...)Gupdated.(..((((.((.((((....)))).))..))))...)..Method 2 on S and Gupdated((..((((.((.((((..[[)))).))..))))..))]] -6.05In this example, method 4 of Iterative HFold outperforms the other methods.The input structure G has a high energy value and neither method 1 (HFold)nor method 2 (HFold-PKonly) can expand the pseudoknot-freestructure to add the pseudoknotted stem.Also, by adding too many pseudoknot-free base pairs, method 3 fails to find thepseudoknotted base pairs. Thus, method 4 performs better than meth-ods 1, 2 and 3.434.4. Iterative HFoldFollowing is the pseudocode of Iterative HFold’s algorithm.Algorithm 4.4.1: Iterative HFold (Sequence S, Structure G)input: sequence S and structure Goutput: structure Gfinal and its energy Efinalcomment: Run methods 1-4 and choose the lowest energy structure(G1, E1)← method1(S,G)Gfinal ← G1Efinal ← E1(G2, E2)← method2(S,G)if (E2 < Efinal)then{Gfinal ← G2Efinal ← E2(G3, E3)← method3(S,G)if (E3 < Efinal)then{Gfinal ← G3Efinal ← E3(G4, E4)← method1(S,G)if (E4 < Efinal)then{Gfinal ← G4Efinal ← E4return (Gfinal, Efinal)Algorithm 4.4.2: method1(Sequence S,Structure G)input: sequence S and structure Goutput: structure Gmethod1 and its energy Emethod1comment: Run HFold and return resultsreturn (HFold(S,G))444.4. Iterative HFoldAlgorithm 4.4.3: method2(Sequence S,Structure G)input: sequence S and structure Goutput: structure Gmethod2 and its energy Emethod2(G1, E1)← HFold-PKonly(S,G)if (G1 is empty structure)then return (G)else{G′ ← (G1 −G)return (method1(S,G′))454.4. Iterative HFoldAlgorithm 4.4.4: method3(Sequence S,Structure G)input: sequence S and structure Goutput: structure Gmethod3 and its energy Emethod3procedure ObtainRelaxedStems(G1, G2)Gresult ← G1for each (i.j ∈ G2)doif i.j 6∈ G1thencomment: extend stems of G1if ((i− 1).(j + 1) or (i+ 1).(j − 1)) ∈ G1then Gresult ← Gresult ∪ i.jcomment: include bulges of size 1else if ((i− 2).(j + 1) or (i− 1).(j + 2)or (i+ 1).(j − 2) or (i+ 2).(j − 1)) ∈ G1then Gresult ← Gresult ∪ i.jcomment: include loops of size 1×1else if ((i− 2).(j + 2) or (i+ 2).(j − 2)) ∈ G1then Gresult ← Gresult ∪ i.jcomment: include loops of size 1×2 or 2×1else if ((i− 3).(j + 2) or (i− 2).(j + 3)or (i+ 2).(j − 3) or (i+ 3).(j − 2)) ∈ G1then Gresult ← Gresult ∪ i.jreturn (Gresult)mainG′ ← SimFold(S,G)Gupdated ← ObtainRelaxedStems(G,G′)return (method2(S,Gupdated))464.5. SummaryAlgorithm 4.4.5: method4(Sequence S,Structure G)input: sequence S and structure Goutput: structure Gmethod4 and its energy Emethod4k ← 1Gupdated ← Gfor each disjoint substructure Gk in GdoSk ← subsequence of S corresponding to GkG′k ← SimFold(Sk, Gk)G′k,updated ← ObtainRelaxedStems(Gk, G′k)Gupdated ← Gupdated ∪G′k,updatedreturn (method2(S,Gupdated))4.5 SummaryIn this chapter we presented a new method that addresses shortcomingsof pseudoknotted RNA secondary structure prediction methods discussedin Section 2.5, namely, slow running time, generality problem, and limitedopportunity for providing input constraint. Iterative HFold has running timecomplexity of O(n3) and space complexity of O(n2). It is because IterativeHFold runs four methods sequentially where each method has O(n3) timeand O(n2) space. Iterative HFold handles a general class of pseudoknottedstructures, namely density-2 structures. Later in Chapter 6, we show thatIterative HFold has better prediction accuracy on two key data sets thanIPknot and is significantly faster than HotKnots V2.0 and has significantlybetter accuracy than HotKnots V2.0 on the IP-pk168 data set.Our method, Iterative HFold, takes a pseudoknot-free input structureand produces a possibly pseudoknotted structure whose energy is at least aslow as that of any density-2 pseudoknotted structure containing the inputstructure. Iterative HFold incorporates four different methods and reports asits final structure the structure with the lowest energy, among all structuresproduced by these methods. While one of its underlying methods, HFold,strictly adheres to the hierarchical folding hypothesis, the other three useiterations to extend or remove the base pairs of input structure, with thegoal of finding a structure that has lower energy than the structure found byHFold. Thus, unlike HFold, iterative HFold is able to modify the input struc-474.5. Summaryture (while the class of structures handled by both methods is the same).This is valuable since 1) computationally produced input structures may notbe completely accurate and 2) while the hierarchical folding hypothesis isa useful guiding principle, there is evidence that allowing for disruption ofsome base pairs in the initially formed pseudoknot-free secondary structurecan improve prediction [31, 100].All of Iterative HFold’s underlying methods use the energy model ofHotKnots V2.0 DP09 [11]; with this model, HFold obtained predictionswith higher accuracy than those obtained with our earlier implementationof HFold.48Chapter 5MFE-based RNApseudoknotted secondarystructure predictionIn this chapter we discuss our MFE-based algorithm for prediction of RNApseudoknotted secondary structures that significantly expands the class ofstructures that can be predicted in O(n5) time and O(n4) space. The CCJalgorithm can handle a more general class of motifs than any other O(n5)MFE-based algorithm, including some biologically important structures suchas kissing hairpin and chains of four overlapping stems (see Figures 1.2 and2.1). CCJ can also handle arbitrary nested substructures of these types.We start with explaining the energy model in Section 5.1 followed by theclass of structures handled by the CCJ algorithm in Section 5.2. Then wediscuss CCJ recurrences in Section 5.3. In Section 5.4 we discuss time andspace complexity of the CCJ algorithm. Finally in Section 5.5 we provide asummary of this chapter.5.1 CCJ energy modelThe energy model used in the CCJ algorithm is a loop-based energy model,in which the energy of a secondary structure is calculated as sum of theenergy of the structure’s loops. Several parameters and functions informthe energy model. We used the DP09 energy model of HotKnots V2.0 [11],summarized in Table 3.1. The published version of the CCJ algorithm [22]uses a slightly more general model, but in this thesis we use the DP09 energymodel, so that the comparison of accuracy of Iterative HFold and CCJ isbased on a consistent energy model. There are few energy functions in theCCJ energy model that are not explicitly in the DP09 model. The valuesof such functions typically depend on the bases in positions i, l, i + 1 andl − 1 of S, and whether i < l. We have set values of these functions to 0,in order to make the models as similar as possible. Table 5.1 summarizes495.1. CCJ energy modelthese functions. We assume that any function specified in the energy modelof the CCJ algorithm can be calculated in constant time, given sequence S.Name Descriptionβ2(i, l) penalty for closing pair i.l or l.i of an ordinary multiloopβp2(i, l) penalty for closing pair i.l or l.i of a multiloop that spans a bandγ2(i, l) penalty for closing pair i.l or l.i of a pseudoloopTable 5.1: CCJ energy functions that are not explicitly in the DP09 model.The values of such functions typically depend on the bases in positionsi, l, i + 1 and l − 1 of S, and whether i < l. We have set values of thesefunctions to 0, in order to make the models as similar as possible.As explained in Section 3.2 the energy of an internal loop or multiloopdepends on whether or not the external base pair of the loop is pseudoknot-ted. If it is not, we call the loop ordinary, and otherwise say that the loopspans a band. If the external base pair of an ordinary internal loop, notincluding stacks, is i.l and the other closing base pair is d.e, then similar tothe DP09 energy model, the energy of the internal loop is calculated by acall to the function eint(i, d, e, l). If the internal loop spans a band we callthe function eintP (i, d, e, l) instead.The energy associated with an ordinary multiloop which has U unpairedbases, external base pair i.l, and set of other closing base pairs C isa+ c× U + β2(l, i) +∑d.e∈Cβ2(d, e).Note that here, the order of the parameters to the function β2 differs,depending on whether or not the parameters identify the external base pair.That is, for the external base pair i.l, the larger index, l, is the first parameterto β2 whereas for the other closing base pair d.e, the smaller index, d, is thefirst parameter to β2. This is because in some energy models, the penaltyfor the closing base pair of an ordinary multiloop may also depend on baseswithin the loop that are adjacent to the closing base pair. If the closingbase pair is the external base pair i.l, then the adjacent bases are i+ 1 andl− 1, and if the closing base pair d.e is not the external base pair, then theadjacent bases are d−1 and e+1. In either case, the order of the parametersto β2(x, y) ensures that the base adjacent to the first parameter, x, is x− 1and the base adjacent to the second parameter, y, is y + 1. The energyassociated with a multiloop which spans a band is similar, with βp2 replacingβ2.505.2. The CCJ class of structuresAs shown next, the energy of a pseudoloop with U unpaired bases andset of closing base pairs C is also similar, with the γ parameters replacing theβ parameters. Just as the previous formulas distinguish between externalclosing pairs and other closing pairs, in the next formulas we distinguishbetween external band borders and other closing pairs. Let CE be the setof closing pairs of a pseudoloop which are external band borders, and let Cbe the set of all other closing pairs of a pseudoloop, plus the band bordersof bands which have only one base pair. For an external pseudoloop — thatis, a pseudoloop which is not nested in any other type of loop — the energyisPs + PupU +∑i.l∈CEγ2(l, i) +∑i.l∈Cγ2(i, l).If the pseudoloop is within a multiloop, Ps is replaced by Psm, and if itis within another pseudoloop, Ps is replaced by Psp. Finally, the energyassociated with an external loop is 0.To compare the performance of CCJ algorithm with that of IterativeHFold, we assigned values to this energy model based on DP09 energy modelof HotKnots V2.0 [11]. We refer the reader to Section 3.2, for a detaileddescription of the energy model and of the differences between energy modelof the CCJ algorithm versus DP09.5.2 The CCJ class of structuresWe call the class of structures that can be handled by our algorithm CCJstructures (using the initials of the last names of the co-authors of thisalgorithm). To explain what CCJ structures are, we also introduce TGB(three-groups-of-bands) structures.Figure 5.1 illustrates these structures. Part (a) shows a TGB structurefor a gapped region, which is comprised of three groups of bands. bands onthe middle group cross bands on the left and right groups according to aregular pattern. More generally, a TGB structure always has at most threegroups of bands. Also, a TGB structure always has at least one band onthe middle group which spans the gap, and for each base pair i.l in thismiddle band, the base at position i is always nested in every left band ofthe group (if any), and the base at position l is always nested in every rightband of the group (if any). A TGB structure may have nested substruc-tures, either within a band (as illustrated in part (b)), or between bands. ACCJ pseudoknot is an “overlay” of two disjoint TGB structures, as shownin part (c). An H-type pseudoknot is a CCJ pseudoknot, as is a kissing515.2. The CCJ class of structures(a)(b)(c)Figure 5.1: TGB (three-groups-of-bands) and CCJ pseudoknotted secondarystructures. (a) In this TGB structure for a gapped region, each shaded arcrepresents a band – a set of pseudoknotted base pairs which span the arc.The left and right groups each have two bands, and the middle group hasfour bands. (The outermost middle “band” does not appear to be a band,since it does not cross any other bands, but as later parts of the figure show,once this TGB structure is overlaid with another TGB structure, all bandsdo indeed cross other bands.) (b) A structure with two groups of bands, andwith pseudoknotted structures nested within one of these bands. In general,not all bands illustrated in part (a) need be present in a TGB structure, aslong as at least one band is on the middle group (labeled with star in part(c)). (c) A CCJ structure is obtained by overlaying two TGB structures.This example is obtained by overlaying the structures of parts (a) and (b).Embedded in this structure is a chain of four overlapping stems, which arelabeled by four stars.525.2. The CCJ class of structuresand even a chain of four overlapping stems. Finally, CCJ structures canbe pseudoknot-free, can have CCJ pseudoknots, and can have nested CCJstructures interspersed in arbitrary places.In the remainder of this section, we define CCJ and TGB structuresinductively. If R is a secondary structure for a sequence of length n, letR[i,l] be the subset of base pairs of R whose endpoints are in region [i, l],and let R[i,j]∪[k,l] be the subset of R whose endpoints are in [i, j] ∪ [k.l].(Thus, R = R[1,n]).1. R[i,j]∪[k,l] is a TGB structure (three-groups-of bands structure) if either(a) [i, j] ∪ [k, l] is a gapped region, R[i,j]∪[k,l] contains base pair(s) i.land j.k, any base pair of R[i,j]∪[k,l] which spans the gapped region[i, j] ∪ [k, l] does not cross any other base pair of R[i,j]∪[k,l], andany nested substructures R[i,j]∪[k,l], in weakly closed regions [i, j]or [k, l] are CCJ structures.(b) R[i,j]∪[k,l] can be decomposed into a CCJ structure and a TGBstructure in one of the following ways, for some d:range of d CCJ structure TGB structurei < d ≤ j R[i,d−1] R[d,j]∪[k,l]i ≤ d < j R[d+1,j] R[i,d]∪[k,l]k < d ≤ l R[k,d−1] R[i,j]∪[d,l]k ≤ d < l R[d+1,l] R[i,j]∪[k,d](c) R[i,j]∪[k,l] can be decomposed into a band and a TGB structurein one of the following ways, for some d and e:range of d range of e band borders TGB structurei ≤ d < j i < e ≤ j i.j d.e R[d+1,e−1]∪[k,l]i ≤ d < j k < e ≤ l i.l d.e R[d+1,j]∪[k,e−1]i < d ≤ j k ≤ e < l d.e j.k R[i,d−1]∪[e+1,l]k ≤ d < l k < e ≤ l k.l d.e R[i,j]∪[d+1,e−1]2. R[i,l] is a CCJ pseudoknot if it is the union of two TGB structuresR[i,j]∪[d+1,k] and R[j+1,d]∪[k+1,l].3. Finally, R[i,l] is a CCJ structure if(a) R[i,l] is empty, or535.3. Recurrences(b) i.l ∈ R[i,l] and R[i+1,l−1] is a CCJ structure, or(c) for some k ∈ [i, l − 1], R[i,l] = R[i,k] ∪R[k+1,l] and both R[i,k] andR[k+1,l] are CCJ structures, or(d) R[i,l] is a CCJ pseudoknot.5.3 RecurrencesOur algorithm finds the minimum free energy CCJ structure for a giveninput sequence S. We express energy values for various MFE substruc-ture types as recurrences. A dynamic programming algorithm can com-pute these energies and store them in arrays, and a standard backtrackingapproach can then deduce R from the arrays. In what follows, we focuson providing the needed recurrences. Here, we extend the recurrences ofpseudoknot-free MFE-based algorithm as explained in Section 4.2 to includea case for calculation of minimum free energy pseudoknotted structures. Wedescribe a more general method of formulating the dynamic programmingrecurrences for prediction of pseudoknotted RNA secondary structures thatcover gapped regions. We set W (i, l) = 0 if i ≥ l. Otherwise,W (i, l) = minW (i, l − 1)mini<d≤lW (i, d− 1) +W (d, l)V (i, l)P(i, l) + PsV (i, l) = minVhairpin(i, l),Vstacked(i, l),Viloop(i, l),Vmloop(i+ 1, l − 1) + b+ β2(l, i) + aIn the recurrence for W (i, l), the first case arises when l is unpaired. Insecond case of W (i, l) recurrence the overall structure consists of two sub-structures, one in the region [i, d− 1] and one in region [d, l] (both handledby the W recurrence). The third case arises when i.l closes a pseudoknot-free loop handled by the V recurrence. The fourth case arises when theregion [i, l] is a pseudoknot (handled by the P recurrence). This is an exter-nal pseudoknot, therefore we add a Ps penalty. The recurrence for V (i, l)minimizes over three terms, which handle hairpin loops, internal loops, andmultiloops, respectively. Details for these types of loops are given in a later545.3. Recurrencessection; the term which handles multiloops takes into account the possibilitythat pseudoknots are nested in base pair i.l.5.3.1 PseudoknotsP(i, l) is the minimum free energy of a CCJ pseudoknot for region [i, l], notcounting the initiation penalty. If i ≥ l, P(i, l) = +∞. Otherwise, thefollowing recurrence expresses P(i, l) using three intermediate points j, d,and k. These points, together with i and l, define two gapped regions,namely [i, j − 1] ∪ [d + 1, k − 1] and [j, d] ∪ [k, l]. P(i, l) is the sum ofcontributions from two gapped regions:P(i, l) = mini<j<d<k<lPK(i, j − 1, d+ 1, k − 1) + PK(j, d, k, l).PK(i, j, k, l) is the minimum free energy of a TGB structure R[i,j]∪[k,l]for the gapped region [i, j] ∪ [k, l], given that both i and l are paired (notnecessarily to each other) and the pairs involving i and l are not part ofnested substructures, assuming also that some base pair (which is not inR[i,j]∪[k,l]) crosses the gap [j + 1, k − 1].The recurrence for PK uses terms PL, PM , PO, and PR. Informally, PLand PR handle bands on the left and right groups of the TGB structure,respectively. Both PM and PO are needed to handle bands on the middlegroup. More precisely, PL(i, j, k, l) is the minimum free energy of a TGBstructure in gapped region [i, j]∪ [k, l], excluding from this energy the termγ2(i, j) (which is accounted for in PK), given that i is paired with j, k ispaired and the pair involving k is not in a nested substructure, and l is paired(not necessarily to k) and the pair involving l is not in a nested substructure.PR(i, j, k, l), PO(i, j, k, l), and PM (i, j, k, l) are defined similarly over gappedregion [i, j] ∪ [k, l]. For PR(i, j, k, l), k must pair with l, and i and j arepaired (not necessarily to each other) and are not in nested substructures.For PO(i, j, k, l), i must pair with l, and j and k are paired (not necessarilyto each other) and are not in nested substructures. Finally, for PM (i, j, k, l),j must pair with k, and i and l are paired (not necessarily to each other)and are not in nested substructures.Note that PK(i, j, k, l) only requires that i and l are paired, whereasPL, PM , PO, and PR require that all four indices are paired. The firsttwo lines of the recurrence below for PK allow the indices j and k to beshifted to a position at which they are paired. The WP terms handle nestedsubstructures in a pseudoloop. The remaining lines handle bands on theleft, right, or middle groups of the MFE structure. These three lines have a555.3. Recurrencesγ2 term, to account for the energy contribution of a border of a band (whichis a closing pair of a pseudoloop).If it is not the case that i ≤ j < k − 1 < l, thenPK(i, j, k, l) = PL(i, j, k, l) = PM (i, j, k, l) = PR(i, j, k, l) = PO(i, j, k, l) = +∞.Otherwise,PK(i, j, k, l) = minmini≤d<j PK(i, d, k, l) + WP(d+ 1, j)mink<d≤l PK(i, j, d, l) + WP(k, d− 1)PL(i, j, k, l) + γ2(j, i) + PbPM (i, j, k, l) + γ2(j, k) + PbPR(i, j, k, l) + γ2(l, k) + PbPO(i, j, k, l) + γ2(l, i) + PbPL(i, j, k, l) = minPL,iloop(i, j, k, l)PL,mloop(i, j, k, l) + b′PfromL(i+ 1, j − 1, k, l) + γ2(j, i)PR(i, j, k, l) = minPR,iloop(i, j, k, l)PR,mloop(i, j, k, l) + b′PfromR(i, j, k + 1, l − 1) + γ2(l, k)PM (i, j, k, l) = minPM,iloop(i, j, k, l)PM,mloop(i, j, k, l) + b′PfromM (i, j − 1, k + 1, l) + γ2(j, k)γ2(i, l), if i = j and k = lPO(i, j, k, l) = minPO,iloop(i, j, k, l)PO,mloop(i, j, k, l) + b′PfromO(i+ 1, j, k, l − 1) + γ2(l, i)The first two rows of the PL recurrence handle the cases where i.l is theouter closing base pair of an internal loop or multiloop that span a band.The third row handles the case where i.l is the inner border of a band.Figure 5.2 shows how these recurrences unwind to calculate the energyof a kissing hairpin structure. Figure 5.3 illustrates a more general case,which includes a chain of four bands.5.3.2 Transitioning between band groups in pseudoknotsIn the above recurrences for PL, PR, PM , and PO, terms PfromL etc. areused, to handle transitions from base pairs in one group to base pairs in565.3. RecurrencesPM,iloop(3, 4, 9, 10) = PM(3, 3, 10, 10) + estP (3, 10)PK(1, 2, 5, 6) = PM(1, 2, 5, 6) + γ2(5, 2)PM(1, 2, 5, 6) = PM,iloop(1, 2, 5, 6)PM(1, 1, 6, 6) = γ2(1, 6)P (1, 12) = PK(1, 2, 5, 6) + PK(3, 4, 7, 12)1 2 3 4 5 6 7 8 9 10 11 12W (1, 12) = W (1, 0) + P (1, 12) + PsPM,iloop(1, 2, 5, 6) = PM(1, 1, 6, 6) + estP (1, 6)1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12PK(3, 4, 7, 12) = PR(3, 4, 7, 12) + γ2(7, 12)PR(3, 4, 7, 12) = PR,iloop(3, 4, 7, 12)PR(3, 4, 8, 11) = PfromR(3, 4, 9, 10) + γ2(11, 8)PfromR(3, 4, 9, 10) = PM(3, 4, 9, 10) + γ2(9, 4)1 2 3 4 5 6 7 8 9 10 11 12PM(3, 4, 9, 10) = PM,iloop(3, 4, 9, 10)PM(3, 3, 10, 10) = γ2(3, 10)PR,iloop(3, 4, 7, 12) = PR(3, 4, 8, 11) + estP (7, 12)Figure 5.2: Illustration of how the W recurrence unwinds, when the MFEstructure for a sequence of length 12 is the simple kissing hairpin structureillustrated at the top of the figure. Since the overall structure is pseudo-knotted, it is handled by the second case of the W recurrence. Here wehave W (1, 0) = 0, therefore the energy is accounted for by P(1, 12) +Ps. Inthe P recurrence, the structure is divided into two TGB structures, namelyPK(1, 2, 5, 6) and PK(3, 4, 7, 12). The term PK(1, 2, 5, 6) takes care of theinternal loop of the gapped region [1, 2]∪ [5, 6], and PK(3, 4, 7, 12) calculatesthe energy of the rest of the structure by transitioning between the PR,PR,iloop, PfromR, PM and PM,iloop recurrences.575.3. RecurrencesW (1, 20) = W (1, 0) + P (1, 20) + Ps1 2 3 4 5 6 7 8 9 10 11 14 15 16 17 18 19 2012 13P (1, 20) = PK(1, 6, 9, 14) + PK(7, 8, 15, 20)1 2 3 4 5 6 7 8 9 10 11 14 15 16 17 18 19 2012 13 1 2 3 4 5 6 7 8 9 10 11 14 15 16 17 18 19 2012 13PK(1, 6, 9, 14) = PL(1, 6, 9, 14) + γ2(1, 6) PK(7, 8, 15, 20) = PR(7, 8, 15, 20) + γ2(15, 20)Figure 5.3: Illustration of how the W recurrence unwinds, when the MFEstructure for a sequence of length 20 is the structure illustrated at thetop of the figure. Note that this structure has five interleaved bands.As in Figure 5.2, the overall structure is pseudoknotted, thus is han-dled by P(1, 20) + Ps. Here P(1, 20) is divided into PK(1, 6, 9, 14), andPK(7, 8, 15, 20). The TGB structure of PK(1, 6, 9, 14) is similar to Fig-ure 5.2, and is handled similarly, with the only difference being that theleftmost band is handled by PL recurrence instead of a PK recurrence. TheTGB structure of PK(7, 8, 15, 20) is similar to the right part of Figure 5.2.585.3. Recurrencesanother group. If transitioning from PL via PfromL(i, j, k, l), then we needto allow for nested substructures either to the left of index j or to the rightof index i, or both. The first two lines of the next recurrence allow for thesepossibilities, and ensure that when PR(i, j, k, l), PM (i, j, k, l), or PO(i, j, k, l)are called, i and j are at positions where they are paired (not necessarilywith each other) and the pairs are not part of nested substructures.PfromL(i, j, k, l) = minmini<d≤j WP(i, d− 1) + PfromL(d, j, k, l)mini≤d<j PfromL(i, d, k, l) + WP(d+ 1, j)PR(i, j, k, l) + γ2(l, k) + PbPM (i, j, k, l) + γ2(j, k) + PbPO(i, j, k, l) + γ2(l, i) + PbPfromR(i, j, k, l) = minmink<d≤l PfromR(i, j, d, l) + WP(k, d− 1)mink≤d<l PfromR(i, j, k, d) + WP(d+ 1, l)PM (i, j, k, l) + γ2(j, k) + PbPO(i, j, k, l) + γ2(l, i) + PbPfromM (i, j, k, l) = minmini≤d<j PfromM (i, d, k, l) + WP(d+ 1, j)mink<d≤l PfromM (i, j, d, l) + WP(k, d− 1)PL(i, j, k, l) + γ2(j, i) + PbPR(i, j, k, l) + γ2(l, k) + PbPfromO(i, j, k, l) = minmini<d≤j WP(i, d− 1) + PfromO(d, j, k, l)mink≤d<l PfromO(i, j, k, d) + WP(d+ 1, l)PL(i, j, k, l) + γ2(j, i) + PbPR(i, j, k, l) + γ2(l, k) + PbNote that in PfromR, it is not possible to transition to PL. This is be-cause the recurrences are designed so that bands are handled in rounds.Within a round, bands in the left are handled first, if any, then those on theright, if any, and then those on the middle, with bands handled by PM (ifany) handled before those handled by PO. A middle band must be handledin each round; otherwise, for example, two “bands” on the left group, addedin different rounds, would collapse into one, causing the recurrences to in-correctly add γ2 terms for band “borders” that are not actually borders.For this reason, no row in PfromR has a PL term, and so a band on the leftgroup cannot be handled directly after a band on the right group. Also,PfromO does not have a row with a PM term, to ensure that PM cannot beused twice on the same round.595.3. RecurrencesWe need a base case, when i = j and k = l. In this case, we need toprovide a way to exit the recurrences, with the last added base pair beingi.l (= j.k). If the recurrences are exited from PM via PfromM , then energycosts associated with the last band (which includes the base pair i.l) havealready been accounted for. Therefore, PfromM (i, j, k, l) = 0 when i = j andk = l. Similarly, PfromO(i, j, k, l) = 0. If the recurrences are exited from PLor PR (via PfromL or PfromR), we need to add energy costs for the base pairi.l, which forms a band of its own. We do this as follows: when i = j andk = l,PfromR(i, j, k, l) = PfromL(i, j, k, l) = γ2(j, k) + γ2(k, j).5.3.3 Nested substructuresSubstructures can be nested in different types of loops. We let WM(i, l),WB(i, l), and WP(i, l) denote the minimum free energies of all structuresRi,l for region [i, l] that are nested in a multiloop that does not span a band,a multiloop that spans a band, and a pseudoloop, respectively. WM′(i, l)and WB′(i, l) are similar to WM(i, l) and WB(i, l), respectively, except thatthe region [i, l] must contain at least one base pair. If i > l thenWM(i, l) = WB(i, l) = WP(i, l) = 0,andWM′(i, l) = WB′(i, l) = WP′(i, l) = +∞.605.3. RecurrencesOtherwise,WM(i, l) = min{WM′(i, l), c× (l − i+ 1)}WM′(i, l) = minWM′(i, l − 1) + c× (1)mini<d≤l WM(i, d− 1) + WM′(d, l)V (i, l) + β2(i, l) + bP(i, l) + Psm + bWB(i, l) = min{WB′(i, l), c′ × (l − i+ 1)}WB′(i, l) = minWB′(i, l − 1) + c′ × (1)mini<d≤l WB(i, d− 1) + WB′(d, l)V (i, l) + βp2(i, l) + b′ + PpsP(i, l) + Psm + b′ + PpsWP(i, l) = min{WP′(i, l), Pup × (l − i+ 1)}WP′(i, l) = minWP′(i, l − 1) + Pup × (1)mini<d≤l WP(i, d− 1) + WP′(d, l)V (i, l) + γ2(i, l) + PpsP(i, l) + Psp + PpsIn the above recurrences, the terms involving c, c′, or Pup account forunpaired bases in ordinary multiloops, multiloops that span a band, andpseudoloops, respectively. The terms involving β2 and βp2 account for closingbase pairs in multiloops, and the Psm and Psp initiation terms account fornew pseudoknots that are nested in a multiloop or pseudoloop, respectively.5.3.4 Internal loops and multiloopsFirst, we give the recurrences for hairpins, and for internal loops and multi-loops that do not span a band. If i > l− 3 then no such loop can be withinbase pair i.l, and soVhairpin(i, l) = Viloop(i, l) = Vmloop(i, l) = +∞615.3. RecurrencesOtherwise,Vhairpin(i, l) = eH(i, l)Vstacked(i, l) = eS(i, l)Viloop(i, l) = mini<d<e<l( eint(i, d, e, l) + V (d, e))Vmloop(i, l) = minVmloop(i, l − 1) + c× (1)Vmloop(i+ 1, l) + c× (1)P(i, l) + Psm + bmini<d≤l WM(i, d− 1) + WM′(d, l)The first and second rows for Vmloop handle the case where base l andbased i are unpaired respectively. The third row handles the case whereregion [i, l] is a pseudoloop, and fourth row handles the case that at point dthe multiloop structure can be split to two.Next are recurrences for internal loops and multiloops that span a band.Only the recurrences associated with the left group are explained. Theexplanation for recurrences for the middle, outer and right groups is similar.PL,iloop(i, j, k, l) is the minimum free energy of a TGB structure in gappedregion [i, j] ∪ [k, l] (excluding the term γ2(i, j)), given that i.j is the closingbase pair of an internal loop that spans a band. The first row of the re-currence handles the case that this internal loop is a stacked pair, and thesecond row handles all other types of internal loops.Similarly, PL,mloop(i, j, k, l) is the minimum free energy of a TGB struc-ture in gapped region [i, j]∪ [k, l] (excluding the term γ2(i, j)), given that i.jis the closing base pair of a multiloop that spans a band. Suppose d is thebase with the smallest index, such that i < d and d.e spans the same bandas i.j for some e. Then, the first row of the recurrence for PL,mloop handlesthe case that there is a nested substructure in region [i, d]. The second caseallows for the possibility that there is no nested substructure in region [i, d],in which case there must be a nested substructure in region [e, j].625.3.RecurrencesPL,iloop(i, j, k, l) = min{PL(i+ 1, j − 1, k, l) + estP (i, j)mini<d<e<j(eintP (i, d, e, j) + PL(d, e, k, l))PL,mloop(i, j, k, l) = mini<d<j−1{PL,mloop0(d, j − 1, k, l) + WB′(i+ 1, d− 1) + a′ + βp2(j, i)PL,mloop1(d, j − 1, k, l) + WB(i+ 1, d− 1) + a′ + βp2(j, i)PL,mloop0(i, j, k, l) = mini<d<j (PL(i, d, k, l) + WB(d+ 1, j) + βp2(d, i))PL,mloop1(i, j, k, l) = mini<d<j (PL(i, d, k, l) + WB′(d+ 1, j) + βp2(d, i))PR,iloop(i, j, k, l) = min{PR(i, j, k + 1, l − 1) + estP (k, l)mink<d<e<l(eintP (k, d, e, l) + PR(i, j, d, e))PR,mloop(i, j, k, l) = mink<d<l−1{PR,mloop0(i, j, d, l − 1) + WB′(k + 1, d− 1) + a′ + βp2(l, k)PR,mloop1(i, j, d, l − 1) + WB(k + 1, d− 1) + a′ + βp2(l, k)PR,mloop0(i, j, k, l) = mink<d<l (PR(i, j, k, d) + WB(d+ 1, l) + βp2(k, d))PR,mloop1(i, j, k, l) = mink<d<l (PR(i, j, k, d) + WB′(d+ 1, l) + βp2(k, d))635.3.RecurrencesPM,iloop(i, j, k, l) = min{PM (i, j − 1, k + 1, l) + estP (j − 1, k + 1)mini<d<j<k<e<l(eintP (d, j, k, e) + PM (i, d, e, l))PM,mloop(i, j, k, l) = mini<d<j−1{PM,mloop0(i, d, k + 1, l) + WB′(d+ 1, j − 1) + a′ + βp2(j, k)PM,mloop1(i, d, k + 1, l) + WB(d+ 1, j − 1) + a′ + βp2(j, k)PM,mloop0(i, j, k, l) = mink<d<l(PM (i, j, d, l) + WB(k, d− 1) + βp2(j, d))PM,mloop1(i, j, k, l) = mink<d<l(PM (i, j, d, l) + WB′(k, d− 1) + βp2(j, d))PO,iloop(i, j, k, l) = min{PO(i+ 1, j, k, l − 1) + estP (i, l)mini<d<j<k<e<l(eintP (i, d, e, l) + PO(d, j, e, k))PO,mloop(i, j, k, l) = mini<d<j−1{PO,mloop0(d, j, k, l − 1) + WB′(i+ 1, d− 1) + a′ + βp2(l, i)PO,mloop1(d, j, k, l − 1) + WB(i+ 1, d− 1) + a′ + βp2(l, i)PO,mloop0(i, j, k, l) = mink<d<l(PO(i, j, k, d) + WB(d+ 1, l) + βp2(d, i))PO,mloop1(i, j, k, l) = mink<d<l(PO(i, j, k, d) + WB′(d+ 1, l) + βp2(d, i))645.4. Time and space complexity5.4 Time and space complexityIn this section we present the pseudocode for a dynamic programming al-gorithm to compute the solutions to the above recurrences, and based onthis pseudocode explain time and space complexity of our algorithm. First,we assume all energies which are base cases are calculated in constant time.Then, energies are computed according to the following schedule. For eachvalue of T , T = 2, 3, . . . , n, where n is the length of the input RNA sequence,• compute all energies over regions [i, j] with j − i+ 1 = T ;• compute all energies over gapped regions [i, j] ∪ [k, l] with j − i+ l −k + 2 = T ;The time requirement is O(n5) since we set the maximum loop size for aninternal loop to be 30. All energies computed are saved as described in theschedule, resulting in a space requirement of Θ(n4).5.5 SummaryIn this chapter we proposed a novel MFE-based method for prediction ofRNA pseudoknotted secondary structure, that expands the class of struc-tures handled in O(n5) time and O(n4) space. We described class of struc-ture our algorithm can handle and provided the recurrences for our algorithmbased on the DP09 energy model of HotKnots V2.0 [11].65Chapter 6Comparison of IterativeHFold and CCJIn this chapter we present a thorough comparison of our proposed algo-rithms, Iterative HFold, and CCJ, with the best existing methods for predic-tion of RNA secondary structure including pseudoknots, namely HotKnotsV2.0, IPknot and HFold. As mentioned in Chapter 2, in the literatureon hierarchical folding, there are reports of counter examples to the hier-archical folding hypothesis, where base pairs that are initially part of thepseudoknot-free structure for a molecule later change as the pseudoknotforms. This motivates a comparison of HFold versus Iterative HFold, in or-der to see how a method that sticks strictly with the hypothesis (i.e., HFold)compares with a method that allows for some base pair changes (i.e., Iter-ative HFold). By comparing the CCJ algorithm, a method based on MFE,with HFold and Iterative HFold, we can further evaluate the merit of thehypothesis that sequences fold into their MFE structures.We start by describing the experimental settings for each of our compu-tational experiments. In Section 6.2 we compare the robustness of HFoldand Iterative HFold with respect to partial information; that is, the degree towhich they provide accurate predictions as a function of how much informa-tion about Gbig, the true pseudoknot-free secondary structure, is providedas input.Then in Section 6.3 we compare HFold, HFold-PKonly and IterativeHFold when a (possibly inaccurate) computational prediction of Gbig is pro-vided as input. In Section 6.5 we compare Iterative HFold and CCJ withtwo of the best existing methods for pseudoknotted secondary structure pre-diction, namely HotKnots V2.0 and IPknot. Sections 6.6 and 6.7 report onthe running time and memory usage of our methods. In Section 6.8 we com-pare Iterative HFold’s accuracy with that of ShapeKnots. In Section 6.9 wediscuss merits and differences of each method. Finally in Section 6.10 weprovide a summary of this chapter.666.1. Experimental settings6.1 Experimental settingsIn this section we explain details of our computational experiments.6.1.1 Definition of Gbig and GsmallTo test the robustness of HFold, HFold-PKonly and Iterative HFold on agiven RNA sequence, we need to provide partial information about the truepseudoknot-free structure as input structure for that sequence. To obtainthe true pseudoknot-free structure, Gbig, we remove the minimum number ofpseudoknotted base pairs from the reference structure to make the referencestructure pseudoknot-free. If the reference structure is pseudoknot-free, thenGbig is the same as the reference structure itself. We call the removed basepairs from the reference structure Gsmall. Blue base pairs in Figure 6.1(i.e., the stems of 1.32 to 12.21, and 40.61 to 46.55) represent base pairsof the Gbig structure and green base pairs (i.e., the band of 16.53 to 20.49)represent the Gsmall structure.6.1.2 Robustness testOne of our goals is to understand the degree to which our methods are robustwith respect to partial information, that is, provide a reliable predictioneven when limited information about the true pseudoknot-free structure,Gbig, is available. For this purpose we generate subset structures of thecorresponding Gbig, for each RNA sequence in the HK-PK and HK-PK-freedata sets. For each α, 0.05 ≤ α ≤ 0.95 with 0.05 steps, we choose each basepair of Gbig structure with probability α. We also generate 1% informationand 99% information about the Gbig structure (i.e., α = 0.01 and α = 0.99).We repeat this step 100 times to generate 100 substructures of Gbig for eachvalue of α for each RNA sequence in our data sets. We then run our methodson all 100 substructures for each RNA sequence in our data sets and α valueand calculate the bootstrap 95% percentile confidence interval for averageF-measure of these 100 cases as the accuracy interval for each method andeach RNA sequence and α value in our data set.We also compare our methods when the true pseudoknot-free structureGbig is provided.6.1.3 Accuracy comparison testsWe compare the accuracy of HFold, HFold-PKonly and Iterative HFold witheach other on different input structures, and with other methods, namely676.1. Experimental settings20GGGGCUCUAGUGCC G C UCGACUAGAGCCCUG U A A C A G U UAUGGACCAC G A GCAGUCCAUG1103040606150CFigure 6.1: H-type pseudoknot. The blue base pairs (i.e., the stems of 1.32to 12.21, and 40.61 to 46.55) belong to the Gbig structure and the greenbase pairs (i.e., the band of 16.53 to 20.49) belong to the Gsmall structure,as defined in Section 6.1.1. This figure was produced using the VARNAsoftware [26].686.1. Experimental settingsCCJ, SimFold [8], HotKnots V2.0 [11, 76] and IPknot [83]. We first describethe settings we choose for the latter two methods in our experiments. Wethen describe the ways in which we choose input structures for HFold andits variants.HotKnotsAs mentioned in Section 2.1.2, HotKnots produces up to 20 output struc-tures. In our experiments, we choose the structure with the lowest energyvalue among the 20 output structures as the final structure predicted byHotKnots. When reporting prediction accuracy for HotKnots, we reportthe bootstrap 95% percentile confidence interval for the average F-measureof the lowest energy structure for all RNA sequences in our data set.IPknotWe run IPknot using the provided source code and the default parametersfor scoring model and level (i.e., scoring model = McCaskill and level = 2).The default values provided for base pair weights are not the same on theIPknot website (i.e., γ1 = 2 and γ2 = 16), its source code (i.e., for somecases γ1 = 2 and γ2 = 4 and for others γ1 = 1 and γ2 = 1) and the providedperl script (i.e., γ1 = 4 and γ2 = 8). We run IPknot with all of thesevalues with and without refinement and provide IPknot’s bootstrap 95%confidence intervals for average F-measures for all of our data sets as a tablein Appendix A. Based on its performance we present IPknot’s results withdefault settings (i.e., no refinement, scoring model = McCaskill and level= 2) and γ1 = 4 and γ2 = 8, for comparison with other methods.Different versions of HFoldWe compare the average accuracy of HFold, HFold-PKonly and IterativeHFold with different input structures.To determine which input structures are good to use when Gbig is notknown, we compare three different options. Since HFold (HFold-PKonlyand Iterative HFold) cannot accept pseudoknotted input structures we usethe following methods to produce pseudoknot-free input structures to HFold(HFold-PKonly and Iterative HFold). First, we use HotKnots hotspots [11],i.e., the 20 lowest energy pseudoknot-free stems produced in the first phaseof HotKnots. We choose the lowest free energy structure predicted by eachof our methods as their final prediction given these hotspots. Second, weuse SimFold’s MFE structure [8] where the energy parameters of SimFold696.2. Robustness comparisonare changed to match that of HotKnots V2.0. Third, we use SimFold’s first50 suboptimal structures.6.1.4 Running timeWe ran all methods, except the CCJ algorithm, on the same platform (Mac-book pro. OS X 10.5.8 with 2.53 GHz Intel Core 2 Duo processor and 4GB 1067 MHz DDR3 RAM). We use the time command to measure therunning time of our methods on each sequence (while no other task wasrunning on the platform), and record the wall clock time. Due to largememory requirement of the CCJ algorithm especially for long RNA strands(> 60 nucleotides), we used a more powerful shared server platform (open-SUSE 13.1 with 64 Intel(R) Xeon(R) CPU E7- 4830 2.13GHz processorsand 128 GB memory) to run the CCJ algorithm. Therefore we are unableto compare CCJ’s running time with other methods in this work.6.1.5 Memory usageTo find the memory usage of the programs, we use the Valgrind package [68]and record the total heap usage as memory usage of each program. IPknotand HotKnots are completely written in C/C++ and so we can easily findtheir memory usage by running Valgrind. However, Iterative HFold programis a perl script that runs a few C/C++ programs (HFold, HFold-pkonlyand SimFold) sequentially. So we find the memory usage of each C/C++component using Valgrind and assign the maximum as the memory usageof Iterative HFold.6.2 Robustness comparisonOne of our goals is to learn what is the accuracy of each of HFold, HFold-PKonly, and Iterative HFold, when partial information about Gbig is avail-able. Figure 6.2 shows the results of this robustness evaluation, for pseu-doknotted structures of the HK-PK data set (see Section 2.2) (Figure 6.2(a), pseudoknot-free structures of the HK-PK-free data set (Figure 6.2 (b)and the overall results (Figure 6.2 (c)). Since HFold-PKonly cannot addpseudoknot-free base pairs to the given input structure, we do not compareits performance here with HFold and Iterative HFold. However we providedetailed performance of all versions of HFold including HFold-PKonly inAppendix B.706.2. Robustness comparisonAs shown in Figure 6.2 (a), which pertains to pseudoknotted structuresof the HK-PK data set, when provided with ≈ 1% of the Gbig structure asinput, Iterative HFold’s bootstrap 95% percentile confidence interval of av-erage F-measures has higher accuracy than those of HFold. Iterative HFoldcontinues to be significantly superior to HFold until approximately 90% ofGbig is available, after which HFold is more accurate. Iterative HFold is mostsuccessful when little information about Gbig is known because it can addboth pseudoknot-free and pseudoknotted base pairs. In particular, usingmethods 3 and 4 (see Section 4.4) Iterative HFold first finds a low energypseudoknot-free structure that includes the given input structure (by ex-tending the stems of the given structure), and then adds pseudoknottedbase pairs to further lower the energy of the overall structure. However,when the vast majority of base pairs of Gbig are provided as input, HFolddominates as it keeps the base pairs of the input structure, thereby oftenadding base pairs of Gsmall. When 100% of Gbig is provided as input, HFold’sbootstrap 95% percentile confidence interval is (85.74%, 91.87%), comparedwith (79.36%, 87.41%) for Iterative HFold.As shown in Figure 6.2 (a), Iterative HFold’s average accuracy on pseu-doknotted structures steadily increases from about 54% to 79% as the userprovides 1% to 40% of the input structure. This improvement in accuracyslows down but still persists when further structural information is provided.If we compare the slope of the curve for Iterative HFold’s average accuracy tothat of HFold in Figure 6.2 (a), we can see that HFold’s slope is steeper thanthat of Iterative HFold, making Iterative HFold more robust than HFold.For pseudoknot-free structures of the HK-PK-free data set, as shown inFigure 6.2 (b), HFold performs better than Iterative HFold. Even with 1%information about Gbig, HFold results in (79.76%, 84.32%) 95% bootstrapconfidence interval in comparison with (79.14%, 83.84%) for Iterative HFoldwith the same inputs. Roughly, HFold’s success for pseudoknot-free struc-tures is because it often adds base pairs that do not cross those provided aspart of the input, and thus are likely to be in Gbig.When 100% of Gbig is provided as input, the overall bootstrap 95% confi-dence interval for HFold is (96.11%, 97.24%) compared with (93.85%, 96.07%)for Iterative HFold.716.3. Accuracy comparison of different versions of HFold++++ ++ ++ + + ++ + ++ ++ ++ +++0 20 40 60 80 100405060708090100(a) pseudoknottedBootstrap 95% confidence interval+++ ++ + ++ + + ++ + ++ ++ +++ ++o+ Iterative HFoldHFold+++++ ++ ++ + ++ + + + ++ + + + ++0 20 40 60 80 1007580859095100(b) pseudoknot−freeBootstrap 95% confidence interval+++++ ++ ++ + + ++ + + + + + + + +++++++ ++ ++ + ++ + + ++ + ++ + ++0 20 40 60 80 100707580859095100(c) combinedBootstrap 95% confidence interval+++++ ++ ++ + + ++ + + ++ + ++ ++Figure 6.2: Comparison of robustness of HFold and Iterative HFold. Ro-bustness results for pseudoknotted structures of the HK-PK data set (a),pseudoknot-free structures of the HK-PK-free data set (b) and all structures(c). The X axes show the percentage of the Gbig structure that is providedas input, and the Y axes show bootstrap 95% percentile confidence intervalsfor average F-measure. Dashed lines show the borders of the bootstrap 95%percentile for average F-measure and solid lines show the average F-measureitself.726.3. Accuracy comparison of different versions of HFold6.3 Accuracy comparison of different versions ofHFoldOften, partial information about Gbig is not available; this is the case formany RNAs of unknown function reported by the ENCODE consortium[1]. Therefore, we next compare the quality of results obtained by HFold,HFold-PKonly and Iterative HFold when given a pseudoknot-free input Gthat is predicted by existing computational methods. One way to producean input structure is to use an MFE pseudoknot-free structure predictionmethod, such as MFold. We chose SimFold as it is an implementation ofMFold and, because of its energy parameters, gives more accurate predic-tions than MFold [10]. If some base pairs of a structure are known, forexample, through comparative analysis with related structures, the user caninput such information as a pseudoknot-free structure to Iterative HFold andexpect a better prediction result. Here we compare two methods for pre-dicting G, namely SimFold and the hotspots produced by HotKnots V2.0.In the following section we also evaluate using SimFold’s suboptimal struc-tures as input to HFold and Iterative HFold. Table 6.1 reports the bootstrap95% percentile confidence intervals of average F-measures. The accuracy ofHFold-PKonly is significantly worse than that of HFold and Iterative HFold,both with the output of SimFold, and with the HotKnots hotspots as input,so we do not discuss HFold-PKonly further.For pseudoknotted structures, using HotKnots hotspots as input is farsuperior to using SimFold as input, for both HFold and Iterative HFold.This appears to be because MFE structures predicted by SimFold tend tohave more base pairs than the true pseudoknot-free structure Gbig, so thatHFold and Iterative HFold are unlikely to add pseudoknotted base pairs tothe input structure. For pseudoknot-free structures, using SimFold as inputis somewhat better than using HotKnots hotspots, but the permutationtest (with significance p-value of < 0.05) indicates that the difference is notsignificant.For pseudoknotted structures, the confidence intervals for HFold andIterative HFold with HotKnots hotspots are respectively (73.35%, 83.53%)and (72.83%, 83.37%) and on pseudoknot-free structures they are respec-tively (75.53%, 80.79%) and (74.93%, 80.26%). Again, based on the resultof the permutation test, the difference in the results of HFold and IterativeHFold on pseudoknotted and pseudoknot-free structures is not significant.Similarly, the permutation test shows that the difference in prediction accu-racy of HFold and Iterative HFold on SimFold input is not significant.736.3.AccuracycomparisonofdifferentversionsofHFoldTable 6.1: Comparison of bootstrap 95% percentile confidence intervals of average F-measure of different versionsof HFold when given SimFold structure as input vs. when given HotKnots hotspots structures as input.Input Hotspots SimFold (MFE)PKonly HFold Iter. HFold PKonly HFold Iter. HFoldHK-PK (55.54, 71.06) (73.35, 83.53) (72.83, 83.37) (50.57, 63.53) (50.69, 63.54) (51.42, 64.39)HK-PK-free (31.37, 38.52) (75.53, 80.79) (74.93, 80.26) (78.42, 83.21) (78.33, 83.27) (78.31, 83.17)Table 6.2: Comparison of bootstrap 95% percentile confidence intervals of average F-measure with existing meth-ods.input Iter. HFold HotKnots SimFold IPknot CCJ(hotspots) (default)HK-PK (72.83, 83.37) (73.60, 83.35) (45.34, 57.73) (54.56, 66.25)HK-PK* (73.19, 83.88) (74.01, 83.96) (44.65, 57.40) (53.72, 66.01) (74.38, 84.78)HK-PK-free (74.93, 80.26) (76.74, 81.95) (78.78, 83.55) (77.31, 81.79) (74.75, 80.13)746.3.AccuracycomparisonofdifferentversionsofHFoldTable 6.3: Comparison of bootstrap 95% percentile confidence intervals of average F-measure with existing methodson DK-pk16, DK-pk16* and IP-pk168 data sets.input Iter. HFold HotKnots IPknot CCJ(hotspots) (default)DK-pk16 (68.05, 81.85) (69.11, 83.81) (65.42, 75.81)DK-pk16* (69.99, 84.48) (72.10, 87.18) (64.68, 77.11) (66.42, 81.76)IP-pk168 (72.65, 79.86) (65.51, 72.96) (58.20, 66.09) (72.45, 79.68)Table 6.4: Comparison of average F-measure, sensitivity and PPV with existing methods on HK-PK, HK-PKfree,DK-pk16, DK-pk16* and IP-pk168 data sets. The values outside parentheses show the average F-measure whilethe values inside parentheses show average sensitivity and PPV respectively.input Iter. HFold HotKnots IPknot CCJ(hotspots) (default)F (sen, PPV) F (sen, PPV) F (sen, PPV) F (sen, PPV)HK-PK 78.27 (80.16, 77.31) 78.63 (78.88, 79.18) 60.05 (58.75, 64.56)HK-PK* 78.71 (80.71, 77.69) 79.14 (79.45, 79.66) 60.04 (58.29, 64.16) 79.79 (82.55, 78.04)HK-PK-free 77.57 (80.29, 75.72) 79.33 (81.29, 77.99) 79.62 (83.58, 76.66) 77.43 (80.54, 75.20)DK-pk16 75.02 (77.78, 73.40) 76.61 (77.64, 76.47) 70.71 (69.42, 73.18)DK-pk16* 77.30 (80.70, 75.30) 79.88 (81.27, 79.59) 71.04 (69.43, 74.10) 74.18 (80.11, 70.25)IP-pk168 76.29 (80.82, 73.67) 69.30 (69.01, 72.68) 62.21 (59.62, 68.10) 76.15 (81.02, 73.33)756.4. Iterative HFold with SimFold’s suboptimal structures6.4 Iterative HFold with SimFold’s suboptimalstructuresTo further investigate which input structures are good to use whenGbig is notknown, we use the first 50 suboptimal structures produced by SimFold (in-cluding the MFE structure). Then for each RNA sequence we run our meth-ods on all 50 suboptimal structures and choose the one with the lowest freeenergy as the final result for that RNA sequence. With this approach, thebootstrap 95% percentile confidence intervals of average F-measure of HFoldand Iterative HFold are (61.80%, 80.63%) and (67.70%, 79.57%) respectivelyfor pseudoknotted structures and (77.17%, 82.35%) and (76.27%, 81.46%)respectively for pseudoknot-free structures. The permutation test indicatesthat the difference between these results and the corresponding results wheninput structures are hotspots is not significant. We also test the significanceof results of HFold with first 50 suboptimal structures versus Iterative HFoldwith the same input structures and Iterative HFold with hotspots for bothpseudoknotted and pseudoknot-free structures. Although the bootstrap 95%percentile confidence intervals for average F-measures are slightly different,the permutation test indicates that the difference is not significant. Simi-larly, results of Iterative HFold with the first 50 suboptimal structures arenot significantly better or worse than the result of HFold with hotspots asinput structures for both pseudoknotted and pseudoknot-free structures.6.5 Accuracy comparison with existing methodsFor comparisons with other methods already in the literature, we choose touse our Iterative HFold method with HotKnots hotspots as input structure,based on its overall good accuracies in Section 6.3. We compare this methodwith two of the best-performing methods [73] for prediction of pseudoknot-ted structures, namely HotKnots V2.0 [11], a MFE-based heuristic method,and IPknot [83], a method that is based on maximum expected accuracy.Prepared by Puton et al. [73], CompaRNA, is the website for continuouscomparison of RNA secondary structure methods on both PDB data setand RNA strand. We chose IPknot because it was the best-performingnon-comparative pseudoknot prediction method that can handle long RNAsequences, based on the ranking on their website as of March 25, 2014(http://genesilico.pl/comparna/). We also noticed that Puton et al.used HotKnots V1 for their comparison, and not the more recently avail-able and better performing HotKnots V2.0. Therefore we chose to include766.5. Accuracy comparison with existing methodsHotKnots in our comparisons as well. Since the focus of this thesis is onprediction of pseudoknotted structures, we do not compare our results withCo-Fold [72] or other methods for prediction of pseudoknot-free structures.Table 6.2 presents the bootstrap 95% percentile confidence intervals ofaverage F-measure for Iterative HFold with hotspots as input, HotKnotsV2.0, SimFold, IPknot with default setting (see Section 6.1.3), and CCJ onthe HK-PK and the HK-PK-free data sets. Since CCJ does not run for RNAsequences of length > 195 (i.e., crashes with memory error), we removed 3RNA sequences with length > 195 (the removed sequences are 196, 214and 400 bases long) from our HK-PK set and named the new set HK-PK*.We present the bootstrap 95% percentile confidence intervals of average F-measure for all our methods on this data set as well. For pseudoknottedstructures in the HK-PK data set, our permutation tests show that thedifference in accuracy of Iterative HFold and HotKnots is not significant.However, the superior accuracy of Iterative HFold compared with SimFoldand IPknot is significant. Similarly on the HK-PK* data set, the differencein accuracy of Iterative HFold, HotKnots and CCJ is not significant, butthey all achieve significantly better accuracies than SimFold and IPknot.For pseudoknot-free structures, the difference in accuracy between IPknot,Iterative HFold, HotKnots, CCJ and SimFold is not significant.Table 6.3 presents the bootstrap 95% percentile confidence intervals foraverage F-measure for Iterative HFold (with hotspots as input), HotKnots,IPknot (with default setting) and CCJ on the IP-pk168 and the DK-pk16data sets. The DK-pk168* data set is the same as the DK-pk16 data setwith 3 RNA sequences of length> 195 removed. Our permutation tests showthat the difference in accuracy of Iterative HFold, HotKnots and IPknot onthe DK-pk16 data set is not significant. Similarly the difference in accuracyof Iterative HFold, HotKnots, IPknot and CCJ on the DK-pk16* data set isnot significant. However, the superior accuracy of Iterative HFold and CCJcompared with HotKnots and IPknot on the IP-pk168 data set is significant,while the difference in accuracy of Iterative HFold and CCJ is not significant.Table 6.4 presents average F-measure, sensitivity and PPV for IterativeHFold (with hotspots as input), HotKnots, IPknot (with default setting)and CCJ on all of our four data sets. We note that both Iterative HFoldand CCJ methods, have higher average sensitivity than average PPV on allof our data sets, which suggests that they both predict more base pairs thanin reference structures. IPknot, on the other hand, has higher average PPVthan sensitivity on all of the data sets with the exception of HK-PK-free.776.6. Running time comparison6.6 Running time comparisonSince prediction of pseudoknotted structures are of interest to us, we onlyreport running time comparison on pseudoknotted structures of the HK-PKdata set. Figure 6.3 presents result of time comparison between IterativeHFold and HotKnots in a log plot (when log is in base 10). The X axis showslog(time) for HotKnots data points and the Y axis shows log(time) for Iter-ative HFold on the HK-PK data set. RNA sequences in this data set are be-tween 26 and 400 bases long. HFold runs significantly faster than HotKnotsand finishes under 1.5 seconds for even the longest RNA sequence in ourdata set (400 bases). HotKnots is faster than Iterative HFold on sequencesof up to 47 bases, while Iterative HFold starts being faster than HotKnots.Iterative HFold runs in less than 8.3 seconds for all RNA sequences in thisdata set whereas HotKnots runs for over 6000 seconds (about 1.7 hours) onthe longest RNA sequence in our data set. The running time of both HFoldand Iterative HFold grows with sequence length, whereas HotKnots’ runningtime is not directly correlated with RNA length. For example, HotKnotsruns for 1665.94 seconds for one RNA sequence of length 195 (RNase PRNA, Wastewater SL-C, molecule ID: ASE-00360), while it runs for 203.12seconds for another RNA sequence of length 195 (RNase P RNA, volunteerESH26-4, molecule ID: ASE-00131).When reporting on time of Iterative HFold and HotKnots on the HK-PKdata set, we did not include the time required to get the input structuresto Iterative HFold. Since we only run HotKnots V2.0 partially to producehotspots it does not take as long as running HotKnots. For example, for thelongest RNA sequence in our data set (RNase P RNA, Agrobacterium tume-faciens, strain A348, molecule ID: A.tum.RNaseP, 400 nucleotides long), itonly takes 0.41 seconds time to produce the hotspots. Average time neededto produce hotspots for all RNA sequences in this data set is 0.20 seconds.(The time required to get the hotspots for all RNA sequences in this dataset is provided in Appendix C.) Therefore, even adding time requirementsof calculating hotspots to the running time of Iterative HFold, we still findIterative HFold to be faster than HotKnots.IPknot is significantly faster than both HFold and Iterative HFold. Forall sequences in this data set, IPknot produces output in less than 0.8 sec-onds. Due to extensive memory requirement of the CCJ algorithm, its runtime is slower than Iterative HFold, HotKnots and IPknot. For detailedinformation about performance of each method see Appendix C.786.7. Memory usage comparison−2 0 2 4 6 8−202468log10(time) HotKnotslog 10(time) Iterative HFoldFigure 6.3: Time comparison. Comparison of running times of IterativeHFold and HotKnots in a log plot. The X axis shows log10(time) for Hot-Knots data points and the Y axis shows log10(time) for Iterative HFold.6.7 Memory usage comparisonHere we present memory usage of HFold, Iterative HFold and HotKnots onour HK-PK pseudoknotted structures. Since HotKnots predicts and keepsabout 20 structures in memory, its memory consumption can vary signifi-cantly from one sequence to another, and is not predictable. Up until 47bases, HotKnots some times uses less memory than HFold or Iterative HFold,796.8. Comparison with ShapeKnotsbut for RNA sequences with 47 bases or longer, HotKnots uses much morememory than HFold and Iterative HFold. Iterative HFold’s memory usageis very similar to HFold’s and increases at a very low rate by the length ofthe RNA sequence. It starts from 48.69 MB for RNA sequences of length26 and increases to 61.33 MB for the longest RNA sequence in this dataset (400 bases long). HotKnots, however, uses as little as 16.53 MB foran RNA of length 30 bases (tmRNA, Legionella pneumophila, molecule ID:LP-PK1) and as much as 91.23 GB (i.e., 93419 MB) for the longest RNAsequence in this data set. When reporting on memory consumption of It-erative HFold and HotKnots on the HK-PK data set, we did not includethe memory required to get the input structures to Iterative HFold. Sincewe only run HotKnots V2.0 partially to produce hotspots it does not con-sume as much memory. For example, for the longest RNA sequence in ourdata set (RNase P RNA, Agrobacterium tumefaciens, strain A348, moleculeID: A.tum.RNaseP, 400 nucleotides long), it only takes 4 MB of memoryto produce the hotspots. We also note that since calculating hotspots andrunning Iterative HFold are done sequentially, the memory consumption iscalculated as the maximum of the two, so the memory consumption of It-erative HFold for this sequence is still the same even including the memoryneeded for calculating hotspots.IPknot uses much less memory than all other methods. For the longestRNA sequence in this data set, IPknot uses less than 5.5 MB of memoryin comparison to 61.33 MB of HFold and Iterative HFold and 91.23 GBof HotKnots. CCJ on the other hand uses 119.72 MB of memory for theshortest sequence and increases so much that Valgrind runs out of memoryand is unable to provide the heap usage for the RNA sequence of length 195.For detailed information about memory usage of each method see AppendixC.6.8 Comparison with ShapeKnotsSimilar to HotKnots, the ShapeKnots method of Hajdin et al. [40] is a heuris-tic algorithm for prediction of pseudoknotted structures. This method incor-porates SHAPE reactivity data as a pseudo energy term into the predictionmethod. SHAPE reactivity data is only available for a limited number ofRNA sequences, so we cannot compare Iterative HFold with ShapeKnots onour data set. Therefore, we use data set of Hajdin et al. to compare thesetwo methods. In their data set Hajdin et al. have 18 RNA sequences intheir training set and 6 RNA sequences in their test set. We run Iterative806.8. Comparison with ShapeKnotsHFold with hotspots for each RNA sequence and choose the lowest energystructure as the final output of our program. For Shapeknots, we use thesensitivity and positive predictive values reported in the work of Hajdin etal. [40] to compare with Iterative HFold. Table 6.5 shows the results ofthis comparison. In all but one sequence of the test set, Iterative HFoldobtains higher accuracy than ShapeKnots. The exception is the HIV-1 5’pseudoknot domain; Hajdin et al. note that the accepted structure of HIV-15’ pseudoknot domain is based on a SHAPE directed prediction and thusan accuracy comparison between ShapeKnots and Iterative HFold may bebiased towards ShapeKnots. In the training set, however, Iterative HFolddoes not perform as well as ShapeKnots. This might be because parame-ters of ShapeKnots were tuned on the training set to achieve the highestpossible accuracy. Iterative HFold’s average accuracy on the training andtest data sets of Hajdin et al. [40] is 75.91% and 87.54% respectively com-pared to 89.13% and 92.02% of ShapeKnots. Since both the training andtest data sets are small, we cannot make more general statements about thesignificance of the differences in accuracy between the two methods.816.8.ComparisonwithShapeKnotsTable 6.5: Comparison of Iterative HFold F-measure with ShapeKnots on SHAPE dataTraining set Len PK Iter. HFold ShapeKnotssen ppv F sen ppv FPre-Q1 riboswitch, B. subtilis 34 1 62.5 100 76.9 100 100 100Telomerase pseudoknot, human 47 1 100 100 100 100 100 100tRNA(asp), yeast 75 0 81.0 100 89.5 95.2 95.2 95.2TPP riboswitch, E. coli 79 0 46.5 47.6 47.1 95.4 87.5 91.3SARS corona virus pseudoknot 82 1 69.2 86.3 69.2 84.6 88.0 86.3cyclic-di-GMP riboswitch, V. cholerae 97 0 85.5 81.0 83.2 89.3 86.2 87.7SAM I riboswitch, T. tengcongenis 118 1 79.5 91.2 84.9 92.3 97.3 94.7M-Box riboswitch, B. subtilis 154 0 87.5 91.3 89.4 87.5 91.3 89.3P546 domain, bI3 group I intron 155 0 55.4 57.4 56.4 94.6 96.4 95.5Lysine riboswitch, T. maritima 174 1 85.7 94.7 90.0 87.3 88.7 88.0Group I intron, Azoarcus sp. 214 1 52.4 54.1 53.2 92.1 95.1 93.5Signal recognition particle RNA, human 301 0 70.0 73.7 71.8 55.0 53.9 54.4Hepatitis C virus IRES domain 336 1 71.2 74.0 72.5 92.3 96.0 94.1RNase P, B. subtilis 405 1 55.7 59.3 57.4 75.6 79.8 77.7Group II intron, O. iheyensis 412 1 87.9 95.9 91.7 93.2 97.6 95.3Group I intron, T. thermophila 425 1 83.2 85.2 84.2 93.9 91.2 92.55’ domain of 23S rRNA, E. coli 511 0 84.0 72.5 77.8 92.4 76.4 83.65’ domain of 16S rRNA, E. coli 530 0 73.6 69.0 71.2 89.9 80.6 84.9Test set Len PK Iter. HFold ShapeKnotssen ppv F sen ppv FFluoride riboswitch, P. syringae 66 1 100 100 100 93.7 93.7 93.7Adenine riboswitch, V. vulnificus 71 0 100 100 100 100 100 100tRNA(phe), E. coli 76 0 100 100 100 100 84.0 91.35S rRNA, E. coli 120 0 91.4 91.4 91.4 85.7 76.9 81.15’ domain of 16S rRNA, H. volcanii 473 0 90.3 82.3 86.1 89.6 82.7 86.0HIV-1 5’ pseudoknot domain 500 1 45.4 50.4 47.7 100 100 100826.9. Discussion6.9 DiscussionIn Section 6.9.1 and 6.9.2 we provide more insight on the differences andmerits of Iterative HFold, CCJ, HotKnots and IPknot respectively. Sec-tion 6.9.4 provides more insight into the energy model used in this work.6.9.1 HotKnotsIn this section we focus on comparing Iterative HFold, when given HotKnotshotspots as input, with Hotknots V2.0, and also compare CCJ with HotknotsV2.0.The differences in accuracies of Iterative HFold, CCJ and HotKnots V2.0are insignificant on three of our data sets, namely the HK-PK (HK-PK*when comparing with CCJ), HK-PK-free and DK-pk16 (DK-pk16* whencomparing with CCJ), but Iterative HFold and CCJ are significantly betterthan HotKnots V2.0 on the fourth data set, namely the IP-pk168 data set.6.9.2 IPknotIn this section we focus on comparing Iterative HFold, when given HotKnotshotspots as input, with IPknot, and also compare CCJ with IPknot.The differences in accuracies of Iterative HFold, CCJ and IPknot areinsignificant on two of our data sets, namely the HK-PK-free and DK-pk16(DK-pk16* when comparing with CCJ), but Iterative HFold and CCJ aresignificantly better than IPknot on the HK-PK (HK-PK* when comparingwith CCJ) and IP-pk168 data set.We note that Sato et al. [83] find the performance of IPknot with pre-dictions using “NUPACK” superior to all versions of IPknot. However dueto large time and space complexity when calculating exact pseudoknottedbase pair probabilities using the “NUPACK” scoring model (O(n5) timeand O(n4) space), IPknot is not practical for RNA sequences of length > 80nucleotides. Therefore, we did not compare our results with this version ofIPknot. Appendix A provides detailed performance of IPknot with differ-ent settings. In this thesis we used IPknot with γ1 = 4 and γ2 = 8, thatproduces the best results on all our data sets.A disadvantage of IPknot over Iterative HFold and CCJ is that to getthe best prediction, user needs to provide some guidance as to what type ofstructure to predict for the given sequence, e.g., whether pseudoknot-free orpseudoknotted.836.9. Discussion6.9.3 Comparison of CCJ, HFold and Iterative HFoldIn this section we evaluate the merits of the different hypotheses about RNAfolding, by comparing performance of CCJ, HFold and Iterative HFold. Morespecifically we would like to see how a method that is purely based on MFEperforms versus a method that strictly adheres to the hierarchical foldinghypothesis and a method that is based on the “relaxed” hierarchical foldinghypothesis. We did not find the difference in accuracies of CCJ, HFold andIterative HFold significant on any of our data sets. However, we observedsome interesting findings by looking at the results more carefully. Table 6.6summarizes these findings.Table 6.6: Comparison of the number of times each of the proposed algo-rithms (i.e. CCJ, Iterative HFold and HFold) predicted the structure withthe minimum free energy, or the structure with highest F -measure in each ofthe four data sets used in this thesis. The table also provides the number oftimes Iterative HFold and HFold predicted structures with higher accuracythan the structure predicted by the CCJ algorithm.DK-pk16* CCJ Iter. HFold HFold# min energy 12 6 6# highest F 1 5 6# F > CCJ 0 8 10IP-pk168 CCJ Iter. HFold HFold# min energy 168 155 143# highest F 112 117 109# F > CCJ 0 7 11HK-PK* CCJ Iter. HFold HFold# min energy 83 64 59# highest F 50 47 49# F > CCJ 0 8 10HK-PK-free CCJ Iter. HFold HFold# min energy 333 294 278# highest F 199 212 214# F > CCJ 0 33 43As seen in Table 6.6, predicting the structure with the minimum freeenergy does not seem to be directly correlated with finding the structure846.9. Discussionwith the highest accuracy among all methods. More specifically in the DK-pk16* data set CCJ produces the lowest energy structure in all cases butonly in one case produces the highest accuracy. This offers some support forthe relaxed hierarchical folding hypothesis. We should note that this maybe dependent also on the energy model used.Figure 6.4 shows an example of the structure produced by the CCJ algo-rithm, compared with the structure predicted by the Iterative HFold algo-rithm. The accuracy of the CCJ structure is 64.4% while Iterative HFold’sstructure has accuracy of 88.1%. As seen in Figure 6.4, both Iterative HFoldand CCJ predicted the bands in the reference structure. However, the struc-ture predicted by CCJ missed the nested substructure and instead addedanother band.We also notice that by following the relaxed hierarchical folding hypoth-esis, hence allowing changes to the input structure, Iterative HFold moreoften finds structures with lower energy than those predicted by HFold inall of our data sets. However this does not result in significant improvementin prediction accuracy compared to HFold. Figure 6.5 shows an example ofone such case, in which finding a lower energy structure results in predictingthe reference structure.In many cases, especially for longer RNA strands in the HK-PK-free dataset, the MFE structures predicted by the CCJ algorithm are pseudoknotted.This suggests that the energy parameters encourage addition of pseudoknotsand so parameter estimation specifically for CCJ would be useful. Figure6.6 shows an example of structures predicted by CCJ vs. Iterative HFold.Addition of pseudoknots in HFold’s predictions is the least among HFold,CCJ and Iterative HFold. This is mainly due to the fact that the hotspotsinput to HFold must also be part of the structure output by HFold, reducingHFold’s flexibility in adding pseudoknots (compared with CCJ and IterativeHFold). Iterative HFold’s predictions are mostly similar to those of HFoldon the RNA sequences of the HK-PK-free data set. However, in a fewcases, especially in longer RNA strands Iterative HFold’s predictions arealso pseudoknotted.We also found that on shorter RNA sequences, Iterative HFold is notable to modify the given structure, when the input structure is overcrowdedwith base pairs. This is because on the shorter RNA sequences, IterativeHFold is unable to modify the input structure. To see why, recall that themethod within Iterative HFold that enables structure modification is method2, which first runs HFold-PKonly on sequence S and given input structureG. If HFold-PKonly results in a structure G ∪ G′ such that G′ is not theempty structure, then runs HFold with sequence S and structure G′, and856.9. Discussion91U G G C C G G C A U G G C C C C A G C C U C C U C G C U G G C G C C G G C U G G G C A A C G A U C C G A G G G G A C U G U C C C U C U C G A G A A U C G G C A A A U G G G G C C C C1 10 20 30 40 50 60 70 80 90G(a) Reference structure for RFA 00632 of the HK-PK data set.91U G G C C G G C A U G G C C C C A G C C U C C U C G C U G G C G C C G G C U G G G C A A C G A U C C G A G G G G A C U G U C C C U C U C G A G A A U C G G C A A A U G G G G C C C C1 10 20 30 40 50 60 70 80 90G(b) Structure predicted by the CCJ algorithm for RFA 00632.91U G G C C G G C A U G G C C C C A G C C U C C U C G C U G G C G C C G G C U G G G C A A C G A U C C G A G G G G A C U G U C C C U C U C G A G A A U C G G C A A A U G G G G C C C C1 10 20 30 40 50 60 70 80 90G(c) Structure predicted by Iterative HFold for RFA 00632.Figure 6.4: Example structure predicted by the CCJ algorithm vs. IterativeHFold.866.9. Discussion52U U U G U U A G U G G C G U G U C C G U C C G C A G C U G G C A A G C G A A U G U A A A G A C U G A C1 10 20 30 40 50G(a) Reference structure for EC PK4 of the HK-PK data set.52U U U G U U A G U G G C G U G U C C G U C C G C A G C U G G C A A G C G A A U G U A A A G A C U G A C1 10 20 30 40 50G(b) Hotspot for EC PK4 provided as input to both HFold and Iterative HFold.52U U U G U U A G U G G C G U G U C C G U C C G C A G C U G G C A A G C G A A U G U A A A G A C U G A C1 10 20 30 40 50G(c) Structure predicted by the HFold for EC PK4.52U U U G U U A G U G G C G U G U C C G U C C G C A G C U G G C A A G C G A A U G U A A A G A C U G A C1 10 20 30 40 50G(d) Structure predicted by Iterative HFold for EC PK4.Figure 6.5: An example where Iterative HFold finds a structure with lowerenergy than the structure predicted by HFold.876.9. Discussion32A A C A C U U G C A U G U U U U G C C G C A G C A A C U C U G1 10 20 30A(a) Reference structure for CRW 00628 of the HK-PK-free data set.32A A C A C U U G C A U G U U U U G C C G C A G C A A C U C U G1 10 20 30A(b) Structure predicted by the CCJ algorithm for CRW 00628.32A A C A C U U G C A U G U U U U G C C G C A G C A A C U C U G1 10 20 30A(c) Structure predicted by Iterative HFold for CRW 00628.Figure 6.6: Example structure predicted by the CCJ algorithm vs. IterativeHFold.886.9. Discussion15G G C U G U A A U G G U C A1 10C(a) Reference structure for SRP 00265 of the HK-PK-free data set.15G G C U G U A A U G G U C A1 10C(b) Hotspot for SRP 00265 provided as input to Iterative HFold.15G G C U G U A A U G G U C A1 10C(c) Structure predicted by Iterative HFold for SRP 00265.Figure 6.7: An example of overcrowded input structure provided to IterativeHFold.896.9. Discussionstores the result. Otherwise, simply stores G as the result. When a shortRNA sequence is overcrowded with base pairs, HFold-PKonly is unable toadd pseudoknotted base pairs. Therefore, the result of method 2 is the sameas the input structure. Figure 6.7 shows one such example.6.9.4 Energy modelIn this thesis we use the HotKnots V2.0 DP09 [11] energy parameters in ourimplementation of Iterative HFold. To investigate the degree to which theenergy model may be causing mis-predictions by HotKnots V2.0 or IterativeHFold, we considered the degree to which the maximum accuracy structuresproduced by these methods, i.e., the structures with the highest F-measure,are better than the minimum free energy structures. Table 6.7 presentsthe difference in bootstrap 95% percentile confidence intervals of average F-measure. For example, the last two rows of the table shows that knowing thereference structure, if we choose the maximum accuracy structure amongthe 20 output structures predicted by HotKnots for each RNA sequence,the bootstrap 95% percentile confidence intervals of average F-measure ofHotKnots will increase to (84.50%, 91.48%) for pseudoknotted structures ofthe HK-PK data set (vs. (73.6%, 83.35%) when choosing the lowest energystructure) and (88.32%, 91.08%) for pseudoknot-free structures of the HK-PK-free data set (vs. (76.74%, 81.95%) when choosing the lowest energystructure).Similarly, if we compare the maximum accuracy structure output by It-erative HFold with the minimum free energy structure, whether given Hot-Knots hotspots or the first 50 suboptimal structures to Iterative HFold as in-put, the bootstrap 95% percentile confidence intervals of average F-measurealso show improvement - see Table 6.7. The difference in improvements issignificant in all but one case, namely Iterative HFold on hotspots structuresas input for pseudoknotted structures of the HK-PK data set. We concludethat improvements on the energy parameter values for pseudoknotted struc-tures may further improve accuracy of both HotKnots and Iterative HFold.906.9.DiscussionTable 6.7: Comparison of bootstrap 95% percentile confidence interval of average F-measure between the minimumenergy structures and the maximum accuracy structures of HK-PK and HK-PK-free data sets.Method Data set Min energy Max accuracy Permutation testIter. HFold - hotspots HK-PK (72.83, 83.37) (78.56, 87.05) Not significantIter. HFold - hotspots HK-PK-free (74.93, 80.26) (87.70, 90.57) SignificantIter. HFold - 50 suboptimals HK-PK (67.70, 79.57) (80.41, 88.14) SignificantIter. HFold - 50 suboptimals HK-PK-free (76.27, 81.46) (90.05, 93.00) SignificantHotKnots HK-PK (73.60, 83.35) (84.50, 91.48) SignificantHotKnots HK-PK-free (76.74, 81.95) (88.32, 91.08) Significant916.10. Summary6.10 SummaryIn this chapter we presented a thorough comparison of our proposed algo-rithms, Iterative HFold and CCJ, with the best existing methods for predic-tion of RNA secondary structure including pseudoknots, namely HotKnotsV2.0, IPknot and HFold. Both Iterative HFold and CCJ are significantlymore accurate than IPknot while matching the accuracy of HotKnots on theHK-PK data set (respectively the HK-PK* data set when comparing withCCJ). Iterative HFold and CCJ are superior to both IPknot and HotKnotson the IP-pk168 data set. Moreover both Iterative HFold and IPknot useless memory and run much faster than HotKnots on long sequences.Iterative HFold also has a lower rate of accuracy deterioration than HFoldwith loss of information about the true pseudoknot-free structure, so it ismore robust than HFold. This is particularly helpful when the given in-put structure may be unreliable and/or limited information about the truepseudoknot-free structure is available.We compared three different ways to generate pseudoknot-free inputstructures for input to Iterative HFold, namely the MFE structures and thefirst 50 suboptimal structures produced by SimFold, and HotKnots hotspots.On the HK-PK and HK-PK-free data sets, accuracy of Iterative HFold is notsignificantly different on each of the first 50 suboptimal structures producedby SimFold, and HotKnots hotspots, while it is significantly better on thosethan on the MFE structures.We used CCJ, HFold and Iterative HFold to compare performance of apure MFE-based algorithm with that of a method that strictly adheres withthe hierarchical folding hypothesis and a method that is based on the relaxedhierarchical folding hypothesis. We found that predicting the MFE structuredoes not necessarily translate to predicting the highest accuracy structure.On long RNA sequences, CCJ tends to add more pseudoknotted base pairs.On short RNA sequences, when the given input structure is overcrowded bybase pairs, Iterative HFold is unable to modify the input structure to predictthe reference structure. Since HFold is unable to add both pseudoknottedand pseudoknot-free base pairs to a given input structure, in many cases itdoes not predict the reference structure. While the differences in accuraciesof HFold, Iterative HFold and CCJ on all our data sets are not significant,based on the overall “shape” of the predicted structures, we believe IterativeHFold is the winner.Comparing accuracy of the minimum free energy structures with themaximum accuracy structures in this work, we found that, on average, theminimum free energy structure has significantly poorer F-measure than the926.10. Summarymaximum accuracy structure. This suggests that an improved energy modelfor pseudoknotted structure prediction may improve accuracy of predictionalgorithms for pseudoknotted structures.93Chapter 7Sparsification of the CCJalgorithm for space reductionIn this chapter, we describe a method called sparsification, and show howsparsification can be useful in reducing the space complexity of the CCJalgorithm.As mentioned in Section 6.7, the CCJ algorithm is memory-intensiveand this limits its practicality. Thus, our goal is to reduce CCJ’s memoryrequirements without further restricting the class of structures handled bythe algorithm. We describe a new way of using candidate lists to reducethe space complexity of the CCJ algorithm to O(n2Z) instead of O(n4),and provide the recurrences for sparse CCJ algorithm. We note that pre-viously published cases using the sparsification technique either focused onpseudoknot-free methods (due to complexity of pseudoknotted methods), orused pseudoknotted methods with simplified energy models (base pair max-imization only), and they mainly aimed to reduce running time complexityof the methods. To the best of our knowledge this is the first time that thesparsification technique is applied to a complex pseudoknotted method witha realistic energy model and parameters, when focusing on reducing spacecomplexity.We start this chapter by recalling recurrences for the MFE pseudoknot-free structure prediction for an RNA sequence S. In Section 7.1.1 we describethe concept of sparsification, and how it can improve the time complexityof MFE structure prediction for single stranded RNA. In Section 7.2 weexplain changes to CCJ recurrences needed for successful sparsification. Wepresent sparsified recurrences of the CCJ algorithm in Section 7.3, and inSection 7.4 we discuss space complexity of the sparse version of the CCJalgorithm. Finally in Section 7.5 we provide a summary of this chapter.947.1. Background on sparsification7.1 Background on sparsificationTo introduce sparsification, we first consider pseudoknot-free MFE secondarystructure prediction. We first recall the MFE recurrences for RNA pseudoknot-free structure prediction. As explained in Section 4.2, we define the W andV recurrences as follows. Let Wi,j be the energy of the MFE pseudoknotfree secondary structure for the subsequence sisi+1 . . . sj . If i ≥ j, Wi,j = 0,since the subsequence is empty. Otherwise, it must either be that i.j is abase pair in the MFE structure for si . . . sj , or that the MFE structure canbe decomposed into two independent subparts. These two cases correspondto the two rows of the following recurrence for Wi,j .Wi,j = min{Vi,j ,mini≤r<jWi,r +W(r+1),jwhere Vi,j is the free energy of the MFE structure for si . . . sj that containsi.j. If i ≥ j, Vi,j is set to be ∞. Otherwise, i.j closes a hairpin, an internalloop, or a multiloop in the MFE structure for si . . . sj .Focusing on the second case of the W recurrence we can see that the timecomplexity bottleneck in Wi,j is due to considering O(n) possible points rin the computation of Wi,r + W(r+1),j . The sparsification idea originatedfrom the work of Wexler et al. [99] to reduce the time complexity of MFEcalculation for single RNA molecules. Their idea was to not store all thevalues in the structure matrix W = {Wi,j}, but instead just keep some valuescalled candidates. Wexler et al. showed that sparsification can reduce theaverage time complexity from Θ(n3) to O(n2ψ(n)), where ψ(n) is maximumsize of the candidate list.Backofen et al. [12] modified the sparsification idea of Wexler et al. toachieve a better space complexity. Their modification was based on theobservation that some of the stored values are not necessary throughout thecomplete run of the algorithm. Backofen et al. further reduced the timeand space complexity of the MFE algorithm for prediction of pseudoknot-free structures. Later, the sparsification idea was used to reduce the timeand space complexity of an algorithm for predicting the joint structure ofinteracting RNAs [81], and to reduce the time and space complexity ofsome pseudoknot prediction algorithms. We next describe the RNA MFEsparsification method, following the definitions of Backofen et al. [12].957.1. Background on sparsification7.1.1 Sparsification of the MFE pseudoknot-free algorithmThe main idea for the sparsification technique presented in the work ofBackofen et al. [12] and Wexler et al. [99] is to rewrite the MFE predic-tion recurrences as co-terminus and partitionable. We provide necessarydefinitions based on the work of Backofen et al. [12].A secondary structure Ri,j admits a split point q, i < q ≤ j, if forno base pair k.l in Ri,j , i ≤ k < l ≤ j, we have k < q ≤ l. In otherwords, a secondary structure R admits a spilt point q if q partitions R intotwo secondary structures R1 and R2 that correspond to the weakly closedregions [i, q − 1] and [q, j] respectively. Let Qi,j be the set of all possiblesplit points with respect to Ri,j , i.e., Qi,j = {q|i < q ≤ j}. A secondarystructure R is called partitionable with respect to Ri,j if it admits at leastone split point in the set Qi,j ; otherwise it is called co-terminus with respectto Ri,j . Note that an empty structure, i.e., when i = j, is co-terminus bydefinition, as Qi,j = ∅.Let’s rewrite the MFE prediction recurrences as follows.Li,j = min{ Lci,jLpi,jwhereLci,j = f(i, j)Lpi,j = minq∈Qi,j{Li,q−1 + Lq,j}and Lpi,j = −∞ if i = j or i > j or Qi,j = ∅.Here Li,j is the previously mentioned Wi,j recurrence, if Lci,j is Vi,j andQi,j is {i+1, ..., j}. We refer to Li,j as a solution to the MFE folding problemfor Si,j , and to Lci,j and Lpi,j as co-terminus and partitionable solutions forSi,j .The main property, which yields to the time and space complexity im-provements, is the triangle inequality property stated below.Triangle inequality: A recurrence relation sustains the triangle in-equality property if for every Ri,j and for every split point q ∈ Qi,j , Li,j ≤Li,q−1 + Lq,j .It is clear that Wi,j recurrence obeys this inequality.A secondary structure R for sequence S is an optimal secondary structureif the energy associated with R is equal to W1,n, where n is the length of S.We note that there might be more than one optimal secondary structure fora given sequence. Similarly a secondary structure over region [i, j], Ri,j , isoptimal if energy of the structure is equal to Wi,j .967.1. Background on sparsificationWe define a region [i, j] as optimally co-terminus (OCT) if every optimalsecondary structure Ri,j is co-terminus. A split point q ∈ Qi,j is called anoptimal split point of Ri,j if Lpi,j = Li,q−1 + Lq,j .Wexler et al. observed that it is sufficient to examine only a subset ofthe split points to compute Lpi,j . Specifically, they observed that for everyregion [i, j], there is an optimal split point q ∈ Qi,j such that region [q, j] isan OCT.Let’s define Qocti,j = {q ∈ Qi,j |[q, j] is an OCT}. Then the split pointsconsidered by Lp(i, j) can be restricted as follows:Lpi,j = minq∈Qocti,j{Li,q−1 + Lq,j}.Now let’s define the sparsity parameter, Z(S), for a region [1, n] corre-sponding to the string S, as the number of subregions in S1,n that are OCT.We use Z in place of Z(S) when the context is clear.Since every region of size 1 is by definition OCT, Z ≥ n. In a regularMFE calculation all the values in the upper triangle of an n× n matrix arecalculated and kept in memory. Hence, the regular MFE algorithm requiresΘ(n2) space. As mentioned at the start of this section, the bottleneck in cal-culating Wi,j comes from the second row of the recurrence, that correspondsto calculation of Lpi,j . With no sparsification, Θ(n) values for possible splitpoints are checked in the calculation of Lpi,j , thus, the total time complexityis Θ(n3). In the sparse case, instead of checking Θ(n) possible split pointswe only check the O(|Qoct|) split points, and the total time complexity isreduced to O(nZ), as follows:n−1∑i=1n∑j=i+1|Qocti,j | ≤n−1∑i=1n∑j=i+1|Qoct1,j | ≤n−1∑i=1Z < nZ.For the MFE pseudoknot-free secondary structure prediction problem,Wexler et al. [99], estimated the expected value of Z and found it to bevery small (< 6 for their test set of 50000 mRNA sequences from the NCBIdatabases with an average length of 1992 nucleotides), significantly smallerthan O(n2). Their estimation was based on a probabilistic model of polymerfolding and measured by simulation.Backofen et al. [12] presented an algorithm for the MFE pseudoknot-free structure prediction problem, that using the sparsification technique,reduced the space complexity to O(Z). Their algorithm is based on theobservation that some of the values stored by the algorithm of Wexler et al.[99] are not needed throughout the complete run of the algorithm. In other977.1. Background on sparsificationwords, they found that for calculation of Li,j , one only needs to examineLa,b for strict subregion [a, b] of [i, j], such that either a = i, a = i + 1, orregion [a, b] is an OCT.7.1.2 Sparsification of pseudoknotted algorithmsMo¨hl et al. [67] modified the idea of an OCT list to work with gappedregions to sparsify a pseudoknotted algorithm (the base pair maximizationversion of the Rivas and Eddy algorithm [77]). We next provide necessarydefinitions for sparsification of pseudoknotted algorithms following the def-initions provided by Mo¨hl et al. [67]. A fragment over a sequence S isdefined as a subregion of [1, n]. We note that a fragment may correspond toan ungapped region as well as a gapped region.A split of a fragment F , into two fragments F1 and F2, is defined asa tuple (F1, F2) such that F = F1 ∪ F2 and F1 ∩ F2 = ∅. A recurrencecase (i.e., a line of a possibly multi-line recurrence) describes how to decom-pose calculation of an energy value on a given fragment into calculations onsmaller fragments, effectively splitting the given fragment or reducing thesize of the fragment. The type of a recurrence case is a short motif that de-scribes roughly how the given fragment is split. (The type can equivalentlybe associated with the fragment split.) The type of a recurrence case can beobtained as follows: in the arc diagram, label each subregion of a fragmentFi with number i, label a gap with “G”, and concatenate the characters inorder from left to right.Figure 7.1 shows examples of recurrence types in the CCJ algorithm. Inthis figure, we haveP(i, l) = mini<j<d<k<lPK(i, j − 1, d+ 1, k − 1) + PK(j, d, k, l)that corresponds to a 1212 recurrence type, and we have the first two casesof the PK recurrence, that span over the gapped region [i, j] ∪ [k, l],mini≤d<jPK(i, d, k, l) + WP(d+ 1, j)andmink<d≤lPK(i, j, d, l) + WP(k, d− 1)correspond to a 12G1 recurrence type and a 1G21 recurrence type respec-tively.Fix a recurrence that can associate an energy with a given fragment Fand fix a case of type T of this recurrence. F is optimally decomposable by987.1. Background on sparsificationl1 2 1 2i j d k(a) Recurrence type 1212 of P(i, l) corresponding tomini<j<d<k<l PK(i, j − 1, d+ 1, k − 1) + PK(j, d, k, l).j21 1Gk li d(b) Recurrence type 12G1 of P(i, l) corresponding tomini≤d<j PK(i, d, k, l) + WP(d+ 1, j).l1G 21i j k d(c) Recurrence type 1G21 of P(i, l) corresponding tomink<d≤l PK(i, j, d, l) + WP(k, d− 1).Figure 7.1: Examples of recurrence types in the CCJ algorithm.997.1. Background on sparsificationa split of type T (T-OD) if a case of type T can split F into (F1, F2) andthe energy associated with F (by the fixed recurrence) is equal to that ofF1 + F2.A fragment F is optimally decomposable with respect to a set of split typesT (T -OD) iff F is T-OD for some T ∈ T .For example a fragment [i, j] is 12-OD with respect to the recurrence W -12 iff W (i, l) = mini<d≤lW (i, d− 1) +W (d, l). Similarly a fragment [i, j] ∪[k, l] is 12G1-OD with respect to the recurrence PK-12G1 iff PK(i, j, k, l) =mini≤d<j PK(i, d, k, l) + WP(d+ 1, j).Mo¨hl et al. [67] showed that in each recursion case, certain optimallydecomposable fragments do not have to be considered for computing an op-timal solution, because each decomposition using these fragments can bereplaced by a decomposition using a smaller fragment. For example, inrecurrence case 12G2 of PfromL(i, j, k, l), which splits the fragment [i, j] ∪[k, l] into F1 = [i, d − 1] and F2 = [d, j] ∪ [k, l], if F2 is 12G2-OD withrespect to the recurrence case 12G2 of PfromL(d, j, k, l), then ∃e > d :PfromL(d, j, k, l) = WP(d, e − 1) + PfromL(e, j, k, l). Therefore, we canreplace every split of PfromL(i, j, k, l) into WP(i, d− 1) + PfromL(d, j, k, l)by WP(i, d − 1) + WP(d, e − 1) + PfromL(e, j, k, l) which is as good asWP(i, d− 1) + PfromL(d, j, k, l). We note that this argument is split typespecific and cannot be used for all split types.Considering this observation, Mo¨hl et al. defined sets of recurrence typesin a specific algorithm such that a recurrence case is optimally decomposableif it is optimally decomposable with respect to at least one of the types withinthe set. We later provide such sets of recurrence type for the CCJ algorithmin Section 7.2. For example, we have the following set of recurrence typescorresponding to PfromL case 12G2:T CCJfromL12G2 = {12G2, 12G1}.To see how this set of recurrence types is used, consider recurrence case12G2 of PfromL(i, j, k, l), which splits the fragment [i, j] ∪ [k, l] into F1 =[i, d− 1] and F2 = [d, j] ∪ [k, l]. Suppose that the second fragment, F2, canbe optimally decomposed by a split of a type in TCCJfromL12G2 , say for example,12G2, such that PfromL(d, j, k, l) = WP(d, e−1) + PfromL(e, j, k, l) then wecould have split the fragment [i, j]∪ [k, l] into F ′1 = [i, e−1] and F ′2 = [e, j]∪[k, l], and still have a solution that is as good as that obtained by the (F1, F2)split. Continuing this idea recursively, we can reformulate recurrences in away that the second fragments are not optimally decomposable with respectto the sets of recurrence types within the algorithm.1007.2. Changes to the CCJ recurrencesSuch non-optimally decomposable fragments correspond to entries ofcandidate lists. Note that separate candidate lists are to be maintainedfor each sparsified recursion case. A fragment is added to a candidate listfor recursion case T iff it is not optimally decomposable in the sets of splittypes within the algorithm. When pushing a candidate into its correspond-ing candidate list, Mo¨hl et al. used notationLT (<fixed boundaries>, T )(<remaining boundaries>,w),such that the first parentheses hold the tuple of fixed boundaries togetherwith the case type, T , and the second parentheses hold the tuple of theremaining boundaries and the corresponding energy value, w. Fixed bound-aries correspond to the boundaries that are the same in F and F2. Forexample, in recurrence case 12G2 of PfromL(i, j, k, l) = mini<d≤j WP(i, d−1) + PfromL(d, j, k, l), when we find a fragment [d, j] ∪ [k, l] that is not12G2-OD, we push it into the candidate list as:LfromL12G2(j, k, l, 12G2), (d, PfromL(d, j, k, l)).Here boundaries of fragment F are i, j, k, l and boundaries of F2 are d, j, k, l,therefore fixed boundaries are j, k, l and the remaining boundary going tothe second tuple is d. Note that PfromL(d, j, k, l) holds the energy value inthe PfromL energy matrix.Similar to the sparsification of the pseudoknot-free MFE algorithm, thealgorithm recurrences for pseudoknotted structures then can be re-written toonly consider the candidates when partitioning the fragments. For example,the sparse version of case 12G2 of PfromL(i, j, k, l) will be as follows:PfromL(i, j, k, l) = min(d,w)∈LfromL12G2 (j,k,l,12G2)WP(i, d− 1) + w.7.2 Changes to the CCJ recurrencesWe describe a new way of using candidate lists to reduce the space com-plexity of the CCJ algorithm to O(n2Z) instead of O(n4), and provide therecurrences for sparse CCJ algorithm. We modify the CCJ recurrences to fa-cilitate sparsification while the class of structures handled is the same. Sinceour focus here is to reduce CCJ’s space complexity, we only use the sparsifi-cation technique on the 4-dimensional matrices. We provide the modified re-currences in this section and introduce the split types for every partitionablerecurrence case in the CCJ algorithm (see Chapter 5). Then in Section 7.3,we provide the sparsification of CCJ with the proof of correctness.1017.2. Changes to the CCJ recurrencesThe modified recurrences are of the form PX,mloop, and PX,mloopab, whereX ∈ {L,R,M,O} and a, b ∈ {0, 1}. All other cases are essentially the sameas those provided in Chapter 5, except for the following notational additionsin some cases:1. here case types are highlighted when a recurrence case is partitionable;2. each case has a superscript corresponding to what will be kept inmemory for the corresponding energy matrix.In the following recurrences, when a recurrence has a ∗ as superscript, itmeans its value will not be stored in memory and instead will be calculatedas needed; when a recurrence has C as superscript, it means we will only keepits candidates and not store the rest; when a recurrence has i as superscript,it means we will only keep a copy of its ith slice (i.e., a 3D matrix of j, k, l).Similarly i..i+K means we will store slices i to i+K (i.e., K× 3D matrices,where K is a constant). Here MLS is Maximum Loop Size for an internalloop (i.e., 30 in this work).1027.2.ChangestotheCCJrecurrencesW (i, l) = minW (i, l − 1)(12′)mini<d≤lW (i, d− 1) +W (d, l)(12)V (i, l)P(i, l) + PsV (i, l) = minVhairpin(i, l)Vstacked(i, l)(1′21′)Viloop(i, l)(1′′21′′)Vmloop(i+ 1, l − 1) + b+ β2(l, i) + a}P(i, l) = mini<j<d<k<l (PK(i, j − 1, d+ 1, k − 1) + PK(j, d, k, l)) (1212)PKC,i(i, j, k, l) = minmini≤d<j PK(i, d, k, l) + WP(d+ 1, j)(12G1)mink<d≤l PK(i, j, d, l) + WP(k, d− 1)(1G21)PL(i, j, k, l) + γ2(j, i) + PbPR(i, j, k, l) + γ2(l, k) + PbPM (i, j, k, l) + γ2(j, k) + PbPO(i, j, k, l) + γ2(l, i) + PbPC,i..i+MLSL (i, j, k, l) = minPL,iloop(i, j, k, l)PL,mloop(i, j, k, l) + b′PfromL(i+ 1, j − 1, k, l) + γ2(j, i)(1′21′G2)1037.2.ChangestotheCCJrecurrencesP iR(i, j, k, l) = minPR,iloop(i, j, k, l)PR,mloop(i, j, k, l) + b′PfromR(i, j, k + 1, l − 1) + γ2(l, k)(1G2′12′)P iM (i, j, k, l) = minPM,iloop(i, j, k, l)PM,mloop(i, j, k, l) + b′PfromM (i, j − 1, k + 1, l) + γ2(j, k)(12′G2′1)γ2(i, l), if i = j and k = lPC,i..i+MLSO (i, j, k, l) = minPO,iloop(i, j, k, l)PO,mloop(i, j, k, l) + b′PfromO(i+ 1, j, k, l − 1) + γ2(l, i)(1′2G21′)PC,i..i+1fromL (i, j, k, l) = minmini<d≤jWP(i, d− 1) + PfromL(d, j, k, l)(12G2)mini≤d<j PfromL(i, d, k, l) + WP(d+ 1, j)(12G1)PR(i, j, k, l) + γ2(l, k) + PbPM (i, j, k, l) + γ2(j, k) + PbPO(i, j, k, l) + γ2(l, i) + PbP ifromR(i, j, k, l) = minmink<d≤l PfromR(i, j, d, l) + WP(k, d− 1)(1G21)mink≤d<l PfromR(i, j, k, d) + WP(d+ 1, l)(1G12)PM (i, j, k, l) + γ2(j, k) + PbPO(i, j, k, l) + γ2(l, i) + Pb1047.2.ChangestotheCCJrecurrencesP ifromM (i, j, k, l) = minmini≤d<j PfromM (i, d, k, l) + WP(d+ 1, j)(12G1)mink<d≤l PfromM (i, j, d, l) + WP(k, d− 1)(1G21)PL(i, j, k, l) + γ2(j, i) + PbPR(i, j, k, l) + γ2(l, k) + PbPC,i..i+1fromO (i, j, k, l) = minmini<d≤jWP(i, d− 1) + PfromO(d, j, k, l)(12G2)mink≤d<l PfromO(i, j, k, d) + WP(d+ 1, l)(1G12)PL(i, j, k, l) + γ2(j, i) + PbPR(i, j, k, l) + γ2(l, k) + PbV ∗hairpin(i, l) = eH(i, l)V ∗stacked(i, l) = eS(i, l)V ∗iloop(i, l) = mini<d<e<l( eint(i, d, e, l) + V (d, e))(121)Vmloop(i, l) = minVmloop(i, l − 1) + c× (1)(12′)Vmloop(i+ 1, l) + c× (1)(1′2)P(i, l) + Psm + bmini<d≤lWM(i, d− 1) +WM′(d, l)(12)1057.2.ChangestotheCCJrecurrencesWM∗(i, l) = min{WM′(i, l), c× (l − i+ 1)}WM′(i, l) = minWM′(i, l − 1) + c× (1)(12′)mini<d≤l WM(i, d− 1) + WM′(d, l)(12)V (i, l) + β2(i, l) + bP(i, l) + Psm + bWB∗(i, l) = min{WB′(i, l), c′ × (l − i+ 1)}WB′(i, l) = minWB′(i, l − 1) + c′ × (1)(12′)mini<d≤l WB(i, d− 1) + WB′(d, l)(12)V (i, l) + βp2(i, l) + b′ + PpsP(i, l) + Psm + b′ + PpsWP ∗(i, l) = min{WP′(i, l), Pup × (l − i+ 1)}WP′(i, l) = minWP′(i, l − 1) + Pup × (1)(12′)mini<d≤l WP(i, d− 1) + WP′(d, l)(12)V (i, l) + γ2(i, l) + PpsP(i, l) + Psp + Pps1067.2.ChangestotheCCJrecurrencesP ∗L,iloop(i, j, k, l) = min{PL(i+ 1, j − 1, k, l) + estP (i, j)(1′21′G2)mini<d<d′<j(eintP (i, d, d′, j) + PL(d, d′, k, l))(1′′21′′G2)P ∗L,mloop(i, j, k, l) = min{PL,mloop10(i+ 1, j − 1, k, l) + a′ + βp2(j, i)(1′21′G2)PL,mloop01(i+ 1, j − 1, k, l) + a′ + βp2(j, i)(1′21′G2)P i,i+1L,mloop10(i, j, k, l) = min{mini<d≤jWB′(i, d− 1) + PL,mloop00(d, j, k, l)(12G2))mini<d<j PL,mloop10(i, d, k, l) +WB(d+ 1, j)(12G1)P i,i+1L,mloop01(i, j, k, l) = mini≤d<j (PL,mloop00(i, d, k, l) +WB′(d+ 1, j)(12G1))PC,iL,mloop00(i, j, k, l) = minmini<d≤jWB(i, d− 1) + PL,mloop00(d, j, k, l)(12G2))mini≤d<j PL,mloop00(i, d, k, l) +WB(d+ 1, j)(12G1)PL(i, j, k, l) + βp2(j, i)P ∗R,iloop(i, j, k, l) = min{PR(i, j, k + 1, l − 1) + estP (k, l)(1G2′12′)mink<d<d′<l(eintP (k, d, d′, l) + PR(i, j, d, d′))(1G2′′12′′)1077.2.ChangestotheCCJrecurrencesP ∗R,mloop(i, j, k, l) = min{PR,mloop10(i, j, k + 1, l − 1) + a′ + βp2(l, k)(1G2′12′)PR,mloop01(i, j, k + 1, l − 1) + a′ + βp2(l, k)(1G2′12′)P iR,mloop10(i, j, k, l) = min{PR,mloop10(i, j, k + 1, l) + c′(1G2′1)mink<d≤lWB′(k, d− 1) + PR,mloop00(i, j, d, l)(1G21))P iR,mloop01(i, j, k, l) = min{PR,mloop01(i, j, k, l − 1) + c′(1G12′)mink≤d<l PR,mloop00(i, j, k, d) +WB′(d+ 1, l)(1G12)P iR,mloop00(i, j, k, l) = minmink<d≤lWB(k, d− 1) + PR,mloop00(i, j, d, l)(1G21)mink≤d<l PR,mloop00(i, j, k, d) +WB(d+ 1, l)(1G12)PR(i, j, k, l) + βp2(l, k)P ∗M,iloop(i, j, k, l) = min{PM (i, j − 1, k + 1, l) + estP (j − 1, k + 1)(12′G2′1)mini<d<j<k<d′<l(eintP (d, j, k, d′) + PM (i, d, d′, l))(12′′G2′′1)P ∗M,mloop(i, j, k, l) = min{PM,mloop10(i, j − 1, k + 1, l) + a′ + βp2(j, k)(12′G2′1)PM,mloop01(i, j − 1, k + 1, l) + a′ + βp2(j, k)(12′G2′1)P iM,mloop10(i, j, k, l) = min{PM,mloop10(i, j − 1, k, l) + c′(12′G1)mini<d<jWB′(d, j) + PM,mloop00(i, d− 1, k, l)(12G1)1087.2.ChangestotheCCJrecurrencesP iM,mloop01(i, j, k, l) = min{PM,mloop01(i, j, k + 1, l) + c′(1G2′1)mink<d≤l PM,mloop00(i, j, d, l) +WB′(k, d− 1)(1G21)P iM,mloop00(i, j, k, l) = minmini≤d<jWB(d+ 1, j) + PM,mloop00(i, d, k, l)(12G1)mink<d≤l PM,mloop00(i, j, d, l) +WB(k, d− 1)(1G21)PM (i, j, k, l) + βp2(j, k)P ∗O,iloop(i, j, k, l) = min{PO(i+ 1, j, k, l − 1) + estP (i, l)(1′2G21′)mini<d<j<k<d′<l(eintP (i, d, d′, l) + PO(d, j, d′, k))(1′′2G21′′)P ∗O,mloop(i, j, k, l) = min{PO,mloop10(i+ 1, j, k, l − 1) + a′ + βp2(l, i)(1′2G21′)PO,mloop01(i+ 1, j, k, l − 1) + a′ + βp2(l, i)(1′2G21′)P i,i+1O,mloop10(i, j, k, l) = min{mini<d≤jWB′(i, d− 1) + PO,mloop00(d, j, k, l)(12G2)mink<d<l PO,mloop10(i, j, k, d) +WB(d+ 1, l)(1G12)P i,i+1O,mloop01(i, j, k, l) = mink≤d<l (PO,mloop00(i, j, k, d) +WB′(d+ 1, l)) (1G12)PC,iO,mloop00(i, j, k, l) = minmini<d≤jWB(i, d− 1) + PO,mloop00(d, j, k, l)(12G2)mink≤d<l PO,mloop00(i, j, k, d) +WB(d+ 1, l)(1G12)PO(i, j, k, l) + βp2(l, i)1097.3. Sparsification of the CCJ algorithmA complete list of fragment types in the CCJ algorithm is as follows:T CCJ12′ , T CCJ12 , T CCJ1′21′ , T CCJ1′′21′′ , T CCJ1212 , T CCJ1′21′G2, T CCJ1G2′12′ , T CCJ12′G2′1, T CCJ1′2G21′ , T CCJ12G2 ,T CCJ12G1 , T CCJ1G21 , T CCJ1G12 , T CCJ121 , T CCJ1′21′G2, T CCJ1′′21′′G2, T CCJ1′2G2, T CCJ12′G1, T CCJ1G2′′12′′ , T CCJ1G2′1,T CCJ1G12′ , T CCJ1G21 , T CCJ12′G2′1, T CCJ12′′G2′′1, T CCJ1′2G21′ , T CCJ1′′2G21′′ , T CCJ1G12′In these fragment types, 1′ shows a single base (similarly 2′) and 1′′ showsan empty region belonging to an internal loop (similarly 2′′).Since our main goal here is to use sparsification for space optimization,we will only consider cases in which the second fragment spans the gappedregion (i.e., would need a 4D matrix). We defer the sparsification for timereduction as a future work. We will need the following sets of recurrencetypes:T CCJ1212 = {12G1, 1G21}T CCJfromL12G2 = {12G2, 12G1}T CCJfromO12G2 = {12G2, 1G12}T CCJLmloop1012G2 = {12G2, 12G1}T CCJLmloop0012G2 = {12G2, 12G1}T CCJOmloop1012G2 = {12G2, 1G12}T CCJOmloop0012G2 = {12G2, 1G12}To obtain the sparsified recursion equations from the original recur-rences, we need to modify each non-constant recurrence case that maximizesover split fragment (F1, F2) for respective splits of fragment F , to minimizeonly over fragments F2 that are not T CCJT − OD. Such non-T CCJT − ODfragments correspond to the entries of candidate lists. Separate candidatelists are maintained for each sparsified recursion case. A fragment is addedto a candidate list for recursion case T iff it is not T CCJT −OD. Pseudocodeof the sparsified CCJ algorithm is shown in Appendix D. Note that some ofour CCJ recurrences can use the same candidate list, as their F2 fragmentsare similar.7.3 Sparsification of the CCJ algorithmFor each recurrence X that we plan to sparsify, we introduce its correspond-ing sparsified version Xs.From the definition of each recurrence X of CCJ, we derive the defini-tion of the corresponding sparsified recurrence Xs by changing all occurring1107.3. Sparsification of the CCJ algorithmrecurrence Y in X to the corresponding sparsified recurrence Y s. More-over, we introduce the following changes to the sparsified recurrences byrestricting certain minimizations to run only over “candidates”.1. In case 12G2 of PfromL:mini<d≤jWP(i, d− 1) + PfromL(d, j, k, l)add the condition d ∈ cands(PfromL, 12G2, j, k, l) that holds if6 ∃e > d : P sfromL(d, j, k, l) = WP(d, e− 1) + P sfromL(e, j, k, l).2. In case 12G2 of PfromO:mini<d≤jWP(i, d− 1) + PfromO(d, j, k, l)add the condition d ∈ cands(PfromO, 12G2, j, k, l) that holds if6 ∃e > d : P sfromO(d, j, k, l) = WP(d, e− 1) + P sfromO(e, j, k, l).3. In case 1212 of P :mini<j<d<k<lPK(i, j − 1, d+ 1, k − 1) + PK(j, d, k, l)add the condition d ∈ cands(P, 1212, j, k, l) that holds if6 ∃e(j ≤ e < d) : PKs(j, d, k, l) = PKs(j, e, k, l) + WP(e+ 1, d).4. In case 12G2 of PL,mloop10:mini<d≤jWB′(i, d− 1) + PL,mloop00(d, j, k, l)add the condition d ∈ cands(PL,mloop10, 12G2, j, k, l) that holds if6 ∃e : P sL,mloop00(d, j, k, l) = WB(d, e− 1) + P sL,mloop00(e, j, k, l)and6 ∃e : P sL,mloop00(d, j, k, l) = P sL,mloop00(d, e, k, l) + WB(e+ 1, j).1117.3. Sparsification of the CCJ algorithm5. In case 12G2 of PL,mloop00:mini<d≤jWB(i, d− 1) + PL,mloop00(d, j, k, l)add the condition d ∈ cands(PL,mloop00, 12G2, j, k, l) that holds iffd ∈ cands(PL,mloop10, 12G2, j, k, l).6. In case 12G2 of PO,mloop10:mini<d≤jWB′(i, d− 1) + PO,mloop00(d, j, k, l)add the condition d ∈ cands(PO,mloop10, 12G2, j, k, l) that holds if6 ∃e : P sO,mloop00(d, j, k, l) = WB(d, e− 1) + P sO,mloop00(e, j, k, l)and6 ∃e : P sO,mloop00(d, j, k, l) = P sO,mloop00(d, j, k, e) + WB(e+ 1, l).7. In case 12G2 of PO,mloop00:mini<d≤jWB(i, d− 1) + PO,mloop00(d, j, k, l)add the condition d ∈ cands(PO,mloop00, 12G2, j, k, l) that holds iff d ∈cands(PO,mloop10, 12G2, j, k, l).We note that the candidate condition6 ∃e > d : P sfromL(d, j, k, l) = WP(d, e− 1) + P sfromL(e, j, k, l).is equivalent toP sfromL(d, j, k, l) < mine>dWP(d, e− 1) + P sfromL(e, j, k, l),where the minimum can be computed by running only over candidates (as wewill show later). In this form, the check does not cause sparsification over-head since both sides have to be calculated by the algorithm to evaluate therecurrences. The other candidate checks are performed analogously. More-over, since the candidate conditions d ∈ cands(PL,mloop10, 12G2, j, k, l) andd ∈ cands(PL,mloop00, 12G2, j, k, l) are identical, these cases can share thesame candidate list. Similarly we use one candidate list for both PO,mloop10and PO,mloop00.1127.3. Sparsification of the CCJ algorithmTheorem 7.3.1 The sparsified version of each CCJ recurrence is equivalentto the non-sparsified recurrence.Note that for the correctness proof, we rely on certain (inverse) triangleinequalities; for example in case of W we have ∀x < y ≤ z : W (x, z) ≤W (x, y − 1) + W (y, z), which follows from the definition of W . Analogousinequalities hold for WP, WB, and WB.Proof We show for fragments i, l and i, j, k, l by simultaneous induction onthe fragment size (respectively, l − i and j − i + l − k) that all changes inthe definition of sparse CCJ (from the original to the sparsified recurrencesof CCJ) are equivalent, in particular the values of the CCJ-recursions andtheir corresponding sparsified versions are identical for each fragment.It suffices to show the equivalence of the changes of the minimizationcases explicitly. Moreover in each case, it suffices to show that there existsa minimum that is a candidate. By the induction hypothesis, the sparsifiedrecursion for all smaller fragments do not have to be distinguished from thenon-sparsified ones.1. Choose the largest d, i < d ≤ j, s.t.WP(i, d− 1) + PfromL(d, j, k, l)is minimal. We show that d ∈ cands(PfromL, 12G2, j, k, l). Assumingthe opposite, choose e (e > d) such thatPfromL(d, j, k, l) = WP(d, e− 1) + PfromL(e, j, k, l). Now,WP(i, d− 1) + PfromL(d, j, k, l) =WP(i, d− 1) + WP(d, e− 1) + PfromL(e, j, k, l) ≥WP(i, e− 1) + PfromL(e, j, k, l).Finally, this contradicts the choice of d, since e > d andWP(i, e− 1) + PfromL(e, j, k, l) ≤WP(i, d− 1) + PfromL(d, j, k, l).Consequently, d is a candidate.2. Analogous to the last case. Choose the largest d, i < d ≤ j, s.t.WP(i, d− 1) + PfromO(d, j, k, l)is minimal. Assume d 6∈ cands(PfromO, 12G2, j, k, l); then there is ane > d such thatWP(i, d− 1) + PfromO(d, j, k, l) = WP(i, d− 1) +WP(d, e− 1) + PfromO(e, j, k, l) ≥WP(i, e− 1) + PfromO(e, j, k, l),contradicting the choice of d.3. For fixed i < j < k < l, choose the smallest d (j < d < k) s.t.PK(i, j − 1, d+ 1, k− 1) + PK(j, d, k, l) is minimal. Assume (j, d, k) 6∈1137.3. Sparsification of the CCJ algorithmcands(P, 1212, i, l), then for some e, there holds thatPK(i, j − 1, d+ 1, k − 1) + PK(j, d, k, l) =PK(i, j − 1, d+ 1, k − 1) + PK(j, e, k, l) + WP(e+ 1, d) ≥ (∗∗)PK(i, j − 1, e+ 1, k − 1) + PK(j, e, k, l),which contradicts the choice of d. For inequation (**), see thatPK(i, j − 1, d+ 1, k − 1) + WP(e+ 1, d) ≤ PK(i, j − 1, e+ 1, k − 1)by definition of PK.4. Choose the largest d s.t. WB′(i, d−1)+PL,mloop00(d, j, k, l) is minimal.Assume d 6∈ cands(PL,mloop10, 12G2, j, k, l), then there are two cases:• there exists some e s.t.WB′(i, d− 1) + PL,mloop00(d, j, k, l) =WB′(i, d− 1) + WB(d, e− 1) + PL,mloop00(e, j, k, l) ≥WB′(i, e− 1) + PL,mloop00(e, j, k, l);the latter inequation holds due to the definition of WB′. Contra-diction!• there exists some e s.t.PL,mloop00(d, j, k, l) = PL,mloop00(d, e, k, l) + WB(e+ 1, j).In this case, the case 12G1 of PL,mloop10(i, j, k, l) yields a smalleror equal value, sinceWB′(i, d− 1) + PL,mloop00(d, j, k, l) =WB′(i, d− 1) + PL,mloop00(d, e, k, l) + WB(e+ 1, j) ≥PL,mloop10(i, e, k, l) + WB(e+ 1, j).5. Choose the largest d s.t. WB(i, d−1)+PL,mloop00(d, j, k, l) is minimal.Assume d 6∈ cands(PL,mloop00, 12G2, j, k, l), then there are two cases:• there exists some e s.t.WB(i, d− 1) + PL,mloop00(d, j, k, l) =WB(i, d− 1) + WB(d, e− 1) + PL,mloop00(e, j, k, l) ≥WB(i, e− 1) + PL,mloop00(e, j, k, l).Contradiction!• there exists some e s.t.PL,mloop00(d, j, k, l) = PL,mloop00(d, e, k, l) + WB(e+ 1, j).In this case, the case 12G1 of PL,mloop00(i, j, k, l) yields a smalleror equal value, sinceWB(i, d− 1) + PL,mloop00(d, j, k, l) =WB(i, d− 1) + PL,mloop00(d, e, k, l) + WB(e+ 1, j) ≥PL,mloop00(i, e, k, l) + WB(e+ 1, j).1147.4. Space complexity analysis6. The last two cases are analogous to the proofs for PO,mloop10 andPO,mloop00.7.4 Space complexity analysisIn the previous section, we aimed to reduce the space complexity of theCCJ algorithm by sparsifying the 4-dimensional matrices only. The sparseversion of the CCJ will be more space efficient than the original algorithmon a sequence of length n if only a small fraction of all Θ(n4) substrings ofS are OCTs.In the sparse CCJ algorithm we reduced the space requirement of all 4-dimensional matrices in the original CCJ algorithm to one of the followingcases.1. Not saving the whole matrix and calculating its corresponding valuein constant time on the fly. These matrices are presented using a ∗superscript.2. Only saving a few 3-dimensional slices of the matrix. These matricesare presented with i, i, i+1 or i, i+1, · · · , i+MLS as superscript, whereMLS corresponds to the maximum loop size (here 30). Thereforemaximum space size of these matrices are O(n3).3. Only saving a candidate list of possible substrings of the 4-dimensionalmatrices, in which case the space complexity is O(n2×Z), where Z isthe number of OCT substrings of S and n ≤ Z ≤ n2.Therefore in no recurrence of the CCJ algorithm we use more than O(n2Z)space.7.5 SummaryIn this chapter we presented theoretical results on applying sparsificationtechnique to the CCJ algorithm reducing its space complexity. We provideda pseudocode for the sparse version of CCJ and prove that the sparse CCJis identical to the original CCJ algorithm while using less space. We notethat using the same technique we can reduce CCJ’s time complexity and weplan to pursue this as future work.115Chapter 8Conclusions and future workThe main goal of this doctoral thesis is to present novel energy-based meth-ods to alleviate the major shortcomings for prediction of pseudoknottedRNA secondary structures based on the thermodynamic energy model. Theseshortcomings include slow running time, poor prediction accuracy, handlinglimited class of pseudoknots, and limited opportunity for providing struc-tural information as input to the prediction method as prediction guide.To this end, we focus on developing new methods for pseudoknottedsecondary structure prediction, and on comparing these new methods withexisting methods, both in terms of time and memory efficiency and in termsof accuracy. We focus on methods that are based on variants of the hierar-chical folding hypothesis and on minimum free energy prediction. We usethe energy model of HotKnots V2.0, which has already yielded good pre-dictions not only for the HotKnots method but also for methods based onmaximum expected accuracy [83].Since difference in accuracy may not be reliable when size of data setsis small, we used statistically sound methods in comparing performance ofour methods with themselves and some of the already existing methods.We proposed several methods, in particular, Iterative HFold, that isbased on the “relaxed” hierarchical folding hypothesis, and CCJ an MFE-based method for prediction of RNA pseudoknotted secondary structure. Inthe case of the Iterative HFold algorithm, which takes into account an RNAsequence and a pseudoknot-free secondary structure, we considered severalmethods of producing the best input structure. For the CCJ algorithm, weobserved that its practicality is limited by its memory usage. To alleviatethis problem we used a sparsification technique to reduce CCJ’s memoryrequirement.In the following sections, we first provide a summary of our experimentsand results, then discuss possible limitations in our evaluations, and finallypropose possible directions for future research.1168.1. Accuracy measures8.1 Accuracy measuresTo compare performance of our proposed algorithms with themselves andtwo of the best existing methods, we used F-measure, that is harmonicmean of sensitivity and positive predictive value (PPV). Instead of usingaverage F-measure, we used bootstrap 95% percentile confidence interval ofaverage F-measures, that includes more information about the distributionof accuracies. To ensure the significance of the differences in accuracies, weused a two-sided permutation test with < 0.05 p-value significance level.8.2 Robustness comparison of HFold andIterative HFoldTo learn what is the accuracy of HFold and Iterative HFold when partialinformation about the true pseudoknot-free structure, Gbig, is available, wegenerated subsets of Gbig structures for each RNA sequence in the HK-PKand HK-PK-free data sets. Since the slope of the curve is one number thatcaptures robustness, we compared slope of the curve for Iterative HFold’saverage accuracy to that of HFold on all subsets of Gbig.On pseudoknotted structures of the HK-PK data set, when provided with≈ 1% up to ≈ 90% of the Gbig structure as input, Iterative HFold’s boot-strap 95% percentile confidence interval of average F-measures has higheraccuracy than those of HFold. Iterative HFold is most successful when littleinformation about Gbig is known because it can add both pseudoknot-freeand pseudoknotted base pairs.However, when the vast majority of base pairs of Gbig are provided asinput, HFold dominates as it keeps the base pairs of the input structure,thereby often adding base pairs of Gsmall.Comparing the slope of the curve for Iterative HFold’s average accuracyto that of HFold, we observed that HFold’s slope is steeper than that ofIterative HFold, making Iterative HFold more robust than HFold.For pseudoknot-free structures of the HK-PK-free data set, HFold per-forms better than Iterative HFold because it often adds base pairs that donot cross those provided as part of the input, and thus are likely to be inGbig.1178.3. Choosing best input structure8.3 Choosing best input structureWe considered different ways of choosing input structure for the HFold andIterative HFold algorithms when partial information about Gbig is not avail-able.For pseudoknotted structures, we found that HotKnots hotspots as in-put is far superior to the MFE structure produced by SimFold, for bothHFold and Iterative HFold. This appears to be because the MFE structurespredicted by SimFold tend to have more base pairs than the true pseudoknot-free structure Gbig, so that HFold and Iterative HFold are unlikely to addpseudoknotted base pairs to the input structure. For pseudoknot-free struc-tures, differences in accuracies of using SimFold as input and HotKnotshotspots is not significant.We further investigated using the first 50 suboptimal structures pro-duced by SimFold (including the MFE structure) versus HotKnots hotspotsas input to both HFold and Iterative HFold. Although the bootstrap 95%percentile confidence intervals for average F-measures of HFold and Itera-tive HFold with first 50 suboptimal structures versus hotspots are slightlydifferent, the permutation test indicates that the difference is not significant.8.4 Accuracy comparisonWe compared accuracy of Iterative HFold with hotspots as input, with thatof CCJ, HotKnots V2.0 and IPknot. The differences in accuracies of IterativeHFold, CCJ and HotKnots V2.0 are insignificant on three of our data sets,namely the HK-PK (HK-PK* when comparing with CCJ), HK-PK-free andDK-pk16 (DK-pk16* when comparing with CCJ), but Iterative HFold andCCJ are significantly better than HotKnots V2.0 on the fourth data set,namely the IP-pk168 data set.The differences in accuracies of Iterative HFold, CCJ and IPknot areinsignificant on two of our data sets, namely the HK-PK-free and DK-pk16(DK-pk16* when comparing with CCJ), but Iterative HFold and CCJ aresignificantly better than IPknot on the HK-PK (HK-PK* when comparingwith CCJ) and IP-pk168 data sets.8.5 Time and memory usage comparisonWe compared running time of Iterative HFold with that of HotKnots V2.0and IPknot. Due to extensive memory requirement of the CCJ algorithm,1188.6. Merits and shortcomings of each methodits run time is slower than that of Iterative HFold, HotKnots and IPknot.HotKnots is faster than Iterative HFold on short sequences (up to 47bases), while Iterative HFold starts being faster than HotKnots. IterativeHFold runs in less than 8.3 seconds for all RNA sequences in this dataset whereas HotKnots runs for over 6000 seconds (about 1.7 hours) on thelongest RNA sequence in our data set. IPknot is significantly faster thanboth HFold and Iterative HFold. For all sequences in this data set, IPknotproduces output in less than 0.8 seconds.HotKnots’ memory consumption can vary significantly from one sequenceto another, and is not predictable. It uses as little as 16.53 MB for an RNAof length 30 bases (LP-PK1) and as much as 91.23 GB (or 93419 MB) forthe longest RNA sequence in the HK-PK data set. Iterative HFold’s mem-ory usage is very similar to HFold’s and increases at a very low rate by thelength of the RNA sequence. It starts from 48.69 MB for RNA sequencesof length 26 and increases to 61.33 MB for the longest RNA sequence inthis data set (400 bases long). IPknot uses much less memory than all othermethods. For the longest RNA sequence in this data set, IPknot uses lessthan 5.5 MB of memory in comparison to 61.33 MB of HFold and IterativeHFold and 91.23 GB of HotKnots. CCJ on the other hand uses 119.72 MBof memory for the shortest sequence and increases so much that the softwarewe used to find the heap usage, Valgrind, runs out of memory and is unableto provide the heap usage for the RNA sequence of length 195.8.6 Merits and shortcomings of each methodIn this section we highlight merits and shortcomings of each method basedon the results presented in this work.8.6.1 HotKnotsOne of the advantages of CCJ over HotKnots is that CCJ guarantees theminimum free energy structure as its output structure, while being a heuris-tic HotKnots is unable to guarantee this.One potential advantage of HFold, and Iterative HFold over HotKnotsis that HFold and Iterative HFold minimize the free energy of the possiblypseudoknotted output structure relative to the given input structure. There-fore HFold and Iterative HFold’s method of adding pseudoknotted stems isbetter motivated energetically than HotKnots.HotKnots advantage over both CCJ and Iterative HFold is its energymodel and parameters. However, HotKnots accuracy is comparable to that1198.6. Merits and shortcomings of each methodof Iterative HFold and CCJ on three of our data sets, namely DK-pk16(DK-pk16*), HK-PK (HK-PK*) and HK-PK-free.8.6.2 IPknotA disadvantage of IPknot over Iterative HFold and CCJ is that to get thebest prediction, the user needs to provide some guidance as to what type ofstructure to predict for the given sequence, e.g., whether pseudoknot-free orpseudoknotted.IPknot’s advantage over CCJ and Iterative HFold is that it uses expectedbase pair probabilities to find the structure with the maximum expectedprobability for a given RNA molecule. IPknot, is much faster and uses lessmemory than all other methods compared in this work.8.6.3 HFoldA disadvantage of HFold over Iterative HFold and CCJ, is that if the giveninput structure is not accurate, HFold is unable to predict the correct outputstructure. However, HFold is faster and uses less memory than both IterativeHFold and CCJ, and has comparable accuracy to that of Iterative HFold andCCJ on all of our data sets.8.6.4 Iterative HFoldWhile CCJ and Iterative HFold’s performance is not significantly differenton our data sets, Iterative HFold is considerably less expensive to run thanCCJ. While the density-2 class of structures is quite general, it does notinclude some of the biologically important motifs that CCJ can handle.For example, Iterative HFold cannot handle the ABCBDEDCAFFE motif(simplified motif of the SAM-IV riboswitch).8.6.5 CCJOne advantage of CCJ over Iterative HFold is that it predicts the minimumfree energy structure for a given RNA molecule within the general class ofTGB structures. However, by predicting the structure with the minimumfree energy, the CCJ algorithm does not outperform Iterative HFold.1208.7. Comparing the folding hypotheses8.7 Comparing the folding hypothesesThrough our extensive comparisons, we aimed to investigate how a methodthat is purely based on MFE performs versus a method that strictly adheresto the hierarchical folding hypothesis and a method that is based on the“relaxed” hierarchical folding hypothesis.We observed that predicting the structure with the minimum free energydoes not seem to be directly correlated with finding the structure with thehighest accuracy among all methods.We also noticed that by following the relaxed hierarchical folding hy-pothesis, hence allowing changes to the input structure, Iterative HFoldmore often finds structures with lower energy than HFold in all of our datasets. However this does not result in significant improvement in predictionaccuracy compared to HFold.On long RNA sequences in the HK-PK-free data set, the MFE structurespredicted by the CCJ algorithm were pseudoknotted, suggesting that theenergy parameters encourage addition of pseudoknots. Therefore, parameterestimation specifically for CCJ would be useful. Addition of pseudoknots inHFold’s predictions is the least among HFold, CCJ and Iterative HFold. Thisis mainly due to the fact that hotspots are part of the MFE structure andHFold is incapable of adding both pseudoknotted and pseudoknot-free basepairs. Iterative HFold’s predictions are mostly similar to that of HFold’son the RNA sequences of the HK-PK-free data set. However, in a fewcases, especially in longer RNA strands Iterative HFold’s predictions arealso pseudoknotted.Although Iterative HFold can generally modify the input structure, itbecomes unable to modify the input structure, in cases where the inputstructure is overcrowded with base pairs.Finally we found that on our data sets the relaxed hierarchical foldinghypothesis, in which the given input structure only guides the prediction ofthe final structure and may not necessarily be part of the final structure,as the hypothesis with most potential for predicting RNA pseudoknottedsecondary structure.8.8 Energy model limitationComparing accuracy of the minimum free energy structures with the maxi-mum accuracy structures in this work, we found that, on average, the mini-mum free energy structure has significantly poorer F-measure than the max-1218.9. Data and accuracy measure limitationimum accuracy structure. This suggests that an improved energy model forpseudoknotted structure prediction may improve accuracy of prediction al-gorithms.8.9 Data and accuracy measure limitationTo calculate F-measure we rely heavily on correctness of known RNA struc-tures in the publicly available data sets. However, we observed in somecases that the known structure contained long regions of unpaired bases,while based on our prediction methods, the region may fold into low en-ergy structure, thus, stabilizing the overall structure. On the other hand,we know that biological RNA molecules do not fold in isolation, which mayaffect availability of these unpaired regions. In our prediction methods, how-ever, we assumed that RNA molecule folds in isolation without interactionwith other molecules.8.10 Possible future directionsOne possible direction for future research is estimating energy parametersof Iterative HFold and CCJ, and comparing the two hypotheses once againwhen each method performs its best. Andronescu et al. achieved on averageabout 7% prediction accuracy improvement over the standard energy modelfor pseudoknot-free structures using their constraint generation method [9].They further used their constraint generation program to estimate pseu-doknotted energy parameters of HotKnots V2.0, and achieved an averageof 11% accuracy improvement over the energy model of Dirks and Pierce[33]. We therefore believe improved prediction accuracy may be achievedby estimating the energy parameters for Iterative HFold (i.e., HFold), andCCJ algorithms using the constraint generation program of Andronescu etal. [9–11]. One possible direction for energy parameter estimation is to usethe pseudoknot-free energy parameters of Andronescu et al. [9, 10], andto focus on estimating the pseudoknotted parameters of the energy modelsonly.Zakov et al. [103] used machine learning techniques and rich parameter-ization of the energy model and achieved about 19% accuracy improvementover the energy model of Andronescu et al. [10]. They based their struc-tural features at the features in the standard energy model (i.e., Turner99 energy model), but taking into account portions of larger structuraland sequential context, decomposed the score of each structural element1228.10. Possible future directionsto many fine-grained local features, resulting in overall about 70000 param-eters. Therefore another possible direction is to explore the effect of suchrich energy parameterization on the performance of our Iterative HFold andCCJ methods.Alternatively, we could incorporate the pseudoknotted energy parame-ters of a better energy model, for example the energy model of Cao andChen [18, 19] to obtain better results.An alternative approach for computationally producing input structuresfor Iterative HFold that may be worth exploring for future research would beto use the most highly probable base pairs, as calculated using the partitionfunction [61]. Even better may be to calculate base pair probabilities forbase pairs of pseudoknotted RNA structures; however this requires Θ(n5)time. Since HFold finds minimum free energy structure in O(n3) time, con-ditional on the given input structure, another direction for future researchcan be investigating ways to develop a O(n3)-time partition function versionof HFold that can produce pseudoknotted base pair probabilities that areconditional on the given input structure.Following the recent work of Aghaeepour and Hoos [3], one can investi-gate ensemble based prediction of RNA pseudoknotted secondary structures.Aghaeepour and Hoos introduced a generic RNA secondary structure predic-tion procedure, called AveRNA, that takes an RNA sequence as input, andproduces one final structure by using an ensemble of existing RNA secondarystructure prediction methods. AveRNA first runs its underlying methodsto obtain a set of structures, then combines these structures on a per-base-pair-basis to produce the final structure. Aghaeepour and Hoos evaluatedtheir method for prediction of RNA pseudoknot-free secondary structuresand achieved about 1.3% accuracy improvement on the RNA STRANDdataset [7] over the best energy model of Andronescu et al. [10], namelyBL-FR. Currently, the underlying ensemble of AveRNA consists of RNApseudoknot-free secondary structure prediction methods. It would be worthexploring to use different ensembles of RNA pseudoknotted secondary struc-ture prediction methods (including Iterative HFold and CCJ), and possiblya mixture of both pseudoknot-free and pseudoknotted prediction methodsand evaluate performance of AveRNA with that of best existing methodson a large data set of RNA structures such as RNA STRAND.123Bibliography[1] Identification and analysis of functional elements in 1% of the humangenome by the ENCODE pilot project. Nature, 447(7146):799–816,June 2007.[2] Nimo M. Abdi and Kurt Fredrick. Contribution of 16S rRNA nu-cleotides forming the 30S subunit a and p sites to translation in es-cherichia coli. RNA, 11(11):1624–1632, November 2005.[3] Nima Aghaeepour and Holger Hoos. Ensemble-based prediction ofRNA secondary structures. BMC Bioinformatics, 14(1):139+, April2013.[4] Cagri Aksay, Raheleh Salari, Emre Karakoc, Can Alkan, and S. CenkSahinalp. taveRNA: a web suite for RNA algorithms and applications.Nucleic acids research, 35(Web Server issue), July 2007.[5] T. Akutsu. Dynamic programming algorithms for RNA secondarystructure prediction with pseudoknots. Disc. App. Math., 104(1-3):45–62, 2000.[6] S. L. Alam, J. F. Atkins, and R. F. Gesteland. Programmed ribosomalframeshifting: Much ado about knotting! PNAS, 96(25):14177–14179,1999.[7] Mirela Andronescu, Vera Bereg, Holger H. Hoos, and Anne Condon.RNA STRAND: The RNA secondary structure and statistical analysisdatabase. BMC Bioinformatics, 9(1):340+, August 2008.[8] Mirela Andronescu, Zhi Chuan, and Anne Condon. Secondary struc-ture prediction of interacting RNA molecules. Journal of molecularbiology, 345(5):987–1001, February 2005.[9] Mirela Andronescu, Anne Condon, Holger H. Hoos, David H. Math-ews, and Kevin P. Murphy. Efficient parameter estimation for RNAsecondary structure prediction. Bioinformatics, 23(13):i19–i28, July2007.124Bibliography[10] Mirela Andronescu, Anne Condon, Holger H. Hoos, David H. Math-ews, and Kevin P. Murphy. Computational approaches for RNA energyparameter estimation. RNA, 16(12):2304–2318, December 2010.[11] Mirela S. Andronescu, Cristina Pop, and Anne E. Condon. Improvedfree energy parameters for RNA pseudoknotted secondary structureprediction. RNA (New York, N.Y.), 16(1):26–42, January 2010.[12] Rolf Backofen, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson.Sparse RNA folding: Time and space efficient algorithms. Journalof Discrete Algorithms, 9(1):12–31, March 2011.[13] Maximillian H. Bailor, Xiaoyan Sun, and Hashim M. Al-Hashimi.Topology links RNA secondary structure with global conformation,dynamics, and adaptation. Science, 327(5962):202–206, January 2010.[14] Mini Balakrishnan, Philip J. Fay, and Robert A. Bambara. The kissinghairpin sequence promotes recombination within the HIV-i 5’ leaderregion. J. Biol. Chem., 276(39):36482–36492, September 2001.[15] Stanislav Bellaousov and David H. Mathews. ProbKnot: fast predic-tion of RNA secondary structure including pseudoknots. RNA (NewYork, N.Y.), 16(10):1870–1880, October 2010.[16] Stephan Bernhart, Ivo Hofacker, Sebastian Will, Andreas Gruber,and Peter Stadler. RNAalifold: improved consensus structure predic-tion for RNA alignments. BMC Bioinformatics, 9(1):474+, November2008.[17] Deepika Calidas, Hiram Lyon, and Gloria M. Culver. The n-terminalextension of s12 influences small ribosomal subunit assembly in es-cherichia coli. RNA, January 2014.[18] Song Cao and Shi-Jie Chen. Predicting RNA pseudoknot folding ther-modynamics. Nucleic Acids Research, 34(9):2634–2652, 2006.[19] Song Cao and Shi-Jie Chen. Predicting structures and stabilities forh-type pseudoknots with interhelix loops. RNA, 15(4):696–706, April2009.[20] Song Cao and Shi-Jie Chen. Physics-Based De Novo Prediction ofRNA 3D Structures. J. Phys. Chem. B, 115(14):4216–4226, March2011.125Bibliography[21] Kung-Yao Chang and Ignacio Tinoco. The structure of an RNA kissinghairpin complex of the HIV TAR hairpin loop and its complement.Journal of Molecular Biology, 269(1):52–66, May 1997.[22] Ho-Lin L. Chen, Anne Condon, and Hosna Jabbari. An o(n(5)) algo-rithm for MFE prediction of kissing hairpins and 4-chains in nucleicacids. Journal of computational biology : a journal of computationalmolecular cell biology, 16(6):803–815, June 2009.[23] Samuel S. Cho, David L. Pincus, and D. Thirumalai. Assembly mech-anisms of RNA pseudoknots are determined by the stabilities of con-stituent secondary structures. Proceedings of the National Academy ofSciences, 106(41):17349–17354, October 2009.[24] A. Condon and H. Jabbari. Computational prediction of nucleic acidsecondary structure: Methods, applications, and challenges. Theoret-ical Computer Science, 410(4-5):294–301, February 2009.[25] The FANTOM Consortium, P. Carninci, T. Kasukawa, S. Katayama,J. Gough, M. C. Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard,C. Wells, R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner,S. Batalov, A. R. R. Forrest, M. Zavolan, M. J. Davis, L. G. Wilming,V. Aidinis, J. E. Allen, A. Ambesi-Impiombato, R. Apweiler, R. N. At-uraliya, T. L. Bailey, M. Bansal, L. Baxter, K. W. Beisel, T. Bersano,H. Bono, A. M. Chalk, K. P. Chiu, V. Choudhary, A. Christof-fels, D. R. Clutterbuck, M. L. Crowe, E. Dalla, B. P. Dalrymple,B. de Bono, G. Della Gatta, D. di Bernardo, T. Down, P. Engstrom,M. Fagiolini, G. Faulkner, C. F. Fletcher, T. Fukushima, M. Furuno,S. Futaki, M. Gariboldi, P. Georgii-Hemming, T. R. Gingeras, T. Go-jobori, R. E. Green, S. Gustincich, M. Harbers, Y. Hayashi, T. K.Hensch, N. Hirokawa, D. Hill, L. Huminiecki, M. Iacono, K. Ikeo,A. Iwama, T. Ishikawa, M. Jakt, A. Kanapin, M. Katoh, Y. Kawa-sawa, J. Kelso, H. Kitamura, H. Kitano, G. Kollias, S. P. T. Krish-nan, A. Kruger, S. K. Kummerfeld, I. V. Kurochkin, L. F. Lareau,D. Lazarevic, L. Lipovich, J. Liu, S. Liuni, S. McWilliam, Madan M.Babu, M. Madera, L. Marchionni, H. Matsuda, S. Matsuzawa, H. Miki,F. Mignone, S. Miyake, K. Morris, S. Mottagui-Tabar, N. Mul-der, N. Nakano, H. Nakauchi, P. Ng, R. Nilsson, S. Nishiguchi,S. Nishikawa, F. Nori, O. Ohara, Y. Okazaki, V. Orlando, K. C. Pang,W. J. Pavan, G. Pavesi, G. Pesole, N. Petrovsky, S. Piazza, J. Reed,J. F. Reid, B. Z. Ring, M. Ringwald, B. Rost, Y. Ruan, S. L. Salzberg,126BibliographyA. Sandelin, C. Schneider, C. Scho¨nbach, K. Sekiguchi, C. A. M. Sem-ple, S. Seno, L. Sessa, Y. Sheng, Y. Shibata, H. Shimada, K. Shimada,D. Silva, B. Sinclair, S. Sperling, E. Stupka, K. Sugiura, R. Sultana,Y. Takenaka, K. Taki, K. Tammoja, S. L. Tan, S. Tang, M. S. Taylor,J. Tegner, S. A. Teichmann, H. R. Ueda, E. van Nimwegen, R. Ve-rardo, C. L. Wei, K. Yagi, H. Yamanishi, E. Zabarovsky, S. Zhu,A. Zimmer, W. Hide, C. Bult, S. M. Grimmond, R. D. Teasdale, E. T.Liu, V. Brusic, J. Quackenbush, C. Wahlestedt, J. S. Mattick, D. A.Hume, RIKEN Genome Exploration Research Group, Genome Sci-ence Group Genome Network Project Core Group, C. Kai, D. Sasaki,Y. Tomaru, S. Fukuda, M. Kanamori-Katayama, M. Suzuki, J. Aoki,T. Arakawa, J. Iida, K. Imamura, M. Itoh, T. Kato, H. Kawaji,N. Kawagashira, T. Kawashima, M. Kojima, S. Kondo, H. Konno,K. Nakano, N. Ninomiya, T. Nishio, M. Okada, C. Plessy, K. Shibata,T. Shiraki, S. Suzuki, M. Tagami, K. Waki, A. Watahiki, Y. Okamura-Oho, H. Suzuki, J. Kawai, and Y. Hayashizaki. The transcriptionallandscape of the mammalian genome. Science, 309(5740):1559–1563,September 2005.[26] Ke´vin Darty, Alain Denise, and Yann Ponty. VARNA: Interactivedrawing and editing of the RNA secondary structure. Bioinformatics,25(15):1974–1975, August 2009.[27] Katherine E. Deigan, Tian W. Li, David H. Mathews, and Kevin M.Weeks. Accurate SHAPE-directed RNA structure determination. Pro-ceedings of the National Academy of Sciences, 106(1):97–102, January2009.[28] B. A. L. M. Deiman and C. W. A. Pleij. Pseudoknots: A vital featurein viral RNA. Seminars in Virol., 8(3):166–175, 1997.[29] C. Dennis. The brave new world of RNA. Nature, 418(6894):122–124,July 2002.[30] Svetlana Deryusheva and Joseph G. Gall. Novel small cajal-body-specific RNAs identified in drosophila: probing guide RNA function.RNA, 19(12):1802–1814, December 2013.[31] Feng Ding, Shantanu Sharma, Poornima Chalasani, Vadim V. Demi-dov, Natalia E. Broude, and Nikolay V. Dokholyan. Ab initio RNAfolding by discrete molecular dynamics: From structure prediction tofolding mechanisms. RNA, 14(6):1164–1173, June 2008.127Bibliography[32] Ye Ding, Chi Yu, and Charles E. Lawrence. Sfold web server forstatistical folding and rational design of nucleic acids. Nucleic acidsresearch, 32(Web Server issue), July 2004.[33] R. M. Dirks and N. A. Pierce. A partition function algorithm fornucleic acid secondary structure including pseudoknots. J. Comput.Chem., 24(13):1664–1677, October 2003.[34] Robert M. Dirks and Niles A. Pierce. Triggered amplification by hy-bridization chain reaction. Proceedings of the National Academy ofSciences of the United States of America, 101(43):15275–15278, Octo-ber 2004.[35] Chuong B. Do, Daniel A. Woods, and Serafim Batzoglou. CON-TRAfold: RNA secondary structure prediction without physics-basedmodels. Bioinformatics, 22(14):e90–e98, July 2006.[36] Tom Fawcett. An introduction to ROC analysis. Pattern RecognitionLetters, 27(8):861–874, June 2006.[37] Brice Felden. RNA structure: experimental analysis. Current Opinionin Microbiology, 10(3):286–291, June 2007.[38] David P. Gardner, Pengyu Ren, Stuart Ozer, and Robin R. Gutell.Statistical potentials for hairpin and internal loops improve the accu-racy of the predicted RNA structure. Journal of molecular biology,413(2):473–483, October 2011.[39] Wolfgang Gerlach and Robert Giegerich. GUUGle: a utility for fastexact matching under RNA complementary rules including GU basepairing. Bioinformatics, 22(6):762–764, March 2006.[40] Christine E. Hajdin, Stanislav Bellaousov, Wayne Huggins, Christo-pher W. Leonard, David H. Mathews, and Kevin M. Weeks. Accu-rate SHAPE-directed RNA secondary structure modeling, includingpseudoknots. Proceedings of the National Academy of Sciences of theUnited States of America, 110(14):5498–5503, April 2013.[41] Monir Hajiaghayi, Anne Condon, and Holger Hoos. Analysis of energy-based algorithms for RNA secondary structure prediction. BMCBioinformatics, 13(1):22+, February 2012.128Bibliography[42] Benjamin J. Hale, Cai-Xia Yang, and Jason W. Ross. Small RNAregulation of reproductive function. Mol. Reprod. Dev., 81(2):148–159, February 2014.[43] Tiffany L. Halo, Kaylin M. McMahon, Nicholas L. Angeloni, YilinXu, Wei Wang, Alyssa B. Chinen, Dmitry Malin, Elena Strekalova,Vincent L. Cryns, Chonghui Cheng, Chad A. Mirkin, and C. ShadThaxton. NanoFlares for the detection, isolation, and culture of livetumor cells from human blood. Proceedings of the National Academyof Sciences, pages 201418637+, November 2014.[44] Michiaki Hamada, Kengo Sato, and Kiyoshi Asai. Improving the ac-curacy of predicting secondary structure for aligned RNA sequences.Nucleic acids research, 39(2):393–402, January 2011.[45] Arif O. Harmanci, Gaurav Sharma, and David H. Mathews. Tur-boFold: iterative probabilistic estimation of secondary structures formultiple RNA sequences. BMC bioinformatics, 12:108+, April 2011.[46] Y. He, T. Ye, M. Su, C. Zhang, A. Ribbe, W. Jiang, and C. Mao. Hi-erarchical self-assembly of DNA into symmetric supramolecular poly-hedra. Nature, 452:198–201, 2008.[47] T. Hesterberg, S. Monaghan, D. S. Moore, A. Cipson, and R. Epstein.Bootstrap methods and permutation tests, chapter 18. W. H. Freemanand Company, New York, 2005.[48] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker,and P. Schuster. Fast folding and comparison of rna secondary struc-tures. Monatshefte fu¨r Chemie / Chemical Monthly, 125(2):167–188–188, February 1994.[49] Ivo L. Hofacker. RNA secondary structure analysis using the ViennaRNA package. Current protocols in bioinformatics / editoral board,Andreas D. Baxevanis ... [et al.], Chapter 12, February 2004.[50] Christine E. Holt and Erin M. Schuman. The central dogma decen-tralized: New perspectives on RNA function and local translation inneurons. Neuron, 80(3):648–657, October 2013.[51] Chun-Hsiang Huang, Chin L. Lu, and Hsien-Tai Chiu. A heuris-tic approach for detecting RNA h-type pseudoknots. Bioinformatics,21(17):3501–3508, September 2005.129Bibliography[52] H. Jabbari, A. Condon, and S. Zhao. Novel and efficient RNA sec-ondary structure prediction using hierarchical folding. J Comput Biol,15(2):139–163, March 2008.[53] Hosna Jabbari and Anne Condon. A fast and robust iterative al-gorithm for prediction of RNA pseudoknotted secondary structures.BMC Bioinformatics, 15(1):147+, 2014.[54] G. F. Joyce. Directed evolution of nucleic acid enzymes. Annu RevBiochem, 73:791–836, 2004.[55] Bjarne Knudsen and Jotun Hein. Pfold: RNA secondary structureprediction using stochastic context-free grammars. Nucleic acids re-search, 31(13):3423–3428, July 2003.[56] K. Lee, S. Varma, J. Santalucia, and P. R. Cunningham. In vivodetermination of RNA structure-function relationships: analysis ofthe 790 loop in ribosomal RNA. J. Mol. Biol., 269(5):732–743, 1997.[57] Zhi John J. Lu, Jason W. Gloor, and David H. Mathews. ImprovedRNA secondary structure prediction by maximizing expected pair ac-curacy. RNA (New York, N.Y.), 15(10):1805–1813, October 2009.[58] R. B. Lyngsø. Complexity of pseudoknot prediction in simple models.In Josep Dı´az, Juhani Karhuma¨ki, Arto Lepisto¨, and Donald Sannella,editors, ICALP, volume 3142 of Lecture Notes in Computer Science,pages 919–931. Springer, 2004.[59] R. B. Lyngsø and C. N. Pedersen. RNA pseudoknot prediction inenergy-based models. J. Comput. Biol., 7(3-4):409–427, 2000.[60] Thomas J. Macke, David J. Ecker, Robin R. Gutell, Daniel Gautheret,David A. Case, and Rangarajan Sampath. RNAMotif, an RNA sec-ondary structure definition and search algorithm. Nucleic Acids Re-search, 29(22):4724–4735, November 2001.[61] D. H. Mathews. Using an RNA secondary structure partition func-tion to determine confidence in base pairs predicted by free energyminimization. RNA, 10(8):1178–1190, 2004.[62] D. H. Mathews. Predicting RNA secondary structure by free energyminimization. Theor. Chem. Acc.: Theory, Computation, and Model-ing (Theoretica Chimica Acta), pages 1–9, 2006.130Bibliography[63] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded se-quence dependence of thermodynamic parameters improves predictionof RNA secondary structure,. J. Mol. Biol., 288(5):911–940, 1999.[64] David H. Mathews, Matthew D. Disney, Jessica L. Childs, Susan J.Schroeder, Michael Zuker, and Douglas H. Turner. Incorporatingchemical modification constraints into a dynamic programming al-gorithm for prediction of RNA secondary structure. Proceedings ofthe National Academy of Sciences of the United States of America,101(19):7287–7292, May 2004.[65] John S. Mattick and Igor V. Makunin. Non-coding RNA. HumanMolecular Genetics, 15(suppl 1):R17–R29, April 2006.[66] W.J. Melchers, J.G. Hoenderop, H.J. Bruins Slot, C.W. Pleij, E.V.Pilipenko, V.I. Agol, and J.M. Galama. Kissing of the two predomi-nant hairpin loops in the coxsackie B virus 3’ untranslated region isthe essential structural feature of the origin of replication required fornegative-strand RNA synthesis. J. Virol., 71(1):686–696, 1997.[67] Mathias Mohl, Raheleh Salari, Sebastian Will, Rolf Backofen, andS. Cenk Sahinalp. Sparsification of RNA structure prediction includingpseudoknots. Algorithms for Molecular Biology, 5(1):39+, December2010.[68] Nicholas Nethercote and Julian Seward. Valgrind: a frameworkfor heavyweight dynamic binary instrumentation. SIGPLAN Not.,42(6):89–100, June 2007.[69] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the sec-ondary structure of single-stranded RNA. Proceedings of the NationalAcademy of Sciences of the United States of America, 77(11):6309–6313, November 1980.[70] Aleksey Y. Ogurtsov, Svetlana A. Shabalina, Alexey S. Kondrashov,and Mikhail A. Roytberg. Analysis of internal loops within theRNA secondary structure in almost quadratic time. Bioinformatics,22(11):1317–1324, June 2006.[71] Marc Parisien and Francois Major. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature,452(7183):51–55, March 2008.131Bibliography[72] Jeff R. Proctor and Irmtraud M. Meyer. CoFold: an RNA secondarystructure prediction method that takes co-transcriptional folding intoaccount. Nucleic Acids Research, 41(9):e102, March 2013.[73] Tomasz Puton, Lukasz P. Kozlowski, Kristian M. Rother, andJanusz M. Bujnicki. CompaRNA: a server for continuous benchmark-ing of automated methods for RNA secondary structure prediction.Nucleic Acids Research, 41(7):4307–4323, April 2013.[74] B. Rastegari and A. Condon. Parsing nucleic acid pseudoknottedsecondary structure: algorithm and applications. J Comput Biol,14(1):16–32, 2007.[75] J. Reeder and R. Giegerich. Design, implementation and evaluation ofa practical pseudoknot folding algorithm based on thermodynamics.BMC Bioinformatics, 5, 2004.[76] Jihong Ren, Baharak Rastegari, Anne Condon, and Holger H. Hoos.Hotknots: Heuristic prediction of rna secondary structures includingpseudoknots. RNA, 11(10):1494–1504, October 2005.[77] E. Rivas and S. R. Eddy. A dynamic programming algorithm forRNA structure prediction including pseudoknots. J. Mol. Biol.,285(5):2053–2068, 1999.[78] P. W. Rothemund, N Papadakis, and E. Winfree. Algorithmic self-assembly of DNA Sierpinski triangles. PLoS Biology, 2(12):e431, 2004.[79] Paul W. K. Rothemund. Folding DNA to create nanoscale shapes andpatterns. Nature, 440(7082):297–302, March 2006.[80] Jianhua Ruan, Gary D. Stormo, and Weixiong Zhang. An Iteratedloop matching approach to the prediction of RNA secondary structureswith pseudoknots. Bioinformatics, 20(1):58–66, January 2004.[81] Raheleh Salari, Mathias Mo¨hl, Sebastian Will, S. Sahinalp, and RolfBackofen. Time and Space Efficient RNA-RNA Interaction Predic-tion via Sparse Folding. In Bonnie Berger, editor, Research in Com-putational Molecular Biology, volume 6044 of Lecture Notes in Com-puter Science, chapter 31, pages 473–490. Springer Berlin / Heidelberg,Berlin, Heidelberg, 2010.132Bibliography[82] Ashesh A. Saraiya, Tek N. Lamichhane, Christine S. Chow, John San-taLucia, and Philip R. Cunningham. Identification and role of func-tionally important motifs in the 970 loop of escherichia coli 16S ribo-somal RNA. Journal of Molecular Biology, 376(3):645–657, February2008.[83] Kengo Sato, Yuki Kato, Michiaki Hamada, Tatsuya Akutsu, andKiyoshi Asai. IPknot: fast and accurate prediction of RNA secondarystructures with pseudoknots using integer programming. Bioinformat-ics, 27(13):i85–i93, July 2011.[84] Stefan E. Seemann, Jan Gorodkin, and Rolf Backofen. Unifying evolu-tionary and thermodynamic information for RNA folding of multiplealignments. Nucleic acids research, 36(20):6355–6362, November 2008.[85] F. C. Simmel and W. U. Dittmer. DNA nanodevices. Small, 1(3):284–299, 2005.[86] Jana Sperschneider and Amitava Datta. KnotSeeker: Heuristic pseu-doknot detection in long RNA sequences. RNA, 14(4):630–640, April2008.[87] Jana Sperschneider and Amitava Datta. DotKnot: pseudoknot pre-diction using the probability dot plot under a refined energy model.Nucleic Acids Research, 38(7):e103+, April 2010.[88] Jana Sperschneider, Amitava Datta, and Michael J. Wise. HeuristicRNA pseudoknot prediction including intramolecular kissing hairpins.RNA, 17(1):27–38, January 2011.[89] Jana Sperschneider, Amitava Datta, and Michael J. Wise. Predictingpseudoknotted structures across two RNA sequences. Bioinformatics(Oxford, England), 28(23):3058–3065, December 2012.[90] Junilda Spirollari, Jason T. Wang, Kaizhong Zhang, Vivian Bellofatto,Yongkyu Park, and Bruce A. Shapiro. Predicting consensus structuresfor RNA alignments via pseudo-energy minimization. Bioinformaticsand biology insights, 3:51–69, 2009.[91] D. W. Staple and S. E. Butcher. Pseudoknots: RNA structures withdiverse functions. PLoS Biol., 3(6), 2005.133Bibliography[92] Peter Steffen, Bjo¨rn Voß, Marc Rehmsmeier, Jens Reeder, and RobertGiegerich. RNAshapes: an integrated RNA analysis package based onabstract shapes. Bioinformatics, 22(4):500–503, February 2006.[93] RDC Team. The r project for statistical computing. Vienna, Austria:R Foundation for, 309, 2011.[94] I. Tinoco and C. Bustamante. How RNA folds. J. Mol. Biol.,293(2):271–281, 1999.[95] Y. Uemura, A. Hasegawa, S. Kobayashi, and T. Yokomori. Tree ad-joining grammars for RNA structure prediction. Theor. Comput. Sci.,210(2):277–303, 1999.[96] F. H. D. van Batenburg, A. P. Gultyaev, and C. W. A. Pleij. Pseu-doBase: structural information on RNA pseudoknots. Nucleic AcidsResearch, 29(1):194–195, January 2001.[97] H. Varian. Bootstrap tutorial. Mathematica Journal, 9(4):768–775,2005.[98] Z. Weinberg, E. E. Regulski, M. C. Hammond, J. E. Barrick, Z. Yao,W. L. Ruzzo, and R. B. Breaker. The aptamer core of SAM-IVriboswitches mimics the ligand-binding site of SAM-I riboswitches.RNA, 14:822–828, 2008.[99] Ydo Wexler, Chaya Zilberstein, and Michal Ziv-Ukelson. A studyof accessible motifs and RNA folding complexity. Journal of compu-tational biology : a journal of computational molecular cell biology,14(6):856–872, 2007.[100] Kevin A. Wilkinson, Edward J. Merino, and Kevin M. Weeks. RNASHAPE chemistry reveals nonhierarchical interactions dominate equi-librium structural transitions in tRNAasp transcripts. Journal of theAmerican Chemical Society, 127(13):4659–4667, April 2005.[101] Sebastian Will, Tejal Joshi, Ivo L. Hofacker, Peter F. Stadler, and RolfBackofen. LocARNA-p: Accurate boundary prediction and improveddetection of structural RNAs. RNA, 18(5):900–914, May 2012.[102] Zhenjiang Xu and David H. Mathews. Multilign: an algorithm topredict secondary structures conserved in multiple RNA sequences.Bioinformatics, 27(5):626–632, March 2011.134[103] Shay Zakov, Yoav Goldberg, Michael Elhadad, and Michal Ziv-Ukelson. Rich parameterization improves RNA structure prediction.Journal of computational biology : a journal of computational molec-ular cell biology, 18(11):1525–1542, November 2011.[104] Jian Zhang, Yu-Jie Zhang, and Wei Wang. An RNA Base DiscreteState Model toward Tertiary Structure Prediction. Chinese PhysicsLetters, 27(11):118702+, November 2010.[105] M. Zuker. Mfold web server for nucleic acid folding and hybridizationprediction. Nucleic Acids Research, 31:3406–3415, 2003.135Appendix AIPknot detailed performanceAmong all different versions of IPknot we tested, we found all but γ1 = 1and γ2 = 1 setting producing similar confidence intervals for all but HK-PK-free data sets, for which γ1 = 4 and γ2 = 8 produces the best result (seebelow). While running parameter refinement with one iteration improvedthe confidence intervals in the HK-PK data set, it did not result in anyimprovement in accuracy in the rest of our data sets as in many cases IPknotfailed to produce results. We note that our results on the IP-pk168 data setwith different weight parameters perfectly match the results of Sato et al.[83].Table A.1: Comparison of bootstrap 95% percentile confidence interval ofaverage F-measure of IPknot with different default settings and no refine-ment.input γ1 = 1, γ1 = 2, γ1 = 2, γ1 = 4data set γ2 = 1 γ2 = 4 γ2 = 16 γ2 = 8HK - PKed (40.70, 54.82) (53.27, 65.91) (57.31, 69.41) (54.67, 66.07)HK - PK free (38.04, 40.42) (39.35, 41.54) (38.88, 40.97) (77.31, 81.72)pk168 (47.00, 56.66) (56.86, 64.81) (60.73, 68.87) (58.17, 66.16)DK-pk16 (64.90, 74.50) (66.22, 77.80) (66.70, 79.15) (65.44, 75.74)Table A.2: Comparison of bootstrap 95% percentile confidence interval ofaverage F-measure of IPknot with different default settings and 1 iterationrefinement.input γ1 = 1, γ1 = 2, γ1 = 2, γ1 = 4,data set γ2 = 1 γ2 = 4 γ2 = 16 γ2 = 8HK - PKed (50.63, 66.80) (57.88, 71.53) (59.06, 72.02) (56.52, 69.26)HK - PK free (32.52, 36.34) (37.42, 39.95) (36.19, 38.60) (37.18, 39.49)pk168 (58.53, 68.93) (66.31, 75.39) (65.98, 75.00) (64.10, 73.20)DK-pk16 (51.57, 79.59) (55.99, 78.76) (55.22, 77.84) (62.16, 76.88)136Appendix BIterative HFold’s detailedperformance comparisonB.1 Robustness comparisonTables B.1, B.2 and B.3 provide complete presenting robustness comparisonof HFold-PKonly, HFold, and Iterative HFold, when provided with differentpercentage of Gbig information of the HK-PK and HK-PK-free data sets.Note that the reported interval in each case is the bootstrap 95% confidenceinterval for F-measure of the 100 structures with 1 ≤ α ≤ 99 percent infor-mation about the Gbig structure. The 100% information is the bootstrap95% confidence interval for F-measure, where input structure is Gbig.B.2 Comparison of correlation between falsepositive ratesTable B.4 provides Pearson correlation coefficient of HFold and IterativeHFold with HotKnots hotspots, SimFold MFE and SimFold first 50 subop-timal structures. Here FP represents false positive rate and F represents theF-measure as accuracy measure. We used R statistical tools to calculate thecorrelation coefficients. We used all 20 hotspots and all 50 suboptimals foreach RNA sequence in the HK-PK data set for this calculation.As seen in Table 3, there is a negative correlation between false positiverates and prediction accuracy in every case except for hotspots false positiverate in Iterative HFold, in which case the correlation is close to 0, indicatingthat there is little correlation. Not surprisingly, the negative correlation isquite strong for the MFE structures, because there are many more falsepositives in these structures.137B.2. Comparison of correlation between false positive ratesTable B.1: Comparison of robustness of different versions of HFold on theHK-PK data set.% Gbig info. PKonly HFold IterativeHK-PK 1% (2.35%, 3.04%) (47.58%, 59.19%) (47.95%, 59.62%)5% (11.09%, 13.42%) (52.50%, 62.24%) (54.51%, 64.56%)10% (20.41%, 24.43%) (56.79%, 65.01%) (60.64%, 69.68%)15% (27.97%, 32.95%) (60.18%, 67.36%) (65.08%, 73.36%)20% (35.19%, 40.57%) (62.77%, 69.19%) (68.51%, 76.54%)25% (41.07%, 46.93%) (64.50%, 70.36%) (70.77%, 78.79%)30% (46.53%, 52.78%) (65.87%, 71.54%) (72.54%, 80.85%)35% (51.60%, 57.78%) (67.65%, 72.96%) (74.18%, 82.30%)40% (55.68%, 61.87%) (68.72%, 73.76%) (75.30%, 83.27%)45% (59.68%, 65.82%) (69.39%, 74.51%) (75.85%, 83.74%)50% (63.32%, 69.33%) (70.25%, 75.17%) (76.50%, 84.64%)55% (66.71%, 72.65%) (71.18%, 75.96%) (77.06%, 85.12%)60% (69.81%, 75.65%) (72.04%, 76.75%) (77.71%, 85.53%)65% (72.73%, 78.66%) (73.10%, 77.78%) (78.03%, 85.93%)70% (75.28%, 81.25%) (74.21%, 78.89%) (78.12%, 86.06%)75% (78.07%, 83.87%) (75.76%, 80.43%) (78.41%, 86.34%)80% (80.57%, 86.39%) (77.19%, 81.82%) (78.60%, 86.54%)85% (82.73%, 88.52%) (78.63%, 83.41%) (78.70%, 86.68%)90% (84.91%, 90.63%) (80.62%, 85.73%) (79.01%, 86.97%)95% (87.19%, 92.89%) (83.18%, 88.73%) (79.19%, 87.09%)99% (88.70%, 94.42%) (85.17%, 91.21%) (79.31%, 87.34%)100% (89.03%, 94.83%) (85.74%, 91.87%) (79.36%, 87.41%)138B.2. Comparison of correlation between false positive ratesTable B.2: Comparison of robustness of different versions of HFold on theHK-PK-free data set.% Gbig info. PKonly HFold IterativeHK-PK-free 1% (1.43%, 1.51%) (79.76%, 84.32%) (79.14%, 83.84%)5% (7.01%, 7.21%) (83.26%, 86.76%) (80.98%, 85.14%)10% (14.08%, 14.40%) (86.13%, 88.99%) (82.86%, 86.43%)15% (21.12%, 21.56%) (88.37%, 90.78%) (84.59%, 87.91%)20% (28.02%, 28.60%) (89.97%, 92.01%) (85.67%, 88.86%)25% (34.52%, 35.17%) (91.10%, 92.97%) (86.80%, 89.73%)30% (40.70%, 41.44%) (91.98%, 93.68%) (87.63%, 90.45%)35% (46.70%, 47.46%) (92.78%, 94.34%) (88.52%, 91.20%)40% (52.20%, 52.98%) (93.34%, 94.85%) (89.01%, 91.71%)45% (57.39%, 58.19%) (93.82%, 95.18%) (89.63%, 92.22%)50% (62.37%, 63.16%) (94.31%, 95.62%) (89.99%, 92.58%)55% (66.87%, 67.66%) (94.59%, 95.87%) (90.32%, 92.83%)60% (71.29%, 72.05%) (94.90%, 96.17%) (90.77%, 93.26%)65% (75.45%, 76.17%) (95.17%, 96.37%) (91.08%, 93.54%)70% (79.28%, 80.00%) (95.40%, 96.59%) (91.44%, 93.85%)75% (83.01%, 83.71%) (95.57%, 96.74%) (91.76%, 94.16%)80% (86.56%, 87.21%) (95.75%, 96.88%) (92.02%, 94.37%)85% (89.86%, 90.50%) (95.87%, 97.00%) (92.60%, 94.82%)90% (93.11%, 93.70%) (95.98%, 97.12%) (92.99%, 95.22%)95% (96.08%, 96.68%) (96.06%, 97.19%) (93.46%, 95.64%)99% (98.41%, 98.99%) (96.09%, 97.23%) (93.76%, 95.98%)100% (98.96%, 99.54%) (96.11%, 97.24%) (93.85%, 96.07%)139B.2. Comparison of correlation between false positive ratesTable B.3: Comparison of robustness of different versions of HFold over allstructures of the HK-PK and HK-PK-free data sets.% Gbig info. PKonly HFold Iterative1% (1.63%, 1.81%) (73.76%, 78.56%) (73.33%, 78.22%)5% (7.87%, 8.50%) (77.22%, 81.32%) (76.05%, 80.32%)10% (15.41%, 16.48%) (80.33%, 83.80%) (78.75%, 82.51%)15% (22.61%, 23.89%) (82.72%, 85.81%) (81.05%, 84.44%)20% (29.61%, 31.03%) (84.41%, 87.26%) (82.64%, 85.85%)25% (36.01%, 37.50%) (85.58%, 88.33%) (83.97%, 87.01%)30% (42.07%, 43.64%) (86.56%, 89.11%) (85.06%, 88.03%)35% (47.90%, 49.46%) (87.52%, 89.98%) (86.09%, 88.89%)40% (53.14%, 54.65%) (88.18%, 90.53%) (86.65%, 89.51%)45% (58.12%, 59.57%) (88.64%, 90.97%) (87.29%, 90.04%)50% (62.81%, 64.23%) (89.25%, 91.48%) (87.68%, 90.48%)55% (67.10%, 68.45%) (89.67%, 91.85%) (88.07%, 90.80%)60% (71.23%, 72.59%) (90.08%, 92.24%) (88.54%, 91.25%)65% (75.11%, 76.48%) (90.54%, 92.58%) (88.86%, 91.53%)70% (78.72%, 80.05%) (90.97%, 92.98%) (89.17%, 91.81%)75% (82.19%, 83.56%) (91.48%, 93.39%) (89.43%, 92.11%)80% (85.51%, 86.85%) (91.91%, 93.75%) (89.70%, 92.33%)85% (88.58%, 89.93%) (92.35%, 94.14%) (90.15%, 92.72%)90% (91.57%, 92.92%) (92.94%, 94.65%) (90.58%, 93.10%)95% (94.38%, 95.75%) (93.60%, 95.28%) (90.93%, 93.49%)99% (96.54%, 97.91%) (94.09%, 95.75%) (91.25%, 93.80%)100% (97.03%, 98.44%) (94.22%, 95.91%) (91.31%, 93.90%)Table B.4: Comparison of correlation between false positive rates in inputstructure and output structures.Iter. HFold HFoldF FHotspots - FP 0.04 −0.16MFE - FP −0.78 −0.79Sub50 - FP −0.35 −0.36140Appendix CTime and memorycomparisonC.1 Time comparisonTables C.1, C.2 and C.3 provide complete data presenting running timecomparison of HFold, Iterative HFold, HotKnots V2.0 and IPknot on theHK-PK data set. The tables also show required time to get Hotspots as inputstructure to HFold and Iterative HFold. Timing is presented in seconds.C.2 Memory comparisonTables C.4, C.5 and C.6 provide complete data presenting memory (totalheap usage) comparison of HFold, Iterative HFold, HotKnots V2.0 and IP-knot on the HK-PK data set. Memory usage is presented in Mega Bytes.141C.2. Memory comparisonTable C.1: Running time comparisonName Len. Time (S)HFold Iter. HotKnots Hotspots IPknotA.tum.RNaseP 400 1.47 5.81 6041.40 0.41 0.72tobacco-mosaic-virus 214 0.40 5.57 3404.73 0.23 0.19telo.human 210 0.63 4.43 181.68 0.22 0.20TMR-00009 196 0.40 8.27 233.82 0.22 0.16ASE-00131 195 0.30 1.21 203.12 0.22 0.14ASE-00360 195 0.33 8.22 1665.94 0.22 0.15ASE-00429 189 0.29 3.18 567.86 0.22 0.13CRW-00659 170 0.27 4.68 72.71 0.21 0.12CRW-00641 168 0.24 5.26 53.05 0.21 0.12CRW-00611 167 0.24 4.39 532.22 0.21 0.09CRW-00687 153 0.20 3.83 36.05 0.21 0.09TMR-00047 130 0.27 2.05 31.39 0.20 0.07Coxsackie 114 0.15 0.64 7.84 0.20 0.05TMV.R 105 0.15 1.68 19.73 0.20 0.04HDV-anti 91 0.15 1.19 11.50 0.20 0.04RFA-00632 91 0.15 0.81 10.74 0.20 0.03RFA-00636 90 0.15 0.81 4.03 0.20 0.03HDV 87 0.14 0.79 4.98 0.20 0.03TYMV 86 0.14 1.20 9.95 0.20 0.03TMV.L 84 0.15 1.10 10.53 0.19 0.03CSFV-IRES 76 0.14 1.08 2.64 0.20 0.02BVDV-IRES 73 0.13 1.07 1.42 0.19 0.02CoxB3 73 0.13 1.03 1.22 0.19 0.02satRPV 73 0.13 1.03 6.67 0.19 0.02PDB-01009 71 0.13 0.56 2.43 0.19 0.03SARS-CoV 69 0.13 1.01 1.93 0.20 0.02PDB-01021 68 0.13 0.56 1.02 0.20 0.02PDB-01023 68 0.13 0.55 1.24 0.20 0.02EC-S15 67 0.14 1.01 2.87 0.20 0.02PDB-00944 65 0.13 0.55 1.27 0.20 0.02Tt-LSU-P3P7 65 0.18 0.56 1.88 0.20 0.02HCV-229E 61 0.13 0.96 1.05 0.19 0.02PRRSV-16244B 58 0.13 0.97 2.29 0.19 0.02PRRSV-LV 58 0.13 0.95 1.44 0.19 0.01HCV-Ires 56 0.12 0.54 1.47 0.19 0.01142C.2. Memory comparisonTable C.2: Running time comparison - ContinuedName Len. Time (S)HFold Iter. HotKnots Hotspots IPknotEc-PK4 52 0.12 0.93 0.89 0.20 0.01AKV-MuLV 50 0.18 1.22 1.54 0.21 0.01BaEV 50 0.19 0.93 1.37 0.19 0.01Cas-Br-E-MuLv 50 0.12 0.92 1.47 0.19 0.01FeLV 50 0.13 0.91 1.57 0.19 0.01SNV 50 0.13 0.67 1.04 0.19 0.01GaLV 49 0.13 0.92 1.44 0.19 0.01Hs-SRP-pkn 47 0.12 0.53 2.80 0.20 0.01Bt-PrP 45 0.13 0.91 1.21 0.19 0.01HIV-1-RT-2-3a 45 0.12 0.66 0.34 0.19 0.01Hs-Prp 45 0.13 0.79 0.76 0.20 0.01minimalIBV 45 0.12 0.91 0.70 0.20 0.01HIV-1-RT-2-2b 42 0.13 0.90 0.26 0.19 0.01HIV-1-RT-2-6b 42 0.12 0.90 0.29 0.19 0.01Ni-VS 42 0.12 0.91 0.47 0.19 0.01HIV-1-RT-2-5a 41 0.12 0.90 0.42 0.19 0.01HIV-1-RT-1-8 39 0.12 0.78 0.24 0.19 0.01HIV-1-RT-2-1b 39 0.12 0.90 0.26 0.19 0.01SRV-1 38 0.12 0.90 0.86 0.19 0.01TMV-L 38 0.13 0.78 1.32 0.19 0.01HIV-1-RT-1-1 37 0.13 0.90 0.33 0.19 0.01HIV-1-RT-1-17 37 0.12 0.90 0.33 0.19 0.01HIV-1-RT-1-3a 37 0.12 0.90 0.36 0.20 0.01HIV-1-RT-1-6 37 0.12 0.90 0.34 0.19 0.01HIV-1-RT-1-7 37 0.12 0.90 0.28 0.19 0.01HIV-1-RT-1-9b 37 0.12 0.90 0.39 0.19 0.01HIV-1-RT-2-10 37 0.12 0.90 0.35 0.19 0.01HIV-1-RT-2-11 37 0.12 0.90 0.27 0.19 0.01HIV-1-RT-2-12 37 0.12 0.78 0.27 0.19 0.01HIV-1-RT-2-4a 37 0.12 0.89 0.40 0.20 0.01HIV-1-RT-2-7a 36 0.12 0.89 0.30 0.19 0.01pKA-A 36 0.12 0.65 0.42 0.19 0.01143C.2. Memory comparisonTable C.3: Running time comparison - ContinuedName Len. Time (S)HFold Iter. HotKnots Hotspots IPknotEIAV 35 0.12 0.89 0.37 0.19 0.01FIV 35 0.12 0.89 0.30 0.19 0.01HIVRT32 35 0.12 0.89 0.31 0.20 0.01HIVRT322 35 0.12 0.89 0.40 0.20 0.01HIVRT33 35 0.12 0.89 0.32 0.19 0.01HIV-1-RT-2-9 34 0.12 0.89 0.27 0.19 0.01MMTV 34 0.12 0.89 0.71 0.19 0.01MMTV-vpk 34 0.12 0.89 0.34 0.19 0.01MMTVgag-pro 34 0.12 0.89 0.95 0.19 0.01T2-gene32 33 0.12 0.65 0.26 0.19 0.01Ec-PK1 30 0.12 0.89 0.31 0.20 0.01LP-PK1 30 0.12 0.89 0.24 0.19 0.01BWYV 28 0.12 0.88 0.31 0.19 0.01PEMV 28 0.12 0.89 0.25 0.19 0.01T4-gene32 28 0.12 0.89 0.26 0.19 0.01BLV 27 0.12 0.52 0.31 0.19 0.01BYDV-NY-RPV 27 0.12 0.88 0.30 0.19 0.01CABYV 27 0.12 0.88 0.27 0.19 0.01BChV 26 0.12 0.88 0.27 0.19 0.01PLRV-S 26 0.12 0.88 0.29 0.19 0.01PLRV-W 26 0.12 0.88 0.29 0.19 0.01144C.2. Memory comparisonTable C.4: Memory usage comparisonName Len. Memory (MB)HFold Iter. HotKnots IPknotA.tum.RNaseP 400 61.33 61.33 93419.00 5.49tobacco-mosaic-virus 214 53.85 53.85 44241.76 2.10telo.human 210 53.71 53.71 8355.85 2.05TMR-00009 196 53.25 53.25 11833.91 1.81ASE-00131 195 53.21 53.21 12805.66 1.76ASE-00360 195 53.21 53.21 55708.86 1.94ASE-00429 189 53.02 53.02 28543.33 1.77CRW-00659 170 52.42 52.42 4175.79 1.44CRW-00641 168 52.36 52.36 3503.01 1.44CRW-00611 167 52.33 52.33 26824.57 1.42CRW-00687 153 51.91 51.91 3387.13 1.26TMR-00047 130 51.24 51.24 2730.81 1.01Coxsackie 114 50.80 50.80 969.34 0.84TMV.R 105 50.56 50.56 2354.81 0.78HDV-anti 91 50.20 50.20 1707.54 0.90RFA-00632 91 50.20 50.20 1528.68 0.78RFA-00636 90 50.17 50.17 634.52 0.70HDV 87 50.10 50.10 755.96 0.68TYMV 86 50.07 50.07 1478.82 0.62TMV.L 84 50.02 50.02 1455.98 0.57CSFV-IRES 76 49.82 49.82 457.97 0.52BVDV-IRES 73 49.75 49.75 252.15 0.50CoxB3 73 49.75 49.75 245.71 0.48satRPV 73 49.75 49.75 1141.53 0.49PDB-01009 71 49.70 49.70 431.24 1.01SARS-CoV 69 49.66 49.66 357.78 0.47PDB-01021 68 49.63 49.63 197.48 0.45PDB-01023 68 49.63 49.63 268.22 0.45EC-S15 67 49.61 49.61 534.31 0.44PDB-00944 65 49.56 49.56 261.16 0.44Tt-LSU-P3P7 65 49.56 49.56 418.68 0.41HCV-229E 61 49.47 49.47 213.69 0.41PRRSV-16244B 58 49.40 49.40 462.97 0.39PRRSV-LV 58 49.40 49.40 322.98 0.39HCV-Ires 56 49.35 49.35 329.64 0.38145C.2. Memory comparisonTable C.5: Memory usage comparison - ContinuedName Len. Memory (MB)HFold Iter. HotKnots IPknotEc-PK4 52 49.26 49.26 213.75 0.33AKV-MuLV 50 49.21 49.21 335.54 0.4BaEV 50 49.21 49.21 285.20 0.36Cas-Br-E-MuLv 50 49.21 49.21 328.05 0.40FeLV 50 49.21 49.21 359.76 0.34SNV 50 49.21 49.21 221.59 0.34GaLV 49 49.19 49.19 281.26 0.30Hs-SRP-pkn 47 49.15 49.15 681.41 0.31Bt-PrP 45 49.10 49.10 303.82 0.59HIV-1-RT-2-3a 45 49.10 49.10 39.97 0.33Hs-Prp 45 49.10 49.10 186.98 0.31minimalIBV 45 49.10 49.10 162.06 0.30HIV-1-RT-2-2b 42 49.04 49.04 20.13 0.26HIV-1-RT-2-6b 42 49.04 49.04 29.99 0.27Ni-VS 42 49.04 49.04 79.84 0.28HIV-1-RT-2-5a 41 49.01 49.01 66.23 0.27HIV-1-RT-1-8 39 48.97 48.97 16.75 0.24HIV-1-RT-2-1b 39 48.97 48.97 23.33 0.26SRV-1 38 48.95 48.95 220.73 0.34TMV-L 38 48.95 48.95 368.74 0.25HIV-1-RT-1-1 37 48.93 48.93 46.26 0.26HIV-1-RT-1-17 37 48.93 48.93 49.33 0.24HIV-1-RT-1-3a 37 48.93 48.93 56.45 0.23HIV-1-RT-1-6 37 48.93 48.93 49.32 0.25HIV-1-RT-1-7 37 48.93 48.93 29.82 0.24HIV-1-RT-1-9b 37 48.93 48.93 62.54 0.32HIV-1-RT-2-10 37 48.93 48.93 49.32 0.29HIV-1-RT-2-11 37 48.93 48.93 26.47 0.25HIV-1-RT-2-12 37 48.93 48.93 26.55 0.24HIV-1-RT-2-4a 37 48.93 48.93 66.74 0.23HIV-1-RT-2-7a 36 48.90 48.90 36.30 0.25pKA-A 36 48.90 48.90 76.08 0.32146C.2. Memory comparisonTable C.6: Memory usage comparison - ContinuedName Len. Memory (MB)HFold Iter. HotKnots IPknotEIAV 35 48.88 48.88 65.99 0.23FIV 35 48.88 48.88 37.03 0.21HIVRT32 35 48.88 48.88 42.99 0.21HIVRT322 35 48.88 48.88 69.10 0.25HIVRT33 35 48.88 48.88 43.22 0.21HIV-1-RT-2-9 34 48.86 48.86 29.66 0.23MMTV 34 48.86 48.86 176.54 0.23MMTV-vpk 34 48.86 48.86 52.92 0.25MMTVgag-pro 34 48.86 48.86 252.39 0.23T2-gene32 33 48.84 48.84 23.32 0.22Ec-PK1 30 48.78 48.78 42.94 0.23LP-PK1 30 48.78 48.78 16.53 0.19BWYV 28 48.73 48.73 42.48 0.18PEMV 28 48.73 48.73 20.09 0.18T4-gene32 28 48.73 48.73 26.47 0.19BLV 27 48.71 48.71 42.56 0.18BYDV-NY-RPV 27 48.71 48.71 42.25 0.18CABYV 27 48.71 48.71 32.66 0.23BChV 26 48.69 48.69 29.43 0.17PLRV-S 26 48.69 48.69 36.06 0.18PLRV-W 26 48.69 48.69 35.89 0.17147Appendix DPseudocode for the sparseCCJ algorithm148Appendix D. Pseudocode for the sparse CCJ algorithmAlgorithm 1 Pseudocode for the sparse CCJ algorithm.1: initialize all candidate lists as empty2: initialize all 2D and 3D matrices to their initial values3: for i := n to 1 do4: W (i, i− 1) := 05: P(i, i) := +∞6: WM′(i, i) = WB′(i, i) = WP′(i, i) := +∞7: if !(i ≤ j < k − 1 < l) then8: PKi(j, k, l) = P iL(j, k, l) = P iM (j, k, l) =9: P iR(j, k, l) = P iO(j, k, l) := +∞10: P ifromM (i, l, l) = P ifromO(i, l, l) := 0 ;11: P ifromL(i, l, l) = P ifromR(i, l, l) := γ2(i, l) + γ2(l, i)12: for l := i to n do13: . —— V recurrence ——14: Vhairpin := eH(i, l)15: Vstacked := eS(i, l)16: Viloop := min i<d<d′<ld−i+l−d′≤MLS(eint(i, d, d′, l) + V (d, d′))17: Vmloop := Vmloop(i+ 1, l − 1) + β2(l, i) + a+ b18: V (i, l) := min{VhairpinVstackedViloopVmloop19: . —— Vmloop recurrence ——20: Vmloop12 := mini<d<l WM′(i, d− 1) + WM′(d, l)21: VmloopP := mini≤d<e≤l P(d, e) + Ps + b+ c× (d− i+ l − e)22: Vmloop(i, l) := min{V mloop12V mloopP23: . —— P recurrence ——24: P(i, l) := min(j,d,k,w)∈LPK(l,1212) PKi(j − 1, d+ 1, k − 1) + w25: . —— W recurrence ——26: W12 := mini<d≤lW (i, d− 1) +W (d, l)27: W (i, l) := min{W (i,l−1)V (i,l)W12P(i,l)+Ps28: . —— WM ′ recurrence ——29: WM′12′ := WM′(i, l − 1) + c× (1)30: WM′12 := mini<d<l WM(i, d− 1) + WM′(d, l)31: WM′(i, l) := minWM′12′WM′12V (i,l)+β2(i,l)+bP(i,l)+Psm+b149Appendix D. Pseudocode for the sparse CCJ algorithmAlgorithm 2 CCJ Alg Part 232: . —— WB′ recurrence ——33: WB′12′ := WB′(i, l − 1) + c× (1)34: WB′12 := mini<d<l WB(i, d− 1) + WB′(d, l)35: WB′ := minWB′12′WB′12V (i,j)+βp2 (i,l)+b′P(i,l)+Psm+b′36: WB′(i, l) := WB′37: . —— WP′ recurrence ——38: WP′12′ := WP′(i, l − 1) + Pup × (1)39: WP′12 := mini<d<l WP(i, d− 1) + WP′(d, l)40: WP′(i, l) := minWP′12′WP′12V (i,l)+γ2(i,l)P(i,l)+Psp41: . —— recurrences for [i, j] ∪ [k, l]——42: for j := l − 2 to i do43: for k := j + 2 to l do44: . —— PL,mloop10 recurrence ——45: PL,mloop1012G2 := min(d,w)∈LPL,m0012G2 (j,k,l,12G2) WB′(i, d −1) + w46: PL,mloop1012G1 := mini≤d<j P iL,mloop10(d, k, l)+WB(d+1, j)47: P iL,mloop10(j, k, l) := min{PL,mloop1012G2 , PL,mloop1012G1}48: P i+1L,mloop10(j, k, l) := P iL,mloop10(j, k, l)49: . —— PL,mloop01 recurrence ——50: P iL,mloop01(j, k, l) := mini≤d<j P iL,mloop00(d, k, l) + WB′(d+1, j)51: P i+1L,mloop01(j, k, l) := P iL,mloop01(j, k, l)52: . —— PL,mloop00 recurrence ——53: PL,mloop0012G2 := min(d,w)∈LPL,m0012G2 (j,k,l,12G2) WB(i, d −1) + w54: PL,mloop0012G1 := mini≤d<j P iL,mloop00(d, k, l)+WB(d+1, j)55: P iL,mloop00(j, k, l) := min{P iL(j,k,l)+βp2 (j,i)PL,mloop0012G2PL,mloop0012G156: . —— Pushing the candidates into the lists ——57: if{(P iL,mloop00(j,k,l)<min(d,w)∈LPL,m0012G2d>iWB(i,d−1)+w)&&(P iL,mloop00(j,k,l)<mini<d≤j P iL,mloop00(d−1,k,l)+WB(d,j))}then58: push LPL,m0012G2 (j, k, l, 12G2), (i, PiL,mloop00(j, k, l))150Appendix D. Pseudocode for the sparse CCJ algorithmAlgorithm 3 CCJ Alg Part 359: . —— PR,mloop10 recurrence ——60: PR,mloop101G2′1 := P iR,mloop10(j, k + 1, l) + c′61: PR,mloop101G21 := mink<d≤l P iR,mloop00(j, d, l) + WB′(k, d −1)62: P iR,mloop10(j, k, l) := min{PR,mloop101G2′1PR,mloop101G2163: . —— PR,mloop01 recurrence ——64: PR,mloop011G12′ := P iR,mloop01(j, k, l − 1) + c′65: PR,mloop011G12 := mink≤d<l P iR,mloop00(j, k, d) + WB′(d +1, l)66: P iR,mloop01(j, k, l) := min{PR,mloop011G12′PR,mloop011G1267: . —— PR,mloop00 recurrence ——68: PR,mloop001G21 := mink<d≤l P iR,mloop00(j, d, l) + WB(k, d −1))69: PR,mloop001G12 := mink≤d<l P iR,mloop00(j, k, d)+WB(d+1, l)70: P iR,mloop00(j, k, l) := min{P iR(j,k,l)+ βp2 (l,k)PR,mloop001G21PR,mloop001G1271: . —— PM,mloop10 recurrence ——72: PM,mloop1012′G1 := P iM,mloop10(j − 1, k, l) + c′73: PM,mloop1012G1 := mini<d<j P iM,mloop00(d − 1, k, l) +WB′(d, j)74: P iM,mloop10(j, k, l) := min{PM,mloop1012′G1PM,mloop1012G175: . —— PM,mloop01 recurrence ——76: PM,mloop011G2′1 := P iM,mloop01(j, k + 1, l) + c′77: PM,mloop011G21 := mink<d≤l P iM,mloop00(j, d, l) + WB′(k, d−1)78: P iM,mloop01(j, k, l) := min{PM,mloop011G2′1PM,mloop011G2179: . —— PM,mloop00 recurrence ——80: PM,mloop0012G1 := mini≤d<j P iM,mloop00(d, k, l) + WB(d +1, j)81: PM,mloop001G21 := mink<d≤l P iM,mloop00(j, d, l) + WB(k, d−1)82: P iM,mloop00(j, k, l) := min{P iM (j,k,l)+ βp2 (j,k)PM,mloop0012G1PM,mloop001G21151Appendix D. Pseudocode for the sparse CCJ algorithmAlgorithm 4 CCJ Alg Part 483: . —— PO,mloop10 recurrence ——84: PO,mloop1012G2 := min(d,w)∈LPO,m0012G2 (j,k,l,12G2) WB′(i, d −1) + w85: PO,mloop101G12 := mink≤d<l P iO,mloop10(j, k, d)+WB(d+1, l)86: P iO,mloop10(j, k, l) := min{PO,mloop1012G2PO,mloop101G1287: P i+1O,mloop10(j, k, l) := P iO,mloop10(j, k, l)88: . —— PO,mloop01 recurrence ——89: P iO,mloop01(j, k, l) := mink≤d<l P iO,mloop00(j, k, d)+WB′(d+1, l)90: P i+1O,mloop01(j, k, l) := P iO,mloop01(j, k, l)91: . —— PO,mloop00 recurrence ——92: PO,mloop0012G2 := min(d,w)∈LPO,m0012G2 (j,k,l,12G2) WB(i, d −1) + w93: PO,mloop001G12 := mink≤d<l P iO,mloop00(j, k, d)+WB(d+1, l)94: P iO,mloop00 := min{P iO(j,k,l)+ βp2 (l,i)PO,mloop0012G2PO,mloop001G1295: . —— Pushing the candidates into the lists ——96: if(P iO,mloop00(j,k,l)<min(d,w)∈LPO,m0012G2d>iWB(i,d−1)+w)AND(P iO,mloop00(j,k,l)<mink≤d<l P iO,mloop00(j,k,d)+WB(d+1,l))then97: push LPO,m0012G2 (j, k, l, 12G2), (i, PiO,mloop00(j, k, l))98: . PL recurrence: To handle i+MLS slices of PL we userotation using mod99: Define rot(x) := x mod MLS100: PL,iloop := min{P rot(i+1)L (j−1,k,l)+estP (i,j)min i<d<d′<ld−i+j−d′≤MLS(eintP (i,d,d′,j)+P rot(d)L (d′,k,l))101: PL,mloop := min{P i+1L,mloop10(j−1,k,l)+a′+βp2 (j,i)P i+1L,mloop01(j−1,k,l)+a′+βp2 (j,i)102: P rot(i)L (j, k, l) := min{PL,iloopPL,mloopP i+1fromL(j−1,k,l)+γ2(j,i)103: P rot(i+1)L (j, k, l) := Prot(i)L (j, k, l)152Appendix D. Pseudocode for the sparse CCJ algorithmAlgorithm 5 CCJ Alg Part 5104: . —— PR recurrence ——105: PR,iloop := min{P iR(j,k+1,l−1)+estP (k,l)min k<d<d′<ld−k+l−d′≤MLS(eintP (i,d,d′,j)+P iR(j,d,d′))106: PR,mloop := min{P iR,mloop10(j,k+1,l−1)+a′+βp2 (l,k)P iR,mloop01(j,k+1,l−1)+a′+βp2 (l,k)107: P iR(j, k, l) := min{PR,iloopPR,mloop+b′P ifromR(j,k+1,l−1)+γ2(l,k)108: . —— PM recurrence ——109: if i = j AND k = l then110: P iM (j, k, l) := γ2(i, l)111: else112: PM,iloop := min{P iM (j−1,k+1,l)+estP (j−1,k+1)min i<d<j<k<d′<lj−d+d′−k≤MLS(eintP (d,j,k,d′)+P iM (d,d′,l))113: PM,mloop := min{P iM,mloop10(j−1,k+1,l)+a′+βp2 (j,k)P iM,mloop01(j−1,k+1,l)+a′+βp2 (j,k)114: P iM (j, k, l) := min{PM,iloopPM,mloop+b′P ifromM (j−1,k+1,l)+γ2(j,k)115: . —— PO recurrence ——116: PO,iloop := min{P rot(i+1)O (j,k,l−1)+estP (i,l)min i<d<j<k<d′<ld−i+l−d′≤MLS(eintP (i,d,d′,l)+P rot(d)O (j,d′,k))117: PO,mloop := min{P i+1LO,mloop10(j,k,l−1)+a′+βp2 (l,i)P i+1O,mloop01(j,k,l−1)+a′+βp2 (l,i)118: P rot(i)O (j, k, l) := min{PO,iloopPO,mloopP i+1fromO(j,k,l−1)+γ2(l,i)119: P rot(i+1)O (j, k, l) := Prot(i)O (j, k, l)120: . —— PfromL recurrence ——121: PfromL12G2 := min(d,w)∈LfromL12G2 (j,k,l,12G2) WP(i, d−1)+w122: PfromL12G1 := mini≤d<j P ifromL(d, k, l) + WP(d+ 1, j)123: P ifromL(j, k, l) := minPfromL12G2PfromL12G1P iR(j,k,l)+γ2(l,k)+PbP iM (j,k,l)+γ2(j,k)+PbP iO(j,k,l)+γ2(l,i)+Pb124: P i+1fromL(j, k, l) := P ifromL(i, j, k)153Appendix D. Pseudocode for the sparse CCJ algorithmAlgorithm 6 CCJ Alg Part 6125: . —— Pushing the candidates into the lists ——126: if P ifromL(j, k, l) < PfromL12G2 then127: push LfromL12G2(j, k, l, 12G2), (i, P ifromL(j, k, l))128: . —— PfromR recurrence ——129: PfromR1G21 := mink<d≤l P ifromR(j, d, l) + WP(k, d− 1)130: PfromR1G12 := mink≤d<l P ifromR(j, k, d) + WP(d+ 1, l)131: P ifromR(j, k, l) := minPfromR1G21PfromR1G12P iM (j,k,l)+γ2(j,k)+PbP iO(j,k,l)+γ2(l,i)+Pb132: . —— PfromM recurrence ——133: PfromM12G1 := mini≤d<j P ifromM (d, k, l) + WP(d+ 1, j)134: PfromM1G21 := mink<d≤l P ifromM (j, d, l) + WP(k, d− 1)135: P ifromM (j, k, l) := minPfromM12G1PfromM1G21P iL(j,k,l)+γ2(j,i)+PbP iR(j,k,l)+γ2(l,k)+Pb136: . —— PfromO recurrence ——137: PfromO12G2 := min(d,w)∈LfromO12G2 (j,k,l,12G2) WP(i, d−1)+w138: PfromO1G12 := mink≤d<l P ifromO(j, k, d) + WP(d+ 1, l)139: P ifromO(j, k, l) := minPfromO12G2Pfromo1G12P iL(j,k,l)+γ2(j,i)+PbP iR(j,k,l)+γ2(l,k)140: P i+1fromO(j, k, l) := P ifromO(j, k, l)141: . —— Pushing the candidates into the lists ——142: if P ifromO(j, k, l) < PfromO12G2 then143: push LfromO12G2(j, k, l, 12G2), (i, P ifromO(j, k, l))144: PK12G1 = mini≤d<j PKi(d, k, l) + WP(d+ 1, j)145: PK1G21 = mink<d≤l PKi(j, d, l) + WP(k, d− 1)146: PKi(j, k, l) = minPK12G1PK1G21P iL(j,k,l)+γ2(j,i)+PbP iM (j,k,l)+γ2(j,k)+PbP iR(j,k,l)+γ2(l,k)+PbP iO(j,k,l)+γ2(l,i)+Pb147: if PKi(j, k, l) < PK12G1 then148: push LPK(l, 1212)(i, j, k,PKi(j, k, l))154
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Algorithms for prediction of RNA pseudoknotted secondary...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Algorithms for prediction of RNA pseudoknotted secondary structures Jabbari, Hosna 2015
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Algorithms for prediction of RNA pseudoknotted secondary structures |
Creator |
Jabbari, Hosna |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | RNA molecules are crucial in different levels of cellular function, and their functions largely depend on their structures. Since experimental methods for determining RNA structure are expensive, computational methods for prediction of RNA secondary structure (the set of base pairs) are valuable. One such approach is based on the Minimum Free Energy (MFE) folding hypothesis, which states an RNA molecule folds into the structure with the minimum free energy. Another approach is based on the hierarchical folding hypothesis, which posits that an RNA molecule first folds into a psuedoknot- free structure (i.e., no crossing base pairs), then additional base pairs are added that may form pseudoknots (structures with crossing base pairs) to lower the structure’s free energy. The overarching goal of this thesis is to overcome some limitations of the previous methods, namely (1) slow running time, (2) poor prediction accuracy (i.e., low number of correctly predicted base pairs), (3) limited in types of pseudoknots they can predict, and (4) limited opportunity to provide structural information that can guide prediction. We propose two algorithms and study different variations of each method. We propose a relaxed version of the hierarchical folding hypothesis and present Iterative HFold, which is based on this hypothesis. Furthermore, we propose the CCJ algorithm, an MFE-based algorithm that significantly expands the structures handled in O(n⁵) time. Employing a sparsification technique, we propose a sparse version of CCJ that improves its space usage. While CCJ’s and Iterative HFold’s performance are not significantly different on our large data set, Iterative HFold is considerably less expensive to run than CCJ. In addition, Iterative HFold provides the user with ability to incorporate structural information as input constraint. Both CCJ and Iterative HFold predict significantly more accurate structures on key data sets than two of the best performing computational methods currently available (i.e., HotKnots V2.0 and IPknot). Of the folding hypotheses that we have studied, it seems that the relaxed hierarchical folding hypothesis has better potential to predict RNA secondary structure, at least with respect to the energy model used in this work. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-03-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0167140 |
URI | http://hdl.handle.net/2429/52396 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_may_jabbari_hosna.pdf [ 1.37MB ]
- Metadata
- JSON: 24-1.0167140.json
- JSON-LD: 24-1.0167140-ld.json
- RDF/XML (Pretty): 24-1.0167140-rdf.xml
- RDF/JSON: 24-1.0167140-rdf.json
- Turtle: 24-1.0167140-turtle.txt
- N-Triples: 24-1.0167140-rdf-ntriples.txt
- Original Record: 24-1.0167140-source.json
- Full Text
- 24-1.0167140-fulltext.txt
- Citation
- 24-1.0167140.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0167140/manifest