"Science, Faculty of"@en . "Computer Science, Department of"@en . "DSpace"@en . "UBCV"@en . "Andronescu, Mirela S\u00CC\u00A7tefania"@en . "2009-11-02T20:24:53Z"@en . "2003"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "Secondary structure prediction of nucleic acid molecules is a very important problem in computational molecular biology. In this thesis we introduce two new algorithms for: (1) secondary structure prediction of pairs of nucleic acid molecules (PairFold), and (2) finding which sequences, formed from a combinatorial set of nucleic acid strands, have the most stable secondary structures (CombFold). Our algorithms run in polynomial time in the sequences lengths and are extensions of the free energy minimization algorithm [72] for secondary structure prediction without pseudoknots, using the nearest neighbour thermodynamic model. Predicting hybridization of pairs of molecules is motivated by important applications such as ribozyme - mRNA target duplexes, primer binding prediction and DNA code design. Finding the most stable concatenations in combinatorial sets of strands is useful for SELEX experiments and for testing whether sets in DNA computing or tag libraries concatenate without secondary structure. Our results for PairFold predictions show over 80% accuracy for sequences of up to 100 nucleotides. The performance goes down as the sequences increase in length and as the number of non-canonical base pairs, pseudoknots and tertiary interactions, none of these considered here, increases. The accuracy of CombFold is similar to that of the free energy minimization algorithm for single strands, being just a polynomial method for structure prediction of a combinatorial set of strands. We show that although complex, CombFold can quickly predict large concatenations of sets drawn from the literature. In the future, these two algorithms can be combined to predict the most stable duplexes formed by two combinatorial sets."@en . "https://circle.library.ubc.ca/rest/handle/2429/14551?expand=metadata"@en . "6901485 bytes"@en . "application/pdf"@en . "Algor i thms for predicting the secondary structure of pairs and combinatorial sets of nucleic acid strands by Mirela \u00C2\u00A7tefania Andronescu M.Sc , Academy of Economic Studies, Bucharest, Romania, 2000 A THESIS S U B M I T T E D IN PARTIAL F U L F I L L M E N T OF T H E R E Q U I R E M E N T S FOR T H E D E G R E E OF Master of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The Universi ty of B r i t i sh Columbia August 2003 \u00C2\u00A9 Mirela Sjtefania Andronescu, 2003 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f the r e q u i r e m e n t s f o r an a d v a n c e d d e g r e e a t the U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may be g r a n t e d by t h e h e a d o f my d e p a r t m e n t o r by h i s o r h e r r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l n o t be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . ) Department o f COHpaTiT^ SC-< gWCg* The U n i v e r s i t y o f B r i t i s h C o l u m b i a V a n c o u v e r , Canada Date Au$tL*t , ZQO 3 \ Abstract Secondary structure prediction of nucleic acid molecules is a very important prob-lem in computational molecular biology. In this thesis we introduce two new algorithms for: (1) secondary structure prediction of pairs of nucleic acid molecules (PairFold), and (2) finding which sequences, formed from a combinatorial set of nucleic acid strands, have the most stable secondary structures (CombFold). Our algorithms run in polynomial time in the sequences lengths and are extensions of the free energy minimization algorithm [72] for secondary structure prediction without pseudoknots, using the nearest neighbour thermo-dynamic model. Predicting hybridization of pairs of molecules is motivated by important applications such as ribozyme - mRNA target duplexes, primer binding prediction and DNA code design. Finding the most stable concatenations in combinatorial sets of strands is use-ful for SELEX experiments and for testing whether sets in DNA computing or tag libraries concatenate without secondary structure. Our results for PairFold predictions show over 80% accuracy for sequences of up to 100 nucleotides. The performance goes down as the sequences increase in length and as the number of non-canonical base pairs, pseudoknots and tertiary interactions, none of these considered here, increases. The accuracy of CombFold is similar to that of the free energy minimization algorithm for single strands, being just a polynomial method for structure prediction of a combinatorial set of strands. We show that although complex, CombFold can quickly predict large concatenations of sets drawn from the literature. In the future, these two algorithms can be combined to predict the most stable duplexes formed by two combinatorial sets. Contents Abstract ii Contents iii List of Tables v List of Figures vii Acknowledgements ix Dedication x 1 Introduction 1 1.1 Background 1 1.2 Problems and motivations 8 1.3 Contributions 9 1.4 Notations and conventions 10 1.5 Thesis outline 10 2 Related W o r k 12 2.1 Secondary structure prediction for single molecules 13 2.2 Secondary structure prediction of pairs of molecules 16 2.3 Secondary structure prediction of combinatorial sets 17 3 T h e Nearest Neighbour Thermodynamic M o d e l 19 3.1 Background for N N T M 20 3.2 Thermodynamic parameters \u00E2\u0080\u00A2 22 3.3 Free energy calculation of a secondary structure 31 3.4 Example of free energy calculation 38 3.5 Free energy calculation for pairs of molecules 40 3.6 Discussion and limitations of the current model 43 iii 4 SimFold : 45 4.1 Dynamic programming algorithm 46 4.2 Implementation 49 4.3 Time and space complexity - theoretical analysis 51 4.4 Performance evaluation 52 4.4.1 Comparison of secondary structure predictions 53 4.4.2 C P U times 55 4.4.3 SimFold accuracy on biological structures 56 5 PairFold 60 5.1 Dynamic programming algorithm 60 5.2 Implementation 65 5.3 Time and space complexity 66 5.4 PairFold symmetry 67 5.5 Applications and performance evaluation 68 5.5.1 Prediction of DNA or R N A duplex formation ^ 68 5.5.2 Predicting ribozyme-mRNA target hybridization 69 5.5.3 Analysis of a combinatorial library of hairpin ribozymes 70 ( 5.5.4 Prediction of primer-target binding 73 5.5.5 PairFold in DNA word design 73 6 CombFold 75 6.1 A dynamic programming algorithm for the optimal M F E combination problem 77 6.2 An algorithm for the fc-suboptimal M F E combinations problem 84 6.3 Implementation 87 6.4 Time and space complexity 91 6.5 Applications and experimental results 95 6.5.1 Directed mutagenesis and S E L E X experiments 95 6.5.2 D N A computing and tag - anti-tag libraries 96 7 Conclusions and Future Work 99 Bibliography 101 Appendix A SimFold analysis 106 Appendix B PairFold Analysis 111 iv List of Tables 3.1 Miscellaneous free energy rules 31 4.1 Differences between SimFold and mfold on a set of 7 RNA sequences of length 100 nucleotides. The model used in mfold on the MFE structure returned by SimFold gives the same free energy as SimFold, which is lower than the MFE returned by mfold. This happens because mfold does not allow isolated base pairs in teh mini-mization algorithm 54 4.2 Comparison of structure prediction between SimFold and RNAfold as a function of sequence length 55 4.3 Comparison of CPU time between SimFold and RNAfold as a function of sequence length 56 4.4 Summary of SimFold accuracy on different types of RNA sequences. 59 5.1 Comparison between PairFold predictions, experimental data and other predic-tions [37, 67], on short DNA and RNA duplexes. The values represent average enthalpies, entropies, free energies and melting temperatures 68 5.2 Measurement of PairFold accuracy on a set of mRNA target - ribozyme pairs. . . 70 5.3 Measurement of PairFold accuracy on a library of hairpin ribozymes 72 5.4 PairFold predictions on a set of designed primers 74 6.1 Example of a combinatorial set of short RNA sequences 77 6.2 Example of choices for b values for a particular situation 81 6.3 Output of CombFold on some published combinatorial sets for structure free-ness 97 6.4 The optimal MFE combinations (sequences and structures), as predicted by Comb-Fold, for the Input-Sets in Table 6.5.2 97 A . l 7 artificially created sequences, on which SimFold and mfold predictions differ. 106 A.2 Analysis of SimFold accuracy on a small subset of the tRNA genes database [51]. 107 A.3 Analysis of SimFold accuracy on Group I Intron sequences from Gutell database [21]. 107 A.4 Analysis of SimFold accuracy on Group II Intron sequences from Gutell database [21] 107 v A . 6 Analysis of SimFold accuracy on 16S r R N A sequences from Gutell database [21]. . 109 A . 7 Analysis of SimFold accuracy on 23S r R N A sequences from Gutell database [21]. . 110 B . l Measurement of PairFold accuracy on a set of D N A duplexes 112 B .2 Measurement of PairFold accuracy on a set of R N A duplexes 114 B.3 Measurement of PairFold accuracy on a set of m R N A target - ribozyme pairs. . . 115 vi List of Figures 1.1 Central dogma in molecular biology for eukaryotes 2 1.2 Example of a pseudoknot-free secondary structure containing all elementary structures. 4 1.3 Structure of a hairpin ribozyme bound to an mRNA target 5 1.4 A simple pseudoknotted structure 5 1.5 Secondary structure of a 16S rRNA from Gutell Database [21]. The structure has been determined by comparative sequence analysis 7 3.1 Secondary structure of 5S Ribosomal R N A of Staphylococcus aureus [21]. This secondary structure, determined by comparative sequence analysis, con-tains \"odd-pairs\", marked by boxes: (CU), (A.G), (G.G) and (CC). . . . 23 3.2 Secondary structure of an arbitrarily chosen RNA sequence. It contains all the elementary structures, marked by different labels 24 3.3 (a) Example of a free energy table for stacked loops of the type shown in (b). (c) One particular value from the table, where x = G and y = C 25 3.4 (a) Partial snapshot of the table showing the free energy penalties up to the size of the hairpin loop or internal loop, (b) An example of an internal loop of length 4. . 25 3.5 (a) A free energy table for hairpin loops of the type shown in (b). An example, where x = G and y = A, is shown in (c) 26 3.6 (a) Examples of bonus values for hairpin loops of size 4. (b) An example of such hairpin, along with its associated bonus 27 3.7 (a) Table for internal loop terminal mismatches whose closing pair is (C,G). This type of internal loop is shown in (b). (c) An example of an internal loop, where the two terminal mismatches use values from the table in (a): x = G, y = A and x = A, y = G 27 3.8 (a) Free energies for symmetric internal loops of size 2, of the type shown in (b). (c) shows an example of such an internal loop, where x = G and y = A 28 3.9 (a) Table with free energy parameters for asymmetric internal loops of size 3, for the type shown in (b). (c) shows an example of such an internal loop, where x = C and y \u00E2\u0080\u0094 A, along with its free energy, (d) Table with free energy parameters for asymmetric internal loops of size 3, for the type shown in (e). (f) shows an example of such an internal loop, where x = A and y = C 29 vii 3.10 (a) Partial snapshot of the table showing the free energy for symmetric internal loops of size 4, of type shown in (b). (c) shows an example, where v = A, w = A, x \u00E2\u0080\u0094 G and y = G. . . . . \u00E2\u0080\u00A2 30 3.11 (a) Free energies for 3' dangling ends, of the type shown in (b). (c) shows an example of a 5' dangling end, where x = U. (d) Free energies for 5' dangling ends, of type shown in (e). (f) shows an example of a 5' dangling end, where x = C 31 3.12 Dangling bases between two branches of a multi-loop or multi-domain structure. . 32 3.13 A general stacked loop structure 33 3.14 A general hairpin loop 33 3.15 Example of a general internal-loop structure 35 3.16 Example of a general multi-loop structure, with k + 1 branches * 37 3.17 Example of a general multidomain structure, with k domains. . . . : 37 3.18 Sequence and secondary structure, which will be used to show step by step how the free energy change is calculated 39 3.19 Simple example of a pair of RNA molecules and the secondary structure of folding. 41 3.20 The boundary between molecules creates special types of structures 41 3.21 Example of a special stacked loop structure in a duplex of molecules 42 3.22 Example of a special hairpin loop structure in a duplex of molecules 42 3.23 Example of a special internal loop structure in a duplex of molecules 43 3.24 Example of a special multi-loop structure in a duplex of molecules 43 4.1 Evaluation of SimFold on tRNA genes: Distribution of the predicted structures after the level of accuracy, on intervals of 0.1 57 4.2 Evaluation of SimFold on tRNA genes: Correlation between the level of accuracy and the difference between the predicted minimum free energy and the free energy of the real structure, considering our model 58 5.1 Comparison between PairFold and SimFold speed (left); PairFold speed on a se-quence with different splitting points 66 5.2 Typical structure of a hairpin ribozyme bound to an mRNA target 71 6.1 The algorithm for finding the fc-suboptimal M F E combinations of a combinatorial set. 86 6.2 Performance of CombFold with k \u00E2\u0080\u0094 1,2,3,10 and ExhaustS, on sets with different characteristics: (a) 19 instances with s ranging from 1 to 19, and the same g = 2 and I = 10; (b) 12 instances with s ranging from 1 to 12, and the same g = 3 and I = 10; (c) 13 instances with g ranging from 1 to 13, and the same s = 6 and I = 10; (d) 10 instances with I ranging from 10 to 100, and the same s = 8 and g \u00E2\u0080\u0094 2; (e) 50 instances with the same characteristics: s = 10, g = 3,1 \u00E2\u0080\u0094 5; (f) 48 instances with the same characteristics: s \u00E2\u0080\u0094 8, g = 8,1 \u00E2\u0080\u0094 4 94 v i i i Acknowledgements First of all, I would like to thank my supervisor Anne Condon, who has guided me through research in bioinformatics since I did not even know what it meant. I am deeply grateful to her encouragements and her enthusiasm in dealing with challenging problems, her support, great research ideas she shared with me, and her responsiveness and open-mindedness to my opinions. I would like to thank Raymond Ng for being my second reader and for useful com-ments. Thanks to Holger Ffoos for wonderful discussions on various bioinformatics topics. Many thanks go to Dan Tulpan, for helping me when I was stuck, his friendship and be-ing the third (informal) reader of my thesis. I want to thank the people in Beta Lab and other collaborators who helped me throughout: Sanja Rogic, Viann Chan, Rosalia Aguirre-Hernandez, Jeremy Barbay, Greg Lakatos. I am grateful to Jeremy Barbay for his friendship, for giving me joy and happiness. Last but not the least, I want to thank my parents, my aunt Adi and my brother Calm for their support and trust. M l R E L A \u00C2\u00A7TEFANIA A N D R O N E S C U The University of British Columbia August 2003 ix \ To my grandpa. x Chapter 1 Introduction One of the most important problems in computational molecular biology is structure pre-diction of nucleic acids. R N A molecules' play a crucial role in cellular processes and their function is directly related to their folding into complex tertiary structures [48]. D N A and R N A strands are also used in biomolecular computations and in self-assembly of nanostruc-tures [9, 10]. Researchers have studied the problem of nucleic acids structure prediction for more than two decades. In this thesis, we start from the state-of-the-art algorithms for nu-cleic acid secondary structure prediction and extend them to the problems of (1) predicting secondary structure of a pair of nucleic acids and (2) finding, out of a combinatorial set of R N A or D N A short strands, which combination has the most stable secondary structure. In this chapter, we first give background on RNA, D N A and secondary structures. Then, in Section 2, we define the problems that we deal with in this thesis and we give motivations to support why this work is important. A summary of contributions is given in Section 3, and some notations and conventions that we will use throughout this thesis are enumerated in Section 4. The last section gives a brief outline of this thesis. 1.1 Background The central dogma in molecular biology (Figure 1.1) is that, in an organism, the genes, which constitute the genetic code made of DNA (deoxyribonucleic acid), are first replicated, and then transcribed into R N A (ribonucleic acid). This is premature messenger RNA (pre-mRNA), which in higher organisms (i.e. eukaryotes) are first processed to eliminate some non-coding sequences, called introns, and they become mature messenger RNA (mRNA). In simpler organisms (i.e. prokaryotes), there are no introns, and the genetic code is directly transcribed into mature mRNA. This is translated into proteins, which have well defined functions [19]. D N A and R N A molecules (also known as nucleic acids) are composed of sequences of four types of nucleotides or bases: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T) for D N A or Uracil (U) for RNA. In organisms, the genetic material is usually double-stranded D N A and the R N A is single-stranded. For this reason, R N A is more flexible 1 Exon Intron Exon Gene (DNA) Transcription prermRNA Processing mRNA Translation protein ^ 0 < 9 \u00C2\u00BB t \u00C2\u00BB > 9 H \u00C2\u00BB 8 Figure 1.1: Central dogma in molecular biology for eukaryotes. and can form a much greater variety of complex three-dimensional structures than double-stranded DNA (dsDNA). However, single-stranded DNA (ssDNA), used in in vitro experi-ments or in DNA computing, can also form complex structures. The linear sequence of an RNA or DNA strand constitutes the primary structure or sequence. The set of base pairs that form when a nucleic acid sequence folds is called secondary structure. These pairings arise from the Hydrogen-bonding forces between pairs of bases. Thus, in an RNA or ssDNA secondary structure, each base can be either free (not bonded with any other base) or Hydrogen-bonded to another base. The tertiary structure is the three-dimensional geometry of the arrangement of bases in space. Much research has been done on understanding secondary structures, while the information we currently have about tertiary structures is relatively sparse. It is believed that once secondary structures are known, they can provide useful information about tertiary structures as well. Throughout this thesis, we focus on predicting secondary structures from primary structures of RNA or ssDNA molecules. Although there are other factors that influence secondary structure formation, it is believed that the sequence has the greatest contribution. The most common Hydrogen-bonds which will lead to secondary structure formation are between C and G, and between A and T (or U for RNA). They are called Watson-Crick (W-C) bonds. (C.G) pairs are more stable, because they involve three Hydrogen connections, as opposed to (A. T) or (A. U) pairs, which involve two Hydrogen connections. The sequence orientation of ssDNA and RNA strands is defined by two different strands,called the .5' end and the 3' end. Consecutive base pairs can be formed only if the sequences have opposite orientation. The following is a double stranded DNA primary 2 sequence of 30 nucleotides: 5'-ATGCGCGCTAGCATCGCTCGGCTAGCTGAT-3' 3 '-TACGCGCGATCGTAGCGAGCCGATCGACTA-5' Each base in the second strand is the W-C complement of the corresponding base in the first strand. They can bind and form a double stranded D N A because all the corresponding bases are complementary and because the strands are in opposite orientation. The same rule is available for R N A or ssDNA. Consider the following simple R N A sequence: 5'-CCCCCCCCCCAAAAAGGGGGGGGGG-3' It contains 10 C s , 5 A's and 10 G's. The fragment 5 ' - C C C C C C C C C C - 3 ' can bind to the fragment 3 ' -GGGGGGGGGG-5' (read from the 3' end to the 5' end), but it cannot bind to the fragment 5'-GGGGGGGGGG-3' (read from the 5' end to the 3' end), because it does not have the right orientation. Thus, most probably, the first C will bind to the last G, the second C will bind to the G before the last base and so on. The first step in understanding R N A or ssDNA secondary structures is to identify the substructures of which they are composed. We call elementary structure any substructure which cannot be decomposed in any other substructure. Figure 1.2 shows an example of a complex secondary structure, which contains all elementary structures that we consider in this work. The bullets represent bases, the thin gray line indicates the backbone that holds all the nucleotides together in the molecule, and the thick black lines indicate paired bases. The 5' and 3' ends are indicated, and a numbering for the base positions is provided, with the first base considered to be at the 5' end. Each base can only participate in at most one base pair. The elementary structures are marked by rectangles and their names are added aside. The elementary structures we consider here follow: \u00E2\u0080\u00A2 A hairpin loop or hairpin contains one closing base pair and all the bases between the paired bases are unpaired. The hairpin marked in the figure contains 5 free bases. Formally, the tuple defines a hairpin loop in a given secondary structure if i and j are paired, and k is a free base, \/k,i < k < j; \u00E2\u0080\u00A2 A stacked loop, also called stacked pair, contains two consecutive base pairs. The tuple defines a stacked pair if i and j are paired and i + 1 and j \u00E2\u0080\u0094 1 are paired. A stem or helix is made of a consecutive number of stacked loops. The helix marked in the figure has 5 stacked loops; \u00E2\u0080\u00A2 An internal loop, sometimes called interior loop, is a loop having two closing base pairs, and all bases between them are free. The tuple i',f), with i +1 < i! < j' < j \u00E2\u0080\u0094 1, defines an internal loop if i and j are paired, i' and j' are paired and A; is a free base, V 7 c , i < k < i! and j' < k < j. The asymmetric internal loop marked in the figure has 3 free bases on one side and 4 free bases on the other side. 3 hairpin loop Figure 1.2: Example of a pseudoknot-free secondary structure containing all elementary structures. A bulge loop, or simply bulge, is a special case of internal loop, which has no free base on one side, and at least one free base on the other side. Note that, in fact, a stacked loop is also a special case of internal loop, with no free bases on both sides. In this work, we will consider stacked loops and internal loops to be distinct elementary structures, but we include bulges in the internal loop case, unless otherwise specified; \u00E2\u0080\u00A2 A multi-branched loop or multi-loop is a loop which has at least three closing base pairs. The tuple (i,j,h,ji,..-im,jm), with m > 2,i < ii < ji < ... < im < jm < j defines a multi-loop with m + 1 branches if i pairs with j, i\ pairs with j i , . . . ,im pairs with jm and k is a free base, Vfc, i < k < i\, j\ < k < i2, ..., jm < k < j. The multi-loop marked in the figure has 3 closing base pairs and 6 free bases; \u00E2\u0080\u00A2 A multi-domain loop, or simply multi-domain, is a loop with at least one closing pair. The tuple [i,j,i\,J2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 -,id,jd), with d > 0, i < j < ix < j2 < \u00E2\u0080\u00A2.. < id < 3d defines a multi-domain if i pairs with j, i\ pairs with j i , . . . , id pairs with jd and k is a free base VA;, 1 < k < i, j < k < i\, ..., jd < k < n, where n is the length of the sequence. We call domains the whole secondary structures which are closed by the closing pairs of a multi-domain. In other words, the domain closed by is the set of all base pairs whose indices are in the interval Note that if we virtually make the sequence 4 pseudoknot Figure 1.3: Structure of a hairpin ri-bozyme bound to an mRNA target. Figure 1.4: A simple pseudoknotted struc-ture. circular by merging the 5' and 3' ends together, we obtain a hairpin, internal loop or multi-loop. The multi-domain marked in the figure contains two branches and 7 free bases. These free bases are called external bases, because they are not inside any domain, but they are between domains or between a domain and a molecule end. Note that, if the whole structure contains only one domain and no external bases, then there is one multi-domain, and the pair (l.n) constitutes its only closing pair. The bases which are neighbours of a closing pair of a multi-domain or multi-loop are called dangling bases. The dangling bases neighbouring to the multi-domain marked in the figure have positions 2, 98, 101 and 123. The ones neighbouring to the multi-loop have positions 6, 57, 59, 93 and 94. When we refer to secondary structures, we assume that at least one base pair exists. In some situations, it is possible that no pairing between any two bases exist. In this case we say that the molecule is structure free. All these elementary structures have been used before with these names [24, 33, 71], except the multi-domain structure, whose exact name we introduce here for convenience later. Another way to understand a secondary structure is to think of it as a set of helices, connected to each other by internal loops, bulges, multi-loops or multi-domains, and some of them ended in hairpins. Note that the secondary structure represented in Figure 1.2 is just a graphical, convenient way to visualize the set of base pairs of the folded molecule. In other words, the geometrical positions of the base pairs and free bases do not have any meaning and do not matter other than for visualization purposes. In our representation of secondary structures, if the backbone location is straightforward, we omit it. The structure used in Figure 1.2 will be used again in Chapter 3, together with its R N A sequence (Figure 3.2). 5 Extending the notion of secondary structure for a single molecule, the secondary structure for a pair of molecules S\ and \u00C2\u00A32 is a set of base pairs, with each base of \u00C2\u00A31 and \u00C2\u00A32 occuring in at most one pair. Figure 1.3 shows an example of a duplex structure, which represents the typical structure of a hairpin ribozyme (the bottom structure), binding to its mRNA target (the top structure)-. Having this structure, the ribozyme executes its function by cleaving the mRNA target at the site indicated by the red arrow. The loops that we just enumerated are all elementary structures, which are cur-rently considered by computational programs for secondary structure prediction without peudoknots. Pseudoknots are very important structures, being sometimes crucial for the R N A function, but making prediction more complicated. The tuple defines a pseudoknot if i and j are pairs, i! and j' are pairs and i < i' < j < j'. Pseudoknots are not considered in this work. Figure 1.4 shows an example of a simple pseudoknot, marked by a rectangle. It contains two helices, and if we take any pair from the first stem and any pair from the second stem, they satisfy the inequality mentioned above. Inside the pseudoknot and/or outside the pseudoknot, the structure can have any elementary structures discussed before, or even other pseudoknots. Tertiary interactions between a base which is already paired and a free base, or between two already paired bases, are possible. These and other interactions, such as Hy-drogen bonding between a base and the backbone, between backbone and backbone, binding of metal ions, water interactions, all contribute to the stability of R N A structure. These are part of tertiary structures and are not considered in secondary structure estimations. An example of a long, complicated structure determined by comparative sequences analysis [22] in Gutell Lab [21] is given in Figure 1.5. It contains all elementary structures, three pseudoknots and two tertiary interactions. In the analysis of the algorithms we propose, we will refer to different types of RNAs. Thus, more background describing the most important R N A classes is necessary. Depending on its type, R N A molecules can have different functions. RNAs can be grouped in two general classes [19]: (1) The informational RNA class is formed of messenger R N A (mRNA and pre-mRNA), the intermediary between the genes and the proteins, which are illustrated in Figure 1.1. (2) Functional RNAs, which are not translated into proteins, but are active as RNA. Although the genes that encode functional RNAs are relatively few, the active RNAs count for a large percentage of the R N A in the cell. They are more stable than informational RNAs, and they need abundance to carry out their function. Three very important classes of functional RNAs that we will refer to later in the thesis are: (1) transfer RNAs (tRNAs) are short R N A molecules (less than 100 bases long), which have an important role in mRNA translation into proteins; (2) ribosomal RNAs (rRNAs), some of them exceeding 4000 bases in length, which also have a central role in the translation machinery. They are part of ribosomes, which are large macromolecular assemblies composed of several types of rRNAs and about 100 different proteins. (3) ribozymes are R N A molecules which act as enzymes and catalyze a specific biological reaction. Recent studies of last ten years reveal that 6 Secondary Structure: small subunit ribosomal RNA c A * f t u u c CUO.OQCUO a \u00E2\u0080\u0094 c C \u00E2\u0080\u00A2 A C \u00E2\u0080\u00A2 A C-0 COCCOAOCCC004COAAACUUQUOAOAUUAAAAA40UUUCAUUUO C A , , / / , * U U Q A v V c \u00C2\u00B0 l . l a O U n n * c * CQA o ccu a uA u uoQ M M I i i i i i i i i i U C U C . Q Q A C A U A A C AA U A , ?f u*Cc Oc C \u00E2\u0080\u00A2 U A \u00E2\u0080\u00A2. ,c 0 \u00E2\u0080\u0094 c A A Q O Q C I M I I uau u u c c c o U uo U U U C C O U A O O u ' I I I I I ' I I I 1 Q A A O O C O U C C A' ; C , - C * 0 Q . \ ' , ,A f ? M I M l CcCCOC CQC \u00E2\u0080\u009E G O C . U C C Q A \"fcuuc I \u00E2\u0080\u00A2 I I O U A C flCUGUOOU, COAOAUCu u c - o c - o Q - C U \u00E2\u0080\u00A2 0 o - c Homo sapiens (K03432) 1. Eukaryota 2. eukaryote crown group 3. Fungi/Metazoa group 4. Metazoa 5. Eumetazoa 6. Bilateria 7. Coelomata 8. Deuterostomia 9. Chordata 10. Craniata 11. Vertebrate 12. Gnathostomata 13. bony vertebrates 14. lobe-finned fish and tetrapod clade 15. Tetrapoda 16. Amniota 17. Mammalia 18. Theria 19. Eutheria 20. Primates 21. Catarrhini 22. Hominidae 23. Homo January 2000 Citation and related Information available at http://www.rna.lcmb.utexas.edu Figure 1.5: Secondary structure of a 16S rRNA from Gutell Database [21]. The structure has been determined by comparative sequence analysis. 7 functional RNAs have a much more important role than originally thought, when proteins were considered to do nearly all functional tasks in the cell. 1.2 P r o b l e m s a n d m o t i v a t i o n s In this thesis, we try to solve two main problems: (1) secondary structure prediction of a pair of nucleic acid molecules and (2) secondary structure prediction of combinations formed from a combinatorial set of nucleic acid molecules. In this section, we briefly explain the problem statements and motivate their study. Secondary structure prediction of a pair of molecules We define the basic problem of secondary structure prediction of a pair of nucleic acid molecules as follows: given two R N A sequences Si and S2 and a thermodynamic model M , find the pseudoknot-free secondary structure R with the smallest free energy change under the model M, in which Si and S2 can fold. Note that the solution R may include base pairing between the two molecules, as well as base pairing within each molecule. The algorithm we propose for solving this problem is called PairFold and is presented in detail in Chapter 5. It is motivated by predicting interactions between any two D N A or R N A molecules. Examples include: (1) a ribozyme and an R N A target [25, 27, 43, 47, 53, 60, 68]; (2) a probe or primer and a target R N A molecule [68]; (3) pairs of D N A strands in D N A code design [55]; (4) pairs of strands in biomolecular nanostructures [39, 69]; (5) molecular tags in a polymer library [10]. Concrete examples where PairFold is useful will be given in Section 5.5. Secondary structure prediction of a combinatorial set of molecules Consider the following sets of short R N A strands (also called words): Si S2 5*3 5*4 5*5 1 UAGCGA CAGCGUAAUAU AUGCG AUAGCGGUA AUCG 2 AUAGAU AGAUGCGCGGU GAGCGCAAG CUGC 3 UAGGCUAGCGU GCGA If we take one word Wij from line j of each set Si and concatenate them together, we obtain what we call combinations. For example, the sequence UAGCGA CAGCGUAAUAU AUGCG AUAGCGGUA AUCG is the combination w\ 11021 wz\u>4iu>5i- Note that the number of possible combinations is exponential in the number of sets which contain at least two words. The input set of sets will be called Input-Set, and the set of all possible combina-tions will be called Combinatorial-Set. We define the basic problem of secondary structure prediction of a combinatorial set of nucleic acid molecules as follows: given an Input-Set 8 IS and a thermodynamic model M , predict which combination, out of all elements of the Combinatorial-Set CS formed from IS, folds to a pseudoknot-free secondary structure with the lowest minimum free energy. The algorithm we propose is called CombFold, and is described in detail in Chapter 6. Applications where CombFold is useful include: (1) biochemical experiments altering a consensus nucleic acid sequence, such as directed mutagenesis or S E L E X experiments [38]; (2) information storage in DNA computations, where all combinations of a combinatorial set are used. It is important that none of the strands fold on themselves in the temperature range at which they are used [9, 10, 14]; (3) combinatorial sets are also used as molecular bar codes in applications such as massive parallel signature sequencing [10]. 1.3 Contributions This section describes the main contributions of this thesis in the order they are discussed: 1. The Nearest Neighbour Thermodynamic Model (NNTM) for R N A secondary struc-ture prediction has been described before by Zuker et al. [71]. However, their report misses some cases when compared to their implementation of mfold and the similar program from the Vienna R N A package [24, 61]. Thus, a new implementation of the R N A secondary structure prediction algorithm by Zuker and Stiegler [72] would be difficult without consulting previous implementations. The description of N N T M that we give in Chapter 3 is meant to be a complete description of this model. A thorough understanding of it is necessary for new implementations or extensions of the secondary structure prediction algorithm; 2. A thorough analysis of the accuracy of our program SimFold and of two popular secondary structure prediction programs (mfold [35, 70] and RNAfold [24, 61]) is the first one showing more characteristics about the data sets and also pointing out that the performance on some tRNA sequences, in terms of percentage of base pairs found, appears to be poorer than reported earlier by Mathews et al. [33] (Chapter 4); 3. An algorithm similar to our method underlying PairFold, to predict secondary struc-tures of pairs of nucleic acid molecules, has been briefly described by Mathews et al. [32], but no analysis on real biological data has been performed. We give a thor-ough analysis of PairFold on sets of ribozymes found in the literature and on primer binding prediction. We also discuss several other useful applications of PairFold, such as prediction of D N A or R N A duplex formation and D N A code design. Moreover, we created the RNAsoft web site [41], which offers online access to PairFold. Information about this web site have been published in a special issue on Web based software of Nucleic Acids Research journal [5]; 4. Our algorithm underlying CombFold has been developed in parallel by Cohen and Skiena [12], but for a different problem. We offer an analysis of data sets from the 9 D N A computing literature, and give other useful applications, such as for S E L E X experiments. Part of this work has been presented at the D N A 8 Conference (Eight International Meeting on DNA Based Computers, Hokkaido, Japan, 2002) and ap-peared in the conference proceedings [6]. A full version of this paper has been ac-cepted to appear in the journal Natural Computing [6]. Online access to CombFold is also available on the RNAsoft web site [41], and information about this software have been published in the Nucleic Acids Research Journal [5]; 5. Finally, our extension to CombFold for calculating suboptimal combinations is the first one that we are aware of. There are at least two reasons for which this is useful information: (1) for both DNA computation and S E L E X experiments, where knowing the next most stable combinations is important; and (2) returning the k most stable combinations can cover some of the impreciseness of the energy model we are using. We also provide two heuristics to evaluate the possible combinations and give a quick answer, which is not guaranteed to be the optimal one, but most probably is close to the optimum. 1.4 Notations and conventions Throughout this thesis, the following notations and conventions will be used: \u00E2\u0080\u00A2 Given a sequence S, its default orientation is from the 5' end (left) to the 3' end (right), unless otherwise stated; \u00E2\u0080\u00A2 To represent secondary structures, sometimes we use the dot-parenthesis (or dot-brackets) format. This contains three characters: \"(\", \")'.' and A left bracket corresponds to a base which is paired to a base upstream, a right bracket denotes a base paired to a base downstream, and a dot denotes a free base. For example, the secondary structure of the simple R N A sequence given in Section 1.1 can be depicted as follows: CCCCCCCCCCAAAAAGGGGGGGGGG (((((((((( )))))))))) Note that this representation is only valid for pseudoknot-free structures. \u00E2\u0080\u00A2 Two nucleotides between round brackets and separated by a point (e.g. (A.U)) signify a base pair. 1.5 Thesis outline The most important work related to the research reported in this thesis is first presented in Chapter 2. The basis of the secondary structure prediction calculations considered in 10 this work resides in the standard Nearest Neighbour Thermodynamic Model, which will be described in great detail in Chapter 3. Our two algorithms: PairFold and CombFold, are extensions of the free energy minimization algorithm [72]. Full understanding of this method is necessary for understanding our proposed solutions. We call this basic algorithm SimFold, and we explain its underlying equations in Chapter 4. A discussion on complexity and a thorough performance evaluation are given. Chapter 5 explains our algorithm for structure prediction of pairs of molecules, PairFold, and gives several applications and per-formance evaluations on biological data. Chapter 6 proposes a polynomial time algorithm for prediction of the most stable combination of a combinatorial set, CombFold. An exten-sion, to predict the k most stable combinations, is included. Two heuristic approaches are discussed, along with practical applications. Finally, we conclude the thesis in Chapter 7 and we present ideas of how this work can be continued in several ways. Some details additional to data analysis discussed in Chapters 4 and 5 are given in Appendices A and B. Because the two algorithms we propose are based on the Nearest Neighbour Thermo-dynamic Model, which is complex and not very well explained in the literature, we dedicate the whole Chapter 3 to detailed explanations of this model, although this is not our work (other than collecting the proper information). Similarly, since our two solutions are in fact extensions of the Zuker and Stiegler's [72] free energy minimization algorithm for secondary structure prediction, we explain it in detail in Section 4.1. Our work starts with showing that our implementation of Zuker and Stiegler's algorithm is reliable (Sections 4.2-4.4) and continues with our proposed solutions, in Chapters 5 and 6. 11 Chapter 2 Related Work R N A folding and secondary structure are very important for R N A function. Experimental studies on R N A secondary and tertiary structures have been carried out continuously after the Watson-Crick binding discovery in 1953. In the last 25 years, applications of computa-tional methods to problems in molecular biology, including R N A folding, became more and more prominent. The quantity of information about R N A structures is rapidly increasing every year. However, depending on the length of the given R N A sequence, the number of possible sec-ondary structures can be very high. For instance, for a 16S rRNA of 1500 nucleotides, there are approximately 15,000 possible helices (less than 100 will be in the final structure). The maximum number of combinatorial arrangements of all possible helices, which will eventually lead to different structures, is about 4.3 x 10 3 9 3 [22]. In vitro experiments for some specific RNAs such as hairpin ribozymes show that they behave very similarly to the experiments in vivo [48]. However, there are studies which show that in vitro R N A folding experiments cannot reliably simulate the complex intracellular environments existing in the cell [48]. Thus, there is still a lot of unknown information about what happens in the cell. The computational methods that currently exist try to capture as much information as possible from the existing knowledge. Still, this knowledge is rudimentary with respect to some rules and contributions of other factors that participate to the folding of RNA. More-over, in order to ensure that the computational methods are efficient (and thus practical), thermodynamic methods are often highly simplified. Section 1 of this chapter describes previous work related to the secondary structure prediction for single R N A or D N A molecules. This work is related to all three algorithms described in this thesis: SimFold, PairFold and CombFold. Section 2 presents work for predicting secondary structures for pairs of molecules, and this is relevant to our PairFold algorithm. Section 3 gives information about algorithms for predicting secondary structures of combinatorial sets, which is directly related to CombFold. 12 2.1 Secondary structure prediction for single molecules In this section, several different methods are explained for predicting secondary structures without pseudoknots, and two algorithms for predicting pseudoknotted structures are sum-marized. Free energy minimization algorithm A very popular algorithm for finding the minimum free energy (MFE) secondary structure without pseudoknots of an R N A molecule is Zuker and Stiegler algorithm [72]. This method is the basis of the algorithms proposed in this thesis, and will be described in great detail in Chapter 4. Its input is the primary R N A sequence, and it uses a dynamic programming algorithm to find the secondary structure with the minimum free energy. The basis of this method is the Nearest Neighbour Thermodynamic Model (NNTM), which will be described in Chapter 3. N N T M contains rules about base pairing formation and tabulated free energy and enthalpy parameters. Briefly, the main idea of the free energy minimization algorithm is that the bases of the R N A molecule are numbered from 1 to n, starting from the 5' end and finishing at the 3' end. Then, for each i and j with 1 < i < j < n, the problem is to determine which of the four elementary structures (hairpin loop, stacked loop, internal loop or multi-branched loop), with the exterior pair ( i j ) , has the lowest free energy. Recurrence relations are applied and a two dimensional matrix with all minimum free energies for each i and j is filled. As for the standard dynamic programming algorithm, backtracking is necessary to build the path (i.e. the set of base pairs) that gives the M F E secondary structure. The complexity of Zuker and Stiegler's algorithm is <2(n4) for time and <9(n2) for space. It has been reduced to 0(n 3 ) for time by Lyngs0 et al. [31], but the space required increased to 0(n3). Other approaches [24] assume that the free bases on both sides of internal loops are bounded by a constant c (e.g. c = 30). This reduces the time complexity of Zuker and Stiegler's algorithm to 0(n 3 ) with no penalty on the space. Wuchty et al. [66] extended the M F E secondary structure prediction algorithm to generate all suboptimal secondary structures between the M F E and an upper limit. Gener-ating suboptimal structures is important for at least two reasons [66]: (1) The energy model on which the minimization algorithm relies is imprecise. Also, there are unknown biological constraints, which are not taken into consideration by the energy model. Thus, the true M F E structure might be one of the suboptimal structures with respect to the parameters used. (2) Under physiological conditions, R N A molecules might fold to alternative struc-tures, whose energy difference is small. Also, it is speculated that specific folding pathways capture molecules in local minima [20]. Mathews et al. [33] show that, on average, the accuracy of the prediction algorithm increases by more than 20% when 750 suboptimal structures are generated, as opposed to generating the M F E structure only. Different implementations of the free energy minimization algorithm exist. The 13 program mfold [70, 71] was the first one to implement Zuker and Stiegler's algorithm and is available online [35]. It also incorporates Wuchty et al.'s [66] extension for generating suboptimal structures. The Vienna R N A Package [24] implements Zuker and Stiegler's algorithm, together with the partition function calculation, which will be discussed later in this section. It is available online and is free open source software [61]. Partition function algorithm McCaskill [34] proposed another dynamic programming algorithm for pseudoknot-free fold-ing of an R N A molecule, which permits calculation of probabilities of various structures. This involves calculation of the partition function: s from statistical mechanics, where the sum goes over all possible structures in which the R N A molecule can fold. Although this sum has a number of terms that is exponential in the molecule length n, the partition function calculation can be done in time 0(n 3 ) . Once the partition function Q is calculated, we can calculate the probability of a given structure S: P(S) = ^e-&G(S)/RT_ fjowever, what is more relevant for biological function of R N A structures is an ensemble of related structures (also called kinetically clustered objects) interchanging more or less rapidly between each other [34]. The focus is on the equilibrium probabilities of substructures common to an ensemble or class of related structures. These substructure classes are very important because they allow the display of the most signif-icant features of the ensemble. Finally, the equilibrium probability of occurrence for each possible base pair can be calculated, and a mirror image including the base probabilities (one triangle) and the optimal structure (another triangle) can be drawn for good visual-ization. McCaskill [34] evaluated his method on four biological R N A sequences with known structures. He showed that the real base pairs have been predicted with high probability, although not always the highest probability. The partition function algorithm has been incorporated in the Vienna R N A Package [24, 61]. There are two major drawbacks of both the free energy minimization algorithm and the partition function algorithm: (1) they cannot predict pseudoknotted structures; (2) they heavily rely on the simplified thermodynamic model. Comparative analysis methods Comparative analysis methods predict secondary structures and early stages of tertiary structures of evolutionary related R N A molecules. They overcome the two major drawbacks of the aforementioned methods: pseudoknotted structures and imprecise thermodynamic model. Gutell Laboratory [11, 21] has started determination of the 16S and 23S rRNA secondary structures since early 1980s, when only two molecules of each class were available. 14 The comparative analysis method is based on two simple and profound principles [22]: (1) \"different R N A sequences can fold into the same secondary and tertiary structures\"; (2) \"the unique structure and function of an R N A molecule is maintained through the evolutionary process of mutation and selection\". In 1999, 7000 homologous 16S and 1050 23S aligned rRNA sequences where used in covariation-based structure models [22] and the result was compared to the experimentally determined high-resolution crystal structures of the 30S and 50S ribosomal units (which include 16S and 23S rRNAs, respectively). Several methods constitute the class of comparative analysis: (1) covariation analysis predicted 97-98% of the base pairs which are present in the 16S and 23S rRNA crystal structures; (2) tentative covariation-based method predicted about 45% of the base pairs; and (3) motif-based method predicted 70% of the base pairs. In conclusion, these methods predicted nearly all of the standard secondary structure base pairings and helices in the 16S and 23S crystal structures, and they have also identified tertiary base-base interactions. The major drawback of this method is that a large number of evolutionary related sequences is necessary for good accuracy. Combinations of using the thermodynamic model as well as comparative analysis have been tried. Hofacker et al. [23] presented a method for computing the secondary struc-ture of a set of aligned R N A sequences, using both thermodynamic stability and sequence covariation. They show that only 5 rRNA related sequences and an automatically generated alignment were necessary to correctly .predict over 80% of the base pairs. Their program is implemented under the name of RNAalifold, which is available online from the Vienna R N A Package web site [61]. Pseudoknotted secondary structure prediction Predicting R N A secondary structures including pseudoknots from the primary sequence of a molecule and using a thermodynamic model is a great challenge. Firstly, the pseudoknots can be very complex; secondly, the forces that drive the formation of pseudoknots are not well understood, thus the model being a rough approximation of what is believed to happen. Lastly, and most importantly, it has been proved that finding a minimum energy structure among all possible pseudoknots is NP-hard, although the energy function used in the proof is highly idealized [30]. Rivas and Eddy [40] proposed a free energy minimization dynamic programming al-gorithm which, apart from the elementary structures considered by Zuker and Stiegler [72], also includes pseudoknots. This was considered to be the first algorithm to optimally deter-mine a large class of pseudoknots. The algorithm is complex and its worst case complexity is 0(n 6 ) for time and 0(n 4 ) for space. They show that their algorithm does not predict spurious pseudoknots for a set of tRNA. They also predict 54 out of 63 SELEX-selected HIV-l-RT-ligant simple pseudoknots, and most pseudoknots in short viral RNAs. The recent work of Dirks and Pierce [13] introduces a partition function algorithm for 15 nucleic acids secondary structures which contain the most physically relevant pseudoknots. The algorithm has a complexity of 0(n 5 ) for time and 0(nA) for space. Although the class of predicted pseudoknots is more restricted than in [40], this algorithm has the advantage of permitting study of conformational ensembles of most relevant structures. On the class of pseudoknots they can predict, they show their results are slightly worse for the false positives and slightly better for the true positives when compared to Rivas and Eddy's algorithm. The thermodynamic parameters for the pseudoknots have been generated individually by each group. 2.2 Secondary structure predic t ion of pairs of molecules The closest related work to PairFold for predicting pseudoknot-free secondary structures for pairs of nucleic acids is by Mathews et al. [32]. They predict equilibrium affinity of complementary D N A or R N A oligonucleotides to an R N A target. Briefly, given a structured long R N A molecule S = S1S2S3 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 sn and the oligomer length I, their first program, called OligoWalk, generates the oligos which are Watson-Crick complements of every window of length I from S. There are n \u00E2\u0080\u0094 l + l such possible oligos, for example the first one being 3' \u00E2\u0080\u0094 s~\s~2 . . . s~i \u00E2\u0080\u0094 5', where Si is the Watson-Crick complement of S j . The target polymer can be in the folded or unfolded state, and the same for the oligomer, which can also be unimolecular or bimolecular. Assuming that the oligo will disrupt preexisting structure in the region of complementarity and will form a helix, OligoWalk calculates the standard free energy change of the duplex, which they call A G 3 . Then, they calculate \"the overall free energy change of binding\", AG\u00C2\u00B0verall, which takes into consideration the stability of the newly formed helix and the total concentration of the oligonucleotides. Performing a walk of n \u00E2\u0080\u0094 l + l steps along the target RNA, they try to find correlations between A G 3 or AG\u00C2\u00B0verall and three types of experiments drawn from the literature. They find correlations suggesting that OligoWalk can be useful for designing oligomers capable of binding to targets. The drawback of this method is that there is no guarantee whether or not the oligomer will disrupt the local structure of the target RNA. Even if the free energy of the duplex is low, thus the duplex being considered to be stable, there might be another region in the target R N A to which the target site would prefer to bind. In primer design and other oligomer-target binding problems, one would like to predict whether a given oligomer will be able to bind to that specific target site, and eventually use this information for primer design. In the same paper, Mathews et al. [32] report a second program, an extension of mfold [70], but for bimolecular secondary structure prediction. Given two nucleic acids Si and 52, a third sequence S is created by concatenating the two sequences and by adding a 3-nucleotide \"molecular linker\" between them. The three nucleotides will be restricted not to bind to any other bases in the two given sequences, and the free energy of the loop thus created will be an intermolecular initiation penalty plus the dangling ends free energies. 16 The intermolecular linker can appear in hairpin loops, bulges or internal loops, multi-loops or external loops. The Zuker and Stiegler's [72] algorithm is modified to deal with these situations. The two programs: OligoWalk and the bimolecular secondary structure prediction program (which we refer to as BSSP) are incorporated in the RNAstructure package [42], a software package for Windows. However, only OligoWalk out of the two seems to be functional at the date of this thesis. The algorithm behind the BSSP program seems to lead to the same results as what we propose as PairFold. However, PairFold is a more elegant extension of the Zuker and Stiegler's algorithm. Since, to our best knowledge, no evaluation of the BSSP program has been performed, it is not clear whether the results of the two programs are always the same. A software tool close to OligoWalk is ProbeSelect by Li and Stormo [29], which designs probe oligomers for D N A microarrays. The best probes are selected based on the most favorable free energy, and also on maximization of the difference between this free energy and the free energy of the probe binding to any other mismatched target. To calculate this free energy quickly, they created a heuristic which is assuming the perfect alignment of alternatives. They expect their program to provide a good approximation to the optimal probes set for a complete genome. Another, simpler extension of the mfold program is the \"2-state hybridization server\" [70] which is available online on Mfold web page [35]. It also adds a linker between the two in-put sequences, and it forces the bases in the linker not to pair. To our understanding, only the hairpin loop special case is covered. The free energy, enthalpy, entropy and melting temperatures are returned, but no information about the secondary structure is output by the program. HyTher software tool, available online at [26], takes as input two R N A / R N A , R N A / D N A or D N A / D N A sequences of equal length. It only calculates the free energy of stacked pairs or mismatches at the corresponding positions in the two input sequences. No minimiza-tion algorithm for finding the most favorable duplex structure is performed, the assumption being that the two input sequences will have matches or mismatches at corresponding po-sitions. Thus, the free energy returned with this assumption may be greater than the free energy returned by a minimization algorithm (such as PairFold). Also, input sequences of different length are not accepted. 2.3 Secondary structure prediction of combinatorial sets An 0(n 3 ) algorithm for finding the combination with the smallest M F E structure out of all possible combinations in a combinatorial set has been developed by Cohen and Skiena [12]. The problem they tried to solve was determining, among all R N A sequences coding for a specific protein, which has the most stable secondary structure. They were interested in 17 finding whether the real mRNA sequence was close to the R N A sequence they found. Their underlying algorithm is very similar with our CombFold algorithm, it uses the nearest neig-bour thermodynamic model and is based on Zuker and Stiegler's [72] algorithm. Applying their algorithm on 200 short microbial R N A sequences, they found that a minimized se-quence has, on average, a predicted M F E of 2.657 times lower than the naturally occurring sequence. 18 ) Chapter 3 The Nearest Neighbour Thermodynamic Mode l All living organisms consume energy continuously. Their metabolism transforms energy to heat, which is dissipated to the environment. Thermodynamics originates from the Greek words therme (heat) and dynamis (power) and is the science which describes the relationship between different forms of energy. In thermodynamics, the part of interest (e.g. an organism) is seen as a system, and the rest of its universe is defined as the surroundings. Whether or not a system can exchange energy with its surroundings, it is said to be open or closed. All living entities are open systems. The algorithms described in this thesis use the Nearest Neighbour Thermodynamic Model (NNTM). This model assumes that the stability of a specific base pair depends on the neighbouring bases. Base pair stability is measured by the standard free energy change AG\u00C2\u00B0. It is believed the most stable secondary structure of an R N A molecule or single-stranded D N A molecule is the one which has the lowest possible free energy change. In other words, the lower the free energy change, the more stable the secondary structure is. First, we have to mention that none of what is explained in this chapter is our work. Instead, it is a thorough explanation of the NNTM model, which is heavily used in the algorithms we propose in this thesis. Section 1 describes the background on which the NNTM model is based. Section 2 explains the thermodynamic parameters determined by Turner's Lab [57]. The way these parameters are actually used for calculating the free energy of a secondary structure is explained in Section 3. An example of such calculation is given in Section 4. The model and parameters described in this chapter can also be used for pairs of R N A / R N A or D N A / D N A sequences. This particular situation is described in Section 5. We conclude the chapter with a discussion and some limitations of this model. 19 3.1 Background for N N T M Calculation of standard free energy, enthalpy and entropy changes, equilibrium constant of the reaction and melting temperature are very important for determining characteristics of folded R N A or DNA. This section gives the background necessary to understand the role and the meaning of these parameters and the mathematical relations that connect them to each other. It is important to mention that throughout this thesis we do not consider the dy-namics of the R N A or D N A folding process. We only look at the R N A or D N A in the folded state, i.e. after the folding process reached an equilibrium and its state does not change any more (or the changes are insignificant). Free energy, enthalpy and entropy Free energy indicates the direction of a spontaneous change. It was introduced by J . W. Gibbs in 1878, and it is abbreviated G. AG represents the work done by a system at constant temperature and pressure when undergoing a reversible process. This system will spontaneously evolve in the direction that minimizes the Gibbs function: A G = AH - T \u00E2\u0080\u00A2 AS, (3.1) where G is the free energy, H is the enthalpy, T is the absolute temperature (in degrees Kelvin (K)) and S is the entropy. The quantity A G (measured in kcal/mol) is negative for \"energy-releasing\" pro-cesses, and positive for \"energy-consuming\" reactions. Sometimes we use the notation AGT to denote the free energy change at a specific temperature T. Enthalpy (H) is a measure of the heat flow that occurs in a process. The enthalpy change (AH) for an exothermic reaction (i.e. the heat flows from the system to the sur-roundings) is negative. The enthalpy change for an endothermic reaction (i.e. the heat flows from the surroundings to the system) is positive. The enthalpy (or enthalpy change) is measured in kcal/mol. Entropy (S) is a thermodynamic function which measures the disorder of a sys-tem [54]. Thus, the entropy change AS measures the change in the degree of disorder. If A 5 is positive, it means there was an increase in the level of disorder. A negative value indicates a decrease in disorder. The entropy (or entropy change) is measured in kcal/(mol-K) or entropy units (leu = leal/(mol- K)). It is worth mentioning that enthalpy and entropy do depend on the temperature, but the dependency is very small, at least for short molecules [46], and they are not taken into consideration, but considered fixed for any temperature between 0\u00C2\u00B0C and 100\u00C2\u00B0C. Also, the standard free energy and entropy changes depend on the salt concentration [45], which in our study is assumed to be 1 M NaCl, and cannot be changed. 20 A G \u00C2\u00B0 , AH0 and AS\u00C2\u00B0 denote standard free energy, enthalpy and entropy changes, i.e. measured at standard conditions such as pressure (1 atm), and the temperature of interest. For nucleic acids, it is very important to determine thermodynamic parameters at the human body temperature [46] as accurately as possible. It has been shown the measurements for free energy change are more accurate than for enthalpy and entropy change (a standard error of 2-5% for AG\u00C2\u00B037, versus 5-8% for AH0 and AS\u00C2\u00B0) [46]. Since we look at the R N A or DNA secondary structure only when the folding process reached its equilibrium, we measure the standard free energy, enthalpy and entropy changes. If, at some temperature, enough energy was consumed by the process, then interactions between the nucleotides of the molecule can happen. In this case, we say the molecule has secondary structure. To get to this state, the process was energy-releasing and exothermic (hence AG\u00C2\u00B0 < 0 and AH\u00C2\u00B0 < 0) and it tended to get ordered (hence AS\u00C2\u00B0 < 0). If not enough energy was consumed, then there are no bondings between the bases. In this situation, we say that the molecule is structure free, i.e it does not have secondary structure. The standard free energy, enthalpy and entropy changes will be 0. Equilibrium constant Consider the general reaction scheme aA + bB ^ cC + dD. Let CA, CB, Cc and CD denote the concentrations of the reactants A , B and the products C, D, measured in mol/l (or cc _c^ Molar (M)). k = J*' ^ j s known as the equilibrium constant of the reaction. The free energy change is given by: A G = AG\u00C2\u00B0 + R-T \u00E2\u0080\u00A2 ln(k), where R is the gas constant (1.98717 cal/(mol- K)), and T is the absolute temperature. If the system is at equilibrium, A G = 0. Hence, A G \u00C2\u00B0 can be calculated if the concentration of the reactants and products is known: A G 0 = -R-T -ln(k) (3.2) Depending on the reactants' type and their concentration, the equlibrium constant k can get one of the following values: \u00E2\u0080\u00A2 k = 1/(G^ + CB) for self-complementary oligonucleotides1; \u00E2\u0080\u00A2 k = 4/(CA + CB) for non-self complementary molecules if CA = CB', \u00E2\u0080\u00A2 k = 1/(CA + CB/2) for non-self complementary molecules if CA > CB-1 Short sequences (~ 20 bases or less), which perfectly fold to themselves at the midpoint. For example, the D N A oligonucleotide CGATAATCG is self-complementary since the first base can pair with the last base, the second base can pair with the base before the last base etc. 21 Melting temperature From equations 3.1 and 3.2, we can determine the melting temperature Tm of different types of molecules: (1) The melting temperature of double-stranded DNA or an RNA molecule bound to a complementary molecule is defined as the temperature at which 50% of the strands are in the double-helical state and 50% are in the unfolded state [45]; (2) The melting temperature of a long RNA or single-stranded DNA molecule is the temperature at which 50% of the base pairs have been denatured, leading to molecules containing alternating stems and denatured regions (loops) [64]. A H\u00C2\u00B0 273.15 (3.3) AS\u00C2\u00B0-R- ln{k) This formula gives the melting temperature in degrees Celsius and assumes an ionic concentration \Na+] of 1M. Hence, it does not show the dependence of the melting temper-ature on the ionic concentration. The following formula takes it into consideration [64]: T\" = AS' - T l\u00E2\u0080\u009E(k) + 1 6 6 - *<*\u00E2\u0080\u00A2 1.0 i%L+] ~ 269-3 <3-4' The standard free energy changes at 37\u00C2\u00B0C (AG^) and the standard enthalpy changes AH0 were experimentally determined by Turner's Lab for RNA [49], and by SantaLucia Lab for DNA [45]. Parameters for RNA have been refined by Mathews et al. [33] mainly by knowledge based methods. Using equation 3.1, one can determine the standard free energy change at any temperature between 0\u00C2\u00B0C and 100\u00C2\u00B0C. Also, using equations 3.3 and 3.4 input concentrations, and reactants type, one can determine the melting temperature of the given molecule(s). Note that in the following sections, when we refer to free energy or energy, we mean standard free energy change (AG0). Similarly, when we refer to enthalpy or entropy, we mean standard enthalpy change AH\u00C2\u00B0 and standard entropy change AS\u00C2\u00B0. 3.2 Thermodynamic parameters Thermodynamic parameters for RNA and DNA folding have been determined by differ-ent methods such as optical melting methods [46, 67], absorbance melting curves, mi-crocalorimetry [46] and knowledge-based methods using databases of known structures [33]. All parameters for RNA that we use have been published [16, 33, 49, 57, 58, 59, 65]. For DNA, only parameters for stacked loops [45], single mismathes [1, 2, 3, 4, 37] and dangling ends [7] have been published. However, for DNA we use the same model as for RNA, our base of parameters containing unpublished results obtained by communication with John SantaLucia Jr. [44]. In this section, we present all types of RNA parameters that we use and their format. Other parameters exist, such as coaxial stacking2 [62, 63, 71], but are not included in our model yet. 2Stacking interactions between adjacent helices. 22 I A C G-G-A-C-C-G-U-\u00E2\u0080\u00A2 u-A - U U H P A G U C A - , GC AAGGAG II I I I I I I CG.UUCCUC A G ^ / T G - G C - G G - U C - C U - A U - A G - C C U A U CAC I I I U C C C , CUGU I I I I UG GACA AA C A A G C U G C Figure 3.1: Secondary structure of 5S Ribosomal R N A of Staphylococcus aureus [21]. This secondary structure, determined by comparative sequence analysis, contains \"odd-pairs\", marked by boxes: (CU), (A.G), (G.G) and (CC). Watson-Crick base pairs, i.e. (C.G) and (A.U), are very.important in R N A sec-ondary structures, and thus, they have been studied extensively. The next most common base pairs are wobble pairs, i.e. (CU). This section presents the thermodynamic param-eters determined by Turner's Laboratory [57] and refined by Mathews et al. [33], which considers Watson-Crick pairs and wobble pairs. NNTM assigns free energy changes to loops rather than to base pairs [71]. The orientation of the base pairs matters, (C.G) be-ing different from (G.C). Thus, six different base pairs are possible: (C.G), (CC), (A.U), (U.A), (CU) and (U.G). So-called \"odd pairs\"3, between (A.A), (CC), (G.G), (U.U), (A.C), (A.C) and (U.C), exist (see Figure 3.2). A database of such pairs in known R N A structures is available [36]; however, currently no or very few parameters are available to predict them [18]. The free energy of a specific secondary structure is based on the N N T M , which simply sums up the contributions of elementary motifs of the structure. Figure 3.2 shows a sequence S and its predicted minimum free energy secondary structure R. Several elementary motifs 3Sometimes, these base pairs are called \"non-canonical pairs\". But wobble pairs are also con-sidered non-canonical pairs. Thus, to not create confusion, we call all possible pairs which are not Watson-Crick, nor wobble, \"odd pairs\". 23 A C C U A - U C - G A 6 G G - C A - U U - G G - C C - G 1 0 U 1 1 C C - G C - 5 ' C - G C - G G A G l - C A ^ \u00E2\u0080\u009E A - U \" G - C G - C \u00E2\u0080\u009E A 8 \u00C2\u00A7 C - G \u00C2\u00B0 C - G G - C A Q A G G C - G C - G G - C A 3 G A G A G G C G G G C A C A G G G A c u A U G A G . U G 11 1 1 5 1 1 1 7 1 M i l 1 2 1 1 \u00E2\u0080\u00A2 4 A C C G . C A U G A G A C C c c u A c C U Figure 3.2: Secondary structure of an arbitrarily chosen RNA sequence. It contains all the elemen-tary structures, marked by different labels. are marked by different labels and will be used as examples, for each particular type of free energy. In the following, we present the thermodynamic parameters for R N A , as determined by Turner's Laboratory [49, 57] and refined by Mathews et al. [33]. The thermodynamic parameters discussed in this section also contain the standard enthalpy changes for all the situations which will be described in this section. Having the standard enthalpy change and the standard free energy change at 37\u00C2\u00B0C for a particular secondary structure, it is straightforward that, using Equation 3.1, we can calculate the standard entropy change, the standard free energy change at any given temperature between 0\u00C2\u00B0C and 100\u00C2\u00B0 C and the melting temperature. Free energies for stacked loops The table in Figure 3.3 shows the free energies of stacked loops whose closing pair is (G.C). AG-Stack(a,b,x,y) denotes the general value for a stacked loop, where a,b,x,y e {A, C, G, U}, and (a.b), (x.y) form pairs. The table shows AG-Stack(G,C,x,y), and the en-ergies for stacked loops are always favourable (negative energies). Note that the values are 24 duplicated in these tables, since AG-Stack(a,b,x,y) = AG-Stack(y,x,b,a). Because there are six possible base pairs for each pair, there are 6 x 6 = 36 possible stacked loops (with duplicated values). \" O d d pairs\" energies are denoted by dashes, meaning that no pair ing between the corresponding bases is possible, hence the free energy of b inding is considered infinite. A n example, showing an elementary structure of Figure 3.2, is given in (c). y x A C G U A - - - -2.40 C - - -3.40 -G - - 3 . 3 0 - -1.50 U -2.20 - -2.50 -(a) G x I I \u00E2\u0080\u00A2C y (b) - - G G - -11 I - - C C \u00E2\u0080\u0094 -3.30 kcal/mol (c) Figure 3.3: (a) Example of a free energy table for stacked loops of the type shown in (b). (c) One particular value from the table, where x = G and y = C. Destabilizing energies by loop size A free energy penalty (i.e. a positive free energy change) is associated wi th each hairpin loop, internal loop or bulge, depending on the length of the loop (i.e. the number of free bases between the closing pairs). The table in Figure 3.4 par t ia l ly shows these parameters. The functions are called AG-Length-Internal (I), where I is the length (or size) of the loop, AG-Length-Bulge(l) and AG - Length-Hairpin(l). Tabulated values exist for I < 30. For longer loops, a function described in section 3.3 is used. T h e dashes signify that no loop formation of the corresponding type is possible, and thus, we consider their free energy to be infinite. Size Internal I I : 2 3 4 1 . 7 0 .5 1.80 30 3.70 (a) Figure 3.4: (a) Partial snapshot of the table showing the free energy penalties up to the size of the hairpin loop or internal loop, (b) A n example of an internal loop of length 4. Bulge Hairpin 3.80 2 80 - , _ _ n C U A r 3.20 5.70 5 '~ | 2 | \"~\"3 3.60 ' 5.60 3 ' -\u00E2\u0080\u0094 A Q C - - - 5 ' 4 - 0 0 5 - 6 0 1.70 kcal/mol i i (b) 6.10 7.70 25 Free energies for general hairpin loops 5 - _ C 3 ' \u00E2\u0080\u0094 G y (b) 3' 5' i i i i i i G - C A 3 G A G A -2.20 kcal/mol (c) Figure 3.5: (a) A free energy table for hairpin loops of the type shown in (b). An example, where x = G and y = A, is shown in (c). The free energy of a hairpin loop depends on the closing pair and the neighbouring free bases. They are sometimes called terminal mismatch free energies for hairpin loops. We call this function AG-Hairpin-n(a,b,x,y), with (a.b). The table in Figure 3.5 shows such energies for hairpins whose closing pair is (C. G), that is, the function is AG-Hairpin-n(C,G,x,y). Note that they are negative energies, regardless of the free bases. There are 6 x 16 = 96 different hairpins in this case. Free energies for hairpin loops of length 4 It is believed that hairpins of size 4 are particularly stable. For this reason, bonus values have been determined for such hairpins, as a function of the base pair and all the free bases between them. These bonus values will be added to other thermodynamic parameters for hairpins (see section 3.3). The function is called AG-Hairpin-4(a,b,c,d,e,f), with (a.f), and a few examples are illustrated in Figure 3.6. Similar bonus values for hairpin loops of size 3 have been determined for DNA [45] (AG-Hairpin-3(a,b,c,d,e), with (a.e)). For hairpins of size 4, 6 x 4 4 = 1536 values are possible, but only 30 are included in the current version of the parameters. For the hairpins not included in the table, no bonus is added. Free energies for general internal loops For internal loops, terminal mismatch free energies have been determined. They are a function of a closing base pair and the neighbouring free bases: AG-Internal-n(a,b,x,y), with (a.b). The function is applied to both exterior and interior base pairs. Note that for the interior base pair, the order of the variables is reversed. Figure 3.7 shows the table for AG-Internal-n(C,G,x,y), a general internal loop of this type and an example. There are y A C G U X A -1.50 -1.50 -1.40 -1.80 C -1.00 -0.90 -2.90 -0.80 G -2.20 -2.00 -1.60 -1.10 U -1.70 -1.40 -1.80 -2.00 (a) 26 Sequence Energy GGGGAC 31)0\" GGUGAC -3.00 C G A A G G -2.50 5 ' \" \" \" < f / | , G CUACGG -2.50 3 . _ _ _ C ^ A : : -1.50 kcal/mol (b) CGAGAG -2.00 G U G A A C -1.50 U G G A A A -1.50 (a) Figure 3.6: (a) Examples of bonus values for hairpin loops of size 4. (b) An example of such hairpin, along with its associated bonus. 6 x 16 = 96 values in these tables. Particular tabulated values exist for internal loops of size 2, 3 and 4, and will be detailed below. A C G U G U -0.00 -0.00 -1.10 -0.00 -0.00 -0.00 -0.00 -0.00 -1.10 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.70 (a) 5 ' - -3 ' - -CX\u00E2\u0080\u0094yG I I G C y - - - x (b) - - 3 ' - - 5 ' r : c s A : G N G 5; I .c 3' 5' V - U G \u00E2\u0080\u0094 -1.10-1.10 kcal/mol (c) Figure 3.7: (a) Table for internal loop terminal mismatches whose closing pair is (C, G). This type of internal loop is shown in (b). (c) An example of an internal loop, where the two terminal mismatches use values from the table in (a): x = G, y = A and x = A, y = G. Free energies for symmetric internal loops of size 2 Symmetric internal loops of size 2 (one free base on each side) have been particularly studied, and parameters have been determined for them. The function giving these values is A G -Internal-2(a,b,m,n,x,y), which depends on the two base pairs (a.b), (m.n) and the two free bases x and y. The table in Figure 3.8 shows the parameters for internal loops whose closing pairs are (C,G) and (C,G), i.e. AG-Internal-2(C,G,C,G,x,y). This type and an 27 A C G U G 0.40 -0.40 0.40 0.40 0.30 0.50 0.40 0.50 (a) -0.10 0.40 -1.70 0.40 U 0.40 0.00 0.40 -0.30 5 ' - -3 ' - -c x c -I I G Y G -- - 3 ' - - 5 ' (b) G - C A 6 G G - C i i 5' 3' -0.10 kcal/mol (c) Figure 3.8: (a) Free energies for symmetric internal loops of size 2, of the type shown in (b). (c) shows an example of such an internal loop, where x = G and y = A. example are also shown. There are 6 x 6 x 16 = 576 parameters for this case. Note that these values are duplicated, since AG-Internal-2 (a, b, m, n, x, y) = AG-Internal-2(n, m, b, a, y, x). Free energies for asymmetric internal loops of size 3 Asymmetric internal loops of size 3 are internal loops which have one free base on one side and two free bases on the other side. The two tables in Figure 3.9 (a) and (d) show these values for internal loops of types illustrated in part (b) and (e), respectively. The general function is called AG-Internal-3(a,b,m,n,x,y) z), with (a.b) and (m.n). Note that these values are also applicable for the reverse situation, in which the two free bases appear closer to the 5' end. The figure shows two particular cases, with example for normal case and the reverse case: (1) AG-Internal-3(A,U, A,U,x,y,C). The example (c) uses x = C and y = A. (2) AG-Internal-3(C,G,G,C,x,y,G). The example (c) uses x = A and y = C. There are 6 x 6 x 4 3 = 2304 parameters for this case. Free energies for symmetric internal loops of size 4 Another special case of internal loops for which the thermodynamic parameters were tabu-lated are symmetric internal loops of size 4, having two free bases on each side. Figure 3.10 shows the function which gives these values, AG-Internal-4(a,b,m,n,v,w,x,y), with (a.b) and (m.n). The table shows a partial snapshot of AG'-Internal-4 (C, G , G, G, v, w, x, y). Part (c) shows an example, where v = A, w \u00E2\u0080\u0094 A, x = G and y = G. Note that the values in the table are duplicated, and for the given example, both AG-Internal-4 (G, G, G, G, A, A, G , G) and AG-Internal-4 (C, G, G, G, G, G, A, A) are valid. There are 6 x 6 x 4 4 = 9216 parameters for symmetric internal loops of size 4. 28 A C G U c u 3.60 3.20 3.10 5.50 3.70 4.00 5.50 3.70 5.50 5.50 5.50 5.50 5.50 3.70 5.50 2.80 (a) 3 ' - -3 ' - -- A X A -I I -u _u-y C (b) - A C A -I 7 I - u A J J -AC 3.70 kcal/mol (c) - - 3 ' - - 5 ' A C G U G U 1.00 0.60 0.40 4.00 4.00 4.00 4.00 4.00 0.80 4.00 2.20 4.00 4.00 4.00 4.00 4.00 (d) - C G- -I I yG (e) 3' 5' i i i i i i G - C A 8 9> C - G \u00C2\u00B0 i i 5' 3' 0.60 kcal/mol - - 3 ' - - 5 ' ( 0 Figure 3.9: (a) Table with free energy parameters for asymmetric internal loops of size 3, for the type shown in (b). (c) shows an example of such an internal loop, where x = C and y = A, along with its free energy, (d) Table with free energy parameters for asymmetric internal loops of size 3, for the type shown in (e). (f) shows an example of such an internal loop, where x = A and y = C. Free energies for dangling ends Dangling bases are free bases located in the immediate vicinity of a stem. They may have a contribution to the stability of the structure. Figure 3.12 shows tables for AG-Bangle-s'(a, b, x), with (a.b), (the free dangling base is close to the 3' end), where a = G and b = C (part (a)), and AG-Dangle-5'(a,b,x), with (a.b) (the dangling base is close to the 5' end), where a = C and b = G (part (b)). The figure contains two examples, one for the 3' end: x = U, and one for the 5' end: x = C. There are 6 x 4 x 2 = 48 parameters for dangling ends. Miscellaneous free energy rules Miscellaneous other parameters are used for multiloops, asymmetric internal loops, special cases of hairpin loops, etc. Table 3.1 gives these parameters, together with their role and a 29 5 xy vw AA AC AG AU CA . . . GG GU UA UC UG UU AA 1 . 3 0 1 . 2 0 0 . 3 0 2 . 0 0 1 . 6 0 . . . 1.00 - 0 . 4 0 2 . 0 0 1 . 9 0 1 . 1 0 1 . 4 0 AC 1 . 6 0 1 . 5 0 0 . 6 0 2 . 0 0 2 . 0 0 . . . 1 . 4 0 - 1 . 1 0 2 . 0 0 1 . 7 0 0 . 4 0 1 . 8 0 AG 0 . 3 0 0 . 2 0 - 0 . 7 0 2 . 0 0 0 . 6 0 . . . 0 . 0 0 - 0 . 6 0 2 . 0 0 1 . 3 0 0 . 9 0 1 . 3 0 AU 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 . . . 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 CA 1 . 2 0 1 . 1 0 0 . 2 0 2 . 0 0 1 . 5 0 . . . 0 . 9 0 - 1 . 5 0 2 . 0 0 1 . 2 0 0 . 0 0 0 . 3 0 GG 1.00 0 . 9 0 0 . 0 0 2 . 0 0 1 . 4 0 . . . 0 . 8 0 - 0 . 7 0 2 . 0 0 1 . 7 0 0 . 9 0 1 . 2 0 GU 1 . 1 0 0 . 0 0 0 . 9 0 2 . 0 0 0 . 4 0 . . . 0 . 9 0 - 2 . 6 0 2 . 0 0 1 . 1 0 - 1 . 1 0 1 . 1 0 UA 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 . . . 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 2 . 0 0 UC 1 . 9 0 1 . 2 0 1 . 3 0 2 . 0 0 1 . 7 0 . . . 1 . 7 0 - 0 . 4 0 2 . 0 0 1 . 4 0 1 . 1 0 0 . 5 0 UG - 0 . 4 0 - 1 . 5 0 - 0 . 6 0 2 . 0 0 - 1 . 1 0 . . . - 0 . 7 0 - 4 . 2 0 2 . 0 0 - 0 . 4 0 - 2 . 6 0 - 0 . 5 0 u u 1 . 4 0 0 . 3 0 1 . 3 0 2 . 0 0 0 . 8 0 . . . 1 . 2 0 - 0 . 5 0 2 . 0 0 0 . 5 0 1 . 1 0 - 0 . 4 0 (a) 3' I 5' i 1 I G i I - c 5 - - - C v x Q 3' A 9 A 1 1 G a G 3 - - - G 5' C - G (b) b 3' 1.00 kcal/mol (c) Figure 3.10: (a) Partial snapshot of the table showing the free energy for symmetric internal loops of size 4, of type shown in (b). (c) shows an example, where v = A, w = A, x = G and y = G. conventional name. Two special types of hairpin loops need to be described here: 1. A GGG hairpin loop is a hairpin loop closed by Si and Sj (i < j) with s;_2 = S j_ i = Si = G and Sj = U; 2. A Poly-C hairpin loop is a hairpin loop where all the free bases are C: S j+ i = . . . = Sj-i = C. Briefly, the parameters in Table 3.1 are used in the following situations: (1) is used for long hairpins, internal loops or bulges; (2,3) are used for asymmetric internal loops; (4,5,6) are used for multibranched loops; (7) is a penalty for stems which end in a base pair different from (GG); (8) is used for GGG hairpin loops; (9,10,11) give a bonus for Poly-C hairpins; (12) is a penalty added to the secondary structure prediction of a pair of molecules, rather than of one single molecule (see Section 3.5) and (13) decides on the calculation of grossly asymmetric internal loops, with one side of length 1. More details about the cases in which each of these parameters is used are given in Sections 3.3 and 3.5. 30 X A C G U -1.10 -0.40 -1.30 -0.60 5 ' - - - G I C (b) 3' C - G 10 U i i -0.60 kcal/mol (c) X A C G u -0.20 -0.30 -0.00 -0.00 (d) 3'-^ X (e) 11c C - G i i i i i i 5' 3' -0.30 kcal/mol (f) Figure 3.11: (a) Free energies for 3' dangling ends, of the type shown in (b). (c) shows an example of a 5' dangling end, where x = U. (d) Free energies for 5' dangling ends, of type shown in (e). (f) shows an example of a 5' dangling end, where x = C. No Role Name Value 1 Extrapolation for large loops for internal loops, bulges or hairpin loops greater than 30 Len-Par 1.079 2 Asymmetric internal loops: the maximum cor-rection Asym-Max 3.00 . 3 Asymmetric internal loops: the Ninio array Asym-Par .50 .50 .50 .50 4 Multibranched loops - offset Multi-a 3.40 5 Multibranched loops - helix penalty Multi-b 0.40 6 Multibranched loops - free base penalty Multi-c 0.00 7 Penalty for non-GC terminal non- G C-terminal 0.50 8 Bonus for GGG hairpin bonusGGG -2.20 9 Poly-C hairpin slope C-Hairpin-1 0.30 10 Poly-C hairpin intercept C-Hairpin-2 1.60 11 Poly-C hairpin of 3 C-Hairpin-3 1.40 12 Intermolecular initiation free energy Intermol 4.10 13 GAIL Rule (Grossly Asymmetric Interior Loop Rule) (on=l, off=0) Gail-Rule 1 Table 3.1: Miscellaneous free energy rules. 3.3 F r e e e n e r g y c a l c u l a t i o n o f a s e c o n d a r y s t r u c t u r e This section describes how to calculate the free energy of a secondary structure, using the parameters described in the previous section. We must recall the reader the equations 31 presented here are not our work, but are a detailed revision from the literature on free energy calculation. General functions In the following, some general functions, used by calculations for different types of struc-tures, are described. Studies have shown the helices whose exterior pairs are not (GG) are less stable. The value Non-GC-terminal is meant to add a penalty to capture this destabilization. In some references, the nomenclature of AU terminal penalty is used. Note that this penalty is also added for wobble pairs. Thus, to avoid confusion, we call it Non-GC terminal penalty. The function Non-GC-Penalty(a,b) is used, and it is calculated as follows: \u00E2\u0080\u009E \u00E2\u0080\u009E _ \u00E2\u0080\u009E , w [ 0 , if (a.b) is (CG)ox (G.C) Non-GC-Penalty(a,b) = { \u00E2\u0080\u009E m I Non-GC-terminal, otherwise Penalties corresponding to the size of hairpin loops and internal loops (including bulges), are considered. They are calculated as follows: A G L th H(l) = i A G - L e n S t h - H a i r P i n ( l ) > 1 ^ 3 0 6 n g ' U \ AG-Length-Hairpin(30) + Len-Par x log(l/30), I > 30 I denotes the length of the hairpin, i.e. the number of free bases. The length penalties for bulges and internal loops can be calculated similarly: ' AG-Length-Bulge{\) , Z < 30 AG-Length-Bulge(30) + Len-Par x log(l/30), I > 30 AG-Length-B{l) = AG-Length-I(l) = { AG-Length-Internal{\) , I < 30 AG-Length-Internal(30) 4- Len-Par x log(l/30), I > 30 Dangling bases can add some stabilization, up to the neighbours involved. A function which we call AG-Dangle (S, i\,j\, 12,32) is used mainly in multi-loops and in multi-domain structures and is calculated as follows (see Figure 3.12): Figure 3.12: Dangling bases between two branches of a multi-loop or multi-domain structure. AG-Dangle(S,ii,ji,i2,j2) = AG-Dangle-3'(sjl,sil,Sjl+i) + AG-Dangle-5'(sj2,si2,si2-i) , .i\ + 1 < h ~ 1 rnin{AG-Dangle-3\sjl,sil,SjlJri),AG-Dangle-5'(sj2,Si2,Si2-i)), i\ + 1 = i2 - 1 0 , i\ + 1 > n - 1 32 Free energy calculation for stacked loops Given the sequence S, the free energy of a stacked loop 5' \u00E2\u0080\u0094 S i S ; + i . . . Sj-\Sj \u00E2\u0080\u0094 3', with (si.Sj), (si+i,Sj-i) (Figure 3.13), is given by AG-S(S,i,j) = AG-Stack(si, Sj, si+i, S j - i ) . I I 3' - - - Sj SH --- ' Figure 3.13: A general stacked loop structure. Free energy calculation for hairpin loops Procedure 1 describes the calculation of the hairpin loop free energy for sequence 5, where the hairpin closing pair is (si,Sj). (Figure 3.14). I ) Si 1 Sj.i Figure 3.14: A general hairpin loop. The function is called AG-H(S,i,j). Hairpins having length shorter than three are not accepted by this model. The free energy of the hairpins of size greater than or equal to three are made of four quantities: A G i , A G 2 , A G 3 and A G 4 : \u00E2\u0080\u00A2 A G i corresponds to the penalty associated to the hairpin length. \u00E2\u0080\u00A2 For hairpins of size 3, A G 2 contains the value of the non-GC penalty function. For hairpins of size greater than 3, A G 2 is the terminal mismatch value, which includes the non-GC penalty (and thus, it does not have to be added separately). \u00E2\u0080\u00A2 If the hairpin is of size 3 or 4, and a bonus for it was tabulated (AG-Hairpin-3 or AG-Hairpin-4), A G 3 contains this bonus. \u00E2\u0080\u00A2 If the hairpin is a special case, such as GGG hairpin or Poly-C hairpin, then A G 4 contains this value. For Poly-C hairpins, this is a function of the hairpin length. Free energy calculation for internal loops As mentioned before, we consider that bulge loops are a special case of internal loops. The length of an internal loop is given by the number of free bases between the two closing base pairs, which we denote with (i.j) and (i'-f). Let us call l\ the length of one side of the 33 Hairpin-Loop-Free-Energy Procedure input: sequence S, i, j; output: free energy AG; procedure Compute AG-H A d := 0; A G 2 := 0; A G 3 := 0; A G 4 := 0; l:=j-i-l; i f (I < 3) A G := infinity; else A G i := AG-Length-H(l); i f (I = 3) AG2 := Non-GC-penalty(si,Sj); A G 3 := AG-Hairpin-3(si, Si+i, si+2, Sj-i, Sj); else AG2 := AG-Hairpin-n(si,Sj,Si+i,Sj-i); i f (I = 4) A G 3 : = AG-Hairpin-4(si,Si+i,Si+2,Sj-2,Sj-i,Sj); endif; endif; i f (si-2 = Si-i = Si =' G' and Sj =' U1) A G 4 := bonusGGG; else i f (si+i = . . . = Sj-i \u00E2\u0080\u0094 C') i f (I = 3) A G 4 := C-Hairpin-3; else A G 4 := C-Hairpin-2 + C-Hairpin-lxl; endif; endif; A G := A G i + A G 2 + A G 3 + A G 4 ; endif; r e t u rn A G ; end p rocedure AG-H. Procedure 1: Outline of the calculation for hairpin loop free energy. 34 loop, i.e. l\ = i' \u00E2\u0080\u0094 i \u00E2\u0080\u0094 1. Then, l2 will be the length of the other side: h = j \u00E2\u0080\u0094 j' \u00E2\u0080\u0094 1- The length of the loop will be I = l\ + I2. If l\ ^ I2, then we say the internal loop is asymmetric. Studies have shown that asym-metric internal loops are less stable than symmetric loops. The function AG-Asymmetry (li, I2) gives this penalty: I Asym-Max AG-Asymmetry(l\, lo) = mm < ,, , . A _ . . ,n . .. , \u00E2\u0080\u009E. y tfV ' ; [ |Zi - l2\ x Asym-Par[min(2,min(l1, l2)) - 1] Procedure 2 shows the calculation of the function AG-I(S, i, j, i',f), which gives the free energy of an internal loop or bulge closed by fa, Sj) and (si>, Sji) (see Figure 3.15). I I 3' S i Sv 5' ' S H - - - Sy+1 I Figure 3.15: Example of a general internal-loop structure. If it is a bulge of size 1, it is considered as a stacked loop. If the size is greater than 1, than only the non-GC penalties for both pairs are added. The function of the length of the bulge is added, and this concludes the bulge case. If it is an internal loop of special case (i.e. of size 2,3 or 4), for which there are tabulated values, then its free energy is given by the corresponding function alone. For any other type of internal loop, four quantities: A G i , A G 2 , A G 3 and A G 4 are added: \u00E2\u0080\u00A2 A G i is the penalty dependant of length. \u00E2\u0080\u00A2 AG2 and A G 3 are the terminal mismatches corresponding to each of the two closing pairs. \u00E2\u0080\u00A2 A G 4 is the assymetry penalty, calculated as described above. Free energy calculation for multibranched loops Consider a multibranched loop with k + 1 branches, and whose closing pairs are fa,Sj), fa^SjJ, ... fak,sjk) (see figure 3.16). The multiloops are calculated using the following formula: AG-M(S,i,j,h,ji, ...,ik,jk) = Multi-a + Multi-b x (k + 1) + 35 Internal-Loop-Free-Energy Procedure 1 input: sequence S, i, j, i!, j'\ output: free energy A G ; procedure Compute AG-I AGi := 0; AG2 := 0; A G 3 := 0; A G 4 := 0; li'\u00E2\u0080\u0094i'-i- 1; l2 ;= j - f - 1; I = lx + l2 if (h = 0 or l2 = 0) A G := AG-Length-B(l); if ( l 1 + l 2 = l) AG := A G + AG-Stack(si,sj,si',sr) else A G := A G + Non-GC-Penalty(si, Sj) + Non-GC-Penalty(sii, Sj>); endif; else if (Ii \u00E2\u0080\u0094 1 and l2 = 1) A G := AG-Internal-2(si, Sj, Si', Sji, Si+i, Sj. -1); else if (li = 1 and l2 = 2) A G := AG-Internal-3(si, Sj, Si>, Sj', Si+i, Sj. -1, Sj '+i); else if (li = 2 and l2 = 1) A G := AG-Internal-3(sji, Sii, Sj, Si, Sj~i, Si* -1, St+i); else if (h = 2 and i 2 = 2) A G := AG-Internal-4(si,Sj,Sii,Sj',Si+i,Sj -1, S i ' _ i , Sj/+i); else A G i := AG-Length-I(l); if = l or l2 = 1) and Gail-Rule = 1) A G 2 := AG-Internal-n(si,Sj,' A',' A'); A G 3 := AG - /n iema^n(s i ' ,S j ' , 'A ' , ' j4 / ) ; else A G 2 := AG-Internal-n(si, Sj, Si+i, Sj-i) A G 3 := AG-Internal-n(sj>,Si>,Sji+i,Si>-. 1); endif; A G 4 := AG-Asymmetry (h,l2); AG := A G i + A G 2 + A G 3 + A G 4 ; endif; return A G ; end procedure AG-I. Procedure 2: Outline of the calculation for internal loop free energy. 36 Figure 3.16: Example of a general multi-loop structure, with k + 1 branches. fc-l Multi-c x ( ( H - i - 1) + ^ ( i f t + i - J / i - 1) + 0 ' - J'fc - 1)) + h=l k Non-GC-penalty(si, Sj) + ^ Non-GC-penalty(sih, Sjh) + h=l k-i AG-Dangle(S,j,i,ii,ji) + ^ AG-Dangle(S,ih,jh,ih+iJh+i) + h=i AG-Dangle(S,ik,jk,J,i); First, a penalty for starting a new multiloop is added (Multi-a). For each branch of the multiloop, including the exterior pair, the value Multi-b is added. Also, for each free base, the value Multi-c is added. For each closing pair, the non-GC penalty is considered, as well as the dangling base contribution, given by the function AG-Dangle defined above. Free energy calculation for multi-domain structures For multi-domain structures, the dangling base energies are considered. The following formula shows the contribution of dangling bases for k domains, where (si1.Sj1), ... , (sik.s3k) are the closing pairs of each domain. The dangling bases located between the domains are calculated in a similar way with the multiloops. If the domain closest to the 5' end has a dangling base, than its contribution is added. Similar addition is performed if the domain closest to the 3' end has a dangling base. I I I I I I i i 5 - - - s , r , s / ( - Sjlsj1+1\u00E2\u0080\u0094 s , 2 . , S / 2 - sJ2sh+1 s/,-/*'*\" sJ*sl*+i ~--3' Figure 3.17: Example of a general multidomain structure, with k domains. AG-D(S,ii,jx, ...,ik,jk) = 5' ---s, I S/-7 37 k Non-GC-penalty(sik, Sjh) + h=i k - i /S.G-Dangle-5'(sjl,sil)Sil-i) + ^ AG-Dangle(S,ih, jh,ih+i, jh+i) + h=l AG-Dangle-3'(sjk,Sik,Sjk+i); Note that if is the first base of the sequence, then the dangling energy term AG-Dangle-5'(sj1, , s^- i ) is replaced by 0. Similarly, if Sjk is the last base of the sequence, the term AG-Dangle-3'(sjk, Sik, S j f c + i ) is replaced by 0. Once the free energy of the elementary structures are calculated, the free energy of a sequence S with a given secondary structure R, AG(S,R), is calculated by the simple addition of the free energies of all elementary structures. The following section gives an example of such calculation. 3.4 Example of free energy calculation Figure 3.18 contains the same sequence S and secondary structure R as in Figure 3.2. The figure contains the base positions (starting from 1), and 6 different substructures are delimited. The free energy calculation for each section marked in the figure is calculated as follows: AG(S, R) = (1) AG-\u00C2\u00A3>(S, 3,97,102,122) + (2) AG-S ,(S,3,97) +AG-S(S,4,96) + (3) AG-M(S,5,95,7,56,60,92) + (4) AG-S(S, 7,56) + AG-S(S, 8,55) + AG-S(S, 9,54) + AG-I(S, 10,53,14,48) + . . . + AG-H(S, 29,34) + (5) AG-S(S, 60,92) + AG-I{S, 61,91,62,89) + AG-S(S, 62,89) + AG-I{S, 63,88,66,86) + AG-5(5,66,86) + AG-S(S, 67,85) + AG-/(5,68,84,71,81) + AG-S{S, 71,81) + AG-S(S, 72,80) + AG-H{S, 73,79) + (6) AG-S(S, 102,122) + . . . + AG-H{S, 109,114) = (1) [Non-GC-Penalty(G, C) + Non-GC-Penalty(C, G) + AG-Dangle-5'{C, G, G) + AG-Dangle-3\"(C, G, A) + AG-Dangle-5'(G, C, C) + AG - Dangle-3\"(G, C, U)] + (2) \AG-Stack(G, C, G, G)] + [AG-Stack(G, G, G, C)\ + 38 A C oC U A - U C - G n A G G - C o s A - U U-G12I G - C C - G cc C - G 9SC\u00E2\u0080\u0094G A G ^ 25 10Q G is c 2\u00C2\u00B0 C U A ?\u00C2\u00B0| G G C G C A A G G G A U G A G U I I I I I I I I I I I I I I I , C G U A J J C C C U A R C U C A * A C 10 0 35 H G C C G , 5 5 J J G G - C A C - G 1 |s5C\u00E2\u0080\u0094G G - C A A G G C - G IsoC\u00E2\u0080\u0094G G - C A G A G AH Figure 3.18: Sequence and secondary structure, which will be used to show step by step how the free energy change is calculated. (3) [Multi-a + 3 \u00E2\u0080\u00A2 Multi-b + 6 \u00E2\u0080\u00A2 Multi-c + Non-GC-Penalty(G, C) + Non-GC-Penalty(G, C) + Non-GC-Penalty(C, G) + min(AG-Dangle-3'(G, C, G),AG-Dangle-5'(C, G,G))+ AG-Dangle-3'(C,G,A) + AG-Dangle-5'(G,C,A) + AG-Dangle-3'(G,C,G) + AG-Dangle-5'(G,C,A)} + (4) [AG-Stack(G, C, G, C)) + [AG-Stack(G: C, G, C)] + [AG-Stack(G, C, C, G)} + [AG-Length-Internal(7) + AG-Internal-n(C, G, G, A) + AG-Internal-n(C, G, A, G) + min(Asym-Max, |3 \u00E2\u0080\u0094 4| x Asym-Par[min(2, mm(3,4)) \u00E2\u0080\u0094 1])] + . . . + [AG-Length-Hairpin(A) + AG-Hairpin-n(G, C, U,A) + AG-Hairpin-4(G, U, G, A, A, C)) + (5) [AG-Stack(C, G, U, A)} + [AG-Length-Bulge(l) + AG-Stack(U, A, C, G)) + [AG-Stack(C, G,C, G)] + [AG-Internal- 3(C, G, G, G, A, C, G)] + [AG-Stack(G, C, G, G)] + [AG-Stack(G, C, C, G)] + 39 . [AG-Internal-4{C, G, G, C, A, A, G, G)] + [AG-Stack(G, C, G, C)} + [AG-Stack{G, C, C, G)] + [AG-Length-Hairpin(5) + AG-Hairpin-n(C, G, G, A)} + (6) [AG-Stack(C, G,G,C)] + ... + [AG-Length-Hairpin(4) + AG-Hairpin-n(A, U, C, U) + AG-Hairpin-4(A, C, A, C, U, U)\ = (1) [0 + 0 - 0.30 - 1.70 - 0.30 - 0.60] + (2) [-3.30] + [-3.30] + (3) [3.4 + 3 \u00E2\u0080\u00A2 0.40 + 6 \u00E2\u0080\u00A2 0.00 + 0 + 0 + 0 - 1.30 - 0.00 - 1.70 - 0.50 - 1.30 - 0.50] + (4) [-3.30] + [-3.30] + [-3.40] + [2.20 - 1.10 - 1.10 + 0.50] + . . . + [5.60 - 1.90 - 1.50] + (5) [-2.10] + [3.80 - 2.40] + [-3.30] + [0.60] + [-3.30] + [-3.40] + [1.00] + [-3.30] + [-3.40] + [5.60 - 2.20] + (6) [-2.40] +'...+' [5.60 - 0.20 - 0.00] = (1) -2.90 + (2) -3.30 - 3.30 + (3) -0.70 + (4) -3.30 - 3.30 - 3.40 + 0.50 + . . . + 2.20 + (5) -2.10 + 1.40 - 3.30 + 0.60 - 3.30 - 3.40 + 1.00 - 3.30 - 3.40 + 3.40 + (6) - 2 . 4 0 + . . . + 5.40 = -45.50 kcal/mol The online version of the mfold program [35], which calculates the minimum free energy secondary structure of an R N A sequence, can be used to verify that the free energy values for each elementary structures in this example are consistent with mfold. 3.5 Free energy calculat ion for pairs of molecules The binding free energy calculation for a pair of R N A or D N A molecules is very similar to the free energy calculation for one single molecule. A penalty for Intermolecular initiation is added, i.e. the value Intermol from Table 3.1. Figure 3.19 shows a simple example of two short R N A molecules which bind together. Note that when talking about pairs of R N A or DNA molecules, three situations are possible: (1) both sequences are R N A molecules; (2) both sequences are D N A molecules and (3) one sequence is R N A and one is DNA. Currently, in our algorithms we cover only the first two cases that we use for folding of single molecules, with the same parameters for 40 3 ' G A G G U G A s G C G A G I I I I I U G C U C C G G U C - G A ^ U , A A A 5' Figure 3.19: Simple example of a pair of RNA molecules and the secondary structure of folding. R N A and DNA, respectively, Parameters for case 3 exist [52] and they can be incorporated in the same model. The elementary structures that can be formed are the same as for single molecules, with the difference that stacked loops, hairpin loops, internal loops and multi-loops may have some special cases. Let Si and S2 denote the two input R N A or D N A sequences, and let R denote the secondary structure of their binding. Let P = pyp2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 pn denote the sequence obtained by concatenating Si and 52, where n = length(Si) + length(S2). Let b be the index of the last nucleotide in sequence Si. b equals length(Sx) (if we start from index 1), and we say that b is the boundary between Si and 52. Figure 3.20 shows the equivalent of Figure 3.12 for a pair of sequences, where the boundary falls between the two pairs. 5' - - - S;, - Sy, Sjl + 1 - - - 3' 5' - - - Sfe., S , 2 - Sy., - - - 3' Figure 3.20: The boundary between molecules creates special types of structures. The dangling energy formula for pairs becomes: AG-Danglep(P,ii,ji,i2,J2,b) = AG-Dangle-3,(pjl,pil,pj1+i)+ AG-Dangle-5'(pj2,pi2,pi2-i) , if (b > ii and b + 1 < i2) min(AG-Dangle-3'(pj1, p^, Pj1+i), < AG-Dangle-5'(pj2,pi2,pi2-i)) , if (ii + 1 = i2 - 1 and b > i2) AG-DangleSXpj^pi^pj^i) , if(f>+l = i 2) AG-Dangle-5'(pj2,Pi2,Pi2-i) > (b = h) 0 , if (ii + 1 > i2 - 1) Note that if Pj1+i does not exist or is not a free base, the third term will be replaced by 0. The same about the fourth term, if pi2-i does not exist or is not a free base. AG-Dangle? will be used for the special types of structures, i.e. when the boundary b breaks the elementary structure and basically transforms it into a multi-domain. 41 Figure 3.21 shows the special type of stacked loop, the equivalent of Figure 3.13 for pairs, where the boundary breaks the loop. 5' S ; Sj+f\u00E2\u0080\u0094 .( I I 31 - - - Sj Sj.-i - - - ' Figure 3.21: Example of a special stacked loop structure in a duplex of molecules. Thus, the general formula for calculating the stacked loop free energy becomes: AG SP(P iib)-I Intermo1 ' if 6 = ior b + 1 = jAGS (F,i,j,b) - j A G , 5 ( P ) 2 ) j ) 5 o t h e r w i s e Similarly, Figure 3.22 shows a special type of hairpin loop, broken by the boundary between the two molecules. Si+1 - - - 3 ' I \u00C2\u00B0y-i 5' - - - S, I Sj.-, - - - 5 ' Figure 3.22: Example of a special hairpin loop structure in a duplex of molecules. If the break between the two molecules is between i and j, then we consider this is a special type of hairpin loop. Thus, the free energy of the hairpin loops for duplexes is calculated as follows: . AG-HP(P,i,j,b) = AG-Danglep(P,i,j,i,j,l) + Intermol+ Non-GC-Penalty(pi,pj) , if i < b < j AG-H(P,i,j) , otherwise The special type of internal loop is shown in Figure 3.23, the equivalent of Fig-ure 3.15. The general formula for internal loops for duplexes becomes: AG-P(P,i,J,i',f,b) = ' AG-Danglep(P,i,j,i',j',b)+ AG-Danglep(P, f, i', j, i, b) + Intermol+ , if (i < b < i') Non-GC-Penalty(pi,pj) + Non-GC-Penalty(pi>,pj>) or (j < b < j') AG-I(P,i,j,i',j') , otherwise 42 Figure 3.23: Example of a special internal loop structure in a duplex of molecules. Finally, multi-loops can also be broken by the boundary. Figure 3.24 shows this special type, and the formula to calculate multi-loops for duplexes follows. l' Figure 3.24: Example of a special multi-loop structure in a duplex of molecules. AG-MP (P, i,j,ii,ji,...,ik,3k,b) = AG-Danglep{P,j,i,ii,ji,b)+ Ili 0 46 dangling energies and the non-GC-penalties are added. Thus, a complete version of the recurrence relation for W(j), case j > 0 is: W(j) = min l < i < j W(j - 1) V(i,j) + Non-GC-penalty(si, Sj) + W(i - 1), V(i + 1, j ) + Non-GC-penalty(si+i, Sj) + AG - Dangle-3'(SJ, si+\, s * ) + W(i - 1), V(i,j - 1) + Non-GC-penalty(si, Sj-i) + AG-Dangle-5'(sj-i, si: Sj) + W(i - 1), V(i + 1, j \u00E2\u0080\u0094 1) + Non-GC-penalty (si+i, S j _ i ) + AG-Dangle-3'(sj-i, S j + i , S j ) + AG-Dangle-5'(SJ-i, si+i,Sj) + W(z \u00E2\u0080\u0094 1) The optimal free energy for s ; . . . S j , V(i , j ) , is given by the most favourable struc-ture amongst hairpin loop, stacked loop, internal loop and multi-loop. The calculation is performed using the following formula: VC ) = I + \u00C2\u00B0 \u00C2\u00B0 , for z > j { h J ) \ mm(H(i,j),S(i,j),VBI(i,j),VM(i,j)), for i < j To make connection to the equations described in Chapter 3, first note that the following two equations are true: H(iJ)=AG-H(S,i,j) S(iJ) = AG-S(S,i,j) + V(i + l,j-l) The equation for calculating the free energy of an internal loop closed by the external pair (si.Sj) must find the optimal internal pair (s^.Sj'), by searching all possible internal pairs: VBI(i,j)= min (AG-I(S,i,j,i',j') + V(i',j')) i 0 and Wds(i).numJbisniches > 0) (id,3d) := W d s (i) . last-domain; i i := SimFold-Backtrack (id, 3d, Vds, Wds); i := Wds (i)-next-domain J ; end while; return (AG, R); end procedure SimFold. Procedure 3: Pseudocode for the SimFold algorithm. structures, now we can backtrack and build the minimum free energy secondary structure R. This is done in the procedure Backtrack, which is detailed in Procedure 4. Procedure SimFold-Backtrack is a recursive function which advances one elementary structure in each step, using the information stored in V^s and W^s data structures. It starts from the closing pair of a domain, e.g. (si.Sj), and stops when j > i. It checks which type of elementary structure the pair (s{.Sj) is closing, it saves the partial secondary structure in R, and then, it recursively calls itself on the internal branches of the structure. If it was a hairpin loop, than there is no internal branch, and the function returns. The SimFold program is implemented in C++ and a library that contains the func-tion SimFold is provided. The input is the given nucleic acid sequence and the M F E sec-ondary structure, in dot-parenthesis format, is returned, together with the minimum free energy in kcal/mol. We have also implemented a function which calculates the free energy of a given sequence S, folded in a specific secondary structure R. This function, called FreeEnergy(S, R, M), where M is the model under consideration, is important for evaluat-ing the stability of a sequence under some conformation or compare the stability of different conformations. Also, functions to calculate enthalpy, entropy and melting temperature are 50 SimFold-Backtrack Procedure input: Indexes i, j, data structures V^, W^', output: Partially filled secondary structure R; procedure SimFold-Backtrack if (i > j) return (0); else if (Vds(i,j).type = HAIRPIN-LOOP) Save-Structure (R); return (R); else if {Vds(i,j).type = STACKED_LOOP) Save-Structure (R); return SimFold-Backtrack (i + 1, j \u00E2\u0080\u0094 1, Vds, Wds); else if (Vds{i, j).type = INTERNAL-LOOP) Save-Structure (R); return SimFold-Backtrack (Vds(i, ,Vds(i, 3)-f,Vds,Wds); else if (Vds(i,j).type = MULTI-LOOP) Save-Structure (R); for each branch B return SimFold-Backtrack (B.i, B.j, Vds, Wds); end for; end if; end procedure SimFold-Backtrack. Procedure 4: Pseudocode for the SimFold backtracking algorithm. provided. 4.3 Time and space complexity - theoretical analysis Time complexity The running time to calculate the values in each array can be determined as follows: \u00E2\u0080\u00A2 W: 0(n2), because for each j , we minimize over i; \u00E2\u0080\u00A2 V: 0 (n 2 ) , since for each i and j, we minimize over 4 terms; \u00E2\u0080\u00A2 H and S: 0 (n 2 ) ; \u00E2\u0080\u00A2 VBL. 0 (n 4 ) , because for each i and j , we have to find the best i' and f; \u00E2\u0080\u00A2 WM: 0 (n 3 ) , since for each i and j, we look for the best h; \u00E2\u0080\u00A2 VM: 0 (n 2 ) : we do a constant number of comparisons for each i and j . 51 The most expensive task is to calculate the free energy of internal loops. Two solutions have been adopted to reduce time complexity of internal loops from 0(n 4 ) to 0(n3): 1. Hofacker et al. [24] adopted an easy to implement solution: they look at the i'-s and j'-s which are at most c bases distance away from i and j, respectively. Thus, the time complexity becomes proportional to c 2 x n 2 , which is considered to be 0(n 3 ) . This solution will miss the internal loops having at least one side longer than c, but it seems this case is very unlikely. The number 30 is considered an appropriate value for c. 2. Lyngs0 et al. [31] gave a more accurate solution, which would find internal loops of any size in time 0(n 3 ) , but is harder to implement, and requires more space. Briefly, instead of using a two-dimensional array VBI(i, j), they use a three-dimensional array VBI(i,j,l), where/ denotes the size of the loop, and ranges from 1 toj \u00E2\u0080\u0094 i \u00E2\u0080\u0094 1. Besides the free energy, the entry VBI(i,j,l) will store the best interior pair (si'.Sji). They prove that, under some thermodynamic assumptions, if for the exterior pair (si.Sj), the interior pair (si'.Sji) is better than (si\u00C2\u00BB.Sjn), than the interior pair (si'.Sji), it is also better for the exterior pair (si-\.Sj+\). Thus, the interior pair (si>.Sj>) stored in VBI(i,j,l), is also the best interior pair for VBI(i \u00E2\u0080\u0094 1, j + 1,1 + 2), with only two possible exceptions, which we will not detail further here. In our implementation of SimFold, we used the first method, which has a lower space complexity and is easier to implement. Space complexity Note that out of the seven aforementioned arrays, only three of them need to be stored: W, V and WM. H, S, VBI and VM need to be calculated only once. The minimum value, for each i and j, will be stored in the V array. Hence, the space complexity for each of the three arrays W, V and WM is 0(n 2 ) , since we store a value (or a data structure of values) for each i and j. 4.4 Performance evaluation We implemented the algorithm described in this chapter under the name SimFold, an im-plementation of Zuker and Stiegler algorithm, very similar to mfold [71] and RNAfold [24] from Vienna R N A package. In this section, first we compare SimFold against RNAfold and mfold and then we run SimFold and RNAfold on biological R N A structures. We refer to our current best version of SimFold as 1.1. For RNAfold, we used the latest version, 1.4, which is free software, downloadable from the Vienna R N A package web page [61]. For mfold, we also 52 used the latest version, which is 1.3. A web page interface running mfold is available [35], but we could not obtain a working version of mfold for local runs. Thus, we were able to perform many comparisons between SimFold and RNAfold, while comparisons between SimFold and mfold were time-consuming, and hence sparse. SimFold and mfold can take as input either R N A or DNA, whereas RNAfold can only fold RNA. Here, all the tests have been performed on RNA. We compare the three programs in Subsection 1, then an analysis of SimFold and RNAfold computation time is performed in Subsection 2. Finally, a study of the SimFold accuracy on six sets of biological R N A is discussed in Subsection 3. 4.4.1 Comparison of secondary structure predictions For RNA, RNAfold uses the set of thermodynamic parameters from SantaLucia Labora-tory [44], as opposed to SimFold and mfold, which use the parameters from Turner Labo-ratory [57], refined by Mathews et al. [33]. Some of these parameters differ, and we believe this constitutes one main reason for which the results are sometimes different. Another reason is that RNAfold does not consider some special types of structures, such as pohj-C hairpins or GGG hairpins. It is unclear whether any of the two sets is better, and a com-parison between them is beyond the scope of this thesis. Given that SimFold uses the same parameters and exactly the same model as mfold11, the predictions made by SimFold are expected to be the same as the predictions made by mfold with respect to the minimum free energy secondary structure (mfold also predicts suboptimal structures). Still, occasionally, the prediction made by mfold has a higher free energy than the prediction made by SimFold, because mfold does not allow for isolated base pairs. SimFold does not incorporate this restriction, allowing everything that the model permits. To have a good comparison of the three implementations, first we created a set of 100 randomly generated R N A sequences S of length 100 nucleotides. Let AGs and Rs denote the M F E and M F E structure returned by SimFold, AGy and Ry denote the M F E and M F E structure returned by RNAfold from Vienna package, and AGM and RM be the M F E and M F E structure returned by mfold. The model used by SimFold will be referred to as Ms, the model used by mfold will be called MM3, and the model used by RNAfold will be denoted by M y . \u00E2\u0080\u00A2 In 82 cases, the structure prediction of SimFold was identical to the structure predic-tion of RNAfold, i.e. Rs = Rv; 2 We believe the model we are using in SimFold, which was described in great detail in Chapter 3, is identical to the model used in mfold. A l l our comparisons (counting tens of instances) with the online version of mfold confirmed this. However, since we could only access the web version of mfold, and since full details about the model used in mfold are not publically available, we cannot guarantee that this is true. 3 We believe MS = MM-53 No Str. diff. A G s A G M FreeEnergy(S, Rs, MM) 1 substantial -21.80 -21.70 -21.80 2 slight -24.50 -23.90 -24.50 3 substantial -32.60 -31.80 -32.60 4 substantial -17.50 -16.80 -17.50 5 slight -21.40 -20.60 -21.40 6 substantial -28.00 -27.90 -28.00 7 substantial -18.80 -17.90 -18.80 Table 4.1: Differences between SimFold and mfold on a set of 7 RNA sequences of length 100 nucleotides. The model used in mfold on the M F E structure returned by SimFold gives the same free energy as SimFold; which is lower than the M F E returned by mfold. This happens because mfold does not allow isolated base pairs in teh minimization algorithm. \u00E2\u0080\u00A2 In 7 cases, the structures Rs and Ry were slightly different, but the free energy of Ry, when using model Ms, was equal to the minimum free energy returned by SimFold: FreeEnergy(S, Ry, Ms) = AGs- This shows that more than one structure gave the same minimum free energy, and SimFold chose another one than RNAfold. The distances between the two structures range between 2 and 38, where by distance between structure R\ and structure R2 we mean the number of bases that have a different bonding status in R\ when compared to R2. This observation clearly shows that predicting suboptimal structures is very important, since another structure (even with the same MFE) might be much closer to the real one, and the distance between the two predicted structures might be high; \u00E2\u0080\u00A2 For all the remaining 11 cases, the free energy FreeEnergy(S, Ry, Ms) was greater than A G s , hence SimFold considered the structure Ry as being a suboptimal structure. The distance between Rs and Ry in these cases ranged between 2 and 75. At this point, a comparison with the prediction made by mfold was performed. \u00E2\u0080\u0094 In 4 cases, the prediction made by SimFold was exactly the same as the prediction made by mfold: Rs = RM and A G s \u00E2\u0080\u0094 A G j ^ . This means that RNAfold made the prediction of a suboptimal structure, according to the model Ms or M M ' , \u00E2\u0080\u0094 In the 7 other cases, AGM > A G s , and the structures RM and Rs were slightly different in 2 cases and substantially different in 5 cases. Mfold web server [35] contains a program called \"Free Energy Determination\", which we will refer to as FreeEnergy(S* ,R*,MM), where S* and R* are any R N A sequence and structure. Running this program with the structure Rs predicted by SimFold, the free energy obtained equals the M F E returned by SimFold and is lower than the M F E returned by mfold: FreeEnergy(S,RS,MM) = A G s < A G M - The equality confirms our supposition that M s = MM- The inequality happens 54 Length % Identical % Equal F E % Lower F E 100-500 0.52 0.10 (2-31) 0.38 (2-227) 600-1000 0.18 0.06 (23-319) 0.76 (2-538) 1100-1500 0.12 0.04 (2-3) 0.84 (2-766) 1600-2000 0.00 0.00 (-) 1.00 (2-1021) Table 4.2: Comparison of structure prediction between SimFold and RNAfold as a function of sequence length. because, as we mentioned before, mfold does not allow for isolated base pairs. In all these cases, SimFold predictions include isolated base pairs. Table 4.1 shows the free energies AGs (column 3), AGM (column 4) and FreeEnergy(S, Rs, MM) (column 5). Table A . l in Appendix A gives the 7 sequences and the structures predicted by SimFold, in dot-parenthesis format. The second set we created to compare predictions of SimFold and RNAfold contains 10 sequences of length 200, 10 sequences of length 300, . . . , 10 sequences of length 2000. As sequences grow longer, the percentage of identical structures Rs and Ry gets lower. Ta-ble 4.4 shows the percentage of the cases where Rs and Ry are identical (second column), the percentage of the cases where FreeEnergy(S, Rs, Ms) = FreeEnergy(S, Ry, Ms) (third col-umn) and the percentage of the cases where FreeEnergy(S, Rs, Ms) < FreeEnergy(S, Ry, Ms) (fourth column). Thus, the third column corresponds to alternative structures with the same free energy, and the fourth column corresponds to M F E structure versus suboptimal struc-ture, as predicted by SimFold and model Ms- The numbers in the parenthesis represent the minimum and the maximum distance between the two structures. 4.4.2 C P U times The RNAfold program has a very efficient implementation, and is used by other programs, such as R N A Designer from RNAsoft suite [5]. SimFold was implemented such that the code could be easily extended to accommodate the necessary differences for PairFold and CombFold. However, SimFold is comparable with RNAfold in terms of speed, being only between approximately 10% to 120% slower on sequences shorter than 2000 nucleotides. The computational experiments have been performed on PCs with dual 2GHz Pentium III processors, 512 K B cache and 2GB R A M using Linux 2.4.20. On the set of length 100 nucleotides, the speed of SimFold relative to RNAfold is 1.28 on average: it took 0.08 C P U seconds for RNAfold and 0.06 C P U seconds for SimFold for each sequence. As the length of the sequences increases, the speed of SimFold relative to RNAfold increases as well. Table 4.3 shows the average C P U time in seconds performed by SimFold (second column) and RNAfold (third column), as well as the relative time: (Time SimFold) j (TimeRNAfold). 55 Length Time SimFold Time RNAfold Relative time 100-500 0.71 0.46 1.48 600-1000 6.14 3.63 1.67 1100-1500 20.71 10.67 1.92 1600-2000 50.07 22.44 2.22 Table 4.3: Comparison of CPU time between SimFold and RNAfold as a function of sequence length. Since the mfold program predicts suboptimal structures in addition to M F E struc-tures, and since its public availability is restricted, we did not compare it against SimFold in terms of speed. 4.4.3 SimFold accuracy on biological structures In order to evaluate the accuracy of SimFold prediction, we assembled six sets of R N A sequences with known secondary structures. The first set contains tRNA genes, whose structures where experimentally determined [51] and the other sets contain 5S rRNA, 16S rRNA, 23S rRNA, Group I Intron and Group II Intron sequences determined by comparative sequence analysis [11, 21]. Mathews et al. [33] performed a thorough analysis of mfold program (MFE foldings and suboptimal foldings) on biological R N A sequences with known secondary structure. Their sets overlap our sets, still it is unclear whether perfect matches between their sets and ours exist. Konings and Gutell [28] and Fields and Gutell [15] have also analyzed mfold program on 16S and 23S rRNA sequences, respectively, which are known to be long and hard to predict. To measure the accuracy level of SimFold prediction, we evaluated four parameters: 1. Qi is the level of accuracy (0 for completely wrong prediction and 1 for perfect pre-diction) calculated by dividing the number of correctly predicted base pairs to the number of base pairs in the real structure. We think the same parameter is used in Mathews et al. [33] measurements of accuracy. Note that with Q\, the free bases are not taken into consideration at all. Thus, a perfect prediction of the correct base pairs, which also predicts base pairs where the bases are actually free bases, would yield a value of 1; 2. Q2 also measures the level of accuracy, but by taking the base pairs as well as the free bases into consideration. Q2 is calculated as follows: first, the distance d between the predicted structure Rs and the real structure RR is calculated, by counting the number of bases whose pairing status is different between Rs and RR. A similar distance function is used in R N A Designer from the RNAsoft suite of programs [5]. Then Q2 = 1 \u00E2\u0080\u0094 d/l, where I is the length of the given sequence S; 56 2 0.3 0.4 O.S 0.6 0/ 08 09 Range of accuracy Q1 0.2 0.3 0.4 OS 0.6 0. Range of accuracy Q2 Figure 4.1: Evaluation of SimFold on tRNA genes: Distribution of the predicted structures after the level of accuracy, on intervals of 0.1. 3. The percentage of \"odd pairs\" (or non-canonical pairs) in the real structure RR is calculated. This parameter is important because, as our model does not consider odd pairs, we do not expect to have a perfect prediction for structures with odd pairs; 4. The percentage of pseudoknots in .RR is even more important, since Zuker and Stiegler algorithm, which is the basis of SimFold, can only predict pseudoknot-free secondary structures. The pseudoknot percentage was determined by dividing the number of pairs which are within a pseudoknot by the total number of pairs in RR. A pair (i.j) is considered to be within a pseudoknot if a pair (i'.j') exists such that i < i' < j < j' or i! < i < j' < j. The tRNA genes set contains a very small fraction of odd pairs and no pseudoknots, and the sequences are shorter than 100 nucleotides. However, other structures that are not considered by our model exist: 7 sequences have hairpins of size 0: (), 16 sequences have hairpins of size 1: (.) and 38 sequences have hairpins of size 2: (. .); also, tertiary inter-actions known to exist in tRNA are not reflected in our analysis. We performed a thorough analysis on 3514 sequences and calculated Q\, Q2, the free energy returned by SimFold and the free energy of the real structure RR calculated with our model: FreeEnergy(S, RR, MS). While some structures have been predicted with more than 0.90 accuracy, others had only 0.20-0.50 accuracy. Figure 4.1 shows the distribution of the predictions on intervals of level of accuracy. The left image shows the distribution when measuring the accuracy with Qi} and the right image corresponds to 0/2- The vertical dashed line in each image shows the average accuracy of the corresponding parameter. 57 Figure 4.2: Evaluation of SimFold on tRNA genes: Correlation between the level of accuracy and the difference between the predicted minimum free energy and the free energy of the real structure, considering our model. For all the predictions whose accuracy was lower than 1.00, the free energy calculated for the real structure AGPR was higher than the predicted minimum free energy AGs-Obviously, this is another convincing hint that the calculation of suboptimal structures is important. It is interesting to look at the difference AGPR \u00E2\u0080\u0094 AGs and at the level of accuracy. Figure 4.2 shows the correlation between these two values for Q\ (left) and Q2 (right). The general trend is that the greater the difference in free energy, the lower the level of accuracy. Still, in many cases, e.g. for accuracy 0.35, the free energy difference is close to 0. Note that for the Q\ parameter, there are quite a few cases where the level of accuracy is 0.00, and the free energy difference ranges in (0,16]. Table 4.4 shows the summary of the results predicted by SimFold and RNAfold on the six data sets. Column 1 shows the data set type and column 2 gives the number of instances for each type that was considered in this study. Column 3 shows the range length for each type and columns 4 and 5 give the average percentage of odd pairs and pseudoknots. Column 6 presents the average accuracy of SimFold, measured with Qi and Q2, respectively, and the last column shows the average performance of RNAfold. The table shows that the prediction of SimFold is better on short sequences, without or with a very small fraction of odd pairs and pseudoknots. As the percentage of odd pairs or pseudoknots increases, the accuracy goes down. Although for some sequences, the prediction of SimFold differs from the prediction of RNAfold, note that on the average, their prediction is very close. For the tRNA genes set, RNAfold gives better accuracy than SimFold in 118 cases, and SimFold gives better accuracy in 150 cases. Table A.2 in the Appendix A shows a 58 R N A type No. inst. Len % Odd % Pk SimFold Qi (Q 2) RNAfold Qi (Q 2) tRNA genes \u00E2\u0080\u00A23514 51-93 0.00 0.00 0.63 (0.67) 0.63 (0.66) Group I Intron 8 394-1031 0.04 0.10 0.48 (0.52) 0.48 (0.52) Group II Intron 2 905-2520 0.00 0.45 0.40 (0.43) 0.50 (0.47) 5S r R N A 9 117-124 0.11 0.00 0.57 (0.62) 0.50 (0.56) 16S r R N A 91 697-2147 0.06 0.06 0.42 (0.50) 0.42 (0.49) 23S r R N A 59 953-4381 0.06 0.16 0.41 (0.50) 0.40 (0.49) Table 4.4: Summary of SimFold accuracy on different types of RNA sequences. set of 30 examples of tRNA gene sequences, together with the parameters discussed above. Complete tables of the measurement of accuracy for the remaining five sets of introns and rRNAs are given in Appendix A. Comparing Table 4.4 with Table 1 in [33], where Mathews et al. report a thorough analysis of mfold, we notice the accuracy measured with Q i is from 8% for 16S rRNA to 20% for tRNA lower than reported there. We consider that the reasons for these differences include: \u00E2\u0080\u00A2 In the study performed by Mathews et al. [33], they use the known real structures to refine the thermodynamic parameters used. Then, the accuracy evaluation is per-formed on the same set. It is not clear whether further tests on a different set would give the same level of accuracy; \u00E2\u0080\u00A2 The sets that we used seem to overlap with the sets that Mathews et al. used, but not to coincide. Also, we are not sure whether the method of measuring the level of accuracy is identical; \u00E2\u0080\u00A2 Finally, our supposition that the model we use is identical with the model mfold uses, might not be completely true. However, in the future, SimFold and its extensions, PairFold and CombFold, can incorporate more parameters. Although it is hard to make a very correct estimate, we believe that SimFold is as good as RNAfold and mfold regarding the prediction of minimum free energy and structure. The scope of implementing SimFold was not to create a better prediction program, but to be able to extend Zuker and Stiegler's [72] basic algorithm of secondary structure prediction towards prediction of pairs and combinatorial sets of molecules, which will be presented in great detail in the next two chapters. 59 Chapter 5 PairFold The algorithm for predicting the secondary structure of a pair of R N A or D N A molecules, which we call PairFold, is described in this chapter. We define the minimum free energy (MFE) problem of secondary structure prediction of a pair of RNA molecules as follows: given two R N A sequences Si and 52 and a thermodynamic model M, find the pseudoknot-free secondary structure R with the smallest free energy change under the model M, in which Si and 5 2 can fold. Note that this problem includes interactions between the two molecules, as well as interactions within each molecule. The same as in case of secondary structure prediction of single molecules, for pair of molecules, the problems of predicting pseudoknotted secondary structures, other inter-actions or suboptimal structures (i.e. which have a free energy greater than the MFE) are important and challenging. In this chapter, only the M F E problem will be considered. PairFold algorithm is a simple extension of Zuker and Steigler [72] algorithm, de-scribed in Chapter 4. As discussed in Section 3.5, consider that we concatenate the two input sequences, Si and S2, into sequence P and that we denote the linkage location with 6 (b equals the last position in Si). With some modifications, which will be described in this chapter, we can apply the SimFold algorithm on 5'i5 2. Briefly, the elementary structures are the same as for a single molecule, with the exception that they can be \"broken\" by the molecular link b between them. In these situations, the free energy is calculated in a way similar to the calculation of multi-domain structures, and an Intermodular initiation penalty, i.e. Intermol from Table 3.1 is added. 5.1 Dynamic programming algorithm The arrays to calculate minimum free energies for different secondary structures are called Wp(j), VP(i,j), HP(i,j), SP(i,j), VBP(i,j), VMP(i,j), VM'(i,j), WM?(i,j) and WM'{i,j) They correspond to the arrays with the same names, but without the subscript p from Sec-tion 4.1. Note the new arrays VM'(i,j) and WM'(i,j). The same as for SimFold, the computation of multi-loops is done by the computation of the partial multi-loop structures WMp(i, j). If b is inside a multi-loop structure (i.e. 6 the intermolecular linkage is between 60 two branches, yielding a special multi-loop), then the multi-loop will be calculated as a multi-domain structure. Otherwise, it will be a regular multi-loop. However, at the mo-ment of the calculation of WMp(i, j), since we are calculating partial structures, we do not know whether the remaining part of the multi-loop contains the b. Thus, the calculation of the two arrays: WMp(i,j) and WM'(i,j) for the regular and special partial multi-loops, respectively, is needed. Consequently, there will be two arrays for calculating complete multi-loops too, one for regular multi-loops, one for special multi-loops. The complete recurrence relation for Wp(j) when j > 0 follows. This is an extension of the equation for W(j) from Section 4.1. Depending on the position of b, dangling ends are added or not. This will be shown in the terms T\, T2 and T 3 , described below: Wp(j)= min \j-i,pi+i,pj), otherwise The optimal free energy for pi.. .pj, Vp(i,j), is given by the most favourable struc-ture amongst the regular or special types of hairpin loop, stacked loop, internal loop and multi-loop. The calculation is performed using the following formula: +00 , for i > j ram(Hp(i,j),Sp(i,j),VBIP(i,j),VMp(i,j)), for i < j Integrating the model for duplexes, described in Section 3.5, each individual type of elementary structure will be calculated as follows: Hp(i,j) = AG-IF(P,i,j,b) Sp(i,j) = AG-?{P,i,3,b) + Vp(i + 1) 61 VBF(i;j) = \u00E2\u0080\u00A2 min (AG-F(P, i, j, i!, j1, b) + VP(i',j')) . i W M ' ( i + 2, j - 2) + AG-Dangle-3XPhpj,pi+i)+ AG-Dangle-5'(Pi,Pj,Pj-i) + 2 x Mu/ii-c , if b \u00C2\u00A3 {i, i + 1, j - l,j The multi-loop specific penalties are finally added: VM'(i,j) = VM'(i,j) + Multi-a+ Multi-b + Non-GC-penalty(pi,pj). VMp(i,j) = min ' WMP(i + 1, j - 1) + Non-GC-Penalty(pi,pj) + T i , W M p ( i + 2, j - 1) + Non-GC-Penalty{pi,pj) + T 2 ) I T\u00C2\u00A5MP(i + 1, j - 2) + Non-GC-Penalty(pi,pj) + T 3 , [ l f M P ( i + 2, j - 2) + Non-GC-Penalty{pi,pj) + T 4 Intermol, if 6 \u00E2\u0082\u00AC {i, j \u00E2\u0080\u0094 1} 0 , otherwise { Intermol \u00E2\u0080\u00A2 , if b \u00E2\u0080\u0094 i AG-Dangle-3'(Pu Pj, Pi+i) + Intermol, if 6 6 {i + 1, j - 1} AG-Dangle-3'(pi,Pj,Pi+i) , otherwise ( Intermol , if b = j - I T 3 = < AG-Dangle-5'{pi,Pj,Pj-i) + Intermol, if b e {i , j --2} I AG-Dangle-5'(pl,pj,Pj-i) , otherwise AG-Dangle-5'(pi, pj, pj-i) + Intermol , if b = i AG-Dangle-3'(pi,pj,pi+i) + Intermol , if b = j \u00E2\u0080\u0094 1 T 4 = < AG-Dangle-3'(pi,pj,pi+i) + AG-Dangle-5'(pi,pj,pj-i)-r, Intermol , if b \u00E2\u0082\u00AC {i + 1, j AG-Dangle-3'(pi,Pj,Pi+i) + AG-Dangle-5'(pi,pj,pj-i) , otherwise Note that no other additions to VMP are necessary at this point. 64 PairFold Procedure input: R N A or D N A sequences Si and S 2 of length n\ and n 2 ; output: minimum free energy A G , secondary structure R; procedure PairFold S = S1S2; n = nx + n 2 ; for (j = 1 to n) for (i = j \u00E2\u0080\u0094 1 down to 1) WMds(i,j) = Compute-WM'(i,j); WMpds(i,i) = Compute-WMP(i,j); end for; for (i :\u00E2\u0080\u0094 1 to j \u00E2\u0080\u0094 1) Vjs(i,j) := Compute-W(ij); end for; end for; for (j :\u00E2\u0080\u0094 2 to n) Ks(j) \u00E2\u0080\u00A2\u00E2\u0080\u00A2= Compute-WP(j); end for; AG := M /Js(n).free_energy; i := n; while [i > 0 and ^^(^ .num-branches > 0) {id,3d) := W^(z).last-domain; R := PairFold-Backtrack (id, jd, V\u00C2\u00A3 , W J J ; i := Wj s(i).next_domain_i; end while; return (AG, i?); end procedure PairFold. Procedure 5: Pseudocode for the PairFold algorithm. 5.2 Implementation The implementation of PairFold is very similar to SimFold implementation. The main differences are: (1) the calculation of two arrays for WMP and WM', (2) the checks for b in each particular situation. Procedure 5 shows a pseudocode of the implementation, which highlights point (1). Point (2) was described in detail in the previous section. The same as in SimFold implementation, we implemented a function that calculates the free energy of a duplex, when the structure is known: FreeEnergy(S\, S2,R). In the same way, once the M F E secondary structure is found, we can compute the enthalpy of the folded molecule, using the enthalpy thermodynamic parameters. Then, the calculation of the entropy and the melting temperature, when the reactants concentrations are given, is straightforward using the equations explained in Section 3.1. The M F E secondary structure, M F E , enthalpy and entropy changes, and the melting temperature, for the sequences (Si, S 2 ) , (Si, Si) and (S 2 , S 2 ) are provided in the online ver-65 6001 Sequence length n (nt) Splitting point Figure 5.1: Comparison between PairFold and SimFold speed (left); PairFold speed on a sequence with different splitting points. sion of PairFold [5, 41]. The secondary structure of (S2, S{) does not have to be calculated separately, because the answer will be the same as for (Si, S2). Section ?? gives an informal proof that, given the thermodynamic model we are using, PairFold is symmetric. 5.3 T i m e a n d space c o m p l e x i t y Being a very simple extension of the SimFold algorithm, the time and space complexity are of the same theoretical order as for SimFold: 0(n 3 ) for time and 0(n 2 ) for space, where n is the sum of the two input sequences lengths. In practice, the time will be longer, due to the additional checking for the location of b and due to supplementary calculations for multi-loops. The space required to store the necessary arrays will be larger too, since instead of storing three arrays, as in the case of SimFold: W, V and WM, we now store four arrays: WP,VP, WMP and WM'. The left graph in Figure 5.1 shows the speed of PairFold compared to the speed of SimFold. 296 sequences have been randomly generated, of length 50, 60, 70, . . . 3000 nucleotides and SimFold has been run on them. Then, each sequence has been split in half (yielding b = n/2) and PairFold has been run on each such pair. The dashed line shows the C P U time in seconds, that SimFold needed to predict the M F E secondary structures. The solid line shows the speed of PairFold. The dotted and dash-dot lines have been manually fit to match the SimFold and PairFold lines, respectively. The dotted line is the function (n/370)3, where n is the sequence length (or the sum of the two sequences lengths, in case of PairFold), and the dash-dot line is the function 1.3(n/370)3. These manual fits show that both SimFold and PairFold run in time proportional to 0(n 3 ) , and PairFold is about 1.3 times slower than SimFold. The right graph in Figure 5.1 shows the speed of PairFold on a sequence of length 66 1000 nucleotides, which has been split at different points: b = {10,20,... ,990}. The plot shows that PairFold is slightly slower when the two sequences are roughly equal to each other, and slightly faster when one sequence is short and the other is long. Note that the difference is not big, and this can be explained by the fact that the two arrays that help multi-loop calculations: WMP and WM', are fully computed, no matter where the boundary b is. 5.4 P a i r F o l d s y m m e t r y sec:pairf'old-symmetry In this section we show that the minimum free energy A G 1 2 for the pair of molecules , 5*2) under the symmetric model M equals the minimum free energy A G 2 1 of the pair (5*2) under the same model. First, we show that given Si, S2 and a secondary structure R12 with free energy A G 1 2 , in which the pair ( 5 i , 5 s ) could fold, the secondary structure R2i which contains the same base pairing as R12, but for the pair (52, Si), has the same free energy: A G 2 1 = A G i 2 . In Section 3.3 we showed that A G 1 2 = Yle A G e , where e is an elementary structure, and the sum goes over all elementary structures in the structure R12. Then, R21 will contain the \"mirrored\" elementary structures. For example, a multi-domain structure in R12 will be a special multi-loop structure in R21, but their free energies will be the same. Given that the Nearest Neighbour Thermodynamic Model is symmetric, all the \"mirrored\" elementary structures will have the same free energy as the original elementary structures. Thus, A G 2 i = EeA Ge, hence A G i 2 = A G 2 i . Now suppose that A G 1 2 and A G 2 1 differ and A G 1 2 < A G 2 1 . If the M F E structures R12 and R21 are \"mirrored\", then from the proof above, A G 1 2 = A G 2 1 . Suppose R12 and R21 are different. Let Ry\ denote the \"mirrored\" structure of -R12, i.e. the structure containing the same base pairs as R12, but for (52, Si), Then, from the proof above, A G 1 2 = AG-j^. But A G 1 2 < A G 2 1 , so A G 1 2 < A G 2 1 . This means that there exists a secondary structure, namely R7^, for ( 5 2 , 5 i ) , whose corresponding free energy A G j ^ is lower than the minimum free energy A G 2 1 . This is a contradiction, which proves that the hypothesis we made is false. We showed that if Rx2 and R21 are identical, then A G 1 2 = A G 2 1 . The reverse implication is not always true, because it is possible for two different secondary structures to have the same free energy. When several M F E secondary structures exist, our algorithms will select the first one (hence always the same one), out of the set of possible secondary structures. For this reason, we think it might be possible for PairFold to select different secondary structures for (Si, Si) and (52, Si) if they exist.' Further analysis is needed to support this supposition. 67 5.5 A p p l i c a t i o n s a n d p e r f o r m a n c e e v a l u a t i o n This section presents several applications of PairFold and gives a thorough analysis of its performance on different biological sets. First, we show its performance on short D N A and R N A duplexes, giving also other parameters, such as enthalpy, entropy and melting temperature. Then, we show a very important and promising application of PairFold, on ribozyme-target hybridization and on a ribozyme combinatorial library. PairFold can also be used for primer/probe design and testing. Finally, we show that it can be useful for including thermodynamic models into D N A word design. 5.5.1 P red ic t ion of D N A or R N A duplex formation Free energy, enthalpy, entropy and melting temperature of D N A or R N A duplexes which are complementary or have few mismatches have been experimentally determined. The information obtained was mainly used for finding thermodynamic parameters for stacking energies and simple mismatches, and for testing of prediction programs. Knowing these parameters is useful for any secondary structure prediction and for primer design [37]. -Peyret et al. [37] have used a set of 51 D N A duplexes which contain A - A , C-C, G G and T-T mismatches, to find thermodynamic parameters for these mismatches, by linear regression. Then, they compared the predicted free energy, enthalpy, entropy and melting temperature with the ones determined experimentally. In a similar way, X i a et al. [67] have used a set of 119 perfectly complementary R N A duplexes, to determine or test stacking parameters. In this subsection we run PairFold on these duplexes and compare our predictions with the experimental measurements and predictions reported in [37] and [67]. NA Prediction AH\u00C2\u00B0 (ayg) AS\u00C2\u00B0 (avg) AG\u00C2\u00B0 (avg) Tm (avg) (kcal/mol) (eu) (kcal/mol) (\u00C2\u00B0C) PairFold prediction -59.06 -168.62 -6.76 41.76 DNA Experimental values [37] -59.79 -170.94 -6.78 41.75 Peyret et al. prediction [37] -59.57 -170.32 -6.74 41.50 PairFold prediction -58.13 -161.27 -8.12 48.59 RNA Experimental values [67] -58.80 -163.32 -8.15 48.31 Xia et al. prediction [67] -57.78 -160.08 -8.13 48.49 Table 5.1: Comparison between PairFold predictions, experimental data and other predictions [37, 67], on short DNA and RNA duplexes. The values represent average enthalpies, entropies, free energies and melting temperatures. Table 5.1 shows average enthalpies, entropies, free energies and melting temperatures of PairFold prediction (first line), the experimental values reported in [37] (second line), and the predictions reported in [37] (third line), on D N A duplexes. The same type of measurements have been performed for the R N A duplexes reported by X i a et al. [67], and 68 their experimental and predicted values. The results show that our predictions are nearly as good as the predictions reported in [37] and [67], both of them being close to the experimental values. Tables with all values for each duplex are provided in Tables B . l and B.2 from Appendix B. The small differences in predictions are explained by the following reasons: \u00E2\u0080\u00A2 The predictions performed by Peyret et al. and Xia et al. assume that the secondary structures are known (mismatches for the underlined nucleotides in Table B . l and perfect matches for the remaining D N A bases, and perfect matches for RNA), and the free energy and enthalpy are calculated for these specific structures. On the contrary, PairFold runs the free-energy minimization algorithm. For one duplex out of the 51 D N A duplexes, the predicted structure is substantially different than the expected structure. Also, for 15 out of the 109 R N A duplexes, some of the ends are predicted to be dangling ends rather than base pairs. Details are given in Appendix B. \u00E2\u0080\u00A2 For the D N A duplexes, the thermodynamic parameters used by Peyret et al. and PairFold are from SantaLucia [45], but some of them might be different versions; \u00E2\u0080\u00A2 For the R N A duplexes, we use Turner parameters [58, 59], which we think are some-what different from the parameters used by Xia et al. [67]. However, the overall prediction accuracy is fairly close. 5.5.2 Predicting ribozyme-mRNA target hybridization Many experiments on ribozyme-target duplexes have been performed in the last years. PairFold can be used to predict secondary structure of a ribozyme-target duplex. Although currently not 100% accurate, it can give some useful information about potential interactions in the duplex. It can also help in ribozyme design, where a pool of many closely related ribozymes are tried until good ones are found. PairFold can help eliminate some of these ribozymes, which are unlikely to perform well. Table 5.2 shows the performance of PairFold on 11 ribozyme - m R N A target du-plexes drawn from the literature. The secondary structures of their binding are found experimentally and are reported in the referenced papers. Column 2 indicates the reference and the figure showing the sequences and the secondary structures. Columns 3, 4 and 5 give the length of the two sequences and the percentage of odd pairs existing in the structure. Column 6 shows the number of correctly predicted base pairs out of the total number of base pairs in the real structure. Column 7 gives two accuracy parameters: Q i and Q2, cal-culated as described in Section 4.4.3. Column 8 gives the minimum free energy, as predicted by PairFold, and column 9 shows the free energy of the real structure, calculated with our model. Because our model does not consider odd pairs, their contribution to the total free energy was ignored (they were considered 0). The structures do not have pseudoknots. For 5 duplexes, both parameters Qi and Q2 gave accuracy greater than 0.90, and for all cases, the accuracy was greater than 0.70. Table B.3 in Appendix B shows the sequences, 69 No. Reference h h %Odd #bp Qi (Q2) AGP 1 [27] Fig. lA 11 32 0.00 14/14 1.00 (1.00) -26.90 -26.90 2 [60] Fig. 2 16 36 0.00 18/19 0.95 (0.88) -22.60 -21.10 3 [60] Fig. 3 17 38 0.06 17/18 0.94 (0.96) -20.00 -19.00 4 [43] Fig. lb 21 92 0.22 33/46 0.72 (0.74) -58.70 -42.50 5 [53] Fig. l a 14 59 0.00 23/23 1.00 (0.97) -34.20 -33.70 6 [53] Fig. 4a 14 59 0.00 19/23 0.83 (0.79) -35.90 -34.70 7 [47] Fig. lc 14 65 0.00 19/19 1.00 (0.85) -21.10 -18.80 8 [47] Fig. Id 14 55 0.00 15/18 0.83 (0.83) -16.30 -13.50 9 [47] Fig. le 34 120 0.00 40/43 0.93 (0.84) -58.20 -53.10 10 [25] Fig. 1 left 25 46 0.06 28/31 0.90 (0.92) -45.80 -43.40 11 [25] Fig. 1 right 70 100 0.03 63/69 0.91' (0.85) -142.10 -135.90 Table 5.2: Measurement of PairFold accuracy on a set of mRNA target - ribozyme pairs. the predicted and the real secondary structures, and highlights the wrong predictions and the odd pairs. 5.5.3 Analysis of a combinatorial library of hairpin ribozymes Some ribozymes, such as hammerhead ribozymes and hairpin ribozymes, are short R N A sequences with catalytic abilities, and they have a specific shape: a part folds to itself and another part binds to an R N A target and it cleaves it at the middle. Figure 5.2 (the same structure has been shown earlier in Figure 1.3) shows the typical structure of a hairpin ribozyme [68]. The top sequence represents the R N A target, also called substrate, which is cleaved at the site indicated by the arrow. The bottom sequence is a hairpin ribozyme, which performs the cleavage. The part of the ribozyme which folds to itself is typically a fixed sequence (this core sequence can be found in [68]), but the other part of the ribozyme, also called substrate binding domain, varies. Yu et al. [68] created a combinatorial library of ribozymes and tested their cleavage ability on a \"highly structured viral mRNA, the 26 S subgenomic R N A of Sindbis virus\" [68], GenBank accession number: U38305. The target R N A was originally of length 4.2 kilo-bases (kb). It has been cut in 4 sequences, of length roughly l.lkb, and the library of ribozymes has been tested on these fragments. In the presence of ions of Magnesium, 15 sites that were cleaved with high efficiency, called \"activity-selected sites\", have been identified. Other 23 sites have been predicted to be cleaved, but the cleavage efficiency was low. These are called \"sequence-selected sites\". Yu et al. have also performed experiments on oligonucleotide substrates, as opposed to targeting the long sequences. The reported results show that the cleavage efficiencies are much higher on the short substrates, even for those which were not cleaved while being part of the long sequences. We used PairFold to predict the binding interactions between the hairpin ribozymes which cleaved R N A and their target. Table 5.3 shows our predictions, in the cases when 70 helix J | 3 ' - \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 I I I I I I 5'\u00E2\u0080\u0094 @-@-@-@-0-@ helix 2 , \u00C2\u00AE \u00E2\u0080\u00943' o\u00E2\u0080\u00946 Figure 5.2: Typical structure of a hairpin ribozyme bound to an mRNA target. the target is an oligo or a long sequence. The first column represents an identification number that we assigned to each case. 'A' stands for \"activity-selected site\" and 'S' stands for \"sequence-selected site\". They are followed by a number. The experiments performed by Yu et al. [68] show which was the substrate, but in some cases, one nucleotide of the corresponding ribozyme is unknown, three different bases being possible. In these cases, we performed tests for all three different alternatives, and we chose the best one. The letter at the end of some identification numbers shows which base was chosen in these cases. It is also underlined in the ribozyme substrate binding domain (column 4). Column 2 shows the cleavage site, which is the number of the base in the GenBank nomenclature. Column 3 gives the experimental cleavage efficiencies for the long R N A targets. 3 means strong cleavage efficiency, 2 means moderate, 1 is poor and 0 is undetectable. Columns 5 to 8 give PairFold prediction on oligos as target. Column 5 gives the number of correctly predicted base pairs out of the total number of base pairs existing in the real structure. Column 6 gives the two accuracy parameters that we used before: Q\ and Q2. Column 7 shows the minimum free energy obtained by PairFold, and column 8 gives the standard free energy change of the real structure, using the. model underlying PairFold. The last column shows the results obtained when running PairFold on the ribozymes and the long l . lkb sequences. If the binding of the ribozyme to the right site was predicted with an accuracy higher than 0.60, then a \"+\" is indicated. If the secondary structure obtained was very different from the typical ribozyme-substrate duplex secondary structure, then a \"-\" is indicated. If binding to a site different than the right site has been predicted (i.e. a false positive), the predicted target site is given. While the predictions between the ribozymes and the oligos show over 0.85 accuracy for most cases and at least 0.70 for all cases, only 4 out of 15 bindings similar to the typical structure are predicted for long sequences. This shows that PairFold performs well on short sequences, but as the sequences length increases, the accuracy goes down. For the 71 Id Cleav. site Cl . eff. Ribozyme fragment #bp Oligo Qi(Q 2 ) AGP Long A1U 8242 \u00E2\u0080\u00A2 2 UGUCUCUGAAGCCU 22/23 0.96 (0.86) -32.90 -29.30 + A2 8581 3 UUUUACCGAAGCCG \u00E2\u0080\u00A2 22/23 0.96 (0.89) -25.90 -25.10 -A3 \u00E2\u0080\u00A2 8658 2 GCACGGCGAAGUAU 23/23 1.00 (0.89) -31.30 -27.80 -A4A 8663 3 CUAAAGAGAAGUUC 23/23 1.00 (0.93) -24.90 -24.20 -A5 9333 2 CGUGUGAGAAGCGU 23/23 1.00 (0.93) -29.50 -28.80 9152 A6 9413 2 ACUACGCGAAGCGC 23/23 1.00 (0.93) -29.90 -29.20 + A7 9516 2 GAUCCACGAAGUGG 23/23 1.00 (0.89) -32.00 -28.50 -A8A 9841 3 ACCUAAAGAAGCAC 23/23 1.00 (0.89) -31.20 -27.70 -A9 9861 2 GAAUGUCGAAGCAU 23/23 1.00 (0.89) -27.90 -24.40 -A10A 9926 2 GGUAUAAGAAGCUG 23/23 1.00 (0.89) -31.00 -27.50 -A11C 10115 2 ACAGUACGAAGCAA 19/23 0.83 (0.79) -23.20 -21.50 -A12 10122 2 GGACAUAGAAGUAA 22/23 0.96 (0.79) -28.30 -24.60 -A13A 10603 3 UCAUCGAGAAGUAU 23/23 1.00 (0.93) -27.20 -26.50 + A14 11160 3 UUUGCACGAAGCAU 22/23 0.96 (0.89) -25.40 -24.60 -A15U 11178 3 GAUAUGUGAAGCUG 23/23 1.00 (0.89) -30.10 -26.60 + SI 7552 0 AUUUAGAGAAGCCG 23/23 1.00 (0.86) -27.30 -24.70 -S2 7562 0 UAUGCUAGAAGUUU 19/23 0.83 (0.70) -24.80 -21.10 -S3 7752 1 GGCACUAGAAGCUG 23/23 1.00 (0.86) -32.30 -30.30 -S4 8084 0 UAUGCUAGAAGCUU 23/23 1.00 (0.86) -25.40 -23.40 -S5 8254 0 UCGGACAGAAGCUG 23/23 1.00 (0.86) -31.30 -29.60 -S6 8257 0 UAAUCGAGAAGCCA 22/23 0.96 (0.82) -27.20 -24.60 -S7 8286 0 UAUCGCAGAAGCCC 23/23 1.00 (0.89) -31.50 -28.80 -S8 8295 0 UCCGAGAGAAGUCG 23/23 1.00 (0.86) -30.40 -27.90 -S9 8340 0 CCAGGUAGAAGCCG 23/23 1.00 (0.86) -32.10 -30.10 -SIO 8399 0 GCAGCAAGAAGCUC 23/23 1.00 (0.86) -32.20 -30.10 -S l l 8647 0 UAUGGUAGAAGUAC 23/23 1.00 (0.82) -28.40 -24.80 -S12 8814 0 UUCUUUAGAAGUAU 22/23 0.96 (0.79) -24.10 -20.40 -S13 9061 0 CUUUCAAGAAGUCG 23/23 1.00 (0.86) -27.00 -25.00 -S14 9393 0 AACAGGAGAAGUGC 23/23 1.00 (0.86) -30.30 -27.70 -S15 9552 0 UCGGUCAGAAGUGA 23/23 1.00 (0.86) -30.30 -28.70 -S16 9702 0 UGAUGCAGAAGCUA 22/23 0.96 (0.82) -28.90 -27.20 -S17 9893 0 CUGUUCAGAAGUAA 22/23 0.96 (0.82) -25.30 -23.50 -S18 9945 0 AACGACAGAAGCGG 23/23 1.00 (0.89) -32.50 -29.00 -S19 9948 0 UAGAACAGAAGCAG 23/23 1.00 (0.86) -26.30 -24.70 -S20 10188 0 GGAGGGAGAAGCAG 23/23 1.00 (0.86) -34.40 -31.80 -S21 10355 0 UCUACUAGAAGUUC 23/23 1.00 (0.82) -26.00 -23.20 -S22 10812 0 CGGAUUAGAAGCAA 22/23 0.96 (0.82) -27.70 -25.60 -S23 10920 0 ACAUUUAGAAGUUG 23/23 1.00 (0.86) -23.50 -21.50 -Table 5.3: Measurement of PairFold accuracy on a library of hairpin ribozymes. 72 \"sequence-selected sites\", we note that the accuracy parameter Q2 is consistently lower than the same parameter for the \"activity-selected sites\". The predictions with the long targets did not find any false positives. Note that in Yu et al.'s experiments, cleavage on the long targets was detected only in the presence of Magnesium ions. These might have an important role in stabilizing the hairpin structure and in facilitating the binding to the substrates. Our model does not consider the influence of ions on the structure. Including this in secondary structure prediction models is an open question. 5.5.4 P red ic t ion of primer-target b inding Primers are designed to bind to specific targets such that they will perform some function. They are perfect complements of fragments of a longer R N A sequence. Most probably these fragments form some structure with the rest of the sequence. Thus, primers are designed to bind to the target fragment, so that the target can be amplified using the polymerase enzyme. PairFold can be used at predicting the minimum free energy site binding between the primer and the target fragment. Yu et al. [68] designed 24 primers in order to identify the cleavage sites of the ribozymes used in their experiments, and in order to read the nucleotides which compose the four 1.1 kb sequences they used. Table 5.4 shows the results we obtained when running PairFold on the designed primers and the long target sequences. Columns 1 and 2 show the identification number used in Yu et al. [68] and the primer length. Column 3 shows the target site for which the primers were designed, and which, according to Yu et al., the primers bind perfectly. Column 4 shows the binding site predicted by PairFold. In 22 out of the 24 cases, PairFold predicts the sites exactly. For 2 duplexes, shown in bold, the prediction made by PairFold is that the primers form an internal loop with two different locations. 5.5.5 P a i r F o l d i n D N A word design D N A codes are designed for information storage in D N A computation or for molecular bar-codes in chemical libraries [50, 55, 56]. Designing good words is important for mini-mizing errors due to unwanted hybridization between non-complementary words. Different restrictions between words in the set, such as: hamming distance, G C content and reverse complement hamming distance, have been used for D N A word design [17, 55, 56]. Includ-ing thermodynamic constraints in addition to other constraints has been proposed as future work by Tulpan et al. [56]. Currently, they use PairFold in their stochastic search meth-ods for designing D N A codes. Whether or not using such constraints proves to be a good solution is an open problem. Also, PairFold has been recently used' by Shortreed et al. [50] for generating a combinatorial library of oligonucleotides for large scale D N A computing projects. The final 73 Id Len. Binding site [68] Predicted site PAO 21 7483 - 7503 7484 - 7503 P A 1 1 7 7 6 7 2 - 7 6 8 8 7 5 1 8 - 7 5 2 0 8 2 2 7 - 8 2 3 9 PA2 121 7852 - 7972 7852 - 7972 PA3 19 8106 - 8124 8106 - 8124 PA4 21 8358 - 8378 8358 - 8378 PA5 18 8532 - 8549 8532 - 8549 PBO 20 8503 - 8522 8503 - 8522 P B l 19 8734 - 8752 8734 - 8752 PB2 20 8963 - 8982 8963 - 8982 PB3 19 9184 - 9202 9184 - 9202 PB4 20 9405 - 9424 9405 - 9424 PB5 20 9590 - 9609 9590 - 9609 PCO 20 9514 - 9533 9514 - 9533 PCI 20 9719 - 9738 9719 - 9738 PC2 20 9941 - 9960 9941 - 9960 P C 3 2 1 1 0 1 6 3 - 1 0 1 8 3 9 5 1 4 - 9 5 1 6 1 0 6 1 3 - 1 0 6 1 6 PC4 18 10394- 10411 10394 - 10411 PC5 21 10629 - 10649 10629 - 10649 PDO 19 10576 - 10594 10577 - 10594 PD1 18 10770 - 10787 10770 - 10787 PD2 20 10975 - 10994 10975 - 10994 PD3 20 11210 - 11229 11210 - 11229 PD4 17 11433 - 11449 11433 - 11449 PD5 26 11638 - 11663 11638 - 11663 Table 5.4: PairFold predictions on a set of designed primers. purpose of their work is to solve a large-scale satisfiability problem using D N A computing. 74 Chapter 6 CombFold This chapter discusses our second extension of the dynamic programming algorithm for nucleic acids secondary structure prediction. This extension is applied to a set of R N A or D N A sequences rather than to a single molecule or a pair of molecules. Consider the following definitions and notation: \u00E2\u0080\u00A2 Let word denote an R N A or D N A sequence w = v\v2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 v\, where i>j 6 {A, C, G, U} for R N A and e {A, C, G, T} for DNA. The orientation of the strand is from 5' to 3', unless otherwise stated. For example, ACGCUAGGCA is an R N A word of length 10. \u00E2\u0080\u00A2 Let set denote a set of g words of the same length I, formally S \u00E2\u0080\u0094 {wi, w2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, wg | length(wi) \u00E2\u0080\u0094 length(wj),Vi,j \u00E2\u0082\u00AC {I,..., g},i ^ j}. The following set is formed of 4 words of length 5: AUACG UAGCG GCCGA CUGCG The words order in a set does not matter, but for convenience later, we assume that the words in S are ranked by their index. \u00E2\u0080\u00A2 Let Input-Set denote a sequence of s sets, IS \u00E2\u0080\u0094 S\, S2,..., Ss, e.g. the following is an Input-Set of 5 sets: UAGCGA CAGCGUAAUAU AUGCG AUAGCGGUA AUCG AUAGAU AGAUGCGCGGU GAGCGCAAG CUGC UAGGCUAGCGU GCGA Note that the number of words in each set can differ, the same for the length of the words across sets. 75 U>12 W21 W22 WS1 Ws2 W l g i W 2 g 2 w sga An Input-Set can also be written in terms of words rather than sets: IS = {wij, 1 < i < s, 1 < J < 9i}, where gi is the number of words in the set i. Thus, an Input-Set IS is characterized by s sets, where each set Si has. gi words, of length li. In what follows, when IS is fixed, we consider all its characteristics: s, Wij, 9%, k, Vi, j , 1 < i < s, 1 < j < gi, to be known. \u00E2\u0080\u00A2 Let Combination denote an R N A / D N A sequence, formed by concatenating one word Wij of each set Si from IS, starting at Si and finishing at Ss. For example C = W11W21 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 .wsi is a combination formed by concatenating the first word of each set together. Generally, a combination is of the form C = wib1W2b2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 - w s b s , where 1 < bi < gi. Here, bi denotes the word rank within the set Si. A combination has the length n = 2~Zf=i U- If we think of a combination as a sequence of nucleotides rather than a concatenation of words, we can denote it as C = C1C2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 cn. \u00E2\u0080\u00A2 Given an Input-Set IS, the set of all possible combinations forms the Combina-torial-Set: CS = {wn>1W2b2 \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00E2\u0080\u00A2wsb3 I 1 \u00C2\u00A3 h < gi}. Note that all combinations have the same length: n \u00E2\u0080\u0094 ]fJi=i s h a n d that CS has gi x #2 x . . . x gs elements. If gi > 1, Vi, then the number of elements in CS is exponential in s. Considering the aforementioned notations, we define the optimal MFE combina-tion problem of secondary structure prediction of a combinatorial set of RNA molecules as follows: given an RNA Input-Set IS and a thermodynamic model M, predict which combination, out of all elements of the Combinatorial-Set CS formed from IS, folds to a pseudoknot-free secondary structure with the lowest minimum free energy. An extension of the optimal MFE combination problem is to find the k best M F E combinations, rather than the optimal one only. We define the k-suboptimal MFE combi-nations problem of secondary structure prediction of a combinatorial set of RNA molecules as follows: given an RNA Input-Set IS and a thermodynamic model M, predict which k different combinations, out of all elements of the Combinatorial-Set CS formed from IS, fold to pseudoknot-free secondary structures with the lowest minimum free energies. The optimal and fc-suboptimal M F E combination problem are discussed in this chap-ter. Note that in our algorithms, given a combination C, we look at the minimum free energy MFE only. Extensions of these problems would be to find suboptimal structures (i.e. whose free energy is greater than the M F E ) , or to consider pseudoknots, interactions with ions or other tertiary interactions. 76 123 i j n bj 1 UAGCGA CAGCGUAAUAU AUGCG AUAGCGGUA AUCG bi 2 AUAGAU AGAUGCGCGGU GAGCGCAAG CUGC 3 UAGGCUAGCGU GCGA Table 6.1: Example of a combinatorial set of short RNA sequences. The remainder of this chapter will first give a dynamic programming algorithm which runs in polynomial time, for solving the optimal MFE combination problem. Then, in Sec-tion 2, an algorithm for the k-suboptimal MFE combinations problem will be presented. Implementation details of these two algorithms into the program CombFold will be given in Section 3. A thorough theoretical and empirical analysis of the optimal and k-suboptimal MFE combinations problem in Section 4 show that, although they are complex algorithms, they run in polynomial time. We conclude this chapter by presenting applications to bio-chemical experiments and D N A computing in Section 5. 6.1 A dynamic programming algorithm for the optimal M F E combination problem The main idea of the CombFold algorithm is based on similar arrays and recurrence relations as SimFold, discussed in Section 4.1. One method to solve this problem is to create all possible combinations and then to run SimFold or an equivalent program on each of them. However, depending on the characteristics of the Input-Set, the number of combinations may be very big. If gi \u00E2\u0080\u0094 g > l ,Vi , then there are gs combinations, and SimFold complexity is 0(n 3 ) , as we show in Section 4.3. This leads to a complexity of 0(gs \u00E2\u0080\u00A2 n3), where n is the length of the combinations. More generally, the number of combinations is exponential in the number of sets which have at least two words. We have implemented this exhaustive search approach under the name of ExhaustS, which will be discussed in Section 6.4 later in this chapter. This section describes a dynamic programming algorithm to solve the basic problem of secondary structure prediction of a combinatorial set of RNA molecules in polynomial time. The main idea of the algorithm is to extend the Zuker and Stiegler [71] algorithm, in which at each position i, there might be several possible nucleotides, since there might be several words in the set of i. In the description that follows, we will use indices i and j for the nucleotide position, i.e. columns in Table 6.1. s(i) and s(j) will be the set in which i and j are positioned, respectively, bi and bj will be the index of the word within the set s(i) and s(j), i.e. the rows in Table 6.1. Given a set S, g(S) will return the number of words in S. Hence, bi can take g(s(i)) values. In order to get the base at the column i and row bi of the Input-Set IS, we will use the function: C j = Nucleotide(IS,bi,i). Table 6.1 shows the nucleotides Ci and Cj. 77 Arrays Instead of two-dimensional arrays, as we showed that SimFold needed (Section 4.1), now we deal with four-dimensional arrays, since for the nucleotide at the position i, there might be several possible words, and similarly for the nucleotide at the position j. The names of the arrays for CombFold correspond to the same names for SimFold, and a c (from CombFold) is added as subscript. Note that, having i and fixing the value of bi, the word index within the set of i, we can easily determine the nucleotide. In what follows, we assume we use the function Nucleotide to find the base Ci whenever 75, 6j and i are known, without explicitly stating so. In the following arrays, the values for each bi, bj, i and j, will be calculated: \u00E2\u0080\u00A2 W'C?) denotes the free energy change of the first j nucleotides of a combination c\C2 .. .Cj, where the word index within the set of j, bj, was chosen such that the free energy is minimized. Consequently, W'{n) contains the minimum free energy change over the entire Combinatorial-Set corresponding to the Input-Set IS; \u00E2\u0080\u00A2 Wc(bj, j) denotes the free energy change of the combination in which the index of the set of j is bj , therefore it is fixed; \u00E2\u0080\u00A2 Vc(bi,bj,i,j) is the minimum free energy of the combination in which bi and bj are fixed, assuming that (ci.Cj) is a base pair; \u00E2\u0080\u00A2 Hc(bi,bj,i, j) is the free energy of the combination in which bi and bj are fixed, assuming that (ci.Cj) closes a hairpin loop; \u00E2\u0080\u00A2 Sc(bi,bj,i,j) is the free energy of the combination in which bi and bj are fixed, as-suming that (ci.Cj) closes a stacked loop; \u00E2\u0080\u00A2 VBIc(bi,bj,i,j) is the free energy of the combination in which bi and bj are fixed, assuming that (ci.Cj) closes an internal loop; \u00E2\u0080\u00A2 VMc(bi,bj,i,j) is the free energy of the combination in which bi and bj are fixed, assuming that (ci.Cj) closes a multi-branched loop; \u00E2\u0080\u00A2 WMc(bi,bj,i,j) is used to compute the array VM. Recurrence relations The same as for SimFold (Section 4.1) and PairFold (Section 5.1), the aforementioned arrays are constructed by recurrence relations. W'(j) is calculated by minimizing the values Wc(bj, j) over all possible bj: W'(j)=minbjWc(bj,j) The equation for Wc(bj, j) is an extension of the relation for W(j) from Section 4.1: 78 Wc(bj,j)= min. \ j mm(Hc(bi, bj,i, j), Sc(bh bj,i,j), VBIc(bi, bj,i,j) VMc{h,b3,i,j)) , for*,bjf,i',j'))) C o m b F o l d Hairp in-Loop-Free-Energy Procedure input: IS,bi,bj,i,j; output: free energy A G ; procedure Compute AG-HC AGi := 0; AG2 := 0; A G 3 := 0; I \u00E2\u0080\u00A2\u00E2\u0080\u00A2= j - i - i ; if (/ < 3) A G := infinity; else A G i := AG-Length-H{l); . . . if G = 3) AG2 := Non-GC-penalty(ci, Cj); A G 3 := min;, 6i+2]63._iex({bi,6j,t,j},{i+i,t+2,j-i}) AG-Hairpin-3 (ei, c i + i , c i + 2 , c J _ 1 , c,-); else A G 2 := m i n ^ ^ g ^ i , - , ^ , ^ if (I = 4) A G 3 := min 6 i + 1 )6 i + 2 i6 i_ 3 i6 J._iex({6i,b j,i,j},{i+i,i+2,j-2,i-i}) AG'-Hairpin-4 (ci,ci+\, c i + 2 , C j _ 2 , C j _ i , c.,-); endif; endif; A G := A G i + A G 2 + A G 3 ; endif; save best b's; return A G ; end procedure AG-H\u00C2\u00B0. Procedure 7: Outline of the calculation for hairpin loop free energy for CombFold. The free energy for multi-loops adds the minimization over the necessary b's as well. The equations for WMC and VMC follow: WMc(bi,bj,i,j) = min 82 CombFold Internal-Loop-Free-Energy Procedure input: IS,bi,bj,bi>,by,i',f; output: free energy AG; procedure Compute AG-IC Ad := 0; A G 2 := 0; A G 3 := 0; h:=i'-i- 1; h := j -j'-l;l = h+ h if {h = 0 or Z2 = 0) AG := AG-Length-B(l); i f ( i i + i 2 = l ) AG := AG + AG-Stack(d,Cj,Cii,Cj') else AG := AG + Non-GC-Penalty(ci,Cj) + Non-GC-Penalty(ci>,cy); endif; else if (l\ = 1 and l2 = 1) AG := mmb.+l ,Cj>, Ci+i , Cj-\); else if (Ii = 1 and l2 = 2) AG := min 6. + 1 ] 6._ l i b j., + i ex({6,:,6 ; j,i,i},{ l+ij-i,i' + i } ) AG-Internal-3(ci, Cj,Ci<, cy, Cj+i, C j _ i , C j / + i ) ; else if (Zi = 2 and l2 = 1) AG := minb ;._ l i 6. /_ i i b i + i ex({6,;,6 i,i,j},{j--i,i'-i,i-r-i}) AG-Internal-3(cy, Ci>, C j , C j , C j _ i , C i ' _ i , c \u00C2\u00BB + i ) ; else if (li = 2 and l2 = 2) AG := min b . + l i 6 j ._ 1 ) 6 . ( _ l i b j . / + i e x(^ AG'-Internal-4 (CJ, Cj,cv,cy, c i + i , C j _ i , C j / _ i , Cj'+r); else , -AGi := AG-Length-I(l); if ((^ = 1 or Z2 = 1) and Gail-Rule = 1) A G 2 := AG-Internal-n(ci,Cj,' A',' A') + AG-Internal-n(cv, cy,' A',' A'); else A G 2 := min b . + l i f c i _ 1 ) b . ( _ i i 6 i ) + i e x ( {b i ,b j , t j } , { i+i j - i , i ' - i , i ' + i } ) AG-Internal-n(ci, Cj,Ci+i,Cj-i) + AG-Internal-n(cy, Ci>, cy+i, Ci> _ i ) ; endif; A G 3 := AG-Asymmetry (h,l2); AG := AGX + A G 2 + A G 3 ; endif; save best b's; return AG; end procedure AG-IC. Procedure 8: Outline of the calculation for internal loop free energy for CombFold. 83 Vc(bi,bj,i, j) + Non-GC-penalty(ci,Cj) + Multi-b, min 6.+ i e X({b i,6 j,i,j},{i+i})(^ c(^+i.^^ + l.J') + Non-GC-penalty(ci+l,Cj)+ AG-Dangle-3'(cj,ci+i,ci) + Multi-b + Multi-c), mmb._ieXi{bubj^hy-1})(Vc(bi,bj\u00E2\u0080\u009Ei,i,j - 1) + Non-GC-penalty(ci,cj^1)+ AG-Dangle-5\c^\,Ci,Cj) + Multi-b + Multi-c), < ^bi+1,bj^eX({bubj,i,j},{i+ij-i})(vc(bi+i,bj^1,i + l,j - 1)+ Non-GC-penalty(ci+i, Cj-i) + AG-Dangle-3'(cj-i,ci+i,Ci)+ AG-Dangle-5'(cj-i,ci+i,Cj) + Multi-b+ 2 x Multi-c), m i n h . + i e X ( { b i i b . i i J } , { l + i } ) ( W M c ( b i + i , ^ , i + 1, j) + Multi-c), min 6 . _ i e X ( { b i , 6 . i i j } , { , _ 1 } ) (^M c (6 i , 6 , -_ i ) i , j - 1) + Mufet-c), . m i n ^ ^ ^ e ^ ^ + WM c (6 h + 1 ,6 , - ,h + 1,j)) VMc(bi,bj,i,j) = min m i n b . + l i b . _ i e X ( { b . ) b j ] i J } ] { i + l j _ 1 } ) WM c (6 i + i ,6 j_ i ,\u00C2\u00AB + 1, j - 1), minbi+1,bi+2,bj-ieX'{bi,bj,ij},{i+i,i+^ 2, j - 1)+ AG - Dangle-3'(ci, Cj, Cj+i) + Multi-c), < m i n b i + 1 ] b . _ 2 i b . _ i e X ( { b . ^ 1, j - 2)+ AG-Dangle-5'(ci, Cj, c,_i) + Multi-c), min b i + 1 ] b i + 2 , b . _ 2 , b . _ l \u00C2\u00A3 ^^ 2 . J - 2)+ AG-Dangle-3'(ci, Cj,Ci+{) + AG-Dangle-5'(CJ, C j , C j _ i ) + 2 x Multi-c) In this section we described the arrays and recurrence equations for predicting the optimal M F E combination of an Input-Set. The algorithm, as presented here, will find only the best combination only, but it can be used to find the next k best combinations. An algorithm for finding the k best combinations is described in the following section. 6.2 A n algorithm for the /c-suboptimal M F E combinations problem The algorithm for the optimal MFE combination problem, just described in the previous section, returns only the combination which has the smallest MFE. The algorithm can be extended to return the k combinations that have the lowest MFE. Consider the Input-Set IS contains s sets Si, each having gi words. We will add the superscript \"(1)\" to the notation of our sets to denote that first we are looking for the optimal combinations. The superscripts for the next combinations will be \"(2)\" and so on. Thus, IS\u00E2\u0084\u00A2 = { 5 { 1 ) ) 5 ^ 1 ) , . . . , 5 i 1 ) } will have the Combinatorial-Set CS& associated and will contain the following words: First, we find the optimal M F E combination using the method described in the previous section. Let the combination G ^ = {itf^ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 w^c3} denote the optimal M F E 84 o(l) c(l) c(l) o<2 - - - O s Wu W21 ... Wsi W12 W22 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 Ws2 Wlgi W2g2 . . . Wsgs combination, where Q denotes the index of the word in the set S\ , which belongs to the optimum combination. The Input-Set IS^ contains all the possible combinations of the original set IS. To find the next best combinations, first we partition the set IS^ into s sets which do not contain C ^ : 7 5 0 ) 1 = { 5 W _ { U , W } > 5 \u00C2\u00AB . . . tsP } / S ( 2 ) 2 = { & . . . ,SP } (2)7 For convenience later, we denote the newly created sets with S; , where 1 < i, j < s, i denotes the set index within the Input-Set, as in the previous notations, and j denotes the index of the newly created Input-Set: 15(2)1 = {5i2 ) 1,42 ) 1,...,5l2 ) 1} /5(2)2 = {5(2)2)5(2)2) __)5(2)2} is^s = {s[2)s,si2)s,...,si2)s} The Input-Sets I S ^ J S ^ 2 , I S ^ S have the following properties: \u00E2\u0080\u00A2 i C5( 2 ) m ,Vm,l is not included in any of the new Input-Sets created by par-titioning, (2) the new input sets do not have any combinations in common and (3) the whole space of combinations in CS^ is covered by the new input sets plus the optimal combination found. This leads to finding the optimal combinations for each of IS^1, IS^2,..., IS^S, followed by choosing the one with the smallest MFE. Thus, the standard free energy change 85 IS -,(1) to (1) W , c , (1) W 2C2-. w (1) \"- ^ A G ( I ) ^ A G ( I ) IS W I sl2)' Sf1 ... sf1 j S U ) , sf)l. .. s!2,i.. . s f I S> 2 ) S g(2)s g(2)s Si -w 1 C , si\" wig, sT'-wS', si\" w i C ] S\u00E2\u0080\u009E - w l C , (2)1 W i c , (2)1 W 2C 2 . . . (2)1 W s C s / s / / (2) W I C (2)i ^2 (2)1 w s C \ (2)s (2)s W , C | . . . W s C A G ( 2 ) 1 / \u00E2\u0080\u00A2 A G( 2 ) s \ A G < 2 , S S C<2>^ min ( A G ( 2 , ' . . . ,AG ( 2 i ! . . . ,A&*) = A G ' :(2)i I S W 1 sj3)1 si*... sf,s s (2) (2)1 sf c(2)i (2)i (3)1 W ic , (3)1 . . W s C (3)s W ic . (3)s . . W s C A G (3)1 A G (3)s C W - > min ( A G 2 ! 1 . . . . A G ^ - . ' A G ^ V . . . , A G ( 2 , S , A G ( 3 ) , . . . , A G ( 3 ) S J Figure 6.1: The algorithm for finding the fc-suboptimal M F E combinations of a combinatorial set. of the second combination will be AG^ = min(AG( 2 ) 1 , AG^2,... AG^2>), where AG^ is the M F E of the IS\u00C2\u00AE*. Let i be such that AG<2) = AG^ is the smallest M F E and let C^2) = {wwlw^c2 \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00E2\u0080\u00A2w^cl} denote the second best combination. The next step is to partition ISM\ in the same way we partitioned IS^. We will obtain the Input-Sets IS^X,IS^2,... IS\u00C2\u00AE8. Now, note that the following are true: \u00E2\u0080\u00A2 C^ and \u00C2\u00A3 CS^m and CS^n,\/m,n,i S). Procedure 6.1 shows the steps just described. Recursively continuing in the same way, we can find the best k combinations. However, note that the tree of partitioned Input-Sets will grow proportionally with fc, more exactly, it will grow by at most k \u00E2\u0080\u00A2 s, which implies increase in run time and space. Section 6.3 will discuss how to implement this extension in an efficient way. 86 It is important to note that when creating the new Input-Set, the ones with a lower index will typically have a bigger solution space (i.e. number of possible combi-nations) than the ones with a higher index. Thus, if after we found the second combi-nation, AG( 2 ) = AG^S, the third combination will be found much more quickly than if AG^ \u00E2\u0080\u0094 AG^1. Also, it is possible that the Input-Set which has the next best combi-nations will be partitioned in less than s partitions (or even no partitions at all), since the other partitions are empty. In this case, only the optimal M F E combinations of the non-empty partitions will be considered. Examples of the running time on some problem instances are discussed in Section 6.4. 6.3 Implementation In this section we describe implementation details of CombFold, as a logical extension of the implementation details described for SimFold, in Section 4.2. Implementation for the optimal M F E combination problem First, the pseudocode of the algorithm for the optimal M F E combination problem is de-scribed. The arrays W, W\u00C2\u00B0, Vc and WMC will store information necessary to backtrack and extract the combination with the smallest M F E , and its secondary structure and M F E value. Let W'ds, W$s, Vds and WMds denote the data structures associated with each of the arrays W, Wc, Vc and WMC. Wds(j) will contain information about the free energy of the best combination up to index j, as well as the index bj within the set. Wds(bj,j) will contain the free energy of the best combination up to index (bj,j), the number of branches of the multi-domain, the indexes of the last domain, and the right index (including its b) of the next domain. Vds(bi, bj,i,j) contains the free energy, the optimal type (HAIRPIN_LOOP, S T A C K E D _ L O O P , INTERNAL_LOOP or M U L T L L O O P ) for the pair (a.Cj) and details about where the internal branches for internal loops and multi-loops are located, in case (ci.Cj) closes such an elementary structure. Finally, WMds(bi,bj,i,j) contains the free en-ergy of the multi-loop fragment and other information needed to reconstitute the multi-loop branches completely. Note that the arrays W$s and V\u00C2\u00A3s, as opposed to the corresponding arrays W^s and for SimFold, will also contain information about the neighbouring indexes, such as bi+\,bj-i,i + l,j \u00E2\u0080\u0094 1, that have been chosen while constructing the path of the best combination. Using all these dependencies will permit finding the best combination in the backtracking procedure. Procedure 9 gives a pseudocode of the CombFold algorithm for determining the optimal M F E combination. The main idea is the same as in the case of SimFold, but now we have to traverse all the words in each set, in addition to the combinations indexed from 1 to n. The function Compute- WM\u00C2\u00B0 fills the WM^ array with information regarding partial multi-loops, using the equation for WMC given in Section 6.1. The function Compute-87 CombFold-optimal Procedure input: R N A or D N A Input-Set IS; output: minimum free energy A G , secondary structure R, combination C; procedure CombFold-optimal for (j := 1 to n) for {bj := 1 to g{s{j))) for (i := j \u00E2\u0080\u0094 1 down to 1) for (k&X({bj,j},{i})) WMcds{bubhi,j) := Compute-WMc(bi,bhi,j); end for; end for; for (i := 1 to j \u00E2\u0080\u0094 1) for (bieX({bj:j},{i})) V\u00C2\u00A3s(bi,bj,i,j) := Compute-Vc(bi,bj,i,j); end for; end for; end for; end for; for (j :\u00E2\u0080\u0094 2 to n) for (bj := 1 to g(s(j))) Wds(hi,J) \u00E2\u0080\u00A2= Compute-Wc(bj,j); end for; W'ds(j).free-energy := minbj(W%s(bj, j).free.energy); W^s(j).6 := bj that minimizes W^(6j, j).free.energy; end for; A G := W^s(n).free.energy; i := n; h := W^(z).6; while (z > 0 and W^(6 i ; i).num.branches > 0) (bid,bjd,id,3d) \u00E2\u0080\u00A2= W^s(fbi,i).last-domain; (R,C) := CombFold-Backtrack(bid,bjd,id,jd,Vl,W^); (bi,i) := W^s(6j,i).next_domain_i; end while; return A G , R, C; end procedure CombFold-optimal. Procedure 9: Pseudocode for the CombFold algorithm, when only the optimal MFE combination is searched for. Vc minimises over the four possible elementary structures, and also stores dependency information in the array V^s. Once this array is filled for all b{,bj,i,j, it can be used to find the multi-domains for each potential combination ending in bj,j. The function Compute-W2 fills the array W\u00C2\u00A3s, which can then be straightforwardly used to create the array W'ds. The value W s^.free_energy will be the lowest M F E of the given Input-Set. The combination C and the M F E structure R can be extracted by backtracking through the V^s array. Procedure 10 gives the resursive backtracking procedure, which is very close to the 88 CombFold-Backtrack Procedure input: Indexes bi,bj,i,j, data structures V\u00C2\u00A3s, Wds; output: Partially filled secondary structure R, partial combination C ; procedure CombFold-Backtrack if (j > j) return (0,0); else if (V^s{bubhi,j).type = HAIRPIN-LOOP) Save-Structure (R); Save-Combination (C); return R, C; else if (Vl{bi,bj,i,j).type = STACKED .LOOP) Save-Structure (R); Save-Combination (C); return CombFold-Backtrack ( 6 i + i , 6 j _ i , i + l,j - i . v&> else if (V&fbub^i, j).type = INTERNAL-LOOP) Save-Structure (R); Save-Combination (C); return CombFold-Backtrack (Vdcs(i, j).bi>,V^s(i J).br,Vd%(i,j).i\ Vd%(i,j).j>, Vd%WZs); else if (V^(bi,bj,i,j).type = MULTI-LOOP) Save-Structure (R); Save-Combination (C); for each branch B return CombFold-Backtrack (B.bi,B.bj,B.i B-j, Vd%, ); end for; end if; end procedure CombFold-Backtrack. Procedure 1 0 : Pseudocode for the backtracking algorithm for CombFold-optimal. corresponding procedure for SimFold, given in Procedure 4 in Chapter 4. Implementation of the fc-suboptimal M F E combination problem Procedure 11 shows the pseudocode for the k-suboptimal MFE combination problem which was described in Section 6.2. The input to the procedure Comb Fold-k-suboptimal is the Input-Set and k. The output is comprised of three vectors of size k, for the best k free energies, along with their associated secondary structures and combinations. To find the optimal combination, first the Comb fold-optimal procedure is called. To find the next k \u00E2\u0080\u0094 1 combinations, three stacks for candidate free energies and their associated structures and combinations are used. First, they are empty, and a resursive procedure, called Subopt-recursion, is called. Its pseudocode is described in Procedure 12. The input to the Subopt-recursion procedure is an Input-Set IS to partition, the remaining number of combinations k, the parent combination Cp, which will be used for partitioning of IS, and the three stacks which keep the candidate solutions. The output 89 CombFold-k-suboptimal Procedure input: R N A or D N A Input-Set IS, number of suboptimal combinations fc; output: vectors AGi:k, R\-k and C\\u00C2\u00B1 with the k best free energies, secondary structures and combinations; procedure CombFold-k-suboptimal ( A G i , i ? i , C i ) := CombFold-optimal (IS); AG-stack := 0; iJ-stack := 0; G-stack := 0; (AG2:fc,R2.k, C^-.k) Subopt-recursion (IS, k \u00E2\u0080\u0094 1,C\, AG-stack, i?-stack, G-stack); return AG1:k, Ri-.k, Gi-k; end procedure CombFold-k-suboptimal. Procedure 11: Pseudocode for the k-suboptimal MFE combinations problem. is three vectors containing the best combinations found, and their M F E and structures. The recursive procedure stops when there are no more suboptimal combinations to look for (k = 0), in which case it returns empty sets. .If > 0, first we partition the input set IS using Cp, as described in Section 6.2. A number of s new Input-Sets IS[,... ,IS'S will be created, and the optimal M F E combination will be found. The free energies, structures and combinations are added to the corresponding stacks. The smallest M F E out of all MFEs in AG-stack will be the the M F E of the next best combination. The corresponding structure and combination are stored, and then the procedure is called recursively, continuing from the last best combination found. Note that, in total, at most s \u00E2\u0080\u00A2 fc-calls to CombFold-optimal will be performed, and at most (s \u00E2\u0080\u0094 1) \u00E2\u0080\u00A2 fc elements will exist on each of the three stacks. However, some of the already calculated 4-dimensional arrays Vds and WMds can be used for the children. Consider the parent Input-Set IS^ = {s[l) ... S^} and consider that V\u00C2\u00A3s and WM% are calculated for this Input-Set. Consider the child partition IS^1 = {{w^l },... ,S\u00C2\u00B1 \u00E2\u0080\u0094 {wiCih Si+v \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 > T h e Vds a n d W M d s v a l u e s f o r Si+v \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2. w i l 1 b e t h e same> h e n c e they can be reused for the children. This is an important save on running time and space, which is implemented in the current version of CombFold. In order to reuse parts of the already calculated data structures, the data structures corresponding to all nodes of the tree must be kept in memory. This implies need for much space, which increases with fc. A trick for saving some of this space, which is not implemented yet, would be to keep only the fc lowest M F E values in the stack AG-stack. There is no need to keep the next values, since they will never become one of the best fc combinations. The nodes in the tree corresponding to the values fc + 1 and beyond can be released from memory. 90 Subopt-recursion Procedure input: R N A or D N A Input-Set IS, number of suboptimal combinations k, the parent combination Cp, stacks AG-stack, i?-stack, C-stack; output: vectors AGi-k, R\-k,Ci-k with the k best free energies, secondary structures and combinations; procedure Subopt-recursion if (k = 0) return (0, 0, 0); end if; (IS[,IS'a) \u00E2\u0080\u0094 Partition-input-set (IS, Cp); for (i := 1 to s) (AG^R^CA := CombFold-optimal(ISl); end for; add A G ' 1 : s to AG-stack; add R[.s to fi-stack; add C[.s to G-stack; A G i := mini AG-stackij pos := get position of A G i in AG-stack; R\ '\u00E2\u0080\u00A2 R\u00E2\u0080\u0094stackpo^; C\ :\u00E2\u0080\u0094 G-stackpos; remove AGi from AG-stack; remove Ri from i?-stack; remove C\ from G-stack; (AG2:fc, i?2:fc, G2:fc) := Subopt-recursion (IS'pos, k \u00E2\u0080\u0094 1, C\, AG-stack, i?-stack, G-stack); return AG1:k, Rv.k, Ci:k; end procedure Subopt-recursion. Procedure 12: Pseudocode for the recursive function to find the next best combination in the k-suboptimal MFE combinations problem. 6.4 Time and space complexity Theoretical analysis Extending the 0 (n 3 ) algorithm for secondary structure prediction of single nucleic acid molecules, the optimal MFE combination algorithm traverses the Input-Set in the same way, but for each position i and j, several possibilities might exist. We consider that the number of words gi in each set Si is limited by a constant bound gmax, and we measure the complexity in terms of the combinations length: n = l\ + l2 + . \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 + ls. Also, we consider that the ranges returned by the X function is bounded by a constant and will be omitted from the theoretical analysis. In practice, the number of words in each set, the number of sets, the length of the words in each set, as well as the nucleotides composing the set, all have an impact on the run time. First we give an analysis of the theoretical complexity, and later in this section we will analyse the CombFold implementation on several specific 91 Input-Sets. The theoretical time complexity of calculating each array described in Section 6.1 in the worst case follows: \u00E2\u0080\u00A2 W : 0(g max ' n ) j because for each j calculated in Wc, we minimize over all possible words of j: gmax, and j = l..n; \u00E2\u0080\u00A2 Wc: 0(gfnax \u00E2\u0080\u00A2 n 2 ) , because for each j, 1 < j < n there are at most gmax possibilities, and we minimize over i. The maximum number of options for the 6's of i and i,j's neighbours is 4 (third line in the equations for Wc) but bi and can only be in different words if the length of the word if l(s(i)) is 1. If this is true, then g(s(i)) is maximum 4 (because there are 4 different nucleotides), no matter what the values of gmax is- We consider this case as being constant; \u00E2\u0080\u00A2 Vc: 0(gn\ax \u00E2\u0080\u00A2 n 2 ) , because for each i and j, we minimize over a constant number of terms, and for each i and j there are at most gmax possibilities; \u00E2\u0080\u00A2 Sc: 0 ( < ? m a x -n-2), because for each i, j and their corresponding bi and bj, we minimize over potential different values for bi+\ and bj-\\ \u00E2\u0080\u00A2 H\u00C2\u00B0: 0(g^ax \u00E2\u0080\u00A2 n2), because for each i, j and their corresponding bi and bj, the term which has the greatest complexity has minimization over 4 terms, but 2 of them happen only if the word length is 1, so they are reduced to constant times; \u00E2\u0080\u00A2 VBIC\- 0(<7^jax \u00E2\u0080\u00A2 n4), but the same as for the previous algorithms, we assume the internal loops do not have more than c bases on each side between the branches (i. e. c = 30), and thus the complexity for internal loops becomes 0(gfnax \u00E2\u0080\u00A2 n2). The power of 8 comes from the most general case of internal loops; \u00E2\u0080\u00A2 WMC: 0 ( < ? m a x \" n 3 ) i because the most costly branch of the WMC calculation for each i and j is to find the best h for multi-loop partitioning. Each of i, j and h are in at most gmax words; \u00E2\u0080\u00A2 VMC: 0 ( 3 ^ 3 . \u00E2\u0080\u00A2 n 2), because for each i, j = l..n, which go over at most gmax possible words, we do a constant number of comparisons. Thus, if we consider both gmax and n in our analysis, the worst case time complexity is 0(g^ax \u00E2\u0080\u00A2 n 3 + gmax \u00E2\u0080\u00A2 n 2), because depending on the input, any of the two terms may be greater. In practice, gmax is often considered a constant, which leads to complexity proportional to n 3 . The arrays W, W\u00C2\u00B0, Vc and WMC need to be stored in memory. The space complexity is 0 ( 5 m a x \u00E2\u0080\u00A2 n 2), or 0 ( n 2 ) if we consider gmax a constant. The worst theoretical time complexity of the fc-suboptimal M F E combinations prob-lem is 0(s \u00E2\u0080\u00A2 k \u00E2\u0080\u00A2 gmax \u00E2\u0080\u00A2 n 3 + s \u00E2\u0080\u00A2 k \u00E2\u0080\u00A2 gmax \u00E2\u0080\u00A2 n2) and the worst space complexity is 0(s \u00E2\u0080\u00A2 k \u00E2\u0080\u00A2 b^^ \u00E2\u0080\u00A2 n 2). However, in practice, some of the Input-Sets after partitioning become empty. Also, as described earlier in Section 6.3, parts of the arrays and space can be reused. 92 Empirical analysis We compared the running time performance of CombFold with that of ExhaustS, a simple (exponential time) exhaustive search algorithm, which creates all possible combinations and runs SimFold on each of them. For Input-Sets with a small number of combinations, it is expected that CombFold takes more time and space than ExhaustS, because of the increased complexity. However, although the space is not a problem for ExhaustS, the running time quickly grows and becomes impractical. Figure 6.2 gives the run time performance of CombFold with k = 1,2,3,10 and ExhaustS on randomly generated Input-Sets of different characteristics. All the tests have been performed on machines with C P U Pentium III 733 MHz, memory cache 256 K B and R A M memory 1GB, running Linux 2.4.20. All graphs show the C P U time in seconds, presented on a log scale, versus variation of different characteristics of the Input-Sets. To simplify the analysis, we chose g\ = ... = gs = g and l\ = ... = ls = I, and we took variations of s, g and I. Having all set sizes equal and all set lengths equal, the number of combinations will be gs, and the length of the combinations will be I \u00E2\u0080\u00A2 s. The graph in (a) shows a comparison between the running time of CombFold with k = 1,2,3,10 and ExhaustS, on a set of 19 instances having g and / fixed at 2 and 10, respectively. The number of sets s varies from 1 to 19, yielding 2 1 = 1 combination of length 10 to 2 1 9 \u00C2\u00AB 0.5 \u00E2\u0080\u00A2 106 combinations of length 190. CombFold with k = 1 becomes faster than ExhaustS at s = 8, with k \u00E2\u0080\u0094 2 and 3 becomes faster at s = 10, and CombFold with k \u00E2\u0080\u0094 10 becomes faster at s = 12. Note that the slope of the curves suggest that CombFold grows polynomially, while ExhaustS grows exponentially. The graph in (b) shows a similar situation as in graph (a), but when g is fixed at 3 rather then 2. I = 10 and s takes values in the range 1 to 12, leading to 3 1 = 3 combinations of length 10 to 3 1 2 \u00C2\u00AB 0.5 \u00E2\u0080\u00A2 106 combinations of length 120. The number of combinations being bigger for the same s, CombFold with k \u00E2\u0080\u0094 1 outperforms ExhaustS when s = 6, with k = 2 and 3 when s = 7, and with k = 10 when s = 8. Graph (c) shows a comparison when s and I are fixed to 6 and 10 respectively, but g varies from 1 to 13. These yield l 6 = 1 to 136 \u00C2\u00AB 4.8 \u00E2\u0080\u00A2 106 combinations of length 60. Note that in this case ExhaustS grows polynomially, as opposed to cases (a) and (b). This happens because g is increasing and g is the base of the formula which gives the number of combinations: gs. However, ExhaustS grows more quickly than CombFold. Indeed, the graph shows that CombFold with k = 1 becomes faster than the ExhaustS when 5 = 3, with k = 2 and 3 when g = 4 and with k = 10 when g = 5. Graph (d) gives the comparison when s and g are fixed to 8 and 2, respectively, leading to a fixed number of 2 8 = 256 combinations. However, the length of the words vary from 10 to 100, yielding combinations of length 80 to 800. Again, ExhaustS grows more quickly, but still polynomially, only the length of the combinations being changed. ExhaustS is outperformed by CombFold(k = 1) at I = 10 and by CombFold(k \u00E2\u0080\u0094 2) at I = 50. On the instances we tested, ExhaustS outperforms CombFold with k = 10, and becomes roughly 93 Figure 6.2: Performance of CombFold with k = 1,2,3,10 and ExhaustS, on sets with different characteristics: (a) 19 instances with s ranging from 1 to 19, and the same g = 2 and I = 10; (b) 12 instances with s ranging from 1 to 12, and the same g \u00E2\u0080\u0094 3 and I = 10; (c) 13 instances with g ranging from 1 to 13, and the same s = 6 and I = 10; (d) 10 instances with I ranging from 10 to 100, and the same s \u00E2\u0080\u0094 8 and g = 2; (e) 50 instances with the same characteristics: s = 10,5 = 3, Z = 5; (f) 48 instances with the same characteristics: s = 8, g \u00E2\u0080\u0094 8,1 = 4. 94 the same speed as CombFold with k = 3 when I = 100. On all these four graphs, we note that CombFold with k = 1 and 2, and ExhaustS are nicely curved, while CombFold with k = 3 and 10 have \"hills\" and \"valleys\". To see how the curves look like, we created two sets of 50 instances of Input-Sets with exactly the same characteristics: graph (e) with s = 10, g = 3,/ = 5 and graph (f) with s = 8, g = 8, / = 4. The results comfirm the explanation we gave earlier in Section 6.2: When k = 1, CombFold fills all the arrays, a small variation happening due to the distribution of the nucleotides in the words. When k = 2, the arrays for s more sets are always calculated, no matter what the optimal combination is. However, depending on which the second best combination is, the size of the next Input-Sets that partition the solutions space can differ substantially. This influence propagates on to the next best combinations, such that when k = 10, the differences in time between different instances can vary substantially. Also, note that for some instances, the time for k = 3, and even for k = 10, is very close or equal to the time for k \u00E2\u0080\u0094 2. This means that the second best combination was part of a very small Input-Set, which was partitioned in fewer (or even 0) non-empy Input-Sets. The graphs also show the run time of the exponential algorithm. For graph (e) there are 3 1 0 & 60,000 combinations of length 50, and ExhaustS is more than one order of magnitude slower than CombFold with k = 1, and 5-6 times slower than CombFold with k = 10. For graph (f), where the number of combinations is 8 8 16.8 \u00E2\u0080\u00A2 106 of length 32, the exponential algorithm is substantially slower, being about two orders of magnitude slower than CombFold(k = 1), and more than one order of magnitude slower than CombFold(k = 10). 6.5 Applications and experimental results CombFold can be used for at least two important applications: (1) directed mutagenesis and S E L E X experiments [38] and (2) for testing whether D N A words used in D N A computations fold without secondary structures [8, 9, 10]. 6.5.1 Directed mutagenesis and S E L E X experiments Directed mutagenesis and S E L E X experiments are biochemical experiments where varia-tions of an R N A or D N A sequence are used to perform analysis on a library of sequences and study whether simple mutations of them have some better properties than others. The input sequences, which we call IUPAC-Input, are typically regular expressions, composed of characters (DNA or R N A nucleotides) and wild cards, which can be replaced by several different characters. A conventional coding for the wild cards- is to use 'TUPAC\" format (International Union of Pure and Applied Chemistry), which contains the following codes: R = G A (purine) Y = T C (pyrimidine) K = G T (keto) M = A C (amino) S = G C B = G T C D = G A T V = G C A N = A G C T (any) H = A C T 95 W = A T Thus, if the input sequence for a S E L E X experiment is AURCAAUGCSNAUGCSNAUGCAC, then we can convert it into an Input-Set in straightforward way: for each different wild card, we create a different set: Si s2 53 s5 Se s7 Ss s9 IUPAC-Input AU R CAAUGC s N AUGC s N AUGCAC Input-Set AU G CAAUGC G A AUGC G A AUGCAC A C G C G C C U U Thus, we can very easily transform the IUPAC-Input format into an Input-Set and use CombFold to predict which nucleotides replacing the wild cards would determine the sequences to fold into the most stable configurations. Collaboration with researchers doing practical directed mutagenesis or S E L E X experiments would be very valuable for testing and improving our tool. 6.5.2 D N A computing and tag - anti-tag libraries In search-and-prune D N A or R N A computations [8, 9, 14] or in tag - anti-tag libraries [10], sets of short D N A or R N A words are used to create many long D N A strands. These sets are carefully designed using computational or information-theoretic methods [56, 55], such that the resulting long strands behave well in computation. The main property for \"good behaviour\" of these long strands is that they do not form secondary structure (in other words, they are structure free), thus begin available to hibridyze with probes. We have collected five Input-Sets from the D N A computing literature and we used CombFold to test whether they are indeed structure free: 1. Braich et al. [8] used an Input-Set of 6 sets of 2 15-mer D N A words each, to solve a 6-variable 11-clause 3-SAT problem with a gel-based D N A computer. Note that there are 2 6 = 64 combinations of length 15 \u00E2\u0080\u00A2 6 = 90. In our tests, we call this set IS-Braich-2001; 2. Braich et al. [9] used an Input-Set of 20 sets of 2 15-mer D N A words each, to solve a 20 variable 24-clause 3-SAT problem with a DNA computer. There are 2 2 0 \u00C2\u00AB 106 combinations, of length 300. In our tests, we call this set IS-Braich-2002. 3. Brenner et al. [10] used an Input-Set of 8 sets of 8 4-mer D N A words each, to clone D N A molecules onto microbead surfaces. There are 8 8 \u00C2\u00AB 16.7 \u00E2\u0080\u00A2 106 combinations of length 4 \u00E2\u0080\u00A2 8 = 32 generated by this set. We call this set IS-Brenner-2000; 96 No Input-Set NA optimal M F E C P U time CPU time (source) (kcal/mol) CombFold (s) ExhaustS (s) 1 IS-Braich-2001 [8] DNA -0.21 3.58 1.13 2 IS-Braich-2002[9] DNA 0 63.68 \u00C2\u00AB 5 days 3 IS-Brenner-2000[10] DNA -3.19 187.05 15,520.08 4 IS-Faulhammer-2000[U] RNA -2.90 18.97 157.03 5 IS-Frutos-1997-r2[n] DNA -12.26 414.65 16.68 Table 6.3: Output of CombFold on some published combinatorial sets for structure freeness. No Sequence - structure 1 TATTCTCACCCATAAACACTATCAACATCACCTTTACCTCAATAAATCTTTAAATACCCCTCCATTT ( ( ( ( ( CTCCATATTTTCTTCCATCACAT ) ) ) ) ) 3 CAAAAATCCTTTTTACCAAAAATCCTTTTTAC . ( ( ( ( ) ) ) ) . . . ( ( ( ( ) ) ) ) . -4 CTCTTACTCAATTCTTCTACCATATCAACATCTTAATAACATCCTCCACTTCACACTTAATTAAAAT ( ( ( ( ( ( ( CTTCCCTCTTTACACCTTACTTTCCATATACAAGTACATTCTCCCTACTCCTTCATAATCTTATATT ( ( ( ( ( ) ) ) ) ) ( ( ( ( ( ( . CTCAATATAATCACATACTTCTCCAACATTCCTTATCCCACACACATTTTAAATTTCACAA . . . . ) ) ) ) ) ) ) ) ) ) ) ) ) 5 AACGTACGTCCTGCAATTCGATGCAGGACGTT ( ( ( ( ( ( ( ( ( ( ) ) ) ) ) ) ) ) ) ) \u00E2\u0080\u00A2 Table 6.4: The optimal M F E combinations (sequences and structures), as predicted by CombFold, for the Input-Sets in Table 6.5.2. 4. Faulhammer et al. [14] used an Input-Set of 19 sets. 10 of these have 2 15 mer words and other 9 sets having 1 word each of length 5 are used as spacers between the 10 2-word sets. They used this set for solving a variant of a \"Knight problem\" which finds in which configurations of knights placed on a 3 x 3 chess board no knight attacks any other knight. We call this set IS-Faulhammer-2000; 5. Frutos et al. [17] proposed a set of 108 8-mer D N A words to be used on D N A computing on surfaces. Several such sets of 108 words can be separated by different 4-mer spacers, in the configuration ((spacer) (108-word set) (spacer))r, where r denotes the number of times the three sets are repeated (note that the spacers are different for each repeat, and the 108-word set is always the same). We tested the set for r = 2 (1082 = 11,664 16-mer combinations) and we call it IS-Frutos-1997-r2. Table 6.5.2 shows the results of our tests on the five described sets. Column 3 shows the nucleic acid type (DNA or RNA) that was used by the authors. Our runs were performed on the same type. Column 4 shows the optimal M F E obtained for each of the sets. Columns 97 4 and 5 show the C P U time in seconds that CombFold with k= l and ExhaustS, respectively, spent to solve each problem instance. The parameter used for the temperature was set to 37\u00C2\u00B0C. The tests were performed on machines with C P U Pentium III 1GHz, 256 K B cache and 1GB R A M . Our predictions show that the set IS-Braich-2001 has a slightly negative free energy and the set IS-Braich-2002 is indeed structure free. Note that, while for the smaller set ExhaustS is faster, for the larger set, it is substantially slower than CombFold ( \u00C2\u00AB 5 days versus \u00C2\u00AB 1 minute). The remaining three sets are predicted to have structure for the combinations indi-cated in Table 6.4. For the set from Brenner et al., the C P U time of ExhaustS is substan-tially slower than the time spent by CombFold. For the last two sequences, the times are not substantially different. In conclusion, the predictions obtained with CombFold show that all of the sets except IS-Frutos-1997-r2 have free energy close to 0, indicating the sets are well designed. 98 Chapter 7 Conclusions and Future W o r k In this thesis we have introduced algorithms for secondary structure prediction of pairs of nucleic acid molecules and for finding the best combinations out of a combinatorial set of strands. First, a thorough description of the model used by these algorithms was provided in Chapter 3. Second, understanding the free energy minimization algorithm for secondary structure prediction of single nucleic acid molecules was necessary in order to follow the extensions that we propose. We gave a detailed description about this algorithm and our implementation, SimFold, and we analyse it in comparison to other implementations of the same algorithm and on biological data, in Chapter 4. Then, in Chapters 5 and 6, we described in detail our algorithms: PairFold and CombFold, and we analysed their complexity, performance and accuracy on biological data. We showed concrete examples on which our algorithms can provide useful information. Our work can be continued in several directions: \u00E2\u0080\u00A2 First of all, the free energy minimization algorithm for secondary structure predictions is based on a simplified model and set of thermodynamic parameters. We have shown that the prediction accuracy goes down with the sequence length. Improving the free energy model to give better prediction accuracy would be, in our opinion, the first and most important step for future work; \u00E2\u0080\u00A2 It has been shown that predicting suboptimal structures for single molecules improved the overall accuracy of the free energy minimization algorithm [33, 66]. Incorporat-ing suboptimal folding into our algorithms would provide more insight on nucleic acid folding; Also, partition function calculation for base pairing probabilities [34], as well as algorithms for pseudoknotted structures [40] could be incorporated into our algorithms; \u00E2\u0080\u00A2 Another important way to-extend our work is to combine PairFold and CombFold in one application, call PairCombFold,. to predict the most stable duplexes of two combinatorial sets of strands. This would be useful for at least three reasons: 99 \u00E2\u0080\u0094 predicting interactions between a pair of molecules, when at least one molecule is part of a library obtained by mutagenesis. Examples include libaries of ri-bozymes, where fragments of the ribozyme sequences (and sometimes the targets also) are combinatorial sets [68]. Predicting which of the randomized pairs bind more strongly to each other can provide insight on ribozyme behaviour; \u00E2\u0080\u0094 predicting binding site of a probe with a combinatorial set of molecules used in D N A computing [8, 9, 10]. We showed that CombFold can be used to test whether sets of D N A words concatenate without secondary structures. The next step for problems such as 3-SAT solvers using DNA computing is to make sure that the probes bind to the right target. A combination of PairFold and CombFold would predict which is the most stable binding and which are are next k ones; \u00E2\u0080\u0094 better designing probes to bind with targets. The OligoWalk program [32] finds the best probe to bind with a target, but only considers probes that are comple-mentary to windows of the target. With PairCombfold, one could consider the whole combinatorial set of probes. \u00E2\u0080\u00A2 We have performed analysis of PairFold on biological pairs of molecules. However, the literature of the last ten years includes many more such examples, especially on ribozyme - mRNA target complexes, on which PairFold can be tested. Moreover, creating a database of R N A pairs of molecules with known structures and using it to improve the prediction would be a very interesting future step; \u00E2\u0080\u00A2 We have shown that our algorithm CombFold has polynomial time and space com-plexity. However, in practice, it can become impractical for large sets of inputs. Op-timizing the algorithm and its implementation further would make it more efficient. A few optimization tricks have been proposed by Cohen and Skiena [12] and have not been applied here yet. Other optimizations, for more efficient data structures, can be useful; \u00E2\u0080\u00A2 We reported CombFold run times on several types of instances. However, predicting the CombFold run time on a given Input-Set is not trivial, especially when the number of words in each set and the length of the words across sets vary. Predicting this time for CombFold with k = 1 and k = 2 would be valuable. In conclusion, we contributed with two important algorithms and practical tools to the scientific community of computational and molecular biology. Further collaboration with researchers doing work related to what we propose will be crucial in finding benefit in their applicability and in improving their accuracy. 100 Bibl iography H. T. Allawi and J. SantaLucia Jr., Thermodyamies and NMR of Internal G-T Mismatches in DNA, Biochemistry (1997) 36, 10581-10594. H. T. Allawi and J. SantaLucia Jr., Thermodynamics of internal C-T mismatches in DNA, Nucl. Acids. Res. (1998), Vol. 26, No. 11. H. T. Allawi and J. SantaLucia Jr., Nearest-Neighbor Thermodynamics of Internal A-C Mis-matches in DNA: Sequence Dependence and pH Effects, Biochemistry (1998), 37, 9435-9444. H. T. Allawi and J. SantaLucia Jr., Nearest Neighbor Thermodynamic Parameters for Internal G-A Mismatches in DNA, Biochemistry (1998), 37, 2170-2179 M . Andronescu, R. Aguirre-Hernandez, A. Condon, H. Hoos, RNAsoft: a suite of RNA sec-ondary structure prediction and design software tools, Nucl. Acids. Res. (2003) 31: 3416-3422. M . Andronescu, D. Dees, L. Slaybaugh, Y . Zhao, B. Cohen, A. Condon, and S. Skiena, Algo-rithms for testing that sets of DNA words concatenate without secondary structure, Proc. 8th International Workshop on DNA Based Computers; LNCS (2003) 2568: 182-195. A journal version will appear in Natural Computing (in press). Fifth edition, McGraw-Hill Inc (1988). S. Bommarito, N. Peyret and J. SantaLucia Jr., Thermodynamic parameters for DNA sequences with dangling ends, Nucl. Acids. Res. (2000) 28: 1929-1934. R. S. Braich, C. Johnson, P. W. K. Rothemund D. Hwuang, N . Chelyapov and L. M . Adleman, Solution of a satisfiability problem on a gel-based DNA computer, Proc. of 6th Intl. Conf. on DNA Computation; Springer-Verlag LNCS (2001), 2045: 27-41. R. S. Braich, N . Chelyapov, C. Johnson, P. W. Rothemund and L. Adleman, Solution of a 20-variable 3-SAT problem on a DNA computer, Science (2002); 296 (5567): 499-502. S. Brenner, S. R. Williams, E. H. Vermaas, T. Storck, K. Moon, C. McCollum, J. I. Mao, S. Luo S, J. J. Kirchner, S. Eletr, R. B. DuBridge, T. Burcham and G. Albrecht, In vitro cloning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs, Proc. Natl. Acad. Sci. U S A . (2000); 97 (4): 1665-1670. J. J. Cannone, S. Subramanian, M . N. Schnare, J. R. Collett, L. M . D'Souza, Y . Du, B. Feng, N . Lin, L. V. Madabusi, K. M . Muller, N . Pande, Z. Shang, N . Yu and R. R. Gutell, The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and other RNAs., BioMed Central Bioinformatics (2002), 3:2. [Correction: BioMed Central Bioinformatics. 3:15.] 101 [12] B. Cohen and S. Skiena, Designing RNA sequences: natural and artificial selection, Proc. 6th Int. Conf. Computational Molecular Biology (RECOMB) (2002), pp. 109-116. [13] R. M . Dirks and N . A. Pierce, A partition function algorithm, for nucleic acid secondary structure including pseudoknots, J. Comput. Chem. (2003); 24 (13): 1664-1677. [14] D. Faulhammer, A. R. Cukras, R. J. Lipton and L. F. Landweber, Molecular computation: RNA solutions to chess problems, Proc. Natl. Acad. Sci. (2000); 97 (4): 1385-1389. [15] D. S. Fields and R. R. Gutell, An analysis of large rRNA sequences folded by a thermodynamic method, Folding & Design (1996) 1: 419-430. [16] S. M . Freier, R. Kierzek, J. A. Jaeger, N . Sugimoto, M . H. Caruthers, T. Neilson and D. H. Turner, Improved free-energy parameters for predictions of RNA duplex stability, Proc. Natl. Acad. Sci. U S A . (1986), 83 (24): 9373-9377. [17] A. G. Frutos, Q. Liu, A. J. Thiel, A. M . Sanner, A. E. Condon, L. M . Smith and R. M . Corn, Demonstration of a word design strategy for DNA computing on surfaces, Nucl. Acids. Res. (1997); 25 (23): 4748-4757. [18] M . Gouy, Secondary structure prediction of RNA, Nucleic Acid and protein sequence analysis, Bishop M J & Rawlings CJ, IRL Press, Washington (1987). [19] A. J. F. Griffiths, W. M . Gelbart, R. C. Lewontin and J. H. Miller, Modern genetic analysis: integrating genes and genomes, 2nd edition, W. H. Freeman and Company, 2002. [20] A. P. Gultyaev, F. H. D. van Batenburg and C. W. A. Pleij, The computer simulation of RNA folding pathways using a genetic algorithm, J. Mol. Biol. (1995) 250, 37-51. [21] Gutell Lab Comparative RNA Web Site, http://www.rna. icmb.utexas.edu. [22] R. R. Gutell, J. C. Lee and J. J. Cannone, The accuracy of ribosomal RNA comparative structure models, Current Opinion in Structural Biology 2002, 12:301-310. [23] I. L. Hofacker, M . Fekete, P. F. Stadler, Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol. (2002) 319, 1059-1066. [24] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M . Tacker and P. Schuster, Fast Folding and Comparison of RNA Secondary Structures, Chemical Monthly (1994) 125: 167-188. [25] R. Hormes, M . Homann, I. Oelze, P. Marschall, M . Tabler, F. Eckstein and G. Sczakiel, The subcellular localization and length of hammerhead ribozymes determine efficacy in human cells, Nucl. Acids. Res. (1997) 25: 769-775. [26] HyTher - Hybridization Thermodynamics Web Page, http://ozone2.chem. wayne.edu/Hyther/hythermenu.html. [27] Y . Kasai, H. Shizuku, Y. Takagi, M . Warashina and K. Taira, Measurements of weak interac-tions between truncated substrates and a hammerhead ribozyme by competitive kinetic analyses: implications for the design of new and efficient ribozymes with high sequence specificity, Nucl. Acids. Res. (2002) 30: 2383-2389. [28] D. A. Konings and R. R. Gutell, A comparison of thermodynamic foldings with comparatively derived structures of 16S and 16S-like rRNAs, RNA (1995), l(6):559-74 102 [29] F. L i , G. D. Stormo, Selection of optimal DNA oligos for gene expression arrays, Bioinformatics (2001); 17 (11): 1067-76. [30] R. B. Lyngs0 and C. N. Pedersen, RNA pseudoknot prediction in energy-based models, J Comput Biol. (2000); 7 (3-4): 409-27. [31] R. B. Lyngs0, M . Zuker, C. N . Pedersen, Fast evaluation of internal loops in RNA secondary structure prediction, Bioinformatics (1999), 15 (6): 440-5. [32] D. H. Mathews, M . E. Burkard, S. M . Freier, J. R. Wyatt and D. H. Turner, Predicting oligonucleotide affinity to nucleic acid targets, RNA (1999), 5: 1458-1469. [33] D. H. Mathews, J . Sabina, M Zuker and D. H. Turner, Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure, J. Mol. Biol. (1999) 288, 911-940. [34] J . S. McCaskill, The equilibrium partition function and base pair binding probabilities for RNA secondary structure, Biopolymers (1990), Vol 29, 1105-1119. [35] Mfold Web Server, h t tp : / /www.b io in fo . rp i . edu /app l i ca t ions /mfo ld . [36] U. Nagaswamy, M . Larios-Sanz, J. Hury, S. Collins, Z. Zhang and G. E. Fox, NCIR: a database of non-canonical interactions found in known RNA structures, Nucl. Acids. Res. (2002) 30: 395-397. [37] N. Peyret, A. P. Seneviratne, H. T. Allawi and J. SantaLucia Jr., Nearest-Neighbor Thermo-dynamics and NMR of DNA Sequences with Internal A-A, G-C, G-G and T-T Mismatches, Biochemistry (1999), 38: 3468-3477. [38] J. V. Ponomarenko, G. V. Orlova, A. S. Frolov, M . S. Gelfand and M . P. Ponomarenko, SE-LEX-DB: a database on in vitro selected oligomers adapted for recognizing natural sites and for analyzing both SNPs and site-directed mutagenesis data, Nucl. Acids. Res. (2002); 30 (1): 195-199. [39] J. H. Reif, T. H. LaBean and N. C. Seeman, Challenges and Applications for Self-Assembled DNA Nanostructures, Proc. 6th International Workshop on DNA Based Computers; LNCS (2001) 2054: 173-198. . [40] E. Rivas and S. R. Eddy, A dynamic programming algorithm for RNA structure prediction including pseudoknots, J. Mol. Biol. (1999) 285, 2053-2068. [41] RNAsoft - Software for R N A / D N A secondary structure prediction and design, http://www.RNAsoft.ca. [42] RNAstructure Web Site, h t tp : //128.151.176.70/RNAstructure.html. [43] P. B. Rupert and A. R. Ferre-d'Amare, Crystal structure of a hairpin ribozyme - inhibitor complex with implications for catalysis, Nature (2001) 410. [44] SantaLucia Laboratory, h t tp: / /ozone. chem. wayne. edu/. [45] J. SantaLucia Jr., A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics, Proc. Natl. Acad. Sci. USA, Vol. 95, pp 1460-1465; Biochemistry (1998). 103 [46] J. SantaLucia Jr. and D. H. Turner, Measuring the Thermodynamics of RNA Secondary Struc-ture Formation, John Wiley & Sons, Inc. Biopoly (1997) 44: 309-319. [47] C. Schmidt, R. Welz and S. Mller, RNA double cleavage by a hairpin-derived twin ribozyme, Nucl. Acids. Res. (2000) 28: 886-894. [48] R. Schroeder, R. Grossberger, A. Pichler and C. Waldsich, RNA folding in vivo, Current Opin-ion in Structural Biology (2002), 12:296-300. [49] M . J. Serra, D. H. Turner and S. M . Freier, Predicting thermodynamic properties of RNA, Meth. Enzymol. (1995) 259, 243-261. [50] M . R.Shortreed, S. B. Chang, M . Andronescu, D. C. Tulpan, H. Hoos, A. Condon and L. M . Smith, An algorithm for designing structure-free concatenated DNA word sets, Proc. 9th International Workshop on DNA Based Computers (2003). [51] M . Sprinzl, C. Horn, M . Brown, A. Ioudovitch and S. Steinberg, Compilation of tRNA sequences and sequences of tRNA genes, Nucl. Acids. Res. (1998) 26: 148-153. [52] N . Sugimoto, S. Nakano, M . Katoh, A. Matsumura, H. Nakamuta, T. Ohmichi, M . Yoneyama and M . Sasaki, Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes Biochemistry (1995); 34 (35): 11211-11216. [53] J. Tang and R. R. Breaker, Mechanism for allosteric inhibition of an ATP-sensitive ribozyme, Nucl. Acids. Res. (1998) 26: 4214-4221. [54] P. A. Tipler, Physics For Scientists and Engineers, Third Edition, Worth Publishers (1991). [55] D. C. Tulpan and H. H. Hoos, Hybrid Randomised Neighbourhoods Improve Stochastic Local Search for DNA Code Design, Canadian Conference on A l (2003), 418-433. [56] D. C. Tulpan, H. H. Hoos and Anne Condon, Stochastic Local Search Algorithms for DNA Word Design, Proc. 8th International Workshop on DNA Based Computers (2002): 229-241. [57] Turner Group Web Page, http: //128.151.176.70/. [58] D. H. Turner, N . Sugimoto, J. A. Jaeger, C. E. Longfellow, S. M . Freier and R. Kierzek, Improved parameters for prediction of RNA structure, Cold Spring Harb. Symp. Quant. Biol. (1987) ; 52: 123-33. [59] D. H. Turner and N . Sugimoto, RNA structure prediction, Annu. Rev. Biophys. Biophys. Chem. (1988) ; 17: 167-92 [60] N . K. Vaish, A. R. Kore and F: Eckstein, Recent developments in the hammerhead ribozyme field, Nucl. Acids. Res. (1998) 26: 5237-5242. [61] Vienna RNA Package, RNA Secondary Structure Prediction and Compaiison http: / /www.tbi .univie.ac.at / ivo/RNA/ [62] A. E. Walter amd D. H. Turner, Sequence dependence of stability for coaxial stacking of RNA-Helixes with Watson-Crick base paired interfaces, Biochemistry (1994), 33, 12715-12719. [63] A. E. Walter, D. H. Turner, J. Kim, M . H. Lyttle, P. Muller, D. H. Mathews and M . Zuker, Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding, Proc. Natl. Acad. Sci. U S A . (1994); 91 (20): 9218-9222. 104 [64] J. G. Wetmur, DNA Probes: Applications of the Principles of Nucleic Acid Hybridization, Critical Reviews in Biochemistry and Molecular Biology (1991), 26 (3/4): 227-259. [65] M . Wu, J. A. McDowell, D. H. Turner, A Periodic Table of Tandem Mismatches in RNA Biochemistry (1995); 34 (10); 3204-3211. [66] S. Wuchty, W. Fontana, I. L. Hofacker and P. Schuster, Complete suboptimal folding of RNA and the stability of secondary structures, Biopolymers (1999); 49 (2): 145-65. [67] T. Xia, J. SantaLucia Jr., M . E. Burkard, R. Kierzek, S. J. Schroeder, X. Jiao, C. Cox, and D. H. Turner, Thermodynamic Parameters for an Expanded Nearest-Neighbor Model for Formation of RNA Duplexes with Watson-Crick Base Pairs, Biochemistry (1998), 37: 14719-14735. [68] Q. Yu, D. B. Pecchia, S. L. Kingsley, J. E. Heckman and H.M. Burke, Cleavage of highly struc-tured viral RNA molecules by combinatorial libraries of hairpin ribozymes, Journal of Biological Chemistry (1998), 273 (36): 23524-23533. [69] B. Yurke, A. J. Turberfield, A. P. Mills Jr, F. C. Simmel and J.L. Neumann, A DNA-fuelled molecular machine made of DNA, Nature (2000); 406 (6796): 605-608. [70] M . Zuker, Mfold web server for nucleic acid folding and hybridization prediction, Nucl. Acids. Res. (2003) 31: 3406-3415. [71] A. M . Zuker, B. D.H. Mathews and C. D.H. Turner, Algorithms and Thermodynamics for RNA Secondary Structure Prediction: a Practical Guide. [72] M . Zuker and P. Stiegler, Optimal computer folding of large RNA sequences using thermody-namics and auxiliary information, Nucl. Acids. Res. (1981) 9: 133-148. 105 Appendix A SimFold analysis Table A . l shows 7 artificially created sequences, on which SimFold and mfold predictions differ, as well as the M F E structures predicted by SimFold. Details are discussed in Sec-tion 4.4.1. No Sequence and MFE structure returned by SimFold 1 CUCCUCUUGAAACGACAGAUCCAUAAGGUCACUCAGUGAUGAACCUGCGCACUACUUGAUGAGCGCUAUACCUAUAAUCAGCGAGCCCUCGAGUCGAUCC \u00C2\u00AB < ( . . . ( ( \u00C2\u00AB . . . ( ( \u00C2\u00AB . ( ( \u00C2\u00AB ( \u00C2\u00BB \u00C2\u00BB > \u00C2\u00BB \u00C2\u00BB (..(((( ))))..)..)))))))).... 2 UCGGCCGAGUuACUCGAAGGGAUUGUUGGGACCCAUGCCGAACAAUAUAUCGUCGAAAAAAGGGUCUAGV/UAGCGCGGAACACUUGCCGGCCCGAUUGUU (CCCCCC <(((....<<((((.(< )).)))))) ))))...(((..(((.(....)..))>...)))..))).)))) 3 GUGCCUGGCUGGGUCUUCAGCCCUUAGGUAUGAUCUUACGGAAAAUCGUAAAGGUUCCUCGUCAGUGACAUAUGUCGCUCGACGCCAGCUUCUGGACGGU ((((((((((((...J)))))...)))))) CC(((..( )..)))>> <(((((C((C....>>)))).)>>) \u00C2\u00BB<(. . . ' )))>. . ' . . . 4 AUCGACCUGCAAUGCAAUUAAUGCGGAUGGUUUACUACCUGAUCGGUCACGCUCGCUUGGUGAUGACAUGACCCCUAUCCCCCGACUAUAAGGUAUCUAA ....(CCC ((( )))((((((....( )...C(CCC.((((((...)>)).)\u00E2\u0080\u00A2).)>))>\u00E2\u0080\u00A2>>>>>> )))) 5 AUGCCUACAUGUAACCUUGCUAGGUGCCACUUUUUGGUAUAAACAGCUCUUUGACGUACUACAUUACCCCUAGAGAGAGCGACGCGGACAGCUUCGCCGC (((( (((..(((((( ) )>>\u00C2\u00BB . . . . ) ) ) )))) (((( >>)>..((((.( ))))) 6 CAGCAGGCAUCGUUUCUGUCGGCGUAUACCUGCGAAAAUUUCACAUGGCGUGCGGCGCAACACCGAGCACGGCUACCAAGCCCAAAGCGGCCCUUGUUCG ..(((((.((<(<( )))).)).))))) ( (<( .<((\u00E2\u0080\u00A2\u00C2\u00AB( . . ( > . . . \u00C2\u00AB \u00C2\u00AB . . . . ) \u00C2\u00BB ) . . . . ) \u00C2\u00BB . \u00C2\u00BB > . \u00C2\u00BB \u00C2\u00BB . . 7 GCUAAGUAGAGUUCAGCAGAGGUUAGUUACAGAGCUAAGUGAUGUGGCGUAUGCAGUUAUGGAGCGAUAGUCAAGAACGUGCAGGAUAACGAAUUUAGUu ( ( ( ( ( ( ( . . .CCC.( ( ( . . ( . ( ( ( ( ( ( . . . . ) )>>) ) .> . .> )\u00C2\u00BB )> . . . (\u00C2\u00AB ( ( \u00C2\u00AB . . . ( ( . ( . . . . ) \u00C2\u00BB . . . ) ) ) . ) ) ) ) ))))))). Tabic A.l: 7 artificially created sequences, on which SimFold and mfold predictions differ. Table A.2 presents comparative analysis between real structures, SimFold prediction and RNAfold prediction, on the first 30 R N A sequences in the t R N A genes database [51]. The entries are in the alphabetical order of the identification number used in the database (second column). Columns 1 and 2 show the organism and identification number and column 3 shows the length of the sequence. Column 4 gives the minimum free energy, measured in kcal/mol, returned by SimFold. Column 5 gives the free energy of the real structure, using the model used in SimFold. Columns 6 and 8 report the number of pairs predicted by SimFold and RNAfold, respectively, out of the total number of pairs in the real structure. Columns 7 and 9 give the level of accuracy performed by SimFold and RNAfold, measured with two parameters. More information is given in Section 4.4.3. Organism Number Len A G S A G ' ft SimFold RNAfold #bp Ql ( ) ) ) ( ( ( ( ( . ( ( (CC )>))) ( ( ( ( . . . . ) ) ) ) . . . ) ) > ) ) 2 pred . real U G C A G A U C A U G A G G A U A U C O U U G A U G G C A U G C A C U A U G C G C G A U G A U C U G C A . ( ( ( ( \u00C2\u00AB ( ( ( ( ( ( ( ( ( ) ) ) ) ) ) . . . . ( ( ( ( ( . . . ) ) ) ) ) . . . > \u00C2\u00BB \u00C2\u00BB \u00C2\u00BB > ) , . ( ( (((((((( . (CCCC ))))) (((( ) ) ) ) . . . ) ) ) ) ) ) ) ) ) ) 3 pred. real A A U A A A C U C A A C G G A G G C C U G C G U U C U G A U G A G U C C G U G A G G A C G A A A G U U U A C C . . ( ( ( ( ( ( . ( ( ( ( . ( ( ( ) ) ) . ) ) ) ) ( ( ( ( . . . . ) ) ) ) . . . ) ) ) ) ) ) . . . . ( \u00C2\u00AB ( ( ( . ( ( < ( < ( ( ( ) ) ) ) ) ) ) ) ( ( ( ( . . . . ) ) ) ) . . . ) ) ) ) ) ) . . 4 pred. real G G C C A C C U G A C A G U C C U C U C C G G A G A G A G A A G U C A A C C A G A G A A A C A C A C C A A C C C A U U G C A C U C C G G G U U G G U G G U A U A U U A C C U G G U A C G G G G G A A A C U U C G U G G U G G C C G ( \u00C2\u00AB ( ( ( ( . ( ( ( . .CCC(CCCC ) ) ) ) ) ) . ) ) . ) ) ) . ( ( ( ( ( . . ( ( ( ( . ( ( ( ( ( ( ( ( ( ) ) ) ) ) ) ) ) ) ) ) . . . ) ) . . ) ) ) ) ) ( ( ( ( ( ( . . . . ) ) ) ) ) > ) ) ) ) ) ) ) . ( \u00C2\u00AB ( ( ( ( ( ( ( ( ( . ((CCCCCC \u00C2\u00BB \u00C2\u00BB ) ) ) ) ) ) \u00C2\u00BB > . < \u00C2\u00AB \u00C2\u00AB \u00C2\u00AB C U . ( ( ( ( ( ( ( ( ( ( ) ) ) ) ) ) ) ) ) ) . ) . ? ) . ) ) ) ) ) ) ) ( \u00C2\u00AB ( ( ( . . . . ) ) ) ) ) ) ) ) ) ) ) ) ) . 5 pred. real G C C G U A G G U U G C C C G G G C G A C C C U G A U G A G U U G G G A A G A A A C U G U G G C A C U U C G G U G C C A G C A A C G A A A C G G U ( ( ( ( ( . ( ( ( ( (CCC ) ) ) ) ) ) ) ) ( ( ( ( . ( ) . . C C ( ( ( C . . . ) ) ) > ) ) . ) ) ) ) . . . ) ) ) ) ) ( ( ( ( ( . (CCCC((( ) ) ) ) ) ) ) ) (((( ( ( ( ( ( ( . . . J ) ) ) ) ) . ) ) ) ) . . . ) ) ) ) ) 6 pred . real G C C G U A G G U U G C C C G G G C G A C C C U G A U G A G U U G G C G G C A C U U C G G U G C C G G G A A G A A A C U G C A A C G A A A C G G U (((((.(((CCCCC ) )>)>)) ) . . . . ( ( (C(( . . ( C C C C C . . . ) ) ) ) ) ) )))) .)) > ) \u00C2\u00BB > (((((.CCCCCC(( ) ) ) ) ) ) ) ) \u00C2\u00AB \u00C2\u00AB . ( \u00C2\u00AB ( \u00C2\u00AB . . . . \u00C2\u00BB ) ) \u00C2\u00BB ) ) ) ) . . . ) ) ) ) ) 7 pred. real U C A C A G U C C U C U U U G G G A G A C G U G G U A U A U U A C C U G G U U U C G A C C A G A G A A A C A C A C G A A A A A A A A A G A G A G A A G U G A A ( ( ( ( . . (CC( ( ( ( ( ( ( ( ( ( ( . . . ( ( . . ( ( ( ( ( . . . . ) ) ) ) ) . . ) ) ) ) . ) ) ) ) ) ) ) ) ) ) . ) ) . \u00C2\u00BB ) ) . ( ( ( ( . . . . ( ( ( ( ( ( (((( ( ( ( ( ( . . . . ) ) ) ) ) )))) ) ) ) ) ) ) . . . . ) ) > ) . 8 pred. real A G A C A G U C C A G A A A G G G A G U U U C U G A G A A G U C U A C C A G A G A A A C A C A C G U U G U G G U A U A U U A C C U G G U A ((((..(((((((( )))))).)).))))((((( (CC )>) ))\u00C2\u00BB>\u00E2\u0080\u00A2 CCCC. . . ( ( (< (C ) ) \u00C2\u00BB ) ) . . . . ) \u00C2\u00BB > < ( ( \u00C2\u00AB ( ( ( . . . ) ) ) ) ) ) ) ) \u00E2\u0080\u00A2 9 pred. real A G A C A G U C C A G A A A U C U C C C U C A C A G U C C U C U U U G G G A G A C G U G G U A U A U U A C C U G G U U U C G A C C A G A G A A A C A C A C G A A A A A A A A A G A G A G A A G U G A G G G A G A U U U C U G A G A ( ( ( ( . . ( ( ( ( ( ( ( < C C C ( ( ( ( ( ( C . ( ( ( ( ( ( ( ( ( ( ( ( ( ( . . . ( ( . . ( ( ( ( ( . . . . ) ) ) ) ) . . ) ) ) ) .)))) ) ) ) ) ) ) . ) ) . ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) . ) ) ( ( ( ( . . . . (<)) ) ) ) ) ) ) . . . . ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) . . . A G U C U A C C A G A G A A A C A C A C G U U G U G G U A U A U U A C C U G G U A . ) ) \u00C2\u00BB ( ( ( \u00C2\u00BB ' ( ( ( ) ) ) ) ) ) ) ) . .))))C(CCC ( ( ( . . . ) ) ) ) ) ) ) ) . 1 0 pred. real U C A G A A G G C U G U A G A C A A A U A C U G G C C A G U A U U U G U C C U G A U G A G G C C U C G A G C C C C A A A C A G C C U U C U G A (CCCCCC((((( .(((CC((((((( ))))>))))))) ( ( ( ( . . . . ) ) ) ) . . . ) ) ) ) ) ) ) ) ) ) ) ) ( ( (CCC(( ( ( ( ( . ( ( ( (C(( ( ( ( ( ( ) ) ) ) ) ) ) ) ) ) ) ) . . . . ( U ( ( ( ( . . . . ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) > 114 U U A G G C A U C U C C U A U G G C A G G A A G A A G C G G A G A C A G C G A C G A A G A C C U C C U C A A G G C A G U C A G A C U C A U C G G G A A C A A A A G C U U A U C U C U G A U G A G G C C U C G A G G C C G A A A C U . \u00C2\u00BB ( ( ( ( ( ( (\u00C2\u00BB ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (\u00C2\u00BB ( ( ( ( ( ( ( ( ( ( ( ( ( (\u00C2\u00BB ( ( ( ( ( ( ( ( ( ( ( \u00E2\u0080\u00A2 : \u00E2\u0080\u00A2 . , \u00E2\u0080\u00A2((((( (C(( )))) . ) ) ) \u00C2\u00BB ( ( ( ( . . . . ) ) ) ) . . . ) \u00C2\u00BB <(((\u00C2\u00AB(<((((<(<<((<<<(<((((((((\u00C2\u00AB(((((<<(((<((<<(((<<<<((<((.(<< ))) \u00C2\u00AB<(<(....))>>\u00C2\u00AB.))) G C C U U G A G G A G G U C U X I C G U C G C U G U C U C C G C U U C U U C C U G C C A U A G G A G A U G C C U A A ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) . ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) > ) ) ) ^ ))))))))))))))))))))))))))))))))))))))))))))))))))))))))) Table B.3: Measurement of PairFold accuracy on a set of mRNA target - ribozyme pairs. 115 "@en . "Thesis/Dissertation"@en . "2003-11"@en . "10.14288/1.0051269"@en . "eng"@en . "Computer Science"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Algorithms for predicting the secondary structure of pairs and combinatorial sets of nucleic acid strands"@en . "Text"@en . "http://hdl.handle.net/2429/14551"@en .