Computational R N A Secondary Structure Design: Empirical Complexity and Improved Methods by Rosalia Aguirre-Hernandez B . S c , Universidad Nacional Autonoma de Mexico, 1996 M . Eng., Universidad Nacional Autonoma de Mexico, 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Doctor of Philosophy in The Faculty of Graduate Studies (Mathematics) The University Of British Columbia Apri l , 2007 © Rosalia Aguirre-Hernandez 2007 A b s t r a c t Ribonucleic acids play fundamental roles in cellular processes and their func-tion is directly related to their structure. The research reported in this thesis is focused on the design of R N A strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model. The design of R N A structures is important for applications in therapeutics and nanotechnology. This work also applies to D N A with the appropriate ther-modynamic model for D N A molecules. The overall goal of this research is to improve the performance and scope of algorithmic methods for R N A secondary structure design. First, we in-vestigate the hardness of this problem, since its theoretical complexity is unknown. A scaling analysis on random and biologically generated struc-tures supports the hypothesis that the running time of the R N A Secondary Structure Designer (RNA-SSD) algorithm, one of the state of the art algo-rithms for designing secondary structures, scales polynomially with the size of the structure. We found that structures with small stems separated by loops are difficult to design. Our improvements to the RNA-SSD algorithm include the support for primary structure constraints, where bases or base types are fixed in cer-tain positions of the sequence. Such constraints are important, for example, when designing R N A s such as ribozymes or tRNAs , where certain base po-sitions must be fixed in order to permit interaction with other molecules. We investigate the correlation between the number and the location of the primary structure constraints and the performance of R N A - S S D . In the second part of our research, we have extended the R N A - S S D algorithm to design for stability, rather than minimum free energy folding. We measure stability according to several criteria such as high probability of observing the minimum free energy structure, and low average number of incorrectly paired nucleotides in the ensemble of structures for the designed sequence. The design of complexes of R N A molecules, that is R N A molecules that interact with each other, is relevant for many applications. We describe several ways to design stable structures and complexes, and we also discuss the advantages and limitations of each approach. i i T a b l e o f C o n t e n t s A b s t r a c t i i Table of Contents i i i L i s t of Tables v L i s t of F igures ix Acknowledgements xix D e d i c a t i o n xx 1 I n t roduc t ion 1 1.1 Motivation 4 1.2 Research goals and contributions 6 1.3 Thesis outline 9 2 R N A Secondary S t ruc ture Des ign P r o b l e m 10 2.1 Secondary structure of an R N A molecule 10 2.2 Free energy of a single R N A strand 11 2.3 Free energy of a duplex 14 2.4 Partition function 15 2.5 Stability 17 3 P rev ious W o r k 21 3.1 Protein design 21 3.2 Nucleic acid design 22 3.2.1 Biochemical methods 22 3.2.2 Computational methods 23 4 E m p i r i c a l C o m p l e x i t y of R N A - S S D A l g o r i t h m 30 4.1 Improvement of the RNA-SSD algorithm 30 4.2 Experiments 35 i i i Table of Contents 4.3 Results 40 4.3.1 Analysis of RNA-SSD and RNAinverse on secondary structures without constraints 40 4.3.2 Analysis of RNA-SSD on secondary structures with constraints 50 4.3.3 Performance of RNA-SSD with different number and locations of primary base constraints 53 4.4 Summary 56 5 Interaction of Two R N A Molecules 57 5.1 Description of the algorithms 59 5.1.1 "Internal" algorithm 60 5.1.2 "Linker" algorithm 62 5.1.3 "Interface" algorithm 64 5.2 Experiments 66 5.3 Results 69 5.3.1 Performance of Linker and Interface 69 5.3.2 Performance of Linker and Internal 75 5.3.3 Performance of Interface and Internal 78 5.3.4 Hardness of duplex design 85 5.4 Summary 86 6 Stability 87 6.1 The RNA-SSD-stability algorithm 88 6.2 Experiments 91 6.3 Results 94 6.3.1 Comparison of RNA-SSD-stability and I N F O - R N A . . 94 6.3.2 Comparison of RNA-SSD-stability and adaptive walk 105 6.4 Summary 120 7 Conclusions and Future Work 122 Bibliography . . 128 Appendix A : Undesignable structures 135 L i s t o f T a b l e s 3.1 List of algorithms that will be constantly mention in this the-sis for predicting and designing R N A secondary structures. . 24 4.1 I U P A C nomenclature for nucleic acids 34 4.2 Set of structures generated by folding random sequences with the RNAfold function from the Vienna R N A package. Nu-cleotides in the sequence are assigned uniformly at random. The sets of structures longer than 75 bases are smaller because of the amount of time required for designing these structures. 36 4.3 Biological structures obtained from the literature and used by Andronescu et al. [4]. Structures marked with an asterisk (*) were obtained from original, pseudoknotted structures by eliminating 8 base pairs in each case to remove the pseudoknot. 36 4.4 Properties of the structures from Table 4.3; the intervals spec-ify the minimum and maximum values observed for the re-spective features. These parameters were used to generate structures with biological properties. * These values denote the minimum and maximum ratio of bulges to base pairs in the stems 37 4.5 Sets of structures generated with the R N A structure genera-tor, using the parameters from Table 4.4 37 4.6 Set of structures used to study the correlation between the primary structure constraints and the performance of R N A -SSD. Structures with similar characteristics (such as size, number of multiloops, etc.) appear in the same group. The structure Bio-150-nl2 was also selected for the study because is relatively easy to design 53 v List of Tables 5.1 Pseudoknot-free ribozyme-substrate duplexes obtained from Andronescu et al. [5]. These duplexes have been selected arbitrarily from the biological literature. The length of the ribozyme and the target structure is specified by l\ and fo, respectively 67 5.2 Biologically motivated duplexes (BIOMD). The strands used to generate the duplexes have statistical features derived from biological structures (see Table 4.5). B IOMD-I refers to du-plexes generated by splitting a single strand in a random po-sition as explained in Section 5.2. BIOMD-II are generated by removing a hairpin from a single strand 68 5.3 Performance results for Linker, Interface and Internal on sets of biological duplexes. The columns SR(-) show the fraction of structures for which the respective algorithm found solutions in all of the runs (SR stands for success rate). R* is the target duplex; S* is the structure that results from concatenating the molecules D\ and D2 of the duplex; D\ and D2 are designed independently by RNA-SSD in the Interface algorithm. . . . 80 5.4 Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-50 81 5.5 Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-100 82 5.6 Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-200 83 5.7 Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-500 84 6.1 The first two structures are artificial R N A s generated by Dirks et al. [25]. Artificial 1-multiloop has one multiloop with four branches. Artificial 3-multiloop has three multi-loops, each of them with three branches. The rest are biolog-ical structures that were obtained arbitrarily from the biolog-ical literature; the same structures were used by Andronescu etal. [4] 92 vi List of Tables 6.2 Cutoff times for the adaptive walk and RNA-SSD-stability. The "running time" columns represent the time spent by the adaptive walk, in C P U seconds, to find a local minimum for each structure. The adaptive walk was run only one time for each stability criteria n(5*) and p(S*). The "cutoff time" columns represents the maximum time that we allow the adaptive walk and the RNA-SSD-stability to find a solution. 94 6.3 Statistics for the stability of the sequences designed by R N A -SSD-stability and I N F O - R N A when n(S*) is optimized. . . . 100 6.4 Statistics for the stability of the sequences designed by R N A -SSD-stability and I N F O - R N A when p(S*) is optimized. . . . 101 6.5 Probability of the shrep with the lowest free energy for the best and the median sequences designed by RNA-SSD-stability and I N F O - R N A . Sequences from the upper and lower Table were obtained by optimizing n(5*) and p(S*) respectively. We do not report any value for (*) because the target struc-ture was not found in the shapes generated by RNAshape. For this structure we generate up to 10 shapes with free energy 10% above the M F E . This suggests that the shape probability of this sequence is low 105 6.6 Performance results for RNA-SSD-stability and adaptive walk on artificial and biological R N A structures. The stability cri-teria to design the R N A structures are n(S*) and p(S*). The "cutoff time" columns indicate the running time in C P U sec-onds for both algorithms on our reference machine. The suc-cess rates show the fraction of runs for which the respective algorithm found a sequence that fold correctly into the target structure 106 6.7 Statistics for the stability of the sequences designed by R N A -SSD-stability and adaptive walk when n(S*) is optimized. The best, median and coefficient of variation, c^, are com-puted for the values of n(S*) and p(S*) of the sequences de-signed for a given structure. The correlation, pHtP, between n(S*) and p(S*) of all the runs of a given structure is also included 113 6.8 Statistics for the stability of the sequences designed by R N A -SSD-stability and adaptive walk when p(S*) is optimized. . 114 6.9 Stability of biological sequences with respect to the metrics n(S*) and p(S*) 114 List of Tables 6.10 Stability of sequences designed by RNA-SSD-stability and adaptive walk on biological R N A structures when n(S*) is optimized (upper Table) and when p(S*) is optimized (lower Table). We run the adaptive walk one time until it finds a local minima. The running time in C P U seconds for the re-spective structure is indicated in the second column. In all cases j a. valid sequence was found 119 6.11 Stability of the sequence designed by RNA-SSD-stability for the biological structure B-5. We performed one run of R N A -SSD-stability, where p(S*) is optimized, with a cutoff time of 205 135 C P U seconds and different values of pa and nr. The default values of pa and nr are 0.2 and 5 respectively. The success rate is one if the sequence fold correctly. The metrics n(S"*) and p(S*) are used to evaluate the stability of the sequences 120 1 Free energy parameters for internal loops. 136 L i s t o f F i g u r e s 1.1 Primary structure (a) and secondary structure (b) of an R N A molecule 2 1.2 Hairpin ribozyme-substrate complex. The arrow indicates the cleavage site in the substrate. Sequence requirements are specified, where N represents any nucleotide; Y represents C or U ; R represents A or G; B represents C, G or U ; V represents A , C or G . Illustration adapted from E . Puerta-Fernandez et al. [46] 5 2.1 Secondary structure motifs, (a) Hairpin, (b) Interior loop, (c) Bulge, (d) Multiloop, (e) Helix or stem, (f) Pseudoknot. (g) External loop 12 2.2 Free energy calculation of a secondary structure. Stems are denoted by S, hairpins by H, multiloops by M , internal loops by I, bulges by B and dangling ends by D. The free energy of this structure is A G ( S , X) = AG(Sl) + ••• + AG(S5) + AG(H1)+AG{H2)+AG(M1)+AG(I1)+AG{B1)+AG(D1) = -36.4 + 8.1 + 2.9 + 3.3 + 0.5 - 1.7 = -23.3 kcal/mol 13 2.3 Matrix of equilibrium base pair binding probabilities between all nucleotides in the sequence of Figure 2.2. The lower tri-angle represents the optimal structure. Illustration obtained with the program MFold from Zuker et al. [71] 17 2.4 Positive and negative design. The x-axis represents the struc-ture space S(X) of a sequence X and the y-axis the free en-ergy A G of each structure. Illustration adapted from Dirks et al. [25] 19 3.1 Tectosquare designed by Chworos et al. [20]. Panel (a) shows the R N A structure that can self-assembly to form the square of panel (b). Panel (c) shows the three-dimensional represen-tation of the tectosquare 23 i x List of Figures 3.2 Pseudocode for designing structures with the R N A - S S D al-gorithm 26 4.1 (a) Randomly generated structure of length 75 (RND-75-n62) with loops separated by short stems. The line represents the location where the structure is split into two substructures. Parts (b) and (c) show the corresponding substructures with static cap structure and dangling ends, respectively 31 4.2 Substructures from Figure l a with dynamic cap structure (a) and dynamic dangling ends (b) 33 4.3 Ribosomal R N A : Leptospira interrogans strain 94-7997013. The structure motif that consists of two internal loops sepa-rated by one base pair is hard to design. We verified this ex-perimentally by adding another base pair between the internal loops, whereupon the expected time to design the structure decreases by a factor of three 33 4.4 Scaling analysis of the expected run-time (y-axis) using R N A -SSD of structures of lengths 50, 75, 100, 125, 150, 200, 450 and 500 (x-axis). A logarithmic scale is used in the y-axis. The solid (dotted) line corresponds to best fits of the data, for structures with lengths 50 to 150, using a polynomial (expo-nential) that is specified in each case. The expected run-times for structures longer than 150 appear closer to the polynomial than the exponential fit line, (a) Median (Q50) of expected run-time of random structures, (b) 0.9-quantile (Q90) of ex-pected run-time of random structures, (c) Median (Q50) of expected run-time of biologically motivated structures, (d) 0.9-quantile (Q90) of expected run-time of biologically moti-vated structures 41 x List of Figures 4.5 Scaling analysis of the expected run-time (y-axis) of struc-tures of lengths 50, 75, 100, 125, 150, 200 and 450 (x-axis). A logarithmic scale is used on both axes. The lines correspond to best fits of the data, for structures with lengths 50 to 150, using a polynomial that is specified in each case. The ex-pected run-time for structures longer than 150 appear close to the corresponding fit line, (a) Median (Q50) of expected run-time of biological structures and median, 0.1-quantile (Q10) and 0.9-quantile (Q90) of expected run-time using R N A - S S D for biologically motivated structures, (b) Median of expected run-time of random and biologically motivated structures us-ing R N A - S S D and RNAinverse. Structures of length 200 are the largest structures of our data set that we designed with RNAinverse 43 4.6 Cost distribution of RNA-SSD. Distribution of expected run-time of RNA-SSD on (a) random structures and (b) biologi-cally motivated structures. For each point, the x-value gives an expected run time and the y-value gives the fraction of structures whose run-time is at most the x-value 44 4.7 Distribution of expected run-time of RNAinverse on (a) ran-dom structures and (b) biologically motivated structures. . . 44 4.8 Examples of structures not designed by RNA-SSD. Structures not designed by RNA-SSD have short stems separated by loops, indicated by arrows in the Figure, (a) Random struc-ture of length 450 (RND-450-n84). This is the only random structure in our data set that RNA-SSD did not design. Note that it has two internal loops separated only by one base pair, (b) Biologically motivated structure of length 74 (BIOM-50-n262) 45 4.9 Undesignable motifs. Two structure motifs of our data set that are not compatible with the thermodynamic model. Bold lines represent base pairs, (a) Motif B : bulges separated by one base pair, (b) Motif 21: internal loops separated by one base pair i 47 4.10 Secondary structure of small subunit ribosomal R N A of Acan-thamoeba castellanii. Biological secondary structure with three bulges, each separated by only one base pair. This structure was obtained from The Comparative R N A Web (CRW) Site [17, 18] 48 List of Figures 4.11 Secondary structure of small subunit ribosomal R N A of Es-cherichia coli. Biological secondary structure with two in-ternal loops separated by one base pair. This structure was obtained from The Comparative R N A Web (CRW) Site [17, 18, 28] 49 4.12 Distribution of expected run-time of RNA-SSD on three struc-tures of approximately 150 bases: RND-150-n85, BIOM-150-n89 and VS Ribozyme from Neurospora mitochondria. The structures were designed with two sets of primary base con-straints: one where the bases are fixed at random positions and the other where the bases are fixed on stems for each structure. Both sets have the same range [a, b] of constrained bases after propagation, where a and b are smallest and largest number of bases constrained if 50% of stems are fixed in a given structure. We fixed 50% plus one stems if a structure has an odd number of stems 51 4.13 Scaling analysis using RNA-SSD for the median expected run-time of biologically motivated structures with no primary base constraints and with bases constrained in fifty percent of random positions and fifty percent of stems. The lines repre-sent the polynomial that best fits the data. The experiment with primary structure constraints is computationally expen-sive, and for this reason, fewer structures of each length were used. The best fit for structures with constraints was calcu-lated based on those of lengths 50, 75 and 100. Note that the run-times for constrained structures longer than 100 appear below the corresponding fit line 52 List of Figures 4.14 Hardness of RNA-SSD with base constraints. Correlation be-tween the fraction of bases constrained in a particular struc-ture (x-axis) and the median expected run-time for designing the structure with RNA-SSD (y-axis). We report the frac-tion of constrained bases after propagation for constraints on randomly chosen base positions. This fraction, for both ran-domly chosen bases and stems, corresponds to the median fraction of bases constrained in a set of 50 constraints that were generated by fixing a given percentage of bases or stems. There are two curves in each graph, one for designing struc-tures with base constraints located in random positions and the other for constraints located in stems, (a) VS Ribozyme from Neurospora mitochondria; (b) Bio-150-n38; (c) Bio-150-nl4; (d) Group II intron ribozyme D135 from Saccharomyces; (e) Bio-200-nl9; (f) Bio-150-nl2. 54 4.15 Biologically motivated structure Bio-150-nl4 with ten stems. When constraining the bases in stems 7 and 8, this structure is hard to design. The structure motif formed by these stems, which are short and separated by a bulge, is unstable 55 5.1 (a) Secondary structure of hammerhead ribozyme in satel-lite D N A from Dolichopoda schiavazzi (cricket), (b) Trans-cleaving hammerhead ribozyme. This picture shows the sec-ondary structure model of the hammerhead-substrate duplex where the specified bases are important for catalytic activity. Dots represent any nucleotide. In both pictures the arrow indicates the cleavage site 58 5.2 R N A complex with kissing loop interactions. This complex designed by Chworos et al. [20] is a self assembling square made of four similar R N A strands, called tectoRNAs, that can have applications in nanobiotechnology 59 5.3 Pseudocode for designing duplexes with the "Internal" algo-rithm 61 5.4 Panel (a): Duplex structure (R*,b), where b = 14 in this case. Panel (b): a single structure S* is formed by concatenating the strands D\ and D2 of the duplex (R*,1A). Panel (c): Duplex structure (R*, 12). Panel (d): a linker of five bases is added between the strands TJi and D2 from (R*, 12) to obtain a structure S* allowed by the thermodynamic model 62 5.5 Pseudocode for designing duplexes with the Linker algorithm. 63 List of Figures 5.6 "Interface" algorithm to design duplexes, (a) Interface / : bases in the intermolecular helix are fixed, (b) Design of structures D\ and D2 independently with R N A - S S D . . . . . . 65 5.7 Pseudocode for designing duplexes with the "Interface" algo-rithm 65 5.8 Artificially generated duplex, (a) Biologically motivated struc-ture generated with R N A Structure Generator, (b) The arrow in (a) indicates the place where the structure is cut to obtain the duplex shown in (b) or the hairpin that is removed to obtain the duplex in (c) 68 5.9 Correlation between the expected running times in C P U sec-onds of the Linker (x-axis) and the Interface (y-axis) ap-proaches to design the biological duplexes from Table 5.1. . . 69 5.10 Correlation between the expected running times of the Linker and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the run-ning time for structures that Linker or Interface is unable to design as 10 6 C P U seconds. Structures that are designed by none of the approaches are excluded, (a) BIOMD-50; (b) BIOMD-100; (c) BIOMD-200; (d) BIOMD-500 70 5.11 (a) Duplex number 10 from the data set BIOMD-I-200, see Table 5.6. (b) The Linker algorithm concatenates both molecules S* = D1D2 and design a sequence X = X1X2 that folds correctly into S*. (c) Secondary structure of X\ and X2 pre-dicted by PairFold. Note that X\ and X2 do not fold correctly into the target duplex R* 71 5.12 (a) Biological duplex BIOD-14. The corresponding sequences were designed by the Interface algorithm. The shaded bases belong to the intermolecular helix. The sequences X\ and X2 from (b) and (c) are designed independently by R N A -SSD. The shaded bases in this case correspond to nucleotides constrained in the interface. Note that X\ and X2 do not fold correctly into D\ and D2 respectively. 73 5.13 (a) Target structure BIOD-7. (b) Duplex formed by the se-quences X\ and X2 designed by Interface. There are two bases that are incorrectly paired that form an internal loop instead of the two consecutive bulges of Figure (a), (c) The sequences X\ and X2 fold correctly into D\ and D2 74 xiv List of Figures 5.14 Correlation between the expected running times in C P U sec-onds of the Linker (x-axis) and the Internal (y-axis) approaches to design the biological duplexes from Table 5.1 76 5.15 Correlation between the expected running times of the Linker and the Internal algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the run-ning time for structures that Linker or Internal is unable to design as 106 C P U seconds. Structures that are designed by none of the approaches are excluded, (a) BIOMD-50; (b) BIOMD-100; (c) BIOMD-200; (d) BIOMD-500 77 5.16 Correlation between the expected running times in C P U sec-onds of the Internal (x-axis) and the Interface (y-axis) ap-proaches to design the biological duplexes from Table 5.1. . . 78 5.17 Correlation between the expected running times of the In-ternal and the Interface algorithms to design artificially gen-erated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Interface or Internal is unable to design as 106 C P U seconds. Structures that are not designed by none of the approaches are excluded, (a) BIOMD-50; (b) BIOMD-100; (c) BIOMD-200; (d) B I O M D -500 79 5.18 Scaling analysis of the median (Q50) of expected run-time (y-axis) of artificially generated duplex B I O M D of lengths 50, 100, 200 and 500 and artificially generated structures B I O M of lengths 50, 75, 100, 125, 150, 200 and 500 (x-axis). Duplexes are designed with Linker, Interface and Internal in Figure (a), (b) and (c) respectively. In all cases, single struc-tures from B I O M are designed with RNA-SSD. The line corre-sponds to the best fit of the B I O M data (obtained in Chapter 4) for structures with lengths 50 to 150 using a polynomial of degree three 85 6.1 Pseudocode for designing stable structures with R N A - S S D -stability. 89 6.2 Pseudocode for SLS-stable 91 xv List of Figures 6.3 Correlation between the average number of incorrect nucleotides and the probability of the target in the ensemble when se-quences are designed with RNA-SSD-stability and I N F O - R N A . Logarithmic values are used in both axes with n(S*) on the x-axis and 1 — p(S*) on the y-axis. A structure is stable if n(S*) and 1 — p(S*) are small. Panels (a) and (b) correspond to structure A - l and panels (c) and (d) to structure A-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas 1 — p(S*) is optimized in panels (b) and (d) 96 6.4 Correlation between the stability metrics n(S*) and p(S*). Panels (a) and (b) correspond to structure B - l and panels (c) and (d) to structure B-2. The values of n(S*) and 1 - p{S*) of the corresponding biological sequence are indicated in each panel 97 6.5 Correlation between the stability metrics n(S*) and p{S*). Panels (a) and (b) correspond to structure B-3 and panels (c) and (d) to structure B-4. The values of n(S*) and 1 - p(S*) of the corresponding biological sequence are indicated in each panel 98 6.6 Correlation between the stability metrics n(S*) and p(S*). Panels (a) and (b) correspond to structure B-5 and panels (c) and (d) to structure B-6. The values of n(S*) and 1 - p(S*) of the corresponding biological sequence are indicated in each panel. 99 6.7 Solution quality distribution of RNA-SSD-stability and INFO-R N A with respect to the probability of shapes. Panels (a) and (b) correspond to structure A - l and panels (c) and (d) to structure A-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p{S*) is optimized in panels (b) and (d) 103 6.8 Solution quality distribution of RNA-SSD-stability and INFO-R N A with respect to the probability shapes. Panels (a) and (b) correspond to structure B - l and panels (c) and (d) to structure B-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p{S*) is optimized in panels (b) and (d) 104 xvi List of Figures 6.9 Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the artificial structure A - l . In panels (a) and (b) the sequences are designed by the adaptive walk and RNA-SSD-stability respectively using n(S*) as the stability measure. Panels (c) and (d) show the quality of the sequences designed by adaptive walk and RNA-SSD-stability respectively when 1 — p(S*) is optimized 107 6.10 Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the artificial structure A-2 (panels (a) and (b)) and the biological structure B - l (panels (c) and (d)) using the adaptive walk and the RNA-SSD-stability al-gorithm, respectively. The values of n(S*) and p(S*) for the biological sequence corresponding to B - l are also included. . . 108 6.11 Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the biological structure B-2. The values of n(S*) and p(S*) for the biological sequence corresponding to B-2 are also included in each panel 109 6.12 Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for B-3 (panels (a) and (b)) and B-4 (panels (c) and (d)). The stability of the corresponding biological sequence is indicated by an arrow 110 6.13 Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the biological structure B-5. The stability of the corresponding biological sequence is indicated by an arrow I l l 6.14 Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for B-6. The stability of the corre-sponding biological sequence is indicated by an arrow 112 6.15 Solution quality distribution for RNA-SSD-stability and adap-tive walk. The x-axis shows the probability of the shape that contains the target structure for every sequence designed by RNA-SSD-stability and the adaptive walk. The y-axis gives the fraction of structures whose probability shape is at most the x-value. Panels (a) and (b) show the stability of the sequences designed for structure A - l . Panels (c) and (d) cor-respond to structure A-2. The stability metric n(S*) is op-timized in panels (a) and (c) whereas p(S*) is optimized in panels (b) and (d) 116 xvii List of Figures 6.16 Solution quality distribution of RNA-SSD-stability and adap-tive walk with respect to the probability shapes. Panels (a) and (b) correspond to structure B - l and panels (c) and (d) to structure B-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p(S*) is optimized in panels (b) and (d) 117 1 (a) Motif I: internal loop formed by breaking the base pair Xi+i • X j _ 3 from motif B . (b) Motif II: internal loop formed by breaking the base pair X i + $ • £ j -4 from motif 21 135 A c k n o w l e d g e m e n t s M y deepest gratitude to Anne Condon and Holger Hoos for their guidance, advice and enthusiasm as my research supervisors. They provided me with countless hours of discussions and financial support. Their keen observa-tions and thorough revisions helped me to considerably improve my work. Certainly without their enormous patience and constant encouragement this thesis would not be completed. I also have to acknowledge Jenny Bryan, David Kirkpatrick, Lawrence Mcintosh and Paul Higgs who kindly agreed to serve as members of my committee and provided me with valuable feedback and recommendations on this project. Many thanks to Mirela Andronescu for her collaboration, enthusiasm and help when the technical aspects of my thesis became a burden. I would also like to thank Roman Baranowski for introducing me to the I A M cluster, and Dan Tulpan and Rachel MacKay for wonderful discussions. Thanks to Sanja Rogic for keeping me calm when I was finishing my thesis. And of course to the people in the beta lab and the I A M that made my stay in U B C enjoyable. I am grateful for the financial support from C O N A C y T and from the Government of Canada in the first years of the program. Special thanks to my parents and siblings, always supportive and encour-aging. Finally, but most importantly, my deepest gratitude to my husband Alberto for his sacrifices, endless patience, love and help in editing this doc-ument. xix To my lovely Tessa xx Chapter 1 Introduction Ribonucleic acids (RNA) are macromolecules that play fundamental roles in many biological processes. RNAs , such as ribosomal, transfer and messen-ger RNAs , have well-established roles in protein biosynthesis. Other R N A s called ribozymes, catalyze several essential biological processes such as R N A splicing, R N A processing and peptide bond formation during translation [46]. R N A s are composed of a single strand of four different nucleotides: ade-nine (A), cytosine (C), guanine (G) and uracil (U). Each strand has two chemically distinct ends, known as 5' and 3' ends; when written as a base sequence, the first base is considered to represent the 5' end. The unique sequence that describes a particular R N A molecule is known as the primary structure of that molecule (Figure 1.1 a). The R N A single-stranded chain can fold on itself, forming hydrogen bonds between pairs of bases. The most common pairs occur between the Watson-Crick complementary bases: G bonds with C, A bonds with U (and vice-versa), but there are also G - U wobble base pairs and non-canonical pairs formed by other bases. Folded R N A can be characterised at the secondary and tertiary structure level. A n R N A secondary structure is the set of Watson-Crick and wobble base pairs that form when the molecule folds. The base pairs produce helical regions, also called stems, and single stranded regions or loops (Figure 1.1 b). The tertiary structure is the relative location of atoms in the molecule in three-dimensional space. It includes the precise geometry of the base-pairing interactions as well as the spatial arrangement of any interaction between secondary and tertiary structure elements. The structure of most R N A s is essential for their biological function. Replication control in single-stranded R N A viruses [35] is governed by its secondary structure. Mechanisms for m R N A synthesis also rely on spe-cific R N A structures. For example, in bacteria, transcription termination is caused by the R N A polymerase responding to structural perturbations in the D N A template or by hairpin loops in the transcript [60]. In eukaryotes, m R N A contains introns whose splicing is catalyzed by a ribonucleoprotein complex known as spliceosome. Conserved secondary structure motifs of 1 Chapter 1. Introduction (a)5' G U C U U U U C C C A A U A A A A G G G A U G G G A U A A G G G U G G G U U U A A C C C C C C C C C C A U A A G A C A 3' A 3 ' Figure 1.1: Primary structure (a) and secondary structure (b) of an R N A molecule. introns can play an important role in intron recognition by spliceosomes. Furthermore, particular m R N A conformations can be recognized by regula-tory proteins that modulate translation initiation, and mRNAs encode sig-nals that modulate translation and regulate gene expression [22, 23]. These signals include sequences and structures that serve as targets for transla-tional repressors [60]. The structure of ribozymes also plays an important role regarding their catalytic activity. For example, allosteric ribozymes have an effector-binding site that is separate from their active site. When the effector binds to the ribozyme it induces a conformational change in the ribozyme that enhances or inhibits catalytic function; effectors can be proteins, oligonucleotides, metal ions, etc. [15] Computational approaches for prediction of R N A secondary structures are based on a thermodynamic model that associates a free energy value with each possible secondary structure for a strand at fixed conditions such as temperature and ion concentration. Thermodynamic parameters for R N A folding have been determined by different methods such as optical melt-2 Chapter 1. Introduction ing, microcalorimetry [52] and knowledge-based methods using databases of known structures [41]. The secondary structure with the lowest possible free energy value, the minimum free energy (MFE) structure, is predicted to be the most stable secondary structure for the strand and is known as ground state. There are dynamic programming algorithms that, given an R N A strand, find the secondary structure with M F E . The MFold server of Zuker [71] and RNAfold from the Vienna Package [29] are widely used pro-grams for predicting structures without pseudoknots (see Chapter 2 for the definition of pseudoknot-free structures) using the R N A energy parameters of the Turner group [41]. Both algorithms have an 8 (n 3 ) running time, where n is the length of the input R N A sequence. There are also several dynamic programming algorithms for predicting R N A secondary structures with pseudoknots, but most of these handle a restricted class of pseudo-knotted structures [47]. The general problem of predicting R N A secondary structures with pseudoknots is NP-hard but Rivas and Eddy [49] proposed an algorithm with a time complexity of 0 (n 6 ) that can handle a large class of structures [47]. We are interested in the inverse problem, the R N A Secondary Struc-ture Design Problem. Thus, in this thesis, we focus on the design of R N A strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model such as that of the Turner group [41]. It is not known whether this problem has a polynomial-time algorithm. A n algorithm for solving this problem takes as input a secondary structure of an R N A molecule (without specific bases assigned to the sequence positions), and outputs an R N A sequence that is predicted to fold into that structure. Current algorithms design for M F E , that is, they find a sequence with M F E structure as the target structure. Once the sequence is known, it is pos-sible to create the R N A molecules using established laboratory techniques. One of the best algorithms for solving this problem is a stochastic local search algorithm known as R N A Secondary Structure Designer (RNA-SSD) of Andronescu et al. [4]. Stochastic local search algorithms initialise the search process at a randomly chosen point of the given search space (here: the space of R N A sequences of a given length) and then proceed by iter-atively moving from the present point to a neighbouring point, where the decision on each search step is based on local knowledge only, as well as randomization [32]. One of the reasons why RNA-SSD has a good perfor-mance is because it decomposes the input secondary structure at multiloops. Different from other algorithms, RNA-SSD recursively splits the structure into two substructures in each decomposition step; thus obtaining a binary decomposition tree whose root is formed by the full target structure and 3 Chapter 1. Introduction whose leaves correspond to small substructures. Each non-leaf node of the tree represents a substructure of that can be obtained by merging the two substructures corresponding to its children. The SLS algorithm is only ap-plied to the smallest substructures, and the corresponding partial solutions are combined into candidate solutions for larger subproblems guided by a decomposition tree. 1.1 Motivation R N A Secondary Structure Design is important, both because it facilitates the characterization of biological R N A s by their function, as well as the design of new ribozymes that potentially can be used as therapeutic agents [14, 46]. There are also applications in nanobiotechnology in the context of building self-assembling, stable structures and devices from small R N A molecules [58]. There are several aspects that play an important role in the design of R N A structures for practical applications. For example, when designing ribozymes we need to be able to design duplexes, because ribozymes cat-alyze highly sequence-specific reactions determined by R N A - R N A interac-tions between the ribozyme and its substrate molecule. Substrate recogni-tion and binding to the substrate are essentially governed by Watson-Crick interactions. Figure 1.2 shows the sequence requirements for the hairpin ribozyme-substrate complex. In particular, primary structure constraints play a fundamental role in the design of the complex. Most of the nu-cleotides important for catalytic activity are contained within the unpaired regions. There are almost no sequence restrictions for bases involved in the formation of the intermolecular helices as long as base pairing is achieved. Substrate regions that interact with the ribozyme do not show high sequence requirements. However, a stable association between the two molecules is required, especially close to the cleavage site. The catalytic performance of the hairpin ribozyme can be improved by stabilisation of helix 4 (H4) [46]. Consequently, it is also important to design for stability. It is also possible to design D N A sequences for construction of D N A -based geometrical objects and nanomechanical devices. The key to using D N A for this purpose is the design of stable branched molecules, that can interact with other nucleic acid molecules [57]. A n important aspect in the study of the R N A Secondary Structure De-sign problem is to understand better the factors that render R N A structures hard to design. Such understanding provides the basis for improving the 4 Chapter 1. Introduction substrate 3' n n H2 5' r n n 3' 5' N GA GYNN A N v V ribozyme A A A G C N N Figure 1.2: Hairpin ribozyme-substrate complex. The arrow indicates the cleavage site in the substrate. Sequence requirements are specified, where N represents any nucleotide; Y represents C or U ; R represents A or G; B represents C, G or U ; V represents A , C or G. Illustration adapted from E . Puerta-Fernandez et al. [46] performance of computational approaches for solving this problem and for characterising its limitations. To our knowledge, it has not been determined whether there is a polynomial-time algorithm for R N A secondary structure design. Schuster et al. [53] performed experiments with the RNAinverse al-gorithm by Hofacker et al. [31] on few small random sequences and a simple t R N A to support the hypothesis that there is no need to search huge por-tions of the sequence space to find a particular structure by mutation and selection. Based on these experiments, they argue that sequences sharing the same structure are distributed randomly over sequence space and that common structures, that is, structures that have many sequences that fold into them, can be accessed from an arbitrary sequence compatible with the target structure by a number of mutations much smaller than the sequence length. These results are based on small sequences and therefore they do not give insight into the computational complexity of the design problem. On the other hand, Andronescu et al. [4] found evidence that some ribosomal R N A structures are difficult to design and that the correlation between the 5 Chapter 1. Introduction size and hardness is not very strong. The goals of our work in this thesis are: 1. Understand the empirical complexity of the R N A Secondary Structure Design problem, that is, the scaling of the typical difficulty of the design task for various classes of R N A structures as the size of the target structure is increased. Closely related is the identification of factors that make R N A structures hard to design. 2. Develop methods for R N A secondary structure design with primary structure constraints; in this case, the input comprises fixed bases or base types in certain positions of the sequence to be designed. 3. Develop methods for designing duplexes of R N A molecules with spe-cific secondary structure. 4. Develop methods for designing stable R N A molecules, since there are several R N A sequences that can fold into a given structure. Finding a sequence with low M F E structure is one approach to design for sta-bility. However, in Chapter 2 we will describe other stability criteria, discussed by Dirks et al. [25], that can be use to design more stable structures than the M F E approach. 1.2 Research goals and contributions In this section we explain our contributions to the rational design of R N A secondary structures. This includes building efficient algorithms to design duplexes and stable structures. To the best of our knowledge, there are no algorithms for designing duplexes in the literature and the only algorithm that design explicitly stable structures is the adaptive walk from Dirks et al, that we will describe in Chapter 3. However, this algorithm performs well only on short structures. To achieve our first goal, namely to understand the empirical complexity of the R N A Secondary Structure Design Problem, we present an empirical analysis of the performance of two algorithms for the problem. One is the R N A - S S D algorithm of Andronescu et al. [4] and the other is the R N A i n -verse algorithm of Hofacker et al. [31] from the Vienna R N A Package [29]. R N A - S S D is one of the best algorithms available to design R N A molecules in terms of the time spent designing a given structure and the number of structures that is able to design. We used an improved version of R N A - S S D that supports primary structure constraints; an online version is available 6 Chapter 1. Introduction at http://www.RNAsoft.ca. For the analysis we consider randomly gener-ated structures, obtained by folding a randomly generated sequence with the RNAfold function from Vienna Package; and also structures that are generated to have random features of biological structures, which we refer to as biologically generated structures. The use of random structures allows us to test the algorithms for types of structures that rarely occur in nature but that can have applications in nanotechnology. Furthermore, the per-formance of RNA-SSD on biologically generated structures is relevant for biological applications like molecular therapeutics. The scaling analysis on random and biologically motivated structures supports the hypothesis that the running time of both algorithms scales polynomially with the size of the structure. We also found that the algorithms are in general faster when con-straints are placed only on paired bases in the structure. When comparing both algorithms, the RNA-SSD algorithm performs better than RNAinverse since it requires less time to design a given structure and also because it is able to design more structures. Furthermore, we prove that, according to the standard thermodynamic model, for some structures that the R N A - S S D algorithm was unable to design, there exists no sequence whose minimum free energy structure is the target structure. Our next goal is to extend the RNA-SSD algorithm to be able to de-sign better structures that work in practice. Here we consider the design of complexes of two R N A molecules that can work as ribozymes or for appli-cations in nanostructure design. We describe three different approaches for R N A duplex design. These are the first algorithms for designing duplexes since to the best of our knowledge, there are no algorithms for designing complexes in the literature. Each of our algorithms takes as its input a complex of molecules that adopt a particular secondary structure. Depend-ing on the problem of interest, there may be input base constraints located in some positions of the complex. The outputs are two sequences that are predicted to form the desired complex according to the PairFold function of Andronescu et al. [5]. The Linker approach concatenates two molecules to generate a single structure that is designed with R N A - S S D . The designed sequence is then separated into the corresponding strands. The Interface approach assigns bases in the positions of intermolecular helices of the com-plex. Then both structures are designed independently by R N A - S S D where the bases located in the regions where both structures interact remain fixed in R N A - S S D . The Internal approach modifies the R N A - S S D algorithm to design duplexes inside the algorithm where the candidate solutions are eval-uated with PairFold. A scaling analysis on these algorithms show that there is no evidence that the running time of Linker and Internal scale exponen-7 Chapter 1. Introduction tially with the length of the duplex. When comparing these approaches, we found that Interface has the lowest running time and designs less du-plexes in our data set than Linker and Internal. Frequently, Interface has the problem that components of the designed sequences form undesirable interactions even though each sequence fold correctly into the structure of the corresponding molecule. We also found that Linker has the best running time but Internal designs more complexes than the other algorithms. Inter-nal uses the PairFold function to evaluate the designs inside the algorithm. The theoretical running time of PairFold is 0 ( n 3 ) where n is the sum of the two sequence lengths. However in practice the running time is longer than RNAfold due to additional checking for the location of the intermolecular linkage in the calculation of the free energy of the duplex [3]. That is, in practice the running time of PairFold is a constant times more expensive than RNAfold that evaluates designs for single sequences. Therefore, there is a trade-off between using PairFold internally or not since in the former case is more expensive even though it designs more' duplexes. Finally, we extend the RNA-SSD algorithm to design stable structures since to the best of our knowledge there are no algorithms reported in the literature that design stable structures explicitly. We design for stability inside R N A - S S D at the lowest level of the algorithm where the smallest substructures of the decomposition tree are designed. A n SLS algorithm is used to find a stable substructure with respect to some stability criteron like the probability of observing the M F E structure, or the average number of incorrectly paired nucleotides in the ensemble of structures of the designed sequence. The performance of the extended RNA-SSD, called R N A - S S D -stability, is evaluated by comparing the stability of the sequences designed with this algorithm and other approaches like the adaptive walk procedure of Dirks et al. [25] and the I N F O - R N A algorithm of Busch et al. [16]. RNA-SSD typically does not achieve the performance of I N F O - R N A when designing M F E structures. However, RNA-SSD-stability performs better than I N F O - R N A which does not use any stability measure to design struc-tures. When RNA-SSD-stability is compared with the adaptive walk, we find that our algorithm designs more stable sequences in less time, especially for structures that are difficult to design. However, for very long run-times, the adaptive walk performs better than RNA-SSD. A n insight gained from our comparison is that RNA-SSD-stability designs stable structures by merging stable substructures. Therefore, we believe that further improvements can be obtained if stable sequences are designed at every level of the decomposi-tion tree, possibly by keeping several candidate subsequences to concatenate, and selecting the most stable. 8 Chapter 1. Introduction 1.3 Thesis outline The remainder of this thesis is structured as follows. In Chapter 2, we give an overview of existing related work from the literature. Computa-tional approaches to the design of R N A secondary structures are discussed in Chapter 3 including heuristic algorithms available to solve this problem such as RNAinverse, I N F O - R N A and RNA-SSD. In Chapter 4 we present an extension of RNA-SSD that supports primary structure constraints and study the empirical complexity of the R N A Secondary Structure Design Problem. Chapter 5 deals with the design of R N A duplexes, and Chapter 6 with the design of stable R N A molecules. The conclusions of this thesis and the future directions in which this work can be extended are discussed in Chapter 7. A proof that some structural motifs are impossible to design with the current thermodynamic model of the Turner group [41] is provided in the Appendix. 9 Chapter 2 R N A S e c o n d a r y S t r u c t u r e D e s i g n P r o b l e m In this chapter we give some background to define the R N A Secondary Structure Design Problem. 2.1 Secondary structure of an RNA molecule R N A secondary structure is characterized as the set of base pairs induc-ing a structure like the one in Figure 1.1 (b). Consider a sequence X = xi,X2,X3, ••• xn, where Xi G {A, C, G, U} Vi = 1, • • • , n . For 1 < i < j < n, let i • j denote the pairing of base Xi with Xj. A secondary structure 5 on a sequence X is a set of base pairs P such that two base pairs in S, i • j and i! • j' are either identical, or else i ^ i' and j ^ j'. This means that a base is paired with at most one other base. Figure 2.1 shows a classification of the loops and the base pairs in a secondary structure. We can classify loops formed by the bonding of base pairs in a structure according to the number of base pairs that they contain. A hairpin loop contains exactly one base pair (Figure 2.1 a). A n internal loop contains exactly two base pairs (Figure 2.1 b). A bulge is an internal loop with one base from each of its two base pairs adjacent (Figure 2.1 c). Furthermore, a stacked pair is defined as two consecutive base pairs (i • j) and (i + 1 • j — 1) (Figure 2.1 e). A stem is formed by one or several stacked pairs. A multibranched loop or multiloop is a loop that contains more than two base pairs (Figure 2.1 d). A n exterior or closing pair is the base pair in a loop closest to the ends of R N A strand. More precisely, the exterior pair is the one that maximizes j — i over all pairs i • j in the loop. A l l other pairs are interior. Dangling bases are free bases located in the immediate vicinity of a stem. A n R N A secondary structure includes a pseudoknot if there exist two base pairs i • j , i! • j' in the structure with i < i! < j < j' (Figure 2.1 f) otherwise it is pseudoknot free. The bases (i,3,h,ji, • • • ,id,jd) with d > 0,i < j < n < ji < • • • < id < jd define an 10 Chapter 2. RNA Secondary Structure Design Problem external loop if i pairs with j, i\ pairs with j i , • ••,id pairs with jd and k is a free base Vfc, 1 < k < i,j < k < i i , • • • ,jd < k < n where n is the length of the sequence (Figure 2.1 g). A domain is a substructure which is closed by a base pair of an external loop, that is, a domain closed by i • j is the set of all base pairs whose indices are in the interval The external loop shown in Figure 2.1 (g) contains two domains and seven free bases. The free bases are called external bases because they are not inside any domain, but they are between domains or between a domain and one of the ends of the strand. 2.2 Free energy of a single R N A strand Associated with a secondary structure of a strand is its free energy. The free energy of a secondary structure measures (in kcal/mol) the specificity of the sequence for a secondary structure at fixed temperature. The bases that are bonded tend to stabilize the R N A , whereas unpaired bases form destabilizing loops. The sequence is most likely to fold into the structure with lowest free energy. For pseudoknot-free secondary structures, the free energy is calculated with the nearest neighbour thermodynamic model described by Zuker et al. [74] and Mathews et al. [41]. This model calculates the free energy of a structure as the sum of free energies of each stacked pair and each loop using experimentally obtained thermodynamic data. A n example is given in Figure 2.2. Let AG(X, S) denote the free energy of an R N A sequence X when folded into a secondary structure S. Furthermore, let $ denote a function that assigns to each R N A sequence X a secondary structure S* that minimizes free energy AG(X, S) over all possible secondary structures S of X. The R N A Secondary Structure Prediction Problem can be formulated as follows. Given an R N A sequence X, determine $(X). Zuker and Stiegler developed a dynamic programming algorithm for finding the minimum free energy (MFE) secondary structure S* without pseudoknots of an R N A molecule X. They use the nearest neighbour thermodynamic model to evaluate the energy AG(X, S) of a sequence X folded into the secondary structure S. The running time of Zuker and Stiegler's algorithm is 0 (n 4 ) [75] and it has been reduced to 6 (n 3 ) by Lyngs0 et al. [39]. The program mfold [72] was the first implementation of Zuker and Stiegler's algorithm and is available online [73]. The Vienna R N A Package [31] also implements Zuker and Stiegler's algorithm. It is available online and is free open source software [30]. 11 Chapter 2. RNA Secondary Structure Design Problem i+1 external bases (g) Figure 2.1: Secondary structure motifs, (a) Hairpin, (b) Interior loop, (c) Bulge, (d) Multiloop, (e) Helix or stem, (f) Pseudoknot. (g) External loop. 12 Chapter 2. RNA Secondary Structure Design Problem Figure 2.2: Free energy calculation of a secondary structure. Stems are denoted by S, hairpins by H, multiloops by M, internal loops by / , bulges by B and dangling ends by D. The free energy of this structure is AG(S, X) = AG(S1) + ••• + AG(S5) + AG(H1) + AG(H2) + AG (Ml) + AG(I1) + AG(B1) + AG(D1) = -36.4 + 8.1 + 2.9 + 3.3 + 0.5 - 1.7 = -23.3 kcal/mol. 13 Chapter 2. RNA Secondary Structure Design Problem A n abstract secondary structure S of length n, (n, S), is defined as a set of integer pairs (i, j) 1 < i < j < n, such that each i is contained in at most one pair and no two base pairs (i, j) and (i',f) of 5 cross, that is, it is not the case that i < i' < j < f or that i' < i < j' < j. The R N A Secondary Structure Design Problem can be stated as follows. Given an abstract R N A secondary structure (n, S*), find a sequence X* such that <&(X*) = S*. 2.3 Free energy of a duplex In some applications, it is desirable to predict the secondary structure of two or more interacting RNAs . Such predictions aid in understanding mecha-nisms for ribozyme function [69] and in designing novel ribozymes [7] or nanostructures [8]. The free energy calculation for a pair of R N A molecules is very similar to the free energy calculation for one molecule. The free energy of a duplex of R N A s is calculated with the nearest neighbour thermodynamic model as the sum of the energies of its loops. It is possible to distinguish regular and special loops. Let Y = yv\)2 • • • y n be the sequence obtained by concatenating two R N A sequences X\ and X2 and let b denotes the number of nucleotides in X\, that is, b is the linkage location between X\ and Xi- A loop is "special" if the linkage location b lies within it; otherwise it is "regular". The free energy of the special structures is calculated in the same way as for regular structures, except that an inter-molecular initiation penalty I = 4.1 kcal/mol [5] is added if the special structure is a hairpin, an internal loop, a multiloop or a stacked pair. Andronescu et al. developed an efficient algorithm, PairFold [5], that predicts the M F E secondary structure that can be formed by two interacting nucleic acid molecules. This algorithm takes as input a pair of R N A strands Xi and X2, and extends the dynamic programming algorithm by Zuker and Stiegler [75] for single molecules. The worst-case time and space complexity for PairFold when calculating the M F E structure are 6 ( n 3 ) for time and 0 (n 2 ) for space, where n is the sum of the two input sequence lengths. The structure of a duplex is specified as (n, R, b) where R is the secondary structure of length n of the interacting strands X\ of length b and X2 of length n — b. We refer to b as the linkage location. A n abstract duplex R of length n, (n,R,b), is defined as a set of integers pairs 1 < i < j < n such that each i is contained in at most one pair and no two base pairs and (i',f) of R cross, that is, it is not the case that i < i' < j < f or that i' < i < j' < j. The R N A Secondary Structure Design Problem for 14 Chapter 2. RNA Secondary Structure Design Problem Duplexes can be formulated as follows. Given an abstract duplex secondary structure (n, R, b), find a pair of sequences X{ of length b and X% of length n-b such that fc^f, X J ) = (n, R, b). 2.4 Partition function Although free energy models for secondary structure loops have been refined over time to achieve a better characterization of folding thermodynamics, the energy parameters are still inaccurate [41]. In the previous section we described how the free energy of a structure is computed by adding the free energy of each loop and stacked pair. A slight deviation in the free energy parameters can lead to substantial differences in the computed M F E struc-ture. Hence, the M F E structure derived from a folding algorithm may not be a true structure, i.e. the structure into which the molecule folds. Another consideration that has to be made is that R N A molecules are not in the (true) M F E structure all the time because base pairs are constantly forming and breaking. For example, co-transcriptional folding leads to the forma-tion of temporary secondary structure elements [36, 48] that have biological functions, e.g. as initial sites for protein anchoring during pre-mRNA tran-scription [50]. Furthermore, R N A conformational switching is fundamental in translational regulation, protein synthesis, and m R N A splicing [38]. How-ever, the M F E structure is the most likely structure. These suggest that it is beneficial to characterise the ensemble of all secondary structures. The probability P(S) of sampling a secondary structure S in the ensem-ble of structures of a given R N A molecule X with free energy AG(X, S) is proportional to e - A G ( x > s ) / R T where R is the gas constant and T is the absolute temperature. The partition function, from statistical mechanics, is a normalising constant that allows estimation of the probabilities from the free energy values and is given by where S(X) is the set of all possible secondary structures of a strand X. Therefore, the probability of observing a structure S in the ensemble of a given sequence X is (2-1) SeS(x) e-AG(X,S)/RT (2.2) To simplify notation we will use P(S) from now on. 15 Chapter 2. RNA Secondary Structure Design Problem Note that the partition function (see Equation 2.1) is a weighted sum over all admissible secondary structures of a given R N A sequence. A n admissible secondary structure has a set of base pairs that can be formed from the R N A sequence. The lower the free energy of a structure, the higher its weighting, that is, the contribution to the sum. The computation time for calculating the partition function by explicitly summing all terms grows exponentially with the length n of the sequence. But McCaskil l [42] derived a dynamic programming algorithm to calculate the partition function of pseudoknot-free structures in time G(n 3 ) . McCaskill also introduced a matrix of equilibrium base pair probabilities between all nucleotides in the R N A sequence. The values displayed in the upper triangle of the matrix (sometimes graphically shown as differently sized dots or boxes) represent the sum of the probabilities associated with all the structures in which the chosen base pair occurs. The lower triangle of the matrix is often used to illustrate the M F E structure. This matrix summarizes the features of the global ensemble of structures at equilibrium. Figure 2.3 shows the matrix for the sequence of Figure 2.2. The pairing probability for every possible pair can be calculated effi-ciently using a dynamic programming algorithm that is described by Mc-Caskill [42]. Let pij denote the probability of forming a base pair i • j for 1 < i , j < , where n is the sequence length and let Pi,n+i be the probability that i is unpaired for 1 < i < n. Let ^ Q _ f 1 if i • j is a base pair of the structure Sa for 1 < i, j < n li I 0 otherwise (2.3) 1 if i is an unpaired base of the structure Sa for 1 < i < n 0 otherwise (2.4) then saes(x) where P(Sa) is the probability of observing Sa in the ensemble of structures S(X). 16 Chapter 2. RNA Secondary Structure Design Problem 1 10 2?1 £3 (0 *a 53 Optimal Energy = -23.3 kcal/mole -23.0 < Energy <= -22.6 kcal/mole -23.3 < Energy <= -23.0 kcal/mole -22.6 < Energy <= -22.3 kcal/mole Figure 2.3: Matrix of equilibrium base pair binding probabilities between all nucleotides in the sequence of Figure 2.2. The lower triangle represents the optimal structure. Illustration obtained with the program MFold from Zuker et al. [71]. 2.5 Stability Given a secondary structure S*, there are typically several sequences that have M F E structure S*. If we want to design stable structures then one approach is to find the sequence with the lowest M F E S*. Dirks et al. [25] described two paradigms for designing a structure. A positive design opti-mizes sequence affinity for the target structure. A negative design optimizes sequence specificity to the target structure. Sequences with high affinity have admissible structures similar to the target structure. Sequences with high specificity have the target structure as their M F E structure. When designing a structure, it is desirable to achieve both high affinity and high specificity. Figure 2.4 shows the structure space for three sequences. The target structure is indicated in the picture. Sequence C has a higher affin-17 Chapter 2. RNA Secondary Structure Design Problem ity to the target structure than sequences A and B since it has the lowest free energy when folding into the desired structure. Moreover, sequences A and B have a higher specificity to the target structure since S* is the M F E structure of A and B. Overall we will prefer sequence B since it has high affinity and high specificity and therefore its M F E structure is the most thermodynamically stable. Dirks et al. [25] define several criteria to evalu-ate the specificity and the affinity of a structure. Energy minimization only evaluates a sequence in terms of its affinity for a target structure. The lower the energy of the sequence when folding into the target structure, the higher the affinity to the target structure. But this condition is not sufficient to have high specificity to the target structure (see sequence C in Figure 2.4). Furthermore, the M F E criterion only evaluates the sequence in terms of its specificity to the target structure but does not ensure high affinity (see sequence A in Figure 2.4). Wi th the partition function it is possible to calculate the probability P(S*) for a sequence to fold into S* (see Equation 2.2). If P(S*) is close to one then the sequence achieves high affinity and specificity. However, requiring that P(S*) ~ 1 is a very strict design evaluation criterion since it requires that all nucleotides match the target exactly. A n alternative criterion is the weighted average number of incorrect positions n(X,S*), with respect to S*, over the equilibrium ensemble of secondary structures of a given sequence X [25]. To simplify notation we will use n(S*) from now on. A sequence has high affinity and specificity to the target structure if n(S*) is close to zero. The weighted average number of incorrect bases can be computed from the matrix of base pair probabilities of the partition function. If S* is the M F E structure of X and 5 is defined as in Equations 2.3 and 2.4, then the weighted average number of incorrect nucleotides is: n(S*) = n lZ lZ PiJ6iJ l<i<n l<j<n+l expected number of correct nucleotides n n JZ lZ PiJSij- zZ Pi.n+l^.n+l l<i<n l<j<n l<i<n 18 Chapter 2. RNA Secondary Structure Design Problem Figure 2.4: Positive and negative design. The x-axis represents the structure space S(X) of a sequence X and the y-axis the free energy A G of each structure. Illustration adapted from Dirks et al. [25] = U ~ E E P i & j - E I 1 " E Pij)6in+l l<i<nl<j<n l<i<n \ l<j<n J since J2i<j<nPij + £>i,n+i = 1 for every 1 < i < n. Therefore, n(S*) = n- E E Stn+i+ E E PiJ6tn+v (2-5) l<i<n l<j<n l<i<n l<i<nl<j<n The indicator that we use to identify stable structures depends on the context. For example, in long structures it is difficult to have a probability P(S*) close to one because the number of structures in the ensemble is very large. In this case we can use the first energy gap [67] to identify the most stable structure. The first energy gap is a negative design criterion that is defined as the difference between the energy of the M F E structure. 19 Chapter 2. RNA Secondary Structure Design Problem and the second best structure. In some situations it is better to design a sequence with big first energy gap. Suppose that we have two sequences X\ and X2 whose M F E structure is the target structure S*. Assume that X\ has an ensemble with several structures very different from each other and all of them with probabilities close to P(S*). Then we will prefer to choose sequence X2 if P(S*) is smaller in this new ensemble but has a bigger first energy gap. Note that if we choose X\ then there is a significant chance to get kinetically trapped in a very stable structure different from the target structure. Moreover, if the energetically favourable structures in the ensemble of X2 are similar to each other then it is better to choose this sequence because it has a higher probability of folding into the desired target than sequence X\. From the previous example, it is useful to determine whether there is some family of structures in the ensemble that is similar, distinct from the rest, and dominates the probabilities of all other families. Vofi et al. [64] introduced an algorithm, called RNAshapes, to compute the accumulated probabilities of all structures that share the same shape. A n R N A shape is a representation of an R N A secondary structure that abstracts loop and stem lengths. Consider the following sequence and two secondary structures from its folding space: AUCGGCGCACAGGACAUCCUAGGUACAAGGCCGCCCGUU . . ( ( ( . ( ( . . ( ( ( . . . . ) ) ) . ( ( ( ) ) ) ) ) ) ) ) . . . . ( ( ( ( ( ( . . . . ) ) ) • ( ( ( ) ) ) . . ) ) ) . . There are several levels of abstraction. In the least abstract shape, the unpaired regions are represented by an underscore and the stacking regions by a pair of squared brackets. The following are the shapes of the previous structures. The most abstract level excludes unpaired bases and combines nested helices. In this case, both structures have the same shape: [ [ ] [ ] ] The accumulated probabilities of all the structures that share the same shape can be used as another stability measure. If the M F E structure belongs to a shape of probability close to one, then the corresponding sequence achieves high affinity and specificity. 20 Chapter 3 P r e v i o u s W o r k The most important work related to structure design is presented in this chapter. We give an overview on the design of proteins and nucleic acids. We also describe biochemical and computational methods including heuristic algorithms for R N A structure design that have been empirically shown to achieve good performance. 3.1 Protein design Proteins have a big range of natural function and hence they represent a fertile medium for the design of new medical and industrial products. The ultimate goal of protein design is the creation of novel proteins that perform specified tasks. A necessary requirement for meeting this goal is the ability to identify sequences that fold with sufficient stability into a target structure. Computational procedures for protein design consists of starting from a given protein three-dimensional structure, usually a known structure from the Protein Data Bank (PDB) [9, 10], and searching for the amino acid sequence or sequences that are compatible with this structure. Pierce and Winfree show that the protein design protein is NP-hard [44]. However, this is a reflection of worst-case behavior but in practice, it is possible for an exponential-time algorithm to perform well or for an approximate stochastic method to prove capable of finding excellent solutions to NP-hard problems. Stochastic methods based on Monte Carlo, simulated annealing or genetic algorithms have performed with some success on small protein design prob-lems [24]. Proteins can be designed computationally by using positive strategies that maximize the stability of the desired structure or by negative strate-gies that seek to destabilize competing states. Design efforts have focused mainly on positive strategies that maximize favorable interactions in the target conformation. This approach has given good results including the introduction of catalytic activity into a previously inert protein [13] and the creation of a novel protein fold [37]. Negative design, by contrast, maximizes unfavorable interactions in competing states and requires modeling of each 21 Chapter 3. Previous Work unwanted conformation [70]. However, one of the challenges in negative de-sign is to model accurately the energetic effects of destabilizing mutations in competing states. 3.2 Nuc le i c acid design In this section we discuss biochemical and computational methods that have been used to design nucleic acids. Nucleic acids have the advantage that they are easy to synthesize and that structure formation is mainly based on secondary structure, that is, base-pairing interactions within a strand. Nucleic acids are versatile building materials. In nature, D N A and R N A ' s , such as m R N A , rRNAs and tRNAs, axe involved in making proteins. In nanotechnology, D N A s and R N A s have several applications. Assembly and folding principles of natural R N A are used to build potentially functional artificial structures at the nano-scale [20, 51]. This concept, called R N A tectonics, led to the synthesis of R N A grids with various patterns such as the one in Figure 3.1. Nucleic acids have been also used to build nanomechanical devices [59]. In this case, the combination of single and double stranded sections of D N A yield structures which can be thought of as a network of stiff and flexible elements. The deliberate formation or destruction of double-stranded sections in such a network induces conformational changes which result in nanoscale motions such as rotational motion, pulling and stretching, or even unidirectional motion. Other applications include engineering logic circuits [54] and simple computers [6]. 3.2.1 Biochemical methods In practice, R N A design is mostly done using biochemical methods. These procedures allow us to find R N A structures similar to those retained by natural evolution, and also to identify alternative conformations that can perform the same function. Structures of interest can be characterized by X-ray crystallography and N M R spectroscopy. By using phylogenetic analy-sis and sequence alignment of several R N A sequences it is possible to identify conserved primary and secondary structural features that can be related to function [17]. Wang and Unrau [65] use random recombination and selection to isolate the core functional elements of an R N A where phylogeny is lacking or is limited. In vitro mutagenesis and selection have been performed on sev-eral ribozymes and substrates to determine the overall secondary structure and to identify which elements are essential for activity [2, 46]. Minimal 22 Chapter 3. Previous Work ( a ) ( b ) ( c ) Figure 3.1: Tectosquare designed by Chworos et al. [20]. Panel (a) shows the R N A structure that can self-assembly to form the square of panel (b). Panel (c) shows the three-dimensional representation of the tectosquare. motifs that support catalytic activity or modified structures with an im-proved catalytic activity can be determined by this method. Breaker et al. [15] also used in vitro selection to design allosteric ribozymes as biosensor components. A n allosteric ribozyme induces or inhibits catalytic function in the presence of an effector molecule that binds to a receptor site distinct from that of the enzyme's active site. Another approach searches in the GeneBank database for potential struc-tural motifs that might have functional significance. Ferbeyre et al. [27] mutate versions of the hammerhead self-cleaving R N A s to find alternative structures with similar function or with an increase catalytic activity. 3.2.2 Computational methods There has been also progress with computational approaches. A determin-istic approach is given by Seeman [56]. He uses a sequence-symmetry mini-mization algorithm where bases are selected to minimize similarities between segments of the molecule. In this way the sequence adopts the desire con-formation and is less likely to fold into an alternative structure. Designed structures are validated with gel electrophoresis where the particular R N A molecule is identified by the band patterns it yields in gel electrophoresis after being cut with various restriction enzymes. The R N A Secondary Structure Design Problem can be seen as a discrete constraint satisfaction problem [32], where the constraint variables are the 23 Chapter 3. Previous Work positions in the desired R N A strand, the values assigned to these variables correspond to the bases at the respective positions, and the constraints cap-ture the base-pairings that define the given secondary structure. For this problem, evaluating the quality of candidate solutions is computationally expensive, since it uses an implementation of the algorithm of Zuker and Stiegler [75] for prediction of the M F E secondary structure for a sequence. Zuker's algorithm has time complexity © ( n 3 ) , where n is the length of the given sequence. Unfortunately, a single local reassignment of a base in the se-quence can result in a completely different M F E secondary structure. Hence, a heuristic approach is appropriate. One solution to the R N A Secondary Structure Design Problem is pro-vided by Hofacker et al. [53], the implementation for which is included in the Vienna R N A Secondary Structure Package and is called RNAinverse. A stochastic local search (SLS) implementation which has been developed by Andronescu et al. [4], shows better performance. This algorithm is called the R N A Secondary Structure Designer (RNA-SSD). Very recently, Busch and Backofen [16] introduced an improved SLS for the inverse R N A folding problem (INFO-RNA) that in most cases performs better than R N A -SSD. These computational approaches find any sequence that has the target structure as its M F E structure. Dirks et al. [25] have a different approach of several design methods that consist of a criterion score for designing stable structures and a heuristic for optimizing that score. The most important algorithms described in this chapter are summarized in Table 3.1. Algorithm Authors Problem Approach mfold Zuker and Stiegler [75] RNA secondary structure prediction dynamic ' programming RNAinverse RNA-SSD INFO-RNA adaptive walk Hofacker et al. [53] Andronescu et al. [4] Busch and Backofen [16] Dirks et al. [25] RNA secondary structure design heuristic Table 3.1: List of algorithms that will be constantly mention for predicting and designing R N A secondary structures. in this thesis 24 Chapter 3. Previous Work The RNAinverse algorithm RNAinverse algorithm uses an iterative improvement. In each step, the iterative improvement will attempt a random mutation and accept it if and only if the cost function decreases. RNAinverse mutates a base located in a random position, and accepts the new sequence if and only if its M F E structure is closer to the desired structure in terms of its Hamming distance. The position to be mutated and the new base are selected uniformly at random. If the candidate solution is a local minimum and therefore no advantageous mutation can be found, the procedure stops and restarts again with a new initial sequence. This procedure has to evaluate many candidate solutions and therefore it needs many executions of the RNAfold function from the Vienna Pack-age which requires G(n 3 ) time. To reduce the number of foldings of full length, small substructures are designed independently and the correspond-ing subsequences are concatenated to find a sequence to the full structure. A substructure consists of a hairpin with its adjacent stem. However, merging these substructures does not always give the desired structure. For more details on the algorithm see Hofacker et al. [31]. The R N A - S S D algorithm We provide a brief overview of the RNA-SSD algorithm, which is described in detail by Andronescu et al. [4]. RNA-SSD is a stochastic local search (SLS) procedure that iteratively modifies single unpaired bases or base pairs of a candidate strand in order to obtain a strand that folds into the target structure. The RNAfold function from Vienna Package is used to evaluate the quality of candidate solutions. Since evaluating the quality of candidate solutions is computationally expensive, the RNA-SSD (similar to R N A i n -verse) decomposes the input secondary structure at multiloops. However, different from RNAinverse, RNA-SSD recursively splits the structure into two substructures in each decomposition step; thus obtaining a binary de-composition tree whose root is formed by the full target structure S* and whose leaves correspond to small substructures of S*. Each non-leaf node of the tree represents a substructure of S* that can be obtained by merging the two substructures corresponding to its children. The SLS algorithm is only applied to the smallest substructures, and the corresponding partial so-lutions are combined into candidate solutions for larger subproblems guided by a decomposition tree. Since the smaller subproblems are not indepen-25 Chapter 3. Previous Work procedure R N A - S S D input: target RNA secondary structure S* and structure length n output: RNA sequence X initialise sequence X (X of length that matches S*) and hierarchically decompose S* and X [4]; for each leaf node k, use the SLS algorithm [4] to design a sequence Xk for the substructure at node k; end for for each non-leaf k of the decomposition tree Xk *— Xk\Xk2\ obtain the M F E structure Sk of Xk with RNAfold, if Sk is different from the target structure SI of node k repeat redesign Ski, the substructure with the higher relative fraction of conflicting bases among Ski and Sk2\ Xk Xk\Xk2\ obtain the M F E structure Sk of Xk with RNAfold; until Sk is identical to SI or a maximal of « M attempts has been performed or a cutoff time is reached end if end for return X end R N A - S S D . Figure 3.2: Pseudocode for designing structures with the R N A - S S D algo-rithm. dent, this does not always result in valid designs for the corresponding larger substructure. Consequently, multiple attempts (involving additional calls to the core SLS procedure) are often required before partial solutions can be combined successfully, bf A maximum number TIM = 5 of attempts are per-formed to merge subsequences into a valid sequence. Figure 3.2 shows the R N A - S S D algorithm. 26 Chapter 3. Previous Work Andronescu et al. designed a more complex SLS algorithm than Ho-facker et al. which has important components for efficient design of sec-ondary structures. For instance, the SLS algorithm in RNAinverse starts with a random sequence whereas RNA-SSD has a method for generating a good initial design for the R N A strand that increases the chances of cor 1 rect folding. First, the assignment is done in such a way that paired bases in S are assigned complementary bases to ensure that X can fold into the desired structure S. However, there might be an alternate structure with lower free energy and therefore, they make sure that the unpaired sequence positions of a hairpin loop directly following helix regions are assigned non-complementary bases. This prevents energetically favourable but undesired helix extensions. Another characteristic of the initialisation procedure is that it assigns bases probabilistically to the strand, using different proba-bilistic models for base positions that are paired and unpaired in the target structure. The algorithm also ensures that complementary stretches of bases are avoided across the design, except where desired along two sides of a stem. The single-base modifications at the level of smallest substructures are determined based on a randomised first-improvement strategy: with a fixed probability 0.2, an arbitrary base position in the given subsequence is se-lected uniformly at random; otherwise, this uniform random choice is re-stricted to base positions that are currently in conflict with the target struc-ture. A n alternate base assignment is then generated for the selected posi-tion using the same probabilistic model for that position as in the initialisa-tion procedure. Different from RNAinverse, Andronescu et al. introduced a noise parameter in the SLS algorithm to overcome local optima. That is, a new candidate solution is accepted with probability 0.2 if the number of conflicting bases is more than the previous solution. The search terminates when it finds a sequence with M F E structure as the target structure or when a maximum of = 5000 base modifications have been performed without finding a sequence. The search is restarted from a different initial sequence if no solution is found after 1000 base modifications. The performance of RNA-SSD was evaluated empirically by Andronescu et al. [4] and compared against that of RNAinverse using artificially gen-erated R N A structures as well as biological data. They report that R N A -SSD shows substantially improved performance over RNAinverse. R N A -SSD solved all but one of the artificial structure, while RNAinverse failed to find any solutions for 42 of 420 structures within the given time limits. On the remaining instances, RNA-SSD found solutions up to 10,000 times faster than RNAinverse. Furthermore, RNAinverse was found to be slightly faster on only eight of 420 tested structures. R N A - S S D performed much 27 Chapter 3. Previous Work better than RNAinverse on a biological set of ribosomal R N A sequences. While R N A - S S D found solutions for all 24 structures, RNAinverse failed to solve 17 structures. In particular, RNAinverse did not find a correct solu-tion for any structure larger than 500 bases, and for the seven structures it did solve correctly, its performance in terms of success rate and run-time is substantially inferior to that of RNA-SSD. The I N F O - R N A algorithm The I N F O - R N A algorithm [16] consists of two parts: a dynamic program-ming method for constructing good initial sequences and a stochastic local search for finding a sequence with M F E structure as the target structure. Different from the initialisation method in RNA-SSD, the dynamic program-ming method finds a sequence that among all sequences adopts the given structure with the lowest possible energy. The M F E structure of the initial sequence can be different from the target structure, therefore a stochastic local search is used to find a sequence that folds correctly. Small substruc-tures are optimized first and then proceed to larger ones as done by Hofacker et al. [31] in the RNAinverse algorithm. The search strategy is similar to an iterative improvement which moves to the first found neighbour of the sequence that has an M F E structure with a lower structure distance to the target structure than the current one. Like RNA-SSD, they also introduce noise and accept a worst neighbor with probability 0.1 to overcome local optima that often happen in the iterative improvement. The best found sequence is stored during the search and finally returned. The search ter-minates after a fixed number of steps that is set as ten times the length of the structure. The selection of neighbors during the local search is different from the arbitrary order used in RNAinverse and RNA-SSD. During the search, not all sequence neighbours are candidates for mutation. Only positions that do not paired correctly and positions adjacent to those are tested. The order in which the neighbours are tested depends on a look-ahead of one selection step. Here, the energy of each candidate sequence folded into the target structure is calculated. Then, the resulting energy difference to the current one is evaluated. The.higher the difference is, the earlier the neighbor is examined. They claim that this procedure is not expensive since structural elements contribute additively to the energy of the whole sequence. Therefore only structural elements that are closed by the mutated pair or include a mutated free base have to be re-evaluated. To evaluate the folding energy they use functions from the Vienna R N A Package. 28 Chapter 3. Previous Work Busch and Backofen tested I N F O - R N A on artificial as well as on biologi-cal data and compared the results to the ones of RNA-SSD and RNAinverse [16]. They generated two test sets of 300 structures each. The first set consists of short structures up to a length of 200, while the second test set includes larger structures of sizes between 300 and 700. They found that I N F O - R N A was always successful for all 300 structures of both data sets. For small structures, RNA-SSD needed twice as long and RNAinverse 400 times as long as I N F O - R N A did. For the test set which includes larger struc-tures, R N A - S S D also performed only a little worse than I N F O - R N A since it was able to design 294 structures but was five times slower than I N F O - R N A . When comparing I N F O - R N A and RNA-SSD on biological data, Busch and Backofen found that in all but one case, it was faster than R N A - S S D . It succeeded for more structures than RNA-SSD and unsuccessful runs termi-nated with better approximate solutions. In general, I N F O - R N A outper-forms R N A - S S D and it performs substantially better than RNAinverse. T h e a lgo r i t hm by D i r k s et al. Dirks et al. used an heuristic algorithm to search for the most stable se-quence according to several design criteria [25]. Independent simulated an-nealing runs with different random initial sequences are used to identify sequences with low free energy on the target structure. A n iterative im-provement is used to search a sequence for which the target structure is the lowest energy structure. This heuristic search method is also used to search for the sequence with the highest probability of sampling the target structure based on the partition function and to obtain a sequence with low average number of incorrect nucleotides. This method was used by Dirks et al. to compare the stability of the sequences obtained with different design criteria rather than to compare the performance of design algorithms. Therefore, we show in Chapter 6 that often the M F E structure of the sequences designed by this method is not always the target structure. 29 Chapter 4 Empirical Complexity of R N A - S S D Algorithm In this chapter we will describe some improvements to the R N A - S S D al-gorithm including the support for primary structure constraints. Then we discuss results on the scaling of run-time for our new R N A - S S D algorithm and RNAinverse on design problems without primary structure constraints; results regarding the undesignability of certain structures; and results on the impact of primary structure constraints on the relative difficulty and scaling of run-time for our new RNA-SSD algorithm. 1 We worked with RNA-SSD since it was the state of the art algorithm for designing R N A secondary structures when we conducted our study. INFO-R N A , that in most cases performs better than RNA-SSD, was developed at the same time and published only relatively shortly before we finished this work. 4.1 Improvement of the R N A - S S D a lgor i thm We found that there are some structures that are difficult to design by using the hierarchical decomposition method of Andronescu et al. [4]. This is the case, for example, for structures that have two loops separated by a very short stem (see Figure 4.1(a)). Wi th this approach, after splitting a structure (which is always done at a multiloop), it is necessary to connect the two free ends created by the split such that both resulting substructures have exactly two free ends. To create structural boundary conditions at the split points that are similar to those of the original structure, this connection *A version of this chapter has been published in two papers: Rosalia Aguirre-Hernandez, Holger Hoos and Anne Condon. Computational RNA Sec-ondary Structure Design: Empirical Complexity and Improved Methods. BMC Bioinfor-matics, 8:34, 2007. Mirela Andronescu, Rosalia Aguirre-Hernandez, Anne Condon and Holger Hoos. RNA-soft: a suite of RNA secondary structure prediction and design software tools. Nucleic Acids Research, 31:13, 2003. 30 Chapter 4. Empirical Complexity of RNA-SSD Algorithm • 3' (c) Figure 4.1: (a) Randomly generated structure of length 75 (RND-75-n62) with loops separated by short stems. The line represents the location where the structure is split into two substructures. Parts (b) and (c) show the corresponding substructures with static cap structure and dangling ends, respectively. is achieved by merging the free ends of one fragment with those of a static cap structure, which is a small hairpin loop of size four (four unpaired bases and five paired bases, see Figure 4.1b); furthermore, two unpaired bases are added to the free ends of the other fragment if it contains a bulge directly after the first base pair. (Note that the example structure in Figure 4.1a does not contain bulges; therefore, no unpaired bases are added after splitting, resulting in the fragment shown in Figure 4.1c). Static cap structures do not reproduce the needed structure, and so de-sign may fail after merging. This mechanism can be improved by introducing a dynamic cap structure and dynamic dangling ends in order to create struc-31 Chapter 4. Empirical Complexity of RNA-SSD Algorithm tural boundary conditions that are exactly identical to the original structure in terms of the number of paired and unpaired bases adjacent to the split point. In our new mechanism, the number of paired bases in the cap struc-ture added to one fragment depends on the number of paired bases at the beginning of the other substructure (Figure 4.2a). Furthermore, we add one unpaired base to the 5' (3') end of a substructure if, and only if, its adjacent base in the original structure is a free base (Figure 4.2b). It is enough to add one unpaired base to the end of the structure since dangling ends of more than one base do not contribute to the free energy of the structure [41]. We observed marked improvements in the running time and success rates in many cases. For instance, the expected time of 50 runs to design the struc-ture RND-75-n62 from Figure 4.1a is reduced from 88446.79 C P U seconds to 27 C P U seconds when using the new dynamic mechanism. (All run-times were measured on a reference machine using an Intel Xeon 2.4 GHz C P U . ) . In another example, Andronescu et al. [4] found a particular ribosomal R N A structure obtained from the Ribosomal Database Project, Leptospira interrogans strain 94-7997013 (Figure 4.3) to be particularly hard to design even though it has only 289 bases. They reported an expected run-time of 2 517.57 C P U seconds to design this structure with a success rate of six in fifty runs when using a cutoff time of 3600 C P U seconds. Using our improved version of RNA-SSD, the expected time to solve this structure is 1170.23 C P U seconds with a success rate of forty four in fifty runs using the same machine and cutoff time. This structure is hard to design because it contains two internal loops separated by a single base pair only; one of these is a symmetric internal loop of size four and the other is slightly asymmetrical with size five. B y introducing an additional base pair between these internal loops, the structure becomes much easier to design, requiring our new version of RNA-SSD only an expected run-time of 415.89 C P U seconds with a success rate of forty nine in fifty runs. More extensive testing on sets of random (RND-75), biologically moti-vated (BIOM-200) and biological structures have shown that the average run time of the new version of R N A - S S D is very close to that of the old version on our set of biological structures, but about 5 times lower for the random structures and 16 times lower for the biologically motivated structures. We also extended RNA-SSD to support primary structure constraints, that is, constraints on the bases that occur in certain sequence positions. The additional sequence constraints limit certain sequence positions to specific bases or sets of bases. For this purpose, the standard I U P A C symbols [21] listed in Table 4.1 are supported. Primary structure constraints facilitate the design of structures useful in certain applications. For instance, when 32 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Figure 4.2: Substructures from Figure l a with dynamic cap structure (a) and dynamic dangling ends (b). Figure 4.3: Ribosomal R N A : Leptospira interrogans strain 94-7997013. The structure motif that consists of two internal loops separated by one base pair is hard to design. We verified this experimentally by adding another base pair between the internal loops, whereupon the expected time to design the structure decreases by a factor of three. 33 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Symbol Meaning Origin of designation G G Guanine A A Adenine U U Thymine C C Cytosine R G or A puRine Y U o r C pYrimidine M A or C aMino K G or U Ketone S G or C Strong interaction (3 H bonds) w A or U Weak interaction (2 H bonds) H A or C or TJ not-G, H follows G in the alphabet B G or U or C not-A, B follows A V G or C or A not-U (not-U), V follows U D G or A or U not-C, D follows C N G or A or U or C aNy Table 4.1: IUPAC nomenclature for nucleic acids. a ribozyme is re-engineered to make it more stable, certain bases of the molecule can be modified while others are constrained because they define the cleavage site [46]. Sequence constraints are also needed in the design of nanostructure components with "sticky ends" [40, 66]. Our extended version of RNA-SSD supports primary structure con-straints as follows. Given a sequence specification using the I U P A C symbols listed in Table 4.1, first, all pairs of constrained bases are checked for fea-sibility, that is, for whether there are Watson-Crick-complementary bases or wobble pairs that satisfy the given primary structure constraints. Then, base constraints are propagated across all paired positions, that is, the set of allowed bases for each paired sequence position is adjusted to take into account constraints on the other base involved in the pairing. This improves the efficiency of the subsequent search process by restricting the number of bases that have to be potentially considered for the respective sequence po-sitions. When initialising the search process, we ensure that all bases are consistent with the given primary structure constraints (after propagation). Furthermore, whenever we modify a base assigned to a sequence position during the search, we ensure that the respective primary structure con-straints (if any) remain satisfied; in other words, we ensure that at all times during the search process, all primary structure constraints are satisfied. 34 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 4.2 Exper imen t s We now report results from our analysis of the empirical complexity of solv-ing R N A secondary structure design problems with the improved version of R N A - S S D and with the RNAinverse algorithm. Experiments were per-formed on random and biologically motivated structures of different lengths. The study of the behaviour of the algorithm on biological structures is rele-vant since it will have an impact in biological applications such as ribozyme design. We generated random structures by folding, with the RNAfold function from the Vienna package, a set of randomly generated sequences with a uniform distribution of nucleotides (see Table 4.2). Because of the limited availability of true biological structures, we generated structures with bio-logical characteristics with the help of an R N A structure generator [4]. Wi th this program it is possible to directly control salient properties of the struc-tures being generated, including the overall size as well as the number and size of bulges, internal, and multiloops, and the length of stems. This allows us to evaluate RNA-SSD (and RNAinverse) on a large number of structures and helps us to reduce the chance of drawing erroneous conclusions from a small set of atypical results. The R N A structure generator takes as input the number and size of dif-ferent loops and stems. A n example of these parameters is given in Table 4.4. The output of the algorithm is a structure S. To generate a structure, the R N A structure generator chooses numbers / and M uniformly at random from the ranges [ A , ^ ] (number of internal loops) and [Mi,M2] (number of multiloops), respectively. Then the structure is built up in rounds, from the outside in (where the outside corresponds to the free ends), with either an internal loop or a multiloop and associated stems (that include bulges) in-serted per round, until the structure has I internal loops and M multiloops. The hairpins are added at the end, closing the stems. In order to determine the parameters used by the R N A structure gen-erator, we selected from the biological literature ten structures that are consistent with experimental evidence and empirical data, ranging from 60 to 600 bases in length (see Table 4.3). Average values of each of the features captured in the parameters of the R N A structure generator over our set of structures were used to roughly summarise the structural properties of nat-urally occurring R N A s (see Table 4.4). Using the R N A structure generator with these parameter values, several sets of biologically motivated structures were generated (see Table 4.5). For the experiment in which RNA-SSD was used to design structures 35 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Structure name Size (bases) Number of structures generated RND-50 50 1000 RND-75 75 1000 RND-100 100 100 RND-125 125 100 RND-150 150 100 RND-200 200 100 RND-450 450 100 Table 4.2: Set of structures generated by folding random sequences with the RNAfold function from the Vienna R N A package. Nucleotides in the se-quence are assigned uniformly at random. The sets of structures longer than 75 bases are smaller because of the amount of time required for designing these structures. No. Description Size (bases) 1 Minimal catalytic domains of the hairpin ribozyme satelite RNA of the Tobacco ringspot virus 65 2 U3 snoRNA 5' domain from Chlamydomonas reinhardtii, in vivo probing 79 3 H. marismortui 5 S rRNA 122 4 VS Ribozyme from Neurospora mitochondria 167 • 5 R180 ribozyme 178 6* X S l Ribozyme, Bacillus subtilus P RNA based ribozyme 314 7* Homo Sapiens RiboNuclease P RNA 342 8 S20 mRNA from E. coli 372 9 Halobacterium cutirubrum RNAse P RNA 375 10 Group II intron ribozyme D135 from Saccharomyces cerevisiae mitochondria 583 Table 4.3: Biological structures obtained from the literature and used by Andronescu et al. [4]. Structures marked with an asterisk (*) were obtained from original, pseudoknotted structures by eliminating 8 base pairs in each case to remove the pseudoknot. with primary structure constraints, we utilised only biologically motivated structures. This experiment was computationally expensive because it re-quired the design of a given structure with several constraints. For this reason, we chose subsets of the previously described sets of biologically mo-tivated structures by means of random sampling (without replacement). 36 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Hairpins Stems internal loops Multiloops Bulges Size [4,8] [3,12] [4,H] [6,17] [1,3] Number — — [1,8] [0,5] [0,0.17]* Branches — — — [3,4] — Table 4.4: Properties of the structures from Table 4.3; the intervals spec-ify the minimum and maximum values observed for the respective features. These parameters were used to generate structures with biological proper-ties. * These values denote the minimum and maximum ratio of bulges to base pairs in the stems. Structure name Size (bases) Number of structures generated BIOM-50 [50,75) 1000 BIOM-75 [75,100) 1000 BIOM-100 [100,125) 100 BIOM-125 [125,150) 100 BIOM-150 [150,175) 100 BIOM-200 [200,225) 100 BIOM-500 [500,525) 100 Table 4.5: Sets of structures generated with the R N A structure generator, using the parameters from Table 4.4. These subsets consist of 50 structures of the data sets BIOM-50 and B I O M -75; 45 structures of the data set BIOM-100; and 10 structures of the data sets BIOM-125, BIOM-150, BIOM-200 and BIOM-500, respectively. The primary base constraints were generated in the following way. For each structure, we used RNA-SSD to obtain 100 sequences that are compu-tationally predicted to fold into it. Of these, we selected the sequence that gave the most stable M F E structure and used it for generating base con-straints for certain positions using two different methods. In one method, we sampled 50% of the sequence positions uniformly at random (without replacement). Additionally, when generating a constraint for a paired base, we also generated a constraint for the base to which it is paired to be fixed to the correct Watson-Crick complementary base; consequently, more than 50% of the bases may be fixed in the resulting design problem. In the other method, we sampled 50% of the stems in the given structure uniformly at random (without replacement) and fixed all bases occurring in these stems. To control for the variation in run-time of the design algorithms due 37 Chapter 4. Empirical Complexity of RNA-SSD Algorithm to the choice of constrained bases, we generated all of the possible sets of constraints in cases where this number was found to be less than 50, and random samples of size 50 otherwise. Thus, for each structure in a test set, we consider up to 50 possible sets of constraints obtained by each of the two generation methods. For structures of length 500, which are computationally expensive to design, we used only 10 instead of 50 constraint sets (also obtained by random sampling without replacement). Running time of R N A - S S D and RNAinverse Both RNA-SSD and RNAinverse are stochastic algorithms: when applied to the same structure multiple times, the time for finding a solution may vary substantially. (Note, however, that by using the same random seed, any run of RNA-SSD can be perfectly reproduced.) Therefore, it is necessary to perform sufficiently many runs on each problem instance in order to get reasonably stable statistics on run-time. We are interested in computing the expected running time E(T) of the R N A - S S D (and RNAinverse) algorithm for designing one structure with no cutoff time. However, in practice we have to fix a cutoff c. It is reasonable to assume that the running time T has an exponential distribution f(t) = Exp(X), see Hoos et al. [33]. Therefore, the expected running time of the R N A - S S D for designing a structure S is E(T) = j . The parameter A can be estimated using the maximum likelihood estimator as follows. Let ti,---,tk be successful runs and • • • unsuccessful runs. Let t be vector [t\, t2, • • •, tpc]• Then k K £(A;t ) n/ (*»)• n p(Ti>c) i=l i=k+l k K =»£(A; t ) ii^-xti n « -i=l i=k+l k =>log£(A; t ) k)Xc i=l then dlogC dX 0 => X k Ei=iti + (K-k)c 38 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Therefore, Note that Equation 4.1 is equivalent to the following equation used by Hoos el. al [32]: E(T) = Es + (j- - l ) £7U (4.2) where Es and E u denote the average time for successful and unsuccessful runs, respectively, and fs is the fraction of successful runs. That is, Es Eu and fs In our experiments, unsuccessful runs of RNAinverse were terminated after 1 800 C P U seconds. Unsuccessful runs were terminated after 1800 C P U seconds for structures that are decomposed by RNA-SSD. For structures not subject to decomposition, the algorithm terminates in less time, determined by the maximal number nj, of base modifications performed by the SLS procedure without finding a solution. In our experiments, we used TIL = 1000. For the unconstrained experiment we performed 50 runs on a given struc-ture to estimate the expected C P U time. Then we compute the median of the expected running time over all the structures of a given length. For the experiments with primary structure constraints, 50 runs were performed for each structure and set of primary structure constraints. The expected C P U time required for designing a structure with a given constraint was estimated from these runs using the same formula 4.2 as in the unconstrained case, and the median over the 50 sets of constraints per structure was used for all analyses. A l l computational experiments were carried out on PCs with dual In-tel(R) Xeon(TM) 2.40GHz processors (only one processor was used in our experiments), 512 K B cache, and 1GB R A M running Red Hat Linux, kernel version 2.6.5-1.358smp. k ' J2i=k+1 ti K-(k + l) + C k_ K' 39 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 4.3 Resul ts In this section we discuss the scaling analysis on random and biologically motivated structures using an improved version of the R N A - S S D algorithm, and also the RNAinverse algorithm from the Vienna package. 4.3.1 Analysis of RNA-SSD and RNAinverse on secondary structures without constraints In earlier work by Andronescu et al. [4], no clear correlation has been detected between the size of a given R N A structure and the performance of R N A secondary structure design algorithms such as R N A - S S D . Here, we use a bigger set of structures to investigate the empirical complexity of R N A design and found a clear correlation between the size of the structure and the performance of the two algorithms we studied, R N A - S S D and RNAinverse. First, we investigate whether the median expected run-time for designing structures scales polynomially with the length of the structure. Figure 4.4 shows the scaling analysis for RNA-SSD of the median expected run-time for different structure lengths, as well as the 90% quantile of the expected run-time. Linear least squared regression was used to fit a polynomial and an exponential function to the data that corresponds to structures of length 50, 75, 100, 125 and 150. Note that the median and the 0.9-quantile expected running time of longer structures (200 and 450 for random structures and 200 and 500 for biologically generated structures) are closer to the polyno-mial line and therefore RNA-SSD scales polynomially with the length of the structure. Figure 4.5a shows the median expected run-time for different structure lengths, as well as the expected run-time for the structure at the 10% and the 90% quantile for the biologically motivated structures. We also include the median expected run-time for a set of real biological structures reported in the literature (Table 4.3). Notice that the empirical complexity for designing these real structures fits well within the range of complexity observed for our biologically motivated sets of structures, which provides some evidence that the probabilistic model underlying these sets is reasonably plausible for the purposes of this study. Furthermore, the data in Figure 4.5a indicate that the expected run-time of RNA-SSD scales polynomially with structure size for median difficulty as well as for the 10% and 90% quantiles of these structure distributions, where the degree of the polynomial is higher for higher quantiles. As can be seen from Figure 4.5b, we obtained similar results for random 40 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Figure 4.4: Scaling analysis of the expected run-time (y-axis) using R N A -SSD of structures of lengths 50, 75, 100, 125, 150, 200, 450 and 500 (x-axis). A logarithmic scale is used in the y-axis. The solid (dotted) line corresponds to best fits of the data, for structures with lengths 50 to 150, using a polyno-mial (exponential) that is specified in each case. The expected run-times for structures longer than 150 appear closer to the polynomial than the expo-nential fit line, (a) Median (Q50) of expected run-time of random structures, (b) 0.9-quantile (Q90) of expected run-time of random structures, (c) Me-dian (Q50) of expected run-time of biologically motivated structures, (d) 0.9-quantile (Q90) of expected run-time of biologically motivated structures. structures as well as when using RNAinverse. Notice that overall, R N A -SSD performs substantially better than RNAinverse. Also, random struc-41 Chapter 4. Empirical Complexity of RNA-SSD Algorithm tures tend to be somewhat more difficult to design than biological structures probably because the latter contain more structural motifs that are easy to design. Distributions of expected run-time for RNA-SSD over our sets of random and biologically motivated structures of various sizes are shown in Figure 4.6 and Figure 4.7. Note that there is a large variation in difficulty between structures from the same set. Also, there are some structures that R N A -SSD is unable to design (the same holds for RNAinverse, as will be explained later). The random structures are designable by construction since they were obtained by folding a set of random sequences with the RNAfold function from the Vienna package. R N A - S S D was able to design all random struc-tures except one of length 450 (Figure 4.6a). This structure has several short stems separated by loops (Figure 4.8a). In particular, it has two inter-nal loops next to each other; one of these is slightly asymmetric with seven unpaired bases, while the other is symmetric with four unpaired bases. A l -though allowed by the thermodynamic model, this motif is hard to design. (This will be discussed in more detail in the later section on undesignable 42 Chapter 4. Empirical Complexity of RNA-SSD Algorithm z> O 1e+06 100000 10000 1000 100 10 1 0.1 0.01 RNA-SSD BIO Q5Q + (5.4* 10"8)x3-48 RNA-SSD BIO OIQ * (2.33* 10"6)x2'338 RW-SSDB(OQ9Q • (2.57* 10"n)x5 real biological structures • 10 100 structure length (a) 1000 1e+06 100000 7j 10000 Q . O 1000 100 \r 10 1\r 0.1 0.01 RNA-SSD Rand Q5Q + (6.47* 10"°) x 3- 6 2 4 RNA-SSD Bio 050 * (5.4* 10"8)x RNAinverse Rand Q50 o <8.23* 10 I j x 5 1 3 4 RNAinverse Bio Q50 * (4.64*10"1V^4 4 10 100 structure length (b) 1000 Figure 4.5: Scaling analysis of the expected run-time (y-axis) of structures of lengths 50, 75, 100, 125, 150, 200 and 450 (x-axis). A logarithmic scale is used on both axes. The lines correspond to best fits of the data, for struc-tures with lengths 50 to 150, using a polynomial that is specified in each case. The expected run-time for structures longer than 150 appear close to the corresponding fit line, (a) Median (Q50) of expected run-time of bio-logical structures and median, 0.1-quantile (Q10) and 0.9-quantile (Q90) of expected run-time using RNA-SSD for biologically motivated structures, (b) Median of expected run-time of random and biologically motivated struc-tures using RNA-SSD and RNAinverse. Structures of length 200 are the largest structures of our data set that we designed with RNAinverse. 43 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01 RND-50 RND-100 RND-200 RND-450 100 10000 1e+06 RNA-Designer expected CPU'time [sec] (a) 0.01 10000 1e+06 RNA-Designer expected CPU time [sec] (b) Figure 4.6: Cost distribution of RNA-SSD. Distribution of expected run-time of R N A - S S D on (a) random structures and (b) biologically motivated structures. For each point, the x-value gives an expected run time and the y-value gives the fraction of structures whose run-time is at most the x-value. RND-50 RfJD-100 .•R'ND-200 100 10000 1e+06 expected CPU time [sec] (a) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 .•• BIOM-50 BIOM-100 BIOM-200 0.01 1 100 10000 1e+06 RNAinverse expected CPU time [sec] (b) Figure 4.7: Distribution of expected run-time of RNAinverse on (a) random structures and (b) biologically motivated structures. structures.) RNAinverse failed to design 1.16% (i.e., 28/2400) of the random structures of length 200 or less and was not evaluated on larger structures. 44 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Figure 4.8: Examples of structures not designed by R N A - S S D . Structures not designed by RNA-SSD have short stems separated by loops, indicated by arrows in the Figure, (a) Random structure of length 450 (RND-450-n84). This is the only random structure in our data set that R N A - S S D did not design. Note that it has two internal loops separated only by one base pair, (b) Biologically motivated structure of length 74 (BIOM-50-n262). As can be seen from Figure 4.6b, there are biologically motivated struc-tures of every length that RNA-SSD was unable to design; none of these structures could be designed by RNAinverse, which also did not succeed to design a number of other structures. Overall, 6.83% and 2.21% (i.e., 164/2400 and 53/2400) of the biologically motivated structures could not be designed by RNAinverse and RNA-SSD, respectively; this indicates that biologically motivated structures are more difficult to design with these al-gorithms than randomly generated structures. 45 Chapter 4. Empirical Complexity of RNA-SSD Algorithm To further explore RNA-SSD's ability to design larger structures, we evaluated its performance on two additional sets, containing random struc-tures of length 450 and biologically motivated structures of length 500, re-spectively. In these experiments, we found that R N A - S S D designed 99.78% of the randomly generated structures and 94.4% of the biologically moti-vated ones within a cutoff time of 30 C P U minutes. Undesignable structures When examining the structures that appeared to be undesignable by the R N A - S S D algorithm, we found that they typically have short stems sepa-rated by loops, as shown in Figure 4.8, that are not stable enough to com-pensate for the penalty associated with the adjacent loops. Further analysis revealed two kinds of motifs that are impossible for any algorithm based on the current thermodynamic model. One of these consists of two bulges next to each other, separated only by one base pair (see Figure 4.9a); the other one is formed by two internal loops, separated also by a single base pair (Figure 4.9b). For internal loops of size bigger than four and for asym-metric internal loops of size four, these motifs are predicted to be unstable by the current thermodynamic model, and are hence undesignable (a formal proof is given in the appendix). However, it is possible to design two internal loops separated by one base pair if at least one of them is symmetric and has four unpaired bases. Unstable motifs were found in several biologically motivated structures, and they also seem to appear in nature. For example, according to the Comparative R N A Web (CRW) Site, which provides R N A secondary struc-tures based on comparative sequence analysis [17, 18], the small subunit ribosomal R N A of Acanthamoeba castellanii [17, 18] (Figure 4.10) has three adjacent bulges of size two, one and three, respectively (these bulges are lo-cated in positions 1578-1583 and 1841-1848). Similarly, the C R W structure for the small subunit ribosomal R N A of Escherichia coli [17, 18, 28] (Figure 4.11) has two adjacent internal loops of size seven and nine, respectively (this motif is found in positions 1963-1972 and 1994-2005). Although these motifs occur in nature, they are not allowed by the thermodynamic model. This could be due to inaccuracies of thermodynamic model commonly used for R N A secondary structure, tertiary structure effects, or interaction of the R N A with other molecules, which prevent it from folding into its "true" M F E conformation. 46 Chapter 4. Empirical Complexity of RNA-SSD Algorithm (a) (b) Figure 4.9: Undesignable motifs. Two structure motifs of our data set that are not compatible with the thermodynamic model. Bold lines represent base pairs, (a) Motif B : bulges separated by one base pair, (b) Motif 21: internal loops separated by one base pair. 47 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Secondary Structure: small subunit ribosomal RNA Figure 4.10: Secondary structure of small subunit ribosomal R N A of Acan-thamoeba castellanii. Biological secondary structure with three bulges, each separated by only one base pair. This structure was obtained from The Comparative R N A Web (CRW) Site [17, 18]. 48 Chapter 4. Empirical Complexity of RNA-SSD Algorithm Secondary Structure of (eu)Bacterial 16S Ribosomal RNA A A U G C Q • • , U G G C G A — Q A A | | A O . U G G C Q . » A C G G A R G A G C llllll j U G C U C Q C C U U A U C C U U U B U U U C C C O G U C I I - • I llllll) I II III c A G G G G U A G O A A A C A A G O G C C G C C V Q A U G A G A A U U A A C C U G G G U G C A llllll III I I • I • I I I U C G G G C C C G U G U A G A C U G A U G A C U G G C A A G C U G U U U G C U — U G U C O A C U U V V U I I I I • I I C Q C A G C U G A , - , G II G O G A G U A C G G C C G C C U C * U U G G A A _ 9 0 ° Ill i A / C uV*G \ C I G U A A A HJ^ A A G " G G G U U G U A C y mi nil U U C C G G U A U G U %C I A , G 40fJ~.C G L N A G A * 1 A ^ % G A J U G A C Q G G G C C C G C A C A A I I I I - • I I t I I I A C A J O U U C C G G G C S V G C A A G C C U O " U G C A G C G G G U A C G U A , . C • A — C - G U l U A A C C G U A G G G • I I l l t l t l l Q G U U G G C O U C C A , E U C A A , G G u C G U C A C U G U C A G G A A G A A G C I • I I t ' l l III ' l l l l l l y C Q G U G . C A G U C U U U C U U C B V % A U ' 0 C V t f G G G C C U C U U U f. I l l l ' l l ' u C C G G G G A G A C U A C U G G A R G A U G G C A * Escherichia coli Figure 4.11: Secondary structure of small subunit ribosomal R N A of Es-cherichia coli. Biological secondary structure with two internal loops sepa-rated by one base pair. This structure was obtained from The Comparative R N A Web (CRW) Site [17, 18, 28]. 49 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 4.3.2 Analysis of RNA-SSD on secondary structures with constraints From the previous experiments we learned that the empirical expected run-time complexity of the R N A design problem is polynomial for random and biologically motivated structures. Now, we will investigate the hardness of the problem when primary structure constraints are imposed on the design of the biologically motivated structures that we used for the unconstrained case. The hardness of an instance of this constrained secondary structure de-sign problem not only depends on the given secondary structure, but also on the set of primary structure constraints. To capture the impact of the primary structure constraints on the performance of R N A - S S D , we used ev-ery secondary structure with a number of different sets of primary structure constraints; furthermore, because of the stochastic nature of R N A - S S D , we performed multiple runs of our algorithm for each such problem instance. The expected C P U time required to design a structure with a given set of primary structure constraints was estimated from these runs. Most of our analysis is based on the median expected run-time of R N A - S S D over ail sets of constraints for a given structure. Because of the computational burden incurred by the large number of runs per secondary structure required by this protocol, we performed these experiments on smaller sets of biologically motivated structures; these sets were obtained by uniform random sampling (without replacement) from the respective sets used for our empirical anal-ysis of the unconstrained case. Two different methods were used to create sets of primary structure con-straints. One of these essentially selects the base positions to be fixed within the given structure at random, while the other fixes the base assignments of entire stems. In both cases, the bases in the selected positions are fixed according to a sequence that folds stably into the given structure. As can be seen in Figure 4.12, the hardness of a constrained design problem varies significantly depending on the given set of constrained bases. In particular, constraining entire stems rather than randomly selected bases tends to result in slightly easier problems. Figure 4.13 illustrates the scaling of search cost for solving our sets of biologically motivated R N A secondary structure design problems with differ-ent types of primary structure constraints compared with the unconstrained case. Our empirical results indicate that the median expected run-time scales polynomially with the size of the structures for the unconstrained case and for constraints located in random positions or in stems; in all three 50 Chapter 4. Empirical Complexity of RNA-SSD Algorithm I /' VS Fip^zyrrie from Neurospora mitochondria, stems - VS Riljo/zyme from Neurospora mitochondria, random positions RND-150-n85, stems 0 '— 0.1 RND-150-n85, random positions BIOM-150-n89, stems BIOM-150-n89, random positions 1000 expected run-time [CPU sec] Figure 4.12: Distribution of expected run-time of R N A - S S D on three struc-tures of approximately 150 bases: RND-150-n85, BIOM-150-n89 and V S Ribozyme from Neurospora mitochondria. The structures were designed with two sets of primary base constraints: one where the bases are fixed at random positions and the other where the bases are fixed on stems for each structure. Both sets have the same range [a, b] of constrained bases after propagation, where a and b are smallest and largest number of bases constrained if 50% of stems are fixed in a given structure. We fixed 50% plus one stems if a structure has an odd number of stems. cases the median run-time is approximated by a polynomial of degree close to three. There is some indication that in the case of constrained stems (which play a major role in stabilising R N A secondary structure) better scaling behaviour is observed than for base constraints in randomly chosen positions or no base constraints at all. 51 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 1000 10 100 1000 structure length Figure 4.13: Scaling analysis using RNA-SSD for the median expected run-time of biologically motivated structures with no primary base constraints and with bases constrained in fifty percent of random positions and fifty percent of stems. The lines represent the polynomial that best fits the data. The experiment with primary structure constraints is computationally expensive, and for this reason, fewer structures of each length were used. The best fit for structures with constraints was calculated based on those of lengths 50, 75 and 100. Note that the run-times for constrained structures longer than 100 appear below the corresponding fit line. 52 Chapter 4. Empirical Complexity of RNA-SSD Algorithm No. Description (source) Size expected number of number (bases) run-time multiloops of stems [CPU sec] 1 VS Ribozyme from Neurospora mitochondria 167 0.6375 2 11 2 Bio-150-n38 172 0.529 1 9 3 Bio-150-nl4 167 12.94 2 10 4 Group II intron ribozyme D135 from Saccharomyces cerevisiae mitochondria 584 11.54 5 32 5 Bio-200-nl9 208 7.618334 3 12 6 Bio-150-nl2 150 0.15633 1 6 Table 4.6: Set of structures used to study the correlation between the pri-mary structure constraints and the performance of R N A - S S D . Structures with similar characteristics (such as size, number of multiloops, etc.) ap-pear in the same group. The structure Bio-150-nl2 was also selected for the study because is relatively easy to design. 4.3.3 Performance of RNA-SSD with different number and locations of primary base constraints In a second series of experiments, we studied the correlation between the number of bases constrained and the performance of the R N A - S S D algo-rithm. The experiments were conducted using some biological structures from Table 4.3 as well as biologically motivated generated structures. Table 4.6 shows some features of these structures. The two biological structures chosen for this experiment are the VS Ribozyme from Neurospora mitochon-dria and the Group II intron ribozyme D135 from Saccharomyces mitochon-dria. The biologically motivated structures were chosen according to various criteria. Bio-150-nl4 has the same size and number of multiloops as the VS Ribozyme; for Bio-150-n38, the run-time required by R N A - S S D (without primary structure constraints) to design the structure is very similar to that required by VS Ribozyme; Bio-200-nl9 is a longer structure that has several multiloops as Group II intron ribozyme; and Bio-150-nl2 is particularly easy to design. In every case, the primary structure constraints are based on se-quences that are computationally predicted to fold into the target structure; therefore, the respective design problems are solvable by construction. Figure 4.14 shows how the hardness of the design problem depends on the fraction of constrained bases for randomly located base constraints and 53 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 1 o 1 ! 1 1 [CPUS scted run-time 8-OJ c T5 1 0.1 random positions stems 0 0.2 0.4 0.6 0.8 fraction of constrained bases (a) 0.2 0.4 0.6 0.8 fraction of constrained bases (b) • • -random positions V stems ^ 0.2 0.4 0.6 0.8 traction of constrained bases (c) r 1 1 1 random positions stems 0.2 0.4 0.6 0.8 fraction of constrained bases (d) fraction ot constrained bases fraction of constrained bases (e) (f) Figure 4.14: Hardness of RNA-SSD with base constraints. Correlation be-tween the fraction of bases constrained in a particular structure (x-axis) and the median expected run-time for designing the structure with R N A -SSD (y-axis). We report the fraction of constrained bases after propagation for constraints on randomly chosen base positions. This fraction, for both randomly chosen bases and stems, corresponds to the median fraction of bases constrained in a set of 50 constraints that were generated by fixing a given percentage of bases or stems. There are two curves in each graph, one for designing structures with base constraints located in random positions and the other for constraints located in stems, (a) V S Ribozyme from Neu-rospora mitochondria; (b) Bio-150-n38; (c) Bio-150-nl4; (d) Group II intron ribozyme D135 from Saccharomyces; (e) Bio-200-nl9; (f) Bio-150-nl2. 54 Chapter 4. Empirical Complexity of RNA-SSD Algorithm 5' 3' Figure 4.15: Biologically motivated structure Bio-150-nl4 with ten stems. When constraining the bases in stems 7 and 8, this structure is hard to design. The structure motif formed by these stems, which are short and separated by a bulge, is unstable. for constrained stems. As can be seen from these results, there are some cases in which base constraints of either type render a secondary design problem easier, while in other cases, we observe a substantial increase in hardness as a critical number of bases is constrained. The performance of RNA-SSD suggests that it is generally easier to design a structure when the stems are constrained. This is intuitively plau-sible, given that generally, stems represent the most stable parts of R N A secondary structures. However, there are exceptions: structure Bio-150-nl4 (shown in Figure 4.15) was found to be difficult when more than seven stems of its ten stems were constrained. When we further analyzed the correla-tion between the constrained stems and the expected run-time required for designing this structure, we found that constraining two particular stems, labelled 7 and 8 in Figure 4.15, made the design problem significantly more difficult. These short stems are separated by a bulge of size one, and they are not stable enough to compensate the penalty of the bulge. Figure 4.15 shows a difficult problem instance where these two stems are constrained. The structure is easy to design if the base pairs A - U in stems 7 and 8 are replaced by C G , which is a more stable interaction. We also observed that for structures with similar characteristics (same number of bases, multiloops or stems, or same difficulty to design without constraints), the behaviour of RNA-SSD algorithm shows significant qualita-tive variation. Structures such as VS Ribozyme from Neurospora mitochon-55 Chapter 4. Empirical Complexity of RNA-SSD Algorithm dria and Bio-200-nl9 are easier to design when the number of constrained bases increases (Figure 4.14a and 4.14e); the same holds for structure Bio-150-n38 when the constrained bases are located within stems (Figure 4.14b). In these cases, the base constraints reduce the size of the search space and additionally may help to steer the search process towards solution sequences. However, the problem can also get harder as the number of constrained bases increases and then becomes easier again, as approximately 80% or more of the bases are constrained or when all the stems are constrained. This is observed for structures Bio-150-n38, Bio-nl50-nl4, Group II intron ribozyme D135 from Saccharomyces mitochondria and Bio-150-nl2 when using random base constraints (Figure 4.14b, c, d and f), as well as for Bio-150-nl4 and Group II intron ribozyme D135 from Saccharomyces mi-tochondria when the stems are constrained (Figure 4.14c and d). In these cases, the reduction in the size of the search space caused by the base con-straints is counteracted up to a point by factors that make finding solutions within these smaller spaces harder. One such factor is solution density, which can be substantially reduced by adding base constraints. Beyond a certain number of primary structure constraints, the advantages from reduction in search space size outweigh these factors, such that the problem becomes eas-ier again. This is not surprising, since the design problem becomes trivial in the extreme case in which all base positions are constrained. Somewhat surprisingly, as can be observed for structure Bio-150-nl2 (Figure 12f), there are cases where constraining all stems, leaving only un-paired base positions to be assigned by the algorithm, renders the design problem harder than constraining a smaller number of stems. 4.4 S u m m a r y The scaling analysis on random and biologically motivated structures sup-ports the hypothesis that the running time of both algorithms scales poly-nomially with the size of the structure. We also found that the algorithms are in general faster when constraints are placed only on paired bases in the structure. When comparing both algorithms, the R N A - S S D algorithm performs better than RNAinverse since it requires less time to design a given structure and also because it is able to design more structures. Furthermore, we prove that, according to the standard thermodynamic model, for some structures that the RNA-SSD algorithm was unable to design, there exists no sequence whose minimum free energy structure is the target structure. 56 Chapter 5 Interaction of Two R N A Molecules This chapter deals with the problem of designing R N A duplexes where two R N A molecules interact with each other. We describe three different ap-proaches for R N A duplex design. These are the first algorithms for design-ing duplexes since to the best of our knowledge, there are no algorithms for designing complexes in the literature. Each of our algorithms takes as its input a pair of molecules that adopt a particular structure and the outputs are two sequences that are predicted to form the desired complex. These ex-tensions support base constraints located in some positions of the complex, although they do not design for stability. The design of duplexes can be useful in engineering novel ribozymes as therapeutic agents. In-trans ribozymes that cleave external substrates can be designed based on the secondary structures of a self-cleavage ribozyme (in-cis). A molecule may be described as cis-acting when there is an in-tramolecular reaction mechanism, whereas cleavage in-trans is governed by an interaction of two molecules. For example, Figure 5.1 (a) shows the secondary structure of a hammerhead ribozyme found in nature [11]. A model for the trans-cleaving hammerhead ribozyme determined by chem-ical analysis [45, 63] is shown in Figure 5.1 (b) where the bases that are important for the catalytic activity are specified. R N A - S S D helps to ob-tain all R N A molecules that bind to a specific target with the conformation shown in Figure 5.1 (a). Then it is possible to test in vitro which of these sequences have the highest catalytic activity. This will have the advantage of reducing the number of. in vitro mutations needed to determine the best ribozyme. It is necessary to determine the catalytic rate of a ribozyme in vitro since the secondary structure of the ribozyme-substrate complex is not sufficient to guarantee a high cleavage rate. A stable secondary structure ensures an efficient association with the target but may be detrimental with respect to multiple turnover cleavage of several target R N A molecules. Also, tertiary interactions that cannot be determined with the current thermody-namic model and secondary structure prediction algorithms may affect the 57 Chapter 5. Interaction of Two RNA Molecules Figure 5.1: (a) Secondary structure of hammerhead ribozyme in satellite D N A from Dolichopoda schiavazzi (cricket), (b) Trans-cleaving hammer-head ribozyme. This picture shows the secondary structure model of the hammerhead-substrate duplex where the specified bases are important for catalytic activity. Dots represent any nucleotide. In both pictures the arrow indicates the cleavage site. cleavage rate of a ribozyme. R N A molecules can interact in different ways. Two complementary re-gions of each strand can hybridize to form a helix. This kind of interactions is common in ribozymes. It is also possible to have kissing loop interactions like the one in Figure 5.2 that involve Watson-Crick base pairing between complementary R N A loops. This is the case for p R N A molecules found in a DNA-packing motor of the bacterial virus phi29. p R N A molecules form an hexameric ring which is a crucial part of the motor. Shu et al. [58] redesigned this p R N A to form a variety of structures and shapes, such as tetramers and triangles that can serve as parts in nanostructure designs. We describe three different approaches for R N A duplex design. Each of our algorithms takes as its input a complex of molecules that adopt a particu-lar secondary structure. The outputs are two sequences that are predicted to form the desired complex according to the PairFold function of Andronescu et al. [5]. The first approach, called "Linker", concatenates two molecules to generate a single structure that is designed with R N A - S S D . The designed sequence is then separated into the corresponding strands. The "Interface" approach assigns bases in the positions of intermolecular helices of the com-plex. Then both structures are designed independently by R N A - S S D where the bases located in the regions where both structures interact remain fixed in R N A - S S D . The "Internal" approach modifies the R N A - S S D algorithm to design duplexes inside the algorithm where the candidate solutions are 58 Chapter 5. Interaction of Two RNA Molecules Figure 5.2: R N A complex with kissing loop interactions. This complex de-signed by Chworos et al. [20] is a self assembling square made of four similar R N A strands, called tectoRNAs, that can have applications in nanobiotech-nology. evaluated with PairFold. A scaling analysis on these algorithms show that there is no evidence that the running time of Linker and Internal scale expo-nentially with the length of the duplex. When comparing these approaches, we found that Interface has the lowest running time and designs less du-plexes in our data set than Linker and Internal. Frequently, Interface has the problem that components of the designed sequences form undesirable interactions even though each sequence fold correctly into the structure of the corresponding molecule. We also found that Linker has the best run-ning time but Internal designs more complexes than the other algorithms. Internal uses the PairFold function to evaluate the designs inside the algo-rithm which in practice is a constant times more expensive than RNAfold that evaluates designs for single sequences. Therefore, there is a trade-off between using PairFold internally or not since in the former case is more expensive even though it designs more duplexes. 5.1 Desc r ip t ion of the algori thms Currently, no algorithms for designing complexes of R N A molecules are available. In this section we describe three different approaches to design du-plexes, building on the RNA-SSD algorithm described in Chapter 3. For all of these, candidate sequences are evaluated with the PairFold algorithm de-signed by Andronescu et al. [5]. This program predicts the secondary struc-ture of two interacting R N A sequences by concatenating two R N A strands and performing a secondary structure prediction as if there were only one strand, using a modified version of the RNAfold algorithm from the Vienna Package (for folding a single strand) that considers an intermolecular initi-59 Chapter 5. Interaction of Two RNA Molecules ation penalty. Since the RNAfold algorithm avoids pseudoknots, PairFold cannot predict kissing loop interactions. Therefore, we will restrict ourselves to the cases where the strands form an intermolecular helix. 5.1.1 "Internal" algorithm The first approach modifies RNA-SSD to design duplexes inside the algo-rithm. The input of the algorithm is (R*,b) where R* is the secondary structure of the interacting strands X\ and X2 and b is the linkage location between the strands (6 equals the last position of X\). R N A - S S D hierarchi-cally decomposes R* into small substructures. The smallest substructures R* of the decomposition tree are designed by the SLS algorithm described by Andronescu et al. [4]. However, in this case a substructure might be composed of a single molecule or by two molecules. Therefore, the SLS al-gorithm is modified to evaluate the candidate solutions. If a substructure R* of the decomposition tree has intermolecular bonds, the candidate solution Xi is separated at the linkage location b into two strands that are then eval-uated with PairFold. Otherwise, we use the RNAfold function from Vienna Package. A candidate sequence X^ for R^, the substructure associated with a non-leaf node k, is then determined by merging the two subsequences Xki and Xk2 corresponding to the children of node k of the decomposition tree. Next, the M F E structure R^ of X^ is compared with the target structure Rk. If Rk and i?£ are identical, the recursive step has been successfully completed. Otherwise, the conflicting bases in X^i or X^ a r e memorised and the SLS procedure is recursively called on the subsequence with the higher relative fraction of conflicting bases, resulting in a new sequence Xk\ or Xk2- This process is iterated until either a valid sequence for R*k has been found, or a cutoff time is reached, or a maximal number riu of attempts to concatenate subsequences X^\ and X^2 into a valid sequence for Rf, have been performed. For our experiments, we used a cutoff time of 1800 C P U seconds and HM = 10000. This approach gives rise to the "Internal" algorithm shown in Figure 5.3. 60 Chapter 5. Interaction of Two RNA Molecules procedure RNA-SSD-duplex-internal input: target RNA secondary structure duplex R*, duplex length n and linkage location b (n, R*, b) output: RNA sequences Xi and X2 initialise sequence X (X of length that matches R*) and hierarchically decompose R* and X [4]; for each leaf node k, use the SLS algorithm [4] to design a sequence Xk for the substructure at node k; end for for each non-leaf k of the decomposition tree Xk <— Xk\Xk2\ if linkage location falls within Xk obtain the M F E structure Rk of (Xk, b') with Pairfold, where b' is the position of the linkage location within Xk\ d s 6 obtain the M F E structure Rk of Xk with RNAfold; end if if Rk is different from the target structure R*k of node k repeat redesign Ru, the substructure with the higher relative fraction of conflicting bases among Rk\ and Rk2\ Xk <— Xk\Xk2\ if linkage location falls within Xk obtain the M F E structure Rk of (Xk, b') with Pairfold; GISG obtain the M F E structure Rk of Xk with RNAfold; end if until Rk is identical to R*k or a maximal of UM attempts has been performed or a cutoff time is reached end if end for return X\ and X2 end RNA-SSD-duplex-internal . Figure 5.3: Pseudocode for designing duplexes with the "Internal" algo-rithm. 61 Chapter 5. Interaction of Two RNA Molecules (R*,14) 3'»-Off O O O O O b=14 p o o o o c r -.0 3' ••••• ' • 5 ' S*=D 1D 2 • -o / ' o • o ? ? ? ? o ° ° . oV •D, ( a ) ( b ) ( R M 2 ) 3 '«-05 ' • -O • -o b=12 • -o , ®* ° ^ O O O O O O 3 ' • ••••• ( c ) S*=D 1 LD 2 D, 3 » - 0 5 ' • !foi • - O ; • - O \ \ o \ P i linker L £ ••••• ( d ) Figure 5.4: Panel (a): Duplex structure (R*,b), where b = 14 in this case. Panel (b): a single structure 5* is formed by concatenating the strands D\ and D2 of the duplex (R*, 14). Panel (c): Duplex structure (R*, 12). Panel (d): a linker of five bases is added between the strands D\ and D2 from (R*, 12) to obtain a structure S* allowed by the thermodynamic model. 5.1.2 " L i n k e r " a lgor i thm A simple approach to design R N A duplexes is to concatenate the two in-teracting strands D\ and D2 of the target duplex R* to generate a single structure S* = D1D2 that is designed with RNA-SSD. This is illustrated in Figures 5.4 (a) and (b). The designed sequence X is then separated into the corresponding strands X\ and X2 at the linkage location b and evaluated with PairFold. In some cases the resulting structure S* includes a hairpin with less than three unpaired bases which would not be allowed by the thermodynamic model, see Figure 5.4 (c). We can obtain a valid structure by introducing a proper linker L between the strands D\ and D2 such that the resulting structure S* = D1LD2 has a free energy similar to the duplex R*. The free energy calculation for a pair of R N A molecules is very similar to the free 62 Chapter 5. Interaction of Two RNA Molecules procedure RNA-SSD-duplex- l inker input: target RNA secondary structure duplex R*, duplex length n linkage location b (n, R*, b) output: RNA sequences X\ and X2 Let S* = D1D2 where D\ and D2 are the two strands that form the duplex R*; i f 5* = D\D2 contains an invalid hairpin of size 0, 1 or 2 introduce a loop L of size five: S* = D1LD2; end if initialise sequence X (X of length that matches S*) and hierarchically decompose 5* and X [4]; for each leaf node k, use the SLS algorithm [4] to design a sequence Xk for the substructure at node k; end for for each non-leaf k of the decomposition tree Xk <— Xk\Xk2\ obtain the M F E structure Sk of Xk with RNAfold; if Sk is different from the target structure SI of node k repeat redesign Ski, the substructure with the higher relative fraction of conflicting bases among Ski and 5^2; Xk <— XkiXk2', obtain the M F E structure Sk of Xk with RNAfold; until Sk is identical to S£ or a maximal of UM attempts has been performed or a cutoff time is reached end if end for split X at the linkage location b and if S* = DiLD2, remove the bases corresponding to the linkage L to obtain Xi and X2', evaluate the candidate solution X1X2 with PairFold; return X1X2 end RNA-SSD-duplex-l inker . Figure 5.5: Pseudocode for designing duplexes with the Linker algorithm. 63 Chapter 5. Interaction of Two RNA Molecules energy calculation for one single molecule. The main difference is that a penalty for intermolecular initiation is added [3]. Therefore we can add a linker L of five unpaired bases to form a hairpin of size five, see Figure 5.4 (d). This has turned out to be the best value for our experiments since hairpins of size four are very stable and longer hairpins have a penalty in the calculation of the free energy that increases with the number of unpaired bases. This approach gives rise to the Linker algorithm shown in Figure 5.5. 5.1.3 " In te r face" a l g o r i t h m The last approach to design duplexes fixes the interface I between the two molecules D\ and D% at the beginning of the algorithm. Here we assign bases in the positions of intermolecular helices as in Figure 5.6 (a). Below we discus how these values are chosen. Then we design both structures D\ and £>2 independently using the version of RNA-SSD that supports primary structure constraints since we fix the bases located in the regions where both structures interact, see Figure 5.6 (b). In a final step we evaluate the quality of the candidate solution using the PairFold algorithm. This procedure is repeated until the PairFold test does not fail or the algorithm reaches a cutoff time. Figure 5.7 shows the "Interface" algorithm. The assignment of bases in the interface is done with the goal of avoiding incorrect folding. The procedure used is the same as the sequence initial-ization in RNA-SSD. That is, the assignment is done in such a way that we assign complementary bases in the intermolecular helices that satisfy the primary base constraints of both structures. We assign to the helix regions G • C pairs with a higher probability p(G • C) than A • U pairs because they are energetically more favourable. We set p(G • C) = 0.66 for our exper-iments. Finally, a tabu mechanism is used to minimize the undesired but energetically favourable interactions between subsequences of X\ and Xi. In this mechanism we assign bases to short segments in a helix. The sequence motif is only admissible if it does not form more than a certain number of consecutive base pairs with any previously used motifs or with itself. The motif length lm and the motif distance threshold dm are set as 10 and 5 respectively which are the same values used by Andronescu et al. [4] to initialize a sequence in RNA-SSD. 64 Chapter 5. Interaction of Two RNA Molecules 3'»-05' • -o • -o • -o V ° :::: Interface I U C C G U G< (a) 3'* • • n U C C G J G # . • •••• 5' 05' O O O O O , G G C A cO<-(b) Figure 5.6: "Interface" algorithm to design duplexes, (a) Interface I: bases in the intermolecular helix are fixed, (b) Design of structures D\ and D2 independently with RNA-SSD. procedure RNA-SSD-duplex-interface input: target RNA secondary structure duplex R*, duplex length n and linkage location b (n, R*,b) output: RNA sequences X\ and Xi repeat fix the interface I; that is, the bases in the intermolecular helices of R*; design the interacting strands D\ and Di of R* independently, using RNA-SSD and subject to the primary structure constraints J, to obtain sequences X\ and Xi; evaluate the candidate solution X1X2 with PairFold; until X\ and Xi are predicted by PairFold to fold into R* or a cutoff time is reached; return X\ and Xi end RNA-SSD-duplex-interface. Figure 5.7: Pseudocode for designing duplexes with the "Interface" algo-rithm. 65 Chapter 5. Interaction of Two RNA Molecules 5.2 Exper imen t s In this section we describe the experiments that we performed to compare the performance of the three different algorithms described in the previous section for designing duplexes. The performance of the algorithms is mea-sured in terms of the running time to design duplexes and their success rate, that is, the number of designed sequences that fold correctly into a given target duplex. Our data sets consist of biological and artificially generated R N A du-plexes. The biological set consists of pseudoknot-free ribozyme-RNA target duplexes on which Andronescu et al. tested the PairFold algorithm [5]. This set is described in Table 5.1. Furthermore, artificial duplexes are ob-tained using the R N A Structure Generator [4]. This program generates single structures with the statistical features from Table 4.4 derived from biological structures. We obtained two different duplexes from a single struc-ture in the following way. The first duplex is generated by splitting the given structure at a position that is selected uniformly at random and such that the resulting duplex has intermolecular base pairs. The second duplex is obtained from a sequence by removing all unpaired bases of a hairpin that is selected uniformly at random. In both cases we obtain two structures that form an intermolecular helix (see Figure 5.8). We refer to this set of duplexes as BIOMD-I and BIOMD-II if they are generated by the first and second method, respectively. Table 5.2 provides further information on the set of artificially generated duplexes. For the R N A - S S D algorithm used in every algorithm we fixed a maxi-mal number UM = 10 000 of attempts to concatenate subsequences . X ^ i and Xk2 into a valid sequence for the substructure i?£ at node k of the decom-position tree. We performed 100 runs on each problem instance in order to get reasonably steady statistics on the running time of each algorithm. For structures of length 500, which are computationally expensive to design, we performed only 10 runs instead of 100. Each run is terminated after a cutoff time of 1800 C P U seconds if it has not resulted in the desired du-plex. We compute the expected run-time for each structure with equation 4.2. A l l computational experiments were carried out on PCs with Intel(R) Xeon(TM) 3.06GHz processors (only one processor was used in our exper-iments), 1024 K B cache, and 2GB R A M running Red Hat Linux, kernel version 2.6.5-7.252-smp. 66 Chapter 5. Interaction of Two RNA Molecules ID Description Size (b h ases) h BIOD-1 Hammerhead ribozyme R32 and substrate 32 11 BIOD-2 AUG-cleaving hammerhead-like ribozyme and substrate 36 16 BIOD-3 Hammerhead ribozyme and no-tail gene target mRNA 38 17 in zebrafish BIOD-4 Reverse-joined hairpin ribozyme HP-RJTL ans substrate 65 14 BIOD-5 Conventional hairpin ribozyme with sequence variation 55 14 HP-WTSV1 and substrate BIOD-6 Hairpin-derived twin ribozyme HP-DS1 and substrate 120 34 BIOD-7 Hairpin ribozyme and substrate 92 21 BIOD-8 Hairpin ribozyme RzGlOl and substrate 14 50 BIOD-9 Minimal two-way helical junction 2WJ-SV5 hairpin 32 40 ribozyme-substrate complex BIOD-10 X-motif ribozyme model 43X and S21 substrate RNA 43 21 BIOD-11 MR8-1 ribozyme (derived from X-motif ribozyme) and 49 21 S21 substrate RNA BIOD-12 ATP-sensitive allosteric ribozyme construct IV-up 59 14 and substrate BIOD-13 ATP-insensitive ribozyme construct IV-down 59 14 and substrate BIOD-14 5'CYbUT RNA crosslinked to gCYb-558 59 28 Table 5.1: Pseudoknot-free ribozyme-substrate duplexes obtained from A n -dronescu et al. [5]. These duplexes have been selected arbitrarily from the biological literature. The length of the ribozyme and the target structure is specified by l\ and I2, respectively. 67 Chapter 5. Interaction of Two RNA Molecules (a) (b) (c) Figure 5.8: Artificially generated duplex, (a) Biologically motivated struc-ture generated with R N A Structure Generator, (b) The arrow in (a) indi-cates the place where the structure is cut to obtain the duplex shown in (b) or the hairpin that is removed to obtain the duplex in (c). ID Size of complex (bases) Number of structures generated BIOMD-I-50 BIOMD-II-50 [50,75) 10 BIOMD-I-100 BIOMD-II-100 [100,125) 10 BIOMD-I-200 BIOMD-II-200 [200,225) 10 BIOMD-I-500 BIOMD-II-500 [500,525) 10 Table 5.2: Biologically motivated duplexes (BIOMD). The strands used to generate the duplexes have statistical features derived from biological struc-tures (see Table 4.5). BIOMD-I refers to duplexes generated by splitting a single strand in a random position as explained in Section 5.2. BIOMD-I I are generated by removing a hairpin from a single strand. 68 Chapter 5. Interaction of Two RNA Molecules o <u .in, CD E => o_ O T3 £ O CD C L X CD CD O CO t CD 10000 100 h 1 h 0.01 0.01 1 100 10000 Linker, expected CPU time [sec] Figure 5.9: Correlation between the expected running times in C P U seconds of the Linker (x-axis) and the Interface (y-axis) approaches to design the biological duplexes from Table 5.1. 5.3 Resul ts In this section we present the performance of Linker, Interface and Internal to design duplexes. 5.3.1 Performance of Linker and Interface Figures 5.9 and 5.10 show the correlation between the expected running times of the Linker and the Interface algorithms for designing biological and artificially generated duplexes respectively. Note that for every problem instance, the expected running time of the Linker algorithm is less than or equal to the expected running time of Interface. In addition, Linker designs more duplexes from our data set than Interface. Figure 5.10 (c) shows the only duplex (from BIOMD-I-200) that Linker did not design; this duplex was designed by Interface. The secondary structure of this duplex is shown in Figure 5.11 (a). In contrast, the Interface algorithm failed to design thirteen artificially generated duplexes that are designed with Linker. Furthermore, twenty one artificially generated duplexes were not designed by either Linker or Interface. We also compute the success rate as the number of designed duplexes that fold correctly to a given structure in all of the runs. The success rate 69 Chapter 5. Interaction of Two RNA Molecules 1e+06 10000 100 0.01 ** o G BIOMD-l-50 o BIOMD-ll-50 * 0.01 1 100 10000 1e+06 Linker, expected CPU time [sec] (a) 8 1e+06 CL O 10000 100 h BIOMD-l-100 o BIOMD-ll-100 * 100 10000 1e+06 Linker, expected CPU time [sec] (b) 1e+06 10000 100 * o < 100 10000 BIOMD-l-200 o BIOMD-ll-200 * j» 1e+06 3 O t 0) 1e+06 Linker, expected CPU time [sec] (c) 10000 100 * G * o si© * 100 BIOMD-l-500 o BIOMD-ll-500 * 10000 1e+06 Linker, expected CPU time [sec] (d) Figure 5.10: Correlation between the expected running times of the Linker and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Linker or Interface is unable to design as 106 C P U seconds. Structures that are designed by none of the approaches are excluded, (a) BIOMD-50; (b) BIOMD-100; (c) BIOMD-200; (d) BIOMD-500. for Linker is in general lower than Interface for almost all the biological duplexes in BIOD of Table 5.1 and in the artificial duplexes in BIOMD-50-I and BIOM-50-II of Table 5.2, see columns SR(R*) of Linker and Interface of Tables 5.3 and 5.4. Furthermore, Tables 5.5-5.7 show that the Linker algorithm has a better performance than Interface for the longer duplexes BIOMD-100, BIOMD-200 and BIOMD-500. Note that the success rate, SR(S*), of Linker for designing single structures S* = Di(L)D2 is one for 70 Chapter 5. Interaction of Two RNA Molecules Figure 5.11: (a) Duplex number 10 from the data set BIOMD-I-200, see Table 5.6. (b) The Linker algorithm concatenates both molecules S* = D\Di and design a sequence X = X1X2 that folds correctly into S*. (c) Secondary structure of X i and X2 predicted by PairFold. Note that Xi and X2 do not fold correctly into the target duplex R*. almost every problem instance in our data set whereas the success rate SR(R*) of the duplex is in general lower than SR(S*). This means that even though single structures S* are easy to design with R N A - S S D , the cor-responding sequences X\ and X2 do not always fold correctly into the target duplex R*. Figure 5.11 shows an example of this situation. Remember from Subsection 5.1.2 that RNA-SSD-Linker is terminated when the structure S* is successfully designed or when a cutoff time is reached. Since single structures S* are easy to design, the running time of Linker is low even 71 Chapter 5. Interaction of Two RNA Molecules though it does not correctly design the target structures R*. Interface has the opposite behaviour for small duplexes in the sets BIOD and BIOMD-50. That is, the success rate for designing D\ and D2 is lower than SR(i?*). This suggests that even though when RNA-SSD does not design sequences X\ and X2 that fold into D\ and D2 respectively, X1X2 may fold into the desired duplex like the example in Figure 5.12. Note that the bases fixed in the interface interact, forming undesired base pairs, and therefore the se-quences X\ and X2 do not fold correctly into D\ and D2 respectively. These undesired interactions do not appear in the duplex because the constraints of both sequences form a more stable intermolecular helix. We also observed the opposite behaviour where the duplex is not designed even though the sequences X\ and X2 fold correctly into D\ and Di- Figure 5.13 shows this case. Note that there are only two bases that are incorrectly paired in the duplex which form an internal loop instead of two consecutive bulges. 72 Chapter 5. Interaction of Two RNA Molecules Figure 5.12: (a) Biological duplex BIOD-14. The corresponding sequences were designed by the Interface algorithm. The shaded bases belong to the intermolecular helix. The sequences X\ and X2 from (b) and (c) are designed independently by RNA-SSD. The shaded bases in this case correspond to nucleotides constrained in the interface. Note that X\ and X2 do not fold correctly into D\ and D2 respectively. 73 Chapter 5. Interaction of Two RNA Molecules Figure 5.13: (a) Target structure BIOD-7. (b) Duplex formed by the se-quences Xi and X2 designed by Interface. There are two bases that are incorrectly paired that form an internal loop instead of the two consecutive bulges of Figure (a), (c) The sequences X\ and X2 fold correctly into T)\ and D%-74 Chapter 5. Interaction of Two RNA Molecules 5.3.2 Performance of Linker and Internal When Linker is compared with Internal in Figures 5.14 and 5.15, we notice that overall Linker has a lower running time than Internal. Most of the duplexes not designed by Linker are also not designed by Internal. In Figure 5.15 (c) and (d) we can see that there are only two duplexes from B I O M D -1-200 and one from BIOMD-I-500 not designed by Linker that Internal was able to design. On the other hand, there are two duplexes from BIOMD-II -500 designed by Linker but not by Internal (see Figure 5.15 (d)). The performance of Linker and Internal in terms of the success rate SR(R*) is shown in Tables 5.3-5.7. The Internal algorithm designs a larger fraction of sequences that fold correctly into the target duplex than Linker. Note that the main difference between Linker and Internal when designing structures from BIOMD-I is the evaluation function. The structures D\ and Di of this data set can be concatenated without adding a linker L to form a single structure S* = D\Di allowed by the thermodynamic parameters. Linker decomposes the structure S* = D\Di and always uses the RNAfold function from the Vienna Package to evaluate the candidate solutions, even though the bases of the subsequence belong to different molecules. On the other hand, Internal decomposes the structure S* and evaluates the candi-dates solutions with the appropriate function: PairFold if the subsequence belongs to both molecules D\ and Di or the RNAfold function if it belongs only to D\ or Di- Therefore, it is not surprising that the success rate for Internal is higher than Linker. 75 Chapter 5. Interaction of Two RNA Molecules 100 o CD y>, CD E 0. O •o £ o 0) Q . X CD IB c CD c 0.01 0.1 1 10 • 100 Linker, expected CPU time [sec] Figure 5.14: Correlation between the expected running times in C P U sec-onds of the Linker (x-axis) and the Internal (y-axis) approaches to design the biological duplexes from Table 5.1. 76 Chapter 5. Interaction of Two RNA Molecules 100 10 0.1 0.01 0.01 o * 9-6 BIOMD-l-50 o BIOMD-ll-50 * 0.1 10 Linker, expected CPU time [sec] (a) 1000 •I 100 Q. o 100 10 0.1 SK s* a' BIOMD-l-100 BIOMD-ll-100 i i 0.1 10 100 1000 Linker, expected CPU time [sec] (b) 1e+06 i 10000 o SK .•' o' 100 1 SK * *.<S> o O BIOMD-l-200 o BIOMD-ll-200 SK I . I . I 1 100 10000 1e+06 Linker, expected CPU time [sec] (?) zi Q_ O T> CD t3 a> 1e+06 10000 100 SK SK SK 'SK© BIOMD-l-500 o BIOMD-ll-500 SK 100 10000 1e+06 Linker, expected CPU time [sec] (d) Figure 5.15: Correlation between the expected running times of the Linker and the Internal algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Linker or Internal is unable to design as 106 C P U seconds. Structures that are designed by none of the approaches are excluded, (a) BIOMD-50; (b) BIOMD-100; (c) BIOMD-200; (d) BIOMD-500. 77 Chapter 5. Interaction of Two RNA Molecules o CD to, CD E 1e+06 ~ 10000 o " D £ o CD CL X CD CD~ o CO t CD 100 0.01 T • r + + / ' ribozyme-substrate duplexes + | 0.01 1 100 10000 1e+06 Internal, expected CPU time [sec] Figure 5.16: Correlation between the expected running times in C P U sec-onds of the Internal (x-axis) and the Interface (y-axis) approaches to design the biological duplexes from Table 5.1. 5.3.3 Performance of Interface and Internal Internal performs much better than Interface. From Figures 5.16 and 5.17, we see that the running times of Internal are in general lower than Inter-face. Furthermore, all duplexes designed by Interface were also designed by Internal. Tables 5.3-5.7 show that the success rate, SR(R*), of Internal is higher or equal than Interface in almost every problem instance. 78 Chapter 5. Interaction of Two RNA Molecules % 1e+06 = 10000 o "S 100 0.01 I 1 1 1 SK G O G ' * 0 G * .0'' 9' I BIOMD-l-50 G BIOMD-ll-50 * i . i 0.01 1 100 10000 Internal, expected CPU time [sec] (a) ZD CL O 1e+06 10000 100 G 9 BIOMD-l-100 G BIOMD-ll-100 * 1 100 10000 1e+06 Internal, expected CPU time [sec] (b) BIOMD-l-200 G BIOMD-ll-200 * 100 10000 1e+06 Internal, expected CPU time [sec] (c) UL 1 e+06 (0 E 3 0 . O T3 t 0) 10000 100 G G G G 100 BIOMD-l-500 o BIOMD-ll-500 * 10000 1e+06 Internal, expected CPU time [sec] (d) Figure 5.17: Correlation between the expected running times of the Internal and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Interface or Internal is unable to design as 106 C P U seconds. Structures that are not designed by none of the approaches are excluded, (a) BIOMD-50; (b) BIOMD-100; (c) BIOMD-200; (d) BIOMD-500. 79 Chapter 5. Interaction of Two RNA Molecules ID Linker Interface Internal SR{R*) SR(S*) SR{R*) SR(DX) SR(D2) SR(R*) BIOD-1 87/100 100/100 100/100 77/100 82/100 100/100 BIOD-2 87/100 100/100 100/100 42/100 52/100 100/100 BIOD-3 90/100 100/100 100/100 37/100 58/100 100/100 BIOD-4 68/100 100/100 100/100 62/100 58/100 100/100 BIOD-5 94/100 100/100 100/100 75/100 76/100 100/100 BIOD-6 65/100 100/100 9/100 3/100 3/100 100/100 BIOD-7 96/100 100/100 1/100 21/100 12/100 100/100 BIOD-8 60/100 100/100 100/100 81/100 77/100 100/100 BIOD-9 89/100 100/100 64/100 23/100 8/100 100/100 BIOD-10 95/100 100/100 100/100 32/100 22/100 100/100 BIOD-11 93/100 100/100 100/100 54/100 91/100 100/100 BIOD-12 60/100 100/100 100/100 56/100 80/100 100/100 BIOD-13 50/100 100/100 100/100 57/100 86/100 100/100 BIOD-14 66/100 100/100 100/100 12/100 13/100 100/100 Table 5.3: Performance results for Linker, Interface and Internal on sets of biological duplexes. The columns SR(-) show the fraction of structures for which the respective algorithm found solutions in all of the runs (SR stands for success rate). R* is the target duplex; S* is the structure that results from concatenating the molecules D\ and D2 of the duplex; D\ and D2 are designed independently by RNA-SSD in the Interface algorithm. 80 Chapter 5. Interaction of Two RNA Molecules BIOMD-50-I Linker Interface Internal ID SR(R*) SR(S*) SR(R*) SR(Dx) SR{D2) SR(R*) 1 31/100 100/100 100/100 62/100 73/100 100/100 2 31/100 100/100 100/100 85/100 78/100 100/100 3 99/100 100/100 100/100 100/100 100/100 100/100 4 0/100 100/100 0/100 99/100 95/100 0/100 5 90/100 100/100 100/100 58/100 27/100 100/100 6 36/100 100/100 100/100 100/100 100/100 100/100 7 0/100 100/100 0/100 0/100 99/100 0/100 8 1/100 100/100 86/100 55/100 51/100 95/100 9 89/100 100/100 100/100 100/100 100/100 100/100 10 81/100 100/100 100/100 92/100 90/100 100/100 BIOMD-50-II 1 94/100 100/100 100/100 14/100 17/100 100/100 2 75/100 100/100 99/100 33/100 29/100 100/100 3 91/100 100/100 100/100 35/100 32/100 100/100 4 o/roo 100/100 0/100 24/100 14/100 0/100 5 90/100 100/100 91/100 41/100 14/100 100/100 6 79/100 100/100 11/100 23/100 16/100 100/100 7 87/100 100/100 100/100 76/100 68/100 100/100 8 52/100 100/100 100/100 22/100 22/100 100/100 9 75/100 100/100 100/100 90/100 73/100 100/100 10 36/100 100/100 95/100 26/100 32/100 100/100 Table 5.4: Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-50. 81 Chapter 5. Interaction of Two RNA Molecules ID BIOMD-100-I Linker Interface Internal SR(R*) SR(S*) SR(R*) SR(D!) SR(D2) SR{R*) 1 92/100 100/100 0/100 17/100 4/100 100/100 2 96/100 100/100 5/100 2/100 2/100 100/100 3 9/100 100/100 19/100 92/100 93/100 100/100 4 38/100 100/100 28/100 68/100 68/100 100/100 5 0/100 100/100 0/100 0/100 7/100 0/100 6 100/100 100/100 96/100 8/100 14/100 100/100 7 0/100 100/100 0/100 100/100 100/100 0/100 8 87/100 100/100 24/100 3/100 1/100 100/100 9 0/100 100/100 0/100 0/100 0/100 0/100 10 0/100 100/100 0/100 43/100 0/100 0/100 BIOMD-100-II 1 91/100 100/100 2/100 5/100 1/100 100/100 2 85/100 100/100 7/100 3/100 1/100 100/100 3 49/100 100/100 6/100 70/100 44/100 100/100 4 89/100 100/100 2/100 5/100 1/100 100/100 5 88/100 100/100 38/100 0/100 2/100 100/100 6 94/100 100/100 94/100 7/100 5/100 100/100 7 74/100 100/100 46/100 2/100 0/100 100/100 8 25/100 100/100 3/100 1/100 2/100 100/100 9 47/100 100/100 0/100 0/100 0/100 100/100 10 93/100 100/100 100/100 26/100 42/100 100/100 Table 5.5: Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-100. 82 Chapter 5. Interaction of Two RNA Molecules BIOMD-200-I ID Linker Interface Internal SR(R*) SR(S*) SR(R*) SR(D!) SR{D2) SR(R*) 1 7/100 97/100 0/100 0/100 2/100 97/100 2 97/100 100/100 9/100 17/100 26/100 100/100 3 31/100 100/100 2/100 10/100 11/100 100/100 4 0/100 100/100 0/100 18/100 18/100 0/100 5 62/100 100/100 3/100 4/100 6/100 100/100 6 0/100 100/100 0/100 0/100 87/100 0/100 7 0/100 100/100 0/100 58/100 79/100 0/100 8 20/100 100/100 6/100 6/100 6/100 100/100 9 0/100 100/100 0/100 5/100 5/100 91/100 10 0/100 100/100 2/100 89/100 81/100 33/100 BIOMD-200-II 1 87/100 100/100 31/100 21/100 13/100 98/100 2 90/100 100/100 0/100 5/100 1/100 100/100 3 90/100 100/100 0/100 0/100 0/100 100/100 4 62/100 100/100 88/100 90/100 89/100 100/100 5 68/100 100/100 91/100 61/100 57/100 100/100 6 79/100 99/100 0/100 0/100 0/100 87/100 7 90/100 100/100 3/100 0/100 0/100 100/100 8 85/100 100/100 7/100 3/100 2/100 100/100 9 71/100 100/100 98/100 83/100 72/100 100/100 10 71/100 100/100 67/100 89/100 77/100 99/100 Table 5.6: Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-200. 83 Chapter 5. Interaction of Two RNA Molecules BIOMD-500-I Linker Interface Internal ID SR(R*) SR(S*) SR{R*) SR(Di) SR{D2) SR(R* 1 8/10 9/10 0/10 2/10 1/10 10/10 2 0/10 0/10 0/10 0/10 0/10 0/10 3 2/10 10/10 0/10 4/10 8/10 10/10 4 0/10 10/10 0/10 10/10 10/10 9/10 5 0/10 0/10 0/10 0/10 0/10 0/10 6 9/10 10/10 0/10 2/10 0/10 10/10 7 0/10 0/10 0/10 10/10 0/10 0/10 8 0/10 6/10 0/10 0/10 0/10 0/10 9 6/10 10/10 5/10 10/10 10/10 10/10 10 0/10 0/10 0/10 0/10 0/10 0/10 BIOMD-500-II 1 6/10 10/10 0/10 7/10 0/10 1/10 2 0/10 1/10 0/10 0/10 0/10 0/10 3 2/10 10/10 0/10 3/10 0/10 0/10 4 7/10 10/10 0/10 0/10 0/10 0/10 5 0/10 0/10 0/10 0/10 0/10 0/10 6 9/10 10/10 10/10 9/10 10/10 1/10 7 0/10 0/10 0/10 0/10 0/10 0/10 8 3/10 5/10 0/10 0/10 0/10 1/10 9 7/10 10/10 2/10 5/10 2/10 10/10 10 0/10 0/10 0/10 0/10 0/10 0/10 Table 5.7: Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD-500. Overall we observed that Linker has the best running time among the three algorithms but Internal has the highest success rate. However, for small duplexes, the running time of Internal is in general not much longer than Linker. Notice that duplexes of BIOMD-500 are particulary difficult to design with every approach. Almost half of them are not designed by Linker and Internal and only three of them are designed by Interface. 84 Chapter 5. Interaction of Two RNA Molecules 5.3.4 Hardness of duplex design We also compared the hardness of solving the problem of R N A secondary-structure design of duplexes with the design of single structures. Figure 5.18 shows the median expected running time for designing the artificially generated duplexes B I O M D of Table 5.2 and the biologically motivated sin-gle structures B I O M from Chapter 4. For every approach we see that the duplexes are more difficult to design than single structures. In addition, from Figures 5.18 (a) and (c) there is no evidence that the time to design duplexes with Linker and Internal scale exponentially with the length of the structure. 7? 100000 10000 CD z> Q_ O 1000 100 10 1 0.1 0.01 10 Linker 3 a. O 100000 10000 1000 100 10 1 0.1 0.01 I BIOMD-I o BIOMD-II * - „B49M + :5.4*10' 8x" ? S r o * f O y/^ -10 100 structure length (a) 1000 Interface Internal BIOMD-I BIOMD-II • B . 4 - 1 0 : 6 x ^ Q_ o 100000 10000 1000 100 10 1 0.1 0.01 BIOMD-I BIOMD-II 5.4*10"° X 100 structure length (b) 1000 10 100 structure length (c) 1000 Figure 5.18: Scaling analysis of the median (Q50) of expected run-time (y-axis) of artificially generated duplex B I O M D of lengths 50, 100, 200 and 500 and artificially generated structures B I O M of lengths 50, 75, 100, 125, 150, 200 and 500 (x-axis). Duplexes are designed with Linker, Interface and Internal in Figure (a), (b) and (c) respectively. In all cases, single structures from B I O M are designed with RNA-SSD. The line corresponds to the best fit of the B I O M data (obtained in Chapter 4) for structures with lengths 50 to 150 using a polynomial of degree three. 85 Chapter 5. Interaction of Two RNA Molecules 5.4 S u m m a r y We describe three different approaches for R N A duplex design: Linker, In-terface and Internal. These are the first algorithms for designing duplexes since to the best of our knowledge, there are no algorithms for designing complexes reported in the literature. A scaling analysis on these algorithms show that there is no evidence that the running time of Linker and Internal scale exponentially with the length of the duplex. We also compared the performance between the algorithms. Interface often has the problem that the duplex R* is not designed even though the sequences X\ and X2 fold correctly into D\ and D2. We found that Internal designs more sequences than Linker and Interface that fold correctly into the target duplex R*. One explanation for this is that, different from Linker, Internal uses the correct evaluation function, PairFold, when a subsequence of the decomposition tree belongs to different strands D\ and D2. Linker has a low running time because the algorithm is terminated when the structure S* is successfully designed or when a cutoff time is reached and in general single structures S* are easy to design. However the designed sequences XjX2 often do not fold correctly into the desired duplex R*. Furthermore, the theoretical running time of PairFold is 0 (n 3 ) where n is the sum of the two sequence lengths. However in practice the running time is longer due to additional checking for the location of the intermolecular linkage in the calculation of the free energy of the duplex [3]. That is, in practice the running time of PairFold is a constant times more expensive than RNAfold that evaluates designs for single sequences. Therefore, there is a trade-off between using PairFold internally or not since in the former case is more expensive even though it designs more duplexes. 86 Chapter 6 Stability In this chapter, we describe an extension of RNA-SSD to design stable struc-tures. It is desirable to design stable structures for certain situations. For instance, the catalytic activity of novel ribozymes can be improved by stabi-lizing certain regions of the molecule [46]. In nanobiotechnology, the design of tectoRNAs consist of self assembling R N A strands that form stable shapes [34]. To the best of our knowledge, the only algorithm that explicitly designs stable structures is the adaptive walk from Dirks et al. [25]. However, this algorithm performs well only on short structures. There are several criteria for designing a structure. Dirks et al. [25] describe two paradigms for designing stable structures. The first one is a positive design that optimizes sequence affinity to the target structure. The second one is a negative design that optimizes sequence specificity to the target structure. These approaches are described in more detail in Section 2.5. A stable structure is the one that has high affinity and high specificity. Currently, there are no efficient algorithms that explicitly design stable structures. The I N F O - R N A algorithm from Busch and Backofen [16] uses positive design in the initialisation function where it searches for a sequence that adopts the given target structure S* with the lowest possible energy. A negative design is implemented in the SLS procedure of I N F O - R N A since it searches for sequences with M F E structure as the target structure 5*. Chapter 3 includes a detailed description of I N F O - R N A . For each designed sequence and for each biological sequence Busch and Backofen calculate the probability that the sequence folds into the target structure. They found that sequences designed by I N F O - R N A have a higher probability to fold into the target structure than the ones obtained with R N A - S S D and therefore are more stable. Dirks et al. [25] implement several design methods with different eval-uation criteria to investigate which one is the best in terms of achieving both high affinity and high specificity to the target structure. Each design method consists of a criterion score and an adaptive walk for optimizing that score. After each sequence search is performed, they check to see whether the sequence is quenched, in the sense that no mutation of a single base 87 Chapter 6. Stability pair or of a single unpaired base improves the design according to the design metric. If the sequence is not quenched, they run a further adaptive walk, checking every 1000 steps to see if the sequence is quenched and terminating the search when quenching is achieved. Note that quenched structures are local minima of the search space. This simple adaptive walk is useful to com-pare different design metrics. However, the design of long stable structures requires a more efficient algorithm where the running time is decreased. The RNA-SSD algorithm is based on negative design because it focuses on the specificity to the target structure. The initial sequence in R N A - S S D is determined by a tabu mechanism that prevents incorrect pairings between short sequence motifs. Furthermore, the designed sequences are evaluated with the M F E criterion but this criterion does not ensure high affinity to the target structure, as discussed in Section 2.5. This could explain why the sequences designed by I N F O - R N A are more stable than the ones obtained with R N A - S S D . In order to design explicitly stable structures we incorporate stability measures for positive and negative design in the R N A - S S D algorithm. We design for stability inside RNA-SSD at the lowest level of the algorithm where the smallest substructures of the decomposition tree are designed and we expect that the "merge" of these substructures is also stable. A n SLS algorithm is used to find a stable structure with respect to some stability criteron. The performance of the new RNA-SSD, that we called R N A - S S D -stability, is evaluated by comparing the stability of the sequences designed with this algorithm and other available approaches like the adaptive walk [25] and I N F O - R N A [16]. We found that RNA-SSD-stability performs bet-ter than I N F O - R N A which does not use any stability measure to design structures. When RNA-SSD-stability is compared with the adaptive walk, we find that our algorithm designs more stable sequences in less time, es-pecially for structures that are difficult to design. However, for very long run-times, the adaptive walk performs better than RNA-SSD-stability. 6.1 The RNA-SSD-stability algorithm We design for stability inside RNA-SSD at the lowest level of the algorithm where the smallest substructures of the decomposition tree are designed. We modified R N A - S S D to design stable substructures, Ski a n d Sfc2, from the leaves of the decomposition tree. After merging Ski and Ski, we expect to get a stable structure Sk that folds into the desired target S^. Several sequences may be designed within a given cutoff time but we report the most stable one. A n outline of the algorithm is shown in Figure 6.1. 88 Chapter 6. Stability-procedure RNA-SSD-s tab i l i ty input: target RNA secondary structure S*, length n and stability measure output: RNA sequence X repeat initialise sequence X and hierarchically decompose S* and X; for each leaf ki of the decomposition tree repeat use the SLS-fold algorithm to design a sequence Xki with M F E structure Shi', use Xki as the initial sequence of SLS-stable to obtain a stable sequence; update the most stable sequence Xk%, until nr iterations are performed or a cutoff time is reached; end for for each non-leaf k of the decomposition tree with children fci. and k2 concatenate X^i.and Xk2 to obtain Xk; obtain the M F E structure Sk of Xk with RNAfold; if Sfc is different from Sk repeat let Ski the substructure with the higher relative fraction of conflicting bases among Sk\ and Sk2', repeat use the SLS-fold algorithm to design a sequence Xki with M F E structure Ski; use Xki as the initial sequence of SLS-stable to obtain a stable sequence; update the most stable sequence Xkt; until nr iterations are performed or a cutoff time is reached; concatenate Xki and Xk2 to obtain Xk; obtain the M F E structure Sk of Xk with RNAfold; until Sk is identical to Sk or a maximal of % attempts has been performed or a cutoff time is reached end if end for update the most stable sequence X; until a cutoff time is reached return X end RNA-SSD-stabi l i ty . Figure 6.1: Pseudocode for designing stable structures with RNA-SSD-stability. 89 Chapter 6. Stability The design of a stable subsequence with minimal energy substructure Ski is performed in two parts. First, we design a subsequence that folds correctly into the target substructure with the SLS procedure from R N A - S S D described by Andronescu et al. [4]. For clarity, we will call this procedure SLS-fold. Then we use X^ as the initial subsequence of a different SLS algorithm that we designed, SLS-stable, which searches for stable sequences. Both functions, SLS-fold and SLS-stable, are repeated nr times or until a cutoff time is reached. In our experiments we set nr equal to five since this has turned out to be a reasonable value to compare the stability of several sequences without spending much time designing each sequence. In each iteration the most stable sequence X^ is stored. Then, for each internal node k of the tree with children kl and k2, sequences X^ and X^2 are concatenated to get a sequence Xk for structure Sk from a higher level of the decomposition tree. The SLS algorithm that searches for stable subsequences, or SLS-stable, is based on a randomized first-improvement strategy. This procedure iter-atively modifies single unpaired bases or base pairs of a candidate strand. The base position and the new assignment are determined uniformly at ran-dom. In each step, the neighbour subsequence is retained if it is more stable than the current one. Otherwise, it is kept with probability pa = 0.2 to overcome local optima. Furthermore, only neighbour subsequences are ac-cepted if they fold correctly into the target substructure. RNAfold is used to determined the M F E structure of the candidate solution. A n outline of the algorithm is shown in Figure 6.2. In order to terminate SLS-stable we test two approaches. The first one terminates the search if the stability of the structure is not improved after nt iterations. We tried different values of nt without achieving good per-formance of RNA-SSD-stability in terms of the stability of the structure. Another approach that we found to produce better results, terminates the search after a fixed number of rif steps. This value depends on the cutoff time that has been specified for the RNA-SSD-stability algorithm. If the cutoff time is small then we have to reduce n j to ensure that SLS-stable spends the same time designing every substructure of the decomposition tree. We set rtf to 500 since this has turned out to be the best value during our experiments where the cutoff time is at least 3600 C P U seconds. For the same reason, we reduced the number of search steps of SLS-fold from 1000 to 500. The stability measure is a parameter of our algorithm. Is possible to opti-mize the probability of the target structure in the ensemble of the designed sequence p{S*) or the average number of incorrect nucleotides n(5*) (see 90 Chapter 6. Stability procedure SLS-stable input: sequence X designed by SLS-fold output: RNA stable sequence X' repeat select a base from X uniformly at random; flip the selected base uniformly at random to obtain X'\ obtain the M F E structure of X' with RNAfold; if X' has M F E structure S and X' is more stable than X accept X'; end if if X' has M F E structure S and X' is less stable than X accept X' with probability pa; end if until n / iterations are performed return X; end SLS-stable. Figure 6.2: Pseudocode for SLS-stable. Chapter 2). We include these stability measures to achieve high specificity and affinity to the target structure. 6.2 Experiments In order to compare the performance of our algorithm with other approaches, we used some biological structures from Table 4.3 and two small artificial structures designed by Dirks et al. [25]. They generated multiloop structures with different numbers of branches. We list all the structures of our data set in Table 6.1. Each structure of our data set is designed by optimizing the probabil-ity, p(s*), of the target structure in the ensemble (see Equation 2.2). In a different experiment, the same set of structures is designed by optimiz-ing the average number of incorrect nucleotides, n(s*) (see Equation 2.5). Sequences with p(s*) close to one or n(s*) close to zero have high affinity and specificity to the target structure. In Chapter 2 we showed that this is not always possible to achieve, especially for long structures. Therefore, we discussed an alternative measure of stability introduced by Vofi et al. [64]. They work with R N A shapes which are abstract representations of 91 Chapter 6. Stability No. Description Size (bases) A - l Artificial 1-multiloop 71 A-2 Artificial 3-multiloop 122 B - l VS Ribozyme from Neurospora mitochondria 167 B-2 U3 snoRNA 5' domain from Chlamydomonas 79 reinhardtii, in vivo probing B-3 XS1 Ribozyme, Bacillus subtilus P R N A based ribozyme 314 B-4 Homo Sapiens RiboNuclease P R N A 342 B-5 S20 m R N A from E. coli 372 B-6 Group II intron ribozyme D135 from Saccharomyces 583 cerevisiae mitochondria Table 6.1: The first two structures are artificial R N A s generated by Dirks et al. [25]. Artificial 1-multiloop has one multiloop with four branches. Artificial 3-multiloop has three multiloops, each of them with three branches. The rest are biological structures that were obtained arbitrarily from the biological literature; the same structures were used by Andronescu et al. [4]-an R N A secondary structure. A sequence is stable if there is a shape that dominates the probabilities of other shapes in the ensemble. The probabil-ity of a shape is the sum of the probabilities of all structures that have this shape. Therefore, we can compare the stability of the designed sequences by computing the accumulated probabilities of all structures that share the shape that contains the target structure. The higher the probability, the more stable is the structure. There are several levels of abstractions of a shape; we work with the most abstract level that excludes unpaired bases and combines nested helices. For example, the secondary structures . . ( ( ( . ( ( . . ( ( ( . . . . ) ) ) . ( ( ( )))))))) . . and . . ( ( ( ( ( ( . . . . ) ) ) . ( ( ( ) ) ) . . ) ) ) . . have the same shape [ [ ] [ ] ]. The shape probabilities are computed with the RNAshape package from Vofi et al. [61, 62]. The RNAshapes software predicts optimal and subopti-mal abstract shapes of a given R N A molecule with a dynamic programming algorithm. RNAshapes takes as input a sequence and outputs the proba-bility of all the shapes. Vofi et al. [62] explain that the running time of 92 Chapter 6. Stability RNAshape is exponential in the number of shapes and therefore it only al-lows the calculation of the probability of shapes for structures shorter than 250 bases. For structures longer than 250 bases and shorter than 500 bases, it only calculates probabilities for shapes with the lowest free energy shreps. A shrep, or shape representative structure, is defined as the structure with the lowest free energy of a shape. Shapes with lowest free energy are often also the shapes of highest probabilities, but not necessarily so. The I N F O - R N A algorithm of Busch and Backofen [16] does not design stable structures explicitly. Therefore we generate several sequences within a given cutoff time and report the sequence with the lowest average number of incorrect nucleotides n(S*) and also the sequence with the highest prob-ability p(S*) of the target structure in the ensemble. In order to compare RNA-SSD-stability with I N F O - R N A , we fixed a cutoff time of 3 600 C P U seconds for both algorithms and performed 50 runs on each structure of Table 6.1. There are some considerations that we have to make in order to compare RNA-SSD-stability and the adaptive walk algorithm of Dirks et ai. First, the SLS-fold function from RNA-SSD assigns C G pairs to helix regions with a higher probability (0.66) than A - U pairs (0.34). In contrast, the probability of assigning C-G and A - U to a base pair is the same for the adaptive walk algorithm. Therefore, we used the probability model of Dirks et al. to assign bases. The thermodynamic parameters for computing the stability of a structure is also different in both algorithms. The adaptive walk uses the thermodynamic parameters implemented by Zuker et al. [41] whereas we use the thermodynamic parameters from the Vienna Package [29]. Therefore, we evaluate the stability of the sequences designed by the adaptive walk with the thermodynamic parameters from the Vienna package to compare the performance of both algorithms. Note that this may penalize the adaptive walk since it searches for stable structures using a slightly different thermodynamic model. We fixed the same cutoff time for RNA-SSD-stability and the adaptive walk algorithm in the following way. We run the adaptive walk one time on a given structure until a local minimum is found or for a maximum of 15 days with the stability criteria n(S*) and p(S*). If the running time is less than one hour or more than two days, the cutoff time for both algorithms is fixed to 3600 C P U seconds. Otherwise the cutoff time is fixed as the amount of time required by the adaptive walk to find a local minimum. Table 6.2 shows the running time required by the adaptive walk to find a local minimum in our dataset and the cutoff time used in every case. We performed 10 runs on structures A-2 and B - l that have long cutoff times 93 Chapter 6. Stability No. n(S*) p(S*) running time cutoff time for running time cutoff time for adaptive walk adaptive walk adaptive walk adaptive walk and RNA-SSD- and RNA-SSD-stability stability A - l 212 3 600 1397 3600 A-2 10 800 10 800 10 800 10 800 B - l 36000 36 000 14400 14400 B-2 3 601 3 600 3601 3600 B-3 621 032 3 600 656159 3 600 B-4 782 536 3 600 1226132 3 600 B-5 1038233 3 600 205135 3600 B-6 > 15 days 3600 > 15 days 3 600 Table 6.2: Cutoff times for the adaptive walk and RNA-SSD-stability. The "running time" columns represent the time spent by the adaptive walk, in C P U seconds, to find a local minimum for each structure. The adaptive walk was run only one time for each stability criteria n(S*) and p(S*). The "cutoff time" columns represents the maximum time that we allow the adaptive walk and the RNA-SSD-stability to find a solution. (10 800 and 36 000 C P U seconds, respectively) and 50 runs for the rest of the structures that have a cutoff time of 3 600 C P U seconds. A l l computational experiments were carried out on PCs with Intel(R) Xeon(TM) 3.06GHz processors (only one processor was used in our exper-iments), 1024 K B cache, and 2GB R A M running Red Hat Linux, kernel version 2.6.5-7.252-smp. 6.3 Results In this section we discuss the performance of RNA-SSD-stability compared with I N F O - R N A and the adaptive walk of Dirks et al. 6.3.1 Comparison of RNA-SSD-stability and INFO-RNA Figures 6.3 to 6.6 show the stability of the sequences designed by both algorithms when n(S*) and p(S*) are optimized. RNA-SSD-stability de-signs more stable sequences than I N F O - R N A using both stability criteria 94 Chapter 6. Stability for every structure of our data set. Tables 6.3 and 6.4 also show the advan-tage of RNA-SSD-stability over I N F O - R N A . The best sequences designed by RNA-SSD-stability are more stable than the best sequences designed by I N F O - R N A . This is because, unlike I N F O - R N A , RNA-SSD-stability explic-itly optimizes for stability. The corresponding values of n(S*) and p(S*) for the biological sequences are also included in Figures 6.4 to 6.6. Note that the biological sequences are less stable than the sequences designed by RNA-SSD-stability and INFO-R N A . Biological structures may not be very stable because R N A folds while it is being transcribed and therefore alternative stable motifs that would compete with the formation of the functional structure during co-transcriptional folding are suppressed [43]. There is also experimental ev-idence, using mutational analysis, that more stable structures may reduce biological function [19]. The correlation p „ ) P between the stability metrics n(S*) and p(S*) of the sequences generated for a given structure is indicated in Tables 6.3 and 6.4. In general, this correlation is higher for sequences with stable structures. We think that these sequences have one local minimum in the free energy landscape of the structures from the ensemble and that the structures that dominate the probability of the ensemble are not very different from each other. We also observe that there is no consistent advantage in optimizing n(S*) versus p(S*). The values of n(S*) and p(S*) that we obtained are very similar when the average number of incorrect nucleotides or the probability of the target structure in the ensemble are optimized. 95 Chapter 6. Stability 0.01 0.01 R N A - S S D I N F O - R N A 0.1 1 optimize n(S*) (c) 10 CO Q . 0.1 0.01 0.01 R N A - S S D x I N F O - R N A o 0.1 1 n(S*) (d) 10 Figure 6.3: Correlation between the average number of incorrect nucleotides v and the probability of the target in the ensemble when sequences are de-signed with RNA-SSD-stability and I N F O - R N A . Logarithmic values are used in both axes with n(S*) on the x-axis and 1 — p(S*) on the y-axis. A structure is stable if n(S*) and 1 — p(S*) are small. Panels (a) and (b) correspond to structure A - l and panels (c) and (d) to structure A-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas 1 — p(S*) is optimized in panels (b) and (d). 96 Chapter 6. Stability 0.1 0.01 - o' -. I biological sequence : x - -RNA-SSD x INFO-RNA o . . . 1 1 0.1 optimize n(S*) (a) 10 C/3 Q. 0.1 0.01 0.1 biological sequence RNA-SSD INFO-RNA 1 n(S*) (b) 10 0.1 biological sequence RNA-SSD INFO-RNA 1 10 optimize n(S*) (c) 100 0.1 0.01 0.1 biological sequence RNA-SSD INFO-RNA 10 n(S*) (d) 100 Figure 6.4: Correlation between the stability metrics n(S*) and p(S*). Pan-els (a) and (b) correspond to structure B - l and panels (c) and (d) to struc-ture B-2. The values of n(S*) and 1 —p(S*) of the corresponding biological sequence are indicated in each panel. 97 Chapter 6. Stability biological sequence RNA-SSD INFO-RNA 10 optimize n(S*) (a) 100 0.1 0.01 0.1 biological sequence RNA-SSD x INFO-RNA a 1 n(S*) (b) 10 100 I F x i 3 >x biological sequence RNA-SSD x i INFO-RNA e 10 100 optimize n(S*) (c) 0.1 biological sequence RNA-SSD x INFO-RNA a 1000 10 100 n(S*) (d) 1000 Figure 6.5: Correlation between the stability metrics n(S*) and p{S*). Pan-els (a) and (b) correspond to structure B-3 and panels (c) and (d) to struc-ture B-4. The values of n(S*) and 1 -p(S*) of the corresponding biological sequence are indicated in each panel. 98 Chapter 6. Stability 10 : >a«:x xo biological sequence RNA-SSD INFO-RNA 100 optimize n(S*) (a) 0.1 1000 10 biological sequence RNA-SSD x INFO-RNA e 100 n(S*) (b) 1000 biological sequence RNA-SSD x INFO-RNA e CL 10 100 optimize n(S*) (c) 0.1 1000 biological sequence RNA-SSD INFO-RNA 10 100 1000 n(S*) (d) Figure 6.6: Correlation between the stability metrics n(S*) and p(S*). Pan-els (a) and (b) correspond to structure B-5 and panels (c) and (d) to struc-ture B-6. The values of n(5*) and 1 -p(S*) of the corresponding biological sequence are indicated in each panel. 99 Chapter 6. Stability RNA-SSD-stability No. n(S*) P(S*) Pn,p best median best median Cy A - l 0.055 0.116 0.258 0.974 0.947 0.014 -0.899 A-2 0.103 0.211 0.273 0.954 0.910 0.024 -0.793 B - l 0.306 0.626 0.724 0.876 0.8112 0.056 -0.366 B-2 0.322 0.351 0.046 0.846 0.838 0.006 -0.912 B-3 3.188 11.032 0.819 0.371 0.129 0.856 -0.536 B-4 2.223 4.920 0.880 0.466 0.252 0.484 -0.673 B-5 15.306 26.858 0.492 0.129 0.005 4.060 -0.299 B-6 3.421 5.650 0.839 0.260 0.138 0.357 -0.381 INFO-RNA No. n(S*) p(S*) Pn,p best median Cy best median Cy A - l 0.266 0.605 0.375 0.943 0.896 0.083 -0.662 A-2 0.437 0.797 0.166 0.896 0.846 0.020 -0.714 B - l 0.708 1.038 0.093 0.850 0.820 0.030 -0.486 B-2 0.880 0.906 0.011 0.610 0.606 0.003 -0.573 B-3 12.48 19.66 0.161 4.738-10" -02 2 . 2 3 3 - 1 0 - 0 3 1.443 -0.446 B-4 12.31 18.46 0.166 2.489-10" -02 1 .467-10- 0 3 1.544 -0.605 B-5 20.55 31.27 0.192 1.239-10" -03 8 . 1 0 4 - H r 0 5 1.317 -0.147 B-6 14.66 20.82 0.112 3.059-10" -02 2 . 4 2 9 - 1 0 - 0 3 1.430 -0.389 Table 6.3: Statistics for the stability of the sequences designed by R N A -SSD-stability and I N F O - R N A when n(S*) is optimized. 100 Chapter 6. Stability RNA-SSD-stability No. n(S') p(S*) Pn,p best median Cy best median Cy A - l 0.061 0.114 0.324 0.972 0.952 0.013 -0.818 A-2 0.102 0.207 0.363 0.955 0.913 0.019 -0.741 B - l 3.611 11.83 0.802 0.447 0.183 0.601 -0.64 B-2 0.323 0.365 0.070 0.845 0.833 0.008 -0.889 B-3 3.611 10.839 0.802 0.447 0.188 0.601 -0.641 B-4 2.146 6.260 1.047 0.470 0.291 0.644 -0.793 B-5 17.646 43.212 0.483 0.050 0.002 3.056 -0.358 B-6 2.933 6.5311 0.628 0.340 0.185 0.404 -0.698 INFO-RNA No. n(S') Pn>p best median Cy best median Cy A - l 0.266 0.753 0.508 0.943 0.905 0.020 -0.776 A-2 0.437 1.084 0.649 0.896 0.860 0.016 -0.361 B - l 0.932 1.139 0.158 0.857 0.834 0.011 -0.609 B-2 0.880 0.916 0.017 0.610 0.607 0.001 -0.482 B-3 12.48 31.18 0.458 0.090 0.020 0.636 -0.127 B-4 12.51 24.84 0.428 0.058 0.005 1.277 -0.282 B-5 20.55 53.55 0.260 3.428-10-03 5.856-10-04 0.960 -0.280 B-6 14.66 28.41 0.384 0.052 0.010 0.675 -0.293 Table 6.4: Statistics for the stability of the sequences designed by R N A -SSD-stability and I N F O - R N A when p(S*) is optimized. 101 Chapter 6. Stability We also test the performance of RNA-SSD-stability and I N F O - R N A us-ing the shape probability measure of VoB et al. [64]. The solution quality distribution for both algorithms is shown in Figures 6.7 and 6.8. RNA-SSD-stability has a better performance than I N F O - R N A since it designs more sequences with higher shape probability than I N F O - R N A , where the shape contains the target structure. The probability of a shrep was calculated for the best and the median sequence of B-3, B-4 and B-5 (see Table 6.5). Structures B-3, B-4 and B-5 have more than 250 bases and therefore RNAshape only calculates proba-bilities for shapes with the lowest energy shreps. Shapes with lowest free energy are often also the shapes of highest probabilities, but not necessar-ily so. Therefore, we verified that the target structure corresponds to the shrep with lowest free energy. Note that these probabilities are higher for sequences generated by RNA-SSD-stability than for sequences generated by I N F O - R N A . 102 Chapter 6. Stability 0.8 h 0.6 0.4 0.2 RNA-SSD INFO-RNA 0.93 0.94 0.95 0.96 0.97 0.98 0.99 probability (a) 0.8 h 0.6 0.4 2 0.2 RNA-SSD INFO-RNA 0.93 0.94 0.95 0.96 0.97 0.98 0.99 probability (b) 0.8 0.6 0.4 £ 0.2 RNA-SSD INFO-RNA 0.94 0.95 0.96 0.97 0.98 0.99 probability (c) 0.8 0.6 F 0.4 h 0.2 RNA-SSD INFO-RNA 0.94 0.95 0.96 0.97 0.98 0.99 1 probability (d) Figure 6.7: Solution quality distribution of RNA-SSD-stability and INFO-R N A with respect to the probability of shapes. Panels (a) and (b) corre-spond to structure A - l and panels (c) and (d) to structure A-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p(S*) is optimized in panels (b) and (d). 103 Chapter 6. Stability 0.8 0.6 0.4 0.2 0.98 0.984 0.988 0.992 0.996 probability (a) 1 0.8 0.6 0.4 0.2 RNA-SSD INFO-RNA 0.996 0.998 probability (c) 1 0.8 0.6 0.4 0.2 RNA-SSD INFO-RNA 0.98 0.984 0.988 0.992 0.996 probability (b) 0.8 0.6 0.4 0.2 RNA-SSD INFO-RNA 0.996 0.998 probability (d) Figure 6.8: Solution quality distribution of RNA-SSD-stability and INFO-R N A with respect to the probability shapes. Panels (a) and (b) correspond to structure B - l and panels (c) and (d) to structure B-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p(S*) is optimized in panels (b) and (d). 104 Chapter 6. Stability optimize n(S*) Structure RNA-SSD-stability INFO-RNA best median best median B-3 B-4 B-5 0.955 0.994 0.204 0.929 0.854 0.587 0.865 * 0.103 0.149 0.760 0.062 optimize p(S*) B-3 B-4 B-5 0.908 0.514 0.692 0.572 0.684 0.311 0.835 0.482 0.120 0.257 0.107 0.054 Table 6.5: Probability of the shrep with the lowest free energy for the best and the median sequences designed by RNA-SSD-stability and I N F O - R N A . Sequences from the upper and lower Table were obtained by optimizing n(S*) and p(S*) respectively. We do not report any value for (*) because the target structure was not found in the shapes generated by RNAshape. For this structure we generate up to 10 shapes with free energy 10% above the M F E . This suggests that the shape probability of this sequence is low. 6.3.2 Comparison of RNA-SSD-stability and adaptive walk Table 6.6 shows the performance of RNA-SSD-stability and the adaptive walk. Both algorithms have a success rate of one for structures A - l , A-2, B - l and B-2. RNA-SSD-stability was also able to find a solution in every run for the biological structure B-6. Note that the adaptive walk did not find a sequence that folds correctly in 50 runs for structures B-3, B-4, B-5 and B-6 when using a cutoff time of 3 600 C P U seconds. From these results, we concluded that the shorter structures A - l , A-2, B - l and B-2 are easier to design than the longer biological structures B-3 to B-6. We also compare the stability of the sequences designed by R N A - S S D -stability and adaptive walk. Figures 6.9-6.14 show the quality of the se-quences designed by both algorithms with respect to the metrics n(S*) and p(S*). For the easiest structures A - l , A-2, B - l and B-2, the adaptive walk designs some sequences that are more stable than those obtained with R N A -SSD-stability. However, in Tables 6.7 and 6.8 we can see that the stability of the best sequences designed by RNA-SSD-stability is similar to the best sequences designed by the adaptive walk. In addition, the coefficient of variation of n(S*) and p(S*) is smaller for the RNA-SSD-stability results 105 Chapter 6. Stability No. "(5*) success rate success rate cutoff RNA-SSD- adaptive cutoff RNA-SSD- adaptive time stability walk time stability walk A - l 3 600 50/50 50/50 3600 50/50 50/50 A-2 10800 10/10 10/10 10 800 10/10 10/10 B - l 36000 10/10 10/10 14400 10/10 10/10 B-2 3 600 50/50 50/50 3600 50/50 50/50 B-3 3 600 47/50 0/50 3600 48/50 0/50 B-4 3 600 46/50 0/50 3 600 39/50 0/50 B-5 3 600 10/50 0/50 3600 12/50 0/50 B-6 3600 50/50 0/50 3600 50/50 0/50 Table 6.6: Performance results for RNA-SSD-stability and adaptive walk on artificial and biological R N A structures. The stability criteria to design the R N A structures are n(5*) andp(5*). The "cutoff time" columns indicate the running time in C P U seconds for both algorithms on our reference machine. The success rates show the fraction of runs for which the respective algorithm found a sequence that fold correctly into the target structure. than for those obtained by the adaptive walk. For example, when n(S*) is optimized to design B-2, the most stable sequence has an average number of incorrect nucleotides of 0.322 and 0.312 when we use RNA-SSD-stability and the adaptive walk respectively. The coefficient of variation within fifty runs is 0.046 for RNA-SSD-stability and 0.586 for the adaptive walk. Moreover, in Figures 6.9 to 6.11 we can see that the adaptive walk also designs less stable structures than RNA-SSD-stability. For the most difficult structures B-3 to B-6 in our data set, RNA-SSD-stability designs more stable struc-tures than the adaptive walk within the given cutoff time. In general, there is a big difference between the best sequences designed by both algorithms, see Figures 6.12 to 6.14 and Tables 6.7 and 6.8. The sequences designed for B - l to B-6 by RNA-SSD-stability are more stable than the biological sequences. The stability of the biological sequences is included in Figures 6.10 to 6.14 and in Table 6.9. Notice that both algorithms often design more stable sequences than the biological ones. 106 Chapter 6. Stability Q. 0.01 0.01 0.01 0.1 t 0.01 Figure 6.9: Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the artificial structure A - l . In panels (a) and (b) the sequences are designed by the adaptive walk and RNA-SSD-stability re-spectively using n(S*) as the stability measure. Panels (c) and (d) show the quality of the sequences designed by adaptive walk and RNA-SSD-stability respectively when 1 — p(S*) is optimized. 107 Chapter 6. Stability 0.01 0.01 o 5 O adaptive walk RNA-SSD 0.1 1 optimize n(S*) (a) CO Q. 10 0.1 0.01 0.01 <9 adaptive walk x RNA-SSD o 0.1 1 n(S*) (b) 10 0.1 0.01 0.1 biological sequence adaptive walk RNA-SSD biological sequence 1 optimize n(S*) (c) § 0.1 E 10 0.01 biological sequence adaptive walk x RNA-SSD o biological sequence a 0.1 1 n(S*) 10 (d) Figure 6.10: Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the artificial structure A-2 (panels (a) and (b)) and the biological structure B - l (panels (c) and (d)) using the adaptive walk and the RNA-SSD-stability algorithm, respectively. The values of n(S*) and p(S*) for the biological sequence corresponding to B - l are also included. 108 Chapter 6. Stability 0.1 0.01 0.1 adaptive walk biological sequence 1 10 optimize n(S*) (a) 100 0.1 0.01 0.1 RNA-SSD o biological sequence a 1 10 optimize n(S*) (b) 100 0.1 x adaptive walk biological sequence 10 n(S*) (c) 100 0.1 0.01 0.1 RNA-SSD o biological sequence * 10 n(S*; (d) 100 Figure 6.11: Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the biological structure B-2. The values of n(S*) and p(S*) for the biological sequence corresponding to B-2 are also included in each panel. 109 Chapter 6. Stability biological sequence RNA-SSD adaptive walk 10 100 optimize n(S") (a) 0.1 1000 3<y ° biological sc-i l i l equence RNA-SSD adaptive walk 10 100 n(S*; (b) 1000 biological sequence RNA-SSD adaptive walk 10 100 optimize n(S*) (c) E Q. 1000 0.1 T " T " ^ y t 0 3 "oo biological sequence RNA-SSD adaptive walk 10 100 n(S*) (d) 1000 Figure 6.12: Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for B-3 (panels (a) and (b)) and B-4 (panels (c) and (d)). The stability of the corresponding biological sequence is indicated by an arrow. 110 Chapter 6. Stability 10 biological sequence adaptive walk CO Q . 100 optimize n(S") (a) 1000 0.1 • I 1 ' 1 ' I yMSXBO GEHDO O A — 0 t biological sequence RNA-SSD o 1 10 100 optimize n(S*) (b) 1000 10 biological sequence adaptive walk 100 n(S*) (c) CO o. 1000 0.1 10 biological sequence RNA-SSD 100 n(S*) (d) 1000 Figure 6.13: Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for the biological structure B-5. The stability of the corresponding biological sequence is indicated by an arrow. I l l Chapter 6. Stability-OA biological sequence RNA-SSD adaptive walk 10 100 optimize n(S*) (a) 1000 0.1 biological sequence RNA-SSD o adaptive walk x 10 100 n(S*) (b) 1000 Figure 6.14: Correlation between the stability metrics n(S*) and p(S*) of the sequences designed for B-6. The stability of the corresponding biological sequence is indicated by an arrow. 112 Chapter 6. Stability RNA-SSD-stability No. n(5*) Pn,p best median Cy best median Cy A - l 0.055 0.116 0.258 0.974 0.947 0.014 -0.899 A-2 0.107 0.167 0.341 0.947 0.927 0.016 -0.671 B - l 0.366 0.529 0.377 0.878 0.860 0.025 -0.340 B-2 0.322 0.351 0.046 0.846 0.838 0.006 -0.912 B-3 3.188 11.032 0.819 0.371 0.129 0.856 -0.536 B-4 2.223 4.920 0.880 0.466 0.252 0.484 -0.673 B-5 15.306 26.858 0.492 0.129 0.005 4.060 -0.299 B-6 3.421 5.650 0.839 0.260 0.138 0.357 -0.381 Adaptive Walk No. n(S*) p(S*) Pn,p best median Cy best median Cy A - l 0.046 0.059 3.159 0.978 0.972 0.218 -0.969 A-2 0.061 0.073 1.283 0.971 0.970 1.283 -0.980 B - l 0.171 1.392 0.650 0.940 0.431 0.510 -0.975 B-2 0.312 0.343 0.586 0.850 0.837 0.132 -0.996 B-3 45.39 81.31 0.215 2.502-10- 1 6 0.0 7.071 -0.177 B-4 68.11 103.00 0.174 i - i o - 2 0 0.0 7.071 -0.291 B-5 81.97 119.70 0.131 0.0 0.0 — — B-6 95.28 151.80 0.213 0.0 0.0 — — Table 6.7: Statistics for the stability of the sequences designed by R N A - S S D -stability and adaptive walk when n(S*) is optimized. The best, median and coefficient of variation, cv, are computed for the values of n(S*) and p(S*) of the sequences designed for a given structure. The correlation, pn%v, between n(S*) and p(S*) of all the runs of a given structure is also included. 113 Chapter 6. Stability RNA-SSD-stability No. n(S*) p(S*) Pn,p best median best median Cy A - l 0.061 0.114 0.324 0.972 0.952 0.013 -0.818 A-2 0.118 0.187 0.190 0.945 0.930 0.013 -0.874 B - l 0.328 0.534 0.347 0.909 0.861 0.033 -0.635 B-2 0.323 0.365 0.070 0.845 0.833 0.008 -0.889 B-3 3.611 10.839 0.802 0.447 0.188 0.601 -0.641 B-4 2.146 6.260 1.047 0.470 0.291 0.644 -0.793 B-5 17.646 43.212 0.483 0.050 0.002 3.056 -0.358 B-6 2.933 6.5311 0.628 0.340 0.185 0.404 -0.698 Adaptive Walk No. n(S*) P(S*) Pn,p best median Cy best median Cy A - l 0.046 0.050 1.265 0.978 0.976 0.042 -0.996 A-2 0.059 1.002 1.021 0.972 0.490 0.789 -0.965 B - l 0.158 0.300 0.995 0.915 0.893 0.228 -0.948 B-2 0.312 0.353 0.709 0.850 0.837 0.175 -0.973 B-3 53.27 103.3 0.208 7.653-10" -15 0.0 6.392 -0.209 B-4 66.32 101.4 0.182 9.965-10" -15 0.0 6.094 -0.315 B-5 73.32 114.0 0.197 1.36-10-18 0.0 6.318 -0.164 B-6 91.39 142.5 0.200 0.0 0.0 — — Table 6.8: Statistics for the stability of the sequences designed by R N A -SSD-stability and adaptive walk when p(S*) is optimized. biological sequences No. n(S*) P(S*) B - l 8.995 0.06317 B-2 10.572 0.00645 B-3 95.032 5.8 -IO- 1 9 B-4 153.477 1.4-10-33 B-5 152.951 3.2-10-24 B-6 158.400 9.05 -IO- 1 5 Table 6.9: Stability of biological sequences with respect to the metrics n(S*) and p(S*). 114 Chapter 6. Stability From Table 6.7 and 6.8, we observe that there is a positive correlation between the stability measures n(S*) and p(S*), see column pntP. Sequences with small number of incorrect nucleotides in the ensemble also have a high probability of observing the target structure in the ensemble. This corre-lation is stronger for the easiest structures A - l , A-2, B - l and B-2 where better values of n(S*) and p(S*) are obtained in the designed sequences. On the other hand, we did not find evidence in our data set that there is an advantage in optimizing n(S*) versus p{S*). The values of n(5*) or p(S*) obtained with both stability criteria are very similar. This suggests that there is only one local minimum in the free energy landscape of the struc-tures from the ensemble of designed sequences and that the structures that dominate the probability of the ensemble are not very different from each other. In Figures 6.15 and 6.16 we compare the shape probabilities of the se-quences designed by RNA-SSD-stability and by the adaptive walk. Struc-tures B-3 to B-6 are excluded from the analysis since the adaptive walk did not find a solution within a cutoff time of 3600 C P U seconds. The shape probabilities of the sequences designed by the adaptive walk are higher than the ones obtained with RNA-SSD-stability for the artificial structures and for the biological structure B - l . Maybe the adaptive walk finds shapes that have higher probability because it explicitly searches for sequences where no mutation of a single base pair or of a single unpaired base improves the design according to the design metric n(S*) or p(S*). Even though the SLS-stable algorithm from RNA-SSD-stability searches for a local minimum, it does not check if no mutations of a given sequence improve the stability of the structure. However, the difference in the quality of the solution is small between both algorithms. We did not find evidence that the shape probabil-ities of the sequences designed by n(S*) are higher than the ones design by optimizing p(S*). For the artificial structures and the biological structure B-2 we obtained higher shape probabilities when n(S*) is optimized, but this is not the case for B - l . 115 Chapter 6. Stability 1 0.8 0.6 0.4 0.2 RNA-SSD ' adaptive walk _ l I L _ 0.996 0.997 0.998 0.999 1 probability (a) 1 0.8 0.6 0.4 0.2 RNA-SSD adaptive walk 0.998 0.999 probability (c) 1 0.8 0.6 0.4 0.2 RNA-SSD adaptive walk 0.996 0.997 0.998 0.999 probability (b) 1 0.8 0.6 0.4 0.2 RNA-SSD adaptive walk 0.998 0.999 probability (d) Figure 6.15: Solution quality distribution for RNA-SSD-stability and adap-tive walk. The x-axis shows the probability of the shape that contains the target structure for every sequence designed by RNA-SSD-stability and the adaptive walk. The y-axis gives the fraction of structures whose probabil-ity shape is at most the x-value. Panels (a) and (b) show the stability of the sequences designed for structure A - l . Panels (c) and (d) correspond to structure A-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p(S*) is optimized in panels (b) and (d). 116 Chapter 6. Stability 0.98 0.984 0.988 0.992 0.996 probability (c) 0.98 0.984 0.988 0.992 0.996 probability (d) Figure 6.16: Solution quality distribution of RNA-SSD-stability and adap-tive walk with respect to the probability shapes. Panels (a) and (b) corre-spond to structure B - l and panels (c) and (d) to structure B-2. The stability metric n(S*) is optimized in panels (a) and (c) whereas p(S*) is optimized in panels (b) and (d). 117 Chapter 6. Stability In a different experiment, we performed one run in each algorithm using the cutoff time required by the adaptive walk to find a local minimum for the biological structures B-3, B-4, B-5 and B-6. Table 6.10 shows the cutoff time for each structure and the quality of the designed sequence by both al-gorithms when n(S*) and p(S*) are optimized. Structure B-6 is not included in Table 6.10 because the adaptive walk requires more than 15 days to find a local minimum for this structure. We observed that sequences designed by the adaptive walk are more stable than those designed by RNA-SSD-stability for these long run-times. We think that this is because RNA-SSD-stability has several parameters that have to be optimized in order to get a good per-formance, in particular the noise introduced in the stochastic local search of SLS-stable. Note that the sequence designed by RNA-SSD-stability for B-5 has a very high value of n(S*) and a very low value of p(S*) when the probability of the target structure in the ensemble is optimized. The sta-bility of the structures designed by RNA-SSD-stability can be improved if we modify some parameters. For instance, Table 6.11 shows that when the noise of the SLS-stable algorithm (pa) is decreased from 0.2 to 0.1 or 0.0 we obtained more stable sequences. On the other hand, we increased the number of sequences nr generated at the leaves of the decomposition tree (see Section 6.1) without obtaining good results. Better results are obtained when pa = 0.1 or when pa = 0.1 and nr = 15. 118 Chapter 6. Stability-No. optimize n(S*) RNA-SSD-stability adaptive walk running time n(5*) p(S*) n(S*) p(S*) B-3 B-4 B-5 621032 782 536 1038233 3.243 5.570 7.642 0.371 0.519 0.129 0.284 0.501 0.225 0.891 0.743 0.916 No. optimize p(S*) RNA-SSD-stability adaptive walk running time n(S') p(S*) n(S") P(S*) B-3 B-4 B-5 656159 1226132 205135 3.203 1.426 10.671 0.522 0.542 0.240 0.258 1.237 0.354 0.903 0.551 0.998 Table 6.10: Stability of sequences designed by RNA-SSD-stability and adap-tive walk on biological R N A structures when n(S*) is optimized (upper Ta-ble) and when p(S*) is optimized (lower Table). We run the adaptive walk one time until it finds a local minima. The running time in C P U seconds for the respective structure is indicated in the second column. In all cases, a valid sequence was found. 119 Chapter 6. Stability Pa nr success n(S*) P(S*) rate (1/0) 0.1 5 1 1.205 0.781 0.0 5 1 3.082 0.655 0.2 10 1 18.598 0.230 0.2 15 1 7.1517 0.191 0.1 15 1 1.280 0.736 Table 6.11: Stability of the sequence designed by RNA-SSD-stability for the biological structure B-5. We performed one run of RNA-SSD-stability, where p(S*) is optimized, with a cutoff time of 205135 C P U seconds and different values of pa and nr. The default values of pa and nr are 0.2 and 5 respectively. The success rate is one if the sequence fold correctly. The metrics n(S*) and p(S*) are used to evaluate the stability of the sequences. 6.4 S u m m a r y We extended RNA-SSD to design stable structures since to the best of our knowledge there are no algorithms reported in the literature that design sta-ble structures explicitly. Our extension of RNA-SSD, RNA-SSD-stability, designs for stability inside RNA-SSD at the lowest level of the algorithm where the smallest substructures of the decomposition tree are designed. A n SLS algorithm is used to find a stable substructure with respect to some stability criteron like n(S*) or p{S*). Stable structures are obtained by merg-ing these substructures. By comparing RNA-SSD-stability with I N F O - R N A and the adaptive walk of Dirks et al. we found that our algorithm has a clear advantage over I N F O - R N A since it always designs more stable structures using the stability measures n(S*), p(S*) and the shape probability. This advantage is because I N F O - R N A does not design stable structures explic-itly. In contrast, the adaptive walk sometimes designs more stable structures than RNA-SSD-stability with the criteria n(S*), p(S*) and always designs more stable structures with respect to the probability shape measure. How-ever, difficult structures are better designed by RNA-SSD-stability with the criteria n(S*) and p(S*) when the cutoff time of the algorithms is small. For longer run-times, the adaptive walk designs more stable structures than RNA-SSD-stability. This suggests that some parameters of our algorithm can be optimized to improve its performance. We also believe that the per-120 Chapter 6. Stability formance of RNA-SSD-stability can be improved by optimizing the stability of the sequences designed at every level of the decomposition tree, possibly by keeping several candidate subsequences to concatenate, and selecting the most stable. 121 Chapter 7 C o n c l u s i o n s a n d F u t u r e W o r k In the past R N A was generally thought to be a passive genetic blueprint. However, the past few decades of intensive research have revealed that there is a vast number of R N A s that do not code for proteins. R N A differs little from D N A in chemical terms, but by contrast exhibits a remarkable confor-mation flexibility and functional versatility. For these reasons R N A s repre-sent a fruitful medium for the design of new medical products and industrial devices at the nano-scale. In this thesis we improved the performance and scope of an existing algorithm called the R N A Secondary Structure Designer (RNA-SSD) [4], one of the state of the art algorithmic methods for R N A secondary structure design. Our work makes several contributions to the rational design of R N A secondary structures. First, we performed a scaling analysis, using an im-proved version of RNA-SSD, and the RNAinverse algorithm from the Vienna package [29], to investigate the empirical complexity of the R N A secondary structure design problem, that is, the scaling of the typical difficulty of the design task for various classes of R N A structures as the size of the target structure is increased. Our empirical analysis helps us to better understand the strengths and limitations of the RNA-SSD and the RNAinverse algo-rithms. For this study we use a large set of structures (5000 in total) of different lengths. One part of this set was generated randomly, and the other part was generated with structural and statistical properties (such as loop size, number of multiloops, etc.) based on different classes of biological RNAs . We investigated the hardness of the design of these structures and found that the algorithm scales polynomially with the size of the structure. When R N A - S S D is compared with RNAinverse, the former algorithm per-forms substantially better, both in terms of speed as well as with respect to the range of structures that can be designed. We compare both algo-rithms on random structures and find that the median running time scales proportional to a polynomial of degree three for R N A - S S D and of degree five for RNAinverse approximately, where n is the size of the structure. The 122 Chapter 7. Conclusions and Future Work structures not design by RNA-SSD are also not design by RNAinverse; fur-thermore, we believe that most of these structures are impossible to design using the current thermodynamic model for R N A secondary structure. It was also possible to identify some structural motifs that make the R N A design task harder. In particular, short stems separated by loops are difficult to design. Short stems are not stable enough to compensate for the penalties associated with adjacent loops, and therefore, energetically more favourable motifs are preferred. Some of these motifs are not allowed by the thermodynamic model [41], yet they are found in biological structures. For other motifs, which have short stems separated by loops that are allowed by the thermodynamic model, it is possible to improve the performance of RNA-SSD by modifying the structural decomposition approach in such a way that at the split points, the boundary conditions from the original struc-ture are replicated. Intuitively, this leads to an increased probability that when merging the respective subsequences, the correct secondary structure is obtained. We also added primary structure constraints to the R N A - S S D algorithm since they comprise an important aspect in the design of R N A s because they permit interaction with other molecules. Experiments on biologically motivated structures show that the problem scales polynomially with the size of the structure. In general there is an advantage in the design if we impose primary base constraints in stems. When we try to determine if the structure design is easier as we increase the number of fixed positions, we find that this is not always the case. The design of some structures gets harder when approximately 50% of the bases are constrained. This suggests a reduction in the search space size that depends on the properties of the structure. Very recently, Busch and Backofen [16] have introduced a new SLS algo-rithm for the R N A secondary structure design problem, dubbed I N F O - R N A (INnverse FOlding of R N A ) . Different from RNA-SSD, I N F O - R N A uses dy-namic programming to determine an initial sequence that adopts the given target structure T with the lowest possible energy. Then, it uses an improved SLS procedure that performs search steps based on a look-ahead mechanism for determining energetically favorable sequences in combination with the structural decomposition approach of RNAinverse in order to find a sequence with M F E structure T. In most cases, I N F O - R N A performs better than the improved version of RNA-SSD described in this work, and we therefore ex-pect that its empirical median expected run time also shows polynomial scaling with input size (possibly with better constants than RNA-SSD) . However, compared with RNAinverse and RNA-SSD, R N A - I N F O is more 123 Chapter 7. Conclusions and Future Work biased towards sequences that form low-energy structures and can hence be expected to find more restricted ensembles of solutions to any given R N A secondary structure design problem. We conjecture that by combining fea-tures of R N A - S S D and R N A - I N F O , in particular RNA-SSD's less biased initialisation and balanced hierarchical decomposition approach with R N A -INFO's more efficient SLS procedure, further performance improvements could be achieved. Furthermore, R N A - I N F O currently does not support pri-mary structure constraints, and it would be interesting (and not too hard) to incorporate these into a future version. Another contribution of this work is the extension of R N A - S S D to design duplexes that form intermolecular helices. To the best of our knowledge, our extended versions of RNA-SSD are the first algorithms for design of complexes of R N A molecules. This problem has some challenges. First, the number of candidate solutions grows exponentially with the length of the du-plex. Another problem is that components of the designed duplex can form undesirable interactions. Furthermore, the evaluation of the designed du-plexes can be expensive and most algorithms, like PairFold from Andronescu et al. [5] and the extended partition function from Dirks et al. [26], that predict the duplex of two R N A sequences do not consider pseudoknots. We describe three different algorithms: Linker, Interface and Internal. In each case, the candidate solutions are evaluated with the PairFold algorithm that predicts the secondary structure of a pair of sequences. A scaling analysis on these algorithms show that there is no evidence that the running time of Linker and Internal scale exponentially with the length of the duplex. B y comparing the performance of each algorithm we find that Interface has the lowest running time and designs less duplexes in our data set than the other two algorithms. One explanation for this is that Internal often has the problem that the duplex R* is not designed even though the sequences X\ and X2 fold correctly into the individual structures D\ and D2 of each molecule. That is, the designed sequences form undesirable interactions. We also found that Linker has the best running time but Internal has the highest success rate. A n explanation for this is that the Internal algorithm designs duplexes inside R N A - S S D and evaluates the substructures with the correct function: RNAfold for subsequences that belong to a single molecule and PairFold for subsequences that belong to both molecules. However, Internal has a higher running time for designing duplexes than Linker since it performs more evaluations of the candidate solutions with PairFold. The PairFold algorithm has a theoretical running time of 0 (n 3 ) where n is the sum of the two sequence lengths. This is the same complexity of the RNAfold function from the Vienna Package 124 Chapter 7. Conclusions and Future Work that predicts the structure of a single sequence. However, Andronescu [3] argues that, in practice, the time of PairFold is longer than RNAfold due to additional checking for the location of the intermolecular linkage in the calculation of the free energy of the duplex. A n insight gained from our comparison is that, overall, better designs are obtained when PairFold is used to evaluate designs of substructures internally in the algorithm, despite the higher overhead of using PairFold. To the best of our knowledge there are no algorithms reported in the lit-erature that design stable structures explicitly. Therefore we also extend the R N A - S S D to design stable structures by minimizing the average number of incorrect nucleotides, n(S*), or by maximizing the probability of the target structure in the ensemble, p{S*). By using either of these stability met-rics, we can design sequences that fold with high specificity and affinity into the target structure. The stability of the sequences designed by R N A - S S D are compared with those designed by I N F O - R N A of Busch and Backofen [16] and the adaptive walk of Dirks et al. [25] using three stability criteria: n(S*), p(S*) and the shape probability where the shape contains the target structure. We find that RNA-SSD performs better than I N F O - R N A , which does not use any stability measure to design structures. When R N A - S S D is compared with the adaptive walk, which explicitly designs stable structures using different stability measures, we find that our algorithm designs more stable sequences than the adaptive walk in less time, especially for structures that are difficult to design. We also find that the variability of the stability in the sequences designed by RNA-SSD is less than the ones obtained with the adaptive walk. However, for long run-times, the adaptive walk performs better than R N A - S S D . We think that this is because R N A - S S D has several parameters that have to be optimized in order to get a good performance, in particular the noise introduced in the stochastic local search, SLS-stable, that searches for stable sequences. Our work can be continued in several directions. First, the structural decomposition approach of the RNA-SSD algorithm could be improved by modifying the algorithm in such a way that the fraction of constrained bases in each substructure is balanced. Another improvement that has been al-ready proposed by Andronescu et al. [4] is to consider split points at motifs other than multiloops; it may be noted that such a modification could easily be extended with primary base constraints. Some parameters of RNA-SSD can be optimized to improve its perfor-mance. One of them is the cutoff time that affects the design of structures in the following way. The hierarchical decomposition procedure of R N A - S S D divides the given target structure S into small structure components which 125 Chapter 7. Conclusions and Future Work are designed independently by an SLS algorithm. For long structures, the SLS algorithm may spend a lot of time designing a given substructure of a leaf. If the cutoff time is not long enough then the SLS algorithm will not be able to spend the same amount of time designing the substructures from the rest of the leaves of the decomposition tree. This suggests that it is con-venient to increase the cutoff time for long structures and balance the time spend by SLS searching for substructures at each leaf of the decomposition tree. The procedures described in this work for designing R N A duplexes can be easily generalized to design for the interaction of more than two molecules. They also can be easily adapted to design complexes with pseudoknot inter-acting structures. The PairFold function will be replaced by another func-tion that allows the prediction of secondary structures with the smallest M F E in which two or more given sequences can fold and where pseudoknot interactions are allowed. The methods of Dirks et al. [26] or Alkan et al. [1] can be used in this case to evaluate the designed sequences. The performance of the Interface algorithm for designing duplexes can be improved based on the observation that sometimes it does not design sequences X\ and X^ that fold into S\ and S2 respectively but they form the desire duplex D. To avoid this situation and decrease the run-time of Interface we can avoid undesired interactions with the bases of the interface and other bases when designing the individual sequences X\ and X2 with RNA-SSD. This can be achieved by adding a tabu mechanism to the R N A -SSD algorithm that forbids the assignation of bases if more than a certain number of consecutive bases pair with the bases of the interface. There are some directions in which we can improve the stability of the sequences designed by RNA-SSD. For instance, we have some evidence that the parameters of the SLS algorithm that designs stable structures (SLS-stable) can be optimized. In particular, it is convenient to have some noise in SLS-stable, although it should be low. Currently, the SLS-stable procedure terminates the search after a fixed number of steps. This parameter can also be optimized to balance the time spent by SLS-stable to search for stable substructures at each leaf of the decomposition tree. Another direction in which RNA-SSD can be improved is the following. Currently, our algorithm designs stable subsequences only at the leaves of the decomposition tree. We expect to get stable sequences when merging the subsequences but this may not be always the case. Therefore, we can optimize the stability of the sequences designed at every level of the decomposition tree, possibly by keeping several candidate subsequences to concatenate, and selecting the most stable. 126 Chapter 7. Conclusions and Future Work This work can also be extended to design R N A secondary structures based on folding pathways. Some RNAs , specially long molecules, do not adopt the structure with the lowest free energy. The differences between real structures and the minimum free energy states are believed to be determined mainly by defects in the energy rules used or by the existence of specific folding pathways capturing molecules in local minima, R N A may enter intermediate conformational states that are key to its functionality and have a significant impact on gene expression. Another motivation is the design of a sequence that has certain intermediate folds on its pathway to its final and most stable structure for applications in nanotechnology [12, 55]. A n algorithm that designs a given structure based on its folding pathways may have Kinefold [68] as the evaluation function. Kinefold uses a stochastic simulation to predict co-transcriptional folding paths of natural R N A s by helix formation and dissociation. Further extensions of this work include the design of secondary struc-tures with pseudoknots, the design of tertiary structures and the design for interaction with other molecules like proteins and metabolites. Finally, it will be interesting to test our designed structures in the wet lab to verify that the sequences designed by RNA-SSD fold correctly into the desire secondary structures. In this research we discussed the importance of nucleic acid design. Novel ribozymes are used in the design of drugs and self-assembling R N A struc-tures are designed to form stable devices for applications in nanotechnology. Also, the design of R N A molecules with specific structural features is useful to investigate the function of natural RNAs . This thesis expands the capa-bility of tools for R N A design in several ways. The R N A - S S D was modified to support primary structure constraints, to design duplexes and to design stable structures. 127 B i b l i o g r a p h y [1] C Alkan, E Karakoc, J Nadeau, C Sahinalp, and K Zhang. R N A - R N A interaction prediction and antisense R N A target search. Journal of Computational Biology, 13(2):267-282, 2006, 2006. [2] P Anderson, J Monforte, R Tritz, S Nesbitt, J Hearst, and A Hampel. Mutagenesis of the hairpin ribozyme. Nucleic Acid Research, 22:1096-1100, 1994. [3] M Andronescu. Algorithms for predicting the secondary structure of pairs and combined sets of nucleic acid strands. M.Sc. thesis, University of British Columbia, Computer Science Department, 2003. [4] M Andronescu, A P Fejes, F Hutter, A . Condon, and H H Hoos. A new algorithm for R N A secondary structure design. Journal of Molecular Biology, 336(3):607-624, 2004. [5] M Andronescu, ZC Zhang, and A . Condon. Secondary structure pre-diction of interacting R N A molecules. Journal of Molecular Biology, 345(5):987-1001, 2005. [6] R D Barish, P W K Rothemund, and E Winfree. Two computational primitives for algorithmic self-assembly: Copying and counting. Nano Letters, 5(12):2586-2592, 2005. [7] A Barroso-delJesus and A Berzal-Herranz. Selection of targets and the most efficient hairpin ribozymes for inactivation of mRNAs using a self-cleaving R N A library. EMBO Reports, 2(12):1112-1118, 2001. [8] Y Benenson, B Gi l , U Ben-Dor, R Adar, and E Shapiro. A n autonomous molecular computer for logical control of gene expression. Letters to Nature, Nature 429:423-429, 2004: [9] H M Berman, J Westbrook, Z Feng, G Gilliland, T N Bhat, H Weissig, I N Shindyalov, and P E Bourne. The protein data bank. Nucleic Acids Research, 28:235-242, 2000. 128 Bibliography [10] H M Berman, J Westbrook, Z Feng, G Gilliland, T N Bhat, H Weissig, I N Shindyalov, and P E Bourne. Protein Data Bank http://www.rcsb.org/pdb/home/home.do, last visited: February 2007. [11] K R Birikh, P A Heaton, and F Eckstein. The structure, function and application of the hammerhead ribozyme. European Journal of Bio-chemistry, 245:1-16, 1997. [12] J S Bois, S Venkataraman, H M T Choi, A J Spakowitz, Z-G W Wang, and N Pierce. Topological constraints in nucleic acid hybridization kinetics. Nucleic Acid Research, 33:4090-4095, 2005. [13] D N Bolon and S L Mayo. Enzyme-like proteins by computational design. PNAS, 98:1427414279, 2001. [14] R R Breaker. Are engineered proteins getting competition from R N A ? Current Opinion in Biotechnology, 7:442-448, 1996. [15] R R Breaker. Engineered allosteric ribozymes as biosensor components. Current Opinion in Biotechnology, 13:31-39, 2002. [16] A Busch and R Backofen. I N F O - R N A - a fast approach to inverse R N A Folding. Bioinformatics Advance Access, 2006. [17] J J Cannone, S Subramanian, M N Schnare, J R Collett, L M D'Souza, Y Du, B Feng, N Lin , L V Madabusi, K M Muller, N Pande, Z Shang, N Yu , and R R Gutell. The comparative R N A Web (CRW) site: A n online database of comparative sequence and structure information for ribosomal, intron, and other RNAs . BMC Bioinformatics, 3:15, 2002. [18] J J Cannone, S Subramanian, M N Schnare, J R Collett, L M D'Souza, Y Du, B Feng, N Lin , L V Madabusi, K M Muller, N Pande, Z Shang, N Y u , and R R Gutell. The comparative R N A web (CRW) site. http://www.rna.icmb.utexas.edu/, last visited: February 2007. [19] Y Chen and W Stephen. Compensatory evolution of a precursor mes-senger R N A secondary structure in the Drosophila melanogaster A d h gene. PNAS, 100:20:11499-11504, 2003. [20] A Chworos, I Severcan, A I Koyfman, P Weinkam, E Oroudjev, H G Hansma, and L Jaeger. Building programmable jigsaw puzzles with R N A . Science, 306:2068-2072, 2004. 129 Bibliography [21] A Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Research, 13(9):30213030, 1985. [22] M H de Smit and J van Duin. Secondary structure of the ribosome binding site determines translational efficiency: a quantitative analysis. In Proceedings of the National Academy of Sciences, pages 7668-7672, 1990. [23] M H de Smit and J van Duin. Control of translation by m R N A sec-ondary structure in escherichia coli.a quantitative analysis of literature data. Journal of Molecular Biology, 244(2): 144-50, 1994. [24] J R Desjarlais and N D Clarke. Computer search algorithms in protein modification and design. Current Opinion in Structural Biology, 8:471-475, 1998. [25] R M Dirks, M Lin , E Winfree, and N A Pierce. Paradigms for com-putational nucleic acid design. Nucleic Acid Research, 32:1392-1403, 2004. [26] R M Dirks, J S Bois, J M Schaeffer, E Winfree, and N A Pierce. Ther-modynamic analysis of interacting nucleic acid strands. SIAM Review, in press. [27] G Ferbeyre, V Bourdeau, M Pageau, P Miramontes, and R Cedergren. Distribution of hammerhead-like R N A motifs through the GeneBank. Genome Research, 10:1011-1019, 2000. [28] R R Gutell, N Larsen, and C R Woese. Lessons from an evolving r R N A : 16s and 23s r R N A structures from a comparative perspective. Micro-biology Review, 58:10-26, 1994. [29] I L Hofacker. Vienna R N A secondary structure server. Nucleic Acids Research, 31(13):3429-31, 2003. [30] I L Hofacker. The Vienna R N A Secondary Structure Package http://www.tbi.univie.ac.at/ ivo/rna/, last visited: February 2007. [31] I L Hofacker, W Fontana, P F Stadler, L S Bonhoeffer, M Tacker, and P Schuster. Fast folding and comparison of R N A secondary structures. Chemical Monthly, 125:167-188, 1994. 130 Bibliography [32] H H Hoos and T Stiitzle. Stochastic Local: Search Foundations and Applications. Morgan Kaufmann/Elsevier, 2004. [33] F Hutter, Y Hamadi, H Hoos, and K Leyton-Brown. Performance prediction and automated tuning of randomized and parametric algo-rithms. In Proceedings of the 12th International Conference on Princi-ples and Practice of Constraint Programming, pages 213-228, 2006. [34] L Jaeger, E Westhof, and N B Leontis. TectoRNA: modular assembly units for the construction of R N A nano-objects. Nucleic Acid Research, 29(2):455-463, 2001. [35] G Koev, S Liu , R Beckett, and W A Miller. The 3'-terminal structure required for replication of barley yellow dwarf virus R N A contains an embedded 3' end. Virology, 292:114-126, 2002. [36] F Kramer and D Mills. Secondary structure formation during R N A -synthesis. Nucleic Acids Research, 9(19):5109-5124, 1981. [37] B Kuhlman, G Dantas, G C Ireton, G Varani, B L Stoddard, and D Baker. Design of a novel globular protein fold with atomic-level accuracy.. Science, 302(5649):1364 - 1368, 2003. [38] K A LeCuyer and D M Crothers. Kinetics of an R N A conformational switch. PNAS, 91:3373-3377, 1994. [39] R B Lyngs0 and M Zuker. Fast evaluation of internal loops in R N A secondary structure prediction. Bioinformatics, 15(6):440-5, 1999. [40] C Mao, T H LaBean, J H Reif, and N C Seeman. Logical computa-tion using algorithmic self-assembly of D N A triple-crossover molecules. Nature, 407:493-496, 2000. [41] D H Mathews, J Sabina, M Zuker, and D H Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of R N A secondary structure. Journal of Molecular Biology, 288:911-940, 1999. [42] J S McCaskill . The equilibrium partition function and base pair binding probabilities for R N A secondary structure. Biopolymers, 29:1105-1119, 1990. [43] I Meyer and I Miklos. Co-transcriptional folding is encoded within R N A genes. BMC Molecular Biology, 5:10, 2004. 131 Bibliography [44] A P Niles and E Winfree. Protein design is NP-hard. Protein Engi-neering, 15:779-782, 2002. [45] H W Pley, K M Flaherty, and D B McKay. Three-dimensional structure of a hammerhead ribozyme. Nature, 372(3):68-74, 1994. [46] E Puerta-Fernandez, C Romero-Lopez, A Barroso-delJesus, and A Berzal-Herranz. Ribozymes: recent advances in the development of R N A tools. FEMS Microbiology Reviews, 27:75-97, 2003. [47] B Rastegari. Linear time algorithm for parsing R N A secondary struc-ture. M.Sc. thesis, University of British Columbia, Computer Science Department, 2004. [48] D Repsilber, S Wiese, M Rachen, A Schroder, D Riesner, and G Steger. Formation of metastable R N A structures by sequential folding during transcription: Time-resolved structural analysis of potato spindle tuber viroid (-)-stranded R N A by temperature-gradient gel electrophoresis. RNA, 5:574-584, 1999. [49] E Rivas and SR Eddy. A dynamic programming algorithm for R N A structure prediction including pseudoknots. Journal of Molecular Biol-ogy, 285:2053-2068, 1999. [50] T Ro-Choi and Y Choi. Structural elements of dynamic R N A strings. Molecules and Cells, 16(2):201-210, 2003. [51] P W K Rothemund. Folding D N A to create nanoscale shapes and patters. Nature, 440:297-302, 2006. [52] J SantaLucia Jr and D H Turner. Measuring the thermodynamics of R N A secondary structure formation. Biopolymers, 44:309-319, 1997. [53] P Schuster, W Fontana, P Stadler, and I L Hofacker. From sequences to shapes and back: a case study in R N A secondary structures. In Proceedings-Royal Society of London. Biological sciences, pages 279-284, 1994. [54] G Seelig, D Soloveichik, D Y Zhang, and E Winfree. Enzyme-free nucleic acid logic circuits. Science, 314(5805):1585-1588, 2006. [55] N Seeman. Nucleic acid nanostructures and topology. Angewandte Chemie International Edition, 37:3220-3238, 1998. 132 Bibliography [56] N C Seeman. De novo design of sequences for nucleic acid structural engineering. Journal of Biomolecular Structure & Dynamics, 3:573-81, 1990. [57] N C Seeman. Structural D N A nanotechnology: A n overview. Methods in Molecular Biology, 303:143-66, 2005. [58] D Shu, W-D Mol l , Z Deng, C Mao, and P Guo. Bottom-up assembly of R N A arrays and superstructures as potential parts in nanotechnology. Nano Letters, 4:1717-1723, 2004. [59] F C Simmel, W U Dittmer, and A Reuter. Control and function for D N A nanodevices. In 4th IEEE Conference on Nanotechnology, pages 305-307, 2004. [60] R W Simons and M Grunberg-Manago. RNA structure and function. Cold Spring Harbor Laboratory Press, 1998. [61] P Steffen, B Vofi, M Rehmsmeier, J Reeder, and R Giegerich. RNAshapes: an integrated R N A analysis package based on abstract shapes. Bioinformatics, 22:500-503, 2006. [62] P Steffen, B Vofi, M Rehmsmeier, J Reeder, and R Giegerich. RNAshapes 2.1.1 http://bibiserv. techfak. uni- bielefeld. de/rnashapes/, last visited: February 2007. [63] N A Vaish, A R Kore, and F Eckstein. Survey and summary, recent de-velopments in the hammerhead ribozyme field. Nucleic Acids Research, 26(23) :5237-5242, 1998. [64] B Vofi, R Giegerich, and M Rehmsmeier. Complete probabilistic anal-ysis of R N A shapes. BMC Biology, 4:5, 2006. [65] Q S Wang and P J U N R A U . Ribozyme motif structure mapped using random recombination and selection. RNA, 11:404-411, 2005. [66] E Winfree, F Liu , L Wenzler, and N Seeman. Design and self-assembly of 2D D N A crystals. Nature, 394:539-544, 1998, [67] S Wuchty, W Fontana, I L Hofacker, and P Schuster. Complete subop-timal folding of R N A and the stability of secondary structure. Biopoly-mers, 49:145-165, 1999. 133 Bibliography [68] A Xayaphoummine, T Bucher, and H Isambert. Kinefold web server for R N A / D N A folding path and structure prediction including pseudo-knots and knots. Nucleic Acid Research, 33:W605-W610, 2005. [69] Q Yu , D B Pecchia, S L Kingsley, J E Heckman, and J M Burke. Cleav-age of highly structured viral R N A molecules by combinatorial libraries of hairpin ribozymes. Jornal of Biological Chemistry, 273 (36):23524-23533, 1998. [70] K Yue and K A Di l l . Inverse protein folding problem: designing polymer sequences. PNAS, 89:4163-4167, 1992. [71] M Zuker. Prediction of R N A secondary structure by energy minimiza-tion. Methods in Molecular Biology, 25:267-94, 1994. [72] M Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31:4306-3415, 2003. [73] M Zuker. Mfold web server http://www. bioinfo. rpi. edu/applications/mfold, last visited: February 2007. [74] M Zuker, D H Mathews, and D H Turner. Algorithms and thermody-namics for RNA secondary structure prediction: a practical guide. In RNA Biochemistry and Biotechnology. (Barciszewski, J &. Clark, B . F. C , eds), N A T O ASI Series, Kluwer Academic Publishers, Dodretch, 1999. [75] M Zuker and P Stiegler. Optimal computer folding of large R N A sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9:133-148, 1981. 134 "1 A p p e n d i x A : U n d e s i g n a b l e s t r u c t u r e s We will prove that the structural motifs from Figure 4.9 (a) and (b) are impossible to design with the current thermodynamic parameters. Consider the structure motif B from Figure 4.9 (a), which is formed by two bulges of size two and three, respectively, where Xj G {A G,C,U} for all i. We will show that this structure is impossible to design with the standard thermodynamic model, because the internal loop I of size seven, formed by breaking the base pair Zj+i • Xj-$ (see Figure 1 (a)), is energetically more favourable. Figure 1: (a) Motif I: internal loop formed by breaking the base pair Xi+i • Xj-3 from motif B . (b) Motif II: internal loop formed by breaking the base pair Xi_|_6 • xj-4 from motif 21. Let X be the set of assignments of bases in which the base pairs are complementary. Let AG(B,X) and AG(I,X) be the energies of motif B and / , respectively, for X G X. We will show that (a) AG(I, X) < AG{B, X) VX eX 135 Appendix A: Undesignable structures Parameter Explanation Values AG-length-I(n) Destabilizing energy of internal loop of size n (n = 4, 5,6,...) 1.7, 1.8, 2.0, ... AG-length-B(n) Destabilizing energy of bulge of size n (n — 1, 2,3,...) 3.8, 2.8, 3.2, ... AG-Internal-n (Si, Sj, JSJ-L-I, Sj—i) Terminal mismatch free energy of closing base pair (Si • Sj) and neighbouring free bases S^+i and Sj-i -1.1, -0.7, -0.4, 0.0 and 0.7 AG-Asymmetry (li, I2) Penalty for asymmetric internal loops M I N = { 0.5- \h-h\ Table 1: Free energy parameters for internal loops. following the notation of Andronescu [3] and using the thermodynamic pa-rameters of Mathews et al. [41] for computing AG(B,X) and AG(I,X). Table 1 shows the definition of parameters and their possible values involved in the calculation of the free energy of an internal loop. AG(I,X) = AG -length-1(7) + A G — Internal — n(xi,Xj,Xi+\,Xj-\) + AG — Internal — n(xj+5, Xj-4, Xi+4, Xj-%) + AG- Asymmetry(4,3) V X <E X =>- max{AG (7 , X)} = 2.7 + max{AGInternal - n(xi, Xj, Xj+i , Xj-i) + AG — Internal — n(xi+5, Xj_4, Xi+4, Xj-3)} = 2.7 + 2-0.7 = 4.1 On the other hand, AG(B,X) = AG - length - B(2) + A G - length — 5 (3 ) = 2.8 + 3.2 = 6 y X €X Then m a x { A G ( / , X ) } <.min{AG(B,X)} 136 Appendix A: Undesignable structures Therefore A G ( J , X) < AG(B, X) V l £ r V Consider the motif 21 in Figure 4.9 (b), which is formed by two internal loops both of size eight. We will show that this structure motif is impossible to design, because the internal loop II of size eighteen, formed by breaking the base pair Xj+6 • Xj-A (Figure 1 (b)), is more favourable. Let A G — Internal — n(xj, Xj, Xi+i , Xj-i) = x\ AG — Internal — n(xj_io , Xj+io, Xj-9, xi+§) = x2 AG - Internal - n(xj_4, X i + 6 , xj-3i xi+s) — 2/1 AG — Internal — n(xi+&, Xj_4, Xj+7, X J S ) = 2/2 where 2:1,2:2,2/1,2/2 € { - 1 . 1 , - 0 . 7 , - 0 . 4 , 0 , 0 . 7 } . Then AG(II,X) = AG - length - 7(18) + x x + x2 V X e X =>AG(1I,X) = 3.I + X1 + X2 VXeX On the other hand, A G ( 2 J , X) = [ A G - length - 1(8) + 2 1+2 / 1 + A G - Asymmetry(3,5)] + [ A G - length - 1(8) + x2 + 2/2 + A G - Asymmetry(3,5)] VX ex = » A G ( 2 / , X ) = (2.3 + x i + y i + 1) + (2.3 + y2 + x 2 + 1) V X e # = ^ A G ( 2 / , X ) = 6.6 + x i + 2/1 + 2/2 + 22 V X e * Therefore A G ( 1 / , X) < A G ( 2 J , X ) V X e X 137
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Computational RNA secondary structure design : empirical...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Computational RNA secondary structure design : empirical complexity and improved methods Aguirre-Hernández, Rosalía 2007
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Computational RNA secondary structure design : empirical complexity and improved methods |
Creator |
Aguirre-Hernández, Rosalía |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Ribonucleic acids play fundamental roles in cellular processes and their function is directly related to their structure. The research reported in this thesis is focused on the design of RNA strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model. The design of RNA structures is important for applications in therapeutics and nanotechnology. This work also applies to DNA with the appropriate thermodynamic model for DNA molecules. The overall goal of this research is to improve the performance and scope of algorithmic methods for RNA secondary structure design. First, we investigate the hardness of this problem, since its theoretical complexity is unknown. A scaling analysis on random and biologically generated structures supports the hypothesis that the running time of the RNA Secondary Structure Designer (RNA-SSD) algorithm, one of the state of the art algorithms for designing secondary structures, scales polynomially with the size of the structure. We found that structures with small stems separated by loops are difficult to design. Our improvements to the RNA-SSD algorithm include the support for primary structure constraints, where bases or base types are fixed in certain positions of the sequence. Such constraints are important, for example, when designing RNAs such as ribozymes or tRNAs, where certain base positions must be fixed in order to permit interaction with other molecules. We investigate the correlation between the number and the location of the primary structure constraints and the performance of RNA-SSD. In the second part of our research, we have extended the RNA-SSD algorithm to design for stability, rather than minimum free energy folding. We measure stability according to several criteria such as high probability of observing the minimum free energy structure, and low average number of incorrectly paired nucleotides in the ensemble of structures for the designed sequence. The design of complexes of RNA molecules, that is RNA molecules that interact with each other, is relevant for many applications. We describe several ways to design stable structures and complexes, and we also discuss the advantages and limitations of each approach. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-02-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0080432 |
URI | http://hdl.handle.net/2429/31202 |
Degree |
Doctor of Philosophy - PhD |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2007-317259.pdf [ 6.64MB ]
- Metadata
- JSON: 831-1.0080432.json
- JSON-LD: 831-1.0080432-ld.json
- RDF/XML (Pretty): 831-1.0080432-rdf.xml
- RDF/JSON: 831-1.0080432-rdf.json
- Turtle: 831-1.0080432-turtle.txt
- N-Triples: 831-1.0080432-rdf-ntriples.txt
- Original Record: 831-1.0080432-source.json
- Full Text
- 831-1.0080432-fulltext.txt
- Citation
- 831-1.0080432.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0080432/manifest