inear time algorithm for parsing R N A secondary structure Baharak Rastegari B.Sc , Sharif University of Technology, Tehran, Iran, 2002 A THESIS S U B M I T T E D IN PARTIAL F U L F I L L M E N T OF T H E R E Q U I R E M E N T S FOR T H E D E G R E E OF Master of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia August 2004 © Baharak Rastegari, 2004 FACULTY OF GRADUATE STUDIES THE UNIVERSITY OF BRITISH COLUMBIA Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: Lit) & & f f~'V> f ri\go r> f A m f o y f>ar<ir\j r?v M A S e . C o n c\gfj—sf-rucrhu D E 9 R E E : hXfi^ir-^ r > £ <^€JP*CJQ. Y e a r : ^ o o ^ Department of Corr\ OQ^eY 5 C l g n ^ g ' The University of British Columbia Vancouver, BC Canada • grad.ubc.ca/forms/?formlD=THS page 1 of 1 last updated: 20-Jul-04 Abstract RNA secondary structure prediction is an important problem in computational molecular biology. Experiments show that existing polynomial time prediction algorithms have limited success in predicting correctly the base pairs, i.e. secondary structure, in known biological RNA structures. One limitation of many current algorithms is that they can pre-dict only restricted classes of structures, excluding many so-called pseudoknotted secondary structures. The type of the pseudoknotted structures that occur in biological structures, as well as the type of structures handled by current algorithms, have been poorly understood, making it difficult to assess the generality of current algorithms. In this thesis we present a comprehensive and precise classification of structural elements and loops in a secondary structure, along with a linear time algorithm for parsing secondary structures into their structural elements. The parsing algorithm, along with the classification scheme for the loops in a pseu-doknotted secondary structure, can be used in analysing existing prediction algorithms to determine which known biological RNA structures can not be predicted by the algorithms. This analysis can help us to design new and more powerful prediction algorithms. Furthermore, we present two applications of our work: (i) linear time free energy calculation algorithm, and (ii) linear time test for Akutsu's[2] algorithm class. We present a linear time algorithm for calculating the free energy of a given secondary structure. This algorithm can be useful especially in heuristic prediction algorithms, as they commonly use a procedure to calculate the free energy for a given sequence and structures. We also present a linear time algorithm to test whether the prediction algorithm introduced by Akutsu[2] can handle a given structure. The result of our analysis on al-gorithm of Akutsu on some sets of biological structures shows that although it is proved theoretically that the class of structures handled by Akutsu's algorithm is more general than that handled by the algorithm of Dirks and Pierce [7], they can both handle the same class of given biological structures. ii Contents Abstract ii Contents iii List of Tables v List of Figures vi Acknowledgements vii Dedication viii 1 Introduction 1 1.1 Parsing Secondary Structures: Problem and Motivation 1 1.2 Contributions 3 1.3 Overview 5 2 Components of a Pseudoknotted Secondary Structure 6 2.1 Components of a Pseudoknotted Secondary Structure 6 2.2 Pseudoknotted Structural Elements 8 2.3 Loops in Pseudoknotted Structure 11 2.3.1 Location Attribute of Loops 13 2.4 Parse Tree . . 14 3 Related Work 16 3.1 Classification Scheme 16 3.2 Pseudoknotted Secondary Structure Prediction 16 3.3 Free Energy Calculation 18 4 Parsing Algorithm 19 4.1 Closed Region Finding Algorithm 19 4.1.1 Intuition . 19. 4.1.2 Invariants 22 iii 4.1.3 Example 22 4.2 Closed Region Finding Algorithm: Proof 24 4.2.1 Mark Invariant Proof 24 4.2.2 List Invariant Proof . . 27 4.2.3 Stack Invariant Proof 28 4.2.4 Closed Region Invariant Proof 28 4.2.5 Add-to-tree Calls Theorem 29 4.3 Constructing the Parse Tree 30 4.4 Parsing Algorithm Running Time 31 5 Energy Calculation 32 5.1 Loop Type Finding 33 5.2 Band Finding - 33 5.3 Loop Location Finding 34 5.4 Inner Loop Finding 35 6 Akutsu's Algorithm Structure Class 37 6:1 Simple Pseudoknot Structures 37 6.1.1 Equivalence of Definitions 39 6.1.2 Simple Pseudoknot Membership Test: Linear Time Algorithm . . . . 40 6.2 Secondary Structures with Simple Pseudoknots . 41 6.2.1 Secondary Structure with Simple Pseudoknots Membership Test . . 43 6.3 Secondary Structures with Recursive Pseudoknots . 45 6.3.1 Recursive Pseudoknot Membership Test 47 6.4 Classification of Biological Structures 47 7 Conclusion and Future Work 49 Bibliography 51 iv List of Tables 4.1 Execution of Algorithm 1 on secondary structure of Figure 4.1(a). There is one row for each base index, plus an initial row, 0. Column 1 shows which base index is scanned by the algorithm. The remaining columns describe what happens when index i is scanned (i.e when A = i). Column 2 gives the list L along with the marks on list elements and column 3 gives the stack, when i has just been scanned (when the algorithm reaches line 31). Column 4 lists which closed region has been identified (and added to the tree) by the algorithm, if any, when i is scanned 23 6.1 Structure classification. Columns 2-7 present data for each RNA data set. For each data set (column), the entry in first row lists the number of structures in the data set. The second row lists the average number of base pairs in the struc-tures. The remaining rows list the number of structures of the data set that are in the PKF, Akutsu(secondary structures with simple pseudoknots), L&P, D&P, Akutsu(secondary structures with recursive pseudoknots) and R&E classes, respec-tively 48 v List of Figures 1.1 R N A s e c o n d a r y s t r u c t u r a l c o m p o n e n t s ( l e f t - r igh t ) : s t e m , h a i r p i n l o o p , b u l g e l o o p , i n t e r n a l l o o p , m u l t i - b r a n c h l o o p , p s e u d o k n o t • 2 1.2 (a) A p s e u d o k n o t free s t r u c t u r e , (b) T h e t ree o f l o o p s for t he s e c o n d a r y s t r u c t u r e g i v e n i n p a r t (a) 2 1.3 (a) A p s e u d o k n o t t e d s t r u c t u r e , (b) T h e t ree o f l o o p s for t he s e c o n d a r y s t r u c t u r e g i v e n i n p a r t (a) 3 2 .1 (a) P s e u d o k n o t ' free s e c o n d a r y s t r u c t u r e o f l e n g t h 27. (b) A r c d i a g r a m r e p r e s e n t a t i o n o f t h e s t r u c t u r e i n p a r t (a) 7 2 .2 (a) P s e u d o k n o t t e d s e c o n d a r y s t r u c t u r e o f l e n g t h 13. (b) A r c d i a g r a m r e p r e s e n t a t i o n o f t h e s t r u c t u r e i n p a r t (a) 8 2 .3 (a) A r c d i a g r a m r e p r e s e n t a t i o n o f a n R N A s e c o n d a r y s t r u c t u r e R. R c o n s i s t s o f t w o c l o s e d r eg ions [1;40] a n d [41; 82]. ( b ) [ l ; 4 0 ] c l o s e d r e g i o n , (c) [41; 82] c l o s e d r e g i o n . 10 2 . 4 P a r s e t ree for t he s t r u c t u r e i n F i g u r e 2 .3 . C o n s i d e r n o d e V = ( 1 2 , 3 6 ) as a n e x a m p l e . (15, 24) a n d ( 2 6 , 3 4 ) a re V ' s c h i l d r e n . C = [12; 36] is V s c o r r e s p o n d i n g c l o s e d r e g i o n a n d i t s p a t t e r n is P = bcdeedcfggfb. C, the p r i v a t e c l o s e d r e g i o n o f V, c o n t a i n s 12.36 as base p a i r a n d i t s p r i v a t e c l o s e d p a t t e r n is P' — bb 15 4 .1 (a) A r c d i a g r a m r e p r e s e n t a t i o n o f a n R N A s e c o n d a r y s t r u c t u r e , (b) P a r s e t ree for the s t r u c t u r e i n p a r t (a) • 2 0 5.1 A r c d i a g r a m r e p r e s e n t a t i o n o f a n R N A s e c o n d a r y s t r u c t u r e ( c l o s e d r e g i o n [41; 82] i n F i g u r e 2 .3(c)) 3 5 6.1 (a) S e c o n d a r y s t r u c t u r e w i t h p a t t e r n P = abebcaddce: S i m p l e p s e u d o k n o t w i t h j o = Left(e) a n d j'0 = Left(d) as wi tnesses , (b) S e c o n d a r y s t r u c t u r e w i t h p a t t e r n . P = abbcaddeec: R e c u r s i v e p s e u d o k n o t as P = abbcaPieec w h e r e P i = dd a n d abbcaeec a re b o t h s i m p l e p s e u d o k n o t . (c) S e c o n d a r y s t r u c t u r e w i t h p a t t e r n P = abbcaddece: n e i t h e r s i m p l e o r r e c u r s i v e 3 8 6 .2 A r c d i a g r a m r e p r e s e n t a t i o n o f s t r u c t u r e s i n F i g u r e 6.1 3 9 6 .3 P a r s e t rees for t he s t r u c t u r e s i n F i g u r e 6.1 4 4 vi Acknowledgements First of all, I would like to thank my supervisor Dr. Anne Condon, who has guided me through research in bioinformatics since I did not even know what it meant. I am deeply grateful to her encouragements , her support, great research ideas she shared with me and her great guidance throughout the whole work. I would also like to thank Dr. Will Evans for being my second reader and for helpful comments. I want to thank the people in Beta Lab and other collaborators who helped me throughout, especially Mirela Andronescue who has always helped me with her helpful comments and suggestions. I also want to thank my dearest friends in Vancouver for understanding me during the last months of work and keeping me calm whenever I felt frustrated. Last but not the least, I want to thank my parents, my dear sister Tahmineh and my dear brother Siavash for their love, support and trust. B A H A R A K R A S T E G A R I The University of British Columbia August 2004 vii To my dad and ma. viii Chapter 1 Introduction In this chapter, we first describe the research problem that we address in this thesis, parsing a given RNA secondary structure to its structural components, and give some motivations to support the importance of the work. Then in Section 1.2, we describe the contributions of our work in other applications such as calculating the free energy of a given R N A secondary structure. The last section is a brief overview of this thesis. 1.1 Parsing Secondary Structures: Problem and Motivation An RNA molecule can be represented as a single stranded sequence consisting of the fol-lowing four bases or nucleotides: Adenine (A), Guanine (G), Cytosine (C), and Uracil (U). This single stranded sequence is called primary structure or sequence. RNA molecules play diverse roles in the cell: as carriers of information, catalysts in cellular processes, and mediators in determining the expression level of genes [6]. The molecule folds into a functional shape (structure) by forming intramolecular base pairs among some of its bases. These pairings arise from the Hydrogen-bonding forces between pairs of bases. The secondary structure describes which bases of the molecule bond with each other. Figure 1.2(a) and Figure 1.3(a) illustrate the base pairs in two simple secondary structures. The base pairs together with unpaired bases in these examples form the fol-lowing structural components: stacked pairs (stems), hairpin loops, bulge loops, internal loops, multiloops, and pseudoknots (Figure 1.1). Then the secondary structure can also be represented by a list of structural components. More detailed definition is given in Chapter 2. The structure of a functional RNA molecule is often the key to its function and interaction with other molecules, since it is conserved across many organisms though the sequence may vary. Therefore predicting the secondary structure of RNA molecules (or being more general, nucleic acids) is one of the most important problems in computational molecular biology. Comparative sequence analysis is the most reliable approach for secondary structure prediction but it needs several sequences available. When just a single molecule is available, 1 i| 9 $ § JL Figure 1.1: RNA secondary structural components (left-right): stem, hairpin loop, bulge loop, internal loop, multi-branch loop, pseudoknot. computational prediction of its secondary structure from its base sequence is based on the premise that out of exponentially many possibilities, an RNA molecule is most likely to fold into the minimum free energy (mfe) structure. The free energy of a given structure is estimated by summing thermodynamic and entropic free energy terms associated with the component loops of the secondary structure. Unfortunately, finding the mfe secondary structure for a given R N A sequence is NP-hard [11]. Several polynomial time algorithms have been proposed for predicting the mfe secondary structure from restricted classes of secondary structure. The most well known such class is that of pseudoknot free secondary structures and the work on prediction of sec-ondary structure has mostly focused on prediction of pseudoknot free secondary structure. In Figure 1.2(a) a pseudoknot free structure is shown. Many biological RNA structures are pseudoknot free but there are also many natural structures which are pseudoknotted. Experiments showed that existing polynomial time algorithms are capable of predicting at most 70% of base pairs correctly in known (experimentally determined) biological RNA structures. A C C U — A C C 17 C — G 27 u u - C U C U G A II I I I c - G A G G G A C A 1 A (a) (b) Figure 1.2: (a) A pseudoknot free structure, (b) The tree of loops for the secondary structure given in part (a). Heuristic algorithms provide one widely used approach for predicting secondary 2 structures. These algorithms commonly need a procedure to calculate the free energy for a given sequence and structure. So it is useful here to be able to compute what are the loops of the structure. If the structure is pseudoknot free, loops are either nested in each other, or disjoint, so it is natural to represent the loops and their relationships as a tree (Figure 1.2(b)). Although it was not reported before, it is straightforward to construct the tree from the description of the pseudoknot free secondary structure in linear time, if the structure is represented in standard format (see Chapter 2). Construction of the tree from the description of the secondary structure is what we mean by parsing the structure. No previous work has provided a comprehensive way to classify the types of loops that arise in pseudoknotted structures, although certain types of pseudoknotted structures have been named. Figure 1.3(a) shows a simple pseudoknotted structure. Figure 1.3: (a) A pseudoknotted structure, (b) The tree of loops for the secondary structure given in part (a). Although there are algorithms for finding the minimum free energy secondary struc-ture from limited classes of pseudoknotted structures, the papers describing these algorithms only implicitly (via description of their algorithm) explain what is their energy model, and what class of structures their algorithm can handle.. While our long-term goal is to develop better algorithms for predicting pseudoknotted secondary structures, we felt that the first step would be to elucidate what is the energy model used by current algorithms. In turn, doing this means that a categorization of the types of loops which arise in pseudoknotted secondary structures is needed, and an efficient algorithm for parsing secondary structures into their component loops (Figure 1.3(b)) is useful. It should be noted here that the simple example in Figure 1.3(a) does not include all the different types of pseudoknotted loops that can arise in a real secondary structure (see Chapter 2). 1.2 Contributions In this thesis we present a comprehensive way to classify the type of loops that arise in a pseudoknotted structure, and also an algorithm which parses a given secondary structure to 3 its structural components. Our parsing algorithm can be used in analysing and understand-ing known experimental structures (in order to design more efficient prediction algorithms), efficient free energy calculation of a secondary structure and other applications. The first two contributions of our work are: • A classification scheme for the loops in a pseudoknotted secondary structure (pre-sented in Chapter 2), and • A linear time parsing algorithm for constructing the parse tree of a given secondary structure (presented in Chapter 4). The linear time algorithm for parsing a secondary structure can be used in several applications. In this work we present two of them as the next two contributions of our work: • A linear time algorithm to calculating the free energy of a given secondary structure. Standard thermodynamic functions calculate the free energy of each loop using infor-mation about the loop such as: (i) its external base pair, (ii) an ordered list of its base pairs or tuples, (iii) number of unpaired bases in the loop, and (iv) type and the location of the loop. The free energy of a secondary structure is calculated by summing the free energy of its component loops. So it could be easily calculated in linear time if we have all loops and needed information. We use our algorithm for parsing the secondary structure to identify all loops in the structure, plus the needed information about them, in linear time. Therefore it is possible to calculate the free energy of a given secondary structure in linear time. • A linear time algorithm to test whether the prediction algorithm introduced by Akutsu [2] can handle a given structure. There are polynomial time algorithms which have been designed to predict the sec-ondary structure for R N A strands. As the problem is NP-hard , each of them can • handle a restricted class of structures. It is useful to have a characterization of the •classes of structures handled by each of these algorithms. Moreover, it is good to know whether the class of structures handled by an algorithm includes most known biological structures. Condon et al.[5] provide a simple characterization of the classes of structures handled by Rivas and Eddy ( R & E ) [13], Dirks and Pierce ( D & P ) [7] and Lyngso and Pederson ( L & P ) [11] algorithms. They also provide a linear time algorithm to test if an input structure is in the R & E , D & P , and L & P classes. But they didn't provide a test algorithm for Aku t su class. Using the parsing algorithm, we introduce a linear time algorithm to test whether an input structure is in the Akutsu class. Then we apply it on the same structures used by Condon et al.[5], to identify the number of structures that can be handled by the Aku t su algorithm. 4 1.3 Overview In Chapter 2, we provide definitions of structural elements and loops in RNA secondary structure, where some notions, such as closed regions, bands and etc, are introduced in this work for the first time. In Chapter 3, we describe the previous related works, mostly in predicting RNA secondary structure. In Chapter 4, we describe our linear time algorithm for parsing a given structure to its closed regions (as structural components). We also provide a precise proof as to why it works accurately and in linear time. We present an application of our parsing algorithm in calculating the free energy of a given secondary structure, in Chapter 5. In Chapter 6, we present another application, namely a linear time test algorithm for Akutsu class. We also apply Akutsu class test algorithm on the number of known biological structures and present the result in this chapter. In Chapter 7 we end with a conclusion of our work and possible future work. 5 Chapter 2 Components of a Pseudoknotted Secondary Structure In this chapter we present a formal definition of RNA secondary structure, along with precise definition and classification of structural elements and loops in a secondary structure. For the first time, in this work we introduce closed region and bands notions, in Section 2.2. In Section 2.3 we present a precise definition of different type of loops in a secondary structure, where types of loops in pseudoknotted structure such as interior-pseudoknotted loop, multi-pseudoknotted loop and pseudoknotted loop are introduced for the first time. In the last section, the notion of parse tree is introduced. 2.1 Components of a Pseudoknotted Secondary Structure An RNA molecule can be represented as a single stranded sequence consisting of the fol-lowing four bases or nucleotides: Adenine (A), Guanine (G), Cytosine (C), and Uracil (U). This single stranded sequence is called primary structure or sequence. The sequence has distinct 5' and 3' ends. The molecule folds into a functional shape by forming intramolecular base pairs among some of its bases. These pairings arise from the Hydrogen-bonding forces between pairs of bases. The set of these base pairings is known as the secondary structure of the RNA. The Watson-Crick base pairs: {AU,CG,GC,UA} and wobble pairs: {GU,UG} are the set of common base pairings. Formally, a set R of base pairs is called a secondary structure if each base is in at most one base pair. We index the bases consecutively from the 5' end toward the 3' end, starting at 1. Let i.j (i < j) denote that the base indexed i is paired with the base indexed j, in which case we write i.j £ R. For convenience, we write i.j to mean i.j £ R. We refer to the length of the sequence by n (Figure 2.1(a) and Figure 2.2(a)). There are different ways for representing RNA secondary structure. Figure 2.1(a) shows an example of usual representation of RNA secondary structure. But to easily explain 6 20 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 20 21 22 23 24 25 26 27 (b) Figure 2.1: (a) Pseudoknot free secondary structure of length 27. (b) Arc diagram representation of the structure in part (a) how algorithmst work, it is usually easier to use other representations. In this work, we use arc diagrams to represent secondary structures. In the arc diagram representation of structure R, base indices are shown as vertices on a straight line (backbone), ordered from the 5' end, and arcs (always above the straight line) indicate base pairs (Figure 2.1(b) and Figure 2.2(b)). We introduce a pattern representation of secondary structures as an alternative, which will be used in Chapter 6. In a pattern, information about unpaired bases and consequently the base indices is lost but the pattern of nesting or overlaps among base pairs is preserved. (We note that the definition of pattern could be extended so that unpaired bases are represented using a special symbol.) For example, the pattern representation of structure in Figure 2.1 is P = abcdeedcfggfba. To define patterns precisely, we introduce some notation. We use e to denote the empty string. Let Nn denote the natural numbers between 1 and n (inclusive). For any string s over alphabet E , s J. a denotes the string s with all occurrences of a removed, and 5 | E ' , E ' C E , denotes the string s with all occurrences of all a £ E ' removed. Also |s| denotes the size (length) of s. 7 Patterns: A string P (of even length) over some alphabet S is a secondary structure pattern, or simply a pattern, if every symbol of £ occurs either exactly twice, or not at all, in P. We say that secondary structure R for a strand of length n corresponds to pattern P if there exists a mapping m : Nn —> £ U {e} with the following properties: (i) if i.j 6 R then m(i) € £ and m(i) = m(j), (ii) if i.j and j i ^ R for all j G Nn, then m(z) = e, and (iii) P = m(l)m(2)...m(n). We refer to the index of the first and the second occurrence of any symbol cr in P by Left(P, cr) and Right(P, a) respectively. When P is understood, we use Left (a) and Right (a). In the same way, the first occurrence of any symbol is called a Left symbol where the second occurrence is called a Right symbol. For example, pattern P = abcdeedcfggfba corresponds to the structure in Figure 2.1(b), and for symbol a, left(a) = 1 and Right (a) = 14. 1 2 3 4 Figure 2.2: (a) Pseudoknotted secondary structure of length 13. (b) Arc diagram representation of the structure in part (a) 2.2 Pseudoknot ted Structural Elements Here we introduce two new notions in a secondary structure: closed region and band. Roughly speaking, a region C is a contiguous set of base indices, and if C is closed then there is no base index in C which is paired with a base index which is not in C. C is a pseudoknotted closed region if there is no arc diagram representation of it such that no two arcs cross each other. C is a pseudoknot free closed region if it is not a pseudoknotted closed region. Figure 2.1(b) and Figure 2.2(b) represent a pseudoknot free and a pseudoknotted closed region respectively. Within a pseudoknotted closed region, we define a notion of a band. Bands in a pseudoknotted closed region refer to pseudoknotted stems. In Figure 2.2(b), the base pairs (arcs) 1.11 and 2.10 form a band, and also the base pairs (arcs) 8.13 and 9.12 form a band. In the rest of the section we formally define the above notions. All of the following definitions are with respect to a fixed secondary structure R with pattern representation P. We use to denote the set of indices i,i + 1 , j and call it 8 a region if i < j. The union of two non-overlapping regions is called a gapped region. Closed Regions: We say that (i < j) is weakly closed if it contains at least one base pair and for all base pairs i'.j' of R, i' £ [i; j] if and only if j' £ We say that is closed, and write if either (i) i = 1 and j = n or (ii) [i; j] is weakly closed and for all I with i < I < j, [i;l] and [l;j] are not weakly closed. Each closed region corresponds to a substructure (subset of base pairs) of R, and also to a pattern. We say that a pattern is closed and refer to it as a closed pattern if its corresponding region is a closed region. For example, abba and abab are closed patterns, but aabb is not (unless aabb corresponds to the entire structure). Let [i; j'] be a closed region. If i' and j are such that i.j and i'.j' then we say that i.j and i'.j' are the external base pairs of [i;j']. If i.j' then the region has just one external base pair. We also refer to i and j' as i;j"s left and right borders respectively. Let [i;j] and [i';j'\ be closed with i < i'. If j < i', we say. that [i;j] and [i';j'] are disjoint; otherwise we say that [i')j'} is nested in [i;j]. For example, in Figure 2.3 [12; 36] is closed. [15; 34] is weakly closed but it is not closed as [15; 24] is weakly closed. Pattern P = bcdeedcfggfb corresponds to 12; 36 and therefore is a closed pattern. 12.36 is the external base pair of 12; 36 where 12 and 36 are the left and the right borders of it respectively. [15; 24] and [26; 34] are disjoint closed regions and both are nested in [12; 36]. Pseudoknotted Closed Region: We say that [i;j] is a pseudoknotted closed region of R if i; j and i.j R. We say that the indices i and j are the left and right borders of the pseudoknotted region [i;j]. Pair i.j is pseudoknotted if there exists i'.j' with i < i' < j < j' or i! < i < j' < j. We also refer to i and j as pseudoknotted base indices. A pattern P is called a pseudoknotted pattern if it corresponds to a pseudoknotted closed region. For example, in Figure 2.3 [57; 69] is a pseudoknotted closed region and its corre-sponding pattern P = abccdebaed is a pseudoknotted pattern. 57.67 and 64.69 are its external base pairs, where 57 and 69 are the left and the right borders of it respectively. 57.67, 58.66, 64.69 and 65.68 are all pseudoknotted pairs and 57, 58,64,66, 67, 68 and 69 are all pseudoknotted base indices. Pseudoknot Free Closed Region: We say that the closed region i; j is pseudoknot free if there are no two pairs i\.j\ and i2-J2, with i < h,j\,i2,J2 < j such that either h < i-2 < ji < J2 or %2 < i\ < J2 < ji- A pattern P is called a pseudoknot free pattern if it corresponds to a pseudoknot free closed region. For example in Figure 2.3 [12; 36] is a pseudoknot free closed region and its corre-sponding pattern P — bcdeedcfggfb is a pseudoknot free pattern. 9 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 (c) Figure 2.3: (a) Arc diagram representation of an RNA secondary structure R. R consists of two closed regions [1;40] and [41; 82]. (b)[l;40] closed region, (c) [41; 82] closed region. Bands: Let i2-J2 be a pseudoknotted base pair. We say that is directly banded in i\.j\ if (i) h < il < 32 < ji, and (ii) [i\ + l,%2 - 1] and [J2 + 1, j i - 1] are weakly closed. Note that the "is directly banded in" relation is reflextive. We let "are banded" be the symmetric and transitive closure of the "is directly banded in" relation. Let B be an equivalence class under the "are banded" relation. That is, B is a set of base pairs such that every two base pairs in B are banded and every base pair in B is pseudoknotted. B has closing base pairs and i2-J2 such that for every base pair i.j in B, ii < i < 12 and j'2 < j < j\. Note that i\.j\ may equal %2-J2-A gapped region [-ii; ] U [jj;ji] is a band of a pseudoknotted region if for some equivalence class B, i\.j\ and i[.j[ are the closing pairs of B. Base pair i.j or closed 10 region i ; j is contained in band U [j[; j i ] , if and only if either i , j € [i\',i'i\ or hj G b i i J i ] - Then we simply say that i . j is in a band region. Base pair i.j spans band [ i i j i ' J U if n < i < i' and j[ < j < j\. We say that [ i i j i ' J U [ j j ; j i ] is a frand of pseudoknotted region [i; j] if i < i\ < ji < j and there is no closed region [p; q] with i<p<i\<j\<q<j-We say that i i . j i and i'l-j'i are band's closing pairs. i\.j\ is the outer and i[.j[ is the inner closing pair of the band. We say that [in i[] and [j[; j{\ are band regions. i\ and i[ are the borders of [ii;i'i] band region and j[andji are the borders of [j[; j\\ band region. We refer to i\ and j\ by the left and the right border of the band respectively. For example in Figure 2.3 [57; 58] U [66; 67] is a band of pseudoknotted closed region [57; 69] and 57.67 and 58.66 are the outer and the inner closing pair of the band. [57; 58] and [66; 67] are band regions where 57 and 67 are. the left and the right border of the band. 57.67 and 58.66 both span the band. We can infer the following corollaries from the above definitions: Corollary 2.2.1 A weakly closed region C can be decomposed to C\ U Ci U ... U Cn U C where each C{ ,1 < i < n, is a closed region and C is a set of unpaired base indices. Corollary 2.2.2 Assume that i i - J i , *2-j2, ••••,in-jn > h < *2 < ••• < in, are the set of base pairs that spans a band, [ii;i'i] U [ i i ; j i ] - Then we have jn < .... < J2 < ji-Corollary 2.2.3 Each base pair i.j in a pseudoknotted closed region either spans a band or belongs to some nested closed region. 2.3 Loops in Pseudoknotted Structure According to thermodynamic models of RNA secondary structure, the free energy of the secondary structure of an RNA is calculated based on the summation of free energy of all the loops in it (Figure 1.2). Our definitions of hairpin and interior loops given below are standard for pseudoknot free structures. The definition of multi loops and external loop is generalized: Hairpin loop: A hairpin loop contains a single external base pair i.j and all bases in [i + l,j — 1] must be unpaired. For example, in Figure 2.3 18.22 is a hairpin loop external base pair. Interior loop: An interior loop contains two unpseudoknotted base pairs i.j and i'.j' where i < j' < j' < j and all bases in [i + 1, i' — 1] U [j' + 1, j — 1] are unpaired. We refer to i.j and i'.j' as the interior loop external and internal base pairs respec-tively. There are two special cases of interior loops: stacked pairs, for which i' = i + 1 and j' = j — 1 and bulge loops, for which either i ' = i + 1 or j' = j — 1 (but not both). 11 For example, in Figure 2.3 26.34 and 28.32 are the external and internal base pairs of an interior loop. 11.37 and 12.36 form a stacked pair, and 16.23 and 18.22 form a bulge loop. Multiloop: A multiloop contains an external base pair i.j and k > 1 tuples (ii,ji), (12, J2), (ik, jk) where ii; jt, 1 < I < k and i < i\ < j\ < i2 < J2 < ••• < ik < jk < j and all bases in [i, j] — 1 < I < k are unpaired. Furthermore, if ii-ji, 1 < I < k then k should be at least 2. In Figure 2.3 12.36 is a multiloop external base pair with (15,24) and (26, 34) as tuples. External loop: An external loop contains k > 0 tuples (*2, J2), (ik,jk) where ii;jl, 1 < I < k and %\ < j\ < « 2 < J2 < ••• < ik < jk, along with the bases in [l;n] — Ui<;<fc[i;; ji], all of which must be unpaired. In Figure 2.3, (1,40) and (41,82) are the tuples of an external loop tuples. We next introduce further types of elementary structures which are the consequence of having pseudoknotted base pairs and pseudoknotted regions. Several algorithms that predict pseudoknotted secondary structures, implicitly assign energies to these types of loops. Pseudoknotted loop: Assume that [i; j'] is pseudoknotted region. Let [i\\i'i] U [j[;ji] U [i2;i'2]U[j'2; } u [fm'Jm] be bands of [i;j'\. Let [px; q{\, [p2; qi\, \Pk\qk] be closed regions which are nested in [i; j] — (U^Ji; ; i[] Uj^ j^ [j[; ji]).. The pseudoknotted loop corresponding to [i; j'} is the set {(ii,ji), {i'^j'^l < I < m} U {(pi,qi)\l <l<k}, together with the bases in [ i , / ] - u t i ^ ; * ] - u ^ 1 [ i z ; t { ] - u £ 1 [ j l ' ; j J ] all of which must be unpaired. For example, in Figure 2.3, [1;40] corresponds to a pseudoknotted loop with [1;2] U [5; 6] U [3; 4] U [9; 10] U [7; 8] U [39; 40] as bands and [11; 37] as the closed region nested in it; [41; 82] also corresponds to a pseudoknotted loop with [41; 41] U [56; 56] U [42; 43] U [47; 48] U [44; 46] U [72; 82] U [49; 55] U [70; 71] as bands and [57; 69] as the closed region nested in it. Interior-pseudoknotted loops: An interior-pseudoknotted loop contains two base pairs i.j and i'.j' where i < %' < j' < j, all bases in [i + 1, i! — 1] U [j' + 1, j — 1] are unpaired and there is a band [bi; bi'} U \bj'\ bj] such that bi < i < bi' and bj' < j < bj. We refer to i.j and i'.j' as the interior-pseudoknotted loop external and internal base pairs respectively. For example, in Figure 2.3 1.6 and 2.5 are the external and internal base pairs of an interior-pseudoknotted loop. 12 Multi-pseudoknotted loops: A multi-pseudoknotted loop contains an external base pair i.j and k > 1 tuples (h,ji), (^,.72), (ikJk) where: (i) it;ji, 1 < I < k and i < h < ji < 12 < J2 < ••• <ik< jk < j, along with the bases in - ^i<i<k[k] all of which must be unpaired, (ii) there is a band [bi; bi']U[bj'; bj] such that bi < i < bi' and bj' < j < bj, and (iii) there is exactly one tuple (ii,ji) for w h i c h j i is not true (i.e [ii;ji] is not a closed region) and i\.j\ spans a band (bi < ii << bi' and bj' < ji << bj). For example, in Figure 2.3 44.82 is the external base pair of a multi-loop which has (45,73) and (74,81) as tuples. » Multi-Pseudoknotted and Interior-pseudoknotted loops are called spanning band loops. If, in the above definition of interior loop, we remove the constraint that i.j and i'.j' should be unpseudoknotted, then interior-pseudoknotted loop would be an special case of interior loop. Also if for exactly one of the tuples (ii,ji) (where ii-ji) in the above definition of multiloop we allow [if,ji] not to be closed, then multi-pseudoknotted loop would be a special case of multiloop. We can infer the following corollaries from the above definition and the definitions of Section 2.2: Corollary 2.3.1 Every unpaired base is in exactly one loop and every base pair is in exactly two loops. An external loop is a weakly closed region. Every other loop, except interior-pseudoknotted and multi-pseudoknotted loops, corresponds to a closed region. Interior-pseudoknotted and multi-pseudoknotted loops are not closed as their external base pair spans a band. Hairpin, interior and multi loops correspond to unpseudoknotted closed regions and pseudoknotted loops correspond to pseudoknotted closed regions. Corollary 2.3.2 Every spanning band base pair, except the inner closing pair of the band, is either an interior-pseudoknotted or a multi-pseudoknotted loop external base pair. 2.3.1 Location Attribute of Loops By introducing bands, each loop nested in a pseudoknot loop gets a location status as follows which is needed in calculating its free energy. in-Band loops: A Loop corresponding to a closed region which is nested in one of a bands' regions is an in-Band loop. The closing pairs of such a loop is in a band region. In Figure 2.3 50; 54 corresponds to an in-Band hairpin loop and [74; 81] corresponds to an in-Band pseudoknotted loop. out-Band loops: Let PR denote a pseudoknotted region. A loop corresponding to a closed region which is nested in PR but is not nested in any band of PR is an out-Band loop. 13 In Figure 2.3 11; 37 corresponds to an out-Band multiloop and [57; 69] corresponds to an out-Band pseudoknotted loop. span-Band loops: A Loop which is nested in a pseudoknotted region, but neither is in-Band nor out-Band is a span-Band loop. This kind of loop is either a M u l t i -pseudoknotted loop or an Interior-pseudoknotted loop. For example, in Figure 2.3 44; 82 corresponds to a multi-pseudoknotted loop and 1; 6 corresponds to an interior-pseudoknotted loop. 2.4 Parse Tree Let and [i';f] be closed regions. We say that is a child of if [i';f] is nested in and is not nested in any with i < i". We say that and [i';f] are siblings if they are children of the same closed region and i ^ i'. So the closed regions form a tree structure. A n ordered tree T(R) is called the parse tree of R if: (i) there is a 1-1 correspondence between nodes of the tree and closed regions of R, and (ii) if node V corresponds to closed region C then V is the parent of all the nodes whose corresponding closed regions are nested in C. The children of each node are ordered by the left index of the closed region. We also refer to T(R) by T(P) where P is the pattern representation of R. Assume that C is the closed region corresponding to node V and C \ , C m are the closed regions correspond to the children of V. Then we say that the pattern corresponding to C also corresponds to node V. Also, C = C — U™ x C i is called the private region corresponding to V and we refer to the pattern corresponding to C as the private pattern of V. (Figure 2.4) The parse tree of R contains information about i?'s closed regions, but it could be easily updated to contain other information such as band regions, loop types (see Chapter 5). 14 F i g u r e 2 .4 : P a r s e t ree for t he s t r u c t u r e i n F i g u r e 2 .3 . C o n s i d e r n o d e V = ( 1 2 , 3 6 ) as a n e x a m p l e . ( 1 5 , 2 4 ) a n d ( 2 6 , 3 4 ) a re V s c h i l d r e n . C = [12; 36] is V s c o r r e s p o n d i n g c l o s e d r e g i o n a n d i t s p a t t e r n is P = bcdeedcfggfb. C, t he p r i v a t e c l o s e d r e g i o n o f V, c o n t a i n s 12.36 as base p a i r a n d i t s p r i v a t e c l o s e d p a t t e r n is P' = bb. 15 Chapter 3 Related Work In this section, we describe related work in three cases. In Section 3.1 we describe work on classification of structural motifs that are found in pseudoknotted secondary structures. In Section 3.2 we discuss related work on algorithms for predicting pseudoknotted secondary structures, and means for classifying the structures handled by these algorithms. Related work on calculating the free energy of given secondary structure is discussed in Section 3.3. It should be noted that there is no related work on parsing a secondary structure, although parsing a pseudoknot free secondary structure is very similar to constructing an expression tree from an arithmetic expression whose parentheses are well-formed. 3.1 Classification Scheme There are several works ori pseudoknotted secondary structures which studied different pos-sible structures that could be formed as a consequence of having pseudoknotted base pairs [12, 9]. They tried to classify pseudoknots (pseudoknotted substructures) into different types and study the types of pseudoknots which occur in known biological structures. How-ever, their classification are incomplete and ambiguous (e.g. the definition of H-type and B-type pseudoknots) [10]. Our parsing algorithm can be used to classify different types of pseudoknots that arise in real structures. Although it is not part of this thesis, our notions of bands and our loop classifications can help to provide a precise definition for different type of pseudoknots. 3.2 Pseudoknotted Secondary Structure Prediction General pseudoknotted RNA secondary structure prediction is NP-hard and therefore sev-eral incomplete polynomial complexity algorithms have been proposed to solve this problem. Dynamic programming and heuristic algorithms are two widely used approaches. There are several dynamic programming algorithms for RNA secondary structure with pseudoknot prediction. Each of them can handle a restricted class of structures as 16 they try to find a minimum free energy (mfe) secondary structure for a limited classes of pseudoknotted structures. Rivas and Eddy (R&E) [13] proposed an algorithm with a complexity of 0(n 6 ) in running time which can handle a large class of structures. They use different free energy parameters for the loops which are nested in a pseudoknot loop and consider a loop (interior or multi) to be nested in a pseudoknot if there is a band with two closing base pairs il • ji and i\ •j[ such that ii < i < i[ and j[ < j < ji (span-band loop). Uemura et al. (U) [15] proposed algorithms of time complexity (9(n4) for simple pseudoknots and 0(n5) or more for the other more complicated pseudoknots. Their al-gorithms are based on tree adjoining grammar (TAG) and therefore are not so easy to understand. Akutsu (A) [2] re-formulated the method of Uemura et al. [15] and introduced sim-ple dynamic programming algorithms for RNA secondary structure prediction. Akutsu's algorithms are of time complexity (3(n4) for simple pseudoknots and 0(n 5 ) for more compli-cated pseudoknots. The algorithms are easy to understand and to modify and hence easier . to cope with various score functions (i.e. free energy functions). However, the algorithms of Akutsu [2] handles a more restricted class of structures than does the algorithm of Uemura et al. (U) [15]. Dirks and Pierce (D&P) [7] described an 0(n5) dynamic programming algorithm for computing the partition function and minimum free energy structure of an RNA (or DNA) strand. The class of pseudoknotted structures that their algorithm can handle consists of structures which pseudoknot loops have exactly two bands. They consider a loop to be nested in a pseudoknot loop (and hence use different parameters to calculate its free energy) if the loop is of the kind out-band. Lyngso and Pederson (L&P) [11] introduced a simple (9(n5) algorithm which can handle the structures with really simple pseudoknot loops. The class of structures they can handle allows pseudoknot loops with exactly two bands. They do not allow the pseudoknot loops nested in another one and also do not make a difference between the free energy of the loops whether they are nested in a pseudoknot loop or not. It is proved by Condon et al.[5] that there is actually a relation between the class of structures each of these algorithms can handle : P K F C L & P C D&P C A C U C R & E . (PKF represents the class of pseudoknot free structures) This shows that there is actually a trade off between the complexity of the algorithm and the generality class of structures it can handle. There are also several heuristic (and incomplete) algorithms for RNA secondary structure prediction [14, 8]. Heuristic algorithms find a locally minimum-energy secondary structure over the whole search space. 17 3.3 Free Energy Calculat ion Heuristic algorithms commonly use a function for calculating the free energy of substructures (as partial solutions) in the process of searching for a low-energy structure in the search space. But there is no explicit work on calculating the free energy of given secondary structure in linear time. Here we can use our linear time free energy calculation algorithm, to calculate the free energy of partial solution substructures. 18 Chapter 4 Parsing Algorithm In this chapter we introduce a linear time algorithm for parsing a secondary structure into its closed regions. A parsing algorithm creates the tree T(R) of closed regions in R. The algorithm for finding closed regions is described in Section 4.1. In Section 4.2, we prove that the closed region finding algorithm identifies all closed regions in the structure accurately. In Section 4.3 we explain how the parse tree is constructed gradually, by adding a node to the tree for each closed region, when it is identified. The last section explains the time complexity of the parsing algorithm, which is linear. > To build the tree of closed regions, our algorithm takes as input a linked list repre-sentation L of a secondary structure R for a strand of length n. In this representation, list elements are the base indices, with bidirectional links between adjacent elements and addi-tionally bidirectional links between paired indices. For convenience, we define the function bp(i) to be j if i.j G R or j.i G R, and to 0 if i is unpaired. 4.1 Closed Region Finding Algorithm The Build-Closed-Regions-Tree algorithm, presented in Algorithm 1, scans the list of indices from left to right and marks indices i with i < bp(i), so as to identify closed regions efficiently. The tree of closed regions is built gradually by calling the Add-To-Tree procedure from the algorithm right after identifying each closed region. We provide the intuition behind the algorithm in Section 4.1.1 and describe four invariants of it in section 4.1.2. 4.1.1 Intuition To identify a closed region it is enough to find its borders. Every unpseudoknotted base pair ,i.j, is the external base pair of a closed region and therefore i and j are the borders of i;j. But this is not true for pseudoknotted base pairs. According to the definition, a base pair i.j is pseudoknotted if there exists i'.j' with i! < i < j' < j of 4 < i' < j < j ' , where i in the former and i' in the latter case couldn't be a closed region border. 19 Scanning the list L of indices from left to right, if we reach j = bp[i) knowing that there is an index i', i < i' < j, whose pair j' has not been scanned yet, we can conclude that both i and i' are pseudoknotted and i' is not a closed region border. To facilitate this procedure we introduce three types of marks for base indices, B, P and N. A base index 1 < bp(i) is marked as (i) B (for "base pair found") if its base pair is found, (ii) P (for "pseudoknotted") if we know that it is pseudoknotted, and (iii) N ("not a border") if we know that it is not a border of a closed region. For example, in Figure 4.1(a): (a) When we scan index 7, index 3 is P-marked as 7 = 6p(3). (b) When we scan index 10, indices 8 and 9 are P-marked and iV-marked as 2 < i < 10 < bp(i) for i = 8 and i = 9. Also, 2 is P-marked and P-marked as 10 = 6p(2) and 2 < 8 < bp(2) < bp(8). Note that when index 10 has just been scanned, it is not clear what is the base pair of 1. Depending on the base pair of 1, 2 may or may not be the left border of a closed region, and so 2 is not A^-marked yet. (c) When we scan index 11, 1 is P-marked and P-marked and 2 is A^-marked, as now we know that 2 is in the same closed region as 1 and therefore 2 is not the left border of a closed region. More details on this example are given in Section 4.1.3. To facilitate the process of marking, and keep it working efficiently, Algorithm 1 makes use of a stack ST. While scanning the list and marking base indices, for efficiency reasons the algorithm removes base indices from the list when they are no longer needed. The following base indices are removed from the list L: (i) base indices which are not the border of any closed region, and (ii) base indices which are the borders of an already identified closed region. The stack and list also provide an efficient way to identify closed regions. For exam-ple, in Figure 4.1(a), when index 7 is scanned, its base pair, 3, is on top of the stack and this identifies that [3; 7] is a closed region. Identifying pseudoknotted closed region is more complicated but is explained more fully below (Section 4.1.2 and Section 4.1.3). Figure 4.1: (a) Arc diagram representation of an RNA secondary structure, (b) Parse tree for the structure in part (a). 20 a l g o r i t h m Build-Closed-Regions-Tree i n p u t : linked list L representation of structure R for a strand of length n o u t p u t : tree T of closed regions of R 1 initialize T to contain one node labeled [ l ,n] ; 2 initialize stack ST to be empty; 3 A := 1; 4 repeat 5 i f A < bp(X) t h e n 6 Push(A, ST); 7 e l se i f bp(X) — 0 t h e n / / A is unpaired 8 Remove(A, L); 9 else / / bp(X) < A 10 B-mark(6p(A)); 11 i f Top(ST) = bp(X) =Pred(A, L) t h e n / / (bp(X), A) is a closed region 12 Pop(ST) ; 13 i f !(6p(A) = 1 and A = n) t h e n Add-To-Tree([6p(A), A] ,T) ; 14 Remove(6p(A) : L); 15 e l se i f mark(Pred(bp(X), L)) = P8zB and bp(X) = Pred(A, L) t h e n / / A is the right border of a pseudoknotted region 16 h<- Pop (ST) ; / / note that h =Pred(6p(A), L) 17 Remove(/i, L ) ; 18 i f !(/» = 1 and A = n) t h e n Add-To-Tree([/i, A] ,T) ; 19 Remove(6p(-A), L); 20 else // bp(X) • A is pseudoknotted but A isn't the right border of a closed region 21 P-mark(6p(A)); 22 w h i l e Top(ST) > 6p(A) do 23 j^-Pop(ST); 24 P-mark( j ) ; 25 TY-mark(j); 26 i f mark{j) = BkNkP t h e n 27 Remove(j, L); 28 i f mark{bp(X)) = BkNkP t h e n 29 Remove(6p(A),L); 30 Remove(A,L); 31 A — A + l ; 32 u n t i l A = n + 1; 33 r e t u r n T A l g o r i t h m 1: B u i l d the tree of closed regions. 21 4.1.2 Invariants The following invariants are maintained when index / has just been scanned. The Mark Invariant formalizes what marks an index has when index I has just been scanned by the algorithm. The List and Stack Invariants formalize what is in the list L and stack ST when index / has just been scanned. Finally, the Closed Region Invariant formalizes how a closed region can be identified by the algorithm. Mark Invariant: For all indices i in L with i < I and i < bp(i), P-mark invariant: i is P-marked if and only if bp{i) < l; P-mark invariant: i is P-marked if and only if for some i', i < i! < bp(i) < I and bp(i) < bp(i'), or i' <i < bp(i') < I and bp(i') < bp(i); and TV-mark invariant: i is TV-marked if and only if there exists i' in the same closed region as i with i' < i < bp(i') < I. List Invariant: For all a < I, a is not in list L if and only if either (i) a is P-marked, P-marked, and ^-marked, (ii) bp(a) < a, or (iii) for some b <l, a; b. Stack Invariant: ST contains exactly those indices i < I such that i is in L and i is not ./V-marked. Closed Region Invariant: If h ^ 1 or / + 1 ^ n, then [h, I + 1] is a closed region of R if and only if bp(l + 1) < I + 1, h is on the top of the stack ST, bp(l + 1) is / + I's predecessor in list L, and either (i) h — bp(l +1) or (ii) h is P-marked and P-marked and h is bp(l + l)'s predecessor in list L. Note that [1; n] is a special case of closed region which is added to the T(R) in the first step by the algorithm, so we excluded it from the closed region invariant and the following sections in this chapter. 4.1.3 Example Consider the structure in Figure 4.1(a) as an example. Table 4.1 shows the list, stack, marks and closed region identified, as Algorithm 1 is executed on the example. (Here, we do not show the links in L , but they can be easily infered from the figure.) When index 6 is scanned by the algorithm, it is removed from L as it is unpaired. Then Top(ST) = 3 — bp(7) = Pred(7), and therefore algorithm identifies (3,7) as a closed region, and add (3,7) to the parse tree, when it scans.7. 22 When index 10 = bp(2) is scanned by the algorithm, indices 8 and 9 are P-marked and iV-marked, as 2 < i < 10 < bp(i) for i £ {8,9}, and are removed from the stack ST. Moreover, index 2 is P-marked and P-marked. When the algorithm scans index 11 = bp(l), index 2 is iV-marked and as it was P-marked and P-marked before it is removed from L (and therefore ST). Moreover, index 1 is P-marked and P-marked. When index 12 = bp(9) is scanned by the algorithm, index 9 is P-marked and as it was P-marked and iV-marked before it is removed from L. Then 8 = 6p(13) — Pred(13) and Top(ST) — 1 = pred(8) and 1 is P-marked and P-marked, but 13 = n, therefore the algorithm does not identify (1,13) as a closed region, since (1,13) is already in the tree. At the end, the parse tree for the structure in Figure 4.1(a) has (1,13) in the root and (3,7) as its child (Figure 4.1(b)). A List L with marks on elements Stack ST Closed region identified 0 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13} {} 1 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13} {i} 2 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13} {1,2} 3 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13} {1, 2,3} 4 {1, 2, 3, , 5, 6, 7, 8, 9, 10, 11, 12, 13} {1, 2, 3} 5 {1, 2, 3, , , 6, 7, 8, 9, 10, 11, 12, 13} {1, 2, 3} 6 {1, 2, 3, , , ,7, 8, 9, 10, 11, 12, 13} {1, 2, 3} 7 {1, 2, , , , , ,8, 9, 10, 11, 12, 13} {1. 2} (3,7) 8 {1, 2, , , , , ,8, 9, 10, 11, 12, 13} {1, 2, 8} (3,7) 9 {1, 2, , , , , ,8, 9, 10, 11, 12, 13} {1, 2, 8, 9} (3,7) . 10 {1, 2, , , , , , 8, 9, , 11, 12, 13} B P P P N N {1, 2} (3,7) 11 {1, , , , , , , 8, 9, , ,12, 13} B P P P N N {1} (3,7) 12 B P P N {1} (3,7) 13 { 1 5 5 J J 5 1 1 ) 5 5 5 } {} (3,7) Table 4.1: Execution of Algorithm 1 on secondary structure of Figure 4.1(a). There is one row for each base index, plus an initial row, 0. Column 1 shows which base index is scanned by the algorithm. The remaining columns describe what happens when index i is scanned (i.e when A = i). Column 2 gives the list L along with the marks on list elements and column 3 gives the stack, when i has just been scanned (when the algorithm reaches line 31). Column 4 lists which closed region has been identified (and added to the tree) by the algorithm, if any, when i is scanned. 23 4.2 Closed Region Finding Algorithm: Proof We prove in Theorem 4.2.1 that Algorithm 1 calls Add — to — tree([i,j],R) if and only if is closed. First , we prove that the closed region invariant, and therefore the three other invariants, are maintained by the algorithm. We prove in Claim 4.2.1 that all the above invariants are maintained by the algo-rithm, on an input list L that represents structure R for a strand of length n, when I has just been scanned. Formally, we say that I has just been scanned when line 32 of the algorithm is reached with A = Z + 1 or line 4 is reached and I = 0. We also say I is scanned when lines 5 to 30 of the algorithm are executed with A = I. Claim 4.2.1 For any input L all invariants are true when I has just been scanned, for all l,0<l<n. Proof We use induction to prove Claim 4.2.1. It is obviously true when we have I = 0 (base case). Let I > 0. Assume that the claim is true for all k < I (Induction Hypothesis). We prove that the claim is true for Z. 4.2.1 Mark Invariant Proof Here we prove that all three mark invariants are true. P-mark: When I is scanned, if i < bp(i) = I then i is P-marked on line 10 of the algorithm and no other index is P-marked. Since marks are never removed from an index, by the Induction Hypothesis the B-mark invariant is true for all i with i < bp(i) < I, therefore the B-mark invariant is true. P-mark: We can decompose each case of the P-mark invariant into two subcases: • The case "For some i', i < i! < bp(i) < I and bp(i) < 6p(i')" is decomposed to Case 1.1 For some i', i < i' < bp{i) < I and bp(i) < bp(i') Case 1.2 For some i', i < i! < bp(i) = I and bp(i) < bp(i'), which is equivalent to bp(l) <i' < I < bp(i'), where bp(i) = I. • The case "For some i', i! < i < bp(i') < I and bp(i') < bp(i)" is decomposed to Case 2.1 For some i', i' < i < bp(i') < I and bp(i') < bp(i) Case 2.2 For some i', i! < i < bp(i') — I and bp{i') < bp(i) which is equivalent to bp(l) < i < I < bp(i), where bp(i') — I. Now we need to prove that i is P-marked if and only if one of the above four cases holds. 24 Part 1 i is P-marked if one the above four cases holds. Proof If one of the two first subcases, Case 1.1 or Case 2.1, holds then i is P-marked by the Induction Hypothesis. Assume that the Case 1.2 holds, which means that bp(l) < i' < I < bp(i'). If bp(l) is not P-marked when I — 1 has just been scanned then one of the following occurs: (i) i' is not TV-marked and therefore (by the Stack invariant and Induction Hypothesis) bp(l) is not on the top of the stack. In this case the algorithm skips line 11. (ii) i' is iV-marked. In this case there exists i", in the same closed region as i', with i" < i! < bp(i") < I (by the N-mark invariant and Induction Hypothesis). If bp(l) is not>-already P-marked then (by the P-mark invariant and Induction Hypothesis) there is no i" with i" < bp(l) < bp(i") < I. Therefore bp(l) < i" < I. We choose the first such i". This i" can not be iV-marked, otherwise it won't be the first such i" (by the N-mark invariant). Therefore i" is in L and separates bp(l) from I in L. So bp{l) is not Vs predecessor and the algorithm skips line 15. As a result, the algorithm reaches line 20 and bp(l) is P-marked in line 21. As the last case, assume that Case 2.2 holds, which means that bp(l) < i < I < bp{i). If i is not P-marked when I — 1 has just been scanned then, using the same explanation as above, the algorithm skips lines 11 and 15 and reaches line 20. bpil) < i < I < bp(i) and i is not P-marked, so i is not TY-marked (otherwise it is iV-marked in line 25 and therefore P-marked in line 24). Therefore i is still on the stack (by the List, Stack and Mark invariants) and it is above bp(l) on the stack. So i is P-marked in line 24. • Part 2 One of the above four cases holds if i is P-marked. Proof If i is P-marked when l—l has just been scanned then by the Induction hypothesis one of the two first subcases, Case 1.1 or Case 2.1, holds. Assume that i is not P-marked when I — 1 has just been scanned and it receives a P-mark when I is scanned, i can receive a P-mark only in lines 21 or 24. Assume that % receives P-mark in line 21, which means that i = bp(l). As the algorithm reaches that line, which means that it skipped lines 11 and 15, so [bp(l),l] is not a closed region and there exists an i' such that either i' < bp (I) < bp(i') < I or bp{l) < i' < I < bp(i') (by the closed region definition). The former case means that bp(l) is already P-marked , contradicting the fact that bp(l) is not P-marked yet, therefore the latter case is true and Case 1.2 holds. As in the last case, assume that i receives P-mark in line 24, where i ^ bp(l). It means that i is above bp(l) on the stack and therefore bp(l) < i < I. Also 25 I < bp(i), otherwise either [i,bp(i)\ is a closed region, and therefore i is not on the stack (by the list and the Stack invariants), or [i, bp(i)] is not a closed region and (by the closed region definition) (i,bp(i)) is a pseudoknotted base pair and i is already P-marked. Therefore bp(l) < i < I < bp(i) and Case 2.2 holds. • iV-mark: We need to prove that i is iV-marked if and only if there exists i' in the same closed region as i with il < i < bp(i') < I. Part 1 i is A-marked if there exists i' in the same closed region as i with i' < i < bp{i') <l. Proof If there exists i' in the same closed region as i with i' < i < bp(i') < I then by the Induction Hypothesis i is iV-marked when I — 1 has just been scanned. Assume that i is not Af-marked when I — 1 has just been scanned and i' < i < bp(i') = I. If bp(l) < i < I < bp(i) then using the same proof as Part 1 - Case 2.2 in above (by the P-mark invariant) i is A^-marked in line 25. If bp(l) < i < bp(i) < I then i is still on the stack (by the List and the Stack invariants) and in the list L. So bp (I) is not on the top of the stack and the algorithm skips line 11. Furthermore i separates bp(l) and I in L so bp{l) is not the predecessor of I which causes the algorithm to skip line 15. As a result, the algorithm runs line 20 and i is A^-marked in line 25. • Part 2 There exists i' in the same closed region as i with i' < i < bp(i') < I if i is A^-marked. Proof If i is Af-marked when I — 1 has just been scanned then by the Induction Hy-pothesis there exists %' in the same closed region as i, with i' < i < bp(i') < I. Assume that i is not A^-marked when I — 1 has just been scanned and it receives an A-mark when / is scanned, i can receive an A^-mark only in line 25. In this case i should be on the stack above bp(l) which means that bp(l) < i < I. If I < bp{i) then it is clear that i is in the same closed region as bp(l). Otherwise, if bp(i) < I, as i is still on the stack it should be in the same closed region as bp(l) (by the List and the Stack invariants). Therefore bp(l) < i < I and i is in the same closed region as bp{l). • Having all of these, the Mark invariant is true. 26 4.2.2 List Invariant Proof We need to prove that for all a < I, a is not in the list L if and only if (i) a is P-marked, P-marked, and /V-marked, (ii) bp(a) < a, or (iii) for some b < I, a; b. Part 1 a is not in L if any of the above three cases holds. Proof If a < I then by Induction Hypothesis when I — 1 has just been scanned a is not in L if either (i) a is P-marked, P-marked and /V-marked, (ii) bp(a) < a, or (iii) for some b < I, a;b. Now we need to prove that by scanning I, (i) a < I is removed from L if a receives its last mark, which could be either N, P or P , (ii) I is removed from L if bp(l) < I, and also (iii) index h, h < Z, is removed from L if h;l. (i) Assume that when I is scanned a receives its last mark, which is either N, P or P. If N is the last mark then a should get it in line 25 and the algorithm checks right after that, in line 26, whether a has all three marks and removes it. If P is the last mark, I should get it either in line 21 or 24. In the former case the algorithm checks in line 28 and in the latter checks in line 26 whether a has all three marks and removes it. If B is the last mark then a should get it in line 10 and a = bp(l). Then the algorithm continues either from line 15 or 20 (the algorithm should skip line 11 as a = bp(l) is A^-marked before and by the Stack invariant a — bp{l) is not on the stack). If the algorithm runs line 15 then a = bp(l) is removed in line 19, and if it runs line 20 then a = bp(l) is removed in line 29. (ii) Assume that bp(l) < I. Then I is removed from the list either by line 8, if I is not paired, or line 30, if I is paired. (iii) Assume that there exists h < I such that h;l. Then I is removed from the list either in line 14 or line 17 by the Closed region invariant. • Part 2 Either of the above three cases holds if a is not in L. Proof If a < I is not in the list L when / — 1 has just been scanned then either (i) a is P-marked, P-marked and A^-marked, (ii) bp(a) < a, or (iii) for some b < I, a; b. Assume that a is removed from L when I is scanned. An index is removed from L in either of these lines: 8,14,17,19,27, 29 or 30. (i) If a is removed from L either in line 19, line 27 or line 29 then a has all three marks, P, A^ and P. 27 (ii) If a is removed from L either in line 8 or line 30 then a — I and bp(a) < a. (iii) If a is removed from L either, in line 14 or line 17 then by the Closed region invariant there exists h < I such that h; I. • 4.2.3 Stack Invariant Proof When the algorithm removes an index i from the list L it removes i from the stack too (lines 12 and 16). The algorithm also removes all indices, which are iV-marked, from the stack in line 23. Furthermore, an index is not removed from the stack in any other place in the algorithm. So the stack invariant is maintained. 4.2.4 Closed Region Invariant Proof We need to prove that [h, I + 1] is a closed region of R if and only if bpil + 1) < I + 1, h is on the top of the stack ST, bp(l + 1) is / + I's predecessor, and either (i) h = bp(l + 1) or (ii) h is P-marked and P-marked and h is bp{l + l)'s predecessor. Part 1 [h, I + 1] is a closed region if one of the above two cases holds. Proof (i) Assume that when I has been just scanned h = bp{l + 1) < I + 1 is on the top of the stack and is / + I's predecessor. It means that there is no i' such that h < i' < I + 1 < bp(i') (otherwise by the List invariant, i' is still in the list and h could not be / + I's predecessor) or i' < h < bp(i') < I + 1 (otherwise by the N-mark invariant h is A-marked and therefore by the Stack invariant should not be on the stack). Therefore [h, I + 1] is a closed region. (ii) Assume that h is on the top of the stack and is P-marked and P-marked. There-fore there is no i' such that i' < h < bp(i') < I and h is in the same closed region as i'. (Otherwise h is A-marked, by the N-mark invariant). Suppose that h is bp(l + l)'s predecessor which is I + I's predecessor. It means that there is no base pair i' such that h<i'<l + l< bp(i') (otherwise i' is on the top of the stack by the Stack invariant). Therefore [h, I + 1] is a closed region. • Part 2 One of the above two cases holds if [h, I + 1] is a closed region. Proof [h, I + 1] is a closed region so bp(l + 1) < I + 1. h is either paired with / + 1 or not : 28 (i) Assume that h = bp(l + 1). Then, by Lemma 4.2.1 below, there is no index remaining in L between h and I +1. So h is on the top of the stack (by the Stack invariant) and it is I + l's predecessor. Therefore the first case, (i), of the Closed region invariant is satisfied. (ii) Assume that h is not paired with l + l. Then [h, I + 1] is a pseudoknotted closed region. It is clear, by Lemma 4.2.1, that the only index remaining in L between h and I + 1 is bp{l + 1). [bp(I + 1); I + 1] is not a closed region (otherwise [h, I + 1] is not a closed region by the closed region definition) and as [h, I + 1] is a closed region so there is no index i' such that bp{l + 1) < i' < I + 1 < bp(i'). Therefore, there is an index i' such that i' < bp(l + 1) < bp(i') < I + 1 which means that bp (I + 1) is iV-marked (by the N-mark invariant) and is not on the stack. So h is on the top of the stack (by the Stack invariant) , h is P-marked and B-marked (by the Mark invariant and closed region definition) and it is bp{l + l)'s predecessor in the list (by the List invariant), which is I + l's predecessor itself. Therefore the second case, (ii), of the Closed region invariant is satisfied. • Lemma 4.2.1 If [h;l] is a closed region then, when I — 1 has just been scanned, the only indices a, h < a < I, which remain in L are : h, I and bp(l). Proof All a such that h < bp(a) < a < I are removed from L when I — 1 has just been scanned by the List invariant. All a such that h < a < bp(a) < I are P-marked (by the B-mark invariant). If a is just P-marked then there is no i' such that il < a < bp(i') < bp(a) or o < i! < bp(a) < bp(i') (by the P-mark and N-mark invariants) and therefore [a, bp(a)} is a closed region and a is removed from L by List invariant . If a is P-marked too, then either a;b (for some b, b < I) or a is iV-marked too (by the N-mark invariant). In both cases a is removed from L by the List invariant. So the only indices which remain are: h, I and bp(l). • • 4.2.5 Add-to-tree Calls Theorem Theorem 4.2.1 On input R, Algorithm 1 calls Add-to-tree([h, I + 1],R) if and only if [h, 1 + 1} is a closed region of R. Proof By the closed region invariant, the proof is as follows: We will show that Algorithml calls Add-to-tree([h, I + 1],R) if and only if when I has just been scanned, 29 bp(l + 1) < I + 1, h is on the top of the stack ST, bp (I + 1) is Z + l's predecessor, and either (i) h = bp(l + 1) or (ii) h is P-marked and P-marked and h is bp{l + l)'s predecessor. Part 1 Algorithml calls Add-to-tree([h, I + 1], R) if one of the above two cases holds. Proof Assume that the first case holds, so when Z has just been scanned, bp(l + 1) < I + 1, h is on the top of the stack ST, h = bp(l + 1) and h is Z + l's predecessor. So while scanning l + l, the algorithm runs line 10 and then as all conditions in line 11 are satisfied it calls Add-to-tree([h, I + 1],R). Now assume that the second case holds, so when I has just been scanned, bp(l + 1) < l + l, h is on the top of the stack ST and it is P-marked and P-marked and also is 6p(Z+l)'s predecessor, where bp(l + l) is Z + l's predecessor. So scanning Z + l algorithm runs line 10 and then skips line 11 as h is on the top of the stack and not bp(l + l). Then as all conditions in line 15 are satisfied the algorithm calls Add-to-tree([h,l + 1],R). • Part 2 One of the above two cases holds if Algorithml calls Add-to-tree([h, I + 1},R). Proof The algorithm calls Add-to-tree([h, I + 1],R) if it reaches either line 13 or 18 when scanning Z + l . The algorithm reaches line 13 only if bp(l + 1) is on the top of the stack and bp(l + 1) =Pred((Z + 1), L). In this case it identifies [bp(l + 1), Z + 1] as a closed region which means that h = bp(l + 1). So the first case holds. The algorithm reaches line 18 (and skips line 11) only if bp(l + 1) is not on the top of the stack, bp(l + 1) =Pred(Z + 1) and Pred(6p(Z + 1),L) is marked as P & S . In this case it identifies [h, I + 1] as a closed region where h is on the top of the stack. Note that h should be the predecessor of (bp(l + 1),L). So the second case holds. • • 4.3 Construct ing the Parse Tree The Parse tree of structure R is built gradually by calling Add-To-Tree procedure from Closed region finding algorithm whenever a closed region is identified. The Add-To-Tree(\h,l},T) procedure works as follows: • Create a new node [h, 1} if [h, 1} ^ [l,n]. 30 • For each child [a, b] of the root [l,n] of T,. check whether a > h, and make [a, b] a child of [h, I] (and thus no longer a child of [1, n]) if this is the case. • Make [h,l] a child of [l,ra]. By maintaining a list of the children of each node ordered by the left border index, and traversing them from the greatest to the smallest, this can be done efficiently in linear time in the number of children [h, 1} has. 4.4 Parsing Algorithm Running Time Algorithm 1 scans the whole list once from left to right. So each index is visited once, and is pushed on the stack, popped off the stack, removed from the list, and receives each mark at most once. Therefore Algorithm 1 accesses each index for a constant number of times, and ignoring the cost of calls to the Add-To-Tree, procedure, it is linear in the number of indices, n, in the list. The Add-To-Tree(\h,l],T) procedure is linear in the number of children [h, I] has. So the whole cost of calls to Add-To-Tree procedure is linear in the number of nodes in the tree T, which is at most n. Therefore the parsing tree algorithm is linear in the number of indices, n, in the list and thus in the number of base pairs in the structure R. 31 Chapter 5 Energy Calculation Standard thermodynamic models calculate the free energy of each loop using certain infor-mation about the loop, such as: (i) its external base pair, (ii) an ordered list of its base pairs or tuples, (iii) number of unpaired bases in the loop, and (iv) type and the location of the loop (see Section 2.3.1). The free energy of a secondary structure is calculated by summing the free energy of its component loops. Here we represent an algorithm that identifies all loops in the structure plus needed information about them in linear time, using the parsing algorithm. Then all we need to calculate the the free energy of the structure is to add up the free energy of all loops, and clearly it can be done in linear time. Given a structure R of length n, the parsing algorithm creates the parse tree, T(R), where we have [l;n] in the root and every other node in the tree corresponds to a closed region of R. The external loop of R corresponds to the root node. Every other loop, except interior-pseudoknotted and multi-pseudoknotted loops, corresponds to a closed region and therefore to a node in the tree. We can easily derive all the information, except the location, for unpseudoknotted loops from T{R) in linear time. Interior-pseudoknotted and multi-pseudoknotted loops are not closed as their exter-nal base pair spans the band. To identify these types of loops, the location status of all the other loops, and also to complete the information needed for pseudoknotted loops, we need to find the band regions for each pseudoknotted closed region(pseudoknotted loops). Having band regions for a pseudoknotted closed region i;j and knowing the location status of its children, we have all needed information for i; j's corresponding pseudoknotted loop. The algorithm does the following steps to achieve the above goals. Loop type finding: The algorithm decides the type of loop corresponding to each closed region. Band finding: The algorithm finds the list of bands for each pseudoknotted closed region. At the end of this step an ordered list of bands regions (ordered by the left border 32 . index) is assigned to the corresponding pseudoknotted closed region. Loop location finding: The algorithm figures out the location status of all loops in the tree, using the ordered list of band regions. Inner loop finding: The algorithm identifies all Multi-pseudoknotted and Interior-pseudoknotted loops, which we refer to by Inner Loops, in addition to their necessary information. 5.1 Loop Type Finding Each closed region i; j corresponds to a loop whose type can be determined as follows, using the definition of each loop (Chapter 2). • Pseudoknot loop Closed region i; j corresponds to a pseudoknotted loop if i.j £ R (i;j is a pseudoknotted closed region). • Hairpin Loop Pseudoknot free closed region i; j corresponds to a hairpin loop if it has no children in T(R). • Interior Loop Pseudoknot free closed region i; j corresponds to an interior loop if it has exactly one child in T(R), which is also not a pseudoknotted closed region. • Multi Loop Pseudoknot free closed region i;j corresponds to a multiloop if either (i) it has more than one child in T(R), or (ii) it has one child which is a pseudoknotted closed region. 5.2 Band Finding Our band finding algorithm is given in Algorithm 2. Let C = i;j be a pseudoknotted closed region in R. We defined a linked list rep-resentation L for R, in Chapter 4. Now let BL be a sublist of L starting from index i to index j. In the first step we remove the following from BL: (i) unpaired base indices, and (ii) base indices corresponding to nested closed regions. Then by Corollary 2.2.3, BL only contains indices corresponding to spanning band base pairs. In other words, BL is a linked list representation of spanning band base pairs in i;j. Inspired by Corollary 2.2.2, Algorithm 2 scans the list BL from left to right to identify bands and their region's borders. Next-leftBase(6^, BL) returns i, the first index after b\ in BL for which bp(i) > i. i.bp(i) will be the outer closing pair of the next band. We could easily get the ordered list of band regions, BandRegions (ordered by the left border index) by doing the following after identifying each band: (i) removing the base 33 a l g o r i t h m Band-Finding i n p u t : linked list BL representation of spanning band base pairs in i;j 1 h := i; 2 r epea t 3 bj := bp(bi); // bi.bj is the outer closing pair of a band, B 4 b[ := ^ 5 &$:=&$; 7 w h i l e Next(6^, BL) = bp(Prev(b'j,BL)) do 8 b\ := Next^.BL); 9 ^ :=Prev(6;.,£L); // b[.b'j is the inner closing pair of the band B II So B = [b^b'tl U ty; bj] is a band of i;j 10 bi :=Next-leftBase(^,BL); 11 u n t i l bi = j + 1; A l g o r i t h m 2: Find Bands of a pseudoknotted closed region. indices corresponding to the band regions from BL, and instead (ii) storing the band regions borders. For example, in Figure 5.1 BL corresponding to 1;42 includes the following base pairs: 1.16, 2.8, 3.7, 4.42, 5.33, 6.32, 9.31 and 15.30. Algorithm 2 figures out the bands in the following order: (1) [1; 1]U[16; 16], (2) [2;3]u[7;8], (3) [4; 6]U[32; 42], and [9; 15]U[30; 31]. So we have the ordered list of band regions as BandRegions = {[1; 1], [2; 3], [4; 6], [7; 8], [9; 15], [16; 16], [30; 31], [32; 42]}. R u n n i n g T i m e : We can do the first step for all of the closed regions by first removing all unpaired bases from L and then removing nested closed regions (which all have already been identified) from each closed region. A closed region i';j' can be removed (from its parent closed region) in constant time by making Prev(i', L) and Next(/, L) to be adjacent (by updating the links). The former is linear in n, number of base pairs, and the latter is linear in the number of nodes in the tree, which is at most n. So the whole cost of the first is linear in n (number of base pairs in R). Algorithm 2 scans £?L,the spanning band base pairs list, once. So it is linear in the size of BL. Therefore it takes O(n) time for all of the closed regions. Thus, finding BandRegions list for all pseudoknotted closed regions in R is linear in n (number of base pairs in R). 5.3 Loop Location Finding For pseudoknotted closed region C, let ChildList be the list of its children ordered by the left border index. So having BandRegions, an ordered list of band regions, (Section 5.2) 34 1 2 3 4 5 6 7 8 9 1011 12 13 1415 16 17 18 192021 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 32 3 3 3 4 3 5 36 3 7 3 8 3 9 4 0 4 1 42 Figure 5.1: Arc diagram representation of an RNA secondary structure (closed region [41; 82] in Figure 2.3(c)). we can easily figure out the location status of C's children. All we need to do is to scan both ordered list from left to right (in parallel, similar to Merge Sort procedure) and check whether a child is nested in a band region or not. A child corresponds to.a in-Band loop in the former and to a out-Band loop in the latter case. This can be done in linear time in the size of BandRegions and ChildList. So the whole cost for all of the closed regions is linear in n (number of base pairs in R) plus the number of nodes in the tree, which is at most n. Therefore, loop location finding algorithm is linear in n (number of base pairs in R). 5.4 Inner L o o p F ind ing By Corollary 2.3.2, each spanning band base pair, except the inner closing pair of the bands, corresponds to either an interior-pseudoknotted or a multi-pseudoknotted loop external base pair. Inspired by Corollary 2.3.2 and definitions for interior-pseudoknotted and multi-pseudoknotted loops (Chapter 2), the Inner Loop Finding algorithm does the following for each pseudoknotted closed region i;j to identify all loops of interior-pseudoknotted and multi-pseudoknotted types: 1. Take the link list representation of spanning band base pairs, BL (Section 5.2), and ordered list of children, ChildList (Section 5.3) as input.. 2. Similar to loop location finding algorithm (Section 5.3) scan BL and ChildList from left to right and assign to each index i the number of children which are nested between i and Next(i), Nested(i), along with the list of their indices, Tuples(i). 3. Each index i, bp(i) > i, in BL corresponds to an interior-pseudoknotted loop with i.bp(i) as the external and Next(i, BL).Pvev(bp(i), BL)) as the internal base pairs, if the followings are satisfied: (i) Next(i, BL) = bp(Prev(bp(i),BL)) (ii) Nested(i) = Nested(Prev(bp(i), BL)) = 0. 35 4. Each index i, bp[i) > i, in BL corresponds to a multi-pseudoknotted loop with i.bp(i) as external base pair if the followings are satisfied: (i) Next(i, BL) — bp(Pvev(bp(i), BL)) (ii) Nested(i) > 0 or Nested(Prev(bp(i), BL)) > 0. Children corresponding to Tuples(i) and Tuples{Prev{bp(i),BL)) along with (Next(i, BL), Prev(bp(i), BL))) will be the tuples corresponding to this multi-pseudoknotted loop. Running Time: Step 2 can be easily (similar to Merge Sort procedure) done in linear time, in the size of BL and ChildList, for each pseudoknotted closed region Then the algorithm needs one scan of BL to identify all inner loops corresponding to i;j. So the above algorithm works in linear time, in the size of BL and ChildList, for each pseudoknotted closed region i;j. Thus, the whole cost is linear in n (number of base pairs in R) plus the number of nodes in the tree, which is at most n. Therefore, finding all inner loops of the structure R is linear in n (number of base pairs in R). 36 Chapter 6 Akutsu's Algorithm Structure Class A k u t s u [2] i n t r o d u c e d s i m p l e d y n a m i c p r o g r a m m i n g a lgo r i t hms for R N A secondary s t ruc-tu re p r e d i c t i o n . T h i s a l g o r i t h m is capable of p r e d i c t i n g the secondary s t ruc tu re for a re-s t r i c t ed class of p s e u d o k n o t t e d R N A s t ruc tures . W e analyse the de f in i t i on of pseudoknots p r o v i d e d i n A k u t s u ' s [2] w o r k a n d present a concise cha rac t e r i za t i on of the class of s t ruc tures A k u t s u ' s a l g o r i t h m c a n handle . A k u t s u def ined pseudoknots i n the secondary s t ruc tu re to be e i ther s i m p l e or recur-sive where recurs ive pseudokno t is a genera l ized ve r s ion of s i m p l e pseudokno t . W e discuss s imp le pseudoknots i n Sec t i on 6.1 a n d recurs ive pseudoknots ( w h i c h is equivalent to sec-o n d a r y s t ruc tu re w i t h recurs ive pseudoknots ) i n Sec t ion 6 . 3 . I n S e c t i o n 6 .2 we descr ibe secondary s t ruc tures w i t h s imp le pseudoknots w h i c h are an in t e rmed ia t e class of s t ruc tures def ined b y A k u t s u [2]. 6.1 Simple Pseudoknot Structures Definition 6.1.1 Akutsu definition for simple pseudoknot: Let P be a pattern of size In over an alphabet E of size n. P is called a s imp le pseudokno t if there exist positions (indices) jo = jo(P) a n d j'o = J'o(P) f l < Jo — Jo < 2 n j for which the following conditions are satisfied: • A l : Each a G E satisfies either 1 < Left(a) < j'0 < Right(a) < jo or j'0 < Left(a) < jo < Right(a) < 2 n . • A2: If a and b, for each a,b G E , satisfy either Left{a) < Left(b) < j'0 or j'0 < Left(a) < Left{b), then Right(a) > Right(b) holds. The two indices jo and j'0 witness that P is a s imp le pseudokno t and we refer to them as witnesses for P. (Figure 6.1(a)) 3 7 (a) (b) (c) F i g u r e 6 . 1 : (a) S e c o n d a r y s t r u c t u r e w i t h p a t t e r n P = abebcaddce: S i m p l e p s e u d o k n o t w i t h jo = Left(e) a n d j0 = Left(d) as wi tnesses , (b) S e c o n d a r y s t r u c t u r e w i t h p a t t e r n P = abbcaddeec: R e c u r s i v e p s e u d o k n o t as P = abbcaPieec w h e r e Pi = demand abbcaeec a re b o t h s i m p l e p s e u d o k n o t . (c) S e c o n d a r y s t r u c t u r e w i t h p a t t e r n P = abbcaddece: n e i t h e r s i m p l e o r r e c u r s i v e . A c c o r d i n g t o A l , s y m b o l s i n a simple pseudoknot P c a n b e d i v i d e d i n t o t w o g r o u p s : • G l : e a c h s y m b o l a G Gl s a t i s f i e s 1 < Left{a) < j'0 < Right(a) < jo-• G2: e a c h s y m b o l a ; £ G2 s a t i s f i e s j'0 < Left(a) < j o < Right(a) < 2n. S o . i t c a n b e d e r i v e d d i r e c t l y f r o m A2 t h a t P J. Gl i s o f t h e f o r m a\a2...aTar...a2ai, w h e r e a ; G G 1 , V 1 < i < r. T h e s a m e i s t r u e f o r P J, G2. F o r e x a m p l e , i n t h e s t r u c t u r e o f F i g u r e 6 . 1 ( a ) , a, b £ Gl a n d c, d, e 6 G2. Definition 6.1.2 Our definition for simple pseudoknot: A pattern P is a s i m p l e s t p s e u d o k n o t if and only if it admits either of these two cases: ' . • BI: It is equal to a\a\. • B2: It is equal to either aiaiP\aia\P2 or a\PiaiaiaiP2, where a\Pia\P2 is a s i m p l e s t p s e u d o k n o t . A pattern P. is a s i m p l e p s e u d o k n o t if and only if either it is a s i m p l e s t p s e u d o k n o t or it is equal to a\P\aiaiai+i... arar ... ai+\aiP2, for some a\, a ; , ...ar G £ , where a\P\aiP2 is a s i m p l e s t p s e u d o k n o t . Corollary 6.1.1 It can be easily infered, by using simple induction on the definition of s i m p l e s t p s e u d o k n o t , that there is no L e f t symbol after (on the right side of) the second occurrence of a\, where a\ is the first symbol in the pattern. Then, it can be easily infered, from the s i m p l e p s e u d o k n o t definition, that a s i m p l e s t p s e u d o k n o t pattern is equivalent to the s i m p l e p s e u d o k n o t pattern in which there is no L e f t symbol after (on the right side of) the second occurrence of a\, where a\ is the first symbol in the pattern. 38 a b e b c a d d c e a b b c a d d e e ' c a b b e a d d e ' c e (a) (b) ' (c) Figure 6.2: Arc diagram representation of structures in Figure 6.1. 6.1.1 Equivalence of Definitions Theorem 6.1.1 Definition 6.1.1 and 6.1.2 are equivalent. Proof Part 1 If P satisfies Definition 6.1.1 then P satisfies Definition 6.1.2. Proof We use induction to prove this. If P is of length 2 and satisfies Definition 6.1.1 then it is equal to a\a\ and so satisfies Definition 6.1.2 (Base case). Assume that it is true for all P of length k < I. Let P, of length I, satisfy Definition 6.1.1. Let j'0(P) and jo(P) be witnesses for P. Then either (i) there is a Left symbol after the second occurrence of a x , or (ii) there isn't. All of the symbols after the second occurrence of a\ should belong to G2. So in case(i), these symbols form a a;a; + i . . . arar ... ai+\aiP2 pattern, where P2 only consists of Right symbols. Therefore P — aiPiaiaia^i... arar ... ai+\aiP2. If we let P' = axPiaiP2, let jo(P') = Right(P',ax) and let j'0(P') = j'0{P) then both conditions A l and A2 are still satisfied for P' with j'o(P') and jo(P') as witnesses. So, P' — a\P\a\P2 satisfies Definition 6.1.1 and therefore, by induction, it satisfies Definition 6.1.2 and in fact must be a simplest pseudoknot, since P2 only contains Right symbols (Corollary 6.1.1). Therefore P satisfies Definition 6.1.2. Now consider case(ii) where there is no Left symbol after the second occurrence of a\. The symbol O j , right before the second occurrence of a\, either belongs to G l or G2. To satisfy Definition 6.1.1 it must be that (i) P = aiaiP\aia\P2, if a; £ G l , or (ii) P = aiP\aiaiaiP2, if G G2, while P2 only consists of Right symbols which definitely belong to G2. In both cases jo(P) = Right(P, ai) and j'Q = Right(P, ai) — 1 if Pi = e or j'0(P) < Right(P, a i ) - l otherwise. If we let P' = P | au let jo(P') = Right(P', ax) and make j'0(P') to be equal to either (1) Right(P', ax) if P x = e, or (2) j'Q(P) if Pi ^ e and ai G G2, or (3) j'0(P) — 1 if Pi ^ e and a* G G l , then both conditions A l and A2 are still satisfied for P' with J'Q(P') and jo(P') as witnesses. So, P' = aiPiaiP2 satisfies Definition 6.1.1 and therefore, by induction, it satisfies Definition 6.1.2 and in fact must be a simplest pseudoknot, since P2 only contains Right symbols (Corollary 39 6.1.1). So, by simplest pseudoknot definition, P is a simplest pseudoknot and therefore by Corollary 6.1.1 P satisfies Definition 6.1.2. • Part 2 If P satisfies Definition 6.1.2 then P satisfies Definition 6.1.1. Proof We use induction to prove this. If P is of length 2 and satisfies Definition 6.1.2 then it is equal to a\a\ and so satisfies Definition 6.1.1 (Base case). Assume that it is true for all P of length k < I. Let P, of length I, satisfy Definition 6.1.2. Suppose that either P = aiaiP\aia\P2 or P = aiP\aia\aiP2 and P' — a\PiaiP2 is a simplest pseudoknot. By Corollary 6.1.1 and using induction, P' satisfies Definition 6.1.1. Let j'0(P') and jo(P') be witnesses for P', so the following are true: (i) jo{P') — Right(P',ai), and (ii) j'0 = Right(P' ,a\) if Pi = e or j'0(P') is an index of a symbol in Pi otherwise. If we let jo(P) = Right(P,a\) and let j'0(P) to be equal to either (1) Right(P,ai) - 1 if Pi = e, or (2) j'0{P') if Pi ^ e and at e G2 , or (3) j'0(P') + 1 if Pi ^ e and ai G GI, then both conditions A l and A2 are true for P and therefore P satisfies Definition 6.1.1. Finally, suppose that P = aiPiaia;aj+i.. . arar ... ai+\a,iPi and P' = aiPi<2iP2 is a simplest pseudoknot. By Corollary 6.1.1 and using induction, P' satisfies Definition 6.1.1. Let j'0(P') and jo(P') be witnesses for P', so the following are true: (i) jo(P') = Right(P', ai) (ii) j'0 — Right(P', a\) if Pi = e or j'0(P') is an index of a symbol in Pi otherwise. If we let jo(P) = Left(P, ar) and j'0(P) — j'o(P') then both conditions A l and A2 are true for P and therefore P satisfies Definition 6.1.1. • • 6.1.2 Simple Pseudoknot Membership Test: Linear Time Algorithm The algorithm for testing whether a pattern P is simple pseudoknot has two steps. In the first step it deals with the a ;a j + i . . . arar ... a^di subpattern and removes it from P make the pattern a simplest pseudoknot. This can be done by scanning the symbols in the pattern starting from the symbol after the second occurrence of a\ and doing the following: 1. While the current symbol, <2j, is a Left symbol push it on the top of the stack ST (which is empty at the begining) and move to the next symbol. 2. While the current symbol, a^ , is a Right symbol and we have a; on the top of the stack, pop. a; from top of the stack and move to the next symbol. 40 3. If the stack ST is empty remove all of the symbols between the second occurrence of a\ and the current symbol from P. Running time: It can be easily seen that the above algorithm works in linear time as it checks each symbol at most once. Next we should figure out if P is simplest pseudoknot. We use both cases in the definition of simplest pseudoknot to construct the desired algorithm. We define two simplify operations according to B2: i aiaiSiaiCLiS-z is converted to axSiaxS2-i i a\S\aia\aiS2 is converted to a\S\a\S2-We define one more operation, final operation according to B I : i i i a\a\ is converted to e. In these cases we say that a simple/final operation is applicable to a\. The algorithm for testing whether the pattern P is a simplest pseudoknot works as follows: 1. While one of the simplify operations, i or i i , is applicable on the first symbohai, apply it. 2. Do the final operation, i i i , on a\ if it is applicable. 3. Return true if the pattern is empty and false otherwise. For example, pattern P = abebcaddce corresponding to the structure in Figure 6.1(a) is a simple pseudoknot pattern as we can simplify it as the following: (1) abebcaddce —> abebcace by removing dd in the first step, (2) abebcace —> abebae by (ii), (3)abebae —> aeae by (i), (4) aeae —> aa by (ii), and finally (5) aa —> e by (iii). Simple test running time: Each simplify operation takes constant time, and the algorithm does at most n (number of distinct symbols in P) simplify operation on the pattern. Therefore the running time of the algorithm will be 0{n). So the whole test can be done in linear time. 6.2 Secondary Structures with Simple Pseudoknots Definition 6.2.1 A pattern P is called a pattern with simple pseudoknots if and only if for some strings Si, Pi, S2, P2, St, Pt, St+i G £ * : . P = S1P1S2P2.:.StPtSt+1, 41 • each Pi, 1 < i < t, is a simple pseudoknot, and • P' — S1S2 • • • StSt+\ is a pseudoknot free pattern. Note that a pseudoknot free pattern corresponds to the case oft = 0 and qualifies as a pattern with simple pseudoknots. We say that a secondary structure Ris a secondary structure with simple pseudo-knots if its corresponding pattern, P, is a pattern with simple pseudoknots. We give a characterization of secondary structure with simple pseudoknots in The-orem 6.2.1. First we prove two useful lemmas. Lemma 6.2.1 If pattern P is a simple pseudoknot then it is a closed pattern. Proof Let pattern P = ai...am be a simple pseudoknot. To prove that P is a closed pattern it is enough to prove that its corresponding structure, [Left(a\); Right(am)], is a closed region. Clearly [Left(ai)\Right(am)] is a weakly closed region. Assume that it is not closed. Then there is ar, ar ^ a\ and ar ^ am, such that either [Left(ai); Right(aT)} or [Left(ar)\Right(am)} is weakly closed. In the former case, Right(ar) should be greater than Right(a\) which means that ar belongs to G2. Therefore Right(ar) > Left(am) and we also have Right(ar) < Right(am). Then we can conclude that [Left(ai); Right(aT)] is not weakly closed. In the latter case, Left(ar) should be less than Left(am) which means that ar belongs to First group. Therefore Left{ar) < Right(a\) and we also have Left{aT) > Right{ai). Then we can conclude that \Left(ar); Right(am)] is not weakly closed. (Contradiction) • Lemma 6.2.2 Any closed region of a simple pseudoknot structure R of length n is either [l;n] or is pseudoknot free. Proof Type 1: regions with i < j'0 and j > jo. Suppose that l.i' and j'.n. By Definition 6.1.1 j'o < i' and j1 < jo, so both i' and j' are in [i;j]. Therefore 1 and n should be in [i; j] which means that [i;j] = [l;n]. So this type only contains one region which is the whole structure R. Type 2: regions [i;j] such that either: (i) i < j'0 and j < jo (where clearly j'0 < j), or (ii) io — ^ — Jo and jo < j- Then it is easy to see that i.j. So this type of closed region is pseudoknot free. • 42 Theorem 6.2.1 A pattern P is a pattern with simple pseudoknots if and only if all of the closed patterns corresponding to the nodes in its parse tree T(P) are either pseudoknot free or simple pseudoknot. Proof To prove our theorem, we need to prove the following. Part 1 If for all of the nodes V in parse tree T(P) one of the following is true, then P is a pattern with simple pseudoknots :' (i) the pattern corresponding to V is pseudoknot free, or (ii) the pattern corresponding to V is a simple pseudoknot. Proof Assume that each node V in T(P) either corresponds to a pseudoknot free closed pattern or its corresponding pseudoknotted closed pattern is a simple pseudoknot. By Lemma 6.2.2 there are no two nodes V and U such that their corresponding pseu-doknotted closed patterns are simple pseudoknots and one of them is the ancestor of another. Assume that P\,...,Pt are the closed patterns corresponding to the pseu-doknotted nodes, ordered by the left index of hte closed pattern. So we can write P = S1P1S2P2 • • • StPtSt+i where each Pj, 1 < i < t, is a simple pseudoknot pattern and S1S2 • •. StSt+i is a pseudoknot free pattern. Therefore P is a pattern with simple pseudoknots. • Part 2 If P is a pattern with simple pseudoknots then one of the following is true for any node V in T(P): (i) the pattern corresponding to V is pseudoknot free, or (ii) the pattern corresponding to V is a simple pseudoknot. Proof Assume that P is a pattern with simple pseudoknots. So P = S\P\S2P2 • • • StPtSt+i where each Pj, 1 < i < t, is a simple pseudoknot pattern and 5152 ... StSt+i is a pseudoknot free pattern. Each P{, 1 < i < t is a closed pattern (by Lemma 6.2.1) and so a node in T(P) is associated with it. The other nodes in the tree should correspond to pseudoknot free closed patterns. So closed regions corresponding to all nodes V are either pseudoknot free or simple pseudoknot. • • 6.2.1 Secondary Structure with Simple Pseudoknots Membership Test Therefore the following algorithm can easily test whether a secondary structure R, with pattern representation P (Lemma 6.2.3), is a secondary structure with simple pseudoknots: 43 1. Construct the parse tree T(P). 2. Starting from the root, check the nodes in each level of T(P). If the simple pseudoknot test returns true on all of the pseudoknotted closed patterns corresponding to nodes in T(P) then return true and return false otherwise. For example, the structure in Figure 6.1(a) is a secondary structure with simple pseudoknots. The only pseudoknotted node in its parse tree (Figure 6.3(a)) is the root node whose pattern is P = abebcadalce and P is a simple pseudoknot. But, the structure in Figure 6.1(b) is not a secondary structure with simple pseudoknots. The pattern P = abbcaddeec corresponding to its root node (Figure 6.3(b)) is not a simple pseudoknot. Simple structure test running time: Constructing the parse tree can be done in O(n) running time, n = number of base pairs in the structure, (Lemma 6.2.3). Then all we need is to run the simple pseudoknot test on all of the pseudoknotted nodes in the tree. According to Lemma 6.2.2, if the closed region corresponding to a node V is a simple pseudoknot, then none of the nodes in the subtree rooted at node V is pseudoknotted, and therefore the algorithm does not need to run the simple pseudoknot test on them. So the whole simple pseudoknot test on the nodes of T(R) is linear in the number of base pairs and the running time of the algorithm is 0{n). Lemma 6.2.3 We can obtain all the patterns and the private patterns corresponding to closed regions of structure R easily in linear time. This also follows that we can obtain TUP) from T(R) in linear time. Proof Consider the following algorithm which assigns a symbol to each paired base index in L, the linked list representation of R: (i) let i = 1, (ii) scan L from left to right, when I < bp(l) 0 is scanned assign symbol a* to both I and bp (I) and increase i. It is easy to see that the above algorithm is linear in the length of L (and therefore in the length of R). Having L with symbol assignments and T(R) we can easily figure out the pattern corresponding to each closed region in R , and therefore T(P), in linear time in n (number of base pairs in R). (a) (b) Figure 6.3: Parse trees for the structures in Figure 6.1. 44 To obtain the private pattern PC corresponding to a closed region C, we just need to remove the patterns corresponding to C"s nested closed regions from PC. The pattern corresponding to nested closed region i';.f can be removed in constant time by making Prev(i',L) and Next(/ ,L) to be adjacent (by updating the links). So this can be done in linear time in the number of nested closed region in C. Therefore the cost for the whole closed regions is linear in the number of closed region in R (number of nodes in the tree) which is at most n (number of base pairs in R). So we can obtain all patterns and private patterns corresponding to closed regions of R in linear time in n (number of base pairs in it!). • 6.3 Secondary Structures with Recursive Pseudoknots Definition 6.3.1 A pattern P is called a recursive pseudoknot if and only if P is a simple pseudoknot or P = P ^ A ' where P2 is a nonempty simple pseudoknot and P\P[ is a recursive pseudoknot. We say that an RNA secondary structure R is a secondary structure with recursive pseudoknots or conveniently recursive pseudoknot structure if its corresponding pattern P is a recursive pseudoknot. Claim 6.3.1 Given the above definition we claim that the pattern P is a recursive pseudo-knot if and only if all of the private patterns corresponding to the nodes in T(P) are simple pseudoknot. Proof The simple pseudoknot test returns true on a node if its corresponding private pat-tern is a simple pseudoknot. So to prove our claim we need to prove the following. Part 1 If the simple pseudoknot test returns true on all nodes in T(P), then P is a recursive pseudoknot. Proof We prove it by using induction on the size of P, 2n. It is clearly true for 2n = 2 (Base case). Let assume that it is true for all patterns of size 21 < 2n (Induction Hypothesis). We prove that it is true for any pattern P of length 2n. Let L i be one of the leaf nodes in T(P). Assume that simple pseudoknot test returns true on L\ which means that its corresponding private pattern C\ is a simple pseu-doknot. Either L\ is the only node in the tree (which means that C\ = P) or there are more nodes in the tree than L\. In the former case, since P is simple pseudoknot, by Definition 6.3.1 P is a recursive pseudoknot. 45 In the latter case P = P\CiP[. By eliminating L i from T(P) we can get the parse tree, T\, corresponding to P\P[. Assume that the simple pseudoknot test returns true on all of the nodes in T(P) which means that it returns true on all of the nodes in T\. Using induction P\P[ is recursive pseudoknot. We saw before that C\ is a nonempty simple pseudoknot. Therefore, by the definition of recursive pseudoknot P = P\C\P[ is a recursive pseudoknot. • Part 2 If P is a recursive pseudoknot then the simple pseudoknot test returns true on all of the nodes in T(P). Proof We prove it by using induction on the size of P, 2n. It is clearly true for 2n = 2 (Base case). Let assume that it is true for all patterns of size 21 < 2n (Induction Hypothesis). We prove that it is true for any pattern P of length 2n. Assume P is a recursive pseudoknot. Then either P is a simple pseudoknot or it is of the form P\P2P{ where P2 is a nonempty simple pseudoknot and P\P[ is a recursive pseudoknot. In the former case the simple pseudoknot test should return true on all of the nodes in T(P). (Using'Lemma 6.3.1). In the latter case assume that T2 and T\ are respectively the parse trees for P\P2P[ and P2. It is easy to see that T(P) can be obtained by concatenating T\ and T2. (This is done by finding a node in T\ which should be the parent of the root node in T2 and making the parent-child relationship between them.) Using induction, the simple pseudoknot test should return true on all of the nodes in T\ and T2 and therefore on all of the nodes in T(P). This makes the proof complete. • • Lemma 6.3.1 If pattern P is a simple pseudoknot then the simple pseudoknot test returns true on all of the nodes in T(P). Proof ~^ We use induction to prove this. It is true for 2n = 2 (Base case). Assume that it is true for all structures of size 2k < 2n (Induction Hypothesis). We prove that it is true for any structure of size 2n. If T(P) consists of just one node then this node should return true. Assume that T(P) has more than one node. By Lemma 6.2.2 it should have some nodes of the second type, Type 2. As the Type 2 nodes correspond to the pseudoknot free closed regions then there should be a leaf node i.j such that either j'0 = j or jo = j . It is clear that i.j 46 is a simple pseudoknot, and it is easy to see that by removing i.j from R, the remaining structure is still a simple pseudoknot. So by induction, the simple pseudoknot test returns true on all of the other nodes in T(P). • 6.3.1 Recursive Pseudoknot Membership Test Now we can easily test whether a secondary structure R, with pattern representation P (Lemma 6.2.3), is a recursive pseudoknot structure by the following algorithm: 1. Construct the parse tree T(P). 2. If the simple pseudoknot test returns true on all of the private patterns corresponding to the nodes in T(P) then return true and return false otherwise. For example, the structure in Figure 6.1(b) is a recursive pseudoknot. A l l private patterns corresponding to the nodes in its parse tree (Figure 6.3(b)) which includes acac, aa, bb and ee are simple pseudoknots. But the structure in Figure 6.1(c) is not a recursive pseudoknot. The private pattern corresponding to its root node (Figure 6.3(c)) which is abbcaece is not a simple pseudoknot. Recursive structure test running time: Constructing the parse tree can be done in 0(n) running time, n = number of base pairs in the structure, (Lemma 6.2.3). Then all we need is to run simple pseudoknot test on all of the private patterns corresponding to the nodes in the tree (Lemma 6.2.3). The simple pseudoknot test is linear (in the number of base pairs) too, so the running time of the whole algorithm is O(n). 6.4 Classification of Biological Structures Condon et al.[5] provide a simple characterization of the classes of structures handled by Rivas and Eddy (R&E) [13], Dirks and Pierce (D&P) [7], Lyngso and Pederson (L&P) [11] and Akutsu [2] algorithms. They also provide a linear time algorithm to test if an input structure is in the R & E , D&P, and L & P classes. They applied their algorithms to classify biological structures from PseudoBase (PBase) [16], the Nucleic Acids Database (NDB) [3], 16S and 23S ribosomal R N A and Group I and Group II Introns from Gutell Database [4]. We also applied our algorithms for Akutsu class membership, for both secondary structures with simple pseudoknots (Section 6.2) and secondary structures with recursive pseudoknots (Section 6.3), on the same biological structures. Table 6.1 presents our results and also the results reported by Condon et al.[5]. As reported by Condon et al.[5], the R & E structure class is indeed very general, whereas the L & P class misses almost all of the 16s rRNA structures. The Akutsu Simple 47 P B a s e 1 6 S 2 3 S G p I I n t r o n G p I I I n t r o n N D B # S t r s 2 4 0 1 5 2 6 9 10 3 12 A v g . 1 4 . 2 4 6 6 7 6 3 . 1 1 2 8 . 9 2 0 9 3 1 2 . 4 # B P s P K F 0 0 14 0 0 1 A k u t s u S 2 0 4 0 14 0 0 1 L & P 2 3 1 12 14 1 0 0 1 D & P 2 3 2 1 5 0 14 1 0 0 5 A k u t s u R 2 3 2 1 5 0 14 10 0 5 R & E 2 4 0 152 2 5 1 0 0 7 T a b l e 6 . 1 : S t r u c t u r e c l a s s i f i c a t i o n . C o l u m n s 2-7 p resen t d a t a for e a c h R N A d a t a set. F o r e a c h d a t a set ( c o l u m n ) , t he e n t r y i n first r o w l i s t s t he n u m b e r o f s t r u c t u r e s i n t he d a t a set. T h e s e c o n d r o w l i s t s t he ave rage n u m b e r o f base p a i r s i n t he s t r u c t u r e s . T h e r e m a i n i n g r o w s l i s t t h e n u m b e r o f s t r u c t u r e s o f t h e d a t a set t h a t a re i n t he P K F , A k u t s u ( s e c o n d a r y s t r u c t u r e s w i t h s i m p l e p s e u d o k n o t s ) , L & P , D & P , A k u t s u ( s e c o n d a r y s t r u c t u r e s w i t h r e c u r s i v e p s e u d o k n o t s ) a n d R & E classes , r e s p e c t i v e l y . c l a s s b e h a v e s w o r s e t h a n L & P , as i t m i s s e s a l l o f t h e s t r u c t u r e s i n G r o u p I a d d i t i o n a l l y . A l t h o u g h i t i s p r o v e d t h e o r e t i c a l l y t h a t A k u t s u R e c u r s i v e c l a s s i s m o r e g e n e r a l t h a n D & P c l a s s , t h e t e s t r e s u l t s h o w s t h a t t h e y c a n b o t h h a n d l e t h e s a m e c l a s s o f g i v e n b i o l o g i c a l s t r u c t u r e s . 4 8 Chapter 7 Conclusion and Future Work RNA molecules play diverse roles in the cell. The structure of RNA molecules is often the key to their function and predicting the secondary structure of RNA molecules is thus an important problem in computational molecular biology. Algorithms for computational prediction of RNA secondary structures, which find the minimum free energy (mfe) structure are widely used, but experiments show that existing polynomial time algorithms are not so powerful in predicting the base pairs in known biological RNA structures (experimental structures) correctly. Therefore, improving existing algorithms or designing new algorithms, which are more capable in predicting the real secondary structure of RNA sequences, is an important challenge. Towards this goal, in this thesis, we presented a precise definition of the structural elements in a secondary structure, and also a comprehensive way to classify the type of loops that arise in pseudoknotted structure (Chapter 2). Then we introduced a parsing algorithm which parses a given secondary structure to its structural components (Chapter 4). This parsing algorithm, along with the classification scheme for the loops in a pseudoknotted secondary structure, can be used in analysing existing prediction algorithms and known biological RNA structures in order to design new, more powerful, prediction algorithms. Our parsing algorithm can be used in other applications. In this thesis we presented two of them: (i) calculating the free energy of a given secondary structure in linear time (Chapter 5), and (ii) constructing a linear time algorithm to test whether the prediction algorithm introduced by Akutsu [2] can handle a given structure (Chapter 6). Our work can be continued in future in several directions: • Heuristic algorithms commonly use a procedure to calculate the free energy for a given sequence and structure. Incorporating the linear free energy calculation algorithm (presented in Chapter 5), into heuristic algorithms may cause improvements in their efficiency. • The parsing algorithm can be used in order to analyse known biological RNA struc-tures, in order to find out what structures occur more frequently in biology. 49 The method for parsing a secondary structure appears to yield a linear time algo-rithm for finding the maximum independent set in a circle graph. The previous best algorithm had running time O(nlogn) [1]. The parsing algorithm plus extra functions presented in energy calculation part seems to be useful in calculating the minimum number of base pairs that need to be removed in order to make a secondary structure pseudoknot free. The classification model we presented in this thesis can be studied and improved if needed. This wasn't an easy task before, but now as we have some new notions and conventions it can be done more easily. There is no precise characterization of the class of structures handled by Uemura et al.[15] algorithm (Uemura). Our work may be useful in providing such a character-ization, and also constructing an algorithm to test whether Uemura's algorithm can handle a given structure. 50 Bibliography [1] A p o s t o l i c o , A . , M i k h a i l J . A t a l l a h , S u s a n n e E . H a m b r u s c h . 1992. New. c l i q u e a n d i n d e p e n d e n t set a l g o r i t h m s for c i r c l e g r a p h s . Discrete Applied Mathematics 36: 1-24. [2] A k u t s u , T . 2000 . D y n a m i c p r o g r a m m i n g a l g o r i t h m s for R N A s e c o n d a r y s t r u c t u r e p r e d i c t i o n w i t h p s e u d o k n o t s . Discrete Applied Mathematics 1 0 4 : 4 5 - 6 2 . [3] B e r m a n , H . M . et a l . 1992 T h e N u c l e i c A c i d D a t a b a s e : A c o m p r e h e n s i v e r e l a t i o n a l d a t a b a s e o f t h r e e - d i m e n s i o n a l s t r u c t u r e s o f n u c l e i c a c i d s . Biophys. J. 63 :751-759 . [4] C a n n o n e J . J . et a l . 2002 . T h e C o m p a r a t i v e R N A W e b ( C R W ) s i te : A n o n l i n e d a t a b a s e o f c o m p a r a t i v e sequence a n d s t r u c t u r e i n f o r m a t i o n for r i b o s o m a l , i n t r o n a n d o t h e r R N A s . BioMed Central Bioinformatics 3 :15. [ C o r r e c t i o n : B i o M e d C e n t r a l B i o i n f o r m a t i c s . 3:15.] [5] C o n d o n , A . , D a v y , B . , R a s t e g a r i , B . , Z h a o , S. a n d T a r r a n t , T . 2004 . C l a s s i f y i n g R N A p s e u d o -k n o t t e d s t r u c t u r e s . Theoretical Computer Science t o be a p p e a r ? . [6] D e n n i s , C . 2002 . T h e b r a v e n e w w o r l d o f R N A . Nature 4 1 8 ( 1 1 ) : 1 2 2 - 1 2 4 . [7] D i r k s , R . M . a n d P i e r c e , N . A . 2 0 0 3 . A p a r t i t i o n f u n c t i o n a l g o r i t h m for n u c l e i c a c i d s e c o n d a r y s t r u c t u r e i n c l u d i n g p s e u d o k n o t s . J. Cornput. Chem. 24 (13 ) :1664-1677 . [8] G u l t y a e v , A . P . , V a n B a t e n b u r g , F . H . D . a n d P l e i j , C . W . A . 1995 . T h e c o m p u t e r s i m u l a t i o n o f R N A f o l d i n g p a t h w a y s u s i n g a gene t i c a l g o r i t h m . Journal of Molecular Biology 250 : 3 7 - 5 1 . [9] G u l t y a e v , A . P . , V a n B a t e n b u r g , F . H . D . a n d P l e i j , C . W . A . 1999. A n a p p r o x i m a t i o n o f l o o p free e n e r g y va lues o f R N A H - p s e u d o k n o t s . RNA 5: 609-617 . [10] H a s l i n g e r , M . C . 2 0 0 1 . P r e d i c t i o n a l g o r i t h m s for r e s t r i c t e d R N A p s e u d o k n o t s . [11] L y n g s 0 R . B . a n d P e d e r s e n , C . N . 2000 . R N A p s e u d o k n o t p r e d i c t i o n i n e n e r g y - b a s e d m o d e l s . J. Computational Biology 7 ( 3 ) : 4 0 9 - 4 2 7 . [12] P l e i j , C . W . A . 1990. P s e u d o k n o t s : a n e w m o t i f i n t he R N A g a m e . Trends Biochem Science 15(4) : 143-147; [13] R i v a s E . a n d E d d y , S. R . 1999 , A d y n a m i c p r o g r a m m i n g a l g o r i t h m for R N A s t r u c t u r e p r e d i c t i o n i n c l u d i n g p s e u d o k n o t s . J. Molecular Biology 2 8 5 : 2 0 5 3 - 2 0 6 8 . ' [14] R u a n , J . , S t o r m o , G . D . a n d Z h a n g , W . 2004 . I t e r a t i v e L o o p M a t c h i n g ( I L M ) a n a p p r o a c h t o t he p r e d i c t i o n o f R N A s e c o n d a r y s t r u c t u r e s w i t h p s e u d o k n o t s . Bioinformatics 20 -1 : 5 8 - 6 6 . 5 1 [15] U e m u r a , Y . , H a s e g a w a , A . , K o b a y a s h i , S. a n d Y o k o m o r i , T . 1999. T r e e a d j o i n i n g g r a m m a r s for R N A s t r u c t u r e p r e d i c t i o n . Theoretical Computer Science 2 1 0 : 2 7 7 - 3 0 3 . [16] V a n B a t e n b u r g , F . H . D . , G u l t y a e v , A . P , P l e i j , C . W . A . , N g , J . a n d O l i e h o e k J . 2000 . P s e u -dobase : a d a t a b a s e w i t h R N A p s e u d o k n o t s . Nucl. Acids Res 2 8 ( l ) : 2 0 1 - 2 0 4 . 52
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Linear time algorithm for parsing RNA secondary structure
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Linear time algorithm for parsing RNA secondary structure Rastegari, Baharak 2004
pdf
Page Metadata
Item Metadata
Title | Linear time algorithm for parsing RNA secondary structure |
Creator |
Rastegari, Baharak |
Date Issued | 2004 |
Description | RNA secondary structure prediction is an important problem in computational molecular biology. Experiments show that existing polynomial time prediction algorithms have limited success in predicting correctly the base pairs, i.e. secondary structure, in known biological RNA structures. One limitation of many current algorithms is that they can predict only restricted classes of structures, excluding many so-called pseudoknotted secondary structures. The type of the pseudoknotted structures that occur in biological structures, as well as the type of structures handled by current algorithms, have been poorly understood, making it difficult to assess the generality of current algorithms. In this thesis we present a comprehensive and precise classification of structural elements and loops in a secondary structure, along with a linear time algorithm for parsing secondary structures into their structural elements. The parsing algorithm, along with the classification scheme for the loops in a pseudoknotted secondary structure, can be used in analysing existing prediction algorithms to determine which known biological RNA structures can not be predicted by the algorithms. This analysis can help us to design new and more powerful prediction algorithms. Furthermore, we present two applications of our work: (i) linear time free energy calculation algorithm, and (ii) linear time test for Akutsu's[2] algorithm class. We present a linear time algorithm for calculating the free energy of a given secondary structure. This algorithm can be useful especially in heuristic prediction algorithms, as they commonly use a procedure to calculate the free energy for a given sequence and structures. We also present a linear time algorithm to test whether the prediction algorithm introduced by Akutsu[2] can handle a given structure. The result of our analysis on algorithm of Akutsu on some sets of biological structures shows that although it is proved theoretically that the class of structures handled by Akutsu's algorithm is more general than that handled by the algorithm of Dirks and Pierce [7], they can both handle the same class of given biological structures. |
Extent | 2899489 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051394 |
URI | http://hdl.handle.net/2429/15658 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2004-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2004-0607.pdf [ 2.77MB ]
- Metadata
- JSON: 831-1.0051394.json
- JSON-LD: 831-1.0051394-ld.json
- RDF/XML (Pretty): 831-1.0051394-rdf.xml
- RDF/JSON: 831-1.0051394-rdf.json
- Turtle: 831-1.0051394-turtle.txt
- N-Triples: 831-1.0051394-rdf-ntriples.txt
- Original Record: 831-1.0051394-source.json
- Full Text
- 831-1.0051394-fulltext.txt
- Citation
- 831-1.0051394.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051394/manifest