Empirical Analysis of the MEA and MFE RNA Secondary Structure Prediction Algorithms by Monir Hajiaghayi B.Sc., Sharif University of Technology, 2008 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) October, 2010 c Monir Hajiaghayi 2010 Abstract RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins [1]. Since their functions largely depend on their folded structures, having accurate and efficient methods for RNA structure prediction is increasingly valuable. A number of computational approaches have been developed to predict RNA secondary structure. There have been some recent advances for two of these approaches, namely Minimum Free Energy (MFE) and Maximum Expected Accuracy (MEA) [2, 3]. The main goals of this thesis are twofold: 1) to empirically analyze the accuracy (i.e. the number of correctly predicted base pairs) of the MFE and MEA algorithms in different energy models and 2) to estimate the free energy parameters specifically for the MEA by using the constraint generation (CG) approach [2]. We present new results on the degree to which different factors influence the prediction accuracy MFE and MEA. The factors that we consider are: structural information that is provided in addition to RNA sequence, thermodynamic parameters, and data set size. Structural information significantly increases the accuracy of these methods. The relative performance of MFE and MEA changes according to the free energy parameters used; however, both have the best performance when they use Andronescu et al.’s BL* parameter set [2]. Having bigger data sets results in less variation in prediction accuracy of the MFE and MEA algorithms. Furthermore, we try to find better free energy parameters for the MEA algorithm. For our purpose, we adapt the CG approach so that it specifically ii Abstract optimizes parameters for MEA. Overall, when parameters are trained using MFE, they slightly outperform the parameters estimated for MEA. For the MEA algorithm, we also study the effect of parameter γ which controls the relative sensitivity and positive predictive value (PPV). We obtain that when γ = 1, the accuracy of MEA (in terms of F-measure on the BL* parameter set) is better than its accuracy for other values of γ. We also find that the sensitivity and PPV of MEA will interestingly change for different values of γ using the BL* parameter set. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Dedication x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . 1.1 Goals and Motivation 1.2 Contributions . . . . 1.3 Outline . . . . . . . 1.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 5 5 2 Related Work and Algorithms . . . . . 2.1 Data Sets . . . . . . . . . . . . . . . . 2.1.1 Format of RNA sequence . . . 2.2 Free Energy Models . . . . . . . . . . 2.2.1 Turner Model . . . . . . . . . . 2.2.2 MultiRNAFold Model . . . . . 2.3 RNA Secondary Structure Algorithms 2.4 Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 11 13 14 17 18 21 iv Table of Contents 2.5 Approaches to Parameter Estimation . . . . . . . . . . . . . 25 3 Sensitivity of RNA Prediction Algorithms to Input Data 3.1 Dependency of Prediction Accuracy on the Format of RNA Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Dependency of Prediction Accuracy on the Thermodynamic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Dependency of Prediction Accuracy on Data Set Size . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Parameter Estimation for Pseudoknot-free Secondary ture Prediction Using the MEA Approach . . . . . . 4.1 Parameter Estimation for the MEA Approach . . . . . 4.2 Parameter Estimation for Varying γ . . . . . . . . . . 4.3 Training and Testing on Restricted Sets . . . . . . . . 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 30 . 30 . 34 . 37 . 43 Struc. . . . . . . . . . . . . . . . . . . . 46 47 51 58 59 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Appendix A Differences between the features of the T99-MRF and T04MRF models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 v List of Tables 2.1 2.2 2.3 Statistics of the structural set. . . . . . . . . . . . . . . . . . . 11 Overview of the different RNA types in MT and MA data sets. 12 Different versions of the MEA and MFE algorithms. . . . . . . 22 3.1 Sensitivity and PPV of rsMEA and rsMFE methods on the restricted and unrestricted MT data sets. . . . . . . . . . . . . Accuracy comparison of different prediction algorithms with various parameter sets on the S-Full-Test set. . . . . . . . . . Prediction accuracy of MEA and MFE algorithms in terms of F-measure on the unrestricted MT data set. . . . . . . . . . . Prediction accuracy of MEA and MFE algorithms in terms of F-measure on the unrestricted MA data set. . . . . . . . . . . F-measure values obtained from 10 subsets of size 16, sampled from the group I intron class in the MA set with size 89. . . . F-measure values obtained from 10 different subsets of the group I Intron class with sizes equal to 32 and 64, respectively. F-measure values obtained from 10 different subsets of size 6, sampled from the ribonuclease P RNA class in the MA set with size 399 . . . . . . . . . . . . . . . . . . . . . . . . . . . F-measure values obtained from 10 different subsets of size 423, sampled from the transfer RNA class in the MA set of size 489 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 3.3 3.4 3.5 3.6 3.7 3.8 35 37 39 39 42 43 43 44 vi List of Tables 4.1 4.2 4.3 4.4 4.5 4.6 Accuracy comparison of ubcMEA and ubcMFE achieved for various parameter sets . . . . . . . . . . . . . . . . . . . . . . Prediction accuracy of the ubcMEA algorithm obtained by varying γ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of sensitivity for different values of γ in the ubcMEA and rsMEA algorithms on several RNA classes of the original MT data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of positive predictive value (PPV) for different values of γ in the ubcMEA and rsMEA algorithms on several RNA classes of the original MT data set. . . . . . . . . . . . . Comparison of F-measure for different values of γ in the ubcMEA and rsMEA algorthims on several RNA classes of the original MT data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy comparison of the ubcMEA algorithm with various parameter sets on the restricted and unrestricted S-Full-test. . 48 54 57 57 58 60 A.1 Feature types and the number of each one in the T99-MRF and T04-MRF models. . . . . . . . . . . . . . . . . . . . . . . 67 vii List of Figures 2.1 2.2 Secondary structure of RNA molecule PDB 00923. . . . . . . . 13 Secondary structure of an arbitrary RNA sequence . . . . . . 16 3.1 Reference structure and secondary structures predicted by the ubcMEA algorithm on restricted and unrestricted format of one transfer RNA sequence. . . . . . . . . . . . . . . . . . . Distribution of the percentage of forced unpaired nucleotides in the transfer RNA class of the MT set. . . . . . . . . . . . F-measure correlation between the MEA and MFE algorithms with BL* parameter set on S-Full-test. . . . . . . . . . . . . Variance of the prediction accuracy of ribonuclease P RNA in the MA set for various subset sizes (6, 16, 32, 64, 128, 256). 3.2 3.3 3.4 4.1 . 31 . 33 . 36 . 44 Performance of the MEA approach by using different free energy parameter sets. . . . . . . . . . . . . . . . . . . . . . . . 55 viii Acknowledgements I owe my deepest gratitude to my supervisor, Anne Condon, whose encouragement, guidance and support from the initial to the final level enabled me to understand and finish the thesis. I would like to thank Holger Hoos for being my second reader and for useful comments and suggestions. Thanks to Mirela Andronescu for her helpful discussions and collaborations throughout this research. I am grateful to all of those who supported me in any respect during the completion of the thesis. Many thanks go to my love Ali, for his support and for giving me lots of joy and happiness. Last but not the least, I want to thank my parents and my siblings Mohammad, Mahdi, and Mehri for their continual support and trust . Monir Hajiaghayi ix Dedication This thesis is dedicated to my parents who taught me that even the hardest task can be accomplished if it is done one step at a time. x Chapter 1 Introduction In order to understand the functions of any organism, the importance of RNA molecules cannot be ignored. These molecules transmit the genetic information through the cells. They also act as catalysts and as regulators of gene expression [4]. Therefore, it is very important to determine their tertiary structures. The tertiary structure of an RNA molecule is considerably determined by its secondary structure [5]. So, secondary structure prediction of RNA molecules is a significant problem in the computational molecular biology [5]. Prediction of secondary structure from base sequence, using an experimental approach is quite expensive [6]. Thus computational methods are required to enhance efficiency. Researchers have studied the problem of computational prediction of RNA structure for several decades. They have found many structure prediction algorithms which employ thermodynamic parameter sets ( which will be described later) for their predictions. Understanding the relative performance of these computational methods is important in improving the prediction of RNA secondary structure. In this thesis, we start by considering how the performance of prediction algorithms depends on input data. We then consider how to obtain improved thermodynamic parameter sets for a recently-proposed prediction algorithm. In the reminder of this chapter, we first describe our goals and motivations to support why our work is important. Next, in Sections 2 and 3, we outline our contributions and the organization of this thesis. Finally, for convenience 1 1.1. Goals and Motivation of the reader a glossary of terms is given in Section 4. 1.1 Goals and Motivation There were two recent advances in secondary structure prediction of particular interest to us. First, Andronescu et al. [2] were able to obtain significant improvements to minimum free energy (MFE) prediction by using state-of-the-art methods for optimizing thermodynamic parameters. Second, Lu et al. [3] proposed a new maximum expected accuracy (MEA) method and showed that it outperformed their MFE algorithm on a large data set. We were interested in seeing if we could obtain yet further improvements in structure prediction, by training new parameters specifically for MEA. To undertake this work, it was useful to have our own implementation of the MEA algorithm that could be used with the parameter training methods of Andronescu et al. As a result, we worked not only with the algorithms of Lu and Mathews et al., which we refer to as rsMFE and rsMEA, but also with the MFE algorithm of Andronescu et al., referred to as ubcMFE, and a new implementation of MEA which we developed, called ubcMEA. Early on, when doing our own baseline comparison of the MEA and MFE algorithms, we realized that there were many aspects of the comparison method that strongly influenced their relative performance. These included (i) whether or not some information about the structures is provided to the algorithms, in addition to the base sequence, (ii) which thermodynamic model and parameters were used by the algorithms, and (iii) what structural data sets were used to measure the prediction accuracy of the algorithms. For example, when reporting their results on rsMEA vs rsMFE , Lu et al. used “restricted format” data sets in which certain bases of sequences are in lower case, indicating that the base is unpaired in the true structure. We realized that, before attempting to obtain improved parameters, we first needed a better 2 1.2. Contributions understanding of the ways in which the format (restricted vs unrestricted) of the data set and other factors influenced algorithm accuracy. As a result, the goals of this thesis are twofold: first, we aim to understand what are robust ways to compare the relative performance of these algorithms. For example, we consider the effect of thermodynamic parameters on the prediction accuracy. Understanding the effects of these factors helps us to configure them in the best way for different algorithms and make their results more robust. Second, we seek approaches to enhance the prediction accuracy (i.e., increase the number of correctly predicted base pairs) of RNA secondary structures. These goals are motivated by the fact that having more accurate and reliable predictions will enhance our ability to understand RNA function. 1.2 Contributions We make the following contributions in this thesis: 1. Analysis of the performance of four prediction algorithms, considering: • the effect of structural information available in RNA sequences • the effect of thermodynamic parameters • the effect of data set size. We show that using the structural information of RNA sequences increases the prediction accuracy. To understand the effect of structural information, we derive new “restricted format” data sets, in which some bases which are unpaired in the reference structure are changed to lower case, from existing “unrestricted format” data sets (in which all bases are in uppercase). We also show that, depending on the thermodynamic set used for prediction, the relative performance of MEA vs 3 1.2. Contributions MFE changes. That is, for some thermodynamic sets, MFE does better, while for others, MEA does better. Finally we conclude that it is important to use large data sets when measuring the accuracy of algorithms. Some results in the literature are broken down by classes of sequences, grouped by biological function (e.g., group I Introns) and the size of the classes in some available data sets is quite small (fewer than ten sequences). The reported accuracy then depends significantly on the specific composition of the data for that class, and may not be representative of the performance of the algorithm on that class as a whole. 2. Performing training on unrestricted data sets for RNA free energy parameters of the ubcMEA and ubcMFE algorithms. We show that the parameters trained specifically for ubcMEA improve its prediction accuracy, however, those trained parameters are not suitable for the ubcMFE algorithm and vice versa. We also give some evidence that the BL* free energy parameter set obtained by Andronescu et al [2] has the highest quality parameters in the MultiRNAFold (MRF) free energy model for both, ubcMEA and ubcMFE. 3. Considering the effect of parameter γ which is a sensitivity versus PPV trade-off parameter in the MEA algorithm on the prediction accuracy. Our experiments indicate the best accuracy (in terms of F-measure) is achieved when γ is equal to 1 in MEA. 4. Training free energy parameters for the ubcMEA algorithm in a restricted way, i.e., some nucleotides of the sequences in the training and testing sets are forced to be single stranded. We show that training of the parameters in an unrestricted way still outperforms their training in a restricted way. Our work is beneficial to the RNA community by providing the dependency and accuracy analysis of four discovered algorithms which are currently con4 1.3. Outline sidered to have the highest performance (in terms of prediction accuracy) among all other prediction methods. Since the reliability of prediction algorithms is an important issue, this analysis will be helpful. 1.3 Outline An overview of biological data and most important work related to this research is given in chapter 2. Then we present the dependency analysis of four different prediction algorithms in chapter 3. We study three influencing factors in prediction accuracy, the format of RNA sequence, the thermodynamic parameters, and the data set size. In chapter 4 we describe how we train thermodynamic parameters for the MEA algorithm using the CG approach and measure the performance of these parameters. We also study the effect of parameter γ which controls the relative sensitivity and PPV in the MEA algorithm. Finally, we present our results on training of free energy parameters for the MEA algorithm when some nucleotides in the input data are forced to be single stranded. 1.4 Glossary For the convenance of the readers, we provide a glossary of terms used throughout this thesis. Many terms are explained in more detail in chapter 2. • Secondary structure y: an RNA secondary structure y for the sequence x = x1 x2 . . . xn ∈ {A, C, G, U }∗ is a set of pairs i.j, with i, j ∈ {1, . . . , n} so that for any pair, i! = j and that for any two pairs i.j and i0 .j 0 which i0 , j 0 ∈ {1, . . . , n}, i, j, i0 , j 0 are all different. • Pseudoknot-free secondary structure: A pseudoknot-free RNA secondary structure for the sequence x = x1 x2 . . . xn ∈ {A, C, G, U }∗ is an RNA 5 1.4. Glossary secondary structure in which any two pairs i.j and i0 .j 0 with i < j, i0 < j 0 also satisfy i < i0 and have either of the following forms: i < i0 < j 0 < j, or i < j < i0 < j 0 . • Reference structure y ∗ : The known secondary structure for a sequence x which is determined by comparative sequence analysis or by experimental methods, such as NMR. • Free energy parameter set Θ: A vector of free energy parameters (θ1 , θ2 , ..., θp ), with free energy parameter θi corresponding to feature fi . A feature is an RNA secondary structure fragment whose thermodynamics are considered to be important for RNA folding, and p is the number of features in the model. • Free energy function ∆G(x, y, Θ): A function that defines the standard free energy change of a sequence x folded into a specific secondary structure y that is consistent with x respect to Θ. By standard, we mean free energy at standard temperature (37◦ C), salt concentration (1nM) and pressure (1atm). • Structural set S: A data set that contains RNA sequences with reference secondary structures. This data set is comprised of pairs (x, y ∗ ) where x is an RNA sequence and y ∗ is its reference secondary structure. The number of RNA sequences in the structural set S is denoted by s. • Thermodynamic set T : A data set comprised of quads (x,y ∗ ,e, δ) where x is an RNA sequence, y ∗ is the reference secondary structure, e is the experimental free energy change, and δ is the reported experimental error. This data set contains a large number of thermodynamic data determined by optical melting experiments. • Partition function Z(x, Θ) : A function, which can be calculated by a dynamic programming algorithm given an RNA sequence x and energy 6 1.4. Glossary parameter set Θ, which permits the computation of probabilities of secondary structures and base pairs. It is defined as: Z(x, Θ) = X e −∆G(x,ω,Θ) RT ω∈Ω(x) where R is the gas constant, T is the absolute temperature and Ω(x) is the set of possible pseudoknot-free secondary structures for the sequence x. • P (y, x, Θ): The probability of the structure y for the RNA sequence x, based on the given parameter Θ. Py,x,Θ = 1 e−∆G(x,y,Θ) Z(x, Θ) • Pi,j,x,Θ : The probability of base pair i.j in the RNA sequence x, based on the given parameter set Θ: Pi,j,x,Θ = 1 Z(x, Θ) X e ∆G(x,γ,Θ) RT γ∈Γi,j (x) where Γi,j (x) is the set of possible secondary structures that contain the base pair i.j. • Constraint Generation (CG) [2]: A parameter estimation algorithm that enforces reference structures to have energies lower than other structures for and RNA molecule based on iteratively generated constraints. • Boltzmann Likelihood (BL) [2]: A parameter estimation algorithm that gives a set of free energy parameters which maximize the Boltzmann probability of a set of reference RNA structures. 7 1.4. Glossary Accuracy measures • Sensitivity: A measure of secondary structure prediction accuracy, showing the ratio of correctly predicted base pairs to the base pairs in the reference structure. The possible values are between 0 and 1; the closer to 1, the better the prediction. sensitivity = number of correctly predicted base pairs number of base pairs in the reference structure • Positive predictive value (PPV): A measure of secondary structure prediction accuracy, showing the ratio of correctly predicted base pairs to the total number of predicted base pairs. The possible values are between 0 and 1; the closer to 1, the better the prediction. PPV = number of correctly predicted base pairs number of predicted base pairs • F-measure. The harmonic mean of sensitivity and positive predictive value. The possible values are between 0 and 1; the closer to 1, the better the prediction. F-measure = 2 × sensitivity × PPV sensitivity + PPV 8 Chapter 2 Related Work and Algorithms In the first part of this chapter, we introduce different data sets used in this thesis. Next, we will explain two free energy models and their features in Section 2.2. Finally, in Section 2.3 and 2.4, we will describe some related work on the estimation of free energy parameters and prediction of RNA secondary structure. 2.1 Data Sets In this thesis, prediction and evaluation of RNA secondary structures and training of parameters are performed based on four different data sets. An in-depth overview of each data set is presented here. • The first set, S-Full, has been collected by Andronescu et al. [2]. This data set contains 3245 RNA sequences with reference secondary structures from RNA STRAND v2.0 [7].It includes a large number of ribosomal RNA molecules, transfer RNAs, transfer messenger RNAs, ribonuclease P RNAs, signal recognition particle RNAs, and secondary structures obtained from tertiary structures. S-Full is split into two parts: approximately 80% for training (S-Full-Train) and the remaining about 20% for testing (S-Full-Test) of the accuracy of algorithm and estimated parameter sets. This data set contains only structures up to 700 nucleotides in length. • The second data set which we refer to as MT, is the one used by Lu et al. [3]. This data set consists of various classes of RNA including 9 2.1. Data Sets small-subunit (16S) and large-subunit (23S) ribosomal RNA, 5S ribosomal RNA, signal recognition RNA, group I introns, group II introns, ribonuclease P RNA, and transfer RAN. The small-subunit (16S)rRNA are divided into domains as defined by Jaeger et al. [8] and the largesubunit (23S)rRNA sequences have been divided into domains of fewer than 700 nucleotides each. • In order to better compare our results with those reported in Lu et al.’s work [3], we have formed a new set of sequences that is a subset of the S-Full set and includes those types of RNA that are common between S-Full and Lu et al.’s set. We call this subset of S-Full set the MA data set. The S-Full, MT, and MA sets contain RNA sequences with their reference secondary structures determined by comparative sequence analysis [9], which is considered to be a reliable method. • The third data set used in this thesis is, Thermodynamic set. This set, called T-Full [2], includes RNA sequences from 1291 optical melting experiments with their reference secondary structures and measured free energy changes. This data set is comprised of quadruples (x, y ∗ , e, δ), where x is an RNA sequence and y ∗ is its reference secondary structure, e is the experimental free energy change of sequence x folded into secondary structure y ∗ at the standard temperature and δ is the reported experimental error (if no experimental error was reported, Andronescu et al. considered the error to be the maximum value of 5% × ei and 0.1). Table 2.1 shows the number of sequences, average length and standard deviation of length for the structural sets S-Full, S-Full-Test, S-Full-Train, and the thermodynamic set T-Full. Further, Table 2.2 shows the number of sequences, average length, standard deviation of length and the number of sequences for different RNA classes of the MT and MA data sets. 10 2.1. Data Sets Data set S-Full S-Full-Train S-Full-Test T-Full No 3245 2386 659 1291 Avg length 269.6 267.3 278.7 18.8 Std 185.2 184.7 186.7 12.3 Table 2.1: Statistics (number of sequences, average length and standard deviation in length) of the structural and thermodynamic sets [2]. It is notable that some types of RNA in the MT set have duplicated sequences (the number of sequences with duplicates was shown in parentheses for each RNA type in Table 2.2). However, for compatibility between this thesis and other research, we have performed our experiments with the original set that includes repeated sequences. 2.1.1 Format of RNA sequence We define two representations for each RNA sequence: (1) the representation which includes both lower- and upper-case nucleotides and (2) the representation in which all nucleotides are in upper-case. In fact, those nucleotides that are shown in lower-case are forced to remain unpaired. By forced unpaired, we mean that the nucleotide is unpaired in the reference structure of sequence x. These lower-case nucleotides provide extra information about the structure of an RNA. This information can be determined by some experimental approaches such as SHAPE experiments [10, 11]. The following example shows the RNA sequence of molecule PDB 00923 in two representations. The first and second lines show the sequence in the first and second representation respectively. 1 CCACGGUuCGAAUCCGUGGC 2 CCACGGUUCGAAUCCGUGGC 11 Type of RNA No. in MA mean±std 377.88 ± 167.18 460.6 ± 151.3 119.5 ± 2.69 344.88 ± 66.42 668.7 ± 70.92 382.5 ± 41.66 267.95± 61.72 77.48 ± 4.8 No. in common mean±std 675 159 128 89 2 399 364 489 2305 485.66± 113.02 453.44± 117.85 120.98±3.21 368.49 ±103.58 578± 47 332.78±52.34 227.04±109.53 77.19± 5.13 0 4 58 0 0 4 58 354 478 Table 2.2: Overview of different RNA classes in the MT and MA data sets, the number of RNAs, the mean length, standard deviation and the number of common sequences in these two data sets for each RNA class. The numbers in parentheses show the number of sequences in the corresponding RNA class of MT with duplicates. There is no duplicate in the MA set. We have considered duplicated sequences in computing the mean and standard deviation of all RNA classes. The numbers in parentheses show the number of duplicated sequences in each class. The last column presents the number of common sequences in the MA and MT sets without duplicates. 2.1. Data Sets 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I intron Group II intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA ; total No. in MT (with duplicates) 88 (88) 26 (27) 305 (309) 16 (16) 3 (3) 6 (6) 91 (91) 423 (484) 958 (1024) 12 2.2. Free Energy Models Figure 2.1 shows the structure of this molecule while the only lower-case nucleotide in its sequence is also shown. Figure 2.1: Secondary structure of the RNA molecule PDB 00923 obtained from the STRAND database, rnasoft.ca/sstrand. The orange circle shows the only unpaired nucleotide in this molecule. From now on, for clarity, we call the first representation of an RNA sequence, restricted and the second one unrestricted. 2.2 Free Energy Models An energy model contains a set of features each of which is a fragment of RNA secondary structure whose thermodynamics are considered to be important for RNA folding. The free energy parameter set Θ is a vector of free energy parameters (θ1 , θ2 , ..., θp ) whose free energy parameter θi corresponds to feature fi . For a given parameter set Θ, a free energy function ∆G(x, y, Θ) is defined as a function that determines the standard free energy change of a sequence x folded into a specific secondary structure y. Usually, the free energy values of parameters are reported at a standard temperature of 37◦ C. In the reminder of this section, we first outline the main features of the most widely used RNA energy model, the Turner model. Then we give an overview of another free energy model called MutltiRNAFold (MRF), which is used 13 2.2. Free Energy Models frequently in this thesis. 2.2.1 Turner Model The Turner model [12] called “Turner99”is the most widely used energy model for prediction of RNA secondary structures. The parameters of the model were derived from optical melting experiments, the most commonly used experimental approach to determine the free energy change of RNA structures. The Turner model is a nearest neighbor thermodynamic model, i.e., it assumes that the stability of a base pair or loop depends on its sequence and the adjacent base pair or unpaired bases. The features of the Turner model have been mostly designed to reflect the physical characteristics of RNA molecules, observed over years from experimental data. Further refinements of the parameters were made by Lu et al. [13] based on new experimental data. We call this revised version of parameters “Turner04”. The number of free energy values tabulated in the Turner model is more than 7600 in both Turner99 and Turner04 versions; but most of these can be determined by applying simple extrapolation rules to much fewer free energy parameters. This model is used in several software packages such as Mfold [14], RNAstructure [15], the Vienna RNA package [16] and Nupack [17]. A feature f is denoted by a feature name (a, b, c, ...), where a, b, c, ... ∈ {A, C, G, U } are the nucleotides on which the feature depends. It is notable that there are some RNA structural motifs including stacked pairs, hairpin loops, internal loops, bulge loops, multi-loops and the exterior loop that each of which may be comprised of one or more features. For example, the Turner model includes the features of hairpin loop terminal mismatch(a, b, c, d), where a-d forms the hairpin loop closing base pair, and b and c are the 1 We employed the same notation as used in Andronescu’s thesis [2] for feature definition 14 2.2. Free Energy Models first two unpaired bases of the hairpin loop, forming stacking interactions with the closing base pair. In what follows we just name the main feature categories in this model . Figure 2.2 shows different motifs in an arbitrary RNA molecule. • Stacked pair features • Hairpin loop terminal mismatch features • Features for 1 × 1, 1 × 2 and 2 × 2 internal loops, where x × y means there is x unpaired nucleotide on one side and y unpaired nucleotides on another side of the internal loop • Internal loop terminal mismatch features • Features for the number of unpaired nucleotides in hairpin loops, internal loops and bulge loops • Multi-loop features • Dangling end features • Other features, including special cases of stable hairpin loops, asymmetric internal loops and penalty for intermolecular initiation for the case of interacting RNA molecules; If we denote the sequence by x, the secondary structure by y, the parameters of a free energy model by a vector Θ, and the number of times a feature i occurs in y by ci (x, y), then the energy function of this free energy model is linear in the parameters and it is calculated by the following equation [2]: ∆G(x, y, Θ) = p X ci (x, y)θi = c(x, y)T · Θ (2.1) i=1 2 These features are described in more details in Andronescu’s thesis [2]. 15 2.2. Free Energy Models Figure 2.2: Secondary structure of an arbitrary RNA sequence. Marked in orange boxes are RNA structural motifs, including stacked pairs (marked by 1), internal loop (marked by 2), bulge loop (marked by 3), hairpin loop (marked by 4), multi-loop (marked by 5) and dangling end (marked by 6). 16 2.2. Free Energy Models For example, the free energy change of the secondary structure shown in Figure 2.2 under Turner99 model is equal to: ∆G = X ∆G(stackedpairs)+ X ∆G(hairpinloops)+ X ∆G(internalloops)+ X ∆G(bulgeloops)+ X ∆G(multi − loops). 2.2.2 MultiRNAFold Model Andronescu et al. have introduced the MutliRNAFold (MRF) model, which is derived from the Turner99 model, but includes only 363 features. In fact, features in the full Turner 99 model can be obtained by extrapolating the features of this model. By applying a similar approach as Andronescu et al., we derived the MultiRNAFold04 model which contains 471 features. However, not all features in Turner04 are included in MultiRNAFold04. For example, Turner04 includes coaxial stacking features while the MultiRNAFOld04 model, doesn’t contain these features. For simplicity, in the reminder of this thesis we call the free energy models of MultiRNAFold and MultiRNAFold04 the T99-MRF and T04-MRF models, respectively. Andronescu has listed the features of the T99-MRF model in her thesis. A summary of feature types and the number of each type in these two free energy models is also attached in Appendix A. 17 2.3. RNA Secondary Structure Algorithms 2.3 RNA Secondary Structure Algorithms There are several different approaches to pseudoknot-free RNA secondary structure prediction. In this section we summarize two of these methods that are most related to our work, namely free energy minimization and maximum expected base pair accuracy. Free energy Minimization Method The minimum free energy (MFE) method, introduced first by Nussinov et al. [18] and generalized to more realistic energy models by Zuker and Stiegler [19], has been the most widely known method for pseudoknot-free RNA secondary structure prediction for decades. This method is based on a dynamic programming algorithm that is guaranteed to find the secondary structure with the minimum free energy, with respect to a nearest neighbor free energy model. We denote an RNA sequence with x, and the set of all possible pseudoknot-free secondary structures for x with Ω, the free energy function ∆G(x, y, Θ) with respect to the vector Θ of free energy parameters, then the minimum free energy secondary structure y M F E is: y M F E ∈ arg miny∈Υ ∆G(x, y, Θ) (2.2) where arg miny F (y) denotes the (set of) y for which F (y) is minimum, and G(x, y, Θ) is the free energy change of the structure y. Briefly, Zuker and Stiegler’s dynamic programming algorithm is as follows. For each index i and j with 1 ≤ i < j ≤ n, the algorithm determines which of the four main structural motifs (hairpin loop, stacked pair, internal loop or multi-branched loop) with the exterior pair i.j has the lowest free energy. Recurrence relations are applied and several two-dimensional matrices with minimum free energies for each i and j are filled. As it is a dynamic programming algorithm, a backtracking procedure is necessary in order to build the set of base pair that gives the MFE secondary structure. Zuker and Stiegler’s algorithm 18 2.3. RNA Secondary Structure Algorithms use Θ(n4 ) time and Θ(n2 ) space. The time complexity has been reduced to Θ(n3 ) by Lyngso et al. [20] at the cost of an increased space complexity of Θ(n3 ). The Zuker and Stiegler algorithm is similar to the Viterbi algorithm for finding the most likely state sequences in hidden Markov models, and to the CYK algorithm for determining how a string can be generated by a given (stochastic) context-free grammar. A number of implementations are based on the Zuker and Stiegler algorithm and other closely related algorithms: Mfold [14], RNAfold from the Vienna RNA Package [16], RNAstructure [15], Simfold [21], and CONTRAfold [22]. In this thesis, we extensively use the Simfold and RNAstructure [15, 21] implementation of the Zuker and Stiegler algorithm for parameter estimation and for the evaluation of prediction accuracy with various parameter sets. For clarity, from now on, we call them ubcMFE and rsMFE respectively. Maximum Expected Base-pair Accuracy Method Another prediction algorithm, MaxExpect [3], makes RNA secondary structure prediction by maximizing the expected base-pair accuracy. The CONTRAfold algorithm [22] was an earlier method that employs base pair probabilities predicted with a statistical parameter estimation method for its predictions. Given a sequence x and a thermodynamic parameter set Θ, a partition function (described later in this section) can be used to predicts the probabilities of base-pair and single-stranded nucleotides in the structure for sequence x by using the Θ set. The expected accuracy of an alternative structure yl for sequence xl is given by: 19 2.3. RNA Secondary Structure Algorithms Expected accuracy(yl , xl , Θ) = X γ × 2Pbp (i, j, xl , Θ)+ (i,j)∈yl X Pss (i, xl , Θ) (i,Nx +1)∈yl (2.3) where Pbp (i, j, xl , Θ) is the base-pair (bp) probability for the base pair i.j of sequence xl with the parameter set Θ ; Pss (i, xl , Θ) is the single-stranded (ss) nucleotide probability for nucleotide at position i and (i, Nx + 1) means that nucleotide i is unpaired. In this expression the pairing probabilities are weighted by a factor γ [22]. The MaxExpect algorithm discovers the most probable structure for sequence xl by finding the structure with maximum expected accuracy. This method also predicts suboptimal structures to present as alternative suggestions for the structure. To predict the optimal structure, Lu et al. use a dynamic programming algorithm that is based on Nussinov-style recursions [23]. They then expanded the method to predict suboptimal structures. We briefly explain the partition function algorithm required for MaxExpect calculations. Partition Function Algorithm The partition function of an RNA sequence x, with respect to energy parameter Θ is defined as: Z(x, Θ) = X e −∆GT (x,ω,Θ) RT (2.4) ω∈Ω(x) where R is the gas constant, T is the absolute temperature and Ω(x) is the set of possible secondary structures for the sequence x. The partition function can be computed by a dynamic programming algorithm proposed by McCaskill [24] for the case when Ω(x) contains all 20 2.4. Accuracy Measures pseudoknot-free secondary structures . This algorithm allows the computation of probabilities of secondary structures and base pairs. So, given the partition function Z, the probability of the secondary structure y is calculated as: P (y, x, Θ) = 1 · e−∆G(x,y,Θ) Z(x, Θ) (2.5) Using the quantity above, we are able to compute the probability of an ensemble of related structures, which is important for biological function of RNA structures [24]. Finally, the probability of forming base pair i.j in the RNA sequence x, based on the given parameter set Θ, is given by: Pi,j,x,Θ = 1 · Z(x, Θ) X e ∆GT (x,γ,Θ) RT (2.6) γ∈Γi,j (x) where Γi,j (x) is the set of possible pseudoknot-free secondary structures that contain the base pair i:j. Throughout this thesis, we work with different versions of the MEA and MFE algorithms that employ different energy models. Table 2.3 shows a summary of these algorithms. We call our implementation of MEA and Andronescu’s implementation of MFE, using the MultiRNAFold free energy model, ubcMEA and ubcMFE, respectively. We also refer to the MEA and MFE approaches used in RNAstructure package rsMEA and rsMFE. The free energy model used by rsMEA and rsMFE is the Turner model. 2.4 Accuracy Measures To measure the accuracy of a predicted RNA secondary structure, it is common to compare the correctness of its base pairs relative to a reference structure, while ignoring the correctness of unpaired bases [12]. We consider a 21 2.4. Accuracy Measures Algorithm rsMEA rsMFE ubcMEA ubcMFE Free energy Model Turner99 Turner99 MultiRNAFold MultiRNAFold No. of parameters 7600 7600 ∼370-470 ∼370-470 Table 2.3: Different versions of the MEA and MFE algorithms with energy models used by each method. base pair i.j to be correct in the predicted structure if and only if this base pair exists in the reference structure. However, Lu et al. [3] consider a base pair i.j to be correct if any of the following is a base pair in the reference structure: i.j, (i − 1).j, (i + 1).j, i.(j − 1) or i.(j + 1). The reason is the uncertainty of base pair matches in comparative sequence analysis methods. Although we agree with this reasoning, we do not count such slipped base pairs as correct ones in this thesis, in order to be consistent with Andronescu et al. work [2]. To evaluate the accuracy of a predicted structure, we need the following definitions. A true positive (TP) occurs when two nucleotides are correctly predicted to pair with each other, a false negative (FN) which occurs when a base pair exists in the reference structure, but the two bases are not predicted to pair with each other (even if one or both of them are predicted to pair with other bases), and false positive (FP) occurs when a predicted base pair does not appear in the reference structure (even if one or both of the bases are known to pair with other bases). The formal definitions of TP, FP and FN in the context of RNA secondary structure prediction accuracy are as below. Note that the variables yl and yl∗ represent the predicted structure and reference structure for sequence xl , respectively. Definition 1 The base pair {i, j} ∈ yl is a true positive (TP) if and only if 22 2.4. Accuracy Measures {i, j} ∈ y∗l . Definition 2 The base pair {i, j} ∈ yl∗ is a false negative (FN) if and only if {i, j} ∈ / yl . Definition 3 The base pair {i, j} ∈ yl is a false positive (FP) if and only if {i, j} ∈ / yl∗ . Throughout this thesis, we use three measures for determining the structural prediction accuracy, namely sensitivity (also called precision or precision rate), positive predictive value or PPV (also called recall), and Fmeasure (in short F) which combines the sensitivity and PPV into a single measure. The mathematical definitions of these measures are as follows: Sensitivity = PPV = number of correctly predicted base pairs #T P = #T P + #F N number of base pairs in the reference structure (2.7) #T P number of correctly predicted base pairs = #T P + #F P number of predicted base pairs F-measure = 2 × sensitivity × PPV . sensitivity + PPV (2.8) (2.9) Sensitivity represents the ratio of correctly predicted base pairs to the total base pairs in the reference structures. PPV indicates the fraction of correctly predicted base pairs, out of all predicted base pairs. If the denominator is 0 for sensitivity and PPV, then the corresponding measure is undefined, and is not included when we average the measure over several sequences of different RNA classes. The F-measure is the harmonic mean of the sensitivity and PPV. This value is close to the arithmetic mean when the two numbers are close to each other. However, it is smaller when one of the 23 2.4. Accuracy Measures numbers is close to 0, thus penalizing predictions for which either of sensitivity or PPV is poor. We consider the F-measure to be 0, if both sensitivity and PPV are 0. It is notable that throughout this thesis, we calculate two kinds of average for our measures, weighted and unweighted. We illustrate the meanings of these averages by formulas below. Let have n RNA classes named C1 , C2 , ..., Cn with the cardinalities l1 , l2 , ..., ln , respectively, and suppose we want to calculate the average of measure “M ” over these classes. Then, we will have: Unweighted Average of M = P P P ∀C2i ∈C2 M (C2i ) ∀C1i ∈C1 M (C1i ) ∀Cni ∈Cn M (Cni ) + + ... + l1 l2 ln Weighted average of M = P P P ∀C1i ∈C1 M (C1i ) + ∀C2i ∈C1 M (C2i ) + ... + ∀Cni ∈C1 M (Cni ) l1 + l2 + ...ln (2.10) (2.11) Weighted average counts each sequence equally, regardless of which class it belongs to. Unweighted average, on the other hand, counts each class equally. Therefore, in the latter case, for classes with few sequences, each sequence has more weight than if it were in a class with more sequences. We give an example to make the meanings more clear. Let us suppose that we have a set of RNA classes with different cardinalities. We intend to add a new RNA class of very small size to this set. If we use the unweighted average for measuring the accuracy, then the size of the new added class is ignored and it contributes equally in the average of accuracy. However, if we use the weighted average, then the contribution of this class in the average of accuracy will be very low. So, depending on the importance 24 2.5. Approaches to Parameter Estimation of the RNA sequences or RNA classes, we use the weighted or unweighted average, respectively. In order to take advantages of each of these averages, we compute both in most of our experiments. However, Lu et al. [3] just calculate the unweighted average and Andronescu et al. [2] just compute the weighted average. Therefore, we consider the unweighted average for the comparison of our results with the work of Lu et al. and the weighted average for the comparison of ours with the work of Andronescu et al. 2.5 Approaches to Parameter Estimation Since the quality of free energy parameters is a significant factor in the accuracy of predictions made by many RNA secondary structure prediction algorithms (we will discuss this factor in Section 3.2 in more detail), many researchers have focused on the problem of making better estimation for free energy parameters available in Turner model. Towards this goal, Andronescu et al. [2] presented two approaches for free energy parameter estimation, namely, 1) Constraint Generation (CG) and 2) Boltzmann Likelihood (BL). These two methods are the first computational approaches to RNA free energy parameter estimation that can be trained on large sets of structural as well as thermodynamic data. In this section, we briefly explain these methods. Constraint Generation (CG) The first approach that we describe is constraint generation (CG) [25]. Given the structural set S and the thermodynamic set T , Andronescu et al.’s CG approach has a convex quadratic objective function with linear equality and inequality constraints; therefore, the problem has one global optimum and can be solved with standard optimizers for convex functions. This problem 3 To be consistent in this thesis, the notation is partially changed compared to the original notation of Andronescu et al. [2] 25 2.5. Approaches to Parameter Estimation can be written in matrix form as minimize ||δ||2 + λ||ξ||2 subject to MS · Θ − δ < 0 (2.12) MT · Θ − ξ = e 1 ||Θ − µ||2 ≤ η p Θ(0) − B ≤ Θ ≤ Θ(0) + B δ≥0 Each row of the matrix MS is [c(xl , yl ∗ ) − c(xl , yl )]T · Θ where yl ∗ is the reference structure of sequence xl , yl ∈ Ω(xl ) \ {yl ∗ } for all l ∈ {1, . . . , s}, and δ is the vector of slack values δl,yl which are introduced to handle infeasible constraints (i.e., there is no solution Θ that satisfies all constraints simultaneously). Also each row of the matrix MT is c(xj , yj∗ ) for each (xj , yj∗ , ej ) ∈ T and ξ is the vector of ξj values where ej is the error in predicting the known free energy ej . In addition, to ensure that Θ is not too far from Θ(0) , which is the initial set of parameters, the parameter B is used. Besides that, to avoid that too many variables reach the upper limits determined by the bounds, and also to avoid overfitting, a ridge regularizer is added as a constraint. Therefore, µ and η are the regularizer mean and bound that are given to the algorithm. It is important to notice that the number of structural constraints can grow exponentially with the size of the input, since for each (xl , yl ∗ ) ∈ S there may be exponentially many structures in Ω(x)l . To avoid this problem, Andronescu et al. proposed a heuristic algorithm which iteratively estimates Θ using constraints MS Θ − δ ≤ 0 for a matrix MS which only includes rows for a manageable subset of structures yl . Specifically, starting from an empty 26 2.5. Approaches to Parameter Estimation set of structures and an initial set of parameters Θ(0) , in each iteration of their algorithm, for sequence xl from S, they predict its MFE secondary structure yl using the current parameter vector Θ(k−1) and then add the following constraint. [c(xl , yl ∗ ) − c(xl , yl )]T · Θ(k) − δl,yl ≤ 0. (2.13) There are three variants of CG, namely, NOMCG, DIMCG, and LAMCG. The one described above is called NOMCG. Since the results of these three variants were comparable on the MFE algorithm [2] (with less than 0.012 difference in their accuracies) and also it wasn’t applicable for us to run LAMCG which is known as the best method, we used NOMCG in this thesis. Boltzmann Likelihood (BL) Another approach introduced by Andronescu et al. for parameter estimation is the Boltzmann Likelihood (BL) [25]. Given the structural set S, this method is based on maximizing the conditional likelihood of the reference structures in the set S. Similar approaches are used by Do et al. [22] in training parameters for their CONTRAfold algorithm, Benos et al. [26] in parameter estimation for DNA-protein interactions and Howe [27] in optimizing the weights for gene structure prediction. The BL algorithm can be described as follows: it estimates a parameter set ΘBL that maximizes the following posterior probability distribution over the structural set S and thermodynamic set T which are assumed to be independent. P (Θ|S, T ) ∝ P (S|Θ) · P (T |Θ) · P (Θ) (2.14) ΘBL is also known as MAP (maximum a posterior) estimate and can be 27 2.5. Approaches to Parameter Estimation written as: ΘBL ∈ arg maxΘ P (Θ|S, T ) (2.15) The above equation, mathematically, is equivalent to the following quantity: ΘBL ∈ arg minΘ FBL (Θ) (2.16) where FBL := −logP (S|Θ) − logP (T |Θ) − logP (Θ|µ, τ0 ) (2.17) The first term P (S|Θ) incorporates the structural set and is computed below. P (S|Θ) := Πsi=1 P (yi∗ |xi , Θ) (2.18) where P (yi∗ |xi , Θ) is the conditional probability of the reference structure yi∗ for the sequence xi in S and the parameter set Θ. P (yi∗ |xi , Θ) is also computed by using a Boltzmann Distribution [2]: P (y|x, Θ) := 1 1 · exp(− · ∆G(x, y, Θ)) Z(x, Θ) RT (2.19) where Z denotes the partition function algorithm defined in Section 2.3. The second term P (T |Θ) which incorporates the Thermodynamic set is given by: (2.20) P (T |Θ) := Πti=1 P (ej|xj , yj , Θ, ρ) where shows the distribution of energies of structure yj for sequence xj in the T set and known free energy ej and the hyperparameter ρ. This quantity is solved by a Gaussian distribution with mean ej and precision (i.e., inverse of variance) ρ. Finally, the third term P (Θ|µ, τ0 ) shows a regularization prior distribution 28 2.5. Approaches to Parameter Estimation and again can be solved by a Gaussian distribution with mean µ and precision τ0 . A significant point is that FBL is a convex function [28], thus the BL approach certainly solves the problem and finds the globally optimal parameter set. Andronescu et al. showed that the BL approach gives slightly better results than CG, but it is more expensive to run (in terms of time). Therefore, we have focused on CG in this thesis. 29 Chapter 3 Sensitivity of RNA Prediction Algorithms to Input Data In this chapter, we study to what degree the prediction accuracy of MFE and MEA algorithms is dependent on input data. By input data we mean different data sets and different thermodynamic parameter sets used for prediction. Understanding this dependency is important when we want to compare the relative merit of different algorithmic approaches, such as MFE and MEA, because it can help us tease apart to what degree a difference in performance is due to the algorithmic approach, as opposed to other factors, such as the choice of test data set. We start by exploring whether restricted or unrestricted data sets have any impact on prediction accuracy in Section 3.1. Then we study the dependency of prediction accuracy on the thermodynamic parameter sets in Section 3.2. Finally, we consider how the size of a data set will influence on the prediction accuracy of RNA secondary structures in Section 3.3. 3.1 Dependency of Prediction Accuracy on the Format of RNA Sequence In this section we demonstrate how some structural information available in RNA sequences is effective in determining RNA secondary structures with higher accuracy. The information we focus on identifies some forced unpaired 30 3.1. Dependency of Prediction Accuracy on the Format of RNA Sequence nucleotides. As we described in Section 2.1.1, there are two representations for each RNA sequence, restricted and unrestricted. By restricted we mean that there exist some lower case letters which identify nucleotides that are unpaired in the reference structure. Correspondingly, we can define two representations for each data set, restricted and unrestricted. Restricted sets have some sequences with the restricted format, and unrestricted sets have no sequence with the restricted format. There are some prediction algorithms, such as rsMEA, rsMFE, ubcMEA and ubcMFE, that take into account these forced unpaired nucleotides in their calculation. For example, Figure 3.1(b) shows the secondary structure predicted by ubcMEA when it uses the restricted format for one of the transfer RNA sequences in the MT set, and Figure 3.1(c) shows the secondary structure when this algorithm uses the unrestricted format for the same molecule. (a) Reference structure (b) Predicted secondary (c) Predicted secondary structure by using restricted structure by using unreformat stricted format Figure 3.1: Reference structure and secondary structures predicted by the ubcMEA algorithm on restricted and unrestricted format of one transfer RNA sequence, nupack.org. The orange circles in (a) show the places of forced unpaired nucleotides in this molecule. (b) shows the perfect prediction made by ubcMEA on the restricted format, while the prediction on the unrestricted format shown in (c) has poor accuracy. 31 3.1. Dependency of Prediction Accuracy on the Format of RNA Sequence To investigate the degree to which the restricted format improves prediction accuracy, we compare rsMEA and rsMFE on restricted and unrestricted representations of the MT data set. The reason that we chose rsMFE and rsMEA algorithms is that Lu et al. [3] have done their experiments by using these algorithms on the MT set, which contains restricted data in just two classes. These two classes are the transfer RNA class which includes 423 out of 484 restricted sequences and the group II intron class which just has 3 sequences in total and only 1 one of them is restricted. So we were curious to see how the results will change when we do the experiments on other created restricted classes and also on the unrestricted MT set. First, we should explain how we created the restricted MT set. As we described earlier, in the original MT set there are just two RNA classes that have restricted sequences. To investigate the effect of restricted sequences in all of the classes, we decided to generate them artificially in the other ones by using the trend of forced unpaired nucleotides in the transfer RNA class. In order to determine the average percentage of forced unpaired nucleotides in this class, we first calculated for each sequence in the transfer RNA class the percentage of lower case letters per sequence. Figure 3.2 shows the distribution of percentage of forced unpaired nucleotides for this class. We then computed the average of these percentages which is 7.6%. Next, we created other restricted classes with the same average. In other words, for each nucleotide in a sequence x which is unpaired in the reference structure, we selected it to be unpaired with the probability equal to the average of percentage of forced unpaired nucleotides in the transfer RNA class divided by 100 (i.e., it is equal to 0.076). Table 3.1 shows the results of the algorithms, rsMEA and rsMFE, on the restricted and unrestricted MT sets. As one can see in Table 3.1, different measurements of prediction accuracy (such as sensitivity, PPV or F-measure) 32 3.1. Dependency of Prediction Accuracy on the Format of RNA Sequence 70 60 Frequency 50 40 30 20 10 0‐1 1‐2 2‐3 3‐4 4‐5 5‐6 6‐7 7‐8 8‐9 9‐10 10‐11 11‐12 12‐13 13‐14 14‐15 15‐16 16‐17 17‐18 18‐19 19‐20 20‐21 21‐22 22‐23 0 Percentage of Forced Unpaired Nucleotides Figure 3.2: Distribution of the percentage of forced unpaired nucleotides in the transfer RNA class of the MT set. The x axis shows the percentage of forced unpaired nucleotides, and the y axis shows the corresponding frequencies. on the unrestricted MT set are much worse than the restricted MT set (e.g., in terms of unweighted average of F-measure, it is worse by 0.061 for rsMEA and by 0.053 for rsMFE). These results show that knowing some structural information about the forced unpaired nucleotides is highly effective in the performance of these two algorithms. We note that, unlike Lu et al., we didn’t take into consideration “slipped” bases in our measures for sensitivity and PPV. Therefore, the results in this table are not consistent with those in David Lu’s paper [3]. Nevertheless, there are some similarities between our results and those of Lu et al. In particular, the unweighted average of PPV is improved by 2 percent when the rsMEA algorithm is used for prediction both on the unrestricted MT data set (0.596 versus 0.577) and the restricted MT data set (0.674 versus 0.647) while the unweighted average of sensitivity hardly changes (e.g., 0.664 versus 33 3.2. Dependency of Prediction Accuracy on the Thermodynamic Parameters 0.666 for the unrestricted MT set). We noted that Lu et al. measure the unweighted PPV average, however, we also computed the weighted average of different measures. One interesting point about the weighted average is that the 2 percent improvement for PPV disappears altogether on both the restricted and unrestricted sets and is reduced to less than 1 percent on these sets. 3.2 Dependency of Prediction Accuracy on the Thermodynamic Parameters Recall from Section 2.2.1, that we work with two free energy models in this thesis, the Turner model and MultiRNAFold model. Summarized in Section 2.3, the rsMFE and rsMEA algorithms use the first model, and the ubcMFE and ubcMEA methods use the second model. In this section, we consider the effect of thermodynamic parameters in prediction accuracy of these algorithms. For this purpose we calculated the accuracy (i.e., F-measure) of these algorithms with different free energy parameter sets on the S-Full-Test set. We chose S-Full-Test as our test set in this and later sections, because it is the set that Andronescu et al. [2] used for their testing, which makes it possible to compare our results with theirs. Table 3.2 presents the prediction accuracy achieved for the various parameter sets. Since the rsMEA and rsMFE algorithms use the Turner model, their prediction accuracies on the other parameter sets using the MultiRNAFold model cannot be obtained. Also, as the ubcMFE and ubcMEA algorithms use the MultiRNAFold model, they cannot provide the prediction accuracy on the Full T99 parameter set that is in format of the Turner model. The results illustrate three important points. First, the prediction accuracy of these algorithms significantly varies over different parameter sets. For example, the ubcMFE and ubcMEA algorithms have the best performance 34 Restricted MT rsMEA rsMFE PPV Sensitivity PPV Sensitivity 0.596 0.618 0.554 0.604 0.699 0.743 0.659 0.718 0.681 0.730 0.689 0.757 0.680 0.697 0.636 0.670 0.831 0.847 0.777 0.821 0.553 0.541 0.512 0.527 0.502 0.603 0.511 0.633 0.849 0.858 0.834 0.855 0.674 0.705 0.647 0.698 0.737 0.769 0.728 0.776 0.689 0.671 0.753 0.751 Table 3.1: Sensitivity and PPV of rsMEA and rsMFE methods on the restricted and unrestricted MT data sets. 35 3.2. Dependency of Prediction Accuracy on the Thermodynamic Parameters Type of RNA 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I intron Group II intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA Unweighted Average Weighted Average Unweighted Average (F-measure) Weighted Average (F-measure) Unrestricted MT rsMEA rsMFE PPV Sensitivity PPV Sensitivity 0.553 0.597 0.509 0.574 0.652 0.713 0.611 0.685 0.594 0.658 0.602 0.688 0.61 0.646 0.569 0.633 0.668 0.839 0.617 0.817 0.511 0.523 0.51 0.536 0.463 0.589 0.491 0.642 0.713 0.74 0.706 0.749 0.596 0.664 0.577 0.666 0.638 0.687 0.633 0.701 0.628 0.618 0.660 0.665 F"mesure!of!MEA!with!B BL*!parameter!set!on! S"Full"TTest 3.2. Dependency of Prediction Accuracy on the Thermodynamic Parameters 1.2 1 0.8 06 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 F"measures!of!MFE!with!BL*!parameter!set!on!S"Full"Test Figure 3.3: F-measure correlation between the MEA and MFE algorithms with BL* parameter set on S-Full-test. The Pearson correlation coefficient is 0.885. on the BL* parameter set (both have accuracy equal to 0.677), but have very poor results on the T04-MRF parameter set (0.521, 0.534 respectively). These results show that the quality of parameters is significantly effective in the final prediction accuracy. They also indicate that the BL* parameter set obtained by BL approach (described in Section 2.5) contains parameters with high quality compared with the other parameter sets. Second, the relative performance of MEA versus MFE varies, with MEA being slightly better than MFE in two cases, MFE being slightly better than MEA in one case, and both having the same accuracy in one case, namely BL*. As Table 3.2 shows ubcMEA andf ubcMFE have the same overall performance when they use the BL* parameter set. Figure 3.3 also shows the F-measure of the ubcMEA versus ubcMFE methods for the individual se- 36 3.3. Dependency of Prediction Accuracy on Data Set Size quences in the S-Full-Test data set. This figure illustrates that the F-measure of ubcMEA and ubcMFE with the BL* parameter set is comparable for most of individual sequences in S-FUll-Test. However, there are some sequences for which ubcMFE predicts their structures correctly while ubcMEA doesn’t and vice versa. For instance, ubcMFE makes perfect predictions for some of the transfer RNA sequences while ubcMEA doesn’t. Algorithm ubcMEA ubcMFE rsMEA rsMFE F-Measure T99-MRF BL* T04-MRF 0.574 0.677 0.534 0.598 0.677 0.521 RNA Structure Results n/a n/a n/a n/a n/a n/a Full T99 n/a n/a 0.618 0.604 Table 3.2: Accuracy comparison of different prediction algorithms with various parameter sets on the S-Full-Test set. The table presents the prediction accuracy of different algorithms with different thermodynamic sets in terms of F-measure. The parameter sets T99-MRF and T04-MRF are the Turner99 and Turner04 parameters in MultiRNAFold format. BL* is the parameter set obtained by BL approach of Andronescu et al. [2]. Also, the Full T99 parameter set is the parameter set obtained by Lu et al. [12] (see Section 2.2 for more details). “n/a” means this result is not applicable for this algorithm because the respective parameter set is not in the format of required energy model. 3.3 Dependency of Prediction Accuracy on Data Set Size In the first section of this chapter we studied the dependency of structure prediction accuracy on the structural information available in an RNA sequence . In the second section, we considered the impact of thermodynamic parameters on prediction accuracy. In this section, our purpose is to study another factor that affects the accuracy, namely the data set size. As we 37 3.3. Dependency of Prediction Accuracy on Data Set Size described in Section 2.1, we have three different data sets, S-Full, MT, and MA, which are comprised of RNA sequences with their reference secondary structures determined by reliable comparative sequence analysis [9], and thus these data sets are considered to be reliable. Tables 2.1 and 2.2 give a summary of these three sets. We hypothesize that the accuracy of an algorithm on a particular class of data, such as group I intron, varies quite significantly depending on the particular data set used, and in particular on the size of that data set. To assess our hypothesis, we applied the rsMEA, rsMFE, ubcMEA and ubcMFE algorithms on the MT and MA sets, which contain common types of RNA while the size of each individual class in MA is different than the one in MT. Tables 3.3 and 3.4 show the comparative results of these algorithms on various RNA classes in the MA and MT data sets. As one can see, the results of these four algorithm in terms of F-measure are very different for most of RNA classes in MA and MT sets. For example, for some types of RNA, such as ribonuclease P RNA and signal Recognition RNA, these algorithms perform better on MA than MT, and for some others, such as group I intron and 23S ribosomal RNA, they perform better on MT. We speculate that the main reason for this variation is related to the different size of RNA classes. For instance, the class of group I intron in the MT set has just 16 sequences, while this class has 89 sequences in the MA set, likewise, ribonuclease P RNA class in MT only has 6 sequences, however, its size is 399 in the MA set. One interesting point in these two tables is that the results of these algorithms on the transfer RNA class, which has roughly the same size in the MA and MT sets, is roughly the same. Therefore, we hypothesize that having bigger RNA classes and thus bigger data sets leads to less variation in the accuracy of RNA structure prediction. In order to test our hypothesis we designed the following experiment: 38 3.3. Dependency of Prediction Accuracy on Data Set Size Type of RNA No. Mean ± std 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I intron Group II intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA Unweighted Average Weighted Average 88 27 309 16 3 6 91 484 377.88 ± 167.18 460.6 ± 151.3 119.5 ± 2.69 344.88 ± 66.42 668.7 ± 70.92 382.5 ± 41.66 267.95± 61.72 77.48 ± 4.8 F-meas (BL*) ubcMEA ubcMFE 0.649 0.621 0.711 0.683 0.739 0.743 0.705 0.65 0.720 0.739 0.471 0.460 0.641 0.621 0.718 0.775 0.669 0.662 0.710 0.732 F-meas (FullT99) rsMEA rsMFE 0.574 0.539 0.681 0.646 0.625 0.642 0.627 0.599 0.744 0.703 0.517 0.522 0.518 0.557 0.726 0.727 0.627 0.617 0.660 0.665 Table 3.3: Prediction accuracy of MEA and MFE algorithms in terms of F-measure on the unrestricted MT data set. Type of RNA No. Mean ± std 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I intron Group II intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA Unweighted Average Weighted Average 675 159 128 89 2 399 364 489 485.66 ± 113.02 453.44 ± 117.85 120.98± 3.21 368.49 ±103.58 578±47 332.78± 52.34 227.04± 109.53 77.19± 5.13 F-meas (BL*) ubcMEA ubcMFE 0.630 0.645 0.648 0.644 0.785 0.781 0.646 0.632 0.54 0.609 0.646 0.604 0.732 0.722 0.711 0.767 0.667 0.676 0.676 0.683 F-meas (FullT99) rsMEA rsMFE 0.561 0.522 0.589 0.562 0.617 0.63 0.577 0.551 0.472 0.471 0.617 0.576 0.61 0.626 0.729 0.728 0.597 0.583 0.620 0.601 Table 3.4: Prediction accuracy of MEA and MFE algorithms in terms of F-measure on the unrestricted MA data set. 39 3.3. Dependency of Prediction Accuracy on Data Set Size • We selected two RNA classes for which the number of sequences in the MT set is much less than the number in the MA set, and we also selected one class for which the number of sequences in MT and MA are nearly the same. For the first two classes, we chose group I Intron and ribonuclease P RNA for the following reasons: – These classes have different sizes in the MA set so that they can be considered as the medium-size and big-size classes. Furthermore, there is a considerable size difference for these two classes in the MT and MA sets. – There is a huge accuracy variation for all four algorithms on these two classes. For example for group I Intron, the accuracy of ubcMEA on the MT and MA sets is 0.705 and 0.646, respectively, and for ribonuclease P RNA, the accuracy of ubcMFE on MT and MA is 0.460 and 0.604, respectively. We also opted for the transfer RNA class as our third class because it is large and has the same size in both MA and MT sets. • For the second step, we created 10 different subsets for each of the ”Group I Intron“, “Ribonuclease P RNA”, and ”Transfer RNA“ classes in the MA set. The size of each subset is equal to the size of the corresponding class with no duplicates in the MT set for the first round. For example for group I intron we created 10 different subsets of this class in MA with size 16 (i.e., the size of group I Intron in MT). It is notable that in order to have all subsets being equally likely, we randomly and uniformly sampled sequences from its corresponding class in MA without replacement until we had 10 sets of equal size. • As the third step, we ran ubcMEA, ubcMFE, rsMEA and rsMFE on these subsets and then reported their average and standard deviation values. 40 3.3. Dependency of Prediction Accuracy on Data Set Size • Finally, we performed some more rounds for the group I Intron and ribonuclease P RNA classes by repeating step 2 and 3, while in each new round the size of the subsets is changed. In more detail, for group I intron we performed 2 more rounds for the subset sizes of 32, and 64 and for ribonuclease P RNA we did 5 more rounds for the subset sizes of 16, 32, 64, 128, and 256. We added the last step because we hypothesize that if the size of these subsets becomes bigger, then the variance of accuracy in those will be less. Also to show that the bigger the size, the less variation in accuracy, we performed this experiment except the last step on the transfer RNA class. Tables 3.5, 3.7 and 3.8 describe in detail the results of running the above experiment on the group I intron, ribonuclease P RNA, and transfer RNA classes with subset sizes of 16, 6, and 423 respectively. The first row of these tables shows the prediction accuracy of all four algorithm on the corresponding unrestricted class of MT. Also, Table 3.6 gives a summary of results for other subset sizes of the group I Intron, including the smallest and biggest accuracy, the average and the variance of each one. To avoid confusion, just the variances of different subset sizes of ribonuclease P RNA are shown in Figure 3.4. As one can see in Table 3.6 and Figure 3.4, the variance is decreased by increasing the size of each subset for both group I Intron and ribonuclease P RNA classes, except for one point in ribonuclease P RNA (i.e., increasing the subset size from 32 to 64). This point indicates that still both sizes 32 and 64 are not enough for the ribonuclease P RNA class. We also noticed that there is not a large decrease (at most 0.009) in variance when the subset size is increased from 32 to 64 for group I intron and when the subset size is increased from 64 to 128 for ribonuclease P RNA. However, there is still about 4% difference between maximum and minimum accuracy of all algorithms for the subset size of 64 of group I Intron and for the subset size of 128 of ribonuclease P RNA. On the other hand, for ribonuclease P 41 3.3. Dependency of Prediction Accuracy on Data Set Size RNA with the subset size of 256, the variance is less than 0.01 and the difference between the maximum and minimum accuracy is also less than 34%. Therefore, we recommend that the size of each RNA class used for testing should be greater than 150. Further, as we expected, Table 3.8 shows the variance of accuracy for the transfer RNA class of size 423 is very low (at most 0.005). This supports our hypothesis that having bigger data sets will result in less variation in accuracy. According to the variances and differences of the smallest and biggest accuracies for different subset sizes of group I intron and ribonuclease P RNA, we may conclude that the size of the group I intron in the MA set is still not adequate for testing; however, the size of the ribonuclease P RNA class seems to be adequate. Having some classes with not adequate size provide good justification for using unweighted average which doesn’t take into account the size of classes. Subset MT-grp1 1 2 3 4 5 6 7 8 9 10 Average std BL* ubcMEA ubcMFE 0.694 0.65 0.717 0.696 0.692 0.687 0.650 0.640 0.700 0.677 0.633 0.613 0.661 0.642 0.706 0.686 0.690 0.673 0.673 0.633 0.683 0.673 0.681 0.662 0.026 0.028 FullT99 rsMEA rsMFE 0.627 0.599 0.614 0.572 0.628 0.599 0.585 0.587 0.598 0.566 0.565 0.533 0.602 0.588 0.615 0.591 0.596 0.579 0.567 0.535 0.625 0.593 0.600 0.574 0.022 0.023 Table 3.5: F-measure values obtained from 10 subsets of size 16, sampled from the group I intron class in the MA set with size 89. 42 3.4. Conclusion (a) F-measure values obtained from 10 different subsets of size 32, sampled from the group I intron class in the MA set with size 89 Parameter set Alg. Smallest acc. Largest acc. average Std ubcMEA 0.614 0.659 0.640 0.015 BL* ubcMFE 0.593 0.646 0.622 0.019 rsMEA 0.526 0.596 0.561 0.018 Full T99 rsMFE 0.492 0.565 0.541 0.022 (b) F-measure values obtained from 10 different subsets of size 64, sampled from the group I intron class in the MA set with size 89 Parameter set Alg. Smallest acc. Largest acc. Average Std ubcMEA 0.628 0.662 0.647 0.012 BL* ubcMFE 0.623 0.652 0.636 0.010 rsMEA 0.552 0.589 0.578 0.012 Full T99 rsMFE 0.527 0.563 0.552 0.013 Table 3.6: F-measure values obtained from 10 different subsets of the group I Intron class with sizes equal to 32 and 64, respectively. Subset RPrna-MT 1 2 3 4 5 6 7 8 9 10 Average std BL* ubcMEA ubcMFE 0.471 0.460 0.607 0.507 0.723 0.694 0.635 0.525 0.635 0.584 0.636 0.498 0.54 0.437 0.656 0.614 0.684 0.649 0.677 0.597 0.645 0.635 0.644 0.574 0.049 0.080 FullT99 rsMEA rsMFE 0.517 0.522 0.569 0.450 0.663 0.644 0.618 0.561 0.601 0.522 0.585 0.572 0.541 0.439 0.587 0.527 0.67 0.612 0.598 0.547 0.644 0.601 0.608 0.548 0.041 0.066 Table 3.7: F-measure values obtained from 10 different subsets of size 6, sampled from the ribonuclease P RNA class in the MA set with size 399 3.4 Conclusion In this chapter, we studied several factors that influence the RNA secondary structure prediction accuracy of four algorithms, namely, ubcMFE, ubcMEA, 43 Variance 3.4. Conclusion 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 ubcMEA ubcMFE rsMEA rsMFE 0 50 100 150 200 250 300 Subset size Figure 3.4: Variance of the prediction accuracy of ribonuclease P RNA in the MA set for various subset sizes (6, 16, 32, 64, 128, 256). Subset tRNA-MT 1 2 3 4 5 6 7 8 9 10 Average std BL* ubcMEA ubcMFE 0.718 0.775 0.712 0.766 0.714 0.771 0.711 0.768 0.717 0.77 0.711 0.769 0.707 0.765 0.705 0.759 0.71 0.769 0.709 0.767 0.705 0.767 0.710 0.767 0.004 0.003 FullT99 rsMEA rsMFE 0.726 0.727 0.726 0.729 0.729 0.730 0.729 0.728 0.739 0.734 0.728 0.726 0.727 0.724 0.723 0.721 0.733 0.736 0.729 0.727 0.725 0.723 0.729 0.728 0.004 0.005 Table 3.8: F-measure values obtained from 10 different subsets of size 423, sampled from the transfer RNA class in the MA set of size 489 44 3.4. Conclusion rsMFE, and rsMEA (described in Section 2.3). The first factor that we considered was the availability of some knowledge about RNA sequences. We found that if the data set contains some structural information, namely places of some of the forced unpaired nucleotides in the reference structure, then there is a significant improvement in the prediction accuracy. This suggests that using some approaches to analyze the shape of the RNA sequences (e.g., using SHAPE experiments [10, 11]) is very important in the prediction of the RNA structures. The second factor that we studied was the effect of thermodynamic parameters on the final prediction accuracy. We found two important points by considering this factor. First, we found that the prediction accuracy of ubcMFE, ubcMEA, rsMFE, and rsMFE substantially varies on different parameter sets. Subsequently we found that the BL* parameter set contains thermodynamic parameters of high quality and still outperforms other parameter sets. Second, we got that the relative performance of MFE versus MEA varies on different parameter sets. In other words, dependent on the parameter set used, one of these algorithms is better than the other one. In this way, we cannot say generally, whether MEA outperforms MFE or MFE outperforms MEA. Finally, we considered to what degree the size of a data set is important in the measured accuracy of a prediction algorithm. In this case, we found that having bigger data sets lead to the less variation in the prediction accuracy and gives more robust results. We concluded that the size of some of the classes in the MT set data set are too small to give meaningful results for that class. 45 Chapter 4 Parameter Estimation for Pseudoknot-free Secondary Structure Prediction Using the MEA Approach As we described in Section 2.2, Andronescu et al. [2] have presented two computational approaches for parameter estimation, namely, the Boltzmann Likelihood (BL) and Constraint Generation (CG) algorithms. Since the second approach, CG, is computationally cheaper than the first one and also estimates parameters that compare well to those obtained using BL in terms of producing accurate predictions [25], we decided to focus on this one in this thesis. Since the most widely used algorithms, such as MFE and MEA, depend on an RNA energy model for prediction, having a good energy model is critical for the success of these approaches. Therefore, in the following sections of this chapter, we will discuss in detail how we estimate parameters by using the ubcMEA method in CG, what is the effect of varying γ (the sensitivity versus PPV trade off) in training and testing parameters, how restricted sets are important in our training and finally a brief conclusion to our work. 46 4.1. Parameter Estimation for the MEA Approach 4.1 Parameter Estimation for the MEA Approach In this section we describe how we used the ubcMEA method instead of ubcMFE for predicting the secondary structure yl of the sequence xl in each iteration of the CG approach, specifically NOMCG. In fact, we are going to predict the secondary structure of sequence xl by maximizing its expected base-pair accuracy rather than minimizing its free energy. Briefly, the free energy minimization method yields the single best guess of the secondary structure for an RNA sequence because of its high probability in an ensemble [13]. Also, in order to make predictions, the MFE method assumes that thermodynamic parameters have the required resolution to choose the correct conformation; however, it is not a perfect assumption in reality. On the other hand, the MEA approach, using base pair probabilities for its prediction, has the advantage of not explicitly depending on this assumption. Base pair probabilities calculated from the partition function may be less sensitive to the errors of the thermodynamic parameters than is the MFE prediction [29]. They can also provide information about the base pairs shared by multiple secondary structures. It seems the MEA approach, which utilizes base pair probabilities can overcome the uncertainties arising from the errors of thermodynamic parameters. Therefore, we was carious to see if by estimating specifically for MEA, the prediction accuracy would be better than parameters estimated for MFE (w.r.t CG method). In order to estimate parameters for the ubcMEA algorithm, we adapted the CG parameter estimating method as summarized in Section 2.4.1 below. Starting from an initial set of parameters Θ(0) and an empty set of structures, in each iteration, by using the current parameter vector Θ(k−1) , we predict the secondary structure yl of xl using the MEA method which maximizes the 47 BL* T99-MRF T04-MRF BL* T99-MRF T04-MRF Final Parameter set Training Set #F ubcMEA F-measure All parameters initial BL* 363 0.677 T99-MRF 363 0.574 T04-MRF 471 0.534 All parameters trained (50 iterations) MEA*-g1 T-Full + S-Full-Train 363 0.643 MEA*T99-g1 T-Full + S-Full-Train 363 0.645 MEA*T04-g1 T-Full + S-Full-Train 471 0.631 MFE*-g1 T-Full + S-Full-Train 363 0.619 MFE*T99-g1 T-Full + S-Full-Train 363 0.616 MFE*T04-g1 T-Full + S-Full-Train 471 0.612 ubcMFE F-measure 0.677 0.598 0.521 0.623 0.606 0.571 0.653 0.660 0.624 Table 4.1: Accuracy comparison of ubcMEA and ubcMFE achieved for various parameter sets on S-FullTest. The columns, in order, present the initial parameter set used for structure prediction, the final parameter set obtained by training, the training set used for estimating parameters, the number of features, and unweighted average of F-measure for the MEA and MFE approaches on S-Full-Test. The bold values are the best for the corresponding row. We denote the parameter sets estimated by the CG approach when using the MEA prediction algorithm as MEA* and those estimated by MFE as MFE*. For the MEA* sets, the “g1” suffix indicates that the MEA algorithm was run with γ = 1 4.1. Parameter Estimation for the MEA Approach Initial Parameter set 48 4.1. Parameter Estimation for the MEA Approach following quantity: Expected accuracy(yl , xl , Θ(k−1) ) = X X γ.2 × Pbp (i, j, xl , Θ(k−1) ) + i.j∈yl Pss (i, xl , Θ(k−1) ) (4.1) i.(Nx +1)∈yl where Pbp (i, j, xl , Θ(k−1) ) is the base-pair (bp) probability for the base pair i.j (i.e., described in Section 2.3); Pss (i, xl , Θ(k−1) ) is the single-stranded (ss) nucleotide probability for nucleotide at position i and i.(Nx + 1) means that nucleotide i is unpaired. In this expression the pairing probabilities are weighted by a factor γ that is set to 1 in these experiments (we study other values of γ in the next section). Finally, the original CG approach adds the expression 4.1 as a constraint. In order to calculate the expected accuracy defined in equation 4.1 we use our implementation of MEA, ubcMEA, introduced in Section 2.3.2. Table 4.1 shows the comparative results between ubcMEA and ubcMFE on the S-Full-Test data set. Columns 5 and 6, respectively, present the prediction accuracy (in terms of F-measure) of ubcMEA and ubcMFE when they use the thermodynamic parameter set of the corresponding row mentioned in the “final parameter set” column. Some interesting points arise from these results. First, they show that parameters estimated specifically for ubcMEA have less accurate performance when they are employed with ubcMFE and vice versa. To see this point, compare the results of columns 5 and 6 in each row of the training part (the part named All parameters trained ). This comparison shows how well each of these algorithms perform on the parameter set obtained by the other one. For example, the accuracy of ubcMEA and ubcMFE with the MEA*-g1 parameter set (i.e., specifically estimated for ubcMEA with γ = 1) is 0.643 and 0.623, respectively. It indicates that the CG approach only estimates 49 4.1. Parameter Estimation for the MEA Approach parameters for a specific algorithm and then the obtained parameter sets are not suitable for employing in other prediction methods. The other point is that the ubcMEA and ubcMFE algorithms have the highest performance when they use BL* among all parameter sets in table 4.1. The BL* parameter set was obtained by the BL approach which doesn’t estimate parameters specifically for an algorithm but does maximize reference structure likelihood, which is defined in terms of structure probabilities (refereed to the CG and BL definitions in Chapter 2). So, we hypothesize that BL* can be effective for other prediction methods as well. The results also indicate that there is no significant improvement in the prediction accuracy of ubcMEA when we used it instead of ubcMFE for estimating parameters in CG except for one case (i.e., the MEA*T04-g1 parameters). However, the improvement in this case is less than 1 percent. To see this, compare in order the results in the first three rows of column 5 with those in the next three rows of column 6 in the training part. We hypothesize that the reason for this is as follows: the ubcMEA algorithm uses base-pair probabilities; however, CG doesn’t explicitly estimate parameters to maximize the correctness of base pair probabilities , instead, it satisfies some conditions to minimize the free energy of the predicted structures in each iteration (according to the CG definition in Chapter 2). So, we suggest that the CG approach is not the best method to estimate parameters for the MEA algorithm. It is possible that changing CG, so that it takes into account the base pair probabilities, would result in better accuracy for ubcMEA than ubcMFE. A further observation that can be made from the table is that using neither ubcMEA or ubcMFE on any of the estimated parameters will outperform the results of these algorithms with the BL* set. In fact, this point shows one more piece of evidence for our conclusions in Section 3.2 and indicates that BL* still has the parameters with the highest quality. 50 4.2. Parameter Estimation for Varying γ The last point that can be observed from Table 4.1 is that the initial parameter set used by CG appears to matter at least somewhat. For instance, the accuracy of ubcMFE when it employs the MEA*-g1 parameter set is 0.623 and when it uses the MEA*T99-g1 parameter set is 0.606. The first parameter set was obtained by training with initial parameter set, BL*, and the second one is the one obtained by tainting with T99-MRF as the initial parameter set. However, if the initial parameter set didn’t affect the estimated parameters, ubcMFE should give the same accuracy when it uses each of these parameter sets. We note that there is another parameter set called BL-FR (with Fmeasure 0.697) obtained by Andronescu et al. [2] which outperforms the BL* parameter set. Since BL-FR uses another free energy model (i.e., it has a lot more features than MultiRNAFold model used by BL*) we didn’t work on this parameter set throughout this thesis. We speculate using ubcMEA with BL-FR should outperform all of the results in Table 4.1. However, it is not still clear that ubcMEA on BL-FR would be better than ubcMFE on BL-FR. 4.2 Parameter Estimation for Varying γ As Equation 4.1 shows, there is a parameter γ which controls the relative sensitivity and PPV in the MEA algorithm. Lu et al. [25] showed that when γ = 1, rsMEA offers improved PPV, compared with rsMFE, while both have the same sensitivity at the level of 0.73. However, there is one important difference between the rsMEA and ubcMEA algorithms: rsMEA uses the Turner model while ubcMEA uses the MultiRNAFold model. As we described earlier, the number of features in these 51 4.2. Parameter Estimation for Varying γ two models are different, so the impact of γ can be different in these two algorithms. For example, when γ is equal to 1 in the ubcMEA approach, we get a higher PPV than sensitivity for structures predicted by the ubcMEA method on the MT set, while the sensitivity is higher than PPV in rsMEA for the same γ. This effect made us decide to study the impact of γ on our predictions to see for which values of γ the sensitivity of ubcMEA will be higher than PPV and vice versa. We hypothesized that the choice of γ that is used, when training parameters, can affect the estimated parameters. Because, the structures predicted by the ubcMEA method had higher PPV than sensitivity, so we decided to give a higher weight to the double strandedness (i.e., the propensity for base pair formation) versus single strandedness. We expected that training which gives higher weight to double strandedness might not only increase sensitivity, but also F-measure, because training provides the opportunity to find the best balance between the number of correctly predicted base pairs versus overall predicted base pairs for a given γ Increasing the double strandedness means that γ > 1, thus we did three more training runs with γ greater than 1 (γ equal to 2, 5, and 10). We didn’t set γ to be more than 10 for two reasons: 1) high expense of training for many γ (in terms of time and space) and 2 same level of accuracy on S-Full-Test achieved from γ equal to 5 and γ equal to 10 (Table 4.2). Therefore, we hypothesized that γ greater than 10 will not further increase the double strandedness. Table 4.2 shows the results obtained by estimating parameters with the CG approach for different γ values. As one can see in this table, we obtain a slight improvement in accuracy of ubcMEA with γ = 1 by at most 0.012 when we use the parameter set MEA*-g5 (i.e., achieved by training parameters when γ = 5). This result shows that some γ > 1 might enhance the quality of the estimated parameters according to our expectation. 52 4.2. Parameter Estimation for Varying γ In the next step, we varied γ from 10−5 to 105 , with even spacing on a logarithmic scale in order to find the optimal preference of double strandedness versus single strandedness for different free energy parameter sets on S-Full-Test. The plot of sensitivity as a function of PPV on this data set is shown in Figure 4.1. Our experiments indicate that the γ value of 1 results in the highest performance for all of these parameter sets. That is, ubcMEA works best when we have equal contribution for double strandedness and single strandedness in equation 4.1 as does rsMEA. Lu et al. provided a similar plot in their work [3] that shows the performance of different methods when γ is varied between 10−5 and 106 on a logarithmic scale. However, the shape of our plot is different from that of Lu et al. In fact, their figure doesn’t have the curvature that exists in our plot when γ is less then 10−3 . We hypothesized that the main reasons for this difference might include: 1) they took into account “slipped” bases (defined in Section 2.4) in computing sensitivity and PPV while we didn’t, 2) they performed their experiments on the original MT set which contains two classes that are restricted, while we used S-Full-Test which is an unrestricted set. “Slipped” bases cause more predicted base pairs to be correct. Testing on a restricted set also leads to predict less incorrect base pairs. These two factors together result in having more correct base pairs out of all predicted base pairs, specifically, when γ is very small. Recall from Section 2.4 that sensitivity is the percentage of base pairs correctly predicted and PPV is the percentage of predicted base pairs that are in the reference structure. Therefore, prediction of less incorrect base pairs and thus more correct ones will result in obtaining higher values for these measures than those obtained for ubcMEA, specifically, for lower values of γ. Further, in order to investigate the effect of γ on sensitivity and PPV 53 BL* BL* BL* BL* Final Parameter set All parameters MEA*-g1 MEA*-g2 MEA*-g5 MEA*-g10 Training Set trained (50 iterations) T-Full + S-Full-Train T-Full + S-Full-Train T-Full + S-Full-Train T-Full + S-Full-Train #F ubcMEA F-measure 363 363 363 363 0.643 0.652 0.655 0.654 Table 4.2: Prediction accuracy of the ubcMEA algorithm on various parameter sets obtained by varying γ in the CG parameter estimating approach. The columns in order, present the initial parameter set used, the final parameter set obtained by estimating for the ubcMEA method with γ = 1, the training set used, the number of features and the F-measure of ubcMEA on S-Full-Test. 4.2. Parameter Estimation for Varying γ Initial Parameter set 54 4.2. Parameter Estimation for Varying γ 0.8 γ =10^6 0.7 γ =1 0.6 Sensitivity ity 0.5 MFE‐BL* MEA*‐g1 0.4 MEA*‐g2 MEA*‐g5 0.3 MEA‐BL* MEA*‐T04‐g1 0.2 Full‐T99 0.1 γ =10^‐5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Positive Predictive Value(PPV) Figure 4.1: Performance of the MEA approach by using different free energy parameter sets on S-Full-Test, which is calculated as a function of γ that varies between 10−5 and 106 on a logarithmic scale. For comparison the performance of MFE method on BL* parameters is also shown as a single point in this plot. The point in which γ = 1 is also shown for each parameter set. 55 4.2. Parameter Estimation for Varying γ of the ubcMEA method on the MT set and to compare it with the results of rsMEA on the same set, we varied γ between 15 = 0.2, 12 = 0.5, 1, 2, and 5. Since the BL* parameter set has high quality parameters, we opted for it as our initial parameter set. Columns 2 to 6 in Tables 4.3 and 4.4 show sensitivity and positive predictive values obtained by varying γ in the ubcMEA method on the original MT set. The last three columns in these tables show the sensitivity and PPV obtained by rsMEA for γ = 0.2, γ = 1, and γ = 5 on the same set. The results in these tables illustrate some important points. First, they show the rsMEA algorithm is less sensitive to the variation of γ. In other words, increasing and decreasing the values of γ, change sensitivity and PPV of ubcMEA more than rsMEA. For example, the unweighted average of sensitivity for ubcMEA is improved by 0.062 when we increase the value of γ from 1 to 5; however, the unweighted average of sensitivity for rsMEA is improved by 0.017 for the same increasing of γ. Another point that can be observed from this table is that, the difference between the average of sensitivity and PPV (for both weighted and unweighted average) in the ubcMEA method for a given value of γ is much higher than this difference in rsMEA for the same γ. For example, the difference between the unweighted average of sensitivity and PPV in rsMEA with γ = 0.2 is 0.361, while this difference is 0.089 for rsMEA. This observation shows that the sensitivity and PPV of rsMEA are more balanced than ubcMEA for different values of γ. The last point that can be obtained from this results is that, while PPV of ubcMEA is decreasing for larger γ, it is still better than PPV of rsMEA for the same γ. It is noteworthy that ubcMEA and rsMEA still have the highest unweighted average of F-measure when γ = 1 (0.687 and 0.654, respectively). However, the F-measure of ubcMEA slightly changes to 0.685 and 0.679 for γ values of 2 and 5 and has a explicit decreasing to 0.655 and 0.584 for γ values of 0.5 and 0.2, respectively (see Table 4.5). 56 4.2. Parameter Estimation for Varying γ Type of RNA 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I Intron Group II Intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA Unweighted Average Weighted Average γ = 0.2 0.453 0.471 0.567 0.395 0.366 0.368 0.544 0.511 0.459 0.522 ubcMEA-BL* γ = 0.5 γ =1 γ = 2 0.565 0.631 0.643 0.593 0.677 0.717 0.688 0.739 0.778 0.549 0.652 0.685 0.541 0.650 0.801 0.403 0.439 0.437 0.658 0.689 0.718 0.694 0.810 0.817 0.586 0.661 0.700 0.671 0.753 0.774 γ =5 0.66 0.685 0.801 0.709 0.865 0.440 0.727 0.836 0.723 0.794 rsMEA-Full-T99 γ = 0.2 γ = 1 γ = 5 0.526 0.597 0.605 0.639 0.713 0.709 0.595 0.658 0.676 0.567 0.646 0.658 0.737 0.847 0.875 0.480 0.523 0.527 0.537 0.589 0.602 0.774 0.857 0.884 0.607 0.679 0.696 0.669 0.736 0.763 Table 4.3: Comparison of sensitivity for different values of γ in the ubcMEA and rsMEA algorithms on several RNA classes of the original MT data set. Type of RNA 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I Intron Group II Intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA Unweighted Average Weighted Average γ = 0.2 0.750 0.854 0.858 0.845 0.934 0.680 0.710 0.932 0.820 0.869 ubcMEA-BL* γ = 0.5 γ =1 γ = 2 0.687 0.668 0.607 0.790 0.739 0.696 0.777 0.739 0.706 0.792 0.768 0.688 0.892 0.862 0.836 0.559 0.507 0.466 0.644 0.600 0.571 0.885 0.879 0.823 0.753 0.720 0.674 0.808 0.786 0.739 γ=5 0.580 0.662 0.671 0.645 0.806 0.422 0.551 0.799 0.642 0.711 rsMEA-Full-T99 γ = 0.2 γ = 1 γ = 5 0.615 0.553 0.518 0.725 0.652 0.616 0.665 0.594 0.564 0.682 0.610 0.562 0.891 0.831 0.804 0.586 0.511 0.488 0.521 0.463 0.445 0.880 0.847 0.830 0.696 0.633 0.603 0.751 0.700 0.653 Table 4.4: Comparison of positive predictive value (PPV) for different values of γ in the ubcMEA and rsMEA algorithms on several RNA classes of the original MT data set. 57 4.3. Training and Testing on Restricted Sets Type of RNA 16S Ribosomal RNA 23S Ribosomal RNA 5S Ribosomal RNA Group I Intron Group II Intron Ribonuclease P RNA Signal Recognition RNA Transfer RNA Unweighted Average Weighted Average γ = 0.2 0.565 0.607 0.683 0.538 0.526 0.478 0.616 0.660 0.584 0.650 ubcMEA-BL* γ = 0.5 γ = 1 0.620 0.649 0.678 0.711 0.729 0.739 0.649 0.705 0.673 0.741 0.468 0.471 0.651 0.641 0.778 0.843 0.655 0.688 0.732 0.769 γ=2 0.625 0.707 0.740 0.686 0.818 0.451 0.636 0.820 0.685 0.756 γ=5 0.617 0.701 0.73 0.676 0.834 0.431 0.627 0.817 0.679 0.749 rsMEA-Full-T99 γ = 0.2 γ = 1 γ = 5 0.567 0.574 0.558 0.679 0.681 0.659 0.628 0.625 0.615 0.619 0.627 0.606 0.807 0.839 0.838 0.527 0.517 0.507 0.529 0.518 0.511 0.823 0.852 0.856 0.647 0.654 0.644 0.707 0.720 0.716 Table 4.5: Comparison of F-measure for different values of γ in the ubcMEA and rsMEA algorthims on several RNA classes of the original MT data set. 4.3 Training and Testing on Restricted Sets As we described in Section 3.1 we obtained much better prediction accuracy on restricted set than unrestricted set. We wondered whether it is possible to get even better results on restricted or unrestricted sets, if training is also done on restricted sets. We thought that having restricted sets will result in more accurate predictions in each iteration of CG, and thus it may cause to get better estimated parameters. To assess our suggestion, we needed to create restricted training and testing sets. Recall from Section 3.1 that, in order to make a set restricted, for each nucleotide in the sequence x which is unpaired in the reference structure, we selected it to be forced unpaired with probability equal to the average of percentage of forced unpaired nucleotides in the transfer RNA class of the original MT set divided by 100. We created our restricted S-Full-Train and S-Full-Test sets by using this approach. We performed our training in the same way as described in Section 4.1 except that this time we used the restricted S-Full-Train and S-Full-Test sets 58 4.4. Conclusion for training and testing, respectively. To find out how effective these new parameter sets are, we calculated the prediction accuracy of ubcMEA with these new parameter sets on both restricted and unrestricted S-Full-Test. Table 4.6 shows the results obtained by estimating of parameters using restricted data sets. None of the new parameter sets outperform the BL* parameter set on both the restricted and unrestricted S-Full-Test (the F-measure is 0.738 and 0.677, respectively). Also none of the new parameter sets, rMEA*-g1, rMEA*-g2, rMEA*-g5, and rMEA*T04-g1 which are obtained by training on sets of restricted structures, have any explicit improvement over their corresponding sets obtained by training on unrestricted sets. For example the accuracy of ubcMEA when it uses the MEA*T04-g1 parameter set is 0.631; however, its accuracy when it employs rMEA*T04-g1 which is obtained by training on the restricted S-Full-Train is 0.616. Also the accuracy of ubcMEA is slightly reduced when it employs the rMEA*-g1 parameter set compared with its accuracy on the corresponding set, MEA*g1 which is obtained by training on the unrestricted set (0.639 versus 0.643). We speculate that the reason for getting no improvement in this training is as follows: training parameters using the restricted sets might not be a good idea. However, these restricted sets will result in more accurate predicted structures in each iteration of CG, the parameters estimated in this way, are more restricted. So, they don’t provide high performance when they are used for testing on restricted and unrestricted sets. 4.4 Conclusion In this chapter we first described how we trained thermodynamic parameters, specifically, for the ubcMEA algorithm using CG. Our experiments showed that while the estimated parameters will increase the prediction accuracy of 59 Initial Parameter set BL* MEA*-g5 MEA*T04-g1 - Training Set All parameters initial P - 363 - 363 - 471 All parameters trained (50 iteration) rMEA*-g1 T-Full + res-S-Full-Train 363 rMEA*-g2 T-Full + res-S-Full-Train 363 rMEA*-g5 T-Full + res-S-Full-Train 363 rMEAT04-g1 T-Full + res-S-Full-Train 471 F-measure restricted S-Full-Test S-Full-Test 0.738 0.705 0.684 0.677 0.655 0.631 0.698 0.708 0.709 0.687 0.639 0.651 0.654 0.616 Table 4.6: Accuracy comparison of ubcMEA algorithm with various parameter sets on the restricted and unrestricted S-Full-test. The BL* parameter set was obtained by Andronescu et al.[2]. MEA*-g1 and MEA*T04-g1 parameter sets are obtained by using CG on unrestricted training and testing sets. The last four rows show the results obtained by training parameters on the restricted S-Full-Train set. We denote the parameter sets estimated by CG when using the MEA prediction algorithm on the restricted sets as rMEA*. The bold values are the best for the corresponding column. 4.4. Conclusion BL* BL* BL* T04 Final Parameter set 60 4.4. Conclusion this method compared with rsMEA, the BL* parameter set still outperforms the other sets and gives the highest accuracy on this algorithm. We also found that the parameters estimated specifically for ubcMEA have less accurate performance when they are used in ubcMFE and vice versa. We then focused on tuning the parameter γ which indicates the relative weight of double strandedness versus single strandedness. We concluded the best prediction accuracy for ubcMEA is achieved when we have equal contribution for double and single strandedness (i.e., γ = 1). We also found that increasing the value of γ is more effective in sensitivity average of ubcMEA than rsMEA. Finally, we performed a restricted training for the ubcMEA algorithm using CG. It means that we estimated parameters by using restricted data sets. Our results showed that the unrestricted training still gives the parameters with higher quality than the restricted training in most of the cases. 61 Bibliography [1] Tania A. Baker, James D. Watson, and Stephen Bell. Molecular biology of the gene. Pearson, Benjamin Cummings, January 2004. [2] M. Andronescu. Computational approaches for RNA energy parameter estimation. PhD thesis, University of British Columbia, Dept. of Computer Science, 2008. [3] Z. J. Lu, J. W. Gloor, and D. H. Mathews. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA (New York, N.Y.), 15(10):18051813, October 2009. PMID: 19703939. [4] C. Dennis. Gene regulation: The brave new world of RNA. Nature, 418(6894):122–124, July 2002. [5] I. Tinoco and C. Bustamante. How RNA folds. Journal of Molecular Biology, 293(2):271–281, October 1999. PMID: 10550208. [6] B. Felden. RNA structure: experimental analysis. Current Opinion in Microbiology, 10(3):286–291, June 2007. [7] M. Andronescu, V. Bereg, H. H. Hoos, and A. Condon. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9:340, 2008. PMID: 18700982. [8] J. A. Jaeger, D. H. Turner, and M. Zuker. Improved predictions of secondary structures for RNA. Proceedings of the National Academy of Sciences of the United States of America, 86(20):7706 7710, October 1989. 62 . Bibliography [9] J. J. Cannone, S. Subramanian, M. N. Schnare, J. R. Collett, L. M. D’Souza, Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Mller, N. P., Z. Shang, N. Yu, and R. R. Gutell. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3:2, 2002. PMID: 11869452. [10] E. J. Merino, K. A. Wilkinson, J. L. Coughlan, and K. M. Weeks. RNA structure analysis at single nucleotide resolution by selective 2’-hydroxyl acylation and primer extension (SHAPE). Journal of the American Chemical Society, 127(12):4223–4231, March 2005. PMID: 15783204. [11] K. A. Wilkinson, E. J. Merino, and K. M. Weeks. Selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nature Protocols, 1(3):1610–1616, 2006. PMID: 17406453. [12] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology, 288(5):911940, May 1999. PMID: 10329189. [13] D. H. Mathews, M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker, and D. H. Turner. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19):72877292, May 2004. PMID: 15123812. [14] M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13):34063415, July 2003. PMID: 12824337. [15] D. H. Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimiza63 . Bibliography tion. RNA (New York, N.Y.), 10(8):11781190, August 2004. PMID: 15272118. [16] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte for Chemie Chemical Monthly, 125(2):167188, 1994. [17] J. N. Zadeh, C. D. Steenberg, J. S. Bois, B. R. Wolfe, M. B. Pierce, A. R. Khan, R. M. Dirks, and N. A. Pierce. NUPACK: analysis and design of nucleic acid systems. Journal of Computational Chemistry, pages n/a–n/a, 2010. [18] R. Nussinov, G. Pieczenik, J. R. Griggs, and D. J. Kleitman. Algorithms for loop matchings. SIAM Journal on Applied Mathematics, 35(1):68– 82, July 1978. [19] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133148, January 1981. PMID: 6163133. [20] R. B. Lyngso, M. Zuker., and C. N. Pedersen. Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics, 15(6):440 445, June 1999. [21] M. Andronescu. Algorithms for predicting the secondary structure of pairs and combinatorial sets of nucleic acid strands. MSc thesis, University of British Columbia, Dept. of Computer Science, 2003. [22] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90 e98, July 2006. [23] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proceedings of the National 64 . Bibliography Academy of Sciences of the United States of America, 77(11):6309 6313, November 1980. [24] J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(67):11051119, June 1990. PMID: 1695107. [25] M. Andronescu, A. Condon, H. H. Hoos, D. H. Mathews, and K. P. Murphy. Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics, 23(13):i19 i28, July 2007. [26] P. V. Benos, A. S. Lapedes, D. S. Fields, and G. D. Stormo. SAMIE: statistical algorithm for modeling interaction energies. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, page 115126, 2001. PMID: 11262933. [27] K. Howe. Gene prediction using a configurable system for the integration of data by dynammic programming. PhD thesis, University of Cambridge, 2003. [28] B. Taskar. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis, Stanford University, 2005. [29] D. M. Layton and R. Bundschuh. A statistical analysis of RNA folding algorithms through thermodynamic parameter perturbation. Nucleic Acids Research, 33(2):519 524, January 2005. 65 Appendix A Differences between the features of the T99-MRF and T04-MRF models Table A.1 shows the types of features and the number of each type in the T99-MRF and T04-MRF models. As one can see in this table, the T04-MRF model has more features than T99-MRF in two feature types, namely Tstackh and Tloop. It is notable that similar features can be assigned different free energy values in these two models. 66 Appendix A. Differences between the features of the T99-MRF and T04-MRF models Type of Feature Stack Tstackh Internal-AU-GU closure penalty Internal-GA-AG mismatch Internal-UU mismatch Internal loop 1*1 Misc.internal11 basic mismatch Misc.internal11 GG mismatch Internal loop 1*2 Misc.internal21 match Misc.internal21-AU closure Internal loop 2*2 Misc.internal22 delta same size Misc.internal22 delta different size Misc.internal22 delta 1stable 1unstable Misc.internal22 delta-AC Misc.internal22 match Dangles Internal penalty by size(4, 5, 6) Bulge penalty by size(1-6) Hairpin penalty by size(3-9) Misc.hairpin-GGG Misc.hairpin(c1, c2, c3) Misc.multi offset Misc.multi helix penalty Misc.multi free base penalty Misc.intermolecular initiation Tloop (tetra and triloop) Total No. of features in T99-MRF 22 96 1 1 1 30 1 1 52 1 1 49 1 1 1 1 1 48 3 6 7 1 3 1 1 1 1 30 363 No. of features in T04-MRF 22 156 1 1 1 30 1 1 52 1 1 49 1 1 1 1 1 48 3 6 7 1 3 1 1 1 1 78 471 Table A.1: Feature types and the number of each one in the T99-MRF and T04-MRF models. 67
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Empirical analysis of the MEA and MFE RNA secondary...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Empirical analysis of the MEA and MFE RNA secondary structure prediction Hajiaghayi, Monir 2010
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Empirical analysis of the MEA and MFE RNA secondary structure prediction |
Creator |
Hajiaghayi, Monir |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since their functions largely depend on their folded structures, having accurate and efficient methods for RNA structure prediction is increasingly valuable. A number of computational approaches have been developed to predict RNA secondary structure. There have been some recent advances for two of these approaches, namely Minimum Free Energy (MFE) and Maximum Expected Accuracy (MEA). The main goals of this thesis are twofold: 1) to empirically analyze the accuracy (i.e., the number of correctly predicted base pairs) of the MFE and MEA algorithms in different energy models and 2) to estimate the free energy parameters specifically for the MEA by using the constraint generation (CG) approach. We present new results on the degree to which different factors influence the prediction accuracy MFE and MEA. The factors that we consider are: structural information that is provided in addition to RNA sequence, thermodynamic parameters, and data set size. Structural information significantly increases the accuracy of these methods. The relative performance of MFE and MEA changes according to the free energy parameters used; however, both have the best performance when they use Andronescu et al.'s BL* parameter set. Having bigger data sets results in less variation in prediction accuracy of the MFE and MEA algorithms. Furthermore, we try to find better free energy parameters for the MEA algorithm. For our purpose, we adapt the CG approach so that it specifically optimizes parameters for MEA. Overall, when parameters are trained using MFE, they slightly outperform the parameters estimated for MEA. For the MEA algorithm, we also study the effect of parameter γ which controls the relative sensitivity and positive predictive value (PPV). We obtain that when γ=1, the accuracy of MEA (in terms of F-measure on the BL* parameter set) is better than its accuracy for other values of γ. We also find that the sensitivity and PPV of MEA will interestingly change for different values of γ using the BL* parameter set. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-10-22 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051976 |
URI | http://hdl.handle.net/2429/29473 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2010-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2010_fall_hajiaghayi_monir.pdf [ 906.7kB ]
- Metadata
- JSON: 24-1.0051976.json
- JSON-LD: 24-1.0051976-ld.json
- RDF/XML (Pretty): 24-1.0051976-rdf.xml
- RDF/JSON: 24-1.0051976-rdf.json
- Turtle: 24-1.0051976-turtle.txt
- N-Triples: 24-1.0051976-rdf-ntriples.txt
- Original Record: 24-1.0051976-source.json
- Full Text
- 24-1.0051976-fulltext.txt
- Citation
- 24-1.0051976.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0051976/manifest