The Open Collections website will be unavailable July 27 from 2100-2200 PST ahead of planned usability and performance enhancements on July 28. More information here.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detecting common secondary structure elements in RNA sequences Shah, Sohrab P 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
831-ubc_2005-0311.pdf [ 6.49MB ]
Metadata
JSON: 831-1.0051575.json
JSON-LD: 831-1.0051575-ld.json
RDF/XML (Pretty): 831-1.0051575-rdf.xml
RDF/JSON: 831-1.0051575-rdf.json
Turtle: 831-1.0051575-turtle.txt
N-Triples: 831-1.0051575-rdf-ntriples.txt
Original Record: 831-1.0051575-source.json
Full Text
831-1.0051575-fulltext.txt
Citation
831-1.0051575.ris

Full Text

Detecting common secondary structure elements in R N A sequences by Sohrab P. Shah B.Sc. (Hons) Biology, Queens University at Kingston, 1996 B.Sc. Computer Science, University of British Columbia, 2001 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF M a s t e r o f S c i e n c e in THE FACULTY OF GRADUATE STUDIES (Computer Science) The Universi ty of B r i t i sh Co lumbia April 2005 © Sohrab P. Shah, 2005 Abst rac t As evidence for the important and diverse roles of RNA molecules in our cellular machinery continues to grow, there is an increasing interest in developing compu-tational methods to analyse RNA sequences. Sets of evolutionarily related RNA sequences contain signals at both the sequence and secondary structure levels that can be exploited to detect motifs common to all or a portion of those sequences. Mo-tifs conserved in evolution are believed to be functionally important and therefore detection of such motifs could yield novel functional RNA sequences. We developed an algorithm called DISCO to detect conserved motifs in a set of unaligned RNA sequences. Our algorithm uses a powerful probabilistic formalism called covariance models (CM) to model motifs. We introduce a novel approach to initialise a CM using pairwise and multiple sequence alignment. The CM is then iteratively refined using expectation maximisation. Our initialisation method can operate on sequence signals alone using only a portion of the input sequences to initialise a CM to recover the remaining motif instances. We tested our algorithm on 26 data sets derived from Rfam seed alignments of microRNA (miRNA) precursors and conserved elements in the untranslated re-gions of mRNAs (UTR elements). By three measures of specificity and positive predictive value, our algorithm performed well on the miRNA data sets and showed a bi-modal distribution for the UTR element data sets where the motif was com-pletely missed, or very accurately predicted. In a comparison test with a competing algorithm, DISCO outperformed RNAProfile in measures of sensitivity and positive predictive value, although the running time of RNAProfile was considerably faster. The accuracy of our algorithm was unaffected by average percent pairwise sequence identity, overall length or number of sequences in the input data, indicating that DISCO could be run with similar accuracy on diverse data sets. The running time of DISCO is 0(W3 + L2W2 + L 3 ) where W is the width of the motif and L is the length of the longest sequence in the input data. This is an improvement on SLASH, the only other RNA motif finding algorithm in the literature that uses CMs. n Table of Contents Abstract ii Table of Contents iii List of Tables viii List of Figures ix List of Algorithms xi Acknowledgements xii Dedication xiii 1 Introduction 1 1.1 Challenging the central dogma of biology 2 1.2 Terminology 3 1.3 Examples of ncRNAs and R N A elements 3 1.3.1 U T R elements 4 1.3.2 microRNAs 5 1.4 Discovering new ncRNAs and R N A elements 6 1.5 Thesis outline 7 2 Related Work 8 iii 2.1 Considering sequence and secondary structure in R N A sequence analy-sis 8 2.2 Techniques for R N A sequence analysis 9 2.2.1 Dynamic programming with thermodynamic energy models . 9 2.2.2 Probabilistic R N A modeling using covariance models 9 2.2.3 Simultaneous alignment and consensus structure prediction of unaligned R N A sequences 16 2.3 Consensus structure prediction from aligned R N A sequences 16 2.3.1 Al i fo ld 17 2.3.2 Pfold 17 2.3.3 Limitations of consensus structure prediction 18 2.4 ab-initio R N A gene detection algorithms 19 2.4.1 Q R N A . 19 2.4.2 R N A z 19 2.5 Searching for R N A s from predefined models 20 2.5.1 R N A M o t i f 20 2.5.2 Infernal and Rfam 21 2.5.3 Rsearch 22 2.6 Specific R N A gene detectors 22 2.7 R N A motif discovery in unaligned sequences 23 2.7.1 F O L D A L I G N 24 2.7.2 S L A S H 25 2.7.3 RNAPro f i l e 25 2.7.4 C o m R N A 27 2.7.5 G P R M 28 2.7.6 C A R N A C 29 2.7.7 Al ido t 30 2.8 Summary 31 iv 3 The Disco Algorithm 32 3.1 Assessment of related work 32 3.1.1 Global vs local sequence alignments 34 3.1.2 Using pairwise and multiple sequence alignment to initialise a C M 34 3.1.3 Expectation Maximizat ion for C M refinement 35 3.2 Proposed new algorithm: D I S C O 35 3.2.1 Initialisation phase 35 3.2.2 Refinement phase 38 3.2.3 Complexity 39 3.2.4 Input and output 39 3.2.5 Parameters 39 3.2.6 Implementation 40 4 Experiments 50 4.1 Major questions and new ideas 50 4.1.1 Question 1: W h i c h properties more strongly represent a motif embedded in a set of unaligned R N A sequences? 50 4.1.2 Question 2: C a n a C M be initialised using only a few sequences? 51 4.1.3 Question 3: C a n a crude secondary structure filter be used to filter out subsequences not expected to be an instance of the motif? 51 4.2 Da ta 51 4.3 Prel iminary experiments . 53 4.4 Fixed parameter experiments 54 4.5 Evaluation methods 54 4.5.1 Score 54 4.5.2 Sensitivity and positive predictive value 54 4.5.3 Reported accuracy measures 56 v 4.6 Comparison with RNAProfile 57 5 Results 63 5.1 Preliminary experiments 63 5.1.1 Sequence method of alignment is superior 63 5.1.2 k = 6 gives best results for NS and NPPV . . 64 5.1.3 Dot-composition threshold 64 5.2 Fixed parameter experiments 69 5.2.1 Score is not an indicator of accuracy 73 5.2.2 Testing the effects of properties of the input data 76 5.3 DISCO is more accurate, but considerably slower than RNAProfile . 83 6 Discussion 96 6.1 Interpretation of results 97 6.1.1 Sequence information is more important than secondary struc-ture in the initialisation phase 97 6.1.2 Relatively few sequences can be used to initialise the CM . . 98 6.1.3 The unpaired nucleotide filter improves performance but does not compromise accuracy for miRNA data sets 98 6.1.4 Relatively poor accuracy of aligned nucleotides 99 ,6.1.5 Poor UTR element results 100 6.2 Improvements on other methods 100 6.2.1 Comparison between DISCO and RNAProfile 101 6.3 Drawbacks and limitations of the DISCO method 102 6.3.1 Limitations of covariance models 102 6.3.2 Reliance on predictive folding 102 6.4 Potential improvements and future work 103 6.4.1 Multiple sequence alignment method 103 6.4.2 The use of priors when initialising the CM 104 vi 6.4.3 Using phylogenetic weighting 104 6.4.4 Optimisations 104 7 Conclusions 106 Bibliography 108 vii List of Tables 3.1 Parameters of the D I S C O algorithm 46 4.1 Description and characteristics of the U T R test data sets 58 4.2 Description and characteristics of the m i R N A test data sets 59 4.3 Parameters for RNAProf i l e in comparison experiment 62 5.1 Results of fixed parameter experiments for m i R N A data 90 5.2 Results of fixed parameter experiments for U T R data 91 5.3 Correlation statistics measuring association between P I D and accuracy 92 5.4 Correlation statistics measuring association between length of the seed alignment and accuracy 93 5.5 Correlation statistics measuring association between length of the input data and accuracy 94 5.6 Correlation statistics measuring association between number of se-quences in the input data and accuracy 95 v i i i List of Figures 3.1 R I B O S U M - 8 5 - 6 0 scoring matrix 47 3.2 D I S C O S U B scoring matrix 48 3.3 Sample output of the D I S C O algorithm 49 4.1 Seed alignment of RF00237 in Stockholm format 60 5.1 Distr ibut ion of NS for each alignment method 65 5.2 Distr ibut ion of NPPV for each alignment method 66 5.3 Distributions of NS for each k = 2 to k = 7 67 5.4 Distributions of NPPV for each k = 2 to k = 7 68 5.5 Distributions of NS for each value of d 70 5.6 Distributions of NPPV for each value of d 71 5.7 Distr ibut ion of proportion of unpaired nucleotides for m i R N A seed alignments 72 5.8 Distr ibut ion of accuracy results for seventeen m i R N A test data sets 74 5.9 Distr ibut ion of accuracy results for nine U T R element test data sets 75 5.10 Scatter plot of AS against normalised score 77 5.11 Scatter plot of APPV against normalised score 78 5.12 Scatter plot of NS against normalised score 79 5.13 Scatter plot of NPPV against normalised score 80 5.14 Scatter plot of SS against normalised score 81 5.15 Scatter plot of SPPV against normalised score 82 ix 5.16 Comparison of NS between DISCO and RNAProfile 85 5.17 Comparison of NPPV between DISCO and RNAProfile 86 5.18 Comparison of SS between DISCO and RNAProfile 87 5.19 Comparison of SPPV between DISCO and RNAProfile 88 5.20 Running time vs size of input data for DISCO and RNAProfile . . . 89 x List of Algorithms 1 D I S C O algorithm • 41 2 Initialisation 42 3 Mul t ip le alignment algorithm 43 4 Sequence to profile algorithm 44 5 Expectat ion Maximisat ion algorithm 45 x i Acknowledgement s I would like to thank my advisor Anne Condon for her guidance, patience and dedication to teaching. Thanks also to Wye th Wasserman for invaluable comments on this work and to Francis Ouellette for providing me wi th professional experiences that gave me a thorough immersion into the field of bioinformatics. Thanks to my colleagues in the Beta-lab at U B C Computer Science for stimulating discussions about their work, my work and bioinformatics in general. Most of al l , thank you to my family: Nik iah , Zubin and Zahra for their support and patience on rainy Sundays and to my parents for always being there when I needed them. SOHRAB P . SHAH The University of British Columbia April 2005 x i i To my son Zubin and my daughter Zahra. xiii Chapter 1 Introduction One of the most surprising facts revealed by the H u m a n Genome Project is that our complement of D N A contains much fewer protein coding genes than originally thought [10, 69]. T h e most recent estimates put the number of protein coding genes at between 20,000 and 25,000 [11]. In comparing other eukaryotic organisms' genomes such as the worm (13,000), or the fly (18,000), it is clear that the number of genes i n an organism is not a good measure of biological complexity. If not the number of protein coding genes, then what can explain the complexi ty of our species in comparison to others? A growing body of literature is point ing to R N A genes as one possible expla-nation. R N A has been largely under-studied i n comparison to proteins, but work done i n the last decade is revealing promising insights into the importance of R N A . One class of R N A s called non-coding R N A s ( n c R N A s ) are transcribed from D N A , but do not get translated into proteins - n c R N A s are the functional molecules them-selves. W h i l e some n c R N A s such as t ransfer-RNAs and r ibosomal R N A s have been well characterised i n the literature for decades, it is becoming increasingly apparent that a diverse array of n c R N A s that have important roles i n the normal function of cellular processes remain to be discovered. In addi t ion to n c R N A s there are R N A elements embedded i n other R N A molecules that also influence biochemical 1 processes i n the cell . R N A elements are relatively smal l R N A s that are part of a larger R N A molecule. These elements are often responsible for fine-grained control of translat ion of m R N A transcr ipts 1 . Numerous examples of n c R N A s and R N A elements are included i n reposi-tories such as R f a m (ht tp : / /www.sanger .ac .uk/Sof tware /Rfam/) [20, 21]. A s evi-denced by the growth of R f a m (approximately 400 families of n c R N A s and counting), new R N A molecules are being discovered at a rapid rate. These new discoveries of R N A s w i t h diverse and cr i t ica l functions give us strong evidence that R N A molecules play a more significant role in cellular process than previously thought. 1.1 C h a l l e n g i n g t h e c e n t r a l d o g m a o f b i o l o g y In a recent review article, M a t t i c k [50] describes R N A as the "architects of eukaryotic complexi ty". He suggests that the central dogma of molecular biology that governs the "gene —> R N A transcript —> protein" process is somewhat incomplete. M a t t i c k paints a picture of a system of R N A molecules that are an intricate part of the biochemical processes i n the cell and exert control over t ranscript ion, t ranslat ion and thereby gene expression. In another review article, E d d y describes experimentally determined families of R N A genes that have a wide array of functions [14]. He describes a rapidly evolving, diverse array of R N A molecules that are involved i n gene regulation, R N A processing, catalysis of sub-unit formation i n cr i t ica l cellular machinery. E a c h of these reviews postulate that we are gaining new insights about the R N A world . M o r e discoveries of R N A molecules w i l l further our understanding of how they contribute to biochemical pathways and cellular processes i n a l l forms of life. Furthermore, i t is now clear that a complete understanding of cellular processes must include R N A molecules as key participants. *An m R N A is an intermediate biochemical form of a protein coding gene that is created through the process of transcription from D N A . The m R N A is then translated (in whole or in part) into a protein through the process of translation. 2 1.2 Terminology R N A (ribonucleic acid) molecules are polymers composed of four different nucleotides - Adenosine ( A ) , Cytosine (C) , Guanine (G) and U r a c i l (U) . A n R N A sequence has a direct ionali ty due to the biochemically dist inct 5' and 3' ends of the nucleotides 2 . The order of the nucleotides is called the sequence of the molecule. R N A molecules also exhibit folding patterns where nucleotides in the sequence can base-pair w i t h other nucleotides i n the sequence and form hydrogen bonds. T h i s acts to thermody-namically stabilise the molecule i n the cell. The resultant set of pairs of nucleotides defines the secondary structure of the molecule. T h e secondary structure is closely l inked to the function of the molecule and is conserved i n evolution, despite po-tential sequence divergence. T h e concept of evolutionary conservation of secondary structure is highlighted throughout this thesis and w i l l be discussed i n depth in later sections. 1.3 Examples of ncRNAs and RNA elements Recent experimental evidence has uncovered several examples of n c R N A s and R N A elements that influence biochemical activities in the cell . These include m i c r o R N A s ( m i R N A s ) , which are smal l (approximately 22 nucleotide) molecules that b ind to m R N A s and influence protein expression [38, 40]. In addi t ion, R N A elements in the untranslated regions ( U T R s ) of m R N A transcripts of certain genes are essential for regulation of the translat ional machinery [9, 23, 25, 39]. There are also numerous examples of human disease genes whose causative agents are gain of function mu-tations related to changing R N A secondary structure of their m R N A [4, 31], and elements i n human v i r a l pathogens that are being investigated as possible drug tar-2 The 5th and 3rd carbon atoms in the sugars of all the non-terminal nucleotides of an R N A molecule are covalently bonded to a phosphate group. The end nucleotides of an R N A molecule either have their 5th carbon unbonded to a phosphate (5' end), or the 3rd carbon unbonded to a phosphate group (3' end). 3 gets [44, 67]. T h e remainder of this thesis, however, w i l l focus on m i R N A s and U T R elements and the following section w i l l give detailed explanations and examples of these R N A s i n part icular . 1.3.1 UTR elements General ly speaking, U T R elements are small R N A structures that occur i n the 5' or 3' U T R s of m R N A transcripts. Usually, these elements are target b inding sites for proteins that once bound, influence the process of t ranslat ion of the m R N A into protein. We outline several examples of U T R elements below. Iron response element The i ron response element ( IRE) is an R N A element of approximately 30 nucleotides found i n the 5' U T R of ferri t in genes and the 3' U T R of the transferrin receptor genes. T h e I R E forms a hai rp in secondary structure which is a b inding target for two forms of the i ron regulation protein ( IRP) [9, 24]. In low concentrations of i ron, I R P binds to the I R E which prevents the binding of the 43S r ibosomal subunit complex, thereby inh ib i t ing the translat ion of the ferritin protein. In high concentrations of i ron, the I R P does not b ind to the I R E and translat ion of ferri t in proceeds so that i ron can be sequestered by the ferritin protein [25]. In contrast, the transferrin receptor I R E s b ind I R P in the presence of i ron. T h i s has the effect of stabil ising the m R N A of the receptor against degradation and enables t ransla t ion [25]. These two essential proteins in i ron metabolism, ferrit in and transferrin receptor, are under translational control that is mediated in part by a U T R element. Selenocysteine insertion sequence (SECIS) The selenocysteine insertion sequence (SECIS) is an R N A element found in the 3' U T R of animal selenoprotein m R N A s [39]. The selenocysteine c o d o n 3 ( U G A ) also 3 A codon is a group of three successive nucleotides in an mRNA that code for a specific amino acid. There are 43 = 64 possible codons, and each codon encodes one of 20 possible 4 codes for a stop codon. T h e selenocysteine insertion sequence is a ha i rp in structure required by the translat ional machinery to dist inguish the selenocysteine codon from a stop codon [39]. Internal ribosomal entry sites Another class of U T R elements are the internal r ibosomal entry sites ( I R E S ) present i n some eukaryotic and v i r a l m R N A s . These R N A structure elements permit in i -t ia t ion of t ranslat ion under physiological circumstances such as mitosis, apoptosis, hypoxia and some v i r a l infections [23]. Several genes are known to contain I R E S whose structures are conserved, w i th significant sequence var iabi l i ty [23]. 1.3.2 microRNAs M i c r o R N A s are a class of n c R N A s that are known to influence the function of the post-transcriptional machinery i n eukaryotes [33, 59]. A n in i t i a l genome wide survey of human m i R N A s estimates that m i R N A s constitute 1-2% of a l l eukaryotic genes and that they exert regulatory influence on the product ion of 10% of a l l human protein products [33]. M i R N A s are first transcribed as m i R N A precursors that form stem-loop sec-ondary structures [37]. T h e mature m i R N A sequence (21-25 nucleotides) is excised from the ha i rp in by an enzyme called Dicer [37]. T h e mature m i R N A binds to its m R N A target by complementary base pair ing [33]. Th i s b inding process induces cleavage of the m R N A or represses translat ion by unknown mechanisms [33]. F i rs t discovered i n the nematode worm (Caenorhabditis elegans), the lin-4 and let-7 m i R N A s were shown to be involved in temporal control of development events [38, 40, 43]. These m i R N A s exhibited expression patterns i n larval develop-ment and homologues were found in animals w i t h bi lateral symmetry [43]. amino acids. The translation machinery moves along the m R N A codon by codon and adds the appropriate amino acid to the growing protein. 5 T h e studies out l ined i n [38, 40, 43] represented the first high-thoughput dis-coveries of m i R N A s . Subsequent papers have hinted at the extent to which m i R N A s are active i n eukaryotic organsims, especially higher order plants and animals. John et al. [33] reported more than 2000 m R N A s w i t h m i R N A target sites conserved across mammals and Sempere et al. [59] reported tissue-specific expression patterns for 119 previously unreported m i R N A s including 19 brain-specific m i R N A s imp l i -cated in human neuronal development. Pfeffer et al. [56] found several m i R N A s expressed in Epps te in-Bar r virus ( E B V ) that exploit silencing of t ranslat ion as a mechanism to regulate host genes. Th is work implies that other viruses may also encode m i R N A s and i t further demonstrates the diverse functionali ty of m i R N A s . 1.4 Discovering new ncRNAs and RNA elements The examples out l ined above illustrate the importance of n c R N A s and R N A el-ements i n biological functions. It is compell ing to consider the possibil i ty that numerous n c R N A s and R N A elements w i t h diverse functional roles remain to be discovered. Developing computat ional methods to detect previously unknown ncR-N A s and elements w i l l contribute to this discovery process. Accura te tools w i l l help identify putative targets for biochemical assays. A s previously mentioned, n c R N A s and R N A elements exhibi t sequence and secondary structure conservation through evolution. T h i s implies that regions of D N A or R N A sequences that have functional roles w i l l exhibit shared sequence and secondary structure patterns. Computa t iona l tools can exploit this phenomenon by searching sequences from different organisms in order to detect shared sequence and secondary structure patterns, and thereby detect putat ive functional regions i n R N A sequences. The work described in this thesis investigates a new method for computat ional discovery of shared sequences and secondary structures i n a set of R N A sequences. We test this method for its abi l i ty to detect U T R elements and m i R N A s i n publ ic ly available R N A data sets. 6 1.5 Thesis outline The remainder of thesis describes computat ional tools for processing R N A sequences (Chapter 2), formalises the computat ional problem to detect conserved secondary structure motifs i n unaligned R N A sequences (Chapter 3), describes a new approach to address this problem (Chapter 3), describes experiments to test the performance of the a lgor i thm (Chapter 4), presents the results of those experiments (Chapter 5) and concludes w i t h a discussion of those results (Chapter 6) and future directions for this work. 7 Chapter 2 Related Work To address the need for computat ional tools to analyse R N A sequences, numerous methods have been developed. T h e following sections discuss the different classes of computat ional problems related to R N A sequence and structure pat tern discovery, review current methods and discuss some of their strengths and l imitat ions. Th i s chapter is meant to be a broad survey of the literature related to R N A sequence analysis and serves as necessary background material for the formal description of the computat ional problem addressed by our algori thm described i n Chapter 3. 2.1 Considering sequence and secondary structure in R N A sequence analysis D N A and R N A sequences are under different evolutionary selection pressures. W h e n analysing a set of D N A sequences, the analysis of the sequence itself is usually suffi-cient to detect patterns. R N A , however is subject to evolutionary constraints at the secondary structure level. T h i s results i n the phenomenon that related R N A genes share common secondary structure, but may have l i t t le sequence similari ty. Th i s renders most D N A and protein sequence mot i f finding and alignment algorithms potentially unsuitable for R N A sequence analysis. There are several categories of 8 algorithms that take bo th sequence information and secondary structure informa-t ion of a set of R N A sequences into account. These categories are defined by the computat ional problem they address. We outline five categories of computat ional problems and published algorithms designed to solve these problems. A l l categories are included as they have some relevance to the novel a lgor i thm introduced i n Chap-ter 3. 2.2 T e c h n i q u e s f o r R N A s e q u e n c e a n a l y s i s We begin w i t h a description of several techniques which are used for R N A sequence analysis, R N A sequence and structure modeling and al igning R N A sequences and secondary structures. 2.2.1 Dynamic programming with thermodynamic energy models Most of the work we w i l l describe deals w i th the predict ion of secondary structure using mult iple sequences as input. Often, it is necessary to predict the secondary structure of a single sequence. Th i s problem has been solved by Zuker and Steigler [71] who used a dynamic programming algori thm based on a thermodynamic energy model scoring function. T h i s approach is implemented i n the m F o l d [71] program, the R N A F o l d program [26] and Pa i rFo ld [1]. 2.2.2 Probabilistic RNA modeling using covariance models A major contr ibut ion to R N A structure modeling is the concept of covariance models ( C M ) due to E d d y and D u r b i n [16]. The C M framework allows for R N A sequences and structures to be modeled probabilist ically. The framework uses stochastic con-text free grammars ( S C F G ) which can be constructed from a mult iple alignment of related R N A s and a consensus secondary structure. Analogous to hidden markov models ( H M M s ) for sequence alignment, S C F G s provide a powerful formalism for 9 aligning sequences to a model of an R N A family. In fact, a C M is a generalisation of an H M M - p r o f i l e for modeling sequence motifs or protein families [16]. Us ing the C Y K / I N S I D E scanning algori thm (see [15]), one can generate an opt imal alignment (using dynamic programming) of a sequence to a S C F G that can include insertions and deletions [13]. Us ing this a lgori thm, instances of the R N A family represented by the model can be detected i n a genome or a database of sequences of interest. To understand this alignment step, we need to first define S C F G s and C M s and describe how a C M is buil t from the input . We w i l l describe how C M s are constructed and the I N S I D E alignment a lgor i thm i n detail i n the following sections. F o r m a l d e s c r i p t i o n s o f S C F G s a n d C M s A C M is a specialised S C F G designed to model a mult iple R N A sequence alignment and consensus secondary structure wi th position-specific scores [13, 15, 16]. A n S C F G is made up of a set of M non-terminal states, K different terminal symbols over an alphabet ( A , C , G , U for R N A ) , and a set of product ion (or emission) rules of the form V —• 7, where V is a non-terminal state and 7 is a s t r ing over the set of (non-terminal and terminal) symbols that includes the empty str ing e. E a c h product ion rule is associated wi th a normalized probabi l i ty (summing to 1) for any given non-terminal V. A C M is a specific type of S C F G adapted to model R N A s at the sequence and secondary structure levels. A C M has seven types of states and product ion rules (see Table 1 i n [15]). The product ion probabi l i ty of a given state v is the product of an emission probabil i ty ev and a t ransi t ion probabil i ty tv. C M s are made up of seven types of states (P, L , R , B , D , S, E - see Table 1 i n [15]). E a c h state has its own emission and transi t ion probabil i t ies which can be derived from the input alignment and consensus secondary structure (see below). The types of states correspond to alignment 'events'. For example, consensus base pairs are modeled w i t h a P state, consensus single stranded residues by L and R 10 states, deletions relative to the consensus by D states. The branching topology of the RNA secondary structure is modeled by begin (B), start (S) and end (E) states. This topology creates an ordered tree. The tree structure makes the alignment algorithm tractable, however it creates a limitation of CMs in that they cannot model pseudoknots or base triples. Modeling pseudoknots would require a more complicated data structure that supported cycles. This is beyond the framework of SCFGs. The next section will describe how to create and set parameters correspond-ing to the states and their emissions and transitions. Steps in constructing a C M Creating a C M from a multiple alignment and a consensus structure involves three steps. First, a guide tree (explained below) is created from the consensus structure. Next, an empty C M (with no emission probabilities or transition probabilities) is created from the guide tree. Finally, the C M transition probabilities and emission probabilities are calculated from the sequences in the alignment. We will now discuss each step in detail. The guide tree The guide tree is made up of eight types of nodes (node type shown in parentheses): MATP (P) which models a base pair, MATL (L) and MATR (R) which model an unpaired base, BIF (B) which models a branch in the structure, ROOT (S) which is the begin node, BEGL (S) and BEGR (S) which are the begin nodes for adjacent branches in the tree, and END (E) which is the end node for a branch. The guide tree is created in the following way from the input: first, a consensus structure is derived from the input (base pairings are assigned and columns with a high proportion of insertions are ignored). Once the consensus structure is determined, the guide tree is created by first creating a root node, then traversing the structure, 11 assigning unpaired bases to M A T L or M A T R nodes and paired bases to M A T P nodes. Branching structures are specified with B I F nodes and then starting each branch with either a B E G L or B E G R node. A l l branches end in E N D nodes. By convention, M A T L nodes are always used before M A T R nodes in the case of unpaired bases. (See [15] for more details on this process). The result of the guide tree creation is a binary tree made up of nodes in the set { M A T P , M A T L , M A T R , BIF , R O O T , B E G L , B E G R , E N D } . Guide tree to covariance model Once the guide tree is created, it is 'expanded' into a covariance model. Each node type in the guide tree can be in a particular state, where the state is in the set of possible states for that node type. Consider a given node to be a discrete random variable. The set of states for a node are just the possible values of the random variable (eg six sides of a dice). The possible states for each node type are listed in Table 3 of [15]. There are two types of states for each node type: split set states and insert set states. For example, M A T P nodes have 6 possible states: M P , M L , M R , D split set states and IL, IR insert set states. Each state has a set of transitions to their child states. The set of transitions depends on whether the state is a split set state or insert set state. Split set states transition to every insert set state in the same node and to every split set state in the next node. Insert set states self transition and also transition to every split set state in the next node. For IL states, there is a special case in that there is a transition to the IR state in the same node. The B state makes an obligate transition to the S states of the child B E G L and B E G R nodes. So in summary, a guide tree is a set of nodes connected by the consensus secondary structure. The C M is a set of states determined by the nodes in the guide tree that can connect to other states within the same node, or other states in the child node. The structure of the C M is therefore made of 1) an array of states 12 in the set { M P , M L , M R , D , I L , I R , B , S , E } whose ordering is constrained by the guide tree from which it was derived and 2) a set of directed 'edges' or transitions between states. The edges can connect a state to itself, or can connect a state to a state i n an adjacent downstream node i n the guide tree._ W i t h the 'empty ' C M now i n place, the t ransi t ion probabili t ies and emis-sion probabili t ies can now be calculated from the sequences i n the input mult iple alignment. T h i s is the parameterisation step, analagous to determining the state transi t ion probabili t ies i n an H M M . Th i s step is also known as the t ra ining step i n the broader machine learning literature. Parameterizing a C M Each sequence i n the mult iple alignment can be represented unambiguously by a unique parse tree over the topology of the 'empty' C M . Once each parse tree has been determined, the t ransi t ion and emission probabili t ies can be computed. T h e transit ion probabili t ies are the observed counts of transitions from state v —> 7 where v is the parent state and 7 is the chi ld state. These counts are normalised over the chi ld states and a Dir ichle t prior is used to smooth the probabili t ies - this results i n no 0 probabili t ies. The emission probabilit ies are s imi l ia r ly computed by counting the observed unpaired nucleotides for M A T R and M A T L states and the observed pairs for M A T P states. A g a i n the counts are normalised and a Dir ichlet prior is used. A paramaterizat ion therefore results i n a M - b y - M t ransi t ion mat r ix where M is the number of types of states i n the model and ^(7) are the entries in the matr ix , a l -by-4 mat r ix for the unpaired emissions and a l -by-16 mat r ix for the paired emissions. Note that in the special case of B states, the t ransi t ion probabilities are 1 to the S states of the branches. 13 From CMs to sequence alignments A s stated previously, the most pract ical use of a C M is to al ign a sequence to i t . Similar to sequence alignment problems, we wish to compute the op t ima l alignment of the sequence to the R N A profile represented by the C M . It is possible to calcu-late the log-probabil i ty of the most l ikely C M parse tree using the C Y K / I N S I D E algori thm (see [15]). One. can also calculate the probabi l i ty of a l l possible parse trees using a ' sum' instead of 'max ' . T h i s results i n the l ikel ihood of the data given the model, or P(x\9) where x is the sequence and 9 are the parameters of the C M . We also would like the unambiguous alignment of each nucleotide i n the sequence to each nucleotide i n the structure that is modeled by the C M (allowing for gaps). Th i s can be derived from the op t imal parse tree which is computed using the I N S I D E 1 " a l -gor i thm described i n [15]. T h i s a lgori thm includes a traceback procedure to recover the opt imal parse tree from the matrices computed i n I N S I D E . The INSIDE algorithm The I N S I D E a lgor i thm is analogous to the Forwards a lgor i thm for H M M s and is described i n [15]. It computes a 3-dimensional mat r ix av(i,j) where v is a state i n the C M and is a subsequence for which the parse tree is being calculated. Each entry in the ma t r ix is the max imum likel ihood of the sub-parse-tree rooted at state v that generates the subsequence Th i s a lgor i thm works from the inside of a structure and proceeds outwards. In other words, i t starts at the empty str ing and nul l trees of end states and proceeds outward, generating longer subsequences and larger parse-trees. It begins at the highest numbered state and proceeds to the lowest numbered state. For each state, the resultant ma t r ix is upper-triangular containing i n each cell the m a x i m u m likelihood over the possible ch i ld states of v of transi t ioning from the current state to the next state given i and j and the current state v. Because i t works backwards through the states, the matrices for the chi ld states of v are already filled in . A s an aside, the states in the C M are enumerated 14 such that the indices of chi ld states are always greater than their parent states. A t the end of the run, the l ikel ihood of the opt imal parse tree is i n the cell av(l, L) where L is the length of the sequence. T h e score of the alignment is a log-odds ratio, which is the difference of the log-l ikelihood of the alignment and the log l ikel ihood that the sequence was generated at random w i t h independently and identical ly dis t r ibuted nucleotide emission probabili t ies [16]. T h e l o g - o d d s s c o r e w i l l p r o v e t o b e i m p o r t a n t w h e n w e i n t r o d u c e o u r a p p r o a c h i n C h a p t e r 3. U s i n g E M t o i t e r a t i v e l y r e f ine a C M E d d y and D u r b i n [16] introduce an E M method for learning the most l ikely C M from a set of unaligned R N A sequences. Consider the case where a mult iple alignment is given. Th i s approach uses covarying positions i n the mult iple alignment to determine the R N A consensus structure. The i r a lgori thm uses the Nussinov-Jacobsen/Zuker folding a lgor i thm [71], but instead of maximiz ing the stacking energies, they max-imize the mutua l information content, which is calculated on compensatory muta-tions i n the sequence alignment that preserve the secondary structure. Once the consensus structure backbone is i n place, it is converted into a C M and then the parameters are estimated i n the usual way. Al ignments are re-estimated by then aligning the sequences to the model. T h e n the consensus structure and the new C M are re-estimated given the new alignment. Th i s procedure iterates un t i l the model parameters do not change significantly between iterations. A significant problem is how to achieve a mult iple alignment to start wi th . C r e a t i n g a n i n i t i a l m u l t i p l e a l i g n m e n t for a C M a n d E M as a m e t h o d for i t e r a t i v e r e f i n e m e n t a r e d i s c u s s e d e x t e n s i v e l y w h e n w e i n t r o d u c e o u r a l g o r i t h m i n C h a p t e r 3. 15 2.2.3 Simultaneous alignment and consensus structure prediction of unaligned R N A sequences A l i g n i n g two sequences allows us to infer the positions i n the alignment that are conserved and the positions that are not. We can assume that conservation gives us information at a nucleotide level as to which positions are evolut ionari ly con-strained. Furthermore, once an alignment is attained, a common secondary struc-ture can be inferred by examining positions in the alignment where compensatory mutations have occurred to preserve a secondary structure. Algor i thms have t r ied to exploit these characteristics of R N A sequences to simultaneously al ign and predict the secondary structure of two or more sequences. Such algorithms operate un-der the assumption that sequence conservation and compensatory mutations which preserve secondary structure imp ly evolutionary constraint and furthermore imply biologically functional importance. These algorithms attempt to represent a set of R N A sequences w i t h a single consensus secondary structure and produce a mult iple alignment of the sequences i n the process. Examples of algori thms that compute a simultaneous alignment and consensus structure predict ion of unaligned R N A se-quences are given in Section 2.7.1 and Section 2.7.2. 2.3 Consensus structure prediction from aligned R N A sequences The algorithms described i n this section address the predict ion of a consensus struc-ture for a set of aligned R N A sequences. They make use of sets of evolutionari ly related sequences i n order to infer a conserved secondary structure. In part icu-lar, these methods capitalise on the occurrence of compensatory mutations i n the columns of the alignment. These compensatory mutations indicate the preservation of secondary structure even i n the context of a mutated sequence. These algorithms avoid the high computat ional cost of simultaneously al igning and predict ing sec-16 ondary structure as i n [18] and are therefore applicable to longer R N A s such as 16S or 23S r ibosomal R N A s [28] provided a quali ty mult iple alignment is available. 2.3.1 Alifold The A l i f o l d dynamic programming algori thm [28] incorporates bo th thermodynamic stabil i ty and compensatory mutations to generate a consensus secondary structure of a mult iple R N A sequence alignment. The classical energy model for folding R N A se-quences used i n [71] is modified to include a covariance measure of pairwise columns i n the mult iple alignment. A simplified view of this model is that a compensatory muta t ion that preserves a base pair is comparable to the energy gained by extending a helix by one base pair. The a lgor i thm has an 0 ( L 3 ) running t ime where L is the length of the alignment. In contrast to other methods such as [35], the folding a lgor i thm is only run once, thereby reducing computat ional effort. T h e A l i f o l d a l g o r i t h m p l a y s a n i m p o r t a n t r o l e i n o u r a p p r o a c h as d e s c r i b e d i n C h a p t e r 3. 2.3.2 Pfold Pfold [35, 36] estimates the m a x i m u m a pr ior i secondary structure from an alignment of R N A sequences assumed to have identical secondary structure. T h e key idea is that P fo ld uses phylogenetic information combined w i t h S C F G s . T h e algori thm produces the most l ikely structure based on an evolutionary model that takes into account bo th nucleotide subst i tut ion rates and base-pair subst i tut ion rates. These substi tut ion rates are based on large sets of published alignments of t R N A s and large sub-unit r ibosomal R N A s . The grammar used by Pfo ld was shown to be the most effective grammar in a recent evaluation [12]. The key idea of P F o l d is that the algori thm derives a phylogenetic tree from the alignment using a m a x i m u m likelihood method over the possible trees calculated from the subst i tut ion matrices. The tree is then used together w i t h the alignment 17 and the grammar to infer the most l ikely structure. T h e authors demonstrate that using phylogenetic information confers a significant performance advantage in pre-dic t ing the true structure. In recent work [36] the authors present an opt imisat ion of the tree-estimation step which significantly improves performance. The major advantage of this method is the incorporat ion of an evolutionary model into a Bayesian approach to inferring secondary structure. T h i s allows the output to have a posterior probabil i ty associated w i t h it - a quanti tat ive probabil ist ic measure that is not available w i t h most other approaches. Moreover, by using the S C F G formalism, one can easily compute the l ikel ihood of each posi t ion i n the structure, a l lowing users to assess which positions i n the structure are more l ikely than others [35]. T h e major drawbacks are that P F o l d cannot handle pseudoknots (due to the use of S C F G s (see Section 2.2.2)), its i terative use of the relatively expensive 0 ( L 3 ) I N S I D E algori thm for phylogenetic tree and secondary structure estimation and the need for a good input alignment to achieve satisfactory results. We w i l l discuss the implicat ions of iterations of the I N S I D E a lgor i thm i n the context of opt imising our own work i n Chapter 6. Also , we w i l l explore the idea of using phylogenetic information and evolutionary models further in Chapter 6. 2.3.3 Limitations of consensus structure prediction The major l imi ta t ion of both A l i f o l d and Pfo ld is the requirement of a user-supplied alignment as input . P fo ld has the added assumption that each sequence i n the alignment share the identical secondary structure. T h i s introduces a cycl ica l prob-lem where a good qual i ty mult iple alignment can be achieved i f a consensus structure is known, but consensus structure determination requires a good qual i ty mul t ip le alignment. We w i l l address this problem i n our work and present a solution based on pairwise sequence alignment wi th secondary structure filters. 18 2 . 4 a b - i n i t i o R N A g e n e d e t e c t i o n a l g o r i t h m s Ab- in i t i o predict ion of R N A genes has proved to be a significantly more difficult challenge than predict ing genes that encode proteins. T h i s difficulty is due to a lack of obvious stat is t ical signals emitted by R N A genes [57] i n genomic sequence. Whereas protein-coding sequences emit signals based on codon bias and specific signals called splice site consensus sequences in eukaryotic genomes [8], n c R N A s lack such discr iminat ing features. However, two notable efforts have shown progress i n ab-init io R N A gene detection. 2.4.1 Q R N A The Q R N A algor i thm [58] takes two related D N A sequences as input and classifies the ind iv idua l nucleotides i n the sequences as protein coding, n c R N A or 'other'. Th i s method exploits patterns of compensatory mutations observed i n n c R N A s and patterns of synonymous codon substitutions i n protein coding sequence. T h e authors implement a p a i r - S C F G model for n c R N A s and a p a i r - H M M for protein-coding and 'other'. T h e a lgor i thm takes i n a pair-wise alignment of two sequences and calculate a log-odds score for the n c R N A model compared to the two H M M s . T h i s method was used to classify regions i n the E. coli genome not known to code for proteins. Four other bacterial species were used as comparative sequence i n the input . T h e authors speculate that this method could be used as a screening tool to detect n c R N A s i n genomic sequence. T h e major drawbacks of this method are the 0(L3) running t ime for the p a i r - S C F G scanning algori thm which makes the a lgor i thm unusable for long sequences, and the l imi t a t ion to two aligned sequences as input . 2.4.2 R N A z Washiet l et al. [68] used comparative sequence analysis and thermodynamic sta-bi l i ty to classify mult iple sequence alignments as n c R N A sequence or n o n - n c R N A sequence. T h e a lgor i thm takes as input a mult iple sequence alignment of two or 19 more sequences. U s i n g Al i fo ld , the algori thm first calculates the average m i n i m u m free energy ( M F E ) of the consensus secondary structure of the input mult iple align-ment. The M F E measure is normalized using standard regression techniques into a z score. A structure conservation index (SCI) is calculated that measures the degree of conservation of secondary structure of the sequences i n the input mult iple align-ment. F i n a l l y using bo th the z score and the S C I , the alignment is classified w i t h a support vector machine as n c R N A or not. The authors report better results and substantially improved running times when compared to Q R N A [58]. T h e method is a significant step forward i n ab-initio R N A gene detection. Notably , previous work has shown that M F E of R N A sequences does not emit strong enough statis-t ica l signals to be used i n a R N A detection algori thm [57]. T h i s work shows that thermodynamic s tabi l i ty can indeed be useful for predict ion of n c R N A s . We w i l l discuss this not ion further i n Chapter 6. 2.5 Searching for RNAs from predefined models 2.5.1 RNAMotif R N A M o t i f [47] uses deterministic descriptors to model R N A secondary structure motifs. T h e descriptors describe rules that instances of the mot i f must follow i n order to be retrieved i n a search of a sequence database. R N A M o t i f , therefore provides a R N A secondary structure definition language. T h e language allows R N A structures to be s t r ic t ly or loosely defined depending on the amount of specific knowledge (sequence or base pair ing constraints) the user has about the R N A structure they are t ry ing to model . T h e R N A M o t i f description language can describe complex R N A structures at bo th the secondary and tertiary levels. T h e major drawback of this type of model ing is that if the constraints are too specific, the searching algori thm is bound to miss instances of the mot i f that are subtle variants. Furthermore, i f the constraints are too general, many false positives w i l l be returned. T h i s latter issue 20 is addressed by Fogel et al. [17] who use evolutionary computat ion to filter large 'hit l ists ' returned from the R N A M o t i f search algori thm. 2.5.2 Infernal and Rfam Infernal [15] is follow up work to probabilist ic modeling w i t h C M s [13, 16]. Th i s work describes a memory-efficient improvement to the or iginal C M I N S I D E alignment algori thm that reduces the memory requirement to 0(L2logL) from 0 ( L 3 ) where L is the length of the sequence. T h i s improvement enabled the development of the R f a m database [20] (h t tp : / /www.sanger .ac .uk/Sof tware /Rfam/) , by making alignments of C M s to longer sequences tractable w i t h respect to memory requirements. R f a m now has sequence and secondary structure mult iple alignments and C M s of nearly 400 R N A families whose act ivi ty has been experimentally verified and published in the literature. T h i s enables searching any new genomic sequence for homologues of the R N A families included i n Rfam. The major l imi ta t ion is that the alignment algori thm remains 0(L3) i n t ime complexity and so searching large genomes remains a computat ional challenge. T h i s is addressed to an extent by using heuristics to filter the sequence being searched using sequence alignment first [20] or using more robust techniques that filter out low-scoring sequences i n 0(L2) t ime and only run I N S I D E on the remaining sequences [70]. Another important use of R f a m is in testing mot i f detection algorithms and R N A searching tools. T h e database contains two types of s t ructural mult iple align-ments, seed and full , that represent families of n c R N A s and U T R elements. T h e seed alignments are hand curated mult iple alignments w i t h accompanying anno-tated consensus structure information. T h e seed sequences are a l l derived from G e n B a n k / E M B L sequences so obtaining the sequence that flanks the seed sequence is t r iv ia l . T h e seed sequences themselves are experimentally val idated sequences published i n the literature. T h e quali ty of alignments and ease of use make this data attractive to use as test data for new algorithms. Indeed, several studies have 21 recently used R f a m alignments as test data [66, 68, 70]. T h i s indicates that R f a m data is gaining acceptance as a quali ty source for R N A sequence and structure align-ments. We also chose to use R f a m data as a source of h igh qual i ty alignments to evaluate our own algor i thm (see Section 4.2 for more details). 2.5.3 Rsearch The Rsearch a lgor i thm [34] is buil t on top the Infernal [15] system. T h e authors present a local alignment search tool that allows a user to search a database of se-quences for homologues of a single R N A sequence w i t h known secondary structure. The key contr ibut ion of this work is the development of empir ical ly derived substi-tut ion matrices (called R I B O S U M matrices) that are specific to R N A sequences and secondary structures. These matrices were developed under the B L O S U M model of evolutionary divergence and contain subst i tut ion rates for unpaired nucleotides and base-paired nucleotides. T h e matrices are used to parameterise a 'single-sequence' C M . Instead of der iving the emission probabilit ies from a mult iple sequence align-ment as described i n Section 2.2.2, the emission probabili t ies are set from a R I -B O S U M matr ix . Trans i t ion probabilit ies are derived using a s tandard affine gap formulation (see [34] for further details). B y using R I B O S U M matrices, a C M can be constructed from a single sequence wi th known secondary structure. Another key contr ibut ion of this work is a local alignment variant of the C M alignment al-gor i thm presented i n [13. 16]. T h e authors report higher sensit ivity and specificity than standard loca l sequence alignment search tools that do not consider secondary structure. 2.6 Specific RNA gene detectors The work described thus far addresses general approaches for R N A sequence analy-sis. Specific R N A families emit characteristics that can be exploi ted for better predictions. For example, the snoscan algori thm [46] searches for 2 ' -0-r ibose methy-22 lat ion guide s n o R N A genes in a sequence database based on specific sequence and secondary structure patterns derived from experimentally determined s n o R N A se-quences. Similar ly , t rnascan-SE algori thm [45] scans for t R N A genes. These a l -gorithms are by design highly specific to part icular gene families. Consider ing the diversity of n c R N A s believed to exist in nature [14, 50], it is imprac t ica l to conceive of designing an a lgor i thm for each type of n c R N A . Tools such as C M s offer gener-ali ty and hence are advantageous in this regard when bui ld ing algorithms designed to discover new n c R N A s or R N A elements. However, it would be an improvement i f specific attributes of an R N A family could be incorporated into a general model . We explore this idea i n our algori thm by fine-tuning parameters to detect specific types of R N A sequences. 2.7 R N A motif discovery in unaligned sequences Often only parts of related R N A sequences have functional elements or motifs that are conserved. U n t i l now, we have only considered predict ion of secondary structure and alignments where a given set of sequences contain a shared structure that spans the entire length of the sequences. A s previously discussed in Chapter 1, there are numerous examples of R N A elements that represent only a por t ion of each sequence i n a set of related sequences. The U T R elements such as I R E and S E C I S are two such examples. In addi t ion, several algorithms presented i n previous sections of this chapter have required a mult iple sequence alignment as input . In this section we present algorithms that detect conserved motifs in a set of unaligned sequences that are expected to have only a port ion of the sequence contain the motif. These algorithms receive the most treatment in this chapter as they address the same computat ional problem as our approach. 23 2.7.1 F O L D A L I G N One approach to this problem is outl ined i n G o r o d k i n et al. [18]. The authors describe an algori thm, F O L D A L I G N , that uses a greedy strategy to create a mult iple alignment of R N A sequences that have a common structure i n a por t ion of each sequence. A 4-D dynamic programming alignment a lgor i thm is used to perform pairwise alignments. T h e scoring system takes into account bo th sequence and structure s imi lar i ty of gapped pairs of residues in the two sequences. T h e alignment algori thm can be thought of as a mixture of the Smi th -Wate rman [62] and Nussinov-Jacobsen algorithms [52], A scoring matr ix S is developed that contains subst i tut ion scores for al igning to bki where a and b are input sequences w i t h the nucleotide at posi t ion i i n a, paired to the nucleotide at posi t ion j i n a, and the nucleotide at posit ion k i n 6, paired to the nucleotide at posi t ion I in b. S is a 25 by 25 mat r ix that includes values for subst i tut ing a l l combinations of paired substitutions of A C G U - , where ' - ' represents a gap that has been inserted in a sequence for the purpose of alignment. T h e 4-D dynamic programming mat r ix Dij^i contains the best score of aligning a{...aj w i t h bk-.-h. T h e greedy strategy is used to bui ld mult iple alignments using the results of a l l the possible pair-wise alignments i n the input sequences. F i r s t , a l l sequences are compared to each other, then a l l pairwise alignments are compared to a l l the sequences such that no one sequence is included more than once i n each comparison. Nex t the ' t r iplet ' alignments can be compared to each sequence such that each sequence is only included once i n the result. T h i s may continue un t i l a l l sequences have been compared. T h i s has complexity 0(NNL4) where N is the number of sequences and L is the length of the longest sequence. To opt imize the greedy algori thm, the search space is heavily pruned by e l iminat ing redundant and low-scoring alignments. O n l y a fixed number of highest scoring alignments are kept at any one i terat ion of the process. Th i s reduces the complexi ty to 0(NALA). This a lgor i thm introduces an opt imal pairwise local alignment procedure 24 and scoring ma t r ix that takes into account both sequence and secondary structure. W h i l e the approach is mathematical ly rigorous, the t ime complexi ty is prohibit ive for longer sequences. 2.7.2 SLASH G o r o d k i n et al. [19] combine C M s w i t h F O L D A L I G N . T h i s is done to improve on the t ime complexi ty of F O L D A L I G N . The algori thm uses F O L D A L I G N to generate a 'seed' alignment and consensus secondary structure to init ial ise a C M . T h e key point is that this step only uses a subset of the input sequences. F r o m this seed alignment and consensus secondary structure, a C M is constructed (see Section 2.2.2) and the remaining sequences are aligned using the C M I N S I D E alignment a lgori thm described i n Section 2.2.2. The notion of a C M being composed of on ly a por t ion of the input sequences and then used to 'recover' the motifs i n the remaining sequences is important as we incorporate a similar technique i n our a lgori thm. Despite the improvement, this a lgor i thm s t i l l has complexity 0(N4L4) where N is the number of sequences used for the 'seed' alignment and L is the length of the longest of those sequences. Th i s remains usable only on small sets of short sequences. We introduce a method of 'seeding' a C M that has a much lower t ime complexi ty i n Chapter 3. 2.7.3 RNAProfile The R N A P r o f i l e a lgori thm [53] outputs the most conserved regions of a set of un-aligned R N A sequences according to a s imilar i ty measure that accounts for both sequence and secondary structure. The algori thm proceeds by first selecting a set of candidate subsequences from the input sequences that w i l l be further analysed i n a later step. T h i s in i t i a l step exhaustively searches a l l possible subsequences (referred to as regions) i n the input for candidate regions that contain a given num-ber of stems i n the secondary structure (derived by dynamic programming w i t h a thermodynamic energy.model). The authors estimate there to be O(L) candidate 25 regions where L is the length of the sequence [53]. Once in i t i a l candidate regions are selected, they are then evaluated for their s imilar i ty to each other. T h e algori thm proceeds by al igning each candidate region in sequence 51 to every other candidate region in sequence 5 2 using a Needleman-Wunsch variant that takes into account both sequence and secondary structure i n the scoring function. We w i l l explore a similar method i n our approach outlined i n Chapter 3. T h e resulting alignments are converted to posi t ion specific profiles that represent the frequency of each nucleotide (and gap) at each posi t ion in the alignment. These profiles are ordered based on the alignment scores and only a fixed number of the best-scoring profiles are kept. After this step, the candidate regions from sequence 53 are aligned to each profile and a new set of high scoring profiles is stored. Th i s procedure continues unt i l the candidate regions from a l l of the sequences have been aligned. T h e output is the highest scoring profiles after a l l sequences have been processed. The R N A P r o f i l e method has several dist inguishing features. T h e major ad-vantage is the lack of required prior information. T h e only required input parameter is the number of stems expected to occur in the motif. In addi t ion, the use of profiles enables the a lgor i thm to more readily detect instances of the mot i f that may have diverged considerably. T h e output has a quantitative measure of the 'fitness' of the region predicted to be an instance of the motif. The a lgor i thm relies on predict-ing the secondary structure through dynamic programming and a thermodynamic energy model . W h i l e this is expected to work well for motifs < 100 nucleotides i n length, the accuracy of R N A P r o f i l e is tied to the accuracy of the folding routines. Th i s is also an issue i n our approach and we w i l l explore this further i n Chapters 3 and 6. Another drawback is that R N A P r o f i l e does not report aligned instances of the motifs i t predicts. 26 2.7.4 ComRNA J i et al [32] introduce a method for detecting conserved R N A structure motifs using graph theoretical methods. The i r method has three major steps: 1) find a l l possible stable stems i n a sequence, 2) find a l l potential conserved stems shared by subsets of sequences and 3) assemble compatible sets of conserved structures to construct consensus secondary structure profiles. In step 1), the stable stems are identified through a branch and bound pro-cedure combined w i t h stacking energy parameters to evaluate biochemical stability. In step 2), comparing stems across sequences is done by al igning each pair of sequences globally using the Needleman-Wunsch [51] a lgor i thm to identify highly conserved regions. Conserved regions are defined as having at least 80% sequence identity over 10 or more nucleotides. The conserved regions are used as anchor regions for the stem comparisons. Similari t ies between stems are measured using five features: helix length, helix sequence, loop sequence, stem stabili ty, and relative positions of the stem start and end coordinates i n the whole sequence. These features are used to compute a weighted sum divided by the sum of the relative s tabi l i ty of the two stems i n their respective sequences. The compatible set of stems are those that are found i n a m i n i m u m k out of the N sequences. In step 3), the popula t ion of stems i n the entire input da ta are part i t ioned into A^-partite graph w i t h stems as nodes. Edges are only allowed between nodes in different part i t ions and are weighted according to how similar they are from step 2). N is the number os sequences i n the input and stems that originate from the same sequence are placed i n the same part i t ion. The problem is to find max ima l cliques (cliques that are not fully contained i n larger cliques) of at least size k i n the A^-partite graph. T h e max ima l cliques w i l l contain stems that are shared by at least k sequences. M a x i m a l clique finding is an N P - h a r d problem, so an approximat ion that uses a depth-first enumeration approach is used. T h e output of the algori thm is the stems found i n the max ima l cliques containing at least k stems. 27 This approach has several advantages: it can predict pseudoknotted struc-tures, it allows for predict ion of motifs not shared by all sequences and it reports a given number of best scoring motifs. A strong at tr ibute of this paper is that the authors d id a thorough comparison of their approach w i t h other published methods and present a quanti tat ive measure for evaluation of performance. We used a sim-ilar measure to assess the results of our experiments (see Chapter 4). The major disadvantage of their approach is that the maximal-cl ique finding step has an expo-nential run-time, al though i n practice, the authors report acceptable run-times for data sets that are of comparable size to our test da ta sets (see Chapte r 4). 2.7.5 G P R M The genetic programming for R N A motifs ( G P R M ) algori thm [30] uses a different technique to address the R N A mot i f finding problem. T h e G P R M algor i thm fo-cuses on finding base-paired segments and non base-paired segments that compose the secondary structure of the motif. The user specifies the m a x i m u m number of segments and the length range(s) of the segments. Under the genetic programming terminology, an ind iv idua l i n a populat ion is a motif. G P R M randomly generates an in i t i a l popula t ion of motifs that follow the user's specifications. T h e fitness of the individuals i n the popula t ion are measured as a function of the number of se-quences i n the input ted t ra in ing set expected to contain the mot i f (sensitivity) and the number of sequences predicted to contain the mot i f that actual ly do contain the motif (positive predictive value). The algori thm randomly selects two motifs and the mot i f w i t h the higher fitness gets selected for the genetic operation step. T h i s step takes a mot i f and transforms it using specific rules (see [30] for more details). The new ind iv idua l is added to the populat ion and the process iterates unt i l there are no more fit individuals in the populat ion than the most fit ind iv idua l that has already been processed, or a m a x i m u m number of iterations is reached. T h e run-ning t ime of the a lgor i thm is approximately 0(L5N) where N is the to ta l number 28 of sequences and L is the length of the longest sequence. 2.7.6 C A R N A C The C A R N A C algor i thm [66] predicts conserved secondary structure elements i n a family of related n c R N A s . T h e algori thm combines energy min imiza t ion , phylo-genetic comparison and sequence conservation i n a three step approach. F i rs t , a l l predicted stems (by dynamic programming w i t h a thermodynamic energy model) below a free-energy threshold are selected. Next a l l pairs of sequences are folded using a pairwise folding a lgori thm outl ined in [55] to give an op t imal consensus sec-ondary structure for the pair. Th i s step is done by choosing pairs of stems from the two sequences that are i n local ly conserved regions, are identical i n structure and have at least one compensatory mutat ion. For TV sequences, this results in N — I structures for each sequence. Next , the predicted stems from the pairwise folds are filtered using graph-theoretic techniques. Us ing a graph structure w i t h predicted stems from the first step as nodes, edges are drawn between two stems i f they are from two different sequences and if the stem also appears in the pairwise fold of those two sequences. In addi t ion a l l identical stems (not included i n the previous step) are included at this t ime. Connected components i n the graph are then scored based on the number of nodes, the number of stems i n each sequence, the to ta l num-ber of edges and the number of edges between identical stems. T h e highest scoring connected components have 1 stem per sequence and are fully connected. F ina l ly , the secondary structure for each sequence is determined by incorporat ing stems from the connected components greedily by first choosing stems from the highest scoring connected components. Overlapping stems are not permit ted. T h e authors report good results for RNase P , ciliate telomerase R N A and enterovirus U T R s al though the assessment of the results is largely qualitative. A n advantage of this method is its lower computat ional complexi ty (0(N3)) compared to F O L D A L I G N (0(N4L4)) where N is the number of sequences and L is the length of the longest sequence. 29 The l imita t ions are the lack of quantitative or probabil ist ic output and the lack of aligned conserved motifs. In other words, one could not bu i ld a C M for searching other sequence databases from the output of C A R N A C . 2.7.7 Alidot The A l i d o t a lgor i thm [27] begins by performing a mult iple sequence alignment using C L U S T A L - W . The secondary structures are predicted independently of the sequence alignment using the free-energy model implemented i n the V i e n n a package [26]. The sequence alignment is used to introduce gaps into the structures, and they are i n tu rn aligned. These aligned structures ' are then plot ted using 'mountain plots ' which are representations of secondary structures. T h e mounta in plots of the aligned structures are used to produce a 'consensus mounta in ' . A l l predicted base pairs i n the consensus mountain are then removed and sorted according to the following cri teria: i) no nucleotide pairs more than once (no base triples), ii) no base pairs cross i.e. there may not exist two base pairs (i,j) and (k,l) such that % < k < j < I. Base pairs are ranked according to their 'credibi l i ty ' . Cred ib i l i ty of a base pair is determined using a number of cri teria. F i r s t , the residues for a l l sequences at positions i and j are retrieved and for each sequence, i t is determined whether a ' legal ' base pair is formed ( G C , C G , A U , U A , G U , U G ) . Second, the more sequences that contain a legal base pair, the more credible the base. T h e presence of consistent (a s tandard base-pair is preserved) and compensatory (strength of base-pair conserved) mutations also lend credibi l i ty to the base pair. T h i r d , symmetric base pairs are more credible than non-symmetric base pairs. A base pair, is symmetric i f j is the most frequent pair ing partner of i i n the sequence set and vice-versa. Last ly , pseudo-entropy is calculated for al l the base pairs. L o w entropy yields more credibili ty. The sorted list of base pairs is then scanned from the top -removing a l l base pairs that conflict wi th a higher ranking one. Base pairs below a frequency threshold are also removed. 30 T h e main advantages to this method is efficient computat ion: the algori thm only computes structures for conserved regions and therefore it can be used on long sequences (more than 10Kb) provided they have few conserved regions. The major drawbacks are the alignment step which only considers sequence information. In the absence of sequence conservation, the mult iple alignment w i l l be a weak start ing point. A l so the output (mountain plots) does not provide a probabil is t ic quanti ty w i t h which to evaluate the results. 2.8 Summary We have outl ined the major areas of computat ional R N A sequence and secondary structure analysis related to the work we present in Chapter 3. T h e following chap-ter summarises this related work and introduces and describes a new approach to detecting R N A motifs i n unaligned R N A sequences. 31 Chapter 3 The Disco Algor i thm 3.1 Assessment of related work We have outl ined the major areas of computat ional work related to processing R N A sequences at the sequence and secondary structure level. In Chapter 2, we intro-duced algorithms to predict conserved motifs, predict consensus secondary struc-tures, detect new n c R N A s and align R N A sequences. These algorithms introduce al l of the major concepts that are needed to understand the a lgor i thm we present i n this chapter. T h e challenge of discovering R N A motifs i n unaligned sequences has been met w i t h a diversity of approaches (see Section 2.7), but there is no one superior method for tackl ing this problem. A s mentioned in Section 2.2.2, C M s offer a powerful probabil is t ic formalism to model a set of related R N A sequences. T h e y are similar to H M M s i n their construction. H M M s have been shown to be sensitive protein sequence alignment tools [5, 63, 64]. C M s are par t icular ly robust if given an adequate mult iple alignment and accurate consensus secondary structure [20, 21, 34]. Therefore i n order to use C M s in mot i f discovery of unaligned R N A sequences, we must first t ry to construct a good mult iple alignment and consensus secondary structure that represents the motif. A s previously mentioned, the S L A S H algori thm spends 0(NALA) t ime doing this step. Th i s t ime complexi ty makes S L A S H only 32 feasible for smal l sets of short sequences. W i t h the exception of S L A S H , no other methods described i n Section 2.7 make use of C M s to detect motifs. Another technique introduced i n Chapter 2 is expectation maximisa t ion ( E M ) for iterative refinement of a C M (see Section 2.2.2). None of the tools out-lined in Section 2.7 use iterative refinement techniques. In mot i f finding i n D N A and protein sequences, E M forms the basis of a large body of work [2, 3, 7, 41, 42] and we explore its use in the R N A moti f finding problem. We formulate the problem as follows: consider a set of N unaligned R N A sequences S where a por t ion of the sequences M is expected to contain a motif of w id th approximately W. The opt imal C M C that represents the mot i f consists of the opt imal mult iple alignment of the instances of the mot i f i n S and the corresponding consensus secondary structure. We measure the quali ty of a C M w i t h a score Cscore which is the sum of the bi t scores of the best alignment using the I N S I D E algori thm of the C M to each sequence i n S. The precise problem then is to discover the C M C w i t h the m a x i m u m C s c o r e given S and W. To solve this problem, we designed an a lgor i thm that improves on the com-plexity of S L A S H for ' in i t ia l i s ing ' a C M and includes an iterative refinement phase using E M that improves this in i t i a l C M . The algori thm has a 0(W3-L+L2-W2+L3) run-time where W is the expected w id th of the motif and L is length of the longest sequence i n the input data (see Section 3.2.3 for further details). Since E M is only guaranteed to converge on local opt ima, we do not expect to always detect C , how-ever we hope that given a good starting point, E M w i l l converge on a model that is composed of a majori ty of the instances of the motif. T h e remaining sections of this chapter outline key concepts and ideas that we introduce, and describe our algori thm, called D I S C O i n detail . Before describing our algori thm, we need to consider some key concepts that we used to motivate our method. 33 3.1.1 Global vs local sequence alignments Several algorithms presented in Chapter 2 input and process global alignments. G l o b a l alignments consider the entire length of each sequence when comput ing align-ments. T h i s is not applicable for finding motifs, since they are expected to only cover a por t ion of each sequence in a set of R N A sequences. However, it may yet be advan-tageous to use tools that process global alignments but i n a local context. Consider a mult iple alignment made of a por t ion of each sequence. T h e alignment is global i n the sense that it is a mult iple sequence alignment, but it is local i n the context of the input data. T h i s ' local mult iple alignment' can take advantage of tools that process global alignments. T h i s idea is used i n [19]. Reca l l that a first step is run F O L D A L I G N on a por t ion of the input sequences to detect conserved elements, then use C M s to detect these elements i n the remainder of the sequences. We employ a related use of C M s i n our approach. We also use A l i f o l d , which also operates on a global alignment, by providing i t w i t h a local mult iple alignment for which we want to predict the secondary structure. 3.1.2 Using pairwise and multiple sequence alignment to initialise a C M Our research tests several methods for determining a good alignment and secondary structure of a mot i f to be used as input to bui lding a C M . In part icular , we explore the use of 0(W2) alignment techniques based on the Needleman-Wunsch algori thm [51] (see Section 3.2.1) to produce a mult iple alignment suitable for C M ini t ia l isat ion, where W is the w i d t h of the motif. Th i s 0(W2) a lgor i thm overcomes prohibit ive running t ime of S L A S H and provides a method to achieve a 'coarse-grained' mult iple alignment and consensus structure to initialise a C M that is later refined. 34 3.1.3 Expectation Maximization for C M refinement Expec ta t ion max imiza t ion ( E M ) is used to improve the qual i ty of the C M by re-peatedly al igning the C M to the input data and re-estimating its parameters and secondary structure based on those alignments. T h i s idea was or iginal ly introduced i n [16] and we test its val idi ty in this work. We expect that the alignment presented to the C M w i l l improve w i t h iterative refinement. O u r work represents the first use to our knowledge of E M i n the R N A motif finding domain. 3.2 Proposed new algorithm: DISCO The D I S C O algor i thm takes as input a set of unaligned R N A sequences and finds a C M that represents a mot i f shared by the input sequences. T h e a lgor i thm outputs a mult iple sequence alignment of the instances of the mot i f and a consensus secondary structure. T h e D I S C O algori thm is best described i n two phases, the in i t ia l isa t ion phase and the refinement phase. T h e goal of the ini t ia l isa t ion phase is to use pair-wise and mult iple sequence alignment of subsequences of w i d t h W, combined wi th secondary structure predict ion using compensatory mutations to init ial ise a C M that represents the motif. T h e refinement phase uses expectation maximisa t ion to i teratively refine the C M using the I N S I D E alignment a lgori thm. T h e algori thm is depicted in pseudocode i n A l g o r i t h m 1. 3.2.1 Initialisation phase S l i d i n g w i n d o w s e c o n d a r y s t r u c t u r e p r e d i c t i o n We first enumerate a l l windows of wid th W i n the input da ta and predict the secondary structure of each window using the implementat ion of Zuker 's a lgori thm as published i n Andronescu et al. [1]. Th i s step gives us a dot-bracket representation of each W - m e r i n the input , where each posit ion in the W-mer is assigned a character i n the alphabet ' ( ' , ' ) ' . Ma tched ' ( ' and ' ) ' indicate base-paired positions and ' . ' 35 indicates unpaired positions. P a i r w i s e a l i g n m e n t o f W - m e r s The next step is to pairwise align the W-mers . E a c h W-mer is aligned to every other W - m e r using the Needleman-Wunsch opt imal alignment a lgori thm. T h i s is done i n one of three ways: 1. 'sequence': using sequence information only w i t h a R I B O S U M 8 5 - 6 0 [34] scor-ing mat r ix (see Figure 3.1) 2. 'structure': using the dot-bracket representation of the secondary structure only, w i t h a scoring mat r ix (called D I S C O S U B ) that is s imilar to Pavesi et al. [53] (see Figure 3.2) 3. 'combinat ion ' : using a combination of 1) and 2) that uses R I B O S U M 8 5 - 6 0 for unpaired nucleotides that align, and D I S C O S U B for paired nucleotides that al ign The algori thm was implemented i n this way i n order to test the properties of the input da ta (sequence, structure or combination) that contained the strongest signals for C M ini t ia l isa t ion (see Chapter 4). The entries in the D I S C O S U B mat r ix were determined using in tu i t ion and should be considered arbitrary. T h i s is discussed further i n Chapter 6. To avoid an 0(L2) number of pairwise alignments, we introduced a filter to reduce computat ional effort while maintaining accuracy. For each W-mer , we calculate its 'dot-composi t ion ' ( D C ) , meaning the propor t ion of unpaired nucleotides i n its secondary structure. We ignore al l W-mers wi th a D C of greater than a threshold d. T h e remaining W-mers are called anchors. Furthermore, we do not align any two W-mers i f their D C differ by more than 20% (arbi t rar i ly selected). The highest k scoring W-mers that align to each anchor W - m e r (Wa) is stored i n sorted order according to alignment score i n an array H w i t h H[1] = Wa. 36 M u l t i p l e a l i g n m e n t o f a set o f W-mers Each set H from the previous step is converted to a mult iple alignment using a progressive alignment technique (see A l g o r i t h m 3). F i rs t , the alignment of Wa to H[2] is converted to a profile alignment P. E a c h co lumn of P is represented by g-dimensional vector Pi containing the frequency of occurrence of the alphabet ' A ' / C ' / G ' / U ' , ' - 1 (or ' ( ' , ' . ' , ' ) ' , '-'), at a posi t ion i i n the alignment, where ' - ' represents a gap in the alignment and q is the number of characters i n the alphabet. P is then updated by aligning H[3] to P so that P now contains a profile alignment of Wa, H[2] and H[3]. A t this step, the dynamic programming mat r ix for the alignment calculates a score based on aligning a single sequence to a profile. The score S y for aligning posi t ion i of the VF-mer w to posi t ion j of the profile P is calculated as £ ) p MpjaWi where Pja is the frequency of character a i n column j of P and Wi is the character at posit ion i of w and M is the scoring mat r ix (one of R I B O S U M 8 5 - 6 0 or D I S C O S U B ) . P is s imilar ly updated un t i l a l l W-mers i n H have been aligned. A t the end of this step, we have a mult iple alignment of the highest k scoring pairwise W-xners to Wa- We store a fixed number I of the highest scoring mult iple alignments. These are then passed to the refinement phase. P r e d i c t i o n o f c o n s e n s u s s t r u c t u r e f r o m m u l t i p l e a l i g n m e n t A consensus structure for each of the I mult iple alignments that are kept in the previous step is predicted using Al i fo ld [28] (see Chapter 2 for a description of Al i fo ld ) . We now have mult iple alignments and corresponding secondary structures - the necessary inputs for creating C M s . A C M for each of the I mul t ip le alignments and secondary structures is then init ial ised using the c m b u i l d routine from the Infernal package [15], and the inti t ial ised C M s are refined in the refinement phase, described next. 37 3.2.2 Refinement phase Expectation Maximisation Using the ini t ial ised C M , we apply the I N S I D E alignment a lgor i thm to al ign the C M to each sequence in the input w i t h the cmsearch routine from Infernal. Us ing the gapped representation of the best scoring 'h i t ' for each sequence, a new mult iple alignment is created. T h e observed insertions and deletions are a l l relative to the same C M , making it possible to construct the mult iple alignment as follows. In the case of a deletion i n the gapped representation of the 'h i t ' , the gap is s imply maintained and the sequence is added to the mult iple alignment. In the case of an insertion i n the gapped representation of the 'h i t ' , a gap is inserted i n every other 'h i t ' at that posi t ion and the sequence is added to the alignment. We score the resultant mult iple alignment as the sum of the bit scores for each hit . T h e bit score is a log-odds score that is the difference of the l ikel ihood of the hit al igning to the C M (calculated by the I N S I D E algorithm) and the l ikel ihood of random sequence aligning to the C M . A s i n the ini t ia l isat ion phase, a secondary structure from this new alignment is then predicted wi th Al i fo ld and a new C M is bui l t from the mul-tiple sequence alignment and consensus secondary structure. T h e refined C M is realigned to the sequences to generate a new mult iple alignment and a new con-sensus secondary structure. T h i s process is repeated un t i l the score of the mult iple alignment no longer improves, or a max imum number of iterations is reached. T h e pseudocode for this step is shown i n A l g o r i t h m 5. Output T h e algori thm outputs the highest scoring C M detected i n the refinement phase. 38 3 . 2 . 3 C o m p l e x i t y The worst case t ime complexi ty of D I S C O is 0(W3 • L + L2 • W2 + L 3 ) where W is the user-inputted expected w i d t h of the mot i f and L is the length of the longest sequence i n the input data. The W3 • L term is from the predictive folding step of each i y - m e r i n the data, shown i n A l g o r i t h m 2, line 3. T h e L2 • W2 te rm is from the pairwise alignment of each W - m e r to every other W-mer (shown i n A l g o r i t h m 2, line 11). T h e m a x i m u m number of pairwise alignments is ((L — W) x N)2, however due to the threshold d introduced above, we expect that i n practice, the running time for this step w i l l be substantially better than 0(L2 • W2). F ina l ly , the L3 term comes from the refinement phase in which the I N S I D E a lgor i thm (the cmsearch method) is run repeatedly over the input data (see A l g o r i t h m 5, line 9). 3 . 2 . 4 I n p u t a n d o u t p u t The a lgori thm takes a set of unaligned sequences i n F A S T A format as input . A sample output is shown i n Figure 3.3. The output contains the score, mult iple alignment and consensus secondary structure produced by the most l ikely C M found by the a lgori thm. T h e index of the parent sequence (by locat ion i n the input data) of each sub-sequence and its posi t ion i n its parent sequence are also given i n the output. 3 . 2 . 5 P a r a m e t e r s The adjustable inpu t parameters are presented i n Table 3.1. R e q u i r e d p a r a m e t e r s A key parameter is W, the w i d t h of the motif. Another key parameter is a - the method of sequence alignment. If users expect a strong secondary structure signal, they can choose the 'structure' method, or they can choose the 'sequence' method if they expect the mot i f to be highly conserved at the sequence level. 39 Running-time parameters There are several running-t ime enhancing parameters. A s the dot-composit ion threshold d is lowered, fewer W-mers w i l l be considered for pairwise alignment. o is the overlap used to enumerate the W-mers in the input data. W-mers are enu-merated by s l iding a W-s ized window across each sequence. T h e overlap parameter alters how many positions to overlap when sl iding the window to the next posit ion. For example, i f W — 10 and o = 9, the window slides one posi t ion and al l W-mers i n the input are enumerated. However, i f W — 10 and o — 5 the s l iding window steps skips over five positions before enumerating the next W-mer. T h i s has a profound effect on the number of pairwise alignments that are performed i n the ini t ia l isat ion phase, reducing the number by a factor of (W - o ) 2 . In addi t ion, k - the m a x i m u m number of Warners used to create a mult iple alignment is a key parameter that we w i l l discuss in Chapter 4 and Chaper 6. F ina l ly , I is the m a x i m u m number of models on which the refinement phase is run. Reducing I reduces the number of times the I N S I D E algor i thm is run (which is 0(LZ)). Matrix parameters Other matrices can be used i n place of R I B O S U M 8 5 - 6 0 or D I S C O S U B . T h e y must be i n the same format as depicted i n Figure 3.1 and Figure 3.2. 3.2.6 Implementation The algori thm is implemented i n the C / C + + programming language. A l l functions are implemented i n C , but the main executable file is implemented in C + + due to a dependency on a C + + library. A l l source code is available by request from the author. 40 A l g o r i t h m 1 Pseudocode of D I S C O algori thm. T h e procedure DISCO(I,d,k,l,a,T) returns a C M C representing a conserved mot i f i n the unaligned sequences / . d is the dot-composit ion threshold, k is the m a x i m u m number of W-mers to be included in a mult iple alignment i n the ini t ia l isa t ion phase (see A l g o r i t h m 2 ) , I is the m a x i m u m number of high-scoring mul t ip le alignments to pass to the refinement phase, a is the method of sequence alignment used i n the ini t ia l isat ion phase and T is the m a x i m u m number of iterations to use i n the refinement phase. ExpectationMaximisation is shown i n A l g o r i t h m 5. 1: p r o c e d u r e DISCO {I,d,k,I,a,T) 2: Cset *— Initialisation^, d, k, I, a, T) 3: maxSc < oo 4: fo r a l l C e Cset d o 5: (C) <— ExpectationMaximisation(C, I) 6: i f score(C) > maxSc t h e n 7: maxSc <— score(C') 8: maxC <- C 9: e n d i f 10: e n d for 11: Re tu rn maxC 12: e n d p r o c e d u r e 41 Algorithm 2 Pseudocode of Initialisation procedure. Parameters are as described in Algorithm 1. Note that DotCompositionQ refers to the proportion of unpaired nucleotides of the sequence, sizeQ refers to the number of entries in the set, sort() sorts the entries in descending order by score and last refers to the index of the lowest scoring entry in the set. cmbuild is described in [15] and Alifold is described in [28]. Align is the Needleman-Wunsch pairwise alignment algorithm described in [13]. MultipleAlign is shown in Algorithm 3. procedure INITIALISATION^,d,k,l,a,T) for all W-mer w £ I do Fold(w) end for maxSp < oo , Cset •*— {} for all W-mer w € / do maxS < c o , ifu>[l] <— w for all W-mer x £ / do if DotComposition(w) < d then if DotComposition(w) — DotComposition(x) < 20 then A <— Align(w, x) t> Pairwise alignment of w and x if score(A) > maxS then maxS <— score(A) if size(Hw) < k then Hw <— grow Array(Hw,x) > appends x to Hw else Hw[last] <— x end if sort(Hw) end if end if end if end for (Pw) MultipleAlignment(Hw) if score(Pw) > maxSv then SS <— Alifold(Pw) > Predict the 2ndary struct from the alignment C <- cmbuild(Pw, SS) > Build a new CM maxSp <— score(Pw) if size(Cset) < I then Cset«- {Cset, C) else Cset[last] <- C end if sort(Cset) end if end for Return Cset end procedure 42 Algorithm 3 Pseudocode for creating a multiple alignment from a set of sequences ordered with the first sequence as an anchor sequence and the rest of the sequences sorted in descending order according to how well they pairwise align to the anchor sequence. The ProfileAlign procedure is implemented exactly as described in [13]. It returns the score and profile alignment of P and P' where Pi is a column vector containing the frequency of each character in the 'alphabet' of the sequences (eg 'ACGU-' ) 1 procedure M u L T l P L E A L l G N M E N T ( i f ) 2 P <- Sequence2Profile(H[l}) 3 P' <- Sequence2Profile(H[2}) 4 P <- ProfileAlign(P,P')) 5 for i <— 3, i < size(H) do 6 P' *- Sequence2Profile(H[i)) 7 P «- ProfileAlign(P, P') 8 end for 9 Return P 10 end procedure 43 Algorithm 4 Pseudocode for converting a sequence into a profile procedure SEQUENCE2PROFILE(S') for all positions i E S do for j *— l,j <— size(alphabet(S)) do Pj,i - 0 end for index <— index(Si) Pindex,i < 1 end for Return P end procedure 44 Algorithm 5 Pseudocode for ExpectationMaximisation(C, I, T) where C is a C M , / is the input set of unaligned R N A sequences and T is the m a x i m u m number of iterations. hits2multipleSequenceAlignment s imply creates a mult iple sequence alignment from the best scoring hits from each sequence. T h i s is possible since a l l alignments are to the same C M and hence have a l l insertions and deletions relative to the same model . 1: procedure EXPECTATIONMAXIMISATION(C, I, T) 2: max Score < 1 3: scorec <— 0 4: t <- 0 5: while scorec > maxScore or t < T do 6: scorec <— 0 7: hitsc <— {} 8: for all s € I do 9: hitss <— cmsearch(C, s) 10: if size(hitss) > 0 then 11: maxHitg <— max(score(hitss)) 12: hitsc <— {hitsc,maxHits} 13: scorec <— scorec + score(maxHits) 14: end if 15: end for 16: if scorec > maxScore then 17: maxScore <— scorec 18: MSA <— hits2multipleSequenceAlignment(hitsc,C) 19: SS *- Alifold1 MSA) 20: C <- cmbuild(MSA, SS) 21: end if 22: t <- t + 1 23: end while 24: R e t u r n C 25: end procedure 45 P a r a m e t e r D e s c r i p t i o n W Expec ted w i d t h of mot i f a Sequence alignment method ('sequence', 'structure' , 'combination') d Dot-composi t ion threshold for pairwise alignment step 0 Overlapping nucleotides in W-mev enumeration k M a x i m u m number of VF-mers to include i n mul t ip le alignment step I Number of models on which to run refinement phase T M a x i m u m number of iterations i n refinement phase m Scoring mat r ix for alignment using sequence method b Scoring ma t r ix for alignment using structure method Table 3.1: Parameters of the D I S C O algor i thm 46 R I B 0 S U M 8 5 - 6 0 A C G U A 2 22 C -1 86 1 16 G -1 46 - 2 48 1 03 U -1 39 - 1 05 -1 74 1 . 6 5 Figure 3.1: T h e unpaired nucleotide por t ion of the empir ical ly derived R I B O S U M 8 5 -60 [34] subst i tut ion mat r ix . T h i s mat r ix is used i n the 'sequence' and 'combinat ion ' alignment methods described i n Section 3.2.1. 47 D O T B R A C K E T - 2 . 0 ( ) ( ) 3 . 0 - 1 . 5 - 1 . 5 3 . 0 1 . 5 1 . 5 Figure 3.2: T h e D I S C O S U B mat r ix showing the subst i tut ion scores for al igning characters from the dot-bracket representation of secondary structure. Matched parentheses are given a score of 3.0, matched 'dots' or unpaired nucleotides are given a score of 1.5 and al l mismatches are given a score of -1.5. T h i s ma t r ix is used in the 'structure' and 'combinat ion ' alignment methods described i n Section 3.2.1. 48 S C O R E : 559 ( ( ( . ( ( ( ( ( ) ) ) ) ) ) ) ) 0 47 G-- T - G G T C G C G T C A A C A G T G T T T G A T C - G - - A A C A - C C T G T 1 12 G A T - T C T T G C T T C A A C A G T G T T T G A A C G G - - A A T T - T C T T T 2 5 G-- T - T C T T G T T T C A A C A G T G A T T G A A C G G - - A A C T - C C T C T 3 9 G-- T T A C C T G C T T C A A C A G T G C T T G A A C G G C A A C - - C T T C T 4 27 G-- T - T C T T G C T T C A A C A G T G A T T G A A C G G - - A A C T - C C T C T 5 23 G-- T - T C T T G C T T C A A C A G T G T T T G A A C G G - - A A C - - C C T C T 6 160 G-- T - T C T T G C T T C A A C A G T A T T T G A A C G G - - A A C - - C C T C T 7 1305 G - T - T C C T G C G T C A A C A G T G C T T G G A C G G - - A A C - - C G G C C 8 2 G - T - T C C T G C T T C A A C A G T G C T T G G A C G G - A A C - - C C G G C 9 12 G - T — C C T G C T T C A A C A G T G C T T G A A C G G - A A C - - C C G G C 10 28 G - T C T C T T G C T T C A A C A G T G T T T G G A C G G - - A A C A - G A T C C 11 948 G - T T T C C T G C T T C A G C A G T G C T T G G A C G G - - A A C - - C C G G C 12 3 G - T C T C C T G C T T C A A C A G T G C T T G G A C G G - - A G C - - C C G G T 13 9 G - T G T C T T G C T T C A A C A G T G T T T G A A C G G - - A A C A G A C - C C 14 1189 G - T - A C T T G C T T C A A C A G T G T T T G A A C G G - - A A C A G A C - C C 15 398 G-- T A T C T T G C T T C A A C A G T G T T T G G A C G G - - A A C A G A C - C C Figure 3.3: Sample output of the D I S C O algori thm. T h e score is given on line one of the output file. T h e next line is empty, followed by the consensus structure of the mult iple alignment. T h e remaining lines are the mult iple alignment of the sub-sequences used to construct the C M . The aligned sequences are preceded w i t h the index of the parent sequence of the sub-sequence and the start posi t ion of the sub-sequence i n the parent sequence. 4 9 Chapter 4 Experiments 4.1 Major questions and new ideas We set out to learn a C M that accurately models a mot i f w i t h i n a set of unaligned R N A sequences. To explore the performance of our method we posed major ques-tions about the inherent properties of the data and how they might be exploited to accomplish this task. 4.1.1 Question 1: Which properties more strongly represent a mo-tif embedded in a set of unaligned RNA sequences? A l l of the papers mentioned i n this chapter comment that bo th sequence and sec-ondary structure signals must be taken into account i n R N A sequence analysis. However, there is a lack of consensus on which properties - the sequence or the secondary structure emit the stronger signals. We compared three different align-ment strategies for in i t ia l is ing a C M : a) sequence alone, b) structure alone and c) a combinat ion of sequence and structure. The details of these alignment algorithms are presented i n Section 3.2.1. 5 0 4.1.2 Question 2: Can a C M be initialised using only a few se-quences? To minimise the cost of creating a mult iple alignment to init ial ise a C M , we explored the idea of only using a subset of the sequences to create the mul t ip le alignment. We wanted to test whether a relatively crude mult iple alignment created from a subset of the input sequences was of high enough quali ty to create a C M that could then recover the remaining motifs i n the iterative refinement phase. 4.1.3 Question 3: Can a crude secondary structure filter be used to filter out subsequences not expected to be an instance of the motif? Before constructing the mult iple alignment, our a lgor i thm first folds each subse-quence of length W in the input using a dynamic programming a lgor i thm that uses a thermodynamic energy model . Th is step produces a dot-bracket representation (see Section 3.2.1) of the secondary structure of each subsequence representing the base-paired nucleotides and the unpaired nucleotides. We used this representation to filter out subsequences in the data that had more than a m i n i m u m proport ion of their nucleotides unpaired in their secondary structure. T h i s was done by s imply counting the ' . ' i n the dot-bracket representation of the W - m e r and d iv id ing by W (recall the parameter d from Chapter 3). G i v e n that the subsequent step is to pairwise al ign a l l the remaining subsequences using the methods introduced i n 4.1.1, we expect this filtering step to improve the run-time of the a lgori thm. We tested different thresholds to assess how the filter affected accuracy. 4.2 Data We used m i c r o R N A s and U T R elements as test data sets for the D I S C O algo-r i thm. We selected m i R N A families and U T R element families from the R f a m 51 database using the keyword searches ' m i c r o R N A ' and ' U T R ' on the R f a m website (h t tp : / /www.sanger .ac .uk/Sof tware /Rfam/) . M i c r o R N A s and U T R elements were selected i n light of their important role i n post- transcript ional gene regulation (see Chapter 1). In addi t ion, m i R N A s and U T R elements were of ideal size (30-100 bp) to prototype our a lgor i thm. We used the R f a m seed alignments as 'ground t ru th ' alignments for testing the D I S C O algori thm. T h e seed alignments are curated mult iple alignments of indi -v idua l members of an R N A family. The consensus secondary structure is annotated on this mult iple alignment. R f a m uses these seed alignments and secondary struc-tures to construct C M s , which are then used to search large genomic databases for other members of the family. There are several advantages to using data from Rfam. These are out l ined below: • The seed alignments from R f a m are hand curated and nearly a l l sequences in -cluded i n the seed alignments have been experimentally determined and pub-lished in the literature. A l l R f a m records are tagged w i t h P u b m e d identifiers which point to the original papers that describe the R N A molecules. • Near ly a l l the sequences included in the seed alignments are flanked by genomic sequence or U T R sequence. Th i s makes it possible to extract a larger 'super-sequence' that contains a member of the seed alignment w i t h i n i t . T h i s is essential for testing, since our algori thm is designed to detect shared sub-structures i n a set of sequences. • A l l sequences included i n the Rfam seed alignments have E M B L / G e n B a n k accession numbers. T h i s makes data retrieval fairly straightforward, where otherwise this can be an onerous task. We used the At l a s integrated database for data retrieval [61]. A n example R f a m seed alignment i n Stockholm format (see [15]) is given i n F i g -ure 4.1. 52 F r o m the in i t i a l set of families retrieved w i t h the keyword searches, we re-moved al l families w i t h fewer than four members, w i t h more than twenty members, w i th length more than 151 and U T R element families whose members extended into coding sequence. T h e last cri terion reflects our opinion that protein coding sequences have dist inct properties that would confound their analysis. We d id not impose any taxonomic filters. Tables 4.1 and 4.2 list and describe some characteris-tics of the nine U T R da ta sets and seventeen m i R N A data sets used i n this analysis. Us ing the larger 'parent ' sequences given by the G e n B a n k accession numbers i n the seed alignments, we constructed the test data sets as follows: for U T R data, the entire U T R i n which the seed sequence was embedded was extracted; for m i R N A data, the m i R N A plus 200 nucleotides upstream and downstream of the m i R N A were extracted. In some cases, extracting 200 nucleotides was not possible due the proximi ty of the m i R N A to an end of the sequence. In such cases, we extracted as much flanking sequence as possible to the end of the sequence. 4.3 Preliminary experiments A set of eight R f a m seed alignments (RF00047, RF00104 , RF00129 , RF00172 , RF00180, RF00237 , RF00241 , RF00256) was used to evaluate three different pa-rameters: • Al ignment method: structure alone, sequence alone, combinat ion (a=0,l ,2) • Number of sequences used to construct mult iple alignment for ini t ia l isa t ion (k=2-7) • Dot-composi t ion threshold (d=0.45,0.50,0.55,0.60,0.65) These experiments were designed to reveal the best parameters for running the algo-r i t hm on m i c r o R N A s and U T R elements and were designed to address the questions outl ined i n Sections 4.1.1, 4.1.2 and 4.1.3. 53 4.4 F i x e d parameter experiments We estimated the op t imal parameters from the results (see Chapter 5) of the prel im-inary experiments and ran the D I S C O algori thm on the remaining da ta sets using those op t imal parameters. For these experiments, we used the 'sequence' method for alignment, k = 6, d — 0.40 for m i R N A data and d — 0.55 for U T R data. For data sets w i t h < 6 sequences, we used k = N, where N is the number of sequences i n the input data. A m a x i m u m of T = 10 iterations was used for the refinement phase. W was set to the length of the seed alignment +2, and overlap, o was set to W — 1. I was set to 15. 4.5 Eva lua t i on methods Figure 3.3 shows an example of the D I S C O output. It shows a score, a mult iple alignment, a consensus structure and the positions of the mot i f instances i n the parent sequence. T h e output ted score and mult iple alignment for each run of the algori thm were used to calculate the measures of accuracy explained below. 4.5.1 Score A s mentioned i n Chapter 3, the output of the a lgor i thm is a consensus structure, a mult iple sequence alignment and a score that reflects the quali ty of the mult iple alignment. T h e score is a sum of the l ikel ihood of the model given each sequence. The higher the score, the better the alignment. We assessed the correlation of the score to the measures l isted below to determine whether a higher score meant better performance. 4.5.2 Sensitivity and positive predictive value The score measures the qual i ty of the alignment, but this is an insufficient measure on its own, since the a lgor i thm may produce a very high scoring alignment that 54 does not contain members of the Rfam seed alignment. T h i s could arise i f the input data contained other regions of s imilar i ty that were more easily detectable than the members of the R f a m seed alignment. To get a quanti tat ive measure of the accuracy of the final mult iple alignment, we chose to use sensit ivity ( S E N S ) and positive predictive value ( P P V ) using three measures of accuracy. To define S E N S and P P V , we first need to describe four other terms: • true positives ( T P ) : the number of pairs of nucleotides aligned i n the output that were also aligned i n the Rfam seed alignment • false positives ( F P ) : the number of pairs of nucleotides aligned i n the output that were not aligned in the Rfam seed alignment • true negatives ( T N ) : the number of pairs of nucleotides unaligned in the output that were not aligned i n the Rfam seed alignment • false negatives ( F N ) : the number of pairs of nucleotides unaligned i n the output that were aligned i n the Rfam seed alignment S E N S is defined as T P / ( T P + F N ) , or the number of true positives over the to-ta l number of aligned nucleotides i n the R f a m seed alignment. P P V is defined as T P / ( T P + F P ) , or the number of true positives over the to ta l number of aligned nucleotides i n the output. Often, when measuring accuracy i n tests of this nature specificity, defined as T N / ( T N + F P ) , is used as a complementary measure to sen-sit ivity. However, in this case T N is difficult to conceptualize as it represents a l l correctly predicted unpaired nucleotides, which makes l i t t le sense i n this scenario. Us ing P P V and S E N S , we can now approximate the Mat thews correlation coefficient [49] ( C C ) in the following way. CC w VSENS • PPV (4.1) This measure was original ly used i n [19] and i n several subsequent papers [30, 32, 22]. Since we are interested i n model ing and recovering motifs, our measures are sl ightly 55 different and focus on the aligned nucleotides and recovering specific nucleotides that are part of the motif. T h i s is another advantage of using the R f a m seed alignments i n that we have a nucleotide-level 'ground-truth ' to compare our results against. O v e r l a p p i n g n u c l e o t i d e s In addi t ion to aligned pairs of nucleotides, we can also measure the number of nu-cleotides from each sequence included in the output that are also i n the R f a m seed alignment, or overlapping nucleotides. In this case T P is the number of nucleotides in the output that were also i n the i n Rfam seed alignment, F P is the number of nucleotides in the output that were not i n the R f a m seed alignment, and F N is the number of nucleotides not i n the output that were i n the R f a m seed alignments. S E N S , P P V and C C can then be calculated i n the same way as explained i n Sec-t ion 4.5.2. S e q u e n c e - l e v e l o v e r l a p Final ly , we can define success at a coarse level which gives a measure of whether the motif was found i n each sequence or not. We define a T P i n this scenario i f 50% of the overlapping nucleotides i n each sequence are T P nucleotides. F P and F N can be s imilar ly defined. 4.5.3 Reported accuracy measures In Chapter 5 we report accuracy based on aligned nucleotides, overlapping nu-cleotides and sequence-level overlap measures for each of the test da ta sets. E a c h accuracy measure consists of three separate values: sensitivity, positive predictive value and correlation coefficient. F r o m here on, these w i l l be referred to as: AS, APPV and ACC for aligned nucleotides, NS, NPPV and NCC for overlapping nucleotides and SS, SPPV and SCC for sequence-level overlap. We first show the results for the pre l iminary experiments, and follow this w i t h the results of the 56 fixed-parameter experiments. We also tested the effect of average pairwise sequence identity of the seed alignments and overall size of the input da ta on the results. 4.6 Comparison with RNAProfile We compared our a lgor i thm wi th R N A P r o f i l e . NS, NPPV, SS and SPPV accu-racy measures and running t ime were compared for the following da ta sets: RF00037, RF00130, RF00164, RF00180 , RF00185, RF00239, RF00241 and RF00256 . AS and APPV measures were not compared because R N A P r o f i l e does not output aligned sequences. We chose four data sets from the U T R element group and four data sets from the m i R N A group. We also chose data that included sets where D I S C O failed to detect the mot i f (RF00130, RF00180, RF00185) and sets where D I S C O performed well (RF00037, RF00164, RF00239, RF00241 and RF00256) in the fixed parameter experiments (see Tables 5.1 and 5.2). Ideally, a l l da ta sets used i n the fixed parameter settings would have been used in the comparison w i t h R N A P r o f i l e , but this was infeasible due to t ime constraints. Similar ly , we would have l iked to include A l i d o t and F O L D A L I G N i n this comparison. T h i s work is being done to prepare a manuscript for publ icat ion. T h e parameters used to run D I S C O were as specified i n Section 4.4 and R N A P r o f i l e parameters are given in Table 4.3. For R N A P r o f i l e the length-based parameters IR and LR were set s imilar ly to the D I S C O W parameter. LR was set to the length of the R f a m seed alignment +2 and IR was set to length of seed alignment —20. The '20' was chosen based on the default differential between IR and LR. The length-specific parameters for R N A P r o f i l e were set to confer prior knowledge of the wid th of the mot i f to R N A P r o f i l e in order to achieve a fair comparison w i t h D I S C O . Default values were used for a l l other R,NAProf i le parameters. O n l y subsequences reported by R N A P r o f i l e w i th positive fitness values were included as predicted mot i f instances. T h i s decision was based on the not ion re-ported i n Pavesi et al. [53] that subsequences wi th negative fitness values should be 57 Rfam id Description Num Length %id RF00031 Selenocysteine insertion sequence 19 88 40.60 RF00032 Histone 3' UTR stem-loop 13 26 73.03 RF00037 Iron response element 16 30 84.36 RF00109 Vimentin 3' UTR protein-binding region 12 94 81.65 RF00164 Coronavirus 3' stem-loop II-like motif (s2m) 16 43 84.52 RF00176 Tombusvirus 3' UTR region IV 17 92 93.52 RF00180 Renin stability regulatory element (REN-SRE) 13 37 89.61 RF00185 Flavivirus 3' UTR pseudoknot 14 102 91.82 RF00214 Retrovirus direct repeat 1 (drl) 19 95 89.47 Table 4.1: Descr ip t ion and characteristics of the U T R test data sets showing the number of members i n the seed alignment (Num), the length of the seed alignment (Length) and the mean percent pairwise nucleotide identi ty of the members i n the seed alignment (%id) 58 R f a m i d D e s c r i p t i o n N u m L e n g t h % i d RF00027 let-7 microRNA precursor 12 90 70 88 RF00051 mir-17 microRNA precursor family 4 82 72 77 RF00052 lin-4 microRNA precursor 9 74 70 26 RF00053 mir-7 microRNA precursor 6 93 67 13 RF00075 mir-166 microRNA precursor 11 151 59 78 RF00076 mir-181 microRNA precursor 4 76 80 22 RF00103 mir-1 microRNA precursor family 7 80 70 01 RF00130 mir-192/215 microRNA precursor 4 110 70 39 RF00131 mir-30 microRNA precursor 4 72 82 55 RF00239 mir-124 microRNA precursor family 6 87 73 31 RF00241 mir-8/mir-141/mir-200 microRNA precursor 9 81 64 60 RF00246 mir-135 microRNA precursor family 5 91 71 56 RF00247 mir-160 microRNA precursor family 7 137 66 48 RF00248 mir-148/mir-152 microRNA precursor family 5 88 73 24 RF00251 mir-219 microRNA precursor family 7 76 83 87 RF00256 mir-196 microRNA precursor family 14 96 74 06 RF00364 mir-BART2 microRNA precursor family 8 62 92 85 Table 4.2: Descr ip t ion and characteristics of the m i R N A test da ta sets showing the number of members i n the seed alignment (Num), the length of the seed alignment (Length) and the mean percent pairwise nucleotide identi ty of the members i n the seed alignment (%id) 59 AE003516.3/109215-109275 AC005316.1/63325-63386 AF155142.1/45725-45784 AC116051.4/129891-129952 Z81467.1/13319-13384 #=GC SS.cons GUCUUUGGUUAUCUAGCUGUA. UGAGUGA. UAAAUA. . ACGU. CAUAAAG CUCUUUGGUUAUCUAGCUGUA. UGAGUGC.CACAGA.GCCGU.CAUAAAG AUCUUUGGUUAUCUAGCUGUA.UGAGUGU.AUUGG...UCUU.CAUAAAG AUCUUUGGUUAUCUAGCUGUA. UGAGUGG. UGUGGA. GUCUU. CAUAAAG AUCUUUGGUGAUUCAGCUUCAAUGAUUGGCUACAGGUUUCUUUCAUAAAG . . . < « < < < « < < < « « . . < . < « > . » > . . » AE003516.3/109215-109275 AG005316.1/63325-63386 AF155142.1/45725-45784 AC116051.4/129891-129952 Z81467.1/13319-13384 #=GC SS.cons / / CUAGCUUACCGAAGUU CUAGAUAACCGAAAGU CUAGAUAACCGAAAGU CUAGAUAACCGAAAGU CUAGGUUACCAAAGCU > > > » » > » » > . . . Figure 4.1: Seed alignment of RF00237 i n Stockholm format. T h i s format consists of a mult iple alignment w i t h secondary structure annotat ion given i n the bot tom line. A l so note the G e n B a n k accession numbers and coordinates that are provided to facilitate straightforward retrieval of flanking sequence for the test data sets. 60 suspected to come from sequences not containing an instance of the motif. To assess running time, we used the U n i x t i m e command to report the num-ber of C P U seconds used. A l l running t ime analysis was performed on the identical machine w i t h no other processes, except system processes running concurrently. Analys is was performed on an Intel X e o n processor at 2 . 4 G H z w i t h 1Gb of R A M . 61 Rfam id IR LR RF00037 20 40 RF00130 90 112 RF00164 20 45 RF00180 20 40 RF00185 82 104 RF00239 67 89 RF00241 61 83 RF00256 76 98 Table 4.3: Parameters for R N A P r o f i l e in comparison experiment. T h e only parame-ters that were set were IR and LR. Default values were used for a l l other parameters. 62 Chapter 5 Results 5.1 P r e l i m i n a r y e x p e r i m e n t s We ran the D I S C O algor i thm on eight sets of data to identify parameters giving the best results. We varied the method of alignment a (sequence, structure, com-bination), the dot-composi t ion threshold d (0.45, 0.55, 0.60, 0.65) and the number of W-mevs used to construct the mult iple alignment k (2-7). In a l l , there were 788 runs i n the prel iminary experiments. The next few sections show the results of these experiments and provide the justification for the parameters that were chosen for the fixed parameter experiments. 5.1.1 Sequence method of alignment is superior Figures 5.1 and 5.2 show the distr ibutions of NS and NPPV (see Section 4.5.3 for an explanat ion of NS and NPPV) of the cumulative results of a l l the runs for the sequence, combinat ion and structure alignment methods. A l l the distr ib-utions i n this chapter are shown as box-and-whisker p lo t s 1 . T h e NS results (see Figure 5.1) show that the sequence method was significantly better than structure ^ox-and-whisker plots show distributions as a box with a line in the box indicating the median of the distribution, the top and bottom edges of the box indicating the third and first quartiles and the ends of the whiskers indicating the maximum and minimum values of the distribution. The points shown on the plots are considered outliers. 63 (Welch T w o Sample t-test, t=12.84 and p=2.2E-16) and combinat ion (Welch T w o Sample t-test, t=12.44 and p=2.2E-16). The NPPV results (see F igure 5.2) s im-i lar ly show sequence to be significantly better than structure (Welch T w o Sample t-test, t=12.84 and p=2.2E-16) and combinat ion (Welch T w o Sample t-test, t=12.44 and p=2.2E-16). Based on these results we chose a ^sequence for the fixed para-meter experiments. 5.1.2 k = 6 gives best results for NS and NPPV Figures 5.3 and 5.4 show the distributions for the results of the a =sequence runs of the prel iminary experiments for k = 2 to k = 7 (from this point on a l l analyses include only a ^sequence runs due to poor performance of the structure and combi-nat ion methods). B y quali tat ive assessment of Figures 5.3 and 5.4, the results for k = 6 seem slightly better than for the other values of k. For NS, the median values increased w i t h k, however the mean of k — 6 (0.55 ± 0.30) was higher than k — 7 (0.49±0.36). For NPPV, the mean and median for k = 6 were 0 . 7 1 ± 0 . 3 5 and 0.84. k — 5 and k — 7 had comparable values for the mean and median (0.69 ± 0.38 and 0.84 for k = 5) and (0.63 ± 0.38 and 0.83 for k = 7). T h e results for k = 6 had the highest mean and the lowest standard deviat ion compared to the k = 5 and k = 7 results. There was a lack of statist ically significant differences between k = 5,6,7, therefore we chose k = 6 based on highest mean and lowest s tandard deviat ion for the fixed parameter experiments. For input data sets w i th fewer than six sequences, we set k = N, where N was the number of sequences. 5.1.3 Dot-composition threshold Figures 5.5 and 5.6 show the distributions for the results of the a ^sequence runs of the prel iminary experiments grouped by d. A l though Figure 5.5 seems to indicate that d — 0.50 produced more accurate NS results, the d is t r ibut ion for d = 0.50 was not stat is t ical ly significantly better than the dis t r ibut ion for d = 0.45 or d = 0.55. 64 d i s t r i b u t i o n o f N S for e a c h a l i g n m e n t m e t h o d Figure 5.1: Box-and-whisker plots showing the d is t r ibut ion of N S over a l l runs for structure, combinat ion and sequence alignment methods. T h e sequence method showed significantly better performance than the structure method (Welch T w o Sample t-test, t=12.84 and p=2.2E-16) and the combinat ion method (Welch T w o Sample t-test, t=12.44 and p=2.2E-16). 65 d i s t r i b u t i o n o f N P P V f o r e a c h a l i g n m e n t m e t h o d o 00 d CM d o d structure combination sequence Figure 5.2: Box-and-whisker plots showing the dis t r ibut ion of N P P V over a l l runs for structure, combinat ion and sequence alignment methods. T h e sequence method showed significantly better performance than the structure method (Welch T w o Sample t-test, t—13.33 and p=2.2E-16) and the combinat ion method (Welch T w o Sample t-test, t=11.35 and p=2.2E-16). 66 oo d CD d <3-d § CM d o d k=2 k=3 k=4 k=5 k=6 k=7 Figure 5.3: Box-and-whisker plots of a l l a ^sequence runs showing the dis t r ibut ion of NS for fc — 2 to fc — 7. T h e median values increased w i t h k, however the mean of fc = 6 (0.55) was higher than fc = 7. A l so , the standard deviat ion of fc = 6 (0.30) was lower than for fc = 7 (0.36). These results show that fc = 6 produced the best accuracy when measured w i t h NS. 67 Figure 5.4: Box-and-whisker plots of a l l a ^sequence runs showing the dis t r ibut ion of NPPV for k — 2 to k = 7. Based on the results shown i n Figure 5.3, we only compared k = 6 to k = 5 and k = 7. The mean, median and s tandard deviat ion for k = 6 were 0.71, 0.84 and 0.35. k = 5 and k = 7 had comparable values for the mean, median and s tandard deviat ion (0.69, 0.84 and 0.38 for k = 5) and (0.63, 0.83 and 0.38 for k = 7). T h e mean for k = 6 was highest and the standard deviat ion for k — 6 was lowest compared to k = 5 and k = 7. These results show that k = 6 produced the best accuracy when measured w i t h NPPV. 68 The results for NPPV showed no observable trend over the value of d. G i v e n the lack of an obvious choice for d, we used biological in tu i t ion to set the value of d. For the m i R N A data sets, we noted that the motifs are highly structured. We plotted the dis t r ibut ion of the proport ion of unpaired nucleotides for the consensus structure of the 40 m i R N A data sets available from R f a m (see Figure 5.7). Based on this data, we set d = 0.40 for the m i R N A data. Ideally, we would have included d = 0.40 i n the prel iminary experiments (and therefore i n Figures 5.5 and 5.6) to give a more ra t ional basis for this choice. We ini t ia l ly thought that d = 0.40 would be too stringent, however by examining the dis t r ibut ion of propor t ion of unpaired nucleotides, we decided to choose d = 0.40 based on intui t ion. The propor t ion of unpaired nucleotides was much more widely dis tr ibuted for U T R elements (data not shown). We arbi t rar i ly chose d — 0.55 (the middle value of our experiments) for the U T R element data. We recognise that this was not ideal, and we had hoped the prel iminary experiments would provide a more rat ional basis for choosing d. We w i l l discuss this further i n Chapter 6. 5.2 Fixed parameter experiments Table 5.1 shows score (SC) and accuracy measures (AS, APPV, ACC, NS, NPPV, NCC, SS, SPPV and SCC) for the seventeen m i R N A data sets. T h e mean, median and standard deviat ion of the distr ibutions of these accuracy measures are also shown i n Table 5.1 and the distributions themselves are shown i n Figure 5.8. Table 5.2 shows score (SC) and accuracy measures (AS, APPV, ACC, NS, NPPV, NCC, SS, SPPV and SCC) for the nine U T R element data sets. T h e mean, s tandard deviat ion and median of the distr ibutions of these accuracy measures are also shown i n Table 5.2 and the distributions themselves are shown i n Figure 5.9. T h e D I S C O algor i thm detected the motifs for the majori ty of the da ta sets. For m i R N A data, the mean and median NS were 0.66 ± 0.36 and 0.85 while the mean and median NPPV were 0.76 ± 0 . 3 5 and 0.90. T h i s indicates that on average, i 69 o to Z 00 o CO o o CN c i o c i T d=0.45 d=0.50 d=0.55 d=0.60 d=0.65 Figure 5.5: Box-and-whisker plots of a l l a =sequence runs showing the dis t r ibut ion of NS for d = 0.45,0.50,0.55,0.60,0.65. The mean and median values were (0.45, 0.40), (0.46, 0.45), (0.44, 0.36), (0.38, 0.28) and (0.42, 0.35) respectively, d = 0.50 had the highest values for NS, al though the d is t r ibut ion d = 0.50 was not stat ist ically significantly different from the dis t r ibut ion of d — 0.45 (Welch T w o Sample t-test, t=-0.12, p=0.90) or from the dis t r ibut ion of d = 0.55 (Welch T w o Sample t-test, t=-0.28, p=0.78). 70 > D_ D_ 00 d d d CN d o d d=0.45 d=0.50 1 d=0.55 d=0.60 d=0.65 Figure 5.6: Box-and-whisker plots of a l l a =sequence runs showing the dis t r ibut ion of NPPV for d = 0.45,0.50,0.55,0.60,0.65. T h e mean and median values were (0.65, 0.86), (0.67, 0.86), (0.65, 0.86), (0.61, 0.83) and (0.63, 0.83) respectively. N o observable t rend due to d was apparent from this data. 71 o C D O in o o o C M Figure 5.7: Box-and-whisker plot showing the dis t r ibut ion of the propor t ion of un-paired nucleotides i n the consensus secondary structure of a l l 40 m i R N A seed align-ments from Rfam. 72 66% of nucleotides in the seed alignments were found in the best scoring C M and that 76% of the nucleotides found in the best C M were part of the seed alignment (see Table 5.1). DISCO recovered at least 67% of the seed sequences in twelve out of seventeen miRNA data sets (by the SS measure). The mean and median SS were 0.71 ± 0.40 and 0.86 and the mean and median SPPV were 0.80 ± 0.39 and 1.00 respectively for the miRNA data. The mean and median AS were 0.46 ± 0.30 and 0.52 while the mean and median APPV were 0.59±0.36 and 0.79. The slightly lower performance by the AS and APPV measures indicate that although the majority of motifs are being detected by DISCO, they are not necessarily accurately aligned in the output when compared to the seed alignments. This will be discussed further in Chapter 6. The large sources of error are most likely attributed to five of the data sets in which the motifs were essentially missed by the algorithm. The results for the UTR element data sets, shown in Table 5.2 were less promising. The NS mean and median were 0.49 ±0.48 and 0.57. The NPPV mean and median were 0.45 ± 0.44 and 0.53. The performance was similarly poor by the other measures. The mean and median for both SS and SPPV were 0.53 ± 0.51 and 0.83. The mean and median for AS were 0.46 ± 0.46 and 0.48. The mean and median for APPV were 0.41 ± 0.49 and 0.49. This relatively poor performance and very large standard deviations are due the the fact that the algorithm completely missed the motif in four of the nine UTR element data sets. Table 5.2 shows that the data sets for which the motif was found show relatively high accuracy for all the measures. For example the NS mean of the remaining five data sets was 0.89 ±0.18. The problems in detecting UTR elements will be discussed in Chapter 6. 5.2.1 Score is not an indicator of accuracy Of great interest to us was whether the score of the C M (recall this was the sum of the bit scores of the best hit of each sequence aligned to the C M with the INSIDE algorithm) was a good indicator of accuracy. To test this, we first normalised the 73 Figure 5.8: D i s t r ibu t ion of accuracy results for seventeen m i R N A test data sets by the AS, APPV, ACC, NS, NPPV, NCC, SS, SPPV and SCC measures. In general, the a lgor i thm performed well on the m i R N A data by the S, NPPV, NCC, SS, SPPV and SCC measures. The distr ibutions for the alignment measures indicate that despite good recovery of the seed sequences, the alignments were not necessarily accurate. 74 Figure 5.9: Distribution of accuracy results for nine U T R element test data sets by the AS, APPV, ACC, NS, NPPV, NCC, SS, SPPV and SCC measures. In general, the algorithm did not perform as well as it did on the m i R N A data. This is indicated by the very wide distributions on all the measures. This was mainly due the the algorithm completely missing four of the nine motifs and therefore contributing 0 values to each of the accuracy measures in these cases. The algorithm however, performed quite well on the remaining five data sets. 75 score S by the number N of sequences in the input to give: S' = S/N. Normal isa t ion was necessary since the score of a high scoring C M is expected to have a 'bit score' contr ibut ion from most of the sequences i n the input . We then plot ted S' against AS, APPV, NS, NPPV, SS and SPPV for a l l of the fixed parameter results and tested each measure of accuracy for a statist ical correlation w i t h score using a Pearson's product moment correlation test. A l l measures were posi t ively correlated w i t h score and AS, APPV, NS and SS were stat is t ical ly significantly correlated. Scatter plots of correlation against the accuracy measures along w i t h the correlation coefficient ( C C ) and p-values of the correlation tests are shown i n Figures 5.10, 5.11, 5.12, 5.13, 5.14 and 5.15. These results appear to indicate that score is an indicator of accuracy. Ideally the vertical axis intercept of the fitted line i n Figures 5.10, 5.11, 5.12, 5.13, 5.14 and 5.15 would go through the origin (0,0) meaning that a C M w i t h score of 0 had 0 true positives. Th i s is not the case i n our output, meaning that the a lgor i thm is producing true positive results despite low-scoring output. Furthermore, there are six cases for which a l l accuracy measures are zero, meaning that the a lgor i thm completely missed the motif. We investigated this data further by e l iminat ing a l l score-accuracy measure pairs w i t h zero values for the accuracy measures and replot t ing the data. T h e correlations were non-significant for al l accuracy measures except for 715 (correlation-coefficient = 0.57, p=0.01) (data not shown). These results demonstrate that for cases where the algori thm successfully identified the motifs, there is no correlation between score and accuracy. Therefore, we were unable to conclude that score was a positive indicator of accuracy. Cases where score is low, but accuracy is high need to be examined closely to determine why this occurs. We w i l l discuss this further i n Chapter 6. 5.2.2 Testing the effects of properties of the input data We wanted to test i f the algori thm accuracy was sensitive to inherent properties of the input data. T h i s was possible, since the data sets were of variable length and 76 CO < oo d co d d CM d o d 20 40 60 80 100 normalised score 120 140 Figure 5.10: Scatter plot of AS against normalised score. AS was stat is t ical ly sig-nificantly correlated w i t h score (Pearson's product-moment correlation, correlation coefficient = 0.56, p=0.00). 77 CO o CD O CD CN c i o O 20 40 60 80 100 normalised score 120 140 Figure 5.11: Scatter plot of APPV against normalised score. APPV was statis-t ical ly significantly correlated wi th score (Pearson's product-moment correlation, correlation coefficient = 0.40, p=0.04). 78 to 00 o CD O o CN d o c i 20 40 60 80 100 120 140 normalised score Figure 5.12: Scatter plot of NS against normalised score. NS was stat is t ical ly sig-nificantly correlated w i t h score (Pearson's product-moment correlation, correlation coefficient = 0.45, p=0.02). 79 0.8 1.0 o o o o ° o o o o o CD O o O CN d o o d CO o o o 0 20 40 60 80 100 120 140 normalised score Figure 5.13: Scatter plot of NPPV against normalised score. NPPV was positively correlated with score, although the correlation was not statistically significant (Pear-son's product-moment correlation, correlation coefficient = 0.29, p=0.14). 80 CO CO oo d CO d d CM d q d normalised score Figure 5.14: Scatter plot of SS against normalised score. SS was stat is t ical ly sig-nificantly correlated w i t h score (Pearson's product-moment correlation, correlation coefficient = 0.42, p=0.03). 81 00 C) CO O > Q_ 0. CO c i o O 1 20 40 60 80 100 normalised score 120 140 Figure 5.15: Scatter plot of SPPV against normalised score. SPPV was positively correlated with score, although the correlation was not statistically significant (Pear-son's product-moment correlation, correlation coefficient = 0.36, p=0.07). 82 of varying sequence s imi lar i ty (see Tables 4.1 and 4.2). We ran Pearson product-moment correlation tests to see i f performance was affected by: • average nucleotide percent pairwise identity (PID) of R f a m seed alignment members • length of R f a m seed alignment • length of input data • number of sequences i n the input data None of the accuracy measures were significantly correlated w i t h any of the data properties l isted above (see Tables 5.3, 5.4, 5.5 and 5.6). A l l of the correlation coefficients were positive for the PID tests, suggesting a positive influence of P I D on the results, however none of the tests yielded stat is t ical ly significant results. Similar ly, a l l of the correlation coefficients for the length of da ta test were negative, suggesting a negative influence of length of data on the results, however none of the tests yielded a s tat is t ical ly significant results. These results indicate that despite a mi ld bias towards high P I D motifs and smaller input data, the a lgor i thm can be run on different data sets w i th respect to size, number of sequences and PID and produce results that are independent of these properties. We w i l l discuss the implications of these results w i th respect to the question posed i n Section 4.1.2 i n Chapter 6. 5.3 D I S C O i s m o r e a c c u r a t e , b u t c o n s i d e r a b l y s l o w e r t h a n R N A P r o f i l e Figures 5.16, 5.17, 5.18, 5.19 compare NS, NPPV, SS and SPPV accuracy mea-sures of the best scoring model for D I S C O against the best scoring model for R N A P r o f i l e . D I S C O generally outperformed R N A P r o f i l e in accuracy for a l l mea-sures. M e a n NS was 0.56 w i t h standard deviat ion 0.45 for D I S C O and 0.41 wi th 83 standard deviation 0.26 for RNAProfile. Mean NPPV was 0.56 with standard deviation 0.43 for DISCO and 0.48 with standard deviation 0.31 for RNAProfile. Mean SS was 0.58 with standard deviation 0.50 for DISCO and 0.48 with stan-dard deviation 0.31 for RNAProfile. Mean SPPV was 0.58 with standard deviation 0.50 for DISCO and 0.48 with standard deviation 0.31 for RNAProfile. In gen-eral, DISCO had better accuracy for RF00037, RF00164, RF00239, RF00241 and RF00256. RNAProfile had better accuracy for RF00130 and RF00185, two data sets where DISCO completely missed the motif. Both programs missed the motif in the RF00185 data set. Overall, we conclude that DISCO is more sensitive and has higher positive predictive value than RNAProfile. Figure 5.20 shows the running time of DISCO and RNAProfile plotted against the size of the input data. RNAProfile had considerably faster running time than DISCO for all data sets by approximately one order of magnitude (mean log ratio of DISCO running time to RNAProfile was 1.46). 84 CM CO 00 d CD d d CM d o d DISCO RNAProfile RF00037 RF00130 RF00164 RF00180 RF00185 RF00239 RF00241 RF00256 Figure 5.16: Compar i son of NS between D I S C O and R N A P r o f i l e . D I S C O outper-formed R N A P r o f i l e for the RF00037, RF00164, RF00239, RF00241 and RF00256 data sets, while R N A P r o f i l e outperformed D I S C O for R F 0 0 1 3 0 and RF00185 da ta sets. M e a n and s tandard deviat ion NS were 0.56 and 0.45 for D I S C O and 0.41 and 0.26 for R N A P r o f i l e . B y the NS measure, D I S C O was more sensitive than R N A P r o f i l e . 85 CM 00 d > D_ % CD z d d CM d o d • DISCO • RNAProfile i RF00037 RF00130 RF00164 RF00180 RF00185 RF00239 RF00241 RF00256 Figure 5.17: Compar i son of NPPV between D I S C O and R N A P r o f i l e . D I S C O out-performed R N A P r o f i l e for the RF00164. RF00239, RF00241 and RF00256 data sets, while R N A P r o f i l e outperformed D I S C O for RF00037, RF00130 and RF00185 data sets. M e a n and s tandard deviation NPPV were 0.56 and 0.43 for D I S C O and 0.48 and 0.31 for R N A P r o f i l e . B y the NPPV measure, D I S C O had better positive predictive value than R N A P r o f i l e . 80 CN • DISCO • RNAProfile o CO CO RF00037 RF00130 RF00164 RF00180 RF00185 RF00239 RF00241 RF00256 Figure 5.18: Comparison of SS between DISCO and RNAProfile. DISCO outper-formed RNAProfile for the RF00037, RF00164, RF00241 and RF00256 data sets, while RNAProfile outperformed DISCO for RF00130 and RF00185 data sets. Mean and standard deviation SS were 0.58 and 0.50 for DISCO and 0.48 and 0.31 for RNAProfile. By the SS measure. DISCO was more sensitive than RNAProfile. 87 • DISCO RNAProfile o RF00037 RF00130 RF00164 RF00180 RF00185 RF00239 RF00241 RF00256 Figure 5.19: Comparison of SPPV between DISCO and RNAProfile. DISCO out-performed RNAProfile for the RF00037, RF00164, RF00241 and RF00256 data sets, while RNAProfile outperformed DISCO for RF00130 and RF00185 data sets. Mean and standard deviation SS were 0.58 and 0.50 for DISCO and 0.48 and 0.31 for RNAProfile. By the SPPV measure, DISCO was more sensitive than RNAProfile. 88 CO <D E +-» c L_ D) O co H CM H o D I S C O * RNAPro f i l e T T T o o T 2000 3000 4000 5000 6000 7000 8000 input s ize (bp) Figure 5.20: R u n n i n g t ime vs size of input data for D I S C O and R N A P r o f i l e . R N A P r o f i l e ran faster than D I S C O for a l l data sets. The mean log ratio of D I S C O to R N A P r o f i l e was 1.46 indicat ing on average, R N A P r o f i l e was faster by approximately one order of magnitude. 89 ID SC AS APPV ACC NS NPPV NCC SS SPPV SCC 027 539 0.59 0.79 0.68 0.85 0.92 0.88 0.83 1.00 0.91 051 79 0.23 0.29 0.26 0.90 1.00 0.95 1.00 1.00 1.00 052 343 0.64 0.80 0.72 0.88 0.93 0.90 0.89 1.00 0.94 053 121 0.39 0.79 0.56 0.67 0.91 0.78 0.83 1.00 0.91 075 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 076 96 0.81 0.78 0.79 0.97 0.85 0.91 1.00 1.00 1.00 103 231 0.48 0.89 0.65 0.65 1.00 0.81 0.86 1.00 0.93 130 22 0.02 0.09 0.04 0.06 0.12 0.08 0.00 0.00 0.00 131 111 0.46 0.45 0.45 0.94 0.90 0.92 1.00 1.00 1.00 239 210 0.52 0.88 0.68 0.69 0.85 0.77 0.67 0.67 0.67 241 345 0.64 0.75 0.69 0.83 0.93 0.88 1.00 1.00 1.00 246 132 0.72 0.86 0.79 0.90 0.96 0.93 1.00 1.00 1.00 247 140 0.00 0.00 0.00 0.12 0.81 0.31 0.14 1.00 0.37 248 152 0.76 0.88 0.82 0.87 0.91 0.89 1.00 1.00 1.00 251 328 0.74 0.87 0.80 0.89 0.89 0.89 0.86 0.86 0.86 256 1069 0.84 0.84 0.84 0.97 0.93 0.95 1.00 1.00 1.00 364 556 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 M 0.46 0.59 0.52 0.66 0.76 0.70 0.71 0.80 0.74 Med 0.52 0.79 0.68 0.85 0.91 0.88 0.86 1.00 0.93 StD 0.30 0.36 0.32 0.36 0.35 0.35 0.40 0.39 0.39 Table 5.1: Results of fixed parameter experiments for m i R N A data. T h e ID co lumn shows the last three digits of the R f a m accession number of each m i R N A data set. M, Med and StD rows show the mean, median and s tandard deviat ion of each measure of accuracy over a l l da ta sets. 90 ID SC AS A P P V A C C NS N P P V N C C SS S P P V S C C 031 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 032 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 037 559 0.92 0.78 0.85 0.95 0.79 0.87 1.00 1.00 1.00 109 868 0.48 0.49 0.48 0.57 0.53 0.55 0.83 0.83 0.83 164 938 0.93 0.89 0.91 0.94 0.86 0.90 1.00 1.00 1.00 176 2248 0.92 0.86 0.89 1.00 0.92 0.96 1.00 1.00 1.00 180 322 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 185 488 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 214 2715 0.87 0.70 0.78 0.95 0.95 0.95 0.95 0.95 0.95 M 0.46 0.41 0.43 0.49 0.45 0.47 0.53 0.53 0.53 Med 0.48 0.49 0.49 0.57 0.53 0.55 0.83 0.83 0.83 StD 0.46 0.41 0.43 0.48 0.44 0.46 0.51 0.51 0.51 Table 5.2: Results of fixed parameter experiments for U T R data. T h e ID co lumn shows the last three digits of the R f a m accession number of each U T R element data set. M , Med and StD rows show the mean, median and s tandard deviat ion of each measure of accuracy over a l l da ta sets. 91 M e t h o d cc P A S 0.29 0.15 A P P V 0.14 0.51 A C C 0.22 0.28 N S 0.21 0.30 N P P V 0.04 0.85 N C C 0.14 0.49 S S 0.18 0.39 S P P V 0.08 0.68 S C C 0.15 0.47 Table 5.3: Corre la t ion statistics measuring association between P I D and accuracy. N o accuracy measures were significantly correlated wi th P I D . al though a l l were positively correlated to some degree. 92 M e t h o d cc P A S -0.21 0.3 A P P V -0.15 0.47 A C C -0.19 0.36 N S -0.16 0.43 N P P V 0.01 0.95 N C C . -0.10 0.61 S S -0.15 0.45 S P P V 0.00 0.99 S C C -0.11 0.59 Table 5.4: Corre la t ion statistics measuring association between length of the seed alignment and accuracy. T h e correlation coefficient ( C C ) of the Pearson product-moment test and associated p-value (p) are shown for each measure of accuracy. N o accuracy measures were significantly correlated wi th the length of the seed align-ment, suggesting that D I S C O does not show a bias based on length of the motif. 93 M e t h o d CC P A S -0.03 0.88 A P P V -0.18 0.38 A C C -0.10 0.61 N S -0.20 0.33 N P P V -0.34 0.09 N C C -0.26 0.19 S S -0.23 0.26 S P P V -0.27 0.18 S C C -0.25 0.22 Table 5.5: Corre la t ion statistics measuring association between length of the input data and accuracy. T h e correlation coefficient ( C C ) of the Pearson product-moment test and associated p-value (p) are shown for each measure of accuracy. N o mea-sures were significantly correlated wi th the length of the input data, suggesting that D I S C O does not show a stat is t ical ly significant bias based on the size of the input . However, a l l measures had a negative C C which indicates a smal l effect of size on the results. 94 M e t h o d cc P A S 0.12 0.55 A P P V -0.09 0.66 A C C 0.02 0.91 N S -0.10 0.61 N P P V -0.30 0.14 N C C -0.19 0.35 S S -0.14 0.51 S P P V -0.20 0.33 S C C -0.16 0.44 Table 5.6: Corre la t ion statistics measuring association between number of sequences in the input da ta and accuracy. T h e correlation coefficient ( C C ) of the Pearson product-moment test and associated p-value (p) are shown for each measure of accuracy. N o accuracy measures were significantly correlated wi th the number of sequences i n the input data, suggesting that D I S C O does not show a bias based on the number of sequences i n the input . 95 Chapter 6 Discussion We developed an a lgor i thm called D I S C O to detect the most l ikely covariance model representing a mot i f embedded in a given set of unaligned R N A sequences. We tested our a lgor i thm on 26 data sets from R f a m from two categories of R N A molecules: m i R N A s and U T R elements. The data sets we constructed consisted of selected members of the R f a m family flanked by genomic or U T R sequence so that each instance of the mot i f was embedded i n a larger sequence. O u r a lgor i thm performed quite well for the m i R N A data sets and showed a type of b i -modal d is t r ibut ion for the U T R elements where the motif was very accurately found, or it was not found at a l l (see Chapter 5). We found that the score of the C M produced by the algori thm was not correlated w i t h our measures of accuracy, suggesting that score is not an indicator of performance. We also found that the measures of accuracy were not significantly correlated w i t h any inherent properties of the input data, indicat ing that the a lgor i thm has an unbiased performance w i t h respect to sequence similar i ty of the mot i f instance, length of the motif instances, length of the input data and the number of sequences in the input data. A comparison w i t h a similar algori thm, R N A P r o f i l e , showed that D I S C O produced more sensitive output w i t h higher positive predictive value. 96 6.1 Interpretation of results 6.-1.1 Sequence information is more important than secondary struc-ture in the initialisation phase In Chapter 4 we posed three major questions that we hoped our a lgor i thm and experiments would help to answer. F i rs t , we wanted to determine what properties of the data - sequence or secondary structure emitted the stronger signals for mot i f detection (see Section 4.1.1). T h e results of the prel iminary experiments showed that the sequence method of alignment was far superior to the structure and the combinat ion methods (see Figures 5.1 and 5.2). These results indicate that the sequence carries more information than the secondary structure and that sequence information is generally sufficient to create a crude mult iple alignment to init ialise a C M , which necessarily introduces secondary structure information i n the refinement phase. Surprisingly, there was no statist ically significant correlation between ac-curacy and pairwise sequence identity of the mot i f sequences using the sequence method. T h i s is counter-intuitive and merits further study. It would be important to find an empir ica l ly derived threshold of pairwise sequence identi ty of the motifs below which the sequence method accuracy degraded. For the RF00185 test data set in the fixed parameter experiments, the ac-curacy was 0 for a l l methods. However, we re-ran the a lgor i thm w i t h the same parameters except we used the structure alignment method instead. T h e accuracy results were NS = 1.00, NPPV = 0.88, SS = 1.00, and SPPV = 1.00. A l t h o u g h not as extreme, a s imilar improvement i n accuracy using the structure method was achieved for RF00180 , where the sequence results for a l l measures were 0, but the structure method gave: NS = 0.54, NPPV = 0.48, SS = 1.00 and SPPV = 1.00. Score results were 322 and 578 for the sequence and structure method respectively. These two examples indicate that while the sequence method of alignment gave the 97 most accurate results in general, the structure method is superior for specific data sets. More work is needed to see if there are detectable properties i n the data that could indicate the selective use of the sequence or structure alignment method. 6.1.2 Relatively few sequences can be used to initialise the C M The second question we posed was whether a sufficiently good C M could be i n i -tialised using only a subset of the input sequences (see Section 4.1.2). We ran our algori thm on data sets which contained between four and nineteen sequences using a fixed value of k = 6 for a l l da ta sets where the number of sequences N i n the input data was at least 6. For the remaining data sets, we set k = N. Reca l l that the parameter k is the m a x i m u m number of W-mers to include i n the mul t ip le align-ment step of the in i t ia l i sa t ion phase of the algor i thm (see A l g o r i t h m 2). There was no stat is t ical ly significant bias detected when the accuracy measures were tested for correlation w i t h N. T h i s indicates that i n general, the a lgor i thm works equally well w i t h a fixed k independent of N. Th i s gives us good evidence that our method can present a good in i t ia l i sa t ion mult iple alignment to bu i ld a C M that is capable of retrieving the remainder of the motifs in the input data through i terative refinement. 6.1.3 The unpaired nucleotide filter improves performance but does not compromise accuracy for m i R N A data sets Final ly , we tested to see i f a simple secondary structure filter could improve per-formance while mainta ining accuracy (see Section 4.1.3). W h e n designing the a l -gor i thm we were concerned w i t h the 0(L2 • W2) te rm of the run-t ime complexi ty where L is the length of the longest sequence in the input , and W is the given w i d t h of the motif. Reca l l that this term arises from the exhaustive pairwise alignment of a l l VF-mers i n the input . For large data sets, this step is very expensive, so we introduced a threshold measure to reduce the number of pairwise alignments per-formed. O n l y W-mevs w i t h a proport ion of unpaired nucleotides lower than a user 98 inputted d were considered for pairwise alignment. O f major concern was whether this threshold el iminated W-mers that were mot i f instances i n the data. For the fixed parameter experiments, we used a relatively stringent threshold of d — 0.40 for the m i R N A experiments, and the results were satisfactory for most da ta sets (see Table 5.1). T h i s gives us some indicat ion that setting d to capitalise on spe-cific s tructural properties of the mot i f can yie ld good results and improve run-time performance. T h i s at tr ibute of our algori thm is unique when compared to the other algorithms presented i n Section 2.7. We view this as a strength of our system that it can be tuned to take advantage of the structural properties of the mot i f i f they are known ahead of t ime. Recent work by Bonnet et al. [6] and Washei t l et al. [68] has shown that m i n i m u m free energy signals are detectable i n certain types of R N A s and our a lgor i thm is poised to take advantage of this information. Some motifs, however, are highly unstructured, and would not be detectable w i t h only a m i n i m u m threshold. Implementing a m a x i m u m threshold as well would provide a range of propor t ion of unpaired nucleotides for W-mers to be admit ted into the search space. W e believe this idea should be explored further and would further enhance our a lgor i thm. 6.1.4 Relatively poor accuracy of aligned nucleotides Recal l from Figures 5.8 and 5.9 the relatively poor accuracy w i t h respect to the AS and APPV methods. Measur ing accuracy of alignments i n this way was perhaps not the right approach i n hindsight. Consider that the dynamic programming algorithms for bo th the alignment methods we use i n the ini t ia l isa t ion phase and the I N S I D E algori thm potent ial ly have mult iple paths i n their traceback routines. Since we broke ties i n the scoring mat r ix by choosing a matched pair over a gap, this could have introduced a bias. 99 6.1.5 Poor U T R element results Table 5.2 shows that the a lgori thm completely missed the mot i f i n four out of nine U T R da ta sets. G i v e n that three of these four sets were U T R s i n predominantly mammal ian m R N A s , it is not too surprising that the 'sequence' alignment method for the ini t ia l isa t ion phase presented non-motif sequences to the refinement phase. Considering the proximi ty of the U T R s to coding sequence, i t is reasonable to as-sume that these sequences may be under evolutionary selection pressures to main-ta in their sequence. Indeed, Shabal ina et al. [60] recently reported the existence of highly conserved sequences i n U T R s , detected through a genome wide comparison of orthologous m R N A s from eukaryotic species. H igh ly conserved patterns at the se-quence level would most certainly influence the performance of our algori thm, which is not specifically designed for m R N A s . Pedersen et al. [54] introduce a comparative method for finding and folding R N A secondary structures w i t h i n protein-coding re-gions. Th i s work is of specific interest to the problem of detecting U T R elements and should be carefully considered in any modifications to our work that deal w i t h biases i n m R N A sequences. 6.2 Improvements on other methods We introduced three improvements on other methods i n our a lgori thm. F i r s t , we used the powerful probabil is t ic framework offered by C M s to both model and detect motifs i n our input data. W i t h the exception of S L A S H [19], none of the other methods described i n Section 2.7 model motifs i n this way. T h e use of C M s have a great advantage in that they offer a sensitive alignment a lgor i thm ( I N S I D E ) to search for an instance of a C M in a given sequence. W i t h respect to our work, this has a two-fold benefit in that the I N S I D E algori thm can be used in the refinement phase and that the output of D I S C O can be easily used to detect instances of the predicted motif i n other sequence databases of interest. Second, we reduced the worst-case 100 t ime complexi ty to init ialise a C M from 0(N4L4) i n S L A S H to 0 ( L 2 • W2) where N is the number of sequences i n the input, L is the length the longest sequence i n the input and W is the user inputted approximate w i d t h of the motif. Th i s theoretical improvement is enhanced by the threshold parameter d which i n our algori thm w i l l substantial ly reduce the ( (L — W) x A ^ ) 2 m a x i m u m possible number of pairwise alignments performed i n the ini t ia l isa t ion phase. T h i r d , we introduce iterative refinement using E M to the R N A moti f discovery problem. None of the methods described in Section 2.7 use iterative refinement. G i v e n its widespread use in the sequence mot i f finding domain, we believe the use of E M is a worthwhile technique and confers an advantage to our algori thm. 6.2.1 Comparison between DISCO and RNAProfile Our a lgori thm shows better accuracy than R N A P r o f i l e (see Figures 5.16, 5.17, 5.18 and 5.19) yet is considerably slower that R N A P r o f i l e (see Figure 5.20). We at-tr ibute bo th the superior accuracy and slower running t ime to the use of C M s . T h e Needleman-Wunsch based alignment algori thm of R N A P r o f i l e considers each posi-t ion of the sequences to be independently derived by definition. One strength of the C M I N S I D E a lgor i thm is the use of t ransi t ion probabili t ies between the states i n the C M data structure which correspond to basepairs or unpaired bases i n the sequence. T h e t ransi t ion probabili t ies introduce a dependence between these states, and al though they are costly i n running time, i n our opinion the t ransi t ion probabi l -ities confer an advantage over the profiles and associated alignment a lgori thm used by R N A P r o f i l e . We d i d not investigate the exact nature of this advantage but we believe it merits further study. Furthermore, comparison w i t h other algorithms such as A l i d o t and F O L D A L I G N is on-going work that is being prepared for submission to a journal for publ icat ion. 101 6.3 D r a w b a c k s a n d l i m i t a t i o n s o f t h e D I S C O m e t h o d We acknowledge several drawbacks to our approach. Perhaps the most significant l imi ta t ion is the need to specify the wid th W of the motif. Tools such as C A R N A C [66] and R N A P r o f i l e [53] require different input parameters. R N A P r o f i l e only re-quires the number of stems the mot i f is expected to have. C A R N A C does not require any other input except the unaligned sequences. The need to specify W is a l i m i -tat ion, but it should be noted that the sequences that make up the output ted C M need not be exactly W nucleotides long. Reca l l that the I N S I D E a lgor i thm allows for insertions and deletions and so the mult iple alignment used i n the refinement phase of the a lgor i thm is expected to contain variable-length sequences. 6.3.1 Limitations of covariance models W h i l e providing a robust probabil ist ic framework for model ing sets of related R N A sequences, C M s have two notable l imitat ions. F i r s t , the bifurcating tree structure underlying the C M is incapable of modeling pseudoknots. O u r a lgor i thm w i l l most l ikely not be able to detect pseudoknots which are detectable w i t h c o m R N A [32]. Second, the complexi ty of the I N S I D E algori thm is 0(L3) where L is the length of the sequence. T h i s makes our algori thm prohibi t ively expensive to run on long sequences. However, Weinberg and Ruzzo [70] recently reported a method that can filter out sequences i n a database to be searched wi th a C M in 0(L2) t ime wi th no reduction i n accuracy. Use of this method in the refinement phase should be considered as a potent ial opt imisat ion. 6.3.2 Reliance on predictive folding There are two parts of our a lgori thm where predictive folding is performed. In the ini t ia l isat ion phase, we use the Zuker algori thm to predict the fold of each W-mer based on thermodynamic energy. Th i s method is known to have l imi ted accuracy of about 73%, measured by proport ion of correctly predicted base pairs [48]. T h i s is an 102 acknowledged l imi ta t ion i n our approach, but the Zuker a lgor i thm remains the state of the art for single sequence predict ion of secondary structure. A n addi t ional source of error could come from the consensus structure predict ion of the mul t ip le alignment in the refinement phase using Al i fo ld . We d id not perform a rigorous evaluation of the accuracy of the consensus structure predictions and therefore we do not know to what extent or how frequently the consensus structure is incorrect ly determined. This merits further study which should include a comparative evaluation of A l i f o l d and Pfo ld at bo th the run-t ime and accuracy levels. 6.4 Potential improvements and future work W h i l e our results were encouraging, there are several areas where the D I S C O al-gori thm could be improved. In the ini t ia l isat ion phase, we tested three alignment methods. T h e 'sequence' alignment method was superior. For the 'structure' and 'combinat ion ' methods, we constructed a scoring mat r ix using in tu i t ion rather than empir ical results. A rigourously derived scoring ma t r ix for the 'structure' method would provide a more accurate comparison to the 'sequence' method which used a matr ix , R I B O S U M 8 5 - 6 0 that was derived using m a x i m u m l ikel ihood methods un-der the B L O S U M model of evolution (see [34]). G i v e n that for some data sets, the 'structure' method d i d outperform the 'sequence' method, we feel this further work has merit. 6.4.1 Multiple sequence alignment method We used a crude heuristic to construct a mult iple alignment (see A l g o r i t h m 3). T h i s method is missing the guide tree creation step used by hierarchical mul t ip le alignment methods such as Clus ta lw [65] and outl ined i n D u r b i n et al. [13]. We did not compare our mult iple alignment algori thm to hierarchical methods, and so we do not know if our heuristic negatively affected accuracy of our alignment. A n assessment of specific cases where this method of mult iple alignment i n int roducing 103 error is necessary to determine i f our mult iple alignment method is adequate. 6.4.2 The use of priors when initialising the C M A uniform Dir ichle t prior was used to intialise bo th the t ransi t ion and emission probabilit ies of the C M . T h e effect of different priors and the use of any other empirical ly derived statistics i n the construction of the C M were not investigated. G iven the numerous (almost 400) C M s now available i n Rfam, it would be interesting, to estimate a more data-driven prior from these existing sets. Furthermore, priors for specific types of R N A s (eg m i R N A s ) could be estimated and opt ional ly used i f the user had prior knowledge of the type of mot i f they were expecting to discover. 6.4.3 Using phylogenetic weighting In the field of comparative genomics, a growing body of l i terature is report ing dif-ferent models to incorporate phylogenetic distance i n analysing sets of sequences where the ind iv idua l sequences originate from different organisms. Knudsen and He in [36] infer a phylogenetic tree using m a x i m u m l ikel ihood methods and use the distances i n the tree to help infer a consensus secondary structure using S C F G s . Us-ing a s imilar approach of weighting the alignment scores i n the in i t ia l isa t ion phase is bound to more accurately reflect the s imilar i ty of the sequences and i n effect normalise the scores by evolutionary distance. The work of Holmes [29] describes an evolutionary model for R N A structure and its use i n construct ing p a i r - S C F G s to al ign two homologous R N A s . Exp lo r ing the use of such evolutionary models for R N A sequences would undoubtedly add an beneficial layer of accuracy to detection of motifs i n sequences from different organisms. 6.4.4 Optimisations Given the relatively high complexi ty of our algori thm, we have enumerated a number of optimisations that would improve the running time. Reca l l that the first step 104 of the ini t ia l isa t ion phase is to fold every W-mer i n the input . Cur ren t ly this is implemented as a ' s l id ing window' across each sequence i n the input , moving one posit ion at a t ime. A s each successive fold is only different by one nucleotide, W — 1 columns and rows of the dynamic programming matrices used for folding could be saved and used i n the calculat ion of the secondary structure of the next 'window' . Another opt imisa t ion could be implemented in the pairwise alignment step. A s only a fixed number of high-scoring W-mers are kept for each W - m e r , it may be possible to tel l early in the alignment process i f the alignment score w i l l be sufficiently high. Implementing this early detection would reduce the number of complete W 2 operations and offer a substantial speed-up for the cases where the sequences do not align well . F ina l ly , our algori thm is very amenable to paral lel izat ion. It should be very straightforward to implement message passing using the Message Passing Interface ( M P I ) , so that the algori thm could run on a dis t r ibuted memory cluster. Paral le l iza t ion could be achieved for the folding of W-mers , the pairwise alignment step and the iterative refinement step. Given a dis tr ibuted memory cluster w i th x nodes, the a lgor i thm theoretically would run faster by a factor propor t ional to x. 105 Chapter 7 Conclusions We designed and implemented an algori thm called D I S C O to discover an opt imal covariance model ( C M ) representing a motif expected to occur i n a given set of unaligned R N A sequences. We were able to conclude that sequence information used to create a mul t ip le alignment of motif sequences exhibits stronger signals than secondary structure information. We also demonstrated that a proport ion of sequences of the input data could be used i n the in i t ia l isa t ion phase to crudely construct a C M . Mos t often the C M could retrieve the motifs i n the remaining sequences in the refinement phase. Final ly , we were able to use a simple filter that admit ted only W-mevs w i t h a lower than d proport ion of unpaired nucleotides into the search space wi thout noticeable loss of accuracy. Results on test data sets from Rfam showed that our a lgor i thm performed well on m i R N A da ta sets, however it showed inconsistent results on U T R element data sets. T h e score we used to measure the quali ty of the C M was not sufficiently correlated w i t h accuracy measures we used to evaluate the performance of our algo-r i thm, meaning that we were unable to conclude that score was a positive indicatory of accuracy. We were not able to detect any significant bias i n our results w i t h re-spect to pairwise sequence identity of the mot i f sequences, number of sequences i n the input data, length of the input data or length of the motif. T h i s suggests that 106 our a lgori thm can be applied generally to different types of data. Our a lgor i thm improves on competing algorithms i n its use of covariance models for model ing motifs, a fast ini t ia l isat ion phase to generate the C M and the use of iterative refinement to improve the C M once ini t ial ised. In addit ion, we introduce a parameter that allows the user to tune the a lgor i thm for da ta sets that exhibit specific s t ructural properties if known before hand. T h e a lgor i thm has a 0(W3 • L + L2 • W2 + L 3 ) run-t ime complexity where W is the expected w i d t h of the motif, L is the length of the longest sequence in the input data, however we suggest several optimisations where this could be improved i n Chapter 6. A comparison between R N A P r o f i l e and D I S C O showed that D I S C O was more sensitive and had higher positive predictive value than R N A P r o f i l e al though due to the use of C M s , the running t ime was considerably slower. The D I S C O algor i thm represents a new approach to detecting motifs i n un-aligned R N A sequences. We showed encouraging results w i t h this prototype i m -plementation and expect that, w i th the improvements suggested i n Chapter 6, this algori thm could be widely applied to analysing sets of R N A sequences for conserved sequences and secondary structures. T h e work presented i n this thesis is a step forward i n computa t ional R N A sequence analysis and we hope i t w i l l contribute to identifying novel functional R N A sequences. 107 Bibliography [1] M . Andronescu, Z C . Zhang, and A . Condon. Secondary structure predict ion of interacting R N A molecules. J Mol Biol, 345(5):987-1001, Feb 2005. [2] T . L . Ba i ley and C . E lkan . Unsupervised learning of mult iple motifs in biopoly-mers using expectation maximiza t ionm. Machine Learning, 21:51-83, 1995. [3] T . L . Ba i l ey and C . E lkan . F i t t i n g a mixture model by expectat ion maximiza t ion to discover motifs i n biopolymers. Proc Int Conf Intell Syst Mol Biol, 2:28-36, 1994. [4] I. Barret te, G . Poisson, P . Gendron, and F . Major . Pseudoknots i n pr ion protein inRNAs confirmed by comparative sequence analysis and pat tern searching. Nucleic Acids Res, 29(3):753-758, Feb 2001. [5] A . Ba teman , E . Birney, R . D u r b i n , SR . Eddy, R D . F i n n , and E L . Sonnhammer. P fam 3.1: 1313 mult iple alignments and profile H M M s match the majori ty of proteins. Nucleic Acids Res, 27(l):260-262, J an 1999. [6] E . Bonnet , J . Wuyt s , P . Rouze, and Y . V a n de Peer. Evidence that m i c r o R N A precursors, unlike other non-coding R N A s , have lower folding free energies than random sequences. Bioinformatics, 20(17):2911-2917, N o v 2004. [7] J . Buhler and M . Tompa . F i n d i n g motifs using random projections. J Comput Biol, 9:225-242, 2002. [8] C . Burge and S. K a r l i n . Predic t ion of complete gene structures i n human genomic D N A . J Mol Biol, 268(l):78-94, A p r 1997. [9] J L . Casey, M W . Hentze, D M . Koel ler , S W . Caughman, T A . Rouaul t , R D . Klausner , and J B . Harford. Iron-responsive elements: regulatory R N A se-quences that control m R N A levels and translation. Science, 240(4854):924-928, M a y 1988. 108 [10] F S . Col l ins , E D . Green, A E . Guttmacher , M S . Guyer , and . . A vis ion for the future of genomics research. Nature, 422(6934):835-847, A p r 2003. [11] I N T E R N A T I O N A L H U M A N G E N O M E S E Q U E N C I N G C O N S O R T I U M . F i n -ishing the euchromatic sequence of the human genome. Nature, 431(7011):931-945, Oct 2004. [12] R D . Dowel l and S R . Eddy. Eva lua t ion of several lightweight stochastic context-free grammars for R N A secondary structure predict ion. BMC Bioinformatics, 5(1):71-71, J u n 2004. [13] R . D u r b i n , S.R. Eddy , K r o g h A . , and Mi tch i son G . Biological sequence analysis. Cambridge Univers i ty Press, 1998. [14] SR . Eddy . Non-coding R N A genes and the modern R N A world . Nat Rev Genet, 2(12):919-929, Dec 2001. [15] S R . Eddy . A memory-efficient dynamic programming a lgor i thm for op t imal alignment of a sequence to an R N A secondary structure. BMC Bioinformatics, 3(1):18-18, J u l 2002. [16] S R . E d d y and R . D u r b i n . R N A sequence analysis using covariance models. Nucleic Acids Res, 22( l l ) :2079-2088, J u n 1994. [17] G B . Fogel, V W . Por to , D G . Weekes, D B . Fogel, R H . Griffey, J A . M c N e i l , E . Lesnik, D J . Ecker, and R . Sampath. Discovery of R N A structural elements using evolutionary computat ion. Nucleic Acids Res, 30(23):5310-5317, Dec 2002. [18] J . Gorodk in , L J . Heyer, and G D . Stormo. F i n d i n g the most significant common sequence and structure motifs i n a set of R N A sequences. Nucleic Acids Res, 25(18):3724-3732, Sep 1997. [19] J . Gorodk in , S L . S t r ick l in , and G D . Stormo. Discovering common stem-loop motifs i n unaligned R N A sequences. Nucleic Acids Res, 29(10) :2135-2144, M a y 2001. [20] S. Griffiths-Jones. A . Bateman, M . Marsha l l , A . K h a n n a , and S R . Eddy . Rfam: an R N A family database. Nucleic Acids Res, 31(1):439-441, J an 2003. [21] S. Griffiths-Jones, S. M o x o n , M . Marsha l l , A . K h a n n a , S R . Eddy , and A . Bate-man. Rfam: annotat ing non-coding R N A s in complete genomes. Nucleic Acids Res, 33 Database Issue: 121-124, J an 2005. 109 [22] J H . Havgaard, R . Lyngso, G D . Stormo, and J . Gorodk in . Pairwise local struc-tura l alignment of R N A sequences w i t h sequence s imi lar i ty less than 40%. Bioinformatics, J an 2005. [23] C U . Hel len and P . Sarnow. Internal ribosome entry sites i n eukaryotic m R N A molecules. Genes Dev, 15(13):1593-1612, J u l 2001. [24] M W . Hentze, S W . Caughman, J L . Casey, D M . Koel ler , T A . Rouaul t , J B . Har-ford, and R D . Klausner . A model for the structure and functions of iron-responsive elements. Gene, 72(1-2):201-208, Dec 1988. [25] M W . Hentze and L C . K i i h n . Molecular control of vertebrate i ron metabolism: m R N A - b a s e d regulatory circuits operated by iron, n i t r ic oxide, and oxidative stress. Proc Natl Acad Sci USA, 93(16):8175-8182, A u g 1996: [26] I L . Hofacker. V i e n n a R N A secondary structure server. Nucleic Acids Res, 31(13):3429-3431, J u l 2003. [27] I L . Hofacker, M . Fekete, C . F l a m m , M A . Huynen , S. Rauscher, P E . Stolorz, and P F . Stadler. A u t o m a t i c detection of conserved R N A structure elements i n complete R N A virus genomes. Nucleic Acids Res, 26(16):3825-3836, A u g 1998. [28] I L . Hofacker, M . Fekete, and P F . Stadler. Secondary structure predict ion for aligned R N A sequences. J Mol Biol, 319(5): 1059-1066, J u n 2002. [29] I. Holmes. A probabil ist ic model for the evolution of R N A structure. BMC Bioinformatics, 5(1):166-166, Oct 2004. [30] Y J . H u . P red ic t ion of consensus structural motifs in a family of coregulated R N A sequences. Nucleic Acids Res, 30(17):3886-3893, Sep 2002. [31] A . Jasinska, G . Mich lewsk i , M . de Mezer, K . Sobczak, P . K o z l o w s k i , M . Napier-ala, and W J . Krzyzos iak . Structures of trinucleotide repeats i n human tran-scripts and their functional implications. Nucleic Acids Res, 31(19):5463-5468, Oc t 2003. [32] Y . J i , X . X u , and G D . Stormo. A graph theoretical approach for predict ing common R N A secondary structure motifs including pseudoknots i n unaligned sequences. Bioinformatics, 20(10):1591-1602, J u l 2004. [33] B . John, A J . Enr igh t , A . A r a v i n , T . Tuschl , C . Sander, and D S . M a r k s . H u m a n M i c r o R N A targets. PLoS Biol, 2(11):-1373, Nov 2004. 110 [34] R J . K l e i n and S R . Eddy . R S E A R C H : F i n d i n g homologs of single structured R N A sequences. BMC Bioinformatics, 4( l ) :44-44, Sep 2003. [35] B . Knudsen and J . Hein . R N A secondary structure predict ion using stochastic context-free grammars and evolutionary history. Bioinformatics, 15(6):446-454, J u n 1999. [36] B . Knudsen and J . Hein . Pfold : R N A secondary structure predict ion using stochastic context-free grammars. Nucleic Acids Res, 31(13):3423-3428, J u l 2003. [37] J . K r o l , K . Sobczak, U . Wi lczynska , M . Dra th , A . Jasinska, D . Kaczynska , and W J . Krzyzos iak . St ructura l features of m i c r o R N A ( m i R N A ) precursors and their relevance to m i R N A biogenesis and small interfering R N A / s h o r t ha i rp in R N A design. J Biol Chem, 279(40) :42230-42239, Oc t 2004. [38] M . Lagos-Quintana, R . Rauhut , W . Lendeckel, and T . Tuschl . Identification of novel genes coding for small expressed R N A s . Science, 294(5543):853-858, Oct 2001. [39] A . Lamber t , A . Lescure, and D . Gautheret. A survey of metazoan selenocysteine insertion sequences. Biochimie, 84(9):953-959, Sep 2002. [40] N C . L a u , L P . L i m , E G . Weinstein, and D P . Bar t e l . A n abundant class of t iny R N A s w i t h probable regulatory roles i n Caenorhabdi t is elegans. Science, 294(5543) :858-862, Oc t 2001. [41] C . E . Lawrence, S .F . Al t schu l , M . S . Boguski , J .S . L i u , A . F . Neuwald , and J . C . Woot ton . Detect ing subtle sequence signals: a gibbs sampl ing strategy for mult iple alignment. Science, 262:208-214, 1993. [42] C . E . Lawrence and A . A . Rei l ly . A n expectation maximiza t ion (em) algori thm for the identification and characterization of common sites i n unaligned biopoly-mer sequences. Proteins, 7:41-51, 1990. [43] R C . Lee and V . Ambros . A n extensive class of smal l R N A s i n Caenorhabdit is elegans. Science, 294(5543) :862-864, Oc t 2001. [44] E A . Lesnik, R . Sampath, and D J . Ecker. Rev response elements ( R R E ) i n lentiviruses: an R N A M o t i f algorithm-based strategy for R R E predict ion. Med Res Rev, 22(6):617-636, Nov 2002. 111 [45] T M . Lowe and S R . Eddy. t R N A s c a n - S E : a program for improved detection of transfer R N A genes in genomic sequence. Nucleic Acids Res, 25(5):955-964, M a r 1997. [46] T M . Lowe and S R . Eddy . A computat ional screen for methyla t ion guide snoR-N A s i n yeast. Science, 283(5405):1168-1171, Feb 1999. [47] T J . Macke, D J . Ecker, R R . Gute l l , D . Gautheret, D A . Case, and R . Sampath. R N A M o t i f , an R N A secondary structure definition and search algori thm. Nu-cleic Acids Res, 29(22):4724-4735, Nov 2001. [48] D H . Mathews. Pred ic t ing a set of min ima l free energy R N A secondary struc-tures common to two sequences. Bioinformatics, Feb 2005. [49] B W . Mat thews . Compar ison of the predicted and observed secondary structure of T 4 phage lysozyme. Biochim Biophys Acta, 405(2):442-451, Oct 1975. [50] JS . Ma t t i ck . Non-coding R N A s : the architects of eukaryotic complexi ty. EMBO Rep, 2(11):986-991, N o v 2001. [51] S B . Needleman and C D . Wunsch. A general method applicable to the search for similari t ies i n the amino acid sequence of two proteins. J Mol Biol, 48(3) :443-453, M a r 1970. [52] R . Nussinov and A B . Jacobson. Fast a lgori thm for predict ing the secondary structure of single-stranded R N A . Proc Natl Acad Sci USA, 77(11):6309-6313, N o v 1980. [53] G . Pavesi, G . M a u r i , M . Stefani, and G . Pesole. R N A P r o f i l e : an algori thm for finding conserved secondary structure motifs i n unaligned R N A sequences. Nucleic Acids Res, 32(10):3258-3269, 2004. [54] JS . Pedersen, I M . Meyer , R . Forsberg, P. Simmonds, and J . Hein . A compara-tive method for finding and folding R N A secondary structures w i th in protein-coding regions. Nucleic Acids Res, 32(16):4925-4936, 2004. [55] O . Perriquet, H . Touzet, and M . Dauchet. F i n d i n g the common structure shared by two homologous R N A s . Bioinformatics, 19(1): 108-116, J a n 2003. [56] S. Pfeffer, M . Zavolan, F A . Grasser, M . Chien , J J . Russo, J . J u , B . John, A J . Enr ight , D . M a r k s . C . Sander, and T . Tuschl . Identification of virus-encoded m i c r o R N A s . Science, 304(5671):734-736, A p r 2004. 112 [57] E . Rivas and SR . Eddy . Secondary structure alone is generally not stat ist ically significant for the detection of noncoding R N A s . Bioinformatics, 16(7):583-605, J u l 2000. [58] E . Rivas, R J . K l e i n , T A . Jones, and SR . Eddy. Computa t iona l identification of noncoding R N A s i n E . coli by comparative genomics. Curr Biol, 11(17):1369-1373, Sep 2001. [59] L F . Sempere, S. Freemantle, I. Pi tha-Rowe, E . Moss , E . Dmit rovsky, and V . Ambros . Expression profiling of mammal ian m i c r o R N A s uncovers a sub-set of brain-expressed m i c r o R N A s wi th possible roles i n murine and human neuronal differentiation. Genome Biol, 5(3):-862, 2004. [60] S A . Shabalina, A Y . Ogurtsov, I B . Rogozin , E V . K o o n i n , and D J . L i p m a n . C o m -parative analysis of orthologous eukaryotic m R N A s : potent ial h idden functional signals. Nucleic Acids Res, 32(5): 1774-1782, 2004. [61] SP . Shah, Y . Huang , T . X u , M M . Yuen , J . L i n g , and B F . Ouellette. A t l a s - a data warehouse for integrative bioinformatics. BMC Bioinformatics, 6(1):34-34, Feb 2005. [62] T F . S m i t h and M S . Waterman. Identification of common molecular subse-quences. J Mol Biol, 147(1):195-197, M a r 1981. [63] E L . Sonnhammer, S R . Eddy, E . Birney, A . Bateman, and R . D u r b i n . Pfam: mult iple sequence alignments and HMM-pro f i l e s of protein domains. Nucleic Acids Res, 26(l) :320-322, J an 1998. [64] E L . Sonnhammer, S R . Eddy , and R . Durb in . P fam: a comprehensive database of protein domain families based on seed alignments. Proteins, 28(3):405-420, J u l 1997. [65] J D . Thompson , D G . Higgins, and T J . Gibson . C L U S T A L W : improving the sensit ivity of progressive mult iple sequence alignment through sequence weight-ing, position-specific gap penalties and weight mat r ix choice. Nucleic Acids Res, 22(22):4673-4680, N o v 1994. [66] H . Touzet and O . Perriquet. C A R N A C : folding families of related R N A s . Nu-cleic Acids Res, 32(Web Server issue): 142-145, J u l 2004. [67] D I . V a n R y k and S. Venkatesan. Real-t ime kinetics of H I V - 1 R e v - R e v response element interactions. Defini t ion of min ima l b inding sites on R N A and protein and stoichiometric analysis. J Biol Chem, 274(25): 17452-17463, J u n 1999. 113 [68] S. Washie t l , I L . Hofacker, and P F . Stadler. Fast and reliable predict ion of noncoding R N A s . Proc Natl Acad Sci USA, 102(7):2454-2459, Feb 2005. [69] R H . Waterston, K . L indb lad-Toh , E . Birney, J . Rogers, J F . A b r i l , P . Agar -wal , R . Agarwala , R . Ainscough, M . Alexandersson, P . A n , S E . Antonarakis , J . A t t w o o d , R . Baertsch, J . Bailey, K . Bar low, S. Beck, E . Berry, B . B i r -ren, T . B l o o m , P . Bork , M . Botcherby, N . Bray, M R . Brent , D G . Brown , S D . B r o w n , C . B u l t , J . B u r t o n , J . But ler , R D . Campbe l l , P . Ca rn inc i , S. Caw-ley, F . Chiaromonte , A T . Chinwal la , D M . Church , M . C l a m p , C . Clee, F S . Col l ins , L L . Cook, R R . Copley, A . Coulson, O. Couronne, J . Cuff, V . Curwen, T . Cut t s , M . Daly , R . D a v i d , J . Davies, K D . Delehaunty, J . De r i , E T . Der-mitzakis , C . Dewey, N J . Dickens, M . Diekhans, S. Dodge, I. Dubchak, D M . D u n n , SR . Eddy , L . E l n i t s k i , R D . Emes, P. Eswara , E . Eyras , A . Felsenfeld, G A . Fewell , P . F l icek , K . Foley, W N . Frankel , L A . Ful ton , R S . Ful ton , T S . Furey, D . Gage, R A . Gibbs , G . Glusman, S. Gnerre, N . G o l d m a n , L . G o o d -stadt, D . Grafham, T A . Graves, E D . Green, S. Gregory, R . Guigo , M . Guyer , R C . Hardison, D . Haussler, Y . Hayashizaki , L W . Hi l l ie r , A . Hinr ichs , W . Hlav-ina, T . Holzer , F . Hsu , A . Hua , T . Hubbard , A . Hunt , I. Jackson, D B . Jaffe, L S . Johnson, M . Jones, T A . Jones, A . Joy, M . K a m a l , E K . Kar l s son , D . Karo lch ik , A . Kasp rzyk , J . K a w a i , E . Keibler , C . Ke l l s , W J . Ken t , A . K i r b y , D L . K o l b e , I. Korf , R S . Kucher lapa t i , E J . Kulbokas , D . K u l p , T . Landers, J P . Leger, S. Leonard, I. Le tunic , R . Levine, J . L i , M . L i , C . L l o y d , S. Lucas , B . M a , D R . Mag lo t t , E R . Mard i s , L . Mat thews, E . Mauce l i , J H . Mayer , M . M c -Carthy, W R . M c C o m b i e , S. M c L a r e n , K . M c L a y , J D . M c P h e r s o n , J . M e l d r i m , B . Mered i th , J P . Mesirov, W . M i l l e r , T L . Mine r , E . M o n g i n , K T . Montgomery, M . Morgan , R . M o t t , J C . M u l l i k i n , D M . Muzny , W E . Nash , J O . Nelson, M N . Nhan , R . N i c o l , Z . N i n g , C . Nusbaum, M J . O 'Connor , Y . Okazak i , K . Oliver , E . Overton-Larty , L . Pachter, G . Parra , K H . Pepin , J . Peterson, P . Pevzner, R . P l u m b , C S . P o h l , A . Pol iakov, T C . Ponce, C P . Pont ing , S. Potter , M . Qua i l , A . Reymond , B A . Roe , K M . Rosk in , E M . R u b i n , A G . Rus t , R . Santos, V . Sapo-jnikov, B . Schultz, J . Schultz, M S . Schwartz, S. Schwartz, C . Scott, S. Sea-man, S. Searle, T . Sharpe, A . Sheridan, R . Shownkeen, S. Sims, J B . Singer, G . Slater, A . Smit , D R . Smi th , B . Spencer, A . Stabenau, N . Stange-Thomann, C . Sugnet, M . Suyama, G . Tesler, J . Thompson, D . Torrents, E . Trevaskis, J . Tromp, C . U c l a , A . Ure t a -V ida l , J P . V inson , A C . V o n Niederhausern, C M . Wade, M . W a l l , R J . Weber, R B . Weiss, M C . Wendl , A P . West, K . Wetter-strand, R . Wheeler , S. Whelan , J . Wierzbowski , D . Wi l l ey , S. W i l l i a m s , R K . W i l s o n , E . Win te r , K C . Worley, D . W y m a n , S. Yang , SP. Yang , E M . Zdobnov, M C . Zody, and E S . Lander . Ini t ia l sequencing and comparative analysis of the 114 mouse genome. Nature, 420(6915):520-562, Dec 2002. [70] Z. Weinberg and W L . Ruzzo . Exp lo i t i ng conserved structure for faster anno-ta t ion of non-coding R N A s without loss of accuracy. Bioinformatics, 20 Suppl 1:334-334, A u g 2004. [71] M . Zuker and P . Stiegler. O p t i m a l computer folding of large R N A sequences using thermodynamics and auxi l iary information. Nucleic Acids Res, 9(1):133-148, J an 1981. 115 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051575/manifest

Comment

Related Items