UBC Theses and Dissertations
Detecting common secondary structure elements in RNA sequences Shah, Sohrab P
As evidence for the important and diverse roles of RNA molecules in our cellular machinery continues to grow, there is an increasing interest in developing computational methods to analyse RNA sequences. Sets of evolutionarily related RNA sequences contain signals at both the sequence and secondary structure levels that can be exploited to detect motifs common to all or a portion of those sequences. Motifs conserved in evolution are believed to be functionally important and therefore detection of such motifs could yield novel functional RNA sequences. We developed an algorithm called DISCO to detect conserved motifs in a set of unaligned RNA sequences. Our algorithm uses a powerful probabilistic formalism called covariance models (CM) to model motifs. We introduce a novel approach to initialise a CM using pairwise and multiple sequence alignment. The CM is then iteratively refined using expectation maximisation. Our initialisation method can operate on sequence signals alone using only a portion of the input sequences to initialise a CM to recover the remaining motif instances. We tested our algorithm on 26 data sets derived from Rfam seed alignments of microRNA (miRNA) precursors and conserved elements in the untranslated regions of mRNAs (UTR elements). By three measures of specificity and positive predictive value, our algorithm performed well on the miRNA data sets and showed a bi-modal distribution for the UTR element data sets where the motif was completely missed, or very accurately predicted. In a comparison test with a competing algorithm, DISCO outperformed RNAProfile in measures of sensitivity and positive predictive value, although the running time of RNAProfile was considerably faster. The accuracy of our algorithm was unaffected by average percent pairwise sequence identity, overall length or number of sequences in the input data, indicating that DISCO could be run with similar accuracy on diverse data sets. The running time of DISCO is 0(W³ + L²W² + L³) where W is the width of the motif and L is the length of the longest sequence in the input data. This is an improvement on SLASH, the only other RNA motif finding algorithm in the literature that uses CMs.
Item Citations and Data