Detecting common secondary structure elements in RNA sequences

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Detecting common secondary structure elements in RNA sequences Shah, Sohrab P

Abstract

As evidence for the important and diverse roles of RNA molecules in our cellular machinery continues to grow, there is an increasing interest in developing computational methods to analyse RNA sequences. Sets of evolutionarily related RNA sequences contain signals at both the sequence and secondary structure levels that can be exploited to detect motifs common to all or a portion of those sequences. Motifs conserved in evolution are believed to be functionally important and therefore detection of such motifs could yield novel functional RNA sequences. We developed an algorithm called DISCO to detect conserved motifs in a set of unaligned RNA sequences. Our algorithm uses a powerful probabilistic formalism called covariance models (CM) to model motifs. We introduce a novel approach to initialise a CM using pairwise and multiple sequence alignment. The CM is then iteratively refined using expectation maximisation. Our initialisation method can operate on sequence signals alone using only a portion of the input sequences to initialise a CM to recover the remaining motif instances. We tested our algorithm on 26 data sets derived from Rfam seed alignments of microRNA (miRNA) precursors and conserved elements in the untranslated regions of mRNAs (UTR elements). By three measures of specificity and positive predictive value, our algorithm performed well on the miRNA data sets and showed a bi-modal distribution for the UTR element data sets where the motif was completely missed, or very accurately predicted. In a comparison test with a competing algorithm, DISCO outperformed RNAProfile in measures of sensitivity and positive predictive value, although the running time of RNAProfile was considerably faster. The accuracy of our algorithm was unaffected by average percent pairwise sequence identity, overall length or number of sequences in the input data, indicating that DISCO could be run with similar accuracy on diverse data sets. The running time of DISCO is 0(W³ + L²W² + L³) where W is the width of the motif and L is the length of the longest sequence in the input data. This is an improvement on SLASH, the only other RNA motif finding algorithm in the literature that uses CMs.

Item Metadata

Title	Detecting common secondary structure elements in RNA sequences
Creator	Shah, Sohrab P
Publisher	University of British Columbia
Date Issued	2005
Description	As evidence for the important and diverse roles of RNA molecules in our cellular machinery continues to grow, there is an increasing interest in developing computational methods to analyse RNA sequences. Sets of evolutionarily related RNA sequences contain signals at both the sequence and secondary structure levels that can be exploited to detect motifs common to all or a portion of those sequences. Motifs conserved in evolution are believed to be functionally important and therefore detection of such motifs could yield novel functional RNA sequences. We developed an algorithm called DISCO to detect conserved motifs in a set of unaligned RNA sequences. Our algorithm uses a powerful probabilistic formalism called covariance models (CM) to model motifs. We introduce a novel approach to initialise a CM using pairwise and multiple sequence alignment. The CM is then iteratively refined using expectation maximisation. Our initialisation method can operate on sequence signals alone using only a portion of the input sequences to initialise a CM to recover the remaining motif instances. We tested our algorithm on 26 data sets derived from Rfam seed alignments of microRNA (miRNA) precursors and conserved elements in the untranslated regions of mRNAs (UTR elements). By three measures of specificity and positive predictive value, our algorithm performed well on the miRNA data sets and showed a bi-modal distribution for the UTR element data sets where the motif was completely missed, or very accurately predicted. In a comparison test with a competing algorithm, DISCO outperformed RNAProfile in measures of sensitivity and positive predictive value, although the running time of RNAProfile was considerably faster. The accuracy of our algorithm was unaffected by average percent pairwise sequence identity, overall length or number of sequences in the input data, indicating that DISCO could be run with similar accuracy on diverse data sets. The running time of DISCO is 0(W³ + L²W² + L³) where W is the width of the motif and L is the length of the longest sequence in the input data. This is an improvement on SLASH, the only other RNA motif finding algorithm in the literature that uses CMs.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2009-12-11
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0051575
URI	http://hdl.handle.net/2429/16465
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2005-05
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

ubc_2005-0311.pdf -- 6.49MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Detecting common secondary structure elements in RNA sequences Shah, Sohrab P

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights