Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae Rogic, Sanja 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2006-200780.pdf [ 18.5MB ]
Metadata
JSON: 831-1.0051317.json
JSON-LD: 831-1.0051317-ld.json
RDF/XML (Pretty): 831-1.0051317-rdf.xml
RDF/JSON: 831-1.0051317-rdf.json
Turtle: 831-1.0051317-turtle.txt
N-Triples: 831-1.0051317-rdf-ntriples.txt
Original Record: 831-1.0051317-source.json
Full Text
831-1.0051317-fulltext.txt
Citation
831-1.0051317.ris

Full Text

T h e role of p r e - m R N A secondary s t ruc tu re i n gene sp l i c ing i n Saccharomyces cerevisiae by Sanja Rogic B.Sc, The Faculty of Mathematics, The University of Belgrade, 1994 M.Sc , The University of British Columbia, 2000 A THESIS SUBMITTED IN PARTIAL F U L F I L M E N T OF T H E R E Q U I R E M E N T S FOR T H E D E G R E E OF Doctor of Philosophy m The Faculty of Graduate Studies (Computer Science) The University Of British Columbia October 12, 2006 © Sanja Rogic 2006 1 1 Abst rac t T h e process of gene s p l i c i n g , w h i c h involves the exc i s ion of in t rons f rom a p r e - m R N A a n d j o i n i n g of exons in to ma tu re m R N A is one of the essent ia l steps i n p r o t e i n p r o d u c t i o n . A l t h o u g h th is process has been ex tens ive ly s tud i ed , i t is s t i l l not clear how the spl ice sites are accura te ly ident i f ied a n d co r rec t ly pa i r ed across the i n t r o n . It is cu r r en t ly be l ieved t h a t i den t i f i c a t i on is a c c o m p l i s h e d t h r o u g h base-pa i r ing in terac t ions between the spl ice sites a n d the sp l i ceosomal s n R N A s . However , the re la t ive ly conserved sequences at the spl ice sites are often i nd i s t i ngu i shab le f rom s i m i l a r sequences t ha t are not i n v o l v e d i n sp l i c ing . T h i s suggests t ha t not on ly sequence bu t o ther features of p r e - m R N A m a y p lay a role i n sp l i c ing . A n u m b e r of au thors have s t ud i ed the effects of p r e - m R N A secondary s t ruc tu re o n s p l i c i n g , b u t these s tudies are u s u a l l y l i m i t e d to one or a s m a l l n u m b e r of genes, a n d therefore the conc lus ions are u sua l l y gene-specific. T h i s thesis a ims to complemen t p rev ious s tudies of the role of p r e - m R N A secondary s t ruc tu re i n sp l i c ing by p e r f o r m i n g a comprehens ive c o m p u t a -t i o n a l s t u d y of s t r u c t u r a l character is t ics of Saccharornyces cerevisiae i n t rons a n d the i r poss ib le role i n p r e - m R N A sp l i c ing . W e ident i fy long-range in terac-t ions i n the secondary s t ruc tures of a l l l ong in t rons t ha t effectively shor ten the d i s tance between the d o n o r si te a n d the b r a n c h p o i n t sequence. T h e shor tened dis tances are d i s t r i b u t e d s i m i l a r l y to the b r a n c h p o i n t d is tances i n shor t yeast in t rons , w h i c h are p re sumed to be o p t i m a l for s p l i c i n g , a n d very different f rom the co r r e spond ing dis tances i n r a n d o m a n d exon ic se-quences. W e show tha t i n the m a j o r i t y of cases, these s tems are conserved a m o n g closely re la ted yeast species. F u r t h e r m o r e , we formula te a m o d e l of s t r u c t u r a l requi rements for effi-cient s p l i c i n g of yeast in t rons t ha t expla ins p rev ious s p l i c i n g s tudies of the Abstract n i RP51B intron. We also test our model by laboratory experiments, which verify our computational predictions. Finally, we use different computational approaches to identify any struc-tural context at the boundaries or within yeast introns. Our study reveals statistically significant biases, which we use to train machine learning clas-sifiers to distinguish between real and pseudo splice sites. IV Contents Abstract ii Contents iv List of Tables viii List of Figures xiii Acknowledgements . . xxiv Dedication xxvi 1 Introduction 1 2 Background and related work 8 2.1 Pre-mRNA splicing and splice site recognition 8 2.2 R N A secondary structure and prediction 14 2.3 Secondary structure of pre-mRNAs and its effects on gene splicing 19 2.4 Using secondary structure information for splice site prediction 23 3 Intron structure and splicing in Saccharomyces cerevisiae 25 3.1 S. c e r e v i s i a e introns and splicing 26 3.1.1 Conservation of splicing signals 26 3.1.2 Pre-mRNA splicing 27 3.2 S. c e r e v i s i a e intron dataset 28 3.2.1 Dataset construction 30 3.2.2 Length distribution and architecture of yeast introns . 31 Contents v 3.2.3 Seconda ry s t ruc tu re i n yeast in t rons 38 3.3 I n t r o n dataset for phy logene t ic analys is 39 4 Z i p p e r stems i n long yeast introns 46 4.1 D e f i n i t i o n a n d i n i t i a l iden t i f i ca t ion of z i p p e r s tems 46 4.2 L e n g t h - b o u n d e d z ippe r stems 51 4.2.1 C o n t r o l datasets 54 4.2.2 M u l t i p l e z ippe r stems 59 4.3 T h e r m o d y n a r n i c a l l y s table z ippe r s tems 64 4.3.1 Iden t i f i ca t ion of t h e r m o d y n a r n i c a l l y s table z ippe r s tems 64 4.3.2 M u l t i p l e z ippe r s tems 69 4.4 P h y l o g e n e t i c analys is of z ippe r s tems 70 4.4.1 C o m p a r a t i v e analys is by v i s u a l i n s p e c t i o n 74 4.4.2 C o m p a r a t i v e analysis u s ing p rog rams for c o m p a r a t i v e R N A s t ruc tu re p r e d i c t i o n 78 4.4.3 C o m p a r a t i v e s t ruc tu re ana lys is on S T R I N 5 ' L in t rons 86 4.5 C o n c l u s i o n s 90 5 S p l i c i n g efficiency a n d b r a n c h p o i n t d is tance 92 5.1 E x p e r i m e n t a l resul ts for the R P 5 1 B i n t r o n 93 5.2 S t r u c t u r a l a n d b r a n c h p o i n t d is tance analys is of R P 5 1 B m u t a n t s 95 5.3 Ref inement of z ippe r s t em hypothes i s 102 5.3.1 I n c l u d i n g s u b o p t i m a l s t ruc tures i n the ana lys is . . . . 102 5.3.2 A new way of c a l c u l a t i n g the b r a n c h p o i n t d i s tance . . 104 5.3.3 C o m p u t a t i o n of s t r u c t u r a l character is t ics of in t rons . 107 5.4 B r a n c h p o i n t - d i s t a n c e analysis of R P 5 1 B m u t a n t s u s ing the refined m o d e l 113 5.5 B r a n c h p o i n t - d i s t a n c e analys is o n the S T R I N dataset 121 5.6 V a l i d a t i o n by b io log i ca l exper imen t s 129 5.6.1 V e r i f i c a t i o n of the e x p e r i m e n t a l sy s t em 130 5.6.2 M u t a n t des ign 132 5.6.3 E x p e r i m e n t a l p rocedure 135 5.6.4 Resu l t s a n d d i scuss ion 136 Contents v i 5.7 C o n c l u s i o n s 140 6 Structural characteristics of yeast introns 142 6.1 S t r u c t u r a l s t ab i l i t y of in t rons vs. r a n d o m sequences 142 6.1.1 Z-score analys is of g loba l i n t r o n s t ruc tu re 146 6.1.2 Z-score analys is of l o c a l s t ruc tu re 152 6.2 Free bases at s p l i c i n g signals 159 6.3 M F E of basepa i r ing between s p l i c i n g s ignals a n d s n R N A s . . 163 6.3.1 s n R N A - p r e - m R N A in te rac t ions d u r i n g s p l i c i n g . . . . 165 6.3.2 P a i r F o l d exper iments u s ing a r b i t r a r y pseudo sites . . . 170 6.3.3 • P a i r F o l d exper iments us ing more specific pseudo sites 174 6.4 P h y l o g e n e t i c analys is of sequences a r o u n d spl ice s ignals . . . 183 6.4.1 P h y l o g e n e t i c analysis of sensu stricto species 184 6.4.2 Sea rch ing for c o m m o n mot i fs i n S T R I N in t rons . . . . 187 6.5 C o n c l u s i o n s 194 7 Using structural information for intron identification . . . 196 7.1 M a c h i n e l ea rn ing a n d c lass i f ica t ion 197 7.1.1 N e u r a l ne tworks 197 7.1.2 S u p p o r t vector machines 201 7.2 L e a r n i n g the s p l i c i n g efficiency func t ion 203 7.2.1 T r a i n i n g o n shor tened b r a n c h p o i n t d i s tance a n d s t ruc-tu re p r o b a b i l i t y d a t a 204 7.2.2 T r a i n i n g o n s t r u c t u r a l s u m m a r y s ta t i s t ics 209 7.3 U s i n g weak s t r u c t u r a l s ignals to improve accu racy of c o m p u -t a t i o n a l spl ice site a n d i n t r o n p r e d i c t i o n 211 7.3.1 I m p r o v i n g the accuracy of spl ice si te p r e d i c t i o n . . . . 214 7.3.2 I m p r o v i n g the accuracy of i n t r o n p r e d i c t i o n 218 7.4 C o n c l u s i o n s 223 8 Conclusions and future work 225 Bibliography 234 Contents v i i A The S T R I N dataset 255 B Experimental procedure 262 C Outputs from StructureAnalyze for RP51B mutants . . . . 267 C . l O u t p u t for the L i b r i ' s mu tan t s ( int rons on ly ) 267 C . 2 O u t p u t for the L i b r i ' s mu tan t s ( int rons a n d 5' f l a n k i n g region) 270 C . 3 O u t p u t for the L i b r i ' s mu tan t s ( int rons a n d b o t h flanking regions) 274 C . 4 O u t p u t for the C h a r p e n t i e r ' s mu tan t s ( int rons on ly ) 278 C . 5 O u t p u t for the C h a r p e n t i e r ' s mu tan t s ( int rons a n d 5' flanking region) 280 C . 6 O u t p u t for the C h a r p e n t i e r ' s mu tan t s ( int rons a n d b o t h flank-i n g regions) 282 D Sequences of new RP51B mutants 285 v i i i Lis t of Tables 3.1 A n excerp t f rom the A r e s lab Yeas t I n t r o n D a t a b a s e ( G r a t e a n d A r e s , 2002) 29 3.2 A n excerpt f rom Y I D B (Lopez a n d Se raph in , 2000) 30 4.1 S u m m a r y of K S test results for a l l pa i r -wise compar i sons be-tween datasets of S T R I N 5'S in t rons , S T R I N 5 ' L in t rons , exon ic sequences a n d r a n d o m sequences. T h e p-value h i g h -l igh ted i n boldface is greater t h a n 0.05, i n d i c a t i n g t h a t the hypothes i s tha t two c o m p a r e d datasets s t em f rom the same d i s t r i b u t i o n canno t be rejected 59 4.2 T h e p-values f rom the K S test app l i ed to the dataset of the 5 ' L b r a n c h p o i n t dis tances shor tened by t h e r m o d y n a r n i c a l l y s table z ippe r s tems a n d the dataset of the 5'S b r a n c h p o i n t dis tances . T h e numbers i n the first row are the values for the loop t h r e s h o l d ( i /) a n d the numbers i n the first c o l u m n are the values for the energy th re sho ld (te). T h e p-values h i g h l i g h t e d i n boldface are grea ter . than 0.05; for these t e a n d ti values the hypothes i s tha t two c o m p a r e d datasets s t em f rom the same d i s t r i b u t i o n cannot be rejected at the s t a n d a r d s ignif icance level of a = 0.05 67 List of Tables i x 4.3 T h e p-values f rom the K S test app l i ed to the dataset of the 5 ' L b r a n c h p o i n t dis tances shor tened by m u l t i p l e t h e r m o d y n a m i -ca l ly s table z ippe r s tems a n d the dataset of the 5'S b r a n c h -po in t dis tances . T h e numbers i n the first row are the values for the loop th re sho ld (£;) a n d the numbers i n the first co l -u m n are the values for the energy t h r e sho ld (te). T h e p-values h i g h l i g h t e d i n boldface are greater t h a n 0.05 a n d for these te a n d ti values the hypothes i s t ha t two c o m p a r e d datasets s t e m f rom the same d i s t r i b u t i o n cannot be rejected at the s t a n d a r d s ignif icance level of a = 0.05 69 4.4 In t rons used for our compara t i ve R N A s t ruc tu re ana lys is . . . 73 4.5 Resu l t s of co mp a ra t i ve R N A s t ruc tu re ana lys is . D e t a i l s are g iven i n the text (S . ce r=5 . cerevisiae, S . p a r = 5 . paradoxus, S . m i k = 5 . mikatae, S . b a y = 5 . bayanus) 85 5.1 C o r r e l a t i o n of the shor tened b r a n c h p o i n t d i s tance (d) w i t h s p l i c i n g efficiency for L i b r i ' s mutan t s . Shor t ened b r a n c h p o i n t d is tances were ca l cu la t ed by two versions of a l g o r i t h m s for z ippe r s t em ident i f ica t ion : one uses the s t em l eng th to select s table z i p p e r stems (first c o l u m n ) , a n d the other uses t h e r m o -d y n a m i c s c r i t e r i a for s t e m select ion (second c o l u m n ) . Leve l s of s p l i c i n g efficiency were inferred f rom L i b r i et a l . (1995) . . 96 5.2 C o r r e l a t i n g shor tened b r a n c h p o i n t d is tance (d) w i t h sp l ic -i n g efficiency f o r ' L i b r i ' s mu tan t s where shor t f l ank ing regions were folded w i t h i n t ron i c sequences. Sho r t ened b r a n c h p o i n t dis tances were ca l cu la t ed by two versions of a lgo r i t hms for z ippe r s t e m ident i f ica t ion : one w h i c h uses the s t e m l e n g t h to select s table z ippe r stems (diengtil=5) a n d the other w h i c h uses t h e r m o d y n a m i c s c r i t e r i a for s tem se lec t ion (c?t e=—io,t (=6)-Leve l s of s p l i c i n g efficiency were inferred f r o m L i b r i et a l . (1995)100 5.3 S u m m a r y s ta t i s t ics for L i b r i ' s mu tan t s . Leve l s of s p l i c i n g efficiency were d e t e r m i n e d f rom F i g u r e 5.2 117 List of Tables x 5.4 S u m m a r y s ta t i s t ics for C h a r p e n t i e r ' s mu tan t s . Leve l s of sp l ic -i n g efficiency were inferred f rom F igu res 2 a n d 3 a n d T a b l e 1 i n the ar t ic le by C h a r p e n t i e r a n d R o s b a s h (1996) 118 5.5 B a s e p a i r i n g p robab i l i t i e s of contact c o n f o r m a t i o n ( F i g u r e 5.6) for L i b r i ' s mutan t s . T h e p robab i l i t i e s were ca l cu l a t ed b y the R N A f o l d p r o g r a m 120 5.6 B a s e p a i r i n g p robab i l i t i e s of contact c o n f o r m a t i o n for C h a r p -ent ier 's mu tan t s 121 5.7 Cha rac t e r i s t i c s of n e w l y designed R P 5 1 B mutan t s : # o f c c - n u m b e r of s t ruc tures w i t h the contac t c o n f o r m a t i o n w i t h i n 5% f rom the M F E ; p i - s u m of n o r m a l i z e d p robab i l i t i e s Pnorm(Rij) for a l l of the s t ruc tures Rij tha t c o n t a i n contac t con fo rma t ion ; p2 - basepa i r ing p r o b a b i l i t y of i n t e r ac t i on be-tween the donor site a n d the b r a n c h p o i n t sequence based o n the p a r t i t i o n func t ion ; a v g - average b r a n c h p o i n t d i s tance as g iven i n E q u a t i o n 5.4 (p. 112); r _ w e i g h t - s u m m a r y s ta t is -t ics def ined i n E q u a t i o n 5.5 (p. 112); BP - s t r u c t u r a l config-u r a t i o n of b r a n c h p o i n t sequence ( loop or s tem) 134 6.1 S T R I N l o n g in t rons t ha t have ve ry low Z-scores 151 6.2 A v e r a g e values for Z-score a n d free energies for na t ive a n d r a n d o m sequences i n the v i c i n i t y of spl ice sites. T h e values are ca l cu l a t ed for s l i d i n g w i n d o w s of size 50 nt a n d t h e n averaged over a l l s l i d i n g w i n d o w pos i t ions . T h e p-values are ca l cu l a t ed u s ing the W i l c o x o n r a n k - s u m test 153 6.3 M o t i f s p red ic t ed for ex tended donor , acceptor a n d b r a n c h -p o i n t sequences a n d the i r features: n u m b e r of sequences f rom S T R I N dataset i n w h i c h m o t i f instances are found , s u m of weights for a l l instances found , average weighted a l ignmen t score (a l ignment be tween covar iance m o d e l a n d sequence), a n d average fo ld ing energy of the mot i f . T h e last c o l u m n of the tab le is the average score f rom the p r o g r a m cmsearch (see tex t ) 190 List of Tables xi 7.1 F r a c t i o n of co r rec t ly classif ied instances for var ious m f o l d pa -rameter set t ings a n d for two types of sequences: i n t r o n - o n l y a n d in t rons w i t h 50-nt flanking regions 206 7.2 F r a c t i o n of co r rec t ly classif ied instances for var ious m f o l d pa -ramete r set t ings a n d for two types of sequences: i n t r o n - o n l y a n d in t rons w i t h 50-nt flanking regions. T h e b r a n c h p o i n t d is -tance a t t r ibu tes , di, are sor ted i n ascending order 207 7.3 F r a c t i o n of cor rec t ly classif ied instances, w h i c h i n th i s case is equa l to the t rue pos i t ive rate ( sens i t iv i ty ) , for var ious m f o l d parameters a n d for two different sor t ings of a t t r ibu tes 208 7.4 T h e accu racy results for mach ine l ea rn ing c lass i f ica t ion of 5 R P 5 1 B mutan t s ( b a d l , bad3 , good2 , good3 , good4) . T h e a t t r ibu tes of feature vectors for b o t h the t r a i n i n g a n d test sets are l i s t ed i n the c o l u m n Attributes; the percent of sub-o p t i m a l i t y used to ca lcu la te the a t t r i bu t e values is g iven i n the c o l u m n % subopt. T h e accuracy of the p r e d i c t i o n , g iven separa te ly for N N a n d S V M , is the f rac t ion of co r rec t ly clas-sified instances ( b o t h pos i t ive a n d negative) achieved by the respect ive classifier 210 7.5 T h e accuracy results for mach ine l ea rn ing c lass i f ica t ion of S T R I N 5 ' L in t rons . T h e a t t r ibu tes of feature vectors for the b o t h t r a i n i n g a n d tes t ing sets are l i s t ed i n the c o l u m n At-tributes; the percent of s u b o p t i m a l i t y used to ca lcu la te the a t t r i b u t e values is g iven i n the c o l u m n % subopt. T h e accu-racy of the p r e d i c t i o n , g iven separa te ly for N N a n d S V M , is the f rac t ion of cor rec t ly classif ied instances ( in th i s case- also the t rue pos i t ive rate) achieved by the respect ive classifier. . . 211 List of Tables x i i 7.6 T h e accu racy results for n e u r a l ne twork c lass i f ica t ion of 1212 cand ida t e donor sites. T h e a t t r ibu tes of feature vectors for b o t h t r a i n i n g a n d tes t ing set are l i s ted i n the c o l u m n At-tributes. Accuracy is a f rac t ion of co r rec t ly classif ied i n -stances ( b o t h pos i t ive a n d negative) g iven b y the classifier; Sn is the sens i t iv i ty of p r e d i c t i o n defined as xp+FN' i s the pos i t ive p red ic t ive value; T N - n u m b e r of co r rec t ly pre-d i c t e d negat ive instances, F P - n u m b e r of negat ive ins tances p red i c t ed as pos i t ives , T P - n u m b e r of co r rec t ly p r e d i c t e d pos i t ive instances, F N - n u m b e r of pos i t ive ins tances pre-d i c t e d as negatives 216 7.7 A c c u r a c y results for n e u r a l ne twork c lass i f ica t ion of 542 can-d ida t e in t rons . T h e a t t r ibu tes of feature vectors for b o t h the t r a i n i n g a n d t e s t ing sets are l i s t ed i n the c o l u m n At-•. tributes. Accuracy is the f rac t ion of co r rec t ly classif ied instances ( b o t h pos i t ive a n d negative) g iven by the c lass i -fier; Sn is the sens i t iv i ty of p r e d i c t i o n ; P P V is the pos i t i ve p red ic t ive value; T N - n u m b e r of co r rec t ly p r e d i c t e d nega-t ive instances , F P - n u m b e r of negat ive ins tances p r e d i c t e d as posi t ives , T P - n u m b e r of cor rec t ly p red i c t ed pos i t i ve i n -stances, F N - n u m b e r of pos i t ive instances p red i c t ed as neg-atives 222 X l l l Lis t of Figures 1.1 T h e molecu la r processes i nvo lved i n the p a t h w a y l ead ing f rom D N A to p r o t e i n i n eukaryo t ic cells 2 2.1 S p l i c i n g of p r e - m R N A 10 2.2 Consensus sequences invo lved i n i n t r o n s p l i c i n g i n h u m a n s . R s tands for pur ines (adenine ( A ) a n d guanine (G) ) a n d Y s tands for p y r i m i d i n e s (cytosine ( C ) , t h y m i n e ( T ) a n d u r a c i l (U)). N s tands for any nucleot ide . T h e figure was m o d i f i e d f rom M o o r e (2000) 11 2.3 A n example of R N A secondary s t ruc tu re where the bu l le t s represent the nucleot ides i n the R N A sequence a n d the b l ack l ines be tween t h e m represent basepa i r ing in te rac t ions . T h e bas ic s t r u c t u r a l elements are anno ta ted . T h e figure was t aken f rom A n d r o n e s c u (2003) 15 3.1 (a) D i s t r i b u t i o n h i s t o g r a m a n d (b) c u m u l a t i v e d i s t r i b u t i o n of i n t r o n lengths (L) i n the S T R I N dataset 32 3.2 A r c h i t e c t u r e of a yeast i n t r o n : consensus 5' a n d 3' sp l ice sites a n d b r a n c h p o i n t sequence are g iven ( Y = C or U ) . T h e b r a n c h p o i n t d i s tance is the d is tance between the 5' sp l ice site a n d the b r a n c h p o i n t sequence. Some au thors e x t e n d th i s d is -tance u p to the b r a n c h p o i n t adenine 33 3.3 (a) D i s t r i b u t i o n h i s t o g r a m a n d (b) c u m u l a t i v e d i s t r i b u t i o n of 5' spl ice site - b r a n c h p o i n t dis tances i n the S T R I N dataset . . 35 3.4 (a) D i s t r i b u t i o n h i s t o g r a m a n d (b) c u m u l a t i v e d i s t r i b u t i o n of b r a n c h p o i n t - 3 ' spl ice site dis tances i n the S T R I N dataset . . 36 List of Figures x i v 3.5 C o r r e l a t i o n between i n t r o n l eng th a n d b r a n c h p o i n t d i s tance i n the S T R I N dataset (r = 0.99) 37 3.6 C o r r e l a t i o n between i n t r o n l eng th a n d b r a n c h p o i n t - 3 ' sp l ice si te d i s tance i n the S T R I N dataset (r = 0.13) 37 3.7 P h y l o g e n e t i c tree for Saccharomyces sensu stricto species de-r ived based o n sequence divergence of r i b o s o m a l D N A se-quences ( K e l l i s , 2003). T h e t i m e labels ( in m y a = m i l l i o n years ago) g iven at the nodes of the tree ind ica te w h e n the species, represented by the branches c o m i n g out f rom a node , d iverged f rom the most c o m m o n ancestor 40 3.8 A n example of a C l u s t a l W a l ignment t aken f rom the supple -m e n t a r y d a t a by K e l l i s et a l . (2003). Nuc leo t ides conserved i n a l l four species are m a r k e d w i t h T h e i n t r o n is i n d i c a t e d by a doub le dashed l ine . P o t e n t i a l donor sites a n d b r a n c h p o i n t sequences are anno ta t ed by s t r ings of D s a n d B s , respect ively . 41 3.9 I n t r o n l eng th d i s t r i b u t i o n h i s tograms a n d i n t r o n l e n g t h cor-r e l a t i on p lo ts be tween S. cerevisiae a n d Saccharomyces sensu stricto species: (a), (b) S. paradoxus, (c), (d) S. mikatae, (e), (f) S. bay anus 43 3.10 B r a n c h p o i n t d is tance d i s t r i b u t i o n h i s tograms a n d b r a n c h p o i n t d i s tance co r re l a t ion plots be tween S. cerevisiae a n d Saccha-romyces sensu stricto species: (a), (b) S. paradoxus, (c), (d) S. mikatae, (e), (f) S. bay anus 45 4.1 Seconda ry s t ruc tu re of the Y G L 0 3 0 W in t ron . T h e do t -bracke t n o t a t i o n for th i s s t ruc tu re w i t h h i g h l i g h t e d z ippe r s t e m is s h o w n at the b o t t o m of the figure 49 4.2 I l l u s t r a t i o n of shor tened b r a n c h p o i n t d i s tance 51 4.3 D i s t r i b u t i o n h i s tograms for (a) b r a n c h p o i n t d i s tance for 5'S in t rons (d) a n d (b) b r a n c h p o i n t d is tance for z i p p e d 5 ' L i n -t rons (d) 52 List of Figures x v 4.4 C u m u l a t i v e d i s t r i b u t i o n s of the 5' spl ice si te - b r a n c h p o i n t d is tance for 5'S in t rons a n d 5 ' L in t rons folded i n secondary s t ruc tu re 53 4.5 Seconda ry s t ruc tu re of the Y D L 0 7 9 C i n t r o n 54 4.6 D i s t r i b u t i o n h i s tograms for (a) b r a n c h p o i n t d i s tance for 5'S in t rons a n d (b) b r a n c h p o i n t d is tance for z i p p e d 5 ' L in t rons w i t h a m i n i m u m z ippe r s t em l e n g t h of 5 nt 55 4.7 C u m u l a t i v e d i s t r i b u t i o n s of the 5' spl ice site - b r a n c h p o i n t d i s tance for 5'S in t rons a n d 5 ' L in t rons folded i n secondary s t ruc tu re w i t h a m i n i m u m z ippe r s tem leng th of 5 nt 56 4.8 D i s t r i b u t i o n s for (a) b r a n c h p o i n t d is tance for 5'S in t rons , (b) b r a n c h p o i n t d is tance for z i p p e d 5 ' L in t rons , (c) b r a n c h p o i n t d i s tance of z i p p e d exonic sequences, a n d (d) b r a n c h p o i n t dis-tance of z i p p e d r a n d o m sequences. A l l z i p p e d s t ruc tures have a m i n i m u m z ippe r s t e m leng th of 5 nt 57 4.9 C u m u l a t i v e d i s t r i b u t i o n s for the b r a n c h p o i n t d i s tance i n 5'S in t rons , shor tened b r a n c h p o i n t d i s tance i n z i p p e d 5 ' L in t rons , a n d shor tened b r a n c h p o i n t d is tance of z i p p e d exon ic a n d r a n -d o m sequences. A l l z i p p e d s t ruc tures have a m i n i m u m z ippe r stern l e n g t h of 5 nt . . . 58 4.10 Seconda ry s t ruc tu re of the Y G R 2 1 4 W in t ron . T h e four z i p -per s tems are enumera ted i n the order by w h i c h t h e y were ident i f ied b y the a l g o r i t h m . T h e shor tened d i s tance is c a l c u -l a t ed b y c o u n t i n g nucleot ides be tween the 5' spl ice si te a n d the b r a n c h p o i n t sequence tha t are not enclosed i n or be tween c o m p l e m e n t a r y sequences of the found z ippe r s tems (b lack let ters) 61 4.11 D i s t r i b u t i o n s for (a) b r a n c h p o i n t d is tance for 5'S in t rons a n d (b) b r a n c h p o i n t d is tance for 5 ' L in t rons z i p p e d w i t h one or more stems w i t h m i n i m u m leng th of 5 nt. • 62 List of Figures x v i 4.12 C u m u l a t i v e d i s t r i b u t i o n s of the 5' sp l ice site - b r a n c h p o i n t d i s tance for 5'S in t rons a n d 5 ' L in t rons , r a n d o m a n d ex-onic sequences z i p p e d w i t h one or more s tems w i t h m i n i m u m leng th of 5 nt . 63 4.13 D i s t r i b u t i o n h i s t o g r a m (a) a n d c u m u l a t i v e d i s t r i b u t i o n p lo t (b) of free energies of stems i n n a t u r a l l y o c c u r r i n g R N A molecules (ob ta ined f rom S S T R A N D database) 66 4.14 D i s t r i b u t i o n h i s tograms a n d c u m u l a t i v e d i s t r i b u t i o n p lo ts of the n u m b e r of free bases i n (a), (b) bulges a n d (c), (d) i n -t e r n a l loops of n a t u r a l l y o c c u r r i n g R N A molecules (ob ta ined f rom S S T R A N D database) 68 4.15 D i s t r i b u t i o n s of (a) b r a n c h p o i n t dis tances for 5'S in t rons a n d (b) b r a n c h p o i n t dis tances for 5 ' L in t rons tha t were shor tened by m u l t i p l e t h e r m o d y n a r n i c a l l y s table z ippe r s tems (te — —10 a n d ti = 6) : 71 4.16 C u m u l a t i v e d i s t r i b u t i o n s of the 5' spl ice site - b r a n c h p o i n t d i s tance for 5'S in t rons a n d 5 ' L in t rons z i p p e d w i t h one or more stems (te = —10 and i ; = 6) 72 4.17 A n e x a m p l e of compensa to ry muta t ions . T h e m u t a t i o n s i n the second sequence are compensa to ry since t hey m a i n t a i n basepa i r ing 73 4.18 L A G A N m u l t i p l e sequence a l ignment for i n t r o n Y C R 0 3 1 C . P o t e n t i a l z i ppe r s t em regions are h igh l igh ted i n 5". cerevisiae, S. paradoxus, S. rnikatae a n d S. bay anus sequences. B l a c k boxes ind ica te the l o c a t i o n of the conserved z i p p e r s t em. . . . 75 4.19 M i n i m u m free energy secondary s t ruc tures for i n t r o n Y C R 0 3 1 C i n S. cerevisiae a n d S. paradoxus. T h e 5' sp l ice site, b r a n c h -po in t a n d po ten t i a l s t em region are anno ta t ed for each s t ruc-ture . C o n s e r v e d z ippe r s tems found by c o m p a r a t i v e ana ly -sis are magni f i ed a n d s h o w n i n boxes. Base -pa i r s conserved a m o n g a l l four species are h igh l igh ted 76 List of Figures x v n 4.20 M i n i m u m free energy secondary s t ruc tures for i n t r o n Y C R 0 3 1 C i n S. mikatae a n d S. bayanus. T h e 5' spl ice si te, b r a n c h p o i n t a n d po t e n t i a l s t em region are anno ta ted for each s t ruc tu re . C o n s e r v e d z ippe r s tems found by c o m p a r a t i v e ana lys i s are magni f i ed a n d s h o w n i n boxes. Base-pa i r s conserved a m o n g a l l four species are h igh l igh t ed 77 4.21 P f o l d resul t for the a l ignment of the Y C R 0 3 1 C i n t r o n . T h e first l ine is a c o m m o n s t ruc tu re for a l l the sequences. T h e i n -d i v i d u a l s t ruc tures are found by a p p l y i n g the c o m m o n s t ruc-tu re to each sequence a n d e x t e n d i n g stems, i f poss ible . T h e last l ine indica tes the r e l i a b i l i t y of p r e d i c t i o n for each nu -cleot ide i n the a l ignment . T h e s t em tha t satisfies the require-ments for a z ippe r s t em is enclosed i n a box 82 4.22 A l i f o l d resul t for the a l ignment of the Y C R 0 3 1 C i n t r o n . T h e o u t p u t conta ins the consensus sequence a n d the o p t i m a l con-sensus s t ruc tu re i n dot-bracket n o t a t i o n fol lowed by i ts en-ergy. A g r a p h i c a l representa t ion of the s t ruc tu re is also g iven . T h e s t em tha t satisfies the requi rements for a z i p p e r s t e m is enclosed i n a box 83 4.23 D i s t r i b u t i o n s of shor tened b r a n c h p o i n t dis tances for 5 ' L S T R I N in t rons where the a n a l y z e d secondary s t ruc tu re was the con -sensus s t ruc tu re for a l l sensu stricto species. T h e consensus s t ruc tures were p r o d u c e d by (a) A l i f o l d a n d (b) P f r a l i . . . . 88 4.24 C u m u l a t i v e d i s t r i b u t i o n s of shor tened b r a n c h p o i n t d is tances for 5 ' L S T R I N in t rons where the a n a l y z e d secondary s t ruc-tu re was the consensus s t ruc tu re for a l l sensu stricto species. T h e consensus s t ruc tures were p r o d u c e d by (a) A l i f o l d a n d (b) P f r a l i 89 5.1 R e p r o d u c t i o n of F i g u r e I B f rom C h a r p e n t i e r a n d R o s b a s h (1996): p u t a t i v e i n t e r ac t ion between the U B l a n d D B 1 regions. 95 List of Figures x v m 5.2 R e p r o d u c t i o n of F i g u r e 2 ( C ) f rom L i b r i et a l . (1995): coppe r g r o w t h assay f rom L i b r i ' s mutan t s . I n the u p p e r left corner of the figure is a schemat ic d r a w i n g showing the loca t ions of the mutan t s . C11SO4 concent ra t ions ( in 1 0 ~ 3 m o l e s / l i t r e ) are i n d i c a t e d under each p a n e l 97 5.3 M i n i m u m free energy secondary s t ruc tu re of 3 m U B l m u t a n t p r ed i c t ed by m f o l d 103 5.4 C o n v e r s i o n f rom the R N A secondary s t ruc tu re to the g r a p h represent ing i t . (a) G r a p h i c a l representa t ion of the secondary s t ruc tu re of an i n t r o n (circles represent basepa i r ing in terac-t ions , i.e., hyd rogen bonds , (b) G r a p h represent ing the R N A s t ruc tu re i n (a) 108 5.5 Pseudo-code for the procedure S t r u c t u r e A n a l y z e desc r ibed i n Sec t ion 5.3.3 110 5.6 A par t of the R P 5 1 B w i l d t ype i n t r o n secondary s t ruc tu re t ha t shows basepa i r ing between the donor si te a n d the b r a n c h -p o i n t sequence 116 5.7 C o m p a r i n g b r a n c h p o i n t dis tances for S T R I N long in t rons a n d c o r r e s p o n d i n g r a n d o m sequences: d i s t r i b u t i o n of m i n i m u m b r a n c h p o i n t dis tances as exp l a ined i n the tex t (a — d i s t r i -b u t i o n h i s t o g r a m a n d b — c u m u l a t i v e d i s t r i b u t i o n ) ; c, d: d i s t r i b u t i o n of average b r a n c h p o i n t dis tances; e, f: d i s t r i b u -t i o n of the most p robab le b r a n c h p o i n t dis tances 124 5.8 C o m p a r i n g b r a n c h p o i n t dis tances for S T R I N l o n g in t rons w i t h f l ank ing regions a n d co r r e spond ing r a n d o m sequences: d i s t r i -b u t i o n of m i n i m u m b r a n c h p o i n t dis tances as e x p l a i n e d i n the tex t (a — d i s t r i b u t i o n h i s t o g r a m a n d b — c u m u l a t i v e d i s t r i -bu t i on ) ; c, d: d i s t r i b u t i o n of average b r a n c h p o i n t dis tances; e, f: d i s t r i b u t i o n of the most p robab le b r a n c h p o i n t d is tances . 125 List of Figures x i x 5.9 C o m p a r i n g p robab i l i t i e s of the basepa i r ing between the donor site a n d the b r a n c h p o i n t sequence for S T R I N l o n g in t rons a n d c o r r e s p o n d i n g r a n d o m sequences, a: d i s t r i b u t i o n h i s t o g r a m of the basepa i r ing p robab i l i t i e s ; b: c u m u l a t i v e d i s t r i b u t i o n of the basepa i r ing p robab i l i t i e s 128 5.10 P r o t e i n express ion results for the R P 5 1 B gene c o n t a i n i n g some of L i b r i ' s m u t a n t in t rons ob t a ined by our e x p e r i m e n -t a l approach . P r o t e i n express ion level is n o r m a l i z e d w i t h re-spect to w i l d t ype express ion level . S h a d e d boxes represent the m e a n value for several different samples a n d er ror bars represent +/— 1 s t a n d a r d d e v i a t i o n for these samples . T h e error bar for the w i l d t y p e i n t r o n comes f rom c o m p a r i s o n of two different w i l d t ype samples 131 5.11 P r o t e i n express ion results for. the R P 5 1 B gene c o n t a i n i n g our new m u t a n t in t rons . P r o t e i n express ion level is n o r m a l i z e d w i t h respect to w i l d t y p e express ion level . S h a d e d boxes rep-resent the m e a n value for several different samples a n d er ror bars represent +/— 1 s t a n d a r d d e v i a t i o n for these samples . . 137 5.12 P a r t of the M F E secondary s t ruc tu re p r e d i c t i o n for m u t a n t good5 tha t shows the donor site a n d b r a n c h p o i n t sequence basepa i r ing as we l l as the ve ry s table z i p p e r s t e m tha t s t ab i -lizes the i r i n t e r ac t ion 139 6.1 (a) D i s t r i b u t i o n of Z-scores for a l l S T R I N in t rons . S t a n d a r d . n o r m a l d i s t r i b u t i o n (mean = 0 a n d s t a n d a r d d e v i a t i o n = 1) , . tha t is expec ted d i s t r i b u t i o n for Z-scores, is s h o w n i n dashed l ine, (b) Quan t i l e -quan t i l e p lo t of Z-scores against s t a n d a r d n o r m a l d i s t r i b u t i o n . . . ' . . - 148 6.2 (a) D i s t r i b u t i o n of Z-scores for l ong S T R I N in t rons . S t a n -d a r d n o r m a l d i s t r i b u t i o n is shown i n dashed l ine, (b) C o r r e -s p o n d i n g quan t i l e -quan t i l e p lo t 149 List of Figures xx 6.3 (a) D i s t r i b u t i o n of Z-scores for shor t S T R I N in t rons . S t a n -d a r d n o r m a l d i s t r i b u t i o n is s h o w n i n dashed l ine, (b) C o r r e -s p o n d i n g quan t i l e -quan t i l e p lo t 150 6.4 A v e r a g e Z-scores for each s l i d i n g w i n d o w p o s i t i o n a r o u n d the spl ice s ignals for 5 ' L S T R I N in t rons . T h e size of the s l i d i n g w i n d o w is 50 nt a n d the step size is 10 nt. T h e Z-values are p l o t t e d for the m i d d l e p o s i t i o n of each s l i d i n g w i n d o w (6 s l i d i n g w i n d o w pos i t ions i n to ta l ) 154 6.5 D i s t r i b u t i o n s of average M F E for 50-n t - s l id ing w i n d o w f rom . na t ive l o n g S T R I N in t rons a n d generated r a n d o m sequences: (a) f rom —40 to +10 n t ' w i t h respect to the 5' spl ice site (p-value = 0.005) (b) q-q p lo t c o m p a r i n g d i s t r i b u t i o n s of na t ive a n d r a n d o m sequences (c) f rom —50 to 0 w. r . t . the 3 ' spl ice site (p-value 1.9 • 1 0 ~ 4 ) (d) co r r e spond ing q-q p lo t 156 6.6 D i s t r i b u t i o n s of average M F E for 50-n t - s l id ing w i n d o w f rom na t ive long S T R I N in t rons a n d generated r a n d o m sequences: (a) f rom - 2 0 to +30 w. r . t . the 3' spl ice site (p-value = 0.002) (b) q-q p lo t c o m p a r i n g d i s t r i bu t i ons of na t ive a n d r a n d o m se-quences (c) f rom 0 to +50 w.r . t . the b e g i n n i n g of b r a n c h p o i n t sequence (p-value = 0.002) (d) co r r e spond ing q-q p lo t 157 6.7 A v e r a g e Z-scores for each s l i d i n g w i n d o w p o s i t i o n a r o u n d the spl ice s ignals for a l l S T R I N in t rons . T h e size of the s l i d i n g w i n d o w is 50 nt a n d the step size is 10 nt. T h e Z-values are p l o t t e d for the m i d d l e p o s i t i o n of each s l i d i n g w i n d o w (6 s l i d i n g w i n d o w pos i t ions i n to ta l ) 158 6.8 D i s t r i b u t i o n s of the n u m b e r of u n p a i r e d nucleot ides i n b r a n c h -po in t a n d n o n - b r a n c h sequences w h e n folded i n g l o b a l i n t r o n i c secondary s t ruc tures . T h e n o n - b r a n c h sequences are i n t r o n i c sequence w i n d o w s of l eng th 7 nt tha t do not over lap w i t h real b r a n c h p o i n t sites 160 6.9 D i s t r i b u t i o n s of free bases for donor (a) a n d acceptor (b) sites. T h e l eng th of rea l a n d pseudo donor sites is 6 nt, w h i l e the l eng th of rea l a n d pseudo acceptor sites is 10 nt 162 List of Figures x x i 6.10 P i c t o g r a m s of donor (a), b r a n c h p o i n t (b) a n d acceptor (c) sites i n S. cerevisiae. T h e p i c tog rams were generated u s ing the W e b t o o l P i c t o g r a m avai lable at h t t p : / / g e n e s . m i t . e d u / p i c t o g r a m . h t m l (accessed i n M a y 2006). T h e height of the letters i n the p i c t o g r a m is p r o p o r t i o n a l to the i r re la t ive fre-quencies at each p o s i t i o n i n the i n p u t sequences, cons ider -i n g the b a c k g r o u n d d i s t r i b u t i o n of bases ( A = 3 1 % , C = 1 9 % , G = 1 9 % , a n d T = 3 1 % for a l l yeast O R F s w i t h 1000-nt f l ank ing regions) . T h e i n p u t d a t a used is generated f rom the S T R I N dataset 164 6.11 S t r u c t u r e of U l s n R N A (human) a n d i ts basepa i r ing in ter-ac t ion w i t h the first 6 i n t r o n i c nucleot ides (yeast) . T h e pre-m R N A sequences are shaded, w i t h the 5' exon s h o w n as a shaded b o x 166 6.12 U l a n d U 2 s n R N A in terac t ions w i t h a p r e - m R N A molecu le . T h e 5' spl ice sequence is t y p i c a l for h igher eukaryotes where the b i n d i n g between U l s n R N A a n d the 5' spl ice site is more extensive t h a n i n yeast. T h e b r a n c h p o i n t sequence s h o w n is t y p i c a l for S. cerevisiae ( h t t p : / / w w w . l i b r a r y . c s i . c u n y . e d u , May 21)06). . . '. . 167 6.13 Seconda ry s t ruc tu re in terac t ions w i t h i n the t r i - s n R N P c o m -p lex U 4 / U 6 . U 5 a n d w i t h a p r e - m R N A i n S. cerevisiae. T h e p r e - m R N A sequences are shaded ( G U A U G U sequence at the donor site, U A C U A A C sequence at the b r a n c h p o i n t , Y A G sequence at the acceptor si te) , w i t h exons s h o w n as shaded boxes. T h e a r row depic ts the 'nuc leophi le a t t ack ' by b r a n c h -po in t A , a chemica l r eac t ion tha t in i t ia tes the the cleavage at the 5' e x o n / i n t r o n j u n c t i o n 168 6.14 S t r u c t u r e of the h u m a n U 5 s n R N A . T w o conserved s t e m / l o o p s are l abe led I a n d II . 11-nt loop I is the one tha t in teracts w i t h the 5' a n d 3' spl ice sites 169 List of Figures x x i i 6.15 T h e average M F E of fo ld ing between U l s n R N A a n d s l i d i n g sequence w i n d o w f rom the +/— 100 nt reg ion a r o u n d the donor site. Fo r each s l i d i n g w i n d o w p o s i t i o n ( w i n d o w size = 11 nt , . s tep size — 1 nt) the M F E of fo ld ing is averaged over a l l S T R I N sequences 171 6.16 D i s t r i b u t i o n h i s tograms of fo ld ing M F E between (a) U l sn-R N A a n d d o n o r / n o n - d o n o r sequence w i n d o w s ( w i n d o w size = 11 nt) a n d (b) c u m u l a t i v e d i s t r i b u t i o n s of the same da t a . . 173 6.17 T h e average M F E of fo ld ing between U l s n R N A a n d s l i d i n g sequence w i n d o w f rom a r a n d o m sequence. F o r each s l i d i n g w i n d o w p o s i t i o n (w indow size = 11 nt, step size = 1 nt) the M F E value is averaged over a l l r a n d o m sequences 174 6.18 (a) D i s t r i b u t i o n h i s tograms of fo ld ing M F E between U l sn-R N A a n d d o n o r / n o n - d o n o r sequence w i n d o w s , where pseudo donor sites are requ i red to have the consensus G U d inuc l eo t ide a n d (b) c u m u l a t i v e d i s t r i b u t i o n s of the same d a t a 177 6.19 T h e Rece iver O p e r a t i n g Cha rac t e r i s t i c s ( R O C ) curves for the t h e r m o d y n a m i c donor - iden t i f i ca t ion approach . Different p lo ts c o r r e s p o n d to different se lec t ion of rea l a n d d o n o r sites: a l l real donor sites a n d a l l pseudo sites c o n t a i n i n g a G U d i n -uc leot ide ( S T R I N ) ; rea l a n d pseudo sites selected u s ing the weak cons t ra in t ( S T R I N w c ) ; real a n d pseudo sites selected us-i n g the s t r o n g cons t ra in t ( S T R I N s c ) ; resul ts for h u m a n d a t a f rom G a r l a n d a n d A a l b e r t s (2004). T h e dashed s t ra ight l ine at a 45-degree angle is k n o w n as the ' n o - d i s c r i m i n a t i o n ' l ine a n d i t indica tes no p red ic t ive a b i l i t y 178 6.20 (a) D i s t r i b u t i o n h i s tograms of fo ld ing M F E be tween U 6 sn-R N A a n d d o n o r / n o n - d o n o r sequence w i n d o w s where pseudo donor sites are r equ i red to con t a in the consensus G U a n d ( b ) c u m u l a t i v e d i s t r i b u t i o n s of the same d a t a 181 List of Figures x x i i i 6.21 (a) D i s t r i b u t i o n h i s tograms of fo ld ing M F E be tween U 2 sn-R N A a n d b r a n c h / n o n - b r a n c h sequence w i n d o w s where pseudo b r a n c h p o i n t sites were found us ing p o s i t i o n a l weight m a t r i x a n d (b) c u m u l a t i v e d i s t r i b u t i o n s of the same d a t a 182 6.22 R O C curves for the t h e r m o d y n a m i c donor- a n d b r a n c h p o i n t -iden t i f i ca t ion approach 183 6.23 A v e r a g e a l ignment scores between m o t i f covar iance m o d e l a n d the instances of the mot i fs i n rea l ex t ended d o n o r se-quences f rom the S T R I N dataset a n d 10 datasets of shuffled donor sequences. D a t a po in t s co r r e spond to the average score values for each dataset . T h e m o t i f numbers as t hey have been ident i f ied b y C M f i n d e r are shown o n the x -ax i s . T h e y corre-s p o n d to the m o t i f names i n T a b l e 6.3 192 6.24 A v e r a g e a l ignment scores between m o t i f covar iance m o d e l a n d the ins tances of the mot i fs i n real ex t ended acceptor se-quences f rom the S T R I N dataset a n d 10 datasets of shuffled acceptor sequences 193 6.25 A v e r a g e a l ignment scores be tween m o t i f covar iance m o d e l a n d the instances of the mot i fs i n rea l ex tended b r a n c h p o i n t sequences f rom the S T R I N dataset a n d 10 datasets of shuffled b r a n c h p o i n t sequences 193 7.1 A n example of a feed-forward three-layer n e u r a l ne twork ar-chi tec ture 198 7.2 N o n - l i n e a r m a p p i n g of the i n p u t space i n to the feature • space: t r a i n i n g d a t a t ha t was not l i nea r ly separable i n the i n p u t space becomes so i n the feature space ,202 B . l F l o w char t of the e x p e r i m e n t a l p rocedure 266 x x i v Acknowledgements B e i n g co-superv ised by three supervisors can b e b o t h a n e n r i c h i n g a n d cha l -l eng ing exper ience. I n m y case, i t was m u c h more the former, s ince m y mentors c o m p l e m e n t e d each other so we l l . F i r s t a n d foremost , I wan t to ac-knowledge m y p r i m a r y thesis superv isor , A l a n M a c k w o r t h , w h o p r o v i d e d me w i t h u n c o n d i t i o n a l research, pe r sona l a n d financial s u p p o r t for m a n y years d u r i n g m y M a s t e r s a n d D o c t o r a l s tudies. H i s cons tant encouragements a n d inva luab le advice were essential for m y progress. I a m also ex t r eme ly grate-ful to H o l g e r H o o s , whose research gu idance was ind i spensab le t h r o u g h o u t m y D o c t o r a l . s t u d i e s . I espec ia l ly apprecia te his extensive engagement i n the final phases of m y thesis w r i t i n g w h e n his t h o r o u g h c o m m e n t s h e l p e d me to s ign i f i can t ly i m p r o v e m y d i sse r ta t ion . A n d last, bu t not least, I e x t e n d sincere a p p r e c i a t i o n to F ranc i s Ouel le t t e , m y t h i r d thesis co-superv isor , w h o t augh t me a lot abou t B i o l o g y a n d B i o i n f o r m a t i c s a n d w h o p l ayed a v i t a l pa r t i n c rea t ing a great B i o i n f o r m a t i c s c o m m u n i t y i n Vancouve r , , w h i c h I a m p r o u d to be a m e m b e r of. I w o u l d also l ike to t h a n k A n n e C o n d o n , a d d i t i o n a l m e m b e r of m y thesis c o m m i t t e e . A n n e was a lways avai lable to meet w i t h me a n d offer me va luab le profess ional a n d pe r sona l advice . I a m t h a n k f u l to B e n M o n t p e t i t , m y research co l l abo ra to r , for his great l a b o r a t o r y w o r k a n d his pat ience w h e n answer ing m y numerous quest ions . I a m also gra teful to B e n ' s superv i so r P h i l H ie t e r , w h o made ou r c o l l a b o r a t i o n poss ib le . M y P h D exper ience was made more enjoyable by m y fr iends a n d co l -leagues f rom the B e t a lab . I w o u l d l ike to t h a n k a l l of t h e m for the i r f r iend-sh ip , help w i t h research p rob lems a n d our fun discuss ions a n d l ab ou t ings . I owe so m u c h to m y parents , for the i r u n c o n d i t i o n a l love a n d suppor t . Acknowledgements x x v F i n a l l y , bu t most i m p o r t a n t l y , m y deepest g ra t i t ude goes to m y fami ly , m y h u s b a n d N o v a k for his endless pat ience, to lerance a n d encouragements a n d for his he lp i n p r o d u c i n g m a n y figures i n th is d o c u m e n t a n d m y b e a u t i f u l a n d clever c h i l d r e n P a v l e , T a r a a n d L u k a , w h o insp i re me to a lways go fo rward . xxvi To Novak, Pavle, Tara, Luka and Sofia 1 Chapter 1 Introduct ion T h e cen t ra l d o g m a of mo lecu la r b io logy states t ha t the flow of genet ic i n -f o r m a t i o n is f rom D N A to R N A to p ro te in : genes, w h i c h are par t s of D N A tha t store genetic i n f o r m a t i o n , are t r an sc r i b ed , i.e., cop ied , to messenger R N A ( m R N A ) , w h i c h carries th i s i n f o r m a t i o n outs ide the ce l l nucleus in to the c y t o p l a s m where i t gets t r ans la t ed in to pro te ins (F igu re 1.1). In eukary-ot ic o rganisms , p r o t e i n - c o d i n g genes are often i n t e r r u p t e d by i n t e r v e n i n g sequences, ca l led in t rons , tha t mus t be removed f rom m R N A before i t gets t r ans l a t ed i n order to p roduce func t iona l prote ins . S p l i c i n g of p recursor m R N A ( p r e - m R N A ) , w h i c h involves the exc i s ion of in t rons f rom a p r i m a r y R N A t r ansc r ip t a n d l i g a t i o n of exons i n to m a t u r e m R N A , is one of the essential ce l lu la r processes i n eukaryo t i c o rgan i sms . A l t h o u g h th i s process has been ex tens ive ly s t u d i e d s ince the d i scove ry o f s p l i c i n g a lmos t three decades ago ( C h o w et a l . , 1977; Berge t et a l . , 1977), r e su l t i ng i n a t h o r o u g h u n d e r s t a n d i n g of the s p l i c i n g p a t h w a y a n d ident i f i -c a t i o n of the numerous componen t s of the s p l i c i n g machinery , there are s t i l l unanswered quest ions . O n e of t h e m is: how are the spl ice sites accu ra t e ly ident i f ied a n d cor rec t ly pa i r ed across the in t ron? It is c u r r e n t l y be l i eved t h a t iden t i f i ca t ion is a ccompl i shed , at least pa r t i a l ly , t h r o u g h ba sepa i r i ng inter-a c t i o n between the conserved sequences at the e x o n / i n t r o n bounda r i e s a n d w i t h i n in t rons a n d the s m a l l sp l i ceosomal R N A s . However , these conserved sequences are shor t a n d not w e l l defined, a n d are often h a r d to d i s t i n g u i s h f rom the numerous u n u t i l i z e d sequences t h r o u g h o u t the genome. T h e re la t ive conse rva t ion of the s p l i c i n g s ignals is used for the c o m p u -t a t i o n a l iden t i f i ca t ion of e x o n / i n t r o n boundar ies , w h i c h is an essent ial pa r t of gene-f inding i n eukaryo t i c genomes. F a i l u r e to d i s t i n g u i s h be tween rea l spl ice sites a n d unused , 'pseudo ' sites is one of the the m a j o r reasons for Chapter 1. Introduction 2 growing protein chain F i g u r e 1.1: T h e molecu la r processes invo lved i n the p a t h w a y l ead ing f rom D N A to p r o t e i n i n eukaryo t ic cells . Chapter 1. Introduction 3 the low accuracy of c o m p u t a t i o n a l gene p r e d i c t i o n i n longer genomic D N A sequences (Pe r t ea et a l . , 2001; R o g i c et a l . , 2001). T h e fact t ha t the k n o w n p r i m a r y sequence de te rminan t s are not able to u n a m b i g u o u s l y specify spl ice sites has p r o m p t e d scientis ts to specula te abou t the role of p r e - m R N A secondary s t ruc tu re i n sp l i c ing . T h e r e is a large b o d y of b io log i ca l l i t e ra ture , a sample of w h i c h is g iven i n S e c t i o n 2.3, t h a t descr ibes the different ways i n w h i c h p r e - m R N A secondary s t ruc tu re c a n affect sp l i c ing . I n m a n y cases, the i n d u c e d s t r u c t u r a l changes have an i n h i b i t o r y effect o n sp l i c ing , suggest ing tha t ce r t a in s t r u c t u r a l a r rangements or s t r u c t u r a l elements are i m p o r t a n t for sp l i c ing . T h e secondary s t ruc tu re of p r e - m R N A is also i n d i c a t e d to have a regu la to ry effect o n a l t e rna t ive sp l i c ing . However , mos t of these e x p e r i m e n t a l s tudies have been focused o n the ana lys is of a s ingle gene or s m a l l set of genes a n d the un ive r sa l role of p r e - m R N A secondary s t ruc tu re has not been de t e rmined . E l u c i d a t i n g the c o m p l e x detai ls of the gene-spl ic ing process is of s ig-nif icant i m p o r t a n c e for b io logy a n d medic ine : i t has been e s t ima t ed t h a t ~ 15% of h u m a n genetic diseases are caused by errors i n s p l i c i n g ( K r a w c z a k et a l . , 1992). T h i s n u m b e r is l i ke ly to be larger since the s t u d y focused o n l y on po in t m u t a t i o n s i n v i c i n i t y of spl ice sites i g n o r i n g m u t a t i o n s i n o ther cis a n d trans s p l i c i n g factors (Faus t ino a n d C o o p e r , 2003). Consequen t ly , i m p r o v e d u n d e r s t a n d i n g of the sp l i c ing process a n d spl ice si te r e c o g n i t i o n w o u l d l ead to bet ter c o m p u t a t i o n a l models a n d higher p r e d i c t i o n accu racy of gene-f inding programs . T h i s , again , is an i m p o r t a n t goa l of genomics , where one of the ma jo r tasks is accura te a n n o t a t i o n of large vo lumes of ge-n o m i c sequences generated by numerous genome sequenc ing projec ts , w h i c h is i n i t i a l l y a ccompl i shed by c o m p u t a t i o n a l me thods . T h i s h igh l igh t s the i m p o r t a n c e of the p r o b l e m we s t u d y i n th i s thesis, namely , co r r e l a t i on between p r e - m R N A secondary s t ruc tu re a n d s p l i c i n g . E v e n t h o u g h the s p l i c i n g m e c h a n i s m is, i n general , u n i v e r s a l for a l l eukary-otes, there are some m i n o r differences between the organ isms tha t need to be cons idered . Therefore , i t is benef ic ia l to focus on a s ingle o r g a n i s m w h e n es t ab l i sh ing a n d tes t ing our i n i t i a l hypo thes i s abou t the above-ment ioned re l a t ionsh ip , w h i c h can later be ex tended a n d / o r m o d i f i e d to a p p l y to other Chapter 1. Introduction 4 eukaryo t i c species. F o r the research w o r k i n th is thesis, we chose Saccharomyces cerevisiae, the s imples t i n t r o n - c o n t a i n i n g eukaryote , w h i c h is also a wel l -es tab l i shed m o d e l o r g a n i s m . W e selected S. cerevisiae based o n i ts t h o r o u g h , expe r i -m e n t a l l y s u p p o r t e d anno ta t ion , large n u m b e r of s p l i c i n g s tudies a n d l i m i t e d i n t r o n sizes. T h e S. cerevisiae dataset tha t we use for our s t u d y is careful ly assembled us ing three different i n t r o n databases a n d relies o n c o m p a r a t i v e genomic s tudies to c o n f i r m anno ta t ed spl ice sites. W e b e g i n our s t u d y e x p l o r i n g the hypothes i s suggested by P a r k e r a n d P a t t e r s o n (1987) t ha t yeast in t rons w i t h large dis tances be tween the 5' spl ice site a n d the b r a n c h p o i n t sequence c a n fold in to secondary s t ruc tu res t ha t w o u l d sho r t en th is d i s tance to one tha t is o p t i m a l for spl iceosome assem-bly. T h i s hypo thes i s was conf i rmed for a l i m i t e d n u m b e r of yeast in t rons by comprehens ive b io log ica l exper iments t ha t demons t r a t ed tha t the exis-tence of such secondary s t ruc tu re elements is essential for s p l i c i n g efficiency. S t r u c t u r a l elements t ha t exh ib i t a s im i l a r effect on s p l i c i n g were also f o u n d i n some in t rons of Drosophila melanogaster a n d i ts re la ted species. F u r t h e r -more , sho r t en ing of l o n g dis tances between the essent ial s p l i c i n g s ignals was observed for h igher eukaryotes , more speci f ica l ly m a m m a l s , where fo ld ing of l o n g i n t r o n sequences is fac i l i t a ted by p ro t e in b i n d i n g a n d in te rac t ions (see Sec t ion 2.3). T h e s e s tudies ind ica te t ha t p r e - m R N A secondary s t ruc tu re w i t h i n i n -t rons migh t be essent ial for efficient s p l i c i n g of l o n g in t rons i n a l l eukaryo t i c species, bu t i t is h a r d to c l a i m un ive r sa l i ty of the p h e n o m e n o n based o n a very s m a l l s ample size. M o t i v a t e d by th i s l i m i t a t i o n , we p e r f o r m a more extens ive c o m p u t a t i o n a l s t u d y of a l l l ong in t rons i n S. cerevisiae, w i t h the ra t iona le t ha t c o m b i n i n g c o m p u t a t i o n a l evidence for a large dataset of i n -t rons w i t h c o n v i n c i n g b io log i ca l evidence for a few in t rons w i l l s t r eng then the hypothes i s . W e commence our s t u d y by searching l o n g yeast in t rons for secondary s t ruc tu re elements t ha t c o u l d b r i n g the 5' spl ice si te a n d the b r a n c h p o i n t sequence in to closer p r o x i m i t y a n d then ana lyze the r e s u l t i n g shor t ened dis tances w i t h respect to dis tances be tween s p l i c i n g s ignals i n shor t yeast in t rons , w h i c h are assumed to be o p t i m a l for s p l i c i n g . Chapter 1. Introduction 5 A f t e r our i n i t i a l analys is , we fur ther refine our m e t h o d to take in to ac-count a d d i t i o n a l b io log ica l evidence, a n d relax our i n i t i a l c r i t e r i a for the ta rge ted s t r u c t u r a l elements w i t h i n the long yeast in t rons . T h e c o m p u -t a t i o n a l analys is i n th is par t is based on p rev ious ly m e n t i o n e d b i o l o g i c a l s tudies t ha t e x a m i n e d the re la t ionsh ip between secondary s t ruc tu re fo rma-t i o n w i t h i n yeast in t rons a n d e x p e r i m e n t a l l y d e t e r m i n e d s p l i c i n g efficiency levels. W e also reconsider the c o m p u t a t i o n of the secondary s t ruc tu re of in t rons a n d shor tened dis tances between the s p l i c i n g s ignals to make t h e m less er ror sensi t ive . E n c o u r a g e d by p r o m i s i n g resul ts o f the c o m p u t a t i o n a l s tudy, we va l ida te our hypothes i s u s ing b io log ica l l a b o r a t o r y expe r imen t s . I n the second par t of th i s thesis, we consider a different role of p r e - m R N A secondary s t ruc tu re i n gene sp l i c ing : can secondary s t ruc tu re elements serve as a d d i t i o n a l identif iers of e x o n / i n t r o n boundar i e s a n d in t rons? W e em-ploy different approaches to ident i fy t h e r m o d y n a r n i c a l l y s table or conserved s t r u c t u r a l mot i fs or specific s t r u c t u r a l contexts i n the v i c i n i t y of sp l ice s ig-nals or w i t h i n in t rons . W e also a p p l y a t h e r m o d y n a m i c a l a p p r o a c h to detect s p l i c i n g s ignals based on knowledge of the i r in te rac t ions w i t h s m a l l nuclear R N A s . F i n a l l y , we use a mach ine l ea rn ing approach to inves t igate the po t en -t i a l of d iscovered s t r u c t u r a l character is t ics of yeast. . introns' to i m p r o v e the accu racy of c o m p u t a t i o n a l spl ice site a n d i n t r o n r ecogn i t ion i n S. cerevisiae. I n s u m m a r y , we pe r fo rm a comprehens ive s t u d y of secondary s t ruc tu re charac ter i s t ics of yeast in t rons a n d thei r r e l a t ionsh ip to p r e - m R N A s p l i c i n g , u s ing a c o m b i n a t i o n of c o m p u t a t i o n a l , s t a t i s t i ca l , phy logene t i c a n d e x p e r i -m e n t a l approaches . W e cons ider var ious aspects a n d funct ions o f R N A sec-o n d a r y s t ruc tu re . T h e o b t a i n e d results suppo r t our i n i t i a l hypo thes i s t ha t p r e - R N A secondary s t ruc tu re is capable of m o d i f y i n g dis tances be tween i m -p o r t a n t s p l i c i n g signals i n long in t rons , b r i n g i n g t h e m in to closer p r o x i m i t y , w h i c h is though t to be o p t i m a l conf igura t ion for s p l i c i n g . T h e ident i f ied s t r u c t u r a l elements are also conserved a m o n g Saccharomyces sensu stricto species, i n d i c a t i n g the i r func t iona l s ignif icance. O u r m o d e l t ha t descr ibes R N A s t r u c t u r a l requi rements for s p l i c i n g is able to e x p l a i n the confus ing resul ts of several p r ev ious ly r epo r t ed e x p e r i m e n t a l s tudies . W e fur ther va l -Chapter 1. Introduction 6 ida te i t by careful ly des igned e x p e r i m e n t a l tes t ing . O u r c o m p u t a t i o n a l a n d s t a t i s t i ca l e x p l o r a t i o n of s t r u c t u r a l charac ter i s -t ics w i t h i n in t rons a n d a r o u n d the spl ice sites identifies a n u m b e r o f s t ruc-t u r a l features tha t are specific to yeast in t rons a n d not observable i n a r an -d o m R N A sequence w i t h the same sequence features. N a m e l y , the 5' sp l ice site sequences are found to have a tendency to be re l a t ive ly free of sec-o n d a r y s t ruc tu re a n d to b i n d more favorably to s n R N A s t h a n pseudo sites; b r a n c h p o i n t sequences t e n d to have more free bases w h e n folded g lobal ly , as observed i n some previous s tudies ( H a l l et a l . , 1988; S t e p h a n a n d K i r b y , 1993; M o u g i n et a l . , 1996; C h e n a n d S tephan , 2003) a n d ce r t a in s t r u c t u r a l mot i f s are found to be over-represented i n the n e i g h b o u r h o o d s of 5' a n d 3' spl ice sites. O u r mach ine l ea rn ing exper iments based o n these findings give fur ther s u p p o r t to the v a l i d i t y of our s p l i c i n g m o d e l a n d demons t r a t e the a b i l i t y of these m e t h o d s to reduce the n u m b e r of false pos i t ive sp l ice site a n d i n t r o n p red ic t ions . Organization of the thesis I n C h a p t e r 2, we p rov ide the necessary b a c k g r o u n d for the topics covered i n th i s thesis, give an overv iew of the b io log ica l l i t e ra tu re t ha t m o t i v a t e d our research a n d discuss ex i s t i ng w o r k tha t incorpora tes secondary s t ruc tu re i n f o r m a t i o n i n c o m p u t a t i o n a l spl ice site p r e d i c t i o n me thods . C h a p t e r 3 descr ibes i m p o r t a n t character is t ics of S. cerevisiae in t rons a n d s p l i c i n g , d is -cusses general features of i n t r o n a rch i tec ture a n d describes the assembly a n d proper t i es of the S T R I N dataset of yeast in t rons a n d the dataset des igned for phy logene t i c analys is . C h a p t e r 4 in t roduces the n o t i o n of a z i p p e r s tem, an R N A s t r u c t u r a l element tha t shortens the d is tance be tween the 5' sp l ice site a n d the b r a n c h p o i n t sequence, a n d descr ibes different c o m p u t a t i o n a l approaches for i ts iden t i f i ca t ion . T h e second par t of the chapte r inves t i -gates the conserva t ion of z ippe r stems a m o n g closely re la ted Saccharomyces sensu stricto species. I n C h a p t e r 5, the i n i t i a l m o d e l t ha t descr ibes R N A s t r u c t u r a l requi rements for s p l i c i n g is fur ther refined to take in to account Chapter 1. Introduction 7 a d d i t i o n a l b io log i ca l evidence a n d to a l low r e l a x i n g of the i n i t i a l c r i t e r i a for the z i p p e r sterns w i t h i n long yeast in t rons . T h e chapter also discusses the v a l i d a t i o n of our m o d e l us ing l a b o r a t o r y exper imen t s done i n c o l l a b o r a t i o n w i t h B e n M o n t pe t i t a n d P h i l H i e t e r f rom the D e p a r t m e n t of M e d i c a l G e -net ics at U B C . F u r t h e r s t r u c t u r a l analys is of yeast in t rons , as w e l l as the s t r u c t u r a l context at the spl ice s ignals , are d iscussed i n C h a p t e r 6. T h e mach ine l e a rn ing exper iments desc r ibed i n C h a p t e r 7 test whe the r the effi-c iency of p r e - m R N A s p l i c i n g can be p red ic t ed based o n secondary s t ruc tu re charac ter i s t ics of in t rons . I n the second pa r t of the chapter , the weak s t ruc-t u r a l s ignals d iscussed i n C h a p t e r 6 are used to i m p r o v e the accu racy of c o m p u t a t i o n a l spl ice site a n d i n t r o n p r e d i c t i o n by f i l te r ing out false pos i t i ve p red ic t ions based o n c lass i f ica t ion me thods f rom mach ine l ea rn ing . C h a p -ter 8 s u m m a r i z e s a n d discusses the results ob t a ined a n d out l ines d i rec t ions for fur ther research. S u p p l e m e n t a r y i n f o r m a t i o n a n d d a t a are p r o v i d e d i n a n u m b e r of ap-pendices : the sequences i n the S T R I N dataset are l i s t ed i n A p p e n d i x A ; the p rocedure for e x p e r i m e n t a l t es t ing of our s p l i c i n g m o d e l is de sc r ibed i n A p -p e n d i x B ; p r in tou t s f rom our S t r u c t u r e A n a l y z e p rocedure w h i c h ca lcula tes s t r u c t u r a l charac ter i s t ics i m p o r t a n t for sp l i c ing , are p r o v i d e d i n A p p e n d i x C ; a n d the sequences of the R P 5 1 B i n t r o n mutan t s tha t we des igned for the purposes of e x p e r i m e n t a l ve r i f i ca t ion are specif ied i n A p p e n d i x D . Chapter 2 Background and related work I n th i s chapter , we first i n t roduce the reader to the genera l process of pre-m R N A s p l i c i n g a n d iden t i f i ca t ion of spl ice sites by the spl iceosome m a c h i n -ery. N e x t , we descr ibe c o m p u t a t i o n a l m e t h o d s for p r e d i c t i n g sp l i ce sites t ha t are based on var ious sequence-based approaches. T h e secondary s t ruc-ture of R N A a n d its c o m p u t a t i o n a l p r ed i c t i on , w h i c h we heav i ly base our thesis on , are also discussed. A sample of the large b o d y of l i t e ra tu re o n the effects of p r e - m R N A secondary s t ruc tu re on s p l i c i n g is presented, p r o v i d i n g the m o t i v a t i o n for our thesis research. F i n a l l y , we descr ibe ex i s t i ng a t t emp t s to integrate p r e - m R N A s t r u c t u r a l i n f o r m a t i o n w i t h sequence i n f o r m a t i o n to p red ic t spl ice sites. 2.1 P re -mRNA splicing and splice site recognition T h e process of gene sp l i c ing , w h i c h involves exc i s ion of in t rons f rom a p r i -m a r y m R N A t r ansc r ip t a n d l i g a t i o n of exons in to m a t u r e m R N A , is one of the essent ial steps i n p r o t e i n p r o d u c t i o n . I n cells, th i s process is u s u a l l y cat-a lyzed by a large r ibonuc leo -p ro te ih complex , ca l led spl iceosome, w h i c h is composed of five s m a l l nuclear R N A s ( U l , U 2 , U 4 , U 5 a n d U 6 ) * , assembled in to s m a l l r i bonuc leo -p ro t e in par t ic les ( s n R N P s ) a n d numerous n o n - s n R N P s p l i c i n g factors. T h e except ions are g roup I a n d II in t rons , w h i c h are ca-pab le of se l f -spl ic ing, t R N A in t rons , where sp l i c ing is c a t a l y z e d b y p r o t e i n enzymes , a n d U 1 2 - t y p e in t rons , w h i c h are sp l iced by a c o m p o s i t i o n a l l y d is -t i nc t sp l iceosome ( A b e l s o n et a l . , 1998; S ta l ey a n d G u t h r i e , 1998; L o p e z a n d ' T h e U3 s n R N A is not involved in sp l ic ing , but part ic ipates in the processing of pre -r ibosomal R N A . Chapter 2. Background and related work 9 Se raph in , 2000). S p l i c i n g consists of two consecut ive t rans-es ter i f ica t ion react ions . I n the first reac t ion , the donor spl ice site at the 5' e x o n / i n t r o n j u n c t i o n is cleaved a n d the i n t r o n 5' end is l iga ted to the b ranchpo in t , l oca ted t y p i c a l l y 20 to 40 nt u p s t r e a m of the 3 ' spl ice site. I n the second reac t ion , cleavage of the acceptor (3') sp l ice site releases the i n t r o n as a la r ia t s t ruc tu re a n d 5' a n d 3' exons are j o i n e d together (F igu re 2.1). Sp l i ce site r ecogn i t ion a n d spl iceosome assembly occu r s imu l t aneous ly : the 5' spl ice site is i n i t i a l l y recognized t h r o u g h c o m p l e m e n t a r y base -pa i r ing in te rac t ions w i t h the U l s n R N A ( M o u n t et a l . , 1983; Z h u a n g a n d W e i n e r , 1986). I n h igher eukaryotes , th i s base-pa i r ing stretches over a p p r o x i m a t e l y n ine nucleot ides (nt) , encompass ing the last two or three exon ic nucleot ides a n d the first five or s ix nucleot ides of the i n t r o n . Subsequent ly , the b r a n c h -po in t sequence base-pairs w i t h the U 2 s n R N A ( B l a c k et a l . , 1985; P a r k e r et a l . , 1987; Z h u a n g a n d W e i n e r , 1989; W u a n d M a n l e y , 1989). T h i s in terac-t i o n involves the U 2 s n R N A sequence G G U G a n d the b r a n c h p o i n t consensus s igna l Y Y R A Y , w i t h the u n p a i r e d b r a n c h p o i n t adenosine ( A ) b u l g e d out of the R N A d u p l e x ( Q u e r y et a l . , 1994). T h e U 4 / U 5 / U 6 t r i - s n R N P is t h e n added to th is p r e - m R N A - s n R N A complex , the U 6 base -pa i r ing w i t h the 5' spl ice site i n t r o n i c sequences ( K a n d e l s - L e w i s a n d S e r a p h i n , 1993; S o n -the imer a n d Ste i tz , 1993) a n d U 5 f o r m i n g non -canon i ca l base -pa i r ing in ter-ac t ions w i t h the 5'-and 3' t e r m i n a l exonic nucleot ides ( N e w m a n a n d N o r m a n , 1992). T h e c o m p l e x t h e n undergoes a series of s t r u c t u r a l rea r rangements t r a n s f o r m i n g in to a m a t u r e spl iceosome tha t is capable of c a t a l y z i n g s p l i c i n g react ions (S ta ley a n d G u t h r i e , 1998). T h i s s u m m a r y is a s i m p l i f i e d vers ion of th is c o m p l e x event, s ince i t neglects i m p o r t a n t funct ions of the associa ted s p l i c i n g factors. S p l i c i n g of in t rons has to be pe r fo rmed w i t h s ingle-nucleot ide p rec i s ion i n order to p roduce func t iona l prote ins . T h i s requires t ha t the a c t u a l spl ice sites be accura te ly recognized a n d cor rec t ly pa i r ed across the i n t r o n . T h e recogn i t i on of spl ice sites is, at least pa r t i a l ly , achieved b y f o r m a t i o n of W a t s o n - C r i c k base-pairs between some sp l i ceosomal s n R N A s a n d shor t con-sensus sequences loca ted at the 5' spl ice site a n d the b r a n c h p o i n t (an ex-Chapter 2. Background and related work 10 3' splice site cleavage and joining of two exons F i g u r e 2 .1 : S p l i c i n g of p r e - m R N A . Chapter 2. Background and related work 11 11 to > 40 nt A G l G U R A G U -9 -8 -7 -6 -5 -4 -3 -2 -1 5' splice site Y U R A Y N *s AGl branchpoint sequence polypyrimidine tract 3' splice site F i g u r e 2.2: Consensus sequences invo lved i n i n t r o n s p l i c i n g i n humans . R s tands for pur ines (adenine ( A ) a n d guanine (G) ) a n d Y s tands for p y r i m -idines (cytosine ( C ) , t h y m i n e ( T ) a n d u r a c i l ( U ) ) . N s tands for any n u -cleot ide . T h e figure was mod i f i ed f rom M o o r e (2000). amp le for h u m a n in t rons is g iven i n F i g u r e 2.2). C o n s e r v e d sequences are also found at the 3 ' spl ice site a n d i n the fo rm of a p o l y p y r i m i d i n e t rac t (lo-ca ted i m m e d i a t e l y u p s t r e a m f rom the 3' spl ice si te) , w h i c h med ia t e s p l i c i n g t h r o u g h the i r in te rac t ions w i t h sp l i c ing factors a n d non-base -pa i r ing inter-act ions w i t h s n R N A s a n d other i n t ron i c sequences ( M a d h a n i a n d G u t h r i e , 1994). However , these consensus sequences are not u n i q u e l y associa ted w i t h func t iona l spl ice sites; there are numerous occurrences of these s ig-nals t h r o u g h o u t the genome not u t i l i z e d by the s p l i c i n g machinery . T h i s is i l l u s t r a t e d i n a s t u d y b y S u n a n d C h a s i n (2000), where p o s i t i o n a l weight mat r ices (descr ibed i n the next section) were t r a i n e d on 2400 instances of rea l h u m a n donor a n d acceptor sites to search for spl ice sites i n the 42-kb h u m a n hprt gene, w h i c h conta ins 8 in t rons . T h i s a p p r o a c h ident i f ied 8 rea l d o n o r sites a long w i t h over 100 pseudo donor sites tha t have scores h igher t h a n the lowest scor ing rea l donor site. T h e results were even more d is-c o u r a g i n g for acceptor sites, s ince 683 pseudo sites were p red ic t ed . It is s t i l l not ful ly u n d e r s t o o d how the precise speci f ic i ty r equ i red to d i s t i n g u i s h correct spl ice sites f rom s i m i l a r 'pseudo-si tes ' is achieved or how the correct d o n o r / a c c e p t o r pa i r s are b rought together . Computational splice site recognition Iden t i f i ca t ion of spl ice sites is an essent ial componen t of c o m p u t a t i o n a l gene-finding i n eukaryo t i c genomes. R e l y i n g o n b io log ica l knowledge a n d results , Chapter 2. Background and related work 12 researchers i n c o m p u t a t i o n a l b io logy approach th is p r o b l e m by m o d e l i n g consensus sequences a r o u n d spl ice sites a n d w i t h i n in t rons . V a r i o u s m e t h -ods are used to m o d e l s p l i c i n g signals , such as the fo l lowing : the s i m p l e consensus sequence m o d e l , w h i c h looks for ei ther a specif ic sequence m o -t i f or a l lows some a l t e rna t ive nucleot ides at ce r t a in pos i t ions i n the mot i f ; p o s i t i o n weight mat r ices , w h i c h represent the frequency of appearance of the A , C , G , a n d T nucleot ides at each p o s i t i o n of the consensus sequence; a n d weight a r rays , w h i c h exp lo i t s t a t i s t i ca l dependences be tween adjacent nucleot ides (F icke t t , 1996; B u r g e , 1997; Sa lzbe rg , 1997). W e i g h t mat r ices a n d weight ar rays are used to score cand ida te sequence mot ives . N e u r a l net-works a n d dec i s ion trees are also used for iden t i f i ca t ion of s p l i c i n g signals (Hebsgaa rd et a l . , 1996; K u l p et a l , 1996; B u r g e , 1997). These sequence sensors are usua l ly not used i n i so l a t i on , bu t are i n -tegra ted w i t h content sensors tha t use c o d i n g s ta t i s t ics to d i s t i n g u i s h be-tween c o d i n g a n d n o n - c o d i n g regions. T h e in tegra ted approaches c a n ei ther be s tand-a lone spl ice site p red ic tors or gene-finders t ha t a t t e m p t t o iden-t i fy ent ire gene s t ruc tures (splice sites i n i n t r o n - c o n t a i n i n g genes a n d i n the boundar i e s of c o d i n g regions) . These me thods y i e l d be t te r a ccu racy for spl ice site r ecogn i t ion because they e l imina te false pos i t ive sp l ice sites t ha t do not have an expec ted shift i n c o d i n g p o t e n t i a l ( B r u n a k et a l . , 1991). T h e r e are a n u m b e r of me thods used to combine s igna l de tec t ion w i t h c o d -i n g s ta t i s t ics for s tand-a lone spl ice site p r ed i c t i on , i n c l u d i n g n e u r a l ne tworks (Hebsgaa rd et a l . , 1996); B a y e s i a n networks ( A r i t a et a l . , 2002; C h u r b a n o v et a l , 2006); ru le-based expe r t sys tems ( V i g n a l et a l . , 1999); a n d d i s c r i m i -nan t ana lys is (Solovyev et a l , 1994). A n example of an in tegra ted approach for p r e d i c t i n g spl ice sites is l i n -ear d i s c r i m i n a n t ana lys is (Solovyev et a l , 1994), w h i c h we use to i n i t i a l l y p red ic t donor a n d acceptor spl ice sites as descr ibed i n Sec t i on 7.2.2. L i n e a r d i s c r i m i n a n t ana lys is is a procedure tha t finds a l inear c o m b i n a t i o n of se-quence measures tha t provides m a x i m u m d i s c r i m i n a t i o n be tween rea l a n d pseudo-si tes . T h e l inear d i s c r i m i n a n t func t ion is of the fo l lowing fo rm: Chapter 2. Background and related work 13 v z — (2.1) i = l where x\,... , xp are the values of p features of sequence x, a n d a—(a\,..., ap) is a vec tor of coefficients de r ived f rom the t r a i n i n g set by m a x i m i z i n g the ra t io of between-class v a r i a t i o n to wi th in -c l a s s v a r i a t i o n of z . T h e i n p u t sequence x is classif ied as a rea l site i f z > c a n d as a pseudo site i f z < c, where c is a t h r e sho ld constant de r ived f rom the t r a i n i n g set i n the same wa y as the coefficient vector a. Sequence features used to detect donor sites inc lude : agreement w i t h consensus region, average t r i p l e t preference i n the p o t e n t i a l c o d i n g region, c o d i n g s ta t i s t ics for the c o d i n g a n d i n t r o n regions (oc tanucleot ide preference), a n d the G-r ichness of the reg ion . T h e features used for acceptor sites inc lude : agreement w i t h consensus region , average t r i p l e t preference i n the b r a n c h p o i n t region, agreement w i t h t y p i c a l p o l y p y r i m i d i n e region, as w e l l as c o d i n g s ta t i s t ics for the c o d i n g a n d i n t r o n regions. T h e spl ice site p r e d i c t i o n accu racy of cur ren t gene-f inding p rog rams is 70-80% w h e n tested o n shor t genomic sequences c o n t a i n i n g exac t l y one gene w i t h a r e l a t ive ly s i m p l e e x o n / i n t r o n s t ruc ture (Rog ic et a l , 2001). T h e ac-c u r a c y level d rops s ign i f ican t ly for more real is t ic , longer genomic .sequences c o n t a i n i n g m u l t i p l e m u l t i - i n t r o n i c genes. T h e reason for th i s d rop i n accu-racy is t ha t consensus sequences used to ident i fy spl ice sites are shor t a n d not w e l l defined, a n d the n u m b e r of false pos i t ive signals t h a t are accepted by s igna l sensors grows w i t h the l eng th of a D N A i n p u t sequence. F o r the l inear d i s c r i m i n a n t ana lys is desc r ibed above, the r epo r t ed number s of false pos i t ive p red ic t ions are o n average 1.5 false donor sites per t rue donor site, a n d 6 false acceptor sites per t rue acceptor site (Solovyev et a l . , 1994). H o w -ever, our analys is i n Sec t ion 7.2.2 shows tha t o n the yeast dataset used i n th is thesis the results are m u c h worse: o n average there are more t h a n five false donor sites per t rue donor site a n d eight false acceptor sites per t rue acceptor site. Chapter 2. Background and related work 14 2.2 RNA secondary structure and prediction R i b o n u c l e i c ac id ( R N A ) is a nucle ic ac id p o l y m e r w i t h a backbone of r i -bose sugar r ings l i n k e d by phospha te groups . E a c h sugar has one of the four bases adenine ( A ) , guanine ( G ) , cy tos ine ( C ) , a n d u r a c i l (U) l i n k e d t o it as a side g roup . T h e phospha te groups l i n k the 5' c a r b o n of one r ibose to the 3 ' c a r b o n of the next , w h i c h imposes a d i r e c t i o n a l i t y of the back-bone f rom 5' to 3 ' . R N A usua l l y occurs as a single s t r a n d a n d i n m a n y cases forms in te r -molecu la r base -pa i r ing in te rac t ions , w h i c h cons t i t u t e the secondary s t ruc tu re of the molecule . B a s e p a i r i n g is a c c o m p l i s h e d t h r o u g h h y d r o g e n b o n d i n g between c o m p l e m e n t a r y bases: C c a n pa i r w i t h G a n d U c a n pa i r w i t h ei ther A or G . These basepairs are ca l l ed W a t s o n - C r i c k or c a n o n i c a l basepairs ( G - U is ca l led a wobble pa i r ) . T h e secondary s t ruc tu re of R N A is composed of a set of e l ementa ry s t ruc-tures such as stems, h a i r p i n loops , i n t e rna l loops, bulges a n d m u l t i - l o o p s . A n example of R N A secondary s t ruc ture w i t h anno ta t ed s t r u c t u r a l e lements is g iven i n F i g u r e 2.3. P s e u d o k n o t s are another t y p e of R N A s t r u c t u r a l e l -ements t ha t occur f requent ly i n na tu re a n d i n some cases have i m p o r t a n t funct ions (Ple i j a n d B o s c h , 1989). However , they are often cons ide red t o be a par t of the R N A t e r t i a ry s t ruc ture . F o r m a n y func t iona l R N A s , t e r t i a ry s t ruc tu re is a key d e t e r m i n a n t of the i r b io log i ca l func t ion . However , our u n d e r s t a n d i n g of t e r t i a ry s t ruc tu re f o r m a t i o n a n d in te rac t ions is l i m i t e d , as is the avai lable e x p e r i m e n t a l d a t a . Seconda ry s t ruc tu re is genera l ly be l ieved to p lay a c r u c i a l role i n t e r t i a ry s t ruc tu re fo rma t ion , since mos t t e r t i a ry in te rac t ions are t h o u g h t to arise after the f o r m a t i o n of a s table secondary s t ruc ture , w h e n the molecu le is able to b e n d a r o u n d the f lexible, s ing le -s t randed regions ( B r i o n a n d Westhof , 1997; T i n o c o a n d B u s t a m a n t e , 1999). T h e t e r t i a ry s t ruc tu re in te rac t ions t ha t arise i n the la ter stages of fo ld ing are u sua l l y t oo weak to d i s r u p t secondary s t ruc tu re t ha t has a l ready formed. Chapter 2. Background and related work 15 110® <j) @—»© © ©100 O '0—0 bulge multi-domain •31 ^; dangling bases -5 ' external base ® — Q « 50® M U I I j_}®' _______ *5 *-<f^«-loop "'"•" w! lo"P I — • item (helix) # • • # © - « I I I , 0 - ® - ® . . 35 ' stacked loop hairpin loop F i g u r e 2.3: A n example of R N A secondary s t ruc tu re where the bul le t s represent the nucleot ides i n the R N A sequence a n d the b lack l ines be tween t h e m represent basepa i r ing in te rac t ions . T h e basic s t r u c t u r a l e lements are anno ta t ed . T h e figure was t aken f rom A n d r o n e s c u (2003). Chapter 2. Background and related work 16 R N A secondary structure prediction T h e mos t c o m m o n approaches for R N A secondary s t ruc tu re p r e d i c t i o n are based o n the a s s u m p t i o n tha t , at e q u i l i b r i u m , any R N A molecu le folds i n to its lowest free energy state. Therefore , the a i m of s t ruc tu re p r e d i c t i o n is to de te rmine a m i n i m u m free energy ( M F E ) s t ruc ture . M o s t M F E secondary s t ruc tu re a lgo r i t hms use d y n a m i c p r o g r a m m i n g to p e r f o r m a comple t e eval-u a t i o n of a l l feasible s t ruc tures g iven an R N A sequence, a n d de te rmine one w i t h m i n i m u m free energy. T h e c a l c u l a t i o n of free energies is based on a nearest ne ighbour ther-m o d y n a m i c m o d e l , w h i c h considers the energy c o n t r i b u t i o n s of s t ack ing in te rac t ions be tween n e i g h b o u r i n g basepairs , a s s u m i n g these energies are independen t a n d add i t i ve . T h e energy of an R N A molecu le is c a l cu l a t ed by a d d i n g energy con t r i bu t i ons of s t ack ing in te rac t ions a n d var ious types of loops . T h e energy con t r ibu t ions of s tacked basepairs a n d some types of loops have been e x p e r i m e n t a l l y de te rmined , wh i l e for the other s t r u c t u r a l elements they have been es t ima ted ( X i a et a l , 1998; M a t h e w s et a l , 1999). These energy con t r i bu t i ons are encoded as the parameters of the energy m o d e l . T h e most w i d e l y used energy model- is T u r n e r ' s energy m o d e l (Freier et a l , 1986; T u r n e r et a l . , 1987; T u r n e r a n d S u g i m o t o , 1988; M a t h e w s et a l , 1999). T h e most c o m m o n l y used M F E programs for R N A secondary s t ruc tu re p r e d i c t i o n are m f o l d (Zuker a n d Jacobson , 1998) a n d R N A f o l d (Hofacker et a l , 1994). B o t h p rograms use d y n a m i c p r o g r a m m i n g for the iden t i f i ca t ion of the M F E s t ruc tures a n d base the i r energy ca lcu la t ions o n T u r n e r ' s energy m o d e l . T h e r u n n i n g t i m e for these two a lgo r i thms is 0 ( n 3 ) . T h e M F E fo ld ing app roach has a n u m b e r of shor t comings : • P r e d i c t i o n of M F E secondary s t ruc tures has l i m i t e d accuracy, t h a t is, the p red i c t ed s t ruc tures are not guaranteed to accu ra t e ly reflect the p h y s i c a l g r o u n d state of respective. R N A s . T h i s is p a r t i a l l y due to the imperfec t u n d e r l y i n g energy m o d e l , w h i c h conta ins a n u m b e r of a p p r o x i m a t i v e a n d ex t r apo l a t ed parameters . T h e e x p e r i m e n t a l l y de r ived parameters are based on very shor t R N A s t rands ( ~ 20 nt) Chapter 2. Background and related work 17 tha t were used to de te rmine the free energy of different secondary s t ruc tu re elements. Consequent ly , a p p l y i n g these t h e r m o d y n a m i c pa -rameters to longer R N A sequences w i l l somet imes resul t i n inaccura te free energy ca lcu la t ions . I n general , the p r e d i c t i o n accu racy decreases w i t h inc reas ing R N A sequence l eng th , a n d c o m p u t a t i o n a l p r e d i c t i o n is cons idered unre l i ab le for sequences longer t h a n several h u n d r e d nu -cleotides ( M o r g a n a n d Higgs , 1996; M a t h e w s et a l . , 1999). A n o t h e r a p p r o x i m a t i o n is the nearest ne ighbour t h e r m o d y n a m i c m o d e l itself, since the independence a n d a d d i t i v i t y of s t r u c t u r a l elements c o n s t i t u t i n g R N A s t ruc tu re are assumpt ions t ha t m a y not be en t i re ly accura te . • A n o t h e r s i m p l i f i c a t i o n tha t mos t of the p r e d i c t i o n a lgo r i t hms make is tha t they can pred ic t o n l y pseudoknot-free secondary s t ruc tures . T h i s l i m i t a t i o n is a consequence of the d y n a m i c p r o g r a m m i n g approach , w h i c h canno t hand le pseudoknots i n the i r mos t genera l fo rm. T h e r e are ways to i n c l u d e some classes o f pseudoknots i n p red ic t ions bu t th i s a lways results i n subs t an t i a l l y increased c o m p u t a t i o n a l t ime . I n fact, i f a l l pseudoknots are i n c l u d e d , the secondary s t ruc tu re p r e d i c t i o n - p rob lems becomes . /VP-hard ( L y n g s 0 a n d Pedersen , 2000). S o m e examples of R N A secondary s t ruc tu re p r e d i c t i o n p r o g r a m s t h a t i n c l u d e pseudoknot p red ic t ions are pknots by R i v a s a n d E d d y (1999) ( 0 ( n 6 ) t i m e a n d 0(nA) space), pknotsRG b y Reede r a n d G i e g e r i c h (2004) (0(n4) t i m e a n d 0(n2) space), b o t h cons ide r ing on l y a re-s t r i c ted class of pseudoknots , a n d HotKnots by R e n et a l . (2005) a n d a genetic a l g o r i t h m by G u l t y a e v et a l . (1995), the la t ter t w o of w h i c h are based on heur i s t i c approaches . • R N A molecules do not fold i n i so la t ion , a n d contact w i t h pro te ins or other R N A molecules m a y p l a y an i m p o r t a n t role i n s t ruc tu re for-m a t i o n . C o n s i d e r i n g our current b io log ica l knowledge a n d a b i l i t y to c o m p u t a t i o n a l l y m o d e l the fo ld ing a n d s t ruc tu re of R N A sequences, we w i l l not be able to take in to account the ce l lu la r env i ronmen t a n d Chapter 2. Background and related work 18 R N A - p r o t e i n mo lecu la r in terac t ions i n the near future. However , re-cent ly some a t t empts have been made to m o d e l the secondary s t ruc-tu re in te rac t ions between two or more R N A molecules ( R e h m s m e i e r et a l , 2004; A n d r o n e s c u et a l . , 2005; M i i c k s t e i n et a l . , 2006). O n e way to dea l w i t h inaccurac ies of R N A secondary s t ruc tu re p r e d i c t i o n is to also c o m p u t e n e a r - o p t i m a l s t ruc tures . T h e p r e d i c t i o n of s u b o p t i m a l s t ruc tures was first p roposed by W u c h t y et a l . (1999) a n d later i m p l e m e n t e d i n the m f o l d a n d R N A f o l d p rograms . T h e c o m p u t a t i o n of s u b o p t i m a l s t ruc-tures i n a d d i t i o n to the o p t i m a l s t ruc ture is i m p o r t a n t no t o n l y because the na t ive s t ruc tu re c a n be b u r i e d i n the n e a r - o p t i m a l space due to the i n a c c u -r acy of the u n d e r l y i n g energy m o d e l , bu t also because there is ev idence tha t some R N A s t ruc tures can osci l la te be tween different s t ruc tu res or exis t i n a p o p u l a t i o n of s t ruc tures (Chr is toffersen a n d M c s w i g g e n , 1994; B e t t s a n d S p r e m u l l i , 1994; F r e y h u l t et a l , 2005). It has been s h o w n tha t , o n aver-age, the accuracy of secondary s t ruc tu re p r e d i c t i o n a lgo r i thms increases by more t h a n 20% w h e n 750 s u b o p t i m a l s t ruc tures are generated, as opposed to genera t ing the M F E s t ruc tu re on ly ( M a t h e w s et a l . , 1999). A n o t h e r i m p o r t a n t advance i n M F E s t ruc tures p r e d i c t i o n was made by M c C a s k i l l (1990), w h o p roposed a d y n a m i c p r o g r a m m i n g a l g o r i t h m for ca l -c u l a t i n g the p a r t i t i o n func t ion . T h e p a r t i t i o n func t ion is a q u a n t i t y f rom s t a t i s t i c a l mechanics t ha t encodes the s t a t i s t i ca l p roper t ies of a s y s t e m i n t h e r m o d y n a m i c e q u i l i b r i u m . T h e p a r t i t i o n func t ion for the ensemble of a l l poss ib le secondary s t ruc tures for a g iven R N A sequence c a n be ca l cu l a t ed us ing the fo l lowing fo rmula : where S is the set of a l l s t ruc tures for the given R N A sequence, AG(S) is the free energy of the s t ruc tu re S, R is the p h y s i c a l gas cons tan t w i t h the value R = 1.987 cai • K~l • mol-1, a n d T is the t empera tu re . T h e p a r t i t i o n func t ion a l g o r i t h m al lows c a l c u l a t i o n of basepa i r ing p rob -ab i l i t i e s w i t h i n the t h e r m o d y n a m i c ensemble of s t ruc tures . It has been (2.2) Chapter 2. Background and related work 19 s h o w n t h a t these basepa i r ing p robab i l i t i e s p rov ide measures of confidence for M F E s t ruc tu re p r e d i c t i o n ( M a t h e w s , 2004) a n d tha t they are less affected by uncer ta in t ies i n energy parameters t h a n is M F E s t ruc tures ( L a y t o n a n d B u n d s c h u h , 2005). T h e r e are other approaches to R N A secondary s t ruc tu re p r e d i c t i o n t h a t are not based o n M F E c o m p u t a t i o n . O n e example is Sfold ( D i n g a n d L a w r e n c e , 2003; D i n g et a l , 2005, 2006), w h i c h samples s t ruc tures accord-i n g to the i r p robab i l i t i e s der ived f rom B o l t z m a n n s ta t i s t ics . T h e de r ived s t ruc tures are fur ther c lus tered acco rd ing to s t r u c t u r a l s im i l a r i t y , a n d a s m a l l n u m b e r of centroids is r e tu rned tha t c a n be taken as a representa t ive ensemble of p o t e n t i a l l y relevant s t ruc tures . T h e p reced ing d i scuss ion abou t the c o m p u t a t i o n a l m e t h o d s for R N A secondary s t ruc tu re p r e d i c t i o n cons idered the ava i l ab i l i t y of a single R N A sequence. W h e n several re la ted R N A sequences f rom different o rgan i sms are avai lable , i t is poss ib le to p red ic t conserved secondary s t ruc tu re u s i n g c o m p a r a t i v e s t ruc tu re analys is . T h i s approach is d iscussed fur ther i n Sec-t i o n 4.4. 2.3 Secondary structure of pre-mRNAs and its effects on gene splicing T h e fact t ha t spl ice sites are not specif ied u n a m b i g u o u s l y by p r i m a r y se-quence p r o m p t e d scientis ts to specidate abou t the effects of h igher -order R N A s t ruc tu re on gene sp l i c ing . O n e of the ear ly expe r imen t s c o n d u c t e d by S o l n i c k (1985) showed tha t w h e n an exon is sequestered (i.e., i so la ted f rom the rest of the p r e - m R N A ) w i t h i n a loop of p o t e n t i a l R N A s t ruc tu re , i t is o m i t t e d f rom the m a t u r e m R N A . S i m i l a r exper imen t s i n d i c a t e d tha t sequester ing of a 5' spl ice si te c a n have a negat ive effect o n s p l i c i n g ( E p e r o n et a l . , 1988). G o g u e l et a l . (1993) went s l i gh t ly fur ther i n the i r expe r imen t s , d e t e r m i n i n g the s ize of the s t e m c o n t a i n i n g the 5' sp l ice s i te t h a t is needed for spl ice site i n h i b i t i o n to be observed (> 9 n t ) . T h e y observed a s i m i -lar s p l i c i n g effect i f the b r a n c h p o i n t sequence was enclosed i n a 15-nt l o n g Chapter 2. Background and related work 20 s tem. E x p e r i m e n t s in vivo showed a weaker effect on s p l i c i n g . A s t u d y i n a p lan t , Nicotiana plumbaginifolia, revealed t h a t 18-24 n t R N A h a i r p i n loops s t r ong ly i n h i b i t s p l i c i n g w h e n they sequester the 5' spl ice site or are p laced w i t h i n a shor t i n t r o n ( L i u et a l . , 1995). 3' spl ice sites tha t are sequestered w i t h i n such s t ruc tures are s t i l l u t i l i z e d , a n d for some in t rons efficiency of s p l i c i n g is i m p r o v e d i f h a i r p i n s are present. A somewha t c o n t r a d i c t o r y ob-serva t ion emerged f rom a s t u d y by L i n a n d R o s s i (1996), where a r t i f i c i a l ly i n t r o d u c e d s t e m / l o o p s t ruc tures inser ted 5 nt d o w n s t r e a m of the 3' spl ice site i n the A C T fusion gene abo l i shed sp l i c ing in vivo. A p o t e n t i a l e x p l a -n a t i o n for th i s effect is tha t s ingle-s t randed sequences i n th is reg ion are requ i red for effective invo lvement w i t h U 5 s n R N A ( N e w m a n a n d N o r m a n , 1992). It was also shown tha t p r e - m R N A has a pos i t ive effect o n s p l i c i n g by s h o r t e n i n g the effective d is tance between the 5' spl ice site a n d the b ranch -po in t to the o p t i m a l d is tance. E x p e r i m e n t s o n several Saccharomyces cere-visiae genes ident i f ied c o m p l e m e n t a r y sequences i n the v i c i n i t y of 5' sp l ice sites a n d b ranchpo in t s whose base-pa i r ing in te rac t ions are essent ia l for sp l i c -i n g efficiency ( N e w m a n , 1987; G o g u e l a n d R o s b a s h , 1993; L i b r i et a l . , 1995; C h a r p e n t i e r a n d R o s b a s h , 1996; H o w e a n d A r e s , 1997). E x i s t e n c e of a s i m -i l a r i n t r o n i c i n t r a -molecu la r s t ruc tu re i n the yeast U 3 A s n o R N A precursor was con f i rmed by comprehens ive s t r u c t u r a l p r o b i n g of the molecu le ( M o u -g i n et a l , 1996). T h i s s t e m / l o o p s t ruc tu re is also conserved i n the second U 3 s n o R N A gene i n S. cerevisiae, even t h o u g h the i n t r o n i c sequences be-tween the two genes are s igni f ican t ly different ( B r u l e et a l . , 1995). S i m i l a r secondary s t ruc tu re in terac t ions were also ident i f ied i n Drosqphila melano-gaster's Adh i n t r o n 1: m u t a t i o n a l analys is showed tha t e i ther the d i s r u p t i o n of the ident i f ied s t em or i ts s t a b i l i z a t i o n resu l ted i n reduced s p l i c i n g effi-c i ency ( C h e n a n d S t e p h a n , 2003). T h i s s t ruc tu re was found to be conserved i n the Drosophila subgenus, a n d i t has been proposed tha t its role is to force the b r a n c h p o i n t sequence d o w n s t r e a m in to an u n p a i r e d c o n f o r m a t i o n . F u n c t i o n a l s tem regions were also found between the b r a n c h p o i n t se-quence a n d the 3 ' sp l ice si te: these s t ruc tures have been s h o w n to have a n i m p o r t a n t sho r t en ing effect o n unusua l ly l o n g dis tances be tween these two Chapter 2. Background and related work 21 s p l i c i n g elements ( G a t t o n i et a l . , 1988; C h e b l i et a l . , 1989) a n d also to enable u t i l i z a t i o n of a d is tan t 3' spl ice site by sequester ing the closer, a l t e rna t ive one (Deshler et a l , 1989). A recent s t u d y by M a r t i n e z - C o n t r e r a s et a l . (2006) ind ica tes t ha t i n -t r o n i c secondary s t r u c t u r e in te rac t ions m a y be i m p o r t a n t for efficient sp l i c -i n g of l o n g m a m m a l i a n in t rons . T h e authors observed tha t the i n s e r t i o n of two h n R N P A l p r o t e i n b i n d i n g sites at the end of an a r t i f i c i a l l y en la rged m a m m a l i a n i n t r o n increased sp l i c ing four-fold. It was suggested tha t two b o u n d h n R N P A l pro te ins interact , thus caus ing the i n t r o n to ' l oop-ou t ' . R e p l a c i n g these b i n d i n g sites w i t h 20-nt inver ted repeats h a d a s i m i l a r effect on sp l i c ing , i n d i c a t i n g tha t loop ing-ou t of in t rons is i m p o r t a n t for efficient s p l i c i n g . T h e observed p h e n o m e n o n is i n agreement w i t h the role o f sec-o n d a r y s t ruc tu re in te rac t ions observed i n yeast, a n d the a u t h o r suggests t ha t yeast secondary s t ruc tu re in terac t ions are s u b s t i t u t e d by h n R N P pro-t e i n in te rac t ions i n m a m m a l s . T h i s is s u p p o r t e d by over- representa t ion of h n R N P - b i n d i n g - s i t e - l i k e mot i fs at the ends of m a m m a l i a n in t rons . Cha nge s i n p r e - m R N A secondary s t ruc tu re c a n also be the cause of hu -m a n genet ic diseases: for example , exc lus ion of exon 7 i n h u m a n S M N 1 a n d S M N 2 genes, caused by d i s r u p t i o n of 24-nt s t e m / l o o p s t ruc tu re i n i n t r o n 7, is the cause of s p i n a l m u s c u l a r a t rophy ( M i y a s o et a l , 2003). T h e secondary s t ruc tu re serves as a s p l i c i n g enhancer t ha t b in d s some uncha rac t e r i zed pro-t e i n factors . T h e r e is also a c o m p e t i n g t heo ry w h i c h argues t h a t p r i m a r y s t ruc tu re changes, more specif ical ly, loss of a sequence based s p l i c i n g en-hancer or ga in of a s p l i c i n g si lencer, are respons ib le for e x c l u s i o n of exon 7 i n S M N 2 ( C a r t e g n i a n d K r a i n e r , 2002; K a s h i m a a n d M a n l e y , 2003). T h e secondary s t ruc tu re of p r e - m R N A has been s h o w n to p l ay a role i n au to regu la t i on of express ion of some prote ins : i f p r o d u c t i o n of a p r o t e i n is i n excess, the p r o t e i n b inds to i ts own p r e - m R N A or m R N A a n d prevents s p l i c i n g o r t r ans l a t i on . O n e e x a m p l e is the S. cerevisiae p r o t e i n R P L 3 0 ( former ly k n o w n as L 3 2 ) . T h i s p r o t e i n regulates the s p l i c i n g of i ts o w n gene by b i n d i n g to a s t e m / l o o p s t ruc tu re tha t is fo rmed be tween c o m p l e m e n t a r y sequences at the 5' end of the p r e - m R N A t r ansc r ip t a n d at the 5' sp l ice site of i ts on ly i n t r o n ( E n g a n d W a r n e r , 1991). It was p roposed tha t R P L 3 0 Chapter 2. Background and related work 22 b i n d i n g s tabi l izes the s t em a n d prevents b i n d i n g of U l s n R N A to the 5' spl ice site. A n example of a different s p l i c i n g au to regu la to ry func t i on was observed i n Y r a l p , a s m a l l yeast R N A b i n d i n g p r o t e i n (Preker a n d G u t h r i e , 2006). I f i n excess, Y r a l p is tox ic to the o rgan i sm; s teady-s ta te levels of the p r o t e i n are essential for v i ab i l i t y . T h i s is a c c o m p l i s h e d t h r o u g h pos i t i ve a n d negat ive s p l i c i n g regu la t ion . T h e Y R A 1 gene has one large i n t r o n (776 nt) a n d a non -canon i ca l b r a n c h p o i n t sequence ( G A C U A A C ) , b o t h of w h i c h are negat ive regula tors of i ts s p l i c i n g ( shor ten ing the i n t r o n or m u t a t i n g the b r a n c h p o i n t sequence to the canon ica l one improves the s p l i c i n g b u t has a negat ive effect on cel l g rowth ) . T h e Y R A 1 i n t r o n conta ins a s t e m / l o o p s t ruc tu re t ha t b r ings the 5' a n d 3' spl ice sites closer together a n d acts as a pos i t i ve regula tor of i n t r o n sp l i c ing . T h i s s t e m is e v o l u t i o n a r i l y conserved i n a l l b u d d i n g yeast species. T h e d i s r u p t i o n of the s t e m leads to reduced s p l i c i n g levels a n d i m p r o v e d ce l l g rowth . P r e - m R N A secondary s t ruc tu re is also found to have a r egu la to ry effect o n a l t e rna t ive sp l i c ing . Fo r several cases of a l t e rna t ive s p l i c i n g events i t has been p roven tha t secondary s t ruc ture elements suppress express ion of some exons i n ce r t a in tissues, w h i l e they are n o r m a l l y expressed i n others ( L i b r i et a l , 1991; C l o u e t d ' O r v a l et a l . , 1991; B l a n c h e t t e a n d C h a b o t , 1997; C o l e m a n a n d Roesser , 1998; H u t t o n et a l . , 1998). A n in te res t ing e x a m p l e is the i n c l u s i o n s t em ( iS tem) , a long-range R N A s t ruc tu re element , f ound i n Drosophila melanogaster's I)scam gene ( K r e a h l i n g a n d Grave l ey , 2005). T h e Dscam gene theore t i ca l ly encodes 36016 different prote ins due to a l t e rna t ive s p l i c i n g of 95 of i ts 115 exons. T h e a l t e rna t ive ly sp l iced exons are o rgan ized i n four c lusters a n d the i S t e m , w h i c h is found i n the i n t r o n p reced ing c lus ter 4, is r equ i r ed for efficient i n c l u s i o n of a l l the exons i n th i s c luster . T h e i S t e m is a large s t e m / l o o p s t ruc ture , w i t h a 27-nt l ong s t em a n d 275 nt i n the loop , w h i c h is also conserved i n other species of the Drosophila subgenus . T h e func t ion of the i S t e m is not precisely k n o w n , bu t i t is suspec ted tha t i t serves as a b i n d i n g site for some p r o t e i n factors. Chapter 2. Background and related work 23 2.4 Using secondary structure information for splice site prediction T h e numerous examples of R N A secondary s t ruc tu re affecting s p l i c i n g p r o m -p t e d scientis ts to cons ider secondary s t ruc ture elements as a d d i t i o n a l i den -tifiers of sp l i ceosomal s ignals . T h i s hypothes i s was tes ted for acceptor spl ice site p r e d i c t i o n by P a t t e r s o n et a l . (2002) us ing a mach ine l e a r n i n g app roach . T h e y used M a r t i n Reese's b e n c h m a r k dataset for the eva lua t i on of gene-finding a lgo r i thms (Reese et a l . , 1999) to ex t rac t 100-nucleot ide- long sub-sequences cent red a r o u n d the acceptor spl ice sites (pos i t ive examples ) a n d 100-nucleot ide- long subsequences cent red a r o u n d A G d inuc leo t ides not an-no ta t ed as spl ice sites (negative examples ) . T h e r e su l t i ng dataset con t a ined 3960 subsequences, w i t h the same n u m b e r of pos i t ive a n d negat ive e x a m -ples, a n d was used for t r a i n i n g a n d tes t ing i n 10-fold c ross -va l ida t ion expe r i -ments . F o r each subsequence i n the dataset , a comprehens ive set of foldings was o b t a i n e d us ing the m f o l d R N A secondary s t ruc tu re p r e d i c t i o n a l g o r i t h m (Zuker a n d Jacobson , 1998). T h e foldings, anno ta t ed by t he i r free energy, were used to ca lcu la te three s t r u c t u r a l met r ics : o p t i m a l fo ld ing energy, m a x he l ix , w h i c h is the p r o b a b i l i t y of he l ix f o r m a t i o n i n a close n e i g h b o u r h o o d , a n d ne ighbour p a i r i n g co r re l a t ion m o d e l ( N P C M ) , w h i c h was used to f o r m an aggregate m o d e l of the s t ruc tu re by t r a i n i n g two M a r k o v mode l s for pos-i t i ve a n d negat ive examples a n d u s ing t h e m to score sequences i n the test dataset . T h e met r ics were aggregated i n feature vectors a n d used to t r a i n s u p p o r t vector machines ( S V M s ) a n d dec i s ion trees. S V M a n d dec i s ion tree c lass i f ica t ion was used to enhance the accuracy of t r a d i t i o n a l sequence-based spl ice site p r e d i c t i o n me thods . T h e y achieved a 5-10% r e d u c t i o n i n er ror rate, c o m p a r e d w i t h s t r i c t l y sequence-based approaches . A n o t h e r , more recent app roach used nuc leo t ide base -pa i r ing i n f o r m a -t i o n to p red ic t spl ice sites i n Saccharomyces cerevisiae in t rons ( M a r a s h i et a l . , 2006b) us ing a neu ra l ne twork approach . T h e au thors selected 154 i n t r o n - c o n t a i n i n g yeast genes a n d p red ic t ed the i r M F E s t ruc tu res u s ing the m f o l d a l g o r i t h m . T h e y cons idered 20-nt w i n d o w s a r o u n d donor a n d ac-ceptor spl ice sites as pos i t ive examples a n d 20-nt w i n d o w s a r o u n d non-Chapter 2. Background and related work 24 spl ice-si te G U a n d A G nucleot ides as negat ive examples . Ins tead of us-ing the t r a d i t i o n a l 4-letter R N A a lphabe t ( { A , C , G , U } ) t hey used an ex-tended 8-letter a lphabe t , w h i c h considers i f a nuc leo t ide is basepa i red or no t {{As,Cs,Gs,Us,AL,CL,GL,UL}, L = loop , S = s tem) . These 20-nt fea-tu re vectors were used for t r a i n i n g a n d tes t ing of separate three- layer-based p e r c e p t r o n n e u r a l ne tworks for donor a n d acceptor sites. H a l f of the spl ice site ins tances were used for t r a i n i n g a n d the other h a l f for tes t ing a n d cross-v a l i d a t i o n . U s i n g th i s s imp le s t ruc tu re i n f o r m a t i o n i m p r o v e d the p r e d i c t i o n accu racy of spl ice site p r ed i c t i on : the co r re l a t ion coefficient for donor site p r e d i c t i o n was 0.98 w h e n s t r u c t u r a l i n f o r m a t i o n was used a n d 0.89 w h e n o n l y sequence i n f o r m a t i o n was used. F o r acceptor sites, the respect ive cor-r e l a t ion coefficient values were 0.70 a n d 0.57. T h e resul ts f rom these studies show tha t us ing secondary s t ruc tu re infor-m a t i o n i n a d d i t i o n to sequence-based measures leads to i m p r o v e d a c c u r a c y of sp l ice si te p r e d i c t i o n . It also provides ind i rec t ev idence for the role of p r e - m R N A secondary s t ruc tu re i n gene sp l i c ing . Chapter 3 25 Intron structure and splicing in Saccharomyces cerevisiae G o o d b io log ica l datasets are essential for mos t types of b io in fo rma t i c s ana l -ys is . I f sequence datasets are used, i t is very, i m p o r t a n t t h a t the n u m b e r of sequenc ing errors be m i n i m a l a n d tha t sequence a n n o t a t i o n , a n d the lo-cat ions of gene s t ruc tu re elements, u p s t r e a m a n d d o w n s t r e a m gene regions a n d any a d d i t i o n a l , f unc t i ona l l y i m p o r t a n t sequence mot i fs , are accura te . A n y errors i n d a t a can resul t i n faul ty results a n d conc lus ions . T h e ava i l -a b i l i t y of d a t a is also a concern . E v e n t h o u g h the in f lux of b i o l o g i c a l d a t a is enormous , somet imes i t is h a r d to f ind a comple te a n d g o o d q u a l i t y dataset for the type of analys is to be per fo rmed . D u e to these concerns, we dec ided to c o n d u c t our research on Saccharomyces cerevisiae, the s imples t i n t r o n -c o n t a i n i n g o rgan i sm. Saccharomyces cerevisiae, also k n o w n as brewer 's or baker ' s yeast, was the first eukaryote to have its genome ful ly sequenced (Goffeau et a l . , 1996). T h e genome of th i s un i ce l lu l a r o r g a n i s m conta ins 12 m i l l i o n bases, d i v i d e d a m o n g 16 chromosomes . I n i t i a l sequence a n n o t a t i o n found 5885 p o t e n t i a l p ro t e in -encod ing genes, bu t th i s n u m b e r is cons tan t ly b e i n g u p d a t e d ( V e l -cu lescu et a l . , 1997; K o w a l c z u k et a l . , 1999; B l a n d i n et a l , 2000; W o o d et a l . , 2001; K e l l i s et a l . , 2003). T h e cur ren t n u m b e r o f o p e n r e a d i n g frames ( O R F s ) is 6604, of w h i c h 4412 are 'ver i f ied ' O R F s , m e a n i n g tha t there ex-ists e x p e r i m e n t a l evidence tha t a gene p r o d u c t is p r o d u c e d i n S. cerevisiae ( S G D , J u l y 2006). Desp i t e the differences i n e x o n / i n t r o n s t ruc tu re be tween yeast a n d h igher eukaryotes , a n d cons ide r ing the un ive r sa l i t y of s p l i c i n g a m o n g a l l i n t r o n Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 26 c o n t a i n i n g o rgan isms , S. cerevisiae is often used as a m o d e l o r g a n i s m for s p l i c i n g s tudies . T h i s is due to its i n t r i n s i c advantages as an e x p e r i m e n t a l sys tem, g iven the pos s ib i l i t y of a con t ro l l ed g r o w t h e n v i r o n m e n t a n d the s i m p l i c i t y of genetic m a n i p u l a t i o n (Goffeau et a l , 1996). W e chose to base our research o n yeast for its t h o r o u g h , e x p e r i m e n t a l l y s u p p o r t e d a n n o t a t i o n , large n u m b e r of s p l i c i n g s tudies a n d l i m i t e d i n t r o n sizes. 3.1 S. cerevisiae introns and splicing U n l i k e mos t eukaryot ic genomes, the yeast genome has few in t rons . I n A u -gust 2003, w h e n we col lec ted our da ta , the n u m b e r of genes t h a t were anno-t a t ed to con t a in sp l i ceosomal in t rons i n the A r e s lab Yeas t I n t r o n D a t a b a s e ( h t t p : / / w w w . c s e . u c s c . e d u / r e s e a r c h / c o m p b i o / y e a s t _ i n t r o n s . h t m l ) ( G r a t e a n d A r e s , 2002) was 239. A l t h o u g h few i n n u m b e r , i n t r o n - c o n t a i n i n g genes p r o d u c e the l ion ' s share of yeast m R N A : more t h a n 10000 of the nea r ly 38000 m R N A molecules made each hou r are de r ived f rom genes t h a t have in t rons (Ares et a l . , 1999). S. cerevisiae in t rons are u sua l l y l i m i t e d to at mos t one per gene: 229 in t ron -con t a in ing genes i n the A r e s D a t a b a s e have o n l y one i n t r o n , 8 con t a in 2 in t rons a n d 2 have several a l t e rna t ive ly sp l i ced i n t r o n var iants . Yeas t in t rons are shor ter on average t h a n in t rons i n h igher eukaryotes , are p r i m a r i l y loca ted near the 5' end of the gene, a n d have h i g h l y conserved spl ice sites a n d b r a n c h p o i n t sequence ( S p i n g o l a et a l , 1999). 3.1.1 Conservation of splicing signals T h e consensus sequence for the d o n o r s i te is G U A U G U , w h i c h is b y far the mos t c o m m o n l y used 5' spl ice site. T h e r e are a few other va r ian t s tha t are used to a m u c h lesser extent , bu t the first a n d fifth p o s i t i o n ( G U A U G U ) are inva r i an t a m o n g a l l anno ta t ed in t rons ( S p i n g o l a et a l . , 1999). T h e b r a n c h -po in t sequence, U A C U A A C , is h i g h l y conserved i n yeast , a n d o n l y s m a l l dev ia t ions f rom the c a n o n i c a l sequence are to le ra ted ( L a n g f o r d et a l , 1984). T h i s is i n contras t w i t h m a m m a l i a n b r a n c h p o i n t sequences, w h i c h are ve ry p o o r l y conserved (Padge t t et a l , 1986). U s u a l l y , o n l y the first nuc leo t ide Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 27 i n the consensus is var iab le , w i t h few except ions . T h e last four pos i t ions are i nva r i an t for a l l anno ta t ed in t rons ( S p i n g o l a et a l . , 1999). T h e vast m a -j o r i t y of 3' spl ice sites have the consensus sequence Y A G ( Y = C or U ) . I n yeast, the p o l y p y r i m i d i n e t rac t , w h i c h is t y p i c a l l y h i g h l y conserved i n h igher eukaryotes (see F i g u r e 2.2), is conserved i n ~ 6 5 % of in t rons ( K u p f e r et a l . , 2004) a n d res t r ic ted m o s t l y to U residues ( U m e n a n d G u t h r i e , 1995). W e also discuss S. cerevisiae sp l ice s i te s ignals a n d b r a n c h p o i n t sequence i n Sec t ion 6.3, focus ing on the i r basepa i r ing in te rac t ions w i t h s n R N A s . 3.1.2 Pre-mRNA splicing S p l i c i n g of p r e - m R N A i n Saccharomyces cerevisiae is genera l ly the same as i n other eukaryo t ic organisms . T h e s p l i c i n g process s tar ts w i t h U l s n R N P recogn i t ion of the 5' spl ice site a n d subsequent b i n d i n g to i t , f o r m i n g w h a t is ca l l ed a ' c o m m i t m e n t c o m p l e x ' . N e x t , U 2 s n R N A b in d s to the b r a n c h -po in t sequence, fol lowed by the assembly of the U 4 / U 5 / U 6 t r i - s n R N P w i t h the pre-spl iceosome. T h e U l s n R N A gets d i sp l aced f rom the 5' sp l ice site, a n d the U 6 s n R N A establishes basepa i r ing in te rac t ions w i t h the donor site. T h e basepa i r ing in te rac t ions between the U 6 a n d U 4 s n R N A s are d i s r u p t e d , a n d U 6 basepairs w i t h the U 2 s n R N A to fo rm the m a t u r e (act ive) spl iceo-some. A s i n other eukaryotes , the s p l i c i n g event consists of two consecut ive t rans-es ter i f ica t ion react ions: the first one cleaves the p r e - m R N A at the 5' e x o n / i n t r o n j u n c t i o n a n d the in t ron ' s 5' end is l iga ted to the b r a n c h p o i n t ; the second one cleaves the p r e - m R N A at the 3 ' spl ice site, re leas ing the i n t r o n as a l a r i a t s t ruc ture , a n d jo ins the 5' a n d 3' exons together . T h e yeast spl iceosome conta ins more t h a n 75 s p l i c i n g p r o t e i n factors, some of w h i c h are d i r e c t l y associa ted w i t h s n R N A s i n s m a l l r i bonuc l eo -p r o t e i n par t ic les ( s n R N P proteins) a n d others tha t are not par t s of s n R N P s ( n o n - s n R N P pro te ins) . These pro te ins have essent ial roles i n the s p l i c i n g process: examples are the B B P p r o t e i n t ha t recognizes a n d b in d s to the b r a n c h p o i n t sequence p r io r to U 2 s n R N A b i n d i n g , a n d the M U D 2 p r o t e i n t ha t b inds to p o l y p y r i m i d i n e t rac t a n d 3 ' spl ice site ( B r o w , 2002). W h i l e the five s m a l l nuclear R N A s are we l l conserved i n l eng th , sequence, Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 28 a n d espec ia l ly s t ruc tu re a m o n g m a m m a l s , they are qu i te different i n yeast ( K r e t z n e r et a l . , 1990). T h e sequence s i m i l a r i t y be tween m a m m a l i a n a n d yeast counte rpar t s is very low (except for U 6 s n R N A ) , a n d three of the yeast s n R N A s are s ign i f ican t ly longer t h a n i n m a m m a l s ( U 2 s n R N A is 6 t imes longer) . However , careful s t r u c t u r a l analys is of these R N A s has re-vealed tha t mos t s t r u c t u r a l elements found i n h igher eukaryotes are also present i n yeast s n R N A s ( K r e t z n e r et a l , 1990). It was s h o w n tha t yeast-specif ic s t r u c t u r a l doma ins of s n R N A c a n be deleted w i t h l i t t l e or no effect o n s p l i c i n g (Igel a n d A r e s , 1988; R y m o n d a n d R o s b a s h , 1992). 3.2 S. cerevisiae intron dataset I n order to o b t a i n a h i g h q u a l i t y yeast i n t r o n dataset we consu l t ed three databases: the A r e s l ab Yeas t I n t r o n Da tabase , the Yeas t I n t r o n D a t a B a s e , a n d the C o m p r e h e n s i v e Yeas t G e n o m e Database . F o r a d d i t i o n a l i n f o r m a -t i o n , we used the Saccharomyces G e n o m e D a t a b a s e ( S G D ) , w h i c h is an u l t i m a t e co l l ec t ion of genetic a n d molecu la r b io log i ca l i n f o r m a t i o n a b o u t S. cerevisiae. The Ares lab Yeast Intron Database T h e A r e s lab Yeas t I n t r o n Da tabase ( A Y I D ) is a searchable da tabase t h a t conta ins 239 Saccharomyces cerevisiae sp l i ceosomal in t rons ( G r a t e a n d A r e s , 2002). F o r each i n t r o n , the fo l lowing i n f o r m a t i o n is g iven : S G D feature, S G D s y n o n y m s , S G D locus a n d desc r ip t ion . A l l these entries are l i n k e d to the S G D a n d M I P S ( M u n i c h I n f o r m a t i o n Cen te r for P r o t e i n Sequences) databases. F A S T A files w i t h i n t ron i c sequences w i t h a n d w i t h o u t 50-nt flanking regions are also p rov ided . T h e format of the A Y I D da tabase is i l l u s t r a t e d i n T a b l e 3.1. E a c h gene i n yeast has a sys temat ic name a n d gene name. S y s t e m a t i c names are of the fo rm Y C X D D D S , where ' Y ' s tands for 'Yeas t ' ; ' C ' i n d i -cates ch romosome n u m b e r ( A = l , B = 2 , etc.); ' X ' c a n be ei ther ' L ' or ' R ' , i n d i c a t i n g the p o s i t i o n of the gene re la t ive to the cent romere (Left or R i g h t ) ; Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 29 S G D feature S G D synonyms S G D locus Description S N R 1 7 A U 3 A s n o R N A [ares] N o t in a prote in O R F , but in U 3 A s n o R N A S N R 1 7 B U 3 B s n o R N A [ares] N o t in a prote in O R F , but in U 3 B s n o R N A Y A L 0 0 1 C F U N 2 4 t s v l l 5 Y A L 0 0 1 C T F C 3 R N A polymerase t r a n s c r i p t i o n i n i t i a t i o n factor T F I I I C ( tau) , 138 k D a subun i t Y A L 0 0 3 W T E F 5 Y A L 0 0 3 W E F B 1 T rans la t ion e longat ion factor E F -l b e t a , G D P / G T P exchange factor for T e f l p / T e f 2 p Y A L 0 3 0 W Y A L 0 3 0 W S N C 1 Synaptobrev in ( v - S N A R E ) homolog present on p o s t - G o l g i vesicles T a b l e 3.1: A n excerpt f rom the A r e s lab Yeas t I n t r o n D a t a b a s e ( G r a t e a n d A r e s , 2002). ' D D D ' is a three-dig i t n u m b e r assigned to the gene bases o n i ts re la t ive po-s i t i o n f rom the centromere ( O R F s are n u m b e r e d f rom the cent romere to the te lomere) ; a n d ' S ' s tands for ' W or ' C , to i nd i ca t e w h i c h s t r a n d the O R F is i n ( W a t s o n - fo rward or C r i c k - reverse). T h e yeast gene n o m e n c l a t u r e requires t ha t the name consist of three letters (the gene s y m b o l ) fo l lowed b y an integer (e.g., A C T 1 ) . T h e Y e a s t I n t r o n D a t a b a s e T h e Yeas t I n t r o n D a t a B a s e ( Y I D B ) conta ins i n f o r m a t i o n abou t a l l i n t rons encoded i n the nuclear a n d m i t o c h o n d r i a l genomes of S. cerevisiae ( L o p e z a n d S e r a p h i n , 2000). It c a n be accessed at h t t p : / / w w w . e m b l - h e i d e l b e r g . d e / E x t e r n a l l n f o / s e r a p h i n / y i d b . h t m l (last accessed i n A u g u s t 2003). I n -t rons are d i v i d e d in to tables acco rd ing to the m e c h a n i s m of exc i s ion : pre-m R N A in t rons , t R N A in t rons , the H A C l i n t r o n , a n d g roup I a n d I I in t rons . F o r 255 p r e - m R N A in t rons , the fo l lowing i n f o r m a t i o n is p r o v i d e d i n a tab-u l a r format : O R F name, E M B L ( E u r o p e a n M o l e c u l a r B i o l o g y L a b o r a t o r y ) da tabase accession number , t r a n s c r i p t i o n frequency, p a r t i a l sequence of the e x o n 1 (5' exon) , sequences of 5' a n d 3' spl ice sites a n d the b r a n c h p o i n t as w e l l as i n t r o n size. T h e entries are l i n k e d to the M I P S a n d E M B L databases. Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 30 O R F E M B L Trans Exon 5' Branchpoint 3' splice site Intron name acc freq 1 splice size num site Y A L 0 0 1 C L22015 1.1 ggaa gtatgtt tttactaacga taacgacacattgaag 90 Y A L 0 0 3 W L22015 52.3 aagg gtatgtt attactaacaa tctccttttaaaatag 366 Y A L 0 1 6 W L05146 4.5 ctgc gtatgtc aatactaacgt ataattgagtggtcag 883 Y A L 0 3 0 W U12980 2.3 agct gtaagta tatactaactt tcgtgtttatttttag 113 Y A L 0 4 2 W U12980 10.4 gcgg gtatgaa agtactaacgg aacttttcacttttag 425 T a b l e 3.2: A n excerp t f rom Y I D B ( L o p e z a n d S e r a p h i n , 2000). T h e format of the Y I D B database is i l l u s t r a t ed i n T a b l e 3.2. The Comprehensive Yeast Genome Database T h e M I P S C o m p r e h e n s i v e Yeas t G e n o m e D a t a b a s e ( C Y G B ) is a search-able da tabase tha t conta ins a var ie ty of d a t a re la ted to the genome of S. cerevisiae (Mewes et a l . , 2002). Its sect ion abou t H e m i a s c o m y c e t o u s yeast sp l i ceosomal in t rons can be accessed at h t t p : / / m i p s . g s f . d e / p r o j / y e a s t / r e v i e w s / i n t r o n / , w i t h further l i nks to the S.cerevisiae i n t r o n t ab le a n d a F A S T A file w i t h i n t r o n sequences (last accessed i n F e b r u a r y 2006). F o r 271 S. cerevisiae sp l i ceosomal in t rons i n the database, the fo l lowing i n f o r m a -t i o n is p r o v i d e d i n a t a b u l a r format : O R F / g e n e name, i n t r o n name , i n t r o n l eng th , p a r t i a l sequences of exon 1 (5 ' exon) a n d 2 (3 ' exon) , sequences of 5' a n d 3' spl ice sites, b r a n c h p o i n t sequence, dis tances f rom the 5' sp l ice si te to the b r a n c h p o i n t a n d f rom the b r a n c h p o i n t to the 3 ' spl ice site, reference, ev idence ( expe r imen ta l or pu ta t ive ) a n d comment s . T h e entries are l i n k e d to the M I P S a n d P u b M e d databases. 3.2.1 Dataset construction W e cons t ruc t ed our dataset by i n c l u d i n g in t rons tha t have consis tent an -no ta t ions between at least two of the three databases p r e v i o u s l y d iscussed . S ince the vast m a j o r i t y of yeast i n t r o n - c o n t a i n i n g genes c o n t a i n o n l y one i n -t r o n a n d on ly a few con t a in two in t rons per gene, we dec ided to i n c l u d e o n l y the former, l eav ing the la t ter for la ter cons ide ra t ion . T h i s was done solely to make the dataset more u n i f o r m . T h e n u m b e r of in t rons found to have a con-sistent a n n o t a t i o n between at least two databases was 227. E l e v e n of these Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 31 were exc luded because they were not s u p p o r t e d b y the latest c o m p a r a t i v e genomic s t u d y ( K e l l i s et a l . , 2003) and were m a r k e d as poss ib le m i s a n n o -ta t ions . A n a d d i t i o n a l two in t rons , be long ing to the genes Y L R 2 0 2 C a n d Y O R 3 1 8 C , have been exc luded f rom the dataset because they were l abe led as ' d u b i o u s ' i n S G D . Consequen t ly , our f ina l yeast i n t r o n dataset conta ins 214 p r e - m R N A in t rons . T h e cons is tency of a n n o t a t i o n a l lows us t o c o m b i n e i n t r o n i n f o r m a -t i o n f rom two sources i n a classic database j o i n ope ra t i on , e.g., the i n t r o n sequence f rom the A Y I D database, w h i c h is not avai lable i n Y I D B , a n d the b r a n c h p o i n t sequence f rom the Y I D B database, w h i c h is not ava i lab le i n A Y I D . A l l of these in t rons are pa r t of p r o t e i n - c o d i n g genes, 95 of w h i c h code for r i b o s o m a l prote ins , 84 have other , k n o w n ce l lu l a r func t ions a n d 35 code for pro te ins of u n k n o w n func t ion . T h e dataset conta ins 159 expe r i -m e n t a l l y ver i f ied a n d 55 pu t a t i ve in t rons . T h e vast m a j o r i t y of in t rons are loca ted i n the t r ans l a t ed p o r t i o n of a gene, w h i l e 12 in t rons are loca t ed i n the 5' un t r ans l a t ed region ( U T R ) . T h i s i n f o r m a t i o n was col lec ted i n J a n u a r y 2006 w h e n the dataset was last u p d a t e d . W e w i l l refer to th is dataset as the S T R u c t u r a l I N t r o n ( S T R I N ) dataset . A p p e n d i x A lists a l l t he in t rons i n the dataset (names of the genes w h i c h con ta in them) a long w i t h some of the i r character is t ics . 3.2.2 Length distribution and architecture of yeast introns T h e l eng th d i s t r i b u t i o n of in t rons i n the S T R I N dataset is s h o w n i n F i g u r e 3.1. W e observe tha t th is d i s t r i b u t i o n is p r i m a r i l y b i m o d a l : t he first m o d e is l oca ted a r o u n d 100 nt a n d the other at 400 nt. T h i s is i n agreement w i t h p rev ious observa t ions o f l eng th d i s t r i b u t i o n of yeast in t rons ( S p i n g o l a et a l . , 1999). T h e shortest i n t r o n i n the dataset has a l eng th of 58 nt a n d the longest one has a l eng th of 1002 nt. T h e a rch i tec tu re of yeast in t rons was first e x a m i n e d by P a r k e r a n d P a t -te rson (1987), w h o t r i e d to es tabl i sh pa t te rns of spa t i a l a r rangements be-tween conserved sequence elements i nvo lved i n sp l i c ing . T h e m o t i v a t i o n for th i s w o r k was the a s s u m p t i o n tha t o p t i m a l spac ing of conserved i n t r o n se-Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 32 F i g u r e 3.1: (a) D i s t r i b u t i o n h i s t o g r a m a n d (b) c u m u l a t i v e d i s t r i b u t i o n of i n t r o n lengths ( L ) i n the S T R I N dataset . Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 33 branchpoint dinstance 1 1 5'—[_ iGUAUGU UACUAAC exon 1 | 5' splice site F i g u r e 3.2: A r c h i t e c t u r e of a yeast i n t ron : consensus 5' a n d 3' spl ice sites a n d b r a n c h p o i n t sequence are g iven ( Y = C or U ) . T h e b r a n c h p o i n t d is tance is the d is tance between the 5' spl ice site a n d the b r a n c h p o i n t sequence. Some au thors ex t end th is d is tance up to the b r a n c h p o i n t adenine. quences w o u l d p romote spl iceosome assembly by p o s i t i o n i n g s n R N A s a n d o ther p r o t e i n factors i n v o l v e d i n s p l i c i n g i n p roper geomet ry to in terac t w i t h each other . E v i d e n c e of n o n - r a n d o m spac ing w o u l d s u p p o r t th is a s s u m p t i o n . Fo r the i r s tudy, P a r k e r a n d P a t t e r s o n used a dataset of 43 fungal in t rons , i n c l u d i n g 21 f rom S. cerevisiae. T h e y d i s t i ngu i shed two classes of in t rons based o n the d is tance f rom the 5' spl ice site to the b r a n c h p o i n t sequence: • 5' shor t (5 'S) in t rons , for w h i c h the d is tance be tween two elements was abou t 40 nt, a n d • 5 : l o n g ( 5 ' L ) in t rons , w i t h a s ign i f ican t ly larger spac ing of 200-450 nt. S i m i l a r l y , based on the d is tance between the b r a n c h p o i n t sequence a n d the 3 ' spl ice site, the authors p roposed the fo l lowing c lass i f ica t ion: • 3 ' shor t (3 'S) in t rons , for w h i c h the d is tance be tween two elements was between 5 a n d 15 nucleot ides , a n d • 3 ' l o n g ( 3 ' L ) in t rons , w i t h a spac ing of 22-137 nt . A schemat ic of yeast i n t r o n archi tec ture is g iven i n F i g u r e 3.2. I n order to see i f these spa t i a l c lassif icat ions s t i l l h o l d for a larger set of yeast in t rons , we p l o t t e d d i s t r i b u t i o n s of dis tances between the 5' spl ice site a n d b r a n c h p o i n t sequence ( w h i c h we c a l l ' b r a n c h p o i n t d is tances ' ) ( F i g u r e 3.3) a n d dis tances between the b ranchpo in t a n d the 3' sp l ice si te ( F i g u r e 3.4) . I n A Y I D , these dis tances are referred to as ' l a r ia t l eng th ' a n d ' t a i l —YAGI 1—3 ' | exon 2 3' splice site Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 34 l eng th ' , respect ively. T h e . b r a n c h p o i n t d is tance d i s t r i b u t i o n appears to be b i m o d a l , w h i c h is i n agreement w i t h the f indings of P a r k e r a n d P a t t e r s o n (1987). C l a s s i f y i n g S T R I N in t rons in to two groups based on the i r b r a n c h -po in t d i s tance y ie lds 104 5'S in t rons a n d 110 5 ' L in t rons . 5 ' shor t i n t rons range f rom 40 to 200 nt , w i t h an average l eng th of 115 nt a n d an average b r a n c h p o i n t d is tance of 71 nt, a n d 5' l o n g in t rons range f rom 200 to 500 nt ( w i t h few longer except ions) , w i t h an average l e n g t h of 417 nt a n d an average b r a n c h p o i n t d is tance of 372 nt . V i s u a l c o m p a r i s o n of the d i s t r i b u t i o n s i n F i g u r e 3.1 a n d F i g u r e 3.3 sug-gests t ha t the i n t r o n l eng th a n d the b r a n c h p o i n t d i s tance m a y be cor re la ted i n yeast in t rons . T h i s is conf i rmed i n the co r re l a t ion p lo t g iven i n F i g u r e 3.5, w h i c h shows a t igh t l inear co r r e l a t i on ( w i t h a P e a r s o n co r r e l a t i on coefficient of r = 0.99) between the two i n t r o n character is t ics . T h e same co r r e l a t i on coefficient va lue was repor t ed i n K u p f e r et a l . (2004), where i t was found tha t h i g h co r r e l a t i on between i n t r o n l eng th a n d b r a n c h p o i n t d i s tance is a c o m m o n charac te r i s t i c of five diverse fungi species. T h e d i s t r i b u t i o n of dis tances between the b r a n c h p o i n t a n d the 3' sp l ice site i n the S T R I N dataset is u n i m o d a l (F igu re 3.4), w h i c h is i n cont ras t w i t h the observat ions i n P a r k e r a n d P a t t e r s o n (1987). T h i s m a y be ex-p l a i n e d by the s m a l l size of the sample used by P a r k e r a n d P a t t e r s o n . T h e observed m o d e is near 35 nt, a n d the dis tances range f rom 10 to 80 nt , w i t h a few longer except ions . T h e cor re la t ion p lo t shown i n F i g u r e 3.6 does not show any evidence of a s ignif icant co r re l a t ion between i n t r o n l e n g t h a n d the d is tance between the b r a n c h p o i n t a n d the 3' spl ice site (r = 0.13). A s i m -i l a r co r r e l a t i on coefficient value was r epo r t ed i n K u p f e r et a l . (2004). T h e P e a r s o n co r re l a t ion coefficient between b r a n c h p o i n t d is tances a n d dis tances be tween the b r a n c h p o i n t sequence a n d the 3' spl ice site is r = —0.005. O u r analys is of s p a t i a l a r rangements i n the S T R I N dataset suppo r t s i n genera l the f indings of P a r k e r a n d P a t t e r s o n (1987). T h e dis tances be tween conserved i n t r o n sequences are no t r a n d o m b u t g r o u p e d a r o u n d one or t w o modes . T h e d is tance d i s t r i b u t i o n for the b r a n c h p o i n t - 3 ' sp l ice site suggests tha t a spac ing between 10-80 nt is o p t i m a l for sp l i ceosomal assembly, a n d thus is e v o l u t i o n a r i l y conserved. T h i s is also s u p p o r t e d by f indings t h a t the Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 35 F i g u r e 3.3: (a) D i s t r i b u t i o n h i s t o g r a m a n d (b) c u m u l a t i v e d i s t r i b u t i o n of 5' sp l ice site - b r a n c h p o i n t dis tances i n the S T R I N dataset . Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 36 F i g u r e 3.4: (a) D i s t r i b u t i o n h i s t o g r a m a n d (b) c u m u l a t i v e d i s t r i b u t i o n of b r a n c h p o i n t - 3' spl ice site dis tances i n the S T R I N dataset . Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 37 "0 100 200 300 400 500 600 700 800 900 1000 intron length F i g u r e 3.5: C o r r e l a t i o n between i n t r o n l eng th a n d b r a n c h p o i n t d i s tance i n the S T R I N dataset (r = 0.99). 160 140 0) o c s 120 V) ice 100 -D. CO 80 60 o D. _C o c 40 to • v t • -h ? ' ' 20 X:} -0 100 200 300 400 500 600 700 800 900 1000 intron length F i g u r e 3.6: C o r r e l a t i o n between i n t r o n l eng th a n d b r a n c h p o i n t - 3 ' sp l ice si te d i s tance i n the S T R I N dataset (r = 0.13). Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 38 d is tance f rom the b r a n c h p o i n t nucleot ide is a c r i t i c a l pa ramete r for 3 ' spl ice si te a c t i v a t i o n ( L u u k k o n e n a n d Se raph in , 1997). T h e obse rva t ion of two modes for the b r a n c h p o i n t d i s t r i b u t i o n leads us to bel ieve tha t there are two o p t i m a l spacings between the 5' spl ice site a n d the b r a n c h p o i n t sequence. T h i s is con t r ad i c to ry to our knowledge of spl iceosome assembly, w h i c h is un ive r sa l for any i n t r o n t ype . T h e shor te r d i s tance seems l ike ly to faci l i ta te d i rec t in terac t ions between the U l a n d U 2 s n R N P s . T h e m i n i m u m dis tance needed for the assembly of the U l a n d U 2 s n R N P s w i t h p r e - m R N A is a r o u n d 40 nucleot ides: the reg ion b o u n d to the U l s n R N P is a p p r o x i m a t e l y 15 nt (3 nt u p s t r e a m of the spl ice j u n c t i o n a n d 12 nt downs t r eam) ( M o u n t et a l . , 1983), a n d the reg ion b o u n d to the U 2 s n R N P is abou t 35 nt (25 nt u p s t r e a m of the b r a n c h p o i n t adenosine a n d 10 nt downs t r eam) ( B l a c k et a l , 1985). It was also s h o w n tha t yeast p r e - m R N A s p l i c i n g requires a m i n i m u m b r a n c h p o i n t d is tance of 40 nt - s h o r t e n i n g th i s d i s tance leads to s p l i c i n g i n h i b i t i o n ( T h o m p s o n - J a g e r a n d D o m d e y , 1987; K o h r e r a n d D o m d e y , 1988). These findings are consis tent w i t h our d a t a - the m i n i m u m b r a n c h p o i n t d is tance observed i n the S T R I N dataset is 42 nucleot ides . I f shor ter spac ing is o p t i m a l for spl iceosome assembly, how d o the U l a n d U 2 s n R N P s in terac t i n 5 ' L in t rons? P a r k e r a n d P a t t e r s o n (1987) suggest the f o r m a t i o n of p r e - m R N A secondary s t ruc tu re tha t w o u l d b r i n g the 5' sp l ice site a n d the b r a n c h p o i n t in to closer p r o x i m i t y . In the i r yeast dataset , t h e y observed 12 in t rons w i t h large b r a n c h p o i n t d is tances t ha t have a p o t e n t i a l to fo rm he l i ca l s t ruc tures be tween the 5' spl ice site a n d the b r a n c h p o i n t . A c o m m o n feature of these s t ruc tures is a re la t ive ly cons tan t shor tened b r a n c h p o i n t d i s tance (ob ta ined after s u b t r a c t i n g nucleot ides enclosed i n the s t em a n d loop) of a r o u n d 45 nt tha t resembles the s p a c i n g i n 5'S in t rons . 3.2.3 Secondary structure in yeast introns P a r k e r a n d Pa t t e r son ' s p r o p o s a l is not un ique . Severa l o ther au thors have s t u d i e d secondary s t ruc tu re elements i n yeast in t rons a n d the i r effect o n sp l i c ing . E x p e r i m e n t a l analysis of s p l i c i n g efficiency of the S. cerevisiae Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 39 C Y H 2 gene, w h i c h invo lved de le t ion a n d rear rangement of i n t r o n sequences, iden t i f ied two s m a l l regions, one d o w n s t r e a m of the 5' spl ice site, a n d the o the r u p s t r e a m o f the b r a n c h p o i n t sequence, tha t were found t o be essent ia l for s p l i c i n g in vitro a n d in vivo ( N e w m a n , 1987). T h e elements were found to be c o m p l e m e n t a r y i n sequence, suggest ing poss ib le basepa i r ing in te rac t ions . In 1993, G o g u e l a n d R o s b a s h observed some c o m m u n i c a t i o n be tween non-conserved sequences d o w n s t r e a m f rom the 5' spl ice site a n d u p s t r e a m f rom the b r a n c h p o i n t region of the R P 5 1 B S. cerevisiae i n t r on . These i n -terac t ions were fur ther conf i rmed by L i b r i et a l . (1995) a n d by C h a r p e n t i e r a n d R o s b a s h (1996), where comprehens ive m u t a t i o n a l a n d s t r u c t u r e - p r o b i n g ana lys i s de t e rmined the exact s t ruc ture of the s t em- loop fo rmed i n the w i l d t ype (wt) i n t r o n . These s tudies also demons t r a t ed tha t th i s c o m p l e m e n t a r y p a i r i n g is essent ial for efficient s p l i c i n g in vitro a n d in vivo. L i b r i et a l . f ound tha t i f w i l d t y p e basepa i r ing in t e rac t ion was d i s r u p t e d , s p l i c i n g c o u l d be res tored by a l te rna t ive basepa i r ing , i n d i c a t i n g tha t the exis tence, not the l o c a t i o n , of the s tem is essential for efficient s p l i c i n g of the R P 5 1 B i n t r o n . T h e w o r k o f L i b r i et a l . a n d C h a r p e n t i e r a n d R o s b a s h is d i scussed i n m o r e de t a i l i n Sec t ion 5.1. A s t u d y of e x o n s k i p p i n g i n the m u l t i p l y i n t e r r u p t e d Y L 8 A S. cerevisiae gene revealed the i m p o r t a n t role of c o m p l e m e n t a r y i n t r o n sequences tha t p r o m o t e exon inc lu s ion i n m a t u r e R N A (Howe a n d A r e s , 1997). T h e se-quences are l oca t ed d o w n s t r e a m of the 5 ' spl ice site a n d u p s t r e a m of the b r a n c h p o i n t , i n each of the Y L 8 A two in t rons . D e s t r o y i n g the c o m p l e m e n -t a r i t y of the sequences reduced the amoun t of co r rec t ly sp l i ced p r e - m R N A a n d i n d u c e d exon s k i p p i n g . B a s e d on th i s observa t ion , H o w e a n d A r e s sug-gest t ha t these basepa i r ing in terac t ions serve as i n t r o n ident if iers , p r o m o t i n g correct p a i r i n g of app rop r i a t e spl ice sites i n m u l t i - i n t r o n p r e - m R N A s . 3.3 Intron dataset for phylogenetic analysis B i o l o g i c a l sequences a n d s t ruc tures t ha t have i m p o r t a n t func t ions are u su -a l ly conserved d u r i n g evo lu t ion . T h i s makes c ompa r a t i ve genomics , w h i c h compares genomes of re la ted species, a powerfu l t o o l for i d e n t i f y i n g func-Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 40 ~10mya ~20mya ~5mya S. cerevisiae S. paradoxus S. mikatae S. bayanus F i g u r e 3.7: P h y l o g e n e t i c tree for Saccharomyces sensu stricto species de r ived based o n sequence divergence of r i b o s o m a l D N A sequences ( K e l l i s , 2003). T h e t i m e labels ( in m y a = m i l l i o n years ago) g iven at the nodes of the tree i n d i c a t e w h e n the species, represented by the branches c o m i n g out f rom a node, d ive rged f rom the mos t c o m m o n ancestor. t i o n a l elements w i t h o u t prev ious knowledge of func t ion . I n th i s thesis , we use a technique f rom compara t i ve genomics , k n o w n as phy logene t i c or c o m -pa ra t ive s t ruc tu re analys is , to look at the conse rva t ion of secondary s t ruc-tures or s t r u c t u r a l mot i fs , w h i c h we consider i m p o r t a n t for s p l i c i n g . F o r th is purpose , we chose three species c losely re la ted to Saccharomyces cerevisiae: S. paradoxus, S. mikatae, a n d S. bayanus. These three species were sequenced a n d used for c ompa r a t i ve studies, a long w i t h S. cerevisiae, by K e l l i s et a l . (2003). T h e species were sequenced by seven-fold r e d u n d a n t coverage a n d assembled in to 230-500 k b long scaffolds t ha t cover mos t of the genome (~95%) . S. cerevisiae, S. paradoxus, S. mikatae, a n d S. bayanus be long to the Saccharomyces sensu stricto g roup , a n d the i r phy logene t i c tree is g iven i n F i g u r e 3.7. S. cerevisiae has 90% (80%) average nucleot ide i den t i t y w i t h S. para-doxus i n c o d i n g ( intergenic) regions, 84% (70%) w i t h S. mikatae, a n d 80% (62%) w i t h S. bayanus. T h e species were found to have e n o u g h sequence s i m i l a r i t y to a l low re l iable a l ignment of or thologous regions, b u t sufficient sequence divergence to a l low recogn i t ion of func t iona l elements ( K e l l i s et a l . , 2003). T h e genomes of the four species were a l igned by pa i r -wise c o m p a r i s o n of the species ' O R F s . T h e c o m p a r a t i v e genetic analys is pe r fo rmed b y K e l l i s et a l . resu l ted i n ma jo r changes i n the S. cerevisiae gene catalogue, r educ ing the t o t a l n u m b e r of genes by a p p r o x i m a t e l y 500 a n d redef ining gene bounda r i e s i n more t h a n Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 41 0-120: Seer : ATGCTTCTCTCCTTTCAAGACTTAGTGACCARG^CTGGTCGTAGAGGTMGTCATAATGATTACATCGGAATCCCTTGATACAAGAAAA-CTAACGCGTATCGTfiCAT Spar : ATGCTTCTCTCTTTCAAGAGTTAGTGACCAAGAAGTGGTCGTAGAGGTATGTTACAATGATTACACCGGGATCCCTTGATACAAGAAAAACTAACGGTATCGTACAT .'Jm i k : ATGCTTCTCTCTTTCAAAACTTGGTCGATCAAGAAGTGGTGGTAGAGGTATGTTACAATAATTACAGACAAACTCTCTTGATATAAGAAAA-CTAGGAGTATCGTACAT Sbiy : ATGCTGTCTCTCGTTTCAAGAGTCTAGTAGACCAAGAACTGGTCGTAGAGGTATGTAATAATGATTACACCGGGATGCCGTTGA--CAAGGAAAACCAACGGGTGTCGTACAT consensus: *' ........ ... . .. .. ...... — . ... . . ... ... ... . . . Scer_spli: DDD  BBBBB3BB Spar_spli: DDD  BBB3B3B Smikspli: DDD  Sbay_sp]i: DDD  121)-?40: Seer : -CAATTTTTGAAAAAAGTC—AAGTACTAACGTTTGTTTACCCGT-GTTTATTGTGTTTCCAGTCAGTTAAAAAACGACATTGAAATAAAAGGTACAGTACAATCAGTTGACCAATTTTT Soar : -CAATTTTTAAAAAAAATTTTAAArACTAACGTTTGCTTAGTGGT-ATT.AATGGTGTCTCCAGTCAGTTAAAAAATGACATCGAAATAAAAGGTACCCTACAATGTGTCGACGAATTTTT sjr.ik : -CAATCTTTCAAAACAACG-AAATACTAACGTTCTTGTACTCTTTGTTGGTTGTGCTTCTAATCAGCTAAAAAATGACATGGAAATAAAAGGTACACTACAATCAGTCGACGAGTTTTT sbay : TCAGAGTTTCGTAAAAACTA-AAATACTAACGTTTGCAGATCCCT-GTTTACTGTGTTGTAAATTAGTTAAAAAACGACATTGAAATAAAAGGTACACTGCAATCTGTAGACCAGTTCTT Scerspli: 3BBBBB Spar_spli: B3BBBBB Smikspli: BBBBBBB Sbay spli: BBBBBBB known. . . . : " F i g u r e 3.8: A n example of a C l u s t a l W a l ignment t aken f rom the s u p p l e m e n -t a r y d a t a by K e l l i s et a l . (2003). Nuc leo t ides conserved i n a l l four species are m a r k e d w i t h T h e i n t r o n is i n d i c a t e d b y a double dashed l ine . P o t e n t i a l donor sites a n d b r a n c h p o i n t sequences are anno ta t ed by s t r ings of D s a n d B s , respect ively. 300 cases. T h e y also ident i f ied 17 mis -anno ta t ed in t rons , w h i c h we e x c l u d e d d u r i n g c o n s t r u c t i o n of the S T R I N dataset , a n d p red i c t ed 58 new in t rons . W e used m u l t i p l e sequence a l ignments p r o v i d e d by K e l l i s et a l . (2003) to ex t rac t the or thologous i n t r o n sequences. T h e a l ignments are p r o v i d e d as s u p p l e m e n t a r y d a t a a n d are avai lable at h t t p : / /www-genome . w i . m i t . e d u / a n n o t a t i o n / f u n g i / c o m p _ y e a s t s / d o w n l o a d s . h t m l (last accessed i n J u n e 2006). T h e y were p r o d u c e d by the C l u s t a l W p r o g r a m ( T h o m p s o n et a l . , 1994) a n d are i n the format s h o w n i n F i g u r e 3.8. W e searched for the m u l t i p l e sequence a l ignments of in t rons f rom the S T R I N dataset . T h e r e were 174 in t rons found i n K e l l i s ' da ta , w h i c h we fur-ther f i l tered to exc lude a l ignments where o n l y a 5". cerevisiae sequence was present ( Y G R 2 9 6 W , Y H L 0 5 0 C , Y H R 2 0 3 C , Y I L 1 7 7 C , Y J L 2 2 5 C , Y L R 4 6 4 W , Y N L 3 3 9 C , Y P L 2 8 3 C , Y P R 2 0 2 W ) . W e also e x a m i n e d the a l ignments for any inconsis tencies w i t h respect to donor , acceptor a n d b r a n c h p o i n t se-quence a l ignments . T h i s revealed a d d i t i o n a l p r o b l e m a t i c a l ignments where e i ther the donor or b r a n c h p o i n t sequence was not p r o p e r l y a l igned or the a l igned sequences s ign i f ican t ly d iverged f rom the k n o w n consensus sequence. T h e r e were 9 i n t r o n a l ignments w i t h these p rob lems : Y B R 1 8 6 W , Y J R 0 7 9 W , Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 42 Y K L 0 0 2 W , Y K L 1 5 7 W , Y L R 0 9 3 C , Y L R 2 1 1 C , Y N L 0 1 2 W , Y O L 0 4 7 C , a n d Y P L 1 2 9 W . These in t rons were exc luded f rom our phy logene t i c dataset . A n a d d i t i o n a l four in t rons h a d p r o b l e m a t i c a l ignments be tween the S. cerevisiae a n d S. bayanus sequences: Y G L 1 3 7 W , Y J L 1 7 7 W , Y L R 0 7 8 C , Y O R 1 8 2 C . W e e x c l u d e d the S. bayanus sequences f rom these a l ignments . F i n a l l y , the i n t r o n Y M R 2 9 2 W a l ignment between S. cerevisiae a n d S. paradoxus sequences was suspic ious (the 7-nt sequence anno ta t ed as the b r a n c h p o i n t sequence for S. paradoxus was very different f rom the consensus sequence), so we exc luded the S. paradoxus sequence f rom i t . These filtering steps r educed the n u m b e r of a l ignments to 155. A m o n g these there were 7 c o n t a i n i n g sequences f rom two species, 51 c o n t a i n i n g sequences f rom three species a n d 97 a l i g n i n g a l l four species. A n i n t r o n sequence f rom S. cerevisiae was present i n each a l ignment , the S. paradoxus sequence was present i n 146 a l ignments , the S. mikatae sequence i n 120 a l ignments a n d the S. bayanus sequence i n 134 a l ignments . W e ca l cu l a t ed the average nucleot ide iden t i ty i n i n t r o n i c regions based on these a l ignments : S. cerevisiae has 74% average nuc leo t ide iden t i t y w i t h S. paradoxus, 58% w i t h S. mikatae, a n d 50% w i t h S. bayanus. W e used these or thologous i n t r o n sequences f rom closely re la ted species to invest igate the conserva t ion of i n t r o n archi tec ture . W e ex t r ac t ed the i n t r o n sequences f rom the a l ignments a n d c o m p u t e d i n t r o n l e n g t h d i s t r i b u -t ions for each species. T h e d i s t r i b u t i o n h i s tograms as w e l l as co r r e l a t i on p lo ts t h a t compare i n t r o n l eng th between S. cerevisiae a n d other species are g iven i n F i g u r e 3.9. These i n t r o n l eng th d i s t r i b u t i o n h i s tograms look very m u c h l ike the dis-t r i b u t i o n h i s t o g r a m for S. cerevisiae (F igu re 3.1): the i n t r o n l eng th range is a lmos t the same a n d a l l d i s t r i b u t i o n s are b i m o d a l , w i t h the first m o d e lo -ca ted a r o u n d 100 nt a n d the other a r o u n d 400 nt . T h i s obse rva t ion as we l l as h i g h co r re l a t ion between i n t r o n lengths i n S. cerevisiae a n d the o ther three species (see co r r e l a t i on plots i n F i g u r e 3.9; the c o r r e s p o n d i n g P e a r s o n co r r e l a t i on coefficients are r = 0.97, r = 0.96, a n d r = 0.96 for S. paradoxus, S. mikatae, a n d S. bayanus, respect ively) ind ica te t ha t i n t r o n lengths have been conserved a m o n g sensu stricto species. Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 43 Intron length distribution for S. paradoxus Intron length correlation between S.cer and S.par 200 400 600 800 1000 1200 intron length 200 400 600 800 1000 1200 intron length for S.cer (a) (b) Intron length distribution for S. mikatae Intron length correlation between S.cer and S.mik 200 400 600 800 1000 intron length "0 100 200 300 400 500 600 700 intron length for S.cer (d) Intron length distribution for S. bayanus Intron length correlation between S.cer and S.bay 0 200 400 600 800 1000 1200 intron length 200 400 600 800 1000 1200 intron length lor S.cer (e) (0 F i g u r e 3.9: I n t r o n l eng th d i s t r i b u t i o n h i s tograms a n d i n t r o n l e n g t h corre-l a t i o n p lo ts be tween S. cerevisiae a n d Saccharomyces sensu stricto species: ( a ) , ( b ) S. paradoxus, ( c ) , (d) S. mikatae, ( e ) , ( f ) S. bayanus. Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 44 W e also looked at the conse rva t ion of b r a n c h p o i n t d is tances i n the sensu stricto species. T h e b r a n c h p o i n t dis tances for the three re la ted sequences were c o m p u t e d based on the m u l t i p l e sequence a l ignments a n d k n o w n b r a n c h -po in t dis tances for S. cerevisiae in t rons . T h e d i s t r i b u t i o n h i s tograms , as w e l l as the co r re l a t ion plots tha t compare b r a n c h p o i n t d is tances be tween S. cerevisiae a n d the other sensu stricto species, are s h o w n i n F i g u r e 3.10. T h e co r r e l a t i on coefficients co r r e spond ing to the g iven co r r e l a t i on p lo t s are r — 0.999, r = 0.996, a n d r = 0.994 for S. paradoxus, S. mikatae, a n d S. bayanus, respect ively. T h i s indicates tha t the b r a n c h p o i n t d is tances are conserved even bet ter t h a n the i n t r o n lengths a n d suggests f u n c t i o n a l i m -por t ance of th i s i n t r o n charac ter i s t ic . If p r e - m R N A secondary s t ruc tu re p lays a role i n s p l i c i n g , ei ther to shor ten the d is tance between the 5' spl ice site a n d the b r a n c h p o i n t sequence or i n some other way, i t seems reasonable to assume tha t these s t r u c t u r a l features of in t rons w i l l be conserved a m o n g sensu stricto species. W e invest igate th is hypothes i s fur ther i n Sect ions 4.4 a n d 6.4. Chapter 3. Intron structure and splicing in Saccharomyces cerevisiae 45 Branchpoint distance distribution for S. paradoxus BP distance correlation between S.cer and S.par 0 200 400 600 800 1000 1200 5' splice site - branchpoint distance 200 400 600 800 BP distance for S.cer (a) (b) Branchpoint distance distribution for S. mikatae BP distance correlation between S.cer and S.mik 200 400 600 800 1000 1200 5' splice site - branchpoint distance 100 200 300 400 500 600 700 BP distance for S.cer (c) (d) Branchpoint distance distribution for S. bayanus BP distance correlation between S.cer and S.bay 200 400 600 800 1000 1200 5' splice site - branchpoint distance 200 400 600 800 BP distance for S.cer (e) ( f ) F i g u r e 3.10: B r a n c h p o i n t d i s tance d i s t r i b u t i o n h i s tograms a n d b r a n c h p o i n t d i s tance co r re l a t ion p lo ts be tween S. cerevisiae a n d Saccharomyces sensu stricto species: (a), ( b ) S. paradoxus, (c), ( d ) S. mikatae, (e), ( f ) S. bayanus. Chapter 4 46 Zipper stems in long yeast introns M o t i v a t e d by the discoveries of c o m p l e m e n t a r y basepa i r ing i n several Sac-charomyces cerevisiae in t rons , i n th is chapter we inves t igate i f the a b i l i t y to f o r m a s t e m s t ruc tu re between the 5' spl ice site a n d the b r a n c h p o i n t is s o m e t h i n g c o m m o n to a l l in t rons w i t h l o n g b r a n c h p o i n t d is tance . W e search the secondary s t ruc tures of 5 ' L S T R I N in t rons for s tems tha t b r i n g the 5' sp l ice site a n d the b r a n c h p o i n t sequence in to closer p r o x i m i t y a n d ana lyze the effect of these stems o n the r e su l t i ng b r a n c h p o i n t dis tances . W e e x p e r i -ment w i t h different s t em select ion c r i t e r i a a n d consider s ingle a n d m u l t i p l e s tems. T h e results ob t a ined general ly suppo r t the hypothes i s t ha t s t e m s t ruc tures i n a l o n g yeast i n t r o n c a n shor ten the b r a n c h p o i n t d i s tance to w h a t is be l ieved to be the o p t i m a l one. T h e second par t of the chapter invest igates the conse rva t ion of z i p p e r s tems a m o n g closely re la ted Saccharomyces sensu stricto species, u s ing v i -sua l i n s p e c t i o n of the m u l t i p l e sequence a l ignments a n d secondary s t ruc tu res as w e l l as a u t o m a t i c compa ra t i ve s t ruc tu re approaches o n a selected subset of in t rons . 4.1 Definition and initial identification of zipper stems W e used the V i e n n a R N A secondary s t ruc tu re package (Hofacker et a l , 1994) to ca lcu la te secondary s t ruc tures of 5 ' L in t rons i n the S T R I N dataset . Its R N A fo ld ing func t ion , R N A f o l d , is based o n a d y n a m i c p r o g r a m m i n g Chapter 4. Zipper stems in long yeast introns 47 a l g o r i t h m , w h i c h calcula tes pseudoknot-free secondary s t ruc tu re w i t h m i n i -m u m free energy as we l l as the e q u i l i b r i u m p a r t i t i o n func t i on a n d basepair -i n g p robab i l i t i e s . V e r s i o n 1.4 of the package was d o w n l o a d e d f rom h t t p : / / w w w . t b i . u n i v i e . a c . a t / ~ i v o / R N A / (September 2003) a n d c o m p i l e d . T h e i n p u t to the R N A f o l d func t ion consists of one or more sequences i n F A S T A format . T h e o u t p u t of the p r o g r a m is a secondary s t ruc tu re i n do t -bracke t n o t a t i o n , w h i c h is u sua l l y defined as follows: Definition 1 ( R N A secondary structure in dot-bracket notation) Let R = r\r2---rn be an RNA sequence (r^ £ {A,C,G,U}). The sec-ondary structure of sequence R is given by a string S = s\S2---sn, where Sk (1- < k, < n) is one of the symbols: '(', and ')'. A basepair between bases r j and rj, where i < j, is represented by Si = '(' and Sj = ')'. Unpaired bases r^ are represented by r^ — '. '• For each position i in S the number of open brackets has to be greater than or equal to the number of closed brackets. The total numbers of opening and closing brackets in S have to be equal* T h e R N A f o l d func t ion was r u n o n a F A S T A file w i t h the 110 5 ' L in t rons f rom the S T R I N dataset . Fo r each i n t r o n sequence, the p r o g r a m ca l cu l a t ed a s t ruc tu re w i t h the m i n i m u m free energy. T h e s t ruc tures were fur ther c o m p u t a t i o n a l l y a n a l y z e d to find a s t em tha t w o u l d shor ten the b r a n c h -po in t d is tance . A s t em s t ruc tu re i n R N A secondary s t ruc tu re p r e d i c t i o n is f o r m a l l y defined i n D e f i n i t i o n 2, as follows: Definition 2 (Stem) A stem in an RNA secondary structure is defined by basepairing between two complementary subsequences of R: riri+\...rinl-and rj_krj-k+i---rj and represented in secondary structure S as a substring SiSi+\...Si+k of opening brackets and a substring Sj-ksj-k+i---sj °f closing " T h e pseudoknot - f ree R N A secondary s t ructure in dot -bracket notat ion can also be defined by a s imple context - f ree g rammar : S->e\.S\{S)S (4.1) Chapter 4. Zipper stems in long yeast introns 48 brackets, and where s , and Sj, S j + i and Sj-\, Si+k and Sj_k a r e matching brackets. N o t e tha t a s t e m as defined i n D e f i n i t i o n 2 is u n i n t e r r u p t e d , i.e., does not a l low for the occurrence of any u n p a i r e d bases tha t w o u l d fo rm bulges or i n t e r n a l loops w i t h i n the s tem. W e des igned an a l g o r i t h m tha t identifies a s t e m tha t b r ings the 5' spl ice si te a n d the b r a n c h p o i n t in to closer p r o x i m i t y . T h e l o c a t i o n of the b r a n c h -po in t was o b t a i n e d by t a k i n g the b r a n c h p o i n t sequence for a p a r t i c u l a r i n t r o n f rom the Y I D B table (see T a b l e 3.2) a n d f ind ing i ts l o c a t i o n i n the i n t r o n sequence t aken f rom the A Y I D database. T h e a l g o r i t h m parses a s t r i ng of brackets a n d dots g iven by the R N A f o l d p r o g r a m a n d looks at the stems tha t appear between the 5' spl ice site a n d the b r a n c h p o i n t se-quence. T h e s t e m tha t has the largest d is tance between the c o m p l e m e n t a r y sequences f o r m i n g i t is selected a n d re tu rned as the resul t . W e c a l l such a s t e m a ' z i p p e r ' s tem, since i t ' z ips ' the in t ron , b r i n g i n g the 5' spl ice site a n d the b r a n c h p o i n t sequence closer together . T h e f o r m a l de f in i t i on of a z ippe r s t e m is g iven i n D e f i n i t i o n 3. A z ippe r s t e m is a lways loca t ed be tween the 5' spl ice site a n d the b r a n c h p o i n t , even t h o u g h these two sequences c o u l d be b rough t closer together by basepa i r ing in terac t ions between the first c o m -p l e m e n t a r y sequence loca ted between the 5' spl ice site a n d the b r a n c h p o i n t a n d the second c o m p l e m e n t a r y sequence loca ted be tween the b r a n c h p o i n t a n d the 3 ' spl ice si te. A genera l i za t ion of the z ippe r s t e m de f in i t i on tha t captures such effects w i l l be cons idered later . A n example of an i n t r o n s t ruc-ture , w i t h anno ta t ed 5' spl ice site, b r a n c h p o i n t sequence a n d z i p p e r s tem, is g iven i n F i g u r e 4.1. Definition 3 (Zipper stem) A stem represented by two substrings Si...Si+k and Sj-k...Sj of matching opening and closing brackets, where j < d (d + 1 is the first position of the branchpoint sequence), is called a zipper stem if it is the stem that brings the 5' splice site and the branchpoint closest together, i.e., if j — i is maximal among all possible stems in S for which j < d. P o t e n t i a l z i ppe r s tems were found for a l l 5 ' L in t rons i n the dataset . S ince i t was p rev ious ly specula ted tha t the role of these s tems w o u l d be to Chapter 4. Zipper stems in long yeast introns 49 ....(((•((( ( ( ( (« ( ( ( ( ( • • • • ( ( ( ( ( ( ( ( ((((((••((••((((((••(((•(((-•)))••)))))))))•))))))))-•((((( •(((( ))))••))))) ))))))))••)))))»)))) (((((((•((•(((••••)))•))•))))))) )))))) F i g u r e 4.1: Seconda ry s t ruc tu re of the Y G L 0 3 0 W i n t r o n . T h e dot -bracket n o t a t i o n for th is s t ruc tu re w i t h h igh l igh t ed z ippe r s t e m is s h o w n at the b o t t o m of the figure. Chapter 4. Zipper stems in long yeast introns 50 sho r t en the d i s tance between the 5' spl ice site a n d the b r a n c h p o i n t to the o p e r a t i o n a l d i s tance tha t is observed for 5'S in t rons (Pa rke r a n d P a t t e r s o n , 1987; N e w m a n , 1987; G o g u e l a n d R o s b a s h , 1993; L i b r i et a l , 1995) we tested if we c o u l d observe th is effect i n our dataset . O n c e the loca t ions of the p o t e n t i a l z i ppe r s tems were found, we ca l cu la t ed the shor tened b r a n c h p o i n t d i s tance for a l l the sequences. Definition 4 (Shortened branchpoint distance) If Si...Si+k and Sj-k---Sj a r e tw0 substrings of matching opening and closing brackets that form the identified zipper stem, the shortened branchpoint distance has the following value: d = i + (d-j) (4.2) where d is the original branchpoint distance (the first position of the branch-point sequence is d + 1); for an illustration see Figure If the a l g o r i t h m fails to ident i fy a z ippe r s t e m for a ce r t a in sequence, i ts o r i g i n a l , l inear b r a n c h p o i n t d is tance w i l l be i n c l u d e d i n the d i s t r i b u t i o n of shor tened dis tances . T h e d i s t r i b u t i o n of shor tened b ranchpo in t dis tances for 5 ' L S T R I N i n -t rons , a long w i t h the d i s t r i b u t i o n of b r a n c h p o i n t d i s tance for 5'S in t rons , is g iven i n F i g u r e 4.3. T h e c u m u l a t i v e d i s t r i b u t i o n p lo t is s h o w n i n F i g u r e 4.4. T h e two d i s t r i b u t i o n s do not appear to suppo r t the h y p o t h e s i z e d effect. T h e datasets have a lmost i den t i ca l means (72 nt for 5'S in t rons a n d a n d 71 nt for z i p p e d 5 ' L in t rons ) , bu t m a n y z i p p e d 5 ' L in t rons have b r a n c h p o i n t dis tances shor ter t h a n 40 nt or longer t h a n 200 nt, w h i c h is not observed for the 5'S in t rons . To. s t a t i s t i ca l ly test these differences we pe r fo rmed the K o l m o g o r o v -S m i r n o v test ( K S test) (Ho l l ande r a n d Wol fe , 1999), w h i c h is used t o deter-m i n e whe ther the u n d e r l y i n g p r o b a b i l i t y d i s t r i b u t i o n s of two datasets differ. T h e K S test is non-pa rame t r i c , i.e., i t does not require any a s s u m p t i o n abou t d i s t r i b u t i o n of d a t a (unl ike S tuden t ' s t-test) a n d is not dependan t o n d a t a b i n n i n g (unl ike chi-square test) , w h i c h makes i t su i t ab le for our ana lys is . Chapter 4. Zipper stems in long yeast introns 51 shortened branchpoint distance (d) d+l y / / 'AAG A G branchpoint sequence zipper stem F i g u r e 4.2: I l l u s t r a t i o n of shor tened b r a n c h p o i n t d is tance . T h e test ca lcula tes the m a x i m u m dis tance between two c u m u l a t i v e d i s t r i -bu t ions (D s ta t i s t ics ) , a n d t h e n the p-value associa ted w i t h th is n u m b e r c a n be ob ta ined . Fo r p-values smal ler t h a n the selected s ignif icance level , the n u l l hypothes i s tha t the two datasets s t e m f rom the same u n d e r l y i n g d i s t r i b u t i o n is rejected. W e used a signif icance leve l of 0.05 for a l l s t a t i s t i ca l tests pe r fo rmed . A p p l y i n g the K S test to the datasets of b r a n c h p o i n t dis tances for 5'S in t rons a n d b r a n c h p o i n t dis tances for z i p p e d 5 ' L in t rons , we o b t a i n e d D = 0.3 w i t h a co r r e spond ing p-value < 0.0001. T h u s , the n u l l hypo thes i s t ha t the two datasets are of the same fo rm is rejected. 4.2 Length-bounded zipper stems U p o n closer i n s p e c t i o n of the i n t r o n s t ruc tures w i t h ve ry shor t ident i f ied z ippe r s tems, we usua l l y found longer, more t h e r m o d y n a r n i c a l l y s table s tems Chapter 4. Zipper stems in long yeast introns 52 Branchpoint distance distribution for 5'S introns 100 150 200 branchpoint distance (a) Branchpoint distance distribution for 'zipped' 5'L introns 0.14r 0 50 100 150 200 250 shortened branchpoint distance (b) F i g u r e 4.3: D i s t r i b u t i o n h i s tograms for (a) b r a n c h p o i n t d i s tance for 5'S in t rons (d) a n d (b) b r a n c h p o i n t d is tance for z i p p e d 5 ' L in t rons (d). Chapter 4. Zipper stems in long yeast introns 53 0 50 100 150 200 250 300 d F i g u r e 4.4: C u m u l a t i v e d i s t r i b u t i o n s of the 5' sp l ice si te - b r a n c h p o i n t dis-tance for 5'S in t rons a n d 5 ' L in t rons folded i n secondary s t ruc tu re . tha t c o u l d z ip the i n t r o n between the 5' spl ice si te a n d the b r a n c h p o i n t . O n e example is g iven i n F i g u r e 4.5. In th i s case the z ippe r s t em found was o n l y 2 nt long a n d is ac tua l l y par t of a larger s t em formed b y basepa i r ing in te rac t ions between the 5' spl ice site a n d the b r a n c h p o i n t sequence. It is ques t ionable i f th is basepa i r ing in t e rac t ion is present in vivo since the U l a n d U 2 s n R N P s have to b i n d to these sequences i n order t o in i t i a t e sp l i c ing . T h e o ther pos s ib i l i t y is t ha t there are enzymes tha t o p e n th i s s t ruc tu re p rev ious to s p l i c i n g (Wagner et a l . , 1998; W a n g et a l , 1998). I n any case, even i f the 5' spl ice site a n d the b r a n c h p o i n t were not enclosed i n th i s s tem, a 2-nuc leo t ide- long s t em does not seem t h e r m o d y n a r n i c a l l y s table enough to h o l d two ends of the molecule together, espec ia l ly i f the loop size is larger. Therefore , we mod i f i ed our a l g o r i t h m to find z i p p e r s tems whose l eng th is greater t h a n or equa l to a m i n i m u m s t em l eng th defined i n the a l g o r i t h m . W e r a n the a l g o r i t h m for a r b i t r a r y m i n i m u m s t e m lengths of 3, 5 a n d 7 basepairs (bp) . Fo r a s t em leng th of at least 7 bp , the a l g o r i t h m fai led to find a s t e m i n m a n y in t rons f rom the dataset . T h e best results , i n te rms o f D s ta t i s t ics value, were o b t a i n e d w h e n the m i n i m u m leng th requi rement for the z ippe r Chapter 4. Zipper stems in long yeast introns 54 F i g u r e 4.5: Seconda ry s t ruc tu re of the Y D L 0 7 9 C i n t r on . s t e m was 5 b p . T h e d i s t r i b u t i o n of shor tened b r a n c h p o i n t d is tances is s h o w n i n F i g u r e 4.6, a long w i t h the o r i g i n a l b r a n c h p o i n t d i s tance d i s t r i b u t i o n for 5'S in t rons . I m p o s i n g the m i n i m u m s tem- length requi rement d i d i m p r o v e results , bu t the n u l l hypothes i s t ha t the dataset of 5'S b r a n c h p o i n t dis tances a n d the dataset of shor tened 5 ' L b r a n c h p o i n t dis tances s t em f rom the same u n d e r l y -i n g d i s t r i b u t i o n was s t i l l rejected (p-value = 0.024). C u m u l a t i v e d i s t r i b u t i o n p lo t s for the two datasets are g iven i n F i g u r e 4.7. F r o m F i g u r e 4.6(b) a n d F i g u r e 4.7 i t is evident tha t the n u m b e r of z i p p e d in t rons tha t have ve ry shor t b r a n c h p o i n t d is tances has been reduced; however, th i s also resu l t ed i n an increase of the z i p p e d in t rons w i t h longer (> 200 nt) b r a n c h p o i n t d is tances . 4.2.1 Control datasets Desp i t e the differences between the two d i s t r i bu t i ons , the i r modes are s t i l l a r o u n d the same value of abou t 50 nt. T o test the s ignif icance of th i s phe-n o m e n o n we inves t iga ted i f a s i m i l a r effect c a n be observed for r a n d o m a n d exon sequences. Fo r th is purpose , we generated two con t ro l datasets: a dataset of r a n d o m l y generated R N A sequences a n d a dataset of exonic Chapter 4. Zipper stems in long yeast introns 55 Branchpoint distance distribution for 5'S introns 150 200 branchpoint distance (a) Branchpoint distance distribution for 'zipped' 5'L (stem > 5bp) 0.14r 50 100 150 200 250 shortened branchpoint distance (b) F i g u r e 4.6: D i s t r i b u t i o n h i s tograms for (a) b r a n c h p o i n t d i s tance for 5'S in t rons a n d (b) b r a n c h p o i n t d is tance for z i p p e d 5 ' L in t rons w i t h a m i n i m u m z i p p e r s t e m l eng th of 5 nt . Chapter 4. Zipper stems in long yeast introns 56 F i g u r e 4.7: C u m u l a t i v e d i s t r i b u t i o n s of the 5' spl ice site - b r a n c h p o i n t d is -tance for 5'S in t rons a n d 5 ' L in t rons folded i n secondary s t ruc tu re w i t h a m i n i m u m z ippe r s t e m l eng th of 5 nt. subsequences. T h e first dataset conta ins 500 r a n d o m l y generated D N A se-quences t ha t o n average have the same G C content as the S T R I N dataset (34%) a n d whose l eng th d i s t r i b u t i o n is i den t i ca l to the l e n g t h d i s t r i b u t i o n of 5 ' L in t rons . T h e second dataset was generated by e x t r a c t i n g the w i n d o w s of exonic sequences f rom S T R I N exons (exons f rom genes tha t have S T R I N in t rons) by s l i d i n g a w i n d o w of va r iab le length , w h i c h is d r a w n f rom the 5 ' L i n t r o n l eng th d i s t r i b u t i o n , over the exon sequences. T h e re su l t i ng dataset has 449 exon ic sequences tha t have the same l eng th d i s t r i b u t i o n as the 5 ' L S T R I N in t rons . T h e sequences i n these con t ro l datasets were folded u s ing R N A f o l d , a n d the o b t a i n e d M F E secondary s t ruc tures were processed to f ind p o t e n t i a l z i p -per s tems between the b e g i n n i n g of the sequence a n d the assumed b r a n c h -po in t l oca t i on . T h i s l o c a t i o n is the same as the b r a n c h p o i n t l o c a t i o n of the i n t r o n whose l eng th was used to m o d e l a p a r t i c u l a r r a n d o m or exonic sequence. Ident i f ied z i p p e r s tems h a d to be at least 5 b p long . T h e d i s t r i -bu t ions of r e su l t i ng shor tened b r a n c h p o i n t dis tances are p l o t t e d i n F igu res Chapter 4. Zipper stems in long yeast introns 57 Branchpoint distance distribution for 5'S introns Branchpoint distance distribution for 'zipped' 5'L introns 0.14, , , , , 1 1 0.14, , 1 , , . 1 branchpoint distance shortened branchpoint distance (a) (b) 'Branchpoint distance' distribution for 'zipped' exonic seq 'Branchpoint distance' distribution for 'zipped' random seq • 100 200 300 400 500 BOO 0 100 200 300 400 500 600 shortened 'branchpoint distance' shortened 'branchpoint distance' (c) (d) F i g u r e 4.8: D i s t r i b u t i o n s for (a) b r a n c h p o i n t d i s tance for 5'S in t rons , (b) b r a n c h p o i n t d i s tance for z i p p e d 5 ' L in t rons , (c) b r a n c h p o i n t d i s tance of z i p p e d exonic sequences, a n d (d) b r a n c h p o i n t d i s tance of z i p p e d r a n d o m sequences. A l l z i p p e d s t ruc tures have a m i n i m u m z i p p e r s t e m l eng th of 5 nt. Chapter 4. Zipper steins in long yeast introns 5 8 0 5'S 5'L «• - exon rand 0 100 sec 300 400 500 500 d F i g u r e 4.9: C u m u l a t i v e d i s t r i b u t i o n s for the b r a n c h p o i n t d i s tance i n 5'S in t rons , shor tened b r a n c h p o i n t d is tance i n z i p p e d 5 ' L in t rons , a n d shor tened b r a n c h p o i n t d i s tance of z i p p e d exonic a n d r a n d o m sequences. A l l z i p p e d s t ruc tures have a m i n i m u m z i p p e r s tem l eng th of 5 nt . 4.8(c) a n d 4.8(d) . T h e c u m u l a t i v e d i s t r i b u t i o n s for the b r a n c h p o i n t d is -tance i n 5'S in t rons , shor tened b r a n c h p o i n t d is tance i n z i p p e d 5 ' L in t rons , a n d shor tened b r a n c h p o i n t d is tance of z i p p e d exonic a n d r a n d o m sequences are g iven i n F i g u r e 4.9. F i g u r e 4.8 shows tha t the d i s t r i b u t i o n s for r a n d o m a n d exonic sequences, w h i c h are ve ry s im i l a r , are qu i te different f rom the o ther two d i s t r i b u t i o n s : the d i s t r i b u t i o n m o d e at a r o u n d 50 nt observed for b o t h 5'S a n d 5 ' L in t rons does not exis t for r a n d o m a n d exonic sequences, whose d i s t r i b u t i o n s appear to be a p p r o x i m a t e l y u n i f o r m between 50 a n d 250 nt. T h e differences are also evident i n c u m u l a t i v e d i s t r i b u t i o n plots , shown i n F i g u r e 4.9, where the curves for r a n d o m a n d exonic sequences over lap , w h i l e the curve for 5 ' L in t rons is closer to tha t for 5'S in t rons . S t a t i s t i c a l ana lys is fur ther emphas izes the differences between the datasets: the K S test c o m p a r i n g 5'S in t rons w i t h exonic a n d r a n d o m sequences p r o d u c e d D = 0.5227 (p-value < 1 0 ~ 2 1 ) a n d D = 0.5081 (p-value < l O - 2 0 ) , respect ively. T a b l e 4.1 Chapter 4. Zipper stems in long yeast introns 59 5'S introns 5'L introns exon seq 5'L introns D = 0.20 p-value = 0.02 exon seq D = 0.52 p-value = 1 0 " 2 1 D = 0.43 p-value = 1 0 " 1 5 random seq D = 0.51 p-value = 1 0 " 2 0 D = 0.41 p-value = 1 0 " 1 4 D = 0.03 p-value = 0.95 T a b l e 4.1: S u m m a r y of K S test resul ts for a l l pa i r -wise compar i sons be tween datasets of S T R I N 5'S in t rons , S T R I N 5 ' L in t rons , exon ic sequences a n d r a n d o m sequences. T h e p-value h igh l igh t ed i n boldface is greater t h a n 0.05, i n d i c a t i n g tha t the hypothes is tha t two c o m p a r e d datasets s t em f rom the same d i s t r i b u t i o n canno t be rejected. s u m m a r i z e s the resul ts of the K S test for a l l pa i r -wise compar i sons be tween the four datasets. It is necessary to po in t out tha t r a n d o m a n d exonic sequences are dif-ferent by na ture , m o s t l y because of the evo lu t i ona ry pressure i m p o s e d o n e x o n sequences, a n d t h a t for some other character is t ics t hey w o u l d p rob-a b l y e x h i b i t p r o f o u n d differences (e.g., c o d o n frequencies). T h e h i g h s i m -i l a r i t y be tween the d i s t r i b u t i o n s for exonic a n d r a n d o m sequences fur ther emphasizes the non- randomness of the d i s t r i b u t i o n m o d e for the shor tened b r a n c h p o i n t dis tances observed i n 5 ' L in t rons . O n e poss ib le i m p l i c a t i o n of the m o d e p h e n o m e n o n is tha t the o p t i m a l b r a n c h p o i n t d i s tance for s p l i c i n g of yeast in t rons is abou t 50 nt, w h i c h i n long in t rons , as our resul ts suggest, is achieved by shor t en ing of the o r i g i n a l d i s tance as a resul t of z i p p e r s t e m f o r m a t i o n . 4.2.2 Multiple zipper stems M o t i v a t e d by the cons iderable n u m b e r of 5 ' L shor tened b r a n c h p o i n t dis-tances longer t h a n 150 nt, w h i c h is not observed for 5'S b r a n c h p o i n t dis-tances, we inves t iga ted the poss ib i l i t y t ha t more t h a n one z i p p e r s t e m is i n v o l v e d i n sho r t en ing the dis tance between the 5' spl ice si te a n d the b r a n c h -po in t . F o r th i s purpose , we mod i f i ed our a l g o r i t h m to look for an ensemble of z i ppe r s tems i n the fo l lowing way: once a z ippe r s t em a c c o r d i n g to Def i -Chapter 4. Zipper stems in long yeast introns 60 n i t i o n 3 is found , the secondary s t ruc tu re S is mod i f i e d to exc lude tha t s t em (bases i n the s t e m are m a r k e d as unpa i r ed ) , a n d the search cont inues for the nex t z ippe r s t e m tha t br ings the 5' spl ice site a n d the b r a n c h p o i n t sequence closest together . T h i s process is repeated u n t i l no s tems c a n be found tha t are larger t h a n or equa l to the m i n i m u m s tem size defined i n the a l g o r i t h m . T h e shor tened b r a n c h p o i n t d is tance of an i n t r o n z i p p e d w i t h several s tems is ca l cu l a t ed s i m i l a r l y as desc r ibed before by c o u n t i n g the nucleot ides be-tween the 5' spl ice site a n d the b r a n c h p o i n t tha t are not enclosed i n or between c o m p l e m e n t a r y sequences of the found z ippe r s tems. A n e x a m p l e is g iven i n F i g u r e 4.10, w h i c h shows four z ippe r s tems ident i f ied by the algo-r i t h m . T h e b lack letters, w h i c h are not w i t h i n the h i g h l i g h t e d regions, are the nucleot ides t ha t are i n c l u d e d i n dis tance c a l c u l a t i o n . T h e a l g o r i t h m was r u n for the a r b i t r a r y m i n i m u m s t e m lengths of 3, 5, a n d 7 nucleot ides a n d the best results , w i t h respect to D s ta t i s t ics values, were o b t a i n e d for the m i n i m u m s tem l eng th of 5 b p . T h e shor tened b r a n c h -po in t d i s t r i b u t i o n for 5 ' L in t rons , a n d the b r a n c h p o i n t d i s t r i b u t i o n for 5'S in t rons , are s h o w n i n F i g u r e 4.11. A g a i n , the modes of the two d i s t r i b u t i o n s are loca ted at a r o u n d 50 nt, bu t the d i s t r i b u t i o n of the d a t a for the lower values i n each case is qu i t e different, w i t h m a n y more sequences h a v i n g ve ry shor t b r a n c h p o i n t d is tances for the z i p p e d 5 ' L in t rons . T h i s observa t ion is also s u p p o r t e d b y a K S test, w h i c h resu l ted i n re ject ion of the n u l l hypothes i s t ha t the two samples s t e m f rom the same u n d e r l y i n g d i s t r i b u t i o n (p-value < 0.001). T h e ana lys i s was repea ted w i t h the two con t ro l datasets, th is t i m e r e su l t i ng i n d i s t r i b u t i o n s of shor tened b r a n c h p o i n t dis tances for r a n d o m a n d exon ic sequences tha t were more s i m i l a r to the o r ig ina l 5'S b r a n c h p o i n t d i s tance d i s t r i b u t i o n t h a n the 5 ' L shor tened b r a n c h p o i n t d is tance d i s t r i b u t i o n (D = 0.1971 a n d D = 0.1864 for the exonic r a n d o m sequences, respect ively, p-va lue < 0.001 for b o t h ) . However , s i m i l a r to s ingle-s tem analys is , the d is tance d i s t r i b u t i o n s of c o n t r o l datasets appear more u n i f o r m , w i t h o u t a p r o m i n e n t mode , u n l i k e the the d i s t r i b u t i o n s for 5'S a n d 5 ' L in t rons . T h i s p h e n o m e n o n c a n be observed i n the c u m u l a t i v e d i s t r i b u t i o n p lo t g iven i n F i g u r e 4.12. Chapter 4. Zipper stems in long yeast introns F i g u r e 4.10: Seconda ry s t ruc tu re of the Y G R 2 1 4 W i n t r o n . T h e four z ippe r s tems are enumera ted i n the order by w h i c h they were ident i f ied by the a l g o r i t h m . T h e shor tened d is tance is ca lcu la ted by c o u n t i n g nucleot ides be tween the 5' spl ice si te a n d the b r a n c h p o i n t sequence tha t are not enclosed i n or between c o m p l e m e n t a r y sequences of the found z i p p e r s tems (black let ters) . Chapter 4. Zipper stems in long yeast introns 62 Branchpoint distance distribution for 5'S introns 0.14 I 1 1 1 1 1 1 1 1 1 Branchpoint distance distribution for 'zipped' 5'L (stem > 5bp) 0.14 | , , , , , , , , , 1 0.12r 0 20 40 60 80 100 120 140 160 180 200 shortened branchpoint distance (b) F i g u r e 4.11: D i s t r i b u t i o n s for (a) b r a n c h p o i n t d i s tance for 5'S in t rons a n d (b) b r a n c h p o i n t d is tance for 5 ' L in t rons z i p p e d w i t h one or more stems w i t h m i n i m u m l eng th of 5 nt . Chapter 4. Zipper stems in long yeast introns 5'S 5'L exon • - • - rand "0 50 100 150 200 250 300 350 d F i g u r e 4.12: C u m u l a t i v e d i s t r i b u t i o n s of the 5' spl ice si te - b r a n c h p o i n t d i s tance for 5'S in t rons a n d 5 ' L in t rons , r a n d o m a n d exon ic sequences z i p p e d w i t h one or more s tems w i t h m i n i m u m leng th of 5 nt . I n s u m m a r y , i n th is sect ion we ana lyzed the shor tened b r a n c h p o i n t d is -tances for S T R I N 5 ' L in t rons tha t were z i p p e d by one or more z ippe r stems, w i t h or w i t h o u t the m i n i m u m s tem- leng th requi rement . W e found tha t i n the case of z ippe r s tems w i t h the m i n i m u m s tem- leng th requi rement , cons id -e r ing a single s t em yie lds bet ter resul ts t h a n w h e n there are no cons t ra in ts on z i p p e r s t e m select ion. T h e ob t a ined shor tened b r a n c h p o i n t d is tances are reasonab ly s i m i l a r to the o r i g i n a l b r anchpo in t dis tances i n 5'S in t rons , w i t h the mos t c o m m o n dis tance b e i n g ~ 50 nt, bu t there are a n u m b e r of dis-tances t ha t are outs ide of the o p t i m a l range. T h e c o r r e s p o n d i n g dis tances for the two con t ro l datasets were found to be more u n i f o r m l y d i s t r i b u t e d , w h i c h ind ica tes tha t z ippe r s tems i n long in t rons have a specific effect o n the b r a n c h p o i n t dis tances . Chapter 4. Zipper stems in long yeast introns 64 4.3 Thermodynarnically stable zipper stems A s e x p l a i n e d earl ier , the z ippe r s tems ident i f ied by our p r o g r a m have to be u n i n t e r r u p t e d (Def in i t i on 3) a n d to satisfy a m i n i m u m leng th requ i rement . However , a s t em l e n g t h is jus t a c rude a p p r o x i m a t i o n of i ts t h e r m o d y n a m i c s tab i l i ty . It is poss ib le to d i r ec t l y ca lcu la te a s tem's t h e r m o d y n a m i c s tab i l i ty , i.e., i ts free energy, us ing T u r n e r ' s energy m o d e l (Freier et a l , 1986; T u r n e r et a l . , 1987; T u r n e r a n d S u g i m o t o , 1988; M a t h e w s et a l , 1999). T h i s m o d e l is desc r ibed i n de t a i l i n Sec t ion 2.2. W h i l e T u r n e r ' s energy m o d e l is usua l ly used to ca lcu la te the free energy of an ent i re R N A molecule as we l l as for finding the m i n i m u m free energy s t ruc tu re for a g iven R N A sequence, i t can also be used to ca lcu la te the free energy of some par ts of an R N A s t ruc ture . W e used th i s a p p r o a c h to i m p r o v e our m e t h o d for z ippe r s t em iden t i f i ca t ion . 4.3.1 Ident i f ica t ion of t he rmodyna rn i ca l l y s table z i ppe r stems W h i l e our goa l of finding the s t em tha t m a x i m a l l y z ips the i n t r o n (i.e., b r ings the donor site a n d the b r a n c h p o i n t sequences closest to each other) r emains unchanged , we next consider different requi rements for the z i p p e r s tems. A s R N A stems i n na tu re are often i n t e r rup t ed b y in t e rna l loops a n d bulges, we mod i f i ed our a l g o r i t h m to also consider th is t ype of s tem. W e a c c o m p l i s h e d th is i n the fo l lowing manner : the a l g o r i t h m first searches for the basepai r SiSj tha t has the m a x i m a l d is tance between the bases i n v o l v e d , i.e., j — i is m a x i m a l a m o n g a l l poss ib le basepairs found between the d o n o r site a n d the b r a n c h p o i n t sequence. O n c e such a basepair is found , a s u b r o u t i n e is ca l l ed to ident i fy the s t e m c o n t a i n i n g tha t basepai r (the basepai r w i l l be the first basepai r i n the s tem) . T h e sub rou t ine has one i n p u t parameter , the loop t h r e sho ld (t[), t ha t cont ro ls the m a x i m u m al lowable size of i n t e rna l loops a n d bulges: an i n t e r n a l loop or bulge w i l l be i n c l u d e d as par t of the s t em o n l y i f its n u m b e r of free bases is less or equa l to I n th i s way, s tems w i t h un rea l i s t i c a l l y large l o o p s / b u l g e s tha t w o u l d des tabi l ize t h e m are not a l lowed . T h e free Chapter 4. Zipper stems in long yeast introns 65 energy of a loop is r o u g h l y p r o p o r t i o n a l t o its size, w i t h larger loops h a v i n g h igher free energy (Zuker a n d Jacobson , 1998). T h u s , u s ing a l o o p size t h r e s h o l d has essent ia l ly the same effect as us ing a loop energy th re sho ld . T h e sub rou t ine keeps e x t e n d i n g the po t en t i a l z i ppe r s t em, s t a r t i n g f rom the i n i t i a l basepair , u n t i l i t reaches a m u l t i - l o o p or an i n t e r n a l loop or bulge whose n u m b e r of free bases exceeds the loop th re sho ld . T h e s u b r o u t i n e re turns the last basepair tha t is cons idered par t of the p o t e n t i a l z i p p e r s t e m a long w i t h the free energy of the s tem. O n c e the free energy of the s t e m has been ca l cu la t ed , i t is c o m p a r e d to the energy t h r e s h o l d (te), a n d i f i t is lower t h a n th is t h resho ld , the s t em is p red i c t ed as the z i p p e r s t em. I f the s t em is rejected, the R N A secondary s t ruc ture is m o d i f i e d to exc lude the found s t em a n d the search is repeated. O n c e a t h e r m o d y n a r n i c a l l y favorable s t e m is found , the shor tened b ranchpo in t d is tance is ca l cu l a t ed i n the same m a n n e r as before. S ince the o p t i m a l free energy of a s t e m tha t is supposed to z ip the i n -t r o n is u n k n o w n , we per fo rmed a s m a l l e m p i r i c a l s t u d y to de t e rmine the range of poss ib le values. T o get some i n i t i a l i dea abou t the free energies of stems i n n a t u r a l l y o c c u r r i n g R N A molecules , we used the R N A S S T R A N D ( R N A Seconda ry S t r u c t u r e a n d S t a t i s t i c a l A n a l y s i s Da tabase ) da tabase ( A n d r o n e s c u et al.) at h t t p : / / w w w . r n a s o f t . c a / s s t r a n d / . W h e n last ac-cessed ( June 2006), th i s da tabase con ta ined 3356 R N A secondary s t ruc tures on w h i c h users c a n pe r fo rm var ious types of s t a t i s t i ca l analyses . U s i n g the database interface, we ob ta ined the d i s t r i b u t i o n of free energies of s tems over a l l R N A molecules i n the database. N o t e tha t o n l y con t inuous s tems are cons idered for th i s analys is ( A n d r o n e s c u , 2006). T h e d i s t r i b u t i o n h i s t o g r a m i n F i g u r e 4.13 shows tha t t he free energy of a t y p i c a l s t em w i t h o u t mismatches ranges between —2 a n d —20 k c a l / m o l . T h e large n u m b e r of stems w i t h a free energy of 0 k c a l / m o l cor responds to i so la ted basepairs ( A n d r o n e s c u , 2006). G u i d e d by th i s d i s t r i b u t i o n we have t r i e d several values for the free energy t h r e sho ld a n d a n a l y z e d the z i p p e r s tems found a long w i t h the co r r e spond ing shor tened b r a n c h p o i n t d is tances . F o r te > —5, the z ippe r stems found are u sua l l y ve ry shor t a n d appea red inadequa te to h o l d the two ends of the i n t r o n together . T h e r e is u s u a l l y a Chapter 4. Zipper stems in long yeast introns 66 F i g u r e 4.13: D i s t r i b u t i o n h i s t o g r a m (a) a n d c u m u l a t i v e d i s t r i b u t i o n p lo t (b) of free energies of s tems i n n a t u r a l l y o c c u r r i n g R N A molecules (ob ta ined f rom S S T R A N D database) . Chapter 4. Zipper stems in long yeast introns 67 te\U 2 3 4 5 6 7 8 - 5 0.003 0.003 0.002 0.002 0.002 0.002 0.002 - 7 0.019 0.029 0.055 0.055 0.038 0.017 0.017 - 1 0 < 0.001 < 0.001 0.008 0.008 0.013 0.019 0.019 - 1 2 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.001 0.002 T a b l e 4.2: T h e p-values f rom the K S test app l i ed to the dataset of the 5 ' L b r a n c h p o i n t dis tances shor tened by t h e r m o d y n a r n i c a l l y s table z i p p e r s tems a n d the dataset of the 5'S b r a n c h p o i n t d is tances . T h e number s i n the first row are the values for the loop th re sho ld (t[) a n d the number s i n the first c o l u m n are the values for the energy t h r e sho ld (te). T h e p-values h i g h l i g h t e d i n boldface are greater t h a n 0.05; for these te a n d t; values the hypo thes i s t h a t two c o m p a r e d datasets s t em f rom the same d i s t r i b u t i o n canno t be rejected at the s t a n d a r d signif icance level of a = 0.05. more s table s t e m close by tha t w o u l d be ident i f ied i f the energy t h r e sho ld were lower. O n the other h a n d , for te < —12 there is a large n u m b e r of 5 ' L in t rons whose s t ruc tures do not con t a in any stems tha t w o u l d sat isfy the free energy c r i t e r ion . B a s e d on th i s analys is , we have chosen several values for te tha t are w i t h i n the acceptable range, name ly : —5, —7, —10 a n d —12. T h e values for the loop th resho ld (i; e { 2 , 3 , 4 , 5 , 6 , 7 , 8 } ) were chosen based o n our obse rva t ion for the secondary s t ruc tures of the 5 ' L in t rons a n d o n the d i s t r i b u t i o n s of the n u m b e r of free bases i n bulges a n d i n t e r n a l loops o b t a i n e d f rom the S S T R A N D database (F igu re 4.14). W e r a n the mod i f i ed a l g o r i t h m for iden t i f i ca t ion of t h e r m o d y n a r n i c a l l y s table z ippe r s tems for a l l combina t i ons of te a n d t\ values. Fo r each par-t i c u l a r pa i r of te a n d ti values, the a l g o r i t h m ident i f ied a z ippe r s t e m tha t satisfies the requi rements for each of the 5 ' L in t rons a n d ca l cu l a t ed a short-ened b r a n c h p o i n t d is tance . T o s ta t i s t i ca l ly test the difference be tween the shor tened d is tance d i s t r i b u t i o n for the 5 ' L in t rons a n d the o r i g i n a l b r anch -po in t d i s tance d i s t r i b u t i o n for 5'S in t rons , we used the K S test. T h e p-values thus ob t a ined are s h o w n i n T a b l e 4.2. F o r mos t of the values for te a n d £/, the p-values are less t h a n 0.05, i n -d i c a t i n g s t a t i s t i ca l ly s ignif icant differences between the datasets of b r a n c h -po in t d is tances . However , for te = —7 a n d for ti € { 4 , 5 } , the p-va lue was 0.055, i n d i c a t i n g tha t the hypothes is t ha t the two c o m p a r e d datasets s t em Chapter 4. Zipper stems in long yeast introns 68 Histogram ol Bulge number of free bases Cumulative distribution ol Bulge number of free bases Bulge number of free bases Bulge number of tree bases (a) (b) Histogram of Intornal number ol free bases Cumulative dislribulion of Internal number of freo bases | 1 1 | i ! -i ! L a . f L i 'S' 1 * / i T Internal number o! free bases Internal number of Iree bases (d) F i g u r e 4.14: D i s t r i b u t i o n h i s tograms a n d c u m u l a t i v e d i s t r i b u t i o n p lo ts of the n u m b e r of free bases i n (a), (b) bulges a n d (c), (d) i n t e r n a l loops of n a t u r a l l y o c c u r r i n g R N A molecules (ob ta ined f rom S S T R A N D database) . Chapter 4. Zipper stems in long yeast introns 69 te\U 2 3 4 5 6 7 8 - 5 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 - 7 0.011 0.007 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 - 1 0 0.001 0.001 0.218 0.370 0.575 0.543 0.543 - 1 2 < 0.001 < 0.001 0.002 0.002 0.009 0.032 0.042 T a b l e 4.3: T h e p-values f rom the K S test app l i ed to the dataset of the 5 ' L b r a n c h p o i n t dis tances shor tened by m u l t i p l e t h e r m o d y n a m i c a l l y s table z i p p e r s tems a n d the dataset of the 5'S b r a n c h p o i n t d is tances . T h e numbers i n the first row are the values for the loop th re sho ld a n d the number s i n the first c o l u m n are the values for the energy t h r e sho ld ( t e ) . T h e p-values h i g h l i g h t e d i n boldface are greater t h a n 0.05 a n d for these te a n d t\ values the hypothes i s t h a t two c o m p a r e d datasets s t em f rom the same d i s t r i b u t i o n canno t be rejected at the s t a n d a r d s ignif icance level of a = 0.05.. f rom the same d i s t r i b u t i o n canno t be rejected. I n order to test the signif icance of these results , we also r a n the algo-r i t h m o n the two con t ro l datasets, desc r ibed i n Sec t ion 4.2.1. S i m i l a r to our ana lys is w i t h the 5 ' L in t rons , we r an the a l g o r i t h m for a l l poss ib le pa i r s of values for the energy a n d loop th resho ld . F o r b o t h c o n t r o l datasets a n d for any pa i r of te a n d ti values the K S test rejected the n u l l hypo thes i s at the s ignif icance level of 0.05 (the p-values are smal le r t h a n 0.001 for any te £ { - 5 , - 7 , - 1 0 , - 1 2 } a n d for any tH e { 2 , . . . , 8 } ) . 4.3.2 Multiple zipper stems A n a l o g o u s to our earl ier phase of z ippe r s t em analys is , we wan ted to ex-plore the pos s ib i l i t y t ha t there can be more t h a n one z i p p e r s t e m tha t w o u l d shor ten the b r a n c h p o i n t d is tance. Fo r th is purpose , we m o d i f i e d our algo-r i t h m for iden t i fy ing t h e r m o d y n a m i c a l l y s table z ippe r s tems to f ind a l l the stems tha t satisfy g iven t h e r m o d y n a m i c c r i t e r i a a n d tha t effectively shor ten the d i s tance between the donor site a n d the b r a n c h p o i n t sequence. T h e a l g o r i t h m was r u n for the same values of te a n d t; as before. T h e p-values ca l cu l a t ed b y the K S test are g iven i n T a b l e 4.3. A s can be seen f rom th is da ta , for t e = —10 a n d £/ e { 4 , 8 } , the algo-r i t h m finds z ippe r s tems tha t shor ten the 5 ' L b r a n c h p o i n t d i s tance i n such a Chapter 4. Zipper stems in long yeast introns 70 way tha t the r e su l t i ng shor tened dis tances closely resemble the o p t i m a l dis-tances found i n 5'S in t rons . T h i s is i l l u s t r a t ed i n F i g u r e 4.15, w h i c h shows the d i s t r i b u t i o n s of the o r i g i n a l b r a n c h p o i n t d is tances for the 5'S dataset a n d shor tened b r a n c h p o i n t dis tances for the 5 ' L dataset , w h e n te — —10 a n d ti — 6. T h e d i s t r i b u t i o n s appear ve ry s imi l a r , w i t h a lmos t i d e n t i c a l ranges a n d modes a r o u n d 50 nt. T h e s i m i l a r i t y is even more apparen t i n the c u m u l a t i v e d i s t r i b u t i o n p lo ts (F igu re 4.16), where the respect ive func t i on curves are nea r ly i den t i ca l ( K S test: D = 0.105, p-va lue = 0.575). T h e a l g o r i t h m s t i l l f inds on ly one z ippe r s t e m i n 79 in t rons a n d more t h a n one z ippe r s t em i n the r e m a i n i n g 31 in t rons . W e also ident i f ied m u l t i p l e t h e r m o d y n a r n i c a l l y s table z i p p e r s tems for the r a n d o m a n d exonic con t ro l datasets a n d c o m p a r e d the i r shor tened b ranch -po in t dis tances to the 5'S b r a n c h p o i n t dis tances u s ing the K S test. T h e test rejected the n u l l hypo thes i s for a l l combina t ions of energy a n d loop th resh-olds at a s ignif icance level of 0.05 (p-value < 0.01). O v e r a l l , the iden t i f i ca t ion of z ippe r stems based o n the t h e r m o d y n a m i c a p p r o a c h y i e l d e d bet ter resul ts t h a n the s tem- leng th based approach . T h i s is encourag ing since the former one is based o n s o u n d t h e r m o d y n a m i c p r i n -ciples . W e inves t iga ted the effects of single or m u l t i p l e z i p p e r s tems o n b r a n c h p o i n t dis tances i n 5 ' L S T R I N in t rons , a n d the resul ts i nd i ca t e tha t z i p p i n g the in t rons w i t h m u l t i p l e , more s table z ippe r s tems, w i t h m i n i m u m free energy AG(S) < -10 k c a l / m o l , y ie lds be t te r resul ts . These resul ts were c o m p a r e d w i t h two c o n t r o l datasets, for w h i c h a s i m i l a r effect o f z i p p e r s tems o n the analogue of b r a n c h p o i n t dis tances was not observed. 4.4 Phy logene t i c analysis of z ippe r stems I n order to invest igate i f the z ippe r s tems tha t we found i n 5 ' L in t rons of S. cerevisiae are e v o l u t i o n a r i l y conserved a m o n g the sensu stricto species, we need to de te rmine secondary s t ruc tures for the c o r r e s p o n d i n g in t rons i n S. paradoxus, S. mikatae, a n d S. bayanus, find p o t e n t i a l z i p p e r s tems a n d examine i f they are at the same l o c a t i o n as the stems i n S. cerevisiae. Chapter 4. Zipper stems in long yeast introns 71 Branchpoint distance distribution for 5 'S introns CO c o c 0.O8 100 150 200 branchpoint distance (a) Branchpoint distance distribution for zipped 5 'L introns w °-1 c g c 0.08 c O 0.06 • i f III • • "I I I 1j i '!l • ULL u 1 50 100 150 200 shortened branchpoint distance (b) F i g u r e 4.15: D i s t r i b u t i o n s of (a) b r a n c h p o i n t dis tances for 5'S in t rons a n d (b) b r a n c h p o i n t dis tances for 5 ' L in t rons tha t were shor t ened by m u l t i p l e t h e r m o d y n a r n i c a l l y s table z ippe r stems (te — —10 a n d t[ — 6) . Chapter 4. Zipper stems in long yeast introns 72 F i g u r e 4.16: C u m u l a t i v e d i s t r i b u t i o n s of the 5' spl ice site - b r a n c h p o i n t d i s tance for 5'S in t rons a n d 5 ' L in t rons z i p p e d w i t h one or more stems (te = - 1 0 a n d ti = 6). A n o t h e r app roach is to use co mp ar a t i v e (a.k.a. phylogenet ic ) ana lys is of s t ruc tures based o n cova r i a t i on analysis a n d / o r s t r u c t u r a l a l ignment . C o -v a r i a t i o n analys is of R N A secondary s t ruc ture is based o n the a s s u m p t i o n tha t a m u t a t i o n tha t d i s rup t s the W a t s o n - C r i c k base -pa i r ing of a func t ion-a l l y i m p o r t a n t R N A s tem has a deleterious effect, w h i c h m a y be overcome by a second, c o m p e n s a t o r y m u t a t i o n tha t restores base-pa i r ing . A n e x a m -ple of c o m p e n s a t o r y m u t a t i o n s is g iven i n F i g u r e 4.17. T h e r e are several approaches for iden t i f i ca t ion of conserved R N A secondary s t ruc tures , w h i c h w i l l be discussed i n Sec t ion 4.4.2. C o m p a r a t i v e R N A s t ruc tu re analys is is usua l ly c r u c i a l l y dependent o n the m u l t i p l e sequence a l ignment a l g o r i t h m used, a n d often v i s u a l i n spec t i on is needed to recognize errors i n the a l ignment . T h i s is why, for the i n i t i a l phy logene t i c analys is , we selected a s m a l l subset of n ine 5 ' L S T R I N in t rons . These in t rons were p r e v i o u s l y p roposed to con t a in a s tem- loop s t ruc tu re be-tween the 5' spl ice site a n d the b r a n c h p o i n t sequence (Pa rke r a n d Pa t t e r son , 1987). T h e selected in t rons are g iven i n T a b l e 4.4. Chapter 4. Zipper stems in long yeast introns 73 u — u / \ A G V J G • C I I G • C A • U AUGGCAUUGlSCCAU A<DGGUAUUG&CC©U U — U / \ A G v y G • C I I G • C A • U F i g u r e 4.17: A n example of compensa to ry muta t ions . T h e m u t a t i o n s i n the second sequence are compensa to ry since they m a i n t a i n basepa i r ing . O R F name name in Parker and Patterson (1987) intron length intron location branchpoint sequence Y C R 0 3 1 C rp59 307 8-314 U A C U A A C Y D R 4 4 7 C r p 5 1 B 314 4-317 U A C U A A C Y F L 0 3 9 C A C T 1 308 11-318 U A C U A A C Y G L 0 3 0 W rp73 230 4-233 U A C U A A C Y G L 1 0 3 W C Y H 2 511 50-560 U A C U A A C Y K L 1 8 0 W r p L 1 7 A 306 310-615 U A C U A A C Y M L 0 2 4 W r p 5 1 A 398 4-401 U A C U A A C Y N L 3 0 1 C r p 2 8 B 432 113-544 U A C U A A C Y O L 1 2 7 W r p L 2 5 414 14-427 U A C U A A C T a b l e 4.4: In t rons used for our compara t i ve R N A s t ruc tu re ana lys is . Chapter 4. Zipper stems in long yeast introns 74 4.4.1 Comparative analysis by visual inspection W e o b t a i n e d o p t i m a l s t ruc tures for the n ine selected in t rons f rom T a b l e 4.4 a n d for each of the four species us ing the m f o l d W e b server (Zuker , 2003). Fo r each s t ruc tu re we m a r k e d s t em regions tha t are loca t ed r e l a t ive ly close to the 5' sp l ice site a n d the b ranchpo in t , w h i c h makes t h e m poss ib le can-didates for z ippe r s tems (F igures 4.19 a n d 4.20 show an e x a m p l e of th i s a p p r o a c h app l i ed to i n t r o n Y C R 0 3 1 C ) . These c o m p l e m e n t a r y sequence re-gions were subsequent ly l abe led i n the m u l t i p l e sequence a l ignments . S ince the q u a l i t y of m u l t i p l e sequence a l ignments is essential for c o m p a r a t i v e R N A s t ruc tu re analys is , we d i d not want to be solely dependent o n the C l u s t a l W resul ts p r o v i d e d by K e l l i s et a l . (2003). Therefore , we used another p o p u -lar a l ignmen t W e b server, L A G A N ( h t t p : / / l a g a n . s t a n f o r d . e d u / l a g a n _ w e b / i n d e x . s h t m l , last accessed i n J u n e 2006), f rom w h i c h we o b t a i n e d m u l -t i p l e sequence a l ignments us ing the M u l t i - L A G A N p r o g r a m ( B r u d n o et a l , 2003). A n example of a labe led L A G A N m u l t i p l e sequence a l ignmen t for i n t r o n Y C R 0 3 1 C is s h o w n i n F i g u r e 4.18. T h e C l u s t a l W a n d L A G A N a l ignments do not differ s igni f icant ly , never-theless L A G A N produces s l igh t ly bet ter resul ts for our d a t a - z i p p e r s tems were more easi ly detectable u s ing these a l ignments a n d a greater n u m b e r of conserved stems was detected by the phylogene t ic ana lys is based o n L A G A N a l ignments . T h u s , we m o s t l y based our f indings o n the L A G A N a l ignments . I n s p e c t i n g the l abe led a l ignments we looked for sequence regions where the labels over lap - each c o l u m n i n the a l ignment is l abe led for each of the species. If they exist a n d are complementa ry , these regions f o r m a s t e m w h i c h is conserved i n sensu stricto species (black boxes i n F i g u r e 4.18). T h i s s t em is selected as a po t en t i a l z ippe r s tem. C o n s e r v e d z i p p e r s tems were found for five in t rons : Y C R 0 3 1 C , Y F L 0 3 9 C (2 s tems) , Y G L 0 3 0 W , Y G L 1 0 3 W , a n d Y O L 1 2 7 W . In t rons Y K L 1 8 0 W a n d Y M L 0 2 4 W have a conserved z ippe r s t em between S. cerevisiae a n d S. baya-nus, bu t s l igh t ly different base -pa i r ing for S. paradoxus ( Y K L 1 8 0 W ) a n d S. mikatae ( Y M L 0 2 4 W ) . However , w h e n we checked the s u b o p t i m a l s t ruc-tu re p red ic t ions for two disagreeing species, we were able to f ind s t ruc tures Chapter 4. Zipper stems in long yeast introns 75 seer spar smi k sbay sbay GTATGTTT-AATCACATAGTGAACAI I I I i AAAGCATCTCTA' GTATGTTT-AGTCACATTGTGAATATTCCCGAGG-ATCGGTA' GTATGTTT-GATCACATAGTGAATATTACGAAGGATCGGCTA' GTATGTTTAAATGCCATATTAAATGCTCAAGAGAATCGACTA" = 1 11 21 31 41 TG4:T< TG TG, tTG/(n"GTTGTTGAATGTTTCTGACGACGTGCAAGATACATTGAAAG--TC 8 97/307 t»H fa IIGGATGTCTCTGATGATGTGCGAG-TGCATCGAAAACTCT 8 97/298 TGTTGTTTGGTGTATTCGAAGATGTTCATACCCCATCGAAAACTCA 8 97/30S TTTCATGCTTCCTATGATGTGCACAGGTCATCGAAAACCTT 8 100/314 61 71 81 91 fTGWTGTAA  SCer : AGAAACATAAAGACAA TTCAACGAATTCATTGCCTCCAAAGTAATT 8 143/307 spar : GAAAACATGAACACAA CTGAACGAAATTATTGCCCCCAAAGTAATT 8 143/298 Sltlik : GAAAACATGAAAACGA GTGGACGAGGTTATTACTTTCGAACAATTT 8 143/305 sbay : AAAAAATAAAAAATAAATCGGTAAACAATGTTGTCAC-TTCAAACAGTTT 8 149/314 • 101 111 121 131 141 seer : CATAGCGATTAGTAGAGCCTATTGTGTCAATGGCAGTATATTTTGTCAAC 8 193/307 spar : CGTATC^TTAGTTGAGTTTGTCTTATCAATGGCAGTACATCACGTCAAC 8 193/298 smik : GTTATTTGT ATTGTATCGTAGTAACCATAGTTCATCACGTCAAC 8 187/305 Sbay : TATGCCAACGGT TTGCATCAGCAATAATATAGCACGCCAGT 8 190/314 - 151 161 171 181 191 spar smi k sbay TGACTTT 201 - I I I 1 II — I I I I I CjTCAATGGAATtrGCAGAATTATTATG - - TTTliC CAATGGAArt TCGATGGAA^GCAAAGATACTATGTAAGAAT 8 232/307 9 224/298 JrGCAGGATTATTGTATTAGAGTCTTAA 8 228/305 TCAATGGAA^rAGAGGATTACTTCACGAGGATCTG-- 8 238/314 231 241 SCer : -TAAAAAAAAAAALI I I IGGATACTAACAACATTACGTTT-GATATCGTC 8 280/307 spar : --AAAAAAAGAAACCTTTGGATACTAACAGAATTACTTTTCGATGTCGTC 8 272/298 smik : GAAAATAAAGAAACTTTTGGATACTAACAAAATTAATTACTGATATCTTT 8 278/305 sbay : --AAAGAAAAGGAACTCTGTATACTAACAAAATCATTAAATGGTATCTTC 8 286/314 . 251 261 271 281 291 seer : CGATATCGATTTAC-TATTTCCATTTAG 8 307/307 spar : CGATATCGA-TTAC-TATTTTCATGTAG 8 298/298 smik : CGATATCAACTATT-GTTGTTTATTTAG 8 305/305 sbay : CGATATCAATTGTTAIIIIICTATTTAG 8 314/314 = 301 311 321 F i g u r e 4.18: L A G A N m u l t i p l e sequence a l ignment for i n t r o n Y C R 0 3 1 C . P o t e n t i a l z ippe r s t e m regions are h igh l igh ted i n S. cerevisiae, S. paradoxus, S. mikatae a n d S. bayanus sequences. B l a c k boxes ind i ca t e the l o c a t i o n of the conserved z ippe r s tem. Chapter 4. Zipper stems in long yeast introns 7G 00 3 CD Q. .CD W S 00 F i g u r e 4.19: M i n i m u m free energy secondary s t ruc tures for i n t r o n Y C R 0 3 1 C i n S. cerevisiae a n d S. paradoxus. T h e 5' spl ice site, b r a n c h p o i n t a n d po ten-t i a l s t e m region are anno ta ted for each s t ruc tu re . C o n s e r v e d z ippe r s tems found by compara t i ve analys is are magni f ied a n d s h o w n i n boxes. Base-pa i r s conserved a m o n g a l l four species are h igh l igh ted . Chapter 4. Zipper stems in long yeast introns 77 lj \ f H V J O 1 3 F i g u r e 4.20: M i n i m u m free energy secondary s t ruc tures for i n t r o n Y C R 0 3 1 C i n S. mikatae a n d 5. bayanus. T h e 5' spl ice site, b r a n c h p o i n t a n d po ten -t i a l s t e m reg ion are anno ta t ed for each s t ruc tu re . C o n s e r v e d z ippe r s tems found b y c o m p a r a t i v e analys is are magni f i ed a n d s h o w n i n boxes. Base-pa i r s conserved a m o n g a l l four species are h igh l igh ted . Chapter 4. Zipper stems in long yeast introns 78 t h a t have stems at the same l o c a t i o n as for the other two species. I n t r o n Y D R 4 4 7 C has a conserved z ippe r s t e m be tween S. cerevisiae a n d S. para-doxus, bu t s l igh t ly different base-pa i r ing for S. bayanus, w h i c h r e m a i n e d consis tent even for the s u b o p t i m a l s t ruc tures (we looked o n l y at the sub-o p t i m a l s t ruc tures w i t h i n 5% f rom the M F E ) . F i n a l l y , for i n t r o n Y N L 3 0 1 C we c o u l d not f ind a conserved z ippe r s tem; nevertheless, c o m p l e m e n t a r y se-quences close to the 5' spl ice site a n d the b r a n c h p o i n t sequence were found for each species. T h e stems tha t we ident i f ied as conserved do not necessar i ly have to have conserved basepa i r ing , a l t h o u g h th is is m o s t l y the case. S ince the z i p -per s t em hypothes i s states t ha t the role of a z ippe r s t em is to sho r t en the b r a n c h p o i n t d is tance to one tha t is o p t i m a l for s p l i c i n g , the exact basepair -i n g w i t h i n the s t em or the s tem s t ruc tu re i tself ( in te rna l loop a n d bulges) is not essential . E v e n i f the stem's l o c a t i o n is not the same bu t somewha t shif ted, the s tem's func t ion w o u l d be unchanged . Therefore , the fact t ha t a conserved s t em was not found a m o n g a l l four species does not con t r ad i c t the z i p p e r s t em hypothes i s , especia l ly cons ide r ing tha t we are dea l ing w i t h imperfec t m u l t i p l e sequence a l ignments a n d imperfec t secondary s t ruc tu re p red ic t ions . 4.4.2 Comparative analysis using programs for comparative R N A structure prediction T h e r e are a n u m b e r of p rograms avai lable for c o m p a r a t i v e R N A s t ruc tu re p r e d i c t i o n , a n d they u sua l l y i m p l e m e n t one of three bas ic approaches . I n the case of re la t ive ly h i g h sequence s i m i l a r i t y a m o n g re la ted R N A sequences, w h e n i t is poss ib le to c o m p u t e a g o o d qua l i t y m u l t i p l e sequence a l ignment , covar iance or c o m p e n s a t o r y m u t a t i o n analys is is used to process the a l ign -ments a n d pred ic t c o m m o n s t ruc tures for a l igned sequences. E x a m p l e s of the a lgo r i t hms f rom this class are A l i d o t (Hofacker et a l . , 1998), P f o l d ( K n u d s e n a n d H e i n , 1999), C o n S t r u c t ( L u c k et a l , 1999), P f r a l i (Hofacker a n d S tad le r , 1999) a n d A l i f o l d (Hofacker et a l , 2002). I n the case w h e n R N A sequences are more d iverged a n d re l iable m u l t i p l e sequence a l i gnmen t Chapter 4. Zipper stems in long yeast introns 79 cannot be achieved, the ' secondary s t ruc tures of the sequences are p r e d i c t e d i n d e p e n d e n t l y a n d then s t r u c t u r a l l y a l igned . A n example of a p r o g r a m tha t uses th i s a p p r o a c h is R N A Forester , w h i c h uses a tree represen ta t ion of p r e d i c t e d R N A s t ruc tures to a l i g n t h e m by a p p l y i n g a gene ra l i za t ion of se-quence a l ignmen t techniques ( H o c h s m a n n et a l , 2003). T h e t h i r d class of c o m p a r a t i v e R N A s t ruc tu re p r e d i c t i o n tools a t t emp t s t o a l i g n g iven R N A sequences wh i l e s imu l t aneous ly p r e d i c t i n g the i r fo ld ing a n d c o m m o n s t ruc-ture . T h i s app roach is preferred for c o m p u t i n g a sequence a l i gnmen t of R N A sequences since i t uses secondary s t ruc tu re i n f o r m a t i o n , u n l i k e C l u s t a l W , L A G A N or o ther m u l t i p l e a l ignment p rograms , w h i c h are based solely o n p r i m a r y sequences. It is a w e l l - k n o w n b io log ica l p h e n o m e n o n t h a t s t ruc tu re is be t te r conserved t h a n sequence, espec ia l ly for func t iona l R N A s , a n d tak-i n g the s t r u c t u r a l i n f o r m a t i o n in to account is essential for accura te a l ignmen t of these sequences. However , c o m p u t i n g a l ignmen t a n d secondary s t ruc tu re s imu l t aneous ly is very c o m p u t a t i o n a l l y expensive , w h i c h l i m i t s the n u m b e r a n d l eng th of sequences t ha t can be a l igned . E x a m p l e s of the t h i r d class of p rog rams are F o l d a l i g n ( G o r o d k i n et a l . , 1997), D y n a l i g n ( M a t h e w s a n d T u r n e r , 2002) a n d C a r n a c (Touzet a n d Pe r r ique t , 2004). T h e a lgo r i t hms for c omp a r a t i v e R N A s t ruc tu re analys is c a n also be dif-ferent ia ted d e p e n d i n g on whe the r they c o m p u t e g l o b a l l y or l o c a l l y conserved s t ruc tu re of the i n p u t R N A sequences. T h e m a j o r i t y of the p rog rams m e n -t i o n e d a t t emp t to p red ic t a c o m m o n g loba l secondary s t ruc tu re ( P f o l d , C o n -s t r u c t , A l i f o l d , R N A Forester , D y n a l i g n ) , w h i l e others p red ic t on ly l o c a l l y conserved s t ructures . S ince i n our s t u d y we have re la t ive ly re l iable i n t r o n a l ignments of four c losely re la ted Saccharomyces sensu stricto species, we used the tools t ha t re ly o n these m u l t i p l e sequence a l ignments . W e tested the fo l lowing pro-g rams o n ou r 9 - in t ron dataset : P f o l d a n d A l i f o l d , w h i c h p red ic t a consensus s t ruc tu re for a l l sequences i n the i n p u t sequence a l ignment , a n d A l i d o t a n d P f r a l i , w h i c h detect conserved subs t ruc tures . S ince z i p p e r s tems are long-range s t r u c t u r a l mot i fs , we s h o u l d be able to detect t h e m ei ther i n a g l o b a l consensus s t ruc tu re of i n t r o n sequences or as a conserved s tem. Pfold uses context-free g r a m m a r s to p red ic t a c o m m o n s t ruc tu re for Chapter 4. Zipper stems in long yeast introns 80 a g iven a l ignmen t of R N A sequences. It does not l ook for c o m p e n s a t o r y m u t a t i o n s d i rec t ly , bu t est imates a phylogenet ic tree f rom the a l i gnmen t a n d uses i t for m a x i m u m a posteriori a p p r o x i m a t i o n . T h e W e b vers ion of P f o l d c a n be used for a set of m a x i m u m 40 R N A sequences, w i t h a l eng th l i m i t a t i o n of 500 nt. T h e W e b server can be accessed at h t t p : / / w w w . d a i m i . a u . d k / ~ c o m p b i o / r n a f o l d / (last accessed i n J u l y 2006). A l i f o l d is a pa r t of the V i e n n a package for R N A secondary s t ruc tu re p r e d i c t i o n (Hofacker , 2003). It uses a mod i f i ed energy m o d e l t h a t integrates t h e r m o d y n a m i c a n d phylogene t ic i n f o r m a t i o n . It takes C l u s t a l W m u l t i p l e sequence a l ignment as its i n p u t a n d computes a consensus s t ruc tu re of the sequences. T h e m a x i m u m size of the i n p u t file is 10 K B . T h e W e b server c a n be accessed at h t t p : / / r n a . t b i . u n i v i e . a c . a t / c g i - b i n / a l i f o l d . e g i (last accessed i n J u l y 2006). W e also used a l o c a l copy of the p r o g r a m f rom vers ion 1.5 of the V i e n n a package. A l i d o t is also one of the V i e n n a package s t ruc tu re p r e d i c t i o n tools . It s tar ts w i t h a C l u s t a l W a l ignment of sequences a n d i n d e p e n d e n t l y compu te s the M F E secondary s t ruc tu re of each of t h e m . U s i n g b o t h the sequence a l ignmen t a n d the M F E s t ruc tu re p red ic t ions , A l i d o t a l igns the s t ruc tures a n d produces a set of cand ida te basepairs . These basepairs are fur ther filtered to exc lude any inconsis tencies a n d to check for c o m p e n s a t o r y m u t a -t ions . P f r a l i is yet another of R N A s t ruc tu re p r e d i c t i o n t o o l con ta ined i n the V i e n n a package. T h e app roach is ve ry s i m i l a r to A l i d o t ' s w i t h the difference t ha t P f r a l i uses basepa i r ing p robab i l i t i e s ob t a ined f rom M c C a s k i l l ' s p a r t i t i o n func t i on a l g o r i t h m ( M c C a s k i l l , 1990) ins tead of M F E s t ruc tu re p red ic t ions . S ince the basepa i r ing p robab i l i t i e s con t a in i n f o r m a t i o n abou t a large n u m b e r of p l aus ib l e s t ruc tures , th i s app roach is less l ike ly to miss any conserved s t r u c t u r a l elements. E a c h of the nine in t rons , w i t h i ts co r r e spond ing L A G A N a l ignment , was processed by P f o l d , A l i f o l d , A l i d o t a n d P f r a l i . T h e first two a l g o r i t h m s give the c o m m o n secondary s t ruc tu re for a l l of the sequences i n the i n p u t a l ignmen t . T h e s t ruc tu re is g iven i n dot -bracket n o t a t i o n a n d for A l i f o l d also i n conven t iona l g r a p h i c a l representa t ion . E x a m p l e s of resul ts for i n t r o n Chapter 4. Zipper stems in long yeast introns 81 Y C R 0 3 1 C are s h o w n i n F igu res 4.21 a n d 4.22. A l i d o t a n d P f r a l i have iden t i ca l ou tpu t s l i s t i n g a l l c and ida t e basepairs a n d the f ina l secondary s t ruc ture , c o n t a i n i n g o n l y the basepairs t h a t pass a l l the f i l t e r ing steps, is g iven i n dot -bracket n o t a t i o n . F o r each basepair , the fo l lowing i n f o r m a t i o n is given: l o c a t i o n of i n t e r ac t i ng bases, n u m b e r of sequences i n the m u l t i p l e sequence a l ignment i n w h i c h the l i s t ed basepai r is no t one of the s ix s t a n d a r d R N A basepairs a n d an i n d i c a t o r t h a t shows whe the r the basepair confl ic ts w i t h a basepair t ha t is p red i c t ed w i t h a h igher confidence. Some a d d i t i o n a l s ta t is t ics are g iven b u t we use none of these for our analys is . T h e f ina l s t ruc tu re g iven i n dot -bracket n o t a t i o n is an ensemble of a l l basepairs t ha t are not labe led as conf l i c t ing , however some of t h e m m a y be s u p p o r t e d by o n l y one sequence i n the a l ignmen t . F o r our s tudy, we w a n t e d to have bet ter basepai r suppo r t a n d thus cons idered o n l y basepairs t ha t were inconsis tent w i t h at most one sequence. T h e o u t p u t of th i s pos t -p rocess ing step is a secondary s t ruc tu re i n dot -bracke t n o t a t i o n tha t conta ins a l l l o c a l l y conserved subs t ruc tures . F o r each i n t r o n a n d for each p rogram ' s p red ic t ions , we m a n u a l l y searched for a conserved s t em tha t w o u l d b r i n g the 5' spl ice site a n d the b r a n c h p o i n t sequence in to closer p r o x i m i t y . P o t e n t i a l s tems were a n n o t a t e d a n d c o m -p a r e d to findings of c o m p a r a t i v e ana lys is based o n v i s u a l i n s p e c t i o n . A s u m m a r y of the results f rom th i s analys is is g iven i n T a b l e 4.5. T h e first c o l u m n of the tab le shows i f a conserved z ippe r s t em was found by v i s u a l i n -spec t ion ; i n one case where conserva t ion was found o n l y be tween two species, those species are given. Fo r each of the p rograms , two c o l u m n s are p r o v i d e d : the first of these indica tes i f a p r o g r a m found a s t e m tha t satisfies our z i p p e r s t e m requi rements a n d also shows the n u m b e r of these stems i n parentheses i f i t is greater t h a n one. T h e second c o l u m n indica tes i f a p r e d i c t e d s t em is present i n the m i n i m u m free energy s t ruc tu re (ob ta ined b y mfold) of each of four species (this is the s t em found by v i s u a l i n spec t ion ) . T h e four pro-g rams we used p red ic t o n l y u n i n t e r r u p t e d stems, w h i l e z i p p e r s t e m s found by v i s u a l i n s p e c t i o n c a n con t a in i n t e rna l loops or bulges. T h i s somet imes resul ts i n a z ippe r s t em found by v i s u a l i n spec t i on m a t c h i n g two or more stems p red i c t ed by the p rograms . If a p red i c t ed s t e m was found o n l y i n Chapter 4. Zipper stems in long yeast introns 82 •par H.'.-i-1 URHl •bay Heli,-ibilii/ G T A T G T T T - A ATCACATAGT GAACATTTTT AAAGCATCTC G T A T G T T T - A CTCACATTGT GAATATTCCC GAGG-ATCGG G T A T G T T T - G ATCACATAGT GAATATTACG AAGGATCGGC T ^ T T T T C C A T TG GTATGTTTAA ATGCCATAT T AAATGCTCAA GAGAATCGAC -U'< TTTTCCAT TGAT| ii 'iitii TTTTTGACFGTTGT TTTCCAT I H U ( I Hit ts, aasssximjs sip-rssSSss ses fc rsss sscssrssfc TGGATGTCTC i i i " frCTTGT TTGGtGTATT AAJTGT AAT (ii< TTCA mm •mlk •twy Kirfiability Commnn iCIM Kbtry Reliability TGACGACGTG TGATGATGTG CGAAGATGTT CTATGA7GTG • . I ) I) i - • • CATACCCCAT CGAAAACTCA i n i)») • CACAGGTCAT C/GAAAACCTT AGAAACATAA GAAAACATGA TTCAACGAAT TCATTGCCTC GAAAACATGA AAAAAAT AAA A C A C A A • • • • CTGAACGAAA TTAT TGCCCC AAACGA• • - GTGGACGAGG T T A T T A C T T T AAATAAATCG GTAAACAATQ T T G T C A C - tT H 1 210 CAAAGT AAT T CAAAGTAATT CGAACAATTT CAAACAGTTT CATAGCCA7T AGTAGAGCCT CGTATCAATT AGTTGAGTTT G T T A T T T G T - - . - - - A T T G T TATCCCAACG G T -ATTCTOTCAA TGGCAGTATA TTTTGTCAAC - T T T T GTCTTATCAA TGGCAGTA.CA TCACGTC.A.AC - ^ - . - - . T T T ATCGTAGTAA CCATAGTTCA TCACGTCAAC ~ *'.j -T TCCATCAG CAATAATATA GCACGCCAGT TGACTTTTTT Common spar limit RaHablliry •mlk (bay Reliability • M I D I TTCGATGG ,1)1)111 CTCAATGG ))) )11J TCCAATGG TTCAATGG >)>))»)) 1IJC130(J„0 2S0 ' I _ AAAfeGCAAAG ATACTATGTA A G A A T . . . . . >TAAAAAAAA AAACTTTTGG ATACTAACAA in I -AATTGCAGAA TTATTATQ AAAAAAAG AAACCTTTGG ATACTAACAG AAATGCAGGA TTATTGTATT AGAGTCTTAA GAAAATAAAG AAACTTTTGG ATACTAACAA ) I ) A A A|T AGAGGA TTACTTCACG AGGATGTG-} ) ) - -AAAGAAAA GGAACTCTGT ATAC TAACAA B o o o n o o o c o oaoc.aooiioa oaoooaodoa :sss3s§s§s ssssissss; a m s a c * n'MMtt 2S1 CATTACCTTT AAT TACT IT T I AAT TAAT TAC AAT CAT TAAA •II III! :GATGTC GTC CGATATCGA. TTAC-TATTT TCATGTAG TGGTAtCTTC CGATATCAAT T C T T A T T T T T CTATTTAG < t ( I H | . . • • 1 ) 1 1 1 I t - . IS l jSs i i SlSSlsSli sslSsal?; IS£S35S5 F i g u r e 4.21: P f o l d resul t for the a l ignment of the Y C R 0 3 1 C i n t r o n . T h e first l ine is a c o m m o n s t ruc tu re for a l l the sequences. T h e i n d i v i d u a l s t ruc tures are found by a p p l y i n g the c o m m o n s t ruc tu re to each sequence a n d e x t e n d i n g s tems, i f poss ible . T h e last l ine indica tes the r e l i ab i l i t y of p r e d i c t i o n for each nuc leo t ide i n the a l ignment . T h e s t em tha t satisfies the requi rements for a z ippe r s t em is enclosed i n a box . Chapter 4. Zipper stems in long yeast introns 83 4 s e q u e n c e s ; l e n g t h o f a l i g n m e n t 3 2 8 GUAUGUUU_AAUCACAUAGUGAAUAUUCCAAAGGAACCGCUAljUUUCCAUUGA)_UGUUGUUGGAUGUUUC CGAUGAUGUGCAAA UACAUCGAAAACUCUAAAAACAUAAAAACAA GUGAACGAAGUUAUUACCUC r a j » < - . » » n i i r » n » n r . » n n t r . AGU_UAUCGUAUCAACGACAGUACAOCACGUCAAC UUU UUlfUCAAUGGAAA JGCAGAAUUACUAUAU_AGAAU AAAAAAAAAAACCUUUGGAUACUAACAA aailHl^l l l l l l l l l/IAliaiirGt]rrT .a i iai iraai innar_naiini irrY""i"G .((((...({(..(((((((...(( ))...)))))) )|((((((((((I..(((((..((((((( ( ( ( ( ( ( ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) . . . ) ) T T T " . . ( ( ( ( ( ( ( . . . . ) ) ) ) ) ) ) . ) ) ) ) ) ) ) ) ) ) l ) ) ) • TTT( (( (. . ) ) ) ) ) ) ) ( - 2 1 . 7 4 ) F i g u r e 4.22: A l i f o l d resul t for the a l ignment of the Y C R 0 3 1 C i n t r o n . T h e o u t p u t conta ins the consensus sequence a n d the o p t i m a l consensus s t ruc tu re i n do t -bracke t n o t a t i o n fol lowed by i ts energy. A g r a p h i c a l representa t ion of the s t ruc tu re is also g iven. T h e s t em tha t satisfies the requi rements for a z ippe r s t em is enclosed i n a box . Chapter 4. Zipper stems in long yeast introns 84 some t h e n the species c o n t a i n i n g i t are ind i ca t ed . P f o l d found one or more po t en t i a l z i ppe r s tems i n 8 in t rons . O n e of the in t rons ( Y G L 1 0 3 W ) was over the P f o l d l eng th l i m i t . A l i d o t also i den t i -fied z i p p e r s tems i n 8 in t rons , f a i l ing to ident i fy any conserved s t e m region for i n t r o n Y N L 3 0 1 C . A l i f o l d a n d P f r a l i found one or more s tems for a l l of the in t rons . Fo r in t rons Y C R 0 3 1 C , Y G L 0 3 0 W , Y G L 1 0 3 W , Y K L 1 8 0 W a n d Y O L 1 2 7 W a l l of the p rograms ident i f ied the z ippe r s t e m tha t was f o u n d by v i s u a l i n spec t ion . I n three cases ( Y D R 4 4 7 C , Y F L 0 3 9 C , Y M L D 2 4 W ) two of four p rograms d i d not con f i rm the s t em found by v i s u a l i n s p e c t i o n . T h e reason for th i s m i g h t be ei ther t ha t the stems ident i f ied by v i s u a l i n s p e c t i o n d i d not have conserved basepa i r ing bu t ju s t o v e r l a p p i n g loca t ions or tha t the p rograms t h a t missed t h e m d i d so due to the i r specific weaknesses (since the stems were ident i f ied by the other two p rograms) . It is in te res t ing tha t a l l p rograms except A l i d o t f ound a re la t ive ly l o n g s t e m i n Y N L 3 0 1 C tha t was not ident i f ied by v i s u a l i n spec t ion . T h e ident i f ied s t e m was present o n l y i n the M F E p r e d i c t i o n for 5". mikatae, w h i l e the M F E pred ic t ions for the other three species do not con t a in i t . C o n s i d e r i n g tha t the s t e m seems to be re la t ive ly s table , i t is ve ry l i ke ly t ha t i t is con ta ined i n some of the s u b o p t i m a l p red ic t ions for the other species. T h e z i p p e r s tems ident i f ied by our ana lys is shor tened the b r a n c h p o i n t dis tances to 46-81 nt. These dis tances co r re spond we l l to the dis tances observed for 5 ! S in t rons (F igu re 4.3(a)) . W e bel ieve tha t the results presented i n th i s sec t ion are ve ry encourag-i n g , p r o v i d i n g fur ther suppo r t for the existence of z i p p e r s tems. F o r seven 5 ' L in t rons i n our test dataset z ippe r s tems were conserved a m o n g a l l four species; for one i n t r o n ( Y D R 4 4 7 C ) , conserva t ion was observed o n l y be tween the two closest species, S. cerevisiae a n d S. paradoxus. T h e h i g h leve l of conse rva t ion observed for detected stems suggests f u n c t i o n a l s ignif icance. A l i f o l d a n d P f r a l i , w h i c h are based o n c o m p e n s a t o r y m u t a t i o n s , were able to f ind conserved stems sa t i s fy ing z ippe r s tem requi rements for a l l n ine i n -t rons . intron zipper stem Pfold Alifold Alidot Pfrali found by vi stem present in s tem present in stem present in stem present in found M F E struct found M F E struct found M F E struct found M F E struct Y C R 0 3 1 C + + + + + + + + (2) + Y D R 4 4 7 C S.cer ,S .par + - + (2) + + - + + Y F L 0 3 9 C + + + + (2) - + - + + Y G L 0 3 0 W + + + + + + + + + Y G L 1 0 3 W + too long n/a +(3) + + (3) + + (3) + Y K L 1 8 0 W + + S.par + + + S.par + (3) + Y M L 0 2 4 W + + - + - +(2) + +(3) S .cer ,S .mik Y N L 3 0 1 C - + S . m i k + (3) S . m i k - - + (2) S . m i k Y O L 1 2 7 W + + (2) + + (3) + + (2) S .cer ,S .mik +(3) + T a b l e 4.5: Resu l t s of compa ra t i ve R N A s t ruc ture analys is . De ta i l s are given i n the text (S .cer=5 . cerevisiae, S . p a r = 5 . paradoxus, S . m i k ^ S " . mikatae, S.bay=5' . bayanus). Chapter 4. Zipper stems in long yeast introns 86 4.4 .3 Comparative structure analysis on S T R I N 5 ' L introns T a k i n g i n to account w h a t we learned o n the s m a l l , 9 - in t ron test dataset , we pe r fo rmed an a u t o m a t e d phylogenet ic analys is o n the in t rons f rom our phy logene t i c dataset (Sec t ion 3.3). B a s e d o n T a b l e 4.5, i t seems tha t a m o n g the p r o g r a m s we tested A l i f o l d a n d P f r a l i were best su i t ed for de tec t ion of conserved p o t e n t i a l z i ppe r stems. B o t h p rograms were able t o p red ic t one or more p o t e n t i a l z i ppe r s tems i n a l l of the nine in t rons , a n d i n the m a j o r i t y of cases these stems ove r l apped the z ippe r s tems ident i f ied by v i s u a l i n s p e c t i o n ana lys is . E a c h of these p rograms has some d rawbacks . A l i f o l d ' s p r e d i c t i o n does not exc lude basepairs tha t are inconsis tent w i t h some of the sequences i n the i n p u t a l ignment , bu t keeps t h e m as poss ib le cases of sequenc ing or a l ignmen t errors or non -canon i ca l basepairs (Hofacker et a l . , 2002). T h u s , some of the z ippe r s tems ident i f ied b y A l i f o l d m igh t not be conserved i n a l l four species. P f r a l i also inc ludes basepairs t ha t are incons is tent w i t h some of the sequences i n the i n p u t a l ignment , however, th i s i n f o r m a t i o n is pa r t of i ts o u t p u t a n d we used i t to post-process the f ina l p r e d i c t i o n to i n c l u d e o n l y basepairs tha t are inconsis tent w i t h at mos t one sequence. T h e post-process ing step in t roduces more free bases i n the final s t ruc tu re p r e d i c t i o n a n d results i n shor ter stems w i t h fewer consecut ive basepairs . Fo r the iden t i f i ca t ion of conserved z ippe r s tems us ing A l i f o l d a n d P f r a l i , we selected 49 S T R I N 5 ' L in t rons for w h i c h we have a l l Saccharomyces sensu stricto sequences a l igned (see Sec t ion 3.3). T h e more sequences there are i n the i n p u t m u l t i p l e sequence a l ignment , the bet ter is the p rog ram ' s pe r fo rmance is. (For o p t i m a l per formance , Hofacker et a l . (2004) suggest at least 5 sequences w i t h pa i rwise a l ignment of a r o u n d 8 0 % ). W e used L a g a n to a l ign the sequences since for ou r d a t a i t p r o d u c e d s l i gh t ly be t te r a l ignments t h a n C l u s t a l W . L a g a n a l ignments h a d to be re-fo rmated in to C l u s t a l W format , since the la t ter is the o n l y acceptable i n p u t format for the p rograms we used. T h e a l ignments were processed b y A l i -fo ld a n d P f r a l i , w h i c h p r o d u c e d a c o m m o n secondary s t ruc tu re (g loba l for A l i f o l d a n d l o c a l for P f r a l i ) for a l l of the sequences i n the a l ignmen t . T h e Chapter 4. Zipper stems in long yeast introns 87 secondary s t ruc tu re was fur ther processed us ing the a l g o r i t h m desc r ibed i n Sec t ion 4.3.1, w h i c h identifies a cand ida t e z ippe r s t em as the s t e m t h a t m a x i m a l l y z ips the i n t r o n a n d calculates the shor tened b r a n c h p o i n t d i s tance . B a s e d o n some e m p i r i c a l tes t ing, we chose the a l g o r i t h m ' s i n p u t pa rame-ters to be te = —4 a n d ti = 5, w h i c h ensures tha t the m i n i m u m n u m b e r of consecut ive basepairs i n a s t em is three (often found i n P f r a l i p red ic t ions ) . O f the 49 c o m m o n secondary s t ruc tures p red i c t ed by A l i f o l d , our a l -g o r i t h m found p o t e n t i a l z i ppe r s tems i n 48. O f the 97 P f r a l i p r ed ic t ions , s tems were found i n on ly 29 cases, due to the p r e v i o u s l y m e n t i o n e d post-process ing step tha t we app l i ed to e l imina te basepairs t h a t were incons is -tent w i t h more t h a n one sequence i n the a l ignment ( w i t h o u t pre-process ing , s tems were found i n 45 cases). T h e d i s t r i b u t i o n s of shor tened b r a n c h p o i n t dis tances for in t rons where z ippe r s tems were found are s h o w n i n F i g u r e 4.23; the c u m u l a t i v e d i s t r i b u t i o n p lo ts are shown i n F i g u r e 4.24. W h e n we c o m p a r e these d i s t r i b u t i o n s w i t h the d i s t r i b u t i o n o f b r a n c h -po in t dis tances for shor t S T R I N in t rons (F igu re 4.15(a)) , we observe tha t the m a j o r i t y of c o m p u t e d b r a n c h p o i n t dis tances fa l l w i t h i n the d i s tance range found for S. cerevisiae 5'S in t rons . F o r each p r o g r a m there are several ex-cept ions where the b r a n c h p o i n t d is tances are ei ther longer or shor te r t h a n for the 5'S in t rons . Z i p p e r stems p red i c t ed by P f r a l i , w h i c h are more re l i -able t h a n those p red i c t ed by A l i f o l d since they have to be conserved be tween at least three species, are bet ter at sho r t en ing the b r a n c h p o i n t d i s tance to the o p t i m a l range. T h i s suggests t ha t some of the shor tened b r a n c h p o i n t d i s tance ca l cu l a t ed based o n A l i f o l d ' s p red ic t ions tha t are outs ide the o p t i -m a l range migh t be a resul t of spur ious s tems conserved o n l y be tween two species. T h e results of th i s larger-scale phylogene t ic ana lys is argue i n favor of e v o l u t i o n a r i l y conserved z ippe r stems a m o n g Saccharomyces sensu stricto species. I n a m a j o r i t y of cases, the conserved stems t h a t were found b y th i s ful ly a u t o m a t e d compa r a t i ve s t ruc tu re app roach reduced b r a n c h p o i n t dis tances to values though t to be o p t i m a l for sp l i c ing . T h e r e are several poss ib le exp lana t ions w h y this app roach fai led to ident i fy conserved z i p -per s tems for a l l of the cons idered in t rons (Pf ra l i ) or ident i f ied the ones Chapter 4. Zipper stems in long yeast introns 88 Shortened branchpoint distance for Alifold results 0 100 200 300 400 500 shortened branchpoint distances (a) Shortened branchpoint distance for Pfrali results 100 200 300 400 500 shortened branchpoint distances (b) F i g u r e 4.23: D i s t r i b u t i o n s of shor tened b r a n c h p o i n t d is tances for 5 ' L S T R I N in t rons where the a n a l y z e d secondary s t ruc tu re was the consensus s t ruc tu re for a l l sensu stricto species. T h e consensus s t ruc tures were p r o d u c e d b y ( a ) A l i f o l d a n d ( b ) P f r a l i . Chapter 4. Zipper stems in long yeast introns S!) Empirical CDF d F i g u r e 4.24: C u m u l a t i v e d i s t r i b u t i o n s of shor tened b r a n c h p o i n t dis tances for 5 ' L S T R I N in t rons where the ana lyzed secondary s t ruc tu re was the con-sensus s t ruc tu re for a l l sensu stricto species. T h e consensus s t ruc tures were p r o d u c e d by (a) A l i f o l d a n d ( b ) P f r a l i . Chapter 4. Zipper stems in long yeast introns 90 t h a t m i g h t no t be the real z ippe r s tems resu l t ing i n shor tened b r a n c h p o i n t d is tances ou ts ide the o p t i m a l range. F i r s t , b o t h p rograms are heav i ly de-pendent on the u n d e r l y i n g m u l t i p l e sequence a l ignments , w h i c h m a y c o n t a i n errors . Second , b o t h p rograms depend on R N A secondary s t ruc tu re pred ic -t i o n me thods tha t have l i m i t e d accuracy ( M a t h e w s et a l . , 1999; G u t e l l et a l . , 2002). E r r o r s i n ei ther m u l t i p l e sequence a l ignment or secondary s t ruc tu re p r e d i c t i o n w o u l d resul t i n faul ty consensus s t ruc tu re p r e d i c t i o n . A n o t h e r poss ib le e x p l a n a t i o n is tha t i n some cases a p p r o x i m a t e s t e m loca t ions are conserved ra ther t h a n the exact basepa i r ing in terac t ions ; A l i f o l d a n d P f r a l i w o u l d fa i l to ident i fy these stems. A s discussed before, the z i p p e r s t e m hy-pothesis states tha t the role of a z i p p e r s t e m is to shor ten the b r a n c h p o i n t d i s tance to one tha t is o p t i m a l for s p l i c i n g thus the exact ba sepa i r ing w i t h i n the s t e m is not essential . E v e n i f the stem's l o c a t i o n is not the same b u t somewha t shif ted, the stem's func t ion w o u l d be unchanged . 4.5 Conclusions T h e analyses i n th i s chapter were m o t i v a t e d by b io log ica l s tudies t ha t iden-t i f ied long-range basepa i r ing in te rac t ions i n a n u m b e r of l o n g in t rons tha t resul t i n shor t en ing of the d is tance between the 5' spl ice si te a n d the b r a n c h -po in t sequence. It is be l ieved tha t the re la t ive ly shor t b r a n c h p o i n t d i s tance is necessary for sp l i ceosomal assembly a n d efficient s p l i c i n g of in t rons . T o c o m p l e m e n t these b io log ica l s tudies we c o n d u c t e d a c o m p u t a t i o n a l ana ly -sis of yeast in t rons , w h i c h c a n be classif ied in to two groups based on the i r b r a n c h p o i n t d is tance - shor t (5 'S) a n d long ( 5 ' L ) . T h e cur ren t hypo thes i s is tha t the b r a n c h p o i n t dis tances i n the 5'S in t rons are though t t o be o p t i m a l for s p l i c i n g a n d tha t 5 ' L in t rons achieve th i s o p t i m a l d i s tance b y secondary s t ruc tu re fo rma t ion . O u r c o m p u t a t i o n a l analys is focused o n de tec t ing s tems i n secondary s t ruc tures of the 5 ' L in t rons tha t w o u l d shor ten the b r a n c h p o i n t d i s tance to be w i t h i n the o p t i m a l range. Z i p p e r s tems were found i n a l l of the 5 ' L in t rons , w h i c h is not s u r p r i s i n g cons ide r ing the s t rong t endency of R N A sequences to f o r m basepa i r ing in terac t ions . However , the shor t ened b r a n c h -Chapter 4. Zipper stems in long yeast introns 91 po in t d is tances of z i p p e d l o n g in t rons are d i s t r i b u t e d s i m i l a r l y to the b r a n c h -po in t dis tances of shor t (5'S) yeast in t rons a n d differently t h a n co r re spond-i n g dis tances of z i p p e d r a n d o m a n d exonic sequences. T h e d i s t r i b u t i o n modes for 5'S a n d 5 ' L in t rons are loca ted a r o u n d 50 nt, w h i l e the corre-s p o n d i n g dis tances for z i p p e d r a n d o m a n d exonic sequences are u n i f o r m l y d i s t r i b u t e d , w i t h o u t p rominen t modes . T h i s was the case for a l l types of z i p p e r s tems tha t we cons idered . O n e possible i m p l i c a t i o n of the m o d e p h e n o m e n o n is tha t the o p t i m a l b r a n c h p o i n t d i s tance for s p l i c i n g o f yeast in t rons is abou t 50 nt, w h i c h i n long in t rons , as our results show, is achieved by sho r t en ing of the o r i g i n a l d is tance by f o r m a t i o n of z i p p e r s tems. W i t h fur ther refinement of the z ippe r s t em iden t i f i ca t ion process, by cons ide r ing o n l y t h e r m o d y n a r n i c a l l y s table stems w i t h l i m i t e d i n t e rna l loop sizes, the d i s t r i b u t i o n s of shor tened b r a n c h p o i n t d is tances for 5 ' L in t rons a n d o r i g i n a l b r a n c h p o i n t dis tances for 5'S in t rons became a lmos t i den t i ca l . I n contras t , the d i s t r i b u t i o n s for r a n d o m a n d exonic sequences were s igni f i -c a n t l y different. W e also pe r fo rmed compara t i ve s t ruc tu re analys is to ana lyze conserva-t i o n of z ippe r s tems a m o n g closely re la ted yeast species. A careful m a n u a l ana lys i s o n a sample i n t r o n dataset o f 9 in t rons found z i p p e r s tems t h a t were conserved between four sensu stricto species. T h i s is a s igni f icant resul t con-s ide r ing tha t the sequence conserva t ion is not very h i g h w i t h i n the in t rons (50-74%). S i m i l a r , more a u t o m a t e d ana lys is on a larger set of S T R I N i n -t rons ident i f ied conserved z ippe r s tems i n a lmos t a l l of t h e m . T h e r e su l t i ng shor tened b r a n c h p o i n t dis tances fa l l w i t h i n the o p t i m a l range observed i n 5'S in t rons . E v o l u t i o n a r y conserva t ion of z ippe r s t e m gives fur ther s u p p o r t to the i r f unc t iona l s ignif icance. Chapter 5 92 Spl ic ing efficiency and branchpoint distance T h e z ippe r s t em analys is i n C h a p t e r 4 p r o v i d e d evidence tha t the long b r a n c h p o i n t dis tances i n S T R I N 5 ' L in t rons can be shor tened to dis tances p r e sumed to be o p t i m a l for spl iceosome assembly by one or more s t e m s t ruc-tures . T h e more or less ub iqu i tous presence of z ippe r s tems i n the secondary s t ruc tures of long in t rons suggest tha t they m i g h t be i m p o r t a n t for efficient s p l i c i n g of these in t rons . W e invest igate th is hypothes i s for the case of the R P 5 1 B i n t r o n a n d a n u m b e r of i ts mu tan t s for w h i c h the e x p e r i m e n t a l l y measu red s p l i c i n g efficiency results are cor re la ted w i t h the b r a n c h p o i n t dis-tances shor tened by the z ippe r stems. A s we w i l l show i n th i s chapter , the shor tened b r a n c h p o i n t d is tances o b t a i n e d us ing the approaches discussed i n C h a p t e r 4 do not e x p l a i n the observed s p l i c i n g efficiency resul ts . T h i s p r o m p t e d us to m o d i f y ou r m o d e l of the role of i n t r o n i c p r e - m R N A secondary s t ruc tu re i n s p l i c i n g by con-s ide r ing not on ly M F E s t ruc tu re p r e d i c t i o n of i n t r o n sequences bu t also n e a r - o p t i m a l p red ic t ions a n d ref in ing our c a l c u l a t i o n of shor tened b r a n c h -p o i n t dis tances. T h e refined d is tance c a l c u l a t i o n takes i n to account the ent ire s t ruc tu re of the i n t r o n , e l i m i n a t i n g the need to search for a p a r t i c u l a r z ippe r s t em. T h e z ippe r s t e m c r i t e r i a are thus effectively re laxed , a l l o w i n g c o m p l e x s t em s t ruc tures a n d p o s i t i o n i n g of the 3' cons t i tuen t of the s t em d o w n s t r e a m f rom the b r a n c h p o i n t sequence. T h e refined m o d e l is s h o w n to be i n bet ter agreement w i t h R P 5 1 B exper -i m e n t a l results , suggest ing i ts p o t e n t i a l to ident i fy in t rons t h a t have o p t i m a l s t r u c t u r a l c o n f o r m a t i o n for sp l i c ing . W e test the new a p p r o a c h o n S T R I N Chapter 5. Splicing efficiency and branchpoint distance 93 l o n g in t rons a n d r a n d o m sequences w i t h s i m i l a r sequence charac te r i s t i cs as 5 ' L in t rons . F i n a l l y , we descr ibe the des ign of several new R P 5 1 B m u t a n t s based o n the refined m o d e l of the role of in t ron ic p r e - m R N A secondary s t ruc tu re i n sp l i c ing . W e use these mutan t s to test our c o m p u t a t i o n a l f indings by b io log i ca l l a b o r a t o r y exper iments . 5.1 Experimental results for the R P 5 1 B intron A s br ief ly m e n t i o n e d i n Sec t ion 3.2.3, the p r e - m R N A of the S. cerevisiae r i b o s o m a l p r o t e i n rp51b has been used ex tens ive ly for the ana lys is of sec-o n d a r y s t ruc tu re w i t h i n in t rons a n d of its role i n i n t r o n s p l i c i n g . I n 1993, G o g u e l a n d R o s b a s h observed tha t secondary s t ruc tu re i n t e r a c t i o n be tween two sequence segments loca ted d o w n s t r e a m of the 5' spl ice site a n d u p s t r e a m of the b r a n c h p o i n t sequence promotes efficient s p l i c i n g of the R P 5 1 B pre-m R N A . T o fur ther test the i m p o r t a n c e of th i s secondary s t ruc tu re i n sp l ic -ing , L i b r i et a l . (1995) emp loyed a copper resis tance gene ( C U P 1 ) , whose express ion is dependent o n sp l i c ing , as a repor ter gene. T h e y inserted, the R P 5 1 B i n t r o n in to the c o d i n g region of C U P 1 , w h i c h is o therwise i n t r o n -less. I n t e r rup t ed i n th is way, the c u p l p ro t e in is go ing to be func t iona l , i.e., a yeast ce l l is go ing to be v i ab l e i n the coppe r - con t a in ing m e d i u m , o n l y i f exc i s ion of the R P 5 1 B i n t r o n is successful. I n order to test the sens i t iv i ty of s p l i c i n g to a l te ra t ions i n the s t e m as p roposed by G o g u e l a n d R o s b a s h , L i b r i et a l . i n t r o d u c e d m u t a t i o n s i n the i n t e r a c t i n g regions des ignated U B 1 (ups t ream b o x 1) a n d D B 1 (downs t r eam b o x 1). T h e y crea ted 9 mutan t s : 3 m U B l (3 nt m u t a t i o n ) , 4 m U B l (4 n t ) , 5 m U B l (5 nt) , 6 m U B l (6 nt) a n d 8 m U B l (8 nt) where m u t a t i o n s fa l l i n the U B l region, 3 m D B l (3 nt) a n d 5 m D B l (5 nt) where m u t a t i o n s fa l l i n the D B 1 reg ion a n d are compensa to ry mu ta t i ons to the m u t a t i o n s i n the 3 m U B l a n d 5 m U B l , respect ively, a n d 3 m U B l / 3 m D B l a n d 5 m U B l / 5 m D B l , w h i c h are doub le mutan t s . A l l of the single mu tan t s are expec ted to d i s r u p t the sec-o n d a r y s t ruc tu re , w h i l e the doub le mu tan t s are p red i c t ed to restore i t . T h e copper sens i t iv i ty assay showed tha t for a l l s ingle m u t a n t s except 8 m U B l , Chapter 5. Splicing efficiency and branchpoint distance 94 s p l i c i n g was reduced . Su rp r i s ing ly , 8 m U B l h a d a s i m i l a r level of s p l i c i n g as the w i l d t y p e i n t r o n . O u t o f two doub le mutan t s , 5 m U B l / 5 m D B l was able to p a r t i a l l y rescue s p l i c i n g , w h i l e for 3 m U B l / 3 m D B l s p l i c i n g was severely i n h i b i t e d . These unexpec ted results were suggested to be the resul t of some secondary s t ruc tu re rearrangements ; however, the secondary s t ruc tu re of the mu tan t s 8 m U B l and 3 m U B l / 3 m D B l was not fur ther exp lo red . A n o t h e r in te res t ing observa t ion tha t emerged f rom the L i b r i et a l . s t u d y is t h a t U B 1 a n d D B 1 are not the o n l y sequence segments i n the R P 5 1 B i n -t r o n whose in t e r ac t ion c a n faci l i ta te p r e - m R N A s p l i c i n g : i f the w i l d t y p e in t e r ac t ion was d i s rup t ed , s p l i c i n g cou ld be restored by a l t e rna t ive base-p a i r i n g , where the m u t a t e d U B l sequence w o u l d pa i r w i t h ano ther b lock of sequence u p s t r e a m or d o w n s t r e a m f rom the b r a n c h p o i n t sequence. O n e of the a l t e rna t ive D B 1 b locks was even found d o w n s t r e a m f rom the b r a n c h -po in t sequence. T h e s t em formed us ing th is D B 1 sequence s t i l l shor tens the d i s tance between the 5 ' sp l ice si te a n d the b r a n c h p o i n t sequence, i n d i c a t i n g tha t the existence, ra ther t h a n the exact sequence cons t i tuen ts , of the s t em is essent ial for efficient s p l i c i n g of the R P 5 1 B i n t r o n . I n a fo l low-up s tudy, C h a r p e n t i e r a n d R o s b a s h (1996) a t t e m p t e d to an-swer some of the i m p o r t a n t quest ions r ega rd ing secondary s t ruc tu re i n the R P 5 1 B in t ron . U s i n g n e w l y des igned mutan t s of the R P 5 1 B i n t r o n , they inves t iga ted at w h i c h step of sp l i c ing the s t em has a f u n c t i o n a l role . T h e y c o n c l u d e d tha t th i s happens at the t i m e of c o m m i t m e n t c o m p l e x f o r m a t i o n , i.e., the recogn i t ion of the donor site by the U l s n R N A a n d i ts consequent b i n d i n g to i t . T h e au thors propose tha t the s table s t em con t r ibu tes to th i s c o m p l e x fo rma t ion . C h a r p e n t i e r a n d R o s b a s h also pe r fo rmed s t r u c t u r a l p r o b i n g of the s t em whose secondary s t ruc tu re h a d been p red ic t ed on ly c o m p u t a t i o n a l l y . T h e y con f i rmed tha t basepa i r ing in te rac t ions between the U B l a n d D B 1 regions are indeed fo rmed in vitro a n d in vivo, a n d tha t the s t ruc tu re of the s t em fo rmed is close to the one p red i c t ed (see F i g u r e 5.1). T h e exact na tu r e of the i n t e r ac t i on is s t i l l no t k n o w n since the techniques used for s t r u c t u r a l p r o b i n g are no t very precise. T h e i r m u t a t i o n a l ana lys i s a lso con f i rmed t h a t the effects o n s p l i c i n g in vivo were pa ra l l e l to the effect observed in vitro, Chapter 5. Splicing efficiency and branchpoint distance 95 F i g u r e 5.1: R e p r o d u c t i o n of F i g u r e I B f rom C h a r p e n t i e r a n d R o s b a s h (1996): pu t a t i ve i n t e r ac t ion between the U B 1 a n d D B 1 regions. w h i c h is an i m p o r t a n t i n d i c a t i o n tha t secondary s t ruc tu re p lays an essent ial role i n s p l i c i n g of the R P 5 1 B i n t r o n i n na ture . E v e n t h o u g h these s tudies have s h o w n tha t the s t e m s t ruc tu re i n the R P 5 1 B i n t r o n is essent ial for p r e - m R N A s p l i c i n g , they were unab le to ex-p l a i n i ts rea l func t ion . L i b r i et a l . (1995) a n d C h a r p e n t i e r a n d R o s b a s h (1996) m a i n t a i n e d the hypothes i s suggested i n P a r k e r a n d P a t t e r s o n (1987) t h a t the s t ruc tu re migh t serve to reduce the d is tance be tween the 5' spl ice site a n d ,the b r a n c h p o i n t sequence to a d is tance o p t i m a l for sp l i ceosomal in te rac t ions a n d p r e - m R N A sp l i c ing . 5.2 Structural and branchpoint distance analysis of R P 5 1 B mutants A s men t ioned earl ier , s p l i c i n g efficiency results for some of the R P 5 1 B m u -tants were different t h a n expec ted . T h e a s s u m p t i o n b e h i n d the m u t a n t des ign i n L i b r i et a l . (1995) a n d C h a r p e n t i e r a n d R o s b a s h (1996) was t h a t any m u t a t i o n w i t h i n the z ippe r s tem, a s t em b r i n g i n g the donor a n d b r a n c h -po in t sequence closer together, w o u l d d i s rup t the s t em a n d change the i n t r o n secondary s t ruc tu re i n such a way tha t the r e su l t ing b r a n c h p o i n t d i s tance w o u l d be greater t h a n for the w i l d t ype i n t r o n . However , the r e su l t i ng Chapter 5. Splicing efficiency and branchpoint distance 96 mutant dta=-io,ti=o splicing efficiency wt 10 55 efficient 3 m U B l 42 42 s l ight ly reduced 5 m U B l 42 42 s l ight ly reduced 8 m U B l 42 42 efficient 3 m D B l 42 42 inh ib i ted 5 m D B l 133 46 inh ib i ted 3 m U B l / 3 m D B l 42 42 inh ib i ted 5 m U B l / 5 m D B l 133 83 s l ight ly reduced 6 m U B l 188 64 inh ib i ted 4 m U B l 188 86 reduced T a b l e 5.1: C o r r e l a t i o n of the shor tened b r a n c h p o i n t d i s tance (d) w i t h sp l i c -i n g efficiency for L i b r i ' s mutan t s . Shor tened b r a n c h p o i n t d is tances were ca l cu l a t ed b y two vers ions of a l g o r i t h m s for z i p p e r s t e m iden t i f i ca t ion : one uses the s t em leng th to select s table z ippe r s tems (first c o l u m n ) , a n d the o ther uses t h e r m o d y n a m i c s c r i t e r i a for s t e m select ion (second c o l u m n ) . L e v -els of s p l i c i n g efficiency were inferred f rom L i b r i et a l . (1995) s t ruc tures a n d b r a n c h p o i n t dis tances were never tested e x p e r i m e n t a l l y or c o m p u t a t i o n a l l y . T h i s p r o m p t e d us to c o m p u t e i n t r o n secondary s t ruc tu res a n d shor tened b r a n c h p o i n t dis tances of these mu tan t s i n a n a t t e m p t to ex-p l a i n the e x p e r i m e n t a l s p l i c i n g efficiency resul ts . W e first employed our a lgor i thms for z ippe r s t em iden t i f i ca t ion , w h i c h we desc r ibed i n C h a p t e r 4, to ca lcu la te the shor tened b r a n c h p o i n t d i s tance for the w i l d t y p e R P 5 1 B gene a n d for a l l of the mu tan t s desc r ibed i n L i b r i et a l . (1995). W e r a n b o t h versions of our a l g o r i t h m , one of w h i c h uses s t em l eng th to select s table z ippe r stems a n d the o ther of w h i c h uses s t e m t h e r m o d y n a m i c s for the select ion. In b o t h cases, we r a n the respect ive a l g o r i t h m w i t h the parameters tha t have been s h o w n to p r o d u c e the best resul ts as e x p l a i n e d i n C h a p t e r 4. For the first ve rs ion of the a l g o r i t h m , o n l y s tems longer or equa l to 5 b p were considered; for the t h e r m o d y n a m i c s vers ion , the energy th re sho ld ( t e ) was set to —10 k c a l / m o l a n d the loop th re sho ld ( i / ) was set to 6. T a b l e 5.1 shows the shor tened b r a n c h p o i n t dis tances c a l c u l a t e d by our two a lgo r i thms a long w i t h the results of the s p l i c i n g efficiency assay per-Chapter 5. Splicing efficiency and branchpoint distance 1)7 C Q m C Q m 1.2 mM 1.4 mM F i g u r e 5.2: R e p r o d u c t i o n of F i g u r e 2 (C) f rom L i b r i et a l . (1995): copper g r o w t h assay f rom L i b r i ' s mu tan t s . In the uppe r left corner of the figure is a schemat ic d r a w i n g s h o w i n g the loca t ions of the mu tan t s . CuSO^ concen-t r a t ions ( in 1 0 ~ 3 m o l e s / l i t r e ) are i n d i c a t e d under each pane l . fo rmed by L i b r i et a l . (1995). T h e s p l i c i n g efficiency levels were inferred f rom F i g u r e 2 ( C ) i n L i b r i et a l . (1995) ( reproduced i n F i g u r e 5.2), w h i c h shows the results of the copper g r o w t h assay for w i l d t ype a n d the mutan t s . F i g u r e 5.2 shows p ic tures of three plates c o n t a i n i n g different concent ra t ions of coppe r sulfate (CuSOn), where m u t a n t cells have been g r o w i n g i n colonies (whi te spots ) . T h e size a n d in tens i ty of the w h i t e spots ind ica te co lony size a n d the level of copper resistance. Unfo r tuna t e ly , th is t y p e of measurement is not ve ry precise a n d cannot be accura te ly quant i f ied , so we a p p r o x i m a t e d i t w i t h four different levels of s p l i c i n g efficiency: efficient, s l i gh t ly reduced , reduced a n d i n h i b i t e d . I t is clear f rom T a b l e 5.1 tha t the shor tened b r a n c h p o i n t dis tances ca l -cu l a t ed by ou r z ippe r s t em p r e d i c t i o n a lgo r i thms d o not e x p l a i n the s p l i c i n g efficiency levels for the L i b r i mutan t s . F o r example , the rows for 5 m U B l , 8 m U B l a n d 3 m D B l have the same shor tened d is tance of 42 for b o t h versions of the a l g o r i t h m , bu t different s p l i c i n g efficiency levels. O n e p o t e n t i a l p r o b l e m w i t h our approach is tha t we assume tha t the sec-Chapter 5. Splicing efficiency and branchpoint distance 98 o n d a r y s t ruc tu re of an i n t r o n is independent of i ts flanking exon ic or 5' U T R sequences (it is often the case tha t in t rons i n S. cerevisiae are loca t ed r ight after the first codon) . T h i s is most p r o b a b l y an unrea l i s t i c a s s u m p t i o n , since an i n t r o n is jus t a pa r t of p r e - m R N A , w h i c h , as w i t h any R N A molecule , w i l l t e n d to fo ld in to its m i n i m u m free energy s t ruc tu re . However , pre-m R N A s are not free molecules bu t associate w i t h m a n y different pro te ins , p r o t e i n complexes a n d other R N A s . It has been s h o w n tha t a large n u m -ber of process ing factors associate co - t r ansc r i p t i ona l l y w i t h nascent R N A : 5' e n d - c a p p i n g process ing factors associate w i t h the emerg ing t r a n s c r i p t w h e n it is o n l y 20-40 b p long (Neugebauer , 2002), sp l i ceosomal R N A s ( U l , U 2 , U 4 , U 5 a n d U 6 s n R N P s ) a long w i t h a large n u m b e r of s p l i c i n g factors are c o - t r a n s c r i p t i o n a l l y r ec ru i t ed ( G o r n e m a n n et a l . , 2005; K o t o v i c et a l . , 2003), a n d 3' end cleavage a n d p o l y a d e n y l a t i o n factors are also b o u n d t h r o u g h o u t the l eng th of the nascent R N A ( Y u et a l . , 2004; B e n t l e y , 2005). T h e r e are also m a n y pro te ins a n d p r o t e i n complexes respons ib le for t r a n s c r i p t i o n reg-u l a t i o n , R N A e d i t i n g a n d q u a l i t y con t ro l , nuclear expor t , l o c a l i z a t i o n i n the c y t o p l a s m , t r a n s l a t i o n r egu la t ion a n d degrada t ion tha t have been s h o w n to associate w i t h p r e - m R N A ei ther d u r i n g or after t r a n s c r i p t i o n (Neuge-bauer , 2002; Jensen et a l . , 2003; H i e r o n y m u s a n d Si lver , 2004; Y u et a l . , 2004). S ince these in te rac t ions w i l l have an effect o n the s t ruc tu re fo rma-t i o n of a p r e - m R N A molecule , p r e d i c t i n g the secondary s t ruc tu re of the entire p r e - m R N A , us ing current c o m p u t a t i o n a l approaches t ha t do not take in to account such in terac t ions , is u n l i k e l y to be successful. A n o t h e r reason w h y we s h o u l d not consider p r e d i c t i n g secondary s t ruc-ture of the entire p r e - m R N A is the existence of c o - t r a n s c r i p t i o n a l s p l i c i n g : i t has been s h o w n tha t s p l i c i n g often occurs d u r i n g t r a n s c r i p t i o n w h i l e R N A po lymerase II is s t i l l t r a n s c r i b i n g the d o w n s t r e a m p o r t i o n of a gene. T h i s p h e n o m e n o n has been observed for m u l t i c e l l u l a r eukaryotes ( O s h e i m et a l . , 1985; B e y e r a n d O s h e i m , 1988; B a u r e n a n d W i e s l a n d e r , 1994; W u a r i n a n d Sch ib le r , 1994; T e n n y s o n et a l , 1995; W e t t e r b e r g et a l , 1996), b u t also more recent ly for S. cerevisiae ( E l l i o t t a n d R o s b a s h , 1996; K o t o v i c et a l . , 2003; G o r n e m a n n et a l . , 2005). S ince the purpose of our c o m p u t a t i o n a l s t ruc tu re p r e d i c t i o n is to de te rmine w h i c h secondary s t ruc tu re element's are essent ial Chapter 5. Splicing efficiency and branchpoint distance 99 for s p l i c i n g , we are on ly interes ted i n the p o r t i o n of the nascent p r e - m R N A tha t has been synthes ized w h e n s p l i c i n g occurs . However , the precise pa r t of the nascent p r e - m R N A tha t serves as the s p l i c i n g subs t ra te is not k n o w n . F i n a l l y , the accuracy of c o m p u t a t i o n a l R N A secondary s t ruc tu re pre-d i c t i o n decreases w i t h increased R N A sequence l eng th a n d is cons idered unre l i ab le for sequences longer t h a n several h u n d r e d nucleot ides ( M o r g a n a n d Higgs , 1996; M a t h e w s et a l . , 1999). C o n s i d e r i n g tha t the average O R F size i n the S T R I N dataset is 1065 nt, the average p r e - m R N A size for the S T R I N dataset is abou t 1300 nt (the average c o m b i n e d 5' U T R a n d 3' U T R l eng th i n yeast is abou t 250 nt ( H u r o w i t z a n d B r o w n , 2003)), w h i c h means tha t p r e d i c t i n g accura te secondary s t ruc tures of S T R I N p r e - m R N A s w o u l d be dif f icul t . B a s e d on these arguments , we bel ieve tha t fo ld ing o n l y i n t r o n i c sequences is a reasonable a p p r o x i m a t i o n of the secondary s t ruc tu re w i t h i n an i n t r o n , b u t we also repeated the same analys is where in t rons were fo lded w i t h shor t f l a n k i n g sequences o n one or b o t h sides. S ince the s p l i c i n g efficiency of the R P 5 1 B i n t r o n a n d i ts mu tan t s was ana lyzed us ing the C U P 1 gene as a re-por te r gene, we have cons idered the R P 5 1 B i n t r o n a n d i ts f l a n k i n g sequences i n th i s con tex t . T h e R P 5 1 B i n t r o n was inser ted i n to the genomic C U P 1 gene after the first c o d o n (S tu tz a n d R o s b a s h , 1994), thus the 5' f l a n k i n g sequence m a i n l y consists of the 5' U T R region, w h i c h is at mos t 68 nt l o n g ( K a r i n et a l , 1984; Z h a n g a n d D i e t r i c h , 2005). W e used R N A f o l d to fold the R P 5 1 B i n t r o n a n d a l l of L i b r i ' s mutan t s , i n c l u d i n g the first A T G c o d o n a n d the 68-nt region ups t r eam of the gene star t . T h e shor tened b r a n c h p o i n t d is tances were c o m p u t e d as before. T h e same analys is was pe r fo rmed w i t h b o t h 5' u p s t r e a m a n d 50 nt d o w n s t r e a m regions (beg inn ing of the second exon) . T h e results are s h o w n i n T a b l e 5.2. A n a l y s i s of the shor tened b r a n c h p o i n t dis tances o b t a i n e d by the two versions of our a l g o r i t h m suggests tha t i n c l u d i n g f l ank ing sequences i n c o m -p u t a t i o n of the i n t ron i c secondary s t ruc tu re s t i l l does not e x p l a i n the s p l i c i n g efficiency results f rom L i b r i et a l . (1995). I n the case where on l y the 5' U T R of the C U P l gene was i n c l u d e d for p r e d i c t i n g secondary s t ruc tu re , an overa l l t r e n d seems to exist suggest ing tha t mutan t s t ha t are more efficiently sp l i ced mutant with 5' flanking seq with both flanking seq splicing efficiency dlength=5 dte = -10,t,=6 d>length=5 dt 0 = - 1 0 , t , = 6 wt 35 84 256 155 efficient 3 m U B l 50 163 243 156 sl ightly reduced 5 m U B l 50 112 255 155 sl ight ly reduced 8 m U B l 50 143 256 155 efficient 3 m D B l 46 46 256 155 inh ib i ted 5 m D B l 229 117 256 155 inh ib i ted 3 m U B l / 3 m D B l 202 163 243 156 inh ib i ted 5 m U B l / 5 m D B l 233 105 255 155 s l ight ly reduced 6 m U B l 229 102 256 155 inh ib i ted 4 m U B l 50 85 256 155 reduced T a b l e 5.2: C o r r e l a t i n g shor tened b r a n c h p o i n t dis tance (d) w i t h sp l i c ing efficiency for L i b r i ' s mutan ts where shor t f l ank ing regions were folded w i t h i n t r o n i c sequences. Shor tened b ranchpo in t dis tances were ca lcu la ted by two versions of a lgo r i thms for z i p p e r s t e m ident i f ica t ion : one w h i c h uses the s tem l eng th to select stable z ippe r s tems {diength=$) a n d the other w h i c h uses t h e r m o d y n a m i c s c r i t e r i a for s tem select ion (dt e=-io,t ,=6)- Levels of s p l i c i n g efficiency were inferred f rom L i b r i et a l . (1995) Chapter 5. Splicing efficiency and branchpoint distance 101 have lower values for <^ e n st/i=5- However , there are a few c o n t r a d i c t o r y e x a m -ples t ha t do not s u p p o r t th i s theory : mutan t s 3 m U B l a n d 5 m D B l , w h i c h b o t h show i n h i b i t e d sp l i c ing , have ve ry different diength=$ values (46 a n d 229 nt, r espec t ive ly ) . A l s o , for the doub le m u t a n t 5 m U B l / 5 m D B l , w h i c h has s l igh t ly reduced sp l i c ing , the shor tened d is tance diength=$ equals 233 nt, w h i c h is more t h a n for any other mutan ts , i n c l u d i n g those w i t h a p p a r e n t l y non-exis tent sp l i c ing . W h e n b o t h f l ank ing sequences are i n c l u d e d , the 68-nt-l o n g 5' U T R of the C U P 1 gene a n d 50 nt f rom the b e g i n n i n g of the second C U P 1 exon, the r e su l t ing shor tened dis tances, diengtil=5 a n d rfte=_io,t(=6) a r e r o u g h l y equa l for a l l of the mutan t s , a n d fa i l to d i s c r i m i n a t e be tween the mu tan t s t ha t are sp l i ced a n d those tha t are not . S ince we are aware tha t our shor tened d is tance c a l c u l a t i o n is j u s t a n a p p r o x i m a t i o n w i t h respect to the rea l d i s tance w i t h i n the fo lded molecule , we also wan ted to look at the p red ic ted secondary s t ruc tures of these m u -tants a n d see i f we can detect any s t r u c t u r a l differences t ha t w o u l d e x p l a i n the i r different sp l i c ing results . W e c o m p u t e d the s t ruc tures u s ing the m f o l d W e b server at h t t p : / / w w w . b i o i n f o . r p i . e d u / a p p l i c a t i o n s / m f o l d / o l d / r n a / f o r m l . c g i (last accessed i n A p r i l 2006). O u r secondary s t ruc tu re analys is of the i n t r o n mutan t s , w h i c h ba s i ca l l y i n v o l v e d v i s u a l i n spec t i on a n d d e t e r m i n a t i o n of s t r u c t u r a l differences a n d s imi l a r i t i e s , m a i n l y focused o n the pa r t of the s t ruc tu re t ha t i n c l u d e d the donor site a n d the b r a n c h p o i n t sequence, since we are in teres ted i n the dis-tance between these two sites. T h e observed s t r u c t u r a l d o m a i n was a lmos t i den t i ca l for the. 3 m U B l , 5 m U B l , 8 m U B l , 3 m D B l , 3 m U B l / 3 m D B l a n d 5 m U B l / 5 m D B l mutan t s , some of w h i c h have ve ry different s p l i c i n g effi-c iency levels. Moreove r , the ent ire secondary s t ruc tures of the 3 m U B l a n d 3 m U B l / 3 m D B l mu tan t s were a lmos t i den t i ca l w i t h o n l y three basepairs difference, w h i l e the copper resistance exper iments showed t h a t the first one is sp l i ced w i t h o n l y s l igh t ly reduced efficiency a n d tha t the second one is not ( F i g u r e 5.3). A s i m i l a r analys is of secondary s t ruc tures of L i b r i ' s mu tan t s folded w i t h the i r flanking regions showed tha t w i t h the a d d i t i o n of flank-i n g sequences, the secondary s t ruc tu re of the mutan t s tends to be even less suscept ib le to shor t muta t ions ; i n the m a j o r i t y of cases, there weren ' t any Chapter 5. Splicing efficiency and branchpoint distance 102 m a j o r differences between the overa l l secondary s t ruc tures of the mutan t s . T h u s , v i s u a l c o m p a r i s o n of the m i n i m u m free energy secondary s t ruc tures of L i b r i ' s mu tan t s folded w i t h or w i t h o u t f l ank ing sequences fa i led to e x p l a i n the observed differences i n the sp l i c ing efficiency. A s s u m i n g tha t the s p l i c i n g efficiency results f rom L i b r i et a l . (1995) are accura te a n d tha t M F E p r e d i c t i o n of secondary s t ruc tu re is re l iab le , we were not able to detect any co r re l a t ion between the e x p e r i m e n t a l resul ts a n d the d i s tance be tween the donor site a n d the b r a n c h p o i n t sequence. T h e r e is s t i l l a p o s s i b i l i t y t ha t the d is tance between the two sites i n the t e r t i a ry s t ruc tu re of the R P 5 1 B i n t r o n is qu i te different f r om our a p p r o x i m a t i o n s a n d tha t th i s rea l d i s tance w o u l d de te rmine the efficiency of i n t r o n exc i s ion . Never the less , we bel ieve tha t the more l i ke ly e x p l a n a t i o n for our i n a b i l i t y to find any co r r e l a t i on between s p l i c i n g efficiency a n d b r a n c h p o i n t d i s tance is t h a t our ana lys i s was l i m i t e d on ly to a single, m i n i m u m free energy p r e d i c t i o n of the secondary s t ruc tu re of the mutan t s . 5.3 Refinement of zipper stem hypothesis I n order to r emedy th i s l i m i t a t i o n , we m o d i f y our a p p r o a c h by cons ide r ing not on ly M F E s t ruc tu re p r e d i c t i o n of i n t r o n sequences bu t also n e a r - o p t i m a l p red ic t ions a n d re f in ing our c a l c u l a t i o n of shor tened b r a n c h p o i n t d is tances . 5.3.1 Including suboptimal structures in the analysis A c c o r d i n g to the ' t h e r m o d y n a m i c hypothes i s ' , w h i c h was i n i t i a l l y estab-l i shed for p r o t e i n chains (Anf insen , 1973), the t e r t i a ry s t ruc tu re of a p r o t e i n or R N A molecu le i n i ts n o r m a l phys io log ica l env i ronmen t is the one i n w h i c h the G i b b s free energy of the whole sys t em is the lowest. However , i n the case of R N A , there are m a n y a l te rna t ive s t ruc tures whose free energy is ve ry close to the o p t i m a l one. It is possible t ha t an R N A molecu le c a n osc i l -late be tween these different s t ruc tures or t ha t different R N A molecules w i t h the same nuc leo t ide sequence c a n fold in to these different, b u t energe t ica l ly s i m i l a r s t ruc tures . Chapter 5. Splicing efficiency and branchpoint distance 103 plt22ps by D. Slewarl and M. Zuker © 2006 Washinglon University dG = -56.39 [initially -61.7] YDR447C_3mUB1 F i g u r e 5.3: M i n i m u m free energy secondary s t ruc tu re of 3 m U B l m u t a n t p r e d i c t e d by m f o l d . Chapter 5. Splicing efficiency and branchpoint distance 104 Fo r func t iona l , n o n - c o d i n g R N A s , such as t R N A s a n d r R N A s , there is a s t rong e v o l u t i o n a r y pressure to m a i n t a i n the un ique , f u n c t i o n a l s t ruc tu re ; m R N A s , o n the o ther h a n d , do not have func t iona l cons t ra in t s o n the i r g loba l s t ruc tu re , a n d i t is l i ke ly tha t they exist i n a p o p u l a t i o n of s t ruc tures . Some evidence for th is has been descr ibed i n Chr is tof fe rsen a n d M c s w i g g e n (1994), B e t t s a n d S p r e m u l l i (1994) a n d F r e y h u l t et a l . (2005). A n o t h e r reason w h y i t w o u l d be benef ic ia l to cons ider R N A secondary s t ruc tures o ther t h a n the M F E s t ruc ture , especia l ly w h e n u s ing c o m p u t a -t i o n a l p r e d i c t i o n methods , is the k n o w n inaccu racy of the R N A secondary s t ruc tu re p r e d i c t i o n a lgo r i thms (see also Sec t ion 2.2). Therefore , i t is poss ible tha t the na t ive s t ruc tu re is not equivalent to the p red i c t ed M F E s t ruc tu re bu t to one of the s u b o p t i m a l ones tha t s t i l l have free energies ve ry close to the M F E . Some examples of th is p h e n o m e n o n have been no ted i n the l i t e ra tu re ( W u c h t y et a l . , 1999). C o n s i d e r i n g th i s poss ib i l i ty , we have dec ided to i nc lude i n our ana lys is a l l of the p r e d i c t e d s u b o p t i m a l s t ruc tures whose free energy is w i t h i n 5% of the m i n i m u m free energy ( this is the default value for m f o l d p red ic t ions ; o ther values of sub-o p t i m a l i t y percentage are used i n la ter ana lys i s ) . Techn ica l ly , th i s is easy to do since the fo ld ing p rograms tha t we use a l low the user to specify th i s percentage. 5.3.2 A new way of ca l cu la t ing the b ranchpo in t d is tance T h e c a l c u l a t i o n of d is tance is very i m p o r t a n t for our ana lys is . T h e z i p p e r s t e m hypothes i s tha t we are t es t ing impl ies tha t r e la t ive ly shor t b r a n c h p o i n t d i s tance is r equ i red for efficient s p l i c i n g of an i n t r o n . Therefore , we need to be able to a p p r o x i m a t e th is d is tance as closely as poss ib le . F o r a d i rec t ca l -c u l a t i o n of the a c t u a l b r a n c h p o i n t d is tance i n t h ree -d imens iona l space, i t is necessary to have accura te t e r t i a ry s t ruc tu re p r e d i c t i o n . A s c u r r e n t l y there are no reasonably re l iable a lgo r i t hms for p r e d i c t i n g R N A t e r t i a ry s t ruc tu re , we have to base our d i s tance c a l c u l a t i o n o n the R N A secondary s t ruc tu re . E v e n at th i s level i t is h a r d to decide how to ca lcu la te the b r a n c h p o i n t d i s tance . Chapter 5. Splicing efficiency and branchpoint distance 105 I n C h a p t e r 4 we ca l cu la t ed th is d i s tance as follows: once the z i p p e r s t e m is p red ic ted , we s u m m e d the n u m b e r of bases f rom the donor si te to the z ippe r s t em a n d f rom the z ippe r s t em to the b r a n c h p o i n t sequence (see D e f i n i t i o n 4). T h i s m e t h o d assumes tha t there is no secondary s t ruc tu re f o r m a t i o n f rom the z ippe r s t em towards the ends of the sequence. W h i l e somewha t consis tent w i t h our hypothes i s t ha t o n l y the z ippe r s t e m is es-sen t ia l for sho r t en ing the dis tance , th is a s s u m p t i o n is o b v i o u s l y unrea l i s t i c . T h e r e is no reason to bel ieve tha t any par t of the i n t r o n i c sequence w i l l r e m a i n unfo lded unless it is p r ev ious ly b o u n d by some other molecu le . W e thus dec ided to refine our m o d e l such tha t the d i s tance be tween donor a n d b r a n c h p o i n t sites is ca l cu la t ed i n the context of an en t i re ly folded i n t r o n . T h e need to ident i fy z ippe r stems, as defined i n D e f i n i t i o n 3, is thereby e l i m i n a t e d , a l l o w i n g more flexible s t ruc tu re ( a rb i t r a ry s t e m free energy a n d i n t e rna l l o o p / b u l g e size) a n d l o c a t i o n of a s t em tha t b r ings the 5' sp l ice site a n d b r a n c h p o i n t i n to close p r o x i m i t y (second c o m p l e m e n t a r y sequence c a n be l oca t ed d o w n s t r e a m f rom the b r a n c h p o i n t sequence, as observed b y L i b r i et a l . (1995)). T h i s ref inement also impl i e s t ha t the ent i re s t ruc tu re w i l l be i m p o r t a n t for d e t e r m i n i n g the b r a n c h p o i n t d is tance , r a the r t h a n ju s t a s ingle s tem. E v e n w i t h th is a s sumpt ion , there is "no u n i q u e way to ca lcu la te b r a n c h p o i n t d is tance. T o the best of our knowledge , there have been no p rev ious a t t emp t s to c o m p u t e t h e ' d i s t a n c e between a pa i r of nucleot ides i n R N A secondary s t ruc tu re . If we assume tha t helices are r i g i d , t h e n c a l c u l a t i n g ' t h e d i s tance be tween two nucleot ides loca ted i n the same he l i x is r e l a t i ve ly s t ra ight for-w a r d , a s suming tha t we w i l l use a n u m b e r of nucleot ides or basepairs as an abs t rac t d is tance measurement . T h e task becomes more c o m p l i c a t e d i f two nucleot ides are found i n different helices, or i n general are separa ted b y any loop (mu l t i - l oop or i n t e rna l loop) , because the m u t u a l p o s i t i o n of the two helices is not k n o w n : i f the helices are closer to each other , the d i s tance s h o u l d be shorter , a n d i f they are far ther apar t , for e x a m p l e because the angle be tween t h e m is larger, the d is tance s h o u l d be larger . However , th i s m u t u a l p o s i t i o n i n g of the secondary s t ruc tu re elements c a n be cons ide red a pa r t of the t e r t i a ry s t ruc tu re of an R N A molecule a n d presently, we canno t Chapter 5. Splicing efficiency and branchpoint distance 106 a p p r o x i m a t e dis tances between t h e m . Ins tead, we have based the d i s tance c a l c u l a t i o n solely on the basepa i r ing i n f o r m a t i o n : the d i s tance c a l c u l a t i o n w i t h i n a s t e m is pe r fo rmed by c o u n t i n g the n u m b e r o f basepairs s epa ra t i ng two sites, w h i l e i n a loop i t is done by c o u n t i n g the n u m b e r of free bases. T o p e r f o r m th i s c a l cu l a t i on , we employed D i j k s t r a ' s shor t e s t -pa th a l g o r i t h m ( D i j k s t r a , 1959). Dijkstra's algorithm D i j k s t r a ' s a l g o r i t h m , n a m e d after its creator , E d s g e r D i j k s t r a , is a n algo-r i t h m tha t solves the single-source shortest p a t h p r o b l e m for a d i r ec t ed g r a p h w i t h non-negat ive edge weight . T h e i n p u t to the a l g o r i t h m consists of a d i r ec t ed g r a p h G w i t h associated edge weights , a source ver tex s, a n d a target ver tex t. E a c h edge of the g r a p h is an ordered pa i r of ver t ices (u,v) represent ing a connec t ion f rom ver tex u to ver tex v. W e i g h t s of edges are g iven by a weight func t ion w : E —> [0, oo) , so t ha t w(u, v) is a non-negat ive cost of m o v i n g f rom ver tex u to ver tex v. T h e weight of an edge can- be t hough t of as the d i s tance between the two vert ices. T h e n , the d i s tance between two vert ices i n a g r aph , i.e., the cost of the p a t h be tween the two edges, is equivalent to the s u m of the costs of the edges i n the p a t h . F o r a g iven pa i r of vert ices s a n d t i n V, the a l g o r i t h m finds the p a t h f rom s t o t w i t h the lowest cost (i.e., shortest d i s tance) . T h e a l g o r i t h m works by keeping for each ver tex v the cost d(v) of the shortest p a t h found f rom the source ver tex s to v. I n i t i a l l y , d(s) = 0 a n d d(v) = co for a l l o ther vert ices v £ V except s. T h e bas ic o p e r a t i o n i n D i j k s t r a ' s a l g o r i t h m is edge relaxation: i f there is an edge f rom u to v, t h e n the shortest k n o w n p a t h f rom s to u can be ex tended to a p a t h f rom s to v, whose cost is d(u) + w(u,v). If the resul t is less t h a n the cur ren t d(y), we can replace th i s value w i t h the new value. E d g e r e l a x a t i o n is a p p l i e d to edge (u, v) o n l y once, w h e n d{u) has reached i ts final value. T h e a l g o r i t h m m a i n t a i n s two sets o f vert ices: S con ta ins o n l y vert ices for w h i c h the shortest p a t h is k n o w n , a n d Q conta ins a l l the o ther vert ices. A t the beg inn ing , set 5' is empty, a n d at each step the ver tex w i t h the lowest Chapter 5. Splicing efficiency and branchpoint distance 107 value of d(u) is moved f rom Q to S. T h e t i m e c o m p l e x i t y o f the a l g o r i t h m is 0(n2): where n is the n u m b e r of vert ices i n a g raph . C o m p u t i n g b r a n c h p o i n t d is tance us ing D i j k s t r a ' s a l g o r i t h m T o ca lcu la te the b r a n c h p o i n t d is tance , we consider a p red i c t ed secondary s t ruc tu re of the in t ron i c p r e - m R N A as an u n d i r e c t i o n a l g r a p h where n u -cleot ide bases are vert ices o f the g r a p h a n d edges are bonds between the nucleot ides . These bonds can be ei ther sugar -phosphate bonds be tween the nucleot ides i n the R N A cha in or the hydrogen bonds between pa i r ed bases i n a g iven R N A secondary s t ruc ture . F i g u r e 5.4 shows the convers ion f rom an R N A secondary s t ruc tu re to the g r a p h represent ing i t . S ince D i j k s t r a ' s a l g o r i t h m requires a d i r ec ted graph , we w i l l represent each non-d i r ec t ed edge (u,v) as two d i rec ted edges, (u,v) a n d (v,u). A l l edges i n the R N A g r a p h have a u n i f o r m weight w(u, v) = 1. I n our i m p l e m e n t a t i o n of the a l g o r i t h m , the inpu t s to the p r o g r a m are an R N A s t ruc tu re i n dot -bracket n o t a t i o n a n d the loca t ions of two bases for w h i c h the d is tance needs to be ca lcu la ted . These bases are the first nuc leo t ide of the i n t r o n a n d the b u l g i n g A i n the b r a n c h p o i n t sequence ( U A C U A A C ) . T h e o u t p u t of the p r o g r a m is the shortest d i s tance be tween these two bases, w h i c h we are go ing to consider as the b r a n c h p o i n t d i s tance for the g iven secondary s t ruc ture . 5.3 .3 Computation of structural characteristics of introns I n order to do the b r a n c h p o i n t d is tance analys is based o n our ref ined m o d e l , we wro te a p r o g r a m tha t runs a l l of the analyses needed to o b t a i n the des i red s t ruc tu re i n f o r m a t i o n . G i v e n a file w i t h R N A sequences a n d a file w i t h c o r r e s p o n d i n g b r a n c h p o i n t dis tances, for each sequence i n the i n p u t file the p r o g r a m produces the shor tened b r a n c h p o i n t d is tance , the s t ruc tu re ' s free energy a n d the p r o b a b i l i t y of each s u b o p t i m a l s t ruc tu re w i t h i n 5% f rom the M F E . O u r p r o g r a m calls two R N A secondary s t ruc tu re p r e d i c t i o n p rograms : m f o l d a n d R N A f o l d . W e use b o t h of these p rograms because we need fea-(a) (b) F i g u r e 5.4: C o n v e r s i o n f rom the R N A secondary s t ruc tu re to the g r a p h represen t ing i t . (a) G r a p h i c a l representa-t i o n of the secondary s t ruc ture of an i n t r o n (circles represent basepa i r ing in te rac t ions , i.e., hyd rogen bonds , ( b ) G r a p h represent ing the R N A s t ruc tu re i n (a). Chapter 5. Splicing efficiency and branchpoint distance 109 tures t ha t are not avai lable i n a single p r o g r a m : m f o l d does not have an o p t i o n for c o m p u t i n g the p a r t i t i o n func t ion , a n d R N A f o l d does not have an o p t i o n for se lect ing a set of s u b o p t i m a l s t ruc tures t ha t are w i t h i n a ce r t a in percent of s u b o p t i m a l i t y f rom the m i n i m u m free energy. T h e p r o g r a m first runs R N A f o l d to ca lcu la te the e q u i l i b r i u m p a r t i t i o n func t i on Q for each of the sequences i n the i n p u t file (see also S e c t i o n 2.2). R N A f o l d computes the p a r t i t i o n func t ion w h e n r u n w i t h the o p t i o n ' -p ' , bu t i ts va lue is not a par t of R N A f o l d ' s regular ou tpu t , so we h a d to s l i g h t l y m o d i f y the source code to have i t p r i n t e d out . O n c e the p a r t i t i o n func t ion is c o m p u t e d for a g iven R N A sequence we can ca lcu la te the p r o b a b i l i t y of each p r e d i c t e d s t ruc tu re g iven i ts free energy. S ince the p a r t i t i o n func t ion is c o m p u t e d by R N A f o l d a n d is thus based o n the energy m o d e l used by the V i e n n a package, the free energy of a s t ruc tu re also needs to be c o m p u t e d us ing the same energy m o d e l i n order to be consis tent . A l t h o u g h b o t h R N A f o l d (a long w i t h a l l of the o ther p rog rams f rom the V i e n n a package) a n d m f o l d use T u r n e r ' s energy m o d e l , t hey use different versions of i t , w h i c h somet imes resul t i n s l i gh t ly different free energy values for the same R N A sequence. T h u s , we first r u n m f o l d to p red ic t the M F E s t ruc tu re a n d s u b o p t i m a l s t ruc tures w i t h i n 5% f rom the M F E a n d t h e n r u n R N A e v a l , a p r o g r a m w h i c h is a pa r t of the V i e n n a package, to c o m p u t e the free energies of the s t ruc tures p red ic t ed by m f o l d . T h e percentage of suboptimality is one of the parameters of m f o l d . W e also used default values for the other m f o l d parameters , i n c l u d i n g the window parameter . T h i s parameter controls how m a n y s t ruc tures w i l l be c o m p u t e d a n d how different t hey mus t be f rom one another . It takes o n pos i t ive integer values; a smal le r va lue results i n more c o m p u t e d s t ruc tures t ha t -may be qui te s i m i l a r to one another , w h i l e a larger value results i n fewer, more va r i ed s t ruc tures . If th is pa ramete r is not chosen by the user, a defaul t va lue is selected acco rd ing to the l eng th of the i n p u t sequence. A f t e r e x p e r i m e n t i n g w i t h different values for th is parameter , we dec ided to use the defaul t va lue . U s i n g the p a r t i t i o n func t ion value c o m p u t e d by R N A f o l d , Q, a n d the free energy value of the s t ruc tu re i ? ^ , AG(Rij), c o m p u t e d by R N A e v a l , we c a n now ca lcu la te the p r o b a b i l i t y of tha t s t ruc tu re i n the ensemble of a l l Chapter 5. Splicing efficiency and branchpoint distance 110 P r o c e d u r e S t r u c t u r e A n a l y z e input: file w i t h n R N A sequences S i , . . . , 5 n , file w i t h c o r r e s p o n d i n g b r a n c h p o i n t dis tances d\,..., dn foreach(5j, where i £ { 1 , . . . , n } ) r u n RNAfold to o b t a i n the value for Q r u n mfold to pred ic t a l l s t ruc tures Rij, j = l , . . . , m w i t h i n 5% f rom M F E for each where j £ { 1 , . . . , m } ) r u n RNAeval to ca lcu la te AG(Rij) (free energy of Rij) ca lcu la te P(Rij) r u n dijkstra(Rij, ci;) t o ca lcu la te b r a n c h p o i n t d i s tance dij o u t p u t AG (Rij) f rom mfold, AG (Rij) f rom RNAfold, dij a n d P(Rij) end foreach end foreach F i g u r e 5.5: Pseudo-code for the procedure S t r u c t u r e A n a l y z e desc r ibed i n Sec t ion 5.3.3. poss ib le s t ruc tures (Ru, Ri2, • • •) for the g iven sequence 5 V P(^j) = Q (5-1) F i n a l l y , u s ing D i j k s t r a ' s a l g o r i t h m we ca lcu la te the b r a n c h p o i n t d i s tance for each of the c o m p u t e d s u b o p t i m a l s t ruc tures . A t th is stage we also count the n u m b e r of bases of the b r a n c h p o i n t sequence tha t r e m a i n e d u n b o u n d i n the g iven secondary s t ruc ture . W e w i l l use th is va lue la ter to test i f the p a i r i n g s ta tus of b r a n c h p o i n t sequence is i m p o r t a n t for s p l i c i n g efficiency. T h e o u t p u t of the p r o g r a m inc lude mfo ld ' s free energy value, R N A e v a l ' s free energy value, the b r a n c h p o i n t d is tance a n d the c o m p u t e d s t ruc tu re p r o b a b i l i t y for each s u b o p t i m a l s t ruc tu re p red i c t ed by m f o l d (a l l of t h e m b e i n g w i t h i n 5% of the M F E ) . T h e top- leve l pseudo-code for the desc r ibed procedure , w h i c h we c a l l S t r u c t u r e A n a l y z e , is g iven i n F i g u r e 5.5. C o n s i d e r i n g tha t the t i m e c o m p l e x i t y of R N A f o l d a n d m f o l d is 0(l3) a n d the t i m e c o m p l e x i t y of R N A e v a l a n d D i j k s t r a ' s a l g o r i t h m is 0(l2), where I is the l eng th of the g iven i n p u t sequence, the t i m e c o m p l e x i t y of the p rocedure Chapter 5. Splicing efficiency and branchpoint distance 111 S t ruc tu re -Ana lyze is 0 ( n ( / 3 + m • I2)). T h e value of m , w h i c h is the n u m b e r of p red ic t ed s u b - o p t i m a l s t ruc tures w i t h i n 5% of the M F E , is t y p i c a l l y i n the order of 10 (the m a x i m u m value for S T R I N dataset is 34). E v e n w h e n the percentage of s u b o p t i m a l i t y is increased to 20% the value for m r emains s i m i l a r (the m a x i m u m value for S T R I N dataset is 44), due to mfo ld ' s window parameter . S ince I is greater t h a n TO, the overa l l t i m e c o m p l e x i t y for the p rocedure is 0(n • I3). Post-processing of the results T h e o u t p u t of the procedure S t r u c t u r e A n a l y z e is p i p e d in to a pos t -p rocess ing p rocedure tha t computes some a d d i t i o n a l values for each p red i c t ed s t ruc-ture , Rij, of a g iven sequence, Si, a n d also calcula tes s u m m a r y s ta t i s t ics for each of the g iven R N A sequences. P(Rij) is the c o m p u t e d p r o b a b i l i t y of a s t ruc tu re i n the ensemble of a l l poss ib le s t ruc tures for a g iven sequence; since the n u m b e r of a l l possible s t ruc tures grows e x p o n e n t i a l l y w i t h the l e n g t h of the sequence, th is p r o b a b i l i t y is u sua l ly a very s m a l l numbe r . S ince we are c o n s i d e r i n g o n l y the s t ruc tu res t h a t are w i t h i n 5% f rom the M F E we also c o m p u t e a n o r m a l i z e d (or relat ive) p r o b a b i l i t y i n the fo l lowing way : p (R ) - P ^ R i ^ (<=,o\ 1 norin\ 1Hj J m \°'^) E P(Rij) A n o t h e r value tha t is ca l cu la t ed i n th is pos t -process ing phase is b.weight ij, w h i c h is dependent o n the n u m b e r of free bases i n the b r a n c h p o i n t sequence. It is def ined i n the fo l lowing way: Pnorm(Rij) x 5 : # of free bases in branchpoint > 3 b.uieigthij = ^ Pnorm(Rij) '• otherwise (5.3) T h e purpose of bjweightij is to take in to account the hypo thes i s tha t s t r u c t u r a l access ib i l i ty of the b r a n c h p o i n t sequence (e.g., b e i n g loca t ed i n a loop) has a pos i t ive effect o n s p l i c i n g efficiency. W e w i l l d iscuss th i s Chapter 5. Splicing efficiency and branchpoint distance 112 t o p i c fur ther i n Sec t i on 6.2. T h e s t ruc tures where four or more b r a n c h p o i n t nucleot ides are i n a l o o p are more heav i ly weighted t h a n the s t ruc tu res where the b r a n c h p o i n t is m o s t l y i n a s tem. T h e m u l t i p l i c a t i o n factor of five was a r b i t r a r i l y chosen. O n c e a l l p r ed i c t ed s t ruc tures for a g iven sequence Si have been processed, the p r o g r a m computes several s u m m a r y s ta t is t ics : average distance - th is is the average b r a n c h p o i n t d i s tance for a l l of the s t ruc tures p red ic t ed by m f o l d w i t h the percentage of s u b o p t i m a l i t y e q u a l to 5. I f there are m p red ic t ed s t ruc tures , the average distancei is c o m p u t e d i n the fo l lowing way: YfL di • average distance;, — ———— (5-4) m If our hypothes i s t ha t a re la t ive ly shor t b r a n c h p o i n t d i s tance is re-q u i r e d for efficient s p l i c i n g were correct , lower values of average dis-tance w o u l d ind ica te h igher sp l i c ing efficiency. r_weight - th i s is a heur i s t i c measure, w h i c h is based o n the hypo thes i s t ha t in t rons w i t h h i g h l y p robab le s u b - o p t i m a l s t ruc tu res t ha t have shor t b r a n c h p o i n t d is tance are sp l iced more efficiently. I t is def ined i n the fo l lowing way: rjweighti = 2_, -f1—— (5.5) T h e addends i n the above s u m are c o m p u t e d for each p r e d i c t e d sec-o n d a r y s t ruc tu re a n d are d i r ec t ly p r o p o r t i o n a l to the re la t ive (no rma l -ized) p r o b a b i l i t y of tha t s t ruc tu re Pn0rm{Rij) ( thus ' r ' i n r.weight) a n d inversely p r o p o r t i o n a l to its b r anchpo in t d is tance . H i g h e r values of rjweighti s h o u l d ind ica te more efficient s p l i c i n g of the sequence Si. r_b_weight - th i s value is analogous to rjweighti bu t also takes in to ac-count the n u m b e r of free bases i n the b r a n c h p o i n t sequences. It is def ined as follows: Chapter 5. Splicing efficiency and branchpoint distance 113 rJb.weighti = ^ ^ weightij (5.6) 3 = 1 10 b.weight is def ined i n E q u a t i o n 5.3. H i g h e r values oi rJb.weighti s h o u l d ind ica te more efficient sp l i c ing . 5.4 Branchpoint-distance analysis of R P 5 1 B mutants using the refined model W e used our u p d a t e d approach for a n a l y z i n g in t ron i c secondary s t ruc tures a n d b r a n c h p o i n t dis tances to ana lyze the w i l d t ype R B 5 1 B i n t r o n a n d a l l of the mu tan t s desc r ibed by L i b r i et a l . (1995) a n d by C h a r p e n t i e r a n d R o s b a s h (1996). T h e new a p p r o a c h to the s t r u c t u r a l ana lys is cons iders not o n l y the M F E s t ruc tu re , bu t also a l l of the s u b o p t i m a l s t ruc tures t h a t are w i t h i n 5% f rom the m i n i m u m free energy a n d tha t are s ign i f i can t ly different f r om each other a cco rd ing to mfold ' s c r i t e r i a . T h e b r a n c h p o i n t d i s tance is also ca l cu l a t ed differently t h a n before, u s ing D i j k s t r a ' s shortest p a t h a l g o r i t h m , desc r ibed i n Sec t i on 5.3.2. W e i n c l u d e d - a l l of L i b r i ' s mu tan t s i n the analys is : 3 m U B l , 5 m U B l , 8 m U B l , 3 m D B l , 5 m D B l , 3 m U B l / 3 m D B l , 5 m U B l / 5 m D B l , 6 m U B l , a n d 4 m U B l . T h e mu tan t s were also a n a l y z e d us ing our o r i g i n a l s t ruc tu re a n d b ranchpo in t -d i s t ance analys is desc r ibed i n Sec t ion 5.2. I n a d d i t i o n , we a n a l y z e d a l l of the R P 5 1 B i n t r o n mu tan t s t ha t were de-s c r i b e d by C h a r p e n t i e r a n d R o s b a s h (1996). T h e s e are m u t - U B l i , w h i c h has an inver ted U B l sequence (ups t ream box 1; 5' c o m p l e m e n t a r y region) , m u t - D B l i , w h i c h has an inve r t ed D B 1 sequence (downs t r eam b o x 1; 3 ' c o m -p l e m e n t a r y region) , m u t - U B l i D B l i , w h i c h has b o t h U B l a n d D B 1 sequences inve r t ed to make t h e m c o m p l e m e n t a r y to each other , mu t -5 , w h i c h reduces the consecut ive p a i r i n g region to 5 basepairs , mut -12 , w h i c h improves pa i r -i n g to 12 consecut ive basepairs ( e l im ina t i ng one one-nucleot ide bulge) , a n d mut -18 , w h i c h extends p a i r i n g to 18 consecut ive basepairs ( e l i m i n a t i n g a l l three bulges i n the p a i r i n g region , see F i g u r e 5.1). Chapter 5. Splicing efficiency and branchpoint distance 114 S i m i l a r to the app roach of L i b r i et a l . (1995), C h a r p e n t i e r a n d R o s b a s h (1996) inser ted the m u t a t e d in t rons in to the C U P 1 gene. T h e in se r t i on was made after the first c o d o n (S tu tz a n d R o s b a s h , 1994), thus the 5' f l a n k i n g sequence cons is ted m a i n l y of the 5' U T R region, w h i c h is at mos t 68 nt l o n g ( K a r i n et a l , 1984; Z h a n g a n d D i e t r i c h , 2005). T h e 3' f l ank ing sequence is pa r t of the C U P 1 c o d i n g sequence. A n a l o g o u s to the ana lys is done i n Sec t ion 5.2, we cons idered b o t h L i b r i ' s a n d C h a r p e n t i e r ' s m u t a n t in t rons i n i so la t ion , a n d also i n c l u d i n g one (5') or b o t h f l a n k i n g sequences. A s p r e v i o u s l y men t ioned , the 5' f l ank ing sequence cons is ted of the first C U P 1 c o d o n a n d 68 nt of its 5' U T R , a n d the 3' f l ank ing sequence cons i s ted of 50 nt C U P 1 c o d i n g sequence d o w n s t r e a m f rom the inser ted i n t r o n . T h e s p l i c i n g efficiency of the mutan t s was not d i r e c t l y quan t i f i ed by C h a r p e n t i e r a n d R o s b a s h (1996), bu t they c a n be in fe r red f rom some of the figures i n the i r paper . U n l i k e the s p l i c i n g efficiency ana lys i s c o n d u c t e d by L i b r i et a l . (1995), w h i c h was done by copper g r o w t h assay, C h a r p e n t i e r a n d R o s b a s h used gel e lectrophoresis of fo rmed sp l i ceosomal complexes . T h e p r e - m R N A s , la r ia t in te rmedia te c o m p l e x a n d l a r i a t p r o d u c t c o m p l e x were resolved based on size, a n d the s p l i c i n g efficiency level was a p p r o x i m a t e d f rom the in tens i ty of the bands . T h u s , levels of s p l i c i n g efficiency for the w i l d t y p e p r e - m R N A a n d m u t a n t p r e - m R N A s c a n o n l y be a p p r o x i m a t e d based o n the p u b l i s h e d gel images. A n o t h e r c o m p l i c a t i o n i n assessing s p l i c i n g efficiency levels for C h a r p e n -t ier ' s mu tan t s is t ha t there is not a d i rec t c o m p a r i s o n of a l l of the m u t a n t resul ts w i t h the w i l d type results . I n some cases, au thors used the w i l d t ype i n t r o n a n d the m u t a t e d in t rons w i t h the donor site m o d i f i e d to corre-s p o n d to the consensus donor sequences. W i t h these mod i f i ca t i ons , in t rons are i n general sp l i ced more efficiently t h a n i n the o r i g i n a l R P 5 1 B contex t . T h e s p l i c i n g efficiency levels c a n s t i l l be a p p r o x i m a t e d re la t ive to one an-other based o n F igu re s 2 a n d 3 f rom the ar t ic le b y C h a r p e n t i e r a n d R o s b a s h (1996). F o u r different levels c a n be observed: n o r m a l for w t , r educed for m u t - U B l i , m u t - D B l i , a n d mut -5 , s l igh t ly i m p r o v e d for m u t - U B l i D B l i a n d i m p r o v e d for mut -12 a n d mut -18 . These levels canno t be c o m p a r e d d i r e c t l y to s p l i c i n g efficiency levels for L i b r i ' s mu tan t s . Chapter 5. Splicing efficiency and branchpoint distance 115 A l l of the desc r ibed mutan t s were processed u s ing the p r o g r a m S t r u c -t u r e A n a l y z e desc r ibed i n Sec t ion 5.3.3. T h e de ta i l ed o u t p u t of the post-process ing p rocedure is g iven i n A p p e n d i x C . T h e s u m m a r y s ta t i s t ics for a l l o f L i b r i ' s mu tan t s are r epo r t ed i n T a -ble 5.3. F r o m the tab le we c a n see tha t there is an in te res t ing co r r e l a t i on between the average b r a n c h p o i n t d is tance a n d the s p l i c i n g efficiency lev-els: sequences tha t are more efficiently sp l iced ( w i l d type , 3 m U B l , 5 m U B l , 8 m U B l , 5 m U B l / 5 m D B l , a n d 4 m U B l ) have lower values for the average d is tance t h a n those tha t are p o o r l y sp l iced . If we assign n u m e r i c a l values to the desc r ip t ive sp l i c ing efficiency labels (efficient = 1, s l i gh t ly reduced = 2, r educed = 3 a n d i n h i b i t e d = 4) we can c o m p u t e the P e a r s o n co r r e l a t i on coefficient (r = 0.95). L o o k i n g at the de ta i l ed o u t p u t g iven i n A p p e n d i x C . l , we can observe tha t a l l of the sequences tha t are sp l iced efficiently or w i t h s l i g h t l y r educed efficiency have one or more p red ic t ed s t ruc tures ( M F E or s u b o p t i m a l ) for w h i c h the b r a n c h p o i n t d i s tance is ve ry short , dij — 5. A n a l y s i s of the sec-o n d a r y s t ruc tures of these sequences reveals t ha t th is d i s tance cor responds to a s t r u c t u r a l con fo rma t ion where the donor site a n d the b r a n c h p o i n t have two basepa i r ing in te rac t ions between t h e m . T h e pa r t of the R N A secondary s t ruc tu re t ha t shows th i s contact con fo rma t ion is i l l u s t r a t e d i n F i g u r e 5.6. E v e n more i n t r i g u i n g is the fact tha t the n u m b e r of p r e d i c t e d s t ruc tures t h a t have th i s contac t c o n f o r m a t i o n between the donor si te a n d the b r a n c h -po in t seems to be p r o p o r t i o n a l to the level of s p l i c i n g efficiency. T h e w i l d t y p e i n t r o n has 5 of these s t ruc tures (efficient sp l i c ing ) , 3 m U B l a n d 5 m U B l mutan t s have one s t ruc tu re each where d^ — 5 ( s l igh t ly reduced sp l i c ing ) , the 8 m U B l m u t a n t has 3 of these s t ruc tures (efficient sp l i c ing ) , a n d the 5 m U B l / 5 m D B l m u t a n t has one such s t ruc tu re ( s l igh t ly r educed sp l i c ing ) . M u t a n t 4 m U B l , w h i c h has reduced bu t not comple t e ly i n h i b i t e d s p l i c i n g , does not have a s t ruc tu re w i t h contact c o n f o r m a t i o n w i t h i n 5% f rom the M F E ; however, i f the percentage of s u b o p t i m a l i t y is increased t o 10%, one of the s t ruc tures p red ic t ed has a s t ruc tu re w i t h a b r a n c h p o i n t d i s tance of 5. T h e value of rjweighti, w h i c h is supposed to be h igher for the sequences t ha t are more efficiently sp l iced , seems to have some p red ic t ive power , s ince Chapter 5. Splicing efficiency and branchpoint distance 116 the rest of intron \ ) donor site branchpoint sequence F i g u r e 5.6: A pa r t of the R P 5 1 B w i l d t y p e i n t r o n secondary s t ruc tu re t ha t shows basepa i r ing between the donor site a n d the b r a n c h p o i n t sequence. Chapter 5. Splicing efficiency and branchpoint distance 117 mutant avg r.weight r_b_weight splicing efficiency wt 20 0.1494 0.1494 efficient 3 m U B l 28 0.0319 0.0573 s l ight ly reduced 5 m U B l 29 0.0302 0.0509 s l ight ly reduced 8 m U B l 26 0.0283 0.0283 efficient 3 m D B l 41 0.0244 0.0244 inh ib i ted 5 m D B l 41 0.0296 0.0631 inh ib i ted 3 m U B l / 3 m D B l 43 0.0243 0.0243 inh ib i ted 5 m U B l / 5 m D B l 34 0.0364 0.1218 s l ight ly reduced 6 m U B l 48 0.0208 0.1042 inh ib i ted 4 m U B l 38 0.0494 0.2458 reduced T a b l e 5.3: S u m m a r y s ta t i s t ics for L i b r i ' s mutan t s . Leve l s of s p l i c i n g effi-c i ency were de t e rmined f rom F i g u r e 5.2. there is a g o o d correspondence between the values a n d the efficiency of sp l ic -i n g (r = —0.52). I n general , efficiently sp l iced sequences have h igher values t h a n those w i t h p o o r sp l i c ing . T h e other s u m m a r y s ta t i s t ics , rJ}.weighty, t ha t takes in to account the n u m b e r of free bases i n the b r a n c h p o i n t sequence, does not corre la te we l l w i t h s p l i c i n g efficiency (r = —0.14). F o r L i b r i ' s mu tan t s w i t h f l ank ing sequences, there seems to be no charac-te r i s t i c s t ruc tures or b r a n c h p o i n t dis tances t ha t w o u l d d i s t i n g u i s h efficiently sp l i ced sequences f rom ones tha t are not . L i k e w i s e , none of the s u m m a r y s ta t i s t ics co r r e spond we l l w i t h the observed s p l i c i n g efficiency levels (see A p p e n d i c e s C . 2 a n d C . 3 ) . T h e b r a n c h p o i n t d is tance results for C h a r p e n t i e r ' s m u t a n t s are s i m i l a r to the results for L i b r i ' s mutan t s : the average b r a n c h p o i n t d is tances are lower for the sequences tha t are efficiently sp l i ced ( w i l d type , m u t - U B l i D B l i , mu t -12, a n d mut -18) . If we assign n u m e r i c a l values to the desc r ip t ive s p l i c i n g efficiency labels ( improved = 1, s l igh t ly i m p r o v e d = 2, n o r m a l = 3 a n d re-d u c e d = 4), we c a n c o m p u t e the co r re l a t ion coefficient (r = 0.92). T h i s , aga in , cor responds to the ab i l i t y of these sequences to fo ld i n such a way as to b r i n g the donor site a n d the b r a n c h p o i n t sequences ve ry close to each other: each of the efficiently sp l iced sequences has a n u m b e r of contac t -c o n f o r m a t i o n s t ruc tures (see A p p e n d i x C . 4 ) . T h e mu tan t s t ha t have reduced s p l i c i n g do not fo ld in to th is t ype of s t ruc ture , except for m u t - D B l i , w h i c h Chapter 5. Splicing efficiency and branchpoint distance 118 mutant avg r_weight r_b_weight splicing efficiency wt 20 0.1494 0.1494 efficient m u t - U B l i 38 0.0454 0.2240 reduced mut-DBl i 30 0.0424 0.0978 reduced m u t - U B l i D B l i 13 0.1569 0.1578 s l ight ly improved mut-5 35 0.0491 0.2455 reduced mut-12 13 0.1569 0.1578 improved mut-18 13 0.1569 0.1578 improved T a b l e 5.4: S u m m a r y s ta t i s t ics for C h a r p e n t i e r ' s mu tan t s . Leve l s of s p l i c i n g efficiency were infer red f rom F i g u r e s 2 a n d 3 a n d T a b l e 1 i n the a r t i c le b y C h a r p e n t i e r a n d R o s b a s h (1996). has one s u b o p t i m a l s t ruc tu re (of re la t ive ly low p r o b a b i l i t y ) w i t h d^ = 5. T h i s does not necessar i ly con t r ad i c t the genera l t r end , s ince the s p l i c i n g efficiency measurements are very imprec ise a n d i t is possible t ha t m u t - D B l i is sp l i ced more efficiently t h a n m u t - U B l i a n d mut -5 , w h i c h w o u l d e x p l a i n the presence of the shor t -b ranchpo in t -d i s t ance s t ruc tu re . T h e value of the r . w e i g h t s u m m a r y s ta t i s t i c also corre la te w e l l w i t h the s p l i c i n g efficiency levels: rjweighti values are s ign i f i can t ly h igher for ef-f ic ient ly sp l i ced sequences, i n d i c a t i n g g o o d p r e d i c t i o n p o t e n t i a l ( r = —0.89). T h i s is not the case for r _ b _ w e i g h t , whose values do not corre la te w e l l w i t h the s p l i c i n g efficiency levels (r = 0.29). F o r C h a r p e n t i e r ' s mu tan t s w i t h flanking sequences, there appear to be no charac te r i s t i c s t ruc tures or b r a n c h p o i n t dis tances t ha t w o u l d d i s t i n g u i s h between sequences tha t are sp l iced efficiently f rom ones tha t are not . A l s o , none of the s u m m a r y s ta t i s t ics correlate we l l w i t h the s p l i c i n g efficiency level (see A p p e n d i c e s C . 5 a n d C . 6 ) . T o conc lude , s t r u c t u r a l and b r a n c h p o i n t d is tance analys is of the R P 5 1 B in t rons desc r ibed by L i b r i et a l . (1995) a n d by C h a r p e n t i e r a n d R o s b a s h (1996) has demons t r a t ed the benefits of our new approach . C o n s i d e r i n g not o n l y M F E s t ruc tu re bu t also a ce r t a in subset of s u b o p t i m a l s t ruc tures , as we l l as i m p r o v i n g the c a l c u l a t i o n of the b r a n c h p o i n t d is tance , a l lowed us to ident i fy some s t r u c t u r a l character is t ics of R P 5 1 B i n t r o n m u t a n t s t ha t m a y be respons ib le for the i r different ia l sp l i c ing . N a m e l y , for a l l of the 16 Chapter 5. Splicing efficiency and branchpoint distance 119 sequences ana lyzed , the ab i l i t y to f o r m h i g h l y p robab le secondary s t ruc tures w i t h shor t b r a n c h p o i n t d is tance ensures the i r efficient s p l i c i n g . In a d d i t i o n , i t seems tha t the n u m b e r of these s t ruc tures a n d the i r p r o b a b i l i t y correlates w e l l w i t h the s p l i c i n g efficiency levels. T h i s obse rva t ion is also reflected i n the average b r a n c h p o i n t d is tance value, w h i c h is a lways lower for the sequences tha t are more efficiently sp l iced . A t th is po in t i t is no t clear i f i t is the shor t d i s tance i tself or the specific contac t c o n f o r m a t i o n be tween the donor site a n d the b r a n c h p o i n t sequences tha t is i m p o r t a n t for sp l i c ing . O u r p rev ious b r a n c h p o i n t d is tance analys is , based o n z i p p e r - s t e m iden-t i f i ca t ion , was not able to detect these differences, s ince i t cons idered o n l y the M F E s t ruc tu re of the mutan t s . A l s o , the de f in i t ion of a z i p p e r s t e m ( D e f i n i t i o n 3) e x c l u d e d stems tha t con ta ined e i ther the donor site or the b r a n c h p o i n t sequence a n d thus w o u l d not be able to ident i fy the s t e m f o u n d i n the R P 5 1 B i n t r o n a n d its efficiently sp l iced mutan t s t ha t b r i n g the donor a n d b r a n c h p o i n t sequence in to contact con fo rma t ion . T h e new b r a n c h p o i n t d is tance analysis y i e l d e d g o o d results o n l y w h e n i n t r o n sequences w i t h o u t any flanking regions were fo lded. A d d i n g flanks i n g regions to the in t rons e l imina tes any recognizable differences be tween efficiently sp l i ced mu tan t s a n d those tha t are p o o r l y sp l i ced . Calculation of the probability of close branchpoint distance using the partition function Ins tead of searching for s t ruc tures t ha t have shor t b r a n c h p o i n t d is tances or, i n the case of the R P 5 1 B gene, tha t have basepa i r ing in te rac t ions be tween the donor site a n d the b ranchpo in t , the p r o b a b i l i t y of contac t c o n f o r m a t i o n for a g iven sequence c a n be ob t a ined d i rec t ly . T h i s p r o b a b i l i t y cor responds to the basepa i r ing p robab i l i t i e s of the pa i r ed nucleot ides i n ques t ion , a n d is ca l cu l a t ed by the R N A f o l d p r o g r a m us ing the p a r t i t i o n func t ion . E a c h basepai r p r o b a b i l i t y reflects a s u m of a l l weighted s t ruc tures i n w h i c h the chosen basepai r occurs . T h u s , these basepa i r ing p robab i l i t i e s also take i n to account the s t ruc tures tha t were not w i t h i n 5% f rom the M F E , e l i m i n a t i n g the necessity to chose an a r b i t r a r y s u b o p t i m a l i t y percentage value . Chapter 5. Splicing efficiency and branchpoint distance 120 mutant G - C probability U - A probability splicing efficiency wt 0.40 0.40 efficient 3 m U B l 0.33 0.33 s l ight ly reduced 5 m U B l 0.31 0.31 s l ight ly reduced 8 m U B l 0.34 0.34 efficient 3 m D B l 0.01 0.01 inh ib i ted 5 m D B l < 0.01 < 0.01 inh ib i ted 3 m U B l / 3 m D B l 0.01 0.01 inh ib i ted 5 m U B l / 5 m D B l 0.11 0.11 s l ight ly reduced 6 m U B l 0.05 0.05 inh ib i ted 4 m U B l 0.18 0.18 reduced T a b l e 5.5: B a s e p a i r i n g p robab i l i t i e s of contact c o n f o r m a t i o n ( F i g u r e 5.6) for L i b r i ' s mu tan t s . T h e p robab i l i t i e s were ca l cu la t ed b y the R N A f o l d p r o g r a m . F i g u r e 5.6 shows the relevant basepairs fo rmed be tween the donor site a n d the b r a n c h p o i n t sequence w h e n the p red ic t ed s t ruc tu re conta ins w h a t we c a l l ' contact con fo rma t io n ' . T h e first basepai r is fo rmed be tween the first i n t r o n base ( G ) a n d the t h i r d base of the b r a n c h p o i n t sequence ( C ) , a n d the second basepai r is between the second base i n the i n t r o n ( U ) a n d the second base of the b r a n c h p o i n t sequence ( A ) . T h e p robab i l i t i e s for these p a r t i c u l a r basepairs c a n be ob t a ined w h e n the R N A f o l d p r o g r a m is r u n w i t h the ' -p ' o p t i o n , w h i c h invokes c o m p u t a t i o n of the p a r t i t i o n func t ion a n d basepa i r ing p robab i l i t i e s . T h e basepai r p r o b a b i l i t y values for the w i l d t ype R P 5 1 B i n t r o n a n d a l l of L i b r i ' s mu tan t s are g iven i n T a b l e 5.5. It can be observed tha t a l l of the sequences tha t are efficiently sp l iced have higher values for the basepai r p robab i l i t i e s t h a n the sequences tha t are p o o r l y sp l i ced (r = 0.92). T h e cor-r e l a t i on is not s t r i c t l y l inear since, for example , the m u t a n t sequence 8 m U B l has a lmos t the same basepai r p r o b a b i l i t y value as 3 m U B l a n d 5 m U B l , a l -t h o u g h i t is more efficiently sp l iced t h a n these two. S i m i l a r l y , the m u t a n t 5 m U B l / 5 m D B l is more efficiently sp l iced t h a n 4 m U B l , b u t th i s is not re-flected i n the basepai r p r o b a b i l i t y values. Fo r C h a r p e n t i e r ' s mutan t s , the basepai r p robab i l i t i e s are also h igher for the sequences tha t are more efficiently sp l iced (Tab le 5.6): a l l of the sequences tha t are efficiently sp l i ced ( w i l d type , m u t - U B l i D B l i , mu t -12 , a n d mut-18) have basepai r p robab i l i t i e s of 0.40, w h i l e the o ther sequences Chapter 5. Splicing efficiency and branchpoint distance 121 mutant G - C probability U - A probability splicing efficiency wt 0.40 0.40 n o r m a l m u t - U B l i 0.04 0.04 reduced m u t - D B l i 0.25 0.25 reduced m u t - U B l i D B l i 0.40 0.40 s l ight ly improved mut-5 0.04 0.04 reduced mut-12 0.40 0.40 improved mut-18 0.40 0.40 improved T a b l e 5.6: B a s e p a i r i n g p robab i l i t i e s of contact c o n f o r m a t i o n for C h a r p e n -t ier ' s mu tan t s . have lower values (r — 0.79). T h e m u t a n t m u t - D B l i has a r e l a t i ve ly h i g h basepai r p r o b a b i l i t y value w i t h respect to the o ther two mutan t s , p o s s i b l y for the same reason as g iven i n the p rev ious sec t ion - the s p l i c i n g efficiency measurements i n C h a r p e n t i e r a n d R o s b a s h (1996) lack p rec i s ion , a n d i t is poss ib le t ha t m u t - D B l i is sp l iced more efficiently t h a n m u t - U B l i a n d mut -5 . A n o t h e r reason migh t be the i m p r e c i s i o n of the energy m o d e l on w h i c h the basepai r p robab i l i t i e s are based. O v e r a l l , based o n the results for L i b r i ' s a n d C h a r p e n t i e r ' s mu tan t s i t seems tha t the basepai r p robab i l i t i e s can be g o o d ind i ca to r s for s p l i c i n g efficiency. 5.5 Branchpoint-distance analysis on the STRIN dataset A c c o r d i n g to the p reced ing ana lys is of R P 5 1 B mutan t s , i t appears t ha t i n t r o n sequences tha t c a n s igni f ican t ly shor ten the i r b r a n c h p o i n t d i s tance by f o r m i n g secondary s t ruc tures are sp l iced more efficiently. T h e p r e d i c t e d secondary s t ruc tures w i t h shor t b r a n c h p o i n t d is tances have to be h i g h l y p robab le s t ruc tures , i.e., s t ruc tures w i t h c lose - to -op t ima l free energies. H a v i n g th i s m o d e l of s p l i c i n g i n m i n d , we a n a l y z e d a l l of the l o n g in t rons i n the S T R I N dataset to see i f they exh ib i t s t r u c t u r a l charac ter i s t ics s i m -i l a r to the w i l d t ype R P 5 1 B i n t r o n . T h e S T R I N dataset conta ins 110 5 ' L in t rons , where the ' l i nea r ' d is tance between the donor site a n d the b ranch -po in t sequence is greater t h a n 200 nt. However , 12 of these in t rons are 5' Chapter 5. Splicing efficiency and branchpoint distance 122 U T R in t rons (i.e., are found i n the 5' un t r ans l a t ed reg ion of genes), w h i c h we dec ided to exc lude f rom this analys is for two reasons. T h e first reason is t ha t i n th is analys is we also want to cons ider the f l ank ing regions of i n -t rons , a n d for the 5' U T R in t rons i n f o r m a t i o n abou t the i r exact l o c a t i o n is not avai lable . Fo r the in t rons found i n the c o d i n g sequence, the l o c a t i o n of spl ice sites is ca l cu la t ed f rom the first c o d i n g nuc leo t ide . T h i s canno t be done for the 5' U T R in t rons , thus i n the A r e s da tabase f rom w h i c h we ex t r ac t ed i n t r o n loca t ions these in t rons are ass igned l o c a t i o n 0, a n d i n the S G D database the 5' U T R in t rons are not anno ta t ed at a l l . T h e second rea-son for not i n c l u d i n g the in t rons loca ted i n 5' U T R s is tha t , a l t h o u g h they are r emoved by the same spl iceosome machinery , i t is poss ib le t ha t the i r requi rements for s p l i c i n g efficiency are different. It has been s h o w n t h a t the c a p - b i n d i n g c o m p l e x ( C B C ) , w h i c h b inds to the 5' e n d of the nascent p r e - m R N A , p lays a role i n the r ecogn i t ion of the 5' spl ice si te of the first i n t r o n by the U l s n R N P d u r i n g the f o r m a t i o n of the sp l i ceosomal c o m m i t -ment c o m p l e x (Lewi s et a l , 1996a,b). A n o t h e r s t u d y o n the A C T 1 i n t r o n found a co r r e l a t i on between i n t r o n p o s i t i o n a n d s p l i c i n g efficiency ( K l i n z a n d G a l l w i t z , 1985), a n d showed tha t s p l i c i n g efficiency decreases w i t h increased d is tance between the R N A cap site a n d the i n t r on . T h u s , i t is poss ib le t h a t the p r o x i m i t y of 5' U T R in t rons to the p r e - m R N A cap m a y have an effect on the s p l i c i n g efficiency of these in t rons , w h i c h w o u l d d i m i n i s h the role of the b ranchpo in t -d i s t ance - shor t en ing s t ruc tu re fo rma t ion . F o r each of the r e m a i n i n g 98 l o n g in t rons , we c o m p u t e d a l l of the sec-o n d a r y s t ruc tures w i t h i n 5% f rom the M F E ( w i t h mfo ld ' s window pa rame te r set to defaul t va lue) . Fo r each sequence, we c o m p u t e d the m i n i m u m a n d the average b r a n c h p o i n t d is tance for a l l of the s t ruc tures p r e d i c t e d a n d the mos t p robab le b r a n c h p o i n t d is tance , w h i c h is the d i s tance ca l cu l a t ed for the s t ruc tu re t ha t has the highest p robab i l i t y , i.e., m i n i m u m free energy s t ruc tu re . W e also c o m p u t e d the m i n i m u m , average a n d mos t p robab le b r a n c h p o i n t d i s tance for a l l of the long in t rons t ha t were folded w i t h 50 nt f l ank ing sequence o n b o t h ends. A s c o n t r o l datasets , we genera ted two sets of 98 r a n d o m sequences (one co r r e spond ing to i n t r o n sequences o n l y a n d one c o r r e s p o n d i n g to in t rons w i t h f l ank ing sequences) t ha t have the same Chapter 5. Splicing efficiency and branchpoint distance 123 l eng th d i s t r i b u t i o n a n d the same G C content as the i n t r o n datasets . S i m i l a r to the ana lys is of S T R I N in t rons , we use the p rocedure S t r u c t u r e A n a l y z e to c o m p u t e the dis tances co r r e spond ing to b r a n c h p o i n t d is tances i n S T R I N in t rons for each sequence i n a con t ro l dataset . T h e dis tances are c o m p u t e d between the first nuc leot ide i n the sequence a n d the nuc leo t ide found at the same l o c a t i o n as the s tar t of the b r a n c h p o i n t sequence i n the S T R I N i n t r o n w i t h the same leng th as a con t ro l sequence. Fo r brevi ty , we s t i l l c a l l these dis tances b r a n c h p o i n t dis tances. T h e d i s t r i b u t i o n h i s tograms a n d the c u m u l a t i v e d i s t r i b u t i o n plots are shown i n F igu re s 5.7 a n d 5.8. F r o m F i g u r e 5.7 i t can be observed tha t yeast l o n g in t rons t e n d to fo ld i n to s t ruc tures t ha t have shor ter b r a n c h p o i n t dis tances t h a n folded r a n d o m sequences of the same leng th a n d G C content: a p p r o x i m a t e l y o n e - t h i r d (33 sequences) of S T R I N long in t rons have a m i n i m u m b r a n c h p o i n t d i s tance of 5, the same as for the R P 5 1 B in t ron , whi le th is is the case for o n l y 6 of the r a n d o m sequences. T h i s t endency of yeast in t rons to have shor ter b r a n c h p o i n t dis tances t h a n r a n d o m sequences c a n also be seen i n the p lo ts for average a n d most p robab le b r a n c h p o i n t dis tances . T o test the s t a t i s t i c a l s ignif icance of these differences, we pe r fo rmed a K o l m o g o r o v - S m i r n o v test, w h i c h de termines i f two d i s t r i b u t i o n s differ s ign i f ican t ly (see also C h a p t e r 4). W e c o m p u t e d D s ta t i s t ics a n d p-values for a l l three pai rs of datasets p l o t t e d i n F i g u r e 5.7. Fo r S T R I N a n d r a n d o m datasets of m i n i m u m a n d average b r a n c h p o i n t dis tances, the n u l l hypothes i s of no difference was re-jec ted ( w i t h p-values of 0.008 and 0.012, respec t ive ly) . T h i s was not the case for the datasets of most p robab le b r a n c h p o i n t d is tances (p-value = 0.13). F i g u r e 5.8 imp l i e s s i m i l a r conclus ions for i n t ron i c sequences w i t h f lank-i n g regions. A l t h o u g h a d d i n g u p s t r e a m a n d d o w n s t r e a m sequences d i d not seem benef ic ia l for the b r a n c h p o i n t d is tance analys is of the R P 5 1 B i n t r o n (Sec t ion 5.4), the analys is o n a l l of the S T R I N l o n g in t rons i nd i ca t e t h a t even w h e n fo lded w i t h the flanking sequences, l ong yeast in t rons have differ-ent s t r u c t u r a l charac ter i s t ics t h a n r a n d o m sequences of the same l e n g t h a n d G C content . T h e K S test rejected the n u l l hypo thes i s for a l l three pai rs of datasets (for m i n i m u m b r a n c h p o i n t d is tance p-value — 0.0009, for average b r a n c h p o i n t d i s tance p-value < 1 0 - 4 , a n d for most p r o b a b l e b r a n c h p o i n t Chapter 5. Splicing efficiency and branchpoint distance 124 Most probable BP distance for 5'L and rand sequences Most probable BP distance (or 5'L and rand sequences 0 20 4 0 60 80 1 00 0 20 40 60 80 1 00 120 most probable branchpoint distance x=most probable branchpoint distance (e) (f) F i g u r e 5.7: C o m p a r i n g b r a n c h p o i n t dis tances for S T R I N l o n g in t rons a n d c o r r e s p o n d i n g r a n d o m sequences: d i s t r i b u t i o n of m i n i m u m b r a n c h p o i n t dis-tances as e x p l a i n e d i n the text ( a — d i s t r i b u t i o n h i s t o g r a m a n d b — c u m u -la t ive d i s t r i b u t i o n ) ; c , d : d i s t r i b u t i o n of average b r a n c h p o i n t dis tances; e, f: d i s t r i b u t i o n of the mos t p robab le b r a n c h p o i n t d is tances . Chapter 5. Splicing efficiency and branchpoint distance 125 Minimum BP distance tor 5'L+50 and rand sequences 15'L +50i rand In I I . minimum branchpoint distance Minimum BP distance tor 5'L+50 and rand sequences - 5 ' L + 5 0 | - r a n d =minimum branchpoint distance (b) 0.18 0.16 w 0 1 4 £ 0 1 2 OJ | 0 1 o 0 08 c % 0.06 ~ 0.04 3 02 Average BP distance for 5'L+50 and rand sequences Average BP distance for 5'L+50 and rand sequences | 5 ' L + 5 0 I r a n d LU I m i l ) 40 60 80 average branchpoint distance too 20 40 60 80 100 120 x=average branchpoint distance (c) (d) Most probable BP distance for 5'L+50 and rand sequences Most probable BP distance for 5'L+50 and rand sequences c 0.08 o V> 0-04 | 5 ' L + 5 0 | I r a n d IJ. 5'L+50| r a n d | most probable branchpoint distance x-most probable branchpoint distance (e) (0 F i g u r e 5.8: C o m p a r i n g b r a n c h p o i n t dis tances for S T R I N l o n g in t rons w i t h f l ank ing regions a n d co r r e spond ing r a n d o m sequences: d i s t r i b u t i o n of m i n -i m u m b r a n c h p o i n t dis tances as exp l a ined i n the tex t (a — d i s t r i b u t i o n h i s t o g r a m a n d b — c u m u l a t i v e d i s t r i b u t i o n ) ; c , d : d i s t r i b u t i o n of average b r a n c h p o i n t dis tances; e, f: d i s t r i b u t i o n of the most p robab le b r a n c h p o i n t dis tances . Chapter 5. Splicing efficiency and branchpoint distance 126 d is tance p-va lue = 0.0001). S ince the most c o m m o n b r a n c h p o i n t d is tance for the S T R I N long in t rons is 5, we wan ted to explore fur ther i f th is d is tance a lways cor responds to the contac t c o n f o r m a t i o n observed for the R B 5 1 B i n t r o n a n d the extent of con-t r i b u t i o n of the b r a n c h p o i n t sequence a n d the canon ica l d inuc leo t ide at the donor si te to the f o r m a t i o n of the basepairs be tween these two sites. A n a -l y z i n g a l l of the s u b o p t i m a l s t ruc tures of the in t rons tha t have a m i n i m u m b r a n c h p o i n t d is tance of 5, we found tha t out of 33 in t rons tha t have th i s d is tance , o n l y 13 have the same contac t c o n f o r m a t i o n as the R P 5 1 B i n t r o n . T h i s does not necessari ly m e a n tha t the rest of the in t rons canno t fo rm th i s contac t be tween the two sites, since for some of t h e m the basepa i r ing p robab i l i t i e s are re la t ive ly h i g h (see next sect ion) . I n order to test the c o n t r i b u t i o n of the G U sequences at the donor site a n d the b r a n c h p o i n t sequence ( U A C U A A C ) , we generated two new datasets of r a n d o m sequences w i t h the same character is t ics as the p rev ious ones bu t w h i c h have the c a n o n i c a l G U d inuc leo t ide at the donor si te ( beg inn ing of the sequence for the r a n d o m sequences tha t have the same l e n g t h d i s t r i b u t i o n as S T R I N in t rons a n d 50 nt f rom the b e g i n n i n g of the sequence for the r a n d o m sequences tha t co r r e spond to in t rons w i t h flanking regions) , d inuc l eo t ide A G at the acceptor site a n d c a n o n i c a l b r a n c h p o i n t sequence U A C U A A C at the same l o c a t i o n as the i n t r o n w i t h the co r r e spond ing leng th . T h u s , these sequences are very m u c h l ike the rea l yeast l o n g in t rons except t ha t the sequences tha t are not essential for s p l i c i n g are r a n d o m i z e d . W e c o m p u t e d the m i n i m u m , average and the most p robab le b r a n c h p o i n t d i s tance for each r a n d o m sequence i n the dataset a n d c o m p a r e d these d i s t r i b u t i o n s t o the c o r r e s p o n d i n g d i s t r i b u t i o n s for the S T R I N in t rons . F o r in t rons a n d r a n d o m sequences w i t h o u t flanking regions, the dis-t r i b u t i o n differences are not as p rominen t as for the first t y p e of r a n d o m sequences: the K o l m o g o r o v - S m i r n o v test fa i led to reject the n u l l hypo thes i s of no difference for a l l three types of d i s t r i b u t i o n . However , there is s t i l l a s ignif icant difference w i t h respect to very shor t b r a n c h p o i n t dis tances: w h i l e the S T R I N dataset has 33 sequences w i t h a m i n i m u m b r a n c h p o i n t d i s tance of 5, there are on ly 6 such sequences i n the r a n d o m dataset (5 have contac t Chapter 5. Splicing efficiency and branchpoint distance 127 conformat ion) a n d another 12 tha t have even shor ter d is tances . T h e results for the sequences w i t h f l ank ing regions are s im i l a r to the results i n F i g u r e 5.8. Fo r a l l three pai rs of datasets we c a n reject the hypothes i s t ha t t hey s t e m f rom the same u n d e r l y i n g d i s t r i bu t i ons . Calculation of the probability of close branchpoint distance using partition function Ins tead of searching for s t ruc tures t ha t have shor t b r a n c h p o i n t d is tances , we c a n use basepai r p robab i l i t i e s t ha t take in to account a l l poss ib le s u b o p t i m a l s t ruc tures . B a s e p a i r p robab i l i t i e s are ca lcu la ted us ing the p a r t i t i o n func t ion ( r u n n i n g R N A f o l d w i t h ' -p ' op t ion ) , w i t h each p r o b a b i l i t y ref lec t ing a s u m of a l l weighted s t ruc tures i n w h i c h the chosen basepair occurs . A s we p r e v i o u s l y d i d for the R P 5 1 B mutan t s , we c a n ca lcu la te the p rob -abi l i t i es of contac t con fo rma t ion for a l l of the S T R I N l o n g mu tan t s a n d c o m p a r e t h e m w i t h p robab i l i t i e s for r a n d o m sequences. F o r th i s ana lys i s , we used the r a n d o m sequences tha t have the same l e n g t h d i s t r i b u t i o n a n d G C contents as the S T R I N long in t rons bu t also have au then t i c s p l i c i n g s igna l sequences (see prev ious sect ion) . Fo r each of the sequences i n these two datasets (for th i s analys is we d i d not cons ider sequences w i t h f l a n k i n g regions) , we c o m p u t e d the basepa i r ing p r o b a b i l i t y of i n t e r ac t i on be tween the donor site a n d the b r a n c h p o i n t sequence, as s h o w n i n the F i g u r e 5.6. T h e d i s t r i b u t i o n s of ob t a ined values are p l o t t e d i n F i g u r e 5.9. F r o m the figure i t c a n be observed tha t the S T R I N dataset , w h e n c o m -pa red to the r a n d o m dataset , has fewer sequences t h a t have ve ry low p rob -a b i l i t y of contac t c o n f o r m a t i o n (p < 0.1) a n d more sequences tha t have h igher p robab i l i t y . T h e K S test rejects the n u l l hypo thes i s t h a t underl ing-d i s t r i b u t i o n s are i den t i ca l (D — 0.36 a n d p-value < 1 0 ~ 4 ) . W e c a n also c o m p u t e the p r o b a b i l i t y of a shor t b r a n c h p o i n t d i s tance u s ing basepai r p robab i l i t i e s . Ins tead of e x t r a c t i n g the specif ic p r o b a b i l i t y of basepa i r ing between the first a n d second nucleot ides of the donor sequence w i t h the t h i r d a n d second nucleot ides ( respect ively) of the b r a n c h p o i n t se-quence, we c a n choose the highest p r o b a b i l i t y of basepa i r ing be tween the Chapter 5. Splicing efficiency and branchpoint distance 128 Basepair probabilities for donor-branchpoint contact 09, , , , , , , , I basepair probabilities (a) Basepair probabilities for donor-branchpoint contact 0.1 •f-rI~~r-^—s " } I — — - rand 11 1 1 1 > 1 1 ' -* 0 0 1 0.2 0.3 0.4 0.5 0.6 0.7 08 0.9 x=basepair probabilities (b) F i g u r e 5.9: C o m p a r i n g p robab i l i t i e s of the basepa i r ing between the donor site a n d the b r a n c h p o i n t sequence for S T R I N long in t rons a n d c o r r e s p o n d i n g r a n d o m sequences, a : d i s t r i b u t i o n h i s t o g r a m of the basepa i r ing p r o b a b i l i -ties; b : c u m u l a t i v e d i s t r i b u t i o n of the basepa i r ing p robab i l i t i e s . Chapter 5. Splicing efficiency and branchpoint distance 129 donor d inuc leo t ide a n d any d inuc leo t ide w i t h i n a ce r t a in w i n d o w f rom the b r a n c h p o i n t sequence. T h i s basepa i r ing in t e r ac t ion w i l l de t e rmine the m a x -i m u m b r a n c h p o i n t d i s tance for tha t sequence. W e e x p e r i m e n t e d w i t h dif-ferent w i n d o w sizes up t o 25 nt; for the sizes less t h a n 20 nt, the K S test y i e l d e d p-values be low 0.05, i n d i c a t i n g tha t we can reject the n u l l hypo thes i s t ha t the two datasets s t em f rom the same u n d e r l y i n g d i s t r i b u t i o n . I n s u m m a r y , the results presented i n Sec t ion 5.5 i m p l y t ha t yeast l o n g in t rons , w h e n c o m p a r e d to r a n d o m l y generated sequences t h a t resemble rea l in t rons i n m a n y respects, are more l i ke ly to fold in to s t ruc tu res t ha t have shor t b r a n c h p o i n t dis tances . T h e r e is a s ignif icant class of yeast l o n g i n t r o n (1 /3 of them) tha t c a n fold in to secondary s t ruc tures t ha t have shor tened b r a n c h p o i n t d i s tance of 5, w h i l e th is is the case for o n l y a few r a n d o m sequences. T h i s observa t ion is also s u p p o r t e d by ana lys is of ba sepa i r ing p robab i l i t i e s be tween the donor site a n d the b r a n c h p o i n t sequence. 5.6 Validation by biological experiments B a s e d o n our s t r u c t u r a l a n d b r a n c h p o i n t d is tance ana lys is of L i b r i ' s a n d C h a r p e n t i e r ' s mu tan t s (Sec t ion 5.4), we mod i f i ed our p r e v i o u s l y p roposed hypothes i s of the role of secondary s t ruc ture on i n t r o n s p l i c i n g as follows: T h e exis tence of h igh ly p robab le secondary s t ruc tures (whose free energy is w i t h i n 5% f rom the m i n i m u m free energy) tha t have shor t b r a n c h p o i n t d i s tance (ca lcu la ted by D i j k s t r a ' s a l g o r i t h m , see Sec t ion 5.3.2) is r e q u i r e d for efficient s p l i c i n g of a yeast i n t r o n . I n order to test the v a l i d i t y of th i s m o d e l of s p l i c i n g , we needed to test i t o n in t rons tha t have not been used i n the d e r i v a t i o n of the m o d e l . W e dec ided to des ign a d d i t i o n a l R P 5 1 B m u t a n t s whose s p l i c i n g efficiency w o u l d be tested by l a b o r a t o r y expe r imen t s . V a l i d a t i o n of c o m p u t a t i o n a l p r e d i c t i o n by l a b o r a t o r y exper imen t s is a ve ry i m p o r t a n t componen t of b io in fo rmat i c s research. It can p rov ide a d d i -t i o n a l s u p p o r t to c o m p u t a t i o n a l resul ts a n d make t h e m more s ignif icant to the b i o l o g i c a l c o m m u n i t y , thus c o n t r i b u t i n g to the research i n b o t h fields. W e pe r fo rmed our l a b o r a t o r y exper iments i n c o l l a b o r a t i o n w i t h D r . P h i l i p Chapter 5. Splicing efficiency and branchpoint distance 130 Hie te r ' s g roup at U B C ' s M i c h a e l S m i t h L a b o r a t o r i e s , whose focus is the mo lecu l a r b io logy of Saccharomyces cerevisiae. T h e R P 5 1 B i n t r o n mu tan t s t ha t we des igned were assembled a n d tested b y D r . H ie t e r ' s d o c t o r a l s tuden t B e n M o n t p e t i t . O u r e x p e r i m e n t a l app roach differs f rom tha t used by L i b r i et a l . (1995) a n d by C h a r p e n t i e r a n d R o s b a s h (1996) i n several ways. T h e m u t a t e d i n -t r o n sequences are inser ted back in to the R P 5 1 B gene, ins tead of the C U P 1 gene, w h i c h al lows us to ana lyze the sp l i c ing of th is i n t r o n i n its endogenous env i ronmen t . A n o t h e r difference is t ha t we es t ima ted the s p l i c i n g efficiency d i r e c t l y f rom the p r o t e i n express ion levels tha t we quan t i f i ed u s ing a state-of-the-art fluorescence i m a g i n g sys tem. T h i s makes our measurements more precise t h a n those of L i b r i et a l . (1995) a n d of C h a r p e n t i e r a n d R o s b a s h (1996). Ideal ly, s p l i c i n g efficiency s h o u l d be measu red by the re la t ive ra t io of p r e - m R N A s a n d m R N A s i n a ce l l ( P i k i e l n y a n d R o s b a s h , 1985); how-ever, our l a b o r a t o r y env i ronment was not su i tab le for R N A i so l a t i on a n d quan t i f i ca t ion . 5.6.1 Veri f icat ion of the experimental system T o ver i fy tha t our e x p e r i m e n t a l sy s t em works a n d tha t we c a n o b t a i n the same results as i n L i b r i et a l . (1995), we synthes ized some of L i b r i ' s m u t a n t s a n d tes ted the i r p r o t e i n express ion levels. T h e sequences tes ted were the w i l d t y p e R P 5 1 B i n t r o n , a n d the 5 m U B l , 3 m U B l , 8 m U B l , 5 m D B l , a n d 3 m D B l m u t a t e d in t rons . T h e e x p e r i m e n t a l p rocedure is desc r ibed i n A p p e n d i x B . F i g u r e 5.10 d isp lays the results of our exper iments . F o r each sequence, between four a n d s ix sample p ro t e in abundance measurements were ob-t a i n e d tha t were n o r m a l i z e d w i t h respect to the w i l d t ype p r o t e i n levels. T h e n o r m a l i z e d values were used to generate the p lo t . T h e shaded boxes represent the m e a n values for a l l the samples a n d the error bars represent +/— 1 s t a n d a r d d e v i a t i o n . T h e error bar for the w i l d t y p e i n t r o n comes f rom c o m p a r i s o n of two different w i l d t y p e samples . T h e results i n F i g u r e 5.10 are not exac t ly the same as the s p l i c i n g effi-c iency results i n L i b r i et a l . (1995), bu t the difference be tween the mu tan t s Chapter 5. Splicing efficiency and branchpoint distance 131 Protein expression results for Libri's mutants 3mUB1 5mUB1 8mUB1 3mDB1 5mDB1 introns inserted into RP51B gene F i g u r e 5.10: P r o t e i n express ion results for the R P 5 1 B gene c o n t a i n i n g some of L i b r i ' s m u t a n t in t rons o b t a i n e d by our e x p e r i m e n t a l app roach . P r o t e i n express ion level is n o r m a l i z e d w i t h respect to w i l d t ype express ion level . S h a d e d boxes represent the m e a n value for several different samples a n d error bars represent +/— 1 s t a n d a r d d e v i a t i o n for these samples . T h e er ror bar for the w i l d t ype i n t r o n comes f rom c o m p a r i s o n of two different w i l d t y p e samples . Chapter 5. Splicing efficiency and branchpoint distance 132 tha t are more efficiently sp l iced a n d the ones tha t are not is obv ious . M u -tants 3 m U B l a n d 5 m U B l , w h i c h showed s l igh t ly reduced levels of s p l i c i n g i n L i b r i et a l . (1995) are s t i l l r educed c o m p a r e d to the w i l d t ype express ion levels. T h e difference is tha t 8 m U B l mu tan t now seems to be less efficiently sp l i ced t h a n 3 m U B l a n d 5 m U B l , w h i c h does not agree w i t h the resul ts i n L i b r i et a l . (1995). E x p r e s s i o n levels for mutan t s 3 m D B l a n d 5 m D B l m i g h t also seem different t h a n i n the prev ious s tudy, since the i r p r o t e i n level ap-pears not to be s ign i f ican t ly reduced. However , we cannot assume tha t there is a perfect co r r e l a t i on between the rp51b copper resis tance a n d s p l i c i n g ef-f ic iency levels (see F i g u r e 5.2). T h u s , the lack of spots o n the coppe r sulfate plates for mutan t s 3 m D B l a n d 5 m D B l does not necessar i ly i m p l y t h a t the i r s p l i c i n g is c o m p l e t e l y i n h i b i t e d . I f we a p p l y the K S test t o d e t e r m i n e the s t a t i s t i ca l s ignif icance of the differences between the w i l d t y p e a n d m u t a n t express ion levels, we get p-values of 0.07 for 3 m U B l , 5 m U B l , a n d 8 m U B l , i n d i c a t i n g no s ignif icant difference between the samples , a n d a p-va lue of 0.005 for 3 m D B l a n d 5 m D B l , w h i c h indica tes tha t the hypothes i s of no difference be tween the samples s h o u l d be rejected. I n general , i t is not rea l is t ic to expect i den t i ca l resul ts f r om two different e x p e r i m e n t a l procedures , especia l ly since the changes i n express ion levels of two different pro te ins ( c u p l a n d rp51b) were measured . O v e r a l l , we cons ider our e x p e r i m e n t a l resul ts to be fa i r ly consistent w i t h L i b r i ' s resul ts , w h i c h gives us confidence to use our e x p e r i m e n t a l sy s t em for fur ther ana lys is . 5.6.2 Mutant design W e des igned 10 a d d i t i o n a l R P 5 1 B i n t r o n mutan t s for the purposes of test-ing our cur ren t m o d e l of the role of in t ron ic p r e - m R N A secondary s t ruc tu re o n sp l i c ing . F i v e of the mutan t s were des igned to have s t r u c t u r a l l y unfa-vorab le charac ter i s t ics for efficient s p l i c i n g ( ' bad ' mu tan t s ) a n d the o ther five mu tan t s have s t r u c t u r a l character is t ics tha t are supposed to ensure effi-c ient s p l i c i n g ( ' good ' mu tan t s ) . T h e 1 most i m p o r t a n t s t r u c t u r a l character is -t i c used for m u t a n t des ign was b r anchpo in t d is tance , ca l cu l a t ed a c c o r d i n g to our m o d e l . I n general , the m a i n c r i t e r i on for the des ign was tha t ' g o o d ' m u -Chapter 5. Splicing efficiency and branchpoint distance 133 tants have shor t b r a n c h p o i n t dis tances a n d tha t ' b a d ' m u t a n t s have longer dis tances . M o r e specif ical ly , the requi rement for the ' g o o d ' mu tan t s was tha t they have m u l t i p l e contact conformat ions (F igu re 5.6), shor t average b r a n c h p o i n t d i s tance a n d higher rjweighti values. O n the other h a n d , the ' b a d ' mu tan t s were not supposed to have any s t ruc tures w i t h contac t confor-m a t i o n or o therwise shor t b r a n c h p o i n t d is tance. T h e i r average b r a n c h p o i n t d i s tance s h o u l d be s igni f ican t ly h igher t h a n for the w i l d t y p e i n t r o n a n d the ' g o o d ' mu tan t s a n d rjweighti values s h o u l d be lower. W e also looked at the basepa i r ing p robab i l i t i e s for the contact c o n f o r m a t i o n a n d the n u m b e r of free bases i n the b r a n c h p o i n t sequence. T h e i m p o r t a n c e of h a v i n g uns t ruc -t u r e d b r a n c h p o i n t sequence was s t ud i ed a n d discussed by H a l l et a l . (1988), S t e p h a n a n d K i r b y (1993), M o u g i n et a l . (1996), a n d C h e n a n d S t e p h a n (2003). W e ca r r i ed out the des ign process manua l ly , gu ided by the M F E s t ruc-tu re of the w i l d t ype i n t ron : for ' b a d ' mutan t s the m a i n goa l was to d i s r u p t any s t r u c t u r a l elements (stems) tha t b r i n g the donor si te a n d the b r a n c h -po in t closer together, a n d for the ' g o o d ' mu tan t s these s t ruc tures were s ta-b i l i z e d . A l l the m u t a t i o n s are s ing le -b lock mu ta t i ons u p to 20 nt long , where sequences of cont iguous nucleot ides were subs t i t u t ed b y new sequences de-s igned to change the secondary s t ruc ture . F o u r of the five ' b a d ' m u t a n t s were o b t a i n e d by m u t a t i n g the o r i g i n a l R P 5 1 B i n t r o n sequence, a n d the bad4 m u t a n t was o b t a i n e d by m u t a t i n g the sequence of L i b r i ' s 8 m U B l m u -tant . S i m i l a r l y , one of the ' g o o d ' mutants , goodd , was o b t a i n e d b y m u t a t i n g L i b r i ' s 3 m D B l m u t a n t . A l s o , to test i f contac t c o n f o r m a t i o n i tse l f is i m -p o r t a n t for s p l i c i n g or jus t the resu l t ing shor t b r a n c h p o i n t d i s tance is, we crea ted one of the ' g o o d ' mu tan t s (good3) such tha t i t w o u l d not have any s t ruc tures w i t h contact con fo rma t ion bu t w o u l d have m a n y s t ruc tures where dij = 9. T h e sequences of the m u t a t e d in t rons are g iven i n A p p e n d i x D . T a b l e 5.7 shows values for var ious quant i t ies a n d charac ter i s t ics used i n the des ign process. Chapter 5. Splicing efficiency and branchpoint distance 134 mutant # of cc P i P 2 avg r_weight B P wt 5 0.70 0.40 20 0.1494 loop badl 0 0.0 0.0 42 0.0243 loop bad2 0 0.0 0.0 45 0.0227 loop bad3 0 0.0 0.0 43 0.0226 s tem bad4 0 0.0 0.21 36 0.0494 s tem bad 5 0 0.0 0.005 33 0.0303 s tem goodl 7 1.0 0.99 5 0.2000 loop good2 6 0.83 0.40 13 0.1721 loop good3 0 0.0 0.03 9 0.0968 loop good4 5 0.61 0.80 17 0.1361 loop good5 8 0.76 0.70 7 0.1615 loop T a b l e 5.7: Cha rac t e r i s t i c s of newly des igned R P 5 1 B mutan t s : # o f c c -n u m b e r of s t ruc tu res w i t h the contac t c o n f o r m a t i o n w i t h i n 5% f rom the M F E ; p i - s u m of n o r m a l i z e d p robab i l i t i e s Pnorm{Rij) for a l l of the s t ruc-tures Rij t h a t con t a in contac t confo rma t ion ; p 2 - ba sepa i r ing p r o b a b i l i t y of i n t e r ac t i on between the donor site a n d the b r a n c h p o i n t sequence based o n the p a r t i t i o n func t ion ; a v g - average b r a n c h p o i n t d i s tance as g iven i n E q u a t i o n 5.4 (p. 112); r _ w e i g h t - s u m m a r y s ta t i s t ics def ined i n E q u a t i o n 5.5 (p. 112); B P - s t r u c t u r a l conf igura t ion o f b r a n c h p o i n t sequence ( loop or s tem) . Chapter 5. Splicing efficiency and branchpoint distance 135 5.6.3 Exper imenta l procedure T h e selected mutan t s were first syn thes ized u s ing des igned p r imer s a n d P C R reac t ion a n d t h e n inser ted in to the R P 5 1 B gene, w h i c h has a deleted i n t r o n . T h e cells were t hen a l lowed to grow a n d p roduce rp51b p ro t e in . T h e p r o t e i n express ion levels were measured us ing W e s t e r n b l o t t i n g ana lys is a n d t h e n quant i f ied us ing a spec ia l i zed i m a g i n g sys tem. T h e detai ls of the e x p e r i m e n -t a l p rocedure are g iven i n A p p e n d i x B . A s m e n t i o n e d before, quan t i fy ing the level of p r o t e i n express ion is not an i dea l measurement of s p l i c i n g efficiency: the first a s s u m p t i o n tha t we are m a k i n g is tha t the level of p r o t e i n abundance is p r o p o r t i o n a l to the m R N A abundance i n the ce l l . However , there are a n u m b e r of p o s t - t r a n s c r i p t i o n a l a n d pos t - t r an s l a t i ona l events tha t can affect th i s p r o p o r t i o n a l i t y o n the g l o b a l level . A l t h o u g h i t has been shown tha t the general t r e n d is t ha t a b u n d a n t m R N A s encode for abundan t prote ins a n d tha t the average pro-t e i n per m R N A ra t io is re la t ive ly constant t h r o u g h the fu l l range of m R N A abundances (2500-4800 p r o t e i n molecules per m R N A molecu le ) , the resul ts for i n d i v i d u a l genes can be very different: genes tha t have s i m i l a r levels of m R N A abundance c a n have 30-fold v a r i a t i o n i n p r o t e i n levels a n d v ice versa ( G y g i et a l . , 1999; G h a e m m a g h a m i et a l . , 2003; G r e e n b a u m et a l . , 2003; B e y e r et a l . , 2004; M o o r e , 2005). S ince i n our exper iments we are dea l ing w i t h on l y one gene, i t is r e l a t ive ly safe to assume tha t p o s t - t r a n s c r i p t i o n a l a n d pos t - t r an s l a t i ona l events w i l l have the same affect on a l l of the mutan t s tested. Consequen t ly , the observed differences i n the p r o t e i n abundance leve l s h o u l d reflect differences i n the m R N A abundance level . However , the oppos i te m a y not necessar i ly be t rue: there m a y be changes i n the m R N A level due to s p l i c i n g def ic iency or enhancement tha t are not go ing to be reflected i n the p r o t e i n a b u n d a n c e level (since pos t - sp l i c ing events can regulate the p r o t e i n express ion level) . T h e second a s s u m p t i o n tha t we are m a k i n g is tha t any change i n sp l i c -i n g efficiency w i l l be reflected i n the m R N A levels, w h i c h is not necessar i ly t rue: P i k i e l n y a n d R o s b a s h (1985) observed tha t for some of the m u t a n t s they tested, the levels of p r e - m R N A were s ign i f ican t ly increased, w h i l e there Chapter 5. Splicing efficiency and branchpoint distance 136 were no changes i n the m R N A level . S i m i l a r conclus ions were d r a w n f rom genome-wide analys is of yeast sp l i c ing where au thors were u s ing the splice junction index SJ — m R N A / ( p r e - m R N A + m R N A ) a n d intron accumula-tion index IA = p r e - m R N A / ( p r e - m R N A + m R N A ) t o ana lyze the effects of m u t a t i o n s o n s p l i c i n g ( C l a r k et a l . , 2002). U s i n g o n l y one of these two indexes or us ing o n l y the p r e - m R N A to m R N A ra t io fa i led to detect a l l of the cases i n w h i c h s p l i c i n g was mod i f i ed . O v e r a l l , i f the p r o t e i n abundance levels for different m u t a n t s are dif-ferent, we c a n conc lude tha t tha t is a consequence of changes i n s p l i c i n g efficiency. However , i f the p r o t e i n abundance levels for the w i l d t y p e i n t r o n a n d a m u t a n t i n t r o n appear to be the same, t hen we s t i l l canno t exc lude the pos s ib i l i t y of m o d i f i e d s p l i c i n g efficiency. 5.6.4 Results and discussion T h e resul ts of p r o t e i n level abundance for the R P 5 1 B gene w i t h our new m u t a t e d in t rons are g iven i n F i g u r e 5.11. F o r each mu tan t , be tween four a n d s ix sample p r o t e i n abundance measurements were o b t a i n e d tha t were n o r m a l i z e d w i t h respect to the w i l d type p r o t e i n levels. T h e shaded boxes represent the m e a n n o r m a l i z e d values for a l l the samples a n d the er ror bars represent +/— 1 s t a n d a r d dev i a t i on . T h e results for mu tan t s bad2 , b a d 5 , a n d g o o d l are m i s s i n g because for bad2 a n d bad5 we were not able to inser t the designed i n t r o n mutan t s in to the R P 5 1 B gene, a n d for g o o d l there was no observable p r o t e i n express ion. W e were not able t o resolve the cause of these p rob lems . F r o m F i g u r e 5.11, we can see tha t mu tan t s b a d l a n d bad3 have re-d u c e d s p l i c i n g efficiency (or more precisely, p ro t e in express ion levels) w h e n c o m p a r e d to the w t as expec ted (the K S test a p p l i e d o n the d i s t r i b u t i o n s of p r o t e i n express ion levels rejected the n u l l hypothes i s of no digfference; p-va lue = 0.005 for b o t h mutan t s ) . M u t a n t bad4 has somewhat reduced s p l i c i n g efficiency bu t not as m u c h as the o ther two b a d mutan t s . T h e r e are two possible reasons for th is : • T h e p r o b a b i l i t y of basepa i r ing in t e r ac t ion be tween the donor site a n d Chapter 5. Splicing efficiency and branchpoint distance 137 Protein expression results for new mutants bad1 bad3 bad4 good2 good3 good4 goods introns inserted into RP51B gene F i g u r e 5.11: P r o t e i n express ion results for the R P 5 1 B gene c o n t a i n i n g our new m u t a n t in t rons . P r o t e i n express ion level is n o r m a l i z e d w i t h respect to w i l d t ype express ion level . Shaded boxes represent the m e a n va lue for several different samples a n d error bars represent +/— 1 s t a n d a r d d e v i a t i o n for these samples . Chapter 5. Splicing efficiency and branchpoint distance 138 the b r a n c h p o i n t sequence is 0.21, even t h o u g h there were no s t ruc-tures w i t h i n 5% f rom the M F E tha t h a d the contac t con fo rma t ion . T h i s means tha t there are p r o b a b l y m a n y less p r o b a b l e s t ruc tu res tha t have th is con fo rma t ion . I f we ana lyze a l l of the s u b o p t i m a l s t ruc-tures p red i c t ed by m f o l d ( w i t h default window pa ramete r ) tha t are w i t h i n 20% from the M F E , we can see tha t th i s a s s u m p t i o n is t rue -there are s u b o p t i m a l s t ruc tures i n th is free energy range w i t h d~ij = 5 a n d one of t h e m is o n l y 70 t imes less p robab le t h a n the M F E s t ruc-ture . T h u s , these s l igh t ly less p robab le s t ruc tures m a y be sufficient to ensure re la t ive ly efficient sp l i c ing . • T h e m i n i m u m free energy s t ruc tu re for th i s m u t a n t has a b r a n c h p o i n t d i s tance of 20, w h i c h s t i l l m a y be shor t enough for r e l a t i ve ly efficient sp l i c ing . T h e d i s t r i b u t i o n of p ro t e in express ion levels for bad4 is s t i l l s ign i f i can t ly different f rom tha t for the w i l d t ype i n t r o n (p-value = 0.002). M u t a n t s good2 , good3 , a n d good4 are a l l sp l i ced efficiently, as p red ic t ed . M u t a n t good3 does not have the contac t con fo rma t ion i n any of the p red i c t ed s t ruc tures . However , m a n y of these s t ruc tures have a shor t b r a n c h p o i n t d is tance of 9, suggest ing tha t a specific s t r u c t u r a l a r rangement be tween the donor site a n d the b r a n c h p o i n t sequence is not r equ i red for efficient s p l i c i n g . M u t a n t good5 shows reduced levels of p r o t e i n abundance , w h i c h is i n dis-agreement w i t h our p r ed i c t i on . A possible e x p l a n a t i o n for th is p h e n o m e n o n m a y be the existence of a ve ry s table s t em (the free energy of the s t e m is A G = —36.6 k c a l / m o l ) tha t holds the 5' sp l ice site a n d the b r a n c h p o i n t together ( F i g u r e 5.12). T h i s z i ppe r s t em m a y be too s table to be d i s r u p t e d , bu t a d i s r u p t i o n w o u l d be needed i n order to a l low the spl iceosome to b i n d to the spl ice s ignals . S ince g o o d l has an even more s table z ippe r s t em t h a n good5 ( A G = —46.6 k c a l / m o l ) , i t is poss ib le tha t th is c o u l d cause t o t a l i n h i b i t i o n of sp l ic -i ng . A n o t h e r reason c o u l d be the p r o x i m i t y of the m u t a t e d b lock to the 5' spl ice site (9 nt apa r t ) . However , j u d g i n g by the o ther resul ts i t seems u n l i k e l y t ha t there w o u l d be no detec table amounts of p r o t e i n i n the ce l l . F i g u r e 5.12: P a r t of the M F E secondary s t ruc tu re p r e d i c t i o n for m u t a n t good5 tha t shows the donor site a n d b r a n c h p o i n t sequence ba sepa i r i ng as w e l l as the ve ry s table z ippe r s t em tha t s tabi l izes the i r i n t e r ac t ion . Chapter 5. Splicing efficiency and branchpoint distance 140 T h u s , we s t i l l suspect tha t th i s resul t is due to a poss ib le e x p e r i m e n t a l er ror . O v e r a l l , the results o n the new R P 5 1 B i n t r o n m u t a n t s are consis tent w i t h our m o d e l of the role of i n t ron i c secondary s t ruc tu re i n gene sp l i c ing . 5.7 Conclusions I n th is chapter we ex tended our prev ious app roach for c a l c u l a t i n g shor tened b r a n c h p o i n t dis tances i n 5 ' L yeast in t rons i n two ways: (1) b y cons ide r ing not o n l y the M F E s t ruc tu re bu t also a subset of s u b o p t i m a l s t ruc tures w i t h free energies close to the M F E ; a n d (2) by i m p r o v i n g the c a l c u l a t i o n of the shor tened b r a n c h p o i n t d is tance , t a k i n g in to account the ent i re s t ruc tu re of an i n t r o n . T h i s new app roach a l lowed us to ident i fy some s t r u c t u r a l char-acter is t ics of the R P 5 1 B i n t r o n a n d i ts mutan t s tha t seem to be respons ib le for the i r d i f ferent ia l sp l i c ing . W e observed tha t the m u t a n t s w i t h a ve ry shor t b r a n c h p o i n t d is tance co r r e spond ing to specific contac t c o n f o r m a t i o n between the donor site a n d the b r a n c h p o i n t sequence are sp l i ced more effi-c ient ly. W e app l i ed the new m o d e l to the S T R I N dataset a n d observed tha t yeast l ong in t rons , w h e n c o m p a r e d to r a n d o m l y generated sequences ( w h i c h resemble rea l in t rons i n m a n y respects) are more l i ke ly to fold in to s t ruc tu res t ha t have shor t b r a n c h p o i n t dis tances. T h i s t endency is espec ia l ly s t rong for ve ry shor t b r a n c h p o i n t dis tances (~ 5), r e su l t ing i n a re la t ive abundance of some specific s t r u c t u r a l conformat ions between the donor site a n d the b r a n c h p o i n t sequence. T h e contact c o n f o r m a t i o n i tse l f m i g h t i m p l y t ha t the c a n o n i c a l b r a n c h p o i n t sequence is m a i n t a i n e d not o n l y because, i t is c o m p l e m e n t a r y to a sequence i n the U 2 s n R N A tha t in teracts w i t h the b r a n c h p o i n t sequence, bu t also because i t is i m p o r t a n t for the secondary s t ruc tu re of the i n t ron . O b v i o u s l y , h a v i n g a contact con fo rma t ion or a ve ry shor t b r a n c h p o i n t d i s tance is not a requ i rement for sp l i c ing , s ince there are m a n y S T R I N in t rons tha t d o . n o t have these character is t ics . S t i l l , i t seems tha t there is a s ignif icant class of yeast in t rons tha t c a n fo ld in to secondary s t ruc tures w i t h very shor t b r a n c h p o i n t dis tances (one- th i rd of S T R I N in t rons have a Chapter 5. Splicing efficiency and branchpoint distance 141 b r a n c h p o i n t d i s tance of 5). J u d g i n g by the e x p e r i m e n t a l resul ts o n one of these (Sec t ion 5.6), th i s shor t d is tance is i m p o r t a n t for efficient s p l i c i n g . F o r the r e m a i n i n g S T R I N in t rons , there are four poss ib le exp lana t ions : • T h e i r b r a n c h p o i n t d is tance is s t i l l r e la t ive ly shor t a n d sufficient for efficient sp l i c ing . • T h e i r p red ic t ed b r a n c h p o i n t d is tance is l o n g due to i n a c c u r a c y of the secondary s t ruc tu re p red ic t ions o r b r a n c h p o i n t d i s t ance c a l c u l a t i o n . • T h e i r p r ed i c t ed b r a n c h p o i n t d is tance is long , b u t they have some other m e c h a n i s m to achieve an o p t i m a l c o n f o r m a t i o n for spl iceosome assem-bly, or s p l i c i n g efficiency is s t i m u l a t e d i n another way (e.g., by p r o t e i n factors) . • T h e i r p red ic t ed b r a n c h p o i n t d is tance is long r e su l t i ng i n r educed sp l ic -i n g efficiency tha t is o p t i m a l for p rope r f u n c t i o n i n g of the gene i n ques t ion; one example is the yeast gene Y R A 1 (Preker a n d G u t h r i e , 2006). F i n a l l y , we c o m p u t a t i o n a l l y des igned new R P 5 1 B i n t r o n m u t a n t s a n d p r e d i c t e d the i r s p l i c i n g efficiency levels based on our new m o d e l of s t r u c t u r a l requi rements for efficient sp l i c ing . T h e pred ic t ions were ver i f ied by l a b o r a -t o r y exper iments : the designed mutan t s were syn thes ized a n d inse r ted i n to the R P 5 1 B gene a n d then the express ion level of the rp51b p r o t e i n was q u a n -t i f i ed . T h e o b t a i n e d measurements , w h i c h are t hough t to be p r o p o r t i o n a l t o s p l i c i n g efficiency levels, were found to be consis tent w i t h our c o m p u t a t i o n a l p red ic t ions . Chapter 6 142 Structural characteristics of yeast introns I n the prev ious two chapters we focused o n iden t i f i ca t ion a n d ana lys is of a specif ic s t r u c t u r a l f o rma t ion w i t h i n yeast in t rons whose role is to shor ten the large d is tance between the 5' spl ice site a n d the b r a n c h p o i n t sequence, a n d the reby e n a b l i n g p r o p e r spl iceosome assembly. However , i t is poss ib le t ha t p r e - m R N A secondary s t ruc tu re has o ther funct ions re la ted to gene s p l i c i n g i n yeast. I n th i s chapter we conduc t var ious analyses o n the S T R I N dataset , w i t h the goal of i den t i fy ing any s t r u c t u r a l character is t ics t ha t m i g h t be i m p o r t a n t for sp l i c ing . F i r s t , we invest igate i f i n t r o n sequences are more s t r u c t u r a l l y s table t h a n r a n d o m sequences w i t h the same sequence charac te r i s t ics . W e also ana lyze the s t a b i l i t y of l oca l s t ruc tures i n the v i c i n i t y of a n d at the spl ice s ignals , a n d invest igate the s t a b i l i t y of basepa i r ing in te rac t ions be-tween spl ice sites a n d s n R N A s a n d i ts c o n t r i b u t i o n to i n t r o n iden t i f i ca t ion . F i n a l l y , we look for any conserved s t r u c t u r a l mot i f s i n the v i c i n i t y of the sp l ice s ignals . 6.1 Structural stability of introns vs. random sequences T h e r e have been several a t t emp t s to assess the ' f o l d a b i l i t y ' o f n a t u r a l l y oc-c u r r i n g R N A sequences - the i r t endency to fold in to more s table secondary s t ruc tures t h a n expec ted by chance. I n 1999, Seffens a n d D i g b y e x a m i n e d 51 m R N A sequences f rom several different o rgan isms to de t e rmine i f the i r Chapter 6. Structural characteristics of yeast introns 143 m i n i m u m free energies are more negat ive t h a n for r a n d o m i z e d m R N A se-quences w i t h the same c o m p o s i t i o n a n d length (Seffens a n d D i g b y , 1999). T h e y employed s ix different m R N A r a n d o m i z a t i o n procedures , r a n d o m i z i n g ei ther ent ire m R N A s , c o d i n g regions or un t r ans l a t ed regions a n d preserv-i n g ei ther base c o m p o s i t i o n or c o d o n c o m p o s i t i o n . F o r each na t ive m R N A a n d each r a n d o m i z a t i o n procedure , they generated 10 r a n d o m i z e d m R N A sequences, c a l cu l a t ed the i r m i n i m u m free energies us ing the m f o l d a l g o r i t h m a n d a n a l y z e d the differences, u s ing w h a t they c a l l ' segment score ' . Segment score for an m R N A sequence is n u m b e r of s t a n d a r d dev ia t ions the m e a n of the r a n d o m i z e d set is away f rom the na t ive free energy. T h i s va lue is more often ca l led Z-score. I n general , Z-score of a n u m b e r x w i t h respect to a set s\ . . . S J V of numbers is defined by: (6.1) where p = si+.^+s/v a n c j a — J = j v l \ ^ are the m e a n a n d s t a n d a r d d e v i a t i o n , respect ively, of the numbers s i , . . . , spj. I f we want to c o m p a r e the free energies of R N A sequences, x w o u l d be the m i n i m u m free energy of the na t ive R N A a n d s i , . . . , spj w o u l d be the m i n i m u m free energies for N r a n d o m i z e d R N A sequences. I n a recent c o m p a r i s o n s t u d y of s ix R N A fo ld ing measures tha t es t imate how we l l an R N A sequence folds, Z-score was found to be the most sensit ive measure (F reyhu l t et a l , 2005). T h e s t u d y by Seffens a n d D i g b y found tha t the na t ive m R N A s have s ign i f ican t ly lower free energies t h a n the r a n d o m i z e d sequences: the aver-age Z-score for the w h o l e - r a n d o m i z e d set (whole m R N A , base c o m p o s i t i o n preserved) was —1.23, a n d for the c o d i n g - r a n d o m set (on ly c o d i n g regions r a n d o m i z e d ) i t was - 0 . 8 7 . T h e Z-score values were s l i gh t ly h igher for o ther r a n d o m i z a t i o n procedures . T h i s s t u d y was later chal lenged by W o r k m a n a n d K r o g h , w h o re -examined the same set of m R N A s a n d conc luded tha t the apparen t h igher s t a b i l i t y of na t ive m R N A s c o u l d be exp l a ined by differences i n d inuc l eo t ide c o m p o -s i t ions between na t ive a n d r a n d o m i z e d sequences ( W o r k m a n a n d K r o g h , 1999). T h e i r ra t iona le was tha t since most of the a l g o r i t h m s for R N A Chapter 6. Structural characteristics of yeast introns 144 secondary s t ruc tu re p r e d i c t i o n ( i n c l u d i n g mfold) use a nearest ne ighbour t h e r m o d y n a m i c m o d e l , w h i c h assumes tha t the s t ab i l i t y of a specif ic base-pa i r depends o n the n e i g h b o u r i n g bases, the R N A fo ld ing t h e r m o d y n a m i c is s t rong ly dependent on basepai r s t ack ing in te rac t ions . Therefore , i n order to have a fair c o m p a r i s o n between s t r u c t u r a l s t ab i l i t y of na t ive a n d r a n d o m i z e d R N A sequences, d inuc leo t ide content needs to be preserved. W o r k m a n a n d K r o g h (1999) used 46 m R N A s f rom the o r i g i n a l set of Sef-fens a n d D i g b y (1999), a n d generated 10 r a n d o m i z e d sequences for each of t h e m p rese rv ing e i ther mononuc leo t ide or d inuc leo t ide c o m p o s i t i o n . W h e n o n l y mononuc leo t ide content was preserved, they o b t a i n e d s i m i l a r Z-scores as d i d Seffens a n d D i g b y (1999) (Z-score = —1.59). W h e n d inuc l eo t ide con-tent was preserved, the Z-score values were m u c h h igher (—0.20 for the first o rder M a r k o v r a n d o m sequences). T h e y conc luded tha t m R N A sequences, i n general , do not fo rm more s table ex tended s t ruc tu re t h a n r a n d o m se-quences. T h e y p o i n t e d out t ha t the analysis t hey pe r fo rmed appl ies o n l y to g l o b a l secondary s t ruc tu re of the molecules , w h i l e more l o c a l s t r u c t u r a l i n -te rac t ions , such as h a i r p i n loops, c o u l d not be detected u s ing th i s app roach . W o r k m a n a n d K r o g h (1999) also d iscussed the c a l c u l a t i o n of p-values associa ted w i t h o b t a i n e d Z-scores, w h i c h is i m p o r t a n t for d e t e r m i n i n g the s t a t i s t i ca l s ignif icance of the scores. T h e d i s t r i b u t i o n of Z-scores for r a n d o m sequences can be a p p r o x i m a t e d by a n o r m a l d i s t r i b u t i o n w i t h m e a n 0 a n d s t a n d a r d d e v i a t i o n 1. A s s u m i n g th is , the s ignif icance of Z-scores c a n be ap-p r o x i m a t e d u s ing an ex t reme value d i s t r i b u t i o n tha t captures the l i k e l i h o o d t ha t the g iven Z-score of a b io log i ca l R N A sequence is larger t h a n the m a x i -m u m Z-score f rom a co l l ec t ion of r a n d o m i z e d versions of the same sequence. I n o ther words , p-values associated w i t h Z-scores can be c o m p u t e d as the ra t io of r a n d o m sequences w i t h a Z-score lower t h a n tha t of the na t ive se-quence. T h i s analys is was further ex tended by R i v a s a n d E d d y (2000), w h o es t ima ted t h a t - i n the case where 100 r a n d o m sequences are genera ted for each na t ive sequence, the Z-score fo r . a single sequence has to be at least of the order of —3.8 to be cons idered s ignif icant (at the 0.01 s ignif icance level) . F o r the s ignif icance level of 0.05 th is value w o u l d a p p r o x i m a t e l y be —2. F o r a set of na t ive sequences we can c o m p u t e the u p p e r b o u n d for the Chapter 6. Structural characteristics of yeast introns 145 average Z-score i n order for i