UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Methods for microRNA profiling and discovery using massively parallel sequencing Morin, Ryan David 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-0522.pdf [ 4.8MB ]
Metadata
JSON: 831-1.0100868.json
JSON-LD: 831-1.0100868-ld.json
RDF/XML (Pretty): 831-1.0100868-rdf.xml
RDF/JSON: 831-1.0100868-rdf.json
Turtle: 831-1.0100868-turtle.txt
N-Triples: 831-1.0100868-rdf-ntriples.txt
Original Record: 831-1.0100868-source.json
Full Text
831-1.0100868-fulltext.txt
Citation
831-1.0100868.ris

Full Text

Methods for microRNA profiling and discovery using massively parallel sequencing by R y a n David Morin B . S c , S i m o n F r a s e r University, 2 0 0 3 A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in T h e Faculty of S c i e n c e s (Bioinformatics) T h e University of British C o l u m b i a October 2007  © R y a n David Morin, 2 0 0 7  ABSTRACT M i c r o R N A s ( m i R N A s ) are e m e r g i n g a s important, albeit poorly c h a r a c t e r i z e d , regulators of g e n e e x p r e s s i o n . H e r e , I review the current k n o w l e d g e of m i R N A s in h u m a n s , including their b i o g e n e s i s , m o d e s of action a n d v a r i o u s m e t h o d s for studying t h e m . T o fully elucidate the various functions of m i R N A s in h u m a n s , w e require a m o r e c o m p l e t e understanding of their n u m b e r s a n d e x p r e s s i o n c h a n g e s a m o n g s t different cell t y p e s . T h i s d o c u m e n t includes a description of a n e w method for s u r v e y i n g the e x p r e s s i o n of m i R N A s that e m p l o y s the n e w lllumina s e q u e n c i n g technology. A set of m e t h o d s is presented that e n a b l e s identification of s e q u e n c e s belonging to k n o w n m i R N A s a s well a s variability in their mature s e q u e n c e s . A s well, a novel s y s t e m for m i R N A g e n e discovery using t h e s e d a t a is d e s c r i b e d . Application of this a p p r o a c h to R N A from h u m a n e m b r y o n i c s t e m cells ( h E S C s ) obtained before a n d after differentiation into embryoid b o d i e s ( E B s ) r e v e a l e d the s e q u e n c e s and e x p r e s s i o n levels of 3 6 2 k n o w n plus 170 novel m i R N A g e n e s . O f t h e s e , 190 k n o w n a n d 31 novel m i c r o R N A s e q u e n c e s exhibited significant e x p r e s s i o n differences b e t w e e n t h e s e two d e v e l o p m e n t a l states. O w i n g to the i n c r e a s e d n u m b e r of s e q u e n c e reads, t h e s e libraries currently represent the d e e p e s t m i R N A s a m p l i n g in any h u m a n cell type s p a n n i n g nearly six orders of magnitude of e x p r e s s i o n . P r e d i c t e d targets of the differentially e x p r e s s e d m i R N A s w e r e ranked to identify t h o s e that are likely under cooperative m i R N A regulation in either h E S C s or E B s . T h e predicted targets of t h o s e m i R N A s e n r i c h e d in either s a m p l e s h a r e d c o m m o n features. Included a m o n g s t the high-ranked predicted g e n e targets are t h o s e implicated in differentiation, cell c y c l e control, p r o g r a m m e d cell death and transcriptional regulation. Direct validation of  these predicted targets or global discovery of miRNA targets should reveal the functions of these sequences in the differentiation of hESCs.  iii  T A B L E OF CONTENTS Abstract T a b l e of C o n t e n t s List of T a b l e s List of F i g u r e s Acknowledgements Dedication C o - a u t h o r s h i p statement 1 A n introduction to m i c r o R N A s 1.1 M i c r o R N A biogenesis 1.2 M i c r o R N A function 1.3 M e t h o d s for m i c r o R N A target prediction 1.4 M i c r o R N A discovery, detection and profiling 1.5 P e r s p e c t i v e s and r e s e a r c h goals 1.6 References 2 Application of m a s s i v e l y parallel s e q u e n c i n g to m i c r o R N A profiling a n d d i s c o v e r y in h u m a n e m b r y o n i c s t e m cells 2.1 List of authors 2.1.1 M a n u s c r i p t Status (footnote) 2.2 Introduction 2.3 R e s u l t s 2.3.1 S e q u e n c i n g and annotation of small R N A s 2.3.2 Variability in m i c r o R N A p r o c e s s i n g 2.3.3 E n z y m a t i c modification of m i c r o R N A s 2.3.4 M i c r o R N A s differentially e x p r e s s e d between h E S C s a n d E B s 2.3.5 N o v e l m i c r o R N A g e n e s 2.3.6 T a r g e t s of differentially e x p r e s s e d m i c r o R N A s 2.4 Discussion 2.5 Methods 2.5.1 S m a l l R N A library preparation 2.5.2 S m a l l R N A g e n o m e m a p p i n g a n d quantification 2.5.3 Differential e x p r e s s i o n detection 2.5.4 S m a l l R N A annotation 2.5.5 N o v e l m i R N A detection 2.5.6 C o o p e r a t i v e m i R N A target prediction 2.5.7 G e n e ontology a n a l y s i s 2.6 References 3 C o n c l u d i n g remarks 3.1 G o a l s a c c o m p l i s h e d 3.2 Future directions 3.3 R e f e r e n c e s Appendices. Appendix A : Supplementary Tables A p p e n d i x B: S u p p l e m e n t a r y M e t h o d s a n d a c c o m p a n y i n g d a t a  " iv v vi vii viii ix 1 1 1 3 4 7 8 11 11 11 11 15 15 15 17 18 19 19 22 27 27 30 30 31 33 35 35 47 52 52 54 56 57 57 70  iv  LIST OF TABLES T a b l e 2 . 1 : T o p 2 0 m i R N A s differentially e x p r e s s e d b e t w e e n the h E S C a n d E B libraries.  36  T a b l e 2.2: Putative novel m i R N A s with significant differential e x p r e s s i o n  37  T a b l e 2.3: T o p 2 5 statistically over-represented s e e d s of m i R N A s e q u e n c e s found in h E S C s or E B s  38  LIST OF FIGURES Figure 2 . 1 :  Distribution of s e q u e n c e counts for the different c l a s s e s of s m a l l  RNAs  39-40  Figure 2.2:  T h e repertoire of i s o m i R s a n d 3' modifications of hsa-mir-191  Figure 2.3:  Clustering of over-represented G e n e Ontology c l a s s e s in predicted  targets of differential m i c r o R N A s  42  44-45  vi  ACKNOWLEDGEMENTS I a m grateful to my rotation supervisors, Dr. Peter U n r a u a n d Dr. F r a n c i s Ouellette, for assisting m e in d e v e l o p i n g my scientific skills a n d providing g u i d a n c e throughout the duration of my d e g r e e . I extend my d e e p e s t gratitude to my supervisor, Dr. M a r c o M a r r a , for providing m e with a n excellent work environment and thesis project. T h i s work would not h a v e b e e n p o s s i b l e without the ongoing moral support of my loving wife, S a r a h .  Tb my parents, grandparents, wife, Sarah, and daughter, jNatafie  CO-AUTHORSHIP STATEMENT T h e manuscript included in this d o c u m e n t is a culmination of work c o n d u c t e d by both laboratory scientists a n d bioinformaticians. Drs. M i c h a e l O ' C o n n o r a n d C o n n i e E a v e s p r o d u c e d the h u m a n e m b r y o n i c s t e m cell and e m b r y o i d body R N A that w a s u s e d for s m a l l R N A s e q u e n c i n g . Dr. Martin Hirst, A n n a - l i i s a P r a b h u , Y o n g j u n Z h a o , Helen M c D o n a l d a n d T h o m a s Z e n g prepared small R N A libraries a n d s e q u e n c e d the material. M a l a c h i Griffith a n d Drs. Florian K u c h e n b a u e r a n d A l l e n D e l a n e y provided bioinformatics support a n d helpful s u g g e s t i o n s for strategies a n d improvements to the data a n a l y s i s a n d presentation. R y a n Morin a n d M a r c o M a r r a c o - c o n c e i v e d the project. M a r c o M a r r a provided methodological, c o n c e p t u a l , emotional a n d financial support. R y a n Morin performed all d a t a b a s e a n d algorithm d e s i g n , bioinformatic a n a l y s e s , wrote the manuscript a n d p r o d u c e d all figures. R h o n d a O s h a n e k edited the d o c u m e n t for content a n d grammar.  ix  1  AN INTRODUCTION TO MICRORNAS 1.1 microRNA biogenesis M i c r o R N A s ( m i R N A s ) are short R N A m o l e c u l e s , 19-25 nucleotides (nt) in  length that derive from stable fold-back sub-structures of larger transcripts ( C a i et al. 2004). M i R N A g e n e s c a n exist in the introns of s p l i c e d m R N A s , either a s solitary n o n - c o d i n g transcripts, or within polycistronic transcripts containing multiple m i R N A g e n e s (Bartel 2004). In most c a s e s , t h e s e primary m i R N A transcripts ( p r i - m i R N A s ) are c l e a v e d by a c o m p l e x of D r o s h a a n d its cofactor D G C R - 8 producing o n e or m o r e m i R N A precursor ( p r e - m i R N A ) hairpins with a 2nt o v e r h a n g at their 3' e n d s (Lund et a l . 2004). P r e - m i R N A s are exported from the n u c l e u s to the c y t o p l a s m by Exportin-5, after w h i c h they are further c l e a v e d by Dicer, releasing short R N A d u p l e x e s with a 2-nt 3' o v e r h a n g at both e n d s (Lund et a l . 2004). T h e recently d i s c o v e r e d 'mirtrons', w h i c h h a v e b e e n o b s e r v e d in both Drosophila  melanogaster  a n d Caenorhabditis  elegans  are a n  e x c e p t i o n to this m e c h a n i s m a s they undergo nuclear export a n d D i c e r c l e a v a g e without D r o s h a p r o c e s s i n g ( R u b y et a l . 2007). In either c a s e , after c l e a v a g e by Dicer, o n e of the strands of the remaining R N A duplex is thought to be preferentially incorporated into the R N A - i n d u c e d silencing c o m p l e x ( R I S C ) , leaving a n inactive m o l e c u l e ( m i R N A * ) that is s u b s e q u e n t l y d e g r a d e d ( O T o o l e et al. 2006).  1.2 microRNA function O f the m i R N A / m i R N A * duplex, w h i c h c o m p r i s e s a short strand from e a c h a r m of the p r e - m i R N A hairpin, only o n e strand is generally thought to b e c o m e an  1  active m i R N A m o l e c u l e , h e n c e a s s e m b l e d within R I S C (Bartel 2 0 0 4 ) . T h e active strand is identified b a s e d on the t h e r m o d y n a m i c properties at both e n d s of the duplex, including the identity of the two nucleotides in the 3' o v e r h a n g ( O T o o l e et al. 2006). H o w e v e r , for a n increasing number of m i R N A g e n e s (currently 79) (http://microrna.sanger.ac.uk), mature s e q u e n c e s from both hairpin a r m s (i.e., both strands of the m i R N A / m i R N A * duplex), are known to a c c u m u l a t e in approximately e q u a l a b u n d a n c e in the cell and are both thought to be biologically active. T h e activity of m i R N A a n d m i R N A * s e q u e n c e s is often p r e s u m e d b a s e d on cloning f r e q u e n c i e s or h o m o l o g y to known m i R N A s . H o w e v e r , l a r g e - s c a l e s e q u e n c i n g projects are leading to a s t e a d y i n c r e a s e in the n u m b e r of m i R N A g e n e s k n o w n to p r o d u c e two functional m i R N A s . A s the m i R N A g e n e s of this type a c c u m u l a t e , a n e w notation is being implemented that m a y replace the limited m i R N A / m i R N A * notation. T h i s s y s t e m , w h i c h is u s e d throughout the r e m a i n d e r of the text, d i s a m b i g u a t e s m i R N A s arising from a single p r e - m i R N A using their positioning 5' (5p) or 3' (3p) within the p r e - m i R N A hairpin (GriffithsJ o n e s et al. 2006). In addition to the mature m i R N A , the R I S C protein c o m p l e x c a n contain a variety of proteins. A key c o m p o n e n t of R I S C is the m e m b e r of the A r g o n a u t e protein family, of w h i c h there are four in h u m a n ( E I F 2 C 1 , E I F 2 C 2 , E I F 2 C 3 a n d E I F 2 C 4 ) (Tolia et a l . 2007). O n c e a s s e m b l e d within R I S C , m i R N A s c a n elicit post-transcriptional r e p r e s s i o n of target g e n e s . In a n i m a l s , m i R N A - t a r g e t interactions o c c u r through s e m i - c o m p l e m e n t a r y b a s e pairing to sites usually within the 3 ' - U T R of the target transcript (Bartel 2 0 0 4 ) . Strong b a s e pairing at  2  the 5' e n d of the m i R N A , generally termed the s e e d region, is thought to be a requisite for m i R N A - t a r g e t interaction. If the target site contains a perfect c o m p l e m e n t to the m i R N A s e e d s e q u e n c e , relaxed hybridization is tolerated along the r e m a i n d e r of the m i R N A / m R N A duplex. If, however, there is a lack of complementarity in the s e e d , stronger complementarity along the r e m a i n d e r of the m i R N A / m R N A duplex is required for a strong interaction (Bartel 2004). Hybridization of m i R N A s (within R I S C ) to their target m R N A s c a n result in post-transcriptional regulation by two s e p a r a t e m e c h a n i s m s : transcript degradation or translational r e p r e s s i o n . M i R N A s that a n n e a l with e x t e n s i v e complementarity to their target s e q u e n c e s and are a c c o m p a n i e d by a R I S C containing E I F 2 C 2 are thought to direct the c l e a v a g e of target m e s s e n g e r R N A s . T h i s p r o c e s s is a n a l o g o u s to the action of short interfering R N A s ( s i R N A s ) (Liu et al. 2 0 0 4 ; M e i s t e r et a l . 2004). In animals, the m o r e w i d e s p r e a d m e c h a n i s m of m i R N A action is that of translational r e p r e s s i o n . T h i s p r o c e s s is mediated by R I S C s containing E I F 2 C 1 a n d generally involves m i R N A s with r e d u c e d complementarity to their target sites (Bartel 2004). T h e binding of multiple R I S C s to the s a m e transcript c a n e n h a n c e this effect c o n s i d e r a b l y ( G r i m s o n et al. 2 0 0 7 ; D o e n c h et a l . 2003).  1.3 Methods for microRNA target prediction That m a m m a l i a n cells tolerate relaxed complementarity b e t w e e n m i R N A s a n d their targets p o s e s m a n y problems in computational 'target prediction'. H e n c e , creating accurate target prediction algorithms is an active field of study, with n u m e r o u s e m e r g i n g a p p r o a c h e s that rely on differing m e t h o d o l o g i e s a n d  3  a s s u m p t i o n s . T h e s e algorithms u s e different m e t h o d s to a s s e s s the relative complementarity b e t w e e n m i R N A s a n d their potential targets ( J o h n et a l . 2004). Presently, the three leading p r o g r a m s for m a m m a l i a n m i R N A target prediction are P i c T a r (Krek et al. 2005), M i R a n d a (John et a l . 2004) a n d T a r g e t S c a n S (referred to a s T a r g e t S c a n throughout the r e m a i n d e r of this document) (Lewis et al. 2005). T o e n h a n c e their specificity, all three m e t h o d s e n f o r c e s o m e d e g r e e of evolutionary c o n s e r v a t i o n within the region of c a n d i d a t e target sites. T h e s e m e t h o d s a l s o c o n s i d e r pairing within the s e e d region of the m i R N A to be more important than in the r e m a i n d e r of the s e q u e n c e . T h o u g h all three a p p r o a c h e s u s e similar inputs, they result in m o d e r a t e to poor a g r e e m e n t b e t w e e n their target predictions, with the strongest agreement b e t w e e n the outputs of P i c T a r a n d T a r g e t S c a n (Sethupathy et al. 2006). A c o m m o n d r a w b a c k a m o n g s t all t h e s e p r o g r a m s is that they lack predictions for recently d i s c o v e r e d m i R N A s , s i n c e their d a t a b a s e updates d o not o c c u r a s frequently a s u p d a t e s to m i R B a s e . T a r g e t S c a n w a s u s e d in the a n a l y s e s d e s c r i b e d herein b e c a u s e it provides the ability to perform ' c u s t o m ' target predictions with novel m i R N A s e q u e n c e s . 1.4 microRNA discovery, detection and profiling In addition to current deficiencies in target identification, the field of m i R N A r e s e a r c h a l s o lacks a c o m p l e t e list of the h u m a n m i R N A g e n e s . Additionally, evaluation of e x p r e s s i o n levels of the known m i R N A s a m o n g s t the n u m e r o u s tissue a n d cell types is a field still in its infancy (Landgraf et a l . 2007). Both of t h e s e deficiencies c a n be attributed to a lack of a robust m e t h o d o l o g y for profiling the e x p r e s s i o n level of e a c h m i R N A in a s a m p l e . T h e current c o m m e r c i a l l y  4  available high-throughput m e t h o d o l o g i e s rely on primers or p r o b e s d e s i g n e d to detect e a c h of the current reference m i R N A s e q u e n c e s residing in m i R B a s e , w h i c h acts a s the central repository for all known m i R N A s (Griffiths-Jones 2006). T h e s e s y s t e m s , w h i c h c a n be p u r c h a s e d a s individual p r o b e s or in a microarray format (Castoldi et a l . 2007), allow concurrent profiling of the e x p r e s s i o n of m a n y m i R N A s ( Z h a o et al. 2006). T h e recent incorporation of L o c k e d N u c l e i c A c i d ( L N A ) nucleotide a n a l o g s into p r o b e s p r o m i s e s a great e n h a n c e m e n t in the r o b u s t n e s s of t h e s e t e c h n o l o g i e s , potentially facilitating discrimination b e t w e e n m i R N A s with single nucleotide differences (Vester et a l . 2004).  MiRNA  microarrays offer g o o d reproducibility a n d facilitate clustering of s a m p l e s by similar m i R N A e x p r e s s i o n profiles ( D a v i s o n et al. 2 0 0 6 ; P o r k k a et a l . 2007). H o w e v e r , t h e s e m e t h o d o l o g i e s are restricted to detecting a n d profiling of only the known m i R N A s e q u e n c e s previously identified by s e q u e n c i n g or h o m o l o g y s e a r c h e s , or querying candidate regions of the g e n o m e that h a v e b e e n predicted to p r o d u c e novel m i R N A s ( B e r e z i k o v et al. 2 0 0 6 a ) . A n e m e r g i n g alternative method for m i R N A detection a n d profiling involves generating c D N A libraries from small R N A cellular fractions, followed by d e e p s e q u e n c i n g using a variety of a p p r o a c h e s (Lu et a l . 2007). Previously, s e q u e n c i n g - b a s e d strategies for m i R N A detection h a v e b e e n hindered by the requirements of c o n c a t a m e r i z a t i o n and cloning a s well a s the e x p e n s e of capillary D N A s e q u e n c i n g ( C u m m i n s et a l . 2 0 0 6 ; Pfeffer et a l . 2005). N e v e r t h e l e s s , small R N A s e q u e n c i n g h a s s e v e r a l a d v a n t a g e s o v e r hybridizationb a s e d m e t h o d o l o g i e s . D i s c o v e r y of novel m i R N A s n e e d not rely on querying  5  c a n d i d a t e regions of the g e n o m e for e x p r e s s i o n using microarrays. Rather, novel m i R N A s c a n be d i s c o v e r e d by direct observation of their mature s e q u e n c e s , followed by validation of the folding potential of flanking g e n o m i c D N A using various a p p r o a c h e s (Berezikov et a l . 2 0 0 6 a ; C u m m i n s et al. 2006). S m a l l R N A s e q u e n c i n g a l s o offers the potential to detect variation in mature m i R N A length a n d e n z y m a t i c modification of m i R N A s s u c h a s R N A editing ( K a w a h a r a et al. 2007) a n d 3' nucleotide additions (Aravin et a l . 2 0 0 5 ; Landgraf et al. 2007). In contrast to capillary s e q u e n c i n g , recently d e v e l o p e d 'next-generation' s e q u e n c i n g t e c h n o l o g i e s offer the prospect of a v a s t yet i n e x p e n s i v e i n c r e a s e in throughput, therefore providing a more complete v i e w of the m i R N A transcriptome. With the depth of s e q u e n c i n g n o w p o s s i b l e , w e h a v e a n opportunity to identify l o w - a b u n d a n c e m i R N A s or t h o s e exhibiting m o d e s t e x p r e s s i o n differences between s a m p l e s , which m a y not be detected by hybridization-based m e t h o d s s u c h a s microarrays. Next-generation m i R N A profiling h a s already b e e n a c c o m p l i s h e d in a variety of o r g a n i s m s (Fahlgren et a l . 2 0 0 7 ; K a s s c h a u et a l . 2 0 0 7 ; R a j a g o p a l a n et al. 2 0 0 6 ; Y a o et a l . 2 0 0 7 ; B e r e z i k o v et al. 2006b) using, at first, M a s s i v e l y Parallel S i g n a t u r e S e q u e n c i n g ( M P S S ) m e t h o d o l o g y ( N a k a n o et a l . 2006) a n d more recently the R o c h e / 4 5 4 platform ( B e r e z i k o v et a l . 2 0 0 6 b ; M a r g u l i e s et al. 2005). T h e recently r e l e a s e d lllumina s e q u e n c i n g platform provides two to three orders of magnitude of greater depth from a single run w h e n c o m p a r e d to the output of current c o m p e t i n g t e c h n o l o g i e s ( B e r e z i k o v et a l . 2 0 0 6 b ) .  6  1.5 P e r s p e c t i v e s a n d r e s e a r c h g o a l s  T h r e e next-generation s e q u e n c i n g technologies h a v e b e e n r e l e a s e d in the past few y e a r s (Bennett 2 0 0 4 ; M a r g u l i e s et al. 2 0 0 5 ; S h e n d u r e et a l . 2005). E a c h t e c h n o l o g y g e n e r a t e s s e q u e n c e r e a d s in varying quantities with different lengths a n d error profiles. T h e work outlined in the a c c o m p a n y i n g manuscript d e s c r i b e s a f o c u s e d view of the m i R N A profiling capabilities of the lllumina s e q u e n c i n g m e t h o d . M a n y n e w algorithms a n d tools are required to facilitate a n a l y s i s of the large d a t a sets p r o d u c e d by t h e s e t e c h n o l o g i e s . A major goal of this project w a s to p r o d u c e a set of bioinformatics tools, including a d a t a b a s e and a s s o c i a t e d software, which w o u l d allow a skilled r e s e a r c h e r to gain biologically relevant a n d statistically meaningful insights from this type of data. T h i s m e t h o d o l o g y w a s first u s e d to profile the m i R N A c h a n g e s involved in differentiation of h u m a n e m b r y o n i c stem cells. A s e c o n d a r y g o a l of this p r o j e c t w a s to identify g e n e s potentially regulated by t h e s e m i R N A s , in h o p e s of identifying g e n e s of central importance to the pluripotency or self-renewal of t h e s e cells. T h e following chapter d i s c u s s e s p r o g r e s s towards t h e s e g o a l s a n d s h o u l d provide a foundation for future a n a l y s e s using this or similar s e q u e n c i n g approaches.  7  1.6 References A r a v i n , A . a n d T u s c h l , T. 2 0 0 5 . Identification and characterization of s m a l l R N A s involved in R N A silencing. FEBS Lett 579: 5 8 3 0 - 4 0 . Bartel, D. P. 2 0 0 4 . M i c r o R N A s : g e n o m i c s , b i o g e n e s i s , m e c h a n i s m , a n d function. Ce//116: 281-97. Bennett, S . 2 0 0 4 . S o l e x a Ltd. Pharmacogenomics  5: 4 3 3 - 4 3 8 .  B e r e z i k o v , E., C u p p e n , E., and Plasterk, R. H. 2 0 0 6 a . A p p r o a c h e s to m i c r o R N A discovery. Nat Genet 38 SuppI: S 2 - 7 . B e r e z i k o v , E., T h u e m m l e r , F., v a n L a a k e , L. W . , K o n d o v a , I., Bontrop, R., C u p p e n , E., and Plasterk, R. H. 2 0 0 6 b . Diversity of m i c r o R N A s in h u m a n and c h i m p a n z e e brain. Nat Genet 38: 1375-7. C a i , X . , H a g e d o r n , C . H., and C u l l e n , B. R. 2 0 0 4 . H u m a n m i c r o R N A s are p r o c e s s e d from c a p p e d , polyadenylated transcripts that c a n a l s o function a s m R N A s . Rna 10: 1957-66. C a s t o l d i , M . , B e n e s , V . , H e n t z e , M . W . , and Muckenthaler, M . U. 2 0 0 7 . m i C h i p : A microarray platform for e x p r e s s i o n profiling of m i c r o R N A s b a s e d on l o c k e d nucleic acid ( L N A ) oligonucleotide capture probes. Methods 43: 1 4 6 - 1 5 2 . C u m m i n s , J . M . , H e , Y . , Leary, R. J . , Pagliarini, R., A . , D. L.,Jr, S j o b l o m , T., B a r a d , O . , Bentwich, Z . , S z a f r a n s k a , A . E., Labourier, E., et al. 2 0 0 6 . T h e colorectal m i c r o R N A o m e . Proc Natl Acad Sci USA 103: 3 6 8 7 - 9 2 . D a v i s o n , T. S . , J o h n s o n , C . D., a n d A n d r u s s , B. F. 2 0 0 6 . A n a l y z i n g m i c r o - R N A e x p r e s s i o n using microarrays. Methods Enzymol 411: 14-34. D o e n c h , J . G . , P e t e r s e n , C . P., a n d S h a r p , P. A . 2 0 0 3 . s i R N A s c a n function a s m i R N A s . Genes Dev. 17: 4 3 8 - 4 4 2 . F a h l g r e n , N., Howell, M . D., K a s s c h a u , K. D., C h a p m a n , E. J . , S u l l i v a n , C . M . , C u m b i e , J . S . , G i v a n , S . A . , L a w , T. F., Grant, S . R., D a n g l , J . L , et al. 2 0 0 7 . High-Throughput S e q u e n c i n g of A r a b i d o p s i s m i c r o R N A s : E v i d e n c e for Frequent Birth a n d Death of M I R N A G e n e s . PLoS ONE 2: e 2 1 9 . Griffiths-Jones, S . 2 0 0 6 . m i R B a s e : the m i c r o R N A s e q u e n c e d a t a b a s e . Mol Biol342: 129-38.  Methods  Griffiths-Jones, S . , G r o c o c k , R. J . , van D o n g e n , S . , B a t e m a n , A . , a n d Enright, A . J . 2 0 0 6 . m i R B a s e : m i c r o R N A s e q u e n c e s , targets a n d g e n e nomenclature. Nucleic Acids Res 34: D 1 4 0 - 4 .  8  G r i m s o n , A . , F a r h , K. K., J o h n s t o n , W . K., G a r r e t t - E n g e l e , P., L i m , L. P., a n d Bartel, D. P. 2 0 0 7 . M i c r o R N A targeting specificity in m a m m a l s : determinants b e y o n d s e e d pairing. Mol Ce/7 27: 9 1 - 1 0 5 . J o h n , B., Enright, A . J . , A r a v i n , A . , T u s c h l , T., S a n d e r , C , a n d M a r k s , D. S . 2004. H u m a n M i c r o R N A targets. PLoS Biol 2: e 3 6 3 . K a s s c h a u , K. D., F a h l g r e n , N., C h a p m a n , E. J . , S u l l i v a n , C . M . , C u m b i e , J . S . , G i v a n , S . A . , a n d Carrington, J . C . 2 0 0 7 . G e n o m e - W i d e Profiling a n d A n a l y s i s of A r a b i d o p s i s s i R N A s . PLoS Biol 5: e 5 7 . K a w a h a r a , Y . , Z i n s h t e y n , B., Sethupathy, P., l i z a s a , H., H a t z i g e o r g i o u , A . G . , a n d N i s h i k u r a , K. 2 0 0 7 . Redirection of silencing targets by a d e n o s i n e - t o - i n o s i n e editing of m i R N A s . Science 315: 1137-40. Krek, A . , G r u n , D., P o y , M . N., Wolf, R., R o s e n b e r g , L , E p s t e i n , E. J . , M a c M e n a m i n , P., d a P i e d a d e , I., G u n s a l u s , K. C , Stoffel, M . , et a l . 2 0 0 5 . C o m b i n a t o r i a l m i c r o R N A target predictions. Nat Genet 37: 4 9 5 - 5 0 0 . Landgraf, P., R u s u , M . , S h e r i d a n , R., S e w e r , A . , lovino, N., A r a v i n , A . , Pfeffer, S . , R i c e , A . , K a m p h o r s t , A . O . , Landthaler, M . , et a l . 2 0 0 7 . A M a m m a l i a n m i c r o R N A E x p r e s s i o n A t l a s B a s e d on S m a l l R N A Library S e q u e n c i n g . Cell 129: 1401-14. L e w i s , B. P., B u r g e , C . B., a n d Bartel, D. P. 2 0 0 5 . C o n s e r v e d s e e d pairing, often flanked by a d e n o s i n e s , indicates that t h o u s a n d s of h u m a n g e n e s are m i c r o R N A targets. Ce/7120: 15-20. Liu, J . , C a r m e l l , M . A . , R i v a s , F. V . , M a r s d e n , C . G . , T h o m s o n , J . M . , S o n g , J . J . , H a m m o n d , S . M . , J o s h u a - T o r , L., a n d H a n n o n , G . J . 2 0 0 4 . A r g o n a u t e 2 is the catalytic e n g i n e of m a m m a l i a n R N A i . Science 305: 1 4 3 7 - 1 4 4 1 . L u , C , M e y e r s , B. C , a n d G r e e n , P. J . 2 0 0 7 . Construction of s m a l l R N A c D N A libraries for d e e p s e q u e n c i n g . Methods 43: 110-117. L u n d , E., Guttinger, S . , C a l a d o , A . , D a h l b e r g , J . E., a n d Kutay, U. 2 0 0 4 . N u c l e a r export of m i c r o R N A precursors. Science 303: 9 5 - 8 . M a r g u l i e s , M . , E g h o l m , M . , A l t m a n , W . E., Attiya, S . , B a d e r , J . S . , B e m b e n , L. A . , B e r k a , J . , B r a v e r m a n , M . S . , C h e n , Y . J . , C h e n , Z . , et a l . 2 0 0 5 . G e n o m e s e q u e n c i n g in microfabricated high-density picolitre reactors. Nature 4 3 7 : 3 7 6 80. Meister, G . a n d T u s c h l , T. 2 0 0 4 . M e c h a n i s m s of g e n e silencing by d o u b l e s t r a n d e d R N A . Nature 431: 3 4 3 - 3 4 9 . N a k a n o , M . , N o b u t a , K., V e m a r a j u , K., Tej, S . S . , S k o g e n , J . W . , a n d M e y e r s , B. C . 2 0 0 6 . Plant M P S S d a t a b a s e s : s i g n a t u r e - b a s e d transcriptional r e s o u r c e s for a n a l y s e s of m R N A a n d s m a l l R N A . Nucleic Acids Res 34: D 7 3 1 - 5 .  9  O T o o l e , A . S . , Miller, S . , H a i n e s , N., Zink, M . C , and S e r r a , M . J . 2 0 0 6 . C o m p r e h e n s i v e t h e r m o d y n a m i c analysis of 3' double-nucleotide o v e r h a n g s neighboring W a t s o n - C r i c k terminal b a s e pairs. Nucleic Acids Res 34: 3 3 3 8 - 4 4 . Pfeffer, S . , S e w e r , A . , L a g o s - Q u i n t a n a , M., S h e r i d a n , R., S a n d e r , C , G r a s s e r , F. A . , v a n Dyk, L. F., H o , C . K., S h u m a n , S . , C h i e n , M., et al. 2 0 0 5 . Identification of m i c r o R N A s of the herpesvirus family. Nat Methods 2: 2 6 9 - 7 6 . P o r k k a , K. P., Pfeiffer, M . J . , Waltering, K. K., V e s s e l l a , R. L , T a m m e l a , T. L , and V i s a k o r p i , T. 2 0 0 7 . M i c r o R N A e x p r e s s i o n profiling in prostate c a n c e r . Cancer Res 67: 6 1 3 0 - 5 . R a j a g o p a l a n , R., V a u c h e r e t , H., Trejo, J . , and Bartel, D. P. 2 0 0 6 . A diverse a n d evolutionarily fluid set of m i c r o R N A s in A r a b i d o p s i s thaliana. Genes Dei/20: 3407-25. R u b y , J . G . , J a n , C . H., and Bartel, D. P. 2 0 0 7 . Intronic m i c r o R N A precursors that b y p a s s D r o s h a p r o c e s s i n g . Nature 448: 83-86. Sethupathy, P., M e g r a w , M . , and Hatzigeorgiou, A . G . 2 0 0 6 . A guide through present computational a p p r o a c h e s for the identification of m a m m a l i a n m i c r o R N A targets. Nat. Methods 3: 8 8 1 - 8 8 6 . S h e n d u r e , J . , P o r r e c a , G . J . , R e p p a s , N. B., Lin, X . , M c C u t c h e o n , J . P . , R o s e n b a u m , A . M . , W a n g , M . D., Z h a n g , K., Mitra, R. D., a n d C h u r c h , G . M . 2 0 0 5 . A c c u r a t e multiplex polony s e q u e n c i n g of an e v o l v e d bacterial g e n o m e . Science 309: 1 7 2 8 - 1 7 3 2 . Tolia, N . J . and J o s h u a - T o r , L. 2 0 0 7 . Slicer and the A r g o n a u t e s . Nat Chem Biol. 1:36-43. V e s t e r , B. a n d W e n g e l , J . 2 0 0 4 . L N A (locked nucleic acid): high-affinity targeting of c o m p l e m e n t a r y R N A a n d D N A . Biochemistry 43: 1 3 2 3 3 - 4 1 . Y a o , Y . , G u o , G . , Ni, Z . , S u n k a r , R., D u , J . , Z h u , J . K., and S u n , Q . 2 0 0 7 . C l o n i n g and characterization of m i c r o R N A s from wheat (Triticum aestivum L.). Genome  Biol 8: R 9 6 . Z h a o , J . J . , H u a , Y . J . , S u n , D. G . , M e n g , X . X . , X i a o , H. S . , a n d M a , X . 2 0 0 6 . G e n o m e - w i d e m i c r o R N A profiling in h u m a n fetal nervous t i s s u e s by oligonucleotide microarray. Childs Nerv Syst 22: 1419-25.  10  2  Application of massively parallel sequencing to microRNA profiling  and discovery in human embryonic stem cells 2.1 List of Authors R y a n D. M o r i n , M i c h a e l D. O ' C o n n o r , M a l a c h i Griffith, Florian Kuchenbauer, Allen Delaney, Anna-liisa Prabhu, Yongjun Zhao, Helen M c D o n a l d , T h o m a s Z e n g , Martin Hirst, C o n n i e J . E a v e s a n d M a r c o A . Marra.  2.2 Introduction M i c r o R N A s ( m i R N A s ) are short R N A m o l e c u l e s , 19-25nt in length, that derive from stable fold-back sub-structures of larger transcripts ( C a i et al. 2004). In most c a s e s , primary m i R N A transcripts ( p r i - m i R N A s ) a r e c l e a v e d by a c o m p l e x of D r o s h a a n d its cofactor D G C R - 8 , producing o n e or m o r e p r e - m i R N A hairpins, e a c h with a 2-nt 3' o v e r h a n g (Lund et al. 2004). N u c l e a r export of t h e s e precursors is mediated by Exportin 5 (Lund et al. 2004) after w h i c h , they a r e r e l e a s e d a s short d o u b l e - s t r a n d e d R N A d u p l e x e s following a s e c o n d c l e a v a g e by Dicer (Lund et a l . 2004). B a s e d on the t h e r m o d y n a m i c stability of e a c h e n d of this duplex, o n e of the strands is thought to b e preferentially incorporated into the R I S C protein c o m p l e x , producing a biologically active m i R N A a n d a n inactive m i R N A * ( O T o o l e et a l . 2006). O n c e a s s e m b l e d within R I S C , m i R N A s c a n elicit down-regulation of target g e n e s by blocking their translation, inducing A g o - 2 mediated d e g r a d a t i o n or potentially inducing deadenylation (Aravin et al. 2 0 0 5 ; G i r a l d e z et al. 2 0 0 6 ) . In  * A version of this chapter has been submitted to Genome Research to be considered for publication as a research article  11  a n i m a l s , m i R N A - t a r g e t interactions o c c u r through s e m i - c o m p l e m e n t a r y b a s e pairing, usually within the 3 ' - U T R of the target transcript. T h e relaxed complementarity between m i R N A s a n d their target sites p o s e s m a n y p r o b l e m s in computational target prediction; h e n c e , this is a n active field of study with n u m e r o u s e m e r g i n g a p p r o a c h e s that rely on differing m e t h o d o l o g i e s a n d a s s u m p t i o n s . C o m m o n target prediction algorithms a s s e s s complementarity of m i R N A s to their targets (John et a l . 2004), m a n y enforcing strong pairing within the s e e d region of the m i R N A (positions 2-8) ( G r i m s o n et a l . 2 0 0 7 ; L e w i s et a l . 2005). A s they w e r e d i s c o v e r e d only relatively recently, the study of m i R N A s is a y o u n g a n d rapidly c h a n g i n g field in w h i c h u n d i s c o v e r e d g e n e s remain a n d the s e q u e n c e modifications a n d activity of mature m i R N A s are not well u n d e r s t o o d . Before the effect of m i R N A s on g e n e regulation c a n be globally studied, a robust method for profiling the e x p r e s s i o n level of e a c h m i R N A in a s a m p l e is required. T h e current c o m m e r c i a l l y available high-throughput m e t h o d o l o g i e s rely on primers or p r o b e s d e s i g n e d to detect e a c h of the current reference m i R N A s e q u e n c e s residing in m i R B a s e , w h i c h acts a s the central repository for known m i R N A s (Griffiths-Jones 2006). T h e s e s y s t e m s , w h i c h are often a v a i l a b l e in array form, allow concurrent profiling of m a n y m i R N A s ( Z h a o et a l . 2006). T h e s e a p p r o a c h e s feature g o o d reproducibility a n d facilitate clustering of s a m p l e s by similar m i R N A e x p r e s s i o n profiles ( D a v i s o n et al. 2 0 0 6 ; P o r k k a et a l . 2007). H o w e v e r , p r o b e - b a s e d m e t h o d o l o g i e s are generally restricted to the detection a n d profiling of only the known m i R N A s e q u e n c e s previously identified by s e q u e n c i n g or h o m o l o g y s e a r c h e s .  12  S e q u e n c i n g - b a s e d applications for identifying a n d profiling m i R N A s h a v e b e e n hindered by laborious cloning techniques a n d the e x p e n s e of capillary D N A s e q u e n c i n g ( C u m m i n s et al. 2 0 0 6 ; Pfeffer et al. 2 0 0 5 ) . N e v e r t h e l e s s , direct s m a l l R N A s e q u e n c i n g h a s s e v e r a l a d v a n t a g e s over hybridization-based m e t h o d o l o g i e s . D i s c o v e r y of novel m i R N A s n e e d not rely o n querying candidate regions of the g e n o m e , but rather c a n be a c h i e v e d by direct observation a n d validation of the folding potential of flanking g e n o m i c s e q u e n c e ( B e r e z i k o v et a l . 2 0 0 6 a ; C u m m i n s et a l . 2006). Direct s e q u e n c i n g a l s o offers the potential to detect variation in mature m i R N A length, a s well a s e n z y m a t i c modification of m i R N A s s u c h a s R N A editing ( K a w a h a r a et al. 2007) a n d 3' nucleotide additions (Aravin et a l . 2 0 0 5 ; Landgraf et a l . 2007). In contrast to capillary s e q u e n c i n g , recently available 'next-generation' s e q u e n c i n g t e c h n o l o g i e s offer i n e x p e n s i v e i n c r e a s e s in throughput, thereby providing a m o r e c o m p l e t e view of the m i R N A transcriptome. With the a d d e d depth of s e q u e n c i n g n o w p o s s i b l e , a n opportunity to identify l o w - a b u n d a n c e m i R N A s or t h o s e exhibiting m o d e s t e x p r e s s i o n differences b e t w e e n s a m p l e s exists, including t h o s e that m a y not be detected by hybridization-based m e t h o d s . Next-generation m i R N A profiling h a s already b e e n realized in a few o r g a n i s m s (Fahlgren et a l . 2 0 0 7 ; K a s s c h a u et a l . 2 0 0 7 ; R a j a g o p a l a n et a l . 2 0 0 6 ; Y a o et a l . 2007) using the M a s s i v e l y P a r a l l e l Signature S e q u e n c i n g ( M P S S ) methodology ( N a k a n o et a l . 2006) a n d m o r e recently the R o c h e / 4 5 4 platform (Margulies et a l . 2005). T h e recently r e l e a s e d lllumina s e q u e n c i n g platform provides approximately two orders of magnitude of greater depth than current competing technologies from a single run ( B e r e z i k o v et al.  13  2 0 0 6 b ) , h e n c e providing a more cost-effective option w h e n c o m p a r e d to other s e q u e n c i n g strategies (http://www.solexa.com). W e s o u g h t to u s e this n e w technology to extensively profile a n d identify c h a n g e s in m i R N A e x p r e s s i o n that o c c u r within a previously c h a r a c t e r i z e d model s y s t e m . Pluripotent h u m a n e m b r y o n i c s t e m cells ( h E S C s ) c a n b e cultured under non-adherent conditions that induce them to differentiate into cells belonging to all three g e r m layers and form cell a g g r e g a t e s termed e m b r y o i d b o d i e s ( E B s ) (Itskovitz-Eldor et al. 2 0 0 0 ; Bhattacharya et al. 2004). S a m p l e s of undifferentiated h E S C s a n d differentiated cells from E B s w e r e c h o s e n for m i R N A profiling, first b e c a u s e the pluripotency of E S C s is k n o w n to require the p r e s e n c e of m i R N A s (Bernstein et al. 2 0 0 3 ; S o n g et al. 2 0 0 6 ; W a n g et a l . 2007) a n d s e c o n d b e c a u s e specific c h a n g e s in m i R N A e x p r e s s i o n are thought to a c c o m p a n y differentiation ( C h e n et al. 2007). T h e h E S C m e s s e n g e r R N A transcriptome has b e e n extensively studied by o u r s e l v e s a n d others (Abeyta et al. 2 0 0 4 ; B h a t t a c h a r y a et al. 2 0 0 5 ; Bhattacharya et a l . 2 0 0 4 ; B o y e r et al. 2 0 0 5 ; Hirst et al. 2 0 0 7 ; S a t o et al. 2003) but to date, very little is k n o w n about the specific m i R N A s that m a y play roles in their pluripotency or differentiation ( S u h et al. 2004).  14  2.3 R e s u l t s 2.3.1  S e q u e n c i n g and annotation of small R N A s  S e q u e n c i n g of small R N A libraries yielded 6 , 1 4 7 , 7 1 8 a n d 6 , 0 1 4 , 1 8 7 37 nucleotides (nt) unfiltered s e q u e n c e reads from h E S C s a n d E B s , respectively. After removal of r e a d s containing a m b i g u o u s b a s e calls, 5,261,520 ( h E S C ) a n d 5,192,421 ( E B ) unique s e q u e n c e s remained with counts varying b e t w e e n 1 a n d 3 8 , 3 9 0 . After m a p p i n g to the h u m a n g e n o m e , a total of 7 6 6 , 1 9 9 ( h E S C ) a n d 724,091 ( E B ) g e n o m e - m a p p e d / t r i m m e d small R N A s e q u e n c e s w e r e represented by 4 , 3 5 1 , 4 7 9 a n d 3,886,865 r e a d s . T h e s e s e q u e n c e s w e r e annotated a s o n e of the known c l a s s e s of small R N A g e n e s or degradation fragments of larger n o n c o d i n g R N A s (Methods). S e q u e n c e s derived from 362 distinct m i R N A g e n e s w e r e identified. B a s e d on their m e d i a n e x p r e s s i o n level, m i R N A s w e r e the most a b u n d a n t c l a s s of small R N A s on a v e r a g e , but s p a n n e d the entire r a n g e of e x p r e s s i o n , with s e q u e n c e counts from singletons up to ~ 1 2 0 t h o u s a n d (Figure 2.1, panel a). T h e p i R N A s , previously thought to exist only in germline cells, w e r e a l s o found at relatively low levels. A total of 9,012 a n d 4 , 6 0 6 s e q u e n c e r e a d s c o r r e s p o n d e d to p i R N A s in h E S C s a n d E B s , respectively. T h e s a m p l i n g depth provided by the lllumina technology, s p a n n i n g approximately 5 orders of magnitude, a p p e a r s to extend the d y n a m i c range of m i R N A e x p r e s s i o n within a cell by two additional orders of magnitude ( B e r e z i k o v et a l . 2 0 0 6 a ) . 2.3.2 V a r i a b i l i t y in m i c r o R N A p r o c e s s i n g  In both libraries, m i R N A s frequently exhibited variation from their 'reference' s e q u e n c e s , producing multiple mature variants that w e hereafter refer  15  to a s i s o m i R s . In m a n y c a s e s the m i R N A * s e q u e n c e a n d its i s o m i R s w e r e a l s o o b s e r v e d in our libraries. T h e e x i s t e n c e of i s o m i R s (Figure 2.2 s h o w s a n e x a m p l e ) is c o m m o n l y reported in m i R N A cloning s t u d i e s but generally d i s m i s s e d with either the s e q u e n c e matching the m i R B a s e record or the most frequently o b s e r v e d i s o m i R c h o s e n a s a reference s e q u e n c e (Landgraf et al. 2 0 0 7 ; R u b y et a l . 2006). It a p p e a r s that m u c h of the i s o m i R variability c a n be e x p l a i n e d by variability in either Dicer or D r o s h a c l e a v a g e positions within the prim i R N A hairpin. Notably, our data s h o w that c h o o s i n g a different i s o m i R s e q u e n c e for m e a s u r i n g m i R N A a b u n d a n c e c a n affect our ability to detect differential m i R N A a b u n d a n c e . B a s e d on our a n a l y s i s , the read count for the most a b u n d a n t i s o m i R , rather than the m i R B a s e reference s e q u e n c e , provides the m o s t robust a p p r o a c h for comparing m i R N A e x p r e s s i o n b e t w e e n libraries (see S u p p l e m e n t a r y M e t h o d s in A p p e n d i x B). In 140 c a s e s , this m o s t abundant s e q u e n c e did not c o r r e s p o n d exactly to the current m i R B a s e reference s e q u e n c e . M o s t c o m m o n l y , this w a s d u e to length variation at the 3' e n d of the m i R N A s , though i s o m i R s with variation at either the 5' e n d or both e n d s w e r e also o b s e r v e d . T h i s s u g g e s t s either that the relative a b u n d a n c e of i s o m i R s m a y vary a c r o s s t i s s u e s or that the original s u b m i s s i o n of this m i R N A to m i R B a s e w a s incorrect. T h e latter a p p e a r s more likely w h e n w e c o n s i d e r that following the most recent update to m i R B a s e , more of the most a b u n d a n t s e q u e n c e s in our libraries then c o r r e s p o n d e d to the updated m i R B a s e reference s e q u e n c e s .  16  2.3.3 E n z y m a t i c m o d i f i c a t i o n o f  microRNAs  A l t h o u g h s o m e r e a d s w e r e detected that m a y represent the result of prem i R N A editing by a d e n o s i n e d e a m i n a s e s (observed a s A to G transitions) or cytidine d e a m i n a s e s (producing C to U transitions), e x a m p l e s of t h e s e w e r e infrequent a n d few w e r e significantly a b o v e the b a c k g r o u n d level of other apparent s e q u e n c i n g errors. T h e more prevalent type of modification noted a m o n g s t the m i R N A s w e r e single-nucleotide 3' e x t e n s i o n s , w h i c h o c c u r r e d in multiple i s o m i R s from nearly all o b s e r v e d m i R N A g e n e s (316 g e n e s ) . T h e s e modifications p r o d u c e a n i s o m i R that m a t c h e s the g e n o m e at e v e r y position e x c e p t the terminal nucleotide. T h e nucleotide most c o m m o n l y a d d e d w a s A d e n i n e (1130 distinct e x a m p l e s ) , followed by Uridine (1008 e x a m p l e s ) , C y t o s i n e (733 e x a m p l e s ) then G u a n i n e (508 e x a m p l e s ) . T h e p r e v a l e n c e of modifications differed a m o n g s t the m i R N A s , with s o m e m i R N A s having more r e a d s representing modified than unmodified forms. E n d modifications w e r e not limited to the most c o m m o n i s o m i R , nor did they s h o w significant differences b e t w e e n the two libraries, s u g g e s t i n g that they m a y not b e a r direct significance in h E S C s , but rather m a y reflect a g e n e r a l cellular p r o c e s s of m i R N A modification (see Figure 2.2 for a n e x a m p l e ) . F o r s o m e of the m i R N A s s h o w i n g the highest ratios of modified/unmodified i s o m i R s , the s a m e modification w a s previously o b s e r v e d a c r o s s multiple h u m a n a n d m o u s e t i s s u e s (Landgraf et a l . 2007). T h e study by Landgraf a n d c o l l e a g u e s e m p l o y e d a different s e q u e n c i n g a p p r o a c h , s u g g e s t i n g that t h e s e variations a r e not inherent to the s e q u e n c i n g strategy u s e d . F o r e x a m p l e , h s a - m i R - 3 2 6 w a s found with a  17  terminal A d e n i n e addition in 6 6 % of the corresponding r e a d s in both of our libraries a n d multiple m e m b e r s of the family h s a - m i R - 3 0 s h o w e d Uridine additions in m o r e than half of their reads. Both of t h e s e s p e c i f i c modifications w e r e noted in the L a n d g r a f study, w h e r e the s a m e modifications w e r e a l s o o b s e r v e d in the orthologous m o u s e m i R N A s . T h e significance of the apparent evolutionary c o n s e r v a t i o n of t h e s e modifications is a d d r e s s e d below. 2.3.4 M i c r o R N A s differentially e x p r e s s e d b e t w e e n h E S C s a n d E B s  W e tested for differential e x p r e s s i o n between the h E S C a n d E B libraries using all s e q u e n c e s , regardless of their annotation; h e n c e , e a c h i s o m i R w a s independently tested for differential e x p r e s s i o n . T h e total n u m b e r of differentially e x p r e s s e d s e q u e n c e s corresponding to the m i R B a s e reference s e q u e n c e (120 total) w a s l e s s than the total number of m i R N A s identified a s differentially e x p r e s s e d using the most abundant i s o m i R for e a c h p r e - m i R N A a r m (190 total from 165 m i R N A g e n e s ) (Table 2.1 a n d S u p p l e m e n t a r y T a b l e 1, A p p e n d i x A ) . All 120 of the m i R N A g e n e s identified a s differentially e x p r e s s e d , b a s e d only on the s e q u e n c e s c o r r e s p o n d i n g perfectly to those in m i R B a s e , w e r e a m o n g s t the 165 identified by the most the abundant s e q u e n c e in our data. E m p l o y i n g the count of the most a b u n d a n t s e q u e n c e for the calculation of differential e x p r e s s i o n also allows identification of the s u p p o s e d m i R * s e q u e n c e s that are differentially e x p r e s s e d b e t w e e n h E S C s a n d E B s but are currently a b s e n t from m i R B a s e (25 total). T h i s reinforces the c o n c e p t that the most abundant, rather than the reference m i R N A s e q u e n c e , is the most useful for identifying differentially expressed miRNAs.  18  2.3.5 Novel microRNA genes W e s o u g h t to identify novel m i R N A g e n e s a m o n g s t the un-classified s e q u e n c e s in our libraries. After annotation, 4 0 , 6 9 9 of the s m a l l R N A s e q u e n c e s in the h E S C library a n d 4 0 , 5 9 8 in the E B library (ignoring singletons) remained unclassified b e c a u s e they derived from un-annotated regions of the h u m a n g e n o m e . A s they w e r e of generally low e x p r e s s i o n a n d often similar in s e q u e n c e to k n o w n g e n o m i c repeats, w e a s s u m e d m a n y of t h e s e w e r e either randomly derived fragments of various R N A s or regulatory s i R N A s ( B e r e z i k o v et a l . 2 0 0 6 b ; W a t a n a b e et a l . 2006). T o identify candidate novel m i R N A s a m o n g s t t h e s e , w e e m p l o y e d both in-house a n d publicly available algorithms (Methods). O n c e g o o d c a n d i d a t e s w e r e identified, w e c o m p a r e d both their mature s e q u e n c e s a n d the predicted p r e - m i R N A s e q u e n c e s to those of all currently known m i R N A s to aid in classifying t h e m into families. T h e total set of novel m i R N A c a n d i d a t e s , included in S u p p l e m e n t a r y T a b l e 2 (Appendix A ) , c o m p r i s e s 129 unique m i R N A s e q u e n c e s from up to 170 distinct g e n e s . C o m p a r e d to other m i R N A s in our data, t h e s e m i R N A s exhibited m o d e s t e x p r e s s i o n ( m e a n count=58), p e r h a p s indicating that m o s t of them m a y not perform significant functions in t h e s e cell types. H o w e v e r , 31 exhibited significant differential e x p r e s s i o n b e t w e e n the two libraries a n d are thus potentially biologically important in the cells in w h i c h they w e r e present (Table 2.2). 2.3.6 Targets of differentially expressed  microRNAs  Strong b a s e pairing between the s e e d region of a m i R N A a n d the U T R of its target m R N A is important for its activity; h e n c e m a n y target prediction  19  algorithms enforce strong s e e d complementarity a n d evolutionary conservation in the c o m p l e m e n t a r y region of potential targets ( G r i m s o n et a l . 2007). A s s u c h , the repertoire of predicted targets of m i R N A s with identical s e e d s often overlaps c o n s i d e r a b l y . W e sought to determine whether m i R N A s sharing identical s e e d s d e m o n s t r a t e d c o e x p r e s s i o n . By considering positions 2-8 of all non-singleton i s o m i R s of k n o w n m i R N A s from e a c h library, w e identified 1009 distinct s e e d s . M a n y of t h e s e represented the n o n - c a n o n i c a l s e e d s of i s o m i R s arising from variation at their 5' e n d . 114 of t h e s e s e e d s w e r e significantly over-represented in the h E S C library while 106 are over-represented in the E B library ( F i s h e r E x a c t Test, alpha=0.05/1009) (Table 2.3 a n d S u p p l e m e n t a r y T a b l e 3 in A p p e n d i x A ) . A s e x p e c t e d , w e found that m a n y of the s e e d s with the largest c h a n g e s in relative levels b e t w e e n the two libraries c o r r e s p o n d e d to the differentially e x p r e s s e d m i R N A s in T a b l e 2 . 1 . Notably, in s o m e c a s e s , w e o b s e r v e d pairs of m i R N A s that s h a r e d a c o m m o n s e e d yet exhibited inverse e x p r e s s i o n c h a n g e s . A c l e a r e x a m p l e of this from our data is the inverse a b u n d a n c e s of h s a - m i R - 3 0 2 a ( h E S C - e n r i c h e d ) a n d h s a - m i R - 3 2 7 ( E B - e n r i c h e d ) , w h i c h s h a r e the s e e d s e q u e n c e A A G U G C U . This behavior m a y m a s k the effect of differential e x p r e s s i o n of s o m e m i R N A s in e a c h library. H e n c e , focusing o n the net c h a n g e in s e e d levels, rather than distinct m i R N A s , m a y be important in this context. S o m e m i R N A s a p p e a r in this table more than o n c e , s u g g e s t i n g that m o r e than o n e i s o m i R from a single m i R N A g e n e c a n contribute to net c h a n g e s in m i R N A s with a given s e e d during differentiation, and t h e s e i s o m i R s c o u l d potentially regulate different s e t s of transcripts.  20  T h e cooperative action of multiple m i R N A s c a n be multiplicative a n d , in s o m e c a s e s , synergistic ( G r i m s o n et a l . 2007); h e n c e , transcripts with more predicted target sites for c o - e x p r e s s e d m i R N A s s h o u l d be most drastically affected by t h o s e m i R N A s . In a n attempt to highlight potentially significant targets of differentially e x p r e s s e d m i R N A s , w e identified g e n e s with predicted target sites for multiple h E S C - e n r i c h e d m i R N A s or E B - e n r i c h e d m i R N A s using T a r g e t S c a n . M e a s u r e s w e r e taken to c o m p e n s a t e for U T R length, m i R N A s with identical target sites, a n d a g e n e r a l p r e p o n d e r a n c e of target sites in s o m e g e n e s (Methods). A total of 591 likely cooperative targets of E B - e n r i c h e d a n d 461 targets of h E S C - e n r i c h e d m i R N A s w e r e identified by this a p p r o a c h . A s t h e s e g e n e s are likely under redundant post-transcriptional regulation by multiple m i R N A s , they could c o m p r i s e g e n e s of central importance to the m a i n t e n a n c e of t h e s e cells. Surprisingly, t h e s e two g e n e lists s h o w e d a significant overlap of 6 4 g e n e s (p<0.0001, permutation test), suggesting that s o m e of the m i R N A regulated g e n e s in h E S C s m a y a l s o be regulated in E B s by a different set of miRNAs. T h e g e n e s that h a v e b e e n highlighted herein a s likely targets of the differentially e x p r e s s e d m i R N A s would be e x p e c t e d to be significant to h E S C biology. T h i s w a s supported by a n examination of the G e n e Ontology 'biological p r o c e s s ' classifications that are significantly over-represented a m o n g s t t h e s e g e n e s . T h i s a n a l y s i s revealed that m a n y of the g e n e s in both g r o u p s h a v e b e e n previously a s s o c i a t e d with differentiating stem cells a n d included t h o s e involved in differentiation, d e v e l o p m e n t a n d regulation of transcription (Skottman et a l .  21  2 0 0 5 ) . Interestingly, s o m e of t h e s e G O terms w e r e e n r i c h e d only in the predicted targets of h E S C - e n r i c h e d or E B - e n r i c h e d m i R N A s . F o r e x a m p l e , g e n e s involved in p r o g r a m m e d cell death w e r e enriched a m o n g s t the predicted targets of h E S C - e n r i c h e d m i R N A s while t h o s e involved in cell proliferation w e r e e n r i c h e d a m o n g s t the predicted targets of E B - e n r i c h e d m i R N A s (Figure 2.3).  2.4  Discussion  T h e s e data highlight the potential of a n e w m a s s i v e l y parallel s e q u e n c i n g strategy to profile m i R N A e x p r e s s i o n in h E S C s a n d E B s a n d the c h a n g e s that o c c u r during differentiation.  B e t w e e n the two libraries, w e identified 190  differentially e x p r e s s e d known a n d 31 novel m i R N A s c o r r e s p o n d i n g to 165 and 33 distinct m i R N A g e n e s . T h i s is a striking a d v a n c e in contrast to a recent arrayb a s e d study of murine E S C s , w h i c h revealed a m u c h s m a l l e r n u m b e r of m i R N A s differentially e x p r e s s e d between E S C s a n d E B s ( C h e n et a l . 2007). T h e latter study identified only 2 3 candidate m i R N A s a s either E S C - s p e c i f i c or d o w n regulated during differentiation, a s well a s 10 m i R N A s that a p p e a r to be enriched in E B s . It is e n c o u r a g i n g that, of the 27 listed m o u s e m i R N A s with identifiable h u m a n orthologs, 14 exhibited e x p r e s s i o n patterns consistent with the h E S C results presented here, whilst m a n y of the others exhibited insignificant c h a n g e s in e x p r e s s i o n . Further, of the additional differentially e x p r e s s e d m i R N A s reported here (Table 2.1), m a n y w e r e originally d i s c o v e r e d in h u m a n or murine E S C s ( H o u b a v i y et a l . 2 0 0 3 ; S u h et a l . 2004).  22  S o m e of t h e s e previous studies g e n e r a l i z e d t h e s e m i R N A s a s h E S C specific in their e x p r e s s i o n . However, c o m p a r e d to E B s , our m e t h o d did not detect m i R N A s that w e r e e x p r e s s e d exclusively in h E S C s ; this is likely a c o n s e q u e n c e of the i n c r e a s e d s a m p l i n g depth of the method u s e d here. O n the other h a n d , w e did find 16 m i R N A s e q u e n c e s in E B s that w e r e not o b s e r v e d in h E S C s ( S u p p l e m e n t a r y T a b l e 1, A p p e n d i x A ) . T h i s is consistent with the postulate that the n u m b e r of e x p r e s s e d m i R N A s i n c r e a s e s during differentiation (Strauss et a l . 2006) a n d further supports the importance of m i R N A s during E S C differentiation ( W a n g et a l . 2007). H o w e v e r , b e c a u s e E B s likely represent a m o r e d i v e r s e population of cells, the total n u m b e r s of e x p r e s s e d m i R N A s in any given cell type m a y not n e c e s s a r i l y i n c r e a s e . T h u s , the m i R N A s highlighted by this method include those previously implicated in pluripotency a n d differentiation, but the a d d e d s a m p l i n g depth d e m o n s t r a t e s that t h e s e m i R N A s are not uniquely e x p r e s s e d by undifferentiated h E S C s . T h i s result m a y s u g g e s t that the p r o c e s s e s involved in differentiation that are under m i R N A - m e d i a t e d regulation are dictated by the repertoire a n d relative m i R N A levels, rather than the n u m b e r of distinct m i R N A s being e x p r e s s e d . With few e x c e p t i o n s ( B e r e z i k o v et a l . 2006b), recent l a r g e - s c a l e cloning efforts h a v e provided minimal yields of n e w m i R N A g e n e s . T h i s is m a y be d u e to the d o m i n a n c e of the highly e x p r e s s e d m i R N A s in s m a l l R N A libraries a s well a s difficulties a s s o c i a t e d with library normalization (Landgraf et al. 2 0 0 7 ; S h i v d a s a n i 2006). H e r e , w e present 170 c a n d i d a t e s for novel m i R N A g e n e s that h a v e p a s s e d multiple levels of annotation criteria. T h i s g e n e list w a s obtained by  23  c o m b i n i n g two s e p a r a t e m i R N A classification tools relying o n different a s s u m p t i o n s a n d input data. M i P r e d relies on a R a n d o m F o r e s t algorithm to classify m i R N A s b a s e d on structural properties (Jiang et a l . 2 0 0 7 ) . O u r c u s t o m classifier, w h i c h u s e s a support vector m a c h i n e ( S V M ) , a s s i g n s a classification s c o r e b a s e d o n a s e p a r a t e set of structural descriptors including data specific to this s e q u e n c i n g method. A s a result, this classifier h a s provided the best classification a c c u r a c y of any m a c h i n e l e a r n i n g - b a s e d m i R N A classification method ( s e e M e t h o d s ) . A t least 31 of the novel m i R N A s identified in this study s h o w e v i d e n c e of regulated e x p r e s s i o n during h E S C differentiation (Table 2.2), s u g g e s t i n g that they m a y a l s o h a v e roles in maintaining the pluripotent status of h E S C s or their ability to self-renew. In support of this, five of t h e s e m i R N A s e q u e n c e s w e r e found to s h a r e a c o m m o n s e e d s e q u e n c e with at least o n e of the differentially e x p r e s s e d known m i R N A s (Table 2.1). M a n y of the remaining novel m i R N A s with overall low a b u n d a n c e a n d insignificant e x p r e s s i o n differences m a y later prove unimportant in the context of h E S C biology. It is e n c o u r a g i n g , however, that m a n y of the k n o w n m i R N A s resided in the s a m e range of e x p r e s s i o n a s t h e s e novel m i R N A s (Figure 2.1, panel a), suggesting that m a n y m i R N A s m a y exhibit low-level e x p r e s s i o n in h E S C s detectable only by d e e p s e q u e n c i n g . Like m a n y of the k n o w n m i R N A s present in t h e s e libraries, t h e s e l o w - a b u n d a n c e novel m i R N A s m a y be o b s e r v e d in higher quantities a m o n g s t other t i s s u e s w h e r e they are functionally important.  24  In addition to novel m i R N A g e n e s , this study has a i d e d in identifying d i v e r s e population of variants of known m i R N A s , collectively t e r m e d i s o m i R s . T h e functional implications of the w i d e s p r e a d 3' modifications a n d R N A editing are unclear in the context of h E S C s a s neither p r o c e s s a p p e a r e d to vary significantly b e t w e e n the two libraries. It is important to note that the d i v e r s e 3' nucleotide additions o b s e r v e d here reiterate previous o b s e r v a t i o n s of this nature (Landgraf et a l . 2007). B y c o m p a r i n g s o m e of the most prevalent 3' additions in our libraries to t h o s e from diverse h u m a n a n d m o u s e t i s s u e s (Landgraf et a l . 2 0 0 7 ) , it is evident that the nature of the a d d e d nucleotide is n o n r a n d o m a n d evolutionary conserved. S o m e of the remaining n o n - c a n o n i c a l i s o m i R s , w h i c h a p p e a r to derive from variability in D i c e r a n d D r o s h a c l e a v a g e positions, w e r e quite abundant. By including all i s o m i R s and grouping t h o s e that s h a r e a n identical s e e d s e q u e n c e , w e r e v e a l e d 2 2 2 s e e d s that w e r e represented in significantly different quantities t h e s e two libraries (Table 2.3 a n d S u p p l e m e n t a r y T a b l e 3, A p p e n d i x A ) . A s the n o n - c a n o n i c a l i s o m i R s s h a r e most of their s e q u e n c e with the most highly e x p r e s s e d i s o m i R s , it is p o s s i b l e that t h e s e variants s h a r e a c o m m o n set of targets with the c a n o n i c a l m i R N A s . T h o s e i s o m i R s resulting from variation at the 5' e n d m a y be of particular interest a s they b e a r a different s e e d s e q u e n c e than the reference m i R N A , thus indicative of their potential to target different transcripts. H o w e v e r , whether any of t h e s e n o n - c a n o n i c a l variants a s s o c i a t e within R I S C r e m a i n s to be experimentally d e t e r m i n e d . If they are found to a s s o c i a t e within R I S C , the p r e s e n c e of i s o m i R s m a y h a v e implications in future  25  annotation of m i R N A s a n d the d e v e l o p m e n t of n e w target prediction algorithms. A recent update to m i R B a s e resulted in better a g r e e m e n t b e t w e e n the most a b u n d a n t s e q u e n c e s in our libraries, but s o m e of the most prevalent i s o m i R s in our libraries still d o not perfectly c o r r e s p o n d to the m i R B a s e entry. T h i s s u g g e s t s that the most abundant s e q u e n c e s have yet to be reliably d e t e r m i n e d for all m i R N A s . If this is the c a s e , further large-scale efforts s u c h a s this s h o u l d result in the truly m o s t c o m m o n i s o m i R replacing the e r r o n e o u s s e q u e n c e s residing in miRBase. T h e r e w a s a large overlap of predicted target g e n e s for the m o s t prevalent m i R N A s e n r i c h e d in either h E S C s or E B s . S i n c e a large proportion of t h e s e g e n e s w e r e predicted by T a r g e t S c a n to be targets of n u m e r o u s m i R N A s , g e n e s w e r e weighted b a s e d on their total number of predicted target sites. B y taking only the g e n e s with weights a b o v e a certain threshold, w e d e r i v e d two relatively s m a l l g r o u p s of g e n e s that are strong c a n d i d a t e s for cooperative targeting by h E S C - e n r i c h e d or E B - e n r i c h e d m i R N A s . T h i s e n a b l e d resolution of key g e n e c l a s s e s in e a c h group (Figure 2.3), but it d o e s not a s s e r t that t h e s e are true in vivo targets, nor d o e s it provide a c o m p r e h e n s i v e list of the targets of t h e s e m i R N A s . Further directed validation a n d g e n e e x p r e s s i o n profiling will be n e c e s s a r y to elucidate the true in-vivo targets. T h e s e lists d o , however, provide the foundation for a n alternate view of g e n e regulation in h E S C s . P r e v i o u s l y , f o c u s h a s b e e n p l a c e d on c h a n g e s to m R N A levels during differentiation.  By  shifting f o c u s to the g e n e s targeted by m i R N A s in this context, it is plausible that g e n e s previously unlinked to differentiation and pluripotency m a y be d i s c o v e r e d .  26  A d a p t a t i o n of a c o m m o n adapter ligation method to a novel s e q u e n c i n g platform h a s a l l o w e d us to generate a view of the s m a l l R N A c o m p o n e n t of differentiating h E S C s at a n u n p r e c e d e n t e d depth. Following a combination of novel m i R N A annotation and d i s c o v e r y t e c h n i q u e s , w e h a v e r e v e a l e d the largest list to date of m i R N A s e q u e n c e s . This list includes a d i v e r s e population of m i R N A variants, termed i s o m i R s , with variation at both the 5' a n d 3' e n d s . W e a l s o present n u m e r o u s novel h u m a n m i R N A s , s o m e of w h i c h m a y be important in the control of h E S C pluripotency or differentiation. Notably, m a n y of the m i R N A s exhibiting the most significant differences b e t w e e n h E S C s a n d E B s a p p e a r to regulate g e n e s involved in transcriptional regulation, differentiation a n d development.  2.5 Methods 2.5.1 Small R N A Library Preparation E m b r y o n i c stem cell s a m p l e s w e r e derived from lines maintained at the Terry F o x Laboratory.  T h e s t e m cell line known a s H 9 w a s c h o s e n b e c a u s e it is  well c h a r a c t e r i z e d a n d w a s u s e d in the d e v e l o p m e n t of differentiation protocols at this facility. Undifferentiated H 9 h E S C s w e r e cultured on Matrigel ( B D B i o s c i e n c e s , S a n J o s e , C A ) coated d i s h e s in m a i n t e n a n c e m e d i u m consisting of D M E M / F 1 2 containing 2 0 % K n o c k o u t S e r u m R e p l a c e r (Invitrogen, C a r l s b a d , C A ) , 0.1 m M p-mercaptoethanol, 0.1 m M non-essential a m i n o a c i d s , 1 m M glutamine a n d 4 ng/ml F G F 2 ( R & D S y s t e m s ) a n d conditioned by mitotically inactivated m o u s e e m b r y o n i c fibroblasts ( X u e et a l . 2005). F o r E B differentiation,  27  h E S C s w e r e harvested v i a 0 . 0 5 % trypsin (Invitrogen) s u p p l e m e n t e d with 0.5 m M C a C _ a n d the resultant cell a g g r e g a t e s cultured in non-adherent d i s h e s ( B D B i o s c i e n c e s ) for up to 30 d a y s in m a i n t e n a n c e m e d i u m lacking F G F 2 (medium c h a n g e s performed a s n e c e s s a r y ) (Itskovitz-Eldor et a l . 2 0 0 0 ; D v a s h et a l . 2004). A t appropriate time points R N A w a s extracted into Trizol a n d aliquots of total R N A from D a y 0 ( h E S C ) a n d D a y 15 ( E B ) w e r e subjected to m i R N A library construction a s follows. F o r e a c h library, 10 ug of D N a s e I ( D N A - f r e e ™ kit; A m b i o n , A u s t i n T X ) treated total R N A w a s s i z e fractionated on a 1 5 % T r i s - B o r a t e - E D T A ( T B E ) u r e a p o l y a c r y l a m i d e gel a n d a 15-30 b a s e pair fraction e x c i s e d . R N A w a s eluted from the p o l y a c r y l a m i d e gel slice in 600 u.l of 0.3 M N a C I overnight at 4°C. T h e resulting gel slurry w a s p a s s e d through a S p i n - X filter c o l u m n (Corning) a n d precipitated in two 300 uJ aliquots by the addition of 8 0 0 u.l of E t O H a n d 3 uJ of M u s s e l G l y c o g e n (5 m g / m l ; Invitrogen). After w a s h i n g with 7 5 % E t O H , the pellets w e r e a l l o w e d to air dry at 25°C a n d pooled in D E P C water. T h e 5' R N A a d a p t e r ( 5 - G U U C A G A G U U C U A C A G U C C G A C G A U C - 3 ' ) w a s ligated to the R N A pool with T 4 R N A ligase (Ambion) in the p r e s e n c e of R N A s e O u t (Invitrogen) overnight at 25°C. T h e ligation reaction w a s s t o p p e d by the addition of 2 X f o r m a m i d e loading dye. T h e ligated R N A w a s s i z e fractionated o n a 1 5 % T B E u r e a p o l y a c r y l a m i d e gel a n d a 4 0 - 6 0 b a s e pair fraction e x c i s e d . R N A w a s eluted from the p o l y a c r y l a m i d e gel slice in 600 pil of 0.3 M N a C I overnight at 4°C. T h e R N A w a s eluted from the gel a n d precipitated a s d e s c r i b e d a b o v e followed by r e - s u s p e n s i o n in D E P C - t r e a t e d water.  28  T h e 3' R N A adapter (5'- p U C G U A U G C C G U C U U C U G C U U G i d T - 3 ' ; p = p h o s p h a t e ; idT = inverted deoxythymidine) w a s s u b s e q u e n t l y ligated to the precipitated R N A with T 4 R N A ligase (Ambion) in the p r e s e n c e of R N A s e Out (Invitrogen) overnight at 25°C.  T h e ligation reaction w a s s t o p p e d by the addition  of 10 ul of 2 X f o r m a m i d e loading d y e . Ligated R N A w a s s i z e fractionated on a 1 0 % T B E u r e a polyacrylamide gel and the 6 0 - 1 0 0 b a s e pair fraction e x c i s e d . T h e R N A w a s eluted from the polyacrylamide gel a n d precipitated from the gel a s d e s c r i b e d a b o v e and r e - s u s p e n d e d in 5.0 ial of D E P C water.  The R N A was  converted to single stranded c D N A using S u p e r s c r i p t II r e v e r s e transcriptase (Invitrogen) a n d lllumina's s m a l l R N A R T - P r i m e r ( 5 ' - C A A G C A G A A G A C G G C A T A C G A - 3 ' ) following the manufacture's instructions. T h e resulting c D N A w a s P C R - a m p l i f i e d with Hotstart P h u s i o n D N A P o l y m e r a s e ( N E B ) in 15 c y c l e s using lllumina's s m a l l R N A primer set (5' - C A A G C A G A A G A C G G C A T A C G A - 3'; 5'A A T G A T A C G G C G A C C A C C GA-3'). P C R products w e r e purified on a 1 2 % T B E urea p o l y a c r y l a m i d e gel a n d eluted into elution buffer (5:1, L o T E : 7.5 M A m m o n i u m Acetate) overnight at 4°C. T h e resulting gel slurry w a s p a s s e d through a S p i n - X filter (Corning) a n d precipitated by the addition of 1100 ^l of E t O H , 133 \i\ 7.5 M A m m o n i u m A c e t a t e a n d 3 |^l of M u s s e l G l y c o g e n (20 mg/ml; Invitrogen). After w a s h i n g with 7 5 % E t O H , the pellet w a s allowed to air dry at 25°C a n d d i s s o l v e d in E B buffer (Qiagen) by incubation at 4°C for 10 min. T h e purified P C R products w e r e quantified on the Agilent D N A 1000 chip (Agilent) a n d diluted to 10 n M for s e q u e n c i n g on the lllumina 1 G analyzer.  29  2.5.2 Small R N A G e n o m e Mapping and Quantification N o r e a d s aligned to the g e n o m e after position 2 8 , s o w e trimmed all reads at 30-nt to r e d u c e the n u m b e r of unique s e q u e n c e s while still retaining at least 2nt of linker s e q u e n c e for the longest alignments. W e counted the o c c u r r e n c e s of e a c h unique s e q u e n c e read a n d u s e d only the unique s e q u e n c e s for further a n a l y s i s . W e aligned e a c h s e q u e n c e to the h u m a n reference g e n o m e ( N C B I build 36.1) using m e g a b l a s t (version 2.2.11). W e filtered t h e s e alignments and retained only t h o s e that included the first nucleotide of the read a n d w e r e devoid of insertions, deletions a n d m i s m a t c h e s . For every read, the longest alignment w a s d e t e r m i n e d , a n d this s u b s e q u e n c e , a s well a s the positions for every alignment of this length w a s stored in a d a t a b a s e (to a m a x i m u m of 100 alignments). T h e s e q u e n c e following the aligned region w a s c h e c k e d for p r e s e n c e of the 3' linker s e q u e n c e . T h e counts for all r e a d s containing the s a m e aligned s u b - s e q u e n c e w e r e s u m m e d to provide a metric of the total frequency of that s m a l l R N A m o l e c u l e in the original R N A s a m p l e . T h e p r e s e n c e a n d length of intervening s e q u e n c e s between the alignment a n d the linker w a s recorded to e n a b l e identification a n d s e p a r a t e quantification of s m a l l R N A s with 3' additions. 2.5.3 Differential Expression Detection All unique s m a l l R N A s e q u e n c e s w e r e c o m p a r e d b e t w e e n the two libraries ( h E S C a n d E B ) for differential e x p r e s s i o n using the B a y e s i a n method d e v e l o p e d by A u d i c & C l a v e r i e (Audic a n d C l a v e r i e 1997). T h i s a p p r o a c h w a s d e v e l o p e d for a n a l y s i s of digital g e n e e x p r e s s i o n profiles a n d a c c o u n t s for the s a m p l i n g variability of s e q u e n c e s with low counts by modeling the P o i s s o n  30  distribution. S e q u e n c e s w e r e d e e m e d significantly differentially e x p r e s s e d if the P - v a l u e given by this method w a s < 0.001 and there w a s at least a 1.5-fold c h a n g e in s e q u e n c e counts b e t w e e n the two libraries. U n l e s s stated otherwise, the most frequently o b s e r v e d i s o m i R w a s u s e d a s the diagnostic s e q u e n c e for c o m p a r i s o n of m i R N A e x p r e s s i o n between libraries. Multiple testing corrections w e r e u n n e c e s s a r y a s c o m p a r i s o n b e t w e e n two libraries is c o n s i d e r e d a single test by this method (Stekel et a l . 2000). 2.5.4 Small RNA annotation E a c h s m a l l R N A s e q u e n c e w a s annotated if its full s e q u e n c e contained o n e or m o r e nucleotides of r e c o g n i z a b l e 3' linker s e q u e n c e , h a d at most 2 5 perfect full-length alignments to the h u m a n g e n o m e , a n d w a s detected at least twice. G e n o m i c positions of e a c h s e q u e n c e w e r e c o m p a r e d to g e n o m e annotations obtained from the U C S C G e n o m e B r o w s e r d o w n l o a d p a g e (http://hgdownload.cse.ucsc.edu/downloads.html).  T h e data u s e d for annotation  included the E n s E M B L g e n e s , R e p e a t M a s k e r a n d s n o / m i R N A t a b l e s . T h e positions of all m i R N A g e n e s w e r e a l s o d o w n l o a d e d from m i R B a s e a n d u s e d for this positional annotation  (http://microrna.sanger.ac.uk/sequences/ftp.shtml).  S e q u e n c e s overlapping annotations from these tables w e r e classified into o n e of the following c l a s s e s , listed in the priority u s e d : m i R N A , t R N A , r R N A , s c a R N A , C D - B o x , s c R N A , s n o R N A , s n R N A , s r p R N A , g e n o m i c repeat or k n o w n transcript (exonic or intronic). Transfer R N A s ( t R N A s ) a n d ribosomal R N A s ( r R N A s ) , w h i c h are involved in the cellular translation machinery, w e r e the most a b u n d a n t c l a s s e s of s e q u e n c e s after m i R N A s , yet t h e s e are generally c o n s i d e r e d relatively  31  uninteresting in the context of small R N A s . T h e other a f o r e m e n t i o n e d c l a s s e s of s m a l l R N A s are involved in either subcellular targeting ( s r p R N A ) , splicing or other c h e m i c a l modifications required for the maturation of n c R N A s a n d m R N A s ( C D - B o x / s n R N A s / s n o R N A s / s c a R N A ) . All m a p p e d s e q u e n c e s w e r e a l s o directly s e a r c h e d against the currently known h u m a n Piwi-interacting R N A s ( p i R N A s ) (Aravin et a l . 2007) found currently in p i R N A b a n k ( L a k s h m i a n d A g r a w a l 2007). P i R N A s w e r e previously thought to exist only in m a m m a l i a n germline cells a n d function in the silencing of g e n o m i c retrotransposons. A s p i R N A s are c o n s i d e r a b l y longer than m i R N A s , w e u s e d the first 18-nt of the p i R N A s to s e a r c h for perfect m a t c h e s a m o n g s t our s e q u e n c e s . S e q u e n c e s that did not overlap any of the aforementioned annotations (in a n y o n e of their g e n o m i c positions), or h a d more than 2 5 perfect alignments to the g e n o m e w e r e automatically classified a s 'unknown'. T h e u n k n o w n s e q u e n c e s with 2 5 or fewer perfect alignments to the g e n o m e w e r e u s e d , along with t h o s e c o r r e s p o n d i n g to introns a n d g e n o m i c repeats, for novel m i R N A prediction. S e q u e n c e s identified a s m i R N A s w e r e n a m e d b a s e d on the specific m i R N A g e n e they o v e r l a p p e d a s well a s the arm (either 5' or 3') with respect to the p r e - m i R N A . W h e r e m i R N A naming w a s a m b i g u o u s d u e to identical s e q u e n c e s s h a r e d by related m i R N A g e n e s , s e q u e n c e s w e r e n a m e d arbitrarily by o n e of their p o s s i b l e parent m i R N A g e n e s . F o r t h o s e m i R N A g e n e s producing identical mature s e q u e n c e s (e.g., h s a - m i R - 9 - 1 ) , the trailing n u m b e r w a s d r o p p e d from the n a m e . M i R B a s e (release 10.0) reference s e q u e n c e s w e r e  32  used for the comparison of the common isomiRs to the reference miRNA sequences. 2.5.5 Novel microRNA detection Candidate miRNA gene loci were identified by finding distinct small RNA sequences lacking annotations that shared partially overlapping genomic positions on the same strand (termed 'hotspots'). These hotspots are suggestive of the presence of two or more isomiRs and were observed for nearly all the known miRNAs in our libraries. 300-nt of genomic sequence flanking each seed was extracted, reverse-complemented where appropriate, and folded using RNALfold (Hofacker 2003), which identifies locally stable sub-structures within a query RNA sequence. Following the known structural constraints of miRNAs (Ambros et al. 2003), the largest un-branched fold back sub-structure was identified from each structure and redundant structures were then removed. Structures were also removed if the seed region (where the original small RNA sequences derived) spanned the loop. The remaining structures and their sequences comprised a set of candidate miRNA genes with expressed sequence and more than one candidate isomiR. The putative pre-miRNA sequences were then enriched for likely real pre-miRNA hairpins using two machine-learning approaches. The previously published method, termed MiPred, relies on a Random Forest (RF) algorithm (Jiang et al. 2007) and uses a combination of structural and thermodynamic parameters. The second approach was devised specifically for its application in this study and employs a support vector machine (SVM) classifier and mostly uses parameters not used by the RF method.  33  T h e S V M functionality w a s provided in the e1071 p a c k a g e for R. T h e p a r a m e t e r s u s e d by this classifier include the relative proportion of e a c h nucleotide and the n u m b e r of e a c h type of b a s e pair in the optimal p r e - m i R N A folded structure. A s well, s o m e parameters describing the folded structure are c o m m o n to other S V M a p p r o a c h e s including M F E I , A M F E ( Z h a n g et a l . 2006), N o r m a l i z e d Pairing Propensity (Ng K w a n g L o o n g et a l . 2007) a n d loop length. T h e p a r a m e t e r s that are unique to this type of s e q u e n c i n g d a t a are descriptors of the s e e d itself a n d its positioning within the p r e - m i R N A structure. Specifically, t h e s e include s e e d length, positioning with respect to the loop, a s well a s the ratio of paired to unpaired nucleotides within the s e e d . T h i s classifier w a s trained using the folded flanking s e q u e n c e of the m i R N A s present in the two libraries for positive e x a m p l e s a n d the folded flanking s e q u e n c e s of s m a l l R N A s e q u e n c e s classified a s either t R N A , r R N A s n R N A or s n o R N A for negative e x a m p l e s . After training, the classifier s h o w e d a sensitivity of 0.973 a n d specificity of 0.988, w h i c h r e p r e s e n t s the upper level of discrimination of p r e - m i R N A s reported for a n y m a c h i n e learning method to date. T h e intersection of the positive predictions of the S V M a n d R a n d o m F o r e s t m e t h o d s (98 total) w a s u s e d a s a n initial set of reliable novel m i R N A predictions. T h i s w a s s u p p l e m e n t e d with m i R N A s s h o w i n g either significant differential e x p r e s s i o n between the two libraries, significant s e q u e n c e similarity to k n o w n m i R N A s , or overlap with a n E v o F o l d prediction for a n evolutionarily c o n s e r v e d hairpin ( P e d e r s e n et al. 2006).  34  2.5.6 Cooperative miRNA target prediction T a r g e t S c a n (release 4.0) predicted targets for k n o w n m i R N A s w e r e d o w n l o a d e d from the d o w n l o a d p a g e (http://www.targetscan.org/).  Only m i R N A s  with counts of at least 100 in either h E S C or E B w e r e included in target a n a l y s e s . G e n e s with target sites for at least two c o - e x p r e s s e d m i R N A s (either h E S C e n r i c h e d or E B - e n r i c h e d ) w e r e identified a s potential c o o p e r a t i v e targets. T o c o m p e n s a t e for potentially biasing s o m e g e n e s with m a n y predicted target sites (due to various i s s u e s ) , t h e s e g e n e s w e r e given a lower rank than t h o s e with few predicted target sites. T h e rank s c o r e of a g e n e w a s calculated by dividing the n u m b e r of predicted target sites for c o - e x p r e s s e d m i R N A s by the total n u m b e r of target sites for that g e n e . In this context, the c o - e x p r e s s e d m i R N A s w e r e t h o s e that w e r e over-represented in the particular library a n d h a d a count of at least 100.  W e u s e d a cutoff of 0.15 (rank) to produce the two s e t s of high-ranked  c a n d i d a t e cooperative targets of h E S C - e n r i c h e d a n d E B - e n r i c h e d m i R N A s (Figure 2.3).  2.5.7 Gene Ontology analysis G o S t a t (Beissbarth et a l . 2004) w a s e m p l o y e d for identification of significantly e n r i c h e d 'biological p r o c e s s ' G e n e O n t o l o g y ( G O ) ( A s h b u r n e r et a l . 2000) terms in the two lists of likely targets of h E S C - e n r i c h e d a n d E B - e n r i c h e d m i R N A s (P<0.01). C u s t o m software w a s used to extract G e n e s a n d G O terms from the G o S t a t output. G e n e s a n d G O terms w e r e then clustered using hierarchical clustering in R. Heat m a p s w e r e created in R using the h e a t m a p function (default parameters).  35  Table 2.1 T o p 20 m i R N A s differentially e x p r e s s e d between the h E S C a n d E B libraries  miRNA gene (hsa) mir-199a mir-372 mir-122a mir-152 mir-10a let-7a mir-302a mir-222 mir-340 mir-363 mir-21 mir-26a mir-26b mir-130a mir-744 mir-302b mir-30d mir-146b mir-25 mir-373 mir-423 mir-320 mir-1 mir-371 mir-302d  Hairpin arm 3-P 3-p 5-p 5-p 5-p 5-p 5-p 3-p 5-p 3-p 5-p 5-p 5-p 3-P 5-p 3-P 5-p 5-p 3-P 3-P 5-p 3-P 3-p 5-p 3-P  Most abundant sequence ACAGTAGTCTGCACATTGGTTA AAAGTGCTGCGACATTTGAGCGT TGGAGTGTGACAATGGTGTTTG TCAGTGCATGACAGAACTTGG TACCCTGTAGATCCGAATTTGT TGAGGTAGTAGGTTGTATAGTT TAAACGTGGATGTACTTGCTTT AGCTACATCTGGCTACTGGGTCTC TTATAAAGCAATGAGACTGATT AATTGCACGGTATCCATCTGTA TAGCTTATCAGACTGATGTTGAC TTCAAGTAATCCAGGATAGGCT TTCAAGTAATTCAGGATAGGTT CAGTGCAATGTTAAAAGGGCAT TGCGGGGCTAGGGCTAACAGCA TAAGTGCTTCCATGTTTTAGTAG TGTAAACATCCCCGACTGGAAGCT TGAGAACTGAATTCCATAGGCTGT CATTGCACTTGTCTCGGTCTGA G AAGTGCTTC G ATTTTG G G GTGT TGAGGGGCAGAGAGCGAGACTTT AAAAGCTGGGTTGAGAGGGCGA TGGAATGTAAAGAAGTATGTAT ACTCAAACTGTGGGGGCACTT TAAGTGCTTCCATGTTTGAGTGT  hESC count 1110 1388 436 622 948 11902 36800 4719 2247 5775 39818 4892 1003 2334 4166 15169 2798 703 24268 60 9844 5967 7421 1649 8599  EB count 13163 13653 2565 3028 3887 2951 9917 1331 7198 17912 21003 8530 2957 4798 1516 8855 4988 2075 15875 788 5538 2978 4051 3020 5047  P-value 0 0 0 0 0 0 0 0 0 0 0 0 1.39E-278 2.20E-265 9.13E-259 1.39E-213 3.29E-205 2.27E-196 1.52E-189 5.75E-184 4.50E-162 1.33E-148 2.39E-137 6.89E-133 3.95E-119  fold change 11.86 9.84 5.88 4.87 4.10 4.03 3.71 3.55 3.20 3.10 1.90 1.74 2.95 2.06 2.75 1.71 1.78 2.95 1.53 13.13 1.78 2.00 1.83 1.83 1.70  36  Table 2.2 Putative novel m i R N A s exhibiting significant differential e x p r e s s i o n  hESC EB Chromo- Length Fold some (nt) most common sequence count count Change chr2 19 AAIGGAI I I I IGGAGCAGG 374 245 1.53 17 84 131 1.56 chr11 TAAGTGCTTCCATGCTT chrX 22 TTCATTCGGCTGTCCAGATGTA 1269 774 1.64 194 114 1.70 chr16 19 GCCTGTCTGAGCGTCGCTT chrl 21 TATTCATTTATCCCCAGCCTACA 126 67 1.88 17 TCCCTGTTCGGGCGCCA 1302 1.94 chr19 670 chrl 19 40 85 2.13 IGGAI I I I IGGAICAGGGA 72 chrl 17 AACAG CTAAGGACTG CA 31 2.32 273 2.34 chr6 19 GIG IAAGCAGGGICGI I I I 639 54 2.70 chrX 22 20 IACGIAGAIAIAIAIG IAI I I I chr5 18 GTCCCTGTTCAGGCGCCA 24 66 2.75 chr15 26 GATGATGATGGCAGCAAATTCTGAAA 81 27 3.00 chr22 19 ACGTTGGCTCTGGTGGTG 35 11 3.18 chr8 18 ATCCCCAGCACCTCCACC 59 18 3.28 22 TGCTGGATCAGTGGTTCGAGTC chr10 16 57 3.56 23 CCTCAGGGCTGTAGAACAGGGCT chr15 163 45 3.62 chr4 22 CTGGACTGAGCCGTGCTACTGG 44 161 3.66 chr10 22 ACTCGGCGTGGCGTCGGTCGTG 228 50 4.56 chr2 24 TTGCAGCTGCCTGGGAGTGACTTC 63 13 4.85 chr4 17 TCCGAGTCACGGCACCA 309 55 5.62 chrl 17 GTGGGGGAGAGGCTGTC 17 6.06 103 chrl 22 ATGGGTGAATTTGTAGAAGGAT 7 45 6.43 chr7 19 AAACIGIAAI I AC I I I IGG 21 3 7.00 chrX 18 GCATGGGTGGTTCAGTGG 407 52 7.83 24 AGCCTGGAAGCTGGAGCCTGCAGT chr10 25 3 8.33 chrX 22 AGGAGGAATTGGTGCTGGTCTT 2 25 12.50 chr22 17 AAAAAGACACCCCCCAC 21 9.00 189 chr14 22 ACCCGTCCCGTTCGTCCCCGGA 4 53 13.25 2 15.50 19 ATGGATAAGGCTTTGGCTT 31 chr11 chr15 18 CGGGCGTGGTGGTGGGGG 2 17.50 35 chr11 24 GTGGGAAGGAATTACAAGACAGTT 72 2 36.00  Putative miRNA Seed family sequence AUGGAUU mir-503 AAGUGCU mir-664 mir-761 mir-565 mir-620 mir-376 mir-650 mir-650 mir-502 mir-661 mir-345  UCAUUCG CCUGUCU AUUCAUU CCCUGUU GGAUUUU ACAGCUA UGUAAGC ACGUAGA UCCCUGU AUGAUGA CGUUGGC UCCCCAG GCUGGAU CUCAGGG UGGACUG CUCGGCG UGCAGCU CCGAGTC UGGGGGG  mir1072 UGGGUGA mir-548a AACUGUA mir-146a CAUGGGU mir-566 GCCUGGA mir-766* GGAGGAA mir396b AAAAGAC mir-638 CCCGUCC UGGAUAA mir-566 GGGCGUG UGGGAAG  37  Table 2.3  Top 25 statistically over-represented seeds of miRNA sequences found in h E S C s or E B s (Fisher's Exact Test with Bonferroni correction) Seed  hESC count  EB count  Fold Change Corrected P-value  m i R N A s with seed  AGUAGUC  86  1478  mir-199a  1801  22288  17x 12.4x  0  CAGUAGU  0  mir-199a  ACAGUAG  241  2749  11.4x  UUAAACG  3452  398  8.7x  0 0  mir-302a-5p  mir-199a  CUUAAAC  16762  2332  7.2x  0  mir-302a-5p  GGAGUGU  846  5216  6.2x  0  AGUGCUG  2484  12641  5.1x  0  mir-122 mir-372,mir-512  ACCCUGU  1503  5689  3.8x  0  mir-10  AAACGUG  55712  0  mir-302a-5p,mir-424  3002  16461 9702  3.4x  UAUAAAG  3.2x  0  mir-340  GAGAACU  1547  4976  3.2x  0  mir-146  UUGCACG  1845  5561  3.0x  0  mir-363  GAGGUAG  29280  10258  2.9x  0  let-7  CAGUGCA  3596  8559  2.4x  0  mir-148,mir-152,mir-130  AAAGCUG  16448  7613  2.2x  0  mir-320  AGUGCAA  4463  9594  2.1x  0  mir-454,mir-301 ,mir-130  UCAAGUA  6898  14089  2.0x  0  mir-26  GCUACAU GUAAACA  47591  2.0x  0  mir-221,mir-222  8789  24280 15479  1.8x  0  mir-30  GAGGGGC  17586  10518  1.7x  0  mir-423/mir-885  AGCUUAU  65376  39424  1.7x  0  mir-21,mir-590  CUGGACU  17819  25378  1.4x  0  mir-378  GCAGCAU  153193 187218  1.2x  0  mir-103,mir-107,mir-885  GCGGGGC  5220  1956  2.7x  8.76E-305  mir-744  CUCAAAC  4306  8173  1.9x  1.62E-296  mir-92a-5p,mir-371  miRNAs indicated in bold are cases in which their non-canonical this seed  isomiR contains  38  Figure 2.1 a Distribution of s e q u e n c e counts for the different c l a s s e s of small R N A s .  Figure 2.1 b  microRNA 52%^  unknown 8 %  mRNA 9 9%  a) The box plot (left) shows the relative expression levels of sequences in each of the 8 major classes of small RNAs (log™ transformation) in the hESC small RNA library. MiRNAs were the most highly expressed class (mean sequence count=253, median sequence count=9). The most abundant miRNA in this library was mir-103, which had 91,398 instances of the most common isomiR in this library and nearly 120,000 in the matched library (EB). The highest logtransformed count between the two libraries (right) for all miRNAs identified as differentially expressed (dark grey) is roughly normal (mean=2.32, median=2.17), representing a count of 1743 and 151 respectively). The miRNAs detected in at least one of the libraries but were not significantly differentially expressed are shown in light grey for comparison. There is a slight enrichment of miRNAs with  40  lower absolute e x p r e s s i o n in this group (mean=1.35,median=1.24), s u g g e s t i n g m i R N A s with higher absolute e x p r e s s i o n levels are more likely to be identified a s differentially e x p r e s s e d , b) Total counts for the 8 c l a s s e s in the box plot are s u m m a r i z e d . T h e y are represented a s a fraction of the total s e q u e n c e s that had at least o n e perfect alignment to the h u m a n reference g e n o m e (1,631,559 total).  41  Figure  2.2  T h e r e p e r t o i r e o f i s o m i R s a n d 3' m o d i f i c a t i o n s o f h s a - m i r - 1 9 1 .  hsa-mi r-191 o r e - m i r n a : c h r 3 49033055 - 49033146 (-) GGCAGGAGAGCAGGGGACGAAATCCAA6CGCAGCTGGAATGCTCTGGAGACAACAQCTQCTTT 2  GGGGA CGA A A T CCA A GCGCA GC  2 AGACAACAGCTGCTTTTGGGATTCCGTTG ACAACAGCTGCTTTTGGGATTCCGTTG CAACAGCTGCTTTTGGGATTCCGTTG AACAGCTGCTTTTGGGATTCCGTT AACAGCTGCTTTTGGGATTCCGTTG 6ACAGCTGCTTTTGGGATTCCGTTG TACAGCTGCTTTTGGGATTCCGTTG CACAGCTGCTTTTGGGATTCCGTTG ACAGCTGCTTTTGGGATTCCGTTG TCAGCTGCTTTTGGGATTCCG TCAGCTGCTTTTGGGATTCCGTT CAGCTGCTTTTGGGATTCCGTTG AGCTGCTTTTGGGATTCCGTT AQCTGCTTTTGQSATTCCGTTG TGCTGCTTTTGGGATTCCGTT TGCTGCTTTTGGGATTCCGTTG CTGCTTTTGGGATTCCGTT CTGCTTTTGGGATTCCGTTG TGCTTTTGGGATTCCGTT TGCTTTTGGGATTCCGTTG GCTTTTGGGATTCCGTT GCTTTTGGGATTCCGTTG CTTTTGGGATTCCGTTG  (((.(((.((( ( ( . « « . « ( « « ( . « < « U  •  3 12 4 15 243 44 339 36 1760 2 258 3294 44 883 10 96 21 21 3 36 8 39 500  ((....))....))))))).))))).))).)))).)).))).))).))).  A d i v e r s e variety of i s o m i R s w e r e o b s e r v e d for m a n y of the k n o w n a n d novel miRNAs.  S e q u e n c e s representing the miR* were also c o m m o n l y observed  (highlighted in pink). T h e r e f e r e n c e m i R N A s e q u e n c e f r o m m i R B a s e (highlighted in b o l d ) w a s g e n e r a l l y n o t t h e m o s t f r e q u e n t l y o b s e r v e d i s o m i R ( s h o w n in b l u e ) . T h o u g h v a r i a t i o n a t t h e 3 ' e n d i s g e n e r a l l y m u c h m o r e c o m m o n t h a n at t h e 5' e n d , target p r e f e r e n c e of t h e s e i s o m i R s m a y differ. S e q u e n c e s with e v i d e n c e of 3' a d d i t i o n s o f n u c l e o t i d e s ( R e d ) w e r e c o m m o n , w i t h c e r t a i n m i R N A s m o r e  42  h e a v i l y m o d i f i e d t h a n o t h e r s . T h e p r e d i c t e d structure of t h e p r e - m i R N A is r e p r e s e n t e d at t h e b o t t o m .  43  Figure 2.3 Clustering of over-represented G e n e O n t o l o g y c l a s s e s in predicted targets of differential m i c r o R N A s . a)  GO.0012502 induction.of programmed cell death G0.0006917 induction of apoptosis GO 0043065 positive regulation of apoptosis GO 0043068 positive regulation of progr arnmed cell death GO.0O4233O.taxis GO 0006335.chenriotaxis GO 0006839 rratochondrtal transpor t GO 0009611 response to.wounding GO 0009605 response to external stimulus GO .0006950 response to stress GO 0006954 inflammatory response GO 0006952 defense response GO 0043522 positive regulalion.of cellular process GO 0048518 positive regulation.of biologicalprocess GO 0048869 cetlular developmental process GO 0030154,cei.differentialion GO CJ032502 developmental process GO 0048856 anatomical structure development GO .0032774 RNA.biosynthetic process G 0.0006351 transcription D NA dependent G0 0006355.regulationof transcription DrM dependent G0.0O31323.regUation of ceWar metaboSc process GO 0019222 metabolic process regulation of metabolic process GO 0050794 cellular process.regutation.of cellular .process GO 0050789 regulation of biological process GO 0065007 biological regulation GO 0044238 primary metabolic process GO 0044237 cellular .metabolic process GO .0043170 macromotecule metabolic process  44  b)  CO 0042127 regulation of cell proliferation GO 0008283 cell proliferation CO 0008284 positive regulation of cell proliferation CO 0048518 positive regulation of biological process GO 0009653 anatorwcal structure morphogenesis GO rjO07S86 digestion GO 0044255 cellular lipid metabolic process GO 0016043 cellular component organization and biogenesis GO 0048731 multicellular organismal development system development GO 0048356 anatomical structure development GO.0Q07275 multicellular organisn^.development GQ.0048513 multicellular organismal development system develcixrient wgan oWetopmerrt GO 0032501 multiceWaf organismal process GO 0048869 cellular developmental process GO 0030154 ceH.differentiation GO 0032502 developmental process GO 0032774 RNAbtosynthetic process GO 0006351 transcription DNAdependent GO 0006355 regulation of transcription DNAdependent GO 0016070 RNA metabolic process GO 0006139 nucleobase nucleoside .nucleotide and nucleic acid metabolic process G0.005Q789 regulation of biological process GO 0065007 bictogica) regulation GO 0043283 bropolvmer metabolic process GO 0043170 macromoleciie- metabolic process GO.Q044238 primary metabolic process GO 0044237 cellular metabolic process  S h o w n a r e heat m a p representations of G e n e O n t o l o g y ( G O ) (Ashburner et a l . 2 0 0 0 ) terms over-represented a m o n g s t predicted cooperative targets (y-axis) of (a) h E S C - e n r i c h e d m i R N A s a n d (b) E B - e n r i c h e d m i R N A s . All g e n e s with statistically over-represented G O annotations w e r e included (p<0.01, x-axis) a s identified by G o S t a t (Beissbarth et a l . 2004). G O terms c o m m o n to both s e t s of g e n e s w e r e t h o s e involved in transcriptional regulation, differentiation a n d  45  d e v e l o p m e n t . T h o s e G O terms unique to h E S C - e n r i c h e d m i R N A targets w e r e a s s o c i a t e d with p r o g r a m m e d cell death, r e s p o n s e to s t r e s s , a n d cell motility while t h o s e unique to E B - e n r i c h e d m i R N A targets d e s c r i b e various a s p e c t s of cell proliferation regulation a s well a s nucleotide a n d nucleic acid m e t a b o l i s m  46  2.6  References  A b e y t a , M . J . , Clark, A . T., R o d r i g u e z , R. T., Bodnar, M . S . , P e r a , R. A . , a n d Firpo, M . T. 2 0 0 4 . U n i q u e g e n e e x p r e s s i o n signatures of independently-derived h u m a n e m b r y o n i c s t e m cell lines. Hum Mol Genet 13: 6 0 1 - 8 . A m b r o s , V . , Bartel, B., Bartel, D. P., Burge, C . B., Carrington, J . C , C h e n , X . , Drey, S . R., Griffiths-Jones, S . , M a r s h a l l , M., M a t z k e , M. et al. A uniform s y s t e m for m i c r o R N A annotation. 2 0 0 3 . RNA 9:277-9. A r a v i n , A . A . , S a c h i d a n a n d a m , R., Girard, A . , F e j e s - T o t h , K., a n d H a n n o n , G . J . 2 0 0 7 . D e v e l o p m e n t a l ^ regulated p i R N A clusters implicate MILI in t r a n s p o s o n control. Science 316: 7 4 4 - 7 4 7 . A r a v i n , A . a n d T u s c h l , T. 2 0 0 5 . Identification and characterization of small R N A s involved in R N A silencing. FEBS Lett 579: 5 8 3 0 - 4 0 . A s h b u r n e r , M . , Ball, C . A . , B l a k e , J . A . , Botstein, D., Butler, H., C h e r r y , J . M., D a v i s , A . P., Dolinski, K., Dwight, S . S . , E p p i g , J . T., et al. 2 0 0 0 . G e n e ontology: tool for the unification of biology. T h e G e n e Ontology C o n s o r t i u m . Nat Genet 25: 25-9. A u d i c , S . a n d C l a v e r i e , J . M . 1997. T h e significance of digital g e n e e x p r e s s i o n profiles. Genome Res 7: 986-95. B e i s s b a r t h , T. a n d S p e e d , T. P. 2004. G O s t a t : find statistically overrepresented G e n e Ontologies within a group of g e n e s . Bioinformatics 20: 1 4 6 4 - 1 4 6 5 . B e r e z i k o v , E., C u p p e n , E., and Plasterk, R. H. 2 0 0 6 a . A p p r o a c h e s to m i c r o R N A discovery. Nat Genet 38 Suppl: S 2 - 7 . B e r e z i k o v , E., T h u e m m l e r , F., v a n L a a k e , L. W . , K o n d o v a , I., Bontrop, R., C u p p e n , E., and Plasterk, R. H. 2 0 0 6 b . Diversity of m i c r o R N A s in h u m a n and c h i m p a n z e e brain. Nat Genet 38: 1375-7. Bernstein E., Kim S . Y . , C a r m e l l M . A . , Murchison E. P., A l c o r n H., Li M . Z . , Mills A . A . , E l l e d g e S . J . , A n d e r s o n K. V . and H a n n o n G . J . 2 0 0 3 . Dicer is e s s e n t i a l for m o u s e development. Nat G e n e t . 35(3): 215-7. B h a t t a c h a r y a B, C a i J , Luo Y , M i u r a T, Mejido J , Brimble S N , Z e n g X , S c h u l z T C , R a o M S . P u r i R K . 2 0 0 5 . C o m p a r i s o n of the g e n e e x p r e s s i o n profile of undifferentiated h u m a n e m b r y o n i c stem cell lines a n d differentiating embryoid b o d i e s . BMC Dev Biol 5: pre-print. B h a t t a c h a r y a , B., M i u r a , T., Brandenberger, R., Mejido, J . , L u o , Y . , Y a n g , A . X . , J o s h i , B. H., G i n i s , I., T h i e s , R. S . , Amit, M., et al. 2 0 0 4 . G e n e e x p r e s s i o n in  47  h u m a n e m b r y o n i c s t e m cell lines: unique molecular signature. Blood 103: 2 9 5 6 64. Boyer, L. A . , L e e , T. I., C o l e , M . F., J o h n s t o n e , S . E., L e v i n e , S . S . , Z u c k e r , J . P., G u e n t h e r , M . G . , K u m a r , R. M . , Murray, H. L , J e n n e r , R. G . , e t a l . 2 0 0 5 . C o r e transcriptional regulatory circuitry in h u m a n e m b r y o n i c stem cells. Cell 122: 94756. C a i , X . , H a g e d o r n , C . H., and C u l l e n , B. R. 2 0 0 4 . H u m a n m i c r o R N A s are p r o c e s s e d from c a p p e d , polyadenylated transcripts that c a n a l s o function a s m R N A s . Rna 10: 1957-66. C h e n , C , R i d z o n , D., L e e , C . T., B l a k e , J . , S u n , Y . , a n d S t r a u s s , W . M . 2 0 0 7 . Defining e m b r y o n i c stem cell identity using differentiation-related m i c r o R N A s and their potential targets. Mamm Genome . C u m m i n s , J . M . , H e , Y . , Leary, R. J . , Pagliarini, R., A . , D. L.,Jr, S j o b l o m , T., B a r a d , O . , Bentwich, Z . , S z a f r a n s k a , A . E., Labourier, E., et al. 2 0 0 6 . T h e colorectal m i c r o R N A o m e . Proc Natl Acad Sci USA 103: 3 6 8 7 - 9 2 . D a v i s o n , T. S . , J o h n s o n , C . D., a n d A n d r u s s , B. F. 2 0 0 6 . A n a l y z i n g m i c r o - R N A e x p r e s s i o n using microarrays. Methods Enzymol 411: 14-34. D v a s h , T., M a y s h a r , Y . , Darr, H., M c E l h a n e y , M . , Barker, D., Y a n u k a , O . , Kotkow, K. J . , R u b i n , L. L., Benvenisty, N., and E i g e s , R. 2 0 0 4 . T e m p o r a l g e n e e x p r e s s i o n during differentiation of h u m a n embryonic s t e m cells a n d embryoid b o d i e s . Hum Reprod 19: 2 8 7 5 - 8 3 . F a h l g r e n , N., Howell, M . D., K a s s c h a u , K. D., C h a p m a n , E. J . , S u l l i v a n , C . M . , C u m b i e , J . S . , G i v a n , S . A . , L a w , T. F., Grant, S . R., D a n g l , J . L , et al. 2 0 0 7 . High-Throughput S e q u e n c i n g of A r a b i d o p s i s m i c r o R N A s : E v i d e n c e for Frequent Birth a n d Death of M I R N A G e n e s . PLoS ONE 2: e 2 1 9 . G i r a l d e z , A . J . , M i s h i m a , Y . , R i h e l , J . , G r o c o c k , R. J . , V a n D o n g e n , S . , Inoue, K., Enright, A . J . , and S c h i e r , A . F. 2 0 0 6 . Zebrafish M i R - 4 3 0 p r o m o t e s deadenylation a n d c l e a r a n c e of maternal m R N A s . Science 312: 75-9. Griffiths-Jones, S . 2 0 0 6 . m i R B a s e : the m i c r o R N A s e q u e n c e d a t a b a s e . Mol Biol 342: 129-38.  Methods  G r i m s o n , A . , F a r h , K. K., J o h n s t o n , W . K., Garrett-Engele, P., L i m , L. P., and Bartel, D. P. 2 0 0 7 . M i c r o R N A targeting specificity in m a m m a l s : determinants b e y o n d s e e d pairing. Mol Cell 27: 9 1 - 1 0 5 . Hirst, M . , D e l a n e y , A . , R o g e r s , S . A . , S c h n e r c h , A . , P e r s a u d , D. R., O ' C o n n o r M , D., Z e n g , T., M o k s a , M . , Fichter, K., M a h , D., et al. 2 0 0 7 . L o n g S A G E profiling of nine h u m a n e m b r y o n i c stem cell lines. Genome Biol 8: R 1 1 3 .  48  H o f a c k e r , I. L. 2 0 0 3 . V i e n n a R N A s e c o n d a r y structure server. Nucleic 31: 3 4 2 9 - 3 1 .  Acids  Res  Houbaviy, H. B., Murray, M . F., and S h a r p , P. A . 2 0 0 3 . E m b r y o n i c s t e m c e l l specific M i c r o R N A s . Dev Cell 5: 351-8. Itskovitz-Eldor, J . , S c h u l d i n e r M . , Karsenti, D., E d e n , A . , Y a n u k a , O . , Amit, M . , S o r e q , H. and Benvenisty, N. 2 0 0 0 . Differentiation of h u m a n e m b r y o n i c s t e m cells into embryoid bodies c o m p r o m i s i n g the three e m b r y o n i c g e r m layers. M o l M e d . 6(2): 8 8 - 9 5 . J i a n g , P., W u , H., W a n g , W . , M a , W . , S u n , X . , a n d L u , Z . 2 0 0 7 . M i P r e d : classification of real and p s e u d o m i c r o R N A precursors using r a n d o m forest prediction m o d e l with c o m b i n e d features. Nucleic Acids Res 35: W 3 3 9 - 4 4 . J o h n , B., Enright, A . J . , A r a v i n , A . , T u s c h l , T., S a n d e r , C , and M a r k s , D. S . 2 0 0 4 . H u m a n M i c r o R N A targets. PLoS Biol 2: e 3 6 3 . K a s s c h a u , K. D., F a h l g r e n , N., C h a p m a n , E. J . , S u l l i v a n , C . M . , C u m b i e , J . S . , G i v a n , S . A . , a n d Carrington, J . C . 2 0 0 7 . G e n o m e - W i d e Profiling a n d A n a l y s i s of A r a b i d o p s i s s i R N A s . PLoS Biol 5: e 5 7 . K a w a h a r a , Y . , Z i n s h t e y n , B., Sethupathy, P., l i z a s a , H., H a t z i g e o r g i o u , A . G . , and N i s h i k u r a , K. 2 0 0 7 . Redirection of silencing targets by a d e n o s i n e - t o - i n o s i n e editing of m i R N A s . Science 315: 1137-40. L a k s h m i , S . and A g r a w a l S . 2 0 0 7 . p i R N A B a n k : a w e b r e s o u r c e o n classified and clustered Piwi-interacting R N A s . Nucleic Acids Res. In p r e s s . Landgraf, P., R u s u , M . , S h e r i d a n , R., S e w e r , A . , lovino, N., A r a v i n , A . , Pfeffer, S . , R i c e , A . , K a m p h o r s t , A . O . , Landthaler, M . , et a l . 2 0 0 7 . A M a m m a l i a n m i c r o R N A E x p r e s s i o n A t l a s B a s e d on S m a l l R N A Library S e q u e n c i n g . Cell 129: 1401-14. L e w i s , B. P., B u r g e , C . B., and Bartel, D. P. 2 0 0 5 . C o n s e r v e d s e e d pairing, often flanked by a d e n o s i n e s , indicates that t h o u s a n d s of h u m a n g e n e s are m i c r o R N A targets. C e / / 1 2 0 : 15-20. L u n d , E., Guttinger, S . , C a l a d o , A . , Dahlberg, J . E., a n d Kutay, U . 2 0 0 4 . N u c l e a r export of m i c r o R N A precursors. Science 303: 95-8. M a r g u l i e s , M . , E g h o l m , M . , A l t m a n , W . E., Attiya, S . , B a d e r , J . S . , B e m b e n , L. A . , B e r k a , J . , B r a v e r m a n , M . S . , C h e n , Y . J . , C h e n , Z . , et a l . 2 0 0 5 . G e n o m e s e q u e n c i n g in microfabricated high-density picolitre reactors. Nature 437: 3 7 6 80. N a k a n o , M . , N o b u t a , K., V e m a r a j u , K., Tej, S . S . , S k o g e n , J . W . , and M e y e r s , B. C . 2 0 0 6 . Plant M P S S d a t a b a s e s : s i g n a t u r e - b a s e d transcriptional r e s o u r c e s for a n a l y s e s of m R N A a n d s m a l l R N A . Nucleic Acids Res 34: D 7 3 1 - 5 .  49  N g K w a n g L o o n g , S . a n d M i s h r a , S . K. 2 0 0 7 . U n i q u e folding of precursor m i c r o R N A s : quantitative e v i d e n c e and implications for de novo identification. 13: 170-87.  Rna  O T o o l e , A . S . , Miller, S . , H a i n e s , N., Zink, M . C , a n d S e r r a , M . J . 2 0 0 6 . C o m p r e h e n s i v e t h e r m o d y n a m i c analysis of 3' double-nucleotide o v e r h a n g s neighboring W a t s o n - C r i c k terminal b a s e pairs. Nucleic Acids Res 34: 3 3 3 8 - 4 4 . P e d e r s e n , J . S . , Bejerano, G . , S i e p e l , A . , R o s e n b l o o m , K., L i n d b l a d - T o h , K., L a n d e r , E. S . , Kent, J . , Miller, W . , and H a u s s l e r , D. 2 0 0 6 . Identification a n d classification of c o n s e r v e d R N A s e c o n d a r y structures in the h u m a n g e n o m e . PLoS Comput Biol 2: e 3 3 . Pfeffer, S . , S e w e r , A . , L a g o s - Q u i n t a n a , M . , S h e r i d a n , R., S a n d e r , C , G r a s s e r , F. A . , v a n Dyk, L. F., H o , C . K., S h u m a n , S . , C h i e n , M . , et al. 2 0 0 5 . Identification of m i c r o R N A s of the herpesvirus family. Nat Methods 2: 2 6 9 - 7 6 . P o r k k a , K. P., Pfeiffer, M . J . , Waltering, K. K., V e s s e l l a , R. L , T a m m e l a , T. L , a n d V i s a k o r p i , T. 2 0 0 7 . M i c r o R N A e x p r e s s i o n profiling in prostate c a n c e r . Cancer Res 67: 6 1 3 0 - 5 . R a j a g o p a l a n , R., V a u c h e r e t , H., Trejo, J . , and Bartel, D. P. 2 0 0 6 . A d i v e r s e a n d evolutionarily fluid set of m i c r o R N A s in A r a b i d o p s i s thaliana. Genes Dev20: 3407-25. R u b y , J . G . , J a n , C , P l a y e r , C , Axtell, M . J . , L e e , W . , N u s b a u m , C , G e , H., and Bartel, D. P. 2 0 0 6 . L a r g e - s c a l e s e q u e n c i n g reveals 2 1 U - R N A s a n d additional m i c r o R N A s a n d e n d o g e n o u s s i R N A s in C . e l e g a n s . C e / / 1 2 7 : 1 1 9 3 - 2 0 7 . S a t o , N., S a n j u a n , I. M . , H e k e , M . , U c h i d a , M., Naef, F., a n d Brivanlou, A . H. 2 0 0 3 . M o l e c u l a r signature of h u m a n embryonic s t e m cells and its c o m p a r i s o n with the m o u s e . Dev Biol 260: 4 0 4 - 1 3 . S h i v d a s a n i , R. A . 2 0 0 6 . M i c r o R N A s : regulators of g e n e e x p r e s s i o n a n d cell differentiation. Blood 108: 3 6 4 6 - 5 3 . S k o t t m a n , H., M i k k o l a , M., Lundin, K., O l s s o n , C , S t r o m b e r g , A . M . , Tuuri, T., O t o n k o s k i , T., Hovatta, O., and L a h e s m a a , R. 2 0 0 5 . G e n e e x p r e s s i o n signatures of s e v e n individual h u m a n e m b r y o n i c stem cell lines. Stem Cells 23: 1343-56. S o n g , L. a n d T u a n , R. S . 2 0 0 6 . M i c r o R N A s and cell differentiation in m a m m a l i a n d e v e l o p m e n t . Birth Defects Res C Embryo Today 78: 140-9. S t e k e l , D. J . , Git, Y . , and F a l c i a n i , F. T h e c o m p a r i s o n of g e n e e x p r e s s i o n from multiple c D N A libraries. 2 0 0 0 . Genome Res 10: 2 0 5 5 - 6 1 .  50  S t r a u s s , W . M . , C h e n , C , L e e , C . T., and R i d z o n , D. 2 0 0 6 . Nonrestrictive d e v e l o p m e n t a l regulation of m i c r o R N A g e n e e x p r e s s i o n . Mamm Genome 17: 833-40. S u h , M . R., L e e , Y . , K i m , J . Y . , K i m , S . K., M o o n , S . H., L e e , J . Y . , C h a , K. Y . , C h u n g , H. M . , Y o o n , H. S . , M o o n , S . Y . , et al. 2 0 0 4 . H u m a n e m b r y o n i c stem cells e x p r e s s a unique set of m i c r o R N A s . Dev Biol 270: 4 8 8 - 9 8 . W a n g Y . , M e d v i d R., Melton C , J a e n i s c h R. and Blelloch R. 2 0 0 7 . D G C R 8 is essential for m i c r o R N A b i o g e n e s i s and silencing of e m b r y o n i c s t e m cell selfrenewal. Nat G e n e t . 39(3): 3 8 0 - 5 . W a t a n a b e , T., T a k e d a , A . , T s u k i y a m a , T., M i s e , K., O k u n o , T., S a s a k i , H., M i n a m i , N., and Imai, H. 2 0 0 6 . Identification and characterization of two novel c l a s s e s of s m a l l R N A s in the m o u s e germline: retrotransposon-derived s i R N A s in o o c y t e s a n d germline s m a l l R N A s in testes. Genes Dev 20: 1 7 3 2 - 4 3 . X u e , C , Li, F., H e , T., Liu, G . P., Li, Y . , and Z h a n g , X . 2 0 0 5 . Classification of real a n d p s e u d o m i c r o R N A precursors using local s t r u c t u r e - s e q u e n c e features and support vector m a c h i n e . BMC Bioinformatics 6: 3 1 0 . Y a o , Y . , G u o , G . , Ni, Z . , S u n k a r , R., D u , J . , Z h u , J . K., a n d S u n , Q . 2 0 0 7 . C l o n i n g a n d characterization of m i c r o R N A s from wheat (Triticum a e s t i v u m L.). Genome Biol 8: R 9 6 . Z h a n g , B., P a n , X . P., C o x , S . B., C o b b , G . P., and A n d e r s o n , T. A . 2 0 0 6 . E v i d e n c e that m i R N A s are different from other R N A s . Cell Mol Life Sci 63: . Z h a o , J . J . , H u a , Y . J . , S u n , D. G . , M e n g , X . X . , X i a o , H. S . , a n d M a , X . 2 0 0 6 . G e n o m e - w i d e m i c r o R N A profiling in h u m a n fetal n e r v o u s t i s s u e s by oligonucleotide microarray. Childs Nerv Syst 22: 1 4 1 9 - 2 5 .  51  3  CONCLUDING REMARKS 3.1 G o a l s a c c o m p l i s h e d A strong motivating factor in this project w a s the g e n e r a l lack of software  for a n a l y z i n g m i R N A s e q u e n c e data. T h e software that is currently available for this p u r p o s e lags far behind current technological capabilities, with recent r e l e a s e s still limited to a n a l y s i s of data from 'dideoxy' (Sanger) s e q u e n c i n g m e t h o d s (Tian et a l . 2007). C o n s i d e r i n g the benefits of next-generation s e q u e n c i n g m e t h o d s presented in the previous c h a p t e r s , u s e of older s e q u e n c i n g a p p r o a c h e s for m i R N A profiling is likely to diminish in favour of t h e s e n e w m e t h o d s . T o c a t a l y z e this transition, the scientific c o m m u n i t y will require a c c e s s to software that is c a p a b l e of analyzing a n d s u m m a r i z i n g d a t a from e x p e r i m e n t s e m p l o y i n g t h e s e n e w platforms. A s r e s e a r c h e r s m a y be a p p r e h e n s i v e to adopt n e w t e c h n i q u e s , they must a l s o be c o n v i n c e d of the benefits s u c h a c h a n g e will provide. T o this e n d , m a n y v a l u a b l e bioinformatic tools for p r o c e s s i n g lllumina small R N A s e q u e n c e d a t a h a v e b e e n d e v e l o p e d throughout this project. A l t h o u g h improvements are required to e n h a n c e the s c o p e a n d r o b u s t n e s s of this software, it has utility in interpreting the vast a m o u n t of d a t a that is p r o d u c e d by this technology. Specifically, chapter two provides an e x a m p l e of the various types of information that c a n be extracted from t h e s e data using this set of tools. O v e r a l l , the a n a l y s e s included here revealed s o m e underappreciated intricacies of m i R N A b i o g e n e s i s a n d s o m e specific insights into g e n e regulation in h u m a n e m b r y o n i c stem cells. T h e s e results include the observation that  52  m i R N A s d e m o n s t r a t e variability at the 5' and 3' e n d s , likely resulting from inconsistent D r o s h a a n d Dicer c l e a v a g e ; undergo R N A editing by A D A R ; a n d are subject to a plethora of 3' e x t e n s i o n s that a p p e a r to be s p e c i f i c to certain m i R N A s a n d evolutionarily c o n s e r v e d . T h e biological s i g n i f i c a n c e of t h e s e findings is unclear and they m a y later prove to be artefactual. H o w e v e r that most of t h e s e o b s e r v a t i o n s are recapitulations of other g r o u p s s u g g e s t s they are at least reproducible. A s well, this a n a l y s i s h a s resulted in a list of 170 novel m i R N A g e n e s , s o m e of w h i c h m a y a l s o be important in h u m a n e m b r y o n i c stem cells. O f the known m i R N A g e n e s identified here, m a n y w e r e already k n o w n to be important in h E S C s ( S u h et al. 2004). T h i s t e c h n i q u e a l s o r e v e a l e d the e x p r e s s i o n of m i R N A s e x p r e s s e d at relatively low levels, w h o s e e x p r e s s i o n h a s never before reported in h E S C s . S o m e of t h e s e a l s o d e m o n s t r a t e d significant e x p r e s s i o n c h a n g e s b e t w e e n t h e s e s a m p l e s , thus increasing the list of m i R N A s potentially important in either pluripotency or differentiation. K e y to d i s c o v e r i n g the role of t h e s e m i R N A s in h E S C s is the correct identification of their g e n e targets. Target g e n e s , predicted by o n e of the most reliable target prediction algorithms ( T a r g e t S c a n ) , w e r e r a n k e d for their potential for c o o p e r a t i v e regulation by c o - e x p r e s s e d m i R N A s . A l t h o u g h this r e d u c e d the sensitivity of target prediction considerably, it provided s m a l l e r s e t s of g e n e s that are potentially important in pluripotency or differentiation. T h e statistically overrepresented g e n e ontology ( G O ) terms (Figure 2.3) support this, a s t h e s e smaller g e n e lists are e n r i c h e d for g e n e s involved in t h e s e p r o c e s s e s .  53  3.2 Future directions T h e data included here represent only a small s a m p l e of the capabilities of this n e w s e q u e n c i n g technology. T h e lllumina s e q u e n c i n g a p p a r a t u s allows 8 s e p a r a t e libraries to be s e q u e n c e d concurrently, while offering a c h o i c e b e t w e e n e n h a n c e d s a m p l i n g depth or multiple biological replicates. B i o l o g i c a l a n d technical replicates w o u l d allow more robust statistical tests to be a p p l i e d , potentially revealing m o r e subtle m i R N A e x p r e s s i o n c h a n g e s that w e r e m i s s e d in this study. S a m p l e s extracted from differentiating h E S C s at earlier a n d later time points could a l s o reveal trends in t h e s e m i R N A e x p r e s s i o n c h a n g e s , w h i c h m a y fluctuate during the p r o c e s s of differentiation.  Later time points m a y a l s o reveal  further s t a g e s of differentiation or could reveal w h e t h e r the E B s a m p l e w a s fully differentiated (devoid of Pluripotent cells). T h e s e additional e x p e r i m e n t s would solidify our current v i e w s of the h E S C m i R N A transcriptome a n d w o u l d likely reveal the variability in m i R N A levels that are unrelated to differentiation. A c o m p l e t e view of the m i R N A e x p r e s s i o n levels a n d c h a n g e s a s s o c i a t e d with differentiation are of limited utility without k n o w l e d g e of their g e n e targets. H o w e v e r , a s s a y s for validating m i R N A / t a r g e t interactions are laborious, thus few h a v e b e e n biologically verified to date. In fact, a publicly available d a t a b a s e known a s T a r B a s e (Sethupathy et a l . 2006) hosts the currently k n o w n interactions a n d it includes very few of the m i R N A s highlighted in this study. H e n c e , a n important next step would be to validate of s o m e of the c a n d i d a t e targets of cooperative m i R N A regulation identified here (Figure 2.3). P e r h a p s more importantly, it is n e c e s s a r y to d e v e l o p l a r g e - s c a l e efforts to  54  identity m i R N A / t a r g e t interactions be d e v e l o p e d . A few recent studies h a v e d e m o n s t r a t e d s u c c e s s f u l co-immunoprecipitation b e t w e e n c o m p o n e n t s of R I S C a n d their b o u n d m R N A s in both h u m a n a n d m o u s e t i s s u e s ( D u a n et a l . 2 0 0 6 ; Beitzinger et a l . 2007). S u c h a p p r o a c h e s may be a d a p t e d into high-throughput m e t h o d s for identifying m R N A targets of the m i R N A pathway a n d potentially the specific sites b o u n d by t h e s e proteins. If a full view of all m R N A s targeted by m i R N A s c a n be p r o d u c e d , the field of m i R N A r e s e a r c h will b e strengthened considerably. Finally, a few of the observations in this project are of u n k n o w n functional r e l e v a n c e . R N A editing of p r e - m i R N A s c a n affect the mature m i R N A s e q u e n c e a n d is thought to affect a m i R N A s repertoire of targets (Blow et a l . 2 0 0 6 ; K a w a h a r a et a l . 2007). H o w e v e r , a s the relative levels of edited m i R N A s found here w e r e l e s s than 1% of the unmodified forms, it s e e m s unlikely that they are functional. Further work using more s m a l l R N A s e q u e n c e libraries s h o u l d reveal w h e t h e r editing of m i R N A s is a w i d e s p r e a d p h e n o m e n o n . T w o other m o d e s of m i R N A variation w e r e d i s c u s s e d here: 3' extension by a n u n k n o w n m e c h a n i s m a n d e n d variation d u e to variable D r o s h a a n d D i c e r c l e a v a g e sites. Both c a n result in m i R N A s of different lengths, a n d the 5' variants h a v e a different s e e d s e q u e n c e than the c a n o n i c a l m i R N A . A s they m a y a l s o h a v e different in vivo targets, determining whether t h e s e i s o m i R s a s s o c i a t e with functional R I S C s is a n important next step to concluding whether they are a l s o functional. T h e 3' variability of m i R N A s is l e s s likely to affect their targets, a s m i R N A / m R N A complementarity in this region is l e s s important.  Instead, it is p o s s i b l e that  55  variation at the 3' e n d performs a different function entirely. A recent study has s h o w n that the s e q u e n c e at the extreme 3' end of s o m e m i R N A s c a n affect their s u b c e l l u l a r localization ( H w a n g et a l . 2007). H e n c e , it is p o s s i b l e that further study m a y reveal that t h e s e i s o m i R s may reside preferentially in different cellular locales.  3.3 R e f e r e n c e s Beitzinger, M . , P e t e r s , L , Z h u , J . Y . , K r e m m e r , E., a n d Meister, G . 2 0 0 7 . Identification of H u m a n m i c r o R N A T a r g e t s F r o m Isolated A r g o n a u t e Protein C o m p l e x e s . RNA Biol 4 : . Blow, M . J . , G r o c o c k , R. J . , v a n D o n g e n , S . , Enright, A . J . , D i c k s , E., Futreal, P. A . , W o o s t e r , R., and Stratton, M. R. 2 0 0 6 . R N A editing of h u m a n m i c r o R N A s . Genome Biol. 7: R 2 7 . D u a n , R. a n d J i n , P. 2 0 0 6 . Identification of m e s s e n g e r R N A s a n d m i c r o R N A s a s s o c i a t e d with fragile X mental retardation protein. Methods Mol. Biol. 3 4 2 : 2 6 7 276. H w a n g , H. W . , W e n t z e l , E. A . , a n d M e n d e l l , J . T. 2 0 0 7 . A h e x a n u c l e o t i d e e l e m e n t directs m i c r o R N A nuclear import. Science 3 1 5 : 9 7 - 1 0 0 . K a w a h a r a , Y . , Z i n s h t e y n , B., Sethupathy, P., l i z a s a , H., H a t z i g e o r g i o u , A . G . , and N i s h i k u r a , K. 2 0 0 7 . Redirection of silencing targets by a d e n o s i n e - t o - i n o s i n e editing of m i R N A s . Science 3 1 5 : 1137-40. Sethupathy, P., C o r d a , B., and Hatzigeorgiou, A . G . 2 0 0 6 . T a r B a s e : A c o m p r e h e n s i v e d a t a b a s e of experimentally supported a n i m a l m i c r o R N A targets. Rna 1 2 : 192-7. S u h , M . R., L e e , Y . , K i m , J . Y . , K i m , S . K., M o o n , S . H., L e e , J . Y . , C h a , K. Y . , C h u n g , H. M . , Y o o n , H. S . , M o o n , S . Y . , et al. 2 0 0 4 . H u m a n e m b r y o n i c s t e m cells e x p r e s s a unique set of m i c r o R N A s . Dev Biol 2 7 0 : 4 8 8 - 9 8 . T i a n , F., Z h a n g , H., Z h a n g , X . , S o n g , C , X i a , Y . , W u , Y . , a n d Liu, X . 2 0 0 7 . m i R A S : a d a t a p r o c e s s i n g s y s t e m for m i R N A e x p r e s s i o n profiling study. BMC Bioinformatics 8: 2 8 5 .  56  APPENDICES Appendix A : Supplementary Tables Supplementary Table 1: All significantly differentially expressed m i R N A s microRNA (hsa) mir-199a mir-372 mir-122a mir-152 mir-10a let-7a mir-302a mir-222 mir-340 mir-363 mir-21 mir-221 mir-26a mir-26b mir-130a mir-744 mir-594 mir-302b mir-30d mir-146b mir-25 mir-373 mir-423 mir-320 mir-1 mir-371 mir-302d mir-378 mir-30a mir-302c let-7e mir-92 mir-204 mir-421 mir-20b mir-130b mir-146a mir-181 a mir-106a let-7c mir-184 mir-708 mir-205 mir-127  arm 3-p 3-p 5-p 5-p 5-p 5-p 5-p 3-p 5-p 3-p 5-p 3-p 5-p 5-p 3-p 5-p 5-p 3-p 5-p 5-p 3-p 3-p 5-p 3-p 3-p 5-p 3-p 3-p 5-p 3-p 5-p 5-p 5-p 3-p 5-p 3-p 5-p 5-p 5-p 5-p 3-p 5-p 5-p 3-p  most common sequence ACAGTAGTCTGCACATTGGTTA AAAGTGCTGCGACATTTGAGCGT TGGAGTGTGACAATGGTGTTTG TCAGTGCATGACAGAACTTGG TAC C CTGTAGATCC GAATTTGT TGAGGTAGTAGGTTGTATAGTT TAAACGTGGATGTACTTGCTTT AGCTACATCTGGCTACTGGGTCTC TTATAAAGCAATGAGACTGATT AATTGCACGGTATCCATCTGTA TAGCTTATCAGACTGATGTTGAC AGCTACATTGTCTGCTGGGTTTC TTCAAGTAATCCAGGATAGGCT TTCAAGTAATTCAGGATAGGTT CAGTGCAATGTTAAAAGGGCAT TGCGGGGCTAGGGCTAACAGCA ATGGATAAGGCATTGGC TAAGTG CTTC C ATGTTTTAGTAG TGTAAACATCCCCGACTGGAAGCT TGAGAACTGAATTCCATAGGCTGT CATTGCACTTGTCTCGGTCTGA GAAGTGCTTCGATTTTGGGGTGT TGAGGGGCAGAGAGCGAGACTTT AAAAGCTGGGTTGAGAGGGCGA TGGAATGTAAAGAAGTATGTAT ACTCAAACTGTGGGGGCACTT TAAGTGCTTCCATGTTTGAGTGT ACTGGACTTGGAGTCAGAAGG TGTAAACATCCTCGACTGGAAGCT TAAGTGCTTCCATGTTTCAGT TGAGGTAGGAGGTTGTATAGTT AGGTTGGGATCGGTTGCAATGCT TTCCCTTTGTCATCCTATGCCT ATCAACAGACATTAATTGGGCGC CAAAGTGCTCATAGTGCAGGTAG CAGTGCAATGATGAAAGGGCAT TGAGAACTGAATTC CAT AACATTCAACGCTGTCGGTGAGTTT AAAAGTGCTTACAGTGCAGGTAG TGAGGTAGTAGGTTGTATGGTT TGGACGGAGAACTGATAAGGGT AAGGAGCTTACAATCTAGCTGGG TCCTTCATTCCACCGGAGTCTGT TCGGATCCGTCTGAGCTTGGCT  hESC count 1110 1388 436 622 948 11902 36800 4719 2247 5775 39818 16275 4892 1003 2334 4166 1717 15169 2798 703 24268 60 9844 5967 7421 1649 8599 4021 1654 3929 899 788 690 579 281 836 107 571 236 360 403 397 502 81  EB count 13163 13653 2565 3028 3887 2951 9917 1331 7198 17912 21003 8716 8530 2957 4798 1516 211 8855 4988 2075 15875 788 5538 2978 4051 3020 5047 6149 2822 1984 1794 138 1505 1331 862 1626 556 1220 696 30 44 990 0 406  P-value 0 0 0 0 0 0 0 0 0 0 0 0 0 1.39E-278 2.20E-265 9.13E-259 1.96E-253 1.39E-213 3.29E-205 2.27E-196 1.52E-189 5.75E-184 4.50E-162 1.33E-148 2.39E-137 6.89E-133 3.95E-119 5.05E-119 2.67E-105 4.03E-95 4.88E-95 2.91 E-94 7.42E-94 1.31E-90 2.51 E-86 4.23E-82 3.04E-78 8.39E-74 4.74E-67 1.94E-64 2.15E-64 3.68E-64 9.77E-63 9.77E-63  fold change 11.86 9.84 5.88 4.87 4.10 4.03 3.71 3.55 3.20 3.10 1.90 1.87 1.74 2.95 2.06 2.75 8.14 1.71 1.78 2.95 1.53 13.13 1.78 2.00 1.83 1.83 1.70 1.53 1.71 1.98 2.00 5.71 2.18 2.30 3.07 1.94 5.20 2.14 2.95 12.00 9.16 2.49 n/a 5.01  57  microRNA (hsa) let-7i mir-302a mir-331 mir-889 mir-210 mir-302d mir-143 mir-423 mir-503 mir-25 mir-17 let-7b mir-335 mir-27b mir-129 mir.-30e mir-371 mir-19b mir-30e let-7g mir-134 mir-181d mir-20a mir-92b mir-877 let-7f mir-483 mir-219 mir-23b mir-93 mir-660 mir-801 let-7d mir-518f mir-296 mir-192 mir-599 mir-145 mir-873 mir-486 mir-329 mir-20b mir-125a mir-302c mir-487b mir-769 mir-28 mir-106b mir-135b mir-598  arm 5-p 3-p 3-p 3-p 3-p 5-p 3-p 3-p 5-p 5-p 3-p 5-p 5-p 3-p 3-p 3-p 3-p 3-p 5-p 5-p 5-p 5-p 5-p 5-p 5-p 5-p 5-p 3-p 3-p 5-p 5-p 5-p 5-p 3-p 3-p 5-p 5-p 5-p 5-p 3-p 3-p 3-p 5-p 5-p 3-p 5-p 3-p 3-p 5-p 3-p  most common sequence TGAGGTAGTAGTTTGTGCTGTT TAAGTGCTTCCATGTTTTGGTGA GCCCCTGGGCCTATCCTAGAA TTAATATCGGACAACCATTGT CTGTGCGTGTGACAGCGGCTGA ACTTTAACATGGAGGCACTTGCT TGAGATGAAGCACTGTAGCTC AGCTCGGTCTGAGGCCCCTCAGT TAGCAGCGGGAACAGTTCTG AGGCGGAGACTTGGGCAATTGCT ACTGCAGTGAAGGCACTTGTAG TGAGGTAGTAGGTTGTGT TCAAGAGCAATAACGAAAAATG TTCACAGTGGCTAAGTTCTGC AAGCCCTTACCCCAAAAAGCAT CTTTCAGTCGGATGTTTACAG AAGTGCCGCCATCTTTTGAGTGT TGTGCAAATCCATGCAAAACTGA TGTAAACATCCTTGACTGGAAGCT TGAGGTAGTAGTTTGTACAGTT TGTGACTGGTTGACCAGAGGGG AACATTCATTGTTGTCGGTGGGTT TAAAGTGCTTATAGTGCAGGTAG AGGGACGGGACGCGGTGCAGTGTT GTAGAGGAGATGGCGCAGGGG TGAGGTAGTAGATTGTATAGTT AAGACGGGAGGAAAGAAGGGAG AGAGTTGAGTCTGGACGTCCCG ATC AC ATTG C CAGGGATTACCAC CAAAGTGCTGTTCGTGCAGGTA TACCCATTGCATATCGGAGTTG TGGCACACGTAGGGCAAC AGAGGTAGTAGGTTGCATAGTT GAAAGCGCTTCTCTTTAGAGGA AGGGTTGGGTGGAGGCT TGACCTATGAATTGACAGCCAGT TTTGATAAGCTGACATGGGACAG GTCCAGTTTTCCCAGGAATCCCT GCAGGAACTTGTGAGTCTCCT TCCTGTACTGAGCTGCCCCGAG AACACACCTGGTTAACCTCTTT ACTGTAGTATGGGCACTTCCAGT TCCCTGAGACCCTTTAACCTGTG TTTAACATGGGGGTACCTGCT AATCGTACAGGGTCATCCACTT TGAGACCTCTGGGTTCTGAGCT CACTAGATTGTGAGCTCCTGGA CCGCACTGTGGGTACTTGCTGC TATGGCTTTTCATTCCTATGTGA TACGTCATCGTTGTCATCGTCA  hESC count 347 5239 1129 0 28 258 314 1225 235 1002 146 284 119 195 946 1741 9 492 447 321 10 281 377 556 101 2004 31 3 134 319 46 90 77 84 97 79 0 57 203 293 0 205 189 522 0 170 344 197 30 168  EB count 33 3237 434 170 247 20 680 560 30 456 377 57 331 442 449 994 114 794 738 100 97 82 608 250 10 1281 133 67 286 556 150 185 4 205 .11 210 48 146 358 125 35 331 311 280 31 298 169 77 54 271  P-value 4.34E-59 8.21 E-58 1.21E-53 2.09E-53 1.54E-51 8.58E-48 7.06E-43 3.17E-40 2.17E-35 1.44E-33 1.27E-31 1.38E-31 1.19E-30 1.38E-30 1.22E-28 1.72E-28 1.31E-27 7.29E-27 1.34E-26 1.69E-22 7.80E-22 1.38E-21 6.56E-21 7.10E-20 1.50E-19 1.60E-19 1.81E-19 6.98E-19 1.57E-18 5.26E-18 2.41E-17 7.04E-17 7.19E-17 1.20E-16 2.74E-16 3.14E-16 1.32E-15 4.79E-13 1.75E-12 1.91E-12 2.74E-12 4.18E-12 5.35E-12 9.08E-12 5.70E-11 1.78E-10 1.89E-10 2.58E-10 2.67E-10 3.79E-10  fold change 10.52 1.62 2.60 n/a 8.82 12.90 2.17 2.19 7.83 2.20 2.58 4.98 2.78 2.27 2.11 1.75 12.67 1.61 1.65 3.21 9.70 3.43 1.61 2.22 10.10 1.56 4.29 22.33 2.13 1.74 3.26 2.06 19.25 2.44 8.82 2.66 n/a 2.56 1.76 2.34 n/a 1.61 1.65 1.86 n/a 1.75 2.04 2.56 1.80 1.61  58  microRNA (hsa) mir-187 mir-27a mir-199b mir-135a mir-323 mir-218 mir-96 mir-185 mir-151 mir-539 mir-379 mir-215 mir-135a mir-99a mir-768 mir-518b mir-494 mir-330 mir-410 mir-941 mir-594 mir-193b mir-367 mir-382 mir-335 mir-29c mir-296 mir-302b mir-518a mir-224 mir-520g mir-519a mir-22 mir-9 mir-221 mir-574 mir-543 mir-140 mir-1 Ob mir-455 mir-518c mir-197 mir-760 mir-516 mir-486 mir-129 mir-374b mir-154 mir-484 mir-187  arm 3-p 3-p 5-p 5-p 3-p 5-p 5-p 5-p 5-p 3-p 5-p. 5-p 3-p 5-p 5-p 3-p 3-p 3-p 3-p 3-p 3-p 3-p 3-p 5-p 3-p 3-p 5-p 3-p 3-p 5-p 3-p 3-p 3-p 3-p 5-p 3-p 3-p 3-p 5-p 5-p 5-p 3-p 3-p 5-p 3-p 5-p 5-p 3-p 5-p 5-p  most common sequence TCGTGTCTTGTGTTGCAGCCGG TTCACAGTGGCTAAGTTC CCCAGTGTTTAGACTATCTGTTC TATGGCTTTTTATTCCTATGTGA GCACATTACACGGTCGACCTCT TTGTGCTTGATCTAACCATGT TTTGGCACTAGCACATTTTTGCT TGGAGAGAAAGGCAGTTCCTG CTAGACTGAAGCTCCTTGAGGA ATCATACAAGGACAATTTCTTT TGGTAGACTATGGAAGGTAGG ATGACCTATGAATTGACAGACA ATATAGGGATTGGAGCCGTGGC AACCCGTAGATCCGATCTTGTG GTTGGAGGATGAAAGTACGGA CAAAGCGCTCCCCTTTAGAGGT TGAAACATACACGGGAAACCTCT GCAAAGCACACGGCCTGCAGAGAG AATATAACACAGATGGCCTGT CACCCGGCTGTGTGCACATGTGC GCCCCAGTGGCCTAATGGA AACTGGCCCTCAAAGTCCCGGT TGGATTGTTAAGCCAATGACAGAA GAAGTTGTTCGTGGTGGATTCG TTTTTCATTATTGCTCCTGACC TAGCACCATTTGAAATCGGTTA AGGGCCCCCCCTCAATCCTGT ACTTTAACATGGAAGTGCTTTCT AGAAGATCTCAAGCTGTGA CAAGTCACTAGTGGTTC C GTTTAG ACAAAGTGCTTCCCTTTAGAGTGT AGAAGATCTCAGGCTGTGTC AAGCTGCCAGTTGAAGAACTGT TCTTTGGTTATCTAGCTGTATGA ACCTGGCATACAATGTAGATTTCT CACGCTCATGCACACACCCACA AAACATTCGCGGTGCACTTCTT ACCACAGGGTAGAACCAC TACCCTGTAGAACCGAATTTGT TATGTGCCTTTGGACTACATCG CTCTGGAGGGAAGCACTTTCTGTT TTCACCACCTTCTCCACCCAGC CGGCTCTGGGTCTGTGGGG CATCTGGAGGTAAGAAGCACTTTGT CGGGGCAGCTCAGTACAGGATA CTTTTTGCGGTCTGGGCTTGC ATATAATACAACCTGCTAAG AATCATACACGGTTGACCTATT TCAGGCTCAGTCCCCTGCCGAT GCTACAACACAGGACCCGGGCG  hESC count 281 77 1 59 0 1 44 500 1179 0 7 3 15 4 69 215 2 187 0 114 9 288 45 0 72 45 60 343 75 108 74 65 359 427 637 65 2 137 2 36 43 145 25 170 19 126 84 0 65 18  EB count 130 157 30 132 28 28 110 311 786 27 46 38 65 30 151 94 32 86 25 41 46 146 5 24 137 100 12 188 20 43 20 16 205 256 413 17 25 219 23 78 8 66 2 92 0 55 146 15 20 0  P-value 4.07E-10 4.45E-10 4.97E-10 5.44E-10 5.55E-10 5.55E-10 6.99E-10 1.11E-09 1.16E-09 1.19E-09 1.58E-09 2.70E-09 3.02E-09 4.03E-09 4.21 E-09 4.64E-09 4.73E-09 5.09E-09 5.41 E-09 1.12E-08 2.00E-08 2.74E-08 2.89E-08 3.59E-08 5.41 E-08 7.98E-08 9.65E-08 1.07E-07 2.22E-07 3.19E-07 3.33E-07 4.78E-07 5.20E-07 8.66E-07 1.04E-06 1.12E-06 1.86E-06 2.18E-06 2.43E-06 3.55E-06 3.69E-06 4.75E-06 4.90E-06 6.14E-06 6.54E-06 7.28E-06 1.05E-05 1.07E-05 1.10E-05 1.23E-05  fold change 2.16 2.04 30.00 2.24 n/a 28.00 2.50 1.61 1.50 n/a 6.57 12.67 4.33 7.50 2.19 2.29 16.00 2.17 n/a 2.78 5.11 1.97 9.00 n/a 1.90 2.22 5.00 1.82 3.75 2.51 3.70 4.06 1.75 1.67 1.54 3.82 12.50 1.60 11.50 2.17 5.38 2.20 12.50 1.85 n/a 2.29 1.74 n/a 3.25 n/a  59  microRNA (hsa) mir-211 mir-577 mir-369 mir-382 mir-200a mir-485 mir-766 mir-7 mir-483 mir-520a mir-589 mir-519c mir-374 mir-432 mir-217 mir-31 mir-196b mir-362 mir-424 mir-493 mir-512 mir-33 mir-182 mir-923 mir-193b mir-874 mir-95 mir-542 mir-24 mir-499 mir-887 mir-155 mir-185 mir-18b mir-375 mir-203 mir-495 mir-339 mir-21 mir-618 mir-28 mir-520b mir-641 mir-199a mir-301 mir-504 mir-372 mir-194 mir-518a  arm 3-p 5-p 3-p 3-p 3-p 3-p 5-p 3-p 3-p 3-p 5-p 3-p 5-p 5-p 5-p 5-p 5-p 5-p 3-p 5-p 3-p 5-p 5-p 3-p 5-p 3-p 3-p 5-p 3-p 5-p 3-p 5-p 3-p 5-p 3-p 3-p 3-p 5-p 3-p 3-p 5-p 3-p 5-p 5-p 5-p 5-p 5-p 3-p 5-p  most common sequence TTCCCTTTGTCATCCTTCGCCT GTAGATAAAATATTGGTACCTG AATAATACATGGTTGATCTTT AATCATTCACGGACAACACTTT TAACACTGTCTGGTAACGATGTT GTCATACACGGCTCTCCTCTCT AGGAGGAATTGGTGCTGGTCTT AGGTAGACTGGGATTTGTTGTT TCACTCCTCTCCTCCCGTCTT AAAGTGCTTCCCTTTGGACTG TGAGAACCACGTCTGCTCTGAGC AGAAGATCTCAGCCTGTGAC TTATAATACAACCTGATAAGT TCTTGGAGTAGGTCATTGGG TACTGCATCAGGAACTGATTGGA AGGCAAGATGCTGGCATAGCTG TAGGTAGTTTCCTGTTGTTGGG AATCCTTGGAACCTAGGTGTGAGT CAGCAGCAATTCATGTTTTGAA TTGTACATGGTAGGCTTTCATT AAGTGCTGTCATAGCTGA GTGCATTGTAGTTGCATTGCA TTTGGCAATGGTAGAACTCACACTGGT CAGG ATTC C CTCAGTAA CGGGGTTTTGAGGGCGAGATG CTGCCCTGGCCCGAGGGACCGAC TTCAACGGGTATTTATTGAGC TGTGACAGATTGATAACTGAAA TGGCTCAGTTCAGCAGGAACA TTAAGACTTGCAGTGATGTTTA GTGAACGGGCGCCATCCCGAGGCT TTAATGCTAATCGTGATAGGGGT AGGGGCTGGCTTTCCTCT TAAGGTGCATCTAGTGCAGT TTTGTTCGTTCGGCTCGC GTGAAATGTTTAGGACCACTAG AAACAAACATGGTGCACTTCTT TGAGCGCCTCGACGACAGAGCCG CAACACCAGTCGATGGGCTGTCT AAACTCTACTTGTCCTTCTGAGT AAGGAGCTCACAGTCTATTGAGT AAAGTGCTTCCTTTTAGAG AAAGACATAGGATAGAGTCACCT CCCAGTGTTCAGACTACCTGTTC CAGTGCAATAGTATTGTCAAAGCAT AGACCCTGGTCTGCACTCTATCT CCTCAAATGTGGAGCACTATTC TGTAACAGCAACTCCATGTGGAA CTGCAAAGGGAAGCCCTTTCT  hESC count 28 53 2 2 87 0 25 41 59 363 62 44 59 0 1 235 2 14 52 0 37 43 73 13 19 19 19 48 47 30 3 35 35 0 19 52 0 29 33 9 21 0 0 1 16 49 1 12 2  EB count 3 96 20 20 136 14 3 9 101 226 23 11 108 13 16 135 18 66 16 13 76 78 32 1 2 2 2 82 87 59 19 9 9 11 46 85 10 55 10 27 4 10 10 10 37 20 12 31 14  P-value 1.27E-05 1.64E-05 1.86E-05 1.86E-05 2.28E-05 2.28E-05 2.41 E-05 2.58E-05 3.59E-05 3.64E-05 4.20E-05 4.35E-05 4.81 E-05 4.87E-05 5.02E-05 6.16E-05 7.11 E-05 8.78E-05 8.78E-05 9.19E-05 9.63E-05 0.000100716 0.000129595 0.000161215 0.000173717 0.000173717 0.000173717 0.00020581 0.000216923 0.000231403 0.000343537 0.000348001 0.000348001 0.000382853 0.000415915 0.000459785 0.000474427 0.000616533 0.000631908 0.000678534 0.000767493 0.000781386 0.000781386 0.000781386 0.000792328 0.000811599 0.000823076 0.000885533 0.000978556  fold change 9.33 1.81 10.00 10.00 1.56 n/a 8.33 4.56 1.71 1.61 2.70 4.00 1.83 n/a 16.00 1.74 9.00 4.71 3.25 n/a 2.05 1.81 2.28 13.00 9.50 9.50 9.50 1.71 1.85 1.97 6.33 3.89 3.89 n/a 2.42 1.63 n/a 1.90 3.30 3.00 5.25 n/a n/a 10.00 2.31 2.45 12.00 2.58 7.00  60  Supplementary Table 2: All novel m i R N A s reported in this study  Chromosome  Start  End  Most abundant isomiR  Fold Change (if significant)  EB count  hESC count 374  245  1.53  84  131  1.56  1269  774  1.64  GCCTGTCTGAGCGTCGCTT  194  114  1.70  TATTCATTTATCCCCAGCCTACA  126  67  1.88  62716246  TCCCTGTTCGGGCGCCA  670  1302  1.94  131417429  131417451  TCCCTGTTCGGGCGCCA  670  1302  1.94  chrl  19096156  19096178  TGGATTTTTGGATCAGGGA  40  85  2.13  chrl  555992  556022  AACAGCTAAGGACTGCA  72  31  2.32  chr6  26645776  26645799  639  273  2.34  chrX  117404429  117404454  TAC GTAG ATATATATGTATTTT  20  54  2.70  chr5  41511501  41511523  GTCCCTGTTCAGGCGCCA  24  66  2.75 3.00  chr2  177173995  177174018  7212576  7212600  chrX  113855921  113855947  chr16  33873061  33873083  218440512  218440575  62716221  chr6  chr11  chrl chr19  AATGGATTTTTGGAGCAGG TAAGTGCTTCCATGCTT TTCATTCGGCTGTCCAGATGTA  GTGTAAGCAGGGTCGI I I I  chr15  62841721  '62841750  GATGATGATGGCAGCAAATTCTGAAA  81  27  chr22  18453595  18453657  ACGTTGGCTCTGGTGGTG  35  11  3.18  chr8 .  141129908  141129930  ATCCCCAGCACCTCCACC  59  18  3.28  chr10  100145015  100145040  TGCTGGATCAGTGGTTCGAGTC  16  57  3.56  chr15  50356653  50356679  CCTCAGGGCTGTAGAACAGGGCT  163  45  3.62  chr4  66825201  66825227  CTGGACTGAGCCGTGCTACTGG  44  161  3.66  105144044  105144071  ACTCGGCGTGGCGTCGGTCGTG  228  50  4.56  chr2  25405021  25405049  63  13  4.85  chr4  164234207  164234230  TCCGAGTCACGGCACCA  309  55  5.62  chrl  150066842  150066862  GTGGGGGAGAGGCTGTC  103  17  6.06  chrl  68421844  68421869  7  45  6.43  chr7  146706060  146706085  AAACTGTAATTACTTTTGG  21  3  7.00  chrX  21990206  21990231  GCATGGGTGGTTCAGTGG  407  52  7.83  chr10  70189095  70189123  AGCCTGGAAGCTGGAGCCTGCAGT  25  3  8.33  chr22  35428858  35428879  AAAAAGACAC C C C C CAC  189  21  9.00  chr14  101096450  101096475  4  53  13.25  chr11  90241994  90242016  31  2  15.50 17.50  chr10  TTGCAGCTGCCTGGGAGTGACTTC  ATGGGTGAATTTGTAGAAGGAT  ACCCGTCCCGTTCGTCCCCGGA ATGGATAAGGCTTTGGCTT  chr15  20014621  20014642  CGGGCGTGGTGGTGGGGG  35  2  chr11  93106277  93106306  GTGGGAAGGAATTACAAGACAGTT  72  2  36.00  chrl  20299  20324  TTGGGACATACTTATGCTAAA  4  1  -  chrl  21187455  21187480  AGGCATTGACTTCTCACTAGCT  2  2  -  chrl  38242469  38242487  AGAGGAACAGTTCATTC  10  2  -  chrl  86844553  86844578  AAAAGTAATTGCGGATTTTGCC  2  3  -  chrl  94984022  94984045  ACTGGGCTTGGAGTCAGAAG  98  130  chrl  148091054  148091079  GCGAGATCGCGCAGGACTTT  3  0  -  61  Chromosome  Start  Most abundant isomiR  End  Fold Change (if significant)  EB count  hESC count  CGGATGAGCAAAGAAAGTGGTT  7  2  -  4  6  -  184  126  -  20  13  -  chrl  166234525  166234550  chrl  169337500  169337525  TTAGGCCGCAGATCTGGGTGA  chrl  191372302  191372327  TAGTACTGTGCATATCATCTAT  chrl  222511371  222511397  AAAAGCTGGGTTGAGAGGGCAA  chr10  14518592  14518617  CAGGATGTGGTCAAGTGTTGTT  9  2  -  chMO  56037643  56037672  AAAACTGTAATTACTTTTG G AC  64  31  -  chr10  64802775  64802800  TTAGGGCCCTGGCTCCATCTCC  13  15  -  chMO  93405230  93405250  TGAACCCAGGAGGCGGA  6  9  -  chr10  104030627  104030644  AGATGGATTGTTCTGGG  2  1  -  chMO  112738724  112738749  AAAAACTGAGACTACTTTTGCA  38  26  -  chr11  3535184  3535209  AAAAGTAATTGCGGATTTTGCC  2  3  -  chM1  8662350  8662379  ATCGAGGCTAGAGTCACGCTTGG  8  10  -  chr11  8662888  8662951  TGTGGACTTTGGTCAGTGAAC  2  1  -  chr11  61491650  61491669  GCCGAGAGTCGTCGGGGTT  3  .1  -  chM1  67457297  67457322  AAAAGTAATTGCGGATTTTGCC  2  3  -  chM1  69807738  69807763  AAAAGTACTTGCGGATTTTGCT  57  64  -  chr11  93106536  93106561  TTTGAGGCTACAGTGAGATGTG  2  2  -  chM1  93839357  93839382  AAAAGTATTTGCGGGTTTTGTC  14  14  -  chM2  9371838  9371863  TTGGGACATACTTATGCTAAA  4  1  -  chM2  43950119.  43950145  AGTGAACCACAGACCTGAGATGT  3  3  -  chM2  47334539  47334568  TGGCCCTGACTGAAGACCAGCAGT  6  0  -  chr12  48914229  48914253  TGGGTGGTCTGGAGATTTGTGC  4  0  -  chM2  61368489  61368514  AAAAACTGTAATTACTTTT  4  1  -  chr12  67953235  67953252  TCATATTGCTTCTTTCT  5  2  -  chM2  78337170  78337195  AGAAGGAAATTGAATTCATTTA  4  5  -  chr12  96409820  96409846  ACTCTAGCTGCCAAAGGCGCT  2  9  -  chM2  97517768  97517795  GACTCGACTCGTGTGCGGACATTT  7  2  -  chr12  111617267  111617292  TTGGGACATACTTATGCTAAA  4  1  -  chM3  44500861  44500886  TCTGGGCAACAAAGTGAGACCT  23  33  -  chr13  53784117  53784134  TTC AAGT AATTC AG GTG  4  6  -  chM3  70373329  70373354  AAAAACTGTAATTACTTTT  4  1  -  chr13  106981564  106981588  CCTGTTGAAGTGTAATCCCCA  2  3  -  chM4  22317565  22317585  ATATAATACAACCTGCTT  45  74  -  chM4  63631551  63631577  AAAAGTAATCGCGGTTTTTGTC  27  16  -  chM4  76802325  76802346  ATCCCACCTCTGCCACCA  4  2  -  chr15  41873222  41873242  TCGTTTGCCTTTTTCTGCTT  1  2  -  chr15  51017861  51017887  TTGAGAAGGAGGCTGCTG  0  3  -  chM5  66267369  66267386  AAAAGAAATGTACAGAG  1  4  -  chr15  84114778  84114803  TAAAGAGCCCTGTGGAGACA  3  9  -  62  Chromosome  Start  End  Most abundant isomiR  Fold Change (if significant)  EB count  hESC count  -  TTGGGACATACTTATGCTAAA  4  1  TCTGGAAGGAGACTGTAT  0  1  AAAAGTAATCGCGGTTTTTGTC  27  16  2598205  AGAGAAGAAGATCAGCCTGCA  4  2  13387635  13387661  AAAAGTAATCGCGGTTTTTGTC  27  16  -  16126095  16126119  TGGACTGCCCTGATCTGGAGA  3  2  -  chr15  100318227  100318252  chr16  1952199  1952219  chr16  11307845  11307871  chr17  2598181  chr17 chr17  -  6  8  60  58  AAAAACTGTAATTACTTTT  4  1  TTGGGACATACTTATGCTAAA  4  1  CTGGAGATATGGAAGAGCTGTGT  76  38  58883583  TCTACAAAGGAAAGCGCTTTCT  12  10  58953309  58953334  TCTACAAAGGAAAGCGCTTTCT  12  10  chr2  56081400  56081425  AAATCTCTGCAGGCAAATGTG  0  4  chr2  62828390  62828415  TAGCAAAAACTGCAGTTACTTT  6  2  chr2  70333567  70333592  TCTGGGCAACAAAGTGAGACCT  23  33  chr2  114057048  114057073  TTGGGACATACTTATGCTAAA  4  '1  chr2  180433815  180433840  AGTTAGGATTAGGTCGTGGAA  10  5  -  chr2  180433850  180433875  AGTTAGGATTAGGTCGTGGAA  10  5  -  chr2  189551103  189551128  AAGTGATCTAAAGGCCTACAT  0  6 1  -  chr17  76721656  76721681  ACGGTGCTGGATGTGGCCTTT  chr18  26132900  26132917  TAATTGCTTCCATGTTT  chr18  55567933  55567958  chr19  23043  23068  chr19  20371125  20371152  chr19  58883558  chr19  -  -  chr2  207842293  207842318  TTGGGACATACTTATGCTAAA  4  chr2  212999243  212999270  AACTGTAATTACTTTTGCACCAAC  2  0  -  chr2  232286322  232286348  AAGTAGTTGGTTTGTATGAGATGGTT  0  1  -  chr20  2581424  2581452  TGGGAACGGGTTCCGGCAGACGCTG  4  5  -  chr20  33505224  33505250  TGGAGTCCAGGAATCTGCATTTT  2  2  -  chr20  36487253  36487280  AAACTCTGGAGCTTTCGTACATGC  7  0  chr20  36578662  36578687  CCAAAACTGCAGTTACTTTTGC  8  2  chr20  47330266  47330296  ATATATGATGACTTAGCTTTT  7  22  chr20  59962070  59962094  AGTGAATGATGGGTTCTGACC  2  3  13  14  7  6  -  chr22  18616665  18616690 .TGCAGGACCAAGATGAGCCCT  chr22  20337277  20337302  GCTCTGACGAGGTTGCACTACT  chr22  20337312  20337340  CAGTGCAATGATATTGTCAAAGCAT  24  30  chr22  25281235  25281262  AAAAGTAATTGCGGTCTTTGGT  82  58  -  chr22  39818495  39818513  TCGCCTCCTCCTCTCCC  0  1  _  chr22  43975501  43975526  ACGCCCTTCCCCCCCTTCTTCA  16  3  chr3  70572562  70572587  AAAAACTGTAATTACTTTT  4  1  chr3  71673877  71673902  TCTATACAGACCCTGGCTTTTC  2  6  chr3  119201514  119201540  TGGAGTCCAGGAATCTGCATTTT  2  2  chr3  126992024  126992049  AAAAGTAATTGCGGATTTTGCC  2  3  -  -  63  Chromosome  Start  End  Most abundant isomiR  hESC count  Fold Change (if significant)  EB count  chr3  129563700  129563720  TCCCACCGCTGCCACCC  6  5  chr3  150090065  150090092  GGCCAGGCCTGGTGGCTCACTTT  6  1  chr3  165372000  165372027  ATGGTACCCTGGCATACTGAGT  5  3  chr3  187985862  187985882  GAAAACTGCAGTGACATGTT  0  2  chr3  187987156  187987186  ACCTTCTTGTATAAGCACTGTGCTAAA  5  1  chr4  4136998  4137023  AAAAGTAATTG C G G ATTTTG C C  2  3  chr4  9166974  9166999  AAAAGTAATTGCGGATTTTGCC  2  3  -  chr4  36104418  36104443  CGGATGAGCAAAGAAAGTGGTT  7  2  -  chr4  71081401  71081426  TTGGGACATACTTATGCTAAA  4  1  -  chr4  102470541  102470568  AGGATGAGCAAAGAAAGTAGATT  111  70  chr4  114247470  114247495  AACTGGATCAATTATAGGAGTG  3  2  chr4  148485239  148485266  AAAACTGTAATTACTTTTGTAC  21  12  chr4  183327488  183327513  TTTTCAACTCTAATGGGAGAGA  14  1  -  chr5  100180098  100180123  TAGCAAAAACTGCAGTTACTTT  6  2  -  chr5  109877434  109877461  AACTGTAATTACTTTTGCACCAAC  2  0  chr5  132791203  132791229  TGGAGTCCAGGAATCTGCATTTT  2  2  chr5  153706904  153706929  TGTGAGGTTGGCATTGTTGTCT  12  12  chr5  154045576  154045602  TTTAGAGACGGGGTCTTGCTCT  19  8  chr5  175727567  175727592  CTTGGCACCTAGCAAGCACTCA  20  25  -  chr6  54084910  54084927  AACAGCATTGCACTTGT  1  3  -  chr6  74286439  74286468  AAAGGAAAAGACTCATATCAA  3  0  -  chr6  132155005  132155031  AAAAGTAATCGCGGTTTTTGTC  27  16  chr6  132634811  132634835  TGGGAAAGGAAAAGACTC  4  2  chr6  133180049  133180074  AAGCCAGCCAATGAA  4  0  chr6  133180152  133180179  TAAGGTCCCTGAGAATGGCTAT  38  24  chr6  163910779  163910796  AAATACAATAAAATCTG  0  2  chr6  167331306  167331331  TACGCGCAGACCACAGGATGTC  4  3  chr7  18133379  18133404  TTGGGACATACTTATGCTAAA  4  1  chr7  34946939  34946964  CAAAAGTAATTGTGGATTTTGT  3  2  chr7  67622028  67622053  TCTGGGCAACAAAGTGAGACCT  .23  33  -  chr7  91671271  91671298  TCTGGGCAACAAAGTGAGACCT  23  33  _  chr7  100156228  100156248  GTGGGGGAGAGGCTGTG  18  9  chr8  7043408  7043433  AAAAGTAATTGCGGATTTTGCC  2  3  chr8  7983960  7983985  AAAAGTAATTGCGGATTTTGCC  2  3  chr8  12479920  12479945  AAAAGTAATTGCGGATTTTGCC  2  3  -  chr8  12538020  12538045  AAAAGTAATTGCGGATTTTGCC  2  3  -  chr8  26962356  26962382  AAAAGTAATCGCGGTTTTTGTC  27  16  chr8  101105387  101105415  GGGCGACAAAGCAAGACTCTTTCTT  4  0  -  chr8  142865520  142865545  TTGGGACATACTTATGCTAAA  4  1  -  -  -  -  -  -  64  Chromosome  Start  Most abundant isomiR  End  hESC count  Fold Change (if significant)  EB count  chr8  144193152  144193177  TTGGGACATACTTATGCTAAA  4  1  -  chr9  20214  20239  TTGGGACATACTTATGCTAAA  4  1  -  chr9  28878884  28878910  GAGACTGATGAGTTC C C GGG A  41  23  chr9  68292057  68292082  TTCTGGAATTCTGTGTGAGGGA  2  8  chr9  91868406  91868431  AAAAGTAATTGCGGATTTTGCC  2  3  chr9  99165696  99165721  TTGGGACATACTTATGCTAAA  4  1  chr9  116078285  116078309  149  103  chrX  32569518  32569545  AAAACTGTAATTACTTTTGTAC  21  12  chrX  69159472  69159498  CTGTCCTAAGGTTGTTGAGTT  28  15  chrX  83367459  83367484  AAAAGTAATTGCGGATTTTGCC  2  3  chrX  93354756  93354781  AAAAGTAATTGCGGATTTTGCC  2  3  -  chrX  94204843  94204867  CAAAGGTATTTGTGGTTTTTG  2  0  -  chrX  100287314  100287339  TAGCAAAAACTGCAGTTACTTT  6  2  -  chrX  113793427  113793450  CAAGTCTTATTTGAGCACCTGTT  2  2  -  chrX  118805342  118805371  CATGACAGATTGACATGGACAATT  7  0  chrY  13478002  13478025  AGATGAACTTGAAAGAAGACCAT  2  1  AATGGCTTTTTGGAGCAGG  -  -  -  65  Supplementary Table 3: Seed s e q u e n c e s differentially represented between libraries Seed sequence AGUAGUC CAGUAGU ACAGUAG UUAAACG CUUAAAC GGAGUGU AGUGCUG ACCCUGU AAACGUG UAUAAAG GAGAACU UUGCACG GAGGUAG CAGUGCA AAAGCUG AGUGCAA UCAAGUA GCUACAU GUAAACA GAGGGGC AGCUUAU CUGGACU GCAGCAU GCGGGGC CUCAAAC UGGAUAA GGAAUGU AAAGUGC UAAGCCA GGUUGGG UCAACAG UGUGCGU GGACGGA CCCCUGG UCCCUUU AGGAGCU UAAACGU UCACAGU CGGAUCC GCUCGGU AUGGAUA UAAUAUC GAGAUGA CACAGUG CUUGGAG AGUGCUU  hESC count 86 1801 241 3452 16762 846 2484 1503 55712 3002 1547 1845 29280 3596 16448 4463 6898 47591 8789 17586 65376 17819 153193 5220 4306 1770 9601 8585 2225 1037 1003 129 616 1910 837 705 759 719 115 1793 322 0 412 6807 0 8315  EB count 1478 22288 2749 398 2332 5216 12641 5689 16461 9702 4976 5561 10258 8559 7613 9594 14089 24280 15479 10518 39424 25378 187218 1956 8173 233 5384 12847 550 199 2283 747 69 773 1862 1649 165 1573 578 817 16 228 1036 8616 193 6098  hESC background 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319  EB background 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472  860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319  826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472  Fold Change 17.19 12.38 11.41 8.67 7.19 6.17 5.09 3.79 3.38 3.23 3.22 3.01 2.85 2.38 2.16 2.15 2.04 1.96 1.76 1.67 1.66 1.42 1.22 2.67 1.90 7.60 1.78 1.50 4.05 5.21 2.28 5.79 8.93 2.47 2.22 2.34 4.60 2.19 5.03 2.19 20.13 n/a 2.51 1.27 n/a 1.36  Corrected Pvalue 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  b 8.76E-305 1.62E-296 1.00E-275 4.47E-223 8.91E-221 7.95E-220 8.95E-127 1.83E-121 5.50E-109 4.40E-103 3.31 E-97 9.74E-95 5.90E-92 1.60E-83 2.42E-77 2.41 E-76 8.24E-72 1.18E-69 2.34E-68 8.38E-65 9.60E-62 1.63E-57 8.08E-55  66  Seed sequence CUUUAAC GGCGGAG UUUCAGU AGCCCUU AUGGCUU UAGAGGA CUGCAGU AACGGAA GGGACGG CAAGAGC ACAGUAC AGCAGCG UUAACAU UUGAUAA GGAUUGU AGUGCCG UGCCGCC GACCUAU AGACGGG CCUUCAU ACCCAUU AAGCUGG GUGACUG CCUGUAC CCCUGUA GAAGAUC ACAUUCA UGACCUA GUGCAAA CAAAGCA CGCACUG CAGGAAC GAGUUGA UCCAGUU AGCUCGG AACGUGG UAUAAUA CCAGUGU UGGCACA ACUAGAU GGUGGGG CAAGACU ACGUCAU GGGUUGG AAGACUG .GUGCCGC UCAUACA ACACACC AAUACUG GGCAAGA  hESC count 860 1303 4065 1311 350 277 340 6891 1059 240 2763 548 1179 0 216 19 11 545 56 1246 106 810 16 529 40 394 3751 2439 755 1176 490 531 3 85 700 221 3711 0 93 514 223 115 517 184 96 11 0 6 3266 585  EB count 315 602 2680 612 834 29 815 5034 472 643 3740 175 579 127 22 194 164 1017 273 660 350 374 160 200 216 124 4644 3150 1198 678 201 902 93 269 352 53 4467 73 269 238 61 14 812 49 10 93 54 76 3839 316  hESC background 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319  EB background 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472  Fold Change 2.73 2.16 1.52 2.14 2.38 9.55 2.40 1.37 2.24 2.68 1.35 3.13 2.04 n/a 9.82 10.21 14.91 1.87 n/a 1.89 3.30 2.17 10.00 2.65 5.40 3.18 1.24 1.29 1.59 1.73 2.44 1.70 31.00 3.16 1.99 4.17 1.20 n/a 2.89 2.16 3.66 8.21 1.57 3.76 9.60 8.45 n/a 12.67 1.18 1.85  Corrected Pvalue 3.22E-51 3.00E-50 1.72E^9 2.02E-49 2.40E-47 5.51 E-47 8.22E-47 1.63E-46 4.81 E-44 6.25E-44 7.14E-40 1.87E-39 3.85E-39 4.54E-37 1.01E-36 2.94E-36 9.64E-35 1.72E-34 1.73E-34 1.15E-33 7.40E-31 9.93E-31 1.61E-29 1.94E-29 3.77E-29 1.50E-28 6.63E-28 4.08E-25 8.80E-25 4.92E-24 2.68E-23 3.38E-23 3.11E-22 3.65E-22 2.77E-21 2.77E-21 3.20E-21 2.44E-20 1.30E-19 6.75E-19 1.76E-18 2.78E-17 1.10E-15 1.88E-15 3.14E-15 3.19E-15 1.88E-14 2.07E-14 2.42E-14 3.86E-14  67  Seed sequence GUAGUCU CGGGGCU CCAGUGC GGCAGUG UAGACUG AAUGGAU CCCCAGU CUGUAGU AAAGCAC GAGACCU GGGGCAG CAUGGGU CGUGUCU AAGCCAG UAUCAGA UUUUUGC CACAUUA ACGUGGA UGCACGG UAAACAU GGCACAC UUGGAGG UAUAGGG AGCCAUG UGUGCUU UAGCUUA CAGCAUU AUCGUAC CCCUGAG GGGCCCC AAGUUGU UCACCAC CUACAUC AUCCUUG AACAUUC UGCCCUG GAAUUGU GGCUCUG GUGCAAU GCUUAUC UAAUGCU CAGGGAU AUUGCAC ACUGGCC AAAGCGC GAAACAU AGCUGCC GUAAGUG AAGUCAC GACCCUG  hESC count 2 203 52 2741 2781 74 44 472 285 248 156 49 414 116 1019 232 0 130 29 159 117 419 15 40 2 124 949 0 721 74 2 217 304 40 4 43 21 49 72 520 139 80 137973 458 654 2 442 229 369 122  EB count 62 67 0 2084 2119 6 148 722 120 440 44 0 211 27 1328 95 40 36 109 302 244 627 79 0 45 36 1226 36 969 12 43 95 155 119 47 2 84 4 164 321 50 18 135989 278 862 36 269 115 217 45  hESC background 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319. 860319 860319 860319 860319 860319 860319  EB background 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472  Fold Change 31.00 3.03 n/a 1.32 1.31 12.33 3.36 1.53 2.38 1.77 3.55 n/a 1.96 4.30 1.30 2.44 n/a 3.61 3.76 1.90 2.09 1.50 5.27 n/a 22.50 3.44 1.29 n/a 1.34 6.17 21.50 2.28 1.96 2.98 11.75 21.50 4.00 12.25 2.28 1.62 2.78 4.44 1.01 1.65 1.32 18.00 1.64 1.99 1.70 2.71  Corrected Pvalue 4.59E-14 6.61E-13 7.08E-13 7.16E-13 8.33E-13 1.81E-12 2.06E-12 2.07E-12 2.82E-12 3.44E-12 4.72E-12 5.41 E-12 1.84E-11 1.08E-10 2.04E-10 3.15E-10 4.08E-10 4.65E-10 5.88E-10 1.27E-09 1.42E-09 1.45E-09 2.27E-09 2.44E-09 4.22E-09 6.27E-09 7.05E-09 7.08E-09 7.20E-09 1.21E-08 1.60E-08 4.49E-08 4.71 E-08 4.89E-08 7.50E-08 8.01 E-08 1.01E-07 1.01E-07 2.38E-07 2.98E-07 3.23E-07 3.93E-07 4.83E-07 1.02E-06 1.05E-06 1.66E-06 2.67E-06 6.19E-06 6.27E-06 9.34E-06  68  Seed sequence GGAGGAA CACUCCU CUACAUU AGCCAGG AACACUG AUAUAAC ACCCGUA ACCCGGC AAUGCCC UUGGCAC UACUGCA AAGCUCG UUUAACA GGAGAGA AACACCA CUACAAC AUCAUAC UUUUCAU UAUUGCA AGAACUG ACAUUAC CACAGCU CACGGUA AUCAUUC UGAACGG GUGACAG CUUUGGU GAGAACC CCUGGCA UGUACAU CACAGGG AAGAUCU CAAGUAA UAAGACU AUGUGCC CUAGAUU ACUGCAU CUGUGCG ACGCUCA UUGUUCG UGGAGGA AGCAUUG AUCUGGA AGACUGG GUAACAG CGAUUGC GGGGCUG CUAAGCC AAAAGCU GGCUCAG  hESC count 57 134" 1042 27 105 0 1637 148 1011 63 3 112 41 2555 144 25 0 203 275 9 0 0 0 2 11 88 737 206 1427 0 333 173 84 37 96 81 2 2 84 431 26 270 235 57 15 19 72 60 125 943  EB count 10 236 765 0 197 25 1900 63 744 137 34 42 5 2092 63 0 23 314 156 47 22 22 22 29 49 166 527 108 1117 21 204 86 157 90 171 28 26 26 30 565 72 163 137 16 51 58 25 18 59 1111  hESC background 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319  EB background 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472  Fold Change 5.70 1.76 1.36 n/a 1.88 n/a 1.16 2.35 1.36 2.17 11.33 2.67 8.20 1.22 2.29 n/a n/a 1.55 1.76 5.22 n/a n/a n/a 14.50 4.45 1.89 1.40 1.91 1.28 n/a 1.63 2.01 1.87 2.43 1.78 2.89 13.00 13.00 2.80 1.31 2.77 1.66 1.72 3.56 3.40 3.05 2.88 3.33 2.12 1.18  Corrected Pvalue 9.50E-06 1.18E-05 1.55E-05 1.72E-05 1.80E-05 1.81E-05 2.05E-05 2.48E-05 3.31 E-05 3.53E-05 4.30E-05 4.46E-05 5.62E-05 6.37E-05 6.62E-05 6.75E-05 7.55E-05 9.19E-05 0.000100705 0.000111875 0.000154039 0.000154039 0.000154039 0.000161419 0.00021004 0.000210775 0.0002156 0.000233133 0.000301431 0.000314382 0.000321823 0.000390211 0.000571587 0.000739075 0.000840298 0.001084265 0.001120023 0.001120023 0.001186869 0.001187676 0.001289031 0.002278633 0.002768409 0.002930674 0.003219056 0.003246499 0.003559141 0.003716032 0.004052592 0.004326709  69  Seed sequence CUCACAC UACCCUG AUAAUAC ACUGCAA ACUCCUC GAAUGUA AAGGACA UCAACGG GAGUGUG AUAAAGC GACUGGG GGGGUUU ACAGCAG AGCACCA CUCAAAU UCUUGAG UUCUAAU UAGAUAA CUUCAUG AGGGGCA GUGCUUC UGAAAUG ACUUUAA CAGGCUC  hESC count 660 0 14 3 54 105 80 181 35 20 17 61 0 4400 2 77 64 59 10 190 349 70 70 142  EB count 805 17 48 26 108 47 31 101 79 56 0 21 15 3850 21 31 117 110 38 112 237 124 28 77  hESC background 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319 860319  EB background 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472 826472  Fold Change 1.22 n/a 3.43 8.67 2.00 2.23 2.58 1.79 .2.26 2.80 n/a 2.90 n/a 1.14 10.50 2.48 1.83 1.86 3.80 1.70 1.47 1.77 2.50 1.84  Corrected Pvalue 0.005340937 0.005454676 0.005639196 0.006144601 0.006407272 0.006947813 0.009501052 0.009508369 0.013463942 0.015650839 0.016244698 0.020082178 0.022720944 0.024924325 0.026958039 0.027743405 0.027744697 0.029032117 0.031394002 0.032600265 0.03499445 0.037971577 0.045295024 0.045618849  70  A p p e n d i x B: Supplementary Methods and accompanying data  Comparing miRNA expression levels using five different metrics We assessed the use o f five different metrics to summarize m i R N A expression levels based on the two sequence libraries. The first three metrics are calculated from sequences with either a perfect match, or sequences with perfect 3' or 5' matches to the miRBase reference sequence. The other two were calculated using either the most common or the sum o f all sequences representing a single arm o f a particular m i R N A . The pairwise correlation between the h E S C and E B libraries using each o f the five metrics is included in Table A l (below). The most robust discrimination between these libraries, manifested as a low R value, resulted from comparison using the most 2  frequently observed sequence from each arm o f every m i R N A (R =0.87, Spearman 2  Correlation). Using only the sequences matching the miRBase reference sequence gave reasonable, but not as large, a discrimination between libraries (R =0.92). The 'sum o f 2  all tags' metric showed an even lower correlation, suggesting it may provide better discrimination between libraries. However, this metric did not perform well when used for clustering libraries o f different origin in both human and mouse (not shown).  Table A l : Spearman correlation between h E S C and E B libraries using each o f the five metrics for m i R N A expression hESC 3prime hESC hESC most hESC sum of hESC 5prime miRBase miRBase common tag tags miRBase match match 0.92 EB miRBase 0.68 0.57 0.85 0.93 EB most common tag 0.87 0.64 0.76 0.89 0.85 EB sum of tags -0.16 -0.09 0.66 -0.12 -0.16 EB 5' miRBase match 0.77 0.70 0.92 0.84 0.29 EB 3' miRBase match 0.67 0.68 0.66 0.85 0.95  71  Detecting R N A editing events The two types of R N A editing that have been characterized i n mammals include Adenine and Cytosine deamination reactions. The former is observed in sequence reads as an A to G transition while the latter is observed as a C to T transition. T o search for evidence o f R N A editing, we aligned all unaligned reads to the human genome using a local alignment tool that allows up to two mismatches ( E L A N D , unpublished). Taking all the reads that aligned to m i R N A loci using this tool, we produced multiple sequence alignments representing 'modified' versions o f each m i R N A . These multiple sequence alignments were converted to sequence logos to facilitate visualization o f recurrent editing events. Figure A l shows the identification o f a known A->I editing site in hasmiR-151 B l o w et al. 2006) as well as a potential C->U editing site. It is also possible that the latter results from minor contamination o f the mouse fibroblast feeder cells on which the h E S C s were grown, considering the mismatching nucleotide corresponds to the mouse genome sequence. F i g u r e A l : Sequence logo representing has-miR-151 reads with imperfect genome alignments  hsa-mIR-151 1  •--iJsftCl "  5'  3'  72  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0100868/manifest

Comment

Related Items