UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Methods for microRNA profiling and discovery using massively parallel sequencing Morin, Ryan David 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-0522.pdf [ 4.8MB ]
Metadata
JSON: 831-1.0100868.json
JSON-LD: 831-1.0100868-ld.json
RDF/XML (Pretty): 831-1.0100868-rdf.xml
RDF/JSON: 831-1.0100868-rdf.json
Turtle: 831-1.0100868-turtle.txt
N-Triples: 831-1.0100868-rdf-ntriples.txt
Original Record: 831-1.0100868-source.json
Full Text
831-1.0100868-fulltext.txt
Citation
831-1.0100868.ris

Full Text

Methods for microRNA profiling and discovery using massively parallel sequencing by Ryan David Morin B . S c , S imon Fraser University, 2003 A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F S C I E N C E in The Faculty of Sc i ences (Bioinformatics) The University of British Co lumb ia October 2007 ©Ryan David Morin, 2007 ABSTRACT M i c r o R N A s (miRNAs) are emerging as important, albeit poorly character ized, regulators of gene express ion. Here, I review the current knowledge of m i R N A s in humans, including their b iogenesis , modes of action and var ious methods for studying them. To fully elucidate the various functions of m i R N A s in humans, we require a more complete understanding of their numbers and express ion changes amongst different cell types. Th is document includes a descript ion of a new method for surveying the express ion of m i R N A s that employs the new l l lumina sequenc ing technology. A set of methods is presented that enab les identification of sequences belonging to known m i R N A s as well as variability in their mature sequences . A s well, a novel sys tem for m i R N A gene discovery using these data is descr ibed. Appl icat ion of this approach to R N A from human embryonic stem cel ls ( h E S C s ) obtained before and after differentiation into embryoid bodies (EBs) revealed the sequences and express ion levels of 362 known plus 170 novel m i R N A genes . O f these, 190 known and 31 novel m i c r o R N A sequences exhibited significant express ion dif ferences between these two developmental states. Owing to the increased number of sequence reads, these libraries currently represent the deepest m i R N A sampl ing in any human cell type spanning nearly six orders of magnitude of express ion . Predicted targets of the differentially expressed m i R N A s were ranked to identify those that are likely under cooperat ive m i R N A regulation in either h E S C s or E B s . T h e predicted targets of those m i R N A s enr iched in either sample shared common features. Included amongst the high-ranked predicted gene targets are those implicated in differentiation, cell cycle control, programmed cell death and transcriptional regulation. Direct validation of these predicted targets or global discovery of miRNA targets should reveal the functions of these sequences in the differentiation of hESCs. iii T A B L E O F C O N T E N T S Abstract " Tab le of Contents iv List of Tab les v List of F igures vi Acknowledgements vii Dedicat ion viii Co-authorship statement ix 1 A n introduction to m i c roRNAs 1 1.1 M i c r o R N A b iogenesis 1 1.2 M i c r o R N A function 1 1.3 Methods for m i c r o R N A target prediction 3 1.4 M i c r o R N A discovery, detection and profiling 4 1.5 Perspect ives and research goals 7 1.6 Re fe rences 8 2 Appl icat ion of massive ly parallel sequencing to m i c r o R N A profiling and d iscovery in human embryonic stem cel ls 11 2.1 List of authors 11 2.1.1 Manuscr ip t Status (footnote) 11 2.2 Introduction 11 2.3 Resul ts 15 2.3.1 Sequenc ing and annotation of smal l R N A s 15 2.3.2 Variabil i ty in m i c r o R N A process ing 15 2.3.3 Enzymat ic modification of m i c r o R N A s 17 2.3.4 M i c r o R N A s differentially expressed between h E S C s and E B s 18 2.3.5 Novel m i c r o R N A genes 19 2.3.6 Targets of differentially expressed m i c r o R N A s 19 2.4 D iscuss ion 22 2.5 Methods 27 2.5.1 Smal l R N A library preparation 27 2.5.2 Sma l l R N A genome mapping and quantif ication 30 2.5.3 Differential express ion detection 30 2.5.4 Smal l R N A annotation 31 2.5.5 Novel m i R N A detection 33 2.5.6 Cooperat ive m i R N A target prediction 35 2.5.7 G e n e ontology analys is 35 2.6 Re fe rences 47 3 Conc lud ing remarks 52 3.1 G o a l s accompl ished 52 3.2 Future directions 54 3.3 Re fe rences 56 Append i ces . 57 Append ix A : Supplementary Tab les 57 Append ix B: Supplementary Methods and accompany ing data 70 iv LIST OF TABLES Tab le 2.1: Top 20 m i R N A s differentially expressed between the h E S C and E B libraries. 36 Tab le 2.2: Putat ive novel m i R N A s with significant differential express ion 37 Tab le 2.3: Top 25 statistically over-represented seeds of m i R N A s e q u e n c e s found in h E S C s or E B s 38 LIST OF FIGURES Figure 2 .1 : Distribution of sequence counts for the different c l asses of smal l R N A s 39-40 Figure 2.2: The repertoire of i somiRs and 3' modif ications of hsa-mir-191 42 Figure 2.3: Cluster ing of over-represented G e n e Ontology c l asses in predicted targets of differential m i c roRNAs 44-45 v i A C K N O W L E D G E M E N T S I am grateful to my rotation supervisors, Dr. Peter Unrau and Dr. Franc is Ouellette, for assist ing me in developing my scientif ic skil ls and providing gu idance throughout the duration of my degree. I extend my deepest gratitude to my supervisor, Dr. Marco Mar ra , for providing me with an excel lent work environment and thesis project. Th is work would not have been possib le without the ongoing moral support of my loving wife, Sa rah . Tb my parents, grandparents, wife, Sarah, and daughter, jNatafie C O - A U T H O R S H I P S T A T E M E N T The manuscr ipt included in this document is a culmination of work conducted by both laboratory scientists and bioinformaticians. Drs. Michae l O 'Conno r and Conn ie Eaves produced the human embryonic stem cell and embryoid body R N A that was used for smal l R N A sequenc ing . Dr. Martin Hirst, Anna- l i i sa Prabhu, Yongjun Zhao , Helen McDona ld and T h o m a s Zeng prepared smal l R N A libraries and sequenced the material. Malach i Griffith and Drs. Florian Kuchenbauer and Al len Delaney provided bioinformatics support and helpful suggest ions for strategies and improvements to the data analys is and presentation. Ryan Morin and Marco Marra co-conce ived the project. Marco Marra provided methodological , conceptual , emotional and f inancial support. Ryan Morin performed all database and algorithm des ign , bioinformatic ana lyses , wrote the manuscript and produced all f igures. R h o n d a Oshanek edited the document for content and grammar. ix 1 AN INTRODUCTION TO MICRORNAS 1.1 microRNA biogenesis M i c r o R N A s (m iRNAs) are short R N A molecules, 19-25 nucleot ides (nt) in length that derive from stable fold-back sub-structures of larger transcripts (Cai et al . 2004). M i R N A genes can exist in the introns of spl iced m R N A s , either as solitary non-coding transcripts, or within polycistronic transcripts containing multiple m i R N A genes (Bartel 2004). In most c a s e s , these primary m i R N A transcripts (pr i -miRNAs) are c leaved by a complex of Drosha and its cofactor D G C R - 8 producing one or more m i R N A precursor (pre-miRNA) hairpins with a 2-nt overhang at their 3' ends (Lund et a l . 2004). P r e - m i R N A s are exported from the nucleus to the cytoplasm by Export in-5, after which they are further c leaved by Dicer, releasing short R N A duplexes with a 2-nt 3' overhang at both ends (Lund et a l . 2004). The recently d iscovered 'mirtrons', which have been observed in both Drosophila melanogaster and Caenorhabditis elegans are an except ion to this mechan ism as they undergo nuclear export and Dicer c leavage without Drosha process ing (Ruby et a l . 2007). In either case , after c leavage by Dicer, one of the strands of the remaining R N A duplex is thought to be preferentially incorporated into the RNA- induced si lencing complex (R ISC) , leaving an inactive molecule (miRNA*) that is subsequent ly degraded ( O T o o l e et a l . 2006). 1.2 microRNA function Of the m i R N A / m i R N A * duplex, which compr ises a short strand from each arm of the p re -m iRNA hairpin, only one strand is general ly thought to become an 1 active m i R N A molecule, hence assemb led within R I S C (Bartel 2004). The active strand is identified based on the thermodynamic properties at both ends of the duplex, including the identity of the two nucleot ides in the 3' overhang ( O T o o l e et a l . 2006). However, for an increasing number of m i R N A genes (currently 79) (http://microrna.sanger.ac.uk), mature sequences from both hairpin arms (i.e., both strands of the m i R N A / m i R N A * duplex), are known to accumulate in approximately equal abundance in the cell and are both thought to be biologically act ive. The activity of m i R N A and m i R N A * sequences is often presumed based on cloning f requencies or homology to known m i R N A s . However , large-scale sequenc ing projects are leading to a steady increase in the number of m i R N A genes known to produce two functional m i R N A s . A s the m i R N A genes of this type accumulate , a new notation is being implemented that may replace the limited m i R N A / m i R N A * notation. This sys tem, which is used throughout the remainder of the text, d isambiguates m i R N A s arising from a single p re -m iRNA using their posit ioning 5' (5p) or 3' (3p) within the p re -m iRNA hairpin (Griffiths-J o n e s et al . 2006). In addit ion to the mature m i R N A , the R I S C protein complex can contain a variety of proteins. A key component of R I S C is the member of the Argonaute protein family, of which there are four in human ( E I F 2 C 1 , E I F 2 C 2 , E I F 2 C 3 and E IF2C4) (Tolia et a l . 2007). O n c e assembled within R I S C , m i R N A s can elicit post-transcriptional repression of target genes. In animals, miRNA-target interactions occur through semi-complementary base pairing to sites usually within the 3 ' -UTR of the target transcript (Bartel 2004). Strong base pairing at 2 the 5' end of the m i R N A , general ly termed the seed region, is thought to be a requisite for miRNA-target interaction. If the target site contains a perfect complement to the m i R N A seed sequence, relaxed hybridization is tolerated along the remainder of the m i R N A / m R N A duplex. If, however, there is a lack of complementar i ty in the seed , stronger complementari ty along the remainder of the m i R N A / m R N A duplex is required for a strong interaction (Bartel 2004). Hybridizat ion of m i R N A s (within R ISC) to their target m R N A s can result in post-transcriptional regulation by two separate mechan isms: transcript degradat ion or translational repression. M i R N A s that anneal with extensive complementar i ty to their target sequences and are accompan ied by a R I S C containing E I F 2 C 2 are thought to direct the c leavage of target messenger R N A s . This process is ana logous to the action of short interfering R N A s (s iRNAs) (Liu et al . 2004; Meister et a l . 2004). In animals, the more widespread mechan ism of m i R N A action is that of translational repression. Th is process is mediated by R I S C s containing E IF2C1 and generally involves m i R N A s with reduced complementar i ty to their target sites (Bartel 2004). The binding of multiple R I S C s to the s a m e transcript can enhance this effect considerably (Gr imson et al . 2007; Doench et a l . 2003). 1.3 Methods for microRNA target prediction That mammal ian cel ls tolerate relaxed complementari ty between m i R N A s and their targets poses many problems in computat ional 'target prediction'. Hence , creating accurate target prediction algorithms is an act ive field of study, with numerous emerging approaches that rely on differing methodologies and 3 assumpt ions. T h e s e algorithms use different methods to a s s e s s the relative complementar i ty between m i R N A s and their potential targets (John et a l . 2004). Present ly, the three leading programs for mammal ian m i R N A target prediction are P icTar (Krek et al . 2005), M i R a n d a (John et a l . 2004) and T a r g e t S c a n S (referred to as Targe tScan throughout the remainder of this document) (Lewis et a l . 2005). To enhance their specificity, all three methods enforce s o m e degree of evolutionary conservat ion within the region of candidate target si tes. T h e s e methods a lso consider pairing within the seed region of the m i R N A to be more important than in the remainder of the sequence . Though all three approaches use similar inputs, they result in moderate to poor agreement between their target predictions, with the strongest agreement between the outputs of P icTar and Targe tScan (Sethupathy et al . 2006). A common drawback amongst all these programs is that they lack predictions for recently d iscovered m i R N A s , s ince their da tabase updates do not occur as frequently as updates to m i R B a s e . Targe tScan was used in the ana lyses descr ibed herein because it provides the ability to perform 'custom' target predictions with novel m i R N A sequences . 1.4 microRNA discovery, detection and profiling In addit ion to current def ic iencies in target identification, the field of m i R N A research a lso lacks a complete list of the human m i R N A genes . Addit ionally, evaluat ion of express ion levels of the known m i R N A s amongst the numerous t issue and cell types is a field still in its infancy (Landgraf et a l . 2007). Both of these def ic iencies can be attributed to a lack of a robust methodology for profiling the express ion level of each m i R N A in a sample . The current commercia l ly 4 avai lable high-throughput methodologies rely on primers or probes des igned to detect each of the current reference m i R N A sequences residing in m i R B a s e , which acts as the central repository for all known m i R N A s (Grif f i ths-Jones 2006). T h e s e sys tems, which can be purchased as individual probes or in a microarray format (Castoldi et a l . 2007), al low concurrent profiling of the express ion of many m i R N A s (Zhao et al . 2006). The recent incorporation of Locked Nucle ic Ac id (LNA) nucleotide analogs into probes promises a great enhancement in the robustness of these technologies, potentially facilitating discrimination between m i R N A s with single nucleotide differences (Vester et a l . 2004). M i R N A microarrays offer good reproducibility and facilitate clustering of samp les by similar m i R N A express ion profiles (Davison et al . 2006; Porkka et a l . 2007). However, these methodologies are restricted to detecting and profiling of only the known m i R N A sequences previously identified by sequenc ing or homology searches , or querying candidate regions of the genome that have been predicted to produce novel m i R N A s (Berezikov et al . 2006a). A n emerging alternative method for m i R N A detect ion and profiling involves generat ing c D N A libraries from smal l R N A cellular fract ions, fol lowed by deep sequenc ing using a variety of approaches (Lu et a l . 2007). Previously, sequenc ing-based strategies for m i R N A detection have been hindered by the requirements of concatamerizat ion and cloning as well as the expense of capil lary D N A sequenc ing (Cummins et a l . 2006; Pfeffer et a l . 2005). Never the less, smal l R N A sequenc ing has several advantages over hybridization-based methodologies. Discovery of novel m i R N A s need not rely on querying 5 candidate regions of the genome for express ion using microarrays. Rather, novel m i R N A s can be d iscovered by direct observat ion of their mature sequences , fol lowed by validation of the folding potential of f lanking genomic D N A using var ious approaches (Berezikov et a l . 2006a ; C u m m i n s et al . 2006). Smal l R N A sequenc ing also offers the potential to detect variation in mature m i R N A length and enzymat ic modification of m i R N A s such as R N A editing (Kawahara et al . 2007) and 3' nucleotide addit ions (Aravin et a l . 2005 ; Landgraf et al . 2007). In contrast to capil lary sequenc ing , recently deve loped 'next-generation' sequenc ing technologies offer the prospect of a vast yet inexpensive increase in throughput, therefore providing a more complete v iew of the m i R N A transcriptome. With the depth of sequenc ing now possib le, we have an opportunity to identify low-abundance m i R N A s or those exhibit ing modest express ion dif ferences between samples , which may not be detected by hybr idizat ion-based methods such as microarrays. Next-generat ion m i R N A profiling has already been accompl ished in a variety of organisms (Fahlgren et a l . 2007; K a s s c h a u et a l . 2007; Rajagopalan et al. 2006; Y a o et a l . 2007; Berezikov et al . 2006b) using, at first, Mass ive ly Paral lel Signature Sequenc ing ( M P S S ) methodology (Nakano et a l . 2006) and more recently the Roche /454 platform (Berezikov et a l . 2006b; Margul ies et al . 2005). The recently re leased l l lumina sequenc ing platform provides two to three orders of magnitude of greater depth from a single run when compared to the output of current compet ing technologies (Berezikov et a l . 2006b). 6 1.5 P e r s p e c t i v e s a n d research g o a l s Three next-generation sequenc ing technologies have been re leased in the past few years (Bennett 2004; Margul ies et al . 2005 ; Shendure et a l . 2005). E a c h technology generates sequence reads in varying quantit ies with different lengths and error profiles. The work outlined in the accompany ing manuscr ipt descr ibes a focused view of the m i R N A profiling capabi l i t ies of the l l lumina sequenc ing method. Many new algorithms and tools are required to facilitate analys is of the large data sets produced by these technologies. A major goal of this project w a s to produce a set of bioinformatics tools, including a da tabase and assoc ia ted software, which would al low a skil led researcher to gain biologically relevant and statistically meaningful insights from this type of data. This methodology w a s first used to profile the m i R N A changes involved in differentiation of human embryonic stem cel ls. A secondary goal of this pro ject w a s to identify genes potentially regulated by these m i R N A s , in hopes of identifying genes of central importance to the pluripotency or sel f - renewal of these cel ls. The following chapter d i scusses progress towards these goa ls and should provide a foundation for future ana lyses using this or similar sequenc ing approaches. 7 1.6 References Arav in , A . and Tusch l , T. 2005. Identification and character izat ion of smal l R N A s involved in R N A si lencing. FEBS Lett 579: 5830-40. Bartel, D. P. 2004. M i c r o R N A s : genomics , b iogenesis , mechan ism, and function. Ce / /116: 281-97. Bennett, S . 2004. So lexa Ltd. Pharmacogenomics 5: 433-438. Berez ikov, E., C u p p e n , E., and Plasterk, R. H. 2006a . App roaches to m i c r o R N A discovery. Nat Genet 38 SuppI: S2 -7 . Berez ikov, E., Thuemmler , F., van Laake, L. W. , Kondova , I., Bontrop, R., C u p p e n , E., and Plasterk, R. H. 2006b. Diversity of m i c r o R N A s in human and ch impanzee brain. Nat Genet 38: 1375-7. C a i , X . , Hagedorn , C . H., and Cul len , B. R. 2004. Human m i c r o R N A s are p rocessed from capped , polyadenylated transcripts that can a lso function as m R N A s . Rna 10: 1957-66. Casto ld i , M. , Benes , V . , Hentze, M. W., and Muckenthaler, M . U. 2007. miChip : A microarray platform for express ion profiling of m i c r o R N A s based on locked nucleic acid (LNA) ol igonucleot ide capture probes. Methods 43: 146-152. Cummins , J . M. , He, Y . , Leary, R. J . , Pagl iar ini , R., A . , D. L.,Jr, S job lom, T., Barad , O. , Bentwich, Z . , Sza f ranska , A . E., Labourier, E., et a l . 2006. T h e colorectal m i c r o R N A o m e . Proc Natl Acad Sci USA 103: 3687-92. Dav ison, T. S . , Johnson , C . D., and Andruss , B. F. 2006. Ana lyz ing m i c r o - R N A express ion using microarrays. Methods Enzymol 411: 14-34. Doench , J . G . , Pe tersen, C . P., and Sharp , P. A . 2003. s i R N A s can function as m i R N A s . Genes Dev. 17: 438-442. Fahlgren, N., Howel l , M . D., K a s s c h a u , K. D., C h a p m a n , E. J . , Sul l ivan, C . M., Cumb ie , J . S . , G ivan , S . A . , Law, T. F., Grant, S . R., Dangl , J . L , et a l . 2007. High-Throughput Sequenc ing of Arab idops is m i c r o R N A s : Ev idence for Frequent Birth and Death of M I R N A G e n e s . PLoS ONE 2: e219. Gri f f i ths-Jones, S . 2006. m i R B a s e : the m ic roRNA sequence da tabase. Methods Mol Biol342: 129-38. Gri f f i ths-Jones, S . , Grocock , R. J . , van Dongen, S . , Ba teman , A . , and Enright, A . J . 2006. m i R B a s e : m i c r o R N A sequences , targets and gene nomenclature. Nucleic Acids Res 34: D140-4. 8 Gr imson , A . , Farh , K. K., Johnston, W. K., Garret t -Engele, P., L im, L. P., and Bartel, D. P. 2007. M i c r o R N A targeting specificity in mammals : determinants beyond seed pairing. Mol Ce/7 27: 91-105. John , B., Enright, A . J . , Arav in , A . , Tusch l , T., Sander , C , and Marks , D. S . 2004. Human M i c r o R N A targets. PLoS Biol 2: e363. K a s s c h a u , K. D., Fahlgren, N., C h a p m a n , E. J . , Sul l ivan, C . M. , Cumb ie , J . S . , G i van , S . A . , and Carr ington, J . C . 2007. Genome-Wide Profil ing and Ana lys is of Arab idops is s i R N A s . PLoS Biol 5: e57. Kawahara , Y . , Z inshteyn, B., Sethupathy, P., l i zasa, H., Hatz igeorgiou, A . G . , and Nishikura, K. 2007. Redirect ion of si lencing targets by adenosine-to- inosine editing of m i R N A s . Science 315: 1137-40. Krek, A . , G run , D., Poy , M. N., Wolf, R., Rosenberg , L , Epste in , E. J . , M a c M e n a m i n , P., da P iedade , I., Gunsa lus , K. C , Stoffel, M. , et a l . 2005 . Combinator ia l m i c r o R N A target predictions. Nat Genet 37: 495-500. Landgraf, P., R u s u , M. , Sher idan, R., Sewer , A . , lovino, N., Arav in , A . , Pfeffer, S . , R ice , A . , Kamphorst , A . O. , Landthaler, M. , et a l . 2007. A Mammal ian m i c r o R N A Express ion At las B a s e d on Smal l R N A Library Sequenc ing . Cell 129: 1401-14. Lewis, B. P., Burge, C . B., and Bartel, D. P. 2005. Conse rved seed pairing, often f lanked by adenos ines , indicates that thousands of human genes are m i c r o R N A targets. Ce /7120: 15-20. Liu, J . , Carme l l , M. A . , R ivas , F. V . , Marsden , C . G . , T h o m s o n , J . M. , S o n g , J . J . , H a m m o n d , S . M. , Joshua-Tor , L., and Hannon, G . J . 2004. Argonaute2 is the catalytic engine of mammal ian R N A i . Science 305: 1437-1441. Lu, C , Meyers , B. C , and Green , P. J . 2007. Construct ion of smal l R N A c D N A libraries for deep sequenc ing . Methods 43: 110-117. Lund, E., Guttinger, S . , Ca lado , A . , Dahlberg, J . E., and Kutay, U. 2004. Nuclear export of m i c r o R N A precursors. Science 303: 95-8. Margul ies, M. , Egho lm, M., A l tman, W. E., Att iya, S . , Bader , J . S . , B e m b e n , L. A . , Berka , J . , Braverman, M. S . , C h e n , Y . J . , C h e n , Z . , et a l . 2005. G e n o m e sequenc ing in microfabricated high-density picolitre reactors. Nature 437: 376-80. Meister, G . and Tusch l , T. 2004. Mechan isms of gene si lencing by double-stranded R N A . Nature 431: 343-349. Nakano , M., Nobuta, K., Vemara ju , K., Tej, S . S . , S k o g e n , J . W. , and Meyers , B. C . 2006. Plant M P S S da tabases : s ignature-based transcriptional resources for ana lyses of m R N A and smal l R N A . Nucleic Acids Res 34: D731-5 . 9 O T o o l e , A . S . , Miller, S . , Ha ines, N., Zink, M. C , and Ser ra , M. J . 2006. Comprehens ive thermodynamic analysis of 3' double-nucleot ide overhangs neighboring Watson-Cr ick terminal base pairs. Nucleic Acids Res 34: 3338-44. Pfeffer, S . , Sewer , A . , Lagos-Quin tana, M., Sher idan, R., Sander , C , Grasser , F. A . , van Dyk, L. F., Ho, C . K., Shuman , S . , Ch ien , M., et a l . 2005. Identification of m i c r o R N A s of the herpesvirus family. Nat Methods 2: 269-76. Porkka , K. P., Pfeiffer, M. J . , Walter ing, K. K., Vesse l l a , R. L , Tamme la , T. L , and V isakorp i , T. 2007. M i c r o R N A express ion profiling in prostate cancer . Cancer Res 67: 6130-5. Ra jagopa lan , R., Vaucheret , H., Trejo, J . , and Bartel, D. P. 2006. A diverse and evolutionarily fluid set of m i c r o R N A s in Arab idops is thal iana. Genes Dei/20: 3407-25. Ruby, J . G . , J a n , C . H., and Bartel, D. P. 2007. Intronic m i c r o R N A precursors that bypass Drosha process ing. Nature 448: 83-86. Sethupathy, P., Megraw, M. , and Hatzigeorgiou, A . G . 2006. A guide through present computat ional approaches for the identification of mammal ian m i c r o R N A targets. Nat. Methods 3: 881-886. Shendure , J . , Por reca , G . J . , Reppas , N. B., Lin, X . , M c C u t c h e o n , J . P., R o s e n b a u m , A . M. , W a n g , M. D., Zhang , K., Mitra, R. D., and Church , G . M . 2005. Accura te multiplex polony sequencing of an evolved bacterial genome. Science 309: 1728-1732. Tol ia , N . J . and Joshua-Tor , L. 2007. Sl icer and the Argonautes . Nat Chem Biol. 1:36-43 . Vester , B. and Wenge l , J . 2004. L N A (locked nucleic acid): high-affinity targeting of complementary R N A and D N A . Biochemistry 43: 13233-41. Y a o , Y . , G u o , G . , Ni , Z. , Sunkar , R., Du , J . , Zhu , J . K., and S u n , Q. 2007. Cloning and character izat ion of m i c r o R N A s from wheat (Triticum aest ivum L.). Genome Biol 8: R96 . Zhao , J . J . , Hua , Y . J . , S u n , D. G . , Meng , X . X . , X iao , H. S . , and M a , X . 2006. Genome-w ide m i c r o R N A profiling in human fetal nervous t issues by ol igonucleot ide microarray. Childs Nerv Syst 22: 1419-25. 10 2 Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells 2.1 List of Authors Ry an D. Mor in, Michae l D. O 'Connor , Malachi Griffith, Flor ian Kuchenbauer , A l len Delaney, Anna- l i i sa Prabhu, Yong jun Z h a o , Helen M c D o n a l d , T h o m a s Zeng , Martin Hirst, Conn ie J . E a v e s and Marco A . Marra . 2.2 Introduction M i c r o R N A s (m iRNAs) are short R N A molecules, 19-25nt in length, that derive from stable fold-back sub-structures of larger transcripts (Cai et al . 2004). In most cases , primary m i R N A transcripts (pr i -miRNAs) are c leaved by a complex of Drosha and its cofactor D G C R - 8 , producing one or more p re -m iRNA hairpins, each with a 2-nt 3' overhang (Lund et al . 2004). Nuc lear export of these precursors is mediated by Exportin 5 (Lund et al . 2004) after which, they are re leased as short double-stranded R N A duplexes following a second c leavage by Dicer (Lund et a l . 2004). B a s e d on the thermodynamic stability of each end of this duplex, one of the strands is thought to be preferentially incorporated into the R I S C protein complex, producing a biologically active m i R N A and an inactive m i R N A * ( O T o o l e et a l . 2006). O n c e assemb led within R I S C , m i R N A s can elicit down-regulat ion of target genes by blocking their translation, inducing Ago-2 mediated degradat ion or potentially inducing deadenylat ion (Aravin et al . 2005; G i ra ldez et al . 2006). In * A version of this chapter has been submitted to Genome Research to be considered for publication as a research article 11 animals, miRNA-target interactions occur through semi-complementary base pairing, usually within the 3 ' -UTR of the target transcript. The relaxed complementar i ty between m i R N A s and their target si tes poses many problems in computat ional target prediction; hence, this is an active field of study with numerous emerging approaches that rely on differing methodologies and assumpt ions. C o m m o n target prediction algorithms a s s e s s complementar i ty of m i R N A s to their targets (John et a l . 2004), many enforcing strong pairing within the seed region of the m i R N A (positions 2-8) (Gr imson et a l . 2007; Lewis et a l . 2005). A s they were d iscovered only relatively recently, the study of m i R N A s is a young and rapidly changing field in which undiscovered genes remain and the sequence modif ications and activity of mature m i R N A s are not well understood. Before the effect of m i R N A s on gene regulation can be globally studied, a robust method for profiling the express ion level of each m i R N A in a samp le is required. The current commercial ly avai lable high-throughput methodologies rely on primers or probes des igned to detect each of the current reference m i R N A sequences residing in m i R B a s e , which acts as the central repository for known m i R N A s (Grif f i ths-Jones 2006). T h e s e sys tems, which are often avai lable in array form, al low concurrent profiling of many m i R N A s (Zhao et a l . 2006). T h e s e approaches feature good reproducibility and facilitate clustering of samp les by similar m i R N A express ion profiles (Davison et al . 2006; Porkka et a l . 2007). However , probe-based methodologies are general ly restricted to the detection and profiling of only the known m i R N A sequences previously identified by sequenc ing or homology searches. 12 Sequenc ing -based appl icat ions for identifying and profiling m i R N A s have been hindered by laborious cloning techniques and the expense of capil lary D N A sequenc ing (Cummins et al . 2006; Pfeffer et al. 2005). Never the less, direct smal l R N A sequenc ing has severa l advantages over hybr idizat ion-based methodologies. Discovery of novel m i R N A s need not rely on querying candidate regions of the genome, but rather can be ach ieved by direct observat ion and validation of the folding potential of f lanking genomic sequence (Berezikov et a l . 2006a ; Cummins et a l . 2006). Direct sequenc ing also offers the potential to detect variation in mature m i R N A length, as well as enzymat ic modif ication of m i R N A s such as R N A editing (Kawahara et al . 2007) and 3' nucleot ide addit ions (Aravin et a l . 2005 ; Landgraf et a l . 2007). In contrast to capil lary sequenc ing , recently avai lable 'next-generation' sequenc ing technologies offer inexpensive increases in throughput, thereby providing a more complete v iew of the m i R N A transcriptome. With the added depth of sequenc ing now possib le, an opportunity to identify low-abundance m i R N A s or those exhibiting modest express ion di f ferences between samp les exists, including those that may not be detected by hybr idizat ion-based methods. Next-generat ion m i R N A profiling has already been real ized in a few organisms (Fahlgren et a l . 2007; K a s s c h a u et a l . 2007; Ra jagopa lan et a l . 2006; Y a o et a l . 2007) using the Mass ive ly Paral le l Signature Sequenc ing ( M P S S ) methodology (Nakano et a l . 2006) and more recently the Roche /454 platform (Margul ies et a l . 2005). The recently re leased l l lumina sequenc ing platform provides approximately two orders of magnitude of greater depth than current compet ing technologies from a single run (Berezikov et al . 13 2006b), hence providing a more cost-effective option when compared to other sequenc ing strategies (http://www.solexa.com). W e sought to use this new technology to extensively profile and identify changes in m i R N A express ion that occur within a previously character ized model sys tem. Pluripotent human embryonic s tem cells ( h E S C s ) can be cultured under non-adherent condit ions that induce them to differentiate into cel ls belonging to all three germ layers and form cell aggregates termed embryoid bodies (EBs) (Itskovitz-Eldor et al . 2000; Bhattacharya et al. 2004). S a m p l e s of undifferentiated h E S C s and differentiated cel ls from E B s were chosen for m i R N A profiling, first because the pluripotency of E S C s is known to require the presence of m i R N A s (Bernstein et al. 2003; Song et al. 2006; W a n g et a l . 2007) and second because speci f ic changes in m i R N A express ion are thought to accompany differentiation (Chen et al . 2007). The h E S C messenge r R N A transcr iptome has been extensively studied by ourse lves and others (Abeyta et a l . 2004; Bhattacharya et al. 2005; Bhattacharya et a l . 2004; Boyer et al. 2005; Hirst et al. 2007; Sato et al . 2003) but to date, very little is known about the speci f ic m i R N A s that may play roles in their pluripotency or differentiation (Suh et al. 2004). 14 2.3 Resu l ts 2.3.1 S e q u e n c i n g and annotat ion of smal l R N A s Sequenc ing of smal l R N A libraries yielded 6,147,718 and 6,014,187 37 nucleot ides (nt) unfiltered sequence reads from h E S C s and E B s , respectively. After removal of reads containing ambiguous base cal ls, 5,261,520 ( h E S C ) and 5,192,421 (EB) unique sequences remained with counts varying between 1 and 38,390. After mapping to the human genome, a total of 766,199 ( h E S C ) and 724,091 (EB) genome-mapped/ t r immed smal l R N A s e q u e n c e s were represented by 4,351,479 and 3,886,865 reads. These sequences were annotated as one of the known c lasses of smal l R N A genes or degradat ion f ragments of larger non-coding R N A s (Methods). S e q u e n c e s derived from 362 distinct m i R N A genes were identified. B a s e d on their median express ion level, m i R N A s were the most abundant c lass of smal l R N A s on average, but spanned the entire range of express ion , with sequence counts from singletons up to ~120 thousand (Figure 2.1, panel a). The p i R N A s , previously thought to exist only in germl ine cel ls, were a lso found at relatively low levels. A total of 9,012 and 4,606 sequence reads cor responded to p i R N A s in h E S C s and E B s , respectively. The sampl ing depth provided by the l l lumina technology, spanning approximately 5 orders of magnitude, appears to extend the dynamic range of m i R N A express ion within a cell by two addit ional orders of magnitude (Berezikov et a l . 2006a) . 2.3.2 Variabil i ty in m i c r o R N A p r o c e s s i n g In both libraries, m i R N A s frequently exhibited variation from their ' reference' sequences , producing multiple mature variants that we hereafter refer 15 to as i somiRs . In many c a s e s the m i R N A * sequence and its i somiRs were also observed in our libraries. The existence of isomiRs (Figure 2.2 shows an example) is commonly reported in m i R N A cloning studies but general ly d ismissed with either the sequence matching the m i R B a s e record or the most frequently observed isomiR chosen as a reference sequence (Landgraf et al . 2007; Ruby et a l . 2006). It appears that much of the isomiR variability can be expla ined by variability in either Dicer or Drosha c leavage posit ions within the pri-m i R N A hairpin. Notably, our data show that choosing a different isomiR sequence for measur ing m i R N A abundance can affect our ability to detect differential m i R N A abundance. B a s e d on our analys is , the read count for the most abundant isomiR, rather than the m i R B a s e reference sequence , provides the most robust approach for comparing m i R N A express ion between libraries (see Supplementary Methods in Appendix B). In 140 c a s e s , this most abundant sequence did not correspond exactly to the current m i R B a s e reference sequence . Most commonly, this was due to length variation at the 3' end of the m i R N A s , though isomiRs with variation at either the 5' end or both ends were also observed. Th is suggests either that the relative abundance of i somiRs may vary ac ross t issues or that the original submiss ion of this m i R N A to m i R B a s e was incorrect. The latter appears more likely when we cons ider that following the most recent update to m i R B a s e , more of the most abundant sequences in our l ibraries then cor responded to the updated m i R B a s e reference sequences . 16 2.3.3 E n z y m a t i c modi f icat ion of m i c r o R N A s Although s o m e reads were detected that may represent the result of pre-m i R N A editing by adenos ine deaminases (observed as A to G transitions) or cytidine deaminases (producing C to U transitions), examples of these were infrequent and few were significantly above the background level of other apparent sequenc ing errors. The more prevalent type of modif ication noted amongst the m i R N A s were single-nucleot ide 3' extensions, which occurred in multiple i somiRs from nearly all observed m i R N A genes (316 genes) . T h e s e modif icat ions produce an isomiR that matches the genome at every posit ion except the terminal nucleotide. The nucleotide most commonly added was Aden ine (1130 distinct examples) , followed by Uridine (1008 examples) , Cytos ine (733 examples) then Guan ine (508 examples) . The prevalence of modif ications differed amongst the m i R N A s , with some m i R N A s having more reads representing modified than unmodif ied forms. End modif icat ions were not limited to the most common isomiR, nor did they show significant di f ferences between the two libraries, suggest ing that they may not bear direct s igni f icance in h E S C s , but rather may reflect a genera l cel lular p rocess of m i R N A modification (see Figure 2.2 for an example) . For some of the m i R N A s showing the highest ratios of modif ied/unmodif ied i somiRs , the s a m e modif ication w a s previously observed across multiple human and mouse t issues (Landgraf et a l . 2007). The study by Landgraf and co l leagues employed a different sequenc ing approach, suggest ing that these variat ions are not inherent to the sequenc ing strategy used. For example, hsa-miR-326 w a s found with a 17 terminal Aden ine addition in 6 6 % of the corresponding reads in both of our libraries and multiple members of the family hsa-miR-30 showed Uridine addit ions in more than half of their reads. Both of these speci f ic modif ications were noted in the Landgraf study, where the s a m e modif icat ions were also observed in the orthologous mouse m i R N A s . The signi f icance of the apparent evolut ionary conservat ion of these modif ications is addressed below. 2.3.4 M i c r o R N A s differentially e x p r e s s e d between h E S C s a n d E B s W e tested for differential express ion between the h E S C and E B libraries using all sequences , regardless of their annotation; hence, e a c h isomiR w a s independently tested for differential express ion. The total number of differentially exp ressed sequences corresponding to the m i R B a s e reference sequence (120 total) w a s less than the total number of m i R N A s identified as differentially exp ressed using the most abundant isomiR for each p re -m iRNA arm (190 total from 165 m i R N A genes) (Table 2.1 and Supplementary Tab le 1, Append ix A) . Al l 120 of the m i R N A genes identified as differentially exp ressed , based only on the sequences corresponding perfectly to those in m i R B a s e , were amongst the 165 identified by the most the abundant sequence in our data. Employ ing the count of the most abundant sequence for the calculat ion of differential express ion also al lows identification of the supposed miR* sequences that are differentially exp ressed between h E S C s and E B s but are currently absent from m i R B a s e (25 total). Th is reinforces the concept that the most abundant, rather than the reference m i R N A sequence , is the most useful for identifying differentially exp ressed m i R N A s . 18 2.3.5 Novel microRNA genes W e sought to identify novel m i R N A genes amongst the un-classi f ied sequences in our libraries. After annotation, 40,699 of the smal l R N A sequences in the h E S C library and 40,598 in the E B library (ignoring singletons) remained unclassi f ied because they derived from un-annotated regions of the human genome. A s they were of general ly low express ion and often similar in sequence to known genomic repeats, we assumed many of these were either randomly derived fragments of var ious R N A s or regulatory s i R N A s (Berezikov et a l . 2006b; Wa tanabe et a l . 2006). To identify candidate novel m i R N A s amongst these, we employed both in-house and publicly avai lable algorithms (Methods). O n c e good candidates were identified, we compared both their mature sequences and the predicted p re -m iRNA sequences to those of all currently known m i R N A s to aid in classifying them into famil ies. The total set of novel m i R N A candidates, included in Supplementary Tab le 2 (Appendix A) , compr ises 129 unique m i R N A s e q u e n c e s from up to 170 distinct genes . Compared to other m i R N A s in our data, these m i R N A s exhibited modest expression (mean count=58), perhaps indicating that most of them may not perform significant functions in these cell types. However, 31 exhibited significant differential express ion between the two libraries and are thus potentially biologically important in the cel ls in which they were present (Table 2.2). 2.3.6 Targets of differentially expressed microRNAs Strong base pairing between the seed region of a m i R N A and the U T R of its target m R N A is important for its activity; hence many target prediction 19 algorithms enforce strong seed complementari ty and evolut ionary conservat ion in the complementary region of potential targets (Gr imson et a l . 2007). A s such , the repertoire of predicted targets of m i R N A s with identical s e e d s often over laps considerably. W e sought to determine whether m i R N A s shar ing identical seeds demonstrated coexpress ion. By consider ing posit ions 2-8 of all non-singleton i somiRs of known m i R N A s from each library, we identified 1009 distinct seeds . Many of these represented the non-canonical seeds of i somiRs arising from variation at their 5' end . 114 of these seeds were significantly over-represented in the h E S C library while 106 are over-represented in the E B library (Fisher Exact Test , alpha=0.05/1009) (Table 2.3 and Supplementary Tab le 3 in Append ix A ) . A s expected, we found that many of the seeds with the largest changes in relative levels between the two libraries corresponded to the differentially expressed m i R N A s in Tab le 2 .1 . Notably, in some c a s e s , we observed pairs of m i R N A s that shared a common seed yet exhibited inverse express ion changes . A c lear example of this from our data is the inverse abundances of hsa -m iR-302a (hESC-enr iched) and hsa-miR-327 (EB-enr iched), which share the seed sequence A A G U G C U . This behavior may mask the effect of differential express ion of s o m e m i R N A s in each library. Hence , focusing on the net change in seed levels, rather than distinct m i R N A s , may be important in this context. S o m e m i R N A s appear in this table more than once, suggest ing that more than one isomiR from a single m i R N A gene can contribute to net changes in m i R N A s with a given seed during differentiation, and these i somiRs could potentially regulate different sets of transcripts. 20 The cooperat ive action of multiple m i R N A s can be multiplicative and , in s o m e c a s e s , synergist ic (Gr imson et a l . 2007); hence, transcripts with more predicted target si tes for co-expressed m i R N A s should be most drastical ly affected by those m i R N A s . In an attempt to highlight potentially signif icant targets of differentially expressed m i R N A s , we identified genes with predicted target si tes for multiple hESC-en r i ched m i R N A s or EB-enr i ched m i R N A s using Targe tScan . Measu res were taken to compensate for U T R length, m i R N A s with identical target si tes, and a general preponderance of target si tes in some genes (Methods). A total of 591 likely cooperat ive targets of EB-enr iched and 461 targets of h E S C - e n r i c h e d m i R N A s were identified by this approach. A s these genes are likely under redundant post-transcriptional regulation by multiple m i R N A s , they could compr ise genes of central importance to the maintenance of these cel ls. Surprisingly, these two gene lists showed a significant overlap of 64 genes (p<0.0001, permutation test), suggest ing that s o m e of the m i R N A -regulated genes in h E S C s may a lso be regulated in E B s by a different set of m i R N A s . The genes that have been highlighted herein as likely targets of the differentially expressed m i R N A s would be expected to be signif icant to h E S C biology. This was supported by an examinat ion of the G e n e Ontology 'biological p rocess ' c lassi f icat ions that are significantly over-represented amongst these genes . Th is analys is revealed that many of the genes in both groups have been previously assoc ia ted with differentiating stem cel ls and included those involved in differentiation, development and regulation of transcription (Skottman et a l . 21 2005). Interestingly, some of these G O terms were enr iched only in the predicted targets of hESC-en r i ched or EB-enr iched m i R N A s . For example , genes involved in programmed cell death were enr iched amongst the predicted targets of h E S C - e n r i c h e d m i R N A s while those involved in cel l proliferation were enr iched amongst the predicted targets of EB-enr iched m i R N A s (Figure 2.3). 2.4 D i s c u s s i o n T h e s e data highlight the potential of a new mass ive ly paral lel sequenc ing strategy to profile m i R N A express ion in h E S C s and E B s and the changes that occur during differentiation. Between the two libraries, we identified 190 differentially expressed known and 31 novel m i R N A s corresponding to 165 and 33 distinct m i R N A genes . This is a striking advance in contrast to a recent array-based study of murine E S C s , which revealed a much smal ler number of m i R N A s differentially expressed between E S C s and E B s (Chen et a l . 2007). The latter study identified only 23 candidate m i R N A s as either E S C - s p e c i f i c or down-regulated during differentiation, as well as 10 m i R N A s that appear to be enr iched in E B s . It is encouraging that, of the 27 listed mouse m i R N A s with identifiable human orthologs, 14 exhibited express ion patterns consistent with the h E S C results presented here, whilst many of the others exhibited insignificant changes in express ion. Further, of the additional differentially exp ressed m i R N A s reported here (Table 2.1), many were originally d iscovered in human or murine E S C s (Houbaviy et a l . 2003; Suh et a l . 2004). 22 S o m e of these previous studies general ized these m i R N A s as h E S C -speci f ic in their express ion . However, compared to E B s , our method did not detect m i R N A s that were expressed exclusively in h E S C s ; this is likely a consequence of the increased sampl ing depth of the method used here. O n the other hand, we did find 16 m i R N A sequences in E B s that were not observed in h E S C s (Supplementary Tab le 1, Appendix A) . Th is is consistent with the postulate that the number of expressed m i R N A s increases during differentiation (Strauss et a l . 2006) and further supports the importance of m i R N A s during E S C differentiation (Wang et a l . 2007). However, because E B s likely represent a more diverse population of cel ls, the total numbers of expressed m i R N A s in any given cell type may not necessar i ly increase. Thus , the m i R N A s highlighted by this method include those previously implicated in pluripotency and differentiation, but the added sampl ing depth demonstrates that these m i R N A s are not uniquely expressed by undifferentiated h E S C s . Th is result may suggest that the p rocesses involved in differentiation that are under m iRNA-med ia ted regulation are dictated by the repertoire and relative m i R N A levels, rather than the number of distinct m i R N A s being expressed . With few except ions (Berezikov et a l . 2006b), recent large-scale cloning efforts have provided minimal yields of new m i R N A genes . Th is is may be due to the dominance of the highly expressed m i R N A s in smal l R N A libraries as well as difficulties assoc ia ted with library normalization (Landgraf et al . 2007; Sh ivdasan i 2006). Here, we present 170 candidates for novel m i R N A genes that have passed multiple levels of annotation criteria. This gene list w a s obtained by 23 combining two separate m i R N A classif icat ion tools relying on different assumpt ions and input data. M iPred relies on a R a n d o m Forest algorithm to classify m i R N A s based on structural properties (Jiang et a l . 2007). Our custom classif ier, which uses a support vector machine (SVM) , ass igns a classif icat ion score based on a separate set of structural descr iptors including data speci f ic to this sequenc ing method. A s a result, this classif ier has provided the best classif icat ion accuracy of any machine learning-based m i R N A classif icat ion method (see Methods) . At least 31 of the novel m i R N A s identified in this study show ev idence of regulated express ion during h E S C differentiation (Table 2.2), suggest ing that they may also have roles in maintaining the pluripotent status of h E S C s or their ability to sel f-renew. In support of this, five of these m i R N A s e q u e n c e s were found to share a common seed sequence with at least one of the differentially expressed known m i R N A s (Table 2.1). Many of the remaining novel m i R N A s with overal l low abundance and insignificant express ion dif ferences may later prove unimportant in the context of h E S C biology. It is encouraging, however, that many of the known m i R N A s resided in the s a m e range of express ion as these novel m i R N A s (Figure 2.1, panel a), suggest ing that many m i R N A s may exhibit low-level express ion in h E S C s detectable only by deep sequenc ing . Like many of the known m i R N A s present in these libraries, these low-abundance novel m i R N A s may be observed in higher quantities amongst other t issues where they are functionally important. 24 In addit ion to novel m i R N A genes , this study has a ided in identifying diverse population of variants of known m i R N A s , collectively termed isomiRs . The functional implications of the widespread 3' modif icat ions and R N A editing are unclear in the context of h E S C s as neither p rocess appeared to vary significantly between the two libraries. It is important to note that the diverse 3' nucleotide addit ions observed here reiterate previous observat ions of this nature (Landgraf et a l . 2007). By compar ing s o m e of the most prevalent 3' addit ions in our libraries to those from diverse human and mouse t issues (Landgraf et a l . 2007), it is evident that the nature of the added nucleotide is nonrandom and evo lu t i ona ry conserved. S o m e of the remaining non-canonical i somiRs, which appear to derive from variability in Dicer and Drosha c leavage posit ions, were quite abundant. By including all i somiRs and grouping those that share an identical seed sequence , we revealed 222 seeds that were represented in significantly different quantit ies these two libraries (Table 2.3 and Supplementary Tab le 3, Append ix A ) . A s the non-canonica l i somiRs share most of their sequence with the most highly exp ressed i somiRs , it is possib le that these variants share a common set of targets with the canonica l m i R N A s . Those i somiRs resulting from variation at the 5' end may be of particular interest as they bear a different seed sequence than the reference m i R N A , thus indicative of their potential to target different transcripts. However, whether any of these non-canonical variants assoc ia te within R I S C remains to be experimental ly determined. If they are found to assoc ia te within R I S C , the presence of isomiRs may have implications in future 25 annotation of m i R N A s and the development of new target prediction algorithms. A recent update to m i R B a s e resulted in better agreement between the most abundant sequences in our libraries, but some of the most prevalent i somiRs in our libraries still do not perfectly correspond to the m i R B a s e entry. Th is suggests that the most abundant sequences have yet to be reliably determined for all m i R N A s . If this is the case , further large-scale efforts such as this should result in the truly most common isomiR replacing the erroneous s e q u e n c e s residing in m i R B a s e . There w a s a large overlap of predicted target genes for the most prevalent m i R N A s enr iched in either h E S C s or E B s . S ince a large proportion of these genes were predicted by Targe tScan to be targets of numerous m i R N A s , genes were weighted based on their total number of predicted target si tes. By taking only the genes with weights above a certain threshold, we der ived two relatively smal l groups of genes that are strong candidates for cooperat ive targeting by h E S C - e n r i c h e d or EB-enr iched m i R N A s . This enabled resolution of key gene c lasses in each group (Figure 2.3), but it does not assert that these are true in vivo targets, nor does it provide a comprehens ive list of the targets of these m i R N A s . Further directed validation and gene express ion profiling will be necessary to elucidate the true in-vivo targets. T h e s e lists do, however, provide the foundation for an alternate v iew of gene regulation in h E S C s . Previously, focus has been placed on changes to m R N A levels during differentiation. By shifting focus to the genes targeted by m i R N A s in this context, it is plausible that genes previously unlinked to differentiation and pluripotency may be d iscovered. 26 Adaptat ion of a common adapter ligation method to a novel sequenc ing platform has al lowed us to generate a view of the smal l R N A component of differentiating h E S C s at an unprecedented depth. Fol lowing a combinat ion of novel m i R N A annotation and d iscovery techniques, we have revealed the largest list to date of m i R N A sequences . This list includes a diverse populat ion of m i R N A variants, termed isomiRs, with variation at both the 5' and 3' ends. W e a lso present numerous novel human m i R N A s , s o m e of which may be important in the control of h E S C pluripotency or differentiation. Notably, many of the m i R N A s exhibiting the most significant dif ferences between h E S C s and E B s appear to regulate genes involved in transcriptional regulation, differentiation and development. 2.5 Methods 2.5.1 Small RNA Library Preparation Embryon ic stem cell samp les were derived from lines maintained at the Terry Fox Laboratory. The s tem cell line known as H9 w a s chosen because it is well character ized and was used in the development of differentiation protocols at this facility. Undifferentiated H9 h E S C s were cultured on Matrigel (BD B iosc iences , S a n J o s e , C A ) coated d ishes in maintenance medium consist ing of D M E M / F 1 2 containing 2 0 % Knockout Serum Rep lacer (Invitrogen, Car l sbad , C A ) , 0.1 m M p-mercaptoethanol, 0.1 m M non-essent ial amino ac ids , 1 m M glutamine and 4 ng/ml F G F 2 (R&D Systems) and condit ioned by mitotically inactivated mouse embryonic fibroblasts (Xue et a l . 2005). For E B differentiation, 27 h E S C s were harvested v ia 0 .05% trypsin (Invitrogen) supplemented with 0.5 m M C a C _ and the resultant cell aggregates cultured in non-adherent d ishes (BD Biosc iences) for up to 30 days in maintenance medium lacking F G F 2 (medium changes performed as necessary) (Itskovitz-Eldor et a l . 2000; Dvash et a l . 2004). At appropriate time points R N A was extracted into Trizol and aliquots of total R N A from Day 0 ( h E S C ) and Day 15 (EB) were subjected to m i R N A library construct ion as fol lows. For each library, 10 ug of D N a s e I (DNA-free™ kit; Amb ion , Aust in TX) treated total R N A was s ize fractionated on a 15% T r i s -Bo ra te -EDTA (TBE) urea polyacry lamide gel and a 15-30 base pair fraction exc ised . R N A w a s eluted from the polyacrylamide gel s l ice in 600 u.l of 0.3 M NaCI overnight at 4°C. The resulting gel slurry was passed through a Sp in -X filter co lumn (Corning) and precipitated in two 300 uJ aliquots by the addition of 800 u.l of E t O H and 3 uJ of M u s s e l G lycogen (5 mg/ml; Invitrogen). After washing with 7 5 % E t O H , the pellets were al lowed to air dry at 25°C and pooled in D E P C water. T h e 5' R N A adapter ( 5 - G U U C A G A G U U C U A C A G U C C G A C G A UC-3 ' ) w a s ligated to the R N A pool with T4 R N A l igase (Ambion) in the p resence of R N A s e Out (Invitrogen) overnight at 25°C. The ligation reaction was stopped by the addition of 2 X formamide loading dye. The ligated R N A w a s s ize fractionated on a 1 5 % T B E urea polyacrylamide gel and a 40-60 base pair fraction exc ised . R N A was eluted from the polyacrylamide gel s l ice in 600 pil of 0.3 M NaCI overnight at 4°C. The R N A w a s eluted from the gel and precipitated as descr ibed above fol lowed by re-suspens ion in DEPC- t rea ted water. 28 The 3' R N A adapter (5'- p U C G U A U G C C G U C U U C U G C U U G i d T -3 ' ; p = phosphate; idT = inverted deoxythymidine) was subsequent ly ligated to the precipitated R N A with T4 R N A l igase (Ambion) in the presence of R N A s e Out (Invitrogen) overnight at 25°C. The ligation reaction w a s s topped by the addition of 10 ul of 2 X formamide loading dye. Ligated R N A w a s s ize fractionated on a 10% T B E urea polyacrylamide gel and the 60-100 base pair fraction exc ised . The R N A w a s eluted from the polyacrylamide gel and precipitated from the gel as descr ibed above and re-suspended in 5.0 ial of D E P C water. T h e R N A w a s converted to s ingle stranded c D N A using Superscr ipt II reverse transcriptase (Invitrogen) and l l lumina's smal l R N A RT-Pr imer ( 5 ' - C A A G C A G A A G A C G G C A T A C G A - 3 ' ) following the manufacture's instructions. The resulting c D N A was PCR-amp l i f i ed with Hotstart Phus ion D N A Po lymerase (NEB) in 15 cyc les using l l lumina's smal l R N A primer set (5' - C A A G C A G A A G A C G G C A T A C G A - 3'; 5'-A A T G A T A C G G C G A C C A C C GA-3 ' ) . P C R products were purified on a 12% T B E urea polyacry lamide gel and eluted into elution buffer (5:1, L o T E : 7.5 M Ammon ium Acetate) overnight at 4°C. The resulting gel slurry was passed through a Sp in -X filter (Corning) and precipitated by the addit ion of 1100 ^l of E t O H , 133 \i\ 7.5 M A m m o n i u m Acetate and 3 | l^ of M u s s e l G lycogen (20 mg/ml; Invitrogen). After wash ing with 7 5 % E t O H , the pellet w a s al lowed to air dry at 25°C and d isso lved in E B buffer (Qiagen) by incubation at 4°C for 10 min. The purified P C R products were quantif ied on the Agi lent D N A 1000 chip (Agilent) and diluted to 10 n M for sequenc ing on the l l lumina 1G analyzer. 29 2.5.2 Small RNA Genome Mapping and Quantification No reads al igned to the genome after position 28, s o we tr immed all reads at 30-nt to reduce the number of unique sequences while still retaining at least 2-nt of linker sequence for the longest al ignments. W e counted the occur rences of each unique sequence read and used only the unique s e q u e n c e s for further analys is . W e al igned each sequence to the human reference genome (NCBI build 36.1) using megablast (version 2.2.11). W e filtered these al ignments and retained only those that included the first nucleotide of the read and were devoid of insertions, delet ions and mismatches. For every read, the longest al ignment was determined, and this subsequence , as well as the posit ions for every al ignment of this length w a s stored in a database (to a max imum of 100 al ignments). The sequence following the al igned region w a s checked for p resence of the 3' l inker sequence . The counts for all reads containing the same al igned sub -sequence were summed to provide a metric of the total f requency of that smal l R N A molecule in the original R N A sample. The presence and length of intervening sequences between the al ignment and the linker w a s recorded to enable identification and separate quantification of smal l R N A s with 3' addit ions. 2.5.3 Differential Expression Detection All unique smal l R N A sequences were compared between the two libraries ( h E S C and EB) for differential express ion using the Bayes ian method deve loped by Aud i c & Claver ie (Audic and Claver ie 1997). Th is approach was deve loped for analys is of digital gene express ion profiles and accounts for the sampl ing variability of sequences with low counts by model ing the Po i sson 30 distribution. S e q u e n c e s were deemed significantly differentially exp ressed if the P-value given by this method was < 0.001 and there was at least a 1.5-fold change in sequence counts between the two libraries. Un less stated otherwise, the most frequently observed isomiR was used as the diagnost ic sequence for compar ison of m i R N A express ion between libraries. Multiple testing correct ions were unnecessary as compar ison between two libraries is cons idered a single test by this method (Stekel et a l . 2000). 2.5.4 Small RNA annotation E a c h smal l R N A sequence was annotated if its full sequence contained one or more nucleot ides of recognizable 3' linker sequence , had at most 25 perfect full-length al ignments to the human genome, and w a s detected at least twice. G e n o m i c posit ions of each sequence were compared to genome annotat ions obtained from the U C S C G e n o m e Browser download page (http: / /hgdownload.cse.ucsc.edu/downloads.html). The data used for annotation included the E n s E M B L genes , Repea tMaske r and s n o / m i R N A tables. The posit ions of all m i R N A genes were also downloaded from m i R B a s e and used for this posit ional annotation (http:/ /microrna.sanger.ac.uk/sequences/ftp.shtml). S e q u e n c e s overlapping annotations from these tables were c lassi f ied into one of the following c lasses , listed in the priority used: m i R N A , t R N A , r R N A , s c a R N A , C D - B o x , s c R N A , s n o R N A , s n R N A , s r p R N A , genomic repeat or known transcript (exonic or intronic). Transfer R N A s ( tRNAs) and r ibosomal R N A s ( rRNAs) , which are involved in the cellular translation machinery, were the most abundant c l asses of sequences after m i R N A s , yet these are general ly cons idered relatively 31 uninteresting in the context of smal l R N A s . The other aforement ioned c lasses of smal l R N A s are involved in either subcel lular targeting (s rpRNA) , spl icing or other chemica l modif ications required for the maturation of n c R N A s and m R N A s ( C D - B o x / s n R N A s / s n o R N A s / s c a R N A ) . Al l mapped sequences were a lso directly searched against the currently known human Piwi-interacting R N A s (p iRNAs) (Aravin et a l . 2007) found currently in p iRNAbank (Lakshmi and Agrawa l 2007). P i R N A s were previously thought to exist only in mammal ian germl ine cells and function in the si lencing of genomic retrotransposons. A s p i R N A s are considerably longer than m i R N A s , we used the first 18-nt of the p i R N A s to search for perfect matches amongst our sequences . S e q u e n c e s that did not overlap any of the aforement ioned annotat ions (in any one of their genomic posit ions), or had more than 25 perfect al ignments to the genome were automatical ly c lassi f ied as 'unknown'. The unknown s e q u e n c e s with 25 or fewer perfect al ignments to the genome were used , along with those corresponding to introns and genomic repeats, for novel m i R N A predict ion. S e q u e n c e s identified as m i R N A s were named based on the speci f ic m i R N A gene they over lapped as well as the arm (either 5' or 3') with respect to the p re -m iRNA. W h e r e m i R N A naming was ambiguous due to identical s e q u e n c e s shared by related m i R N A genes, sequences were named arbitrarily by one of their possib le parent m i R N A genes . For those m i R N A genes producing identical mature sequences (e.g., hsa-miR-9-1) , the trailing number w a s dropped from the name. M i R B a s e (release 10.0) reference sequences were 32 used for the comparison of the common isomiRs to the reference miRNA sequences. 2.5.5 Novel microRNA detection Candidate miRNA gene loci were identified by finding distinct small RNA sequences lacking annotations that shared partially overlapping genomic positions on the same strand (termed 'hotspots'). These hotspots are suggestive of the presence of two or more isomiRs and were observed for nearly all the known miRNAs in our libraries. 300-nt of genomic sequence flanking each seed was extracted, reverse-complemented where appropriate, and folded using RNALfold (Hofacker 2003), which identifies locally stable sub-structures within a query RNA sequence. Following the known structural constraints of miRNAs (Ambros et al. 2003), the largest un-branched fold back sub-structure was identified from each structure and redundant structures were then removed. Structures were also removed if the seed region (where the original small RNA sequences derived) spanned the loop. The remaining structures and their sequences comprised a set of candidate miRNA genes with expressed sequence and more than one candidate isomiR. The putative pre-miRNA sequences were then enriched for likely real pre-miRNA hairpins using two machine-learning approaches. The previously published method, termed MiPred, relies on a Random Forest (RF) algorithm (Jiang et al. 2007) and uses a combination of structural and thermodynamic parameters. The second approach was devised specifically for its application in this study and employs a support vector machine (SVM) classifier and mostly uses parameters not used by the RF method. 33 T h e S V M functionality was provided in the e1071 package for R. The parameters used by this classif ier include the relative proportion of e a c h nucleotide and the number of each type of base pair in the optimal p re -m iRNA folded structure. A s wel l , s o m e parameters descr ibing the folded structure are common to other S V M approaches including M F E I , A M F E (Zhang et a l . 2006), Normal ized Pair ing Propensi ty (Ng Kwang Loong et a l . 2007) and loop length. The parameters that are unique to this type of sequenc ing data are descriptors of the seed itself and its posit ioning within the p re -m iRNA structure. Speci f ical ly, these include s e e d length, positioning with respect to the loop, as wel l as the ratio of paired to unpaired nucleot ides within the s e e d . Th is classif ier w a s trained using the folded flanking sequence of the m i R N A s present in the two libraries for posit ive examples and the folded flanking sequences of smal l R N A sequences classi f ied as either t R N A , r R N A s n R N A or s n o R N A for negative examples . After training, the classif ier showed a sensitivity of 0.973 and specif icity of 0.988, which represents the upper level of discrimination of p r e - m i R N A s reported for any mach ine learning method to date. The intersection of the posit ive predictions of the S V M and R a n d o m Forest methods (98 total) w a s used as an initial set of reliable novel m i R N A predict ions. Th is w a s supplemented with m i R N A s showing either significant differential express ion between the two libraries, significant sequence similarity to known m i R N A s , or overlap with an EvoFo ld prediction for an evolutionarily conserved hairpin (Pedersen et al . 2006). 34 2.5.6 Cooperative miRNA target prediction Targe tScan (release 4.0) predicted targets for known m i R N A s were downloaded from the download page (http://www.targetscan.org/). Only m i R N A s with counts of at least 100 in either h E S C or E B were included in target ana lyses. G e n e s with target sites for at least two co-expressed m i R N A s (either h E S C -enr iched or EB-enr iched) were identified as potential cooperat ive targets. To compensa te for potentially biasing some genes with many predicted target si tes (due to var ious issues) , these genes were given a lower rank than those with few predicted target si tes. The rank score of a gene w a s calculated by dividing the number of predicted target si tes for co-expressed m i R N A s by the total number of target si tes for that gene. In this context, the co-expressed m i R N A s were those that were over-represented in the particular library and had a count of at least 100. W e used a cutoff of 0.15 (rank) to produce the two sets of high-ranked candidate cooperat ive targets of hESC-en r i ched and EB-enr i ched m i R N A s (Figure 2.3). 2.5.7 Gene Ontology analysis GoSta t (Beissbarth et a l . 2004) was employed for identification of significantly enr iched 'biological p rocess ' G e n e Ontology (GO) (Ashburner et a l . 2000) terms in the two lists of likely targets of h E S C - e n r i c h e d and EB-enr iched m i R N A s (P<0.01). Cus tom software was used to extract G e n e s and G O terms from the GoSta t output. G e n e s and G O terms were then clustered using hierarchical clustering in R. Heat maps were created in R using the heatmap function (default parameters). 35 Table 2.1 Top 20 m i R N A s differentially expressed between the h E S C and E B libraries miRNA gene (hsa) Hairpin arm Most abundant sequence hESC count EB count P-value fold change mir-199a 3-P ACAGTAGTCTGCACATTGGTTA 1110 13163 0 11.86 mir-372 3-p AAAGTGCTGCGACATTTGAGCGT 1388 13653 0 9.84 mir-122a 5-p TGGAGTGTGACAATGGTGTTTG 436 2565 0 5.88 mir-152 5-p TCAGTGCATGACAGAACTTGG 622 3028 0 4.87 mir-10a 5-p TACCCTGTAGATCCGAATTTGT 948 3887 0 4.10 let-7a 5-p TGAGGTAGTAGGTTGTATAGTT 11902 2951 0 4.03 mir-302a 5-p TAAACGTGGATGTACTTGCTTT 36800 9917 0 3.71 mir-222 3-p AGCTACATCTGGCTACTGGGTCTC 4719 1331 0 3.55 mir-340 5-p TTATAAAGCAATGAGACTGATT 2247 7198 0 3.20 mir-363 3-p AATTGCACGGTATCCATCTGTA 5775 17912 0 3.10 mir-21 5-p TAGCTTATCAGACTGATGTTGAC 39818 21003 0 1.90 mir-26a 5-p TTCAAGTAATCCAGGATAGGCT 4892 8530 0 1.74 mir-26b 5-p TTCAAGTAATTCAGGATAGGTT 1003 2957 1.39E-278 2.95 mir-130a 3-P CAGTGCAATGTTAAAAGGGCAT 2334 4798 2.20E-265 2.06 mir-744 5-p TGCGGGGCTAGGGCTAACAGCA 4166 1516 9.13E-259 2.75 mir-302b 3-P TAAGTGCTTCCATGTTTTAGTAG 15169 8855 1.39E-213 1.71 mir-30d 5-p TGTAAACATCCCCGACTGGAAGCT 2798 4988 3.29E-205 1.78 mir-146b 5-p TGAGAACTGAATTCCATAGGCTGT 703 2075 2.27E-196 2.95 mir-25 3-P CATTGCACTTGTCTCGGTCTGA 24268 15875 1.52E-189 1.53 mir-373 3-P G AAGTGCTTC G ATTTTG G G GTGT 60 788 5.75E-184 13.13 mir-423 5-p TGAGGGGCAGAGAGCGAGACTTT 9844 5538 4.50E-162 1.78 mir-320 3-P AAAAGCTGGGTTGAGAGGGCGA 5967 2978 1.33E-148 2.00 mir-1 3-p TGGAATGTAAAGAAGTATGTAT 7421 4051 2.39E-137 1.83 mir-371 5-p ACTCAAACTGTGGGGGCACTT 1649 3020 6.89E-133 1.83 mir-302d 3-P TAAGTGCTTCCATGTTTGAGTGT 8599 5047 3.95E-119 1.70 36 Table 2.2 Putative novel m i R N A s exhibiting significant differential express ion Putative Chromo- Length hESC EB Fold miRNA Seed some (nt) most common sequence count count Change family sequence chr2 19 AAIGGAI I I I IGGAGCAGG 374 245 1.53 - AUGGAUU chr11 17 TAAGTGCTTCCATGCTT 84 131 1.56 mir-503 AAGUGCU chrX 22 TTCATTCGGCTGTCCAGATGTA 1269 774 1.64 UCAUUCG chr16 19 GCCTGTCTGAGCGTCGCTT 194 114 1.70 - CCUGUCU chrl 21 TATTCATTTATCCCCAGCCTACA 126 67 1.88 mir-664 AUUCAUU chr19 17 TCCCTGTTCGGGCGCCA 670 1302 1.94 mir-761 CCCUGUU chrl 19 IGGAI I I I IGGAICAGGGA 40 85 2.13 - GGAUUUU chrl 17 AACAG CTAAGGACTG CA 72 31 2.32 - ACAGCUA chr6 19 G IG IAAGCAGGGICGI I I I 639 273 2.34 mir-565 UGUAAGC chrX 22 IACGIAGAIAIAIAIG IAI I I I 20 54 2.70 mir-620 ACGUAGA chr5 18 GTCCCTGTTCAGGCGCCA 24 66 2.75 - UCCCUGU chr15 26 GATGATGATGGCAGCAAATTCTGAAA 81 27 3.00 mir-376 AUGAUGA chr22 19 ACGTTGGCTCTGGTGGTG 35 11 3.18 mir-650 CGUUGGC chr8 18 ATCCCCAGCACCTCCACC 59 18 3.28 mir-650 UCCCCAG chr10 22 TGCTGGATCAGTGGTTCGAGTC 16 57 3.56 mir-502 GCUGGAU chr15 23 CCTCAGGGCTGTAGAACAGGGCT 163 45 3.62 - CUCAGGG chr4 22 CTGGACTGAGCCGTGCTACTGG 44 161 3.66 mir-661 UGGACUG chr10 22 ACTCGGCGTGGCGTCGGTCGTG 228 50 4.56 - CUCGGCG chr2 24 TTGCAGCTGCCTGGGAGTGACTTC 63 13 4.85 - UGCAGCU chr4 17 TCCGAGTCACGGCACCA 309 55 5.62 mir-345 CCGAGTC chrl 17 GTGGGGGAGAGGCTGTC 103 17 6.06 UGGGGGG chrl 22 ATGGGTGAATTTGTAGAAGGAT 7 45 6.43 mir1072 UGGGUGA chr7 19 AAACIGIAAI I AC I I I IGG 21 3 7.00 mir-548a AACUGUA chrX 18 GCATGGGTGGTTCAGTGG 407 52 7.83 mir-146a CAUGGGU chr10 24 AGCCTGGAAGCTGGAGCCTGCAGT 25 3 8.33 mir-566 GCCUGGA chrX 22 AGGAGGAATTGGTGCTGGTCTT 25 2 12.50 mir-766* GGAGGAA chr22 17 AAAAAGACACCCCCCAC 189 21 9.00 mir396b AAAAGAC chr14 22 ACCCGTCCCGTTCGTCCCCGGA 4 53 13.25 mir-638 CCCGUCC chr11 19 ATGGATAAGGCTTTGGCTT 31 2 15.50 - UGGAUAA chr15 18 CGGGCGTGGTGGTGGGGG 35 2 17.50 mir-566 GGGCGUG chr11 24 GTGGGAAGGAATTACAAGACAGTT 72 2 36.00 - UGGGAAG 37 Table 2.3 Top 25 statistically over-represented seeds of miRNA sequences found in hESCs or EBs (Fisher's Exact Test with Bonferroni correction) Seed hESC count EB count Fold Change Corrected P-value miRNAs with seed A G U A G U C 86 1478 17x 0 mir-199a C A G U A G U 1801 22288 12.4x 0 mir-199a A C A G U A G 241 2749 11.4x 0 mir-199a U U A A A C G 3452 398 8.7x 0 mir-302a-5p C U U A A A C 16762 2332 7.2x 0 mir-302a-5p G G A G U G U 846 5216 6.2x 0 mir-122 A G U G C U G 2484 12641 5.1x 0 mir-372,mir-512 A C C C U G U 1503 5689 3.8x 0 mir-10 A A A C G U G 55712 16461 3.4x 0 mir-302a-5p,mir-424 U A U A A A G 3002 9702 3.2x 0 mir-340 G A G A A C U 1547 4976 3.2x 0 mir-146 U U G C A C G 1845 5561 3.0x 0 mir-363 G A G G U A G 29280 10258 2.9x 0 let-7 C A G U G C A 3596 8559 2.4x 0 mir-148,mir-152,mir-130 A A A G C U G 16448 7613 2.2x 0 mir-320 A G U G C A A 4463 9594 2.1x 0 mir-454,mir-301 ,mir-130 U C A A G U A 6898 14089 2.0x 0 mir-26 G C U A C A U 47591 24280 2.0x 0 mir-221,mir-222 G U A A A C A 8789 15479 1.8x 0 mir-30 G A G G G G C 17586 10518 1.7x 0 mir-423/mir-885 A G C U U A U 65376 39424 1.7x 0 mir-21,mir-590 C U G G A C U 17819 25378 1.4x 0 mir-378 G C A G C A U 153193 187218 1.2x 0 mir-103,mir-107,mir-885 G C G G G G C 5220 1956 2.7x 8.76E-305 mir-744 C U C A A A C 4306 8173 1.9x 1.62E-296 mir-92a-5p,mir-371 miRNAs indicated in bold are cases in which their non-canonical isomiR contains this seed 38 Figure 2.1 a Distribution of sequence counts for the different c lasses of smal l R N A s . Figure 2.1 b microRNA 52%^ unknown 8 % mRNA 9 9% a) The box plot (left) shows the relative expression levels of sequences in each of the 8 major classes of small RNAs (log™ transformation) in the hESC small RNA library. MiRNAs were the most highly expressed class (mean sequence count=253, median sequence count=9). The most abundant miRNA in this library was mir-103, which had 91,398 instances of the most common isomiR in this library and nearly 120,000 in the matched library (EB). The highest log-transformed count between the two libraries (right) for all miRNAs identified as differentially expressed (dark grey) is roughly normal (mean=2.32, median=2.17), representing a count of 1743 and 151 respectively). The miRNAs detected in at least one of the libraries but were not significantly differentially expressed are shown in light grey for comparison. There is a slight enrichment of miRNAs with 4 0 lower absolute express ion in this group (mean=1.35,median=1.24), suggest ing m i R N A s with higher absolute express ion levels are more likely to be identified as differentially exp ressed , b) Total counts for the 8 c l asses in the box plot are summar ized . They are represented as a fraction of the total sequences that had at least one perfect al ignment to the human reference genome (1,631,559 total). 41 Figure 2.2 T h e reper to i re o f i s o m i R s a n d 3' m o d i f i c a t i o n s o f h s a - m i r - 1 9 1 . hsa-mi r-191 ore -mi rna : chr3 49033055 - 49033146 (-) GGCAGGAGAGCAGGGGACGAAATCCAA6CGCAGCTGGAATGCTCTGGAGACAACAQCTQCTTT 2 GGGGA CGA A A T CCA A GCGCA GC 2 AGACAACAGCTGCTTTTGGGATTCCGTTG 3 ACAACAGCTGCTTTTGGGATTCCGTTG 12 CAACAGCTGCTTTTGGGATTCCGTTG • 4 AACAGCTGCTTTTGGGATTCCGTT 15 AACAGCTGCTTTTGGGATTCCGTTG 243 6ACAGCTGCTTTTGGGATTCCGTTG 44 TACAGCTGCTTTTGGGATTCCGTTG 339 CACAGCTGCTTTTGGGATTCCGTTG 36 ACAGCTGCTTTTGGGATTCCGTTG 1760 TCAGCTGCTTTTGGGATTCCG 2 TCAGCTGCTTTTGGGATTCCGTT 258 CAGCTGCTTTTGGGATTCCGTTG 3294 AGCTGCTTTTGGGATTCCGTT 44 AQCTGCTTTTGQSATTCCGTTG 883 TGCTGCTTTTGGGATTCCGTT 10 TGCTGCTTTTGGGATTCCGTTG 96 CTGCTTTTGGGATTCCGTT 21 CTGCTTTTGGGATTCCGTTG 21 TGCTTTTGGGATTCCGTT 3 TGCTTTTGGGATTCCGTTG 36 GCTTTTGGGATTCCGTT 8 GCTTTTGGGATTCCGTTG 39 CTTTTGGGATTCCGTTG 500 (((.(((.((( ( ( . « « . « ( « « ( . « < « U (( . . . . ) ) . . . . ) )))))) . ))))) . ))) . )))) . )) . ))) . ))) . ))) . A d i v e r s e va r i e t y o f i s o m i R s w e r e o b s e r v e d for m a n y of t he k n o w n a n d n o v e l m i R N A s . S e q u e n c e s r e p r e s e n t i n g the m i R * w e r e a l s o c o m m o n l y o b s e r v e d (h igh l igh ted in p ink ) . T h e r e f e r e n c e m i R N A s e q u e n c e f r om m i R B a s e (h igh l igh ted in bo ld ) w a s g e n e r a l l y no t the m o s t f r equen t l y o b s e r v e d i s o m i R ( s h o w n in b lue) . T h o u g h va r i a t i on at the 3' e n d is g e n e r a l l y m u c h m o r e c o m m o n t h a n at the 5' e n d , ta rge t p r e f e r e n c e of t h e s e i s o m i R s m a y di f fer. S e q u e n c e s w i th e v i d e n c e of 3' a d d i t i o n s of n u c l e o t i d e s ( R e d ) w e r e c o m m o n , w i th ce r ta in m i R N A s m o r e 42 h e a v i l y m o d i f i e d t h a n o t h e r s . T h e p r e d i c t e d s t ruc tu re o f t he p r e - m i R N A is r e p r e s e n t e d at t he b o t t o m . 43 Figure 2.3 Cluster ing of over-represented G e n e Ontology c l asses in predicted targets of differential m i c r o R N A s . a) GO.0012502 induction.of programmed cell death G0.0006917 induction of apoptosis GO 0043065 positive regulation of apoptosis GO 0043068 positive regulation of progr arnmed cell death GO.0O4233O.taxis GO 0006335.chenriotaxis GO 0006839 rratochondrtal transpor t GO 0009611 response to.wounding GO 0009605 response to external stimulus GO .0006950 response to stress GO 0006954 inflammatory response GO 0006952 defense response GO 0043522 positive regulalion.of cellular process GO 0048518 positive regulation.of biologicalprocess GO 0048869 cetlular developmental process GO 0030154,cei.differentialion GO CJ032502 developmental process GO 0048856 anatomical structure development GO .0032774 RNA.biosynthetic process G 0.0006351 transcription D NA dependent G0 0006355.regulationof transcription DrM dependent G0.0O31323.regUation of ceWar metaboSc process GO 0019222 metabolic process regulation of metabolic process GO 0050794 cellular process.regutation.of cellular .process GO 0050789 regulation of biological process GO 0065007 biological regulation GO 0044238 primary metabolic process GO 0044237 cellular .metabolic process GO .0043170 macromotecule metabolic process 44 b) CO 0042127 regulation of cell proliferation GO 0008283 cell proliferation CO 0008284 positive regulation of cell proliferation CO 0048518 positive regulation of biological process GO 0009653 anatorwcal structure morphogenesis GO rjO07S86 digestion GO 0044255 cellular lipid metabolic process GO 0016043 cellular component organization and biogenesis GO 0048731 multicellular organismal development system development GO 0048356 anatomical structure development GO.0Q07275 multicellular organisn .^development GQ.0048513 multicellular organismal development system develcixrient wgan oWetopmerrt GO 0032501 multiceWaf organismal process GO 0048869 cellular developmental process GO 0030154 ceH.differentiation GO 0032502 developmental process GO 0032774 RNAbtosynthetic process GO 0006351 transcription DNAdependent GO 0006355 regulation of transcription DNAdependent GO 0016070 RNA metabolic process GO 0006139 nucleobase nucleoside .nucleotide and nucleic acid metabolic process G0.005Q789 regulation of biological process GO 0065007 bictogica) regulation GO 0043283 bropolvmer metabolic process GO 0043170 macromoleciie- metabolic process GO.Q044238 primary metabolic process GO 0044237 cellular metabolic process Shown are heat map representat ions of G e n e Ontology (GO) (Ashburner et a l . 2000) terms over-represented amongst predicted cooperat ive targets (y-axis) of (a) h E S C - e n r i c h e d m i R N A s and (b) EB-enr iched m i R N A s . All genes with statistically over-represented G O annotations were included (p<0.01, x-axis) as identified by GoSta t (Beissbarth et a l . 2004). G O terms common to both sets of genes were those involved in transcriptional regulation, differentiation and 45 development. Those G O terms unique to h E S C - e n r i c h e d m i R N A targets were assoc ia ted with programmed cell death, response to st ress, and cell motility while those unique to EB-enr iched m i R N A targets descr ibe var ious aspec ts of cell proliferation regulation as well as nucleotide and nucleic acid metabol ism 46 2.6 References Abey ta , M. J . , Clark, A . T., Rodr iguez, R. T., Bodnar, M. S . , P e r a , R. A . , and Firpo, M. T. 2004. Unique gene express ion signatures of independent ly-der ived human embryonic s tem cell l ines. Hum Mol Genet 13: 601-8. Ambros , V . , Bartel , B., Bartel, D. P., Burge, C . B., Carr ington, J . C , C h e n , X . , Drey, S . R., Gri f f i ths-Jones, S . , Marshal l , M., Matzke, M. et a l . A uniform sys tem for m i c r o R N A annotat ion. 2003. RNA 9:277-9. Arav in , A . A . , Sach idanandam, R., Girard, A . , Fe jes-Toth, K., and Hannon, G . J . 2007. Deve lopmen ta l ^ regulated p i R N A clusters implicate MILI in t ransposon control. Science 316: 744-747. Arav in , A . and Tusch l , T. 2005. Identification and character izat ion of smal l R N A s involved in R N A si lencing. FEBS Lett 579: 5830-40. Ashburner , M. , Bal l , C . A . , B lake, J . A . , Botstein, D., Butler, H., Cherry , J . M. , Dav is , A . P., Dol inski , K., Dwight, S . S . , Eppig, J . T., et al . 2000. G e n e ontology: tool for the unification of biology. The G e n e Ontology Consor t ium. Nat Genet 25: 25-9. Aud ic , S . and Claver ie , J . M. 1997. The signif icance of digital gene express ion profi les. Genome Res 7: 986-95. Beissbar th , T. and S p e e d , T. P. 2004. GOsta t : find statistically overrepresented G e n e Ontologies within a group of genes . Bioinformatics 20: 1464-1465. Berez ikov, E., C u p p e n , E., and Plasterk, R. H. 2006a . App roaches to m i c r o R N A discovery. Nat Genet 38 Suppl: S2 -7 . Berez ikov, E., Thuemmler , F., van Laake , L. W. , Kondova , I., Bontrop, R., C u p p e n , E., and Plasterk, R. H. 2006b. Diversity of m i c r o R N A s in human and ch impanzee brain. Nat Genet 38: 1375-7. Bernstein E., Kim S . Y . , Carmel l M. A . , Murchison E. P., A lcorn H., Li M . Z . , Mills A . A . , E l ledge S . J . , Ande rson K. V . and Hannon G . J . 2003. Dicer is essent ia l for mouse development. Nat Genet . 35(3): 215-7. Bhat tacharya B, C a i J , Luo Y , Miura T, Mejido J , Br imble S N , Zeng X , Schu lz T C , R a o M S . P u r i R K . 2005. Compar ison of the gene express ion profile of undifferentiated human embryonic stem cell l ines and differentiating embryoid bodies. BMC Dev Biol 5: pre-print. Bhat tacharya, B., Miura, T., Brandenberger, R., Mej ido, J . , Luo, Y . , Y a n g , A . X . , Josh i , B. H., Gin is , I., Th ies , R. S . , Amit, M. , et al . 2004. G e n e express ion in 47 human embryonic s tem cell l ines: unique molecular signature. Blood 103: 2956-64. Boyer, L. A . , Lee, T. I., Co le , M. F., Johnstone, S . E., Lev ine, S . S . , Zucker , J . P., Guenther , M. G . , Kumar, R. M., Murray, H. L , Jenner , R. G . , e t a l . 2005. Co re transcriptional regulatory circuitry in human embryonic stem cel ls. Cell 122: 947-56. C a i , X . , Hagedorn , C . H., and Cul len, B. R. 2004. Human m i c r o R N A s are p rocessed from capped , polyadenylated transcripts that can a lso function as m R N A s . Rna 10: 1957-66. C h e n , C , R idzon , D., Lee , C . T., B lake, J . , Sun , Y . , and St rauss , W . M. 2007. Defining embryonic stem cell identity using differentiation-related m i c r o R N A s and their potential targets. Mamm Genome . Cummins , J . M. , He, Y . , Leary, R. J . , Pagl iar ini , R., A . , D. L.,Jr, S job lom, T., Barad , O. , Bentwich, Z . , Sza f ranska , A . E., Labourier, E., et a l . 2006. The colorectal m i c r o R N A o m e . Proc Natl Acad Sci USA 103: 3687-92. Dav ison , T. S . , Johnson , C . D., and Andruss , B. F. 2006. Ana lyz ing m i c r o - R N A express ion using microarrays. Methods Enzymol 411: 14-34. Dvash , T., Mayshar , Y . , Darr, H., McE lhaney , M., Barker, D., Y a n u k a , O. , Kotkow, K. J . , Rub in , L. L., Benvenisty, N., and E iges , R. 2004. Tempora l gene express ion during differentiation of human embryonic s tem cel ls and embryoid bodies. Hum Reprod 19: 2875-83. Fahlgren, N., Howel l , M. D., K a s s c h a u , K. D., C h a p m a n , E. J . , Sul l ivan, C . M., Cumb ie , J . S . , G ivan , S . A . , Law, T. F., Grant, S . R., Dangl , J . L , et a l . 2007. High-Throughput Sequenc ing of Arab idops is m i c r o R N A s : Ev idence for Frequent Birth and Death of M I R N A G e n e s . PLoS ONE 2: e219. Gi ra ldez, A . J . , M ish ima, Y . , Rihel , J . , Grocock, R. J . , V a n Dongen , S . , Inoue, K., Enright, A . J . , and Schier , A . F. 2006. Zebraf ish M iR -430 promotes deadenylat ion and c learance of maternal m R N A s . Science 312: 75-9. Gri f f i ths-Jones, S . 2006. m i R B a s e : the m ic roRNA sequence da tabase. Methods Mol Biol 342: 129-38. Gr imson , A . , Farh , K. K., Johnston, W. K., Garret t -Engele, P., L im, L. P., and Bartel, D. P. 2007. M i c r o R N A targeting specificity in mammals : determinants beyond seed pairing. Mol Cell 27: 91-105. Hirst, M. , Delaney, A . , Rogers , S . A . , Schnerch , A . , Pe rsaud , D. R., O 'Connor M, D., Z e n g , T., M o k s a , M. , Fichter, K., M a h , D., et al . 2007. L o n g S A G E profiling of nine human embryonic stem cell l ines. Genome Biol 8: R 1 1 3 . 48 Hofacker, I. L. 2003. V i enna R N A secondary structure server. Nucleic Acids Res 31: 3429-31. Houbaviy, H. B., Murray, M. F., and Sharp , P. A . 2003 . Embryon ic s tem cel l -speci f ic M i c r o R N A s . Dev Cell 5: 351-8. Itskovitz-Eldor, J . , Schuld iner M. , Karsent i , D., E d e n , A . , Y a n u k a , O., Amit, M. , So req , H. and Benvenisty, N. 2000. Differentiation of human embryon ic stem cel ls into embryoid bodies compromising the three embryonic germ layers. Mol M e d . 6(2): 88-95. J iang , P., W u , H., W a n g , W. , M a , W. , S u n , X . , and Lu , Z . 2007. M iP red : classif icat ion of real and pseudo m i c r o R N A precursors using random forest prediction model with combined features. Nucleic Acids Res 35: W339-44 . John , B., Enright, A . J . , Arav in , A . , Tusch l , T., Sander , C , and Marks , D. S . 2004. Human M i c r o R N A targets. PLoS Biol 2: e363. K a s s c h a u , K. D., Fahlgren, N., C h a p m a n , E. J . , Sul l ivan, C . M. , Cumb ie , J . S . , G ivan , S . A . , and Carr ington, J . C . 2007. Genome-Wide Profil ing and Ana lys is of Arab idops is s i R N A s . PLoS Biol 5: e57. Kawahara , Y . , Z inshteyn, B., Sethupathy, P., l i zasa, H., Hatz igeorgiou, A . G . , and Nishikura, K. 2007. Redirect ion of si lencing targets by adenosine-to- inosine editing of m i R N A s . Science 315: 1137-40. Lakshmi , S . and Agrawa l S . 2007. p iRNABank : a web resource on classi f ied and clustered Piwi-interacting R N A s . Nucleic Acids Res. In press. Landgraf, P., R u s u , M., Sher idan, R., Sewer , A . , lovino, N., Arav in , A . , Pfeffer, S . , R ice , A . , Kamphorst , A . O., Landthaler, M. , et a l . 2007. A Mammal ian m i c r o R N A Express ion At las B a s e d on Smal l R N A Library Sequenc ing . Cell 129: 1401-14. Lewis, B. P., Burge, C . B., and Bartel, D. P. 2005. Conse rved seed pairing, often f lanked by adenos ines , indicates that thousands of human genes are m i c r o R N A targets. Ce / /120: 15-20. Lund, E., Guttinger, S . , Ca lado , A . , Dahlberg, J . E., and Kutay, U. 2004. Nuclear export of m i c r o R N A precursors. Science 303: 95-8. Margul ies, M. , Egho lm, M., A l tman, W. E., Att iya, S . , Bader , J . S . , B e m b e n , L. A . , Berka , J . , Braverman, M. S . , C h e n , Y . J . , C h e n , Z. , et a l . 2005. G e n o m e sequenc ing in microfabricated high-density picolitre reactors. Nature 437: 376-80. Nakano , M., Nobuta, K., Vemara ju , K., Tej, S . S . , S k o g e n , J . W. , and Meyers , B. C . 2006. Plant M P S S da tabases : s ignature-based transcriptional resources for ana lyses of m R N A and smal l R N A . Nucleic Acids Res 34: D731-5 . 49 Ng Kwang Loong, S . and Mishra , S . K. 2007. Unique folding of precursor m i c r o R N A s : quantitative ev idence and implications for de novo identification. Rna 13: 170-87. O T o o l e , A . S . , Miller, S . , Haines, N., Zink, M. C , and Ser ra , M. J . 2006. Comprehens ive thermodynamic analysis of 3' double-nucleot ide overhangs neighboring Watson-Cr ick terminal base pairs. Nucleic Acids Res 34: 3338-44. P e d e r s e n , J . S . , Bejerano, G . , S iepe l , A . , Rosenb loom, K., L indblad-Toh, K., Lander, E. S . , Kent, J . , Miller, W. , and Haussler , D. 2006. Identification and classif icat ion of conserved R N A secondary structures in the human genome. PLoS Comput Biol 2: e33 . Pfeffer, S . , Sewer , A . , Lagos-Quin tana, M., Sher idan, R., Sander , C , Grasser , F. A . , van Dyk, L. F., Ho, C . K., Shuman , S . , Ch ien , M., et a l . 2005. Identification of m i c r o R N A s of the herpesvirus family. Nat Methods 2: 269-76. Porkka , K. P., Pfeiffer, M . J . , Walter ing, K. K., V e s s e l l a , R. L , Tamme la , T. L , and V isakorp i , T. 2007. M i c r o R N A express ion profiling in prostate cancer . Cancer Res 67: 6130-5. Ra jagopa lan , R., Vaucheret , H., Trejo, J . , and Bartel, D. P. 2006. A d iverse and evolutionari ly fluid set of m i c r o R N A s in Arab idops is thal iana. Genes Dev20: 3407-25. Ruby, J . G . , J a n , C , Player , C , Axtel l , M . J . , Lee , W. , N u s b a u m , C , G e , H., and Bartel , D. P. 2006. Large-sca le sequenc ing reveals 2 1 U - R N A s and addit ional m i c r o R N A s and endogenous s i R N A s in C . e legans. Ce / /127: 1193-207. Sato , N., San juan , I. M. , Heke, M., Uch ida , M., Naef, F., and Br ivanlou, A . H. 2003. Molecular signature of human embryonic s tem cel ls and its compar ison with the mouse . Dev Biol 260: 404-13. Sh ivdasan i , R. A . 2006. M i c r o R N A s : regulators of gene express ion and cell differentiation. Blood 108: 3646-53. Skot tman, H., Mikkola, M., Lundin, K., O l sson , C , St romberg, A . M., Tuuri , T., Otonkosk i , T., Hovatta, O., and L a h e s m a a , R. 2005. G e n e express ion signatures of seven individual human embryonic stem cell l ines. Stem Cells 23: 1343-56. S o n g , L. and Tuan , R. S . 2006. M i c r o R N A s and cell differentiation in mammal ian development. Birth Defects Res C Embryo Today 78: 140-9. Steke l , D. J . , Git, Y . , and Falc iani , F. The compar ison of gene express ion from multiple c D N A libraries. 2000. Genome Res 10: 2055-61. 50 Strauss, W. M. , C h e n , C , Lee , C . T., and R idzon , D. 2006. Nonrestr ict ive developmenta l regulation of m i c roRNA gene express ion. Mamm Genome 17: 833-40. S u h , M. R., Lee , Y . , K im, J . Y . , K im, S . K., Moon , S . H., Lee , J . Y . , C h a , K. Y . , C h u n g , H. M., Y o o n , H. S . , Moon , S . Y . , et al . 2004. Human embryonic stem cel ls express a unique set of m i c r o R N A s . Dev Biol 270: 488-98. W a n g Y . , Medv id R., Melton C , Jaen isch R. and Blel loch R. 2007. D G C R 8 is essent ia l for m i c r o R N A biogenesis and si lencing of embryonic s tem cell self-renewal . Nat Genet . 39(3): 380-5. Watanabe , T., T a k e d a , A . , Tsuk iyama, T., Mise , K., Okuno , T., S a s a k i , H., Minami , N., and Imai, H. 2006. Identification and character izat ion of two novel c lasses of smal l R N A s in the mouse germline: retrotransposon-derived s i R N A s in oocytes and germline smal l R N A s in testes. Genes Dev 20: 1732-43. X u e , C , Li , F., He, T., L iu, G . P., L i , Y . , and Zhang , X . 2005. Classi f icat ion of real and pseudo m i c r o R N A precursors using local s t ructure-sequence features and support vector mach ine. BMC Bioinformatics 6: 310. Y a o , Y . , G u o , G . , Ni , Z . , Sunkar , R., Du, J . , Zhu , J . K., and S u n , Q . 2007. Cloning and character izat ion of m i c r o R N A s from wheat (Triticum aest ivum L.). Genome Biol 8: R96 . Zhang , B., P a n , X . P., Cox , S . B., C o b b , G . P., and Ande rson , T. A . 2006. Ev idence that m i R N A s are different from other R N A s . Cell Mol Life Sci 63: . Zhao , J . J . , Hua , Y . J . , S u n , D. G . , Meng , X . X . , X iao , H. S . , and M a , X . 2006. Genome-w ide m i c r o R N A profiling in human fetal nervous t issues by ol igonucleot ide microarray. Childs Nerv Syst 22: 1419-25. 51 3 C O N C L U D I N G R E M A R K S 3.1 G o a l s a c c o m p l i s h e d A strong motivating factor in this project was the general lack of software for analyz ing m i R N A sequence data. The software that is currently avai lable for this purpose lags far behind current technological capabi l i t ies, with recent re leases still limited to analys is of data from 'dideoxy' (Sanger) sequenc ing methods (Tian et a l . 2007). Consider ing the benefits of next-generation sequenc ing methods presented in the previous chapters, use of older sequenc ing approaches for m i R N A profiling is likely to diminish in favour of these new methods. To cata lyze this transition, the scientif ic communi ty will require a c c e s s to software that is capable of analyzing and summar iz ing data from exper iments employing these new platforms. A s researchers may be apprehens ive to adopt new techniques, they must a lso be conv inced of the benefits such a change will provide. To this end, many valuable bioinformatic tools for process ing l l lumina smal l R N A sequence data have been deve loped throughout this project. Al though improvements are required to enhance the scope and robustness of this software, it has utility in interpreting the vast amount of data that is produced by this technology. Speci f ical ly, chapter two provides an example of the various types of information that can be extracted from these data using this set of tools. Overa l l , the ana lyses included here revealed s o m e underappreciated intr icacies of m i R N A b iogenesis and some specif ic insights into gene regulation in human embryonic stem cel ls. T h e s e results include the observat ion that 52 m i R N A s demonstrate variability at the 5' and 3' ends , likely resulting from inconsistent Drosha and Dicer c leavage; undergo R N A editing by A D A R ; and are subject to a plethora of 3' extensions that appear to be speci f ic to certain m i R N A s and evolutionarily conserved. The biological s igni f icance of these f indings is unclear and they may later prove to be artefactual. However that most of these observat ions are recapitulations of other groups suggests they are at least reproducible. A s wel l , this analys is has resulted in a list of 170 novel m i R N A genes , some of which may a lso be important in human embryonic stem cel ls. Of the known m i R N A genes identified here, many were already known to be important in h E S C s (Suh et al . 2004). This technique a lso revealed the express ion of m i R N A s expressed at relatively low levels, w h o s e express ion has never before reported in h E S C s . S o m e of these a lso demonstrated significant express ion changes between these samples , thus increasing the list of m i R N A s potentially important in either pluripotency or differentiation. Key to d iscover ing the role of these m i R N A s in h E S C s is the correct identification of their gene targets. Target genes , predicted by one of the most reliable target prediction algorithms (TargetScan), were ranked for their potential for cooperat ive regulation by co-expressed m i R N A s . Al though this reduced the sensitivity of target prediction considerably, it provided smal ler sets of genes that are potentially important in pluripotency or differentiation. The statistically over-represented gene ontology (GO) terms (Figure 2.3) support this, as these smal ler gene lists are enr iched for genes involved in these p rocesses . 53 3.2 Future directions The data included here represent only a smal l sample of the capabil i t ies of this new sequenc ing technology. The l l lumina sequenc ing apparatus al lows 8 separate libraries to be sequenced concurrently, while offering a cho ice between enhanced sampl ing depth or multiple biological repl icates. Biological and technical repl icates would al low more robust statistical tests to be appl ied, potentially revealing more subtle m i R N A express ion changes that were missed in this study. S a m p l e s extracted from differentiating h E S C s at earl ier and later time points could a lso reveal trends in these m i R N A express ion changes , which may fluctuate during the process of differentiation. Later time points may also reveal further s tages of differentiation or could reveal whether the E B samp le w a s fully differentiated (devoid of Pluripotent cel ls). T h e s e addit ional exper iments would solidify our current v iews of the h E S C m i R N A transcriptome and would likely reveal the variability in m i R N A levels that are unrelated to differentiation. A complete v iew of the m i R N A express ion levels and changes assoc ia ted with differentiation are of limited utility without knowledge of their gene targets. However, a s s a y s for validating miRNA/ target interactions are laborious, thus few have been biologically verified to date. In fact, a publicly avai lable da tabase known as T a r B a s e (Sethupathy et a l . 2006) hosts the currently known interactions and it includes very few of the m i R N A s highlighted in this study. Hence , an important next step would be to val idate of s o m e of the candidate targets of cooperat ive m i R N A regulation identified here (Figure 2.3). Pe rhaps more importantly, it is necessary to develop large-scale efforts to 54 identity miRNA/ target interactions be developed. A few recent studies have demonstrated success fu l co-immunoprecipitat ion between components of R I S C and their bound m R N A s in both human and mouse t issues (Duan et a l . 2006; Beitz inger et a l . 2007). S u c h approaches may be adapted into high-throughput methods for identifying m R N A targets of the m i R N A pathway and potentially the speci f ic si tes bound by these proteins. If a full v iew of all m R N A s targeted by m i R N A s can be produced, the field of m i R N A research will be strengthened considerably. Final ly, a few of the observat ions in this project are of unknown functional re levance. R N A editing of p re -m iRNAs can affect the mature m i R N A sequence and is thought to affect a m i R N A s repertoire of targets (Blow et a l . 2006; Kawahara et a l . 2007). However, as the relative levels of edited m i R N A s found here were less than 1% of the unmodif ied forms, it s e e m s unlikely that they are functional. Further work using more smal l R N A sequence libraries should reveal whether editing of m i R N A s is a widespread phenomenon. T w o other modes of m i R N A variation were d i scussed here: 3' extension by an unknown mechan ism and end variation due to variable Drosha and Dicer c leavage sites. Both can result in m i R N A s of different lengths, and the 5' variants have a different seed sequence than the canonica l m i R N A . A s they may a lso have different in vivo targets, determining whether these isomiRs assoc ia te with functional R I S C s is an important next step to concluding whether they are a lso functional. The 3' variability of m i R N A s is less likely to affect their targets, as m i R N A / m R N A complementar i ty in this region is less important. Instead, it is possib le that 55 variation at the 3' end performs a different function entirely. A recent study has shown that the sequence at the extreme 3' end of s o m e m i R N A s can affect their subcel lu lar local izat ion (Hwang et a l . 2007). Hence , it is poss ib le that further study may reveal that these i somiRs may reside preferentially in different cellular locales. 3.3 R e f e r e n c e s Beitzinger, M. , Peters , L , Zhu , J . Y . , Kremmer, E., and Meister, G . 2007. Identification of Human m i c r o R N A Targets From Isolated Argonaute Protein C o m p l e x e s . RNA Biol 4 : . Blow, M. J . , Grocock , R. J . , van Dongen, S . , Enright, A . J . , D icks, E., Futreal , P. A . , Wooster , R., and Stratton, M. R. 2006. R N A editing of human m i c r o R N A s . Genome Biol. 7: R27 . Duan , R. and J in , P. 2006. Identification of messenger R N A s and m i c r o R N A s assoc ia ted with fragile X mental retardation protein. Methods Mol. Biol. 342 : 267-276. Hwang, H. W. , Wentze l , E. A . , and Mendel l , J . T. 2007. A hexanucleot ide e lement directs m i c r o R N A nuclear import. Science 315 : 97-100. Kawahara , Y . , Z inshteyn, B., Sethupathy, P., l i zasa, H., Hatz igeorgiou, A . G . , and Nishikura, K. 2007. Redirect ion of si lencing targets by adenosine-to- inosine editing of m i R N A s . Science 315 : 1137-40. Sethupathy, P., Co rda , B., and Hatzigeorgiou, A . G . 2006. T a r B a s e : A comprehens ive da tabase of experimental ly supported animal m i c r o R N A targets. Rna 12 : 192-7. S u h , M. R., Lee , Y . , K im, J . Y . , K im, S . K., Moon , S . H., Lee , J . Y . , C h a , K. Y . , C h u n g , H. M., Y o o n , H. S . , Moon , S . Y . , et al . 2004. Human embryonic stem cel ls express a unique set of m i c r o R N A s . Dev Biol 270 : 488-98. T ian , F., Zhang , H., Zhang , X . , S o n g , C , X i a , Y . , W u , Y . , and Liu, X . 2007. m i R A S : a data processing system for m i R N A express ion profiling study. BMC Bioinformatics 8: 285. 56 APPENDICES Appendix A: Supplementary Tables Supplementary Table 1: All significantly differentially expressed miRNAs microRNA (hsa) arm most common sequence hESC count EB count P-value fold change mir-199a 3-p ACAGTAGTCTGCACATTGGTTA 1110 13163 0 11.86 mir-372 3-p AAAGTGCTGCGACATTTGAGCGT 1388 13653 0 9.84 mir-122a 5-p TGGAGTGTGACAATGGTGTTTG 436 2565 0 5.88 mir-152 5-p TCAGTGCATGACAGAACTTGG 622 3028 0 4.87 mir-10a 5-p TAC C CTGTAGATCC GAATTTGT 948 3887 0 4.10 let-7a 5-p TGAGGTAGTAGGTTGTATAGTT 11902 2951 0 4.03 mir-302a 5-p TAAACGTGGATGTACTTGCTTT 36800 9917 0 3.71 mir-222 3-p AGCTACATCTGGCTACTGGGTCTC 4719 1331 0 3.55 mir-340 5-p TTATAAAGCAATGAGACTGATT 2247 7198 0 3.20 mir-363 3-p AATTGCACGGTATCCATCTGTA 5775 17912 0 3.10 mir-21 5-p TAGCTTATCAGACTGATGTTGAC 39818 21003 0 1.90 mir-221 3-p AGCTACATTGTCTGCTGGGTTTC 16275 8716 0 1.87 mir-26a 5-p TTCAAGTAATCCAGGATAGGCT 4892 8530 0 1.74 mir-26b 5-p TTCAAGTAATTCAGGATAGGTT 1003 2957 1.39E-278 2.95 mir-130a 3-p CAGTGCAATGTTAAAAGGGCAT 2334 4798 2.20E-265 2.06 mir-744 5-p TGCGGGGCTAGGGCTAACAGCA 4166 1516 9.13E-259 2.75 mir-594 5-p ATGGATAAGGCATTGGC 1717 211 1.96E-253 8.14 mir-302b 3-p TAAGTG CTTC C ATGTTTTAGTAG 15169 8855 1.39E-213 1.71 mir-30d 5-p TGTAAACATCCCCGACTGGAAGCT 2798 4988 3.29E-205 1.78 mir-146b 5-p TGAGAACTGAATTCCATAGGCTGT 703 2075 2.27E-196 2.95 mir-25 3-p CATTGCACTTGTCTCGGTCTGA 24268 15875 1.52E-189 1.53 mir-373 3-p GAAGTGCTTCGATTTTGGGGTGT 60 788 5.75E-184 13.13 mir-423 5-p TGAGGGGCAGAGAGCGAGACTTT 9844 5538 4.50E-162 1.78 mir-320 3-p AAAAGCTGGGTTGAGAGGGCGA 5967 2978 1.33E-148 2.00 mir-1 3-p TGGAATGTAAAGAAGTATGTAT 7421 4051 2.39E-137 1.83 mir-371 5-p ACTCAAACTGTGGGGGCACTT 1649 3020 6.89E-133 1.83 mir-302d 3-p TAAGTGCTTCCATGTTTGAGTGT 8599 5047 3.95E-119 1.70 mir-378 3-p ACTGGACTTGGAGTCAGAAGG 4021 6149 5.05E-119 1.53 mir-30a 5-p TGTAAACATCCTCGACTGGAAGCT 1654 2822 2.67E-105 1.71 mir-302c 3-p TAAGTGCTTCCATGTTTCAGT 3929 1984 4.03E-95 1.98 let-7e 5-p TGAGGTAGGAGGTTGTATAGTT 899 1794 4.88E-95 2.00 mir-92 5-p AGGTTGGGATCGGTTGCAATGCT 788 138 2.91 E-94 5.71 mir-204 5-p TTCCCTTTGTCATCCTATGCCT 690 1505 7.42E-94 2.18 mir-421 3-p ATCAACAGACATTAATTGGGCGC 579 1331 1.31E-90 2.30 mir-20b 5-p CAAAGTGCTCATAGTGCAGGTAG 281 862 2.51 E-86 3.07 mir-130b 3-p CAGTGCAATGATGAAAGGGCAT 836 1626 4.23E-82 1.94 mir-146a 5-p TGAGAACTGAATTC CAT 107 556 3.04E-78 5.20 mir-181 a 5-p AACATTCAACGCTGTCGGTGAGTTT 571 1220 8.39E-74 2.14 mir-106a 5-p AAAAGTGCTTACAGTGCAGGTAG 236 696 4.74E-67 2.95 let-7c 5-p TGAGGTAGTAGGTTGTATGGTT 360 30 1.94E-64 12.00 mir-184 3-p TGGACGGAGAACTGATAAGGGT 403 44 2.15E-64 9.16 mir-708 5-p AAGGAGCTTACAATCTAGCTGGG 397 990 3.68E-64 2.49 mir-205 5-p TCCTTCATTCCACCGGAGTCTGT 502 0 9.77E-63 n/a mir-127 3-p TCGGATCCGTCTGAGCTTGGCT 81 406 9.77E-63 5.01 57 microRNA (hsa) arm most common sequence hESC count EB count P-value fold change let-7i 5-p TGAGGTAGTAGTTTGTGCTGTT 347 33 4.34E-59 10.52 mir-302a 3-p TAAGTGCTTCCATGTTTTGGTGA 5239 3237 8.21 E-58 1.62 mir-331 3-p GCCCCTGGGCCTATCCTAGAA 1129 434 1.21E-53 2.60 mir-889 3-p TTAATATCGGACAACCATTGT 0 170 2.09E-53 n/a mir-210 3-p CTGTGCGTGTGACAGCGGCTGA 28 247 1.54E-51 8.82 mir-302d 5-p ACTTTAACATGGAGGCACTTGCT 258 20 8.58E-48 12.90 mir-143 3-p TGAGATGAAGCACTGTAGCTC 314 680 7.06E-43 2.17 mir-423 3-p AGCTCGGTCTGAGGCCCCTCAGT 1225 560 3.17E-40 2.19 mir-503 5-p TAGCAGCGGGAACAGTTCTG 235 30 2.17E-35 7.83 mir-25 5-p AGGCGGAGACTTGGGCAATTGCT 1002 456 1.44E-33 2.20 mir-17 3-p ACTGCAGTGAAGGCACTTGTAG 146 377 1.27E-31 2.58 let-7b 5-p TGAGGTAGTAGGTTGTGT 284 57 1.38E-31 4.98 mir-335 5-p TCAAGAGCAATAACGAAAAATG 119 331 1.19E-30 2.78 mir-27b 3-p TTCACAGTGGCTAAGTTCTGC 195 442 1.38E-30 2.27 mir-129 3-p AAGCCCTTACCCCAAAAAGCAT 946 449 1.22E-28 2.11 mir.-30e 3-p CTTTCAGTCGGATGTTTACAG 1741 994 1.72E-28 1.75 mir-371 3-p AAGTGCCGCCATCTTTTGAGTGT 9 114 1.31E-27 12.67 mir-19b 3-p TGTGCAAATCCATGCAAAACTGA 492 794 7.29E-27 1.61 mir-30e 5-p TGTAAACATCCTTGACTGGAAGCT 447 738 1.34E-26 1.65 let-7g 5-p TGAGGTAGTAGTTTGTACAGTT 321 100 1.69E-22 3.21 mir-134 5-p TGTGACTGGTTGACCAGAGGGG 10 97 7.80E-22 9.70 mir-181d 5-p AACATTCATTGTTGTCGGTGGGTT 281 82 1.38E-21 3.43 mir-20a 5-p TAAAGTGCTTATAGTGCAGGTAG 377 608 6.56E-21 1.61 mir-92b 5-p AGGGACGGGACGCGGTGCAGTGTT 556 250 7.10E-20 2.22 mir-877 5-p GTAGAGGAGATGGCGCAGGGG 101 10 1.50E-19 10.10 let-7f 5-p TGAGGTAGTAGATTGTATAGTT 2004 1281 1.60E-19 1.56 mir-483 5-p AAGACGGGAGGAAAGAAGGGAG 31 133 1.81E-19 4.29 mir-219 3-p AGAGTTGAGTCTGGACGTCCCG 3 67 6.98E-19 22.33 mir-23b 3-p ATC AC ATTG C CAGGGATTACCAC 134 286 1.57E-18 2.13 mir-93 5-p CAAAGTGCTGTTCGTGCAGGTA 319 556 5.26E-18 1.74 mir-660 5-p TACCCATTGCATATCGGAGTTG 46 150 2.41E-17 3.26 mir-801 5-p TGGCACACGTAGGGCAAC 90 185 7.04E-17 2.06 let-7d 5-p AGAGGTAGTAGGTTGCATAGTT 77 4 7.19E-17 19.25 mir-518f 3-p GAAAGCGCTTCTCTTTAGAGGA 84 205 1.20E-16 2.44 mir-296 3-p AGGGTTGGGTGGAGGCT 97 .11 2.74E-16 8.82 mir-192 5-p TGACCTATGAATTGACAGCCAGT 79 210 3.14E-16 2.66 mir-599 5-p TTTGATAAGCTGACATGGGACAG 0 48 1.32E-15 n/a mir-145 5-p GTCCAGTTTTCCCAGGAATCCCT 57 146 4.79E-13 2.56 mir-873 5-p GCAGGAACTTGTGAGTCTCCT 203 358 1.75E-12 1.76 mir-486 3-p TCCTGTACTGAGCTGCCCCGAG 293 125 1.91E-12 2.34 mir-329 3-p AACACACCTGGTTAACCTCTTT 0 35 2.74E-12 n/a mir-20b 3-p ACTGTAGTATGGGCACTTCCAGT 205 331 4.18E-12 1.61 mir-125a 5-p TCCCTGAGACCCTTTAACCTGTG 189 311 5.35E-12 1.65 mir-302c 5-p TTTAACATGGGGGTACCTGCT 522 280 9.08E-12 1.86 mir-487b 3-p AATCGTACAGGGTCATCCACTT 0 31 5.70E-11 n/a mir-769 5-p TGAGACCTCTGGGTTCTGAGCT 170 298 1.78E-10 1.75 mir-28 3-p CACTAGATTGTGAGCTCCTGGA 344 169 1.89E-10 2.04 mir-106b 3-p CCGCACTGTGGGTACTTGCTGC 197 77 2.58E-10 2.56 mir-135b 5-p TATGGCTTTTCATTCCTATGTGA 30 54 2.67E-10 1.80 mir-598 3-p TACGTCATCGTTGTCATCGTCA 168 271 3.79E-10 1.61 58 microRNA (hsa) arm most common sequence hESC count EB count P-value fold change mir-187 3-p TCGTGTCTTGTGTTGCAGCCGG 281 130 4.07E-10 2.16 mir-27a 3-p TTCACAGTGGCTAAGTTC 77 157 4.45E-10 2.04 mir-199b 5-p CCCAGTGTTTAGACTATCTGTTC 1 30 4.97E-10 30.00 mir-135a 5-p TATGGCTTTTTATTCCTATGTGA 59 132 5.44E-10 2.24 mir-323 3-p GCACATTACACGGTCGACCTCT 0 28 5.55E-10 n/a mir-218 5-p TTGTGCTTGATCTAACCATGT 1 28 5.55E-10 28.00 mir-96 5-p TTTGGCACTAGCACATTTTTGCT 44 110 6.99E-10 2.50 mir-185 5-p TGGAGAGAAAGGCAGTTCCTG 500 311 1.11E-09 1.61 mir-151 5-p CTAGACTGAAGCTCCTTGAGGA 1179 786 1.16E-09 1.50 mir-539 3-p ATCATACAAGGACAATTTCTTT 0 27 1.19E-09 n/a mir-379 5-p. TGGTAGACTATGGAAGGTAGG 7 46 1.58E-09 6.57 mir-215 5-p ATGACCTATGAATTGACAGACA 3 38 2.70E-09 12.67 mir-135a 3-p ATATAGGGATTGGAGCCGTGGC 15 65 3.02E-09 4.33 mir-99a 5-p AACCCGTAGATCCGATCTTGTG 4 30 4.03E-09 7.50 mir-768 5-p GTTGGAGGATGAAAGTACGGA 69 151 4.21 E-09 2.19 mir-518b 3-p CAAAGCGCTCCCCTTTAGAGGT 215 94 4.64E-09 2.29 mir-494 3-p TGAAACATACACGGGAAACCTCT 2 32 4.73E-09 16.00 mir-330 3-p GCAAAGCACACGGCCTGCAGAGAG 187 86 5.09E-09 2.17 mir-410 3-p AATATAACACAGATGGCCTGT 0 25 5.41 E-09 n/a mir-941 3-p CACCCGGCTGTGTGCACATGTGC 114 41 1.12E-08 2.78 mir-594 3-p GCCCCAGTGGCCTAATGGA 9 46 2.00E-08 5.11 mir-193b 3-p AACTGGCCCTCAAAGTCCCGGT 288 146 2.74E-08 1.97 mir-367 3-p TGGATTGTTAAGCCAATGACAGAA 45 5 2.89E-08 9.00 mir-382 5-p GAAGTTGTTCGTGGTGGATTCG 0 24 3.59E-08 n/a mir-335 3-p TTTTTCATTATTGCTCCTGACC 72 137 5.41 E-08 1.90 mir-29c 3-p TAGCACCATTTGAAATCGGTTA 45 100 7.98E-08 2.22 mir-296 5-p AGGGCCCCCCCTCAATCCTGT 60 12 9.65E-08 5.00 mir-302b 3-p ACTTTAACATGGAAGTGCTTTCT 343 188 1.07E-07 1.82 mir-518a 3-p AGAAGATCTCAAGCTGTGA 75 20 2.22E-07 3.75 mir-224 5-p CAAGTCACTAGTGGTTC C GTTTAG 108 43 3.19E-07 2.51 mir-520g 3-p ACAAAGTGCTTCCCTTTAGAGTGT 74 20 3.33E-07 3.70 mir-519a 3-p AGAAGATCTCAGGCTGTGTC 65 16 4.78E-07 4.06 mir-22 3-p AAGCTGCCAGTTGAAGAACTGT 359 205 5.20E-07 1.75 mir-9 3-p TCTTTGGTTATCTAGCTGTATGA 427 256 8.66E-07 1.67 mir-221 5-p ACCTGGCATACAATGTAGATTTCT 637 413 1.04E-06 1.54 mir-574 3-p CACGCTCATGCACACACCCACA 65 17 1.12E-06 3.82 mir-543 3-p AAACATTCGCGGTGCACTTCTT 2 25 1.86E-06 12.50 mir-140 3-p ACCACAGGGTAGAACCAC 137 219 2.18E-06 1.60 mir-1 Ob 5-p TACCCTGTAGAACCGAATTTGT 2 23 2.43E-06 11.50 mir-455 5-p TATGTGCCTTTGGACTACATCG 36 78 3.55E-06 2.17 mir-518c 5-p CTCTGGAGGGAAGCACTTTCTGTT 43 8 3.69E-06 5.38 mir-197 3-p TTCACCACCTTCTCCACCCAGC 145 66 4.75E-06 2.20 mir-760 3-p CGGCTCTGGGTCTGTGGGG 25 2 4.90E-06 12.50 mir-516 5-p CATCTGGAGGTAAGAAGCACTTTGT 170 92 6.14E-06 1.85 mir-486 3-p CGGGGCAGCTCAGTACAGGATA 19 0 6.54E-06 n/a mir-129 5-p CTTTTTGCGGTCTGGGCTTGC 126 55 7.28E-06 2.29 mir-374b 5-p ATATAATACAACCTGCTAAG 84 146 1.05E-05 1.74 mir-154 3-p AATCATACACGGTTGACCTATT 0 15 1.07E-05 n/a mir-484 5-p TCAGGCTCAGTCCCCTGCCGAT 65 20 1.10E-05 3.25 mir-187 5-p GCTACAACACAGGACCCGGGCG 18 0 1.23E-05 n/a 59 microRNA (hsa) arm most common sequence hESC count EB count P-value fold change mir-211 3-p TTCCCTTTGTCATCCTTCGCCT 28 3 1.27E-05 9.33 mir-577 5-p GTAGATAAAATATTGGTACCTG 53 96 1.64E-05 1.81 mir-369 3-p AATAATACATGGTTGATCTTT 2 20 1.86E-05 10.00 mir-382 3-p AATCATTCACGGACAACACTTT 2 20 1.86E-05 10.00 mir-200a 3-p TAACACTGTCTGGTAACGATGTT 87 136 2.28E-05 1.56 mir-485 3-p GTCATACACGGCTCTCCTCTCT 0 14 2.28E-05 n/a mir-766 5-p AGGAGGAATTGGTGCTGGTCTT 25 3 2.41 E-05 8.33 mir-7 3-p AGGTAGACTGGGATTTGTTGTT 41 9 2.58E-05 4.56 mir-483 3-p TCACTCCTCTCCTCCCGTCTT 59 101 3.59E-05 1.71 mir-520a 3-p AAAGTGCTTCCCTTTGGACTG 363 226 3.64E-05 1.61 mir-589 5-p TGAGAACCACGTCTGCTCTGAGC 62 23 4.20E-05 2.70 mir-519c 3-p AGAAGATCTCAGCCTGTGAC 44 11 4.35E-05 4.00 mir-374 5-p TTATAATACAACCTGATAAGT 59 108 4.81 E-05 1.83 mir-432 5-p TCTTGGAGTAGGTCATTGGG 0 13 4.87E-05 n/a mir-217 5-p TACTGCATCAGGAACTGATTGGA 1 16 5.02E-05 16.00 mir-31 5-p AGGCAAGATGCTGGCATAGCTG 235 135 6.16E-05 1.74 mir-196b 5-p TAGGTAGTTTCCTGTTGTTGGG 2 18 7.11 E-05 9.00 mir-362 5-p AATCCTTGGAACCTAGGTGTGAGT 14 66 8.78E-05 4.71 mir-424 3-p CAGCAGCAATTCATGTTTTGAA 52 16 8.78E-05 3.25 mir-493 5-p TTGTACATGGTAGGCTTTCATT 0 13 9.19E-05 n/a mir-512 3-p AAGTGCTGTCATAGCTGA 37 76 9.63E-05 2.05 mir-33 5-p GTGCATTGTAGTTGCATTGCA 43 78 0.000100716 1.81 mir-182 5-p TTTGGCAATGGTAGAACTCACACTGGT 73 32 0.000129595 2.28 mir-923 3-p CAGG ATTC C CTCAGTAA 13 1 0.000161215 13.00 mir-193b 5-p CGGGGTTTTGAGGGCGAGATG 19 2 0.000173717 9.50 mir-874 3-p CTGCCCTGGCCCGAGGGACCGAC 19 2 0.000173717 9.50 mir-95 3-p TTCAACGGGTATTTATTGAGC 19 2 0.000173717 9.50 mir-542 5-p TGTGACAGATTGATAACTGAAA 48 82 0.00020581 1.71 mir-24 3-p TGGCTCAGTTCAGCAGGAACA 47 87 0.000216923 1.85 mir-499 5-p TTAAGACTTGCAGTGATGTTTA 30 59 0.000231403 1.97 mir-887 3-p GTGAACGGGCGCCATCCCGAGGCT 3 19 0.000343537 6.33 mir-155 5-p TTAATGCTAATCGTGATAGGGGT 35 9 0.000348001 3.89 mir-185 3-p AGGGGCTGGCTTTCCTCT 35 9 0.000348001 3.89 mir-18b 5-p TAAGGTGCATCTAGTGCAGT 0 11 0.000382853 n/a mir-375 3-p TTTGTTCGTTCGGCTCGC 19 46 0.000415915 2.42 mir-203 3-p GTGAAATGTTTAGGACCACTAG 52 85 0.000459785 1.63 mir-495 3-p AAACAAACATGGTGCACTTCTT 0 10 0.000474427 n/a mir-339 5-p TGAGCGCCTCGACGACAGAGCCG 29 55 0.000616533 1.90 mir-21 3-p CAACACCAGTCGATGGGCTGTCT 33 10 0.000631908 3.30 mir-618 3-p AAACTCTACTTGTCCTTCTGAGT 9 27 0.000678534 3.00 mir-28 5-p AAGGAGCTCACAGTCTATTGAGT 21 4 0.000767493 5.25 mir-520b 3-p AAAGTGCTTCCTTTTAGAG 0 10 0.000781386 n/a mir-641 5-p AAAGACATAGGATAGAGTCACCT 0 10 0.000781386 n/a mir-199a 5-p CCCAGTGTTCAGACTACCTGTTC 1 10 0.000781386 10.00 mir-301 5-p CAGTGCAATAGTATTGTCAAAGCAT 16 37 0.000792328 2.31 mir-504 5-p AGACCCTGGTCTGCACTCTATCT 49 20 0.000811599 2.45 mir-372 5-p CCTCAAATGTGGAGCACTATTC 1 12 0.000823076 12.00 mir-194 3-p TGTAACAGCAACTCCATGTGGAA 12 31 0.000885533 2.58 mir-518a 5-p CTGCAAAGGGAAGCCCTTTCT 2 14 0.000978556 7.00 60 Supplementary Table 2: All novel miRNAs reported in this study Chromo-some Start End Most abundant isomiR hESC count EB count Fold Change (if significant) chr2 177173995 177174018 AATGGATTTTTGGAGCAGG 374 245 1.53 chr11 7212576 7212600 TAAGTGCTTCCATGCTT 84 131 1.56 chrX 113855921 113855947 TTCATTCGGCTGTCCAGATGTA 1269 774 1.64 chr16 33873061 33873083 GCCTGTCTGAGCGTCGCTT 194 114 1.70 chrl 218440512 218440575 TATTCATTTATCCCCAGCCTACA 126 67 1.88 chr19 62716221 62716246 TCCCTGTTCGGGCGCCA 670 1302 1.94 chr6 131417429 131417451 TCCCTGTTCGGGCGCCA 670 1302 1.94 chrl 19096156 19096178 TGGATTTTTGGATCAGGGA 40 85 2.13 chrl 555992 556022 AACAGCTAAGGACTGCA 72 31 2.32 chr6 26645776 26645799 GTGTAAGCAGGGTCGI I I I 639 273 2.34 chrX 117404429 117404454 TAC GTAG ATATATATGTATTTT 20 54 2.70 chr5 41511501 41511523 GTCCCTGTTCAGGCGCCA 24 66 2.75 chr15 62841721 '62841750 GATGATGATGGCAGCAAATTCTGAAA 81 27 3.00 chr22 18453595 18453657 ACGTTGGCTCTGGTGGTG 35 11 3.18 chr8 . 141129908 141129930 ATCCCCAGCACCTCCACC 59 18 3.28 chr10 100145015 100145040 TGCTGGATCAGTGGTTCGAGTC 16 57 3.56 chr15 50356653 50356679 CCTCAGGGCTGTAGAACAGGGCT 163 45 3.62 chr4 66825201 66825227 CTGGACTGAGCCGTGCTACTGG 44 161 3.66 chr10 105144044 105144071 ACTCGGCGTGGCGTCGGTCGTG 228 50 4.56 chr2 25405021 25405049 TTGCAGCTGCCTGGGAGTGACTTC 63 13 4.85 chr4 164234207 164234230 TCCGAGTCACGGCACCA 309 55 5.62 chrl 150066842 150066862 GTGGGGGAGAGGCTGTC 103 17 6.06 chrl 68421844 68421869 ATGGGTGAATTTGTAGAAGGAT 7 45 6.43 chr7 146706060 146706085 AAACTGTAATTACTTTTGG 21 3 7.00 chrX 21990206 21990231 GCATGGGTGGTTCAGTGG 407 52 7.83 chr10 70189095 70189123 AGCCTGGAAGCTGGAGCCTGCAGT 25 3 8.33 chr22 35428858 35428879 AAAAAGACAC C C C C CAC 189 21 9.00 chr14 101096450 101096475 ACCCGTCCCGTTCGTCCCCGGA 4 53 13.25 chr11 90241994 90242016 ATGGATAAGGCTTTGGCTT 31 2 15.50 chr15 20014621 20014642 CGGGCGTGGTGGTGGGGG 35 2 17.50 chr11 93106277 93106306 GTGGGAAGGAATTACAAGACAGTT 72 2 36.00 chrl 20299 20324 TTGGGACATACTTATGCTAAA 4 1 -chrl 21187455 21187480 AGGCATTGACTTCTCACTAGCT 2 2 -chrl 38242469 38242487 AGAGGAACAGTTCATTC 10 2 -chrl 86844553 86844578 AAAAGTAATTGCGGATTTTGCC 2 3 -chrl 94984022 94984045 ACTGGGCTTGGAGTCAGAAG 98 130 chrl 148091054 148091079 GCGAGATCGCGCAGGACTTT 3 0 -61 Chromo-some Start End Most abundant isomiR hESC count EB count Fold Change (if significant) chrl 166234525 166234550 CGGATGAGCAAAGAAAGTGGTT 7 2 -chrl 169337500 169337525 TTAGGCCGCAGATCTGGGTGA 4 6 -chrl 191372302 191372327 TAGTACTGTGCATATCATCTAT 184 126 -chrl 222511371 222511397 AAAAGCTGGGTTGAGAGGGCAA 20 13 -chr10 14518592 14518617 CAGGATGTGGTCAAGTGTTGTT 9 2 -chMO 56037643 56037672 AAAACTGTAATTACTTTTG G AC 64 31 -chr10 64802775 64802800 TTAGGGCCCTGGCTCCATCTCC 13 15 -chMO 93405230 93405250 TGAACCCAGGAGGCGGA 6 9 -chr10 104030627 104030644 AGATGGATTGTTCTGGG 2 1 -chMO 112738724 112738749 AAAAACTGAGACTACTTTTGCA 38 26 -chr11 3535184 3535209 AAAAGTAATTGCGGATTTTGCC 2 3 -chM1 8662350 8662379 ATCGAGGCTAGAGTCACGCTTGG 8 10 -chr11 8662888 8662951 TGTGGACTTTGGTCAGTGAAC 2 1 -chr11 61491650 61491669 GCCGAGAGTCGTCGGGGTT 3 .1 -chM1 67457297 67457322 AAAAGTAATTGCGGATTTTGCC 2 3 -chM1 69807738 69807763 AAAAGTACTTGCGGATTTTGCT 57 64 -chr11 93106536 93106561 TTTGAGGCTACAGTGAGATGTG 2 2 -chM1 93839357 93839382 AAAAGTATTTGCGGGTTTTGTC 14 14 -chM2 9371838 9371863 TTGGGACATACTTATGCTAAA 4 1 -chM2 43950119. 43950145 AGTGAACCACAGACCTGAGATGT 3 3 -chM2 47334539 47334568 TGGCCCTGACTGAAGACCAGCAGT 6 0 -chr12 48914229 48914253 TGGGTGGTCTGGAGATTTGTGC 4 0 -chM2 61368489 61368514 AAAAACTGTAATTACTTTT 4 1 -chr12 67953235 67953252 TCATATTGCTTCTTTCT 5 2 -chM2 78337170 78337195 AGAAGGAAATTGAATTCATTTA 4 5 -chr12 96409820 96409846 ACTCTAGCTGCCAAAGGCGCT 2 9 -chM2 97517768 97517795 GACTCGACTCGTGTGCGGACATTT 7 2 -chr12 111617267 111617292 TTGGGACATACTTATGCTAAA 4 1 -chM3 44500861 44500886 TCTGGGCAACAAAGTGAGACCT 23 33 -chr13 53784117 53784134 TTC AAGT AATTC AG GTG 4 6 -chM3 70373329 70373354 AAAAACTGTAATTACTTTT 4 1 -chr13 106981564 106981588 CCTGTTGAAGTGTAATCCCCA 2 3 -chM4 22317565 22317585 ATATAATACAACCTGCTT 45 74 -chM4 63631551 63631577 AAAAGTAATCGCGGTTTTTGTC 27 16 -chM4 76802325 76802346 ATCCCACCTCTGCCACCA 4 2 -chr15 41873222 41873242 TCGTTTGCCTTTTTCTGCTT 1 2 -chr15 51017861 51017887 TTGAGAAGGAGGCTGCTG 0 3 -chM5 66267369 66267386 AAAAGAAATGTACAGAG 1 4 -chr15 84114778 84114803 TAAAGAGCCCTGTGGAGACA 3 9 -62 Chromo-some Start End Most abundant isomiR hESC count EB count Fold Change (if significant) chr15 100318227 100318252 TTGGGACATACTTATGCTAAA 4 1 -chr16 1952199 1952219 TCTGGAAGGAGACTGTAT 0 1 chr16 11307845 11307871 AAAAGTAATCGCGGTTTTTGTC 27 16 -chr17 2598181 2598205 AGAGAAGAAGATCAGCCTGCA 4 2 chr17 13387635 13387661 AAAAGTAATCGCGGTTTTTGTC 27 16 -chr17 16126095 16126119 TGGACTGCCCTGATCTGGAGA 3 2 -chr17 76721656 76721681 ACGGTGCTGGATGTGGCCTTT 6 8 chr18 26132900 26132917 TAATTGCTTCCATGTTT 60 58 chr18 55567933 55567958 AAAAACTGTAATTACTTTT 4 1 -chr19 23043 23068 TTGGGACATACTTATGCTAAA 4 1 chr19 20371125 20371152 CTGGAGATATGGAAGAGCTGTGT 76 38 chr19 58883558 58883583 TCTACAAAGGAAAGCGCTTTCT 12 10 -chr19 58953309 58953334 TCTACAAAGGAAAGCGCTTTCT 12 10 chr2 56081400 56081425 AAATCTCTGCAGGCAAATGTG 0 4 -chr2 62828390 62828415 TAGCAAAAACTGCAGTTACTTT 6 2 chr2 70333567 70333592 TCTGGGCAACAAAGTGAGACCT 23 33 chr2 114057048 114057073 TTGGGACATACTTATGCTAAA 4 '1 chr2 180433815 180433840 AGTTAGGATTAGGTCGTGGAA 10 5 -chr2 180433850 180433875 AGTTAGGATTAGGTCGTGGAA 10 5 -chr2 189551103 189551128 AAGTGATCTAAAGGCCTACAT 0 6 chr2 207842293 207842318 TTGGGACATACTTATGCTAAA 4 1 -chr2 212999243 212999270 AACTGTAATTACTTTTGCACCAAC 2 0 -chr2 232286322 232286348 AAGTAGTTGGTTTGTATGAGATGGTT 0 1 -chr20 2581424 2581452 TGGGAACGGGTTCCGGCAGACGCTG 4 5 -chr20 33505224 33505250 TGGAGTCCAGGAATCTGCATTTT 2 2 -chr20 36487253 36487280 AAACTCTGGAGCTTTCGTACATGC 7 0 chr20 36578662 36578687 CCAAAACTGCAGTTACTTTTGC 8 2 chr20 47330266 47330296 ATATATGATGACTTAGCTTTT 7 22 -chr20 59962070 59962094 AGTGAATGATGGGTTCTGACC 2 3 chr22 18616665 18616690 .TGCAGGACCAAGATGAGCCCT 13 14 chr22 20337277 20337302 GCTCTGACGAGGTTGCACTACT 7 6 -chr22 20337312 20337340 CAGTGCAATGATATTGTCAAAGCAT 24 30 chr22 25281235 25281262 AAAAGTAATTGCGGTCTTTGGT 82 58 -chr22 39818495 39818513 TCGCCTCCTCCTCTCCC 0 1 _ chr22 43975501 43975526 ACGCCCTTCCCCCCCTTCTTCA 16 3 chr3 70572562 70572587 AAAAACTGTAATTACTTTT 4 1 chr3 71673877 71673902 TCTATACAGACCCTGGCTTTTC 2 6 chr3 119201514 119201540 TGGAGTCCAGGAATCTGCATTTT 2 2 chr3 126992024 126992049 AAAAGTAATTGCGGATTTTGCC 2 3 -63 Chromo-some Start End Most abundant isomiR hESC count EB count Fold Change (if significant) chr3 129563700 129563720 TCCCACCGCTGCCACCC 6 5 chr3 150090065 150090092 GGCCAGGCCTGGTGGCTCACTTT 6 1 -chr3 165372000 165372027 ATGGTACCCTGGCATACTGAGT 5 3 chr3 187985862 187985882 GAAAACTGCAGTGACATGTT 0 2 -chr3 187987156 187987186 ACCTTCTTGTATAAGCACTGTGCTAAA 5 1 chr4 4136998 4137023 AAAAGTAATTG C G G ATTTTG C C 2 3 chr4 9166974 9166999 AAAAGTAATTGCGGATTTTGCC 2 3 -chr4 36104418 36104443 CGGATGAGCAAAGAAAGTGGTT 7 2 -chr4 71081401 71081426 TTGGGACATACTTATGCTAAA 4 1 -chr4 102470541 102470568 AGGATGAGCAAAGAAAGTAGATT 111 70 chr4 114247470 114247495 AACTGGATCAATTATAGGAGTG 3 2 -chr4 148485239 148485266 AAAACTGTAATTACTTTTGTAC 21 12 chr4 183327488 183327513 TTTTCAACTCTAATGGGAGAGA 14 1 -chr5 100180098 100180123 TAGCAAAAACTGCAGTTACTTT 6 2 -chr5 109877434 109877461 AACTGTAATTACTTTTGCACCAAC 2 0 chr5 132791203 132791229 TGGAGTCCAGGAATCTGCATTTT 2 2 -chr5 153706904 153706929 TGTGAGGTTGGCATTGTTGTCT 12 12 chr5 154045576 154045602 TTTAGAGACGGGGTCTTGCTCT 19 8 chr5 175727567 175727592 CTTGGCACCTAGCAAGCACTCA 20 25 -chr6 54084910 54084927 AACAGCATTGCACTTGT 1 3 -chr6 74286439 74286468 AAAGGAAAAGACTCATATCAA 3 0 -chr6 132155005 132155031 AAAAGTAATCGCGGTTTTTGTC 27 16 chr6 132634811 132634835 TGGGAAAGGAAAAGACTC 4 2 chr6 133180049 133180074 AAGCCAGCCAATGAA 4 0 chr6 133180152 133180179 TAAGGTCCCTGAGAATGGCTAT 38 24 chr6 163910779 163910796 AAATACAATAAAATCTG 0 2 -chr6 167331306 167331331 TACGCGCAGACCACAGGATGTC 4 3 chr7 18133379 18133404 TTGGGACATACTTATGCTAAA 4 1 -chr7 34946939 34946964 CAAAAGTAATTGTGGATTTTGT 3 2 chr7 67622028 67622053 TCTGGGCAACAAAGTGAGACCT .23 33 -chr7 91671271 91671298 TCTGGGCAACAAAGTGAGACCT 23 33 _ chr7 100156228 100156248 GTGGGGGAGAGGCTGTG 18 9 chr8 7043408 7043433 AAAAGTAATTGCGGATTTTGCC 2 3 -chr8 7983960 7983985 AAAAGTAATTGCGGATTTTGCC 2 3 chr8 12479920 12479945 AAAAGTAATTGCGGATTTTGCC 2 3 -chr8 12538020 12538045 AAAAGTAATTGCGGATTTTGCC 2 3 -chr8 26962356 26962382 AAAAGTAATCGCGGTTTTTGTC 27 16 chr8 101105387 101105415 GGGCGACAAAGCAAGACTCTTTCTT 4 0 -chr8 142865520 142865545 TTGGGACATACTTATGCTAAA 4 1 -64 Chromo-some Start End Most abundant isomiR hESC count EB count Fold Change (if significant) chr8 144193152 144193177 TTGGGACATACTTATGCTAAA 4 1 -chr9 20214 20239 TTGGGACATACTTATGCTAAA 4 1 -chr9 28878884 28878910 GAGACTGATGAGTTC C C GGG A 41 23 chr9 68292057 68292082 TTCTGGAATTCTGTGTGAGGGA 2 8 -chr9 91868406 91868431 AAAAGTAATTGCGGATTTTGCC 2 3 chr9 99165696 99165721 TTGGGACATACTTATGCTAAA 4 1 chr9 116078285 116078309 AATGGCTTTTTGGAGCAGG 149 103 -chrX 32569518 32569545 AAAACTGTAATTACTTTTGTAC 21 12 chrX 69159472 69159498 CTGTCCTAAGGTTGTTGAGTT 28 15 chrX 83367459 83367484 AAAAGTAATTGCGGATTTTGCC 2 3 chrX 93354756 93354781 AAAAGTAATTGCGGATTTTGCC 2 3 -chrX 94204843 94204867 CAAAGGTATTTGTGGTTTTTG 2 0 -chrX 100287314 100287339 TAGCAAAAACTGCAGTTACTTT 6 2 -chrX 113793427 113793450 CAAGTCTTATTTGAGCACCTGTT 2 2 -chrX 118805342 118805371 CATGACAGATTGACATGGACAATT 7 0 chrY 13478002 13478025 AGATGAACTTGAAAGAAGACCAT 2 1 -65 Supplementary Table 3: Seed sequences differentially represented between libraries Seed sequence hESC count EB count hESC background EB background Fold Change Corrected P-value AGUAGUC 86 1478 860319 826472 17.19 0 CAGUAGU 1801 22288 860319 826472 12.38 0 ACAGUAG 241 2749 860319 826472 11.41 0 UUAAACG 3452 398 860319 826472 8.67 0 CUUAAAC 16762 2332 860319 826472 7.19 0 GGAGUGU 846 5216 860319 826472 6.17 0 AGUGCUG 2484 12641 860319 826472 5.09 0 ACCCUGU 1503 5689 860319 826472 3.79 0 AAACGUG 55712 16461 860319 826472 3.38 0 UAUAAAG 3002 9702 860319 826472 3.23 0 GAGAACU 1547 4976 860319 826472 3.22 0 UUGCACG 1845 5561 860319 826472 3.01 0 GAGGUAG 29280 10258 860319 826472 2.85 0 CAGUGCA 3596 8559 860319 826472 2.38 0 AAAGCUG 16448 7613 860319 826472 2.16 0 AGUGCAA 4463 9594 860319 826472 2.15 0 UCAAGUA 6898 14089 860319 826472 2.04 0 GCUACAU 47591 24280 860319 826472 1.96 0 GUAAACA 8789 15479 860319 826472 1.76 0 GAGGGGC 17586 10518 860319 826472 1.67 0 AGCUUAU 65376 39424 860319 826472 1.66 0 CUGGACU 17819 25378 860319 826472 1.42 0 GCAGCAU 153193 187218 860319 826472 1.22 b GCGGGGC 5220 1956 860319 826472 2.67 8.76E-305 CUCAAAC 4306 8173 860319 826472 1.90 1.62E-296 UGGAUAA 1770 233 860319 826472 7.60 1.00E-275 GGAAUGU 9601 5384 860319 826472 1.78 4.47E-223 AAAGUGC 8585 12847 860319 826472 1.50 8.91E-221 UAAGCCA 2225 550 860319 826472 4.05 7.95E-220 GGUUGGG 1037 199 860319 826472 5.21 8.95E-127 UCAACAG 1003 2283 860319 826472 2.28 1.83E-121 UGUGCGU 129 747 860319 826472 5.79 5.50E-109 GGACGGA 616 69 860319 826472 8.93 4.40E-103 CCCCUGG 1910 773 860319 826472 2.47 3.31 E-97 UCCCUUU 837 1862 860319 826472 2.22 9.74E-95 AGGAGCU 705 1649 860319 826472 2.34 5.90E-92 UAAACGU 759 165 860319 826472 4.60 1.60E-83 UCACAGU 719 1573 860319 826472 2.19 2.42E-77 CGGAUCC 115 578 860319 826472 5.03 2.41 E-76 GCUCGGU 1793 817 860319 826472 2.19 8.24E-72 AUGGAUA 322 16 860319 826472 20.13 1.18E-69 UAAUAUC 0 228 860319 826472 n/a 2.34E-68 GAGAUGA 412 1036 860319 826472 2.51 8.38E-65 CACAGUG 6807 8616 860319 826472 1.27 9.60E-62 CUUGGAG 0 193 860319 826472 n/a 1.63E-57 AGUGCUU 8315 6098 860319 826472 1.36 8.08E-55 66 Seed sequence hESC count EB count hESC background EB background Fold Change Corrected P-value CUUUAAC 860 315 860319 826472 2.73 3.22E-51 GGCGGAG 1303 602 860319 826472 2.16 3.00E-50 UUUCAGU 4065 2680 860319 826472 1.52 1.72E^9 AGCCCUU 1311 612 860319 826472 2.14 2.02E-49 AUGGCUU 350 834 860319 826472 2.38 2.40E-47 UAGAGGA 277 29 860319 826472 9.55 5.51 E-47 CUGCAGU 340 815 860319 826472 2.40 8.22E-47 AACGGAA 6891 5034 860319 826472 1.37 1.63E-46 GGGACGG 1059 472 860319 826472 2.24 4.81 E-44 CAAGAGC 240 643 860319 826472 2.68 6.25E-44 ACAGUAC 2763 3740 860319 826472 1.35 7.14E-40 AGCAGCG 548 175 860319 826472 3.13 1.87E-39 UUAACAU 1179 579 860319 826472 2.04 3.85E-39 UUGAUAA 0 127 860319 826472 n/a 4.54E-37 GGAUUGU 216 22 860319 826472 9.82 1.01E-36 AGUGCCG 19 194 860319 826472 10.21 2.94E-36 UGCCGCC 11 164 860319 826472 14.91 9.64E-35 GACCUAU 545 1017 860319 826472 1.87 1.72E-34 AGACGGG 56 273 860319 826472 n/a 1.73E-34 CCUUCAU 1246 660 860319 826472 1.89 1.15E-33 ACCCAUU 106 350 860319 826472 3.30 7.40E-31 AAGCUGG 810 374 860319 826472 2.17 9.93E-31 GUGACUG 16 160 860319 826472 10.00 1.61E-29 CCUGUAC 529 200 860319 826472 2.65 1.94E-29 CCCUGUA 40 216 860319 826472 5.40 3.77E-29 GAAGAUC 394 124 860319 826472 3.18 1.50E-28 ACAUUCA 3751 4644 860319 826472 1.24 6.63E-28 UGACCUA 2439 3150 860319 826472 1.29 4.08E-25 GUGCAAA 755 1198 860319 826472 1.59 8.80E-25 CAAAGCA 1176 678 860319 826472 1.73 4.92E-24 CGCACUG 490 201 860319 826472 2.44 2.68E-23 CAGGAAC 531 902 860319 826472 1.70 3.38E-23 GAGUUGA 3 93 860319 826472 31.00 3.11E-22 UCCAGUU 85 269 860319 826472 3.16 3.65E-22 AGCUCGG 700 352 860319 826472 1.99 2.77E-21 AACGUGG 221 53 860319 826472 4.17 2.77E-21 UAUAAUA 3711 4467 860319 826472 1.20 3.20E-21 CCAGUGU 0 73 860319 826472 n/a 2.44E-20 UGGCACA 93 269 860319 826472 2.89 1.30E-19 ACUAGAU 514 238 860319 826472 2.16 6.75E-19 GGUGGGG 223 61 860319 826472 3.66 1.76E-18 CAAGACU 115 14 860319 826472 8.21 2.78E-17 ACGUCAU 517 812 860319 826472 1.57 1.10E-15 GGGUUGG 184 49 860319 826472 3.76 1.88E-15 AAGACUG 96 10 860319 826472 9.60 3.14E-15 .GUGCCGC 11 93 860319 826472 8.45 3.19E-15 UCAUACA 0 54 860319 826472 n/a 1.88E-14 ACACACC 6 76 860319 826472 12.67 2.07E-14 AAUACUG 3266 3839 860319 826472 1.18 2.42E-14 GGCAAGA 585 316 860319 826472 1.85 3.86E-14 67 Seed sequence hESC count EB count hESC background EB background Fold Change Corrected P-value GUAGUCU 2 62 860319 826472 31.00 4.59E-14 CGGGGCU 203 67 860319 826472 3.03 6.61E-13 CCAGUGC 52 0 860319 826472 n/a 7.08E-13 GGCAGUG 2741 2084 860319 826472 1.32 7.16E-13 UAGACUG 2781 2119 860319 826472 1.31 8.33E-13 AAUGGAU 74 6 860319 826472 12.33 1.81E-12 CCCCAGU 44 148 860319 826472 3.36 2.06E-12 CUGUAGU 472 722 860319 826472 1.53 2.07E-12 AAAGCAC 285 120 860319 826472 2.38 2.82E-12 GAGACCU 248 440 860319 826472 1.77 3.44E-12 GGGGCAG 156 44 860319 826472 3.55 4.72E-12 CAUGGGU 49 0 860319 826472 n/a 5.41 E-12 CGUGUCU 414 211 860319 826472 1.96 1.84E-11 AAGCCAG 116 27 860319 826472 4.30 1.08E-10 UAUCAGA 1019 1328 860319 826472 1.30 2.04E-10 UUUUUGC 232 95 860319 826472 2.44 3.15E-10 CACAUUA 0 40 860319 826472 n/a 4.08E-10 ACGUGGA 130 36 860319 826472 3.61 4.65E-10 UGCACGG 29 109 860319 826472 3.76 5.88E-10 UAAACAU 159 302 860319 826472 1.90 1.27E-09 GGCACAC 117 244 860319 826472 2.09 1.42E-09 UUGGAGG 419 627 860319 826472 1.50 1.45E-09 UAUAGGG 15 79 860319 826472 5.27 2.27E-09 AGCCAUG 40 0 860319 826472 n/a 2.44E-09 UGUGCUU 2 45 860319 826472 22.50 4.22E-09 UAGCUUA 124 36 860319 826472 3.44 6.27E-09 CAGCAUU 949 1226 860319 826472 1.29 7.05E-09 AUCGUAC 0 36 860319 826472 n/a 7.08E-09 CCCUGAG 721 969 860319 826472 1.34 7.20E-09 GGGCCCC 74 12 860319 826472 6.17 1.21E-08 AAGUUGU 2 43 860319 826472 21.50 1.60E-08 UCACCAC 217 95 860319 826472 2.28 4.49E-08 CUACAUC 304 155 860319 826472 1.96 4.71 E-08 AUCCUUG 40 119 860319 826472 2.98 4.89E-08 AACAUUC 4 47 860319 826472 11.75 7.50E-08 UGCCCUG 43 2 860319 826472 21.50 8.01 E-08 GAAUUGU 21 84 860319 826472 4.00 1.01E-07 GGCUCUG 49 4 860319 826472 12.25 1.01E-07 GUGCAAU 72 164 860319 826472 2.28 2.38E-07 GCUUAUC 520 321 860319 826472 1.62 2.98E-07 UAAUGCU 139 50 860319 826472 2.78 3.23E-07 CAGGGAU 80 18 860319 826472 4.44 3.93E-07 AUUGCAC 137973 135989 860319 826472 1.01 4.83E-07 ACUGGCC 458 278 860319. 826472 1.65 1.02E-06 AAAGCGC 654 862 860319 826472 1.32 1.05E-06 GAAACAU 2 36 860319 826472 18.00 1.66E-06 AGCUGCC 442 269 860319 826472 1.64 2.67E-06 GUAAGUG 229 115 860319 826472 1.99 6.19E-06 AAGUCAC 369 217 860319 826472 1.70 6.27E-06 GACCCUG 122 45 860319 826472 2.71 9.34E-06 68 Seed sequence hESC count EB count hESC background EB background Fold Change Corrected P-value GGAGGAA 57 10 860319 826472 5.70 9.50E-06 CACUCCU 134" 236 860319 826472 1.76 1.18E-05 CUACAUU 1042 765 860319 826472 1.36 1.55E-05 AGCCAGG 27 0 860319 826472 n/a 1.72E-05 AACACUG 105 197 860319 826472 1.88 1.80E-05 AUAUAAC 0 25 860319 826472 n/a 1.81E-05 ACCCGUA 1637 1900 860319 826472 1.16 2.05E-05 ACCCGGC 148 63 860319 826472 2.35 2.48E-05 AAUGCCC 1011 744 860319 826472 1.36 3.31 E-05 UUGGCAC 63 137 860319 826472 2.17 3.53E-05 UACUGCA 3 34 860319 826472 11.33 4.30E-05 AAGCUCG 112 42 860319 826472 2.67 4.46E-05 UUUAACA 41 5 860319 826472 8.20 5.62E-05 GGAGAGA 2555 2092 860319 826472 1.22 6.37E-05 AACACCA 144 63 860319 826472 2.29 6.62E-05 CUACAAC 25 0 860319 826472 n/a 6.75E-05 AUCAUAC 0 23 860319 826472 n/a 7.55E-05 UUUUCAU 203 314 860319 826472 1.55 9.19E-05 UAUUGCA 275 156 860319 826472 1.76 0.000100705 AGAACUG 9 47 860319 826472 5.22 0.000111875 ACAUUAC 0 22 860319 826472 n/a 0.000154039 CACAGCU 0 22 860319 826472 n/a 0.000154039 CACGGUA 0 22 860319 826472 n/a 0.000154039 AUCAUUC 2 29 860319 826472 14.50 0.000161419 UGAACGG 11 49 860319 826472 4.45 0.00021004 GUGACAG 88 166 860319 826472 1.89 0.000210775 CUUUGGU 737 527 860319 826472 1.40 0.0002156 GAGAACC 206 108 860319 826472 1.91 0.000233133 CCUGGCA 1427 1117 860319 826472 1.28 0.000301431 UGUACAU 0 21 860319 826472 n/a 0.000314382 CACAGGG 333 204 860319 826472 1.63 0.000321823 AAGAUCU 173 86 860319 826472 2.01 0.000390211 CAAGUAA 84 157 860319 826472 1.87 0.000571587 UAAGACU 37 90 860319 826472 2.43 0.000739075 AUGUGCC 96 171 860319 826472 1.78 0.000840298 CUAGAUU 81 28 860319 826472 2.89 0.001084265 ACUGCAU 2 26 860319 826472 13.00 0.001120023 CUGUGCG 2 26 860319 826472 13.00 0.001120023 ACGCUCA 84 30 860319 826472 2.80 0.001186869 UUGUUCG 431 565 860319 826472 1.31 0.001187676 UGGAGGA 26 72 860319 826472 2.77 0.001289031 AGCAUUG 270 163 860319 826472 1.66 0.002278633 AUCUGGA 235 137 860319 826472 1.72 0.002768409 AGACUGG 57 16 860319 826472 3.56 0.002930674 GUAACAG 15 51 860319 826472 3.40 0.003219056 CGAUUGC 19 58 860319 826472 3.05 0.003246499 GGGGCUG 72 25 860319 826472 2.88 0.003559141 CUAAGCC 60 18 860319 826472 3.33 0.003716032 AAAAGCU 125 59 860319 826472 2.12 0.004052592 GGCUCAG 943 1111 860319 826472 1.18 0.004326709 69 Seed sequence hESC count EB count hESC background EB background Fold Change Corrected P-value CUCACAC 660 805 860319 826472 1.22 0.005340937 UACCCUG 0 17 860319 826472 n/a 0.005454676 AUAAUAC 14 48 860319 826472 3.43 0.005639196 ACUGCAA 3 26 860319 826472 8.67 0.006144601 ACUCCUC 54 108 860319 826472 2.00 0.006407272 GAAUGUA 105 47 860319 826472 2.23 0.006947813 AAGGACA 80 31 860319 826472 2.58 0.009501052 UCAACGG 181 101 860319 826472 1.79 0.009508369 GAGUGUG 35 79 860319 826472 .2.26 0.013463942 AUAAAGC 20 56 860319 826472 2.80 0.015650839 GACUGGG 17 0 860319 826472 n/a 0.016244698 GGGGUUU 61 21 860319 826472 2.90 0.020082178 ACAGCAG 0 15 860319 826472 n/a 0.022720944 AGCACCA 4400 3850 860319 826472 1.14 0.024924325 CUCAAAU 2 21 860319 826472 10.50 0.026958039 UCUUGAG 77 31 860319 826472 2.48 0.027743405 UUCUAAU 64 117 860319 826472 1.83 0.027744697 UAGAUAA 59 110 860319 826472 1.86 0.029032117 CUUCAUG 10 38 860319 826472 3.80 0.031394002 AGGGGCA 190 112 860319 826472 1.70 0.032600265 GUGCUUC 349 237 860319 826472 1.47 0.03499445 UGAAAUG 70 124 860319 826472 1.77 0.037971577 ACUUUAA 70 28 860319 826472 2.50 0.045295024 CAGGCUC 142 77 860319 826472 1.84 0.045618849 70 Appendix B: Supplementary Methods and accompanying data Comparing miRNA expression levels using five different metrics We assessed the use of five different metrics to summarize m i R N A expression levels based on the two sequence libraries. The first three metrics are calculated from sequences with either a perfect match, or sequences with perfect 3' or 5' matches to the miRBase reference sequence. The other two were calculated using either the most common or the sum of all sequences representing a single arm of a particular m i R N A . The pairwise correlation between the h E S C and E B libraries using each of the five metrics is included in Table A l (below). The most robust discrimination between these libraries, manifested as a low R 2 value, resulted from comparison using the most frequently observed sequence from each arm of every m i R N A (R2=0.87, Spearman Correlation). Using only the sequences matching the miRBase reference sequence gave reasonable, but not as large, a discrimination between libraries (R2=0.92). The 'sum of all tags' metric showed an even lower correlation, suggesting it may provide better discrimination between libraries. However, this metric did not perform well when used for clustering libraries of different origin in both human and mouse (not shown). Table A l : Spearman correlation between h E S C and E B libraries using each of the five metrics for m i R N A expression hESC hESC most hESC sum of hESC 5prime hESC 3prime miRBase miRBase common tag tags miRBase match match EB miRBase 0.92 0.68 0.57 0.85 0.93 EB most common tag 0.76 0.87 0.64 0.89 0.85 EB sum of tags -0.16 -0.09 0.66 -0.16 -0.12 EB 5' miRBase match 0.77 0.70 0.29 0.92 0.84 EB 3' miRBase match 0.67 0.68 0.66 0.85 0.95 71 Detecting R N A editing events The two types of R N A editing that have been characterized in mammals include Adenine and Cytosine deamination reactions. The former is observed in sequence reads as an A to G transition while the latter is observed as a C to T transition. To search for evidence of R N A editing, we aligned all unaligned reads to the human genome using a local alignment tool that allows up to two mismatches ( E L A N D , unpublished). Taking all the reads that aligned to m i R N A loci using this tool, we produced multiple sequence alignments representing 'modified' versions of each m i R N A . These multiple sequence alignments were converted to sequence logos to facilitate visualization of recurrent editing events. Figure A l shows the identification of a known A->I editing site in has-miR-151 B low et al. 2006) as well as a potential C->U editing site. It is also possible that the latter results from minor contamination of the mouse fibroblast feeder cells on which the hESCs were grown, considering the mismatching nucleotide corresponds to the mouse genome sequence. Figure A l : Sequence logo representing has-miR-151 reads with imperfect genome alignments hsa-mIR-151 " 5' 3' 1 •--iJsftCl 72 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0100868/manifest

Comment

Related Items