Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysis of undifferentiated human embryonic stem cell lines using Serial Analysis of Gene Expression Schnerch, Angelique 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2006-0100.pdf [ 20.7MB ]
Metadata
JSON: 831-1.0092489.json
JSON-LD: 831-1.0092489-ld.json
RDF/XML (Pretty): 831-1.0092489-rdf.xml
RDF/JSON: 831-1.0092489-rdf.json
Turtle: 831-1.0092489-turtle.txt
N-Triples: 831-1.0092489-rdf-ntriples.txt
Original Record: 831-1.0092489-source.json
Full Text
831-1.0092489-fulltext.txt
Citation
831-1.0092489.ris

Full Text

A N A L Y S I S O F U N D I F F E R E N T I A T E D H U M A N E M B R Y O N I C S T E M C E L L L I N E S U S I N G J J E R I A L A N A L Y S I S O F G E N E E X P R E S S I O N by ANGELIQUE SCHNERCH B.Sc, University of British Columbia, 2000 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Medical Genetics) THE UNIVERSITY OF BRITISH COLUMBIA December 2005 © Angelique Schnerch, 2005 Abstract Since the first reported isolation of immortalized human embryonic stem cell (hESC) lines in 1998 (Thomson et al., 1998), methods for directing their differentiation to various specialized derivatives has been extensively demonstrated and provides hope for future therapeutic applications. Characterization of the key molecular factors governing hESC self-renewal and pluripotency is necessary for ongoing efforts in deriving therapeutically useful cell types and in modelling human embryonic and oncogenic development. To this end the GSC Gene Expression Laboratory has generated 11 global gene expression profiles of 8 undifferentiated hESC lines using long Serial Analysis of Gene Expression (long SAGE) (NIH stem cell registry code BG01, ES03, ES04, WA01, WA07, WA09, WA13, and WA14). I analysed a database of the hESC long SAGE data consisting of 2,613,475 total tags corresponding to 379,465 transcripts. By employing various comprehensive tag-to-gene mapping resources I have provided a detailed survey of the genes expressed and differentially expressed in multiple hESC lines. A suite of sternness-associated factors was observed in the hESC SAGE data. We also observed molecular components of several pathways involved in embryonic development, the cell cycle, and programmed cell death. Comprehensive interspecies pair-wise comparisons between the hESC libraries and publicly available human SAGE libraries identified up-regulated transcripts in embryonic stem cells. A robust computational approach was designed and identified tags expressed solely in hESCs compared to 247 normal and malignant cells. A key feature of the approach was to isolate tags derived from sequences conserved across the human, mouse and rat genomes; this led to the identification of 301 candidate novel transcripts that may be integral to human pluripotent stem cells. These studies represent an important step in the development of high throughput approaches to an analysis of early human developmental processes and will be a strategic element in more comprehensive interspecies comparisons of E S cells to identify preserved control mechanisms. ANALYSIS OF UNDIFFERENTIATED HUMAN EMBRYONIC STEM CELL LINES USING SERIAL ANALYSIS OF GENE EXPRESSION i Abstract " Table of Contents. iv List of Tables vi List of Figures viii List of Abbreviations ix Acknowledgements xii Dedication xiii 1. Introduction 1 1.1 Human embryonic stem cells 1 1.1.1 Overview of human embryonic stem cell biology 1 1.1.1.1 Human embryonic stem cells and cell cycle regulation 2 1.1.1.2 Human embryonic stem cells and cancer 5 1.1.2 Contribution of mouse embryonic stem cells to the understanding of human embryonic stem cells 8 1.1.2.1 LIF/gpl30 signalling in mESC 8 1.1.2.2 Transcriptional regulators of pluripotency in mESC 11 1.2 Global gene expression profiling approaches in stem cells IS 1.2.1 Overview of high throughput gene expression profiling platforms 15 1.2.2 Large-Scale Genomic Approaches to the Study of Human Embryonic Stem Cells 16 1.3 Objectives 26 1.4 Specific aims and rationale 27 2. Catalogue of the undifferentiated hESC transcriptome 28 2.1 Introduction 29 2.2 Methods 30 2.2.1 Cell Culture 30 2.2.2 SAGE Library Construction 31 2.2.3 SAGE Library Sequencing ..32 2.2.4 Tag-to-gene mapping 35 2.2.4.1 Comprehensive mapping of SAGE tags 35 2.2.4.2 Assigning functional annotations 41 2.3 Results and discussion 42 2.3.1 Database of expressed long and regular SAGE tags 42 2.3.2 Analysis of unmapped tags 43 2.3.2.1 Species ambiguous tags 46 2.3.3 Detection of stem cell associated genes in WA09 48 2.3.4 Developmental signalling pathway expression in embryonic stem cells 55 2.3.5 Cell cycle regulation and programmed cell death pathways in hESCs 79 2.3.6 HESC gene ontology !' 98 2.4 Conclusions 104 3. Comparison between hESC and cancer/non-cancer differentiated cells/tissues.... 106 3.1 Introduction 107 3.2 Methods 110 3.2.1 SAGE library downloads 110 3.2.2 Cluster analysis 110 3.2.2.1 Random sampling script 112 3.2.3 Differential gene expression analysis 112 3.3 Results and discussion 114 3.3.1 Pair-wise library comparisons 114 3.3.2 Isolation of differentially expressed tags 134 3.4 Conclusions 154 4. Computational approach for the identification of candidate novel genes in undifferentiated hESC SAGE libraries 155 4.1 Introduction 156 4.2 Methods 159 4.2.1 SAGE library acquisition and tag processing 159 4.2.2 SAGE tag processing 159 4.2.3 Comprehensive mapping of SAGE tags (CMOST) 160 4.2.4 Tag mapping database construction 161 4.2.5 BLAST analysis 163 4.2.6 Mouse tag to gene mapping 163 4.3 Results and Discussion 164 4.3.1 Selection of tags for the isolation of candidate novel genes 164 4.3.2 Mouse annotation of hESC tags 171 4.4 Conclusions 184 List of Appendices 186 Bibliography '. 187 V List of Tables Table 1 In-vitro/vivo differentiation of Human Embryonic Stem Cells 3 Table 2 Candidate ES pluripotency genes confirmed or identified by global gene expression profiling (confirmed/identified by: down-regulation upon differentiation, high expression pattern in ES compared to differentiated tissues and/or RT-PCR) 25 Table 3 Summary of hESC long SAGE dataset 34 Table 4 Summary of CMOST data sources for tag-to-gene mapping 36 Table 5 Top 10 expressed SAT tags 47 Table 6 Short-list of most highly expressed sternness associated genes in the WA09 line (absolute tag counts listed; WA09 library size equals 441,795 tags sequenced) 49 Table 7 Detection of genes up-regulated in pluripotent stem cells and potential hESC markers (pluripotent stem cell associated genes, PAGs) 51 Table 8 Summary of Wnt ligand and receptor expression in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), and a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 59 Table 9 Expression of Wnt signalling components in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 60 Table 10 Differential gene expression of Wnt signalling components compared between the hESC metalibrary and differentiated metalibrary 62 Table 11 Differential gene expression of Wnt signalling components compared between the WA09 library and differentiated metalibrary 63 Table 12 Expression of the TGFp signalling network ligands, receptors and transcriptional targets in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 69 Table 13 Expression of the TGFp signalling network activators and inhibitors in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 70 Table 14 Differential gene expression of TGFp signalling components compared between the hESC metalibrary and differentiated metalibrary 71 Table 15 Differential gene expression of TGFp signalling components compared between the WA09 library and differentiated metalibrary 72 Table 16 LIF signalling network expression in human embryonic stem cells 76 Table 17 Differential gene expression of the cell cycle compared between the hESC metalibrary and differentiated metalibrary 92 Table 18 Differential gene expression of DNA repair mechanisms compared between the hESC metalibrary and differentiated metalibrary 94 Table 19 Differential gene expression of apoptosis pathways compared between the hESC metalibrary and differentiated metalibrary 96 Table 20 Genes that were differentially expressed in the autophagic cell death pathway compared between the hESC metalibrary and differentiated metalibrary 98 Table 21 Pearson correlation coefficients (r) between human embryonic stem cell line SAGE expression profiles 115 Table 22 Comparisons of short and extracted short SAGE libraries measured by Pearson correlation (r). 122 Table 23 Comparisons of long SAGE libraries measured by Pearson correlation (r) 124 Table 24 The intersection of the top most highly expressed tags in M APC and hESC libraries 131 Table 25 Calculation of Pearson correlation between the hESC metalibrary and MAPC4 using Correlate (written by Allen Delaney, Gene Expression Informatics) 131 Table 26 Summary of the total number of tag sequences up- or down-regulated in the hESC SAGE libraries compared to ectoderm-derived normal CGAP (n-CGAP) libraries, mesoderm-derived n-CGAP libraries, and endoderm-derived n-CGAP libraries 136 Table 27 Up-regulated gene list in u-hESC metalibrary compared to n-CGAP libraries 138 Table 28 Genes up-regulated in hESC compared to ectoderm-derived libraries 139 Table 29 Differentially expressed in hESC versus ectoderm-derived libraries only 141 vi Table 30 Genes up-regulated in hESC compared to mesoderm-derived libraries 143 Table 31 Differentially expressed in hESC versus mesoderm-derived libraries only 144 Table 32 Genes up-regulated in hESC compared to endoderm-derived libraries 145 Table 33 Differentially expressed in hESC versus endoderm-derived libraries only 147 Table 34 CMOST Best Mapping results for 20,047 novel hESC tags 169 Table 35 Top 25 BLASTN hits (mouse MAF genomic regions against mouse RefSeq transcripts) 176 Table 36 Candidate mouse orthologous genomic sequences were analyzed using BLASTN against mouse RefSeq transcripts..' 178 Table 37 25 mouse transcripts identified by BLASTN analysis of candidate orthologous mouse sequences against mouse RefSeq transcripts 181 Table 38 TBLASTX analysis of candidate human genomic regions derived from the UCSC multiple alignment format (MAF) 186 Table 39 TBLASTX mouse hits associated with development/differentiation, proliferation, transcriptional regulation (DNA dependent and epigenetically), genomic stability, and cell cycle checkpoints 188 vii List of Figures Figure 1 Differentiation of Human Embryonic Stem Cells 4 Figure 2 Retinoblastoma cell cycle control pathway 6 Figure 3 Jak/Stat Pathway 10 Figure 4 SAGE protocol 33 Figure 5 CMOST schematic of methodology adapted from http://www.bcgsc.ca/downloads/genex/DS/cmost_plugin_userdoc.htm 37 Figure 6 Distribution of WA09 tags sequenced (log scale Y-axis) classed according to gene expression level (X-axis) 44 Figure 7 The intersection between non-redundant human and mouse genomic sequences (UCSC) totaled 289,453 species ambiguous tags (SATs) 46 Figure 8 Detection of pluripotency genes and markers of differentiation 52 Figure 9 The Wnt signalling pathway expression in hESC long SAGE libraries (hESC metalibrary) and differentiated normal C G A P long SAGE libraries (n-CGAP metalibrary) 57 Figure 10 Expression of the TGF|3 signalling pathway in hESC long SAGE libraries (hESC metalibrary) and differentiated normal C G A P long SAGE libraries (n-CGAP metalibrary) 66 Figure 11 The expression of LIF signalling pathway components in a pooling of 11 hESC SAGE libraries and a pooling of 12 normal adult and fetal SAGE libraries 77 Figure 12 Cell Cycle Expression Cell cycle genes detected in the hESC metalibrary and differentiated metalibrary 80 Figure 13 Expression of D N A repair machinery in the u-hESC and n-CGAP metalibraries 82 Figure 14 Apoptotic programmed cell death pathway expression in hESC compared with n-CGAP 87 Figure 15 Autophagic cell death 90 Figure 16 Gene Ontology (GO) "sl im" molecular functions expressed in the hESC metalibrary, hESC cell line WA09, and the differentiated metalibrary (n-CGAP metalibrary) 101 Figure 17 Trilaminar embryonic disc 108 Figure 18 Matrix file format and the fitch settings used in this analysis 111 Figure 19 Hierarchical clustering of short and extracted normal SAGE libraries 117 Figure 20 Hierarchical clustering of long SAGE libraries from C G A P and our own database of normal and malignant cells ., 118 Figure 21 Incremental random sampling of WA09(L) extracted short SAGE total tags 126 Figure 22 Incremental random sampling of WA09(L) extracted short SAGE total tags and normal differentiated C G A P (n-CGAP) extracted short SAGE libraries 128 Figure 23 Ensembl BlastView of sequence escO 1 151 Figure 24 Ensembl BlastView of sequence esc02 153 Figure 25 The computational approach for the selection of candidate long SAGE tags to detect novel transcripts in hESC 158 Figure 26 Venn diagram listing the tag types for the hESC, cancer and normal SAGE library comparisons 165 Figure 27 Novel gene discovery candidate tags distribution of absolute gene expression levels. 166 Figure 28 The computational method for annotating hESC candidate tags 172 viii List of Abbreviations Abbreviation Name BCCA British Columbia Cancer Agency BER Base excision repair BLAST Basic Local Alignment Search Tool BLASTN Nucleotide-nucleotide BLAST BMP Bone morphogenic protein bp base pairs c-CGAP Malignant CGAP SAGE libraries cDNA Complementary DNA CGAP Cancer Genome Anatomy Project CHEK2 CHK2 checkpoint homolog (S. pombe) CMOST Comprehensive mapping of SAGE tags CNS Central nervous system DNA Deoxyribonucleic acid DNMT3B DNA (cytosine-5-)-methyltransferase 3 beta DPPA4 Developmental pluripotency associated 4 EB Embryoid body ECC Embryonic carcinoma cell e-g- Example given ESC Embryonic stem cell EST Expressed sequence tags FC Fold change FGF Fibroblast growth factor FOX Forkhead box transcription factors Gl Gap 1 G2 Gap 2 GEO Gene Expression Omnibus GLGI Generation of longer cDNA fragm* identification GNL3 Nucleostemin GO Gene Ontology gpl30/IL6ST Interleukin-6 signal transducer GSC Genome Sciences Centre hCG Human chorionic gonadotropin hECC Human embryonic carcinoma cell hESC Human embryonic stem cell HMG High mobility group HRR Homologous recombination repair HSC Haematopoietic stem cell ICM Inner cell mass IL-6 Interleukin-6 cytokines ix irr-MEFs Irradiated mouse embryonic fibroblasts (inactive MEFs) JAK Janus tryrosine kinases Kb kilobase LIF Leukemia inhibitory factor LIFR Leukemia inhibitory factor receptor In Natural log M Mitotic phase M A F University of California Santa Cruz (UCSC) multiple alignment format M A P C Multipotent adult progenitor cell Mb megabase MEFs Mouse embryonic fibroblasts (mouse feeder layers) mESC Mouse embryonic stem cells M G C Mammalian Gene Collection MPSS Massively parallel signature sequencing n/a Not applicable/available NCBI National Center for Biotechnology Information n-CGAP Normal adult and fetal CGAP SAGE libraries NER Nucleotide excision repair NHEJ Non-homologous end joining NIH National Institutes of Health ORFs Open reading frames PAG Pluripotent stem cell associated genes P A G E Polyacrylamide gel electrophoresis PBS Phosphate buffer solution PCR Polymerase chain reaction Perl Practical extraction and report language PGC Primordial germ cell PNS Peripheral nervous system . POU5F1 POU (Pit Oct Unc) domain, class 5, transcription factor I preHEP Pre-hepatocyte-like cells preNEU Pre-neuronal-like cells r Pearson correlation coefficient r 2 Coefficient of determination R A C E 375' rapid amplification of cDNA ends RB Retinoblastoma RefSeq Reference Sequence project RNA Ribonucleic acid rRNA Ribosomal RNA RT-PCR Reverse transcriptase polymerase chain reaction S Synthesis phase SAGE Serial analysis of gene expression SAT Species ambiguous tags SUP Protein tyrosine phosphatases x siRNAs Small-interfering RNA molecules S0X2 SRY (sex determining region Y)-box 2 SSEA Stage specific embryonic antigen STAT Signal transducer and activator of transcription T B L A S T X Translated query vs. translated database TDGF1 Teratocarcinoma-derived growth factor 1 TERT Telomerase reverse transcriptase TGFp Transforming growth factor beta T R A Tumour rejection antigen TS Trophoblast cells UCSC University of California Santa Cruz u-hESC undifferentiated human embryonic stem cells UTF1 Undifferentiated embryonic stem cell transcription factor UTR Untranslated region WNT Wingless-type M M T V integration site family ZFP42 Zinc finger protein 42 i Acknowledgements Dr . Marco M a r r a and Dr . Steven Jones both for supporting and directing my research during my graduate studies at the BCCA GSC. I sincerely thank-you for encouraging me to pursue my MSc and providing me the opportunity to learn more about bioinformatics and its application to global transcription profiling in Medical Genetics. Thanks especially for bearing with me as I continued to add analysis-after-analysis, page-after-page to my thesis. The B C C A Genome Sciences Centre (GSC) . Many thanks to the laboratory and bioinformatics groups at the GSC. Without your efforts, both in generating high quality and high throughput data and analysis tools, my work would not be possible. Many individuals have provided me with invaluable advice during the course of my studies and have succeeded in fostering a collaborative and positive work environment. Dr. Pamela Hoodless and Dr . Keith Humphries for their support and input as members of my thesis advisory committee. Genome British Columbia and the National Cancer Institute (USA) for funding this project. xii Dedication This work is dedicated to my parents, Maria and Donald Schnerch, for their support and encouragement in every of my aspect personal and professional development. This work is also dedicated to Dr. Brian Yang for his caring, patience, and unconditional support during my successes and struggles as a graduate student. xiii 1. Introduction 1.1 Human embryonic stem cells 1.1.1 Overview of human embryonic stem cell biology Stem cells are defined in part by their ability to give rise to progeny of more restricted developmental potential. The potential to differentiate is greatest in the totipotent fertilized egg, the ancestor of all embryonic and extra-embryonic cell types. This potential progressively decreases according to a developmental timeline. The pluripotent stem cell has a more restricted developmental potential. Members of this category include: primordial germ cells (PGC), embryonic carcinoma cells (ECC) and embryonic stem cells (ESC). Pluripotent stem cells may differentiate to several different cell types of embryonic origin but not extra-embryonic origin, and thus are unable to give rise to entire organisms in and of themselves. In 1998, James A. Thomson first reported the isolation of 5 immortalized human embryo-derived pluripotent cell lines, termed human embryonic stem cells (hESCs) (Thomson et al., 1998). HESCs are derived from the inner cell mass of the pre- v implantation blastocyst (5 days post-fertilization) which gives rise to the embryo proper. HESCs maintain an undifferentiated state and proliferate for many passages in culture (near indefinitely). The presence of alkaline phosphatase, high levels of telomerase activity and a sustained normal karyotype are key attributes of the hESC lines derived by Thomson et al (1998). Additionally, these cells express an array of cell surface antigens and molecular markers responsible for maintenance of an undifferentiated and pluripotent phenotype. The cell surface antigens are the stage specific embryonic antigen (SSEA)-3, 1 SSEA4, tumour rejection antigen (TRA)-l-60 and TRA-1-81 (Reubinoff et al., 2000; Thomson et al., 1998). Molecular markers of the undifferentiated state consist of a suite of transcription and growth factors, namely: POU domain, class 5, transcription factor 1 (POU5F1 or OCT3/4), teratocarcinoma-derived growth factor 1TTDGF1 or Cripto), SRY (sex determining region Y)-box 2 (SOX2), fibroblast growth factor 4 (FGF4) and zinc finger protein 42 (ZFP42 or REX1). Characteristic of pluripotent cells, hESC differentiate into various specialized derivatives of the three embryonic germ layers: endoderm, mesoderm and ectoderm. This capability has been demonstrated extensively (Figure 1) (Table 1) (Assady et al., 2001; Carpenter et al., 2004; Lebkowski et al., 2001; Levenberg et al., 2002; Reubinoff et al., 2000; Zhang et al., 2001) and promises to be exploitable in future therapeutic applications, particularly in regenerative medicine. 1.1.1.1 Human embryonic stem cells and cell cycle regulation Human and mouse embryonic stem cells (mESCs)7have unconventional cell cycles compared to somatic cells. MESCs proliferate rapidly, doubling every 8 to 12 hours. HESC cycle times are longer than in mouse; doubling every 35-40 hours, they are highly proliferative nonetheless. Somatic cells are also capable of rapid cycle times. For example, cultured mammalian fibroblasts have an approximate cell-cycle time of 20 hours (Alberts et al., 1998) begging the question, what distinguishes the cell cycle of embryonic stem cells from other somatic cell types? The answer can be attributed to alterations in cell cycle regulation. Checkpoints exist before each transition to a new phase of the cell cycle. At these checkpoints a cell is either detained from further 2 Table 1 In-vitro/vivo differentiation of Human Embryonic Stem Cells. Summary of directed differentiation of hESCs to a number of cells/tissue types representative of endoderm, mesoderm and ectoderm origins. Cell/tissue type Reference Astrocytes Bone Cardiomyocytes Cartilage Embryoid bodies Endothelial cells Erythrocytes Fetal glomeruli Ganglia Granulocytes Gut Hair Haematopoietic colony-forming cells Hepatocyte/Hepatocyte-like cells Insulin producing beta cells Keratinizing squamous epithelium Macrophage Megakaryocytes Neural epithelium Neural progenitor cells Neurons Oligodendrocytes Osteoblasts Respiratory epithelium Smooth muscle Striated muscle Trophoblasts Yolk sac (Zhang et al., 2001) (Itskovitz-Eldor et al., 2000; Thomson et al., 1998) (Kehat et al., 2003; Mummery et al., 2002; Xu et al., 2002a) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Levenberg et al. 2002) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Kaufman et al. 2001) (Lavon and Benvenisty, 2005; Rambhatla et al., 2003) (Assady et al., 2001) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Carpenter et al. 2001; Itskovitz-Eldor et al. 2000) (Carpenter et al., 2001; Itskovitz-Eldor et al., 2000; Reubinoff et aL 2000; Schuldiner et al., 2000; Zhang et al., 2001) (Zhang et al. 2001) (Sottile et al., 2003) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Thomson et al. 1998; Xu et al. 2002) (Thomson et al. 1998) 3 Figure 1 Differentiation of Human Embryonic Stem Cells. Embryo-derived pluripotent stem cells have been directed to differentiate to embryoid bodies and derivatives of the three embryonic germ layers. 4 progression through the cycle or released to continue its passage if various conditions have been met (e.g., an adequate cell size, undamaged DNA, favourable environment etc). The retinoblastoma (RB) protein regulates the checkpoint between the Gl and S phase of the cell cycle (Goodrich et al., 1991; Hatakeyama and Weinberg, 1995; Weinberg, 1995) (Figure 2). RB prevents cell cycle progression by inhibition of E2F transcription factors (Dyson, 1998; Fattaey et al., 1993; Helin, 1998; Helin et al., 1993; Johnson et al., 1994; Narita et al., 2003), ultimately preventing cell proliferation. Cell cycle arrest can be relieved upon growth factor signalling and activation of specific cyclins and cyclin-dependent kinases (CDKs) that disrupt the association between RB and E2F by hyper-phosphorylation of RB (Buchkovich et al., 1989). In both human and mouse ESCs, the RB pathway is inactive, resulting in a noticeably shortened Gl phase and rapid cellular proliferation. HESCs may express an inhibitor of RB, termed nucleostemin, preventing cell cycle arrest (Tsai and McKay, 2002). In a recent study of gene expression in hESC, RB, proapoptotic genes and regulators of the p53 pathway are expressed at low levels while positive regulators of the cell cycle are highly expressed (Brandenberger et al., 2004a). 1.1.1.2 Human embryonic stem cells and cancer Pluripotent stem cells and cancer cells are both immortal and tumorigenic. Similarities observed between hESC and cancer may be attributed to the expression of similar sets of genes involved in cell cycle regulation, apoptosis, and cellular senescence. A key step prior to the development of some cancers is the loss of the RB pathway via genetic changes (Classon and Harlow, 2002) while in hESC the RB pathway is naturally 5 Figure 2 Retinoblastoma cell cycle control pathway. G O G l s G 2 M G l / S phase transition r E 2 F T Differentiation < R B T C D K / C y c l i n complex t F G F G 2 / M transition T -> Checkpoint control -> D N A repair D N A replication H Apoptosis inactivated. Cancer and hESC immortality may also be attributed to telomerase reverse transcriptase (TERT) activity. In postnatal somatic cells, TERT is normally repressed; consequently telomere ends become progressively shorter after each cell division resulting in a finite number of times a cell can divide before it is targeted for cell death. Deregulation of TERT may be involved in oncogenesis. Again, where cancer involves genetic deregulation of key genes, hESCs "naturally" maintain TERT activity in culture, serving the same purpose to extend the proliferative lifespan in both cells. Similarities between stem cells and tumour cells have led to the hypothesis that deregulated stem cells may lead to cancer development (Brickman and Burdon, 2002; Mullor et al., 2002; Pardal et al., 2003). Thus additional and equally important advantages arising from the derivation of hESCs is their utility in providing a model for understanding cancer development from which the discovery of novel therapeutic targets may be utilized in future cancer treatments. The promise of hESCs has stimulated interest in thoroughly understanding the biology of the cell lines (http://stemcells.nih.gov/research/registry/; NIH Human Embryonic Stem Cell Registry code: BG01, BG02, BG03, SA01, SA02, ES01, ES02, ES03, ES04, ES05, ES06, MI01, RE03, TE04, TE06, UC01, UC06, WA01, WA07, WA09, WA13, and WA14) currently available. Two current areas of study in hESC research include: uncovering the gene functions directing differentiation down specific developmental paths and elucidating the gene functions that maintain hallmark ES attributes. 1.1.2 Contribution of mouse embryonic stem cells to the understanding of human embryonic stem cells Relatively little is known about the genes governing properties unique to embryonic stem cells. The study of mouse embryonic stem cells (mESC) (Evans and Kaufman, 1981; Martin, 1981) uncovered a handful of genes involved in maintaining pluripotency. A caveat of applying lessons learned in mESC to human studies is that embryonic development and embryonic cell types are not equivalent between mice and humans (Fougerousse et al., 2000; Ginis et al., 2004). Nonetheless, genes uncovered in the mouse have provided a short-list of factors that may be functionally conserved in hESC. 1.1.2.1 L IF /gpl30 signalling in m E S C The JAK/STAT signalling pathway is well characterized in MESC and initiated extrinsically by the leukemia inhibitory factor (LIF) (Smith et al., 1988; Williams et al., 1988) or related family members of the interleukin-6 (IL-6) type cytokines (Burdon et al., 1999a; Burdon et al., 1999b) (Figure 3). Paracrine signalling through LIF is a result of culturing MESC on a monolayer of mouse embryonic fibroblasts. LIF is bound by a cell surface receptor heterodimer composed of the LIF receptor (LIF-R) and gpl30. The activated receptor complex subsequently recruits various Janus tyrosine kinases (JAKs). JAKs, in turn, can recruit the STAT3 (signal transducer and activator of transcription) protein and various SHPs (protein tyrosine phosphatases) to initiate pathways that maintain self-renewal and suppress differentiation. hESCs, though they can be similarly cultured on MEFs, do not require LIF or signalling via gpl30 to maintain an 8 undifferentiated state (Reubinoff et al., 2000; Thomson et al., 1998). Downstream of LIF, the extent to which STATs and JAKs do or do not affect the hESC phenotype is not fully known. 9 Figure 3 Jak/Stat Pathway. Depicted are the ligands, receptors and intracellular components of the Jak/Stat pathway. The progression from pathway activation via LIF ultimately to transcriptional activation of stem cell maintenance genes is shown (Heinrich et al., 1998). 10 1.1.2.2 Transcriptional regulators of pluripotency in mESC Gene expression analyses of mESCs have identified a number of transcriptional regulators of pluripotency. An established marker for undifferentiated embryonic stem cells in mouse and humans is POU5F1 (also known as OCT3/4). POU5F1 is expressed throughout the totipotent cells of the early mouse embryo. The first differentiation event amongst these cells results in the formation of the blastocyst composed of an extra-embryonic layer, the trophoectoderm, and cells that will ultimately become the embryo proper, the inner cell mass. Not until this event occurs does POU5F1 expression become restricted to the inner cell mass of the pre-implantation blastocyst and downregulated in the trophoectoderm (Nichols et al., 1998). Later on in development, POU5F1 expression is rapidly downregulated upon differentiation of the inner cell mass (ICM), making its presence a defining characteristic of totipotent and pluripotent cell types. POU5F1 is critical for preventing the differentiation of totipotent and pluripotent cells of the embryo (Pesce et al., 1998a; Pesce et al., 1998b). The gene functions in maintenance of pluripotency and self-renewal in embryonic stem cells via repression and activation of select differentiation factors and stem cell maintenance genes respectively (for a concise review see Pesce and Scholer 2001). Targets for repression include the human chorionic gonadotropin (hCG) a and (3 subunits (Liu et al., 1997; Liu and Roberts, 1996). Release from repression of a and (3 hCGs may be an initial event in differentiation to the trophoectoderm and later derivatives (Pesce and Scholer, 2001). FOXD3, of the forkhead box (FOX) family of proteins, is expressed in embryonic stem cells and in the embryonic neural crest. FOXD3 activates other FOX family proteins, which may initiate 11 early embryonic lineage decisions such as differentiation to endoderm and endodermal organogenesis. POU5F1 can interfere with FOXD3 binding domains to repress its ability to activate downstream FOX proteins and ultimately repress hESC differentiation to endodermal derivatives. Through repression of various differentiation factors, POU5F1 can act as a gatekeeper of the pluripotent state in embryonic stem cells, rendering their potential to become multiple cell types quiescent. POU5F1 also serves to transcriptionally activate several downstream targets such as ZFP42, Creatine kinase B, Makorin 1, Importin (5, Histone H2A.Z and the ribosomal protein S7 (Du et al., 2001). Expression of many of these genes is not necessarily restricted to the ICM or embryonic stem cells, and their role in maintaining pluripotency is unknown. The zinc finger protein, ZFP42, however, is developmentally regulated and known to be associated with pluripotent cells (Rogers et al., 1991). Interestingly, ZFP42 is capable of binding DNA to regulate transcription and, like other downstream targets of POU5F1, may be necessary for stem cell maintenance, although its role remains unclear. In conjunction with the high-mobility group (HMG) domain protein SOX2, POU5F1 can synergistically activate transcriptional targets (Ambrosetti et al., 1997). One confirmed target of POU5F1 and SOX2 is the fibroblast growth factor 4 (FGF4), which encodes a secreted signalling molecule. FGF4 is involved in the viability of the blastocyst, in the development of the heart, as well as in the outgrowth and patterning of the developing mouse limb (Feldman et al., 1995; Niswander et al., 1993). Thus, regulation of this growth factor may have a role in ESC survival and/or differentiation. The gene encoding the undifferentiated embryonic cell transcription factor (UTF1) is an additional target for synergistic co-regulation by POU5F1 and SOX2 (Nishimoto et al., 12 1999). UTF1 is expressed mainly in pluripotent cell types, particularly mouse and human ESC. The precise role of UTF1 in early embryonic development or in the maintenance of ESC is not understood. SOX2 can also function to antagonize POU5F1 activation of transcription. In the case of the Osteopontin (OPN) gene, shortly before differentiation of cells of the ICM to primitive endoderm, POU5F1 is up-regulated and can form homodimers on the OPN enhancer leading to its expression (Botquin et al., 1998). SOX2 is additionally able to bind to the OPN enhancer to form a complex with POU5F1 that disrupts enhancer activity and represses OPN expression (Ambrosetti et al., 1997). Upon the proper developmental cue, SOX2 is downregulated in the ICM prior to POU5F1 possibly providing a mechanism in which POU5F1 can regulate the formation of the primitive endoderm in mouse embryos. Several genes may interact to stabilize the POU5F1 complex with DNA or to act as bridging factors in connection with transcriptional machinery. Other members of the high-mobility group protein family, HMG1 and HMG2, have been shown to interact with POU proteins. Both proteins may act with POU5F1 to facilitate and/or cement DNA binding or to activate a transactivation domain (Pesce et al., 1999). Unique to embryonic stem cells, POU5F1 can bind target genes distal to transcriptional start sites to initiate gene expression (Scholer et al., 1991). This suggests other co-activators may connect POU5F1 to transcriptional machinery by acting as bridging factors. Several viral genes have been proposed as bridging factors in differentiated cell types, such as the adenoviral El A protein (Scholer et al., 1991). The particular bridging factors that link POU5F1 to 13 transcriptional machinery in embryonic stem cells have yet to be elucidated in vivo but it is probable that they would bear strong functional similarity to El A. Not all of the genes identified as important to ESC maintenance are intimately associated with POU5F1. Recently, Nanog was identified in mESCs to maintain pluripotency and self-renewal independently of the STAT3/LIF pathway (Chambers et al., 2003; Mitsui et al., 2003). The gene shares homology to homeodomain-containing transcription factors of the NK2 family which are implicated in several aspects of cell type specification and maintenance of differentiated tissue (Wang et al., 2003). Nanog expression is limited to ESC and a small selection of tissue types including teratocarcinoma cells, germ cells and various tumour types (Chambers et al. 2003; Mitsui et al. 2003). Though the gene does not require POU5F1 for its expression, both Nanog and POU5F1 appear to be required in combination to effect self-renewal and pluripotency (Cavaleri and Scholer, 2003). A human ortholog of Nanog was also uncovered, bearing 85% sequence identity in its functional domain to the mouse gene in a syntenic region, and may function similarly in hESCs. Our understanding of human embryology has previously relied on studies of embryogenesis in model organisms, namely the mouse. Many developmentally important genes and pathways demonstrate high evolutionary conservation, providing justification for the use of the mouse to provide insight on early human development. This justification has also extended to the study of hESC. Expression studies in mESC have provided a handful of candidate genes that maintain pluripotency and are conserved in hESC. However, several differences between human and mouse embryology exist both morphologically and molecularly (Fougerousse et al., 2000; Ginis et al., 2004). For this 14 reason, it is not surprising that key molecular pluripotency pathways that exist in the mouse are not conserved in the human, the most obvious example being the LIF/gpl30 signalling pathway. With the isolation of hESC, investigators can now begin to generate an accurate model of the mechanisms involved in early human embryology through global gene expression profiling techniques. 1.2 Global gene expression profiling approaches in stem cells 1.2.1 Overview of high throughput gene expression profiling platforms The genomics era ushered in the development of high throughput platforms to characterize gene expression differences between normal and diseased states, pharmacologically treated and untreated cells/tissues, and between different developmental stages in human and model organisms. In relation to the study of embryonic stem cells, transcriptome analysis consists of comparing undifferentiated ESC with differentiated derivatives, somatic, or non-pluripotent cells. Current technologies capable of characterizing transcriptomes fall under two broad categories: hybridization based approaches (cDNA and oligonucleotide microarray) (Lipshutz et al., 1999; Lockhart et al., 1996; Schena et al., 1998) and sequence based approaches (EST sequencing projects, massively parallel signature sequencing (MPSS) and serial analysis of gene expression (SAGE)) (Brenner et al., 2000; Velculescu et al., 1999; Velculescu et al., 1995). 15 1.2.2 Large-Scale Genomic Approaches to the Study of Human Embryonic Stem Cells Recently a handful of groups have approached the study of a number of hESC lines using the aforementioned large-scale genomic technologies. Table 1 (located at the end of this section) provides a summary of all the genes discussed below). v. One such study originated from the lab of James A. Thomson describing the expression profile of 5 hESC lines (NIH Human Embryonic Stem Cell Registry code: WA01, WA07, WA09, WA13 and WA14) and human embryonic carcinoma cell (hECC) lines using the microarray technology (Sperger et al., 2003). One major goal of the analysis was to identify genes specifically expressed at higher levels in pluripotent cell types. In particular, Thomson's group found a set of genes both highly expressed and shared between hESC lines and hECC lines (895 genes), which may represent genes important in maintaining pluripotency. Not surprisingly, POU5F1 was one of the most highly expressed genes in hESC and hECC lines. DPPA4, which has been similarly identified in several other gene expression studies in hESC lines (Richards et al., 2004; Sato et al., 2003), was also highly co-expressed with POU5F1 based on comparisons of the hESC and hECC lines with microarray profiles from 29 germ cell tumour lines, 14 samples of normal testis and 17 somatic cell lines. The most highly expressed genes in hESC and hECC lines identified in this study included DNMT3B, FOXD3, and SOX2. These genes were identified among the most highly expressed in additional gene expression profiles of hESCs (Brandenberger et al.,,2004a; Brandenberger et al., 2004b; Carpenter et al., 2004; Richards et al., 2004; Sato et al., 2003). Representatives from signalling pathways identified in this study that may be important in maintaining 16 undifferentiated hESC (u-hESCs) included: Frizzled 7/8 (Wnt-p-catenin pathway), fibroblast growth factor (FGF) receptor genes 1-4 (FGF signalling pathway), and the bone morphogenic protein (BMP) receptor, type 1A (BMP signalling pathway). Sato et al (2003) utilized Affymetrix Gene Chip technology (Lockhart et al., 1996) to compare the expression profiles of the WA01 hESC line (Thomson et al, 1998) to published mESC SAGE data (RI line) (Anisimov et al., 2002). They proposed that many properties are likely shared between human and mouse pluripotent cells and report their list of evolutionarily conserved molecular factors. This study sought to elucidate genes involved in maintaining a pluripotent state. The undifferentiated WA01 expression profile was compared to WA01 cells that were differentiated to embryoid bodies (EBs), neurons, and non-lineage directed differentiation. 918 genes were enriched1 in u-hESCs compared to non-lineage directed differentiated hESCs. Genes confirmed by RT-PCR to be downregulated upon differentiation included: POU5F1, TDGF1, Id-IH, Jumonji, Lefty A, FGF 13, Sprouty, Hey2, SOX2, FGF2, Thyl, ADCY2, BIRC5, CHEK2, MCF1, PAK1, PTTG1, LCK, and LDB2. Notable examples of highly enriched genes in WA01 include: POU5F1, Lefty A (a downstream target of POU5F1), and ZFP42. Nearly a quarter of enriched genes were identified as uncharacterized ESTs, thus many of the genes potentially involved in maintaining embryonic stem "cellness" have not been extensively studied. Several ligands and receptors of signalling pathways (e.g., FGF, BMP, TGFP negative regulators, and WNT pathways) are additionally highly enriched genes in WA01. 1 Enriched genes.are significantly downregulated in WA01 differentiated cell populations; of these 918 genes/transcripts only 42 were absent across all differentiated WA01 populations. 17 The intersection between genes enriched in u-hESCs and genes enriched in mouse embryonic stem cells was identified. 227 genes were enriched in both species and included many ESTs of unknown function, members of the BMP/TGF(3 signalling pathway, components of the chromatin-remodelling machinery, phosphatidyl-inositol signalling as well as several proposed pluripotent marker genes (POU5F1, TDGF1, DPPA4, and CHEK2). 691 genes in WA01 either did not have a mouse ortholog or were not enriched in mESCs. For example, SOCS1, an inhibitor of the STAT3 pathway, the ultimate effecter of LIF signalling, was enriched in WA01 compared to mESCs. Thus, this gene may account for one major difference between human and mouse ES cells, the inability of LIF to maintain an undifferentiated state in hESC. The remaining genes, outside of this intersection, may represent human specific markers and provide an explanation for various species differences in ES "cellness". The properties of four feeder-free hESC lines, WA01, WA07, WA09 and WA14 (Thomson et al. 1998), were investigated using various methodologies such as RT-PCR, microarray, and cDNA library sequencing (Carpenter et al., 2004). HESC lines are traditionally cultured on irradiated feeder layers (mouse embryonic fibroblasts) (MEFs). The feeder-free lines used in this study express the same cell-surface markers as MEF-cultured lines and a suite of pluripotency molecular markers. Such markers include: CD9, CD90, hTERT, POU5F1, SOX2, ZFP42, UTF1, TDGF1, and BCRP. CDNA libraries were generated from u-hESCs and EBs; they demonstrated yet again that POU5F1 and SOX2 are highly abundant in ES and down-regulated upon differentiation. Carpenter et al observed that hESC are tightly adhered; upon further investigation they 18 discovered high expression levels of gap junction genes and functional gap junctions in WA01 and WA09. Microarray analysis of WA01, WA07 and WA09 demonstrated that most genes were similarly expressed in all lines. There was no evidence of genes that were significantly differentially expressed and unique to a particular cell line. It is important to note that only 2,802 cDNA clones were used for this analysis, as genes with low expression were filtered out computationally. It is plausible that cell-line specific differences may be genes expressed at low levels, thus a more comprehensive analysis of commonalities between hESC is required. Richards et al. (2004) constructed short SAGE (14mer tag sequences) libraries to investigate the transcriptome of two human pluripotent stem cell lines, ES03 and ES04 (ES Cell International; http://www.escellinternational.com). Both hESC lines were compared to each of 21 normal and cancer human SAGE libraries in an effort to elucidate genes differentially expressed in ES. Based on SAGE results of genes potentially enriched in hESC, RT-PCR of candidate hESC-specific marker genes was performed in u-hESC, differentiated hESC, and various human fetal and adult somatic tissue types. To identify genes evolutionarily conserved in pluripotent embryonic stem cells the Rl mESC line was compared to ES03 and ES04. Richards utilized UniGene (http://www.ncbi.nlm.nih.gov/) to map tags to genes and LocusLink (http://www.ncbi.nih.gov/entrez/) to classify the molecular functions of tag-to-gene-mappings. Additionally, CGAP SAGE Genie (http://cgap.nci.nih.gov/SAGE/AnatomicViewer) and the NCBI SAGEmap (http://www.ncbi.nih.gov/SAGE/) were utilised to select the best tag for ambiguous tag-19 to-gene mappings. However, using these particular databases still resulted in over 13% of tags without a reliable assignment to a UniGene cluster (analysis excluded singleton tag mappings.) The use of UniGene may complicate tag-to-gene mappings as it contains sequence redundancies. Furthermore, a major confounding factor introduced by the short SAGE technology is the ambiguity of tag-to-gene mappings due to the lack of gene specificity in 14 bp tags. This shortcoming has been partially addressed by the introduction of long SAGE (Saha et al., 2002) and thus provides significantly increased accuracy in tag-to-gene mapping. A combined total of 145,015 tags, corresponding to 31,852 transcripts, were sequenced for ES03 and ES04 (67,807 tags and 77,208 tags sequenced respectively). 64.2% of tags (20,447 tags) were singletons. Nearly half of all singletons mapped to ESTs, hypothetical genes and some known transcripts. Notable transcripts mapped by a singleton tag include FOXD3 and GBX2, transcription factors that are highly expressed in mouse embryonic stem cells and in the inner cell mass of mouse blastocysts. Singleton tags were ultimately excluded from further analysis. Richards et al (2004) sought to validate the hESC lines pluripotent phenotype by assaying for the presence/absence of SAGE tags corresponding to a variety of markers of pluripotency and of early embryonic differentiation. This study proposed the following genes to be candidate hESC markers2: POU5F1, SOX2, NANOG, ZFP42 (present only in ES03), HESX1, FLJ14549, FLJ21837, DPPA4, TGIF, DNMT3B, LIN-28, NPM1, 2 Candidate hESC markers were investigated by semiquantitative RT-PCR in undifferentiated and differentiated ES03 and ES04 lines. Many of these genes were present only in the undifferentiated stem cell lines and/or showed a marked decrease during stem cell differentiation. 20 TDGF1, GDF3, CHEK2, OC90, CLDN6, GJA1, CKS1 B, ERH, HMGA1 and TNFRSF6. These genes are not exclusive to hESC or the inner cell mass of pre-implantation embryos. Particularly, NANOG, HESX1, FLJ21837, DPPA4, LIN-28, CLDN6, GJA1, CKS1 B, ERH, HMGA1 and TNFRSF6 are expressed at low levels in various differentiated tissues. The majority of these genes also show slight to modest declines upon hESC differentiation only. Only OC90 and FLJ14549 demonstrated a significant decrease in expression upon hESC differentiation (based on RT-PCR analysis of candidate markers) and were not present in the various differentiated tissue types assayed. Genes associated with differentiation were detected in the ES03 SAGE library, namely LECT1, TGFa, and IFRD1. Additional points Richards et al (2004) addressed were the differences in gene expression profiles between ES03 and ES04. Most differentially expressed genes were expressed at low levels (tag counts of <3). Differential expression of splice isoforms was also evident between hESC lines e.g., basic transcription factor 3 (BTF3). The paper investigated similarities between mouse and human ES cells. Transcription factors originally defined in mouse ES cells to be associated with pluripotency such as SOX2, HESX1, UTF1, POU5F1, and ZFP42 were expressed at consistently higher levels in hESC. As expected members of the leukemia inhibitory factor signalling pathway as well as FGF4, TDGFI, GBX2, and Nanog were expressed higher or uniquely present in mouse ES cells. The differences demonstrated in this paper at a molecular level further strengthen the argument against mouse ES cells as an accurate model to study human embryonic pluripotency. A comprehensive hESC transcriptome profiling study was published recently and employed signature MPSS and EST sequencing of three feeder-free cell lines, WA01, WA07 and WA09 (Brandenberger et al., 2004a; Brandenberger et al., 2004b). MPSS employs ligation based sequencing of 16-20 bp sequences ('signatures') bound to microbead arrays to enable the rapid sequencing of millions of signatures in parallel (Brenner et al., 2000; Jongeneel et al., 2003). The depth of MPSS sampling approached 3 million sequenced signatures 17 bp in length, yet only corresponded to 22,136 distinct signatures; 3% of signatures were unmapped to transcripts and genomic sequences. Housekeeping genes were unsurprisingly among the most highly expressed. More importantly, known u-hESC markers, SOX2, DNMT3(3, and POU5F1, were also among the top 200 expressed genes. Represented in the top 200 expressed genes were the FGF signalling pathway and the Ras pathway. The high expression of FGFR1 indicated that hESC may utilize a number of FGFs, but there is an apparent requirement for basic FGF (FGF2) in hESC. Differences between mouse and human ES were demonstrated, notably the inactivity of the LIF pathway in hESC (by the lack of expression of a number of constituents in MPSS and EST libraries). Additionally, ERAS, PEPP1 and PEPP2, which are present in mESC, were undetected in the pooled hESC sample. However, PEPP1 and PEPP2 were detectable by our own SAGE data (discussed below), therefore the inability to detect these transcripts is a limitation of MPSS and not necessarily biologically significant. Several Wnt and TGFP signalling network constituents were expressed and are likely to be important in ESC maintenance. High levels of soluble frizzled receptors and E-cadherin of the Wnt signalling pathway were detected. In general, the analysis revealed 22 that many signalling pathway transcripts are expressed, however, so are their negative regulators, which may support that transcriptional repression maintains an undifferentiated state in hESC. Brandenberger et al (2004b) compared the MPSS library to an MPSS database of 36 human tissues and cell lines to look for genes unique to or over-expressed in hESC. They discovered 13 highly enriched uncharacterized genes that were expressed in u-hESCs and down-regulated upon differentiation. These genes were generally absent in other cell types and down-regulated upon ES differentiation providing good candidate markers for u-hESCs. MPSS has several advantages in permitting impressively deep transcriptome coverage. Shortcomings of the technique do exist; the method of sequencing cannot resolve palindromic sequence hybridization. Notably, this analysis failed to detect Nodal, which can be detected using SAGE. MPSS requires that transcripts possess a type II restriction enzyme site (Dpnll) as does the SAGE technique, which utilizes Nlalll. Thus a small proportion of tags lacking a Dpnll and/or Malll site are undetected by both methods. For example, SNRPF, which was enriched in microarray analysis of hESC (Bhattacharya et al., 2004), lacks a Dpnll site but not an Nlalll site. There are half as many Dpnll sites in the genome as Nlalll and 1% of transcripts lack an Nlalll site while 4% of transcripts lack a Dpnll site (Pleasance et al., 2003) (A. Delaney, personal communication). MPSS and SAGE also complement one another as some Dpnll sites produce ambiguous tags while the corresponding Nlalll site does not and vice versa. This analysis ambiguously detected a marker of undifferentiated hESC, ZFP42 (REX1), while our SAGE dataset detected an unambiguous tag mapping to the gene. Thus, MPSS and 23 SAGE are complementary and the utilization of both techniques may be particularly important for the detection of some hESC specific genes. 24 Table 2 Candidate ES pluripotency genes confirmed or identified by global gene expression profiling (confirmed/identified by: down-regulation upon differentiation, high expression pattern in ES compared to differentiated tissues and/or RT-PCR) Genes: Pathways: BCRP 2 WNT CD9 2 Frizzled7 1,4,5 CD90 2 Frizzled8 1,4,5 hTERT 2 TGF3 UTF1 2 TDGF1 1,2,3,4 ZFP42 2,3,4 LEFTY2 1,2,3,4 CKS1B 3 ID1 1,2,3,4 CLDN6 3 FGF ERH 3 FGF1 1,4,5 FLJ14549 3 FGF 13 1,4,5 FLJ21837 3 FGF2 1,4,5 GBX2 3 FGF3 1,4,5 GDF3 3 FGF4 1,4,5 GJA1 3 FGFR1 1,4 HESX1 3 BMP HMGA1 3 BMPR1A 1,4,5 LIN28 3 NANOG 3 NPM1 3 References: OC90 3 ' Brandenberger et al. TGIF 3 2 Carpenter et al. TNFRSF6 3 3 Richards et al. ADCY2 4 4 Sato et al. BIRC5 4 5 Sperger et al. HEY2 4 JARID2 4 LCK 4 LDB2 4 MCF1 4 PAK1 4 PTTG1 4 SOCS1 4 SPRY1 4 THY1 4 CHEK2 3,4 BMPR1A 5 DPPA4 3,5 FOXD3 3,5 DNMT3B 1,3,5 POU5F1 1,2,3,4,5 SOX2 1,2,3,4,5 • 25 1.3 Objectives Few large-scale cDNA and EST projects of the human pre-implantation embryo or its cell types exist. The isolation of human embryonic stem cell lines have made possible some initial investigations of the transcriptome of a limited number of these cell lines. Much remains unknown about the genes governing the undifferentiated state and the sequence of events that determine stem cell fate decisions. Consequently, hESCs are a rich source for novel gene discovery. We sought to generate a comprehensive catalogue of genes expressed in 8 NIH approved human embryonic stem cell lines (BG01, ES03, ES04, WA01, WA07, WA09, WA13, and WA14) using long Serial Analysis of Gene Expression (long SAGE). SAGE offers the opportunity to directly compare libraries constructed by diverse labs and differing protocols based on a unique sequence tag, which can be easily quantified. Additionally, because SAGE does not require a priori knowledge of a gene in order for its detection, unlike hybridization based techniques, it is amenable to novel gene discovery. The gene expression laboratory sequenced 11 long SAGE libraries, at a minimum depth of 200,000 total tags per library, generated from undifferentiated hESC mRNA from 8 cell lines to more accurately describe the number of transcripts expressed in a cell at any given time and the absolute levels that these transcripts are expressed at. At this depth of tag sequencing we are more likely to identify known and novel genes that are transiently expressed and/or expressed at low levels (singleton or doubleton tags). 26 1.4 Specific aims and rationale Aim 1. To analyze the database of hESC expressed SAGE tags to provide an annotated catalogue of expressed genes characterizing undifferentiated hESCs. Aim 2. To compare the hESC SAGE libraries to publicly available adult and fetal human SAGE libraries to identify genes that are differentially expressed (up-regulated or down-regulated) across all embryonic stem cell lines. Aim 3. To generate a computational approach to identify candidate novel genes in undifferentiated hESCs using multiple interspecies comparisons to all publicly available normal and malignant human SAGE libraries and multiple species sequence conservation. 27 2. Catalogue of the undifferentiated hESC transcriptome Contributions The culture of hESC lines and RNA extractions were completed at the laboratories of James A. Thomson (WA01, WA07, WA09, WA13, and WA14; Wisconsin Regional Primate Research Center, University of Wisconsin, Madison, WI, USA), Martin F. Pera (ES03 and ES04; Monash Institute of Reproduction & Development, Monash University, Melbourne, Victoria, Australia), and Allan Robins (BG01; BresaGen, Inc., Athens, Georgia, USA). Long SAGE library construction was completed by Jaswinder Khattra, Jennifer Asano, and Sean Rogers (Gene Expression Laboratory) and sequenced by the Production Group at the BCCA Genome Sciences Centre (GSC). Affymetrix GeneChip wet-lab work was completed by the Gene Expression Lab (BCCA GSC). Analysis of Affymetrix data was completed by Jaswinder Khattra who provided the list of genes used in Chapter 2.3.3 Detection of stem cell associated genes in WA09. DiscoverySpace software and database were designed by the Gene Expression Bioinformatics Group (BCCA GSC). Audic-Claverie statistical analysis script was written by Mehrdad Oveisi (Gene Expression Bioinformatics; BCCA GSC). All computational analyses such as tag-to-gene mapping, creation of additional tag-mapping resources, functional annotation using Gene Ontology (GO) terms, signalling pathway analyses/figure generation, and statistical analysis were completed by Angelique Schnerch (BCCA GSC and the Department of Medical Genetics, University of British Columbia). 28 2.1 Introduction Global gene expression in human embryonic stem cells has been studied using a number of technologies such as microarray (Bhattacharya et al., 2004; Carpenter et al., 2004; Sato et al., 2003; Sperger et al., 2003), short SAGE (Richards et al., 2004), EST sequencing (Brandenberger et al., 2004b; Carpenter et al., 2004), and MPSS (Brandenberger et al., 2004a). These studies have each examined a subset of the embryonic stem cell lines available. The intersection between gene lists from the published studies is small, although recent analysis of the microarray data has demonstrated greater overlap between studies than originally supposed (Suarez-Farinas et al., 2005). It remains that the profiles generated using different technologies show little overlap for genes expressed at low levels (Evans et al., 2002). Even when comparing gene lists across different microarray platforms most similarities are between highly expressed genes in the different ES lines. Additional genes that may be significant to several ES cell lines could be expressed infrequently and will be missed by meta-analysis of these published studies. To describe the hESC transcriptome several lines should be investigated using the same technology; in this analysis long SAGE was utilized. It has been widely established that SAGE is qualitative and quantitative, producing a digital transcriptome survey of known and novel transcripts (Saha et al., 2002; Velculescu et al., 1999; Velculescu et al., 1995). To capture infrequent transcripts and approximate the biological numbers of genes expressed in a cell type at a given time point, each library must be deeply sequenced. Long SAGE libraries, sequenced to 200,000 total tags or more per library on average, were generated for 8 NIH approved human embryonic stem cell lines. The WA09 line was sequenced to a depth of over 400,000 total tags to more closely 2 9 approximate the number of transcripts expressed in a cell at a given time-point. The entire dataset generated greatly exceeds other short/long SAGE libraries available, particularly the short libraries previously constructed for the ES03 and ES04 lines (145,015 combined total tags) (Richards et al., 2004). Therefore, the opportunity to describe the majority of known genes and to identify rare novel transcripts in multiple hESC lines was possible. Using a comprehensive tag-to-gene mapping strategy, I aimed to catalogue the genes expressed in the WA09 long SAGE library and the genes common to all hESC lines for which a long SAGE library was constructed. The expression of genes implicated in maintaining stem cell self-renewal or plasticity was assayed in the WA09 library and a pooling of all stem cell libraries (hESC metalibrary). Additionally, I aimed to assess the expression of genes involved in various developmentally regulated pathways, the cell cycle, programmed cell death and metabolism using a combination of Comprehensive Mapping of SAGE Tag (CMOST) mappings, Cancer Genome Anatomy Project (CGAP) SAGE Genie mappings and Gene Ontology (GO) terms. 2.2 Methods 2.2.1 Cell Culture Cell culture and RNA isolation were completed at the Wisconsin National Primate Research Centre (University of Wisconsin-Madison). The human ES cell line WA09 (NIH Stem Cell Registry code) was cultured on murine embryonic fibroblast (MEF) feeders. The ES line has a normal XX karyotype, expresses high levels of telomerase and has been shown to stain for cell surface markers that characterize undifferentiated primate embryonic stem cells (Thomson et al., 1998). Markers of pluripotency were assayed and 30 included stage-specific embryonic antigens (SSEA3 and SSEA4), human embryonal carcinoma marker antigens (TRA-1-60 and TRA-1-81), and alkaline phosphatase. Colonies were cultured according to protocols previously established (Zwaka and Thomson, 2003). Briefly, WA09 cells (with a doubling time of approximately 20 hours) were harvested for gene expression profiling at passage number 38 and were approximately 80% confluent. Cells were harvested by treatment with lmg/ml of collagenase (Invitrogen) at 37°C for 10 minutes until the edges of the colonies curled away from the feeder layers. Next, cells were treated with 5 mg/ml of dispase (Invitrogen) for 5 minutes at 37°C until colonies dissociated from the plate. Cells were collected and washed with PBS (phosphate buffered saline solution) prior to total RNA extraction. Both short and long SAGE libraries were constructed from 20 mg of total RNA isolated from WA09 cells using TRIzol Reagent (Invitrogen) and Phase Lock Gel™ tubes according to the manufacturer's protocol (Brinkmann Instruments). RNA isolation, RNA quality assessment, SAGE library construction and sequencing were conducted at the Gene Expression laboratory of BCCA GSC. RNA quality was assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies) and RNA 6000 Nano LabChip kit (Caliper Technologies). 2.2.2 SAGE Library Construction Prior to SAGE library construction, potential genomic DNA contamination was removed from WA09 total RNA using DNase I (Invitrogen) treatment. The long SAGE library (Velculescu et al., 1995) was constructed according to the I-SAGE kit protocol 31 (Invitrogen) using the kit reagents with adaptations for generating 17 bp tags (long SAGE protocol version 1.0a; http://www.sagenet.org) (Figure 4). Specific alterations include: the use of long SAGE linker molecules, scale-up PCR oligonucleotide primers specific to the long SAGE linkers (Invitrogen), and the use of an alternate type IIS restriction enzyme ("tagging" enzyme), Mme I (NEB), as opposed to Bsm Fl in the short SAGE protocol. To increase the recovery and purity of DNA, Phase Lock Gel™ tubes (Eppendorf) were used at each phenol-chloroform extraction step (Ye et al., 2000). Ditags were amplified with 25 PCR cycles. Ditags were purified using polyacrylamide gel electrophoresis (PAGE); 34 bp sized bands for long SAGE ditags were cut from the gel and ethanol precipitated to extract the DNA sample. Ditags were ligated to form concatemers containing up to 30-50 21mer tags. Concatemers were also PAGE purified. Purified concatemers were cloned into a pZErO-1 (Invitrogen) sequencing vector. The ligated vector-concatemers were transformed into One Shot® Top 10 electrocompetent Escherichia Coli (Invitrogen) and transformants were selected on low-salt LB/zeocin agar plates. Resistance to the antibiotic, Zeocin, denotes a successful transformation; these colonies were picked robotically into 384-well plates. Prior to sequencing, transformants were analyzed using colony PCR to determine the size and percent of inserts present in the SAGE library. 2.2.3 SAGE Library Sequencing DNA was sequenced using BigDye primer cycle sequencing reagents and analyzed on ABI PRISM 3700 and 3730 XL capillary DNA sequencers (Applied Figure 4 SAGE protocol. Protocol adapted from Invitrogen I-SAGE kit (httpi/Zvvvvw.invitrogen.com). 1. Mix H9 total RNA with oligo dT streptavidin beads . A A A A A A / — \ T T T T T T W 2. Synthesize double stranded cDNA . A A A A A A / ~ \ T T T T T T \ J 3. Digest with anchoring enzyme (Main) GTAC . A A A A A A / ~ \ T T T T T T V _ 7 4. Divide in half and ligate adaptors (A and B) containing the recognition sequence for the type-US restriction enzyme (tagging enzyme), BsmFl (14bp tag) or Mmel (21 bp tag) C A T G < - > G T A C ' C A T G ' G T A C ' . A A A A A A / ~ \ T T T T T T \ J ! N A A A A A A S~\ " T T T T T T \ J 5. Cleave with tagging enzyme, releasing adaptor plus 14bp/21bp tag from strepavidin beads A A A A A A / ~ N T T T T T T \_J Fill in overhangs (Klenow reaction) and ligate to form ditag C A T G -G T A C " CATG GTAC B ditag PCR amplify using ditag primers 5' A C A T G -G T A C " CATG GTAC B Cut adaptors with NlaUl to release ditag C A T G -GTAC Ligate ditags to form concatemers (20-50 tags per concatemer) C A T G - C A T G -GTAC C A T G -' G T A C " GTAC 10. Clone into pZErO-1 and sequence C A T G - C A T G -G T A C " C A T G -' G T A C " GTAC 3 3 Table 3 Summary of hESC long SAGE dataset. Listed below is the NIH code for each cell line, the SAGE library accession (provided by the BCCA GSC), cell line karyotype, a description of the growth conditions and experimental manipulations where applicable, total tags sequenced and the corresponding unique tag types used for subsequent analysis. All libraries are publicly available from the CGAP and from http ://www.transcriptomES .org. Cell lines Library accession Karyotype Description Total tags Tag types BG01 SHE 19 46XY Grown on mouse embryonic fibroblasts (MEF) 201,651 50,268 ES03 SHE 10 46XX Grown on MEF 205,276 41,571 ES04 SHE11 46XY Grown on MEF 209,177 49,411 WA01 SHE 17 46XY Grown on MEF 276,177 56,426 WA01-M SHE 16 46XY Grown on matrigel 218,169 45,637 WA01-7 SHES7 46XY Grown on matrigel; OCT4 knock-in unselected 266,057 54,318 WA01-8 SHES8 46XY Grown on matrigel; OCT4 knock-in G418 selected 196,607 39,999 WA07 SHE 13 46XX Grown on MEF 272,422 57,283 WA09 SHES2 46XX Grown on MEF 466,042 80,946 WA13 SHE 15 46XY Grown on MEF 221,060 47,389 WA14 SHE 14 46XY Grown on MEF 212,136 47,732 Biosystems). Phred (Ewing and Green 1998; Ewing et al. 1998) was used to process and assess quality of the sequences produced. Custom scripts (Gene Expression Bioinformatics Group; http://www.bcgsc.ca/bioinfo/ge/) were used to process the sequencing reads, removing vector sequences and low-quality sequences and identifying non-recombinant clones. Additional scripts (Gene Expression Bioinformatics Group; http://www.bcgsc.ca/bioinfo/ge/) were utilized to extract ditags from sequencing reads and ultimately to extract 21 bp tags for the libraries constructed from WA09 RNA. Linker 34 derived tags (long SAGE linker sequences: T C G G A C G T A C A T C G T T A a n d T C G G A T A T T A A G C C T A G ) and sequencing errors were removed prior to analysis. Additional libraries constructed for NIH approved human embryonic stem cells were similarly cultured and prepared for long SAGE library construction using the Gene Expression Laboratory pipeline described above. Table 3 lists the libraries sequenced at BCCA GSC and described in this research. 2.2.4 Tag-to-gene mapping 2.2.4.1 Comprehensive mapping of S A G E tags SAGE gene expression data analysis utilized the Java application, DiscoverySpace (version 3.2.1; http://wvvw.bcgsc.ca/bioinfo/software/discoveryspace). The application provides a graphical interface to visualize and access a centralized database (DiscoveryDB) comprised of multiple public biological data sources. Tag-to-gene mappings were generated using the CMOST algorithm for use within DiscoverySpace (http://www.bcgsc.ca/bioinfo/software/discoveryspace/CMOST_plugin_docs/view). The CMOST approach attempts to account for single base-pair permutations, insertions and deletions that potentially prevent a SAGE tag sequence from mapping to a known sequenced transcript or to the genome. Pertaining to unmapped high quality sequence tags, CMOST enables tag-to-gene mappings for tags that may not be derived from artifacts introduced by the SAGE protocol. The approach introduces each theoretically possible permutation of the experimentally observed (canonical) SAGE tag sequence, thus a modified tag is a single base-pair permutation from the canonical sequence. The \ 35 canonical tag and each modified tag are mapped to the available databases (Table 4 and Figure 5). Off-by-one tag mappings are only considered if a tag does not have a canonical tag mapping at a position in close proximity to the 3' most end of a transcript. Ambiguity in tag mapping can result when a single tag maps to multiple genes. Should a transcript be alternatively spliced, matches to sites other than the 3'most CATG site may be valid. As a result, if an ambiguous tag is of interest (e.g., not derived from repetitive sequence) it can be experimentally resolved using the generation of longer cDNA fragments from Table 4 Summary of CMOST data sources for tag-to-gene mapping. Tags were extracted from all positions in a transcript/sequence in the sense and antisense orientation. Resource Version Sequence entries Tags extracted from each sequence resource Unique tags M G C May 04 '05 18,293 304,062 220,722 RefSeq May 04 '05 28,826 629,921 477,298 Ensembl transcripts 30.35c 33,869 682,788 448,281 Ensembl EST transcripts 23.34e.l 43,710 313,656 195,495 Transcription units 30.35c 22,218 9,948,449 7,624,140 Mitochondria NC 001807.4 1 90 90 Non-coding May 12 '05 43 4,327 4,206 UCSC Human Genome NCBI35.nov 25,820 24,067,102 18,206,956 T O T A L 35,950,395 27,177,188 Database size: 3.9 GB 36 Figure 5 C M O S T schematic of methodology adapted from httpV/vvfww.bcgsc.ca/downloads/genex/DS/cmostjluginuserdoc.htm. Each tag in a Long SAGE library undergoes the following modifications prior to tag mapping: single base permutations, single base insertion/deletion. Both modified and unmodified tags are mapped to the virtual tag databases listed below. SAGE LIBRARY TAG MODIFICATION SINGLE BASE PERMUTATION SINGLE BASE INSERTION SINGLE BASE DELETION VIRTUAL TAG DATABASES SENSE ANTISENSE MGC/RefSeq Ensembl transcripts Mitochondrion Non-protein coding Transcription units Ensembl EST transcripts Golden Path MGC/RefSeq Ensembl transcripts Mitochondrion Non-protein coding Transcription units Ensembl EST transcripts Golden Path 3 7 SAGE tags for gene identification (GLGI) (Chen et al., 2002; Chen et al., 2000). The technique uses the SAGE tag as the gene-specific primer and an anchored oligo(dT) primer to amplify the cDNA sequence from which the SAGE tag was derived, similar to 5' or 3' rapid amplification of cDNA ends (Frohman et al., 1988). Resolving many ambiguous tag matches would be costly and labour intensive. Consequently, previous SAGE profiling experiments have removed ambiguous tags from further analysis (Richards et al., 2004). I have similarly excluded ambiguous tags from further analysis of the hESC transcriptome profile. CMOST mappings produce multiple tag-to-gene matches given the nature of current transcript databases which contain redundancies and disparate naming conventions. Multiple matched tags may also arise due to mapping off-by-one sequences. To select for the most reliable CMOST tag mapping (defined as a tag mapping to a single gene or genomic location), a hierarchical approach to tag mapping was utilized. Tags were mapped to various publicly available transcript/sequence databases with the rationale that the highest quality transcript database was first used first to map tags. As tags mapped to a known transcript in one database they were excluded from further analysis with subsequent databases to avoid redundancies. Data sources were prioritized using the following parameters (refer to http://www.bcgsc.ca for CMOST documentation): (i) Data sources. Data sources were ranked based on reliability and functional annotation. Data source reliability was defined by high quality full-length cDNA sequence information derived from automated gene predictions and manual curation. The mammalian genome collection (MGC) provided the 38 most reliable cDNA sequences comprising full open reading frames (ORFs) with evidence of splicing. In the case where a cDNA does not have homology to a known protein these sequences were manually annotated and experimentally verified (Gerhard et al., 2004). The Reference sequence project (RefSeq) also provides non-redundant experimentally verified and computationally annotated cDNA sequences, which links transcript, chromosomal and protein information. A proportion of RefSeq cDNAs were solely ab initio predicted (Pruitt et al., 2003) thus these data sources were given a lower priority over M G C data sources. The order from highest to lowest priority data source was: M G C (http://mgc.nci.nih.gov; M G C 2004), RefSeq (http://www.ncbi.nlm.nih.gov; Pruitt et al. 2003), Ensembl transcripts (Exon sequences only) (Birney et al., 2004 ; Hubbard, 2002; Hubbard et al., 2002; Kasprzyk et al., 2004) (http://www.ensembl.org), Genbank Human Mitochondrial Sequence (Accession AY289102.1), Genbank Non-coding sequences (http://www.ncbi.nlm.nih.gov/Genbank), Ensembl transcription units, which demarcate a region in the genome bounded by a transcription initiation/termination site and encodes a primary transcript (to account for sequence lacking an annotated 3' UTR an additional 1000 bp from the genomic region adjacent to the end of the transcript is included), Ensembl EST transcripts, and Golden path (Genbank Human Genome Assembly Contigs build 34, January 2004). 3 Ab initio definition: from the beginning 3 9 (ii) Tag orientation (sense or antisense); tags in the sense orientation were given priority over antisense tags. (iii) Proximity to the 3' most CATG site. (iv) Tag modifications; in cases where both the experimentally observed tag and the CMOST modified tag mapped to an expressed sequence, the gene specified by the unmodified tag took precedence. In addition to DiscoverySpace transcript mapping resources, CGAP provided a reliable tag-to-gene mapping database for regular and long SAGE tags (Boon et al. 2002; http://cgap.nci.nih.gov/SAGE/). Tag mappings were derived from 105 virtual tag databases which were ranked according to the percentage of tags contained in each database represented in the "confident SAGE tag list" (a list of tags reliably observed in multiple SAGE libraries) (Boon et al., 2002). CGAP virtual tags were extracted from the following seven transcript sequence sources: (/) MGC. (ii) RefSeq. (iii) Predicted transcripts from chromosome 22. (vi) The human mitochondrial genome (GenBank accession X93334). (v) The "20K set" transcript database that was generated by taking the longest non-EST cDNA for each UniGene cluster, (vi) Clustered UniGene sequences comprising the "Consensus sequences" databases (Hsest). (vii) "Unclustered EST" databases. (Boon et al. 2002). The databases (excepting predicted transcript and mitochondrial databases) were further subdivided according to the presence of a 3' poly (A) tail of 5 adenosines or more, a poly (A) signal or both. Virtual tag databases to detect internal cDNA synthesis priming from a stretch of adenosines other than the poly (A) tail and alternative polyadenylation were additionally constructed. Tags were extracted from the four closest Nla/77 sites from the 3' most end of a transcript for each subdivided 40 transcript sequence and parsed into the virtual tag databases. CGAP mappings were made available for download at the following ftp site: ftp://ftpl.nci.nih.gov/pub/SAGE/HUMAN/. 2.2.4.2 Assigning functional annotations GO terms (http://www.geneontology.org) (Ashburner et al., 2000) can be used to assign annotations to broadly describe functional categories on the level of the entire transcriptome. The following criteria were devised to ensure only the most reliable tag-to-gene mappings were used to generate a transcriptome view of function in WA09 and transcripts commonly expressed across all ES lines: (i) Tags were unmodified. (ii) Tags-to-gene mappings were in the sense orientation. (iii) Tags were derived from positions 1-3 in a transcript (position 1 being the 3' most Nlalll site in a transcript and positions 2 and 3 being farther upstream). (iv) Tags were unambiguously assigned to a transcript. In the case of tag types mapping to mitochondrial or non-protein coding sequences, tag orientation and position were omitted. C-shell scripts were used to process DiscoverySpace output for ease of parsing tag-mappings according to the above criteria (Appendix 2a). GO terms were assigned where available to tags meeting the above criteria and mapping to Ensembl transcripts using DiscoverySpace. The Ensembl website provided the Biomart resource to rapidly obtain GO terms for each transcript accession (Ensembl transcript ID or UniGene ID) (Hubbard et al., 2005). GO terms are exceedingly detailed, 4 1 hence a set of annotations given by the GO slims, 'slimmed down' versions of the GO terms, provides a minimal set of molecular functions and biological processes to describe the hESC and normal CGAP (n-CGAP) library transcriptomes. These terms were obtained from the GO web site (http://www.geneontology.org/GO.slims.html). The C-Shell scripts and GO slims used in this analysis were provided in Appendix 2b. 2.3 Results and discussion 2.3.1 Database of expressed long and regular S A G E tags The WA09 long SAGE library totalled 466,042 sequenced tags. Recent developments made by the Gene Expression Lab/Informatics at BCCA GSC have since included ditag sequences, tag clustering (accounting for off-by-one tag sequences) (Siddiqui et al., submitted) and individual tag quality scores based on SAGE library construction/sequencing error (Siddiqui et al., submitted). With these improvements and quality assurances the WA09 dataset used in this analysis, which includes all tags with a tag construction/sequencing error of P<0.05, totaled 441,795 corresponding to 56,532 unique tags. Figure 6a depicts the distribution of sequenced tags categorized according to gene expression level. The bulk of tag types (53.9%) were observed once in the library while highly expressed tags (>50 tags per type) were infrequent, representing 2% of all tag types. Figure 6b focused on the contribution of mapped and unmapped tags to the total number of tags sequenced in low-to-mid expression levels. Using this quality trimmed dataset resulted in the successful mapping of 92% of tag types to a transcript or a region of the human genome. The remaining 8% of tags that did not map to a human sequence were mainly low expression tags shown in Figure 6b, 77% of which were 42 singletons. Potential sources of unmapped tags included: tags spanning novel splice junctions, (as a proposed future analysis, such tags may be identified using the SAGE2Splice software; http://www.cisreg.ca/SAGE2Splice/); tags may be derived from mouse feeder contamination; small sequence polymorphisms (2bp or greater); or may have arisen as artifacts of the SAGE protocol (e.g., RT-PCR). The full list of tag-to-gene mappings is provided in Appendix 2c. 2.3.2 Analysis of unmapped tags Mouse embryonic fibroblast (MEF) contamination was assessed by mapping tags unassigned to a human sequence to mouse transcript databases (RefSeq) using CMOST and to the mouse genome assembly NCBI build 30. A total of 1,488 tags mapped to a mouse transcript or genomic sequence accounting for 2.6% of all WA09 tag types (full mappings listed in Appendix 2d). Though undifferentiated WA09 populations of cells are carefully separated from differentiated embryonic stem cells and MEF prior to RNA extraction, the WA09 total RNA used for SAGE is a heterogeneous population of mostly undifferentiated cells and a small population of differentiated cells and MEF. Potential MEF contamination was additionally investigated by accessing the theoretical "contamination" based on the overlap of human and mouse tag sequences. 43 Figure 6 (A) Distribution of WA09 tags sequenced (log scale Y-axis) classed according to gene expression level (X-axis). Tags expressed once in the library (singleton) are the most represented expression class (30,459 tag types; 53.9% of all unique tag types). The total number of singleton tag types is indicated on the graph. (B) The percentage of total tags sequenced contributed by mapped and unmapped tags at low (1-10 tags/type) and mid-level gene expression (11-50 tags/type). A. 1.00E.05 1.00E.O+ D d . 1.00E.03 T3 ID O C <D c- 1.00E.02 « BI 03 ^ 1.00E.01 M , 30459 1.00E.00 »- w io N n r: I D r-- mn* . 1000* 44 B. my. soy. GQy. toy. • Unmapped • Mapped 2Qy. oy. i i i i i i i i r i i i i r i i i i i CD _ I I I Low expression Mid expression 2.3.2.1 Species ambiguous tags Both human and mouse long SAGE UCSC genomic tag sequences were extracted from Discovery DB (Gene Expression Informatics; http://www.bcgsc.ca/gelab). The number of human genomic tags extracted was 27,858,501, which corresponded to 18,715,283 unique tag types; 27,725,360 mouse tags were extracted which corresponded to 18,467,657 unique tag types. In total there were 289,453 unique tag types that were identical in human and mouse genomic sequences; these tags were termed "Species Ambiguous Tags" (SATs) (Figure 7) and provided a tag-mapping database to flag sequences that may occur in the hESC libraries due to MEF contamination (Appendix 2e). A fraction of SATs are low complexity sequences derived from repetitive regions of the genome. There were 56,866 tags of this type that mapped to 2 or more locations in the genome; these tags accounted for 20% of SATs. Figure 7 The intersection between non-redundant human and mouse genomic sequences (UCSC) totaled 289,453 species ambiguous tags (SATs). SATs 289,453 HUMAN 18,425,830 46 SATs may derive from genes highly conserved between species, similarly exemplified in microarray experiments in which cross-hybridization of probes can occur between gene family members both within a single species and potentially between species (e.g., human and mouse). A total of 3,906 tags (6.9% of all WA09 tag sequences) were SAT (complete list of mappings available in Appendix 2f); 18.0 % of which were present in the WA09 library 10 times or more. Table 5 lists the ten most highly expressed SATs and their human transcript mapping. Many of the genes in this list are highly expressed in mammalian cells (e.g., ribosomal proteins). Thus a proportion of the SATs in the WA09 library may have been derived from MEFs. Tags that mapped to SATs and those that only mapped to mouse sequences provided a theoretical estimate of MEF contamination for hESC cell lines cultured on mouse feeders (2.6% of WA09 tags only mapped to mouse sequences and 6.9% of WA09 tags mapped to SATs). Table 5 Top 10 expressed SAT tags. Sequence Count # Genomic mappings Transcript description TGTGTTGAGAGCTTCTC 4094 12 Eukaryotic translation elongation factor 1 alpha 1 TGGTGTTGAGGAAAGCA 1407 7 Unknown (protein for MGC:87887) CCAGAACAGACTGGTGA 1347 3 Ribosomal protein L30 CTGTTGATTGCTAAATG 904 45 Heterogeneous nuclear ribonucleoprotein A l , isoform A GTGTAATAAGACATAAC 823 1 HNRPA2B1 protein CGCTGGTTCCAGCAGAA 821 2 Ribosomal protein LI 1 TTCATTATAATCTCAAA 820 7 Prothymosin, alpha (gene sequence 28) ATCAAGGGTGTTACACT 819 10 RPL9 protein AAGGAGATGGGAACTCC 795 19 Ribosomal protein L31 GGCAAGAAGAAGATCGC 703 3 Ribosomal protein L27 47 2.3.3 Detection of stem cell associated genes in W A 0 9 Several known stem cell markers have been isolated by others in mouse and human embryonic stem cells, teratocarcinoma cells, trophoblast stem cells and multipotent stem cells (haematopoietic stem cells and neural stem cells) (Brandenberger et al., 2004a; Brandenberger et al., 2004b; Ivanova et al., 2002; Kelly and Rizzino, 2000; Ramalho-Santos et al., 2002; Richards et al., 2004; Sperger et al., 2003; Tanaka et al., 2002). In an experiment using Affymetrix oligonucleotide arrays, genes uniquely present only in 7 different NIH approved hESC lines (NIH code: WA01, WA07, WA09, WA13, WA14, ES03, and ES04) were identified, producing a set of 84 genes (J. Khattra, unpublished). Altogether, the expression of 854 transcripts implicated in embryonic stem cell biology (which I have termed pluripotent stem cell associated genes (PAGs)), determined by extensive literature searches or the Affymetrix experiment were sought after in the WA09 long SAGE library (See Appendix 2g for a list of PAGs and their SAGE tag sequence). The long SAGE technique, which utilizes the Malll restriction enzyme site, was capable of detecting 763 of the transcripts (89%). The remainder escaped detection via SAGE. Sternness genes expressed in the WA09 line are listed in Appendix 2h. Table 6 lists the most highly expressed PAGs in the library. Expression of this set of genes was also surveyed across 12 n-CGAP long SAGE libraries, a pooling of the n-CGAP libraries (n-CGAP metalibrary), and a pooling of all the hESC lines (hESC metalibrary). The n-CGAP libraries were pooled to provide greater transcript coverage than a single CGAP library, which was generally sequenced to a depth far less than half the size of our hESC libraries (typically sequenced to a depth of 200,000 total tags). Note that all singletons in 48 Table 6 Short-list o f most highly expressed sternness associated genes in the W A 0 9 line (absolute tag counts listed; W A 0 9 library size equals 441,795 tags sequenced). ' IL6ST tag sequence is highly repetitive. EEF1 A l is a species ambiguous tag. Sequence Count Gene Symbol Hs.Id Cytoband TGTGTTGAGAGCTTCTC2 4094 EEF1A1 Hs.439552 6ql4.1 TTGGTCCTCTGCCCTGG' 2256 IL6ST Hs.532082 5 q l l T G A A A T A A A A C T C A G T A 1026 NPM1 Hs.519452 5q35 T T G G T G A A G G A A G A A G T 988 TMSB4X Hs.522584 Xq21.3-q22 G G A A C A A A C A G A T C G A A 980 CD24 Hs.375108 6q21 C G C C G G A A C A C C A T T C T 942 RPL4 Hs. 186350 15q22 G C A T A A T A G G T G T T A A A 866 R A M P Hs. 126774 G T G T A A T A A G A C A T A A C 823 HNRPA2B1 Hs.487774 7pl5 T C A G A T C T T T G T A C G T A 792 RPS4X Hs.446628 Xql3.1 A T T T G T C C C A G C C T G G G 779 HMGA1 Hs.518805 6p21 T C G T C T T T A T C G C T C A G 706 RPS7 Hs.534346 2p25 G A A G C A G G A C C A G T A A G 674 CFL1 Hs. 170622 l l q l 3 T G A G G G A A T A A A C C T G G 571 TPI1 Hs.524219 12pl3 T A A A T A A T T T C C A T A T T 532 HSPE1 Hs.1197 2q33.1 T G T T C T G G A G A G T G T T C 496 GJA1 Hs.74471 6q21-q23.2 T A T C A A T A T T C A C T T G A 476 SFRP1 Hs.213424 8p l2 -p l l . l T T T A C T G C T A G A A A C C A 435 LIN28 Hs.86154 lp36.11 G G C T G G G G G C C A G G G C T 430 PFN1 Hs.494691 17pl3.3 T G C T T C A T C T G T G G G A T 427 R A N Hs. 10842 12q24.3 T A C C A G T G T A C T G C T T T 422 HSPD1 Hs.471014 2q33.1 G G G G A A A T C G C C A G C T T 405 TMSB10 Hs.446574 2p l l .2 G A T C A C A G T T T G C T T T G 402 LDHB Hs.446149 12pl2.2-pl2.1 T A T C A C T T T T T T C T T A A 396 POU5F1 Hs.249184 6p21.31 G A C G T G T G G G C G C G A C T 350 H2AFZ Hs.l 19192 4q24 "~ T T A A A C C T C A A A T A A A T 345 HNRPDL Hs.527105 4ql3-q21 C C G C C T C C G G G A A T G A G 330 SNRPN Hs.525700 15qll.2 G A T G C T G C C A A T T T T G A 297 RPL22 Hs.515329 Ip36.3-p36.2 A A C T A A A A A A A A A A A A A 277 RPS27A Hs.546292 2pl6 G A G G A C A C A G A T G A C T C 276 PODXL Hs.l 6426 7q32-q33 A T G A T G A T G A T G G G A C T 270 SLC25A5 Hs.522767 Xq24-q26 A T G T A G T A G T G T C T T A C 264 HNRPD Hs.480073 4q21.1-q21.2 TTTTATGGGTAACTTTT 261 CCNG1 Hs.79101 5q32-q34 C T G C C T T C T T G G G G A T T 242 PPP1CC Hs.79081 12q24.1-q24.2 G C C T T C C A A T A A A A A A T 235 DDX5 Hs.279806 17q21 T C A T A G A A A C C T T G A T T 233 CCT5 Hs.l 600 5pl5.2 T C C T C A A G A T A A A G T C T 232 ERH Hs.509791 14q24.1 T T G G A G A T C T C T A T T G T 221 NDUFA4 Hs.50098 7p21.3 T A G C T A C A G G A C A T T T T 211 DNMT3B Hs.251673 20ql l .2 A A T A T T G A G A A G A A A C T 209 E1F3S6 Hs.405590' 8q22-q23 T A A T T C T T C T C T A T T G T 206 CCT3 Hs.491494 lq23 A T A G A C A T A A A A T T G G T 198 C1QBP Hs.555866 17pl3.3 49 T T A G C A A T A A A T G A T G T 194 TXNL5 Hs.408236 17pl3.1 C T T A T T T G T T T T A A A A C 191 PLS3 Hs.496622 Xq23 T C A A A T G C A T C C T C T A G 189 HNRPC Hs.449114 14ql 1.2 T A G C T G A G A C A T A A A T T 184 KPNA2 Hs.159557 17q23.1-q23.3 T A T A T A T T T G A A C T A A T 180 SOX2 Hs.518438 3q26.3-q27 A A A A T T T A C A G T T T G C C 179 LECT1 Hs.421391 13ql4-q21 C G G C C C A A C G C C A A G A A 171 HRMT1L2 Hs.20521 19ql3.3 T C T G T C A A G A C C A A G A T 168 ATP50 Hs.409140 21q22.1-q22.2 A A A T A A A G A A T T T A A A G 165 MGST1 Hs.389700 . 12pl2.3-pl2.1 the n-CGAP metalibrary were excluded from analysis. Table 7 tabulates the number of PAGs detected and their representation amongst total tags sequenced across each library or metalibrary. The hESC metalibrary expressed 83% of all sternness genes while WA09 individually expressed 74% of the set. In contrast, the n-CGAP metalibrary, which has 41,209 tag types corresponding to 651,881 total tags, detected 61.7% of the sternness genes. The average detection of genes per CGAP library was 35.8%. The exception to this average was the fetal brain library (detected 63.8% of genes), which was sequenced to 300,000 tags and individually expressed a greater diversity of tag types compared to other normal libraries. Figure 8 highlights the detection of hallmark pluripotency genes in each hESC library compared to a multipotent adult progenitor (MAPC) line (Reyes et al., 2001) and fetal brain (CGAP). A l l pluripotency markers were highly expressed in each hESC line. In contrast PAGs were expressed at low levels in M A P C and the brain. Factors that were expressed at low levels in hESC were not present in the non-ES libraries. Both SOX2 and GNL3 were expressed in one or both of the non-ES libraries. GNL3 (also known as nucleostemin) plays a role in stem cell proliferation. The gene was expressed in the M A P C library at levels comparable to the hESC lines. 50 Table 7 Detection of genes up-regulated in pluripotent stem cells and potential hESC markers (pluripotent stem cell associated genes, PAGs). 'There were 763 tags detecting candidate markers in total. Frequency is % of total tags sequenced per library or metalibrary. Library Number of % sternness Frequency of genes in sternness genes genes detected1 library2 WBC 224 29.4 3.5 WBC 284 37.2 3.3 WBC 269 35.3 3.4 WBC 298 39.1 3.6 WBC 284 37.2 3.4 WBC 241 31.6 4.3 WBC 256 33.6 5.8 Pancreas 110 14.4 1.7 Breast myoepithelium 267 35 4.1 Substantia nigra 254 33.3 3 Liver vascular endothelium 302 39.6 3 Fetal brain 487 63.8 3.2 Differentiated Metalibrary 471 61.7 5.1 hESC Metalibrary 631 82.7 8 WA09 564 73.9 9-5 Figure 8 Detection of pluripotency genes and markers of differentiation. Tag frequencies (tags per 400,000 total tags) were plotted along the Y-axis; Gene names were plotted along the X-axis; SAGE libraries were plotted along the Z-axis. LIFR and STAT3 maintain undifferentiated mESC though their function is not conserved in hESC. Both hCGA and hCGB are down-regulated by POU5F1. NES is a marker for ectodermal differentiation (the gene is a specific marker for neural stem cells), A C T C is a marker for mesodermal differentiation and AFP is a marker of early endodermal differentiation. U-hESC libraries all express high levels of pluripotent cell markers and low levels of early differentiation markers. WA01(7), WA01(8), and ES03 express the highest levels of neural stem cell and muscle markers suggesting that the heterogeneous population of cells that comprised the hESC cultures may contain a greater number of differentiating cells than the other hESC lines. 52 Genes involved in mESC maintenance (LIFR and Stat3) were not expressed or were expressed at low levels across all libraries (Figure 8). Early markers of differentiation/embryonic development such as hCGA and hCGB are downregulated by POU5F1; these genes were shown to be expressed at low levels or absent in the hESC SAGE data. Markers of ectodermal and mesodermal fate (NES and ACTC respectively) were significantly expressed across the hESC lines. NES is involved in neuronal differentiation while ACTC expression is indicative of muscle differentiation. The hESC libraries most highly expressing these differentiation markers were WA01(7), WA01(8), ES03 and ES04. Several groups have claimed that ES might undergo a default differentiation pathway to neuronal development (Ramalho-Santos et al., 2002). NES is a marker of primitive neural stem cells that have been isolated from the developing mouse epiblast. The detection of NES may be indicative of spontaneous neural differentiation of these lines in culture. 54 2.3.4 Developmental signalling pathway expression in embryonic stem cells The regulation of developmental signalling pathways, such as the Wnt, T G F P J and Jak-Stat pathways, is important for maintaining mammalian pluripotent cell types and directing their differentiation. The aim of this analysis was to investigate the breadth and depth of gene expression of specific developmental pathways in hESCs compared to differentiated adult and fetal tissues. These pathways, being previously implicated in pluripotent stem cell biology, are expected to be near fully expressed in undifferentiated ES cells including pathway antagonists. Expression of pathway antagonists may be critical for ES developmental plasticity and distinctly different than the expression pattern observed for adult and fetal tissues. I also hypothesize that the full expression of pathway ligands and activators also attests to the developmental plasticity of ES cells as the Wnt and TGFp signalling pathways direct many disparate cellular fates; this observation should similarly be unique to hESCs. Gene expression of developmental pathways was assayed in the WA09 embryonic stem cell line, a pooling of 11 hESC long SAGE libraries from 8 lines (hESC metalibrary), and a pooling of 12 CGAP long SAGE libraries (n-CGAP metalibrary) from normal adult and fetal samples representative of 4 tissues (blood, pancreas, breast, and brain). The WNT signalling pathway regulates developmental differentiation and has been implicated in stem cell fate and self-renewal (Aubert et al., 2002; Reya et al., 2003; Taipale and Beachy, 2001; Tang et al., 2002; Walsh and Andrews, 2003; Willert et al., 2003). More specifically, WNT signalling regulates multiple events during embryogenesis (Logan and Nusse, 2004) such as neuronal differentiation and limb 55 development (Cadigan and Nusse, 1997). The role of Wnt signalling has particular relevance to stem cell biology with its noted role in cellular proliferation and has previously been suggested to function in haematopoietic stem cell (HSC) renewal (Reya et al., 2003; Willert et al., 2003). I have focused on the canonical pathway mediated by the Frizzled receptors and p-catenin (CTNNB1). The hESC metalibrary expressed the majority of WNT pathway genes (Figure 9) (Tables 8 and 9) (the complete list of Wnt pathway gene expression is provided in Appendix 2i). In a recent study, gene expression in hESC (lines WA01, WA07, and WA09) was measured using EST sequencing (Brandenberger et al., 2004b) and similarly demonstrated the expression of much of the WNT pathway, excepting for WNT1, WNT2B, WNT10B, and WNT11, which I detected in our long SAGE data (WNT1, 2B, 3-6, 8B, 10B-11) (Figure 9) (Tables 8 and 9). The variety of Wnt ligands expressed in hESCs exceeded those expressed in the n-CGAP metalibrary but were all expressed at low levels (<5 tags per 650,000 total tags). The individual WA09 line and the hESC metalibrary expressed a wider repertoire of Wnt pathway receptors (Frizzled 1-4, 6-10; LRP5-6) than the n-CGAP metalibrary (WA09 expressed 6 out of 9 Frizzled genes and 2 out of 2 LRP genes; hESC metalibrary expressed 8 out of 9 Frizzled and both LRP genes; n-CGAP metalibrary expressed 4 out of 9 Frizzled genes and LRP5). Expression levels for FZD7 and FZD2 in WA09 and the hESC metalibrary were significantly up-regulated (p<0.01) compared to the differentiated samples (Table 10; Table 11). FZD2 and FZD7 were greater than 20-fold more abundant in WA09 than the differentiated library (Table 11) and the genes were greater than 15-fold and 8-fold more abundant in the hESC metalibrary compared to differentiated tissues (Table 10). 56 Figure 9 The Wnt signalling pathway expression in hESC long SAGE libraries (hESC metalibrary) and differentiated normal CGAP long SAGE libraries (n-CGAP metalibrary). Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. The top box reflects hESC metalibrary expression; bottom box reflects differentiated metalibrary expression. ~| inhibition; -> activation. 57 SFRP5 | CER1 _ WIF1 _ S F R P l g WNT FZD6 DKK1 FZD2 { PORCH B / / F Z D 7 B U ' L R P 6 _ Cadherin mediated cell adhesion z •PSEN1 _ -CXXC4_ CSNK1E_ -L FRAT2 _ • D V L 1 _ PRKACA _ G S K 3 p g 1 CTNNB1 _ MAPK signalling pathway M A P 4 K 4 -+ MAP3K9 -NLK CREBBP_ RUVBL2 _ APC_ CSHK1A1_ PPP2CA _ SEHP5 _ CELL MEMBRANE NKD1 _ >0-10 >10-100 M00-500 >500 WNT genes: W H T l g W H T 2 B _ WHT3 B WNT4 B WHTSAB WNTSBCj WNT6 B WHT7AB WMT8B0 WNT10B B WHT1lB AXIH1 B Phosphorylation TP53 [j-.^ independence SIAHlB"'^ CACYBPF A P C B SKP1A^ TBL1X CSNK1A1 B " / T + P CTNNBIP1 B SOX17 B SMAD30 SMAB4Q R C T B P , 1 B V+p 1 TCF7B "*LEF1 B / Phosphorylation dependence BTRC B _ S K P 1 A _ CUL1 I RBX1 I TGFp signalling pathway NUCLEAR MEMBRANE CCND2 CCND3| CCNDIf FOSLlf DNA i Ubiquitin mediated proteolysis ^ Cell cycle ^ 58 Table 8 Summary of Wnt ligand and receptor expression in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), and a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). hESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol WA09 hESCs Differentiated Role in Wnt pathway signalling Hs.94234 FZD1 0 0 0 Receptor Hs.31664 FZD10 0 0.7 0 Receptor Hs. 142912 FZD2 19.6 8.1 0 Receptor Hs.40735 FZD3 1.4 2.5 2 Receptor Hs. 19545 FZD4 4.2 4.3 5 Receptor Hs.292464 FZD6 2.8 2.2 2 Receptor Hs. 173859 FZD7 120.3 90.1 5 Receptor Hs.302634 FZD8 1.4 1.1 0 Receptor Hs.534367 FZD9 0 0.9 0 Receptor Hs.6347 LRP5 4.2 4 4 Receptor Hs.549194 LRP6 1.4 1.3 0 Receptor Hs.248164 WNTI 0 0.4 0 Ligand Hs. 121540 WNT 1 OA 0 0 0 Ligand Hs.91985 WNTI OB 0 1.3 14 Ligand Hs.l 08219 WNT11 1.4 1.1 5 Ligand Hs.272375 WNT 16 0 0 0 Ligand Hs. 128553 WNT2 0 0 0 Ligand Hs.258575 WNT2B 1.4 0.4 0 Ligand Hs.445884 WNT3 0 0.2 0 Ligand Hs.336930 WNT3A 0 0 0 Ligand Hs.25766 WNT4 0 0.7 0 Ligand Hs.l 52213 WNT5A 2.8 1.6 0 Ligand Hs.306051 WNT5B 1.4 0.7 2 Ligand Hs.29764 WNT6 0 0.7 0 Ligand Hs.72290 WNT7A 0 0 3 Ligand Hs.512714 WNT7B 0 0 0 Ligand Hs.259471 WNT8A 0 0 0 Ligand Hs.421281 WNT8B 0 0.2 0 Ligand Hs.558416 WNT9A 0 0 0 Ligand Hs.326420 WNT9B 0 0 0 Ligand 59 Table 9 Expression of Wnt signalling components in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). The effect of each component on pathway activity is indicated. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http:// cgap. nci .nih. go v/S AGE). UniGene accession Gene symbol WA09 hESCs Differentiated Effect on Wnt pathway signalling H s . 4 5 9 7 5 9 C R E B B P 2 .8 2 .7 13 A c t i v a t i o n H s . 5 2 9 8 6 2 C S N K 1 A 1 89 .5 74 .6 2 9 A c t i v a t i o n H s . 4 7 4 8 3 3 C S N K 1 E 78 .3 6 9 . 7 38 A c t i v a t i o n H s . 7 4 3 7 5 D V L 1 51 .8 36.5 6 8 A c t i v a t i o n H s . 140720 F R A T 2 58 .7 81.4 91 A c t i v a t i o n H s . 3 8 6 4 5 3 P O R C N 12.6 8.1 6 A c t i v a t i o n H s . 1 9 4 3 5 0 P R K A C A 4 .2 3.6 34 A c t i v a t i o n H s . 3 2 6 0 P S E N 1 14 12.8 2 A c t i v a t i o n H s . 5 1 5 8 4 6 R U V B L 2 19.6 20 .8 6 A c t i v a t i o n H s . 5 5 5 8 8 1 S M A D 3 2 .8 2 .7 2 A c t i v a t i o n H s . 7 5 8 6 2 S M A D 4 33 .6 3 0 11 A c t i v a t i o n H s . 4 7 6 0 1 8 C T N N B 1 169.3 71 17 A c t i v a t i o n ( t ranscr ip t ion) H s . 5 5 5 9 4 7 L E F 1 1.4 2.5 4 A c t i v a t i o n ( t ranscr ip t ion) H s . 5 1 9 5 8 0 T C F 7 8.4 6.1 5 A c t i v a t i o n ( t ranscr ip t ion) H s . l 5 8 9 3 2 A P C 0 0.2 31 Inac t i va t i on H s . 5 12765 A X 1 N 1 2 6 . 6 17.5 10 Inac t i va t i on H s . 2 5 5 9 7 3 C R I 1 85.3 70.1 61 Inac t i va t i on H s . l 8 9 4 9 C R I 2 9.8 5.8 6 Inac t i va t i on H s . 2 0 8 5 9 7 C T B P 1 4 6 . 2 39 .2 74 Inac t i va t i on H s . 4 6 3 7 5 9 C T N N B I P 1 18.2 15.9 0 Inac t i va t ion H s . l 2 2 4 8 C X X C 4 0 0 2 I nac t i va t i on H s . 4 4 5 7 3 3 G S K 3 B 1.4 6.7 3 Inac t i va t i on H s . 5 0 7 6 8 1 M A P 3 K 7 I P 1 12.6 7.8 11 Inac t i va t i on H s . 4 3 2 4 5 3 M A P 3 K 8 0 1.1 4 Inac t i va t ion H s . 4 4 5 4 9 6 M A P 3 K 9 5.6 3.4 4 Inac t i va t ion H s . 4 3 1 5 5 0 M A P 4 K 4 14 13.2 10 Inac t i va t ion H s . 2 9 8 4 3 4 N K D 1 0 0.2 0 Inac t i va t ion H s . 2 0 8 7 5 9 N L K 2 .8 1.8 0 Inac t i va t i on H s . 4 8 3 4 0 8 P P P 2 C A 81.1 5 8 . 9 24 Inac t i va t i on H s . 5 3 3 1 2 4 S E N P 5 11.2 6.5 14 Inac t i va t i on H s . 2 1 3 4 2 4 S F R P 1 6 6 4 . 4 2 2 6 . 6 15 Inac t i va t i on (ex t race l lu la r ) H s . 105700 S F R P 4 0 0 3 2 Inac t i va t ion (ex t race l lu la r ) H s . 2 7 9 5 6 5 S F R P 5 2 .8 3.6 7 I nac t i va t i on (ex t race l lu la r ) H s . 9 8 3 6 7 S O X 17 0 5.4 21 Inac t i va t i on (ex t race l lu la r ) H s . 2 4 8 2 0 4 C E R 1 0 1.6 0 I nac t i va t i on (ex t race l lu la r ) H s . 4 0 4 9 9 D K K 1 0 0.2 0 I nac t i va t i on (ex t race l lu la r ) H s . 2 8 4 1 2 2 W I F I 0 0.7 4 Inac t iva t ion (ex t race l lu la r ) Table 9 listed extracellular and intracellular activators and inhibitors of Wnt signalling. The intracellular machinery involved in propagating the Wnt signal to the nucleus where the transcription of target genes may occur was completely detected in the WA09 library, hESC metalibrary and n-CGAP metalibrary. A number of Wnt pathway activators were significantly up-regulated in hESCs (p<0.01) (Table 10 lists genes up-regulated in WA09 versus differentiated metalibrary; Table 11 lists the genes up-regulated in the hESC metalibrary versus n-CGAP metalibrary) and included CSNK1A1, CSNK1E, PSEN1, SMAD4, and CTNNB1. The major effects of WNT signalling hinge upon the nuclear translocation of CTNNB1 to activate the transcription of target genes (e.g., transcription factors and cell cycle regulatory genes). Constitutive activation of J3-catenin (CTNNB1) was suggested to underlie tumorigenesis in self-renewing tissues such as the colon (Kielman et al., 2002). Over-expression of CTNNB 1 and/or the absence of a Wnt-receptor interaction results in the degradation of excess cytoplasmic protein, regulated by a multi-protein complex consisting of GSK3P, AXIN, and APC (Cadigan and Nusse 1997). In hESCs, APC was absent in the WA09 line and expressed at less than 1 tag/650,000 in the entire metalibrary (Table 9). CTNNB 1 expression was more than 9-fold higher in WA09 and 16-fold higher in the hESC metalibrary. This observation and the apparent lack of expression of key regulators of CTNNB 1 protein levels suggests that Wnt signalling was active in hESCs and unlikely to be downregulated at the protein level. 61 Table 10 Differential gene expression of Wnt signalling components compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of the hESC metalibrary/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio hESC metalibrary/differentiated metalibrary Hs.408312 TP53 0 3.5348 Hs.463759 CTNNBIP1 0 2.7811 Hs.173859 FZD7 0 2.7116 Hs.213424 SFRP1 0 2.6515 Hs.523852 CCND1 0 2.3138 Hs.l 42912 FZD2 0.0011 2.1153 Hs.495656 TBL1X 0 1.6067 Hs.3260 PSEN1 0.0011 1.4662 Hs.476018 CTNNB1 0 1.3761 Hs.508524 CACYBP 0 1.2584 Hs.515846 RUVBL2 0.0007 1.1018 Hs.75862 SMAD4 0.0004 0.9248 Hs.529862 CSNK1A1 0 0.9143 Hs.483408 PPP2CA 0 0.8615 Hs.202453 MYC 0.0083 0.7108 Hs.474833 CSNK1E 0.0002 0.5838 Hs.74375 DVL1 0 -0.6298 Hs.208597 C T B P 1 0 -0.6426 Hs.98367 SOX 17 0 -1.3678 Hs.459759 CREBBP 0.0001 -1.5697 Hs.500812 BTRC 0 -1.6699 Hs.534307 CCND3 0 -1.7732 Hs. 194350 PRKACA 0 -2.2177 Hs.91985 WNTI 0B 0 -2.2577 Hs.525704 JUN 0 -2.7228 Hs.72290 WNT7A 0.0022 -2.8819 Hs.220971 FOSL2 0 -3.3414 Hs. 158932 APC 0 -4.2682 Hs.l 05700 SFRP4 0 -4.9921 62 Table 11 Differential gene expression of Wnt signalling components compared between the WA09 library and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of WA09/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in WA09. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio WA09/differentiated metalibrary Hs.408312 TP53 0 3.9191 Hs.213424 SFRP1 0 3.7284 Hs.142912 FZD2 0 3.0436 Hs.173859 FZD7 0 3.0097 Hs.463759 CTNNBIP1 0 2.9746 Hs.495656 TBL1X 0 2.3833 Hs.476018 CTNNB1 0 2.2492 Hs.523852 CCND1 0 1.9697 Hs.3260 PSEN1 0.0039 1.6349 Hs.483408 PPP2CA 0 1.1942 Hs.529862 CSNK1A1 0 1.1088 Hs.75862 SMAD4 0.0014 1.0696 Hs.508524 CACYBP 0.0003 1.0287 Hs.474833 CSNK1E 0.0005 0.7151 Hs.171626 SKP1A 0 0.7095 Hs.534307 CCND3 0 -1.215 Hs.500812 BTRC 0 -1.6305 Hs. 194350 PRKACA 0 -1.8335 Hs.91985 WNT10B 0.0006 -2.3725 Hs.220971 FOSL2 0 -2.6089 Hs.98367 SOX 17 0 -2.7555 Hs. 158932 APC 0 -3.1301 Hs. 105700 SFRP4 0 -3.1609 Hs.525704 JUN 0 -3.7249 63 A number of molecular factors that have inhibitory effects on this developmental pathway were up-regulated in hESCs (CTNNBIP1, SFRP1, and PPP2CA) (Table 10 and 11). CTNNBIP1 directly interacts with CTNNB 1 to prevent its interaction with the TCF/LEF complex thereby preventing transcriptional activation of target genes. In hESCs, this gene was significantly more highly expressed than in differentiated samples (16-fold and 19 fold greater in the hESC metalibrary and WA09 line respectively). The extracellular inhibitor of Wnt receptors, SFRP1 was also significantly more abundant in hESC samples (Table 10 and 11). Additionally, SFRP1 was shown to be up-regulated by a published EST analysis of hESCs compared to its differentiated derivatives (embryoid bodies, pre-neuronal, and pre-hepatocyte like cells) (Brandenberger et al. 2004). The employ of different SFRP genes in regulating Wnt-receptor interactions was likely to inhibit specific aspects of Wnt signalling. Wnt knockout phenotypes in the mouse provided evidence that loss-of-function of specific Wnts had disparate effects on many aspects of embryonic development ranging from defects in gastrulation, mesoderm patterning, neural crest development, and placental development (Aulehla et al., 2003; Barrow et al., 2003; Ikeya et al., 1997; Parr et al., 2001). A hypothetical mode of action of high expression of SFRP1 in hESCs may be to antagonize developmental differentiation and promote mitotic activity in hESCs. Targeted silencing of SFRP1 in stem cells by small-interfering RNA molecules (siRNAs) (Caplen et al., 2001; Dykxhoorn et al., 2003) would provide the necessary functional validation to confirm or refute this proposed effect. Downstream of CTNNB 1, transcriptional co-factors (TCF7 and LEF1) were expressed in WA09 at low levels while antagonists to their action were concurrently 64 expressed (NLK and CTBP1). Several target genes of the WNT pathway were detected in the hESC lines, including the cell cycle regulators CCND1, 2, and 3 and the transcription factors MYC and FOSL1 (Figure 9). The expression of the CCND genes, particularly CCND1 which was more highly expressed in stem cells than differentiated samples (greater than 7-fold expression difference; p<0.000T), may have a role in preventing differentiation by maintaining hESC mitotic activity. A wider repertoire of agonists and antagonists to the WNT pathway were expressed across all the hESC lines in contrast to a pooling of 4 normal adult and fetal tissues. The high expression levels of antagonists to the pathway such as SFRP1 and CTNNBIP1 were not similarly detected in differentiated tissues, suggesting that Wnt signalling is tightly regulated. SFRP1 and CTNNBIP1 may play a role in suppressing the specific differentiation effects of the Wnt pathway while some unknown mechanism may propagate Wnt signalling effects on cellular proliferation. HESCs exist on the cusp of differentiation and express hallmark developmental pathways but they are held in check until the necessary environmental cue is received to direct cellular fate down a specific path. The transforming growth factor beta (TGF|3) and Nodal signalling pathway governs various differentiation events such as osteoblast differentiation, neurogenesis, angiogenesis, embryo differentiation and placenta formation (Figure 10). Nodal signalling specifically induces mesoderm and endoderm in the early embryo and has a role in left-right axis determination (Hart et al., 2005). Expression of a Nodal recombinant protein and constitutive expression of Nodal during hESC differentiation resulted in the retention of pluripotency markers in embryoid bodies and the expression of extra-embryonic endoderm markers (Vallier et al., 2004). The results further indicated 65 Figure 10 Expression of the TGFP signalling pathway in hESC long SAGE libraries (hESC metalibrary) and differentiated normal CGAP long SAGE libraries (n-CGAP metalibrary). Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression.--] inhibition; -> activation. 66 T H B S 1 i THBS2 ffi THBS3 _ T H B S 4 [| C 0 M P D C U R D _ N O G _ BMP4* _ ^ L O C 2 8 3 1 5 5 S M A D 1 R r - - S M A ° 1 B AMHR2 _ ffl <5MAI14 C Growth factor LEFTY1 T G F B R 2 S M A D 7 Q * S M A D 6 _ - S M A D 2 ffi • S M A D 3 • S M A D 3 _ S M A D 4 RBX1 R 1 F S T U ^ . . { ^ m TGFBR1 IHHBA _ — • _ „ , D , r j t _ Z F Y V E 1 6 Z F Y V E 9 N O B A L _ -/ T B G F I B CER1 _ A C V R 2 r J ^ t S M A D 2 _ S M A D 6 _ S M A B 7 [! ' A C V R 2 [if S M A D 2 _ C E L L M E M B R A N E S M A D 2 S M A D 4 S M A D 2 S M A B 4 H S M A D 4 + MAPK1 M A P K 3 ,. Transcription factors, ^ Angiogenests co-activators, and Apoptosis co-repressors - MYCB CDKN2BB Differentiation, neurogenesis Cell cycle, G1 arrest DNA * Embryo differentiation DNA N U C L E A R M E M B R A N E P I T X 2 _ FOXH1 g 0RAP1 § Left-right axis determination Mesoderm/endoderm induction WHIP >10-100 >1OO-5O0 >500 67 that Nodal signalling prevented a 'default' progression to a neuroectoderm fate in comparison to control embryoid bodies. These findings implicated Nodal and TGFp signalling to function in pluripotency and in early cell fate decisions in vitro. The role of TGFp signalling in undifferentiated ES cells has been previously suggested by various gene expression studies documenting rapid and significant down-regulation of TDGF1 upon differentiation (Brandenberger et al., 2004b; Carpenter et al., 2004; Richards et al., 2004). Our own data showed the near complete expression of the TGFP signalling network in hESCs (Table 12 and 13). Appendix 2j is the complete list of TGFp gene expression in the WA09 line, hESC metalibrary, and differentiated metalibrary. Genes that were significantly differentially expressed (P=<0.01) between hESCs and the differentiated metalibrary were listed in tables 14 and 15. Table 12 lists a summary of the expression of TGFp signalling network ligands, receptors, and transcriptional targets in the SAGE libraries. Signalling through different heterodimers in the pathway leads to distinct cellular fates. Activation of the LOC283155/AMHR2 heterodimer directs cells to osteoblast differentiation, neurogenesis, or ventral mesoderm specification. The TGFBR1/TGFBR2 receptor complex plays a role in angiogenesis, apoptosis and cell cycle regulation. TGFBR1/ACVR2 signalling regulates embryo differentiation, gonadal growth and placenta formation. Lastly the receptor complex involved in Nodal signalling determines left-right axis patterning and mesoderm/endoderm induction during embryonic development. The TGFBR1 and TGFBR2 receptors were expressed in hESCs; TGFBR1 was expressed in excess in the hESC lines compared to n-CGAP libraries (8-fold and 4-fold increased expression in the pooled hESC library and WA09 lines respectively; 68 Table 12 Expression of the TGF(3 signalling network ligands, receptors and transcriptional targets in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the C G A P SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol WA09 h E S C Differentiated Role in T G F D signalling network Hs.73853 BMP2 0 3.4 5 Ligand Hs.68879 BMP4 2.8 8.3 0 Ligand Hs.296648 BMP5 0 0 0 Ligand Hs.285671 BMP6 0 0 0 Ligand Hs.473163 BMP7 8.4 10.5 2 Ligand Hs.494158 BMP8A 0 0 0 Ligand Hs.409964 BMP8B 0 0 0 Ligand Hs.370414 NODAL 4.2 30.3 0 Ligand Hs.l 103 TGFB1 5.6 13 175 Ligand Hs.133379 TGFB2 0 0.2 0 Ligand Hs.2025 TGFB3 0 0 12 Ligand Hs.l 573 GDF5 0 0 0 Ligand Hs.447688 GDF7 0 0 0 Ligand Hs.438918 ACVR1B 2.79752 4.03408 57 Receptor Hs.470174 ACVR2 0 0 0 Receptor Hs.437877 AMHR2 0 0.2 0 Receptor Hs.448651 LOC283155 1.4 0.9 0 Receptor Hs.494622 TGFBR1 37.8 19.5 4 Receptor Hs.82028 TGFBR2 2.8 2 20 Receptor Hs.72901 CDKN2B 0 0 0 Transcriptional target Hs.449410 FOXH1 0 0.4 0 Transcriptional target Hs.202453 MYC 9.8 24.2 11 Transcriptional target Hs.92282 PITX2 4.2 8.1 2 Transcriptional target Hs.462590 TIAF1 7 4.5 47 Transcriptional target 69 Table 13 Expression of the TGFp signalling network activators and inhibitors in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). The effect of each gene on pathway activity is listed. UniGene accession Gene symbol WA09 hESC Differentiated Effect on TGFB signalling Hs.9914 FST 23.8 19.7 0 Activation (extra cellular) Hs.385870 TDGF1 151.1 400.3 0 Activation (extracellular) Hs.519005 SMAD1 9.8 4 3 Activation (intracellular) Hs.465061 SMAD2 12.6 7.4 5 Activation (intracellular) Hs.75862 SMAD4 33.6 30 11 Activation (intracellular) Hs.459759 CREBBP 2.8 2.7 13 Activation (transcription) Hs.504609 ID1 318.9 249.2 27 Activation (transcription) Hs.524461 SP1 9.8 6.5 3 Activation (transcription) Hs.856 IFNG 0 0 0 Inhibition Hs.241570 TNF 0 0 11 Inhibition Hs.248204 CER1 0 1.6 0 Inhibition (extracellular) Hs.l 66186 CHRD 1.4 0.7 16 Inhibition (extracellular) Hs.1584 COMP 0 0 5 Inhibition (extracellular) Hs.156316 DCN 0 0.4 84 Inhibition (extracellular) Hs.28792 INHBA 0 1.1 5 Inhibition (extracellular) Hs.278239 LEFTY1 21 160.9 3 Inhibition (extracellular) Hs.248201 NOG 0 0 0 Inhibition (extracellular) Hs.l 64226 THBS1 12.6 20.2 31 Inhibition (extracellular) Hs.371147 THBS2 4.2 4.5 8 Inhibition (extracellular) Hs.l 69875 THBS3 12.6 4.3 10 Inhibition (extracellular) Hs.211426 THBS4 4.2 4.9 2 Inhibition (extracellular) Hs.356742 DRAP1 72.7 63.6 98 Inhibition (intracellular) Hs.431850 MAPK1 7 10.5 5 Inhibition (intracellular) Hs.861 MAPK3 11.2 13.4 16 Inhibition (intracellular) Hs.474949 RBX1 72.7 94.6 85 Inhibition (intracellular) Hs.l 53863 SMAD6 2.8 4.7 5 Inhibition (intracellular) Hs.465087 SMAD7 12.6 9 4 Inhibition (intracellular) Hs.l 89329 SMURF1 4.2 6.7 10 Inhibition (intracellular) Hs.482660 ZFYVE16 4.2 3.8 16 Inhibition (intracellular) Hs.532345 ZFYVE9 5.6 6.5 7 Inhibition (intracellular) Hs.255973 CR11 85.3 70.1 61 Inhibition (transcription) Hs. 18949 CRI2 9.8 5.8 6 Inhibition (transcription) Hs.108371 E2F4 19.6 12.6 6 Inhibition (transcription) Hs.207745 RBL1 0 0 0 Inhibition (transcription) Hs.79353 TFDP1 11.2 7.2 5 Inhibition (transcription) Table 14 Differential gene expression of TGFP signalling components compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of hESC metalibrary/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene Symbol P-value Ln ratio hESC /differentiated Hs.385870 TDGF1 0 5.9927 Hs.278239 LEFTY1 0 3.696 Hs.370414 NODAL 0 3.4171 Hs.9914 FST 0 2.993 Hs.504609 ID1 0 2.187 Hs.68879 BMP4 0.0009 2.142 Hs.494622 TGFBR1 0.0001 1.3723 Hs.473163 BMP7 0.006 1.277 Hs.75862 SMAD4 0.0004 0.9248 Hs.483408 PPP2CA 0 0.8615 Hs.202453 MYC 0.0083 0.7108 Hs.23650 MAZ 0 -0.4069 Hs.3 56742 DRAP1 0.0003 -0.4382 Hs.247077 RHOA 0 -0.5569 Hs.507916 TGFB1I4 0 -1.2902 Hs.482660 ZFYVE16 0.0001 -1.4384 Hs.459759 CREBBP 0.0001 -1.5697 Hs.82028 TGFBR2 0 -2.2375 Hs.462590 TIAF1 0 -2.3223 Hs.1103 TGFB1 0 -2.5885 Hs. 166186 CHRD 0 -2.9425 Hs.1584 COMP 0.0001 -3.2874 Hs.241570 TNF 0 -3.9805 Hs.241570 TNF 0 -3.9805 Hs.2025 TGFB3 0 -4.0605 Hs. 156316 DCN 0 -4.8396 71 Table 15 Differential gene expression of TGF(3 signalling components compared between the WA09 library and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of WA09/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in WA09. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio WA09/differentiated Hs.385870 TDGF1 0 5.0269 Hs.9914 FST 0 3.226 Hs.504609 ID1 0 2.4371 Hs.494622 TGFBR1 0 2.0584 Hs.278239 LEFTY 1 0.0004 1.7219 Hs.483408 PPP2CA 0 1.1942 Hs.75862 SMAD4 0.0014 1.0696 Hs.247077 RHOA 0 -1.0561 Hs.507916 TGFB1I4 0 -1.3326 Hs.82028 TGFBR2 0.0012 -1.6103 Hs.462590 TIAF1 0 -1.7439 Hs.l 66186 CHRD 0.0017 -1.8045 Hs.241570 TNF 0.0031 -2.1493 Hs.241570 TNF 0.0031 -2.1493 Hs.2025 TGFB3 0.0018 -2.2294 Hs.l 103 TGFB1 0 -3.2255 Hs.l 56316 DCN 0 -4.1071 72 P=<0.0001) (Tables 14 and 15). The ACVR1B and ACVR2B receptors mediate the effects of Nodal signalling. ACVR1B was detected at low levels in hESCs at less than 1 tag per 650,000 total tags (not detected in the WA09 line or the n-CGAP metalibrary) and ACVR2B was not detected in any human long SAGE library. This suggests that Nodal signalling was not functional in u-hESCs, however observations to the contrary were noted in the SAGE data. Nodal expression is significantly greater in the hESC metalibrary (30-fold higher; PO.0001) and downstream transcriptional targets (PITX2 and FOXH1) were expressed in the hESC libraries. A co-receptor of the Nodal ligand, TDGF1, was up-regulated in hESCs (Brandenberger et al., 2004b) and expressed at high levels in our SAGE data while undetected in differentiated tissues (Table 12). Additionally, RT-PCR expression of the ACVR1B and ACVR2B was reported in undifferentiated hESCs (Vallier et al., 2004) which implies that the necessary receptors were expressed but beyond the sensitivity of SAGE detection. BMP4 signalling in human ES cells has been shown to orchestrate early differentiation events (Xu et al., 2002b) thus the expression of the pathway genes should be evidenced in the SAGE data. Diverse BMP ligands and their corresponding receptors (LOC283155 and AMHR2) were expressed in the hESC SAGE libraries; BMP4 and BMP7 were significantly differentially expressed in stem cells than in the adult and fetal samples (>3-fold higher expression; P=<0.006) (Table 14). The transcriptional target of BMP signalling, ID1, also had a higher expression value in the hESC libraries (>8-fold increased expression; PO.0001) (Tables 14 and 15). ID1 regulates transcription epigenetically and was suggested to induce cellular proliferation indirectly through protection against apoptosis (Ling et al., 2003). High levels of ID1 may 73 indicate the inhibition of TGFp signalling induced apoptosis as a mechanism by which WA09 maintains cellular proliferation. Table 13 lists the expression of extracellular and intracellular activators and inhibitors of the pathway in hESC and n-CGAP libraries. The activation of SMAD proteins propagates TGFP and Nodal signalling. The complete repertoire of SMADs was expressed in all SAGE libraries. Lefty 1, an inhibitor of TGFp and Nodal signal transduction was highly differentially expressed in hESCs; the increase in expression was more than 5-fold higher in WA09 compared to adult/fetal libraries and more than 40-fold higher in the hESC metalibrary (P=<0.0004) (Tables 14 and 15). High expression of both TGFp and Nodal ligands and their repressors suggests tight regulation of their signalling effects is important to the stem cell undifferentiated state and aspects of stem cell maintenance. I also observed that INHBA was expressed at 5 tags/650,000 in the hESC metalibrary but absent in the WA09 line. The gene inhibits TGFp and Nodal, and ultimately promotes embryo differentiation.- This observation was coupled with the moderately high expression of its repressor, FST, which appeared to be co-expressed with TGFp and Nodal in hESCs (its expression was not detected in the differentiated libraries) (Table 14). Negative regulation of INHBA and its low level of expression was consistent with the u-hESC phenotype. Similar to the Wnt signalling pathway, we observe that many TGFP pathway genes were expressed in hESCs while pathway inhibitors (such as FST and Leftyl) were uniquely expressed at high levels in hESCs. TGFp pathway expression alludes to a currently undefined role for the pathway in stem cell maintenance and to its function in hESC developmental plasticity. 74 Mouse embryonic stem cells (mESCs) maintain pluripotency via leukemia inhibitory factor (LIF) signalling through the LIFR/IL6ST receptor which leads to STAT3-mediated transcriptional activation (Yoshida et al., 1994). The same response to LIF does not hold true in hESCs where both LIF and its downstream effectors were expressed at low quantities in comparison to mESCs (Carpenter et al., 2004; Richards et al., 2004). I looked to confirm previous reports of low expression levels for LIF signalling in our hESC SAGE data and similarly observed low levels of the LIF, LIFR, Janus kinases (JAK), STAT3 and the absence of ERAS (Table 16) (Figure 11). Appendix 2k is a catalogue of LIF signalling pathway expression in WA09, the hESC metalibrary, and the differentiated metalibrary. IL6ST may not be expressed in hESCs because the long SAGE tag sequence mapped ambiguously in the genome and previous studies reported low levels of expression or do not detect the receptor (Brandenberger et al., 2004a; Brandenberger et al., 2004b; Carpenter et al., 2004; Rho et al., 2005; Richards et al., 2004). In mESCs, the LIF signalling pathway components were highly expressed (Anisimov et al., 2002) and critical to maintaining the undifferentiated and pluripotent state. In our hESC data the critical factors responsible for mESC maintenance are expressed at low levels, thus the pathway is unlikely to be functionally conserved in the human. 75 Table 16 LIF signalling network expression in human embryonic stem cells. HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the C G A P SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). The IL6ST* sequence maps to multiple genes. UniGene accession Gene symbol WA09 hESC Differentiated Effect/role on LIF signalling Hs.447330 ERAS 0 0 0 Activation Hs.444356 GRB2 39.2 21.6 81 Activation Hs.523875 INPPL1 58.7 19.5 106 Activation Hs.434374 JAK2 1.4 0.4 0 Activation Hs.515247 JAK3 2.8 1.1 25 Activation Hs.278733 SOS1 2.8 1.1 5 Activation Hs.291533 SOS2 4.2 2.1 3 Activation Hs.470943 STAT1 21 12 50 Activation(rranscription) Hs.530595 STAT2 49 39.3 202 Activation_(transcription) Hs.463059 STAT3 1.4 1.9 5 Activation(transcription) Hs.437058 STAT5A 1.4 2.1 33 Activation(transcription) Hs.524518 STAT6 7 2.5 67 Activation_( transcription) Hs.2250 LIF 1.4 0.7 2 Ligand Hs.133421 LIFR 1.4 0.7 2 Reception Hs.454699 IL6ST 0 0 0 Receptor Hs.532082 IL6ST* 3155.6 1918.1 2555 Receptor 76 Figure 11 The expression of LIF signalling pathway components in a pooling of 11 hESC SAGE libraries and a pooling of 12 normal adult and fetal SAGE libraries. Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation. IL6ST* sequence maps to multiple genes. 77 ERAS B- Teratoma formation PI3-K INPPL1B -P STAT^P STAT STAT MAPKKK MAPKK POU5F1 B HANOGB NUCLEAR MEMBRANE MAPK1 MAPK3D ES Renewal and Pluripotency CELL MEMBRANE I FtQ-ji p i >10-100 >100-500 PI3-K genes: STAT genes: RAS genes: MAPKKK aeries: PIK3C2A0 STAT1B KRAS2 B MAP3K3 [ 5 MAP3K8 EE PIK3C2B^  STAT2 B RRAS S MAP3K4 I B MAP3K9 ffl PIK3C3™ STAT3 [_ RRAS2 _] MAP3KS | | MAP3K10 M PIK3CB0 n STAT5A Q HRAS B MAP3K6 | | MAP3K11 B PIK3CDLJ STAT6 |_ MRAS B MAP3K7 ( 9 MAP3K12 f_ MAP3K14 EE MAPKK genes: MAP2K1 B MAP2K2 g MAP2K3 M MAP2K4 Q MAP2K5 0 MAP2KG B MAP2K7 g 78 2.3.5 Cel l cycle regulation and programmed cell death pathways in h E S C s To gain a broad overview of the cell cycle and death pathways in human embryonic stem cells the genes found in the metalibrary were mapped to the pathway regulators. See Figures 12 and 13 for cell cycle and DNA repair mechanisms; Figures 14 and 15 for apoptosis and autophagic cell death pathways; see Tables 17-20 for differentially expressed genes in the cell cycle, DNA repair, apoptosis and autophagic cell death pathways. Positive regulators of cell cycle progression (such as E2F1, ORC genes and MCM genes) and DNA damage checkpoints (such as TP53, PCNA and CHEK1/2) were significantly more highly expressed in hESCs than terminally differentiated tissue types (Table 17) (Figure 12). Proapoptotic genes (such as IL1R1, TNF, and TNFSF10) were significantly more highly expressed in adult and fetal libraries in comparison to hESCs (Figure 14) (Table 19). RBI was detected at low levels in hESCs (3.5 tags/650,000 total tags); while its proposed inhibitor, nucleostemin (GNL3), was more abundantly expressed (75 tags/650,000 total tags) (Tsai and McKay, 2002). RBI i expression leads to cell cycle arrest and its activity is disrupted in many forms of cancer resulting in uncontrolled cell growth and deregulated apoptosis (Hanahan and Weinberg, 2000; Hickman and Helin, 2002; Hickman et al., 2002). In hESCs, the regulators involved in responding to stress and DNA damage are highly expressed at significantly different levels than adult and fetal SAGE libraries (Table 18). Though hESCs share the property of inactive RBI and related genes with malignant tissues, the loss of cell cycle and apoptosis regulation distinguishes hESCs from cancer phenotypes. 79 Figure 12 Cell Cycle Expression Cell cycle genes detected in the hESC metalibrary and differentiated metalibrary. Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the differentiated metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect > 100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects differentiated metalibrary expression. ~| inhibition; -> activation. 80 8 1 Figure 13 Expression of DNA repair machinery in the u-hESC and n-CGAP metalibraries. Repair mechanisms include: double-strand break repair (homologous recombination repair (HRR) or non-homologous end joining (NHEJ)), nucleotide excision repair (NER) and base excision repair (BER). Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect > 100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation. 82 H R R Ftniln >10-100 p i 00-500 DSB HBS1 ffl MRE11A _ RAD50_ 5' -» 3' resection RAD52 End-binding • M D 5 2 B R A D 5 2 _ -RAD51 mediated reactions IRAD51 _ RAD52 _RAD52 _ RAD51 _ T R A D 5 1 _ B R C A 1 B B R C A 2 _ Branch migration Ligation Xz~x Holiday junction resolution 8 3 NHEJ DSB G22P1 XRCC5| PRKDcf End recognition by G22P1, XRCC5 and P R K D C G22P1 XRCC5 [ PRKDC[ HBS1 MRE11A RAD50I "End processing" LIG40 XRCC4_ LIG4JH XRCC4ffl "End bridging" G22P1 XRCCS f PRKDC[ LIG4_ XRCC4E LIG4J3 XRCC4_ G22PT XRCC5( PRKDC[ Ligation n:::!i!!in M00-5Q0 >500 NER y \_ | RFC |GENES RFC1 | RFC2 [] RFC3 _ RFC4 _ RFC5 _ Ftntin >10-100 M00-500 Pre-incision complex (PIC) PIC1 ADP+ Pi GTF2H4p / ERCC2 Ii X P A _ E R C C 5 _ RPA1 \ ERCC3 i / ERCC1 E R C C 4 X PIC2 E R C C 1 H ERCC4 : GTF2H4H 5' incision u E R C C 2 | < P A r T | ERCC5J ERCC3f \ RPA1 H \ P l ° 3 f^i 3' incision ERCC3 i ERCC2 XPA / \ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 GTF2H4 ERCC5 ATP PCHA [ R F C ] ADP + Pi POLE POLD1 POLD2 RFC | POLEP PCHA [ j POLD1 P0LD2 -RPA4 -B-5' 85 B E R DAMAGED NUCLEOTIDE \ BE APEX1 E F SHORT-PATCH LONG-PATCH POLB POLI P O L L H XRCC1 LIG3 LIG1 RFC PCNA [-J-RFC ~ P A R P T Q = PCNA FENlf-L I G I ^ BE GENES= POLD1 POLD2 POLE POLB n 1 Specialized Excision Repair Genes = UNG[] M U T Y H g T D G Q MBD4 g NTHL10 M P G 0 SMUG1™ OGG1 | J | RFC | GENES = RFC1 g RFC2 g RFC3 Q RFC4 I RFC5 Q 0 >10-100 >100-500 >500 Figure 14 Apoptotic programmed cell death pathway expression in hESC compared with n-CGAP. A. Extrinsic pathway and survival factors. B. Ca 2 + induced cell death. C. DNA damage induced cell death. Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation. 87 Extrinsic Pathway Death Ligand FASLG g * FAS Q B r1 CASP6 T l l r . „ ^ „ n THFRS F10A.'D* Q TNFSF10 H—+ n • THFRS F1OB0 TNFRSF10C TRADD IL1A _j • IL1R1 T H F f ^ THFRS F ^ A | < ^ D D J * RIPKl|J TRAF2 f] an ADD0 | — C F L A R Q CFLARgj BIRC2 F A D D 0 — • C A S P 1 o Q - ^ A S P 3 0 _ f = ASP8Pf -» -CASP7[ BID [j BIRC2 [J BCL2L1R ' - B C L 2 [ I | Mitochondrion • - Cleavage of Caspase Substrate TRADD F A D O y h-^ MYD88 CFLARQ Suruiual Factors CYCS R -*CASP9 Q APAF1g ^ * IRAK3_ffl—» MAP3K14^-»-CHUK0—• HFKB1A0- - - Degradati IRAKl[j j -lon D F F A 0 --*> Degradation D F " F B | - D N A _ • w Fragmentation i I PDCD8 R EHDOGQ h f 1 Mitochondrion Stress Signals Intrinsic Pathway •* APOPTOSIS + NGFB Q HFKB1 Qj ™A H F K B 2 0 * Survival Genes BIRC2 EL. t B C L 2 L 1 0 - - * ' Survival B C L 2 & AKT1| AKT3 PRKAR1A Homodimer IL3RA * B A X Q -CELL MEMBRANE 0 >10-100 >100-500 Figure 15 Autophagic cell death. Pathway detected in the hESC metalibrary and n-CGAP metalibrary. MAP1LC3B (mammalian homologue to yeast ATG8) undergoes post-translational processing resulting in conjugation to phosphatidylethanolamine (PE) for activation. Additional ATG8 homologues are shown. Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect > 100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation. 90 APG12L _ -ft A P G 7 L _ A P G 5 L _ M A P 1 L C 3 B f -MAP1LC3A [I GABARAPL2 _ GABARAP _ APG10L_ ? A P G 3 L _ -V-APG16 llililililll MO-100 • MAP1LC3B MAP1LC3A GABARAPL2 G A B A R A P _ F R A P 1 _ M00-500 BECN1 PIK3C3 PREAUTOPHAGOSOMAL STRUCTURE (PAS) 0 CONJUGATE FORMATION :oo APG12 _ -APG50-APG1 PAS MAP1LC3BJ_^ CUP-SHAPED PREAUTOPHAGOSOMAL MEMBRANE AUTO PH AGOS O M E AUTOPHAGY Table 17 Differential gene expression of the cell cycle compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio hESC/differentiated Hs.28312 MAD2L1 0 5.0027 Hs. 194698 CCNB2 0 4.4654 Hs.517582 M C M 5 0 4.0417 Hs.329989 PLK1 0 3.7977 Hs.84113 CDKN3 0 3.6974 Hs.23960 CCNB1 0 3.6461 Hs.85137 CCNA2 0 3.5478 Hs.408312 TP53 0 3.5348 Hs.558433 PTTG1 0 3.3629 Hs.524947 CDC20 0 3.3327 Hs. 17908 ORC1L 0 3.2049 Hs.334562 CDC2 0 3.1678 Hs.36708 B U B IB 0 3.1294 Hs.474217 CDC45L 0 3.0688 Hs.291363 CHEK2 0 2.9703 Hs.24529 CHEK1 0 2.7385 Hs.96055 E2F1 0 2.7091 Hs.533573 CDC7 0 2.6633 Hs.244723 CCNE1 0 2.6315 Hs.656 CDC25C 0 2.6153 Hs.249441 WEE1 0 2.3294 Hs.523852 CCND1 0 2.3138 Hs.23348 SKP2 0 2.3055 Hs.469649 BUB1 0.0003 2.2886 Hs.95577 CDK4 0 2.2361 Hs. 196102 RB1CC1 0.0005 2.218 Hs. 147433 PCNA 0 2.1648 Hs.20447 PAK4 0 2.1063 Hs.410228 ORC3L 0 2.1045 Hs.208414 A S K 0.0014 2.0879 Hs.477481 M C M 2 0 2.0879 92 Hs. 153479 ESPL1 0.0014 2.0879 Hs.1634 CDC25A 0.0046 1.9056 Hs.3352 HDAC2 0 1.8489 Hs. 19400 MAD2L2 0 1.7431 Hs.269408 E2F3 0 1.7365 Hs.88556 HDAC1 0 1.7233 Hs. 16349 KIAA0431 0 1.7233 Hs.445758 E2F5 0.0003 1.5954 Hs.313544 GNL3 0 1.5489 Hs.49760 ORC6L 0 1.5001 Hs.418533 BUB3 0 1.3913 Hs.558364 ORC4L 0.0002 1.3609 Hs.517517 EP300 0.0072 1.1254 Hs.438720 M C M 7 0 1.0066 Hs. 135465 E2F6 0.0062 0.9753 Hs.75862 SMAD4 0.0004 0.9248 Hs.558307 CDK7 0.0071 0.8558 Hs.433201 CDK2AP1 0 0.7711 Hs.431048 ABL1 0.0099 0.7216 Hs.491682 PRKDC 0.0001 0.6213 Hs.36915 SMAD3 0.0001 0.3712 Hs. 179565 MCM3 0.0005 -0.267 Hs.520974 Y W H A G 0 -0.4186 Hs.200063 HDAC7A 0.0019 -0.498 Hs. 150423 CDK9 0.0062 -0.6341 Hs.523835 DOC-1R 0.0001 -0.8921 Hs. 106070 CDKN1C 0 -0.9099 Hs.238990 CDKN1B 0 -1.0101 Hs.77313 CDK10 0 -1.0238 Hs.438782 HDAC5 0 -1.0506 Hs.6764 HDAC6 0 -1.1728 Hs.80409 GADD45A 0 -1.2378 Hs.271791 A T R 0.0001 -1.262 Hs.534307 CCND3 0 -1.7732 Hs.310536 HDAC8 0.0013 -1.7833 Hs.370771 CDKN1A 0 -1.7988 Hs.417050 CCNA1 0.0023 -1.8321 Hs.513645 PAK6 0 -1.9011 Hs. 110571 GADD45B 0 -2.5046 Hs.525324 CDKN2C 0.0004 -3.105 Hs.32539 PAK7 0 -3.8935 Hs.2025 TGFB3 0 -4.0605 Table 18 Differential gene expression of DNA repair mechanisms compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio hESC/differentiated Hs.409065 FEN1 0 3.8127 Hs.445052 MSH6 0 3.4022 Hs.156519 MSH2 0 3.3719 Hs.191334 UNG 0 3.0153 Hs.523220 RAD54L 0 2.9817 Hs.291363 CHEK2 0 2.9703 Hs.558896 CHEK2 0 2.9703 Hs.518475 RFC4 0 2.9471 Hs.l 39226 RFC2 0 2.9111 Hs.24529 CHEK1 0 2.7385 Hs.531879 RAD1 0 2.5819 Hs.446564 DDB2 0 2.4747 Hs.446554 RAD51 0.0001 2.4556 Hs.209945 TDP1 0.0001 2.4164 Hs.78016 PNKP 0.0001 2.3756 Hs.487540 RPA3 0 2.3221 Hs.498248 E X O l 0.0003 2.2656 Hs. l 50477 WRN 0.0006 2.1933 Hs. 147433 PCNA 0 2.1648 Hs.l 92649 MRE11A 0.0009 2.142 Hs. l 00299 LIG3 0.0009 2.142 Hs.487294 ERCC2 0.0011 2.1153 Hs.302003 F A N C E 0.0017 2.0598 Hs.555936 APEX2 0.0021 2.0308 Hs. 169348 B L M 0.0046 1.9056 Hs.512592 RRM2B 0 1.7949 Hs. 19400 MAD2L2 0 1.7431 Hs.500721 MMS19L 0 1.7233 Hs. 16349 KIAA0431 0 1.7233 Hs.491695 UBE2V2 0 1.4748 Hs.534331 NUDT1 0.0002 1.4748 94 Hs.461925 RPA1 0 1.4619 Hs.558417 XRCC3 0.0001 1.3948 Hs. 115474 RFC3 0.0002 1.3493 Hs.520189 EL0VL5 0 1.2991 Hs.344812 TREX1 0.006 1.277 Hs. 177766 PARP1 0 1.2684 Hs.l 11749 PMS1 0.0001 1.2498 Hs.388739 XRCC5 0 1.2043 Hs.521640 RAD23B 0 1.2041 Hs.66196 NTHL1 0 1.1112 lis. 191356 GTF2H2 0.0008 1.0911 Hs.524630 UBE2N 0.002 1.0821 Hs.279413 POLD1 0 1.0694 Hs.523230 POLL 0.0068 1.0381 Hs.488624 PMS2L3 0.0068 1.0381 Hs.73722 APEX1 0 0.9969 Hs. 194143 BRCA1 0 0.8644 Hs.558307 CDK7 0.0071 0.8558 Hs.271353 MUTYH 0.0088 0.7958 Hs.306791 POLD2 0.0066 0.729 Hs.477879 H2AFX 0.0014 0.7078 Hs.292493 G22P1 0 0.6421 Hs.491682 PRKDC 0.0001 0.6213 Hs.412587 RAD51C 0.0042 0.5895 Hs.459596 MPG 0.0014 -0.5459 Hs.98493 XRCC1 0.0006 -0.9675 Hs.34012 BRCA2 0 -1.0227 Hs.35947 MBD4 0 -1.0463 Hs.385986 UBE2B 0 -1.1389 Hs.271791 ATR 0.0001 -1.262 Hs.469872 ERCC3 0 -1.3133 Hs.475538 XPC 0 -1.3158 Hs.422901 GTF2H2 0 -1.334 Hs.288798 MUS81 0 -1.3778 Hs.l 29727 XRCC2 0.0001 -1.5697 Hs.512732 NEIL1 0 -1.6208 Hs.258429 ERCC5 0 -1.9118 Hs.208388 FANCD2 0.0015 -2.1887 Hs.232021 REV3L 0 -2.3429 Hs.408557 ELOVL2 0 -3.6159 Table 19 Differential gene expression of apoptosis pathways compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio hESC /di f ferent iated Hs.408312 TP53 0 3.5348 Hs.289052 BCL2L12 0 3.3486 Hs.424932 PDCD8 0 2.7239 Hs.522506 TRAF2 0.0001 2.3962 Hs. 149032 PIK3R4 0.0009 2.142 Hs.141125 CASP3 0 1.9056 Hs.517841 PRKAR2A 0.007 1.8366 Hs.l 98998 C H U K 0.0085 1.8002 Hs.437060 C Y C S 0 1.7937 Hs. 16349 KIAA0431 0 1.7233 Hs.3280 CASP6 0.0016 1.4311 Hs.l 75343 PIK3C2A 0 1.1553 Hs.521456 TNFRSF10B 0 0.9781 Hs.514494 FBF1 0.0001 0.8398 Hs.l 59428 B A X 0 0.8005 Hs.502842 CAPN1 0.0004 -0.4711 Hs.515371 CAPNS1 0 -0.487 Hs.503704 BIRC2 0.0088 -0.528 Hs.522819 IRAKI 0 -0.5877 Hs.3 56076 BIRC4 0.0009 -0.7315 Hs.435512 PPP3CA 0.0027 -0.7647 Hs.484782 DFFA 0 -0.7778 Hs.280342 PRKAR1A 0 -0.8337 Hs.43505 I K B K G 0.0001 -0.9146 Hs.515406 AKT2 0 -0.9306 Hs.420106 ENDOG 0.0055 -0.936 Hs.523309 BAG3 0.0001 -1.1346 Hs.525622 AKT1 0 -1.1591 Hs.474150 BID 0 -1.1824 Hs.532826 MCL1 0 -1.2 Hs.433068 PRKAR2B 0 -1.2193 96 Hs.371344 PIK3R2 0 -1.3844 Hs.401745 TNFRSF10A 0 -1.3844 Hs.431926 NFKB1 0.0015 -1.4086 Hs.550753 PRKAR1B 0 -1.6655 Hs.73090 NFKB2 0 -1.7928 Hs.l 32225 PIK3R1 0 -1.8057 Hs.552567 APAF1 0 -1.8365 Hs.500067 PPP3CB 0 -1.9011 Hs.478275 TNFSF10 0 -2.1409 Hs. 194350 P R K A C A 0 -2.2177 Hs.278901 PIK3R5 0 -2.378 Hs.82116 MYD88 0 -2.3855 Hs.487325 P R K A C B 0 -2.4072 Hs.227817 BCL2A1 0 -2.6996 Hs. 126256 IL1B 0 -2.8819 Hs.390736 C F L A R 0.0004 -3.105 Hs.460996 T R A D D 0 -3.2184 Hs.81328 NFKBIA 0 -3.2248 Hs.l 96472 IL3RA 0.0001 -3.2874 Hs.498292 AKT3 0 -3.5105 Hs. 149413 PPP3CC 0 -3.575 Hs.l 27799 BIRC3 0 -3.7982 Hs.241570 TNF 0 -3.9805 Hs.557403 IL1R1 0 -4.4043 Table 20 Genes that were differentially expressed in the autophagic cell death pathway compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Gene symbol P-value Ln ratio hESC /di f ferentiated Hs.l 49032 PIK3R4 0.0009 2.142 Hs.90093 HSPA4 0 1.3178 Hs.338207 FRAP1 0 1.1291 Hs.369373 SEC23B 0.0002 0.9299 Hs.555993 APG10L 0.0002 0.7682 Hs.356061 MAP1LC3B 0 -0.7953 Hs.477126 APG3L 0 -1.0278 Hs.461379 G A B A R A P L 2 0 -1.087 Hs.81964 SEC24C 0 -1.1616 Hs.47061 ULK1 0 -1.1702 Hs.283610 APG4B 0 -1.4021 Hs.84359 G A B A R A P 0 -1.4091 Hs.8763 APG4A 0.0062 -2.0064 Hs.237886 DAPK2 0.0021 -2.4119 Hs.472513 MAP1LC3A 0 -3.4781 2.3.6 HESC gene ontology Human embryonic stem cells may be a rich source of many different transcript types encompassing a variety of disparate biological processes and molecular functions. The expression of a wide variety of genes, particularly at low levels, may play a key role in enabling stem cells to differentiate to a multitude of cell types when provided with the appropriate developmental cue. Various genes important to maintaining a stem cell's developmental potential are expressed at low levels, such as in the case of ligands of the 98 WNT and TGFp developmental pathways. Additionally, many characteristics of pluripotent stem cell biology have not been explained by molecular factors described in previous studies of the hESC transcriptome. Much remains unknown regarding the molecular factors governing hESC maintenance and pluripotency, thus it is likely that genes left to be discovered to function in these aspects are expressed at the limits of the transcriptome profiling mechanisms used to date. Generally, transcription factors and cell specific factors are expressed at levels undetectable by most gene expression techniques (Brandenberger 2004a) or are transiently expressed. Upon differentiation, particular combinations of these low expression genes may be up-regulated or downregulated, locking in the stem cell's fate. Key molecular factors that control stem cell fate determination and maintenance of developmental plasticity are enigmatic, leading to the expectation that several transcripts identified in the hESC lines will be uncharacterized as opposed to most other terminally differentiated cell types. Figure 16 compared the functional state of genes shared across all hESC lines and the WA09 transcriptome to the n-CGAP metalibrary. In total 7,364 tags were commonly expressed across all of the ES lines. These tags were mapped using CMOST and resulted in 5,284 unambiguous-tag-to-gene mappings in which the tag sequence localized to one of the three nearest Nlalll sites to the 3' end of the transcript (denoted position 1, 2 and 3 where position 1 is 3' most Malll site in a transcript). Gene Ontology (GO) terms were assigned to these tags. GO terms were similarly assigned for unambiguous tag-to-gene mappings in the WA09 library (15,509 tags) and differentiated metalibrary (13,523 tags). Analysis of the gene ontologies revealed that molecular functions expressed in hESCs and normal CGAP libraries were distinctly different based on Audic-Claverie statistical 99 analysis of total tags expressed in each functional category (see Appendix 21 for a table of pair-wise comparisons of molecular functions and corresponding p-values). Genes involved in nucleic acid binding, signal transduction, and cell death/aging were more highly represented in adult derived SAGE libraries than in hESCs (p<0.0001). Genes that were involved in protein synthesis/processing, transport, cell cycle regulation, and chromatin binding were more highly represented in hESC libraries in contrast to the differentiated metalibrary (p<0.000T). The preponderance of protein binding/degradation genes may support the theory that hESC express many genes transiently (Golan-Maschiach et al., 2005); thus increased degradation machinery to maintain a specific level of protein expression necessary for an undifferentiated and pluripotent state. The molecular integrity of hESCs has been demonstrated extensively by numerous accounts of hESCs possessing a stable karyotype after several passages (Hwang et al., 2004; Reubinoff et al., 2000; Thomson et al., 1998). The high representation of genes involved in cell cycle regulation in hESCs observed in the GO molecular function assignments and closely investigated in my assessment of cell cycle/DNA repair gene expression in hESCs was consistent with the genomic stability of hESCs and their proliferative capabilities. 100 Figure 16 Gene Ontology (GO) "slim" molecular functions expressed in the hESC metalibrary, hESC cell line WA09, and the differentiated metalibrary (n-CGAP metalibrary). Tags were mapped to Ensembl transcripts and EST transcripts using CMOST. Additionally, tags were mapped to embryonic ESTs (see Chapter 4.2.3 for a detailed account of tag-mapping resource construction) and mitochondrial sequences (CMOST). Tags that mapped to sense positions 1, 2, and 3 in a transcript and tags that mapped to all positions in EST and mitochondrial sequences were used in this analysis. The numbers of unique tags per functional category were depicted. The inset legend lists all functional categories corresponding to pie chart colours. The hESC metalibrary chart displayed category names and the percentage of all unique tags used in this analysis represented in that category. Global molecular functions in hESCs are significantly different than the pooling of adult/fetal samples. Nucleic acid binding and signal transduction genes account for a greater proportion of functions in adult/fetal libraries (p<0.0001). Protein binding/degradation, transport, and cell cycle regulation account for a greater proportion of overall functions in hESC (p<0.0001). 101 HESC MetaLibrary DNA metabolism/repair, 0 05% ~ \ PTOtei" synthesis/processing, 2.84% Antioxidant, U.10%-Defense/immunity, 0.12% Chromatin binding, 1.98% Negative regulation cell cycle, 6.20% Protein degradation, 10.54% Catalytic, 1.43% CelldeatWaging, 0.36% Protein folding/chaperone. 0.38% Cell adhesion, 1.24% Transport, 10.49% Structural/cytoskeletaL 11.64% Uncharacterized, 2.10% Nucleic acid binding, 0.50% Signal transduction, 0.02% Metabolism, 16.36% Protein binding, 20.70% Transcription/RNA processing, 6.56% '-Zinc ion binding, 6.39% YVA09 DNA metabolism/repair, 0.07% Antioxidant, 0.05% Deferise/immunity, 0.09% Chromatin binding, 1.26% Negative regulation cell cycle, 8.14% Protein degradation, 9.09% Catalytic, 2.14% Cell death/aging, 0.23% Protein folding/chaperone, 0.33% Cell adhesion, 0.73% Transport, 10.72% Structural/cytoskeletaL 12.42% Protein synthesis/processing, 1.91% Nucleic acid binding, 0.29% Signal transduction, 0.01% Metabolism, 9.69% Protein binding, 28.79% \^Transcription/RNA processing, 6.94% Zinc ion binding, 4.40% Uncharacterized, 2.70% Differentiated MetaLibrary Chromatin binding, 0.34% Negative regulation cell cycle, 0.61% Protein degradation, 0.83% Catalytic, 1.54% Cell death/aging, 1.86% Protein folding/chaperone, 1.86% Cell adhesion, 2.37% Transport, 2.77% StmcturaUcytoskeletal, 5.25% Uncharacterized, 7.39% Defense/immunity, 0.20% -Antioxidant, 0.08% DNA metabolism/repair, 0.04% Protein synthesis/processing, 0.03% Zinc ion binding, 8.11% Nucleic acid binding, 18.83% Signal transduction, 14.32% Metabolism, 14.03% Protem binding, 10.92% Transcription/RNA processing, 8.63% 2.4 Conclusions / This Chapter provided a summary of the hESC transcriptome profiles sampled using SAGE. The dataset described was among the most comprehensive and was representative of 8 cell lines and 2,613,475 total tags corresponding to 379,465 transcripts (unique tag types). Over 700 transcripts implicated in mammalian pluripotent stem cell biology were assayed in a pooling of the hESC SAGE libraries and the WA09 line (termed PAGs). The hESC libraries expressed 10-20% more PAGs at higher expression levels than any n-CGAP library. A suite of well-defined pluripotency marker (included POU5F1) and primitive differentiated cell markers were surveyed in hESCs. Pluripotency markers were significantly higher in hESCs and generally absent in the differentiated samples. Differentiation markers were expressed at low levels in hESCs. The expression of developmental pathways, such as the Wnt signalling network, was investigated in hESCs. In accordance with the "Just in case" hypothesis (Golan-Mashiach et al., 2005) hESCs uniquely expressed a greater variety of pathway genes than differentiated samples. Differences in other key pathways, such as the cell cycle, DNA repair, and programmed cell death pathways, also noted increased diversity of pathway genes expressed in hESCs. Most intriguing was the observation that cell cycle checkpoints and DNA repair genes were significantly more highly expressed in hESCs than n-CGAP libraries. From the GO analysis of global molecular functions I observed that the distribution of expressed genes with annotated molecular functions in embryonic undifferentiated cells was distinctly different than the adult and fetal samples. Genes involved in cell death and aging were prevalent in n-CGAP libraries in contrast to hESC libraries while cell cycle regulatory genes were more highly expressed in hESCs. These 104 characteristics were consistent with and necessary for the maintenance of immortalized and proliferating hESCs. 105 3. Comparison between h E S C and cancer/non-cancer differentiated cells/tissues Contributions Computational analysis was completed by Angelique Schaech (BCCA GSC and the Department of Medical Genetics, University of British Columbia) with hierarchical clustering script and Audic-Claverie statistics script contributed by Allen Delaney (BCCA GSC) and Mehrdad Oveisi (BCCA GSC) respectively. 106 3.1 Introduction Genes important to the pluripotent and undifferentiated state of hESC may be up-regulated in hESC compared to differentiated cells. Conversely, genes important in hESC differentiation may be down-regulated in hESC. The hESC expression profile generated by SAGE lends itself to direct pair-wise comparison to cancer and non-cancer differentiated cells and tissue types to enable the elucidation of differentially expressed transcripts. The first portion of this analysis consisted of pair-wise comparisons within the set of hESC long SAGE libraries. Also performed were comparisons between hESC libraries and other human SAGE libraries publicly available from the CGAP website (http://cgap.nci.nih.gov/SAGE). I selected all normal CGAP SAGE libraries and categorized them according to their embryonic germ layer origin (endoderm, mesoderm, or ectoderm). The germ layers arise during the first differentiation event of the embryo proper, termed gastrulation. This event takes place in the third week of development. Tissue types derived from each germ layer were depicted in Figure 17. The Pearson correlation coefficient was used as the measure of relatedness between expression levels of matched tag sequences in each library-pair comparison (Pearson 1896). Cluster analysis based on Pearson correlation was additionally employed. This provides a method for grouping similar libraries into respective categories, ultimately defining which cell and tissue types were similar or dissimilar to embryonic stem cells. Individual transcript differences between the hESC transcriptome and multiple differentiated transcriptomes were investigated as part of this analysis to define a 107 Figure 17 Trilaminar embryonic disc. The embryonic germ layers, endoderm, mesoderm and ectoderm will develop into various differentiated derivatives in the adult human listed below. TS - trophoblast cells; CNS - central nervous system; PNS - peripheral nervous system (Moore 2003). DODEklM Derivatives: • Epidermis, CNS (brain and spinal cord), PNS (nerves), eye (retina), ear • Musculoskeletal (ligaments, muscles, skeleton), vascular system (blood, bone marrow), reproductive system (urogenital) • Gastrointestinal (pharynx, digestive tube) major glands, respiratory system (nasal cavity, larynx, trachea, lungs) pluripotent molecular signature. Using multiple hESC lines provided an opportunity to elucidate genes that were both differentially expressed and common to all hESC lines. The ultimate goal of studying hESCs is related to its application in the clinical setting, namely the directed differentiation of stem cells for transplantation therapies. Prefacing this is the characterization of the molecular factors and signalling networks that govern stem cell properties and differentiation. In recent years the study of hESC biology has rapidly grown, yet much remains unclear about the stem cell molecular machinery and its function. A number of pluripotency markers have been described in the literature that were shown to be downregulated upon differentiation in mammalian embryonic stem cells (mouse or human ESCs). The candidate markers distinguished the u-hESCs from its nearest neighbours along the developmental time-course, yet many of these genes are not unique to embryo-derived pluripotent cells. Upon differentiation some of the most well characterized markers of u-hESC, such as POU5F1 and SOX2, are not immediately down-regulated (e.g., still present at detectable levels after two weeks of differentiation) (C. Eaves, personal communication) (Cai et al., 2005). One aim of this study was to assess the undifferentiated state of the hESCs used to generate our SAGE data through observation of the differential expression of published pluripotency markers in hESC versus adult cells; this will further refine our knowledge of and the appropriateness of current u-hESC markers. A second aim of analyzing differential gene expression was to identify novel markers distinguishing u-hESCs from adult and fetal cells. This analysis took the approach of comparing hESC SAGE expression profiles to pre-existing publicly available SAGE data, comprised of mainly adult tissues and cell lines. 109 3.2 Methods 3.2.1 SAGE library downloads SAGE libraries constructed from non-pluripotent tissues and cell lines were obtained from the CGAP SAGE genie resource (http://cgap.nci.nih.gov/SAGE) and our own database of long SAGE libraries. In total, 65 normal short and extracted4 short SAGE libraries from CGAP (n-CGAP) were used for comparison to the hESC expression profiles. All CGAP long SAGE libraries, both normal (n-CGAP) and malignant (c-CGAP) were utilized in this analysis. The tissues and cell lines represented are listed in Appendix 3a. A subset consisting of 43 libraries was categorized according to embryonic germ layer origin; library descriptions, including the tissue type and germ layer, are additionally listed in Appendix 3 a. 3.2.2 Cluster analysis For comparison to short SAGE libraries, our own hESC long SAGE tags were truncated at the 3' end to yield extracted short sequences. SAGE libraries were grouped together according to a distance measure of 1-Pearson correlation coefficient (1-r). A matrix containing pair-wise comparisons for all libraries versus themselves was generated based on 1-r using "matrix.pl" (Appendix 3b) (based on a script written by Allen Delaney; Gene Expression Informatics; http://www.bcgsc.ca/bioinfo/ge). The matrix was used to construct a hierarchical tree using the tree clustering Fitch-Margoliash algorithm (fitch) (Fitch and Margoliash, 1967). Figure 18 depicts the matrix file format and the fitch i 4 Extracted short libraries are long SAGE libraries whose 21 bp sequences were shortened to 14 bp computationally (21- and 14-bp tags include the Malll consensus site "CATG"). 110 settings used in this analysis. The tree constructed (*.tre file) was viewed using the program TreeViewX (http://darwin.zoology.gla.ac.uk/~rpage/treeviewx/). Figure 18 Matrix file format and the fitch settings used in this analysis. Matrix file 19 breast, l ib 0 fetalbrain 0.3365 0 l iver J i b 0.3864 0.213 0 pancreas.1 0.4605 0.4134 0.1691 0 snigra. l ib 0,704 0,4591 0,1643 0,2635 0 ubc l . l ib 0.2923 0,139 0,2356 0,3904 0,4375 0 ubc2.1ib 0,3824 0.1311 0.2308 0,4282 0.4155 0,0234 0 shes2 0,3436 0.1557 0,3109 0.4444 0.4509 0.1346 0.1477 0 shes7 0,3261 0,1539 0.3087 0,4914 0.5021 0.1654 0,1644 0,0468 0 shes8 0,3284 0.1603 0.2725 0,4655 0.4392 0.1662 0,1626 0.0615 0.014 0| shes9 0,3952 0,1692 0.2004 0.3427 0.2877 0.1344 0.1318 0,0398 0.0894 shelO 0.2684 0.1845 0.3707 0.5088 0.594 0.2174 0,2371 0,0671 0.0692 0| shel l 0,4265 0,1373 0,2355 0,4067 0,3588 0,1324 0.1153 0,0382 0.0418 shel3 0.5723 0.2024 0.4015 0,5224 0.4896 0.1465 0.1253 0.0857 0,1523 shel4 0.3606 0.1665 0.1804 0.3126 0.2957 0.1806 0.1803 0.0601 0.064 0| shel5 0.4231 0,1517 0.2304 0,4239 0.3768 0.2083 0.1853 0.0891 0.0542 shel6 0.4709 0,1654 0,3179 0,4959 0,5077 0,1861 0.1648 0,0889 0.0861 shel7 0.3717 0.1734 0.1704 0,3049 0.2792 0.1837 0.1831 0.0607 0.0688 Fitch settings Fitch-Margoliash method version 3.6a2, l Sett ings for th is run: D Method (F-M, Minimum Evolution)? Fitch-Margoliash U Search for best tree? Yes P Power? 2,00000 - Negative branch lengths allowed? No 0 Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? Yes R Upper-triangular data matrix? No S Subreplicates? No G Global rearrangements? No J Randomize input order of species? No. Use input order M Analyze mult iple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Pr in t out the data at s tar t of run No 2 Pr in t indicat ions of progress of run Yes 3 Pr in t out tree Yes 4 Write out trees onto tree f i l e ? Yes Y to accept these or type the le t ter for one to change J Random number seed (must be odd)? 1235 Number of times to jumble? 111 3.2.2.1 Random sampling script To assess the nature of tag type diversity in hESCs versus comparator libraries (each sequenced to different depths) a random sample was taken of theWA09 long SAGE library tags (converted to extracted short SAGE), to model increasing sampling depth (Perl script found in Appendix 3c; original script written by Paul Stothard; https://www.gchelpdesk.ualberta.ca/repositoryA i^ewRepository.php) using multiple iterations (25 iterations). 3.2.3 Differential gene expression analysis A set of 43 non-pluripotent SAGE libraries was selected to provide a sample of tissues derived from the three embryonic germ layers and were subdivided accordingly (12 endoderm libraries, 16 mesoderm libraries, and 15 ectoderm libraries). "Endoderm" libraries were constructed from the gastrointestinal tract, major glands and the respiratory system. "Mesoderm" libraries were constructed from the musculoskeletal system, vascular system and the reproductive system. Lastly, "ectoderm" libraries were constructed from the epidermis, central nervous system (CNS), peripheral nervous system (PNS), and the eye. (Refer to Figure 17 for an illustration of the embryonic germ layers and derivative tissues). HESC extracted short SAGE libraries were compared to CGAP libraries; upon association back to the corresponding 21mer sequence the most abundant 21mer was used for further analysis (tag count >20). Differentially expressed tags were mapped to genes using CGAP SAGE genie mappings and genomic sequences using genomic contigs (NT*_), complete genomic sequences (NC_*), or genomic regions (NG_*) from RefSeq. 112 Tag-to-gene mappings were filtered based on their uniqueness in the transcriptome and their uniqueness in the genome. All tags that uniquely mapped to a single transcript were parsed; for tags that did not map to a known transcript only tags that mapped uniquely in the genome were parsed. Differential gene expression was calculated using the script sagematrix.sh (Mehrdad Oveisi; Gene Expression Bioinformatics Group; http://www.bcgsc.ca/bioinfo/ge/) which employed the Audic and Claverie statistic (Audic and Claverie, 1997). Sagematrix.sh generated a P-value to allow the assessment of whether changes in gene expression observed were significant. Another descriptor of differential gene expression was the fold change (FC) which is equivalent to the frequency of a tag in library "x" divided by its frequency in library "y". The script outputs the natural log of the fold change according to the equation, InFC = ln((x/Nx)/(y/Ny)), where x (or y) is the tag expression level and Nx (or Ny) is the total number of tags in library "x" (or "y"). Genes were first defined as significantly differentially expressed if they met a P-value cut-off of 0.05 or less. The definition was further refined if the genes demonstrated a 3-fold or l/3rd-fold change across each library in a subgroup of the embryonic germ layers. The exception was the list of genes downregulated in the mesoderm subgroup which was consistently downregulated across 15 of the 16 libraries (CGAP56, constructed from embryonic kidney, did not share a subset of commonly downregulated tags with the other mesoderm derivatives). 113 3.3 Results and discussion 3.3.1 Pair-wise library comparisons The use of Pearson's correlation (Eisen et al., 1998) in gene expression analysis reflects the degree of linear relationship between two libraries with respect to expression levels of commonly detected tags. Values range from "+1" to "-1". A correlation of "+1" means that there is a perfect positive linear relationship between libraries. Conversely a correlation of "-1" means there is a perfect negative linear relationship between the libraries. The first goal of this analysis was to determine how similar ES libraries were to one another. Table 21 is a matrix of correlations between 12 embryonic stem cell libraries constructed from 8 cell lines. Multiple libraries were constructed for WA01 (4 libraries; each constructed from a different RNA sample collected from cells exposed to different experimental conditions5) and WA09 (2 libraries constructed from the same RNA sample using short (WA09s) and long (WA09L) SAGE protocols). 5 Four long SAGE libraries constructed from the WA01 line: WAOlm cell line was cultured on matrigel; WA01 cultured on irradiated mouse embryonic fibroblasts (irr-MEFs); WA01(7) cultured on irr-MEFs, POU5F1 knocked-in, unselected; WA01(8) cultured on irr-MEFs, POU5F1 knocked-in, G418-selected. 114 Table 21 Pearson correlation coefficients (r) between human embryonic stem cell line SAGE expression profiles. Values were generated using matrix.pl for short and extracted short SAGE libraries. Although matrix.pl produces 1-r values for each pair-wise comparison, the table reflects r values. The highest r value between any two u-hESC libraries is 0.974 (WA01 compared with WA14). Note that all libraries except for WA09(s) are long SAGE. ES03 ES04 BG01 WA07 WA14 WA13 WAOl(m) 1 WA01 1 WA01C7) 1 WA01(8)' WA09(s)2 WA09(L) 2 ES03 1 0.8683 0.8659 0.7427 0.8825 0.8717 0.8275 0.8495 0.8747 0.8402 0.7249 0.8538 ES04 0.8683 1 0.8849 0 .8993 ( 3 > 0.9166 0 .9534 ( 3 ) 0.8834 0.9215 0.9246 0.9008 0.6494 0.9009 BG01 0.8659 -0.8849 1 0.7573 0.9309 0.8262 0.8137 0.9429 0.9294 0.8782 0 .8162 ( 3 ) 0.9027 WA07 0.7427 0.8993 0.7573 1 0.7913 0.8097 0 .8899 ( 3 ) 0.8133 0.7986 0.7478 0.4767 0.845 WA14 0 .8825 ( 3 ) 0.9166 0.9309 0.7913 1 0.8708 0.8417 0 .974 ( 3 > 0.9072 0.8478 0.7559 0.8907 WA13 0.8717 0.9534 < 3 ) 0.8262 0.8097 0.8708 1 0.8454 0.8597 0.8885 0.8918 0.6177 0.85 WAOl(m) 1 0.8275 0.8834 0.8137 0.8899 0.8417 0.8454 1 0 .8603 ( 4 ) 0 .8289 ( 4 ) 0.7675 ( 4 ) 0.4954 0.8352 W A 0 1 1 0.8495 0.9215 0 .9429 ( 3 ) 0.8133 0 .974 ( 3 ) 0.8597 0.8603 < 4 > 1 0.9138 < 4 ) 0.8544 ( 4 ) 0.7461 0.8938 WA01(7)' 0.8747 0.9246 0.9294 0.7986 0.9072 0.8885 0.8289 ( 4 > 0 .9138 ( 4 ) 1 0.9658 ( 3 ) 0.7743 0 .9381 ( 3 ) W A 0 U 8 ) 1 0.8402 0.9008 0.8782 0.7478 0.8478 0.8918 0.7675 < 4 > 0.8544 ( 4 ) 0 .9658 ( 3 ) 1 0.7574 0.9067 WA09(s)2 0.7249 0.6494 0.8162 0.4767 0.7559 0.6177 0.4954 0.7461 0.7743 0.7574 1 0.7604 < 4 ) WA09(I)2 0.8538 0.9009 0.9027 0.845 0.8907 0.85 0.8352 0.8938 0.9381 0.9067 0.7604 < 4 ) 1 1. WA01 stem cell lines under different experimental conditions. WA01(m) was grown on matrigel. WA01 was grown on irradiated mouse fibroblasts. WA01(7) Oct4 knock-in unselected (reference) and WA01(8) Oct4 knock-in G418 selected. 2. WA09 stem cell lines. Same RNA sample used to construct short SAGE (s) and long SAGE (L) libraries. 3. Highest Pearson's correlation for a library against all other stem cell libraries (Read according to library listed in the column heading). 4. Pearson's correlations between libraries of the same cell line. 115 The hESC libraries were more highly correlated to one another than they were to n-CGAP libraries in most cases (Figures 19 and 20). Table 21 lists the highest correlation value for each ES library pair-wise comparison. WA09s was the exception; it was most highly correlated to just 7 out of 12 embryonic stem cell lines versus non-pluripotent cell lines/tissues (ES03, BG01, WA14, WA01, WA01(7), WA01(8), and WA09L) (Appendix 3d). The remaining lines did not correlate closely with WA09s and included ES04 (R=0.6494), WA07 (R=0.4767), WA13 (R=0.6177), and WAOlm (R=0.4954). Furthermore, WA09s was most similar to BG01 (R=0.8162) as opposed to its long SAGE counterpart WA09L (R=0.7604). These results suggest that short SAGE and long SAGE techniques are not directly comparable. The expectation (assuming both techniques are unbiased samples of the ES transcriptome) was a correlation close to 1 in the case of the WA09 libraries. It has been suggested that the long SAGE technique may have a bias in transcriptome sampling (personal communication Allen Delaney; Gene Expression Informatics; http://wvvw.bcgsc.ca/ge/bioinfo) providing an explanation for the discrepancies in r-values in the short to long comparisons. Pearson correlation is also biased towards outliers (e.g., highly expressed tags) resulting in distortions of the linear relationship between libraries. This relates back to the suggestion that the long SAGE technique may selectively sample tags in a total RNA sample while short SAGE may be more representative of a random sampling of the transcriptome (Delaney, A. personal J communication). Thus outliers based on biased sampling of the transcriptome are a likely cause of the weaker than expected correlation between WA09s and the long ES libraries. 116 Figure 19 Hierarchical clustering of short and extracted normal SAGE libraries. HESC libraries are boxed in red. Abbreviations: vase endo, vascular endothelium. • CGAP8 cortex • CGAP34 cerebellum • CGAP11 colon • CGAP94 kidney • CGAP12 colon • CGAP523 leptomeninges • CGAP398 stomach •CGAP405 lymph CGAP 162 retina • CGAP429 liver vase endo • CGAP362 pancreas • CGAP1103 skin • CGAP404 marrow • CGAP 135 liver • CGAP 161 placenta • CGAP 160 placenta • CGAP389 thyroid • CGAP99 lung CGAP168 refRNA CGAP203 prostate CGAP169 prostate CGAP2D ovary CGAP44 fibroblast I— CGAP 163 retina "I CGAP 164 retina L f i WA01 8 i WAD9 L BG01 WA14 WA01 WA09 s CGAP430 fetal brain CGAP59 prostate CGAP56 embryonic kidney CGAP71 embryonic kidney r CGAP65 pancreas CGAP66 pancreas CGAP584 lung CGAP585 lung CGAP 1363 eye CGAP3D vascular CGAP29 vascular • CGAP91 peritoneum • CGAP364 placenta • CGAP 142 muscle • CGAP 141 muscle • CGAP55 cerebellum — CGAP53 thalamus CGAP 165 retina • CGAP 134 stomach CGAP 136 heart CGAP 133 pediatric cortex CGAP421 substantia nigra • CGAP 10 cortex 0.1 CGAP167 spinal cord 117 Figure 20 Hierarchical clustering of long SAGE libraries from CGAP and our own database of normal and malignant cells. Libraries constructed from PCR SAGE (Zhao et al submitted) include an additional cDNA amplification step compared to conventional long SAGE (Saha et al., 2002). The technique may preferentially amplify a small subpopulation of transcripts possibly resulting in highly expressed tags shared between all libraries constructed using PCR-SAGE in effect biasing the Pearson correlation. Abbreviations: HSC, haematopoietic stem cells; MAPC, multipotent adult progenitor cells. 118 Figure 2 0 rr£' WA14 WA01 BG01 H E S C -WA13 — WA07 •WA01m •ES04 ES03 I w -a WA01 8 WA01 7 WA09L — MAPC5 MAPC4 • CGAP643 Pancreas • MAPC3 • CGAP649 Breast carcinoma CGAP657 Breast carcinoma r CGAP703 Breast carcinoma 1 CGAP683 Breast tumor fibroblast CGAP963 Lung adenocarcinoma CGAP723 Breast fibroadenoma CGAP647 Breast myoepithelium CGAP645 Breast carcinoma CGAP675 Breast carcinoma CGAP648 Substantia nigra • CGAP655 Liver vascular CGAP673 Breast carcinoma P C R S A G E r H S C 1 i — H S C 2 • Ieukemia2 • HSC3 CGAP656 Fetal brain 0.1 leukemial Ieukemia3 Multiple libraries were constructed for the WA01 line. A reciprocal highest correlation value was observed only in the case of hESC libraries expressing a POU5F1-reporter gene construct generated by homologous recombination (Zwaka and Thomson, 2003), WA01(7) and WA01(8) (Table 20). The effect of the POU5F1 -reporter gene construct was reported to antagonize endogenous POU5F1 (Thomson, J. personal communication). POU5F1 is normally up-regulated in ES compared to non-pluripotent cells. Diminishing POU5F1 levels would be expected to affect the linear relationship of these libraries to other stem cell libraries. However, detection of POU5F1 is not noticeably different in either WA01(7) or WA01(8) compared to the remaining hESC lines (see Figure 8, Chapter 2.3.3). WAOlm was constructed from cells removed from irr-MEFS and cultured on matrigel; it was expected to have the least mouse sequence contamination from incomplete removal of the embryonic feeder layer before mRNA isolation. WAOlm was more similar to WA07 and ES04 than to other WA01 libraries. This might suggest that libraries more similar to WAOlm (under constant experimental conditions e.g., grown on irradiated-MEFs) have less mouse embryonic fibroblast contamination. The WA01 library (grown on irr-MEFs) was strongly correlated to WA14 (r=0.974) while the WA01(7) and WA01(8) libraries were noticeably less related to the stem cell lines (r values ranging from 0.8544-0.9138). These libraries do not cluster with one another in Figures 19 and 20 due to the variance in experimental conditions. The second goal of the hierarchical clustering was to measure the relatedness of hESCs to non-pluripotent cells and tissue types. Most publicly available libraries are short SAGE, so to do this comparison with the available resources, I had to perform the 120 following two analyses: short SAGE versus extracted short SAGE (artificial 14mer sequences generated from 21mer tags) and long SAGE vs. long SAGE. I used the coefficient of determination (r ) to obtain a mean correlation across multiple samples using the pair-wise Pearson correlation coefficients (Zar 1996). The mean r calculated across all hESC libraries for each pair-wise comparison to a non-pluripotent library is presented in Tables 22 and 23. Pearson correlations between long and short SAGE were problematic as shown by the weaker than expected correlation between WA09(s) and WA09(L) (r=0.7604; refer to Table 21). The WA09(s) library was most correlated to the BG01 ES library (r=0.8162) as opposed to a non-pluripotent short SAGE library. This provides a relative measure of a perfect positive relationship between two libraries when comparing short to long SAGE. A l l correlations generated comparing non-pluripotent short libraries to hESC can be interpreted in relationship to r=0.8162 approximating r=l. Table 22 lists the mean r in descending order for short versus extracted short SAGE library comparisons. HESCs were most similar to CGAP29, a library constructed from vascular endothelium (r =0.493; r=0.702). Mesoderm-derived libraries constituted 6 of the top 10 most correlated to hESCs (tissues represented: vascular endothelium, bone marrow, and prostate) (Table 22). The top 10 most similar libraries to hESC included 2 ectoderm-derived libraries (eye lens and fetal brain), an endoderm derivative (lung), and a universal reference R N A sample, thus libraries originating from all three germ layers are among the ten most similar to hESCs. The reference R N A sample is a pooling of multiple 121 human cancer cell lines6 and typically employed on microarrays for subtractive hybridization (http://www.stratagene.com). The top ten most similar libraries to the hESC libraries demonstrated weak correlations to the stem cell expression profiles (ranging from r 0.35 to 0.49 and r equals 0.59-0.7). From this analysis it was apparent that hESCs were not strongly related to any other cell/tissue types other than themselves. Table 22 Comparisons of short and extracted short SAGE libraries measured by Pearson correlation (r). The r (coefficient of determination) values are a composite of comparisons for each extracted short SAGE hESC library against an adult or fetal normal CGAP library combined to determine the average r between all hESC lines to a differentiated cell/tissue type. Bolded libraries are extracted short SAGE libraries. M e a n r 2 S T D E V Library accession Tissue 0.49346856 0.044130197 C G A P 2 9 Vascular normal CS control 0.436696766 0.081714228 C G A P 4 0 9 Bone marrow normal A P CD34+/CD38-/lin-0.425607806 0.068133099 C G A P 4 4 3 Bone marrow normal A P CD34+/CD38+/lin+ 0.424168977 0.048945875 C G A P 7 2 Prostate normal M D PR317 0.407536843 0.082553484 C G A P 13 63 Eye lens B UIHI0 0.404822256 0.052949823 C G A P 5 8 5 Lung normal C L L15 0.385883144 0.076145984 C G A P 4 3 0 Brain fetal normal B SI 0.379414384 0.145578391 C G A P 168 Universal reference human R N A C L 0.352839606 0.056480824 C G A P 5 9 Prostate normal B 2 0.352831891 0.065611645 C G A P 3 0 Vascular normal CS V E G F + 0.342395811 0.10246759 C G A P 7 1 Kidney embryonic C L 293+beta-catenin 0.336588001 0.058926787 C G A P 5 6 Kidney embryonic C L 293-control 0.320422764 0.04935973 C G A P 5 8 4 Lung normal C L L16 0.312863044 0.047560295 C G A P 1 3 1 Breast normal myoepithelium A P myoepithelial! 0.30365421 0.105265262 C G A P 160 Placenta first trimester normal B 1 0.286733789 0.111589805 C G A P 132 Breast normal B hyperplasial 0.28305822 0.037743008 C G A P 4 9 Ovary normal C L IOSE29EC-11 0.267713931 0.081263737 C G A P 4 0 5 Lymph Node normal B 1 r 2 range 0.75-1 6 Reference RNA comprised a pool of total R N A from 10 cell lines from various organs and tissues listed below: B-lymphocyte (plasmacytoma, myeloma), mammary gland (adenocarcinoma), liver (hepatoblastoma), cervix (adenocarcinoma), testis (embryonal carcinoma), brain (glioblastoma), melanoma, liposarcoma, macrophage (histiocytic lymphoma, histocyte), and T-lymphoblast (lymphoblastic leukemia) 122 0.24370988 0.091797008 CGAP404 Bone marrow normal B D01 0.241963306 0.068989617 CGAP362 Pancreas normal B 1 0.241153717 0.062168582 CGAP47 Breast normal epithelium AP Br N 0.241089385 0.037388241 CGAP9 Brain astrocyte normal CL NHA5 0.230604608 0.070626513 CGAP389 Thyroid normal B 001 0.209778502 0.117741133 CGAP203 Prostate normal epithelium CS senescent 0.209105008 0.059021753 CGAP398 Stomach normal B antrum 0.188514773 0.115570664 CGAP 169 Prostate normal epithelium CS confluent 0.175561659 0.079993497 CGAP 182 Breast normal organoid B 0.1719172 0.046039738 CGAP429 Vascular endothelium normal liver associated AP NLEC1 0.166050311 0.097843172 CGAP20 Ovary normal CS HOSE 4 0.165926198 0.065339335 CGAP135 Liver normal B 1 0.163196162 0.08269416 CGAP 183 Breast normal stroma AP 1 0.163022158 0.052677852 CGAP181 White Blood Cells normal breast associated AP 0.161263383 0.060715932 CGAP141 Muscle normal B old 0.160178048 0.06111842 CGAP420 Breast normal myoepithelium AP IDC7 0.159110087 0.085138344 CGAP161 Placenta normal B 1 0.153291684 0.033958039 CGAP66 Pancreas normal CS HI26 0.151307066 0.048247808 CGAP 136 Heart normal B 1 0.150896881 0.06236684 CGAP 163 Retina Peripheral normal B 2 0.145797983 0.057929581 CGAP 164 Retina macula normal B HMAC2 0.145020103 0.032635067 CGAP523 Brain normal leptomeninges B AL2 0.143986203 0.030734044 CGAP65 Pancreas normal CS HX 0.13424494 0.044472162 CGAP 142 Muscle normal B young 0.129687028 0.024569831 CGAP669 Breast normal FS NER 0.129592598 0.047957376 CGAP55 Brain normal cerebellum B BB542 0.126930023 0.024188429 CGAP31 Breast normal epithelium AP 1 0.118323856 0.033198779 CGAP 167 Spinal cord normal B 1 0.117928767 0.042967668 CGAP421 Brain normal substantia nigra B 1 0.112012558 0.042897007 CGAP364 Placenta hydatidiform mole B 1 0.110079643 0.037097221 CGAP 133 Brain normal peds cortex B HI571 0.106257967 0.017455054 CGAP 180 Vascular endothelium normal breast associated AP 1 0.099761571 0.043706959 CGAP53 Brain normal thalamus B 1 0.099292949 0.04449798 CGAP99 Lung normal B 1 0.095455116 0.03664259 CGAP 134 Stomach normal epithelium B bodyl 0.091490914 0.025179833 CGAP98 Leukocytes normal B 1 0.088486662 0.040775561 CGAP 165 Retina Pigment epithelium normal B 1 0.078527597 0.032035846 CGAP91 Peritoneum normal B 13 0.061807725 0.028743426 CGAP 162 Retina Peripheral normal B 1 0.054822587 0.023020957 CGAP34 Brain normal cerebellum B 1 0.046971211 0.033483769 CGAP94 Kidney normal B 1 0.045586712 0.030369154 CGAP11 Colon normal B NCI 0.044868973 0.021719298 CGAP8 Brain normal cortex B BB542 0.044665395 0.02993352 CGAP 1103 Skin normal B NS 0.036764595 0.020819044 CGAP 10 Brain normal cortex B pool6 0.033665333 0.014154952 CGAP44 Fibroblasts CL postcrisis 0.02950209 0.020564078 CGAP 12 Colon normal B NC2 Table 23 Comparisons of long SAGE libraries measured by Pearson correlation (r). The r (coefficient of determination) values are a composite of comparisons for each long SAGE hESC library compared to a CGAP or our own database of normal and malignant cells/tissues; these values were combined to determine the average r2 between all hESC lines to a differentiated cell/tissue type. Bolded libraries were used to generate extracted short SAGE libraries for the normal CGAP dataset. M e a n r 2 S T D E V Library accession Tissue 0.640058639 0.079251 sc004 Multipotent adult progenitor cells 0.521967335 0.043546 C G A P 6 5 6 Brain fetal normal B SI 0.508129685 0.075261 sc005 Multipotent adult progenitor cells 0.427168466 0.1 14232 C G A P 7 0 3 Breast carcinoma 0.403683813 0.101766 C G A P 6 8 3 Breast tumor fibroblast 0.402505235 0.11343 C G A P 7 2 3 Breast f ibroadenoma 0.355476285 0.09022 sc003 Multipotent adult progenitor cells 0.328857095 0.048539 C G A P 6 5 5 Vascular endothelium normal liver associated A P 1 0.323480686 0.046945 C G A P 6 7 5 Breast carcinoma 0.320603852 0.040084 C G A P 6 7 3 Breast carcinoma 0.319053675 0.072053 C G A P 6 4 3 Pancreas normal B 1 0.310902763 0.046104 C G A P 6 4 5 Breast carcinoma 0.306680995 0.078762 C G A P 6 4 7 Breast normal myoepithelium A P I D C 7 0.287196046 0.063246 C G A P 6 5 7 Breast carcinoma 0.244757859 0.082618 C G A P 9 6 3 L u n g adenocarcinoma 0.236308343 0.057561 C G A P 6 4 9 Breast carcinoma 0.221390168 0.023823 shsOl Haematopoietic stem cells 0.196770585 0.032394 shs()2 Haematopoietic stem cells 0.195653974 0.077994 C G A P 6 4 8 Brain normal substantia nigra B 1 0.131343191 0.024434 sle02 Leukemia 0.100959373 0.014366 shs03 Haematopoietic stem cells 0.053800208 0.012898 sleOl Leukemia 0.017756184 0.006893 sle03 Leukemia 124 ES are transcriptionally complex, expressing more types of genes than other cell states (Gerecht-Nir et al., 2005; Golan-Mashiach et al., 2005). This complexity was defined by the expression of a wide repertoire of genes with disparate developmental roles. ES are thought to be primed for differentiation to any cell type in the embryo proper by expressing genes encoding the differentiation machinery (at low levels), this property has been termed the "Just in case" theory (Golan-Mashiach et al., 2005). The maintenance of these levels is pivotal to the undifferentiated phenotype and until a response is elicited by a specific growth or differentiation signal, these expression levels are stably preserved. This was evidenced by the complete expression of embryonic developmental pathways and the high expression of pathway antagonists in undifferentiated hESCs (u-hESCs) (see Chapter 2.3.4: Developmental signalling pathway expression in embryonic stem cells) (Brandenberger et al., 2004b). Thus transcriptional complexity also distinguishes ES from other cell types. Figure 21 was an assessment of tag type diversity in one ESC line (WA09L) versus non-ESC CGAP libraries. To take into account inconsistent depths of sequencing contributing to tag type diversity the WA09 library was randomly sampled at intervals comparable to CGAP library sizes. Consistent with the literature, at most sampling depths WA09 expressed more tag types than other cell states (Figure 21) testifying to the wide breadth of genes expressed in ESC. Statistical testing of the tag type diversity at various sampling depths was performed using Student's t-test (Zar 1996) for extracted CGAP short SAGE libraries and the WA09 line (Figure 22). Similarly, WA09 showed a greater diversity of tag types at various sampling depths than adult/fetal samples (with the exception of the fetal brain library which expressed more tag types at a depth of 300,000 total tags than WA09). WA09 tag types expressed at 60,000 total tags was not 125 significantly different from the tag types expressed in the liver vascular endothelium library (Figure 22). Figure 21 Incremental random sampling of WA09(L) extracted short S A G E total tags. In silico generated W A 0 9 libraries totalling 10,000-400,000 total tags were plotted against tag diversity (unique tag sequences) for each generated library. Experimental short S A G E libraries ( C G A P ) were plotted to compare tag diversity to a W A 0 9 sampling of a similar total tag count. 126 Figure 21 120000 100000 80000 to a: CJ cz tu 3 C :D 01 O l h 60000 40000 H 20000 4 • • • • • • • • • Pooled lung carcinoma • • • Pooled medullablastoma x Lymphoma • • • + •varian cancer • Pooled foreskin fibroblast Pooled c prostate carcinoma cell line Lung • * Blood (IcukocjXc) X X • Monocyte leukemia A Heart Colon epithelium a. = 0.05 50 100 150 200 250 Thousands of total tags 300 350 400 450 jWA09 • Pooled lung carcinoma Pooled medullablastoma Pooled prostate carcinoma cell line x Lymphoma • Pooled foreskin fibroblast + Brain (greater than 95x white matter) AHeart Lung Astrocytoma Pooled pancreas cancer cell line Pediatric frontal cortex Pooled pancreas neoplasia Colon epithelium Ependymoma Stomach cancer -Glioblastoma Prostate Liver Thalamus Blood (leukocyte) x Astrocyte :*: Reference RNA • Peritoneum (mesothelial cells) + Cerebellum -Kidney (embryonic) — Ovary suface epithelium • Kidney • Monocyte leukemia A Mesothelioma x Stomach epithelium x Pancreas (duct epithelium) • Ovarian cancer 127 Figure 22 Incremental random sampl ing o f W A 0 9 ( L ) extracted short S A G E total tags and normal dif ferentiated C G A P ( n - C G A P ) extracted short S A G E l ibrar ies. W A 0 9 l ibraries totaled 10,000-400,000 total tags and were plotted against number o f unique tag sequences for each generated l ibrary. S ign i f icance o f di f ference between tag types expressed at each sampl ing depth was determined using the Student 's t-test (See inset data table). Standard deviat ions were inc luded. 90000 80000 70000 60000 ID <a o c CD CT CD to O) J3 40000 CD CJ •5 30000 50000 20000 4 10000 4 < < . . .1 • hESC • Pancreas Brain Substantia Nigra WBC * Breast Myoepithelium • Liver vascular endothelium + Fetal brain 50 100 150 200 250 Thousands of total tags 300 350 128 n-CGAP tissues Thousands of total tags Tag types (nCGAP) Tag types (hESCs) T-TEST Pancreas 20 7863.2 11046 5.27562E-25 Brain substantia nigra 40 16587 18566 7.40044E-39 W B C 40 15651.33 18566 4.17339E-15 Breast myoepithelium 60 21334.6 '24878 5.00258E-22 Liver vascular endothelium1 60 24883.8 24878 0.400221756 W B C 80 26831 30539 7.58436E-41 WBC 100 29866.13 35727 1.6408E-15 Fetal brain2 300 78472 76400 2.61476E-30 'insignificant difference in tag type diversity between hESCs and n-CGAP 2The fetal brain library expressed more tag types than the WA09 library at 300,000 total tags. Note that whole brain was used and was thus representative of multiple cell types (e.g., neurons and glial cells). I additionally examined the degree of similarity between long SAGE libraries (hESC, n/c-CGAP, and our own database of normal and malignant cells). Table 23 -y presents the mean r values in descending order for the hESC lines versus a CGAP library. The correlations were additionally depicted in Figure 20 as a hierarchical tree; recall that branch lengths are measured as 1-r. Three major branches were identified from the cluster analysis; hESC cluster closely together in one branch of the tree, a fetal brain library comprises its own branch and the remaining CGAP libraries (mainly generated from carcinomas and adult progenitor/stem cells) cluster into 3 subgroups of the last branch. •y Based on the evidence presented in Table 23 (mean r values between long SAGE libraries and hESCs) hESCs were highly correlated to a multipotent adult progenitor cell (MAPC) line (MAPC4; r2=0.64, r=0.80), fetal brain (CGAP656), and another MAPC library constructed for the same line (MAPC5). MAPCs, isolated from bone marrow and 129 CD34/CD45 negative7, have been shown to possess characteristics similar to pluripotent cell lines in their ability to differentiate into most mesodermal and neuroectodermal cells in vitro and all embryonic lineages in vivo (Schwartz et al., 2002). The similarities in developmental potential between MAPCs and ESCs demonstrated experimentally were reflected in the high correlation between the SAGE libraries from each cell type. Stated earlier, Pearson's correlations are affected by extremes in gene expression levels, thus if two libraries commonly express a highly abundant tag this may inappropriately result in a stronger correlation than biology would suggest. To test whether the correlation between MAPC and hESC was heavily influenced by outlier tags, the top 10 most abundantly expressed genes amongst all hESC and MAPC-04 were extracted. Taking the intersection between all tags resulted in two sequences common to the hESC and MAPC libraries and typically expressed at roughly 1000 tags per 100,000 total tags per individual library (Table 24). Pearson correlations were calculated for the MAPC and hESC metalibrary including the two abundantly expressed genes and excluding these genes (Table 25). The highly expressed genes appear to have a minor affect on the correlation between hESC and MAPC but this does not invalidate the finding that both stem cell libraries share strong similarities in long SAGE transcriptome profiles. Haematopoietic stem cell (HSC) libraries used in this analysis were among the least similar to ESCs (Table 23). I assume that the result was primarily due to technical differences in long SAGE library construction; all HSC and leukemia libraries were 7 CD34 and CD45 are hematopoietic stem cell markers. 130 constructed using PCR-SAGE technique, this utilized an amplification step prior to long SAGE construction. Pearson correlation is biased towards highly expressed tags; the Table 24 The intersection of the top most highly expressed tags in MAPC and hESC libraries. Tag counts for each sequence are normalized to tags per 200,000. Highly expressed tags: GAAAAATGGTTGATGGA LAMR1 (ribosomal protein SA) TGTGTTGAGAGCTTCTC EEF1A1 (eukaryotic translation elongation factor 1 alpha 1) MAPC4 1,068.5 1,253.9 WA09(L) 579.5 860.3 WA01(7) 769 1,122.1 WA01(8) 784.5 987.8 ES03 1,796.9 1,660.6 ES04 1,653.7 1,985.4 WA07 1,382.2 2,584.6 WA14 1,933.4 1,763.7 WA13 1,500.7 1,820 WA01(m) 1,037.5 3,003.5 WA01 1,780.6 1,804.5 BG01 1,749.3 1,893 Table 25 Calculation of Pearson correlation between the hESC metalibrary and MAPC4 using Correlate (written by Allen Delaney, Gene Expression Informatics). First comparison includes all tags expressed in both libraries and second comparison excludes the two most abundant tags that are shared between hESC and MAPC libraries (*minus_high_exp libraries). Min - minimum tag count for a gene; Max - maximum tag count for a gene; Sum - total tags in MAPC4 or hESC metalibrary; Mean - mean tag count level for the library; SD - standard deviation of the mean; Correlation matrix -Pearson correlation coefficient calculated for genes detected in both MAPC4 and the hESC metalibrary. Analysis for 24,701 cases of 2 variables Variable MAPC4 hESC meta Min 1 1 Max 1,676 22,697 Sum 182,559 1,842,608 Mean 7.4 74.6 SD 36.3 424.2 Correlation Matrix: MAPC4 hESC meta MAPC4 1 0.8142 hESC meta 0.8142 1 Analysis for 24,699 cases of 2 variables: Variable MAPC4 minus high exp hESC meta minus high exp Min 1 1 Max 1,676 17,410 Sum 180,242 1,802,692 Mean 7.3 73.0 SD 34.8 383.8 Correlation Matrix: MAPC4 minus high expressers hESC meta minus high expressers MAPC4 minus high expressers 1 0.7989 hESC meta minus high expressers 0.7989 1 introduction of the mRNA amplification state in PCR-SAGE may have resulted in the preferential sampling of certain abundant transcripts shared in HSCs. Thus the correlation coefficient comparing conventional long SAGE and PCR SAGE may not accurately describe the biological similarity between the cell types sampled; similar to the notion that short SAGE and long SAGE were not accurately compared using Pearson's correlation coefficients (A. Delaney, personal communication). Fetal brain was found to be among the most highly similar libraries to hESCs in both the in silico short SAGE analysis (Table 22) and this long SAGE comparison. Neural cell types are transcriptionally complex expressing many transcripts at low levels (Evans et al., 2002). On average, fetal brain and the hESC libraries have more tags types in common than ESCs to other cell states (data not shown). Additionally, pathways functioning in early neural development, such as the Wnt and Nodal pathways, were implicated in ESC maintenance (see Chapter 2.3.4) (Brandenberger et al., 2004b). A caveat of the analysis was the paucity of tissues sampled for long SAGE library construction. The comparisons possible were between tissues and cell types separated by years of development. With the current human SAGE data, it was not possible to describe global similarities between earlier germ layer derivatives and hESCs. The conclusion one may draw was that hESCs bear no particular similarity to most other derivatives of the embryonic germ layers represented in the adult, excepting for other adult stem cell populations such as MAPCs. As stated earlier, conclusions drawn from comparing extracted short SAGE to short SAGE or conventional long SAGE to PCR SAGE using the Pearson correlation are not necessarily reflective of the biological similarities/dissimilarities between cells and 133 tissue types. Higher correlations were observed between comparisons using the same technique (e.g., long SAGE versus long SAGE) as opposed to a comparison between two extracted short SAGE libraries (see bolded libraries in Tables 22 and 23). However, I did observe the conservation of similar correlations between tissue types with both long SAGE and extracted short SAGE libraries to the ES libraries. For example, all ES long SAGE libraries were still more highly related to the short SAGE library generated for WA09 than any other n-CGAP short SAGE library. Similarly, all n-CGAP long SAGE libraries that were truncated to short sequences for comparison maintain their relative distance to the ES library cluster that they demonstrated in the long SAGE clustering analysis. It remains that the tissues sampled to construct long SAGE libraries were severely limiting in comparison to the wealth of short SAGE transcriptome profiles generated to date. With the enhanced ability to map long SAGE tags and the complications in comparing SAGE libraries constructed with different protocols there is a need to repeat these analyses once long SAGE libraries have been generated for a wider breadth of tissues and cell types. 3.3.2 Isolation of differentially expressed tags Genes that define the pluripotent state may have a differential expression pattern in undifferentiated hESC compared to differentiated populations. Cell fate determination pathway constituents are likely to be down-regulated while genes involved in cell proliferation, cell survival and maintenance of genomic integrity may be up-regulated in ES. I wanted to determine if the genes up-regulated in the hESC SAGE data compared to adult and fetal CGAP libraries overlapped with published stem cell marker lists 134 ( ident i f ied i n u - h E S C s and embryon ic ce l ls) . I also sought to def ine new markers o f u-h E S C s to d is t inguish these cel ls f r om subpopulat ions o f adult/ fetal t issues grouped accord ing to embryon ic germ layer o r ig in . D is t ingu ish ing u - h E S C s f r om endoderm-l ibrar ies, mesoderm- l ibrar ies, or ectoderm-l ibrar ies may prov ide us w i th insight into w h i c h genes are necessary to direct di f ferent iat ion to a speci f ic germ-layer der ivat ive. A subset o f genes may ident i fy mo lecu la r di f ferences between h E S C s and a l l t issue types o f a g iven embryon ic germ layer or the di f ferences may be cel l / t issue-type spec i f ic . The frequency o f a transcript i n the h E S C metal ibrary was compared to its f requency in n - C G A P l ibrar ies c lass i f ied accord ing to their embryon ic germ-layer or ig in . In total there were 43 pa i r -w ise compar isons (15 ectoderm der ived, 16 mesoderm der ived, and 12 endoderm der ived l ibrar ies). S ign i f i cant ly up-regulated or down-regulated genes were determined by A u d i c - C l a v e r i e statistics and def ined by a P-va lue o f 0.05 or less. In general , s igni f icant up-regulat ion in h E S C demonstrated a m i n i m u m 3- fo ld change wh i l e down-regula t ion demonstrated a m i n i m u m o f l / 3 r d - f o l d change. A m o n g a l l pa i r -wise compar isons to the germ layer der ivat ives 4,771 sequences were up-regulated and 359,529 were down-regulated. The d iscrepancy between the absolute numbers o f up-regulated to down-regulated tags was due to the use o f many disparate t issue types express ing numerous genes that spec i f ica l ly def ine their terminal ly dif ferentiated phenotype. Sequences that were consistent ly up-regulated or down-regulated across a l l o f the ectodermal , mesodermal , or endodermal der ived l ibrar ies were isolated (Table 26). The except ion was a set o f genes down-regulated among the mesodermal der ivat ives that were di f ferent ia l ly expressed across 15 o f the 16 l ibrar ies. C G A P 5 6 (embryonic k idney) 135 did not share a subset of commonly down-regulated tags across the remaining mesoderm derivatives. Table 26 Summary of the total number of tag sequences up- or down-regulated in the hESC SAGE libraries compared to ectoderm-derived normal CGAP (n-CGAP) libraries, mesoderm-derived n-CGAP libraries, and endoderm-derived n-CGAP libraries. Up-regulated Down-regulated Total tags per germ layer Ectoderm All 15 libraries 105 9 374,275' (186,1342) Any 15 libraries 4,303 274,274 Mesoderm All 16 libraries 70 5 311,998' (209,1582) Any 16 libraries 2,702 230,265* Endoderm All 12 libraries 95 8 182,787' (135,3912) Any 12 libraries 2,706 219,009 3 germ layers All 43 libraries 32 0 Any 43 libraries 4,771 359,529 'Sum of tag counts for each SAGE library in a germ layer category 2Total unique tags sequences HESC extracted short SAGE libraries were compared to CGAP libraries. Genes differentially expressed in the hESC metalibrary versus all n-CGAP libraries, subsets of endoderm-libraries, mesoderm-libraries, or ectoderm-libraries were listed in Tables 27-33. Note that Tables 28, 30, and 32 which represent the subset of differentially expressed tags in ectoderm, mesoderm, and endoderm derivatives respectively do not list the genes commonly down-regulated across all CGAP libraries or the genes that are specifically down-regulated in one set of embryonic germ layer libraries. The genes represented in Tables 28, 30, and 32 were consistently up-regulated across at least two germ layer groupings. Tables 29 , 31 , and 33 l isted genes that were on ly d i f ferent ia l ly expressed i n h E S C versus one set o f embryon ic germ layer l ibrar ies. Genes that were sole ly di f ferent ia l ly expressed i n h E S C versus ectoderm l ibrar ies inc luded several tags that on ly mapped to a single genomic locat ion (Table 29 ; h igh l ighted i n ye l low) . The genes detected by the S A G E tags (termed ecto01-ecto07) cou ld funct ion to repress genes d i rect ing ectoderm di f ferent iat ion. Converse ly ecto08 and ecto09 were down-regulated i n embryon ic stem cel ls compared to ectoderm l ibrar ies; they might encode determinants o f an ectodermal fate i n the early embryo. Other hypothet ical uncharacter ized genes shown to be di f ferent ia l ly expressed i n h E S C versus mesoderm or endoderm l ibrar ies were l isted in Tables 31 and 33 respect ively and are termed m e s o O l -05 and endo01-03 (up-regulated in h E S C ) . 137 Table 27 Up-regulated gene l ist i n u - h E S C metal ibrary compared to n - C G A P l ibrar ies. (Note: there were no genes that were consistent ly down-regulated i n h E S C s compared to a l l n - C G A P l ibraries). I ta l ic ized sequences have an ambiguous tag-to-gene mapp ing . Sequence Count Genome hits Symbol GO Fold change AAAATTTACAGTTTGCC 482 1 LECT1 1 Molecular function unclassified; Skeletal development; Angiogenesis 9.576184 ATGATGATGATGGGACT 1873 2 SLC25A51 Transporter; Mitochondrial carrier protein; Nucleoside, nucleotide and nucleic acid transport; Transport 6.464523 ATGTTAATAAAATAGGC 720 1 GPC4' Cell adhesion molecule; Extracellular matrix glycoprotein; Cell adhesion 12.5975 CAAACACCGTTGTAACC 818 1 Hs.334219 17.83319 CAAATTTTATTGTTAGT 844 1 DPPA41 Molecular function unclassified 18.13544 CAGTCTCTCAAGTCCCG 3869 8 CSRP1 Molecular function unclassified; Muscle development; Cell proliferation and differentiation 6.896997 CAGTCTCTCAAGTCCCG 3869 8 RPS10 Ribosomal protein; Protein biosynthesis 6.896997 CCCTCCTGGACAAGGCT 774 3 HMGA1 1 Molecular function unclassified 11.6491 GAAAAGGGTTTTCTTTT 1600 1 LAPTM4B 1 Transporter 13.05209 GAAAGAAAGAGAGGAAA 1164 1 cDNA CS0CAP004YK13 24.29679 GAGGACACAGATGACTC 1365 1 PODXL 1 Extracellular matrix glycoprotein; Cell structure 16.32674 GCTGTTTATTTCACCTG 1002 656 Similar to XP 375833.1 20.37089 GGGCTGTGAAATGGGTG 721 1 escOl/nt 016354 - 15.949 GTCCTGGTGGTGGGGGG 641 1 esc02/nt 037704 13.47483 GTTTAAATCGACTGTTT 1699 1 PSMA21 Other proteases; Proteolysis 5.805942 TAATTCTACCAAGGTCT 1521 5 TDGF1 1 Growth factor; Ligand-mediated signalling; Developmental processes; Cell proliferation and differentiation 33.30267 TACTGGTTTGTATATTT 581 1 FLJ35259 12.09318 TAGCTACAGGACATTTT 988 1 DNMT3B 1 DNA methyltransferase; DNA methyltransferase; DNA metabolism 20.35203 TATCACTTTTTTCTTAA 2655 4 P0U5F12 Homeobox transcription factor; Nucleic acid binding; mRNA transcription regulation; Developmental processes 58.23732 TTCATTATAATCTCAAA 6202 7 PTMA Molecular function unclassified 11.89386 TTGCTCACACAAAAAAA 997 1 BM759098 14.8713 138 TTGCTCACACAAAAAAA 997 1 BM692360 14.8713 TTGCTCA CA CAAAAAAA 997 1 CD247421 (WA01 EST library; NIH-MGC) 14.8713 TTTACTGCTAGAAACCA 1391 4 LIN281 Other RNA-binding protein 35.94167 TTTTATGGGTAACTTTT 1304 1 CCNG1 1 Kinase activator; Cell cycle control 10.99607 1. Described in previous studies as potential hESC markers. 2. POU5F1 (highlighted in green) is the only known transcription factor presented in this list. Table 28 Genes up-regulated in hESC compared to ectoderm-derived libraries. These genes were found to be similarly differentially expressed in at least one other grouping of germ-layer derivatives. (See legend below for explanation of table). Meta sequence Count Gene symbol Mean P-value Mean In (ratio) Gene Ontology GATCCCAACATTGTTGG 1980 ATP5B 0.002953333 1.87785625 Other ligand-gated ion channel; Anion channel; Hydrogen transporter; ATP synthase; Hydrolase; Purine metabolism; Cation transport; ATP synthesis->Fl beta;; ATAGACATAAAATTGGT 1158 C1QBP1 0.00022 1.7311 Complement component; Antibacterial response protein; Complement-mediated immunity; CTGTGACACAGCTTGCC 1219 CCT2 0 2.2029 Chaperonin; Protein folding; CTATATTTTTTAAAATC 587 CTSC 0.00138 2.18983125 Cysteine protease; Proteolysis; A TTTGTCCCA GCCTGGG 5258 HMGA1 2 0 2.98228125 Molecular function unclassified; Biological process unclassified; TCTGCAAAGGAGAAGTC 1081 HMGB2 2 0.000686667 2.12788125 HMG box transcription factor; Chromatin/chromatin-binding protein; Chromatin packaging and remodeling;p53 pathway->High mobility group protein 1; TTGCTCACAAAAAAAAA 900 Hs.382100 6.67E-05 2.5868375 GGTTGAAAAAAAAAAAA 564 Hs.446545 0.001486667 2.16851875 ACCATTGGATTCATCCT 1015 IFITM1 0.002033333 2.1404625 Other miscellaneous function protein; Cell proliferation and differentiation; AATAAAACACATTTTAT 607 LEFTB 3 0.00122 2.3153 TGF-beta superfamily member; Biological process unclassified; TGF-beta signalling pathway-transforming growth factor beta;; AAATAAAGAATTTAAAG 1380 MGST1 0.0013 2.17775 Other transferase; Protein modification; Detoxification; TACAAAACCATTTTTTC 1439 NCL 0.000693333 1.9036625 Ribonucleoprotein; rRNA metabolism; GAATCGGTTATACTCGG 1535 NDUFS5 2.67E-05 1.719325 Oxidoreductase; Electron transport; TATTTTAAATGCCACCT 437 esc03/ nt 007741 0.008546667 1.71855625 CGAACAAAAGACTTCGG 967 esc04/ nt 011875 4.67E-05 2.61644375 139 GAAAGAAAGAGAGGAAA 1164 esc05/ nt 022517 6.67E-06 2.9392875 CGCACAATCATTGAGTT 569 esc06/ nt 033903 0.00218 2.32865 AAAAGAAACTTGTGCTT 2314 PABPC1 0.000333333 2.0241625 Nucleic acid binding; Pre-mRNA processing; AA TACTTTTGTA TTGCT 871 PAI-RBP1 0.00012 2.17576875 Other RNA-binding protein; Interacts with chromatin-remodeling factor C H D 3 ; Biological process unclassified; CAA TAAA TGTTCTGGTT 9202 RPL37 0 1.56810625 Ribosomal protein; Protein biosynthesis; GCTTTTAAGGA TACCGG 6329 RPS20 0 2.2446875 Ribosomal protein; Protein biosynthesis; TAATAAAGGTGTTTATT 8849 RPS8 0.000473333 1.41510625 Ribosomal protein; Protein biosynthesis; GA TTA TTGGGA TTGTAG 473 SEPHS1 0.004733333 1.81708125 Other transferase; Amino acid biosynthesis; TATCAATATTCACTTGA 743 SFRP13 0.002733333 2.0177875 Other signalling molecule; Ligand-mediated signalling; Angiogenesis->Frizzled-Related Protein; Wnt signalling pathway->secreted frizzled-related protein; CTGTCA TTTGTAA TA TG 1121 SFRS3 0.00148 1.5174625 Molecular function unclassified; mRNA splicing; TGA TAGTCTGAAA TA TG 507 Similar to KIAA1606 0.003433333 2.10314375 TACATTTTCATATTAGA 1632 SNRPG1 0 2.42744375 mRNA splicing factor; mRNA splicing; TATAATCTTTATGGCTT 601 SSBP1 0.002793333 1.47771875 Single-stranded DNA-binding protein; DNA replication; DNA repair; DNA replication; ATAAAGTAACTGGTTTG 897 STRAP 0.006193333 1.16710625 Other miscellaneous function protein; Receptor protein serine/threonine kinase signalling pathway; TACGTACTGCCTGCCCG 650 TIMM13 0.004486667 1.7855875 Other miscellaneous function protein; Intracellular protein traffic; Protein targeting; Transport; Hearing; GGTTTGGCTTAGGCTGG 2384 UQCRH 0.000433333 1.24695 Reductase; Oxidative phosphorylation; Table 28 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1. Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al, 2003) 2. POU5F1 target/cofactor/co-expressed 3. Embryonic developmental pathway genes 1 4 0 Table 29 Differentially expressed in hESC versus ectoderm-derived libraries only. (See inset legend for table explanations). Meta sequence Count Gene Symbol Mean P-value Mean In (ratio) Gene Ontology GGAACAAACAGATCGAA 4238 CD24 0 3.37100625 Molecular function unclassified; Immunity and defense; Developmental processes; Cell proliferation and differentiation ATTTCCTTGAATGTGGC 981 Hypothetical gene X69397 0.00004 2.74305625 ATTAAGAGGGACGGCCG 1671 ectoOl nt 0050581 1.33E-05 2.29159375 GTGACAGAATTGATATC 897 UGP2 0.000546667 2.06641875 Nucleotidyltransferase; Other polysaccharide metabolism GTCTTGAACTGAGAGTC 510 ASNS 0.004686667 1.89019375 Synthetase; Other ligase; Amino acid biosynthesis CTTAAGATTCAACTGGG 432 N0P5/N0P58 0.007033333 1.82459375 Ribonucleoprotein; rRNA metabolism GCTACTATTAGATCAGG 506 ecto02 nt 008183' 0.0033 1.8190125 CAGATCTTTGTGAAGAC 1903 UBA52 0.000406667 1.7732625 Molecular function unclassified; Proteolysis TGGTGTTGAGGAAAGCA 7903 RPS18 0 1.76955625 Ribosomal protein; Protein biosynthesis GTTTCTATCAATGTGAA 672 LYPLA1 0.002933333 1.68834375 Phospholipase; Lipid metabolism AA TA TTGAGAAGAAACT 846 EIF3S6 0.00354 1.6439 Translation initiation factor; Protein biosynthesis; Translational regulation; Oncogene TA TCTGTCTACTTTCTC 1164 SET 0.000733333 1.5966875 Phosphatase inhibitor; DNA replication; DNA replication CTGCTATACGAGAGAAT 4530 RPL5 0 1.596225 Ribosomal protein; Protein biosynthesis CTCCTCACCTGTATTTT 3697 ecto03 nt 0111091 0 1.584275 GTATCTTCACATCTTGG 876 HSBP1 0.003993333 1.57703125 Transcription cofactor; Chaperone; mRNA transcription regulation; Other metabolism TACAAGAGGAAGTACTC 3398 RPL6 0.000206667 1.53090625 Ribosomal protein; Protein biosynthesis GTGTAATAAGACATAAC 2164 HNRPA2B1 0.000193333 1.52698125 Ribonucleoprotein; RNA localization GAGTAGAGAAAAGAGAC 635 ecto04 nt 0084701 0.005886667 1.5026625 GTTTTTGCTTCAGCGGC 1410 ecto05 nt 0054031 0.00104 1.4751375 TTCTTGTGGCGCTTCTC 2115 ecto06 nt 011109' 0.001633333 1.39864375 GCATTTAAATAAAAGAT 2713 EEF1B2 0.000666667 1.3701125 Translation elongation factor; Protein biosynthesis TGAATCTGGGTGGGATA 804 ecto07 nt 0084701 0.00168 1.34310625 AATCCTGTGGAGCATCC 3172 RPL8 0.002046667 1.286325 Other RNA-binding protein;Ribosomal protein; Protein biosynthesis 141 GGCTTTA CCCTTCCCTG 1956 EIF5A 0.002153333 1.11913125 Translation initiation factor; Protein biosynthesis ATAGGTCAGAAAGTGTA 58 CLSTN1 2 0.00595 -2.46867 Cell adhesion molecule; Calmodulin related protein; Annexin; Cell adhesion-mediated signalling; Cell adhesion; Calcium ion homeostasis; CGCCGACGATGCCCAGA 16 G1P32 0.001543 -3.54579 Molecular function unclassified; Immunity and defense; GGCTGTACCCAAGCTGA 79 ecto08 nt 0044871 0.003357 -2.58104 AGAGGTGGTGTGCAAAA 10 ecto09 nt 0113621 0.0021 -3.02723 ACGGAACAATAGGACTC 2 PTGDS 2 0.003686 -5.97111 Synthase; Isomerase; Fatty acid biosynthesis; Muscle contraction; TGCACTTCAAGAAAATG 7 SPARCL1 2 0.000436 -4.83601 Extracellular matrix glycoprotein; Other immune and defense; Cell proliferation and differentiation; TCTCTGATGCTTTGTAT 47 TIMP22 0.000179 -2.79686 Metalloprotease inhibitor; Proteolysis; TACATAATTACTAATCA 18 TncRNA2 0.003193 -3.12937 Table 29 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1. Candidate novel regulators of ectoderm differentiation 2. Genes down-regulated in hESC 142 Table 30 Genes up-regulated in hESC compared to mesoderm-derived libraries. These genes were found to be similarly differentially expressed in at least one other grouping of germ-layer derivatives. See legend below for table explanations. Meta sequence Count Gene symbol Mean P-value Mean Ln (ratio) Gene Ontology GATCCCAACATTGTTGG 1980 ATP5B 625E-06 1.7480375 Other ligand-gated ion channel; Anion channel; Hydrogen transporter; ATP synthase; Hydrolase; Purine metabolism; Cation transport; ATP synthesis->Fl beta; ACAATGTTGTAGTGTCC 616 CRABPl' 0.000106 2.54705625 Other transfer/carrier protein; Lipid and fatty acid transport; Lipid and fatty acid binding; Vitamin/cofactor transport; Steroid hormone-mediated signalling; Transport; Ectoderm development; CTATATTTTTTAAAATC 587 CTSC 0.001094 2.30685 Cysteine protease; Proteolysis; A TTTGTCCCAGCCTGGG 5258 HMGA1 2 0 2.64125 Molecular function unclassified; Biological process unclassified; AATAAAACACATTTTAT 607 LEFTB 3 0.000125 2.51866875 TGF-beta superfamily member; Biological process unclassified; TGF-beta signalling pathway-transforming growth factor beta; GTCTACTTTAGGTGTGC 1011 MGC8685 1.88E-05 2.93589375 AAATAAAGAATTTAAAG 1380 MGST1 0.00215 2.23279375 Other transferase; Protein modification; Detoxification; TACAAAACCATTTTTTC 1439 NCL 0.00155 1.45008125 Ribonucleoprotein; rRNA metabolism; TTTTCTTCTTTGGCTTG 458 esc07/ nt 016354 0.0011 2.2051125 GAAAGAAAGAGAGGAAA 1164 esc05/ nt 022517 0 3.22756875 CGCACAATCATTGAGTT 569 esc06/ nt 033903 0.003738 2.1805625 TCTGTACACCTGTCCCC 6678 RPS11 0 2.705575 Ribosomal protein; Protein biosynthesis; GCTTTTAAGGA TACCGG 6329 RPS20 1.88E-05 1.2977875 Ribosomal protein; Protein biosynthesis; GATTATTGGGATTGTAG 473 SEPHS1 0.00125 2.024225 Other transferase; Amino acid biosynthesis; TTGCTCACAAAAAAAAA 900 Similar to HERV-H LTR-associating 3 6.25E-06 2.51748125 TGA TAGTCTGAAA TA TG 507 Similar to KIAA1606 0.0007 2.16529375 -TACA TTTTCA TA TTAGA 1632 SNRPG1 8.13E-05 1.76109375 mRNA splicing factor; mRNA splicing; CCGCCTCCGGGAATGAG 1626 SNRPN 0 1.99073125 mRNA splicing factor; mRNA splicing; TATATATTTGAACTAAT 483 SOX2 2 0.00085 2.30940625 HMG box transcription factor; Nucleic acid binding; mRNA transcription regulation; Neurogenesis; TACGTACTGCCTGCCCG 650 TIMM13 0.001269 1.90310625 Other miscellaneous function protein; Intracellular protein traffic; Protein targeting; Transport; 143 Hearing; TTATAATATAATGTTTT 484 TMEFF1 0.000563 2.33475 Surfactant; Biological process unclassified; CAGTCTAAAATGCTTCA 1763 UCHL1 0.00075 3.011175 Cysteine protease; Proteolysis; Parkinson disease->Ubiquitin C-terminal hydrolase-Ll; Ubiquitin proteasome pathway->26S proteasome; Table 30 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1. Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al., 2003) 2. POU5F1 target/cofactor/coexpressed 3. Embryonic developmental pathway genes Table 31 Differentially expressed in hESC versus mesoderm-derived libraries only. See legend for further explanation of the table. Meta sequence Count Gene symbol Mean P-value Mean Ln (ratio) Gene Ontology GGAACACACAGCACAGA 844 PS ATI 0.001006 2.2769625 Transaminase; Amino acid biosynthesis; TTTTGTTAGTGCAAAAA 368 CLDN6 1 0.001731 2.20416875 Tight junction; Cell structure; CATCCAAAAATCAACAA 390 FLJ10884 0.002294 2.114 Transcription factor; Nuclease; Nucleoside, nucleotide and nucleic acid metabolism; Other metabolism; GGGATTTTGTATAACCA 669 PMAIP1 0.003369 2.0699875 Molecular function unclassified; Oncogenesis; GAGAAAACCCGGTACGC 437 mesoOl/ nt 0056123 0.002188 2.06904375 CGTTCCTGCGGACGATC 910 ID12 0.000975 1.9713125 Other transcription factor; mRNA transcription regulation; Angiogenesis; TGF-beta signalling pathway->Transforming growth factor beta; ATGCTCCTGAGTAGAAC 317 Hs.471439 0.007894 1.92333125 AAAATATATCTCTGGAC 306 meso02/ nt 0163543 0.009806 1.8768125 CTGCCTTCTTGGGGATT 982 PPP1CC 0.001688 1.8400625 Protein phosphatase; Other select calcium binding proteins; Other signal transduction; D1/D5 dopamine receptor mediated signalling pathway->Protein Phosphatase-1;; GCTTCCTAAATGGCCCT 329 CBS 0.010956 1.82614375 Synthase; Lyase; Amino acid biosynthesis; Cysteine biosynthesis->0-Acetylserine-lyase AAAGCAATCAACCCTGT 399 meso03/ nt 0097143 0.003519 1.805775 CTTTGCACTCTCCTTTG 552 TCEA1 0.007538 1.80185 Basal transcription factor; Nucleic acid binding; mRNA transcription elongation; GTCAACTGCTTCAGCTT 669 FLJ12666 0.003619 1.72971875 Molecular function unclassified; Biological process unclassified; 144 TTGGCATTGTCCCCTTT 480 meso04/ nt 0044873 0.004913 1.615025 CATCTAAACTGCTGGGC 1336 WBSCR1 0.000944 1.3997 Translation initiation factor; Protein biosynthesis; CAATGCTGCCAGCATTG TGCCTGCACCAGGAGAC 1413 169 meso05/ nt_0107553 CST3 4 0.004094 0.005887 1.27195625 -1.52227 Cysteine protease inhibitor; Proteolysis; GAGTGGGGGCTTCACTC 21 DPP74 0.003667 -2.64715 Serine protease; Proteolysis; CTGACCTGTGTTTCCTC 128 HLA-B 4 0.001333 -2.2564 Major histocompatibility complex antigen; MHCI-mediated immunity; TAGGTTGTCTAAAAATA 2824 TPT1 4 6.67E-06 -1.34433 Non-motor microtubule binding protein; Immunity and defense; Table 31 Legend: 1. Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al., 2003) 2. Embryonic developmental pathway genes 3. Candidate novel regulators of mesoderm differentiation 4. Genes down-regulated in hESC Table 32 Genes up-regulated in hESC compared to endoderm-derived libraries. These genes were found to be similarly differentially expressed in at least one other grouping of germ-layer derivatives. See legend for table explanations. Meta sequence Count Gene symbol Mean P-value Mean ln(ratio) Gene Ontology GATCCCAACATTGTTGG 1980 ATP5B 0.00095 1.807075 Other ligand-gated ion channel; Anion channel; Hydrogen transporter; ATP synthase; Hydrolase; Purine metabolism; Cation transport; ATP synthesis->Fl beta; ATAGACATAAAATTGGT 1158 C1QBP1 0.0038 1.580358 Complement component; Antibacterial response protein; Complement-mediated immunity CTGTGACACAGCTTGCC 1219 CCT2 0.00005 2.21255 Chaperonin; Protein folding ACAATGTTGTAGTGTCC 616 CRABP1 1 0.001775 2.171125 Other transfer/carrier protein; Lipid and fatty acid transport; Lipid and fatty acid binding; Vitamin/cofactor transport; Steroid hormone-mediated signalling; Transport; Ectoderm development CTATATTTTTTAAAATC 587 CTSC 0.005233 1.655708 Cysteine protease; Proteolysis TCTGCAAAGGAGAAGTC 1081 HMGB2 2 0.000133 2.441175 HMG box transcription factor; Chromatin/chromatin-binding protein; Chromatin packaging and remodeling GGTTGAAAAAAAAAAAA 564 Hs.446545 0.002033 2.309683 145 ACCATTGGATTCATCCT 1015 IFITM1 4.17E-05 2.357142 Other miscellaneous function protein; Cell proliferation and differentiation GTCTACTTTAGGTGTGC 1011 MGC8685 333E-05 2.8341 TACAAAACCATTTTTTC 1439 NCL 0.001192 1.832258 Ribonucleoprotein; rRNA metabolism GAATCGGTTATACTCGG 1535 NDUFS5 0.000992 1.61035 Oxidoreductase; Electron transport TATTTTAAATGCCACCT 437 esc03/ nt 007741 0.010342 1.726975 CGAACAAAAGACTTCGG 967 esc04/ nt 011875 0.004608 2.152867 TTTTCTTCTTTGGCTTG 458 esc07/ nt 016354 0.008492 2.034658 GAAAGAAAGAGAGGAAA 1164 esc05/ nt 022517 8.33E-06 2.970467 AAAAGAAACTTGTGCTT 2314 PABPC1 0.003317 1.834633 Nucleic acid binding; Pre-mRNA processing AA TA CTTTTGTA TTGCT 871 PAI-RBP1 0.001383 1.927425 Other RNA-binding protein; Interacts with chromatin-remodeling factor C H D 3 ; Biological process unclassified CAA TAAA TGTTCTGGTT 9202 RPL37 8.33E-06 1.687508 Ribosomal protein; Protein biosynthesis TCTGTACACCTGTCCCC 6678 RPS11 0 2.886167 Ribosomal protein; Protein biosynthesis GCTTTTAAGGATACCGG 6329 RPS20 0 2.0331 Ribosomal protein; Protein biosynthesis TAA TAAAGGTGTTTA TT 8849 RPS8 0.000367 1.31705 Ribosomal protein; Protein biosynthesis TATCAATATTCACTTGA 743 SFRP13 0.0005 2.410908 Other signalling molecule; Ligand-mediated signalling; Angiogenesis->Frizzled-Related Protein; Wnt signalling pathway->secreted frizzled-related protein; CTGTCA TTTGTAA TA TG 1121 SFRS3 0.004692 1.452883 Molecular function unclassified; mRNA splicing TGA TAGTCTGAAA TA TG 507 Similar to KIAA1606 0.005508 2.1248 Molecular function unclassified; Biological process unclassified TACA TTTTCA TA TTAGA 1632 SNRPG1 0 2.0292 mRNA splicing factor; mRNA splicing CCGCCTCCGGGAATGAG 1626 SNRPN 0 2.469467 mRNA splicing factor; mRNA splicing TATATATTTGAACTAAT 483 SOX2 2 0.005608 2.029417 H M G box transcription factor; Nucleic acid binding; mRNA transcription regulation; Neurogenesis TATAATCTTTATGGCTT 601 SSBP1 0.006525 1.746742 Single-stranded DNA-binding protein; DNA replication; DNA repair; DNA replication ATAAAGTAACTGGTTTG 897 STRAP 0.001392 1.642317 Other miscellaneous function protein; Receptor protein serine/threonine kinase signalling pathway TTATAATATAATGTTTT 484 TMEFF1 0.005608 2.120967 Surfactant; Biological process unclassified CAGTCTAAAATGCTTCA 1763 UCHL1 0 3.316525 Cysteine protease; Proteolysis GGTTTGGCTTAGGCTGG 2384 UQCRH 0.000417 1.142608 Reductase; Oxidative phosphorylation 146 Table 32 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1. Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al., 2003) 2. POU5F1 target/cofactor/coexpressed 3. Embryonic developmental pathway genes Table 33 Differentially expressed in hESC versus endoderm-derived libraries only. Legend is provided below for table explanations. Meta sequence Count Gene symbol Mean P-value Mean In(ratio) Gene Ontology AAAATAAAGAGCCATAG 963 APEX1 0.00825 1.638917 Transcription cofactor; Exodeoxyribonuclease; Endodeoxyribonuclease; Lyase; DNA repair; GCAAAACCAGCTGGTGG 891 CCT8 0.006408 1.631908 Chaperonin; Protein folding; GGCTGCCCTGGGCAGCC 860 DPYSL3 0.000592 2.24465 Other hydrolase; Nucleoside, nucleotide and nucleic acid metabolism; Axon guidance mediated by semaphorins->CRMP 3-associated molecule; Axon guidance mediated by semaphorins->Collapsin response mediator protein; TCCTCAAGATAAAGTCT 1170 ERH 1 0.000192 1.716958 Other transcription factor; mRNA transcription regulation; TTGTAAACTTAAGTGGC 976 FKBP3 0.000583 1.7755 Other isomerase; Protein folding; T-cell mediated immunity; Other neuronal activity; ATGGCAAGGGACAAAGC 1025 FLJ35696 0.005483 1.260883 Molecular function unclassified; Biological process unclassified; CTTTATGTGATAGTATT 609 HDAC2 1 0.008425 1.652225 Transcription factor; Nucleic acid binding; Deacetylase; mRNA transcription regulation; Chromatin packaging and remodeling; Protein modification; Cell cycle control; Wnt signalling pathway->Histone deacetylase; p53 pathway->Histone deacetylase 1; GAAA TTTAAAGCAGGTT 2057 HMGB1 2 0.000108 2.405842 HMG box transcription factor; Chromatin/chromatin-binding protein; Chromatin packaging and remodeling; p53 pathway->High mobility group protein 1; CCAGGAGGAATGCCTGG 2766 HSPA8 0.000475 1.401025 Hsp 70 family chaperone; Protein folding; Protein complex assembly; Stress response; Apoptosis signalling pathway->Heat shock protein 70;;Parkinson disease->Heat shock protein 70; TCAACTTCTGGCTCCTC 871 endoOl/ nt 0097143 0.0002 2.193617 GATTTCCTTGAAGCAGG 730 endo02/ nt 0228533 0.000775 2.290733 GGCTGATTTTTATTACC 971 endo03/ nt 0231333 0.000925 2.011842 TGTACTACTTAAGTTTA 685 PAICS 0.0023 1.960483 Synthase; Ligase; Purine metabolism; AATTTTATTTCTGTTTG 844 PCBP1 0.00435 1.545675 Ribonucleoprotein; mRNA polyadenylation; mRNA end-processing and stability; CTTATTTGTTTTAAAAC 575 PLS3 0.006108 1.794983 Non-motor actin binding protein; Other developmental process; Other oncogenesis; Cell structure; 147 GGAATCCAATCTGTTGC 665 PTTG11 0.003167 1.986775 Other transcription factor; Nucleic acid binding; DNA repair; mRNA transcription regulation; Cell cycle control; Oncogene; TTGAATTTGTTTGTTAG 434 RBPMS 0.0145 1.525408 Nuclease; Biological process unclassified; GAATTAACATTAAACTT TTGAAGCTTTAAGAACT 2254 1 YWHAE CXCL2 4 0.001842 0.014108 1.779142 -4.34608 Other miscellaneous function protein; Non-vertebrate process; EGF receptor signalling pathway->14-3-3; FGF signalling pathway->14-3-3; PI3 kinase pathway->14-3-3; Parkinson disease->14-3-3; p53 pathway-> 14-3-3 sigma; p53 pathway-> 14-3-3; Chemokine; Cytokine and chemokine mediated signalling pathway; Calcium mediated signalling; NF-kappaB cascade; Ligand-mediated signalling; T-cell mediated immunity; Macrophage-mediated immunity; Granulocyte-mediated immunity; Cell proliferation and differentiation; Cell motility; AGCAGTGACGGATAGTT 1 EVA1 4 0.0153 -4.11228 Cell adhesion molecule; Cell adhesion; GATGAATCCGGGGTATG 1 LOC1202244 0.01845 -4.31724 TGTGGGAAATCCTGCGT 1 SLPI4 0.014917 -4.54698 Serine protease inhibitor; Biological process unclassified; Table 33 Legend Italicized tags 3 or more genomic hits 1. Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al, 2004; Sato et al., 2003) 2. POU5F1 target/cofactor/coexpressed 3. Candidate novel regulators of endoderm differentiation 4. Genes down-regulated in hESC 148 repress genes directing ectoderm differentiation. Conversely ecto08 and ecto09 were down-regulated in embryonic stem cells compared to ectoderm libraries; they might encode determinants of an ectodermal fate in the early embryo. Other hypothetical uncharacterized genes shown to be differentially expressed in hESC versus mesoderm or endoderm libraries were listed in Tables 31 and 33 respectively and are termed mesoOT-05 and endo01-03 (up-regulated in hESC). A number of genes have been shown by various groups to be both highly expressed in ES and additionally rapidly down-regulated upon differentiation. These include POU5F1, DNMT3(3, DPPA4, HMGA1, and TDGF1 (Richards et al., 2004; Sato et al., 2003; Sperger et al., 2003), all of which were over-expressed in the ES SAGE libraries versus all normal adult and fetal libraries available from CGAP (72 libraries) (Appendix 3e) in an additional experiment (Table 27). Many of the genes found to be most significantly over-expressed in ES are in some way related to or regulated by POU5F1, a well established regulator of self-renewal in both human and mouse ES. HMGA1 is a downstream target of POU5F1 that has been implicated in cancer metastasis (Chuma et al., 2004) and changes in chromatin structure linked to the regulation of gene expression (Harrer et al., 2004). HMGA1 may serve to regulate the transcription of the molecular machinery involved in ES cell proliferation. POU5F1 spatial and temporal expression is thought to be regulated by its methylation status (Fuhrmann et al 1999). Given the co-expression of DNMT3p in ES, this particular methyl transferase is the likely candidate in control of epigenetic regulation of POU5F1 and possibly many other genes involved in ES self-renewal. DNMT3p is important for embryonic development as well as purported to have a role in cancer cell survival and gene silencing (Beaulieu et al., 149 2002; Rhee et al., 2002). Methyltransferases work together with histone deacetylases to suppress gene expression by epigenetically modifying chromatin according to the histone code hypothesis (Cheung et al., 2000). The histone code hypothesis proposes that histone proteins and their modifications can lead to inherited differences in transcriptional activity and silencing (Jenuwein and Allis, 2001). The observation of significantly higher expression of DNMT3B and the histone deacetylase HDAC2 (Table 33) in hESC suggests there may be coordination between the actions of these genes. In addition to POU5F1, other early developmental signals provided by the TGFp 1-Nodal pathway for example, may be critical for ES maintenance. In particular, TDGF1 over-expression may be necessary for ES cell proliferation. This analysis uncovered two tags, neither mapped to a known transcript (CGAP Best Tag Mapping) but did map to a single genomic location (referred to as escOl and esc02). Further investigation of the tags using Ensembl Multi BlastView (http://www.ensembl.org) yielded information about the regions surrounding the tag sequence. EscOl mapped to chromosome 4q25 and was antisense to predicted transcripts bearing similarity to a gene of unknown function, HDCMA18P (Figure 23). The zebrafish homologue of HDCMA18P functions in mRNA nuclear splicing (annotated by GeneOntology terms). Upstream of escOl and within 500bp are several microRNA sequences. Many miRNA genes are found in clusters in the genome and presumed to be expressed from the same primary transcript (Wienholds et al., 2005). Based on the neighbourhood of expressed genes surrounding escOl one might predict that the tag sequence could be derived from a pri-microRNA sequence that could regulate Figure 23 Ensembl BlastView of sequence escOl. Chr. 4 Length CONfls Human cDNAs EMBL mRNAs Unigene Human RefSeqs Genscan EST t rans. Ensembl t rans. DNACcontigs) B las t h i t s ncRNA e! ncRNA Unigene Length 113,924,500 113,925,000 H Forward strand 113,925,500 113,926,000 113,926,500 4.02 Kb 113,927,000 113,927,500 113,928,000 SENSCHN00000012999 -> Ab-initio Gcnscan trans ENSEST00000094252 ENSEETT000000942S1 ENSEST00000094250 NP_057732.1 -> Ensembl knoun trans N P J J 5 7 7 3 2 . 1 -> Ensembl hnoun irons 1 P I — M k—-TI Q96IZ3_HUHAN -> Enservbl knoun trans NPJ>57732.1 -> Ensefbl knoun trans i r^tZr^i r"""i i | «C106864.5.1.89690 > j 1 escOl • <- ENET00000346969 miRNn • <- ENST00000362299 roiRNA • <- ENST00000362275 maRNH • <- ENST00000362228 roiRNA • ENST00000362232 miRNn • <- ENST0 0000362188 miRNn 113,924,500 113,925,000 113,925,500 113,926,000 113,926,500 113,927,000 - Reu«rse 7,500 113 strand -H 928,000 HDCMA18P gene expression in cis or various other molecular targets in trans. MicroRNAs, which may primarily function in gene repression, are indicated to play a role in certain developmental processes, such as brain morphogenesis in zebrafish embryos (Giraldez et al., 2005) and may be enriched in embryonic stem cells. To further strengthen this hypothesis, the detection of Dicer, a double-stranded catalytic RNA essential for the generation of miRNA (Bartel, 2004), was observed across all of the ESC SAGE libraries. Additionally, Dicer expression was nearly totally absent in differentiated tissues. One might additionally hypothesize that HDCMA18P, which was expressed mainly in hESCs 151 and at low levels (at approximately fewer than 10 tags per 200,000) may be tightly regulated by the hypothetical mic roRNA from which escOl was derived. If H D C M A 1 8 P preferentially processes a suite of genes, e.g., genes associated with a differentiated phenotype, the m i R N A gene potentially detected by escOl could be necessary for pluripotency. Figure 24 depicted the Ensembl BlastView result for the alignment of tag sequence esc02 to chromosome 8q24.3. The surrounding sequence revealed that the long S A G E tag resided in the 3' end of a human c D N A sequence similar to forkhead box H I (FOXH1) . The F O X H 1 gene is homologous to the 5' region of the c D N A sequence suggesting that it and the transcript identified by esc02 may be an alternative transcript of the F O X H 1 gene differing at the 3' most end. F O X H 1 is a transcription factor and downstream target of the TGFp-Nodal signalling pathway. Together with other components of the Nodal pathway, F O X H 1 is thought to co-regulate embryonic mesodermal morphogenesis (Hart et al., 2005). Recall that T D G F 1 , a pathway agonist, was also up-regulated in the h E S C metalibrary in comparison to 72 n - C G A P libraries (Table 26), while Lefty 1, a T G F p antagonist, was up-regulated in ES compared to ectoderm and mesoderm libraries (Tables 28 and 30). Lastly, a transcriptional target of Nodal signalling (ID1) was also up-regulated in hESC versus all mesoderm libraries (Table 31). Several agonists and antagonists of the TGFp/nodal pathway are over-expressed in human ES versus differentiated libraries (see Figure 10 and Tables 12-14, Chapter 2.3.4). This result was not inconsistent with the idea that ES prescribe to a "Just in case" philosophy (Golan-Mashiach et al., 2005), expressing a suite of tightly regulated genes involved in directing differentiation and maintaining the desired phenotype; thus 152 undifferentiated ES are primed to exhibit developmental plasticity when supplied with the appropriate developmental cue. The role of TGF|3-Nodal signalling in hESC maintenance remains unclear, but over-expression of the pathway constituents suggests that it is of import to ES biology. Figure 24 Ensembl BlastView of sequence esc02. Esc02 localizes into the 3' most end of predicted transcript similar to F O X H 1 . Note that the transcript is a probable alternate isoform of the F O X H 1 gene. Chr. 8 Length CDNAs Human cDNAs EMBL tnRNAs Unigene Human RefSeqs Genscan EST t rans. 1+5,668,010 145,668,500 145,669,000 145,669,500 145,670,000 145,670,500 145,671,000 145,671,500 h Forward strand 4. 02 Kb p*-Ensembl t rans. DWXcontigs) Blast h i t s Ensembl t rans. Genscan Unigene EMBL mRNAs Human cDNAs CDNAs Length GENSCfiN00890047546 -> fib-initio Genscan trans ENSEE.TT00000092233 ENEESTT00000092232 -> ENSESTT00000092231 ENSEST00000092230 ENSESTT00000092229 ENEESTT0 0 0 00092228 ENSESTT00000092227 KIFC2 -> Ensembl know trans KIFC2_HUMAN • tvc.1 urtESC02 I <- F0KH1 Enseal know trans <- GEN£CfiN00000047S45 flb-initio Genscan trans 145,668,000 145,6! BC051376.1 ' X BC051376 BC051376.1 Homo sapiens mRNA similar to forkheacl box H1 (cDNA clone IMAGE: 6650515). EMBL: BC051376.1 View all hits • Reverse strand H ,670,000 145,670,500 145,671,000 145,671,500 153 3.4 Conclusions A caveat of this exercise to provide a meaningful discussion of the significance of the sets of genes differentially expressed in hESC versus adult normal tissues was the inability to relate the changes in expression pattern to early embryonic differentiation. Better comparisons have been reported by various other groups which have compared various ES lines to their immediately differentiated progeny (e.g., embryoid bodies) or specific embryonic lineages derived from undifferentiated ES (Brandenberger et al., 2004a; Brandenberger et al., 2004b). This analysis did identify a set of 22 genes that were significantly more abundant in multiple u-hESC lines compared to 72 adult and fetal differentiated samples. Known genes such as POU5F1 and DNMT3B were expressed at significantly higher levels in u-hESCs compared to all n-CGAP samples. Among the list of "up-regulated" genes were characterized genes that have not been previously implicated in stem cell maintenance; these genes will be of interest as new candidate markers of the undifferentiated and pluripotent state. The candidate genes, escOl and esc02, identified were more than 3-fold higher in u-hESCs than any n-CGAP library. Future analysis of these candidates such as GLGI and functional studies will be necessary to determine if these tags correspond to uncharacterized markers of human pluripotent cells. The work presented in this Chapter may define a core set of ES maintenance genes both previously defined and novel to what is currently known in human ES biology. 154 4. Computational approach for the identification of candidate novel genes in undifferentiated hESC S A G E libraries Contributions The development of an approach for the identification of novel genes and computational analyses were completed by Angelique Schnerch (BCCA GSC and the Department of Medical Genetics, UBC). Portions of this Chapter were used for a submitted publication (Hirst et al., submitted). 4.1 Introduction Embryonic stem cell genes are largely underrepresented among EST and cDNA databases although a handful of gene expression studies of various hESC lines have been performed (Brandenberger et al., 2004a; Brandenberger et al., 2004b; Carpenter et al., 2004; Richards et al., 2004; Sato et al., 2003; Sperger et al., 2003). I hypothesize that novel genes remain to be discovered in human embryonic stem cell lines. These genes may function in maintaining the hESC undifferentiated state. Multiple global gene expression profiles generated from human embryonic stem cell lines will provide the necessary resources to identify novel genes in a comprehensive fashion. The use of several hESC lines from different providers and deep sampling of each transcriptome using serial analysis of gene expression represent a comprehensive approach. The SAGE Library Production Group (BCCA GSC; http://www.bcgsc.ca/) constructed 11 long SAGE libraries from the RNA of 8 NIH approved cell lines (WA01, WA07, WA09, WA13, WA14, ES03, ES04, BG01) providing a resource for novel gene discovery (see Table 3, Chapter 2.2.3 for a listing of the hESC data generated at the BCCA GSC). The first aim of this analysis was to identify tags enriched in hESC long SAGE libraries (hESC metalibrary). Identification of enriched tags was accomplished by pooling the data to construct a metalibrary and then comparing the metalibrary to publicly available human SAGE libraries constructed from various tissue and cell types. Tags found exclusively in the hESC metalibrary may lead to the discovery of candidate novel genes or alternatively spliced transcripts. Alternatively, tags exclusive to hESC may be 156 derived from known transcripts unrepresented by the pool of tissues and cell types selected for comparison to the hESC lines. The second aim of this analysis is to generate a computational method for the selection of candidate tags for novel gene discovery (Figure 25). I hypothesized that tags that map to a human genomic sequence outside the vicinity of a known transcript may provide likely candidates for novel genes. To select such tags, I generated two in silico tag-to-sequence mapping databases. The first database of in silico tags was generated from human genomic sequences suspected of being orthologous to mouse and/or rat genomic sequences. A common approach to novel gene prediction is to determine sequence conservation across multiple species; such conservation is thought to correlate more strongly with possible functionality of an uncharacterized sequence of interest (Kellis et al., 2004). The second database contains tags derived in silico from sequences that lie 2 kb outside of the 3' untranslated region (UTR) of known transcripts. Approximately 30% of Ensembl genes do not have an annotated 3' UTR. CMOST mappings to Ensembl transcription units attempt to account for the 3' UTRs of all Ensembl genes. The addition of 1000 bp to each sequence approximates the UTR based on an estimated average 3' UTR length of 600 bp (Zhang, 1998). The artificial UTR is added regardless of whether a tag has a previously annotated UTR, which can overestimate the UTR region for several genes. CMOST may also underestimate the UTR length, as in the case of the Fukutin-related protein with a 3' UTR length of over 1500 bp (Brockington et al., 2001) necessitating mapping the mapping of tags to longer regions 3' of the polyadenylation site of genes. The purpose of mapping tags to both CMOST Ensembl transcription units and 157 Figure 25 The computational approach for the selection of candidate long S A G E tags to detect novel transcripts in hESC. STEP 1: 20,047 tag types STEP 2: S T E P 2A: C M O S T B E S T M A P P I N G S T E P 2B: E m b r y o n i c E S T mappings S T E P 2C: M o u s e C M O S T / C G A P mappings G e n o m i c 1,024 (961 unambiguous) E m b r y o n i c E S T s 62 K n o w n transcript 3,365 M o u s e mapping 1,443 U n m a p p e d 14,153 STEP 3: M a p p i n g to Human-mouse-rat ^ U n m a p p e d 524 orthologous sequences 1 M a p p e d 499 * < 2kb downstream STEP 4: M a p p i n g to "enhanced" 3 ' U T R (2 kb) • from an Ensembl transcript 38 > 2kb downstream from an E n s e m b l transcript 461 i T a g quality filtering (P-value<0.05) (S iddiqui et al., submitted) h E S C candidate tags (p-value 0.05) 301 STEP 5: 158 estimated 3' UTR sequences was to capture hESC tags that mapped in close proximity to a known gene. I made the assumption that a tag in close proximity to the 3' end of a gene is most likely derived from the 3' UTR. A potential concern regarding this assumption was that tags mapping to overestimated 3' UTRs could belong to the 5' region of a downstream gene or to an intergenic genomic sequence. I also assumed that tags mapping further away than 2 kb of a known gene fell into an intergenic genomic sequence and may originate from a novel gene ®. Both databases provide resources for the selection of candidate hESC long SAGE tags that may correspond to a novel gene based on sequence conservation across multiple species and its location in a genomic region lacking previous expression data. 4.2 Methods 4.2.1 SAGE library acquisition and tag processing CGAP 21mer and 14mer human SAGE libraries were downloaded from the SAGE Genie website (April 13,2005; http://www.cgap.nci.nih.gov/SAGE) (Boon et al., 2002) (Appendix 4a). 4.2.2 SAGE tag processing Long SAGE hESC tags were processed using the protocol described below: 8 Exceptional cases with UTR lengths outside of 2 kb exist (e.g., the human tissue inhibitor of metalloproteinase-3 (gb|U14394) has a 3,663 bp 3' UTR length) Makalowski, W., Zhang, J., and Boguski, M. S. (1996). Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res 6, 846-57. 159 1. I truncated seven base pairs from the 3' end of a long SAGE tag sequence (21mer) using "cut -c 1-10" (UNIX command line) resulting in an in silico generated 14mer library. 2. I produced 1 file containing all tag types found in a pooling of 11 hESC long SAGE libraries (hESC metalibrary) and a 2nd file containing all tag types found in a pooled sample of 247 CGAP libraries (CGAP metalibrary). For each file respectively tag types were labeled "HESC" or "CGAP" (tab delimited) (command: <filename> | awk '{print $l"\t""HESC"}')-3. Both the hESC metalibrary and CGAP metalibrary files were concatenated. All tags were enumerated from a sorted list of hESC and CGAP tags to determine if they occur in one or both meta-libraries using the UNIX command "cat hESC metalibrary CGAP metalibrary | sort | uniq -w 1-10 -c". The command "cat" concatenates two or more files; "uniq -w 1-10" compares only the first ten characters for each line of a file, lastly "-c" provides a count (tab delimited) for the number of times a unique tag type occurs in the concatenated file. 4. Tags that occurred exclusively in the hESC metalibrary were isolated using "grep -w 1" (isolated tags that occurred once in the concatenated metalibrary) succeeded by "grep -w HESC" (isolated tags that occurred only in the hESC metalibrary). 4.2.3 Comprehensive mapping of SAGE tags (CMOST) Tags enriched in the hESC metalibrary were mapped using the DiscoverySpace (version 3.2.4) CMOST plug-in (Comprehensive Mapping of SAGE Tags Best Mapping Algorithm) described in Chapter 2.2.4.1. The CMOST best mapping strategy first 160 identifies all of the tags that are derived from a previously characterized transcript before tags are mapped to the genome. This allowed me to parse the novel tags that mapped solely to uncharacterized regions of the human genome. The following conditions were imposed to parse tags that mapped to genes/genomic sequences from the CMOST mapping result: all single base-pair tag modifications were ignored, tag sequences mapping to a known gene were excluded, and tag sequences mapping to multiple genomic locations were excluded. 4.2.4 Tag mapping database construction I extracted in silico 21 bp tags from three data sources: 1. A set of 148,453 EST sequences generated from undifferentiated hESC lines and its differentiated derivatives described below (Brandenberger et al., 2004b). a. GR_ES_: ESTs derived from a pooled sample of 3 undifferentiated hESC lines )u-hESCs) grown under feeder-free conditions (lines WA01, WA07, and WA09) (37,081 sequences) b. GR_EB_: ESTs derived from u-hESCs differentiated to embryoid bodies (37,555 sequences) c. GR_preNEU_: ESTs derived from u-hESCs differentiated neuroectoderm-like cells (38,206 sequences) d. GR_preHEP_: ESTs derived from u-hESCs differentiated hepatocyte-like cells (35,611 sequences) 161 J Sequences were downloaded using Entrez nucleotide (http://www.ncbi.nlm.nih.gov/; accession numbers CF227093-CF227275, CN255152-CN315425, CN331906-CN373615, CN385955-CN394390, CN394392-CN432241). 2. The UCSC multiple alignment format (MAF) version 1 comprised of multiple alignments of the Human July 2003 (hgl6) genome, the Mouse February 2003 (mm3) assembly and the Rat June 2003 (rn3) assembly (http://hgdownload.cse.ucsc.edu/downloads.html#human). Multi-species homologous genomic regions were determined based on BLASTZ scores (http://www.ncbi.nil.gov/BLAST/) and stored in files according to human chromosomal location. Human and mouse alignments were formatted for SAGE tag extraction using ad hoc UNIX commands. 3. Intergenic regions of Ensembl transcripts. In this analysis, intergenic was defined as a region of the genome not occupied by a known gene. Specifically, the intergenic region was at least 2kb beyond the 3' end of each Ensembl transcript. Mappings were generated for sequence 2 kb in length and adjacent to the 3' end of a transcript. A hESC SAGE tag was defined as intergenic if it did not map to this data-source. Sequences were obtained using Ensembl EnsMart (version 19.34a. 1, NCBI genome assembly 34) (http://www.ensembl.org/). For each sequence in a data source an Nlalll site was used to demarcate the start of a 21 bp or 14bp9 SAGE tag. Tags were extracted from all possible sites in a sequence. SAGE tags were extracted using the Perl script SAGE_tag_positions.pl (written by E. 9 14 bp tags were only extracted from mouse orthologous genomic regions (UCSC Multiple Alignment Format). 162 Pleasance) (Pleasance et al., 2003). Tag-to-gene mapping were completed using the UNIX command "join". 4.2.5 BLAST analysis To provide in silico functional annotation (theoretical) of the candidate novel genes I utilized the program blastall to use BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1990) on selected human and mouse genomic sequences (query) against a database of 26,205 mouse RefSeq genes (Release 5, May 2004). The parameters I designated for the blastall program are as follows: "blastall -p blastn10 or tblastx11 -d mouse.rna.fna -i <query file> -e 0.01". Where "-p" is the program name, "-d" is the formatted12 database to BLAST against, "-i" is the set of FASTA formatted query sequences, and "-e" is the expectation value. BLAST results were parsed using parse_blast.pl (written by Erin Pleasance; BCCA GSC). 4.2.6 Mouse tag to gene mapping 14mer mESC tags were mapped to CGAP best tag (http://cgap.nci.nih.gov/SAGE) (Boon et al., 2002) using the UNIX command "join". Mouse CGAP mappings 1 U The BLASTN program compares a nucleotide sequence against a nucleotide sequence database. 1 1 The TBLASTX program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. 1 2 Format.db should be used to format the FASTA databases. This must be done before blastall can be run locally. 163 (Mm_short.best_tag.gz ) were downloaded from ftp://ftp 1 .nci.nih.gov/pub/SAGE/MOUSE/. 4.3 Results and Discussion 4.3.1 Selection of tags for the isolation of candidate novel genes The hESC metalibrary was constructed by computationally pooling 11 long SAGE libraries made from 8 hESC lines. The metalibrary consisted of 2,613,475 total tags corresponding to 379,465 unique tag types (transcripts). To identify tags enriched across the hESC lines I compared the metalibrary to 247 publicly available human SAGE libraries downloaded from the CGAP SAGE Genie (April 13, 2003; http://www.cgap.nci.nih.gov/SAGE) (Boon et al, 2002). The SAGE Genie website provides a resource for the retrieval and analysis of human and mouse gene expression data (Lai et al., 1999). The human CGAP libraries used in this analysis contained 15,683,409 tags representing 654,491 transcripts. Currently, the majority of publicly available SAGE libraries are short SAGE. Consequently, all long SAGE tags were truncated to short tags (extracted short SAGE) for direct comparison to the CGAP short SAGE libraries. Depicted in the Venn diagram (Figure 26) were tag sequences common to the hESC and CGAP metalibraries. These sequences were excluded to isolate tags found solely in hESCs (Step 1 of Figure 25). CGAP best tag mappings searches for the best tag for a gene. 164 Figure 26 Venn diagram listing the tag types for the hESC, cancer and normal SAGE library comparisons. We pooled 11 hESC libraries (green) for a total of 222,337 unique tag types (translates to 379,465 long tag types). The normal metalibrary (orange) is a pooling of 77 CGAP libraries totaling 410,583 tag types. The cancer metalibrary (blue) consisted of 170 CGAP libraries, yielding 578,798 tag types. 165 Figure 27 Novel gene discovery candidate tags distribution of absolute gene expression levels, bins (bin size = 1) and the y-axis is the frequency of tag types. The x-axis contains expression level l.OE+05 T 1.0E+04 1.0E+03 1.0E+02 1.0E+01 4 1.0E+00 -18,907 tags 1.0E-01 0 * * N f e 1? ^ & & # & & $> 4> <$> & & & A f o <£> <bN % % & <*> v c ? ^ N c ? V N> N N f e ^> 0? & Expression level 166 There were 18,700 extracted short SAGE tags enriched in hESCs. These tags were associated back to their appropriate long SAGE tag sequence and resulted in 20,047 sequences for further analysis. Enriched tags were defined as those found only in the hESC data. Expression levels of the enriched tags were not normally distributed; I predominately observed low abundance tags (shown in Figure 27). Reports have previously defined the lower limit of gene expression reliably detectable by SAGE at 5 tags (Evans et al., 2002). As most highly abundant transcripts have previously been characterized I expected that candidates for novel transcripts would be expressed at low levels to have escaped discovery. A recent improvement in SAGE data processing implemented by Gene Expression Informatics at the BCCA GSC has included tag quality measures based on SAGE library construction/sequencing errors (Siddiqui et al., submitted). High quality singleton tags (e.g., defined by P-value < 0.05) have been used to resolve novel genes in the Mouse Atlas of Gene Expression data (http://www.mouseatlas.org) and our own hESC data that led to the successful generation of longer cDNA sequences from SAGE tags (Siddiqui et al., submitted; Hirst et al., submitted). Thus, SAGE can detect gene expression at the singleton or doubleton level and can provide evidence of bona fide gene expression where previous transcript information does not exist. Enriched tags were mapped to identify those tags that were generated from known transcripts and to select for those tags that mapped only to the genome or the set of embryonic EST sequences (Brandenberger et al., 2004b) (Appendix 4b) (Steps 2A and 2B of Figure 25). A proportion of tags enriched in hESCs were mapped to previously characterized transcripts because the CGAP SAGE libraries were not representative of all of the tissues/cell lines used to construct cDNA and EST sequencing projects. Table 34 summarizes the CMOST Best Mapping results and mappings to embryo-derived ESTs. Briefly, 3,365 tag types mapped to a known transcript/EST and 1,024 tag types mapped to a human genomic sequence of which 961 tags mapped to a single genomic location and were used for further analysis. From this analysis I found that the majority of tags were initially unmapped (15,615 tag types). The inability to map these tags may be attributed to various reasons such as sequence polymorphisms, tags spanning novel splice junctions, contaminating sequences (e.g., derived from MEFs), and experimental artifacts of the SAGE protocol and sequencing. It is also possible that the mapping resources are incomplete. Recently, an EST sequencing project was completed for undifferentiated hESC and its differentiated derivatives (embryoid bodies (EB), 'neural ectoderm-like cells (preNEU), and hepatocyte-like cells (preHEP) (Brandenberger et al, 2004b). These sequences are not currently available for tag mapping using CMOST. Thus, in addition to CMOST I mapped enriched hESC tags to a tag-to-gene mapping database created from these embryonic EST sequences (Step 2B of Figure 25). A small collection of unmapped tags (19 tag types) were resolved by mapping to the above EST data set. In the metalibrary I found that 275 tags mapped to a mouse transcript (CGAP Best Tag; http://cgap.nih.nlm/SAGE) and 1,168 tags mapped to the mouse genome (Step 2C, Figure 25). The hESC library WA01(m) was cultured on matrigel as opposed to MEF; thus, mouse sequence contamination should not be observed. As expected, unmapped tags from this library did not map to a mouse sequence (data not shown). Table 34 CMOST Best Mapping results for 20,047 novel hESC tags. Listed below are the data source and tag types for sense and antisense tags. The percentage of total novel tag types (20,047 tags) per tag mapping database is shown in brackets. Sense Antisense Database .. Types Types Types (all) M G C 268 (1.34) 208 (1.04) 476 (2.38) RefSeq 328 (1.64) 185 (0.93) 513 (2.56) Ensembl Transcript (ENST) 89 (0.45) 58 (0.29) 147 (0.74) Non-protein coding 0 1 (0.01) 1 (0.01) Ensembl E S T ( E N S E S T T ) 299(1.5) 145 (0.73) 444(2.22) Ensembl Transcription Unit (ENSG) 1140 (5.69) 644 (3.22) 1784 (8.9) A l l mappings to transcripts/EST sequences 3365(17.8) Genomic ( U C S C Human Genome) 1,028 (5.13) Unambiguous genomic mappings 965 (4.81) Embryonic E S T s 36 (0.18) 22 (0.11) 58 (0.30) Unmapped ^ — — " 15,596 (77.80) Total 20047 169 Tags selected for further analysis included sequences that unambiguously mapped to a genomic location and sequences that mapped to embryonic ESTs (1,023 tag types) (Step 2, Figure 25 and Table 34). The complete list of mappings to CMOST, ES-derived EST sequences, and multiple alignment sequences are found in Appendix 4c. A collection of 1023 hESC enriched tags that mapped to genomic sequences and embryonic ESTs were mapped to human sequences with known orthology to mouse and/or rat sequences as determined by the UCSC Multiple Alignment Format (MAF) (Step 3, Figure 25). Figure 25 also illustrates the loss of hESC-enriched tag types as filters were imposed to define a set of candidate novel transcripts. I isolated 499 sequences that mapped to the MAF tag-mapping database. I then applied the criteria that these tags should not map to 3' flanking regions ("enhanced" 3' UTRs) defined from known Ensembl transcripts (Step 4, Figure 25) and the hESC-enriched tags should additionally have a significant p-value (a=< 0.05) (Step5, Figure 25). P-values were calculated based on SAGE tag sequence quality per base pair and the library construction error calculated on a per library basis (Siddiqui et al., submitted). After applying these constraints, 301 tags satisfied these criteria (Appendix 4d). Roughly l/6th of the tags (56 tags) were present in more than one cell line (Appendix 4e). Particularly in the case of singleton/doubleton tags in an individual library, expression in two or more libraries further substantiated the likelihood that the tag was observed in the mRNA sampled as opposed to being introduced by some artifact of the technology. The stringent criteria that high quality hESC novel tags map to sequences conserved in multiple species distal to known genes and observed in 2 or more hESC libraries further reinforces the likelihood that the tags were derived from novel genes. 170 4.3.2 Mouse annotation of hESC tags I annotated our selections of the "best" candidates for novel gene discovery across multiple hESC lines using a computational approach illustrated in Figure 28. The multiple alignment format mappings can infer potential functionality because sequences conserved in multiple species are often coding or regulatory sequences under selective pressure to be maintained in the genome. By using the human and corresponding mouse genomic sequences as queries for TBLASTX and BLASTN respectively, I can identify known mouse transcripts in regions of interest (Figure 28B, example 1 and 2). This strengthens the argument that sequence conservation correlates with functional conservation. I additionally annotated hESC novel tags with short SAGE data generated from an Rl mouse embryonic stem cell (mESC) line (GEO accession: GSM580; http://www.ncbi.nlm.nih.gov/projects/geo) (Anisimov et al., 2002). I hypothesize that the observation of regions mapped by novel hESC and mESC tags may be indicative of a novel gene central to mammalian embryonic stem cell biology (Figure 28B, example 1). The mESC short SAGE library consisted of 137,906 total tags, which represented 44,570 unique tag types. A caveat of the 14 bp tag length is the inability to unambiguously map to genomic sequence. For this reason, I mapped mESC tags to mouse transcripts orthologous to candidate human genomic regions. 171 Figure 28 The computational method for annotating hESC candidate tags. Candidate tags used in this analysis are orthologous to mouse and/or rat genomic sequences and map > 2 kb downstream of a known transcript. A. The candidate regions of orthology consist of human and mouse genomic sequences identified by a hESC novel tag. Homology to mouse RefSeq transcripts was determined by BLAST analysis of human and mouse genomic sequences using the programs TBLASTX and BLASTN respectively. Mouse transcripts in a region orthologous to a hESC human tag mapping were isolated and further used to map an RI mouse embryonic stem cell line (mESC) short SAGE library. B. Examples of annotated hESC candidate tags and the number of hESC novel tags annotated. 172 A. h E S C candidate tags Candidate regions of orthology Multiple Alignment Format Extract orthologous Extract orthologous human genomic mouse genomic sequences sequences I 1 Blastall against mouse RefSeq transcripts T B L A S T X B L A S T N Candidate mouse transcripts I Map to RI mESC tags <-I Annotation of hESC candidate tags: B. Example 1: h E S C tag mapping to a genomic region >2kb downstream o f an Ensembl transcript; Supported by a mouse transcript containing an RI m E S C tag and rat sequence conservation (7 tag types). 5' 3' UTR hESC tag • Human -H-Rat Mouse RI mESC tag Example 2: h E S C tag mapping to a genomic region >2kb downstream o f an Ensembl transcript; Supported by a mouse transcript and rat sequence conservation (59 tag types). -f-f- Human Rat Mouse Example 3: h E S C tag mapping to a genomic region >2kb downstream o f an Ensembl transcript; Supported by mouse and rat sequence conservation (235 tag types). -H- Human R a t Mouse 174 The first analysis aligned the mouse MAF sequences (corresponding to candidate regions in the human genome in which a novel hESC tag localized) to mouse RefSeq transcripts. The best BLAST hit with a minimum of 99% sequence identity, a minimum HSP (High-scoring Segment Pair) length of at least 200 bp, and an e" value of 0.0001 or less were parsed. Table 35 is a subset of the highest scoring alignments (see Appendix 4f for the full table). Using the described requirements, I isolated 32 transcripts annotating the same number of hESC tags. MESC tags localized to 7 of the highest scoring mouse transcripts (Table 36). These 7 mouse transcripts and the corresponding mESC and hESC tags represent novel candidates in which the human tag identified a genomic region with an orthologous mouse transcript that was additionally identified by a mouse tag (illustrated in Figure 28B, example 1). Candidate novel hESC tags were termed nESCOl-nESC07. The transcripts corresponded to predicted cDNAs and the following genes: Gm704 (nESC06), Dpysll5 (nESC03), and Actgl (nESC07). The gene model transcript, Gm704, is similar to the fidgetin-like 1 gene which is a member of a superfamily of ATPases associated with a wide variety of cellular activities, including membrane fusion, proteolysis, and DNA replication. The mouse gene dihydropyrimidinase-like 5, Dpysl5, is thought to play a role in growth cone guidance during neural development. Lastly, Actgl is a cytoplasmic gamma-actin, which functions in sarcomere organization. The remaining four genes mapped by a mESC tag were functionally uncharacterized cDNAs or predicted cDNAs. 175 Table 3 5 Top 25 BLASTN hits (mouse MAF genomic regions against mouse RefSeq transcripts). Listed in descending order of BLASTN score are: hESC tag sequence, count, location of the human MAF sequence and start site, mESC tag sequence, count, mouse RefSeq hit, and score. h E S C meta tag Count Human genomic sequence mESC tag to gene Count Hit Score T A C T T T G A A C C G A G G A G 1 c h r l 5 (start: 82576680) n/a n/a gi |20467422|ref |NM_139001.1| M u s musculus chondroitin sulfate proteoglycan 4 (Cspg4), m R N A 7027 A A T C G G T C A C A C C A G C C 1 chr6 (start: 11243073) n/a n/a gi |51767667|ref lXM 488673.1| P R E D I C T E D : M u s musculus c D N A sequence B C 0 2 4 6 5 9 (BC024659) , m R N A 4918 T A C C A C T C T C T G T A T G G 1 chr2 (start: 214068623) n/a n/a gi |31544068|ref]NM 011770.2| M u s musculus zinc finger protein, subfamily I A , 2 (Helios) (Zfpn la2) , m R N A 4117 A A T C C C C C C G C C C C C T C 2 chr3 (start: 32903559) n/a n/a gi |51765137|reflXM_356199.2| P R E D I C T E D : M u s musculus similar to abnormal cell LINeage L I N - 4 1 , heterochronic gene; Drosophi la dappled/ vertebrate TRipartite M o t i f protein related; B- box zinc finger, F i l a m i n and N H L repeat containing protein (123.8 k D ) (lin-41) ( L O C 3 8 2 1 1 2 ) , m R N A 2882 T A G G T G G C C C T G T C T C C * 1 c h r X (start: 52138590) T G G C T C G G T C 364 gi |48762677|ref |NM_009609.2| M u s musculus actin, gamma, cytoplasmic 1 ( A c t g l ) , m R N A 2577 A G T G C G G A G T C C C C T T C 4 c h r l 7 (start: 1379996) n/a n/a gi |40254309|ref |NM 177182.3| M u s musculus R I K E N c D N A A830053O21 gene ( A 8 3 0 0 5 3 O 2 1 R i k ) , m R N A 2393 A G T T G G G G T C T T G G G G A 1 chr2 (start: 208891744) A A A A T A A A A A 4 gi |40254301|refjNM 175509.3| M u s musculus R I K E N c D N A 9430067K14 gene (9430067K14Rik) , m R N A 2313 T C A T C G C T T T A A T A C T G 1 c h r l 2 (start: 50497423) G G C C C C C A C A 16 gi |51829921|ref lXM_204880.3 | P R E D I C T E D : M u s musculus similar to fidgetin-like 1 ( L O C 2 7 8 7 1 8 ) , m R N A 2250 C A T A A C C C A G G A A A C A T 1 chr2 (start: 166737128) n/a n/a gi |31712027|refjNM 153409.2| M u s musculus R I K E N c D N A A 3 3 0 1 0 2 K 2 3 gene ( A 3 3 0 1 0 2 K 2 3 R i k ) , m R N A 2024 G T T T A T T A G T C T G G A T T 1 chr8 (start: 49025333) n/a n/a gi |31543917|refjNM_023585.2| M u s musculus ubiquit in-conjugating enzyme E 2 variant 2 (Ube2v2), m R N A 1885 T C A A A T C T C A G A G C A T C 1 chr7 (start: 98845093) n/a n/a gi |47106055|reflNM_007481.2| M u s musculus ADP-r ibosy la t ion factor 6 (Arf6) , m R N A 1844 T G T G G C A G C T G G T G G A A * 1 chr2 (start: 233482950) n/a n/a gi |40254535|ref]NM_021306.2| M u s musculus endothelin converting enzyme-like 1 ( E c e l l ) , m R N A 1633 G A T A G G A A C T C T T C C T G * 1 chr22 (start: 25388488) n/a n/a g i |51711414|ref |XM 489103.1| P R E D I C T E D : M u s musculus R I K E N c D N A A 2 3 0 0 5 7 G 1 8 gene ( A 2 3 0 0 5 7 G 1 8 R i k ) , m R N A 1550 C A C G G C A C A C A C A G G C A 1 chr7 (start: 71212286) n/a n/a g i | l0946703 |ref jNM 021371.1| M u s musculus calneuron 1 ( C a l n l ) , m R N A 1306 C G A T T T C T T A G A G A G A T 1 chr2 (start: 42692470) n/a - n/a gi23943841|ref]NM_153512.1 | M u s musculus potassium voltage-gated channel, subfamily G , member 3 (Kcng3) , m R N A 1241 C C T C G A G G G C A C C G C G G chr22 (start: 40547832) n/a n/a g i |51829829 |ref |XM 139488.2| P R E D I C T E D : M u s musculus hypothetical L O C 2 0 7 9 3 9 ( L O C 2 0 7 9 3 9 ) , m R N A 1183 A T G T A G A C A A A A T T A G C 1 chr3 (start: 57082089) G C T G A C A T T T 2 gi[21312475|ref |NM 027265.1| M u s musculus R I K E N c D N A 2810004A10 gene (2810004A10Rik) , m R N A 1148 T G T C G T C T T G G G G T T G A 1 chr2 (start: 27146887) G C C C T A T G C T 2 gi |40789287|ref |NM_023047.2| M u s musculus dihydropyrimidinase-l ike 5 (Dpysl5), m R N A 1021 C G T C C G C C T G C C T G C C T 1 c h r l 6 (start: 28269370) n/a n/a gi |21704179|ref|NM_145587.11 M u s musculus S H 3 - b i n d i n g kinase (Sbk), m R N A 932 C G A C T T T T A T T T C T G A C 1 chr6 (start: 22253423) n/a n/a g i |51767594 |ref |XM 488666.1| P R E D I C T E D : M u s musculus L O C 4 3 2 7 4 0 ( L O C 4 3 2 7 4 0 ) , m R N A 922 T C T G C C T G A T A C C A A A C 1 c h r l 7 (start: 33566977) n/a n/a gi |31543573|ref |NM_011235.2| M u s musculus R A D 5 1 - l i k e 3 (S. cerevisiae) (Rad5113), m R N A 876 C A T C T T T C G G C C C A T T C 1 c h r l 7 (start: 80065894) n/a n/a gi |51766572|refjXM 358416.2| P R E D I C T E D : M u s musculus hypothetical L O C 3 80741 ( L O C 3 80741), m R N A 829 T C G A A G T A A A A T T C A A C 2 chr7 (start: 102348828) C G T T G G A T T C 4 g i |51710838 |ref |XM 131914.4| P R E D I C T E D : M u s musculus R I K E N c D N A 3110004018 gene (3110004018Rik) , m R N A 785 A A T G T C A T T A A A T A C C T 1 chr7 (start: 47742444) n/a n/a gi |31982268|ref |NM_008316.2| M u s musculus H u s l homolog (S. pombe) ( H u s l ) , m R N A 660 T C C A C T C A A C T G T A C A A 4 c h r l (start: 53600547) n/a n/a gi |28077004|ref |NM 028355.1| M u s musculus R I K E N c D N A 2810475A17 gene (2810475A17Rik) , m R N A 622 *Map to ESTs derived from undifferentiated hESCs (Brandenberger et al., 2004b) (gi|47283420|gb|CN267006.1, gi|47331838|gb|CN315424.1, and gi|47417639|gb|CN430045.1 respectively) 177 Table 36 Candidate mouse orthologous genomic sequences were analyzed using BLASTN against mouse RefSeq transcripts. Identified transcripts were mapped to mESC tags. The hESC metalibrary tag and count, mESC tag and count, and the gene name are shown below. Novel hESC candidates were given a unique identifier (nESC##). ID hESC meta tag Count mESC tag Count Gene name nESCOl AGTTGGGGTCTTGGGGA 1 AAAATAAAAA 4 RIKEN cDNA 9430067K14 nESC02 TCGAAGTAAAATTCAAC 2 CGTTGGATTC 4 PREDICTED: RIKEN cDNA 3110004018 nESC03 TGTCGTCTTGGGGTTGA 1 GCCCTATGCT 2 dihydropyrimidinase-like 5 nESC04 CGATTTACCTACTTGAA 1 GCCGCGTCCG 3 PREDICTED: Mus musculus LOC434078 nESC05 ATGTAGACAAAATTAGC 1 GCTGACATTT 2 RIKEN cDNA 2810004A10 nESC06 TCATCGCTTTAATACTG 1 GGCCCCCACA 16 PREDICTED: similar to fidgetin-like 1 nESC07 TAGGTGGCCCTGTCTCC 1 TGGCTCGGTC 364 actin, gamma, cytoplasmic 1 178 Many genes critical to maintaining hallmark ESC properties, such as P0U5F1 and S0X2, are orthologous in human and mouse. To reiterate, species conserved gene expression in orthologous sequences strengthens the proposition that a novel and functional transcript can be detected using SAGE. This hypothesis is further reinforced by the presence of a known mouse transcript in the region of conservation. The remaining mouse transcripts were identified by a novel hESC tag mapped to an orthologous human genomic sequence and additionally were not identified by a mESC SAGE tag (25 transcripts corresponding to 25 hESC tags) (novel hESC tags were termed nESC08-nESC32) (Figure 28B, example 2). The B L A S T N analysis of candidate orthologous mouse genomic sequences identified 12 uncharacterized or predicted cDNA sequences and 13 functionally characterized transcripts. Table 37 lists the genes and a brief description of their functional role. Many of these mouse transcripts have a predefined human ortholog. Perhaps, by way of duplication in the human genome, a novel paralogous gene may exist in these candidate genomic regions. Some examples notable transcripts included an SH3-binding kinase (Sbk) (nESCIO), the chromodomain helicase D N A binding protein (Chdl) (nESC31), and Rad51-like 3 (Rad5113) (nESC13). A number of SH3 domain containing proteins are downstream targets of the Jak/Stat pathway, the principal signalling pathway maintaining an undifferentiated state in mESCs. The role of Sbk in ESCs may indicate a role for participants of the Jak/Stat pathway in hESC maintenance contrary to the current belief that the pathway is non-functional in human pluripotent cells. CHD1 may play an important role in gene regulation through the modification of chromatin structure by altering the access of transcriptional machinery to its chromosomal D N A template. In 179 Chapter 3.3.2 I found that other genes involved in epigenetic gene regulation (DNMT3(5 and HDAC2) were significantly up-regulated in u-hESCs compared to multiple adult and fetal samples (Tables 27 and 33). These results, in conjunction with the possible identification of an additional CHD1 ortholog novel to hESC suggest that specific epigenetic machinery may be expressed in hESC and unique to the sternness phenotype. These findings support a crucial role for the D N A repair gene Rad5113 in normal mammalian development, recombination, and maintenance of mammalian genome stability (Smiraldo et al., 2005). I similarly noted significantly higher expression of D N A damage checkpoints and D N A repair genes (such as RAD51, RAD54L and RAD23B) in hESCs compared to n-CGAP libraries (Table 8, Chapter 2.3.5). Maintaining genomic stability would be critical to the immortality of hESC lines. 180 Table 37 25 mouse transcripts identified by BLASTN analysis of candidate orthologous mouse sequences against mouse RefSeq transcripts. The corresponding hESC tag sequence, count, mouse transcript accession, gene symbol, HSP size and a description describing gene ontology, conserved domains, and/or Entrez gene summaries are listed. Novel hESC candidates were given an identifier (nESC##). ID hESC sequence Count Accession Gene symbol HSP size Description nESC08 CACGGCACACACAGGCA 1 gi| 10946703 CALN1 659 This gene encodes a protein with high similarity to the calcium-binding proteins of the calmodulin family. GO: biological process unknown, cellular component unknown nESC09 TACTTTGAACCGAGGAG 1 gi|20467422 CSPG4 3557 The human CSPG4 plays a role in stabilizing cell-substratum interactions during early events of melanoma cell spreading on endothelial basement membranes. Data suggest that CSPG4 is a novel marker for epidermal stem cells that contributes to their patterned distribution by promoting stem cell clustering (Legg et al., 2003). CSPG4 represents an integral membrane chondroitin sulfate proteoglycan expressed by human malignant melanoma cells. GO: activation of MAPK, cell proliferation, glial cell migration, transmembrane receptor protein tyrosine kinase signalling pathway, plasma membrane nESCIO CGTCCGCCTGCCTGCCT 1 gi|21704179 SBK1 482 Conserved domain: cd00180: S T K c ; Serine/Threonine protein kinases, catalytic domain nESCll CGATTTCTTAGAGAGAT 1 gi|23943841 KCNG3 626 Conserved domain: pfam00520: Iontrans; Ion transport protein nESC12 TCCACTCAACTGTACAA 4 gi|28077004 2810475A17RIK 314 Transmembrane protein 48 (Tmem48) nESC13 TCTGCCTGATACCAAAC 1 gi|31543573 RAD51L3 442 1. Findings support a crucial role for mammalian RAD51D in normal development, recombination, and maintaining mammalian genome stability (Smiraldo et al., 2005). 2. a fragment of Rad51B / 181 interacts with the C-terminus and linker of Rad51C, and this region of Rad51C also interacts with mRad51D and Xrcc3 (Miller et al., 2004). GO: Base-excision repair nESC14 GTTTATTAGTCTGGATT •1 gi|31543917 UBE2V2 975 Conserved domain: cd00195: UBCc; Ubiquitin-conjugating enzyme E2, catalytic (UBCc) domain nESC15 TACCACTCTCTGTATGG 1 gi|31544068 ZFPN1A2 2104 GO: transcription factor activity nESC16 ATATGCAGCAGGATCAC 1 gi|31560079 KCNMB2 224 Conserved domain: pfam03185: CaKB; Calcium-activated potassium channel, beta subunit nESC17 CATAACCCAGGAAACAT 1 gi|31712027 A330102K23RIK 1045 GO: apoptosis nESC18 AATGTCATTAAATACCT 1 gi|31982268 HUS1 337 1. Evidence for a requirement for Radl7 and Husl to induce G(2) arrest as well as Vpr-induced phosphorylation of histone 2A variant X (H2AX) and formation of nuclear foci containing H2AX and breast cancer susceptibility protein 1 (Zimmerman et al., 2004). 2. Husl-deficient mouse cells had an impaired S checkpoint after exposure to DNA strand break-inducing agents such as camptothecin or ionizing radiation (Wang et al., 2004). 3. Husl is required specifically for one of two separable mammalian checkpoint pathways that respond to distinct forms of genome damage during S phase (Weiss et al., 2003). nESC19 AGTGCGGAGTCCCCTTC 4 gi|40254309 A830053O21RIK 1207 Conserved domains: cd00083: HLH; Helix-loop-helix domain, found in specific DNA- binding proteins that act as transcription factors nESC20 TGTGGCAGCTGGTGGAA* 1 gi|40254535 ECEL1 824 Conserved domains: pfam05649: Peptidase_M13_N; Peptidase family Ml3 nESC21 TCAAATCTCAGAGCATC 1 gi|47106055 ARF6 941 GO: small GTPase mediated signal transduction, intracellular protein transport, vesicle-mediated transport nESC22 CTCAGAGCGCGCAGGTC 1 gi|51556212 AL024069 251 Conserved domain: cd01214: CG8312; CG8312 Phosphotyrosine-binding (PTB) domain nESC23 GACTGCAAGAACCTAAG 1 gi|51709794 CIPP 202 This gene encodes a multivalent PDZ domain protein, which is expressed exclusively in brain and kidney. This protein selectively interacts with Kir 182 family members, N-methyl-D-aspartate receptor subunits, neurexins and neuroligins, and cell surface molecules enriched in synaptic membranes. This protein may serve as a scaffold that brings structurally diverse but functionally connected proteins into close proximity at the synapse. nESC24 GATAGGAACTCTTCCTG* 1 gi|51711414 A230057G18RIK 782 n/a nESC25 AATCCCCCCGCCCCCTC 2 gi|51765137 Gmll27 1454 Human conserved domain : cd00200: WD40; WD40 domain, found in a number of eukaryotic proteins that cover a wide variety of functions including adaptor/regulatory modules in signal transduction, pre-mRNA processing and cytoskeleton assembly nESC26 CATCTTTCGGCCCATTC 1 gi|51766572 Gml567 418 n/a nESC27 CGACTTTTATTTCTGAC 1 gi|51767594 LOC432740 465 n/a nESC28 AATCGGTCACACCAGCC 1 gi|51767667 BC024659 2553 n/a nESC29 CCCGACCCCGCGCTCTT 1 gi|51828549 LOC432582 276 n/a nESC30 CCTCGAGGGCACCGCGG 2 gi|51829829 LOC207939 597 n/av nESC31 ACGCCGAGAAAGCAAGC 1 gi|6680927 CHD1 207 The CHD family of proteins is characterized by the presence of chromo (chromatin organization modifier) domains and SNF2-related helicase/ATPase domains. CHD genes alter gene expression possibly by modification of chromatin structure thus altering access of the transcriptional apparatus to its chromosomal DNA template. nESC32 GCAGTAGGTAGAGTCAC 1 gi|7242198 RASGRF2 288 GO: small GTPase mediated signal transduction, guanyl-nucleotide exchange factor activity *Map to ESTs derived from undifferentiated hESCs (Brandenberger et al., 2004b) ES|gi|47331838|gb|CN315424.1|CN315424 and ES|gi|47417639|gb|CN430045.1|CN430045 respectively. GACTGCAAGAACCTAAG maps to hep|gi|47358539|gb|CN358605.1|CN358605 (derived from hESC differentiated to hepatocyte-like cells). 183 A second analysis performed involved the alignment of human genomic sequences identified by a novel hESC tag to mouse RefSeq transcripts using TBLASTX. The purpose of this analysis was to determine if the human region might be orthologous to a mouse transcript of known functionality. Human sequences were selected based on a p-value cut-off of 0.0001 and resulted the identification of 61 candidate mouse transcripts, of which 28 transcripts did not overlap with the BLASTN analysis described above. Table 38 lists the novel tag sequence, count, human MAF chromosome and start site, TBLASTX score, hsp size (bp), hit accession, and gene symbol (novel hESC tags were termed nESC33-nESC66). Transcripts were ordered from highest to lowest score. Altogether, 34 tag sequences hit 28 genes (illustrated in Figure 25, example 2). The predicted transcript LOC432880 is similar to reverse transcriptase and was associated with 3 different human genomic locations that contained repetitive sequences (nESC43, nESC44, and nESC50). Similarly, LOC277923 is a mouse reverse transcriptase gene that corresponded to 2 human genomic locations with repetitive sequences (nESC40 and nESC52). I also observed instances where multiple novel tags align to the same MAF genomic region, specifically chromosome 4 (start site: 30893704), chromosome 13 (start site: 53945546), and chromosome X (start site: 106734806) (nESC57 and nESC60). More than 50% (15) of the mouse genes identified were to predicted transcripts or uncharacterized cDNA sequences. These transcripts were orthologous to 17 human genomic regions that were identified by 19 novel hESC tags in which 19 (Table 38). These predicted mouse genes suggested that a transcribed sequence existed in a homologous region to a hypothetical novel gene in hESCs. Furthermore, several of the identified mouse transcripts, both predicted and validated, were involved in development 184 and differentiation, transcriptional regulation, proliferation, genomic stability, and cell cycle checkpoints. Each gene was associated to a single hESC tag. A description of the genes and their suggested functions/conserved domains were listed in Table 39. 185 Table 38 TBLASTX analysis of candidate human genomic regions derived from the UCSC multiple alignment format (MAF). Human sequences were compared against mouse RefSeq transcripts. Listed below is the human embryonic stem cell novel tag sequence, count, MAF chromosomal region (in human) and alignment start site, TBLASTX score, hsp size (bp), hit accession, and gene symbol. Transcripts are ordered from highest to lowest score. hESC meta tag Count Human genomic region Score HSP Accession Mouse symbol nESC33 A T C T G A G A C A G A C A G T T 1 chr3 (start: 185064773) 544 580 gi|51873059 EEF1A1 nESC34 G A G A G C G G A T T T T G A C T 1 chr5 (start: 106788290) 221 1990 gi|46560569 EFNA5 nESC35 C T A T C T A G T G C C A A A A A 1 chr21 (start:34106567) 132 338 gi|6754391 ITSN1 nESC36 A A A C T T C A A C A T A T G G T 1 chr6(start:4180623 7) 111 533 . gi|38259219 USP49 nESC37 G C T G T A G G C G C A A T G A G 1 chrl (start: 113991788) 111 2402 gi|9055361 SYT6 nESC38 T A G T C T G C T A T G A C C A C 1 chrl6 (start:53507325) 110 512 gi|27734121 1700047E16RIK nESC39 C C A T T G G T C T C C A T T C C 1 chr4 (start:95992) 92 104 gi|31543955 WEE1 nESC40 C T A G A C T A G A A A C C A C A 1 chr7 (start: 19771451) 87 637 gi|54312063 LOC277923 nESC41 G C T A T C T T G A A T G G G G T * 1 chr4 (start:30893704) 81 1919 gi|51770847 LOC381153 nESC42 G G T A G G T T A A G A A A G A T * 1 chr4 (start:30893704) 81 1919 gi|51770847 LOC381153 nESC43 A A G G G T T A G A C T A G A T A 1 chr l l (start: 115529238) 80 737 gi|51768555 LOC432880 nESC44 T G G T A T G C A A T A A A T A T 1 chr'8 (start:129525130) 69 874 gi|51768555 LOC432880 nESC45 G C T T A T G G C T A G A G A A T 1 chr2 (start:203442081) 67 1657 gi|6680803 BMPR2 nESC46 G C T C T C T G A A T A G C T T T 1 chr9 (start:3514336) 65 1376 gi|34328188 RFX3 nESC47 C T A C A A A A C C G A A A G C A chrl 3 (start:54062598) 65 203 gi| 16716602 GTF2IRD2 nESC48 C G A A C A T T T C C T A A T G A 1 chrl2 (start:78446474) 62 418 gi|51830454 LOC436367 nESC49 CCTTTGCTTCCCTTTCC 1 chr2 (start:36556311) 56 298 gi|51770404 CRIM1 nESC50 TGGATGTCAATTTGTTC 1 chr3 (start: 8279218) 52 293 gi|51768555 LOC432880 nESC51 C G G C G G G G C A G C C G A C G chr3 (start:32830280) 52 408 gi|51765138 LOC3 84985 nESC52 T A G A T A C C A A G T T G T C C 1 chrX (start:55811260) 51 368 gi|54312063 LOC277923 nESC53 G T A C T G C A C A A T T C A G A 1 chr5 (start:142120336) 51 313 gi|13626035 SIGLECL1 186 \ nESC54 C C A A C G T G A A G T G A T T T 1 chr6 (start: 14968830) 49 378 gi|51769069 LOC432935 nESC55 C C A C A T C C G A T G C A T A G 1 chrl8(start:51671682) 49 3311 gi|6753409 CER1 nESC56 A T T A C A G T G C C C T C A A A 1 chr2 (start: 113452195) 48 214 gi|31559867 B930067F20RIK nESC57 A A G T C C C C G T T T G T T T T * 2 chrl3 (start:53945546) 47 776 gi|31541931 C O X 15 nESC58 T A T A T G A T C A T T A C T A A * 12 chrl3 (start:53945546) 47 776 gi|31541931 C O X 15 nESC59 A A C T G A T A G C T G G A A G G * 8 chrX (start: 106734806) 45 995 gi|51708184 LOC433611 nESC60 T G A T T G T A G A T G T A C C T * 2 chrX (start: 106734806) 45 995 gi|51708184 LOC433611 nESC61 G A C G A A G A A C C T T G T C C 1 ! chr9 (start:128427310) 45 607 gi|13878228 SYTL1 nESC62 T A A A C G C T G C C C T T A A A 1 chr4 (start: 18788855) 44 386 gi|31342579 4932435022RIK nESC63 T A T A T A G C T T G C A T T T C 2 chr20 (start:38449627) 37 638 gi|31340738 9330169L03RIK nESC64 CCAGTCCGTTTTCTGGT 1 chr9 (start: 134725680) 36 1394 gi|37718971 5930405J04RIK nESC65 A C T G A C A T T T A G C T A G T 1 chrl8(start:23587615) 33 418 gi|46411175 CEECAM1 nESC66 C T A C A T A G T C C T G C A T T 1 chr4(start:56903132) 30 96 gi|37674223 GNL3L AAGTCCCCGTTTGTTTT and TGATTGTAGATGTACCT hit multiple MAF genomic regions. LOC432880i and LOC432880J correspond to human repetitive sequences. 187 Table 39 TBLASTX mouse hits associated with development/differentiation, proliferation, transcriptional regulation (DNA dependent and epigenetically), genomic stability, and cell cycle checkpoints. Gene name, classification, and a description (defined by Entrez Gene, Gene Ontology, and/or conserved functional domains) are provided. Name Sequence Classification Description 1700047E16RIK Mus musculus RIKEN cDNA 1700047E16gene (1700047E16Rik),mRNA. Conserved domains: COG0419: SbcC; ATPase involved in DNA repair [DNA replication, recombination, and repair]. cd00030: C2; Protein kinase C conserved region 2 (CalB). pfam00038: Filament; Intermediate filament protein. pfam05557: MAD; Mitotic checkpoint protein. WEE1 Mus musculus wee 1 homolog (S. pombe) (Weel), mRNA. This gene encodes a nuclear protein, which is a tyrosine kinase belonging to the Ser/Thr family of protein kinases. The human homologue catalyzes the inhibitory tyrosine phosphorylation of CDC2/cyclin B kinase, and appears to coordinate the transition between DNA replication and mitosis by protecting the nucleus from cytoplasmically activated CDC2 kinase. BMPR2 Mus musculus bone morphogenic protein receptor, type II (serine/threonine kinase) (Bmpr2), mRNA. 1. Results define a pathway linking the bone morphogenetic protein receptor BMPRII to regulation of actin and provides insights into how extracellular signals modulate LIMK1 activity during dendritogenesis. 2. BMPR-1 A, -2, and Noggin are up-regulated in undifferentiated mesenchymal cells and"regenerating muscle fibers occurs during the early phase of BMP-2-induced bone formation. GO: transforming growth factor beta receptor activity, protein serine/threonine kinase and tyrosine kinase activity, anterior/posterior pattern formation, protein amino acid phosphorylation RFX3 Mus musculus regulatory factor X, 3 (influences HLA class II expression) (Rfx3), mRNA. The transcription factor RFX3 directs nodal cilium development and left-right asymmetry specification. Mol Cell Biol. 2004 May;24(10):4417-27. The human homologue may have a role as a modulator of Ras signalling in epithelial cells. GTF2IRD2 Mus musculus GTF2I repeat domain containing 2 (Gtf2ird2), mRNA. The exact function of this gene product is not known. It is inferred to be a transcription factor based on the presence of GTF2I-like repeats (containing helix-loop-helix motifs), also found in other proteins such as GTF2IRD1 and GTF2I. CRJM1 PREDICTED: Mus musculus cysteine-rich motor neuron 1 (Criml), mRNA. 1. Modulates BMP activity by affecting its processing and delivery to the cell surface. 2. Has a role in capillary formation and maintenance during angiogenesis. GO: insulin-like growth factor binding, serine-type endopeptidase inhibitor activity, regulation of cell growth, extracellular region. SIGLECL1 Mus musculus SIGLEC-like 1 (Siglecll),mRNA. Sialic acid-binding immunoglobulin-like lectins (SIGLECs) are a family of cell surface proteins belonging to the immunoglobulin superfamily. They mediate protein-carbohydrate interactions by selectively binding to different sialic acid moieties present on glycolipids and glycoproteins. This gene encodes a member of the 188 SIGLEC3-like subfamily of SIGLECs. Members of this subfamily are characterized by an extracellular V-set immunoglobulin-like domain followed by two C2-set immunoglobulin-like domains, and the cytoplasmic tyrosine-based motifs ITIM and SLAM-like. The encoded protein, upon tyrosine phosphorylation, has been shown to recruit the Src homology 2 domain-containing protein-tyrosine phosphatases SHP1 and SHP2. It has been suggested that the protein is involved in the negative regulation of macrophage signalling by functioning as an inhibitory receptor. This gene is located in a cluster with other SIGLEC3-like genes on 19ql3.4. Alternatively spliced transcript variants encoding distinct isoforms have been described for this gene. CER1 Mus musculus cerberus 1 homolog (Xenopus laevis) (Cerl), mRNA. 1. Cerberus-like and Lefty-1 function redundantly to modulate Nodal signalling during gastrulation and regulate patterning of the primitive streak. 2. Role of Cerberus-like in mouse embryogenesis. GO: cytokine activity, extracellular space. 5930405J04RIK Mus musculus RIKEN cDNA 5930405J04 gene (5930405J04Rik), mRNA. Conserved domains: COG5259: RSC8; RSC chromatin remodeling complex subunit RSC8 [Chromatin structure and dynamics / Transcription], COG5271: MDN1; AAA ATPase containing von Willebrand factor type A (vWA) domain [General function prediction only]. pfam04433: SWIRM; SWIRM domain 189 4.4 Conclusions To summarize, computational analysis of the human SAGE data from u-hESCs, adult and fetal cells led to the identification of 20,047 tags enriched in u-hESCs. Upon the implementation of stringent criteria, particularly the mapping of these enriched tags to genomic sequences with multiple species sequence conservation, I isolated 301 high quality tags novel to the hESC SAGE data. Computational annotation of the human genomic sequences identified by a novel hESC tag isolated 60 candidate mouse transcripts (identified by BLASTN and TBLASTX analysis) corresponding to 64 novel hESC tags. A subset of the identified mouse transcripts were also identified by a mESC SAGE tag (7 candidate transcripts; nESCOl-nESC07). These candidate genes may be integral to pluripotency as evidenced by their identification in human and mouse gene expression data. After further characterization of the novel hESC tags associated with a mouse ortholog, I observed a preponderance of transcripts implicated in cell cycle regulation (e.g., Weel; nESC39) or genomic stability (e.g., Rad5113; nESC13) which also represent a class of genes highly represented in hESCs. Other candidate novel hESC genes, such as nESC31 associated with the mouse transcript CHD1, may later prove to regulate epigenetic gene expression in conjunction with the known embryonic methyl transferase DNMT3p. Our goal was to devise a method to select for novel genes expressed in the human embryonic stem cell lines. To this end I have isolated a number of candidate hESC tags supported by various in silico resources. These tags will be further investigated by 190 utilizing the SAGE tags as gene-specific primers in 3' and 5' (rapid amplification of cDNA ends) R A C E (Frohman et al., 1988) to obtain longer sequences unique to hESCs. The novel tags identified using this approach have been used to obtain full-length cDNA sequences using 5' and 3' R A C E in the follow-up study to this analysis (Hirst et al., submitted). This subsequent analysis describes the cloning and characterization of the novel transcript SPD4 (shares promoter with DPPA4). The SAGE tag first identifying the gene was present at 3 tags in 2.6 million total tags in the u-hESC metalibrary. Functional analysis revealed that SPD4 may encode a miRNA based on sequence homology to known miRNAs and its ability to form a stem-loop structure, a required feature of miRNAs. In addition, quantitative PCR using R N A from undifferentiated and differentiated hESCs showed a reduction in SPD4 in response to differentiation. Our efforts have demonstrated that the class of candidate novel genes found in the hESC SAGE data are expressed at low levels but the majority are representative of real transcripts. These transcripts, such as SPD4, are associated with undifferentiated hESCs and may prove to be necessary for stem cell maintenance and pluripotency. 191 List of Appendices (http://www.bcgsc.ca/people/angels/htdocs/Thesis_appendices/) Appendix 2a Process Discovery SPACE mappings script Appendix 2b GO slim Molecular Function terms and scripts Appendix 2c WA09 CMOST mappings Appendix 2d Unmapped WA09 mouse CMOST mappings Appendix 2e Database of species ambiguous tags Appendix 2f WA09 species ambiguous tag mappings Appendix 2g Database of pluripotent stem cell associated genes (PAGs) Appendix 2h WA09 PAG mappings ' Appendix 2i WA09 WNT mappings Appendix 2j WA09 TGF-beta mappings Appendix 2k WA09 Jak/Stat mappings Appendix 21 Statistical testing of GO molecular functions between u-hESCs and n-CGAP libraries Appendix 3a CGAP SAGE libraries Appendix 3 b Matrix.pl script Appendix 3 c Random tag generator script Appendix 3d Correlation matrix Appendix 3e Up-regulated in hESC versus nCGAP libraries Appendix 4a Normal and malignant CGAP library list Appendix 4b All novel hESC candidates Appendix 4c Enriched hESC tag mappings Appendix 4d Candidate novel hESC gene list Appendix 4e Novel hESC tag matrix Appendix 4f BLASTN results Appendix 4g TBLASTX results 192 Bibliography Journal Articles Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. JMol Biol 215,403-10. i Ambrosetti, D. C , Basilico, C., and Dailey, L. (1997). Synergistic activation of the fibroblast growth factor 4 enhancer by Sox2 and Oct-3 depends on protein-protein interactions facilitated by a specific spatial arrangement of factor binding sites. Mol Cell Biol 17 , 6321-9. Anisimov, S. V., Tarasov, K. V., Tweedie, D., Stern, M. D., Wobus, A. M., and Boheler, K. R. (2002). SAGE identification of gene transcripts with profiles unique to pluripotent mouse Rl embryonic stem cells. Genomics 7 9 , 169-76. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, PL, Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C , Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2 5 , 25-9. Assady, S., Maor, G., Amit, M., Itskovitz-Eldor, J., Skorecki, K. L., and Tzukerman, M. (2001). Insulin production by human embryonic stem cells. Diabetes 5 0 , 1691-7. Aubert, J., Dunstan, H., Chambers, I., and Smith, A. (2002). Functional gene screening in embryonic stem cells implicates Wnt antagonism in neural differentiation. Nat Biotechnol 2 0 , 1240-5. 193 Audic, S., and Claverie, J. M. (1997). The significance of digital gene expression profiles. Genome Res 7, 986-95. Aulehla, A., Wehrle, C , Brand-Saberi, B., Kemler, R., Gossler, A., Kanzler, B., and Herrmann, B. G. (2003). Wnt3a plays a major role in the segmentation clock controlling somitogenesis. Dev Cell 4, 395-406. Barrow, J. R., Thomas, K. R., Boussadia-Zahui, O., Moore, R., Kemler, R., Capecchi, M. R., and McMahon, A. P. (2003). Ectodermal Wnt3/beta-catenin signaling is required for the establishment and maintenance of the apical ectodermal ridge. Genes Dev 17, 394-409. Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116,281-97. Beaulieu, N., Morin, S., Chute, I. C , Robert, M. F., Nguyen, H., and MacLeod, A. R. (2002). An essential role for DNA methyltransferase DNMT3B in cancer cell survival. J Biol Chem 211, 28176-81. Bhattacharya, B., Miura, T., Brandenberger, R., Mejido, J., Luo, Y., Yang, A. X., Joshi, B. H., Ginis, I., Thies, R. S., Amit, M., Lyons, I., Condie, B. G., Itskovitz-Eldor, J., Rao, M. S., and Puri, R. K. (2004). Gene expression in human embryonic stem cell lines: unique molecular signature. Blood 103, 2956-64. Birney, E., Andrews, D., Bevan, P., Caccamo, M., Cameron, G., Chen, Y., Clarke, L., Coates, G., Cox, T., Cuff, J., Curwen, V., Cutts, T., Down, T., Durbin, R., Eyras, 194 E., Fernandez-Suarez, X. M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M , Hotz, H., Iyer, V., Kahari, A., Jekosch, K., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, C., Clamp, M , and Hubbard, T. (2004). Ensembl 2004. Nucleic Acids Res 32 Database issue, D468-70. Boon, K., Osorio, E. C , Greenhut, S. F., Schaefer, C. F., Shoemaker, J., Polyak, K., Morin, P. J., Buetow, K. H., Strausberg, R. L., De Souza, S. J., and Riggins, G. J. (2002). An anatomy of normal and malignant gene expression. Proc Natl Acad Sci USA 99, 11287-92. Botquin, V., Hess, H., Fuhrmann, G., Anastassiadis, C , Gross, M. K., Vriend, G., and Scholer, H. R. (1998). New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. Genes Dev 12, 2073-90. Brandenberger, R., Khrebtukova, I., Thies, R. S., Miura, T., Jingli, C , Puri, R., Vasicek, T., Lebkowski, J., and Rao, M. (2004a). MPSS profiling of human embryonic stem cells. BMC Dev Biol 4, 10. Brandenberger, R., Wei, H., Zhang, S., Lei, S., Murage, J., Fisk, G. J., Li, Y., Xu, C , Fang, R., Guegler, K., Rao, M. S., Mandalam, R., Lebkowski, J., and Stanton, L. 195 W. (2004b). Transcriptome characterization elucidates signaling networks that control human ES cell growth and differentiation. Nat Biotechnol 22, 707-16. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D. H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S. R., Moon, K., Burcham, T., Pallas, M., DuBridge, R. B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K. (2000). Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 1 8 , 630-4. Brickman, J. M., and Burdon, T. G. (2002). Pluripotency and tumorigenicity. Nat Genet 32, 557-8. Brockington, M., Blake, D. J., Prandini, P., Brown, S. C , Torelli, S., Benson, M. A., Ponting, C. P., Estournet, B., Romero, N. B., Mercuri, E., Voit, T., Sewry, C. A., Guicheney, P., and Muntoni, F. (2001). Mutations in the fukutin-related protein gene (FKRP) cause a form of congenital muscular dystrophy with secondary laminin alpha2 deficiency and abnormal glycosylation of alpha-dystroglycan. Am JHum Genet 69,1198-209. Buchkovich, K., Duffy, L. A., and Harlow, E. (1989). The retinoblastoma protein is phosphorylated during specific phases of the cell cycle. Cell 5 8 , 1097-105. Burdon, T., Chambers, I., Stracey, C , Niwa, H., and Smith, A. (1999a). Signaling mechanisms regulating self-renewal and differentiation of pluripotent embryonic stem cells. Cells Tissues Organs 1 6 5 , 131-43. 196 Burdon, T., Stracey, C , Chambers, I., Nichols, J., and Smith, A. (1999b). Suppression of SHP-2 and ERK signalling promotes self-renewal of mouse embryonic stem cells. Dev Biol 210,30-43. Cadigan, K. M., and Nusse, R. (1997). Wnt signaling: a common theme in animal development. Genes Dev 1 1 , 3286-305. Cai, J., Chen, J., Liu, Y., Miura, T., Luo, Y., Loring, J. F., Freed, W. J., Rao, M. S., and Zeng, X. (2005). Assessing self-renewal and differentiation in hESC lines. Stem Cells. Caplen, N. J., Parrish, S., Imani, F., Fire, A., and Morgan, R. A. (2001). Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci USA9S, 9742-7. Carpenter, M. K., Inokuma, M. S., Denham, J., Mujtaba, T., Chiu, C. P., and Rao, M. S. (2001). Enrichment of neurons and neural precursors from human embryonic stem cells. Exp Neurol 172 , 383-97. Carpenter, M. K., Rosier, E. S., Fisk, G. J., Brandenberger, R., Ares, X., Miura, T., Lucero, M., and Rao, M. S. (2004). Properties of four human embryonic stem cell lines maintained in a feeder-free culture system. Dev Dyn 2 2 9 , 243-58. Cavaleri, F., and Scholer, H. R. (2003). Nanog: a new recruit to the embryonic stem cell orchestra. Cell 1 1 3 , 551-2. 197 Chambers, I., Colby, D., Robertson, M., Nichols, J., Lee, S., Tweedie, S., and Smith, A. (2003). Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells. Cell 113 , 643-55. Chen, J., Lee, S., Zhou, G., and Wang, S. M. (2002). High-throughput GLGI procedure for converting a large number of serial analysis of gene expression tag sequences into 3' complementary DNAs. Genes Chromosomes Cancer 3 3 , 252-61. Chen, J. J., Rowley, J. D., and Wang, S. M. (2000). Generation of longer cDNA fragments from serial analysis of gene expression tags for gene identification. Proc Natl Acad Sci USA 91, 349-53. Cheung, P., Allis, C. D., and Sassone-Corsi, P. (2000). Signaling to chromatin through histone modifications. Cell 103 , 263-71. Chuma, M., Saeki, N., Yamamoto, Y., Ohta, T., Asaka, M., Hirohashi, S., and Sakamoto, M. (2004). Expression profiling in hepatocellular carcinoma with intrahepatic metastasis: identification of high-mobility group I(Y) protein as a molecular marker of hepatocellular carcinoma metastasis. Keio J Med 5 3 , 90-7. Classon, M., and Harlow, E. (2002). The retinoblastoma tumour suppressor in development and cancer. Nat Rev Cancer 2, 910-7. Du, Z., Cong, H., and Yao, Z. (2001). Identification of putative downstream genes of Oct-4 by suppression-subtractive hybridization. Biochem Biophys Res Commun 2S2, 701-6. 198 Dykxhoorn, D. M., Novina, C. D., and Sharp, P. A. (2003). Killing the messenger: short RNAs that silence gene expression. Nat Rev Mol Cell Biol 4, 457-67. Dyson, N. (1998). The regulation of E2F by pRB-family proteins. Genes Dev 12, 2245-62. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863-8. Evans, M. J., and Kaufman, M. H. (1981). Establishment in culture of pluripotential cells from mouse embryos. Nature 292, 154-6. Evans, S. J., Datson, N. A., Kabbaj, M., Thompson, R. C , Vreugdenhil, E., De Kloet, E. R., Watson, S. J., and Akil, H. (2002). Evaluation of Affymetrix Gene Chip sensitivity in rat hippocampal tissue using SAGE analysis. Serial Analysis of Gene Expression. Eur JNeurosci 16,409-13. Fattaey, A., Helin, K., and Harlow, E. (1993). Transcriptional inhibition by the retinoblastoma protein. Philos Trans R Soc Lond B Biol Sci 340, 333-6. Feldman, B., Poueymirou, W., Papaioannou, V. E., DeChiara, T. M., and Goldfarb, M. (1995). Requirement of FGF-4 for postimplantation mouse development. Science 267,246-9. Fitch, W. M., and Margoliash, E. (1967). Construction of phylogenetic trees. Science 155, 279-84. 1 9 9 Fougerousse, F., Bullen, P., Herasse, M., Lindsay, S., Richard, I., Wilson, D., Suel, L., Durand, M., Robson, S., Abitbol, M., Beckmann, J. S., and Strachan, T. (2000). Human-mouse differences in the embryonic expression patterns of developmental control genes and disease genes. Hum Mol Genet 9, 165-73. Frohman, M. A., Dush, M. K., and Martin, G. R. (1988). Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci USAS5, 8998-9002. Gerecht-Nir, S., Dazard, J. E., Golan-Mashiach, M., Osenberg, S., Botvinnik, A., Amariglio, N., Domany, E., Rechavi, G., Givol, D., and Itskovitz-Eldor, J. (2005). Vascular gene expression and phenotypic correlation during differentiation of human embryonic stem cells. Dev Dyn 232,487-97. Gerhard, D. S., Wagner, L., Feingold, E. A., Shenmen, C. M., Grouse, L. H., Schuler, G., Klein, S. L., Old, S., Rasooly, R., Good, P., Guyer, M., Peck, A. M., Derge, J. G., Lipman, D., Collins, F. S., Jang, W., Sherry, S., Feolo, M., Misquitta, L., Lee, E., Rotmistrovsky, K., Greenhut, S. F., Schaefer, C. F., Buetow, K., Bonner, T. I., Haussler, D., Kent, J., Kiekhaus, M., Furey, T., Brent, M., Prange, C , Schreiber, K., Shapiro, N., Bhat, N. K., Hopkins, R. F., Hsie, F., Driscoll, T., Soares, M. B., Casavant, T. L., Scheetz, T. E., Brown-stein, M. J., Usdin, T. B., Toshiyuki, S., Carninci, P., Piao, Y., Dudekula, D. B., Ko, M. S., Kawakami, K., Suzuki, Y., Sugano, S., Gruber, C. E., Smith, M. R., Simmons, B., Moore, T., Waterman, R., Johnson, S. L., Ruan, Y., Wei, C. L., Mathavan, S., Gunaratne, P. H., Wu, J., Garcia, A. M., Hulyk, S. W., Fuh, E., Yuan, Y., Sneed, A., Kowis, C , Hodgson, A., Muzny, D. M., McPherson, J., Gibbs, R. A., Fahey, J., Helton, E., Ketteman, M., Madan, A., Rodrigues, S., Sanchez, A., Whiting, M., Madari, A., Young, A. C , Wetherby, K. D., Granite, S. J., Kwong, P. N., Brinkley, C. P., Pearson, R. L., Bouffard, G. G., Blakesly, R. W., Green, E. D., Dickson, M. C., Rodriguez, A. C., Grimwood, J., Schmutz, J., Myers, R. M., Butterfield, Y. S., Griffith, M., Griffith, O. L., Krzywinski, M. I., Liao, N., Morrin, R., Palmquist, D., et al. (2004). The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res 14, 2121-7. Ginis, I., Luo, Y., Miura, T., Thies, S., Brandenberger, R., Gerecht-Nir, S., Amit, M., Hoke, A., Carpenter, M. K., Itskovitz-Eldor, J., and Rao, M. S. (2004). Differences between human and mouse embryonic stem cells. Dev Biol 269, 360-80. Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). MicroRNAs regulate brain morphogenesis in zebrafish. Science 308, 833-8. Golan-Mashiach, M., Dazard, J. E., Gerecht-Nir, S., Amariglio, N., Fisher, T., Jacob-Hirsch, J., Bielorai, B., Osenberg, S., Barad, O., Getz, G., Toren, A., Rechavi, G., Itskovitz-Eldor, J., Domany, E., and Givol, D. (2005). Design principle of gene expression used by human stem cells: implication for pluripotency. Faseb J19, 147-9. Goodrich, D. W., Wang, N. P., Qian, Y. W., Lee, E. Y., and Lee, W. H. (1991). The retinoblastoma gene product regulates progression through the Gl phase of the cell cycle. Cell 67, 293-302. Hanahan, D., and Weinberg, R. A. (2000). The hallmarks of cancer. Cell 100, 57-70. Harrer, M., Luhrs, FL, Bustin, M., Scheer, U., and Hock, R. (2004). Dynamic interaction of HMGAla proteins with chromatin. J Cell Sci 111, 3459-71. Hart, A. H., Willson, T. A., Wong, M., Parker, K., and Robb, L. (2005). Transcriptional regulation of the homeobox gene Mixll by TGF-beta and FoxHl. Biochem Biophys Res Commun 333, 1361-9. Hatakeyama, M., and Weinberg, R. A. (1995). The role of RB in cell cycle control. Prog Cell Cycle Res 1,9-19. Heinrich, P. C , Behrmann, I., Muller-Newen, G., Schaper, F., and Graeve, L. (1998). Interleukin-6-type cytokine signalling through the gpl30/Jak/STAT pathway. Biochem J 334 ( Pt 2), 297-314. Helin, K. (1998). Regulation of cell proliferation by the E2F transcription factors. Curr Opin Genet Dev 8, 28-35. Helin, K., Harlow, E., and Fattaey, A. (1993). Inhibition of E2F-1 transactivation by direct binding of the retinoblastoma protein. Mol Cell Biol 13, 6501-8. 202 Hickman, E. S., and Helin, K. (2002). The regulation of APAF1 expression during development and tumourigenesis. Apoptosis 7,167-71. Hickman, E. S., Moroni, M. C , and Helin, K. (2002). The role of p53 and pRB in apoptosis and cancer. Curr Opin Genet Dev 12, 60-6. Hubbard, T. (2002). Biological information: making it accessible and integrated (and trying to make sense of it). Bioinformatics 18 Suppl 2, SI 40. Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X. M., Gilbert, J., Hammond, M., Herrero, J., Hotz, H., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Kokocinsci, F., London, D., Longden, I., McVicker, G., Melsopp, C , Meidl, P., Potter, S., Proctor, G., Rae, M., Rios, D., Schuster, M., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C , and Birney, E. (2005). Ensembl 2005. Nucleic Acids Res 33, D447-53. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Huminiecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C , Mongin, E., Pettett, R., Pocock, M., Potter, S., Rust, A., Schmidt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik, I., and Clamp, M. (2002). The Ensembl genome database project. Nucleic Acids Res 30, 38-41. Hwang, W. S., Ryu, Y. J., Park, J. H., Park, E. S., Lee, E. G., Koo, J. M., Jeon, H. Y., Lee, B. C , Kang, S. K., Kim, S. J., Ahn, C , Hwang, J. H., Park, K. Y., Cibelli, J. B., and Moon, S. Y. (2004). Evidence of a pluripotent human embryonic stem cell line derived from a cloned blastocyst. Science 303, 1669-74. Ikeya, M., Lee, S. M., Johnson, J. E., McMahon, A. P., and Takada, S. (1997). Wnt signalling required for expansion of neural crest and CNS progenitors. Nature 389, 966-70. Itskovitz-Eldor, J., Schuldiner, M., Karsenti, D., Eden, A., Yanuka, O., Amit, M., Soreq, H. , and Benvenisty, N. (2000). Differentiation of human embryonic stem cells into embryoid bodies compromising the three embryonic germ layers. Mol Med 6, 88-95. Ivanova, N. B., Dimos, J. T., Schaniel, C , Hackney, J. A., Moore, K. A., and Lemischka, I. R. (2002). A stem cell molecular signature. Science 298, 601-4. Jenuwein, T., and Allis, C. D. (2001). Translating the histone code. Science 293, 1074-80. Johnson, D. G., Ohtani, K., and Nevins, J. R. (1994). Autoregulatory control of E2F1 expression in response to positive and negative regulators of cell cycle progression. Genes Dev 8, 1514-25. Jongeneel, C. V., Iseli, C , Stevenson, B. J., Riggins, G. J., Lai, A., Mackay, A., Harris, R. A., O'Hare, M. J., Neville, A. M., Simpson, A. J., and Strausberg, R. L. (2003). Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc Natl Acad Sci USA 100,4702-5. Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C , Hammond, M., Rocca-Serra, P., Cox, T., and Birney, E. (2004). EnsMart: a generic system for fast and flexible access to biological data. Genome Res 14, 160-9. Kehat, I., Amit, M., Gepstein, A., Huber, I., Itskovitz-Eldor, J., and Gepstein, L. (2003). Development of cardiomyocytes from human ES cells. Methods Enzymol 365, 461-73. Kellis, M., Patterson, N., Birren, B., Berger, B., and Lander, E. S. (2004). Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 11, 319-55. Kelly, D. L., and Rizzino, A. (2000). DNA microarray analyses of genes regulated during the differentiation of embryonic stem cells. Mol Reprod Dev 56, 113-23. Kielman, M. F., Rindapaa, M., Gaspar, C , van Poppel, N., Breukel, C , van Leeuwen, S., Taketo, M. M., Roberts, S., Smits, R., and Fodde, R. (2002). Ape modulates embryonic stem-cell differentiation by controlling the dosage of beta-catenin signaling. Nat Genet 32, 594-605. 2 0 5 Lai, A., Lash, A. E., Altschul, S. F., Velculescu, V., Zhang, L., McLendon, R. E., Marra, M. A., Prange, C , Morin, P. J., Polyak, K., Papadopoulos, N., Vogelstein, B., Kinzler, K. W., Strausberg, R. L., and Riggins, G. J. (1999). A public database for gene expression in human cancers. Cancer Res 5 9 , 5403-7. Lavon, N., and Benvenisty, N. (2005). Study of hepatocyte differentiation using embryonic stem cells. J Cell Biochem. Lebkowski, J. S., Gold, J., Xu, C., Funk, W., Chiu, C. P., and Carpenter, M. K. (2001). Human embryonic stem cells: culture, differentiation, and genetic modification for regenerative medicine applications. Cancer Jl Suppl 2, S83-93. Legg, J., Jensen, U . B., Broad, S., Leigh, I., and Watt, F. M. (2003). Role of melanoma chondroitin sulphate proteoglycan in patterning stem cells in human interfollicular epidermis. Development 130 , 6049-63. Levenberg, S., Golub, J. S., Amit, M., Itskovitz-Eldor, J., and Langer, R. (2002). Endothelial cells derived from human embryonic stem cells. Proc Natl Acad Sci U SA99, 4391-6. Ling, M. T., Wang, X., Ouyang, X. S., Xu, K., Tsao, S. W., and Wong, Y. C. (2003). Id-1 expression promotes cell survival through activation of NF-kappaB signalling pathway in prostate cancer cells. Oncogene 22, 4498-508. Lipshutz, R. J., Fodor, S. P., Gingeras, T. R., and Lockhart, D. J. (1999). High density synthetic oligonucleotide arrays. Nat Genet 21, 20-4. 206 Liu, L., Leaman, D., Villalta, M., and Roberts, R. M. (1997). Silencing of the gene for the alpha-subunit of human chorionic gonadotropin by the embryonic transcription factor Oct-3/4. Mol Endocrinol 11, 1651-8. Liu, L., and Roberts, R. M. (1996). Silencing of the gene for the beta subunit of human chorionic gonadotropin by the embryonic transcription factor Oct-3/4. J Biol Chem 111, 16683-9. Lockhart, D. J., Dong, H., Byrne, M. C , Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C , Kobayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14, 1675-80. Logan, C. Y., and Nusse, R. (2004). The Wnt signaling pathway in development and disease. Annu Rev Cell Dev Biol 20, 781-810. Makalowski, W., Zhang, J., and Boguski, M. S. (1996). Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res 6, 846-57. Martin, G. R. (1981). Isolation of a pluripotent cell line from early mouse embryos cultured in medium conditioned by teratocarcinoma stem cells. Proc Natl Acad Sci USA 18, 7634-8. Miller, K. A., Sawicka, D., Barsky, D., and Albala, J. S. (2004). Domain mapping of the Rad51 paralog protein complexes. Nucleic Acids Res 32, 169-78. Mitsui, K., Tokuzawa, Y., Itoh, H., Segawa, K., Murakami, M., Takahashi, K., Maruyama, M., Maeda, M., and Yamanaka, S. (2003). The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell 113, 631-42. Mullor, J. L., Sanchez, P., and Altaba, A. R. (2002). Pathways and consequences: Hedgehog signaling in human disease. Trends Cell Biol 12, 562-9. Mummery, C , Ward, D., van den Brink, C. E., Bird, S. D., Doevendans, P. A., Opthof, T., Brutel de la Riviere, A., Tertoolen, L., van der Heyden, M., and Pera, M. (2002). Cardiomyocyte differentiation of mouse and human embryonic stem cells. JAnat 200, 233-42. Narita, M., Nunez, S., Heard, E., Lin, A. W., Hearn, S. A., Spector, D. L., Harmon, G. J., and Lowe, S. W. (2003). Rb-mediated heterochromatin formation and silencing of E2F target genes during cellular senescence. Cell 113, 703-16. Nichols, J., Zevnik, B., Anastassiadis, K., Niwa, H., Klewe-Nebenius, D., Chambers, I., Scholer, H., and Smith, A. (1998). Formation of pluripotent stem cells in the mammalian embryo depends on the POU transcription factor Oct4. Cell 95, 379-91. Nishimoto, M., Fukushima, A., Okuda, A., and Muramatsu, M. (1999). The gene for the embryonic stem cell coactivator UTF1 carries a regulatory element which selectively interacts with a complex composed of Oct-3/4 and Sox-2. Mol Cell Biol 19, 5453-65. 208 Niswander, L., Tickle, C , Vogel, A., Booth, I., and Martin, G. R. (1993). FGF-4 replaces the apical ectodermal ridge and directs outgrowth and patterning of the limb. Cell 75, 579-87. Pardal, R., Clarke, M. F., and Morrison, S. J. (2003). Applying the principles of stem-cell biology to cancer. Nat Rev Cancer 3, 895-902. Parr, B. A., Cornish, V. A., Cybulsky, M. I., and McMahon, A. P. (2001). WntVb regulates placental development in mice. Dev Biol 237, 324-32. Pesce, M., Anastassiadis, K., and Scholer, H. R. (1999). Oct-4: lessons of totipotency from embryonic stem cells. Cells Tissues Organs 165, 144-52. Pesce, M., Gross, M. K., and Scholer, H. R. (1998a). In line with our ancestors: Oct-4 and the mammalian germ. Bioessays 20, 722-32. Pesce, M., and Scholer, H. R. (2001). Oct-4: gatekeeper in the beginnings of mammalian development. Stem Cells 19, 271-8. Pesce, M., Wang, X., Wolgemuth, D. J., and Scholer, H. (1998b). Differential expression of the Oct-4 transcription factor during mouse germ cell differentiation. Mech Dev 71, 89-98. Pleasance, E. D., Marra, M. A., and Jones, S. J. (2003). Assessment of SAGE in transcript identification. Genome Res 13, 1203-15. 209 Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2003). NCBI Reference Sequence project: update and current status. Nucleic Acids Res 31, 34-7. Ramalho-Santos, M., Yoon, S., Matsuzaki, Y., Mulligan, R. C , and Melton, D. A. (2002). "Sternness": transcriptional profiling of embryonic and adult stem cells. Science 298, 597-600. Rambhatla, L., Chiu, C. P., Kundu, P., Peng, Y., and Carpenter, M. K. (2003). Generation of hepatocyte-like cells from human embryonic stem cells. Cell Transplant 12,1-11. Reubinoff, B. E., Pera, M. F., Fong, C. Y., Trounson, A., and Bongso, A. (2000). Embryonic stem cell lines from human blastocysts: somatic differentiation in vitro. Nat Biotechnol 18, 399-404. Reya, T., Duncan, A. W., Ailles, L., Domen, J., Scherer, D. C , Willert, K., Hintz, L., Nusse, R., and Weissman, I. L. (2003). A role for Writ signalling in self-renewal of haematopoietic stem cells. Nature 423,409-14. Reyes, M., Lund, T., Lenvik, T., Aguiar, D., Koodie, L., and Verfaillie, C. M. (2001). Purification and ex vivo expansion of postnatal human marrow mesodermal progenitor cells. Blood 98, 2615-25. Rhee, I., Bachman, K. E., Park, B. H., Jair, K. W., Yen, R. W., Schuebel, K. E., Cui, H., Feinberg, A. P., Lengauer, C , Kinzler, K. W., Baylin, S. B., and Vogelstein, B. 210 (2002). DNMT1 and DNMT3b cooperate to silence genes in human cancer cells. Nature 416, 552-6. Rho, J. Y., Yu, K., Han, J. S., Chae, J. I., Koo, D. B., Yoon, H. S., Moon, S. Y., Lee, K. K., and Han, Y. M. (2005). Transcriptional profiling of the developmentally important signalling pathways in human embryonic stem cells. Hum Reprod. Richards, M., Tan, S. P., Tan, J. H., Chan, W. K., and Bongso, A. (2004). The transcriptome profile of human embryonic stem cells as defined by SAGE. Stem Cells 22, 51-64. Rogers, M. B., Hosier, B. A., and Gudas, L. J. (1991). Specific expression of a retinoic acid-regulated, zinc-finger gene, Rex-1, in preimplantation embryos, trophoblast and spermatocytes. Development 113, 815-24. Saha, S., Sparks, A. B., Rago, C , Akmaev, V., Wang, C. J., Vogelstein, B., Kinzler, K. W., and Velculescu, V. E. (2002). Using the transcriptome to annotate the genome. Nat Biotechnol 20, 508-12. Sato, N., Sanjuan, I. M., Heke, M., Uchida, M., Naef, F., and Brivanlou, A. H. (2003). Molecular signature of human embryonic stem cells and its comparison with the mouse. Dev Biol 260, 404-13. Schena, M., Heller, R. A., Theriault, T. P., Konrad, K., Lachenmeier, E., and Davis, R. W. (1998). Microarrays: biotechnology's discovery platform for functional genomics. Trends Biotechnol 16, 301-6. Scholer, H. R., Ciesiolka, T., and Gruss, P. (1991). A nexus between Oct-4 and El A: implications for gene regulation in embryonic stem cells. Cell 66,291-304. Schuldiner, M., Yanuka, O., Itskovitz-Eldor, J., Melton, D. A., and Benvenisty, N. (2000). Effects of eight growth factors on the differentiation of cells derived from human embryonic stem cells. Proc Natl Acad Sci USA 97, 11307-12. Schwartz, R. E., Reyes, M., Koodie, L., Jiang, Y., Blackstad, M., Lund, T., Lenvik, T., Johnson, S., Hu, W. S., and Verfaillie, C. M. (2002). Multipotent adult progenitor cells from bone marrow differentiate into functional hepatocyte-like cells. J Clin Invest 109, 1291-302. Smiraldo, P. G., Gruver, A. M., Osborn, J. C., and Pittman, D. L. (2005). Extensive chromosomal instability in Rad51d-deficient mouse cells. Cancer Res 65, 2089-96. Smith, A. G., Heath, J. K., Donaldson, D. D., Wong, G. G., Moreau, J., Stahl, M., and Rogers, D. (1988). Inhibition of pluripotential embryonic stem cell differentiation by purified polypeptides. Nature 336, 688-90. Sottile, V., Thomson, A., and McWhir, J. (2003). In vitro osteogenic differentiation of human ES cells. Cloning Stem Cells 5, 149-55. Sperger, J. M., Chen, X., Draper, J. S., Antosiewicz, J. E., Chon, C. H., Jones, S. B., Brooks, J. D., Andrews, P. W., Brown, P. O., and Thomson, J. A. (2003). Gene 212 expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Proc Natl Acad Sci USA 100, 13350-5. Suarez-Farinas, M., Noggle, S., Heke, M., Hemmati-Brivanlou, A., and Magnasco, M. O. (2005). Comparing independent microarray studies: the case of human embryonic stem cells. BMC Genomics 6, 99. Taipale, J., and Beachy, P. A. (2001). The Hedgehog and Wnt signalling pathways in cancer. Nature 411, 349-54. Tanaka, T. S., Kunath, T., Kimber, W. L., Jaradat, S. A., Stagg, C. A., Usuda, M., Yokota, T., Niwa, H., Rossant, J., and Ko, M. S. (2002). Gene expression profiling of embryo-derived stem cells reveals candidate genes associated with pluripotency and lineage specificity. Genome Res 12,1921-8. Tang, K., Yang, J., Gao, X., Wang, C , Liu, L., Kitani, H., Atsumi, T., and Jing, N. (2002). Wnt-1 promotes neuronal differentiation and inhibits gliogenesis in PI9 cells. Biochem Biophys Res Commun 293, 167-73. Thomson, J. A., Itskovitz-Eldor, J., Shapiro, S. S., Waknitz, M. A., Swiergiel, J. J., Marshall, V. S., and Jones, J. M. (1998). Embryonic stem cell lines derived from human blastocysts. Science 282, 1145-7. Tsai, R. Y., and McKay, R. D. (2002). A nucleolar mechanism controlling cell proliferation in stem cells and cancer cells. Genes Dev 16,2991-3003. 213 Vallier, L., Reynolds, D., and Pedersen, R. A. (2004). Nodal inhibits differentiation of human embryonic stem cells along the neuroectodermal default pathway. Dev Biol 275,403-21. Velculescu, V. E., Madden, S. L., Zhang, L., Lash, A. E., Yu, J., Rago, C , Lai, A., Wang, C. J., Beaudry, G. A., Ciriello, K. M., Cook, B. P., Dufault, M. R., Ferguson, A. T., Gao, Y., He, T. C , Hermeking, H., Hiraldo, S. K., Hwang, P. M., Lopez, M. A., Luderer, H. F., Mathews, B., Petroziello, J. M., Polyak, K., Zawel, L., Kinzler, K. W., and et al. (1999). Analysis of human transcriptomes. Nat Genet 23, 387-8. Velculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W. (1995). Serial analysis of gene expression. Science 270, 484-7. Walsh, J., and Andrews, P. W. (2003). Expression of Wnt and Notch pathway genes in a pluripotent human embryonal carcinoma cell line and embryonic stem cell. Apmis 111, 197-210; discussion 210-1. Wang, S. H., Tsai, M. S., Chiang, M. F., and Li, H. (2003). A novel NK-type homeobox gene, ENK (early embryo specific NK), preferentially expressed in embryonic stem cells. Gene Expr Patterns 3, 99-103. Wang, X., Guan, J., Hu, B., Weiss, R. S., Iliakis, G., and Wang, Y. (2004). Involvement of Husl in the chain elongation step of DNA replication after exposure to camptothecin or ionizing radiation. Nucleic Acids Res 32, 767-75. 214 Weinberg, R. A. (1995). The retinoblastoma protein and cell cycle control. Cell 81, 323-30. Weiss, R. S., Leder, P., and Vaziri, C. (2003). Critical role for mouse Husl in an S-phase DNA damage cell cycle checkpoint. Mol Cell Biol 23, 791-803. Wienholds, E., Kloosterman, W. P., Miska, E., Alvarez-Saavedra, E., Berezikov, E., de Bruijn, E., Horvitz, H. R., Kauppinen, S., and Plasterk, R. H. (2005). MicroRNA expression in zebrafish embryonic development. Science 309, 310-1. Willert, K., Brown, J. D., Danenberg, E., Duncan, A. W., Weissman, I. L., Reya, T., Yates, J. R., 3rd, and Nusse, R. (2003). Wnt proteins are lipid-modified and can act as stem cell growth factors. Nature 423, 448-52. Williams, R. L., Hilton, D. J., Pease, S., Willson, T. A., Stewart, C. L., Gearing, D. P., Wagner, E. F., Metcalf, D., Nicola, N. A., and Gough, N. M. (1988). Myeloid leukaemia inhibitory factor maintains the developmental potential of embryonic stem cells. Nature 336, 684-7. Xu, C , Police, S., Rao, N., and Carpenter, M. K. (2002a). Characterization and enrichment of cardiomyocytes derived from human embryonic stem cells. Circ Res 91, 501-8. Xu, R. H., Chen, X., Li, D. S., Li, R., Addicks, G. C , Glennon, C , Zwaka, T. P., and Thomson, J. A. (2002b). BMP4 initiates human embryonic stem cell differentiation to trophoblast. Nat Biotechnol 20, 1261-4. 215 Ye, S. Q., Zhang, L. Q., Zheng, F., Virgil, D., and Kwiterovich, P. O. (2000). miniSAGE: gene expression profiling using serial analysis of gene expression from 1 microg total RNA. Anal Biochem 287, 144-52. Yoshida, K., Chambers, I., Nichols, J., Smith, A., Saito, M., Yasukawa, K., Shoyab, M., Taga, T., and Kishimoto, T. (1994). Maintenance of the pluripotential phenotype of embryonic stem cells through direct activation of gpl30 signalling pathways. Mech Dev 45, 163-71. Zhang, M. Q. (1998). Statistical features of human exons and their flanking regions. Hum Mol Genet 7,919-32. Zhang, S. C , Wernig, M., Duncan, I. D., Brustle, O., and Thomson, J. A. (2001). In vitro differentiation of transplantable neural precursors from human embryonic stem cells. Nat Biotechnol 19, 1129-33. Zimmerman, E. S., Chen, J., Andersen, J. L., Ardon, O., Dehart, J. L., Blackett, J., Choudhary, S. K., Camerini, D., Nghiem, P., and Planelles, V. (2004). Human immunodeficiency virus type 1 Vpr-mediated G2 arrest requires Radl7 and Husl and induces nuclear BRCA1 and gamma-H2AX focus formation. Mol Cell Biol 24, 9286-94. Zwaka, T. P., and Thomson, J. A. (2003). Homologous recombination in human embryonic stem cells. Nat Biotechnol 21, 319-21. 216 Submitted publications Hirst, M. et al. LongSAGE Transcriptome analysis of Nine Human Embryonic Stem Cell lines reveals novel transcripts and an over representation of RNA binding proteins. Nature Biotechnology (Submitted). Siddiqui, A. et al. Mouse Atlas of Gene Expression: Large-Scale Digital Gene Expression Profiles from Precisely Defined Developing C57BL/6J mouse tissues and cells. Proceedings of the National Academy of Sciences of the United States ofAmerica (Submitted). Textbooks Alberts, B., D. Bray, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. 1998. Essential Cell Biology: An Introduction to the Molecular Biology of the Cell. Union Square West, NY: Garland Publishing Inc. Moore, K.L., and Persaud, T.V.N. 2003. The Developing Human: Clinically Oriented Embryology. 7th Edition. Saunders.. Zar, J.H. 1996. Biostatistical Analysis. 3rd Edition. Upper Saddle River, NJ: Apprentice Hall Inc. 2 1 7 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0092489/manifest

Comment

Related Items