Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysis of undifferentiated human embryonic stem cell lines using Serial Analysis of Gene Expression Schnerch, Angelique 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2006-0100.pdf [ 20.7MB ]
Metadata
JSON: 831-1.0092489.json
JSON-LD: 831-1.0092489-ld.json
RDF/XML (Pretty): 831-1.0092489-rdf.xml
RDF/JSON: 831-1.0092489-rdf.json
Turtle: 831-1.0092489-turtle.txt
N-Triples: 831-1.0092489-rdf-ntriples.txt
Original Record: 831-1.0092489-source.json
Full Text
831-1.0092489-fulltext.txt
Citation
831-1.0092489.ris

Full Text

ANALYSIS O F UNDIFFERENTIATED H U M A N EMBRYONIC S T E M C E L L LINES USINGJJERIALANALYSIS O F G E N E EXPRESSION  by ANGELIQUE SCHNERCH B.Sc, University of British Columbia, 2000  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in  THE FACULTY OF GRADUATE STUDIES (Medical Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA December 2005 © Angelique Schnerch, 2005  Abstract Since the first reported isolation of immortalized human embryonic stem cell (hESC) lines in 1998 (Thomson et al., 1998), methods for directing their differentiation to various specialized derivatives has been extensively demonstrated and provides hope for future therapeutic applications. Characterization of the key molecular factors governing hESC self-renewal and pluripotency is necessary for ongoing efforts in deriving therapeutically useful cell types and in modelling human embryonic and oncogenic development. To this end the GSC Gene Expression Laboratory has generated 11 global gene expression profiles of 8 undifferentiated hESC lines using long Serial Analysis of Gene Expression (long SAGE) (NIH stem cell registry code BG01, ES03, ES04, WA01, WA07, WA09, WA13, and WA14). I analysed a database of the hESC long SAGE data consisting of 2,613,475 total tags corresponding to 379,465 transcripts. By employing various comprehensive tag-togene mapping resources I have provided a detailed survey of the genes expressed and differentially expressed in multiple hESC lines. A suite of sternness-associated factors was observed in the hESC SAGE data. We also observed molecular components of several pathways involved in embryonic development, the cell cycle, and programmed cell death. Comprehensive interspecies pair-wise comparisons between the hESC libraries and publicly available human SAGE libraries identified up-regulated transcripts in embryonic stem cells. A robust computational approach was designed and identified tags expressed solely in hESCs compared to 247 normal and malignant cells. A key feature of the approach was to isolate tags derived from sequences conserved across the human, mouse and rat genomes; this led to the identification of 301 candidate novel transcripts  that may be integral to human pluripotent stem cells. These studies represent an important step in the development of high throughput approaches to an analysis of early human developmental processes and will be a strategic element in more comprehensive interspecies comparisons of E S cells to identify preserved control mechanisms.  ANALYSIS  OF UNDIFFERENTIATED  LINES USING SERIAL ANALYSIS  HUMAN  EMBRYONIC  STEM  OF GENE EXPRESSION  Abstract  CELL i "  Table of Contents.  iv  List of Tables  vi  List of Figures  viii  List of Abbreviations  ix  Acknowledgements  xii  Dedication 1. Introduction  xiii 1  1.1 Human embryonic stem cells 1 1.1.1 Overview of human embryonic stem cell biology 1 1.1.1.1 Human embryonic stem cells and cell cycle regulation 2 1.1.1.2 Human embryonic stem cells and cancer 5 1.1.2 Contribution of mouse embryonic stem cells to the understanding of human embryonic stem cells 8 1.1.2.1 LIF/gpl30 signalling in mESC 8 1.1.2.2 Transcriptional regulators of pluripotency in mESC 11 1.2 Global gene expression profiling approaches in stem cells 1.2.1 Overview of high throughput gene expression profiling platforms 1.2.2 Large-Scale Genomic Approaches to the Study of Human Embryonic Stem Cells  IS 15 16  1.3 Objectives  26  1.4 Specific aims and rationale  27  2. Catalogue of the undifferentiated hESC transcriptome 2.1 Introduction 2.2 Methods 2.2.1 Cell Culture 2.2.2 SAGE Library Construction 2.2.3 SAGE Library Sequencing 2.2.4 Tag-to-gene mapping 2.2.4.1 Comprehensive mapping of SAGE tags 2.2.4.2 Assigning functional annotations 2.3 Results and discussion 2.3.1 Database of expressed long and regular SAGE tags 2.3.2 Analysis of unmapped tags 2.3.2.1 Species ambiguous tags 2.3.3 Detection of stem cell associated genes in WA09 2.3.4 Developmental signalling pathway expression in embryonic stem cells 2.3.5 Cell cycle regulation and programmed cell death pathways in hESCs 2.3.6 HESC gene ontology !' 2.4 Conclusions  28 29 30 30 31 ..32 35 35 41 42 42 43 46 48 55 79 98 104  3. Comparison between hESC and cancer/non-cancer differentiated cells/tissues.... 106  3.1 Introduction  107  3.2 Methods 3.2.1 SAGE library downloads 3.2.2 Cluster analysis 3.2.2.1 Random sampling script 3.2.3 Differential gene expression analysis  110 110 110 112 112  3.3 Results and discussion 3.3.1 Pair-wise library comparisons 3.3.2 Isolation of differentially expressed tags  114 114 134  3.4 Conclusions  154  4. Computational approach for the identification of candidate novel genes in undifferentiated hESC SAGE libraries  155  4.1 Introduction  156  4.2 Methods 4.2.1 SAGE library acquisition and tag processing 4.2.2 SAGE tag processing 4.2.3 Comprehensive mapping of SAGE tags (CMOST) 4.2.4 Tag mapping database construction 4.2.5 BLAST analysis 4.2.6 Mouse tag to gene mapping  159 159 159 160 161 163 163  4.3 Results and Discussion 4.3.1 Selection of tags for the isolation of candidate novel genes 4.3.2 Mouse annotation of hESC tags  164 164 171  4.4 Conclusions  184  List ofAppendices  186  Bibliography  '.  187  V  List of Tables Table 1 In-vitro/vivo differentiation of Human Embryonic Stem Cells 3 Table 2 Candidate ES pluripotency genes confirmed or identified by global gene expression profiling (confirmed/identified by: down-regulation upon differentiation, high expression pattern in ES compared to differentiated tissues and/or RT-PCR) 25 Table 3 Summary of hESC long SAGE dataset 34 Table 4 Summary of CMOST data sources for tag-to-gene mapping 36 Table 5 Top 10 expressed SAT tags 47 Table 6 Short-list of most highly expressed sternness associated genes in the WA09 line (absolute tag counts listed; WA09 library size equals 441,795 tags sequenced) 49 Table 7 Detection of genes up-regulated in pluripotent stem cells and potential hESC markers (pluripotent stem cell associated genes, PAGs) 51 Table 8 Summary of Wnt ligand and receptor expression in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), and a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 59 Table 9 Expression of Wnt signalling components in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 60 Table 10 Differential gene expression of Wnt signalling components compared between the hESC metalibrary and differentiated metalibrary 62 Table 11 Differential gene expression of Wnt signalling components compared between the WA09 library and differentiated metalibrary 63 Table 12 Expression of the TGFp signalling network ligands, receptors and transcriptional targets in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 69 Table 13 Expression of the TGFp signalling network activators and inhibitors in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated) 70 Table 14 Differential gene expression of TGFp signalling components compared between the hESC metalibrary and differentiated metalibrary 71 Table 15 Differential gene expression of TGFp signalling components compared between the WA09 library and differentiated metalibrary 72 Table 16 LIF signalling network expression in human embryonic stem cells 76 Table 17 Differential gene expression of the cell cycle compared between the hESC metalibrary and differentiated metalibrary 92 Table 18 Differential gene expression of DNA repair mechanisms compared between the hESC metalibrary and differentiated metalibrary 94 Table 19 Differential gene expression of apoptosis pathways compared between the hESC metalibrary and differentiated metalibrary 96 Table 20 Genes that were differentially expressed in the autophagic cell death pathway compared between the hESC metalibrary and differentiated metalibrary 98 Table 21 Pearson correlation coefficients (r) between human embryonic stem cell line SAGE expression profiles 115 Table 22 Comparisons of short and extracted short SAGE libraries measured by Pearson correlation (r). 122 Table 23 Comparisons of long SAGE libraries measured by Pearson correlation (r) 124 Table 24 The intersection of the top most highly expressed tags in M APC and hESC libraries 131 Table 25 Calculation of Pearson correlation between the hESC metalibrary and MAPC4 using Correlate (written by Allen Delaney, Gene Expression Informatics) 131 Table 26 Summary of the total number of tag sequences up- or down-regulated in the hESC SAGE libraries compared to ectoderm-derived normal CGAP (n-CGAP) libraries, mesoderm-derived n-CGAP libraries, and endoderm-derived n-CGAP libraries 136 Table 27 Up-regulated gene list in u-hESC metalibrary compared to n-CGAP libraries 138 Table 28 Genes up-regulated in hESC compared to ectoderm-derived libraries 139 Table 29 Differentially expressed in hESC versus ectoderm-derived libraries only 141  vi  Table 30 Genes up-regulated in hESC compared to mesoderm-derived libraries 143 Table 31 Differentially expressed in hESC versus mesoderm-derived libraries only 144 Table 32 Genes up-regulated in hESC compared to endoderm-derived libraries 145 Table 33 Differentially expressed in hESC versus endoderm-derived libraries only 147 Table 34 CMOST Best Mapping results for 20,047 novel hESC tags 169 Table 35 Top 25 BLASTN hits (mouse MAF genomic regions against mouse RefSeq transcripts) 176 Table 36 Candidate mouse orthologous genomic sequences were analyzed using BLASTN against mouse RefSeq transcripts..' 178 Table 37 25 mouse transcripts identified by BLASTN analysis of candidate orthologous mouse sequences against mouse RefSeq transcripts 181 Table 38 TBLASTX analysis of candidate human genomic regions derivedfromthe UCSC multiple alignment format (MAF) 186 Table 39 TBLASTX mouse hits associated with development/differentiation, proliferation, transcriptional regulation (DNA dependent and epigenetically), genomic stability, and cell cycle checkpoints 188  vii  List of Figures Figure 1 Differentiation of Human Embryonic Stem Cells 4 Figure 2 Retinoblastoma cell cycle control pathway 6 Figure 3 Jak/Stat Pathway 10 Figure 4 S A G E protocol 33 Figure 5 C M O S T schematic of methodology adapted from http://www.bcgsc.ca/downloads/genex/DS/cmost_plugin_userdoc.htm 37 Figure 6 Distribution of WA09 tags sequenced (log scale Y-axis) classed according to gene expression level (X-axis) 44 Figure 7 The intersection between non-redundant human and mouse genomic sequences (UCSC) totaled 289,453 species ambiguous tags (SATs) 46 Figure 8 Detection of pluripotency genes and markers of differentiation 52 Figure 9 The Wnt signalling pathway expression in hESC long S A G E libraries (hESC metalibrary) and differentiated normal C G A P long S A G E libraries (n-CGAP metalibrary) 57 Figure 10 Expression of the TGF|3 signalling pathway in hESC long S A G E libraries (hESC metalibrary) and differentiated normal C G A P long S A G E libraries (n-CGAP metalibrary) 66 Figure 11 The expression of LIF signalling pathway components in a pooling of 11 hESC S A G E libraries and a pooling of 12 normal adult and fetal S A G E libraries 77 Figure 12 Cell Cycle Expression Cell cycle genes detected in the hESC metalibrary and differentiated metalibrary 80 Figure 13 Expression of D N A repair machinery in the u-hESC and n-CGAP metalibraries 82 Figure 14 Apoptotic programmed cell death pathway expression in hESC compared with n-CGAP 87 Figure 15 Autophagic cell death 90 Figure 16 Gene Ontology (GO) "slim" molecular functions expressed in the hESC metalibrary, hESC cell line WA09, and the differentiated metalibrary (n-CGAP metalibrary) 101 Figure 17 Trilaminar embryonic disc 108 Figure 18 Matrix file format and the fitch settings used in this analysis 111 Figure 19 Hierarchical clustering of short and extracted normal S A G E libraries 117 Figure 20 Hierarchical clustering of long S A G E libraries from C G A P and our own database of normal and malignant cells ., 118 Figure 21 Incremental random sampling of WA09(L) extracted short S A G E total tags 126 Figure 22 Incremental random sampling of WA09(L) extracted short S A G E total tags and normal differentiated C G A P (n-CGAP) extracted short S A G E libraries 128 Figure 23 Ensembl BlastView of sequence escO 1 151 Figure 24 Ensembl BlastView of sequence esc02 153 Figure 25 The computational approach for the selection of candidate long S A G E tags to detect novel transcripts in hESC 158 Figure 26 Venn diagram listing the tag types for the hESC, cancer and normal S A G E library comparisons 165 Figure 27 Novel gene discovery candidate tags distribution of absolute gene expression levels. 166 Figure 28 The computational method for annotating hESC candidate tags 172  viii  List of Abbreviations Abbreviation BCCA BER BLAST BLASTN BMP bp c-CGAP cDNA CGAP CHEK2 CMOST CNS DNA DNMT3B DPPA4 EB ECC  Name British Columbia Cancer Agency Base excision repair Basic Local Alignment Search Tool Nucleotide-nucleotide BLAST Bone morphogenic protein base pairs Malignant CGAP SAGE libraries Complementary DNA Cancer Genome Anatomy Project CHK2 checkpoint homolog (S. pombe) Comprehensive mapping of SAGE tags Central nervous system Deoxyribonucleic acid DNA (cytosine-5-)-methyltransferase 3 beta Developmental pluripotency associated 4 Embryoid body Embryonic carcinoma cell Example given e-gESC Embryonic stem cell EST Expressed sequence tags FC Fold change FGF Fibroblast growth factor FOX Forkhead box transcription factors Gl Gap 1 G2 Gap 2 GEO Gene Expression Omnibus GLGI Generation of longer cDNA fragm* identification GNL3 Nucleostemin GO Gene Ontology gpl30/IL6ST Interleukin-6 signal transducer GSC Genome Sciences Centre hCG Human chorionic gonadotropin hECC Human embryonic carcinoma cell hESC Human embryonic stem cell HMG High mobility group HRR Homologous recombination repair HSC Haematopoietic stem cell ICM Inner cell mass IL-6 Interleukin-6 cytokines  ix  irr-MEFs  Irradiated mouse embryonic fibroblasts (inactive MEFs)  JAK  Janus tryrosine kinases  Kb  kilobase  LIF  Leukemia inhibitory factor  LIFR  Leukemia inhibitory factor receptor  In  Natural log  M  Mitotic phase  MAF  University of California Santa Cruz (UCSC) multiple alignment format  MAPC  Multipotent adult progenitor cell  Mb  megabase  MEFs  Mouse embryonic fibroblasts (mouse feeder layers)  mESC  Mouse embryonic stem cells  MGC  Mammalian Gene Collection  MPSS  Massively parallel signature sequencing  n/a  Not applicable/available  NCBI  National Center for Biotechnology Information  n-CGAP  Normal adult and fetal C G A P S A G E libraries  NER  Nucleotide excision repair  NHEJ  Non-homologous end joining  NIH  National Institutes of Health  ORFs  Open reading frames  PAG  Pluripotent stem cell associated genes  PAGE  Polyacrylamide gel electrophoresis  PBS  Phosphate buffer solution  PCR  Polymerase chain reaction  Perl  Practical extraction and report language  PGC  Primordial germ cell  PNS  Peripheral nervous system .  POU5F1  POU (Pit Oct Unc) domain, class 5, transcription factor I  preHEP  Pre-hepatocyte-like cells  preNEU  Pre-neuronal-like cells  r  Pearson correlation coefficient  r  2  Coefficient of determination  RACE  375' rapid amplification of c D N A ends  RB  Retinoblastoma  RefSeq  Reference Sequence project  RNA  Ribonucleic acid  rRNA  Ribosomal R N A  RT-PCR  Reverse transcriptase polymerase chain reaction  S  Synthesis phase  SAGE  Serial analysis of gene expression  SAT  Species ambiguous tags  SUP  Protein tyrosine phosphatases  x  siRNAs  Small-interfering R N A molecules  S0X2  SRY (sex determining region Y)-box 2  SSEA  Stage specific embryonic antigen  STAT  Signal transducer and activator of transcription  TBLASTX  Translated query vs. translated database  TDGF1  Teratocarcinoma-derived growth factor 1  TERT  Telomerase reverse transcriptase  TGFp  Transforming growth factor beta  TRA  Tumour rejection antigen  TS  Trophoblast cells  UCSC  University of California Santa Cruz  u-hESC  undifferentiated human embryonic stem cells  UTF1  Undifferentiated embryonic stem cell transcription factor  UTR  Untranslated region  WNT  Wingless-type M M T V integration site family  ZFP42  Zinc finger protein 42  i  Acknowledgements D r . M a r c o M a r r a and D r . Steven Jones  both for supporting and directing my research  during my graduate studies at the BCCA GSC. I sincerely thank-you for encouraging me to pursue my MSc and providing me the opportunity to learn more about bioinformatics and its application to global transcription profiling in Medical Genetics. Thanks especially for bearing with me as I continued to add analysis-after-analysis, page-afterpage to my thesis.  T h e B C C A Genome Sciences Centre ( G S C ) .  Many thanks to the laboratory and  bioinformatics groups at the GSC. Without your efforts, both in generating high quality and high throughput data and analysis tools, my work would not be possible. Many individuals have provided me with invaluable advice during the course of my studies and have succeeded in fostering a collaborative and positive work environment.  D r . Pamela Hoodless and D r . K e i t h H u m p h r i e s  for their support and input as members  of my thesis advisory committee.  Genome British C o l u m b i a a n d the National C a n c e r Institute ( U S A )  for funding this  project.  xii  Dedication This work is dedicated to my parents, Maria and Donald Schnerch, for their support and encouragement in every of my aspect personal and professional development.  This work is also dedicated to Dr. Brian Yang for his caring, patience, and unconditional support during my successes and struggles as a graduate student.  xiii  1. Introduction 1.1  H u m a n embryonic stem cells  1.1.1 Overview of h u m a n embryonic stem cell biology  Stem cells are defined in part by their ability to give rise to progeny of more restricted developmental potential. The potential to differentiate is greatest in the totipotent fertilized egg, the ancestor of all embryonic and extra-embryonic cell types. This potential progressively decreases according to a developmental timeline. The pluripotent stem cell has a more restricted developmental potential. Members of this category include: primordial germ cells (PGC), embryonic carcinoma cells (ECC) and embryonic stem cells (ESC). Pluripotent stem cells may differentiate to several different cell types of embryonic origin but not extra-embryonic origin, and thus are unable to give rise to entire organisms in and of themselves. In 1998, James A. Thomson first reported the isolation of 5 immortalized human embryo-derived pluripotent cell lines, termed human embryonic stem cells (hESCs) (Thomson et al., 1998). HESCs are derived from the inner cell mass of the pre-  v  implantation blastocyst (5 days post-fertilization) which gives rise to the embryo proper. HESCs maintain an undifferentiated state and proliferate for many passages in culture (near indefinitely). The presence of alkaline phosphatase, high levels of telomerase activity and a sustained normal karyotype are key attributes of the hESC lines derived by Thomson et al (1998). Additionally, these cells express an array of cell surface antigens and molecular markers responsible for maintenance of an undifferentiated and pluripotent phenotype. The cell surface antigens are the stage specific embryonic antigen (SSEA)-3,  1  SSEA4, tumour rejection antigen (TRA)-l-60 and TRA-1-81 (Reubinoff et al., 2000; Thomson et al., 1998). Molecular markers of the undifferentiated state consist of a suite of transcription and growth factors, namely: POU domain, class 5, transcription factor 1 (POU5F1 or OCT3/4), teratocarcinoma-derived growth factor 1TTDGF1 or Cripto), SRY (sex determining region Y)-box 2 (SOX2), fibroblast growth factor 4 (FGF4) and zinc finger protein 42 (ZFP42 or REX1). Characteristic of pluripotent cells, hESC differentiate into various specialized derivatives of the three embryonic germ layers: endoderm, mesoderm and ectoderm. This capability has been demonstrated extensively (Figure 1) (Table 1) (Assady et al., 2001; Carpenter et al., 2004; Lebkowski et al., 2001; Levenberg et al., 2002; Reubinoff et al., 2000; Zhang et al., 2001) and promises to be exploitable in future therapeutic applications, particularly in regenerative medicine.  1.1.1.1 Human embryonic stem cells and cell cycle regulation Human and mouse embryonic stem cells (mESCs) have unconventional cell 7  cycles compared to somatic cells. MESCs proliferate rapidly, doubling every 8 to 12 hours. HESC cycle times are longer than in mouse; doubling every 35-40 hours, they are highly proliferative nonetheless. Somatic cells are also capable of rapid cycle times. For example, cultured mammalian fibroblasts have an approximate cell-cycle time of 20 hours (Alberts et al., 1998) begging the question, what distinguishes the cell cycle of embryonic stem cells from other somatic cell types? The answer can be attributed to alterations in cell cycle regulation. Checkpoints exist before each transition to a new phase of the cell cycle. At these checkpoints a cell is either detained from further  2  Table 1 In-vitro/vivo differentiation of Human Embryonic Stem Cells. Summary of directed differentiation of hESCs to a number of cells/tissue types representative of endoderm, mesoderm and ectoderm origins.  Cell/tissue type  Reference  Astrocytes Bone Cardiomyocytes Cartilage Embryoid bodies Endothelial cells Erythrocytes Fetal glomeruli Ganglia Granulocytes Gut Hair Haematopoietic colony-forming cells Hepatocyte/Hepatocyte-like cells Insulin producing beta cells Keratinizing squamous epithelium Macrophage Megakaryocytes Neural epithelium Neural progenitor cells Neurons  (Zhang et al., 2001) (Itskovitz-Eldor et al., 2000; Thomson et al., 1998) (Kehat et al., 2003; Mummery et al., 2002; Xu et al., 2002a) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Levenberg et al. 2002) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Kaufman et al. 2001) (Lavon and Benvenisty, 2005; Rambhatla et al., 2003) (Assady et al., 2001) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Carpenter et al. 2001; Itskovitz-Eldor et al. 2000) (Carpenter et al., 2001; Itskovitz-Eldor et al., 2000; Reubinoff et aL 2000; Schuldiner et al., 2000; Zhang et al., 2001) (Zhang et al. 2001) (Sottile et al., 2003) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Itskovitz-Eldor et al. 2000; Thomson et al. 1998) (Thomson et al. 1998; Xu et al. 2002) (Thomson et al. 1998)  Oligodendrocytes Osteoblasts Respiratory epithelium Smooth muscle Striated muscle Trophoblasts Yolk sac  3  Figure 1 Differentiation of Human Embryonic Stem Cells. Embryo-derived pluripotent stem cells have been directed to differentiate to embryoid bodies and derivatives of the three embryonic germ layers.  4  progression through the cycle or released to continue its passage if various conditions have been met (e.g., an adequate cell size, undamaged DNA, favourable environment etc). The retinoblastoma (RB) protein regulates the checkpoint between the Gl and S phase of the cell cycle (Goodrich et al., 1991; Hatakeyama and Weinberg, 1995; Weinberg, 1995) (Figure 2). RB prevents cell cycle progression by inhibition of E2F transcription factors (Dyson, 1998; Fattaey et al., 1993; Helin, 1998; Helin et al., 1993; Johnson et al., 1994; Narita et al., 2003), ultimately preventing cell proliferation. Cell cycle arrest can be relieved upon growth factor signalling and activation of specific cyclins and cyclin-dependent kinases (CDKs) that disrupt the association between RB and E2F by hyper-phosphorylation of RB (Buchkovich et al., 1989). In both human and mouse ESCs, the RB pathway is inactive, resulting in a noticeably shortened Gl phase and rapid cellular proliferation. HESCs may express an inhibitor of RB, termed nucleostemin, preventing cell cycle arrest (Tsai and McKay, 2002). In a recent study of gene expression in hESC, RB, proapoptotic genes and regulators of the p53 pathway are expressed at low levels while positive regulators of the cell cycle are highly expressed (Brandenberger et al., 2004a).  1.1.1.2  Human embryonic stem cells and cancer Pluripotent stem cells and cancer cells are both immortal and tumorigenic.  Similarities observed between hESC and cancer may be attributed to the expression of similar sets of genes involved in cell cycle regulation, apoptosis, and cellular senescence. A key step prior to the development of some cancers is the loss of the RB pathway via genetic changes (Classon and Harlow, 2002) while in hESC the RB pathway is naturally  5  Figure 2  Retinoblastoma cell cycle control pathway.  GO  s  Gl G l / S phase transition  r  E2F  T Differentiation <  G 2 / M transition  T -> Checkpoint control -> D N A repair  RB  T C D K / C y c l i n complex  t FGF  M  G2  D N A replication H  Apoptosis  inactivated. Cancer and hESC immortality may also be attributed to telomerase reverse transcriptase (TERT) activity. In postnatal somatic cells, TERT is normally repressed; consequently telomere ends become progressively shorter after each cell division resulting in a finite number of times a cell can divide before it is targeted for cell death. Deregulation of TERT may be involved in oncogenesis. Again, where cancer involves genetic deregulation of key genes, hESCs "naturally" maintain TERT activity in culture, serving the same purpose to extend the proliferative lifespan in both cells. Similarities between stem cells and tumour cells have led to the hypothesis that deregulated stem cells may lead to cancer development (Brickman and Burdon, 2002; Mullor et al., 2002; Pardal et al., 2003). Thus additional and equally important advantages arising from the derivation of hESCs is their utility in providing a model for understanding cancer development from which the discovery of novel therapeutic targets may be utilized in future cancer treatments. The promise of hESCs has stimulated interest in thoroughly understanding the biology of the cell lines (http://stemcells.nih.gov/research/registry/; NIH Human Embryonic Stem Cell Registry code: BG01, BG02, BG03, SA01, SA02, ES01, ES02, ES03, ES04, ES05, ES06, MI01, RE03, TE04, TE06, UC01, UC06, WA01, WA07, WA09, WA13, and WA14) currently available. Two current areas of study in hESC research include: uncovering the gene functions directing differentiation down specific developmental paths and elucidating the gene functions that maintain hallmark ES attributes.  1.1.2 Contribution of mouse embryonic stem cells to the understanding of h u m a n embryonic stem cells  Relatively little is known about the genes governing properties unique to embryonic stem cells. The study of mouse embryonic stem cells (mESC) (Evans and Kaufman, 1981; Martin, 1981) uncovered a handful of genes involved in maintaining pluripotency. A caveat of applying lessons learned in mESC to human studies is that embryonic development and embryonic cell types are not equivalent between mice and humans (Fougerousse et al., 2000; Ginis et al., 2004). Nonetheless, genes uncovered in the mouse have provided a short-list of factors that may be functionally conserved in hESC.  1.1.2.1 L I F / g p l 3 0 signalling in m E S C  The JAK/STAT signalling pathway is well characterized in MESC and initiated extrinsically by the leukemia inhibitory factor (LIF) (Smith et al., 1988; Williams et al., 1988) or related family members of the interleukin-6 (IL-6) type cytokines (Burdon et al., 1999a; Burdon et al., 1999b) (Figure 3). Paracrine signalling through LIF is a result of culturing MESC on a monolayer of mouse embryonic fibroblasts. LIF is bound by a cell surface receptor heterodimer composed of the LIF receptor (LIF-R) and gpl30. The activated receptor complex subsequently recruits various Janus tyrosine kinases (JAKs). JAKs, in turn, can recruit the STAT3 (signal transducer and activator of transcription) protein and various SHPs (protein tyrosine phosphatases) to initiate pathways that maintain self-renewal and suppress differentiation. hESCs, though they can be similarly cultured on MEFs, do not require LIF or signalling via gpl30 to maintain an  8  undifferentiated state (Reubinoff et al., 2000; Thomson et al., 1998). Downstream of LIF, the extent to which STATs and JAKs do or do not affect the hESC phenotype is not fully known.  9  Figure 3 Jak/Stat Pathway. Depicted are the ligands, receptors and intracellular components of the Jak/Stat pathway. The progressionfrompathway activation via LIF ultimately to transcriptional activation of stem cell maintenance genes is shown (Heinrich et al., 1998).  10  1.1.2.2 Transcriptional regulators of pluripotency in mESC  Gene expression analyses of mESCs have identified a number of transcriptional regulators of pluripotency. An established marker for undifferentiated embryonic stem cells in mouse and humans is POU5F1 (also known as OCT3/4). POU5F1 is expressed throughout the totipotent cells of the early mouse embryo. The first differentiation event amongst these cells results in the formation of the blastocyst composed of an extraembryonic layer, the trophoectoderm, and cells that will ultimately become the embryo proper, the inner cell mass. Not until this event occurs does POU5F1 expression become restricted to the inner cell mass of the pre-implantation blastocyst and downregulated in the trophoectoderm (Nichols et al., 1998). Later on in development, POU5F1 expression is rapidly downregulated upon differentiation of the inner cell mass (ICM), making its presence a defining characteristic of totipotent and pluripotent cell types. POU5F1 is critical for preventing the differentiation of totipotent and pluripotent cells of the embryo (Pesce et al., 1998a; Pesce et al., 1998b). The gene functions in maintenance of pluripotency and self-renewal in embryonic stem cells via repression and activation of select differentiation factors and stem cell maintenance genes respectively (for a concise review see Pesce and Scholer 2001). Targets for repression include the human chorionic gonadotropin (hCG) a and (3 subunits (Liu et al., 1997; Liu and Roberts, 1996). Release from repression of a and (3 hCGs may be an initial event in differentiation to the trophoectoderm and later derivatives (Pesce and Scholer, 2001). FOXD3, of the forkhead box (FOX) family of proteins, is expressed in embryonic stem cells and in the embryonic neural crest. FOXD3 activates other FOX family proteins, which may initiate  11  early embryonic lineage decisions such as differentiation to endoderm and endodermal organogenesis. POU5F1 can interfere with FOXD3 binding domains to repress its ability to activate downstream FOX proteins and ultimately repress hESC differentiation to endodermal derivatives. Through repression of various differentiation factors, POU5F1 can act as a gatekeeper of the pluripotent state in embryonic stem cells, rendering their potential to become multiple cell types quiescent. POU5F1 also serves to transcriptionally activate several downstream targets such as ZFP42, Creatine kinase B, Makorin 1, Importin (5, Histone H2A.Z and the ribosomal protein S7 (Du et al., 2001). Expression of many of these genes is not necessarily restricted to the ICM or embryonic stem cells, and their role in maintaining pluripotency is unknown. The zinc finger protein, ZFP42, however, is developmentally regulated and known to be associated with pluripotent cells (Rogers et al., 1991). Interestingly, ZFP42 is capable of binding DNA to regulate transcription and, like other downstream targets of POU5F1, may be necessary for stem cell maintenance, although its role remains unclear. In conjunction with the high-mobility group (HMG) domain protein SOX2, POU5F1 can synergistically activate transcriptional targets (Ambrosetti et al., 1997). One confirmed target of POU5F1 and SOX2 is the fibroblast growth factor 4 (FGF4), which encodes a secreted signalling molecule. FGF4 is involved in the viability of the blastocyst, in the development of the heart, as well as in the outgrowth and patterning of the developing mouse limb (Feldman et al., 1995; Niswander et al., 1993). Thus, regulation of this growth factor may have a role in ESC survival and/or differentiation. The gene encoding the undifferentiated embryonic cell transcription factor (UTF1) is an additional target for synergistic co-regulation by POU5F1 and SOX2 (Nishimoto et al.,  12  1999). UTF1 is expressed mainly in pluripotent cell types, particularly mouse and human ESC. The precise role of UTF1 in early embryonic development or in the maintenance of ESC is not understood. SOX2 can also function to antagonize POU5F1 activation of transcription. In the case of the Osteopontin (OPN) gene, shortly before differentiation of cells of the ICM to primitive endoderm, POU5F1 is up-regulated and can form homodimers on the OPN enhancer leading to its expression (Botquin et al., 1998). SOX2 is additionally able to bind to the OPN enhancer to form a complex with POU5F1 that disrupts enhancer activity and represses OPN expression (Ambrosetti et al., 1997). Upon the proper developmental cue, SOX2 is downregulated in the ICM prior to POU5F1 possibly providing a mechanism in which POU5F1 can regulate the formation of the primitive endoderm in mouse embryos. Several genes may interact to stabilize the POU5F1 complex with DNA or to act as bridging factors in connection with transcriptional machinery. Other members of the high-mobility group protein family, HMG1 and HMG2, have been shown to interact with POU proteins. Both proteins may act with POU5F1 to facilitate and/or cement DNA binding or to activate a transactivation domain (Pesce et al., 1999). Unique to embryonic stem cells, POU5F1 can bind target genes distal to transcriptional start sites to initiate gene expression (Scholer et al., 1991). This suggests other co-activators may connect POU5F1 to transcriptional machinery by acting as bridging factors. Several viral genes have been proposed as bridging factors in differentiated cell types, such as the adenoviral E l A protein (Scholer et al., 1991). The particular bridging factors that link POU5F1 to  13  transcriptional machinery in embryonic stem cells have yet to be elucidated in vivo but it is probable that they would bear strong functional similarity to El A. Not all of the genes identified as important to ESC maintenance are intimately associated with POU5F1. Recently, Nanog was identified in mESCs to maintain pluripotency and self-renewal independently of the STAT3/LIF pathway (Chambers et al., 2003; Mitsui et al., 2003). The gene shares homology to homeodomain-containing transcription factors of the NK2 family which are implicated in several aspects of cell type specification and maintenance of differentiated tissue (Wang et al., 2003). Nanog expression is limited to ESC and a small selection of tissue types including teratocarcinoma cells, germ cells and various tumour types (Chambers et al. 2003; Mitsui et al. 2003). Though the gene does not require POU5F1 for its expression, both Nanog and POU5F1 appear to be required in combination to effect self-renewal and pluripotency (Cavaleri and Scholer, 2003). A human ortholog of Nanog was also uncovered, bearing 85% sequence identity in its functional domain to the mouse gene in a syntenic region, and may function similarly in hESCs. Our understanding of human embryology has previously relied on studies of embryogenesis in model organisms, namely the mouse. Many developmentally important genes and pathways demonstrate high evolutionary conservation, providing justification for the use of the mouse to provide insight on early human development. This justification has also extended to the study of hESC. Expression studies in mESC have provided a handful of candidate genes that maintain pluripotency and are conserved in hESC. However, several differences between human and mouse embryology exist both morphologically and molecularly (Fougerousse et al., 2000; Ginis et al., 2004). For this  14  reason, it is not surprising that key molecular pluripotency pathways that exist in the mouse are not conserved in the human, the most obvious example being the LIF/gpl30 signalling pathway. With the isolation of hESC, investigators can now begin to generate an accurate model of the mechanisms involved in early human embryology through global gene expression profiling techniques.  1.2 Global gene expression profiling approaches in stem cells  1.2.1 Overview of high throughput gene expression profiling platforms  The genomics era ushered in the development of high throughput platforms to characterize gene expression differences between normal and diseased states, pharmacologically treated and untreated cells/tissues, and between different developmental stages in human and model organisms. In relation to the study of embryonic stem cells, transcriptome analysis consists of comparing undifferentiated ESC with differentiated derivatives, somatic, or non-pluripotent cells. Current technologies capable of characterizing transcriptomes fall under two broad categories: hybridization based approaches (cDNA and oligonucleotide microarray) (Lipshutz et al., 1999; Lockhart et al., 1996; Schena et al., 1998) and sequence based approaches (EST sequencing projects, massively parallel signature sequencing (MPSS) and serial analysis of gene expression (SAGE)) (Brenner et al., 2000; Velculescu et al., 1999; Velculescu et al., 1995).  15  1.2.2 Large-Scale Genomic Approaches to the Study of Human Embryonic Stem Cells Recently a handful of groups have approached the study of a number of hESC lines using the aforementioned large-scale genomic technologies. Table 1 (located at the end of this section) provides a summary of all the genes discussed below). v. One such study originated from the lab of James A. Thomson describing the expression profile of 5 hESC lines (NIH Human Embryonic Stem Cell Registry code: WA01, WA07, WA09, WA13 and WA14) and human embryonic carcinoma cell (hECC) lines using the microarray technology (Sperger et al., 2003). One major goal of the analysis was to identify genes specifically expressed at higher levels in pluripotent cell types. In particular, Thomson's group found a set of genes both highly expressed and shared between hESC lines and hECC lines (895 genes), which may represent genes important in maintaining pluripotency. Not surprisingly, POU5F1 was one of the most highly expressed genes in hESC and hECC lines. DPPA4, which has been similarly identified in several other gene expression studies in hESC lines (Richards et al., 2004; Sato et al., 2003), was also highly co-expressed with POU5F1 based on comparisons of the hESC and hECC lines with microarray profiles from 29 germ cell tumour lines, 14 samples of normal testis and 17 somatic cell lines. The most highly expressed genes in hESC and hECC lines identified in this study included DNMT3B, FOXD3, and SOX2. These genes were identified among the most highly expressed in additional gene expression profiles of hESCs (Brandenberger et al.,,2004a; Brandenberger et al., 2004b; Carpenter et al., 2004; Richards et al., 2004; Sato et al., 2003). Representatives from signalling pathways identified in this study that may be important in maintaining 16  undifferentiated hESC (u-hESCs) included: Frizzled 7/8 (Wnt-p-catenin pathway), fibroblast growth factor (FGF) receptor genes 1-4 (FGF signalling pathway), and the bone morphogenic protein (BMP) receptor, type 1A (BMP signalling pathway). Sato et al (2003) utilized Affymetrix Gene Chip technology (Lockhart et al., 1996) to compare the expression profiles of the WA01 hESC line (Thomson et al, 1998) to published mESC SAGE data (RI line) (Anisimov et al., 2002). They proposed that many properties are likely shared between human and mouse pluripotent cells and report their list of evolutionarily conserved molecular factors. This study sought to elucidate genes involved in maintaining a pluripotent state. The undifferentiated WA01 expression profile was compared to WA01 cells that were differentiated to embryoid bodies (EBs), neurons, and non-lineage directed differentiation. 918 genes were enriched in u-hESCs compared to non-lineage directed 1  differentiated hESCs. Genes confirmed by RT-PCR to be downregulated upon differentiation included: POU5F1, TDGF1, Id-IH, Jumonji, Lefty A, FGF 13, Sprouty, Hey2, SOX2, FGF2, Thyl, ADCY2, BIRC5, CHEK2, MCF1, PAK1, PTTG1, LCK, and LDB2. Notable examples of highly enriched genes in WA01 include: POU5F1, Lefty A (a downstream target of POU5F1), and ZFP42. Nearly a quarter of enriched genes were identified as uncharacterized ESTs, thus many of the genes potentially involved in maintaining embryonic stem "cellness" have not been extensively studied. Several ligands and receptors of signalling pathways (e.g., FGF, BMP, TGFP negative regulators, and WNT pathways) are additionally highly enriched genes in WA01.  1  Enriched genes.are significantly downregulated in WA01 differentiated cell populations; of these 918  genes/transcripts only 42 were absent across all differentiated WA01 populations.  17  The intersection between genes enriched in u-hESCs and genes enriched in mouse embryonic stem cells was identified. 227 genes were enriched in both species and included many ESTs of unknown function, members of the BMP/TGF(3 signalling pathway, components of the chromatin-remodelling machinery, phosphatidyl-inositol signalling as well as several proposed pluripotent marker genes (POU5F1, TDGF1, DPPA4, and CHEK2). 691 genes in WA01 either did not have a mouse ortholog or were not enriched in mESCs. For example, SOCS1, an inhibitor of the STAT3 pathway, the ultimate effecter of LIF signalling, was enriched in WA01 compared to mESCs. Thus, this gene may account for one major difference between human and mouse ES cells, the inability of LIF to maintain an undifferentiated state in hESC. The remaining genes, outside of this intersection, may represent human specific markers and provide an explanation for various species differences in ES "cellness". The properties of four feeder-free hESC lines, WA01, WA07, WA09 and WA14 (Thomson et al. 1998), were investigated using various methodologies such as RT-PCR, microarray, and cDNA library sequencing (Carpenter et al., 2004). HESC lines are traditionally cultured on irradiated feeder layers (mouse embryonic fibroblasts) (MEFs). The feeder-free lines used in this study express the same cell-surface markers as MEFcultured lines and a suite of pluripotency molecular markers. Such markers include: CD9, CD90, hTERT, POU5F1, SOX2, ZFP42, UTF1, TDGF1, and BCRP. CDNA libraries were generated from u-hESCs and EBs; they demonstrated yet again that POU5F1 and SOX2 are highly abundant in ES and down-regulated upon differentiation. Carpenter et al observed that hESC are tightly adhered; upon further investigation they  18  discovered high expression levels of gap junction genes and functional gap junctions in WA01 and WA09. Microarray analysis of WA01, WA07 and WA09 demonstrated that most genes were similarly expressed in all lines. There was no evidence of genes that were significantly differentially expressed and unique to a particular cell line. It is important to note that only 2,802 cDNA clones were used for this analysis, as genes with low expression were filtered out computationally. It is plausible that cell-line specific differences may be genes expressed at low levels, thus a more comprehensive analysis of commonalities between hESC is required. Richards et al. (2004) constructed short SAGE (14mer tag sequences) libraries to investigate the transcriptome of two human pluripotent stem cell lines, ES03 and ES04 (ES Cell International; http://www.escellinternational.com). Both hESC lines were compared to each of 21 normal and cancer human SAGE libraries in an effort to elucidate genes differentially expressed in ES. Based on SAGE results of genes potentially enriched in hESC, RT-PCR of candidate hESC-specific marker genes was performed in u-hESC, differentiated hESC, and various human fetal and adult somatic tissue types. To identify genes evolutionarily conserved in pluripotent embryonic stem cells the Rl mESC line was compared to ES03 and ES04. Richards utilized UniGene (http://www.ncbi.nlm.nih.gov/) to map tags to genes and LocusLink (http://www.ncbi.nih.gov/entrez/) to classify the molecular functions of tag-to-gene-mappings. Additionally, CGAP SAGE Genie (http://cgap.nci.nih.gov/SAGE/AnatomicViewer) and the NCBI SAGEmap (http://www.ncbi.nih.gov/SAGE/) were utilised to select the best tag for ambiguous tag-  19  to-gene mappings. However, using these particular databases still resulted in over 13% of tags without a reliable assignment to a UniGene cluster (analysis excluded singleton tag mappings.) The use of UniGene may complicate tag-to-gene mappings as it contains sequence redundancies. Furthermore, a major confounding factor introduced by the short SAGE technology is the ambiguity of tag-to-gene mappings due to the lack of gene specificity in 14 bp tags. This shortcoming has been partially addressed by the introduction of long SAGE (Saha et al., 2002) and thus provides significantly increased accuracy in tag-to-gene mapping. A combined total of 145,015 tags, corresponding to 31,852 transcripts, were sequenced for ES03 and ES04 (67,807 tags and 77,208 tags sequenced respectively). 64.2% of tags (20,447 tags) were singletons. Nearly half of all singletons mapped to ESTs, hypothetical genes and some known transcripts. Notable transcripts mapped by a singleton tag include FOXD3 and GBX2, transcription factors that are highly expressed in mouse embryonic stem cells and in the inner cell mass of mouse blastocysts. Singleton tags were ultimately excluded from further analysis. Richards et al (2004) sought to validate the hESC lines pluripotent phenotype by assaying for the presence/absence of SAGE tags corresponding to a variety of markers of pluripotency and of early embryonic differentiation. This study proposed the following genes to be candidate hESC markers : POU5F1, SOX2, NANOG, ZFP42 (present only in 2  ES03), HESX1, FLJ14549, FLJ21837, DPPA4, TGIF, DNMT3B, LIN-28, NPM1,  2  Candidate hESC markers were investigated by semiquantitative RT-PCR in undifferentiated and  differentiated ES03 and ES04 lines. Many of these genes were present only in the undifferentiated stem cell lines and/or showed a marked decrease during stem cell differentiation.  20  TDGF1, GDF3, CHEK2, OC90, CLDN6, GJA1, CKS1 B, ERH, HMGA1 and TNFRSF6. These genes are not exclusive to hESC or the inner cell mass of pre-implantation embryos. Particularly, NANOG, HESX1, FLJ21837, DPPA4, LIN-28, CLDN6, GJA1, CKS1 B, ERH, HMGA1 and TNFRSF6 are expressed at low levels in various differentiated tissues. The majority of these genes also show slight to modest declines upon hESC differentiation only. Only OC90 and FLJ14549 demonstrated a significant decrease in expression upon hESC differentiation (based on RT-PCR analysis of candidate markers) and were not present in the various differentiated tissue types assayed. Genes associated with differentiation were detected in the ES03 SAGE library, namely LECT1, TGFa, and IFRD1. Additional points Richards et al (2004) addressed were the differences in gene expression profiles between ES03 and ES04. Most differentially expressed genes were expressed at low levels (tag counts of <3). Differential expression of splice isoforms was also evident between hESC lines e.g., basic transcription factor 3 (BTF3). The paper investigated similarities between mouse and human ES cells. Transcription factors originally defined in mouse ES cells to be associated with pluripotency such as SOX2, HESX1, UTF1, POU5F1, and ZFP42 were expressed at consistently higher levels in hESC. As expected members of the leukemia inhibitory factor signalling pathway as well as FGF4, TDGFI, GBX2, and Nanog were expressed higher or uniquely present in mouse ES cells. The differences demonstrated in this paper at a molecular level further strengthen the argument against mouse ES cells as an accurate model to study human embryonic pluripotency.  A comprehensive hESC transcriptome profiling study was published recently and employed signature MPSS and EST sequencing of three feeder-free cell lines, WA01, WA07 and WA09 (Brandenberger et al., 2004a; Brandenberger et al., 2004b). MPSS employs ligation based sequencing of 16-20 bp sequences ('signatures') bound to microbead arrays to enable the rapid sequencing of millions of signatures in parallel (Brenner et al., 2000; Jongeneel et al., 2003). The depth of MPSS sampling approached 3 million sequenced signatures 17 bp in length, yet only corresponded to 22,136 distinct signatures; 3% of signatures were unmapped to transcripts and genomic sequences. Housekeeping genes were unsurprisingly among the most highly expressed. More importantly, known u-hESC markers, SOX2, DNMT3(3, and POU5F1, were also among the top 200 expressed genes. Represented in the top 200 expressed genes were the FGF signalling pathway and the Ras pathway. The high expression of FGFR1 indicated that hESC may utilize a number of FGFs, but there is an apparent requirement for basic FGF (FGF2) in hESC. Differences between mouse and human ES were demonstrated, notably the inactivity of the LIF pathway in hESC (by the lack of expression of a number of constituents in MPSS and EST libraries). Additionally, ERAS, PEPP1 and PEPP2, which are present in mESC, were undetected in the pooled hESC sample. However, PEPP1 and PEPP2 were detectable by our own SAGE data (discussed below), therefore the inability to detect these transcripts is a limitation of MPSS and not necessarily biologically significant. Several Wnt and TGFP signalling network constituents were expressed and are likely to be important in ESC maintenance. High levels of soluble frizzled receptors and E-cadherin of the Wnt signalling pathway were detected. In general, the analysis revealed  22  that many signalling pathway transcripts are expressed, however, so are their negative regulators, which may support that transcriptional repression maintains an undifferentiated state in hESC. Brandenberger et al (2004b) compared the MPSS library to an MPSS database of 36 human tissues and cell lines to look for genes unique to or over-expressed in hESC. They discovered 13 highly enriched uncharacterized genes that were expressed in uhESCs and down-regulated upon differentiation. These genes were generally absent in other cell types and down-regulated upon ES differentiation providing good candidate markers for u-hESCs. MPSS has several advantages in permitting impressively deep transcriptome coverage. Shortcomings of the technique do exist; the method of sequencing cannot resolve palindromic sequence hybridization. Notably, this analysis failed to detect Nodal, which can be detected using SAGE. MPSS requires that transcripts possess a type II restriction enzyme site (Dpnll) as does the SAGE technique, which utilizes Nlalll. Thus a small proportion of tags lacking a Dpnll and/or Malll site are undetected by both methods. For example, SNRPF, which was enriched in microarray analysis of hESC (Bhattacharya et al., 2004), lacks a Dpnll site but not an Nlalll site. There are half as many Dpnll sites in the genome as Nlalll and 1% of transcripts lack an Nlalll site while 4% of transcripts lack a Dpnll site (Pleasance et al., 2003) (A. Delaney, personal communication). MPSS and SAGE also complement one another as some Dpnll sites produce ambiguous tags while the corresponding Nlalll site does not and vice versa. This analysis ambiguously detected a marker of undifferentiated hESC, ZFP42 (REX1), while our SAGE dataset detected an unambiguous tag mapping to the gene. Thus, MPSS and  23  SAGE are complementary and the utilization of both techniques may be particularly important for the detection of some hESC specific genes.  24  Table 2 Candidate ES pluripotency genes confirmed or identified by global gene expression profiling (confirmed/identified by: down-regulation upon differentiation, high expression pattern in ES compared to differentiated tissues and/or RT-PCR) Genes: BCRP CD9 CD90 hTERT UTF1 ZFP42 CKS1B CLDN6 ERH FLJ14549 FLJ21837 GBX2 GDF3 GJA1 HESX1 HMGA1 LIN28 NANOG NPM1 OC90 TGIF TNFRSF6 ADCY2 BIRC5 HEY2 JARID2 LCK LDB2 MCF1 PAK1 PTTG1 SOCS1 SPRY1 THY1 CHEK2 BMPR1A DPPA4 FOXD3 DNMT3B POU5F1 SOX2  2 2 2 2 2 2,3,4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 3,4 5 3,5 3,5 1,3,5 1,2,3,4,5 1,2,3,4,5  Pathways: WNT Frizzled7 Frizzled8 TGF3 TDGF1 LEFTY2 ID1 FGF FGF1 FGF 13 FGF2 FGF3 FGF4 FGFR1 BMP BMPR1A  1,4,5 1,4,5 1,2,3,4 1,2,3,4 1,2,3,4 1,4,5 1,4,5 1,4,5 1,4,5 1,4,5 1,4 1,4,5  References: ' Brandenberger et al. Carpenter et al. Richards et al. Sato et al. Sperger et al.  2  3  4  5  •  25  1.3 Objectives Few large-scale cDNA and EST projects of the human pre-implantation embryo or its cell types exist. The isolation of human embryonic stem cell lines have made possible some initial investigations of the transcriptome of a limited number of these cell lines. Much remains unknown about the genes governing the undifferentiated state and the sequence of events that determine stem cell fate decisions. Consequently, hESCs are a rich source for novel gene discovery. We sought to generate a comprehensive catalogue of genes expressed in 8 NIH approved human embryonic stem cell lines (BG01, ES03, ES04, WA01, WA07, WA09, WA13, and WA14) using long Serial Analysis of Gene Expression (long SAGE). SAGE offers the opportunity to directly compare libraries constructed by diverse labs and differing protocols based on a unique sequence tag, which can be easily quantified. Additionally, because SAGE does not require a priori knowledge of a gene in order for its detection, unlike hybridization based techniques, it is amenable to novel gene discovery. The gene expression laboratory sequenced 11 long SAGE libraries, at a minimum depth of 200,000 total tags per library, generated from undifferentiated hESC mRNA from 8 cell lines to more accurately describe the number of transcripts expressed in a cell at any given time and the absolute levels that these transcripts are expressed at. At this depth of tag sequencing we are more likely to identify known and novel genes that are transiently expressed and/or expressed at low levels (singleton or doubleton tags).  26  1.4 Specific aims and rationale Aim 1. To analyze the database of hESC expressed SAGE tags to provide an annotated catalogue of expressed genes characterizing undifferentiated hESCs. Aim 2. To compare the hESC SAGE libraries to publicly available adult and fetal human SAGE libraries to identify genes that are differentially expressed (up-regulated or downregulated) across all embryonic stem cell lines. Aim 3. To generate a computational approach to identify candidate novel genes in undifferentiated hESCs using multiple interspecies comparisons to all publicly available normal and malignant human SAGE libraries and multiple species sequence conservation.  27  2. Catalogue of the undifferentiated hESC transcriptome Contributions  The culture of hESC lines and RNA extractions were completed at the laboratories of James A. Thomson (WA01, WA07, WA09, WA13, and WA14; Wisconsin Regional Primate Research Center, University of Wisconsin, Madison, WI, USA), Martin F. Pera (ES03 and ES04; Monash Institute of Reproduction & Development, Monash University, Melbourne, Victoria, Australia), and Allan Robins (BG01; BresaGen, Inc., Athens, Georgia, USA). Long SAGE library construction was completed by Jaswinder Khattra, Jennifer Asano, and Sean Rogers (Gene Expression Laboratory) and sequenced by the Production Group at the BCCA Genome Sciences Centre (GSC). Affymetrix GeneChip wet-lab work was completed by the Gene Expression Lab (BCCA GSC). Analysis of Affymetrix data was completed by Jaswinder Khattra who provided the list of genes used in Chapter 2.3.3 Detection of stem cell associated genes in WA09. DiscoverySpace software and database were designed by the Gene Expression Bioinformatics Group (BCCA GSC). Audic-Claverie statistical analysis script was written by Mehrdad Oveisi (Gene Expression Bioinformatics; BCCA GSC). All computational analyses such as tag-to-gene mapping, creation of additional tag-mapping resources, functional annotation using Gene Ontology (GO) terms, signalling pathway analyses/figure generation, and statistical analysis were completed by Angelique Schnerch (BCCA GSC and the Department of Medical Genetics, University of British Columbia).  28  2.1 Introduction  Global gene expression in human embryonic stem cells has been studied using a number of technologies such as microarray (Bhattacharya et al., 2004; Carpenter et al., 2004; Sato et al., 2003; Sperger et al., 2003), short SAGE (Richards et al., 2004), EST sequencing (Brandenberger et al., 2004b; Carpenter et al., 2004), and MPSS (Brandenberger et al., 2004a). These studies have each examined a subset of the embryonic stem cell lines available. The intersection between gene lists from the published studies is small, although recent analysis of the microarray data has demonstrated greater overlap between studies than originally supposed (Suarez-Farinas et al., 2005). It remains that the profiles generated using different technologies show little overlap for genes expressed at low levels (Evans et al., 2002). Even when comparing gene lists across different microarray platforms most similarities are between highly expressed genes in the different ES lines. Additional genes that may be significant to several ES cell lines could be expressed infrequently and will be missed by meta-analysis of these published studies. To describe the hESC transcriptome several lines should be investigated using the same technology; in this analysis long SAGE was utilized. It has been widely established that SAGE is qualitative and quantitative, producing a digital transcriptome survey of known and novel transcripts (Saha et al., 2002; Velculescu et al., 1999; Velculescu et al., 1995). To capture infrequent transcripts and approximate the biological numbers of genes expressed in a cell type at a given time point, each library must be deeply sequenced. Long SAGE libraries, sequenced to 200,000 total tags or more per library on average, were generated for 8 NIH approved human embryonic stem cell lines. The WA09 line was sequenced to a depth of over 400,000 total tags to more closely  29  approximate the number of transcripts expressed in a cell at a given time-point. The entire dataset generated greatly exceeds other short/long SAGE libraries available, particularly the short libraries previously constructed for the ES03 and ES04 lines (145,015 combined total tags) (Richards et al., 2004). Therefore, the opportunity to describe the majority of known genes and to identify rare novel transcripts in multiple hESC lines was possible. Using a comprehensive tag-to-gene mapping strategy, I aimed to catalogue the genes expressed in the WA09 long SAGE library and the genes common to all hESC lines for which a long SAGE library was constructed. The expression of genes implicated in maintaining stem cell self-renewal or plasticity was assayed in the WA09 library and a pooling of all stem cell libraries (hESC metalibrary). Additionally, I aimed to assess the expression of genes involved in various developmentally regulated pathways, the cell cycle, programmed cell death and metabolism using a combination of Comprehensive Mapping of SAGE Tag (CMOST) mappings, Cancer Genome Anatomy Project (CGAP) SAGE Genie mappings and Gene Ontology (GO) terms.  2.2 Methods 2.2.1 Cell Culture  Cell culture and RNA isolation were completed at the Wisconsin National Primate Research Centre (University of Wisconsin-Madison). The human ES cell line WA09 (NIH Stem Cell Registry code) was cultured on murine embryonic fibroblast (MEF) feeders. The ES line has a normal XX karyotype, expresses high levels of telomerase and has been shown to stain for cell surface markers that characterize undifferentiated primate embryonic stem cells (Thomson et al., 1998). Markers of pluripotency were assayed and  30  included stage-specific embryonic antigens (SSEA3 and SSEA4), human embryonal carcinoma marker antigens (TRA-1-60 and TRA-1-81), and alkaline phosphatase. Colonies were cultured according to protocols previously established (Zwaka and Thomson, 2003). Briefly, WA09 cells (with a doubling time of approximately 20 hours) were harvested for gene expression profiling at passage number 38 and were approximately 80% confluent. Cells were harvested by treatment with lmg/ml of collagenase (Invitrogen) at 37°C for 10 minutes until the edges of the colonies curled away from the feeder layers. Next, cells were treated with 5 mg/ml of dispase (Invitrogen) for 5 minutes at 37°C until colonies dissociated from the plate. Cells were collected and washed with PBS (phosphate buffered saline solution) prior to total RNA extraction. Both short and long SAGE libraries were constructed from 20 mg of total RNA isolated from WA09 cells using TRIzol Reagent (Invitrogen) and Phase Lock Gel™ tubes according to the manufacturer's protocol (Brinkmann Instruments). RNA isolation, RNA quality assessment, SAGE library construction and sequencing were conducted at the Gene Expression laboratory of BCCA GSC. RNA quality was assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies) and RNA 6000 Nano LabChip kit (Caliper Technologies).  2.2.2 SAGE Library Construction Prior to SAGE library construction, potential genomic DNA contamination was removed from WA09 total RNA using DNase I (Invitrogen) treatment. The long SAGE library (Velculescu et al., 1995) was constructed according to the I-SAGE kit protocol  31  (Invitrogen) using the kit reagents with adaptations for generating 17 bp tags (long SAGE protocol version 1.0a; http://www.sagenet.org) (Figure 4). Specific alterations include: the use of long SAGE linker molecules, scale-up PCR oligonucleotide primers specific to the long SAGE linkers (Invitrogen), and the use of an alternate type IIS restriction enzyme ("tagging" enzyme), Mme I (NEB), as opposed to Bsm F l in the short SAGE protocol. To increase the recovery and purity of DNA, Phase Lock Gel™ tubes (Eppendorf) were used at each phenol-chloroform extraction step (Ye et al., 2000). Ditags were amplified with 25 PCR cycles. Ditags were purified using polyacrylamide gel electrophoresis (PAGE); 34 bp sized bands for long SAGE ditags were cut from the gel and ethanol precipitated to extract the DNA sample. Ditags were ligated to form concatemers containing up to 30-50 21mer tags. Concatemers were also PAGE purified. Purified concatemers were cloned into a pZErO-1 (Invitrogen) sequencing vector. The ligated vector-concatemers were transformed into One Shot® Top 10 electrocompetent Escherichia Coli (Invitrogen) and transformants were selected on low-salt LB/zeocin agar plates. Resistance to the antibiotic, Zeocin, denotes a successful transformation; these colonies were picked robotically into 384-well plates. Prior to sequencing, transformants were analyzed using colony PCR to determine the size and percent of inserts present in the SAGE library.  2.2.3 SAGE Library Sequencing DNA was sequenced using BigDye primer cycle sequencing reagents and analyzed on ABI PRISM 3700 and 3730 XL capillary DNA sequencers (Applied  Figure 4 SAGE protocol. Protocol adapted from Invitrogen I-SAGE kit (httpi/Zvvvvw.invitrogen.com).  1.  Mix H9 total RNA with oligo dT streptavidin beads  .AAA AAA TTTTTT  /—\ W  2.  Synthesize double stranded cDNA  .AAA AAA TTTTTT  /~\ \ J  3.  Digest with anchoring enzyme  (Main)  4.  GTAC  .AAA AAA / ~ \ T T T T T T V_7  Divide in half and ligate adaptors (A and B) containing the recognition sequence for the type-US restriction enzyme (tagging enzyme), BsmFl  CATG<-> GTAC'  .AAA AAA TTTTTT  (14bp tag) or Mmel (21 bp tag)  CATG'  !NA AAAAA "TTTTTT  GTAC'  5.  Cleave with tagging enzyme, releasing adaptor plus 14bp/21bp tag from strepavidin beads  /~\ \ J  S~\ \ J  A AA AAA /~N TTTTTT \_J  Fill in overhangs (Klenow reaction) and ligate to form ditag  CATG-  CATG  GTAC"  GTAC  B  ditag  PCR amplify using ditag primers  5'  Cut adaptors with NlaUl to release ditag  Ligate ditags to form concatemers (20-50 tags per concatemer)  Clone into pZErO-1 and sequence  CATGGTAC"  CATG GTAC  B  CATGGTAC  CATG-  CATG10.  A  CATG-  CATG-  GTAC  ' GTAC"  CATG-  CATG-  GTAC"  ' GTAC"  GTAC  GTAC  33  Table 3 Summary of hESC long SAGE dataset. Listed below is the NIH code for each cell line, the SAGE library accession (provided by the BCCA GSC), cell line karyotype, a description of the growth conditions and experimental manipulations where applicable, total tags sequenced and the corresponding unique tag types used for subsequent analysis. All libraries are publicly available from the CGAP and from http ://www.transcriptomES .org.  Cell lines  Library accession  Karyotype  BG01  SHE 19  46XY  ES03 ES04 WA01 WA01-M  SHE 10 SHE11 SHE 17 SHE 16  46XX 46XY 46XY 46XY  WA01-7  SHES7  46XY  WA01-8  SHES8  46XY  WA07 WA09 WA13 WA14  SHE 13 SHES2 SHE 15 SHE 14  46XX 46XX 46XY 46XY  Description Grown on mouse embryonic fibroblasts (MEF) Grown on MEF Grown on MEF Grown on MEF Grown on matrigel Grown on matrigel; OCT4 knockin unselected Grown on matrigel; OCT4 knockin G418 selected Grown on MEF Grown on MEF Grown on MEF Grown on MEF  Total tags  Tag types  201,651  50,268  205,276 209,177 276,177 218,169  41,571 49,411 56,426 45,637  266,057  54,318  196,607  39,999  272,422 466,042 221,060 212,136  57,283 80,946 47,389 47,732  Biosystems). Phred (Ewing and Green 1998; Ewing et al. 1998) was used to process and assess quality of the sequences produced. Custom scripts (Gene Expression Bioinformatics Group; http://www.bcgsc.ca/bioinfo/ge/) were used to process the sequencing reads, removing vector sequences and low-quality sequences and identifying non-recombinant clones. Additional scripts (Gene Expression Bioinformatics Group; http://www.bcgsc.ca/bioinfo/ge/) were utilized to extract ditags from sequencing reads and ultimately to extract 21 bp tags for the libraries constructed from WA09 RNA. Linker  34  derived tags (long SAGE linker sequences: T C G G A C G T A C A T C G T T A T C G G A T A T T A A G C C T A G )  a n d  and sequencing errors were removed prior to analysis.  Additional libraries constructed for NIH approved human embryonic stem cells were similarly cultured and prepared for long SAGE library construction using the Gene Expression Laboratory pipeline described above. Table 3 lists the libraries sequenced at BCCA GSC and described in this research.  2.2.4 Tag-to-gene mapping  2.2.4.1 Comprehensive mapping of S A G E tags  SAGE gene expression data analysis utilized the Java application, DiscoverySpace (version 3.2.1; http://wvvw.bcgsc.ca/bioinfo/software/discoveryspace). The application provides a graphical interface to visualize and access a centralized database (DiscoveryDB) comprised of multiple public biological data sources. Tag-togene mappings were generated using the CMOST algorithm for use within DiscoverySpace (http://www.bcgsc.ca/bioinfo/software/discoveryspace/CMOST_plugin_docs/view). The CMOST approach attempts to account for single base-pair permutations, insertions and deletions that potentially prevent a SAGE tag sequence from mapping to a known sequenced transcript or to the genome. Pertaining to unmapped high quality sequence tags, CMOST enables tag-to-gene mappings for tags that may not be derived from artifacts introduced by the SAGE protocol. The approach introduces each theoretically possible permutation of the experimentally observed (canonical) SAGE tag sequence, thus a modified tag is a single base-pair permutation from the canonical sequence. The  \  35  canonical tag and each modified tag are mapped to the available databases (Table 4 and Figure 5). Off-by-one tag mappings are only considered if a tag does not have a canonical tag mapping at a position in close proximity to the 3' most end of a transcript. Ambiguity in tag mapping can result when a single tag maps to multiple genes. Should a transcript be alternatively spliced, matches to sites other than the 3'most CATG site may be valid. As a result, if an ambiguous tag is of interest (e.g., not derived from repetitive sequence) it can be experimentally resolved using the generation of longer cDNA fragments from Table 4 Summary of CMOST data sources for tag-to-gene mapping. Tags were extracted from all positions in a transcript/sequence in the sense and antisense orientation.  Resource  Version  Sequence entries  Tags extracted from each sequence resource  Unique tags  MGC  May 04 '05  18,293  304,062  220,722  RefSeq  May 04 '05  28,826  629,921  477,298  Ensembl transcripts  30.35c  33,869  682,788  448,281  Ensembl EST transcripts  23.34e.l  43,710  313,656  195,495  Transcription units  30.35c  22,218  9,948,449  7,624,140  Mitochondria  N C 001807.4  1  90  90  Non-coding  May 12 '05  43  4,327  4,206  UCSC Human Genome  NCBI35.nov  25,820  24,067,102  18,206,956  35,950,395  27,177,188  TOTAL Database size: 3.9 G B  36  Figure 5 C M O S T schematic of methodology adapted from  httpV/vvfww.bcgsc.ca/downloads/genex/DS/cmostjluginuserdoc.htm. Each tag in a Long S A G E library undergoes the following modifications prior to tag mapping: single base permutations, single base insertion/deletion. Both modified and unmodified tags are mapped to the virtual tag databases listed below. SAGE LIBRARY  TAG MODIFICATION  SINGLE BASE PERMUTATION  SINGLE BASE INSERTION  VIRTUAL TAG DATABASES  SINGLE BASE DELETION  SENSE  ANTISENSE  MGC/RefSeq Ensembl transcripts Mitochondrion Non-protein coding Transcription units Ensembl EST transcripts Golden Path  MGC/RefSeq Ensembl transcripts Mitochondrion Non-protein coding Transcription units Ensembl EST transcripts Golden Path  37  SAGE tags for gene identification (GLGI) (Chen et al., 2002; Chen et al., 2000). The technique uses the SAGE tag as the gene-specific primer and an anchored oligo(dT) primer to amplify the cDNA sequence from which the SAGE tag was derived, similar to 5' or 3' rapid amplification of cDNA ends (Frohman et al., 1988). Resolving many ambiguous tag matches would be costly and labour intensive. Consequently, previous SAGE profiling experiments have removed ambiguous tags from further analysis (Richards et al., 2004). I have similarly excluded ambiguous tags from further analysis of the hESC transcriptome profile. CMOST mappings produce multiple tag-to-gene matches given the nature of current transcript databases which contain redundancies and disparate naming conventions. Multiple matched tags may also arise due to mapping off-by-one sequences. To select for the most reliable CMOST tag mapping (defined as a tag mapping to a single gene or genomic location), a hierarchical approach to tag mapping was utilized. Tags were mapped to various publicly available transcript/sequence databases with the rationale that the highest quality transcript database was first used first to map tags. As tags mapped to a known transcript in one database they were excluded from further analysis with subsequent databases to avoid redundancies. Data sources were prioritized using the following parameters (refer to http://www.bcgsc.ca for CMOST documentation): (i)  Data sources. Data sources were ranked based on reliability and functional annotation. Data source reliability was defined by high quality full-length cDNA sequence information derived from automated gene predictions and manual curation. The mammalian genome collection (MGC) provided the  38  most reliable cDNA sequences comprising full open reading frames (ORFs) with evidence of splicing. In the case where a c D N A does not have homology to a known protein these sequences were manually annotated and experimentally verified (Gerhard et al., 2004). The Reference sequence project (RefSeq) also provides non-redundant experimentally verified and computationally annotated cDNA sequences, which links transcript, chromosomal and protein information. A proportion of RefSeq cDNAs were solely ab initio predicted (Pruitt et al., 2003) thus these data sources were given a lower priority over M G C data sources. The order from highest to lowest priority data source was: M G C (http://mgc.nci.nih.gov; M G C 2004), RefSeq (http://www.ncbi.nlm.nih.gov; Pruitt et al. 2003), Ensembl transcripts (Exon sequences only) (Birney et al., 2004 ; Hubbard, 2002; Hubbard et al., 2002; Kasprzyk et al., 2004) (http://www.ensembl.org), Genbank Human Mitochondrial Sequence (Accession AY289102.1), Genbank Non-coding sequences (http://www.ncbi.nlm.nih.gov/Genbank), Ensembl transcription units, which demarcate a region in the genome bounded by a transcription initiation/termination site and encodes a primary transcript (to account for sequence lacking an annotated 3' U T R an additional 1000 bp from the genomic region adjacent to the end of the transcript is included), Ensembl EST transcripts, and Golden path (Genbank Human Genome Assembly Contigs build 34, January 2004).  3  Ab initio definition: from the beginning  39  (ii)  Tag orientation (sense or antisense); tags in the sense orientation were given priority over antisense tags.  (iii)  Proximity to the 3' most CATG site.  (iv)  Tag modifications; in cases where both the experimentally observed tag and the CMOST modified tag mapped to an expressed sequence, the gene specified by the unmodified tag took precedence. In addition to DiscoverySpace transcript mapping resources, CGAP provided a  reliable tag-to-gene mapping database for regular and long SAGE tags (Boon et al. 2002; http://cgap.nci.nih.gov/SAGE/). Tag mappings were derived from 105 virtual tag databases which were ranked according to the percentage of tags contained in each database represented in the "confident SAGE tag list" (a list of tags reliably observed in multiple SAGE libraries) (Boon et al., 2002). CGAP virtual tags were extracted from the following seven transcript sequence sources: (/) MGC. (ii) RefSeq. (iii) Predicted transcripts from chromosome 22. (vi) The human mitochondrial genome (GenBank accession X93334). (v) The "20K set" transcript database that was generated by taking the longest non-EST cDNA for each UniGene cluster, (vi) Clustered UniGene sequences comprising the "Consensus sequences" databases (Hsest). (vii) "Unclustered EST" databases. (Boon et al. 2002). The databases (excepting predicted transcript and mitochondrial databases) were further subdivided according to the presence of a 3' poly (A) tail of 5 adenosines or more, a poly (A) signal or both. Virtual tag databases to detect internal cDNA synthesis priming from a stretch of adenosines other than the poly (A) tail and alternative polyadenylation were additionally constructed. Tags were extracted from the four closest Nla/77 sites from the 3' most end of a transcript for each subdivided  40  transcript sequence and parsed into the virtual tag databases. CGAP mappings were made available for download at the following ftp site: ftp://ftpl.nci.nih.gov/pub/SAGE/HUMAN/.  2.2.4.2 Assigning functional annotations  GO terms (http://www.geneontology.org) (Ashburner et al., 2000) can be used to assign annotations to broadly describe functional categories on the level of the entire transcriptome. The following criteria were devised to ensure only the most reliable tag-togene mappings were used to generate a transcriptome view of function in WA09 and transcripts commonly expressed across all ES lines: (i)  Tags were unmodified.  (ii)  Tags-to-gene mappings were in the sense orientation.  (iii)  Tags were derived from positions 1-3 in a transcript (position 1 being the 3' most Nlalll site in a transcript and positions 2 and 3 being farther upstream).  (iv)  Tags were unambiguously assigned to a transcript.  In the case of tag types mapping to mitochondrial or non-protein coding sequences, tag orientation and position were omitted. C-shell scripts were used to process DiscoverySpace output for ease of parsing tag-mappings according to the above criteria (Appendix 2a). GO terms were assigned where available to tags meeting the above criteria and mapping to Ensembl transcripts using DiscoverySpace. The Ensembl website provided the Biomart resource to rapidly obtain GO terms for each transcript accession (Ensembl transcript ID or UniGene ID) (Hubbard et al., 2005). GO terms are exceedingly detailed,  4 1  hence a set of annotations given by the GO slims, 'slimmed down' versions of the GO terms, provides a minimal set of molecular functions and biological processes to describe the hESC and normal CGAP (n-CGAP) library transcriptomes. These terms were obtained from the GO web site (http://www.geneontology.org/GO.slims.html). The CShell scripts and GO slims used in this analysis were provided in Appendix 2b.  2.3 Results and discussion  2.3.1 Database of expressed long and regular S A G E tags  The WA09 long SAGE library totalled 466,042 sequenced tags. Recent developments made by the Gene Expression Lab/Informatics at BCCA GSC have since included ditag sequences, tag clustering (accounting for off-by-one tag sequences) (Siddiqui et al., submitted) and individual tag quality scores based on SAGE library construction/sequencing error (Siddiqui et al., submitted). With these improvements and quality assurances the WA09 dataset used in this analysis, which includes all tags with a tag construction/sequencing error of P<0.05, totaled 441,795 corresponding to 56,532 unique tags. Figure 6a depicts the distribution of sequenced tags categorized according to gene expression level. The bulk of tag types (53.9%) were observed once in the library while highly expressed tags (>50 tags per type) were infrequent, representing 2% of all tag types. Figure 6b focused on the contribution of mapped and unmapped tags to the total number of tags sequenced in low-to-mid expression levels. Using this quality trimmed dataset resulted in the successful mapping of 92% of tag types to a transcript or a region of the human genome. The remaining 8% of tags that did not map to a human sequence were mainly low expression tags shown in Figure 6b, 77% of which were  42  singletons. Potential sources of unmapped tags included: tags spanning novel splice junctions, (as a proposed future analysis, such tags may be identified using the SAGE2Splice software; http://www.cisreg.ca/SAGE2Splice/); tags may be derived from mouse feeder contamination; small sequence polymorphisms (2bp or greater); or may have arisen as artifacts of the SAGE protocol (e.g., RT-PCR). The full list of tag-to-gene mappings is provided in Appendix 2c.  2.3.2 Analysis of unmapped tags Mouse embryonic fibroblast (MEF) contamination was assessed by mapping tags unassigned to a human sequence to mouse transcript databases (RefSeq) using CMOST and to the mouse genome assembly NCBI build 30. A total of 1,488 tags mapped to a mouse transcript or genomic sequence accounting for 2.6% of all WA09 tag types (full mappings listed in Appendix 2d). Though undifferentiated WA09 populations of cells are carefully separated from differentiated embryonic stem cells and MEF prior to RNA extraction, the WA09 total RNA used for SAGE is a heterogeneous population of mostly undifferentiated cells and a small population of differentiated cells and MEF. Potential MEF contamination was additionally investigated by accessing the theoretical "contamination" based on the overlap of human and mouse tag sequences.  43  Figure 6  (A) Distribution of WA09 tags sequenced (log scale Y-axis) classed according to gene expression level (X-axis). Tags  expressed once in the library (singleton) are the most represented expression class (30,459 tag types; 53.9% of all unique tag types). The total number of singleton tag types is indicated on the graph. (B) The percentage of total tags sequenced contributed by mapped and unmapped tags at low (1-10 tags/type) and mid-level gene expression (11-50 tags/type). A. 1.00E.05  M , 30459  1.00E.O+  D  d . 1.00E.03 T3 ID O C  <D  c- 1.00E.02  «  BI 03  ^  1.00E.01  1.00E.00  »- w io  N  n r:  ID  r-- mn*  . 1000*  44  B.  my.  soy.  GQy.  • Unmapped • Mapped  toy.  2Qy.  oy.  i  i  i  i  Low expression  i  i  i  i r  i  i  i  i  r  i  i  i  i CD  i _  Mid expression  I  I  I  2.3.2.1 Species ambiguous tags  Both human and mouse long S A G E UCSC genomic tag sequences were extracted from Discovery DB (Gene Expression Informatics; http://www.bcgsc.ca/gelab). The number of human genomic tags extracted was 27,858,501, which corresponded to 18,715,283 unique tag types; 27,725,360 mouse tags were extracted which corresponded to 18,467,657 unique tag types. In total there were 289,453 unique tag types that were identical in human and mouse genomic sequences; these tags were termed "Species Ambiguous Tags" (SATs) (Figure 7) and provided a tag-mapping database to flag sequences that may occur in the hESC libraries due to M E F contamination (Appendix 2e). A fraction of SATs are low complexity sequences derived from repetitive regions of the genome. There were 56,866 tags of this type that mapped to 2 or more locations in the genome; these tags accounted for 20% of SATs.  Figure 7 The intersection between non-redundant human and mouse genomic sequences (UCSC) totaled 289,453 species ambiguous tags (SATs).  HUMAN  18,425,830  SATs  289,453  46  SATs may derive from genes highly conserved between species, similarly exemplified in microarray experiments in which cross-hybridization of probes can occur between gene family members both within a single species and potentially between species (e.g., human and mouse). A total of 3,906 tags (6.9% of all WA09 tag sequences) were SAT (complete list of mappings available in Appendix 2f); 18.0 % of which were present in the WA09 library 10 times or more. Table 5 lists the ten most highly expressed SATs and their human transcript mapping. Many of the genes in this list are highly expressed in mammalian cells (e.g., ribosomal proteins). Thus a proportion of the SATs in the WA09 library may have been derived from MEFs. Tags that mapped to SATs and those that only mapped to mouse sequences provided a theoretical estimate of MEF contamination for hESC cell lines cultured on mouse feeders (2.6% of WA09 tags only mapped to mouse sequences and 6.9% of WA09 tags mapped to SATs).  Table 5 Top 10 expressed SAT tags. Sequence  Count  TGTGTTGAGAGCTTCTC  4094  # Genomic mappings 12  TGGTGTTGAGGAAAGCA CCAGAACAGACTGGTGA CTGTTGATTGCTAAATG  1407 1347 904  7 3 45  GTGTAATAAGACATAAC CGCTGGTTCCAGCAGAA TTCATTATAATCTCAAA ATCAAGGGTGTTACACT AAGGAGATGGGAACTCC GGCAAGAAGAAGATCGC  823 821 820 819 795 703  1 2 7 10 19 3  Transcript description Eukaryotic translation elongation factor 1 alpha 1 Unknown (protein for MGC:87887) Ribosomal protein L30 Heterogeneous nuclear ribonucleoprotein A l , isoform A HNRPA2B1 protein Ribosomal protein LI 1 Prothymosin, alpha (gene sequence 28) RPL9 protein Ribosomal protein L31 Ribosomal protein L27  47  2.3.3 Detection of stem cell associated genes in W A 0 9  Several known stem cell markers have been isolated by others in mouse and human embryonic stem cells, teratocarcinoma cells, trophoblast stem cells and multipotent stem cells (haematopoietic stem cells and neural stem cells) (Brandenberger et al., 2004a; Brandenberger et al., 2004b; Ivanova et al., 2002; Kelly and Rizzino, 2000; Ramalho-Santos et al., 2002; Richards et al., 2004; Sperger et al., 2003; Tanaka et al., 2002). In an experiment using Affymetrix oligonucleotide arrays, genes uniquely present only in 7 different NIH approved hESC lines (NIH code: WA01, WA07, WA09, WA13, WA14, ES03, and ES04) were identified, producing a set of 84 genes (J. Khattra, unpublished). Altogether, the expression of 854 transcripts implicated in embryonic stem cell biology (which I have termed pluripotent stem cell associated genes (PAGs)), determined by extensive literature searches or the Affymetrix experiment were sought after in the WA09 long SAGE library (See Appendix 2g for a list of PAGs and their SAGE tag sequence). The long SAGE technique, which utilizes the Malll restriction enzyme site, was capable of detecting 763 of the transcripts (89%). The remainder escaped detection via SAGE. Sternness genes expressed in the WA09 line are listed in Appendix 2h. Table 6 lists the most highly expressed PAGs in the library. Expression of this set of genes was also surveyed across 12 n-CGAP long SAGE libraries, a pooling of the n-CGAP libraries (n-CGAP metalibrary), and a pooling of all the hESC lines (hESC metalibrary). The nCGAP libraries were pooled to provide greater transcript coverage than a single CGAP library, which was generally sequenced to a depth far less than half the size of our hESC libraries (typically sequenced to a depth of 200,000 total tags). Note that all singletons in  48  T a b l e 6 Short-list o f most highly expressed sternness associated genes in the W A 0 9 line (absolute tag counts listed; W A 0 9 library size equals 441,795 tags sequenced). ' I L 6 S T tag sequence is highly repetitive. E E F 1 A l is a species ambiguous tag. Sequence  Count  Gene Symbol  EEF1A1 IL6ST NPM1 TMSB4X CD24 RPL4 RAMP HNRPA2B1 RPS4X HMGA1  GATCACAGTTTGCTTTG  4094 2256 1026 988 980 942 866 823 792 779 706 674 571 532 496 476 435 430 427 422 405 402  TATCACTTTTTTCTTAA GACGTGTGGGCGCGACT TTAAACCTCAAATAAAT CCGCCTCCGGGAATGAG GATGCTGCCAATTTTGA AACTAAAAAAAAAAAAA GAGGACACAGATGACTC ATGATGATGATGGGACT ATGTAGTAGTGTCTTAC  396 350 345 330 297 277 276 270 264  POU5F1 H2AFZ HNRPDL  TTTTATGGGTAACTTTT CTGCCTTCTTGGGGATT  261 242  GCCTTCCAATAAAAAAT TCATAGAAACCTTGATT TCCTCAAGATAAAGTCT TTGGAGATCTCTATTGT TAGCTACAGGACATTTT AATATTGAGAAGAAACT TAATTCTTCTCTATTGT ATAGACATAAAATTGGT  235  CCNG1 PPP1CC DDX5  TGTGTTGAGAGCTTCTC TTGGTCCTCTGCCCTGG' TGAAATAAAACTCAGTA TTGGTGAAGGAAGAAGT GGAACAAACAGATCGAA CGCCGGAACACCATTCT GCATAATAGGTGTTAAA GTGTAATAAGACATAAC TCAGATCTTTGTACGTA ATTTGTCCCAGCCTGGG TCGTCTTTATCGCTCAG GAAGCAGGACCAGTAAG TGAGGGAATAAACCTGG TAAATAATTTCCATATT TGTTCTGGAGAGTGTTC TATCAATATTCACTTGA TTTACTGCTAGAAACCA GGCTGGGGGCCAGGGCT TGCTTCATCTGTGGGAT TACCAGTGTACTGCTTT GGGGAAATCGCCAGCTT 2  233 232 221 211 209 206 198  RPS7 CFL1 TPI1 HSPE1 GJA1  SFRP1 LIN28 PFN1 RAN HSPD1 TMSB10 LDHB  SNRPN RPL22 RPS27A PODXL SLC25A5 HNRPD  CCT5 ERH NDUFA4 DNMT3B E1F3S6 CCT3 C1QBP  Hs.Id Hs.439552 Hs.532082 Hs.519452 Hs.522584  Cytoband  Hs.375108 Hs. 186350 Hs. 126774 Hs.487774 Hs.446628 Hs.518805 Hs.534346 Hs. 170622 Hs.524219 Hs.1197 Hs.74471 Hs.213424 Hs.86154 Hs.494691 Hs. 10842 Hs.471014 Hs.446574  6q21 15q22  Hs.446149 Hs.249184 Hs.l 19192 Hs.527105 Hs.525700 Hs.515329 Hs.546292 Hs.l 6426 Hs.522767 Hs.480073 Hs.79101 Hs.79081 Hs.279806 Hs.l 600  6ql4.1 5qll 5q35 Xq21.3-q22  7pl5 Xql3.1 6p21 2p25 llql3 12pl3 2q33.1 6q21-q23.2 8pl2-pll.l lp36.11 17pl3.3 12q24.3 2q33.1 2pll.2 12pl2.2-pl2.1 6p21.31 4q24 "~ 4ql3-q21 15qll.2 Ip36.3-p36.2 2pl6 7q32-q33 Xq24-q26 4q21.1-q21.2 5q32-q34 12q24.1-q24.2 17q21  Hs.509791 Hs.50098 Hs.251673 Hs.405590' Hs.491494  5pl5.2 14q24.1 7p21.3 20qll.2 8q22-q23 lq23  Hs.555866  17pl3.3  49  TTAGCAATAAATGATGT CTTATTTGTTTTAAAAC TCAAATGCATCCTCTAG TAGCTGAGACATAAATT TATATATTTGAACTAAT AAAATTTACAGTTTGCC CGGCCCAACGCCAAGAA TCTGTCAAGACCAAGAT AAATAAAGAATTTAAAG  194 191 189 184 180 179 171 168 165  TXNL5 PLS3 HNRPC KPNA2 SOX2 LECT1 HRMT1L2 ATP50 MGST1  Hs.408236 Hs.496622 Hs.449114 Hs.159557 Hs.518438 Hs.421391 Hs.20521 Hs.409140 Hs.389700 .  17pl3.1 Xq23 14ql 1.2 17q23.1-q23.3 3q26.3-q27 13ql4-q21 19ql3.3 21q22.1-q22.2 12pl2.3-pl2.1  the n-CGAP metalibrary were excluded from analysis. Table 7 tabulates the number of PAGs detected and their representation amongst total tags sequenced across each library or metalibrary. The hESC metalibrary expressed 83% of all sternness genes while WA09 individually expressed 74% of the set. In contrast, the n-CGAP metalibrary, which has 41,209 tag types corresponding to 651,881 total tags, detected 61.7% of the sternness genes. The average detection of genes per C G A P library was 35.8%. The exception to this average was the fetal brain library (detected 63.8% of genes), which was sequenced to 300,000 tags and individually expressed a greater diversity of tag types compared to other normal libraries. Figure 8 highlights the detection of hallmark pluripotency genes in each hESC library compared to a multipotent adult progenitor (MAPC) line (Reyes et al., 2001) and fetal brain (CGAP). A l l pluripotency markers were highly expressed in each hESC line. In contrast PAGs were expressed at low levels in M A P C and the brain. Factors that were expressed at low levels in hESC were not present in the non-ES libraries. Both SOX2 and GNL3 were expressed in one or both of the non-ES libraries. GNL3 (also known as nucleostemin) plays a role in stem cell proliferation. The gene was expressed in the M A P C library at levels comparable to the hESC lines.  50  Table 7 Detection of genes up-regulated in pluripotent stem cells and potential hESC markers (pluripotent stem cell associated genes, PAGs). 'There were 763 tags detecting candidate markers in total. Frequency is % of total tags sequenced per library or metalibrary. Library WBC WBC WBC WBC WBC WBC WBC Pancreas Breast myoepithelium Substantia nigra Liver vascular endothelium Fetal brain Differentiated Metalibrary hESC Metalibrary WA09  Number of sternness genes 224 284 269 298 284 241 256 110 267 254 302 487 471 631 564  % sternness genes detected 29.4 37.2 35.3 39.1 37.2 31.6 33.6 14.4 35 33.3 39.6 63.8 61.7 82.7 73.9  1  Frequency of genes in library 3.5 3.3 3.4 3.6 3.4 4.3 5.8 1.7 4.1 3 3 3.2 5.1 8 2  9-5  Figure 8 Detection of pluripotency genes and markers of differentiation. Tag frequencies (tags per 400,000 total tags) were plotted along the Y-axis; Gene names were plotted along the X-axis; SAGE libraries were plotted along the Z-axis. LIFR and STAT3 maintain undifferentiated mESC though their function is not conserved in hESC. Both h C G A and hCGB are down-regulated by POU5F1. NES is a marker for ectodermal differentiation (the gene is a specific marker for neural stem cells), A C T C is a marker for mesodermal differentiation and AFP is a marker of early endodermal differentiation. U-hESC libraries all express high levels of pluripotent cell markers and low levels of early differentiation markers. WA01(7), WA01(8), and ES03 express the highest levels of neural stem cell and muscle markers suggesting that the heterogeneous population of cells that comprised the hESC cultures may contain a greater number of differentiating cells than the other hESC lines.  52  Genes involved in mESC maintenance (LIFR and Stat3) were not expressed or were expressed at low levels across all libraries (Figure 8). Early markers of differentiation/embryonic development such as hCGA and hCGB are downregulated by POU5F1; these genes were shown to be expressed at low levels or absent in the hESC SAGE data. Markers of ectodermal and mesodermal fate (NES and ACTC respectively) were significantly expressed across the hESC lines. NES is involved in neuronal differentiation while ACTC expression is indicative of muscle differentiation. The hESC libraries most highly expressing these differentiation markers were WA01(7), WA01(8), ES03 and ES04. Several groups have claimed that ES might undergo a default differentiation pathway to neuronal development (Ramalho-Santos et al., 2002). NES is a marker of primitive neural stem cells that have been isolated from the developing mouse epiblast. The detection of NES may be indicative of spontaneous neural differentiation of these lines in culture.  54  2.3.4 Developmental signalling pathway expression in embryonic stem cells  The regulation of developmental signalling pathways, such as the Wnt, T G F P J and Jak-Stat pathways, is important for maintaining mammalian pluripotent cell types and directing their differentiation. The aim of this analysis was to investigate the breadth and depth of gene expression of specific developmental pathways in hESCs compared to differentiated adult and fetal tissues. These pathways, being previously implicated in pluripotent stem cell biology, are expected to be near fully expressed in undifferentiated ES cells including pathway antagonists. Expression of pathway antagonists may be critical for ES developmental plasticity and distinctly different than the expression pattern observed for adult and fetal tissues. I also hypothesize that the full expression of pathway ligands and activators also attests to the developmental plasticity of ES cells as the Wnt and TGFp signalling pathways direct many disparate cellular fates; this observation should similarly be unique to hESCs. Gene expression of developmental pathways was assayed in the WA09 embryonic stem cell line, a pooling of 11 hESC long SAGE libraries from 8 lines (hESC metalibrary), and a pooling of 12 CGAP long SAGE libraries (n-CGAP metalibrary) from normal adult and fetal samples representative of 4 tissues (blood, pancreas, breast, and brain). The WNT signalling pathway regulates developmental differentiation and has been implicated in stem cell fate and self-renewal (Aubert et al., 2002; Reya et al., 2003; Taipale and Beachy, 2001; Tang et al., 2002; Walsh and Andrews, 2003; Willert et al., 2003). More specifically, WNT signalling regulates multiple events during embryogenesis (Logan and Nusse, 2004) such as neuronal differentiation and limb  55  development (Cadigan and Nusse, 1997). The role of Wnt signalling has particular relevance to stem cell biology with its noted role in cellular proliferation and has previously been suggested to function in haematopoietic stem cell (HSC) renewal (Reya et al., 2003; Willert et al., 2003). I have focused on the canonical pathway mediated by the Frizzled receptors and p-catenin (CTNNB1). The hESC metalibrary expressed the majority of WNT pathway genes (Figure 9) (Tables 8 and 9) (the complete list of Wnt pathway gene expression is provided in Appendix 2i). In a recent study, gene expression in hESC (lines WA01, WA07, and WA09) was measured using EST sequencing (Brandenberger et al., 2004b) and similarly demonstrated the expression of much of the WNT pathway, excepting for WNT1, WNT2B, WNT10B, and WNT11, which I detected in our long SAGE data (WNT1, 2B, 3-6, 8B, 10B-11) (Figure 9) (Tables 8 and 9). The variety of Wnt ligands expressed in hESCs exceeded those expressed in the n-CGAP metalibrary but were all expressed at low levels (<5 tags per 650,000 total tags). The individual WA09 line and the hESC metalibrary expressed a wider repertoire of Wnt pathway receptors (Frizzled 1-4, 6-10; LRP5-6) than the n-CGAP metalibrary (WA09 expressed 6 out of 9 Frizzled genes and 2 out of 2 LRP genes; hESC metalibrary expressed 8 out of 9 Frizzled and both LRP genes; n-CGAP metalibrary expressed 4 out of 9 Frizzled genes and LRP5). Expression levels for FZD7 and FZD2 in WA09 and the hESC metalibrary were significantly up-regulated (p<0.01) compared to the differentiated samples (Table 10; Table 11). FZD2 and FZD7 were greater than 20-fold more abundant in WA09 than the differentiated library (Table 11) and the genes were greater than 15-fold and 8-fold more abundant in the hESC metalibrary compared to differentiated tissues (Table 10).  56  Figure 9  The Wnt signalling pathway expression in hESC long SAGE libraries (hESC metalibrary) and differentiated normal CGAP  long SAGE libraries (n-CGAP metalibrary). Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. The top box reflects hESC metalibrary expression; bottom box reflects differentiated metalibrary expression. ~| inhibition; -> activation.  57  SFRP5 | CER1 _  WIF1 _  Cadherin mediated cell adhesion  SFRPlg  z FZD6  WNT PORCH DKK1  U  B  MAPK signalling pathway  FZD2 {  //  F  Z  D  7  B  •PSEN1 _ -  CXXC4_  ' L R P 6 _  PRKACA  _  +  CSNK1E_ -L •  MAP4K4MAP3K9 -  DVL1_  1 CTNNB1 _ APC_ CSNK1A1 B " PPP2CA _ CTNNBIP1 +  P  AXIH1 B NKD1 _  TP53  [j-.^  Phosphorylation dependence  SIAHlB"'^  >0-10 >10-100 M00-500 >500  WNTSBCj  WHTlg  WNT6  WHT2B_  WHT7AB  WHT3  B  WNT4  B  WHTSAB  B  WMT8B0 WNT10B B  APCB  /  CACYBPF SKP1A^ TBL1X  Ubiquitin mediated proteolysis  BTRC B  _  I  SKP1A_  CUL1 RBX1  I  T  C T B  P,  1  B  1  V+p TCF7B  "*  SMAD30  LEF1  B  SMAB4Q  B /  Phosphorylation independence  R  RUVBL2 _  GSK3pg  SEHP5 _  WNT genes:  CREBBP_  FRAT2 _ CSHK1A1_  CELL MEMBRANE  NLK  SOX17  B DNA  TGFp signalling pathway  NUCLEAR MEMBRANE  CCND2 CCND3| CCNDIf FOSLlf  ^  i Cell cycle  ^  WHT1lB  58  Table 8 Summary of Wnt ligand and receptor expression in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), and a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). hESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the C G A P SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Hs.94234 Hs.31664 Hs. 142912 Hs.40735 Hs. 19545 Hs.292464 Hs. 173859 Hs.302634 Hs.534367 Hs.6347 Hs.549194 Hs.248164 Hs. 121540 Hs.91985 Hs.l 08219 Hs.272375 Hs. 128553 Hs.258575 Hs.445884 Hs.336930 Hs.25766 Hs.l 52213 Hs.306051 Hs.29764 Hs.72290 Hs.512714  Gene symbol FZD1 FZD10 FZD2 FZD3 FZD4  WA09  hESCs  Differentiated  Hs.259471  FZD6 FZD7 FZD8 FZD9 LRP5 LRP6 WNTI WNT 1 OA WNTI OB WNT11 WNT 16 WNT2 WNT2B WNT3 WNT3A WNT4 WNT5A WNT5B WNT6 WNT7A WNT7B WNT8A  0 0 19.6 1.4 4.2 2.8 120.3 1.4 0 4.2 1.4 0 0 0 1.4 0 0 1.4 0 0 0 2.8 1.4 0 0 0 0  0 0 0 2 5 2 5 0 0 4 0 0 0 14 5 0 0 0 0 0 0 0 2 0 3 0 0  Hs.421281 Hs.558416 Hs.326420  WNT8B WNT9A WNT9B  0 0 0  0 0.7 8.1 2.5 4.3 2.2 90.1 1.1 0.9 4 1.3 0.4 0 1.3 1.1 0 0 0.4 0.2 0 0.7 1.6 0.7 0.7 0 0 0 0.2  Role in Wnt pathway signalling Receptor Receptor Receptor Receptor Receptor Receptor Receptor Receptor Receptor Receptor Receptor Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand  0 0 0  Ligand Ligand Ligand  0 0  59  Table 9 Expression of Wnt signalling components in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). The effect of each component on pathway activity is indicated. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap. nci .nih. go v/S AGE). UniGene accession  Gene symbol  Hs.459759  CREBBP  2.8  2.7  13  Activation  Hs.529862  CSNK1A1  89.5  74.6  29  Activation  Hs.474833  CSNK1E  78.3  69.7  38  Activation  WA09  hESCs  Effect on Wnt pathway signalling  Differentiated  Hs.74375  DVL1  51.8  36.5  68  Activation  Hs. 140720  FRAT2  58.7  81.4  91  Activation  Hs.386453  PORCN  12.6  8.1  6  Activation  4.2  3.6  34  Activation  14  12.8  2  Activation  19.6  20.8  6  Activation  Hs. 194350  PRKACA  Hs.3260  PSEN1  Hs.515846  RUVBL2  Hs.555881  SMAD3  2.8  2.7  2  Activation  Hs.75862  SMAD4  33.6  30  11  Activation  Hs.476018  CTNNB1  169.3  71  17  Activation  Hs.555947  LEF1  1.4  2.5  4  Activation  (transcription)  Hs.519580  TCF7  8.4  6.1  5  Activation  (transcription)  Hs.l 58932  APC  0  0.2  31  Inactivation  Hs.5 12765  AX1N1  26.6  17.5  10  Inactivation  Hs.255973  CRI1  85.3  70.1  61  Inactivation  Hs.l 8949  CRI2  9.8  5.8  6  Inactivation  Hs.208597  CTBP1  46.2  39.2  74  Inactivation  Hs.463759  CTNNBIP1  18.2  15.9  0  Inactivation  H s . l 2248  CXXC4  0  0  2  Inactivation  Hs.445733  GSK3B  1.4  6.7  3  Inactivation  Hs.507681  MAP3K7IP1  12.6  7.8  11  Inactivation  Hs.432453  MAP3K8  0  1.1  4  Inactivation  Hs.445496  MAP3K9  5.6  3.4  4  Inactivation  Hs.431550  MAP4K4  14  13.2  10  Inactivation  Hs.298434  NKD1  0  0.2  0  Inactivation  Hs.208759  NLK  2.8  1.8  0  Inactivation  Hs.483408  PPP2CA  81.1  58.9  24  Inactivation  Hs.533124  SENP5  11.2  6.5  14  Inactivation  Hs.213424  SFRP1  664.4  226.6  15  Inactivation (extracellular)  (transcription)  Hs. 105700  SFRP4  0  0  Hs.279565  SFRP5  2.8  Hs.98367  S O X 17  0  32  Inactivation  (extracellular)  3.6  7  Inactivation  (extracellular)  5.4  21  Inactivation  (extracellular)  Hs.248204  CER1  0  1.6  0  Inactivation  (extracellular)  Hs.40499  DKK1  0  0.2  0  Inactivation  (extracellular)  Hs.284122  WIFI  0  0.7  4  Inactivation  (extracellular)  Table 9 listed extracellular and intracellular activators and inhibitors of Wnt signalling. The intracellular machinery involved in propagating the Wnt signal to the nucleus where the transcription of target genes may occur was completely detected in the WA09 library, hESC metalibrary and n-CGAP metalibrary. A number of Wnt pathway activators were significantly up-regulated in hESCs (p<0.01) (Table 10 lists genes upregulated in WA09 versus differentiated metalibrary; Table 11 lists the genes upregulated in the hESC metalibrary versus n-CGAP metalibrary) and included CSNK1A1, CSNK1E, PSEN1, SMAD4, and CTNNB1. The major effects of WNT signalling hinge upon the nuclear translocation of CTNNB1 to activate the transcription of target genes (e.g., transcription factors and cell cycle regulatory genes). Constitutive activation of J3catenin (CTNNB1) was suggested to underlie tumorigenesis in self-renewing tissues such as the colon (Kielman et al., 2002). Over-expression of CTNNB 1 and/or the absence of a Wnt-receptor interaction results in the degradation of excess cytoplasmic protein, regulated by a multi-protein complex consisting of GSK3P, AXIN, and APC (Cadigan and Nusse 1997). In hESCs, APC was absent in the WA09 line and expressed at less than 1 tag/650,000 in the entire metalibrary (Table 9). CTNNB 1 expression was more than 9fold higher in WA09 and 16-fold higher in the hESC metalibrary. This observation and the apparent lack of expression of key regulators of CTNNB 1 protein levels suggests that Wnt signalling was active in hESCs and unlikely to be downregulated at the protein level.  61  Table 10 Differential gene expression of Wnt signalling components compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of the hESC metalibrary/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE).  UniGene accession  Gene symbol  P-value  Hs.408312 Hs.463759 Hs.173859 Hs.213424 Hs.523852 Hs.l 42912 Hs.495656 Hs.3260 Hs.476018 Hs.508524 Hs.515846 Hs.75862 Hs.529862 Hs.483408 Hs.202453 Hs.474833 Hs.74375 Hs.208597 Hs.98367 Hs.459759 Hs.500812 Hs.534307 Hs. 194350 Hs.91985 Hs.525704 Hs.72290  TP53 CTNNBIP1 FZD7 SFRP1 CCND1 FZD2 TBL1X PSEN1 CTNNB1 CACYBP RUVBL2 SMAD4 CSNK1A1 PPP2CA MYC CSNK1E DVL1  0 0 0 0 0 0.0011 0 0.0011 0 0 0.0007 0.0004 0 0  Hs.220971 Hs. 158932 Hs.l 05700  SOX 17 CREBBP BTRC CCND3 PRKACA WNTI 0B JUN WNT7A FOSL2  0.0083 0.0002 0 0 0 0.0001 0 0 0 0 0 0.0022 0  APC SFRP4  0 0  CTBP1  Ln ratio hESC metalibrary/differentiated metalibrary 3.5348 2.7811 2.7116 2.6515 2.3138 2.1153 1.6067 1.4662 1.3761 1.2584 1.1018 0.9248 0.9143 0.8615 0.7108 0.5838 -0.6298 -0.6426 -1.3678 -1.5697 -1.6699 -1.7732 -2.2177 -2.2577 -2.7228 -2.8819 -3.3414 -4.2682 -4.9921  62  Table 11 Differential gene expression of Wnt signalling components compared between the WA09 library and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of WA09/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in WA09. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE).  0  Ln ratio WA09/differentiated metalibrary 3.9191 3.7284  0  3.0436  UniGene accession  Gene symbol  P-value  Hs.408312  TP53 SFRP1 FZD2  0  Hs.213424 Hs.142912 Hs.173859 Hs.463759  FZD7 CTNNBIP1  0  3.0097  Hs.495656  TBL1X CTNNB1  0 0  Hs.476018 Hs.523852 Hs.3260 Hs.483408 Hs.529862 Hs.75862  0  2.9746 2.3833 2.2492  CCND1  0  1.9697  PSEN1 PPP2CA CSNK1A1  0.0039  1.6349 1.1942 1.1088  Hs.508524  SMAD4 CACYBP  Hs.474833  CSNK1E  Hs.171626 Hs.534307 Hs.500812 Hs. 194350 Hs.91985 Hs.220971 Hs.98367 Hs. 158932 Hs. 105700 Hs.525704  SKP1A CCND3 BTRC PRKACA WNT10B FOSL2 SOX 17 APC SFRP4 JUN  0 0 0.0014  1.0696  0.0003  1.0287  0.0005 0  0.7151 0.7095  0 0 0 0.0006 0 0 0  -1.215 -1.6305 -1.8335 -2.3725 -2.6089 -2.7555 -3.1301 -3.1609 -3.7249  0 0  63  A number of molecular factors that have inhibitory effects on this developmental pathway were up-regulated in hESCs (CTNNBIP1, SFRP1, and PPP2CA) (Table 10 and 11). CTNNBIP1 directly interacts with CTNNB 1 to prevent its interaction with the TCF/LEF complex thereby preventing transcriptional activation of target genes. In hESCs, this gene was significantly more highly expressed than in differentiated samples (16-fold and 19 fold greater in the hESC metalibrary and WA09 line respectively). The extracellular inhibitor of Wnt receptors, SFRP1 was also significantly more abundant in hESC samples (Table 10 and 11). Additionally, SFRP1 was shown to be up-regulated by a published EST analysis of hESCs compared to its differentiated derivatives (embryoid bodies, pre-neuronal, and pre-hepatocyte like cells) (Brandenberger et al. 2004). The employ of different SFRP genes in regulating Wnt-receptor interactions was likely to inhibit specific aspects of Wnt signalling. Wnt knockout phenotypes in the mouse provided evidence that loss-of-function of specific Wnts had disparate effects on many aspects of embryonic development ranging from defects in gastrulation, mesoderm patterning, neural crest development, and placental development (Aulehla et al., 2003; Barrow et al., 2003; Ikeya et al., 1997; Parr et al., 2001). A hypothetical mode of action of high expression of SFRP1 in hESCs may be to antagonize developmental differentiation and promote mitotic activity in hESCs. Targeted silencing of SFRP1 in stem cells by small-interfering RNA molecules (siRNAs) (Caplen et al., 2001; Dykxhoorn et al., 2003) would provide the necessary functional validation to confirm or refute this proposed effect. Downstream of CTNNB 1, transcriptional co-factors (TCF7 and LEF1) were expressed in WA09 at low levels while antagonists to their action were concurrently  64  expressed (NLK and CTBP1). Several target genes of the WNT pathway were detected in the hESC lines, including the cell cycle regulators CCND1, 2, and 3 and the transcription factors MYC and FOSL1 (Figure 9). The expression of the CCND genes, particularly CCND1 which was more highly expressed in stem cells than differentiated samples (greater than 7-fold expression difference; p<0.000T), may have a role in preventing differentiation by maintaining hESC mitotic activity. A wider repertoire of agonists and antagonists to the WNT pathway were expressed across all the hESC lines in contrast to a pooling of 4 normal adult and fetal tissues. The high expression levels of antagonists to the pathway such as SFRP1 and CTNNBIP1 were not similarly detected in differentiated tissues, suggesting that Wnt signalling is tightly regulated. SFRP1 and CTNNBIP1 may play a role in suppressing the specific differentiation effects of the Wnt pathway while some unknown mechanism may propagate Wnt signalling effects on cellular proliferation. HESCs exist on the cusp of differentiation and express hallmark developmental pathways but they are held in check until the necessary environmental cue is received to direct cellular fate down a specific path. The transforming growth factor beta (TGF|3) and Nodal signalling pathway governs various differentiation events such as osteoblast differentiation, neurogenesis, angiogenesis, embryo differentiation and placenta formation (Figure 10). Nodal signalling specifically induces mesoderm and endoderm in the early embryo and has a role in leftright axis determination (Hart et al., 2005). Expression of a Nodal recombinant protein and constitutive expression of Nodal during hESC differentiation resulted in the retention of pluripotency markers in embryoid bodies and the expression of extra-embryonic endoderm markers (Vallier et al., 2004). The results further indicated  65  Figure 10 Expression of the TGFP signalling pathway in hESC long SAGE libraries (hESC metalibrary) and differentiated normal CGAP long SAGE libraries (n-CGAP metalibrary). Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression.--] inhibition; -> activation.  66  CURD_  THBS1i _  _  BMP4* _ ^ L O C 2 8 3 1 5 5 AMHR2 _  T H B S 2 ffi THBS3  NOG  SMAD1 ffl  Growth factor  +  T H B S 4 [|  SMAD7Q  C0MPD  *  TGFBR2  Rr--  - SMAD2 •  RBX1  1  SMAD6_  ffi  SMAD3 _  •  S M A  °  1  B  <5MAI14 SMAD4 C  MAPK1 MAPK3  R  ,.  SMAD3  Transcription factors, co-activators, and co-repressors  SMAD4  -  ZFYVE16  Cell cycle, G1 arrest  LEFTY1  CDKN2BB  TGFBR1  ^..{^m IHHBA _ — •  _„, ,rjt_ ACVR2 J^ D  r  t  SMAD2  _  SMAD6  _  SMAB7  [!  CER1  _  B  SMAD4 DNA  NOBAL_-  /TBGFI  SMAD2  SMAD2 SMAB4 '  MYCB  ZFYVE9  FSTU  A C V R 2 [if  Differentiation, neurogenesis  ^ Angiogenests Apoptosis  SMAD2  CELL MEMBRANE  _  * Embryo differentiation Left-right axis determination  PITX2_ H  DNA NUCLEAR MEMBRANE  FOXH1  g  0RAP1  §  Mesoderm/endoderm induction  WHIP >10-100 >1OO-5O0 >500  67  that Nodal signalling prevented a 'default' progression to a neuroectoderm fate in comparison to control embryoid bodies. These findings implicated Nodal and TGFp signalling to function in pluripotency and in early cell fate decisions in vitro. The role of TGFp signalling in undifferentiated ES cells has been previously suggested by various gene expression studies documenting rapid and significant down-regulation of TDGF1 upon differentiation (Brandenberger et al., 2004b; Carpenter et al., 2004; Richards et al., 2004). Our own data showed the near complete expression of the TGFP signalling network in hESCs (Table 12 and 13). Appendix 2j is the complete list of TGFp gene expression in the WA09 line, hESC metalibrary, and differentiated metalibrary. Genes that were significantly differentially expressed (P=<0.01) between hESCs and the differentiated metalibrary were listed in tables 14 and 15. Table 12 lists a summary of the expression of TGFp signalling network ligands, receptors, and transcriptional targets in the SAGE libraries. Signalling through different heterodimers in the pathway leads to distinct cellular fates. Activation of the LOC283155/AMHR2 heterodimer directs cells to osteoblast differentiation, neurogenesis, or ventral mesoderm specification. The TGFBR1/TGFBR2 receptor complex plays a role in angiogenesis, apoptosis and cell cycle regulation. TGFBR1/ACVR2 signalling regulates embryo differentiation, gonadal growth and placenta formation. Lastly the receptor complex involved in Nodal signalling determines left-right axis patterning and mesoderm/endoderm induction during embryonic development. The TGFBR1 and TGFBR2 receptors were expressed in hESCs; TGFBR1 was expressed in excess in the hESC lines compared to n-CGAP libraries (8-fold and 4fold increased expression in the pooled hESC library and WA09 lines respectively; 68  Table 12 Expression of the TGF(3 signalling network ligands, receptors and transcriptional targets in the WA09 long S A G E library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the C G A P S A G E Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession  Gene symbol  Hs.73853 Hs.68879 Hs.296648 Hs.285671 Hs.473163 Hs.494158 Hs.409964 Hs.370414 Hs.l 103 Hs.133379 Hs.2025 Hs.l 573 Hs.447688 Hs.438918 Hs.470174 Hs.437877 Hs.448651 Hs.494622 Hs.82028 Hs.72901 Hs.449410 Hs.202453 Hs.92282 Hs.462590  BMP2 BMP4 BMP5 BMP6 BMP7 BMP8A BMP8B NODAL TGFB1 TGFB2 TGFB3 GDF5 GDF7 ACVR1B ACVR2 AMHR2 LOC283155 TGFBR1 TGFBR2 CDKN2B FOXH1 MYC PITX2 TIAF1  WA09  hESC  Differentiated  0 2.8 0 0 8.4 0 0 4.2 5.6 0 0 0 0  3.4 8.3 0 0 10.5 0 0 30.3 13 0.2 0 0 0  5 0 0 0 2 0 0 0 175 0 12 0 0  2.79752 0 0 1.4 37.8 2.8 0 0 9.8 4.2 7  4.03408 0 0.2 0.9 19.5 2 0 0.4 24.2 8.1 4.5  57 0 0 0 4 20 0 0 11 2 47  Role in T G F D signalling network Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Ligand Receptor Receptor Receptor Receptor Receptor Receptor Transcriptional Transcriptional Transcriptional Transcriptional Transcriptional  target target target target target  69  Table 13 Expression of the TGFp signalling network activators and inhibitors in the WA09 long SAGE library, a pooling of 11 hESC libraries generated from 8 stem cell lines (hESCs), an a pooling of 12 normal adult and fetal samples representative of 4 tissues (Differentiated). HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). The effect of each gene on pathway activity is listed. UniGene accession Hs.9914 Hs.385870 Hs.519005 Hs.465061 Hs.75862 Hs.459759 Hs.504609 Hs.524461 Hs.856 Hs.241570 Hs.248204 Hs.l 66186 Hs.1584 Hs.156316 Hs.28792 Hs.278239 Hs.248201 Hs.l 64226 Hs.371147 Hs.l 69875 Hs.211426 Hs.356742 Hs.431850 Hs.861 Hs.474949 Hs.l 53863 Hs.465087 Hs.l 89329 Hs.482660 Hs.532345 Hs.255973  Gene symbol FST TDGF1 SMAD1 SMAD2 SMAD4 CREBBP ID1 SP1 IFNG TNF CER1 CHRD COMP DCN INHBA LEFTY1 NOG THBS1 THBS2 THBS3 THBS4 DRAP1 MAPK1 MAPK3 RBX1 SMAD6 SMAD7 SMURF1 ZFYVE16 ZFYVE9 CR11  WA09 23.8 151.1 9.8 12.6 33.6 2.8 318.9 9.8 0 0 0 1.4 0 0 0 21 0 12.6 4.2 12.6 4.2 72.7 7 11.2 72.7 2.8 12.6 4.2 4.2 5.6 85.3  hESC 19.7 400.3 4 7.4  Differentiated 0 0 3 5 11  30 2.7 249.2 6.5 0 0 1.6 0.7 0 0.4 1.1 160.9 0 20.2 4.5 4.3 4.9 63.6 10.5 13.4 94.6 4.7 9  0 11 0 16 5 84 5 3 0 31 8 10 2 98 5 16 85 5 4  6.7  10  3.8 6.5 70.1  16 7 61  13 27 3  Effect on TGFB signalling Activation (extra cellular) Activation (extracellular) Activation (intracellular) Activation (intracellular) Activation (intracellular) Activation (transcription) Activation (transcription) Activation (transcription) Inhibition Inhibition Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (extracellular) Inhibition (intracellular) Inhibition (intracellular) Inhibition (intracellular) Inhibition (intracellular) Inhibition (intracellular) Inhibition (intracellular) Inhibition Inhibition Inhibition Inhibition  (intracellular) (intracellular) (intracellular) (transcription)  Hs. 18949 Hs.108371 Hs.207745 Hs.79353  CRI2 E2F4 RBL1 TFDP1  9.8 19.6 0 11.2  5.8 12.6 0 7.2  6 6 0 5  Inhibition (transcription) Inhibition (transcription) Inhibition (transcription) Inhibition (transcription)  Table 14 Differential gene expression of TGFP signalling components compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of hESC metalibrary/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Hs.385870 Hs.278239 Hs.370414 Hs.9914 Hs.504609 Hs.68879 Hs.494622 Hs.473163 Hs.75862 Hs.483408 Hs.202453 Hs.23650 Hs.3 56742 Hs.247077 Hs.507916 Hs.482660 Hs.459759 Hs.82028 Hs.462590 Hs.1103 Hs. 166186 Hs.1584 Hs.241570 Hs.241570 Hs.2025 Hs. 156316  Gene Symbol TDGF1 LEFTY1 NODAL FST ID1 BMP4 TGFBR1 BMP7 SMAD4 PPP2CA MYC MAZ DRAP1 RHOA TGFB1I4 ZFYVE16 CREBBP TGFBR2 TIAF1 TGFB1 CHRD COMP TNF TNF TGFB3 DCN  P-value 0 0 0 0 0 0.0009 0.0001 0.006 0.0004 0 0.0083 0 0.0003 0 0 0.0001 0.0001 0 0 0 0 0.0001 0 0 0 0  Ln ratio hESC /differentiated 5.9927 3.696 3.4171 2.993 2.187 2.142 1.3723 1.277 0.9248 0.8615 0.7108 -0.4069 -0.4382 -0.5569 -1.2902 -1.4384 -1.5697 -2.2375 -2.3223 -2.5885 -2.9425 -3.2874 -3.9805 -3.9805 -4.0605 -4.8396  71  Table 15 Differential gene expression of TGF(3 signalling components compared between the WA09 library and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01 (a P-value of 0 is equivalent to P=<0.0001). Fold-difference in expression is denoted as the natural log (In) ratio of WA09/differentiated metalibrary. Grayedportion of the table corresponds to genes that are downregulated in WA09. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Hs.385870 Hs.9914 Hs.504609 Hs.494622 Hs.278239 Hs.483408 Hs.75862 Hs.247077 Hs.507916 Hs.82028 Hs.462590 Hs.l 66186 Hs.241570 Hs.241570 Hs.2025 Hs.l 103 Hs.l 56316  Gene symbol TDGF1 FST ID1 TGFBR1 LEFTY 1 PPP2CA SMAD4 RHOA TGFB1I4 TGFBR2 TIAF1 CHRD TNF TNF TGFB3 TGFB1 DCN  P-value 0 0 0 0 0.0004 0 0.0014 0 0 0.0012 0 0.0017 0.0031 0.0031 0.0018 0 0  Ln ratio WA09/differentiated 5.0269 3.226 2.4371 2.0584 1.7219 1.1942 1.0696 -1.0561 -1.3326 -1.6103 -1.7439 -1.8045 -2.1493 -2.1493 -2.2294 -3.2255 -4.1071  72  P=<0.0001) (Tables 14 and 15). The ACVR1B and ACVR2B receptors mediate the effects of Nodal signalling. ACVR1B was detected at low levels in hESCs at less than 1 tag per 650,000 total tags (not detected in the WA09 line or the n-CGAP metalibrary) and ACVR2B was not detected in any human long SAGE library. This suggests that Nodal signalling was not functional in u-hESCs, however observations to the contrary were noted in the SAGE data. Nodal expression is significantly greater in the hESC metalibrary (30-fold higher; PO.0001) and downstream transcriptional targets (PITX2 and FOXH1) were expressed in the hESC libraries. A co-receptor of the Nodal ligand, TDGF1, was upregulated in hESCs (Brandenberger et al., 2004b) and expressed at high levels in our SAGE data while undetected in differentiated tissues (Table 12). Additionally, RT-PCR expression of the ACVR1B and ACVR2B was reported in undifferentiated hESCs (Vallier et al., 2004) which implies that the necessary receptors were expressed but beyond the sensitivity of SAGE detection. BMP4 signalling in human ES cells has been shown to orchestrate early differentiation events (Xu et al., 2002b) thus the expression of the pathway genes should be evidenced in the SAGE data. Diverse BMP ligands and their corresponding receptors (LOC283155 and AMHR2) were expressed in the hESC SAGE libraries; BMP4 and BMP7 were significantly differentially expressed in stem cells than in the adult and fetal samples (>3-fold higher expression; P=<0.006) (Table 14). The transcriptional target of BMP signalling, ID1, also had a higher expression value in the hESC libraries (>8-fold increased expression; PO.0001) (Tables 14 and 15). ID1 regulates transcription epigenetically and was suggested to induce cellular proliferation indirectly through protection against apoptosis (Ling et al., 2003). High levels of ID1 may  73  indicate the inhibition of TGFp signalling induced apoptosis as a mechanism by which WA09 maintains cellular proliferation. Table 13 lists the expression of extracellular and intracellular activators and inhibitors of the pathway in hESC and n-CGAP libraries. The activation of SMAD proteins propagates TGFP and Nodal signalling. The complete repertoire of SMADs was expressed in all SAGE libraries. Lefty 1, an inhibitor of TGFp and Nodal signal transduction was highly differentially expressed in hESCs; the increase in expression was more than 5-fold higher in WA09 compared to adult/fetal libraries and more than 40-fold higher in the hESC metalibrary (P=<0.0004) (Tables 14 and 15). High expression of both TGFp and Nodal ligands and their repressors suggests tight regulation of their signalling effects is important to the stem cell undifferentiated state and aspects of stem cell maintenance. I also observed that INHBA was expressed at 5 tags/650,000 in the hESC metalibrary but absent in the WA09 line. The gene inhibits TGFp and Nodal, and ultimately promotes embryo differentiation.- This observation was coupled with the moderately high expression of its repressor, FST, which appeared to be co-expressed with TGFp and Nodal in hESCs (its expression was not detected in the differentiated libraries) (Table 14). Negative regulation of INHBA and its low level of expression was consistent with the u-hESC phenotype. Similar to the Wnt signalling pathway, we observe that many TGFP pathway genes were expressed in hESCs while pathway inhibitors (such as FST and Leftyl) were uniquely expressed at high levels in hESCs. TGFp pathway expression alludes to a currently undefined role for the pathway in stem cell maintenance and to its function in hESC developmental plasticity.  74  Mouse embryonic stem cells (mESCs) maintain pluripotency via leukemia inhibitory factor (LIF) signalling through the LIFR/IL6ST receptor which leads to STAT3-mediated transcriptional activation (Yoshida et al., 1994). The same response to LIF does not hold true in hESCs where both LIF and its downstream effectors were expressed at low quantities in comparison to mESCs (Carpenter et al., 2004; Richards et al., 2004). I looked to confirm previous reports of low expression levels for LIF signalling in our hESC SAGE data and similarly observed low levels of the LIF, LIFR, Janus kinases (JAK), STAT3 and the absence of ERAS (Table 16) (Figure 11). Appendix 2k is a catalogue of LIF signalling pathway expression in WA09, the hESC metalibrary, and the differentiated metalibrary. IL6ST may not be expressed in hESCs because the long SAGE tag sequence mapped ambiguously in the genome and previous studies reported low levels of expression or do not detect the receptor (Brandenberger et al., 2004a; Brandenberger et al., 2004b; Carpenter et al., 2004; Rho et al., 2005; Richards et al., 2004). In mESCs, the LIF signalling pathway components were highly expressed (Anisimov et al., 2002) and critical to maintaining the undifferentiated and pluripotent state. In our hESC data the critical factors responsible for mESC maintenance are expressed at low levels, thus the pathway is unlikely to be functionally conserved in the human.  75  Table 16 LIF signalling network expression in human embryonic stem cells. HESC tags were normalized to the total number of tags expressed in the Differentiated library (651,881 total tags). Tags were mapped to genes using the C G A P SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). The IL6ST* sequence maps to multiple genes. UniGene accession Hs.447330 Hs.444356 Hs.523875 Hs.434374 Hs.515247 Hs.278733 Hs.291533 Hs.470943 Hs.530595 Hs.463059 Hs.437058 Hs.524518 Hs.2250 Hs.133421 Hs.454699 Hs.532082  Gene symbol ERAS GRB2 INPPL1 JAK2 JAK3 SOS1 SOS2 STAT1 STAT2 STAT3 STAT5A STAT6 LIF LIFR IL6ST IL6ST*  WA09  hESC  Differentiated  Effect/role on LIF signalling  0 39.2 58.7 1.4 2.8 2.8 4.2 21 49 1.4 1.4  0 21.6 19.5 0.4 1.1 1.1 2.1 12  Activation Activation Activation Activation Activation Activation  7 1.4 1.4 0 3155.6  2.5 0.7 0.7 0 1918.1  0 81 106 0 25 5 3 50 202 5 33 67 2 2 0 2555  39.3 1.9 2.1  Activation Activation(rranscription) Activation_(transcription) Activation(transcription) Activation(transcription) Activation_( transcription) Ligand Reception Receptor Receptor  76  Figure 11 The expression of LIF signalling pathway components in a pooling of 11 hESC SAGE libraries and a pooling of 12 normal adult and fetal SAGE libraries. Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation. IL6ST* sequence maps to multiple genes.  77  ERASB-  Teratoma formation  PI3-K  POU5F1 B HANOGB  INPPL1B -P STAT^P  STAT STAT NUCLEAR MEMBRANE MAPKKK  CELL MEMBRANE PI3-K genes: PIK3C2A0  I FtQ-ji p i  >10-100 >100-500  PIK3C2B^ PIK3C3™ PIK3CB0  n  PIK3CDLJ  STAT genes: STAT1B STAT2 B STAT3 [_ STAT5A Q STAT6 |_  MAPKK  RAS genes:  KRAS2 B RRAS S RRAS2 _] HRAS B MRAS B  MAPK1 MAPK3D  ES Renewal and Pluripotency  MAPKKK aeries: MAP3K3 5[ MAP3K8 EE MAP3K4 B I MAP3K9 ffl MAP3KS | | MAP3K10 MAP3K6 | | MAP3K11 B MAP3K7 9( MAP3K12 f_ MAP3K14 EE M  MAPKK genes: MAP2K1 B MAP2K2 g MAP2K3 MAP2K4 Q MAP2K5 0 MAP2KG B MAP2K7 g M  78  2.3.5 C e l l cycle regulation a n d programmed cell death pathways in h E S C s  To gain a broad overview of the cell cycle and death pathways in human embryonic stem cells the genes found in the metalibrary were mapped to the pathway regulators. See Figures 12 and 13 for cell cycle and DNA repair mechanisms; Figures 14 and 15 for apoptosis and autophagic cell death pathways; see Tables 17-20 for differentially expressed genes in the cell cycle, DNA repair, apoptosis and autophagic cell death pathways. Positive regulators of cell cycle progression (such as E2F1, ORC genes and MCM genes) and DNA damage checkpoints (such as TP53, PCNA and CHEK1/2) were significantly more highly expressed in hESCs than terminally differentiated tissue types (Table 17) (Figure 12). Proapoptotic genes (such as IL1R1, TNF, and TNFSF10) were significantly more highly expressed in adult and fetal libraries in comparison to hESCs (Figure 14) (Table 19). RBI was detected at low levels in hESCs (3.5 tags/650,000 total tags); while its proposed inhibitor, nucleostemin (GNL3), was more abundantly expressed (75 tags/650,000 total tags) (Tsai and McKay, 2002). RBI i  expression leads to cell cycle arrest and its activity is disrupted in many forms of cancer resulting in uncontrolled cell growth and deregulated apoptosis (Hanahan and Weinberg, 2000; Hickman and Helin, 2002; Hickman et al., 2002). In hESCs, the regulators involved in responding to stress and DNA damage are highly expressed at significantly different levels than adult and fetal SAGE libraries (Table 18). Though hESCs share the property of inactive RBI and related genes with malignant tissues, the loss of cell cycle and apoptosis regulation distinguishes hESCs from cancer phenotypes.  79  Figure 12 Cell Cycle Expression Cell cycle genes detected in the hESC metalibrary and differentiated metalibrary. Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the differentiated metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >010 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect > 100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects differentiated metalibrary expression. ~| inhibition; -> activation.  80  81  Figure 13 Expression of DNA repair machinery in the u-hESC and n-CGAP metalibraries. Repair mechanisms include: double-strand break repair (homologous recombination repair (HRR) or non-homologous end joining (NHEJ)), nucleotide excision repair (NER) and base excision repair (BER). Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the nCGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect > 100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation.  82  DSB  HRR HBS1 ffl  5' -» 3' resection  MRE11A _ RAD50_  RAD52 End-binding •  M  D  5  2  B  RAD52_-  RAD51 mediated reactions IRAD51 _  RAD52 _ R A D 5 2 _ RAD51_  RAD51 _ T BRCA1B BRCA2_  Branch migration  Xz~x Ligation  Ftniln >10-100  Holiday junction resolution  p i 00-500  83  DSB  NHEJ  End recognition by G22P1, XRCC5 and PRKDC G22P1 XRCC5| PRKDcf G22P1 XRCC5 [ PRKDC[ HBS1 MRE11A  "End processing"  RAD50I  LIG40 XRCC4_ LIG4JH XRCC4ffl  "End bridging"  G22P1 XRCCS f PRKDC[  LIG4_ XRCC4E LIG4J3 XRCC4_  G22PT XRCC5( PRKDC[  Ligation  n::!i!!in M00-5Q0 >500  NER  y  \_  Pre-incision complex (PIC) PIC1  ADP+ Pi GTF2H4p ERCC2 X P A _ ERCC5_ \ ERCC3  /  ERCC1 ERCC4  ERCC1H ERCC4 :  u  <PArT| | RFC |GENES RFC1  |  5' incision  Ii  i  GTF2H4H  \  ERCC2| ERCC5J ERCC3f  RPA1 H \  RFC5  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  ERCC3 ERCC2 GTF2H4  /  \  l  °  3  3' incision  POLE PCHA POLD1 [RFC] POLD2  i XPA ERCC5  P  f^i ATP  _  RFC4 _  /  X  RFC2 [] RFC3  PIC2  RPA1  ADP + Pi  _  RFC | PCHA [ j  Ftntin  POLEP POLD1 P0LD2  -B-  -RPA4  >10-100 M00-500  5'  85  BER  DAMAGED NUCLEOTIDE  \  BE GENES= Specialized Excision Repair Genes = UNG[] MUTYHg BE  TDGQ MBD4 g NTHL10 APEX1 E F  MPG0 SMUG1™  LONG-PATCH  SHORT-PATCH  OGG1 | J POLB POLI POLLH  | RFC | GENES =  RFC PCNA [-J-  RFC1 g RFC2 g RFC3 Q  XRCC1 LIG3 LIG1  ~PARPTQ  =  RFC PCNA FENlfLIGI^  POLD1 POLD2 POLE n POLB  RFC4 1  I  RFC5 Q  0 >10-100 >100-500 >500  Figure 14 Apoptotic programmed cell death pathway expression in hESC compared with n-CGAP. A. Extrinsic pathway and survival factors. B. C a induced cell death. C. DNA damage induced cell death. Expression was normalized to tags/650,000 total tags. Tag 2+  expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect >100500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation.  87  Extrinsic Pathway Death Ligand  CFLARgj  FASLG g .„^„  *  THFRS F10A.'D* Q  TNFSF10 H—+ n THFRS F1OB0 T l l r  n  TNFRSF10C  FADD0—•  THFRS  TRADD  F^A|<^  D D  J  * RIPKl|J TRAF2 f]  an  1  • - Cleavage of Caspase Substrate  CASP6  CASP1oQ-^ASP30_f=  •  ADD0 | — C F L A R Q  T H F f ^  Br  BIRC2  FAS Q  DFFA0  --*> Degradation  DF"FB|w  ASP8Pf-»-CASP7[ BID [j  BIRC2  D N A  i PDCD8 R  [J  BCL2L1R ' -  _ • Fragmentation  F A D O y h-^ CFLARQ  • IL1R1  MYD88 ^*  IRAK3_ffl—»  IRAKl[j  f  1  CYCS R APAF1g  Mitochondrion  -*CASP9 Q Stress Signals Intrinsic Pathway  MAP3K14^-»-CHUK0—• H F K B 1 A 0 - - - Degradatilon j-  AKT1| AKT3  NGFB Q  EHDOGQ  h  Mitochondrion  HFKB1 Qj ™ H F K B 2 0 * Survival Genes A  Suruiual Factors  APOPTOSIS +  BCL2[I|  TRADD  IL1A _j  I  •*  BIRC2 E L . B C L 2 L 1 0 - - * ' Survival BCL2& t  PRKAR1A  IL3RA CELL MEMBRANE  *  Homodimer BAXQ0 >10-100 >100-500  Figure 15 Autophagic cell death. Pathway detected in the hESC metalibrary and n-CGAP metalibrary. MAP1LC3B (mammalian homologue to yeast ATG8) undergoes post-translational processing resulting in conjugation to phosphatidylethanolamine (PE) for activation. Additional ATG8 homologues are shown. Expression was normalized to tags/650,000 total tags. Tag expression levels were normalized to the total tags in the n-CGAP metalibrary (651,881 total tags). Expression levels were denoted by the heat-map. Boxes with vertical lines (low gene expression) represent >0-10 tags per gene, boxes with light gray dots (mid-level gene expression) represent >10-100 tags per gene, horizontal lined and dark gray boxes with dots (high gene expression) reflect > 100-500 and >500 tags per gene respectively. Top box reflects hESC metalibrary expression; bottom box reflects n-CGAP metalibrary expression. —| inhibition; -> activation.  90  BECN1 PIK3C3  APG16  APG12L _ -  ft  PREAUTOPHAGOSOMAL STRUCTURE (PAS)  0  CONJUGATE FORMATION  :oo  APG12 _ - A P G 5 0 - A P G 1 APG5L_ APG10L_ PAS  APG7L_  ? APG3L_  MAP1LC3BfMAP1LC3A [I  -V-  MAP1LC3BJ_^  • MAP1LC3B MAP1LC3A  GABARAPL2 _  GABARAPL2  GABARAP _  GABARAP_  FRAP1_  llililililll  CUP-SHAPED PREAUTOPHAGOSOMAL MEMBRANE  AUTO PH AGOS O M E  MO-100  M00-500  AUTOPHAGY  Table 17 Differential gene expression of the cell cycle compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). UniGene accession Hs.28312 Hs. 194698 Hs.517582 Hs.329989 Hs.84113 Hs.23960 Hs.85137 Hs.408312 Hs.558433 Hs.524947 Hs. 17908 Hs.334562 Hs.36708 Hs.474217 Hs.291363 Hs.24529 Hs.96055 Hs.533573 Hs.244723 Hs.656 Hs.249441 Hs.523852 Hs.23348 Hs.469649 Hs.95577 Hs. 196102 Hs. 147433 Hs.20447 Hs.410228 Hs.208414 Hs.477481  Gene symbol  P-value  MAD2L1 CCNB2 MCM5 PLK1 CDKN3 CCNB1 CCNA2 TP53 PTTG1 CDC20 ORC1L CDC2 B U B IB CDC45L CHEK2 CHEK1 E2F1 CDC7 CCNE1 CDC25C WEE1 CCND1 SKP2 BUB1 CDK4 RB1CC1 PCNA PAK4 ORC3L ASK MCM2  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0003 0 0.0005 0 0 0 0.0014 0  Ln ratio hESC/differentiated 5.0027 4.4654 4.0417 3.7977 3.6974 3.6461 3.5478 3.5348 3.3629 3.3327 3.2049 3.1678 3.1294 3.0688 2.9703 2.7385 2.7091 2.6633 2.6315 2.6153 2.3294 2.3138 2.3055 2.2886 2.2361 2.218 2.1648 2.1063 2.1045 2.0879 2.0879  92  Hs. 153479 Hs.1634 Hs.3352 Hs. 19400 Hs.269408 Hs.88556 Hs. 16349 Hs.445758 Hs.313544 Hs.49760 Hs.418533 Hs.558364 Hs.517517 Hs.438720 Hs. 135465 Hs.75862 Hs.558307 Hs.433201 Hs.431048 Hs.491682 Hs.36915 Hs. 179565 Hs.520974 Hs.200063 Hs. 150423 Hs.523835 Hs. 106070 Hs.238990 Hs.77313 Hs.438782 Hs.6764 Hs.80409 Hs.271791 Hs.534307 Hs.310536 Hs.370771 Hs.417050 Hs.513645 Hs. 110571 Hs.525324 Hs.32539 Hs.2025  ESPL1 CDC25A HDAC2 MAD2L2 E2F3 HDAC1 KIAA0431 E2F5 GNL3 ORC6L BUB3 ORC4L EP300 MCM7 E2F6 SMAD4 CDK7 CDK2AP1 ABL1 PRKDC SMAD3 MCM3 YWHAG HDAC7A CDK9 DOC-1R CDKN1C CDKN1B CDK10 HDAC5 HDAC6 GADD45A ATR CCND3 HDAC8 CDKN1A CCNA1 PAK6 GADD45B CDKN2C PAK7 TGFB3  0.0014 0.0046 0 0 0 0 0 0.0003 0 0 0 0.0002 0.0072 0 0.0062 0.0004 0.0071 0 0.0099 0.0001 0.0001 0.0005 0 0.0019 0.0062 0.0001 0 0 0 0 0 0 0.0001 0 0.0013 0 0.0023 0 0 0.0004 0 0  2.0879 1.9056 1.8489 1.7431 1.7365 1.7233 1.7233 1.5954 1.5489 1.5001 1.3913 1.3609 1.1254 1.0066 0.9753 0.9248 0.8558 0.7711 0.7216 0.6213 0.3712 -0.267 -0.4186 -0.498 -0.6341 -0.8921 -0.9099 -1.0101 -1.0238 -1.0506 -1.1728 -1.2378 -1.262 -1.7732 -1.7833 -1.7988 -1.8321 -1.9011 -2.5046 -3.105 -3.8935 -4.0605  Table 18 Differential gene expression of DNA repair mechanisms compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE).  UniGene accession Hs.409065 Hs.445052 Hs.156519 Hs.191334 Hs.523220 Hs.291363 Hs.558896 Hs.518475 Hs.l 39226 Hs.24529 Hs.531879 Hs.446564 Hs.446554 Hs.209945 Hs.78016 Hs.487540 Hs.498248 Hs.l 50477 Hs. 147433 Hs.l 92649 Hs.l 00299 Hs.487294 Hs.302003 Hs.555936 Hs. 169348 Hs.512592 Hs. 19400 Hs.500721 Hs. 16349 Hs.491695 Hs.534331  Gene symbol FEN1 MSH6 MSH2 UNG RAD54L CHEK2 CHEK2 RFC4 RFC2 CHEK1 RAD1 DDB2 RAD51 TDP1 PNKP RPA3 EXOl WRN PCNA MRE11A LIG3 ERCC2 FANCE APEX2 BLM RRM2B MAD2L2 MMS19L KIAA0431 UBE2V2 NUDT1  P-value 0 0 0 0 0 0 0 0 0 0 0 0 0.0001 0.0001 0.0001 0 0.0003 0.0006 0 0.0009 0.0009 0.0011 0.0017 0.0021 0.0046 0 0 0 0 0 0.0002  Ln ratio hESC/differentiated 3.8127 3.4022 3.3719 3.0153 2.9817 2.9703 2.9703 2.9471 2.9111 2.7385 2.5819 2.4747 2.4556 2.4164 2.3756 2.3221 2.2656 2.1933 2.1648 2.142 2.142 2.1153 2.0598 2.0308 1.9056 1.7949 1.7431 1.7233 1.7233 1.4748 1.4748  94  Hs.461925 Hs.558417 Hs. 115474 Hs.520189 Hs.344812 Hs. 177766 Hs.l 11749 Hs.388739 Hs.521640 Hs.66196 lis. 191356 Hs.524630 Hs.279413 Hs.523230 Hs.488624 Hs.73722 Hs. 194143 Hs.558307 Hs.271353 Hs.306791 Hs.477879 Hs.292493 Hs.491682 Hs.412587 Hs.459596 Hs.98493 Hs.34012 Hs.35947 Hs.385986 Hs.271791 Hs.469872 Hs.475538 Hs.422901 Hs.288798 Hs.l 29727 Hs.512732 Hs.258429 Hs.208388 Hs.232021 Hs.408557  RPA1 XRCC3 RFC3 EL0VL5 TREX1 PARP1 PMS1 XRCC5 RAD23B NTHL1 GTF2H2 UBE2N POLD1 POLL PMS2L3 APEX1 BRCA1 CDK7 MUTYH POLD2 H2AFX G22P1 PRKDC RAD51C MPG XRCC1 BRCA2 MBD4 UBE2B ATR ERCC3 XPC GTF2H2 MUS81 XRCC2 NEIL1 ERCC5 FANCD2 REV3L ELOVL2  0 0.0001 0.0002 0 0.006 0 0.0001 0 0 0 0.0008 0.002 0 0.0068 0.0068 0 0 0.0071 0.0088 0.0066 0.0014 0 0.0001 0.0042 0.0014 0.0006 0 0 0 0.0001 0 0 0 0 0.0001 0 0 0.0015 0 0  1.4619 1.3948 1.3493 1.2991 1.277 1.2684 1.2498 1.2043 1.2041 1.1112 1.0911 1.0821 1.0694 1.0381 1.0381 0.9969 0.8644 0.8558 0.7958 0.729 0.7078 0.6421 0.6213 0.5895 -0.5459 -0.9675 -1.0227 -1.0463 -1.1389 -1.262 -1.3133 -1.3158 -1.334 -1.3778 -1.5697 -1.6208 -1.9118 -2.1887 -2.3429 -3.6159  Table 19 Differential gene expression of apoptosis pathways compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE).  UniGene accession Hs.408312 Hs.289052 Hs.424932 Hs.522506 Hs. 149032 Hs.141125 Hs.517841 Hs.l 98998  Gene symbol TP53  P-value 0  L n ratio hESC/differentiated 3.5348  BCL2L12  0 0 0.0001  3.3486 2.7239 2.3962  0.0009 0 0.007  2.142 1.9056 1.8366  0.0085 0  1.8002  0  1.7233  0.0016 0 0  1.1553 0.9781  PDCD8 TRAF2 PIK3R4 CASP3 PRKAR2A CHUK  Hs.437060 Hs. 16349 Hs.3280  KIAA0431 CASP6  Hs.l 75343  PIK3C2A  Hs.521456 Hs.514494  TNFRSF10B FBF1 BAX  Hs.l 59428 Hs.502842 Hs.515371 Hs.503704 Hs.522819 Hs.3 56076 Hs.435512 Hs.484782 Hs.280342 Hs.43505  CYCS  CAPN1  0.0001 0 0.0004  1.7937 1.4311  0.8398 0.8005  IKBKG AKT2  0.0009 0.0027 0 0 0.0001  -0.4711 -0.487 -0.528 -0.5877 -0.7315 -0.7647 -0.7778 -0.8337 -0.9146  0  -0.9306  Hs.420106  ENDOG  0.0055  -0.936  Hs.523309  BAG3 AKT1  0.0001  -1.1346  0  -1.1591  Hs.515406  Hs.525622  CAPNS1 BIRC2 IRAKI BIRC4 PPP3CA DFFA PRKAR1A  0 0.0088 0  Hs.474150  BID  0  Hs.532826  MCL1  0  -1.1824 -1.2  Hs.433068  PRKAR2B  0  -1.2193  96  Hs.371344 Hs.401745 Hs.431926 Hs.550753 Hs.73090 Hs.l 32225 Hs.552567 Hs.500067 Hs.478275 Hs. 194350 Hs.278901 Hs.82116 Hs.487325 Hs.227817 Hs. 126256 Hs.390736 Hs.460996 Hs.81328 Hs.l 96472 Hs.498292 Hs. 149413 Hs.l 27799 Hs.241570 Hs.557403  PIK3R2 TNFRSF10A NFKB1 PRKAR1B NFKB2 PIK3R1 APAF1 PPP3CB TNFSF10 PRKACA PIK3R5 MYD88 PRKACB BCL2A1 IL1B CFLAR TRADD NFKBIA IL3RA AKT3 PPP3CC BIRC3 TNF IL1R1  0 0 0.0015 0 0 0 0 0 0 0 0 0 0 0 0 0.0004 0 0 0.0001 0 0 0 0 0  -1.3844 -1.3844 -1.4086 -1.6655 -1.7928 -1.8057 -1.8365 -1.9011 -2.1409 -2.2177 -2.378 -2.3855 -2.4072 -2.6996 -2.8819 -3.105 -3.2184 -3.2248 -3.2874 -3.5105 -3.575 -3.7982 -3.9805 -4.4043  Table 20 Genes that were differentially expressed in the autophagic cell death pathway compared between the hESC metalibrary and differentiated metalibrary. Significance of expression differences was determined using Audic-Claverie statistics (Audic and Claverie, 1997) with P=<0.01. Fold-difference in expression is denoted as the natural log (In) ratio of hESC/differentiated metalibrary. Grayed-portion of the table corresponds to genes that are downregulated in hESCs. Tags were mapped to genes using the CGAP SAGE Genie Best Tag mappings (http://cgap.nci.nih.gov/SAGE). L n ratio hESC/differentiated 2.142  UniGene accession Hs.l 49032  Gene symbol PIK3R4  Hs.90093  HSPA4  P-value 0.0009 0  Hs.338207 Hs.369373  FRAP1 SEC23B  0 0.0002  1.1291 0.9299  Hs.555993  APG10L  0.0002  0.7682  Hs.356061  MAP1LC3B  0  -0.7953  Hs.477126  APG3L GABARAPL2  0 0  -1.0278 -1.087  Hs.461379 Hs.81964  1.3178  SEC24C  0  -1.1616  Hs.47061  ULK1  0  Hs.283610 Hs.84359  APG4B  0  -1.1702 -1.4021  GABARAP APG4A  0 0.0062  -1.4091  DAPK2  0.0021  -2.4119  MAP1LC3A  0  -3.4781  Hs.8763 Hs.237886 Hs.472513  -2.0064  2.3.6 HESC gene ontology Human embryonic stem cells may be a rich source of many different transcript types encompassing a variety of disparate biological processes and molecular functions. The expression of a wide variety of genes, particularly at low levels, may play a key role in enabling stem cells to differentiate to a multitude of cell types when provided with the appropriate developmental cue. Various genes important to maintaining a stem cell's developmental potential are expressed at low levels, such as in the case of ligands of the  98  WNT and TGFp developmental pathways. Additionally, many characteristics of pluripotent stem cell biology have not been explained by molecular factors described in previous studies of the hESC transcriptome. Much remains unknown regarding the molecular factors governing hESC maintenance and pluripotency, thus it is likely that genes left to be discovered to function in these aspects are expressed at the limits of the transcriptome profiling mechanisms used to date. Generally, transcription factors and cell specific factors are expressed at levels undetectable by most gene expression techniques (Brandenberger 2004a) or are transiently expressed. Upon differentiation, particular combinations of these low expression genes may be up-regulated or downregulated, locking in the stem cell's fate. Key molecular factors that control stem cell fate determination and maintenance of developmental plasticity are enigmatic, leading to the expectation that several transcripts identified in the hESC lines will be uncharacterized as opposed to most other terminally differentiated cell types. Figure 16 compared the functional state of genes shared across all hESC lines and the WA09 transcriptome to the n-CGAP metalibrary. In total 7,364 tags were commonly expressed across all of the ES lines. These tags were mapped using CMOST and resulted in 5,284 unambiguous-tag-to-gene mappings in which the tag sequence localized to one of the three nearest Nlalll sites to the 3' end of the transcript (denoted position 1, 2 and 3 where position 1 is 3' most Malll site in a transcript). Gene Ontology (GO) terms were assigned to these tags. GO terms were similarly assigned for unambiguous tag-to-gene mappings in the WA09 library (15,509 tags) and differentiated metalibrary (13,523 tags). Analysis of the gene ontologies revealed that molecular functions expressed in hESCs and normal CGAP libraries were distinctly different based on Audic-Claverie statistical  99  analysis of total tags expressed in each functional category (see Appendix 21 for a table of pair-wise comparisons of molecular functions and corresponding p-values). Genes involved in nucleic acid binding, signal transduction, and cell death/aging were more highly represented in adult derived SAGE libraries than in hESCs (p<0.0001). Genes that were involved in protein synthesis/processing, transport, cell cycle regulation, and chromatin binding were more highly represented in hESC libraries in contrast to the differentiated metalibrary (p<0.000T). The preponderance of protein binding/degradation genes may support the theory that hESC express many genes transiently (GolanMaschiach et al., 2005); thus increased degradation machinery to maintain a specific level of protein expression necessary for an undifferentiated and pluripotent state. The molecular integrity of hESCs has been demonstrated extensively by numerous accounts of hESCs possessing a stable karyotype after several passages (Hwang et al., 2004; Reubinoff et al., 2000; Thomson et al., 1998). The high representation of genes involved in cell cycle regulation in hESCs observed in the GO molecular function assignments and closely investigated in my assessment of cell cycle/DNA repair gene expression in hESCs was consistent with the genomic stability of hESCs and their proliferative capabilities.  100  Figure 16 Gene Ontology (GO) "slim" molecular functions expressed in the hESC metalibrary, hESC cell line WA09, and the differentiated metalibrary (n-CGAP metalibrary). Tags were mapped to Ensembl transcripts and EST transcripts using CMOST. Additionally, tags were mapped to embryonic ESTs (see Chapter 4.2.3 for a detailed account of tag-mapping resource construction) and mitochondrial sequences (CMOST). Tags that mapped to sense positions 1, 2, and 3 in a transcript and tags that mapped to all positions in EST and mitochondrial sequences were used in this analysis. The numbers of unique tags per functional category were depicted. The inset legend lists all functional categories corresponding to pie chart colours. The hESC metalibrary chart displayed category names and the percentage of all unique tags used in this analysis represented in that category. Global molecular functions in hESCs are significantly different than the pooling of adult/fetal samples. Nucleic acid binding and signal transduction genes account for a greater proportion of functions in adult/fetal libraries (p<0.0001). Protein binding/degradation, transport, and cell cycle regulation account for a greater proportion of overall functions in hESC (p<0.0001).  101  HESC MetaLibrary DNA metabolism/repair, 0 05% ~ \ PTOtei" synthesis/processing, 2.84% Nucleic acid binding, 0.50% Antioxidant, U.10%Signal transduction, 0.02% Defense/immunity, 0.12% Chromatin binding, 1.98% Negative regulation cell cycle, 6.20%  Metabolism, 16.36%  Protein degradation, 10.54% Catalytic, 1.43% CelldeatWaging, 0.36% Protein binding, 20.70%  Protein folding/chaperone. 0.38% Cell adhesion, 1.24% Transport, 10.49%  Transcription/RNA processing, 6.56%  Structural/cytoskeletaL 11.64%  '-Zinc ion binding, 6.39%  Uncharacterized, 2.10%  YVA09 DNA metabolism/repair, 0.07% Protein synthesis/processing, 1.91% Antioxidant, 0.05% Nucleic acid binding, 0.29% Deferise/immunity, 0.09% Signal transduction, 0.01% Chromatin binding, 1.26%  Metabolism, 9.69%  Negative regulation cell cycle, 8.14% Protein degradation, 9.09% Catalytic, 2.14% Cell death/aging, 0.23%  Protein binding, 28.79%  Protein folding/chaperone, 0.33% Cell adhesion, 0.73% Transport, 10.72% Structural/cytoskeletaL 12.42% Uncharacterized, 2.70%  \^Transcription/RNA processing, 6.94% Zinc ion binding, 4.40%  Differentiated MetaLibrary Chromatin binding, 0.34% Negative regulation cell cycle, 0.61%  Defense/immunity, 0.20% -Antioxidant, 0.08%  Protein degradation, 0.83% DNA metabolism/repair, 0.04%  Catalytic, 1.54% Cell death/aging, 1.86%  Protein synthesis/processing, 0.03%  Protein folding/chaperone, 1.86% Cell adhesion, 2.37% Transport, 2.77%  Nucleic acid binding, 18.83%  StmcturaUcytoskeletal, 5.25%  Signal transduction, 14.32%  Uncharacterized, 7.39%  Metabolism, 14.03% Zinc ion binding, 8.11% Protem binding, 10.92% Transcription/RNA processing, 8.63%  2.4 Conclusions /  This Chapter provided a summary of the hESC transcriptome profiles sampled using SAGE. The dataset described was among the most comprehensive and was representative of 8 cell lines and 2,613,475 total tags corresponding to 379,465 transcripts (unique tag types). Over 700 transcripts implicated in mammalian pluripotent stem cell biology were assayed in a pooling of the hESC SAGE libraries and the WA09 line (termed PAGs). The hESC libraries expressed 10-20% more PAGs at higher expression levels than any n-CGAP library. A suite of well-defined pluripotency marker (included POU5F1) and primitive differentiated cell markers were surveyed in hESCs. Pluripotency markers were significantly higher in hESCs and generally absent in the differentiated samples. Differentiation markers were expressed at low levels in hESCs. The expression of developmental pathways, such as the Wnt signalling network, was investigated in hESCs. In accordance with the "Just in case" hypothesis (GolanMashiach et al., 2005) hESCs uniquely expressed a greater variety of pathway genes than differentiated samples. Differences in other key pathways, such as the cell cycle, DNA repair, and programmed cell death pathways, also noted increased diversity of pathway genes expressed in hESCs. Most intriguing was the observation that cell cycle checkpoints and DNA repair genes were significantly more highly expressed in hESCs than n-CGAP libraries. From the GO analysis of global molecular functions I observed that the distribution of expressed genes with annotated molecular functions in embryonic undifferentiated cells was distinctly different than the adult and fetal samples. Genes involved in cell death and aging were prevalent in n-CGAP libraries in contrast to hESC libraries while cell cycle regulatory genes were more highly expressed in hESCs. These 104  characteristics were consistent with and necessary for the maintenance of immortalized and proliferating hESCs.  105  3. C o m p a r i s o n between h E S C a n d cancer/non-cancer differentiated cells/tissues  Contributions  Computational analysis was completed by Angelique Schaech (BCCA GSC and the Department of Medical Genetics, University of British Columbia) with hierarchical clustering script and Audic-Claverie statistics script contributed by Allen Delaney (BCCA GSC) and Mehrdad Oveisi (BCCA GSC) respectively.  106  3.1 Introduction Genes important to the pluripotent and undifferentiated state of hESC may be upregulated in hESC compared to differentiated cells. Conversely, genes important in hESC differentiation may be down-regulated in hESC. The hESC expression profile generated by SAGE lends itself to direct pair-wise comparison to cancer and non-cancer differentiated cells and tissue types to enable the elucidation of differentially expressed transcripts. The first portion of this analysis consisted of pair-wise comparisons within the set of hESC long SAGE libraries. Also performed were comparisons between hESC libraries and other human SAGE libraries publicly available from the CGAP website (http://cgap.nci.nih.gov/SAGE). I selected all normal CGAP SAGE libraries and categorized them according to their embryonic germ layer origin (endoderm, mesoderm, or ectoderm). The germ layers arise during the first differentiation event of the embryo proper, termed gastrulation. This event takes place in the third week of development. Tissue types derived from each germ layer were depicted in Figure 17. The Pearson correlation coefficient was used as the measure of relatedness between expression levels of matched tag sequences in each library-pair comparison (Pearson 1896). Cluster analysis based on Pearson correlation was additionally employed. This provides a method for grouping similar libraries into respective categories, ultimately defining which cell and tissue types were similar or dissimilar to embryonic stem cells. Individual transcript differences between the hESC transcriptome and multiple differentiated transcriptomes were investigated as part of this analysis to define a  107  Figure 17 Trilaminar embryonic disc. The embryonic germ layers, endoderm, mesoderm and ectoderm will develop into various differentiated derivatives in the adult human listed below. TS - trophoblast cells; CNS - central nervous system; PNS - peripheral nervous system (Moore 2003).  DODEklM Derivatives: • • •  Epidermis, CNS (brain and spinal cord), PNS (nerves), eye (retina), ear Musculoskeletal (ligaments, muscles, skeleton), vascular system (blood, bone marrow), reproductive system (urogenital) Gastrointestinal (pharynx, digestive tube) major glands, respiratory system (nasal cavity, larynx, trachea, lungs)  pluripotent molecular signature. Using multiple hESC lines provided an opportunity to elucidate genes that were both differentially expressed and common to all hESC lines. The ultimate goal of studying hESCs is related to its application in the clinical setting, namely the directed differentiation of stem cells for transplantation therapies. Prefacing  this is the characterization of the molecular factors and signalling networks that govern stem cell properties and differentiation. In recent years the study of hESC biology has rapidly grown, yet much remains unclear about the stem cell molecular machinery and its function. A number of pluripotency markers have been described in the literature that were shown to be downregulated upon differentiation in mammalian embryonic stem cells (mouse or human ESCs). The candidate markers distinguished the u-hESCs from its nearest neighbours along the developmental time-course, yet many of these genes are not unique to embryo-derived pluripotent cells. Upon differentiation some of the most well characterized markers of u-hESC, such as POU5F1 and SOX2, are not immediately down-regulated (e.g., still present at detectable levels after two weeks of differentiation) (C. Eaves, personal communication) (Cai et al., 2005). One aim of this study was to assess the undifferentiated state of the hESCs used to generate our SAGE data through observation of the differential expression of published pluripotency markers in hESC versus adult cells; this will further refine our knowledge of and the appropriateness of current u-hESC markers. A second aim of analyzing differential gene expression was to identify novel markers distinguishing u-hESCs from adult and fetal cells. This analysis took the approach of comparing hESC SAGE expression profiles to pre-existing publicly available SAGE data, comprised of mainly adult tissues and cell lines.  109  3.2 Methods  3.2.1 SAGE library downloads SAGE libraries constructed from non-pluripotent tissues and cell lines were obtained from the CGAP SAGE genie resource (http://cgap.nci.nih.gov/SAGE) and our own database of long SAGE libraries. In total, 65 normal short and extracted short 4  SAGE libraries from CGAP (n-CGAP) were used for comparison to the hESC expression profiles. All CGAP long SAGE libraries, both normal (n-CGAP) and malignant (cCGAP) were utilized in this analysis. The tissues and cell lines represented are listed in Appendix 3a. A subset consisting of 43 libraries was categorized according to embryonic germ layer origin; library descriptions, including the tissue type and germ layer, are additionally listed in Appendix 3 a.  3.2.2 Cluster analysis For comparison to short SAGE libraries, our own hESC long SAGE tags were truncated at the 3' end to yield extracted short sequences. SAGE libraries were grouped together according to a distance measure of 1-Pearson correlation coefficient (1-r). A matrix containing pair-wise comparisons for all libraries versus themselves was generated based on 1-r using "matrix.pl" (Appendix 3b) (based on a script written by Allen Delaney; Gene Expression Informatics; http://www.bcgsc.ca/bioinfo/ge). The matrix was used to construct a hierarchical tree using the tree clustering Fitch-Margoliash algorithm (fitch) (Fitch and Margoliash, 1967). Figure 18 depicts the matrix file format and the fitch i  Extracted short libraries are long SAGE libraries whose 21 bp sequences were shortened to 14 bp computationally (21- and 14-bp tags include the Malll consensus site "CATG"). 4  110  settings used in this analysis. The tree constructed (*.tre file) was viewed using the program TreeViewX (http://darwin.zoology.gla.ac.uk/~rpage/treeviewx/). Figure 18 Matrix file format and the fitch settings used in this analysis. Matrix file 19 breast,lib fetalbrain liver J i b pancreas.1 snigra.lib ubcl.lib ubc2.1ib shes2 shes7 shes8 shes9 shelO shell shel3 shel4 shel5 shel6 shel7  0 0.3365 0 0.3864 0.213 0 0.4605 0.4134 0.1691 0 0,704 0,4591 0,1643 0,2635 0 0.2923 0,139 0,2356 0,3904 0,4375 0 0,3824 0.1311 0.2308 0,4282 0.4155 0,0234 0 0,3436 0.1557 0,3109 0.4444 0.4509 0.1346 0.1477 0 0,3261 0,1539 0.3087 0,4914 0.5021 0.1654 0,1644 0,0468 0 0,3284 0.1603 0.2725 0,4655 0.4392 0.1662 0,1626 0.0615 0.014 0| 0,3952 0,1692 0.2004 0.3427 0.2877 0.1344 0.1318 0,0398 0.0894 0.2684 0.1845 0.3707 0.5088 0.594 0.2174 0,2371 0,0671 0.0692 0| 0,4265 0,1373 0,2355 0,4067 0,3588 0,1324 0.1153 0,0382 0.0418 0.5723 0.2024 0.4015 0,5224 0.4896 0.1465 0.1253 0.0857 0,1523 0.3606 0.1665 0.1804 0.3126 0.2957 0.1806 0.1803 0.0601 0.064 0| 0.4231 0,1517 0.2304 0,4239 0.3768 0.2083 0.1853 0.0891 0.0542 0.4709 0,1654 0,3179 0,4959 0,5077 0,1861 0.1648 0,0889 0.0861 0.3717 0.1734 0.1704 0,3049 0.2792 0.1837 0.1831 0.0607 0.0688  Fitch settings  Fitch-Margoliash method version 3 . 6 a 2 , l Settings f o r t h i s r u n : D Method (F-M, Minimum Evolution)? U Search f o r best tree? P Power? Negative branch lengths allowed? -0 Outgroup root? L Lower-triangular data matrix? R Upper-triangular data matrix? S Subreplicates? G Global rearrangements? J Randomize input order of species? M Analyze multiple data sets? 0 Terminal type (IBM PC, ANSI, none)? 1 P r i n t out the data at s t a r t of run 2 P r i n t indications of progress of run 3 P r i n t out tree 4 Write out trees onto tree f i l e ?  Fitch-Margoliash Yes 2,00000 No No, use as outgroup species Yes No No No No. Use input order No ANSI No Yes Yes Yes  1  Y to accept these or type the l e t t e r f o r one to change J Random number seed (must be odd)? 1235 Number of times to jumble?  111  3.2.2.1 Random sampling script To assess the nature of tag type diversity in hESCs versus comparator libraries (each sequenced to different depths) a random sample was taken of theWA09 long SAGE library tags (converted to extracted short SAGE), to model increasing sampling depth (Perl script found in Appendix 3c; original script written by Paul Stothard; https://www.gchelpdesk.ualberta.ca/repositoryA^iewRepository.php)  using multiple  iterations (25 iterations).  3.2.3 Differential gene expression analysis A set of 43 non-pluripotent SAGE libraries was selected to provide a sample of tissues derived from the three embryonic germ layers and were subdivided accordingly (12 endoderm libraries, 16 mesoderm libraries, and 15 ectoderm libraries). "Endoderm" libraries were constructed from the gastrointestinal tract, major glands and the respiratory system. "Mesoderm" libraries were constructed from the musculoskeletal system, vascular system and the reproductive system. Lastly, "ectoderm" libraries were constructed from the epidermis, central nervous system (CNS), peripheral nervous system (PNS), and the eye. (Refer to Figure 17 for an illustration of the embryonic germ layers and derivative tissues). HESC extracted short SAGE libraries were compared to CGAP libraries; upon association back to the corresponding 21mer sequence the most abundant 21mer was used for further analysis (tag count >20). Differentially expressed tags were mapped to genes using CGAP SAGE genie mappings and genomic sequences using genomic contigs (NT*_), complete genomic sequences (NC_*), or genomic regions (NG_*) from RefSeq.  112  Tag-to-gene mappings were filtered based on their uniqueness in the transcriptome and their uniqueness in the genome. All tags that uniquely mapped to a single transcript were parsed; for tags that did not map to a known transcript only tags that mapped uniquely in the genome were parsed. Differential gene expression was calculated using the script sagematrix.sh (Mehrdad Oveisi; Gene Expression Bioinformatics Group; http://www.bcgsc.ca/bioinfo/ge/) which employed the Audic and Claverie statistic (Audic and Claverie, 1997). Sagematrix.sh generated a P-value to allow the assessment of whether changes in gene expression observed were significant. Another descriptor of differential gene expression was the fold change (FC) which is equivalent to the frequency of a tag in library "x" divided by its frequency in library "y". The script outputs the natural log of the fold change according to the equation, InFC = ln((x/Nx)/(y/Ny)), where x (or y) is the tag expression level and Nx (or Ny) is the total number of tags in library "x" (or "y"). Genes were first defined as significantly differentially expressed if they met a Pvalue cut-off of 0.05 or less. The definition was further refined if the genes demonstrated a 3-fold or l/3 -fold change across each library in a subgroup of the embryonic germ rd  layers. The exception was the list of genes downregulated in the mesoderm subgroup which was consistently downregulated across 15 of the 16 libraries (CGAP56, constructed from embryonic kidney, did not share a subset of commonly downregulated tags with the other mesoderm derivatives).  113  3.3 Results and discussion  3.3.1 Pair-wise library comparisons The use of Pearson's correlation (Eisen et al., 1998) in gene expression analysis reflects the degree of linear relationship between two libraries with respect to expression levels of commonly detected tags. Values range from "+1" to "-1". A correlation of "+1" means that there is a perfect positive linear relationship between libraries. Conversely a correlation of "-1" means there is a perfect negative linear relationship between the libraries. The first goal of this analysis was to determine how similar ES libraries were to one another. Table 21 is a matrix of correlations between 12 embryonic stem cell libraries constructed from 8 cell lines. Multiple libraries were constructed for WA01 (4 libraries; each constructed from a different RNA sample collected from cells exposed to different experimental conditions ) and WA09 (2 libraries constructed from the same RNA sample 5  using short (WA09s) and long (WA09L) SAGE protocols).  5  Four long SAGE libraries constructed from the WA01 line: WAOlm cell line was cultured on matrigel;  WA01 cultured on irradiated mouse embryonic fibroblasts (irr-MEFs); WA01(7) cultured on irr-MEFs, POU5F1 knocked-in, unselected; WA01(8) cultured on irr-MEFs, POU5F1 knocked-in, G418-selected.  114  Table 21 Pearson correlation coefficients (r) between human embryonic stem cell line SAGE expression profiles. Values were generated using matrix.pl for short and extracted short SAGE libraries. Although matrix.pl produces 1-r values for each pair-wise comparison, the table reflects r values. The highest r value between any two u-hESC libraries is 0.974 (WA01 compared with WA14). Note that all libraries except for WA09(s) are long SAGE. ES03  ES04 1  ES03  WA07  BG01  0.8683  0.8659  ES04  0.8683  1  0.8849  BG01  0.8659  -0.8849  1  WA07  0.7427  0.8993 0.9166  WA14  0.7427 0.8993  WA13  WAOl(m)  0.7249  0.8538  0.9008  0.6494  0.9009  0.9309  0.8262  0.8137  0.9429  0.9294  0.8782  0.7913  0.8097  0.8133  0.7986  (3>  0.8597  0.7573  0.7573  1  0.8899  (3)  0.8708  0.8417  1  0.8454  0.8275  0.8834  0.8137  0.8899  0.8417  0.8454  1  0.8495  0.9215  WA01(7)'  0.8747  WA0U8)  1  WA09(s)  2  1  WA09(I)  2  (3)  0.8133  0.9246  0.9294  0.8402  0.9008  0.7249  0.6494  0.8538  0.9009  0.974  0.974  0.8603  (4)  (3)  0.9027  0.7478  0.4767  0.845  0.9072  0.8478  0.7559  0.8907  0.8885  0.8918  0.6177  0.85  0.4954  0.8352  0.8162  0.8289  (4)  0.7675  (4)  0.9138  <4)  0.8544  (4)  0.7461  0.9658  (3)  0.7743  (3)  0.8597  0.8603  <4>  0.7986  0.9072  0.8885  0.8289  (4>  0.9138  (4)  0.8782  0.7478  0.8478  0.8918  0.7675  <4>  0.8544  (4)  1  0.7574  0.8162  0.4767  0.7559  0.6177  0.4954  0.7461  0.7743  0.7574  1  0.9027  0.845  0.8907  0.85  0.8352  0.8938  0.9381  0.9067  0.9429  2  0.8402  1  WAOl(m)  WA09(L)  0.9246  0.8708  WA01  2  0.8747  0.8097  1  WA09(s)  0.9215  0.7913  WA13  WA01(8)'  0.8495  0.8262  0.9534  1  0.8834  0.9534  0.9309  0.8717  WA01C7)  0.8275  0.9166  <3)  0.8825  WA14  (3)  1  (3)  0.8825  (3>  0.8717  WA01  1  1  1 0.9658  (3)  0.7604  <4)  0.8938 0.9381  (3)  0.9067 0.7604  <4)  1  1. WA01 stem cell lines under different experimental conditions. WA01(m) was grown on matrigel. WA01 was grown on irradiated mouse fibroblasts. WA01(7) Oct4 knock-in unselected (reference) and WA01(8) Oct4 knock-in G418 selected.  2. WA09 stem cell lines. Same RNA sample used to construct short SAGE (s) and long SAGE (L) libraries. 3. Highest Pearson's correlation for a library against all other stem cell libraries (Read according to library listed in the column heading). 4. Pearson's correlations between libraries of the same cell line.  115  The hESC libraries were more highly correlated to one another than they were to n-CGAP libraries in most cases (Figures 19 and 20). Table 21 lists the highest correlation value for each ES library pair-wise comparison. WA09s was the exception; it was most highly correlated to just 7 out of 12 embryonic stem cell lines versus non-pluripotent cell lines/tissues (ES03, BG01, WA14, WA01, WA01(7), WA01(8), and WA09L) (Appendix 3d). The remaining lines did not correlate closely with WA09s and included ES04 (R=0.6494), WA07 (R=0.4767), WA13 (R=0.6177), and WAOlm (R=0.4954). Furthermore, WA09s was most similar to BG01 (R=0.8162) as opposed to its long SAGE counterpart WA09L (R=0.7604). These results suggest that short SAGE and long SAGE techniques are not directly comparable. The expectation (assuming both techniques are unbiased samples of the ES transcriptome) was a correlation close to 1 in the case of the WA09 libraries. It has been suggested that the long SAGE technique may have a bias in transcriptome sampling (personal communication Allen Delaney; Gene Expression Informatics; http://wvvw.bcgsc.ca/ge/bioinfo) providing an explanation for the discrepancies in r-values in the short to long comparisons. Pearson correlation is also biased towards outliers (e.g., highly expressed tags) resulting in distortions of the linear relationship between libraries. This relates back to the suggestion that the long SAGE technique may selectively sample tags in a total RNA sample while short SAGE may be more representative of a random sampling of the transcriptome (Delaney, A. personal J  communication). Thus outliers based on biased sampling of the transcriptome are a likely cause of the weaker than expected correlation between WA09s and the long ES libraries.  116  Figure 19  Hierarchical clustering of short and extracted normal SAGE libraries. HESC libraries  are boxed in red. Abbreviations: vase endo, vascular endothelium.  • CGAP8 cortex • CGAP34 cerebellum • CGAP11 colon • CGAP94 kidney • CGAP12 colon • CGAP523 leptomeninges • CGAP398 stomach •CGAP405 lymph  CGAP 162 retina • CGAP429 liver vase endo • CGAP362 pancreas • CGAP1103 skin • CGAP404 marrow • CGAP 135 liver • CGAP 161 placenta • CGAP 160 placenta • CGAP389 thyroid • CGAP99 lung CGAP168 refRNA CGAP203 prostate CGAP169 prostate CGAP2D ovary  CGAP44 fibroblast  Lfi  WA01 8 WAD9 L BG01 WA14 WA01 WA09 s CGAP430 fetal brain CGAP59 prostate CGAP56 embryonic kidney CGAP71 embryonic kidney CGAP65 pancreas CGAP66 pancreas i  r  I— CGAP 163 retina "I CGAP 164 retina • CGAP91 peritoneum • CGAP364 placenta • CGAP 142 muscle • CGAP 141 muscle • CGAP55 cerebellum — CGAP53 thalamus CGAP 165 retina • CGAP 134 stomach CGAP 136 heart CGAP 133 pediatric cortex CGAP421 substantia nigra CGAP167 spinal cord  CGAP584 lung CGAP585 lung CGAP 1363 eye CGAP3D vascular CGAP29 vascular  • CGAP 10 cortex 0.1  117  Figure  20 Hierarchical clustering of long SAGE libraries from CGAP and our own database of normal and malignant cells. Libraries  constructed from PCR SAGE (Zhao et al submitted) include an additional cDNA amplification step compared to conventional long SAGE (Saha et al., 2002). The technique may preferentially amplify a small subpopulation of transcripts possibly resulting in highly expressed tags shared between all libraries constructed using PCR-SAGE in effect biasing the Pearson correlation. Abbreviations: HSC, haematopoietic stem cells; MAPC, multipotent adult progenitor cells.  118  Figure 2 0 WA14 WA01 BG01  rr£' HESC  — WA07 •WA01m -WA13 •ES04 ES03 WA01 8 WA01 7 I w WA09L — MAPC5 -a MAPC4 • CGAP643 Pancreas • MAPC3 • CGAP649 Breast carcinoma CGAP657 Breast carcinoma r CGAP703 Breast carcinoma CGAP683 Breast tumor fibroblast CGAP963 Lung adenocarcinoma CGAP723 Breast fibroadenoma CGAP647 Breast myoepithelium CGAP645 Breast carcinoma CGAP675 Breast carcinoma CGAP648 Substantia nigra • CGAP655 Liver vascular CGAP673 Breast carcinoma 1  PCRSAGEr  H S C 1  i—HSC2 • Ieukemia2 leukemial • HSC3 Ieukemia3  CGAP656 Fetal brain 0.1  Multiple libraries were constructed for the WA01 line. A reciprocal highest correlation value was observed only in the case of hESC libraries expressing a POU5F1reporter gene construct generated by homologous recombination (Zwaka and Thomson, 2003), WA01(7) and WA01(8) (Table 20). The effect of the POU5F1 -reporter gene construct was reported to antagonize endogenous POU5F1 (Thomson, J. personal communication). POU5F1 is normally up-regulated in ES compared to non-pluripotent cells. Diminishing POU5F1 levels would be expected to affect the linear relationship of these libraries to other stem cell libraries. However, detection of POU5F1 is not noticeably different in either WA01(7) or WA01(8) compared to the remaining hESC lines (see Figure 8, Chapter 2.3.3). WAOlm was constructed from cells removed from irr-MEFS and cultured on matrigel; it was expected to have the least mouse sequence contamination from incomplete removal of the embryonic feeder layer before mRNA isolation. WAOlm was more similar to WA07 and ES04 than to other WA01 libraries. This might suggest that libraries more similar to WAOlm (under constant experimental conditions e.g., grown on irradiated-MEFs) have less mouse embryonic fibroblast contamination. The WA01 library (grown on irr-MEFs) was strongly correlated to WA14 (r=0.974) while the WA01(7) and WA01(8) libraries were noticeably less related to the stem cell lines (r values ranging from 0.8544-0.9138). These libraries do not cluster with one another in Figures 19 and 20 due to the variance in experimental conditions. The second goal of the hierarchical clustering was to measure the relatedness of hESCs to non-pluripotent cells and tissue types. Most publicly available libraries are short SAGE, so to do this comparison with the available resources, I had to perform the  120  following two analyses: short S A G E versus extracted short S A G E (artificial 14mer sequences generated from 21mer tags) and long S A G E vs. long SAGE. I used the coefficient of determination (r ) to obtain a mean correlation across multiple samples using the pair-wise Pearson correlation coefficients (Zar 1996). The mean r calculated across all hESC libraries for each pair-wise comparison to a non-pluripotent library is presented in Tables 22 and 23. Pearson correlations between long and short S A G E were problematic as shown by the weaker than expected correlation between WA09(s) and WA09(L) (r=0.7604; refer to Table 21). The WA09(s) library was most correlated to the BG01 ES library (r=0.8162) as opposed to a non-pluripotent short S A G E library. This provides a relative measure of a perfect positive relationship between two libraries when comparing short to long SAGE. A l l correlations generated comparing non-pluripotent short libraries to hESC can be interpreted in relationship to r=0.8162 approximating r=l. Table 22 lists the mean r in descending order for short versus extracted short S A G E library comparisons. HESCs were most similar to CGAP29, a library constructed from vascular endothelium (r =0.493; r=0.702). Mesoderm-derived libraries constituted 6 of the top 10 most correlated to hESCs (tissues represented: vascular endothelium, bone marrow, and prostate) (Table 22). The top 10 most similar libraries to hESC included 2 ectoderm-derived libraries (eye lens and fetal brain), an endoderm derivative (lung), and a universal reference R N A sample, thus libraries originating from all three germ layers are among the ten most similar to hESCs. The reference R N A sample is a pooling of multiple  121  human cancer cell lines and typically employed on microarrays for subtractive 6  hybridization (http://www.stratagene.com). The top ten most similar libraries to the hESC libraries demonstrated weak correlations to the stem cell expression profiles (ranging from r 0.35 to 0.49 and r equals 0.59-0.7). From this analysis it was apparent that hESCs were not strongly related to any other cell/tissue types other than themselves. T a b l e 22  Comparisons of short and extracted short S A G E libraries measured by Pearson  correlation (r). The r (coefficient of determination) values are a composite of comparisons for each extracted short S A G E hESC library against an adult or fetal normal C G A P library combined to determine the average r between all hESC lines to a differentiated cell/tissue type. Bolded libraries are extracted short SAGE libraries.  Mean r  6  2  STDEV  Library accession  Tissue  r  0.75-1  0.49346856  0.044130197  CGAP29  Vascular normal C S control  0.436696766  0.081714228  CGAP409  Bone marrow normal A P CD34+/CD38-/lin-  0.425607806  0.068133099  CGAP443  Bone marrow normal A P CD34+/CD38+/lin+  0.424168977  0.048945875  CGAP72  Prostate normal M D PR317  0.407536843  0.082553484  C G A P 13 63  Eye lens B UIHI0  0.404822256  0.052949823  CGAP585  Lung normal C L L15  0.385883144  0.076145984  CGAP430  B r a i n fetal normal B SI  0.379414384  0.145578391  C G A P 168  Universal reference human R N A C L  0.352839606  0.056480824  CGAP59  Prostate normal B 2  0.352831891  0.065611645  CGAP30  Vascular normal C S V E G F +  0.342395811  0.10246759  CGAP71  Kidney embryonic C L 293+beta-catenin  0.336588001  0.058926787  CGAP56  Kidney embryonic C L 293-control  0.320422764  0.04935973  CGAP584  Lung normal C L L16  0.312863044  0.047560295  CGAP131  Breast normal myoepithelium A P myoepithelial!  0.30365421  0.105265262  C G A P 160  Placenta first trimester normal B 1  0.286733789  0.111589805  C G A P 132  Breast normal B hyperplasial  0.28305822  0.037743008  CGAP49  Ovary normal C L IOSE29EC-11  0.267713931  0.081263737  CGAP405  Lymph Node normal B 1  2  range  Reference R N A comprised a pool of total R N A from 10 cell lines from various organs and tissues listed  below: B-lymphocyte (plasmacytoma, myeloma), mammary gland (adenocarcinoma), liver (hepatoblastoma), cervix (adenocarcinoma), testis (embryonal carcinoma), brain (glioblastoma), melanoma, liposarcoma, macrophage (histiocytic lymphoma, histocyte), and T-lymphoblast (lymphoblastic leukemia)  122  0.24370988  0.091797008  CGAP404  Bone marrow normal B D01  0.241963306  0.068989617  CGAP362  Pancreas normal B 1  0.241153717  0.062168582  CGAP47  Breast normal epithelium AP Br N  0.241089385  0.037388241  CGAP9  Brain astrocyte normal CL NHA5  0.230604608  0.070626513  CGAP389  Thyroid normal B 001  0.209778502  0.117741133  CGAP203  Prostate normal epithelium CS senescent  0.209105008  0.059021753  CGAP398  Stomach normal B antrum  0.188514773  0.115570664  CGAP 169  Prostate normal epithelium CS confluent  0.175561659  0.079993497  CGAP 182  Breast normal organoid B  0.1719172  0.046039738  CGAP429  Vascular endothelium normal liver associated AP NLEC1  0.166050311  0.097843172  CGAP20  Ovary normal CS HOSE 4  0.165926198  0.065339335  CGAP135  Liver normal B 1  0.163196162  0.08269416  CGAP 183  Breast normal stroma AP 1  0.163022158  0.052677852  CGAP181  White Blood Cells normal breast associated AP  0.161263383  0.060715932  CGAP141  Muscle normal B old  0.160178048  0.06111842  CGAP420  Breast normal myoepithelium A P IDC7  0.159110087  0.085138344  CGAP161  Placenta normal B 1  0.153291684  0.033958039  CGAP66  Pancreas normal CS HI26  0.151307066  0.048247808  CGAP 136  Heart normal B 1  0.150896881  0.06236684  Retina Peripheral normal B 2  0.145797983  0.057929581  CGAP 163 CGAP 164  0.145020103  0.032635067  CGAP523  Brain normal leptomeninges B AL2  0.143986203  0.030734044  CGAP65  Pancreas normal CS HX  0.13424494  0.044472162  CGAP 142  Muscle normal B young  0.129687028  0.024569831  CGAP669  Breast normal FS NER  0.129592598  0.047957376  CGAP55  Brain normal cerebellum B BB542  0.126930023  CGAP31  Breast normal epithelium AP 1  0.118323856  0.024188429 0.033198779  CGAP 167  Spinal cord normal B 1  0.117928767  0.042967668  CGAP421  Brain normal substantia nigra B 1  0.112012558  0.042897007  CGAP364  Placenta hydatidiform mole B 1  0.110079643  0.037097221  CGAP 133  0.106257967  0.017455054  CGAP 180  Brain normal peds cortex B HI571 Vascular endothelium normal breast associated AP  0.099761571  0.043706959  CGAP53  Brain normal thalamus B 1  0.099292949  0.04449798  CGAP99  Lung normal B 1  0.095455116  0.03664259  CGAP 134  Stomach normal epithelium B bodyl  0.091490914  0.025179833  CGAP98  Leukocytes normal B 1  0.088486662  0.040775561  CGAP 165  Retina Pigment epithelium normal B 1  0.078527597  0.032035846  CGAP91  Peritoneum normal B 13  0.061807725  0.028743426  CGAP 162  Retina Peripheral normal B 1  0.054822587  0.023020957  CGAP34  Brain normal cerebellum B 1  0.046971211  0.033483769  CGAP94  Kidney normal B 1  0.045586712  0.030369154  CGAP11  Colon normal B NCI  0.044868973  0.021719298  CGAP8  Brain normal cortex B BB542  0.044665395  0.02993352  CGAP 1103  Skin normal B NS  0.036764595  0.020819044  CGAP 10  Brain normal cortex B pool6  0.033665333  0.014154952  CGAP44  Fibroblasts CL postcrisis  0.02950209  0.020564078  CGAP 12  Colon normal B NC2  Retina macula normal B HMAC2  1  Table 23 Comparisons of long SAGE libraries measured by Pearson correlation (r). The r (coefficient of determination) values are a composite of comparisons for each long SAGE hESC library compared to a CGAP or our own database of normal and malignant cells/tissues; these values were combined to determine the average r between all hESC 2  lines to a differentiated cell/tissue type. Bolded libraries were used to generate extracted short SAGE libraries for the normal CGAP dataset.  Mean r  2  STDEV  Library accession  Tissue  0.640058639  0.079251  sc004  0.521967335  0.043546  CGAP656  Brain fetal normal B SI  0.508129685  0.075261  sc005  M u l t i p o t e n t adult progenitor cells  0.427168466  0.1 14232  CGAP703  Breast c a r c i n o m a  0.403683813  0.101766  CGAP683  Breast t u m o r fibroblast  0.402505235  0.11343  CGAP723  Breast f i b r o a d e n o m a  0.355476285  0.09022  sc003  M u l t i p o t e n t adult progenitor cells  M u l t i p o t e n t adult progenitor cells  0.328857095  0.048539  CGAP655  Vascular endothelium AP 1  normal liver associated  0.323480686  0.046945  CGAP675  Breast c a r c i n o m a  0.320603852  0.040084  CGAP673  Breast c a r c i n o m a  0.319053675  0.072053  CGAP643  Pancreas normal B 1  0.310902763  0.046104  CGAP645  Breast c a r c i n o m a  0.306680995  0.078762  CGAP647  Breast normal myoepithelium A P I D C 7  0.287196046  0.063246  CGAP657  Breast c a r c i n o m a  0.244757859  0.082618  CGAP963  L u n g adenocarcinoma  0.236308343  0.057561  CGAP649  Breast c a r c i n o m a  0.221390168  0.023823  shsOl  H a e m a t o p o i e t i c stem cells  0.196770585  0.032394  shs()2  Haematopoietic stem cells  0.195653974  0.077994  CGAP648  Brain normal substantia nigra B 1  0.131343191  0.024434  sle02  Leukemia  0.100959373  0.014366  shs03  H a e m a t o p o i e t i c stem cells  0.053800208  0.012898  sleOl  Leukemia  0.017756184  0.006893  sle03  Leukemia  124  ES are transcriptionally complex, expressing more types of genes than other cell states (Gerecht-Nir et al., 2005; Golan-Mashiach et al., 2005). This complexity was defined by the expression of a wide repertoire of genes with disparate developmental roles. ES are thought to be primed for differentiation to any cell type in the embryo proper by expressing genes encoding the differentiation machinery (at low levels), this property has been termed the "Just in case" theory (Golan-Mashiach et al., 2005). The maintenance of these levels is pivotal to the undifferentiated phenotype and until a response is elicited by a specific growth or differentiation signal, these expression levels are stably preserved. This was evidenced by the complete expression of embryonic developmental pathways and the high expression of pathway antagonists in undifferentiated hESCs (u-hESCs) (see Chapter 2.3.4: Developmental signalling pathway expression in embryonic stem cells) (Brandenberger et al., 2004b). Thus transcriptional complexity also distinguishes ES from other cell types. Figure 21 was an assessment of tag type diversity in one ESC line (WA09L) versus non-ESC CGAP libraries. To take into account inconsistent depths of sequencing contributing to tag type diversity the WA09 library was randomly sampled at intervals comparable to CGAP library sizes. Consistent with the literature, at most sampling depths WA09 expressed more tag types than other cell states (Figure 21) testifying to the wide breadth of genes expressed in ESC. Statistical testing of the tag type diversity at various sampling depths was performed using Student's t-test (Zar 1996) for extracted CGAP short SAGE libraries and the WA09 line (Figure 22). Similarly, WA09 showed a greater diversity of tag types at various sampling depths than adult/fetal samples (with the exception of the fetal brain library which expressed more tag types at a depth of 300,000 total tags than WA09). WA09 tag types expressed at 60,000 total tags was not  125  significantly different from the tag types expressed in the liver vascular endothelium library (Figure 22). Figure 21 Incremental random sampling o f W A 0 9 ( L ) extracted short S A G E total tags. In silico generated W A 0 9 libraries totalling 10,000-400,000 total tags were plotted against tag diversity (unique tag sequences) for each generated library. Experimental short S A G E libraries ( C G A P ) were plotted to compare tag diversity to a W A 0 9 sampling o f a similar total tag count.  126  F i g u r e 21 jWA09  • Pooled lung carcinoma Pooled medullablastoma Pooled prostate carcinoma cell line x Lymphoma  120000  • Pooled foreskin fibroblast + Brain (greater than 95x white matter) AHeart Lung  • •  100000  Astrocytoma Pooled pancreas cancer cell line Pediatric frontal cortex Pooled pancreas neoplasia Colon epithelium Ependymoma Stomach cancer -Glioblastoma Prostate Liver  • 80000  •  to  a:  •  CJ  cz tu 3 C :D  •  •  •  Pooled lung carcinoma  •  60000  •  01  •  Ol  •  h 40000 H  • • •  •varian cancer  Blood (leukocyte) x Astrocyte :*: Reference RNA  x Lymphoma  • Peritoneum (mesothelial cells) + Cerebellum  Pooled foreskin fibroblast prostate carcinoma cell line Pooled c  -Kidney (embryonic) — Ovary suface epithelium • Kidney  a. = 0.05  Lung A Heart Colon epithelium  20000 4 •  • +  Thalamus  Pooled medullablastoma  *  • Monocyte leukemia A Mesothelioma  Blood (IcukocjXc)  x Stomach epithelium  X  X  • Monocyte leukemia  x Pancreas (duct epithelium) • Ovarian cancer  50  100  150  200  250  300  350  400  450  T h o u s a n d s of total tags  127  Figure 22 Incremental r a n d o m s a m p l i n g o f W A 0 9 ( L ) extracted short S A G E total tags and n o r m a l differentiated C G A P ( n - C G A P ) extracted short S A G E libraries. W A 0 9 libraries totaled 1 0 , 0 0 0 - 4 0 0 , 0 0 0 total tags and w e r e plotted against n u m b e r o f unique tag sequences for e a c h generated library. S i g n i f i c a n c e o f difference between tag types expressed at each s a m p l i n g depth w a s determined u s i n g the S t u d e n t ' s t-test (See inset data table). Standard deviations were i n c l u d e d .  90000  80000 <  70000  60000  ID <a  • hESC  o c  CD CT CD  • Pancreas 50000  Brain Substantia Nigra  to  WBC  O) J3 40000  * Breast Myoepithelium  CD  • Liver vascular endothelium  <  CJ  + Fetal brain  . . .1  •5 30000  20000 4  10000 4  50  100  150  200  250  300  350  Thousands of total tags  128  n-CGAP tissues Pancreas Brain substantia nigra WBC Breast myoepithelium Liver vascular endothelium WBC WBC Fetal brain 1  2  Thousands of total tags 20 40 40 60  Tag types (nCGAP) 7863.2 16587 15651.33 21334.6  60  24883.8  24878  0.400221756  80 100 300  26831 29866.13 78472  30539 35727 76400  7.58436E-41 1.6408E-15 2.61476E-30  Tag types (hESCs) 11046 18566 18566 '24878  T-TEST 5.27562E-25 7.40044E-39 4.17339E-15 5.00258E-22  'insignificant difference in tag type diversity between hESCs and n-CGAP 2  The fetal brain library expressed more tag types than the WA09 library at 300,000 total tags. Note that  whole brain was used and was thus representative of multiple cell types (e.g., neurons and glial cells).  I additionally examined the degree of similarity between long SAGE libraries (hESC, n/c-CGAP, and our own database of normal and malignant cells). Table 23 -y  presents the mean r values in descending order for the hESC lines versus a CGAP library. The correlations were additionally depicted in Figure 20 as a hierarchical tree; recall that branch lengths are measured as 1-r. Three major branches were identified from the cluster analysis; hESC cluster closely together in one branch of the tree, a fetal brain library comprises its own branch and the remaining CGAP libraries (mainly generated from carcinomas and adult progenitor/stem cells) cluster into 3 subgroups of the last branch. •y  Based on the evidence presented in Table 23 (mean r values between long SAGE libraries and hESCs) hESCs were highly correlated to a multipotent adult progenitor cell (MAPC) line (MAPC4; r =0.64, r=0.80), fetal brain (CGAP656), and another MAPC 2  library constructed for the same line (MAPC5). MAPCs, isolated from bone marrow and  129  CD34/CD45 negative , have been shown to possess characteristics similar to pluripotent 7  cell lines in their ability to differentiate into most mesodermal and neuroectodermal cells in vitro and all embryonic lineages in vivo (Schwartz et al., 2002). The similarities in developmental potential between MAPCs and ESCs demonstrated experimentally were reflected in the high correlation between the SAGE libraries from each cell type. Stated earlier, Pearson's correlations are affected by extremes in gene expression levels, thus if two libraries commonly express a highly abundant tag this may inappropriately result in a stronger correlation than biology would suggest. To test whether the correlation between MAPC and hESC was heavily influenced by outlier tags, the top 10 most abundantly expressed genes amongst all hESC and MAPC-04 were extracted. Taking the intersection between all tags resulted in two sequences common to the hESC and MAPC libraries and typically expressed at roughly 1000 tags per 100,000 total tags per individual library (Table 24). Pearson correlations were calculated for the MAPC and hESC metalibrary including the two abundantly expressed genes and excluding these genes (Table 25). The highly expressed genes appear to have a minor affect on the correlation between hESC and MAPC but this does not invalidate the finding that both stem cell libraries share strong similarities in long SAGE transcriptome profiles. Haematopoietic stem cell (HSC) libraries used in this analysis were among the least similar to ESCs (Table 23). I assume that the result was primarily due to technical differences in long SAGE library construction; all HSC and leukemia libraries were  7  CD34 and CD45 are hematopoietic stem cell markers.  130  constructed using PCR-SAGE technique, this utilized an amplification step prior to long SAGE construction. Pearson correlation is biased towards highly expressed tags; the Table 24 The intersection of the top most highly expressed tags in MAPC and hESC libraries. Tag counts for each sequence are normalized to tags per 200,000. Highly expressed tags:  MAPC4 WA09(L) WA01(7) WA01(8) ES03 ES04 WA07 WA14 WA13 WA01(m) WA01 BG01  GAAAAATGGTTGATGGA LAMR1 (ribosomal protein SA)  TGTGTTGAGAGCTTCTC EEF1A1 (eukaryotic translation elongation factor 1 alpha 1)  1,068.5 579.5 769 784.5 1,796.9 1,653.7 1,382.2 1,933.4 1,500.7 1,037.5 1,780.6 1,749.3  1,253.9 860.3 1,122.1 987.8 1,660.6 1,985.4 2,584.6 1,763.7 1,820 3,003.5 1,804.5 1,893  Table 25 Calculation of Pearson correlation between the hESC metalibrary and MAPC4 using Correlate (written by Allen Delaney, Gene Expression Informatics). First comparison includes all tags expressed in both libraries and second comparison excludes the two most abundant tags that are shared between hESC and MAPC libraries (*minus_high_exp libraries). Min - minimum tag count for a gene; Max - maximum tag count for a gene; Sum - total tags in MAPC4 or hESC metalibrary; Mean - mean tag count level for the library; SD - standard deviation of the mean; Correlation matrix Pearson correlation coefficient calculated for genes detected in both MAPC4 and the hESC metalibrary.  Analysis for 24,701 cases of 2 variables Variable Min Max Sum Mean SD  MAPC4 1 1,676 182,559 7.4 36.3  hESC meta 1 22,697 1,842,608 74.6 424.2  Correlation Matrix: MAPC4 hESC meta  MAPC4 1 0.8142  hESC meta 0.8142 1  Analysis for 24,699 cases of 2 variables: Variable Min Max Sum Mean SD  MAPC4 minus high exp 1 1,676 180,242 7.3 34.8  hESC meta minus high exp 1 17,410 1,802,692 73.0 383.8  MAPC4 minus high expressers 1  hESC meta minus high expressers 0.7989 1  Correlation Matrix: MAPC4 minus high expressers hESC meta minus high expressers  0.7989  introduction of the mRNA amplification state in P C R - S A G E may have resulted in the preferential sampling of certain abundant transcripts shared in HSCs. Thus the correlation coefficient comparing conventional long S A G E and PCR S A G E may not accurately describe the biological similarity between the cell types sampled; similar to the notion that short S A G E and long S A G E were not accurately compared using Pearson's correlation coefficients (A. Delaney, personal communication). Fetal brain was found to be among the most highly similar libraries to hESCs in both the in silico short S A G E analysis (Table 22) and this long S A G E comparison. Neural cell types are transcriptionally complex expressing many transcripts at low levels (Evans et al., 2002). On average, fetal brain and the hESC libraries have more tags types in common than ESCs to other cell states (data not shown). Additionally, pathways functioning in early neural development, such as the Wnt and Nodal pathways, were implicated in ESC maintenance (see Chapter 2.3.4) (Brandenberger et al., 2004b). A caveat of the analysis was the paucity of tissues sampled for long S A G E library construction. The comparisons possible were between tissues and cell types separated by years of development. With the current human S A G E data, it was not possible to describe global similarities between earlier germ layer derivatives and hESCs. The conclusion one may draw was that hESCs bear no particular similarity to most other derivatives of the embryonic germ layers represented in the adult, excepting for other adult stem cell populations such as MAPCs. As stated earlier, conclusions drawn from comparing extracted short S A G E to short S A G E or conventional long S A G E to PCR S A G E using the Pearson correlation are not necessarily reflective of the biological similarities/dissimilarities between cells and  133  tissue types. Higher correlations were observed between comparisons using the same technique (e.g., long SAGE versus long SAGE) as opposed to a comparison between two extracted short SAGE libraries (see bolded libraries in Tables 22 and 23). However, I did observe the conservation of similar correlations between tissue types with both long SAGE and extracted short SAGE libraries to the ES libraries. For example, all ES long SAGE libraries were still more highly related to the short SAGE library generated for WA09 than any other n-CGAP short SAGE library. Similarly, all n-CGAP long SAGE libraries that were truncated to short sequences for comparison maintain their relative distance to the ES library cluster that they demonstrated in the long SAGE clustering analysis. It remains that the tissues sampled to construct long SAGE libraries were severely limiting in comparison to the wealth of short SAGE transcriptome profiles generated to date. With the enhanced ability to map long SAGE tags and the complications in comparing SAGE libraries constructed with different protocols there is a need to repeat these analyses once long SAGE libraries have been generated for a wider breadth of tissues and cell types.  3.3.2 Isolation of differentially expressed tags  Genes that define the pluripotent state may have a differential expression pattern in undifferentiated hESC compared to differentiated populations. Cell fate determination pathway constituents are likely to be down-regulated while genes involved in cell proliferation, cell survival and maintenance of genomic integrity may be up-regulated in ES. I wanted to determine if the genes up-regulated in the hESC SAGE data compared to adult and fetal CGAP libraries overlapped with published stem cell marker lists  134  ( i d e n t i f i e d i n u - h E S C s a n d e m b r y o n i c c e l l s ) . I also sought to define n e w m a r k e r s o f uh E S C s to d i s t i n g u i s h these c e l l s f r o m subpopulations o f adult/fetal tissues g r o u p e d a c c o r d i n g to e m b r y o n i c g e r m layer o r i g i n . D i s t i n g u i s h i n g u - h E S C s f r o m e n d o d e r m libraries, m e s o d e r m - l i b r a r i e s , or e c t o d e r m - l i b r a r i e s m a y p r o v i d e us w i t h insight into w h i c h genes are necessary to direct differentiation to a s p e c i f i c germ-layer d e r i v a t i v e . A subset o f genes m a y i d e n t i f y m o l e c u l a r differences between h E S C s and a l l tissue types o f a g i v e n e m b r y o n i c g e r m layer or the differences m a y be cell/tissue-type s p e c i f i c . T h e frequency o f a transcript i n the h E S C metalibrary w a s c o m p a r e d to its frequency i n n - C G A P libraries c l a s s i f i e d a c c o r d i n g to their e m b r y o n i c g e r m - l a y e r o r i g i n . In total there w e r e 43 p a i r - w i s e c o m p a r i s o n s (15 e c t o d e r m d e r i v e d , 16 m e s o d e r m d e r i v e d , a n d 12 e n d o d e r m d e r i v e d libraries). S i g n i f i c a n t l y up-regulated or d o w n - r e g u l a t e d genes w e r e determined b y A u d i c - C l a v e r i e statistics and d e f i n e d b y a P - v a l u e o f 0.05 or less. In general, s i g n i f i c a n t u p - r e g u l a t i o n i n h E S C demonstrated a m i n i m u m 3 - f o l d change w h i l e d o w n - r e g u l a t i o n demonstrated a m i n i m u m o f l / 3 - f o l d change. A m o n g a l l p a i r - w i s e r d  c o m p a r i s o n s to the g e r m layer derivatives 4,771 sequences w e r e up-regulated and 3 5 9 , 5 2 9 w e r e d o w n - r e g u l a t e d . T h e d i s c r e p a n c y between the absolute n u m b e r s o f u p regulated to d o w n - r e g u l a t e d tags w a s due to the use o f m a n y disparate tissue types e x p r e s s i n g n u m e r o u s genes that s p e c i f i c a l l y define their t e r m i n a l l y differentiated phenotype. Sequences that w e r e consistently up-regulated or d o w n - r e g u l a t e d across a l l o f the e c t o d e r m a l , m e s o d e r m a l , or e n d o d e r m a l d e r i v e d libraries w e r e isolated ( T a b l e 26). T h e e x c e p t i o n w a s a set o f genes d o w n - r e g u l a t e d a m o n g the m e s o d e r m a l d e r i v a t i v e s that were differentially expressed across 15 o f the 16 libraries. C G A P 5 6 ( e m b r y o n i c k i d n e y )  135  did not share a subset of commonly down-regulated tags across the remaining mesoderm derivatives. Table 26 Summary of the total number of tag sequences up- or down-regulated in the hESC SAGE libraries compared to ectoderm-derived normal CGAP (n-CGAP) libraries, mesoderm-derived n-CGAP libraries, and endoderm-derived n-CGAP libraries.  All 15 libraries Any 15 libraries  105 4,303  9 274,274  Total tags per germ layer 374,275' (186,134 )  All 16 libraries Any 16 libraries  70 2,702  5 230,265*  311,998' (209,158 )  All 12 libraries Any 12 libraries  95 2,706  8 219,009  182,787' (135,391 )  Up-regulated Ectoderm  Mesoderm  Endoderm  Down-regulated  2  2  2  3 germ layers  32 All 43 libraries 0 Any 43 libraries 4,771 359,529 'Sum of tag counts for each SAGE library in a germ layer category 2  Total unique tags sequences  HESC extracted short SAGE libraries were compared to CGAP libraries. Genes differentially expressed in the hESC metalibrary versus all n-CGAP libraries, subsets of endoderm-libraries, mesoderm-libraries, or ectoderm-libraries were listed in Tables 2733. Note that Tables 28, 30, and 32 which represent the subset of differentially expressed tags in ectoderm, mesoderm, and endoderm derivatives respectively do not list the genes commonly down-regulated across all CGAP libraries or the genes that are specifically down-regulated in one set of embryonic germ layer libraries. The genes represented in Tables 28, 30, and 32 were consistently up-regulated across at least two germ layer  groupings. T a b l e s 2 9 , 3 1 , and 33 listed genes that were o n l y differentially expressed i n h E S C versus one set o f e m b r y o n i c g e r m layer libraries. G e n e s that were s o l e l y differentially expressed i n h E S C versus e c t o d e r m libraries i n c l u d e d several tags that o n l y m a p p e d to a single g e n o m i c l o c a t i o n ( T a b l e 2 9 ; h i g h l i g h t e d i n y e l l o w ) . T h e genes detected b y the S A G E tags (termed ecto01-ecto07) c o u l d f u n c t i o n to repress genes d i r e c t i n g e c t o d e r m differentiation. C o n v e r s e l y ecto08 and ecto09 were d o w n - r e g u l a t e d i n e m b r y o n i c stem c e l l s c o m p a r e d to e c t o d e r m l i b r a r i e s ; they m i g h t encode determinants o f an e c t o d e r m a l fate i n the early e m b r y o . O t h e r hypothetical uncharacterized genes s h o w n to be differentially expressed i n h E S C versus m e s o d e r m or e n d o d e r m libraries were listed i n T a b l e s 31 a n d 33 respectively a n d are termed m e s o O l 05 a n d e n d o 0 1 - 0 3 (up-regulated i n h E S C ) .  137  Table 27 U p - r e g u l a t e d gene list i n u - h E S C metalibrary c o m p a r e d to n - C G A P libraries. ( N o t e : there were n o genes that were consistently down-regulated i n h E S C s c o m p a r e d to a l l n - C G A P libraries). I t a l i c i z e d sequences have a n a m b i g u o u s tag-to-gene mapping.  Fold change  Sequence  Count  Genome hits  Symbol  GO  AAAATTTACAGTTTGCC  482  1  LECT1  9.576184  ATGATGATGATGGGACT  1873  2  SLC25A5  Molecular function unclassified; Skeletal development; Angiogenesis Transporter; Mitochondrial carrier protein; Nucleoside, nucleotide and nucleic acid transport; Transport  ATGTTAATAAAATAGGC  720  1  GPC4'  Cell adhesion molecule; Extracellular matrix glycoprotein; Cell adhesion  12.5975  CAAACACCGTTGTAACC  818  1  Hs.334219  CAAATTTTATTGTTAGT  844  1  DPPA4  18.13544  CAGTCTCTCAAGTCCCG  3869  8  CSRP1  Molecular function unclassified Molecular function unclassified; Muscle development; Cell proliferation and differentiation  6.896997  CAGTCTCTCAAGTCCCG  3869  8  RPS10  Ribosomal protein; Protein biosynthesis  6.896997  CCCTCCTGGACAAGGCT  774  3  HMGA1  Molecular function unclassified  11.6491  GAAAAGGGTTTTCTTTT  1600  1  LAPTM4B  GAAAGAAAGAGAGGAAA  1164  1  cDNA CS0CAP004YK13  GAGGACACAGATGACTC  1365  1  PODXL  GCTGTTTATTTCACCTG  1002  656  Similar to XP 375833.1  GGGCTGTGAAATGGGTG  721  1  escOl/nt 016354  GTCCTGGTGGTGGGGGG  641  1  esc02/nt 037704  GTTTAAATCGACTGTTT  1699  1  PSMA2  TAATTCTACCAAGGTCT  1521  5  TDGF1  TACTGGTTTGTATATTT  581  1  FLJ35259  TAGCTACAGGACATTTT  988  1  DNMT3B  TATCACTTTTTTCTTAA  2655  4  P0U5F1  TTCATTATAATCTCAAA  6202  7  PTMA  TTGCTCACACAAAAAAA  997  1  BM759098  1  1  6.464523  17.83319  1  1  1  13.05209  Transporter  24.29679 16.32674  Extracellular matrix glycoprotein; Cell structure  1  1  1  1  2  20.37089 -  15.949 13.47483  Other proteases; Proteolysis Growth factor; Ligand-mediated signalling; Developmental processes; Cell proliferation and differentiation  5.805942  DNA methyltransferase; DNA methyltransferase; DNA metabolism Homeobox transcription factor; Nucleic acid binding; mRNA transcription regulation; Developmental processes  20.35203  Molecular function unclassified  11.89386  33.30267 12.09318  58.23732  14.8713  138  TTGCTCA CA CAAAAAAA  997  1  BM692360 CD247421 (WA01 EST library; NIH-MGC)  TTTACTGCTAGAAACCA  1391  4  LIN28  TTTTATGGGTAACTTTT  1304  1  CCNG1  TTGCTCACACAAAAAAA  997  1  1  1  14.8713 14.8713 Other RNA-binding protein  35.94167  Kinase activator; Cell cycle control  10.99607  1.  Described in previous studies as potential hESC markers.  2.  POU5F1 (highlighted in green) is the only known transcription factor presented in this list.  Table 28 Genes up-regulated in hESC compared to ectoderm-derived libraries. These genes were found to be similarly differentially expressed in at least one other grouping of germ-layer derivatives. (See legend below for explanation of table).  Meta sequence  Count  Gene symbol  Mean P-value  Mean In (ratio)  0.002953333  1.87785625  Gene Ontology Other ligand-gated ion channel; Anion channel; Hydrogen transporter; ATP synthase; Hydrolase; Purine metabolism; Cation transport; ATP synthesis->Fl beta;;  GATCCCAACATTGTTGG  1980  ATP5B  ATAGACATAAAATTGGT  1158  C1QBP  CTGTGACACAGCTTGCC  1219  CCT2  0.00022  1.7311  Complement component; Antibacterial response protein; Complement-mediated immunity;  0  2.2029  CTATATTTTTTAAAATC  587  CTSC  Chaperonin; Protein folding;  0.00138  2.18983125  Cysteine protease; Proteolysis;  HMGA1  2  A TTTGTCCCA GCCTGGG  5258  0  2.98228125  1081  HMGB2  2  0.000686667  2.12788125  Molecular function unclassified; Biological process unclassified; HMG box transcription factor; Chromatin/chromatin-binding protein; Chromatin packaging and remodeling;p53 pathway->High mobility group protein 1;  TCTGCAAAGGAGAAGTC TTGCTCACAAAAAAAAA  900  Hs.382100  6.67E-05  2.5868375  GGTTGAAAAAAAAAAAA  564  Hs.446545  0.001486667  2.16851875  ACCATTGGATTCATCCT  1015  IFITM1  0.002033333  2.1404625  AATAAAACACATTTTAT  607  LEFTB  0.00122  2.3153  Other miscellaneous function protein; Cell proliferation and differentiation; TGF-beta superfamily member; Biological process unclassified; TGF-beta signalling pathwaytransforming growth factor beta;;  AAATAAAGAATTTAAAG  1380  MGST1  0.0013  2.17775  Other transferase; Protein modification; Detoxification;  TACAAAACCATTTTTTC  1439  NCL  0.000693333  1.9036625  Ribonucleoprotein; rRNA metabolism;  GAATCGGTTATACTCGG  1535  2.67E-05  1.719325  Oxidoreductase; Electron transport;  TATTTTAAATGCCACCT  437  0.008546667  1.71855625  CGAACAAAAGACTTCGG  967  NDUFS5 esc03/ nt 007741 esc04/ nt 011875  4.67E-05  2.61644375  1  3  139  GAAAGAAAGAGAGGAAA  1164  CGCACAATCATTGAGTT  569  esc05/ nt 022517 esc06/ nt 033903  AAAAGAAACTTGTGCTT  2314  AA TACTTTTGTA TTGCT CAA TAAA TGTTCTGGTT  6.67E-06  2.9392875  0.00218  2.32865  PABPC1  0.000333333  2.0241625  871  PAI-RBP1  0.00012  2.17576875  Nucleic acid binding; Pre-mRNA processing; Other RNA-binding protein; Interacts with chromatin-remodeling factor C H D 3 ; Biological process unclassified;  9202  RPL37  0  1.56810625  Ribosomal protein; Protein biosynthesis;  GCTTTTAAGGA TACCGG  6329  RPS20  0  2.2446875  Ribosomal protein; Protein biosynthesis;  TAATAAAGGTGTTTATT  8849  RPS8  0.000473333  1.41510625  Ribosomal protein; Protein biosynthesis;  GA TTA TTGGGA TTGTAG  473  SEPHS1  0.004733333  1.81708125  TATCAATATTCACTTGA  743  SFRP1  0.002733333  2.0177875  Other transferase; Amino acid biosynthesis; Other signalling molecule; Ligand-mediated signalling; Angiogenesis->Frizzled-Related Protein; Wnt signalling pathway->secretedfrizzled-relatedprotein;  CTGTCA TTTGTAA TA TG  1121  0.00148  1.5174625  Molecular function unclassified; mRNA splicing;  TGA TAGTCTGAAA TA TG  507  SFRS3 Similar to KIAA1606  0.003433333  2.10314375  TACATTTTCATATTAGA  1632  SNRPG  0  2.42744375  mRNA splicing factor; mRNA splicing;  1.47771875  3  1  TATAATCTTTATGGCTT  601  SSBP1  0.002793333  ATAAAGTAACTGGTTTG  897  STRAP  0.006193333  1.16710625  TACGTACTGCCTGCCCG  650  TIMM13  0.004486667  1.7855875  Single-stranded DNA-binding protein; DNA replication; DNA repair; DNA replication; Other miscellaneous function protein; Receptor protein serine/threonine kinase signalling pathway; Other miscellaneous function protein; Intracellular protein traffic; Protein targeting; Transport; Hearing;  GGTTTGGCTTAGGCTGG  2384  UQCRH  0.000433333  1.24695  Reductase; Oxidative phosphorylation;  Table 28 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1.  Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al, 2003)  2.  POU5F1 target/cofactor/co-expressed  3.  Embryonic developmental pathway genes  1 4 0  Table 29 Differentially expressed in hESC versus ectoderm-derived libraries only. (See inset legend for table explanations). Meta sequence  Count  Gene Symbol  Mean P-value  Mean In (ratio)  GGAACAAACAGATCGAA  4238  0  3.37100625  ATTTCCTTGAATGTGGC  981  0.00004  2.74305625  ATTAAGAGGGACGGCCG  1671  CD24 Hypothetical gene X69397 ectoOl nt 005058  1.33E-05  2.29159375  GTGACAGAATTGATATC  897  UGP2  0.000546667  2.06641875  1  Gene Ontology Molecular function unclassified; Immunity and defense; Developmental processes; Cell proliferation and differentiation  Nucleotidyltransferase; Other polysaccharide metabolism  GTCTTGAACTGAGAGTC  510  ASNS  0.004686667  1.89019375  Synthetase; Other ligase; Amino acid biosynthesis  CTTAAGATTCAACTGGG  432  0.007033333  1.82459375  Ribonucleoprotein; rRNA metabolism  GCTACTATTAGATCAGG  506  N0P5/N0P58 ecto02 nt 008183'  0.0033  1.8190125  CAGATCTTTGTGAAGAC  1903  UBA52  0.000406667  1.7732625  Molecular function unclassified; Proteolysis  TGGTGTTGAGGAAAGCA  7903  RPS18  0  1.76955625  Ribosomal protein; Protein biosynthesis  GTTTCTATCAATGTGAA  672  LYPLA1  0.002933333  1.68834375  Phospholipase; Lipid metabolism  AA TA TTGAGAAGAAACT  846  EIF3S6  0.00354  1.6439  Translation initiation factor; Protein biosynthesis; Translational regulation; Oncogene  TA TCTGTCTACTTTCTC  1164  SET  0.000733333  1.5966875  Phosphatase inhibitor; DNA replication; DNA replication  CTGCTATACGAGAGAAT  4530  0  1.596225  Ribosomal protein; Protein biosynthesis  CTCCTCACCTGTATTTT  3697  RPL5 ecto03 nt 011109  0  1.584275  GTATCTTCACATCTTGG  876  HSBP1  0.003993333  1.57703125  Transcription cofactor; Chaperone; mRNA transcription regulation; Other metabolism  TACAAGAGGAAGTACTC  3398  RPL6  0.000206667  1.53090625  Ribosomal protein; Protein biosynthesis  GTGTAATAAGACATAAC  2164  0.000193333  1.52698125  Ribonucleoprotein; RNA localization  GAGTAGAGAAAAGAGAC  635  1  0.005886667  1.5026625  GTTTTTGCTTCAGCGGC  1410  1  0.00104  1.4751375  TTCTTGTGGCGCTTCTC  2115  HNRPA2B1 ecto04 nt 008470 ecto05 nt 005403 ecto06 nt 011109'  GCATTTAAATAAAAGAT  2713  TGAATCTGGGTGGGATA AATCCTGTGGAGCATCC  1  0.001633333  1.39864375  0.000666667  1.3701125  804  EEF1B2 ecto07 nt 008470  0.00168  1.34310625  3172  RPL8  0.002046667  1.286325  1  Translation elongation factor; Protein biosynthesis  Other RNA-binding protein;Ribosomal protein; Protein biosynthesis  141  GGCTTTA CCCTTCCCTG  1956  EIF5A  0.002153333  1.11913125  0.00595  -2.46867  Translation initiation factor; Protein biosynthesis Cell adhesion molecule; Calmodulin related protein; Annexin; Cell adhesion-mediated signalling; Cell adhesion; Calcium ion homeostasis;  ATAGGTCAGAAAGTGTA  58  CLSTN1  CGCCGACGATGCCCAGA  16  GGCTGTACCCAAGCTGA  79  AGAGGTGGTGTGCAAAA  10  G1P3 ecto08 nt 004487 ecto09 nt 011362  0.001543  -3.54579  Molecular function unclassified; Immunity and defense;  1  0.003357  -2.58104  1  ACGGAACAATAGGACTC  2  PTGDS  0.0021  -3.02723  0.003686  -5.97111  TGCACTTCAAGAAAATG  7  SPARCL1  TCTCTGATGCTTTGTAT  47  TIMP2  TACATAATTACTAATCA  18  TncRNA  2  2  2  2  2  2  0.000436  -4.83601  Synthase; Isomerase; Fatty acid biosynthesis; Muscle contraction; Extracellular matrix glycoprotein; Other immune and defense; Cell proliferation and differentiation;  0.000179  -2.79686  Metalloprotease inhibitor; Proteolysis;  0.003193  -3.12937  Table 29 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1.  Candidate novel regulators of ectoderm differentiation  2.  Genes down-regulated in hESC  142  Table 30 Genes up-regulated i n h E S C compared to mesoderm-derived libraries. These genes were found to be similarly differentially expressed i n at least one other grouping o f germ-layer derivatives. See legend below for table explanations.  Meta sequence  Count  Gene symbol  Mean Pvalue  Mean Ln (ratio)  GATCCCAACATTGTTGG  1980  ATP5B  625E-06  1.7480375  ACAATGTTGTAGTGTCC  616  CRABPl'  0.000106  2.54705625  Gene Ontology Other ligand-gated ion channel; Anion channel; Hydrogen transporter; ATP synthase; Hydrolase; Purine metabolism; Cation transport; ATP synthesis->Fl beta; Other transfer/carrier protein; Lipid and fatty acid transport; Lipid and fatty acid binding; Vitamin/cofactor transport; Steroid hormone-mediated signalling; Transport; Ectoderm development;  0.001094  2.30685  Cysteine protease; Proteolysis;  0  2.64125  Molecular function unclassified; Biological process unclassified; TGF-beta superfamily member; Biological process unclassified; TGF-beta signalling pathwaytransforming growth factor beta;  CTATATTTTTTAAAATC  587  CTSC  A TTTGTCCCAGCCTGGG  5258  HMGA1  AATAAAACACATTTTAT  607  LEFTB  0.000125  2.51866875  GTCTACTTTAGGTGTGC  1011  MGC8685  1.88E-05  2.93589375  AAATAAAGAATTTAAAG  1380  MGST1  0.00215  2.23279375  Other transferase; Protein modification; Detoxification;  TACAAAACCATTTTTTC  1439  0.00155  1.45008125  Ribonucleoprotein; rRNA metabolism;  TTTTCTTCTTTGGCTTG  458  0.0011  2.2051125  GAAAGAAAGAGAGGAAA  1164  0  3.22756875  CGCACAATCATTGAGTT  569  NCL esc07/ nt 016354 esc05/ nt 022517 esc06/ nt 033903  0.003738  2.1805625  TCTGTACACCTGTCCCC  6678  RPS11  0  2.705575  Ribosomal protein; Protein biosynthesis;  GCTTTTAAGGA TACCGG  6329  RPS20  1.88E-05  1.2977875  Ribosomal protein; Protein biosynthesis;  GATTATTGGGATTGTAG  473  0.00125  2.024225  Other transferase; Amino acid biosynthesis;  TTGCTCACAAAAAAAAA  900  6.25E-06  2.51748125  TGA TAGTCTGAAA TA TG  507  SEPHS1 Similar to HERV-H LTRassociating 3 Similar to KIAA1606  0.0007  2.16529375  2  3  -  TACA TTTTCA TA TTAGA  1632  SNRPG  8.13E-05  1.76109375  mRNA splicing factor; mRNA splicing;  CCGCCTCCGGGAATGAG  1626  SNRPN  0  1.99073125  mRNA splicing factor; mRNA splicing;  TATATATTTGAACTAAT  483  SOX2  0.00085  2.30940625  HMG box transcription factor; Nucleic acid binding; mRNA transcription regulation; Neurogenesis;  1.90310625  Other miscellaneous function protein; Intracellular protein traffic; Protein targeting; Transport;  TACGTACTGCCTGCCCG  650  1  2  TIMM13  0.001269  143  Hearing; TTATAATATAATGTTTT  484  TMEFF1  0.000563  2.33475  CAGTCTAAAATGCTTCA  1763  UCHL1  0.00075  3.011175  Surfactant; Biological process unclassified; Cysteine protease; Proteolysis; Parkinson disease->Ubiquitin C-terminal hydrolase-Ll; Ubiquitin proteasome pathway->26S proteasome;  Table 30 Legend:  Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1.  Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al., 2003)  2.  POU5F1 target/cofactor/coexpressed  3.  Embryonic developmental pathway genes  Table 31 Differentially expressed in hESC versus mesoderm-derived libraries only. See legend for further explanation of the table. Meta sequence  Count  Gene symbol  Mean Pvalue  Mean Ln (ratio)  Gene Ontology  GGAACACACAGCACAGA  844  PS ATI  0.001006  2.2769625  Transaminase; Amino acid biosynthesis;  TTTTGTTAGTGCAAAAA  368  CLDN6  0.001731  2.20416875  Tight junction; Cell structure;  CATCCAAAAATCAACAA  390  FLJ10884  0.002294  2.114  Transcription factor; Nuclease; Nucleoside, nucleotide and nucleic acid metabolism; Other metabolism;  GGGATTTTGTATAACCA  669  0.003369  2.0699875  Molecular function unclassified; Oncogenesis;  GAGAAAACCCGGTACGC  437  PMAIP1 mesoOl/ nt 005612  0.002188  2.06904375  CGTTCCTGCGGACGATC  910  ID1  0.000975  1.9713125  ATGCTCCTGAGTAGAAC  317  0.007894  1.92333125  AAAATATATCTCTGGAC  306  Hs.471439 meso02/ nt 016354  0.009806  1.8768125  CTGCCTTCTTGGGGATT  982  PPP1CC  0.001688  1.8400625  Protein phosphatase; Other select calcium binding proteins; Other signal transduction; D1/D5 dopamine receptor mediated signalling pathway->Protein Phosphatase-1;;  GCTTCCTAAATGGCCCT  329  0.010956  1.82614375  Synthase; Lyase; Amino acid biosynthesis; Cysteine biosynthesis->0-Acetylserine-lyase  AAAGCAATCAACCCTGT  399  CBS meso03/ nt 009714  0.003519  1.805775  CTTTGCACTCTCCTTTG  552  TCEA1  0.007538  1.80185  Basal transcription factor; Nucleic acid binding; mRNA transcription elongation;  GTCAACTGCTTCAGCTT  669  FLJ12666  0.003619  1.72971875  Molecular function unclassified; Biological process unclassified;  1  3  2  3  3  Other transcription factor; mRNA transcription regulation; Angiogenesis; TGF-beta signalling pathway->Transforming growth factor beta;  144  TTGGCATTGTCCCCTTT  480  CATCTAAACTGCTGGGC  1336  CAATGCTGCCAGCATTG  meso04/ nt 004487  0.004913  1.615025  0.000944  1.3997  1413  WBSCR1 meso05/ nt_010755  0.004094  1.27195625  TGCCTGCACCAGGAGAC  169  CST3  4  0.005887  -1.52227  GAGTGGGGGCTTCACTC  21  DPP7  4  0.003667  -2.64715  Serine protease; Proteolysis;  CTGACCTGTGTTTCCTC  128  HLA-B  0.001333  -2.2564  Major histocompatibility complex antigen; MHCI-mediated immunity;  TAGGTTGTCTAAAAATA  2824  TPT1  6.67E-06  -1.34433  Non-motor microtubule binding protein; Immunity and defense;  3  3  4  4  Translation initiation factor; Protein biosynthesis;  Cysteine protease inhibitor; Proteolysis;  Table 31 Legend:  1.  Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al., 2003)  2.  Embryonic developmental pathway genes  3.  Candidate novel regulators of mesoderm differentiation  4.  Genes down-regulated in hESC  Table 32 Genes up-regulated in hESC compared to endoderm-derived libraries. These genes were found to be similarly differentially expressed in at least one other grouping of germ-layer derivatives. See legend for table explanations.  Meta sequence  Count  Gene symbol  GATCCCAACATTGTTGG  1980  ATP5B  ATAGACATAAAATTGGT  1158  C1QBP  CTGTGACACAGCTTGCC  1219  CCT2  ACAATGTTGTAGTGTCC  616  CRABP1  CTATATTTTTTAAAATC  587  CTSC  TCTGCAAAGGAGAAGTC  1081  HMGB2  GGTTGAAAAAAAAAAAA  564  Hs.446545  1  1  2  Mean Pvalue  Mean ln(ratio)  0.00095  1.807075  Gene Ontology Other ligand-gated ion channel; Anion channel; Hydrogen transporter; ATP synthase; Hydrolase; Purine metabolism; Cation transport; ATP synthesis->Fl beta;  0.0038  1.580358  Complement component; Antibacterial response protein; Complement-mediated immunity  0.00005  2.21255  0.001775  2.171125  Chaperonin; Protein folding Other transfer/carrier protein; Lipid and fatty acid transport; Lipid and fatty acid binding; Vitamin/cofactor transport; Steroid hormone-mediated signalling; Transport; Ectoderm development  0.005233  1.655708  0.000133  2.441175  0.002033  2.309683  Cysteine protease; Proteolysis HMG box transcription factor; Chromatin/chromatin-binding protein; Chromatin packaging and remodeling  145  Other miscellaneous function protein; Cell proliferation and differentiation  ACCATTGGATTCATCCT  1015  IFITM1  4.17E-05  2.357142  GTCTACTTTAGGTGTGC  1011  MGC8685  333E-05  2.8341  TACAAAACCATTTTTTC  1439  NCL  0.001192  1.832258  Ribonucleoprotein; rRNA metabolism  GAATCGGTTATACTCGG  1535  0.000992  1.61035  Oxidoreductase; Electron transport  TATTTTAAATGCCACCT  437  0.010342  1.726975  CGAACAAAAGACTTCGG  967  0.004608  2.152867  TTTTCTTCTTTGGCTTG  458  0.008492  2.034658  GAAAGAAAGAGAGGAAA  1164  NDUFS5 esc03/ nt 007741 esc04/ nt 011875 esc07/ nt 016354 esc05/ nt 022517  8.33E-06  2.970467  AAAAGAAACTTGTGCTT  2314  PABPC1  0.003317  1.834633  AA TA CTTTTGTA  871  PAI-RBP1  0.001383  1.927425  Nucleic acid binding; Pre-mRNA processing Other RNA-binding protein; Interacts with chromatin-remodeling factor C H D 3 ; Biological process unclassified  9202  RPL37  8.33E-06  1.687508  Ribosomal protein; Protein biosynthesis  TCTGTACACCTGTCCCC  6678  RPS11  0  2.886167  Ribosomal protein; Protein biosynthesis  GCTTTTAAGGATACCGG  6329  RPS20  0  2.0331  Ribosomal protein; Protein biosynthesis  8849  RPS8  0.000367  1.31705  TATCAATATTCACTTGA  743  SFRP1  0.0005  2.410908  Ribosomal protein; Protein biosynthesis Other signalling molecule; Ligand-mediated signalling; Angiogenesis->Frizzled-Related Protein; Wnt signalling pathway->secreted frizzled-related protein;  CTGTCA  TA TG  1121  0.004692  1.452883  Molecular function unclassified; mRNA splicing  TA TG  507  SFRS3 Similar to KIAA1606  0.005508  2.1248  Molecular function unclassified; Biological process unclassified  1632  SNRPG  0  2.0292  mRNA splicing factor; mRNA splicing  CCGCCTCCGGGAATGAG  1626  SNRPN  0  2.469467  mRNA splicing factor; mRNA splicing  TATATATTTGAACTAAT  483  SOX2  2  0.005608  2.029417  H M G box transcription factor; Nucleic acid binding; mRNA transcription regulation; Neurogenesis  TATAATCTTTATGGCTT  601  SSBP1  0.006525  1.746742  Single-stranded DNA-binding protein; DNA replication; DNA repair; DNA replication Other miscellaneous function protein; Receptor protein serine/threonine kinase signalling pathway  CAA TAAA  TTGCT  TGTTCTGGTT  TAA TAAAGGTGTTTA  TTTGTAA  TGA TAGTCTGAAA  TT  TACA TTTTCA TA TTAGA  3  1  ATAAAGTAACTGGTTTG  897  STRAP  0.001392  1.642317  TTATAATATAATGTTTT  484  TMEFF1  0.005608  2.120967  Surfactant; Biological process unclassified  CAGTCTAAAATGCTTCA  1763  UCHL1  0  3.316525  Cysteine protease; Proteolysis  GGTTTGGCTTAGGCTGG  2384  UQCRH  0.000417  1.142608  Reductase; Oxidative phosphorylation  146  Table 32 Legend: Italicized tags 3 or more genomic hits TGATAGTCTGAAATATG 512 genomic hits 1.  Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et al., 2004; Sato et al., 2003)  2.  POU5F1 target/cofactor/coexpressed  3.  Embryonic developmental pathway genes  Table 33 Differentially expressed in hESC versus endoderm-derived libraries only. Legend is provided below for table explanations.  Meta sequence  Count  Gene symbol  Mean Pvalue  Mean In(ratio)  Gene Ontology  AAAATAAAGAGCCATAG  963  APEX1  0.00825  1.638917  Transcription cofactor; Exodeoxyribonuclease; Endodeoxyribonuclease; Lyase; DNA repair;  GCAAAACCAGCTGGTGG  891  CCT8  0.006408  1.631908  GGCTGCCCTGGGCAGCC  860  DPYSL3  0.000592  2.24465  Chaperonin; Protein folding; Other hydrolase; Nucleoside, nucleotide and nucleic acid metabolism; Axon guidance mediated by semaphorins->CRMP 3-associated molecule; Axon guidance mediated by semaphorins->Collapsin response mediator protein;  TCCTCAAGATAAAGTCT  1170  ERH  0.000192  1.716958  Other transcription factor; mRNA transcription regulation;  TTGTAAACTTAAGTGGC  976  FKBP3  0.000583  1.7755  Other isomerase; Protein folding; T-cell mediated immunity; Other neuronal activity;  ATGGCAAGGGACAAAGC  1025  FLJ35696  0.005483  1.260883  CTTTATGTGATAGTATT  609  HDAC2  0.008425  1.652225  GAAA TTTAAAGCAGGTT  2057  HMGB1  0.000108  2.405842  CCAGGAGGAATGCCTGG  2766  0.000475  1.401025  TCAACTTCTGGCTCCTC  871  3  0.0002  2.193617  GATTTCCTTGAAGCAGG  730  3  0.000775  2.290733  GGCTGATTTTTATTACC  971  HSPA8 endoOl/ nt 009714 endo02/ nt 022853 endo03/ nt 023133  Molecular function unclassified; Biological process unclassified; Transcription factor; Nucleic acid binding; Deacetylase; mRNA transcription regulation; Chromatin packaging and remodeling; Protein modification; Cell cycle control; Wnt signalling pathway->Histone deacetylase; p53 pathway->Histone deacetylase 1; HMG box transcription factor; Chromatin/chromatin-binding protein; Chromatin packaging and remodeling; p53 pathway->High mobility group protein 1; Hsp 70 family chaperone; Protein folding; Protein complex assembly; Stress response; Apoptosis signalling pathway->Heat shock protein 70;;Parkinson disease->Heat shock protein 70;  3  0.000925  2.011842  1  1  2  Synthase; Ligase; Purine metabolism;  TGTACTACTTAAGTTTA  685  PAICS  0.0023  1.960483  AATTTTATTTCTGTTTG  844  PCBP1  0.00435  1.545675  Ribonucleoprotein; mRNA polyadenylation; mRNA end-processing and stability;  CTTATTTGTTTTAAAAC  575  PLS3  0.006108  1.794983  Non-motor actin binding protein; Other developmental process; Other oncogenesis; Cell structure;  147  Other transcription factor; Nucleic acid binding; DNA repair; mRNA transcription regulation; Cell cycle control; Oncogene;  0.003167  1.986775  RBPMS  0.0145  1.525408  2254  YWHAE  0.001842  1.779142  TTGAAGCTTTAAGAACT  1  CXCL2  0.014108  -4.34608  Nuclease; Biological process unclassified; Other miscellaneous function protein; Non-vertebrate process; EGF receptor signalling pathway->14-3-3; FGF signalling pathway->14-3-3; PI3 kinase pathway->14-3-3; Parkinson disease->14-3-3; p53 pathway> 14-3-3 sigma; p53 pathway-> 14-3-3; Chemokine; Cytokine and chemokine mediated signalling pathway; Calcium mediated signalling; NFkappaB cascade; Ligand-mediated signalling; T-cell mediated immunity; Macrophage-mediated immunity; Granulocyte-mediated immunity; Cell proliferation and differentiation; Cell motility;  AGCAGTGACGGATAGTT  1  EVA1  0.0153  -4.11228  Cell adhesion molecule; Cell adhesion;  GATGAATCCGGGGTATG  1  LOC120224  0.01845  -4.31724  TGTGGGAAATCCTGCGT  1  SLPI  0.014917  -4.54698  GGAATCCAATCTGTTGC  665  PTTG1  TTGAATTTGTTTGTTAG  434  GAATTAACATTAAACTT  4  1  4  4  4  Serine protease inhibitor; Biological process unclassified;  Table 33 Legend Italicized tags 3 or more genomic hits 1.  Downregulated upon differentiation (Brandenberger et al., 2004b; Richards et a l , 2004; Sato et al., 2003)  2.  POU5F1 target/cofactor/coexpressed  3.  Candidate novel regulators of endoderm differentiation  4.  Genes down-regulated in hESC  148  repress genes directing ectoderm differentiation. Conversely ecto08 and ecto09 were down-regulated in embryonic stem cells compared to ectoderm libraries; they might encode determinants of an ectodermal fate in the early embryo. Other hypothetical uncharacterized genes shown to be differentially expressed in hESC versus mesoderm or endoderm libraries were listed in Tables 31 and 33 respectively and are termed mesoOT05 and endo01-03 (up-regulated in hESC). A number of genes have been shown by various groups to be both highly expressed in ES and additionally rapidly down-regulated upon differentiation. These include POU5F1, DNMT3(3, DPPA4, HMGA1, and TDGF1 (Richards et al., 2004; Sato et al., 2003; Sperger et al., 2003), all of which were over-expressed in the ES SAGE libraries versus all normal adult and fetal libraries available from CGAP (72 libraries) (Appendix 3e) in an additional experiment (Table 27). Many of the genes found to be most significantly over-expressed in ES are in some way related to or regulated by POU5F1, a well established regulator of self-renewal in both human and mouse ES. HMGA1 is a downstream target of POU5F1 that has been implicated in cancer metastasis (Chuma et al., 2004) and changes in chromatin structure linked to the regulation of gene expression (Harrer et al., 2004). HMGA1 may serve to regulate the transcription of the molecular machinery involved in ES cell proliferation. POU5F1 spatial and temporal expression is thought to be regulated by its methylation status (Fuhrmann et al 1999). Given the co-expression of DNMT3p in ES, this particular methyl transferase is the likely candidate in control of epigenetic regulation of POU5F1 and possibly many other genes involved in ES self-renewal. DNMT3p is important for embryonic development as well as purported to have a role in cancer cell survival and gene silencing (Beaulieu et al.,  149  2002; Rhee et al., 2002). Methyltransferases work together with histone deacetylases to suppress gene expression by epigenetically modifying chromatin according to the histone code hypothesis (Cheung et al., 2000). The histone code hypothesis proposes that histone proteins and their modifications can lead to inherited differences in transcriptional activity and silencing (Jenuwein and Allis, 2001). The observation of significantly higher expression of DNMT3B and the histone deacetylase HDAC2 (Table 33) in hESC suggests there may be coordination between the actions of these genes. In addition to POU5F1, other early developmental signals provided by the TGFp 1-Nodal pathway for example, may be critical for ES maintenance. In particular, TDGF1 over-expression may be necessary for ES cell proliferation. This analysis uncovered two tags, neither mapped to a known transcript (CGAP Best Tag Mapping) but did map to a single genomic location (referred to as escOl and esc02). Further investigation of the tags using Ensembl Multi BlastView (http://www.ensembl.org) yielded information about the regions surrounding the tag sequence. EscOl mapped to chromosome 4q25 and was antisense to predicted transcripts bearing similarity to a gene of unknown function, HDCMA18P (Figure 23). The zebrafish homologue of HDCMA18P functions in mRNA nuclear splicing (annotated by GeneOntology terms). Upstream of escOl and within 500bp are several microRNA sequences. Many miRNA genes are found in clusters in the genome and presumed to be expressed from the same primary transcript (Wienholds et al., 2005). Based on the neighbourhood of expressed genes surrounding escOl one might predict that the tag sequence could be derived from a pri-microRNA sequence that could regulate  Figure 23 Ensembl BlastView of sequence escOl. Chr. 4 Length CONfls  H 113,924,500 Forward stran113,925,000 d  113,925,500  113,926,000 4.02  Kb113,926,500  113,927,000  113,927,500  113,928,000  Human cDNAs EMBL mRNAs Unigene Human RefSeqs Genscan  EST t r a n s .  SENSiCnHiN 00 12 Abti0 o000G cn sc9a9n9 t-r> ans ENSESTT00000094252 ENSEETT000000942S1 ENSESTT00000094250 N 57b7l32k1.nou-> EnPs_e0m n trans Ensembl hnoun irons Enservbl knoun trans Ensefbl knoun trans | NPJJ57732.1  Ensembl t r a n s .  Q96IZ3_HUHAN  NPJ>57732.1  DNACcontigs) Blast hits  ->  1  PI—M  k—-TI  i  r^tZr^i  r"""i  ->  ->  i  «C1068645 .1 .8 . 9690 > • escOl <- ENET00000346969 miRNn • < ro R i-N• AENST00000362299 <a -RNE ST00000362275 m HN • <- ENST00000362228 roiRNA EN • mR iN nST00000362232 • <R -iNE m nNST0 0000362188  ncRNA  e! ncRNA  Unigene Length 113,924,500  113,925,000  j  1  113,925,500  113,926,000  113,926,500  113,927,000  - Reu«rsestrand H 9280 ,00 75 , 00 113  HDCMA18P gene expression in cis or various other molecular targets in trans. MicroRNAs, which may primarily function in gene repression, are indicated to play a role in certain developmental processes, such as brain morphogenesis in zebrafish embryos (Giraldez et al., 2005) and may be enriched in embryonic stem cells. To further strengthen this hypothesis, the detection of Dicer, a double-stranded catalytic R N A essential for the generation of miRNA (Bartel, 2004), was observed across all of the ESC S A G E libraries. Additionally, Dicer expression was nearly totally absent in differentiated tissues. One might additionally hypothesize that HDCMA18P, which was expressed mainly in hESCs  151  and at low levels (at approximately fewer than 10 tags per 200,000) may be tightly regulated by the hypothetical m i c r o R N A from which escOl was derived. If H D C M A 1 8 P preferentially processes a suite o f genes, e.g., genes associated with a differentiated phenotype, the m i R N A gene potentially detected by escOl could be necessary for pluripotency. Figure 24 depicted the Ensembl BlastView result for the alignment o f tag sequence esc02 to chromosome 8q24.3. The surrounding sequence revealed that the long S A G E tag resided in the 3' end o f a human c D N A sequence similar to forkhead box H I ( F O X H 1 ) . The F O X H 1 gene is homologous to the 5' region o f the c D N A sequence suggesting that it and the transcript identified by esc02 may be an alternative transcript o f the F O X H 1 gene differing at the 3' most end. F O X H 1 is a transcription factor and downstream target o f the T G F p - N o d a l signalling pathway. Together with other components o f the Nodal pathway, F O X H 1 is thought to co-regulate embryonic mesodermal morphogenesis (Hart et al., 2005). Recall that T D G F 1 , a pathway agonist, was also up-regulated in the h E S C metalibrary in comparison to 72 n - C G A P libraries (Table 26), while Lefty 1, a T G F p antagonist, was up-regulated in E S compared to ectoderm and mesoderm libraries (Tables 28 and 30). Lastly, a transcriptional target o f Nodal signalling (ID1) was also up-regulated in h E S C versus all mesoderm libraries (Table 31). Several agonists and antagonists o f the TGFp/nodal pathway are overexpressed i n human E S versus differentiated libraries (see Figure 10 and Tables 12-14, Chapter 2.3.4). This result was not inconsistent with the idea that E S prescribe to a "Just in case" philosophy (Golan-Mashiach et al., 2005), expressing a suite o f tightly regulated genes involved in directing differentiation and maintaining the desired phenotype; thus  152  undifferentiated E S are primed to exhibit developmental plasticity when supplied with the appropriate developmental cue. The role o f TGF|3-Nodal signalling i n h E S C maintenance remains unclear, but over-expression o f the pathway constituents suggests that it is o f import to E S biology. Figure 24 Ensembl BlastView o f sequence esc02. Esc02 localizes into the 3' most end o f predicted transcript similar to F O X H 1 . Note that the transcript is a probable alternate isoform o f the F O X H 1 gene. Chr. 8 Length  ,rw 68 ,rd10stra1n4d56 , 685 , 00 1456 , 690 , 00 1456 , 695 , 00 4.104256 , 7K0b0 , 00 1456 , 705 , 00 1456 , 710 , 00 1456 , 715 , 00 p*h1+F5o6 a0  CDNAs  Human cDNAs EMBL tnRNAs Unigene Human RefSeqs Genscan  EST t r a n s .  Ensembl t r a n s .  G ENSCN if008900475G 46ensc-> fib-initio an trans ENSEET.T00000092233 ENEESTT00000092232 -> ENSESTT00000092231 ENSESTT00000092230 ENSESTT00000092229 ENEESTT0 0 0 00092228 ENSESTT00000092227 KF IC2 -> K ICs2 U EFn e_m tvH c1 .bM lA kN nou wr• t trans  DWXcontigs)  ESC02 I  Blast h i t s Ensembl t r a n s .  <- F0KH1 Enseal know trans <- GEN£CN if000000 5n trans flb-initio G4e7nSs4 ca  Genscan Unigene EMBL mRNAs Human cDNAs  BC051376.1 '  X  BC051376 BC051376.1 Homo sapiens mRNA similar to forkheacl box H1 (cDNA clone IMAGE: 6650515).  CDNAs  EMBL: BC051376.1  Length  1456 , 680 , 00 145,6!  View all hits  • Reverse strand H 6 , 700 , 00 1456 , 705 , 00 1456 , 710 , 00 1456 , 715 , 00  153  3.4 Conclusions A caveat of this exercise to provide a meaningful discussion of the significance of the sets of genes differentially expressed in hESC versus adult normal tissues was the inability to relate the changes in expression pattern to early embryonic differentiation. Better comparisons have been reported by various other groups which have compared various ES lines to their immediately differentiated progeny (e.g., embryoid bodies) or specific embryonic lineages derived from undifferentiated ES (Brandenberger et al., 2004a; Brandenberger et al., 2004b). This analysis did identify a set of 22 genes that were significantly more abundant in multiple u-hESC lines compared to 72 adult and fetal differentiated samples. Known genes such as POU5F1 and DNMT3B were expressed at significantly higher levels in u-hESCs compared to all n-CGAP samples. Among the list of "up-regulated" genes were characterized genes that have not been previously implicated in stem cell maintenance; these genes will be of interest as new candidate markers of the undifferentiated and pluripotent state. The candidate genes, escOl and esc02, identified were more than 3-fold higher in u-hESCs than any n-CGAP library. Future analysis of these candidates such as GLGI and functional studies will be necessary to determine if these tags correspond to uncharacterized markers of human pluripotent cells. The work presented in this Chapter may define a core set of ES maintenance genes both previously defined and novel to what is currently known in human ES biology.  154  4. Computational approach for the identification of candidate novel genes in undifferentiated h E S C S A G E libraries  Contributions  The development of an approach for the identification of novel genes and computational analyses were completed by Angelique Schnerch (BCCA GSC and the Department of Medical Genetics, UBC). Portions of this Chapter were used for a submitted publication (Hirst et al., submitted).  4.1 Introduction Embryonic stem cell genes are largely underrepresented among EST and cDNA databases although a handful of gene expression studies of various hESC lines have been performed (Brandenberger et al., 2004a; Brandenberger et al., 2004b; Carpenter et al., 2004; Richards et al., 2004; Sato et al., 2003; Sperger et al., 2003). I hypothesize that novel genes remain to be discovered in human embryonic stem cell lines. These genes may function in maintaining the hESC undifferentiated state. Multiple global gene expression profiles generated from human embryonic stem cell lines will provide the necessary resources to identify novel genes in a comprehensive fashion. The use of several hESC lines from different providers and deep sampling of each transcriptome using serial analysis of gene expression represent a comprehensive approach. The SAGE Library Production Group (BCCA GSC; http://www.bcgsc.ca/) constructed 11 long SAGE libraries from the RNA of 8 NIH approved cell lines (WA01, WA07, WA09, WA13, WA14, ES03, ES04, BG01) providing a resource for novel gene discovery (see Table 3, Chapter 2.2.3 for a listing of the hESC data generated at the BCCA GSC). The first aim of this analysis was to identify tags enriched in hESC long SAGE libraries (hESC metalibrary). Identification of enriched tags was accomplished by pooling the data to construct a metalibrary and then comparing the metalibrary to publicly available human SAGE libraries constructed from various tissue and cell types. Tags found exclusively in the hESC metalibrary may lead to the discovery of candidate novel genes or alternatively spliced transcripts. Alternatively, tags exclusive to hESC may be  156  derived from known transcripts unrepresented by the pool of tissues and cell types selected for comparison to the hESC lines. The second aim of this analysis is to generate a computational method for the selection of candidate tags for novel gene discovery (Figure 25). I hypothesized that tags that map to a human genomic sequence outside the vicinity of a known transcript may provide likely candidates for novel genes. To select such tags, I generated two in silico tag-to-sequence mapping databases. The first database of in silico tags was generated from human genomic sequences suspected of being orthologous to mouse and/or rat genomic sequences. A common approach to novel gene prediction is to determine sequence conservation across multiple species; such conservation is thought to correlate more strongly with possible functionality of an uncharacterized sequence of interest (Kellis et al., 2004). The second database contains tags derived in silico from sequences that lie 2 kb outside of the 3' untranslated region (UTR) of known transcripts. Approximately 30% of Ensembl genes do not have an annotated 3' UTR. CMOST mappings to Ensembl transcription units attempt to account for the 3' UTRs of all Ensembl genes. The addition of 1000 bp to each sequence approximates the UTR based on an estimated average 3' UTR length of 600 bp (Zhang, 1998). The artificial UTR is added regardless of whether a tag has a previously annotated UTR, which can overestimate the UTR region for several genes. CMOST may also underestimate the UTR length, as in the case of the Fukutinrelated protein with a 3' UTR length of over 1500 bp (Brockington et al., 2001) necessitating mapping the mapping of tags to longer regions 3' of the polyadenylation site of genes. The purpose of mapping tags to both CMOST Ensembl transcription units and  157  Figure 25 The computational approach for the selection of candidate long S A G E tags to detect novel transcripts in hESC.  STEP 2:  STEP 1:  K n o w n transcript 3,365  S T E P 2A: C M O S T B E S T 20,047 tag types  MAPPING M o u s e mapping  S T E P 2B: E m b r y o n i c E S T mappings  1,443  S T E P 2C: M o u s e C M O S T / C G A P mappings  Unmapped  STEP 3 :  G e n o m i c 1,024  Embryonic E S T s  (961 unambiguous)  62  M a p p i n g to Human-mouse-rat o r t h o l o g o u s sequences  14,153  ^  U n m a p p e d 524  1 M a p p e d 499 *  STEP 4:  < 2 k b downstream  M a p p i n g to "enhanced" 3 ' U T R (2 kb)  •  from  an E n s e m b l  transcript 38  > 2 k b downstream  from  an E n s e m b l transcript 461  i STEP 5:  T a g quality filtering (P-value<0.05) ( S i d d i q u i et al., submitted)  h E S C candidate tags (p-value 0.05) 301  158  estimated 3' UTR sequences was to capture hESC tags that mapped in close proximity to a known gene. I made the assumption that a tag in close proximity to the 3' end of a gene is most likely derived from the 3' UTR. A potential concern regarding this assumption was that tags mapping to overestimated 3' UTRs could belong to the 5' region of a downstream gene or to an intergenic genomic sequence. I also assumed that tags mapping further away than 2 kb of a known gene fell into an intergenic genomic sequence and may originate from a novel gene ®. Both databases provide resources for the selection of candidate hESC long SAGE tags that may correspond to a novel gene based on sequence conservation across multiple species and its location in a genomic region lacking previous expression data.  4.2 Methods  4.2.1 SAGE library acquisition and tag processing CGAP 21mer and 14mer human SAGE libraries were downloaded from the SAGE Genie website (April 13,2005; http://www.cgap.nci.nih.gov/SAGE) (Boon et al., 2002) (Appendix 4a).  4.2.2 SAGE tag processing Long SAGE hESC tags were processed using the protocol described below:  8  Exceptional cases with UTR lengths outside of 2 kb exist (e.g., the human tissue inhibitor of  metalloproteinase-3 (gb|U14394) has a 3,663 bp 3' UTR length) Makalowski, W., Zhang, J., and Boguski, M. S. (1996). Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res 6, 846-57.  159  1. I truncated seven base pairs from the 3' end of a long SAGE tag sequence (21mer) using "cut -c 1-10" (UNIX command line) resulting in an in silico generated 14mer library. 2. I produced 1 file containing all tag types found in a pooling of 11 hESC long SAGE libraries (hESC metalibrary) and a 2  nd  file containing all tag types found in  a pooled sample of 247 CGAP libraries (CGAP metalibrary). For each file respectively tag types were labeled "HESC" or "CGAP" (tab delimited) (command: <filename> | awk '{print $l"\t""HESC"}')3. Both the hESC metalibrary and CGAP metalibrary files were concatenated. All tags were enumerated from a sorted list of hESC and CGAP tags to determine if they occur in one or both meta-libraries using the UNIX command "cat hESC metalibrary CGAP metalibrary | sort | uniq -w 1-10 -c". The command "cat" concatenates two or morefiles;"uniq -w 1-10" compares only thefirstten characters for each line of a file, lastly "-c" provides a count (tab delimited) for the number of times a unique tag type occurs in the concatenated file. 4. Tags that occurred exclusively in the hESC metalibrary were isolated using "grep -w 1" (isolated tags that occurred once in the concatenated metalibrary) succeeded by "grep -w HESC" (isolated tags that occurred only in the hESC metalibrary).  4.2.3 Comprehensive mapping of SAGE tags (CMOST) Tags enriched in the hESC metalibrary were mapped using the DiscoverySpace (version 3.2.4) CMOST plug-in (Comprehensive Mapping of SAGE Tags Best Mapping Algorithm) described in Chapter 2.2.4.1. The CMOST best mapping strategy first  160  identifies all of the tags that are derived from a previously characterized transcript before tags are mapped to the genome. This allowed me to parse the novel tags that mapped solely to uncharacterized regions of the human genome. The following conditions were imposed to parse tags that mapped to genes/genomic sequences from the C M O S T mapping result: all single base-pair tag modifications were ignored, tag sequences mapping to a known gene were excluded, and tag sequences mapping to multiple genomic locations were excluded.  4.2.4 Tag mapping database construction I extracted in silico 21 bp tags from three data sources: 1. A set of 148,453 EST sequences generated from undifferentiated hESC lines and its differentiated derivatives described below (Brandenberger et al., 2004b). a. GR_ES_: ESTs derived from a pooled sample of 3 undifferentiated hESC lines )uhESCs) grown under feeder-free conditions (lines WA01, WA07, and WA09) (37,081 sequences) b. GR_EB_: ESTs derived from u-hESCs differentiated to embryoid bodies (37,555 sequences) c. GR_preNEU_: ESTs derived from u-hESCs differentiated neuroectoderm-like cells (38,206 sequences) d. GR_preHEP_: ESTs derived from u-hESCs differentiated hepatocyte-like cells (35,611 sequences)  161  J  Sequences were downloaded using Entrez nucleotide (http://www.ncbi.nlm.nih.gov/; accession numbers CF227093-CF227275, CN255152-CN315425, CN331906CN373615, CN385955-CN394390, CN394392-CN432241). 2. The UCSC multiple alignment format (MAF) version 1 comprised of multiple alignments of the Human July 2003 (hgl6) genome, the Mouse February 2003 (mm3) assembly and the Rat June 2003 (rn3) assembly (http://hgdownload.cse.ucsc.edu/downloads.html#human). Multi-species homologous genomic regions were determined based on BLASTZ scores (http://www.ncbi.nil.gov/BLAST/) and stored infilesaccording to human chromosomal location. Human and mouse alignments were formatted for SAGE tag extraction using ad hoc UNIX commands. 3. Intergenic regions of Ensembl transcripts. In this analysis, intergenic was defined as a region of the genome not occupied by a known gene. Specifically, the intergenic region was at least 2kb beyond the 3' end of each Ensembl transcript. Mappings were generated for sequence 2 kb in length and adjacent to the 3' end of a transcript. A hESC SAGE tag was defined as intergenic if it did not map to this data-source. Sequences were obtained using Ensembl EnsMart (version 19.34a. 1, NCBI genome assembly 34) (http://www.ensembl.org/). For each sequence in a data source an Nlalll site was used to demarcate the start of a 21 bp or 14bp SAGE tag. Tags were extracted from all possible sites in a sequence. 9  SAGE tags were extracted using the Perl script SAGE_tag_positions.pl (written by E.  9  14 bp tags were only extracted from mouse orthologous genomic regions (UCSC Multiple Alignment  Format).  162  Pleasance) (Pleasance et al., 2003). Tag-to-gene mapping were completed using the UNIX command "join".  4.2.5 BLAST analysis  To provide in  functional annotation (theoretical) of the candidate novel  silico  genes I utilized the program blastall to use BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1990) on selected human and mouse genomic sequences (query) against a database of 26,205 mouse RefSeq genes (Release 5, May 2004). The parameters I designated for the blastall program are as follows: "blastall -p blastn or tblastx -d mouse.rna.fna -i <query file> -e 0.01". 10  11  Where "-p" is the program name, "-d" is the formatted database to 12  BLAST against, "-i" is the set of FASTA formatted query sequences, and "-e" is the expectation value. BLAST results were parsed using parse_blast.pl (written by Erin Pleasance; BCCA GSC).  4.2.6 Mouse tag to gene mapping 14mer mESC tags were mapped to CGAP best tag (http://cgap.nci.nih.gov/SAGE) (Boon et al., 2002) using the UNIX command "join". Mouse CGAP mappings  1U  The BLASTN program compares a nucleotide sequence against a nucleotide sequence database.  11  The TBLASTX program compares the six-frame conceptual translation products of a nucleotide query  sequence (both strands) against a protein sequence database. 12  Format.db should be used to format the FASTA databases. This must be done before blastall can be run  locally.  163  (Mm_short.best_tag.gz ) were downloaded from ftp://ftp 1 .nci.nih.gov/pub/SAGE/MOUSE/.  4.3 Results and Discussion  4.3.1 Selection of tags for the isolation of candidate novel genes  The hESC metalibrary was constructed by computationally pooling 11 long SAGE libraries made from 8 hESC lines. The metalibrary consisted of 2,613,475 total tags corresponding to 379,465 unique tag types (transcripts). To identify tags enriched across the hESC lines I compared the metalibrary to 247 publicly available human SAGE libraries downloaded from the CGAP SAGE Genie (April 13, 2003; http://www.cgap.nci.nih.gov/SAGE) (Boon et al, 2002). The SAGE Genie website provides a resource for the retrieval and analysis of human and mouse gene expression data (Lai et al., 1999). The human CGAP libraries used in this analysis contained 15,683,409 tags representing 654,491 transcripts. Currently, the majority of publicly available SAGE libraries are short SAGE. Consequently, all long SAGE tags were truncated to short tags (extracted short SAGE) for direct comparison to the CGAP short SAGE libraries. Depicted in the Venn diagram (Figure 26) were tag sequences common to the hESC and CGAP metalibraries. These sequences were excluded to isolate tags found solely in hESCs (Step 1 of Figure 25).  CGAP best tag mappings searches for the best tag for a gene.  164  Figure 26 Venn diagram listing the tag types for the hESC, cancer and normal SAGE library comparisons. We pooled 11 hESC libraries (green) for a total of 222,337 unique tag types (translates to 379,465 long tag types). The normal metalibrary (orange) is a pooling of 77 CGAP libraries totaling 410,583 tag types. The cancer metalibrary (blue) consisted of 170 CGAP libraries, yielding 578,798 tag types.  165  Figure 27 Novel gene discovery candidate tags distribution of absolute gene expression levels, The x-axis contains expression level bins (bin size = 1) and the y-axis is the frequency of tag types.  l.OE+05 T  -18,907 tags 1.0E+04  1.0E+03  1.0E+02  1.0E+01 4  1.0E+00  1.0E-01 0  *  *  N  f e  1?  ^  &  &  #  &  &  $>  4>  <$> &  &  &  A  f o  <£> <b  N  %  %  &  <*> c ? ^ v  N  c ? N> N V  N  f e  ^>  0?  &  Expression level  166  There were 18,700 extracted short SAGE tags enriched in hESCs. These tags were associated back to their appropriate long SAGE tag sequence and resulted in 20,047 sequences for further analysis. Enriched tags were defined as those found only in the hESC data. Expression levels of the enriched tags were not normally distributed; I predominately observed low abundance tags (shown in Figure 27). Reports have previously defined the lower limit of gene expression reliably detectable by SAGE at 5 tags (Evans et al., 2002). As most highly abundant transcripts have previously been characterized I expected that candidates for novel transcripts would be expressed at low levels to have escaped discovery. A recent improvement in SAGE data processing implemented by Gene Expression Informatics at the BCCA GSC has included tag quality measures based on SAGE library construction/sequencing errors (Siddiqui et al., submitted). High quality singleton tags (e.g., defined by P-value < 0.05) have been used to resolve novel genes in the Mouse Atlas of Gene Expression data (http://www.mouseatlas.org) and our own hESC data that led to the successful generation of longer cDNA sequences from SAGE tags (Siddiqui et al., submitted; Hirst et al., submitted). Thus, SAGE can detect gene expression at the singleton or doubleton level and can provide evidence of bona  fide  gene expression where previous transcript  information does not exist. Enriched tags were mapped to identify those tags that were generated from known transcripts and to select for those tags that mapped only to the genome or the set of embryonic EST sequences (Brandenberger et al., 2004b) (Appendix 4b) (Steps 2A and 2B of Figure 25). A proportion of tags enriched in hESCs were mapped to previously  characterized transcripts because the CGAP SAGE libraries were not representative of all of the tissues/cell lines used to construct cDNA and EST sequencing projects. Table 34 summarizes the CMOST Best Mapping results and mappings to embryoderived ESTs. Briefly, 3,365 tag types mapped to a known transcript/EST and 1,024 tag types mapped to a human genomic sequence of which 961 tags mapped to a single genomic location and were used for further analysis. From this analysis I found that the majority of tags were initially unmapped (15,615 tag types). The inability to map these tags may be attributed to various reasons such as sequence polymorphisms, tags spanning novel splice junctions, contaminating sequences (e.g., derived from MEFs), and experimental artifacts of the SAGE protocol and sequencing. It is also possible that the mapping resources are incomplete. Recently, an EST sequencing project was completed for undifferentiated hESC and its differentiated derivatives (embryoid bodies (EB), 'neural ectoderm-like cells (preNEU), and hepatocyte-like cells (preHEP) (Brandenberger et al, 2004b). These sequences are not currently available for tag mapping using CMOST. Thus, in addition to CMOST I mapped enriched hESC tags to a tag-to-gene mapping database created from these embryonic EST sequences (Step 2B of Figure 25). A small collection of unmapped tags (19 tag types) were resolved by mapping to the above EST data set. In the metalibrary I found that 275 tags mapped to a mouse transcript (CGAP Best Tag; http://cgap.nih.nlm/SAGE) and 1,168 tags mapped to the mouse genome (Step 2C, Figure 25). The hESC library WA01(m) was cultured on matrigel as opposed to MEF; thus, mouse sequence contamination should not be observed. As expected, unmapped tags from this library did not map to a mouse sequence (data not shown).  Table 34 CMOST Best Mapping results for 20,047 novel hESC tags. Listed below are the  data source and tag types for sense and antisense tags. The percentage of total novel tag types (20,047 tags) per tag mapping database is shown in brackets.  Database  ..  Sense  Antisense  Types  Types  Types (all)  208 (1.04) 185 (0.93) 58 (0.29) 1 (0.01) 145 (0.73) 644 (3.22)  476 (2.38) 513 (2.56) 147 (0.74) 1 (0.01) 444(2.22) 1784 (8.9) 3365(17.8) 1,028 (5.13) 965 (4.81) 58 (0.30)  268 (1.34) 328 (1.64) RefSeq E n s e m b l Transcript ( E N S T ) 89 (0.45) Non-protein coding 0 Ensembl E S T ( E N S E S T T ) 299(1.5) E n s e m b l Transcription U n i t ( E N S G ) 1140 (5.69) MGC  A l l mappings to transcripts/EST sequences Genomic ( U C S C H u m a n  Genome)  Unambiguous genomic mappings Embryonic E S T s Unmapped  36 (0.18) ^  —  22 (0.11) —  15,596 (77.80)  "  Total  20047  169  Tags selected for further analysis included sequences that unambiguously mapped to a genomic location and sequences that mapped to embryonic ESTs (1,023 tag types) (Step 2, Figure 25 and Table 34). The complete list of mappings to CMOST, ES-derived EST sequences, and multiple alignment sequences are found in Appendix 4c. A collection of 1023 hESC enriched tags that mapped to genomic sequences and embryonic ESTs were mapped to human sequences with known orthology to mouse and/or rat sequences as determined by the UCSC Multiple Alignment Format (MAF) (Step 3, Figure 25). Figure 25 also illustrates the loss of hESC-enriched tag types as filters were imposed to define a set of candidate novel transcripts. I isolated 499 sequences that mapped to the MAF tag-mapping database. I then applied the criteria that these tags should not map to 3' flanking regions ("enhanced" 3' UTRs) defined from known Ensembl transcripts (Step 4, Figure 25) and the hESC-enriched tags should additionally have a significant p-value (a=< 0.05) (Step5, Figure 25). P-values were calculated based on SAGE tag sequence quality per base pair and the library construction error calculated on a per library basis (Siddiqui et al., submitted). After applying these constraints, 301 tags satisfied these criteria (Appendix 4d). Roughly l/6 of the tags (56 th  tags) were present in more than one cell line (Appendix 4e). Particularly in the case of singleton/doubleton tags in an individual library, expression in two or more libraries further substantiated the likelihood that the tag was observed in the mRNA sampled as opposed to being introduced by some artifact of the technology. The stringent criteria that high quality hESC novel tags map to sequences conserved in multiple species distal to known genes and observed in 2 or more hESC libraries further reinforces the likelihood that the tags were derived from novel genes.  170  4.3.2 Mouse annotation of  hESC tags  I annotated our selections of the "best" candidates for novel gene discovery across multiple hESC lines using a computational approach illustrated in Figure 28. The multiple alignment format mappings can infer potential functionality because sequences conserved in multiple species are often coding or regulatory sequences under selective pressure to be maintained in the genome. By using the human and corresponding mouse genomic sequences as queries for TBLASTX and BLASTN respectively, I can identify known mouse transcripts in regions of interest (Figure 28B, example 1 and 2). This strengthens the argument that sequence conservation correlates with functional conservation. I additionally annotated hESC novel tags with short SAGE data generated from an Rl mouse embryonic stem cell (mESC) line (GEO accession: GSM580; http://www.ncbi.nlm.nih.gov/projects/geo) (Anisimov et al., 2002). I hypothesize that the observation of regions mapped by novel hESC and mESC tags may be indicative of a novel gene central to mammalian embryonic stem cell biology (Figure 28B, example 1). The mESC short SAGE library consisted of 137,906 total tags, which represented 44,570 unique tag types. A caveat of the 14 bp tag length is the inability to unambiguously map to genomic sequence. For this reason, I mapped mESC tags to mouse transcripts orthologous to candidate human genomic regions.  171  Figure 28 The computational method for annotating hESC candidate tags. Candidate tags used in this analysis are orthologous to mouse and/or rat genomic sequences and map > 2 kb downstream of a known transcript. A. The candidate regions of orthology consist of human and mouse genomic sequences identified by a hESC novel tag. Homology to mouse RefSeq transcripts was determined by BLAST analysis of human and mouse genomic sequences using the programs TBLASTX and BLASTN respectively. Mouse transcripts in a region orthologous to a hESC human tag mapping were isolated and further used to map an RI mouse embryonic stem cell line (mESC) short SAGE library. B. Examples of annotated hESC candidate tags and the number of hESC novel tags annotated.  172  A.  h E S C candidate tags  Multiple Alignment Format  Candidate regions of orthology  Extract orthologous  Extract orthologous  human genomic  mouse genomic  sequences  sequences  I  1  Blastall against mouse RefSeq transcripts  TBLASTX  BLASTN  Candidate mouse transcripts  I Map to RI mESC tags  I  <-  Annotation of hESC candidate tags:  B. Example 1: h E S C tag m a p p i n g to a g e n o m i c region >2kb downstream o f an E n s e m b l transcript; S u p p o r t e d by a mouse transcript c o n t a i n i n g an RI m E S C tag and rat sequence conservation (7 tag types).  5'  3' UTR  -H-  hESC tag •  Human  Rat  Mouse RI mESC tag  Example 2: h E S C tag m a p p i n g to a g e n o m i c region >2kb downstream o f an E n s e m b l transcript; S u p p o r t e d by a m o u s e transcript and rat sequence conservation (59 tag types).  -f-f-  Human  Rat  Mouse  Example 3: h E S C tag m a p p i n g to a g e n o m i c region >2kb downstream o f an E n s e m b l transcript; S u p p o r t e d b y m o u s e and rat sequence conservation (235 tag types).  -H-  Human  Rat  Mouse  174  Thefirstanalysis aligned the mouse MAF sequences (corresponding to candidate regions in the human genome in which a novel hESC tag localized) to mouse RefSeq transcripts. The best BLAST hit with a minimum of 99% sequence identity, a minimum HSP (High-scoring Segment Pair) length of at least 200 bp, and an e" value of 0.0001 or less were parsed. Table 35 is a subset of the highest scoring alignments (see Appendix 4f for the full table). Using the described requirements, I isolated 32 transcripts annotating the same number of hESC tags. MESC tags localized to 7 of the highest scoring mouse transcripts (Table 36). These 7 mouse transcripts and the corresponding mESC and hESC tags represent novel candidates in which the human tag identified a genomic region with an orthologous mouse transcript that was additionally identified by a mouse tag (illustrated in Figure 28B, example 1). Candidate novel hESC tags were termed nESCOlnESC07. The transcripts corresponded to predicted cDNAs and the following genes: Gm704 (nESC06), Dpysll5 (nESC03), and Actgl (nESC07). The gene model transcript, Gm704, is similar to thefidgetin-like1 gene which is a member of a superfamily of ATPases associated with a wide variety of cellular activities, including membrane fusion, proteolysis, and DNA replication. The mouse gene dihydropyrimidinase-like 5, Dpysl5, is thought to play a role in growth cone guidance during neural development. Lastly, Actgl is a cytoplasmic gamma-actin, which functions in sarcomere organization. The remaining four genes mapped by a mESC tag were functionally uncharacterized cDNAs or predicted cDNAs.  175  Table 3 5 Top 25 BLASTN hits (mouse MAF genomic regions against mouse RefSeq transcripts). Listed in descending order of BLASTN score are: hESC tag sequence, count, location of the human MAF sequence and start site, mESC tag sequence, count, mouse RefSeq hit, and score. h E S C meta tag TACTTTGAACCGAGGAG  Count 1  Human genomic  m E S C tag to  sequence  gene  chrl5  n/a  Count  Hit  Score  n/a  g i | 2 0 4 6 7 4 2 2 | r e f | N M _ 1 3 9 0 0 1 . 1 | M u s m u s c u l u s c h o n d r o i t i n sulfate  7027  proteoglycan 4 ( C s p g 4 ) , m R N A  (start: 82576680) AATCGGTCACACCAGCC  1  chr6  n/a  n/a  TACCACTCTCTGTATGG  1  chr2  n/a  n/a  2  chr3  4918  g i | 3 1 5 4 4 0 6 8 | r e f ] N M 0 1 1 7 7 0 . 2 | M u s m u s c u l u s z i n c finger protein,  4117  subfamily I A , 2 (Helios) ( Z f p n l a 2 ) , m R N A  (start: 2 1 4 0 6 8 6 2 3 ) AATCCCCCCGCCCCCTC  gi|51767667|reflXM 488673.1| P R E D I C T E D : M u s musculus c D N A sequence B C 0 2 4 6 5 9 ( B C 0 2 4 6 5 9 ) , m R N A  (start: 11243073)  n/a  n/a  gi|51765137|reflXM_356199.2| P R E D I C T E D : M u s musculus  2882  similar to a b n o r m a l cell L I N e a g e L I N - 4 1 , heterochronic gene;  (start: 3 2 9 0 3 5 5 9 )  D r o s o p h i l a d a p p l e d / vertebrate T R i p a r t i t e M o t i f protein related; B - b o x z i n c finger, F i l a m i n a n d N H L repeat c o n t a i n i n g protein (123.8 k D ) (lin-41) ( L O C 3 8 2 1 1 2 ) , m R N A TAGGTGGCCCTGTCTCC*  1  chrX  TGGCTCGGTC  364  4  chrl7  n/a  n/a  1  chr2  AAAATAAAAA  4  1  chrl2  GGCCCCCACA  16  1  chr2  n/a  n/a  1  chr8  n/a  n/a  1  chr7  n/a  n/a  1  chr2  n/a  n/a  1  chr22 (start: 2 5 3 8 8 4 8 8 )  gi|51829921|reflXM_204880.3| P R E D I C T E D : M u s musculus fidgetin-like  gi|31712027|refjNM  2250  1 (LOC278718), m R N A 153409.2| M u s m u s c u l u s R I K E N c D N A  2024  gi|31543917|refjNM_023585.2| M u s musculus ubiquitin-  1885  gi|47106055|reflNM_007481.2| M u s musculus ADP-ribosylation  1844  gi|40254535|ref]NM_021306.2| M u s musculus endothelin  1633  converting e n z y m e - l i k e 1 ( E c e l l ) , m R N A  (start: 2 3 3 4 8 2 9 5 0 ) GATAGGAACTCTTCCTG*  2313  factor 6 ( A r f 6 ) , m R N A  (start: 9 8 8 4 5 0 9 3 ) TGTGGCAGCTGGTGGAA*  175509.3| M u s m u s c u l u s R I K E N c D N A  conjugating e n z y m e E 2 variant 2 ( U b e 2 v 2 ) , m R N A  (start: 4 9 0 2 5 3 3 3 ) TCAAATCTCAGAGCATC  2393  A 3 3 0 1 0 2 K 2 3 gene ( A 3 3 0 1 0 2 K 2 3 R i k ) , m R N A  (start: 166737128) GTTTATTAGTCTGGATT  gi|40254301|refjNM  similar to  (start: 50497423) CATAACCCAGGAAACAT  177182.3| M u s m u s c u l u s R I K E N c D N A  9 4 3 0 0 6 7 K 1 4 gene ( 9 4 3 0 0 6 7 K 1 4 R i k ) , m R N A  (start: 2 0 8 8 9 1 7 4 4 ) TCATCGCTTTAATACTG  gi|40254309|ref|NM  A 8 3 0 0 5 3 O 2 1 gene ( A 8 3 0 0 5 3 O 2 1 R i k ) , m R N A  (start: 1379996) AGTTGGGGTCTTGGGGA  2577  cytoplasmic 1 ( A c t g l ) , m R N A  (start: 52138590) AGTGCGGAGTCCCCTTC  g i | 4 8 7 6 2 6 7 7 | r e f | N M _ 0 0 9 6 0 9 . 2 | M u s m u s c u l u s actin, g a m m a ,  n/a  n/a  gi|51711414|ref|XM 489103.1| P R E D I C T E D : M u s musculus R I K E N c D N A A 2 3 0 0 5 7 G 1 8 gene ( A 2 3 0 0 5 7 G 1 8 R i k ) , m R N A  1550  CACGGCACACACAGGCA  1  chr7  n/a  n/a  (start: 7 1 2 1 2 2 8 6 ) CGATTTCTTAGAGAGAT  1  chr2 chr22  n/a  - n/a  1  chr3  n/a  n/a  1  chr2  GCTGACATTT  2  1  chrl6  GCCCTATGCT  2  1  chr6  n/a  n/a  1  chrl7  n/a  n/a  1  chrl7  n/a  n/a  2  chr7  n/a  n/a  1  chr7  CGTTGGATTC  4  4  c h r l (start: 53600547)  1148  gi|40789287|ref|NM_023047.2| M u s musculus  1021  gi|21704179|ref|NM_145587.11 M u s musculus S H 3 - b i n d i n g  932  gi|51767594|ref|XM 488666.1| P R E D I C T E D : M u s musculus  922  g i | 3 1 5 4 3 5 7 3 | r e f | N M _ 0 1 1 2 3 5 . 2 | M u s m u s c u l u s R A D 5 1 - l i k e 3 (S.  876  gi|51766572|refjXM  358416.2| P R E D I C T E D : M u s musculus  829  gi|51710838|ref|XM  131914.4| P R E D I C T E D : M u s m u s c u l u s  785  R I K E N c D N A 3 1 1 0 0 0 4 0 1 8 gene ( 3 1 1 0 0 0 4 0 1 8 R i k ) , m R N A n/a  n/a  (start: 4 7 7 4 2 4 4 4 ) TCCACTCAACTGTACAA  cDNA  hypothetical L O C 3 80741 ( L O C 3 8 0 7 4 1 ) , m R N A  (start: 102348828) AATGTCATTAAATACCT  gi[21312475|ref|NM 027265.1| M u s musculus R I K E N  cerevisiae) (Rad5113), m R N A  (start: 8 0 0 6 5 8 9 4 ) TCGAAGTAAAATTCAAC  1183  LOC432740 (LOC432740), m R N A  (start: 3 3 5 6 6 9 7 7 ) CATCTTTCGGCCCATTC  139488.2| P R E D I C T E D : M u s m u s c u l u s  kinase ( S b k ) , m R N A  (start: 2 2 2 5 3 4 2 3 ) TCTGCCTGATACCAAAC  gi|51829829|ref|XM  dihydropyrimidinase-like 5 (Dpysl5), m R N A  (start: 2 8 2 6 9 3 7 0 ) CGACTTTTATTTCTGAC  1241  2 8 1 0 0 0 4 A 1 0 gene ( 2 8 1 0 0 0 4 A 1 0 R i k ) , m R N A  (start: 2 7 1 4 6 8 8 7 ) CGTCCGCCTGCCTGCCT  g i 2 3 9 4 3 8 4 1 | r e f ] N M _ 1 5 3 5 1 2 . 1 | M u s m u s c u l u s p o t a s s i u m voltage-  hypothetical L O C 2 0 7 9 3 9 ( L O C 2 0 7 9 3 9 ) , m R N A  (start: 57082089) TGTCGTCTTGGGGTTGA  1306  gated c h a n n e l , s u b f a m i l y G , m e m b e r 3 ( K c n g 3 ) , m R N A  (start: 4 0 5 4 7 8 3 2 ) ATGTAGACAAAATTAGC  021371.1| M u s musculus calneuron 1  (Calnl), m R N A  (start: 4 2 6 9 2 4 7 0 ) CCTCGAGGGCACCGCGG  gi|l0946703|refjNM  g i | 3 1 9 8 2 2 6 8 | r e f | N M _ 0 0 8 3 1 6 . 2 | M u s m u s c u l u s H u s l h o m o l o g (S.  660  pombe) ( H u s l ) , m R N A n/a  n/a  gi|28077004|ref|NM 028355.1| M u s musculus R I K E N  cDNA  622  2 8 1 0 4 7 5 A 1 7 gene ( 2 8 1 0 4 7 5 A 1 7 R i k ) , m R N A  *Map to ESTs derived from undifferentiated hESCs (Brandenberger et al., 2004b) (gi|47283420|gb|CN267006.1, gi|47331838|gb|CN315424.1, and gi|47417639|gb|CN430045.1 respectively)  177  T a b l e 36 Candidate mouse  orthologous genomic sequences were analyzed using  BLASTN against mouse RefSeq transcripts. Identified transcripts were mapped to mESC tags. The hESC metalibrary tag and count, mESC tag and count, and the gene name are shown below. Novel hESC candidates were given a unique identifier (nESC##). ID  hESC meta tag  Count  mESC tag  Count  Gene name  nESCOl  AGTTGGGGTCTTGGGGA  1  AAAATAAAAA  4  nESC02  TCGAAGTAAAATTCAAC  2  CGTTGGATTC  4  nESC03  TGTCGTCTTGGGGTTGA  1  GCCCTATGCT  2  nESC04  CGATTTACCTACTTGAA  1  GCCGCGTCCG  3  nESC05  ATGTAGACAAAATTAGC  1  GCTGACATTT  2  nESC06  TCATCGCTTTAATACTG  1  GGCCCCCACA  16  nESC07  TAGGTGGCCCTGTCTCC  1  TGGCTCGGTC  364  RIKEN cDNA 9430067K14 PREDICTED: RIKEN cDNA 3110004018 dihydropyrimidinaselike 5 PREDICTED: Mus musculus LOC434078 RIKEN cDNA 2810004A10 PREDICTED: similar to fidgetinlike 1 actin, gamma, cytoplasmic 1  178  Many genes critical to maintaining hallmark ESC properties, such as P0U5F1 and S 0 X 2 , are orthologous in human and mouse. To reiterate, species conserved gene expression in orthologous sequences strengthens the proposition that a novel and functional transcript can be detected using SAGE. This hypothesis is further reinforced by the presence of a known mouse transcript in the region of conservation. The remaining mouse transcripts were identified by a novel hESC tag mapped to an orthologous human genomic sequence and additionally were not identified by a mESC S A G E tag (25 transcripts corresponding to 25 hESC tags) (novel hESC tags were termed nESC08-nESC32) (Figure 28B, example 2). The B L A S T N analysis of candidate orthologous mouse genomic sequences identified 12 uncharacterized or predicted cDNA sequences and 13 functionally characterized transcripts. Table 37 lists the genes and a brief description of their functional role. Many of these mouse transcripts have a predefined human ortholog. Perhaps, by way of duplication in the human genome, a novel paralogous gene may exist in these candidate genomic regions. Some examples notable transcripts included an SH3-binding kinase (Sbk) (nESCIO), the chromodomain helicase D N A binding protein (Chdl) (nESC31), and Rad51-like 3 (Rad5113) (nESC13). A number of SH3 domain containing proteins are downstream targets of the Jak/Stat pathway, the principal signalling pathway maintaining an undifferentiated state in mESCs. The role of Sbk in ESCs may indicate a role for participants of the Jak/Stat pathway in hESC maintenance contrary to the current belief that the pathway is non-functional in human pluripotent cells. CHD1 may play an important role in gene regulation through the modification of chromatin structure by altering the access of transcriptional machinery to its chromosomal D N A template. In  179  Chapter 3.3.2 I found that other genes involved in epigenetic gene regulation (DNMT3(5 and HDAC2) were significantly up-regulated in u-hESCs compared to multiple adult and fetal samples (Tables 27 and 33). These results, in conjunction with the possible identification of an additional CHD1 ortholog novel to hESC suggest that specific epigenetic machinery may be expressed in hESC and unique to the sternness phenotype. These findings support a crucial role for the D N A repair gene Rad5113 in normal mammalian development, recombination, and maintenance of mammalian genome stability (Smiraldo et al., 2005). I similarly noted significantly higher expression of D N A damage checkpoints and D N A repair genes (such as RAD51, RAD54L and RAD23B) in hESCs compared to n-CGAP libraries (Table 8, Chapter 2.3.5). Maintaining genomic stability would be critical to the immortality of hESC lines.  180  Table 37 25 mouse transcripts identified by BLASTN analysis of candidate orthologous mouse sequences against mouse RefSeq transcripts. The corresponding hESC tag sequence, count, mouse transcript accession, gene symbol, HSP size and a description describing gene ontology, conserved domains, and/or Entrez gene summaries are listed. Novel hESC candidates were given an identifier (nESC##). ID nESC08  hESC sequence CACGGCACACACAGGCA  Count 1  Accession gi| 10946703  Gene symbol CALN1  HSP size 659  nESC09  TACTTTGAACCGAGGAG  1  gi|20467422  CSPG4  3557  nESCIO  CGTCCGCCTGCCTGCCT  1  gi|21704179  SBK1  482  nESCll  CGATTTCTTAGAGAGAT  1  gi|23943841  KCNG3  626  nESC12 nESC13  TCCACTCAACTGTACAA TCTGCCTGATACCAAAC  4 1  gi|28077004 gi|31543573  2810475A17RIK RAD51L3  314 442  Description This gene encodes a protein with high similarity to the calcium-binding proteins of the calmodulin family. GO: biological process unknown, cellular component unknown The human CSPG4 plays a role in stabilizing cellsubstratum interactions during early events of melanoma cell spreading on endothelial basement membranes. Data suggest that CSPG4 is a novel marker for epidermal stem cells that contributes to their patterned distribution by promoting stem cell clustering (Legg et al., 2003). CSPG4 represents an integral membrane chondroitin sulfate proteoglycan expressed by human malignant melanoma cells. GO: activation of MAPK, cell proliferation, glial cell migration, transmembrane receptor protein tyrosine kinase signalling pathway, plasma membrane Conserved domain: cd00180: S T K c ; Serine/Threonine protein kinases, catalytic domain Conserved domain: pfam00520: Iontrans; Ion transport protein Transmembrane protein 48 (Tmem48) 1. Findings support a crucial role for mammalian RAD51D in normal development, recombination, and maintaining mammalian genome stability (Smiraldo et al., 2005). 2. a fragment of Rad51B  /  181  nESC14  GTTTATTAGTCTGGATT  •1  gi|31543917  UBE2V2  975  nESC15 nESC16  TACCACTCTCTGTATGG ATATGCAGCAGGATCAC  1 1  gi|31544068 gi|31560079  ZFPN1A2 KCNMB2  2104 224  nESC17 nESC18  CATAACCCAGGAAACAT AATGTCATTAAATACCT  1 1  gi|31712027 gi|31982268  A330102K23RIK HUS1  1045 337  nESC19  AGTGCGGAGTCCCCTTC  4  gi|40254309  A830053O21RIK  1207  nESC20  TGTGGCAGCTGGTGGAA*  1  gi|40254535  ECEL1  824  nESC21  TCAAATCTCAGAGCATC  1  gi|47106055  ARF6  941  nESC22  CTCAGAGCGCGCAGGTC  1  gi|51556212  AL024069  251  nESC23  GACTGCAAGAACCTAAG  1  gi|51709794  CIPP  202  interacts with the C-terminus and linker of Rad51C, and this region of Rad51C also interacts with mRad51D and Xrcc3 (Miller et al., 2004). GO: Baseexcision repair Conserved domain: cd00195: UBCc; Ubiquitinconjugating enzyme E2, catalytic (UBCc) domain GO: transcription factor activity Conserved domain: pfam03185: CaKB; Calciumactivated potassium channel, beta subunit GO: apoptosis 1. Evidence for a requirement for Radl7 and Husl to induce G(2) arrest as well as Vpr-induced phosphorylation of histone 2A variant X (H2AX) and formation of nuclear foci containing H2AX and breast cancer susceptibility protein 1 (Zimmerman et al., 2004). 2. Husl-deficient mouse cells had an impaired S checkpoint after exposure to DNA strand break-inducing agents such as camptothecin or ionizing radiation (Wang et al., 2004). 3. Husl is required specifically for one of two separable mammalian checkpoint pathways that respond to distinct forms of genome damage during S phase (Weiss et al., 2003). Conserved domains: cd00083: HLH; Helix-loophelix domain, found in specific DNA- binding proteins that act as transcription factors Conserved domains: pfam05649: Peptidase_M13_N; Peptidase family Ml3 GO: small GTPase mediated signal transduction, intracellular protein transport, vesicle-mediated transport Conserved domain: cd01214: CG8312; CG8312 Phosphotyrosine-binding (PTB) domain This gene encodes a multivalent PDZ domain protein, which is expressed exclusively in brain and kidney. This protein selectively interacts with Kir  182  nESC24 nESC25  GATAGGAACTCTTCCTG* AATCCCCCCGCCCCCTC  1 2  gi|51711414 gi|51765137  A230057G18RIK Gmll27  782 1454  nESC26 nESC27 nESC28 nESC29 nESC30 nESC31  CATCTTTCGGCCCATTC CGACTTTTATTTCTGAC AATCGGTCACACCAGCC CCCGACCCCGCGCTCTT CCTCGAGGGCACCGCGG ACGCCGAGAAAGCAAGC  1 1 1 1 2 1  gi|51766572 gi|51767594 gi|51767667 gi|51828549 gi|51829829 gi|6680927  Gml567 LOC432740 BC024659 LOC432582 LOC207939 CHD1  418 465 2553 276 597 207  nESC32  GCAGTAGGTAGAGTCAC  1  gi|7242198  RASGRF2  288  family members, N-methyl-D-aspartate receptor subunits, neurexins and neuroligins, and cell surface molecules enriched in synaptic membranes. This protein may serve as a scaffold that brings structurally diverse but functionally connected proteins into close proximity at the synapse. n/a Human conserved domain : cd00200: WD40; WD40 domain, found in a number of eukaryotic proteins that cover a wide variety of functions including adaptor/regulatory modules in signal transduction, pre-mRNA processing and cytoskeleton assembly n/a n/a n/a n/a n/av The CHD family of proteins is characterized by the presence of chromo (chromatin organization modifier) domains and SNF2-related helicase/ATPase domains. CHD genes alter gene expression possibly by modification of chromatin structure thus altering access of the transcriptional apparatus to its chromosomal DNA template. GO: small GTPase mediated signal transduction, guanyl-nucleotide exchange factor activity  *Map to ESTs derived from undifferentiated hESCs (Brandenberger et al., 2004b) ES|gi|47331838|gb|CN315424.1|CN315424 and ES|gi|47417639|gb|CN430045.1|CN430045 respectively. GACTGCAAGAACCTAAG maps to hep|gi|47358539|gb|CN358605.1|CN358605  (derived from hESC differentiated to hepatocyte-  like cells).  183  A second analysis performed involved the alignment of human genomic sequences identified by a novel hESC tag to mouse RefSeq transcripts using TBLASTX. The purpose of this analysis was to determine if the human region might be orthologous to a mouse transcript of known functionality. Human sequences were selected based on a p-value cut-off of 0.0001 and resulted the identification of 61 candidate mouse transcripts, of which 28 transcripts did not overlap with the BLASTN analysis described above. Table 38 lists the novel tag sequence, count, human MAF chromosome and start site, TBLASTX score, hsp size (bp), hit accession, and gene symbol (novel hESC tags were termed nESC33-nESC66). Transcripts were ordered from highest to lowest score. Altogether, 34 tag sequences hit 28 genes (illustrated in Figure 25, example 2). The predicted transcript LOC432880 is similar to reverse transcriptase and was associated with 3 different human genomic locations that contained repetitive sequences (nESC43, nESC44, and nESC50). Similarly, LOC277923 is a mouse reverse transcriptase gene that corresponded to 2 human genomic locations with repetitive sequences (nESC40 and nESC52). I also observed instances where multiple novel tags align to the same MAF genomic region, specifically chromosome 4 (start site: 30893704), chromosome 13 (start site: 53945546), and chromosome X (start site: 106734806) (nESC57 and nESC60). More than 50% (15) of the mouse genes identified were to predicted transcripts or uncharacterized cDNA sequences. These transcripts were orthologous to 17 human genomic regions that were identified by 19 novel hESC tags in which 19 (Table 38). These predicted mouse genes suggested that a transcribed sequence existed in a homologous region to a hypothetical novel gene in hESCs. Furthermore, several of the identified mouse transcripts, both predicted and validated, were involved in development  184  and differentiation, transcriptional regulation, proliferation, genomic stability, and cell cycle checkpoints. Each gene was associated to a single hESC tag. A description of the genes and their suggested functions/conserved domains were listed in Table 39.  185  Table 38 TBLASTX analysis of candidate human genomic regions derived from the UCSC multiple alignment format (MAF). Human sequences were compared against mouse RefSeq transcripts. Listed below is the human embryonic stem cell novel tag sequence, count, MAF chromosomal region (in human) and alignment start site, TBLASTX score, hsp size (bp), hit accession, and gene symbol. Transcripts are ordered from highest to lowest score.  nESC33 nESC34 nESC35 nESC36 nESC37 nESC38 nESC39 nESC40 nESC41 nESC42 nESC43 nESC44 nESC45 nESC46 nESC47 nESC48 nESC49 nESC50 nESC51 nESC52 nESC53  hESC meta tag ATCTGAGACAGACAGTT GAGAGCGGATTTTGACT CTATCTAGTGCCAAAAA AAACTTCAACATATGGT  Count 1 1 1  GCTGTAGGCGCAATGAG TAGTCTGCTATGACCAC CCATTGGTCTCCATTCC CTAGACTAGAAACCACA GCTATCTTGAATGGGGT* GGTAGGTTAAGAAAGAT* AAGGGTTAGACTAGATA TGGTATGCAATAAATAT GCTTATGGCTAGAGAAT GCTCTCTGAATAGCTTT CTACAAAACCGAAAGCA CGAACATTTCCTAATGA  1  CCTTTGCTTCCCTTTCC TGGATGTCAATTTGTTC CGGCGGGGCAGCCGACG TAGATACCAAGTTGTCC GTACTGCACAATTCAGA  1  1 1 1 1 1  1 1 1 1 1 1 1 1 1  Human genomic region chr3 (start: 185064773) chr5 (start: 106788290) chr21 (start:34106567) chr6(start:4180623 7) chrl (start: 113991788) chrl6 (start:53507325) chr4 (start:95992) chr7 (start: 19771451) chr4 (start:30893704) chr4 (start:30893704) c h r l l (start: 115529238) chr'8 (start:129525130) chr2 (start:203442081) chr9 (start:3514336) chrl 3 (start:54062598) chrl2 (start:78446474) chr2 (start:36556311) chr3 (start: 8279218) chr3 (start:32830280) chrX (start:55811260) chr5 (start:142120336)  Score 544 221 132 111 111 110 92 87 81 81 80  HSP 580 1990 338 533 . 2402 512 104 637 1919 1919 737 874 1657 1376  69 67 65 65 62 56 52 52  203 418 298 293 408  51 51  368 313  Accession gi|51873059 gi|46560569 gi|6754391 gi|38259219 gi|9055361 gi|27734121 gi|31543955  Mouse symbol EEF1A1 EFNA5 ITSN1 USP49 SYT6 1700047E16RIK WEE1  gi|54312063 gi|51770847 gi|51770847 gi|51768555 gi|51768555 gi|6680803 gi|34328188 gi| 16716602 gi|51830454 gi|51770404  LOC277923 LOC381153 LOC381153 LOC432880 LOC432880 BMPR2 RFX3 GTF2IRD2  gi|51768555 gi|51765138 gi|54312063 gi|13626035  LOC436367 CRIM1 LOC432880 LOC3 84985 LOC277923 SIGLECL1  186  \  nESC54 nESC55 nESC56 nESC57  CCAACGTGAAGTGATTT CCACATCCGATGCATAG ATTACAGTGCCCTCAAA AAGTCCCCGTTTGTTTT*  nESC58 nESC59 nESC60 nESC61 nESC62  TATATGATCATTACTAA* AACTGATAGCTGGAAGG* TGATTGTAGATGTACCT*  nESC63 nESC64 nESC65  TATATAGCTTGCATTTC CCAGTCCGTTTTCTGGT ACTGACATTTAGCTAGT CTACATAGTCCTGCATT  nESC66  GACGAAGAACCTTGTCC TAAACGCTGCCCTTAAA  1 1 1 2 12 8 2 1 1 2 1 1 1  !  chr6 (start: 14968830) chrl8(start:51671682) chr2 (start: 113452195)  49 49 48  378 3311 214  chrl3 (start:53945546) chrl3 (start:53945546) chrX (start: 106734806)  47 47 45 45 45 44  776 776 995 995 607  chrX (start: 106734806) chr9 (start:128427310) chr4 (start: 18788855) chr20 (start:38449627) chr9 (start: 134725680) chrl8(start:23587615) chr4(start:56903132)  37 36 33 30  386 638 1394 418 96  gi|51769069 gi|6753409 gi|31559867 gi|31541931 gi|31541931 gi|51708184 gi|51708184 gi|13878228 gi|31342579 gi|31340738 gi|37718971 gi|46411175 gi|37674223  LOC432935 CER1 B930067F20RIK C O X 15 C O X 15 LOC433611 LOC433611 SYTL1 4932435022RIK 9330169L03RIK 5930405J04RIK CEECAM1 GNL3L  AAGTCCCCGTTTGTTTT and TGATTGTAGATGTACCT hit multiple MAF genomic regions. LOC432880i and LOC432880J correspond to human repetitive sequences.  187  Table 39 TBLASTX mouse hits associated with development/differentiation, proliferation, transcriptional regulation (DNA dependent and epigenetically), genomic stability, and cell cycle checkpoints. Gene name, classification, and a description (defined by Entrez Gene, Gene Ontology, and/or conserved functional domains) are provided. Name  Sequence Classification  Description  1700047E16RIK  Mus musculus RIKEN cDNA 1700047E16gene (1700047E16Rik),mRNA. Mus musculus wee 1 homolog (S. pombe) (Weel), mRNA.  Conserved domains: COG0419: SbcC; ATPase involved in DNA repair [DNA replication, recombination, and repair]. cd00030: C2; Protein kinase C conserved region 2 (CalB). pfam00038: Filament; Intermediate filament protein. pfam05557: MAD; Mitotic checkpoint protein. This gene encodes a nuclear protein, which is a tyrosine kinase belonging to the Ser/Thr family of protein kinases. The human homologue catalyzes the inhibitory tyrosine phosphorylation of CDC2/cyclin B kinase, and appears to coordinate the transition between DNA replication and mitosis by protecting the nucleus from cytoplasmically activated CDC2 kinase. 1. Results define a pathway linking the bone morphogenetic protein receptor BMPRII to regulation of actin and provides insights into how extracellular signals modulate LIMK1 activity during dendritogenesis. 2. BMPR-1 A, -2, and Noggin are up-regulated in undifferentiated mesenchymal cells and"regenerating muscle fibers occurs during the early phase of BMP-2-induced bone formation. GO: transforming growth factor beta receptor activity, protein serine/threonine kinase and tyrosine kinase activity, anterior/posterior pattern formation, protein amino acid phosphorylation The transcription factor RFX3 directs nodal cilium development and left-right asymmetry specification. Mol Cell Biol. 2004 May;24(10):4417-27. The human homologue may have a role as a modulator of Ras signalling in epithelial cells.  WEE1  BMPR2  Mus musculus bone morphogenic protein receptor, type II (serine/threonine kinase) (Bmpr2), mRNA.  RFX3  Mus musculus regulatory factor X, 3 (influences HLA class II expression) (Rfx3), mRNA. Mus musculus GTF2I repeat domain containing 2 (Gtf2ird2), mRNA. PREDICTED: Mus musculus cysteine-rich motor neuron 1 (Criml), mRNA. Mus musculus SIGLEClike 1 (Siglecll),mRNA.  GTF2IRD2 CRJM1  SIGLECL1  The exact function of this gene product is not known. It is inferred to be a transcription factor based on the presence of GTF2I-like repeats (containing helix-loop-helix motifs), also found in other proteins such as GTF2IRD1 and GTF2I. 1. Modulates BMP activity by affecting its processing and delivery to the cell surface. 2. Has a role in capillary formation and maintenance during angiogenesis. GO: insulin-like growth factor binding, serinetype endopeptidase inhibitor activity, regulation of cell growth, extracellular region. Sialic acid-binding immunoglobulin-like lectins (SIGLECs) are a family of cell surface proteins belonging to the immunoglobulin superfamily. They mediate protein-carbohydrate interactions by selectively binding to different sialic acid moieties present on glycolipids and glycoproteins. This gene encodes a member of the  188  CER1  5930405J04RIK  Mus musculus cerberus 1 homolog (Xenopus laevis) (Cerl), mRNA. Mus musculus RIKEN cDNA 5930405J04 gene (5930405J04Rik), mRNA.  SIGLEC3-like subfamily of SIGLECs. Members of this subfamily are characterized by an extracellular Vset immunoglobulin-like domain followed by two C2-set immunoglobulin-like domains, and the cytoplasmic tyrosine-based motifs ITIM and SLAM-like. The encoded protein, upon tyrosine phosphorylation, has been shown to recruit the Src homology 2 domain-containing protein-tyrosine phosphatases SHP1 and SHP2. It has been suggested that the protein is involved in the negative regulation of macrophage signalling by functioning as an inhibitory receptor. This gene is located in a cluster with other SIGLEC3-like genes on 19ql3.4. Alternatively spliced transcript variants encoding distinct isoforms have been described for this gene. 1. Cerberus-like and Lefty-1 function redundantly to modulate Nodal signalling during gastrulation and regulate patterning of the primitive streak. 2. Role of Cerberus-like in mouse embryogenesis. GO: cytokine activity, extracellular space. Conserved domains: COG5259: RSC8; RSC chromatin remodeling complex subunit RSC8 [Chromatin structure and dynamics / Transcription], COG5271: MDN1; AAA ATPase containing von Willebrand factor type A (vWA) domain [General function prediction only]. pfam04433: SWIRM; SWIRM domain  189  4.4 Conclusions  To summarize, computational analysis of the human SAGE data from u-hESCs, adult and fetal cells led to the identification of 20,047 tags enriched in u-hESCs. Upon the implementation of stringent criteria, particularly the mapping of these enriched tags to genomic sequences with multiple species sequence conservation, I isolated 301 high quality tags novel to the hESC SAGE data. Computational annotation of the human genomic sequences identified by a novel hESC tag isolated 60 candidate mouse transcripts (identified by BLASTN and TBLASTX analysis) corresponding to 64 novel hESC tags. A subset of the identified mouse transcripts were also identified by a mESC SAGE tag (7 candidate transcripts; nESCOlnESC07). These candidate genes may be integral to pluripotency as evidenced by their identification in human and mouse gene expression data. After further characterization of the novel hESC tags associated with a mouse ortholog, I observed a preponderance of transcripts implicated in cell cycle regulation (e.g., Weel; nESC39) or genomic stability (e.g., Rad5113; nESC13) which also represent a class of genes highly represented in hESCs. Other candidate novel hESC genes, such as nESC31 associated with the mouse transcript CHD1, may later prove to regulate epigenetic gene expression in conjunction with the known embryonic methyl transferase DNMT3p. Our goal was to devise a method to select for novel genes expressed in the human embryonic stem cell lines. To this end I have isolated a number of candidate hESC tags supported by various in silico resources. These tags will be further investigated by  190  utilizing the S A G E tags as gene-specific primers in 3' and 5' (rapid amplification of c D N A ends) R A C E (Frohman et al., 1988) to obtain longer sequences unique to hESCs. The novel tags identified using this approach have been used to obtain full-length c D N A sequences using 5' and 3' R A C E in the follow-up study to this analysis (Hirst et al., submitted). This subsequent analysis describes the cloning and characterization of the novel transcript SPD4 (shares promoter with DPPA4). The S A G E tag first identifying the gene was present at 3 tags in 2.6 million total tags in the u-hESC metalibrary. Functional analysis revealed that SPD4 may encode a miRNA based on sequence homology to known miRNAs and its ability to form a stem-loop structure, a required feature of miRNAs. In addition, quantitative PCR using R N A from undifferentiated and differentiated hESCs showed a reduction in SPD4 in response to differentiation. Our efforts have demonstrated that the class of candidate novel genes found in the hESC S A G E data are expressed at low levels but the majority are representative of real transcripts. These transcripts, such as SPD4, are associated with undifferentiated hESCs and may prove to be necessary for stem cell maintenance and pluripotency.  191  List of Appendices  (http://www.bcgsc.ca/people/angels/htdocs/Thesis_appendices/) Appendix 2a Appendix 2b Appendix 2c Appendix 2d Appendix 2e Appendix 2f Appendix 2g Appendix 2h Appendix 2i Appendix 2j Appendix 2k Appendix 21 Appendix 3a Appendix 3 b Appendix 3 c Appendix 3d Appendix 3e Appendix 4a Appendix 4b Appendix 4c Appendix 4d Appendix 4e Appendix 4f Appendix 4g  Process Discovery SPACE mappings script GO slim Molecular Function terms and scripts WA09 CMOST mappings Unmapped WA09 mouse CMOST mappings Database of species ambiguous tags WA09 species ambiguous tag mappings Database of pluripotent stem cell associated genes (PAGs) WA09 PAG mappings ' WA09 WNT mappings WA09 TGF-beta mappings WA09 Jak/Stat mappings Statistical testing of GO molecular functions between u-hESCs and n-CGAP libraries CGAP SAGE libraries Matrix.pl script Random tag generator script Correlation matrix Up-regulated in hESC versus nCGAP libraries Normal and malignant CGAP library list All novel hESC candidates Enriched hESC tag mappings Candidate novel hESC gene list Novel hESC tag matrix BLASTN results TBLASTX results  192  Bibliography Journal Articles Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. JMol Biol 215,403-10. i  Ambrosetti, D. C , Basilico, C., and Dailey, L. (1997). Synergistic activation of the fibroblast growth factor 4 enhancer by Sox2 and Oct-3 depends on protein-protein interactions facilitated by a specific spatial arrangement of factor binding sites. Mol Cell Biol 17, 6321-9.  Anisimov, S. V., Tarasov, K. V., Tweedie, D., Stern, M. D., Wobus, A. M., and Boheler, K. R. (2002). SAGE identification of gene transcripts with profiles unique to pluripotent mouse Rl embryonic stem cells. Genomics 7 9 , 169-76.  Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, PL, Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J. C , Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2 5 , 25-9.  Assady, S., Maor, G., Amit, M., Itskovitz-Eldor, J., Skorecki, K. L., and Tzukerman, M. (2001). Insulin production by human embryonic stem cells. Diabetes 5 0 , 1691-7.  Aubert, J., Dunstan, H., Chambers, I., and Smith, A. (2002). Functional gene screening in embryonic stem cells implicates Wnt antagonism in neural differentiation. Nat Biotechnol 2 0 , 1240-5.  193  Audic, S., and Claverie, J. M. (1997). The significance of digital gene expression profiles. Genome Res 7, 986-95.  Aulehla, A., Wehrle, C , Brand-Saberi, B., Kemler, R., Gossler, A., Kanzler, B., and Herrmann, B. G. (2003). Wnt3a plays a major role in the segmentation clock controlling somitogenesis. Dev Cell 4, 395-406.  Barrow, J. R., Thomas, K. R., Boussadia-Zahui, O., Moore, R., Kemler, R., Capecchi, M. R., and McMahon, A. P. (2003). Ectodermal Wnt3/beta-catenin signaling is required for the establishment and maintenance of the apical ectodermal ridge. Genes Dev 17, 394-409.  Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116,281-97.  Beaulieu, N., Morin, S., Chute, I. C , Robert, M. F., Nguyen, H., and MacLeod, A. R. (2002). An essential role for DNA methyltransferase DNMT3B in cancer cell survival. J Biol Chem 211, 28176-81.  Bhattacharya, B., Miura, T., Brandenberger, R., Mejido, J., Luo, Y., Yang, A. X., Joshi, B. H., Ginis, I., Thies, R. S., Amit, M., Lyons, I., Condie, B. G., Itskovitz-Eldor, J., Rao, M. S., and Puri, R. K. (2004). Gene expression in human embryonic stem cell lines: unique molecular signature. Blood 103, 2956-64.  Birney, E., Andrews, D., Bevan, P., Caccamo, M., Cameron, G., Chen, Y., Clarke, L., Coates, G., Cox, T., Cuff, J., Curwen, V., Cutts, T., Down, T., Durbin, R., Eyras,  194  E., Fernandez-Suarez, X. M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M , Hotz, H., Iyer, V., Kahari, A., Jekosch, K., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, C., Clamp, M , and Hubbard, T. (2004). Ensembl 2004. Nucleic Acids Res 32 Database issue, D468-70.  Boon, K., Osorio, E. C , Greenhut, S. F., Schaefer, C. F., Shoemaker, J., Polyak, K., Morin, P. J., Buetow, K. H., Strausberg, R. L., De Souza, S. J., and Riggins, G. J. (2002). An anatomy of normal and malignant gene expression. Proc Natl Acad Sci USA 99, 11287-92.  Botquin, V., Hess, H., Fuhrmann, G., Anastassiadis, C , Gross, M. K., Vriend, G., and Scholer, H. R. (1998). New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. Genes Dev 12, 2073-90.  Brandenberger, R., Khrebtukova, I., Thies, R. S., Miura, T., Jingli, C , Puri, R., Vasicek, T., Lebkowski, J., and Rao, M. (2004a). MPSS profiling of human embryonic stem cells. BMC Dev Biol 4, 10.  Brandenberger, R., Wei, H., Zhang, S., Lei, S., Murage, J., Fisk, G. J., Li, Y., Xu, C , Fang, R., Guegler, K., Rao, M. S., Mandalam, R., Lebkowski, J., and Stanton, L.  195  W. (2004b). Transcriptome characterization elucidates signaling networks that control human ES cell growth and differentiation. Nat Biotechnol 22, 707-16.  Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D. H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S. R., Moon, K., Burcham, T., Pallas, M., DuBridge, R. B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K. (2000). Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 1 8 , 630-4.  Brickman, J. M., and Burdon, T. G. (2002). Pluripotency and tumorigenicity. Nat Genet 32, 557-8.  Brockington, M., Blake, D. J., Prandini, P., Brown, S. C , Torelli, S., Benson, M. A., Ponting, C. P., Estournet, B., Romero, N. B., Mercuri, E., Voit, T., Sewry, C. A., Guicheney, P., and Muntoni, F. (2001). Mutations in the fukutin-related protein gene (FKRP) cause a form of congenital muscular dystrophy with secondary laminin alpha2 deficiency and abnormal glycosylation of alpha-dystroglycan. Am JHum Genet 69,1198-209.  Buchkovich, K., Duffy, L. A., and Harlow, E. (1989). The retinoblastoma protein is phosphorylated during specific phases of the cell cycle. Cell 5 8 , 1097-105.  Burdon, T., Chambers, I., Stracey, C , Niwa, H., and Smith, A. (1999a). Signaling mechanisms regulating self-renewal and differentiation of pluripotent embryonic stem cells. Cells Tissues Organs 1 6 5 , 131-43.  196  Burdon, T., Stracey, C , Chambers, I., Nichols, J., and Smith, A. (1999b). Suppression of SHP-2 and ERK signalling promotes self-renewal of mouse embryonic stem cells. Dev Biol 210,30-43.  Cadigan, K. M., and Nusse, R. (1997). Wnt signaling: a common theme in animal development. Genes Dev 1 1 , 3286-305.  Cai, J., Chen, J., Liu, Y., Miura, T., Luo, Y., Loring, J. F., Freed, W. J., Rao, M. S., and Zeng, X. (2005). Assessing self-renewal and differentiation in hESC lines. Stem Cells.  Caplen, N. J., Parrish, S., Imani, F., Fire, A., and Morgan, R. A. (2001). Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci USA9S, 9742-7.  Carpenter, M. K., Inokuma, M. S., Denham, J., Mujtaba, T., Chiu, C. P., and Rao, M. S. (2001). Enrichment of neurons and neural precursors from human embryonic stem cells. Exp Neurol 172, 383-97.  Carpenter, M. K., Rosier, E. S., Fisk, G. J., Brandenberger, R., Ares, X., Miura, T., Lucero, M., and Rao, M. S. (2004). Properties of four human embryonic stem cell lines maintained in a feeder-free culture system. Dev Dyn 2 2 9 , 243-58.  Cavaleri, F., and Scholer, H. R. (2003). Nanog: a new recruit to the embryonic stem cell orchestra. Cell 1 1 3 , 551-2.  197  Chambers, I., Colby, D., Robertson, M., Nichols, J., Lee, S., Tweedie, S., and Smith, A. (2003). Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells. Cell 1 1 3 , 643-55.  Chen, J., Lee, S., Zhou, G., and Wang, S. M. (2002). High-throughput GLGI procedure for converting a large number of serial analysis of gene expression tag sequences into 3' complementary DNAs. Genes Chromosomes Cancer 3 3 , 252-61.  Chen, J. J., Rowley, J. D., and Wang, S. M. (2000). Generation of longer cDNA fragments from serial analysis of gene expression tags for gene identification. Proc Natl Acad Sci USA 91, 349-53.  Cheung, P., Allis, C. D., and Sassone-Corsi, P. (2000). Signaling to chromatin through histone modifications. Cell 1 0 3 , 263-71.  Chuma, M., Saeki, N., Yamamoto, Y., Ohta, T., Asaka, M., Hirohashi, S., and Sakamoto, M. (2004). Expression profiling in hepatocellular carcinoma with intrahepatic metastasis: identification of high-mobility group I(Y) protein as a molecular marker of hepatocellular carcinoma metastasis. Keio J Med 5 3 , 90-7.  Classon, M., and Harlow, E. (2002). The retinoblastoma tumour suppressor in development and cancer. Nat Rev Cancer 2, 910-7.  Du, Z., Cong, H., and Yao, Z. (2001). Identification of putative downstream genes of Oct4 by suppression-subtractive hybridization. Biochem Biophys Res Commun 2S2, 701-6.  198  Dykxhoorn, D. M., Novina, C. D., and Sharp, P. A. (2003). Killing the messenger: short RNAs that silence gene expression. Nat Rev Mol Cell Biol 4, 457-67.  Dyson, N. (1998). The regulation of E2F by pRB-family proteins. Genes Dev 12, 224562. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 148638. Evans, M. J., and Kaufman, M. H. (1981). Establishment in culture of pluripotential cells from mouse embryos. Nature 292, 154-6. Evans, S. J., Datson, N. A., Kabbaj, M., Thompson, R. C , Vreugdenhil, E., De Kloet, E. R., Watson, S. J., and Akil, H. (2002). Evaluation of Affymetrix Gene Chip sensitivity in rat hippocampal tissue using SAGE analysis. Serial Analysis of Gene Expression. Eur JNeurosci 16,409-13.  Fattaey, A., Helin, K., and Harlow, E. (1993). Transcriptional inhibition by the retinoblastoma protein. Philos Trans R Soc Lond B Biol Sci 340, 333-6. Feldman, B., Poueymirou, W., Papaioannou, V. E., DeChiara, T. M., and Goldfarb, M. (1995). Requirement of FGF-4 for postimplantation mouse development. Science  267,246-9. Fitch, W. M., and Margoliash, E. (1967). Construction of phylogenetic trees. Science 155, 279-84.  1 9 9  Fougerousse, F., Bullen, P., Herasse, M., Lindsay, S., Richard, I., Wilson, D., Suel, L., Durand, M., Robson, S., Abitbol, M., Beckmann, J. S., and Strachan, T. (2000). Human-mouse differences in the embryonic expression patterns of developmental control genes and disease genes. Hum Mol Genet 9, 165-73.  Frohman, M. A., Dush, M. K., and Martin, G. R. (1988). Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci USAS5, 8998-9002.  Gerecht-Nir, S., Dazard, J. E., Golan-Mashiach, M., Osenberg, S., Botvinnik, A., Amariglio, N., Domany, E., Rechavi, G., Givol, D., and Itskovitz-Eldor, J. (2005). Vascular gene expression and phenotypic correlation during differentiation of human embryonic stem cells. Dev Dyn 232,487-97.  Gerhard, D. S., Wagner, L., Feingold, E. A., Shenmen, C. M., Grouse, L. H., Schuler, G., Klein, S. L., Old, S., Rasooly, R., Good, P., Guyer, M., Peck, A. M., Derge, J. G., Lipman, D., Collins, F. S., Jang, W., Sherry, S., Feolo, M., Misquitta, L., Lee, E., Rotmistrovsky, K., Greenhut, S. F., Schaefer, C. F., Buetow, K., Bonner, T. I., Haussler, D., Kent, J., Kiekhaus, M., Furey, T., Brent, M., Prange, C , Schreiber, K., Shapiro, N., Bhat, N. K., Hopkins, R. F., Hsie, F., Driscoll, T., Soares, M. B., Casavant, T. L., Scheetz, T. E., Brown-stein, M. J., Usdin, T. B., Toshiyuki, S., Carninci, P., Piao, Y., Dudekula, D. B., Ko, M. S., Kawakami, K., Suzuki, Y., Sugano, S., Gruber, C. E., Smith, M. R., Simmons, B., Moore, T., Waterman, R., Johnson, S. L., Ruan, Y., Wei, C. L., Mathavan, S., Gunaratne, P. H., Wu, J., Garcia, A. M., Hulyk, S. W., Fuh, E., Yuan, Y., Sneed, A., Kowis, C , Hodgson,  A., Muzny, D. M., McPherson, J., Gibbs, R. A., Fahey, J., Helton, E., Ketteman, M., Madan, A., Rodrigues, S., Sanchez, A., Whiting, M., Madari, A., Young, A. C , Wetherby, K. D., Granite, S. J., Kwong, P. N., Brinkley, C. P., Pearson, R. L., Bouffard, G. G., Blakesly, R. W., Green, E. D., Dickson, M. C., Rodriguez, A. C., Grimwood, J., Schmutz, J., Myers, R. M., Butterfield, Y. S., Griffith, M., Griffith, O. L., Krzywinski, M. I., Liao, N., Morrin, R., Palmquist, D., et al. (2004). The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res 14, 2121-7.  Ginis, I., Luo, Y., Miura, T., Thies, S., Brandenberger, R., Gerecht-Nir, S., Amit, M., Hoke, A., Carpenter, M. K., Itskovitz-Eldor, J., and Rao, M. S. (2004). Differences between human and mouse embryonic stem cells. Dev Biol 269, 36080.  Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). MicroRNAs regulate brain morphogenesis in zebrafish. Science 308, 833-8.  Golan-Mashiach, M., Dazard, J. E., Gerecht-Nir, S., Amariglio, N., Fisher, T., JacobHirsch, J., Bielorai, B., Osenberg, S., Barad, O., Getz, G., Toren, A., Rechavi, G., Itskovitz-Eldor, J., Domany, E., and Givol, D. (2005). Design principle of gene expression used by human stem cells: implication for pluripotency. Faseb J19, 147-9.  Goodrich, D. W., Wang, N. P., Qian, Y. W., Lee, E. Y., and Lee, W. H. (1991). The retinoblastoma gene product regulates progression through the Gl phase of the cell cycle. Cell 67, 293-302.  Hanahan, D., and Weinberg, R. A. (2000). The hallmarks of cancer. Cell 100, 57-70.  Harrer, M., Luhrs, FL, Bustin, M., Scheer, U., and Hock, R. (2004). Dynamic interaction of HMGAla proteins with chromatin. J Cell Sci 111, 3459-71.  Hart, A. H., Willson, T. A., Wong, M., Parker, K., and Robb, L. (2005). Transcriptional regulation of the homeobox gene Mixll by TGF-beta and FoxHl. Biochem Biophys Res Commun 333, 1361-9.  Hatakeyama, M., and Weinberg, R. A. (1995). The role of RB in cell cycle control. Prog Cell Cycle Res 1,9-19.  Heinrich, P. C , Behrmann, I., Muller-Newen, G., Schaper, F., and Graeve, L. (1998). Interleukin-6-type cytokine signalling through the gpl30/Jak/STAT pathway. Biochem J 334 ( Pt 2), 297-314.  Helin, K. (1998). Regulation of cell proliferation by the E2F transcription factors. Curr Opin Genet Dev 8, 28-35.  Helin, K., Harlow, E., and Fattaey, A. (1993). Inhibition of E2F-1 transactivation by direct binding of the retinoblastoma protein. Mol Cell Biol 13, 6501-8.  202  Hickman, E. S., and Helin, K. (2002). The regulation of APAF1 expression during development and tumourigenesis. Apoptosis 7,167-71.  Hickman, E. S., Moroni, M. C , and Helin, K. (2002). The role of p53 and pRB in apoptosis and cancer. Curr Opin Genet Dev 12, 60-6.  Hubbard, T. (2002). Biological information: making it accessible and integrated (and trying to make sense of it). Bioinformatics 18 Suppl 2, SI 40.  Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X. M., Gilbert, J., Hammond, M., Herrero, J., Hotz, H., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Kokocinsci, F., London, D., Longden, I., McVicker, G., Melsopp, C , Meidl, P., Potter, S., Proctor, G., Rae, M., Rios, D., Schuster, M., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C , and Birney, E. (2005). Ensembl 2005. Nucleic Acids Res 33, D447-53.  Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Huminiecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C , Mongin, E., Pettett, R., Pocock, M., Potter, S., Rust, A., Schmidt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-  Vidal, A., Vastrik, I., and Clamp, M. (2002). The Ensembl genome database project. Nucleic Acids Res 30, 38-41.  Hwang, W. S., Ryu, Y. J., Park, J. H., Park, E. S., Lee, E. G., Koo, J. M., Jeon, H. Y., Lee, B. C , Kang, S. K., Kim, S. J., Ahn, C , Hwang, J. H., Park, K. Y., Cibelli, J. B., and Moon, S. Y. (2004). Evidence of a pluripotent human embryonic stem cell line derived from a cloned blastocyst. Science 303, 1669-74.  Ikeya, M., Lee, S. M., Johnson, J. E., McMahon, A. P., and Takada, S. (1997). Wnt signalling required for expansion of neural crest and CNS progenitors. Nature 389, 966-70.  Itskovitz-Eldor, J., Schuldiner, M., Karsenti, D., Eden, A., Yanuka, O., Amit, M., Soreq, H. , and Benvenisty, N. (2000). Differentiation of human embryonic stem cells into embryoid bodies compromising the three embryonic germ layers. Mol Med 6, 8895.  Ivanova, N. B., Dimos, J. T., Schaniel, C , Hackney, J. A., Moore, K. A., and Lemischka, I. R. (2002). A stem cell molecular signature. Science 298, 601-4.  Jenuwein, T., and Allis, C. D. (2001). Translating the histone code. Science 293, 1074-80.  Johnson, D. G., Ohtani, K., and Nevins, J. R. (1994). Autoregulatory control of E2F1 expression in response to positive and negative regulators of cell cycle progression. Genes Dev 8, 1514-25.  Jongeneel, C. V., Iseli, C , Stevenson, B. J., Riggins, G. J., Lai, A., Mackay, A., Harris, R. A., O'Hare, M. J., Neville, A. M., Simpson, A. J., and Strausberg, R. L. (2003). Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc Natl Acad Sci USA 100,4702-5.  Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C , Hammond, M., Rocca-Serra, P., Cox, T., and Birney, E. (2004). EnsMart: a generic system for fast and flexible access to biological data. Genome Res 14, 160-9.  Kehat, I., Amit, M., Gepstein, A., Huber, I., Itskovitz-Eldor, J., and Gepstein, L. (2003). Development of cardiomyocytes from human ES cells. Methods Enzymol 365, 461-73.  Kellis, M., Patterson, N., Birren, B., Berger, B., and Lander, E. S. (2004). Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 11, 319-55.  Kelly, D. L., and Rizzino, A. (2000). DNA microarray analyses of genes regulated during the differentiation of embryonic stem cells. Mol Reprod Dev 56, 113-23.  Kielman, M. F., Rindapaa, M., Gaspar, C , van Poppel, N., Breukel, C , van Leeuwen, S., Taketo, M. M., Roberts, S., Smits, R., and Fodde, R. (2002). Ape modulates embryonic stem-cell differentiation by controlling the dosage of beta-catenin signaling. Nat Genet 32, 594-605.  205  Lai, A., Lash, A. E., Altschul, S. F., Velculescu, V., Zhang, L., McLendon, R. E., Marra, M. A., Prange, C , Morin, P. J., Polyak, K., Papadopoulos, N., Vogelstein, B., Kinzler, K. W., Strausberg, R. L., and Riggins, G. J. (1999). A public database for gene expression in human cancers. Cancer Res 5 9 , 5403-7.  Lavon, N., and Benvenisty, N. (2005). Study of hepatocyte differentiation using embryonic stem cells. J Cell Biochem.  Lebkowski, J. S., Gold, J., Xu, C., Funk, W., Chiu, C. P., and Carpenter, M. K. (2001). Human embryonic stem cells: culture, differentiation, and genetic modification for regenerative medicine applications. Cancer Jl Suppl 2, S83-93.  Legg, J., Jensen, U . B., Broad, S., Leigh, I., and Watt, F. M. (2003). Role of melanoma chondroitin sulphate proteoglycan in patterning stem cells in human interfollicular epidermis. Development 130, 6049-63.  Levenberg, S., Golub, J. S., Amit, M., Itskovitz-Eldor, J., and Langer, R. (2002). Endothelial cells derived from human embryonic stem cells. Proc Natl Acad Sci U SA99, 4391-6.  Ling, M. T., Wang, X., Ouyang, X. S., Xu, K., Tsao, S. W., and Wong, Y. C. (2003). Id-1 expression promotes cell survival through activation of NF-kappaB signalling pathway in prostate cancer cells. Oncogene 22, 4498-508.  Lipshutz, R. J., Fodor, S. P., Gingeras, T. R., and Lockhart, D. J. (1999). High density synthetic oligonucleotide arrays. Nat Genet 21, 20-4.  206  Liu, L., Leaman, D., Villalta, M., and Roberts, R. M. (1997). Silencing of the gene for the alpha-subunit of human chorionic gonadotropin by the embryonic transcription factor Oct-3/4. Mol Endocrinol 11, 1651-8.  Liu, L., and Roberts, R. M. (1996). Silencing of the gene for the beta subunit of human chorionic gonadotropin by the embryonic transcription factor Oct-3/4. J Biol Chem 111, 16683-9.  Lockhart, D. J., Dong, H., Byrne, M. C , Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C , Kobayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14, 1675-80.  Logan, C. Y., and Nusse, R. (2004). The Wnt signaling pathway in development and disease. Annu Rev Cell Dev Biol 20, 781-810.  Makalowski, W., Zhang, J., and Boguski, M. S. (1996). Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res 6, 846-57.  Martin, G. R. (1981). Isolation of a pluripotent cell line from early mouse embryos cultured in medium conditioned by teratocarcinoma stem cells. Proc Natl Acad Sci USA 18, 7634-8.  Miller, K. A., Sawicka, D., Barsky, D., and Albala, J. S. (2004). Domain mapping of the Rad51 paralog protein complexes. Nucleic Acids Res 32, 169-78.  Mitsui, K., Tokuzawa, Y., Itoh, H., Segawa, K., Murakami, M., Takahashi, K., Maruyama, M., Maeda, M., and Yamanaka, S. (2003). The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell 113, 631-42.  Mullor, J. L., Sanchez, P., and Altaba, A. R. (2002). Pathways and consequences: Hedgehog signaling in human disease. Trends Cell Biol 12, 562-9.  Mummery, C , Ward, D., van den Brink, C. E., Bird, S. D., Doevendans, P. A., Opthof, T., Brutel de la Riviere, A., Tertoolen, L., van der Heyden, M., and Pera, M. (2002). Cardiomyocyte differentiation of mouse and human embryonic stem cells. JAnat 200, 233-42.  Narita, M., Nunez, S., Heard, E., Lin, A. W., Hearn, S. A., Spector, D. L., Harmon, G. J., and Lowe, S. W. (2003). Rb-mediated heterochromatin formation and silencing of E2F target genes during cellular senescence. Cell 113, 703-16.  Nichols, J., Zevnik, B., Anastassiadis, K., Niwa, H., Klewe-Nebenius, D., Chambers, I., Scholer, H., and Smith, A. (1998). Formation of pluripotent stem cells in the mammalian embryo depends on the POU transcription factor Oct4. Cell 95, 37991.  Nishimoto, M., Fukushima, A., Okuda, A., and Muramatsu, M. (1999). The gene for the embryonic stem cell coactivator UTF1 carries a regulatory element which selectively interacts with a complex composed of Oct-3/4 and Sox-2. Mol Cell Biol 19, 5453-65.  208  Niswander, L., Tickle, C , Vogel, A., Booth, I., and Martin, G. R. (1993). FGF-4 replaces the apical ectodermal ridge and directs outgrowth and patterning of the limb. Cell 75, 579-87.  Pardal, R., Clarke, M. F., and Morrison, S. J. (2003). Applying the principles of stem-cell biology to cancer. Nat Rev Cancer 3, 895-902.  Parr, B. A., Cornish, V. A., Cybulsky, M. I., and McMahon, A. P. (2001). WntVb regulates placental development in mice. Dev Biol 237, 324-32.  Pesce, M., Anastassiadis, K., and Scholer, H. R. (1999). Oct-4: lessons of totipotency from embryonic stem cells. Cells Tissues Organs 165, 144-52.  Pesce, M., Gross, M. K., and Scholer, H. R. (1998a). In line with our ancestors: Oct-4 and the mammalian germ. Bioessays 20, 722-32.  Pesce, M., and Scholer, H. R. (2001). Oct-4: gatekeeper in the beginnings of mammalian development. Stem Cells 19, 271-8.  Pesce, M., Wang, X., Wolgemuth, D. J., and Scholer, H. (1998b). Differential expression of the Oct-4 transcription factor during mouse germ cell differentiation. Mech Dev 71, 89-98.  Pleasance, E. D., Marra, M. A., and Jones, S. J. (2003). Assessment of SAGE in transcript identification. Genome Res 13, 1203-15.  209  Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2003). NCBI Reference Sequence project: update and current status. Nucleic Acids Res 31, 34-7.  Ramalho-Santos, M., Yoon, S., Matsuzaki, Y., Mulligan, R. C , and Melton, D. A. (2002). "Sternness": transcriptional profiling of embryonic and adult stem cells. Science 298, 597-600.  Rambhatla, L., Chiu, C. P., Kundu, P., Peng, Y., and Carpenter, M. K. (2003). Generation of hepatocyte-like cells from human embryonic stem cells. Cell Transplant 12,111.  Reubinoff, B. E., Pera, M. F., Fong, C. Y., Trounson, A., and Bongso, A. (2000). Embryonic stem cell lines from human blastocysts: somatic differentiation in vitro. Nat Biotechnol 18, 399-404.  Reya, T., Duncan, A. W., Ailles, L., Domen, J., Scherer, D. C , Willert, K., Hintz, L., Nusse, R., and Weissman, I. L. (2003). A role for Writ signalling in self-renewal of haematopoietic stem cells. Nature 423,409-14.  Reyes, M., Lund, T., Lenvik, T., Aguiar, D., Koodie, L., and Verfaillie, C. M. (2001). Purification and ex vivo expansion of postnatal human marrow mesodermal progenitor cells. Blood 98, 2615-25.  Rhee, I., Bachman, K. E., Park, B. H., Jair, K. W., Yen, R. W., Schuebel, K. E., Cui, H., Feinberg, A. P., Lengauer, C , Kinzler, K. W., Baylin, S. B., and Vogelstein, B.  210  (2002). DNMT1 and DNMT3b cooperate to silence genes in human cancer cells. Nature 416, 552-6.  Rho, J. Y., Yu, K., Han, J. S., Chae, J. I., Koo, D. B., Yoon, H. S., Moon, S. Y., Lee, K. K., and Han, Y. M. (2005). Transcriptional profiling of the developmentally important signalling pathways in human embryonic stem cells. Hum Reprod.  Richards, M., Tan, S. P., Tan, J. H., Chan, W. K., and Bongso, A. (2004). The transcriptome profile of human embryonic stem cells as defined by SAGE. Stem Cells 22, 51-64.  Rogers, M. B., Hosier, B. A., and Gudas, L. J. (1991). Specific expression of a retinoic acid-regulated, zinc-finger gene, Rex-1, in preimplantation embryos, trophoblast and spermatocytes. Development 113, 815-24.  Saha, S., Sparks, A. B., Rago, C , Akmaev, V., Wang, C. J., Vogelstein, B., Kinzler, K. W., and Velculescu, V. E. (2002). Using the transcriptome to annotate the genome. Nat Biotechnol 20, 508-12.  Sato, N., Sanjuan, I. M., Heke, M., Uchida, M., Naef, F., and Brivanlou, A. H. (2003). Molecular signature of human embryonic stem cells and its comparison with the mouse. Dev Biol 260, 404-13.  Schena, M., Heller, R. A., Theriault, T. P., Konrad, K., Lachenmeier, E., and Davis, R. W. (1998). Microarrays: biotechnology's discovery platform for functional genomics. Trends Biotechnol 16, 301-6.  Scholer, H. R., Ciesiolka, T., and Gruss, P. (1991). A nexus between Oct-4 and E l A: implications for gene regulation in embryonic stem cells. Cell 66,291-304.  Schuldiner, M., Yanuka, O., Itskovitz-Eldor, J., Melton, D. A., and Benvenisty, N. (2000). Effects of eight growth factors on the differentiation of cells derived from human embryonic stem cells. Proc Natl Acad Sci USA 97, 11307-12.  Schwartz, R. E., Reyes, M., Koodie, L., Jiang, Y., Blackstad, M., Lund, T., Lenvik, T., Johnson, S., Hu, W. S., and Verfaillie, C. M. (2002). Multipotent adult progenitor cells from bone marrow differentiate into functional hepatocyte-like cells. J Clin Invest 109, 1291-302.  Smiraldo, P. G., Gruver, A. M., Osborn, J. C., and Pittman, D. L. (2005). Extensive chromosomal instability in Rad51d-deficient mouse cells. Cancer Res 65, 208996.  Smith, A. G., Heath, J. K., Donaldson, D. D., Wong, G. G., Moreau, J., Stahl, M., and Rogers, D. (1988). Inhibition of pluripotential embryonic stem cell differentiation by purified polypeptides. Nature 336, 688-90.  Sottile, V., Thomson, A., and McWhir, J. (2003). In vitro osteogenic differentiation of human ES cells. Cloning Stem Cells 5, 149-55.  Sperger, J. M., Chen, X., Draper, J. S., Antosiewicz, J. E., Chon, C. H., Jones, S. B., Brooks, J. D., Andrews, P. W., Brown, P. O., and Thomson, J. A. (2003). Gene  212  expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Proc Natl Acad Sci USA 100, 13350-5.  Suarez-Farinas, M., Noggle, S., Heke, M., Hemmati-Brivanlou, A., and Magnasco, M. O. (2005). Comparing independent microarray studies: the case of human embryonic stem cells. BMC Genomics 6, 99.  Taipale, J., and Beachy, P. A. (2001). The Hedgehog and Wnt signalling pathways in cancer. Nature 411, 349-54.  Tanaka, T. S., Kunath, T., Kimber, W. L., Jaradat, S. A., Stagg, C. A., Usuda, M., Yokota, T., Niwa, H., Rossant, J., and Ko, M. S. (2002). Gene expression profiling of embryo-derived stem cells reveals candidate genes associated with pluripotency and lineage specificity. Genome Res 12,1921-8.  Tang, K., Yang, J., Gao, X., Wang, C , Liu, L., Kitani, H., Atsumi, T., and Jing, N. (2002). Wnt-1 promotes neuronal differentiation and inhibits gliogenesis in PI9 cells. Biochem Biophys Res Commun 293, 167-73.  Thomson, J. A., Itskovitz-Eldor, J., Shapiro, S. S., Waknitz, M. A., Swiergiel, J. J., Marshall, V. S., and Jones, J. M. (1998). Embryonic stem cell lines derived from human blastocysts. Science 282, 1145-7.  Tsai, R. Y., and McKay, R. D. (2002). A nucleolar mechanism controlling cell proliferation in stem cells and cancer cells. Genes Dev 16,2991-3003.  213  Vallier, L., Reynolds, D., and Pedersen, R. A. (2004). Nodal inhibits differentiation of human embryonic stem cells along the neuroectodermal default pathway. Dev Biol 275,403-21.  Velculescu, V. E., Madden, S. L., Zhang, L., Lash, A. E., Yu, J., Rago, C , Lai, A., Wang, C. J., Beaudry, G. A., Ciriello, K. M., Cook, B. P., Dufault, M. R., Ferguson, A. T., Gao, Y., He, T. C , Hermeking, H., Hiraldo, S. K., Hwang, P. M., Lopez, M. A., Luderer, H. F., Mathews, B., Petroziello, J. M., Polyak, K., Zawel, L., Kinzler, K. W., and et al. (1999). Analysis of human transcriptomes. Nat Genet 23, 387-8.  Velculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W. (1995). Serial analysis of gene expression. Science 270, 484-7.  Walsh, J., and Andrews, P. W. (2003). Expression of Wnt and Notch pathway genes in a pluripotent human embryonal carcinoma cell line and embryonic stem cell. Apmis 111, 197-210; discussion 210-1.  Wang, S. H., Tsai, M. S., Chiang, M. F., and Li, H. (2003). A novel NK-type homeobox gene, ENK (early embryo specific NK), preferentially expressed in embryonic stem cells. Gene Expr Patterns 3, 99-103.  Wang, X., Guan, J., Hu, B., Weiss, R. S., Iliakis, G., and Wang, Y. (2004). Involvement of Husl in the chain elongation step of DNA replication after exposure to camptothecin or ionizing radiation. Nucleic Acids Res 32, 767-75.  214  Weinberg, R. A. (1995). The retinoblastoma protein and cell cycle control. Cell 81, 32330.  Weiss, R. S., Leder, P., and Vaziri, C. (2003). Critical role for mouse Husl in an S-phase DNA damage cell cycle checkpoint. Mol Cell Biol 23, 791-803.  Wienholds, E., Kloosterman, W. P., Miska, E., Alvarez-Saavedra, E., Berezikov, E., de Bruijn, E., Horvitz, H. R., Kauppinen, S., and Plasterk, R. H. (2005). MicroRNA expression in zebrafish embryonic development. Science 309, 310-1.  Willert, K., Brown, J. D., Danenberg, E., Duncan, A. W., Weissman, I. L., Reya, T., Yates, J. R., 3rd, and Nusse, R. (2003). Wnt proteins are lipid-modified and can act as stem cell growth factors. Nature 423, 448-52.  Williams, R. L., Hilton, D. J., Pease, S., Willson, T. A., Stewart, C. L., Gearing, D. P., Wagner, E. F., Metcalf, D., Nicola, N. A., and Gough, N. M. (1988). Myeloid leukaemia inhibitory factor maintains the developmental potential of embryonic stem cells. Nature 336, 684-7.  Xu, C , Police, S., Rao, N., and Carpenter, M. K. (2002a). Characterization and enrichment of cardiomyocytes derived from human embryonic stem cells. Circ Res 91, 501-8.  Xu, R. H., Chen, X., Li, D. S., Li, R., Addicks, G. C , Glennon, C , Zwaka, T. P., and Thomson, J. A. (2002b). BMP4 initiates human embryonic stem cell differentiation to trophoblast. Nat Biotechnol 20, 1261-4.  215  Ye, S. Q., Zhang, L. Q., Zheng, F., Virgil, D., and Kwiterovich, P. O. (2000). miniSAGE: gene expression profiling using serial analysis of gene expression from 1 microg total RNA. Anal Biochem 287, 144-52.  Yoshida, K., Chambers, I., Nichols, J., Smith, A., Saito, M., Yasukawa, K., Shoyab, M., Taga, T., and Kishimoto, T. (1994). Maintenance of the pluripotential phenotype of embryonic stem cells through direct activation of gpl30 signalling pathways. Mech Dev 45, 163-71.  Zhang, M. Q. (1998). Statistical features of human exons and their flanking regions. Hum Mol Genet 7,919-32.  Zhang, S. C , Wernig, M., Duncan, I. D., Brustle, O., and Thomson, J. A. (2001). In vitro differentiation of transplantable neural precursors from human embryonic stem cells. Nat Biotechnol 19, 1129-33.  Zimmerman, E. S., Chen, J., Andersen, J. L., Ardon, O., Dehart, J. L., Blackett, J., Choudhary, S. K., Camerini, D., Nghiem, P., and Planelles, V. (2004). Human immunodeficiency virus type 1 Vpr-mediated G2 arrest requires Radl7 and Husl and induces nuclear BRCA1 and gamma-H2AX focus formation. Mol Cell Biol 24, 9286-94.  Zwaka, T. P., and Thomson, J. A. (2003). Homologous recombination in human embryonic stem cells. Nat Biotechnol 21, 319-21.  216  Submitted publications Hirst, M. et al. LongSAGE Transcriptome analysis of Nine Human Embryonic Stem Cell lines reveals novel transcripts and an over representation of RNA binding proteins. Nature Biotechnology (Submitted). Siddiqui, A. et al. Mouse Atlas of Gene Expression: Large-Scale Digital Gene Expression Profiles from Precisely Defined Developing C57BL/6J mouse tissues and cells. Proceedings of the National Academy of Sciences of the United States ofAmerica (Submitted). Textbooks Alberts, B., D. Bray, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. 1998. Essential Cell Biology: An Introduction to the Molecular Biology of the Cell. Union Square West, NY: Garland Publishing Inc.  Moore, K.L., and Persaud, T.V.N. 2003. The Developing Human: Clinically Oriented Embryology. 7 Edition. Saunders.. th  Zar, J.H. 1996. Biostatistical Analysis. 3 Edition. Upper Saddle River, NJ: Apprentice rd  Hall Inc.  217  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0092489/manifest

Comment

Related Items