Computational Prediction of Regulatory Element Combinations and Transcription Factor Cooperativity by DEBRA LOUISE FULTON B.Sc., Simon Fraser University, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Genetics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) December 2009 © Debra Louise Fulton, 2009 ii Abstract Cellular identity and function is determined, in part, by the subset of genes transcribed. Gene transcription regulation is directed by a subgroup of proteins called transcription factors (TF), which can interact directly or indirectly with DNA to promote transcription initiation. In multi-cellular eukaryotes, gene expression often derives from synergistic and/or antagonistic interplay of multiple TFs with coordinate activity in response to physiological, developmental, and environmental stimuli. Sequence-specific interactions of TFs with DNA occur at TF binding sites (TFBS). Such TFBS can be predicted based on previously observed target DNA sequence specificity of a TF. Experimental studies have confirmed that proximally situated TFBS are often associated with synergistic interactions of multiple proteins that lead to cooperative regulation. The identification of clustered TFBS combinations (often called cis-regulatory modules) in a set of co- expressed genes can implicate regulatory roles for homologous groups of TFs that may contribute to co-regulation of a gene cohort. The identification of the specific TFs that interact with TFBS motifs is an important step in deciphering mechanisms of co- regulation. My thesis research addressed these challenges, firstly, through the design and development of a Combination Site Analysis (CSA) algorithm to identify over- representation of combinations of TFBS in co-expressed genes and, secondly, the assembly of a comprehensive wiki-based catalog of human-mouse TFs (TFCat) using literature curation and homolog prediction approaches. These applications were incorporated within a new promoter sequence analyses procedure for the identification of TFs that may be acting cooperatively to co-regulate expression of myelin-associated genes during myelin production in the CNS. Dysregulation of gene expression is iii frequently implicated in human pathologies and development of approaches that identify the molecular components of transcriptional regulatory systems is an important step towards the elucidation of molecular mechanisms for the design of therapeutic interventions. iv Table of Contents Abstract...........................................................................................................................ii Table of Contents ........................................................................................................... iv List of Tables ...............................................................................................................viii List of Figures................................................................................................................ ix List of Abbreviations and Acronyms ...............................................................................x Acknowledgements ......................................................................................................xiii Co-authorship Statement ............................................................................................... xv 1. Introduction..............................................................................................................1 1.1. Gene transcriptional regulatory mechanisms ......................................................3 1.1.1. Chromatin remodeling in transcription regulation .......................................3 1.1.1.1. DNA methylation.................................................................................4 1.1.1.2. Histone modification ............................................................................5 1.1.1.3. X-inactivation ......................................................................................6 1.1.2. DNA regulatory elements............................................................................7 1.1.3. Transcription factor proteins .......................................................................8 1.1.4. Transcription factor mechanisms and cooperativity control .........................9 1.2. Analysis of gene expression............................................................................. 10 1.2.1. High-throughput gene expression technologies ......................................... 11 1.2.1.1. Normalization and quantification of microarray expression data......... 13 1.2.1.2. Analysis of gene expression differences ............................................. 14 1.3. Experimental techniques for regulatory sequence detection.............................. 15 1.3.1. Identification of TF-DNA binding............................................................. 16 1.3.1.1. Identification of TF binding site motifs .............................................. 17 1.3.2. Regulatory sequence validation................................................................. 18 1.3.2.1. Mouse transgenesis ............................................................................ 19 1.3.2.1.1. Hprt1 mouse transgenesis system ................................................ 19 1.4. Transcriptional regulatory region detection through comparative genomics ..... 20 1.4.1. Multiple sequence alignments ................................................................... 21 1.4.2. Phylogenetic footprinting.......................................................................... 21 1.5. Computational identification of regulatory mechanisms................................... 22 1.5.1. Predicting TF binding sites........................................................................ 22 1.5.2. Computational detection of cis-regulatory modules................................... 24 1.6. Myelinogenesis in the nervous system ............................................................. 25 1.6.1. Schwann cells ........................................................................................... 27 1.6.2. Oligodendrocytes ...................................................................................... 27 1.6.2.1. Olig transcription factors.................................................................... 28 1.6.3. Myelin Basic Protein................................................................................. 29 1.7. Thesis overview and chapter summaries .......................................................... 31 1.8. References ....................................................................................................... 48 v 2. Identification of Over-represented Combinations of Transcription Factor Binding Sites in Sets of Co-expressed Genes .............................................................................. 59 2.1. Chapter preamble............................................................................................. 59 2.2. Introduction ..................................................................................................... 59 2.3. Results............................................................................................................. 61 2.3.1. Overview and rationale of oPOSSUM II algorithm ................................... 61 2.3.2. TFBS classification................................................................................... 62 2.3.3. Validation with reference data sets............................................................ 63 2.3.3.1. Yeast CLB2 cluster ............................................................................ 63 2.3.3.2. Three human reference gene sets ........................................................ 63 2.3.4. Effect of set size on false positive rate....................................................... 64 2.3.5. Web interface............................................................................................ 65 2.4. Discussion ....................................................................................................... 65 2.5. Methods........................................................................................................... 69 2.5.1. Background: the oPOSSUM database ....................................................... 69 2.5.2. TFBS in foreground gene set..................................................................... 69 2.5.3. Classification of TFBS profiles ................................................................. 70 2.5.4. Selection of TFBS and enumeration of combinations ................................ 70 2.5.5. Scoring of combinations ........................................................................... 71 2.5.6. Finding significant TFs from over-represented class combinations............ 72 2.5.7. Random sampling simulations of foreground genes................................... 73 2.5.8. Validation ................................................................................................. 73 2.6. References ....................................................................................................... 78 3. oPOSSUM: Integrated Tools for Analysis of Regulatory Motif Over-Representation...................................................................................................... 80 3.1. Chapter preamble............................................................................................. 80 3.2. Introduction ..................................................................................................... 80 3.3. Results............................................................................................................. 81 3.3.1. Human single site analysis ........................................................................ 82 3.3.2. Human combination site analysis .............................................................. 83 3.3.3. Worm single site analysis.......................................................................... 84 3.3.4. Yeast single site analysis........................................................................... 84 3.4. Discussion ....................................................................................................... 85 3.5. Methods........................................................................................................... 87 3.5.1. Over-representation analysis ..................................................................... 87 3.5.1.1. oPOSSUM single site analysis............................................................ 87 3.5.1.2. oPOSSUM combination site analysis................................................. 88 3.5.2. Species-specific databases......................................................................... 89 3.5.2.1. Human/mouse .................................................................................... 89 3.5.2.2. C. elegans/C. briggsae ........................................................................ 90 3.5.2.3. Yeast .................................................................................................. 91 3.5.3. TFBS prediction........................................................................................ 92 3.6. References ....................................................................................................... 99 4. TFCat: The Curated Catalog of Mouse and Human Transcription Factors............. 101 4.1. Chapter preamble........................................................................................... 101 4.2. Introduction ................................................................................................... 101 vi 4.3. Results........................................................................................................... 103 4.3.1. TF gene candidate selection, the annotation process, and quality assurance 103 4.3.2. Identification and classification of DNA binding proteins ....................... 107 4.3.3. Generation and assessment of mouse-human TF homology clusters to predict additional putative TFs............................................................................. 110 4.3.4. Maintenance and access of TFCat annotation data................................... 113 4.4. Discussion ..................................................................................................... 114 4.4.1. Catalog characteristics, comparisons, and utility ..................................... 114 4.5. Materials and methods ................................................................................... 118 4.5.1. Creation of four independent murine and human TF preliminary candidate data sets............................................................................................................... 118 4.5.1.1. Dataset I ........................................................................................... 118 4.5.1.2. Dataset II.......................................................................................... 119 4.5.1.3. Dataset III ........................................................................................ 119 4.5.1.4. Dataset IV ........................................................................................ 120 4.5.2. Standardizing TF gene candidate annotation ........................................... 121 4.5.3. Selection and annotation of a subset of TF candidates ............................. 121 4.5.4. Randomly sampled quality assessment and auditing of TF annotations ... 122 4.5.5. TFC quality assurance comparisons ........................................................ 123 4.5.6. Human-mouse ortholog assignment ........................................................ 124 4.5.7. TF DNA-binding structure analysis and classification ............................. 124 4.5.8. Identification of homolog sets for mouse TF genes ................................. 125 4.5.9. Website download access, wiki publication and annotation feedback ...... 130 4.6. References ..................................................................................................... 136 5. Brain MiniPromoters by Design: Pleiades Promoter Project ................................. 141 5.1. Chapter preamble........................................................................................... 141 5.2. Introduction ................................................................................................... 142 5.3. Results........................................................................................................... 144 5.3.1. Novel tools to study and treat the brain ................................................... 144 5.3.2. A new score to prioritize suitable genes .................................................. 145 5.3.3. MiniPromoter designs incorporate available information......................... 146 5.3.4. ESC neural differentiation for pre-screening ........................................... 147 5.3.5. Novel MiniPromoter expression patterns in the brain .............................. 148 5.3.6. A unique dataset for in silico studies ....................................................... 150 5.4. Discussion ..................................................................................................... 151 5.5. Methods......................................................................................................... 154 5.5.1. Pleiades Promoter Project pipeline.......................................................... 154 5.5.2. Pleiades Promoter Project protocols ........................................................ 155 5.5.2.1.1. Hprt1 targeting vectors and MiniPromoters ............................... 155 5.5.2.2. Knock-in immediately 5′ of the Hprt1 locus ..................................... 156 5.5.2.3. PCR analysis of genomic DNA ........................................................ 157 5.5.2.4. In vitro neural differentiation............................................................ 157 5.5.3. Immunohistochemistry and histochemistry.............................................. 158 5.6. References ..................................................................................................... 171 6. Identification and analysis of transcriptional cis-regulatory modules directing oligodendrocytic expression of myelin-linked genes.................................................... 175 vii 6.1. Chapter preamble........................................................................................... 175 6.2. Introduction ................................................................................................... 175 6.3. Results........................................................................................................... 178 6.3.1. Myelin gene associated conserved regions confer reporter activity .......... 178 6.3.2. Myelin gene co-expression is detected in mouse forebrain and optic nerve expression profiles............................................................................................... 183 6.3.3. Validation of a promoter analysis approach............................................. 185 6.3.4. Promoter analyses of a co-expressed oligodendrocyte gene set highlights potential TF cooperativity.................................................................................... 188 6.3.5. An oligodendrocyte TF network supported by enhancer predicted regulatory elements .............................................................................................................. 189 6.3.6. Prioritization of TFBS cooperativity predictions via enhancer feature weighting ............................................................................................................ 191 6.4. Discussion ..................................................................................................... 193 6.5. Methods......................................................................................................... 198 6.5.1. Selection of conserved regions and validation in mice............................. 199 6.5.2. Isolation of genomic DNA sequences...................................................... 199 6.5.3. Generation of reporter constructs ............................................................ 200 6.5.4. Histochemistry, fluorescence microscopy, and immunocytochemistry..... 200 6.5.5. Gene expression profiling analyses ......................................................... 201 6.5.6. Evaluation of local conservation of validated TFBS ................................ 203 6.5.7. Development of the promoter database and CSA algorithm adaptation.... 203 6.5.8. Promoter analyses method validation ...................................................... 204 6.5.9. Jaspar profile clustering and cluster labeling ........................................... 205 6.5.10. Promoter analysis of oligodendrocyte co-expression data...................... 206 6.5.11. Enhancer feature weighting of CRM predictions ................................... 206 6.5.12. Oligodendrocyte TF network construction and analysis......................... 207 6.5.13. Evaluation of predicted CRM sequence characteristics.......................... 208 6.5.14. Analysis of overlapping oligodendrocyte CRM predictions................... 209 6.6. References ..................................................................................................... 219 7. Discussion and Conclusions ................................................................................. 228 7.1. Summary ....................................................................................................... 228 7.2. Gene regulation analyses in humans and mice................................................ 228 7.2.1. Computational TF binding site detection................................................. 229 7.3. TF inventories for gene regulatory analyses ................................................... 231 7.4. Detecting differential gene expression for prediction of TF gene co-regulation 233 7.5. Functional validation of regulatory sequences................................................ 235 7.6. Incorporating high throughput epigenomic data and detailed experimental analyses in models of gene regulation...................................................................... 236 7.7. References ..................................................................................................... 241 8. Appendices........................................................................................................... 245 8.1. Appendix 1: supplementary for chapter 4....................................................... 245 8.2. Appendix 2: supplementary for chapter 5....................................................... 304 8.3. Appendix 3: supplementary for chapter 6....................................................... 328 viii List of Tables Table 1.1. Selected list of databases providing eukaryote transcription factor binding site data ....................................................................................................................... 37 Table 1.2. CRM detection techniques ........................................................................... 37 Table 1.3. Selected computational human/mouse CRM detection tools/methods available before 2005 ........................................................................................................... 38 Table 3.1. oPOSSUM results for human FoxM1-regulated gene cluster......................... 94 Table 3.2. oPOSSUM results for c-Fos-regulated gene cluster....................................... 94 Table 3.3. oPOSSUM results for skeletal muscle genes identified by Moran et al. and Tomczak et al. ....................................................................................................... 94 Table 3.4. oPOSSUM results for worm skeletal muscle genes using worm profiles ....... 95 Table 3.5. oPOSSUM results for the yeast CLB2 gene cluster ....................................... 95 Table 4.1. Transcription factor data resources.............................................................. 132 Table 4.2. TFCat catalog statistics ............................................................................... 132 Table 4.3. TFCat judgment classifications ................................................................... 132 Table 4.4. TFCat taxonomy classifications .................................................................. 133 Table 4.5. DNA-binding TF gene classification counts................................................ 134 Table 4.6. Large cluster ranking criteria ...................................................................... 135 Table 6.1. Predicted TF regulatory network for myelin genes ...................................... 210 ix List of Figures Figure 1.1. Nucleosome core octamer particle ............................................................... 39 Figure 1.2. Model for histone acetylation/deacetylation................................................. 40 Figure 1.3. Eukaryotic gene regulatory architecture....................................................... 41 Figure 1.4. Hprt1 mouse transgenesis system ................................................................ 42 Figure 1.5. Modeling TF binding sites using position weight matrices........................... 43 Figure 1.6. Myelinating glial cells: oligodendrocytes and Schwann cells ....................... 44 Figure 1.7. Schwann cell lineage ................................................................................... 45 Figure 1.8. Oligodendrocytes can ensheath multiple axons ............................................ 46 Figure 1.9. Oligodendrocyte lineage.............................................................................. 47 Figure 2.1. Overview of the oPOSSUM II analysis algorithm........................................ 75 Figure 2.2. The top five over-represented pair combinations of TFBS classes for muscle reference sets......................................................................................................... 76 Figure 2.3. Gene set size and false positive rate............................................................. 77 Figure 3.1. Determination of one-to-one orthologs for human and mouse genes. ........... 96 Figure 3.2. Identification of transcription start regions (TSRs) using a combination of EnsEMBL annotations and CAGE data ................................................................. 97 Figure 3.3. oPOSSUM Human SSA website screenshots............................................... 98 Figure 5.1. A resource of 240 MiniPromoters for predictable reproducible expression. 160 Figure 5.2. Resolution score prioritizes genes for MiniPromoter design....................... 162 Figure 5.3. In vitro neural differentiation for pre-screening MiniPromoter designs ...... 164 Figure 5.4. Montage of MiniPromoter expression patterns in the adult brain and retina165 Figure 5.5. Specific neuronal and glial expression patterns.......................................... 167 Figure 5.6. MiniPromoters as tools to study developmental expression patterns........... 168 Figure 5.7. A unique dataset for bioinformatics analysis.............................................. 169 Figure 6.1. Enhancer selection and validation.............................................................. 211 Figure 6.2. Histochemical detection of β-galactosidase activity in early postnatal development ........................................................................................................ 212 Figure 6.3. Histochemical detection of β-galactosidase activity in whole mounts at adult developmental stage ............................................................................................ 213 Figure 6.4. Characterization of cell populations expressing the Gjb1, Cldn11, Pou2f1 and Mal constructs ..................................................................................................... 214 Figure 6.5. Overlap of differentially expressed genes in the two expression profile datasets................................................................................................................ 216 Figure 6.6. Myelin gene TFBS regulatory sub-network ............................................... 218 Figure 7.1. Systematic integration of biological data and computational analyses to decipher gene regulatory mechanisms.................................................................. 240 x List of Abbreviations and Acronyms ABS Annotation Regulatory Binding Site Database ACC UBC Animal Care Committee ARIDs AT-rich interaction domains BAN2 N2 backcross of ICR into B6-Alb bHLH basic Helix-Loop-Helix BTD beta-trefoil domain CAGE cap analysis of gene expression CCAC Canadian Council on Animal Care ChIP chromatin immunoprecipitation CNS central nervous system CRM cis-regulatory modules CSA Combination Site Analysis DBD DNA-binding domain DBDdb DBD database resource DDT DNA binding homeobox and different transcription factors DHTF DNA Helix-Turn-factor DHTM DNA Helix Turn Modulus DPE Downstream Promoter Element dsDNA double-stranded DNA ECB early cell cycle box EM expectation maximization EMSA electrophoretic mobility shift assays ENCODE Encyclopedia of DNA Elements EOL-MOLd early oligodendrocytes-myelinating oligodendrocytes dataset EOLs early oligodendrocytes ES embryonic stem ESCs embryonic stem cell lines Fox Forkhead transcription factor FWER family wise error rate GCM Glial cells missing domain GFP Green Fluorescence Protein GO Gene Ontology GOA Gene Ontology Annotations HAT Histone acetyl transferase HAT hypoxanthine, aminopterin, thymidine HDACs histone deacetylases HLH, helix-loop-helix HMG High Mobility Group HMM Hidden Markov Model HOX, Homeodomain xi List of Abbreviations and Acronyms (continued) Hprt1 hypoxanthine phosphoribosyltransferase hsp heat shock protein IBSD inter-binding site distance IC information content ICM inner cell mass ID Identifier IEA Inferred Electronic Annotations INR Initiator Recognition IOLEDd intersection of oligodendrocyte early development dataset IRC International Regulome Consortium IUPAC International Union of Pure and Applied Chemistry KS-test Kolomogorov-Smirnov test lacZ beta-galatosidase MBP myelin basic protein MCA most conserved√Æ alignments ME measurement of expression MGD Mouse Genome Database MGI Mouse Genome Informatics MiniPs MiniPromoters MOLs mature oligodendrocytes MPSS massively parallel signature sequencing MSA multiple sequence alignment NCBI National Center for Biotechnology Information NFI-CTF family Nuclear factor I - CCAAT-binding transcription factor nGFP native EGFP fluorescence NR nuclear receptor O-2A oligodendrocyte-type-2-astrocyte cells OAMTF observed approximate mean TFs OL oligodendrocytes OMIM database Online Mendelian Inheritance in Man database OPC olidodendrocyte precursor cells OPC-EOLd oliodendrocyte progenitor cells - early oligodendrocytes dataset OPCs oligodendrocyte progenitor cells ORF open reading frame P4-P10d Postnatal 4 vs Postnatal 10 dataset PBM protein binding microarray PCR Polymerase Chain Reaction PDB Protein Data Bank PFM position frequency matrix pMN motor neuron progenitor domain xii List of Abbreviations and Acronyms (continued) Pn Postnatal (where n is a number) PPP Pleiades Promoter Project Prom endogenous gene promoter PWM position weight matrix QA quality assurance RCA Inferred from reviewed computational analysis RefSeq NCBI reference sequences Rel Rel homology domain RMA Robust Multi-chip Analysis RMS rostral migratory stream RR candidate regulatory regions SAGE Serial Analysis of Gene Expression SC Schwann cells Shh Sonic Hedge Hog shi the mouse model shiverer shRNA small hairpin RNA Sox SRY-related HMG-box transcription factor SP SwissProt database SSA single site analysis ssDNA single-stranded DNA SSM Secondary-Structure Matching tool SUMO small ubiquitin related histone modifiers TAF TATA-binding Associated Factor TBP TATA Binding Protein TFBS transcription factor binding sites TFC Transcription Factor Candidate TFCat transcription factor catalog TFe TFencyclopedia TH tyrosine hydroxylase TSR transcription start regions TSS transcription start site UCSC University of California, Santa Cruz UPTF union of putative transacription factors VEB Vista Enhancer Browser VTA ventral tegmental area xGal beta galactosidase YRSA Yeast Regulatory Sequence Analysis system ZF zinc-finger xiii Acknowledgements I would like to convey my appreciation and thanks to the many people who have helped me arrive at this milestone. First and foremost, I am especially grateful for the guidance and support provided by my PhD supervisor, Dr. Wyeth Wasserman. I also want to thank my PhD committee members for their expert advice, valuable feedback, and continued support: Drs. Eldon Emberly, Leah Keshet, Alan Peterson, and Elizabeth M. Simpson. Additionally, I am grateful for the supervision and guidance that I received during my initial graduate research rotations training under: Dr. Fiona Brinkman, Dr. Steven Jones, and Dr. Wyeth Wasserman at Merck Frost. I would also like to acknowledge the valuable input that I received early on from my masters phase committee members: Drs. Marco Marra, Fiona Brinkman, and Frederick Pio and the graduate course research guidance provided by Dr. David Baillie. I’ve had the privilege of working on collaborative research with an inspiring set of researchers, for which I am grateful, which include: Drs. Fiona Brinkman, Martin Ester, Tim Hughes, Steven Jones, Alan Peterson, Assim Siddiqui, Rob Sladek, Jared Roach, and Wyeth Wasserman. I have appreciated the opportunity to work, learn, interact, and socialize with a number of student trainees, post- doc fellows, and research scientists during my training, including: Jochen Brumm, Stefanie Butland, Elodie Portales-Casamar, Warren Cheung, Eric Denarier, Samar Dib, Nancy Dionne, Joanne Fox, Hana Friedman, Ben Good, Obi Griffith, Karsten Hokamp, Shannan Ho Sui, Shao-Shan (Carol) Huang, Andrew Kwon, Shang-Jung (Jessica) Lee, Yvonne Li, Alison Meynert, Carrie-Lyn Mead, Mehrdad Oveisi, Erin Pleasance, Fiona Roche, Monica Sleumer, Sarav Sundararajan, Dimas Yusuf, and the remarkable group of researchers at Merck Frost in Montreal. It’s been a great pleasure to work with and around the Wasserman lab research group members. I am particularly grateful to Dora xiv Pak for her top-notch organizational abilities and on-going support and assistance during my time in the Wasserman lab. I also want to acknowledge the valuable systems support provided by Dave Arenillas, Miroslav Hatas, and Jonathan Lim. I would like to acknowledge and thank my salary, training, research, and travel funding sources for their financial support during my PhD training, namely: the Wasserman Laboratory, CIHR/MSFHR Strategic Training Program in Bioinformatics, Michael Smith for Health Research Senior Graduate Award, Canadian Institute of Health Research Doctoral Scholarship, the UBC Faculty of Graduate Studies PhD Tuition Fee Awards, and the Multiple Sclerosis Society of Canada: endMS Network Travel Award. I greatly appreciate the UBC Genetics Graduate Program administrative support provided by Dr. Hugh Brock and Monica Deutsch and the CIHR/MSFHR Strategic Training Program in Bioinformatics Program administrative assistance provided by Dr. Steven Jones and Sharon Ruschkowski. I am deeply grateful to all my friends and family for their unwavering support and love during my graduate studies. I would especially like to thank my parents: Michael and Joy, and sisters: Julia and Stephanie for their love and encouragement. xv Co-authorship Statement The work described in this thesis was achieved, in part, through collaborative research. A summary of chapter research contributions is provided below. Chapter 2: I am responsible for the initial design, development, and validation of the Combination Site Analysis (CSA) algorithm. The CSA algorithm and software were incorporated by Shao-Shan Huang in a website implementation, which included integration of a TFBS clustering step prototyped by Paul Perco. Shao-Shan Huang conducted further CSA algorithm validations. Dave Arenillas was responsible for implementation of the human-mouse oPOSSUM database. Shannan Ho Sui provided the yeast promoter database used in the yeast dataset validation. James Mortimer provided microarray data. Shao-Shan Huang and Wyeth Wasserman wrote the initial draft manuscript and I provided further writing and editorial input. Shao-Shan Huang and I prepared the manuscript for conference submission. Chapter 3: The human/mouse alternative promoter dataset was developd by Shannan Ho Sui. I redesigned and redeveloped the Combinations Site Analysis (CSA) algorithm to statistically evaluate combinations of TFBS in alternative promoters and performed all related testing. Dave Arenillas redesigned the oPOSSUM database and Single Site Analysis (SSA) algorithm to accomodate the alternative promoter data. Shannan Ho Sui contributed the oPOSSUM yeast analysis website application. Andrew Kwon and Shannan Ho Sui worked on the worm-specific resource, which was architected by Andrew Kwon. Shannan Ho Sui and Dave Arenillas worked on the SSA website code and oPOSSUM portal web page. I redeveloped the CSA website and included additional xvi web application functionality and CSA e-mail function enhancements. Shannan Ho Sui wrote the draft manuscript. I contributed writing for the CSA work and edits for the manuscript. Chapter 4: The collaborative research work described in chapter 4 was initiated by my supervisor, Wyeth Wasserman. I was responsible for leading and managing the project collaboration and work. Initial putative TF datasets were contributed by Jared Roach, Sarav Sundarajan, Gwenael Beard, and myself. Sarav Sundarajan merged the datasets and provided input on the wiki gene page design. I designed, implemented, and populated the centralized database and curation website tool. Rob Sladek and I precurated the merged putative TF dataset. Jared Roach, Robert Sladek, Sarav Sundarajan, Gwenael Beard, Tim Hughes, Wyeth Wasserman and myself acted as the core group of gene annotators. I established and implemented the structural classification mapping methodology and performed the analysis of DNA-binding structures to extend the DNA- binding structural classification system. I designed and implemented the TF homology analysis approach, the wiki, and the website download portal. I wrote the manuscript and created the supplemental document. Additional manuscript writing input and edits were provided by Wyeth Wasserman and Rob Sladek. Chapter 5: The work described in chapter 5 involved both computational analyses and multiple stages of detailed molecular work. The research project was initiated by Elizabeth M. Simpson. Molecular work was provided by research scientists and graduate students affiliated with the following principal investigator laboratories: Elizabeth M. Simpson, Dan Goldowitz, and Robert Holt. Computational analyses were xvii performed by researchers in Wyeth Wasserman’s and Steven Jones’ laboratories. A full author list is provided in the chapter publication citation. I am responsible for the design and implementation of a set of computational analyses that predicted the transcription factors responsible for directing expression of an OLIG1 gene-associated Green Fluorescence Protein (GFP) reporter construct sequence in mice. The computational analyses included identification and a comprehensive evaluation of publicly-available oligodendrocyte expression data and an in-depth TFBS feature analyses. I provided the results, tables, and writing for the ‘Regulatory element predictions in OLIG1 enhancer sequences’ section in the supplemental. The draft manuscript was written by Elodie Portales-Casamar et al. I provided additional writing input and edits for the introduction and results sections of the manuscript. Chapter 6: Research concepts described in chapter 6 were established by Eric Denarier and myself. I am responsible for the enhancer sequence integration concept and design of the overall computational analyses approach. The identification of putative enhancer sequences was performed by Eric Denarier. The mouse transgenesis work and related molecular work were conducted by Eric Denarier and research associates in Alan Peterson’s laboratory. I implemented software to establish the promoter analyses database, adapted the Combination Site Analyses (CSA) algorithm, and performed all reference collection validation testing. I performed data expression analyses for all oligodendrocyte microarray datasets. I conducted CSA analyses on oligodendrocyte co- expression datasets. I devised, implemented, and performed the enhancer CRM weighting and enhancer feature weighting approaches. I wrote the manuscript and created the xviii supplemental documents. Manuscript input and edits were provided by Wyeth Wasserman, Alan Peterson, and Eric Denarier 1 1. Introduction The wide-range of eukaryote cell types and tissues that are generated from a single eukaryote cell (embryo) require numerous, well-defined tissue-specific gene regulatory systems, which are initiated, maintained, or arrested in response to temporal, spatial, and environmental cues. Each gene regulatory program is a combination of potentially many participating processes: transcription, translation, splicing, post-translational modifications, degradation, diffusion, cell growth, and others. Gene expression levels can be influenced by signals that are differentially initiated in a specific spatial state environment (for example, different cell types may express the same gene at different levels) or a specific temporal state (for example, a gene’s expression level in a given cell type in an early development stage may differ from its expression state in an adult stage in that same cell type). Dysregulation of gene expression is implicated in a wide variety of diseases and illumination of molecular basis for gene regulatory systems, in both a temporal and spatial state context, is an integral step towards understanding the contributory mechanisms in disease phenotypes and the identification of possible therapeutic interventions. Since the primary response of a gene’s regulatory program is at the transcription step, much research has focused on the measurement of tissue-specific transcription levels and on deciphering the regulatory mechanisms that induce transcription through both experimental and computational analyses. Gene expression in higher-level eukaryotes frequently involves synergistic and/or antagonistic interplay of multiple transcription factors (TFs) that bind to DNA and/or interact with other TFs. Sequence- specific interactions of TFs with DNA occur at TF binding sites (TFBS). Studies have 2 shown that clustered TFBS instances may lead to compatible interactions and cooperative regulation. Correspondingly, the identification of co-located TFBS motif signatures, often called cis-regulatory modules (CRMs), in a set of co-expressed genes can suggest homologous groups of TFs that contribute to the co-regulation of genes. The identification of the specific TFs that interact with these predicted motifs is an essential step in elucidating transcriptional co-regulatory systems. The over-arching objective of my thesis research was to address these important challenges in gene regulatory analyses, firstly, through the design and development of an algorithm that identifies enrichment of TFBS combinations found in the non-coding regions adjacent to co-expressed genes and, secondly, with the assembly of a comprehensive catalog of human-mouse TFs (TFCat) that includes a DNA-binding domain (DBD) structural classification. Importantly, although these tools may be applied for the detection of gene regulatory mechanisms in a variety of biological systems, these tools were systematically integrated in to a new promoter sequence analyses method to predict the TFs that may be acting cooperatively to co-regulate expression of myelin- associated genes during central nervous system (CNS) myelinogenesis. The myelin sheath is a lipid-rich plasma membrane that wraps around the axons projecting from neural cells to enable proper conduction of impulses throughout the nervous system. The prediction and identification of TF cohorts acting in the myelin production transcriptional regulatory system is of particular importance because myelin malfunction contributes to debilitating human pathologies, such as multiple sclerosis and leukodystrophies, and elucidation of myelin gene regulation could lead to the development of treatments that improve remyelination to attenuate disease progression. The remainder of this introduction will review relevant background information for 3 the thesis and further motivate the thesis objectives. 1.1. Gene transcriptional regulatory mechanisms Gene expression is, in part, controlled by DNA regulatory elements that (most often) reside on the same chromosome in non-coding regions neighbouring a gene’s transcription start site (TSS). One or more TF proteins can act directly on or interact indirectly (through protein-protein interactions) with the DNA to promote transcription. Regulatory DNA elements can be located proximal to a basal promoter TSS or may exert their influence over longer distances away from a TSS (often referred to as enhancers) and/or operate to regulate multiple adjacent genes (termed locus control regions). As protein access to the DNA sequence is a necessary requirement for the enablement of these operational forms, mechanisms affecting the chromatin architecture can greatly impact gene transcription. Brief reviews of DNA elements, TF proteins, and chromatin architecture regulatory mechanisms follow. 1.1.1. Chromatin remodeling in transcription regulation Eukaryotic DNA is arranged into chromatin via wrapping of approximately 200 base pairs of DNA around histone octamers to form nucleosomes. Roughly 147 DNA base pairs are coiled around a histone octamer made up of two each of four core histones: H2A, H2B, H3, and H4, which establish the nucleosome core structure, and the remainder of DNA sequence is involved in linking adjacent nucleosomes (Figure 1.1 and see review in [1]). Lysine-rich terminal tails are located at the N-termini of the four core histones and can extend beyond the surface of the nucleosome. These amino-tails are 4 subject to specific modifications, which can precipitate chromatin structure accessibility state changes, causing the chromatin structure to become more open or closed (Figure 1.2). In eukaryotes an additional histone, H1, found at half the concentration of other histones, is bound to nucleosomes near the DNA-histone octamer coiling entry and exit point, sealing the two DNA turns (Figure 1.1) (for a review see [2]). This first stage nucleosome packing, resembling beads on a string, can then be folded into a more compact structure, known as a solenoid form. The solenoid state is associated with DNA that is not transcriptionally active. A eukaryote cell’s ability to maintain its differentiated state long-term is enabled by accessible chromatin structure for those genes that are actively transcribed [3]. This open chromatin architecture enables trans-activating factors access to DNA target sequences. Chromatin accessibility is influenced by the types of histones, referred to as histone variants, that are included in a nucleosome [4] and the positional placement of nucleosomes in DNA. Nucleosome positioning can influence transcription by inhibiting or enabling access to regulatory DNA elements [5] and nucleosome arrangements can be guided by the positional placement of DNA-bound proteins [6, 7]. In addition to structural influences, there are biochemical changes that alter chromatin structure to impact gene transcription regulation: 1) DNA methylation; 2) histone modifications; and 3) X-chromosome inactivation, briefly reviewed below. 1.1.1.1. DNA methylation DNA methylation involves the addition of a chemical modification: a methyl group to one or more DNA bases, which can be both added and removed without direct affect on the underlying DNA sequence. The most prominent DNA methylation 5 modification in eukaryotic DNA is addition of 5-methyl groups to cytosines [8] with a preference for those located in CG dinucleotide sequences, which are frequently referred to as CpG sites. Methylation patterns can be tissue-specific, with methylated genes found in cells in which they are inactive and, conversely, active genes left as unmethylated in cells in which they are transcribed. This pattern is likely due, in part, to interference of transcription factor DNA-interactions by methyl groups [9]. Clusters of CG sequences, known as CpG islands are often found nearby or overlapping with gene promoters and first exons. With a few exceptions, CpG islands are present in an unmethylated state regardless of their associated gene’s transcription level (for a recent review see [10]) and are found at the 5’ end of constitutively transcribed genes, known as housekeeping1 genes. Studies suggest that proteins that bind methylated DNA induce a more “closed” chromatin structure, which could result from deacetylation of histones (for review see [11]), discussed below. 1.1.1.2. Histone modification Histones can vary in structure depending on chemical modifications. Acetylation is one such modification that covalently attaches an acetyl group to amino acids (for example, lysines), which reduces the positive charge of the histones (primarily occurring on H3 and H4) and opens the chromatin structure (Figure 1.2). Histone acetyl transferase (HAT) enzymes catalyze this process, promoting increased transcription [12]. Conversely, deacetylation of histones via histone deacetylases (HDACs) can inhibit transcription by producing a more condensed chromatin architecture (Figure 1.2). Histone 1 Housekeeping genes are genes that are typically constitutively active in all cells because they are involved in cell maintenance functions. 6 acetylation patterns have been successfully analyzed to predict enhancer signatures [13, 14]. Methylation modifications can both activate gene transcription (for example, methylation of lysine-4 in H3 - referred to as H3K4me) [13] or inhibit transcription (for example, tri-methylation of lysine-9 and/or lysine-27 in H3, referred to as H3K9me3 and H3K27me3 respectively [15, 16]). Histone modification signatures have been used to successfully predict and validate functional human enhancer and promoter regions [16, 17]. These recent genome-wide studies continue to support the long-standing hypothesis of a histone tail domain-encoded ‘language’, which regulates DNA-chromatin interactions that establish the chromatin accessibility state [18]. Although such modifications are not addressed further in this thesis, other small molecules such as ubiquitin and small ubiquitin related histone modifiers (SUMO) [19] and phosphorylation of histone H3 (for review see [20]) will influence gene regulatory activity as well. 1.1.1.3. X-inactivation The process of X-chromosome inactivation compensates for the fact that mammalian females have two X chromosomes whereas males have one. This process occurs early in embryo development and the selection of paternal or maternal X chromosomes for inactivation occurs randomly. The inactive chromosome is packed into a condensed structure known as a Barr body. The inactivation process involves both acetylation and methylation histone modifications [21]. The XIST gene is responsible for silencing genes on the same (cis) X chromosome from which it is transcribed. The hypoxanthine phosphoribosyltransferase (HPRT) locus, described later, is located on the 7 X-chromosome. 1.1.2. DNA regulatory elements In most cases in eukaryotes, RNA transcripts encode a single gene, in contrast to prokaryote systems where functionally linked genes are often transcribed in one multi- gene RNA molecule (e.g. the Lac operon). Consequently, eukaryote transcriptional regulatory systems must coordinate expression of sets of requisite genes required for a common process or signaling pathway. This regulatory scheme supports increased diversity and plasticity in the combinations of genes that can be regulated for any given pathway. Metazoan genes are regulated by a structured architecture of DNA regulatory elements (Figure 1.3). The core promoter region is approximately 60 bp surrounding the transcription start site (TSS) and houses DNA elements that interact with the basal machinery, which include: TFIID , TFIIA, TFIIB, RNA PolymeraseII, TFIIF, TFIIE, and TFIIH (for review see [22]). Combinations of regulatory elements surrounding the TSS, including the TATA, Initiator Recognition (INR), and Downstream Promoter Element (DPE) motifs, serve to not only engage and position the basal machinery, but also provide specific selectivity for interactions with regulatory regions and TFs. A gene may be transcribed from more than one promoter, which enables greater transcriptional plasticity between different tissues and developmental timeframes, and under varying environmental conditions [23]. Notably, a landmark cap analysis of gene expression2 (CAGE) study found at least 58% of protein-coding transcriptional units have 2 In CAGE analysis, short ~20 nucleotide sequence tags that begin at the 5′ end of full-length mRNAs are sequenced to identify transcription start sites. 8 two or more alternative promoters [24]. Identification of different start sites can assist with detection of proximal promoter regulatory elements that direct specific transcriptional units. Context-specific utilization of alternative promoters further complicates the identification of promoter-enhancer interactions (described below). Eukarote genes typically possess regulatory regions (enhancers) that can be located distal to the TSS, situated upstream as well as downstream of the gene and/or in introns, and generally contain multiple TFBS (Figure 1.3). These regions direct tissue- specific regulatory control of gene transcription through chromatin remodeling and/or TF cooperativity interactions, which enable [25], disable [26], or insulate [27] promoter activity. Importantly, gene transcription regulation may be an integration of multiple acting enhancers, each of which exert specific temporal and/or spatial control [28]. 1.1.3. Transcription factor proteins TFs are proteins that direct transcription of one or more genes. These proteins can either directly bind to target DNA regulatory elements [29, 30] or act to influence DNA- binding of other TFs through protein-protein interactions [31]. TF proteins are composed of one or more modular domains that facilitate functional capacities, such as: DNA- binding domains to enable interaction with DNA motifs, cooperativity domains to facilitate interactions with other TFs, and activation domains that influence interactions with core promoter-associated proteins and/or coativators [32]. This modular architecture enables combinatorial transcriptional mechanisms, which can specify a unique activity profile. For example, TF proteins may act alone or in combination with accessory TFs (i.e. co-activators) to produce different regulatory effects under unique conditions [33]. TFs bind DNA through both sequence-specific and non-sequence specific 9 interactions. Nonspecific interactions may occur with double- (dsDNA) and single- stranded DNA (ssDNA) through electrostatic interactions involving positively charged protein side chains and negative DNA backbone phosphate groups. Sequence-specific protein-dsDNA binding is facilitated by hydrogen-bond donor and acceptor sites made available in the minor and major groove of dsDNA which, when compatible, form complementary interactions with hydrogen-bonding acceptor and donor sites on a TF protein DNA binding domain (DBD) surface. Although TF-DNA binding affinity must be strong enough to allow the TF to remain on the DNA for a functionally sufficient period of time, the sequences satisfying this requirement are nevertheless degenerate [34]. Moreover, the presence of bound TFs in any given system may be modulated by specific biological parameters, such as TF protein concentration [35]. The sequence-specific tethering of a TF to DNA occurs at one or more specialized protein domain interfaces. The growing collection of solved protein-DNA binding structures [36] highlights structurally homologous classes with distinct DNA-binding mechanisms (for review see [37]). Many homologous DBD structures share similar protein sequences, which has enabled the development of models that predict DNA- binding domains [38-41] and, via inference, transcriptional roles for uncharacterterized proteins [42]. 1.1.4. Transcription factor mechanisms and cooperativity control TFs can possess activation domains that modulate the initiation of transcription through indirect and/or direct interactions with targets in the basal transcription complex [43, 44]. TFs can act to repress transcription by a variety of mechanisms that include: 1) 10 interference-binding: DNA-binding which interferes with TF binding or changes chromatin accessibility; 2) complex formation: a TF repressor binds a TF activator so that it cannot bind DNA; 3) quenching: proximal binding of a TF repressor alongside a TF activator to extinguish the activation effect; and 4) a TF repressor that has a direct negative effect on transcription [45]. The functional complexity of TF activity is increased by DNA accessibility and interactions with multiple cis-regulatory elements and trans-acting proteins. Multiple TFs can synergistically bind DNA that is associated with nucleosomes [46-48] or engage nucleosome-free DNA [49]. Physical cooperative interactions between TF DNA-bound proteins can precipitate synergistic transcriptional activation. For example, Olig1 - Sox10 interactions can co-activate oligodendrocyte3 transcription [50]; similarly, MyoD (Myf) - Tcfe2a (E12) physical associations contribute to certain muscle gene regulation [51]. TF cooperativity can also occur between co-localized DNA-bound TFs that do not physically interact. For instance, DNA-bound TFs can produce DNA conformation changes that facilitate the binding of additional TFs [52] and multiple TFs may individually interact with the core promoter to exert an accumulative activation effect [53]. 1.2. Analysis of gene expression The analysis of gene transcription is multi-faceted, as there are many stages in the production of a gene’s transcript that may be studied: transcription initiation, elongation, termination, and further stages of transcript processing. Gene transcription is an intermediate stage in the process of protein production from coding sequences. However, 3 Oligodendrocytes are responsible for myelinating the axons of neurons in the CNS. See section 1.6 for more information. 11 unlike some post-transcriptional processing mechanisms, the transcription of RNA is an obligatory prerequisite for protein production. Due to this requirement and the relative ease of high-throughput experimental procedures for RNA measurements, mRNA expression profiles are often utilized as a surrogate indicator of protein levels. However, recent mRNA expression level comparisons with protein detection assays, which allow for quantification of protein levels, suggest that the correlation between mRNA and protein levels may only be moderate at best [54]. Nevertheless, increased production of mRNA is an indication of a modulated regulatory process. 1.2.1. High-throughput gene expression technologies Understanding when and where a gene is expressed is often a preliminary step in gene regulatory analyses and a number of technologies have emerged to identify gene expression patterns and profiles. Microarray platforms, such as spotted cDNA microarrays [55] and oligonucleotide arrays [56, 57] are commonly used to simultaneously measure the expression of a set of genes. Serial Analysis of Gene Expression4 (SAGE) [58] and modern transcriptome analysis via high-throughput sequencing of RNA samples (e.g. RNA-seq, see review in [59] ) offer advantages over array-based procedures, in that novel transcripts can be observed and detection is not dependent upon hybridization conditions. As expression data from sequencing-based gene expression technologies were not incorporated in this thesis, the introductory focus will be placed upon array-based methods. 4 SAGE – in this technique unique short sequence segments (tags) within transcripts are identified, linked, and sequenced to enumerate the number of times a tag (the proxy for a transcript) is observed 12 Microarray-based gene expression measurements are often compared between the same tissues in different states, such as, a diseased versus healthy or two developmental stages. For example, numerous studies have examined the changes in gene expression between cancerous tissues and normal tissues (of common cellular classes) to identify cancer biomarkers [60]. Similarly, gene expression can be compared between different tissues to identify unique tissue-specific [61] and developmental-stage specific [62] gene expression profiles. Microarray technology for parallel measurement of RNA by hybridization is well established. Microarrays are constructed by affixing target probes (cDNA or synthesized oligonucleotides) to a solid surface (a chip) within a matrix architecture using a variety of materials and assembly methods (for a technology review see [63]). Nucleic acids labeled with fluorochromes are hybridized onto the array-bound probes. When using cDNA arrays, typically two samples with two different fluorescent dyes are applied to one array which, when excited by lasers, emit unique wavelengths for each dye that is interpreted and quantified by scanners. This intensity measurement provides a relative quantification of gene expression. In contrast, oligonucleotide array technology permits measurement of absolute gene expression intensity, through fluorochrome excitation and laser scanner interpretation, and, generally, only one sample is applied to each array. Oligonucleotide array technology has become the favored platform because of its inherent flexibility in probe design and the ability to compare gene expression intensities across multiple samples/chips. Gene expression studies described in the following thesis chapters were captured using oligonucleotide (Affymetrix) single-channel arrays and, therefore, subsequent discussion will be directed towards this type of microarray platform. 13 1.2.1.1. Normalization and quantification of microarray expression data Microarray measured expression values can be influenced by non-biological (technical) variation and biological variation. The detection of differential gene expression relies on the fact that differences are consistent across more than one biological sample. Since a number of steps are required for sample preparation and execution of a microarray experiment, technical errors can be introduced along the way. Assessment of this variation requires the incorporation of more than one biological sample and replication of the technical protocol using additional microarray chips. However, the number of replicates included in each microarray experiment may be limited by the number of biological samples available and the costs associated with technical replication. After the probe array images have been captured by scanners and software, which analyze hybridization data for each probe, data preprocessing is performed to remove sources of technical variation produced by biases in: dye integration, sample preparation, hybridization, and image processing effects. Background correction is required to remove unrelated extraneous hybridization effects. The Affymetrix oligonucleotide arrays contain both perfect match and mismatched probe pair sets targeting specific gene/mRNA sequences and probes that enable application of positive and negative controls. The challenge is to remove the non-specific background signal and summarize the probes to obtain a measurement of expression (ME) for a gene transcript. Normalization is performed during this analysis to account for scale differences in the hydridization values across multiple chips (technical replicates). One popular algorithm, which was utilized in studies described later in this thesis, is called Robust Multi-chip Analysis (RMA) [64, 65]. This algorithm conducts a three-step analysis that includes: 1) background correction 14 using a model that ignores the mismatched probe values; 2) quantile normalization, which makes the probe intensity distributions the same across chips; and 3) probe set summarization to derive gene transcript level gene expression measurements. Several other normalization methods are reviewed here [66]. 1.2.1.2. Analysis of gene expression differences Gene expression profiling is often applied to discover the gene expression differences between two or more classes of samples, where a class is defined as a categorical variable such as developmental time point or different tissue type. Statistical comparisons are performed using tests, such as t- and f-tests, to determine the significance of the expression difference between the same probe sets on arrays. If a stringent p-value cut-off of 0.001 is utilized, false positives are limited to 10 in 10,000 genes. However, if there are only a small number of samples for each class, the computed within-class variance of each gene may be imprecise [67]. Modified versions of these tests, for example random variance t-tests [68], assume that the variance of genes within a class may be different, however, each variance is drawn from a single distribution that is shared by all genes in a class. Given the high number of comparisons that must be performed on microarray probe sets, adjustments must be incorporated to account for multiple statistical tests. A type I error rate describes the false positive rate. There are two classes of corrections that can be applied, those that control the family wise error rate (FWER): the probability of at least one type I error and the false discovery rate (FDR): expected proportion of type I errors within the rejected tests. The Bonferroni correction is a conservative approach that 15 controls for FWER. This test establishes a new significance level by dividing the desired statistical significance value by the number of performed tests. An alternative method proposed by Benjamini and Hochberg [69] controls for FDR. The Benjamini and Hochberg FDR is easily computed at each row i of ascending ordered probe set p-values after the statistical tests have been performed as: row i p-value multiplied by the total number of probe sets tested, divided by the row number i. The computed FDR for row i is an estimate of the proportion of false positive probe set expression differences with p- values less than or equal to the row i p-value. Additional commonly applied statistical correction methods are reviewed in [66]. 1.3. Experimental techniques for regulatory sequence detection Deciphering the transcriptional regulatory mechanisms for the ~30,000 human genes is central to understanding the assembly of complex biological systems that are specified by the information coded in DNA. Recent large-scale efforts, such as the ENCyclopedia Of DNA Elements (Encode) project [70], have begun to illuminate the complex organization of functional features in the human genome. Smaller scale studies have devoted significant resources to perform detailed gene transcription regulation investigations [71, 72], which have offered important insights into the diversity, complexity, and capacity of regulatory mechanisms. Experimental approaches have been developed for the study of protein-DNA interactions and validation of cis-regulatory sequences. A comprehensive overview of experimental techniques for gene regulatory analyses is available in a recently published text [73]. Specific methods relevant to the research in the thesis are introduced below. 16 1.3.1. Identification of TF-DNA binding Experimental detection of protein-bound DNA highlights TF-DNA interactions that may influence transcriptional activity. Several experimental methods start with labeled DNA fragments, which are incubated with a cognate double-stranded DNA (dsDNA)-binding protein. Electrophoretic mobility shift assays (EMSA), commonly referred to as a gel shift, allows separation of protein-DNA complexes from unbound sequence. Protein bound to the DNA probe retards its migration progress through the polyacrylamide gel relative to unbound DNA [74]. The DNA and/or protein may be recovered from the gel and subjected to further investigation. The EMSA procedure can be repeated either in the presence of unlabelled dsDNA of various sequences to reveal the binding specificity of the protein, or alternatively, in the presence of an antibody for the identification of the protein engaging the DNA (most commonly through further retardation of the protein-DNA complex referred to as a “supershift”). An alternative technique for the study of protein-DNA interactions is called DNase I footprinting. A mixture of proteins and 5’ end labeled dsDNA is subjected to a partial digestion by DNase I, an enzyme that preferentially cleaves naked DNA. Protein-protected DNA sequences are identified via gel mobility separation to establish the exact location of protected DNA positions from labeled ends. High-through put methods using chromatin immunoprecipitation (ChIP) assays detect in vivo DNA-TF interactions using a variety of detection procedures: 1) ChIP-chip [75] and 2) ChIP-seq [76], which differ in the method used to identify the protein bound sequence. Both techniques involve cross-linking proteins to chromatin, shearing of DNA to isolate protein-bound sequence and recovery of protein-DNA complexes using antibodies. After removal of the protein-DNA cross-linkages, the ChIP-chip method 17 incorporates microarray analysis, while the ChIP-seq method subjects the pool of recovered DNA to high-throughput sequencing. An inherent limitation of the ChIP-chip method is the number of array probes that can be applied on the chip for sequence determination. In ChIP-seq, short sequence reads are mapped to the reference genome and regions with high read densities are considered binding site locations [77]. ChIP-chip assays return DNA fragments of lengths between 200 - 1000 bp (resolution relies on the size of the chromatin fragment and probes on an array). However, deep sequencing in ChIP-seq studies can be more specific in defining locations of bound protein [77]. 1.3.1.1. Identification of TF binding site motifs The interaction of TFs with their cognate DNA target sequences is one of the key steps in transcription initiation. It is well understood that TFs bind DNA with a level of degeneracy [35, 78]. Therefore, accurate profiles of the binding properties of sequence- specific TFs can be a key component of successful analyses. Until recently, the most common experimental technique used to profile high-affinity TF binding sites was the SELEX assay (systematic evolution of ligands by exponential enrichment) [79, 80]. In brief, oligonucleotides of random sequence are incubated with a TF of interest, protein- DNA complexes are purified, and the DNA is amplified by PCR. Starting with the DNA recovered from the previous step, this process is repeated several times to reveal high- affinity target sequences. A ground-breaking study using protein binding microarray (PBM) technology (containing all 10 bp sequences) [81] identified a full range of affinity binding sites for 104 TFs [34]. Remarkably, secondary DNA binding preferences were identified for half 18 of the TFs. Position interdependence5 was identified in binding sites of ~20% of the TFs studied. These results exemplify the complexity of TF-DNA sequence interpretation. Further investigation is warranted to determine whether distinct TF sequence preference arrangements provide selective mechanisms for differential regulatory induction effects. 1.3.2. Regulatory sequence validation A putative regulatory region can be tested for its ability to direct transcription of reporter genes. A reporter gene is a protein that, when expressed, possesses properties which enable its presence to be uniquely measured (for review see [82]). For example, when the green fluorescent protein (GFP – from jelly fish) is excited with a blue light, it will fluoresce green. Similarly, the β-galactosidase protein (encoded by the bacterial gene lacZ) cleaves the colorless substrate X-gal into galactose and an insoluble blue product that can be visually detected and quantified. To test a regulatory region’s functional properties, the sequence is inserted into a vector carrying a reporter gene and an endogenous or exogenous promoter. The vector may be transfected in vitro into cells or tested in vivo using transgenesis techniques. While in vitro studies can provide environments that are comparable to in vivo systems, some cell line systems may lack key molecular components necessary for transcription regulation. Transgenesis studies can identify tissue-targeted specificity and may provide information about temporal- specific expression. A brief discussion regarding mouse transgenesis follows. 5 Position interdependence suggests that the co-occurrence of nucleotides in specific positions of a binding site interact non-independantly with a transcription factor protein. 19 1.3.2.1. Mouse transgenesis Mouse transgenesis is a powerful experimental method that can be used to investigate the regulatory expression capacity of putative enhancer sequences in vivo. Two methods can be used to produce transgenic mice: 1) DNA injection into male pronucleus and 2) embryonic stem (ES) cell methods. In method (1) fertilized eggs are harvested before the sperm head has become a pronucleus, the vector is injected into the male pronucleus and the pronuclei are allowed to fuse to form the diploid zygote nucleus before implantation of the embryos into a pseudopregnant female mouse. Method (2) involves the transformation of the vector into ES cells, selection of the vector-bearing cells, injection of ES cells into mouse blastocoels or under the zona pelucida of eight-cell uncompacted embryos, and implantation of embryos in a female pseudopregnant mouse as above. In both methods, mice bearing the transgene may display distinctive coat color to facilitate confirmation of transgene integration. Recent technological innovations allow for directed insertion of a single transgene copy into a targeted locus [83, 84]. The hypoxanthine phosphoribosyltransferase (Hprt1) locus is such a targeted locus destination, which is used in the studies described in this thesis; it will be described in some detail in the following section. 1.3.2.1.1. Hprt1 mouse transgenesis system Until recently, it has been difficult to compare and interpret gene regulatory data obtained in mouse studies due to: 1) the variability in the number of transgenes integrated and 2) to the possible chromatin influences at these integration sites. To overcome this source of variability a controlled transgenesis strategy, which enables 20 single-copy integration at the Hprt1 locus (Figure 1.4), was developed [83]. This method enables the insertion of single-copy reporter constructs in a predetermined site at the Hprt1 locus located on the X-chromosome. In brief, a vector construct includes Hprt1 sequences for homologous recombination, a portion of the Hprt1 gene that is absent in a mouse ES-cell line, a reporter gene, and basal promoter along with candidate regulatory sequences for analysis. Destination constructs bearing the Hprt1 targeting cassette are transfected into ES cells carrying a deletion spanning the promoter and first two exons of the Hprt1 gene. The Hprt1 gene, restored by homologous recombination of the vector, confers resistance to hypoxanthine-aminopterin-thymine (HAT) selection. The ES with HAT-resistance are injected into blastocysts or aggregated with eight-cell embryos and the blastocysts or embryos are transferred to pseudo-pregnant female mice. Transgenic mice are confirmed through coat-color selection and sequencing. Typically mice are back-crossed several times to create a homogeneous genetic background, as the most effective ES cells are hybrids [85]. 1.4. Transcriptional regulatory region detection through comparative genomics TF regulatory activity is specified by information coded in DNA sequence. As such, the identification of regulatory sequence is an integral step in decoding gene expression mechanisms. The vast amount of non-coding sequence in metazoan genomes makes the identification of functional sequence challenging. The availability of multiple genome sequences has provided important context for DNA-sequence analysis methods through comparative sequence analysis and phylogenetic footprinitng. 21 1.4.1. Multiple sequence alignments The sequencing of multiple species genomes has motivated the development of algorithms that align sequences to depict evolutionary relationships. Several approaches have been designed specifically for DNA sequence studies. Local alignments focus on determining sequence conservation in short segments, while global alignments attempt to align the full length of sequence. At present, the most commonly applied methods adopt a hybrid approach, referred to as “glocal” alignments, that initially identify local alignments which are concatenated to form longer alignments. Current multiple sequence alignment (MSA) algorithms use a progressive alignment approach, which relies on phylogenetic trees, in which the two most closely related species sequences are aligned first and additional pairwise alignments are performed until all sequences are incorporated [86-89]. Similarly, the popular MSA datasets [90] provided by the University of California Santa Cruz (UCSC) resource [91], are established through best- in-genome pairwise alignments across N species, with progressive alignments guided by a phylogenetic tree topology [89]. 1.4.2. Phylogenetic footprinting Comparative genomics is widely used to identify regulatory region candidates in non-coding sequence, under the assumption that conserved multi-species sequences are functionally important. This approach, known as phylogenetic footprinting, relies on non- coding multi-species sequence alignments to identify conserved putative regulatory regions and has been applied with remarkable success. Tested conserved regions, using in vivo mouse transgenesis reporter assays (described above), have demonstrated their capacity to direct tissue-targeted gene expression [92-95]. Furthermore, there is evidence 22 that evolutionarily conserved regions contain functional regulatory motifs [96-98]. However, recent studies have highlighted cases of active regulatory elements that are not conserved [99, 100]. Notably, a recent study found that 41 – 89% of protein-bound locations identified in hepatocyte ChIP studies are not conserved between human and mouse [101], taken with the caveat that DNA binding and functional impact on gene regulation are not necessarily equivalent. It is important to recognize that the use of conservation as a filter to reduce false positive predictions in regulatory region detection will constrain sensitivity. 1.5. Computational identification of regulatory mechanisms Exhaustive experimental validation of non-coding sequences for regulatory function is infeasible. Computational analyses approaches have been developed to identify putative TFBS and predict regulatory mechanisms for experimental investigation. Numerous algorithmic approaches have been designed to achieve such objectives. A review of regulatory element detection methods and algorithms relevant to the scope of this thesis are presented below. 1.5.1. Predicting TF binding sites DNA-binding TF proteins interact with short degenerative DNA motifs, typically 5 - 15 bps in length. A set of high-affinity TF binding sites (TFBS) can be identified through experimental analyses, as described earlier. A critical component of computational analysis of regulatory mechanisms is the prediction of TFBS within a 23 sequence. TF binding site prediction approaches fall into two categories: 1) motif matching or 2) motif discovery. Motifs may be computationally detected using knowledge of TF sequence binding affinities, which have been experimentally identified (as described previously), to identify sequences that ‘match’ the TFBS characterization. A set of TFBS may be aligned and depicted as a degenerate consensus sequence. For example, the TF binding sequences depicted in Figure 1.5: GATCAG, GATCAT, GATCCA, GACTGT may be summarized as the consensus sequence: GAYYBD, where Y, B, and D are International Union of Pure and Applied Chemistry6 (IUPAC) ambiguity codes that represent one or more nucleotides (Y= [C or T]; B=[C or G or T]; D=[A or G or T]). While this depiction of TF binding sequence preferences is useful, the consensus sequence format does not incorporate the likelihood of observing nucleotides in a given motif position. More commonly, a position weight matrix (PWM) is used, which reflects the probability of encountering a particular nucleotide at each position. Importantly, it has been shown that the PWM approach provides an estimate for binding energy contributions of base pairs (see review in [102]). Initially, a position frequency matrix (PFM) is computed, which enumerates the count of each base pair in each column of a set of aligned TFBS (Figure 1.5). The PFM may be transformed to a log probability form PWM as follows: where Wb,i is the PWM value of base b at column i, ! " p (b,i) is the corrected probability of base b found at column i, and p(b) is the probability of base b in the genomic background; ! " p (b,i) can be computed as: 6 IUPAC is a nomenclature system that describes chemical compounds. ! Wb,i = log2 " p (b,i) p(b) (1) (2) 24 ! " p (b,i) = Cb,i + p(b) n n + n where ! C b,i is the count of base b at column i and n is the total number of motifs contributing to the model. It should be noted that there are subtle variations in the formulas used by different researchers [103]. Each sequence is scanned and potential binding profiles are enumerated if they meet an indicated minimum threshold score. Although the PWM approach provides improved results over the consensus matching approach, it often produces a high number of false predictions (where false refers to sequence motifs that are likely to be suitable for protein-DNA interactions in vitro but are not functional cis-regulatory elements). An additional shortcoming of this model is the assumption of independence of nucleotides at different positions in the binding site. Several databases have been developed which provide lists of experimentally determined TFBS and TFBS profiles. A select list of these resources is provided in Table 1.1. The construction of PWM models assumes an existing alignment of TF binding sites is provided. In the absence of such an alignment, a motif discovery procedure can be applied. Motif de novo detection entails identifying over-represented short sequence patterns in a set of longer DNA sequences observed or predicted to be bound by the same TF. Popular technical approaches include expectation maximization (EM) [104, 105] and Gibbs sampling [106]. As a detailed review is beyond the scope of this thesis, interested readers may wish to refer to reviews on the subject found here: [107-109]. 1.5.2. Computational detection of cis-regulatory modules The development of computational methods for the detection of CRMs is useful for the study of cooperative interactions between TFs. A diverse set of computational 25 methods have been developed to discriminate CRMs in sequences. The common objective of each of these approaches is to detect non-random clusters of TFBS. Several CRM-based algorithms have been developed to specifically identify regulatory elements proximal to core promoters (see [110] for a recent review and evaluation of this software). Additional methods have been developed to search for CRMs in regions distal to the TSS. These ‘enhancer’ detection tools apply unique and overlapping procedures (Table 1.2). Selected algorithms pre-dating the CRM detection research presented in this thesis are summarized in Table 1.3. For a recent review of CRM analyses tools see [111]. 1.6. Myelinogenesis in the nervous system The nervous system is composed of a network of cells that enables communication through electrical impulses. This system is divided into two categories; the central nervous system (CNS) and the peripheral nervous system (PNS). The CNS consists of the brain, spinal cord and optic nerves, while the PNS resides outside the CNS providing connections from the CNS to limbs and organs (for example, spinal roots, sciatic nerves). There are two main cell types: neurons and glia cells. Neurons are responsible for transmitting signals through the nervous system. Glia cells often reside in direct contact with neurons and perform supportive and enabling roles. Neurons are essentially made up of three main components: 1) dendrites that receive information from other cells; 2) the cell body which contains the nucleus and other eukaryotic organelles; and 3) axons which conduct signals away from the cell body. The myelin sheath is an insulating layer that forms around axons, the extensions of neurons, enabling efficient conduction of impulses along nerve fibers throughout the vertebrate nervous system. Positioned at the gaps of individual myelin sheaths are the 26 Nodes of Ranvier, which enable the propagation of impulses. At resting potential, the fluid outside of the plasma membrane is positively charged and the interior is negatively charged. Voltage-gated sodium channels, concentrated at the Nodes of Ranvier (Figure 1.6), open in response to a depolarization7 of the plasma membrane encouraging an influx of positive sodium ions, which creates a positive interior charge. A depolarization in one area of the membrane causes voltage-gated channels in adjacent regions of the membrane to open, resulting in continued influx of positive sodium ions, evoking a depolarization sweep along the axon – referred to as an action potential. The speed of action potential propagation in non-myelinated fibers is proportional to the axon diameter. Importantly, properties of the lipid-rich myelin sheath facilitate high-speed propagation of impulses without an increase in axon diameter. Myelin is elaborated from the plasma membrane of two types of glia cells: Schwann cells in the PNS, which encase a segment of a single axon (Figure 1.6 and Figure 1.7) and oligodendrocytes cells in the CNS, which are capable of myelinating several axons (Figure 1.6 and Figure 1.8). The myelin membrane contains a number of proteins that are responsible for its unique structural and functional properties (Figure 1.6). In the CNS, the PLP1 protein, a tetraspan membrane protein, makes up 17% of myelin proteins, while the myelin basic protein (MBP), the second largest protein component of myelin constitutes 8% of the total protein [112]. Given the critical role that the myelin sheath plays in the nervous system, it is not surprising that dysregulation of myelin protein production results in debilitating neuropathies. 7 Depolarization is a change in voltage difference between the interior and exterior of a cell. 27 1.6.1. Schwann cells Schwann cells (SC), which develop in the PNS, myelinate single axons. Most SCs develop from the neural crest where precursors transition to immature SCs (around ~E16) and, while migrating along the axon tracts, diverge into either myelinating or non- myelinating Schwann cells (Figure 1.7) [113]. Studies have demonstrated that SC non- myelinating and myelinating cell phyenotypes are interconvertible and signals emanating from axons determine these states (see review in [114]). Much effort has been devoted to determining the TFs that are modulated by these signals to identify responsible regulatory pathways (for review see [115]). A number of peripheral myelinopathies are caused by altered gene dosage [116, 117] and, therefore, it is hypothesized that gene regulation- based therapies may be an avenue for intervention. 1.6.2. Oligodendrocytes Oligodendrocytes (OL) cells are responsible for diverse functional roles in the central nervous system (CNS), including myelin sheath formation and axonal integrity maintenance. Much like the SC maturation process, the OL cell development process progresses through multiple cell stage transitions, starting with a precursor cell stage and culminating in a mature myelinating oligodendrocyte (Figure 1.9). Each of these cell stages can be identified by specific protein marker expression. Fate-mapping studies in mice suggest that oligodendrocyte precursor cells (OPC), initially populate the cortex from the motor neuron progenitor domain (pMN) in the ventral ventricular zones and then a second wave of OPCs are generated from dorsal sources [118]. This specification and maturation of OLs is largely controlled by TF gene 28 regulation [119]. Specific TFs and TF families have demonstrated necessary roles in the OL development process, including: Olig1 and Olig2 [120-123] along with other members of the basic Helix-Loop-Helix (bHLH) family [124, 125], Nkx2-2 [126, 127], Nkx6-1 [128], Nkx6-2 [129], and Sox family TFs [130-132]. Recent studies confirm that a subset of these regulatory effects are a consequence of synergistic TF control [50, 133- 136]. 1.6.2.1. Olig transcription factors Olig TF proteins belong to a sub-class of the bHLH group of DNA-binding proteins. Sonic hedge hog (Shh) is a signaling protein required for OL development in the forebrain [137-139] and it induces CNS expression of both Olig1 and Olig2 in human spinal cord and forebrain prior to the emergence of PDGFR-alpha+ and NG2+ cell OL progenitor cells (see protein markers in Figure 1.9) [120, 140]. Olig1 and Olig2 TF expression is responsible for the generation of OL progenitor (PDGFR-alpha+) cells in brain regions [121]. Recent studies suggest that the Olig2 TF is required for OL development in the spinal cord, while Olig1 is necessary for OL specification in the brain [120, 121, 141]. The critical role that Olig1 plays in brain OL development was further elucidated in a recent study using Olig1 knockout mice [123]. Olig1 null mice develop severe neurological defects, such as tremors and seizures, and die two weeks after birth. Although the elimination of Olig1 expression does not inhibit OL progenitor stage cell development in the brain, its absence abolishes all major myelin gene expression and myelin sheath assembly, demonstrating that Olig1 plays an essential role during initial 29 CNS myelin elaboration. However, Olig1 may not be involved in control of myelinogenesis beyond the initial accumulation of myelinating OLs during development, since Olig1 proteins are found localized in the cytoplasm after two weeks of age [142]. Importantly, Olig1 TF proteins re-enter the nucleus following demyelinating injuries, which suggests that it may be involved in remyelination and maintenance activities in the adult brain. Deciphering the regulatory role of Olig1 in these adult OL regenerative events would be an important contribution towards elucidating myelin reparative mechanisms. 1.6.3. Myelin Basic Protein Myelin Basic Protein (MBP), found on the cytoplasmic face of the myelin membrane, is required for myelin compaction in the PNS (Schwann Cells) and CNS (oligodendrocytes) (Figure 1.6). The MBP gene is located on chromosome 18 in both murine and human genomes and is composed of seven exons. There are four known major protein forms in mice, two of which constitute 95% of the MBPs and these are translated from at least seven transcript variants, which appear to be differentially expressed at different time points (see review in [143] ). In mice, MBP protein production is required for long-term viability. Mutant mice lacking the MBP gene, a mouse model called Shiverer, exhibit a shivering gait which appears a few weeks after birth and remains until premature death (between 50 and 100 days old) [144]. Given MBP’s central role in the myelin sheath architecture, a number of studies have been conducted to identify regulatory sequence that directs MBP gene expression. Recent studies have used targeted in vivo mouse transgenesis strategies to characterize 30 the tissue-specific and temporal regulatory functions of non-coding conserved regions situated upstream of the MBP gene [93, 94, 145-147]. Notably, each of these enhancers contribute individually and/or in combination to direct MBP gene expression to SC (PNS) and/or OL (CNS) cells at different developmental stages. 31 1.7. Thesis overview and chapter summaries One of the fundamental challenges, in this post genome era, is determining how genes are regulated. A growing list of diseases are linked to dysregulation of gene expression. The complexity of transcriptional mechanisms in metazoans presents great challenges in defining the specific processes that lead to dysregulation of genes. Detailed elucidation of gene regulatory mechanisms is a necessary prerequisite to the development of therapies. The myelin sheath plays a critical role in enabling and maintaining the integrity of neural signal transmission throughout vertebrate PNS and CNS. Since myelin elaboration is largely controlled through synergistic transcriptional mechanisms, it is well known that dosage changes in myelin-associated proteins result in debilitating neuropathies. Discovery of the DNA-binding TFs and cooperative mechanisms that are responsible for myelin gene regulation is essential for determining potential therapeutic strategies that alleviate these severe disease phenotypes. The work described in this thesis was motivated by this fundamental objective. The aim of this thesis was to develop the necessary algorithms and resources to predict TF cooperative involvement in CNS myelin sheath elaboration. A summary of the applied approaches described in chapters two through six are provided below. In chapter 2, I describe an approach (formally named oPOSSUM2 and currently called Combination Site Analysis – CSA), published in [148] that enables the prediction of DNA-binding TF cooperativity in non-coding regions of human or mouse co- expressed genes. When I began work on this algorithm in 2004, few tools addressed the 32 detection of non-homotypic CRMs8 (Table 1.3). Our objective was to develop a computationally efficient process that identifies over-represented combinations of TFBS in a set of co-expressed genes as compared with predicted TFBS combinations found in a background set of non-coding regions (predicted in the oPOSSUM database developed in our lab [149]). The evaluation of multiple predicted TFBS in sequences derives a search space that is combinatoric in nature and computationally unfavorable for conducting analyses in a reasonable timeframe. Another inherent challenge with TFBS predictions is that many TF family members bind similar sequences and, therefore, a prediction of a binding site for one TF family member can implicate the binding of homologous family members. We addressed these challenges through the design of an algorithm that implements a TF binding profile clustering step to identify statistically significant combinations of TFBS class instances, followed by an evaluation of sets of TFs that are members of the indicated classes. Additionally, incorporation of a biologically relevant inter-binding site distance (IBSD) parameter facilitated cluster aggregation of TF pairs that satisfy the IBSD to form larger CRMs, which avoided brute-force enumeration of larger CRM sizes. The CSA approach evaluates the significance of TFBS combinations found in non-coding regions of a co-expressed gene set against the frequency of the TFBS combination found in a (full ortholog) background set, using a Fisher Exact Test. CSA analyses of CRM reference collections recovered known TFBS combinations in top ranked results. We provided community access to this CRM analysis tool through a web- based application. The alternative promoter data, made available through a recent large scope CAGE study [24], provides unprecedented information about multiple TSS and core promoter 8 A homotypic cis-regulatory module (CRM) is a cluster of similar binding sites. 33 regions of human and mouse genes. We used this information to define multiple transcription start sites for human and mouse genes, which enabled improved alignments of human-mouse orthologous gene isoforms in the oPOSSUM database. In chapter 3, I describe the implementation of a revised CSA algorithm that incorporates the evaluation of multiple putative promoter regions of a single gene to identify over-represented CRMs in a set of co-expressed genes (published in [150]). Validation of this revised approach demonstrated a significant improvement in recovery of TFBS combinations in the CRM reference collections. The demarcation of multiple promoter boundaries per gene can expand the range of the sequence regions searched, while limiting exploration to bounded partitions around an individual TSS. Chapter 3 also describes the analysis of CAGE data to establish alternative TSSs in the database and integration of the CSA website with an oPOSSUM Suite portal that provides centralized access to SSA (Single Site Analysis) tools for human-mouse, yeast, and worms. Coordinated regulation of gene transcription relies on TF proteins binding DNA. Recent studies have also highlighted the important role that accessory TFs play through co-factor mechanisms and chromatin structure modifications. Understanding which TFs are capable of binding a TFBS and/or identifying the TFs that are expressed in a gene expression profiling analysis is an essential step in gene regulatory analyses. We were surprised to find that a well-validated inventory of mouse-human TFs was not available for this purpose. During our efforts to assemble this resource, we encountered other researchers compiling similar lists. We combined our efforts to produce the TFCat resource for gene regulation analyses. In chapter 4, I describe the assembly of a comprehensive mouse and human TF catalog (published in [151]). TFCat is a curated catalog of mouse and human TFs. Curators assigned genes to a functional taxonomy and 34 provided a confidence assessment for judgment classifications. All proteins linked to DNA-binding were reviewed and DNA–binding domains were mapped to a structural classification system. Sequence-based analyses were performed to predict TF encoding genes that were not reviewed in the curation process or could not be curated due to lack of literature evidence. The TF data was made available for review and download on a wiki and web portal. There is widespread interest in exploring the use of gene therapy to treat disease. The goal of Pleiades Promoter Project is to create a panel of human regulatory region constructs that direct tissue- and/or cell-specific gene expression in the adult brain. The project strategy incorporates detailed analysis of gene expression profile data and computational evaluation of multi-species conserved sequences to identify putative regulatory regions neighboring gene loci that exhibit brain region-specific expression. Predicted regulatory regions are validated using an in vivo mouse transgenesis method and evaluated at an adult developmental stage (~postnatal day 56). In Chapter 5, computational analyses are described for the prediction of TFs that may be driving expression of a mouse OLIG1 promoter-reporter gene construct. OLIG1 is a TF that is expressed in both oligodendrocyte progenitor cells and mature oligodendrocytes. A set of three reporter constructs, each composed of three conserved non-coding regions, were tested using in vivo mouse transgensis, of which one reporter-construct actively expressed in adult brain. A comprehensive evaluation of an in vitro oligodendrocyte expression dataset identified differentially expressed genes, including TFs, across eight time points. Sequence feature analyses were performed over conserved sequence segments associated with the active and inactive constructs. The synthesis of these two analyses resulted in a short list of TF candidates that are both differentially regulated across oligodendrocyte 35 maturation and for which putative binding sites are uniquely present in the construct sequence segments. A significant amount of research is focused on the elucidation of gene regulatory mechanisms to gain insights into complex biological systems and the role of gene regulation in disease phenotypes. Myelin sheath degradation is associated with human diseases, such as multiple sclerosis, schizophrenia, and leukodystrophies. It appears that for some myelin-related neuropathies, myelin protein dosage-alterations may be causally associated with the disease phenotypes. Importantly, disease progression may be eased through therapies that promote remyelination, motivating study of the regulatory mechanisms controlling oligodendrocyte development and myelination. In chapter 6, integrated experimental and computational research are presented that predict the synergistic relationships of TFs involved in the spatio-temporal transcriptional events of co-expressed myelin genes during oligodendrocyte development. Mouse enhancer regions neighboring myelin-associated genes that direct expression in oligodendrocytes (CNS) and/or Schwann cells (PNS) were identified. Analyses of oligodendrocyte expression data were performed to define a specific set of co-expressed genes during early oligodendrocyte development across oligodendrocyte cell-stage transitions. The differentially expressed TF subset was identified through a TFCat-microarray probe mapping analysis. A new promoter analyses approach was developed to statistically evaluate CRM predictions in the oligodendroctye co-expression dataset. TFBS signatures present in the validated enhancers and absent in the negative enhancers were identified and used to weight CRM predictions. A regulatory feature similarity analysis was performed for CRM predictions identified in both the oligodendrocyte co-expression dataset and the validated enhancer regions to identify potential feature similarity. 36 Analyses results were incorporated to produce an enhancer-weighted regulatory network of TFs that may be co-regulating myelin-associated gene expression during CNS myelinogenesis. 37 Table 1.1. Selected list of databases providing eukaryote transcription factor binding site data Database Description Reference ABS Annotated TFBS for human, mouse and rat promoters. [152] JASPAR High quality transcription factor binding profile database. [153] OREGANNO Open Regulatory Annotation database [154] PAZAR A framework for collection and dissemination of cis-regulatory sequence annotation [155] TRANSFAC Contains data on transcription factors, their experimentally validated binding sites, and regulated genes [156] Table 1.2. CRM detection techniques # Description Details 1 Window-based TFBS clusters are detected within a specific window 2 PFM and/or PWM-based Uses position weight matrices to predict TFBS 3 De novo motif detection Uses motif discovery method to predict TFBS 4 Conserved alignments Requires orthologous sequences and detects TFBS in conserved species sequence alignments 5 Gene co-expression Searches for predicted CRMs in a set of co-expressed genes 6 Provides learning based on training set Uses training sets to guide or improve the predictions 7 Alignment of predicted TFBS Requires or provides higher weighting to aligned predicted TFBS in multiple species 8 Model generated background Creates a background using a model 9 TFBS score threshold Applies a TFBS score threshold cut-off 10 Statistical evaluation method applied Applies statistical test(s) to determine significance of CRM prediction 38 Table 1.3. Selected computational human/mouse CRM detection tools/methods available before 2005 Algorithm Year Description Techniques incorporated (Table 1.2) Reference Cister 2001 Finds TFBS clusters using HMM modeling approach. Background is modeled using local window over natural sequences. 2, 8, 10 [157] Comet 2002 Finds TFBS clusters using HMM modeling approach. Uses sliding local window null model over natural sequences. 2, 8, 10 [158] Cluster-Buster 2003 Finds TFBS clusters using HMM modeling approach. Uses sliding local window null model over natural sequences. 2, 8, 10 [159] CREME 2003/ 2004 Finds TFBS clusters within windows. Uses TFBS permutation tests for null background model. 1, 2, 4, 5,7, 9, 10 [160, 161] ModuleSearcher 2003 Finds best TFBS cluster in a set of sequences. Background model is 3rd order Markov Model learned using natural sequences. 1, 2, 4, 5, 9, 10 [162] MSCAN 2003 Reports clusters of TFBS above a threshold. Background defined as base pair frequency in each window that is evaluated. 1, 2, 5, 9, 10 [163] Gibbs Module Sampler 2004 Identifies clusters of motifs (inferred TFBS) in a set of sequences trained with known CRM datasets. 1, 3, 4, 5, 6, 10 [164] 39 Figure 1.1. Nucleosome core octamer particle DNA is wrapped around a histone octamer to form a nucleosome. Figure source: Wikipedia, URL: http://en.wikipedia.org/wiki/Histone_H1, copyrights released to public domain. 40 Figure 1.2. Model for histone acetylation/deacetylation Histone acetylation, catalyzed by Histone Acetyl Transferase (HAT) enzymes, can reduce the positive charge of histones leading to a more open chromatin structure, which is often associated with increased transcription. Deacetylation of histones (for example, on histone amino-tails), via histone deacetylases (HDACs), can inhibit transcription by producing a more condensed chromatin conformation. In this example, the Retinoic Acid Receptor (RAR) and Retinoid X Receptor (RXR) heterodimer binds to an enhancer. When the ligand, retinoic acid, is not available, the dimer interacts with nuclear corepressors: nuclear receptor corepressor and silencing mediator for retinoid and thryoid homone receptors (NCoR/SMRT), which binds HDAC1. Figure source: Weaver RF: Molecular Biology, 2nd edn. New York: McGraw-Hill Higher Education; 2002 [165], used with permission. 41 Figure 1.3. Eukaryotic gene regulatory architecture The gene regulatory architecture of eukaryotes include DNA elements that interact with the basal promoter complex and may include proximal and/or distal enhancers that bind TFs and/or TF protein complexes to facilitate or repress gene transcription. Regulatory DNA sequence resides in a chromatin state that can be altered to enable or inhibit TF protein – DNA interactions. Figure source: With kind permission from Springer Science+Business Media: Die Naturwissenschaften, In silico identification of metazoan transcriptional regulatory regions, 90, 2003, 146-66, Wasserman WW, Krivan W, Figure 1. [166] 42 Figure 1.4. Hprt1 mouse transgenesis system Embryonic stem (ES) cells carry a deletion spanning the promoter and first 2 exons of the hypoxanthine phosphoribosyltransferase (Hprt1) gene. The putative enhancer sequence (for example, depicted below as a Myelin Basic Protein (MBP) regulatory sequence) and reporter gene sequence (for example, lacZ) are inserted into the targeting vector, which includes the Hprt1 homology arms required to restore the Hprt1 locus by homologous recombination. Figure source: Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration. In: Proc Natl Acad Sci USA. vol. 93; 1996: 9067-9072; Figure 5 [83]. Copyright (1996) National Academy of Sciences, U.S.A.; figure adapted by the Peterson Laboratory (McGill University), used with permission. 43 Figure 1.5. Modeling TF binding sites using position weight matrices An aligned set of transcription factor binding sites can be converted into a position frequency matrix (PFM), which enumerates the frequency of each nucleotide in each column. A binding site logo graphic of the PFM can be generated. A PFM is converted to a log-scale probability representation, referred to as a position weight matrix (PWM), which is used to detect and score potential binding sites. 44 Figure 1.6. Myelinating glial cells: oligodendrocytes and Schwann cells Myelinating glia cells elaborate myelin from their cell plasma membrane. Schwann cells are generated from neural crest cells and can myelinate segments of single axons in the peripheral nervous system, whereas, oligodendrocytes are derived from neuroectoderm and can myelinate one or more axons in the central nervous system. Despite their unique cellular origins, these glial cells express several important myelin proteins in common. Figure source: Used with kind permission from Editions Doin and The American Physiology Society: Pham-Dinh D. Les cellules gliales. In: Physiologie du Neurone [167], adapted in Baumann N, Pham-Dinh D. Physiol Rev. 2001 [143]. 45 Figure 1.7. Schwann cell lineage In peripheral nerves, Schwann cells progress through a maturation process, which initiates with the formation of immature Schwann cells from migrating neural crest cells and culminates with a transition to mature myelinating or non-myelinating Schwann cells. Myelinating Schwann cells ensheathe a single axon. Non-myelinating Schwann cells aggregate multiple C fiber axons to form Remak bundles. Extrinsic signals, such as expression of neuroregulin, determine Schwann Cell fate. Figure source: Figure reprinted by permission from Macmillan Publishers Ltd: Nature Neuroscience. 8(11): 420-422, copyright 2005 [114]. 46 Figure 1.8. Oligodendrocytes can ensheath multiple axons Figure source: Figure produced using Servier Medical Art, used with permission for academic use. 47 Figure 1.9. Oligodendrocyte lineage Oligodendrocytes transition through a multi-stage developmental process. Cell stages are characterized by the expression of specific marker proteins: i) A2B5 antigen, platelet-derived growth factor receptor-alpha (PDGFRalpha), and chondroitin sulphate proteoglycan NG2 in progenitors; ii) O4 antigen in pro-oligodendrocytes; iii) galactocerebroside (GC or O1 antigen) and 2',3'-cyclic nucleotide 3'-phosphodiesterase (CNPase) in mature oligodendrocytes; and iv) myelin proteins, such as: myelin- associated glycoprotein (MAG), myelin basic protein (MBP), and proteolipid protein (PLP), in myelinating oligodendrocytes. Figure source: Figure reprinted by permission from Macmillan Publishers Ltd: Nature Reviews. Neuroscience. 2(11): 840-843, copyright 2001 [168]. 48 1.8. References 1. Khorasanizadeh S: The nucleosome: from genomic organization to genomic regulation. Cell 2004, 116(2):259-272. 2. Wolffe AP, Guschin D: Review: chromatin structural features and targets that regulate transcription. Journal of structural biology 2000, 129(2-3):102- 122. 3. Wood WI, Felsenfeld G: Chromatin structure of the chicken beta-globin gene region. Sensitivity to DNase I, micrococcal nuclease, and DNase II. Journal of Biological Chemistry 1982, 257(13):7730-7736. 4. Jin C, Felsenfeld G: Nucleosome stability mediated by histone variants H3.3 and H2A.Z. Genes Dev 2007, 21(12):1519-1529. 5. Ozsolak F, Song JS, Liu XS, Fisher DE: High-throughput mapping of the chromatin structure of human promoters. Nature biotechnology 2007, 25(2):244-248. 6. Fu Y, Sinha M, Peterson CL, Weng Z: The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet 2008, 4(7):e1000138. 7. Fedor MJ, Lue NF, Kornberg RD: Statistical positioning of nucleosomes by specific protein-binding to an upstream activating sequence in yeast. J Mol Biol 1988, 204(1):109-127. 8. Jones PA, Takai D: The role of DNA methylation in mammalian epigenetics. Science 2001, 293(5532):1068-1070. 9. Ng HH, Bird A: DNA methylation and chromatin modification. Current opinion in genetics & development 1999, 9(2):158-163. 10. Illingworth RS, Bird AP: CpG islands--'a rough guide'. FEBS Lett 2009, 583(11):1713-1720. 11. Hendrich B, Tweedie S: The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends in genetics : TIG 2003, 19(5):269- 277. 12. Brown CE, Lechner T, Howe L, Workman JL: The many HATs of transcription coactivators. Trends in biochemical sciences 2000, 25(1):15-19. 13. Liu CL, Kaplan T, Kim M, Buratowski S, Schreiber SL, Friedman N, Rando OJ: Single-nucleosome mapping of histone modifications in S. cerevisiae. PLoS biology 2005, 3(10):e328. 14. Roh TY, Wei G, Farrell CM, Zhao K: Genome-wide prediction of conserved and nonconserved enhancers by histone acetylation patterns. Genome Res 2007, 17(1):74-81. 15. Martens JH, O'Sullivan RJ, Braunschweig U, Opravil S, Radolf M, Steinlein P, Jenuwein T: The profile of repeat-associated histone lysine methylation states in the mouse epigenome. The EMBO journal 2005, 24(4):800-812. 16. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823-837. 17. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B: Distinct and predictive chromatin signatures of transcriptional 49 promoters and enhancers in the human genome. Nat Genet 2007, 39(3):311- 318. 18. Strahl BD, Allis CD: The language of covalent histone modifications. Nature 2000, 403(6765):41-45. 19. Zhang Y: Transcriptional regulation by histone ubiquitination and deubiquitination. Genes & development 2003, 17(22):2733-2740. 20. Nowak SJ, Corces VG: Phosphorylation of histone H3: a balancing act between chromosome condensation and transcriptional activation. Trends Genet 2004, 20(4):214-220. 21. Heard E, Disteche CM: Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev 2006, 20(14):1848-1867. 22. Smale S, Kadonaga J: THE RNA POLYMERASE II CORE PROMOTER. In: Annu Rev Biochem. vol. 72; 2003: 449-479. 23. Ayoubi TA, Van De Ven WJ: Regulation of gene expression by alternative promoters. Faseb J 1996, 10(4):453-460. 24. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui- Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F et al: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006, 38(6):626-635. 25. Merika M, Thanos D: Enhanceosomes. Current opinion in genetics & development 2001, 11(2):205-208. 26. Kamakaka RT: Silencers and locus control regions: opposite sides of the same coin. Trends in biochemical sciences 1997, 22(4):124-128. 27. West AG, Gaszner M, Felsenfeld G: Insulators: many functions, many mechanisms. Genes & development 2002, 16(3):271-288. 28. Visel A, Akiyama JA, Shoukry M, Afzal V, Rubin EM, Pennacchio LA: Functional autonomy of distant-acting human enhancers. Genomics 2009, 93(6):509-513. 29. Garvie CW, Wolberger C: Recognition of specific DNA sequences. Molecular cell 2001, 8(5):937-946. 30. Halford SE, Marko JF: How do site-specific DNA-binding proteins find their targets? Nucleic acids research 2004, 32(10):3040-3052. 31. Spiegelman BM, Heinrich R: Biological control through regulated transcriptional coactivators. Cell 2004, 119(2):157-167. 32. Triezenberg SJ: Structure and function of transcriptional activation domains. Curr Opin Genet Dev 1995, 5(2):190-196. 33. Luo Y, Ge H, Stevens S, Xiao H, Roeder RG: Coactivation by OCA-B: definition of critical regions and synergism with general cofactors. Molecular and cellular biology 1998, 18(7):3803-3810. 34. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML: Diversity and complexity in DNA recognition by transcription factors. In: Science. vol. 324; 2009: 1720- 1723. 50 35. L'Honore A, Lamb NJ, Vandromme M, Turowski P, Carnac G, Fernandez A: MyoD distal regulatory region contains an SRF binding CArG element required for MyoD expression in skeletal myoblasts and during muscle regeneration. Molecular biology of the cell 2003, 14(5):2151-2162. 36. Dutta S, Burkhardt K, Young J, Swaminathan GJ, Matsuura T, Henrick K, Nakamura H, Berman HM: Data deposition and annotation at the worldwide protein data bank. Molecular biotechnology 2009, 42(1):1-13. 37. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome biology 2000, 1(1):REVIEWS001. 38. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic acids research 1999, 27(1):260-262. 39. Gough J: The SUPERFAMILY database in structural genomics. Acta Crystallographica Section D, Biological Crystallography; Acta crystallographicaSection D, Biological crystallography 2002, 58(Pt 11):1897- 1900. 40. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247(4):536-540. 41. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure (London, England : 1993) 1997, 5(8):1093-1108. 42. Kummerfeld SK, Teichmann SA: DBD: a transcription factor prediction database. Nucleic acids research 2006, 34(Database issue):D74-81. 43. Boube M, Joulia L, Cribbs DL, Bourbon HM: Evidence for a mediator of RNA polymerase II transcriptional regulation conserved from yeast to man. Cell 2002, 110(2):143-151. 44. Li XY, Virbasius A, Zhu X, Green MR: Enhancement of TBP binding by activators and general transcription factors. Nature 1999, 399(6736):605-609. 45. Hanna-Rose W, Hansen U: Active repression mechanisms of eukaryotic transcription repressors. Trends in genetics : TIG 1996, 12(6):229-234. 46. Adams CC, Workman JL: Binding of disparate transcriptional activators to nucleosomal DNA is inherently cooperative. Molecular and cellular biology 1995, 15(3):1405-1421. 47. Bucceri A, Kapitza K, Thoma F: Rapid accessibility of nucleosomal DNA in yeast on a second time scale. The EMBO journal 2006, 25(13):3123-3132. 48. Anderson JD, Thastrom A, Widom J: Spontaneous access of proteins to buried nucleosomal DNA target sites occurs via a mechanism that is distinct from nucleosome translocation. Molecular and cellular biology 2002, 22(20):7147- 7157. 49. Lee CK, Shibata Y, Rao B, Strahl BD, Lieb JD: Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet 2004, 36(8):900-905. 50. Li H, Lu Y, Smith HK, Richardson WD: Olig1 and Sox10 interact synergistically to drive myelin basic protein transcription in oligodendrocytes. J Neurosci 2007, 27(52):14375-14382. 51 51. Morin S, Pozzulo G, Robitaille L, Cross J, Nemer M: MEF2-dependent recruitment of the HAND1 transcription factor results in synergistic activation of target promoters. J Biol Chem 2005, 280(37):32272-32278. 52. Panne D, Maniatis T, Harrison SC: An atomic model of the interferon-beta enhanceosome. Cell 2007, 129(6):1111-1123. 53. Pettersson M, Schaffner W: Synergistic activation of transcription by multiple binding sites for NF-kappa B even in absence of co-operative factor binding to DNA. J Mol Biol 1990, 214(2):373-380. 54. Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 2003, 4(9):117. 55. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467-470. 56. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP: Light- generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 1994, 91(11):5022-5026. 57. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature biotechnology 1996, 14(13):1675-1680. 58. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270(5235):484-487. 59. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57-63. 60. Harsha HC, Kandasamy K, Ranganathan P, Rani S, Ramabadran S, Gollapudi S, Balakrishnan L, Dwivedi SB, Telikicherla D, Selvan LD, Goel R, Mathivanan S, Marimuthu A, Kashyap M, Vizza RF, Mayer RJ, Decaprio JA, Srivastava S, Hanash SM, Hruban RH, Pandey A: A compendium of potential biomarkers of pancreatic cancer. PLoS medicine 2009, 6(4):e1000046. 61. Gray PA, Fu H, Luo P, Zhao Q, Yu J, Ferrari A, Tenzen T, Yuk DI, Tsung EF, Cai Z, Alberta JA, Cheng LP, Liu Y, Stenman JM, Valerius MT, Billings N, Kim HA, Greenberg ME, McMahon AP, Rowitch DH, Stiles CD, Ma Q: Mouse brain organization revealed through direct genome-scale TF expression analysis. Science (New York, NY) 2004, 306(5705):2255-2257. 62. Cahoy JD, Emery B, Kaushal A, Foo LC, Zamanian JL, Christopherson KS, Xing Y, Lubischer JL, Krieg PA, Krupenko SA, Thompson WJ, Barres BA: A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. J Neurosci 2008, 28(1):264-278. 63. Affara NA: Resource and hardware options for microarray-based experimentation. Briefings in functional genomics & proteomics 2003, 2(1):7- 20. 64. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 2003, 4(2):249-264. 52 65. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193. 66. Steinhoff C, Vingron M: Normalization and quantification of differential expression in gene expression microarrays. Briefings in bioinformatics 2006, 7(2):166-177. 67. Simon R: Microarray-based expression profiling and informatics. Current opinion in biotechnology 2008, 19(1):26-29. 68. Wright GW, Simon RM: A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 2003, 19(18):2448-2455. 69. Benjamini Y. HY: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 1995, 57(1):289-300. 70. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306(5696):636-640. 71. Yuh CH, Bolouri H, Davidson EH: Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development 2001, 128(5):617-629. 72. Levine M: A systems view of Drosophila segmentation. Genome Biol 2008, 9(2):207. 73. Weaver RF: Molecular biology, 4th edn. New York: McGraw-Hill Higher Education; 2008. 74. Kerr LD: Electrophoretic mobility shift assay. Methods in enzymology 1995, 254:619-632. 75. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins. Science 2000, 290(5500):2306-2309. 76. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature methods 2008, 5(9):829-834. 77. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316(5830):1497-1502. 78. Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. In: Genome Res. vol. 16; 2006: 962-972. 79. Kalionis B, O'Farrell PH: A universal target sequence is bound in vitro by diverse homeodomains. Mechanisms of development 1993, 43(1):57-70. 80. Blackwell TK: Selection of protein binding sites from random nucleic acid sequences. Methods in enzymology 1995, 254:604-618. 81. Berger MF, Bulyk ML: Protein binding microarrays (PBMs) for rapid, high- throughput characterization of the sequence specificities of DNA binding proteins. Methods in molecular biology (Clifton, NJ 2006, 338:245-260. 82. Naylor LH: Reporter gene technology: the future looks bright. Biochemical pharmacology 1999, 58(5):749-757. 83. Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration. Proceedings of the 53 National Academy of Sciences of the United States of America 1996, 93(17):9067- 9072. 84. Zambrowicz BP, Imamoto A, Fiering S, Herzenberg LA, Kerr WG, Soriano P: Disruption of overlapping transcripts in the ROSA beta geo 26 gene trap strain leads to widespread expression of beta-galactosidase in mouse embryos and hematopoietic cells. Proc Natl Acad Sci U S A 1997, 94(8):3789-3794. 85. Yang GS, Banks KG, Bonaguro RJ, Wilson G, Dreolini L, de Leeuw CN, Liu L, Swanson DJ, Goldowitz D, Holt RA, Simpson EM: Next generation tools for high-throughput promoter and expression analysis employing single-copy knock-ins at the Hprt1 locus. Genomics 2009, 93(3):196-204. 86. Bray N, Pachter L: MAVID: constrained ancestral alignment of multiple sequences. Genome Res 2004, 14(4):693-699. 87. Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics 2003, 19 Suppl 1:i54-62. 88. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 2003, 31(13):3518-3524. 89. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708-715. 90. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 2007, 17(12):1797-1808. 91. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, Meyer L, Hsu F, Hinrichs AS, Harte RA, Giardine B, Fujita P, Diekhans M, Dreszer T, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 2009, 37(Database issue):D755-761. 92. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 93. Denarier E, Forghani R, Farhadi HF, Dib S, Dionne N, Friedman HC, Lepage P, Hudson TJ, Drouin R, Peterson A: Functional organization of a Schwann cell enhancer. J Neurosci 2005, 25(48):11210-11217. 94. Tuason MC, Rastikerdar A, Kuhlmann T, Goujet-Zalc C, Zalc B, Dib S, Friedman H, Peterson A: Separate proteolipid protein/DM20 enhancers serve different lineages and stages of development. J Neurosci 2008, 28(27):6895-6903. 95. Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science 2003, 302(5644):413. 54 96. Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G: Highly conserved non-coding sequences are associated with vertebrate development. PLoS biology 2005, 3(1):e7. 97. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. Journal of biology 2003, 2(2):13. 98. Martin N, Patel S, Segre JA: Long-range comparison of human and mouse Sprr loci to identify conserved noncoding sequences involved in coordinate regulation. Genome Res 2004, 14(12):2430-2438. 99. Emberly E, Rajewsky N, Siggia ED: Conservation of regulatory elements between two species of Drosophila. BMC bioinformatics 2003, 4:57. 100. McGaughey DM, Vinton RM, Huynh J, Al-Saif A, Beer MA, McCallion AS: Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b. Genome Res 2008, 18(2):252-260. 101. Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, Macisaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E: Tissue-specific transcriptional regulation has diverged significantly between human and mouse. In: Nat Genet. vol. 39; 2007: 730-732. 102. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16(1):16-23. 103. King OD, Roth FP: A non-parametric model for transcription factor binding sites. Nucleic Acids Res 2003, 31(19):e116. 104. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings / International Conference on Intelligent Systems for Molecular Biology ; ISMB 1994, 2:28-36. 105. Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7(1):41-51. 106. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(5131):208-214. 107. Sandve GK, Abul O, Walseng V, Drablos F: Improved benchmarks for computational motif discovery. BMC bioinformatics 2007, 8:193. 108. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature biotechnology 2005, 23(1):137- 144. 109. D'Haeseleer P: How does DNA sequence motif discovery work? Nature biotechnology 2006, 24(8):959-961. 110. Abeel T, Van de Peer Y, Saeys Y: Toward a gold standard for promoter prediction evaluation. Bioinformatics 2009, 25(12):i313-320. 111. Narlikar L, Ovcharenko I: Identifying regulatory elements in eukaryotic genomes. In: Briefings in Functional Genomics and Proteomics. 2009: 1-16. 55 112. Jahn O, Tenzer S, Werner HB: Myelin proteomics: molecular anatomy of an insulating sheath. Molecular neurobiology 2009, 40(1):55-72. 113. Jessen KR, Mirsky R: Schwann cells and their precursors emerge as major regulators of nerve development. Trends in neurosciences 1999, 22(9):402-410. 114. Nave KA, Schwab MH: Glial cells under remote control. Nature neuroscience 2005, 8(11):1420-1422. 115. Svaren J, Meijer D: The molecular machinery of myelin gene transcription in Schwann cells. Glia 2008, 56(14):1541-1551. 116. Meyer zu Horste G, Prukop T, Nave KA, Sereda MW: Myelin disorders: Causes and perspectives of Charcot-Marie-Tooth neuropathy. J Mol Neurosci 2006, 28(1):77-88. 117. Wrabetz L, D'Antonio M, Pennuto M, Dati G, Tinelli E, Fratta P, Previtali S, Imperiale D, Zielasek J, Toyka K, Avila RL, Kirschner DA, Messing A, Feltri ML, Quattrini A: Different intracellular pathomechanisms produce diverse Myelin Protein Zero neuropathies in transgenic mice. J Neurosci 2006, 26(8):2358-2368. 118. Kessaris N, Fogarty M, Iannarelli P, Grist M, Wegner M, Richardson WD: Competing waves of oligodendrocytes in the forebrain and postnatal elimination of an embryonic lineage. Nature neuroscience 2006, 9(2):173-179. 119. Nicolay DJ, Doucette JR, Nazarali AJ: Transcriptional control of oligodendrogenesis. Glia 2007, 55(13):1287-1299. 120. Lu QR, Yuk D, Alberta JA, Zhu Z, Pawlitzky I, Chan J, McMahon AP, Stiles CD, Rowitch DH: Sonic hedgehog--regulated oligodendrocyte lineage genes encoding bHLH proteins in the mammalian central nervous system. Neuron 2000, 25(2):317-329. 121. Zhou Q, Anderson DJ: The bHLH transcription factors OLIG2 and OLIG1 couple neuronal and glial subtype specification. Cell 2002, 109(1):61-73. 122. Fu H, Qi Y, Tan M, Cai J, Takebayashi H, Nakafuku M, Richardson W, Qiu M: Dual origin of spinal oligodendrocyte progenitors and evidence for the cooperative role of Olig2 and Nkx2.2 in the control of oligodendrocyte differentiation. Development 2002, 129(3):681-693. 123. Xin M, Yue T, Ma Z, Wu FF, Gow A, Lu QR: Myelinogenesis and axonal recognition by oligodendrocytes in brain are uncoupled in Olig1-null mice. J Neurosci 2005, 25(6):1354-1365. 124. Samanta J, Kessler JA: Interactions between ID and OLIG proteins mediate the inhibitory effects of BMP4 on oligodendroglial differentiation. Development 2004, 131(17):4131-4142. 125. Kondo T, Raff M: Basic helix-loop-helix proteins and the timing of oligodendrocyte differentiation. Development 2000, 127(14):2989-2998. 126. Qi Y, Cai J, Wu Y, Wu R, Lee J, Fu H, Rao M, Sussel L, Rubenstein J, Qiu M: Control of oligodendrocyte differentiation by the Nkx2.2 homeodomain transcription factor. Development 2001, 128(14):2723-2733. 127. Sun T, Dong H, Wu L, Kane M, Rowitch DH, Stiles CD: Cross-repressive interaction of the Olig2 and Nkx2.2 transcription factors in developing neural tube associated with formation of a specific physical complex. J Neurosci 2003, 23(29):9547-9556. 56 128. Liu R, Cai J, Hu X, Tan M, Qi Y, German M, Rubenstein J, Sander M, Qiu M: Region-specific and stage-dependent regulation of Olig gene expression and oligodendrogenesis by Nkx6.1 homeodomain transcription factor. Development 2003, 130(25):6221-6231. 129. Southwood C, He C, Garbern J, Kamholz J, Arroyo E, Gow A: CNS myelin paranodes require Nkx6-2 homeoprotein transcriptional activity for normal structure. J Neurosci 2004, 24(50):11215-11225. 130. Stolt CC, Lommes P, Sock E, Chaboissier MC, Schedl A, Wegner M: The Sox9 transcription factor determines glial fate choice in the developing spinal cord. Genes Dev 2003, 17(13):1677-1689. 131. Stolt CC, Lommes P, Friedrich RP, Wegner M: Transcription factors Sox8 and Sox10 perform non-equivalent roles during oligodendrocyte development despite functional redundancy. Development 2004, 131(10):2349-2358. 132. Stolt CC, Schlierf A, Lommes P, Hillgartner S, Werner T, Kosian T, Sock E, Kessaris N, Richardson WD, Lefebvre V, Wegner M: SoxD proteins influence multiple stages of oligodendrocyte development and modulate SoxE protein function. Dev Cell 2006, 11(5):697-709. 133. Gokhan S, Marin-Husstege M, Yung SY, Fontanez D, Casaccia-Bonnefil P, Mehler MF: Combinatorial profiles of oligodendrocyte-selective classes of transcriptional regulators differentially modulate myelin basic protein gene expression. J Neurosci 2005, 25(36):8311-8321. 134. Sugimori M, Nagao M, Bertrand N, Parras CM, Guillemot F, Nakafuku M: Combinatorial actions of patterning and HLH transcription factors in the spatiotemporal control of neurogenesis and gliogenesis in the developing spinal cord. Development 2007, 134(8):1617-1629. 135. Petryniak MA, Potter GB, Rowitch DH, Rubenstein JL: Dlx1 and Dlx2 control neuronal versus oligodendroglial cell fate acquisition in the developing forebrain. Neuron 2007, 55(3):417-433. 136. Liu Z, Hu X, Cai J, Liu B, Peng X, Wegner M, Qiu M: Induction of oligodendrocyte differentiation by Olig2 and Sox10: evidence for reciprocal interactions and dosage-dependent mechanisms. Dev Biol 2007, 302(2):683- 693. 137. Tekki-Kessaris N, Woodruff R, Hall AC, Gaffield W, Kimura S, Stiles CD, Rowitch DH, Richardson WD: Hedgehog-dependent oligodendrocyte lineage specification in the telencephalon. Development 2001, 128(13):2545-2554. 138. Nery S, Wichterle H, Fishell G: Sonic hedgehog contributes to oligodendrocyte specification in the mammalian forebrain. Development 2001, 128(4):527-540. 139. Spassky N, Heydon K, Mangatal A, Jankovski A, Olivier C, Queraud-Lesaux F, Goujet-Zalc C, Thomas JL, Zalc B: Sonic hedgehog-dependent emergence of oligodendrocytes in the telencephalon: evidence for a source of oligodendrocytes in the olfactory bulb that is independent of PDGFRalpha signaling. Development 2001, 128(24):4993-5004. 140. Jakovcevski I, Zecevic N: Olig transcription factors are expressed in oligodendrocyte and neuronal cells in human fetal CNS. J Neurosci 2005, 25(44):10064-10073. 57 141. Takebayashi H, Nabeshima Y, Yoshida S, Chisaka O, Ikenaka K: The basic helix-loop-helix factor olig2 is essential for the development of motoneuron and oligodendrocyte lineages. Curr Biol 2002, 12(13):1157-1163. 142. Arnett HA, Fancy SP, Alberta JA, Zhao C, Plant SR, Kaing S, Raine CS, Rowitch DH, Franklin RJ, Stiles CD: bHLH transcription factor Olig1 is required to repair demyelinated lesions in the CNS. Science 2004, 306(5704):2111-2115. 143. Baumann N, Pham-Dinh D: Biology of oligodendrocyte and myelin in the mammalian central nervous system. Physiological Reviews 2001, 81(2):871- 927. 144. Readhead C, Hood L: The dysmyelinating mouse mutations shiverer (shi) and myelin deficient (shimld). Behavior genetics 1990, 20(2):213-234. 145. Farhadi HF, Lepage P, Forghani R, Friedman HC, Orfali W, Jasmin L, Miller W, Hudson TJ, Peterson AC: A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial- specific phenotypes. The Journal of neuroscience : the official journal of the Society for Neuroscience 2003, 23(32):10214-10223. 146. Dionne N: Structure and function of Module 3, a conserved enhancer of the myelin basic protein gene. Montreal, Quebec: McGill University; 2006. 147. Dib S: Functional Analysis Of The Myelin Basic Protein Gene Regulation. Montreal, Quebec: McGill University; 2008. 148. Huang SS, Fulton DL, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW: Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In: Series on Advances in Bioinformatics and Computational Biology Volume 3 - Proceedings of the 4th Asia-Pacific Bioinformatics Conference: 2006; Taipei, Taiwan: Imperial College Press, London UK; 2006: 247- 256. 149. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 150. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW: oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Res 2007, 35(Web Server issue):W245-252. 151. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R: TFCat: the curated catalog of mouse and human transcription factors. Genome Biol 2009, 10(3):R29. 152. Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res 2006, 34(Database issue):D63-67. 153. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 154. Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Van Loo P, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJ: ORegAnno: an open- 58 access community-driven resource for regulatory annotation. Nucleic Acids Res 2008, 36(Database issue):D107-113. 155. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 156. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31(1):374-378. 157. Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 2001, 17(10):878-889. 158. Frith MC, Spouge JL, Hansen U, Weng Z: Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res 2002, 30(14):3214-3224. 159. Frith MC, Li MC, Weng Z: Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res 2003, 31(13):3666-3668. 160. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32(Web Server issue):W253-256. 161. Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 2003, 19 Suppl 1:i283-291. 162. Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B: Computational detection of cis -regulatory modules. Bioinformatics 2003, 19 Suppl 2:ii5-14. 163. Alkema WB, Johansson O, Lagergren J, Wasserman WW: MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res 2004, 32(Web Server issue):W195-198. 164. Thompson W, Palumbo MJ, Wasserman WW, Liu JS, Lawrence CE: Decoding human regulatory circuits. Genome Res 2004, 14(10A):1967-1974. 165. Weaver RF: Molecular Biology, 2nd edn. New York: McGraw-Hill Higher Education; 2002. 166. Wasserman WW, Krivan W: In silico identification of metazoan transcriptional regulatory regions. Die Naturwissenschaften 2003, 90(4):156- 166. 167. Pham-Dinh D: Les cellules gliales. In: Physiologie du Neurone. Edited by Tritsch D, Chesnoy-Marchais D, Feltz A. France: Initiatives Santé; 1998: 31–90. 168. Zhang SC: Defining glial cells during CNS development. Nature reviews 2001, 2(11):840-843. 59 2. Identification of Over-represented Combinations of Transcription Factor Binding Sites in Sets of Co-expressed Genes9 2.1. Chapter preamble A eukaryote gene’s regulatory program often involves coordinate binding of multiple transcription factors (TFs) that enable and/or inhibit gene transcription. This chapter describes the development of an algorithm, called Combination Site Analysis (CSA), which identifies over-represented combinations of TF binding sites (cis- regulatory modules - CRMs) in the promoter regions of co-expressed genes to infer TF cooperative control. Validation of the algorithm with reference collections of co- regulated genes demonstrated its ability to identify known CRMs. This tool was made available to the community through a web-based interface. 2.2. Introduction The interaction between transcription factor (TF) proteins and transcription factor binding sites (TFBS) is an important mechanism in regulating gene expression. Each cell in the human body expresses genes in response to its developmental state (e.g. tissue type), external signals from neighboring cells, and environmental stimuli (e.g. stress, nutrients). Diverse regulatory mechanisms have evolved to facilitate the programming of gene expression, with a primary mechanism being TF-mediated modulation of the rate of 9 A version of this chapter has been published. Huang SS*, Fulton DL*, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW. (2006) Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In: Series on Advances in Bioinformatics and Computational Biology Volume 3 - Proceedings of the 4th Asia-Pacific Bioinformatics Conference; Taipei, Taiwan: Imperial College Press, London UK; 247- 256. *Joint first authors. 60 transcript initiation. Given a finite collection of protein structures capable of binding to specific DNA sequences and the diversity of conditions to which cells must respond, it is logical and well-documented that combinatorial interplay between TFs drives much of the observed specificity of gene expression. The arrays of TFBS at which the interactions occur are often termed cis-regulatory modules (CRM) [1]. The sequence specificity of TFs has stimulated development of computational methods for discovery of TFBS on DNA sequences. Well established methods represent aligned collections of TFBS as position weight matrices (PWM). The sequence specificity of individual PWM profiles can be quantified by their information content and a PWM score provides a quantitative measure of the sequence’s similarity to the binding profile (for review see Wasserman and Sandelin [2]). Searching for high scoring motifs in putative regulatory sequences with a collection of profiles (for instance, JASPAR [3]) can predict binding sites within a sequence and the interacting TFs. However, this methodology is plagued by poor specificity due to the short and variable nature of the TFBS. Phylogenetic footprinting filters have been demonstrated repeatedly to improve specificity [4]. Such filters are justified by the hypothesis that sequences of biological importance are under higher selective pressure and will thus accumulate DNA sequence changes at a slower rate than other sequences. Based on this expectation, the search for potential TFBS can be limited to the most similar non-coding regions of aligned orthologous gene sequences from species of suitable evolutionary distance. Further, one might expect that genes that are coordinately expressed are under the control of the same TFs, suggesting that over-represented TFBS in the co-expressed genes are likely to be functional. These concepts are implemented by Ho Sui et al. in the web service tool oPOSSUM [5], which, when given a set of co-expressed genes, can identify the TFBS 61 motifs that are over-represented with respect to a background set of genes. This approach has achieved success in finding binding sites known to contribute to the regulation of reference gene sets. Prior methods that attempt to address the known interplay between TFs at CRMs can be difficult to interpret [6-8]. We introduce a new approach rooted in the biochemical properties of TFs, which allows greater computational efficiency and improved interpretation of results. The resulting method is assessed against diverse reference data to demonstrate its utility for the applied analysis of gene expression data. 2.3. Results 2.3.1. Overview and rationale of oPOSSUM II algorithm Finding over-represented combinations of TFBS presents several new issues that are not encountered in single site analysis. We address two of the main challenges: computational complexity and TFBS class redundancy. Firstly, the number of possible combinations of size n from m TFBS (n ≤ m) increases combinatorially with respect to both m and n, which greatly impacts computing time. Secondly, several TFs have similar binding properties, thus subsets of profiles may be effectively redundant. Consequently, an exhaustive search is not an efficient method to find over-represented combinations of patterns. To address both problems we introduced two approaches. Firstly, we used a novel method to group the profiles into classes. Rather than using protein sequence similarity, a hierarchical clustering procedure was applied to group the profiles into classes according to their quantitative similarity. One representative member was selected from each class 62 for further analysis. We then searched for the occurrences of class combinations in both co-regulated genes (foreground) and a set of background genes. We considered unordered combinations and applied an inter-binding site distance (IBSD) constraint to avoid exhaustive enumeration of all combinations, since many co-operative TFBS are found to occur in clusters without strict ordering constraints [1]. Thus, we only need consider each set of TFBS where all IBSDs satisfy the distance parameter. This approach can dramatically reduce the search space when evaluating any combination size. A scoring scheme was adopted from the Fisher exact test to compare the degree of over- representation of the class combinations. The highly over-represented class combinations were re-assessed using all possible profile combinations within the indicated classes. The overall scheme of oPOSSUM II analysis is shown in Figure 2.1. The sections below describe the details of each step. 2.3.2. TFBS classification Three human reference sets were utilized to validate the oPOSSUM II algorithm: two independently derived skeletal muscle gene sets and a set of smooth muscle-specific genes. Each of the analyses was restricted to the prediction of vertebrate TFBS profiles. A TFBS profile clustering step is implemented in the oPOSSUM II analysis to identify profiles that possess similar binding properties. We clustered the vertebrate TFBS profiles (see Methods) and applied a thresholded (cut) to the resulting hierarchical dendogram at 0.45 ( ! thr H = 0.45 ) to produce clusters that, in most cases, correlated well with structural families in JASPAR (cluster tree available in web supplement). Most notably, binding profiles from FORKHEAD, HMG and ETS families were grouped according to their structural classifications. However, as anticipated, the zinc finger 63 profiles were dispersed into multiple groupings due to the divergent protein binding preferences in this structural class. Using this approach, 68 vertebrate TFBS in JASPAR were partitioned into 32 classes. This TF classification step facilitates the detection of over-represented TF class combinations in a set of co-expressed genes, which provides a significant reduction in the combinatorial search space. 2.3.3. Validation with reference data sets 2.3.3.1. Yeast CLB2 cluster The yeast CLB2 gene cluster dataset contains genes whose transcription peaks at late G2/early M phase of the cell cycle. The transcription of these genes is regulated by the TF FKH, a component of the TF SFF complex, which interacts with the TF protein MCM1. Each of the top ten scoring class combinations identified by oPOSSUM II included the binding sites of the ECB class, of which MCM1 is a member. The highest ranked TFBS combination was ECB/FKH1, which is consistent with experimental evidence and a recent TFBS analyses performed by Kreiman et al. [7]. The full set of analysis results are available on the oPOSSUM II supplementary web site. 2.3.3.2. Three human reference gene sets Prior studies involving muscle set 1 [9] have identified clusters of muscle regulatory sites, which include MEF2, SRF, Myf/MyoD, SP1 and TEF. Figure 2.2 lists the top five over-represented class combinations for each of the three human muscle reference gene sets analyzed. Each of the score values for these combinations fell below 64 2.0E-3. Also listed are the five most over-represented TFBS classes, as reported by oPOSSUM single site analysis. The classes that contain MEF2 and SP1 dominated the top combinations in both skeletal muscle sets (Figure 2.2 a and Figure 2.2 b.). The TF Yin-Yang modulates SRF-dependent, skeletal muscle expression. Thing1-E47 is a bHLH TF localized to gut smooth muscle in adult mice. The prediction of this TFBS class may suggest binding of other bHLH myogenic factors (such as Myf). Bsap and MZF are not muscle specific TFs. The Bsap motif is long (20 bp) and exhibits an unusual pattern of low information content distributed across the entire motif, suggesting that it may behave differently than other binding profiles. The inclusion of this profile in the JASPAR database is under review (B. Lenhard, personal communication). Analysis of the smooth muscle genes resulted in an SRF class prediction in each of the top five combinations, consistent with previous gene regulatory studies in muscle [10]. The top combination, SP1/SRF, is required for the expression of smooth muscle myosin heavy chain in rat. Yin-Yang has been shown to stimulate smooth muscle growth. Spz1 acts in spermatogenesis and has no known role in muscle expression. For all three reference sets, the top scoring combinations highlighted new TFBS classes not found by the oPOSSUM single site analysis algorithm. Futhermore, several relevant muscle TFBS were identified exclusively in the oPOSSUM II combination site analysis. 2.3.4. Effect of set size on false positive rate Random sampling simulations of foreground genes were analyzed in the oPOSSUM II algorithm to evaluate false positive prediction rates relative to input gene 65 set size (Figure 2.3). Our analyses suggest that false prediction rates made by the oPOSSUM II algorithm are independent of input gene set size. We also noted that at low score values, the proportion of false positives is low. 2.3.5. Web interface oPOSSUM II web service is available at http://www.cisreg.ca/oPOSSUM_CSA/opossum2.php. A user provides a set of putatively co-expressed genes as input and specifies the parameter values to be used in the analysis. Certain parameter values may produce lengthy runtimes. To accommodate this possibility, the web service queues the analysis request and issues e-mail notification once the analysis is complete. 2.4. Discussion The analysis of over-represented combinations of TFBS in the promoters of co- expressed genes is motivated by biochemical and genetic studies which reveal the functional importance of cis-regulatory modules. In contrast to previously described methods which identify single over-represented motifs, the analysis of combinations must solve or circumvent the consequence of a combinatoric explosion, which can precipitate prohibitive runtimes. To reduce the search space, oPOSSUM II restricts its analysis to binding site combinations using biologically justifiable criteria, namely, TFBS profile similarity and through application of an inter-binding site distance10 (IBSD) constraint (see Methods). 10 Inter-binding site distance refers to the number of base pairs between two TFBS. 66 Our results suggest two important contributions over the existing single-site TFBS over-representation methods. Firstly, for each reference gene set analyzed, at least one relevant TF class appeared in multiple combinations, an observation that is not immediately obvious in a single site analysis. Secondly, the algorithm can discover functional TFBS that are not highligted in a single site analysis. For example, oPOSSUM II analysis of the yeast CLB2 gene cluster predicted ECB and FKH1 as a top scoring combination pair, yet analysis of the same dataset by the single site analysis algorithm ranked these predicted TFs as first and eleventh, respectively. Similarly, the SRF and SP1 TFBS combination is reported as most significant by an oPOSSUM II analysis of the smooth muscle reference set and these TFBS are ranked first and fourteenth in a single site analysis. These results clearly demonstrate the power of combination site analysis. Furthermore, oPOSSUM II analyses of the microarray-based skeletal muscle reference set correctly predicted the cooperativity of MEF2 and SP1 TFs in myogenesis, which confirms the utility of incorporating a high-quality microarray dataset with a combination site analysis approach. While our analysis results for the yeast CLB2 cluster is comparable to that reported by Kreiman et al. [7], there are significant differences between the methods applied by the two studies. The Kreiman study applies a motif similarity approach to discard redundant PWMs before searching for modules (CRMs). oPOSSUM II implements clustering of PWMs to identify groups of similar profiles (classes) and the PWM (centroid) which best represents the cluster of PWMs to, initially, identify over- represented combinations of TF classes. Over-represented classes are then expanded to evaluate all relelvant TFBS combinations for over-representation. The Kreiman study reports the top scoring combinations for the Wasserman and Fickett skeletal muscle 67 collection as SP1, SRF, TEF, and a novel motif, while oPOSSUM II identified Mef2, Myf, Srf, and Sp1 in the top scoring pairs. A few issues should be considered in future research. Firstly, the interpretation of most PWM-based TFBS analyses are confounded by intra-class binding similarity. While this property facilitates the oPOSSUM II algorithm approach, users are left with determining the TF protein family member that could be acting in the tissue and/or condition under study. For instance, over-representation of an E-box motif in the skeletal muscle analysis does not specifically highlight the MyoD TF protein; the user must consider the entire range of bHLH-domain TFs. Secondly, inter-class similarity can influence the CRM predictions. Although oPOSSUM II does not evaluate overlapping redundant TFBS combinations, unique TFBS combinations predictions, which may overlap other binding site combinations, are included in the analysis. Thus, two G-rich motifs may be reported as over-represented in different combinations (for instance, the SP1 and MZF motifs in Figure 2.2 c.) but highlight the same candidate TFBS. A related issue is the compositional sequence bias in tissue specific genes [11], which could be addressed through provision of a background gene set that possesses a similar sequence content bias. Finally, depending on the parameters selected, computing time requirements can be prohibitively long for a synchronous web service. Parallelization of the algorithm would be a natural way to improve the running time. oPOSSUM II utilizes putative TFBS identified from comparative genomic analysis, in conjunction with knowledge of co-regulated expression, to search for functional combinations of TFBS that may confer a given gene expression pattern. It uses a novel scheme to classify similar binding site profiles. Using this clustering approach, the oPOSSUM II method is able to circumvent the combinatorial challenge associated 68 with the identification of significant TFBS combinations. Furthermore, the application of an IBSD constraint limits the number of possible combinations to analyze. Validation results suggest that a TFBS combination site analysis can provide valuable information that is not available through identification of single over-represented TFBS. 69 2.5. Methods 2.5.1. Background: the oPOSSUM database Ho Sui et al. [5] describe the creation of the oPOSSUM database which stores predicted, evolutionarily conserved TFBS to support over-representation analysis of TFBS for single TFs. Briefly, human-mouse orthologs are retrieved from Ensembl. TFBS profiles from the JASPAR database are used to identify putative TFBS within the conserved non-coding regions from 5000 base pairs (bp) upstream to 5000 bp downstream of the annotated transcription start site (TSS) on both strands. The oPOSSUM database stores the start and end positions and the matrix match score of each predicted TFBS for four score thresholds: 70, 75, 80, and 85%. This database is used as input to the oPOSSUM II algorithm to search for over-represented TFBS combinations (described below). 2.5.2. TFBS in foreground gene set When presented with a set of co-expressed genes S, oPOSSUM II queries the oPOSSUM database for all putative TFBS T present in S within a maximum of 5000 bp upstream and 5000 bp downstream from the TSS on each gene. The analysis may be restricted to those TFs found in selected taxonomic subgroups (plant, vertebrate and insect are currently available), or TFs whose profiles exceed a minimum information content. 70 2.5.3. Classification of TFBS profiles Binding profiles for T were retrieved from the JASPAR database. A profile comparison algorithm, either CompareACE [12] (default) or matrix aligner [13], was used to calculate the pairwise similarity scores of all the profiles using profile alignment methods. The similarity score ! s(ti,t j ) between profiles ti and tj was converted to a distance ! d(ti,t j ) =1" s(ti,t j ). A distance matrix M was formed from these pairwise distances. From M, an agglomerative clustering procedure produced a hierarchy of clusters (subsets) of T. The complete linkage method was used. Cutting the cluster tree at a specified height ! thr H partitioned T into classes. 2.5.4. Selection of TFBS and enumeration of combinations For each class C, we selected the profile that is the most similar to other profiles in C as the class representative. We chose this approach as we could not identify an adequate procedure that would generate a consensus profile with comparable specificity to the matrices within the class. To identify the class representative, we first calculated the sum of pairwise similarity score ! " i between a profile ! t i and other profiles in C, i.e., ! " i =#ti ,t j $C s(ti,t j ) . The profile with the maximum sum of similarity score was chosen. From the selected TFBS, unordered combinations of specified size (cardinality) were created. The foreground gene set (the co-expressed genes) and the background gene set (default is all the genes in the database) were searched for occurrences of these combinations. Let ! max d be the maximum inter-binding site distance. For each gene, occurrences of the combinations were found using a sliding window of width equal to 71 ! max d within the required search region. The number of genes with a combination in both the foreground set and background gene set were counted. 2.5.5. Scoring of combinations The Fisher exact test detects the non-random association between two categorical variables. We adopted the Fisher P-values to rank the significance of non-random association between the occurrence of a combination in the foreground gene set, i.e., over-representation of the combination in the foreground compared to background. For each combination, a two-dimensional contingency table was constructed from the foreground and background count distributions: Number of genes with a given combination Number of genes without a given combination Foreground ! a 11 ! a 12 Background ! a 21 ! a 22 For i, ,j = 1,2, row sum ! R i = a i1 + a i2 and column sum ! C j = a1 j + a2 j , and the total count ! N = Ri i " = C j j " . From the hypergeometric probability function, the conditional probability Pcutoff , given the row and column sums, is: ! Pcutoff = (C 1 !C 2 !)(R 1 !R 2 !) N!"aij! 72 We calculated the P-values for all other possible contingency tables with row sums equal to Ri and column sums equal to Cj. The Fisher P-value is the sum of all the P-values less than or equal to ! Pcutoff . Caution must be exercised when interpreting these Fisher P-values. Firstly, the foreground and background genes are allowed to overlap, which is a violation of an assumption for the statistical test. Secondly, the Fisher exact test model may not precisely characterize the data sets being analyzed. As a result, the Fisher P-values were used purely as a measure to compare the degree of over-representation between different combinations. We will hereafter refer to the P-values as “scores”. Although the scores do not describe the probabilistic nature of the over-representation, the ranking they provide is shown to be useful [5]. 2.5.6. Finding significant TFs from over-represented class combinations Let ! thr C be the maximum “score” for which a TFBS combination may be considered significant. Our empirical studies of reference collections suggested that a default maximum score value of 0.01 detects relevant TF combinations. Let ! x i be any TFBS class combination with a score less than or equal to ! thr H and X is the set of distinct class combinations that satisfys the score threshold: ! X = {x i | score(x i ) " thrC} . For each combination ! x i , let each of ! C 1 ,C 2 ,…,C h be a set of TFBS profiles that are represented by each of the h class profiles in that combination. Compute the Cartesian product ! Cp of ! C 1 ,C 2 ,…,C h . We call this “expanding the TFBS classes” from the class representatives. The enumeration and ranking procedures were repeated for the h-tuples in ! Cp . 73 2.5.7. Random sampling simulations of foreground genes The oPOSSUM II algorithm accommodates input gene sets of different cardinalities. To investigate the relationship between gene set size and the false positive rate. 100 random samples of r genes were selected from the background and provided as input to oPOSSUM II as foreground genes. For each sample, oPOSSUM II reported the scores for all the class combinations. As random samples of genes are not expected to be co-regulated, any predicted combination was a false positive. Let ! (0,Max s ] be the interval over which false positives are accumulated. We recorded the number of false positive class combinations for a range of maxs when r = 20,40,60,80,100. 2.5.8. Validation Three reference sets of human genes were used as input to oPOSSUM II to assess the performance of the algorithm. Two independent sets of skeletal muscle genes were tested. The first set (muscle set 1) was compiled from the reference collection identified by Wasserman and Fickett [9] and updated by a review of recent literature. A second set (muscle set 2) combines the results of microarray studies of Moran et al. [14] and Tomczak et al. [15]. The third set contains smooth muscle-specific genes experimentally verified by Nelander et al. [16]. All sets were validated with ! max d =100 , TFBS matrix score threshold = 75%, and conservation level = 1. We compared our oPOSSUM II analysis results to the results reported in Kreiman et al. study [7]. This comparison included analysis of the yeast CLB2 gene cluster [17]. We used the yeast oPOSSUM database (Ho Sui, unpublished) as input to the oPOSSUM II algorithm to perform the analysis. 74 All supplementary information is available at: www.cisreg.ca/oPOSSUM_CSA/supplement/. 75 Figure 2.1. Overview of the oPOSSUM II analysis algorithm Processing steps are numbered in the order executed. The database of predicted TFBS is identical to that of the oPOSSUM analysis system (Ho Sui et al.[5]). 76 Figure 2.2. The top five over-represented pair combinations of TFBS classes for muscle reference sets The top five over-represented pair combinations of TFBS classes reported by oPOSSUM II and over-represented single TFBS sites reported by oPOSSUM for the skeletal and smooth muscle sets. The numbers are the class identifiers and enclosed in parentheses is the name of a TF within that class, which is either known to mediate transcription in the assessed tissue (*) or is a class representative. 77 Figure 2.3. Gene set size and false positive rate Effect of gene set size on false positive rate observed from pairwise TFBS combinations in randomly generated foreground gene sets. 78 2.6. References 1. Arnone MI, Davidson EH: The hardwiring of development: organization and function of genomic regulatory systems. Development 1997, 124(10):1851- 1864. 2. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004, 5(4):276-287. 3. Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, 32(Database issue):D91-94. 4. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003, 2(2):13. 5. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 6. Bluthgen N, Kielbasa SM, Herzel H: Inferring combinatorial regulation of transcription in silico. Nucleic Acids Res 2005, 33(1):272-279. 7. Kreiman G: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res 2004, 32(9):2889- 2900. 8. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32(Web Server issue):W253-256. 9. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278(1):167-181. 10. Madsen CS, Hershey JC, Hautmann MB, White SL, Owens GK: Expression of the smooth muscle myosin heavy chain gene is regulated by a negative-acting GC-rich element located between two positive-acting serum response factor- binding elements. J Biol Chem 1997, 272(10):6332-6340. 11. Yamashita R, Suzuki Y, Sugano S, Nakai K: Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene 2005, 350(2):129-136. 12. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296(5):1205-1214. 13. Sandelin A, Hoglund A, Lenhard B, Wasserman WW: Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics 2003, 3(3):125-134. 14. Moran JL, Li Y, Hill AA, Mounts WM, Miller CP: Gene expression changes during mouse skeletal myoblast differentiation revealed by transcriptional profiling. Physiol Genomics 2002, 10(2):103-111. 15. Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han M, Kunkel LM, Kohane IS, Beggs AH: Expression profiling and identification of novel genes involved in myogenic differentiation. Faseb J 2004, 18(2):403-405. 79 16. Nelander S, Mostad P, Lindahl P: Prediction of cell type-specific gene modules: identification and initial characterization of a core set of smooth muscle- specific genes. Genome Res 2003, 13(8):1838-1854. 17. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273-3297. 80 3. oPOSSUM: Integrated Tools for Analysis of Regulatory Motif Over-Representation11 3.1. Chapter preamble Gene promoters contain cis-regulatory DNA elements that are required for transcription initiation. Alternative gene promoters provide regulatory control plasticity, enabling activation of specific gene expression programs under varying environmental conditions, temporal states, and tissue types. This chapter describes the identification of alternative promoters for human and mouse genes and extension of the original Combination Site Analysis (CSA) algorithm, described in Chapter 2, to identify over- represented TFBS combinations in regions surrounding alternative gene promoters. The revised CSA algorithm demonstrated a marked improvement in recovery of CRMs known to regulate gene reference collections. 3.2. Introduction Functional genomics research often generates lists of genes with observed common properties, such as coordinated expression. For many studies, a key challenge is the generation of relevant and testable hypotheses about the regulatory networks and pathways that underlie observed co-expression. Our strategy for elucidating regulatory mechanisms identifies over-represented sequence motifs that are present in the upstream 11 A version of this chapter has been published. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW. (2007) oPOSSUM: Integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Research. Jul; 35 Web Server Issue: W245-52. 81 regulatory regions of genes. The motifs may represent transcription factor binding sites (TFBSs) that have a role in regulating expression. oPOSSUM [1] and oPOSSUM2 [2] were developed to identify over-represented, predicted TFBSs and combinations of predicted TFBSs, respectively, in sets of human and mouse genes. The user inputs a list of related genes, selects the TFBS profile set to be included in the analysis, and the algorithm determines which, if any, predicted TFBSs occur in the promoters of the set of input genes more often than would be expected by chance. Both analytic approaches rely on a database of aligned, orthologous human and mouse sequences, and the delineation of conserved regions within which TFBS predictions are analyzed. While the approach does not explicitly address uncharacterized transcription factors (TFs), the effective coverage is broadened by the fact that members within certain structural families of TFs can exhibit similarities in binding specificity. However, intra-class similarity is not always the case, as exemplified by the zinc-finger family of TFs [3], the observation holds true for many TF families [4, 5]. Here we describe the new release of the oPOSSUM system, which integrates the two previously developed applications, and has been expanded to accommodate new species (yeast and worms). It also includes new methods for orthology assignment, transcription start site (TSS) determination, and sequence alignment. 3.3. Results Each oPOSSUM component was validated on sets of reference genes. The results of all validations are available as supplementary materials (Tables S4-S13 available at http://cisreg.ca/oPOSSUM/data/). In the interest of space, a single validation is described for each system. 82 3.3.1. Human single site analysis Wonsey and Follettie performed a microarray analysis of genes that are transcriptionally regulated by FoxM1, a member of the forkhead family of TFs, using BT- 20 cells that had been transfected with FoxM1 siRNA [6]. They identified a set of 27 genes that were specifically regulated in cells transfected with FoxM1 siRNA (Table S4 available at http://cisreg.ca/oPOSSUM/data/). The 27 Affymetrix UG144A identifiers were mapped to 27 EnsEMBL gene identifiers and submitted to the Human single site analysis (SSA) tool with default parameters. Of these, 22 genes had a unique mouse ortholog and were used in the oPOSSUM analysis. While a specific profile for FoxM1 is not present in JASPAR CORE, other members of the forkhead family were ranked in the top ten highest scoring TFBS profiles (Table 3.1). There is also a known association between HNF4, the highest scoring TFBS profile, and the forkhead TF, FOXO1 in the regulation of gluconeogenic gene expression in hepatocytes [7], which may explain the over-representation of the HNF4 profile. We previously identified over-represented Fos binding sites in a set of genes induced after transformation by c-Fos in rat fibroblast cells [1, 8]. We analyzed 160 orthologous genes from the original list of 252 induced genes (Table S7 available at http://cisreg.ca/oPOSSUM/data/). This is a notable improvement over the previous version where only 98 genes were included in the oPOSSUM analysis. The Fos TFBS profile ranked third in the list of over-represented TFBSs (Table 3.2). Inspection of the results using the JASPAR PhyloFACTS profiles with default parameters illustrates how inclusion of this new set of profiles provides additional, meaningful information. The highest ranked PhyloFACTS motif (TGANTCA) is noted by JASPAR as being most similar to the binding profile for AP-1, and the third highest scoring motif 83 (TGASTMAGC) is most similar to the bZIP TF NF-E2. AP-1 complexes are comprised of Fos and Jun proteins, and the structurally related NF-E2 and AP-1 TFs bind similar sequence motifs [9]. 3.3.2. Human combination site analysis The combination site analysis (CSA) algorithm was validated on a set of mouse skeletal muscle genes comprised of the union of the results of the microarray studies of Moran et al. [10] and Tomczak et al. [11] (Table S9 available at http://cisreg.ca/oPOSSUM/data/). To avoid circularity, we removed muscle-specific genes used to generate the JASPAR binding site profiles for Mef-2, Myf, Sp-1, SRF, and Tef. These factors occur in clusters in cis-regulatory modules that contribute to skeletal muscle-specific expression [12]. Table 3.3 lists the top five over-represented pairwise TFBS combinations for this set of genes, along with the JASPAR class each TF profile clustered to, and the Fisher score obtained for each pair. The five most over-represented pairs of TFBS profiles include combinations of Mef-2, SRF and Sp-1. The inclusion of alternative promoters provides notable improvements in the Human SSA and Human CSA analyses. The same data sets were used to validate our previous and current human oPOSSUM analyses systems. Demarcation of additional promoter boundaries increases the signal in the discovery process, improving the signal for both over-represented single TFBSs and combinations of TFBSs in the gene sets analyzed. 84 3.3.3. Worm single site analysis Worm SSA was tested on a set of well-characterized nematode muscle genes (Table S10 available at http://cisreg.ca/oPOSSUM/data/) [13]. Analysis of 1000bp of upstream sequence, using the top 10% of conserved regions (minimum of 60% sequence identity), a matrix match threshold of 80% and the worm profiles, identified the putative muscle1 motif with a Z-score of 20.6 and a Fisher score less than 0.01 (Table 3.4). This is, however, somewhat circular, given that 19 of the 41 input genes were used to generate the putative muscle-specific worm profiles. Analysis using the JASPAR CORE profiles ranked SP1 and Su(H) within the top ten scoring profiles (Table S10B available at http://cisreg.ca/oPOSSUM/data/). Studies in Xenopus and Drosophila provide evidence that MyoD triggers Notch signaling through Su(H) for muscle determination [14, 15]. Although SP1 has been implicated in muscle CRMs, it is a general transcription factor involved in the expression of many different genes and binds to GC-rich motifs. 3.3.4. Yeast single site analysis The yeast CLB2 gene cluster is comprised of 32 genes whose pattern of expression peaks at late G2/early M phase of the cell cycle (Table S11 available at http://cisreg.ca/oPOSSUM/data/). Transcription of these genes is regulated by two TFs: FKH, which is a component of the TF SFF, and MCM1, a member of the early cell cycle box (ECB) binding complex. Analysis of 500bp of upstream sequence using a matrix match threshold of 85% ranked ECB, MCM1 and FKH1 in the top five scoring TFBS profiles (Table 3.5), which is consistent with the literature [16]. 85 3.4. Discussion The four oPOSSUM systems, Human SSA, Human CSA, Worm SSA, and Yeast SSA, have been integrated into a user-friendly website at www.cisreg.ca/oPOSSUM_new. We recommend that users of the system begin with the SSA to quickly identify TFBSs that may be relevant to their input data sets. For sets of human and mouse genes, this can be followed with the CSA, which takes longer to process, but which can provide insights into TFBSs that may be acting in concert to regulate the set of genes. The web implementation allows for analysis in default and custom modes. Default mode processing is faster as TFBS counts have been pre-calculated and stored for pre- defined conservation levels, matrix match thresholds and promoter lengths. In either mode, the user is required to select a species and to enter a list of gene identifiers (EnsEMBL, RefSeq, HGNC and Entrez Gene are supported for human). A number of options are available to specify the TFBS profile set to be used in the analysis. Finally, the conservation level, matrix match threshold and the promoter length can be varied. In the custom mode, users may define their own background set, which provides users with more control, but results in more variable processing speeds depending on the size of the background set and the parameters selected. Upon submission, oPOSSUM SSA generates a summary of the input parameters, and produces a single table that ranks the over-represented TFBSs by descending Z- score. The table may be sorted by TF name, TF class, supergroup, information content (IC), Z-score and Fisher score (Figure 3.3A). Pop-up windows linked to each TFBS foreground count display the genes in which the putative site is located, the promoter region(s) for each gene, as well as the TFBS’s co-ordinates and score (Figure 3.3B). 86 TFBSs that occur in overlapping promoter regions are marked by an asterisk and highlighted in yellow. The TF names are linked to the JASPAR database for easy access to information regarding the binding site profiles. The output for oPOSSUM CSA is similar, providing (i) a ranked list of over-represented TFBS class combinations, and (ii) a list of the most significant TFBS combinations (found in the set of expanded top-ranked class combinations). Based on the underlying assumption of the statistics employed that DNA sequences are randomly generated, there is little reason to accept the calculated scores as accurate reflections of significance. Instead, as suggested in the original published description of the oPOSSUM algorithm, we recommend that the scores are best used as rankings rather than significance measures. For this reason, a multiple testing correction is not applied as it does not alter their relative ranking. Empirically, we determined that TFBS profiles with Z-scores equal to or exceeding 10 and Fisher scores less than or equal to 0.01 facilitate the identification of relevant TFBSs for our sets of reference genes [1]. However, these are relatively stringent thresholds, and we encourage users to examine the scores of top-ranked TFBS profiles before applying any cutoffs. We provide a consistent display for all four systems. However, there are slight differences between the systems, such as different parameters for selection on the input pages which are relevant for each species database and system. Also, due to the longer processing times required to compute combinations of TFBSs, Human CSA queues the analysis request on the server and emails the completed results to the user. The oPOSSUM system is under continued development. Efforts are underway to allow users to submit custom TF profiles to be included in the analysis. An improved search method for nuclear hormone receptors, which typically contain two half sites 87 separated by a variable length spacer, has been developed and will be included in a future release. We will continue to add TFBS profiles as they become available, with an emphasis on expanding the repertoire of worm TFBS profiles. We believe the oPOSSUM web server is and will continue to be a useful resource for researchers attempting to move from observed co-expression to infer mechanisms of co-regulation. 3.5. Methods 3.5.1. Over-representation analysis 3.5.1.1. oPOSSUM single site analysis The oPOSSUM system for identifying over-represented TFBSs in sets of co- expressed genes first focused on single site analysis [1]. Two scores were developed to assess over-representation, one at the TFBS occurrence level and the other at the gene level. The Z-score, based on the normal approximation to the binomial distribution, indicates how far and in what direction the number of TFBS occurrences deviates from the background distribution's mean. The second score, the Fisher exact test, indicates if the proportion of genes containing the TFBS is greater than would be expected by chance. TFBS predictions situated within overlapping alternative promoters are counted only once when calculating over-representation in human and mouse genes. For C. elegans genes in operons, TFBS predictions in the upstream region of the first gene in the operon apply to all genes in the operon. 88 3.5.1.2. oPOSSUM combination site analysis TFBSs do not act in isolation to initiate the transcription process. Transcriptional regulation can be viewed as mediated by arrays of cis-regulatory sequences, termed cis- regulatory modules (CRMs), which are bound by multiple TFs. In oPOSSUM2, Huang et al. (2006) [2] address the detection of over-represented sets of TFBSs in the promoters of a set of co-expressed genes. In brief, the method reduces combinatorial complexity through an initial clustering step, which partitions similar TFBS profiles into groups - herein denoted TFBS classes, along with an analysis step to determine a TFBS class representative profile for each TFBS cluster, which are then evaluated to detect over- represented sets of TFBS classes. Since each distinct, over-represented set of detected TFBS classes, herein described as a TFBS class combination, implicates the over- representation of one or more underlying TFBS profile-specific combinations, each of these TFBS class combinations is expanded to all possible TFBS profile-specific combinations (for the indicated classes) and then all combinations are analyzed for over- representation. Furthermore, given that CRMs can contain locally dense clusters of TFBSs, the system also provides for the specification of an inter-binding site distance (IBSD) constraint to confine the number of TFBS combinations that are investigated. A scoring scheme, adopted from the Fisher exact test, utilizes two sets of TFBS (class or profile-specific) combination counts to compare the degree of their over-representation: 1) the number found in the promoters of the co-expressed gene set versus 2) the number found in the promoters of genes in a background set (all genes in the database). TFBS combinations occurring in multiple alternative gene promoter regions are counted only once. 89 3.5.2. Species-specific databases In addition to enhancements to the human/mouse oPOSSUM database, we introduce new species databases for studies of over-represented TFBSs in yeast and worms. While the SSA over-representation analysis remains the same for all species, differences in gene structure require that the construction of the underlying databases be particular to each species. 3.5.2.1. Human/mouse Ambiguities in ortholog assignments and the definition of TSS positions are major challenges when performing alignments for a large proportion of human and mouse genes. We have expanded the human/mouse database through (i) the discrimination of potential orthologs from predicted paralogs based on upstream sequence similarity (Figure 3.1), and (ii) the delineation of alternative promoters for human and mouse genes (Figure 3.2) to address the alignment failure observed in previous database builds. While the inclusion of promoter comparisons for candidate ortholog assignment may be controversial, the impact is marginal as less than 1.3% of gene pairs were derived from this approach. This brings the total number of orthologs to 15162. Despite improvements in EnsEMBL’s ortholog prediction, this is only 1079 more orthologs than were present in our previous database build. Based on the small incremental increases in mapped orthologs, we may be nearing the upper bound for the number of genes in human and mouse that are truly orthologous and detectable by sequence conservation. Detailed 90 descriptions of transcription start region (TSR) determination and the distribution of TSRs for human and mouse genes are available as supplementary material. For each human/mouse orthologous pair, we determine the coordinates of the longest region from the UCSC genome alignments [17] spanning all transcripts plus an additional 10kb of upstream sequence. The orthologous sequences are retrieved and re- aligned using ORCA, a pairwise global progressive alignment algorithm (described in [1]) to optimally align short, conserved blocks within longer global alignments. If possible, TSRs from human and mouse are paired in the alignment. We apply three dynamically computed and progressively more stringent conservation thresholds corresponding to the top 10%, 20%, and 30% of all 100bp non-coding windows, each with a minimum percent identity of 70%, 65%, and 60%, respectively. Of the 15162 orthologous gene pairs supplied as input to the oPOSSUM pipeline, 15121 (99.7 %) successfully align, and 15027 (99.1%) have non-exonic conserved regions above 60% nucleotide identity. This is a significant improvement over the previous version of oPOSSUM. 3.5.2.2. C. elegans/C. briggsae To facilitate transcriptional regulatory analysis of the numerous gene expression studies performed in C. elegans, we have implemented a worm version of oPOSSUM. While the database structure and pipeline procedure are very similar to that used for the human/mouse database, there are small modifications that allow for mapping of genes to their operons, as defined by Blumenthal et al. [18]. In addition, nucleotide identity thresholds for conserved regions were reduced to 60%, 55%, and 50% for the top 10%, 91 20%, and 30% of non-coding windows, respectively, to account for the greater sequence divergence between C. elegans and C. briggsae compared to human and mouse. The set of orthologs for C. elegans and C. briggsae is defined by one-to-one InParanoid clusters [19] from WormBase (WS160) [20]. After filtering overlapping genes, 10592 orthologous gene pairs (of which, 2140 genes are in operons) remain for alignment. Alignments are performed on the orthologous gene sequences plus 2kb of upstream sequence (relative to the start codon) for C. elegans, and 4kb of upstream sequence for C. briggsae. Annotations are not as mature for C. briggsae, and the longer upstream region aids in the alignment of the worm promoter sequences. Alternative promoters have not been considered in this first version; however, should CAGE data or other reliable means for annotating TSSs in worms become available, efforts will certainly be made to include them. Of the 10592 worm orthologs, 9331 (88%) were successfully aligned. 3.5.2.3. Yeast The analysis of yeast promoters is simplified by the more compact nature of the yeast genome. This characteristic diminishes the requirement for comparative methods to reduce the search space and noise inherent in larger genomes. Computational methods using S. cerevisiae sequences alone have successfully been used to identify regulatory elements associated with known sets of related genes [21, 22]. We opted to exclude phylogenetic footprinting for yeast, and instead, select promoter sequences corresponding to the 5' untranslated region 1000bp immediately upstream of the start codon of each open reading frame (ORF). Note that for all applications, users have the option to further 92 restrict the search space if they wish. The sequences were downloaded from the Saccharomyces Genome Database [23]. 3.5.3. TFBS prediction For the metazoan species, we search for matches to TFBS profiles contained in the JASPAR CORE and JASPAR PhyloFACTS database collections [24, 25]. Additionally, we include a set of profiles compiled for C. elegans TFs from literature review for Worm SSA (Table S2 available at http://cisreg.ca/oPOSSUM/data/). Binding sites are predicted for the sequences using the TFBS suite of Perl modules for regulatory sequence analysis [26]. A predicted binding site for a given TF model is reported if the site occurs in the promoters of both orthologs above a threshold PSSM score of 70% and at equivalent positions in the alignment. Overlapping sites for the same TF are filtered such that only the highest scoring motif is kept. The genomic location, profile score, motif orientation, and local sequence conservation level of each TFBS match in orthologous genes are stored in the respective species databases. For S. cerevisiae, we compiled a collection of yeast-specific TFBS motifs from both the Yeast Regulatory Sequence Analysis (YRSA) system [27] and the literature (Table S3 available at http://cisreg.ca/oPOSSUM/data/), and record the genomic location, profile score and motif orientation for each prediction. Based on the observation that members of the same structural family of TFs often bind to similar sequences, plant and insect matrices are available for inclusion in the analysis. The MADS family of TFs is an excellent example of conservation of binding domains between plants and vertebrates [28, 29], and there are numerous examples of 93 conservation of binding domains across vertebrates, flies and worms. Thus, in cases where a profile for the TF of interest is not available in the database, oPOSSUM can still provide insights into the underlying regulation by suggesting a particular TF family that may be involved. 94 Table 3.1. oPOSSUM results for human FoxM1-regulated gene cluster JASPAR CORE TF Class IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher Score HNF4 Nuclear 9.62 13 0.0054 0.0085 7.19 2.64E-02 Fos bZIP 10.67 15 0.0111 0.0146 5.72 4.29E-01 Pbx Homeo 14.64 5 0.0019 0.0033 5.57 3.10E-01 FOXI1 Forkhead 13.18 16 0.0153 0.0186 4.49 9.05E-02 RORA1 Nuclear Receptor 17.42 4 0.0020 0.0029 3.54 5.04E-01 TAL1-TCF3 bHLH 14.07 12 0.0052 0.0066 3.30 5.88E-02 Staf Zn-Finger, C2H2 17.54 3 0.0014 0.0021 3.16 3.03E-01 Foxa2 Forkhead 12.43 13 0.0152 0.0174 3.04 4.83E-01 Foxd3 Forkhead 12.94 13 0.0172 0.0194 2.93 5.27E-01 TEAD TEA 15.67 6 0.0028 0.0037 2.850 4.70E-01 Table 3.2. oPOSSUM results for c-Fos-regulated gene cluster JASPAR PhyloFACTS Similar To IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher Score TGANTCA AP-1 12.06 46 0.0011 0.0023 18.05 1.40E-04 GGGYGTGNY - 14.18 82 0.0059 0.0083 15.64 4.98E-02 TGASTMAGC NF-E2 16.60 43 0.0013 0.0024 15.64 1.19E-03 GGARNTKYCCA - 17.13 44 0.0016 0.0026 12.54 1.11E-03 GGGAGGRR MAZ 14.00 111 0.0171 0.0202 11.98 3.16E-01 Table 3.3. oPOSSUM results for skeletal muscle genes identified by Moran et al. and Tomczak et al. TF name (Class ID) TF class name TF name (Class ID) TF class name Score MEF2A (class 4) MADS Myf (class 22) bHLH 1.65E-06 MEF2A (class 4) MADS ZNF42_1-4 (class 25) Zn-finger, C2H2 4.24E-06 Myf (class 22) bHLH SRF (class 1) MADS 2.52E-05 SP1 (class (31) Zn-finger, C2H2 SRF (class 1) MADS 2.68E-05 Agamous (class 1) MADS MEF2A (class 4) MADS 7.63E-05 95 Table 3.4. oPOSSUM results for worm skeletal muscle genes using worm profiles Worm Status IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher Score Muscle1 Putative 11.34 6 0.0025 0.0156 20.56 4.24E-04 Muscle2 Putative 11.97 4 0.0022 0.0089 11.19 1.39E-02 LIN-14 Putative 9.13 9 0.0143 0.0280 9.116 1.17E-01 Muscle3 Putative 16.67 4 0.0029 0.0064 5.02 6.96E-02 Table 3.5. oPOSSUM results for the yeast CLB2 gene cluster YEAST TF Class IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher Score ECB Unclassified 16.65 13 0.0019 0.0131 32.87 8.68E-09 MCM1 MADS 9.15 10 0.0073 0.0165 13.71 1.08E-02 FKH1 Forkhead 13.28 30 0.0305 0.0473 12.26 4.05E-02 CCA Unclassified 16.93 3 0.0017 0.0040 7.08 2.02E-01 LYS14 C6_Zinc finger 17.02 6 0.0030 0.0053 5.20 9.41E-02 96 Figure 3.1. Determination of one-to-one orthologs for human and mouse genes. An initial set of homologs was downloaded from EnsEMBL v41 [30]. All homologs annotated as “one2one” are extracted. To select the closest putative ortholog pairs from homologs with “one2many” or “many2many” relationships, we check for upstream conservation using the whole-genome human-mouse alignments [17]. We re-annotate unambiguously aligned homologs as putative one-to-one orthologs, adding 195 gene pairs to our set, and bringing the total number of orthologs to 15162. 97 Figure 3.2. Identification of transcription start regions (TSRs) using a combination of EnsEMBL annotations and CAGE data To improve our alignments, we determine putative alternative TSSs for the human and mouse genes. For each gene, the entire repertoire of transcripts from both EnsEMBL core genes and EST genes are retrieved. The TSSs for all transcripts are recorded, followed by a clustering step such that TSSs within 500bp of one another are merged to form a transcriptional start region (TSR). For each TSR containing a transcript annotated as “known” or “novel”, we accept the TSR as is. For TSRs based solely on EST gene transcripts, we require a minimum of 5 CAGE tags as evidence for transcription initiation. 98 Figure 3.3. oPOSSUM Human SSA website screenshots (A) A screenshot of the output of the oPOSSUM Human SSA analysis, with TFBS profiles ranked by Z-score. The arrows allow the user to sort and re-order the results by Fisher score, TF name, TF class, TF supergroup, or TF profile information content (IC). Each TF name links to a pop-up window displaying the TFBS profile information. (B) Pop-up window displaying genes that contain a particular TFBS (in this case, MEF2A; partial list shown), as well as the promoter coordinates associated with each gene, and the motif locations and scores. Sites in overlapping alternative promoters are highlighted for emphasis. Such sites are only counted once in the statistical analysis. 99 3.6. References 1. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 2. Huang SS, Fulton DL, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW: Identification of Over-represented Combinations of Transcription Factor Binding Sites in Sets of Co-expressed Genes. Advances in Bioinformatics & Computational Biology 2006, 3:247- 256. 3. Urnov FD, Rebar EJ: Designed transcription factors as tools for therapeutics and functional genomics. Biochem Pharmacol 2002, 64(5-6):919-923. 4. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome Biol 2000, 1(1):REVIEWS001. 5. Sandelin A, Wasserman WW: Constrained Binding Site Diversity within Families of Transcription Factors Enhances Pattern Discovery Bioinformatics. J Mol Biol 2004, 338(2):207-215. 6. Wonsey DR, Follettie MT: Loss of the forkhead transcription factor FoxM1 causes centrosome amplification and mitotic catastrophe. Cancer Res 2005, 65(12):5181-5189. 7. Lin J, Tarr PT, Yang R, Rhee J, Puigserver P, Newgard CB, Spiegelman BM: PGC- 1beta in the regulation of hepatic glucose and energy metabolism. J Biol Chem 2003, 278(33):30843-30848. 8. Ordway JM, Williams K, Curran T: Transcription repression in oncogenic transformation: common targets of epigenetic repression in cells transformed by Fos, Ras or Dnmt1. Oncogene 2004, 23(21):3737-3748. 9. Daftari P, Gavva NR, Shen CK: Distinction between AP1 and NF-E2 factor- binding at specific chromatin regions in mammalian cells. Oncogene 1999, 18(39):5482-5486. 10. Moran JL, Li Y, Hill AA, Mounts WM, Miller CP: Gene expression changes during mouse skeletal myoblast differentiation revealed by transcriptional profiling. Physiol Genomics 2002, 10(2):103-111. 11. Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han M, Kunkel LM, Kohane IS, Beggs AH: Expression profiling and identification of novel genes involved in myogenic differentiation. Faseb J 2004, 18(2):403-405. 12. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278(1):167-181. 13. GuhaThakurta D, Schriefer LA, Waterston RH, Stormo GD: Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. Genome Res 2004, 14(12):2457-2468. 14. Rusconi JC, Corbin V: Evidence for a novel Notch pathway required for muscle precursor selection in Drosophila. Mech Dev 1998, 79(1-2):39-50. 15. Wittenberger T, Steinbach OC, Authaler A, Kopan R, Rupp RA: MyoD stimulates delta-1 transcription and triggers notch signaling in the Xenopus gastrula. Embo J 1999, 18(7):1915-1922. 16. Kreiman G: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res 2004, 32(9):2889-2900. 100 17. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13(1):103-107. 18. Blumenthal T, Evans D, Link CD, Guffanti A, Lawson D, Thierry-Mieg J, Thierry- Mieg D, Chiu WL, Duke K, Kiraly M, Kim SK: A global analysis of Caenorhabditis elegans operons. Nature 2002, 417(6891):851-854. 19. O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 2005, 33(Database issue):D476-D480. 20. Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J: WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res 2001, 29(1):82-86. 21. Zhang MQ: Promoter analysis of co-regulated genes in the yeast genome. Comput Chem 1999, 23(3-4):233-250. 22. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22(3):281-285. 23. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26(1):73-79. 24. Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B: A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res 2006, 34(Database issue):D95-97. 25. Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, 32(Database issue):D91-94. 26. Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics 2002, 18(8):1135-1136. 27. Sandelin A, Hoglund A, Lenhard B, Wasserman WW: Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics 2003, 3(3):125-134. 28. Alvarez-Buylla ER, Pelaz S, Liljegren SJ, Gold SE, Burgeff C, Ditta GS, Ribas de Pouplana L, Martinez-Castilla L, Yanofsky MF: An ancestral MADS-box gene duplication occurred before the divergence of plants and animals. Proc Natl Acad Sci U S A 2000, 97(10):5328-5333. 29. Shore P, Sharrocks AD: The MADS-box family of transcription factors. Eur J Biochem 1995, 229(1):1-13. 30. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A et al: The Ensembl genome database project. Nucleic Acids Res 2002, 30(1):38-41. 101 4. TFCat: The Curated Catalog of Mouse and Human Transcription Factors12 4.1. Chapter preamble In chapters 2 and 3, I described the design and development of an algorithm that identifies over-represented combinations of predicted transcription factor binding sites (TFBS) elements. The identification of the TF proteins that bind the predicted CRM DNA elements is an essential step in defining gene regulatory mechanisms. Importantly, transcription factor (TF) families (with similar DNA binding structures) may bind similar DNA elements. A comprehensive inventory of TFs for mouse and/or human is necessary for the study of all members of each family. This need lead us to establish a curated catalogue of mouse and human TFs and assign assessed DNA-binding TFs to an extended structural classification system (called TFCat). Homology analysis was used to expand the set of TFs with unannotated genes. 4.2. Introduction The functional properties of cells are determined in large part by the subset of genes that they express in response to physiological, developmental and environmental stimuli. The coordinated regulation of gene transcription, which is critical in maintaining this adaptive capacity of cells, relies on proteins called transcription factors (TF) that control profiles of gene activity and regulate many different cellular functions by interacting directly with DNA [1, 2] and with non-DNA binding accessory proteins [3, 4]. 12A version of this chapter has been published. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R. (2009) TFCat: the curated catalog of mouse and human transcription factors. Genome Biology, 10(3):R29. 102 While the biochemical properties and regulatory activities of both DNA-binding and accessory TFs have been experimentally characterized and extensively documented (for example, in textbooks devoted to TFs [5, 6]), a well-validated and comprehensive catalog of TFs has not been assembled for any mammalian species. Many gene transcription studies have linked the subset of TFs that bind specific DNA sequences to the activation of individual genes and, more recently, these have been pursued on a genome-wide basis using high-throughput laboratory studies (for example by performing chromatin-immunoprecipitation) as well as computational analyses (for example by identifying over-represented DNA motifs within promoters of co-expressed genes). To facilitate such efforts, inventories of TFs have been assembled for Drosophila and Caenorhabditis species as well as for specific sub-families of mammalian TFs (Table 4.1). Since only a limited number of protein structures can mediate high-affinity DNA interactions, collections of TF subfamilies have been constructed using predictive sequence-based models for DNA-binding domains [7-10]. For example, the PFAM Hidden Markov Model (HMM) database [11] and Superfamily HMMs [12] have been applied to sets of peptide sequences to identify nearly1900 putative TFs in the human genome [10] and over 750 fly TFs, of which 60% were well-characterized site-specific binding proteins [13]. While these collections have emphasized DNA binding proteins, recent evidence suggests that the contributions of accessory TFs may be equally or more important in establishing the spatio-temporal regulation of gene activity. For example, microarray-based chromatin immunoprecipitation studies have highlighted the key regulatory contributions of histone modifying TFs over the control of gene expression [14]. Therefore, any comprehensive study of TFs must extend beyond a narrow focus of DNA binding proteins to serve as a foundation for regulatory network analyses. 103 The four research laboratories contributing to this report were originally pursuing parallel efforts to compile reference collections of bona fide mammalian TFs. In order to maximize the quality and breadth of our gene curation, we combined our efforts to create a single, literature-based catalog of mouse and human TFs (called TFCat). The collection of annotations is based on published experimental evidence. Each TF gene was assigned to a functional category within a hierarchical classification system based on evidence supporting DNA binding and transcriptional activation functions for each protein. DNA- binding proteins were categorized using an established structure-based classification system [15]. A blind, random sample of the functional assessments provided by each expert was used to assess the quality of the gene annotations. The evidence-based subset of TFs was used to computationally predict additional un-annotated genes likely to encode TFs. The resulting collection is available for download from the TFCat portal and is also accessible via a wiki to encourage community input and feedback to facilitate continuous improvement of this resource. 4.3. Results 4.3.1. TF gene candidate selection, the annotation process, and quality assurance Prior to the initiation of the TFCat collaboration, each of the four participating laboratories constructed mouse TF datasets using manual text-mining and computational- based approaches. As each dataset was created specifically to suit the needs of the research lab that generated it, combinations of overlapping and distinct procedures were applied to collect and filter each dataset (Figure S1 in Appendix 1). These four, independently established, putative TF datasets laid the foundation for this joint initiative. 104 To ensure the comprehensiveness and utility of our reference collection, we broadly defined a TF as any protein directly involved in the activation or repression of the initiation of synthesis of RNA from a DNA template. Incorporating this standard, the union of the four sets yielded 3230 putative mouse TFs (referred to as the UPTFs). As complete manual curation of all literature to evaluate TFs is not practical, our curation efforts were prioritized to maximize the number of reviews conducted for UPTFs linked to papers. A manual survey of PubMed abstracts was performed, using available gene symbol identifiers and aliases, to identify genes for which experimental evidence of TF function might exist. Since standardized naming conventions have not been fully applied in the older literature, the associations between abstracts and genes may be incomplete or inaccurate due to the redundant use of the same identifiers for two or more genes. In addition, we did not consider abstracts that made no mention of the gene identifiers of interest or those that, by their description, were unlikely to have conducted transcription regulation-related analyses. From this list of 3230 putative mouse TFs, coarse precuration identified 1200 putative TFs with scientific papers describing their biochemical or gene regulatory activities in the PubMed database [16]. The majority of predicted TFs (2030 of 3230) had no substantive literature evidence supporting their molecular function. The remaining 1200 Transcription Factor Candidates (TFCs) were prioritized for expert annotation. Genes belonging to the TFC set that were associated with two or more papers in PubMed were selected and randomly assigned for evaluation by one or more of 17 participating reviewers. Gene annotations were primarily performed by a single reviewer, with the exception of 20 genes assigned to multiple reviewers for initial training purposes and 50 genes assigned to pairs of reviewers for a quality assurance assessment. In total, 105 1058 genes (Table 4.2) have been reviewed. For each candidate, a TF confidence judgment was assigned (Table 4.3) based on the literature surveyed. Annotation of each TFC required evidence of transcriptional regulation and/or DNA-binding (e.g. a reporter gene assay and/or DNA-binding assay). A text summary of the experimental evidence was extracted and entered by the reviewer, along with the PubMed ID, the species under study, and the reviewer’s perception of the strength of the evidence supporting their judgment. Although reviewers were not obligated to continue beyond two types of experimental support, they were encouraged to review multiple papers where feasible. Based on their literature review, annotators were required to classify their determination of each TFC into a positive (TF Gene or TF Gene Candidate), neutral (no data or conflicting data) or negative group (not a TF or likely not). Of the 1058 TFCs reviewed, 83% were found to have sufficient experimental evidence to be classified either as a TF Gene or as a TF Gene Candidate. To simplify data collection and curation, we focused our literature evidence collection and annotation efforts on mouse genes. However, literature pertaining to mouse genes and their human (or other mammalian) orthologs was used interchangeably as evidence for the annotations. Roughly 83% of the annotation literature evidence surveyed was based on a combination of mouse and human data, with roughly equal numbers of papers pertaining to each of these species. Mouse TF genes were associated with their putative human ortholog using the NCBI’s HomoloGene resource [16]. With the exception of 40 mouse genes, putative ortholog pairs were matched using defined HomoloGene groups. All but 13 of the remaining 40 were mapped using ortholog relationships in the Mouse Genome Database [17]. Each gene’s predicted human ortholog is included in the download data and in the published wiki data. 106 Depending upon the subset of available papers reviewed for a given TFC, two curators could arrive at different judgments. To ascertain the consistency and quality of our reviewing approach and judgment decisions, we randomly selected 50 genes for re- review and assigned each to a second expert (Table S1 and Table S2 in Appendix 1). Out of the 100 annotations (2 reviews each for 50 genes), 37 paired gene judgments (74 annotations) were concordant and 13 paired gene judgments (26 annotations) were discordant. Examination of the discordant pairs suggested that review of different publications may have produced the disagreement in annotation. To further evaluate this assumption, we extracted a non-QA sample of multiple annotations where different reviewers curated the same genes or gene family members using the same articles (Table S3 in Appendix 1) and found that these curation judgments were in perfect agreement. Under the assumption that judgment conflicts identified in the QA sample would be resolved in favour of one of the assigned judgment calls, we conclude that 13% of judgments may be altered after additional annotation, suggesting that a system to enable continued review would be beneficial. Since mouse and human TFs have been evolutionarily conserved among distantly related species [18], we assessed the coverage of our curated TF collection by comparing it with a list of expert annotated fly TFs documented in the FlyTF database [13]. Over half (443/753) of the FlyTF genes were found in NCBI HomoloGene groups, producing 184 fly TF-containing clusters that also contained mouse homologs. More than 85% (164/184) of these homologous TF genes were in the UPTF set. Inspection of the twenty putative mouse homologs of fly TFs absent from the UPTF set led to the inclusion of five genes in both the UPTF and the TFC sets for future curation, while there were no published studies involving the mammalian proteins for the remaining fifteen genes. We 107 also assessed TFCat’s coverage by comparing it with a classic collection of TFs prepared prior to the completion of the mouse genome [6]. After mapping 506 TFs to Entrez Gene identifiers, we found that 463 were present in the UPTF and 423 were members of the TFC gene list. The remaining 43 genes were added to the UPTF and the TFC list was extended to include 83 additional genes. From these analyses, we conclude that TFCat contains a large majority of known TFs. 4.3.2. Identification and classification of DNA binding proteins Genes positively identified as TFs were categorized using a taxonomy to document their functional properties identified in the literature review (Table 4.4). Notably, 65% (571/882) of the genes judged as TFs were reported to act through a DNA binding mechanism and 94% (535/571) of these DNA-binding TFs were found to act through sequence-specific interactions mediated by a small number of protein structural domains (Table 4.5). Members of a DNA-binding TF family share strongly conserved DNA binding domains that, in most cases, have overlapping affinity for DNA-sequences; therefore a prediction of a TF binding site can suggest a role for the family but does not implicate specific family members. As such, a TF DNA-binding classification system is an essential resource for many promoter sequence analyses in which researchers should prioritize potential trans-acting candidates from a set of equally suitable candidate TFs within a structural class. Capitalizing on large-scale computational efforts for the prediction of protein domains [11, 12, 19-21], we analyzed each of the TFCat DNA- binding TF protein sequences with the full set of PFAM and Superfamily HMM domain 108 models to predict DNA-binding domain (DBD) structures. A total of 20 Superfamily structure types were identified in our set, along with 54 PFAM DBD models (Table S4 in Appendix 1). Where possible, we linked each double-stranded DNA-binding TF to a family within an established DNA-binding structural classification system [15], that was developed initially to organize the DNA-bound protein crystal structures found in the Protein Data Bank (PDB) [22]. In light of more recent studies, along with a modification of classification requirements (see Methods), an additional set of 16 DBD family classes were added to the system to map domain structures (Table S5 in Appendix 1). The DNA binding domain analysis offers some noteworthy observations. The homeodomain-containing genes are prominently represented in our set, comprising 24% (131/545) of the classified DBD TFs and 16% of all predicted domain occurrences. The beta-beta-alpha zinc-finger and helix-loop-helix TF families account for 14% (79/545) and 13% (71/545) of the classified genes, respectively. Given the abundance of zinc- finger proteins in the eukaryotic genomes [23] and recent predictions that this DNA- binding structure makes up a significant portion of all TFs [10], this class may be under- represented. On the other hand, since zinc-finger containing genes are involved in a wide variety of functions, the number of predicted zinc-finger proteins that possess a TF role may be overestimated. In addition, it is likely that certain families of TFs, with central roles in well-studied areas of biology, have been more widely covered in the literature, which may account for the prevalence of literature support for homeodomain TFs. The majority (392/545) of the classified DBD TFs in our list contain a single DNA interaction domain, however, a notable portion (145/545) of genes belonging to just a few protein families contain more than one instance of its designated DBD structure. These multiple instances predominantly reside in TFs containing zinc-finger, helix-turn- 109 helix, and leucine zipper domains (Table S6 in Appendix 1). While most TFs contained single or multiple copies of a single DNA binding motif, our predictions identified eight TFs with two distinct DBDs (Table S7 in Appendix 1). We removed the second zinc finger-type domain prediction for two of the genes: Atf2 and Atf7, as this domain is characterized as a transactivation domain in Atf2 [24] and may have a similar function in family member Atf7. All other predicted gene domains were retained, based on literature that supported their activity or failed to support their removal. Four PFAM DBD domain models detected in eight proteins are not represented by a solved structure and, therefore, could not be directly appointed in the classification system (see Table 4.5 – Protein Group 999). In addition, three NFI proteins were annotated with DNA-binding evidence and predicted to contain a SMAD MH1 DNA-binding domain. Interestingly, a recent study noted that the DNA-binding domains of NFI and SMAD-MH1 share significant sequence similarity [25]. These TFs were also assigned to their own family in the unclassified protein group (see Table 4.5 and Table S5 in Appendix 1 – Protein Group: 999 and Protein Family 905). A group of ten literature-based DNA-binding TFs had no predicted DBD domains (Table S8 in Appendix 1). The absence of detected DNA- binding domains may be due, in part, to the limited sensitivity of the models. For example, the Tcf20 gene (alias Spbp) purportedly contains a novel type of DNA-binding domain with an AT hook motif [26] which was not predicted by the corresponding AT hook PFAM model. Restricted model representation is also likely the reason for the missing domain predictions of the C4 zinc finger domain in the Nr0b1 gene and the basic helix-loop-helix (bHLH) domain in the Spz1 gene. Similarly, four DBDs detected with protein group class-level Superfamily models (specifically for zinc coordinating and helix-turn-helix models) could not be further delineated to a protein family level 110 assignment (Table S9 in Appendix 1) suggesting that their sequences deviate from the family-specific properties represented in PFAM. It is quite possible that there remain to be discovered domains involved in DNA binding by human and mouse TFs. Most TF DNA-protein interactions occur when the DNA is in a double stranded (dsDNA) state; however, a small number of TF proteins preferentially bind single- stranded DNA (ssDNA) [27, 28]. We identified in the literature review a set of sixteen single-stranded DNA-binding TFs, of which twelve contain HMM-predicted protein domains that are characterized as single-stranded RNA-DNA-binding (Table S10 in Appendix 1). There may be other DBD TFs in our list that act on both ssDNA and dsDNA but were not classified in the ssDNA DBD taxonomy because this property was not specifically characterized in the literature reviewed. The distinction and overlap between ssDNA and dsDNA binding TFs warrants future attention. 4.3.3. Generation and assessment of mouse-human TF homology clusters to predict additional putative TFs Since a transcriptional role can be inferred for closely related TF homologs [7, 29- 31], researchers interested in the analysis of gene regulatory networks would benefit from access to a broad data collection of both experimentally validated TFs and their homologs. The curated TF gene list was used to identify putative mouse TF homologs, in the genome-wide RefSeq collection, that have not yet been annotated in our catalog or that were not evaluated because they lack PubMed literature evidence. While sequence homology is often used in preliminary analyses to infer similar protein structure and function, its success may be limited when similar protein structures have low sequence similarity [32] or short homologous protein domains. Based on recent evidence that over 111 15% of predicted domain families have an average length of 50 amino acids or less [33], we evaluated whether pruning BLAST-derived clusters using a previously published sequence similarity metric [34] could be further improved by explicitly including domain information. Our evaluation of both pruning methods indicated that the inclusion of domain knowledge improved homolog cluster content (Figures S2 and S3 in Appendix 1). We therefore incorporated both domain structure predictions, using HMMs, and sequence similarity in our homology-based approach to predict additional TF genes. The homolog prediction and clustering process yielded 227 homolog clusters containing 3561 genes (3419 unique genes). The vast majority of the genes (3284/3561) are associated with only one cluster each, however 128 genes were members of two clusters and 7 genes were present in three clusters. We also identified 72 single gene clusters (singletons), which included 36 TF genes that had only significant BLAST matches to themselves, 12 genes that derived BLAST hits which didn’t satisfy the homolog candidate cut-offs, 21 genes with cluster members that didn’t satisfy the pruning criteria, and 3 genes that had no RefSeq model sequence. While our TF-seeded homology inference analysis used cut-offs that likely pruned some false negatives, in an effort to emphasize specificity, it is likely that these singletons represent TFs that share common protein structural features with low sequence similarity. The curated TF set contains some proteins with properties not commonly associated with TF function. For example, our catalog included the cyclin dependent kinases (cdk7, cdk8, and cdk9), which are reported to directly activate gene transcription (for a review see Malumbres et al. [35]). Therefore, the homolog analysis of TFs identified numerous other protein kinases that will likely have no direct involvement in transcription. Similarly, larger clusters seeded by TFs containing other domains not 112 frequently associated with transcription, such as calcium-binding, ankyrin repeats, armadillo repeats, dehydrogenase, and WD40, also attracted false TF predictions. To assign a quantitative confidence metric for the large clusters of TF predictions, we developed a scoring procedure based on protein domain associations to TF activity annotations from the Gene Ontology (GO) Molecular Function sub-tree [36]. The cluster confidence metric was employed using a four-tier ranking system for clusters containing more than ten gene members (42 out of 227 homolog clusters). The majority of these clusters (52% or 22 clusters) received high scores indicating that they contain a high proportion of TF genes. Given that GO currently annotates only 39% of the TF genes in our catalog in the TF activity node in the Molecular Function sub-tree (Table S11 in Appendix 1), we expect that less frequently occurring protein domains found in small homolog clusters may not yet be represented in GO. Therefore, we did not analyze clusters containing fewer than ten members and we anticipate future refinements in the homolog cluster confidence rankings as TF gene annotation is expanded in GO. We incorporated our curated set and clusters counts in an analysis to estimate both the total number of TFs and, a smaller sub-set, the number of double-stranded DNA- binding proteins (see Methods). The cluster counts were adjusted using the observed approximate mean TF (OAMTF) proportions associated with each rank level (see Table 4.6) to account for false positives. From this mouse RefSeq-based analysis, we arrived at an estimate of 2355 DNA-binding and accessory TFs. Since peptide sequence-dependent analyses can result in both omissions and false predictions of homologous protein structures readers should regard this figure as a “best-guess” approximation [32]. A similar analysis conducted over the homolog clusters containing double-stranded DNA- binding TFs resulted in an estimate of 1510 DNA-interacting TFs. We also performed an 113 extraction of DBD-containing genes from the Ensembl database using the DNA-binding domains defined in TFCat. This analysis derived a list of 1507 putative DNA-binding TFs. These estimates agree well with earlier publications [10, 37, 38]. 4.3.4. Maintenance and access of TFCat annotation data All gene annotations, mouse homolog clusters and human orthologs are published in the TFCatWiki, which is accessible from the TFCat portal. Each wiki article page houses the annotation information for one gene with its content secured against modification. Each gene article page is associated with a discussion page, which is available for comments and feedback by all wiki users. Wiki users can specify that they wish to receive periodic e-mail notification of lists of gene wiki pages and their associated discussion pages that have been updated. Semantic features and functional capabilities are included in the wiki implementation to facilitate easy access to all gene annotation data. We established a TFCat Annotation Feedback System workflow process (Figure S4 in Appendix 1) to encourage continuous improvement of the catalogued gene entries. An issue tracking management system is integrated with the wiki to capture, queue, and track feedback contributions for follow-up by the wiki annotator. Wiki users may view a gene’s feedback report summaries and current workflow status, through an inquiry made available on each gene’s article page. Gene annotation changes, entered through our internally-accessible TFCat annotation system, will be flagged and forwarded to the wiki through an automated updating process. Community members who wish to directly 114 contribute to the wiki contents through the backend web application (Figure S5 in Appendix 1) may contact the authors. The complete TF catalog resource can be downloaded from our website [39]. The website application enables download of the complete list or a subset of annotated genes by assigned judgment, functional taxonomy, and DNA-binding classification. The data extraction is run real-time against a relational database providing access to the most current TF catalog data. 4.4. Discussion 4.4.1. Catalog characteristics, comparisons, and utility The comprehensive catalog of TFs contained in TFCat provides an important resource for investigators studying gene regulation and regulatory networks in mammals. The curation effort assessed the scientific literature for 3230 putative mouse and human TFs, including detailed evaluation of papers describing the molecular function of 1058 TFCs, to identify 882 confirmed human and mouse TFs. Each TF was further described within TFCat using a newly developed TF taxonomy. DNA binding proteins, a subset of TFs, were mapped to a structural classification system. As an aide to researchers, an expanded set of putative TFs was generated through a homology-based sequence analysis procedure. Online access to the annotations and homology data are facilitated through a wiki system. An annotation feedback system, linked from the wiki, enables reporting and tracking of community input. An additional website application offers capabilities to extract all or a subset of the catalog data for file download. 115 For many researchers, the greatest utility of TFCat is the provision of an organized and comprehensive list of DNA binding proteins. The protein-DNA structural classification system used to organize the DBD TFs in the catalog was originally proposed by Harrison [40], further modified by Luisi [41] and extended by Luscombe [15]. The DNA binding domain (DBD) analysis and gene/domain counts (Table 4.5) confirmed that well-known DBD families are represented. The DNA-binding classification system was extended with new family classes to accommodate the majority of predicted DNA-binding structures in our curated TF set (Table 4.5 and Table S5 in Appendix 1). A new family category was included for unrepresented, double-stranded TF protein-DNA binding mechanisms that were supported by PDB structures or publications. Similar to the analysis and classification performed by Luscombe et al., we added structural domain families that were characterized by distinct DNA-binding mechanisms. However, unlike the Luscombe et al. approach, we did not consider biological function in our classification decisions. To preserve the properties of the system, the necessary extensions were made within the existing protein groups. The value in having inventories of TFs has spurred previous efforts to compile collections of DNA-binding proteins. To evaluate the comprehensiveness of our curated collection, we performed a comparison with the gene annotations provided by GO and our DNA-binding domain classification analysis with domains found in a DNA-binding domain collection [42]. GO assigns molecular function labels to proteins, including functions falling under the broad category of transcription. The challenge of annotating all genes is daunting and therefore it was not a surprise that only 39% (343) of our expert curated collection of TFs have thus far been associated with GO terms linked to transcription (Table S11 in Appendix 1). 116 While TFCat is unique in its evidence-based approach to identify mouse and human TFs, there are other compilations of TF binding domain models and predictions of domain-containing proteins. For example, a catalog of sequence-specific DNA-binding TFs (which we will refer to as DBDdb) has been compiled using HMMs to catalog double-stranded and single-stranded sequence-specific DBD domains [42]. Comparison of the double-stranded DNA binding subdivision of TFCat with the predictions in DBDdb highlights some key differences between these efforts (Tables S12, S13 and S14 in Appendix 1). For example, the TFCat DNA binding subdivision includes only TFs with published evidence from mammalian studies, whereas the DBDdb collection includes domain predictions based on evidence of sequence-specific DNA binding in any organism. While the two TF resources overlap, they serve complementary purposes. DBDdb is a set of computational predictions generated with protein motif models associated with sequence-specific single or double-stranded binding domains, while TFCat is an expert-curated, highly specific resource that targets the organized identification of all TFs, regardless of DNA binding, in human and mouse. For example, the high mobility group (HMG) domain TFs, which exhibit both specific and non- specific DNA-binding, are excluded from DBDdb but included in TFCat. Moreover, TFCat only included TFs with literature support in mammalian cells which excludes certain domains included in DBDdb. For example, CG-I has been shown to regulate gene transcription in fly [43] but not in mammals [44]. To complement our large set of curated TF proteins, we conducted a sequence- based homology analysis, propagated from our positively-judged TFs, to predict additional TF encoding genes. We applied a confidence ranking metric to predict the number of false positives included in larger homolog clusters (Table 4.6), which should 117 be considered when extracting un-annotated, predicted TFs. Future adaptations of the TFCat resource could include literature-based judgments of TF homolog predictions. While the homolog clusters as provided are an essential and useful supplement to our evidence-based TF catalog, future predictions may benefit from further structure-based homology research. Creation of a comprehensive TF catalog provides an important first step in unravelling where, when and how each TF acts. For example, a number of recently published genome-scale studies constructed lists of predicted TFs prior to investigating the spatial and temporal expression characteristics of sets of regulatory proteins [8, 9, 45, 46], in advance of conducting a phylogenetic analysis of genes involved in transcription [47], and as initial input to the analysis of conserved non-coding regions in TF orthologs [48]. The set of literature evidence-supported TFs in TFCat will provide an important foundation for similar future studies. TF catalogs will become increasingly important and necessary to facilitate the investigation and analysis of TF-directed biological systems. Recent ground-breaking stem cell studies [49, 50] have shown the central role of TFs in regulating stem cell pluripotency and differentiation. Understanding the central role of TFs in the control of cellular differentiation has therefore taken on increased importance. Computational predictions in regulatory network analysis of cellular differentiation often highlight a pattern consistent with binding of a structural class of TFs, but fail to delineate which TF class member is acting. TFCat will serve as a reference and organizing framework through which such linkages can progress towards the detailed investigation of candidate TF regulators. 118 4.5. Materials and methods 4.5.1. Creation of four independent murine and human TF preliminary candidate data sets Four TF collections were compiled by four independent approaches. All data sets are available on the TFCat portal. 4.5.1.1. Dataset I A list of 986 human genes considered ‘very likely’ plus 913 considered ‘possibly’ to code for TFs was manually curated in February 2004 [51] using personal knowledge combined with information in LocusLink (now Entrez Gene), the Online Mendelian Inheritance in Man database (OMIM) [52], and PubMed [16]. Selection was guided by the following definition of TF: ‘a protein that is part of a complex at the time that complex binds to DNA with the effect of modifying transcription’. Inclusion was necessarily subjective for two reasons: (1) the definition of ‘transcription factor’ is difficult to precisely constrain, and (2) there was not enough information available for many genes to be certain of their function. Genes that primarily mediate DNA repair (e.g., ERCC6) or chromatin conformation (e.g., CBX1) were excluded. To be considered, a gene had to have an Entrez Gene entry with a Genbank accession number. Text-based searches for the terms ‘transcription factor’ or ‘homeobox’ were used to identify Entrez Gene entries for further analysis. GO node descriptions including the terms: ‘nucleic acid binding’; ‘DNA binding’; and ‘transcription’ were used as a supplement to guide gene 119 selection. A total of 998 TFs were present in the set following this initial compilation. After February 2004, periodic additions were made based on new reports in the literature. 4.5.1.2. Dataset II The objective of this analysis was to identify a comprehensive list of DBDs for TF gene candidate extraction. Firstly, the SwissProt (SP) database [53] protein entries (obtained in April 2005) were scanned for descriptors or assigned PFAM [11] and/or Interpro [54] domains (downloaded in April 2005) indicating: DNA-binding, DNA- dependent, and transcription. The extracted gene set was then further extended by including SP gene entries that had assignments to the biological process GO node: GO:0006355 Regulation of DNA Transcription, DNA-dependent and SP records with text descriptions that included JASPAR database transcription factor binding site (TFBS) class names [55]. A list of unique DBDs was compiled from this extraction. All domains were manually reviewed for evidence strongly suggesting DNA binding and transcription factor activity using both Interpro and PFAM domain descriptions and associated literature references. Domains that did not meet these criteria were pruned from the list. Both known and putative TF genes were extracted from Ensembl V29 database [56] using the TF DBD PFAM-based list, yielding a set of 1266 mouse and 1500 human DNA-binding TF candidates. 4.5.1.3. Dataset III GO trees were constructed for all mouse and human entries in Entrez Gene by starting with the leaf term from gene2go [36] (downloaded July 19th, 2005) and 120 enumerating all parent terms using file version 200507-termdb.rdf-xml. As we were interested in all genes that could be involved in altering transcription, genes were selected if they had any annotation (including Inferred Electronic Annotations -- IEA) to GO terms with descriptors: "transcription regulator activity", "transcription factor activity" and/or "transcription factor binding" in their tree. 970 mouse genes and 1203 human genes were identified using this method. As this first extraction did not identify all family members of a putative transcription factor, we performed an additional extraction using the term searches: "DNA binding" and "transcription factor" against the domain information in the Interpro database [54]. The resulting genes were mapped to Entrez Gene entries using the Affymetrix annotation for the MOE-430 v2 chip. Merging the two lists and removing duplicate entries resulted in 2131 mouse and 2900 human candidate genes involved in transcriptional regulation. 4.5.1.4. Dataset IV We assembled ~350,000 isoforms representing ~48,000 known and predicted protein-coding mouse genes by mapping seven collections of known and predicted mRNAs to the mouse chromosomes, and clustering them on the basis of overlap (see [57] for source sequences), a representative mRNA from each cluster, and a description of the clustering method). We then assembled 36 known transcription-factor DNA-binding domains from PFAM and SMART [58], and screened the ~350,000 isoforms using the HMMER software [59] to identify approximately 2,500 known or predicted genes containing at least one of the 36 domains. To map the International Regulome Consortium (IRC) entries to Entrez Gene, the IRC sequences [60] were compared with 121 RefSeq sequences using BLAST. Only sequences with an expectation value of at most 10-05 were selected and subsequently mapped to Entrez Gene using the Gene2Refseq table. 4.5.2. Standardizing TF gene candidate annotation A website annotation tool and MySQL database were developed to standardize and centralize the annotation effort (Figure S5 in Appendix 1). TF candidate judgments and a high-level taxonomy classification system were established (Tables 3 and 4) for this web-based annotation process. The secure website enables access to only those genes assigned to each annotator. Each gene annotation required input of text summarizing the journal article evidence that, to some degree, supported or refuted the judgment of a gene (or the gene’s ortholog in a closely related species) as a TF. One or more PubMed journal articles were summarized in the reviewer comments and a final judgment and general taxonomy classification were assigned. Ten trial genes, randomly selected from the list of TFCs, were assessed by four reviewers. The set of annotations for each trial gene were evaluated for literature evidence selected and annotation content and formatting. This evaluation was used to develop annotation evidence guidelines and a suggested general documentation format for the annotation process, which was included in the annotator help guidelines. 4.5.3. Selection and annotation of a subset of TF candidates The mouse TF candidate datasets were merged, using mapped NCBI Entrez Gene identifiers, into a single non-redundant dataset. Gene2PubMed file counts were extracted 122 and merged by Entrez Gene ID. Genes were manually pre-curated for evidence supporting TF activity by scanning NCBI PubMed abstracts (where available) using both standard gene symbols and aliases and examining GeneRIF entries for each gene in the dataset. Genes with literature evidence suggesting TF function were included in the list of TFCs to be annotated. A set of TFCs associated with two or more PubMed abstracts (based on Gene2Pubmed data and excluding the large annotation project articles) were extracted from the TFC list and randomly assigned to each of seventeen reviewers based on pre-determined reviewer allocation counts. Each TFC was reviewed and judged by the assigned reviewer for TF evidence in the literature as described above. We also extracted and entered the PubMed information accompanying 22 TF DNA-binding profiles from Jaspar Database [55]. During this research project, the Entrez Gene numbers were maintained using the NCBI Gene History file. TFCat gene identifiers were maintained (changed or merged or deleted) if a corresponding change was recorded in this file. 4.5.4. Randomly sampled quality assessment and auditing of TF annotations TF gene candidates were randomly selected from each reviewer-assigned gene set based on the assigned proportions across all reviewers to form a list of fifty genes for annotation quality assurance (QA) testing. Each gene was allocated to two reviewers for annotation in a blind QA test. The QA gene annotations were extracted and reviewed for TF judgment and taxonomy classification consistency. A second round of annotation auditing was performed to ensure consistency in the recorded annotation data. All annotations were examined for alignment of PubMed evidence reviewed and assigned 123 judgment and functional taxa. Misaligned annotations were forwarded to the annotator for review and revision. 4.5.5. TFC quality assurance comparisons To assess sensitivity (coverage) in our initial curated TF list, we compared our gene set with TF genes identified in two TF collections. Approximately 800 gene symbols listed in a TF textbook index, authored by Joseph Locker [6], were manually reviewed and mapped, where possible, to 506 mouse Entrez Gene Identifiers using gene descriptions and citations provided in the text. A TF comparison was also performed against the list of annotated fly TFs found in the FlyTF database [13] by mapping, where possible, FlyBase identifiers to NCBI gene identifiers to locate their corresponding mouse homolog in a HomoloGene group [16]. Upon completion of the TFCat curation phase, we performed comparisons with GO [36] and the DBD Transcription Factor Prediction Database resource [42]. To compare our curated set with GO we developed software to enumerate the number of our TF genes in the GO Molecular Function sub-tree under the "transcription regulator activity" node. We used the Mouse Xref file found in the GO Annotation Database [61] to map the TF Entrez gene numbers to the gene identifiers available in the GO database. The DBD resource comparison involved downloading the mouse (Mus musculus 49_37b) and human (Homo sapiens 49_36k) predicted TF sets and development of software to extract all DBD domain models identified in those records. We then compared the domains found in the DBD mouse/human set with those domain models annotated as DNA-binding in our curated TF set. 124 4.5.6. Human-mouse ortholog assignment Human-mouse predicted orthologs were assigned using NCBI HomoloGene groups [16] with one-to-one relationships between the mouse and human gene. Those few genes that did not have a one-to-one relationship were manually inspected and, when available, a preference was given to the human non-predicted RefSeq gene model or an assignment was made using the closest Blast alignment scores between a mouse and human gene pair. Where HomoloGene entries were not available for both human and mouse, ortholog assignments identified in the Mouse Genome Database were used. 4.5.7. TF DNA-binding structure analysis and classification A DNA-binding protein classification system, an extension of the work from Luscombe et al. [15], was utilized to classify all genes judged as TFs with DNA-binding activity. Structural assignments were made utilizing the HMMER software to enumerate a full set of Superfamily (SCOP-based) HMMs [12] with a threshold of 0.02 and PFAM HMMs [11] for each gene using gathering threshold cut-offs and a calculated model significance value ≤ 10-2. The Superfamily domain sequences predicted in the TF gene set were subjected to a PFAM HMM analysis to identify PFAM domain models that are satisfied by the same sequences (Table S4 in Appendix 1). Both redundant and non- redundant models were then mapped to the DNA-binding structure classification using model structural descriptions and based on review of related literature for PDB entries that contain these domains. 125 The DNA-binding classification was extended with additional family classes to accommodate the predicted DNA-binding structures encountered in the curated set of DBD TFs (Table 4.5 and Table S5 in Appendix 1). To evaluate the structural similarity of DNA-binding domains, we performed alignments using the protein structure comparison web tool: SSM – Secondary Structure Matching Service [62]. We identified PDB entries for each of the new DBD families, with a preference for DNA-bound structures. The DNA-binding domain chains of each PDB entry were aligned with the entire PDB archive (incorporating lowest acceptable matches of 40% and defaulting the remaining parameters) to identify similar DBD structures based on Q-score metric clustering results. A new protein family classification was established if the structure aligned only to itself or was clustered (by Q-value) within its own set of family class structures. In a few cases, where a structure aligned reasonably well with another family in the classification system, PubMed articles were consulted to derive a final decision and any borderline cases were noted and described in the family class description text (Table S5 in Appendix 1). Each DNA-binding TF was then assigned to one or more DNA-binding families in the classification system if it was predicted to contain the related DBD structure. 4.5.8. Identification of homolog sets for mouse TF genes A homolog analysis process was implemented that considers both sequence similarity and predicted protein domain commonality, and uses a computationally simplified clustering approximation, loosely motivated by proportional linkage clustering [63]. We initially identified sequence similarity using BLASTALL [64] analysis over a full mouse protein RefSeq [65] dataset with an expect value cut-off of 10-3 and 126 enumerated all HMM PFAM domains over an extracted full representation of the mouse genome using NCBI RefSeq sequences. To extract putative homolog candidates for each TF gene we incorporated a metric, originally proposed by Li et al. [34], which considers the ratio of aligned sequence length to the entire length of each sequence. Given the focus on mouse genes, the formula for this metric, which we will refer to as metric ! I s ' , was revised to utilize sequence similarity rather than identity. Our metric is computed as: ! I s ' = S "Min(n 1 /L 1 ,n 2 /L 2 ) where S is the proportion of similar amino acids (as defined by the Blosom62 matrix) across the hit, ! L i is the length of sequence ! i ( ! i is the query or hit sequence), and ! n i is the number of amino acids in the aligned region of sequence ! i . We considered only homolog candidates that had a maximum hit significance of 10-4 and allowed for a high level of sensitivity by requiring that the computed ! I s ' values were at least 0.06. We did not include any genes that had been reviewed and deemed not TFs. Our survey of a set of TF gene family sequence characteristics suggested that some known DNA-binding domains were contained in a small fraction of the total TF protein sequence. However, similarly short alignments between a TF gene and other hit sequences (low ! I s ' values) can yield a significant amount of false positives. We used well- documented SRY-related HMG-box transcription factor (Sox) and Forkhead transcription factor (Fox) TF families (Table S15 and Table S16 in Appendix 1) to evaluate two cluster pruning strategies and selected an approach that increased cluster specificity (proportion of members of a test set in a cluster) without decreasing cluster sensitivity (number of cluster members that are members of a test set). To evaluate cluster pruning of the Blast- 127 based clusters using strictly an ! I s ' threshold method, we computed cluster sensitivity and cluster specificity over an increasing range of ! I s ' values, using the Sox and Fox validation sets (Figures S2 and S3 in Appendix 1). An ! I s ' value was computed between the query sequence and every member in the cluster and a member (gene) was pruned if the ! I s ' did not satisfy a cut-off threshold. Cluster sensitivity and cluster specificity were computed for the range of ! I s ' values and compared. We then assessed a second cluster pruning approach over a successive range of ! I s ' values requiring that all predicted domains in a cluster member (gene) match the query gene or, when this criteria could not be met, a particular ! I s ' value threshold be satisfied (Figures S2 and S3 in Appendix 1). Inclusion of a domain-based method as a primary criteria for pruning with the incorporation of a stricter ! I s ' value criteria when the domains did not match, in most cases, maintained cluster sensitivity while preserving or improving cluster specificity. Importantly, higher cluster sensitivity and cluster specificity levels enabled comprehensive Sox HMG and Fox Forkhead families to emerge when we applied a proportional linkage clustering approximation approach to merge the overlapping clusters (Figure S6 and Figure S7 in Appendix 1). While the sole application of an ! I s ' value as a pruning criteria may not generate comprehensive TF family clusters (compare Panel B in Figures S6 and S7 in Appendix 1), our analyses suggested that this metric on its own, implemented with higher parameter values, is useful for identifying closely related sub-family members (Figure S8 in Appendix 1). Motivated by these assessment results, we implemented a cluster pruning step which, required that either all predicted PFAM enumerated domains in the TF gene be matched in a homolog candidate or that the ! I s ' value between the query TF gene and its 128 homolog hit be no smaller than 0.21 with a sequence similarity no less than 30%. This resulted in 830 overlapping sets consisting of 48,555 members in total. To cluster and merge the sets, we implemented a method that considers a proportional linkage median-based relationship between sets. The algorithm performed iterations of set merges, combining two sets S and T if at least half of the genes in the smaller set matched genes in the larger set. i.e. if there were : ! (min(| S |,|T |)) /2" # matching genes. To mitigate the cluster attraction strength properties of initially larger and possibly noisier clusters the merge process iteratively considered and executed merging over smaller to progressively larger cluster cardinalities using increments of 10. Cluster membership attained a steady-state convergence within 700 iterations. A cluster confidence metric was developed to measure the number of potential false positives in a large (cardinality > 10) homolog cluster using predicted domain content. We mapped the mouse genes with the enumerated PFAM domains to terms in the GO Molecular Function subtree. We tallied the number of times a specific domain is contained within a gene annotated to the transcription regulator activity node and its child nodes versus the number of times the domain is found in a gene annotated to some other activity node to compute a probability of a particular domain ! P d being associated with TF function. The majority of GO annotation evidence codes were included, with the following exceptions: IEA-Inferred from Electronic Annotation, ISS-Inferred from Sequence or Structural Similarity, and RCA-Inferred from Reviewed Computational Analysis. To evaluate cluster confidence ! C n , we first enumerated the number of genes 129 that contain a specific domain within a cluster ! C d and the number of genes in each cluster ! Cg to weight a domain’s association to TF activity: ! Nd = Cd Cg Pd and, secondly, included those cluster domains that satisfy ! D = Cd " Cg /4# ${ } to compute ! C n , using the following equation: ! C n = N di i"D # D All cluster confidence values and cluster membership were reviewed and qualitatively assessed based on the proportion of verified TFs and binned into four partitions with associated confidence rankings (Table 4.6). To derive an estimate for the total number of TFs in the human and mouse species, we computed the number of known and predicted TF homologs and adjusted this amount by the cluster rank OAMTF (Table 4.6) to obtain a prediction of 2355 DNA- binding and accessory TFs. To obtain a ballpark figure for a total number of DBD TFs, we performed a separate homolog clustering analysis seeded by genes curated with double-stranded DNA binding activity and reduced the counts using the OAMTF proportions by cluster rank, where applicable. The homolog-based analysis generated an estimate of 1510 DBD TFs. To support our DBD homology-based count analysis, we 130 developed PERL scripts to query the mouse Ensembl mus_musculus_core_47_37 and ensembl_mart_47 databases for extraction of predicted DNA-binding TFs using the identified PFAM DNA-binding domains in TFCat. This extraction produced a total of 1507 Ensembl mouse genes (1416 records supported by Mouse Genome Informatics (MGI); 23 RefSeq and Entrez Gene sourced records; 29 Uniprot/SPTREML predicted genes; and 39 Ensembl predicted gene models). 4.5.9. Website download access, wiki publication and annotation feedback The MediaWiki software was used to implement the TFCatWiki, with some modifications and additions made to the base software code and configuration files. We included the Semantic MediaWiki [66] extension to facilitate access and searching. Each article page contains the annotation information for one gene and has been configured to disallow edits, although enabling all associated discussion pages for contribution. Software was developed to extract data from the TFCat wiki database to create the wiki pages. We implemented a feedback tracking function using the MantisBT software system [67], a well-established, open-source, issue monitoring system, to accommodate tracking and follow-up management of TFCat feedback contributions. PHP interfaces and software were developed to populate MediaWiki user information to the feedback system and provide direct query access to feedback records by gene. We also integrated new data update flagging mechanisms into our internally-available TFCat annotation software tool to identify new or modified gene annotation information that requires re-population to the gene wiki page. 131 The MediaWiki software includes a Watch function, which issues individual e- mails when information is changed on a Wiki page by a wiki user. We developed an e- mail feature that optionally provides lists of wiki pages that have been changed via the backend auto-update process. To enable this feature, we developed an external PHP program (MediaWiki) hook and an associated MySQL database table to solicit user entry and capture of desired e-mail parameter options and notification frequency. An e-mail notification process was developed which issues e-mails for wiki content updates based on user-selected parameters. 132 Table 4.1. Transcription factor data resources Resource Organism Reference/URL Human KZNF Gene Catalog Human Huntley et al. (2006) [68] / [69] Database of bZIP Transcription Factors Human Ryu et al. (2007) [70] / [71] The Drosophila Transcription Factor Database Fly Adryan et al. (2006) [13] / [72] wTF2.0: a collection of predicted C. elegans transcription factors Worm Reece-Hoyes et al. (2005) [73] / [74] Table 4.2. TFCat catalog statistics Table 4.3. TFCat judgment classifications Judgment classification Number of annotations % of annotations TF gene 733 61.9 TF gene candidate 256 21.7 Probably not a TF - no evidence that it is a TF 41 3.5 Not a TF - evidence that it is not a TF 30 2.5 Indeterminate - there is no evidence for or against this gene's role as a TF 114 9.6 TF evidence conflict - there is evidence for and against this gene's role as a TF 10 0.8 Total number of genes annotated 1,058 100% Proportion of genes with positive TF judgments 882 83% Proportion of positive TFs with DNA-binding activity 571 65% Proportion of DNA-binding TFs that are (double-stranded) sequence-specific 535 94% 133 Table 4.4. TFCat taxonomy classifications Taxonomy classification Number of annotations % of annotations Basal transcription factor 39 3.7 DNA-binding: non-sequence-specific 30 2.9 DNA-binding: sequence-specific 591 56.5 DNA-binding: single-stranded RNA/DNA binding 20 1.9 Transcription factor binding: TF co-factor binding 315 30.1 Transcription regulatory activity: heterochromatin interaction/binding 51 4.9 134 Table 4.5. DNA-binding TF gene classification counts Protein group Protein group description Protein family Protein family description Gene count Predicted occurrences 1.1 Helix-turn-helix group 2 Homeodomain family 131 160 1.1 Helix-turn-helix group 100 Myb domain family 7 16 1.1 Helix-turn-helix group 109 Arid domain family 5 5 1.1 Helix-turn-helix group 999 No family level classification 2 2 1.2 Winged helix-turn-helix 13 Interferon regulatory factor 7 7 1.2 Winged helix-turn-helix 15 Transcription factor family 10 11 1.2 Winged helix-turn-helix 16 Ets domain family 23 23 1.2 Winged helix-turn-helix 101 GTF2I domain family 2 12 1.2 Winged helix-turn-helix 102 Forkhead domain family 26 26 1.2 Winged helix-turn-helix 103 RFX domain family 4 4 1.2 Winged helix-turn-helix 111 Slide domain family 1 1 2.1 Zinc-coordinating group 17 Beta-beta-alpha-zinc finger family 79 450 2.1 Zinc-coordinating group 18 Hormone-nuclear receptor family 43 43 2.1 Zinc-coordinating group 19 Loop-sheet-helix family 1 1 2.1 Zinc-coordinating group 104 GATA domain family 7 12 2.1 Zinc-coordinating group 105 Glial cells missing (GCM) domain family 2 2 2.1 Zinc-coordinating group 106 MH1 domain family 3 3 2.1 Zinc-coordinating group 114 Non methyl-CpG-binding CXXC domain 2 4 2.1 Zinc-coordinating group 999 No family level classification 2 2 3 Zipper-type group 21 Leucine zipper family 41 64 3 Zipper-type group 22 Helix-loop-helix family 71 71 4 Other alpha-helix group 28 High mobility group (Box) family 24 28 4 Other alpha-helix group 29 MADS box family 4 4 4 Other alpha-helix group 107 Sand domain family 3 3 4 Other alpha-helix group 115 NF-Y CCAAT-binding protein family 2 2 5 Beta-sheet group 30 TATA box-binding family 1 2 6 Beta-hairpin-ribbon group 34 Transcription factor T-domain 11 11 6 Beta-hairpin-ribbon group 108 Methyl-CpG-binding domain, MBD family 2 2 7 Other 37 Rel homology region family 10 10 7 Other 38 Stat protein family 6 6 7 Other 110 Runt domain family 3 3 7 Other 112 Beta_Trefoil-like domain family 2 2 7 Other 113 DNA-binding LAG-1-like domain family 2 2 8 Enzyme group 47 DNA polymerase-beta family 1 7 999 Unclassified structure 901 CP2 transcription factor domain family 3 3 999 Unclassified structure 902 AF-4 protein family 1 1 999 Unclassified structure 903 DNA binding homeobox and different transcription factors (DDT) domain family 1 1 999 Unclassified structure 904 AT-hook domain family 3 6 999 Unclassified structure 905 Nuclear factor I - CCAAT-binding transcription factor (NFI-CTF) family 3 3 135 Table 4.6. Large cluster ranking criteria Cn Rank Implication for unannotated genes in cluster Fraction of observed approximate mean TFs (OAMTF) Cn ≥ 0.20 1 The majority of genes are likely TFs 95% 0.10 ≤ Cn < 0.20 2 A higher proportion of genes are likely TFs 75% 0.03 ≤ Cn < 0.10 3 A higher proportion of genes are likely not TFs 35% 0.00 ≤ Cn < 0.03 4 The majority of genes are likely not TFs 15% 136 4.6. References 1. Garvie CW, Wolberger C: Recognition of specific DNA sequences. Molecular Cell 2001, 8:937-946. 2. Halford SE, Marko JF: How do site-specific DNA-binding proteins find their targets? Nucleic Acids Research 2004, 32:3040-3052. 3. Rescan PY: Regulation and functions of myogenic regulatory factors in lower vertebrates. Comparative Biochemistry and Physiology Part B, Biochemistry & Molecular Biology 2001, 130:1-12. 4. Rosenfeld MG, Lunyak VV, Glass CK: Sensors and signals: a coactivator/corepressor/epigenetic code for integrating signal-dependent programs of transcriptional response. Genes & Development 2006, 20:1405- 1428. 5. Latchman DS: Eukaryotic transcription factors. London ; San Diego, Calif.: Elsevier Academic Press; 2004. 6. Locker J: Transcription factors. Oxford; San Diego, CA: Bios; Academic Press; 2001. 7. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology 2004, 14:283-291. 8. Gray PA, Fu H, Luo P, Zhao Q, Yu J, Ferrari A, Tenzen T, Yuk DI, Tsung EF, Cai Z, Alberta JA, Cheng LP, Liu Y, Stenman JM, Valerius MT, Billings N, Kim HA, Greenberg ME, McMahon AP, Rowitch DH, Stiles CD, Ma Q: Mouse brain organization revealed through direct genome-scale TF expression analysis. Science (New York, NY) 2004, 306:2255-2257. 9. Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J, Gordon L, Branscomb E, Stubbs L: A comprehensive catalog of human KRAB- associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Research 2006, 16:669-677. 10. Messina DN, Glasscock J, Gish W, Lovett M: An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression. Genome Research 2004, 14:2041-2047. 11. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Research 1999, 27:260-262. 12. Gough J: The SUPERFAMILY database in structural genomics. Acta Crystallographica Section D, Biological Crystallography 2002, 58:1897-1900. 13. Adryan B, Teichmann SA: FlyTF: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster. Bioinformatics (Oxford, England) 2006, 22:1532-1533. 14. Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RD, Chenoweth JG, Tesar PJ, Furey TS, Ren B, Weng Z, Crawford GE: Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. PLoS Genetics 2007, 3:e136. 15. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome Biology 2000, 1:REVIEWS001. 137 16. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 2007, 35:D5-12. 17. Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE: The mouse genome database (MGD): new features facilitating a model system. Nucleic Acids Research 2007, 35:D630-637. 18. Coulier F, Popovici C, Villet R, Birnbaum D: MetaHox gene clusters. The Journal of Experimental Zoology 2000, 288:345-351. 19. Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C, Gonzales NR, Gwadz M, Hao L, He S, Hurwitz DI, Jackson JD, Ke Z, Krylov D, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Thanki N, Yamashita RA, Yin JJ, Zhang D, Bryant SH: CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Research 2007, 35:D237-240. 20. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247:536-540. 21. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure (London, England : 1993) 1997, 5:1093-1108. 22. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Jr., Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein Data Bank: a computer- based archival file for macromolecular structures. Journal of Molecular Biology 1977, 112:535-542. 23. Laity JH, Lee BM, Wright PE: Zinc finger proteins: new insights into structural and functional diversity. Current Opinion in Structural Biology 2001, 11:39-46. 24. Nagadoi A, Nakazawa K, Uda H, Okuno K, Maekawa T, Ishii S, Nishimura Y: Solution structure of the transactivation domain of ATF-2 comprising a zinc finger-like subdomain and a flexible subdomain. Journal of Molecular Biology 1999, 287:593-607. 25. Stefancsik R, Sarkar S: Relationship between the DNA binding domains of SMAD and NFI/CTF transcription factors defines a new superfamily of genes. DNA Sequence 2003, 14:233-239. 26. Rekdal C, Sjottem E, Johansen T: The nuclear factor SPBP contains different functional domains and stimulates the activity of various transcriptional activators. The Journal of Biological Chemistry 2000, 275:40288-40300. 27. Horn G, Hofweber R, Kremer W, Kalbitzer HR: Structure and function of bacterial cold shock proteins. Cellular and Molecular Life Sciences : CMLS 2007, 64:1457-1470. 28. Swamynathan SK, Nambiar A, Guntaka RV: Role of single-stranded DNA regions and Y-box proteins in transcriptional regulation of viral and cellular 138 genes. The FASEB Journal : Official publication of the Federation of American Societies for Experimental Biology 1998, 12:515-522. 29. Gasperowicz M, Otto F: Mammalian Groucho homologs: redundancy or specificity? Journal of Cellular Biochemistry 2005, 95:670-687. 30. Hamilton AT, Huntley S, Tran-Gyamfi M, Baggott DM, Gordon L, Stubbs L: Evolutionary expansion and divergence in the ZNF91 subfamily of primate- specific zinc finger genes. Genome Research 2006, 16:584-594. 31. Lemons D, McGinnis W: Genomic evolution of Hox gene clusters. Science (New York, NY) 2006, 313:1918-1922. 32. Rost B: Twilight zone of protein sequence alignments. Protein engineering 1999, 12:85-94. 33. Liu J, Rost B: Domains, motifs and clusters in the protein universe. Current Opinion in Chemical Biology 2003, 7:5-11. 34. Li WH, Gu Z, Wang H, Nekrutenko A: Evolutionary analyses of the human genome. Nature 2001, 409:847-849. 35. Malumbres M, Barbacid M: Mammalian cyclin-dependent kinases. Trends in Biochemical Sciences 2005, 30:630-641. 36. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29. 37. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C et al: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 38. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N et al: The sequence of the human genome. Science (New York, NY) 2001, 291:1304-1351. 39. TFCat Portal Resource [http://www.tfcat.ca/] 40. Harrison SC: A structural taxonomy of DNA-binding domains. Nature 1991, 353:715-719. 41. Lilley DMJ: DNA-Protein : structural interactions. Oxford: IRL Press at Oxford University Press; 1995. 42. Kummerfeld SK, Teichmann SA: DBD: a transcription factor prediction database. Nucleic Acids Research 2006, 34:D74-81. 43. Han J, Gong P, Reddig K, Mitra M, Guo P, Li HS: The fly CAMTA transcription factor potentiates deactivation of rhodopsin, a G protein- coupled light receptor. Cell 2006, 127:847-858. 44. Finkler A, Ashery-Padan R, Fromm H: CAMTAs: calmodulin-binding transcription activators from plants to human. FEBS Lett 2007, 581:3893- 3898. 139 45. Choi MY, Romer AI, Hu M, Lepourcelet M, Mechoor A, Yesilaltay A, Krieger M, Gray PA, Shivdasani RA: A dynamic expression survey identifies transcription factors relevant in mouse digestive tract development. Development (Cambridge, England) 2006, 133:4119-4129. 46. Kong YM, Macdonald RJ, Wen X, Yang P, Barbera VM, Swift GH: A comprehensive survey of DNA-binding transcription factor gene expression in human fetal and adult organs. Gene Expression Patterns : GEP 2006, 6:678- 686. 47. Coulson RM, Ouzounis CA: The phylogenetic diversity of eukaryotic transcription. Nucleic Acids Research 2003, 31:653-660. 48. Lee AP, Yang Y, Brenner S, Venkatesh B: TFCONES: a database of vertebrate transcription factor-encoding genes and their associated conserved noncoding elements. BMC Genomics 2007, 8:441. 49. Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K, Yamanaka S: Induction of Pluripotent Stem Cells from Adult Human Fibroblasts by Defined Factors. Cell 2007, 131:1-12. 50. Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S, Nie J, Jonsdottir GA, Ruotti V, Stewart R, Slukvin, II, Thomson JA: Induced Pluripotent Stem Cell Lines Derived from Human Somatic Cells. Science (New York, NY) 2007, 318:1917-1920. 51. Roach JC, Smith KD, Strobe KL, Nissen SM, Haudenschild CD, Zhou D, Vasicek TJ, Held GA, Stolovitzky GA, Hood LE, Aderem A: Transcription factor expression in lipopolysaccharide-activated peripheral-blood-derived mononuclear cells. Proceedings of the National Academy of Sciences of the United States of America 2007, 104:16245-16250. 52. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33:D514-517. 53. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 2003, 31:365-370. 54. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN et al: InterPro, progress and status in 2005. Nucleic Acids Research 2005, 33:D201-205. 55. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 2004, 32:D91-94. 56. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G et al: Ensembl 2005. Nucleic Acids Research 2005, 33:D447-453. 140 57. International Regulome Consortium Mouse Genome Project: Mouse Gene List [http://hugheslab.med.utoronto.ca/IRC/ ] 58. Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP, Bork P: SMART 4.0: towards genomic data integration. Nucleic Acids Res 2004, 32:D142-144. 59. HMMER - Profile HMM Software for Protein Sequence Analysis [http://hmmer.janelia.org/] 60. The International Regulome Consortium [www.internationalregulomeconsortium.ca] 61. Gene Ontology Annotation (GOA) Database [http://www.ebi.ac.uk/GOA/] 62. Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica Section D, Biological Crystallography 2004, 60:2256-2268. 63. William D, Herbert E: Investigation of proportional link linkage clustering methods. Journal of Classification 1985, 2:239-254. 64. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215:403-410. 65. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 2007, 35:D61-65. 66. Markus Krötzscha DV, Max Völkelb, Heiko Hallerb and Rudi Studer: Semantic Wikipedia. Journal of Web Semantics 2007, 5:251-261. 67. MantisBT Issue Tracking Software [http://www.mantisbt.org/] 68. Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J, Gordon L, Branscomb E, Stubbs L: A comprehensive catalog of human KRAB- associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Res 2006, 16:669-677. 69. Human KZNF Gene Catalog [http://znf.llnl.gov] 70. Ryu T, Jung J, Lee S, Nam HJ, Hong SW, Yoo JW, Lee DK, Lee D: bZIPDB: a database of regulatory information for human bZIP transcription factors. BMC Genomics 2007, 8:136. 71. bZIPDB - Database of bZIP Transcription Factors [http://bzip.kaist.ac.kr:8080/bzip.html] 72. FlyTF - The Drosophila Transcription Factor Database [http://www.mrc- lmb.cam.ac.uk/genomes/FlyTF/] 73. Reece-Hoyes JS, Deplancke B, Shingles J, Grove CA, Hope IA, Walhout AJ: A compendium of Caenorhabditis elegans regulatory transcription factors: a resource for mapping transcription regulatory networks. Genome Biol 2005, 6:R110. 74. A Collection of Predicted C. elegans Transcription Factors [http://genomebiology.com/content/supplementary/gb-2005-6-13-r110-s1.xls] 141 5. Brain MiniPromoters by Design: Pleiades Promoter Project13 14 5.1. Chapter preamble The myelin sheath wraps around axons to facilitate proper conduction of neural impulses throughout the nervous system. Myelin is produced by two cell types: oligodendrocytes in the central nervous system and Schwann cells in the peripheral nervous system. Transcription factors (TFs) play an important role in myelination. OLIG1 is a TF that plays a critical role in oligodendrocyte development. Non-coding regions surrounding the OLIG1 gene coding sequence were evaluated for tissue-specific expression in adult brain regions by the Pleiades Promoter Project. One of the constructs demonstrated expression in oligodendrocytes and two other OLIG1-associated contructs were negative (i.e. did not express). Chapter 5 describes analyses conducted to determine the TFs that could be acting to regulate the positively expressed construct. Gene expression analyses and sequence feature evaluations of the positive and negative constructs were conducted. Drawing from the work described in Chapter 4, the TFCat resource played an important role in identifying TFs differentially expressed in 13 A version of this chapter will be submitted for publication. Elodie Portales-Casamar, Douglas J. Swanson, Charles N. de Leeuw, Kathleen G. Banks, Shannan J. Ho Sui, Debra L. Fulton, Johar Ali, Mahsa Amirabbasi, David J. Arenillas, Nazar Babyak, Sonia F. Black, Russell J. Bonaguro, Erich Brauer, Tara R. Candido, Mauro Castellarin, Jing Chen, Ying Chen, Jason C.Y. Cheng, Vik Chopra, T. Roderick Docking, Lisa Dreolini, Cletus A. D’Souza, Erin K. Flynn, Randy Glenn, Kristi Hatakka, Taryn G. Hearty, Behzad Imanian, Steven Jiang, Shadi Khorasan-zadeh, Ivana Komljenovic, Stéphanie Laprise, Nancy Y. Liao, Jonathan S. Lim, Stuart Lithwick, Flora Liu, Jun Liu, Li Liu, Meifen Lu, Melissa McConechy, Andrea J. McLeod, Marko Milisavljevic, Jacek Mis, Katie O’Connor, Betty Palma, Diana L. Palmquist, Jean-François Schmouth, Magdalena I. Swanson, Bonny Tam, Amy Ticoll, Jenna L. Turner, Richard Varhol, Jenny Vermeulen, Russell F. Watkins, Gary Wilson, Bibiana K.Y. Wong, Siaw H. Wong, Tony Y.T. Wong, George S. Yang, Athena R. Ypsilanti, Steven J.M. Jones, Robert A. Holt, Daniel Goldowitz, Wyeth W. Wasserman, Elizabeth M. Simpson. Brain MiniPromoters by Design: Pleiades Promoter Project. 14 The computational analyses linked to the thesis research are predominantly described in the supplementary information of the manuscript, provided as Appendix 2 of this thesis. 142 oligodendrocytes. From this analysis, two TFs emerged as likely candidates responsible for the construct expression. The thesis-related work is described in Figure 5.7 b. and Appendix 2. 5.2. Introduction There is increasing interest in the application of gene therapy to treat incurable brain, eye, and spinal cord disorders. Several promising clinical trial results have been reported for Parkinson, Huntington, and Alzheimer disease, as well as eye diseases ([1, 2] for review). However, current therapeutic approaches frequently incorporate ubiquitous promoters, which may direct expression in both targeted therapeutic cells and in untargeted cellular regions. Another critical issue is that gene therapy approaches can be mutagenic when delivered through random insertion approaches that cause integration of multiple copies [3]. Delivery technologies that enable region- and/or cell-type specific expression using a controlled transgenesis strategy would greatly improve the potential applications for gene therapy. The Pleiades Promoter Project is striving to address these major challenges by enabling brain region-specific gene delivery using MiniPromoters (MiniPs). These MiniPs are being pre-tested as single-copy inserts at a defined locus in the mouse genome to enable reproducible and physiological expression levels. All Pleiades MiniPs contain entirely human DNA regulatory sequences to minimize cross- species concerns in therapy and avoid or reduce epigenetic inactivation effects. The mammalian central nervous system (CNS) is a complex entity comprising different neuronal and non-neuronal cell types. Cellular identity is characterized by a unique repertoire of gene expression profiles. The Allen Brain Atlas has helped elucidate brain cell diversity by mapping the expression of approximately 20,000 genes in the adult 143 mouse brain using in situ hybridization [4]. Another gene expression detection effort, the GENSAT project, incorporates transgenic mouse techniques to map reporter gene expression in CNS tissues of mice driven by large chromosomal segments [5]. These resources demonstrate the selectivity with which some genes are expressed in the brain. Much of metazoan gene expression is driven by modular conserved regulatory regions [6- 9]. Notably, studies have highlighted similar patterns of gene expression in corresponding human and mouse brain regions [10]. The Pleiades Promoter Project implemented a three-step approach to identify putative non-coding regions, which may be used to drive gene therapies. Firstly, a set of genes were selected that were specifically expressed in adult brain regions of interest and deemed to be of therapeutic interest. Secondly, we computationally predicted candidate human regulatory regions that could be responsible for each gene’s tissue-specific expression pattern. Lastly, we tested the human sequences in vivo in mice using a robust mouse transgenesis strategy. The overarching goal of the Pleiades project is to generate a bank of human DNA MiniPs, of no more than 4 kb in length to ensure suitability for gene therapy purposes, that drive expression in selected, specific adult brain regions of therapeutic interest. Our approach is founded on the premise that robust bioinformatics-based predictions of putative regulatory regions can be generated through the systematic application of computational analyses that considers prior experimental evidence. We have improved previously applied approaches through the application of a standardized promoter design methodology and incorporation of a consistent genomic context for integration, testing and comparison of MiniP reporter gene expression. Our expanding collection of brain- specific promoters is made publicly available at: http://www.pleiades.org. 144 5.3. Results 5.3.1. Novel tools to study and treat the brain The Pleiades Promoter Project strategy incorporates computational prediction of the regulatory regions that could be responsible for gene expression patterns observed in specific brain regions or cell-types [11]. However, the goal was not to recapitulate the potentially complex endogenous expression pattern of a gene, but rather to identify a subset of putative regulatory regions that drive reporter expression in targeted brain regions. For each set of gene regulatory region candidates, we designed several (from 1 to 7) MiniP constructs that incorporated unique combinations of the candidate regions (Fig. 5.1). To obtain physiological expression levels for a human-based MiniP sequence, we included the endogenous promoter of the gene. Importantly, we designed the MiniPs to be relatively short in length (≤ 4 kb) for ease of manipulation and to maintain their suitability for insertion in space-restricted molecular constructs (i.e., gene-therapy vectors) (Fig. 5.1a). Addressing an annotated set of 62 genes, we designed 240 MiniPs containing 468 different candidate regulatory regions (Fig. 5.1b). The MiniP sequences were cloned 5′ of a reporter gene sequence (EGFP, lacZ, or an EGFP/cre fusion protein). To date this research project has predicted 313 regulatory regions, which have been incorporated in 202 MiniP construct designs for 58 selected genes. Each tested construct was introduced by homologous recombination immediately 5′ of the Hprt1 locus on the X chromosome, as previously described [12, 13], which provides a single-copy knock-in insertion for reproducible expression [14] (Fig. 5.1a). Currently, 970 embryonic stem cell lines (ESCs) have been derived carrying 180 MiniPs. Transgenic mice carrying 103 MiniPs have been generated and 52 transgenic mouse strains have undergone post- 145 mortem evaluation. Notably, 36% of tested construct designs have demonstrated expression in the brain; amongst these are MiniPs expressed in both glial and neuronal cell types. 5.3.2. A new score to prioritize suitable genes As described previously in D’Souza et al., we utilized a genome-wide evaluation approach to identify 237 candidate genes with region-specific expression patterns in a set of 30 adult brain regions of therapeutic interest [11]. To further refine our list of genes, we considered disease relevance and genomic suitability for a MiniP design. To establish a disease relevance judgment, we reviewed the literature and noted the phenotypic consequences of gene knock-outs in mice, when this information was available. The MiniP design suitability assessment was determined by prioritizing those genes for which putative regulatory regions could be more readily distinguished. Comparative sequence analysis, often referred to as phylogenetic footprinting, has proven useful in predicting regulatory regions with the expectation that sequences under selective pressure will be more conserved than those that are not. We thus based our gene prioritization on the following criteria: (i) the existence for a gene of known regulatory regions responsible for the expression pattern in the brain, (ii) the generation of a single transcript from the gene to avoid having to differentiate between alternative promoters, (iii) the amount of non- coding sequence to be considered for analysis, (iv) the number of conserved regions within our analysis boundaries, and (v) how well-distinguished conserved regions are relative to the overall conservation level for a gene. In brief, we sought genes containing a small number of well-defined conserved non-coding regions close to their start site. To 146 this end, we developed a “regulatory resolution score” intended to reflect human perception of what constitutes a good candidate gene for MiniP design (see Appendix 2 information). We demonstrated that this score captures aspects of the manual curation process by comparing the results to 100 manually curated genes (Fig. 5.2a and Appendix 2 information). The 62 genes that we selected for MiniP design are heavily skewed towards higher scores in our set of 237 brain region-selective genes (Fig. 5.2b,c). 5.3.3. MiniPromoter designs incorporate available information Our review of the literature identified important gene regulatory information for a selected set of genes, which was considered in the construct design process. Literature evidence describing regulatory function for 62 genes was reviewed, curated, and stored in the PAZAR database [15] within the “Pleiades Genes” project. Importantly, this data collection represents the first publicly-available collection of detailed gene regulatory data for brain regions, which documents both validated regulatory sequence and acting transcription factors (TFs). One of the central objectives for the design of MiniPs was to select regulatory sequence that could be functional in both human and mouse brain cells. Although we mainly relied on mouse expression information to select our initial list of genes, the MiniPs incorporated human DNA sequences for validation in mice. To identify the human-mouse orthologous non-coding sequence, we implemented phylogenetic footprinting over human and mouse sequence alignments. Identified conserved regions were subjected to TF binding site (TFBS) prediction analyses to identify motifs that could be bound by TFs previously shown to direct specific expression patterns in the 147 brain tissue of interest. For example, the TF NRSF/REST was included in TFBS analyses as it has been shown to bind the non-coding region of many neuronal genes to inhibit gene expression in non-neuronal tissues [16]. The TF binding models used in this study were compiled from the JASPAR database [17] and supplemented with brain-specific TFs annotated in the PAZAR “Pleiades Genes” project described above. As illustrated in the Appendix 2 information, we developed a MiniP design pipeline that takes into account all available information from the literature, genome annotations, and the computational analyses mentioned above. Endogenous promoters were identified using the 5′ cap analysis gene expression (CAGE) tags [18], genome annotated transcripts (mRNAs, ESTs), and CpG islands [19]. In the case of ADORA2A, multiple transcription start sites were identified and RT-PCR on human brain cDNA was performed (see Appendix 2 information) to identify the predominantly expressed transcript in the targeted region (striatum). In a few instances, we incorporated histone modification chromatin immunoprecipitation (ChIP) data for mouse and human cortex (Jones et al. unpublished) in our MiniP design pipeline analyses, when this information was available for candidate regulatory regions. 5.3.4. ESC neural differentiation for pre-screening To test our pre-screening ESC differentiation strategy and to demonstrate that the genomic location immediately 5′ of the Hprt1 locus is suitable for specific expression, we selected a set of previously characterized promoters that drive expression either ubiquitously or in a subset of neurons or glial cells for validation. We evaluated the expression patterns of MiniP constructs: Ple88 for glial cell expression, which was 148 composed of previously known GFAP regulatory regions [20], and Ple53, which was composed of previously validated DCX regulatory regions, for neuronal expression [21] (Fig. 5.3). It should be noted that both of these genes are developmentally expressed. The Barberi et al. assay approach was chosen for the characterization since it has no region- specific bias, induces neural stem cells within 6 days, and expresses genes in a developmentally appropriate temporal fashion [22, 23]. Expression of the MiniPs followed a pattern that is similar to the endogenous expression of their associated gene, as shown by RT-PCR (Fig. 5.3b,f) and staining (Fig. 5.3c,g). The positive ESC pre- screening results were then confirmed in the corresponding knock-in adult mouse brains (Fig. 5.3d,h). This validation test demonstrates the utility of the MiniP-containing ESCs for directed expression in stem cell differentiation assays. While the MiniPs were designed to express in mature adult tissues, it is anticipated that a subset will function in the earlier developmental stages represented in stem cell studies, based upon the observed expression patterns for the MiniP source genes in BGEM [24], GenePaint [25], and in the literature. 5.3.5. Novel MiniPromoter expression patterns in the brain The Pleiades resource includes MiniPs that direct expression in diverse cell types and regions of the brain, spinal cord, and eye (Fig. 5.4). EGFP immunohistochemistry was performed on adult brains, spinal cords, and eyes of germline males to characterize each MiniP-directed expression pattern. Ple151-EGFP expresses broadly throughout cortical and subcortical regions whereas only a small cluster of neurons show Ple111- EGFP expression in the lateral hypothalamus (Fig. 5.4). A portion of the Pleiades MiniPs 149 drive expression of the EGFP/cre fusion protein and were analyzed for lacZ staining after recombination of the Gt(ROSA)26Sortm1Sor allele [26] (Fig 5.4; Ple103, Ple162, Ple167, Ple176). For these assays, lacZ acts as a historical marker and is visualized wherever the MiniP driving cre was expressed during development. For example, in the case of Ple167, Ple176, Ple177, and Ple178 constructs, it appears that cortical columns are labeled, which is an indication that reporter expression occurred early in development, before the migration of cells through the cortex. This effect produces broader labeling of regions in the adult brains (data not shown). As expected, a subset of MiniPs directed neuronal-specific expression (examples in Fig. 5a-h) and others direct glial-specific expression (examples in Fig. 5.5i-p). Ple54 (based on DCX regulatory regions) and Ple111 (based on HCRT regulatory regions) express the EGFP reporter in neuronal cell populations that are immunopositive for their respective source gene. EGFP expression for the Ple54 MiniP is found in the neurogenic cells of the adult rostral migratory stream as well as olfactory bulb neurons (Fig. 5.5a-d). EGFP expression in the Ple111 transgenic brain is co-localized with Hcrt in a discrete population of neurons in the hypothalamus (Fig. 5.5e-h). Ple88 (based on GFAP regulatory regions) and Ple185 (based on S100ß regulatory region) express the EGFP reporter gene with fidelity compared to their respective associated genes. Ple88 expresses in astroglial-like cells throughout the brain, spinal cord, and the eye with a subset of cells co-expressing the Gfap source gene (Fig. 5.5i-l) similarly to what has been previously described for this construct [20]. The pattern of expression of Ple185 is broad throughout the brain, most notably in cerebellar Bergmann glia and myelinated white matter tracts (Fig. 5.5m-o), as well as fiber tracts in the cortex (Fig. 5.5p). The EGFP expression 150 pattern in the brain co-labels with myelinated axons in the cortex, striatum, and the brainstem (data not shown). To demonstrate the value of the EGFP/cre strains carrying lacZ as a historical marker, we have performed a developmental analysis of the Ple162-EGFP/cre (based on PITX3 regulatory regions) mouse to assess the history of the lacZ-positive neurons located in the ventral tegmental area (VTA) of the midbrain. Detailed analysis of these positive cells in the adult brain shows that they are neurons located just superficial to the VTA dopaminergic cells but are not TH-positive dopaminergic cells (Fig. 5.6a-c), representing a novel subpopulation of neurons not typically found in the adult PITX3- positive population [27]. The developmental analysis of lacZ expression, in whole mount and sectioned tissue, tracks these cells from their initial genesis at the mesencephalic flexure to their final destination in the adult VTA (Fig. 5.6d-k). The expression of the lacZ reporter is observed at E11.5, but not in neural tissue at E10.5, delineating the onset of Ple162 MiniP expression (Fig. 5.6h-k). 5.3.6. A unique dataset for in silico studies As a proof of principle for our bioinformatics approach, we generated a smaller MiniP than previously available for the GFAP gene [20]. Using our MiniP design pipeline, we selected a minimal promoter and a well-conserved upstream sequence for validation (Ple90; Fig. 5.7a). As predicted, the 1.4 kb Ple90 drives the expression of EGFP in a pattern similar to the previously characterized 2.2 kb Ple88 (Fig. 5.7a). This result was confirmed in a recent study analyzing similar GFAP promoter constructs [28]. 151 The observed properties of MiniPs facilitate study of transcriptional regulation in specific cells and tissues through the comparison and correlation of MiniP sequences. For the MiniPs associated with OLIG1, one MiniP appears to drive expression in oligodendrocytes in the adult brain (Ple151) (see Appendix 2, Figure S4), as sought based on the endogenous gene expression pattern [29]. Two other OLIG1-associated MiniP constructs did not exhibit expression in adult brain (Ple148 and Ple150). We evaluated and compared the TF binding site predictions between the positive and negative MiniP sequences (Appendix 2 information). Our results highlight the potential involvement of EGR1 (KROX-24) in the Ple151 expression pattern (Fig. 5.7b and Appendix 2 information). Recent studies demonstrate that OLIG1 promotes the initiation of oligodendrocyte differentiation [30, 31] and is responsible for oligodendrocyte specification in some brain regions [32]. Temporal expression studies implicate EGR1 (KROX-24) as having a role in the primary response that leads to oligodendrocyte differentiation [33]. Consistent with suggestions of a role for the AP-1 family of TFs in oligodendrocyte differentiation [34], we observed predicted FOS (AP-1) binding sites in Ple151 and not in the inactive MiniPs (Appendix 2 information). 5.4. Discussion The design of MiniPs for targeted adult brain gene expression, guided by a comprehensive collection of experimentally-based regulatory data and a computationally- driven pipeline, is an important new development. In the past, discovery of brain-specific promoters was obtained by low-throughput promoter-deletion studies (e.g. L7/Pcp2 [35, 36] and Camk2a [37, 38]). The discovery of brain-specific promoters has been of enormous value, but the tedious efforts needed for identification and the sparse collection 152 limits research and therapeutic initiatives. The GENSAT project is generating mice using engineered BACs (100-200 kb) driving EGFP regionally in the brain [5]. However, many applications including therapeutic gene delivery require compact and portable promoters. Conservation-driven selection of candidate regulatory sequences is being used to identify regions directing reporter gene expression in the embryonic mouse [6]. Despite great interest and demand, progress in identifying regulatory sequences for selective expression in the adult brain has been too limited. The Pleiades Promoter Project has addressed this issue by undertaking a high-throughput bioinformatically directed parallel design process for 62 brain genes, producing MiniPs vetted in vivo in the adult mouse brain. The success of the Pleiades design process was facilitated by the availability of large-scale gene expression studies [4, 24, 25], comparative genomics tools [39], and bioinformatics software for regulatory sequence prediction [40], to produce large numbers of MiniPs. By introducing the regulatory resolution scoring procedure to target the design efforts on the most tractable genes, it was possible to increase the probability of design success, a necessity given the expense of transgenic studies in the developed brain. Future efforts will further benefit from large-scale ChIP data for epigenetic marks [41], co-activator localization [42], and TF binding sites (e.g. the Encode project [43]), making it feasible to pursue diverse tissues and more complex designs. It is likely that any gene expressed in the nervous system is expressed in more than one cell type. We hypothesized that we could dissect the regulatory gene expression controls by computationally identifying sequences capable of directing discrete regions of endogenous gene expression profiles. This was observed with some MiniPs (e.g. Ple111). In other cases the regulatory designs directed expression in novel patterns of expression (e.g. Ple176). In short, the design of MiniPs was successful in delivering gene expression 153 profiles of utility. While 36% of the tested MiniP constructs directed expression in the brain, it is likely that many MiniPs will drive additional discrete expression in developmental stages or physiological conditions within and external to the brain that were not assessed in our initial characterization. The Pleiades resource of plug-and-play MiniPs for brain expression will have a major impact on basic and preclinical research. They can be easily introduced into constructs, and in many cases, into viruses, for brain-, spinal cord-, and eye-directed delivery of molecules, such as siRNA, cre recombinase, fluorescent reporters, and therapeutic proteins. Their dual origin, human sequence and mouse testing, suggests that they will not only function in those species, but also in rats, monkeys, and other research and clinical models. We have already demonstrated their function in mouse ESCs and a future critical step will be assaying their function in human stem cells. Driving specific reporters, they can be used in flow-sorting experiments to enrich or exclude cells of specific neural types. In drug testing experiments they can be used to monitor the suppression or enhancement of responding cell types. Ultimately, the greatest impact of the Pleiades MiniPs is anticipated to be the added specificity for therapeutic gene delivery into the human brain. While this may be accomplished using viruses, site- specific delivery to the human genome directly or in cell therapy is an area of active research [44, 45]. The availability of a large collection of new MiniPs will play an important role in treatment designed for incurable brain diseases. 154 5.5. Methods 5.5.1. Pleiades Promoter Project pipeline The Pleiades Promoter Project is building a collection of 240 MiniPs in a four- year time frame. A pipeline has been established that involves 5 specialized laboratories. The MiniP sequences were computationally designed. In silico designs were assembled into DNA molecules at a rate of 4 per week. The construct DNA was electroporated into B6129F1 ESCs (mEMS1202 or mEMS1204 [13]) at a rate of 7 per week (including controls). Per electroporation, 10-15 clones per construct were picked, expanded, and PCR-verified, to obtain ~4 correctly targeted ESC lines per construct. ESCs were microinjected into E3.5 blastocysts from an N2 backcross of ICR into B6-Alb (BAN2), selected for high blastocyst yield. Germline females were backcrossed to C56BL/6J. The brains of N2-N3 germline males (8-12 weeks) were analyzed using histochemistry procedures. A minimum of 3 brains were processed for each MiniP. Every brain was cryosectioned (20 µm sections at 640 µm intervals) from medial to lateral sagittal, and prepared for brightfield detection of EGFP or lacZ. When reporter expression was absent in at least 3 adult brains, a MiniP strain was classified as negative. Positive MiniP strains underwent further histological analyses to define the cellular pattern of gene expression. Both positive and negative strains were prepared for presentation on the Internet at http://www.pleaides.org. For the positive strains, images were captured at high resolution (12000 x 16000 pixels) and tiled into hundreds of smaller images for viewing using Zoomify® technologies. Using the zoomify viewer, investigators are able to drill into the images without loss of fidelity in a fast and efficient manner. 155 5.5.2. Pleiades Promoter Project protocols The general methods used have been described previously [13]. Below are only reported modifications or additions specific to this work. 5.5.2.1.1. Hprt1 targeting vectors and MiniPromoters The Hprt1 targeting vectors used in this study are pEMS1306 (EGFP reporter [13]), pEMS1313 (lacZ reporter), and pEMS1302 (EGFP/cre reporter). The pEMS1313 and pEMS1302 fragments from the MCS to the end of the reporter gene were synthesized by GeneArt (Germany) and cloned into the Hprt1 targeting plasmid pJDH8A/246b [46] using EcoRI restriction sites. MiniPs comprised up to 4 distinct genomic segments joined by fusion PCR. Each genomic segment was first PCR-amplified independently using AccuPrime Pfx DNA Polymerase (Invitrogen), PCR primers (Integrated DNA Technologies), and BAC DNA template (10 pg to 200 ng). PCR primers for the outermost 5′ and 3′ segments were tailed with the appropriate restriction sites to allow for cloning. For MiniPs with two segments or more, PCR products of upstream segments were 3′ tailed with 18 bp linkers homologous to the first 18 bp of the adjacent downstream genomic segment. Reaction conditions were 0.25 Unit enzyme, 1x AccuPrime Pfx reaction mix, 1.0 µM each primer mix in a 20 µl volume. A 2-minute denaturation at 95 ºC was followed by 30 cycles of 95 ºC for 15 seconds, 30 seconds (at Tm corresponding to primer pair) and 68 ºC for 90 seconds, plus a final extension at 68 ºC for 10 minutes. The PCR reaction was run on a 1% low melting point agarose gel, visualized using SYBR Green (Invitrogen), excised and recovered from the gel using QIAquick gel extraction kit (Qiagen). Reaction 156 products were eluted using 30 µl of Ultrapure water (Gibco) then quantified using the NanoDrop (Thermo Scientific). For MiniPs with multiple elements, fusion PCR was performed as above, but using 2.0 µl of gel purified first round reaction products (10 pg to 200 ng). Additional binary fusions were executed as above until the full-length was obtained. A subset of 9 MiniPs was generated by direct synthesis at GeneArt. The final MiniPs were cloned into one of our Hprt1 targeting vectors and sequence verified with primers located every 300 bp along the construct on both strands. All discrepancies between the designed and constructed sequences were inspected using the UCSC Genome Browser annotations (hg18) [47]. We tolerated discrepancies if they were known polymorphisms, located in a non-conserved region (PhastCons Vertebrate Multiz Alignment & Conservation (17 Species) score below 0.7), or if analysis did not show any further regulatory implication. We rejected any sequence with insertion or deletion bigger than 10 bp. 5.5.2.2. Knock-in immediately 5′ of the Hprt1 locus The mEMS1204 (B6129F1-Gt(ROSA26)Sortm1Sor/+, Hprt1b-m3/Y), mEMS1202 (B6129F1-Gt(ROSA26)Sor+/+, Hprt1b-m3/Y), and E14TG2A [48] ESC lines were electroporated with constructs built in pEMS1302, pEMS1306, or pEMS1313, respectively. Clones were maintained under HAT selection for 3-4 days of expansion in 96 well plates and then transferred to 2 x 24 wells and cultured in HT media. Once cells reached confluence, both wells were frozen in HT-freeze media and stored in liquid nitrogen (LN2). 157 5.5.2.3. PCR analysis of genomic DNA Vector NTI (http://www.invitrogen.com) software was used to design PCR assays for the different constructs. MiniP-specific PCR genotyping assays are available on the http://www.pleiades.org website. 5.5.2.4. In vitro neural differentiation Neural differentiation of ESCs was conducted as previously described [22], with the following modifications. Once confluent, ESCs were trypsinized and seeded in duplicate wells onto confluent MS-5 feeder layers at 500 cells/cm2 for seven time-points. Total cell RNA was extracted with the RNeasy Plus Mini Kit (Qiagen) and used in RT- PCR analysis in both +RT and –RT conditions, using the OneStep RT-PCR Kit (Qiagen) according to manufacturer’s instructions (details in Appendix 2 Information). Ple53- EGFP immunohistochemistry was perfomed on day 11 of differentiation. Cells were washed once with 1x PBS and fixed using 4% paraformaldehyde in PBS for 15 minutes at room temperature. Cells were then blocked using Image-iT FX signal enhancer (Invitrogen) reagent and subsequently incubated with 1:1000 rabbit polyclonal anti-GFP antibody (Abcam) followed by 1:1000 Alexa-488 secondary anti-rabbit antibody (Invitrogen). Cells were imaged on a Zeiss Axiovert 200M microscope at 20x with the FITC filter set. Ple88-lacZ staining was performed as outlined at http://openwetware.org/wiki/LacZ_staining_of_cells, on day 14 of differentiation. Brightfield images were taken with the 10x objective on an Olympus Bx61 microscope. 158 5.5.3. Immunohistochemistry and histochemistry EGFP expression was detected with anti-GFP using the Vectastain Elite ABC kit (Vector Labs, Burlingame, CA) and DAB, as a brown chromogen, following the manufacturer’s directions (Vector Labs, Burlingame, CA). Expression of the beta- galatosidase (lacZ) or the EGFP/cre fusion protein (following recombination of the ROSA26 locus) was detected with Xgal (5-Bromo-4-chloro-3-indolyl-ß-d- galactopyranoside) staining as previously described [49]. High resolution serial images of brightfield material were acquired using a Nikon Optiphot-2 microscope with a LEP motorized stage connected to a Dell Precision 390 computer equipped with hardware and software from MicroBrightField, Inc. Images were captured and tilled using MBF Neurolucida Virtual Slice v8.2.3.0. Double-label immunofluorescence for colocalization of EGFP and endogenous proteins was performed as previously described [50]. Either native EGFP fluorescence (nGFP) or anti-GFP detection with an Alexa-488 secondary antibody was combined with a second primary antisera and detection with a Cy3 or Alexa-555 secondary antibody. Colocalization of LacZ activity and tyrosine hydroxylase (TH) or NeuN was performed with sequential staining as previously described [49]. Primary antibodies used for these studies include: rabbit anti-DCX (1:500, Cell Signaling), rabbit anti-orexin (1:500, Chemicon), mouse anti-GFAP (1:1000, Millipore), mouse anti-S100ß (1:1000, Abcam), mouse anti-NeuN (1:500, Chemicon), mouse anti-TH (1:3000, Chemicon). Secondary antibodies include: goat anti-rabbit Alexa-488 (1:500, Molecular probes), goat anti- rabbit-Cy3 (1:500, Jackson ImmunoResearch Laboratories, Inc.), goat anti-mouse Alexa- 555 (1:500, Molecular probes), goat anti-mouse Alexa-488 (1:500, Molecular probes), donkey anti-goat-Cy3 (1:500, Jackson ImmunoResearch Laboratories, Inc.). Detection of 159 double immunofluorescence was performed using a BioRad confocal laser-scanning microscope (CLSM, BioRad, Hercules, CA). Whole-mount Xgal histochemistry was performed on 4% paraformadehyde fixed embryos (E10.5, E11.5) or dissected brains (E15.5, P0.5) following a similar protocol described above after preincubation of the tissue in 0.1 M PBS containing 0.3% Triton X- 100. Stained embryos and brains were photographed, cryosectioned, and counterstained with neutral red for localization of lacZ expressing cells. 160 Figure 5.1. A resource of 240 MiniPromoters for predictable reproducible expression a. Pleiades MiniP design and testing strategy using the POGZ gene as an example. The MiniP designs capture the candidate regulatory regions (RR) in various combinations upstream of the endogenous gene promoter (Prom). The MiniPs are cloned in an Hprt1 targeting vector and knocked-in the mouse genome in the exact same location every time. b. Pleiades MiniPs (designated Ple#) designed for each of the 62 genes selected. Each box represents a contiguous human DNA sequence bioinformatically identified as a candidate regulatory region. Regulatory regions are numbered, and when the number is reversed, the sequences are placed in reverse orientation to avoid a possible alternative start site. Multiple regulatory regions are stitched together upstream of the endogenous gene promoter (Prom), represented as an arrow. In some case more than one Prom was identified (LongProm, Prom 2) and used in conjunction with 5′UTR and first intron sequences. In a few instances, sequences from neighboring genes were included in the design as indicated by a second gene name following the primarily selected one. 161 162 Figure 5.2. Resolution score prioritizes genes for MiniPromoter design a. Score distribution for 100 manually curated genes. The boxes’ widths are proportional to the number of observations in the groups. The increases in scores from “1” to “4” and “5” are significant (p = 1.4e-03 and p = 7e-04, respectively; Wilcoxon test), as well as from “2” to “5” (p = 4.5e-02; Wilcoxon test). b. Score frequency of the selected 62 genes (black) compared to all other brain region selective genes (white). c. Pleiades genes selected for MiniP design. 163 164 Figure 5.3. In vitro neural differentiation for pre-screening MiniPromoter designs a, e. Ple53 and Ple88 are cloned upstream of EGFP or lacZ, respectively. b, f. RT-PCR assays across seven time points of ESC neural differentiation demonstrate appropriate temporal expression. c, g. Immunohistochemistry or β-galactosidase staining demonstrates appropriate spatial expression (scale bar = 100 µm in c, 200 µm in g). d, h. Germline knock-in adult mouse brain sections analyzed by immunohistochemistry or β- galactosidase staining confirms expression in the appropriate regions (i.e., olfactory bulb and rostral migratory stream (RMS) for Ple53, and glia throughout the brain for Ple88). 165 Figure 5.4. Montage of MiniPromoter expression patterns in the adult brain and retina Presented is a sampling of positive strains expressing EGFP, detected using anti-GFP immunocytochemistry; EGFP/cre, detected using Xgal histochemistry; and lacZ detected using Xgal histochemistry. Various brain regions containing positive staining are presented for each mouse strain. Bs, brainstem; Cb, cerebellum; Ctx, cortex; Hip, hippocampus; Hyp, hypothalamus; LC, locus coeruleus; Olf, olfactory bulb; Ret, retina; RMS, rostral migratory stream; VTA, ventral tegmental area. For each panel, the upper sagittal image is a montage photographed at 1.5 x and resized to fit the frame. For each panel, the lower images scale bar = left µm, right µm: a, b, h, 100, 100; c, d, 200, 50; e, 400, 400; f, i, 200, 200; g, j, 200, 100; k, 400, 100; and l, 400, 50. 166 167 Figure 5.5. Specific neuronal and glial expression patterns a-d. Ple54-EGFP expresses EGFP in Dcx-positive cells of the rostral migratory stream (a-c) and olfactory bulb (d). e-h. Ple111-EGFP expresses EGFP in a subpopulation of hypothalamic neurons that are Hcrt (Orexin)-positive. i-l. Ple90-EGFP expresses EGFP in Gfap-positive astrocytes of the hippocampus (i-k) and Bergmann glia of the cerebellum (l). m-p. Ple185-EGFP expresses EGFP in S100ß-positive Bergmann glia of the cerebellum (m-o) and myelinated fibers in the cortex (p). Scale bar, 50 µm (a-d) 100 µm (e-p). nGFP, native GFP fluorescence. 168 Figure 5.6. MiniPromoters as tools to study developmental expression patterns a-c. Ple162-EGFP/cre is expressed in neurons of the ventral tegmental area (VTA) that are distinct from the tyrosine hydroxylase-positive (TH) cells (a, b). Beta galactosidase- positive (xGal) cells in this area co-label with the pan-neuronal marker NeuN (c). d-k. Whole mount Xgal staining (d-g) and in histological sections (h-k) in Ple162-EGFP/cre mice across development from embryonic day (E) 10.5, 11.5, 15.5, and postnatal day (P) 0.5. scale bar, 50µm (c , f, h, i, k), 100µm (b, i inset, j), 200µm (h inset, j inset, k inset), 500µm (a, a inset, d), 750µm (g), 1000µm (e, f inset, g inset). 169 Figure 5.7. A unique dataset for bioinformatics analysis a. Immunohistochemistry on adult mouse brain sections shows that the Ple90 original design recapitulates the expression of the previously characterized Ple88. b. The top panel shows the human genomic sequence around OLIG1 and OLIG2 together with the candidate regulatory regions included in Ple148, Ple150, and Ple151. Comparisons of TFBS predictions between Ple151 sequences (8, 10, and 11) and all others (5, 6, 7, 9) identify EGR1 (e) and FOS (f) binding sites putatively responsible for Ple151 specific expression. The conservation plots were captured from the UCSC genome browser. 170 171 5.6. References 1. Alexander BL, Ali RR, Alton EW, Bainbridge JW, Braun S, Cheng SH, Flotte TR, Gaspar HB, Grez M, Griesenbach U, Kaplitt MG, Ott MG, Seger R, Simons M, Thrasher AJ, Thrasher AZ, Yla-Herttuala S: Progress and prospects: gene therapy clinical trials (part 1). Gene Ther 2007, 14(20):1439-1447. 2. Aiuti A, Bachoud-Levi AC, Blesch A, Brenner MK, Cattaneo F, Chiocca EA, Gao G, High KA, Leen AM, Lemoine NR, McNeish IA, Meneguzzi G, Peschanski M, Roncarolo MG, Strayer DS, Tuszynski MH, Waxman DJ, Wilson JM: Progress and prospects: gene therapy clinical trials (part 2). Gene Ther 2007, 14(22):1555-1563. 3. Hacein-Bey-Abina S, Von Kalle C, Schmidt M, McCormack MP, Wulffraat N, Leboulch P, Lim A, Osborne CS, Pawliuk R, Morillon E, Sorensen R, Forster A, Fraser P, Cohen JI, de Saint Basile G, Alexander I, Wintergerst U, Frebourg T, Aurias A, Stoppa-Lyonnet D, Romana S, Radford-Weiss I, Gross F, Valensi F, Delabesse E, Macintyre E, Sigaux F, Soulier J, Leiva LE, Wissler M et al: LMO2- associated clonal T cell proliferation in two patients after gene therapy for SCID-X1. Science 2003, 302(5644):415-419. 4. Lein ES, Hawrylycz MJ, Ao N, Ayres M, Bensinger A, Bernard A, Boe AF, Boguski MS, Brockway KS, Byrnes EJ, Chen L, Chen L, Chen TM, Chin MC, Chong J, Crook BE, Czaplinska A, Dang CN, Datta S, Dee NR, Desaki AL, Desta T, Diep E, Dolbeare TA, Donelan MJ, Dong HW, Dougherty JG, Duncan BJ, Ebbert AJ, Eichele G et al: Genome-wide atlas of gene expression in the adult mouse brain. Nature 2007, 445(7124):168-176. 5. Gong S, Zheng C, Doughty ML, Losos K, Didkovsky N, Schambra UB, Nowak NJ, Joyner A, Leblanc G, Hatten ME, Heintz N: A gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature 2003, 425(6961):917-925. 6. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 7. Tsumaki N, Kimura T, Tanaka K, Kimura JH, Ochi T, Yamada Y: Modular arrangement of cartilage- and neural tissue-specific cis-elements in the mouse alpha2(XI) collagen promoter. J Biol Chem 1998, 273(36):22861-22864. 8. Farhadi HF, Peterson AC: The myelin basic protein gene: a prototype for combinatorial mammalian transcriptional regulation. Adv Neurol 2006, 98:65- 76. 9. Davidson S, Miller KA, Dowell A, Gildea A, Mackenzie A: A remote and highly conserved enhancer supports amygdala specific expression of the gene encoding the anxiogenic neuropeptide substance-P. Mol Psychiatry 2006, 11(4):323, 410-321. 10. Strand AD, Aragaki AK, Baquet ZC, Hodges A, Cunningham P, Holmans P, Jones KR, Jones L, Kooperberg C, Olson JM: Conservation of regional gene expression in mouse and human brain. PLoS Genet 2007, 3(4):e59. 172 11. D'Souza CA, Chopra V, Varhol R, Xie YY, Bohacec S, Zhao Y, Lee LL, Bilenky M, Portales-Casamar E, He A, Wasserman WW, Goldowitz D, Marra MA, Holt RA, Simpson EM, Jones SJ: Identification of a set of genes showing regionally enriched expression in the mouse brain. BMC Neurosci 2008, 9:66. 12. Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration [see comments]. Proc Natl Acad Sci U S A 1996, 93(17):9067-9072. 13. Yang GS, Banks KG, Bonaguro RJ, Wilson G, Dreolini L, de Leeuw CN, Liu L, Swanson DJ, Goldowitz D, Holt RA, Simpson EM: Next generation tools for high-throughput promoter and expression analysis employing single-copy knock-ins at the Hprt1 locus. Genomics 2008. 14. Farhadi HF, Lepage P, Forghani R, Friedman HC, Orfali W, Jasmin L, Miller W, Hudson TJ, Peterson AC: A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial-specific phenotypes. J Neurosci 2003, 23(32):10214-10223. 15. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 16. Schoenherr CJ, Anderson DJ: Silencing is golden: negative regulation in the control of neuronal gene transcription. Curr Opin Neurobiol 1995, 5(5):566- 571. 17. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 18. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y: Cap analysis gene expression for high- throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 2003, 100(26):15776-15781. 19. Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol 1987, 196(2):261-282. 20. Brenner M, Kisseberth WC, Su Y, Besnard F, Messing A: GFAP promoter directs astrocyte-specific expression in transgenic mice. J Neurosci 1994, 14(3 Pt 1):1030-1037. 21. Couillard-Despres S, Winner B, Karl C, Lindemann G, Schmid P, Aigner R, Laemke J, Bogdahn U, Winkler J, Bischofberger J, Aigner L: Targeted transgene expression in neuronal precursors: watching young neurons in the old brain. Eur J Neurosci 2006, 24(6):1535-1545. 22. Barberi T, Klivenyi P, Calingasan NY, Lee H, Kawamata H, Loonam K, Perrier AL, Bruses J, Rubio ME, Topf N, Tabar V, Harrison NL, Beal MF, Moore MA, Studer L: Neural subtype specification of fertilization and nuclear transfer embryonic stem cells and application in parkinsonian mice. Nat Biotechnol 2003, 21(10):1200-1207. 23. Cai C, Grabel L: Directing the differentiation of embryonic stem cells to neural stem cells. Dev Dyn 2007. 24. Magdaleno S, Jensen P, Brumwell CL, Seal A, Lehman K, Asbury A, Cheung T, Cornelius T, Batten DM, Eden C, Norland SM, Rice DS, Dosooye N, Shakya S, 173 Mehta P, Curran T: BGEM: an in situ hybridization database of gene expression in the embryonic and adult mouse nervous system. PLoS Biol 2006, 4(4):e86. 25. Visel A, Thaller C, Eichele G: GenePaint.org: an atlas of gene expression patterns in the mouse embryo. Nucleic acids research 2004, 32(Database issue):D552-556. 26. Soriano P: Generalized lacZ expression with the ROSA26 Cre reporter strain. Nat Genet 1999, 21(1):70-71. 27. Smidt MP, van Schaick HS, Lanctot C, Tremblay JJ, Cox JJ, van der Kleij AA, Wolterink G, Drouin J, Burbach JP: A homeodomain gene Ptx3 has highly restricted brain expression in mesencephalic dopaminergic neurons. Proc Natl Acad Sci U S A 1997, 94(24):13305-13310. 28. Lee Y, Messing A, Su M, Brenner M: GFAP promoter elements required for region-specific and astrocyte-specific expression. Glia 2008, 56(5):481-493. 29. Arnett HA, Fancy SP, Alberta JA, Zhao C, Plant SR, Kaing S, Raine CS, Rowitch DH, Franklin RJ, Stiles CD: bHLH transcription factor Olig1 is required to repair demyelinated lesions in the CNS. Science 2004, 306(5704):2111-2115. 30. Lu QR, Cai L, Rowitch D, Cepko CL, Stiles CD: Ectopic expression of Olig1 promotes oligodendrocyte formation and reduces neuronal survival in developing mouse cortex. Nat Neurosci 2001, 4(10):973-974. 31. Balasubramaniyan V, Timmer N, Kust B, Boddeke E, Copray S: Transient expression of Olig1 initiates the differentiation of neural stem cells into oligodendrocyte progenitor cells. Stem Cells 2004, 22(6):878-882. 32. Zhou Q, Anderson DJ: The bHLH transcription factors OLIG2 and OLIG1 couple neuronal and glial subtype specification. Cell 2002, 109(1):61-73. 33. Sock E, Leger H, Kuhlbrodt K, Schreiber J, Enderich J, Richter-Landsberg C, Wegner M: Expression of Krox proteins during differentiation of the O-2A progenitor cell line CG-4. J Neurochem 1997, 68(5):1911-1919. 34. Barnett SC, Rosario M, Doyle A, Kilbey A, Lovatt A, Gillespie DA: Differential regulation of AP-1 and novel TRE-specific DNA-binding complexes during differentiation of oligodendrocyte-type-2-astrocyte (O-2A) progenitor cells. Development 1995, 121(12):3969-3977. 35. Oberdick J, Schilling K, Smeyne RJ, Corbin JG, Bocchiaro C, Morgan JI: Control of segment-like patterns of gene expression in the mouse cerebellum. Neuron 1993, 10(6):1007-1018. 36. Anderson GW, Hagen SG, Larson RJ, Strait KA, Schwartz HL, Mariash CN, Oppenheimer JH: Purkinje cell protein-2 cis-elements mediate repression of T3-dependent transcriptional activation. Mol Cell Endocrinol 1997, 131(1):79- 87. 37. Mima K, Deguchi S, Yamauchi T: Characterization of 5' flanking region of alpha isoform of rat Ca2+/calmodulin-dependent protein kinase II gene and neuronal cell type specific promoter activity. Neurosci Lett 2001, 307(2):117- 121. 38. Olson NJ, Masse T, Suzuki T, Chen J, Alam D, Kelly PT: Functional identification of the promoter for the gene encoding the alpha subunit of calcium/calmodulin-dependent protein kinase II. Proc Natl Acad Sci U S A 1995, 92(5):1659-1663. 174 39. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8):1034-1050. 40. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004, 5(4):276-287. 41. Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW, Ching KA, Antosiewicz-Bourget JE, Liu H, Zhang X, Green RD, Lobanenkov VV, Stewart R, Thomson JA, Crawford GE, Kellis M, Ren B: Histone modifications at human enhancers reflect global cell-type- specific gene expression. Nature 2009. 42. Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, Afzal V, Ren B, Rubin EM, Pennacchio LA: ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 2009, 457(7231):854-858. 43. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799-816. 44. Thyagarajan B, Liu Y, Shin S, Lakshmipathy U, Scheyhing K, Xue H, Ellerstrom C, Strehl R, Hyllner J, Rao MS, Chesnut JD: Creation of engineered human embryonic stem cell lines using phiC31 integrase. Stem Cells 2008, 26(1):119- 126. 45. Kuduvalli PN, Mitra R, Craig NL: Site-specific Tn7 transposition into the human genome. Nucleic acids research 2005, 33(3):857-863. 46. Heaney JD, Rettew AN, Bronson SK: Tissue-specific expression of a BAC transgene targeted to the Hprt locus in mouse embryonic stem cells. Genomics 2004, 83(6):1072-1082. 47. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: 2008 update. Nucleic acids research 2008, 36(Database issue):D773-779. 48. Hooper M, Hardy K, Handyside A, Hunter S, Monk M: HPRT-deficient (Lesch- Nyhan) mouse embryos derived from germline colonization by cultured cells. Nature 1987, 326(6110):292-295. 49. Reiner A, Del Mar N, Deng YP, Meade CA, Sun Z, Goldowitz D: R6/2 neurons with intranuclear inclusions survive for prolonged periods in the brains of chimeric mice. J Comp Neurol 2007, 505(6):603-629. 50. Liu L, Geisert EE, Frankfurter A, Spano AJ, Jiang CX, Yue J, Dragatsis I, Goldowitz D: A transgenic mouse class-III beta tubulin reporter using yellow fluorescent protein. Genesis 2007, 45(9):560-569. 175 6. Identification and analysis of transcriptional cis-regulatory modules directing oligodendrocytic expression of myelin-linked genes15 6.1. Chapter preamble During development, oligodendrocytes progress through a maturation process that commences in a progenitor cell state and culminates in a myelinating oligodendrocyte. Studies have suggested that much of oligodendrocyte development is controlled at the transcriptional level. In Chapter 6, I describe analyses that incorporated the Combination Site Analysis (CSA) tool (described in Chapters 2 and 3) and the transcription factor (TF) catalogue (described in Chapter 4) with a new promoter analysis method to predict TFs that cooperate to regulate myelin genes during oligodendrocyte development. Many of the TFs highlighted in the predicted TF network have been previously shown (individually) to be involved in oligodendrocyte development and a substantial portion of the predicted TF regulators are expressed in developing oligodendrocytes, per gene expression data. 6.2. Introduction In both the central and peripheral nervous systems, large caliber axons are enveloped by myelin sheaths elaborated by glial cells. Myelin is generated in the peripheral nervous system (PNS) by Schwann cells derived from the neural crest and in the central nervous system (CNS) by oligodendrocytes derived from neuroectoderm. The lipid-rich myelin sheath is composed of spirally wrapped glial cell plasma membrane that 15 A version of this chapter will be submitted for publication. Fulton DL, Denarier E, Tuason MC, Friedman HC, Wasserman WW and Peterson AC. Identification and analysis of transcriptional cis- regulatory modules directing oligodendrocytic expression of myelin-linked genes. 176 is tightly compacted through the action of multiple “myelin” proteins. By insulating the axonal membrane, action potentials depolarize axon membrane only at nodes between successive myelin sheaths. Such saltatory action potential propagation accelerates the rate of action potential conduction. Further, as repolarization of the axon membrane is required only at inter-sheath nodes where axon membrane is unmyelinated, energy requirements are reduced. Despite distinct embryological origins, cellular architecture and maturation programs, Schwann cells and oligodendroctyes are known to express several critical myelin proteins in common. Expression analyses spanning the oligodendrocyte maturation process suggest that the coordinate accumulation of myelin proteins is regulated largely at the transcriptional level [1]. Recent studies of the early stages of oligodendrocyte lineage specification implicate the involvement of the basic helix-loop- helix transcription factor (TF) family, including Olig1, Olig2, Mash1, and the Id proteins [2-7]. Other key TF classes operating at both early and later stages of maturation include the homeodomain TFs, such as Nkx [8-11] and Pou [12, 13], high-mobility group domain (HMG) TFs, specifically members of the Sox sub-family [14-16], and zinc-finger proteins, such as Egr1/Krox24 [17] and Zfp488 [18]. Since cooperative interactions between multiple TFs is required for selective expression of most metazoan genes, it is not surprising that recent oligodendrocyte studies have identified a key role for cooperative TF regulatory mechanisms [19-22]. Sequence analysis of DNA elements in non-coding regions, often called computational promoter analyses, has received much research attention (reviewed in [23, 24]. Throughout the paper we use the term ‘promoter’ in the broadest sense to encompass non-coding regulatory sequences that are both proximal to (typically called basal 177 promoters) and distal from (often referred to as enhancers) the transcription start site (TSS) of a gene. For the prediction of functional transcription factor binding sites (TFBS), high-affinity TF binding preferences elucidated experimentally are represented by position weight matrices (reviewed in [25]). While the utility of a prediction of any specific functional element is constrained by poor specificity, the analysis of predicted TFBS over-representation in groups of co-expressed genes has proven to be a powerful means to infer regulatory mechanisms [26-28]. One of the key complexities for the interpretation of such computational analysis arises from the common binding patterns often observed for structurally related TFs – a TFBS profile for one member of a TF class or sub-class can often refer equally well to other members of the set. Regulatory sequence analysis has proven capacity to accurately predict sets of contributing TFs. However, the vast amount of non-coding sequence between and adjacent to exons makes such predictions challenging. To reduce the search space, TFBS detection algorithms typically incorporate evolutionarily sequence conservation, referred to as phylogenetic footprinting [29], to identify candidate regulatory regions that may be under selective pressure due to their regulatory function. The use of conservation criteria has proven useful in identifying bonafide regulatory regions. Cross-species preserved sequence regions have demonstrated tissue-specific expression activity in vivo [30-32]. Clusters of regulatory DNA elements, called cis-regulatory modules (CRMs), have been observed in promoter regions of tissue specific genes [26, 27]. Promoter analysis tools frequently incorporate TFBS separation distance constraints to establish the boundaries of TFBS clusters, which define predicted CRMs [33-36]. In this report we describe experimental and computational studies designed to predict the DNA elements and associated TFs that may be cooperating to coordinately 178 drive transcription of co-expressed myelin genes during oligodendrocyte myelination. We have characterized novel regulatory regions for a sub-set of myelin-associated genes, which demonstrate enhancer activity in oligodendrocytes in the CNS and/or Schwann cells in the PNS. This novel list of enhancers was combined with existing myelin gene enhancers identified by this group, for the MBP [30, 37-39] and PLP1/DM20 genes [40], to establish a reference collection of myelin gene enhancers to guide computational analyses. We have implemented a new promoter analysis procedure that incorporates TFBS that are positionally preserved in aligned genome sequences. Expression profile analyses of multiple oligodendrocyte developmental datasets identified a co-expressed cohort of myelin–specific genes. This tissue- and temporally-specific gene cohort was subject to CRM analyses, returning statistically ranked predictions of cooperatively acting TFBS. Integration of the TFBS combination predictions with the sequence properties of the functionally evaluated myelin gene enhancer collection highlighted common signatures of TF cooperativity. This compendium of experimentally validated enhancer regions and TF cooperativity predictions offers new insight into the potential regulatory mechanisms driving myelinogenesis in the CNS. 6.3. Results 6.3.1. Myelin gene associated conserved regions confer reporter activity We selected eight mouse non-coding regions, conserved in human, mouse, and dog per UCSC genome sequence alignments, that were either located within (introns) or adjacent to the following myelin-associated genes: claudin 11 (Cldn11); 2',3'-cyclic nucleotide 3' phosphodiesterase (Cnp), ermine (Ermn); connexin 32 (Gjb1); myelin and 179 lymphocyte protein, T-cell differentiation protein (Mal); oligodendrocyte transcription factor 1 (Olig1); oligodendrocyte transcription factor 2 (Olig2); and pou domain class 3, transcription factor 1 (Pou3f1). The putative regulatory regions were individually concatenated to an hsp promoter-driven eGFP-lacZ reporter gene or, in the case of the Ermn-associated sequence which included the endogenous basal promoter, cloned directly in front of the eGFP-lacZ reporter sequence. The reporter constructs were inserted in single copy at the Hprt1 locus in mouse embryonic stem (ES) cells, lines of transgenic mice were derived (Figure 6.1 and Appendix 3 Figure S1), and the expression phenotypes conferred by such sequences were compared at multiple post-natal ages. The emergence of both the oligdendrocyte and Schwann cell lineages is well characterized in the mouse, as is the complex spatial and temporal programming of myelin elaboration. Schwann cell progenitors are present throughout the developing PNS from mid fetal development and, by one week of age, Schwann cell proliferation is near completion and myelin elaboration is well advanced. In contrast, glial lineage progression and myelin elaboration is more protracted in the CNS occurring over several weeks and proceeding on markedly different schedules within the spinal cord and brain. Oligodendrocyte progenitor cells (OPCs) are first recognized in the mid-fetal spinal cord at E13.5 [41] with myelin elaboration initiating in cervical cord during the perinatal period. Myelination then proceeds in the spinal cord in a rostral-caudal direction, and by one week of age, significant myelin has been deposited throughout. In the more rostral CNS regions including the mouse brain and optic nerve, OPCs emerge later in mouse development in three distinct waves, including a postnatal stage [42], with myelination initiating near the end of the first week after birth. Thus, the expression programming conferred to reporter constructs during pre-weaning development was used to reveal 180 whether the associated regulatory sequence activated in both lineage appropriate cells and prior to, coincident with, or following the initiation of myelin formation. Three of the tested candidate regulatory regions were located adjacent to tetraspan myelin protein family members: Cldn11, Gjb1, and Mal. Cldn11 is an integral membrane protein contributing approximately 1% of the protein accumulating in CNS myelin and recent studies suggest that Cldn11 and Plp1 may play redundant functional roles [43]. A 631 bp sequence located immediately downstream from the Cldn11 (Appendix 3 Figure S1) gene directed widespread reporter gene expression in the CNS during post-natal development (P5 – P10) suggesting that expression initiated in OPCs prior to the emergence of mature oligodendrocytes and myelin elaboration (Figure 6.2). Based on continued intense labeling in white matter from ages P10 to P90, reporter expression was maintained in mature myelinating oligodendrocytes. Gjb1 (Cx-32) is a transmembrane protein that acts as a subunit in gap junction channels and is found expressed in oligodendrocytes and some neurons [44]. A 637 bp region found in intron 1 (Figure S1) directed reporter expression to both the CNS and PNS [45]. In P5 samples, both brain and optic nerves were diffusely labeled (Figure 6.2) while spinal cords were darkly and uniformly labeled from cervical through lumbar levels; an expression program consistent with expression in both OPCs and mature myelinating cells. Reporter gene activity was detected in CNS white matter in mice at 6 months of age, indicating that expression in mature oligodendrocytes continues into maturity. In the P5 PNS, labeling was observed in both spinal roots and sciatic nerves and expression continued through to P15, but was extinguished at 6 months of age. Thus, 181 expression is robust in Schwann cells actively elaborating myelin but ceases at a later stage of maturation when myelin elaboration is complete. Mal is a tetraspan raft-associated proteolipid, which regulates sorting and trafficking of membrane components in myelinating cells [46]. A 680 bp sequence located just upstream from the Mal gene (Appendix 3 Figure S1) appeared to directed expression to regions where active myelin deposition had commenced; e.g., at P5, expression in the spinal cord was intense but absent in the brain except for those regions known to initiate more rostral myelination first [47]. Thus, reporter activity was observed at the chiasmal end of the optic nerves and in the deep regions of the cerebellum. Continued expression in myelinating oligodendrocytes was indicated by well-labeled white matter at both P18 and at 5 months of age. In the PNS, spinal roots and sciatic nerves were labeled at P5 and reporter expression continued in mature mice (Figure 6.3). Two non-coding regions associated with genes expressed in myelinating glia, but not encoding proteins that accumulate in compact myelin, were investigated. The enhancer-associated genes were Ermn, a cytoskeletal protein that is expressed during late stage myelination [48] and 2',3'-cyclic nucleotide 3'-phosphodiesterase (Cnp), a cytoplasmic protein expressed early in development of the myelinating glia [49]. For Ermn, we tested a conserved region just upstream of its TSS. Sparse labeling was observed in dorsal spinal cord white matter at P10 (Figure 6.2) with no reporter expression detected in the brain and, at the adult stage, oligodendrocytes in the cerebellum and brain stem were weakly labeled (Figure 6.3). Schwann cells did not label at any age. A conserved sequence region in the first intron of Cnp gene was investigated but no expression was detected at any age. 182 Putative enhancer regions for three TFs were tested that have demonstrated key regulatory roles in oligodendrogenesis: Pou3f1 (Oct6), Olig1, and Olig2. Pou3f1, is a Pou-homeodomain containing transcription factor that is expressed in the cerebral cortex, in oligodendrocyte progenitor cells (OPC) of the developing CNS and in astrocytes [12, 13, 50] and is highly expressed in Schwann cells [51, 52]. A 650 bp sequence located downstream of the Pou3f1 gene was tested. This sequence overlaps with a portion of a validated Schwann cell enhancer sequence reported by Mandemakers et al. [53] shown to contain a DNase I-hypersensitive site (referred to in their study as hss6). At P5, mouse optic nerves were uniformly and densely labeled as were peripheral nerves. Staining was intense at P15 and P17 while at later stages of development it continued in both oligodendrocytes and Schwann cells at significantly reduced levels. Olig1 is a basic helix-loop-helix transcription factor that is essential for proper oligodendrocyte development [6]. A tested sequence located downstream of the Olig1 coding exon produced expression in neurons (Figure 6.2 and Figure 6.3) but did not label oligodendrocytes or schwann cells in early and/or adult development stages. A region upstream of a paralogous gene TF family member Olig2 directed transient low level expression in optic nerves of P10 mice (data not shown) with expression absent at later time points and elsewhere in the brain (Figure 6.2 and Figure 6.3). Four highly expressed constructs (Cldn11 and Gjb1, Mal, and Pou3f1) were evaluated for expression in oligodendrocytes by immunofluorescence using P10 brain sections. In the brain stem, oligodendrocytes expressing MBP also expressed these constructs (Figure 6.4A and Figure 6.4B). At developmental stage P10, oligodendrocyte progenitors expressing PDGFR-alpha were found in brainstem and the cerebellum, where oligodendrocyte lineage progression is fully underway, and in the cortex, where 183 oligodendrocyte transitions have recently begun. To determine whether the Mal construct expression began earlier than the myelination stage, we labeled transgenic tissues expressing the Mal construct with a PDGFR-alpha antibody. The MAL construct was expressed with a number of PDGFR-alpha positive cells in the cortex. However, there was no detected transgene expression in progenitor cells in brain stem (Figure 6.4B). Enhancer constructs associated with the Cld11, Gjb1, and Pou3f1 genes demonstrated labeled reporter activity in cells that also express the Mbp myelination marker. These results confirm that the tested non-coding sequences associated with the Cldn11, Gjb1, Mal, and Pou3f1 genes are capable of directing expression in oligodendrocytes. 6.3.2. Myelin gene co-expression is detected in mouse forebrain and optic nerve expression profiles Transcriptional profiles can highlight genes that are temporally co-expressed in a given cell state [54, 55]. Promoter analyses of coordinately expressed genes may implicate mediating DNA-binding TFs. However, noisy, less cohesive groupings of putatively co-regulated genes can preclude success. A refined and reliable set of co- expressed genes is key for detection of TFBS signals present in the set of promoter sequences. The process of oligodendrocyte myelination in mice optic nerves progresses from initial onset after developmental stage P4 (i.e. late P5) to an extensively myelinating phenotype at P10. To identify genes induced during the myelination process, we generated gene (RNA) expression profiles for mouse optic nerves at these two time points. Optic nerve tissue is composed of retinal ganglion neurons, oligodendrocytes, and astrocytes and, as such, their minimal cell heterogeneity makes them a good source of 184 oligodendrocytes. Our analyses identified 487 differentially expressed genes at a p-value cutoff of 0.001 (see Supplemental File 1) in the P4 vs P10 dataset (P4-P10d). We mapped 487 partially redundant mouse gene identifiers to a set of 504 mouse and ortholgous rat genes to perform a Gene Ontology (GO) [56] over-representation analysis using the mouse and rat annotation databases (see Appendix 3 Table S4 amd Table S5). The two most significant GO molecular function categories highlighted in the analysis were : GO:0005488 binding and GO:0019911 structural constituent of myelin sheath (Appendix 3 Table S4), confirming that a significant portion of the genes differentially expressed between the P4 and P10 developmental time points are transcript products for myelin structural proteins. During the process of myelination, oligodendrocytes transition through a developmental process that primarily involves four cell states: 1) oligodendrocyte progenitors cells; 2) pre-myelinating cells; 3) active myelinating cells; and 4) oligodendrocyte maintenance. Each stage is distinguished by specific oligodendrocyte protein markers. While it is likely that more oligodendrocytes will exist in the myelinating state at P10 compared to P4, the population of oligodendrocytes at either time point will be heterogeneous. A recent study showed that oligodendrocyte marker- selected cell populations from mouse P16 forebrain expression are distinguished by cell stage-specific gene expression profiles [57]. We integrated evaluation of this data with analyses of our optic nerve expression profiles to identify a specific set of myelin genes that concurrently exhibited differential expression over an early development timeframe and across progenitor stage to pro-myelinating state transitions. Overlaps of differentially expressed genes were determined between the P4 vs P10 timeframe (P4– P10d), oliodendrocyte progenitor cells vs early oligodendrocytes (OPC-EOLd) and early 185 oligodendrocytes vs myelinating oligodendrocytes (EOL-MOLd) cell stage expression datasets (Figure 6.5 and Supplemental File 1). The majority of genes with associated myelin-specific enhancers were expressed in the intersection of the P4-P10d and OPC- EOLd gene expression datasets (Figure 6.5). This co-expressed set is composed of 203 genes, which are up-regulated and/or down-regulated in response to the myelination program during early oligodendrocyte development. This cohort of co-expressed genes is herein referred to as the intersection of oligodendrocyte early development (IOLEDd) expression dataset (see Supplemental File 2). TFs co-expressed with tissue-specific genes may be involved in the biological process under study [58, 59]. We mapped all gene expression data to a curated mouse- human TF catalogue to identify the subset of genes that are transcriptional regulators [60]. Both novel and previously-linked oligodendrocyte TF regulators were identified (see Figure 6.5 and Supplemental File 3). The co-expressed gene list (IOLEDd) and TF subset were evaluated in the promoter analyses method described below. 6.3.3. Validation of a promoter analysis approach Building upon our Combination Site Analysis (CSA) CRM detection algorithm [35, 61], we implemented a new regulatory element prediction approach. Promoter analyses methods often focus on non-coding regions that satisfy a minimum sequence conservation criteria (e.g. 60% – 75% identity in sequence alignments with orthologous genes separated by ~60 Myrs of evolution) over a given length of sequence (e.g. 100 base pairs). Recent gene regulatory and chromatin immunoprecipitation (ChIP) studies have highlighted binding and activity of TFs over sequence regions that possess little or no 186 inter-species sequence identity [62, 63]. Similarly, our evaluation of regions surrounding a sample of validated TFBS in aligned sequences [64] suggests that local sequence conservation flanking TFBS can fall below typically applied threshold cut-offs (see example Appendix 3 Figure S2). Orthologous non-coding sequences identified by alignment algorithms will exhibit a range of conservation, that include spans of low sequence conservation along with regions of highly constrained sequences [65]. To facilitate regulatory element detection across a range of sequence conservation values, our TFBS detection algorithm required only that a binding site prediction be positionally conserved in an aligned sequence block [66]. Positional conservation of predicted binding sites has been previously incorporated in promoter sequence analysis studies (for example see [34, 67]. Prior studies have identified the co-occurrence of TFBS for cooperatively acting TFs in skeletal muscle expression (i.e. MEF2, SRF, Myf/MyoD, Sp1 and Tead1/Tef) [27]. We used a recently updated reference collection of 25 human skeletal muscle- related genes (see [68] and Supplementary Table S6) for validation of the CSA algorithm. The CSA promoter analyses recovered known cis-regulatory modules composed of MEF2A, SP1, SRF, and Tead1 TFs in the top thirteen predicted pairs (Appendix 3 Table S7 and supplemental discussion). An alternative over-representation approach was evaluated, which compared the TFBS motif frequencies against random selections of five thousand orthologous gene pair promoters. The ranked TFBS pair results for both skeletal muscle promoter analyses are consistent (Appendix 3 Table S8). Remarkably, only two TFBS pair predictions varied within the top thirteen CRM pair predictions for the two background methods. In particular, a known Mef2A/Myf TFBS CRM was identified in the randomly sampled 187 background method but not in the analyses using the full background of genes. Since the background random sampling method improved TFBS pair prediction sensitivity in our reference collection, we hypothesize that a randomized sample may serve to minimize promoter characteristic biases that are inherent in our full gene ortholog dataset. As such, the random background sampling method was applied to all subsequent analyses. Gene expression data is generally captured over a coarse timescale and, as a result, sets of differentially and/or putatively co-expressed genes are likely to contain a certain degree of heterogeneity (i.e. genes subject to different regulatory programs). To explore the impact of such noise on the analysis, we added twenty-five randomly selected genes to the muscle test set (Appendix 3 Table S9). Notably, the CRM prediction procedure identified the skeletal muscle TFBS pairs: Mef2A/Sp1 and Mef2A/SRF pairs within the top five ranked predictions (Appendix 3 Table S10). However, two other relevant skeletal muscle pair combinations: Meft2A/Myf and Myf/Sp1, correctly predicted in the ‘specific’ muscle gene set promoter analyses, were absent in the analyses of the noise-added test set. This is not surprising, as the frequency of TFBS class pair instances can be obscured in a larger, less cohesive gene set, which suggests that promoter analyses outcomes can be significantly improved through the identification and incorporation of a well-supported set of putatively co-expressed genes. We retained and incorporated this noise-added representation of the skeletal muscle test set in the remainder of our validation analyses to parallel the noise that may be present in the co- expressed oligodendrocyte gene dataset. To test the methods ability to recover CRMs in varying lengths of search sequence spanning one or more TSS, multiple sequence range parameters were evaluated (Appendix 3 Table S11). Across all of the search regions analyzed, the Mef2a/SP1 and 188 Mef2A/Srf cooperativity predictions frequently appeared in the top five ranked results and always within the top ten results (Supplemental File 4). The Sp1/Srf prediction was exclusively located in the top twenty results, except in the one case where the promoter sequences analyzed did not include the full 2000 bp promoter regions. Validation analyses confirmed that the inclusion of longer sequences does not obfuscate discovery of over-represented TFBS combinations in promoters. 6.3.4. Promoter analyses of a co-expressed oligodendrocyte gene set highlights potential TF cooperativity We applied our validated promoter analyses method to the co-expressed set of myelin-associated genes (the IOLEDd expression dataset) to elucidate the TF cooperativity that may be responsible for the co-regulation of myelin-associated genes (Supplemental File 5). As many individual TFBS predictions implicate the binding of homologous pairs of TFs with similar binding properties, we clustered the JASPAR vertebrate TFBS profiles, using a binding profile similarity measure, into 47 labeled TF- groups (See Methods and Supplementary Table S14). Just over half of these groups (26/47) are singleton clusters composed of one TFBS profile (i.e. they are dissimilar to other binding profiles), which include binding site properties for zinc-fingers (ZF), homeodomain (HOX), helix-loop-helix (HLH), and nuclear receptor (NR) TFs. Summarization of the set of oligodendrocyte promoter analyses, using TFBS class labels, highlighted four recurring TF-group combinations that corresponded to top ten ranked hits: SP1 (ZF) - SPZ1 (HLH); ROAZ (ZF) - ETS GRP; TAL1 (HLH) - SPZ1 (HLH); HLH GRP - SPZ1 (HLH). Many of these promoter analyses predictions align with previous links between associated TF families and OLs. A discussion of their 189 characteristics and roles are presented in section E of Appendix 3 supplementary material. Identification of enhancer-weighted CRM pair predictions with overlapping predicted TFBS pairs, distinguished oligodendrocyte co-expressed genes with CRMs made up of three TFBS in their non-coding regions (see summary in Appendix 3 Table S15 and detailed coordinates provided in Supplemental File 6). Although we did not ascertain the statistical significance of these CRMs, we hypothesize that the compositional over-representation of overlapping CRMs could suggest an expanded TF cooperativity relationship. 6.3.5. An oligodendrocyte TF network supported by enhancer predicted regulatory elements Using the set of of statistically-supported TFBS signatures enriched in the co- expressed oligodendrocyte genes, we capitalized on the validated regulatory capacity of our myelin gene enhancers to further substantiate TF cooperativity predictions for genes co-expressed in oligodendrocytes. We extracted and stored all human-mouse aligned TFBS predictions found in the myelin enhancer sequence collection. Each of the enhancers were classified as “positive” or “negative” based on their temporal and spatial expression patterns, i.e. whether they activated a reporter gene in oligodendrocytes and were co-expressed in the IOLEDd set (Appendix 3 Table S16). Using the full set of promoter analyses results, we identified and mapped all instances of the ranked TFBS pair predictions found in gene promoters to class labels, which resulted in 94 ranked distinct TFBS “class/group” CRM predictions with score values <= 0.05 (see Appendix 3 Table S17). To weight the CRM predictions, we pruned the TFBS pairs that were present 190 in the positive enhancers and not found in the negative enhancers, resulting in 14 unique class CRM predictions targeting oligodendrocyte co-expressed genes (Appendix 3 Table S18). The extracted TF sub-network for a set of oligodendrocyte myelin expressed proteins [69] included 19 individual TF classes representing 14 TF class CRM predictions (Figure 6.6 and Table 6.1). Importantly, the network highlights previously identified regulators of OL development: SOX/HMG GRP [14-16, 70], POU3F1 (POU- HOX) [12, 13, 71], and NKX/HOX GRP [2, 9, 72]. Subsequent review of gene expression data and literature evidence for other TF regulatory network nodes produced noteworthy observations. EGR/ZF GRP: Egr1, Egr2, and Egr3 TF gene expression is down-regulated in the OPC_OELd set. FKH GRP: The expression of a group of Forkhead TFs were detected in the oligodendrocyte gene expression datasets, including Foxp1, Foxn2, Foxj1, Foxn3, and Foxo1 which are all up-regulated in the EOL_MOLd expression group, and Foxg1 (OPC_EOLd), Foxj3 (OPC_EOLd), and Foxf2 (P4_P10d) were measured as down- regulated in oligodendrocyte development. FOS (LEUZIP): Ap-1 is a DNA-binding protein complex that is composed of Fos and Jun protein member interactions and was previously shown to play a role in oligodendrocyte differentiation [73]. Fos is down- regulated in the IOLEDd set. RELGRP: The NF-κB family acts as a homodimer or heterodimer and is made up of the following family members: NF-κB-1 (p50), NF-κB- 2 (p52), RelA (p65), RelB, or Rel (c-Rel). Oxidative stress-induced apoptotic death in oligodendrocytes stimulates nuclear translocation and activity of AP-1 and NF-κB TFs [74]. Expression of Nfkb1 and Nfkb2 TFs are up-regulated in the OPC_EOLd and OPC_MOLd datasets. MEF2A (MADS); MYF (HLH); HAND1-TCFE2A (HLH): A 191 number of proteins identified in a recently established CNS myelin proteome list [69] are known to be regulated by Mef2A and Myf TFs in muscle: Desmin (Des); Alpha-1 (Acta1); actin alpha cardiac muscle 1 (Actc1) [75, 76] and Creatine Kinase Brain (Ckb) [77, 78]. We noted that Mef2A and Hand1 TFs are measured as expressed in P7 rat brain oligodendrocyte expression data [79] and another known muscle co-regulator, Tcef2a, is detected as down-regulated in the P4_P10d and EOL_MOLd sets. NHLH1 (HLH): the helix-loop-helix protein Nhlhl1 is required for normal brain development [80] and its expression is detected in rat brain oligodendrocytes at P7 [79]. TEAD1 (TEA): Tead1/Tea is also a known regulator of muscle genes and is associated with notochord maintenance and cell proliferation and survival in mouse development [81]. A homologous family member Tead2, which may in some cases perform functionally redundant roles with Tead1 [81], is found differentially expressed and down-regulated in the P4_P10d dataset. 6.3.6. Prioritization of TFBS cooperativity predictions via enhancer feature weighting Validated gene enhancer sequences contain regulatory features that are responsible for reporter activity. Participating TFBS are located within a genomic sequence context that facilitates their capacity to regulate transcription. Analysis of the characteristics of confirmed regulatory sequences could reveal additional predictive properties to inform the prioritization of CRM predictions for subsequent experimental validation. The identification of larger putative regulatory sequence regions based on DNA sequence properties has received much research attention. For example, CpG islands (unmethylated CG dinucleotides) are associated with gene regulatory sequences [82]. Accordingly, CG dinucleotide content is often incorporated in approaches that predict 192 regulatory regions [50, 83, 84]. Similarly, proximal promoter sequences have been classified using predicted physical properties such as DNA bendability and protein-DNA twist [85], predicted DNA curvature [86], and word content frequency-based approaches [87, 88]. TF-DNA attraction forces may be coupled with local properties that facilitate the protein interactions required for synergistic and/or antagonistic interplay of TFs. The interaction potential between two separated DNA-bound TFs can be influenced by whether the binding events occur on compatible faces of the DNA helix. This implies that the distance between two TFBS may be constrained by multiple increments of DNA- helical turns (i.e 10.5 bp on average) [89] and this constraint is as a result of the energetic cost of twisting the DNA [90]. Correspondingly, the capacity for a DNA sequence to bend and curve to facilitate interaction between DNA-bound TFs that are separated along the DNA may play an important role. The evaluation of local sequence features for predicted TFBS combinations may identify similar sequence attributes, across a set of putative and/or known co-regulated genes, which may be used to prioritize the CRM-containing gene promoter regions for experimental validation. We evaluated a set of properties in sequences flanking known CRMs in the skeletal muscle reference set, which included: dinucleotide frequency, number of clustered weaker TFBS, DNA Helix-Turn-factor (DHTF), DNA Helix Turn Modulus (DHTM), and predicted DNA bend and curve capacity characterizations (see Methods and Supplemental File 7). As experimentally validated binding site collections, which include identified active and inactive sites, for human and/or mouse are currently unavailable to perform feature selection and classification analyses, we evaluated a set of selected local binding site features whose distributions were compared to randomly selected sets of non-coding sequence. We assessed the significance of the distribution of 193 average GC-content, predicted DNA bend, and curve capacity feature values surrounding known skeletal muscle TFBS pairs found in the skeletal muscle gene promoters (predicted by the promoter analyses). Notably, for known pairs of muscle regulators such as Mef2A/Sp, and Mef2A/SRF,the distribution of the DNA bend capacity predictions and GC content in the muscle enhancer sequences were identified as statistically significant at an average p-value cut-off of 0.05 (Appendix 3 Figure S5). The distribution of DNA curvature values were found to be insignificant, consistent with a previous study [86]. We incorporated the statistically significant features in a clustering analysis to group local sequence features neighboring CRMs predicted in both the human muscle promoter sequences (by CSA) and mouse muscle enhancer reference collection sequences. The cluster analyses highlighted feature similarity between a group of muscle promoters and muscle enhancers (see example Appendix 3 Figure S6 and Supplemental File 8).). Such local feature similarity may be used to support and rank the regulatory potential of predicted TFBS combination instances. Accordingly, an equivalent enhancer feature enumeration and cluster analyses were performed on the oligodendrocyte CRM predictions (Supplemental File 9 and Supplemental File 10). Follow-up studies that confirm DNA-binding and/or regulatory activity may be used to re-weight feature relevance and reprioritize the predicted CRM-targeted genes for validation. 6.4. Discussion The elaboration of CNS myelin is largely controlled at the transcriptional level during oligodendrogenesis. Exposure of the underlying transcriptional regulatory program is essential for understanding the molecular mechanisms of myelination. In this study, we coupled a new set of experimentally validated myelin gene enhancer sequence 194 data with an improved promoter analyses method to predict combinations of TFs involved in the regulation of myelination in oligogendrocytes. We applied a robust mouse in vivo validation technique to identify a novel set of functional murine myelin-associated enhancer sequences, which augments a previously established collection of myelin gene- associated regulatory sequences. Expression profile analyses of mouse optic nerve and a publicly-available mouse forebrain dataset isolated a core set of genes and TFs that are co-expressed during oligodendrocyte myelination in early murine development. We implemented a promoter analyses method, which was validated in tests with known CRMs from a skeletal muscle reference collection. Gene promoters for the set of tissue- and temporally-specific oligodendrocyte gene cohort were subjected to multiple analyses to generate a compilation of predicted over-represented CRMs, which may be responsible for myelin-associated gene co-expression. Co-occurrence of predicted TFBS motif patterns found in myelin enhancers that express reporter genes in olgodendrocyte cells provided additional support for specific TF cooperativity predictions and their putative gene targets. CRM predictions were amassed to construct a predicted TF interaction network targeting oligodendrocyte myelin genes. Overlapping predicted TFBS pair coordinates were merged to extend the TF cooperativity group predictions. Sequence features in the enhancers and putative gene promoters were used to preliminarily characterize and group each of the predicted CRM instances for experimental validation. The mouse in vivo validated myelin gene enhancers provides a novel set of regulatory sequences that express in oligodendrocytes and/or Schwann cells. This collection of regulatory regions complements existing enhancer databases, such as the Vista Enhancer Browser (VEB) [91] and Pleiades Promoter Project (PPP) brain-related enhancers resource [92] (and Portales-Casamar et al. 2009, manuscript in preparation). 195 While our validated set of tested enhancer sequences are unique, other groups have also performed targeted testing of conserved non-coding regions neighboring the Pou3f1 and Olig1 genes. The VEB database presents experimental validation data for eleven conserved human sequences neighboring the Pou3f1 gene and provides evidence for reporter activity for two of these regions in mice embryos. Similarly, three reporter constructs consisting of nine non-coding regions adjacent to the human OLIG1 gene have been tested by the PPP group in mice and present, thus far, detection of expression for one construct in adult stage mice. In addition, mouse embryo expression results for a single overlapping Olig1-associated region are presented in the VEB database. Experimental regulatory analyses of non-coding regions is a critical step in deciphering regulatory mechanisms and compilations of distinct and complementary gene regulatory region investigations can provide critical guidance in the discovery process. Several groups have performed gene expression profiling to identify genes responsible for oligodendrocyte development and function. We hypothesized that different tissue sources of oligodendrocytes share a core set of differentially expressed genes that are responsible for oligodendrocyte myelination during early development. Our optic nerve P4-P10 expression comparison dataset shared 65% of its differentially expressed genes with the mouse forebrain stage-specific differential expression profiles and over half of the genes with altered expression are found in common with an OPC- EOL oligodendrocyte stage transition. We identified a subset of 31 TFs that are differentially regulated during the transition from progenitor to pro-myelination. Liu et al. recently demonstrated that oligodendrocyte developmental processes in spinal cord are sensitive to TF expression-level alternations [92] and suggest that dosage-dependent TF expression may be necessary. Future temporally-specific expression studies will be 196 required to ascertain whether incremental TF expression dosage alterations play an integral role in the CNS oligodendrocyte regulatory program. Our refined list of pro- myelination genes and identified TF protein subsets will serve as a valuable reference for future oligodendrocyte expression profiling and regulatory studies. This study has generated a set of well-supported TF cooperativity predictions specific to oligodendrocyte pro-myelination transcriptional control. We designed a new promoter analyses method, which statistically evaluates the promoters of a cohesive set of co-expressed myelin-activating genes for TFBS combinations and weights the predictions based on common TFBS instances and sequence features found in validated enhancer sequences. This new architecture and adapted combination site analysis (CSA) algorithm provides improved TFBS cooperativity predictions for known pairs of TFBS found in CRMs in a skeletal muscle reference collection [27]. Other studies have incorporated previously validated clustered TFBS instances in CRM-specific models to predict additional genes that may be regulated by a CRM [93, 94]. In contrast, our approach relies on the sequence content information contained in multi-species aligned putative gene promoters to first, without a priori knowledge, statistically evaluate the co- occurrence of TFBS motifs and then applys a weighting for the coincidence of predicted features in validated enhancers. Another recent study applied genome-wide CRM prediction in similarly aligned multi-species sequences through the identification of predicted TFBS clusters and statistical evaluation against a background model which is designed to reflect the GC content for the predicted CRM instance [34]. While this approach will assuredly identify a number of bonafide CRMs in gene promoter regions, background analyses using our promoter database (see Supplementary Table S8) indicate that a surprising number of CRM instances for valid CRMs are found in gene promoters, 197 which are presumably not regulated by these TFs. The substantial incidence of false positives not only further motivates the use of a tightly co-expressed set of genes to identify potential CRMs but also emphasizes that, although common co-occurrences of TFBS motifs in co-expressed genes is a relevant indicator of CRM activity, other factors, such as chromatin accessibility and/or TF availability, will influence a regulatory outcome. Correspondingly, the recent emergence and adoption of high-throughput chromatin accessibility assays and TF expression analyses will provide important regulatory information that should be incorporated in future CRM analyses to guide the prediction process. Moreover, on-going experimental validation of TF cooperativity is critical for the evolution of our understanding of gene regulatory programs and gene- dosage related pathologies and it is this objective that the set of oligodendrocyte and Schwann cell targeted enhancers and detailed compendium of OL myelin-associated TF cooperativity predictions is intended to facilitate. The emerging view of oligodendrocyte gene regulation, supported by both literature and our predicted TF network, is one of coordinated combinatoric control, with sets of TFs providing sustained input (i.e. HMG protein Sox10) and additional TFs exerting enhancing and/or antagonizing transcriptional effects in response to environmental influences (for example, Fos/Ap1). Unraveling these complex tissue- specific regulatory mechanisms will require an evolutionary process that involves iterative and synergistic application of computationally-supported predictions that are validated and expanded through subsequent rounds of detailed experimental evaluation. Much like CNS myelin production, PNS myelin is deposited during the maturation of Schwann cells, via cell state transitions that are mainly controlled by transcriptional mechanisms (for a recent review see [95]). Failure of gene expression 198 mechanisms in Schwann cells may be responsible for debilitating neuropathies. PNS and CNS myelin share a set of common myelin architectural proteins: Gjb1, Plp1, and Mbp and, accordingly, several of our enhancer sequences demonstrated expression in both oligodendrocyte and Schwann cells. A dual regulatory role provides support for shared TF cooperativity control in both cell types. Notably, recent experimental analyses of the non-coding regions neighboring Mbp [37-39, 96] and Plp1 [40] genes have exposed a regulatory architecture that is composed of multiple enhancers that specifically target gene expression in oligodendrocytes and/or Schwann cells. A similar study applied to the promoter regions of a Schwann cell-expressed gene cohort could reveal shared and unique transcriptional regulatory signatures. The joint investigation of CNS and PNS regulatory mechanisms in such a permissive, distinguishable, and transcriptionally- controlled system, offers unique opportunities to explore and elucidate the transcription regulation mechanisms that are responsible for directing commonly expressed genes in different cell types. 6.5. Methods Supplemental files are available for download at: URL: www.cisreg.ca/ PT_Share/ User: ch6reviewer Password: thesis 199 6.5.1. Selection of conserved regions and validation in mice The University of California, Santa Cruz (UCSC) browser human-mouse-dog non-coding sequence alignments (Mouse May 2005 ( mm7 – based on Build 35 assembly by NCBI) [97] for myelin-associated genes were qualitatively reviewed to identify well- conserved putative regulatory regions. We selected non-coding regions for the following genes (see Supplemental Information Table S2 and Figures S1 for genomic locations): 1) an intergenic region between Claudin 11 (Cldn11) – solute carrier family 7 (cationic amino acid transporter, y+ system), member 4 (Slc7a4); 2) a region in the first intron of 2',3'-cyclic nucleotide 3' phosphodiesterase (Cnp); 3) an intergenic region between mitochondrial ribosomal protein S5 (mRpS5) and myelin and lymphocyte protein, T-cell differentiation protein (Mal); 4) a region in the first intron of Connexin 32 (Gjb1); 5) a region just 5’ of the transcription start site (TSS) of Ermin (Ermn); 6) a conserved region 5’ upstream of Olig2 gene; 7) a region downstream of the oligodendrocyte transcription factor 1 (Olig1) gene and quite a distance upstream from Interferon (alpha and beta) receptor 2 (Ifnar2); and 8) an intergenic region between POU domain, class 3, transcription factor 1 (Pou3f1) and UTP11-like, U3 small nucleolar ribonucleoprotein (Utp11l). 6.5.2. Isolation of genomic DNA sequences The conserved regions were amplified by PCR with Taq DNA polymerase on genomic DNA after selection of the primers in the surrounding sequence using Primer3 [98]. Restriction sites AscI and XhoI were added to the primers for insertion and digestion (Appendix 3 Table S3). 200 6.5.3. Generation of reporter constructs PCR products were digested and subcloned into AscI and XhoI sites upstream of the heat shock protein (hsp) promoter in an hspeGFPLacZ Entry vector [40], with the exception of the Ermn-associated sequence, which was inserted into the eGFPLacZ Entry vector (a similar vector where the hsp promoter is removed). These reporter constructs were recombined into a “Gateway” Destination vector bearing Hprt1 homology arms for recombination at the Hprt1 locus using the LR clonase reaction kit (Invitrogen). The final destination vectors were amplified, sequenced across the insert, and linearized by restriction enzymes Age1 or SalI and transfected into ES cells, bearing a deletion of the promoter and exon 1 of the Hprt1 gene, as previously described [96]. The Hprt1 gene is restored through recombination, which allows them to survive after hypoxanthine, aminopterin, thymidine (HAT) selection [99]. Positive clones were sequenced at the McGill University and Génome Québec Innovation Center (Montreal, Quebec, Canada) using the forward primer 5 -CGCTTGTCTCTGGATGGAAC-3 located in the hsp promoter and the reverse primer 5 -AGCCTGGGCAACAGAGAAATATC-3 located in the Hprt1 homology arm. Sequences were analyzed using MacVector 7.2. 6.5.4. Histochemistry, fluorescence microscopy, and immunocytochemistry Wholemount histochemical detection of Beta –Galactosidase activity was performed as described previously [96]. Anesthetized mice were perfused with 4% PFA in Phosphate buffer. Brains were dissected and incubated overnight in fixation buffer at 4°. The brains were then kept in cold PBS buffer. Cryosections of brain tissues were frozen in 30% sucrose and 30µm sections on slides were evaluated for 201 immunofluorescence and/or direct GFP detection. Primary antibodies: Rabbit anti GFP antibody (Molecular Probes 1/200), Rat anti MBP (Chemicon 1/500), Rat anti PDGFr (BD Pharmingen 1/500) and secondary antibodies: Goat anti-Rabbit Alexa 488 for GFP detection (Molecular Probes 1/1000), and Goat anti-Rat Cy3 for the detection of the other antigens (Jackson ImmunoResearch 1/1000) were incubated at 4°C overnight. Nuclei were labelled with Hoescht. 6.5.5. Gene expression profiling analyses Dataset 1: P4 vs P10 mouse optic nerve expression data Optic nerves were dissected from 75 P4 C57Bl/6J mice and 70 P10 C57Bl/6J mice. RNA was extracted and hybridized on Affymetrix GeneChip 430A 2.0 arrays. A second experiment was repeated with an additional technical replicate included for the P10 sample. The probe preparation, hybridization, and scanning of microarrays were performed at the McGill University and Génome Québec Innovation Centre according to the manufacturer's instructions. Background correction and normalization were performed in the R environment [100] with the Bioconductor packages [101] using the robust multichip analysis method (RMA) [102, 103] of the affy package [104]. The mouse Affymetrix chip probes were mapped to mouse NCBI Entrez Genes [105] using Bioconductor packages [101]. A set of mouse TF genes [60] were mapped to the mouse Affymetrix probes using a developed PERL program. Differential expression analyses for the P4 and P10 datasets were performed using a two-sample T-test with a random variance model [106] and the Benjamini and Hochberg method to compute a false discovery rate (FDR - expected proportion of type 1 errors within the rejected 202 hypotheses) implemented in the BRB-array software (http://linus.nci.nih.gov/~brb/). PERL software was written to convert each HTML-formatted expression analysis results to text files and extract and report all significantly (p-value <= 0.001 and FDR <= 0.05), differentially expressed genes across the pairwise expression profiles. A GO term enrichment analysis was performed on the differentially expressed set of genes with the AmiGO annotation and ontology toolkit [107] using a hypergeometric test incorporating the GO annotations for the Mouse Genome Database [108]) and Rat Genome Database [109]. Dataset 1I: OPCs vs EOLs and EOLs vs MOLs mouse forebrain expression data Affymetrix Mouse 430 2.0 chip CEL files were downloaded from NCBI GEO (GEO dataset ID: GSE9566) [57]. Ten CEL files recording expression for olidgodendrocyte progenitor cells (OPC), early oligodendrocytes (EOLs), and mature oligodendrocytes (MOLs) oligodendrocyte lineage cells were analyzed using RMA (as described above) to obtain individual probe set expression values and each pair of experiments were subjected to a two-sample T-test with a random variance model [106] implemented in the BRB-array software (http://linus.nci.nih.gov/~brb/). The mouse Affymetrix chip probes were mapped to mouse NCBI Entrez Genes [105] using Bioconductor packages [101]. PERL software (described above) was used to extract and report all significantly (p-value <= 0.001 and FDR <= 0.05) and differentially expressed genes across the pairwise expression profiles and TF genes were identified as described above. Gene expression profiles were compared across the P4 vs P10; OPCs vs EOLs; and EOLs vs MOLs datsets for differential gene expression overlap, enhancer library co- 203 expression, and TF-mapped differential gene expression using developed software. 6.5.6. Evaluation of local conservation of validated TFBS Software was developed to extract human and mouse aligned genomic sequences from the UCSC 28-way multi-species alignments data [110] around annotated TFBS defined in the in the Annotation Regulatory Binding Site Database (ABS) [64] made available with genomic coordinates in the PAZAR database [111]. Sequence identity surrounding TFBS instances were evaluated and reviewed. 6.5.7. Development of the promoter database and CSA algorithm adaptation Building upon our previous Combination Site Analysis (CSA) algorithm [35] for detection of over-represented combinations of TFBS in sets of co-expressed genes, we developed a new database of human-mouse non-coding alignments, using a human- mouse subset of the UCSC 28-way alignments data [110, 112]. All UCSC Hg18 multiz28way chromosome and database files were downloaded from UCSC FTP site (ftp://hgdownload.cse.ucsc.edu/). Kent source utilities were downloaded and compiled on a linux server and Perl-C wrappers were developed for selected Kent programs. Software was developed to extract 15006 alignments from the chromosome databases based on pre-computed human alignment coordinates in the oPOSSUM database [61]. Human- anchored alignment subsets for human and mouse sequences, with aligned gaps removed and exons masked, were inserted into MySQL database tables and coordinates for each non-contiguous mouse alignment in a MAF block were recorded in a MySQL table. Binding site information for a set of Glia-related TFBS: Pou2F1/Oct1; Egr1/Krox- 204 24/NGFI-A; Egr2/Krox-20; Egr3; Egr4; Pou3f1/Oct6; Nkx2-2; Nkx2-5 was extracted from literature to supplement the Jaspar database (Appendix 3 Figure S3). All Jaspar database TFBS profiles, along with the supplemented profiles, were enumerated in human-mouse alignments requiring only that predicted TFBS overlap. CSA uses the similarity of TF binding properties to initially compute the over-representation of combinations of TFBS classes/families (using TFBS profile clustering) in a set of co- expressed genes, as compared with a background set. This predicted class-based set is then expanded and evaluated at a TFBS level to identify statistically over-represented combinations of individual TFBS for the set of co-expressed genes. We adapted the software to interface to the new database and modified the algorithm to establish one randomly selected background sample extracted from the gene promoter database for the class combination evaluation and a second independent randomly extracted sample for the TFBS combination evaluation step. 6.5.8. Promoter analyses method validation A reference collection of twenty-five (Ensembl-mapped) human muscle-related genes, which are known to be co-regulated by specific CRMs, was utilized for the validation of the promoter analyses approach ([68] and Appendix 3 Table S6). This set was used as the ‘co-expressed gene set’ input list to the revised CSA algorithm (described above) in the promoter analyses method evaluation phase, using different upstream and downstream parameter search regions (Appendix 3 Table S11) and a full background set of gene promoters in the database. For each analyses we retained the TFBS inter-binding distance site (IBSD) parameter of 225 bp. Interacting DNA-bound TFs require a degree 205 of physical proximity to facilitate their interaction. TFBS combinations have been detected in the skeletal muscle collection within 200 bp and DNA-bound TF cooperativity has been observed across distances spanning 20 helical turns (~210 bp) [90]. We ran a second round of analyses using the same muscle collection of 25 genes as input and the same parameter space values (Appendix 3 Table S11), but incorporated random background promoter sampling steps in the CSA algorithm. A ‘noisier representation’ of the co-expressed skeletal muscle collection was established by appending an additional set of 25 randomly selected genes from the promoter database to the collection, yielding a list of 50 genes for analysis (Appendix 3 Table S9). A third promoter analyses in silico experiment was performed with the noisier co-regulated skeletal muscle set using the same parameter values established earlier (Appendix 3 Table S11) with the random background promoter sampling steps. 6.5.9. Jaspar profile clustering and cluster labeling Software was written to extract and cluster Jaspar vertebrate profiles. Pairwise profile Pearson correlations were computed using the CompareAce program and distances were computed. Clustering was performed using the hclust function in the R software package using a cut-off of 0.40. TFBS profile clusters were examined manually and each cluster with 4 or more members was assigned a “family label”. Clusters with less than 4 members were labeled with the corresponding TF name and structural class (Appendix 3 Table S14). 206 6.5.10. Promoter analysis of oligodendrocyte co-expression data Promoter analyses were conducted over the oligodendrocyte early development (IOLEDd) co-expression dataset (Supplemental File 2) with the same parameter values (Appendix 3 Table S11) that were applied to the muscle set. Mouse Entrez Genes in the IOLEDd set were mapped to 202 unique mouse Ensembl gene ids associated with 194 Ensembl-predicted human orthologs, of which 178 had established human-mouse ortholog records in the promoter database (Supplemental File 11). This promoter database-mapped co-expression dataset was used for all CSA analyses. A second set of CSA analyses was performed using a TFBS profile threshold cut-off of 85% with all other parameter values maintained. In total, eighteen analyses were performed, each of which provided a ranked list of TFBS pair predictions that fell below the 0.05 ranking cut-off. Software was written to map the CRM predictions to CRM class labels using the TF class label mapping described above. 6.5.11. Enhancer feature weighting of CRM predictions To identify predicted TFBS features in the mouse enhancer sequences, we developed software to extract UCSC MAF 28-way alignments [97, 110] for an enhancer’s genomic coordinates and predicted TFBS instances in mouse-human aligned positions. All predictions for each enhancer were loaded to a MySQL database. This procedure was performed for a set of mouse muscle enhancer regions (Appendix 3 Table S12) and the full collection of myelin gene enhancers (Appendix 3 Table S13), which included previously validated enhancers by this group. Software was developed to identify and report all genomic coordinates for TFBS pairs that were predicted for 207 putative target genes and record the instances of corresponding predicted TFBS combination found in the active and negative enhancer sequences. This analysis was performed for both the muscle CSA analyses and the oligodendrocyte CSA-predicted CRMs. We classified the enhancer sequences into positive and negative groups based on their tissue-specific and temporal expression. For the skeletal muscle reference collection, we used the entire mouse-based genes in the reference collection (Appendix 3 Table S12) in the positive group. For the oligodendrocyte dataset, we placed gene enhancers that were expressed in the CNS and also co-expressed in the IOLEDd developmental set in the positive group and two enhancers that did not express but were found in the IOLEDd set (CNP) and the other (Ermn) which has no probe on the P4_p10 Affymetrix chips but was found expressed and up regulated in both the OPC_EOLd and EOLd_MOLd sets in the negative group (Appendix 3 Table S16). Software was developed to evaluate the coincidence of TFBS pairs that were found in these groups for each of the TF cooperativity predictions. TFBS pair predictions were identified and retained if they had a co-occurring prediction in a positive enhancer with no occurrence in a negative enhancer and predictions were pruned from the enhancer-weighted CRM list if TFBS combination instances were found in the negative enhancers. A PERL program was developed to map the enhancer-weighted CRM predictions to CRM class labels using the TF class label mapping described above. 6.5.12. Oligodendrocyte TF network construction and analysis Predicted CRM gene targets identified in the oligodendrocyte CSA analyses were extracted if they were also found in a list of abundant CNS myelin proteins produced in a 208 recent proteome study (see Figure 4a in [69]). Software was developed to establish the myelin network TF-protein node relationships and visualized using BioLayout [113]. TF gene differential expression was reviewed for TF regulators in the myelin sub- network in mouse optic nerve (P4 - P10) and mouse forebrain (P16) expression profile comparisons. We also queried a P7 rat brain oligodendrocyte dataset (GEO dataset: GSE5940 [79]) using the NCBI GEO Profiles viewer [114] for evidence of TF gene expression in early oligodendrocyte development in rats. 6.5.13. Evaluation of predicted CRM sequence characteristics For each of the skeletal muscle-specific TFBS instances predicted by our promoter analyses, we extracted 100 bp windows flanking (but not including) the TFBS motifs and evaluated each of the sequences for dinucleotide content, predicted DNA bending and curving capacity, and homogenous clusters of TFBS [50] for TFBS score threshold values at or above 70% . DHTF (DNA Helix-Turn-Factor - number of DNA helix turns) and DHTM (DNA Helix-Turn-Modulus - the residual distance after subtraction of full integrals of DNA helix turn distances) metrics were also computed for center-to-center TFBS pair motif distances. Each set of feature values for a group of skeletal muscle genes containing known muscle TFBS pairs instances (predicted by our promoter analyses) were standardized and centered. Software was developed to compute Kolomogorov-Smirnov tests (KS-test) to compare the distribution of GC%-content and predicted DNA bend and DNA curve capacity features in predicted enhancer sequences versus the same set of metric distributions computed from randomly extracted non- coding sequences. Specifically, for each of the CRM predictions, a set of randomly 209 selected non-coding sequences of 100 bp (for a set size equal to the number of target genes) were extracted and a KS-test p-value was computed between the random sample of sequence feature values and the sample of enhancer sequence feature values. Thirty random sampling iterations were performed for each CRM instance and KS-Test p-values were averaged. A constant of 0.95 was added to the average KS-test p-value and a log was taken to identify features with p-values <= 0.05 as histogram bar values equal to or less than 0 (Appendix 3 Figure S5). For each CRM dataset, the GC% and predicted bend capacity values were standardized (centered and scaled by the feature sample standard deviation) and Euclidean distances were computed to cluster similar feature values for CRM instances in non-coding regions of genes and predicted CRM instances in enhancer sequences. Dendogram-based heatmaps were created using the Heatmap2 function in the R gplots package. Sequence feature analyses using the CRM predictions for the oligodendrocyte CSA predictions and myelin-associated enhancer sequences were performed in the same manner as described for the muscle dataset analyses. 6.5.14. Analysis of overlapping oligodendrocyte CRM predictions Software was written to identify enhancer-weighted TFBS pair predictions which overlapped at one predicted TFBS genomic coordinate position to identify CRM predictions of size three. 210 Table 6.1. Predicted TF regulatory network for myelin genes TF Class I TF Class II Predicted Targeted Myelin Genes ETS GRP NKX/HOX GRP CLDN11; PLP1; ETS GRP SP1 (ZF) CLDN11; GJB1; TSPAN2; FKH GRP MYF (HLH) CLDN11; MBP; PLP1; FKH GRP REL GRP MAL; FOS (LEUZIP) NKX/HOX GRP CLDN11; MAL; MBP; PLP1; GFI (ZF) MEF2A (MADS) CLDN11; MAL; HLH GRP SP1 (ZF) CLDN11; GJB1; MAL; MBP; PLP1; HLH GRP SPZ1 (HLH) CLDN11; CNP; GJB1; PLP1; MEF2A (MADS) HAND1-TCFE2A (HLH) CLDN11; MAL; TSPAN2; MEF2A (MADS) POU3F1 (POU-HOX) CLDN11; TSPAN2; NHLH1 (HLH) TEAD1 (TEA) CLDN11; PLP1; RORA1 (NR) SPZ1 (HLH) PLP1; SOX/HMG GRP EGR/ZF GRP GJB1; MAL; MBP; TEAD1 (TEA) SPZ1 (HLH) CLDN11; PLP1; 211 Figure 6.1. Enhancer selection and validation Genomic coordinates displayed using mouse MM8-mapped coordinates (http://genome.ucsc.edu). 212 Figure 6.2. Histochemical detection of β-galactosidase activity in early postnatal development Histochemical detection of β- galactosidase activity in whole mounts of mid-sagital sections of brain and spinal cord/spinal root, dorsal root, and dorsal root ganglia at developmental timeframe P5 - P10. 213 Figure 6.3. Histochemical detection of β-galactosidase activity in whole mounts at adult developmental stage Histochemical detection of β-galactosidase activity in whole mounts of mid-sagital sections of brain and spinal cord/spinal root and dorsal root and dorsal root ganglia at 2 – 5 months old. 214 Figure 6.4. Characterization of cell populations expressing the Gjb1, Cldn11, Pou2f1 and Mal constructs A. Brain sections stained for MBP (red) and GFP (green) depicts multiple colabeled oligodendrocyte lineage cells that are expressed with the Gjb1, Cldn11, and Pou2f1 transgenes. B. The Mal transgene is expressed in oligodendrocytes precursor cells (PDGFR-alpha) in the cortex and myelinated oligodendrocytes (MBP) in the cortex and brain stem. A. Pou3f1 Gjb1 Cldn11 215 Figure 6.4 Characterization of cell populations expressing the Gjb1, Cldn11, Pou2f1 and Mal constructs (continued) B. 216 Figure 6.5. Overlap of differentially expressed genes in the two expression profile datasets A. Concordance of differential gene expression; B. Myelin gene expression; C. Subset of differentially expressed genes that are TFs A. B. 217 Figure 6.5. Overlap of differentially expressed genes in the two expression profile datasets (continued) C. 218 Figure 6.6. Myelin gene TFBS regulatory sub-network 219 6.6. References 1. Dugas JC, Tai YC, Speed TP, Ngai J, Barres BA: Functional genomic analysis of oligodendrocyte differentiation. J Neurosci 2006, 26(43):10967-10983. 2. Fu H, Qi Y, Tan M, Cai J, Takebayashi H, Nakafuku M, Richardson W, Qiu M: Dual origin of spinal oligodendrocyte progenitors and evidence for the cooperative role of Olig2 and Nkx2.2 in the control of oligodendrocyte differentiation. Development (Cambridge, England) 2002, 129(3):681-693. 3. Kondo T, Raff M: Basic helix-loop-helix proteins and the timing of oligodendrocyte differentiation. Development (Cambridge, England) 2000, 127(14):2989-2998. 4. Lu QR, Yuk D, Alberta JA, Zhu Z, Pawlitzky I, Chan J, McMahon AP, Stiles CD, Rowitch DH: Sonic hedgehog--regulated oligodendrocyte lineage genes encoding bHLH proteins in the mammalian central nervous system. Neuron 2000, 25(2):317-329. 5. Samanta J, Kessler JA: Interactions between ID and OLIG proteins mediate the inhibitory effects of BMP4 on oligodendroglial differentiation. Development (Cambridge, England) 2004, 131(17):4131-4142. 6. Xin M, Yue T, Ma Z, Wu FF, Gow A, Lu QR: Myelinogenesis and axonal recognition by oligodendrocytes in brain are uncoupled in Olig1-null mice. J Neurosci 2005, 25(6):1354-1365. 7. Zhou Q, Anderson DJ: The bHLH transcription factors OLIG2 and OLIG1 couple neuronal and glial subtype specification. Cell 2002, 109(1):61-73. 8. Liu R, Cai J, Hu X, Tan M, Qi Y, German M, Rubenstein J, Sander M, Qiu M: Region-specific and stage-dependent regulation of Olig gene expression and oligodendrogenesis by Nkx6.1 homeodomain transcription factor. Development (Cambridge, England) 2003, 130(25):6221-6231. 9. Qi Y, Cai J, Wu Y, Wu R, Lee J, Fu H, Rao M, Sussel L, Rubenstein J, Qiu M: Control of oligodendrocyte differentiation by the Nkx2.2 homeodomain transcription factor. Development (Cambridge, England) 2001, 128(14):2723- 2733. 10. Southwood C, He C, Garbern J, Kamholz J, Arroyo E, Gow A: CNS myelin paranodes require Nkx6-2 homeoprotein transcriptional activity for normal structure. J Neurosci 2004, 24(50):11215-11225. 11. Sun T, Dong H, Wu L, Kane M, Rowitch DH, Stiles CD: Cross-repressive interaction of the Olig2 and Nkx2.2 transcription factors in developing neural tube associated with formation of a specific physical complex. J Neurosci 2003, 23(29):9547-9556. 12. Collarini EJ, Kuhn R, Marshall CJ, Monuki ES, Lemke G, Richardson WD: Down-regulation of the POU transcription factor SCIP is an early event in oligodendrocyte differentiation in vitro. Development (Cambridge, England) 1992, 116(1):193-200. 13. Collarini EJ, Pringle N, Mudhar H, Stevens G, Kuhn R, Monuki ES, Lemke G, Richardson WD: Growth factors and transcription factors in oligodendrocyte development. Journal of cell science 1991, 15:117-123. 14. Stolt CC, Lommes P, Friedrich RP, Wegner M: Transcription factors Sox8 and Sox10 perform non-equivalent roles during oligodendrocyte development 220 despite functional redundancy. Development (Cambridge, England) 2004, 131(10):2349-2358. 15. Stolt CC, Lommes P, Sock E, Chaboissier MC, Schedl A, Wegner M: The Sox9 transcription factor determines glial fate choice in the developing spinal cord. Genes & development 2003, 17(13):1677-1689. 16. Stolt CC, Schlierf A, Lommes P, Hillgartner S, Werner T, Kosian T, Sock E, Kessaris N, Richardson WD, Lefebvre V, Wegner M: SoxD proteins influence multiple stages of oligodendrocyte development and modulate SoxE protein function. Developmental cell 2006, 11(5):697-709. 17. Sock E, Leger H, Kuhlbrodt K, Schreiber J, Enderich J, Richter-Landsberg C, Wegner M: Expression of Krox proteins during differentiation of the O-2A progenitor cell line CG-4. Journal of neurochemistry 1997, 68(5):1911-1919. 18. Wang SZ, Dulin J, Wu H, Hurlock E, Lee SE, Jansson K, Lu QR: An oligodendrocyte-specific zinc-finger transcription regulator cooperates with Olig2 to promote oligodendrocyte differentiation. Development (Cambridge, England) 2006, 133(17):3389-3398. 19. Gokhan S, Marin-Husstege M, Yung SY, Fontanez D, Casaccia-Bonnefil P, Mehler MF: Combinatorial profiles of oligodendrocyte-selective classes of transcriptional regulators differentially modulate myelin basic protein gene expression. The Journal of neuroscience : the official journal of the Society for Neuroscience 2005, 25(36):8311-8321. 20. Li H, Lu Y, Smith HK, Richardson WD: Olig1 and Sox10 interact synergistically to drive myelin basic protein transcription in oligodendrocytes. J Neurosci 2007, 27(52):14375-14382. 21. Liu Z, Hu X, Cai J, Liu B, Peng X, Wegner M, Qiu M: Induction of oligodendrocyte differentiation by Olig2 and Sox10: evidence for reciprocal interactions and dosage-dependent mechanisms. Developmental biology 2007, 302(2):683-693. 22. Sugimori M, Nagao M, Bertrand N, Parras CM, Guillemot F, Nakafuku M: Combinatorial actions of patterning and HLH transcription factors in the spatiotemporal control of neurogenesis and gliogenesis in the developing spinal cord. Development (Cambridge, England) 2007, 134(8):1617-1629. 23. MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS computational biology 2006, 2(4):e36. 24. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nature reviews 2004, 5(4):276-287. 25. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics (Oxford, England) 2000, 16(1):16-23. 26. Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome research 2001, 11(9):1559-1566. 27. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. Journal of molecular biology 1998, 278(1):167-181. 28. GuhaThakurta D, Palomar L, Stormo GD, Tedesco P, Johnson TE, Walker DW, Lithgow G, Kim S, Link CD: Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using 221 microarray gene expression and computational methods. Genome research 2002, 12(5):701-712. 29. Sandelin A, Wasserman WW: Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. Journal of molecular biology 2004, 338(2):207-215. 30. Farhadi HF, Lepage P, Forghani R, Friedman HC, Orfali W, Jasmin L, Miller W, Hudson TJ, Peterson AC: A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial- specific phenotypes. The Journal of neuroscience : the official journal of the Society for Neuroscience 2003, 23(32):10214-10223. 31. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 32. Visel A, Akiyama JA, Shoukry M, Afzal V, Rubin EM, Pennacchio LA: Functional autonomy of distant-acting human enhancers. Genomics 2009, 93(6):509-513. 33. Aerts S: Computational detection of cis -regulatory modules. In: Bioinformatics (Oxford, England). vol. 19; 2003: 5ii-14. 34. Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D, Coulombe B, Robert F: Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome research 2006, 16(5):656-668. 35. Huang SS, Fulton DL, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW: Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In: Series on Advances in Bioinformatics and Computational Biology Volume 3 - Proceedings of the 4th Asia-Pacific Bioinformatics Conference: 2006; Taipei, Taiwan: Imperial College Press, London UK; 2006: 247- 256. 36. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic acids research 2004, 32(Web Server issue):W253-256. 37. Denarier E, Forghani R, Farhadi HF, Dib S, Dionne N, Friedman HC, Lepage P, Hudson TJ, Drouin R, Peterson A: Functional organization of a Schwann cell enhancer. The Journal of neuroscience : the official journal of the Society for Neuroscience 2005, 25(48):11210-11217. 38. Dib S: Functional Analysis Of The Myelin Basic Protein Gene Regulation. Montreal, Quebec: McGill University; 2008. 39. Dionne N: Structure and function of Module 3, a conserved enhancer of the myelin basic protein gene. Montreal, Quebec: McGill University; 2006. 40. Tuason MC, Rastikerdar A, Kuhlmann T, Goujet-Zalc C, Zalc B, Dib S, Friedman H, Peterson A: Separate proteolipid protein/DM20 enhancers serve different lineages and stages of development. J Neurosci 2008, 28(27):6895-6903. 41. Calver AR, Hall AC, Yu WP, Walsh FS, Heath JK, Betsholtz C, Richardson WD: Oligodendrocyte population dynamics and the role of PDGF in vivo. Neuron 1998, 20(5):869-882. 222 42. Kessaris N, Fogarty M, Iannarelli P, Grist M, Wegner M, Richardson WD: Competing waves of oligodendrocytes in the forebrain and postnatal elimination of an embryonic lineage. Nature neuroscience 2006, 9(2):173-179. 43. Chow E, Mottahedeh J, Prins M, Ridder W, Nusinowitz S, Bronstein JM: Disrupted compaction of CNS myelin in an OSP/Claudin-11 and PLP/DM20 double knockout mouse. Molecular and cellular neurosciences 2005, 29(3):405- 413. 44. Bennett MV, Barrio LC, Bargiello TA, Spray DC, Hertzberg E, Saez JC: Gap junctions: new tools, new answers, new questions. Neuron 1991, 6(3):305-320. 45. Kunzelmann P, Blumcke I, Traub O, Dermietzel R, Willecke K: Coexpression of connexin45 and -32 in oligodendrocytes of rat brain. Journal of neurocytology 1997, 26(1):17-22. 46. Schaeren-Wiemers N, Bonnet A, Erb M, Erne B, Bartsch U, Kern F, Mantei N, Sherman D, Suter U: The raft-associated protein MAL is required for maintenance of proper axon--glia interactions in the central nervous system. The Journal of cell biology 2004, 166(5):731-742. 47. Foran DR, Peterson AC: Myelin acquisition in the central nervous system of the mouse revealed by an MBP-Lac Z transgene. J Neurosci 1992, 12(12):4890-4897. 48. Brockschnieder D, Sabanay H, Riethmacher D, Peles E: Ermin, a myelinating oligodendrocyte-specific protein that regulates cell morphology. J Neurosci 2006, 26(3):757-762. 49. Trapp BD, Bernier L, Andrews SB, Colman DR: Cellular and subcellular distribution of 2',3'-cyclic nucleotide 3'-phosphodiesterase and its mRNA in the rat central nervous system. Journal of neurochemistry 1988, 51(3):859-868. 50. Zhang C, Xuan Z, Otto S, Hover JR, McCorkle SR, Mandel G, Zhang MQ: A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic acids research 2006, 34(8):2238-2246. 51. Bermingham JR, Jr., Scherer SS, O'Connell S, Arroyo E, Kalla KA, Powell FL, Rosenfeld MG: Tst-1/Oct-6/SCIP regulates a unique step in peripheral myelination and is required for normal respiration. Genes & development 1996, 10(14):1751-1762. 52. Jaegle M, Mandemakers W, Broos L, Zwart R, Karis A, Visser P, Grosveld F, Meijer D: The POU factor Oct-6 and Schwann cell differentiation. Science 1996, 273(5274):507-510. 53. Mandemakers W, Zwart R, Jaegle M, Walbeehm E, Visser P, Grosveld F, Meijer D: A distal Schwann cell-specific enhancer mediates axonal regulation of the Oct-6 transcription factor during peripheral nerve development and regeneration. The EMBO journal 2000, 19(12):2992-3003. 54. Marco A, Konikoff C, Karr TL, Kumar S: Relationship between gene co- expression and sharing of transcription factor binding sites in Drosophila melanogaster. Bioinformatics (Oxford, England) 2009. 55. Allocco DJ, Kohane IS, Butte AJ: Quantifying the relationship between co- expression, co-regulation and gene function. BMC bioinformatics 2004, 5:18. 56. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, 223 Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS et al: The Gene Ontology (GO) database and informatics resource. Nucleic acids research 2004, 32(Database issue):D258-261. 57. Cahoy JD, Emery B, Kaushal A, Foo LC, Zamanian JL, Christopherson KS, Xing Y, Lubischer JL, Krieg PA, Krupenko SA, Thompson WJ, Barres BA: A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. J Neurosci 2008, 28(1):264-278. 58. Gray PA, Fu H, Luo P, Zhao Q, Yu J, Ferrari A, Tenzen T, Yuk DI, Tsung EF, Cai Z, Alberta JA, Cheng LP, Liu Y, Stenman JM, Valerius MT, Billings N, Kim HA, Greenberg ME, McMahon AP, Rowitch DH, Stiles CD, Ma Q: Mouse brain organization revealed through direct genome-scale TF expression analysis. Science (New York, NY) 2004, 306(5705):2255-2257. 59. Choi MY, Romer AI, Hu M, Lepourcelet M, Mechoor A, Yesilaltay A, Krieger M, Gray PA, Shivdasani RA: A dynamic expression survey identifies transcription factors relevant in mouse digestive tract development. Development (Cambridge, England) 2006, 133(20):4119-4129. 60. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R: TFCat: the curated catalog of mouse and human transcription factors. Genome biology 2009, 10(3):R29. 61. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW: oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic acids research 2007, 35(Web Server issue):W245-252. 62. Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, Macisaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E: Tissue-specific transcriptional regulation has diverged significantly between human and mouse. In: Nature genetics. vol. 39; 2007: 730-732. 63. McGaughey DM, Vinton RM, Huynh J, Al-Saif A, Beer MA, McCallion AS: Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b. Genome research 2008, 18(2):252-260. 64. Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic acids research 2006, 34(Database issue):D63-67. 65. King DC, Taylor J, Zhang Y, Cheng Y, Lawson HA, Martin J, Analysis EgfTRaMS, Chiaromonte F, Miller W, Hardison RC: Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. In: Genome research. vol. 17; 2007: 775-786. 66. Blanchette M: Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. In: Genome research. vol. 14; 2004: 708-715. 67. Blanco E, Messeguer X, Smith TF, Guigo R: Transcription factor map alignment of promoter regions. PLoS computational biology 2006, 2(5):e49. 68. Combined Muscle Data (Human, Rat, Mouse Only) [www.cisreg.ca/tjkwon] 69. Jahn O, Tenzer S, Werner HB: Myelin proteomics: molecular anatomy of an insulating sheath. Molecular neurobiology 2009, 40(1):55-72. 70. Sohn J, Natale J, Chew LJ, Belachew S, Cheng Y, Aguirre A, Lytle J, Nait- Oumesmar B, Kerninon C, Kanai-Azuma M, Kanai Y, Gallo V: Identification of 224 Sox17 as a transcription factor that regulates oligodendrocyte development. J Neurosci 2006, 26(38):9722-9735. 71. Jensen NA, Pedersen KM, Celis JE, West MJ: Neurological disturbances, premature lethality, and central myelination deficiency in transgenic mice overexpressing the homeo domain transcription factor Oct-6. The Journal of clinical investigation 1998, 101(6):1292-1299. 72. Awatramani R, Scherer S, Grinspan J, Collarini E, Skoff R, O'Hagan D, Garbern J, Kamholz J: Evidence that the homeodomain protein Gtx is involved in the regulation of oligodendrocyte myelination. J Neurosci 1997, 17(17):6657-6668. 73. Barnett SC, Rosario M, Doyle A, Kilbey A, Lovatt A, Gillespie DA: Differential regulation of AP-1 and novel TRE-specific DNA-binding complexes during differentiation of oligodendrocyte-type-2-astrocyte (O-2A) progenitor cells. Development (Cambridge, England) 1995, 121(12):3969-3977. 74. Vollgraf U, Wegner M, Richter-Landsberg C: Activation of AP-1 and nuclear factor-kappaB transcription factors is involved in hydrogen peroxide- induced apoptotic cell death of oligodendrocytes. Journal of neurochemistry 1999, 73(6):2501-2509. 75. Di Padova M, Caretti G, Zhao P, Hoffman EP, Sartorelli V: MyoD acetylation influences temporal patterns of skeletal muscle gene expression. The Journal of biological chemistry 2007, 282(52):37650-37659. 76. Hinits Y, Hughes SM: Mef2s are required for thick filament formation in nascent muscle fibres. Development (Cambridge, England) 2007, 134(13):2511- 2519. 77. Hobson GM, Molloy GR, Benfield PA: Identification of cis-acting regulatory elements in the promoter region of the rat brain creatine kinase gene. Molecular and cellular biology 1990, 10(12):6533-6543. 78. Hobson GM, Mitchell MT, Molloy GR, Pearson ML, Benfield PA: Identification of a novel TA-rich DNA binding protein that recognizes a TATA sequence within the brain creatine kinase promoter. Nucleic acids research 1988, 16(18):8925-8944. 79. Nielsen JA, Maric D, Lau P, Barker JL, Hudson LD: Identification of a novel oligodendrocyte cell adhesion protein using gene expression profiling. J Neurosci 2006, 26(39):9881-9891. 80. Li CM, Yan RT, Wang SZ: Misexpression of a bHLH gene, cNSCL1, results in abnormal brain development. Dev Dyn 1999, 215(3):238-247. 81. Sawada A, Kiyonari H, Ukita K, Nishioka N, Imuta Y, Sasaki H: Redundant roles of Tead1 and Tead2 in notochord development and the regulation of cell proliferation and survival. Molecular and cellular biology 2008, 28(10):3177-3189. 82. Bird A, Taggart M, Frommer M, Miller OJ, Macleod D: A fraction of the mouse genome that is derived from islands of nonmethylated, CpG-rich DNA. Cell 1985, 40(1):91-99. 83. Ioshikhes IP, Zhang MQ: Large-scale human promoter mapping using CpG islands. Nature genetics 2000, 26(1):61-63. 84. Taylor J, Tyekucheva S, King DC, Hardison RC, Miller W, Chiaromonte F: ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. Genome research 2006, 16(12):1596-1604. 225 85. Ohler U, Niemann H, Liao G, Rubin GM: Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics (Oxford, England) 2001, 17 Suppl 1:S199-206. 86. Gabrielian AE, Landsman D, Bolshoy A: Curved DNA in promoter sequences. In silico biology 1999, 1(4):183-196. 87. Abnizova I, te Boekhorst R, Walter K, Gilks WR: Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC bioinformatics 2005, 6:109. 88. Pierstorff N, Bergman CM, Wiehe T: Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA. Bioinformatics (Oxford, England) 2006, 22(23):2858-2864. 89. Fickett JW: Coordinate positioning of MEF2 and myogenin binding sites. Gene 1996, 172(1):GC19-32. 90. Dunn TM, Hahn S, Ogden S, Schleif RF: An operator at -280 base pairs that is required for repression of araBAD operon promoter: addition of DNA helical turns between the operator and promoter cyclically hinders repression. Proceedings of the National Academy of Sciences of the United States of America 1984, 81(16):5017-5020. 91. Visel A, Minovitsky S, Dubchak I, Pennacchio LA: VISTA Enhancer Browser-- a database of tissue-specific human enhancers. Nucleic acids research 2007, 35(Database issue):D88-92. 92. Pleiades Promoter Project [http://www.pleiades.org/] 93. Gailus-Durner V, Scherf M, Werner T: Experimental data of a single promoter can be used for in silico detection of genes with related regulation in the absence of sequence similarity. Mamm Genome 2001, 12(1):67-72. 94. Halfon MS, Grad Y, Church GM, Michelson AM: Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome research 2002, 12(7):1019-1028. 95. Svaren J, Meijer D: The molecular machinery of myelin gene transcription in Schwann cells. Glia 2008, 56(14):1541-1551. 96. Farhadi HF, Peterson AC: The myelin basic protein gene: a prototype for combinatorial mammalian transcriptional regulation. Advances in Neurology 2006, 98:65-76. 97. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, Meyer L, Hsu F, Hinrichs AS, Harte RA, Giardine B, Fujita P, Diekhans M, Dreszer T, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser Database: update 2009. Nucleic acids research 2009, 37(Database issue):D755-761. 98. Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. In: Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by Krawetz S, Misener S. Totowa, NJ.: Humana Press; 2000: 365-386. 99. Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration. In: Proc Natl Acad Sci USA. vol. 93; 1996: 9067-9072. 100. R Development Core Team (2007). R: A language and environment for 226 statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. [ http://www.R-project.org] 101. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome biology 2004, 5(10):R80. 102. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (Oxford, England) 2003, 19(2):185-193. 103. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 2003, 4(2):249-264. 104. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic acids research 2003, 31(4):e15. 105. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA et al: Database resources of the National Center for Biotechnology Information. Nucleic acids research 2009, 37(Database issue):D5-15. 106. Wright GW, Simon RM: A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics (Oxford, England) 2003, 19(18):2448-2455. 107. Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S: AmiGO: online access to ontology and annotation data. Bioinformatics (Oxford, England) 2009, 25(2):288-289. 108. Bult CJ, Eppig JT, Kadin JA, Richardson JE, Blake JA: The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic acids research 2008, 36(Database issue):D724-728. 109. Twigger SN, Shimoyama M, Bromberg S, Kwitek AE, Jacob HJ: The Rat Genome Database, update 2007--easing the path from disease to data and back again. Nucleic acids research 2007, 35(Database issue):D658-662. 110. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome research 2007, 17(12):1797-1808. 111. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome biology 2007, 8(10):R207. 112. Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson 227 H, Bejerano G, Barber GP, Baertsch R, Haussler D, Kent WJ: The UCSC genome browser database: update 2007. Nucleic acids research 2007, 35(Database issue):D668-673. 113. Goldovsky L, Cases I, Enright AJ, Ouzounis CA: BioLayout(Java): versatile network visualisation of structural and functional relationships. Applied bioinformatics 2005, 4(1):71-74. 114. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for high-throughput functional genomic data. Nucleic acids research 2009, 37(Database issue):D885-890. 228 7. Discussion and Conclusions 7.1. Summary This thesis describes the development of algorithms, tools, and novel methods for the prediction of DNA-binding TFs and TF cooperativity relationships that direct oligodendrocyte myelinogenesis. I described the development of an algorithm for detection of CRMs in sets of co-expressed genes. A comprehensive mouse-human TF catalog was developed and integrated with regulatory sequence analyses. This inventory provided essential information for the identification of TFs in gene expression profiling analyses and TF binding site class evaluations documented in this thesis. Finally, a novel method was described, which integrated the CRM detection algorithm and the TF catalog inventory, to infer co-regulation of myelin-associated proteins by cooperating TFs during oligodendrocyte myelination. This final chapter will provide discussion regarding assessments and conclusions drawn from the work and propose opportunities for future research. 7.2. Gene regulation analyses in humans and mice Only a small portion (1.5%) of the human genome codes for proteins [1]. Intense efforts towards deciphering the function of non-coding DNA have been a hallmark of research in the past decade. Towards this effort, massive amounts of data have been compiled, including gene expression profiles, regulatory sequence regions, TF DNA binding sites, and chromatin modifications. Computational approaches for the identification of discriminatory features are now plentiful. Transcriptional regulatory 229 element detection forms a specific class of algorithms that are designed to detect TF DNA-binding motifs in non-coding sequence. In chapter 2, I described the development and validation of a CRM regulatory element detection algorithm, called CSA, which detects over-represented combinations of TFBS in non-coding regions of a set of co-expressed genes. The algorithm incorporates the assessment of predicted TFBS combinations (defined by PWM models) in well- conserved human-mouse non-coding sequences and utilizes a large background dataset to determine over-representation as compared with the promoters of a co-expressed set of genes. We showed that the algorithm successfully recovers many known CRMs in tissue-specific reference collections. Since the original development, new opportunities have been created by the availability of richer data sources. 7.2.1. Computational TF binding site detection New high-throughput technologies have enabled the characterization of numerous TFBS patterns. A recent large scale TF binding study produced a ranked list of full- affinity binding site preferences for over 100 TFS. Over half of these TFs demonstrated high-affinity primary and secondary DNA binding site preferences and a portion of the binding sites display up to tri-nucleotide positional interdependence16 [2]. The findings indicate that many PWM models fail to represent the breadth of TF-sequence affinity information. The probabilistic format of PWM models is unsuitable for representation of two high-affinity (primary and secondary) binding preferences. A simple solution may be to build multiple PWMs for a TF and evaluate them individually. We implemented such 16 Positional interdependence suggests that the co-occurrence of nucleotides in specific positions are non- random. 230 an approach for NKX2-5, where multiple studies demonstrated slightly different binding affinities. Alternatively, rather than evaluating multiple PWMs for a TF over a sequence individually, multiple PWMs may be evaluated simultaneously. A recent analyses by Hannenhalli and Wang explored this approach, referred to as mixture modeling, which uses Expectation Maximization (EM) to facilitate concurrent evaluation of an arbitrary number of PWMs [3]. Another approach that could accommodate detection of a list of full affinity TFBS would be to perform sequence text matching using each individual TFBS motif and/or each cluster of similar TFBS motifs represented by a consensus model (see Chapter 1). Models have recently been developed that incorporate pairwise nucleotide interdependence [4] that could be expanded to accommodate higher levels of positional nucleotide dependence. Improved TFBS detection would be expected to advance the predictive capacity of CRM detection algorithms. The use of multi-species sequence conservation analyses to identify functional non-coding sequences has proven fruitful [5-7]. Our initial CSA analyses implementation focused on TFBS predicted in highly conserved human-mouse pairwise alignments (above 70% identity) [8]. This approach enabled the recovery of known CRMs from tissue- and cell-specific reference collections [9] (Chapter 2). The challenge of aligning orthologous non-coding regions is exacerbated by multiple gene isoforms and TSS. The human-mouse pairwise alignments in the oPOSSUM database were subsequently improved through the identification of orthologous alternate transcription start sites to define multiple gene promoter regions (Chapter 3). In conjunction, the CSA algorithm was adapted to statistically evaluate clusters of TFBS in regions spanning each alternative TSS of a gene, which provided notable improvements in the CRM discovery validation results (Chapter 3). While the utility of using highly conserved orthologous sequence 231 regions in regulatory analyses is well-established, only a portion of regulatory elements are found in evolutionarily constrained sequences [10, 11]. To increase the sensitivity of our CRM analyses (Chapter 6) we predicted TFBS in ~15,000 aligned non-coding sequences (extracted from the MULTIZ 28-way chromosome datasets [12, 13]) without further imposition of sequence constraint criteria. The capacity to align multi-species sequences implies a degree of sequence conservation. However, multi-species sequences found in common by alignment algorithms need not be constrained [14]. Importantly, a high number of over-lapping TFBS instances satisfy PWM models in unconstrained non- coding sequences that, at the lower bounds, can possess sequence identities ranging around 40% in pairwise human-mouse alignments (i.e. a level of nucleotide identity commonly observed in alignments of unrelated sequences). Validation of this revised CSA approach demonstrated a marked increase in the ordered ranking of known TFBS combinations, when using either a full background comparison or random sample background approach. However, despite the increase in sensitivity that this revised approach offers, regulatory elements residing in human or mouse sequences that cannot be aligned will be missed. The incorporation of TF protein-DNA bound regions from ChIP studies, which can indicate potential regulatory regions irregardless of sequence conservation [15], could be used to bound the TFBS element detection analyses. 7.3. TF inventories for gene regulatory analyses In chapter 4 of this thesis, I described the development of a comprehensive mouse-human transcription factor catalog (TFCat), a key resource in subsequent gene regulatory analyses. The catalog enabled the identification of TF gene expression observed in a multi-timepoint in vitro oligodendrocyte dataset (Chapter 5) and facilitated 232 the detection of DNA-binding and accessory TFs that were commonly or uniquely expressed in two independently-derived in vivo oligodendrocyte expression datasets (Chapter 6). I also incorporated the structurally classified subset of DNA-binding TFs and evaluated homologous families in the TF class analyses described in Chapter 6. The emergence of Mediawiki software [16] has facilitated the creation of numerous resources that allow entry and display of biological information. We published the TFCat data on a modified version of the Mediawiki software, which offers a combination of built-in searching capabilities (with incorporation of Semantic Wiki) and flexible display formats. The TFCat resource is anchored on mouse TF Entrez Gene IDs [17], however, about half of the annotations were derived from human-based regulatory studies for the orthologous human TF. Curation identified a small set of mouse TF genes that do not have an apparent human ortholog. Similarly, TF gene divergence has occurred in human TF families, largely through gene duplication and/or loss [18, 19]. Short-term future developments include the expansion of the TFCat resource with human TFs that have no detectable mouse ortholog and additional curation of DNA-binding TFs, with expansion of the DNA-binding structural classification system. The TFCat backend annotation tool currently supports curation of an unlimited number of species-specific genes. Long term objectives include the annotation of TFs for other key model organisms. As the TFCat collection served an essential role in my thesis research, it has additionally enabled other studies. It was used to identify the transcriptional regulators in a muscle-based expression profiling study, which, through subsequent validation using a 233 small hairpin RNA17 (shRNA) knockdown experiments, highlighted many TFs that were not previously associated with myogenesis (Sundarajan et al. 2009, manuscript under submission). The TFCat data has been incorporated in the development of a DNA- binding TF knowledge-base called TFencyclopedia (TFe) (Yusuf et al. 2009, manuscript in preparation), which facilitates user-friendly data entry of expert knowledge and automated assembly of relevant information from other high-quality resources to present the full embodiment of knowledge for each TF in a wiki-like framework. Work is currently underway to adopt the TFCat structural ontology into the PAZAR regulatory sequence annotation system [20] and the classification system is being incorporated into the 2010 release of the popular JASPAR database [21]. 7.4. Detecting differential gene expression for prediction of TF gene co-regulation In Chapter 5 and Chapter 6, I described the use of gene expression data to identify differentially co-expressed genes and TFs across multiple timepoints and transitions between oligodendrocyte developmental stages. We identified cohesively expressed sets of genes and active TF subsets in each of the analyses. Low expression levels of TF genes can make their detection in microarray analyses problematic and, therefore, it is expected that the presence of some expressed TFs will have been undetected. The hypothesis that highly co-expressed genes will share common regulatory elements specifies a targeted approach for regulatory data analyses. Studies have demonstrated that the co-expression/co-regulation relationship is well-founded in yeast [22]. This hypothesis also holds in tissue-specific co-expression of more complex multi- 17 A short hairpin RNA is an RNA sequence that forms a tight hairpin turn, which can be used to silence gene expression using RNA interference. 234 cellular organisms: for example, in muscle [23] and liver [24] tissues. A recently published study in fly, confirms that genes co-expressed in the same annotated Gene Ontology (GO) biological process are more likely to be enriched for common TFBS, while there is no increased likelihood that genes co-expressed in whole organism expression data share common TFBS [25]. These findings are consistent with strategies that incorporate spatial and/or cell-specific co-expression profiles to infer regulatory sequences in metazoan species. One of the key objectives of the study described in Chapter 6 was to predict a set of CRMs that direct myelinogenesis in oligodendrocytes using a refined set of co- expressed genes. We showed that the addition of substantial noise to a reference collection of co-regulated genes (i.e. adding genes with no evidence of shared TFBS co- regulation) reduces the positional ranking of known CRMs, making the results more difficult to interpret. Oligodendrocytes elaborate myelin in different tissue-types in the brain. We defined a cohesive oligodendrocyte-specific set of co-expressed genes for CRM analyses by combining the co-expression results for two tissue-specific oligodendrocyte expression datasets. The overlap of these differentially expressed sets established a unique cell-specific set of core genes that are highly co-expressed during myelination in the CNS. While the co-expression overlap set proved to be over- represented in genes annotated to the structural constituent of myelin sheath GO node, the dataset also included genes involved in cell maintenance and growth. The latter class of genes are less likely to be co-regulated with the myelin protein-specific cohort. As such, we extracted a focused subset of the TF network for evaluation. Notably, the sub- network of TFs predicted to regulate the co-expressed myelin-associated proteins highlighted many TFs that have known involvement in the myelinogenesis process and 235 established new links for TFs not previously associated with oligodendrocyte myelination. While the TF regulatory network provides valuable insight into the combinatorial mechanisms that may be responsible for control of myelin protein transcription, it is not possible to a priori delineate CRMs that are engaged for specific physiological response pathways. Importantly, the CRM predictions provide a refined hypotheses space to guide detailed experimental TFBS validation and elucidation of specific biological context requirements. 7.5. Functional validation of regulatory sequences In Chapter 5 and Chapter 6, I described studies that incorporated in vivo mouse transgenesis assays to validate the regulatory activity of putative enhancer sequences. Several tested putative regulatory sequences activated reporter gene expression in mice, referred to as ‘positives’, and some displayed no detectable reporter gene expression, which we classified as ‘negatives’. We used these positive and negative classifications to weight the TFBS predicted in each sequence, based on the underlying assumption that TFBS predicted in the negative sequences are not relevant for the assessed tissue (spatial) or developmental (temporal) state. However, there are several inherent challenges in using functional assays for gene regulatory sequence validation that must be considered. Firstly, a putative regulatory sequence is integrated into a genomic context that may possess characteristics that influence reporter gene expression. This is particularly true for transgene constructs that are randomly integrated into genomic regions, which mandates evaluation of a sample of biological replicates to assess transgene expression activity. Constructs docked in a consistent, single location can also be influenced by interactions between the exogenous enhancer sequence and the endogenous local 236 environment. For example, reporter constructs, which do not carry enhancer sequences, docked at the ubiquitously expressed Hprt1 locus in mice, demonstrate reporter expression in heart, large blood vessels, and neuronal populations [26]. Notably, reporter transgenes that include enhancer sequences largely attenuate Hprt1-directed expression [27]. However, interpretation of enhancer-directed spatial (cellular) activity, which overlaps with endogenous Hprt1-targeted cellular expression, is challenging. Secondly, a regulatory sequence in its endogenous environment may: 1) have a different chromatin state; 2) require interactions with other regulatory sequences for its activity; and 3) require regulatory proteins whose expression is enabled or disabled in a specific temporal and/or spatial and/or physiological state. Therefore, caution must be exercised in classifying a putative regulatory sequence that lacks the capability to activate a reporter gene in a functional assay as ‘non-functional’ or ‘negative’ and, accordingly, the predicted TFBS signatures identified in these sequences cannot be entirely discounted. Application of fine-grained experimental analyses, over a range of temporal and spatial states can elucidate regulatory sequence functionality. Additionally, integration of information that captures endogenous cell-state characteristics, such as chromatin modifications and TF protein-DNA accessibility conditions, can improve the predictive capacity of regulatory sequence element detection, as discussed below. 7.6. Incorporating high throughput epigenomic data and detailed experimental analyses in models of gene regulation Gene transcription is regulated by combined contributions from the core promoter, proximal and distal regulatory elements, DNA-binding TFs, accessory TFs and the underlying chromatin structure. The term epigenome is often used to describe the state of chromatin modifications in a cell type, which include both DNA methylation and 237 histone modifications. Techniques for high throughput analyses of genome-wide chromatin modifications are creating opportunities to explore tissue-specific epigenomes and how they impact gene expression. Methylation detection is combined with microarrays [28], bead arrays [29], and sequencing [30] technologies to identify regions of DNA that are protected by 5’-methylcytosines, a modification which inhibits TF binding and gene activation. Histone modifications and histone variants, associated with transcriptional repression and activation, can be detected using specific antibodies with ChIP-chip [31], ChIP-SAGE [32], and ChIP-seq [33] technologies. In addition, reagents, such as DNase or MNase, combined with microarrays and high-throughput sequencing can identify nucleosome positions [34-36] and chromatin accessibility [37, 38]. Epigenomic state data can provide important tissue-specific and/or cellular context for CRM analyses predictions. Specifically, the integration of genome-wide epigenomic data with CRM analyses algorithms, will increasingly target regulatory element predictions to DNA sequences that are in transcriptionally-active states. Such approaches promise improved CRM predictions in tissue-specific gene regulatory analyses. Using the initial laboratory data for training, sequence analysis algorithms are being developed to predict chromatin characteristics such as nucleosome positioning [39] and chromatin modifications (for example, methylation state [40, 41]). Recent nucleosome positioning models derived from single cell in vitro yeast data suggest that sequence-based prediction of nucleosome occupancy compares well with in vivo data derived from worm. However, differences in this study were also highlighted between the datasets, which corroborate reports that additional factors influence nucleosome density [42, 43]. A recent study directly challenges the concept of a DNA sequence-based nucleosome positioning modality, which suggests that a sequence-based model may not 238 be appropriate [44]. Further data describing chromatin-state properties will precipitate the development of models that could be combined with CRM analyses to guide gene regulatory predictions. The mechanisms of gene regulation are emerging. We have significantly expanded our understanding of physical TF-DNA structural interactions (eg. see DNA- binding structures in PDB [45]) and TF binding affinities [2]. Libraries of regulatory regions are being compiled [7] and profiles of genome-wide TF-DNA interactions for tissue specific states are being collected (for example, the Foxa2 study described here [46] and the recent P300 binding study [15] ). While several complex cis-regulatory systems have been elucidated in multicellular organisms [47, 48], further illumination of gene regulatory CRM mechanisms are necessary to improve our understanding of gene regulation modalities and/or archetypes in more complex eukaryotes. Experimental technologies that enable non-invasive, real-time monitoring of in vivo gene expression, such as magnetic resonance (MR) reporter genes, will enable fine-grained investigations of temporal and spatial parameters of gene regulatory mechanisms. While several MR reporter gene strategies have been developed over the last 10 years, current approaches are limited by cellular physiology (for example, in the case of ferritin reporters, there may be cell-specific rates of iron storage) and/or low MR reporter gene specificity [49]. Improvements in in vivo reporter gene technologies that do not require post-mortem evaluation will expand the resolution and capacity of transgene expression studies. Detailed experimental investigations will be facilitated by computationally– derived hypotheses guided by current biological knowledge (Figure 7.1). The elucidation of mechanistic models in higher eukaryotes is a necessary step towards understanding how disruption of cis-regulatory control produces human pathologies and in establishing 239 strategies that could prevent the onset of disease phenotypes and/or initiate regenerative processes [50]. The advancement of in silico gene regulatory prediction methods that leverage existing biological knowledge and data can provide refined, targeted hypotheses for in vitro and in vivo experimental discovery. 240 Figure 7.1. Systematic integration of biological data and computational analyses to decipher gene regulatory mechanisms New/refined cell/tissue physiological data New/refined temporal data New/refined spatial data New/refined hypotheses Expanded models Prioritized experimental targets 241 7.7. References 1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C et al: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921. 2. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML: Diversity and complexity in DNA recognition by transcription factors. In: Science. vol. 324; 2009: 1720- 1723. 3. Hannenhalli S: Enhanced position weight matrices using mixture models. In: Bioinformatics. vol. 21; 2005: i204-i212. 4. Osada R: Comparative analysis of methods for representing and searching for transcription factor binding sites. In: Bioinformatics. vol. 20; 2004: 3516- 3525. 5. Venkatesh B, Kirkness EF, Loh YH, Halpern AL, Lee AP, Johnson J, Dandona N, Viswanathan LD, Tay A, Venter JC, Strausberg RL, Brenner S: Ancient noncoding elements conserved in the human genome. Science 2006, 314(5807):1892. 6. Prabhakar S, Poulin F, Shoukry M, Afzal V, Rubin EM, Couronne O, Pennacchio LA: Close sequence comparisons are sufficient to identify human cis- regulatory elements. Genome Res 2006, 16(7):855-863. 7. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 8. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 9. Huang SS, Fulton DL, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW: Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In: Series on Advances in Bioinformatics and Computational Biology Volume 3 - Proceedings of the 4th Asia-Pacific Bioinformatics Conference: 2006; Taipei, Taiwan: Imperial College Press, London UK; 2006: 247- 256. 10. Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, Macisaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E: Tissue-specific transcriptional regulation has diverged significantly between human and mouse. In: Nat Genet. vol. 39; 2007: 730-732. 242 11. McGaughey DM, Vinton RM, Huynh J, Al-Saif A, Beer MA, McCallion AS: Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b. Genome Res 2008, 18(2):252-260. 12. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, Meyer L, Hsu F, Hinrichs AS, Harte RA, Giardine B, Fujita P, Diekhans M, Dreszer T, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 2009, 37(Database issue):D755-761. 13. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 2007, 17(12):1797-1808. 14. King DC, Taylor J, Zhang Y, Cheng Y, Lawson HA, Martin J, Analysis EgfTRaMS, Chiaromonte F, Miller W, Hardison RC: Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. In: Genome Res. vol. 17; 2007: 775-786. 15. Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, Afzal V, Ren B, Rubin EM, Pennacchio LA: ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 2009, 457(7231):854-858. 16. Mediawiki [http://www.mediawiki.org/wiki/MediaWiki] 17. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2007, 35(Database issue):D26-31. 18. Shannon M, Hamilton AT, Gordon L, Branscomb E, Stubbs L: Differential expansion of zinc-finger transcription factor loci in homologous human and mouse gene clusters. In: Genome Res. vol. 13; 2003: 1097-1110. 19. Zheng X, Wang Y, Yao Q, Yang Z, Chen K: A genome-wide survey on basic helix-loop-helix transcription factors in rat and mouse. In: Mamm Genome. vol. 20; 2009: 236-246. 20. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 21. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 22. Allocco DJ, Kohane IS, Butte AJ: Quantifying the relationship between co- expression, co-regulation and gene function. BMC bioinformatics 2004, 5:18. 23. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278(1):167-181. 24. Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res 2001, 11(9):1559-1566. 25. Marco A, Konikoff C, Karr TL, Kumar S: Relationship between gene co- expression and sharing of transcription factor binding sites in Drosophila melanogaster. Bioinformatics 2009. 243 26. Denarier E, Forghani R, Farhadi HF, Dib S, Dionne N, Friedman HC, Lepage P, Hudson TJ, Drouin R, Peterson A: Functional organization of a Schwann cell enhancer. J Neurosci 2005, 25(48):11210-11217. 27. Tuason MC, Rastikerdar A, Kuhlmann T, Goujet-Zalc C, Zalc B, Dib S, Friedman H, Peterson A: Separate proteolipid protein/DM20 enhancers serve different lineages and stages of development. J Neurosci 2008, 28(27):6895-6903. 28. Gitan RS, Shi H, Chen CM, Yan PS, Huang TH: Methylation-specific oligonucleotide microarray: a new potential for high-throughput methylation analysis. Genome Res 2002, 12(1):158-164. 29. Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E, Goldmann T, Seifart C, Jiang W, Barker DL, Chee MS, Floros J, Fan JB: High-throughput DNA methylation profiling using universal bead arrays. Genome Res 2006, 16(3):383-393. 30. Dupont JM, Tost J, Jammes H, Gut IG: De novo quantitative bisulfite sequencing using the pyrosequencing technology. Analytical biochemistry 2004, 333(1):119-127. 31. Schubeler D, MacAlpine DM, Scalzo D, Wirbelauer C, Kooperberg C, van Leeuwen F, Gottschling DE, O'Neill LP, Turner BM, Delrow J, Bell SP, Groudine M: The histone modification pattern of active genes revealed through genome-wide chromatin analysis of a higher eukaryote. Genes Dev 2004, 18(11):1263-1271. 32. Roh TY, Ngau WC, Cui K, Landsman D, Zhao K: High-resolution genome-wide mapping of histone modifications. Nature biotechnology 2004, 22(8):1013- 1016. 33. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823-837. 34. Ozsolak F, Song JS, Liu XS, Fisher DE: High-throughput mapping of the chromatin structure of human promoters. Nature biotechnology 2007, 25(2):244-248. 35. Yuan GC, Liu YJ, Dion MF, Slack MD, Wu LF, Altschuler SJ, Rando OJ: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 2005, 309(5734):626-630. 36. Johnson SM, Tan FJ, McCullough HL, Riordan DP, Fire AZ: Flexibility and constraint in the nucleosome core landscape of Caenorhabditis elegans chromatin. Genome Res 2006, 16(12):1505-1516. 37. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D, Zhou D, Luo S, Vasicek TJ, Daly MJ, Wolfsberg TG, Collins FS: Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res 2006, 16(1):123- 131. 38. Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS: DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nature methods 2006, 3(7):503-509. 39. Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, Leproust EM, Hughes TR, Lieb JD, Widom J, Segal E: The DNA-encoded 244 nucleosome organization of a eukaryotic genome. In: Nature. vol. 458; 2009: 362-366. 40. Das R, Dimitrova N, Xuan Z, Rollins RA, Haghighi F, Edwards JR, Ju J, Bestor TH, Zhang MQ: Computational prediction of methylation status in human genomic sequences. In: Proc Natl Acad Sci USA. vol. 103; 2006: 10713-10716. 41. Previti C, Harari O, Zwir I, del Val C: Profile analysis and prediction of tissue- specific CpG island methylation classes. BMC bioinformatics 2009, 10:116. 42. Fedor MJ, Lue NF, Kornberg RD: Statistical positioning of nucleosomes by specific protein-binding to an upstream activating sequence in yeast. J Mol Biol 1988, 204(1):109-127. 43. Fu Y, Sinha M, Peterson CL, Weng Z: The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet 2008, 4(7):e1000138. 44. Zhang Y, Moqtaderi Z, Rattner BP, Euskirchen G, Snyder M, Kadonaga J, Liu XS, Struhl K: Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo. In: Nature Publishing Group. vol. 16; 2009: 847-852. 45. Berman H, Henrick K, Nakamura H: Announcing the worldwide Protein Data Bank. Nature structural biology 2003, 10(12):980. 46. Wederell ED, Bilenky M, Cullum R, Thiessen N, Dagpinar M, Delaney A, Varhol R, Zhao Y, Zeng T, Bernier B, Ingham M, Hirst M, Robertson G, Marra MA, Jones S, Hoodless PA: Global analysis of in vivo Foxa2-binding sites in mouse adult liver using massively parallel sequencing. Nucleic Acids Res 2008, 36(14):4549-4564. 47. Yuh CH, Bolouri H, Davidson EH: Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development 2001, 128(5):617-629. 48. Levine M: A systems view of Drosophila segmentation. Genome Biol 2008, 9(2):207. 49. Gilad AA, Ziv K, McMahon MT, van Zijl PC, Neeman M, Bulte JW: MRI reporter genes. J Nucl Med 2008, 49(12):1905-1908. 50. Kleinjan DJ, Coutinho P: Cis-ruption mechanisms: disruption of cis- regulatory control as a cause of human genetic disease. Briefings in functional genomics & proteomics 2009. 245 8. Appendices 8.1. Appendix 1: supplementary for chapter 4 Appendix 1 contents : Table S1: provides gene annotation judgment summary counts from the quality assurance assessment process. Table S2: lists the quality assurance gene pair judgment annotations. Table S3: lists independent annotations of TFs when the same PubMed evidence was used. Table S4: lists PFAM and Superfamily group model DNA-binding domain predictions for the annotated TF genes. Table S5: provides a list of and descriptions for the DNA-binding classification extensions added to the Luscombe et al. classification system. Table S6: lists the protein class counts of genes predicted to contain multiple instances of the same DNA-Binding domain. Table S7: lists the DNA-binding TFs predicted to contain two different DNA-Binding classes. Table S8: lists DNA-binding TFs that do not contain a detected DNA-binding domain. Table S9: lists DNA-binding TFs with no detected protein family-level domain. Table S10: lists the annotated single-stranded DNA Binding TFs. Table S11: summarizes the counts enumerated in the TFCat comparison with GO. Table S12: provides a summary of the counts enumerated in the comparison of TFCat classified HMM DNA-binding domains with the DBD database (DBDdb). Table S13: lists Superfamily DNA binding domain comparisons with DBDdb. Table S14: lists PFAM DNA binding domain comparisons with DBDdb. Table S15: is a list of the Fox family test set genes. Table S16: is a list of the Sox family test set genes. Figure S1: provides a Venn diagram of the overlap of the four initial TF datasets. Figure S2: provides plots for the analysis of cluster pruning methods using the Fox test set. Figure S3: provides plots for the analysis of cluster pruning methods using the Sox test set. Figure S4: is a figure depicting the TFCat annotation workflow implementation. Figure S5: is a set of screen shots of the backend web-based TFCat annotation tool. Figure S6: depicts the Sox-containing cluster membership for the evaluated cluster pruning methods. Figure S7: depicts the Fox-containing cluster membership for the evaluated cluster pruning methods. Figure S8: provides an example of pruned Fox-containing clusters generated using the I’s only method using a threshold of 0.21. 246 Appendix 1: Tables Table S1. Quality assurance gene annotation judgment summary Gene Annotation Pairs Concordant Judgments 37 Discordant Judgments 13 # QA Gene Annotations 50 247 Table S2. Quality assurance gene annotation pair judgments (continued) Gene ID Description Reviewer Judgment 12022 BarH-like homeobox 1 Jared Roach TF Gene Candidate 12022 BarH-like homeobox 1 Elodie Portales-Casamar TF Gene Candidate 12455 cyclin T1 Jared Roach Not a TF 12455 cyclin T1 Debra Fulton TF Gene Candidate 12590 caudal type homeo box 1 Sarav Sundararajan TF Gene 12590 caudal type homeo box 1 Rob Sladek TF Gene 14247 Friend leukemia integration 1 Gwenael Breard TF Gene 14247 Friend leukemia integration 1 Elodie Portales-Casamar TF Gene 15375 forkhead box A1 Gwenael Breard TF Gene 15375 forkhead box A1 Rob Sladek TF Gene 15384 heterogeneous nuclear ribonucleoprotein A/B Sarav Sundararajan TF Gene 15384 heterogeneous nuclear ribonucleoprotein A/B Rob Sladek TF Gene 16917 LIM homeobox transcription factor 1 beta Jared Roach TF Gene 16917 LIM homeobox transcription factor 1 beta Wyeth Wasserman TF Gene Candidate 17300 forkhead box C1 Jared Roach TF Gene 17300 forkhead box C1 Elodie Portales-Casamar TF Gene 17702 homeo box, msh-like 2 Wyeth Wasserman TF Gene 17702 homeo box, msh-like 2 Shannan HoSui TF Gene 18504 paired box gene 2 Jared Roach TF Gene 18504 paired box gene 2 Shannan HoSui TF Gene 18508 paired box gene 6 Jared Roach TF Gene 18508 paired box gene 6 Tim Hughes TF Gene 18854 promyelocytic leukemia Jared Roach TF Gene Candidate 18854 promyelocytic leukemia Elodie Portales-Casamar Not a TF 19013 peroxisome proliferator activated receptor alpha Gwenael Breard TF Gene 19013 peroxisome proliferator activated receptor alpha Sarav Sundararajan TF Gene 19698 avian reticuloendotheliosis viral (v-rel) oncogene related B Jared Roach TF Gene 19698 avian reticuloendotheliosis viral (v-rel) oncogene related B Wyeth Wasserman TF Gene 248 Table S2. Quality assurance gene annotation pair judgments (continued) Gene ID Description Reviewer Judgment 19708 D4, zinc and double PHD fingers family 2 Rob Sladek Indeterminate 19708 D4, zinc and double PHD fingers family 2 Wyeth Wasserman TF Gene Candidate 20182 retinoid X receptor beta Andrew Kwon TF Gene 20182 retinoid X receptor beta David Martin TF Gene Candidate 20466 transcriptional regulator, SIN3A (yeast) Gwenael Breard TF Gene Candidate 20466 transcriptional regulator, SIN3A (yeast) Debra Fulton TF Gene 20588 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily c, member 1 David Martin TF Gene Candidate 20588 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily c, member 1 Shannan HoSui TF Gene 20613 snail homolog 1 (Drosophila) Rob Sladek TF Gene 20613 snail homolog 1 (Drosophila) Wyeth Wasserman TF Gene Candidate 20630 U1 small nuclear ribonucleoprotein 1C Jared Roach Not a TF 20630 U1 small nuclear ribonucleoprotein 1C Shannan HoSui TF Evidence Conflict 20669 SRY-box containing gene 14 Wyeth Wasserman TF Gene Candidate 20669 SRY-box containing gene 14 Mark Minie TF Gene Candidate 20671 SRY-box containing gene 17 Wyeth Wasserman TF Gene Candidate 20671 SRY-box containing gene 17 Elodie Portales-Casamar TF Gene 21339 TATA box binding protein (Tbp)-associated factor, RNA polymerase I, A Sarav Sundararajan Not a TF 21339 TATA box binding protein (Tbp)-associated factor, RNA polymerase I, A Debra Fulton TF Gene 21387 T-box 4 Sarav Sundararajan TF Gene Candidate 21387 T-box 4 Rob Sladek Indeterminate 21416 transcription factor 7-like 2, T-cell specific, HMG-box David Martin TF Gene 21416 transcription factor 7-like 2, T-cell specific, HMG-box Andrew Kwon TF Gene 21685 thyrotroph embryonic factor Gwenael Breard TF Gene 21685 thyrotroph embryonic factor Debra Fulton TF Gene 21833 thyroid hormone receptor alpha Jared Roach TF Gene 21833 thyroid hormone receptor alpha Debra Fulton TF Gene 249 Table S2. Quality assurance gene annotation pair judgments (continued) Gene ID Description Reviewer Judgment 22032 Tnf receptor associated factor 4 Debra Fulton TF Gene 22032 Tnf receptor associated factor 4 David Martin Indeterminate 22608 Y box protein 1 Wyeth Wasserman TF Gene 22608 Y box protein 1 David Martin TF Gene 22634 pleiomorphic adenoma gene-like 1 Jared Roach TF Gene 22634 pleiomorphic adenoma gene-like 1 Debra Fulton TF Gene 22666 zinc finger protein 161 Sarav Sundararajan TF Gene 22666 zinc finger protein 161 Wyeth Wasserman TF Gene 22718 zinc finger protein 60 Rob Sladek TF Gene Candidate 22718 zinc finger protein 60 Tim Hughes TF Gene Candidate 22722 zinc finger protein 64 Wyeth Wasserman TF Gene Candidate 22722 zinc finger protein 64 Elodie Portales-Casamar Indeterminate 56233 histone deacetylase 7A Wyeth Wasserman TF Gene Candidate 56233 histone deacetylase 7A Elodie Portales-Casamar TF Gene 56309 c-myc binding protein Sarav Sundararajan Probably Not a TF 56309 c-myc binding protein Rob Sladek TF Gene Candidate 65100 zinc finger protein of the cerebellum 5 Gwenael Breard TF Gene Candidate 65100 zinc finger protein of the cerebellum 5 Wyeth Wasserman TF Gene 69792 mediator of RNA polymerase II transcription, subunit 6 homolog (yeast) Sarav Sundararajan Not a TF 69792 mediator of RNA polymerase II transcription, subunit 6 homolog (yeast) Andrew Kwon TF Gene Candidate 71458 Bcl6 interacting corepressor Shannan HoSui TF Gene Candidate 71458 Bcl6 interacting corepressor Andrew Kwon TF Gene Candidate 71950 Nanog homeobox Rob Sladek TF Gene 71950 Nanog homeobox Tim Hughes TF Gene 75901 MAD homolog 4 interacting transcription coactivator 1 Sarav Sundararajan Probably Not a TF 75901 MAD homolog 4 interacting transcription coactivator 1 Rob Sladek TF Gene Candidate 250 Table S2. Quality assurance gene annotation pair judgments (continued) Gene ID Description Reviewer Judgment 83796 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 2 Debra Fulton TF Gene Candidate 83796 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 2 Wyeth Wasserman TF Gene Candidate 94222 oligodendrocyte transcription factor 3 Rob Sladek Indeterminate 94222 oligodendrocyte transcription factor 3 Tim Hughes TF Gene Candidate 94353 high mobility group nucleosomal binding domain 3 Jared Roach TF Gene Candidate 94353 high mobility group nucleosomal binding domain 3 Rob Sladek TF Gene 110805 forkhead box E1 (thyroid transcription factor 2) Wyeth Wasserman TF Gene Candidate 110805 forkhead box E1 (thyroid transcription factor 2) David Martin TF Gene Candidate 114142 forkhead box P2 Rob Sladek TF Gene 114142 forkhead box P2 Tim Hughes TF Gene 114741 suppressor of Ty 16 homolog (S. cerevisiae) Jared Roach Not a TF 114741 suppressor of Ty 16 homolog (S. cerevisiae) Tim Hughes TF Gene 209448 homeo box C10 Jared Roach TF Gene 209448 homeo box C10 Rob Sladek TF Gene 216285 cartilage homeo protein 1 Rob Sladek TF Gene 216285 cartilage homeo protein 1 Tim Hughes TF Gene 223922 activating transcription factor 7 Gwenael Breard TF Gene 223922 activating transcription factor 7 Sarav Sundararajan TF Gene 246086 one cut domain, family member 3 David Martin TF Gene Candidate 246086 one cut domain, family member 3 Alice Chou TF Gene Candidate 251 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 18021 Nfatc3 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 3 Sarav Sundararajan 7650004 TF Gene DNA-Binding: sequence- specific 18021 Nfatc3 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 3 Amy Ticoll 7650004 TF Gene DNA-Binding: sequence-specific 18021 Nfatc3 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 3 Stuart Lithwick 7650004 TF Gene DNA-Binding: sequence- specific 15404 Hoxa7 homeo box A7 Sarav Sundararajan 7911971 TF Gene DNA-Binding: sequence- specific 15407 Hoxb1 homeo box B1 David Martin 7911971 TF Gene DNA-Binding: sequence-specific 19883 Rora RAR-related orphan receptor alpha Jared Roach 7926749 TF Gene DNA-Binding: sequence- specific 19883 Rora RAR-related orphan receptor alpha Sarav Sundararajan 7926749 TF Gene DNA-Binding: sequence- specific 19883 Rora RAR-related orphan receptor alpha Rob Sladek 7926749 TF Gene DNA-Binding: sequence- specific 14237 Foxd4 forkhead box D4 Amy Ticoll 7957066 TF Gene DNA-Binding: sequence-specific 14241 Foxl1 forkhead box L1 Jaspar Database 7957066 TF Gene DNA-Binding: sequence- specific 21815 Tgif1 TG interacting factor 1 Debra Fulton 8537382 TF Gene DNA-Binding: sequence-specific 245583 Tgif2lx TGFB-induced factor homeobox 2-like, X- linked Sarav Sundararajan 8537382 TF Gene DNA-Binding: sequence- specific 18185 Nrl neural retina leucine zipper gene Amy Ticoll 8552602 TF Gene DNA-Binding: sequence- specific 18185 Nrl neural retina leucine zipper gene Stuart Lithwick 8552602 TF Gene DNA-Binding: non-sequence- specific 18185 Nrl neural retina leucine zipper gene Warren Cheung 8552602 TF Gene Candidate DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 252 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 83796 Smarcd2 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 2 Debra Fulton 8804307 TF Gene Candidate Transcription Regulatory Activity: heterochromatin interaction/binding 83797 Smarcd1 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 1 Tim Hughes 8804307 TF Gene Transcription Regulatory Activity: heterochromatin interaction/binding 15437 Hoxd8 homeo box D8 Shannan HoSui 8890171 TF Gene Candidate DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 15438 Hoxd9 homeo box D9 Amy Ticoll 8890171 TF Gene DNA-Binding: sequence-specific 71702 Cdc5l cell division cycle 5-like (S. pombe) Sarav Sundararajan 8917598 TF Evidence Conflict 71702 Cdc5l cell division cycle 5-like (S. pombe) Rob Sladek 8917598 TF Evidence Conflict 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Sarav Sundararajan 9032275 TF Gene DNA-Binding: non-sequence- specific; Transcription Factor Binding: TF co-factor binding 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Rob Sladek 9032275 TF Gene DNA-Binding: non-sequence-specific 71702 Cdc5l cell division cycle 5-like (S. pombe) Debra Fulton 9038199 TF Evidence Conflict 71702 Cdc5l cell division cycle 5-like (S. pombe) Rob Sladek 9038199 TF Evidence Conflict 15396 Hoxa11 homeo box A11 Rob Sladek 9079637 TF Gene DNA-Binding: sequence-specific 15417 Hoxb9 homeo box B9 Andrew Kwon 9079637 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 253 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 19668 Rbpjl recombination signal binding protein for immunoglobulin kappa J region-like Stuart Lithwick 9111338 TF Gene DNA-Binding: sequence- specific 19668 Rbpjl recombination signal binding protein for immunoglobulin kappa J region-like Warren Cheung 9111338 TF Gene DNA-Binding: sequence- specific 19668 Rbpjl recombination signal binding protein for immunoglobulin kappa J region-like Amy Ticoll 9111338 TF Gene DNA-Binding: sequence-specific 19668 Rbpjl recombination signal binding protein for immunoglobulin kappa J region-like Magdalena Swanson 9111338 TF Gene DNA-Binding: sequence- specific 17977 Ncoa1 nuclear receptor coactivator 1 Sarav Sundararajan 9192892 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 17979 Ncoa3 nuclear receptor coactivator 3 Rob Sladek 9192892 TF Gene Transcription Factor Binding: TF co-factor binding 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Sarav Sundararajan 9384387 TF Gene DNA-Binding: non-sequence- specific; Transcription Factor Binding: TF co-factor binding 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Debra Fulton 9384387 TF Gene DNA-Binding: non-sequence- specific; Transcription Factor Binding: TF co-factor binding 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Rob Sladek 9384387 TF Gene DNA-Binding: non-sequence-specific 13392 Dlx2 distal-less homeobox 2 Andrew Kwon 9415433 TF Gene DNA-Binding: sequence-specific 13396 Dlx6 distal-less homeobox 6 Andrew Kwon 9415433 TF Gene Candidate DNA-Binding: sequence- specific 13194 Ddb1 damage specific DNA binding protein 1 Debra Fulton 9418871 TF Gene Candidate Single stranded RNA/DNA binding; Transcription Factor Binding: TF co-factor binding 107986 Ddb2 damage specific DNA binding protein 2 Sarav Sundararajan 9418871 TF Gene Candidate Single stranded RNA/DNA binding; Transcription Factor Binding: TF co-factor binding 107986 Ddb2 damage specific DNA binding protein 2 Debra Fulton 9418871 TF Gene Candidate Single stranded RNA/DNA binding; Transcription Factor Binding: TF co-factor binding 254 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 12705 Cited1 Cbp/p300-interacting transactivator with Glu/Asp-rich carboxy- terminal domain 1 Wyeth Wasserman 9434189 TF Gene Transcription Factor Binding: TF co-factor binding 17684 Cited2 Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy- terminal domain, 2 Wyeth Wasserman 9434189 TF Gene Transcription Factor Binding: TF co-factor binding 19727 Rfxank regulatory factor X- associated ankyrin- containing protein Tim Hughes 9806546 TF Gene Transcription Factor Binding: TF co-factor binding 53970 Rfx5 regulatory factor X, 5 (influences HLA class II expression) Amy Ticoll 9806546 TF Gene Candidate DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 16597 Klf12 Kruppel-like factor 12 Sarav Sundararajan 9858544 TF Gene DNA-Binding: sequence- specific 16601 Klf9 Kruppel-like factor 9 Debra Fulton 9858544 TF Gene DNA-Binding: sequence-specific 71702 Cdc5l cell division cycle 5-like (S. pombe) Debra Fulton 10570151 TF Evidence Conflict 71702 Cdc5l cell division cycle 5-like (S. pombe) Rob Sladek 10570151 TF Evidence Conflict 19434 Rax retina and anterior neural fold homeobox Amy Ticoll 10625658 TF Gene DNA-Binding: sequence- specific 19434 Rax retina and anterior neural fold homeobox Magdalena Swanson 10625658 TF Gene DNA-Binding: sequence- specific 19434 Rax retina and anterior neural fold homeobox Stuart Lithwick 10625658 TF Gene DNA-Binding: sequence- specific 19434 Rax retina and anterior neural fold homeobox Warren Cheung 10625658 TF Gene Transcription Factor Binding: TF co-factor binding 11634 Aire autoimmune regulator (autoimmune polyendocrinopathy candidiasis ectodermal dystrophy) Sarav Sundararajan 10748110 TF Gene DNA-Binding: sequence- specific 11634 Aire autoimmune regulator (autoimmune polyendocrinopathy candidiasis ectodermal dystrophy) Debra Fulton 10748110 TF Gene DNA-Binding: non-sequence-specific 255 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 11925 Neurog3 neurogenin 3 Amy Ticoll 10757813 TF Gene DNA-Binding: sequence-specific 11925 Neurog3 neurogenin 3 Magdalena Swanson 10757813 TF Gene DNA-Binding: sequence- specific 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Sarav Sundararajan 10848616 TF Gene DNA-Binding: non-sequence- specific; Transcription Factor Binding: TF co-factor binding 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 Rob Sladek 10848616 TF Gene DNA-Binding: non-sequence-specific 56809 Gmeb1 glucocorticoid modulatory element binding protein 1 Andrew Kwon 10894151 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 229004 Gmeb2 glucocorticoid modulatory element binding protein 2 Magdalena Swanson 10894151 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 22772 Zic2 zinc finger protein of the cerebellum 2 Debra Fulton 11053430 TF Gene DNA-Binding: sequence- specific 22773 Zic3 zinc finger protein of the cerebellum 3 Tim Hughes 11053430 TF Gene DNA-Binding: sequence- specific 19434 Rax retina and anterior neural fold homeobox David Martin 11069920 TF Gene Transcription Factor Binding: TF co-factor binding 19434 Rax retina and anterior neural fold homeobox Warren Cheung 11069920 TF Gene Transcription Factor Binding: TF co-factor binding 223922 Atf7 activating transcription factor 7 Gwenael Breard 11278933 TF Gene DNA-Binding: sequence- specific 223922 Atf7 activating transcription factor 7 Sarav Sundararajan 11278933 TF Gene DNA-Binding: sequence- specific 15417 Hoxb9 homeo box B9 Andrew Kwon 11432851 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 15430 Hoxd10 homeo box D10 David Martin 11432851 TF Gene DNA-Binding: sequence-specific 20689 Sall3 sal-like 3 (Drosophila) Magdalena Swanson 11836251 TF Gene Candidate Transcription Regulatory Activity: heterochromatin interaction/binding 58198 Sall1 sal-like 1 (Drosophila) Stuart Lithwick 11836251 TF Gene Transcription Regulatory Activity: heterochromatin interaction/binding 256 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 20185 Ncor1 nuclear receptor co-repressor 1 Elodie Portales- Casamar 11997503 TF Gene Transcription Regulatory Activity: heterochromatin interaction/binding; Transcription Factor Binding: TF co-factor binding 20602 Ncor2 nuclear receptor co-repressor 2 David Martin 11997503 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 13194 Ddb1 damage specific DNA binding protein 1 Jared Roach 12034848 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 13194 Ddb1 damage specific DNA binding protein 1 Rob Sladek 12034848 TF Gene Candidate Single stranded RNA/DNA binding 107986 Ddb2 damage specific DNA binding protein 2 Rob Sladek 12034848 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 15410 Hoxb3 homeo box B3 Elodie Portales- Casamar 12482716 TF Gene DNA-Binding: sequence-specific 15412 Hoxb4 homeo box B4 Elodie Portales- Casamar 12482716 TF Gene DNA-Binding: sequence-specific 18503 Pax1 paired box gene 1 Sarav Sundararajan 12490554 TF Gene DNA-Binding: sequence- specific 18511 Pax9 paired box gene 9 Rob Sladek 12490554 TF Gene DNA-Binding: sequence-specific 21415 Tcf3 transcription factor 3 Rob Sladek 14627819 TF Gene DNA-Binding: sequence-specific 21423 Tcfe2a transcription factor E2a Debra Fulton 14627819 TF Gene DNA-Binding: sequence-specific 74123 Foxp4 forkhead box P4 Amy Ticoll 14701752 TF Gene DNA-Binding: sequence-specific 114142 Foxp2 forkhead box P2 Rob Sladek 14701752 TF Gene DNA-Binding: sequence-specific 114142 Foxp2 forkhead box P2 Tim Hughes 14701752 TF Gene DNA-Binding: sequence-specific 18185 Nrl neural retina leucine zipper gene Magdalena Swanson 15001570 TF Gene DNA-Binding: sequence- specific 18185 Nrl neural retina leucine zipper gene Warren Cheung 15001570 TF Gene Candidate DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 257 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 20937 Suv39h1 suppressor of variegation 3-9 homolog 1 (Drosophila) Debra Fulton 15107829 TF Gene Transcription Factor Binding: TF co-factor binding 64707 Suv39h2 suppressor of variegation 3-9 homolog 2 (Drosophila) Sarav Sundararajan 15107829 TF Gene Candidate Transcription Regulatory Activity: heterochromatin interaction / binding; Transcription Factor Binding: TF co-factor binding 18612 Etv4 ets variant gene 4 (E1A enhancer binding protein, E1AF) Stuart Lithwick 15138262 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 104156 Etv5 ets variant gene 5 Rob Sladek 15138262 TF Gene DNA-Binding: sequence-specific 93760 Arid1a AT rich interactive domain 1A (Swi1 like) Jared Roach 15170388 TF Gene DNA-Binding: non-sequence- specific 239985 Arid1b AT rich interactive domain 1B (Swi1 like) Amy Ticoll 15170388 TF Gene Transcription Regulatory Activity: heterochromatin interaction / binding; DNA- Binding: non-sequence- specific; Transcription Factor Binding: TF co-factor binding 18021 Nfatc3 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 3 Magdalena Swanson 15173172 TF Gene DNA-Binding: sequence- specific 18021 Nfatc3 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 3 Stuart Lithwick 15173172 TF Gene DNA-Binding: sequence- specific 18021 Nfatc3 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 3 Warren Cheung 15173172 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 20671 Sox17 SRY-box containing gene 17 Elodie Portales- Casamar 15220343 TF Gene DNA-Binding: sequence-specific 20680 Sox7 SRY-box containing gene 7 Amy Ticoll 15220343 TF Gene DNA-Binding: sequence- specific 19290 Pura purine rich element binding protein A Rob Sladek 15282343 TF Gene Single stranded RNA/DNA binding 19291 Purb purine rich element binding protein B Amy Ticoll 15282343 TF Gene Single stranded RNA/DNA binding 258 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 22774 Zic4 zinc finger protein of the cerebellum 4 Gwenael Breard 15465018 TF Gene Candidate DNA-Binding: sequence- specific 65100 Zic5 zinc finger protein of the cerebellum 5 Wyeth Wasserman 15465018 TF Gene DNA-Binding: sequence- specific 18612 Etv4 ets variant gene 4 (E1A enhancer binding protein, E1AF) Stuart Lithwick 15466854 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 104156 Etv5 ets variant gene 5 Rob Sladek 15466854 TF Gene DNA-Binding: sequence-specific 211323 Nrg1 neuregulin 1 Amy Ticoll 15494726 TF Gene Transcription Factor Binding: TF co-factor binding 211323 Nrg1 neuregulin 1 Magdalena Swanson 15494726 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 211323 Nrg1 neuregulin 1 Stuart Lithwick 15494726 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 211323 Nrg1 neuregulin 1 Warren Cheung 15494726 TF Gene Transcription Factor Binding: TF co-factor binding 66993 Smarcd3 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 3 Alice Chou 15525990 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 83797 Smarcd1 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 1 Tim Hughes 15525990 TF Gene Transcription Regulatory Activity: heterochromatin interaction /binding 12677 Vsx2 visual system homeobox 2 Elodie Portales- Casamar 15647262 TF Gene DNA-Binding: sequence-specific 114889 Vsx1 visual system homeobox 1 homolog (zebrafish) Tim Hughes 15647262 TF Gene DNA-Binding: sequence- specific 11634 Aire autoimmune regulator (autoimmune polyendocrinopathy candidiasis ectodermal dystrophy) Debra Fulton 15649436 TF Gene DNA-Binding: non-sequence-specific 11634 Aire autoimmune regulator (autoimmune polyendocrinopathy candidiasis ectodermal dystrophy) Rob Sladek 15649436 TF Gene DNA-Binding: non-sequence-specific 259 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 107503 Atf5 activating transcription factor 5 Debra Fulton 15735663 TF Gene DNA-Binding: sequence- specific 223922 Atf7 activating transcription factor 7 Sarav Sundararajan 15735663 TF Gene DNA-Binding: sequence- specific 15395 Hoxa10 homeo box A10 Jared Roach 15886193 TF Gene DNA-Binding: sequence-specific 15416 Hoxb8 homeo box B8 Debra Fulton 15886193 TF Gene DNA-Binding: sequence-specific 15110 Hand1 heart and neural crest derivatives expressed transcript 1 Amy Ticoll 16043483 TF Gene Transcription Factor Binding: TF co-factor binding 15110 Hand1 heart and neural crest derivatives expressed transcript 1 Stuart Lithwick 16043483 TF Gene DNA-Binding: sequence- specific 15110 Hand1 heart and neural crest derivatives expressed transcript 1 Warren Cheung 16043483 TF Gene DNA-Binding: sequence- specific; Transcription Factor Binding: TF co-factor binding 55942 Sertad1 SERTA domain containing 1 Elodie Portales- Casamar 16098148 TF Gene Transcription Factor Binding: TF co-factor binding 58172 Sertad2 SERTA domain containing 2 Amy Ticoll 16098148 TF Gene Transcription Factor Binding: TF co-factor binding 27386 Npas3 neuronal PAS domain protein 3 Amy Ticoll 16172381 TF Gene Candidate DNA-Binding: sequence- specific 27386 Npas3 neuronal PAS domain protein 3 Stuart Lithwick 16172381 TF Gene Candidate Transcription Regulatory Activity: heterochromatin interaction / binding 27386 Npas3 neuronal PAS domain protein 3 Warren Cheung 16172381 TF Gene Candidate Transcription Factor Binding: TF co-factor binding 11925 Neurog3 neurogenin 3 Amy Ticoll 16511571 TF Gene DNA-Binding: sequence-specific 11925 Neurog3 neurogenin 3 Magdalena Swanson 16511571 TF Gene DNA-Binding: sequence- specific 11925 Neurog3 neurogenin 3 Stuart Lithwick 16511571 TF Gene DNA-Binding: non-sequence- specific 11925 Neurog3 neurogenin 3 Warren Cheung 16511571 TF Gene DNA-Binding: sequence- specific 260 Table S3. Summary of independent annotations of TFs when the same PubMed evidence was used (continued) Gene ID Gene Symbol Description Reviewer PubMed ID Judgment Taxa 73181 Nfatc4 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 4 Magdalena Swanson 16644691 TF Gene DNA-Binding: sequence- specific 73181 Nfatc4 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 4 Stuart Lithwick 16644691 TF Gene DNA-Binding: sequence- specific 73181 Nfatc4 nuclear factor of activated T-cells, cytoplasmic, calcineurin- dependent 4 Warren Cheung 16644691 TF Gene DNA-Binding: sequence- specific 261 Table S4. Detected DNA-binding domains predicted in judged TF genes using PFAM and Superfamily group models PFAM HMM Superfamily Relationship AF-4 ARID ARID-like AT_hook Beta-trefoil DNA-binding protein LAG-1 (CSL) bZIP_1 bZIP_2 bZIP_Maf A DNA-binding domain in eukaryotic transcription factors CBFB_NFYA CBFD_NFYB_HMF CP2 CUT DDT E2F_TDP Winged helix DNA-binding domain Ets Winged helix DNA-binding domain Fork_head Winged helix DNA-binding domain GATA Glucocorticoid receptor-like (DNA-binding domain) GCM GCM domain GTF2I HLH HLH, helix-loop-helix DNA-binding domain HMG_box HMG-box Homeobox Homeodomain-like HSF_DNA-bind Winged helix DNA-binding domain HTH_psq Homeodomain-like IRF Winged helix DNA-binding domain LAG1-DNAbind p53-like transcription factors MBD DNA-binding domain MH1 SMAD MH1 domain Myb_DNA-binding Homeodomain-like P53 p53-like transcription factors 262 Table S4. Detected DNA-binding domains predicted in judged TF genes using PFAM and Superfamily group models (continued) PFAM HMM Superfamily Relationship PAX Homeodomain-like Pou RFX_DNA_binding Winged helix DNA-binding domain RHD p53-like transcription factors RNA_pol_Rpb2_1 beta and beta-prime subunits of DNA dependent RNA-polymerase RNA_pol_Rpb2_2 beta and beta-prime subunits of DNA dependent RNA-polymerase RNA_pol_Rpb2_3 beta and beta-prime subunits of DNA dependent RNA-polymerase RNA_pol_Rpb2_4 beta and beta-prime subunits of DNA dependent RNA-polymerase RNA_pol_Rpb2_5 beta and beta-prime subunits of DNA dependent RNA-polymerase RNA_pol_Rpb2_6 beta and beta-prime subunits of DNA dependent RNA-polymerase RNA_pol_Rpb2_7 beta and beta-prime subunits of DNA dependent RNA-polymerase Runt p53-like transcription factors SAND SAND domain-like SLIDE Homeodomain-like SRF-TF SRF-like STAT_bind p53-like transcription factors T-box p53-like transcription factors TBP TATA-box binding protein-like TEA TF_AP-2 TF_Otx zf-C2H2 C2H2 and C2HC zinc fingers zf-C2HC CCHHC domain zf-C4 Glucocorticoid receptor-like (DNA-binding domain) zf-CXXC 263 Table S5. DNA-binding classification extensions to Luscombe et al. classification system Protein Structure Group Protein Structure Family Description Comments 1.2) Winged Helix-Turn-Helix The winged HTH motif is an extension of the HTH group, which is characterized by a third or fourth alpha-helix and an adjacent beta-sheet. Group description modification 1.1) Helix-Turn- Helix 100) Myb Domain Family The Myb vertebrate DBD consists of three tandem repeats of 51 to 53 amino acid residues from the amino acid terminus, herein referred to as R1, R2, and R3 (Kanei-Ishii et al. 1990). Each repeat contains three helices of a helix-turn-helix motif α1, α2, and α3 (Ogata et al. 1992; Ogata et al. 1995) with R2 and R3 involved in specific DNA recognition and R1 covers the DNA position next to the R2 binding site. The R1, R2, and R3 bind to DNA mainly in the major groove (Tahirov et al. 2002). The relationship between the protein domains of Myb and Rap1 is heavily covered in the literature. While there is a clear sequence similarity, there are important differences between the two regions in the proteins. A study by Hanaoka et al. (Hanaoka et al. 2001) compares the Myb-like domains. The human Rap1 protein fragment solution structure PDB: 1FEX appears to share structural similarity to Myb. However, yeast Rap1 associated with DNA (available in PDB 1IGN) does not align well with Myb. It is noteworthy that the yeast (SC) protein does have observed DNA binding capacity while the human seems to require interaction with TRF2 for association with DNA (O'Connor et al. 2004). Family added 1.2) Winged Helix-Turn-Helix 101) GTF2I Domain Family DNA-binding studies suggest that GTF2I domain binds DNA (Rauhala et al. 2005); (Vullhorst et al. 2003). At the time of our analysis, no DNA-bound protein structures were available for review. However, structural alignments using an NMR-based structure suggest that this domain may take on a Helix-Turn-Helix configuration. Based on its unique conformation we predict that its DNA binding mechanism will likely support its own HTH family. Family added – predicted 264 Table S5. DNA-binding classification extensions (continued) Protein Structure Group Protein Structure Family Description Comments 1.2) Winged Helix-Turn-Helix 102) Forkhead Domain Family The forkhead domain binds DNA as a monomer (Clark et al. 1993). Three or four helices are set against a small three- stranded antiparallel beta-sheet from which two large loops extend (Clark et al. 1993; van Dongen et al. 2000). DNA binding occurs largely through the third helix inserted in the major groove of the DNA (Clark et al. 1993). The wings of the forkhead domain also make contact with the DNA and may contribute to sequence specificity (Bravieri et al. 1997). Family added 1.2) Winged Helix-Turn-Helix 103) RFX Domain Family The hRFX1 DBD consists of three alpha- helices, three beta-strands, and three connecting loops (Gajiwala et al. 2000). The third loop, connecting beta-strands S2 and S3, forms wing W1 of the winged-helix motif and makes contact with the DNA in the major groove, along with beta-strands S2 and S3. In contrast to other winged-helix DBDs, RFX has only 1 wing. Family added 2) Zinc- coordinating Group 104) GATA Domain Family The GATA domain is composed of a core: a zinc coordinated by four cysteines and a carboxyl-terminal tail (Omichinski et al. 1993). Specifically, the core consists of a two anti-parallel Beta-sheets and an alpha helix connected to a long loop that tethers a carboxyl-terminal tail. DNA contact is made through a helix and loop connecting the two Beta-sheets in the major groove and carboxyl-terminal tail around the DNA making contact with the minor groove (Omichinski et al. 1993). Family added 2) Zinc- coordinating Group 105) Glial Cells Missing (GCM) Domain Family The GCM domain is composed of one five- and one three-stranded Beta sheet, with three small helical segments packed against the same side of the two beta- sheets (Cohen et al. 2003). The 5-stranded Beta-sheet is inserted into the major groove of the DNA. Residues from the edge of the Beta sheet and the following loop and strand contact the DNA backbone and bases - providing the sequence specificity (Cohen et al. 2003). Family added 265 Table S5. DNA-binding classification extensions (continued) Protein Structure Group Protein Structure Family Description Comments 2) Zinc- coordinating Group 106) MH1 Domain Family Smad3 MH1 domain consists of four alpha- helices, six beta-strands, and five loops. The Smad MH1 domain contains a novel DNA-binding motif, an 11-residue beta- hairpin (formed by the second and third beta-strands), is embedded asymmetrically in the major groove of DNA (Chai et al. 2003). The MH1 domain contains a bound zinc atom coordinated by three cysteines and one histidine. Removal of the zinc atom results in reduced DNA binding activity. However, not all MH1-containing proteins can bind to DNA (such as Smad1). Sequence analyses suggest that the DNA- binding domains of CTF/NFI and SMAD MH1 demonstrate significant sequence homology (Stefancsik et al. 2003). Family added 4) Other Alpha- Helix Group 107) Sand Domain Family The GMEb1 Sand domain adopts a compact fold with an alpha-helical face and a twisted Beta-sheet face (Surdo et al. 2003). At the time of our analysis, no DNA- bound protein structures were available for review. However, the DNA binding surface has been mapped to the alpha-helical region encompassing the KDWK motif (Bottomley et al. 2001; Surdo et al. 2003). The GMEB1 SAND domain contains a zinc- binding motif and, although the zinc ion is not required for DNA binding, it plays a role in determining the C-terminal conformation of the GMEB1 SAND domain (Surdo et al. 2003). Family added – predicted 266 Table S5. DNA-binding classification extensions (continued) Protein Structure Group Protein Structure Family Description Comments 6) Beta Hairpin_Ribbon Group 108) Methyl- CpG-binding Domain, Family (MBD) The MBD folds into an alpha/beta sandwich structure, which is comprised of a layer of twisted beta-sheet, backed by another layer formed by the alpha helix 1 and a hairpin loop at the C terminus. The beta sheet is composed of two long inner strands (beta- strand 2 and beta-strand 3) sandwiched by two shorter outer strands (beta-strand 1 and beta-strand 4) (Ohki et al. 2001). A section of the inner strands that projects beyond the outer strands is embedded in the major groove at the target DNA site and, together with the loop that links beta-strand 2 and beta-strand 3, forms the principal DNA interface. The twisted beta-sheet is angled within the major groove to wedge the C- terminal part of beta-strand 4. This orientation of the beta-sheet allows alpha- helix 1 and loop 2 to mediate major groove contacts with DNA (Ohki et al. 2001). Family added 1.2) Winged Helix 109) Arid Domain Family Arid family proteins can be grouped into subfamilies based on sequence similarity. A majority of these subfamilies bind DNA without obvious sequence specificity (Patsialou et al. 2005)). Arid appears to interact with both the major and minor DNA grooves. Major groove DNA contact is made through insertion of a loop (Kim et al. 2004) and/or an α-helix (Iwahara et al. 2002). Family added 267 Table S5. DNA-binding classification extensions (continued) Protein Structure Group Protein Structure Family Description Comments 7) Other 110) Runt Domain Family The Runt domain recognizes specific bases in both the major and minor grooves of the DNA, and binding is accomplished mainly using loops. CBFα Run domain makes contact with the DNA consensus sequence using three loop-containing regions: Beta- strand 3- Loop 3, Beta-strand 9 – Loop 9, and Beta-strand 12 – Loop 12 (Backstrom et al. 2002). The first and third loop interact with the major groove, while the second interacts with the minor groove. Two chloride ions bind to the Runt domain one of which is situated at the DNA-binding surface and are shown to have a positive effect on DNA binding (Backstrom et al. 2002). Structural comparisons demonstrate that the s-type Ig fold found in the Runt domain is conserved in the Ig folds found in the DNA-binding domains of NF-kappaB, NFAT, p53, STAT-1, and the T-domain. The differences among these proteins arise in the connecting loop regions where short additional secondary structural elements have been added that in some cases interact with the core Ig scaffold (Berardi et al. 1999). These proteins appear to form a family of structurally and functionally related DNA-binding domains. Unlike the other members of this family, the Runt domain utilizes loops at both ends of the Ig fold for DNA recognition (Berardi et al. 1999). Added family 1.2) Winged Helix-Turn-Helix 111) Slide Domain Family The three core helices of the Slide domain superimpose well with c-Myb repeats. The slide domain appears to be highly compatible with a role in DNA binding given that it has an overall positive charge and c- MYB DNA-contacting residues are largely conserved in the slide domain (Grune et al. 2003). At the time of our analysis, no DNA- bound protein structures were available for review. However, given its similarities with the c-Myb protein structure, we predict that the Slide DNA binding structure may take on a helix-turn-helix (HTH) conformation and, as such, have classified in its own HTH family and will revisit this assignment when a DNA-bound structure is available. Family added – predicted 268 Table S5. DNA-binding classification extensions (continued) Protein Structure Group Protein Structure Family Description Comments 7) Other 112) Beta- trefoil-like The beta-trefoil domain (BTD) is a capped beta-barrel. The prototypical BTD consists of four strands repeated in a three-fold arrangement, where beta-strands 1 and 4 form the walls of the barrel and beta – strands 1 and 2 form the cap of the barrel. However, the CSL protein’s BTD posses minor deviations from this (Kovall et al. 2004). In CSL the BTD makes specific contact with the DNA minor groove DNA and non-specific contact with the phosphate-ribose backbone (Kovall et al. 2004). Family added 7)Other 113) DNA- Binding LAG-1-like Lag-1 DNA binding domain is composed of a seven-stranded beta barrel organized into a sandwich composed of three- and four- stranded beta sheets (characteristics of an IG-like fold) (Kovall et al. 2004). In the CSL protein, the Lag-1 DNA binding domain interacts with the major groove of the DNA (Kovall et al. 2004). Family added 2) Zinc- coordinating Group 114) Non- methyl CpG-binding CXXC Domain Three cysteine residues in two CGXCXXC motifs provide coordination for two zinc ions. Both motifs adopt a similar conformation in which the second and third cysteines are contained within a small helix or form a small helix-turn-helix. At the time of our analysis, no DNA-bound protein structures were available for review. However, NMR binding and mutagenesis data define the CXXC domain as the non- methyl CpG DNA binding interface (Allen et al. 2006). Family added – predicted 269 Table S5. DNA-binding classification extensions (continued) Protein Structure Group Protein Structure Family Description Comments 4) Other Alpha Helix Group 115) NF-Y CCAAT- Binding Protein Family CBF/NF-Y is a heterotrimeric complex composed of NF-YA, NF-YB and NF-YC, which are all are required for DNA binding(Romier et al. 2003). CBF NF-YC and NF-YB are homologous to histones H2A and H2B (Sinha et al. 1995). Although there are no DNA-bound structures available for review, DNA recognition appears to involve both minor and major grooves interactions (inferred by methylation interference patterns) (Ronchi et al. 1995). 999) Unclassified Structure 901) CP2 Transcription Factor Domain Family The DNA binding domain of CP2-like genes has been experimentally identified (Rodda et al. 2001) and appears to be somewhat conserved in the fly Grainyhead TF (Uv et al. 1994). At the time of our analysis, no protein structure was available for review. Unclassified DBD structure 999) Unclassified Structure 902) AF-4 Protein Family AF-4 proteins have been shown to bind DNA in vitro (Ma et al. 1996). However, the structural DNA binding mechanism is unclear. Unclassified DBD structure 999) Unclassified Structure 903) DNA binding homeobox and Different Transcription factors (DDT) Experimental analysis of Fac1 (a truncated version of bromodomain PHD finger transcription factors) demonstrates that the N-terminal region, which includes the DDT domain, is involved in DNA-binding (Jordan- Sciutto et al. 1999). Secondary structure predictions suggest that DDT is composed of 3 alpha helices. However, fold recognition comparisons do not suggest any significant similarity to known DNA- RNA binding alpha-helical bundles (Doerks et al. 2001). Unclassified DBD structure 999) Unclassified Structure 904) AT-hook Domain Family The AT-Hook domain is the DNA-binding domain of the HMGI(Y) family of proteins. At the time of our analysis, the only DNA- bound structure available was HMGA1 (HMG-I(Y)). Unfortunately this structure is derived from NMR data produced over a decade ago and does not appear sufficiently detailed to confidently assess a structural family. Unclassified DBD structure 270 Table S5. DNA-binding classification extensions (continued) References for Table S5 Allen, M.D., C.G. Grummitt, C. Hilcenko, S.Y. Min, L.M. Tonkin, C.M. Johnson, S.M. Freund, M. Bycroft, and A.J. Warren. 2006. Solution structure of the nonmethyl-CpG-binding CXXC domain of the leukaemia-associated MLL histone methyltransferase. Embo J 25: 4503- 4512. Backstrom, S., M. Wolf-Watz, C. Grundstrom, T. Hard, T. Grundstrom, and U.H. Sauer. 2002. The RUNX1 Runt domain at 1.25A resolution: a structural switch and specifically bound chloride ions modulate DNA binding. J Mol Biol 322: 259-272. Berardi, M.J., C. Sun, M. Zehr, F. Abildgaard, J. Peng, N.A. Speck, and J.H. Bushweller. 1999. The Ig fold of the core binding factor alpha Runt domain is a member of a family of structurally and functionally related Ig-fold DNA-binding domains. Structure 7: 1247-1256. Bottomley, M.J., M.W. Collard, J.I. Huggenvik, Z. Liu, T.J. Gibson, and M. Sattler. 2001. The SAND domain structure defines a novel DNA-binding fold in transcriptional regulation. Nat Struct Biol 8: 626-633. Bravieri, R., T. Shiyanova, T.H. Chen, D. Overdier, and X. Liao. 1997. Different DNA contact schemes are used by two winged helix proteins to recognize a DNA binding sequence. Nucleic Acids Res 25: 2888-2896. Chai, J., J.W. Wu, N. Yan, J. Massague, N.P. Pavletich, and Y. Shi. 2003. Features of a Smad3 MH1-DNA complex. Roles of water and zinc in DNA binding. J Biol Chem 278: 20327- 20331. Clark, K.L., E.D. Halay, E. Lai, and S.K. Burley. 1993. Co-crystal structure of the HNF-3/fork head DNA-recognition motif resembles histone H5. Nature 364: 412-420. Cohen, S.X., M. Moulin, S. Hashemolhosseini, K. Kilian, M. Wegner, and C.W. Muller. 2003. Structure of the GCM domain-DNA complex: a DNA-binding domain with a novel fold and mode of target site recognition. Embo J 22: 1835-1845. Doerks, T., R. Copley, and P. Bork. 2001. DDT -- a novel domain in different transcription and chromosome remodeling factors. Trends Biochem Sci 26: 145-146. Gajiwala, K.S., H. Chen, F. Cornille, B.P. Roques, W. Reith, B. Mach, and S.K. Burley. 2000. Structure of the winged-helix protein hRFX1 reveals a new mode of DNA binding. Nature 403: 916-921. Protein Structure Group Protein Structure Family Description Comments 999) Unclassified Structure 905) Nuclear Factor I - CCAAT- binding Transcription Factor (NFI-CTF) Family Nuclear Factor I NFI/CTI TFs can form both homo- and heterodimers (Kruse et al. 1994). At the time of our analysis, no protein structure was available for review. The family contains a conserved N-terminal DNA-binding domain (within the first 240 amino acids of CTF/NFI) (Gournari et al. 1990). Chicken and mammalian homolog genes incorporating this domain are NFI-A, NFI-B, and NFI-C (Rupp et al. 1990). Sequence analyses suggest that the DNA- binding domains of CTF/NFI and SMAD MH1 demonstrate significant sequence homology (Stefancsik et al. 2003). Unclassified DBD structure 271 Gounari, F., R. De Francesco, J. Schmitt, P. van der Vliet, R. Cortese, and H. Stunnenberg. 1990. Amino-terminal domain of NF1 binds to DNA as a dimer and activates adenovirus DNA replication. Embo J 9: 559-566. Grune, T., J. Brzeski, A. Eberharter, C.R. Clapier, D.F. Corona, P.B. Becker, and C.W. Muller. 2003. Crystal structure and functional analysis of a nucleosome recognition module of the remodeling factor ISWI. Mol Cell 12: 449-460. Hanaoka, S., A. Nagadoi, S. Yoshimura, S. Aimoto, B. Li, T. de Lange, and Y. Nishimura. 2001. NMR structure of the hRap1 Myb motif reveals a canonical three-helix bundle lacking the positive surface charge typical of Myb DNA-binding domains. J Mol Biol 312: 167-175. Iwahara, J., M. Iwahara, G.W. Daughdrill, J. Ford, and R.T. Clubb. 2002. The structure of the Dead ringer-DNA complex reveals how AT-rich interaction domains (ARIDs) recognize DNA. Embo J 21: 1197-1209. Jordan-Sciutto, K.L., J.M. Dragich, and R. Bowser. 1999. DNA binding activity of the fetal Alz-50 clone 1 (FAC1) protein is enhanced by phosphorylation. Biochem Biophys Res Commun 260: 785-789. Kanei-Ishii, C., A. Sarai, T. Sawazaki, H. Nakagoshi, D.N. He, K. Ogata, Y. Nishimura, and S. Ishii. 1990. The tryptophan cluster: a hypothetical structure of the DNA-binding domain of the myb protooncogene product. J Biol Chem 265: 19990-19995. Kim, S., Z. Zhang, S. Upchurch, N. Isern, and Y. Chen. 2004. Structure and DNA-binding sites of the SWI1 AT-rich interaction domain (ARID) suggest determinants for sequence-specific DNA recognition. J Biol Chem 279: 16670-16676. Kovall, R.A. and W.A. Hendrickson. 2004. Crystal structure of the nuclear effector of Notch signaling, CSL, bound to DNA. Embo J 23: 3441-3451. Kruse, U. and A.E. Sippel. 1994. Transcription factor nuclear factor I proteins form stable homo- and heterodimers. FEBS Lett 348: 46-50. Ma, C. and L.M. Staudt. 1996. LAF-4 encodes a lymphoid nuclear protein with transactivation potential that is homologous to AF-4, the gene fused to MLL in t(4;11) leukemias. Blood 87: 734-745. O'Connor, M.S., A. Safari, D. Liu, J. Qin, and Z. Songyang. 2004. The human Rap1 protein complex and modulation of telomere length. J Biol Chem 279: 28585-28591. Ogata, K., H. Hojo, S. Aimoto, T. Nakai, H. Nakamura, A. Sarai, S. Ishii, and Y. Nishimura. 1992. Solution structure of a DNA-binding unit of Myb: a helix-turn-helix-related motif with conserved tryptophans forming a hydrophobic core. Proc Natl Acad Sci U S A 89: 6428- 6432. Ogata, K., S. Morikawa, H. Nakamura, H. Hojo, S. Yoshimura, R. Zhang, S. Aimoto, Y. Ametani, Z. Hirata, A. Sarai et al. 1995. Comparison of the free and DNA-complexed forms of the DNA-binding domain from c-Myb. Nat Struct Biol 2: 309-320. Ohki, I., N. Shimotake, N. Fujita, J. Jee, T. Ikegami, M. Nakao, and M. Shirakawa. 2001. Solution structure of the methyl-CpG binding domain of human MBD1 in complex with methylated DNA. Cell 105: 487-497. Omichinski, J.G., G.M. Clore, O. Schaad, G. Felsenfeld, C. Trainor, E. Appella, S.J. Stahl, and A.M. Gronenborn. 1993. NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1. Science 261: 438-446. Patsialou, A., D. Wilsker, and E. Moran. 2005. DNA-binding properties of ARID family proteins. Nucleic Acids Res 33: 66-80. Rodda, S., S. Sharma, M. Scherer, G. Chapman, and P. Rathjen. 2001. CRTR-1, a developmentally regulated transcriptional repressor related to the CP2 family of transcription factors. J Biol Chem 276: 3324-3332. Romier, C., F. Cocchiarella, R. Mantovani, and D. Moras. 2003. The NF-YB/NF-YC structure gives insight into DNA binding and transcription regulation by CCAAT factor NF-Y. J Biol Chem 278: 1336-1345. Ronchi, A., M. Bellorini, N. Mongelli, and R. Mantovani. 1995. CCAAT-box binding protein NF-Y (CBF, CP1) recognizes the minor groove and distorts DNA. Nucleic Acids Res 23: 4565- 4572. Rupp, R. A., Kruse, U., Multhaup, G., Gobel, U., Beyreuther, K., Sippel, A. E. 1990. Chicken NFI/TGGCA proteins are encoded by at least three independent genes: NFI-A, NFI-B and NFI-C with homologues in mammalian genomes. Nucleic Acids Res 18: 2607-2616 272 Sinha, S., S.N. Maity, J. Lu, and B. de Crombrugghe. 1995. Recombinant rat CBF-C, the third subunit of CBF/NFY, allows formation of a protein-DNA complex with CBF-A and CBF-B and with yeast HAP2 and HAP3. Proc Natl Acad Sci U S A 92: 1624-1628. Stefancsik, R. and S. Sarkar. 2003. Relationship between the DNA binding domains of SMAD and NFI/CTF transcription factors defines a new superfamily of genes. DNA Seq 14: 233- 239. Surdo, P.L., M.J. Bottomley, M. Sattler, and K. Scheffzek. 2003. Crystal structure and nuclear magnetic resonance analyses of the SAND domain from glucocorticoid modulatory element binding protein-1 reveals deoxyribonucleic acid and zinc binding regions. Mol Endocrinol 17: 1283-1295. Tahirov, T.H., K. Sato, E. Ichikawa-Iwata, M. Sasaki, T. Inoue-Bungo, M. Shiina, K. Kimura, S. Takata, A. Fujikawa, H. Morii et al. 2002. Mechanism of c-Myb-C/EBP beta cooperation from separated sites on a promoter. Cell 108: 57-70. Uv, A.E., C.R. Thompson, and S.J. Bray. 1994. The Drosophila tissue-specific factor Grainyhead contains novel DNA-binding and dimerization domains which are conserved in the human protein CP2. Mol Cell Biol 14: 4020-4031. van Dongen, M.J., A. Cederberg, P. Carlsson, S. Enerback, and M. Wikstrom. 2000. Solution structure and dynamics of the DNA-binding domain of the adipocyte-transcription factor FREAC-11. J Mol Biol 296: 351-359. Vullhorst, D. and A. Buonanno. 2003. Characterization of general transcription factor 3, a transcription factor involved in slow muscle-specific gene expression. J Biol Chem 278: 8370-8379. Vullhorst, D. and A. Buonanno. 2005. Multiple GTF2I-like repeats of general transcription factor 3 exhibit DNA binding properties. Evidence for a common origin as a sequence-specific DNA interaction module. J Biol Chem 280: 31722-31731. 273 Table S6. Protein class counts of genes predicted to contain multiple instances of the same DNA-Binding domain Protein Group ID Protein Group Description Protein Family ID Protein Family Description # Genes With Multiple Predicted Instances 1.1 Helix-Turn-Helix Group 100 Myb Domain Family 5 1.1 Helix-Turn-Helix Group 2 Homeodomain Family 22 1.2 Winged Helix-Turn- Helix 101 GTF2I Domain Family 2 1.2 Winged Helix-Turn- Helix 15 Transcription Factor Family 1 2.1 Zinc- coordinating Group 104 GATA Domain Family 5 2.1 Zinc- coordinating Group 114 Non Methyl-CpG- binding CXXC Domain 1 2.1 Zinc- coordinating Group 17 BetaBetaAlpha-zinc finger Family 79 3 Zipper-Type Group 21 Leucine Zipper Family 23 4 Other Alpha-Helix Group 28 High Mobility Group (Box) 3 5 Beta-sheet Group 30 TATA box-binding Family 1 8 Enzyme Group 47 DNA Polymerase- Beta Family 1 999 Unclassified Structure 904 AT-hook Domain Family 2 274 Table S7. DNA-binding transcription factors predicted to contain two different DNA- binding classes Gene ID Gene Symbol Gene Description Predicted Protein Group Predicted Protein Family 11909 Atf2 activating transcription factor 2 Zinc- coordinating Group BetaBetaAlpha-zinc finger Family 11909 Atf2 activating transcription factor 2 Zipper-Type Group Leucine Zipper Family 17190 Mbd1 methyl-CpG binding domain protein 1 Zinc- coordinating Group Non Methyl-CpG- binding CXXC Domain 17190 Mbd1 methyl-CpG binding domain protein 1 Beta-Hairpin- Ribbon Group Methyl-CpG- binding domain, MBD Family 19664 Rbpj recombination signal binding protein for immunoglobuli n kappa Other Beta_Trefoil-like Domain Family 19664 Rbpj recombination signal binding protein for immunoglobuli n kappa J region Other DNA-binding LAG- 1-like Domain Family 19668 Rbpjl recombination signal binding protein for immunoglobuli n kappa J region-like Other Beta_Trefoil-like Domain Family 19668 Rbpjl recombination signal binding protein for immunoglobuli n kappa J region-like Other DNA-binding LAG- 1-like Domain Family 275 Table S7. DNA-binding transcription factors predicted to contain two different DNA- binding classes (continued) Gene ID Gene Symbol Gene Description Predicted Protein Group Predicted Protein Family 56218 Patz1 POZ (BTB) and AT hook containing zinc finger 1 Zinc- coordinating Group BetaBetaAlpha-zinc finger Family 56218 Patz1 POZ (BTB) and AT hook containing zinc finger 1 Unclassified Structure AT-hook Domain Family 116870 Mta1 metastasis associated 1 Helix-Turn- Helix Group Myb Domain Family 116870 Mta1 metastasis associated 1 Zinc- coordinating Group GATA Domain Family 223922 Atf7 activating transcription factor 7 Zinc- coordinating Group BetaBetaAlpha-zinc finger Family 223922 Atf7 activating transcription factor 7 Zipper-Type Group Leucine Zipper Family 214162 Mll1 myeloid/lymph oid or mixed- lineage leukemia 1 Zinc- coordinating Group Non Methyl-CpG- binding CXXC Domain 214162 Mll1 myeloid/lymph oid or mixed- lineage leukemia 1 Unclassified Structure AT-hook Domain Family 276 Table S8. DNA-binding transcription factors with no detected DNA-binding domains Gene ID Gene Symbol Gene Description 106389 Eaf2 ELL associated factor 2 232906 Grlf1 glucocorticoid receptor DNA binding factor 1 56461 Kcnip3 calsenilin, presenilin binding protein, EF hand transcription factor 104338 Mynf1 myeloid nuclear factor 1 11614 Nr0b1 nuclear receptor subfamily 0, group B, member 1 74451 Pgs1 phosphatidylglycerophosphate synthase 1 50907 Preb prolactin regulatory element binding 79401 Spz1 spermatogenic Zip 1 21411 Tcf20 transcription factor 20 57432 Zc3h8 zinc finger CCCH type containing 8 Table S9. DNA-binding transcription factors with no detected Protein Family class Gene ID Gene Symbol Gene Description Predicted Protein Group 11545 Parp1 poly (ADP-ribose) polymerase family, member 1 Zinc-coordinating Group 14056 Ezh2 enhancer of zeste homolog 2 (Drosophila) Helix-Turn-Helix Group 21804 Tgfb1i1 transforming growth factor beta 1 induced transcript 1 Zinc-coordinating Group 245583 Tgif2lx TGFB-induced factor homeobox 2-like, X-linked Helix-Turn-Helix Group 277 Table S10. Single-stranded DNA-binding transcription factors Gene ID Gene Symbol Description 245000 Atr ataxia telangiectasia and rad3 related 13194 Ddb1 damage specific DNA binding protein 1 107986 Ddb2 damage specific DNA binding protein 2 15384 Hnrpab heterogeneous nuclear ribonucleoprotein A/B 50926 Hnrpdl heterogeneous nuclear ribonucleoprotein D-like 15387 Hnrpk heterogeneous nuclear ribonucleoprotein K 17876 Myef2 myelin basic protein expression factor 2, repressor 74164 Nfx1 nuclear transcription factor, X-box binding 1 18148 Npm1 nucleophosmin 1 23983 Pcbp1 poly(rC) binding protein 1 18521 Pcbp2 poly(rC) binding protein 2 19290 Pura purine rich element binding protein A 19291 Purb purine rich element binding protein B 56381 Spen SPEN homolog, transcriptional regulator (Drosophila) 106021 Topors topoisomerase I binding, arginine/serine-rich 22608 Ybx1 Y box protein 1 278 Table S11. Summary of curated TFCat TFs in Gene Ontology (GO) GO Annotation Type Molecular Function Sub-Tree Excluding GO annotations types: IEA: Inferred from Electronic Annotations ISS: Inferred from Sequence or Structural Similarity Annotations RCA: Inferred from Reviewed Computational Analysis 343 / 882 All GO annotation types (including IEA, ISS, RCA) 409 / 882 Table S12. Comparison summary of TFCat classified HMM DNA-binding domains with the DBD database (DBDdb) resource Found in DBD Not found in DBD Classified in TFCat 116 16 Not classified in TFCat 68 279 Table S13. Superfamily DNA binding domain list comparison with DBD Database (DBDdb) resource Model ID Domain Name In DBD In TFCat 39848 A DNA-binding domain in eukaryotic transcription factors Y Y 43644 A DNA-binding domain in eukaryotic transcription factors Y Y 35817 ARID-like Y 43437 ARID-like Y 34823 C2H2 and C2HC zinc fingers Y Y 34824 C2H2 and C2HC zinc fingers Y 34825 C2H2 and C2HC zinc fingers Y Y 34826 C2H2 and C2HC zinc fingers Y Y 35441 C2H2 and C2HC zinc fingers Y Y 35556 C2H2 and C2HC zinc fingers Y Y 37351 C2H2 and C2HC zinc fingers Y Y 37782 C2H2 and C2HC zinc fingers Y Y 40545 C2H2 and C2HC zinc fingers Y Y 41311 C2H2 and C2HC zinc fingers Y Y 41429 C2H2 and C2HC zinc fingers Y Y 42182 C2H2 and C2HC zinc fingers Y Y 42220 C2H2 and C2HC zinc fingers Y 43688 C2H2 and C2HC zinc fingers Y 43689 C2H2 and C2HC zinc fingers Y 43730 C2H2 and C2HC zinc fingers Y Y 43976 C2H2 and C2HC zinc fingers Y Y 43977 C2H2 and C2HC zinc fingers Y 43978 C2H2 and C2HC zinc fingers Y Y 43982 C2H2 and C2HC zinc fingers Y 43983 C2H2 and C2HC zinc fingers Y 43984 C2H2 and C2HC zinc fingers Y 44259 C2H2 and C2HC zinc fingers Y 44260 C2H2 and C2HC zinc fingers Y 44261 C2H2 and C2HC zinc fingers Y Y 45110 C2H2 and C2HC zinc fingers Y Y 45118 C2H2 and C2HC zinc fingers Y 45151 C2H2 and C2HC zinc fingers Y 45152 C2H2 and C2HC zinc fingers Y Y 45249 C2H2 and C2HC zinc fingers Y Y 45250 C2H2 and C2HC zinc fingers Y 45293 C2H2 and C2HC zinc fingers Y Y 45294 C2H2 and C2HC zinc fingers Y 45295 C2H2 and C2HC zinc fingers Y 45296 C2H2 and C2HC zinc fingers Y 45297 C2H2 and C2HC zinc fingers Y Y 45613 C2H2 and C2HC zinc fingers Y 280 Table S13. Superfamily DNA binding domain list comparison with DBD Database (DBDdb) resource (continued) Model ID Domain Name In DBD In TFCat 45631 C2H2 and C2HC zinc fingers Y Y 42508 CCHHC domain Y Y 40609 Cysteine-rich DNA binding domain, (DM domain) Y 36316 DNA-binding domain Y Y 42826 DNA-binding domain Y 41800 GCM domain Y Y 36002 Glucocorticoid receptor-like (DNA-binding domain) Y Y 36583 Glucocorticoid receptor-like (DNA-binding domain) Y 40006 Glucocorticoid receptor-like (DNA-binding domain) Y 40440 Glucocorticoid receptor-like (DNA-binding domain) Y 40589 Glucocorticoid receptor-like (DNA-binding domain) Y Y 45290 Glucocorticoid receptor-like (DNA-binding domain) Y Y 45386 Glucocorticoid receptor-like (DNA-binding domain) Y Y 45592 Glucocorticoid receptor-like (DNA-binding domain) Y Y 34803 HLH, helix-loop-helix DNA-binding domain Y Y 35101 HLH, helix-loop-helix DNA-binding domain Y Y 35112 HLH, helix-loop-helix DNA-binding domain Y Y 35113 HLH, helix-loop-helix DNA-binding domain Y Y 38629 HLH, helix-loop-helix DNA-binding domain Y Y 40898 HLH, helix-loop-helix DNA-binding domain Y Y 41437 HLH, helix-loop-helix DNA-binding domain Y Y 41452 HLH, helix-loop-helix DNA-binding domain Y Y 43065 HLH, helix-loop-helix DNA-binding domain Y Y 34886 Homeodomain-like Y Y 34887 Homeodomain-like Y 35079 Homeodomain-like Y Y 35379 Homeodomain-like Y Y 35402 Homeodomain-like Y Y 35403 Homeodomain-like Y 35741 Homeodomain-like Y 36604 Homeodomain-like Y 37777 Homeodomain-like Y 37778 Homeodomain-like Y Y 38474 Homeodomain-like Y 281 Table S13. Superfamily DNA binding domain list (continued) Model ID Domain Name In DBD In TFCat 38986 Homeodomain-like Y Y 40485 Homeodomain-like Y Y 40874 Homeodomain-like Y Y 40945 Homeodomain-like Y 41016 Homeodomain-like Y Y 42267 Homeodomain-like Y Y 42468 Homeodomain-like Y Y 44368 Homeodomain-like Y Y 44899 Homeodomain-like Y 45311 Homeodomain-like Y 45348 Homeodomain-like Y Y 45634 Homeodomain-like Y 35201 lambda repressor-like DNA-binding domains Y Y 36752 lambda repressor-like DNA-binding domains Y Y 38966 lambda repressor-like DNA-binding domains Y Y 43707 lambda repressor-like DNA-binding domains Y 37961 Nucleic acid-binding proteins Y 38488 Nucleic acid-binding proteins Y 40961 Nucleic acid-binding proteins Y 34796 p53-like transcription factors Y Y 34855 p53-like transcription factors Y Y 35512 p53-like transcription factors Y Y 35525 p53-like transcription factors Y Y 36065 p53-like transcription factors Y Y 38119 p53-like transcription factors Y Y 38434 p53-like transcription factors Y Y 39062 p53-like transcription factors Y Y 41017 p53-like transcription factors Y 41370 p53-like transcription factors Y Y 44112 p53-like transcription factors Y Y 45029 p53-like transcription factors Y Y 45080 p53-like transcription factors Y Y 38425 SAND domain-like Y Y 41973 SAND domain-like Y Y 44332 SAND domain-like Y 40928 SMAD MH1 domain Y Y 41015 SRF-like Y 282 Table S13. Superfamily DNA binding domain list (continued) Model ID Domain Name In DBD In TFCat 41235 SRF-like Y Y 43734 SRF-like Y Y 35524 STAT Y Y 35881 Transcriptional factor tubby, C-terminal domain Y 35957 Winged helix DNA-binding domain Y Y 35958 Winged helix DNA-binding domain Y Y 36285 Winged helix DNA-binding domain Y 36540 Winged helix DNA-binding domain Y Y 36707 Winged helix DNA-binding domain Y 37479 Winged helix DNA-binding domain Y 38516 Winged helix DNA-binding domain Y Y 38616 Winged helix DNA-binding domain Y Y 38992 Winged helix DNA-binding domain Y Y 39808 Winged helix DNA-binding domain Y 40891 Winged helix DNA-binding domain Y Y 45310 Winged helix DNA-binding domain Y Y 45339 Winged helix DNA-binding domain Y Y 283 Table S14. PFAM DNA binding domain list comparison with DBD Database (DBDdb) Model ID Model Name Model Description In DBD In TFCat PF05110 AF-4 AF-4 proto-oncoprotein Y Y PF01586 Basic Myogenic Basic domain Y PF00170 bZIP_1 bZIP transcription factor Y Y PF07716 bZIP_2 Basic region leucine zipper – bZIP_2 Y Y PF03131 bZIP_Maf bZIP Maf transcription factor Y Y PF02045 CBFB_NFYA CCAAT-binding transcription factor (CBF-B/NF-YA) subunit B Y Y PF00808 CBFD_NFYB_HMF Histone-like transcription factor (CBF/NF-Y) and archaeal histone Y PF03859 CG-1 CG-1 domain Y PF06573 Churchill Churchill protein Y PF04516 CP2 CP2 transcription factor Y Y PF00313 CSD 'Cold-shock' DNA-binding domain Y PF02376 CUT CUT domain Y Y PF02791 DDT DDT domain Y Y PF00751 DM DM DNA binding domain Y PF02319 E2F_TDP E2F/DP family winged-helix DNA-binding domain Y Y PF00178 Ets Ets-domain Y Y PF06818 Fez1 Fez1 Y PF00250 Fork_head Fork head domain Y Y PF00320 GATA GATA zinc finger Y Y PF03615 GCM GCM motif protein Y Y PF06320 GCN5L1 GCN5-like protein 1 (GCN5L1) Y PF00010 HLH Helix-loop-helix DNA-binding domain Y Y PF00046 Homeobox Homeobox domain Y Y PF00447 HSF_DNA-bind HSF-type DNA-binding Y Y PF08279 HTH_11 HTH domain Y PF01381 HTH_3 Helix-turn-helix Y PF05225 HTH_psq helix-turn-helix, Psq domain Y Y PF00605 IRF Interferon regulatory factor transcription factor Y Y PF01056 Myc_N Myc amino-terminal region Y PF05224 NDT80_PhoG NDT80 / PhoG like DNA-binding family Y PF04054 Not1 CCR4-Not complex component, Not1 Y PF00870 P53 P53 DNA-binding domain Y Y PF00292 PAX 'Paired box' domain Y Y PF00157 Pou Pou domain - N-terminal to homeobox domain Y Y PF05044 Prox1 Homeobox prospero-like protein (PROX1) Y PF02257 RFX_DNA_binding RFX DNA-binding domain Y Y PF00554 RHD Rel homology domain (RHD) Y Y PF00853 Runt Runt domain Y Y PF01342 SAND SAND domain Y Y PF03343 SART-1 SART-1 family Y PF07093 SGT1 SGT1 protein Y 284 Table S14. PFAM DNA binding domain list comparison with DBD Database (DBDdb) (continued) Model ID Model Name Model Description In DBD In TFCat PF00319 SRF-TF SRF-type transcription factor (DNA-binding and dimerisation domain) Y Y PF02864 STAT_bind STAT protein, DNA binding domain Y Y PF00907 T-box T-box Y Y PF01285 TEA TEA/ATTS domain family Y Y PF03299 TF_AP-2 Transcription factor AP-2 Y Y PF01167 Tub Tub family Y PF05764 YL1 YL1 nuclear protein Y PF01754 zf-A20 A20-like zinc finger Y PF02892 zf-BED BED zinc finger Y PF00096 zf-C2H2 Zinc finger, C2H2 type Y Y PF01530 zf-C2HC Zinc finger, C2HC type Y Y PF00105 zf-C4 Zinc finger, C4 type (two domains) Y Y PF02928 zf-C5HC2 C5HC2 zinc finger Y PF06839 zf-GRF GRF zinc finger Y PF02891 zf-MIZ MIZ zinc finger Y PF01422 zf-NF-X1 NF-X1 type zinc finger Y PF02135 zf-TAZ TAZ zinc finger Y PF01388 ARID ARID/BRIGHT DNA binding domain Y PF02178 AT_hook AT hook motif Y PF09270 Beta-trefoil Beta-trefoil Y PF00859 CTF_NF1 CTF/NF-I family transcription modulation region Y PF02946 GTF2I GTF2I-like repeat Y PF00505 HMG_box HMG (high mobility group) box Y PF09271 LAG1-DNAbind LAG1, DNA binding Y PF01429 MBD Methyl-CpG binding domain Y PF03165 MH1 MH1 domain Y PF00249 Myb_DNA-binding Myb-like DNA-binding domain Y PF09111 SLIDE SLIDE Y PF00352 TBP Transcription factor TFIID (or TATA-binding protein, TBP) Y PF03529 TF_Otx Otx1 transcription factor Y PF02008 zf-CXXC CXXC zinc finger domain Y 285 Table S15. Fox family gene test set Gene ID Gene Symbol Gene Description 15375 Foxa1 forkhead box A1 15376 Foxa2 forkhead box A2 15377 Foxa3 forkhead box A3 64290 Foxb1 forkhead box B1 14240 Foxb2 forkhead box B2 17300 Foxc1 forkhead box C1 14234 Foxc2 forkhead box C2 15229 Foxd1 forkhead box D1 17301 Foxd2 forkhead box D2 15221 Foxd3 forkhead box D3 14237 Foxd4 forkhead box D4 110805 Foxe1 forkhead box E1 30923 Foxe3 forkhead box E3 15227 Foxf1a forkhead box F1a 14238 Foxf2 forkhead box F2 15228 Foxg1 forkhead box G1 14106 Foxh1 forkhead box H1 14233 Foxi1 forkhead box I1 270004 Foxi2 forkhead box I2 15223 Foxj1 forkhead box J1 60611 Foxj2 forkhead box J2 230700 Foxj3 forkhead box J3 17425 Foxk1 forkhead box K1 68837 Foxk2 forkhead box K2 14241 Foxl1 forkhead box L1 26927 Foxl2 forkhead box L2 14235 Foxm1 forkhead box M1 15218 Foxn1 forkhead box N1 14236 Foxn2 forkhead box N2 71375 Foxn3 forkhead box N3 116810 Foxn4 forkhead box N4 56458 Foxo1 forkhead box O1 56484 Foxo3a forkhead box O3a 54601 Foxo4 forkhead box O4 329934 Foxo6 forkhead box O6 108655 Foxp1 forkhead box P1 114142 Foxp2 forkhead box P2 286 Table S15. Fox family gene test set (continued) Gene ID Gene Symbol Gene Description 20371 Foxp3 forkhead box P3 74123 Foxp4 forkhead box P4 15220 Foxq1 forkhead box Q1 Table S16. Sox family gene test set Gene ID Gene Symbol Gene Description 20665 Sox10 SRY-box containing gene 10 20666 Sox11 SRY-box containing gene 11 20664 Sox1 SRY-box containing gene 1 20667 Sox12 SRY-box containing gene 12 20668 Sox13 SRY-box containing gene 13 20669 Sox14 SRY-box containing gene 14 20670 Sox15 SRY-box containing gene 15 20671 Sox17 SRY-box containing gene 17 20672 Sox18 SRY-box containing gene 18 223227 Sox21 SRY-box containing gene 21 20674 Sox2 SRY-box containing gene 2 214105 Sox30 SRY-box containing gene 30 20675 Sox3 SRY-box containing gene 3 20677 Sox4 SRY-box containing gene 4 20678 Sox5 SRY-box containing gene 5 20679 Sox6 SRY-box containing gene 6 20680 Sox7 SRY-box containing gene 7 20681 Sox8 SRY-box containing gene 8 20682 Sox9 SRY-box containing gene 9 21674 Sry sex determining region of Chr Y 287 Appendix 1: Figures Figure S1. Overlap of initial datasets The gene identifiers were evaluated for overlap in both the Union of Putative TFs (UPTF) (Venn diagram Figure S1 Panel A) and Transcription Factor Candidates (TFC) (Venn diagram Figure S1 Panel B) sets. A. 288 Figure S1. Overlap of initial datasets (continued) B. 289 Figure S2. Analysis of cluster pruning methods using the Fox test set: plots of cluster sensitivity (proportion of members of the Fox test set in a cluster), cluster specificity (number of cluster members that are members of the Fox test set), and cluster size across increasing I’s derived from two different pruning methods. Each line in a panel represents the evaluation of one cluster containing one or more Fox test set genes. Analysis for each specific cluster is represented by the same line color and point symbol combination across all plots in the test set evaluation (some line attributes cannot be easily distinguished when they overlap on a plot). Panel A: Clusters sensitivity values for Fox clusters across increasing I’s when clusters are pruned using only I’s thresholds (x-axis). Panel B: Cluster specificity values for Fox clusters across increasing I’s when clusters are pruned using only I’s value thresholds. Panel C: Resulting cluster sizes (cardinalities) for Fox clusters across increasing I’s when clusters are pruned using only I’s value thresholds. Panel D: Cluster sensitivity values for Fox clusters across increasing I’s (x- axis) when clusters are pruned using domain-matching as a primary criteria and I’s thresholds applied secondarily when domain matching criteria is not satisfied. Panel E: Cluster specificity values for Fox clusters across increasing I’s when clusters are pruned using domain-matching as a primary criteria and I’s thresholds applied secondarily when domain matching criteria is not satisfied. Panel F: Resulting cluster sizes for Fox clusters across increasing I’s when clusters are pruned using domain-matching as a primary criteria and I’s thresholds applied secondarily when domain matching criteria is not satisfied. 290 A. B. C. 291 D. E. F. 292 Figure S3. Analysis of cluster pruning methods using the Sox test set: plots of cluster sensitivity (proportion of members of the Sox test set in a cluster), cluster specificity (number of cluster members that are members of the Sox test set), and cluster size across increasing I’s derived from two different pruning methods. Each line in a panel represents the evaluation of one cluster containing one or more Sox test set genes. Analysis for each specific cluster is represented by the same line color and point symbol combination across all plots in the test set evaluation (some line attributes cannot be easily distinguished when they overlap on a plot). Panel A: Clusters sensitivity values for Sox clusters across increasing I’s when clusters are pruned using only I’s thresholds (x-axis). Panel B: Cluster specificity values for Sox clusters across increasing I’s when clusters are pruned using only I’s value thresholds. Panel C: Resulting cluster sizes (cardinalities) for Sox clusters across increasing I’s when clusters are pruned using only I’s value thresholds. Panel D: Cluster sensitivity values for Sox clusters across increasing I’s when clusters are pruned using domain-matching as a primary criteria and I’s thresholds applied secondarily when domain matching criteria is not satisfied. Panel E: Cluster specificity values for Sox clusters across increasing I’s when clusters are pruned using domain- matching as a primary criteria and I’s thresholds applied secondarily when domain matching criteria is not satisfied. Panel F: Resulting cluster sizes for Sox clusters across increasing I’s when clusters are pruned using domain-matching as a primary criteria and I’s thresholds applied secondarily when domain matching criteria is not satisfied. 293 A. B. C. 294 D. E. F. 295 Figure S4. TFCat annotation workflow Annotation feedback recorded by wiki user TFCatWiki Annotator is assigned report in tracking system Update of TFCat data through annotation tool and post action on wiki feedback system Update of DBD classification and homolog Information (if necessary) Automated update of Wiki pages with revised gene annotation data E-mail sent to wiki user community regarding annotation revisions 296 Figure S5. Screen shots of the backend web-based TFCat Annotation Tool. Panel A: Each reviewer has password-protected access to a full or partial list of genes assigned to their annotation queue. Panel B: One or more genes may be selected for annotation review/update. Panel C: The gene annotation page facilitates recording of PubMed article reviews and entry of functional taxa and TF judgment assignments. The tool also provides direct access to additional web-based gene information and literature resources to facilitate the curation. 297 A. B. 298 C. 299 Figure S6. Final cluster membership for the Sox containing test set genes. Genes that are members of the test set are colored in yellow. Panel A: Final Sox cluster membership when domain-matching as a primary criteria and I’s thresholds applied secondary criteria is applied, in conjunction with an approximation method for merging proportionally linked clusters. Panel B: Example of final Sox containing merged clusters when only the I’s threshold method is applied (using an I’s value of 0.21) A. 300 Figure S6. Final cluster membership for the Sox containing test set genes (continued). B. 301 Figure S7. Final cluster membership for the Fox containing test set genes. Genes that are members of the test set are colored in yellow. Panel A: Final Fox cluster membership when domain-matching as a primary criteria and I’s thresholds applied secondary criteria is applied, in conjunction with an approximation method for merging proportionally linked clusters. Panel B: Example of final Fox merged clusters when only the I’s threshold method is applied (using an I’s value of 0.21). A. 302 Figure S7. Final cluster membership for the Fox containing test set genes (continued). B. 303 Figure S8. Example of pruned Fox-containing clusters generated using the I’s only method using a threshold of 0.21. 304 8.2. Appendix 2: supplementary for chapter 5 Regulatory Resolution Score Identification of genomic sequence boundaries for regulatory resolution scoring For regulatory resolution scoring of each gene in the Ensembl human genome database (version 46) [1], we define the transcription start site (TSS) as the start position of the 5′-most exon annotated for the gene. We then determine the boundaries of the region to be analyzed relative to the TSS as follows. In most cases, the upstream boundary is defined as the start/end position of the upstream gene (depending on the upstream orientation of the gene). If the upstream gene is less than 1 kb from the TSS of the gene of interest, we extend the analysis to introns of the upstream gene located within 10 kb of the TSS. In most cases, the downstream boundary is the end of the gene of interest. However, if the gene is longer than 30 kb from the TSS to the end of the last exon, intronic regions within 30 kb downstream of the TSS are used. Conversely, we include 2 kb of sequence downstream of the TSS for genes shorter than 2 kb in length. Non-exonic conserved regions PhastCons scores and PhastCons “conserved elements” computed from comparisons of 17-way vertebrate multi-species alignments [2] were downloaded from the UCSC Genome Browser database [3]. Only PhastCons conserved elements that are both 20 bp or longer and non-overlapping with annotated human mRNAs or Ensembl human gene annotations are retained for analysis. PhastCons conserved elements 305 separated by less than 100 bp are chained together (excluding the intervening regions) and thereafter considered part of a single longer conserved region. Score definition We define a raw regulatory resolution score as: ! raw score = log10 l(c " b) n # n 2 $ % & & & ' ( ) ) ) where l is the length of the conserved region, c is the “conservation level” of the conserved region (i.e., the mean phastCons score for the conserved region), b is the baseline conservation level (i.e., the mean phastCons score for the entire genomic segment analyzed) and n is the number of conserved regions. Thus, for each conserved region, we consider the amount of conserved sequence, how well distinguished the region is from the background, and penalize genes with many conserved regions. After computing the raw score, we normalize it to obtain a value between 0 and 1 using the following formula: scorerawscoreraw scorerawscoreraw scorenormalized minmax min ! ! = Thus, zero is a gene with little resolution and 1 is highly resolved. Genome-wide distribution of regulatory resolution scores Regulatory resolution scores were computed for all human genes as reported in the Ensembl annotations (Fig. S1). Of 22298 genes tested, 2411 did not contain any 306 conserved PhastCons elements (Fig. S1B) and therefore, we could not compute regulatory resolution scores for these genes. The distribution of scores is skewed towards zero, with a median score of 0.34 and a mean score of 0.36 (Fig. S1A). 307 Figure S1. Genome-wide distribution of regulatory resolution scores (A) Histogram of scores. (B) Summary statistics showing the score by quartiles (Qu.), as well as the median and mean score. NA, not able to score. (C) Boxplot showing the distribution of the number of conserved regions by score intervals. (D) Boxplot showing the distribution of the number of conserved bases by score intervals. The boxes in both boxplots are drawn with widths proportional to the square roots of the number of observations in the groups. Min 1st Qu. Median Mean 3rd Qu. Max NA Score 0.00 0.25 0.34 0.36 0.45 1.00 2411 A B 308 Genes with up to 5 conserved regions receive higher scores (Fig. S1C), with the top 20th percentile having an average of 2.2 conserved regions per gene. The highest scores were assigned to genes with less than 1,000 bp of conserved non-exonic nucleotides (Fig. S1D), with an average of 330 bp of conserved sequence per gene for genes scoring within the top 20th percentile. Features of genes with the highest, average, and lowest regulatory resolution The ADCK5 locus was assigned the highest regulatory resolution score due to the presence of a single 1,277 bp highly conserved region within the upstream intergenic region (Fig. S2A). Two smaller conserved regions directly upstream of ADCK5 in the “17-Way Most Cons” track are excluded from the analysis. The larger of the two overlaps with human mRNAs, while the smaller conserved element is only 10 bp long (Fig. S2A). The low baseline conservation level across the entire region further contributed to the high score. C D 309 The ELOVL3 locus (Fig. S2B) is an example of an average gene, receiving the mean score of 0.35. It contains four small, conserved non-exonic regions containing a total of 186 bp of sequence within the boundaries of the analysis. The lowest scoring gene, NR4A3, features 90 small conserved regions, containing a total of 9,281 bp of non-exonic sequence, that are distributed across the entire locus (Fig. S2C). The majority of the NR4A3 locus is conserved and the conservation profile reveals few insights into the location of potential regulatory regions for targeted promoter construct design. Manual promoter curation Promoters for 100 genes were manually assessed based on a number of gene features, including: (i) the location of the transcription start points, (ii) the boundaries of analysis, i.e., the amount of non-coding sequence to be analyzed upstream and downstream of the gene of interest, and (iii) the number and qualitative conservation level of conserved regions located proximal to the TSS within the defined boundaries. The genes were ranked from 1 to 5 based on the curators’ perception of their suitability for MiniPromoter (MiniP) design, “1” being a gene not suitable and “5” a very good candidate. 310 Figure S2. Genes with (A) the highest, (B) average and (C) the lowest regulatory resolution scores Each screenshot from the UCSC Genome Browser (NCBI Build 36.1) displays: conserved non-exonic (CNE) PhastCons elements used in the analysis; UCSC gene predictions based on RefSeq, GenBank and UniProt data[3]; transcripts for Ensembl genes based on mRNA and protein evidence[1]; a dense display of human mRNAs from GenBank[4]; CpG islands (≥50% GC content, ≥200 bp in length, and an observed CG to expected CG ratio ≥ 0.6); evolutionary conservation in 17 vertebrates based on Multiz alignments[5] and PhastCons scores[2]; and predictions of conserved elements produced by the PhastCons program (17-way Most Cons). 311 A. ADCK5: 1277 conserved bases in 1 conserved region. Score = 1.00 B. ELOVL3: 186 conserved bases in 4 conserved regions. Score = 0.35 C. NR4A3: 9281 conserved bases in 90 conserved regions. Score = 0.00 312 MiniPromoter design Figure S3. MiniPromoter design pipeline Resources used include Pubmed (http://www.ncbi.nlm.nih.gov/pubmed), PAZAR (http://www.pazar.info), the UCSC genome browser (http://genome.ucsc.edu), ORCAtk (http://www.cisreg.ca/cgi-bin/ORCAtk/orca), the VISTA enhancer browser (http://enhancer.lbl.gov) and histone modification ChIP (chromatin immunoprecipitation) assays performed on mouse and human cortex (Jones et al. unpublished). Literature annotation Transcription Start Position Selection CAGE tags mRNAs, ESTs CpG islands islands Sequence boundaries delineation for analysis Phylogenetic footprinting and TFBS prediction Integration of all information Histone modification ChIP assays VISTA enhancer browser 313 Identification of the boundaries for analysis The boundaries are defined similarly to the regulatory resolution score analysis, except that if one of the neighboring genes has an expression pattern similar to the gene of interest, the boundaries are extended to include the surrounding sequences of this additional gene. In a few cases the GENSAT project [6] had generated and tested BAC mice for the gene of interest and the expression pattern reported matched the endogenous expression pattern. In such cases, the BAC sequence defined the boundaries for regulatory sequence analysis. Phylogenetic footprinting and transcription factor binding site (TFBS) prediction The ORCA toolkit [7] was used for the following steps: • Retrieval and alignment of human and mouse orthologous sequences within the defined boundaries; • Computation of the non-coding conserved regions above a user-defined threshold (ranging from 50 to 85% identity in our analyses); • Prediction of TFBS in those conserved regions for the transcription factors that have been described to be relevant for the expression of this specific gene or for expression in the brain region of interest in general. The TF binding models were extracted from the JASPAR database [8] or custom-generated from the PAZAR database based on the manually curated “Pleiades Genes” project [9]. 314 Immunofluorescence labelling of Ple151-EGFP reporter gene expression Figure S4. Ple151-EGFP reporter expression is co-localized with an oligodendrocyte marker Double immunofluorescent staining of rabbit anti-EGFP and neuron and glial cell markers on an adult, male, N2 mouse expressing the Ple151 MiniPromoter. Panel A. Double labeling of mouse cortex for EGFP (green) and NeuN, a pan neuronal marker (red). Panel B. Double labeling of mouse cortex for EGFP (green) and Rip, an oligodendrocyte marker (red). Examples of co-localization are indicated with white arrows. Panel C. Double labeling of mouse cortex for EGFP (green) expression and GFAP, an astrocyte-specifc cell marker (red). TOTO3 (blue) is a nuclear counter stain. 315 Regulatory element predictions in OLIG1 enhancer sequences Identification of “most conserved” aligned sequences in OLIG1 construct sequences The genomic coordinates for each of the conserved regions constituting the tested OLIG1 MiniPs were retrieved using the BLAT sequence search tool at the UCSC browser against the Human March 2006 assembly [3]. The genomic coordinates were used as input to the UCSC Table Retrieval function to extract the human sequence alignment in the 17-way multiple mammalian-species and 28-way placental mammals “most conserved” alignments (MCA) and each aligned human sequence (with gaps) was stored in FASTA format. Transcription factor binding site (TFBS) predictions in “most conserved” sequences Each of the conserved regions making up a MiniP was subjected to a TFBS prediction analysis. A PERL script was developed using the TFBS PERL module [10] and the JASPAR CORE database[8] (supplemented with additional model annotations for Glia-related TFBS derived from the literature: POU2F1; EGR1; EGR2; EGR3; EGR4; POU3F1; NKX2-2; NKX2-5) to evaluate vertebrate TFBS models across the most conserved sequence elements of each region using profile score threshold levels of 75% and 80%. TFBS predictions were written to BED formatted files for each analyzed MCA. Analysis of in vitro oligodendrocyte gene expression profile data To identify TF candidates that could be directing OLIG1 regulation in oligodendrocytes (Appendix 2 Figure S4), we analyzed an in vitro 8-day time point oligodendrocyte differentiation dataset produced by Dugas et al. [11]. This dataset is comprised of recorded gene expression profiles across a timescale of differentiated, 316 purified, rat cortical oligodendrocyte progenitor cells (OPC) using the Affymetrix RG_U34-A, RG_U34-B, and RG_U34-C chips. 96 Affymetrix CEL files (8 time points x 4 biological replicates x 3 chips) were obtained from J.C. Dugas. We developed R code (http://www.R-project.org) and used the Bioconductor packages [12] to perform a robust multichip analysis (RMA) [13] on each chip dataset to obtain a probe-level summarization. All pairwise experiments were subjected to a two-sample T-test with a random variance model [14] implemented in the BRB-array software (http://linus.nci.nih.gov/~brb/). The Rat Affymetrix chip probes were mapped to Entrez rat genes using Bioconductor packages. The rat Entrez Genes were mapped to mouse Entrez Genes (where possible) using Homologene [15]. A set of mouse TF genes [16] was mapped to the rat Affymetrix probes. PERL software was written to convert all HTML-formatted expression analysis results to text files and extract and report all significantly (P-value >= 0.001), differentially expressed genes across the pairwise expression profiles and mapped TF genes in this set were identified. Evaluation of TFBS predictions The set of TFBS predictions identified in the positively expressed MiniP construct sequence (Ple151) and in the negative MiniP construct sequences (i.e. those constructs that had no reporter gene expression: Ple148 and Ple150) were compared using a PERL script to identify the predictions that were unique to the positive MiniP (Ple151). The TFBS predictions found exclusively in the Ple151 positive construct were evaluated against the expression profile analysis results and this data was summarized (Tables: S1, S2, S3, S4). The following abbreviations are used in the tables: JP: Jaspar profile name 317 included if different than TF HUGO gene name; OPC: oligodendrocyte progenitor cells; D2: day 2 time point in the Dugas et al. dataset; D7: day 7 time point in the Dugas et al. dataset; D9: day 9 time point in the Dugas et al. dataset; OLs: oligodendrocytes. Table S1 – List of predicted TFBS that were unique to the positive Ple151 construct (17-way-mammals, TFBS score 80%, most-conserved sequences) TF Predictions TFs Differentially Expressed (U34A-C Chips) POU2F1 No RORA (JP: RORA_2) Probe mapping not available FOS Yes - FOSL1 (FRA-1) expression is down-regulated between OPC and D2 HINFP (JP: MIZF) Probe mapping not available HLF No T brachyury Probe mapping not available ELK4 Probe mapping not available MAX No E2F1 Probe mapping not available PAX5 Probe mapping not available TEAD1 Probe mapping not available FOXD1 No NR3C1 Yes - down-regulated between D9 and Acute OLs CREB1 No 318 Table S2. List of predicted TFBS that were unique to the positive Ple151 construct (17-way-mammals, TFBS score 75%, most-conserved sequences) TF Predictions TFs Differentially Expressed (U34A-C Chips) ZNF143 (JP: Staf) Probe mapping not available RXRA-VDR No RORA (JP: RORA_2) Probe mapping not available STAT1 Yes - Up regulation between OPC and D2 T brachyury Probe mapping not available EVI1 No RREB1 No GLI1 Probe mapping not available IRF2 No PAX4 No EGR1 Yes - down-regulated from OPCs over time and up-regulated at D7-D9 E2F1 Probe mapping not available EGR4 No AR No NR3C1 Yes - down-regulated between D9 and acute OLs 319 Table S3. List of predicted TFBS that were unique to the positive Ple151 construct (28-way-placental mammals, TFBS score 80%, most-conserved sequences) TF Predictions TFs Differentially Expressed (U34A-C Chips) POU2F1 No ELK1 No RORA (JP: RORA_2) Probe mapping not available FOXF2 Probe mapping not available ZNF423 (JP; Roax) No HINFP (JP: MIZF) Probe mapping not available HLF No GABPA Probe mapping not available NKX2-2 Probe mapping not available ELK4 Probe mapping not available NFYA No IRF2 No E2F1 Probe mapping not available DDIT3 (JP: Ddit3-Cebpa) No PAX5 Probe mapping not available NKX3-1 Probe mapping not available TEAD1 Probe mapping not available FOXD1 No NR3C1 Yes - down-regulated between D9 and acute OLs NKX3-2 (JP: Bapx1) Probe mapping not available 320 Table S4. List of predicted TFs that were unique to the positive Ple151 construct (28-way-placental mammals, TFBS score 75%, most-conserved sequences) TF Predictions TFs Differentially Expressed (U34A-C Chips) POU2F1 No ZNF143 (JP: Staf) Probe mapping not available RXRA (JP: RXRA-VDR) No RORA (JP: RORA_2) Probe mapping not available STAT1 Yes - Up-regulation over OPC-D2 GABPA Probe mapping not available NFKB1 Yes - down-regulated between OPC and D2 RREB1 No IRF2 No PAX4 No EGR1 Yes - Egr1/Krox24 expression is down-regulated after OPC time point E2F1 Probe mapping not available EGR4 No PAX5 Probe mapping not available AR No NR3C1 Yes - down-regulated between D9 and acute OLs 321 Prioritization of candidate TFBS The compiled TFBS predictions and expression data analyses were reviewed to rank the TFBS candidates. TFBS predictions that were unique to the positive Ple151 construct with differential gene expression and correlated literature evidence support were reported (Table S5). Table S5. Predicted TFBS candidates with differential gene expression and literature evidence support TF Predictions TFs Differentially Expressed (U34A-C Chips) Literature Evidence EGR1 Yes - EGR1 (KROX-24) expression is down-regulated after OPC time point EGR1 (KROX-24) may be involved in the initial oligodendrocyte differentiation primary response[17] FOS Yes - FOSL1 (FRA-1) expression is down-regulated between OPC and D2 The AP-1 family of TFs may play a role in oligodendrocyte differentiation[18] 322 Testing of Putative ADORA2A Transcription Start Sites Primers were designed to generate PCR products (200 to 800 bp) within unique initial exons of four ADORA2A predicted transcripts with a common reverse primer (primers listed below). Initial testing of each primer pair was undertaken using a commercially available total human brain cDNA (Ambion/Applied Biosystems). Upon detection of a PCR product of the predicted size, the primer pair was then used for PCR analyses of a panel of first strand cDNA from human brain regions (Origene Technologies). The panel was derived from 12 regions of the human brain (frontal lobe, temporal lobe, cerebellum, hippocampus, substantia nigra, caudate nucleus, amygdala, thalamus, hypothalamus, pons, medulla, and spinal cord) and arrayed in 4 different amounts (1 ng, 0.1 ng, 10 pg, and 1 pg) on a 48 well plate. PCR reactions were set up according to the manufacturer’s instructions. In summary, 25 µl aliquots of a PCR master mix containing Platinum PCR SuperMix (Invitrogen) and one of the four primer pairs to be tested were transferred to a 48 well brain panel plate, incubated on ice for 15 minutes with occasional vortexing to dissolve the lyophilized cDNA samples in each well, and then centrifuged at 2000 rpm for 2 minutes. PCR conditions varied but in general included a one cycle denaturation step at 94 ˚C for 2 minutes, followed by a further 35 cycles, each consisting of a 30 second annealing step at varying temperature, a 1 minute extension at 72 ˚C, and, finally, a finishing extension step of 5 minutes at 72 ˚C. The readout consisted of a scan of a 2% agarose gel containing samples from all brain regions represented in the panel, with at least 2 different amounts of each cDNA sample loaded. The forward primers are 5′- TTGTCCTTTCACAGGGCG -3′ (ADORA2A-BL_V1), 5′- CTTTCAGCACAGCGTGGG -3′ (ADORA2A-DL_V1), 5′- CCGAGACAGCGGGAGC -3′ (ADORA2A-EL_V1), 5′- CCAGAGCCTTGGGATTACAG -3′ (ADORA2A-FL_V1) 323 and the reverse primer is 5′- GCGGATGGCAATGTAGCG -3′ (ADORA2A- CODE_2R_V1). 324 Mouse strain availability All procedures involving animals were in accordance with the Canadian Council on Animal Care (CCAC) and UBC Animal Care Committee (ACC) (Protocol# A05-1258 and A05-1748). Table S6. List of the Ple mice strains presented here Strain Name JAX Catalog Number MGI number B6.129P2-Hprt1/J 9348 TBA B6.129P2-Hprt1/J 9353 TBA B6.129P2-Hprt1/J 9113 TBA B6.129P2-Hprt1/J 8706 3830628 B6.129P2-Hprt1/J 9119 TBA B6.129P2-Hprt1/J 9118 TBA B6.129P2-Hprt1/J 9114 TBA B6.129P2-Hprt1/J 9115 TBA B6.129P2-Hprt1/J 8708 3830630 B6.129P2-Hprt1/J 8710 TBA B6.129P2-Hprt1/J 9116 TBA B6.129P2-Hprt1/J 8876 TBA B6.129P2-Hprt1/J 8877 TBA B6.129P2-Hprt1/J 8709 TBA B6.129P2-Hprt1/J 8707 3830629 B6.129P2-Hprt1/J 9080 TBA B6.129P2-Hprt1/J 9060 TBA TBA = to be assigned 325 In vitro neural differentiation RT-PCR primers and conditions Table S7. List of RT-PCR primers and conditions used in the in vitro neural differentiation assay Gene Primer name Primer sequence (5′→3′) Product size Input RNA Cycles, Temp oEMS2910 CAAACTGGAAACCGGAGTTGTC Dcx oEMS2911 CACAAGCAATGAACACATCATCAT 104 bp 200 ng 32, 60 °C oEMS2905 GTCCGCCCTGAGCAAAGA EGFP oEMS2906 TCCAGCAGGACCATGTGATC 54 bp 200 ng 32, 56 °C oEMS2574 CTCGTCTCATAGACAAGATGGTGAAG Gapdh oEMS2575 AGACTCCACGACATACTCAGCACC 305 bp 200 ng (1) 100 ng (2) 32, 56 °C (1) 30, 56 °C (2) oEMS2576 GCGCTCAATGCTGGCTTCAA Gfap oEMS2577 ACGCAGCCAGGTTGTTCTCT 346 bp 100 ng 32, 58 °C oEMS3650 CCATCTGCTGCACGCGGAAGAA lacZ oEMS3651 TAGAGTCGCGGCCGCTGAAGTT 220 bp 100 ng 32, 58 °C (1) is the condition used for the Ple53 RT-PCR (2) is the condition used for the Ple88 RT-PCR 326 Supplementary references 1. Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Eyre T, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Holland R, Howe KL, Howe K, Johnson N, Jenkinson A, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K et al: Ensembl 2008. Nucleic acids research 2008, 36(Database issue):D707-714. 2. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8):1034-1050. 3. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: 2008 update. Nucleic acids research 2008, 36(Database issue):D773-779. 4. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic acids research 2008, 36(Database issue):D25-30. 5. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708-715. 6. Gong S, Zheng C, Doughty ML, Losos K, Didkovsky N, Schambra UB, Nowak NJ, Joyner A, Leblanc G, Hatten ME, Heintz N: A gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature 2003, 425(6961):917-925. 7. Portales-Casamar E, Arenillas D, Lim J, Swanson MI, Jiang S, McCallum A, Kirov S, Wasserman WW: The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences. Nucleic acids research 2009, 37(Database issue):D54-60. 8. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 9. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 10. Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics 2002, 18(8):1135- 1136. 11. Dugas JC, Tai YC, Speed TP, Ngai J, Barres BA: Functional genomic analysis of oligodendrocyte differentiation. J Neurosci 2006, 26(43):10967-10983. 12. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney 327 L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 13. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264. 14. Wright GW, Simon RM: A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 2003, 19(18):2448-2455. 15. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL et al: Database resources of the National Center for Biotechnology Information. Nucleic acids research 2008, 36(Database issue):D13-21. 16. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman* WW, Roach* JC, Sladek* R: TFCat: The Curated Catalog of Mouse and Human Transcription Factors. Genome Biology 2009, 10(3):R29 (*Senior manuscript authors). 17. Sock E, Leger H, Kuhlbrodt K, Schreiber J, Enderich J, Richter-Landsberg C, Wegner M: Expression of Krox proteins during differentiation of the O-2A progenitor cell line CG-4. J Neurochem 1997, 68(5):1911-1919. 18. Barnett SC, Rosario M, Doyle A, Kilbey A, Lovatt A, Gillespie DA: Differential regulation of AP-1 and novel TRE-specific DNA-binding complexes during differentiation of oligodendrocyte-type-2-astrocyte (O-2A) progenitor cells. Development 1995, 121(12):3969-3977. 328 8.3. Appendix 3: supplementary for chapter 6 Supplementary material for: Identification and analysis of transcriptional cis- regulatory modules directing oligodendrocytic expression of myelin-linked genes Supplemental files available at: URL: www.cisreg.ca/ PT_Share/ User: ch6reviewer Password: thesis • Supplemental File 1 – Differentially expressed genes identified in the oligodendroctye P4 vs P10 dataset (P4-P10d) • Supplemental File 2 - Oligodendrocyte early development (IOLEDd) expression dataset and expression datasets in Figure 6.5 a. Venn diagram partitions • Supplemental File 3 – TF genes identified in the expression datasets in Figure 6.5 c. Venn diagram partitions • Supplemental File 4 – Promoter analyses results for the muscle reference collection validation (HTML files) • Supplemental File 5 – Promoter analyses results for the oligodendrocyte early development (IOLEDd) expression dataset (HTML files) • Supplemental File 6 – Overlapping CRM predictions identified in promoter analyses of the oligodendrocyte early development (IOLEDd) expression dataset • Supplemental File 7 – Computed feature values and genomic coordinates for CRM predictions identified in the muscle reference collection and the mouse muscle enhancer sequences • Supplemental File 8 – Feature value cluster analyses for CRM predictions identified in muscle reference collection and the mouse muscle enhancer sequences • Supplemental File 9 – Computed feature values and genomic coordinates for CRM predictions identified in promoter analyses of the oligodendrocyte early development (IOLEDd) expression dataset and the myelin enhancer sequences • Supplemental File 10 – Feature value cluster analyses for CRM predictions identified in promoter analyses of the oligodendrocyte early development (IOLEDd) expression dataset and the myelin enhancer sequences • Supplemental File 11 - Mapped Ensembl genes used in the promoter analyses for the oligodendrocyte early development (IOLEDd) expression dataset 329 Appendix 3 contents: A. Myelin gene enhancer sequence selection and validation B. Analyses of Oligodendrocyte gene co-expression C. Implementation of promoter database and CRM analyses D. Validation of promoter CRM analyses method E. Promoter analyses of co-expressed oligodendrocyte gene set F. Prioritization of TFBS cooperativity predictions via enhancer feature weighting 330 A. Myelin gene enhancer sequence selection and validation Table S1. Myelin gene enhancer genomic coordinates Mouse Assembly 36 Feb 2006 (mm8) Gene Enhancers Chr From Coordinate To Coordinate Cldn11 3 31357206 31357836 Cnp 11 100391366 100392107 Ermn 2 57868209 57868795 Gjb1 X 97586055 97586691 Mal 2 127318715 127319393 Olig1 16 91189705 91190805 Olig2 16 91097897 91098790 Pou3f1 4 124169971 124170620 331 Table S2. Myelin gene-associated enhancer sequences Gene Enhancers Construct Sequences Cldn11 TGTCACCACTCCTCTGAAGCTACCCCAGGCATGATTTGTGTCCTCCACGGCCCTCCAGG AATGGCTTCGACTTCCCACATTTTTTCATTTCTACAGAGGGGCATGAAGCAGTGACTAGT CCCCCTGTCTCCACTCACCCTTCCAGACACGGCCACGGTGCCTGAGCTCCGGGTATACA CGAGCTAAAGAAACCGAGATTCCTCAAGAGCTGCTATTCAAGCCACGAGCAGCATGTGC CAACATTCCTTCCACAATCCCGCTTCAGTCCCCATGAAGCTGCACATCTGGCTGTCACAC ACTCTTTTGTGGGGCCATCTCAAAGGCTGTCCTGATGCTGCCTCTCCCCAGGGCCGGTC CCACAGTGGCCTCCCTGTCCTCCTAGCATGTGATGTCATGAGCATGAATGTCCTGCCAAA CATTTAAGATGCGACAGCTCATTGTGTACATTGTATAGAGAAACACTGCTGGGCAGATCA GTTCTAATTAAAACAAGAAAAGAGGCAGGCCAGGACAGAGCCGGACCTTGTTCTATTGTT ACCCCAAGCCACCGAGGCAGCACAAGTGTGTAACGTGCACTCTCCAAGCTCACACCCTT CTCCTTTTGTTCCAGTCTTGTCCCCAGCATCTCCAA Cnp GCTGGGTTGTAAGGTAGAGGGAATCTTTCTGAAGCTACCTAACCTCTTAGTCTCTGGCTC CAAGGATCACTCTCTGCTCCCCCTCAGAACACCTTGGTGACAGGGCACGCAAGGGCAGC AAGTGCCTGTGGCTCCCTGTCAGCCCTAACTTATATATTCTTGGCACCTCTTAAGGATGT CCTGTTTGGAGGCCTGGGGCTCTATGTATGGGCAAAGCCACAGCTGCCTTGGCTCTCTT GAGTGTGCAGAGGAGGGTGAGGGCAGGGGAGGGTGACCCAAAGACCGCCTCCCTCCA CCCAACAATGGCACCAACAGATAATACTCTCTTGTCTGGGTATAGGGGGACAATGGTCCC ACAGAGGGCTGGTGGGGGGCAGGGCAGGAGGGCACTGGACAAAGGGCATATTCACAAC CACCTCCTCCTCTGCTCTCTCAGTGCGCCCCCTCCCCCAATCTGTAGGATTAGGGCACA GGGCACCTGTATTCAGAGATCTGTGCCCCCTATTCTGCTGTGTGTTAGCAGCACCTCTCT GACAAAAACAAGACATAGCGGTTTCAGTTGTCATCTACAACGGCTTCCGCCCCTGAGAAT TTAGCTGGCTTCTGGCTTCTCTATAGCCTTGTGCAGGAACCCACTCCCAGCGCGGGCTC CACAACAGCCCTGCACTCTCTGGTGCTACCCAGACCCATTTGCCAGCTGGTGTCGCCAA GGAGCATCGACTTTCACTTCCCTGCGGCGATA Ermn TAATGGGTAGGGTGGGTGTCTGTAGACCTGGGGGGACTGGAAAATCAGCTTCCTGAGAG CACACTATGAAATCAGTTGCCAAAGCCGTTTACAATCAAGTTGTCATGTTTACAGGTCCTT GAAAAGCTGGTTTTAGTCTTTAATAGATGAGAAATGATGCCATTGTCTCCTACAAAATAGC TGAGGTTTTACTCGAACCAAATCTGTTCGGAAATGTTAAGCTGGTTACACAGAACTAATGT GCCTCAAGGTGGTCATTCTCTCTGGATTAAAGCCTGGGAACAATTGTGGCTCCTGTTACA AGGAAAATTACAATGGGCCATTATGGAGAGGGACAAAAATCTCTGTTCCCAGAGGGGGA CTGACAAGCGGCAAGTCCCCCTTTGCATGCTAACACAAAAGCCTGATTGCTTCAGATTGC TTGTTTATGCAAATTGAAGGCAACACTAATTCTGAGACTGGAGCTGCGAGGTGGTGGGG ATGTTGCTGATTGGCTGGGTTTTGCTCCCGGGCTGGTTACCAAACTGACAAACCCTATCA TATTCATTCTGAGCCCTGAAAACCTTTCGGAGGAATCTTGCTATTGC Gjb1 TCCTCTTGAGTCCCTTTCTTCACCAGGCCTATCTTGTTCTGCTATTAGCCTGTTCTTCGTG CACCACCACCACCACCCCGCCCCAGCTCCTTTACACCTGACTTATGGCAAACCCAGATA GCCCGGTTACCTCGGGGAGCGCCTATCCTTGAGGCCACCCAGACAGCTCCCCTATGGT CTCATGTCTGTGTAGGGGGAGTGGTCCCTGTGGTCGCGCCTGCGTCCCAATTCATCGCC TCCCGGTGCTAGAGAGTCATATGCTGCTTTATACCCAGTGTCTGTACAATGGGACCTTTG TGGGGCGCTGAACAACACCCTTTAGAATTCAGATCAAACGCCCTGACTTCCGCCCACCC TACACACAGAGCACTTGTGTCCCAGCGCCGGACACCCAGCCTGCACCCACCTGTTCTGT GCCCGCCCGGTCCATCGCCGCACCCGCCCCATGGCCCCACCAGAATAATGGCTCTGGG GGAGGGAGATAGAACACAGACAGGTGCAAGTAAGCCAGGACCCCAGGGAATCAGGGTC CTCTGGACTAAGGCCTTCCCCTATATTCTACCAGAAACGGGGTCAGCAGCAGGTTTTCTA GGGGTAGGTCCCTTTCCAGGTATGGATGAGCCAGAAGAGAGGAAAC Mal CTGCCTTTGTTTCTCTTTCTGGTGATCTTCCCCCACGAAGGGAGGGTATATGAAAGAGGC TTCTTCATGTATAGCTCTTTTCCCCTGGAAGCCAACTTCTAACACCTCATTCTGCACACAG TAGGTCCAATGAGACCATTCTTTCATAGACCACTTCATAGGAGCTGAGCCTTCCTTATAAA GGAGTCTGTGTGGGCACCCCTCAGGAATCAGCATGTCTTTGCACAGCAGTCCCAACGGT GCATTGAAGGCCTCCCTACACATGGGCCTTAAAGAGACAGTAAAGGGTTCGCACTGCCA GCCGTGGGCTGGAGCAAGCATGCCTTTGGGGCTATACAAGGGCCTGTTGTTCCCAGAAT CAATTGGGGCCACTGTGGGGTCACAATGAAGGCCTTTCAAAGAGACAAAGCGCCTCTGG CCTGGCTAAGTGTCCCTGCACACGTGGGCATCTGCCCAGAACATTTAAAATAACACGATT TGAAAAAACAGATTTAAAGTGTACAATCTAGTGGCCTTTGTTCTGTGTCTACCATGTGCAG TGGCCTTTGTTCTGTGTCCACCACATGCAGTGATGGTGGATCAGAGCTGGGTGTGGGGA GGAGGACACTAGAGCTCCCAGCTGGGACCTCTCTTCTCTGTCTTAGGTAGGAAACTGGG TATCTACACAGTTCCAGGCTCAC 332 Table S2. Myelin gene-associated enhancer sequences (continued) Gene Enhancers Construct Sequences Olig1 CTGCTGAAACACTCCGCTCTTCCTGGAGACTGCAGGAGGCTCGGATGGGGTGGTTGGG GCAGAGCACTGGGCATTAAGCTCCATCCTGGCTTCACATAAAGGAAGAAAACTCAAAACG CAGGGGCGGGGGGTGGGGGGGTGCAACTAAATTTAATTGGCTAGAAAGCAGGATTCCT GGGGTCTCCGCTTTCCCAGCTTCCAGAGACGGCTTTAAGAAGGGATTCTTCTCCGCCTC CCAGCACCAAAAAGAAAGGGTTGAGCTGGGGTGGTAATCCAGGTTTGGTCTGGCCAAGA CAAATGACATACAGAACATGCAGTCTCTTTGCACACGGCATGTGCTTTCATGGAAGTGGA AGGGGAGAGAAGAAGGTGGGTAAGGGGGTGTAGAAAGGGGGCACTAAAAACAACTTGC TGTTTGCATATGGTCGGGCGAAATCTGTTTTGTCCCCGTCACTGTGCAGAGTGTCAGTAG TGGGCACCGTGTCGTGACAAGTTTCTTTCTTTCTTTCTTTTCTTTTTTTCTTTTTCTTTTTTT TTTTTTAAGTTAAGCTAAGCTCACCCCGAGCCTCGGCCCCCCGAGGTCCTGCAGCTTTAA TTGTGTTCAGTATGATCTGGCGAACAATATGTCTTGATAAATGTGAATGGCGATCCCGCG CGCGGCTCCCCTTGGCATTGCCGCAAAGCCAGGCCTCGCTTTAGTGAACCCGTTCAGGA ATGTGGGTTAATTCGGCCGACCCTTTTTACCCCCACAGCTCTGTTTTTGTGAGGTTGGCA GAAACTGACACGTTTTCTTTTGCTTAATTAAGTTGTTCAGTAGAAGGCTGGTCAAACAGTT GCATCTACGTGGGGAGTCTGGGGATGCGCAGAAAGCCGTGGGCGTTTCCAACATAAAAA GGGGGATTCTGTTATTGTTTGGGTGTAGACGGGTCATTTTTTCAAAGTTATTGCTGCCTG AGTGACCACAACCTTGCGCAGAGCCCTGAAAGGACATTAAAGCTGCTTAAACACAACCAC CTTGGGCTCCAAAGTGTGTTTCTAAAAAGGAGAAAGAGATGCAGGGCGCAAGAGAAAGA ACTAGGGCGATAGGAAGCAGTGGCGTTT Olig2 GGATTGGGAAGGCATCTTGCCTCCAGCCTGGCATTTTATGAAGAATTAAAAATAACGAAG CCGGCAGAGATACGGAGGGAGGCTAATTTGGAAACCTGGAATGATCGCTTTTAATTTGTT GACAATGTGGTGTTTGCGGGGGGAGGGATTGAGGGAGCCAAGGCACAGGATGGCCGTG CTTGTCTGGGCCTTGTATTAATGGAGTAGCAGCTGAATTACGTGTTCAGAGCGAACACAT CAAAGACCATCATCTGAGATCTCATTTCTTCTATTCCACGGTGTCAAAGGTGGCTGCAAAT ATATCCAGCACTTTCTGGGTGGCATTGTTTAGGATGCTGGCCCCAGCAGATGGCCCCAA AATGTACCTCAGCCTGATGGAGCAGGGAGGGACCACATGGGATGGCATGTCTCCTCATT AGCAAACAGAAAAGAAGGGGGAAAAAATAGAAAGCAGGATGAGAAAGGAGAGAGATGAA AAGAATACACAAAGACATTTACATATTTACATTGTCTCCCTCGCATTTAAAATTCCCCCTTA GCCGAAGTGTCAGTGGGCTTCTTCCTTCTGTTTGTGTGTTTACTTGTGCTAAATAGATTAT GGTAATCAGGCACATAAATGCACGCAACTGTTATTATGGCATTTTAATCACGGACTTTTTC ACTGGAAGCCTTCATTAGGATTCCCATAATCTTACTTGTTAGTGTAACAAACAGTAATAAA TGCCAAAGAGCTTGGATTAAGGCAGGGGCCTATCAGGCTTTAAAAGAGGAATATGGAAAT GTTGTTGAATCTCTGCATTACTTATTGAAATGTCACATAAACAGTTCGAGATTCTAAATAG CTTACTCATTATCCAGAGAGGGACGCGGGCAACCTGAGCCCAGTAATGATGT Pou3f1 CTCGAGCTGCTCAGCCCCCCTCCCCTGACACAAACAATCCTCAGTTACCTCCCCCTCCT GCTCCCCAGAATCTGGGCACAGCTGGAGCCTGCTATGCCCTAGCCACCCCATGAATCAC CGCTCTATGGTCCACGGGGGAGTGGTCCAGGGAGCATCCTACGCTTGCTCGGGAGGAG TGAGGGCCAGAACTGCAGCCCTCAAGCAGGCAGTGTCCACAGAAACAATGGGGGCCTG TGGCTAACAGGCGGAATGCAGCTATTGTCCTGCCCTGGCCCCCAACCCCAAGGCCCCA GGTCCCCAGGCCAGGCGGCCTGGCGTGAATCAGTGCGTCAGACTCTCGTGTACCAGGG CTGGGCACACATGACCTGCTGCTTACCTTCTCTGGGTAAGCAGGAGGGTAGGCTGATGG GCCAGACCCCCACCTCTACAGCCAACCTCACAAAGGATCCTCCTGCACAGAATGAGAAG CGAGATGGGACCCACGGGAGTAAAGGCAACCTCCAACCTACCCATCCCTAGGATGCTTG AGACCAGCAGCTCTGAGACCCAAGCTATACATCACAGTGGGGAGGAGACCACATTAGAA TAATGCAGGATTAGAGTGGGGTTGCTATAGCGACGTATTAGGGCAATACATCTAGGGAG CCCCA 333 Table S3. PCR primers used in enhancer sequence amplification (restrictions sites are noted in bold). Gene Enhancers Primer 1 / Primer2 ATGCGGCGCGCCTGTCACCACTCCTCTGAAGCTACC Cldn11 ATGCCTCGAGTTGGAGATGCTGGGGACAAGAC ATGCCTCGAGGCTGGGTTGTAAGGTAGAGGGAATC Cnp ATGCGGCGCGCCTATCGCCGCAGGGAAGTGAAAGTC ATGCGGCGCGCCTAATGGGTAGGGTGGGTGTCTG Ermn ATGCCTCGAGGCAATAGCAAGATTCCTCCGAAAG ATGCGGCGCGCCTCCTCTTGAGTCCCTTTCTTCACC Gjb1 ATGCCTCGAGGTTTCCTCTCTTCTGGCTCATCC ATGCGGCGCGCCCTGCCTTTGTTTCTCTTTCTGGTG Mal ATGCCTCGAGGTGAGCCTGGAACTGTGTAGATACC ATGCGGCGCGCCCTGCTGAAACACTCCGCTCTTC Olig1 ATGCCTCGAGAAACGCCACTGCTTCCTATCGC ATGCGGCGCGCCGGATTGGGAAGGCATCTTGC Olig2 ATGCCTCGAGACATCATTACTGGGCTCAGGT ATGCCTCGAGTTCTTTGACAATGGGGCTTCTCT Pou3f1 ATGCGGCGCGCCTGGGGCTCCCTAGATGTATTGC 334 Figure S1. Genomic context for myelin gene-associated enhancer regions (displayed using mouse MM8-mapped coordinates http://genome.ucsc.edu). Cldn11 Cnp Mal Gjb1 Ermn Olig1 335 Figure S1 (cont). Genomic context for myelin gene-associated enhancer regions (displayed using mouse MM8-mapped coordinates http://genome.ucsc.edu). Olig2 Pou3f1/Oct-6 336 B. Analyses of Oligodendrocyte gene co-expression Table S4. Gene Ontology (GO)-molecular function term enrichment analysis of optic nerve expression data GO molecular function enrichment analysis for genes differentially expressed in murine optic nerves between postnatal day 4 and postnatal day 10 (using p-value cut-off of 0.01 and database filters: MGI – Mouse Genome Informatics and RGD – Rat Genome Database). GO Term P-value Sample frequency (MGI + RGD) Background frequency (MGI + RGD) Mouse and Rat Genes GO:0005488 binding 2.47E-07 309/504 (61.3%) 10628/22524 (47.2%) Cdca5 Fkbp10 Emilin1 Kcnj10 S100a1 Ngfr Tpd52 Tgfbr2 S100a1 Elk3 Dock9 Tcf7 Ptp4a3 Idh1 Ndn Nek6 Ywhaq Fos Chfr Bmp6 Eif4ebp2 Cxcl12 Col5a1 Ier3 Ssbp3 Myoc Arap3 Gjb2 Sept8 Jam3 Il1rap Crym Dab2ip Ctse Ywhah Cdt1 Tgm2 Sp5 Gnas Erbb3 Pdlim2 Aplp1 Cacnb4 Ascl1 Klf2 Wdr6 Tcf3 Gas6 Fzd2 Klhl2 Cdc37l1 Rras2 Taldo1 Nid2 Txnip Arpc1a Ywhah Axin1 Tgfbi Kcnj10 Elk3 Arpc1b Col5a1 Sema4d Cebpb Bmp7 Gstm1 Eng Ddx39 Egfl7 S100a4 Sdc1 Eya2 Rrm2 Plp1 Birc2 Rims2 Eps15 Myoc Spred2 Rrm2 Ncapd2 Ckap4 Dab2ip Wnt7b Adamts4 Tcf7 Eps15 Sirt2 Dgkz Mcm6 Emilin1 Mfge8 Cdc37l1 Vldlr Kctd12 Rims2 Ets2 Adipor1 Igfbp2 Plekhb1 Ywhaq Sema4d Mast2 Mknk2 Bgn Ntsr2 Strn Loxl1 Aplp1 Cdca5 Tpd52 Mknk2 Lyve1 Cebpb Gsn F2r Ccnb2 Fgf13 S100a4 Med16 Fli1 Csdc2 Fos Sytl2 Fgfr1 Acta1 Prkcz Nkx6-2 Wnt7b Fmnl3 Ramp2 Cldn11 Fzd2 Wasf2 Igfbp5 Nfic Plekhb1 Bicc1 Id4 Trp53 Itgb5 Nid2 Ehd4 Sept2 Smad1 Sox7 Actn1 Aatk Acat2 Ets2 Dlk1 Ctse Emid2 Cacnb4 Tprkb Klf2 Wasf2 Tgfbr2 Tcfe2a Scap Car2 Otud7b Ets1 Prkcz Lasp1 Acss2 Erbb3 Il1rap Cldn5 Vldlr Tgm2 Gusb Ssbp3 Mcm7 Antxr1 Med9 Rab34 Ascl1 Rasip1 Fgf13 Gli1 Rcn3 Ppic Txnip Dgkz Fcgr2b Scap Cxcr4 Bgn Cyp1b1 Ngfr Tead2 Dlk1 Sept2 Pdgfrb Rap1a Mmp14 Reep5 Gli1 Fkbp10 Mast2 Ski Lsm2 Smarcd3 Uhrf1 Rap1a Cpox Smarcd3 Igfbp2 Piga Dhcr24 Rab34 Mtap7 Ndn Nfic Igfbp5 Tead2 Eng Grb14 Gtf2ird1 Chfr Akt1 Efhd2 Loxl1 Ntsr2 Fcgr2b Akt1 Ets1 Tcf3 Pdk4 Mcm2 Plekha1 Glrb Rras2 Crym Reep5 Creb3l1 Ptx3 Ctsb Fermt3 Itgb5 Surf4 Klf4 Dact1 Birc2 Plp1 Slc2a1 Mfge8 Cdt1 Slc2a1 Wif1 Gnas Lrrc59 Adipor1 Fdft1 Spred2 Id4 Gjb2 Klf4 Arf3 Lasp1 Eif4ebp2 Ddc Gtf2ird1 Racgap1 Smad1 Ptx3 Lyve1 Bmp6 Mcm2 Lrp10 Efhd2 Sh3gl3 Pdgfrb Ehd4 Fzr1 Lrp10 Fnta Foxf2 Mest Dock9 Nkx6-2 Mcm6 Sept9 Gamt Cldn11 Mdk Igfbp4 Timm17a Grb14 Mcm7 Pctk3 Sox7 Pxn Cxcl12 Igfbp4 Glrb Stard4 Tprkb B4galt1 B4galt1 Sh3gl3 Pxn S100a3 Dact1 Axin1 Actn1 Bmp7 337 Table S4. Gene Ontology (GO)-molecular function term enrichment analysis of optic nerve expression data (contined). GO Term P-value Sample frequency (MGI + RGD) Background frequency (MGI + RGD) Mouse and Rat Genes GO:0019911 structural constituent of myelin sheath 3.56E-07 6/504 (1.2%) 6/22524 (0.0%) Mobp Mbp Mobp Plp1 Mal Tspan2 GO:0019838 growth factor binding 2.14E-05 15/504 (3.0%) 101/22524 (0.4%) Ngfr Tgfbr2 Il1rap Erbb3 Col5a1 Eng Igfbp2 Igfbp5 Il1rap Ngfr Igfbp2 Igfbp5 Pdgfrb Igfbp4 Igfbp4 GO:0005515 protein binding 1.67E-04 264/504 (52.4%) 9167/22524 (40.7%) Cdca5 Fkbp10 Emilin1 Kcnj10 S100a1 Ngfr Tpd52 Tgfbr2 S100a1 Dock9 Tcf7 Ptp4a3 Ndn Nek6 Ywhaq Chfr Bmp6 Eif4ebp2 Cxcl12 Col5a1 Ier3 Myoc Arap3 Gjb2 Sept8 Jam3 Il1rap Dab2ip Ctse Ywhah Cdt1 Tgm2 Gnas Erbb3 Pdlim2 Aplp1 Cacnb4 Klf2 Wdr6 Tcf3 Gas6 Fzd2 Klhl2 Cdc37l1 Rras2 Taldo1 Nid2 Txnip Arpc1a Ywhah Axin1 Kcnj10 Elk3 Arpc1b Col5a1 Sema4d Cebpb Bmp7 Gstm1 Eng Ddx39 S100a4 Sdc1 Rrm2 Plp1 Birc2 Rims2 Eps15 Myoc Spred2 Rrm2 Ncapd2 Ckap4 Dab2ip Wnt7b Adamts4 Eps15 Sirt2 Dgkz Mcm6 Emilin1 Mfge8 Cdc37l1 Vldlr Kctd12 Rims2 Ets2 Adipor1 Igfbp2 Plekhb1 Ywhaq Sema4d Mast2 Mknk2 Strn Loxl1 Aplp1 Cdca5 Tpd52 Mknk2 Cebpb Gsn F2r Ccnb2 Fgf13 S100a4 Med16 Fli1 Csdc2 Fos Sytl2 Fgfr1 Acta1 Prkcz Wnt7b Fmnl3 Ramp2 Cldn11 Fzd2 Wasf2 Igfbp5 Plekhb1 Id4 Trp53 Itgb5 Nid2 Ehd4 Sept2 Smad1 Actn1 Aatk Acat2 Dlk1 Ctse Emid2 Cacnb4 Tprkb Klf2 Wasf2 Tgfbr2 Tcfe2a Scap Car2 Otud7b Ets1 Prkcz Lasp1 Acss2 Erbb3 Il1rap Cldn5 Vldlr Mcm7 Antxr1 Med9 Ascl1 Rasip1 Fgf13 Gli1 Rcn3 Txnip Dgkz Fcgr2b Scap Cxcr4 Cyp1b1 Ngfr Dlk1 Sept2 Pdgfrb Rap1a Mmp14 Reep5 Gli1 Fkbp10 Mast2 Ski Lsm2 Smarcd3 Uhrf1 Rap1a Cpox Smarcd3 Igfbp2 Piga Dhcr24 Mtap7 Ndn Nfic Igfbp5 Eng Grb14 Chfr Akt1 Efhd2 Loxl1 Fcgr2b Akt1 Ets1 Tcf3 Mcm2 Glrb Rras2 Crym Reep5 Ctsb Fermt3 Itgb5 Surf4 Dact1 Birc2 Plp1 Slc2a1 Mfge8 Cdt1 Slc2a1 Wif1 Gnas Lrrc59 Adipor1 Fdft1 Spred2 Id4 Gjb2 Lasp1 Eif4ebp2 Ddc Gtf2ird1 Racgap1 Smad1 Bmp6 Mcm2 Lrp10 Efhd2 Sh3gl3 Pdgfrb Ehd4 Fzr1 Lrp10 Fnta Mest Dock9 Mcm6 Sept9 Gamt Cldn11 Igfbp4 Timm17a Grb14 Pctk3 Pxn Cxcl12 Igfbp4 Glrb Tprkb B4galt1 B4galt1 Sh3gl3 Pxn S100a3 Dact1 Axin1 Actn1 Bmp7 GO:0004450 isocitrate dehydrogenas e (NADP+) activity 7.25E-04 4/504 (0.8%) 4/22524 (0.0%) Idh1 Idh2 Idh2 Idh1 338 Table S4. Gene Ontology (GO)-molecular function term enrichment analysis of optic nerve expression data (contined). GO Term P-value Sample frequency (MGI + RGD) Background frequency (MGI + RGD) Mouse and Rat Genes GO:0005520 insulin-like growth factor binding 5.26E-03 6/504 (1.2%) 18/22524 (0.1%) Igfbp2 Igfbp5 Igfbp2 Igfbp5 Igfbp4 Igfbp4 GO:0003700 transcription factor activity 7.61E-03 33/504 (6.5%) 604/22524 (2.7%) Elk3 Tcf7 Fos Ascl1 Tcf3 Elk3 Cebpb Tcf7 Cebpb Fos Nkx6-2 Nfic Trp53 Smad1 Sox7 Klf2 Tcfe2a Ets1 Ascl1 Gli1 Gli1 Nfic Gtf2ird1 Ets1 Tcf3 Creb3l1 Klf4 Klf4 Gtf2ird1 Smad1 Foxf2 Nkx6-2 Sox7 339 Table S5. Gene Ontology (GO)-cellular component term enrichment analysis of optic nerve expression data. GO cellular component enrichment analysis for genes differentially expressed in murine optic nerves between postnatal day 4 and postnatal day 10 (using p-value cut-off of 0.01 and database filters: MGI – Mouse Genome Informatics and RGD – Rat Genome Database). GO Term P-value Sample frequency (MGI + RGD) Background frequency (MGI + RGD) Mouse and Rat Genes GO:0043209 myelin sheath 4.22E-05 8/504 (1.6%) 22/22524 (0.1%) Mbp Plp1 Mag Mbp Plp1 Pllp Tubb4 Mag GO:0044421 extracellular region part 3.00E-04 45/504 (8.9%) 853/22524 (3.8%) Emilin1 Efemp2 Bmp6 Cxcl12 Col5a1 Myoc Erbb3 Nid2 Tgfbi Col5a1 Bmp7 Egfl7 Grn Myoc Wnt7b Adamts4 Emilin1 Mfge8 Vldlr Igfbp2 Bgn Loxl1 Gsn Tgfbi Cnp Nid2 Ltbp3 Emid2 Car2 Vldlr Tgm2 Dkk3 Bgn Col18a1 Igfbp2 Igfbp5 Loxl1 Mfge8 Grn Col18a1 Egfl7 Mdk Igfbp4 Igfbp4 Bmp7 GO:0005576 extracellular region 8.18E-03 48/504 (9.5%) 1056/22524 (4.7%) Emilin1 Efemp2 Bmp6 Cxcl12 Col5a1 Myoc Erbb3 Nid2 Tgfbi Col5a1 Bmp7 Gstm1 Egfl7 Grn Myoc Wnt7b Adamts4 Emilin1 Mfge8 Vldlr Igfbp2 Bgn Loxl1 Gsn Tgfbi Cnp Nid2 Ltbp3 Ltbp3 Emid2 Car2 Vldlr Tgm2 Dkk3 Bgn Col18a1 Igfbp2 Igfbp5 Loxl1 Ctsb Mfge8 Grn Col18a1 Egfl7 Mdk Igfbp4 Igfbp4 Bmp7 340 C. Implementation of promoter database and CRM analyses. Figure S2. Evaluation of sequence conservation flanking transcription factor binding sites Human-mouse sequence conservation flanking an experimentally validated Ap1/Fos transcription factor binding site (using binding site information: AP1_EXTRACTED50 extracted from the Annotated Binding Site database [1]) 341 Figure S3. Logos of transcription factor binding site profiles added to the Jaspar database for the promoter analyses 342 D. Validation of promoter CRM analyses method Table S6. List of muscle reference collection genes Ensembl Gene ID HGNC Symbol Ensembl Description ENSG00000143632 ACTA1 Actin, alpha skeletal muscle (Alpha-actin-1) [Source:UniProtKB/Swiss-Prot;Acc:P68133] ENSG00000149925 ALDOA Fructose-bisphosphate aldolase A (EC 4.1.2.13)(Muscle-type aldolase)(Lung cancer antigen NY-LU-1) [Source:UniProtKB/Swiss- Prot;Acc:P04075] ENSG00000138435 CHRNA1 Acetylcholine receptor subunit alpha Precursor [Source:UniProtKB/Swiss-Prot;Acc:P02708] ENSG00000170175 CHRNB1 Acetylcholine receptor subunit beta Precursor [Source:UniProtKB/Swiss-Prot;Acc:P11230] ENSG00000135902 CHRND Acetylcholine receptor subunit delta Precursor [Source:UniProtKB/Swiss-Prot;Acc:Q07001] ENSG00000108556 CHRNE Acetylcholine receptor subunit epsilon Precursor [Source:UniProtKB/Swiss-Prot;Acc:Q04844] ENSG00000196811 CHRNG Acetylcholine receptor subunit gamma Precursor [Source:UniProtKB/Swiss-Prot;Acc:P07510] ENSG00000104879 CKM Creatine kinase M-type (EC 2.7.3.2)(Creatine kinase M chain)(M-CK) [Source:UniProtKB/Swiss-Prot;Acc:P06732] ENSG00000175084 DES Desmin [Source:UniProtKB/Swiss-Prot;Acc:P17661] ENSG00000198947 DMD Dystrophin [Source:UniProtKB/Swiss-Prot;Acc:P11532] ENSG00000198125 MB Myoglobin [Source:UniProtKB/Swiss-Prot;Acc:P02144] ENSG00000081189 MEF2C Myocyte-specific enhancer factor 2C [Source:UniProtKB/Swiss-Prot;Acc:Q06413] ENSG00000111046 MYF6 Myogenic factor 6 (Myf-6) [Source:UniProtKB/Swiss-Prot;Acc:P23409] ENSG00000109063 MYH3 Myosin-3 (Myosin heavy chain 3)(Myosin heavy chain, fast skeletal muscle, embryonic)(Muscle embryonic myosin heavy chain)(SMHCE) [Source:UniProtKB/Swiss-Prot;Acc:P11055] ENSG00000141048 MYH4 Myosin-4 (Myosin heavy chain 4)(Myosin heavy chain 2b)(MyHC- 2b)(Myosin heavy chain IIb)(MyHC-IIb)(Myosin heavy chain, skeletal muscle, fetal) [Source:UniProtKB/Swiss-Prot;Acc:Q9Y623] ENSG00000197616 MYH6 Myosin-6 (Myosin heavy chain 6)(Myosin heavy chain, cardiac muscle alpha isoform)(MyHC-alpha) [Source:UniProtKB/Swiss-Prot;Acc:P13533] ENSG00000092054 MYH7 Myosin-7 (Myosin heavy chain 7)(Myosin heavy chain, cardiac muscle beta isoform)(MyHC-beta)(Myosin heavy chain slow isoform)(MyHC- slow) [Source:UniProtKB/Swiss-Prot;Acc:P12883] ENSG00000168530 MYL1 Myosin light chain 1, skeletal muscle isoform (MLC1F)(A1 catalytic)(Alkali myosin light chain 1) [Source:UniProtKB/Swiss-Prot;Acc:P05976] ENSG00000198336 MYL4 Myosin light chain 4 (Myosin light chain 1, embryonic muscle/atrial isoform)(Myosin light chain alkali, GT-1 isoform) [Source:UniProtKB/Swiss-Prot;Acc:P12829] ENSG00000129152 MYOD1 Myoblast determination protein 1 (Myogenic factor 3)(Myf-3) [Source:UniProtKB/Swiss-Prot;Acc:P15172] ENSG00000122180 MYOG Myogenin (Myogenic factor 4)(Myf-4) [Source:UniProtKB/Swiss-Prot;Acc:P15173] ENSG00000007314 SCN4A Sodium channel protein type 4 subunit alpha (Sodium channel protein type IV subunit alpha)(Voltage-gated sodium channel subunit alpha Nav1.4)(Sodium channel protein skeletal muscle subunit alpha)(SkM1) [Source:UniProtKB/Swiss-Prot;Acc:P35499] ENSG00000181856 SLC2A4 Solute carrier family 2, facilitated glucose transporter member 4 (Glucose transporter type 4, insulin-responsive)(GLUT-4) [Source:UniProtKB/Swiss-Prot;Acc:P14672] ENSG00000114854 TNNC1 Troponin C, slow skeletal and cardiac muscles (TN-C) [Source:UniProtKB/Swiss-Prot;Acc:P63316] ENSG00000159173 TNNI1 Troponin I, slow skeletal muscle (Troponin I, slow-twitch isoform) [Source:UniProtKB/Swiss-Prot;Acc:P19237] 343 Table S7. Validation of promoter CRM analyses method using the muscle reference collection and a full ortholog background Top ranked CRM predictions from the promoter analyses validation using the full muscle reference collection (25 genes), a full ortholog background dataset, and the following analysis parameters: search regions: 2000 base pairs (bp) upstream and downstream of the transcription start site; inter-binding site distance constraint of 225 bp; and all vertebrate TF binding profiles. Known skeletal muscle CRMs are highlighted in yellow. Supplemental discussion: Analyses of skeletal muscle-specific genes recovered known cis-regulatory modules composed of MEF2A, SP1, SRF, and Tead1 TFs in the top thirteen predicted pairs (Table S7). We also noted some other interesting TFBS pair predictions such as Sox/Sp1 and Mef2a/Hand1-Tcefe2a. Sox factors appear to play important roles in skeletal muscle development [2-4]. Hand1 is a basic helix-loop-helix (bHLH) transcription factor family that, amongst other roles, is required for vascular smooth muscle recruitment [5] and regulation of cardiomycyte development [6]. Hand1 also inhibits MyoD-dependent skeletal muscle cell differentiation and muscle-specific myosin heavy chain protein expression [7]. Interestingly, a recent study demonstrated that Mef2a/Hand1 interactions result in synergistic activation of Mef2a-dependent promoters in cardiac muscle (cardiomycytes) [8]]. As such, skeletal muscle CRM cooperativity predictions may warrant further investigation. Abbreviations used: FG=Foreground; BG=Background 344 Table S7. A. TF class pair over-representation (top five results). TF Name Class TF Name Class FG Hits FG Non Hits BG Hits BG Non Hits Score MEF2A MADS SP1 ZN-FINGER, C2H2 13 12 1572 13549 2.6686e-07 MEF2A MADS Hand1-Tcfe2a bHLH 14 11 2292 12829 2.9178e-06 MEF2A MADS SRF MADS 4 21 101 15020 2.4661e-05 RORA NUCLEAR RECEPTOR TAL1- TCF3 bHLH 2 23 59 15062 4.4147e-05 MEF2A MADS RORA NUCLEAR RECEPTOR 9 16 1146 13975 5.6318e-05 345 Table S7. B. Top thirteen detected over-represented TFBS pairs. TF Name Class TF Name Class FG Hits FG Non Hits BG Hits BG Non Hits Score MEF2A MADS SP1 ZN-FINGER, C2H2 13 12 1572 13549 2.6686e-07 Sox17 HMG SP1 ZN-FINGER, C2H2 24 1 7537 7584 7.6493e-07 Myf bHLH SP1 ZN-FINGER, C2H2 23 2 6697 8424 1.0893e-06 MEF2A MADS Pdx1 HOMEO 17 8 3494 11627 2.3789e-06 MEF2A MADS Hand1-Tcfe2a bHLH 14 11 2292 12829 2.3373e-06 MEF2A MADS YY1 ZN-FINGER, C2H2 15 10 2918 12203 8.8036e-06 MEF2A MADS Nobox HOMEO 14 11 2629 12492 1.5159e-05 MEF2A MADS SRF MADs 4 21 101 15020 2.4661e-05 RORA NUCLEAR RECEPTOR TAL1- TCF3 bHLH 11 14 1735 13386 4.4147e-05 MEF2A MADS RORA NUCLEAR RECEPTOR 9 16 1146 13975 5.6318e-05 MEF2A MADS ROAZ ZN-FINGER, C2H2 6 19 433 14688 6.351e-05 MEF2A MADS TEAD1 TEA 6 19 486 14635 0.00011921 Sp1 ZN-FINGER, C2H2 TAL1- TCF3 bHLH 14 11 3241 11880 0.00017081 346 Table S8. Validation of promoter CRM analyses method using the muscle reference collection and random background sampling. Top ranked CRM predictions from the promoter analyses validation using the full muscle reference collection (25 genes), two randomly sampled background datasets of 5000 ortholog pairs, and the following analysis parameters: search regions: 2000 base pairs (bp) upstream and downstream of the transcription start site; inter-binding site distance constraint of 225 bp; and all vertebrate TF binding profiles. Known skeletal muscle CRMs are highlighted in yellow. Supplemental discussion: The initial ranked TFBS pair results for both analyses methods using a full background (see Table S7) versus a random background (see Table S8 below) are consistent. We computed Z ratios for the proportion of TFBS instances in a full background sample versus the randomized background sample for each matching TFBS pair prediction and found that these differences were not significant at a p-value threshold cut-off of 0.05 (data not shown). Moreover, a number of the ranking scores for the same TFBS pair predictions produced very similar values (compare Table S7 with Table S8). Abbreviations used: FG=Foreground; BG=Background Table S8. A. TF class pair over-representation (top five results). TF Nam Class TF Name Class FG Hits FG Non Hits BG Hits BG Non Hits Score MEF2A MADS SP1 ZN-FINGER, C2H2 13 12 504 4496 2.0302e-07 MEF2A MADS Hand1-Tcfe2a bHLH 14 11 728 4272 1.9014e-06 MEF2A MADS SRF MADS 4 21 29 4971 1.7701e-05 MEF2A MADS RORA NUCLEAR RECEPTOR 9 16 345 4655 2.867e-05 RORA NUCLEAR RECEPTOR TAL1- TCF3 bHLH 11 14 549 4451 3.0778e-05 347 Table S8. B. Top thirteen over-represented TFBS pairs. TF Name Class TF Name Class FG Hits FG Non Hits BG Hits BG Non Hits Score MEF2A MADS SP1 ZN-FINGER, C2H2 13 12 504 4496 2.0302e-07 Sox17 HMG SP1 ZN-FINGER, C2H2 24 1 2442 2548 5.4238e-07 Myf bHLH SP1 ZN-FINGER, C2H2 23 2 2188 2802 1.0893e-06 MEF2A MADS Pdx1 HOMEO 17 8 1123 3867 1.6987e-06 MEF2A MADS Hand1-Tcfe2a bHLH 14 11 739 4251 2.3373e-06 MEF2A MADS YY1 ZN-FINGER, C2H2 15 10 947 4043 7.4723e-06 MEF2A MADS Nobox HOMEO 14 11 840 4150 1.0854e-05 RORA NUCLEAR RECEPTOR TAL1- TCF3 bHLH 11 14 557 4433 3.5892e-05 MEF2A MADS Myb TRP-CLUSTER 12 13 683 4307 4.2071e-05 MEF2A MADS SRF MADs 4 21 37 4953 4.2977e-05 MEF2A MADS RORA NUCLEAR RECEPTOR 9 16 368 4622 4.8162e-05 MEF2A MADS ROAZ ZN-FINGER, C2H2 6 19 143 4847 6.9023e-05 MEF2A MADS MYF bHLH 9 16 392 4598 7.8459e-05 348 Table S9. Mouse skeletal reference collection with 25 random genes Ensembl Gene ID HGNC Symbol Ensembl Description ENSG00000143632 ACTA1 Actin, alpha skeletal muscle (Alpha-actin-1) [Source:UniProtKB/Swiss-Prot;Acc:P68133] ENSG00000149925 ALDOA Fructose-bisphosphate aldolase A (EC 4.1.2.13)(Muscle-type aldolase)(Lung cancer antigen NY-LU-1) [Source:UniProtKB/Swiss- Prot;Acc:P04075] ENSG00000050820 BCAR1 Breast cancer anti-estrogen resistance protein 1 (CRK-associated substrate)(p130cas)(Cas scaffolding protein family member 1) [Source:UniProtKB/Swiss-Prot;Acc:P56945] ENSG00000108688 CCL7 C-C motif chemokine 7 Precursor (Small-inducible cytokine A7)(Monocyte chemoattractant protein 3)(Monocyte chemotactic protein 3)(MCP- 3)(NC28) [Source:UniProtKB/Swiss-Prot;Acc:P80098] ENSG00000138435 CHRNA1 Acetylcholine receptor subunit alpha Precursor [Source:UniProtKB/Swiss-Prot;Acc:P02708] ENSG00000170175 CHRNB1 Acetylcholine receptor subunit beta Precursor [Source:UniProtKB/Swiss-Prot;Acc:P11230] ENSG00000135902 CHRND Acetylcholine receptor subunit delta Precursor [Source:UniProtKB/Swiss-Prot;Acc:Q07001] ENSG00000108556 CHRNE Acetylcholine receptor subunit epsilon Precursor [Source:UniProtKB/Swiss-Prot;Acc:Q04844] ENSG00000196811 CHRNG Acetylcholine receptor subunit gamma Precursor [Source:UniProtKB/Swiss-Prot;Acc:P07510] ENSG00000104879 CKM Creatine kinase M-type (EC 2.7.3.2)(Creatine kinase M chain)(M-CK) [Source:UniProtKB/Swiss-Prot;Acc:P06732] ENSG00000166394 CYB5R2 NADH-cytochrome b5 reductase 2 (b5R.2)(EC 1.6.2.2) [Source:UniProtKB/Swiss-Prot;Acc:Q6BCY4] ENSG00000175084 DES Desmin [Source:UniProtKB/Swiss-Prot;Acc:P17661] ENSG00000198947 DMD Dystrophin [Source:UniProtKB/Swiss-Prot;Acc:P11532] ENSG00000163435 ELF3 ETS-related transcription factor Elf-3 (E74-like factor 3)(Epithelium-specific Ets transcription factor 1)(ESE-1)(Epithelium-restricted Ets protein ESX)(Epithelial-restricted with serine box) [Source:UniProtKB/Swiss- Prot;Acc:P78545] ENSG00000187672 ERC2 ERC protein 2 [Source:UniProtKB/Swiss-Prot;Acc:O15083] ENSG00000138829 FBN2 Fibrillin-2 Precursor [Source:UniProtKB/Swiss-Prot;Acc:P35556] ENSG00000204007 GLT6D1 Glycosyltransferase 6 domain-containing protein 1 (EC 2.4.1.- )(Galactosyltransferase family 6 domain-containing 1) [Source:UniProtKB/Swiss-Prot;Acc:Q7Z4J2] ENSG00000100577 GSTZ1 Maleylacetoacetate isomerase (MAAI)(EC 5.2.1.2)(Glutathione S- transferase zeta 1)(EC 2.5.1.18)(GSTZ1-1) [Source:UniProtKB/Swiss- Prot;Acc:O43708] ENSG00000129636 ITFG1 T-cell immunomodulatory protein Precursor (Protein TIP)(Integrin-alpha FG-GAP repeat-containing protein 1) [Source:UniProtKB/Swiss- Prot;Acc:Q8TB96] ENSG00000111615 KRR1 KRR1 small subunit processome component homolog (HIV-1 Rev-binding protein 2)(Rev-interacting protein 1)(Rip-1) [Source:UniProtKB/Swiss- Prot;Acc:Q13601] ENSG00000181541 MAB21L2 Protein mab-21-like 2 [Source:UniProtKB/Swiss-Prot;Acc:Q9Y586] ENSG00000198125 MB Myoglobin [Source:UniProtKB/Swiss-Prot;Acc:P02144] ENSG00000081189 MEF2C Myocyte-specific enhancer factor 2C [Source:UniProtKB/Swiss-Prot;Acc:Q06413] ENSG00000139505 MTMR6 Myotubularin-related protein 6 (EC 3.1.3.-) [Source:UniProtKB/Swiss-Prot;Acc:Q9Y217] 349 Table S9. Mouse skeletal reference collection with 25 random genes (continued) Ensembl Gene ID HGNC Symbol Ensembl Description ENSG00000111046 MYF6 Myogenic factor 6 (Myf-6) [Source:UniProtKB/Swiss-Prot;Acc:P23409] ENSG00000109063 MYH3 Myosin-3 (Myosin heavy chain 3)(Myosin heavy chain, fast skeletal muscle, embryonic)(Muscle embryonic myosin heavy chain)(SMHCE) [Source:UniProtKB/Swiss-Prot;Acc:P11055] ENSG00000141048 MYH4 Myosin-4 (Myosin heavy chain 4)(Myosin heavy chain 2b)(MyHC- 2b)(Myosin heavy chain IIb)(MyHC-IIb)(Myosin heavy chain, skeletal muscle, fetal) [Source:UniProtKB/Swiss-Prot;Acc:Q9Y623] ENSG00000197616 MYH6 Myosin-6 (Myosin heavy chain 6)(Myosin heavy chain, cardiac muscle alpha isoform)(MyHC-alpha) [Source:UniProtKB/Swiss-Prot;Acc:P13533] ENSG00000092054 MYH7 Myosin-7 (Myosin heavy chain 7)(Myosin heavy chain, cardiac muscle beta isoform)(MyHC-beta)(Myosin heavy chain slow isoform)(MyHC-slow) [Source:UniProtKB/Swiss-Prot;Acc:P12883] ENSG00000168530 MYL1 Myosin light chain 1, skeletal muscle isoform (MLC1F)(A1 catalytic)(Alkali myosin light chain 1) [Source:UniProtKB/Swiss-Prot;Acc:P05976] ENSG00000198336 MYL4 Myosin light chain 4 (Myosin light chain 1, embryonic muscle/atrial isoform)(Myosin light chain alkali, GT-1 isoform) [Source:UniProtKB/Swiss- Prot;Acc:P12829] ENSG00000129152 MYOD1 Myoblast determination protein 1 (Myogenic factor 3)(Myf-3) [Source:UniProtKB/Swiss-Prot;Acc:P15172] ENSG00000122180 MYOG Myogenin (Myogenic factor 4)(Myf-4) [Source:UniProtKB/Swiss-Prot;Acc:P15173] ENSG00000188162 OTOG Otogelin Precursor [Source:UniProtKB/Swiss-Prot;Acc:Q6ZRI0] ENSG00000169241 RAG1AP1 RAG1-activating protein 1 (Stromal cell protein) [Source:UniProtKB/Swiss-Prot;Acc:Q9BRV3] ENSG00000054967 RELT Tumor necrosis factor receptor superfamily member 19L Precursor (Receptor expressed in lymphoid tissues) [Source:UniProtKB/Swiss- Prot;Acc:Q969Z4] ENSG00000128482 RNF112 RING finger protein 112 (Zinc finger protein 179)(Brain finger protein) [Source:UniProtKB/Swiss-Prot;Acc:Q9ULX5] ENSG00000007314 SCN4A Sodium channel protein type 4 subunit alpha (Sodium channel protein type IV subunit alpha)(Voltage-gated sodium channel subunit alpha Nav1.4)(Sodium channel protein skeletal muscle subunit alpha)(SkM1) [Source:UniProtKB/Swiss-Prot;Acc:P35499] ENSG00000071537 SEL1L Protein sel-1 homolog 1 Precursor (Suppressor of lin-12-like protein 1)(Sel-1L) [Source:UniProtKB/Swiss-Prot;Acc:Q9UBV2] ENSG00000099381 SETD1A Histone-lysine N-methyltransferase SETD1A (EC 2.1.1.43)(SET domain- containing protein 1A)(hSET1A)(Set1/Ash2 histone methyltransferase complex subunit SET1)(Lysine N-methyltransferase 2F) [Source:UniProtKB/Swiss-Prot;Acc:O15047] ENSG00000181856 SLC2A4 Solute carrier family 2, facilitated glucose transporter member 4 (Glucose transporter type 4, insulin-responsive)(GLUT-4) [Source:UniProtKB/Swiss- Prot;Acc:P14672] ENSG00000184402 SS18L1 SS18-like protein 1 (SYT homolog 1) [Source:UniProtKB/Swiss-Prot;Acc:O75177] ENSG00000099365 STX1B Syntaxin-1B (Syntaxin-1B1)(Syntaxin-1B2) [Source:UniProtKB/Swiss-Prot;Acc:P61266] ENSG00000114854 TNNC1 Troponin C, slow skeletal and cardiac muscles (TN-C) [Source:UniProtKB/Swiss-Prot;Acc:P63316] ENSG00000159173 TNNI1 Troponin I, slow skeletal muscle (Troponin I, slow-twitch isoform) [Source:UniProtKB/Swiss-Prot;Acc:P19237] ENSG00000198258 UBL5 Ubiquitin-like protein 5 [Source:UniProtKB/Swiss-Prot;Acc:Q9BZL1] 350 Table S9. Mouse skeletal reference collection with 25 random genes (continued) Ensembl Gene ID HGNC Symbol Ensembl Description ENSG00000124486 USP9X Probable ubiquitin carboxyl-terminal hydrolase FAF-X (EC 3.1.2.15)(Ubiquitin thioesterase FAF-X)(Ubiquitin-specific-processing protease FAF-X)(Deubiquitinating enzyme FAF-X)(Fat facets protein- related, X-linked)(Ubiquitin-specific protease 9, X chromosome) [Source:UniProtKB/Swiss-Prot;Acc:Q93008] ENSG00000157796 WDR19 WD repeat-containing protein 19 [Source:UniProtKB/Swiss-Prot;Acc:Q8NEZ3] ENSG00000166435 XRRA1 X-ray radiation resistance-associated protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q6P2D8] ENSG00000161914 ZNF653 Zinc finger protein 653 (67 kDa zinc finger protein)(Zinc finger protein Zip67) [Source:UniProtKB/Swiss-Prot;Acc:Q96CK0] 351 Table S10. Validation of promoter CRM analyses method with a dataset that included the muscle reference collection and 25 randomly selected genes and random background sampling. Top ranked CRM predictions from the promoter analyses validation using a dataset of 50 genes (25 muscle collection genes and 25 additional randomly selected genes), two randomly sampled background datasets of 5000 ortholog pairs, and the following analysis parameters: search regions: 2000 base pairs (bp) upstream and downstream of the transcription start site; inter-binding site distance constraint of 225 bp; and all vertebrate TF binding profiles. Known skeletal muscle CRMs are highlighted in yellow. Table S10. A. TF class pair over-representation (top five results). TF Name Class TF Name Class FG Hits FG Non Hits BG Hits BG Non Hits Score MEF2A MADS SP1 ZN-FINGER, C2H2 15 35 531 4469 0.00016293 MEF2A MADS Hand1-Tcfe2a (80) bHLH 18 32 756 4244 0.00025141 MEF2A MADS SRF MADS 4 46 43 4957 0.0011089 ESR1 NUCLEAR RECEPTOR Bapx1 HOMEO 6 44 123 4877 0.0015632 RORA NUCLEAR RECEPTOR Hand1- Tcfe2a bHLH 26 24 1517 3483 0.0017172 352 Table S10. B. Top thirteen over-represented TFBS pairs. TF Name (ID) Class TF Name (ID) Class FG Hits FG Non Hits BG Hits BG Non Hits Score MEF2A MADS SP1 ZN-FINGER, C2H2 15 35 483 4508 5.7681e-05 MEF2A MADS Pdx1 HOMEO 24 26 1132 3859 0.00011452 MEF2A MADS Hand1- Tcfe2a (80) bHLH 18 32 717 4274 0.00013127 MEF2A MADS TAL1-TCF3 bHLH 10 40 249 4742 0.00017833 MEF2A MADS SRF MADS 4 46 34 4957 0.00049328 Sox17 HMG SP1 ZN-FINGER, C2H2 37 13 2461 2530 0.00053417 MEF2A MADS RREB1 ZN-FINGER, C2H2 3 47 15 4976 0.00067479 ESR1 NUCLEAR RECEPTOR Bapx1 HOMEO 6 44 115 4876 0.0011318 MEF2A MADS HNF4A NUCLEAR RECEPTOR 8 42 211 4780 0.0012358 Sox17 HMG TAL1-TCF3 bHLH 22 28 1164 3827 0.0012595 MEF2A MADS Roaz ZN-FINGER, C2H2 6 44 118 4873 0.0012857 YY1 ZN-FINGER, C2H2 NKX2-5 V2 HOMEO 43 7 3232 1759 0.0014695 MEF2A MADS NKX2-5 V2 HOMEO 17 33 798 4193 0.0016167 353 Table S11. Parameters used in the validation of the promoter CRM analyses method. Jaspar TFBS Phylum Max. TFBS Binding Site Distance (bp) TFBS Score Threshold (%) Upstream Sequence From TSS (bp) Downstream Sequence from TSS (bp) Vertebrate 225 80 100 5000 Vertebrate 225 80 1000 5000 Vertebrate 225 80 2000 2000 Vertebrate 225 80 2000 3000 Vertebrate 225 80 4000 3000 Vertebrate 225 80 5000 3000 Vertebrate 225 80 6000 3000 Vertebrate 225 80 7000 3000 Vertebrate 225 80 10000 3000 354 Table S12. Mouse skeletal muscle enhancer genomic coordinates. Mouse Assembly 36 Feb 2006 (mm8) Gene Enhancers Human Ensembl Gene Id Chr From Coordinate To Coordinate DMD_MOUSE ENSG00000198947 X 79208550 79208871 MYG_MOUSE ENSG00000198125 15 76849839 76850250 MEF2C_MOUSE ENSG00000081189 13 83981341 83981501 MYH3_MOUSE ENSG00000125414 11 67053712 7053977 ACHA_MOUSE ENSG00000138435 2 73381070 73381369 ACHB_MOUSE ENSG00000170175 11 69612003 69612304 ACHG_MOUSE ENSG00000196811 1 89036593 89036963 ACHD_MOUSE ENSG00000135902 1 89021553 89021854 ACHE_MOUSE ENSG00000108556 11 70435308 70435605 KCRM_MOUSE_1 ENSG00000104879 7 18568267 18568567 KCRM_MOUSE_2 ENSG00000104879 7 18568838 18569622 MYF6_MOUSE ENSG00000111046 10 106898903 106899202 MYOD_MOUSE ENSG00000129152 7 46239064 46239781 MYL4_MOUSE ENSG00000198336 11 104392837 104393509 TNCC_MOUSE ENSG00000114854 14 30038200 30038502 MYH6_MOUSE ENSG00000197616 14 53920850 53921149 Table S13. Mouse myelin gene-associated enhancer genomic coordinates. Mouse Assembly 36 Feb 2006 (mm8) Gene Enhancers Human Ensembl Gene Id Chr From Coordinate To Coordinate Cldn11 ENSG00000013297 3 31357206 31357836 Cnp ENSG00000173786 11 100391366 100392107 Ermin ENSG00000136541 2 57868209 57868795 Gjb1/Cx32 ENSG00000169562 X 97586055 97586691 Mal ENSG00000172005 2 127318715 127319393 Mbp_M3 ENSG00000197971 18 82683969 82684438 Mbp_M4 ENSG00000197971 18 82680074 82680286 Olig1 ENSG00000184221 16 91189705 91190805 Olig2 ENSG00000205927 16 91097897 91098790 Plp_Pk211 ENSG00000123560 X 132185516 132186725 Plp_WmN1 ENSG00000123560 X 132172873 132174044 Plp_WmN2 ENSG00000123560 X 132174737 132175770 Pou3f1/Oct-6 ENSG00000185668 4 124169971 124170620 355 Table S14. Vertebrate TF profile clustering and class label assignment Cluster # Jaspar TFBS Jaspar TFBS ID TFBS Class Label 1 Arnt MA0004 HLH GRP 1 MAX MA0058 HLH GRP 1 MYC-MAX MA0059 HLH GRP 1 Mycn MA0104 HLH GRP 1 USF1 MA0093 HLH GRP 2 Arnt-Ahr MA0006 ARNT-AHR (HLH) 2 Pax6 MA0069 PAX6 (HOX) 3 Ar MA0007 AR (NR) 3 NR3C1 MA0113 NR3C1 (NR) 4 T MA0009 BRACHYURY-T (TDOMAIN) 4 ZEB1 MA0103 ZEB1 (ZF) 5 Pax5 MA0014 PAX5 (HOX) 6 HNF4A MA0114 NR GRP 6 NR1H2-RXRA MA0115 NR GRP 6 NR2F1 MA0017 NR GRP 6 PPARG-RXRA MA0065 NR GRP 7 CREB1 MA0018 CREB1 (LEUZIP) 7 RORA_1 MA0071 RORA1 (NR) 7 RORA_2 MA0072 RORA2 (NR) 8 Cebpa MA0102 CEBPA (LEUZIP) 8 Ddit3-Cebpa MA0019 DDIT3-CEBPA (LEUZIP) 9 E2F1 MA0024 E2F1 (TFF) 10 HLF MA0043 HLF (LEUZIP) 10 NFIL3 MA0025 NFIL3 (LEUZIP) 11 ELF5 MA0136 ETS GRP 11 ELK1 MA0028 ETS GRP 11 ELK4 MA0076 ETS GRP 11 GABPA MA0062 ETS GRP 11 SPIB MA0081 ETS GRP 12 Evi1 MA0029 EVi1 (ZF) 13 Foxa2 MA0047 FKH GRP 13 FOXD1 MA0031 FKH GRP 13 Foxd3 MA0041 FKH GRP 13 FOXF2 MA0030 FKH GRP 13 FOXI1 MA0042 FKH GRP 13 Foxq1 MA0040 FKH GRP 14 Gfi MA0038 GFI (ZF) 14 NFYA MA0060 NFYA (CAATBOX) 14 PBX1 MA0070 PBX1 (HOX) 15 HNF1A MA0046 HNF1A (HOX) 16 Myb MA0100 MYB (MYB) 16 Myf MA0055 MYF (HLH) 16 NHLH1 MA0048 NHLH1 (HLH) 356 Table S14. Vertebrate TF profile clustering and class label assignment (continued) Cluster # Jaspar TFBS Jaspar TFBS ID TFBS Class Label 17 IRF1 MA0050 IRF1 (IRF) 17 IRF2 MA0051 IRF2 (IRF) 17 STAT1 MA0137 STAT1 (STAT) 18 MEF2A MA0052 MEF2A (MADS) 19 MZF1_1-4 MA0056 MZF1_1-4 (ZF) 19 MZF1_5-13 MA0057 MZF1_5-13 (ZF) 20 NF-kappaB MA0061 REL GRP 20 NFKB1 MA0105 REL GRP 20 REL MA0101 REL GRP 20 RELA MA0107 REL GRP 21 Lhx3 MA0134 HOX GRP 21 Lhx3_V2 MA0135 HOX GRP 21 Nkx2-5 MA0063 HOX GRP 21 Nobox MA0125 HOX GRP 21 Pdx1 MA0132 HOX GRP 22 ESR1 MA0112 ESR1 (NR) 22 PPARG MA0066 PPARG (NR) 23 Pax4 MA0068 PAX3 (HOX) 24 RREB1 MA0073 RREB1 (ZF) 25 RXRA-VDR MA0074 RXRA-VDR (NR) 26 Prrx2 MA0075 PRN2 (HOX) 27 Sox17 MA0078 SOX/HMG GRP 27 Sox5 MA0087 SOX/HMG GRP 27 SOX9 MA0077 SOX/HMG GRP 27 SRY MA0084 SOX/HMG GRP 28 SP1 MA0079 SP1 (ZF) 29 SRF MA0083 SRF (MADS) 30 Staf MA0088 STAF (ZF) 31 TEAD1 MA0090 TEAD1 (TEA) 32 TAL1-TCF3 MA0091 TAL1 (HLH) 32 YY1 MA0095 YY1 (ZF) 33 Hand1-Tcfe2a MA0092 HAND1-TCFE2A (HLH) 34 Fos MA0099 FOS (LEUZIP) 35 TP53 MA0106 TP53 (LSH) 36 Spz1 MA0111 SPZ1 (HLH) 37 Roaz MA0116 ROAZ (ZF) 38 TLX1-NFIC MA0119 TLX1-NFIC (HOX-CAAT) 39 Bapx1 MA0122 NKX/HOX GRP 39 NKX2-2 MA0506 NKX/HOX GRP 39 NKX2-5_V2 MA0507 NKX/HOX GRP 39 NKX3-1 MA0124 NKX/HOX GRP 40 ZNF354C MA0130 ZNF354C (ZF) 41 MIZF MA0131 MIZF (ZF) 357 Table S14. Vertebrate TF profile clustering and class label assignment (continued) Cluster # Jaspar TFBS Jaspar TFBS ID TFBS Class Label 42 REST MA0138 REST (ZF) 43 Pou2F1/Oct1 MA0500 POU2F1 (POU-HOX) 44 Egr1/Krox-24/NGFI-A MA0501 EGR/ZF GRP 44 Egr2/Krox-20 MA0502 EGR/ZF GRP 44 Egr3 MA0503 EGR/ZF GRP 44 Egr4 MA0504 EGR/ZF GRP 45 POU3F1/OCT6 MA0505 POU3F1 (POU-HOX) 46 CTCF_ren MA0508 CTCF (ZF) 47 Gli MA0509 GLI (ZF) 358 Table S15. Summary of enhancer-weighted CRMs of size three for co-expressed oligodendrocyte-targeted genes TFBS TFBS TFBS Predicted Targeted Genes Spz1 TEAD1 TEAD1 CLDN11; FOXG1B ELK4 NKX3-1 NKX3-1 CACNB4; PLP1 Spz1 Spz1 TEAD1 OTUD7B; Q8NA55_HUMAN; PDLIM2; SDC1; TMEM2; NFIX Foxq1 NFKB1 NFKB1 NFIX FOXD1 Myf Myf PTHR1; OTUD7B; MMP14; ADAMTS4; RIMS2; ARMCX2; RFFL; ELK3; APLP1; TM7SF2; PTPRD; OLFML3; RNF26; CCNB2; FGFR1; CLDN11; TPD52; NFIX; RAMP2; CACNB4; TNNI1; ENPP2; PLP1 MYC- MAX SP1 Spz1 ELOVL6; NR2F1; SMARCD3; KLF4; RAB33A; IGFBP4; BMP7; PPAP2C; ACTN1; LEPREL2; DAB2IP; JOSD2; DBN1; ST3GAL2; TNNI1; FOXG1B; GJB1 Egr1 Egr1 SRY DBNDD2; NFIX; GJB1 Gfi Gfi MEF2A NR2F1; OTUD7B; ADAMTS4; FGF13; RFFL; KLF4; APCDD1; GRB14; ADIPOR1; PTPRD; NEK6; FOS; OLFML3; COL9A1; STRN; FGFR1; CRTAP; ID4; CLDN11; OMG; FRMD4B; FZD2; PELI1; FOXG1B; ETS1 Egr2 Sox5 Sox5 VASN; FGF13; KLF4; OSBPL1A; FOS; CDC37L1; SH3GL3; FGFR1; NFE2L3; POU3F1; MAL; GJB1 Egr2 SRY Sox17 VASN; ADAMTS4; SMARCD3; KLF4; RAB33A; PLEKHH1; OSBPL1A; NFE2L3 Egr2 SRY Sox5 RIMS2; RFFL Foxq1 Foxq1 NFKB1 NKD1 ELK4 SP1 SP1 ELOVL6; FNTA; OTUD7B; Q8NA55_HUMAN; SMARCD3; DBNDD2; RIMS2; KLF4; RAB33A; ELK3; CMTM3; SDC1; KLHL2; SLC12A2; ACTN1; ACSS2; GPR153; RNF26; DAB2IP; SPRED2; FGFR1; ID4; CLDN11; PLEKHG2; NKX6-2; AMPD2; NFIX; FMNL3; FRMD4B; FZD2; CPT1A; TSPAN2; FOXG1B; GJB1 Egr1 SRY SRY VASN; C7orf24; PLXNB3; MBP; RNF26; K0256_HUMAN; SH3GL3; FGFR1; NFE2L3; ETS2; MAL; FRMD4B Hand1- Tcfe2a Lhx3 Lhx3 NR2F1; MAP7; FGF13; NKD1; FAM107B; TMEM49; ETS2; OMG; CHN2 NHLH1 TEAD1 TEAD1 CLDN11 Hand1- Tcfe2a Lhx3 MEF2A RAB11FIP5 MYC- MAX Spz1 Spz1 ARMCX2; PTPRD; ACTN1; LEPREL2; CNP; FMNL3; TNNI1; PLP1 Egr3 Sox5 Sox5 FGF13; CDC37L1; FGFR1; POU3F1; MAL; GJB1 Egr2 Egr3 SRY FGF13; SYTL2; FAM107B; FOS; DAB2IP; CDC37L1; MPP5; FGFR1; PLEKHG2; NFIX Egr2 Egr3 Sox5 PLEKHB1; C7orf24; DDC; SYTL2; FAM107B; DAB2IP; MPP5; STRN; ID4 Gfi Hand1-Tcfe2a MEF2A GRB14; CRTAP; S100A3; ETS1 Egr2 SRY SRY VASN; ADAMTS4; C7orf24; PLXNB3; DDC; KLF4; SH3GL3; FGFR1; NFE2L3; POU3F1; MAL; FRMD4B; GJB1 Egr2 Egr2 Sox5 DBNDD2; MAL; GJB1 MEF2A MEF2A POU3F1 NR2F1; BICC1; RIMS2; GSN; PRKCZ; ASCL1; PLEKHG2 Fos Fos NKX3-1 FNTA; PLEKHB1; FGF13; RIMS2; KLF4; RAB33A; PCYT2; MBP; RRM2; PLEKHH1; MAST2; OSBPL1A; ASCL1; PTPRD; ACTN1; CD68; FOS; ARPC1A; RRAS2; ETS2; OMG; CHN2; TMEFF2; FRMD4B; CACNB4 MYC- MAX MYC-MAX SP1 PTPRD; CLDN11; NKX6-2; AMPD2; FZD2 359 Table S15. Summary of enhancer-weighted CRMs of size three for co-expressed oligodendrocyte-targeted genes (continued) TFBS TFBS TFBS Predicted Targeted Genes Fos NKX3-1 NKX3-1 NR2F1; OTUD7B; MMP14; SLAIN1; EPS15; FGF13; ARMCX2; APCDD1; GRB14; ELK3; MBP; RRM2; NKD1; SYTL2; PLEKHA1; OSBPL1A; RHPN1; FAM107B; LRIG1; NEK6; TMEM49; PLA2G4A; OLFML3; SPRED2; ELOVL7; PIGA; ID4; TPD52; ETS2; OMG; NFIX; CACNB4; PELI1; FOXG1B Egr1 SRY Sox5 SMARCD3; RNF26; K0256_HUMAN; LEPREL2; ETS2 Hand1- Tcfe2a Hand1- Tcfe2a Lhx3 EPS15; RFFL; APLP1; MBP; IDI1; PIGA; ETS2; OMG; TMEFF2; FRMD4B; MEST; ETS1 Egr1 Egr2 SRY FGF13; SYTL2; FOS; DAB2IP; FGFR1; PLEKHG2; NFIX Egr2 Sox17 Sox5 KLF4; RAB33A; OSBPL1A; FAM107B; KLHL2; CDC37L1; POU3F1 Egr1 Egr1 Sox5 DBNDD2; ID4; GJB1 FOXD1 FOXD1 Myf EPS15; FGF13; DDC; PCYT2; GSN; IGFBP4; LRIG1; OLFML3; IDI1; CHN2; FOXG1B MEF2A POU3F1 POU3F1 EPS15; BICC1; FGF13; RIMS2; KLF4; GRB14; GRAMD3; OSBPL1A; F2R; FAM107B; KLHL2; FOS; PLA2G4A; OLFML3; K0256_HUMAN; RAB11FIP5; CRTAP; ID4; CLDN11; OMG; CHN2; FRMD4B; SC4MOL; FOXG1B; ETS1 MYC- MAX MYC-MAX Spz1 CLDN11; NKX6-2; JOSD2 ELK4 MYC-MAX SP1 EDG8; KLF2; CEBPB RORA_1 RORA_1 Spz1 NFIX RORA_1 Spz1 Spz1 POU3F1; PLP1 Egr2 Sox17 Sox17 NR2F1; ADAMTS4; PLEKHB1; SMARCD3; C7orf24; PLXNB3; KLF4; SLC7A5; RAB33A; SYTL2; OSBPL1A; FAM107B; SH3GL3; NFE2L3; POU3F1; MAL; ETS1; GJB1 Egr3 SRY SRY C7orf24; PLXNB3; DDC; MBP; SH3GL3; FGFR1; POU3F1; MAL; GJB1 Egr3 SRY Sox5 ARMCX2 Egr1 Sox5 Sox5 VASN; FGF13; MBP; FOS; K0256_HUMAN; LEPREL2; SH3GL3; FGFR1; NFE2L3; ETS2; MAL Egr3 Egr3 Sox5 DBNDD2; ID4 MYC- MAX Spz1 TEAD1 CLDN11 ELK4 ELK4 SP1 PTPRD; TMEM49; NCAPD2 Egr3 Egr3 SRY DBNDD2; NFIX Egr1 Egr2 Sox5 PLEKHB1; C7orf24; SYTL2; DAB2IP; STRN; ID4; FRMD4B; ETS1 Hand1- Tcfe2a MEF2A MEF2A EPS15; GSN; ID4; PELI1; FOXG1B MYC- MAX SP1 SP1 ELOVL6; MMP14; Q8NA55_HUMAN; SMARCD3; PLXNB3; APLP1; TM7SF2; IGFBP4; BMP7; PPP1R14A; SDC1; CYP51A1; PPAP2C; ACTN1; ACSS2; GPR153; CD68; FOS; DAB2IP; DHCR24; CLDN11; SEMA4D; RAMP2; FMNL3; JOSD2; DBN1; MAL; FRMD4B; ST3GAL2; FOXG1B; PLP1; GJB1 Hand1- Tcfe2a Hand1- Tcfe2a MEF2A NR2F1; FGF13; RIMS2; ARMCX2; APCDD1; SYTL2; OSBPL1A; KLHL2; NEK6; FGFR1; CRTAP; OMG; MAL; FRMD4B; FZD2; ETS1 Gfi MEF2A MEF2A NR2F1; BICC1; RIMS2; BMP7; SYTL2; PTPRD; KLHL2; PLA2G4A; OLFML3; ID4; PLEKHG2; SC4MOL; FOXG1B; ETS1 NHLH1 Spz1 TEAD1 POU3F1 Egr1 Egr3 Sox5 STRN 360 Table S15. Summary of enhancer-weighted CRMs of size three for co-expressed oligodendrocyte-targeted genes (continued) TFBS TFBS TFBS Predicted Targeted Genes Egr2 Egr2 SRY DBNDD2; NFIX; MAL; GJB1 Egr2 Egr2 Sox17 EDG8; BMP7; ACTN1; MAL; GJB1 Gfi MEF2A POU3F1 EPS15; BICC1; KLF4; APCDD1; ACTN1; PLA2G4A; IDI1; COL9A1; FGFR1; CRTAP; PLEKHG2; NSDHL; TMEM2; NFIX; CHN2; PELI1 361 Table S16. Oligodendrocyte enhancer groups Enhancer Enhancer Group Claudin_11 Positive MAL Positive MBP_M3 Positive PLP_WmN1 Positive CNP Negative Ermin Negative 362 E. Promoter analyses of co-expressed oligodendrocyte gene set Supplemental discussion: All oligodendrocyte promoter analyses results were combined to identify four recurring TFBS class combinations representing the top ten ranked hits: SP1 (ZF) - SPZ1 (HLH); ROAZ (ZF) - ETS GRP; TAL1 (HLH) - SPZ1 (HLH); HLH GRP - SPZ1 (HLH). SP1 (ZF) : SP1 is up-regulated during oligodedrocyte differentiation and regulates MBP gene transcription [9-12]. We noted that Sp5 gene expression, an Sp1 family member, is down-regulated between P4 and P10 time points. However, Sp5 is not detected in the OPC-EOLd nor in the EOL-MOLd gene expression sets, therefore its expression may be turned off very early in development. Sp5 has been shown to act as a transcriptional repressor of Sp1-targted genes and has similar DNA binding specificity to SP1 [13]. In the telencephalon, Sp5 expression, via Wnt/beta-catenin signaling, decreases Nkx2.1, Mash1, Gsh2, Olig2, and Dlx2 expression and, importantly, a recent study showed that Nkx2.1 is regulated by Sp1 and Sp3 [14]. SPZ1 (HLH): The spermatogenic leucine zipper 1 (SPZ1) protein is an helix-loop- helix (HLH) protein that includes a leucine zipper at its C-terminus, which enables protein dimerization. Notably, the binding site properties of this TF do not cluster with other profiles and it’s HLH domain does not satisfy the corresponding PFAM Hidden Markov Models using standard threshold cut-offs [15] suggesting that it’s DNA binding domain (DBD) sequence may be atypical. This TF is largely expressed in the testes but has been detected in the brain [16] and in oligodendrogliomas [17]. However, we could find no direct evidence that this protein is expressed in normal oligodendrocytes and believe that this prominent motif prediction implicates the involvement of an OL- expressed TF protein with a similar protein structure and/or DNA-binding preference. ROAZ (ZF): ZFP423/Roaz is a ZF TF that is required for patterning the development of neuronal and glial precursors in the developing brain [18]. Notably, this TF protein appears to be expressed in oligodendrocytes in the rat brain at P7 [19], although there is currently no known tie to myelination. HLH GRP: The HLH GRP represents TFs that bind the canonical E-box and a number of HLH TFs have demonstrated important regulatory roles in OLs, including the 363 Olig proteins, Mash1, and Hes (see main manuscript for further discussion). TAL1/SCL requires interaction with a Class A HLH (E12, E47, E2-2), which enables HLH dimer binding to specific DNA motifs [20]. A recent study showed that glia fate may be regulated by antagonistic interactions between Tal1 and Olig2 in the ventral neural tube [21]. It is possible the binding of this TF coupled with another Class A HLH protein could act either antagonistically or synergistically in myelin gene transcription. Tal1 gene expression is detected in OLPs in the rat brain at P7 [19]. ETS GRP: The ETS GRP of TF proteins include the Ets and Ets variants (ETV), Elk, and Elf TFs. A number of ETS transcription factors such as Ets1, Ets2, Elk3 are expressed but downregulated in the IOLEDd set. However, ETS variants Etv1 (U OLS/Myelin), Etv6 (OPCs/OLs u) are up-regulated in the EOL_MOLd and OPC_EOLd forebrain gene expression sets respectively. Although we found no direct literature evidence for ETS TF involvement in vertebrate oligodendrocyte development, this family of TFs has been shown to direct gene regulation in Schwann cells [22, 23]. 364 F. An Oligodendrocyte TF network supported by enhancer predicted regulatory elements Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes TF Class I TF Class II Predicted Targeted Genes AR (NR) SPZ1 (HLH) ANTXR1; IGFBP4; OLFML3; OMG; RAB11FIP5 BRACHYURY- T (TDOMAIN) EGR/ZF GRP FOS; FRMD4B; LRIG1; VASN CREB1 (LEUZIP) SPZ1 (HLH) LEPREL2; NKX6-2; RAB33A E2F1 (TFF) MIZF (ZF) ARPC1B; ASCL1; CDCA5; COL5A1; CRTAP; FGF13; FOXG1B; MAST2; NFIX; NR2F1; PELI1; PLP1; RAB33A; SMARCD3; STRN; TMEM2; WNT7B E2F1 (TFF) PAX3 (HOX) NFIX; NR2F1; RAB33A E2F1 (TFF) SOX/HMG GRP ARMCX2; BICC1; CD68; CHN2; EPS15; FGF13; FMNL3; FOXG1B; FRMD4B; FZD2; NFIX; NKX6-2; PCTK3; PLP1; RFFL; TMEM49 ETS GRP NKX/HOX GRP ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ARPC1B; BICC1; CACNB4; CDCA5; DAB2IP; DBNDD2; ELK3; ELOVL6; FAM107B; FMNL3; FOXG1B; FRMD4B; LEPREL2; MAP7; NFIX; NKD1; NR2F1; PDLIM2; PIGA; PLEKHG2; Q8NA55_HUMAN; RAB33A; RFFL; RNF26; SLC12A2; SPRED2; TMEM49 ETS GRP ROAZ (ZF) AATK; ACSS2; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APLP1; ARMCX2; ARPC1B; ASCL1; BICC1; BIRC2; BMP7; CACNB4; CD68; CDCA5; CHFR; CLDN11; CMTM3; CNP; COL18A1; COL5A1; COL9A1; CPOX; CTSE; CXCL12; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; ELOVL6; ETS1; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GAS6; GJA12; GLRB; GPR153; GRB14; GSN; ID4; IDI1; JOSD2; KLF4; LAPTM5; LEPREL2; LRIG1; MAG; MBP; MEST; MMP14; NFE2L3; NFIX; NKD1; NKX6-2; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PCTK3; PCYT2; PDLIM2; PIGA; PLD4; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RFFL; RIMS2; RNF26; S100A3; SFRP2; SGK2; SH3GL3; SLAIN1; SLC12A2; SMARCD3; SYTL2; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TNNI1; TSPAN2; VASN; WNT7B ETS GRP SP1 (ZF) AATK; ACSS2; ACTN1; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1B; ASCL1; BICC1; BMP7; C7orf24; CACNB4; CD68; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; CPOX; CPT1A; CRTAP; CTSE; CXCL12; DAB2IP; DBN1; DBNDD2; DHCR24; EDG8; ELK3; ELOVL6; ELOVL7; ENPP2; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GPR153; GSN; ID4; IER3; IER5; IGFBP4; JOSD2; K0256_HUMAN; KLF2; KLF4; KLHL2; LEPREL2; LRIG1; MAG; MAST2; MBP; MEST; MMP14; MPP5; NCAPD2; NEK6; NFIX; NKX6-2; NR2F1; OLFML3; OTUD7B; PCTK3; PCYT2; PDLIM2; PIGA; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RIMS2; RNF26; RRM2; SDC1; SEMA4D; SFRP2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TCF7L1; TGFBR2; TM7SF2; TMEM49; TNNI1; TSPAN2; VASN; WNT7B EVi1 (ZF) HAND1- TCFE2A (HLH) ARMCX2; CLDN11; DDC; ELOVL6; ETS2; FGF13; FOXG1B; LRIG1; NCAPD2; NKD1; NR2F1; PDLIM2; PLP1; RIMS2; S100A3; SMARCD3; TMEFF2 EVi1 (ZF) MYF (HLH) CNP; DDC; ENPP2; FOXG1B; LRIG1; MPP5; NR2F1; PDLIM2; RFP2 365 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes EVi1 (ZF) NHLH1 (HLH) CNP; COL5A1; FOXG1B; LRIG1; SMARCD3 EVi1 (ZF) ROAZ (ZF) DDC; ELOVL6; FGFR1; NKD1; SMARCD3 FKH GRP MYF (HLH) ADAMTS4; ADIPOR1; APCDD1; APLP1; ARMCX2; ASCL1; BIRC2; CA2; CACNB4; CCNB2; CD68; CHN2; CPT1A; CTSE; DAB2IP; DBN1; DDC; ELK3; ENPP2; EPS15; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; GPR153; GSN; IDI1; IGFBP4; KLF4; KLHL2; LRIG1; MMP14; MPP5; NFIX; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PLEKHB1; PLP1; PTHR1; PTPRD; Q8NA55_HUMAN; RAMP2; RFFL; RFP2; RIMS2; RNF26; RRAS2; SC4MOL; SEMA4D; SLAIN1; SYTL2; TGFBR2; TM7SF2; TMEFF2; TMEM2; TMEM49; TNNI1; TPD52 FKH GRP NHLH1 (HLH) ACTN1; ADAMTS4; ADIPOR1; APCDD1; ARMCX2; ASCL1; BICC1; CDCA5; CHN2; COL9A1; CXCL12; DAB2IP; DDC; FAM107B; FGF13; FGFR1; FMNL3; FOXG1B; FZD2; GPR153; ID4; IDI1; JAM3; KLF4; KLHL2; LRIG1; NFIX; NKD1; NKX6-2; NR2F1; OLFML3; OMG; PLEKHG2; PLP1; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB33A; RAMP2; RFFL; RIMS2; SDC1; SFRP2; SLAIN1; SLC12A2; STRN; SYTL2; TGFBR2; TM7SF2 FKH GRP POU3F1 (POU-HOX) ADIPOR1; APCDD1; ARMCX2; ASCL1; BICC1; BIRC2; BMP7; C7orf24; CACNB4; CCNB2; CD68; CHN2; CLDN11; COL9A1; CPOX; CXCL12; DAB2IP; DDC; ELK3; ELOVL6; ELOVL7; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GLRB; GPR153; GRAMD3; GRB14; GSN; ID4; IDI1; IER3; ITGB5; JAM3; K0256_HUMAN; KLF4; KLHL2; LRIG1; MAST2; MBP; MEST; MMP14; MPP5; NCAPD2; NEK6; NFE2L3; NFIX; NKD1; NKX6-2; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCYT2; PELI1; PIGA; PLA2G4A; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PRKCZ; PTHR1; PTPRD; RAB11FIP5; RAB33A; RACGAP1; RAP1A; RAP2A; REEP5; RFFL; RFP2; RIMS2; RRM2; SC4MOL; SFRP2; SH3GL3; SLAIN1; SLC12A2; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TIMM17A; TMEFF2; TMEM2; TMEM49; TPD52; TSPAN2; UGT8; VASN FKH GRP REL GRP ACSS2; ADAMTS4; ADIPOR1; AMPD2; APCDD1; APLP1; ARMCX2; BICC1; BIRC2; CA2; CACNB4; CD68; CHN2; COL18A1; COL5A1; COL9A1; CPOX; CPT1A; CXCL12; DAB2IP; DBNDD2; DHCR24; EDG8; ELK3; ELOVL6; ELOVL7; ENPP2; EPS15; ETS1; ETS2; FAM107B; FGF13; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GRAMD3; GRB14; ID4; IER3; JAM3; JOSD2; KLF2; KLF4; KLHL2; LAPTM5; LRIG1; MEST; MPP5; NEK6; NFIX; NKD1; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PDLIM2; PELI1; PIGA; PLA2G4A; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PTHR1; PTPRD; RAB11FIP5; RAB33A; RAMP2; RAP2A; RFFL; RFP2; RIMS2; RNF26; SDC1; SEMA4D; SH3GL3; SLAIN1; SLC7A5; SMARCD3; STRN; SYTL2; TIMM17A; TM7SF2; TMEM2; TMEM49; TNNI1; TPD52; UGT8; VASN FKH GRP SPZ1 (HLH) ACTN1; ADAMTS4; AMPD2; ARMCX2; ASCL1; CACNB4; CXCL12; CYP51A1; DAB2IP; ELK3; ELOVL6; EPS15; ETS1; FGF13; FNTA; FOXG1B; FRMD4B; GRB14; ID4; IDI1; KLF4; MMP14; NFIX; NKX6-2; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PLEKHB1; PLP1; PTPRD; RAB33A; RIMS2; RNF26; RRM2; S100A3; SIRT2; SLAIN1; SLC12A2; SLC7A5; SMARCD3; ST3GAL2; SYTL2; TMEM2; TMEM49; VASN 366 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes FOS (LEUZIP) NKX/HOX GRP AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; APCDD1; APLP1; ARMCX2; ARPC1A; ASCL1; BICC1; BIRC2; BMP7; CA2; CACNB4; CCNB2; CD68; CDC37L1; CDCA5; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; COL9A1; CPOX; CPT1A; CRTAP; DAB2IP; DBN1; DDC; DHCR24; EDG8; ELK3; ELOVL6; ELOVL7; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAS6; GLRB; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IER5; IGFBP4; ITGB5; JAM3; K0256_HUMAN; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAL; MAP7; MAST2; MBP; MEST; MMP14; NCAPD2; NEK6; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCTK3; PCYT2; PDLIM2; PELI1; PIGA; PLA2G4A; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP1A; REEP5; RFFL; RFP2; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SC4MOL; SFRP2; SIRT2; SLAIN1; SLC1A4; SLC2A1; SLC7A5; SMARCD3; SPRED2; ST3GAL2; SYTL2; TALDO1; TCF7L1; TIMM17A; TM7SF2; TMEFF2; TMEM2; TMEM49; TPD52; TSPAN2; UGT8; VASN; WNT7B FOS (LEUZIP) NR GRP ACSS2; AMPD2; ARMCX2; CA2; CACNB4; CD68; CHN2; COL9A1; CPT1A; EDG8; ELK3; ENPP2; EPS15; ETS1; ETS2; FAM107B; FGF13; FGFR1; FMNL3; FOS; FRMD4B; GRB14; GSN; IER3; JAM3; JOSD2; LEPREL2; MAST2; MPP5; NEK6; NP_001033793.1; NR2F1; NSDHL; OTUD7B; PCYT2; PLP1; PTHR1; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP1A; RAP2A; RFFL; RFP2; RIMS2; RRAS2; S100A3; SGK2; SIRT2; SMARCD3; ST3GAL2; SYTL2; TCF7L1; TM7SF2; TMEM2; TMEM49; TPD52; VASN FOS (LEUZIP) REL GRP AATK; ADAMTS4; ADIPOR1; AMPD2; APLP1; ARMCX2; ARPC1A; ASCL1; BICC1; BIRC2; CA2; CD68; CDCA5; CHN2; COL5A1; COL9A1; CPOX; CPT1A; CTSE; CXCL12; DAB2IP; DBN1; DDC; DHCR24; ELK3; ELOVL7; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FMNL3; FOS; FOXG1B; FRMD4B; GRAMD3; GRB14; GSN; IER3; IGFBP4; JOSD2; KLF2; KLF4; KLHL2; LEPREL2; MAL; MEST; MPP5; NEK6; NFIX; NKD1; NP_001033793.1; NR2F1; OLFML3; OMG; OSBPL1A; OTUD7B; PCYT2; PDLIM2; PELI1; PIGA; PLA2G4A; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RFFL; RFP2; RIMS2; S100A3; SC4MOL; SFRP2; SGK2; SIRT2; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TMEM49; TSPAN2; UGT8; VASN FOS (LEUZIP) SPZ1 (HLH) AATK; ACSS2; ACTN1; ARMCX2; BMP7; CA2; CACNB4; CD68; CHN2; CMTM3; COL9A1; DHCR24; ELK3; EMID2; EPS15; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; IDI1; IGFBP4; KLF4; LEPREL2; NFIX; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PLEKHH1; PLP1; PPAP2C; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RIMS2; RNF26; RRM2; SLAIN1; SLC7A5; SMARCD3; SPRED2; SYTL2; TCF7L1; TGFBR2; TM7SF2; TPD52; VASN GFI (ZF) MEF2A (MADS) ACTN1; ADAMTS4; ADIPOR1; APCDD1; ARPC1A; ASCL1; BICC1; BIRC2; BMP7; CACNB4; CHN2; CLDN11; COL5A1; COL9A1; CRTAP; ENPP2; EPS15; ETS1; FAM107B; FGF13; FGFR1; FOS; FOXG1B; FRMD4B; FZD2; GRAMD3; GRB14; ID4; IDI1; JAM3; K0256_HUMAN; KLF4; KLHL2; MAL; MAST2; NEK6; NFIX; NR2F1; NSDHL; OLFML3; OMG; OTUD7B; PELI1; PLA2G4A; PLEKHA1; PLEKHG2; PRKCZ; PTHR1; PTPRD; RFFL; RIMS2; RRAS2; S100A3; SC4MOL; SLC1A4; STRN; SYTL2; TMEM2; TPD52 367 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes GFI (ZF) NHLH1 (HLH) ACTN1; ADIPOR1; APCDD1; ARMCX2; ASCL1; BICC1; CHFR; CHN2; COL9A1; CXCL12; DAB2IP; DBNDD2; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GRB14; KLF4; LEPREL2; MEST; NFIX; NKD1; OLFML3; OMG; PCYT2; PLEKHA1; PLEKHG2; PLEKHH1; PLP1; PTPRD; Q8NA55_HUMAN; RAB33A; RAMP2; RFFL; RIMS2; SC4MOL; SDC1; SFRP2; SLAIN1; SLC12A2; SLC2A1; SYTL2; TM7SF2; TMEM49; WNT7B GFI (ZF) RORA1 (NR) BICC1; CCNB2; CPT1A; ENPP2; ETS1; FGF13; FOS; FOXG1B; FRMD4B; GRB14; KLF4; NFIX; NR2F1; OLFML3; OSBPL1A; PCYT2; RAB33A; RFFL; SDC1; SLAIN1; TMEM2 GFI (ZF) SRF (MADS) ACTN1; ELK3; EPS15; ETS1; ETS2; FGF13; FGFR1; FOS; IER5; NFIX; NR2F1; OLFML3; RAB33A; TIMM17A; TMEM49 HAND1- TCFE2A (HLH) FOS (LEUZIP) AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; APLP1; ARMCX2; ASCL1; BIRC2; CA2; CACNB4; CD68; CHN2; CMTM3; COL5A1; COL9A1; CPT1A; CRTAP; CYP51A1; DAB2IP; ELK3; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAS6; GPR153; GRB14; GSN; ID4; IDI1; IER3; IGFBP4; JOSD2; KLF4; KLHL2; LEPREL2; LRIG1; LSS; MAG; MAP7; MAST2; MBP; MEST; MMP14; MPP5; NCAPD2; NEK6; NFE2L3; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCTK3; PCYT2; PDLIM2; PIGA; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP1A; RAP2A; REEP5; RFFL; RFP2; RIMS2; RNF26; RRM2; S100A3; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEFF2; TMEM2; TMEM49; TSPAN2; VASN; WNT7B HAND1- TCFE2A (HLH) HOX GRP AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ARPC1B; BIRC2; BMP7; C1orf93; C7orf24; CA2; CACNB4; CCNB2; CDCA5; CHN2; CLDN11; CNP; COL5A1; COL9A1; CPOX; CRTAP; CTSE; DAB2IP; DBN1; DBNDD2; DDC; ELK3; ELOVL6; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GLRB; GPR153; GRB14; GSN; ID4; IDI1; IGFBP4; ITGB5; JAM3; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAP7; MAST2; MBP; MEST; MMP14; MOBP; MPP5; NCAPD2; NEK6; NFE2L3; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCTK3; PDLIM2; PELI1; PIGA; PLEKHA1; PLEKHB1; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP1A; RAP2A; REEP5; RFFL; RFP2; RIMS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEFF2; TMEM2; TMEM49; TNNI1; TPD52; TSPAN2; UGT8; VASN; WNT7B HAND1- TCFE2A (HLH) ROAZ (ZF) AATK; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APLP1; ASCL1; BIRC2; CACNB4; CD68; CDCA5; COL5A1; CPOX; DAB2IP; DBN1; DBNDD2; DHCR24; ELOVL6; ETS1; ETS2; FGF13; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GJA12; GLRB; GPR153; GRB14; GSN; IDI1; JOSD2; KLF4; LAPTM5; LEPREL2; LRIG1; MAG; MBP; NFE2L3; NFIX; NKD1; NR2F1; OLFML3; OMG; PDLIM2; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RFFL; RIMS2; RNF26; S100A3; SFRP2; SH3GL3; SLC12A2; SMARCD3; TCF7L1; TGFBR2; TM7SF2; TMEM2; TMEM49; TNNI1; TSPAN2; VASN 368 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes HAND1- TCFE2A (HLH) SPZ1 (HLH) ACTN1; ADAMTS4; AMPD2; ANTXR1; APLP1; ARMCX2; BMP7; CACNB4; CD68; CHN2; CMTM3; CNP; COL5A1; COL9A1; CRTAP; CXCL12; DAB2IP; DDC; ELK3; ELOVL6; ETS1; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRB14; GSN; ID4; IDI1; IGFBP4; JAM3; JOSD2; KLF4; LEPREL2; MAST2; MEST; NCAPD2; NFIX; NP_001033793.1; NR2F1; OLFML3; OMG; PDLIM2; PLEKHB1; PLEKHG2; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SDC1; SFRP2; SH3GL3; SLAIN1; SLC12A2; SLC7A5; SMARCD3; ST3GAL2; SYTL2; TGFBR2; TM7SF2; TMEM2; TMEM49; TNNI1; VASN HLF (LEUZIP) PAX6 (HOX) FOXG1B; OTUD7B HLH GRP SP1 (ZF) AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ARPC1B; ASCL1; BICC1; BMP7; C1orf93; CA2; CACNB4; CD68; CDCA5; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; CPOX; CPT1A; CXCL12; CYP51A1; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; EDG8; ELK3; ELOVL6; EMID2; ENPP2; ETS1; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GJA12; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IER5; IGFBP4; ITGB5; JAM3; JOSD2; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAP7; MBP; MEST; MMP14; MVD; NFE2L3; NFIX; NKX6-2; NR2F1; NSDHL; OLFML3; OTUD7B; PCTK3; PCYT2; PDLIM2; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SIRT2; SLAIN1; SLC12A2; SLC2A1; SLC7A5; SMARCD3; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TM7SF2; TMEM2; TMEM49; TNNI1; TSPAN2; VASN; WNT7B HLH GRP SPZ1 (HLH) AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; APLP1; ARMCX2; ASCL1; BMP7; CA2; CACNB4; CDCA5; CHN2; CNP; COL9A1; CPOX; CRTAP; CXCL12; DAB2IP; DBN1; DBNDD2; DHCR24; ELOVL6; EPS15; ETS1; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRAMD3; GSN; ID4; IGFBP4; JAM3; JOSD2; KLF4; LEPREL2; MEST; MMP14; NFIX; NKX6-2; NR2F1; OLFML3; OTUD7B; PDLIM2; PLEKHG2; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; RFFL; RIMS2; RNF26; RRAS2; RRM2; SDC1; SFRP2; SH3GL3; SIRT2; SLC12A2; SLC7A5; SMARCD3; SPRED2; ST3GAL2; SYTL2; TCF7L1; TIMM17A; TM7SF2; TMEM2; TMEM49; TNNI1; VASN; WNT7B 369 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes HOX GRP HAND1- TCFE2A (HLH) AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ASCL1; BIRC2; BMP7; C1orf93; C7orf24; CA2; CACNB4; CCNB2; CDCA5; CHN2; CLDN11; CNP; COL5A1; COL9A1; CPOX; CRTAP; CTSE; DAB2IP; DBN1; DBNDD2; DDC; ELK3; ELOVL6; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GLRB; GPR153; GRB14; GSN; ID4; IDI1; IGFBP4; ITGB5; JAM3; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAP7; MAST2; MBP; MEST; MMP14; MOBP; MPP5; NCAPD2; NEK6; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCTK3; PDLIM2; PELI1; PIGA; PLEKHA1; PLEKHB1; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP1A; RAP2A; REEP5; RFFL; RFP2; RIMS2; RNF26; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEFF2 TMEM2; TMEM49; TNNI1; TPD52; TSPAN2; UGT8; VASN; WNT7B HOX GRP REL GRP ACSS2; AMPD2; ANTXR1; ARMCX2; C7orf24; CACNB4; CLDN11; COL5A1; DAB2IP; ELK3; ELOVL6; FRMD4B; GRB14; IDI1; IER3; IGFBP4; KLHL2; LEPREL2; MMP14; MPP5; NFE2L3; NKD1; NSDHL; OMG; PCTK3; PIGA; PLP1; RAB33A; RACGAP1; RFFL; RFP2; SYTL2; TPD52 HOX GRP SRF (MADS) ACTN1; ELK3; EPS15; ETS1; ETS2; FGF13; FGFR1; IER5; NFIX; NR2F1; OLFML3; RAB33A; SYTL2; TMEM49 MEF2A (MADS) HAND1- TCFE2A (HLH) ACTN1; APCDD1; ARMCX2; CACNB4; CRTAP; EPS15; ETS1; FAM107B; FGF13; FGFR1; FOS; FOXG1B; FRMD4B; FZD2; GRAMD3; GRB14; GSN; ID4; IER5; JAM3; KLF4; KLHL2; LEPREL2; LRIG1; NEK6; NFIX; NR2F1; OMG; OSBPL1A; OTUD7B; PELI1; PLEKHA1; PTHR1; PTPRD; RAB11FIP5; RFFL; RIMS2; S100A3; SMARCD3; SYTL2; TMEM49; TSPAN2 MEF2A (MADS) POU3F1 (POU-HOX) ACTN1; APCDD1; ASCL1; BICC1; CHN2; CLDN11; COL9A1; CPOX; CRTAP; DAB2IP; EPS15; ETS1; F2R; FAM107B; FGF13; FGFR1; FNTA; FOS; FOXG1B; FRMD4B; GAS6; GPR153; GRAMD3; GRB14; GSN; ID4; IDI1; K0256_HUMAN; KLF4; KLHL2; NFIX; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; PELI1; PLA2G4A; PLEKHG2; PRKCZ; PTHR1; PTPRD; RAB11FIP5; REEP5; RFFL; RIMS2; SC4MOL; SMARCD3; SYTL2; TMEM2; TSPAN2 MIZF (ZF) NKX/HOX GRP ARPC1B; ASCL1; DBNDD2; ETS1; FOS; FOXG1B; FZD2; KLF4; NFIX; NR2F1; PLEKHH1; PPAP2C; S100A3; SFRP2; WNT7B 370 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes MYB (MYB) NKX/HOX GRP AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; APLP1; ARMCX2; ARPC1A; ARPC1B; ASCL1; BICC1; BIRC2; BMP7; CA2; CACNB4; CCNB2; CD68; CDCA5; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL5A1; COL9A1; CPOX; CPT1A; CRTAP; CTSE; CXCL12; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; ELK3; ELOVL6; ELOVL7; EMID2; ENPP2; EPS15; ETS1; ETS2; F2R; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GAS6; GJA12; GPR153; GRB14; GSN; ID4; IDI1; IER3; IER5; IGFBP4; ITGB5; JOSD2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAST2; MBP; MMP14; NCAPD2; NFIX; NKD1; NKX6-2; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCTK3; PCYT2; PDLIM2; PELI1; PIGA; PLD4; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAP2A; REEP5; RFFL; RFP2; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SC4MOL; SDC1; SEMA4D; SFRP2; SGK2; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC2A1; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TMEM49; TNNI1; TPD52; VASN; WNT7B MYB (MYB) SPZ1 (HLH) AATK; ACTN1; ADAMTS4; AMPD2; APLP1; ARPC1A; BMP7; CA2; CHN2; CMTM3; CNP; CRTAP; CXCL12; DAB2IP; DBN1; ELK3; ELOVL6; EMID2; ETS1; FA2H; FGF13; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRB14; GSN; IGFBP4; JAM3; KLF4; LEPREL2; MAST2; NEK6; NFIX; NKX6-2; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB33A; RACGAP1; RAMP2; RAP2A; RFFL; RIMS2; RNF26; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SYTL2; TGFBR2; TM7SF2; TNNI1; VASN MYF (HLH) SOX/HMG GRP APCDD1; ARMCX2; ASCL1; CACNB4; CHN2; CNP; DDC; ELK3; EPS15; F2R; FGF13; FOS; GPR153; GRB14; JOSD2; MMP14; NFIX; OLFML3; OMG; PLEKHA1; PLP1; PLXNB3; RAP2A; RFFL; RRAS2; SC4MOL; SEMA4D; SLAIN1; SYTL2; TNNI1; TPD52 MYF (HLH) SPZ1 (HLH) ACTN1; ADAMTS4; AMPD2; APLP1; ARMCX2; BMP7; CHN2; CMTM3; CNP; CRTAP; DAB2IP; DBN1; DHCR24; ELK3; ELOVL6; FGF13; FMNL3; FRMD4B; FZD2; GPR153; GSN; IDI1; IGFBP4; JAM3; KLF4; MAST2; MMP14; NFIX; NR2F1; OLFML3; OTUD7B; PLEKHH1; PLP1; PLXNB3; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RIMS2; RNF26; SEMA4D; SFRP2; SLAIN1; SLC1A4; SMARCD3; SYTL2; TGFBR2; TM7SF2; TNNI1; VASN NHLH1 (HLH) NKX/HOX GRP AATK; ACSS2; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ASCL1; BICC1; BMP7; C1orf93; CD68; CEBPB; CHFR; CHN2; CLDN11; CMTM3; COL5A1; COL9A1; CPT1A; CRTAP; DAB2IP; DBN1; DBNDD2; ENPP2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GJA12; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IGFBP4; JOSD2; KLF4; KLHL2; LEPREL2; LRIG1; MAG; MBP; MEST; MMP14; NFIX; NKX6-2; NR2F1; OLFML3; OMG; PCYT2; PIGA; PLEKHB1; PLEKHG2; PLP1; PLXNB3; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RFFL; RFP2; RIMS2; RRM2; SC4MOL; SFRP2; SIRT2; SLAIN1; SLC12A2; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TGFBR2; TM7SF2; TMEM2; TMEM49; TSPAN2 371 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes NHLH1 (HLH) SOX/HMG GRP ADAMTS4; ADIPOR1; APCDD1; ASCL1; CHN2; CNP; COL9A1; DDC; FGF13; FMNL3; FOXG1B; NFIX; NKD1; OMG; PCYT2; PLEKHA1; PTPRD; RFFL; RRM2; SGK2; SYTL2 NHLH1 (HLH) SP1 (ZF) ACTN1; ADAMTS4; AMPD2; APCDD1; APLP1; ASCL1; BMP7; C1orf93; CEBPB; CHN2; CLDN11; CNP; COL18A1; COL5A1; CPOX; CRTAP; CYP51A1; DAB2IP; DBN1; DBNDD2; EDG8; EMID2; ENPP2; FA2H; FGF13; FGFR1; FMNL3; FNTA; FOS; FRMD4B; GAMT; GSN; ID4; IER3; IGFBP4; KLF2; KLF4; LEPREL2; MAG; MBP; MEST; MMP14; NFE2L3; NFIX; NKX6-2; NR2F1; OLFML3; PLEKHB1; PLEKHG2; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RIMS2; RRM2; SDC1; SFRP2; SLC12A2; SMARCD3; STRN; TMEM49; VASN; WNT7B NHLH1 (HLH) SPZ1 (HLH) AATK; ACTN1; ADAMTS4; APLP1; ARMCX2; CRTAP; DAB2IP; DBN1; DBNDD2; FGF13; FGFR1; GPR153; GSN; IGFBP4; LEPREL2; MMP14; NFIX; NKX6-2; NR2F1; OLFML3; PLP1; PPP1R14A; PTPRD; RAB33A; RACGAP1; SFRP2; SLC12A2; SMARCD3; SYTL2; TGFBR2; TNNI1; VASN NHLH1 (HLH) TEAD1 (TEA) CHN2; NKD1; PLEKHA1; SYTL2; WNT7B NKX/HOX GRP MIZF (ZF) ARPC1A; ARPC1B; ASCL1; BMP7; CDCA5; CEBPB; CHN2; CMTM3; COL5A1; CRTAP; DBNDD2; ETS1; FGF13; FMNL3; FOS; FOXG1B; FZD2; GSN; KLF4; LAPTM5; LRIG1; MMP14; NFIX; NR2F1; PLEKHH1; PLP1; RAB33A; S100A3; SFRP2; SMARCD3; WNT7B NR GRP FOS (LEUZIP) APLP1; ARMCX2; ASCL1; BICC1; CD68; CHN2; COL9A1; CPT1A; EDG8; ENPP2; ETS1; FAM107B; FGF13; FOXG1B; FRMD4B; GSN; IER3; JAM3; MAST2; MPP5; NEK6; NP_001033793.1; NSDHL; OLFML3; OMG; PCYT2; PLEKHH1; PLP1; RAB11FIP5; RAB33A; RAMP2; RFP2; RIMS2; RRAS2; SIRT2; SPRED2; TM7SF2; VASN PAX3 (HOX) ROAZ (ZF) RAB33A; RIMS2; SMARCD3 PAX3 (HOX) SP1 (ZF) NFIX; NR2F1; RAB33A; RIMS2; SMARCD3 PAX3 (HOX) SPZ1 (HLH) NFIX; NR2F1; RAB33A; RIMS2 PBX1 (HOX) SRF (MADS) FGF13; NFIX; TIMM17A PRN2 (HOX) HAND1- TCFE2A (HLH) AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; BIRC2; BMP7; C1orf93; C7orf24; CA2; CACNB4; CCNB2; CHN2; CLDN11; CMTM3; CNP; COL5A1; COL9A1; CPOX; CRTAP; DAB2IP; DBN1; DBNDD2; DDC; ELK3; ELOVL6; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOXG1B; FRMD4B; FZD2; GLRB; GRB14; GSN; ID4; IDI1; IGFBP4; ITGB5; JAM3; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAP7; MAST2; MBP; MEST; MPP5; NEK6; NFE2L3; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OMG; OSBPL1A; OTUD7B; PCTK3; PELI1; PIGA; PLEKHA1; PLEKHB1; PLEKHH1; PLP1; PLXNB3; PPAP2C; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP1A; RAP2A; REEP5; RFFL; RFP2; RIMS2; RRAS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; ST3GAL2; STRN; SYTL2; TALDO1; TGFBR2; TIMM17A; TM7SF2; TMEFF2; TMEM2; TMEM49; TNNI1; TPD52; TSPAN2; UGT8; VASN; WNT7B 372 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes PRN2 (HOX) REL GRP ANTXR1; C7orf24; COL5A1; ELK3; ELOVL6; FMNL3; KLF4; KLHL2; MPP5; NFE2L3; NKD1; PCTK3; PIGA; PLP1; RFFL; RFP2; SYTL2 REL GRP FOS (LEUZIP) ADAMTS4; ADIPOR1; AMPD2; APLP1; BICC1; BIRC2; CA2; CACNB4; CD68; CHN2; COL5A1; COL9A1; CXCL12; DAB2IP; DDC; DHCR24; ELK3; ELOVL7; ENPP2; EPS15; ETS1; ETS2; FMNL3; FOS; FOXG1B; FRMD4B; GPR153; GRAMD3; IGFBP4; JOSD2; KLF4; LEPREL2; MAL; MBP; MMP14; MPP5; NFIX; NKD1; NKX6-2; NR2F1; OLFML3; PDLIM2; PLA2G4A; PLEKHB1; PLEKHG2; PLP1; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RFFL; RFP2; RIMS2; RNF26; SEMA4D; SGK2; SLAIN1; SLC12A2; SMARCD3; STRN; SYTL2; TGFBR2; TIMM17A; TM7SF2; TMEM2; TMEM49; TSPAN2; VASN REL GRP HOX GRP ACSS2; AMPD2; ARMCX2; C7orf24; COL5A1; ELK3; ELOVL6; FOXG1B; GPR153; IDI1; IGFBP4; KLHL2; LEPREL2; NFE2L3; NKD1; PCTK3; PIGA; PTPRD; RAB33A; RACGAP1; TNNI1 ROAZ (ZF) ETS GRP AATK; ACSS2; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1B; ASCL1; BICC1; BIRC2; BMP7; CACNB4; CD68; CDCA5; CHFR; CLDN11; CMTM3; COL18A1; COL5A1; COL9A1; CTSE; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; ELOVL6; ETS1; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAS6; GLRB; GPR153; GRB14; GSN; ID4; IDI1; JOSD2; KLF4; LAPTM5; LEPREL2; MAG; MBP; MEST; MMP14; NFIX; NKD1; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PDLIM2; PIGA; PLD4; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RFFL; RIMS2; RNF26; S100A3; SFRP2; SIRT2; SLAIN1; SLC12A2; SMARCD3; SYTL2; TCF7L1; TIMM17A; TM7SF2; TMEFF2; TMEM2; TMEM49; TNNI1; VASN; WNT7B ROAZ (ZF) MIZF (ZF) APLP1; ARPC1B; ASCL1; BMP7; CDCA5; CXCL12; FGF13; FRMD4B; GSN; ID4; KLF4; LRIG1; RAB33A; SMARCD3; TGFBR2; WNT7B ROAZ (ZF) NKX/HOX GRP ACSS2; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APLP1; ARPC1B; ASCL1; BICC1; BIRC2; BMP7; CACNB4; CDCA5; CEBPB; CHFR; CHN2; CLDN11; CMTM3; CNP; COL5A1; COL9A1; CPOX; CTSE; DAB2IP; DBN1; DBNDD2; DHCR24; ELOVL6; ETS1; ETS2; FA2H; FGF13; FGFR1; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GLRB; GPR153; GRB14; GSN; ID4; IER3; JOSD2; LEPREL2; LRIG1; MAG; MBP; MEST; MMP14; NFIX; NKD1; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PDLIM2; PLEKHG2; PLEKHH1; PLP1; PPP1R14A; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RFFL; RIMS2; RNF26; S100A3; SLAIN1; SMARCD3; SYTL2; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEFF2; TMEM2; TNNI1; WNT7B 373 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes RORA1 (NR) FOS (LEUZIP) AMPD2; BICC1; CA2; CACNB4; CD68; COL5A1; DAB2IP; EPS15; FAM107B; FGF13; FOS; GRB14; GSN; KLF4; LEPREL2; NR2F1; OTUD7B; PCYT2; PLP1; Q8NA55_HUMAN; RIMS2; SYTL2; VASN RORA1 (NR) SPZ1 (HLH) GSN; KLF4; LEPREL2; NFIX; PLP1; RIMS2 RORA1 (NR) YY1 (ZF) ACTN1; ADAMTS4; ARPC1A; BICC1; CA2; CACNB4; CD68; CHN2; DAB2IP; DDC; ENPP2; EPS15; FAM107B; FGF13; FGFR1; FOXG1B; FRMD4B; FZD2; GAS6; GRB14; GSN; KLF4; LEPREL2; MAP7; MAST2; MMP14; NFIX; NKD1; NR2F1; NSDHL; OLFML3; OSBPL1A; PCYT2; PLEKHH1; PLP1; PTHR1; Q8NA55_HUMAN; RAB33A; RIMS2; RNF26; SFRP2; SLAIN1; SMARCD3; STRN; TM7SF2; TMEM2; VASN SOX/HMG GRP EGR/ZF GRP ACTN1; ADAMTS4; AMPD2; APCDD1; ARMCX2; BMP7; C7orf24; CDC37L1; CLDN11; CNP; CTSE; DAB2IP; DBNDD2; DDC; DHCR24; EDG8; ELOVL6; EMID2; ETS1; ETS2; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GRB14; ID4; K0256_HUMAN; KLF4; KLHL2; LEPREL2; LRIG1; MBP; MEST; MPP5; NFE2L3; NFIX; NR2F1; OLFML3; OSBPL1A; PELI1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PTHR1; RAB33A; RFFL; RIMS2; RNF26; SEMA4D; SGK2; SH3GL3; SLC2A1; SLC7A5; SMARCD3; STRN; SYTL2; VASN SOX/HMG GRP FOS (LEUZIP) ACTN1; ADAMTS4; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ASCL1; BICC1; BIRC2; BMP7; CA2; CACNB4; CCNB2; CD68; CDC37L1; CEBPB; CHN2; CLDN11; CMTM3; COL18A1; COL5A1; COL9A1; CPOX; CPT1A; CTSE; CYP51A1; DAB2IP; DDC; EDG8; ELK3; ELOVL6; ELOVL7; EMID2; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GLRB; GPR153; GRAMD3; GRB14; GSN; ID4; IDI1; IER3; IER5; IGFBP4; ITGB5; JAM3; JOSD2; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAP7; MAST2; MBP; MEST; MMP14; MPP5; NEK6; NFE2L3; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCYT2; PELI1; PIGA; PLEKHA1; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP1A; RAP2A; REEP5; RFFL; RFP2; RIMS2; RRAS2; RRM2; S100A3; SC4MOL; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; STRN; SYTL2; TALDO1; TCF7L1; TIMM17A; TMEFF2; TMEM2; TMEM49; TNNI1; TPD52; TSPAN2; UGT8; VASN; WNT7B SOX/HMG GRP HAND1- TCFE2A (HLH) APCDD1; ARMCX2; ARPC1A; CACNB4; CHN2; COL9A1; DAB2IP; FAM107B; FGFR1; FMNL3; FRMD4B; IDI1; IGFBP4; LEPREL2; NFIX; OLFML3; OMG; PIGA; PLEKHA1; PLP1; PTPRD; RACGAP1; RFFL; RIMS2; SFRP2; SMARCD3; TGFBR2; TMEFF2; TMEM49; TSPAN2; UGT8 SOX/HMG GRP SPZ1 (HLH) ACTN1; AMPD2; ARMCX2; DAB2IP; ELK3; ELOVL6; KLF4; LEPREL2; NFIX; OLFML3; OMG; PLEKHH1; PLP1; PTPRD; RAB33A; RIMS2; SDC1; SYTL2; TMEM49 SP1 (ZF) CEBPA (LEUZIP) AMPD2; ASCL1; CEBPB; FZD2; IGFBP4; MMP14; NR2F1; PLEKHH1; PRKCZ; PTPRD; SDC1; SH3GL3; SIRT2; SYTL2 374 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes SP1 (ZF) ETS GRP AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ARPC1B; ASCL1; BICC1; BMP7; C1orf93; C7orf24; CA2; CACNB4; CD68; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; COL9A1; CPOX; CPT1A; CRTAP; CTSE; CXCL12; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; EDG8; ELK3; ELOVL6; ELOVL7; EMID2; ENPP2; EPS15; ETS1; ETS2; F2R; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GJA12; GLRB; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IER5; IGFBP4; ITGB5; JOSD2; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAL; MAST2; MBP; MEST; MMP14; MPP5; NCAPD2; NEK6; NFE2L3; NFIX; NKD1; NKX6-2; NP_001033793.1; NR2F1; OLFML3; OSBPL1A; OTUD7B; PCYT2; PDLIM2; PIGA; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TM7SF2; TMEM2; TMEM49; TNNI1; TSPAN2; VASN; WNT7B SP1 (ZF) FOS (LEUZIP) ACTN1; ADAMTS4; AMPD2; ANTXR1; APLP1; ARMCX2; ARPC1A; ASCL1; BMP7; CA2; CACNB4; CD68; CDCA5; CEBPB; CHN2; CLDN11; COL18A1; COL5A1; COL9A1; CPT1A; CTSE; DAB2IP; DBN1; ELK3; ELOVL6; EMID2; ENPP2; EPS15; ETS1; ETS2; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GSN; ID4; IER3; IGFBP4; ITGB5; JOSD2; KLF4; KLHL2; LEPREL2; LSS; MAG; MBP; MMP14; MPP5; NEK6; NFIX; NR2F1; NSDHL; OLFML3; OSBPL1A; OTUD7B; PDLIM2; PELI1; PIGA; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RIMS2; RNF26; S100A3; SC4MOL; SH3GL3; SIRT2; SLC7A5; SMARCD3; ST3GAL2; STRN; SYTL2; TCF7L1; TM7SF2; TNNI1; TSPAN2; VASN; WNT7B SP1 (ZF) HAND1- TCFE2A (HLH) AMPD2; APLP1; CACNB4; COL5A1; DAB2IP; DBN1; DDC; DHCR24; ELK3; FOS; FRMD4B; FZD2; GPR153; GSN; IGFBP4; LEPREL2; LSS; MAG; MPP5; PIGA; PLEKHB1; PLXNB3; PTPRD; RAB33A; RIMS2; SDC1; SH3GL3; SLC1A4; SMARCD3; TM7SF2; VASN SP1 (ZF) HLH GRP AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ARPC1B; ASCL1; BICC1; BMP7; C1orf93; CA2; CACNB4; CD68; CDCA5; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; CPOX; CPT1A; CXCL12; CYP51A1; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; EDG8; ELK3; ELOVL6; EMID2; ENPP2; ETS1; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GJA12; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IER5; IGFBP4; ITGB5; JAM3; JOSD2; KLF2; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; LSS; MAG; MAP7; MBP; MEST; MMP14; MVD; NFE2L3; NFIX; NKX6-2; NR2F1; NSDHL; OLFML3; OTUD7B; PCTK3; PCYT2; PDLIM2; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SIRT2; SLAIN1; SLC12A2; SLC2A1; SLC7A5; SMARCD3; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TM7SF2; TMEM2; TNNI1; TSPAN2; VASN; WNT7B SP1 (ZF) MIZF (ZF) ACSS2; ADAMTS4; AMPD2; APLP1; ARPC1B; ASCL1; CDCA5; CHN2; COL5A1; CRTAP; DAB2IP; DBNDD2; DHCR24; ETS1; FGF13; FMNL3; FOXG1B; FZD2; ID4; KLF4; LRIG1; MAST2; MMP14; NFIX; NR2F1; PLEKHG2; PLEKHH1; PLP1; PPAP2C; PTHR1; Q8NA55_HUMAN; RAB33A; RAP2A; RIMS2; S100A3; SFRP2; SMARCD3; TGFBR2; WNT7B 375 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes SP1 (ZF) MYB (MYB) AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARPC1B; ASCL1; BICC1; BMP7; CA2; CACNB4; CD68; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; CPOX; CPT1A; CTSE; CYP51A1; DAB2IP; DBN1; DBNDD2; EDG8; ELK3; EMID2; ENPP2; ETS1; ETS2; F2R; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GPR153; GRB14; GSN; ID4; IER3; IGFBP4; ITGB5; JOSD2; KLF2; KLF4; KLHL2; LEPREL2; MAG; MAP7; MAST2; MEST; MMP14; MPP5; NCAPD2; NEK6; NFE2L3; NFIX; NKX6- 2; NR2F1; OLFML3; OSBPL1A; OTUD7B; PCYT2; PDLIM2; PIGA; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAP2A; REEP5; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; ST3GAL2; STRN; SYTL2; TGFBR2; TM7SF2; TMEM2; TMEM49; TNNI1; VASN; WNT7B SP1 (ZF) NKX/HOX GRP AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ARPC1B; ASCL1; BICC1; BMP7; C7orf24; CA2; CACNB4; CD68; CDCA5; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL5A1; COL9A1; CPT1A; CRTAP; CTSE; CXCL12; DAB2IP; DBN1; DBNDD2; DHCR24; ELK3; ELOVL6; EMID2; ETS1; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GJA12; GLRB; GRAMD3; GRB14; GSN; ID4; IDI1; IER3; IGFBP4; ITGB5; JOSD2; K0256_HUMAN; KLF2; KLF4; KLHL2; LEPREL2; LRIG1; MAG; MAL; MAST2; MBP; MEST; MMP14; MOBP; NCAPD2; NEK6; NFIX; NKD1; NKX6-2; NR2F1; NSDHL; OLFML3; OSBPL1A; OTUD7B; PCTK3; PCYT2; PDLIM2; PELI1; PIGA; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SC4MOL; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC2A1; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TIMM17A; TM7SF2; TMEM2; TMEM49; TNNI1; TSPAN2; VASN; WNT7B SP1 (ZF) ROAZ (ZF) ACSS2; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APLP1; ARPC1B; ASCL1; CACNB4; CD68; CDCA5; CEBPB; CLDN11; CMTM3; COL18A1; COL5A1; CTSE; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; ELOVL6; ETS2; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; GJA12; GPR153; GRB14; GSN; ID4; JOSD2; KLF4; LAPTM5; LEPREL2; MAG; MBP; MMP14; NFIX; NKD1; NR2F1; OLFML3; OTUD7B; PDLIM2; PIGA; PLEKHG2; PLEKHH1; PLP1; PPP1R14A; PRKCZ; PTHR1; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RIMS2; RNF26; RRAS2; S100A3; SFRP2; SH3GL3; SIRT2; SLC12A2; SMARCD3; SYTL2; TCF7L1; TGFBR2; TM7SF2; TNNI1; TSPAN2; VASN; WNT7B SP1 (ZF) SPZ1 (HLH) ACTN1; ADAMTS4; AMPD2; APLP1; ASCL1; BMP7; CA2; CACNB4; CD68; CDCA5; CHN2; CMTM3; COL18A1; COL5A1; COL9A1; CXCL12; CYP51A1; DAB2IP; DBN1; DHCR24; ELK3; ELOVL6; EMID2; EPS15; ETS1; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IGFBP4; KLF4; LEPREL2; MAL; MAST2; MMP14; NEK6; NFIX; NKX6-2; NR2F1; OLFML3; OTUD7B; PDLIM2; PLEKHB1; PLEKHG2; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAP2A; RFFL; RIMS2; RNF26; RRAS2; S100A3; SDC1; SFRP2; SGK2; SH3GL3; SIRT2; SLC12A2; SLC7A5; SMARCD3; ST3GAL2; SYTL2; TCF7L1; TIMM17A; TM7SF2; TNNI1; VASN; WNT7B 376 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes SP1 (ZF) TAL1 (HLH) AATK; ACTN1; ADAMTS4; ADIPOR1; AMPD2; APCDD1; APLP1; ARMCX2; ARPC1A; ASCL1; CACNB4; CNP; COL18A1; COL5A1; COL9A1; CPT1A; CYP51A1; DAB2IP; DBN1; DDC; DHCR24; FA2H; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GPR153; GSN; ID4; IER3; JOSD2; KLF4; LAPTM5; LEPREL2; LRIG1; MAST2; NEK6; NFIX; NKX6-2; NR2F1; NSDHL; OLFML3; PCYT2; PLEKHG2; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; RIMS2; RNF26; SDC1; SIRT2; SLAIN1; SLC2A1; SPRED2; ST3GAL2; SYTL2; TGFBR2; TM7SF2; TMEM2; TSPAN2; VASN SP1 (ZF) YY1 (ZF) AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ARPC1A; ARPC1B; ASCL1; BICC1; BMP7; C1orf93; C7orf24; CA2; CACNB4; CD68; CDCA5; CEBPB; CHN2; CLDN11; CMTM3; CNP; COL18A1; COL5A1; COL9A1; CPOX; CPT1A; CRTAP; CTSE; CYP51A1; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; EDG8; ELK3; ELOVL6; ENPP2; EPS15; ETS1; ETS2; F2R; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GJA12; GLRB; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IER5; IGFBP4; ITGB5; JAM3; JOSD2; K0256_HUMAN; KLF4; KLHL2; LAPTM5; LEPREL2; LRIG1; MAG; MAL; MAP7; MAST2; MBP; MEST; MMP14; MVD; NCAPD2; NEK6; NFE2L3; NFIX; NKD1; NKX6-2; NP_001033793.1; NR2F1; NSDHL; OLFML3; OTUD7B; PCTK3; PCYT2; PDLIM2; PELI1; PIGA; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SC4MOL; SDC1; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC2A1; SLC7A5; SMARCD3; SPRED2; ST3GAL2; STRN; SYTL2; TALDO1; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TNNI1; TSPAN2; VASN; WNT7B SP1 (ZF) ZEB1 (ZF) ACSS2; ACTN1; ADAMTS4; AMPD2; APLP1; ARPC1A; ARPC1B; ASCL1; BMP7; CACNB4; CD68; CDCA5; CEBPB; CHN2; CLDN11; CNP; COL18A1; COL5A1; CPOX; CPT1A; CTSE; CYP51A1; DAB2IP; DBN1; DHCR24; ELK3; ELOVL6; EPS15; ETS1; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GPR153; GRAMD3; GSN; ID4; IGFBP4; ITGB5; JOSD2; KLF4; LEPREL2; LRIG1; LSS; MAG; MAL; MBP; MMP14; MPP5; NFE2L3; NFIX; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PDLIM2; PIGA; PLD4; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RAP2A; REEP5; RIMS2; SC4MOL; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SMARCD3; SPRED2; ST3GAL2; SYTL2; TM7SF2; TNNI1; TSPAN2; VASN; WNT7B SPZ1 (HLH) MIZF (ZF) CMTM3; CXCL12; DAB2IP; FOS; FOXG1B; KLF4; MMP14; NR2F1; PLEKHG2; PPAP2C; Q8NA55_HUMAN; RAB33A; SMARCD3; WNT7B SPZ1 (HLH) NKX/HOX GRP AATK; ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APLP1; ARMCX2; ASCL1; BMP7; CA2; CACNB4; CD68; CDCA5; CHFR; CHN2; CMTM3; CNP; COL9A1; CPOX; CXCL12; DAB2IP; DBN1; DBNDD2; DHCR24; ELK3; ELOVL6; EMID2; ENPP2; EPS15; ETS1; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRAMD3; GRB14; GSN; ID4; IER3; IGFBP4; KLF4; LEPREL2; MAL; MAST2; MEST; MMP14; NCAPD2; NFIX; NKX6- 2; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PDLIM2; PIGA; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAP2A; RFFL; RIMS2; RNF26; RRAS2; RRM2; SDC1; SGK2; SLAIN1; SLC12A2; SMARCD3; SPRED2; ST3GAL2; SYTL2; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TMEM49 377 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes SPZ1 (HLH) ROAZ (ZF) ADAMTS4; AMPD2; APLP1; ARMCX2; DAB2IP; DHCR24; ELOVL6; FOS; GPR153; GSN; IDI1; KLF4; MMP14; NR2F1; OLFML3; OMG; OTUD7B; PLP1; Q8NA55_HUMAN; RAB33A; RACGAP1; SFRP2; SIRT2; SMARCD3; SYTL2; TMEM2; VASN; WNT7B SPZ1 (HLH) ZNF354C (ZF) ACSS2; ACTN1; ADAMTS4; ADIPOR1; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; ASCL1; BICC1; BMP7; C1orf93; CA2; CACNB4; CD68; CDCA5; CHFR; CHN2; CMTM3; CNP; COL18A1; COL5A1; COL9A1; CPOX; CRTAP; CXCL12; DAB2IP; DBN1; DBNDD2; DHCR24; ELK3; ELOVL6; EMID2; ENPP2; EPS15; ETS1; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRAMD3; GRB14; GSN; ID4; IDI1; IER3; IGFBP4; JAM3; JOSD2; KLF4; LEPREL2; MAL; MAST2; MEST; MMP14; NCAPD2; NEK6; NFIX; NKX6-2; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PDLIM2; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RACGAP1; RAMP2; RAP2A; RFFL; RHPN1; RIMS2; RNF26; RRAS2; RRM2; S100A3; SDC1; SEMA4D; SFRP2; SGK2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; ST3GAL2; SYTL2; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TMEM49; TNNI1; TPD52; VASN; WNT7B SRF (MADS) HOX GRP ELK3; ENEPS15; ETS2; FGF13; FGFR1; NFIX; NR2F1; OLFML3; TAL1 (HLH) ROAZ (ZF) ADAMTS4; COL5A1; DAB2IP; DBNDD2; DDC; DHCR24; ETS2; FA2H; FGFR1; FNTA; FRMD4B; FZD2; GPR153; LAPTM5; LEPREL2; LRIG1; MEST; NR2F1; OLFML3; OMG; PCYT2; PDLIM2; PLEKHG2; PLP1; PTHR1; Q8NA55_HUMAN; RAB11FIP5; RACGAP1; RNF26; SIRT2; SYTL2; TGFBR2; TM7SF2 TAL1 (HLH) SPZ1 (HLH) ADAMTS4; AMPD2; ARMCX2; ARPC1A; CNP; DBN1; DHCR24; FGF13; FGFR1; FOXG1B; FZD2; KLF4; LEPREL2; MAST2; NFIX; NR2F1; OLFML3; OMG; PLEKHB1; PLP1; PLXNB3; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAMP2; RAP2A; RNF26; SDC1; SIRT2; ST3GAL2; SYTL2; TGFBR2 TEAD1 (TEA) SPZ1 (HLH) ADAMTS4; AMPD2; ARMCX2; BMP7; CACNB4; CMTM3; CPOX; ELOVL6; EPS15; FGF13; FOS; FOXG1B; FRMD4B; ID4; IDI1; NCAPD2; NFIX; NR2F1; OLFML3; OTUD7B; PCYT2; PDLIM2; PLP1; Q8NA55_HUMAN; RAB33A; RIMS2; RRAS2; SDC1; TMEM2; VASN YY1 (ZF) NKX/HOX GRP AATK; ACSS2; ACTN1; ADAMTS4; AMPD2; ANTXR1; APCDD1; APLP1; ARMCX2; BICC1; BMP7; C7orf24; CA2; CACNB4; CD68; CDC37L1; CDCA5; CEBPB; CHFR; CHN2; CMTM3; COL18A1; COL9A1; CPOX; CTSE; CXCL12; DAB2IP; DBN1; DBNDD2; DHCR24; ELOVL6; ELOVL7; ENPP2; EPS15; ETS1; ETS2; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GAMT; GPR153; GRB14; GSN; ID4; IER3; IER5; IGFBP4; ITGB5; JOSD2; K0256_HUMAN; KLF4; KLHL2; LEPREL2; LRIG1; MAG; MAST2; MBP; MEST; MMP14; MVD; NCAPD2; NEK6; NFIX; NKD1; NKX6-2; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCYT2; PDLIM2; PELI1; PIGA; PLEKHA1; PLEKHB1; PLEKHH1; PLP1; PLXNB3; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; REEP5; RFFL; RFP2; RIMS2; RNF26; RRM2; SDC1; SEMA4D; SFRP2; SIRT2; SLAIN1; SLC12A2; SLC2A1; SMARCD3; SPRED2; STRN; SYTL2; TCF7L1; TIMM17A; TM7SF2; TMEM2; TMEM49; TPD52; TSPAN2; UGT8; VASN; WNT7B 378 Table S17. Summary of predicted over-represented CRM classes and targeted oligodendrocyte co-expressed genes (continued) TF Class I TF Class II Predicted Targeted Genes YY1 (ZF) ROAZ (ZF) ACSS2; ADAMTS4; AMPD2; ANTXR1; APLP1; ARMCX2; ASCL1; BICC1; BIRC2; CACNB4; CD68; CDCA5; CHFR; CHN2; CLDN11; CMTM3; COL18A1; COL5A1; COL9A1; CPOX; CTSE; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; ELOVL6; ETS1; ETS2; FA2H; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FRMD4B; FZD2; GAMT; GJA12; GPR153; GRB14; GSN; ID4; IDI1; JOSD2; KLF4; LAPTM5; LEPREL2; LRIG1; MAG; MBP; MMP14; NFIX; NKD1; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PDLIM2; PLEKHB1; PLEKHG2; PLEKHH1; PLP1; PPP1R14A; PRKCZ; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAMP2; RFFL; RIMS2; RNF26; SFRP2; SIRT2; SLC12A2; SMARCD3; SYTL2; TCF7L1; TGFBR2; TIMM17A; TM7SF2; TMEM2; TNNI1; TSPAN2; VASN; WNT7B YY1 (ZF) SPZ1 (HLH) AATK; ACTN1; ADAMTS4; AMPD2; ANTXR1; APLP1; ARMCX2; ARPC1A; BMP7; C1orf93; CA2; CACNB4; CD68; CDCA5; CHFR; CHN2; CMTM3; CNP; COL9A1; CPOX; CRTAP; DAB2IP; DBN1; DBNDD2; DHCR24; ELOVL6; ETS1; FGF13; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; FZD2; GPR153; GRB14; GSN; IDI1; IGFBP4; KLF4; LEPREL2; MAL; MAST2; MMP14; NFIX; NKX6-2; NP_001033793.1; NR2F1; OLFML3; OMG; OTUD7B; PDLIM2; PLEKHB1; PLP1; PLXNB3; PPP1R14A; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAP2A; RIMS2; RNF26; RRAS2; RRM2; SFRP2; SH3GL3; SIRT2; SLAIN1; SLC12A2; SLC1A4; SLC7A5; SMARCD3; SPRED2; ST3GAL2; SYTL2; TCF7L1; TM7SF2; TMEM2; TPD52; VASN ZEB1 (ZF) EGR/ZF GRP ACTN1; ADAMTS4; APCDD1; ARMCX2; BICC1; C1orf93; CDCA5; CHN2; CLDN11; CNP; CTSE; DAB2IP; DBN1; DBNDD2; DDC; DHCR24; EDG8; ELOVL6; EMID2; ETS1; FA2H; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GRB14; ID4; IGFBP4; KLF4; KLHL2; LEPREL2; LRIG1; MAST2; MBP; MPP5; NFIX; OLFML3; OSBPL1A; PELI1; PLEKHG2; PLXNB3; PTHR1; RAB11FIP5; RIMS2; SEMA4D; SGK2; SLAIN1; SMARCD3; ST3GAL2; STRN; SYTL2; TGFBR2; TNNI1; VASN ZEB1 (ZF) SPZ1 (HLH) ACTN1; ARMCX2; ASCL1; CD68; CNP; COL9A1; DAB2IP; ELOVL6; FMNL3; FOS; IGFBP4; KLF4; LEPREL2; NFIX; OLFML3; OMG; OTUD7B; PLEKHH1; PLP1; PTPRD; RAB11FIP5; RAB33A; RIMS2; SDC1; SLC12A2; SYTL2; TCF7L1 379 Table S18. Summary of co-expressed oligodendrocyte-targeted genes for predicted, enhancer-weighted CRM classes TF Class I TF Class II Predicted Targeted Genes ETS GRP NKX/HOX GRP ADIPOR1; CACNB4; CLDN11; ELK3; ELOVL6; FAM107B; LEPREL2; MAP7; PLP1; RFFL; ETS GRP SP1 (ZF) ACSS2; ACTN1; AMPD2; ARMCX2; ARPC1B; BICC1; BMP7; CEBPB; CLDN11; CMTM3; CPT1A; CTSE; DAB2IP; DBNDD2; EDG8; ELK3; ELOVL6; ELOVL7; FAM107B; FGFR1; FMNL3; FNTA; FOXG1B; FRMD4B; FZD2; GPR153; ID4; IER3; KLF2; KLF4; KLHL2; LEPREL2; NCAPD2; NFIX; NKX6-2; OTUD7B; PCYT2; PDLIM2; PLEKHG2; PLXNB3; PPAP2C; PTHR1; PTPRD; Q8NA55_HUMAN; RAB11FIP5; RAB33A; RAP2A; RIMS2; RNF26; SDC1; SLC12A2; SMARCD3; SPRED2; STRN; SYTL2; TMEM49; TSPAN2; FKH GRP MYF (HLH) ADAMTS4; ADIPOR1; APCDD1; APLP1; ARMCX2; ASCL1; BIRC2; CA2; CACNB4; CCNB2; CD68; CHN2; CLDN11; CPT1A; CTSE; DAB2IP; DBN1; DDC; ELK3; ENPP2; EPS15; FAM107B; FGF13; FGFR1; FMNL3; FOS; FOXG1B; FRMD4B; GPR153; GSN; IDI1; IGFBP4; KLF4; KLHL2; LRIG1; MBP; MMP14; MPP5; NFIX; NR2F1; OLFML3; OMG; OTUD7B; PCYT2; PLEKHB1; PLP1; PTHR1; PTPRD; Q8NA55_HUMAN; RAMP2; RFFL; RFP2; RIMS2; RNF26; RRAS2; SC4MOL; SEMA4D; SLAIN1; SYTL2; TGFBR2; TM7SF2; TMEFF2; TMEM2; TMEM49; TNNI1; TPD52; FKH GRP REL GRP FZD2; MAL; NFIX; NKD1; PTHR1; PTPRD; RAB33A; RFFL; RIMS2; SEMA4D; FOS (LEUZIP) NKX/HOX GRP ACSS2; ACTN1; ADAMTS4; APCDD1; ARMCX2; ARPC1A; ASCL1; BICC1; BIRC2; CACNB4; CCNB2; CD68; CHN2; CLDN11; CPOX; DDC; ELK3; ELOVL7; EPS15; ETS1; ETS2; FAM107B; FGF13; FGFR1; FMNL3; FNTA; FOS; FOXG1B; FRMD4B; GRB14; GSN; ID4; ITGB5; JAM3; K0256_HUMAN; KLF4; KLHL2; LEPREL2; LRIG1; MAL; MAST2; MBP; MMP14; NEK6; NFIX; NKD1; NP_001033793.1; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; OTUD7B; PCYT2; PELI1; PIGA; PLA2G4A; PLEKHA1; PLEKHB1; PLEKHH1; PLP1; PTHR1; PTPRD; RAB33A; RACGAP1; RAMP2; RAP1A; RFFL; RFP2; RHPN1; RIMS2; RRAS2; RRM2; SFRP2; SLAIN1; SPRED2; SYTL2; TIMM17A; TMEFF2; TMEM2; TMEM49; TPD52; GFI (ZF) MEF2A (MADS) ACTN1; ADAMTS4; ADIPOR1; APCDD1; ARPC1A; ASCL1; BICC1; BIRC2; BMP7; CACNB4; CHN2; CLDN11; COL5A1; COL9A1; CRTAP; ENPP2; EPS15; ETS1; FAM107B; FGF13; FGFR1; FOS; FOXG1B; FRMD4B; FZD2; GRAMD3; GRB14; ID4; IDI1; JAM3; K0256_HUMAN; KLF4; KLHL2; MAL; MAST2; NEK6; NFIX; NR2F1; NSDHL; OLFML3; OMG; OTUD7B; PELI1; PLA2G4A; PLEKHA1; PLEKHG2; PRKCZ; PTHR1; PTPRD; RFFL; RIMS2; RRAS2; S100A3; SC4MOL; SLC1A4; STRN; SYTL2; TMEM2; TPD52; HAND1- TCFE2A (HLH) HOX GRP APLP1; CHN2; COL5A1; DAB2IP; ENPP2; EPS15; ETS1; ETS2; F2R; FAM107B; FGF13; FRMD4B; GRB14; IDI1; MAP7; MBP; MEST; NFIX; NKD1; NR2F1; OMG; PCTK3; PIGA; PLEKHH1; PTPRD; RAB11FIP5; RAP2A; RFFL; SLAIN1; TMEFF2; TMEM49; UGT8; HLH GRP SP1 (ZF) AATK; ACSS2; ACTN1; AMPD2; APLP1; BMP7; CD68; CEBPB; CLDN11; COL18A1; CYP51A1; DAB2IP; DBN1; DHCR24; EDG8; ELOVL6; FMNL3; FOS; FOXG1B; FRMD4B; FZD2; GPR153; IGFBP4; JOSD2; KLF2; KLF4; LEPREL2; MAL; MBP; MMP14; NFIX; NKX6-2; NR2F1; PCYT2; PLEKHB1; PLP1; PLXNB3; PPAP2C; PPP1R14A; PTHR1; PTPRD; Q8NA55_HUMAN; RAB33A; RAMP2; RFFL; RIMS2; SDC1; SEMA4D; SLC2A1; SLC7A5; SMARCD3; ST3GAL2; TM7SF2; TNNI1; HLH GRP SPZ1 (HLH) ACTN1; AMPD2; ARMCX2; BMP7; CLDN11; CNP; CPOX; CXCL12; DAB2IP; DBN1; ELOVL6; EPS15; FMNL3; FOXG1B; FZD2; GPR153; IGFBP4; JOSD2; KLF4; LEPREL2; MMP14; NKX6-2; NR2F1; PDLIM2; PLP1; PPAP2C; PTPRD; RAB33A; RNF26; SLC12A2; SMARCD3; ST3GAL2; TNNI1; MEF2A (MADS) HAND1- TCFE2A (HLH) ACTN1; APCDD1; ARMCX2; CACNB4; CLDN11; CRTAP; EPS15; ETS1; FAM107B; FGF13; FGFR1; FOS; FOXG1B; FRMD4B; FZD2; GRAMD3; GRB14; GSN; ID4; IER5; JAM3; KLF4; KLHL2; LEPREL2; LRIG1; MAL; NEK6; NFIX; NR2F1; OMG; OSBPL1A; OTUD7B; PELI1; PLEKHA1; PTHR1; PTPRD; RAB11FIP5; RFFL; RIMS2; S100A3; SMARCD3; SYTL2; TMEM49; TSPAN2; 380 Table S18. Summary of co-expressed oligodendrocyte-targeted genes for predicted, enhancer-weighted CRM classes (continued) TF Class I TF Class II Predicted Targeted Genes MEF2A (MADS) POU3F1 (POU- HOX) ACTN1; APCDD1; ASCL1; BICC1; CHN2; CLDN11; COL9A1; CPOX; CRTAP; DAB2IP; EPS15; ETS1; F2R; FAM107B; FGF13; FGFR1; FNTA; FOS; FOXG1B; FRMD4B; GAS6; GPR153; GRAMD3; GRB14; GSN; ID4; IDI1; K0256_HUMAN; KLF4; KLHL2; NFIX; NR2F1; NSDHL; OLFML3; OMG; OSBPL1A; PELI1; PLA2G4A; PLEKHG2; PRKCZ; PTHR1; PTPRD; RAB11FIP5; REEP5; RFFL; RIMS2; SC4MOL; SMARCD3; SYTL2; TMEM2; TSPAN2; NHLH1 (HLH) TEAD1 (TEA) CHN2; CLDN11; NKD1; PLEKHA1; PLP1; SYTL2; WNT7B; RORA1 (NR) SPZ1 (HLH) GSN; KLF4; LEPREL2; NFIX; PLP1; RIMS2; SOX/HMG GRP EGR/ZF GRP ACTN1; ADAMTS4; ARMCX2; BMP7; C7orf24; CDC37L1; CTSE; DAB2IP; DBNDD2; DDC; EDG8; EMID2; ETS1; ETS2; FAM107B; FGF13; FGFR1; FOS; FRMD4B; GRB14; ID4; K0256_HUMAN; KLF4; KLHL2; LEPREL2; LRIG1; MAL; MBP; MPP5; NFE2L3; NFIX; NR2F1; OSBPL1A; PLEKHB1; PLEKHG2; PLEKHH1; PLXNB3; PTHR1; RAB33A; RFFL; RIMS2; RNF26; SH3GL3; SLC2A1; SLC7A5; SMARCD3; STRN; SYTL2; VASN; TEAD1 (TEA) SPZ1 (HLH) ADAMTS4; AMPD2; ARMCX2; BMP7; CACNB4; CLDN11; CMTM3; CPOX; ELOVL6; EPS15; FGF13; FOS; FOXG1B; FRMD4B; ID4; IDI1; NCAPD2; NFIX; NR2F1; OLFML3; OTUD7B; PCYT2; PDLIM2; PLP1; Q8NA55_HUMAN; RAB33A; RIMS2; RRAS2; SDC1; TMEM2; VASN; 381 G. Prioritization of TFBS cooperativity predictions via enhancer feature weighting Figure S5. Muscle enhancer feature distribution evaluation. Histogram plot of modified Kolmogorov-Smirnov (KS-test) p-values for bend, curve, and GC-content feature metrics surrounding predicted muscle reference collection CRMs (as compared with features computed in random datasets). Histogram bars extending below or at the y-axis p-value=0 equate to average KS-test p-values less than or equal to 0.05. 382 383 Figure S6. Heat map of agglomerative clustered feature values surrounding predicted a CRM in targeted muscle genes Example of heat map visualization of agglomerative clustering for evaluated standardized Euclidean feature distances. Features were evaluated in sequences flanking the Mef2a-Srf CRM predictions identified in a set of targeted muscle genes by the CSA promoter analyses. 384 Supplemental references 1. Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic acids research 2006, 34(Database issue):D63-67. 2. Hagiwara N, Yeh M, Liu A: Sox6 is required for normal fiber type differentiation of fetal skeletal muscle in mice. Dev Dyn 2007, 236(8):2062- 2076. 3. Lee HJ, Goring W, Ochs M, Muhlfeld C, Steding G, Paprotta I, Engel W, Adham IM: Sox15 is required for skeletal muscle regeneration. Molecular and cellular biology 2004, 24(19):8428-8436. 4. Schmidt K, Glaser G, Wernig A, Wegner M, Rosorius O: Sox8 is a specific marker for muscle satellite cells and inhibits myogenesis. The Journal of biological chemistry 2003, 278(32):29769-29775. 5. Morikawa Y, Cserjesi P: Extra-embryonic vasculature development is regulated by the transcription factor HAND1. Development (Cambridge, England) 2004, 131(9):2195-2204. 6. Srivastava D, Cserjesi P, Olson EN: A subclass of bHLH proteins required for cardiac morphogenesis. Science 1995, 270(5244):1995-1999. 7. Bounpheng MA, Morrish TA, Dodds SG, Christy BA: Negative regulation of selected bHLH proteins by eHAND. Experimental cell research 2000, 257(2):320-331. 8. Morin S, Pozzulo G, Robitaille L, Cross J, Nemer M: MEF2-dependent recruitment of the HAND1 transcription factor results in synergistic activation of target promoters. The Journal of biological chemistry 2005, 280(37):32272-32278. 9. Wei Q, Miskimins WK, Miskimins R: The Sp1 family of transcription factors is involved in p27(Kip1)-mediated activation of myelin basic protein gene expression. Molecular and cellular biology 2003, 23(12):4035-4045. 10. Wei Q, Miskimins WK, Miskimins R: Sox10 acts as a tissue-specific transcription factor enhancing activation of the myelin basic protein gene promoter by p27Kip1 and Sp1. Journal of neuroscience research 2004, 78(6):796-802. 11. Tretiakova A, Steplewski A, Johnson EM, Khalili K, Amini S: Regulation of myelin basic protein gene transcription by Sp1 and Puralpha: evidence for association of Sp1 and Puralpha in brain. Journal of cellular physiology 1999, 181(1):160-168. 12. Wei Q, Miskimins WK, Miskimins R: Stage-specific expression of myelin basic protein in oligodendrocytes involves Nkx2.2-mediated repression that is relieved by the Sp1 transcription factor. The Journal of biological chemistry 2005, 280(16):16284-16294. 13. Fujimura N, Vacik T, Machon O, Vlcek C, Scalabrin S, Speth M, Diep D, Krauss S, Kozmik Z: Wnt-mediated down-regulation of Sp1 target genes by a transcriptional repressor Sp5. The Journal of biological chemistry 2007, 282(2):1225-1237. Egr1/Kr ox24/N GFI-A Egr4/N GFI-C 385 14. Li C, Ling X, Yuan B, Minoo P: A novel DNA element mediates transcription of Nkx2.1 by Sp1 and Sp3 in pulmonary epithelial cells. Biochim Biophys Acta 2000, 1490(3):213-224. 15. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R: TFCat: the curated catalog of mouse and human transcription factors. Genome biology 2009, 10(3):R29. 16. Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, Wagner L, Shenmen CM, Schuler GD, Altschul SF, Zeeberg B, Buetow KH, Schaefer CF, Bhat NK, Hopkins RF, Jordan H, Moore T, Max SI, Wang J, Hsieh F, Diatchenko L, Marusina K, Farmer AA, Rubin GM, Hong L, Stapleton M, Soares MB, Bonaldo MF, Casavant TL, Scheetz TE et al: Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proceedings of the National Academy of Sciences of the United States of America 2002, 99(26):16899-16903. 17. Hsu SH, Hsieh-Li HM, Huang HY, Huang PH, Li H: bHLH-zip transcription factor Spz1 mediates mitogen-activated protein kinase cell proliferation, transformation, and tumorigenesis. Cancer research 2005, 65(10):4041-4050. 18. Alcaraz WA, Gold DA, Raponi E, Gent PM, Concepcion D, Hamilton BA: Zfp423 controls proliferation and differentiation of neural precursors in cerebellar vermis formation. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(51):19424-19429. 19. Nielsen JA, Maric D, Lau P, Barker JL, Hudson LD: Identification of a novel oligodendrocyte cell adhesion protein using gene expression profiling. J Neurosci 2006, 26(39):9881-9891. 20. Hsu HL, Huang L, Tsan JT, Funk W, Wright WE, Hu JS, Kingston RE, Baer R: Preferred sequences for DNA recognition by the TAL1 helix-loop-helix proteins. Molecular and cellular biology 1994, 14(2):1256-1265. 21. Muroyama Y, Fujiwara Y, Orkin SH, Rowitch DH: Specification of astrocytes by bHLH protein SCL in a restricted region of the neural tube. Nature 2005, 438(7066):360-363. 22. Srinivasan R, Jang SW, Ward RM, Sachdev S, Ezashi T, Svaren J: Differential regulation of NAB corepressor genes in Schwann cells. BMC molecular biology 2007, 8:117. 23. Parkinson DB, Langner K, Namini SS, Jessen KR, Mirsky R: beta-Neuregulin and autocrine mediated survival of Schwann cells requires activity of Ets family transcription factors. Molecular and cellular neurosciences 2002, 20(1):154-167.