Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Computational prediction of regulatory element combinations and transcription factor cooperativity Fulton, Debra Louise 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2010_spring_fulton_debra.pdf [ 23.46MB ]
Metadata
JSON: 24-1.0068719.json
JSON-LD: 24-1.0068719-ld.json
RDF/XML (Pretty): 24-1.0068719-rdf.xml
RDF/JSON: 24-1.0068719-rdf.json
Turtle: 24-1.0068719-turtle.txt
N-Triples: 24-1.0068719-rdf-ntriples.txt
Original Record: 24-1.0068719-source.json
Full Text
24-1.0068719-fulltext.txt
Citation
24-1.0068719.ris

Full Text

Computational Prediction of Regulatory Element Combinations and Transcription Factor Cooperativity    by    DEBRA LOUISE FULTON  B.Sc., Simon Fraser University, 2003     A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY    in    THE FACULTY OF GRADUATE STUDIES  (Genetics)           THE UNIVERSITY OF BRITISH COLUMBIA  (Vancouver)         December 2009    © Debra Louise Fulton, 2009     ii  Abstract  Cellular identity and function is determined, in part, by the subset of genes transcribed. Gene transcription regulation is directed by a subgroup of proteins called transcription factors (TF), which can interact directly or indirectly with DNA to promote transcription initiation. In multi-cellular eukaryotes, gene expression often derives from synergistic and/or antagonistic interplay of multiple TFs with coordinate activity in response to physiological, developmental, and environmental stimuli. Sequence-specific interactions of TFs with DNA occur at TF binding sites (TFBS). Such TFBS can be predicted based on previously observed target DNA sequence specificity of a TF.  Experimental studies have confirmed that proximally situated TFBS are often associated with synergistic interactions of multiple proteins that lead to cooperative regulation. The identification of clustered TFBS combinations (often called cis-regulatory modules) in a set of co- expressed genes can implicate regulatory roles for homologous groups of TFs that may contribute to co-regulation of a gene cohort. The identification of the specific TFs that interact with TFBS motifs is an important step in deciphering mechanisms of co- regulation. My thesis research addressed these challenges, firstly, through the design and development of a Combination Site Analysis (CSA) algorithm to identify over- representation of combinations of TFBS in co-expressed genes and, secondly, the assembly of a comprehensive wiki-based catalog of human-mouse TFs (TFCat) using literature curation and homolog prediction approaches.  These applications were incorporated within a new promoter sequence analyses procedure for the identification of TFs that may be acting cooperatively to co-regulate expression of myelin-associated genes during myelin production in the CNS. Dysregulation of gene expression is    iii frequently implicated in human pathologies and development of approaches that identify the molecular components of transcriptional regulatory systems is an important step towards the elucidation of molecular mechanisms for the design of therapeutic interventions.                        iv  Table of Contents Abstract...........................................................................................................................ii Table of Contents ........................................................................................................... iv List of Tables ...............................................................................................................viii List of Figures................................................................................................................ ix List of Abbreviations and Acronyms ...............................................................................x Acknowledgements ......................................................................................................xiii Co-authorship Statement ............................................................................................... xv 1. Introduction..............................................................................................................1 1.1. Gene transcriptional regulatory mechanisms ......................................................3 1.1.1. Chromatin remodeling in transcription regulation .......................................3 1.1.1.1. DNA methylation.................................................................................4 1.1.1.2. Histone modification ............................................................................5 1.1.1.3. X-inactivation ......................................................................................6 1.1.2. DNA regulatory elements............................................................................7 1.1.3. Transcription factor proteins .......................................................................8 1.1.4. Transcription factor mechanisms and cooperativity control .........................9 1.2. Analysis of gene expression............................................................................. 10 1.2.1. High-throughput gene expression technologies ......................................... 11 1.2.1.1. Normalization and quantification of microarray expression data......... 13 1.2.1.2. Analysis of gene expression differences ............................................. 14 1.3. Experimental techniques for regulatory sequence detection.............................. 15 1.3.1. Identification of TF-DNA binding............................................................. 16 1.3.1.1. Identification of TF binding site motifs .............................................. 17 1.3.2. Regulatory sequence validation................................................................. 18 1.3.2.1. Mouse transgenesis ............................................................................ 19 1.3.2.1.1. Hprt1 mouse transgenesis system ................................................ 19 1.4. Transcriptional regulatory region detection through comparative genomics ..... 20 1.4.1. Multiple sequence alignments ................................................................... 21 1.4.2. Phylogenetic footprinting.......................................................................... 21 1.5. Computational identification of regulatory mechanisms................................... 22 1.5.1. Predicting TF binding sites........................................................................ 22 1.5.2. Computational detection of cis-regulatory modules................................... 24 1.6. Myelinogenesis in the nervous system ............................................................. 25 1.6.1. Schwann cells ........................................................................................... 27 1.6.2. Oligodendrocytes ...................................................................................... 27 1.6.2.1. Olig transcription factors.................................................................... 28 1.6.3. Myelin Basic Protein................................................................................. 29 1.7. Thesis overview and chapter summaries .......................................................... 31 1.8. References ....................................................................................................... 48    v 2. Identification of Over-represented Combinations of Transcription Factor Binding Sites in Sets of Co-expressed Genes .............................................................................. 59 2.1. Chapter preamble............................................................................................. 59 2.2. Introduction ..................................................................................................... 59 2.3. Results............................................................................................................. 61 2.3.1. Overview and rationale of oPOSSUM II algorithm ................................... 61 2.3.2. TFBS classification................................................................................... 62 2.3.3. Validation with reference data sets............................................................ 63 2.3.3.1. Yeast CLB2 cluster ............................................................................ 63 2.3.3.2. Three human reference gene sets ........................................................ 63 2.3.4. Effect of set size on false positive rate....................................................... 64 2.3.5. Web interface............................................................................................ 65 2.4. Discussion ....................................................................................................... 65 2.5. Methods........................................................................................................... 69 2.5.1. Background: the oPOSSUM database ....................................................... 69 2.5.2. TFBS in foreground gene set..................................................................... 69 2.5.3. Classification of TFBS profiles ................................................................. 70 2.5.4. Selection of TFBS and enumeration of combinations ................................ 70 2.5.5. Scoring of combinations ........................................................................... 71 2.5.6. Finding significant TFs from over-represented class combinations............ 72 2.5.7. Random sampling simulations of foreground genes................................... 73 2.5.8. Validation ................................................................................................. 73 2.6. References ....................................................................................................... 78 3. oPOSSUM: Integrated Tools for Analysis of Regulatory Motif Over-Representation...................................................................................................... 80 3.1. Chapter preamble............................................................................................. 80 3.2. Introduction ..................................................................................................... 80 3.3. Results............................................................................................................. 81 3.3.1. Human single site analysis ........................................................................ 82 3.3.2. Human combination site analysis .............................................................. 83 3.3.3. Worm single site analysis.......................................................................... 84 3.3.4. Yeast single site analysis........................................................................... 84 3.4. Discussion ....................................................................................................... 85 3.5. Methods........................................................................................................... 87 3.5.1. Over-representation analysis ..................................................................... 87 3.5.1.1. oPOSSUM single site analysis............................................................ 87 3.5.1.2. oPOSSUM  combination site analysis................................................. 88 3.5.2. Species-specific databases......................................................................... 89 3.5.2.1. Human/mouse .................................................................................... 89 3.5.2.2. C. elegans/C. briggsae ........................................................................ 90 3.5.2.3. Yeast .................................................................................................. 91 3.5.3. TFBS prediction........................................................................................ 92 3.6. References ....................................................................................................... 99 4. TFCat: The Curated Catalog of Mouse and Human Transcription Factors............. 101 4.1. Chapter preamble........................................................................................... 101 4.2. Introduction ................................................................................................... 101    vi 4.3. Results........................................................................................................... 103 4.3.1. TF gene candidate selection, the annotation process, and quality assurance  103 4.3.2. Identification and classification of DNA binding proteins ....................... 107 4.3.3. Generation and assessment of mouse-human TF homology clusters to predict additional putative TFs............................................................................. 110 4.3.4. Maintenance and access of TFCat annotation data................................... 113 4.4. Discussion ..................................................................................................... 114 4.4.1. Catalog characteristics, comparisons, and utility ..................................... 114 4.5. Materials and methods ................................................................................... 118 4.5.1. Creation of four independent murine and human TF preliminary candidate data sets............................................................................................................... 118 4.5.1.1. Dataset I ........................................................................................... 118 4.5.1.2. Dataset II.......................................................................................... 119 4.5.1.3. Dataset III ........................................................................................ 119 4.5.1.4. Dataset IV ........................................................................................ 120 4.5.2. Standardizing TF gene candidate annotation ........................................... 121 4.5.3. Selection and annotation of a subset of TF candidates ............................. 121 4.5.4. Randomly sampled quality assessment and auditing of TF annotations ... 122 4.5.5. TFC quality assurance comparisons ........................................................ 123 4.5.6. Human-mouse ortholog assignment ........................................................ 124 4.5.7. TF DNA-binding structure analysis and classification ............................. 124 4.5.8. Identification of homolog sets for mouse TF genes ................................. 125 4.5.9. Website download access, wiki publication and annotation feedback ...... 130 4.6. References ..................................................................................................... 136 5. Brain MiniPromoters by Design: Pleiades Promoter Project ................................. 141 5.1. Chapter preamble........................................................................................... 141 5.2. Introduction ................................................................................................... 142 5.3. Results........................................................................................................... 144 5.3.1. Novel tools to study and treat the brain ................................................... 144 5.3.2. A new score to prioritize suitable genes .................................................. 145 5.3.3. MiniPromoter designs incorporate available information......................... 146 5.3.4. ESC neural differentiation for pre-screening ........................................... 147 5.3.5. Novel MiniPromoter expression patterns in the brain .............................. 148 5.3.6. A unique dataset for in silico studies ....................................................... 150 5.4. Discussion ..................................................................................................... 151 5.5. Methods......................................................................................................... 154 5.5.1. Pleiades Promoter Project pipeline.......................................................... 154 5.5.2. Pleiades Promoter Project protocols ........................................................ 155 5.5.2.1.1. Hprt1 targeting vectors and MiniPromoters ............................... 155 5.5.2.2. Knock-in immediately 5′ of the Hprt1 locus ..................................... 156 5.5.2.3. PCR analysis of genomic DNA ........................................................ 157 5.5.2.4. In vitro neural differentiation............................................................ 157 5.5.3. Immunohistochemistry and histochemistry.............................................. 158 5.6. References ..................................................................................................... 171 6. Identification and analysis of transcriptional cis-regulatory modules directing oligodendrocytic expression of myelin-linked genes.................................................... 175    vii 6.1. Chapter preamble........................................................................................... 175 6.2. Introduction ................................................................................................... 175 6.3. Results........................................................................................................... 178 6.3.1. Myelin gene associated conserved regions confer reporter activity .......... 178 6.3.2. Myelin gene co-expression is detected in mouse forebrain and optic nerve expression profiles............................................................................................... 183 6.3.3. Validation of a promoter analysis approach............................................. 185 6.3.4. Promoter analyses of a co-expressed oligodendrocyte gene set highlights potential TF cooperativity.................................................................................... 188 6.3.5. An oligodendrocyte TF network supported by enhancer predicted regulatory elements .............................................................................................................. 189 6.3.6. Prioritization of TFBS cooperativity predictions via enhancer feature weighting ............................................................................................................ 191 6.4. Discussion ..................................................................................................... 193 6.5. Methods......................................................................................................... 198 6.5.1. Selection of conserved regions and validation in mice............................. 199 6.5.2. Isolation of genomic DNA sequences...................................................... 199 6.5.3. Generation of reporter constructs ............................................................ 200 6.5.4. Histochemistry, fluorescence microscopy, and immunocytochemistry..... 200 6.5.5. Gene expression profiling analyses ......................................................... 201 6.5.6. Evaluation of local conservation of validated TFBS ................................ 203 6.5.7. Development of the promoter database and CSA algorithm adaptation.... 203 6.5.8. Promoter analyses method validation ...................................................... 204 6.5.9. Jaspar profile clustering and cluster labeling ........................................... 205 6.5.10. Promoter analysis of oligodendrocyte co-expression data...................... 206 6.5.11. Enhancer feature weighting of CRM predictions ................................... 206 6.5.12. Oligodendrocyte TF network construction and analysis......................... 207 6.5.13. Evaluation of predicted CRM sequence characteristics.......................... 208 6.5.14. Analysis of overlapping oligodendrocyte CRM predictions................... 209 6.6. References ..................................................................................................... 219 7. Discussion and Conclusions ................................................................................. 228 7.1. Summary ....................................................................................................... 228 7.2. Gene regulation analyses in humans and mice................................................ 228 7.2.1. Computational TF binding site detection................................................. 229 7.3. TF inventories for gene regulatory analyses ................................................... 231 7.4. Detecting differential gene expression for prediction of TF gene co-regulation  233 7.5. Functional validation of regulatory sequences................................................ 235 7.6. Incorporating high throughput epigenomic data and detailed experimental analyses in models of gene regulation...................................................................... 236 7.7. References ..................................................................................................... 241 8. Appendices........................................................................................................... 245 8.1. Appendix 1: supplementary for chapter 4....................................................... 245 8.2. Appendix 2: supplementary for chapter 5....................................................... 304 8.3. Appendix 3: supplementary for chapter 6....................................................... 328    viii List of Tables Table 1.1.  Selected list of databases providing eukaryote transcription factor binding site data ....................................................................................................................... 37 Table 1.2.  CRM detection techniques ........................................................................... 37 Table 1.3.  Selected computational human/mouse CRM detection tools/methods available before 2005 ........................................................................................................... 38 Table 3.1. oPOSSUM results for human FoxM1-regulated gene cluster......................... 94 Table 3.2. oPOSSUM results for c-Fos-regulated gene cluster....................................... 94 Table 3.3. oPOSSUM results for skeletal muscle genes identified by Moran et al. and Tomczak et al. ....................................................................................................... 94 Table 3.4. oPOSSUM results for worm skeletal muscle genes using worm profiles ....... 95 Table 3.5. oPOSSUM results for the yeast CLB2 gene cluster ....................................... 95 Table 4.1. Transcription factor data resources.............................................................. 132 Table 4.2. TFCat catalog statistics ............................................................................... 132 Table 4.3. TFCat judgment classifications ................................................................... 132 Table 4.4. TFCat taxonomy classifications .................................................................. 133 Table 4.5. DNA-binding TF gene classification counts................................................ 134 Table 4.6. Large cluster ranking criteria ...................................................................... 135 Table 6.1. Predicted TF regulatory network for myelin genes ...................................... 210                          ix List of Figures Figure 1.1. Nucleosome core octamer particle ............................................................... 39 Figure 1.2. Model for histone acetylation/deacetylation................................................. 40 Figure 1.3. Eukaryotic gene regulatory architecture....................................................... 41 Figure 1.4. Hprt1 mouse transgenesis system ................................................................ 42 Figure 1.5. Modeling TF binding sites using position weight matrices........................... 43 Figure 1.6. Myelinating glial cells: oligodendrocytes and Schwann cells ....................... 44 Figure 1.7. Schwann cell lineage ................................................................................... 45 Figure 1.8. Oligodendrocytes can ensheath multiple axons ............................................ 46 Figure 1.9. Oligodendrocyte lineage.............................................................................. 47 Figure 2.1. Overview of the oPOSSUM II analysis algorithm........................................ 75 Figure 2.2. The top five over-represented pair combinations of TFBS classes for muscle reference sets......................................................................................................... 76 Figure 2.3. Gene set size and false positive rate............................................................. 77 Figure 3.1. Determination of one-to-one orthologs for human and mouse genes. ........... 96 Figure 3.2. Identification of transcription start regions (TSRs) using a combination of EnsEMBL annotations and CAGE data ................................................................. 97 Figure 3.3. oPOSSUM Human SSA website screenshots............................................... 98 Figure 5.1. A resource of 240 MiniPromoters for predictable reproducible expression. 160 Figure 5.2. Resolution score prioritizes genes for MiniPromoter design....................... 162 Figure 5.3. In vitro neural differentiation for pre-screening MiniPromoter designs ...... 164 Figure 5.4. Montage of MiniPromoter expression patterns in the adult brain and retina165 Figure 5.5. Specific neuronal and glial expression patterns.......................................... 167 Figure 5.6. MiniPromoters as tools to study developmental expression patterns........... 168 Figure 5.7. A unique dataset for bioinformatics analysis.............................................. 169 Figure 6.1. Enhancer selection and validation.............................................................. 211 Figure 6.2. Histochemical detection of β-galactosidase activity in early postnatal development ........................................................................................................ 212 Figure 6.3. Histochemical detection of β-galactosidase activity in whole mounts at adult developmental stage ............................................................................................ 213 Figure 6.4. Characterization of cell populations expressing the Gjb1, Cldn11, Pou2f1 and Mal constructs ..................................................................................................... 214 Figure 6.5. Overlap of differentially expressed genes in the two expression profile datasets................................................................................................................ 216 Figure 6.6. Myelin gene TFBS regulatory sub-network ............................................... 218 Figure 7.1. Systematic integration of biological data and computational analyses to decipher gene regulatory mechanisms.................................................................. 240        x List of Abbreviations and Acronyms ABS Annotation Regulatory Binding Site Database ACC UBC Animal Care Committee ARIDs AT-rich interaction domains BAN2 N2 backcross of ICR into B6-Alb bHLH basic Helix-Loop-Helix BTD beta-trefoil domain CAGE cap analysis of gene expression CCAC Canadian Council on Animal Care ChIP chromatin immunoprecipitation CNS central nervous system CRM   cis-regulatory modules CSA Combination Site Analysis DBD DNA-binding domain DBDdb DBD database resource DDT DNA binding homeobox and different transcription factors DHTF DNA Helix-Turn-factor DHTM DNA Helix Turn Modulus DPE Downstream Promoter Element dsDNA double-stranded DNA ECB early cell cycle box EM expectation maximization EMSA electrophoretic mobility shift assays ENCODE Encyclopedia of DNA Elements EOL-MOLd early oligodendrocytes-myelinating oligodendrocytes  dataset EOLs early oligodendrocytes ES embryonic stem ESCs embryonic stem cell lines Fox Forkhead transcription factor FWER family wise error rate GCM Glial cells missing domain GFP Green Fluorescence Protein GO Gene Ontology GOA Gene Ontology Annotations HAT Histone acetyl transferase HAT hypoxanthine, aminopterin, thymidine HDACs histone deacetylases HLH, helix-loop-helix HMG High Mobility Group HMM Hidden Markov Model HOX, Homeodomain     xi List of Abbreviations and Acronyms (continued) Hprt1 hypoxanthine phosphoribosyltransferase hsp heat shock protein IBSD inter-binding site distance IC information content ICM inner cell mass ID Identifier IEA  Inferred Electronic Annotations INR Initiator Recognition IOLEDd intersection of oligodendrocyte early development dataset IRC International Regulome Consortium IUPAC International Union of Pure and Applied Chemistry KS-test Kolomogorov-Smirnov test lacZ beta-galatosidase MBP myelin basic protein MCA most conserved√Æ alignments ME measurement of expression MGD Mouse Genome Database MGI Mouse Genome Informatics MiniPs MiniPromoters MOLs mature oligodendrocytes MPSS massively parallel signature sequencing MSA multiple sequence alignment NCBI National Center for Biotechnology Information NFI-CTF family Nuclear factor I - CCAAT-binding transcription factor nGFP native EGFP fluorescence NR nuclear receptor O-2A oligodendrocyte-type-2-astrocyte cells OAMTF observed approximate mean TFs OL oligodendrocytes OMIM database Online Mendelian Inheritance in Man database OPC olidodendrocyte precursor cells OPC-EOLd oliodendrocyte progenitor cells - early oligodendrocytes  dataset OPCs oligodendrocyte progenitor cells ORF open reading frame P4-P10d Postnatal 4 vs Postnatal 10 dataset PBM protein binding microarray PCR Polymerase Chain Reaction PDB Protein Data Bank PFM position frequency matrix pMN motor neuron progenitor domain     xii List of Abbreviations and Acronyms (continued) Pn Postnatal (where n is  a number) PPP Pleiades Promoter Project Prom endogenous gene promoter PWM position weight matrix QA quality assurance RCA  Inferred from reviewed computational analysis RefSeq NCBI reference sequences Rel Rel homology domain RMA   Robust Multi-chip Analysis RMS rostral migratory stream RR candidate regulatory regions SAGE   Serial Analysis of Gene Expression SC Schwann cells Shh Sonic Hedge Hog shi the mouse model shiverer shRNA small hairpin RNA Sox SRY-related HMG-box transcription factor SP SwissProt database SSA single site analysis ssDNA single-stranded DNA SSM Secondary-Structure Matching tool SUMO small ubiquitin related histone modifiers TAF TATA-binding Associated Factor TBP TATA Binding Protein TFBS transcription factor binding sites TFC Transcription Factor Candidate TFCat transcription factor catalog TFe TFencyclopedia TH tyrosine hydroxylase TSR transcription start regions TSS transcription start site UCSC University of California, Santa Cruz UPTF union of putative transacription factors VEB Vista Enhancer Browser VTA ventral tegmental area xGal beta galactosidase YRSA Yeast Regulatory Sequence Analysis system ZF zinc-finger    xiii Acknowledgements I would like to convey my appreciation and thanks to the many people who have helped me arrive at this milestone. First and foremost, I am especially grateful for the guidance and support provided by my PhD supervisor, Dr. Wyeth Wasserman. I also want to thank my PhD committee members for their expert advice, valuable feedback, and continued support: Drs. Eldon Emberly, Leah Keshet, Alan Peterson, and Elizabeth M. Simpson. Additionally, I am grateful for the supervision and guidance that I received during my initial graduate research rotations training under: Dr. Fiona Brinkman, Dr. Steven Jones, and Dr. Wyeth Wasserman at Merck Frost. I would also like to acknowledge the valuable input that I received early on from my masters phase committee members: Drs. Marco Marra, Fiona Brinkman, and Frederick Pio and the graduate course research guidance provided by Dr. David Baillie.  I’ve had the privilege of working on collaborative research with an inspiring set of researchers, for which I am grateful, which include: Drs. Fiona Brinkman, Martin Ester, Tim Hughes, Steven Jones, Alan Peterson, Assim Siddiqui, Rob Sladek, Jared Roach, and Wyeth Wasserman. I have appreciated the opportunity to work, learn, interact, and socialize with a number of student trainees, post- doc fellows, and research scientists during my training, including: Jochen Brumm, Stefanie Butland, Elodie Portales-Casamar, Warren Cheung, Eric Denarier, Samar Dib, Nancy Dionne, Joanne Fox, Hana Friedman, Ben Good, Obi Griffith, Karsten Hokamp, Shannan Ho Sui, Shao-Shan (Carol) Huang, Andrew Kwon, Shang-Jung (Jessica) Lee, Yvonne Li, Alison Meynert, Carrie-Lyn Mead, Mehrdad Oveisi, Erin Pleasance, Fiona Roche, Monica Sleumer, Sarav Sundararajan, Dimas Yusuf, and the remarkable group of researchers at Merck Frost in Montreal. It’s been a great pleasure to work with and around the Wasserman lab research group members.  I am particularly grateful to Dora    xiv Pak for her top-notch organizational abilities and on-going support and assistance during my time in the Wasserman lab. I also want to acknowledge the valuable systems support provided by Dave Arenillas, Miroslav Hatas, and Jonathan Lim. I would like to acknowledge and thank my salary, training, research, and travel funding sources for their financial support during my PhD training, namely: the Wasserman Laboratory, CIHR/MSFHR Strategic Training Program in Bioinformatics, Michael Smith for Health Research Senior Graduate Award, Canadian Institute of Health Research Doctoral Scholarship, the UBC Faculty of Graduate Studies PhD Tuition Fee Awards, and the Multiple Sclerosis Society of Canada: endMS Network Travel Award. I greatly appreciate the UBC Genetics Graduate Program administrative support provided by Dr. Hugh Brock and Monica Deutsch and the CIHR/MSFHR Strategic Training Program in Bioinformatics Program administrative assistance provided by Dr. Steven Jones and Sharon Ruschkowski.  I am deeply grateful to all my friends and family for their unwavering support and love during my graduate studies. I would especially like to thank my parents: Michael and Joy, and sisters: Julia and Stephanie for their love and encouragement.               xv Co-authorship Statement The work described in this thesis was achieved, in part, through collaborative research.  A summary of chapter research contributions is provided below.   Chapter 2: I am responsible for the initial design, development, and validation of the Combination Site Analysis (CSA) algorithm. The CSA algorithm and software were incorporated by Shao-Shan Huang in a website implementation, which included integration of a TFBS clustering step prototyped by Paul Perco. Shao-Shan Huang conducted further CSA algorithm validations. Dave Arenillas was responsible for implementation of the human-mouse oPOSSUM database. Shannan Ho Sui provided the yeast promoter database used in the yeast dataset validation. James Mortimer provided microarray data. Shao-Shan Huang and Wyeth Wasserman wrote the initial draft manuscript and I provided further writing and editorial input. Shao-Shan Huang and I prepared the manuscript for conference submission.   Chapter 3: The human/mouse alternative promoter dataset was developd by Shannan Ho Sui. I redesigned and redeveloped the Combinations Site Analysis (CSA) algorithm to statistically evaluate combinations of TFBS in alternative promoters and performed all related testing. Dave Arenillas redesigned the oPOSSUM database and Single Site Analysis (SSA) algorithm to accomodate the alternative promoter data. Shannan Ho Sui contributed the oPOSSUM yeast analysis website application. Andrew Kwon and Shannan Ho Sui worked on the worm-specific resource, which was architected by Andrew Kwon. Shannan Ho Sui and Dave Arenillas worked on the SSA website code and oPOSSUM portal web page. I redeveloped the CSA website and included additional    xvi web application functionality and CSA e-mail function enhancements. Shannan Ho Sui wrote the draft manuscript.  I contributed writing for the CSA work and edits for the manuscript.   Chapter 4: The collaborative research work described in chapter 4 was initiated by my supervisor, Wyeth Wasserman. I was responsible for leading and managing the project collaboration and work. Initial putative TF datasets were contributed by Jared Roach, Sarav Sundarajan, Gwenael Beard, and myself. Sarav Sundarajan merged the datasets and provided input on the wiki gene page design. I designed, implemented, and populated the centralized database and curation website tool. Rob Sladek and I precurated the merged putative TF dataset. Jared Roach, Robert Sladek, Sarav Sundarajan, Gwenael Beard, Tim Hughes, Wyeth Wasserman and myself acted as the core group of gene annotators. I established and implemented the structural classification mapping methodology and performed the analysis of DNA-binding structures to extend the DNA- binding structural classification system. I designed and implemented the TF homology analysis approach, the wiki, and the website download portal. I wrote the manuscript and created the supplemental document.  Additional manuscript writing input and edits were provided by Wyeth Wasserman and Rob Sladek.   Chapter 5:  The work described in chapter 5 involved both computational analyses and multiple stages of detailed molecular work. The research project was initiated by Elizabeth M. Simpson. Molecular work was provided by research scientists and graduate students affiliated with the following principal investigator laboratories: Elizabeth M. Simpson, Dan Goldowitz, and Robert Holt.  Computational analyses were    xvii performed by researchers in Wyeth Wasserman’s and Steven Jones’ laboratories.  A full author list is provided in the chapter publication citation. I am responsible for the design and implementation of a set of computational analyses that predicted the transcription factors responsible for directing expression of an OLIG1 gene-associated Green Fluorescence Protein (GFP) reporter construct sequence in mice. The computational analyses included identification and a comprehensive evaluation of publicly-available oligodendrocyte expression data and an in-depth TFBS feature analyses. I provided the results, tables, and writing for the ‘Regulatory element predictions in OLIG1 enhancer sequences’ section in the supplemental. The draft manuscript was written by Elodie Portales-Casamar et al. I provided additional writing input and edits for the introduction and results sections of the manuscript.   Chapter 6: Research concepts described in chapter 6 were established by Eric Denarier and myself.  I am responsible for the enhancer sequence integration concept and design of the overall computational analyses approach. The identification of putative enhancer sequences was performed by Eric Denarier. The mouse transgenesis work and related molecular work were conducted by Eric Denarier and research associates in Alan Peterson’s laboratory.  I implemented software to establish the promoter analyses database, adapted the Combination Site Analyses (CSA) algorithm, and performed all reference collection validation testing. I performed data expression analyses for all oligodendrocyte microarray datasets. I conducted CSA analyses on oligodendrocyte co- expression datasets. I devised, implemented, and performed the enhancer CRM weighting and enhancer feature weighting approaches. I wrote the manuscript and created the    xviii supplemental documents. Manuscript input and edits were provided by Wyeth Wasserman, Alan Peterson, and Eric Denarier     1 1. Introduction   The wide-range of eukaryote cell types and tissues that are generated from a single eukaryote cell (embryo) require numerous, well-defined tissue-specific gene regulatory systems, which are initiated, maintained, or arrested in response to temporal, spatial, and environmental cues.  Each gene regulatory program is a combination of potentially many participating processes: transcription, translation, splicing, post-translational modifications, degradation, diffusion, cell growth, and others. Gene expression levels can be influenced by signals that are differentially initiated in a specific spatial state environment (for example, different cell types may express the same gene at different levels) or a specific temporal state (for example, a gene’s expression level in a given cell type in an early development stage may differ from its expression state in an adult stage in that same cell type). Dysregulation of gene expression is implicated in a wide variety of diseases and illumination of molecular basis for gene regulatory systems, in both a temporal and spatial state context, is an integral step towards understanding the contributory mechanisms in disease phenotypes and the identification of possible therapeutic interventions.  Since the primary response of a gene’s regulatory program is at the transcription step, much research has focused on the measurement of tissue-specific transcription levels and on deciphering the regulatory mechanisms that induce transcription through both experimental and computational analyses. Gene expression in higher-level eukaryotes frequently involves synergistic and/or antagonistic interplay of multiple transcription factors (TFs) that bind to DNA and/or interact with other TFs. Sequence- specific interactions of TFs with DNA occur at TF binding sites (TFBS). Studies have     2 shown that clustered TFBS instances may lead to compatible interactions and cooperative regulation. Correspondingly, the identification of co-located TFBS motif signatures, often called cis-regulatory modules (CRMs), in a set of co-expressed genes can suggest homologous groups of TFs that contribute to the co-regulation of genes. The identification of the specific TFs that interact with these predicted motifs is an essential step in elucidating transcriptional co-regulatory systems.  The over-arching objective of my thesis research was to address these important challenges in gene regulatory analyses, firstly, through the design and development of an algorithm that identifies enrichment of TFBS combinations found in the non-coding regions adjacent to co-expressed genes and, secondly, with the assembly of a comprehensive catalog of human-mouse TFs (TFCat) that includes a DNA-binding domain (DBD) structural classification. Importantly, although these tools may be applied for the detection of gene regulatory mechanisms in a variety of biological systems, these tools were systematically integrated in to a new promoter sequence analyses method to predict the TFs that may be acting cooperatively to co-regulate expression of myelin- associated genes during central nervous system (CNS) myelinogenesis. The myelin sheath is a lipid-rich plasma membrane that wraps around the axons projecting from neural cells to enable proper conduction of impulses throughout the nervous system. The prediction and identification of TF cohorts acting in the myelin production transcriptional regulatory system is of particular importance because myelin malfunction contributes to debilitating human pathologies, such as multiple sclerosis and leukodystrophies, and elucidation of myelin gene regulation could lead to the development of treatments that improve remyelination to attenuate disease progression.  The remainder of this introduction will review relevant background information for     3 the thesis and further motivate the thesis objectives.  1.1. Gene transcriptional regulatory mechanisms  Gene expression is, in part, controlled by DNA regulatory elements that (most often) reside on the same chromosome in non-coding regions neighbouring a gene’s transcription start site (TSS). One or more TF proteins can act directly on or interact indirectly (through protein-protein interactions) with the DNA to promote transcription. Regulatory DNA elements can be located proximal to a basal promoter TSS or may exert their influence over longer distances away from a TSS (often referred to as enhancers) and/or operate to regulate multiple adjacent genes (termed locus control regions).  As protein access to the DNA sequence is a necessary requirement for the enablement of these operational forms, mechanisms affecting the chromatin architecture can greatly impact gene transcription. Brief reviews of DNA elements, TF proteins, and chromatin architecture regulatory mechanisms follow.  1.1.1. Chromatin remodeling in transcription regulation  Eukaryotic DNA is arranged into chromatin via wrapping of approximately 200 base pairs of DNA around histone octamers to form nucleosomes. Roughly 147 DNA base pairs are coiled around a histone octamer made up of two each of four core histones: H2A, H2B, H3, and H4, which establish the nucleosome core structure, and the remainder of DNA sequence is involved in linking adjacent nucleosomes (Figure 1.1 and see review in [1]). Lysine-rich terminal tails are located at the N-termini of the four core histones and can extend beyond the surface of the nucleosome. These amino-tails are     4 subject to specific modifications, which can precipitate chromatin structure accessibility state changes, causing the chromatin structure to become more open or closed (Figure 1.2).  In eukaryotes an additional histone, H1, found at half the concentration of other histones, is bound to nucleosomes near the DNA-histone octamer coiling entry and exit point, sealing the two DNA turns (Figure 1.1) (for a review see [2]). This first stage nucleosome packing, resembling beads on a string, can then be folded into a more compact structure, known as a solenoid form. The solenoid state is associated with DNA that is not transcriptionally active. A eukaryote cell’s ability to maintain its differentiated state long-term is enabled by accessible chromatin structure for those genes that are actively transcribed [3]. This open chromatin architecture enables trans-activating factors access to DNA target sequences. Chromatin accessibility is influenced by the types of histones, referred to as histone variants, that are included in a nucleosome [4] and the positional placement of nucleosomes in DNA. Nucleosome positioning can influence transcription by inhibiting or enabling access to regulatory DNA elements [5] and nucleosome arrangements can be guided by the positional placement of DNA-bound proteins [6, 7]. In addition to structural influences, there are biochemical changes that alter chromatin structure to impact gene transcription regulation: 1) DNA methylation; 2) histone modifications; and 3) X-chromosome inactivation, briefly reviewed below.  1.1.1.1. DNA methylation   DNA methylation involves the addition of a chemical modification: a methyl group to one or more DNA bases, which can be both added and removed without direct affect on the underlying DNA sequence. The most prominent DNA methylation     5 modification in eukaryotic DNA is addition of 5-methyl groups to cytosines [8] with a preference for those located in CG dinucleotide sequences, which are frequently referred to as CpG sites.  Methylation patterns can be tissue-specific, with methylated genes found in cells in which they are inactive and, conversely, active genes left as unmethylated in cells in which they are transcribed. This pattern is likely due, in part, to interference of transcription factor DNA-interactions by methyl groups [9].   Clusters of CG sequences, known as CpG islands are often found nearby or overlapping with gene promoters and first exons. With a few exceptions, CpG islands are present in an unmethylated state regardless of their associated gene’s transcription level (for a recent review see [10]) and are found at the 5’ end of constitutively transcribed genes, known as housekeeping1 genes. Studies suggest that proteins that bind methylated DNA induce a more “closed” chromatin structure, which could result from deacetylation of histones (for review see [11]), discussed below.  1.1.1.2. Histone modification   Histones can vary in structure depending on chemical modifications. Acetylation is one such modification that covalently attaches an acetyl group to amino acids (for example, lysines), which reduces the positive charge of the histones (primarily occurring on H3 and H4) and opens the chromatin structure (Figure 1.2). Histone acetyl transferase (HAT) enzymes catalyze this process, promoting increased transcription [12]. Conversely, deacetylation of histones via histone deacetylases (HDACs) can inhibit transcription by producing a more condensed chromatin architecture (Figure 1.2). Histone  1 Housekeeping genes are genes that are typically constitutively active in all cells because they are involved in cell maintenance functions.     6 acetylation patterns have been successfully analyzed to predict enhancer signatures [13, 14]. Methylation modifications can both activate gene transcription (for example, methylation of lysine-4 in H3 - referred to as H3K4me) [13] or inhibit transcription (for example, tri-methylation of lysine-9 and/or lysine-27 in H3, referred to as H3K9me3 and H3K27me3 respectively [15, 16]). Histone modification signatures have been used to successfully predict and validate functional human enhancer and promoter regions [16, 17]. These recent genome-wide studies continue to support the long-standing hypothesis of a histone tail domain-encoded  ‘language’, which regulates DNA-chromatin interactions that establish the chromatin accessibility state [18].  Although such modifications are not addressed further in this thesis, other small molecules such as ubiquitin and small ubiquitin related histone modifiers (SUMO) [19] and phosphorylation of histone H3 (for review see [20]) will influence gene regulatory activity as well.  1.1.1.3. X-inactivation   The process of X-chromosome inactivation compensates for the fact that mammalian females have two X chromosomes whereas males have one. This process occurs early in embryo development and the selection of paternal or maternal X chromosomes for inactivation occurs randomly. The inactive chromosome is packed into a condensed structure known as a Barr body. The inactivation process involves both acetylation and methylation histone modifications [21]. The XIST gene is responsible for silencing genes on the same (cis) X chromosome from which it is transcribed. The hypoxanthine phosphoribosyltransferase (HPRT) locus, described later, is located on the     7 X-chromosome.  1.1.2. DNA regulatory elements  In most cases in eukaryotes, RNA transcripts encode a single gene, in contrast to prokaryote systems where functionally linked genes are often transcribed in one multi- gene RNA molecule (e.g. the Lac operon). Consequently, eukaryote transcriptional regulatory systems must coordinate expression of sets of requisite genes required for a common process or signaling pathway. This regulatory scheme supports increased diversity and plasticity in the combinations of genes that can be regulated for any given pathway.  Metazoan genes are regulated by a structured architecture of DNA regulatory elements (Figure 1.3). The core promoter region is approximately 60 bp surrounding the transcription start site (TSS) and houses DNA elements that interact with the basal machinery, which include: TFIID , TFIIA, TFIIB, RNA PolymeraseII, TFIIF, TFIIE, and TFIIH (for review see [22]). Combinations of regulatory elements surrounding the TSS, including the TATA, Initiator Recognition (INR), and Downstream Promoter Element (DPE) motifs, serve to not only engage and position the basal machinery, but also provide specific selectivity for interactions with regulatory regions and TFs.  A gene may be transcribed from more than one promoter, which enables greater transcriptional plasticity between different tissues and developmental timeframes, and under varying environmental conditions [23]. Notably, a landmark cap analysis of gene expression2 (CAGE) study found at least 58% of protein-coding transcriptional units have  2 In CAGE analysis, short ~20 nucleotide sequence tags that begin at the 5′ end of full-length mRNAs are sequenced to identify transcription start sites.     8 two or more alternative promoters [24]. Identification of different start sites can assist with detection of proximal promoter regulatory elements that direct specific transcriptional units. Context-specific utilization of alternative promoters further complicates the identification of promoter-enhancer interactions (described below).  Eukarote genes typically possess regulatory regions (enhancers) that can be located distal to the TSS, situated upstream as well as downstream of the gene and/or in introns, and generally contain multiple TFBS (Figure 1.3). These regions direct tissue- specific regulatory control of gene transcription through chromatin remodeling and/or TF cooperativity interactions, which enable [25], disable [26], or insulate [27] promoter activity. Importantly, gene transcription regulation may be an integration of multiple acting enhancers, each of which exert specific temporal and/or spatial control [28].  1.1.3. Transcription factor proteins  TFs are proteins that direct transcription of one or more genes. These proteins can either directly bind to target DNA regulatory elements [29, 30] or act to influence DNA- binding of other TFs through protein-protein interactions [31]. TF proteins are composed of one or more modular domains that facilitate functional capacities, such as: DNA- binding domains to enable interaction with DNA motifs, cooperativity domains to facilitate interactions with other TFs, and activation domains that influence interactions with core promoter-associated proteins and/or coativators [32]. This modular architecture enables combinatorial transcriptional mechanisms, which can specify a unique activity profile. For example, TF proteins may act alone or in combination with accessory TFs (i.e. co-activators) to produce different regulatory effects under unique conditions [33].  TFs bind DNA through both sequence-specific and non-sequence specific     9 interactions. Nonspecific interactions may occur with double- (dsDNA) and single- stranded DNA (ssDNA) through electrostatic interactions involving positively charged protein side chains and negative DNA backbone phosphate groups. Sequence-specific protein-dsDNA binding is facilitated by hydrogen-bond donor and acceptor sites made available in the minor and major groove of dsDNA which, when compatible, form complementary interactions with hydrogen-bonding acceptor and donor sites on a TF protein DNA binding domain (DBD) surface. Although TF-DNA binding affinity must be strong enough to allow the TF to remain on the DNA for a functionally sufficient period of time, the sequences satisfying this requirement are nevertheless degenerate [34]. Moreover, the presence of bound TFs in any given system may be modulated by specific biological parameters, such as TF protein concentration [35].  The sequence-specific tethering of a TF to DNA occurs at one or more specialized protein domain interfaces. The growing collection of solved protein-DNA binding structures [36] highlights structurally homologous classes with distinct DNA-binding mechanisms  (for review see [37]). Many homologous DBD structures share similar protein sequences, which has enabled the development of models that predict DNA- binding domains [38-41] and, via inference, transcriptional roles for uncharacterterized proteins [42].  1.1.4. Transcription factor mechanisms and cooperativity control  TFs can possess activation domains that modulate the initiation of transcription through indirect and/or direct interactions with targets in the basal transcription complex [43, 44]. TFs can act to repress transcription by a variety of mechanisms that include: 1)     10 interference-binding:  DNA-binding which interferes with TF binding or changes chromatin accessibility; 2) complex formation:  a TF repressor binds a TF activator so that it cannot bind DNA; 3) quenching: proximal binding of a TF repressor alongside a TF activator to extinguish the activation effect; and 4) a TF repressor that has a direct negative effect on transcription [45].  The functional complexity of TF activity is increased by DNA accessibility and interactions with multiple cis-regulatory elements and trans-acting proteins. Multiple TFs can synergistically bind DNA that is associated with nucleosomes [46-48] or engage nucleosome-free DNA [49]. Physical cooperative interactions between TF DNA-bound proteins can precipitate synergistic transcriptional activation. For example, Olig1 - Sox10 interactions can co-activate oligodendrocyte3 transcription [50]; similarly, MyoD (Myf) - Tcfe2a (E12) physical associations contribute to certain muscle gene regulation [51]. TF cooperativity can also occur between co-localized DNA-bound TFs that do not physically interact. For instance, DNA-bound TFs can produce DNA conformation changes that facilitate the binding of additional TFs [52] and multiple TFs may individually interact with the core promoter to exert an accumulative activation effect [53].  1.2. Analysis of gene expression  The analysis of gene transcription is multi-faceted, as there are many stages in the production of a gene’s transcript that may be studied: transcription initiation, elongation, termination, and further stages of transcript processing. Gene transcription is an intermediate stage in the process of protein production from coding sequences. However,  3 Oligodendrocytes are responsible for myelinating the axons of neurons in the CNS. See section 1.6 for more information.     11 unlike some post-transcriptional processing mechanisms, the transcription of RNA is an obligatory prerequisite for protein production. Due to this requirement and the relative ease of high-throughput experimental procedures for RNA measurements, mRNA expression profiles are often utilized as a surrogate indicator of protein levels. However, recent mRNA expression level comparisons with protein detection assays, which allow for quantification of protein levels, suggest that the correlation between mRNA and protein levels may only be moderate at best [54]. Nevertheless, increased production of mRNA is an indication of a modulated regulatory process.  1.2.1. High-throughput gene expression technologies  Understanding when and where a gene is expressed is often a preliminary step in gene regulatory analyses and a number of technologies have emerged to identify gene expression patterns and profiles. Microarray platforms, such as spotted cDNA microarrays [55] and oligonucleotide arrays [56, 57] are commonly used to simultaneously measure the expression of a set of genes. Serial Analysis of Gene Expression4 (SAGE) [58] and modern transcriptome analysis via high-throughput sequencing of RNA samples (e.g. RNA-seq, see review in [59] ) offer advantages over array-based procedures, in that novel transcripts can be observed and detection is not dependent upon hybridization conditions. As expression data from sequencing-based gene expression technologies were not incorporated in this thesis, the introductory focus will be placed upon array-based methods.  4 SAGE – in this technique unique short sequence segments (tags) within transcripts are identified, linked, and sequenced to enumerate the number of times a tag (the proxy for a transcript) is observed     12  Microarray-based gene expression measurements are often compared between the same tissues in different states, such as, a diseased versus healthy or two developmental stages.  For example, numerous studies have examined the changes in gene expression between cancerous tissues and normal tissues (of common cellular classes) to identify cancer biomarkers [60]. Similarly, gene expression can be compared between different tissues to identify unique tissue-specific [61] and developmental-stage specific [62] gene expression profiles.  Microarray technology for parallel measurement of RNA by hybridization is well established. Microarrays are constructed by affixing target probes (cDNA or synthesized oligonucleotides) to a solid surface (a chip) within a matrix architecture using a variety of materials and assembly methods (for a technology review see [63]).  Nucleic acids labeled with fluorochromes are hybridized onto the array-bound probes.  When using cDNA arrays, typically two samples with two different fluorescent dyes are applied to one array which, when excited by lasers, emit unique wavelengths for each dye that is interpreted and quantified by scanners. This intensity measurement provides a relative quantification of gene expression. In contrast, oligonucleotide array technology permits measurement of absolute gene expression intensity, through fluorochrome excitation and laser scanner interpretation, and, generally, only one sample is applied to each array. Oligonucleotide array technology has become the favored platform because of its inherent flexibility in probe design and the ability to compare gene expression intensities across multiple samples/chips. Gene expression studies described in the following thesis chapters were captured using oligonucleotide (Affymetrix) single-channel arrays and, therefore, subsequent discussion will be directed towards this type of microarray platform.     13 1.2.1.1. Normalization and quantification of microarray expression data  Microarray measured expression values can be influenced by non-biological (technical) variation and biological variation. The detection of differential gene expression relies on the fact that differences are consistent across more than one biological sample. Since a number of steps are required for sample preparation and execution of a microarray experiment, technical errors can be introduced along the way. Assessment of this variation requires the incorporation of more than one biological sample and replication of the technical protocol using additional microarray chips. However, the number of replicates included in each microarray experiment may be limited by the number of biological samples available and the costs associated with technical replication.  After the probe array images have been captured by scanners and software, which analyze hybridization data for each probe, data preprocessing is performed to remove sources of technical variation produced by biases in: dye integration, sample preparation, hybridization, and image processing effects. Background correction is required to remove unrelated extraneous hybridization effects. The Affymetrix oligonucleotide arrays contain both perfect match and mismatched probe pair sets targeting specific gene/mRNA sequences and probes that enable application of positive and negative controls. The challenge is to remove the non-specific background signal and summarize the probes to obtain a measurement of expression (ME) for a gene transcript.  Normalization is performed during this analysis to account for scale differences in the hydridization values across multiple chips (technical replicates). One popular algorithm, which was utilized in studies described later in this thesis, is called Robust Multi-chip Analysis (RMA) [64, 65]. This algorithm conducts a three-step analysis that includes: 1) background correction     14 using a model that ignores the mismatched probe values; 2) quantile normalization, which makes the probe intensity distributions the same across chips; and 3) probe set summarization to derive gene transcript level gene expression measurements. Several other normalization methods are reviewed here [66].  1.2.1.2. Analysis of gene expression differences   Gene expression profiling is often applied to discover the gene expression differences between two or more classes of samples, where a class is defined as a categorical variable such as developmental time point or different tissue type. Statistical comparisons are performed using tests, such as t- and f-tests, to determine the significance of the expression difference between the same probe sets on arrays. If a stringent p-value cut-off of 0.001 is utilized, false positives are limited to 10 in 10,000 genes.  However, if there are only a small number of samples for each class, the computed within-class variance of each gene may be imprecise [67]. Modified versions of these tests, for example random variance t-tests [68], assume that the variance of genes within a class may be different, however, each variance is drawn from a single distribution that is shared by all genes in a class.  Given the high number of comparisons that must be performed on microarray probe sets, adjustments must be incorporated to account for multiple statistical tests. A type I error rate describes the false positive rate. There are two classes of corrections that can be applied, those that control the family wise error rate (FWER): the probability of at least one type I error and the false discovery rate (FDR): expected proportion of type I errors within the rejected tests. The Bonferroni correction is a conservative approach that     15 controls for FWER. This test establishes a new significance level by dividing the desired statistical significance value by the number of performed tests.  An alternative method proposed by Benjamini and Hochberg [69] controls for FDR. The Benjamini and Hochberg FDR is easily computed at each row i of ascending ordered probe set p-values after the statistical tests have been performed as: row i p-value multiplied by the total number of probe sets tested, divided by the row number i. The computed FDR for row i is an estimate of the proportion of false positive probe set expression differences with p- values less than or equal to the row i p-value. Additional commonly applied statistical correction methods are reviewed in [66].  1.3. Experimental techniques for regulatory sequence detection  Deciphering the transcriptional regulatory mechanisms for the ~30,000 human genes is central to understanding the assembly of complex biological systems that are specified by the information coded in DNA. Recent large-scale efforts, such as the ENCyclopedia Of DNA Elements (Encode) project [70], have begun to illuminate the complex organization of functional features in the human genome. Smaller scale studies have devoted significant resources to perform detailed gene transcription regulation investigations [71, 72], which have offered important insights into the diversity, complexity, and capacity of regulatory mechanisms. Experimental approaches have been developed for the study of protein-DNA interactions and validation of cis-regulatory sequences. A comprehensive overview of experimental techniques for gene regulatory analyses is available in a recently published text [73]. Specific methods relevant to the research in the thesis are introduced below.     16 1.3.1. Identification of TF-DNA binding  Experimental detection of protein-bound DNA highlights TF-DNA interactions that may influence transcriptional activity. Several experimental methods start with labeled DNA fragments, which are incubated with a cognate double-stranded DNA (dsDNA)-binding protein. Electrophoretic mobility shift assays (EMSA), commonly referred to as a gel shift, allows separation of protein-DNA complexes from unbound sequence. Protein bound to the DNA probe retards its migration progress through the polyacrylamide gel relative to unbound DNA [74]. The DNA and/or protein may be recovered from the gel and subjected to further investigation. The EMSA procedure can be repeated either in the presence of unlabelled dsDNA of various sequences to reveal the binding specificity of the protein, or alternatively, in the presence of an antibody for the identification of the protein engaging the DNA (most commonly through further retardation of the protein-DNA complex referred to as a “supershift”). An alternative technique for the study of protein-DNA interactions is called DNase I footprinting. A mixture of proteins and 5’ end labeled dsDNA is subjected to a partial digestion by DNase I, an enzyme that preferentially cleaves naked DNA. Protein-protected DNA sequences are identified via gel mobility separation to establish the exact location of protected DNA positions from labeled ends.  High-through put methods using chromatin immunoprecipitation (ChIP) assays detect in vivo DNA-TF interactions using a variety of detection procedures:  1) ChIP-chip [75] and 2) ChIP-seq [76], which differ in the method used to identify the protein bound sequence. Both techniques involve cross-linking proteins to chromatin, shearing of DNA to isolate protein-bound sequence and recovery of protein-DNA complexes using antibodies. After removal of the protein-DNA cross-linkages, the ChIP-chip method     17 incorporates microarray analysis, while the ChIP-seq method subjects the pool of recovered DNA to high-throughput sequencing. An inherent limitation of the ChIP-chip method is the number of array probes that can be applied on the chip for sequence determination. In ChIP-seq, short sequence reads are mapped to the reference genome and regions with high read densities are considered binding site locations [77]. ChIP-chip assays return DNA fragments of lengths between 200 - 1000 bp (resolution relies on the size of the chromatin fragment and probes on an array). However, deep sequencing in ChIP-seq studies can be more specific in defining locations of bound protein [77].  1.3.1.1. Identification of TF binding site motifs  The interaction of TFs with their cognate DNA target sequences is one of the key steps in transcription initiation. It is well understood that TFs bind DNA with a level of degeneracy [35, 78]. Therefore, accurate profiles of the binding properties of sequence- specific TFs can be a key component of successful analyses. Until recently, the most common experimental technique used to profile high-affinity TF binding sites was the SELEX assay (systematic evolution of ligands by exponential enrichment) [79, 80]. In brief, oligonucleotides of random sequence are incubated with a TF of interest, protein- DNA complexes are purified, and the DNA is amplified by PCR. Starting with the DNA recovered from the previous step, this process is repeated several times to reveal high- affinity target sequences.  A ground-breaking study using protein binding microarray (PBM) technology (containing all 10 bp sequences) [81] identified a full range of affinity binding sites for 104 TFs [34]. Remarkably, secondary DNA binding preferences were identified for half     18 of the TFs. Position interdependence5 was identified in binding sites of  ~20% of the TFs studied. These results exemplify the complexity of TF-DNA sequence interpretation. Further investigation is warranted to determine whether distinct TF sequence preference arrangements provide selective mechanisms for differential regulatory induction effects.  1.3.2. Regulatory sequence validation  A putative regulatory region can be tested for its ability to direct transcription of reporter genes. A reporter gene is a protein that, when expressed, possesses properties which enable its presence to be uniquely measured (for review see [82]). For example, when the green fluorescent protein (GFP – from jelly fish) is excited with a blue light, it will fluoresce green. Similarly, the β-galactosidase protein (encoded by the bacterial gene lacZ) cleaves the colorless substrate X-gal into galactose and an insoluble blue product that can be visually detected and quantified. To test a regulatory region’s functional properties, the sequence is inserted into a vector carrying a reporter gene and an endogenous or exogenous promoter. The vector may be transfected in vitro into cells or tested in vivo using transgenesis techniques. While in vitro studies can provide environments that are comparable to in vivo systems, some cell line systems may lack key molecular components necessary for transcription regulation.  Transgenesis studies can identify tissue-targeted specificity and may provide information about temporal- specific expression. A brief discussion regarding mouse transgenesis follows.   5 Position interdependence suggests that the co-occurrence of nucleotides in specific positions of a binding site interact non-independantly with a transcription factor protein.     19 1.3.2.1. Mouse transgenesis  Mouse transgenesis is a powerful experimental method that can be used to investigate the regulatory expression capacity of putative enhancer sequences in vivo. Two methods can be used to produce transgenic mice: 1) DNA injection into male pronucleus and 2) embryonic stem (ES) cell methods. In method (1) fertilized eggs are harvested before the sperm head has become a pronucleus, the vector is injected into the male pronucleus and the pronuclei are allowed to fuse to form the diploid zygote nucleus before implantation of the embryos into a pseudopregnant female mouse. Method (2) involves the transformation of the vector into ES cells, selection of the vector-bearing cells, injection of ES cells into mouse blastocoels or under the zona pelucida of eight-cell uncompacted embryos, and implantation of embryos in a female pseudopregnant mouse as above. In both methods, mice bearing the transgene may display distinctive coat color to facilitate confirmation of transgene integration. Recent technological innovations allow for directed insertion of a single transgene copy into a targeted locus [83, 84]. The hypoxanthine phosphoribosyltransferase (Hprt1) locus is such a targeted locus destination, which is used in the studies described in this thesis; it will be described in some detail in the following section.  1.3.2.1.1. Hprt1 mouse transgenesis system   Until recently, it has been difficult to compare and interpret gene regulatory data obtained in mouse studies due to:  1) the variability in the number of transgenes integrated and 2) to the possible chromatin influences at these integration sites. To overcome this source of variability a controlled transgenesis strategy, which enables     20 single-copy integration at the Hprt1 locus (Figure 1.4), was developed [83]. This method enables the insertion of single-copy reporter constructs in a predetermined site at the Hprt1 locus located on the X-chromosome. In brief, a vector construct includes Hprt1 sequences for homologous recombination, a portion of the Hprt1 gene that is absent in a mouse ES-cell line, a reporter gene, and basal promoter along with candidate regulatory sequences for analysis. Destination constructs bearing the Hprt1 targeting cassette are transfected into ES cells carrying a deletion spanning the promoter and first two exons of the Hprt1 gene. The Hprt1 gene, restored by homologous recombination of the vector, confers resistance to hypoxanthine-aminopterin-thymine (HAT) selection.  The ES with HAT-resistance are injected into blastocysts or aggregated with eight-cell embryos and the blastocysts or embryos are transferred to pseudo-pregnant female mice. Transgenic mice are confirmed through coat-color selection and sequencing. Typically mice are back-crossed several times to create a homogeneous genetic background, as the most effective ES cells are hybrids [85].  1.4. Transcriptional regulatory region detection through comparative genomics  TF regulatory activity is specified by information coded in DNA sequence. As such, the identification of regulatory sequence is an integral step in decoding gene expression mechanisms. The vast amount of non-coding sequence in metazoan genomes makes the identification of functional sequence challenging. The availability of multiple genome sequences has provided important context for DNA-sequence analysis methods through comparative sequence analysis and phylogenetic footprinitng.      21 1.4.1. Multiple sequence alignments  The sequencing of multiple species genomes has motivated the development of algorithms that align sequences to depict evolutionary relationships. Several approaches have been designed specifically for DNA sequence studies. Local alignments focus on determining sequence conservation in short segments, while global alignments attempt to align the full length of sequence. At present, the most commonly applied methods adopt a hybrid approach, referred to as “glocal” alignments, that initially identify local alignments which are concatenated to form longer alignments. Current multiple sequence alignment (MSA) algorithms use a progressive alignment approach, which relies on phylogenetic trees, in which the two most closely related species sequences are aligned first and additional pairwise alignments are performed until all sequences are incorporated [86-89].   Similarly, the popular MSA datasets [90] provided by the University of California Santa Cruz (UCSC) resource [91], are established through best- in-genome pairwise alignments across N species, with progressive alignments guided by a phylogenetic tree topology [89].  1.4.2. Phylogenetic footprinting  Comparative genomics is widely used to identify regulatory region candidates in non-coding sequence, under the assumption that conserved multi-species sequences are functionally important. This approach, known as phylogenetic footprinting, relies on non- coding multi-species sequence alignments to identify conserved putative regulatory regions and has been applied with remarkable success. Tested conserved regions, using in vivo mouse transgenesis reporter assays (described above), have demonstrated their capacity to direct tissue-targeted gene expression [92-95]. Furthermore, there is evidence     22 that evolutionarily conserved regions contain functional regulatory motifs [96-98]. However, recent studies have highlighted cases of active regulatory elements that are not conserved [99, 100]. Notably, a recent study found that 41 – 89% of protein-bound locations identified in hepatocyte ChIP studies are not conserved between human and mouse [101], taken with the caveat that DNA binding and functional impact on gene regulation are not necessarily equivalent. It is important to recognize that the use of conservation as a filter to reduce false positive predictions in regulatory region detection will constrain sensitivity.  1.5. Computational identification of regulatory mechanisms  Exhaustive experimental validation of non-coding sequences for regulatory function is infeasible. Computational analyses approaches have been developed to identify putative TFBS and predict regulatory mechanisms for experimental investigation. Numerous algorithmic approaches have been designed to achieve such objectives. A review of regulatory element detection methods and algorithms relevant to the scope of this thesis are presented below.  1.5.1. Predicting TF binding sites  DNA-binding TF proteins interact with short degenerative DNA motifs, typically 5 - 15 bps in length.  A set of high-affinity TF binding sites (TFBS) can be identified through experimental analyses, as described earlier. A critical component of computational analysis of regulatory mechanisms is the prediction of TFBS within a     23 sequence. TF binding site prediction approaches fall into two categories: 1) motif matching or 2) motif discovery.  Motifs may be computationally detected using knowledge of TF sequence binding affinities, which have been experimentally identified (as described previously), to identify sequences that ‘match’ the TFBS characterization.  A set of TFBS may be aligned and depicted as a degenerate consensus sequence. For example, the TF binding sequences depicted in Figure 1.5: GATCAG, GATCAT, GATCCA, GACTGT may be summarized as the consensus sequence: GAYYBD, where Y, B, and D are International Union of Pure and Applied Chemistry6 (IUPAC) ambiguity codes that represent one or more nucleotides (Y= [C or T]; B=[C or G or T]; D=[A or G or T]). While this depiction of TF binding sequence preferences is useful, the consensus sequence format does not incorporate the likelihood of observing nucleotides in a given motif position. More commonly, a position weight matrix (PWM) is used, which reflects the probability of encountering a particular nucleotide at each position. Importantly, it has been shown that the PWM approach provides an estimate for binding energy contributions of base pairs (see review in [102]).  Initially, a position frequency matrix (PFM) is computed, which enumerates the count of each base pair in each column of a set of aligned TFBS (Figure 1.5). The PFM may be transformed to a log probability form PWM as follows:  where Wb,i is the PWM value of base b at column i, ! " p (b,i)  is the corrected probability of base b found at column i, and p(b) is the probability of base b in the genomic background; ! " p (b,i)  can be computed as:  6 IUPAC is a nomenclature system that describes chemical compounds. ! Wb,i = log2 " p (b,i) p(b) (1) (2)     24  ! " p (b,i) = Cb,i + p(b) n n + n  where ! C b,i   is the count of base b at column i and n is the total number of motifs contributing to the model. It should be noted that there are subtle variations in the formulas used by different researchers [103]. Each sequence is scanned and potential binding profiles are enumerated if they meet an indicated minimum threshold score. Although the PWM approach provides improved results over the consensus matching approach, it often produces a high number of false predictions (where false refers to sequence motifs that are likely to be suitable for protein-DNA interactions in vitro but are not functional cis-regulatory elements). An additional shortcoming of this model is the assumption of independence of nucleotides at different positions in the binding site. Several databases have been developed which provide lists of experimentally determined TFBS and TFBS profiles. A select list of these resources is provided in Table 1.1.  The construction of PWM models assumes an existing alignment of TF binding sites is provided.  In the absence of such an alignment, a motif discovery procedure can be applied.  Motif de novo detection entails identifying over-represented short sequence patterns in a set of longer DNA sequences observed or predicted to be bound by the same TF. Popular technical approaches include expectation maximization (EM) [104, 105] and Gibbs sampling [106]. As a detailed review is beyond the scope of this thesis, interested readers may wish to refer to reviews on the subject found here: [107-109].  1.5.2.  Computational detection of cis-regulatory modules  The development of computational methods for the detection of CRMs is useful for the study of cooperative interactions between TFs. A diverse set of computational     25 methods have been developed to discriminate CRMs in sequences.  The common objective of each of these approaches is to detect non-random clusters of TFBS. Several CRM-based algorithms have been developed to specifically identify regulatory elements proximal to core promoters (see [110] for a recent review and evaluation of this software). Additional methods have been developed to search for CRMs in regions distal to the TSS. These ‘enhancer’ detection tools apply unique and overlapping procedures (Table 1.2). Selected algorithms pre-dating the CRM detection research presented in this thesis are summarized in Table 1.3. For a recent review of CRM analyses tools see [111].  1.6. Myelinogenesis in the nervous system  The nervous system is composed of a network of cells that enables communication through electrical impulses. This system is divided into two categories; the central nervous system (CNS) and the peripheral nervous system (PNS).  The CNS consists of the brain, spinal cord and optic nerves, while the PNS resides outside the CNS providing connections from the CNS to limbs and organs (for example, spinal roots, sciatic nerves). There are two main cell types: neurons and glia cells.  Neurons are responsible for transmitting signals through the nervous system.  Glia cells often reside in direct contact with neurons and perform supportive and enabling roles.  Neurons are essentially made up of three main components: 1) dendrites that receive information from other cells; 2) the cell body which contains the nucleus and other eukaryotic organelles; and 3) axons which conduct signals away from the cell body.  The myelin sheath is an insulating layer that forms around axons, the extensions of neurons, enabling efficient conduction of impulses along nerve fibers throughout the vertebrate nervous system. Positioned at the gaps of individual myelin sheaths are the     26 Nodes of Ranvier, which enable the propagation of impulses. At resting potential, the fluid outside of the plasma membrane is positively charged and the interior is negatively charged. Voltage-gated sodium channels, concentrated at the Nodes of Ranvier (Figure 1.6), open in response to a depolarization7 of the plasma membrane encouraging an influx of positive sodium ions, which creates a positive interior charge. A depolarization in one area of the membrane causes voltage-gated channels in adjacent regions of the membrane to open, resulting in continued influx of positive sodium ions, evoking a depolarization sweep along the axon – referred to as an action potential. The speed of action potential propagation in non-myelinated fibers is proportional to the axon diameter. Importantly, properties of the lipid-rich myelin sheath facilitate high-speed propagation of impulses without an increase in axon diameter.  Myelin is elaborated from the plasma membrane of two types of glia cells: Schwann cells in the PNS, which encase a segment of a single axon (Figure 1.6 and Figure 1.7) and oligodendrocytes cells in the CNS, which are capable of myelinating several axons (Figure 1.6 and Figure 1.8). The myelin membrane contains a number of proteins that are responsible for its unique structural and functional properties (Figure 1.6).  In the CNS, the PLP1 protein, a tetraspan membrane protein, makes up 17% of myelin proteins, while the myelin basic protein (MBP), the second largest protein component of myelin constitutes 8% of the total protein [112].  Given the critical role that the myelin sheath plays in the nervous system, it is not surprising that dysregulation of myelin protein production results in debilitating neuropathies.   7 Depolarization is a change in voltage difference between the interior and exterior of a cell.     27 1.6.1. Schwann cells  Schwann cells (SC), which develop in the PNS, myelinate single axons. Most SCs develop from the neural crest where precursors transition to immature SCs (around ~E16) and, while migrating along the axon tracts, diverge into either myelinating or non- myelinating Schwann cells (Figure 1.7) [113]. Studies have demonstrated that SC non- myelinating and myelinating cell phyenotypes are interconvertible and signals emanating from axons determine these states (see review in [114]). Much effort has been devoted to determining the TFs that are modulated by these signals to identify responsible regulatory pathways (for review see [115]).   A number of peripheral myelinopathies are caused by altered gene dosage [116, 117] and, therefore, it is hypothesized that gene regulation- based therapies may be an avenue for intervention.  1.6.2. Oligodendrocytes  Oligodendrocytes (OL) cells are responsible for diverse functional roles in the central nervous system (CNS), including myelin sheath formation and axonal integrity maintenance. Much like the SC maturation process, the OL cell development process progresses through multiple cell stage transitions, starting with a precursor cell stage and culminating in a mature myelinating oligodendrocyte (Figure 1.9).  Each of these cell stages can be identified by specific protein marker expression.  Fate-mapping studies in mice suggest that oligodendrocyte precursor cells (OPC), initially populate the cortex from the motor neuron progenitor domain (pMN) in the ventral ventricular zones and then a second wave of OPCs are generated from dorsal sources [118]. This specification and maturation of OLs is largely controlled by TF gene     28 regulation [119]. Specific TFs and TF families have demonstrated necessary roles in the OL development process, including: Olig1 and Olig2 [120-123] along with other members of the basic Helix-Loop-Helix (bHLH) family [124, 125], Nkx2-2 [126, 127], Nkx6-1 [128], Nkx6-2 [129], and Sox family TFs  [130-132]. Recent studies confirm that a subset of these regulatory effects are a consequence of synergistic TF control [50, 133- 136].  1.6.2.1. Olig transcription factors   Olig TF proteins belong to a sub-class of the bHLH group of DNA-binding proteins.  Sonic hedge hog (Shh) is a signaling protein required for OL development in the forebrain [137-139] and it induces CNS expression of both Olig1 and Olig2 in human spinal cord and forebrain prior to the emergence of PDGFR-alpha+ and NG2+ cell OL progenitor cells (see protein markers in Figure 1.9) [120, 140]. Olig1 and Olig2 TF expression is responsible for the generation of OL progenitor (PDGFR-alpha+) cells in brain regions [121]. Recent studies suggest that the Olig2 TF is required for OL development in the spinal cord, while Olig1 is necessary for OL specification in the brain [120, 121, 141].  The critical role that Olig1 plays in brain OL development was further elucidated in a recent study using Olig1 knockout mice [123].  Olig1 null mice develop severe neurological defects, such as tremors and seizures, and die two weeks after birth. Although the elimination of Olig1 expression does not inhibit OL progenitor stage cell development in the brain, its absence abolishes all major myelin gene expression and myelin sheath assembly, demonstrating that Olig1 plays an essential role during initial     29 CNS myelin elaboration. However, Olig1 may not be involved in control of myelinogenesis beyond the initial accumulation of myelinating OLs during development, since Olig1 proteins are found localized in the cytoplasm after two weeks of age [142]. Importantly, Olig1 TF proteins re-enter the nucleus following demyelinating injuries, which suggests that it may be involved in remyelination and maintenance activities in the adult brain. Deciphering the regulatory role of Olig1 in these adult OL regenerative events would be an important contribution towards elucidating myelin reparative mechanisms.  1.6.3. Myelin Basic Protein  Myelin Basic Protein (MBP), found on the cytoplasmic face of the myelin membrane, is required for myelin compaction in the PNS (Schwann Cells) and CNS (oligodendrocytes) (Figure 1.6). The MBP gene is located on chromosome 18 in both murine and human genomes and is composed of seven exons. There are four known major protein forms in mice, two of which constitute 95% of the MBPs and these are translated from at least seven transcript variants, which appear to be differentially expressed at different time points (see review in [143] ). In mice, MBP protein production is required for long-term viability. Mutant mice lacking the MBP gene, a mouse model called Shiverer, exhibit a shivering gait which appears a few weeks after birth and remains until premature death (between 50 and 100 days old) [144].  Given MBP’s central role in the myelin sheath architecture, a number of studies have been conducted to identify regulatory sequence that directs MBP gene expression. Recent studies have used targeted in vivo mouse transgenesis strategies to characterize     30 the tissue-specific and temporal regulatory functions of non-coding conserved regions situated upstream of the MBP gene [93, 94, 145-147]. Notably, each of these enhancers contribute individually and/or in combination to direct MBP gene expression to SC (PNS) and/or OL (CNS) cells at different developmental stages.      31 1.7. Thesis overview and chapter summaries  One of the fundamental challenges, in this post genome era, is determining how genes are regulated.  A growing list of diseases are linked to dysregulation of gene expression. The complexity of transcriptional mechanisms in metazoans presents great challenges in defining the specific processes that lead to dysregulation of genes. Detailed elucidation of gene regulatory mechanisms is a necessary prerequisite to the development of therapies.  The myelin sheath plays a critical role in enabling and maintaining the integrity of neural signal transmission throughout vertebrate PNS and CNS. Since myelin elaboration is largely controlled through synergistic transcriptional mechanisms, it is well known that dosage changes in myelin-associated proteins result in debilitating neuropathies. Discovery of the DNA-binding TFs and cooperative mechanisms that are responsible for myelin gene regulation is essential for determining potential therapeutic strategies that alleviate these severe disease phenotypes. The work described in this thesis was motivated by this fundamental objective.  The aim of this thesis was to develop the necessary algorithms and resources to predict TF cooperative involvement in CNS myelin sheath elaboration. A summary of the applied approaches described in chapters two through six are provided below.  In chapter 2, I describe an approach (formally named oPOSSUM2 and currently called Combination Site Analysis – CSA), published in [148] that enables the prediction of DNA-binding TF cooperativity in non-coding regions of human or mouse co- expressed genes. When I began work on this algorithm in 2004, few tools addressed the     32 detection of non-homotypic CRMs8 (Table 1.3). Our objective was to develop a computationally efficient process that identifies over-represented combinations of TFBS in a set of co-expressed genes as compared with predicted TFBS combinations found in a background set of non-coding regions (predicted in the oPOSSUM database developed in our lab [149]). The evaluation of multiple predicted TFBS in sequences derives a search space that is combinatoric in nature and computationally unfavorable for conducting analyses in a reasonable timeframe. Another inherent challenge with TFBS predictions is that many TF family members bind similar sequences and, therefore, a prediction of a binding site for one TF family member can implicate the binding of homologous family members. We addressed these challenges through the design of an algorithm that implements a TF binding profile clustering step to identify statistically significant combinations of TFBS class instances, followed by an evaluation of sets of TFs that are members of the indicated classes. Additionally, incorporation of a biologically relevant inter-binding site distance (IBSD) parameter facilitated cluster aggregation of TF pairs that satisfy the IBSD to form larger CRMs, which avoided brute-force enumeration of larger CRM sizes.  The CSA approach evaluates the significance of TFBS combinations found in non-coding regions of a co-expressed gene set against the frequency of the TFBS combination found in a (full ortholog) background set, using a Fisher Exact Test. CSA analyses of CRM reference collections recovered known TFBS combinations in top ranked results. We provided community access to this CRM analysis tool through a web- based application.  The alternative promoter data, made available through a recent large scope CAGE study [24], provides unprecedented information about multiple TSS and core promoter  8 A homotypic cis-regulatory module (CRM) is a cluster of similar binding sites.     33 regions of human and mouse genes. We used this information to define multiple transcription start sites for human and mouse genes, which enabled improved alignments of human-mouse orthologous gene isoforms in the oPOSSUM database. In chapter 3, I describe the implementation of a revised CSA algorithm that incorporates the evaluation of multiple putative promoter regions of a single gene to identify over-represented CRMs in a set of co-expressed genes (published in [150]). Validation of this revised approach demonstrated a significant improvement in recovery of TFBS combinations in the CRM reference collections. The demarcation of multiple promoter boundaries per gene can expand the range of the sequence regions searched, while limiting exploration to bounded partitions around an individual TSS. Chapter 3 also describes the analysis of CAGE data to establish alternative TSSs in the database and integration of the CSA website with an oPOSSUM Suite portal that provides centralized access to SSA (Single Site Analysis) tools for human-mouse, yeast, and worms.  Coordinated regulation of gene transcription relies on TF proteins binding DNA. Recent studies have also highlighted the important role that accessory TFs play through co-factor mechanisms and chromatin structure modifications.  Understanding which TFs are capable of binding a TFBS and/or identifying the TFs that are expressed in a gene expression profiling analysis is an essential step in gene regulatory analyses. We were surprised to find that a well-validated inventory of mouse-human TFs was not available for this purpose. During our efforts to assemble this resource, we encountered other researchers compiling similar lists. We combined our efforts to produce the TFCat resource for gene regulation analyses. In chapter 4, I describe the assembly of a comprehensive mouse and human TF catalog (published in [151]). TFCat is a curated catalog of mouse and human TFs. Curators assigned genes to a functional taxonomy and     34 provided a confidence assessment for judgment classifications. All proteins linked to DNA-binding were reviewed and DNA–binding domains were mapped to a structural classification system. Sequence-based analyses were performed to predict TF encoding genes that were not reviewed in the curation process or could not be curated due to lack of literature evidence. The TF data was made available for review and download on a wiki and web portal.  There is widespread interest in exploring the use of gene therapy to treat disease. The goal of Pleiades Promoter Project is to create a panel of human regulatory region constructs that direct tissue- and/or cell-specific gene expression in the adult brain. The project strategy incorporates detailed analysis of gene expression profile data and computational evaluation of multi-species conserved sequences to identify putative regulatory regions neighboring gene loci that exhibit brain region-specific expression. Predicted regulatory regions are validated using an in vivo mouse transgenesis method and evaluated at an adult developmental stage (~postnatal day 56). In Chapter 5, computational analyses are described for the prediction of TFs that may be driving expression of a mouse OLIG1 promoter-reporter gene construct. OLIG1 is a TF that is expressed in both oligodendrocyte progenitor cells and mature oligodendrocytes. A set of three reporter constructs, each composed of three conserved non-coding regions, were tested using in vivo mouse transgensis, of which one reporter-construct actively expressed in adult brain.  A comprehensive evaluation of an in vitro oligodendrocyte expression dataset identified differentially expressed genes, including TFs, across eight time points. Sequence feature analyses were performed over conserved sequence segments associated with the active and inactive constructs. The synthesis of these two analyses resulted in a short list of TF candidates that are both differentially regulated across oligodendrocyte     35 maturation and for which putative binding sites are uniquely present in the construct sequence segments.  A significant amount of research is focused on the elucidation of gene regulatory mechanisms to gain insights into complex biological systems and the role of gene regulation in disease phenotypes. Myelin sheath degradation is associated with human diseases, such as multiple sclerosis, schizophrenia, and leukodystrophies. It appears that for some myelin-related neuropathies, myelin protein dosage-alterations may be causally associated with the disease phenotypes. Importantly, disease progression may be eased through therapies that promote remyelination, motivating study of the regulatory mechanisms controlling oligodendrocyte development and myelination.  In chapter 6, integrated experimental and computational research are presented that predict the synergistic relationships of TFs involved in the spatio-temporal transcriptional events of co-expressed myelin genes during oligodendrocyte development. Mouse enhancer regions neighboring myelin-associated genes that direct expression in oligodendrocytes (CNS) and/or Schwann cells (PNS) were identified. Analyses of oligodendrocyte expression data were performed to define a specific set of co-expressed genes during early oligodendrocyte development across oligodendrocyte cell-stage transitions. The differentially expressed TF subset was identified through a TFCat-microarray probe mapping analysis. A new promoter analyses approach was developed to statistically evaluate CRM predictions in the oligodendroctye co-expression dataset. TFBS signatures present in the validated enhancers and absent in the negative enhancers were identified and used to weight CRM predictions. A regulatory feature similarity analysis was performed for CRM predictions identified in both the oligodendrocyte co-expression dataset and the validated enhancer regions to identify potential feature similarity.     36 Analyses results were incorporated to produce an enhancer-weighted regulatory network of TFs that may be co-regulating myelin-associated gene expression during CNS myelinogenesis.      37  Table 1.1.  Selected list of databases providing eukaryote transcription factor binding site data  Database Description Reference ABS Annotated TFBS for human, mouse and rat promoters. [152] JASPAR High quality transcription factor binding profile database. [153] OREGANNO Open Regulatory Annotation database [154] PAZAR A framework for collection and dissemination of cis-regulatory sequence annotation [155] TRANSFAC Contains data on transcription factors, their experimentally validated binding sites, and regulated genes [156]   Table 1.2.  CRM detection techniques  # Description Details 1 Window-based TFBS clusters are detected within a specific window 2 PFM and/or PWM-based Uses position weight matrices to predict TFBS 3 De novo motif detection Uses motif discovery method to predict TFBS 4 Conserved alignments Requires orthologous sequences and detects TFBS in conserved species sequence alignments 5 Gene co-expression Searches for predicted CRMs in a set of co-expressed genes 6 Provides learning based on training set Uses training sets to guide or improve the predictions 7 Alignment of predicted TFBS Requires or provides higher weighting to aligned predicted TFBS in multiple species 8 Model generated background Creates a background using a model 9 TFBS score threshold Applies a TFBS score threshold cut-off 10 Statistical evaluation method applied Applies statistical test(s) to determine significance of CRM prediction      38   Table 1.3.  Selected computational human/mouse CRM detection tools/methods available before 2005  Algorithm Year Description Techniques incorporated (Table 1.2) Reference Cister 2001 Finds TFBS clusters using HMM modeling approach. Background is modeled using local window over natural sequences. 2, 8, 10 [157] Comet 2002 Finds TFBS clusters using HMM modeling approach. Uses sliding local window null model over natural sequences. 2, 8, 10 [158] Cluster-Buster 2003 Finds TFBS clusters using HMM modeling approach. Uses sliding local window null model over natural sequences. 2, 8, 10 [159] CREME 2003/ 2004 Finds TFBS clusters within windows. Uses TFBS permutation tests for null background model. 1, 2, 4, 5,7, 9, 10 [160, 161] ModuleSearcher 2003 Finds best TFBS cluster in a set of sequences. Background model is 3rd order Markov Model learned using natural sequences. 1, 2, 4, 5, 9, 10 [162] MSCAN 2003 Reports clusters of TFBS above a threshold. Background defined as base pair frequency in each window that is evaluated. 1, 2, 5, 9, 10 [163] Gibbs Module Sampler 2004 Identifies clusters of motifs (inferred TFBS) in a set of sequences trained with known CRM datasets. 1, 3, 4, 5, 6, 10 [164]      39 Figure 1.1. Nucleosome core octamer particle  DNA is wrapped around a histone octamer to form a nucleosome.  Figure source: Wikipedia, URL: http://en.wikipedia.org/wiki/Histone_H1, copyrights released to public domain.            40 Figure 1.2. Model for histone acetylation/deacetylation  Histone acetylation, catalyzed by Histone Acetyl Transferase (HAT) enzymes, can reduce the positive charge of histones leading to a more open chromatin structure, which is often associated with increased transcription. Deacetylation of histones (for example, on histone amino-tails), via histone deacetylases (HDACs), can inhibit transcription by producing a more condensed chromatin conformation. In this example, the Retinoic Acid Receptor (RAR) and Retinoid X Receptor (RXR) heterodimer binds to an enhancer. When the ligand, retinoic acid, is not available, the dimer interacts with nuclear corepressors: nuclear receptor corepressor and silencing mediator for retinoid and thryoid homone receptors (NCoR/SMRT), which binds HDAC1.  Figure source: Weaver RF: Molecular Biology, 2nd edn. New York: McGraw-Hill Higher Education; 2002 [165], used with permission.          41 Figure 1.3. Eukaryotic gene regulatory architecture  The gene regulatory architecture of eukaryotes include DNA elements that interact with the basal promoter complex and may include proximal and/or distal enhancers that bind TFs and/or TF protein complexes to facilitate or repress gene transcription. Regulatory DNA sequence resides in a chromatin state that can be altered to enable or inhibit TF protein – DNA interactions.  Figure source: With kind permission from Springer Science+Business Media: Die Naturwissenschaften, In silico identification of metazoan transcriptional regulatory regions, 90, 2003, 146-66, Wasserman WW, Krivan W, Figure 1. [166]            42 Figure 1.4. Hprt1 mouse transgenesis system  Embryonic stem (ES) cells carry a deletion spanning the promoter and first 2 exons of the hypoxanthine phosphoribosyltransferase (Hprt1) gene.  The putative enhancer sequence (for example, depicted below as a Myelin Basic Protein (MBP) regulatory sequence) and reporter gene sequence (for example, lacZ) are inserted into the targeting vector, which includes the Hprt1 homology arms required to restore the Hprt1 locus by homologous recombination.  Figure source: Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration. In: Proc Natl Acad Sci USA. vol. 93; 1996: 9067-9072; Figure 5 [83]. Copyright (1996) National Academy of Sciences, U.S.A.; figure adapted by the Peterson Laboratory (McGill University), used with permission.          43 Figure 1.5. Modeling TF binding sites using position weight matrices  An aligned set of transcription factor binding sites can be converted into a position frequency matrix (PFM), which enumerates the frequency of each nucleotide in each column. A binding site logo graphic of the PFM can be generated. A PFM is converted to a log-scale probability representation, referred to as a position weight matrix (PWM), which is used to detect and score potential binding sites.        44 Figure 1.6. Myelinating glial cells: oligodendrocytes and Schwann cells  Myelinating glia cells elaborate myelin from their cell plasma membrane. Schwann cells are generated from neural crest cells and can myelinate segments of single axons in the peripheral nervous system, whereas, oligodendrocytes are derived from neuroectoderm and can myelinate one or more axons in the central nervous system. Despite their unique cellular origins, these glial cells express several important myelin proteins in common.  Figure source: Used with kind permission from Editions Doin and The American Physiology Society: Pham-Dinh D. Les cellules gliales. In: Physiologie du Neurone [167], adapted in Baumann N, Pham-Dinh D. Physiol Rev. 2001 [143].         45 Figure 1.7. Schwann cell lineage  In peripheral nerves, Schwann cells progress through a maturation process, which initiates with the formation of immature Schwann cells from migrating neural crest cells and culminates with a transition to mature myelinating or non-myelinating Schwann cells. Myelinating Schwann cells ensheathe a single axon. Non-myelinating Schwann cells aggregate multiple C fiber axons to form Remak bundles. Extrinsic signals, such as expression of neuroregulin, determine Schwann Cell fate.  Figure source: Figure reprinted by permission from Macmillan Publishers Ltd: Nature Neuroscience. 8(11): 420-422, copyright 2005 [114].         46 Figure 1.8. Oligodendrocytes can ensheath multiple axons  Figure source: Figure produced using Servier Medical Art, used with permission for academic use.         47 Figure 1.9. Oligodendrocyte lineage  Oligodendrocytes transition through a multi-stage developmental process.  Cell stages are characterized by the expression of specific marker proteins:  i) A2B5 antigen, platelet-derived growth factor receptor-alpha (PDGFRalpha), and chondroitin sulphate proteoglycan NG2 in progenitors;  ii) O4 antigen in pro-oligodendrocytes; iii) galactocerebroside (GC or O1 antigen) and 2',3'-cyclic nucleotide 3'-phosphodiesterase (CNPase) in mature oligodendrocytes; and iv) myelin proteins, such as: myelin- associated glycoprotein (MAG), myelin basic protein (MBP), and proteolipid protein (PLP), in myelinating oligodendrocytes.  Figure source: Figure reprinted by permission from Macmillan Publishers Ltd: Nature Reviews. Neuroscience. 2(11): 840-843, copyright 2001 [168].          48 1.8.  References 1. Khorasanizadeh S: The nucleosome: from genomic organization to genomic regulation. Cell 2004, 116(2):259-272. 2. Wolffe AP, Guschin D: Review: chromatin structural features and targets that regulate transcription. Journal of structural biology 2000, 129(2-3):102- 122. 3. Wood WI, Felsenfeld G: Chromatin structure of the chicken beta-globin gene region. Sensitivity to DNase I, micrococcal nuclease, and DNase II. Journal of Biological Chemistry 1982, 257(13):7730-7736. 4. Jin C, Felsenfeld G: Nucleosome stability mediated by histone variants H3.3 and H2A.Z. Genes Dev 2007, 21(12):1519-1529. 5. Ozsolak F, Song JS, Liu XS, Fisher DE: High-throughput mapping of the chromatin structure of human promoters. Nature biotechnology 2007, 25(2):244-248. 6. Fu Y, Sinha M, Peterson CL, Weng Z: The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet 2008, 4(7):e1000138. 7. Fedor MJ, Lue NF, Kornberg RD: Statistical positioning of nucleosomes by specific protein-binding to an upstream activating sequence in yeast. J Mol Biol 1988, 204(1):109-127. 8. Jones PA, Takai D: The role of DNA methylation in mammalian epigenetics. Science 2001, 293(5532):1068-1070. 9. Ng HH, Bird A: DNA methylation and chromatin modification. Current opinion in genetics & development 1999, 9(2):158-163. 10. Illingworth RS, Bird AP: CpG islands--'a rough guide'. FEBS Lett 2009, 583(11):1713-1720. 11. Hendrich B, Tweedie S: The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends in genetics : TIG 2003, 19(5):269- 277. 12. Brown CE, Lechner T, Howe L, Workman JL: The many HATs of transcription coactivators. Trends in biochemical sciences 2000, 25(1):15-19. 13. Liu CL, Kaplan T, Kim M, Buratowski S, Schreiber SL, Friedman N, Rando OJ: Single-nucleosome mapping of histone modifications in S. cerevisiae. PLoS biology 2005, 3(10):e328. 14. Roh TY, Wei G, Farrell CM, Zhao K: Genome-wide prediction of conserved and nonconserved enhancers by histone acetylation patterns. Genome Res 2007, 17(1):74-81. 15. Martens JH, O'Sullivan RJ, Braunschweig U, Opravil S, Radolf M, Steinlein P, Jenuwein T: The profile of repeat-associated histone lysine methylation states in the mouse epigenome. The EMBO journal 2005, 24(4):800-812. 16. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823-837. 17. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B: Distinct and predictive chromatin signatures of transcriptional     49 promoters and enhancers in the human genome. Nat Genet 2007, 39(3):311- 318. 18. Strahl BD, Allis CD: The language of covalent histone modifications. Nature 2000, 403(6765):41-45. 19. Zhang Y: Transcriptional regulation by histone ubiquitination and deubiquitination. Genes & development 2003, 17(22):2733-2740. 20. Nowak SJ, Corces VG: Phosphorylation of histone H3: a balancing act between chromosome condensation and transcriptional activation. Trends Genet 2004, 20(4):214-220. 21. Heard E, Disteche CM: Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev 2006, 20(14):1848-1867. 22. Smale S, Kadonaga J: THE RNA POLYMERASE II CORE PROMOTER. In: Annu Rev Biochem. vol. 72; 2003: 449-479. 23. Ayoubi TA, Van De Ven WJ: Regulation of gene expression by alternative promoters. Faseb J 1996, 10(4):453-460. 24. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui- Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F et al: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006, 38(6):626-635. 25. Merika M, Thanos D: Enhanceosomes. Current opinion in genetics & development 2001, 11(2):205-208. 26. Kamakaka RT: Silencers and locus control regions: opposite sides of the same coin. Trends in biochemical sciences 1997, 22(4):124-128. 27. West AG, Gaszner M, Felsenfeld G: Insulators: many functions, many mechanisms. Genes & development 2002, 16(3):271-288. 28. Visel A, Akiyama JA, Shoukry M, Afzal V, Rubin EM, Pennacchio LA: Functional autonomy of distant-acting human enhancers. Genomics 2009, 93(6):509-513. 29. Garvie CW, Wolberger C: Recognition of specific DNA sequences. Molecular cell 2001, 8(5):937-946. 30. Halford SE, Marko JF: How do site-specific DNA-binding proteins find their targets? Nucleic acids research 2004, 32(10):3040-3052. 31. Spiegelman BM, Heinrich R: Biological control through regulated transcriptional coactivators. Cell 2004, 119(2):157-167. 32. Triezenberg SJ: Structure and function of transcriptional activation domains. Curr Opin Genet Dev 1995, 5(2):190-196. 33. Luo Y, Ge H, Stevens S, Xiao H, Roeder RG: Coactivation by OCA-B: definition of critical regions and synergism with general cofactors. Molecular and cellular biology 1998, 18(7):3803-3810. 34. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML: Diversity and complexity in DNA recognition by transcription factors. In: Science. vol. 324; 2009: 1720- 1723.     50 35. L'Honore A, Lamb NJ, Vandromme M, Turowski P, Carnac G, Fernandez A: MyoD distal regulatory region contains an SRF binding CArG element required for MyoD expression in skeletal myoblasts and during muscle regeneration. Molecular biology of the cell 2003, 14(5):2151-2162. 36. Dutta S, Burkhardt K, Young J, Swaminathan GJ, Matsuura T, Henrick K, Nakamura H, Berman HM: Data deposition and annotation at the worldwide protein data bank. Molecular biotechnology 2009, 42(1):1-13. 37. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome biology 2000, 1(1):REVIEWS001. 38. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic acids research 1999, 27(1):260-262. 39. Gough J: The SUPERFAMILY database in structural genomics. Acta Crystallographica Section D, Biological Crystallography; Acta crystallographicaSection D, Biological crystallography 2002, 58(Pt 11):1897- 1900. 40. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247(4):536-540. 41. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure (London, England : 1993) 1997, 5(8):1093-1108. 42. Kummerfeld SK, Teichmann SA: DBD: a transcription factor prediction database. Nucleic acids research 2006, 34(Database issue):D74-81. 43. Boube M, Joulia L, Cribbs DL, Bourbon HM: Evidence for a mediator of RNA polymerase II transcriptional regulation conserved from yeast to man. Cell 2002, 110(2):143-151. 44. Li XY, Virbasius A, Zhu X, Green MR: Enhancement of TBP binding by activators and general transcription factors. Nature 1999, 399(6736):605-609. 45. Hanna-Rose W, Hansen U: Active repression mechanisms of eukaryotic transcription repressors. Trends in genetics : TIG 1996, 12(6):229-234. 46. Adams CC, Workman JL: Binding of disparate transcriptional activators to nucleosomal DNA is inherently cooperative. Molecular and cellular biology 1995, 15(3):1405-1421. 47. Bucceri A, Kapitza K, Thoma F: Rapid accessibility of nucleosomal DNA in yeast on a second time scale. The EMBO journal 2006, 25(13):3123-3132. 48. Anderson JD, Thastrom A, Widom J: Spontaneous access of proteins to buried nucleosomal DNA target sites occurs via a mechanism that is distinct from nucleosome translocation. Molecular and cellular biology 2002, 22(20):7147- 7157. 49. Lee CK, Shibata Y, Rao B, Strahl BD, Lieb JD: Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet 2004, 36(8):900-905. 50. Li H, Lu Y, Smith HK, Richardson WD: Olig1 and Sox10 interact synergistically to drive myelin basic protein transcription in oligodendrocytes. J Neurosci 2007, 27(52):14375-14382.     51 51. Morin S, Pozzulo G, Robitaille L, Cross J, Nemer M: MEF2-dependent recruitment of the HAND1 transcription factor results in synergistic activation of target promoters. J Biol Chem 2005, 280(37):32272-32278. 52. Panne D, Maniatis T, Harrison SC: An atomic model of the interferon-beta enhanceosome. Cell 2007, 129(6):1111-1123. 53. Pettersson M, Schaffner W: Synergistic activation of transcription by multiple binding sites for NF-kappa B even in absence of co-operative factor binding to DNA. J Mol Biol 1990, 214(2):373-380. 54. Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 2003, 4(9):117. 55. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467-470. 56. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP: Light- generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 1994, 91(11):5022-5026. 57. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature biotechnology 1996, 14(13):1675-1680. 58. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270(5235):484-487. 59. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57-63. 60. Harsha HC, Kandasamy K, Ranganathan P, Rani S, Ramabadran S, Gollapudi S, Balakrishnan L, Dwivedi SB, Telikicherla D, Selvan LD, Goel R, Mathivanan S, Marimuthu A, Kashyap M, Vizza RF, Mayer RJ, Decaprio JA, Srivastava S, Hanash SM, Hruban RH, Pandey A: A compendium of potential biomarkers of pancreatic cancer. PLoS medicine 2009, 6(4):e1000046. 61. Gray PA, Fu H, Luo P, Zhao Q, Yu J, Ferrari A, Tenzen T, Yuk DI, Tsung EF, Cai Z, Alberta JA, Cheng LP, Liu Y, Stenman JM, Valerius MT, Billings N, Kim HA, Greenberg ME, McMahon AP, Rowitch DH, Stiles CD, Ma Q: Mouse brain organization revealed through direct genome-scale TF expression analysis. Science (New York, NY) 2004, 306(5705):2255-2257. 62. Cahoy JD, Emery B, Kaushal A, Foo LC, Zamanian JL, Christopherson KS, Xing Y, Lubischer JL, Krieg PA, Krupenko SA, Thompson WJ, Barres BA: A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. J Neurosci 2008, 28(1):264-278. 63. Affara NA: Resource and hardware options for microarray-based experimentation. Briefings in functional genomics & proteomics 2003, 2(1):7- 20. 64. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 2003, 4(2):249-264.     52 65. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193. 66. Steinhoff C, Vingron M: Normalization and quantification of differential expression in gene expression microarrays. Briefings in bioinformatics 2006, 7(2):166-177. 67. Simon R: Microarray-based expression profiling and informatics. Current opinion in biotechnology 2008, 19(1):26-29. 68. Wright GW, Simon RM: A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 2003, 19(18):2448-2455. 69. Benjamini Y. HY: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 1995, 57(1):289-300. 70. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306(5696):636-640. 71. Yuh CH, Bolouri H, Davidson EH: Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development 2001, 128(5):617-629. 72. Levine M: A systems view of Drosophila segmentation. Genome Biol 2008, 9(2):207. 73. Weaver RF: Molecular biology, 4th edn. New York: McGraw-Hill Higher Education; 2008. 74. Kerr LD: Electrophoretic mobility shift assay. Methods in enzymology 1995, 254:619-632. 75. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins. Science 2000, 290(5500):2306-2309. 76. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature methods 2008, 5(9):829-834. 77. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316(5830):1497-1502. 78. Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. In: Genome Res. vol. 16; 2006: 962-972. 79. Kalionis B, O'Farrell PH: A universal target sequence is bound in vitro by diverse homeodomains. Mechanisms of development 1993, 43(1):57-70. 80. Blackwell TK: Selection of protein binding sites from random nucleic acid sequences. Methods in enzymology 1995, 254:604-618. 81. Berger MF, Bulyk ML: Protein binding microarrays (PBMs) for rapid, high- throughput characterization of the sequence specificities of DNA binding proteins. Methods in molecular biology (Clifton, NJ 2006, 338:245-260. 82. Naylor LH: Reporter gene technology: the future looks bright. Biochemical pharmacology 1999, 58(5):749-757. 83. Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration. Proceedings of the     53 National Academy of Sciences of the United States of America 1996, 93(17):9067- 9072. 84. Zambrowicz BP, Imamoto A, Fiering S, Herzenberg LA, Kerr WG, Soriano P: Disruption of overlapping transcripts in the ROSA beta geo 26 gene trap strain leads to widespread expression of beta-galactosidase in mouse embryos and hematopoietic cells. Proc Natl Acad Sci U S A 1997, 94(8):3789-3794. 85. Yang GS, Banks KG, Bonaguro RJ, Wilson G, Dreolini L, de Leeuw CN, Liu L, Swanson DJ, Goldowitz D, Holt RA, Simpson EM: Next generation tools for high-throughput promoter and expression analysis employing single-copy knock-ins at the Hprt1 locus. Genomics 2009, 93(3):196-204. 86. Bray N, Pachter L: MAVID: constrained ancestral alignment of multiple sequences. Genome Res 2004, 14(4):693-699. 87. Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics 2003, 19 Suppl 1:i54-62. 88. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 2003, 31(13):3518-3524. 89. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708-715. 90. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 2007, 17(12):1797-1808. 91. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, Meyer L, Hsu F, Hinrichs AS, Harte RA, Giardine B, Fujita P, Diekhans M, Dreszer T, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 2009, 37(Database issue):D755-761. 92. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 93. Denarier E, Forghani R, Farhadi HF, Dib S, Dionne N, Friedman HC, Lepage P, Hudson TJ, Drouin R, Peterson A: Functional organization of a Schwann cell enhancer. J Neurosci 2005, 25(48):11210-11217. 94. Tuason MC, Rastikerdar A, Kuhlmann T, Goujet-Zalc C, Zalc B, Dib S, Friedman H, Peterson A: Separate proteolipid protein/DM20 enhancers serve different lineages and stages of development. J Neurosci 2008, 28(27):6895-6903. 95. Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science 2003, 302(5644):413.     54 96. Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G: Highly conserved non-coding sequences are associated with vertebrate development. PLoS biology 2005, 3(1):e7. 97. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. Journal of biology 2003, 2(2):13. 98. Martin N, Patel S, Segre JA: Long-range comparison of human and mouse Sprr loci to identify conserved noncoding sequences involved in coordinate regulation. Genome Res 2004, 14(12):2430-2438. 99. Emberly E, Rajewsky N, Siggia ED: Conservation of regulatory elements between two species of Drosophila. BMC bioinformatics 2003, 4:57. 100. McGaughey DM, Vinton RM, Huynh J, Al-Saif A, Beer MA, McCallion AS: Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b. Genome Res 2008, 18(2):252-260. 101. Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, Macisaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E: Tissue-specific transcriptional regulation has diverged significantly between human and mouse. In: Nat Genet. vol. 39; 2007: 730-732. 102. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16(1):16-23. 103. King OD, Roth FP: A non-parametric model for transcription factor binding sites. Nucleic Acids Res 2003, 31(19):e116. 104. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings /  International Conference on Intelligent Systems for Molecular Biology ; ISMB 1994, 2:28-36. 105. Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7(1):41-51. 106. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(5131):208-214. 107. Sandve GK, Abul O, Walseng V, Drablos F: Improved benchmarks for computational motif discovery. BMC bioinformatics 2007, 8:193. 108. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature biotechnology 2005, 23(1):137- 144. 109. D'Haeseleer P: How does DNA sequence motif discovery work? Nature biotechnology 2006, 24(8):959-961. 110. Abeel T, Van de Peer Y, Saeys Y: Toward a gold standard for promoter prediction evaluation. Bioinformatics 2009, 25(12):i313-320. 111. Narlikar L, Ovcharenko I: Identifying regulatory elements in eukaryotic genomes. In: Briefings in Functional Genomics and Proteomics. 2009: 1-16.     55 112. Jahn O, Tenzer S, Werner HB: Myelin proteomics: molecular anatomy of an insulating sheath. Molecular neurobiology 2009, 40(1):55-72. 113. Jessen KR, Mirsky R: Schwann cells and their precursors emerge as major regulators of nerve development. Trends in neurosciences 1999, 22(9):402-410. 114. Nave KA, Schwab MH: Glial cells under remote control. Nature neuroscience 2005, 8(11):1420-1422. 115. Svaren J, Meijer D: The molecular machinery of myelin gene transcription in Schwann cells. Glia 2008, 56(14):1541-1551. 116. Meyer zu Horste G, Prukop T, Nave KA, Sereda MW: Myelin disorders: Causes and perspectives of Charcot-Marie-Tooth neuropathy. J Mol Neurosci 2006, 28(1):77-88. 117. Wrabetz L, D'Antonio M, Pennuto M, Dati G, Tinelli E, Fratta P, Previtali S, Imperiale D, Zielasek J, Toyka K, Avila RL, Kirschner DA, Messing A, Feltri ML, Quattrini A: Different intracellular pathomechanisms produce diverse Myelin Protein Zero neuropathies in transgenic mice. J Neurosci 2006, 26(8):2358-2368. 118. Kessaris N, Fogarty M, Iannarelli P, Grist M, Wegner M, Richardson WD: Competing waves of oligodendrocytes in the forebrain and postnatal elimination of an embryonic lineage. Nature neuroscience 2006, 9(2):173-179. 119. Nicolay DJ, Doucette JR, Nazarali AJ: Transcriptional control of oligodendrogenesis. Glia 2007, 55(13):1287-1299. 120. Lu QR, Yuk D, Alberta JA, Zhu Z, Pawlitzky I, Chan J, McMahon AP, Stiles CD, Rowitch DH: Sonic hedgehog--regulated oligodendrocyte lineage genes encoding bHLH proteins in the mammalian central nervous system. Neuron 2000, 25(2):317-329. 121. Zhou Q, Anderson DJ: The bHLH transcription factors OLIG2 and OLIG1 couple neuronal and glial subtype specification. Cell 2002, 109(1):61-73. 122. Fu H, Qi Y, Tan M, Cai J, Takebayashi H, Nakafuku M, Richardson W, Qiu M: Dual origin of spinal oligodendrocyte progenitors and evidence for the cooperative role of Olig2 and Nkx2.2 in the control of oligodendrocyte differentiation. Development 2002, 129(3):681-693. 123. Xin M, Yue T, Ma Z, Wu FF, Gow A, Lu QR: Myelinogenesis and axonal recognition by oligodendrocytes in brain are uncoupled in Olig1-null mice. J Neurosci 2005, 25(6):1354-1365. 124. Samanta J, Kessler JA: Interactions between ID and OLIG proteins mediate the inhibitory effects of BMP4 on oligodendroglial differentiation. Development 2004, 131(17):4131-4142. 125. Kondo T, Raff M: Basic helix-loop-helix proteins and the timing of oligodendrocyte differentiation. Development 2000, 127(14):2989-2998. 126. Qi Y, Cai J, Wu Y, Wu R, Lee J, Fu H, Rao M, Sussel L, Rubenstein J, Qiu M: Control of oligodendrocyte differentiation by the Nkx2.2 homeodomain transcription factor. Development 2001, 128(14):2723-2733. 127. Sun T, Dong H, Wu L, Kane M, Rowitch DH, Stiles CD: Cross-repressive interaction of the Olig2 and Nkx2.2 transcription factors in developing neural tube associated with formation of a specific physical complex. J Neurosci 2003, 23(29):9547-9556.     56 128. Liu R, Cai J, Hu X, Tan M, Qi Y, German M, Rubenstein J, Sander M, Qiu M: Region-specific and stage-dependent regulation of Olig gene expression and oligodendrogenesis by Nkx6.1 homeodomain transcription factor. Development 2003, 130(25):6221-6231. 129. Southwood C, He C, Garbern J, Kamholz J, Arroyo E, Gow A: CNS myelin paranodes require Nkx6-2 homeoprotein transcriptional activity for normal structure. J Neurosci 2004, 24(50):11215-11225. 130. Stolt CC, Lommes P, Sock E, Chaboissier MC, Schedl A, Wegner M: The Sox9 transcription factor determines glial fate choice in the developing spinal cord. Genes Dev 2003, 17(13):1677-1689. 131. Stolt CC, Lommes P, Friedrich RP, Wegner M: Transcription factors Sox8 and Sox10 perform non-equivalent roles during oligodendrocyte development despite functional redundancy. Development 2004, 131(10):2349-2358. 132. Stolt CC, Schlierf A, Lommes P, Hillgartner S, Werner T, Kosian T, Sock E, Kessaris N, Richardson WD, Lefebvre V, Wegner M: SoxD proteins influence multiple stages of oligodendrocyte development and modulate SoxE protein function. Dev Cell 2006, 11(5):697-709. 133. Gokhan S, Marin-Husstege M, Yung SY, Fontanez D, Casaccia-Bonnefil P, Mehler MF: Combinatorial profiles of oligodendrocyte-selective classes of transcriptional regulators differentially modulate myelin basic protein gene expression. J Neurosci 2005, 25(36):8311-8321. 134. Sugimori M, Nagao M, Bertrand N, Parras CM, Guillemot F, Nakafuku M: Combinatorial actions of patterning and HLH transcription factors in the spatiotemporal control of neurogenesis and gliogenesis in the developing spinal cord. Development 2007, 134(8):1617-1629. 135. Petryniak MA, Potter GB, Rowitch DH, Rubenstein JL: Dlx1 and Dlx2 control neuronal versus oligodendroglial cell fate acquisition in the developing forebrain. Neuron 2007, 55(3):417-433. 136. Liu Z, Hu X, Cai J, Liu B, Peng X, Wegner M, Qiu M: Induction of oligodendrocyte differentiation by Olig2 and Sox10: evidence for reciprocal interactions and dosage-dependent mechanisms. Dev Biol 2007, 302(2):683- 693. 137. Tekki-Kessaris N, Woodruff R, Hall AC, Gaffield W, Kimura S, Stiles CD, Rowitch DH, Richardson WD: Hedgehog-dependent oligodendrocyte lineage specification in the telencephalon. Development 2001, 128(13):2545-2554. 138. Nery S, Wichterle H, Fishell G: Sonic hedgehog contributes to oligodendrocyte specification in the mammalian forebrain. Development 2001, 128(4):527-540. 139. Spassky N, Heydon K, Mangatal A, Jankovski A, Olivier C, Queraud-Lesaux F, Goujet-Zalc C, Thomas JL, Zalc B: Sonic hedgehog-dependent emergence of oligodendrocytes in the telencephalon: evidence for a source of oligodendrocytes in the olfactory bulb that is independent of PDGFRalpha signaling. Development 2001, 128(24):4993-5004. 140. Jakovcevski I, Zecevic N: Olig transcription factors are expressed in oligodendrocyte and neuronal cells in human fetal CNS. J Neurosci 2005, 25(44):10064-10073.     57 141. Takebayashi H, Nabeshima Y, Yoshida S, Chisaka O, Ikenaka K: The basic helix-loop-helix factor olig2 is essential for the development of motoneuron and oligodendrocyte lineages. Curr Biol 2002, 12(13):1157-1163. 142. Arnett HA, Fancy SP, Alberta JA, Zhao C, Plant SR, Kaing S, Raine CS, Rowitch DH, Franklin RJ, Stiles CD: bHLH transcription factor Olig1 is required to repair demyelinated lesions in the CNS. Science 2004, 306(5704):2111-2115. 143. Baumann N, Pham-Dinh D: Biology of oligodendrocyte and myelin in the mammalian central nervous system. Physiological Reviews 2001, 81(2):871- 927. 144. Readhead C, Hood L: The dysmyelinating mouse mutations shiverer (shi) and myelin deficient (shimld). Behavior genetics 1990, 20(2):213-234. 145. Farhadi HF, Lepage P, Forghani R, Friedman HC, Orfali W, Jasmin L, Miller W, Hudson TJ, Peterson AC: A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial- specific phenotypes. The Journal of neuroscience : the official journal of the Society for Neuroscience 2003, 23(32):10214-10223. 146. Dionne N: Structure and function of Module 3, a conserved enhancer of the myelin basic protein gene. Montreal, Quebec: McGill University; 2006. 147. Dib S: Functional Analysis Of The Myelin Basic Protein Gene Regulation. Montreal, Quebec: McGill University; 2008. 148. Huang SS, Fulton DL, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW: Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In: Series on Advances in Bioinformatics and Computational Biology Volume 3 - Proceedings of the 4th Asia-Pacific Bioinformatics Conference: 2006; Taipei, Taiwan: Imperial College Press, London UK; 2006: 247- 256. 149. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 150. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW: oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Res 2007, 35(Web Server issue):W245-252. 151. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R: TFCat: the curated catalog of mouse and human transcription factors. Genome Biol 2009, 10(3):R29. 152. Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res 2006, 34(Database issue):D63-67. 153. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 154. Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Van Loo P, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJ: ORegAnno: an open-     58 access community-driven resource for regulatory annotation. Nucleic Acids Res 2008, 36(Database issue):D107-113. 155. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 156. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31(1):374-378. 157. Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 2001, 17(10):878-889. 158. Frith MC, Spouge JL, Hansen U, Weng Z: Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res 2002, 30(14):3214-3224. 159. Frith MC, Li MC, Weng Z: Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res 2003, 31(13):3666-3668. 160. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32(Web Server issue):W253-256. 161. Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 2003, 19 Suppl 1:i283-291. 162. Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B: Computational detection of cis -regulatory modules. Bioinformatics 2003, 19 Suppl 2:ii5-14. 163. Alkema WB, Johansson O, Lagergren J, Wasserman WW: MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res 2004, 32(Web Server issue):W195-198. 164. Thompson W, Palumbo MJ, Wasserman WW, Liu JS, Lawrence CE: Decoding human regulatory circuits. Genome Res 2004, 14(10A):1967-1974. 165. Weaver RF: Molecular Biology, 2nd edn. New York: McGraw-Hill Higher Education; 2002. 166. Wasserman WW, Krivan W: In silico identification of metazoan transcriptional regulatory regions. Die Naturwissenschaften 2003, 90(4):156- 166. 167. Pham-Dinh D: Les cellules gliales. In: Physiologie du Neurone. Edited by Tritsch D, Chesnoy-Marchais D, Feltz A. France: Initiatives Santé; 1998: 31–90. 168. Zhang SC: Defining glial cells during CNS development. Nature reviews 2001, 2(11):840-843.       59  2. Identification of Over-represented Combinations of Transcription Factor Binding Sites in Sets of Co-expressed Genes9   2.1. Chapter preamble  A eukaryote gene’s regulatory program often involves coordinate binding of multiple transcription factors (TFs) that enable and/or inhibit gene transcription.  This chapter describes the development of an algorithm, called Combination Site Analysis (CSA), which identifies over-represented combinations of TF binding sites (cis- regulatory modules - CRMs) in the promoter regions of co-expressed genes to infer TF cooperative control. Validation of the algorithm with reference collections of co- regulated genes demonstrated its ability to identify known CRMs. This tool was made available to the community through a web-based interface.  2.2. Introduction  The interaction between transcription factor (TF) proteins and transcription factor binding sites (TFBS) is an important mechanism in regulating gene expression. Each cell in the human body expresses genes in response to its developmental state (e.g. tissue type), external signals from neighboring cells, and environmental stimuli (e.g. stress, nutrients). Diverse regulatory mechanisms have evolved to facilitate the programming of gene expression, with a primary mechanism being TF-mediated modulation of the rate of  9 A version of this chapter has been published. Huang SS*, Fulton DL*, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW. (2006) Identification of over-represented combinations of transcription factor binding sites in sets of co-expressed genes. In: Series on Advances in Bioinformatics and Computational Biology Volume 3 - Proceedings of the 4th Asia-Pacific Bioinformatics Conference; Taipei, Taiwan: Imperial College Press, London UK; 247- 256.  *Joint first authors.     60 transcript initiation. Given a finite collection of protein structures capable of binding to specific DNA sequences and the diversity of conditions to which cells must respond, it is logical and well-documented that combinatorial interplay between TFs drives much of the observed specificity of gene expression. The arrays of TFBS at which the interactions occur are often termed cis-regulatory modules (CRM) [1].  The sequence specificity of TFs has stimulated development of computational methods for discovery of TFBS on DNA sequences. Well established methods represent aligned collections of TFBS as position weight matrices (PWM). The sequence specificity of individual PWM profiles can be quantified by their information content and a PWM score provides a quantitative measure of the sequence’s similarity to the binding profile (for review see Wasserman and Sandelin [2]). Searching for high scoring motifs in putative regulatory sequences with a collection of profiles (for instance, JASPAR [3]) can predict binding sites within a sequence and the interacting TFs. However, this methodology is plagued by poor specificity due to the short and variable nature of the TFBS. Phylogenetic footprinting filters have been demonstrated repeatedly to improve specificity [4]. Such filters are justified by the hypothesis that sequences of biological importance are under higher selective pressure and will thus accumulate DNA sequence changes at a slower rate than other sequences. Based on this expectation, the search for potential TFBS can be limited to the most similar non-coding regions of aligned orthologous gene sequences from species of suitable evolutionary distance. Further, one might expect that genes that are coordinately expressed are under the control of the same TFs, suggesting that over-represented TFBS in the co-expressed genes are likely to be functional. These concepts are implemented by Ho Sui et al. in the web service tool oPOSSUM [5], which, when given a set of co-expressed genes, can identify the TFBS     61 motifs that are over-represented with respect to a background set of genes. This approach has achieved success in finding binding sites known to contribute to the regulation of reference gene sets.  Prior methods that attempt to address the known interplay between TFs at CRMs can be difficult to interpret [6-8]. We introduce a new approach rooted in the biochemical properties of TFs, which allows greater computational efficiency and improved interpretation of results. The resulting method is assessed against diverse reference data to demonstrate its utility for the applied analysis of gene expression data.  2.3. Results 2.3.1. Overview and rationale of oPOSSUM II algorithm  Finding over-represented combinations of TFBS presents several new issues that are not encountered in single site analysis. We address two of the main challenges: computational complexity and TFBS class redundancy. Firstly, the number of possible combinations of size n from m TFBS (n ≤ m) increases combinatorially with respect to both m and n, which greatly impacts computing time. Secondly, several TFs have similar binding properties, thus subsets of profiles may be effectively redundant. Consequently, an exhaustive search is not an efficient method to find over-represented combinations of patterns.  To address both problems we introduced two approaches. Firstly, we used a novel method to group the profiles into classes. Rather than using protein sequence similarity, a hierarchical clustering procedure was applied to group the profiles into classes according to their quantitative similarity. One representative member was selected from each class     62 for further analysis. We then searched for the occurrences of class combinations in both co-regulated genes (foreground) and a set of background genes. We considered unordered combinations and applied an inter-binding site distance (IBSD) constraint to avoid exhaustive enumeration of all combinations, since many co-operative TFBS are found to occur in clusters without strict ordering constraints [1]. Thus, we only need consider each set of TFBS where all IBSDs satisfy the distance parameter. This approach can dramatically reduce the search space when evaluating any combination size. A scoring scheme was adopted from the Fisher exact test to compare the degree of over- representation of the class combinations. The highly over-represented class combinations were re-assessed using all possible profile combinations within the indicated classes.  The overall scheme of oPOSSUM II analysis is shown in Figure 2.1. The sections below describe the details of each step.  2.3.2. TFBS classification  Three human reference sets were utilized to validate the oPOSSUM II algorithm: two independently derived skeletal muscle gene sets and a set of smooth muscle-specific genes. Each of the analyses was restricted to the prediction of vertebrate TFBS profiles. A TFBS profile clustering step is implemented in the oPOSSUM II analysis to identify profiles that possess similar binding properties. We clustered the vertebrate TFBS profiles (see Methods) and applied a thresholded (cut) to the resulting hierarchical dendogram at 0.45 ( ! thr H = 0.45 ) to produce clusters that, in most cases, correlated well with structural families in JASPAR (cluster tree available in web supplement). Most notably, binding profiles from FORKHEAD, HMG and ETS families were grouped according to their structural classifications. However, as anticipated, the zinc finger     63 profiles were dispersed into multiple groupings due to the divergent protein binding preferences in this structural class. Using this approach, 68 vertebrate TFBS in JASPAR were partitioned into 32 classes. This TF classification step facilitates the detection of over-represented TF class combinations in a set of co-expressed genes, which provides a significant reduction in the combinatorial search space.  2.3.3. Validation with reference data sets 2.3.3.1. Yeast CLB2 cluster  The yeast CLB2 gene cluster dataset contains genes whose transcription peaks at late G2/early M phase of the cell cycle. The transcription of these genes is regulated by the TF FKH, a component of the TF SFF complex, which interacts with the TF protein MCM1. Each of the top ten scoring class combinations identified by oPOSSUM II included the binding sites of the ECB class, of which MCM1 is a member. The highest ranked TFBS combination was ECB/FKH1, which is consistent with experimental evidence and a recent TFBS analyses performed by Kreiman et al.  [7]. The full set of analysis results are available on the oPOSSUM II supplementary web site.  2.3.3.2. Three human reference gene sets  Prior studies involving muscle set 1 [9] have identified clusters of muscle regulatory sites, which include MEF2, SRF, Myf/MyoD, SP1 and TEF. Figure 2.2 lists the top five over-represented class combinations for each of the three human muscle reference gene sets analyzed. Each of the score values for these combinations fell below     64 2.0E-3. Also listed are the five most over-represented TFBS classes, as reported by oPOSSUM single site analysis. The classes that contain MEF2 and SP1 dominated the top combinations in both skeletal muscle sets (Figure 2.2 a and Figure 2.2 b.). The TF Yin-Yang modulates SRF-dependent, skeletal muscle expression. Thing1-E47 is a bHLH TF localized to gut smooth muscle in adult mice. The prediction of this TFBS class may suggest binding of other bHLH myogenic factors (such as Myf). Bsap and MZF are not muscle specific TFs. The Bsap motif is long (20 bp) and exhibits an unusual pattern of low information content distributed across the entire motif, suggesting that it may behave differently than other binding profiles. The inclusion of this profile in the JASPAR database is under review (B. Lenhard, personal communication).  Analysis of the smooth muscle genes resulted in an SRF class prediction in each of the top five combinations, consistent with previous gene regulatory studies in muscle [10]. The top combination, SP1/SRF, is required for the expression of smooth muscle myosin heavy chain in rat. Yin-Yang has been shown to stimulate smooth muscle growth. Spz1 acts in spermatogenesis and has no known role in muscle expression.  For all three reference sets, the top scoring combinations highlighted new TFBS classes not found by the oPOSSUM single site analysis algorithm. Futhermore, several relevant muscle TFBS were identified exclusively in the oPOSSUM II combination site analysis.  2.3.4. Effect of set size on false positive rate  Random sampling simulations of foreground genes were analyzed in the oPOSSUM II algorithm to evaluate false positive prediction rates relative to input gene     65 set size (Figure 2.3). Our analyses suggest that false prediction rates made by the oPOSSUM II algorithm are independent of input gene set size. We also noted that at low score values, the proportion of false positives is low.  2.3.5. Web interface oPOSSUM II web service is available at http://www.cisreg.ca/oPOSSUM_CSA/opossum2.php. A user provides a set of putatively co-expressed genes as input and specifies the parameter values to be used in the analysis. Certain parameter values may produce lengthy runtimes. To accommodate this possibility, the web service queues the analysis request and issues e-mail notification once the analysis is complete.  2.4. Discussion  The analysis of over-represented combinations of TFBS in the promoters of co- expressed genes is motivated by biochemical and genetic studies which reveal the functional importance of cis-regulatory modules. In contrast to previously described methods which identify single over-represented motifs, the analysis of combinations must solve or circumvent the consequence of a combinatoric explosion, which can precipitate prohibitive runtimes. To reduce the search space, oPOSSUM II restricts its analysis to binding site combinations using biologically justifiable criteria, namely, TFBS profile similarity and through application of an inter-binding site distance10 (IBSD) constraint (see Methods).  10 Inter-binding site distance refers to the number of base pairs between two TFBS.     66  Our results suggest two important contributions over the existing single-site TFBS over-representation methods. Firstly, for each reference gene set analyzed, at least one relevant TF class appeared in multiple combinations, an observation that is not immediately obvious in a single site analysis. Secondly, the algorithm can discover functional TFBS that are not highligted in a single site analysis. For example, oPOSSUM II analysis of the yeast CLB2 gene cluster predicted ECB and FKH1 as a top scoring combination pair, yet analysis of the same dataset by the single site analysis algorithm ranked these predicted TFs as first and eleventh, respectively. Similarly, the SRF and SP1 TFBS combination is reported as most significant by an oPOSSUM II analysis of the smooth muscle reference set and these TFBS are ranked first and fourteenth in a single site analysis. These results clearly demonstrate the power of combination site analysis. Furthermore, oPOSSUM II analyses of the microarray-based skeletal muscle reference set correctly predicted the cooperativity of MEF2 and SP1 TFs in myogenesis, which confirms the utility of incorporating a high-quality microarray dataset with a combination site analysis approach.  While our analysis results for the yeast CLB2 cluster is comparable to that reported by Kreiman et al. [7], there are significant differences between the methods applied by the two studies. The Kreiman study applies a motif similarity approach to discard redundant PWMs before searching for modules (CRMs). oPOSSUM II implements clustering of PWMs to identify groups of similar profiles (classes) and the PWM (centroid) which best represents the cluster of PWMs to, initially, identify over- represented combinations of TF classes. Over-represented classes are then expanded to evaluate all relelvant TFBS combinations for over-representation. The Kreiman study reports the top scoring combinations for the Wasserman and Fickett skeletal muscle     67 collection as SP1, SRF, TEF, and a novel motif, while oPOSSUM II identified Mef2, Myf, Srf, and Sp1 in the top scoring pairs.  A few issues should be considered in future research. Firstly, the interpretation of most PWM-based TFBS analyses are confounded by intra-class binding similarity. While this property facilitates the oPOSSUM II algorithm approach, users are left with determining the TF protein family member that could be acting in the tissue and/or condition under study. For instance, over-representation of an E-box motif in the skeletal muscle analysis does not specifically highlight the MyoD TF protein; the user must consider the entire range of bHLH-domain TFs. Secondly, inter-class similarity can influence the CRM predictions. Although oPOSSUM II does not evaluate overlapping redundant TFBS combinations, unique TFBS combinations predictions, which may overlap other binding site combinations, are included in the analysis. Thus, two G-rich motifs may be reported as over-represented in different combinations (for instance, the SP1 and MZF motifs in Figure 2.2 c.) but highlight the same candidate TFBS. A related issue is the compositional sequence bias in tissue specific genes [11], which could be addressed through provision of a background gene set that possesses a similar sequence content bias. Finally, depending on the parameters selected, computing time requirements can be prohibitively long for a synchronous web service. Parallelization of the algorithm would be a natural way to improve the running time.  oPOSSUM II utilizes putative TFBS identified from comparative genomic analysis, in conjunction with knowledge of co-regulated expression, to search for functional combinations of TFBS that may confer a given gene expression pattern. It uses a novel scheme to classify similar binding site profiles. Using this clustering approach, the oPOSSUM II method is able to circumvent the combinatorial challenge associated     68 with the identification of significant TFBS combinations. Furthermore, the application of an IBSD constraint limits the number of possible combinations to analyze. Validation results suggest that a TFBS combination site analysis can provide valuable information that is not available through identification of single over-represented TFBS.     69 2.5. Methods 2.5.1. Background: the oPOSSUM database  Ho Sui et al. [5] describe the creation of the oPOSSUM database which stores predicted, evolutionarily conserved TFBS to support over-representation analysis of TFBS for single TFs. Briefly, human-mouse orthologs are retrieved from Ensembl. TFBS profiles from the JASPAR database are used to identify putative TFBS within the conserved non-coding regions from 5000 base pairs (bp) upstream to 5000 bp downstream of the annotated transcription start site (TSS) on both strands. The oPOSSUM database stores the start and end positions and the matrix match score of each predicted TFBS for four score thresholds: 70, 75, 80, and 85%. This database is used as input to the oPOSSUM II algorithm to search for over-represented TFBS combinations (described below).  2.5.2. TFBS in foreground gene set  When presented with a set of co-expressed genes S, oPOSSUM II queries the oPOSSUM database for all putative TFBS T present in S within a maximum of 5000 bp upstream and 5000 bp downstream from the TSS on each gene. The analysis may be restricted to those TFs found in selected taxonomic subgroups (plant, vertebrate and insect are currently available), or TFs whose profiles exceed a minimum information content.      70 2.5.3. Classification of TFBS profiles  Binding profiles for T were retrieved from the JASPAR database. A profile comparison algorithm, either CompareACE [12] (default) or matrix aligner [13], was used to calculate the pairwise similarity scores of all the profiles using profile alignment methods. The similarity score ! s(ti,t j )  between profiles ti and tj was converted to a distance ! d(ti,t j ) =1" s(ti,t j ). A distance matrix M was formed from these pairwise distances. From M, an agglomerative clustering procedure produced a hierarchy of clusters (subsets) of T. The complete linkage method was used. Cutting the cluster tree at a specified height ! thr H  partitioned T into classes.  2.5.4. Selection of TFBS and enumeration of combinations  For each class C, we selected the profile that is the most similar to other profiles in C as the class representative. We chose this approach as we could not identify an adequate procedure that would generate a consensus profile with comparable specificity to the matrices within the class. To identify the class representative, we first calculated the sum of pairwise similarity score ! " i between a profile ! t i  and other profiles in C, i.e., ! " i =#ti ,t j $C s(ti,t j ) . The profile with the maximum sum of similarity score was chosen. From the selected TFBS, unordered combinations of specified size (cardinality) were created. The foreground gene set (the co-expressed genes) and the background gene set (default is all the genes in the database) were searched for occurrences of these combinations. Let ! max d  be the maximum inter-binding site distance. For each gene, occurrences of the combinations were found using a sliding window of width equal to     71 ! max d  within the required search region. The number of genes with a combination in both the foreground set and background gene set were counted.  2.5.5. Scoring of combinations  The Fisher exact test detects the non-random association between two categorical variables. We adopted the Fisher P-values to rank the significance of non-random association between the occurrence of a combination in the foreground gene set, i.e., over-representation of the combination in the foreground compared to background. For each combination, a two-dimensional contingency table was constructed from the foreground and background count distributions:  Number of genes with a given combination Number of genes without a given combination Foreground ! a 11  ! a 12  Background ! a 21  ! a 22   For i, ,j = 1,2, row sum ! R i = a i1 + a i2  and column sum ! C j = a1 j + a2 j , and the total count ! N = Ri i " = C j j " .  From the hypergeometric probability function, the conditional probability Pcutoff , given the row and column sums, is:   ! Pcutoff = (C 1 !C 2 !)(R 1 !R 2 !) N!"aij!        72 We calculated the P-values for all other possible contingency tables with row sums equal to Ri and column sums equal to Cj. The Fisher P-value is the sum of all the P-values less than or equal to ! Pcutoff .  Caution must be exercised when interpreting these Fisher P-values. Firstly, the foreground and background genes are allowed to overlap, which is a violation of an assumption for the statistical test. Secondly, the Fisher exact test model may not precisely characterize the data sets being analyzed. As a result, the Fisher P-values were used purely as a measure to compare the degree of over-representation between different combinations. We will hereafter refer to the P-values as “scores”. Although the scores do not describe the probabilistic nature of the over-representation, the ranking they provide is shown to be useful [5].  2.5.6. Finding significant TFs from over-represented class combinations  Let ! thr C  be the maximum “score” for which a TFBS combination may be considered significant. Our empirical studies of reference collections suggested that a default maximum score value of 0.01 detects relevant TF combinations. Let ! x i  be any TFBS class combination with a score less than or equal to ! thr H  and X is the set of distinct class combinations that satisfys the score threshold: ! X = {x i | score(x i ) "  thrC} . For each combination ! x i , let each of ! C 1 ,C 2 ,…,C h be a set of TFBS profiles that are represented by each of the h class profiles in that combination. Compute the Cartesian product ! Cp of ! C 1 ,C 2 ,…,C h . We call this “expanding the TFBS classes” from the class representatives. The enumeration and ranking procedures were repeated for the h-tuples in ! Cp .     73 2.5.7. Random sampling simulations of foreground genes  The oPOSSUM II algorithm accommodates input gene sets of different cardinalities. To investigate the relationship between gene set size and the false positive rate. 100 random samples of r genes were selected from the background and provided as input to oPOSSUM II as foreground genes. For each sample, oPOSSUM II reported the scores for all the class combinations. As random samples of genes are not expected to be co-regulated, any predicted combination was a false positive. Let ! (0,Max s ] be the interval over which false positives are accumulated. We recorded the number of false positive class combinations for a range of maxs when r = 20,40,60,80,100.  2.5.8. Validation  Three reference sets of human genes were used as input to oPOSSUM II to assess the performance of the algorithm. Two independent sets of skeletal muscle genes were tested. The first set (muscle set 1) was compiled from the reference collection identified by Wasserman and Fickett [9] and updated by a review of recent literature. A second set (muscle set 2) combines the results of microarray studies of Moran et al. [14] and Tomczak et al. [15]. The third set contains smooth muscle-specific genes experimentally verified by Nelander et al. [16]. All sets were validated with ! max d =100 , TFBS matrix score threshold = 75%, and conservation level = 1.  We compared our oPOSSUM II analysis results to the results reported in Kreiman et al. study [7]. This comparison included analysis of the yeast CLB2 gene cluster [17]. We used the yeast oPOSSUM database (Ho Sui, unpublished) as input to the oPOSSUM II algorithm to perform the analysis.     74  All supplementary information is available at: www.cisreg.ca/oPOSSUM_CSA/supplement/.         75 Figure 2.1. Overview of the oPOSSUM II analysis algorithm  Processing steps are numbered in the order executed. The database of predicted TFBS is identical to that of the oPOSSUM analysis system (Ho Sui et al.[5]).      76  Figure 2.2. The top five over-represented pair combinations of TFBS classes for muscle reference sets  The top five over-represented pair combinations of TFBS classes reported by oPOSSUM II and over-represented single TFBS sites reported by oPOSSUM for the skeletal and smooth muscle sets. The numbers are the class identifiers and enclosed in parentheses is the name of a TF within that class, which is either known to mediate transcription in the assessed tissue (*) or is a class representative.                77 Figure 2.3. Gene set size and false positive rate  Effect of gene set size on false positive rate observed from pairwise TFBS combinations in randomly generated foreground gene sets.          78 2.6. References 1. Arnone MI, Davidson EH: The hardwiring of development: organization and function of genomic regulatory systems. Development 1997, 124(10):1851- 1864. 2. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004, 5(4):276-287. 3. Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, 32(Database issue):D91-94. 4. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003, 2(2):13. 5. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 6. Bluthgen N, Kielbasa SM, Herzel H: Inferring combinatorial regulation of transcription in silico. Nucleic Acids Res 2005, 33(1):272-279. 7. Kreiman G: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res 2004, 32(9):2889- 2900. 8. Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32(Web Server issue):W253-256. 9. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278(1):167-181. 10. Madsen CS, Hershey JC, Hautmann MB, White SL, Owens GK: Expression of the smooth muscle myosin heavy chain gene is regulated by a negative-acting GC-rich element located between two positive-acting serum response factor- binding elements. J Biol Chem 1997, 272(10):6332-6340. 11. Yamashita R, Suzuki Y, Sugano S, Nakai K: Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene 2005, 350(2):129-136. 12. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296(5):1205-1214. 13. Sandelin A, Hoglund A, Lenhard B, Wasserman WW: Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics 2003, 3(3):125-134. 14. Moran JL, Li Y, Hill AA, Mounts WM, Miller CP: Gene expression changes during mouse skeletal myoblast differentiation revealed by transcriptional profiling. Physiol Genomics 2002, 10(2):103-111. 15. Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han M, Kunkel LM, Kohane IS, Beggs AH: Expression profiling and identification of novel genes involved in myogenic differentiation. Faseb J 2004, 18(2):403-405.     79 16. Nelander S, Mostad P, Lindahl P: Prediction of cell type-specific gene modules: identification and initial characterization of a core set of smooth muscle- specific genes. Genome Res 2003, 13(8):1838-1854. 17. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273-3297.                                80 3. oPOSSUM: Integrated Tools for Analysis of Regulatory Motif Over-Representation11   3.1. Chapter preamble   Gene promoters contain cis-regulatory DNA elements that are required for transcription initiation. Alternative gene promoters provide regulatory control plasticity, enabling activation of specific gene expression programs under varying environmental conditions, temporal states, and tissue types. This chapter describes the identification of alternative promoters for human and mouse genes and extension of the original Combination Site Analysis (CSA) algorithm, described in Chapter 2, to identify over- represented TFBS combinations in regions surrounding alternative gene promoters. The revised CSA algorithm demonstrated a marked improvement in recovery of CRMs known to regulate gene reference collections.   3.2. Introduction  Functional genomics research often generates lists of genes with observed common properties, such as coordinated expression. For many studies, a key challenge is the generation of relevant and testable hypotheses about the regulatory networks and pathways that underlie observed co-expression. Our strategy for elucidating regulatory mechanisms identifies over-represented sequence motifs that are present in the upstream  11 A version of this chapter has been published. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW. (2007) oPOSSUM: Integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Research. Jul; 35 Web Server Issue: W245-52.      81 regulatory regions of genes. The motifs may represent transcription factor binding sites (TFBSs) that have a role in regulating expression.  oPOSSUM [1] and oPOSSUM2 [2] were developed to identify over-represented, predicted TFBSs and combinations of predicted TFBSs, respectively, in sets of human and mouse genes. The user inputs a list of related genes, selects the TFBS profile set to be included in the analysis, and the algorithm determines which, if any, predicted TFBSs occur in the promoters of the set of input genes more often than would be expected by chance. Both analytic approaches rely on a database of aligned, orthologous human and mouse sequences, and the delineation of conserved regions within which TFBS predictions are analyzed. While the approach does not explicitly address uncharacterized transcription factors (TFs), the effective coverage is broadened by the fact that members within certain structural families of TFs can exhibit similarities in binding specificity. However, intra-class similarity is not always the case, as exemplified by the zinc-finger family of TFs [3], the observation holds true for many TF families [4, 5].  Here we describe the new release of the oPOSSUM system, which integrates the two previously developed applications, and has been expanded to accommodate new species (yeast and worms). It also includes new methods for orthology assignment, transcription start site (TSS) determination, and sequence alignment.  3.3. Results  Each oPOSSUM component was validated on sets of reference genes. The results of all validations are available as supplementary materials (Tables S4-S13 available at http://cisreg.ca/oPOSSUM/data/). In the interest of space, a single validation is described for each system.     82 3.3.1. Human single site analysis  Wonsey and Follettie performed a microarray analysis of genes that are transcriptionally regulated by FoxM1, a member of the forkhead family of TFs, using BT- 20 cells that had been transfected with FoxM1 siRNA [6]. They identified a set of 27 genes that were specifically regulated in cells transfected with FoxM1 siRNA (Table S4 available at http://cisreg.ca/oPOSSUM/data/). The 27 Affymetrix UG144A identifiers were mapped to 27 EnsEMBL gene identifiers and submitted to the Human single site analysis (SSA) tool with default parameters. Of these, 22 genes had a unique mouse ortholog and were used in the oPOSSUM analysis.  While a specific profile for FoxM1 is not present in JASPAR CORE, other members of the forkhead family were ranked in the top ten highest scoring TFBS profiles (Table 3.1). There is also a known association between HNF4, the highest scoring TFBS profile, and the forkhead TF, FOXO1 in the regulation of gluconeogenic gene expression in hepatocytes [7], which may explain the over-representation of the HNF4 profile.  We previously identified over-represented Fos binding sites in a set of genes induced after transformation by c-Fos in rat fibroblast cells [1, 8]. We analyzed 160 orthologous genes from the original list of 252 induced genes (Table S7 available at http://cisreg.ca/oPOSSUM/data/). This is a notable improvement over the previous version where only 98 genes were included in the oPOSSUM analysis. The Fos TFBS profile ranked third in the list of over-represented TFBSs (Table 3.2). Inspection of the results using the JASPAR PhyloFACTS profiles with default parameters illustrates how inclusion of this new set of profiles provides additional, meaningful information. The highest ranked PhyloFACTS motif (TGANTCA) is noted by JASPAR as being most similar to the binding profile for AP-1, and the third highest scoring motif     83 (TGASTMAGC) is most similar to the bZIP TF NF-E2.  AP-1 complexes are comprised of Fos and Jun proteins, and the structurally related NF-E2 and AP-1 TFs bind similar sequence motifs [9].  3.3.2. Human combination site analysis  The combination site analysis (CSA) algorithm was validated on a set of mouse skeletal muscle genes comprised of the union of the results of the microarray studies of Moran et al. [10] and Tomczak et al. [11] (Table S9 available at http://cisreg.ca/oPOSSUM/data/). To avoid circularity, we removed muscle-specific genes used to generate the JASPAR binding site profiles for Mef-2, Myf, Sp-1, SRF, and Tef. These factors occur in clusters in cis-regulatory modules that contribute to skeletal muscle-specific expression [12]. Table 3.3 lists the top five over-represented pairwise TFBS combinations for this set of genes, along with the JASPAR class each TF profile clustered to, and the Fisher score obtained for each pair. The five most over-represented pairs of TFBS profiles include combinations of Mef-2, SRF and Sp-1.  The inclusion of alternative promoters provides notable improvements in the Human SSA and Human CSA analyses.  The same data sets were used to validate our previous and current human oPOSSUM analyses systems. Demarcation of additional promoter boundaries increases the signal in the discovery process, improving the signal for both over-represented single TFBSs and combinations of TFBSs in the gene sets analyzed.      84 3.3.3. Worm single site analysis  Worm SSA was tested on a set of well-characterized nematode muscle genes (Table S10 available at http://cisreg.ca/oPOSSUM/data/) [13].  Analysis of 1000bp of upstream sequence, using the top 10% of conserved regions (minimum of 60% sequence identity), a matrix match threshold of 80% and the worm profiles, identified the putative muscle1 motif with a Z-score of 20.6 and a Fisher score less than 0.01 (Table 3.4). This is, however, somewhat circular, given that 19 of the 41 input genes were used to generate the putative muscle-specific worm profiles. Analysis using the JASPAR CORE profiles ranked SP1 and Su(H) within the top ten scoring profiles (Table S10B available at http://cisreg.ca/oPOSSUM/data/). Studies in Xenopus and Drosophila provide evidence that MyoD triggers Notch signaling through Su(H) for muscle determination [14, 15]. Although SP1 has been implicated in muscle CRMs, it is a general transcription factor involved in the expression of many different genes and binds to GC-rich motifs.  3.3.4. Yeast single site analysis  The yeast CLB2 gene cluster is comprised of 32 genes whose pattern of expression peaks at late G2/early M phase of the cell cycle (Table S11 available at http://cisreg.ca/oPOSSUM/data/). Transcription of these genes is regulated by two TFs: FKH, which is a component of the TF SFF, and MCM1, a member of the early cell cycle box (ECB) binding complex. Analysis of 500bp of upstream sequence using a matrix match threshold of 85% ranked ECB, MCM1 and FKH1 in the top five scoring TFBS profiles (Table 3.5), which is consistent with the literature [16].      85 3.4. Discussion  The four oPOSSUM systems, Human SSA, Human CSA, Worm SSA, and Yeast SSA, have been integrated into a user-friendly website at www.cisreg.ca/oPOSSUM_new. We recommend that users of the system begin with the SSA to quickly identify TFBSs that may be relevant to their input data sets. For sets of human and mouse genes, this can be followed with the CSA, which takes longer to process, but which can provide insights into TFBSs that may be acting in concert to regulate the set of genes.  The web implementation allows for analysis in default and custom modes. Default mode processing is faster as TFBS counts have been pre-calculated and stored for pre- defined conservation levels, matrix match thresholds and promoter lengths.  In either mode, the user is required to select a species and to enter a list of gene identifiers (EnsEMBL, RefSeq, HGNC and Entrez Gene are supported for human). A number of options are available to specify the TFBS profile set to be used in the analysis. Finally, the conservation level, matrix match threshold and the promoter length can be varied. In the custom mode, users may define their own background set, which provides users with more control, but results in more variable processing speeds depending on the size of the background set and the parameters selected.  Upon submission, oPOSSUM SSA generates a summary of the input parameters, and produces a single table that ranks the over-represented TFBSs by descending Z- score. The table may be sorted by TF name, TF class, supergroup, information content (IC), Z-score and Fisher score (Figure 3.3A). Pop-up windows linked to each TFBS foreground count display the genes in which the putative site is located, the promoter region(s) for each gene, as well as the TFBS’s co-ordinates and score (Figure 3.3B).     86 TFBSs that occur in overlapping promoter regions are marked by an asterisk and highlighted in yellow. The TF names are linked to the JASPAR database for easy access to information regarding the binding site profiles.  The output for oPOSSUM CSA is similar, providing (i) a ranked list of over-represented TFBS class combinations, and (ii) a list of the most significant TFBS combinations (found in the set of expanded top-ranked class combinations).  Based on the underlying assumption of the statistics employed that DNA sequences are randomly generated, there is little reason to accept the calculated scores as accurate reflections of significance. Instead, as suggested in the original published description of the oPOSSUM algorithm, we recommend that the scores are best used as rankings rather than significance measures. For this reason, a multiple testing correction is not applied as it does not alter their relative ranking. Empirically, we determined that TFBS profiles with Z-scores equal to or exceeding 10 and Fisher scores less than or equal to 0.01 facilitate the identification of relevant TFBSs for our sets of reference genes [1]. However, these are relatively stringent thresholds, and we encourage users to examine the scores of top-ranked TFBS profiles before applying any cutoffs.  We provide a consistent display for all four systems. However, there are slight differences between the systems, such as different parameters for selection on the input pages which are relevant for each species database and system. Also, due to the longer processing times required to compute combinations of TFBSs, Human CSA queues the analysis request on the server and emails the completed results to the user.  The oPOSSUM system is under continued development. Efforts are underway to allow users to submit custom TF profiles to be included in the analysis. An improved search method for nuclear hormone receptors, which typically contain two half sites     87 separated by a variable length spacer, has been developed and will be included in a future release. We will continue to add TFBS profiles as they become available, with an emphasis on expanding the repertoire of worm TFBS profiles. We believe the oPOSSUM web server is and will continue to be a useful resource for researchers attempting to move from observed co-expression to infer mechanisms of co-regulation.  3.5. Methods 3.5.1. Over-representation analysis 3.5.1.1. oPOSSUM single site analysis  The oPOSSUM system for identifying over-represented TFBSs in sets of co- expressed genes first focused on single site analysis [1]. Two scores were developed to assess over-representation, one at the TFBS occurrence level and the other at the gene level.  The Z-score, based on the normal approximation to the binomial distribution, indicates how far and in what direction the number of TFBS occurrences deviates from the background distribution's mean. The second score, the Fisher exact test, indicates if the proportion of genes containing the TFBS is greater than would be expected by chance. TFBS predictions situated within overlapping alternative promoters are counted only once when calculating over-representation in human and mouse genes. For C. elegans genes in operons, TFBS predictions in the upstream region of the first gene in the operon apply to all genes in the operon.      88 3.5.1.2. oPOSSUM  combination site analysis  TFBSs do not act in isolation to initiate the transcription process. Transcriptional regulation can be viewed as mediated by arrays of cis-regulatory sequences, termed cis- regulatory modules (CRMs), which are bound by multiple TFs. In oPOSSUM2, Huang et al. (2006) [2] address the detection of over-represented sets of TFBSs in the promoters of a set of co-expressed genes. In brief, the method reduces combinatorial complexity through an initial clustering step, which partitions similar TFBS profiles into groups - herein denoted TFBS classes, along with an analysis step to determine a TFBS class representative profile for each TFBS cluster, which are then evaluated to detect over- represented sets of TFBS classes. Since each distinct, over-represented set of detected TFBS classes, herein described as a TFBS class combination, implicates the over- representation of one or more underlying TFBS profile-specific combinations, each of these TFBS class combinations is expanded to all possible TFBS profile-specific combinations (for the indicated classes) and then all combinations are analyzed for over- representation. Furthermore, given that CRMs can contain locally dense clusters of TFBSs, the system also provides for the specification of an inter-binding site distance (IBSD) constraint to confine the number of TFBS combinations that are investigated. A scoring scheme, adopted from the Fisher exact test, utilizes two sets of TFBS (class or profile-specific) combination counts to compare the degree of their over-representation: 1) the number found in the promoters of the co-expressed gene set versus 2) the number found in the promoters of genes in a background set (all genes in the database). TFBS combinations occurring in multiple alternative gene promoter regions are counted only once.      89 3.5.2. Species-specific databases  In addition to enhancements to the human/mouse oPOSSUM database, we introduce new species databases for studies of over-represented TFBSs in yeast and worms.  While the SSA over-representation analysis remains the same for all species, differences in gene structure require that the construction of the underlying databases be particular to each species.  3.5.2.1. Human/mouse  Ambiguities in ortholog assignments and the definition of TSS positions are major challenges when performing alignments for a large proportion of human and mouse genes. We have expanded the human/mouse database through (i) the discrimination of potential orthologs from predicted paralogs based on upstream sequence similarity (Figure 3.1), and (ii) the delineation of alternative promoters for human and mouse genes (Figure 3.2) to address the alignment failure observed in previous database builds.  While the inclusion of promoter comparisons for candidate ortholog assignment may be controversial, the impact is marginal as less than 1.3% of gene pairs were derived from this approach.  This brings the total number of orthologs to 15162. Despite improvements in EnsEMBL’s ortholog prediction, this is only 1079 more orthologs than were present in our previous database build. Based on the small incremental increases in mapped orthologs, we may be nearing the upper bound for the number of genes in human and mouse that are truly orthologous and detectable by sequence conservation. Detailed     90 descriptions of transcription start region (TSR) determination and the distribution of TSRs for human and mouse genes are available as supplementary material.  For each human/mouse orthologous pair, we determine the coordinates of the longest region from the UCSC genome alignments [17] spanning all transcripts plus an additional 10kb of upstream sequence.  The orthologous sequences are retrieved and re- aligned using ORCA, a pairwise global progressive alignment algorithm (described in [1]) to optimally align short, conserved blocks within longer global alignments. If possible, TSRs from human and mouse are paired in the alignment. We apply three dynamically computed and progressively more stringent conservation thresholds corresponding to the top 10%, 20%, and 30% of all 100bp non-coding windows, each with a minimum percent identity of 70%, 65%, and 60%, respectively. Of the 15162 orthologous gene pairs supplied as input to the oPOSSUM pipeline, 15121 (99.7 %) successfully align, and 15027 (99.1%) have non-exonic conserved regions above 60% nucleotide identity. This is a significant improvement over the previous version of oPOSSUM.  3.5.2.2. C. elegans/C. briggsae  To facilitate transcriptional regulatory analysis of the numerous gene expression studies performed in C. elegans, we have implemented a worm version of oPOSSUM. While the database structure and pipeline procedure are very similar to that used for the human/mouse database, there are small modifications that allow for mapping of genes to their operons, as defined by Blumenthal et al. [18]. In addition, nucleotide identity thresholds for conserved regions were reduced to 60%, 55%, and 50% for the top 10%,     91 20%, and 30% of non-coding windows, respectively, to account for the greater sequence divergence between C. elegans and C. briggsae compared to human and mouse. The set of orthologs for C. elegans and C. briggsae is defined by one-to-one InParanoid clusters [19] from WormBase (WS160) [20]. After filtering overlapping genes, 10592 orthologous gene pairs (of which, 2140 genes are in operons) remain for alignment. Alignments are performed on the orthologous gene sequences plus 2kb of upstream sequence (relative to the start codon) for C. elegans, and 4kb of upstream sequence for C. briggsae. Annotations are not as mature for C. briggsae, and the longer upstream region aids in the alignment of the worm promoter sequences. Alternative promoters have not been considered in this first version; however, should CAGE data or other reliable means for annotating TSSs in worms become available, efforts will certainly be made to include them. Of the 10592 worm orthologs, 9331 (88%) were successfully aligned.  3.5.2.3. Yeast  The analysis of yeast promoters is simplified by the more compact nature of the yeast genome. This characteristic diminishes the requirement for comparative methods to reduce the search space and noise inherent in larger genomes. Computational methods using S. cerevisiae sequences alone have successfully been used to identify regulatory elements associated with known sets of related genes [21, 22].  We opted to exclude phylogenetic footprinting for yeast, and instead, select promoter sequences corresponding to the 5' untranslated region 1000bp immediately upstream of the start codon of each open reading frame (ORF). Note that for all applications, users have the option to further     92 restrict the search space if they wish. The sequences were downloaded from the Saccharomyces Genome Database [23].  3.5.3. TFBS prediction  For the metazoan species, we search for matches to TFBS profiles contained in the JASPAR CORE and JASPAR PhyloFACTS database collections [24, 25]. Additionally, we include a set of profiles compiled for C. elegans TFs from literature review for Worm SSA (Table S2 available at http://cisreg.ca/oPOSSUM/data/). Binding sites are predicted for the sequences using the TFBS suite of Perl modules for regulatory sequence analysis [26]. A predicted binding site for a given TF model is reported if the site occurs in the promoters of both orthologs above a threshold PSSM score of 70% and at equivalent positions in the alignment. Overlapping sites for the same TF are filtered such that only the highest scoring motif is kept. The genomic location, profile score, motif orientation, and local sequence conservation level of each TFBS match in orthologous genes are stored in the respective species databases. For S. cerevisiae, we compiled a collection of yeast-specific TFBS motifs from both the Yeast Regulatory Sequence Analysis (YRSA) system [27] and the literature (Table S3 available at http://cisreg.ca/oPOSSUM/data/), and record the genomic location, profile score and motif orientation for each prediction.  Based on the observation that members of the same structural family of TFs often bind to similar sequences, plant and insect matrices are available for inclusion in the analysis. The MADS family of TFs is an excellent example of conservation of binding domains between plants and vertebrates [28, 29], and there are numerous examples of     93 conservation of binding domains across vertebrates, flies and worms. Thus, in cases where a profile for the TF of interest is not available in the database, oPOSSUM can still provide insights into the underlying regulation by suggesting a particular TF family that may be involved.      94  Table 3.1. oPOSSUM results for human FoxM1-regulated gene cluster  JASPAR CORE TF Class IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher Score HNF4 Nuclear 9.62 13 0.0054 0.0085 7.19 2.64E-02 Fos bZIP 10.67 15 0.0111 0.0146 5.72 4.29E-01 Pbx Homeo 14.64 5 0.0019 0.0033 5.57 3.10E-01 FOXI1 Forkhead 13.18 16 0.0153 0.0186 4.49 9.05E-02 RORA1 Nuclear Receptor 17.42 4 0.0020 0.0029 3.54 5.04E-01 TAL1-TCF3 bHLH 14.07 12 0.0052 0.0066 3.30 5.88E-02 Staf Zn-Finger, C2H2 17.54 3 0.0014 0.0021 3.16 3.03E-01 Foxa2 Forkhead 12.43 13 0.0152 0.0174 3.04 4.83E-01 Foxd3 Forkhead 12.94 13 0.0172 0.0194 2.93 5.27E-01 TEAD TEA 15.67 6 0.0028 0.0037 2.850 4.70E-01   Table 3.2. oPOSSUM results for c-Fos-regulated gene cluster  JASPAR PhyloFACTS Similar To IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher  Score TGANTCA AP-1 12.06 46 0.0011 0.0023 18.05 1.40E-04 GGGYGTGNY - 14.18 82 0.0059 0.0083 15.64 4.98E-02 TGASTMAGC NF-E2 16.60 43 0.0013 0.0024 15.64 1.19E-03 GGARNTKYCCA - 17.13 44 0.0016 0.0026 12.54 1.11E-03 GGGAGGRR MAZ 14.00 111 0.0171 0.0202 11.98 3.16E-01   Table 3.3. oPOSSUM results for skeletal muscle genes identified by Moran et al. and Tomczak et al.  TF name (Class ID) TF class name TF name (Class ID) TF class name Score MEF2A (class 4) MADS Myf (class 22) bHLH 1.65E-06 MEF2A (class 4) MADS ZNF42_1-4 (class 25) Zn-finger, C2H2 4.24E-06 Myf (class 22) bHLH SRF (class 1) MADS 2.52E-05 SP1 (class (31) Zn-finger, C2H2 SRF (class 1) MADS 2.68E-05 Agamous (class 1) MADS MEF2A (class 4) MADS 7.63E-05       95  Table 3.4. oPOSSUM results for worm skeletal muscle genes using worm profiles  Worm Status IC Target gene hits Background TFBS rate Target TFBS rate   Z-score Fisher Score Muscle1  Putative  11.34 6 0.0025 0.0156 20.56 4.24E-04 Muscle2 Putative 11.97 4 0.0022 0.0089 11.19 1.39E-02 LIN-14 Putative 9.13 9 0.0143 0.0280 9.116 1.17E-01 Muscle3  Putative  16.67 4 0.0029 0.0064 5.02 6.96E-02   Table 3.5. oPOSSUM results for the yeast CLB2 gene cluster  YEAST  TF Class IC Target gene hits Background TFBS rate Target TFBS rate Z-score Fisher Score ECB Unclassified 16.65 13 0.0019 0.0131 32.87 8.68E-09 MCM1 MADS 9.15 10 0.0073 0.0165 13.71 1.08E-02 FKH1 Forkhead 13.28 30 0.0305 0.0473 12.26 4.05E-02 CCA Unclassified 16.93 3 0.0017 0.0040 7.08 2.02E-01 LYS14 C6_Zinc finger 17.02 6 0.0030 0.0053 5.20 9.41E-02      96   Figure 3.1. Determination of one-to-one orthologs for human and mouse genes.  An initial set of homologs was downloaded from EnsEMBL v41 [30].  All homologs annotated as “one2one” are extracted.  To select the closest putative ortholog pairs from homologs with “one2many” or “many2many” relationships, we check for upstream conservation using the whole-genome human-mouse alignments [17]. We re-annotate unambiguously aligned homologs as putative one-to-one orthologs, adding 195 gene pairs to our set, and bringing the total number of orthologs to 15162.      97  Figure 3.2. Identification of transcription start regions (TSRs) using a combination of EnsEMBL annotations and CAGE data  To improve our alignments, we determine putative alternative TSSs for the human and mouse genes. For each gene, the entire repertoire of transcripts from both EnsEMBL core genes and EST genes are retrieved. The TSSs for all transcripts are recorded, followed by a clustering step such that TSSs within 500bp of one another are merged to form a transcriptional start region (TSR). For each TSR containing a transcript annotated as “known” or “novel”, we accept the TSR as is. For TSRs based solely on EST gene transcripts, we require a minimum of 5 CAGE tags as evidence for transcription initiation.       98  Figure 3.3. oPOSSUM Human SSA website screenshots  (A) A screenshot of the output of the oPOSSUM Human SSA analysis, with TFBS profiles ranked by Z-score. The arrows allow the user to sort and re-order the results by Fisher score, TF name, TF class, TF supergroup, or TF profile information content (IC). Each TF name links to a pop-up window displaying the TFBS profile information. (B) Pop-up window displaying genes that contain a particular TFBS (in this case, MEF2A; partial list shown), as well as the promoter coordinates associated with each gene, and the motif locations and scores. Sites in overlapping alternative promoters are highlighted for emphasis. Such sites are only counted once in the statistical analysis.      99 3.6. References 1. Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33(10):3154-3164. 2. Huang SS, Fulton DL, Arenillas DJ, Perco P, Ho Sui SJ, Mortimer JR, Wasserman WW: Identification of Over-represented Combinations of Transcription Factor Binding Sites in Sets of Co-expressed Genes. Advances in Bioinformatics & Computational Biology 2006, 3:247- 256. 3. Urnov FD, Rebar EJ: Designed transcription factors as tools for therapeutics and functional genomics. Biochem Pharmacol 2002, 64(5-6):919-923. 4. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome Biol 2000, 1(1):REVIEWS001. 5. Sandelin A, Wasserman WW: Constrained Binding Site Diversity within Families of Transcription Factors Enhances Pattern Discovery Bioinformatics. J Mol Biol 2004, 338(2):207-215. 6. Wonsey DR, Follettie MT: Loss of the forkhead transcription factor FoxM1 causes centrosome amplification and mitotic catastrophe. Cancer Res 2005, 65(12):5181-5189. 7. Lin J, Tarr PT, Yang R, Rhee J, Puigserver P, Newgard CB, Spiegelman BM: PGC- 1beta in the regulation of hepatic glucose and energy metabolism. J Biol Chem 2003, 278(33):30843-30848. 8. Ordway JM, Williams K, Curran T: Transcription repression in oncogenic transformation: common targets of epigenetic repression in cells transformed by Fos, Ras or Dnmt1. Oncogene 2004, 23(21):3737-3748. 9. Daftari P, Gavva NR, Shen CK: Distinction between AP1 and NF-E2 factor- binding at specific chromatin regions in mammalian cells. Oncogene 1999, 18(39):5482-5486. 10. Moran JL, Li Y, Hill AA, Mounts WM, Miller CP: Gene expression changes during mouse skeletal myoblast differentiation revealed by transcriptional profiling. Physiol Genomics 2002, 10(2):103-111. 11. Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han M, Kunkel LM, Kohane IS, Beggs AH: Expression profiling and identification of novel genes involved in myogenic differentiation. Faseb J 2004, 18(2):403-405. 12. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278(1):167-181. 13. GuhaThakurta D, Schriefer LA, Waterston RH, Stormo GD: Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. Genome Res 2004, 14(12):2457-2468. 14. Rusconi JC, Corbin V: Evidence for a novel Notch pathway required for muscle precursor selection in Drosophila. Mech Dev 1998, 79(1-2):39-50. 15. Wittenberger T, Steinbach OC, Authaler A, Kopan R, Rupp RA: MyoD stimulates delta-1 transcription and triggers notch signaling in the Xenopus gastrula. Embo J 1999, 18(7):1915-1922. 16. Kreiman G: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res 2004, 32(9):2889-2900.     100 17. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13(1):103-107. 18. Blumenthal T, Evans D, Link CD, Guffanti A, Lawson D, Thierry-Mieg J, Thierry- Mieg D, Chiu WL, Duke K, Kiraly M, Kim SK: A global analysis of Caenorhabditis elegans operons. Nature 2002, 417(6891):851-854. 19. O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 2005, 33(Database issue):D476-D480. 20. Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J: WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res 2001, 29(1):82-86. 21. Zhang MQ: Promoter analysis of co-regulated genes in the yeast genome. Comput Chem 1999, 23(3-4):233-250. 22. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22(3):281-285. 23. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26(1):73-79. 24. Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B: A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res 2006, 34(Database issue):D95-97. 25. Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, 32(Database issue):D91-94. 26. Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics 2002, 18(8):1135-1136. 27. Sandelin A, Hoglund A, Lenhard B, Wasserman WW: Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics 2003, 3(3):125-134. 28. Alvarez-Buylla ER, Pelaz S, Liljegren SJ, Gold SE, Burgeff C, Ditta GS, Ribas de Pouplana L, Martinez-Castilla L, Yanofsky MF: An ancestral MADS-box gene duplication occurred before the divergence of plants and animals. Proc Natl Acad Sci U S A 2000, 97(10):5328-5333. 29. Shore P, Sharrocks AD: The MADS-box family of transcription factors. Eur J Biochem 1995, 229(1):1-13. 30. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A et al: The Ensembl genome database project. Nucleic Acids Res 2002, 30(1):38-41.     101  4. TFCat: The Curated Catalog of Mouse and Human Transcription Factors12  4.1. Chapter preamble  In chapters 2 and 3, I described the design and development of an algorithm that identifies over-represented combinations of predicted transcription factor binding sites (TFBS) elements.  The identification of the TF proteins that bind the predicted CRM DNA elements is an essential step in defining gene regulatory mechanisms.  Importantly, transcription factor (TF) families (with similar DNA binding structures) may bind similar DNA elements. A comprehensive inventory of TFs for mouse and/or human is necessary for the study of all members of each family. This need lead us to establish a curated catalogue of mouse and human TFs and assign assessed DNA-binding TFs to an extended structural classification system (called TFCat). Homology analysis was used to expand the set of TFs with unannotated genes.  4.2. Introduction  The functional properties of cells are determined in large part by the subset of genes that they express in response to physiological, developmental and environmental stimuli. The coordinated regulation of gene transcription, which is critical in maintaining this adaptive capacity of cells, relies on proteins called transcription factors (TF) that control profiles of gene activity and regulate many different cellular functions by interacting directly with DNA [1, 2] and with non-DNA binding accessory proteins [3, 4].  12A version of this chapter has been published. Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R. (2009) TFCat: the curated catalog of mouse and human transcription factors. Genome Biology, 10(3):R29.     102 While the biochemical properties and regulatory activities of both DNA-binding and accessory TFs have been experimentally characterized and extensively documented (for example, in textbooks devoted to TFs [5, 6]), a well-validated and comprehensive catalog of TFs has not been assembled for any mammalian species. Many gene transcription studies have linked the subset of TFs that bind specific DNA sequences to the activation of individual genes and, more recently, these have been pursued on a genome-wide basis using high-throughput laboratory studies (for example by performing chromatin-immunoprecipitation) as well as computational analyses (for example by identifying over-represented DNA motifs within promoters of co-expressed genes). To facilitate such efforts, inventories of TFs have been assembled for Drosophila and Caenorhabditis species as well as for specific sub-families of mammalian TFs (Table 4.1). Since only a limited number of protein structures can mediate high-affinity DNA interactions, collections of TF subfamilies have been constructed using predictive sequence-based models for DNA-binding domains [7-10]. For example, the PFAM Hidden Markov Model (HMM) database [11] and Superfamily HMMs [12] have been applied to sets of peptide sequences to identify nearly1900 putative TFs in the human genome [10] and over 750 fly TFs, of which 60% were well-characterized site-specific binding proteins [13]. While these collections have emphasized DNA binding proteins, recent evidence suggests that the contributions of accessory TFs may be equally or more important in establishing the spatio-temporal regulation of gene activity. For example, microarray-based chromatin immunoprecipitation studies have highlighted the key regulatory contributions of histone modifying TFs over the control of gene expression [14]. Therefore, any comprehensive study of TFs must extend beyond a narrow focus of DNA binding proteins to serve as a foundation for regulatory network analyses.     103 The four research laboratories contributing to this report were originally pursuing parallel efforts to compile reference collections of bona fide mammalian TFs. In order to maximize the quality and breadth of our gene curation, we combined our efforts to create a single, literature-based catalog of mouse and human TFs (called TFCat). The collection of annotations is based on published experimental evidence. Each TF gene was assigned to a functional category within a hierarchical classification system based on evidence supporting DNA binding and transcriptional activation functions for each protein. DNA- binding proteins were categorized using an established structure-based classification system [15]. A blind, random sample of the functional assessments provided by each expert was used to assess the quality of the gene annotations. The evidence-based subset of TFs was used to computationally predict additional un-annotated genes likely to encode TFs. The resulting collection is available for download from the TFCat portal and is also accessible via a wiki to encourage community input and feedback to facilitate continuous improvement of this resource.  4.3. Results 4.3.1. TF gene candidate selection, the annotation process, and quality assurance  Prior to the initiation of the TFCat collaboration, each of the four participating laboratories constructed mouse TF datasets using manual text-mining and computational- based approaches. As each dataset was created specifically to suit the needs of the research lab that generated it, combinations of overlapping and distinct procedures were applied to collect and filter each dataset (Figure S1 in Appendix 1). These four, independently established, putative TF datasets laid the foundation for this joint initiative.     104 To ensure the comprehensiveness and utility of our reference collection, we broadly defined a TF as any protein directly involved in the activation or repression of the initiation of synthesis of RNA from a DNA template. Incorporating this standard, the union of the four sets yielded 3230 putative mouse TFs (referred to as the UPTFs). As complete manual curation of all literature to evaluate TFs is not practical, our curation efforts were prioritized to maximize the number of reviews conducted for UPTFs linked to papers. A manual survey of PubMed abstracts was performed, using available gene symbol identifiers and aliases, to identify genes for which experimental evidence of TF function might exist. Since standardized naming conventions have not been fully applied in the older literature, the associations between abstracts and genes may be incomplete or inaccurate due to the redundant use of the same identifiers for two or more genes. In addition, we did not consider abstracts that made no mention of the gene identifiers of interest or those that, by their description, were unlikely to have conducted transcription regulation-related analyses. From this list of 3230 putative mouse TFs, coarse precuration identified 1200 putative TFs with scientific papers describing their biochemical or gene regulatory activities in the PubMed database [16]. The majority of predicted TFs (2030 of 3230) had no substantive literature evidence supporting their molecular function. The remaining 1200 Transcription Factor Candidates (TFCs) were prioritized for expert annotation. Genes belonging to the TFC set that were associated with two or more papers in PubMed were selected and randomly assigned for evaluation by one or more of 17 participating reviewers. Gene annotations were primarily performed by a single reviewer, with the exception of 20 genes assigned to multiple reviewers for initial training purposes and 50 genes assigned to pairs of reviewers for a quality assurance assessment. In total,     105 1058 genes (Table 4.2) have been reviewed. For each candidate, a TF confidence judgment was assigned (Table 4.3) based on the literature surveyed. Annotation of each TFC required evidence of transcriptional regulation and/or DNA-binding (e.g. a reporter gene assay and/or DNA-binding assay). A text summary of the experimental evidence was extracted and entered by the reviewer, along with the PubMed ID, the species under study, and the reviewer’s perception of the strength of the evidence supporting their judgment. Although reviewers were not obligated to continue beyond two types of experimental support, they were encouraged to review multiple papers where feasible. Based on their literature review, annotators were required to classify their determination of each TFC into a positive (TF Gene or TF Gene Candidate), neutral (no data or conflicting data) or negative group (not a TF or likely not). Of the 1058 TFCs reviewed, 83% were found to have sufficient experimental evidence to be classified either as a TF Gene or as a TF Gene Candidate. To simplify data collection and curation, we focused our literature evidence collection and annotation efforts on mouse genes. However, literature pertaining to mouse genes and their human (or other mammalian) orthologs was used interchangeably as evidence for the annotations. Roughly 83% of the annotation literature evidence surveyed was based on a combination of mouse and human data, with roughly equal numbers of papers pertaining to each of these species. Mouse TF genes were associated with their putative human ortholog using the NCBI’s HomoloGene resource [16]. With the exception of 40 mouse genes, putative ortholog pairs were matched using defined HomoloGene groups. All but 13 of the remaining 40 were mapped using ortholog relationships in the Mouse Genome Database [17]. Each gene’s predicted human ortholog is included in the download data and in the published wiki data.     106 Depending upon the subset of available papers reviewed for a given TFC, two curators could arrive at different judgments. To ascertain the consistency and quality of our reviewing approach and judgment decisions, we randomly selected 50 genes for re- review and assigned each to a second expert (Table S1 and Table S2 in Appendix 1). Out of the 100 annotations (2 reviews each for 50 genes), 37 paired gene judgments (74 annotations) were concordant and 13 paired gene judgments (26 annotations) were discordant. Examination of the discordant pairs suggested that review of different publications may have produced the disagreement in annotation. To further evaluate this assumption, we extracted a non-QA sample of multiple annotations where different reviewers curated the same genes or gene family members using the same articles (Table S3 in Appendix 1) and found that these curation judgments were in perfect agreement. Under the assumption that judgment conflicts identified in the QA sample would be resolved in favour of one of the assigned judgment calls, we conclude that 13% of judgments may be altered after additional annotation, suggesting that a system to enable continued review would be beneficial. Since mouse and human TFs have been evolutionarily conserved among distantly related species [18], we assessed the coverage of our curated TF collection by comparing it with a list of expert annotated fly TFs documented in the FlyTF database [13]. Over half (443/753) of the FlyTF genes were found in NCBI HomoloGene groups, producing 184 fly TF-containing clusters that also contained mouse homologs. More than 85% (164/184) of these homologous TF genes were in the UPTF set. Inspection of the twenty putative mouse homologs of fly TFs absent from the UPTF set led to the inclusion of five genes in both the UPTF and the TFC sets for future curation, while there were no published studies involving the mammalian proteins for the remaining fifteen genes. We     107 also assessed TFCat’s coverage by comparing it with a classic collection of TFs prepared prior to the completion of the mouse genome [6]. After mapping 506 TFs to Entrez Gene identifiers, we found that 463 were present in the UPTF and 423 were members of the TFC gene list. The remaining 43 genes were added to the UPTF and the TFC list was extended to include 83 additional genes. From these analyses, we conclude that TFCat contains a large majority of known TFs.  4.3.2. Identification and classification of DNA binding proteins  Genes positively identified as TFs were categorized using a taxonomy to document their functional properties identified in the literature review (Table 4.4). Notably, 65% (571/882) of the genes judged as TFs were reported to act through a DNA binding mechanism and 94% (535/571) of these DNA-binding TFs were found to act through sequence-specific interactions mediated by a small number of protein structural domains (Table 4.5). Members of a DNA-binding TF family share strongly conserved DNA binding domains that, in most cases, have overlapping affinity for DNA-sequences; therefore a prediction of a TF binding site can suggest a role for the family but does not implicate specific family members. As such, a TF DNA-binding classification system is an essential resource for many promoter sequence analyses in which researchers should prioritize potential trans-acting candidates from a set of equally suitable candidate TFs within a structural class. Capitalizing on large-scale computational efforts for the prediction of protein domains [11, 12, 19-21], we analyzed each of the TFCat DNA- binding TF protein sequences with the full set of PFAM and Superfamily HMM domain     108 models to predict DNA-binding domain (DBD) structures. A total of 20 Superfamily structure types were identified in our set, along with 54 PFAM DBD models (Table S4 in Appendix 1). Where possible, we linked each double-stranded DNA-binding TF to a family within an established DNA-binding structural classification system [15], that was developed initially to organize the DNA-bound protein crystal structures found in the Protein Data Bank (PDB) [22]. In light of more recent studies, along with a modification of classification requirements (see Methods), an additional set of 16 DBD family classes were added to the system to map domain structures (Table S5 in Appendix 1). The DNA binding domain analysis offers some noteworthy observations. The homeodomain-containing genes are prominently represented in our set, comprising 24% (131/545) of the classified DBD TFs and 16% of all predicted domain occurrences. The beta-beta-alpha zinc-finger and helix-loop-helix TF families account for 14% (79/545) and 13% (71/545) of the classified genes, respectively. Given the abundance of zinc- finger proteins in the eukaryotic genomes [23] and recent predictions that this DNA- binding structure makes up a significant portion of all TFs [10], this class may be under- represented. On the other hand, since zinc-finger containing genes are involved in a wide variety of functions, the number of predicted zinc-finger proteins that possess a TF role may be overestimated. In addition, it is likely that certain families of TFs, with central roles in well-studied areas of biology, have been more widely covered in the literature, which may account for the prevalence of literature support for homeodomain TFs. The majority (392/545) of the classified DBD TFs in our list contain a single DNA interaction domain, however, a notable portion (145/545) of genes belonging to just a few protein families contain more than one instance of its designated DBD structure. These multiple instances predominantly reside in TFs containing zinc-finger, helix-turn-     109 helix, and leucine zipper domains (Table S6 in Appendix 1). While most TFs contained single or multiple copies of a single DNA binding motif, our predictions identified eight TFs with two distinct DBDs (Table S7 in Appendix 1). We removed the second zinc finger-type domain prediction for two of the genes: Atf2 and Atf7, as this domain is characterized as a transactivation domain in Atf2 [24] and may have a similar function in family member Atf7. All other predicted gene domains were retained, based on literature that supported their activity or failed to support their removal. Four PFAM DBD domain models detected in eight proteins are not represented by a solved structure and, therefore, could not be directly appointed in the classification system (see Table 4.5 – Protein Group 999). In addition, three NFI proteins were annotated with DNA-binding evidence and predicted to contain a SMAD MH1 DNA-binding domain. Interestingly, a recent study noted that the DNA-binding domains of NFI and SMAD-MH1 share significant sequence similarity [25]. These TFs were also assigned to their own family in the unclassified protein group (see Table 4.5 and Table S5 in Appendix 1 – Protein Group: 999 and Protein Family 905). A group of ten literature-based DNA-binding TFs had no predicted DBD domains (Table S8 in Appendix 1). The absence of detected DNA- binding domains may be due, in part, to the limited sensitivity of the models. For example, the Tcf20 gene (alias Spbp) purportedly contains a novel type of DNA-binding domain with an AT hook motif [26] which was not predicted by the corresponding AT hook PFAM model. Restricted model representation is also likely the reason for the missing domain predictions of the C4 zinc finger domain in the Nr0b1 gene and the basic helix-loop-helix (bHLH) domain in the Spz1 gene. Similarly, four DBDs detected with protein group class-level Superfamily models (specifically for zinc coordinating and helix-turn-helix models) could not be further delineated to a protein family level     110 assignment (Table S9 in Appendix 1) suggesting that their sequences deviate from the family-specific properties represented in PFAM. It is quite possible that there remain to be discovered domains involved in DNA binding by human and mouse TFs. Most TF DNA-protein interactions occur when the DNA is in a double stranded (dsDNA) state; however, a small number of TF proteins preferentially bind single- stranded DNA (ssDNA) [27, 28]. We identified in the literature review a set of sixteen single-stranded DNA-binding TFs, of which twelve contain HMM-predicted protein domains that are characterized as single-stranded RNA-DNA-binding (Table S10 in Appendix 1). There may be other DBD TFs in our list that act on both ssDNA and dsDNA but were not classified in the ssDNA DBD taxonomy because this property was not specifically characterized in the literature reviewed. The distinction and overlap between ssDNA and dsDNA binding TFs warrants future attention.  4.3.3. Generation and assessment of mouse-human TF homology clusters to predict additional putative TFs   Since a transcriptional role can be inferred for closely related TF homologs [7, 29- 31], researchers interested in the analysis of gene regulatory networks would benefit from access to a broad data collection of both experimentally validated TFs and their homologs. The curated TF gene list was used to identify putative mouse TF homologs, in the genome-wide RefSeq collection, that have not yet been annotated in our catalog or that were not evaluated because they lack PubMed literature evidence. While sequence homology is often used in preliminary analyses to infer similar protein structure and function, its success may be limited when similar protein structures have low sequence similarity [32] or short homologous protein domains. Based on recent evidence that over     111 15% of predicted domain families have an average length of 50 amino acids or less [33], we evaluated whether pruning BLAST-derived clusters using a previously published sequence similarity metric [34] could be further improved by explicitly including domain information. Our evaluation of both pruning methods indicated that the inclusion of domain knowledge improved homolog cluster content (Figures S2 and S3 in Appendix 1). We therefore incorporated both domain structure predictions, using HMMs, and sequence similarity in our homology-based approach to predict additional TF genes. The homolog prediction and clustering process yielded 227 homolog clusters containing 3561 genes (3419 unique genes). The vast majority of the genes (3284/3561) are associated with only one cluster each, however 128 genes were members of two clusters and 7 genes were present in three clusters. We also identified 72 single gene clusters (singletons), which included 36 TF genes that had only significant BLAST matches to themselves, 12 genes that derived BLAST hits which didn’t satisfy the homolog candidate cut-offs, 21 genes with cluster members that didn’t satisfy the pruning criteria, and 3 genes that had no RefSeq model sequence. While our TF-seeded homology inference analysis used cut-offs that likely pruned some false negatives, in an effort to emphasize specificity, it is likely that these singletons represent TFs that share common protein structural features with low sequence similarity. The curated TF set contains some proteins with properties not commonly associated with TF function. For example, our catalog included the cyclin dependent kinases (cdk7, cdk8, and cdk9), which are reported to directly activate gene transcription (for a review see Malumbres et al. [35]). Therefore, the homolog analysis of TFs identified numerous other protein kinases that will likely have no direct involvement in transcription. Similarly, larger clusters seeded by TFs containing other domains not     112 frequently associated with transcription, such as calcium-binding, ankyrin repeats, armadillo repeats, dehydrogenase, and WD40, also attracted false TF predictions. To assign a quantitative confidence metric for the large clusters of TF predictions, we developed a scoring procedure based on protein domain associations to TF activity annotations from the Gene Ontology (GO) Molecular Function sub-tree [36]. The cluster confidence metric was employed using a four-tier ranking system for clusters containing more than ten gene members (42 out of 227 homolog clusters). The majority of these clusters (52% or 22 clusters) received high scores indicating that they contain a high proportion of TF genes. Given that GO currently annotates only 39% of the TF genes in our catalog in the TF activity node in the Molecular Function sub-tree (Table S11 in Appendix 1), we expect that less frequently occurring protein domains found in small homolog clusters may not yet be represented in GO. Therefore, we did not analyze clusters containing fewer than ten members and we anticipate future refinements in the homolog cluster confidence rankings as TF gene annotation is expanded in GO. We incorporated our curated set and clusters counts in an analysis to estimate both the total number of TFs and, a smaller sub-set, the number of double-stranded DNA- binding proteins (see Methods). The cluster counts were adjusted using the observed approximate mean TF (OAMTF) proportions associated with each rank level (see Table 4.6) to account for false positives. From this mouse RefSeq-based analysis, we arrived at an estimate of 2355 DNA-binding and accessory TFs. Since peptide sequence-dependent analyses can result in both omissions and false predictions of homologous protein structures readers should regard this figure as a “best-guess” approximation [32]. A similar analysis conducted over the homolog clusters containing double-stranded DNA- binding TFs resulted in an estimate of 1510 DNA-interacting TFs. We also performed an     113 extraction of DBD-containing genes from the Ensembl database using the DNA-binding domains defined in TFCat. This analysis derived a list of 1507 putative DNA-binding TFs. These estimates agree well with earlier publications [10, 37, 38].  4.3.4. Maintenance and access of TFCat annotation data  All gene annotations, mouse homolog clusters and human orthologs are published in the TFCatWiki, which is accessible from the TFCat portal. Each wiki article page houses the annotation information for one gene with its content secured against modification. Each gene article page is associated with a discussion page, which is available for comments and feedback by all wiki users. Wiki users can specify that they wish to receive periodic e-mail notification of lists of gene wiki pages and their associated discussion pages that have been updated. Semantic features and functional capabilities are included in the wiki implementation to facilitate easy access to all gene annotation data. We established a TFCat Annotation Feedback System workflow process (Figure S4 in Appendix 1) to encourage continuous improvement of the catalogued gene entries. An issue tracking management system is integrated with the wiki to capture, queue, and track feedback contributions for follow-up by the wiki annotator. Wiki users may view a gene’s feedback report summaries and current workflow status, through an inquiry made available on each gene’s article page. Gene annotation changes, entered through our internally-accessible TFCat annotation system, will be flagged and forwarded to the wiki through an automated updating process. Community members who wish to directly     114 contribute to the wiki contents through the backend web application (Figure S5 in Appendix 1) may contact the authors. The complete TF catalog resource can be downloaded from our website [39]. The website application enables download of the complete list or a subset of annotated genes by assigned judgment, functional taxonomy, and DNA-binding classification. The data extraction is run real-time against a relational database providing access to the most current TF catalog data.  4.4. Discussion 4.4.1. Catalog characteristics, comparisons, and utility  The comprehensive catalog of TFs contained in TFCat provides an important resource for investigators studying gene regulation and regulatory networks in mammals. The curation effort assessed the scientific literature for 3230 putative mouse and human TFs, including detailed evaluation of papers describing the molecular function of 1058 TFCs, to identify 882 confirmed human and mouse TFs. Each TF was further described within TFCat using a newly developed TF taxonomy. DNA binding proteins, a subset of TFs, were mapped to a structural classification system. As an aide to researchers, an expanded set of putative TFs was generated through a homology-based sequence analysis procedure. Online access to the annotations and homology data are facilitated through a wiki system. An annotation feedback system, linked from the wiki, enables reporting and tracking of community input. An additional website application offers capabilities to extract all or a subset of the catalog data for file download.     115 For many researchers, the greatest utility of TFCat is the provision of an organized and comprehensive list of DNA binding proteins. The protein-DNA structural classification system used to organize the DBD TFs in the catalog was originally proposed by Harrison [40], further modified by Luisi [41] and extended by Luscombe [15]. The DNA binding domain (DBD) analysis and gene/domain counts (Table 4.5) confirmed that well-known DBD families are represented. The DNA-binding classification system was extended with new family classes to accommodate the majority of predicted DNA-binding structures in our curated TF set (Table 4.5 and Table S5 in Appendix 1). A new family category was included for unrepresented, double-stranded TF protein-DNA binding mechanisms that were supported by PDB structures or publications. Similar to the analysis and classification performed by Luscombe et al., we added structural domain families that were characterized by distinct DNA-binding mechanisms. However, unlike the Luscombe et al. approach, we did not consider biological function in our classification decisions. To preserve the properties of the system, the necessary extensions were made within the existing protein groups. The value in having inventories of TFs has spurred previous efforts to compile collections of DNA-binding proteins. To evaluate the comprehensiveness of our curated collection, we performed a comparison with the gene annotations provided by GO and our DNA-binding domain classification analysis with domains found in a DNA-binding domain collection [42]. GO assigns molecular function labels to proteins, including functions falling under the broad category of transcription. The challenge of annotating all genes is daunting and therefore it was not a surprise that only 39% (343) of our expert curated collection of TFs have thus far been associated with GO terms linked to transcription (Table S11 in Appendix 1).     116 While TFCat is unique in its evidence-based approach to identify mouse and human TFs, there are other compilations of TF binding domain models and predictions of domain-containing proteins. For example, a catalog of sequence-specific DNA-binding TFs (which we will refer to as DBDdb) has been compiled using HMMs to catalog double-stranded and single-stranded sequence-specific DBD domains [42]. Comparison of the double-stranded DNA binding subdivision of TFCat with the predictions in DBDdb highlights some key differences between these efforts (Tables S12, S13 and S14 in Appendix 1). For example, the TFCat DNA binding subdivision includes only TFs with published evidence from mammalian studies, whereas the DBDdb collection includes domain predictions based on evidence of sequence-specific DNA binding in any organism. While the two TF resources overlap, they serve complementary purposes. DBDdb is a set of computational predictions generated with protein motif models associated with sequence-specific single or double-stranded binding domains, while TFCat is an expert-curated, highly specific resource that targets the organized identification of all TFs, regardless of DNA binding, in human and mouse. For example, the high mobility group (HMG) domain TFs, which exhibit both specific and non- specific DNA-binding, are excluded from DBDdb but included in TFCat. Moreover, TFCat only included TFs with literature support in mammalian cells which excludes certain domains included in DBDdb. For example, CG-I has been shown to regulate gene transcription in fly [43] but not in mammals [44]. To complement our large set of curated TF proteins, we conducted a sequence- based homology analysis, propagated from our positively-judged TFs, to predict additional TF encoding genes. We applied a confidence ranking metric to predict the number of false positives included in larger homolog clusters (Table 4.6), which should     117 be considered when extracting un-annotated, predicted TFs. Future adaptations of the TFCat resource could include literature-based judgments of TF homolog predictions. While the homolog clusters as provided are an essential and useful supplement to our evidence-based TF catalog, future predictions may benefit from further structure-based homology research. Creation of a comprehensive TF catalog provides an important first step in unravelling where, when and how each TF acts. For example, a number of recently published genome-scale studies constructed lists of predicted TFs prior to investigating the spatial and temporal expression characteristics of sets of regulatory proteins [8, 9, 45, 46], in advance of conducting a phylogenetic analysis of genes involved in transcription [47], and as initial input to the analysis of conserved non-coding regions in TF orthologs [48]. The set of literature evidence-supported TFs in TFCat will provide an important foundation for similar future studies. TF catalogs will become increasingly important and necessary to facilitate the investigation and analysis of TF-directed biological systems. Recent ground-breaking stem cell studies [49, 50] have shown the central role of TFs in regulating stem cell pluripotency and differentiation. Understanding the central role of TFs in the control of cellular differentiation has therefore taken on increased importance. Computational predictions in regulatory network analysis of cellular differentiation often highlight a pattern consistent with binding of a structural class of TFs, but fail to delineate which TF class member is acting. TFCat will serve as a reference and organizing framework through which such linkages can progress towards the detailed investigation of candidate TF regulators.     118  4.5. Materials and methods 4.5.1. Creation of four independent murine and human TF preliminary candidate data sets Four TF collections were compiled by four independent approaches. All data sets are available on the TFCat portal.  4.5.1.1. Dataset I  A list of 986 human genes considered ‘very likely’ plus 913 considered ‘possibly’ to code for TFs was manually curated in February 2004 [51] using personal knowledge combined with information in LocusLink (now Entrez Gene), the Online Mendelian Inheritance in Man database (OMIM) [52], and PubMed [16]. Selection was guided by the following definition of TF: ‘a protein that is part of a complex at the time that complex binds to DNA with the effect of modifying transcription’. Inclusion was necessarily subjective for two reasons: (1) the definition of ‘transcription factor’ is difficult to precisely constrain, and (2) there was not enough information available for many genes to be certain of their function. Genes that primarily mediate DNA repair (e.g., ERCC6) or chromatin conformation (e.g., CBX1) were excluded. To be considered, a gene had to have an Entrez Gene entry with a Genbank accession number. Text-based searches for the terms ‘transcription factor’ or ‘homeobox’ were used to identify Entrez Gene entries for further analysis. GO node descriptions including the terms: ‘nucleic acid binding’; ‘DNA binding’; and ‘transcription’ were used as a supplement to guide gene     119 selection. A total of 998 TFs were present in the set following this initial compilation. After February 2004, periodic additions were made based on new reports in the literature.  4.5.1.2. Dataset II  The objective of this analysis was to identify a comprehensive list of DBDs for TF gene candidate extraction. Firstly, the SwissProt (SP) database [53] protein entries (obtained in April 2005) were scanned for descriptors or assigned PFAM [11] and/or Interpro [54] domains (downloaded in April 2005) indicating: DNA-binding, DNA- dependent, and transcription. The extracted gene set was then further extended by including SP gene entries that had assignments to the biological process GO node: GO:0006355 Regulation of DNA Transcription, DNA-dependent and SP records with text descriptions that included JASPAR database transcription factor binding site (TFBS) class names [55]. A list of unique DBDs was compiled from this extraction. All domains were manually reviewed for evidence strongly suggesting DNA binding and transcription factor activity using both Interpro and PFAM domain descriptions and associated literature references. Domains that did not meet these criteria were pruned from the list. Both known and putative TF genes were extracted from Ensembl V29 database [56] using the TF DBD PFAM-based list, yielding a set of 1266 mouse and 1500 human DNA-binding TF candidates.  4.5.1.3. Dataset III  GO trees were constructed for all mouse and human entries in Entrez Gene by starting with the leaf term from gene2go [36] (downloaded July 19th, 2005) and     120 enumerating all parent terms using file version 200507-termdb.rdf-xml. As we were interested in all genes that could be involved in altering transcription, genes were selected if they had any annotation (including Inferred Electronic Annotations -- IEA) to GO terms with descriptors: "transcription regulator activity", "transcription factor activity" and/or "transcription factor binding" in their tree. 970 mouse genes and 1203 human genes were identified using this method. As this first extraction did not identify all family members of a putative transcription factor, we performed an additional extraction using the term searches: "DNA binding" and "transcription factor" against the domain information in the Interpro database [54]. The resulting genes were mapped to Entrez Gene entries using the Affymetrix annotation for the MOE-430 v2 chip. Merging the two lists and removing duplicate entries resulted in 2131 mouse and 2900 human candidate genes involved in transcriptional regulation.  4.5.1.4. Dataset IV  We assembled ~350,000 isoforms representing ~48,000 known and predicted protein-coding mouse genes by mapping seven collections of known and predicted mRNAs to the mouse chromosomes, and clustering them on the basis of overlap (see [57] for source sequences), a representative mRNA from each cluster, and a description of the clustering method). We then assembled 36 known transcription-factor DNA-binding domains from PFAM and SMART [58], and screened the ~350,000 isoforms using the HMMER software [59] to identify approximately 2,500 known or predicted genes containing at least one of the 36 domains. To map the International Regulome Consortium (IRC) entries to Entrez Gene, the IRC sequences [60] were compared with     121 RefSeq sequences using BLAST. Only sequences with an expectation value of at most 10-05 were selected and subsequently mapped to Entrez Gene using the Gene2Refseq table.  4.5.2. Standardizing TF gene candidate annotation  A website annotation tool and MySQL database were developed to standardize and centralize the annotation effort (Figure S5 in Appendix 1). TF candidate judgments and a high-level taxonomy classification system were established (Tables 3 and 4) for this web-based annotation process. The secure website enables access to only those genes assigned to each annotator. Each gene annotation required input of text summarizing the journal article evidence that, to some degree, supported or refuted the judgment of a gene (or the gene’s ortholog in a closely related species) as a TF. One or more PubMed journal articles were summarized in the reviewer comments and a final judgment and general taxonomy classification were assigned. Ten trial genes, randomly selected from the list of TFCs, were assessed by four reviewers. The set of annotations for each trial gene were evaluated for literature evidence selected and annotation content and formatting. This evaluation was used to develop annotation evidence guidelines and a suggested general documentation format for the annotation process, which was included in the annotator help guidelines.  4.5.3. Selection and annotation of a subset of TF candidates  The mouse TF candidate datasets were merged, using mapped NCBI Entrez Gene identifiers, into a single non-redundant dataset. Gene2PubMed file counts were extracted     122 and merged by Entrez Gene ID. Genes were manually pre-curated for evidence supporting TF activity by scanning NCBI PubMed abstracts (where available) using both standard gene symbols and aliases and examining GeneRIF entries for each gene in the dataset. Genes with literature evidence suggesting TF function were included in the list of TFCs to be annotated. A set of TFCs associated with two or more PubMed abstracts (based on Gene2Pubmed data and excluding the large annotation project articles) were extracted from the TFC list and randomly assigned to each of seventeen reviewers based on pre-determined reviewer allocation counts. Each TFC was reviewed and judged by the assigned reviewer for TF evidence in the literature as described above. We also extracted and entered the PubMed information accompanying 22 TF DNA-binding profiles from Jaspar Database [55]. During this research project, the Entrez Gene numbers were maintained using the NCBI Gene History file. TFCat gene identifiers were maintained (changed or merged or deleted) if a corresponding change was recorded in this file.  4.5.4. Randomly sampled quality assessment and auditing of TF annotations  TF gene candidates were randomly selected from each reviewer-assigned gene set based on the assigned proportions across all reviewers to form a list of fifty genes for annotation quality assurance (QA) testing. Each gene was allocated to two reviewers for annotation in a blind QA test. The QA gene annotations were extracted and reviewed for TF judgment and taxonomy classification consistency. A second round of annotation auditing was performed to ensure consistency in the recorded annotation data. All annotations were examined for alignment of PubMed evidence reviewed and assigned     123 judgment and functional taxa. Misaligned annotations were forwarded to the annotator for review and revision.  4.5.5. TFC quality assurance comparisons  To assess sensitivity (coverage) in our initial curated TF list, we compared our gene set with TF genes identified in two TF collections. Approximately 800 gene symbols listed in a TF textbook index, authored by Joseph Locker [6], were manually reviewed and mapped, where possible, to 506 mouse Entrez Gene Identifiers using gene descriptions and citations provided in the text. A TF comparison was also performed against the list of annotated fly TFs found in the FlyTF database [13] by mapping, where possible, FlyBase identifiers to NCBI gene identifiers to locate their corresponding mouse homolog in a HomoloGene group [16]. Upon completion of the TFCat curation phase, we performed comparisons with GO [36] and the DBD Transcription Factor Prediction Database resource [42]. To compare our curated set with GO we developed software to enumerate the number of our TF genes in the GO Molecular Function sub-tree under the "transcription regulator activity" node. We used the Mouse Xref file found in the GO Annotation Database [61] to map the TF Entrez gene numbers to the gene identifiers available in the GO database. The DBD resource comparison involved downloading the mouse (Mus musculus 49_37b) and human (Homo sapiens 49_36k) predicted TF sets and development of software to extract all DBD domain models identified in those records. We then compared the domains found in the DBD mouse/human set with those domain models annotated as DNA-binding in our curated TF set.     124 4.5.6. Human-mouse ortholog assignment  Human-mouse predicted orthologs were assigned using NCBI HomoloGene groups [16] with one-to-one relationships between the mouse and human gene. Those few genes that did not have a one-to-one relationship were manually inspected and, when available, a preference was given to the human non-predicted RefSeq gene model or an assignment was made using the closest Blast alignment scores between a mouse and human gene pair. Where HomoloGene entries were not available for both human and mouse, ortholog assignments identified in the Mouse Genome Database were used.  4.5.7. TF DNA-binding structure analysis and classification  A DNA-binding protein classification system, an extension of the work from Luscombe et al. [15], was utilized to classify all genes judged as TFs with DNA-binding activity. Structural assignments were made utilizing the HMMER software to enumerate a full set of Superfamily (SCOP-based) HMMs [12] with a threshold of 0.02 and PFAM HMMs [11] for each gene using gathering threshold cut-offs and a calculated model significance value ≤ 10-2. The Superfamily domain sequences predicted in the TF gene set were subjected to a PFAM HMM analysis to identify PFAM domain models that are satisfied by the same sequences (Table S4 in Appendix 1). Both redundant and non- redundant models were then mapped to the DNA-binding structure classification using model structural descriptions and based on review of related literature for PDB entries that contain these domains.      125 The DNA-binding classification was extended with additional family classes to accommodate the predicted DNA-binding structures encountered in the curated set of DBD TFs (Table 4.5 and Table S5 in Appendix 1). To evaluate the structural similarity of DNA-binding domains, we performed alignments using the protein structure comparison web tool: SSM – Secondary Structure Matching Service [62]. We identified PDB entries for each of the new DBD families, with a preference for DNA-bound structures. The DNA-binding domain chains of each PDB entry were aligned with the entire PDB archive (incorporating lowest acceptable matches of 40% and defaulting the remaining parameters) to identify similar DBD structures based on Q-score metric clustering results. A new protein family classification was established if the structure aligned only to itself or was clustered (by Q-value) within its own set of family class structures. In a few cases, where a structure aligned reasonably well with another family in the classification system, PubMed articles were consulted to derive a final decision and any borderline cases were noted and described in the family class description text (Table S5 in Appendix 1). Each DNA-binding TF was then assigned to one or more DNA-binding families in the classification system if it was predicted to contain the related DBD structure.  4.5.8. Identification of homolog sets for mouse TF genes  A homolog analysis process was implemented that considers both sequence similarity and predicted protein domain commonality, and uses a computationally simplified clustering approximation, loosely motivated by proportional linkage clustering [63]. We initially identified sequence similarity using BLASTALL [64] analysis over a full mouse protein RefSeq [65] dataset with an expect value cut-off of 10-3 and     126 enumerated all HMM PFAM domains over an extracted full representation of the mouse genome using NCBI RefSeq sequences. To extract putative homolog candidates for each TF gene we incorporated a metric, originally proposed by Li et al. [34], which considers the ratio of aligned sequence length to the entire length of each sequence. Given the focus on mouse genes, the formula for this metric, which we will refer to as metric ! I s ' , was revised to utilize sequence similarity rather than identity. Our metric is computed as:  ! I s ' = S "Min(n 1 /L 1 ,n 2 /L 2 )  where S is the proportion of similar amino acids (as defined by the Blosom62 matrix) across the hit, ! L i is the length of sequence ! i  ( ! i  is the query or hit sequence), and ! n i  is the number of amino acids in the aligned region of sequence ! i . We considered only homolog candidates that had a maximum hit significance of 10-4 and allowed for a high level of sensitivity by requiring that the computed ! I s '  values were at least 0.06. We did not include any genes that had been reviewed and deemed not TFs. Our survey of a set of TF gene family sequence characteristics suggested that some known DNA-binding domains were contained in a small fraction of the total TF protein sequence. However, similarly short alignments between a TF gene and other hit sequences (low ! I s '  values) can yield a significant amount of false positives. We used well- documented SRY-related HMG-box transcription factor (Sox) and Forkhead transcription factor (Fox) TF families (Table S15 and Table S16 in Appendix 1) to evaluate two cluster pruning strategies and selected an approach that increased cluster specificity (proportion of members of a test set in a cluster) without decreasing cluster sensitivity (number of cluster members that are members of a test set). To evaluate cluster pruning of the Blast-     127 based clusters using strictly an ! I s '  threshold method, we computed cluster sensitivity and cluster specificity over an increasing range of ! I s '  values, using the Sox and Fox validation sets (Figures S2 and S3 in Appendix 1). An ! I s '  value was computed between the query sequence and every member in the cluster and a member (gene) was pruned if the ! I s '  did not satisfy a cut-off threshold. Cluster sensitivity and cluster specificity were computed for the range of ! I s '  values and compared. We then assessed a second cluster pruning approach over a successive range of ! I s '  values requiring that all predicted domains in a cluster member (gene) match the query gene or, when this criteria could not be met, a particular ! I s '  value threshold be satisfied (Figures S2 and S3 in Appendix 1). Inclusion of a domain-based method as a primary criteria for pruning with the incorporation of a stricter ! I s '  value criteria when the domains did not match, in most cases, maintained cluster sensitivity while preserving or improving cluster specificity. Importantly, higher cluster sensitivity and cluster specificity levels enabled comprehensive Sox HMG and Fox Forkhead families to emerge when we applied a proportional linkage clustering approximation approach to merge the overlapping clusters (Figure S6 and Figure S7 in Appendix 1). While the sole application of an ! I s '  value as a pruning criteria may not generate comprehensive TF family clusters (compare Panel B in Figures S6 and S7 in Appendix 1), our analyses suggested that this metric on its own, implemented with higher parameter values, is useful for identifying closely related sub-family members (Figure S8 in Appendix 1). Motivated by these assessment results, we implemented a cluster pruning step which, required that either all predicted PFAM enumerated domains in the TF gene be matched in a homolog candidate or that the ! I s ' value between the query TF gene and its     128 homolog hit be no smaller than 0.21 with a sequence similarity no less than 30%. This resulted in 830 overlapping sets consisting of 48,555 members in total. To cluster and merge the sets, we implemented a method that considers a proportional linkage median-based relationship between sets. The algorithm performed iterations of set merges, combining two sets S and T if at least half of the genes in the smaller set matched genes in the larger set. i.e. if there were :  ! (min(| S |,|T |)) /2" #  matching genes. To mitigate the cluster attraction strength properties of initially larger and possibly noisier clusters the merge process iteratively considered and executed merging over smaller to progressively larger cluster cardinalities using increments of 10. Cluster membership attained a steady-state convergence within 700 iterations. A cluster confidence metric was developed to measure the number of potential false positives in a large (cardinality > 10) homolog cluster using predicted domain content. We mapped the mouse genes with the enumerated PFAM domains to terms in the GO Molecular Function subtree. We tallied the number of times a specific domain is contained within a gene annotated to the transcription regulator activity node and its child nodes versus the number of times the domain is found in a gene annotated to some other activity node to compute a probability of a particular domain ! P d being associated with TF function. The majority of GO annotation evidence codes were included, with the following exceptions: IEA-Inferred from Electronic Annotation, ISS-Inferred from Sequence or Structural Similarity, and RCA-Inferred from Reviewed Computational Analysis. To evaluate cluster confidence ! C n , we first enumerated the number of genes     129 that contain a specific domain within a cluster ! C d and the number of genes in each cluster ! Cg  to weight a domain’s association to TF activity:  ! Nd = Cd Cg Pd  and, secondly, included those cluster domains that satisfy ! D = Cd " Cg /4# ${ }  to compute ! C n , using the following equation:  ! C n = N di i"D # D   All cluster confidence values and cluster membership were reviewed and qualitatively assessed based on the proportion of verified TFs and binned into four partitions with associated confidence rankings (Table 4.6). To derive an estimate for the total number of TFs in the human and mouse species, we computed the number of known and predicted TF homologs and adjusted this amount by the cluster rank OAMTF (Table 4.6) to obtain a prediction of 2355 DNA- binding and accessory TFs. To obtain a ballpark figure for a total number of DBD TFs, we performed a separate homolog clustering analysis seeded by genes curated with double-stranded DNA binding activity and reduced the counts using the OAMTF proportions by cluster rank, where applicable. The homolog-based analysis generated an estimate of 1510 DBD TFs. To support our DBD homology-based count analysis, we     130 developed PERL scripts to query the mouse Ensembl mus_musculus_core_47_37 and ensembl_mart_47 databases for extraction of predicted DNA-binding TFs using the identified PFAM DNA-binding domains in TFCat. This extraction produced a total of 1507 Ensembl mouse genes (1416 records supported by Mouse Genome Informatics (MGI); 23 RefSeq and Entrez Gene sourced records; 29 Uniprot/SPTREML predicted genes; and 39 Ensembl predicted gene models).  4.5.9. Website download access, wiki publication and annotation feedback  The MediaWiki software was used to implement the TFCatWiki, with some modifications and additions made to the base software code and configuration files. We included the Semantic MediaWiki [66] extension to facilitate access and searching. Each article page contains the annotation information for one gene and has been configured to disallow edits, although enabling all associated discussion pages for contribution. Software was developed to extract data from the TFCat wiki database to create the wiki pages. We implemented a feedback tracking function using the MantisBT software system [67], a well-established, open-source, issue monitoring system, to accommodate tracking and follow-up management of TFCat feedback contributions. PHP interfaces and software were developed to populate MediaWiki user information to the feedback system and provide direct query access to feedback records by gene. We also integrated new data update flagging mechanisms into our internally-available TFCat annotation software tool to identify new or modified gene annotation information that requires re-population to the gene wiki page.     131 The MediaWiki software includes a Watch function, which issues individual e- mails when information is changed on a Wiki page by a wiki user. We developed an e- mail feature that optionally provides lists of wiki pages that have been changed via the backend auto-update process. To enable this feature, we developed an external PHP program (MediaWiki) hook and an associated MySQL database table to solicit user entry and capture of desired e-mail parameter options and notification frequency. An e-mail notification process was developed which issues e-mails for wiki content updates based on user-selected parameters.     132  Table 4.1. Transcription factor data resources  Resource Organism Reference/URL Human KZNF Gene Catalog Human Huntley et al. (2006) [68] / [69] Database of bZIP Transcription Factors Human Ryu et al. (2007) [70] / [71] The Drosophila Transcription Factor Database Fly Adryan et al. (2006) [13] / [72] wTF2.0: a collection of predicted C. elegans transcription factors Worm Reece-Hoyes et al. (2005) [73] / [74]   Table 4.2. TFCat catalog statistics     Table 4.3. TFCat judgment classifications  Judgment classification Number of annotations % of annotations TF gene 733 61.9 TF gene candidate 256 21.7 Probably not a TF - no evidence that it is a TF 41 3.5 Not a TF - evidence that it is not a TF 30 2.5 Indeterminate - there is no evidence for or against this gene's role as a TF 114 9.6 TF evidence conflict - there is evidence for and against this gene's role as a TF 10 0.8            Total number of genes annotated 1,058 100% Proportion of genes with positive TF judgments 882 83% Proportion of positive TFs with DNA-binding activity 571 65% Proportion of DNA-binding TFs that are (double-stranded) sequence-specific 535 94%     133   Table 4.4. TFCat taxonomy classifications  Taxonomy classification Number of annotations % of annotations Basal transcription factor 39 3.7 DNA-binding: non-sequence-specific 30 2.9 DNA-binding: sequence-specific 591 56.5 DNA-binding: single-stranded RNA/DNA binding 20 1.9 Transcription factor binding: TF co-factor binding 315 30.1 Transcription regulatory activity: heterochromatin interaction/binding 51 4.9      134  Table 4.5. DNA-binding TF gene classification counts  Protein group Protein group description Protein family Protein family description Gene count Predicted occurrences 1.1 Helix-turn-helix group 2 Homeodomain family 131 160 1.1 Helix-turn-helix group 100 Myb domain family 7 16 1.1 Helix-turn-helix group 109 Arid domain family 5 5 1.1 Helix-turn-helix group 999 No family level classification 2 2 1.2 Winged helix-turn-helix 13 Interferon regulatory factor 7 7 1.2 Winged helix-turn-helix 15 Transcription factor family 10 11 1.2 Winged helix-turn-helix 16 Ets domain family 23 23 1.2 Winged helix-turn-helix 101 GTF2I domain family 2 12 1.2 Winged helix-turn-helix 102 Forkhead domain family 26 26 1.2 Winged helix-turn-helix 103 RFX domain family 4 4 1.2 Winged helix-turn-helix 111 Slide domain family 1 1 2.1 Zinc-coordinating group 17 Beta-beta-alpha-zinc finger family 79 450 2.1 Zinc-coordinating group 18 Hormone-nuclear receptor family 43 43 2.1 Zinc-coordinating group 19 Loop-sheet-helix family 1 1 2.1 Zinc-coordinating group 104 GATA domain family 7 12 2.1 Zinc-coordinating group 105 Glial cells missing (GCM) domain family 2 2 2.1 Zinc-coordinating group 106 MH1 domain family 3 3 2.1 Zinc-coordinating group 114 Non methyl-CpG-binding CXXC domain 2 4 2.1 Zinc-coordinating group 999 No family level classification 2 2 3 Zipper-type group 21 Leucine zipper family 41 64 3 Zipper-type group 22 Helix-loop-helix family 71 71 4 Other alpha-helix group 28 High mobility group (Box) family 24 28 4 Other alpha-helix group 29 MADS box family 4 4 4 Other alpha-helix group 107 Sand domain family 3 3 4 Other alpha-helix group 115 NF-Y CCAAT-binding protein family 2 2 5 Beta-sheet group 30 TATA box-binding family 1 2 6 Beta-hairpin-ribbon group 34 Transcription factor T-domain 11 11 6 Beta-hairpin-ribbon group 108 Methyl-CpG-binding domain, MBD family 2 2 7 Other 37 Rel homology region family 10 10 7 Other 38 Stat protein family 6 6 7 Other 110 Runt domain family 3 3 7 Other 112 Beta_Trefoil-like domain family 2 2 7 Other 113 DNA-binding LAG-1-like domain family 2 2 8 Enzyme group 47 DNA polymerase-beta family 1 7 999 Unclassified structure 901 CP2 transcription factor domain family 3 3 999 Unclassified structure 902 AF-4 protein family 1 1 999 Unclassified structure 903 DNA binding homeobox and different transcription factors (DDT) domain family 1 1 999 Unclassified structure 904 AT-hook domain family 3 6 999 Unclassified structure 905 Nuclear factor I - CCAAT-binding transcription factor (NFI-CTF) family 3 3       135  Table 4.6. Large cluster ranking criteria  Cn Rank Implication for unannotated genes in cluster Fraction of observed approximate mean TFs (OAMTF) Cn ≥ 0.20 1 The majority of genes are likely TFs 95% 0.10 ≤ Cn < 0.20 2 A higher proportion of genes are likely TFs 75% 0.03 ≤ Cn < 0.10 3 A higher proportion of genes are likely not TFs 35% 0.00 ≤ Cn < 0.03 4 The majority of genes are likely not TFs 15%         136 4.6. References 1. Garvie CW, Wolberger C: Recognition of specific DNA sequences. Molecular Cell 2001, 8:937-946. 2. Halford SE, Marko JF: How do site-specific DNA-binding proteins find their targets? Nucleic Acids Research 2004, 32:3040-3052. 3. Rescan PY: Regulation and functions of myogenic regulatory factors in lower vertebrates. Comparative Biochemistry and Physiology Part B, Biochemistry & Molecular Biology 2001, 130:1-12. 4. Rosenfeld MG, Lunyak VV, Glass CK: Sensors and signals: a coactivator/corepressor/epigenetic code for integrating signal-dependent programs of transcriptional response. Genes & Development 2006, 20:1405- 1428. 5. Latchman DS: Eukaryotic transcription factors. London ; San Diego, Calif.: Elsevier Academic Press; 2004. 6. Locker J: Transcription factors. Oxford; San Diego, CA: Bios; Academic Press; 2001. 7. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology 2004, 14:283-291. 8. Gray PA, Fu H, Luo P, Zhao Q, Yu J, Ferrari A, Tenzen T, Yuk DI, Tsung EF, Cai Z, Alberta JA, Cheng LP, Liu Y, Stenman JM, Valerius MT, Billings N, Kim HA, Greenberg ME, McMahon AP, Rowitch DH, Stiles CD, Ma Q: Mouse brain organization revealed through direct genome-scale TF expression analysis. Science (New York, NY) 2004, 306:2255-2257. 9. Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J, Gordon L, Branscomb E, Stubbs L: A comprehensive catalog of human KRAB- associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Research 2006, 16:669-677. 10. Messina DN, Glasscock J, Gish W, Lovett M: An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression. Genome Research 2004, 14:2041-2047. 11. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Research 1999, 27:260-262. 12. Gough J: The SUPERFAMILY database in structural genomics. Acta Crystallographica Section D, Biological Crystallography 2002, 58:1897-1900. 13. Adryan B, Teichmann SA: FlyTF: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster. Bioinformatics (Oxford, England) 2006, 22:1532-1533. 14. Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RD, Chenoweth JG, Tesar PJ, Furey TS, Ren B, Weng Z, Crawford GE: Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. PLoS Genetics 2007, 3:e136. 15. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome Biology 2000, 1:REVIEWS001.     137 16. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 2007, 35:D5-12. 17. Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE: The mouse genome database (MGD): new features facilitating a model system. Nucleic Acids Research 2007, 35:D630-637. 18. Coulier F, Popovici C, Villet R, Birnbaum D: MetaHox gene clusters. The Journal of Experimental Zoology 2000, 288:345-351. 19. Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C, Gonzales NR, Gwadz M, Hao L, He S, Hurwitz DI, Jackson JD, Ke Z, Krylov D, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Thanki N, Yamashita RA, Yin JJ, Zhang D, Bryant SH: CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Research 2007, 35:D237-240. 20. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247:536-540. 21. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure (London, England : 1993) 1997, 5:1093-1108. 22. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Jr., Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein Data Bank: a computer- based archival file for macromolecular structures. Journal of Molecular Biology 1977, 112:535-542. 23. Laity JH, Lee BM, Wright PE: Zinc finger proteins: new insights into structural and functional diversity. Current Opinion in Structural Biology 2001, 11:39-46. 24. Nagadoi A, Nakazawa K, Uda H, Okuno K, Maekawa T, Ishii S, Nishimura Y: Solution structure of the transactivation domain of ATF-2 comprising a zinc finger-like subdomain and a flexible subdomain. Journal of Molecular Biology 1999, 287:593-607. 25. Stefancsik R, Sarkar S: Relationship between the DNA binding domains of SMAD and NFI/CTF transcription factors defines a new superfamily of genes. DNA Sequence 2003, 14:233-239. 26. Rekdal C, Sjottem E, Johansen T: The nuclear factor SPBP contains different functional domains and stimulates the activity of various transcriptional activators. The Journal of Biological Chemistry 2000, 275:40288-40300. 27. Horn G, Hofweber R, Kremer W, Kalbitzer HR: Structure and function of bacterial cold shock proteins. Cellular and Molecular Life Sciences : CMLS 2007, 64:1457-1470. 28. Swamynathan SK, Nambiar A, Guntaka RV: Role of single-stranded DNA regions and Y-box proteins in transcriptional regulation of viral and cellular     138 genes. The FASEB Journal : Official publication of the Federation of American Societies for Experimental Biology 1998, 12:515-522. 29. Gasperowicz M, Otto F: Mammalian Groucho homologs: redundancy or specificity? Journal of Cellular Biochemistry 2005, 95:670-687. 30. Hamilton AT, Huntley S, Tran-Gyamfi M, Baggott DM, Gordon L, Stubbs L: Evolutionary expansion and divergence in the ZNF91 subfamily of primate- specific zinc finger genes. Genome Research 2006, 16:584-594. 31. Lemons D, McGinnis W: Genomic evolution of Hox gene clusters. Science (New York, NY) 2006, 313:1918-1922. 32. Rost B: Twilight zone of protein sequence alignments. Protein engineering 1999, 12:85-94. 33. Liu J, Rost B: Domains, motifs and clusters in the protein universe. Current Opinion in Chemical Biology 2003, 7:5-11. 34. Li WH, Gu Z, Wang H, Nekrutenko A: Evolutionary analyses of the human genome. Nature 2001, 409:847-849. 35. Malumbres M, Barbacid M: Mammalian cyclin-dependent kinases. Trends in Biochemical Sciences 2005, 30:630-641. 36. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29. 37. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C et al: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 38. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N et al: The sequence of the human genome. Science (New York, NY) 2001, 291:1304-1351. 39. TFCat Portal Resource [http://www.tfcat.ca/] 40. Harrison SC: A structural taxonomy of DNA-binding domains. Nature 1991, 353:715-719. 41. Lilley DMJ: DNA-Protein : structural interactions. Oxford: IRL Press at Oxford University Press; 1995. 42. Kummerfeld SK, Teichmann SA: DBD: a transcription factor prediction database. Nucleic Acids Research 2006, 34:D74-81. 43. Han J, Gong P, Reddig K, Mitra M, Guo P, Li HS: The fly CAMTA transcription factor potentiates deactivation of rhodopsin, a G protein- coupled light receptor. Cell 2006, 127:847-858. 44. Finkler A, Ashery-Padan R, Fromm H: CAMTAs: calmodulin-binding transcription activators from plants to human. FEBS Lett 2007, 581:3893- 3898.     139 45. Choi MY, Romer AI, Hu M, Lepourcelet M, Mechoor A, Yesilaltay A, Krieger M, Gray PA, Shivdasani RA: A dynamic expression survey identifies transcription factors relevant in mouse digestive tract development. Development (Cambridge, England) 2006, 133:4119-4129. 46. Kong YM, Macdonald RJ, Wen X, Yang P, Barbera VM, Swift GH: A comprehensive survey of DNA-binding transcription factor gene expression in human fetal and adult organs. Gene Expression Patterns : GEP 2006, 6:678- 686. 47. Coulson RM, Ouzounis CA: The phylogenetic diversity of eukaryotic transcription. Nucleic Acids Research 2003, 31:653-660. 48. Lee AP, Yang Y, Brenner S, Venkatesh B: TFCONES: a database of vertebrate transcription factor-encoding genes and their associated conserved noncoding elements. BMC Genomics 2007, 8:441. 49. Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K, Yamanaka S: Induction of Pluripotent Stem Cells from Adult Human Fibroblasts by Defined Factors. Cell 2007, 131:1-12. 50. Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S, Nie J, Jonsdottir GA, Ruotti V, Stewart R, Slukvin, II, Thomson JA: Induced Pluripotent Stem Cell Lines Derived from Human Somatic Cells. Science (New York, NY) 2007, 318:1917-1920. 51. Roach JC, Smith KD, Strobe KL, Nissen SM, Haudenschild CD, Zhou D, Vasicek TJ, Held GA, Stolovitzky GA, Hood LE, Aderem A: Transcription factor expression in lipopolysaccharide-activated peripheral-blood-derived mononuclear cells. Proceedings of the National Academy of Sciences of the United States of America 2007, 104:16245-16250. 52. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33:D514-517. 53. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 2003, 31:365-370. 54. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN et al: InterPro, progress and status in 2005. Nucleic Acids Research 2005, 33:D201-205. 55. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 2004, 32:D91-94. 56. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G et al: Ensembl 2005. Nucleic Acids Research 2005, 33:D447-453.     140 57. International Regulome Consortium Mouse Genome Project: Mouse Gene List [http://hugheslab.med.utoronto.ca/IRC/ ] 58. Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP, Bork P: SMART 4.0: towards genomic data integration. Nucleic Acids Res 2004, 32:D142-144. 59. HMMER - Profile HMM Software for Protein Sequence Analysis [http://hmmer.janelia.org/] 60. The International Regulome Consortium [www.internationalregulomeconsortium.ca] 61. Gene Ontology Annotation (GOA) Database [http://www.ebi.ac.uk/GOA/] 62. Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica Section D, Biological Crystallography 2004, 60:2256-2268. 63. William D, Herbert E: Investigation of proportional link linkage clustering methods. Journal of Classification 1985, 2:239-254. 64. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215:403-410. 65. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 2007, 35:D61-65. 66. Markus Krötzscha DV, Max Völkelb, Heiko Hallerb and Rudi Studer: Semantic Wikipedia. Journal of Web Semantics 2007, 5:251-261. 67. MantisBT Issue Tracking Software [http://www.mantisbt.org/] 68. Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J, Gordon L, Branscomb E, Stubbs L: A comprehensive catalog of human KRAB- associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Res 2006, 16:669-677. 69. Human KZNF Gene Catalog [http://znf.llnl.gov] 70. Ryu T, Jung J, Lee S, Nam HJ, Hong SW, Yoo JW, Lee DK, Lee D: bZIPDB: a database of regulatory information for human bZIP transcription factors. BMC Genomics 2007, 8:136. 71. bZIPDB - Database of bZIP Transcription Factors [http://bzip.kaist.ac.kr:8080/bzip.html] 72. FlyTF - The Drosophila Transcription Factor Database [http://www.mrc- lmb.cam.ac.uk/genomes/FlyTF/] 73. Reece-Hoyes JS, Deplancke B, Shingles J, Grove CA, Hope IA, Walhout AJ: A compendium of Caenorhabditis elegans regulatory transcription factors: a resource for mapping transcription regulatory networks. Genome Biol 2005, 6:R110. 74. A Collection of Predicted C. elegans Transcription Factors [http://genomebiology.com/content/supplementary/gb-2005-6-13-r110-s1.xls]     141 5. Brain MiniPromoters by Design: Pleiades Promoter Project13 14   5.1. Chapter preamble  The myelin sheath wraps around axons to facilitate proper conduction of neural impulses throughout the nervous system. Myelin is produced by two cell types: oligodendrocytes in the central nervous system and Schwann cells in the peripheral nervous system. Transcription factors (TFs) play an important role in myelination. OLIG1 is a TF that plays a critical role in oligodendrocyte development. Non-coding regions surrounding the OLIG1 gene coding sequence were evaluated for tissue-specific expression in adult brain regions by the Pleiades Promoter Project. One of the constructs demonstrated expression in oligodendrocytes and two other OLIG1-associated contructs were negative (i.e. did not express). Chapter 5 describes analyses conducted to determine the TFs that could be acting to regulate the positively expressed construct. Gene expression analyses and sequence feature evaluations of the positive and negative constructs were conducted.  Drawing from the work described in Chapter 4, the TFCat resource played an important role in identifying TFs differentially expressed in  13 A version of this chapter will be submitted for publication. Elodie Portales-Casamar, Douglas J. Swanson, Charles N. de Leeuw, Kathleen G. Banks, Shannan J. Ho Sui, Debra L. Fulton, Johar Ali, Mahsa Amirabbasi, David J. Arenillas, Nazar Babyak, Sonia F. Black, Russell J. Bonaguro, Erich Brauer, Tara R. Candido, Mauro Castellarin, Jing Chen, Ying Chen, Jason C.Y. Cheng, Vik Chopra, T. Roderick Docking, Lisa Dreolini, Cletus A. D’Souza, Erin K. Flynn, Randy Glenn, Kristi Hatakka, Taryn G. Hearty, Behzad Imanian, Steven Jiang, Shadi Khorasan-zadeh, Ivana Komljenovic, Stéphanie Laprise, Nancy Y. Liao, Jonathan S. Lim, Stuart Lithwick, Flora Liu, Jun Liu, Li Liu, Meifen Lu, Melissa McConechy, Andrea J. McLeod, Marko Milisavljevic, Jacek Mis, Katie O’Connor, Betty Palma, Diana L. Palmquist, Jean-François Schmouth, Magdalena I. Swanson, Bonny Tam, Amy Ticoll, Jenna L. Turner, Richard Varhol, Jenny Vermeulen, Russell F. Watkins, Gary Wilson, Bibiana K.Y. Wong, Siaw H. Wong, Tony Y.T. Wong, George S. Yang, Athena R. Ypsilanti, Steven J.M. Jones, Robert A. Holt, Daniel Goldowitz, Wyeth W. Wasserman, Elizabeth M. Simpson. Brain MiniPromoters by Design: Pleiades Promoter Project.  14 The computational analyses linked to the thesis research are predominantly described in the supplementary information of the manuscript, provided as Appendix 2 of this thesis.     142 oligodendrocytes. From this analysis, two TFs emerged as likely candidates responsible for the construct expression. The thesis-related work is described in Figure 5.7 b. and Appendix 2.  5.2. Introduction  There is increasing interest in the application of gene therapy to treat incurable brain, eye, and spinal cord disorders. Several promising clinical trial results have been reported for Parkinson, Huntington, and Alzheimer disease, as well as eye diseases ([1, 2] for review). However, current therapeutic approaches frequently incorporate ubiquitous promoters, which may direct expression in both targeted therapeutic cells and in untargeted cellular regions. Another critical issue is that gene therapy approaches can be mutagenic when delivered through random insertion approaches that cause integration of multiple copies [3]. Delivery technologies that enable region- and/or cell-type specific expression using a controlled transgenesis strategy would greatly improve the potential applications for gene therapy. The Pleiades Promoter Project is striving to address these major challenges by enabling brain region-specific gene delivery using MiniPromoters (MiniPs). These MiniPs are being pre-tested as single-copy inserts at a defined locus in the mouse genome to enable reproducible and physiological expression levels. All Pleiades MiniPs contain entirely human DNA regulatory sequences to minimize cross- species concerns in therapy and avoid or reduce epigenetic inactivation effects.  The mammalian central nervous system (CNS) is a complex entity comprising different neuronal and non-neuronal cell types. Cellular identity is characterized by a unique repertoire of gene expression profiles. The Allen Brain Atlas has helped elucidate brain cell diversity by mapping the expression of approximately 20,000 genes in the adult     143 mouse brain using in situ hybridization [4]. Another gene expression detection effort, the GENSAT project, incorporates transgenic mouse techniques to map reporter gene expression in CNS tissues of mice driven by large chromosomal segments [5]. These resources demonstrate the selectivity with which some genes are expressed in the brain. Much of metazoan gene expression is driven by modular conserved regulatory regions [6- 9]. Notably, studies have highlighted similar patterns of gene expression in corresponding human and mouse brain regions [10]. The Pleiades Promoter Project implemented a three-step approach to identify putative non-coding regions, which may be used to drive gene therapies. Firstly, a set of genes were selected that were specifically expressed in adult brain regions of interest and deemed to be of therapeutic interest. Secondly, we computationally predicted candidate human regulatory regions that could be responsible for each gene’s tissue-specific expression pattern. Lastly, we tested the human sequences in vivo in mice using a robust mouse transgenesis strategy.  The overarching goal of the Pleiades project is to generate a bank of human DNA MiniPs, of no more than 4 kb in length to ensure suitability for gene therapy purposes, that drive expression in selected, specific adult brain regions of therapeutic interest. Our approach is founded on the premise that robust bioinformatics-based predictions of putative regulatory regions can be generated through the systematic application of computational analyses that considers prior experimental evidence. We have improved previously applied approaches through the application of a standardized promoter design methodology and incorporation of a consistent genomic context for integration, testing and comparison of MiniP reporter gene expression. Our expanding collection of brain- specific promoters is made publicly available at: http://www.pleiades.org.      144 5.3. Results 5.3.1. Novel tools to study and treat the brain  The Pleiades Promoter Project strategy incorporates computational prediction of the regulatory regions that could be responsible for gene expression patterns observed in specific brain regions or cell-types [11]. However, the goal was not to recapitulate the potentially complex endogenous expression pattern of a gene, but rather to identify a subset of putative regulatory regions that drive reporter expression in targeted brain regions. For each set of gene regulatory region candidates, we designed several (from 1 to 7) MiniP constructs that incorporated unique combinations of the candidate regions (Fig. 5.1). To obtain physiological expression levels for a human-based MiniP sequence, we included the endogenous promoter of the gene. Importantly, we designed the MiniPs to be relatively short in length (≤ 4 kb) for ease of manipulation and to maintain their suitability for insertion in space-restricted molecular constructs (i.e., gene-therapy vectors) (Fig. 5.1a). Addressing an annotated set of 62 genes, we designed 240 MiniPs containing 468 different candidate regulatory regions (Fig. 5.1b). The MiniP sequences were cloned 5′ of a reporter gene sequence (EGFP, lacZ, or an EGFP/cre fusion protein). To date this research project has predicted 313 regulatory regions, which have been incorporated in 202 MiniP construct designs for 58 selected genes. Each tested construct was introduced by homologous recombination immediately 5′ of the Hprt1 locus on the X chromosome, as previously described [12, 13], which provides a single-copy knock-in insertion for reproducible expression [14] (Fig. 5.1a). Currently, 970 embryonic stem cell lines (ESCs) have been derived carrying 180 MiniPs. Transgenic mice carrying 103 MiniPs have been generated and 52 transgenic mouse strains have undergone post-     145 mortem evaluation. Notably, 36% of tested construct designs have demonstrated expression in the brain; amongst these are MiniPs expressed in both glial and neuronal cell types.  5.3.2. A new score to prioritize suitable genes  As described previously in D’Souza et al., we utilized a genome-wide evaluation approach to identify 237 candidate genes with region-specific expression patterns in a set of 30 adult brain regions of therapeutic interest [11]. To further refine our list of genes, we considered disease relevance and genomic suitability for a MiniP design. To establish a disease relevance judgment, we reviewed the literature and noted the phenotypic consequences of gene knock-outs in mice, when this information was available. The MiniP design suitability assessment was determined by prioritizing those genes for which putative regulatory regions could be more readily distinguished. Comparative sequence analysis, often referred to as phylogenetic footprinting, has proven useful in predicting regulatory regions with the expectation that sequences under selective pressure will be more conserved than those that are not. We thus based our gene prioritization on the following criteria: (i) the existence for a gene of known regulatory regions responsible for the expression pattern in the brain, (ii) the generation of a single transcript from the gene to avoid having to differentiate between alternative promoters, (iii) the amount of non- coding sequence to be considered for analysis, (iv) the number of conserved regions within our analysis boundaries, and (v) how well-distinguished conserved regions are relative to the overall conservation level for a gene. In brief, we sought genes containing a small number of well-defined conserved non-coding regions close to their start site. To     146 this end, we developed a “regulatory resolution score” intended to reflect human perception of what constitutes a good candidate gene for MiniP design (see Appendix 2 information). We demonstrated that this score captures aspects of the manual curation process by comparing the results to 100 manually curated genes (Fig. 5.2a and Appendix 2  information). The 62 genes that we selected for MiniP design are heavily skewed towards higher scores in our set of 237 brain region-selective genes (Fig. 5.2b,c).  5.3.3. MiniPromoter designs incorporate available information  Our review of the literature identified important gene regulatory information for a selected set of genes, which was considered in the construct design process. Literature evidence describing regulatory function for  62 genes was reviewed, curated, and stored in the PAZAR database [15] within the “Pleiades Genes” project. Importantly, this data collection represents the first publicly-available collection of detailed gene regulatory data for brain regions, which documents both validated regulatory sequence and acting transcription factors (TFs).  One of the central objectives for the design of MiniPs was to select regulatory sequence that could be functional in both human and mouse brain cells. Although we mainly relied on mouse expression information to select our initial list of genes, the MiniPs incorporated human DNA sequences for validation in mice. To identify the human-mouse orthologous non-coding sequence, we implemented phylogenetic footprinting over human and mouse sequence alignments. Identified conserved regions were subjected to TF binding site (TFBS) prediction analyses to identify motifs that could be bound by TFs previously shown to direct specific expression patterns in the     147 brain tissue of interest. For example, the TF NRSF/REST was included in TFBS analyses as it has been shown to bind the non-coding region of many neuronal genes to inhibit gene expression in non-neuronal tissues [16]. The TF binding models used in this study were compiled from the JASPAR database [17] and supplemented with brain-specific TFs annotated in the PAZAR “Pleiades Genes” project described above.  As illustrated in the Appendix 2 information, we developed a MiniP design pipeline that takes into account all available information from the literature, genome annotations, and the computational analyses mentioned above. Endogenous promoters were identified using the 5′ cap analysis gene expression (CAGE) tags [18], genome annotated transcripts (mRNAs, ESTs), and CpG islands [19]. In the case of ADORA2A, multiple transcription start sites were identified and RT-PCR on human brain cDNA was performed (see Appendix 2 information) to identify the predominantly expressed transcript in the targeted region (striatum). In a few instances, we incorporated histone modification chromatin immunoprecipitation (ChIP) data for mouse and human cortex (Jones et al. unpublished) in our MiniP design pipeline analyses, when this information was available for candidate regulatory regions.  5.3.4. ESC neural differentiation for pre-screening  To test our pre-screening ESC differentiation strategy and to demonstrate that the genomic location immediately 5′ of the Hprt1 locus is suitable for specific expression, we selected a set of previously characterized promoters that drive expression either ubiquitously or in a subset of neurons or glial cells for validation. We evaluated the expression patterns of MiniP constructs: Ple88 for glial cell expression, which was     148 composed of previously known GFAP regulatory regions [20], and Ple53, which was composed of previously validated DCX regulatory regions, for neuronal expression [21] (Fig. 5.3). It should be noted that both of these genes are developmentally expressed. The Barberi et al. assay approach was chosen for the characterization since it has no region- specific bias, induces neural stem cells within 6 days, and expresses genes in a developmentally appropriate temporal fashion [22, 23]. Expression of the MiniPs followed a pattern that is similar to the endogenous expression of their associated gene, as shown by RT-PCR (Fig. 5.3b,f) and staining (Fig. 5.3c,g). The positive ESC pre- screening results were then confirmed in the corresponding knock-in adult mouse brains (Fig. 5.3d,h). This validation test demonstrates the utility of the MiniP-containing ESCs for directed expression in stem cell differentiation assays. While the MiniPs were designed to express in mature adult tissues, it is anticipated that a subset will function in the earlier developmental stages represented in stem cell studies, based upon the observed expression patterns for the MiniP source genes in BGEM [24], GenePaint [25], and in the literature.  5.3.5. Novel MiniPromoter expression patterns in the brain  The Pleiades resource includes MiniPs that direct expression in diverse cell types and regions of the brain, spinal cord, and eye (Fig. 5.4). EGFP immunohistochemistry was performed on adult brains, spinal cords, and eyes of germline males to characterize each MiniP-directed expression pattern. Ple151-EGFP expresses broadly throughout cortical and subcortical regions whereas only a small cluster of neurons show Ple111- EGFP expression in the lateral hypothalamus (Fig. 5.4). A portion of the Pleiades MiniPs     149 drive expression of the EGFP/cre fusion protein and were analyzed for lacZ staining after recombination of the Gt(ROSA)26Sortm1Sor allele [26] (Fig 5.4; Ple103, Ple162, Ple167, Ple176). For these assays, lacZ acts as a historical marker and is visualized wherever the MiniP driving cre was expressed during development. For example, in the case of Ple167, Ple176, Ple177, and Ple178 constructs, it appears that cortical columns are labeled, which is an indication that reporter expression occurred early in development, before the migration of cells through the cortex. This effect produces broader labeling of regions in the adult brains (data not shown).  As expected, a subset of MiniPs directed neuronal-specific expression (examples in Fig. 5a-h) and others direct glial-specific expression (examples in Fig. 5.5i-p). Ple54 (based on DCX regulatory regions) and Ple111 (based on HCRT regulatory regions) express the EGFP reporter in neuronal cell populations that are immunopositive for their respective source gene. EGFP expression for the Ple54 MiniP is found in the neurogenic cells of the adult rostral migratory stream as well as olfactory bulb neurons (Fig. 5.5a-d). EGFP expression in the Ple111 transgenic brain is co-localized with Hcrt in a discrete population of neurons in the hypothalamus (Fig. 5.5e-h). Ple88 (based on GFAP regulatory regions) and Ple185 (based on S100ß regulatory region) express the EGFP reporter gene with fidelity compared to their respective associated genes. Ple88 expresses in astroglial-like cells throughout the brain, spinal cord, and the eye with a subset of cells co-expressing the Gfap source gene (Fig. 5.5i-l) similarly to what has been previously described for this construct [20]. The pattern of expression of Ple185 is broad throughout the brain, most notably in cerebellar Bergmann glia and myelinated white matter tracts (Fig. 5.5m-o), as well as fiber tracts in the cortex (Fig. 5.5p). The EGFP expression     150 pattern in the brain co-labels with myelinated axons in the cortex, striatum, and the brainstem (data not shown).  To demonstrate the value of the EGFP/cre strains carrying lacZ as a historical marker, we have performed a developmental analysis of the Ple162-EGFP/cre (based on PITX3 regulatory regions) mouse to assess the history of the lacZ-positive neurons located in the ventral tegmental area (VTA) of the midbrain. Detailed analysis of these positive cells in the adult brain shows that they are neurons located just superficial to the VTA dopaminergic cells but are not TH-positive dopaminergic cells (Fig. 5.6a-c), representing a novel subpopulation of neurons not typically found in the adult PITX3- positive population [27]. The developmental analysis of lacZ expression, in whole mount and sectioned tissue, tracks these cells from their initial genesis at the mesencephalic flexure to their final destination in the adult VTA (Fig. 5.6d-k). The expression of the lacZ reporter is observed at E11.5, but not in neural tissue at E10.5, delineating the onset of Ple162 MiniP expression (Fig. 5.6h-k).  5.3.6. A unique dataset for in silico studies  As a proof of principle for our bioinformatics approach, we generated a smaller MiniP than previously available for the GFAP gene [20]. Using our MiniP design pipeline, we selected a minimal promoter and a well-conserved upstream sequence for validation (Ple90; Fig. 5.7a). As predicted, the 1.4 kb Ple90 drives the expression of EGFP in a pattern similar to the previously characterized 2.2 kb Ple88 (Fig. 5.7a). This result was confirmed in a recent study analyzing similar GFAP promoter constructs [28].     151  The observed properties of MiniPs facilitate study of transcriptional regulation in specific cells and tissues through the comparison and correlation of MiniP sequences. For the MiniPs associated with OLIG1, one MiniP appears to drive expression in oligodendrocytes in the adult brain (Ple151) (see Appendix 2, Figure S4), as sought based on the endogenous gene expression pattern [29]. Two other OLIG1-associated MiniP constructs did not exhibit expression in adult brain (Ple148 and Ple150). We evaluated and compared the TF binding site predictions between the positive and negative MiniP sequences (Appendix 2 information). Our results highlight the potential involvement of EGR1 (KROX-24) in the Ple151 expression pattern (Fig. 5.7b and Appendix 2 information). Recent studies demonstrate that OLIG1 promotes the initiation of oligodendrocyte differentiation [30, 31] and is responsible for oligodendrocyte specification in some brain regions [32]. Temporal expression studies implicate EGR1 (KROX-24) as having a role in the primary response that leads to oligodendrocyte differentiation [33]. Consistent with suggestions of a role for the AP-1 family of TFs in oligodendrocyte differentiation [34], we observed predicted FOS (AP-1) binding sites in Ple151 and not in the inactive MiniPs (Appendix 2 information).  5.4. Discussion  The design of MiniPs for targeted adult brain gene expression, guided by a comprehensive collection of experimentally-based regulatory data and a computationally- driven pipeline, is an important new development. In the past, discovery of brain-specific promoters was obtained by low-throughput promoter-deletion studies (e.g. L7/Pcp2 [35, 36] and Camk2a [37, 38]). The discovery of brain-specific promoters has been of enormous value, but the tedious efforts needed for identification and the sparse collection     152 limits research and therapeutic initiatives. The GENSAT project is generating mice using engineered BACs (100-200 kb) driving EGFP regionally in the brain [5]. However, many applications including therapeutic gene delivery require compact and portable promoters. Conservation-driven selection of candidate regulatory sequences is being used to identify regions directing reporter gene expression in the embryonic mouse [6]. Despite great interest and demand, progress in identifying regulatory sequences for selective expression in the adult brain has been too limited. The Pleiades Promoter Project has addressed this issue by undertaking a high-throughput bioinformatically directed parallel design process for 62 brain genes, producing MiniPs vetted in vivo in the adult mouse brain. The success of the Pleiades design process was facilitated by the availability of large-scale gene expression studies [4, 24, 25], comparative genomics tools [39], and bioinformatics software for regulatory sequence prediction [40], to produce large numbers of MiniPs. By introducing the regulatory resolution scoring procedure to target the design efforts on the most tractable genes, it was possible to increase the probability of design success, a necessity given the expense of transgenic studies in the developed brain. Future efforts will further benefit from large-scale ChIP data for epigenetic marks [41], co-activator localization [42], and TF binding sites (e.g. the Encode project [43]), making it feasible to pursue diverse tissues and more complex designs.  It is likely that any gene expressed in the nervous system is expressed in more than one cell type. We hypothesized that we could dissect the regulatory gene expression controls by computationally identifying sequences capable of directing discrete regions of endogenous gene expression profiles. This was observed with some MiniPs (e.g. Ple111). In other cases the regulatory designs directed expression in novel patterns of expression (e.g. Ple176). In short, the design of MiniPs was successful in delivering gene expression     153 profiles of utility. While 36% of the tested MiniP constructs directed expression in the brain, it is likely that many MiniPs will drive additional discrete expression in developmental stages or physiological conditions within and external to the brain that were not assessed in our initial characterization.  The Pleiades resource of plug-and-play MiniPs for brain expression will have a major impact on basic and preclinical research. They can be easily introduced into constructs, and in many cases, into viruses, for brain-, spinal cord-, and eye-directed delivery of molecules, such as siRNA, cre recombinase, fluorescent reporters, and therapeutic proteins. Their dual origin, human sequence and mouse testing, suggests that they will not only function in those species, but also in rats, monkeys, and other research and clinical models. We have already demonstrated their function in mouse ESCs and a future critical step will be assaying their function in human stem cells. Driving specific reporters, they can be used in flow-sorting experiments to enrich or exclude cells of specific neural types. In drug testing experiments they can be used to monitor the suppression or enhancement of responding cell types. Ultimately, the greatest impact of the Pleiades MiniPs is anticipated to be the added specificity for therapeutic gene delivery into the human brain. While this may be accomplished using viruses, site- specific delivery to the human genome directly or in cell therapy is an area of active research [44, 45]. The availability of a large collection of new MiniPs will play an important role in treatment designed for incurable brain diseases.      154 5.5. Methods  5.5.1. Pleiades Promoter Project pipeline  The Pleiades Promoter Project is building a collection of 240 MiniPs in a four- year time frame. A pipeline has been established that involves 5 specialized laboratories. The MiniP sequences were computationally designed. In silico designs were assembled into DNA molecules at a rate of 4 per week. The construct DNA was electroporated into B6129F1 ESCs (mEMS1202 or mEMS1204 [13]) at a rate of 7 per week (including controls). Per electroporation, 10-15 clones per construct were picked, expanded, and PCR-verified, to obtain ~4 correctly targeted ESC lines per construct. ESCs were microinjected into E3.5 blastocysts from an N2 backcross of ICR into B6-Alb (BAN2), selected for high blastocyst yield. Germline females were backcrossed to C56BL/6J. The brains of N2-N3 germline males (8-12 weeks) were analyzed using histochemistry procedures. A minimum of 3 brains were processed for each MiniP. Every brain was cryosectioned (20 µm sections at 640 µm intervals) from medial to lateral sagittal, and prepared for brightfield detection of EGFP or lacZ. When reporter expression was absent in at least 3 adult brains, a MiniP strain was classified as negative. Positive MiniP strains underwent further histological analyses to define the cellular pattern of gene expression. Both positive and negative strains were prepared for presentation on the Internet at http://www.pleaides.org. For the positive strains, images were captured at high resolution (12000 x 16000 pixels) and tiled into hundreds of smaller images for viewing using Zoomify® technologies. Using the zoomify viewer, investigators are able to drill into the images without loss of fidelity in a fast and efficient manner.      155 5.5.2. Pleiades Promoter Project protocols  The general methods used have been described previously [13]. Below are only reported modifications or additions specific to this work.  5.5.2.1.1. Hprt1 targeting vectors and MiniPromoters  The Hprt1 targeting vectors used in this study are pEMS1306 (EGFP reporter [13]), pEMS1313 (lacZ reporter), and pEMS1302 (EGFP/cre reporter). The pEMS1313 and pEMS1302 fragments from the MCS to the end of the reporter gene were synthesized by GeneArt (Germany) and cloned into the Hprt1 targeting plasmid pJDH8A/246b [46] using EcoRI restriction sites.  MiniPs comprised up to 4 distinct genomic segments joined by fusion PCR. Each genomic segment was first PCR-amplified independently using AccuPrime Pfx DNA Polymerase (Invitrogen), PCR primers (Integrated DNA Technologies), and BAC DNA template (10 pg to 200 ng). PCR primers for the outermost 5′ and 3′ segments were tailed with the appropriate restriction sites to allow for cloning. For MiniPs with two segments or more, PCR products of upstream segments were 3′ tailed with 18 bp linkers homologous to the first 18 bp of the adjacent downstream genomic segment. Reaction conditions were 0.25 Unit enzyme, 1x AccuPrime Pfx reaction mix, 1.0 µM each primer mix in a 20 µl volume. A 2-minute denaturation at 95 ºC was followed by 30 cycles of 95 ºC for 15 seconds, 30 seconds (at Tm corresponding to primer pair) and 68 ºC for 90 seconds, plus a final extension at 68 ºC for 10 minutes. The PCR reaction was run on a 1% low melting point agarose gel, visualized using SYBR Green (Invitrogen), excised and recovered from the gel using QIAquick gel extraction kit (Qiagen). Reaction     156 products were eluted using 30 µl of Ultrapure water (Gibco) then quantified using the NanoDrop (Thermo Scientific). For MiniPs with multiple elements, fusion PCR was performed as above, but using 2.0 µl of gel purified first round reaction products (10 pg to 200 ng). Additional binary fusions were executed as above until the full-length was obtained. A subset of 9 MiniPs was generated by direct synthesis at GeneArt.  The final MiniPs were cloned into one of our Hprt1 targeting vectors and sequence verified with primers located every 300 bp along the construct on both strands. All discrepancies between the designed and constructed sequences were inspected using the UCSC Genome Browser annotations (hg18) [47]. We tolerated discrepancies if they were known polymorphisms, located in a non-conserved region (PhastCons Vertebrate Multiz Alignment & Conservation (17 Species) score below 0.7), or if analysis did not show any further regulatory implication. We rejected any sequence with insertion or deletion bigger than 10 bp.  5.5.2.2. Knock-in immediately 5′  of the Hprt1 locus  The mEMS1204 (B6129F1-Gt(ROSA26)Sortm1Sor/+, Hprt1b-m3/Y), mEMS1202 (B6129F1-Gt(ROSA26)Sor+/+, Hprt1b-m3/Y), and E14TG2A [48] ESC lines were electroporated with constructs built in pEMS1302, pEMS1306, or pEMS1313, respectively. Clones were maintained under HAT selection for 3-4 days of expansion in 96 well plates and then transferred to 2 x 24 wells and cultured in HT media. Once cells reached confluence, both wells were frozen in HT-freeze media and stored in liquid nitrogen (LN2).      157 5.5.2.3. PCR analysis of genomic DNA  Vector NTI (http://www.invitrogen.com) software was used to design PCR assays for the different constructs. MiniP-specific PCR genotyping assays are available on the http://www.pleiades.org website.  5.5.2.4. In vitro neural differentiation  Neural differentiation of ESCs was conducted as previously described [22], with the following modifications. Once confluent, ESCs were trypsinized and seeded in duplicate wells onto confluent MS-5 feeder layers at 500 cells/cm2 for seven time-points. Total cell RNA was extracted with the RNeasy Plus Mini Kit (Qiagen) and used in RT- PCR analysis in both +RT and –RT conditions, using the OneStep RT-PCR Kit (Qiagen) according to manufacturer’s instructions (details in Appendix 2 Information). Ple53- EGFP immunohistochemistry was perfomed on day 11 of differentiation. Cells were washed once with 1x PBS and fixed using 4% paraformaldehyde in PBS for 15 minutes at room temperature. Cells were then blocked using Image-iT FX signal enhancer (Invitrogen) reagent and subsequently incubated with 1:1000 rabbit polyclonal anti-GFP antibody (Abcam) followed by 1:1000 Alexa-488 secondary anti-rabbit antibody (Invitrogen). Cells were imaged on a Zeiss Axiovert 200M microscope at 20x with the FITC filter set. Ple88-lacZ staining was performed as outlined at http://openwetware.org/wiki/LacZ_staining_of_cells, on day 14 of differentiation. Brightfield images were taken with the 10x objective on an Olympus Bx61 microscope.      158 5.5.3. Immunohistochemistry and histochemistry  EGFP expression was detected with anti-GFP using the Vectastain Elite ABC kit (Vector Labs, Burlingame, CA) and DAB, as a brown chromogen, following the manufacturer’s directions (Vector Labs, Burlingame, CA). Expression of the beta- galatosidase (lacZ) or the EGFP/cre fusion protein (following recombination of the ROSA26 locus) was detected with Xgal (5-Bromo-4-chloro-3-indolyl-ß-d- galactopyranoside) staining as previously described [49]. High resolution serial images of brightfield material were acquired using a Nikon Optiphot-2 microscope with a LEP motorized stage connected to a Dell Precision 390 computer  equipped with hardware and software from MicroBrightField, Inc. Images were captured and tilled using MBF Neurolucida Virtual Slice v8.2.3.0.  Double-label immunofluorescence for colocalization of EGFP and endogenous proteins was performed as previously described [50]. Either native EGFP fluorescence (nGFP) or anti-GFP detection with an Alexa-488 secondary antibody was combined with a second primary antisera and detection with a Cy3 or Alexa-555 secondary antibody. Colocalization of LacZ activity and tyrosine hydroxylase (TH) or NeuN was performed with sequential staining as previously described [49]. Primary antibodies used for these studies include: rabbit anti-DCX (1:500, Cell Signaling), rabbit anti-orexin (1:500, Chemicon), mouse anti-GFAP (1:1000, Millipore), mouse anti-S100ß (1:1000, Abcam), mouse anti-NeuN (1:500, Chemicon), mouse anti-TH (1:3000, Chemicon). Secondary antibodies include: goat anti-rabbit Alexa-488 (1:500, Molecular probes), goat anti- rabbit-Cy3 (1:500, Jackson ImmunoResearch Laboratories, Inc.), goat anti-mouse Alexa- 555 (1:500, Molecular probes), goat anti-mouse Alexa-488 (1:500, Molecular probes), donkey anti-goat-Cy3 (1:500, Jackson ImmunoResearch Laboratories, Inc.). Detection of     159 double immunofluorescence was performed using a BioRad confocal laser-scanning microscope (CLSM, BioRad, Hercules, CA).  Whole-mount Xgal histochemistry was performed on 4% paraformadehyde fixed embryos (E10.5, E11.5) or dissected brains (E15.5, P0.5) following a similar protocol described above after preincubation of the tissue in 0.1 M PBS containing 0.3% Triton X- 100. Stained embryos and brains were photographed, cryosectioned, and counterstained with neutral red for localization of lacZ expressing cells.      160 Figure 5.1. A resource of 240 MiniPromoters for predictable reproducible expression  a. Pleiades MiniP design and testing strategy using the POGZ gene as an example. The MiniP designs capture the candidate regulatory regions (RR) in various combinations upstream of the endogenous gene promoter (Prom). The MiniPs are cloned in an Hprt1 targeting vector and knocked-in the mouse genome in the exact same location every time. b. Pleiades MiniPs (designated Ple#) designed for each of the 62 genes selected. Each box represents a contiguous human DNA sequence bioinformatically identified as a candidate regulatory region. Regulatory regions are numbered, and when the number is reversed, the sequences are placed in reverse orientation to avoid a possible alternative start site. Multiple regulatory regions are stitched together upstream of the endogenous gene promoter (Prom), represented as an arrow. In some case more than one Prom was identified (LongProm, Prom 2) and used in conjunction with 5′UTR and first intron sequences. In a few instances, sequences from neighboring genes were included in the design as indicated by a second gene name following the primarily selected one.        161      162 Figure 5.2. Resolution score prioritizes genes for MiniPromoter design  a. Score distribution for 100 manually curated genes. The boxes’ widths are proportional to the number of observations in the groups. The increases in scores from “1” to “4” and “5” are significant (p = 1.4e-03 and p = 7e-04, respectively; Wilcoxon test), as well as from “2” to “5” (p = 4.5e-02; Wilcoxon test). b. Score frequency of the selected 62 genes (black) compared to all other brain region selective genes (white). c. Pleiades genes selected for MiniP design.      163      164  Figure 5.3. In vitro neural differentiation for pre-screening MiniPromoter designs  a, e. Ple53 and Ple88 are cloned upstream of EGFP or lacZ, respectively. b, f. RT-PCR assays across seven time points of ESC neural differentiation demonstrate appropriate temporal expression. c, g. Immunohistochemistry or β-galactosidase staining demonstrates appropriate spatial expression (scale bar = 100 µm in c, 200 µm in g). d, h. Germline knock-in adult mouse brain sections analyzed by immunohistochemistry or β- galactosidase staining confirms expression in the appropriate regions (i.e., olfactory bulb and rostral migratory stream (RMS) for Ple53, and glia throughout the brain for Ple88).         165  Figure 5.4. Montage of MiniPromoter expression patterns in the adult brain and retina  Presented is a sampling of positive strains expressing EGFP, detected using anti-GFP immunocytochemistry; EGFP/cre, detected using Xgal histochemistry; and lacZ detected using Xgal histochemistry. Various brain regions containing positive staining are presented for each mouse strain. Bs, brainstem; Cb, cerebellum; Ctx, cortex; Hip, hippocampus; Hyp, hypothalamus; LC, locus coeruleus; Olf, olfactory bulb; Ret, retina; RMS, rostral migratory stream; VTA, ventral tegmental area. For each panel, the upper sagittal image is a montage photographed at 1.5 x and resized to fit the frame. For each panel, the lower images scale bar = left µm, right µm: a, b, h, 100, 100; c, d, 200, 50; e, 400, 400; f, i, 200, 200; g, j, 200, 100; k, 400, 100; and l, 400, 50.      166       167 Figure 5.5. Specific neuronal and glial expression patterns  a-d. Ple54-EGFP expresses EGFP in Dcx-positive cells of the rostral migratory stream (a-c) and olfactory bulb (d). e-h. Ple111-EGFP expresses EGFP in a subpopulation of hypothalamic neurons that are Hcrt (Orexin)-positive. i-l. Ple90-EGFP expresses EGFP in Gfap-positive astrocytes of the hippocampus (i-k) and Bergmann glia of the cerebellum (l). m-p. Ple185-EGFP expresses EGFP in S100ß-positive Bergmann glia of the cerebellum (m-o) and myelinated fibers in the cortex (p). Scale bar, 50 µm (a-d) 100 µm (e-p). nGFP, native GFP fluorescence.          168 Figure 5.6. MiniPromoters as tools to study developmental expression patterns  a-c. Ple162-EGFP/cre is expressed in neurons of the ventral tegmental area (VTA) that are distinct from the tyrosine hydroxylase-positive (TH) cells (a, b). Beta galactosidase- positive (xGal) cells in this area co-label with the pan-neuronal marker NeuN (c). d-k. Whole mount Xgal staining (d-g) and in histological sections (h-k) in Ple162-EGFP/cre mice across development from embryonic day (E) 10.5, 11.5, 15.5, and postnatal day (P) 0.5. scale bar, 50µm (c , f, h, i, k), 100µm (b, i inset, j), 200µm (h inset, j inset, k inset), 500µm (a, a inset, d), 750µm (g), 1000µm (e, f inset, g inset).        169 Figure 5.7. A unique dataset for bioinformatics analysis  a. Immunohistochemistry on adult mouse brain sections shows that the Ple90 original design recapitulates the expression of the previously characterized Ple88. b. The top panel shows the human genomic sequence around OLIG1 and OLIG2 together with the candidate regulatory regions included in Ple148, Ple150, and Ple151. Comparisons of TFBS predictions between Ple151 sequences (8, 10, and 11) and all others (5, 6, 7, 9) identify EGR1 (e) and FOS (f) binding sites putatively responsible for Ple151 specific expression. The conservation plots were captured from the UCSC genome browser.      170     171  5.6. References 1. Alexander BL, Ali RR, Alton EW, Bainbridge JW, Braun S, Cheng SH, Flotte TR, Gaspar HB, Grez M, Griesenbach U, Kaplitt MG, Ott MG, Seger R, Simons M, Thrasher AJ, Thrasher AZ, Yla-Herttuala S: Progress and prospects: gene therapy clinical trials (part 1). Gene Ther 2007, 14(20):1439-1447. 2. Aiuti A, Bachoud-Levi AC, Blesch A, Brenner MK, Cattaneo F, Chiocca EA, Gao G, High KA, Leen AM, Lemoine NR, McNeish IA, Meneguzzi G, Peschanski M, Roncarolo MG, Strayer DS, Tuszynski MH, Waxman DJ, Wilson JM: Progress and prospects: gene therapy clinical trials (part 2). Gene Ther 2007, 14(22):1555-1563. 3. Hacein-Bey-Abina S, Von Kalle C, Schmidt M, McCormack MP, Wulffraat N, Leboulch P, Lim A, Osborne CS, Pawliuk R, Morillon E, Sorensen R, Forster A, Fraser P, Cohen JI, de Saint Basile G, Alexander I, Wintergerst U, Frebourg T, Aurias A, Stoppa-Lyonnet D, Romana S, Radford-Weiss I, Gross F, Valensi F, Delabesse E, Macintyre E, Sigaux F, Soulier J, Leiva LE, Wissler M et al: LMO2- associated clonal T cell proliferation in two patients after gene therapy for SCID-X1. Science 2003, 302(5644):415-419. 4. Lein ES, Hawrylycz MJ, Ao N, Ayres M, Bensinger A, Bernard A, Boe AF, Boguski MS, Brockway KS, Byrnes EJ, Chen L, Chen L, Chen TM, Chin MC, Chong J, Crook BE, Czaplinska A, Dang CN, Datta S, Dee NR, Desaki AL, Desta T, Diep E, Dolbeare TA, Donelan MJ, Dong HW, Dougherty JG, Duncan BJ, Ebbert AJ, Eichele G et al: Genome-wide atlas of gene expression in the adult mouse brain. Nature 2007, 445(7124):168-176. 5. Gong S, Zheng C, Doughty ML, Losos K, Didkovsky N, Schambra UB, Nowak NJ, Joyner A, Leblanc G, Hatten ME, Heintz N: A gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature 2003, 425(6961):917-925. 6. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444(7118):499-502. 7. Tsumaki N, Kimura T, Tanaka K, Kimura JH, Ochi T, Yamada Y: Modular arrangement of cartilage- and neural tissue-specific cis-elements in the mouse alpha2(XI) collagen promoter. J Biol Chem 1998, 273(36):22861-22864. 8. Farhadi HF, Peterson AC: The myelin basic protein gene: a prototype for combinatorial mammalian transcriptional regulation. Adv Neurol 2006, 98:65- 76. 9. Davidson S, Miller KA, Dowell A, Gildea A, Mackenzie A: A remote and highly conserved enhancer supports amygdala specific expression of the gene encoding the anxiogenic neuropeptide substance-P. Mol Psychiatry 2006, 11(4):323, 410-321. 10. Strand AD, Aragaki AK, Baquet ZC, Hodges A, Cunningham P, Holmans P, Jones KR, Jones L, Kooperberg C, Olson JM: Conservation of regional gene expression in mouse and human brain. PLoS Genet 2007, 3(4):e59.     172 11. D'Souza CA, Chopra V, Varhol R, Xie YY, Bohacec S, Zhao Y, Lee LL, Bilenky M, Portales-Casamar E, He A, Wasserman WW, Goldowitz D, Marra MA, Holt RA, Simpson EM, Jones SJ: Identification of a set of genes showing regionally enriched expression in the mouse brain. BMC Neurosci 2008, 9:66. 12. Bronson SK, Plaehn EG, Kluckman KD, Hagaman JR, Maeda N, Smithies O: Single-copy transgenic mice with chosen-site integration [see comments]. Proc Natl Acad Sci U S A 1996, 93(17):9067-9072. 13. Yang GS, Banks KG, Bonaguro RJ, Wilson G, Dreolini L, de Leeuw CN, Liu L, Swanson DJ, Goldowitz D, Holt RA, Simpson EM: Next generation tools for high-throughput promoter and expression analysis employing single-copy knock-ins at the Hprt1 locus. Genomics 2008. 14. Farhadi HF, Lepage P, Forghani R, Friedman HC, Orfali W, Jasmin L, Miller W, Hudson TJ, Peterson AC: A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial-specific phenotypes. J Neurosci 2003, 23(32):10214-10223. 15. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 16. Schoenherr CJ, Anderson DJ: Silencing is golden: negative regulation in the control of neuronal gene transcription. Curr Opin Neurobiol 1995, 5(5):566- 571. 17. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 18. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y: Cap analysis gene expression for high- throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 2003, 100(26):15776-15781. 19. Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol 1987, 196(2):261-282. 20. Brenner M, Kisseberth WC, Su Y, Besnard F, Messing A: GFAP promoter directs astrocyte-specific expression in transgenic mice. J Neurosci 1994, 14(3 Pt 1):1030-1037. 21. Couillard-Despres S, Winner B, Karl C, Lindemann G, Schmid P, Aigner R, Laemke J, Bogdahn U, Winkler J, Bischofberger J, Aigner L: Targeted transgene expression in neuronal precursors: watching young neurons in the old brain. Eur J Neurosci 2006, 24(6):1535-1545. 22. Barberi T, Klivenyi P, Calingasan NY, Lee H, Kawamata H, Loonam K, Perrier AL, Bruses J, Rubio ME, Topf N, Tabar V, Harrison NL, Beal MF, Moore MA, Studer L: Neural subtype specification of fertilization and nuclear transfer embryonic stem cells and application in parkinsonian mice. Nat Biotechnol 2003, 21(10):1200-1207. 23. Cai C, Grabel L: Directing the differentiation of embryonic stem cells to neural stem cells. Dev Dyn 2007. 24. Magdaleno S, Jensen P, Brumwell CL, Seal A, Lehman K, Asbury A, Cheung T, Cornelius T, Batten DM, Eden C, Norland SM, Rice DS, Dosooye N, Shakya S,     173 Mehta P, Curran T: BGEM: an in situ hybridization database of gene expression in the embryonic and adult mouse nervous system. PLoS Biol 2006, 4(4):e86. 25. Visel A, Thaller C, Eichele G: GenePaint.org: an atlas of gene expression patterns in the mouse embryo. Nucleic acids research 2004, 32(Database issue):D552-556. 26. Soriano P: Generalized lacZ expression with the ROSA26 Cre reporter strain. Nat Genet 1999, 21(1):70-71. 27. Smidt MP, van Schaick HS, Lanctot C, Tremblay JJ, Cox JJ, van der Kleij AA, Wolterink G, Drouin J, Burbach JP: A homeodomain gene Ptx3 has highly restricted brain expression in mesencephalic dopaminergic neurons. Proc Natl Acad Sci U S A 1997, 94(24):13305-13310. 28. Lee Y, Messing A, Su M, Brenner M: GFAP promoter elements required for region-specific and astrocyte-specific expression. Glia 2008, 56(5):481-493. 29. Arnett HA, Fancy SP, Alberta JA, Zhao C, Plant SR, Kaing S, Raine CS, Rowitch DH, Franklin RJ, Stiles CD: bHLH transcription factor Olig1 is required to repair demyelinated lesions in the CNS. Science 2004, 306(5704):2111-2115. 30. Lu QR, Cai L, Rowitch D, Cepko CL, Stiles CD: Ectopic expression of Olig1 promotes oligodendrocyte formation and reduces neuronal survival in developing mouse cortex. Nat Neurosci 2001, 4(10):973-974. 31. Bal